Voice interaction represents the biggest UX challenge since the birth of the smartphone, so we break down the implications and opportunities for this paradigm shift in UX design.

It’s a brand new year, and by most reliable indicators – the latest demos at CES 2017, the buzz on all the tech blogs and even the pre-roll ads interrupting my binge watching of Crazy Ex-Girlfriend – it looks like 2017 will be the year that voice interaction reaches mainstream adoption.

Voice interaction – the ability to speak to your devices, and have them understand and act upon whatever you’re asking them – was everywhere this year. Device manufacturers of all shapes and sizes heavily integrated voice capabilities into their offerings at CES 2017, with Amazon’s Alexa stealing the show as their AI platform of choice.

amazon-echo

Meet your new interface – for everything

The rapid proliferation of voice interaction capabilities in our individual digital ecosystems raises critical questions for any designer whose work plays a role in the customer experience. It’s becoming clear that voice interaction will soon become an expected offering as either an alternative, or even a full replacement to, traditional visual interfaces.

Voice is poised to impact UX design, just as mobile touchscreens turned web design on its head – except this shift is going to arrive way faster, and far from being limited to screen-based interactions, the transformation is going to permeate every aspect of our users’ lives. As consumers start to talk to and be understood by their products, user-centered companies must learn to apply the same intentional design principles to these interactions as they do with visual interfaces, if they hope to satisfy users’ high expectations for this new wave of tech to “just work”.

In this post, we’re going to explain some of the profound implications of the rise of voice interaction for UX design. Just as the internet began as a playground of raw new technical capability that embraced the guiding principles of intuitive, user-friendly product design over time, so too I see today’s voice-enabled tools and devices in their infancy, with limitless potential ready to be unlocked through innovative, user-centered design.

What’s driving adoption of voice interaction?

Before we dive into the specific implications of voice for our industry, it’s important to understand some of the forces that are propelling the rapid adoption of this new interaction medium.

Moore’s Law

Accurate natural language processing has, until very recently, existed only in the realm of science fiction, in part because it takes a lot of computing power to break down and interpret human speech in real-time. 2016 saw numerous significant breakthroughs in language processing, and we’ve reached a tipping point where there’s enough computational power available to us to make speech recognition and interaction a viable alternative to visual interfaces.

“…improvements in natural language processing have set the stage for a revolution in how we interact with tech: more and more, we’re bypassing screens altogether through the medium of voice… Shawn DuBravac of CTA said that 2017 would represent an inflection point in voice recognition as computers reach parity with humans, accurately transcribing speech about 94% of the time. “We’re ushering in an entirely new era of faceless computing,” DuBravac said.” ~CES 2017: Key trends, J. Walter Thompson Intelligence

In an age where almost a third of the global population is carrying a microphone connected to a supercomputer in their pocket, it’s not hard to guess at the huge swath of people that are primed and ready to adopt voice interaction as their input method of choice.

A viable, cross-device voice platform

Getting the machines to understand us correctly is just one milestone in the quest for frictionless voice interaction, but another is making it available to users across multiple use-cases and contexts.

Just as the availability of internet access was one of the major growth factors driving more people online, so the adoption of voice interaction will be limited by the variety of scenarios in which we can simply speak to our devices and be understood. Alexa demonstrated its viability at CES 2017 as such a unifying platform, based on the sheer number of software and hardware developers who’ve chosen to hop onboard thus far, as well as a massive 9X jump in sales numbers of their Echo devices. It may not be the ultimate incarnation of the medium, but it’s currently a strong favorite to become the first voice-driven application to truly find a mainstream audience.

This isn’t a new direction, just the next step in UX design

As designers, we have to understand that humans have always been using intermediaries to interact with technology – from levers & pedals, to punch cards, to code, to GUIs, to touchscreens and now to voice and beyond. Each advancement of the way in which we use our tools was motivated by the need to reduce friction: to get more done, faster, easier and by more people.

Voice represents the new pinnacle of intuitive interfaces that democratize the use of technology – at least until direct brain-to-brain communication becomes a reality, (ahem, “Digital Telepathy”, anyone?).

brains

So with this basic understanding of the driving forces behind voice interaction, what does this trend actually mean for designers of the customer experience?

The implications of voice for designers

Vocabulary: words matter more than ever

It was only recently that the movement among UX designers to ditch the use of placeholder text and lorem ipsum in visual interface designs started to gain traction. With the rise of voice interaction, now more than ever, our choice of words will influence how people perceive the customer experiences we design for them, because there are no accompanying visual cues to serve as a guide.

Designers for the voice context must realize they’re relying 100% on what the user perceives the chosen words and phrases to mean – a notoriously squishy concept!

Clearly, there’s an immediate need for some kind of standardized set of command phrases and keywords, so that users are able to intuitively navigate between different AI systems. It’s a safe bet that few will want to memorize proprietary sets of commands for each of their AI assistants.

As designers, we also must adapt and innovate to cope with some of the inherent limitations of this new medium. There are no images we can use to articulate processes more clearly. We can’t use animation to communicate complex concepts more easily. Telling the user to “Click Here” no longer has any meaning when applied to the invisible interface of voice, so we’ll need to develop a whole new lexicon of commonly-understood and intuitive cues for the user to act. Think about that for a sec: the most fundamental design element of the web, the clickable link, no longer has any place in the future standard of interface design.

Understanding users’ intent

Consistent interpretation of commands between visual and voice interfaces will become a key concern for UX designers in charge of navigating this transition phase, particularly for web applications. Without the clear signal of a button click with which to interpret a user’s desired action, it will fall to the designers to anticipate their intent at each point in the conversation, and shape the appropriate response.

A hypothetical example: saying the phrase “Delete this” could be a valid command for voice-enabled versions of both a Microsoft Word document, and your Facebook profile settings – but the consequences and intent behind uttering the same words in each scenario are drastically different!

This will not always be such an easy distinction to make. Consider how visual and voice interfaces handle a common digital interaction – opting-in for an email newsletter. In a traditional visual interface, the typical email subscription process for the user goes something like this:

optin

Simple, quick, and unambiguous, right?

Now, how might the same process be initiated by voice?

“Subscribe me to this blog.”

“Add my email address to their mailing list”

“Give me updates from this site”

“Opt me in for this blog’s email newsletter”

There are innumerable ways to articulate the same basic intent via voice – which means UX designers must make sure they’re asking the right questions to elicit the appropriate verbal responses from users.

Maintaining engagement with variability

Once the novelty of voice interaction wears off for the mainstream user, product designers will be challenged to maintain user engagement. As we saw in the email subscription example above, there are many ways to articulate even clear-cut binary choices when it comes to voice – but it’s this variability that offers opportunities for intuitive design to foster user engagement.

brain

The nucleus accumbens is the part of human brain that lights up when we crave something, and in particular it’s highly stimulated by unpredictability. This means when we can’t predict what’s going to happen, we tend to pay a lot closer attention – which partly explains the addictiveness of gambling, and Netflix’s The OA, (seriously, try it, it’s an amazing show).

This neurological trait is already employed by many designers at the forefront of UX for visual design, and will likely continue to be leveraged as we start to shape conversations with our technology. Designing variability into these interactions opens the door for anthropomorphization and the user ascribing mood and even personality to the voices in the machines.

poster

This wide variation in potential responses also places much more emphasis on the importance of crafting meaningful error messaging that steers the conversation with the user back on track, without being incessantly annoying. Users will quickly lose interest in conversing with a voice that robotically repeats “I’m sorry, I didn’t quite catch that”, like a broken phone tree menu.

Brands & personalities: An extension of voice?

Outside of what they’re actually saying, voices convey a wealth of meta-information to the listener – so it’s easy to imagine brands leveraging the medium of voice interaction as an extension of their personalities. Gender, age, inflection, tone, accent, cadence and pace are all elements that can be used by UX designers seeking to craft a particular customer experience with their brand.

Virgin America may opt to converse with you in a saucy, flirty and suggestive British voice that’s in line with their brand, whereas the New York Times might opt for a more mature, assertive, voice for their announcements. The kids may finally get to talk directly to Mickey as you book your Disney World vacation! Apple may be searching for the perfectly engaging, and yet soothing voice for your next operating system, (spoiler alert, it’s Scarlett Johanssen’s voice, in Her).

On the flipside, some brands may opt to let the user customize the voices they interact with – which leads to a looming philosophical debate: who actually controls a brand? Is it the company behind it? Or the customer’s perception of it?

It’s not hard to envision a new role for “VX Designers”, that’s like a cross between a casting director and a sound engineer – tweaking synthetic voices in search of that je ne sais quoi that they think will best engage their users.

Celebrities will likely find a brand new income stream from licensing not just the sound of their voices, but their entire personalities as AI assistants. Sound ridiculous? It does, but you can already pay about $10 to make your TomTom GPS nav unit speak like Snoop Dogg. (Oooo-wee!)

Preferences, prioritization & smart summarization

One of the advantages visual interfaces retain over voice is the ability to present multiple options to users clearly and in a hierarchical manner – search results and pricing pages are perfect examples of this. But how exactly could you present a list of options to the user without an accompanying visual aid?

In this age of expected instant gratification, it’s hard to imagine the average user patiently listening to their AI assistant as it narrates a laundry list of all sushi restaurants within walking distance, one by one. This would be a classic case of the new medium being limited by the conventions of the past for the sake of familiarity – like someone printing out their emails before reading them: it kinda defeats the purpose, and absolutely doesn’t scale to accommodate today’s needs.

A more viable approach could be to prioritize and summarize the information based on known user preferences, prior to delivering an answer – in other words, doing what a normal person would naturally do in a conversation.

“Hey, Jason, where’s a good place to go for sushi?”

“There are several sushi restaurants in the area – would you like to walk, or drive?”

“It’s a nice day, I’m down to walk”

“Ok, Emperor Sushi is a 2 minute walk from here,
but if you want something cheaper, Ninja Sushi Deli is a 5-minute drive.”

“Good to know – let’s do Emperor Sushi today.”

In the case of our hypothetical quest for sushi, a more user-oriented voice interaction, asking relevant follow-up questions (“How far do you want to walk?”, “How much do you want to spend?”) to narrow down the list to the very best options before recommending them.

There are wide applications for these sorts of branching, dialog tree-driven interactions – hospitals, info kiosks and hotel concierges could all ditch the clunky touchscreens, and migrate to entirely conversation-driven interactions with voice-enabled devices in, say, your hotel room – with each response crafted by designers according to the latest findings and best practices in hospitality research.

It’ll be up to designers to identify this logical throughline for all kinds of requests, and craft the conversation with the user around it, so the machine is able to collect the data it needs to provide the best possible answer.

Accessibility & privacy

The shift to voice interaction must account for multiple accessibility considerations – for instance people who are deaf, mute, or sick and have temporarily lost their voices. Indeed, the amplified potential of voice has been recognized in these communities long before it gained traction in the mainstream:

“Able-bodied individuals gain convenience from voice-control technology, while the disability community gains the greatest reward of all: independence.” ~ Talk to the Machine: Voice Control Comes Into Its Own

If UX designers see large benefits in refining the design of their voice interfaces for the able-bodied, consider the huge impact they will have on the quality of life for those with advanced impairment of their motor function – it could literally mean the difference between life and death!

Without intentionally and thoughtfully designed interactions, the disabled will miss out on the benefits of the ease and intuitiveness of voice interactions. This could involve designers building accommodations into their experiences to run in a hybridized interface configuration that provides both audio and visual cues to these categories of users. Don’t expect chatbots to fade away.

Privacy concerns also abound in this new medium, and we’re walking a fine line between easing friction at the risk of opening up entirely new fronts of vulnerability. Most voice-driven devices currently store and automatically remember the necessary user credentials for the sake of reduced friction, but this will likely come back to bite us – as parents are starting to find out when hundreds of dollars’ worth of cookies unexpectedly arrive at the front door, and their 6 year-old starts looking guilty…

If our voices are our passports in this new medium, what’s to stop someone forging it by recording you speaking your password out loud, or editing your voice to synthesize commands that you never gave? These are imminent privacy concerns that UX designers must address to instill confidence in their users, and propel voice interaction further into the mainstream.

In summary: Pay attention and design with your eyes closed

Voice interaction is the next great leap forward in UX design, and we’ll see it proliferate rapidly in 2017, across software and hardware products.

Many of the old paradigms we’ve come to rely on with visual UX design don’t even apply to this new medium, so designers must step up to embrace and start refining this raw new technology in many aspects, by carefully and intentionally honing the vocabulary used in these interactions, and working to better understand our users’ intent at every step.

Once the novelty of voice interaction wears off, it’ll be incumbent upon UX designers to maintain their users’ engagement by leveraging personality and innovating in how AI assistants deliver answers to the questions they’re asked. It’ll begin with crafting the responses for clarity and understanding, and progress toward creating entire branded personas.

I’m under no illusions; what I’ve described here is huge design challenge. In fact, it’s the biggest design challenge we’ve faced since Steve Jobs’ legendary “One last thing” back in 2007, a product demo that, in hindsight, heralded a transformative change for web and software design.

Voice interaction may not have garnered the same fanfare just yet, but I believe the the same moment is upon us in the field of UX design, as voice interaction proliferates, augments and in many situations completely replaces visual UX design, as the new standard user interface. For decades, the limitations of our technology has forced us to design our interfaces within a 2-dimensional space – an unnatural expectation for 3-dimensional humans! Designing for voice could be the catalyst that helps us return to the original goal of UX: treating people like people again.

Comments
  • Luke Wester

    Found a Typo

    ACCESSIBILITY & PRIVACY section:
    Indeed, the amplified potential of voice has been been recognized in these communities long before it gained traction in the mainstream:

    Such an amazing and thought provoking piece. Fantastic.

    Feel free to delete this comment 🙂

  • pedroggomez

    Great post, Jason, thanks for sharing. In your opinion, what are the main challenges and opportunities in translating this voice interaction to support global audiences?

    • Thanks for reading @pedroggomez:disqus – great question!

      To be honest, I hadn’t considered this, but in thinking about it, I imagine this will be less of a technical barrier than a design one. We already have near seamless translation technology (check out the recent New York Times article The Great AI Awakening – https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html), so I believe the important questions about a multi-national voice interface would be more centered around culture-specific concerns, e.g. what accent should the interface use to communicate with, and should its character remain consistent regardless of where the user hails from?

      • Kevin Buckley

        @pedroggomez brings up a good point – not just in language translation, but what about cultural translation. In a global market, is it enough to rely on site design to account for cultural sensitivities? For example:

        A Chinese college student was learning to use the Chinese version of Windows 3.2—the first graphical user interface (GUI) she had ever encountered—on her new computer. With the help of the translated The Complete Idiot’s Guide to Windows, she made fast progress. However, every time she opened and moved a file, she was a bit puzzled as to why she needed to click a small yellow rectangular icon on the desktop. She was told that the yellow rectangle was wen jian jia (a Chinese translation of file folder), and its function was to organize files. But what is a file folder? Why did she need to organize her files? She had no idea. As someone who was unfamiliar with American office culture, she had never used a file folder before nor had any experience with filing documents—Chinese culture was not as obsessed with paper trails as American culture, at least not at this time in the 1990s.

        FROM – http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780199744763.001.0001/acprof-9780199744763-chapter-1

        • It’s a great point, @disqus_TVwMEs9zdw:disqus! For the longest time, running a multi-lingual site was only really feasible for the very largest companies, or the most dedicated site owners. With recent breakthroughs in translation, the *words* might not be an issue, but accurately translating the cultural context of their use is a huge design, and indeed technical challenge.

          We also have to remember that it’s not just cultural, but generational. I once heard of a mid-20’s client asking a designer what that “banana icon” was on the footer of each page. After several puzzled seconds spent trying to decipher what exactly he was referring to, it dawned on the designer: the “banana” was actually a traditional phone handset icon – the client was so young that they couldn’t actually conceive of a phone that wasn’t shaped like a smartphone!

          • Kevin Buckley

            This is a very real issue @jaffy:disqus ! Just this morning the Marketing Director and I were having a conversation about “Fax Blasting” in front of some of our younger peers and the looks of confusion and disbelief were rather humerous.

            The scenario works both ways and it certainly highlights the need to understand the proficiency levels of your user base. For example, our primary customers are insurance agents with an average age of 50+. As I try to develop our replacement systems to support “newer” technology (web services, data modeling, ad hoc reporting, etc.) I have to constantly remind myself that there are certain features that won’t be used by the majority of users without significant education. I can’t tell you how many calls my techs get complaining about faxes that don’t go through. If those are the critical issues of my users, it is indeed an uphill battle.

            With that being said, there’s a very fine line to be walked in when trying to support the needs of a less technical user base, while still courting a more tech savvy generation of potential users.

  • krasalexander

    Hey @jaffy:disqus ! Thanks for the great summary of current state of Voice UX technology and design.

    Let me share my thought on several matters that have been raised:
    – standardized set of command phrases – I feel it won’t be a real breakthrough. People were using voice as prior way of communication for centuries and making them follow some rules when “talking to their app or washing machine” may be more challenging than we all as techies hope to. I’d say that it will be a good thing to have, but still ability for the app to understand the user’s way of talking is the top importance challenge. Actually, you have provide a fantastic example with that sushi place. It was a real human to human way on interaction.
    – Voice UX and other ways of info delivery to people – I think that with Voice only, the industry will be pretty limited in growth. However, with other existing ways of info delivery (smartphones, TVs, wearables, etc) it may become the must-have option for apps. Like mobile-friendly approach has disrupted the industry on 2010th+
    – Voice recognition and Natural Language processing: I think that sometimes there confusion between those two terms. Still there wasn’t such in your article 🙂 I think that we are very close to market-fit of voice recognition, but NLP is still on the way to reaching the state, when we can have human conversations with our Open Table or CRM app. I think that chatbots and Voice UX industries can work on the above goal together. Actually, I think Google has acquired such company before releasing Google Assistant to the market
    – Privacy matter: I think there is one big problem in privacy of Voice UX. The matter of authentication. There is no way either Alexa or Google Assistant can check if this you asking them some question. And this may be an issue both for some personal apps and business apps
    – Some people issues: having in mind that in general people are ready to talk their fancy TV at home, it is interesting to have them to interact with devices and apps out of their house. Let’s say using voice interface to check on some info, which isn’t displayed on the display. I’d say such ability may really help in heavy enterprise apps, where so much data needs to be taken into consideration.

    Hope my thoughts were interesting to you 🙂

    • All great thoughts, thanks for weighing in @krasalexander:disqus!

      – Agreed on standardized commands – they’ll be a temporary band-aid solution while the technology improves to pick up on the kinds of subtle cues we give each other when talking. One favorite exercise I do often is when riding in a Lyft with a friend. Notice that you instinctively give a combination of signals to identify when you’re talking to your friend versus addressing your human driver, like facing forward, projecting your voice in their direction, modifying your tone of voice, etc. Once machines are detecting these intentional cues, the need for predefined commands will go away, and it’ll all be contextual.

      – Agreed, of all the issues, privacy is probably one of the biggest. My guess is we’ll need to go to multi-factor authentication as a default: the computer will need to check not only that it’s your voice, but also that you’re physically present, were actually addressing the computer and not another person, and that you’re not under duress, etc.