Hi! I'm Ottomatias from Speechly We are building developer tools that enable developers to add voice user interfaces to their apps.
I'd be interested to hear from you why you haven't already added voice to your app? Have you tried that but ran into some issues? Don't you think it's beneficial in your use case?
Would be great to hear your thoughts! Tell me what you are building, too and on which platform.
Top comments (3)
It must give a benefit/value higher than the cost of implementing this (and maintaining it). By the other hand, users are skeptical about this interactions because they don't know how long will your app hear it's ambient sounds -> conversatios (privacy concerns).
Apart from that this must lead to a redesign to add some button to activate this festure somewhere, which adds more cost.
This can't be properly ensured through A/B testing because when adding this kind of features users usually will "push the button" because it's new but on a long term they will interact with touch/mouse/keyboard.
Moreover users who "take advantage" of voice functionallities don't use them on public, so this features are reserved to an "alone at home" situation.
Another point is voice recognition and processing of natural language, which lacks of perfection and makes users loose time instead on making things easier.
After that... I think this functionallities are reserved to a well accurated market path so there's no point on adding them in general.
If you search for most asked questions to (Siri/Cortana/Alexa/Google Assistant) you will understand what I'm saying (are you married? I'm drunk, can you drive? Do aliens exist? Can you make me laugh?)
I can missed something at all... So reversing the question, what are those advantages we can get using voice functionallities on our applications?
Great comment, thanks Joel!
To recap your comment, I'd say that you gave four different reasons:
1) privacy issues on end-users side
2) implementation costs
3) speech recognition accuracy
4) the benefits of a voice user interface are not big enough
So starting from privacy. It's clearly an issue and something that the app developers need to think about. And when it comes to machine learning, it's not a trivial issue, as developing the models require at least some kind of a way for a human to a) validate the results and b) correct them if needed, so that the model can be improved.
Our approach on privacy is based on being as open as possible on how the data is used and who can access it and support also "private mode" where voice data is never heard by anyone. But like said, it is a valid question and something that app developers should really think about. Not only with voice, of course.
Then about implementation costs. If we skip the problems of ASR accuracy and whatnot, implementing voice user interfaces is actually pretty simple with modern tools (such as Speechly ;) ). We have good client for React, for example and the extra work is pretty much
a) configuring the model by providing annotated sample utterances such as
b) streaming audio from the application to our servers to receive the actual transcript, but also the intent and entities. So if the user said something like "Set brightness in living room to 25", the API would return the intent
set_brightness' and entities
living roomand the value of
25`. By providing a few more examples, our system should be able to generalize these to also support other similar ways of expressing the same things – so for example if the user would say "I want the kitchen brightness to be 56", it would still work. So implementation costs do not need to be that huge!When it comes to speech recognition accuracy, it's a hard problem. There is a range of accents and different voices and just like us humans, sometimes the system doesn't hear it just right. We are solving this issue at Speechly (and similar ways are used by others, too) a bit like us humans, do, too: if you know the context and hear at least some of the words right, you can probably guess the rest.
This requires some natural language understanding on top of the actual speech recognition part. Let's say the baseline ASR (automatic speech recognition) would hear something like "Turd ofter flights", but you've provided "Turn off the lights" as one of the example utterances, it's pretty easy to guess what was really said.
Of course, not all the results are correct, but I'd say that with most current speech recognition systems you can achieve a level of confidence that makes building real-life applications feasible.
And then the benefits: I don't think voice will ever be a replacement for touch and vision. Speech is seldom the fastest means of transmitting information, because us people are just not very good at expressing complex ideas in short sentences. If you want to create a spreadsheet, for example, a keyboard and mouse is probably the best UI for most tasks.
But let's say you want to send the ready spreadsheet to your boss after it's done. That's pretty easy to express that intent by using voice. With a keyboard and mouse, on the other hand you'll need to switch between apps, copypaste links and what not. That's a task that's a lot easier to do with voice.
The same goes with almost every application. Every application has subtasks and use cases where using voice would make a lot of sense, but you should not replace the current UI with a voice UI but rather add voice functionalities to improve the current UI and make use of voice whenever it's best suited.
There are also applications that really benefit from a voice UI. For example we built a grocery shopping application with a voice UI. It's a lot faster to say something like "2 liters of milk, one bag of crips, six-pack of Heineken and a loaf of bread" than to do 4 different searches and click ADD next to the correct products.
Sorry for a humongous answer!
Hahaha no problem, that's fine, i'll probably write another bible here.
I'm seeing it from my business point of view instead on a generic way for being more skeptical as a "client" could be, thinking on our customers.
The use case you provided such as send a spreadsheet to your boss could be perfectly automated with a "send to" button where you can just add an instant search linked to your contacts list for example (every automation could be perfectly achieved from multiple ways).
That's why I said that I can see that there's a specific market share for this features but it's not valid for every app. I mean, GBoard, Swiftkey and other smartphone keyboard APPs included this feature long long time ago and I never saw (heared) a single person using it to write whatsapp messages or emails.
That's - I think - due to concerns about other people knowing what you are doing.
I mean no one wants others to hear what you're saying to your family or friend or whatever, also you may not care about others knowing that you are currently buying something on internet but I think no one want others to hear all the entire list of what you are about to buy.
Of course we are not talking about replacing a way to interact with another, we are talking about combine both and that's great if it match your App and could benefit users. Adding it without a reason could add cognitive load to your potential customers and being counterproductive.
Now I'm talking as a customer/user:
I like the voice features of Google Assistant while driving (call someone, send [TEXT] to [CONTACT] using whatsapp, or search places on G maps), I like voice features to turn the lights on/off.
I don't want, need or gonna use voice control to buy something on Amazon. I could be aware that using voice control through Alexa could be fine, but I may want to read some reviews, search for prices and properties of the product and lately choose the delivery address and payment method visually.
As a developer again:
It's think it's not about adding this functionalities to your App, it must be something like adding this features for specific use cases.
I may want to add voice recognition for customers when asking some doubt through the contact chat and being answered using voice speech too in consequence for example, but not for adding products to cart as I want to push related products when user click on "add to cart button" and if I permit users to add products to cart using voice they don't need to pay attention to the screen so the marketing push will be useless for this customers.