Language Learning Models (LLMs) have drastically transformed our perception of machines and their grasp over language. With models like OpenAI's GPT series and AWS’s Bedrock, we've witnessed a sea change in the capabilities of machines to comprehend and generate human-like text. An integral feature ingrained in these LLMs is their capacity to utilize context. When prodded for further dialogue or refinement, they typically yield better, more focused answers, a testament to their ability to adapt and learn. However, widely used consumer interfaces like Amazon's Alexa do not seem to make any use of context. This may explain the limits of their adoption by consumers.
The Contextual Prowess of LLMs
Advanced LLMs capture the essence of context through various means:
• Iterative Refinement: Upon receiving a query, LLMs produce a response based on their vast training data. If the user feels the answer isn't satisfactory and asks for more detail or clarity, the LLM can delve deeper, offering a refined and potentially more accurate answer.
• Handling Ambiguity: Natural language is often ambiguous. Faced with multifaceted statements, LLMs can utilize prior interactions or seek further clarity to pinpoint the user's exact intention.
• Adapting to Conversational Flow: Unlike older chatbots that treated each user input in isolation, LLMs can maintain a semblance of conversational continuity, building upon prior exchanges to ensure a cohesive and contextual dialogue.
Alexa Does Not Contextualize
One type of failure is that Alexa treats each interaction as standalone “conversation”. If you ask it the same question repeatedly it generally gives the same repetitive answer, even when you tell it that the answer was wrong.
User: “play xxxx”
Alexa: “playing yyyy”
User: “no, play xxxx”
Alexa: “playing yyyy”
You can keep repeating “no” all day and Alexa will never change its response.
Another way it lacks context is in Home Automation. The words “on” and “off can sound very similar. So, imagine your lights are on and you say: “turn lights off”. Alexa’s natural language understanding (NLU) can easily hear “turn lights on” and conclude there is nothing to do. If the user repeats the instruction Alexa simply does not have the ability to notice that: if the lights are on its far more likely that the user said “off” rather than “on”. So, Alexa continues to decide there is nothing to do, and the user gets more and more frustrated.
Now, all of this applies primarily to what we call First Party interactions, which means interactions you have with Amazon developed features. These are the requests for things like “what is the weather”, “set a timer”, or “when is the next Red Sox game.” No context is thought to be required for these requests as they are completely standalone.
By contrast, there are over a hundred thousand “skills” (Amazon’s word for applications) written by independent developers. These skills often do maintain context, but the extent of that context is hit and miss. In my own Premier League skill for example (which covers UK Football), if you ask about the “Red Sox” the skill reminds you that you’re talking to a football skill and the Red Sox are a baseball team. If this was a First Party skill Alexa would likely not respond at all.
A third way that Alexa ignores context is its lack of emotive sensitivity. While detecting the emotional content of a voice utterance isn’t a completely solved problem it can certainly be done (https://towardsdatascience.com/detecting-emotions-from-voice-clips-f1f7cc5d4827).
One of the hallmarks of a bad Alexa interaction is a user becoming increasingly frustrated or even angry at Alexa and Alexa being unaware of the fact. At Voice22 and at a Meetup presentation I spoke about how Alexa could incorporate emotion in its responses (https://www.youtube.com/watch?v=4LyQy-Aq79o). Unfortunately, for privacy reasons Alexa is designed so that the actual voice utterance is not available to a third-party developer. This means that there is no opportunity to analyze the interaction for emotion. First Party skills however could use emotion detection since they are Amazon internal and could access the raw voice recording. Additionally, many third party skills notice if a user is saying "No" a lot and offer a help message or even an apology.
Until recently this wasn’t much of a concern because compared with the alternatives Alexa’s interactions were pretty good. With LLMs raising the bar however Alexa’s interactions seem increasingly lacking.
Can Alexa Learn Context?
One would certainly think so. Context after is simply having awareness of what has happened and what is happening. Alexa certainly knows that the lights are, that its nighttime, and that in each of the previous 100 days the last user interaction has been to turn off the lights. We know that Alexa keeps track of these interactions as you can see them all via the Alexa app. So, it’s actually shocking that this readily available information is not being used.
Many skill writers are incorporating LLMs into their responses. It’s early days for this approach and there are several stumbling blocks … not the least of which is that Alexa only has eight seconds to create a response, and LLMs are often not that fast. Various skill writers are designing solutions to this issue but as yet no general solution has emerged.
On the other hand, a third-party skill can easily maintain its own context which can be fed into the next LLM prompt which would presumably create a higher quality response.
What’s really needed however is a large-scale change in the Alexa First Party responses. This change will not be free or easy and the large scale layoffs in the Alexa division are not encouraging. On the other hand, perhaps the division can stop creating strange products like the flying Alexa or the $1599 Astro robot and concentrate on fixing the basics.
Alexa could use the new era of LLMs to retake the lead in intelligent home automation. I hope they do.
Top comments (0)