Speech and Natural Language Input for Your Mobile App Using LLMsUtilizing OpenAI GPT-4 Functions for Navigating Your GUI
Introduction
A Large Language Model (LLM) stands as a machine learning mechanism adept at efficiently handling natural language. Presently, the most cutting-edge LLM in existence is GPT-4, which serves as the driving force behind the premium edition of ChatGPT. This article aims to educate product owners, UX designers, and mobile developers on the process of imbuing your application with remarkably versatile speech interpretation via GPT-4 function calling, seamlessly integrated with your app’s Graphical User Interface (GUI).
Background
Mobile phone digital assistants, available on both Android and iOS platforms, have encountered challenges in gaining widespread popularity due to a range of factors. These include their unreliability, constrained functionality, and frequently cumbersome user experience. However, LLMs, particularly OpenAI GPT-4, present an opportunity to address these issues. Their capacity to comprehend user intentions on a deeper level, as opposed to relying on rudimentary pattern matching of spoken phrases, holds the potential to revolutionize this landscape.
Android features Google Assistant’s ‘app actions,’ while iOS incorporates SiriKit intents. These functionalities offer predefined templates for enlisting voice commands that your application can process. The capabilities of Google Assistant and Siri have seen substantial enhancement in recent years, perhaps beyond your recognition. However, their effectiveness depends on app integration. For instance, you can use voice commands to play your preferred track on Spotify. It’s important to note that the contextual understanding of these built-in OS services predates the significant advancements introduced by Large Language Models (LLMs). Hence, the logical progression involves leveraging LLMs to enhance the dependability and adaptability of speech input.
While it’s anticipated that operating system functions such as Siri and Google Assistant will likely adjust their approaches to leverage Large Language Models (LLMs) in the near future, we can presently empower our applications to incorporate speech capabilities independently from these services. Once you’ve embraced the principles outlined in this piece, your app will also be prepared to utilize upcoming virtual assistants once they are accessible.
The type of language model you choose (GPT, PaLM, LLama2, MPT, Falcon, etc.) does influence reliability, but the fundamental principles taught here can be employed with any of these options. Users will be able to fully utilize the app’s features simply by expressing their intent once. The language model translates a natural language command into an action within the app’s structure and functions. This doesn’t require a robotic sentence structure; the language model’s comprehension abilities enable users to communicate naturally, using their own words and style. Users often hesitate, err, and correct themselves when interacting with voice assistants, causing frustration. However, the adaptability of a language model can foster a more organic and dependable interaction, ultimately leading to greater user acceptance, especially among those who have been hesitant due to previous experiences of misunderstood intentions by traditional voice assistants.
Why speech input in your app, and why now?
Pros:
- Go to a page and input all the settings using a single spoken command.
- Easy to Learn: Users don’t have to locate the data’s location within the app or understand how the graphical user interface (GUI) functions.
- Hands-free
- Coordinated and integrated, unlike separate elements in a voice user interface (VUI): speech and GUI complement each other seamlessly.
- Beneficial for individuals with visual impairments.
- At present: Due to the advancement of natural language understanding with LLMs, the responses are notably more dependable.
Cons:
- Privacy when speaking
- accuracy/misinterpretations
- still relatively slow
- Distinguishing between Internal Knowledge and External Knowledge (What can I say?): Users lack awareness of which spoken phrases the system comprehends and possesses information about.
Applications that could take advantage of voice input encompass those designed for aiding car or bicycle navigation. Generally, users might prefer not to deal with the intricacies of app navigation through touch when their hands are occupied – such as when they’re in motion, wearing gloves, or engaged in manual tasks.
Additionally, shopping applications could also reap rewards from this functionality. Users can express their preferences in natural language instead of navigating through shopping interfaces and configuring filters.
When implementing this strategy to enhance accessibility for people with visual impairments, it’s advisable to contemplate the integration of natural language responses and text-to-speech capabilities.
Your app
The provided illustration displays the layout for navigating a common application, demonstrated by a train journey planning tool you might recognize. The upper part illustrates the default layout designed for touch-based navigation, which is controlled by the Navigation Component. All navigation interactions are directed to the Navigation Component, which subsequently carries out the navigation command. The lower part illustrates how we can leverage this layout using voice input.
Users express their preferences verbally, which are then converted into text through a speech recognition system. The system generates a request containing this text and transmits it to the LLM. The LLM provides the application with relevant information, indicating which screen to display along with specific parameters. This information is transformed into a deep link and passed to the navigation component. The navigation component then triggers the appropriate screen with the designated parameters – for instance, activating the ‘Outings’ screen with the parameter ‘Amsterdam’ in this scenario. It’s important to acknowledge that this overview is an oversimplification; we will provide a more detailed explanation below.
Numerous contemporary applications incorporate a centralized navigation element in their underlying structure. Android utilizes Jetpack Navigation, Flutter employs the Router, and iOS relies on NavigationStack. These centralized navigation components facilitate deep linking—a method enabling users to directly access a particular screen within a mobile app, bypassing the need to navigate through the app’s primary screen or menu. While the concepts discussed in this article can be applied without mandating a navigation component and centralized deep linking, their presence simplifies the implementation of these ideas.
Deep linking encompasses the generation of a distinct (URI) pathway that directs to a particular content item or a specific section within an application. Furthermore, this pathway can incorporate variables that govern the conditions of graphical user interface elements on the screen linked by the deep link.
Function calling for your app
We instruct the LLM to associate a natural language phrase with a navigation function invocation using methods of crafting prompts. Essentially, the prompt is formulated along the lines of: ‘Considering the provided function frameworks with parameters, correlate the subsequent natural language inquiry with one of these function frameworks and furnish the result.’
The majority of LLMs possess this capability. LangChain has effectively utilized this feature through Zero Shot ReAct Agents, referring to the intended actions as Tools. OpenAI has refined their GPT-3.5 and GPT-4 models for this purpose, with specialized versions (namely gpt-3.5-turbo-0613 and gpt-4-0613) that excel in this function. OpenAI has also introduced specific API entries to facilitate this functionality. While we’ll use OpenAI’s terminology in this article, the underlying ideas are applicable to any LLM, including the employment of the ReAct mechanism. Furthermore, LangChain has developed a distinct agent type (AgentType.OPENAI_FUNCTIONS) that internally converts Tools into OpenAI function templates. For LLama2, you’ll have the ability to utilize the llama-api with the same syntax as OpenAI.
Function calling for LLMs works as follows:
- You include a JSON schema containing function templates within your prompt, combined with the user’s natural language input as a user message.
- The LLM endeavors to associate the user’s natural language input with one of these templates.
- The LLM furnishes the resultant JSON object, enabling your code to execute a function call.
Within this article, the function descriptions represent straightforward translations of the graphical user interface (GUI) within a (mobile) application, wherein each function aligns with a screen, and each parameter corresponds to a GUI element on that screen. When a natural language statement is directed to the LLM, it yields a JSON object featuring a function name and its associated parameters. This JSON object can be employed to navigate to the appropriate screen, activate the designated function within your view model, consequently retrieving the necessary data, and configuring the values of pertinent GUI elements on that screen based on the provided parameters.
This is illustrated in the following figure:
It displays a simplified rendition of the function templates integrated into the prompt for the LLM. To access the comprehensive prompt for the user message ‘What activities are possible in Amsterdam?’, kindly follow this link (Github Gist). It encompasses an entire curl request that can be employed via the command line or imported into Postman. You should replace the placeholder with your personal OpenAI key in order to execute it.
Screens without parameters
Certain screens within your application might lack parameters, or at the very least, not the parameters that the LLM needs to recognize. To minimize token consumption and simplify matters, you can consolidate several of these screen activations into a singular function featuring a solitary parameter: the screen to be displayed.
{ “name”: “show_screen”, “description”: “Determine which screen the user wants to see”, “parameters”: { “type”: “object”, “properties”: { “screen_to_show”: { “description”: “type of screen to show. Either ‘account’: ‘all personal data of the user’, ‘settings’: ‘if the user wants to change the settings of the app'”, “enum”: [ “account”, “settings” ], “type”: “string” } }, “required”: [ “screen_to_show” ] } }, |
The determining factor regarding whether a triggering function requires parameters depends on user discretion: if there’s any kind of search or navigation occurring on the display, such as search fields or selectable tabs, then this criterion is met.
If not, the specific details don’t need to be known by the language model (LLM), and the generic screen triggering function of your application can accommodate screen triggering. This primarily involves testing out different screen purpose descriptions to see what works. If a more extensive description is necessary, it might be advisable to create a dedicated function definition for it, which would place greater emphasis on its description compared to the enumeration of the generic parameter.
Prompt instruction guidance and repair:
Within the initial instructions of your prompt, you provide general guidance for navigation. In our scenario, it could be crucial for the language model to be aware of the current date and time, particularly if you intend to arrange a trip for the following day. Another significant aspect is to control the level of assumption the model makes. Frequently, it’s preferable for the model to exhibit excessive confidence rather than troubling the user with its lack of certainty. A suitable system message for our example application might be:
“messages”: [ { “role”: “system”, “content”: “The current date and time is 2023-07-13T08:21:16+02:00. Be very presumptive when guessing the values of function parameters.” }, |
Function parameter descriptions can require quite a bit of tuning. An example is the trip_date_time when planning a train trip. A reasonable parameter description is:
“trip_date_time”: { “description”: “Requested DateTime for the departure or arrival of the trip in ‘YYYY-MM-DDTHH:MM:SS+02:00’ format. The user will use a time in a 12 hour system, make an intelligent guess about what the user is most likely to mean in terms of a 24 hour system, e.g. not planning for the past.”, “type”: “string” }, |
Hence, assuming the current time is 3:00 PM, when users express a desire to depart at 8, they are referring to 8:00 PM unless they explicitly specify the time of day. The mentioned guideline is effective for GPT-4 to a satisfactory extent. However, there are instances at the periphery where this approach may still fall short. In such situations, we have the option to include supplementary parameters within the function template, which can be employed to implement additional adjustments within our code. As an illustration, we could introduce:
“explicit_day_part_reference”: { “description”: “Always prefer None! None if the request refers to the current day, otherwise the part of the day the request refers to.” “enum”: [“none”, “morning”, “afternoon”, “evening”, “night”], } |
In your app you are likely going to find parameters that require post-processing to enhance their success ratio.
System requests for clarification
On occasion, the user’s inquiry lacks sufficient details to proceed with. It’s possible that no suitable function is available to address the user’s request in such cases. In this scenario, the language model (LLM) will generate a response in natural language that can be presented to the user, for instance, using a Toast notification.
There might also be instances where the LLM recognizes a potential function to invoke, but essential information is missing to complete all the required function parameters. If this occurs, consider the option of making certain parameters optional, if feasible. However, if that’s not a viable solution, the LLM could formulate a request for the absent parameters in natural language, using the user’s language. This text should be displayed to the users, for instance, via a Toast message or text-to-speech functionality. This way, users can provide the missing information through spoken input. For example, if a user states “I want to go to Amsterdam” (assuming your app hasn’t supplied a default or current location through the system message), the LLM might respond with “I understand you intend to take a train trip. Could you please specify your departure location?”
This introduces the matter of maintaining conversational context. I recommend always incorporating the four most recent messages from the user in the prompt. This allows requests for information to span across multiple conversational turns. To simplify this, exclude the responses generated by the system from the history, as they often have a tendency to be more counterproductive than helpful in this particular use case.
Speech recognition
Speech recognition plays a pivotal role in the process of converting spoken words into actionable navigation commands within the application. While various elements contribute to this transformation, poor speech recognition can often be the most vulnerable point. Although mobile phones offer onboard speech recognition with acceptable accuracy, speech recognition systems based on LLM technology such as Whisper, Google Chirp/USM, Meta MMS, or DeepGram generally yield superior outcomes, particularly when tailored to suit your specific usage scenario.
Architecture
Storing the function definitions is likely most advisable on the server, although they can alternatively be managed by the application and transmitted with each request. Each approach presents its own advantages and drawbacks. Including them in every request offers greater adaptability, potentially simplifying the alignment of functions and screens. Nevertheless, the function templates encompass not only the function names and parameters but also their accompanying descriptions, which might require more frequent updates than what the app store update process allows. These descriptions are somewhat reliant on the capabilities of the language model (LLM) and are meticulously tailored to achieve desired outcomes. Considering the scenario where you might opt to replace the LLM with a more advanced or cost-effective alternative, or even switch dynamically, having the function templates hosted on the server could also prove beneficial. This approach centralizes their management, particularly if your app is native on both iOS and Android platforms. In instances where OpenAI services are utilized for both speech recognition and natural language processing, the overarching technical framework of the process appears as follows:
Users verbalize their inquiries, which are then captured and stored in an m4a buffer/file (or optionally in mp3 format). This audio data is subsequently transmitted to your server, which forwards it to the Whisper speech recognition system. Whisper returns the transcribed text, which your server integrates with your system message and function templates to compose a prompt for the language model. The server then acquires the unprocessed JSON representation of the function call, which it further refines into a structured function call JSON object suitable for your application’s usage.
From function call to deep link
To illustrate how a function call translates into a deep link we take the function call response from the initial example:
“function_call”: { “name”: “outings”, “arguments”: “{\n \”area\”: \”Amsterdam\”\n}” } |
Various platforms manage this process in distinct ways, and throughout the years, a multitude of navigation methods have been employed, many of which remain in active use. The specifics of implementing these approaches are not covered within the scope of this article. However, to provide a broad overview, the latest versions of these platforms can employ deep linking in the following manner:
On Android:
navController.navigate(“outings/?area=Amsterdam”) |
On Flutter:
Navigator.pushNamed( context, ‘/outings’, arguments: ScreenArguments( area: ‘Amsterdam’, ), ); |
On iOS things are a little less standardized, but using NavigationStack:
NavigationStack(path: $router.path) { … } |
And then issuing:
router.path.append(“outing?area=Amsterdam”) |
More on deep linking can be found here: for Android, for Flutter, for iOS
Free text field for apps
There exist two modes for entering free-form text: using voice or typing. While our discussion has primarily focused on speech input, the availability of a text field for manual typing is also a viable choice. Natural language input is typically more extensive, which might make it challenging to rival graphical user interface (GUI) interactions in terms of efficiency. Nevertheless, GPT-4 demonstrates an aptitude for accurately deducing parameters from abbreviations, enabling accurate interpretation of even concise and abbreviated typed inputs.
Incorporating functions with parameters within the prompt significantly constrains the context of interpretation for an LLM. As a result, it requires minimal input, and even less so if directed to assume certain information. This emerging trend holds potential for enhancing mobile interactions. For the scenario of planning train trips between stations, the LLM generated the subsequent interpretations using the provided prompt structure exemplified in this article. You can explore these outcomes firsthand by employing the prompt gist referenced earlier.
Examples:
‘ams utr’: show me a list of train itineraries from Amsterdam central station to Utrecht central station departing from now
‘utr ams arr 9’: (Given that it is 13:00 at the moment). Show me a list of train itineraries from Utrecht Central Station to Amsterdam Central Station arriving before 21:00
Follow up interaction
Just like in ChatGPT you can refine your query if you send a short piece of the interaction history along:
Using the history feature the following also works very well (presume it is 9:00 in the morning now):
Type: ‘ams utr’ and get the answer as above. Then type ‘arr 7’ in the next turn. And yes, it can actually translate that into a trip being planned from Amsterdam Central to Utrecht Central arriving before 19:00.
I made an example web app about this that you find a video about here. The link to the actual app is in the description.
Future
You can anticipate that this structural framework for managing functions within your app will evolve into an integral aspect of your smartphone’s operating system, whether it’s Android or iOS. A universal assistant embedded in the phone will manage speech-based requests, while applications can expose their functions to the operating system for activation through a deep linking mechanism. This concept draws a parallel to how plugins are made accessible for ChatGPT. Currently, a preliminary version of this is already accessible through intents within the AndroidManifest and App Actions on Android, as well as SiriKit intents on iOS. However, the degree of control you possess over these functionalities is restricted, and users often need to employ a somewhat robotic style of speech to reliably activate them. It’s beyond doubt that this situation will progressively enhance as language model-powered assistants take center stage.
Virtual Reality (VR) and Augmented Reality (AR), often referred to as Extended Reality (XR), present promising prospects for speech recognition due to the frequent engagement of users’ hands in various tasks.
It’s likely that in the near future, the accessibility of running one’s own high-quality Large Language Model (LLM) will become commonplace. This shift will be accompanied by a rapid reduction in costs and an increase in processing speed within the upcoming year. Before long, Low-resource LLMs (LoRA LLMs) will be accessible on smartphones, enabling inference directly on the device, leading to cost and speed efficiencies. Additionally, the competitive landscape is set to expand, with the emergence of various alternatives including open-source solutions like Llama2, as well as closed-source options like PaLM.
Ultimately, the potential of combining modalities extends beyond merely granting unrestricted entry to your entire app’s graphical user interface (GUI). The true potential lies in the ability of LLMs to amalgamate information from diverse origins, offering the prospect of enhanced assistance to surface. Several noteworthy articles delve into this topic, including discussions on multimodal dialog, a Google blog detailing the interplay between GUIs and LLMs, the interpretation of GUI interaction as language, and the emergence of LLM-powered assistants.
Conclusion
This article has equipped you with the knowledge of implementing function calling to enable speech capabilities within your application. By taking the provided Gist as a starting point, you have the opportunity to experiment using Postman or command-line interfaces to grasp the substantial potential of function calling. Should you intend to conduct a proof of concept (POC) for incorporating speech functionality into your app, my suggestion would be to directly integrate the server component from the architecture section into your application. The process essentially involves two HTTP calls, constructing prompts, and integrating microphone recording. Depending on your proficiency and the structure of your codebase, you can expect to have your POC operational within a few days.