Introduction
The launch of the GPT-4V(ision) model from OpenAI in November marked a significant transformation in the development of OS Agents. By OS Agents (or also UI-focused Agents) we refer to Generative AI agents capable of navigating and utilizing the array of applications and functions available either on a host device, or any other devices they may interact with and control. Interaction can take the form of pure textual input, such as an end goal, or voice (e.g., through the Whisper APIs) from a user.
Prior to November, efforts to create UI automations were mainly dependent on pure textual agents grounding the LLM with the Document Object Models (DOM) of an app and page under testing, such as its XML or HTML representation. However, this approach often proved ineffective due to the excessive noise and voluminous information present in such representations, which often obscured the true semantic meaning of the page from the perspective of the user. In contrast, GPT-4V and various other vision models offer a more effective approach by leveraging the rich semantic information embedded in the visual aspects of interactable objects. This includes understanding an object’s appearance and its spatial relationship on the page to better discern its role and predict the type of actions applicable to it (e.g., whether it is a search box, a button, or an element that can be dragged).
UI-focused agents typically perform an action over a target app in two steps per individual turn:
First, predicting the next probable action given the current page state. The primary methods of control involve UI automation, including actions like tapping, typing, and scrolling within the user interface of the target system.
Then, generating automation code by relying on accessibility IDs and XPath selectors of the underlying page DOM.
Many research papers have emerged since the release of GPT-4V, proving the effectiveness of leveraging a purely visual approach for the first prediction step and only relying on textual grounding for the latter, if applicable. Some useful research papers that extensively discuss this approach include:
Below are some results on Windows, in terms of the completion rate for tasks across common apps using such a vision + textual agents. Credits to UFO for the benchmark.
Regardless of the specific focus of these studies, the methodology described can be universally applied across various application domains and operating systems. This is where NavAIGuide comes into play.
The project’s objective is to offer a TypeScript-based, extensible, multi-modal, and UI-focused framework that is crafted to execute plans and address user queries effectively. Designed to be cross-OS, it supports mobile platforms (iOS, Android), web, and desktop environments seamlessly. Additionally, it supports a range of vision and textual models, such as GPT-4V, Claude, LLaVA, and Open-Interpreter, to enhance its predictive capabilities and code generation processes.
As of the time of writing this article, only an iOS implementation of the framework is available. To my knowledge, this is the first UI-focused agent on iOS 🥳. Contributions to backend implementations for Android and macOS are more than welcome!
The framework is broken down into different packages:
navaiguide/core: Exposes an unenforced set of agents for planning, predicting, and generating code selectors for each action.
navaiguide/ios: Exposes the iOS implementation of the NavAIGuideBaseAgent, along with the glue with Xcode, WDA, and Appium to make UI Automation possible on a real iOS device.
navaiguide/core
Let’s start by delving deeper into navaiguide/core. The NavAIGuideclass exposes three agents worth mentioning:
startTaskPlanner_Agent: A simple planner that accesses the ecosystem of apps available on your device, formulating a cross-app plan to fulfill the request.
predictNextNLAction_Visual_Agent: This visual agent is responsible for analyzing the current page screenshot to predict the next probable action in terms of the type of action (tap, type, scroll), visual description, and position context (e.g., bounding box or coordinates, if supported by the model) of the target element. Additionally, the agent acts as a feedback loop by comparing the current state of the page with the previously held action, analyzing both the previous and current page screenshots. Since GPT-4V currently lacks the capability to distinguish between images, this is achieved by drawing watermarks on the taken screenshots, as shown in the image below.
An example of the type of output that can be expected from this agent is as follows:
{
"previousActionSuccess": false,
"previousActionSuccessExplanation": "The previous action was not successful as the keyboard is not visible in the second screenshot.",
"endGoalMet": false,
"endGoalMetExplanation": "The goal of finding a coffee shop nearby hasn't been met yet.",
"actionType": "tap",
"actionTarget": "Search bar",
"actionDescription": "Retry tapping the search bar with corrected coordinates.",
"actionExpectedOutcome": "The keyboard becomes visible.",
"actionTargetVisualDescription": "A white search bar at the top of the page with a magnifying glass icon and a placeholder text 'Search for a place or address'",
"actionTargetPositionContext": "The search bar is located at the top of the page, just below the app's title and the search bar's placeholder text."
}
- generateCodeSelectorsWithRetry_Agent: This textual agent processes the natural language (NL) action provided by the previous agent, along with the Document Object Model (DOM) representation of the page (for iOS, this is an XCUITest XML DOM), to generate code selectors. Multiple selectors can be generated in certain scenarios. When this happens, each selector is returned with an accompanying confidence score and is subject to a retry mechanism for enhanced accuracy. Additionally, if the Document Object Model (DOM) is too large to fit into a single request to the textual model, it is automatically divided into smaller chunks.
An example of the type of output that can be expected by the generateCodeSelectorsWithRetry_Agent:
{
"selectorsByRelevance": [
{
"selector": '//XCUIElementTypeButton[@name="Home"]',
"relevanceScore": 10
},
{
"selector": '//XCUIElementTypeButton[@name="Hom"]',
"relevanceScore": 8
},
{
"selector": '//XCUIElementTypeButton[@name="Search"]',
"relevanceScore": 1
}
]
}
navaiguide/ios
navaiguide/ios builds upon the core component to enable building AI agents that can command real iOS devices. Here’s how:
Some pre-requisites first:
Some pre-requisites before running navaiguide/ios:
Follow the core pre-requisites.
macOS with Xcode 15.
iOS device (simulators not supported).
Apple Developer Free Account.
Go Build Tools (currently required as a dependency for go-ios).
Appium Server with XCUITest Driver.
Steps:
1. ⚡Install NavAIGuide-iOS
You can choose to either clone the repository or use npm, yarn, or pnpm to install NavAIGuide.
git clone https://github.com/francedot/NavAIGuide-TS
npm:
npm install @navaiguide/ios
Yarn:
yarn add @navaiguide/ios
2. Go-iOS Setup
Go-iOS is required for NavAIGuide to list apps and start a pre-installed WDA Runner on the target device. If your device is running iOS 17, support for waking up the WDA Runner is experimental, and npm packages for go-ios are not available. Therefore, you need to install the latest version from the ios-17 branch and manually build an executable, which requires installing Go build tools.
# Install Go build tools on macOS
brew install go
Once installed, you can run this utility script to build go-ios. This will copy the go-ios executable to the ./packages/ios/bin directory, which is necessary for the next steps.
# If you cloned the repository:
cd packages/ios
npx run build-go-ios
# If installed through the npm package:
npm explore @navaiguide/ios -- npm run build-go-ios
3. Appium
Install the Appium server globally:
npm install -g appium
Launching Appium from the terminal should result in a similar output:
Install and run appium-doctor to diagnose and fix any iOS configuration issues:
npm install -g appium-doctor
appium-doctor --ios
Install the Appium XCUITest Driver:
appium driver install xcuitest
This step will also clone the Appium WebDriverAgent (WDA) Xcode project, required in the next step. Check that the Xcode project exists at ~/.appium/node_modules/appium-xcuitest-driver/node_modules/appium-webdriveragent.
4. Enable Developer Settings & UI Automation
If you haven’t already, enable Developer Mode on your target device. If required, reboot your phone.
Next, enable UI Automation from Settings/Developer. This will allow the WDA Runner to control your device and execute XCUITests.
5. WDA Building and Signing
The next step is to build and sign the Appium WDA project through Xcode.
cd '~/.appium/node_modules/appium-xcuitest-driver/node_modules/appium-webdriveragent'
open 'WebDriverAgent.xcodeproj'
Select WebDriverAgentRunner from the target section.
Click on Signing & Capabilities.
Check the Automatically manage signing checkbox.
Choose your Team from the Team dropdown.
In Bundle Identifier, replace the value with a bundle identifier of your choice, for example: com..wda.runner.
Building the WDA project from Xcode (and macOS) is optional if you already have a pre-built IPA, but it must be re-signed with your Apple Developer account’s certificate. For instructions on how to do this, see Daniel Paulus’ wda-signer.
Next, to deploy and run the WDA Runner on the target real device:
xcodebuild build-for-testing test-without-building -project WebDriverAgent.xcodeproj -scheme WebDriverAgentRunner -destination 'id=<YOUR_DEVICE_UDID>'
You can find your connected device UDID with go-ios.
./go-ios list
The xcodebuild step is only required once, as we will later use go-ios to wake up a previously installed WDA Runner on your device.
If the xcodebuild is successful, you should see a WDA Runner app installed, and your device will enter the UI Automation mode (indicated by a watermark on your screen).
6. Wake Up WDA Runner & Run Appium Server
Next, let’s exit UI Automation mode by holding the Volume Up and Down buttons simultaneously and ensure we can use go-ios to wake up the installation.
# If you cloned the repository:
npm run run-wda -- --WDA_BUNDLE_ID=com.example.wdabundleid --WDA_TEST_RUNNER_BUNDLE_ID=com.example.wdabundleid --DEVICE_UDID=12345
# If installed through the npm package:
npm explore @navaiguide/ios -- npm run run-wda -- --WDA_BUNDLE_ID=com.example.wdabundleid --WDA_TEST_RUNNER_BUNDLE_ID=com.example.wdabundleid --DEVICE_UDID=12345
If successful, you should see the device entering UI Automation mode again. What’s changed? Using go-ios, it will technically make it possible to control the device from Linux and, soon, Windows (see the latest go-ios release).
As we are not running XCUITests directly but through Appium, we will need to run the Appium Server next, which will listen for any WebDriverIO commands and translate them into XCUITests for the WDA Runner to execute.
# Provided that you've installed Appium as a global npm package
appium
7. Run an iOS AI Agent
With the Appium Server and WDA running, we can finally run our first AI-powered iOS agent. Let’s see how to:
import { iOSAgent } from "@navaiguide/ios";
const iosAgent = new iOSAgent({
// openAIApiKey: "YOUR_OPEN_AI_API_KEY", // Optional if set through process.env.OPEN_AI_API_KEY
appiumBaseUrl: 'http://127.0.0.1',
appiumPort: 4723,
iOSVersion: "17.3.0",
deviceUdid: "<DEVICE_UDID>"
});
const fitnessPlannerQuery = "Help me run a 30-day fitness challenge.";
await iosAgent.runAsync({
query: fitnessPlannerQuery
});
This is all that’s needed to build UI-focused agents for iOS.
I encourage you to experiment with building your own and share your experiences and insights. Your feedback and contributions are invaluable as we strive to push the boundaries of what’s possible with vision models.
Stay tuned for more updates, and in the meantime, happy hacking! 🤖
Curious about a cool OSS application of this tech? Check out OwlAIProject/Owl: A personal wearable AI that runs locally.
Appendix — Pills of iOS UI Automation
Following are some technical definitions and the glued components that make running UI-focused agents possible on iOS:
Go-iOS: A set of tools written in Go that allows you to control iOS devices on Linux and Windows. It notably includes the capability to start and kill apps and run UI tests on iOS devices. It uses a reverse-engineered version of the DTX Message Framework. Kudos to Daniel Paulus and the whole OSS community at go-ios on recently adding support for iOS 17.
WebDriverAgent (WDA): To automate tasks on an iOS device, installing WebDriverAgent (WDA) is required. WDA, initially a Facebook project and now maintained by Appium, acts as the core for all iOS automation tools and services. Due to iOS’s strict security, direct input simulations or screenshot captures via public APIs or shell commands are prevented. WebDriverAgent circumvents these limitations by launching an HTTP server on the device, turning XCUITest framework functions into accessible REST calls.
UI Automation Mode: This is the state triggered on the target device when running UI Automation XCUITests. It must be enabled through the Developer Settings before installing the WDA Runner App.
WebdriverIO: An open-source testing utility for Node.js that enables developers to automate testing for web applications. In the context of UI automation for iOS applications, WebdriverIO can be used alongside Appium, a mobile application automation framework. Appium acts as a bridge between WebdriverIO tests and the iOS platform, allowing tests written in WebdriverIO to interact with iOS applications as if a real user were using them. This integration supports the automation of both native and hybrid iOS apps.
Appium: An open-source, cross-platform test automation tool used for automating native, mobile web, and hybrid applications on iOS and Android platforms. Appium uses the WebDriver protocol to interact with iOS and Android applications. For iOS, it primarily relies on Apple’s XCUITest framework (for iOS 9.3 and above), and for older versions, it used the UIAutomation framework. XCUITest, part of XCTest, is Apple’s official UI testing framework, which Appium leverages to perform actions on the iOS UI.
Top comments (0)