Give AI
hands.

One protocol to see, decide, and act on any interface. macOS apps, web pages, iOS simulators β€” all through snapshot β†’ think β†’ act.

macOS
Web
iOS
Android
agent-control β€” observe β†’ act β†’ observe
# Install
$ npm install -g agent-control
$ agent-control doctor
βœ… Node.js >= 18   βœ… Playwright   βœ… Chromium   All checks passed.

# See the screen
$ agent-control -p web snapshot
12 interactive elements
@e8 text "Name"   @e10 email "Email"   @e18 submit "Create Account"

# Act
$ agent-control -p web fill @e8 "Alice"
βœ“ { ok: true }

The Loop

No pre-scripted steps. The AI observes the current state, decides what to do, acts, then observes again. Like a human would.

πŸ‘
Observe
Screenshot + element tree with @ref identifiers
β†’
🧠
Decide
LLM sees the UI, picks the next action
β†’
🀚
Act
Click, type, scroll β€” through the unified protocol
β†’
πŸ”„
Repeat
Until the goal is reached

Two Ways to Use

Let the AI figure it out, or define every step. Pick the right tool for the job.

Auto Mode

Give a goal, AI decides how

Natural language objective → LLM loops snapshot→think→act until done. Best for exploratory tasks, testing new apps, one-off automation.

$ agent-control auto -p web \
  --goal "Sign up with name Alice and email alice@test.com" \
  --url https://example.com/signup

# AI observes the form, fills fields, clicks submit
# No scripting needed
Flow DSL

Define steps, run deterministically

JSON-declared action sequences with verify/retry. Best for regression tests, CI pipelines, repeatable workflows.

// signup-flow.json
{ "platform": "web",
  "steps": [
    { "action": "fill", "find": ["Name"], "value": "Alice" },
    { "action": "click", "find": ["Create Account"] },
    { "action": "verify", "contains": "Welcome" }
  ] }

Four Drivers, One Protocol

macOS

Accessibility API

Native Swift CLI that reads the AX tree and acts via AXPress, CGEvent, or coordinate fallback. Operates any macOS app, Electron apps, and menubar apps. Use --app to target by name.

Swift Β· ApplicationServices Β· CGEvent
Web

Playwright

Headless Chromium with chain commands. Open a URL, snapshot the DOM, fill forms, click buttons β€” all in one pipeline.

Node.js Β· Playwright Β· Chromium
iOS

idb

Uses Facebook's idb to describe UI elements and tap by coordinates. Auto-detects booted Simulator. Future: real device support via USB.

idb Β· xcrun simctl Β· Simulator
Android Experimental

adb + uiautomator

Dumps the UI hierarchy via uiautomator, taps by coordinates, swipes, launches apps. Works with emulators and physical devices over adb.

adb Β· uiautomator Β· Android SDK

Unified Actions

Action Syntax macOS Web iOS
observeobserve [--ss] [--tree]βœ“βœ“βœ“
clickclick @refβœ“βœ“βœ“
longpresslongpress @ref [--duration=ms]βœ“βœ“βœ“
dblclickdblclick @refβœ“βœ“β€”
rightclickrightclick @refβœ“βœ“β€”
fillfill @ref "text"βœ“βœ“βœ“
presspress <key>βœ“βœ“β€”
scrollscroll <up|down>βœ“βœ“βœ“
dragdrag @ref1 @ref2βœ“βœ“β€”
screenshotscreenshot [path]βœ“βœ“βœ“
openopen <url>β€”βœ“β€”
waitwait --idle | @ref | --urlβ€”βœ“β€”
findfind <text>β€”βœ“β€”
evaleval <js>β€”βœ“β€”
back/forwardback | forwardβ€”βœ“β€”

For AI Agents

If you're an AI agent reading this page, here's what you need to know:

Installation:

git clone https://github.com/momomo-agent/agent-control
cd agent-control && npm install && npm link

Basic workflow:

  1. agent-control -p <platform> snapshot β€” See what's on screen
  2. Analyze the output, find target element (e.g. @e3)
  3. agent-control -p <platform> click @e3 β€” Interact with it
  4. agent-control -p <platform> screenshot /tmp/verify.png β€” Verify result

Platforms:

  • -p web β€” Control web pages (Playwright)
  • -p macos --app <name> β€” Control macOS apps (Accessibility API)
  • -p ios β€” Control iOS Simulator (idb)
  • -p android β€” Control Android emulator/device (adb)

Full documentation: GitHub README