One protocol to see, decide, and act on any interface. macOS apps, web pages, iOS simulators β all through snapshot β think β act.
No pre-scripted steps. The AI observes the current state, decides what to do, acts, then observes again. Like a human would.
Let the AI figure it out, or define every step. Pick the right tool for the job.
Natural language objective β LLM loops snapshotβthinkβact until done. Best for exploratory tasks, testing new apps, one-off automation.
$ agent-control auto -p web \ --goal "Sign up with name Alice and email alice@test.com" \ --url https://example.com/signup # AI observes the form, fills fields, clicks submit # No scripting needed
JSON-declared action sequences with verify/retry. Best for regression tests, CI pipelines, repeatable workflows.
// signup-flow.json { "platform": "web", "steps": [ { "action": "fill", "find": ["Name"], "value": "Alice" }, { "action": "click", "find": ["Create Account"] }, { "action": "verify", "contains": "Welcome" } ] }
Native Swift CLI that reads the AX tree and acts via AXPress, CGEvent, or coordinate fallback. Operates any macOS app, Electron apps, and menubar apps. Use --app to target by name.
Headless Chromium with chain commands. Open a URL, snapshot the DOM, fill forms, click buttons β all in one pipeline.
Uses Facebook's idb to describe UI elements and tap by coordinates. Auto-detects booted Simulator. Future: real device support via USB.
Dumps the UI hierarchy via uiautomator, taps by coordinates, swipes, launches apps. Works with emulators and physical devices over adb.
| Action | Syntax | macOS | Web | iOS |
|---|---|---|---|---|
| observe | observe [--ss] [--tree] | β | β | β |
| click | click @ref | β | β | β |
| longpress | longpress @ref [--duration=ms] | β | β | β |
| dblclick | dblclick @ref | β | β | β |
| rightclick | rightclick @ref | β | β | β |
| fill | fill @ref "text" | β | β | β |
| press | press <key> | β | β | β |
| scroll | scroll <up|down> | β | β | β |
| drag | drag @ref1 @ref2 | β | β | β |
| screenshot | screenshot [path] | β | β | β |
| open | open <url> | β | β | β |
| wait | wait --idle | @ref | --url | β | β | β |
| find | find <text> | β | β | β |
| eval | eval <js> | β | β | β |
| back/forward | back | forward | β | β | β |
If you're an AI agent reading this page, here's what you need to know:
Installation:
git clone https://github.com/momomo-agent/agent-control
cd agent-control && npm install && npm link
Basic workflow:
agent-control -p <platform> snapshot β See what's on screen@e3)agent-control -p <platform> click @e3 β Interact with itagent-control -p <platform> screenshot /tmp/verify.png β Verify resultPlatforms:
-p web β Control web pages (Playwright)-p macos --app <name> β Control macOS apps (Accessibility API)-p ios β Control iOS Simulator (idb)-p android β Control Android emulator/device (adb)Full documentation: GitHub README