Remove vision tools (except take_screenshot) and macax tools

Vision tools removed:
- extract_text (OCR from image files)
- extract_text_with_boxes (OCR with bounding boxes)
- vision_find_text (find text in app windows)
- vision_click_text (find and click on text)
- vision_click_near_text (click near text labels)

macax tools removed:
- macax_list_apps
- macax_get_frontmost_app
- macax_activate_app
- macax_press_key
- macax_type_text

The LLM can now read images directly via read_image tool.
take_screenshot is retained for capturing application windows.

Files deleted:
- crates/g3-core/src/tools/vision.rs
- crates/g3-core/src/tools/macax.rs
- docs/macax-tools.md

Updated tool counts: 12 core + 15 webdriver = 27 total
This commit is contained in:
Dhanji R. Prasanna
2026-01-03 17:38:25 +11:00
parent 29e263ac49
commit 386176899e
19 changed files with 15 additions and 1408 deletions

View File

@@ -191,7 +191,6 @@ Key modules:
- `platform/` - Platform-specific implementations (macOS, Linux, Windows)
- `webdriver/` - Safari and Chrome WebDriver integration
- `ocr/` - Text extraction (Tesseract, Apple Vision)
- `macax/` - macOS Accessibility API controller
**Platform support**:
- **macOS**: Core Graphics, Cocoa, screencapture, Vision framework

View File

@@ -27,7 +27,6 @@ G3 uses TOML format. The configuration is organized into sections:
[agent] # Agent behavior settings
[computer_control] # Mouse/keyboard automation
[webdriver] # Browser automation
[macax] # macOS Accessibility API
```
## Provider Configuration
@@ -236,13 +235,11 @@ apt install chromium-chromedriver
## macOS Accessibility API Configuration
```toml
[macax]
enabled = false # Set to true to enable
```
**Required permissions**: System Preferences → Security & Privacy → Privacy → Accessibility → Add your terminal app
See [macOS Accessibility Tools Guide](macax-tools.md) for detailed usage.
## Multi-Role Configuration
@@ -295,7 +292,6 @@ g3 --model claude-opus-4-5
# Enable features
g3 --webdriver # Enable WebDriver (Safari)
g3 --chrome-headless # Enable WebDriver (Chrome headless)
g3 --macax # Enable macOS Accessibility API
# Specify config file
g3 --config /path/to/config.toml
@@ -340,7 +336,6 @@ enabled = true
browser = "safari"
safari_port = 4444
[macax]
enabled = false
```

View File

@@ -1,472 +0,0 @@
# macOS Accessibility Tools Guide
**Last updated**: January 2025
**Source of truth**: `crates/g3-computer-control/src/macax/`
## Purpose
G3 includes tools for controlling macOS applications via the Accessibility API. This enables automation of native macOS apps, including those you're building with G3.
## Overview
The macOS Accessibility API provides programmatic access to UI elements in any application. G3 exposes this through the `macax_*` tools, allowing you to:
- List and activate applications
- Inspect UI element hierarchies
- Find elements by role, title, or identifier
- Click buttons and interact with controls
- Read and set values in text fields
- Simulate keyboard input
## Setup
### 1. Enable in Configuration
```toml
# ~/.config/g3/config.toml
[macax]
enabled = true
```
Or use the CLI flag:
```bash
g3 --macax
```
### 2. Grant Accessibility Permissions
1. Open **System Preferences****Security & Privacy****Privacy**
2. Select **Accessibility** in the left sidebar
3. Click the lock icon and authenticate
4. Add your terminal application (Terminal, iTerm2, etc.)
5. Restart your terminal
**Note**: If using VS Code's integrated terminal, add VS Code to the list.
### 3. Verify Setup
```json
{"tool": "macax_list_apps", "args": {}}
```
This should return a list of running applications.
## Available Tools
### macax_list_apps
List all running applications.
**Parameters**: None
**Example**:
```json
{"tool": "macax_list_apps", "args": {}}
```
**Returns**:
```
Running Applications:
- Safari (com.apple.Safari)
- Finder (com.apple.finder)
- Terminal (com.apple.Terminal)
- MyApp (com.example.myapp)
```
---
### macax_get_frontmost_app
Get the currently active (frontmost) application.
**Parameters**: None
**Example**:
```json
{"tool": "macax_get_frontmost_app", "args": {}}
```
**Returns**:
```
Frontmost Application: Safari (com.apple.Safari)
```
---
### macax_activate_app
Bring an application to the front.
**Parameters**:
- `app_name` (string, required): Application name
**Example**:
```json
{"tool": "macax_activate_app", "args": {"app_name": "Safari"}}
```
---
### macax_get_ui_tree
Get the UI element hierarchy of an application.
**Parameters**:
- `app_name` (string, required): Application name
- `max_depth` (integer, optional): Maximum tree depth (default: 5)
**Example**:
```json
{"tool": "macax_get_ui_tree", "args": {"app_name": "Calculator", "max_depth": 3}}
```
**Returns**:
```
UI Tree for Calculator:
└── AXApplication "Calculator"
└── AXWindow "Calculator"
├── AXGroup
│ ├── AXButton "1" [id: digit_1]
│ ├── AXButton "2" [id: digit_2]
│ ├── AXButton "+" [id: add]
│ └── AXButton "=" [id: equals]
└── AXStaticText "0" [id: display]
```
**Notes**:
- Use lower `max_depth` for complex apps to avoid overwhelming output
- Elements show role, title, and accessibility identifier (if set)
---
### macax_find_elements
Find UI elements matching criteria.
**Parameters**:
- `app_name` (string, required): Application name
- `role` (string, optional): Element role (e.g., "button", "textField")
- `title` (string, optional): Element title/label
- `identifier` (string, optional): Accessibility identifier
**Example**:
```json
{"tool": "macax_find_elements", "args": {
"app_name": "Safari",
"role": "button"
}}
```
**Returns**:
```
Found 5 elements:
1. AXButton "Back" [id: BackButton]
2. AXButton "Forward" [id: ForwardButton]
3. AXButton "Reload" [id: ReloadButton]
4. AXButton "Share" [id: ShareButton]
5. AXButton "New Tab" [id: NewTabButton]
```
---
### macax_click
Click a UI element.
**Parameters**:
- `app_name` (string, required): Application name
- `identifier` (string, optional): Accessibility identifier
- `title` (string, optional): Element title
- `role` (string, optional): Element role
At least one of `identifier`, `title`, or `role` must be provided.
**Examples**:
```json
// Click by identifier (most reliable)
{"tool": "macax_click", "args": {
"app_name": "Calculator",
"identifier": "digit_5"
}}
// Click by title
{"tool": "macax_click", "args": {
"app_name": "Calculator",
"title": "5"
}}
// Click by role and title
{"tool": "macax_click", "args": {
"app_name": "Safari",
"role": "button",
"title": "Reload"
}}
```
---
### macax_set_value
Set the value of a UI element (text fields, sliders, etc.).
**Parameters**:
- `app_name` (string, required): Application name
- `identifier` (string, optional): Accessibility identifier
- `title` (string, optional): Element title
- `value` (string, required): Value to set
**Example**:
```json
{"tool": "macax_set_value", "args": {
"app_name": "TextEdit",
"role": "textArea",
"value": "Hello, World!"
}}
```
---
### macax_get_value
Get the current value of a UI element.
**Parameters**:
- `app_name` (string, required): Application name
- `identifier` (string, optional): Accessibility identifier
- `title` (string, optional): Element title
**Example**:
```json
{"tool": "macax_get_value", "args": {
"app_name": "Calculator",
"identifier": "display"
}}
```
**Returns**:
```
Value: 42
```
---
### macax_press_key
Simulate a key press.
**Parameters**:
- `key` (string, required): Key to press
- `modifiers` (array, optional): Modifier keys
**Supported modifiers**: `command`, `shift`, `option`, `control`
**Examples**:
```json
// Simple key press
{"tool": "macax_press_key", "args": {"key": "a"}}
// With modifiers (Cmd+S)
{"tool": "macax_press_key", "args": {
"key": "s",
"modifiers": ["command"]
}}
// Multiple modifiers (Cmd+Shift+N)
{"tool": "macax_press_key", "args": {
"key": "n",
"modifiers": ["command", "shift"]
}}
// Special keys
{"tool": "macax_press_key", "args": {"key": "return"}}
{"tool": "macax_press_key", "args": {"key": "escape"}}
{"tool": "macax_press_key", "args": {"key": "tab"}}
{"tool": "macax_press_key", "args": {"key": "delete"}}
```
**Special key names**:
- `return`, `enter`
- `escape`, `esc`
- `tab`
- `delete`, `backspace`
- `space`
- `up`, `down`, `left`, `right`
- `home`, `end`, `pageup`, `pagedown`
- `f1` through `f12`
## Common Roles
| Role | Description |
|------|-------------|
| `button` | Clickable button |
| `textField` | Single-line text input |
| `textArea` | Multi-line text input |
| `checkbox` | Checkbox control |
| `radioButton` | Radio button |
| `popUpButton` | Dropdown/popup menu |
| `slider` | Slider control |
| `table` | Table view |
| `list` | List view |
| `outline` | Outline/tree view |
| `group` | Container group |
| `window` | Application window |
| `sheet` | Modal sheet |
| `dialog` | Dialog window |
| `staticText` | Non-editable text |
| `image` | Image element |
| `scrollArea` | Scrollable container |
| `toolbar` | Toolbar |
| `menuBar` | Menu bar |
| `menu` | Menu |
| `menuItem` | Menu item |
## Best Practices
### 1. Use Accessibility Identifiers
When building apps you'll automate with G3, add accessibility identifiers:
**SwiftUI**:
```swift
Button("Submit") { ... }
.accessibilityIdentifier("submit_button")
```
**UIKit**:
```swift
button.accessibilityIdentifier = "submit_button"
```
**AppKit**:
```swift
button.setAccessibilityIdentifier("submit_button")
```
Identifiers are more reliable than titles (which may be localized).
### 2. Inspect Before Automating
Always inspect the UI tree first:
```json
{"tool": "macax_get_ui_tree", "args": {"app_name": "MyApp", "max_depth": 4}}
```
This helps you understand:
- Element hierarchy
- Available identifiers
- Correct role names
### 3. Activate App First
Some actions require the app to be frontmost:
```json
{"tool": "macax_activate_app", "args": {"app_name": "MyApp"}}
{"tool": "macax_click", "args": {"app_name": "MyApp", "identifier": "button1"}}
```
### 4. Handle Timing
UI updates may take time. If an element isn't found:
1. Wait briefly
2. Retry the operation
3. Check if the app state changed
### 5. Prefer Identifiers Over Titles
```json
// Good: Uses identifier
{"tool": "macax_click", "args": {"app_name": "MyApp", "identifier": "save_btn"}}
// Less reliable: Uses title (may be localized)
{"tool": "macax_click", "args": {"app_name": "MyApp", "title": "Save"}}
```
## Example: Automating Calculator
```json
// 1. Activate Calculator
{"tool": "macax_activate_app", "args": {"app_name": "Calculator"}}
// 2. Inspect UI
{"tool": "macax_get_ui_tree", "args": {"app_name": "Calculator", "max_depth": 3}}
// 3. Click "5"
{"tool": "macax_click", "args": {"app_name": "Calculator", "title": "5"}}
// 4. Click "+"
{"tool": "macax_click", "args": {"app_name": "Calculator", "title": "+"}}
// 5. Click "3"
{"tool": "macax_click", "args": {"app_name": "Calculator", "title": "3"}}
// 6. Click "="
{"tool": "macax_click", "args": {"app_name": "Calculator", "title": "="}}
// 7. Read result
{"tool": "macax_get_value", "args": {"app_name": "Calculator", "role": "staticText"}}
```
## Troubleshooting
### "Accessibility permission denied"
1. Check System Preferences → Security & Privacy → Accessibility
2. Ensure your terminal app is listed and checked
3. Restart the terminal after granting permission
### "Application not found"
1. Use exact app name (case-sensitive)
2. Run `macax_list_apps` to see available apps
3. App must be running
### "Element not found"
1. Inspect UI tree to verify element exists
2. Check identifier/title spelling
3. Element may be in a different window or sheet
4. App state may have changed
### "Cannot perform action"
1. Element may be disabled
2. App may need to be frontmost
3. Element may not support the action
4. Check element role supports the operation
### Slow Performance
1. Reduce `max_depth` in `macax_get_ui_tree`
2. Use specific identifiers instead of searching
3. Complex apps have large UI trees
## Comparison with Other Tools
| Feature | macax | Vision Tools | WebDriver |
|---------|-------|--------------|----------|
| Native apps | ✅ | ✅ (via OCR) | ❌ |
| Web browsers | ✅ | ✅ | ✅ |
| Electron apps | ✅ | ✅ | Partial |
| Reliability | High | Medium | High |
| Setup | Permissions | None | Driver |
| Speed | Fast | Slower | Medium |
**Use macax when**:
- Automating native macOS apps
- You control the app and can add identifiers
- Need reliable, fast automation
**Use Vision tools when**:
- App doesn't expose accessibility
- Need to find text visually
- Cross-platform approach needed
**Use WebDriver when**:
- Automating web content
- Need JavaScript execution
- Testing web applications

View File

@@ -12,12 +12,10 @@ This document describes all tools available to the G3 agent. Tools are the prima
| Category | Tools | Enabled By |
|----------|-------|------------|
| **Core** | shell, read_file, write_file, str_replace, final_output, background_process | Always |
| **Images** | read_image, take_screenshot, extract_text | Always |
| **Images** | read_image, take_screenshot | Always |
| **Task Management** | todo_read, todo_write | Always |
| **Code Intelligence** | code_search, code_coverage | Always |
| **WebDriver** | webdriver_* (12 tools) | `--webdriver` or `--chrome-headless` |
| **Vision** | vision_find_text, vision_click_text, vision_click_near_text | Always (macOS) |
| **macOS Accessibility** | macax_* (9 tools) | `--macax` |
| **Computer Control** | mouse_click, type_text, find_element, list_windows | `computer_control.enabled = true` |
---
@@ -82,7 +80,6 @@ Read file contents with optional character range.
```
**Notes**:
- For image files (png, jpg, gif, etc.), automatically extracts text using OCR
- Supports tilde expansion (`~`)
- Reports file size and line count
@@ -105,7 +102,6 @@ Read image files for visual analysis by the LLM.
**Notes**:
- Images are sent to the LLM for visual analysis
- Use for inspecting sprites, UI screenshots, diagrams, etc.
- Different from `extract_text` which only does OCR
---
@@ -197,23 +193,6 @@ Capture a screenshot of an application window.
---
### extract_text
Extract text from an image using OCR.
**Parameters**:
- `path` (string, optional): Path to image file
**Example**:
```json
{"tool": "extract_text", "args": {"path": "screenshot.png"}}
```
**Notes**:
- Uses Tesseract OCR or Apple Vision framework
- For window-based OCR, use `vision_find_text` instead
---
## Task Management Tools
@@ -386,98 +365,7 @@ Close browser and end session.
---
## Vision Tools (macOS)
Use Apple Vision framework for text recognition.
### vision_find_text
Find text in an application window.
**Parameters**:
- `app_name` (string, required): Application name
- `text` (string, required): Text to search for
**Returns**: Bounding box coordinates and confidence score
### vision_click_text
Find and click on text.
**Parameters**:
- `app_name` (string, required): Application name
- `text` (string, required): Text to click
### vision_click_near_text
Click near a text label (useful for form fields).
**Parameters**:
- `app_name` (string, required): Application name
- `text` (string, required): Label text to find
- `direction` (string, optional): "right", "below", "left", "above" (default: "right")
- `distance` (integer, optional): Pixels from text (default: 50)
---
## macOS Accessibility Tools
Enabled with `--macax`. See [macOS Accessibility Tools Guide](macax-tools.md).
### macax_list_apps
List running applications.
### macax_get_frontmost_app
Get the frontmost application.
### macax_activate_app
Bring an application to front.
**Parameters**:
- `app_name` (string, required): Application name
### macax_get_ui_tree
Get UI element hierarchy.
**Parameters**:
- `app_name` (string, required): Application name
- `max_depth` (integer, optional): Tree depth limit
### macax_find_elements
Find UI elements by criteria.
**Parameters**:
- `app_name` (string, required): Application name
- `role` (string, optional): Element role (button, textField, etc.)
- `title` (string, optional): Element title
- `identifier` (string, optional): Accessibility identifier
### macax_click
Click a UI element.
**Parameters**:
- `app_name` (string, required): Application name
- `identifier` or `title` or `role`: Element selector
### macax_set_value / macax_get_value
Set or get element value.
### macax_press_key
Simulate key press.
**Parameters**:
- `key` (string, required): Key to press
- `modifiers` (array, optional): ["command", "shift", "option", "control"]
---
## Computer Control Tools