Remove vision tools (except take_screenshot) and macax tools

Vision tools removed:
- extract_text (OCR from image files)
- extract_text_with_boxes (OCR with bounding boxes)
- vision_find_text (find text in app windows)
- vision_click_text (find and click on text)
- vision_click_near_text (click near text labels)

macax tools removed:
- macax_list_apps
- macax_get_frontmost_app
- macax_activate_app
- macax_press_key
- macax_type_text

The LLM can now read images directly via read_image tool.
take_screenshot is retained for capturing application windows.

Files deleted:
- crates/g3-core/src/tools/vision.rs
- crates/g3-core/src/tools/macax.rs
- docs/macax-tools.md

Updated tool counts: 12 core + 15 webdriver = 27 total
This commit is contained in:
Dhanji R. Prasanna
2026-01-03 17:38:25 +11:00
parent 29e263ac49
commit 386176899e
19 changed files with 15 additions and 1408 deletions

View File

@@ -202,4 +202,3 @@ g3/
- [Control Commands](docs/CONTROL_COMMANDS.md) - Interactive commands - [Control Commands](docs/CONTROL_COMMANDS.md) - Interactive commands
- [Code Search](docs/CODE_SEARCH.md) - Tree-sitter search guide - [Code Search](docs/CODE_SEARCH.md) - Tree-sitter search guide
- [Flock Mode](docs/FLOCK_MODE.md) - Multi-agent development - [Flock Mode](docs/FLOCK_MODE.md) - Multi-agent development
- [macOS Accessibility](docs/macax-tools.md) - macOS automation

View File

@@ -106,7 +106,6 @@ g3/
- `type_text`: Type text at the current cursor position - `type_text`: Type text at the current cursor position
- `find_element`: Find UI elements by text, role, or attributes - `find_element`: Find UI elements by text, role, or attributes
- `take_screenshot`: Capture screenshots of screen, region, or window - `take_screenshot`: Capture screenshots of screen, region, or window
- `extract_text`: Extract text from images or screen regions using OCR
- `find_text_on_screen`: Find text visually on screen and return coordinates - `find_text_on_screen`: Find text visually on screen and return coordinates
- `list_windows`: List all open windows with IDs and titles - `list_windows`: List all open windows with IDs and titles

View File

@@ -103,10 +103,8 @@ These commands give you fine-grained control over context management, allowing y
- **TODO Management**: Read and write TODO lists with markdown checkbox format - **TODO Management**: Read and write TODO lists with markdown checkbox format
- **Computer Control** (Experimental): Automate desktop applications - **Computer Control** (Experimental): Automate desktop applications
- Mouse and keyboard control - Mouse and keyboard control
- macOS Accessibility API for native app automation (via `--macax` flag)
- UI element inspection - UI element inspection
- Screenshot capture and window management - Screenshot capture and window management
- OCR text extraction from images and screen regions
- Window listing and identification - Window listing and identification
- **Code Search**: Embedded tree-sitter for syntax-aware code search (Rust, Python, JavaScript, TypeScript, Go, Java, C, C++) - see [Code Search Guide](docs/CODE_SEARCH.md) - **Code Search**: Embedded tree-sitter for syntax-aware code search (Rust, Python, JavaScript, TypeScript, Go, Java, C, C++) - see [Code Search Guide](docs/CODE_SEARCH.md)
- **Final Output**: Formatted result presentation - **Final Output**: Formatted result presentation
@@ -305,24 +303,11 @@ chrome_binary = "/Users/yourname/.chrome-for-testing/chrome-mac-arm64/Google Chr
**Note**: If you see "ChromeDriver version doesn't match Chrome version" errors, use Option 1 (Chrome for Testing) which bundles matching versions. **Note**: If you see "ChromeDriver version doesn't match Chrome version" errors, use Option 1 (Chrome for Testing) which bundles matching versions.
## macOS Accessibility API Tools
G3 includes support for controlling macOS applications via the Accessibility API, allowing you to automate native macOS apps.
**Available Tools**: `macax_list_apps`, `macax_get_frontmost_app`, `macax_activate_app`, `macax_get_ui_tree`, `macax_find_elements`, `macax_click`, `macax_set_value`, `macax_get_value`, `macax_press_key`
**Setup**: Enable with the `--macax` flag or in config with `macax.enabled = true`. Grant accessibility permissions:
- **macOS**: System Preferences → Security & Privacy → Privacy → Accessibility → Add your terminal app
**For detailed documentation**, see [macOS Accessibility Tools Guide](docs/macax-tools.md).
**Note**: This is particularly useful for testing and automating apps you're building with G3, as you can add accessibility identifiers to your UI elements.
## Computer Control (Experimental) ## Computer Control (Experimental)
G3 can interact with your computer's GUI for automation tasks: G3 can interact with your computer's GUI for automation tasks:
**Available Tools**: `mouse_click`, `type_text`, `find_element`, `take_screenshot`, `extract_text`, `find_text_on_screen`, `list_windows` **Available Tools**: `mouse_click`, `type_text`, `find_element`, `take_screenshot`, `list_windows`
**Setup**: Enable in config with `computer_control.enabled = true` and grant OS accessibility permissions: **Setup**: Enable in config with `computer_control.enabled = true` and grant OS accessibility permissions:
- **macOS**: System Preferences → Security & Privacy → Accessibility - **macOS**: System Preferences → Security & Privacy → Accessibility
@@ -351,7 +336,6 @@ Detailed documentation is available in the `docs/` directory:
| [Control Commands](docs/CONTROL_COMMANDS.md) | Interactive `/` commands for context management | | [Control Commands](docs/CONTROL_COMMANDS.md) | Interactive `/` commands for context management |
| [Code Search](docs/CODE_SEARCH.md) | Tree-sitter code search query patterns | | [Code Search](docs/CODE_SEARCH.md) | Tree-sitter code search query patterns |
| [Flock Mode](docs/FLOCK_MODE.md) | Parallel multi-agent development | | [Flock Mode](docs/FLOCK_MODE.md) | Parallel multi-agent development |
| [macOS Accessibility](docs/macax-tools.md) | macOS Accessibility API automation |
For AI agents working with this codebase, see [AGENTS.md](AGENTS.md). For AI agents working with this codebase, see [AGENTS.md](AGENTS.md).

View File

@@ -345,10 +345,6 @@ pub struct Cli {
#[arg(long)] #[arg(long)]
pub quiet: bool, pub quiet: bool,
/// Enable macOS Accessibility API tools for native app automation
#[arg(long)]
pub macax: bool,
/// Enable WebDriver browser automation tools /// Enable WebDriver browser automation tools
#[arg(long)] #[arg(long)]
pub webdriver: bool, pub webdriver: bool,
@@ -540,11 +536,6 @@ pub async fn run() -> Result<()> {
cli.model.clone(), cli.model.clone(),
)?; )?;
// Apply macax flag override
if cli.macax {
config.macax.enabled = true;
}
// Apply webdriver flag override // Apply webdriver flag override
if cli.webdriver { if cli.webdriver {
config.webdriver.enabled = true; config.webdriver.enabled = true;
@@ -992,11 +983,6 @@ async fn run_accumulative_mode(
cli.model.clone(), cli.model.clone(),
)?; )?;
// Apply macax flag override
if cli.macax {
config.macax.enabled = true;
}
// Apply webdriver flag override // Apply webdriver flag override
if cli.webdriver { if cli.webdriver {
config.webdriver.enabled = true; config.webdriver.enabled = true;
@@ -1099,11 +1085,6 @@ async fn run_accumulative_mode(
cli.model.clone(), cli.model.clone(),
)?; )?;
// Apply macax flag override
if cli.macax {
config.macax.enabled = true;
}
// Apply webdriver flag override // Apply webdriver flag override
if cli.webdriver { if cli.webdriver {
config.webdriver.enabled = true; config.webdriver.enabled = true;
@@ -2604,7 +2585,7 @@ Review the current state of the project and provide a concise critique focusing
2. Whether the project compiles successfully 2. Whether the project compiles successfully
3. What requirements are missing or incorrect 3. What requirements are missing or incorrect
4. Specific improvements needed to satisfy requirements 4. Specific improvements needed to satisfy requirements
5. Use UI tools such as webdriver or macax to test functionality thoroughly 5. Use UI tools such as webdriver to test functionality thoroughly
CRITICAL INSTRUCTIONS: CRITICAL INSTRUCTIONS:
1. You MUST use the final_output tool to provide your feedback 1. You MUST use the final_output tool to provide your feedback

View File

@@ -10,7 +10,6 @@ pub struct Config {
pub agent: AgentConfig, pub agent: AgentConfig,
pub computer_control: ComputerControlConfig, pub computer_control: ComputerControlConfig,
pub webdriver: WebDriverConfig, pub webdriver: WebDriverConfig,
pub macax: MacAxConfig,
} }
/// Provider configuration with named configs per provider type /// Provider configuration with named configs per provider type
@@ -139,17 +138,6 @@ pub struct WebDriverConfig {
pub browser: WebDriverBrowser, pub browser: WebDriverBrowser,
} }
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct MacAxConfig {
pub enabled: bool,
}
impl Default for MacAxConfig {
fn default() -> Self {
Self { enabled: false }
}
}
impl Default for WebDriverConfig { impl Default for WebDriverConfig {
fn default() -> Self { fn default() -> Self {
Self { Self {
@@ -212,7 +200,6 @@ impl Default for Config {
}, },
computer_control: ComputerControlConfig::default(), computer_control: ComputerControlConfig::default(),
webdriver: WebDriverConfig::default(), webdriver: WebDriverConfig::default(),
macax: MacAxConfig::default(),
} }
} }
} }

View File

@@ -14,9 +14,6 @@ max_actions_per_second = 10
[webdriver] [webdriver]
enabled = false enabled = false
safari_port = 4444 safari_port = 4444
[macax]
enabled = false
"# "#
} }

View File

@@ -108,8 +108,6 @@ pub struct Agent<W: UiWriter> {
>, >,
>, >,
webdriver_process: std::sync::Arc<tokio::sync::RwLock<Option<tokio::process::Child>>>, webdriver_process: std::sync::Arc<tokio::sync::RwLock<Option<tokio::process::Child>>>,
macax_controller:
std::sync::Arc<tokio::sync::RwLock<Option<g3_computer_control::MacAxController>>>,
tool_call_count: usize, tool_call_count: usize,
requirements_sha: Option<String>, requirements_sha: Option<String>,
/// Working directory for tool execution (set by --codebase-fast-start) /// Working directory for tool execution (set by --codebase-fast-start)
@@ -389,9 +387,6 @@ impl<W: UiWriter> Agent<W> {
None None
}; };
// Capture macax_enabled before moving config
let macax_enabled = config.macax.enabled;
Ok(Self { Ok(Self {
providers, providers,
context_window, context_window,
@@ -411,13 +406,6 @@ impl<W: UiWriter> Agent<W> {
computer_controller, computer_controller,
webdriver_session: std::sync::Arc::new(tokio::sync::RwLock::new(None)), webdriver_session: std::sync::Arc::new(tokio::sync::RwLock::new(None)),
webdriver_process: std::sync::Arc::new(tokio::sync::RwLock::new(None)), webdriver_process: std::sync::Arc::new(tokio::sync::RwLock::new(None)),
macax_controller: {
std::sync::Arc::new(tokio::sync::RwLock::new(if macax_enabled {
Some(g3_computer_control::MacAxController::new()?)
} else {
None
}))
},
tool_call_count: 0, tool_call_count: 0,
requirements_sha: None, requirements_sha: None,
working_dir: None, working_dir: None,
@@ -921,7 +909,6 @@ impl<W: UiWriter> Agent<W> {
Some(tool_definitions::create_tool_definitions( Some(tool_definitions::create_tool_definitions(
tool_definitions::ToolConfig::new( tool_definitions::ToolConfig::new(
self.config.webdriver.enabled, self.config.webdriver.enabled,
self.config.macax.enabled,
self.config.computer_control.enabled, self.config.computer_control.enabled,
))) )))
} else { } else {
@@ -2674,7 +2661,6 @@ impl<W: UiWriter> Agent<W> {
request.tools = Some(tool_definitions::create_tool_definitions( request.tools = Some(tool_definitions::create_tool_definitions(
tool_definitions::ToolConfig::new( tool_definitions::ToolConfig::new(
self.config.webdriver.enabled, self.config.webdriver.enabled,
self.config.macax.enabled,
self.config.computer_control.enabled, self.config.computer_control.enabled,
))); )));
} }
@@ -3289,7 +3275,6 @@ impl<W: UiWriter> Agent<W> {
computer_controller: self.computer_controller.as_ref(), computer_controller: self.computer_controller.as_ref(),
webdriver_session: &self.webdriver_session, webdriver_session: &self.webdriver_session,
webdriver_process: &self.webdriver_process, webdriver_process: &self.webdriver_process,
macax_controller: &self.macax_controller,
background_process_manager: &self.background_process_manager, background_process_manager: &self.background_process_manager,
todo_content: &self.todo_content, todo_content: &self.todo_content,
pending_images: &mut self.pending_images, pending_images: &mut self.pending_images,

View File

@@ -11,15 +11,13 @@ use serde_json::json;
#[derive(Debug, Clone, Copy, Default)] #[derive(Debug, Clone, Copy, Default)]
pub struct ToolConfig { pub struct ToolConfig {
pub webdriver: bool, pub webdriver: bool,
pub macax: bool,
pub computer_control: bool, pub computer_control: bool,
} }
impl ToolConfig { impl ToolConfig {
pub fn new(webdriver: bool, macax: bool, computer_control: bool) -> Self { pub fn new(webdriver: bool, computer_control: bool) -> Self {
Self { Self {
webdriver, webdriver,
macax,
computer_control, computer_control,
} }
} }
@@ -36,14 +34,6 @@ pub fn create_tool_definitions(config: ToolConfig) -> Vec<Tool> {
tools.extend(create_webdriver_tools()); tools.extend(create_webdriver_tools());
} }
if config.macax {
tools.extend(create_macax_tools());
}
if config.computer_control {
tools.extend(create_computer_control_tools());
}
tools tools
} }
@@ -88,7 +78,7 @@ fn create_core_tools() -> Vec<Tool> {
}, },
Tool { Tool {
name: "read_file".to_string(), name: "read_file".to_string(),
description: "Read the contents of a file. For image files (png, jpg, jpeg, gif, bmp, tiff, webp), automatically extracts text using OCR. For text files, optionally read a specific character range.".to_string(), description: "Read the contents of a file. Optionally read a specific character range.".to_string(),
input_schema: json!({ input_schema: json!({
"type": "object", "type": "object",
"properties": { "properties": {
@@ -208,19 +198,6 @@ fn create_core_tools() -> Vec<Tool> {
"required": ["path", "window_id"] "required": ["path", "window_id"]
}), }),
}, },
Tool {
name: "extract_text".to_string(),
description: "Extract text from an image file using OCR. For extracting text from a specific window, use vision_find_text instead which automatically handles window capture.".to_string(),
input_schema: json!({
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Path to image file (optional if region is provided)"
},
}
}),
},
Tool { Tool {
name: "todo_read".to_string(), name: "todo_read".to_string(),
description: "Read your current TODO list from todo.g3.md file in the session directory. Shows what tasks are planned and their status. Call this at the start of multi-step tasks to check for existing plans, and during execution to review progress before updating. TODO lists are scoped to the current session.".to_string(), description: "Read your current TODO list from todo.g3.md file in the session directory. Shows what tasks are planned and their status. Call this at the start of multi-step tasks to check for existing plans, and during execution to review progress before updating. TODO lists are scoped to the current session.".to_string(),
@@ -476,174 +453,6 @@ fn create_webdriver_tools() -> Vec<Tool> {
] ]
} }
/// Create macOS Accessibility tools
fn create_macax_tools() -> Vec<Tool> {
vec![
Tool {
name: "macax_list_apps".to_string(),
description: "List all running applications that can be controlled via macOS Accessibility API".to_string(),
input_schema: json!({
"type": "object",
"properties": {},
"required": []
}),
},
Tool {
name: "macax_get_frontmost_app".to_string(),
description: "Get the name of the currently active (frontmost) application".to_string(),
input_schema: json!({
"type": "object",
"properties": {},
"required": []
}),
},
Tool {
name: "macax_activate_app".to_string(),
description: "Bring an application to the front (activate it)".to_string(),
input_schema: json!({
"type": "object",
"properties": {
"app_name": {
"type": "string",
"description": "Name of the application to activate (e.g., 'Safari', 'TextEdit')"
}
},
"required": ["app_name"]
}),
},
Tool {
name: "macax_press_key".to_string(),
description: "Press a keyboard key or shortcut in an application (e.g., Cmd+S to save)".to_string(),
input_schema: json!({
"type": "object",
"properties": {
"app_name": {
"type": "string",
"description": "Name of the application"
},
"key": {
"type": "string",
"description": "Key to press (e.g., 's', 'return', 'tab')"
},
"modifiers": {
"type": "array",
"items": {
"type": "string"
},
"description": "Modifier keys (e.g., ['command', 'shift'])"
}
},
"required": ["app_name", "key"]
}),
},
Tool {
name: "macax_type_text".to_string(),
description: "Type arbitrary text into the currently focused element in an application (supports unicode, emojis, etc.)".to_string(),
input_schema: json!({
"type": "object",
"properties": {
"app_name": {
"type": "string",
"description": "Name of the application"
},
"text": {
"type": "string",
"description": "Text to type (can include unicode, emojis, special characters)"
}
},
"required": ["app_name", "text"]
}),
},
Tool {
name: "extract_text_with_boxes".to_string(),
description: "Extract all text from an image file with bounding box coordinates for each text element. Returns JSON array with text, position (x, y), size (width, height), and confidence for each detected text. Uses Apple Vision Framework for precise sub-pixel accuracy.".to_string(),
input_schema: json!({
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Path to image file to extract text from"
},
"app_name": {
"type": "string",
"description": "Optional: Name of application to screenshot first (e.g., 'Safari', 'Things3'). If provided, takes screenshot of app before extracting text."
}
},
"required": ["path"]
}),
},
]
}
/// Create computer control / vision-guided tools
fn create_computer_control_tools() -> Vec<Tool> {
vec![
Tool {
name: "vision_find_text".to_string(),
description: "Find text in a specific application window and return its location with bounding box coordinates (x, y, width, height) and confidence score. Useful for locating UI elements. Uses Apple Vision Framework for precise sub-pixel accuracy.".to_string(),
input_schema: json!({
"type": "object",
"properties": {
"app_name": {
"type": "string",
"description": "Name of the application to search in (e.g., 'Things3', 'Safari', 'TextEdit')"
},
"text": {
"type": "string",
"description": "The text to search for on screen"
}
},
"required": ["app_name", "text"]
}),
},
Tool {
name: "vision_click_text".to_string(),
description: "Find text in a specific application window and click on it (useful for clicking buttons, links, menu items)".to_string(),
input_schema: json!({
"type": "object",
"properties": {
"app_name": {
"type": "string",
"description": "Name of the application (e.g., 'Things3', 'Safari', 'TextEdit')"
},
"text": {
"type": "string",
"description": "The text to click on (e.g., 'Submit', 'OK', 'Cancel', '+')"
}
},
"required": ["app_name", "text"]
}),
},
Tool {
name: "vision_click_near_text".to_string(),
description: "Find text in a specific application window and click near it (useful for clicking text fields next to labels)".to_string(),
input_schema: json!({
"type": "object",
"properties": {
"app_name": {
"type": "string",
"description": "Name of the application (e.g., 'Things3', 'Safari', 'TextEdit')"
},
"text": {
"type": "string",
"description": "The label text to find (e.g., 'Name:', 'Email:', 'Task:')"
},
"direction": {
"type": "string",
"enum": ["right", "below", "left", "above"],
"description": "Direction to click relative to the text (default: right)"
},
"distance": {
"type": "integer",
"description": "Distance in pixels from the text (default: 50)"
}
},
"required": ["app_name", "text"]
}),
},
]
}
#[cfg(test)] #[cfg(test)]
mod tests { mod tests {
use super::*; use super::*;
@@ -652,9 +461,9 @@ mod tests {
fn test_core_tools_count() { fn test_core_tools_count() {
let tools = create_core_tools(); let tools = create_core_tools();
// Should have the core tools: shell, background_process, read_file, read_image, // Should have the core tools: shell, background_process, read_file, read_image,
// write_file, str_replace, final_output, take_screenshot, extract_text, // write_file, str_replace, final_output, take_screenshot,
// todo_read, todo_write, code_coverage, code_search // todo_read, todo_write, code_coverage, code_search (12 total)
assert_eq!(tools.len(), 13); assert_eq!(tools.len(), 12);
} }
#[test] #[test]
@@ -664,33 +473,19 @@ mod tests {
assert_eq!(tools.len(), 15); assert_eq!(tools.len(), 15);
} }
#[test]
fn test_macax_tools_count() {
let tools = create_macax_tools();
// 6 macax tools
assert_eq!(tools.len(), 6);
}
#[test]
fn test_computer_control_tools_count() {
let tools = create_computer_control_tools();
// 3 vision tools
assert_eq!(tools.len(), 3);
}
#[test] #[test]
fn test_create_tool_definitions_core_only() { fn test_create_tool_definitions_core_only() {
let config = ToolConfig::default(); let config = ToolConfig::default();
let tools = create_tool_definitions(config); let tools = create_tool_definitions(config);
assert_eq!(tools.len(), 13); assert_eq!(tools.len(), 12);
} }
#[test] #[test]
fn test_create_tool_definitions_all_enabled() { fn test_create_tool_definitions_all_enabled() {
let config = ToolConfig::new(true, true, true); let config = ToolConfig::new(true, true);
let tools = create_tool_definitions(config); let tools = create_tool_definitions(config);
// 13 core + 15 webdriver + 6 macax + 3 computer_control = 37 // 12 core + 15 webdriver = 27
assert_eq!(tools.len(), 37); assert_eq!(tools.len(), 27);
} }
#[test] #[test]

View File

@@ -7,7 +7,7 @@ use anyhow::Result;
use tracing::{debug, warn}; use tracing::{debug, warn};
use crate::tools::executor::ToolContext; use crate::tools::executor::ToolContext;
use crate::tools::{file_ops, macax, misc, shell, todo, vision, webdriver}; use crate::tools::{file_ops, misc, shell, todo, webdriver};
use crate::ui_writer::UiWriter; use crate::ui_writer::UiWriter;
use crate::ToolCall; use crate::ToolCall;
@@ -43,7 +43,6 @@ pub async fn dispatch_tool<W: UiWriter>(
Ok(result) Ok(result)
} }
"take_screenshot" => misc::execute_take_screenshot(tool_call, ctx).await, "take_screenshot" => misc::execute_take_screenshot(tool_call, ctx).await,
"extract_text" => misc::execute_extract_text(tool_call, ctx).await,
"code_coverage" => misc::execute_code_coverage(tool_call, ctx).await, "code_coverage" => misc::execute_code_coverage(tool_call, ctx).await,
"code_search" => misc::execute_code_search(tool_call, ctx).await, "code_search" => misc::execute_code_search(tool_call, ctx).await,
@@ -64,19 +63,6 @@ pub async fn dispatch_tool<W: UiWriter>(
"webdriver_refresh" => webdriver::execute_webdriver_refresh(tool_call, ctx).await, "webdriver_refresh" => webdriver::execute_webdriver_refresh(tool_call, ctx).await,
"webdriver_quit" => webdriver::execute_webdriver_quit(tool_call, ctx).await, "webdriver_quit" => webdriver::execute_webdriver_quit(tool_call, ctx).await,
// macOS Accessibility tools
"macax_list_apps" => macax::execute_macax_list_apps(tool_call, ctx).await,
"macax_get_frontmost_app" => macax::execute_macax_get_frontmost_app(tool_call, ctx).await,
"macax_activate_app" => macax::execute_macax_activate_app(tool_call, ctx).await,
"macax_press_key" => macax::execute_macax_press_key(tool_call, ctx).await,
"macax_type_text" => macax::execute_macax_type_text(tool_call, ctx).await,
// Vision tools
"vision_find_text" => vision::execute_vision_find_text(tool_call, ctx).await,
"vision_click_text" => vision::execute_vision_click_text(tool_call, ctx).await,
"vision_click_near_text" => vision::execute_vision_click_near_text(tool_call, ctx).await,
"extract_text_with_boxes" => vision::execute_extract_text_with_boxes(tool_call, ctx).await,
// Unknown tool // Unknown tool
_ => { _ => {
warn!("Unknown tool: {}", tool_call.tool); warn!("Unknown tool: {}", tool_call.tool);

View File

@@ -20,7 +20,6 @@ pub struct ToolContext<'a, W: UiWriter> {
pub computer_controller: Option<&'a Box<dyn g3_computer_control::ComputerController>>, pub computer_controller: Option<&'a Box<dyn g3_computer_control::ComputerController>>,
pub webdriver_session: &'a Arc<RwLock<Option<Arc<tokio::sync::Mutex<WebDriverSession>>>>>, pub webdriver_session: &'a Arc<RwLock<Option<Arc<tokio::sync::Mutex<WebDriverSession>>>>>,
pub webdriver_process: &'a Arc<RwLock<Option<tokio::process::Child>>>, pub webdriver_process: &'a Arc<RwLock<Option<tokio::process::Child>>>,
pub macax_controller: &'a Arc<RwLock<Option<g3_computer_control::MacAxController>>>,
pub background_process_manager: &'a Arc<BackgroundProcessManager>, pub background_process_manager: &'a Arc<BackgroundProcessManager>,
pub todo_content: &'a Arc<RwLock<String>>, pub todo_content: &'a Arc<RwLock<String>>,
pub pending_images: &'a mut Vec<g3_providers::ImageContent>, pub pending_images: &'a mut Vec<g3_providers::ImageContent>,

View File

@@ -13,7 +13,7 @@ use super::executor::ToolContext;
/// Execute the `read_file` tool. /// Execute the `read_file` tool.
pub async fn execute_read_file<W: UiWriter>( pub async fn execute_read_file<W: UiWriter>(
tool_call: &ToolCall, tool_call: &ToolCall,
ctx: &ToolContext<'_, W>, _ctx: &ToolContext<'_, W>,
) -> Result<String> { ) -> Result<String> {
debug!("Processing read_file tool call"); debug!("Processing read_file tool call");
@@ -28,35 +28,6 @@ pub async fn execute_read_file<W: UiWriter>(
let resolved_path = resolve_path_with_unicode_fallback(expanded_path.as_ref()); let resolved_path = resolve_path_with_unicode_fallback(expanded_path.as_ref());
let path_str = resolved_path.as_ref(); let path_str = resolved_path.as_ref();
// Check if this is an image file
let is_image = path_str.to_lowercase().ends_with(".png")
|| path_str.to_lowercase().ends_with(".jpg")
|| path_str.to_lowercase().ends_with(".jpeg")
|| path_str.to_lowercase().ends_with(".gif")
|| path_str.to_lowercase().ends_with(".bmp")
|| path_str.to_lowercase().ends_with(".tiff")
|| path_str.to_lowercase().ends_with(".tif")
|| path_str.to_lowercase().ends_with(".webp");
// If it's an image file, use OCR via extract_text
if is_image {
if let Some(controller) = ctx.computer_controller {
match controller.extract_text_from_image(path_str).await {
Ok(text) => {
return Ok(format!("📄 Image file (OCR extracted):\n{}", text));
}
Err(e) => {
return Ok(format!(
"❌ Failed to extract text from image '{}': {}",
path_str, e
));
}
}
} else {
return Ok("❌ Computer control not enabled. Cannot perform OCR on image files. Set computer_control.enabled = true in config.".to_string());
}
}
// Extract optional start and end positions // Extract optional start and end positions
let start_char = tool_call let start_char = tool_call
.args .args

View File

@@ -1,178 +0,0 @@
//! macOS Accessibility API tools.
use anyhow::Result;
use tracing::debug;
use crate::ui_writer::UiWriter;
use crate::ToolCall;
use super::executor::ToolContext;
/// Execute the `macax_list_apps` tool.
pub async fn execute_macax_list_apps<W: UiWriter>(
tool_call: &ToolCall,
ctx: &ToolContext<'_, W>,
) -> Result<String> {
debug!("Processing macax_list_apps tool call");
let _ = tool_call; // unused
if !ctx.config.macax.enabled {
return Ok(
"❌ macOS Accessibility is not enabled. Use --macax flag to enable.".to_string(),
);
}
let controller_guard = ctx.macax_controller.read().await;
let controller = match controller_guard.as_ref() {
Some(c) => c,
None => return Ok("❌ macOS Accessibility controller not initialized.".to_string()),
};
match controller.list_applications() {
Ok(apps) => {
let app_list: Vec<String> = apps.iter().map(|a| a.name.clone()).collect();
Ok(format!("Running applications:\n{}", app_list.join("\n")))
}
Err(e) => Ok(format!("❌ Failed to list applications: {}", e)),
}
}
/// Execute the `macax_get_frontmost_app` tool.
pub async fn execute_macax_get_frontmost_app<W: UiWriter>(
tool_call: &ToolCall,
ctx: &ToolContext<'_, W>,
) -> Result<String> {
debug!("Processing macax_get_frontmost_app tool call");
let _ = tool_call; // unused
if !ctx.config.macax.enabled {
return Ok(
"❌ macOS Accessibility is not enabled. Use --macax flag to enable.".to_string(),
);
}
let controller_guard = ctx.macax_controller.read().await;
let controller = match controller_guard.as_ref() {
Some(c) => c,
None => return Ok("❌ macOS Accessibility controller not initialized.".to_string()),
};
match controller.get_frontmost_app() {
Ok(app) => Ok(format!("Frontmost application: {}", app.name)),
Err(e) => Ok(format!("❌ Failed to get frontmost app: {}", e)),
}
}
/// Execute the `macax_activate_app` tool.
pub async fn execute_macax_activate_app<W: UiWriter>(
tool_call: &ToolCall,
ctx: &ToolContext<'_, W>,
) -> Result<String> {
debug!("Processing macax_activate_app tool call");
if !ctx.config.macax.enabled {
return Ok(
"❌ macOS Accessibility is not enabled. Use --macax flag to enable.".to_string(),
);
}
let app_name = match tool_call.args.get("app_name").and_then(|v| v.as_str()) {
Some(n) => n,
None => return Ok("❌ Missing app_name argument".to_string()),
};
let controller_guard = ctx.macax_controller.read().await;
let controller = match controller_guard.as_ref() {
Some(c) => c,
None => return Ok("❌ macOS Accessibility controller not initialized.".to_string()),
};
match controller.activate_app(app_name) {
Ok(_) => Ok(format!("✅ Activated application: {}", app_name)),
Err(e) => Ok(format!("❌ Failed to activate app: {}", e)),
}
}
/// Execute the `macax_press_key` tool.
pub async fn execute_macax_press_key<W: UiWriter>(
tool_call: &ToolCall,
ctx: &ToolContext<'_, W>,
) -> Result<String> {
debug!("Processing macax_press_key tool call");
if !ctx.config.macax.enabled {
return Ok(
"❌ macOS Accessibility is not enabled. Use --macax flag to enable.".to_string(),
);
}
let app_name = match tool_call.args.get("app_name").and_then(|v| v.as_str()) {
Some(n) => n,
None => return Ok("❌ Missing app_name argument".to_string()),
};
let key = match tool_call.args.get("key").and_then(|v| v.as_str()) {
Some(k) => k,
None => return Ok("❌ Missing key argument".to_string()),
};
let modifiers_vec: Vec<&str> = tool_call
.args
.get("modifiers")
.and_then(|v| v.as_array())
.map(|arr| arr.iter().filter_map(|v| v.as_str()).collect())
.unwrap_or_default();
let controller_guard = ctx.macax_controller.read().await;
let controller = match controller_guard.as_ref() {
Some(c) => c,
None => return Ok("❌ macOS Accessibility controller not initialized.".to_string()),
};
match controller.press_key(app_name, key, modifiers_vec.clone()) {
Ok(_) => {
let modifier_str = if modifiers_vec.is_empty() {
String::new()
} else {
format!(" with modifiers: {}", modifiers_vec.join("+"))
};
Ok(format!("✅ Pressed key: {}{}", key, modifier_str))
}
Err(e) => Ok(format!("❌ Failed to press key: {}", e)),
}
}
/// Execute the `macax_type_text` tool.
pub async fn execute_macax_type_text<W: UiWriter>(
tool_call: &ToolCall,
ctx: &ToolContext<'_, W>,
) -> Result<String> {
debug!("Processing macax_type_text tool call");
if !ctx.config.macax.enabled {
return Ok(
"❌ macOS Accessibility is not enabled. Use --macax flag to enable.".to_string(),
);
}
let app_name = match tool_call.args.get("app_name").and_then(|v| v.as_str()) {
Some(n) => n,
None => return Ok("❌ Missing app_name argument".to_string()),
};
let text = match tool_call.args.get("text").and_then(|v| v.as_str()) {
Some(t) => t,
None => return Ok("❌ Missing text argument".to_string()),
};
let controller_guard = ctx.macax_controller.read().await;
let controller = match controller_guard.as_ref() {
Some(c) => c,
None => return Ok("❌ macOS Accessibility controller not initialized.".to_string()),
};
match controller.type_text(app_name, text) {
Ok(_) => Ok(format!("✅ Typed text into {}", app_name)),
Err(e) => Ok(format!("❌ Failed to type text: {}", e)),
}
}

View File

@@ -1,4 +1,4 @@
//! Miscellaneous tools: final_output, take_screenshot, extract_text, code_coverage, code_search. //! Miscellaneous tools: final_output, take_screenshot, code_coverage, code_search.
use anyhow::Result; use anyhow::Result;
use tracing::debug; use tracing::debug;
@@ -118,35 +118,6 @@ pub async fn execute_take_screenshot<W: UiWriter>(
} }
} }
/// Execute the `extract_text` tool.
pub async fn execute_extract_text<W: UiWriter>(
tool_call: &ToolCall,
ctx: &ToolContext<'_, W>,
) -> Result<String> {
debug!("Processing extract_text tool call");
let controller = match ctx.computer_controller {
Some(c) => c,
None => {
return Ok(
"❌ Computer control not enabled. Set computer_control.enabled = true in config."
.to_string(),
)
}
};
let path = tool_call
.args
.get("path")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow::anyhow!("Missing path argument"))?;
match controller.extract_text_from_image(path).await {
Ok(text) => Ok(format!("✅ Extracted text:\n{}", text)),
Err(e) => Ok(format!("❌ Failed to extract text: {}", e)),
}
}
/// Execute the `code_coverage` tool. /// Execute the `code_coverage` tool.
pub async fn execute_code_coverage<W: UiWriter>( pub async fn execute_code_coverage<W: UiWriter>(
tool_call: &ToolCall, tool_call: &ToolCall,

View File

@@ -6,17 +6,13 @@
//! - `file_ops` - File reading, writing, and editing //! - `file_ops` - File reading, writing, and editing
//! - `todo` - TODO list management //! - `todo` - TODO list management
//! - `webdriver` - Browser automation via WebDriver //! - `webdriver` - Browser automation via WebDriver
//! - `macax` - macOS Accessibility API tools
//! - `vision` - Vision-based text finding and clicking
//! - `misc` - Other tools (screenshots, code search, etc.) //! - `misc` - Other tools (screenshots, code search, etc.)
pub mod executor; pub mod executor;
pub mod file_ops; pub mod file_ops;
pub mod macax;
pub mod misc; pub mod misc;
pub mod shell; pub mod shell;
pub mod todo; pub mod todo;
pub mod vision;
pub mod webdriver; pub mod webdriver;
pub use executor::ToolExecutor; pub use executor::ToolExecutor;

View File

@@ -1,275 +0,0 @@
//! Vision-based tools: vision_find_text, vision_click_text, vision_click_near_text, extract_text_with_boxes.
use anyhow::Result;
use tracing::debug;
use crate::ui_writer::UiWriter;
use crate::ToolCall;
use super::executor::ToolContext;
/// Execute the `vision_find_text` tool.
pub async fn execute_vision_find_text<W: UiWriter>(
tool_call: &ToolCall,
ctx: &ToolContext<'_, W>,
) -> Result<String> {
debug!("Processing vision_find_text tool call");
let controller = match ctx.computer_controller {
Some(c) => c,
None => {
return Ok(
"❌ Computer control not enabled. Set computer_control.enabled = true in config."
.to_string(),
)
}
};
let app_name = tool_call
.args
.get("app_name")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow::anyhow!("Missing app_name parameter"))?;
let text = tool_call
.args
.get("text")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow::anyhow!("Missing text parameter"))?;
match controller.find_text_in_app(app_name, text).await {
Ok(Some(location)) => Ok(format!(
"✅ Found '{}' in {} at position ({}, {}) with size {}x{} (confidence: {:.0}%)",
location.text,
app_name,
location.x,
location.y,
location.width,
location.height,
location.confidence * 100.0
)),
Ok(None) => Ok(format!("❌ Could not find '{}' in {}", text, app_name)),
Err(e) => Ok(format!("❌ Error finding text: {}", e)),
}
}
/// Execute the `vision_click_text` tool.
pub async fn execute_vision_click_text<W: UiWriter>(
tool_call: &ToolCall,
ctx: &ToolContext<'_, W>,
) -> Result<String> {
debug!("Processing vision_click_text tool call");
let controller = match ctx.computer_controller {
Some(c) => c,
None => {
return Ok(
"❌ Computer control not enabled. Set computer_control.enabled = true in config."
.to_string(),
)
}
};
let app_name = tool_call
.args
.get("app_name")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow::anyhow!("Missing app_name parameter"))?;
let text = tool_call
.args
.get("text")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow::anyhow!("Missing text parameter"))?;
match controller.find_text_in_app(app_name, text).await {
Ok(Some(location)) => {
// Click on center of text
// IMPORTANT: location coordinates are in NSScreen space (Y=0 at BOTTOM, increases UPWARD)
// location.x is the LEFT edge of the bounding box
// location.y is the TOP edge of the bounding box (highest Y value in NSScreen space)
// location.width and location.height are already scaled to screen space
// To get center: we need to add half the SCALED width and subtract half the SCALED height
if location.width == 0 || location.height == 0 {
return Ok(format!(
"❌ Invalid bounding box dimensions: width={}, height={}",
location.width, location.height
));
}
debug!(
"[vision_click_text] Location from find_text_in_app: x={}, y={}, width={}, height={}, text='{}'",
location.x, location.y, location.width, location.height, location.text
);
// Calculate center using the SCALED dimensions
// X: Use right edge instead of center (Vision OCR bounding box seems offset)
// This gives us: left edge + full width = right edge
// Y: top edge - half of scaled height (subtract because Y increases upward)
let click_x = location.x + location.width; // Right edge
let half_height = location.height / 2;
let click_y = location.y - half_height;
debug!(
"[vision_click_text] Click position calculation: x={} + {} = {} (right edge), y={} - {} = {}",
location.x, location.width, click_x, location.y, half_height, click_y
);
match controller.click_at(click_x, click_y, Some(app_name)) {
Ok(_) => Ok(format!(
"✅ Clicked on '{}' in {} at ({}, {})",
text, app_name, click_x, click_y
)),
Err(e) => Ok(format!("❌ Failed to click: {}", e)),
}
}
Ok(None) => Ok(format!("❌ Could not find '{}' in {}", text, app_name)),
Err(e) => Ok(format!("❌ Error finding text: {}", e)),
}
}
/// Execute the `vision_click_near_text` tool.
pub async fn execute_vision_click_near_text<W: UiWriter>(
tool_call: &ToolCall,
ctx: &ToolContext<'_, W>,
) -> Result<String> {
debug!("Processing vision_click_near_text tool call");
let controller = match ctx.computer_controller {
Some(c) => c,
None => {
return Ok(
"❌ Computer control not enabled. Set computer_control.enabled = true in config."
.to_string(),
)
}
};
let app_name = tool_call
.args
.get("app_name")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow::anyhow!("Missing app_name parameter"))?;
let text = tool_call
.args
.get("text")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow::anyhow!("Missing text parameter"))?;
let direction = tool_call
.args
.get("direction")
.and_then(|v| v.as_str())
.unwrap_or("right");
let distance = tool_call
.args
.get("distance")
.and_then(|v| v.as_i64())
.unwrap_or(50) as i32;
match controller.find_text_in_app(app_name, text).await {
Ok(Some(location)) => {
// Calculate click position based on direction
// location.x is LEFT edge, location.y is TOP edge (in NSScreen space)
let (click_x, click_y) = match direction {
"right" => (
location.x + location.width + distance,
location.y - (location.height / 2),
),
"below" => (
location.x + (location.width / 2),
location.y - location.height - distance,
),
"left" => (location.x - distance, location.y - (location.height / 2)),
"above" => (location.x + (location.width / 2), location.y + distance),
_ => (
location.x + location.width + distance,
location.y - (location.height / 2),
),
};
debug!(
"[vision_click_near_text] Clicking {} of text at ({}, {})",
direction, click_x, click_y
);
match controller.click_at(click_x, click_y, Some(app_name)) {
Ok(_) => Ok(format!(
"✅ Clicked {} of '{}' in {} at ({}, {})",
direction, text, app_name, click_x, click_y
)),
Err(e) => Ok(format!("❌ Failed to click: {}", e)),
}
}
Ok(None) => Ok(format!("❌ Could not find '{}' in {}", text, app_name)),
Err(e) => Ok(format!("❌ Error finding text: {}", e)),
}
}
/// Execute the `extract_text_with_boxes` tool.
pub async fn execute_extract_text_with_boxes<W: UiWriter>(
tool_call: &ToolCall,
ctx: &ToolContext<'_, W>,
) -> Result<String> {
debug!("Processing extract_text_with_boxes tool call");
if !ctx.config.macax.enabled {
return Ok(
"❌ extract_text_with_boxes requires --macax flag to be enabled".to_string(),
);
}
let controller = match ctx.computer_controller {
Some(c) => c,
None => {
return Ok(
"❌ Computer control not enabled. Set computer_control.enabled = true in config."
.to_string(),
)
}
};
let path = tool_call
.args
.get("path")
.and_then(|v| v.as_str())
.ok_or_else(|| anyhow::anyhow!("Missing path parameter"))?;
// Optional: take screenshot of app first
let final_path = if let Some(app_name) = tool_call.args.get("app_name").and_then(|v| v.as_str())
{
let temp_path = format!("/tmp/g3_extract_boxes_{}.png", uuid::Uuid::new_v4());
match controller
.take_screenshot(&temp_path, None, Some(app_name))
.await
{
Ok(_) => temp_path,
Err(e) => return Ok(format!("❌ Failed to take screenshot: {}", e)),
}
} else {
path.to_string()
};
// Extract text with locations
match controller.extract_text_with_locations(&final_path).await {
Ok(locations) => {
// Clean up temp file if we created one
if final_path != path {
let _ = std::fs::remove_file(&final_path);
}
// Return as JSON
match serde_json::to_string_pretty(&locations) {
Ok(json) => Ok(format!(
"✅ Extracted {} text elements:\n{}",
locations.len(),
json
)),
Err(e) => Ok(format!("❌ Failed to serialize results: {}", e)),
}
}
Err(e) => Ok(format!("❌ Failed to extract text: {}", e)),
}
}

View File

@@ -191,7 +191,6 @@ Key modules:
- `platform/` - Platform-specific implementations (macOS, Linux, Windows) - `platform/` - Platform-specific implementations (macOS, Linux, Windows)
- `webdriver/` - Safari and Chrome WebDriver integration - `webdriver/` - Safari and Chrome WebDriver integration
- `ocr/` - Text extraction (Tesseract, Apple Vision) - `ocr/` - Text extraction (Tesseract, Apple Vision)
- `macax/` - macOS Accessibility API controller
**Platform support**: **Platform support**:
- **macOS**: Core Graphics, Cocoa, screencapture, Vision framework - **macOS**: Core Graphics, Cocoa, screencapture, Vision framework

View File

@@ -27,7 +27,6 @@ G3 uses TOML format. The configuration is organized into sections:
[agent] # Agent behavior settings [agent] # Agent behavior settings
[computer_control] # Mouse/keyboard automation [computer_control] # Mouse/keyboard automation
[webdriver] # Browser automation [webdriver] # Browser automation
[macax] # macOS Accessibility API
``` ```
## Provider Configuration ## Provider Configuration
@@ -236,13 +235,11 @@ apt install chromium-chromedriver
## macOS Accessibility API Configuration ## macOS Accessibility API Configuration
```toml ```toml
[macax]
enabled = false # Set to true to enable enabled = false # Set to true to enable
``` ```
**Required permissions**: System Preferences → Security & Privacy → Privacy → Accessibility → Add your terminal app **Required permissions**: System Preferences → Security & Privacy → Privacy → Accessibility → Add your terminal app
See [macOS Accessibility Tools Guide](macax-tools.md) for detailed usage.
## Multi-Role Configuration ## Multi-Role Configuration
@@ -295,7 +292,6 @@ g3 --model claude-opus-4-5
# Enable features # Enable features
g3 --webdriver # Enable WebDriver (Safari) g3 --webdriver # Enable WebDriver (Safari)
g3 --chrome-headless # Enable WebDriver (Chrome headless) g3 --chrome-headless # Enable WebDriver (Chrome headless)
g3 --macax # Enable macOS Accessibility API
# Specify config file # Specify config file
g3 --config /path/to/config.toml g3 --config /path/to/config.toml
@@ -340,7 +336,6 @@ enabled = true
browser = "safari" browser = "safari"
safari_port = 4444 safari_port = 4444
[macax]
enabled = false enabled = false
``` ```

View File

@@ -1,472 +0,0 @@
# macOS Accessibility Tools Guide
**Last updated**: January 2025
**Source of truth**: `crates/g3-computer-control/src/macax/`
## Purpose
G3 includes tools for controlling macOS applications via the Accessibility API. This enables automation of native macOS apps, including those you're building with G3.
## Overview
The macOS Accessibility API provides programmatic access to UI elements in any application. G3 exposes this through the `macax_*` tools, allowing you to:
- List and activate applications
- Inspect UI element hierarchies
- Find elements by role, title, or identifier
- Click buttons and interact with controls
- Read and set values in text fields
- Simulate keyboard input
## Setup
### 1. Enable in Configuration
```toml
# ~/.config/g3/config.toml
[macax]
enabled = true
```
Or use the CLI flag:
```bash
g3 --macax
```
### 2. Grant Accessibility Permissions
1. Open **System Preferences****Security & Privacy****Privacy**
2. Select **Accessibility** in the left sidebar
3. Click the lock icon and authenticate
4. Add your terminal application (Terminal, iTerm2, etc.)
5. Restart your terminal
**Note**: If using VS Code's integrated terminal, add VS Code to the list.
### 3. Verify Setup
```json
{"tool": "macax_list_apps", "args": {}}
```
This should return a list of running applications.
## Available Tools
### macax_list_apps
List all running applications.
**Parameters**: None
**Example**:
```json
{"tool": "macax_list_apps", "args": {}}
```
**Returns**:
```
Running Applications:
- Safari (com.apple.Safari)
- Finder (com.apple.finder)
- Terminal (com.apple.Terminal)
- MyApp (com.example.myapp)
```
---
### macax_get_frontmost_app
Get the currently active (frontmost) application.
**Parameters**: None
**Example**:
```json
{"tool": "macax_get_frontmost_app", "args": {}}
```
**Returns**:
```
Frontmost Application: Safari (com.apple.Safari)
```
---
### macax_activate_app
Bring an application to the front.
**Parameters**:
- `app_name` (string, required): Application name
**Example**:
```json
{"tool": "macax_activate_app", "args": {"app_name": "Safari"}}
```
---
### macax_get_ui_tree
Get the UI element hierarchy of an application.
**Parameters**:
- `app_name` (string, required): Application name
- `max_depth` (integer, optional): Maximum tree depth (default: 5)
**Example**:
```json
{"tool": "macax_get_ui_tree", "args": {"app_name": "Calculator", "max_depth": 3}}
```
**Returns**:
```
UI Tree for Calculator:
└── AXApplication "Calculator"
└── AXWindow "Calculator"
├── AXGroup
│ ├── AXButton "1" [id: digit_1]
│ ├── AXButton "2" [id: digit_2]
│ ├── AXButton "+" [id: add]
│ └── AXButton "=" [id: equals]
└── AXStaticText "0" [id: display]
```
**Notes**:
- Use lower `max_depth` for complex apps to avoid overwhelming output
- Elements show role, title, and accessibility identifier (if set)
---
### macax_find_elements
Find UI elements matching criteria.
**Parameters**:
- `app_name` (string, required): Application name
- `role` (string, optional): Element role (e.g., "button", "textField")
- `title` (string, optional): Element title/label
- `identifier` (string, optional): Accessibility identifier
**Example**:
```json
{"tool": "macax_find_elements", "args": {
"app_name": "Safari",
"role": "button"
}}
```
**Returns**:
```
Found 5 elements:
1. AXButton "Back" [id: BackButton]
2. AXButton "Forward" [id: ForwardButton]
3. AXButton "Reload" [id: ReloadButton]
4. AXButton "Share" [id: ShareButton]
5. AXButton "New Tab" [id: NewTabButton]
```
---
### macax_click
Click a UI element.
**Parameters**:
- `app_name` (string, required): Application name
- `identifier` (string, optional): Accessibility identifier
- `title` (string, optional): Element title
- `role` (string, optional): Element role
At least one of `identifier`, `title`, or `role` must be provided.
**Examples**:
```json
// Click by identifier (most reliable)
{"tool": "macax_click", "args": {
"app_name": "Calculator",
"identifier": "digit_5"
}}
// Click by title
{"tool": "macax_click", "args": {
"app_name": "Calculator",
"title": "5"
}}
// Click by role and title
{"tool": "macax_click", "args": {
"app_name": "Safari",
"role": "button",
"title": "Reload"
}}
```
---
### macax_set_value
Set the value of a UI element (text fields, sliders, etc.).
**Parameters**:
- `app_name` (string, required): Application name
- `identifier` (string, optional): Accessibility identifier
- `title` (string, optional): Element title
- `value` (string, required): Value to set
**Example**:
```json
{"tool": "macax_set_value", "args": {
"app_name": "TextEdit",
"role": "textArea",
"value": "Hello, World!"
}}
```
---
### macax_get_value
Get the current value of a UI element.
**Parameters**:
- `app_name` (string, required): Application name
- `identifier` (string, optional): Accessibility identifier
- `title` (string, optional): Element title
**Example**:
```json
{"tool": "macax_get_value", "args": {
"app_name": "Calculator",
"identifier": "display"
}}
```
**Returns**:
```
Value: 42
```
---
### macax_press_key
Simulate a key press.
**Parameters**:
- `key` (string, required): Key to press
- `modifiers` (array, optional): Modifier keys
**Supported modifiers**: `command`, `shift`, `option`, `control`
**Examples**:
```json
// Simple key press
{"tool": "macax_press_key", "args": {"key": "a"}}
// With modifiers (Cmd+S)
{"tool": "macax_press_key", "args": {
"key": "s",
"modifiers": ["command"]
}}
// Multiple modifiers (Cmd+Shift+N)
{"tool": "macax_press_key", "args": {
"key": "n",
"modifiers": ["command", "shift"]
}}
// Special keys
{"tool": "macax_press_key", "args": {"key": "return"}}
{"tool": "macax_press_key", "args": {"key": "escape"}}
{"tool": "macax_press_key", "args": {"key": "tab"}}
{"tool": "macax_press_key", "args": {"key": "delete"}}
```
**Special key names**:
- `return`, `enter`
- `escape`, `esc`
- `tab`
- `delete`, `backspace`
- `space`
- `up`, `down`, `left`, `right`
- `home`, `end`, `pageup`, `pagedown`
- `f1` through `f12`
## Common Roles
| Role | Description |
|------|-------------|
| `button` | Clickable button |
| `textField` | Single-line text input |
| `textArea` | Multi-line text input |
| `checkbox` | Checkbox control |
| `radioButton` | Radio button |
| `popUpButton` | Dropdown/popup menu |
| `slider` | Slider control |
| `table` | Table view |
| `list` | List view |
| `outline` | Outline/tree view |
| `group` | Container group |
| `window` | Application window |
| `sheet` | Modal sheet |
| `dialog` | Dialog window |
| `staticText` | Non-editable text |
| `image` | Image element |
| `scrollArea` | Scrollable container |
| `toolbar` | Toolbar |
| `menuBar` | Menu bar |
| `menu` | Menu |
| `menuItem` | Menu item |
## Best Practices
### 1. Use Accessibility Identifiers
When building apps you'll automate with G3, add accessibility identifiers:
**SwiftUI**:
```swift
Button("Submit") { ... }
.accessibilityIdentifier("submit_button")
```
**UIKit**:
```swift
button.accessibilityIdentifier = "submit_button"
```
**AppKit**:
```swift
button.setAccessibilityIdentifier("submit_button")
```
Identifiers are more reliable than titles (which may be localized).
### 2. Inspect Before Automating
Always inspect the UI tree first:
```json
{"tool": "macax_get_ui_tree", "args": {"app_name": "MyApp", "max_depth": 4}}
```
This helps you understand:
- Element hierarchy
- Available identifiers
- Correct role names
### 3. Activate App First
Some actions require the app to be frontmost:
```json
{"tool": "macax_activate_app", "args": {"app_name": "MyApp"}}
{"tool": "macax_click", "args": {"app_name": "MyApp", "identifier": "button1"}}
```
### 4. Handle Timing
UI updates may take time. If an element isn't found:
1. Wait briefly
2. Retry the operation
3. Check if the app state changed
### 5. Prefer Identifiers Over Titles
```json
// Good: Uses identifier
{"tool": "macax_click", "args": {"app_name": "MyApp", "identifier": "save_btn"}}
// Less reliable: Uses title (may be localized)
{"tool": "macax_click", "args": {"app_name": "MyApp", "title": "Save"}}
```
## Example: Automating Calculator
```json
// 1. Activate Calculator
{"tool": "macax_activate_app", "args": {"app_name": "Calculator"}}
// 2. Inspect UI
{"tool": "macax_get_ui_tree", "args": {"app_name": "Calculator", "max_depth": 3}}
// 3. Click "5"
{"tool": "macax_click", "args": {"app_name": "Calculator", "title": "5"}}
// 4. Click "+"
{"tool": "macax_click", "args": {"app_name": "Calculator", "title": "+"}}
// 5. Click "3"
{"tool": "macax_click", "args": {"app_name": "Calculator", "title": "3"}}
// 6. Click "="
{"tool": "macax_click", "args": {"app_name": "Calculator", "title": "="}}
// 7. Read result
{"tool": "macax_get_value", "args": {"app_name": "Calculator", "role": "staticText"}}
```
## Troubleshooting
### "Accessibility permission denied"
1. Check System Preferences → Security & Privacy → Accessibility
2. Ensure your terminal app is listed and checked
3. Restart the terminal after granting permission
### "Application not found"
1. Use exact app name (case-sensitive)
2. Run `macax_list_apps` to see available apps
3. App must be running
### "Element not found"
1. Inspect UI tree to verify element exists
2. Check identifier/title spelling
3. Element may be in a different window or sheet
4. App state may have changed
### "Cannot perform action"
1. Element may be disabled
2. App may need to be frontmost
3. Element may not support the action
4. Check element role supports the operation
### Slow Performance
1. Reduce `max_depth` in `macax_get_ui_tree`
2. Use specific identifiers instead of searching
3. Complex apps have large UI trees
## Comparison with Other Tools
| Feature | macax | Vision Tools | WebDriver |
|---------|-------|--------------|----------|
| Native apps | ✅ | ✅ (via OCR) | ❌ |
| Web browsers | ✅ | ✅ | ✅ |
| Electron apps | ✅ | ✅ | Partial |
| Reliability | High | Medium | High |
| Setup | Permissions | None | Driver |
| Speed | Fast | Slower | Medium |
**Use macax when**:
- Automating native macOS apps
- You control the app and can add identifiers
- Need reliable, fast automation
**Use Vision tools when**:
- App doesn't expose accessibility
- Need to find text visually
- Cross-platform approach needed
**Use WebDriver when**:
- Automating web content
- Need JavaScript execution
- Testing web applications

View File

@@ -12,12 +12,10 @@ This document describes all tools available to the G3 agent. Tools are the prima
| Category | Tools | Enabled By | | Category | Tools | Enabled By |
|----------|-------|------------| |----------|-------|------------|
| **Core** | shell, read_file, write_file, str_replace, final_output, background_process | Always | | **Core** | shell, read_file, write_file, str_replace, final_output, background_process | Always |
| **Images** | read_image, take_screenshot, extract_text | Always | | **Images** | read_image, take_screenshot | Always |
| **Task Management** | todo_read, todo_write | Always | | **Task Management** | todo_read, todo_write | Always |
| **Code Intelligence** | code_search, code_coverage | Always | | **Code Intelligence** | code_search, code_coverage | Always |
| **WebDriver** | webdriver_* (12 tools) | `--webdriver` or `--chrome-headless` | | **WebDriver** | webdriver_* (12 tools) | `--webdriver` or `--chrome-headless` |
| **Vision** | vision_find_text, vision_click_text, vision_click_near_text | Always (macOS) |
| **macOS Accessibility** | macax_* (9 tools) | `--macax` |
| **Computer Control** | mouse_click, type_text, find_element, list_windows | `computer_control.enabled = true` | | **Computer Control** | mouse_click, type_text, find_element, list_windows | `computer_control.enabled = true` |
--- ---
@@ -82,7 +80,6 @@ Read file contents with optional character range.
``` ```
**Notes**: **Notes**:
- For image files (png, jpg, gif, etc.), automatically extracts text using OCR
- Supports tilde expansion (`~`) - Supports tilde expansion (`~`)
- Reports file size and line count - Reports file size and line count
@@ -105,7 +102,6 @@ Read image files for visual analysis by the LLM.
**Notes**: **Notes**:
- Images are sent to the LLM for visual analysis - Images are sent to the LLM for visual analysis
- Use for inspecting sprites, UI screenshots, diagrams, etc. - Use for inspecting sprites, UI screenshots, diagrams, etc.
- Different from `extract_text` which only does OCR
--- ---
@@ -197,23 +193,6 @@ Capture a screenshot of an application window.
--- ---
### extract_text
Extract text from an image using OCR.
**Parameters**:
- `path` (string, optional): Path to image file
**Example**:
```json
{"tool": "extract_text", "args": {"path": "screenshot.png"}}
```
**Notes**:
- Uses Tesseract OCR or Apple Vision framework
- For window-based OCR, use `vision_find_text` instead
---
## Task Management Tools ## Task Management Tools
@@ -386,98 +365,7 @@ Close browser and end session.
--- ---
## Vision Tools (macOS)
Use Apple Vision framework for text recognition.
### vision_find_text
Find text in an application window.
**Parameters**:
- `app_name` (string, required): Application name
- `text` (string, required): Text to search for
**Returns**: Bounding box coordinates and confidence score
### vision_click_text
Find and click on text.
**Parameters**:
- `app_name` (string, required): Application name
- `text` (string, required): Text to click
### vision_click_near_text
Click near a text label (useful for form fields).
**Parameters**:
- `app_name` (string, required): Application name
- `text` (string, required): Label text to find
- `direction` (string, optional): "right", "below", "left", "above" (default: "right")
- `distance` (integer, optional): Pixels from text (default: 50)
---
## macOS Accessibility Tools
Enabled with `--macax`. See [macOS Accessibility Tools Guide](macax-tools.md).
### macax_list_apps
List running applications.
### macax_get_frontmost_app
Get the frontmost application.
### macax_activate_app
Bring an application to front.
**Parameters**:
- `app_name` (string, required): Application name
### macax_get_ui_tree
Get UI element hierarchy.
**Parameters**:
- `app_name` (string, required): Application name
- `max_depth` (integer, optional): Tree depth limit
### macax_find_elements
Find UI elements by criteria.
**Parameters**:
- `app_name` (string, required): Application name
- `role` (string, optional): Element role (button, textField, etc.)
- `title` (string, optional): Element title
- `identifier` (string, optional): Accessibility identifier
### macax_click
Click a UI element.
**Parameters**:
- `app_name` (string, required): Application name
- `identifier` or `title` or `role`: Element selector
### macax_set_value / macax_get_value
Set or get element value.
### macax_press_key
Simulate key press.
**Parameters**:
- `key` (string, required): Key to press
- `modifiers` (array, optional): ["command", "shift", "option", "control"]
---
## Computer Control Tools ## Computer Control Tools