From 2488cc54d59451cdf80158f829045eb6035e18bf Mon Sep 17 00:00:00 2001 From: Dhanji Prasanna Date: Mon, 20 Oct 2025 15:03:22 +1100 Subject: [PATCH] docs: update README and DESIGN to reflect current project state - Add g3-computer-control crate to architecture documentation - Document all 13 tools including computer control and TODO management - Add context thinning feature documentation (50-80% thresholds) - Update tool ecosystem section with complete tool list - Remove broken link to non-existent COMPUTER_CONTROL.md - Update workspace count from 5 to 6 crates - Add platform-specific implementation details for computer control - Document OCR support via Tesseract - Clarify setup instructions for computer control features --- DESIGN.md | 62 ++++++++++++++++++++++++++++++++++++++++++++++++------- README.md | 42 ++++++++++++++----------------------- 2 files changed, 71 insertions(+), 33 deletions(-) diff --git a/DESIGN.md b/DESIGN.md index aabee07..4e25b24 100644 --- a/DESIGN.md +++ b/DESIGN.md @@ -29,7 +29,8 @@ g3/ │ ├── g3-core/ # Core agent engine, tools, and streaming logic │ ├── g3-providers/ # LLM provider abstractions and implementations │ ├── g3-config/ # Configuration management -│ └── g3-execution/ # Code execution engine +│ ├── g3-execution/ # Code execution engine +│ └── g3-computer-control/ # Computer control and automation ├── logs/ # Session logs (auto-created) ├── README.md # Project documentation └── DESIGN.md # This design document @@ -48,6 +49,7 @@ g3/ │ • Retro TUI │ │ • Tool system │ │ • Embedded │ │ • Autonomous │ │ • Streaming │ │ (llama.cpp) │ │ mode │ │ • Task exec │ │ • OAuth flow │ +│ │ │ • TODO mgmt │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ └───────────────────────┼───────────────────────┘ @@ -59,7 +61,18 @@ g3/ │ • Shell cmds │ │ • Env overrides │ │ • Streaming │ │ • Provider │ │ • Error hdlg │ │ settings │ - └─────────────────┘ └─────────────────┘ + └─────────────────┘ │ • Computer │ + │ │ control cfg │ + │ └─────────────────┘ + │ │ + ┌─────────────────┐ │ + │ g3-computer- │◄────────────┘ + │ control │ + │ • Mouse/kbd │ + │ • Screenshots │ + │ • OCR/Tesseract │ + │ • Windows/UI │ + └─────────────────┘ ``` ## Core Components @@ -79,6 +92,7 @@ g3/ - **Streaming Parser**: Real-time parsing of LLM responses with tool call detection and execution - **Session Management**: Automatic session logging with detailed conversation history and token usage - **Error Recovery**: Sophisticated error classification and retry logic for recoverable errors +- **TODO Management**: In-memory TODO list with read/write tools for task tracking **Available Tools:** - `shell`: Execute shell commands with streaming output @@ -86,7 +100,15 @@ g3/ - `write_file`: Create or overwrite files with content - `str_replace`: Apply unified diffs to files with precise editing - `final_output`: Signal task completion with detailed summaries -- **Project Management**: Workspace handling, requirements.md processing for autonomous mode +- `todo_read`: Read the entire TODO list content +- `todo_write`: Write or overwrite the entire TODO list +- `mouse_click`: Click the mouse at specific coordinates +- `type_text`: Type text at the current cursor position +- `find_element`: Find UI elements by text, role, or attributes +- `take_screenshot`: Capture screenshots of screen, region, or window +- `extract_text`: Extract text from images or screen regions using OCR +- `find_text_on_screen`: Find text visually on screen and return coordinates +- `list_windows`: List all open windows with IDs and titles ### 2. g3-providers: LLM Provider Abstraction @@ -172,6 +194,26 @@ g3/ - **Validation**: Configuration validation with helpful error messages - **Flexible Paths**: Support for shell expansion (`~`, environment variables) +### 6. g3-computer-control: Computer Control & Automation + +**Primary Responsibilities:** +- Cross-platform computer control and automation +- Mouse and keyboard input simulation +- Window management and screenshot capture +- OCR text extraction from images and screen regions + +**Platform Support:** +- **macOS**: Core Graphics, Cocoa, screencapture integration +- **Linux**: X11/Xtest for input, X11 for window management +- **Windows**: Win32 APIs for input and window control + +**Key Features:** +- **OCR Integration**: Tesseract-based text extraction from images +- **Window Management**: List, identify, and capture specific application windows +- **UI Automation**: Find elements, simulate clicks, type text +- **Screenshot Capture**: Full screen, regions, or specific windows +- **Accessibility**: Requires OS-level permissions for automation + ## Advanced Features ### Context Window Management @@ -180,6 +222,7 @@ G3 implements sophisticated context window management: - **Automatic Monitoring**: Tracks token usage with percentage-based thresholds - **Smart Summarization**: Auto-triggers at 80% capacity to prevent context overflow +- **Context Thinning**: Progressive thinning at 50%, 60%, 70%, 80% thresholds - replaces large tool results with file references - **Conversation Preservation**: Maintains conversation continuity through intelligent summaries - **Provider-Specific Limits**: Adapts to different model context windows (4k to 200k+ tokens) - **Cumulative Tracking**: Monitors total token usage across entire sessions @@ -354,20 +397,23 @@ This design document reflects the current state of G3 as a mature, production-re ### Fully Implemented - ✅ **Core Agent Engine**: Complete with streaming, tool execution, and context management - ✅ **Provider System**: Anthropic, Databricks, and Embedded providers with OAuth support -- ✅ **Tool System**: All 5 core tools (shell, read_file, write_file, str_replace, final_output) +- ✅ **Tool System**: 13 tools including file ops, shell, TODO management, and computer control - ✅ **CLI Interface**: Interactive mode, single-shot mode, retro TUI - ✅ **Autonomous Mode**: Coach-player feedback loop with requirements.md processing - ✅ **Configuration**: TOML-based config with environment overrides - ✅ **Error Handling**: Comprehensive retry logic and error classification - ✅ **Session Logging**: Automatic session tracking and JSON logs -- ✅ **Context Management**: Auto-summarization at 80% capacity +- ✅ **Context Management**: Context thinning (50-80%) and auto-summarization at 80% capacity +- ✅ **Computer Control**: Cross-platform automation with OCR support +- ✅ **TODO Management**: In-memory TODO list with read/write tools ### Architecture Highlights -- **Workspace**: 5 crates with clear separation of concerns +- **Workspace**: 6 crates with clear separation of concerns - **Dependencies**: Modern Rust ecosystem (Tokio, Clap, Serde, etc.) - **Streaming**: Real-time response processing with tool call detection - **Cross-Platform**: Works on macOS, Linux, and Windows -- **GPU Support**: Metal acceleration for local models on macOS +- **GPU Support**: Metal acceleration for local models on macOS, CUDA on Linux +- **OCR Support**: Tesseract integration for text extraction from images ### Key Files - `src/main.rs`: main entry point delegating to g3-cli @@ -376,3 +422,5 @@ This design document reflects the current state of G3 as a mature, production-re - `crates/g3-providers/src/lib.rs`: provider trait and registry - `crates/g3-config/src/lib.rs`: configuration management - `crates/g3-execution/src/lib.rs`: code execution engine +- `crates/g3-computer-control/src/lib.rs`: computer control and automation +- `crates/g3-computer-control/src/platform/`: platform-specific implementations diff --git a/README.md b/README.md index 2310343..f9faf75 100644 --- a/README.md +++ b/README.md @@ -11,8 +11,8 @@ G3 follows a modular architecture organized as a Rust workspace with multiple cr #### **g3-core** The heart of the agent system, containing: - **Agent Engine**: Main orchestration logic for handling conversations, tool execution, and task management -- **Context Window Management**: Intelligent tracking of token usage with auto-summarization capabilities when approaching context limits (~80% capacity) -- **Tool System**: Built-in tools for file operations (read, write, edit), shell command execution, and structured output generation +- **Context Window Management**: Intelligent tracking of token usage with context thinning (50-80%) and auto-summarization at 80% capacity +- **Tool System**: Built-in tools for file operations, shell commands, computer control, TODO management, and structured output - **Streaming Response Parser**: Real-time parsing of LLM responses with tool call detection and execution - **Task Execution**: Support for single and iterative task execution with automatic retry logic @@ -44,8 +44,8 @@ Task execution framework: Computer control capabilities: - Mouse and keyboard automation - UI element inspection and interaction -- Screenshot capture -- OCR text extraction +- Screenshot capture and window management +- OCR text extraction via Tesseract #### **g3-cli** Command-line interface: @@ -68,19 +68,21 @@ G3 includes robust error handling with automatic retry logic: ### Intelligent Context Management - Automatic context window monitoring with percentage-based tracking - Smart auto-summarization when approaching token limits +- **Context thinning** at 50%, 60%, 70%, 80% thresholds - automatically replaces large tool results with file references - Conversation history preservation through summaries -- Dynamic token allocation for different providers +- Dynamic token allocation for different providers (4k to 200k+ tokens) ### Tool Ecosystem - **File Operations**: Read, write, and edit files with line-range precision - **Shell Integration**: Execute system commands with output capture - **Code Generation**: Structured code generation with syntax awareness +- **TODO Management**: Read and write TODO lists with markdown checkbox format - **Computer Control** (Experimental): Automate desktop applications - - **OCR Support**: Extract and find text from images and screen regions using Tesseract - Mouse and keyboard control - UI element inspection - - Screenshot capture - - See [Computer Control Guide](docs/COMPUTER_CONTROL.md) for details + - Screenshot capture and window management + - OCR text extraction from images and screen regions + - Window listing and identification - **Final Output**: Formatted result presentation ### Provider Flexibility @@ -111,7 +113,7 @@ G3 is designed for: - Automated code generation and refactoring - File manipulation and project scaffolding - System administration tasks -- Data processing and transformation +- Data processing and transformation - API integration and testing - Documentation generation - Complex multi-step workflows @@ -134,24 +136,12 @@ g3 "implement a function to calculate fibonacci numbers" G3 can interact with your computer's GUI for automation tasks: -### Setup +**Available Tools**: `mouse_click`, `type_text`, `find_element`, `take_screenshot`, `extract_text`, `find_text_on_screen`, `list_windows` -1. Enable in config: -```toml -[computer_control] -enabled = true -``` - -2. Grant OS permissions: - - **macOS**: System Preferences → Security & Privacy → Accessibility - - **Linux**: Ensure X11 or Wayland access - - **Windows**: Run as administrator (first time only) - -3. Use computer control: -```bash -``` - -See [Computer Control Guide](docs/COMPUTER_CONTROL.md) for detailed documentation. +**Setup**: Enable in config with `computer_control.enabled = true` and grant OS accessibility permissions: +- **macOS**: System Preferences → Security & Privacy → Accessibility +- **Linux**: Ensure X11 or Wayland access +- **Windows**: Run as administrator (first time only) ## Session Logs