- Add auto-approval logic in execute_plan_write() when ctx.is_autonomous is true
- Update system prompt to document auto-approval behavior
- Plans still require explicit approval in interactive mode
Added plan_approve to the compact tool list in format_tool_result_summary()
so it displays in the same format as other tools like read_file and write_file.
The format_plan_approve_summary() function already existed but was never
called because plan_approve was missing from the matches! block.
Plan tools (plan_read, plan_write) now display with elegant tree-style
formatting similar to the old todo_write UI:
- State indicators: □ (todo), ◐ (doing), ■ (done), ⊘ (blocked)
- Tree prefixes (├/└) for items with child details
- Strikethrough for completed items
- Shows touches and all three checks (happy/negative/boundary)
- Displays plan file path link at the end
plan_approve uses compact single-line format like read_file:
- Shows approval status and revision number
- Handles already-approved and error cases
Changes:
- Add print_plan_compact() to UiWriter trait with default impl
- Implement print_plan_compact() in ConsoleUiWriter
- Call print_plan_compact() from execute_plan_read/write
- Add plan_read/plan_write to is_self_handled_tool()
- Add plan_approve to is_compact_tool() with format_plan_approve_summary()
- Add serde_yaml dependency to g3-cli
Adds a verification system that checks evidence in completed plan items:
- Evidence parsing: supports code locations (file:line, file:line-line, file only)
and test references (file::test_name)
- Code location verification: checks file exists, validates line numbers in range
- Test reference verification: checks test file exists, searches for fn pattern
- Verification results: Verified, Warning, Error, Skipped statuses
- Loud output formatting with emoji indicators for warnings/errors
- Integration with execute_plan_write(): runs when plan is complete and approved
- 12 new unit tests covering parsing and verification
Warnings are advisory (don't block), errors are loud but also don't block.
Blocked items are skipped during verification.
Plan Mode is a cognitive forcing system that requires reasoning about:
- Happy path
- Negative case
- Boundary condition
New tools:
- plan_read: Read current plan for session
- plan_write: Create/update plan with YAML content (validates structure)
- plan_approve: Mark current revision as approved
New command:
- /feature <description>: Start Plan Mode for a new feature
Plan schema requires:
- plan_id, revision, approved_revision
- items with id, description, state, touches, checks (happy/negative/boundary)
- evidence and notes required when marking items done
Verification:
- plan_verify() called automatically when all items are done/blocked
Removed:
- todo_read, todo_write tools
- todo.rs module and related tests
Fixes issues in the last 11 commits:
1. pending_research.rs: Fix flaky test_generate_id_uniqueness
- Replaced random u16 suffix with atomic counter for guaranteed uniqueness
- The timestamp+random approach could collide when generating IDs rapidly
- Now uses static AtomicU32 counter that increments monotonically
2. embedded/adapters/glm.rs: Remove unused in_code_fence field
- Field was written but never read (dead code)
- Removed from struct definition, constructor, and reset()
3. embedded/adapters/glm.rs: Fix orphaned tests
- Two tests (test_strip_code_fences, test_code_fenced_tool_call) were
outside the #[cfg(test)] mod tests block
- Moved closing brace to include them in the test module
All 446 library tests pass.
Agent: fowler
When the LLM 'stutters' and emits incomplete tool call fragments like:
{"tool": "shell", "args": {...}}
{"tool":
{"tool": "shell", "args": {...}}
The parser would get stuck waiting for the incomplete fragment to complete,
causing the entire response to be lost (no tool executed, no text displayed).
This was observed in butler session butler_c6ab59af2e4f991c where the user's
'send!' command produced no response.
Fix: Enhanced is_json_invalidated() to detect when a new tool call pattern
({"tool"}) appears after a newline while parsing an incomplete JSON fragment.
This indicates the previous fragment was abandoned and should be invalidated.
Safety:
- Tool patterns inside JSON strings (e.g., writing example code) are not
affected because the check only runs outside strings
- Added tests for the stuttering pattern and the file-writing edge case
When background research completes, g3 now immediately prints a status
message instead of waiting for the next user interaction:
- Added ResearchCompletionNotification and broadcast channel to
PendingResearchManager for push-based notifications
- Added spawn_research_notification_handler() in interactive mode that
listens for completions in a background task
- When idle (at prompt): clears line, prints status, reprints prompt
- When busy (processing): prints status inline (interleaving is fine)
- Added G3Status::research_complete() for consistent formatting
- Added enable_research_notifications() method to Agent
Output format: "g3: 1 research report ... [done]"
The research tool now spawns the scout agent in a background tokio task
and returns immediately with a research_id placeholder. This allows the
agent to continue working while research runs (30-120 seconds).
Key changes:
- New PendingResearchManager for tracking async research tasks
- research tool returns immediately with placeholder containing research_id
- research_status tool to check progress of pending research
- Auto-injection of completed research at natural break points:
- Start of each tool iteration (before LLM call)
- Before prompting user in interactive mode
- /research CLI command to list all research tasks
- Updated system prompt to explain async behavior
The agent can:
- Continue with other work while research runs
- Check status with research_status tool
- Yield turn to user if results are critical before continuing
README.md is no longer auto-loaded into the LLM context at startup.
This saves ~4,600 tokens per session while AGENTS.md and memory.md
still provide all critical information for code tasks.
Changes:
- Delete read_project_readme() function
- Remove readme_content parameter from combine_project_content()
- Rename extract_readme_heading() -> extract_project_heading()
- Rename Agent constructors: *_with_readme_* -> *_with_project_context_*
- Update context preservation to only check for Agent Configuration
- Remove has_readme field from LoadedContent
- Update all tests to use new markers and function names
The LLM can still read README.md on-demand via read_file when needed.
- Add GeminiProvider with streaming and native tool calling
- Support gemini-2.5-pro, gemini-2.0-flash, gemini-1.5-pro/flash models
- Model-specific context window detection (1M-2M tokens)
- Message conversion: assistant -> model role mapping
- System messages extracted to system_instruction field
- Tool schema conversion with functionCall/functionResponse parts
- SSE streaming with JSON array buffer parsing
- 8 unit tests for conversion and parsing logic
- Register provider in g3-core and validate in g3-cli
The resolve_max_tokens() function was returning 2048 for embedded providers,
which caused responses to be truncated prematurely. Increased to 8192 to
allow the provider's own effective_max_tokens() calculation to work properly.
- Add context_window_size() method to LLMProvider trait
- Implement for EmbeddedProvider to return the auto-detected context length
- Update Agent to query provider directly instead of using hardcoded defaults
- Removes need for model-specific context length mappings
Eliminate code-path aliasing in Agent construction methods by introducing
a single `build_agent()` helper that all constructors delegate to.
Before: 3 nearly-identical `Ok(Self { ... })` blocks (~30 lines each)
with subtle differences in auto_compact, is_autonomous, quiet, and
computer_controller fields - prone to drift over time.
After: Single canonical `build_agent()` method that constructs Agent
with all fields. All public constructors delegate to this single path:
- new_for_test() -> new_for_test_with_readme() -> build_agent()
- new_with_mode_and_readme() -> build_agent()
Changes:
- Add `build_agent()` private helper method (single source of truth)
- Simplify `new_for_test()` to delegate to `new_for_test_with_readme()`
- Update `new_for_test_with_readme()` to use `build_agent()`
- Update `new_with_mode_and_readme()` to use `build_agent()`
Net reduction: ~43 lines (-109/+66)
All 190 tests pass.
Agent: fowler
- Extend Usage struct with cache_creation_tokens and cache_read_tokens fields
- Parse Anthropic cache_creation_input_tokens and cache_read_input_tokens
- Parse OpenAI prompt_tokens_details.cached_tokens for automatic prefix caching
- Add CacheStats struct to Agent for cumulative tracking across API calls
- Add "Prompt Cache Statistics" section to /stats output showing:
- API call count and cache hit count
- Hit rate percentage
- Total input tokens and cache read/creation tokens
- Cache efficiency (% of input served from cache)
- Update all provider implementations and test files
- Fix test_rehydrate_success race condition by using UUID for unique session IDs
- Add #[serial] attribute to prevent parallel execution conflicts
- Improve cleanup to remove entire session directory tree
- Add characterization test for resize_image_to_dimensions fallback behavior
(documents fix from commit af8b849 for media type preservation)
Agent: hopper
The previous implementation added the summary as a System message, which
caused "Conversation must start with a user message" errors because the
first non-system message after compaction was Assistant (the preserved
last assistant message).
Fix: Change summary from System to User message, creating valid alternation:
[System Prompt] -> [Summary as USER] -> [Last Assistant] -> [Latest User]
This also prevents system message bloat across multiple compactions since
the summary is now part of the conversation flow and gets replaced on
each compaction.
Added test_second_compaction_no_bloat to verify no accumulation.
When context window compaction occurs, the last assistant message is now
preserved in addition to the system prompt, README, and summary. This
improves continuity after compaction by keeping the LLM's most recent
response, which often contains important context about what was just
done or what comes next.
New message order after compaction:
[System Prompt] -> [README/AGENTS.md] -> [ACD Stub?] -> [Summary] -> [Last Assistant] -> [Latest User?]
Changes:
- Add last_assistant_message field to PreservedMessages struct
- Modify extract_preserved_messages() to find last assistant message
- Modify reset_with_summary_and_stub() to include last assistant message
- Add comprehensive integration tests using MockProvider
Tests cover edge cases:
- No assistant message exists
- Tool-call-only assistant messages (still preserved)
- Multiple assistant messages (only last one preserved)
- No trailing user message
When resize_image_to_dimensions() returns a larger file than the original,
we fall back to using the original bytes. Previously, was_resized was set
to true if the original dimensions exceeded MAX_IMAGE_DIMENSION, which
caused final_media_type to be set to 'image/jpeg' even though we were
using the original PNG bytes.
This caused Anthropic API errors like:
'Image does not match the provided media type image/jpeg'
Fix: Set was_resized=false when falling back to original bytes, so the
original media type (detected from magic bytes) is preserved.
Rename all references from "Project Memory" to "Workspace Memory" to avoid
future conflation if a "project" concept is introduced later.
Changes:
- Rename read_project_memory() -> read_workspace_memory()
- Update all prompts, tool descriptions, and comments
- Update header parsing in memory.rs to use "# Workspace Memory"
- Update display detection for "=== Workspace Memory ==="
- Update documentation and analysis/memory.md
11 files changed, ~36 occurrences updated.
Warnings fixed:
- Remove unused 'warn' import from retry.rs
- Prefix unused 'output' param with underscore
- Prefix unused 'rel_start' with underscore
- Add #[allow(dead_code)] to G3Status::info()
Message format tweaked per feedback:
- 'g3: model overloaded [error]' (no attempt info)
- 'g3: retrying in 2.2s (1/3) ... [done]' (attempt info moved here)
- Handle empty error message in Status::Error to show just '[error]'
The prefix was causing duplication when users typed 'Task: ...' themselves,
resulting in '📋 Task: Task: ...' in context dumps.
User messages are now stored as-is without any prefix.
Change from multi-line verbose format to single-line compact format:
Before:
⚡ DEHYDRATED CONTEXT (fragment_id: 188c7ac71613)
• 8 messages (4 user, 4 assistant)
• 3 tool calls (shell ×3)
• ~299 tokens saved
To restore this history, call: rehydrate(fragment_id: "188c7ac71613")
After:
⚡ DEHYDRATED CONTEXT: 3 tool calls (shell x3), 8 total msgs. To restore, call: rehydrate(fragment_id: "188c7ac71613")
- Combine all info into single line
- Remove tokens saved (not essential for rehydration decision)
- Use ASCII 'x' instead of '×' for simplicity
- Add 'no tool calls' case for fragments without tools
- Update related tests
Centralize tool output formatting logic that was duplicated/scattered in
stream_completion_with_tools(). This eliminates code-path aliasing where
tool type checks were done in multiple places.
Changes:
- Add ToolOutputFormat enum (SelfHandled, Compact, Regular)
- Add format_tool_result_summary() for centralized formatting decisions
- Add is_compact_tool() and is_self_handled_tool() helper functions
- Move parse_diff_stats() from lib.rs to streaming.rs
- Simplify tool execution display logic in lib.rs using new helpers
Net effect: -86 lines in lib.rs, +112 lines in streaming.rs
The streaming.rs additions are reusable, well-named functions.
All 585+ workspace tests pass.
Agent: fowler
Consolidate scattered state variables in the 834-line stream_completion_with_tools()
function to use the existing StreamingState and IterationState structs from
streaming.rs. This eliminates code-path aliasing where state was tracked in
multiple places and makes the streaming loop easier to reason about.
Changes:
- Add assistant_message_added field to StreamingState
- Add stream_stop_reason field to IterationState
- Replace 8 inline state variables with StreamingState::new()
- Replace 7 iteration-local variables with IterationState::new()
- All 585 workspace tests pass
This is a pure refactor with no behavior changes. The state structs were already
defined in streaming.rs but not used in the main streaming loop.
Agent: fowler
Extract a new g3_status module in g3-cli that provides consistent formatting
for all 'g3:' prefixed system status messages.
Key changes:
- Add G3Status struct with methods for progress, done, failed, error, etc.
- Add Status enum with Done, Failed, Error, Resolved, Insufficient, NoChanges
- Add ThinResult struct in g3-core for semantic thinning data
- Update UiWriter trait with print_thin_result() method
- Refactor context thinning to return ThinResult instead of formatted strings
- Update all callers to use the new centralized formatting
- Session resume/decline messages now use G3Status
- Compaction status messages now use G3Status
This maintains clean separation of concerns: g3-core emits semantic data,
g3-cli handles all terminal formatting and colors.
Adds 8 unit tests verifying:
- Research tool has 20-minute timeout
- All other tools (shell, read_file, write_file, str_replace, code_search,
webdriver_*, etc.) have standard 8-minute timeout
- Comprehensive test_only_research_has_extended_timeout covers 19 tools
This ensures future changes don't accidentally affect other tool timeouts.
The research tool often runs past 8 minutes due to web browsing and
analysis. Increased its timeout to 20 minutes while keeping other
tools at 8 minutes.
Changes:
- Tool timeout is now tool-specific (20 min for research, 8 min for others)
- Timeout error message now shows the correct duration for each tool
Adds 3 new tests to json_parsing_stress_test.rs:
- test_tool_result_with_json_not_parsed: Full agent integration test proving
that JSON in tool results (sent TO the LLM) is never parsed by the
streaming parser (which only sees LLM output)
- test_parser_only_processes_completion_chunks: Documents that StreamingToolParser
only accepts CompletionChunk, not Message objects
- test_architectural_separation_documented: Documents the data flow showing
tool results flow TO the LLM while the parser only sees FROM the LLM
This proves the architectural guarantee: there is no code path where
tool result content could be parsed as a tool call, because:
1. Tool results are Message objects added to context_window
2. The streaming parser only processes CompletionChunk from provider.stream_completion()
3. These are completely separate data types flowing in opposite directions
Total: 41 JSON parsing stress tests now pass.
The bug was caused by mark_tool_calls_consumed() being called after
displaying each chunk, which advanced last_consumed_position to the
end of the current buffer. When the next chunk arrived with JSON,
the unchecked_buffer started at position 0 of the slice, causing
is_on_own_line() to return true (position 0 is always "on its own line").
Removed the problematic mark_tool_calls_consumed() call from the
"no tool executed" branch. The remaining call after actual tool
execution is correct and necessary.
Added integration test that verifies inline JSON in prose is not
detected as a tool call.
Bug: When the LLM responded with text-only (no tool calls), the assistant
message was sometimes not saved to the context window. This caused consecutive
user messages where the LLM would lose track of previous responses.
Root causes found and fixed:
1. Early return path (line ~2535): When stream finishes with no tools executed
in previous iterations (any_tool_executed=false), the code returned early
without saving the assistant message. Fixed by adding save before return.
2. Post-loop path (line ~2657): When raw_clean was empty but current_response
had content, no message was saved. Fixed by falling back to current_response.
Both paths now properly save the assistant message before returning.
The assistant_message_added flag prevents any duplication.
Added tests:
- missing_assistant_message_test.rs: verifies the fallback logic
- assistant_message_dedup_test.rs: verifies no duplicate messages
- consecutive_assistant_message_test.rs: verifies alternation invariant
The Anthropic API was rejecting requests with multiple high-resolution images
(~2000x3000 pixels each) even though individual file sizes were under limits.
Root cause: Code only checked per-image file size (3.75MB), not dimensions.
Claude recommends images ≤1568px on longest edge and has 32MB total request limit.
Changes:
- Add MAX_IMAGE_DIMENSION (1568px) and MAX_TOTAL_IMAGE_PAYLOAD (20MB) constants
- Trigger resize when dimensions > 1568px (not just file size > 3.75MB)
- Add new resize_image_to_dimensions() for dimension-constrained resizing
- Track cumulative payload size across multiple images
- Warn if total payload exceeds recommended limit
Test results with Walking Dead comic images:
- WD_0001_0001.jpg: 800KB 1987x3057 → 321KB 1019x1568
- WD_0001_1064.png: 150KB 1988x3057 → 143KB 1020x1568
- WD_0002_0001.jpg: 1023KB 1988x3056 → 292KB 1020x1568
- Total payload: ~2.5MB → ~1MB base64
Removed the persistent_chrome config flag - chromedriver is now always
kept running after webdriver_quit. This eliminates startup latency for
subsequent WebDriver sessions.
Safaridriver is still killed on quit since it doesn't benefit from
persistence in the same way.
Updated quit message to correctly indicate chromedriver remains running.
When webdriver_start is called, now checks if chromedriver is already
running on the configured port and reuses it instead of spawning a new
process. This significantly reduces startup time for subsequent sessions.
New config option:
[webdriver]
persistent_chrome = true # Keep chromedriver running between sessions
When enabled, webdriver_quit closes the browser session but leaves
chromedriver running for reuse by the next session.