- Use global OnceLock for llama.cpp backend to prevent BackendAlreadyInitialized error
- Suppress verbose llama.cpp stderr logging during model loading
- Fix provider validation to accept "embedded.name" format (extract type before dot)
- Extend Usage struct with cache_creation_tokens and cache_read_tokens fields
- Parse Anthropic cache_creation_input_tokens and cache_read_input_tokens
- Parse OpenAI prompt_tokens_details.cached_tokens for automatic prefix caching
- Add CacheStats struct to Agent for cumulative tracking across API calls
- Add "Prompt Cache Statistics" section to /stats output showing:
- API call count and cache hit count
- Hit rate percentage
- Total input tokens and cache read/creation tokens
- Cache efficiency (% of input served from cache)
- Update all provider implementations and test files
Added 6 new integration tests for stream_completion_with_tools:
- test_text_before_tool_call_preserved: text before native tool call is saved
- test_native_tool_call_execution: native tool calls execute correctly
- test_duplicate_tool_calls_skipped: sequential duplicates are detected
- test_json_fallback_tool_calling: JSON tool calls work without native support
- test_text_after_tool_execution_preserved: follow-up text is saved
- test_multiple_tool_calls_executed: multiple tool calls in sequence work
Also added MockResponse helper methods:
- text_then_native_tool(): text followed by native tool call
- duplicate_native_tool_calls(): same tool call twice (for dedup testing)
Fixed text_with_json_tool() to ensure "tool" key comes before "args"
(serde_json alphabetizes keys, breaking pattern detection).
Total: 18 integration tests covering historical bugs and core behaviors.
Adds a configurable mock LLM provider that can simulate various behaviors:
- Text-only responses (single or multi-chunk streaming)
- Native tool calls
- JSON tool calls in text
- Truncated responses (max_tokens)
- Multi-turn conversations
Features:
- Builder pattern for easy test setup
- Request tracking for verification
- Preset scenarios for common patterns
- Full LLMProvider trait implementation
Also adds integration tests that use MockProvider to test the
stream_completion_with_tools code path, including:
- test_butler_bug_scenario: reproduces the exact bug where text-only
responses were not saved to context, causing consecutive user messages
This enables testing complex streaming behaviors without real API calls.
- Fix aliasing issue where resolve_max_tokens() used fallback_default_max_tokens
(8192) instead of provider-specific defaults
- Update fallback_default_max_tokens from 8192 to 32000
- Set provider-specific max_tokens defaults:
- Anthropic: 32000
- OpenAI: 32000 (was 16000)
- Databricks: 32000 (was 50000, now matches Anthropic as passthru)
- Embedded: 2048
- Context window lengths unchanged:
- OpenAI: 400,000
- Anthropic: 200,000
- Databricks (Claude): 200,000
This fixes the 'LLM response was cut off due to max_tokens limit' error
in agent mode that occurred because 8192 was being used instead of 32000.
- Add ToolParsingHint enum (Detected/Active/Complete) for UI feedback
- New UiWriter methods: print_tool_streaming_hint(), print_tool_streaming_active()
- Refactor ConsoleUiWriter state to use atomics in ParsingHintState
- Add tool_call_streaming field to CompletionChunk for provider hints
- Anthropic provider sends streaming hints when tool name detected
- New streaming helpers: make_tool_streaming_hint(), make_tool_streaming_active()
Parser improvements:
- Add is_json_invalidated() to detect false positive tool patterns
- Fix tool result poisoning when file contents contain partial JSON
- Unescaped newlines in strings or prose after JSON invalidates detection
User sees ' ● tool_name |' immediately when tool call starts streaming,
with blinking indicator while args are received.
ACD (Aggressive Context Dehydration) fixes:
- Fixed dehydrate_context() to extract turn summary from context window
instead of using the passed-in final_response (which contained only
the timing footer, not the actual LLM response)
- Removed final_response parameter from dehydrate_context() since it
now self-extracts the last assistant message as the summary
- This ensures the actual turn summary is preserved after dehydration,
not just the timing footer
New /dump command:
- Added /dump command to dump entire context window to tmp/ for debugging
- Shows message index, role, kind, content length, and full content
- Available in both console and machine modes
UTF-8 safety:
- Fixed truncate_to_word_boundary() to use character indices instead of
byte indices, preventing panics on multi-byte UTF-8 characters
- Added UTF-8 string slicing guidance to AGENTS.md
Agent: g3
The buffer truncation code was slicing at a raw byte offset which could
land in the middle of a multi-byte character (like emojis), causing a
panic. Fixed by using char_indices() to find valid character boundaries.
Also added stop_reason field to CompletionChunk initializers in tests
to complete the stop_reason feature addition.
- Fix byte boundary panic in filter_json.rs line 327
- Add test for multi-byte character handling
- Update test files with missing stop_reason field
Agent: carmack
openai.rs:
- Use make_text_chunk() for streaming text content
- Use make_final_chunk() for final completion chunk
- Simplify tool_calls conversion logic
embedded.rs:
- Use make_text_chunk() for all 4 streaming text chunks
- Use make_final_chunk() for final completion chunk
- Remove unused CompletionChunk import
Net reduction: 35 lines removed
All tests pass. Behavior unchanged.
Agent: carmack
databricks.rs:
- Extract ToolCallAccumulator struct to replace opaque (String, String, String) tuple
- Add decode_utf8_streaming() helper for cleaner UTF-8 handling
- Add is_incomplete_json_error() helper for JSON parse error detection
- Add make_final_chunk() helper to reduce duplication
- Add finalize_tool_calls() to convert accumulators to final format
- Refactor parse_streaming_response from ~270 lines to ~100 lines
- Reduce nesting depth from 8+ levels to 4 levels
- Use early returns and let-else for cleaner control flow
file_ops.rs:
- Replace repetitive if-let chains with declarative PATH_CONTENT_KEYS table
- Use match expression instead of nested if-else
- Reduce extract_path_and_content from 44 lines to 20 lines
All tests pass. Behavior unchanged.
New test files:
- crates/g3-cli/tests/cli_integration_test.rs (14 tests)
Blackbox CLI tests: help/version flags, argument validation,
conflicting modes, flock mode requirements
- crates/g3-core/tests/tool_execution_test.rs (20 tests)
Tool call structure tests and unified diff application:
read_file, write_file, str_replace, shell, background_process,
todo, final_output, code_search, take_screenshot
- crates/g3-providers/tests/message_serialization_test.rs (20 tests)
Round-trip serialization tests for Message, MessageRole,
CacheControl, and Tool types. Covers Unicode, special chars,
and edge cases.
All tests follow blackbox/integration-first principles with
documentation of what they protect and intentionally do not assert.
- Remove unused assignment to final_output_called (returns immediately after)
- Mark cache_config field as #[allow(dead_code)] (reserved for future use)
- Mark print_status_line method as #[allow(dead_code)] (reserved for future use)
Converted ~77 info! macro calls to debug! across the codebase to prevent
log messages from interrupting the CLI experience during normal operation.
Users can still see these logs by setting RUST_LOG=debug if needed.
Affected crates:
- g3-cli
- g3-computer-control
- g3-console
- g3-core
- g3-ensembles
- g3-execution
- g3-providers
Writes the current context window to logs/current_context_window (uses a symlink to a session ID).
This PR was unfortunately generated by a different LLM and did a ton of superficial reformating, it's actually a fairly small and benign change, but I don't want to roll back everything. Hope that's ok.
This tries to short-circuit multiple round-trips to llm for reading code.
It's a precursor to trying to context engineer tailored to specific tasks.
In initial experiments, it's only marginally faster than regular mode, and burns more tokens.