Agent: fowler
Eliminate code-path aliasing and near-duplicates across recent commits:
1. Deduplicate find_json_object_end: Three near-identical copies in
streaming_parser.rs, context_window.rs, and acd.rs consolidated into
a single canonical implementation in utils.rs. All callers now route
through the canonical version. The utils.rs version uses the most
defensive variant (with found_start guard). (-84 lines)
2. Deduplicate provider constructors: AnthropicProvider::new() and
GeminiProvider::new() now delegate to their respective new_with_name()
methods instead of duplicating the full constructor body.
(OpenAI already delegated.) (-28 lines)
3. Inline convert_cache_control: Removed identity function that just
cloned CacheControl. Call sites now use .map(|cc| cc.clone())
directly. (-4 lines)
Net: -65 lines, 0 behavior changes, all 683 library tests pass.
read_image tool results placed images as top-level Image content blocks
alongside ToolResult blocks in user messages. The Anthropic API rejects
this combination, reporting orphaned tool_use IDs even though the
tool_result was present — the malformed message structure prevented
the API from recognizing it as a valid tool result.
Added ToolResultContent enum (Text | Blocks) with custom serde so that
when images are attached to a tool result, they are nested inside the
tool_result content array as structured blocks, matching the Anthropic
API's expected format for multi-modal tool results.
Regular tool results (no images) continue to use simple string content.
Regular user messages (not tool results) continue to use top-level
Image blocks.
4 new tests covering image nesting, string fallback, regular user
messages, and orphan detection with structured content.
After context compaction, the preserved last assistant message retained
its structured tool_calls field, but the corresponding tool_result was
summarized away. This created orphaned tool_use blocks that violated
the Anthropic API constraint: 'Each tool_use block must have a
corresponding tool_result block in the next message', causing 400 errors.
Primary fix: clear tool_calls from the preserved assistant message in
extract_preserved_messages(). The tool call was already executed and
its result is captured in the summary.
Defense-in-depth: added strip_orphaned_tool_use() post-processing in
Anthropic convert_messages() to detect and strip any orphaned tool_use
blocks before they reach the API.
Added 7 tests: 3 unit tests for compaction stripping, 3 unit tests for
Anthropic orphan detection, 1 integration test reproducing the exact
bug scenario from the h3 session.
The agent would stop mid-task because native tool calls were stored as
inline JSON text in Message.content. When sent back to the Anthropic API
via convert_messages(), they went as plain text instead of structured
tool_use/tool_result blocks. The model would occasionally get confused
and emit text describing what it wanted to do instead of invoking the
tool mechanism.
Changes:
- Add MessageToolCall struct and tool_calls/tool_result_id fields to Message
- Add id field to core ToolCall struct to preserve provider tool call IDs
- Update Anthropic convert_messages() to emit tool_use and tool_result blocks
- Add ToolResult variant to AnthropicContent enum
- Store tool calls structurally in tool message construction (not inline JSON)
- Fix add_message() to preserve empty-content messages with tool_calls
- Fix check_duplicate_in_previous_message() to check structured tool_calls
- Generate valid IDs for JSON fallback tool calls (Anthropic pattern requirement)
- Update planner create_tool_message() to use structured tool calls
When the LLM emits identical JSON tool calls as text content (JSON
fallback mode), the raw duplicate JSON was being stored in the assistant
message in conversation history. This confused the model on subsequent
turns, causing it to stall or repeat itself.
Root cause: raw_content_for_log used get_text_content() which returns
the full parser buffer including all duplicate tool call JSONs.
Fix: Added get_text_before_tool_calls() to StreamingToolParser that
returns only the text before the first JSON tool call. Changed
raw_content_for_log to use this method so the assistant message only
contains the preamble text + the single executed tool call.
Added 5 integration tests covering stuttered duplicates, triple
stutter, cross-turn dedup, and different-args boundary case.
Added MockResponse helpers for simulating LLM stutter patterns.
Fixes issues in the last 11 commits:
1. pending_research.rs: Fix flaky test_generate_id_uniqueness
- Replaced random u16 suffix with atomic counter for guaranteed uniqueness
- The timestamp+random approach could collide when generating IDs rapidly
- Now uses static AtomicU32 counter that increments monotonically
2. embedded/adapters/glm.rs: Remove unused in_code_fence field
- Field was written but never read (dead code)
- Removed from struct definition, constructor, and reset()
3. embedded/adapters/glm.rs: Fix orphaned tests
- Two tests (test_strip_code_fences, test_code_fenced_tool_call) were
outside the #[cfg(test)] mod tests block
- Moved closing brace to include them in the test module
All 446 library tests pass.
Agent: fowler
GLM-4 models wrap tool calls in markdown code fences and inline backticks,
which prevents the streaming parser from detecting them. This adapter:
- Strips ```json and ``` code fence markers during streaming
- Strips inline backticks from tool call JSON
- Handles chunked streaming correctly (buffers potential fence lines)
- Transforms GLM native format (<|assistant|>tool_name) to g3 JSON format
Also refactors embedded provider into module structure:
- embedded/mod.rs - module exports
- embedded/provider.rs - main EmbeddedProvider (moved from embedded.rs)
- embedded/adapters/mod.rs - ToolFormatAdapter trait
- embedded/adapters/glm.rs - GLM-specific adapter
Includes 22 unit tests covering edge cases like nested JSON in strings,
chunk boundary handling, and false pattern detection.
Updates README to show GLM-4 9B now works (⭐⭐) for agentic tasks.
embedded.rs (937→789 lines, -16%):
- Extract duplicated inference setup into prepare_context() helper
- Extract stop sequence handling into find_stop_sequence() and truncate_at_stop_sequence()
- Add InferenceParams struct to consolidate request parameter extraction
- Add clear section markers for code organization
- Tests now use module-level format functions directly (no duplication)
gemini.rs:
- Extract common request building into build_request() method
- Reduces duplication between complete() and stream() methods
All 399 unit tests pass. Behavior unchanged.
Agent: carmack
Agent: hopper
Added two new integration test files:
1. cache_stats_integration_test.rs (g3-core)
- Tests CacheStats accumulation through streaming completion flow
- Verifies cache hit detection (cache_read_tokens > 0)
- Tests multi-request accumulation of cache statistics
- Verifies cache efficiency and hit rate calculations
- Uses MockProvider to simulate provider usage data
2. gemini_serialization_test.rs (g3-providers)
- Tests Gemini API message format conversion
- Verifies system messages become system_instruction
- Verifies assistant role maps to "model" (Gemini terminology)
- Tests tool conversion to function_declarations format
- Characterizes multi-system-message behavior (last wins)
Both test files follow blackbox/integration testing principles:
- Test observable behavior through stable surfaces
- Do not assert internal implementation details
- Include documentation of what is/is not asserted
- Add GeminiProvider with streaming and native tool calling
- Support gemini-2.5-pro, gemini-2.0-flash, gemini-1.5-pro/flash models
- Model-specific context window detection (1M-2M tokens)
- Message conversion: assistant -> model role mapping
- System messages extracted to system_instruction field
- Tool schema conversion with functionCall/functionResponse parts
- SSE streaming with JSON array buffer parsing
- 8 unit tests for conversion and parsing logic
- Register provider in g3-core and validate in g3-cli
- Add context_window_size() method to LLMProvider trait
- Implement for EmbeddedProvider to return the auto-detected context length
- Update Agent to query provider directly instead of using hardcoded defaults
- Removes need for model-specific context length mappings
- Use global OnceLock for llama.cpp backend to prevent BackendAlreadyInitialized error
- Suppress verbose llama.cpp stderr logging during model loading
- Fix provider validation to accept "embedded.name" format (extract type before dot)
- Extend Usage struct with cache_creation_tokens and cache_read_tokens fields
- Parse Anthropic cache_creation_input_tokens and cache_read_input_tokens
- Parse OpenAI prompt_tokens_details.cached_tokens for automatic prefix caching
- Add CacheStats struct to Agent for cumulative tracking across API calls
- Add "Prompt Cache Statistics" section to /stats output showing:
- API call count and cache hit count
- Hit rate percentage
- Total input tokens and cache read/creation tokens
- Cache efficiency (% of input served from cache)
- Update all provider implementations and test files
Added 6 new integration tests for stream_completion_with_tools:
- test_text_before_tool_call_preserved: text before native tool call is saved
- test_native_tool_call_execution: native tool calls execute correctly
- test_duplicate_tool_calls_skipped: sequential duplicates are detected
- test_json_fallback_tool_calling: JSON tool calls work without native support
- test_text_after_tool_execution_preserved: follow-up text is saved
- test_multiple_tool_calls_executed: multiple tool calls in sequence work
Also added MockResponse helper methods:
- text_then_native_tool(): text followed by native tool call
- duplicate_native_tool_calls(): same tool call twice (for dedup testing)
Fixed text_with_json_tool() to ensure "tool" key comes before "args"
(serde_json alphabetizes keys, breaking pattern detection).
Total: 18 integration tests covering historical bugs and core behaviors.
Adds a configurable mock LLM provider that can simulate various behaviors:
- Text-only responses (single or multi-chunk streaming)
- Native tool calls
- JSON tool calls in text
- Truncated responses (max_tokens)
- Multi-turn conversations
Features:
- Builder pattern for easy test setup
- Request tracking for verification
- Preset scenarios for common patterns
- Full LLMProvider trait implementation
Also adds integration tests that use MockProvider to test the
stream_completion_with_tools code path, including:
- test_butler_bug_scenario: reproduces the exact bug where text-only
responses were not saved to context, causing consecutive user messages
This enables testing complex streaming behaviors without real API calls.
- Fix aliasing issue where resolve_max_tokens() used fallback_default_max_tokens
(8192) instead of provider-specific defaults
- Update fallback_default_max_tokens from 8192 to 32000
- Set provider-specific max_tokens defaults:
- Anthropic: 32000
- OpenAI: 32000 (was 16000)
- Databricks: 32000 (was 50000, now matches Anthropic as passthru)
- Embedded: 2048
- Context window lengths unchanged:
- OpenAI: 400,000
- Anthropic: 200,000
- Databricks (Claude): 200,000
This fixes the 'LLM response was cut off due to max_tokens limit' error
in agent mode that occurred because 8192 was being used instead of 32000.
- Add ToolParsingHint enum (Detected/Active/Complete) for UI feedback
- New UiWriter methods: print_tool_streaming_hint(), print_tool_streaming_active()
- Refactor ConsoleUiWriter state to use atomics in ParsingHintState
- Add tool_call_streaming field to CompletionChunk for provider hints
- Anthropic provider sends streaming hints when tool name detected
- New streaming helpers: make_tool_streaming_hint(), make_tool_streaming_active()
Parser improvements:
- Add is_json_invalidated() to detect false positive tool patterns
- Fix tool result poisoning when file contents contain partial JSON
- Unescaped newlines in strings or prose after JSON invalidates detection
User sees ' ● tool_name |' immediately when tool call starts streaming,
with blinking indicator while args are received.
ACD (Aggressive Context Dehydration) fixes:
- Fixed dehydrate_context() to extract turn summary from context window
instead of using the passed-in final_response (which contained only
the timing footer, not the actual LLM response)
- Removed final_response parameter from dehydrate_context() since it
now self-extracts the last assistant message as the summary
- This ensures the actual turn summary is preserved after dehydration,
not just the timing footer
New /dump command:
- Added /dump command to dump entire context window to tmp/ for debugging
- Shows message index, role, kind, content length, and full content
- Available in both console and machine modes
UTF-8 safety:
- Fixed truncate_to_word_boundary() to use character indices instead of
byte indices, preventing panics on multi-byte UTF-8 characters
- Added UTF-8 string slicing guidance to AGENTS.md
Agent: g3
The buffer truncation code was slicing at a raw byte offset which could
land in the middle of a multi-byte character (like emojis), causing a
panic. Fixed by using char_indices() to find valid character boundaries.
Also added stop_reason field to CompletionChunk initializers in tests
to complete the stop_reason feature addition.
- Fix byte boundary panic in filter_json.rs line 327
- Add test for multi-byte character handling
- Update test files with missing stop_reason field
Agent: carmack
openai.rs:
- Use make_text_chunk() for streaming text content
- Use make_final_chunk() for final completion chunk
- Simplify tool_calls conversion logic
embedded.rs:
- Use make_text_chunk() for all 4 streaming text chunks
- Use make_final_chunk() for final completion chunk
- Remove unused CompletionChunk import
Net reduction: 35 lines removed
All tests pass. Behavior unchanged.
Agent: carmack
databricks.rs:
- Extract ToolCallAccumulator struct to replace opaque (String, String, String) tuple
- Add decode_utf8_streaming() helper for cleaner UTF-8 handling
- Add is_incomplete_json_error() helper for JSON parse error detection
- Add make_final_chunk() helper to reduce duplication
- Add finalize_tool_calls() to convert accumulators to final format
- Refactor parse_streaming_response from ~270 lines to ~100 lines
- Reduce nesting depth from 8+ levels to 4 levels
- Use early returns and let-else for cleaner control flow
file_ops.rs:
- Replace repetitive if-let chains with declarative PATH_CONTENT_KEYS table
- Use match expression instead of nested if-else
- Reduce extract_path_and_content from 44 lines to 20 lines
All tests pass. Behavior unchanged.
New test files:
- crates/g3-cli/tests/cli_integration_test.rs (14 tests)
Blackbox CLI tests: help/version flags, argument validation,
conflicting modes, flock mode requirements
- crates/g3-core/tests/tool_execution_test.rs (20 tests)
Tool call structure tests and unified diff application:
read_file, write_file, str_replace, shell, background_process,
todo, final_output, code_search, take_screenshot
- crates/g3-providers/tests/message_serialization_test.rs (20 tests)
Round-trip serialization tests for Message, MessageRole,
CacheControl, and Tool types. Covers Unicode, special chars,
and edge cases.
All tests follow blackbox/integration-first principles with
documentation of what they protect and intentionally do not assert.
- Remove unused assignment to final_output_called (returns immediately after)
- Mark cache_config field as #[allow(dead_code)] (reserved for future use)
- Mark print_status_line method as #[allow(dead_code)] (reserved for future use)
Converted ~77 info! macro calls to debug! across the codebase to prevent
log messages from interrupting the CLI experience during normal operation.
Users can still see these logs by setting RUST_LOG=debug if needed.
Affected crates:
- g3-cli
- g3-computer-control
- g3-console
- g3-core
- g3-ensembles
- g3-execution
- g3-providers
Writes the current context window to logs/current_context_window (uses a symlink to a session ID).
This PR was unfortunately generated by a different LLM and did a ton of superficial reformating, it's actually a fairly small and benign change, but I don't want to roll back everything. Hope that's ok.
This tries to short-circuit multiple round-trips to llm for reading code.
It's a precursor to trying to context engineer tailored to specific tasks.
In initial experiments, it's only marginally faster than regular mode, and burns more tokens.