Add test_project_content_survives_compaction() to verify that project
content loaded via /project command persists through context compaction.
This is a CHARACTERIZATION test that validates:
- Project content appended to README message survives compaction
- The README message (containing project content) is preserved as message[1]
- PROJECT INSTRUCTIONS, ACTIVE PROJECT markers, Brief and Status sections
all survive the compaction process
Agent: hopper
The previous implementation added the summary as a System message, which
caused "Conversation must start with a user message" errors because the
first non-system message after compaction was Assistant (the preserved
last assistant message).
Fix: Change summary from System to User message, creating valid alternation:
[System Prompt] -> [Summary as USER] -> [Last Assistant] -> [Latest User]
This also prevents system message bloat across multiple compactions since
the summary is now part of the conversation flow and gets replaced on
each compaction.
Added test_second_compaction_no_bloat to verify no accumulation.
When context window compaction occurs, the last assistant message is now
preserved in addition to the system prompt, README, and summary. This
improves continuity after compaction by keeping the LLM's most recent
response, which often contains important context about what was just
done or what comes next.
New message order after compaction:
[System Prompt] -> [README/AGENTS.md] -> [ACD Stub?] -> [Summary] -> [Last Assistant] -> [Latest User?]
Changes:
- Add last_assistant_message field to PreservedMessages struct
- Modify extract_preserved_messages() to find last assistant message
- Modify reset_with_summary_and_stub() to include last assistant message
- Add comprehensive integration tests using MockProvider
Tests cover edge cases:
- No assistant message exists
- Tool-call-only assistant messages (still preserved)
- Multiple assistant messages (only last one preserved)
- No trailing user message
Change from multi-line verbose format to single-line compact format:
Before:
⚡ DEHYDRATED CONTEXT (fragment_id: 188c7ac71613)
• 8 messages (4 user, 4 assistant)
• 3 tool calls (shell ×3)
• ~299 tokens saved
To restore this history, call: rehydrate(fragment_id: "188c7ac71613")
After:
⚡ DEHYDRATED CONTEXT: 3 tool calls (shell x3), 8 total msgs. To restore, call: rehydrate(fragment_id: "188c7ac71613")
- Combine all info into single line
- Remove tokens saved (not essential for rehydration decision)
- Use ASCII 'x' instead of '×' for simplicity
- Add 'no tool calls' case for fragments without tools
- Update related tests
Extract a new g3_status module in g3-cli that provides consistent formatting
for all 'g3:' prefixed system status messages.
Key changes:
- Add G3Status struct with methods for progress, done, failed, error, etc.
- Add Status enum with Done, Failed, Error, Resolved, Insufficient, NoChanges
- Add ThinResult struct in g3-core for semantic thinning data
- Update UiWriter trait with print_thin_result() method
- Refactor context thinning to return ThinResult instead of formatted strings
- Update all callers to use the new centralized formatting
- Session resume/decline messages now use G3Status
- Compaction status messages now use G3Status
This maintains clean separation of concerns: g3-core emits semantic data,
g3-cli handles all terminal formatting and colors.
Adds test_llm_repeats_text_before_each_tool_call() which documents the
scenario where the LLM re-outputs the same preamble text before each
tool call in a multi-tool response.
Analysis showed this is LLM behavior, not a g3 bug:
- Each assistant message is correctly stored with different tool calls
- The duplicate display is the LLM choosing to repeat context
- Storage is correct, display accurately reflects LLM output
Decision: Accept as LLM behavior (Option B). Future LLM improvements
may resolve this naturally without g3 code changes.
Adds 3 new tests to json_parsing_stress_test.rs:
- test_tool_result_with_json_not_parsed: Full agent integration test proving
that JSON in tool results (sent TO the LLM) is never parsed by the
streaming parser (which only sees LLM output)
- test_parser_only_processes_completion_chunks: Documents that StreamingToolParser
only accepts CompletionChunk, not Message objects
- test_architectural_separation_documented: Documents the data flow showing
tool results flow TO the LLM while the parser only sees FROM the LLM
This proves the architectural guarantee: there is no code path where
tool result content could be parsed as a tool call, because:
1. Tool results are Message objects added to context_window
2. The streaming parser only processes CompletionChunk from provider.stream_completion()
3. These are completely separate data types flowing in opposite directions
Total: 41 JSON parsing stress tests now pass.
Added 6 new integration tests for stream_completion_with_tools:
- test_text_before_tool_call_preserved: text before native tool call is saved
- test_native_tool_call_execution: native tool calls execute correctly
- test_duplicate_tool_calls_skipped: sequential duplicates are detected
- test_json_fallback_tool_calling: JSON tool calls work without native support
- test_text_after_tool_execution_preserved: follow-up text is saved
- test_multiple_tool_calls_executed: multiple tool calls in sequence work
Also added MockResponse helper methods:
- text_then_native_tool(): text followed by native tool call
- duplicate_native_tool_calls(): same tool call twice (for dedup testing)
Fixed text_with_json_tool() to ensure "tool" key comes before "args"
(serde_json alphabetizes keys, breaking pattern detection).
Total: 18 integration tests covering historical bugs and core behaviors.
The bug was caused by mark_tool_calls_consumed() being called after
displaying each chunk, which advanced last_consumed_position to the
end of the current buffer. When the next chunk arrived with JSON,
the unchecked_buffer started at position 0 of the slice, causing
is_on_own_line() to return true (position 0 is always "on its own line").
Removed the problematic mark_tool_calls_consumed() call from the
"no tool executed" branch. The remaining call after actual tool
execution is correct and necessary.
Added integration test that verifies inline JSON in prose is not
detected as a tool call.
Adds a configurable mock LLM provider that can simulate various behaviors:
- Text-only responses (single or multi-chunk streaming)
- Native tool calls
- JSON tool calls in text
- Truncated responses (max_tokens)
- Multi-turn conversations
Features:
- Builder pattern for easy test setup
- Request tracking for verification
- Preset scenarios for common patterns
- Full LLMProvider trait implementation
Also adds integration tests that use MockProvider to test the
stream_completion_with_tools code path, including:
- test_butler_bug_scenario: reproduces the exact bug where text-only
responses were not saved to context, causing consecutive user messages
This enables testing complex streaming behaviors without real API calls.
Bug: When the LLM responded with text-only (no tool calls), the assistant
message was sometimes not saved to the context window. This caused consecutive
user messages where the LLM would lose track of previous responses.
Root causes found and fixed:
1. Early return path (line ~2535): When stream finishes with no tools executed
in previous iterations (any_tool_executed=false), the code returned early
without saving the assistant message. Fixed by adding save before return.
2. Post-loop path (line ~2657): When raw_clean was empty but current_response
had content, no message was saved. Fixed by falling back to current_response.
Both paths now properly save the assistant message before returning.
The assistant_message_added flag prevents any duplication.
Added tests:
- missing_assistant_message_test.rs: verifies the fallback logic
- assistant_message_dedup_test.rs: verifies no duplicate messages
- consecutive_assistant_message_test.rs: verifies alternation invariant
Change format from verbose emoji-based message to cleaner status line:
Before: ✨🥒 Context thinned at 70%: 7 tool results, ~33839 chars saved ✨
After: g3: thinning context ... 70% -> 40% ... [done]
The new format shows before/after percentages and uses bold green for
'g3:' and '[done]' to match other status messages.
Also removes unused emoji() and label() methods from ThinScope.
Images >= 5MB are now automatically resized to < 4.9MB using ImageMagick
before being sent to the LLM. This prevents API errors from oversized images.
- Uses iterative quality/scale reduction to find optimal size
- Converts to JPEG for better compression
- Shows original and resized size in terminal output (e.g., '6.2 MB → 4.1 MB (resized)')
- Falls back to original if ImageMagick fails or isn't available
Adds tests to verify that:
- All streaming chunks are processed before control returns to caller
- Both tool calls in a multi-tool-call stream are executed
- The finished signal properly terminates stream processing
Also adds Agent::new_for_test() to allow injecting mock providers.
- Shell outputs > 8KB are truncated to first 500 chars
- Full output saved to .g3/sessions/<session_id>/tools/shell_stdout_<id>.txt
- LLM can use read_file with start/end to paginate through large outputs
- read_file now uses seek() for O(1) random access instead of reading entire file
- UTF-8 safe: reads extra bytes at boundaries to find valid char positions
- Falls back to lossy conversion for binary files (no panics)
Files changed:
- paths.rs: get_tools_output_dir(), generate_short_id()
- shell.rs: truncate_large_output() integration
- file_ops.rs: seek-based read_file_range() helper
- New test: read_file_utf8_test.rs
- Add ToolParsingHint enum (Detected/Active/Complete) for UI feedback
- New UiWriter methods: print_tool_streaming_hint(), print_tool_streaming_active()
- Refactor ConsoleUiWriter state to use atomics in ParsingHintState
- Add tool_call_streaming field to CompletionChunk for provider hints
- Anthropic provider sends streaming hints when tool name detected
- New streaming helpers: make_tool_streaming_hint(), make_tool_streaming_active()
Parser improvements:
- Add is_json_invalidated() to detect false positive tool patterns
- Fix tool result poisoning when file contents contain partial JSON
- Unescaped newlines in strings or prose after JSON invalidates detection
User sees ' ● tool_name |' immediately when tool call starts streaming,
with blinking indicator while args are received.
When partial JSON tool call patterns appear in LLM output (e.g., from
quoting file content), the parser would incorrectly report them as
"incomplete tool calls", triggering auto-continue loops.
Fix: Added is_json_invalidated() to detect when partial JSON has been
invalidated by subsequent content that cannot be valid JSON:
- Unescaped newline inside a string (invalid JSON)
- Newline followed by prose text outside a string
The check is only applied to incomplete JSON - complete tool calls
with trailing text are still correctly detected.
Added 6 new tests covering:
- Tool results with partial JSON patterns
- LLM quoting file content inline vs on own line
- Comment prefixes (// # -- etc) with partial patterns
- Real incomplete tool calls (should still be detected)
The streaming parser was incorrectly detecting tool call patterns that
appeared inline in prose (e.g., when explaining the format), causing
g3 to return control mid-task.
Fix: Modified find_first_tool_call_start() and find_last_tool_call_start()
to only recognize patterns that appear on their own line (at start of
buffer or after newline with only whitespace before the pattern).
Changes:
- Added is_on_own_line() helper to check line-boundary conditions
- Updated detection methods to skip inline patterns
- Removed sanitize_inline_tool_patterns() and LBRACE_HOMOGLYPH (no longer needed)
- Rewrote tests for new behavior
- Added streaming_repro tests that use process_chunk() to verify the exact bug scenario
28 tests covering: streaming repro, line boundaries, Unicode, code contexts, edge cases
- Rename take_screenshot -> screenshot, code_coverage -> coverage (shorter names)
- Align | character across all compact tools (pad to 11 chars for str_replace)
- Make code_search a compact tool with summary display
- Show language and search name in code_search output (e.g., rust:"find structs")
- Add format_code_search_summary() to extract match/file counts from JSON response
Agent: hopper
Adds 56 new integration tests covering the observable end-of-turn
behaviors in the streaming module:
- Timing footer formatting (5 tests): verifies user-facing timing display
with various durations, token counts, and context percentages
- Tool call duplicate detection (6 tests): ensures identical sequential
tool calls are detected while different tools/args are not
- Empty response detection (9 tests): validates detection of empty,
whitespace-only, and timing-only responses that trigger auto-continue
- Connection error classification (5 tests): verifies EOF, connection,
chunk, and body errors are correctly identified for graceful recovery
- Tool output summary formatting (17 tests): covers read_file, write_file,
str_replace, remember, screenshot, coverage, and rehydrate summaries
- Duration formatting (4 tests): milliseconds, seconds, minutes, zero
- Text truncation (4 tests): short/long strings, multiline, flag behavior
- LLM token cleaning (3 tests): removal of stop tokens like <|im_end|>
- Edge cases (4 tests): empty inputs, unicode handling, large numbers
All tests are blackbox/characterization style - they test observable
outputs through stable public interfaces without encoding internal
implementation details. Tests remain stable under refactoring that
preserves behavior.
This change removes the legacy logs/ directory and consolidates all
session data, error logs, and discovery files under the .g3/ directory.
New directory structure:
- .g3/sessions/<session_id>/session.json - session logs
- .g3/errors/ - error logs (was logs/errors/)
- .g3/background_processes/ - background process logs
- .g3/discovery/ - planner discovery files (was workspace/logs/)
Changes:
- paths.rs: Remove get_logs_dir()/logs_dir(), add get_errors_dir(),
get_background_processes_dir(), get_discovery_dir()
- session.rs: Anonymous sessions now use .g3/sessions/anonymous_<ts>/
- error_handling.rs: Errors now saved to .g3/errors/
- project.rs: Remove logs_dir() and ensure_logs_dir() methods
- feedback_extraction.rs: Remove logs_dir field and fallback logic
- planner: Use .g3/ for workspace data and .g3/discovery/ for reports
- flock.rs: Look for session metrics in .g3/sessions/
- coach_feedback.rs: Remove fallback to logs/ path
- Update all tests to use new paths
- Update README.md and .gitignore
Agent: hopper
Added 4 new test files with blackbox/characterization-style integration tests:
- compaction_behavior_test.rs (14 tests): Token cap calculation, thinking mode
disable logic, summary message building, CompactionResult behavior
- retry_behavior_test.rs (17 tests): RetryConfig presets and customization,
RetryResult state handling, retry_operation behavior with simulated errors
- tool_execution_roundtrip_test.rs (16 tests): End-to-end tool execution through
Agent interface for read_file, write_file, shell, str_replace, and TODO tools
- error_classification_test.rs (25 tests): Recoverable vs non-recoverable error
classification, retry delay calculation, edge cases and priority handling
All tests follow integration-first philosophy:
- Test through stable public interfaces
- Assert observable behavior, not implementation details
- Use characterization style to document current behavior
- Enable refactoring by not encoding internal structure
ACD (Aggressive Context Dehydration) fixes:
- Fixed dehydrate_context() to extract turn summary from context window
instead of using the passed-in final_response (which contained only
the timing footer, not the actual LLM response)
- Removed final_response parameter from dehydrate_context() since it
now self-extracts the last assistant message as the summary
- This ensures the actual turn summary is preserved after dehydration,
not just the timing footer
New /dump command:
- Added /dump command to dump entire context window to tmp/ for debugging
- Shows message index, role, kind, content length, and full content
- Available in both console and machine modes
UTF-8 safety:
- Fixed truncate_to_word_boundary() to use character indices instead of
byte indices, preventing panics on multi-byte UTF-8 characters
- Added UTF-8 string slicing guidance to AGENTS.md
Agent: g3
final_output removal:
- Remove final_output from tool definitions and dispatch
- Update system prompts to request summaries as regular text
- Remove final_output_called field from StreamingState
- Update auto_continue tests to remove final_output_called parameter
- Remove final_output test from tool_execution_test.rs
- Update planner and flock prompts to not reference final_output
- Keep backwards-compat code in feedback_extraction.rs and task_result.rs
Scout report handback:
- Change from file-based to delimiter-based report extraction
- Scout outputs report between ---SCOUT_REPORT_START/END--- markers
- Research tool extracts content between markers, strips ANSI codes
- Add comprehensive tests for extraction and ANSI stripping
657 tests pass.
The buffer truncation code was slicing at a raw byte offset which could
land in the middle of a multi-byte character (like emojis), causing a
panic. Fixed by using char_indices() to find valid character boundaries.
Also added stop_reason field to CompletionChunk initializers in tests
to complete the stop_reason feature addition.
- Fix byte boundary panic in filter_json.rs line 327
- Add test for multi-byte character handling
- Update test files with missing stop_reason field
- Remove final_output from tool definitions, dispatch, and misc tools
- Update system prompts to request summaries as regular markdown text
- Remove print_final_output from UiWriter trait and all implementations
- Remove final_output handling from agent core logic
- Rename final_output_summary → summary in session continuation
- Delete final_output test files
- Update tool count tests (12→11, 27→26)
This allows LLM summaries to stream through the markdown formatter
for a more natural, responsive user experience instead of buffering
everything into a tool call.
This fixes a bug where the agent would stop responding abruptly without
calling final_output. The root cause was the allow_multiple_tool_calls
config option (default: false) which caused the agent to break out of
the streaming loop mid-stream after executing the first tool, losing
any subsequent content.
Changes:
- Remove allow_multiple_tool_calls config option entirely
- Always process all tool calls without breaking mid-stream
- Simplify system prompt generation (no longer needs boolean param)
- Let the stream complete fully before continuing to next iteration
- Change find_last_tool_call_start to find_first_tool_call_start
- Remove parser.reset() call on duplicate detection
Benefits:
- Simpler logic with less conditional branching
- No lost content after tool calls
- Consistent behavior for all users
- Reduced config complexity
New test files:
- crates/g3-cli/tests/cli_integration_test.rs (14 tests)
Blackbox CLI tests: help/version flags, argument validation,
conflicting modes, flock mode requirements
- crates/g3-core/tests/tool_execution_test.rs (20 tests)
Tool call structure tests and unified diff application:
read_file, write_file, str_replace, shell, background_process,
todo, final_output, code_search, take_screenshot
- crates/g3-providers/tests/message_serialization_test.rs (20 tests)
Round-trip serialization tests for Message, MessageRole,
CacheControl, and Tool types. Covers Unicode, special chars,
and edge cases.
All tests follow blackbox/integration-first principles with
documentation of what they protect and intentionally do not assert.
- Add TODO completion check to final_output tool in autonomous mode only
- When incomplete TODO items exist, reject final_output and prompt LLM to continue
- Non-autonomous modes (interactive, chat) are unaffected
- Add 6 tests verifying behavior in both autonomous and non-autonomous modes
Fixes issue where LLM would call final_output after completing first phase,
causing agent to stop prematurely instead of continuing with remaining phases.
Adds a new tool that allows launching processes (like game servers) in the
background while g3 continues to operate. The process runs independently
with stdout/stderr captured to a log file.
Features:
- Named process tracking for easy reference
- Automatic log capture to logs/background_processes/
- Returns PID and log file path for use with shell tool
- Automatic cleanup on agent shutdown via Drop trait
Usage: Use shell tool to interact with the process:
- Read logs: tail -100 <logfile>
- Check status: ps -p <pid>
- Stop process: kill <pid>
Files:
- New: crates/g3-core/src/background_process.rs
- New: crates/g3-core/tests/background_process_demo_test.rs
- Modified: crates/g3-core/src/lib.rs (tool definition + handler)
- Modified: crates/g3-core/src/prompts.rs (documentation)
Added 13 tests to verify that duplicate detection only catches
IMMEDIATELY SEQUENTIAL duplicates:
- test_find_complete_json_object_end_* - Tests for JSON parsing helper
- test_same_tool_with_text_between_not_duplicate - Key test ensuring
tool calls separated by text are NOT duplicates
- test_different_tools_back_to_back_not_duplicate
- test_same_tool_different_args_not_duplicate
- test_identical_tool_calls_back_to_back_are_duplicates
- test_has_text_after_tool_call - Tests text detection logic
- test_tool_call_with_newlines_between
- test_tool_call_with_whitespace_text_between
- test_tool_call_in_middle_of_text
- test_multiple_different_tool_calls_with_text
Also made find_complete_json_object_end public for testing.
- Add last_consumed_position tracking to StreamingToolParser to prevent
re-detecting already-executed tool calls
- Add mark_tool_calls_consumed() method to mark tool calls as processed
- Add find_first_tool_call_start() for forward scanning of tool patterns
- Replace try_parse_json_tool_call_from_buffer() with
try_parse_all_json_tool_calls_from_buffer() to find ALL tool calls
- Update has_incomplete_tool_call() and has_unexecuted_tool_call() to
only check unconsumed portion of buffer
- Fix tool execution loop to not reset parser when unexecuted tools remain
- Simplify should_auto_continue logic (remove redundant condition)
- Add comprehensive tests for auto-continue condition logic