Commit Graph

81 Commits

Author SHA1 Message Date
Dhanji R. Prasanna
570a824780 Rename archivist agent to huffman
Named after David Huffman, inventor of Huffman coding -
compression that preserves information with fewer bits.

Fits the agent's purpose: compact memory, preserve semantics.
2026-01-29 11:22:59 +11:00
Dhanji R. Prasanna
56f558dc1b Fix compiler warnings in test files
Eliminate unused variable and import warnings across test files:
- streaming_parser_test.rs: prefix unused `tools` with underscore
- webdriver_session.rs: remove unused `use super::*` import
- mock_provider_integration_test.rs: prefix unused `result` and `task_result`
- test_preflight_max_tokens.rs: prefix unused `proposed_max`
- todo_staleness_test.rs: add #[allow(dead_code)] for test helper methods
- json_parsing_stress_test.rs: prefix unused `tools`
- read_file_token_limit_test.rs: add #[allow(dead_code)] for unused helper
- background_process_demo_test.rs: remove unused PathBuf import
- test_session_continuation.rs: prefix unused `temp_dir` in 7 tests

All tests pass. No behavior changes.

Agent: fowler
2026-01-29 11:15:10 +11:00
Dhanji R. Prasanna
7bfb9efa19 Remove automatic README loading from context window
README.md is no longer auto-loaded into the LLM context at startup.
This saves ~4,600 tokens per session while AGENTS.md and memory.md
still provide all critical information for code tasks.

Changes:
- Delete read_project_readme() function
- Remove readme_content parameter from combine_project_content()
- Rename extract_readme_heading() -> extract_project_heading()
- Rename Agent constructors: *_with_readme_* -> *_with_project_context_*
- Update context preservation to only check for Agent Configuration
- Remove has_readme field from LoadedContent
- Update all tests to use new markers and function names

The LLM can still read README.md on-demand via read_file when needed.
2026-01-29 11:07:41 +11:00
Dhanji R. Prasanna
a902be1562 Refactor system prompts to eliminate duplication; upgrade embedded provider
- Refactor prompts.rs: extract shared sections (intro, TODO, workspace memory,
  web research, response guidelines) used by both native and non-native prompts
- Fix typo in native prompt: "save them.." -> "save them."
- Fix non-native prompt: add missing closing braces in JSON examples,
  add IMPORTANT steps section, align with native prompt quality
- Add 9 unit tests to verify both prompts contain required sections
- Upgrade llama-cpp-2 dependency and refactor embedded provider
- Update config.example.toml with embedded model examples
- Update workspace memory
2026-01-28 09:56:39 +11:00
Dhanji R. Prasanna
5b4079e861 Add prompt cache statistics tracking to /stats command
- Extend Usage struct with cache_creation_tokens and cache_read_tokens fields
- Parse Anthropic cache_creation_input_tokens and cache_read_input_tokens
- Parse OpenAI prompt_tokens_details.cached_tokens for automatic prefix caching
- Add CacheStats struct to Agent for cumulative tracking across API calls
- Add "Prompt Cache Statistics" section to /stats output showing:
  - API call count and cache hit count
  - Hit rate percentage
  - Total input tokens and cache read/creation tokens
  - Cache efficiency (% of input served from cache)
- Update all provider implementations and test files
2026-01-27 11:32:45 +11:00
Dhanji R. Prasanna
2e84f1ece0 test: fix ACD test race condition and add read_image characterization test
- Fix test_rehydrate_success race condition by using UUID for unique session IDs
- Add #[serial] attribute to prevent parallel execution conflicts
- Improve cleanup to remove entire session directory tree
- Add characterization test for resize_image_to_dimensions fallback behavior
  (documents fix from commit af8b849 for media type preservation)

Agent: hopper
2026-01-26 16:19:53 +11:00
Dhanji R. Prasanna
726e2d71f5 test: add integration test for project content surviving compaction
Add test_project_content_survives_compaction() to verify that project
content loaded via /project command persists through context compaction.

This is a CHARACTERIZATION test that validates:
- Project content appended to README message survives compaction
- The README message (containing project content) is preserved as message[1]
- PROJECT INSTRUCTIONS, ACTIVE PROJECT markers, Brief and Status sections
  all survive the compaction process

Agent: hopper
2026-01-26 16:09:17 +11:00
Dhanji R. Prasanna
9de8e8cc76 Fix compaction bug: use User role for summary to maintain alternation
The previous implementation added the summary as a System message, which
caused "Conversation must start with a user message" errors because the
first non-system message after compaction was Assistant (the preserved
last assistant message).

Fix: Change summary from System to User message, creating valid alternation:
[System Prompt] -> [Summary as USER] -> [Last Assistant] -> [Latest User]

This also prevents system message bloat across multiple compactions since
the summary is now part of the conversation flow and gets replaced on
each compaction.

Added test_second_compaction_no_bloat to verify no accumulation.
2026-01-26 15:24:04 +11:00
Dhanji R. Prasanna
5d0d532b47 feat: preserve last assistant message during compaction
When context window compaction occurs, the last assistant message is now
preserved in addition to the system prompt, README, and summary. This
improves continuity after compaction by keeping the LLM's most recent
response, which often contains important context about what was just
done or what comes next.

New message order after compaction:
[System Prompt] -> [README/AGENTS.md] -> [ACD Stub?] -> [Summary] -> [Last Assistant] -> [Latest User?]

Changes:
- Add last_assistant_message field to PreservedMessages struct
- Modify extract_preserved_messages() to find last assistant message
- Modify reset_with_summary_and_stub() to include last assistant message
- Add comprehensive integration tests using MockProvider

Tests cover edge cases:
- No assistant message exists
- Tool-call-only assistant messages (still preserved)
- Multiple assistant messages (only last one preserved)
- No trailing user message
2026-01-23 09:54:03 +05:30
Dhanji R. Prasanna
feb7c3e40d Add /project and /unproject commands for project-specific context
- Add Project struct in crates/g3-cli/src/project.rs with file loading logic
- Load brief.md, contacts.yaml, status.md from project path
- Load projects.md from workspace root for cross-project context
- Project content appended to system message (survives compaction/dehydration)
- /project <path> loads project and auto-submits prompt asking about state
- /unproject clears project content and resets context
- Add set_project_content(), clear_project_content(), has_project_content() to Agent
- Add new_for_test_with_readme() for testing with custom README content
- Add 6 unit tests for Project struct
- Add 9 integration tests for project context behavior
2026-01-21 14:53:30 +05:30
Dhanji R. Prasanna
6a5ce11e7b Consolidate redundant assistant message test files
Deleted 4 redundant test files (~956 lines):
- assistant_message_dedup_test.rs (416 lines, 12 tests)
- consecutive_assistant_message_test.rs (248 lines, 6 tests)
- missing_assistant_message_test.rs (100 lines, 4 tests)
- early_return_path_test.rs (192 lines, 5 tests) - whitebox test

Created consolidated assistant_message_test.rs (369 lines, 14 tests):
- Helper function tests for consecutive message detection
- ContextWindow unit tests for normal and tool execution flows
- Bug demonstration tests documenting what bugs looked like
- Invariant tests for user/assistant alternation
- Missing assistant message fallback logic tests

The early_return_path_test was removed because it:
- Referenced specific line numbers in production code (brittle)
- Reimplemented internal logic (whitebox anti-pattern)
- Duplicated coverage from mock_provider_integration_test.rs

All 729 g3-core tests pass.
2026-01-21 10:27:07 +05:30
Dhanji R. Prasanna
9a0a2a2726 Make dehydration stub more compact
Change from multi-line verbose format to single-line compact format:

Before:
   DEHYDRATED CONTEXT (fragment_id: 188c7ac71613)
     • 8 messages (4 user, 4 assistant)
     • 3 tool calls (shell ×3)
     • ~299 tokens saved

     To restore this history, call: rehydrate(fragment_id: "188c7ac71613")

After:
   DEHYDRATED CONTEXT: 3 tool calls (shell x3), 8 total msgs. To restore, call: rehydrate(fragment_id: "188c7ac71613")

- Combine all info into single line
- Remove tokens saved (not essential for rehydration decision)
- Use ASCII 'x' instead of '×' for simplicity
- Add 'no tool calls' case for fragments without tools
- Update related tests
2026-01-20 21:26:42 +05:30
Dhanji R. Prasanna
182f5f98fe Centralize g3 status message formatting
Extract a new g3_status module in g3-cli that provides consistent formatting
for all 'g3:' prefixed system status messages.

Key changes:
- Add G3Status struct with methods for progress, done, failed, error, etc.
- Add Status enum with Done, Failed, Error, Resolved, Insufficient, NoChanges
- Add ThinResult struct in g3-core for semantic thinning data
- Update UiWriter trait with print_thin_result() method
- Refactor context thinning to return ThinResult instead of formatted strings
- Update all callers to use the new centralized formatting
- Session resume/decline messages now use G3Status
- Compaction status messages now use G3Status

This maintains clean separation of concerns: g3-core emits semantic data,
g3-cli handles all terminal formatting and colors.
2026-01-20 09:50:55 +05:30
Dhanji R. Prasanna
f4cce22db3 Add test documenting LLM duplicate text behavior
Adds test_llm_repeats_text_before_each_tool_call() which documents the
scenario where the LLM re-outputs the same preamble text before each
tool call in a multi-tool response.

Analysis showed this is LLM behavior, not a g3 bug:
- Each assistant message is correctly stored with different tool calls
- The duplicate display is the LLM choosing to repeat context
- Storage is correct, display accurately reflects LLM output

Decision: Accept as LLM behavior (Option B). Future LLM improvements
may resolve this naturally without g3 code changes.
2026-01-19 18:44:01 +05:30
Dhanji R. Prasanna
1604ed613a Add integration tests proving tool results are never parsed as tool calls
Adds 3 new tests to json_parsing_stress_test.rs:
- test_tool_result_with_json_not_parsed: Full agent integration test proving
  that JSON in tool results (sent TO the LLM) is never parsed by the
  streaming parser (which only sees LLM output)
- test_parser_only_processes_completion_chunks: Documents that StreamingToolParser
  only accepts CompletionChunk, not Message objects
- test_architectural_separation_documented: Documents the data flow showing
  tool results flow TO the LLM while the parser only sees FROM the LLM

This proves the architectural guarantee: there is no code path where
tool result content could be parsed as a tool call, because:
1. Tool results are Message objects added to context_window
2. The streaming parser only processes CompletionChunk from provider.stream_completion()
3. These are completely separate data types flowing in opposite directions

Total: 41 JSON parsing stress tests now pass.
2026-01-19 16:21:36 +05:30
Dhanji R. Prasanna
2043a83e7d Add comprehensive MockProvider integration tests
Added 6 new integration tests for stream_completion_with_tools:
- test_text_before_tool_call_preserved: text before native tool call is saved
- test_native_tool_call_execution: native tool calls execute correctly
- test_duplicate_tool_calls_skipped: sequential duplicates are detected
- test_json_fallback_tool_calling: JSON tool calls work without native support
- test_text_after_tool_execution_preserved: follow-up text is saved
- test_multiple_tool_calls_executed: multiple tool calls in sequence work

Also added MockResponse helper methods:
- text_then_native_tool(): text followed by native tool call
- duplicate_native_tool_calls(): same tool call twice (for dedup testing)

Fixed text_with_json_tool() to ensure "tool" key comes before "args"
(serde_json alphabetizes keys, breaking pattern detection).

Total: 18 integration tests covering historical bugs and core behaviors.
2026-01-19 14:44:30 +05:30
Dhanji R. Prasanna
5caa101b84 Fix inline JSON being incorrectly detected as tool call
The bug was caused by mark_tool_calls_consumed() being called after
displaying each chunk, which advanced last_consumed_position to the
end of the current buffer. When the next chunk arrived with JSON,
the unchecked_buffer started at position 0 of the slice, causing
is_on_own_line() to return true (position 0 is always "on its own line").

Removed the problematic mark_tool_calls_consumed() call from the
"no tool executed" branch. The remaining call after actual tool
execution is correct and necessary.

Added integration test that verifies inline JSON in prose is not
detected as a tool call.
2026-01-19 14:35:01 +05:30
Dhanji R. Prasanna
292a3aa48d Add MockProvider for integration testing
Adds a configurable mock LLM provider that can simulate various behaviors:
- Text-only responses (single or multi-chunk streaming)
- Native tool calls
- JSON tool calls in text
- Truncated responses (max_tokens)
- Multi-turn conversations

Features:
- Builder pattern for easy test setup
- Request tracking for verification
- Preset scenarios for common patterns
- Full LLMProvider trait implementation

Also adds integration tests that use MockProvider to test the
stream_completion_with_tools code path, including:
- test_butler_bug_scenario: reproduces the exact bug where text-only
  responses were not saved to context, causing consecutive user messages

This enables testing complex streaming behaviors without real API calls.
2026-01-19 13:59:31 +05:30
Dhanji R. Prasanna
349230d0b7 Fix missing assistant messages in context window
Bug: When the LLM responded with text-only (no tool calls), the assistant
message was sometimes not saved to the context window. This caused consecutive
user messages where the LLM would lose track of previous responses.

Root causes found and fixed:

1. Early return path (line ~2535): When stream finishes with no tools executed
   in previous iterations (any_tool_executed=false), the code returned early
   without saving the assistant message. Fixed by adding save before return.

2. Post-loop path (line ~2657): When raw_clean was empty but current_response
   had content, no message was saved. Fixed by falling back to current_response.

Both paths now properly save the assistant message before returning.
The assistant_message_added flag prevents any duplication.

Added tests:
- missing_assistant_message_test.rs: verifies the fallback logic
- assistant_message_dedup_test.rs: verifies no duplicate messages
- consecutive_assistant_message_test.rs: verifies alternation invariant
2026-01-19 13:50:28 +05:30
Dhanji R. Prasanna
74b1b9bea3 refactor: simplify context thinning status message
Change format from verbose emoji-based message to cleaner status line:
  Before:  🥒 Context thinned at 70%: 7 tool results, ~33839 chars saved 
  After:  g3: thinning context ... 70% -> 40% ... [done]

The new format shows before/after percentages and uses bold green for
'g3:' and '[done]' to match other status messages.

Also removes unused emoji() and label() methods from ThinScope.
2026-01-17 04:47:16 +05:30
Dhanji R. Prasanna
1003386f7f Auto-resize large images (>=5MB) in read_image tool
Images >= 5MB are now automatically resized to < 4.9MB using ImageMagick
before being sent to the LLM. This prevents API errors from oversized images.

- Uses iterative quality/scale reduction to find optimal size
- Converts to JPEG for better compression
- Shows original and resized size in terminal output (e.g., '6.2 MB → 4.1 MB (resized)')
- Falls back to original if ImageMagick fails or isn't available
2026-01-16 21:09:38 +05:30
Dhanji R. Prasanna
fc702168ab Add streaming completion integration test with mock LLM provider
Adds tests to verify that:
- All streaming chunks are processed before control returns to caller
- Both tool calls in a multi-tool-call stream are executed
- The finished signal properly terminates stream processing

Also adds Agent::new_for_test() to allow injecting mock providers.
2026-01-16 20:52:32 +05:30
Dhanji R. Prasanna
0e33465342 Add print_g3_progress/print_g3_status methods for consistent status messages 2026-01-16 20:28:24 +05:30
Dhanji R. Prasanna
6bd9c51e8e feat: shell output pagination and optimized read_file with seek
- Shell outputs > 8KB are truncated to first 500 chars
- Full output saved to .g3/sessions/<session_id>/tools/shell_stdout_<id>.txt
- LLM can use read_file with start/end to paginate through large outputs
- read_file now uses seek() for O(1) random access instead of reading entire file
- UTF-8 safe: reads extra bytes at boundaries to find valid char positions
- Falls back to lossy conversion for binary files (no panics)

Files changed:
- paths.rs: get_tools_output_dir(), generate_short_id()
- shell.rs: truncate_large_output() integration
- file_ops.rs: seek-based read_file_range() helper
- New test: read_file_utf8_test.rs
2026-01-16 09:16:16 +05:30
Dhanji R. Prasanna
0ae1a13cdb feat: real-time tool call streaming indicator with blinking UI
- Add ToolParsingHint enum (Detected/Active/Complete) for UI feedback
- New UiWriter methods: print_tool_streaming_hint(), print_tool_streaming_active()
- Refactor ConsoleUiWriter state to use atomics in ParsingHintState
- Add tool_call_streaming field to CompletionChunk for provider hints
- Anthropic provider sends streaming hints when tool name detected
- New streaming helpers: make_tool_streaming_hint(), make_tool_streaming_active()

Parser improvements:
- Add is_json_invalidated() to detect false positive tool patterns
- Fix tool result poisoning when file contents contain partial JSON
- Unescaped newlines in strings or prose after JSON invalidates detection

User sees ' ● tool_name |' immediately when tool call starts streaming,
with blinking indicator while args are received.
2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna
d68f059acf fix: detect invalidated JSON tool calls to prevent parser poisoning
When partial JSON tool call patterns appear in LLM output (e.g., from
quoting file content), the parser would incorrectly report them as
"incomplete tool calls", triggering auto-continue loops.

Fix: Added is_json_invalidated() to detect when partial JSON has been
invalidated by subsequent content that cannot be valid JSON:
- Unescaped newline inside a string (invalid JSON)
- Newline followed by prose text outside a string

The check is only applied to incomplete JSON - complete tool calls
with trailing text are still correctly detected.

Added 6 new tests covering:
- Tool results with partial JSON patterns
- LLM quoting file content inline vs on own line
- Comment prefixes (// # -- etc) with partial patterns
- Real incomplete tool calls (should still be detected)
2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna
999ac6fe66 fix: prevent parser poisoning from inline tool-call JSON patterns
The streaming parser was incorrectly detecting tool call patterns that
appeared inline in prose (e.g., when explaining the format), causing
g3 to return control mid-task.

Fix: Modified find_first_tool_call_start() and find_last_tool_call_start()
to only recognize patterns that appear on their own line (at start of
buffer or after newline with only whitespace before the pattern).

Changes:
- Added is_on_own_line() helper to check line-boundary conditions
- Updated detection methods to skip inline patterns
- Removed sanitize_inline_tool_patterns() and LBRACE_HOMOGLYPH (no longer needed)
- Rewrote tests for new behavior
- Added streaming_repro tests that use process_chunk() to verify the exact bug scenario

28 tests covering: streaming repro, line boundaries, Unicode, code contexts, edge cases
2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna
38828c7757 Clean up tool output formatting
- Shell: " Command executed successfully" → "️ ran successfully"
- Write file: Remove ✏️ emoji, use plain "wrote N lines | M chars"
2026-01-14 19:42:54 +05:30
Dhanji R. Prasanna
dea0e6b1ca Compact tool output improvements
- Rename take_screenshot -> screenshot, code_coverage -> coverage (shorter names)
- Align | character across all compact tools (pad to 11 chars for str_replace)
- Make code_search a compact tool with summary display
- Show language and search name in code_search output (e.g., rust:"find structs")
- Add format_code_search_summary() to extract match/file counts from JSON response
2026-01-14 08:12:50 +05:30
Dhanji R. Prasanna
3a47ebe668 better racket example support 2026-01-13 21:16:14 +05:30
Dhanji R. Prasanna
151b8c4658 Add Racket tree-sitter support, remove Kotlin
- Add tree-sitter-racket dependency (v0.24)
- Initialize Racket parser in code search
- Add .rkt, .rktl, .rktd file extensions
- Add test_racket_search test
- Remove Kotlin from supported languages (was disabled)
- Clean up duplicate test files

Supported languages: Rust, Python, JavaScript, TypeScript, Go, Java, C, C++, Racket
2026-01-13 18:44:59 +05:30
Dhanji R. Prasanna
b89d55a9ff Add characterization tests for stream_completion_with_tools
Add 32 blackbox characterization tests to lock down the behavior of the
stream_completion_with_tools function (1067 lines) before refactoring.

Tests cover key behaviors through stable boundaries:
- StreamingToolParser: tool call detection, incomplete detection, text accumulation
- Auto-continue logic: autonomous mode decisions, priority ordering
- Duplicate detection: sequential duplicates, cross-message duplicates
- Context window: token tracking, compaction threshold, history preservation
- Tool execution: read_file, shell, write_file, todo tools through Agent
- Streaming utilities: LLM token cleaning, duration formatting, truncation
- Parser sanitization: inline tool pattern handling, homoglyph replacement

These tests intentionally do NOT assert:
- Internal parser state or implementation details
- Specific timing values
- UI output formatting
- Provider-specific behavior

Agent: hopper
2026-01-13 16:25:33 +05:30
Dhanji R. Prasanna
dc45987e8d Add characterization tests for UTF-8 truncation and parser sanitization
Agent: hopper

Adds 32 new integration tests covering recent commits:

## UTF-8 Safe Truncation Tests (14 tests)
Covers commit f30f145 (Fix UTF-8 panics):
- Topic extraction with emoji, CJK, and multi-byte characters
- Truncation at character boundaries (not byte boundaries)
- Edge cases: exactly 50 chars, 51 chars, 2-byte/3-byte/4-byte UTF-8
- Stub generation with multi-byte topics
- Combining characters and diacritics

## Parser Sanitization Tests (18 tests)
Covers commit 4c36cc0 (Prevent parser poisoning):
- Code block contexts (inline code, after fences, prose)
- Line boundary edge cases (empty lines, whitespace, indentation)
- Unicode handling (emoji, bullets, CJK before patterns)
- Multiple patterns on same line
- Negative cases (similar but different patterns, partial patterns)
- Real-world scenarios from the original bug report

All tests are blackbox/characterization style - they test observable
outputs through stable public interfaces without encoding internal
implementation details.
2026-01-13 11:22:46 +05:30
Dhanji R. Prasanna
6f50d01ab6 Add comprehensive end-of-turn behavior tests for g3-core
Agent: hopper

Adds 56 new integration tests covering the observable end-of-turn
behaviors in the streaming module:

- Timing footer formatting (5 tests): verifies user-facing timing display
  with various durations, token counts, and context percentages

- Tool call duplicate detection (6 tests): ensures identical sequential
  tool calls are detected while different tools/args are not

- Empty response detection (9 tests): validates detection of empty,
  whitespace-only, and timing-only responses that trigger auto-continue

- Connection error classification (5 tests): verifies EOF, connection,
  chunk, and body errors are correctly identified for graceful recovery

- Tool output summary formatting (17 tests): covers read_file, write_file,
  str_replace, remember, screenshot, coverage, and rehydrate summaries

- Duration formatting (4 tests): milliseconds, seconds, minutes, zero

- Text truncation (4 tests): short/long strings, multiline, flag behavior

- LLM token cleaning (3 tests): removal of stop tokens like <|im_end|>

- Edge cases (4 tests): empty inputs, unicode handling, large numbers

All tests are blackbox/characterization style - they test observable
outputs through stable public interfaces without encoding internal
implementation details. Tests remain stable under refactoring that
preserves behavior.
2026-01-12 21:17:32 +05:30
Dhanji R. Prasanna
c2aa80647a Remove legacy logs/ directory, consolidate all data under .g3/
This change removes the legacy logs/ directory and consolidates all
session data, error logs, and discovery files under the .g3/ directory.

New directory structure:
- .g3/sessions/<session_id>/session.json - session logs
- .g3/errors/ - error logs (was logs/errors/)
- .g3/background_processes/ - background process logs
- .g3/discovery/ - planner discovery files (was workspace/logs/)

Changes:
- paths.rs: Remove get_logs_dir()/logs_dir(), add get_errors_dir(),
  get_background_processes_dir(), get_discovery_dir()
- session.rs: Anonymous sessions now use .g3/sessions/anonymous_<ts>/
- error_handling.rs: Errors now saved to .g3/errors/
- project.rs: Remove logs_dir() and ensure_logs_dir() methods
- feedback_extraction.rs: Remove logs_dir field and fallback logic
- planner: Use .g3/ for workspace data and .g3/discovery/ for reports
- flock.rs: Look for session metrics in .g3/sessions/
- coach_feedback.rs: Remove fallback to logs/ path
- Update all tests to use new paths
- Update README.md and .gitignore
2026-01-12 18:20:08 +05:30
Dhanji R. Prasanna
5dfabaf19a Add 72 integration tests for compaction, retry, tool execution, and error classification
Agent: hopper

Added 4 new test files with blackbox/characterization-style integration tests:

- compaction_behavior_test.rs (14 tests): Token cap calculation, thinking mode
  disable logic, summary message building, CompactionResult behavior

- retry_behavior_test.rs (17 tests): RetryConfig presets and customization,
  RetryResult state handling, retry_operation behavior with simulated errors

- tool_execution_roundtrip_test.rs (16 tests): End-to-end tool execution through
  Agent interface for read_file, write_file, shell, str_replace, and TODO tools

- error_classification_test.rs (25 tests): Recoverable vs non-recoverable error
  classification, retry delay calculation, edge cases and priority handling

All tests follow integration-first philosophy:
- Test through stable public interfaces
- Assert observable behavior, not implementation details
- Use characterization style to document current behavior
- Enable refactoring by not encoding internal structure
2026-01-12 11:40:19 +05:30
Dhanji R. Prasanna
f415dbb84b Fix ACD turn summary loss and add /dump command
ACD (Aggressive Context Dehydration) fixes:
- Fixed dehydrate_context() to extract turn summary from context window
  instead of using the passed-in final_response (which contained only
  the timing footer, not the actual LLM response)
- Removed final_response parameter from dehydrate_context() since it
  now self-extracts the last assistant message as the summary
- This ensures the actual turn summary is preserved after dehydration,
  not just the timing footer

New /dump command:
- Added /dump command to dump entire context window to tmp/ for debugging
- Shows message index, role, kind, content length, and full content
- Available in both console and machine modes

UTF-8 safety:
- Fixed truncate_to_word_boundary() to use character indices instead of
  byte indices, preventing panics on multi-byte UTF-8 characters
- Added UTF-8 string slicing guidance to AGENTS.md

Agent: g3
2026-01-12 05:13:02 +05:30
Dhanji R. Prasanna
83c9b5d434 Add integration blackbox tests for g3-core
Adds 18 new integration tests covering:

- Background process lifecycle (start, check running, kill, list)
- Unified diff edge cases (multi-hunk, additions-only, deletions-only,
  CRLF normalization, range constraints, error handling)
- Error classification boundaries (rate limit, server error, timeout,
  network error, context length exceeded, model busy, non-recoverable)

These tests follow blackbox/integration-first principles:
- Test through stable public interfaces
- Do not encode internal implementation details
- Focus on observable behavior
- Enable refactoring without test breakage

Agent: hopper
2026-01-11 16:32:59 +05:30
Dhanji R. Prasanna
e731bc8217 Make remember tool instructions more imperative in system prompts
- Change 'call remember' to 'you MUST call remember' in native prompt
- Change 'IF you discovered' to 'ALWAYS...when you discovered'
- Add explicit list of trigger tools (code_search, rg, grep, find, read_file)
- Add reminder to Response Guidelines section
- Add remember tool and Project Memory section to non-native prompt
- Remove redundant console output from remember tool
- Fix test compilation errors (missing summary parameter, temporary borrow)
2026-01-11 06:49:45 +08:00
Dhanji R. Prasanna
0aa1287ca6 Remove final_output tool and improve scout report handback
final_output removal:
- Remove final_output from tool definitions and dispatch
- Update system prompts to request summaries as regular text
- Remove final_output_called field from StreamingState
- Update auto_continue tests to remove final_output_called parameter
- Remove final_output test from tool_execution_test.rs
- Update planner and flock prompts to not reference final_output
- Keep backwards-compat code in feedback_extraction.rs and task_result.rs

Scout report handback:
- Change from file-based to delimiter-based report extraction
- Scout outputs report between ---SCOUT_REPORT_START/END--- markers
- Research tool extracts content between markers, strips ANSI codes
- Add comprehensive tests for extraction and ANSI stripping

657 tests pass.
2026-01-10 13:43:04 +11:00
Dhanji R. Prasanna
e301075666 Fix panic on multi-byte chars in filter_json buffer truncation
The buffer truncation code was slicing at a raw byte offset which could
land in the middle of a multi-byte character (like emojis), causing a
panic. Fixed by using char_indices() to find valid character boundaries.

Also added stop_reason field to CompletionChunk initializers in tests
to complete the stop_reason feature addition.

- Fix byte boundary panic in filter_json.rs line 327
- Add test for multi-byte character handling
- Update test files with missing stop_reason field
2026-01-09 15:20:57 +11:00
Dhanji R. Prasanna
777191b3cb Remove final_output tool - let summaries stream naturally
- Remove final_output from tool definitions, dispatch, and misc tools
- Update system prompts to request summaries as regular markdown text
- Remove print_final_output from UiWriter trait and all implementations
- Remove final_output handling from agent core logic
- Rename final_output_summary → summary in session continuation
- Delete final_output test files
- Update tool count tests (12→11, 27→26)

This allows LLM summaries to stream through the markdown formatter
for a more natural, responsive user experience instead of buffering
everything into a tool call.
2026-01-09 14:57:24 +11:00
Dhanji R. Prasanna
67be0f20c7 fix: remove allow_multiple_tool_calls config and simplify tool execution flow
This fixes a bug where the agent would stop responding abruptly without
calling final_output. The root cause was the allow_multiple_tool_calls
config option (default: false) which caused the agent to break out of
the streaming loop mid-stream after executing the first tool, losing
any subsequent content.

Changes:
- Remove allow_multiple_tool_calls config option entirely
- Always process all tool calls without breaking mid-stream
- Simplify system prompt generation (no longer needs boolean param)
- Let the stream complete fully before continuing to next iteration
- Change find_last_tool_call_start to find_first_tool_call_start
- Remove parser.reset() call on duplicate detection

Benefits:
- Simpler logic with less conditional branching
- No lost content after tool calls
- Consistent behavior for all users
- Reduced config complexity
2026-01-09 13:28:07 +11:00
Dhanji R. Prasanna
5bfaee8dd5 use consistent naming for compaction 2026-01-08 12:54:03 +11:00
Dhanji R. Prasanna
5d20da2609 Add 54 integration tests for CLI, tools, and message serialization
New test files:
- crates/g3-cli/tests/cli_integration_test.rs (14 tests)
  Blackbox CLI tests: help/version flags, argument validation,
  conflicting modes, flock mode requirements

- crates/g3-core/tests/tool_execution_test.rs (20 tests)
  Tool call structure tests and unified diff application:
  read_file, write_file, str_replace, shell, background_process,
  todo, final_output, code_search, take_screenshot

- crates/g3-providers/tests/message_serialization_test.rs (20 tests)
  Round-trip serialization tests for Message, MessageRole,
  CacheControl, and Tool types. Covers Unicode, special chars,
  and edge cases.

All tests follow blackbox/integration-first principles with
documentation of what they protect and intentionally do not assert.
2026-01-07 09:23:34 +11:00
Dhanji R. Prasanna
f4a1bf5e93 fix agent-mode session resumption bug 2026-01-03 16:44:58 +11:00
Dhanji R. Prasanna
595ad6ad21 agent mode resumption 2026-01-03 14:50:08 +11:00
Dhanji R. Prasanna
016efc1db6 Prevent agent mode from stopping after first TODO phase
- Add TODO completion check to final_output tool in autonomous mode only
- When incomplete TODO items exist, reject final_output and prompt LLM to continue
- Non-autonomous modes (interactive, chat) are unaffected
- Add 6 tests verifying behavior in both autonomous and non-autonomous modes

Fixes issue where LLM would call final_output after completing first phase,
causing agent to stop prematurely instead of continuing with remaining phases.
2025-12-27 12:35:31 +11:00
Dhanji R. Prasanna
3601cc0547 Enhance read_image tool with magic byte detection and multi-image support
- Fix media type detection using magic bytes instead of file extension
  - Correctly identifies JPEG files with .png extension (and vice versa)
  - Supports PNG, JPEG, GIF, and WebP formats

- Add multi-image support with file_paths array parameter
  - Load multiple images in a single tool call
  - All images queued for LLM analysis

- Enhanced CLI output:
  - Inline image preview via iTerm2 imgcat protocol (height=5)
  - Dimmed info line showing: path | dimensions | media type | file size
  - Proper │ prefix alignment with tool output boxing
  - Human-readable file sizes (bytes, KB, MB)

- Add image dimension extraction from file headers
  - PNG, JPEG, GIF, WebP dimension parsing

- Add comprehensive tests for magic byte detection and dimensions
2025-12-26 11:19:37 +11:00
Dhanji R. Prasanna
d9c58576a1 feat: add background_process tool for launching long-running processes
Adds a new tool that allows launching processes (like game servers) in the
background while g3 continues to operate. The process runs independently
with stdout/stderr captured to a log file.

Features:
- Named process tracking for easy reference
- Automatic log capture to logs/background_processes/
- Returns PID and log file path for use with shell tool
- Automatic cleanup on agent shutdown via Drop trait

Usage: Use shell tool to interact with the process:
- Read logs: tail -100 <logfile>
- Check status: ps -p <pid>
- Stop process: kill <pid>

Files:
- New: crates/g3-core/src/background_process.rs
- New: crates/g3-core/tests/background_process_demo_test.rs
- Modified: crates/g3-core/src/lib.rs (tool definition + handler)
- Modified: crates/g3-core/src/prompts.rs (documentation)
2025-12-25 18:23:10 +11:00