Commit Graph

350 Commits

Author SHA1 Message Date
Dhanji R. Prasanna
4b7be3f9ee Increase research tool timeout to 20 minutes
The research tool often runs past 8 minutes due to web browsing and
analysis. Increased its timeout to 20 minutes while keeping other
tools at 8 minutes.

Changes:
- Tool timeout is now tool-specific (20 min for research, 8 min for others)
- Timeout error message now shows the correct duration for each tool
2026-01-19 21:51:08 +05:30
Dhanji R. Prasanna
f4cce22db3 Add test documenting LLM duplicate text behavior
Adds test_llm_repeats_text_before_each_tool_call() which documents the
scenario where the LLM re-outputs the same preamble text before each
tool call in a multi-tool response.

Analysis showed this is LLM behavior, not a g3 bug:
- Each assistant message is correctly stored with different tool calls
- The duplicate display is the LLM choosing to repeat context
- Storage is correct, display accurately reflects LLM output

Decision: Accept as LLM behavior (Option B). Future LLM improvements
may resolve this naturally without g3 code changes.
2026-01-19 18:44:01 +05:30
Dhanji R. Prasanna
1604ed613a Add integration tests proving tool results are never parsed as tool calls
Adds 3 new tests to json_parsing_stress_test.rs:
- test_tool_result_with_json_not_parsed: Full agent integration test proving
  that JSON in tool results (sent TO the LLM) is never parsed by the
  streaming parser (which only sees LLM output)
- test_parser_only_processes_completion_chunks: Documents that StreamingToolParser
  only accepts CompletionChunk, not Message objects
- test_architectural_separation_documented: Documents the data flow showing
  tool results flow TO the LLM while the parser only sees FROM the LLM

This proves the architectural guarantee: there is no code path where
tool result content could be parsed as a tool call, because:
1. Tool results are Message objects added to context_window
2. The streaming parser only processes CompletionChunk from provider.stream_completion()
3. These are completely separate data types flowing in opposite directions

Total: 41 JSON parsing stress tests now pass.
2026-01-19 16:21:36 +05:30
Dhanji R. Prasanna
2043a83e7d Add comprehensive MockProvider integration tests
Added 6 new integration tests for stream_completion_with_tools:
- test_text_before_tool_call_preserved: text before native tool call is saved
- test_native_tool_call_execution: native tool calls execute correctly
- test_duplicate_tool_calls_skipped: sequential duplicates are detected
- test_json_fallback_tool_calling: JSON tool calls work without native support
- test_text_after_tool_execution_preserved: follow-up text is saved
- test_multiple_tool_calls_executed: multiple tool calls in sequence work

Also added MockResponse helper methods:
- text_then_native_tool(): text followed by native tool call
- duplicate_native_tool_calls(): same tool call twice (for dedup testing)

Fixed text_with_json_tool() to ensure "tool" key comes before "args"
(serde_json alphabetizes keys, breaking pattern detection).

Total: 18 integration tests covering historical bugs and core behaviors.
2026-01-19 14:44:30 +05:30
Dhanji R. Prasanna
5caa101b84 Fix inline JSON being incorrectly detected as tool call
The bug was caused by mark_tool_calls_consumed() being called after
displaying each chunk, which advanced last_consumed_position to the
end of the current buffer. When the next chunk arrived with JSON,
the unchecked_buffer started at position 0 of the slice, causing
is_on_own_line() to return true (position 0 is always "on its own line").

Removed the problematic mark_tool_calls_consumed() call from the
"no tool executed" branch. The remaining call after actual tool
execution is correct and necessary.

Added integration test that verifies inline JSON in prose is not
detected as a tool call.
2026-01-19 14:35:01 +05:30
Dhanji R. Prasanna
292a3aa48d Add MockProvider for integration testing
Adds a configurable mock LLM provider that can simulate various behaviors:
- Text-only responses (single or multi-chunk streaming)
- Native tool calls
- JSON tool calls in text
- Truncated responses (max_tokens)
- Multi-turn conversations

Features:
- Builder pattern for easy test setup
- Request tracking for verification
- Preset scenarios for common patterns
- Full LLMProvider trait implementation

Also adds integration tests that use MockProvider to test the
stream_completion_with_tools code path, including:
- test_butler_bug_scenario: reproduces the exact bug where text-only
  responses were not saved to context, causing consecutive user messages

This enables testing complex streaming behaviors without real API calls.
2026-01-19 13:59:31 +05:30
Dhanji R. Prasanna
349230d0b7 Fix missing assistant messages in context window
Bug: When the LLM responded with text-only (no tool calls), the assistant
message was sometimes not saved to the context window. This caused consecutive
user messages where the LLM would lose track of previous responses.

Root causes found and fixed:

1. Early return path (line ~2535): When stream finishes with no tools executed
   in previous iterations (any_tool_executed=false), the code returned early
   without saving the assistant message. Fixed by adding save before return.

2. Post-loop path (line ~2657): When raw_clean was empty but current_response
   had content, no message was saved. Fixed by falling back to current_response.

Both paths now properly save the assistant message before returning.
The assistant_message_added flag prevents any duplication.

Added tests:
- missing_assistant_message_test.rs: verifies the fallback logic
- assistant_message_dedup_test.rs: verifies no duplicate messages
- consecutive_assistant_message_test.rs: verifies alternation invariant
2026-01-19 13:50:28 +05:30
Dhanji R. Prasanna
02655110d6 fix: auto-resize images exceeding 1568px dimension to prevent 413 Payload Too Large
The Anthropic API was rejecting requests with multiple high-resolution images
(~2000x3000 pixels each) even though individual file sizes were under limits.

Root cause: Code only checked per-image file size (3.75MB), not dimensions.
Claude recommends images ≤1568px on longest edge and has 32MB total request limit.

Changes:
- Add MAX_IMAGE_DIMENSION (1568px) and MAX_TOTAL_IMAGE_PAYLOAD (20MB) constants
- Trigger resize when dimensions > 1568px (not just file size > 3.75MB)
- Add new resize_image_to_dimensions() for dimension-constrained resizing
- Track cumulative payload size across multiple images
- Warn if total payload exceeds recommended limit

Test results with Walking Dead comic images:
- WD_0001_0001.jpg: 800KB 1987x3057 → 321KB 1019x1568
- WD_0001_1064.png: 150KB 1988x3057 → 143KB 1020x1568
- WD_0002_0001.jpg: 1023KB 1988x3056 → 292KB 1020x1568
- Total payload: ~2.5MB → ~1MB base64
2026-01-18 10:05:45 +05:30
Dhanji R. Prasanna
3a03ed0585 Fix imgcat aspect ratio by adding preserveAspectRatio=1
Images were being displayed as narrow vertical strips because
iTerm2 wasn't preserving aspect ratio when only height was specified.
2026-01-17 18:50:00 +05:30
Dhanji R. Prasanna
d600b600b8 Always keep chromedriver running for faster subsequent startups
Removed the persistent_chrome config flag - chromedriver is now always
kept running after webdriver_quit. This eliminates startup latency for
subsequent WebDriver sessions.

Safaridriver is still killed on quit since it doesn't benefit from
persistence in the same way.

Updated quit message to correctly indicate chromedriver remains running.
2026-01-17 09:48:10 +05:30
Dhanji R. Prasanna
8ed360024f Add persistent ChromeDriver support for faster WebDriver startup
When webdriver_start is called, now checks if chromedriver is already
running on the configured port and reuses it instead of spawning a new
process. This significantly reduces startup time for subsequent sessions.

New config option:
  [webdriver]
  persistent_chrome = true  # Keep chromedriver running between sessions

When enabled, webdriver_quit closes the browser session but leaves
chromedriver running for reuse by the next session.
2026-01-17 09:26:25 +05:30
Dhanji R. Prasanna
b8193bf9f9 style: use orange color for [no changes] status in thinning message 2026-01-17 04:53:42 +05:30
Dhanji R. Prasanna
74b1b9bea3 refactor: simplify context thinning status message
Change format from verbose emoji-based message to cleaner status line:
  Before:  🥒 Context thinned at 70%: 7 tool results, ~33839 chars saved 
  After:  g3: thinning context ... 70% -> 40% ... [done]

The new format shows before/after percentages and uses bold green for
'g3:' and '[done]' to match other status messages.

Also removes unused emoji() and label() methods from ThinScope.
2026-01-17 04:47:16 +05:30
Dhanji R. Prasanna
c7984fd4c2 fix: account for base64 encoding overhead in image size limit
The Anthropic API has a 5MB limit on base64-encoded images, not raw file
size. Base64 encoding increases size by ~33% (4/3 ratio), so a 4MB raw
image becomes ~5.3MB encoded, exceeding the limit.

Changed MAX_IMAGE_SIZE from 5MB to ~3.75MB (5MB * 3/4) to trigger
resizing before the base64-encoded result exceeds the API limit.

Also updated target resize size to 3.6MB to leave margin.
2026-01-16 21:29:05 +05:30
Dhanji R. Prasanna
1003386f7f Auto-resize large images (>=5MB) in read_image tool
Images >= 5MB are now automatically resized to < 4.9MB using ImageMagick
before being sent to the LLM. This prevents API errors from oversized images.

- Uses iterative quality/scale reduction to find optimal size
- Converts to JPEG for better compression
- Shows original and resized size in terminal output (e.g., '6.2 MB → 4.1 MB (resized)')
- Falls back to original if ImageMagick fails or isn't available
2026-01-16 21:09:38 +05:30
Dhanji R. Prasanna
fc702168ab Add streaming completion integration test with mock LLM provider
Adds tests to verify that:
- All streaming chunks are processed before control returns to caller
- Both tool calls in a multi-tool-call stream are executed
- The finished signal properly terminates stream processing

Also adds Agent::new_for_test() to allow injecting mock providers.
2026-01-16 20:52:32 +05:30
Dhanji R. Prasanna
0e33465342 Add print_g3_progress/print_g3_status methods for consistent status messages 2026-01-16 20:28:24 +05:30
Dhanji R. Prasanna
95f89d3f8e Simplify compaction status messages 2026-01-16 20:26:35 +05:30
Dhanji R. Prasanna
7c59d1993c Fix auto-memory JSON leak: tool call printed raw to UI
The JSON filter only suppresses tool calls at line boundaries. When
"Memory checkpoint: " was printed without a trailing newline, the LLM
response `{"tool": "remember", ...}` appeared on the same line and
leaked through to the UI.

Fix:
- Add trailing newline to "Memory checkpoint:" message
- Reset JSON filter state before streaming the response

Added test: test_tool_call_not_at_line_start_passes_through
Documents the filter behavior and references the fix location.
2026-01-16 13:10:18 +05:30
Dhanji R. Prasanna
6bd9c51e8e feat: shell output pagination and optimized read_file with seek
- Shell outputs > 8KB are truncated to first 500 chars
- Full output saved to .g3/sessions/<session_id>/tools/shell_stdout_<id>.txt
- LLM can use read_file with start/end to paginate through large outputs
- read_file now uses seek() for O(1) random access instead of reading entire file
- UTF-8 safe: reads extra bytes at boundaries to find valid char positions
- Falls back to lossy conversion for binary files (no panics)

Files changed:
- paths.rs: get_tools_output_dir(), generate_short_id()
- shell.rs: truncate_large_output() integration
- file_ops.rs: seek-based read_file_range() helper
- New test: read_file_utf8_test.rs
2026-01-16 09:16:16 +05:30
Dhanji R. Prasanna
01cb4f6691 fix: use consistent max_tokens defaults across providers
- Fix aliasing issue where resolve_max_tokens() used fallback_default_max_tokens
  (8192) instead of provider-specific defaults
- Update fallback_default_max_tokens from 8192 to 32000
- Set provider-specific max_tokens defaults:
  - Anthropic: 32000
  - OpenAI: 32000 (was 16000)
  - Databricks: 32000 (was 50000, now matches Anthropic as passthru)
  - Embedded: 2048
- Context window lengths unchanged:
  - OpenAI: 400,000
  - Anthropic: 200,000
  - Databricks (Claude): 200,000

This fixes the 'LLM response was cut off due to max_tokens limit' error
in agent mode that occurred because 8192 was being used instead of 32000.
2026-01-16 07:05:57 +05:30
Dhanji R. Prasanna
a84fead03b refactor: improve readability of streaming parser and JSON filter
Agent: carmack

Changes:
- streaming_parser.rs: Unified find_first/last_tool_call_start into single
  find_tool_call_start with SearchDirection enum, reducing duplication.
  Simplified is_json_invalidated from 45 to 20 lines with clearer logic.
  Fixed redundant !escape_next check in find_complete_json_object_end.

- filter_json.rs: Simplified check_tool_pattern from 40 to 24 lines.
  Replaced repetitive prefix checks with loop over ["t", "to", "too", "tool"].
  Reduced trailing return statements with direct expression returns.

- ui_writer_impl.rs: Added ansi module for duration color constants.
  Simplified duration_color function by removing redundant comments.

- language_prompts.rs: Fixed test assertions to match actual prompt content
  ("obvious, readable Racket" instead of "RACKET-SPECIFIC GUIDANCE").

All 174+ tests pass. No behavior changes.
2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna
0ae1a13cdb feat: real-time tool call streaming indicator with blinking UI
- Add ToolParsingHint enum (Detected/Active/Complete) for UI feedback
- New UiWriter methods: print_tool_streaming_hint(), print_tool_streaming_active()
- Refactor ConsoleUiWriter state to use atomics in ParsingHintState
- Add tool_call_streaming field to CompletionChunk for provider hints
- Anthropic provider sends streaming hints when tool name detected
- New streaming helpers: make_tool_streaming_hint(), make_tool_streaming_active()

Parser improvements:
- Add is_json_invalidated() to detect false positive tool patterns
- Fix tool result poisoning when file contents contain partial JSON
- Unescaped newlines in strings or prose after JSON invalidates detection

User sees ' ● tool_name |' immediately when tool call starts streaming,
with blinking indicator while args are received.
2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna
d68f059acf fix: detect invalidated JSON tool calls to prevent parser poisoning
When partial JSON tool call patterns appear in LLM output (e.g., from
quoting file content), the parser would incorrectly report them as
"incomplete tool calls", triggering auto-continue loops.

Fix: Added is_json_invalidated() to detect when partial JSON has been
invalidated by subsequent content that cannot be valid JSON:
- Unescaped newline inside a string (invalid JSON)
- Newline followed by prose text outside a string

The check is only applied to incomplete JSON - complete tool calls
with trailing text are still correctly detected.

Added 6 new tests covering:
- Tool results with partial JSON patterns
- LLM quoting file content inline vs on own line
- Comment prefixes (// # -- etc) with partial patterns
- Real incomplete tool calls (should still be detected)
2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna
999ac6fe66 fix: prevent parser poisoning from inline tool-call JSON patterns
The streaming parser was incorrectly detecting tool call patterns that
appeared inline in prose (e.g., when explaining the format), causing
g3 to return control mid-task.

Fix: Modified find_first_tool_call_start() and find_last_tool_call_start()
to only recognize patterns that appear on their own line (at start of
buffer or after newline with only whitespace before the pattern).

Changes:
- Added is_on_own_line() helper to check line-boundary conditions
- Updated detection methods to skip inline patterns
- Removed sanitize_inline_tool_patterns() and LBRACE_HOMOGLYPH (no longer needed)
- Rewrote tests for new behavior
- Added streaming_repro tests that use process_chunk() to verify the exact bug scenario

28 tests covering: streaming repro, line boundaries, Unicode, code contexts, edge cases
2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna
f4562cd4c9 config: default agent settings and provider override 2026-01-14 20:14:33 +05:30
Dhanji R. Prasanna
38828c7757 Clean up tool output formatting
- Shell: " Command executed successfully" → "️ ran successfully"
- Write file: Remove ✏️ emoji, use plain "wrote N lines | M chars"
2026-01-14 19:42:54 +05:30
Dhanji R. Prasanna
9ef064a041 Add guidance to shell tool description to avoid unnecessary cd prefixes
LLMs were prefixing shell commands with `cd <workspace> &&` unnecessarily,
wasting tokens and cluttering CLI display. Added clear guidance in the
shell tool description that commands already execute in the working directory.
2026-01-14 19:00:53 +05:30
Dhanji R. Prasanna
5104bd53b6 refactor(g3-core): improve stream_completion_with_tools readability
Extract and simplify the streaming completion function:

- Extract ensure_context_capacity() helper for pre-loop context management
  (thinning + compaction logic now in dedicated async method)
- Simplify compact_summary generation block: flatten nested if/match,
  remove redundant comments, reorder branches for clarity
- Remove dead code: unused _last_error variable and modified_tool_call
- Streamline duplicate detection block: reduce verbose logging
- Clean up text content display block: remove redundant comments,
  tighten variable declarations
- Remove redundant is_todo_tool redefinition inside block expression

Net reduction: 79 lines (-187/+108)
Behavior unchanged, all unit tests pass.

Agent: carmack
2026-01-14 15:11:53 +05:30
Dhanji R. Prasanna
dea0e6b1ca Compact tool output improvements
- Rename take_screenshot -> screenshot, code_coverage -> coverage (shorter names)
- Align | character across all compact tools (pad to 11 chars for str_replace)
- Make code_search a compact tool with summary display
- Show language and search name in code_search output (e.g., rust:"find structs")
- Add format_code_search_summary() to extract match/file counts from JSON response
2026-01-14 08:12:50 +05:30
Dhanji R. Prasanna
7d17b436f9 refactor(g3-core): remove 3 unused Agent constructor variants
Remove dead code - constructor variants that had no callers:
- new_with_readme()
- new_autonomous_with_readme()
- new_with_quiet()

These were thin wrappers around new_with_mode_and_readme() that were
never used externally. All 5 remaining constructors have verified callers.

Results:
- lib.rs reduced from 2817 to 2797 lines (-20 lines)
- Eliminated code-path aliasing: 8 constructors → 5 constructors
- All g3-core tests pass
- Full workspace compiles cleanly

Agent: fowler
2026-01-14 04:26:42 +05:30
Dhanji R. Prasanna
a1dfd9c0b6 Enhanced auto-memory with rich few-shot format
- Updated memory reminder prompt with per-symbol char ranges
- Added two few-shot examples: Session Continuation (feature) + UTF-8 Safe Slicing (pattern)
- Updated system prompt Memory Format section to match
- Format: file -> nested symbols with [start..end] ranges and descriptions
- Enables direct read_file navigation to specific functions
2026-01-13 21:49:48 +05:30
Dhanji R. Prasanna
3a47ebe668 better racket example support 2026-01-13 21:16:14 +05:30
Dhanji R. Prasanna
151b8c4658 Add Racket tree-sitter support, remove Kotlin
- Add tree-sitter-racket dependency (v0.24)
- Initialize Racket parser in code search
- Add .rkt, .rktl, .rktd file extensions
- Add test_racket_search test
- Remove Kotlin from supported languages (was disabled)
- Clean up duplicate test files

Supported languages: Rust, Python, JavaScript, TypeScript, Go, Java, C, C++, Racket
2026-01-13 18:44:59 +05:30
Dhanji R. Prasanna
5e45e110e2 refactor(g3-core): extract finalize_streaming_turn() to unify return paths
Extract a single canonical helper function for completing streaming turns,
eliminating 3 nearly-identical return paths in stream_completion_with_tools().

Changes:
- Add finalize_streaming_turn() helper that handles:
  - Finishing streaming markdown
  - Saving context window
  - Adding timing footer (when requested)
  - Dehydrating context (when ACD enabled)
  - Building TaskResult
- Replace 3 duplicated return blocks with calls to the helper
- Remove unused mut on full_response variable

Results:
- Function reduced from 1067 to 999 lines (-68 lines)
- Eliminated code-path aliasing: 3 paths → 1 canonical path
- All 32 characterization tests pass
- Full g3-core test suite passes

Agent: fowler
2026-01-13 16:52:48 +05:30
Dhanji R. Prasanna
b89d55a9ff Add characterization tests for stream_completion_with_tools
Add 32 blackbox characterization tests to lock down the behavior of the
stream_completion_with_tools function (1067 lines) before refactoring.

Tests cover key behaviors through stable boundaries:
- StreamingToolParser: tool call detection, incomplete detection, text accumulation
- Auto-continue logic: autonomous mode decisions, priority ordering
- Duplicate detection: sequential duplicates, cross-message duplicates
- Context window: token tracking, compaction threshold, history preservation
- Tool execution: read_file, shell, write_file, todo tools through Agent
- Streaming utilities: LLM token cleaning, duration formatting, truncation
- Parser sanitization: inline tool pattern handling, homoglyph replacement

These tests intentionally do NOT assert:
- Internal parser state or implementation details
- Specific timing values
- UI output formatting
- Provider-specific behavior

Agent: hopper
2026-01-13 16:25:33 +05:30
Dhanji R. Prasanna
47e3a88cf6 refactor(g3-core): extract stats formatting to dedicated module
Extract the get_stats() function (158 lines) from lib.rs to a new stats.rs module.

Changes:
- Create stats.rs with AgentStatsSnapshot struct for capturing agent state
- Replace inline formatting logic with delegation to snapshot.format()
- Add unit tests for stats formatting (empty and populated states)
- Reduce lib.rs from 2961 to 2818 lines (-143 lines)

The new module improves:
- Testability: Stats formatting can now be unit tested in isolation
- Separation of concerns: Formatting logic is decoupled from Agent struct
- Readability: lib.rs is more focused on core agent behavior

All 271 workspace tests pass.

Agent: fowler
2026-01-13 16:11:53 +05:30
Dhanji R. Prasanna
82c0165765 Fix unused variable warning and UTF-8 panic in string slicing
- Remove unused total_lines variable in file_ops.rs
- Fix UTF-8 boundary panic in utils.rs when generating diff error preview
  The code was slicing at byte index 200 which could land inside a
  multi-byte character (e.g., box-drawing chars like ─). Now uses
  character-based slicing with chars().take() instead.
2026-01-13 14:52:52 +05:30
Dhanji R. Prasanna
118935d2da Remove unused variable total_lines in file_ops.rs 2026-01-13 14:25:17 +05:30
Dhanji R. Prasanna
a09967eb27 refactor(streaming): Extract deduplication and auto-continue logic into helpers
Improve readability of stream_completion_with_tools (~1000 line function):

- Add deduplicate_tool_calls() helper with closure for previous-message check
- Add should_auto_continue() with AutoContinueReason enum for clearer control flow
- Replace inline deduplication loop with helper call (-19 lines)
- Replace complex auto-continue conditional with match on reason enum (-13 lines)
- Add section comments for major phases (State Init, Pre-loop, Main Loop, Auto-Continue, Post-Loop)
- Add comprehensive tests for new helpers

Net reduction: 82 deletions, behavior unchanged (172+ tests pass)

Agent: carmack
2026-01-13 11:44:06 +05:30
Dhanji R. Prasanna
dc45987e8d Add characterization tests for UTF-8 truncation and parser sanitization
Agent: hopper

Adds 32 new integration tests covering recent commits:

## UTF-8 Safe Truncation Tests (14 tests)
Covers commit f30f145 (Fix UTF-8 panics):
- Topic extraction with emoji, CJK, and multi-byte characters
- Truncation at character boundaries (not byte boundaries)
- Edge cases: exactly 50 chars, 51 chars, 2-byte/3-byte/4-byte UTF-8
- Stub generation with multi-byte topics
- Combining characters and diacritics

## Parser Sanitization Tests (18 tests)
Covers commit 4c36cc0 (Prevent parser poisoning):
- Code block contexts (inline code, after fences, prose)
- Line boundary edge cases (empty lines, whitespace, indentation)
- Unicode handling (emoji, bullets, CJK before patterns)
- Multiple patterns on same line
- Negative cases (similar but different patterns, partial patterns)
- Real-world scenarios from the original bug report

All tests are blackbox/characterization style - they test observable
outputs through stable public interfaces without encoding internal
implementation details.
2026-01-13 11:22:46 +05:30
Dhanji R. Prasanna
8dcb7a3dba feat: add compact styled output for TODO tools
TODO tools (todo_read, todo_write) now display with a cleaner, more
compact format:

- Styled header: " ● todo_read" or " ● todo_write"
- Tree-style prefixes for content lines (│ and └)
- Checkbox conversion: "- [ ]" → □, "- [x]" → ■
- Dimmed content for visual distinction
- No timing footer (cleaner output)

Changes:
- Add print_todo_compact() method to UiWriter trait
- Implement print_todo_compact() in ConsoleUiWriter
- Update todo.rs to call print_todo_compact() instead of line-by-line output
- Skip tool header, output header, and timing for TODO tools in agent streaming
2026-01-13 10:58:55 +05:30
Dhanji R. Prasanna
4c36cc058c fix: prevent parser poisoning from inline tool-call JSON patterns
When the streaming parser encountered fragments of JSON that looked like
partial tool calls (e.g., {"tool":) embedded in inline text (like code
examples or prose), it would incorrectly enter JSON parsing mode and
poison the parser state, causing control to be returned to the user
mid-task.

This fix:
- Adds sanitize_inline_tool_patterns() to detect tool-call patterns that
  are NOT on their own line and replace the opening brace with a Unicode
  homoglyph (fullwidth left curly bracket U+FF5B)
- Integrates sanitization into process_chunk() before text is buffered
- Updates system prompts to instruct LLMs to use homoglyphs when showing
  example tool call JSON in prose
- Adds comprehensive tests for the sanitization logic

Real tool calls from LLMs always appear on their own line, so those are
left untouched. Only inline patterns (with non-whitespace before them)
are sanitized.
2026-01-13 10:58:41 +05:30
Dhanji R. Prasanna
a0b9126555 Revert "refactor(g3-core): extract streaming logic to agent_streaming.rs"
This reverts commit a2e51cf075.
2026-01-13 07:59:18 +05:30
Dhanji R. Prasanna
6907fa36c0 UI: Add newline before auto-memory skip message 2026-01-13 07:03:42 +05:30
Dhanji R. Prasanna
a2e51cf075 refactor(g3-core): extract streaming logic to agent_streaming.rs
Reduce lib.rs complexity by extracting the streaming completion logic:

- Extract stream_completion_with_tools (~1080 lines) to agent_streaming.rs
- Extract stream_with_retry helper method
- Extract parse_diff_stats helper function
- Add handle_pre_stream_compaction helper for cleaner pre-stream logic
- Add format_tool_output helper for tool output formatting
- Remove 3 unused constructor variants:
  - new_with_readme
  - new_autonomous_with_readme
  - new_with_quiet

Results:
- lib.rs reduced from 2974 to 1791 lines (40% reduction)
- Streaming logic cleanly separated into dedicated module
- All tests pass, no behavior changes

Agent: fowler
2026-01-13 06:14:56 +05:30
Dhanji R. Prasanna
f30f145c85 Fix UTF-8 panics and inconsistent retry logic
- Fix 7 UTF-8 byte slicing panics that crash on multi-byte characters:
  - acd.rs: extract_topic_from_text() [..50] slice
  - streaming.rs: log_stream_error() [..500] slice
  - tools/acd.rs: rehydrate message truncation [..2000] slice
  - history.rs: git commit message truncation [..69] slice
  - planner.rs: commit summary/description truncation [..69] slices
  - llm.rs: requirements summary line truncation [..117] slice

- All now use chars().count() and chars().take(N).collect() for
  UTF-8 safe truncation

- Fix inconsistent retry logic in task_execution.rs:
  - Previously only retried on Timeout errors
  - Now retries on ALL recoverable errors (rate limits, network,
    server errors, model busy, token limits, context length)
  - Added error-specific base delays (rate limit: 5s, server: 2s, etc.)
  - Added exponential backoff with ±20% jitter
  - Consistent with autonomous mode retry behavior
2026-01-13 05:49:45 +05:30
Dhanji R. Prasanna
6f50d01ab6 Add comprehensive end-of-turn behavior tests for g3-core
Agent: hopper

Adds 56 new integration tests covering the observable end-of-turn
behaviors in the streaming module:

- Timing footer formatting (5 tests): verifies user-facing timing display
  with various durations, token counts, and context percentages

- Tool call duplicate detection (6 tests): ensures identical sequential
  tool calls are detected while different tools/args are not

- Empty response detection (9 tests): validates detection of empty,
  whitespace-only, and timing-only responses that trigger auto-continue

- Connection error classification (5 tests): verifies EOF, connection,
  chunk, and body errors are correctly identified for graceful recovery

- Tool output summary formatting (17 tests): covers read_file, write_file,
  str_replace, remember, screenshot, coverage, and rehydrate summaries

- Duration formatting (4 tests): milliseconds, seconds, minutes, zero

- Text truncation (4 tests): short/long strings, multiline, flag behavior

- LLM token cleaning (3 tests): removal of stop tokens like <|im_end|>

- Edge cases (4 tests): empty inputs, unicode handling, large numbers

All tests are blackbox/characterization style - they test observable
outputs through stable public interfaces without encoding internal
implementation details. Tests remain stable under refactoring that
preserves behavior.
2026-01-12 21:17:32 +05:30
Dhanji R. Prasanna
d164c97ad2 Fix multi-line error messages in compact tool output
The truncate_for_display() function now takes only the first line
of input before truncating. This prevents multi-line error messages
(like str_replace failures) from breaking the compact single-line
format.

Added tests for multi-line input handling.
2026-01-12 20:55:05 +05:30
Dhanji R. Prasanna
1b051aad94 Fix write_file compact summary to show actual line/char counts
The write_file compact display was showing 1 line because it was
counting lines in the success message, not the actual written content.

Now parses the tool result (e.g. ' wrote 150 lines | 4.2k chars')
to extract and display the correct counts.

Added format_write_file_result() to parse the tool output.
2026-01-12 20:32:54 +05:30