Commit Graph

819 Commits

Author SHA1 Message Date
Dhanji R. Prasanna
7bd72a4a51 Add tests for tool-specific timeout durations
Adds 8 unit tests verifying:
- Research tool has 20-minute timeout
- All other tools (shell, read_file, write_file, str_replace, code_search,
  webdriver_*, etc.) have standard 8-minute timeout
- Comprehensive test_only_research_has_extended_timeout covers 19 tools

This ensures future changes don't accidentally affect other tool timeouts.
2026-01-19 21:58:16 +05:30
Dhanji R. Prasanna
4b7be3f9ee Increase research tool timeout to 20 minutes
The research tool often runs past 8 minutes due to web browsing and
analysis. Increased its timeout to 20 minutes while keeping other
tools at 8 minutes.

Changes:
- Tool timeout is now tool-specific (20 min for research, 8 min for others)
- Timeout error message now shows the correct duration for each tool
2026-01-19 21:51:08 +05:30
Dhanji R. Prasanna
f4cce22db3 Add test documenting LLM duplicate text behavior
Adds test_llm_repeats_text_before_each_tool_call() which documents the
scenario where the LLM re-outputs the same preamble text before each
tool call in a multi-tool response.

Analysis showed this is LLM behavior, not a g3 bug:
- Each assistant message is correctly stored with different tool calls
- The duplicate display is the LLM choosing to repeat context
- Storage is correct, display accurately reflects LLM output

Decision: Accept as LLM behavior (Option B). Future LLM improvements
may resolve this naturally without g3 code changes.
2026-01-19 18:44:01 +05:30
Dhanji R. Prasanna
6ff21a7d47 Fix JSON filter to preserve code fence and indented content
Two cosmetic bugs fixed:
1. JSON inside code fences was being filtered - now tracks fence state
   and passes through all content inside ``` ... ``` blocks
2. Indented JSON was being filtered - now recognizes that real tool
   calls are never indented, so indented JSON is always documentation

Changes:
- Added in_code_fence and fence_buffer fields to FilterState
- Added track_code_fence() to detect ``` markers (with/without language)
- Added pass_through_char() for content inside code fences
- Modified '{' handling to only filter when no leading whitespace
- Added 4 new unit tests for code fence and indentation cases
- Updated 3 stress tests to expect new (correct) behavior

All 16 filter_json unit tests and 59 stress tests pass.
2026-01-19 17:00:43 +05:30
Dhanji R. Prasanna
1604ed613a Add integration tests proving tool results are never parsed as tool calls
Adds 3 new tests to json_parsing_stress_test.rs:
- test_tool_result_with_json_not_parsed: Full agent integration test proving
  that JSON in tool results (sent TO the LLM) is never parsed by the
  streaming parser (which only sees LLM output)
- test_parser_only_processes_completion_chunks: Documents that StreamingToolParser
  only accepts CompletionChunk, not Message objects
- test_architectural_separation_documented: Documents the data flow showing
  tool results flow TO the LLM while the parser only sees FROM the LLM

This proves the architectural guarantee: there is no code path where
tool result content could be parsed as a tool call, because:
1. Tool results are Message objects added to context_window
2. The streaming parser only processes CompletionChunk from provider.stream_completion()
3. These are completely separate data types flowing in opposite directions

Total: 41 JSON parsing stress tests now pass.
2026-01-19 16:21:36 +05:30
Dhanji R. Prasanna
2043a83e7d Add comprehensive MockProvider integration tests
Added 6 new integration tests for stream_completion_with_tools:
- test_text_before_tool_call_preserved: text before native tool call is saved
- test_native_tool_call_execution: native tool calls execute correctly
- test_duplicate_tool_calls_skipped: sequential duplicates are detected
- test_json_fallback_tool_calling: JSON tool calls work without native support
- test_text_after_tool_execution_preserved: follow-up text is saved
- test_multiple_tool_calls_executed: multiple tool calls in sequence work

Also added MockResponse helper methods:
- text_then_native_tool(): text followed by native tool call
- duplicate_native_tool_calls(): same tool call twice (for dedup testing)

Fixed text_with_json_tool() to ensure "tool" key comes before "args"
(serde_json alphabetizes keys, breaking pattern detection).

Total: 18 integration tests covering historical bugs and core behaviors.
2026-01-19 14:44:30 +05:30
Dhanji R. Prasanna
5caa101b84 Fix inline JSON being incorrectly detected as tool call
The bug was caused by mark_tool_calls_consumed() being called after
displaying each chunk, which advanced last_consumed_position to the
end of the current buffer. When the next chunk arrived with JSON,
the unchecked_buffer started at position 0 of the slice, causing
is_on_own_line() to return true (position 0 is always "on its own line").

Removed the problematic mark_tool_calls_consumed() call from the
"no tool executed" branch. The remaining call after actual tool
execution is correct and necessary.

Added integration test that verifies inline JSON in prose is not
detected as a tool call.
2026-01-19 14:35:01 +05:30
Dhanji R. Prasanna
292a3aa48d Add MockProvider for integration testing
Adds a configurable mock LLM provider that can simulate various behaviors:
- Text-only responses (single or multi-chunk streaming)
- Native tool calls
- JSON tool calls in text
- Truncated responses (max_tokens)
- Multi-turn conversations

Features:
- Builder pattern for easy test setup
- Request tracking for verification
- Preset scenarios for common patterns
- Full LLMProvider trait implementation

Also adds integration tests that use MockProvider to test the
stream_completion_with_tools code path, including:
- test_butler_bug_scenario: reproduces the exact bug where text-only
  responses were not saved to context, causing consecutive user messages

This enables testing complex streaming behaviors without real API calls.
2026-01-19 13:59:31 +05:30
Dhanji R. Prasanna
349230d0b7 Fix missing assistant messages in context window
Bug: When the LLM responded with text-only (no tool calls), the assistant
message was sometimes not saved to the context window. This caused consecutive
user messages where the LLM would lose track of previous responses.

Root causes found and fixed:

1. Early return path (line ~2535): When stream finishes with no tools executed
   in previous iterations (any_tool_executed=false), the code returned early
   without saving the assistant message. Fixed by adding save before return.

2. Post-loop path (line ~2657): When raw_clean was empty but current_response
   had content, no message was saved. Fixed by falling back to current_response.

Both paths now properly save the assistant message before returning.
The assistant_message_added flag prevents any duplication.

Added tests:
- missing_assistant_message_test.rs: verifies the fallback logic
- assistant_message_dedup_test.rs: verifies no duplicate messages
- consecutive_assistant_message_test.rs: verifies alternation invariant
2026-01-19 13:50:28 +05:30
Dhanji R. Prasanna
07bff7691a Make /resume session prompt more compact
Output is now a single line:
  Session number to resume (Enter to cancel): 1 ... resuming scout_88871653e8e5f4f7 [done]

- Session ID displayed in cyan
- [done] displayed in bold green
- [error: ...] displayed in bold red on failure
- Added print_inline() to SimpleOutput for inline prompts
2026-01-18 18:41:24 +05:30
Dhanji R. Prasanna
02655110d6 fix: auto-resize images exceeding 1568px dimension to prevent 413 Payload Too Large
The Anthropic API was rejecting requests with multiple high-resolution images
(~2000x3000 pixels each) even though individual file sizes were under limits.

Root cause: Code only checked per-image file size (3.75MB), not dimensions.
Claude recommends images ≤1568px on longest edge and has 32MB total request limit.

Changes:
- Add MAX_IMAGE_DIMENSION (1568px) and MAX_TOTAL_IMAGE_PAYLOAD (20MB) constants
- Trigger resize when dimensions > 1568px (not just file size > 3.75MB)
- Add new resize_image_to_dimensions() for dimension-constrained resizing
- Track cumulative payload size across multiple images
- Warn if total payload exceeds recommended limit

Test results with Walking Dead comic images:
- WD_0001_0001.jpg: 800KB 1987x3057 → 321KB 1019x1568
- WD_0001_1064.png: 150KB 1988x3057 → 143KB 1020x1568
- WD_0002_0001.jpg: 1023KB 1988x3056 → 292KB 1020x1568
- Total payload: ~2.5MB → ~1MB base64
2026-01-18 10:05:45 +05:30
Dhanji R. Prasanna
3a03ed0585 Fix imgcat aspect ratio by adding preserveAspectRatio=1
Images were being displayed as narrow vertical strips because
iTerm2 wasn't preserving aspect ratio when only height was specified.
2026-01-17 18:50:00 +05:30
Dhanji R. Prasanna
0234920446 Print g3 progress and status on same line
- print_g3_progress now uses print! instead of println!
- print_g3_status completes the line with just the status
- Result: 'g3: compacting session ... [done]' on one line
2026-01-17 17:28:20 +05:30
Dhanji R. Prasanna
8dad00bdd0 Colorize session name in cyan in continuation prompt 2026-01-17 15:58:46 +05:30
Dhanji R. Prasanna
0d6a66a252 Compress session continuation prompt to single line
- Combine session info and resume prompt on one line
- Show result inline after user input (y/n)
- Green '... resuming ... [done]' on successful resume
- Dark grey '... starting fresh' when declining
- Yellow '... failed: <error>' on restore failure
2026-01-17 15:56:05 +05:30
Dhanji R. Prasanna
5622e5b21e refactor(cli): show only loaded items in startup status line
Changes the startup status line to only display items that were
actually loaded, instead of showing dots for missing items.

Before: "   · README  · AGENTS.md  ✓ Memory"
After:  "   ✓ Memory"

Also adds include prompt to the status line when specified:
"   ✓ prompt.md  ✓ Memory"

The order matches the load order: README → AGENTS.md → include prompt → Memory
2026-01-17 15:35:37 +05:30
Dhanji R. Prasanna
4877f8ae8a test(cli): add integration tests for --include-prompt and --no-auto-memory flags
Adds blackbox tests to verify:
- --include-prompt option is recognized by CLI parser
- --include-prompt appears in help output
- --no-auto-memory option is recognized by CLI parser
- --no-auto-memory appears in help output
2026-01-17 15:27:04 +05:30
Dhanji R. Prasanna
b0740b63c2 feat(cli): add --no-auto-memory flag to disable memory reminder in agent mode
Adds a flag to disable the automatic memory update reminder that runs
at the end of agent mode. Useful when running agents that should not
modify project memory.
2026-01-17 15:24:16 +05:30
Dhanji R. Prasanna
6bb5448d3f feat(project_files): add read_include_prompt() and update combine_project_content()
- Add read_include_prompt() function to read prompt content from a file
- Update combine_project_content() to accept include_prompt parameter
- Change prompt order: cwd → agents → readme → language → include_prompt → memory
- Add section markers around Project Memory for clearer boundaries
- Add comprehensive tests for include prompt functionality and ordering
2026-01-17 15:20:01 +05:30
Dhanji R. Prasanna
e45d5b25f3 feat(cli): wire up --include-prompt in main CLI and agent mode
Updates lib.rs and agent_mode.rs to read the include prompt file
and pass it through to combine_project_content(). The include prompt
is placed after language prompts but before project memory.
2026-01-17 15:19:55 +05:30
Dhanji R. Prasanna
56e8fddfc4 feat(cli): add --include-prompt flag for dynamic prompt injection
Adds a new CLI flag that allows users to include additional prompt
content from a file. The content is appended to the system prompt
before project memory is loaded.
2026-01-17 15:19:49 +05:30
Dhanji R. Prasanna
d89439d4b8 Fix macOS security policy rejection after install
After copying binaries to ~/.local/bin, macOS AppleSystemPolicy would
reject them because the linker-signed code signature becomes invalid.

Now re-sign binaries with ad-hoc signature after copying on macOS.
2026-01-17 11:41:45 +05:30
Dhanji R. Prasanna
d600b600b8 Always keep chromedriver running for faster subsequent startups
Removed the persistent_chrome config flag - chromedriver is now always
kept running after webdriver_quit. This eliminates startup latency for
subsequent WebDriver sessions.

Safaridriver is still killed on quit since it doesn't benefit from
persistence in the same way.

Updated quit message to correctly indicate chromedriver remains running.
2026-01-17 09:48:10 +05:30
Dhanji R. Prasanna
8ed360024f Add persistent ChromeDriver support for faster WebDriver startup
When webdriver_start is called, now checks if chromedriver is already
running on the configured port and reuses it instead of spawning a new
process. This significantly reduces startup time for subsequent sessions.

New config option:
  [webdriver]
  persistent_chrome = true  # Keep chromedriver running between sessions

When enabled, webdriver_quit closes the browser session but leaves
chromedriver running for reuse by the next session.
2026-01-17 09:26:25 +05:30
Dhanji R. Prasanna
eb6268641f Fix --safari flag being blocked by Chrome diagnostics
When --safari was passed, Chrome diagnostics were still running because
--chrome-headless defaults to true. This caused the CLI to hang while
running diagnostics for a browser that wouldn't be used.

Now skip Chrome diagnostics when --safari is explicitly set.
2026-01-17 09:20:21 +05:30
Dhanji R. Prasanna
e3967a9948 refactor: remove animation from context thinning display
Simplify print_context_thinning to just print the message directly.
The message already contains proper ANSI formatting from context_window.rs.

Removes the flash animation and 'Context optimized successfully' footer.
2026-01-17 05:00:12 +05:30
Dhanji R. Prasanna
b8193bf9f9 style: use orange color for [no changes] status in thinning message 2026-01-17 04:53:42 +05:30
Dhanji R. Prasanna
74b1b9bea3 refactor: simplify context thinning status message
Change format from verbose emoji-based message to cleaner status line:
  Before:  🥒 Context thinned at 70%: 7 tool results, ~33839 chars saved 
  After:  g3: thinning context ... 70% -> 40% ... [done]

The new format shows before/after percentages and uses bold green for
'g3:' and '[done]' to match other status messages.

Also removes unused emoji() and label() methods from ThinScope.
2026-01-17 04:47:16 +05:30
Dhanji R. Prasanna
c7984fd4c2 fix: account for base64 encoding overhead in image size limit
The Anthropic API has a 5MB limit on base64-encoded images, not raw file
size. Base64 encoding increases size by ~33% (4/3 ratio), so a 4MB raw
image becomes ~5.3MB encoded, exceeding the limit.

Changed MAX_IMAGE_SIZE from 5MB to ~3.75MB (5MB * 3/4) to trigger
resizing before the base64-encoded result exceeds the API limit.

Also updated target resize size to 3.6MB to leave margin.
2026-01-16 21:29:05 +05:30
Dhanji R. Prasanna
1003386f7f Auto-resize large images (>=5MB) in read_image tool
Images >= 5MB are now automatically resized to < 4.9MB using ImageMagick
before being sent to the LLM. This prevents API errors from oversized images.

- Uses iterative quality/scale reduction to find optimal size
- Converts to JPEG for better compression
- Shows original and resized size in terminal output (e.g., '6.2 MB → 4.1 MB (resized)')
- Falls back to original if ImageMagick fails or isn't available
2026-01-16 21:09:38 +05:30
Dhanji R. Prasanna
fc702168ab Add streaming completion integration test with mock LLM provider
Adds tests to verify that:
- All streaming chunks are processed before control returns to caller
- Both tool calls in a multi-tool-call stream are executed
- The finished signal properly terminates stream processing

Also adds Agent::new_for_test() to allow injecting mock providers.
2026-01-16 20:52:32 +05:30
Dhanji R. Prasanna
0e33465342 Add print_g3_progress/print_g3_status methods for consistent status messages 2026-01-16 20:28:24 +05:30
Dhanji R. Prasanna
95f89d3f8e Simplify compaction status messages 2026-01-16 20:26:35 +05:30
Dhanji R. Prasanna
415226ca84 Add newline before context progress display 2026-01-16 20:24:29 +05:30
Dhanji R. Prasanna
cebec23075 Fix duplicate response printing in interactive mode
The response was being printed twice: once during streaming and again
after task completion. Removed the redundant print_smart() call since
streaming already displays the response in real-time.
2026-01-16 14:48:50 +05:30
Dhanji R. Prasanna
4c6878a63d Set process title to agent name in agent mode
When running g3 --agent butler, the process title is now "g3 [butler]"
which shows up in ps, Activity Monitor, top, etc.

Uses the proctitle crate for cross-platform support.
2026-01-16 14:37:58 +05:30
Dhanji R. Prasanna
1f6a5671b2 Use agent name as prompt in --agent --chat mode (e.g., "butler>")
Changed run_interactive() parameter from bool to Option<&str> agent_name.
When agent_name is Some, use it as the prompt instead of "g3>".
2026-01-16 13:58:45 +05:30
Dhanji R. Prasanna
2e6bef4b24 Auto-memory: call once on exit for --agent --chat, per-turn for single-shot
When running g3 --agent <name> --chat:
- Skip per-turn memory checkpoint calls (too onerous)
- Call memory checkpoint once when exiting (Ctrl-D)

When running g3 --agent <name> (single-shot):
- Preserve existing behavior: call memory checkpoint after each turn

This keeps the auto-memory feature useful without being intrusive
in interactive agent sessions.
2026-01-16 13:35:40 +05:30
Dhanji R. Prasanna
6068249827 Simplify --agent --chat startup: minimal output, no session resume
When running g3 --agent <name> --chat, the output is now minimal:
- Workspace path (-> ~/path)
- Status line (README/AGENTS.md/Memory)
- Context progress bar
- Prompt (g3>)

Skipped in this mode:
- Session resume prompts
- "agent mode | name (source)" header
- "g3 programming agent" welcome
- Provider info display
- Language guidance messages

Added from_agent_mode parameter to run_interactive() to control
whether verbose welcome and session resume are shown.
2026-01-16 13:31:10 +05:30
Dhanji R. Prasanna
7c59d1993c Fix auto-memory JSON leak: tool call printed raw to UI
The JSON filter only suppresses tool calls at line boundaries. When
"Memory checkpoint: " was printed without a trailing newline, the LLM
response `{"tool": "remember", ...}` appeared on the same line and
leaked through to the UI.

Fix:
- Add trailing newline to "Memory checkpoint:" message
- Reset JSON filter state before streaming the response

Added test: test_tool_call_not_at_line_start_passes_through
Documents the filter behavior and references the fix location.
2026-01-16 13:10:18 +05:30
Dhanji R. Prasanna
94544c8f6a Add interactive mode support for agents with --chat flag
- Remove chat from conflicts_with_all for --agent flag
- Add chat parameter to run_agent_mode()
- Run interactive loop instead of single task when --chat is passed

Usage: g3 --agent <name> --chat
2026-01-16 12:01:56 +05:30
Dhanji R. Prasanna
6bd9c51e8e feat: shell output pagination and optimized read_file with seek
- Shell outputs > 8KB are truncated to first 500 chars
- Full output saved to .g3/sessions/<session_id>/tools/shell_stdout_<id>.txt
- LLM can use read_file with start/end to paginate through large outputs
- read_file now uses seek() for O(1) random access instead of reading entire file
- UTF-8 safe: reads extra bytes at boundaries to find valid char positions
- Falls back to lossy conversion for binary files (no panics)

Files changed:
- paths.rs: get_tools_output_dir(), generate_short_id()
- shell.rs: truncate_large_output() integration
- file_ops.rs: seek-based read_file_range() helper
- New test: read_file_utf8_test.rs
2026-01-16 09:16:16 +05:30
Dhanji R. Prasanna
ce5183b296 style: compress studio auto-accept output
- Replace verbose auto-accept messages with single line
- Format: 'studio: session <id> ... [merged]'
- Refactor cmd_accept to use accept_session() with configurable prefix
- Remove 'completed successfully' and 'Auto-accepting' messages
2026-01-16 07:30:27 +05:30
Dhanji R. Prasanna
e2385faba1 style: compress studio session startup output
- Replace verbose multi-line output with single line
- Format: 'studio: new session <id>'
- 'studio:' in bold green, session id in inline-code orange (RGB 216,177,114)
- Remove separator lines and 'Starting g3 agent' message
2026-01-16 07:24:22 +05:30
Dhanji R. Prasanna
ef5aa75e6b style: simplify studio accept/discard output messages
- Change verbose emoji messages to minimal format
- Print '> session <id> ...' first, then status after operation completes
- 'merged' shown in bold green
- 'discarded' shown in bold yellow
2026-01-16 07:17:36 +05:30
Dhanji R. Prasanna
01cb4f6691 fix: use consistent max_tokens defaults across providers
- Fix aliasing issue where resolve_max_tokens() used fallback_default_max_tokens
  (8192) instead of provider-specific defaults
- Update fallback_default_max_tokens from 8192 to 32000
- Set provider-specific max_tokens defaults:
  - Anthropic: 32000
  - OpenAI: 32000 (was 16000)
  - Databricks: 32000 (was 50000, now matches Anthropic as passthru)
  - Embedded: 2048
- Context window lengths unchanged:
  - OpenAI: 400,000
  - Anthropic: 200,000
  - Databricks (Claude): 200,000

This fixes the 'LLM response was cut off due to max_tokens limit' error
in agent mode that occurred because 8192 was being used instead of 32000.
2026-01-16 07:05:57 +05:30
Dhanji R. Prasanna
65e0217c68 Add unit tests for studio session management
New tests:
- test_new_session_has_short_id
- test_new_interactive_session
- test_branch_name_format
- test_session_save_and_load
- test_session_mark_complete
- test_session_mark_paused
- test_list_empty_sessions
- test_backwards_compatibility_no_session_type

Added tempfile as dev dependency for temp directory tests.
2026-01-16 06:52:23 +05:30
Dhanji R. Prasanna
78f9207d27 Add interactive mode to studio
New commands:
- studio cli (alias: c) - Start a new interactive g3 session in an isolated worktree
- studio resume <id> (alias: r) - Resume a paused interactive session
- Bare 'studio' now defaults to 'studio cli'

Session changes:
- Added SessionStatus::Paused for sessions that can be resumed
- Added SessionType enum (OneShot, Interactive) for future use
- Interactive sessions use inherited stdio for direct TTY access
- Sessions are marked as Paused when user exits g3

Workflow:
1. studio        # creates worktree, runs g3 interactively
2. (work in g3, exit when done)
3. studio resume <id>  # continue working
4. studio accept <id>  # merge to main when finished
2026-01-16 06:48:24 +05:30
Dhanji R. Prasanna
637884f84b Fix duplicate todo_read display in agent mode
The print_todo_compact() function was missing the call to clear the
streaming hint line before printing the final tool output. This caused
the tool name to appear twice when the hint line wasn't cleared:

  ● todo_read     ● todo_read   | empty

Added the missing handle_hint(ToolParsingHint::Complete) call to match
the behavior of print_tool_compact().
2026-01-16 06:38:11 +05:30
Dhanji R. Prasanna
25d35529e7 Fix --accept flag being passed through to g3 in studio run
When --accept was passed after positional args (e.g., 'studio run --agent
carmack task --accept'), clap's trailing_var_arg captured it as part of
g3_args instead of parsing it as the studio flag. This caused g3 to error
with 'unexpected argument --accept'.

- Extract filter_accept_flag() helper to detect and remove --accept from
  trailing args
- Set auto_accept=true if --accept found in either position
- Add 5 unit tests for the filtering logic
2026-01-15 21:05:13 +05:30