alex/g3 - g3 - Millerson GIT hosting

alex/g3

Author	SHA1	Message	Date
Dhanji R. Prasanna	1b26de6cd2	Fix context window progress bar showing wrong token counts Calibrate used_tokens from API prompt_tokens (ground truth) to fix progress bar drift in interactive mode. Three issues fixed: 1. update_usage_from_response() only updated cumulative_tokens, never calibrated used_tokens. Now snaps used_tokens to prompt_tokens when available (falls back to heuristic when prompt_tokens is 0). 2. Moved calibration call inline during streaming (when usage chunk arrives) instead of after the loop. Text-only responses — the most common case in interactive mode — take an early return path that bypassed the post-loop usage update entirely. 3. Removed mock Usage with hardcoded prompt_tokens=100 from execute_single_task() which corrupted calibration.	2026-03-18 15:31:20 +11:00
Dhanji R. Prasanna	7347d92ae8	Make plan approval gate non-destructive and baseline-aware - Remove all file revert/delete logic from check_plan_approval_gate: no more git checkout or fs::remove_file calls. The gate only warns. - Remove reverted_files field from ApprovalGateResult::Blocked. - Add get_dirty_files() helper to snapshot dirty files as a HashSet. - Capture baseline dirty files when plan mode starts (set_plan_mode). Pre-existing dirty files are excluded from gate checks so they never trigger blocking. - Add 5 new unit tests covering non-destructive behavior, baseline exclusion, and mixed baseline/new file scenarios. - Update integration test to match new non-destructive semantics.	2026-02-15 09:53:14 +11:00
Dhanji R. Prasanna	1d77f3f865	fix: allow new plan_write after completed approved plan When an approved plan was fully complete (all items done/blocked), plan_write blocked creating a new plan with 'Cannot remove item' error. Now checks is_complete() first — complete plans allow fresh plan creation without carrying over approved_revision or enforcing item ID preservation. Adds 4 end-to-end integration tests covering happy path, negative (in-progress still blocks), and boundary cases (all-blocked, mixed).	2026-02-14 12:27:38 +11:00
Dhanji R. Prasanna	0410efd41b	Add 1% safety buffer to context window to prevent API token limit errors Our token estimation heuristic (chars/3 * 1.1 for code, chars/4 * 1.1 for text) slightly undercounts over long sessions with hundreds of tool calls. This accumulated drift of ~89 tokens caused Anthropic API 400 errors: 'prompt is too long: 200089 tokens > 200000 maximum' Fix: ContextWindow::new() now applies a 1% buffer, setting total_tokens to 99% of the provider-reported limit. For a 200k window this gives 198k, providing a 2000-token safety margin that absorbs estimation drift. All percentage calculations, compaction thresholds, and thinning triggers operate against the buffered limit, so compaction fires earlier and we never send a request the API will reject.	2026-02-13 15:46:53 +11:00
Dhanji R. Prasanna	68f8f13b38	Fix research_status polling tool falsely deduplicated across auto-continue iterations The dedup logic compared only tool name+args, ignoring the unique tool call IDs that native providers (Anthropic) assign to each invocation. When the model called research_status {} in iteration 1, auto-continued, and called it again in iteration 2 with identical args but a new ID, the second call was marked DUP IN MSG and skipped. With no tool executed and no text, the stream errored with 'No response received from the model.' Three-part fix: - ID-aware DUP IN MSG: check_duplicate_in_previous_message() uses tool call IDs when both are non-empty (different IDs = different invocations) - History cutoff: only checks messages from before the current iteration to prevent within-iteration false positives - DUP IN ITER: last_executed_tool on IterationState catches stuttered duplicates across chunks within the same response Regression test reproduces the exact bug (fails without fix, passes with).	2026-02-12 15:54:24 +11:00
Dhanji R. Prasanna	88d2b9592b	Fix tool_call input tokens invisible to context window tracker estimate_tokens() only counted message.content chars, completely ignoring message.tool_calls[].input JSON. When sent to the API, tool_use blocks include full input, so the token tracker massively undercounted — in one session, 303k chars (101k tokens) of tool input were invisible, showing 39% usage when actual was >100%. Compaction never triggered, causing an API 400 error. Added estimate_message_tokens() that accounts for both content and tool_call input. Updated add_message_with_tokens(), recalculate_tokens(), and clear_conversation() to use it. 7 unit tests + 1 integration test reproducing the exact session trace.	2026-02-11 16:12:13 +11:00
Dhanji R. Prasanna	d61be719c2	fix: strip orphaned tool_calls from preserved assistant message during compaction After context compaction, the preserved last assistant message retained its structured tool_calls field, but the corresponding tool_result was summarized away. This created orphaned tool_use blocks that violated the Anthropic API constraint: 'Each tool_use block must have a corresponding tool_result block in the next message', causing 400 errors. Primary fix: clear tool_calls from the preserved assistant message in extract_preserved_messages(). The tool call was already executed and its result is captured in the summary. Defense-in-depth: added strip_orphaned_tool_use() post-processing in Anthropic convert_messages() to detect and strip any orphaned tool_use blocks before they reach the API. Added 7 tests: 3 unit tests for compaction stripping, 3 unit tests for Anthropic orphan detection, 1 integration test reproducing the exact bug scenario from the h3 session.	2026-02-11 15:22:03 +11:00
Dhanji R. Prasanna	d3f0112f46	fix: store tool calls structurally for proper API roundtripping The agent would stop mid-task because native tool calls were stored as inline JSON text in Message.content. When sent back to the Anthropic API via convert_messages(), they went as plain text instead of structured tool_use/tool_result blocks. The model would occasionally get confused and emit text describing what it wanted to do instead of invoking the tool mechanism. Changes: - Add MessageToolCall struct and tool_calls/tool_result_id fields to Message - Add id field to core ToolCall struct to preserve provider tool call IDs - Update Anthropic convert_messages() to emit tool_use and tool_result blocks - Add ToolResult variant to AnthropicContent enum - Store tool calls structurally in tool message construction (not inline JSON) - Fix add_message() to preserve empty-content messages with tool_calls - Fix check_duplicate_in_previous_message() to check structured tool_calls - Generate valid IDs for JSON fallback tool calls (Anthropic pattern requirement) - Update planner create_tool_message() to use structured tool calls	2026-02-11 08:48:07 +11:00
Dhanji R. Prasanna	2a4cd1f4d6	fix: strip duplicate tool call JSON from assistant messages when LLM stutters When the LLM emits identical JSON tool calls as text content (JSON fallback mode), the raw duplicate JSON was being stored in the assistant message in conversation history. This confused the model on subsequent turns, causing it to stall or repeat itself. Root cause: raw_content_for_log used get_text_content() which returns the full parser buffer including all duplicate tool call JSONs. Fix: Added get_text_before_tool_calls() to StreamingToolParser that returns only the text before the first JSON tool call. Changed raw_content_for_log to use this method so the assistant message only contains the preamble text + the single executed tool call. Added 5 integration tests covering stuttered duplicates, triple stutter, cross-turn dedup, and different-args boundary case. Added MockResponse helpers for simulating LLM stutter patterns.	2026-02-10 19:53:11 +11:00
Dhanji R. Prasanna	328eecfcad	fix: extract_facts fallback for facts-prefixed selectors in datalog verification Root cause: ActionEnvelope.to_yaml_value() creates a Mapping from the facts HashMap without a 'facts:' wrapper key, but rulespec selectors may include a 'facts.' prefix (e.g. 'facts.feature.done' instead of 'feature.done'). This caused zero facts to be extracted, making all predicate evaluations fail. Fix: extract_facts() now tries the selector against the unwrapped envelope value first, and if empty, retries against a facts-wrapped version as fallback. Also: - Strengthened write_envelope tool description to require top-level facts: key, file paths for evidence, and allow free-form notes - Updated system prompt with matching rules - Added 6 new tests (4 unit, 2 integration) - Strengthened existing integration test to verify fact count > 0	2026-02-07 14:42:39 +11:00
Dhanji R. Prasanna	7032e75fc6	Add write_envelope tool with verify_envelope for explicit envelope creation - New crates/g3-core/src/tools/envelope.rs with execute_write_envelope() and verify_envelope() (moved from shadow_datalog_verify in plan.rs) - write_envelope accepts YAML facts, writes envelope.yaml to session dir, then runs datalog verification against analysis/rulespec.yaml in shadow mode - plan_verify() now only checks envelope existence (no longer runs datalog) - Tool count: 13 -> 14 - Updated system prompt to instruct agents to call write_envelope before marking last plan item done - Updated integration tests to use write_envelope tool directly Workflow: write_envelope -> verify_envelope -> datalog shadow artifacts plan_write(done) -> plan_verify -> checks envelope exists	2026-02-06 16:09:07 +11:00
Dhanji R. Prasanna	f7a240a99b	refactor: decouple rulespec from plan_write, read from analysis/rulespec.yaml - Remove rulespec parameter from plan_write tool definition and execution - Remove rulespec compilation from plan_approve (no longer pre-compiles) - Remove write_rulespec, get_rulespec_path, format_rulespec_yaml/markdown from invariants.rs; read_rulespec() now takes &Path working dir - Remove save/load_compiled_rulespec, get_compiled_rulespec_path from datalog.rs - Update shadow_datalog_verify() to compile on-the-fly from analysis/rulespec.yaml, writing rulespec.compiled.dl and datalog_evaluation.txt to session dir - Remove rulespec display from plan_read output - Remove Invariants/Rulespec section from native.md system prompt - Remove rulespec from prompts.rs plan_write format and examples - Update existing tests to remove rulespec from plan_write calls - Add 3 integration tests for on-the-fly rulespec verification	2026-02-06 15:31:23 +11:00
Dhanji R. Prasanna	3823f8b5f3	Optimize native system prompt - 48% size reduction Removed redundant and vague content from prompts/system/native.md: - Simplified intro from 17 lines to 3 lines - Reduced Code Search section to one line - Removed duplicate Plan Mode example (kept one) - Removed Action Envelope section (rarely used correctly) - Removed verbose Memory Format details (tool description covers it) - Removed Response Guidelines (obvious to modern LLMs) Size: 8,620 chars -> 4,498 chars Also updated: - G3_IDENTITY_LINE constant for agent mode compatibility - Test assertions to check for new prompt markers - System prompt validation to use new marker string	2026-02-05 22:16:34 +11:00
Dhanji R. Prasanna	7e2d9bc22c	Enforce rulespec creation with plan_write for new plans Solves the tautology problem where the LLM would write invariants after implementation, making them match what was done rather than constrain it. Changes: - plan_write now accepts 'rulespec' parameter - New plans REQUIRE rulespec (fails with helpful error if missing) - Plan updates don't require rulespec (backward compatible) - Rulespec is parsed, validated, and written atomically with plan - Updated system prompt with clear examples for new vs update - Updated tool definition schema - Updated all affected tests New flow: task → plan+rulespec → user reviews BOTH → approve → implement	2026-02-05 21:12:02 +11:00
Dhanji R. Prasanna	0f919237ea	Make plan approval gate only active in plan mode - Add in_plan_mode flag to Agent struct - Add set_plan_mode() and is_plan_mode() methods - Gate check now only runs when in_plan_mode is true - CLI calls set_plan_mode(true) on /plan command and EnterPlanMode - CLI calls set_plan_mode(false) on approval and CTRL-D exit - Update integration test to enable plan mode - Fix test YAML to use Vec<Check> for negative/boundary checks	2026-02-05 11:41:52 +11:00
Dhanji R. Prasanna	c347a73cbd	Add plan approval gate to block file changes without approved plan - Add check_plan_approval_gate() in tools/plan.rs that runs after each tool call - Detects file changes via git status --porcelain when plan exists but not approved - Reverts changes: git checkout for modified files, rm for new untracked files - Returns blocking message instructing LLM to create/approve plan first - Add ApprovalGateResult enum with Allowed/Blocked/NotGitRepo variants - Add set_session_id() and set_working_dir() methods on Agent for testing - Add integration test using MockProvider to simulate blocked write_file	2026-02-05 11:34:10 +11:00
Dhanji R. Prasanna	263a838d31	Remove redundant 'No plan exists' message from plan_read output The UI already shows 'empty' via print_plan_compact, so returning an empty string avoids duplicate output.	2026-02-02 17:19:01 +11:00
Dhanji R. Prasanna	a63950d8f5	Add Plan Mode to replace TODO system Plan Mode is a cognitive forcing system that requires reasoning about: - Happy path - Negative case - Boundary condition New tools: - plan_read: Read current plan for session - plan_write: Create/update plan with YAML content (validates structure) - plan_approve: Mark current revision as approved New command: - /feature <description>: Start Plan Mode for a new feature Plan schema requires: - plan_id, revision, approved_revision - items with id, description, state, touches, checks (happy/negative/boundary) - evidence and notes required when marking items done Verification: - plan_verify() called automatically when all items are done/blocked Removed: - todo_read, todo_write tools - todo.rs module and related tests	2026-02-02 14:38:25 +11:00
Dhanji R. Prasanna	58bbfde6f4	test: add integration tests for streaming parser stuttering bug fix Add characterization tests for the streaming parser stuttering bug fix (`fa3c920`). These tests verify that when an LLM "stutters" and emits incomplete tool call fragments followed by complete tool calls, the parser: 1. Does not get stuck waiting for the incomplete fragment to complete 2. Successfully parses complete tool calls that appear after the fragment Tests cover: - The exact pattern from butler session butler_c6ab59af2e4f991c - Edge cases that should NOT trigger invalidation (nested JSON, patterns in strings) - Recovery behavior after reset - Multiple complete tool calls - Boundary conditions (chunk boundaries, minimal patterns) Agent: hopper	2026-01-30 14:30:27 +11:00
Dhanji R. Prasanna	570a824780	Rename archivist agent to huffman Named after David Huffman, inventor of Huffman coding - compression that preserves information with fewer bits. Fits the agent's purpose: compact memory, preserve semantics.	2026-01-29 11:22:59 +11:00
Dhanji R. Prasanna	56f558dc1b	Fix compiler warnings in test files Eliminate unused variable and import warnings across test files: - streaming_parser_test.rs: prefix unused `tools` with underscore - webdriver_session.rs: remove unused `use super::*` import - mock_provider_integration_test.rs: prefix unused `result` and `task_result` - test_preflight_max_tokens.rs: prefix unused `proposed_max` - todo_staleness_test.rs: add #[allow(dead_code)] for test helper methods - json_parsing_stress_test.rs: prefix unused `tools` - read_file_token_limit_test.rs: add #[allow(dead_code)] for unused helper - background_process_demo_test.rs: remove unused PathBuf import - test_session_continuation.rs: prefix unused `temp_dir` in 7 tests All tests pass. No behavior changes. Agent: fowler	2026-01-29 11:15:10 +11:00
Dhanji R. Prasanna	7bfb9efa19	Remove automatic README loading from context window README.md is no longer auto-loaded into the LLM context at startup. This saves ~4,600 tokens per session while AGENTS.md and memory.md still provide all critical information for code tasks. Changes: - Delete read_project_readme() function - Remove readme_content parameter from combine_project_content() - Rename extract_readme_heading() -> extract_project_heading() - Rename Agent constructors: _with_readme_ -> _with_project_context_ - Update context preservation to only check for Agent Configuration - Remove has_readme field from LoadedContent - Update all tests to use new markers and function names The LLM can still read README.md on-demand via read_file when needed.	2026-01-29 11:07:41 +11:00
Dhanji R. Prasanna	a902be1562	Refactor system prompts to eliminate duplication; upgrade embedded provider - Refactor prompts.rs: extract shared sections (intro, TODO, workspace memory, web research, response guidelines) used by both native and non-native prompts - Fix typo in native prompt: "save them.." -> "save them." - Fix non-native prompt: add missing closing braces in JSON examples, add IMPORTANT steps section, align with native prompt quality - Add 9 unit tests to verify both prompts contain required sections - Upgrade llama-cpp-2 dependency and refactor embedded provider - Update config.example.toml with embedded model examples - Update workspace memory	2026-01-28 09:56:39 +11:00
Dhanji R. Prasanna	5b4079e861	Add prompt cache statistics tracking to /stats command - Extend Usage struct with cache_creation_tokens and cache_read_tokens fields - Parse Anthropic cache_creation_input_tokens and cache_read_input_tokens - Parse OpenAI prompt_tokens_details.cached_tokens for automatic prefix caching - Add CacheStats struct to Agent for cumulative tracking across API calls - Add "Prompt Cache Statistics" section to /stats output showing: - API call count and cache hit count - Hit rate percentage - Total input tokens and cache read/creation tokens - Cache efficiency (% of input served from cache) - Update all provider implementations and test files	2026-01-27 11:32:45 +11:00
Dhanji R. Prasanna	2e84f1ece0	test: fix ACD test race condition and add read_image characterization test - Fix test_rehydrate_success race condition by using UUID for unique session IDs - Add #[serial] attribute to prevent parallel execution conflicts - Improve cleanup to remove entire session directory tree - Add characterization test for resize_image_to_dimensions fallback behavior (documents fix from commit `af8b849` for media type preservation) Agent: hopper	2026-01-26 16:19:53 +11:00
Dhanji R. Prasanna	726e2d71f5	test: add integration test for project content surviving compaction Add test_project_content_survives_compaction() to verify that project content loaded via /project command persists through context compaction. This is a CHARACTERIZATION test that validates: - Project content appended to README message survives compaction - The README message (containing project content) is preserved as message[1] - PROJECT INSTRUCTIONS, ACTIVE PROJECT markers, Brief and Status sections all survive the compaction process Agent: hopper	2026-01-26 16:09:17 +11:00
Dhanji R. Prasanna	9de8e8cc76	Fix compaction bug: use User role for summary to maintain alternation The previous implementation added the summary as a System message, which caused "Conversation must start with a user message" errors because the first non-system message after compaction was Assistant (the preserved last assistant message). Fix: Change summary from System to User message, creating valid alternation: [System Prompt] -> [Summary as USER] -> [Last Assistant] -> [Latest User] This also prevents system message bloat across multiple compactions since the summary is now part of the conversation flow and gets replaced on each compaction. Added test_second_compaction_no_bloat to verify no accumulation.	2026-01-26 15:24:04 +11:00
Dhanji R. Prasanna	5d0d532b47	feat: preserve last assistant message during compaction When context window compaction occurs, the last assistant message is now preserved in addition to the system prompt, README, and summary. This improves continuity after compaction by keeping the LLM's most recent response, which often contains important context about what was just done or what comes next. New message order after compaction: [System Prompt] -> [README/AGENTS.md] -> [ACD Stub?] -> [Summary] -> [Last Assistant] -> [Latest User?] Changes: - Add last_assistant_message field to PreservedMessages struct - Modify extract_preserved_messages() to find last assistant message - Modify reset_with_summary_and_stub() to include last assistant message - Add comprehensive integration tests using MockProvider Tests cover edge cases: - No assistant message exists - Tool-call-only assistant messages (still preserved) - Multiple assistant messages (only last one preserved) - No trailing user message	2026-01-23 09:54:03 +05:30
Dhanji R. Prasanna	feb7c3e40d	Add /project and /unproject commands for project-specific context - Add Project struct in crates/g3-cli/src/project.rs with file loading logic - Load brief.md, contacts.yaml, status.md from project path - Load projects.md from workspace root for cross-project context - Project content appended to system message (survives compaction/dehydration) - /project <path> loads project and auto-submits prompt asking about state - /unproject clears project content and resets context - Add set_project_content(), clear_project_content(), has_project_content() to Agent - Add new_for_test_with_readme() for testing with custom README content - Add 6 unit tests for Project struct - Add 9 integration tests for project context behavior	2026-01-21 14:53:30 +05:30
Dhanji R. Prasanna	6a5ce11e7b	Consolidate redundant assistant message test files Deleted 4 redundant test files (~956 lines): - assistant_message_dedup_test.rs (416 lines, 12 tests) - consecutive_assistant_message_test.rs (248 lines, 6 tests) - missing_assistant_message_test.rs (100 lines, 4 tests) - early_return_path_test.rs (192 lines, 5 tests) - whitebox test Created consolidated assistant_message_test.rs (369 lines, 14 tests): - Helper function tests for consecutive message detection - ContextWindow unit tests for normal and tool execution flows - Bug demonstration tests documenting what bugs looked like - Invariant tests for user/assistant alternation - Missing assistant message fallback logic tests The early_return_path_test was removed because it: - Referenced specific line numbers in production code (brittle) - Reimplemented internal logic (whitebox anti-pattern) - Duplicated coverage from mock_provider_integration_test.rs All 729 g3-core tests pass.	2026-01-21 10:27:07 +05:30
Dhanji R. Prasanna	9a0a2a2726	Make dehydration stub more compact Change from multi-line verbose format to single-line compact format: Before: ⚡ DEHYDRATED CONTEXT (fragment_id: 188c7ac71613) • 8 messages (4 user, 4 assistant) • 3 tool calls (shell ×3) • ~299 tokens saved To restore this history, call: rehydrate(fragment_id: "188c7ac71613") After: ⚡ DEHYDRATED CONTEXT: 3 tool calls (shell x3), 8 total msgs. To restore, call: rehydrate(fragment_id: "188c7ac71613") - Combine all info into single line - Remove tokens saved (not essential for rehydration decision) - Use ASCII 'x' instead of '×' for simplicity - Add 'no tool calls' case for fragments without tools - Update related tests	2026-01-20 21:26:42 +05:30
Dhanji R. Prasanna	182f5f98fe	Centralize g3 status message formatting Extract a new g3_status module in g3-cli that provides consistent formatting for all 'g3:' prefixed system status messages. Key changes: - Add G3Status struct with methods for progress, done, failed, error, etc. - Add Status enum with Done, Failed, Error, Resolved, Insufficient, NoChanges - Add ThinResult struct in g3-core for semantic thinning data - Update UiWriter trait with print_thin_result() method - Refactor context thinning to return ThinResult instead of formatted strings - Update all callers to use the new centralized formatting - Session resume/decline messages now use G3Status - Compaction status messages now use G3Status This maintains clean separation of concerns: g3-core emits semantic data, g3-cli handles all terminal formatting and colors.	2026-01-20 09:50:55 +05:30
Dhanji R. Prasanna	f4cce22db3	Add test documenting LLM duplicate text behavior Adds test_llm_repeats_text_before_each_tool_call() which documents the scenario where the LLM re-outputs the same preamble text before each tool call in a multi-tool response. Analysis showed this is LLM behavior, not a g3 bug: - Each assistant message is correctly stored with different tool calls - The duplicate display is the LLM choosing to repeat context - Storage is correct, display accurately reflects LLM output Decision: Accept as LLM behavior (Option B). Future LLM improvements may resolve this naturally without g3 code changes.	2026-01-19 18:44:01 +05:30
Dhanji R. Prasanna	1604ed613a	Add integration tests proving tool results are never parsed as tool calls Adds 3 new tests to json_parsing_stress_test.rs: - test_tool_result_with_json_not_parsed: Full agent integration test proving that JSON in tool results (sent TO the LLM) is never parsed by the streaming parser (which only sees LLM output) - test_parser_only_processes_completion_chunks: Documents that StreamingToolParser only accepts CompletionChunk, not Message objects - test_architectural_separation_documented: Documents the data flow showing tool results flow TO the LLM while the parser only sees FROM the LLM This proves the architectural guarantee: there is no code path where tool result content could be parsed as a tool call, because: 1. Tool results are Message objects added to context_window 2. The streaming parser only processes CompletionChunk from provider.stream_completion() 3. These are completely separate data types flowing in opposite directions Total: 41 JSON parsing stress tests now pass.	2026-01-19 16:21:36 +05:30
Dhanji R. Prasanna	2043a83e7d	Add comprehensive MockProvider integration tests Added 6 new integration tests for stream_completion_with_tools: - test_text_before_tool_call_preserved: text before native tool call is saved - test_native_tool_call_execution: native tool calls execute correctly - test_duplicate_tool_calls_skipped: sequential duplicates are detected - test_json_fallback_tool_calling: JSON tool calls work without native support - test_text_after_tool_execution_preserved: follow-up text is saved - test_multiple_tool_calls_executed: multiple tool calls in sequence work Also added MockResponse helper methods: - text_then_native_tool(): text followed by native tool call - duplicate_native_tool_calls(): same tool call twice (for dedup testing) Fixed text_with_json_tool() to ensure "tool" key comes before "args" (serde_json alphabetizes keys, breaking pattern detection). Total: 18 integration tests covering historical bugs and core behaviors.	2026-01-19 14:44:30 +05:30
Dhanji R. Prasanna	5caa101b84	Fix inline JSON being incorrectly detected as tool call The bug was caused by mark_tool_calls_consumed() being called after displaying each chunk, which advanced last_consumed_position to the end of the current buffer. When the next chunk arrived with JSON, the unchecked_buffer started at position 0 of the slice, causing is_on_own_line() to return true (position 0 is always "on its own line"). Removed the problematic mark_tool_calls_consumed() call from the "no tool executed" branch. The remaining call after actual tool execution is correct and necessary. Added integration test that verifies inline JSON in prose is not detected as a tool call.	2026-01-19 14:35:01 +05:30
Dhanji R. Prasanna	292a3aa48d	Add MockProvider for integration testing Adds a configurable mock LLM provider that can simulate various behaviors: - Text-only responses (single or multi-chunk streaming) - Native tool calls - JSON tool calls in text - Truncated responses (max_tokens) - Multi-turn conversations Features: - Builder pattern for easy test setup - Request tracking for verification - Preset scenarios for common patterns - Full LLMProvider trait implementation Also adds integration tests that use MockProvider to test the stream_completion_with_tools code path, including: - test_butler_bug_scenario: reproduces the exact bug where text-only responses were not saved to context, causing consecutive user messages This enables testing complex streaming behaviors without real API calls.	2026-01-19 13:59:31 +05:30
Dhanji R. Prasanna	349230d0b7	Fix missing assistant messages in context window Bug: When the LLM responded with text-only (no tool calls), the assistant message was sometimes not saved to the context window. This caused consecutive user messages where the LLM would lose track of previous responses. Root causes found and fixed: 1. Early return path (line ~2535): When stream finishes with no tools executed in previous iterations (any_tool_executed=false), the code returned early without saving the assistant message. Fixed by adding save before return. 2. Post-loop path (line ~2657): When raw_clean was empty but current_response had content, no message was saved. Fixed by falling back to current_response. Both paths now properly save the assistant message before returning. The assistant_message_added flag prevents any duplication. Added tests: - missing_assistant_message_test.rs: verifies the fallback logic - assistant_message_dedup_test.rs: verifies no duplicate messages - consecutive_assistant_message_test.rs: verifies alternation invariant	2026-01-19 13:50:28 +05:30
Dhanji R. Prasanna	74b1b9bea3	refactor: simplify context thinning status message Change format from verbose emoji-based message to cleaner status line: Before: ✨ 🥒 Context thinned at 70%: 7 tool results, ~33839 chars saved ✨ After: g3: thinning context ... 70% -> 40% ... [done] The new format shows before/after percentages and uses bold green for 'g3:' and '[done]' to match other status messages. Also removes unused emoji() and label() methods from ThinScope.	2026-01-17 04:47:16 +05:30
Dhanji R. Prasanna	1003386f7f	Auto-resize large images (>=5MB) in read_image tool Images >= 5MB are now automatically resized to < 4.9MB using ImageMagick before being sent to the LLM. This prevents API errors from oversized images. - Uses iterative quality/scale reduction to find optimal size - Converts to JPEG for better compression - Shows original and resized size in terminal output (e.g., '6.2 MB → 4.1 MB (resized)') - Falls back to original if ImageMagick fails or isn't available	2026-01-16 21:09:38 +05:30
Dhanji R. Prasanna	fc702168ab	Add streaming completion integration test with mock LLM provider Adds tests to verify that: - All streaming chunks are processed before control returns to caller - Both tool calls in a multi-tool-call stream are executed - The finished signal properly terminates stream processing Also adds Agent::new_for_test() to allow injecting mock providers.	2026-01-16 20:52:32 +05:30
Dhanji R. Prasanna	0e33465342	Add print_g3_progress/print_g3_status methods for consistent status messages	2026-01-16 20:28:24 +05:30
Dhanji R. Prasanna	6bd9c51e8e	feat: shell output pagination and optimized read_file with seek - Shell outputs > 8KB are truncated to first 500 chars - Full output saved to .g3/sessions/<session_id>/tools/shell_stdout_<id>.txt - LLM can use read_file with start/end to paginate through large outputs - read_file now uses seek() for O(1) random access instead of reading entire file - UTF-8 safe: reads extra bytes at boundaries to find valid char positions - Falls back to lossy conversion for binary files (no panics) Files changed: - paths.rs: get_tools_output_dir(), generate_short_id() - shell.rs: truncate_large_output() integration - file_ops.rs: seek-based read_file_range() helper - New test: read_file_utf8_test.rs	2026-01-16 09:16:16 +05:30
Dhanji R. Prasanna	0ae1a13cdb	feat: real-time tool call streaming indicator with blinking UI - Add ToolParsingHint enum (Detected/Active/Complete) for UI feedback - New UiWriter methods: print_tool_streaming_hint(), print_tool_streaming_active() - Refactor ConsoleUiWriter state to use atomics in ParsingHintState - Add tool_call_streaming field to CompletionChunk for provider hints - Anthropic provider sends streaming hints when tool name detected - New streaming helpers: make_tool_streaming_hint(), make_tool_streaming_active() Parser improvements: - Add is_json_invalidated() to detect false positive tool patterns - Fix tool result poisoning when file contents contain partial JSON - Unescaped newlines in strings or prose after JSON invalidates detection User sees ' ● tool_name \|' immediately when tool call starts streaming, with blinking indicator while args are received.	2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna	d68f059acf	fix: detect invalidated JSON tool calls to prevent parser poisoning When partial JSON tool call patterns appear in LLM output (e.g., from quoting file content), the parser would incorrectly report them as "incomplete tool calls", triggering auto-continue loops. Fix: Added is_json_invalidated() to detect when partial JSON has been invalidated by subsequent content that cannot be valid JSON: - Unescaped newline inside a string (invalid JSON) - Newline followed by prose text outside a string The check is only applied to incomplete JSON - complete tool calls with trailing text are still correctly detected. Added 6 new tests covering: - Tool results with partial JSON patterns - LLM quoting file content inline vs on own line - Comment prefixes (// # -- etc) with partial patterns - Real incomplete tool calls (should still be detected)	2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna	999ac6fe66	fix: prevent parser poisoning from inline tool-call JSON patterns The streaming parser was incorrectly detecting tool call patterns that appeared inline in prose (e.g., when explaining the format), causing g3 to return control mid-task. Fix: Modified find_first_tool_call_start() and find_last_tool_call_start() to only recognize patterns that appear on their own line (at start of buffer or after newline with only whitespace before the pattern). Changes: - Added is_on_own_line() helper to check line-boundary conditions - Updated detection methods to skip inline patterns - Removed sanitize_inline_tool_patterns() and LBRACE_HOMOGLYPH (no longer needed) - Rewrote tests for new behavior - Added streaming_repro tests that use process_chunk() to verify the exact bug scenario 28 tests covering: streaming repro, line boundaries, Unicode, code contexts, edge cases	2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna	38828c7757	Clean up tool output formatting - Shell: "✅ Command executed successfully" → "⚡️ ran successfully" - Write file: Remove ✏️ emoji, use plain "wrote N lines \| M chars"	2026-01-14 19:42:54 +05:30
Dhanji R. Prasanna	dea0e6b1ca	Compact tool output improvements - Rename take_screenshot -> screenshot, code_coverage -> coverage (shorter names) - Align \| character across all compact tools (pad to 11 chars for str_replace) - Make code_search a compact tool with summary display - Show language and search name in code_search output (e.g., rust:"find structs") - Add format_code_search_summary() to extract match/file counts from JSON response	2026-01-14 08:12:50 +05:30
Dhanji R. Prasanna	3a47ebe668	better racket example support	2026-01-13 21:16:14 +05:30
Dhanji R. Prasanna	151b8c4658	Add Racket tree-sitter support, remove Kotlin - Add tree-sitter-racket dependency (v0.24) - Initialize Racket parser in code search - Add .rkt, .rktl, .rktd file extensions - Add test_racket_search test - Remove Kotlin from supported languages (was disabled) - Clean up duplicate test files Supported languages: Rust, Python, JavaScript, TypeScript, Go, Java, C, C++, Racket	2026-01-13 18:44:59 +05:30

1 2

100 Commits