alex/g3 - g3 - Millerson GIT hosting

alex/g3

Author	SHA1	Message	Date
Dhanji R. Prasanna	d3f0112f46	fix: store tool calls structurally for proper API roundtripping The agent would stop mid-task because native tool calls were stored as inline JSON text in Message.content. When sent back to the Anthropic API via convert_messages(), they went as plain text instead of structured tool_use/tool_result blocks. The model would occasionally get confused and emit text describing what it wanted to do instead of invoking the tool mechanism. Changes: - Add MessageToolCall struct and tool_calls/tool_result_id fields to Message - Add id field to core ToolCall struct to preserve provider tool call IDs - Update Anthropic convert_messages() to emit tool_use and tool_result blocks - Add ToolResult variant to AnthropicContent enum - Store tool calls structurally in tool message construction (not inline JSON) - Fix add_message() to preserve empty-content messages with tool_calls - Fix check_duplicate_in_previous_message() to check structured tool_calls - Generate valid IDs for JSON fallback tool calls (Anthropic pattern requirement) - Update planner create_tool_message() to use structured tool calls	2026-02-11 08:48:07 +11:00
Dhanji R. Prasanna	2a4cd1f4d6	fix: strip duplicate tool call JSON from assistant messages when LLM stutters When the LLM emits identical JSON tool calls as text content (JSON fallback mode), the raw duplicate JSON was being stored in the assistant message in conversation history. This confused the model on subsequent turns, causing it to stall or repeat itself. Root cause: raw_content_for_log used get_text_content() which returns the full parser buffer including all duplicate tool call JSONs. Fix: Added get_text_before_tool_calls() to StreamingToolParser that returns only the text before the first JSON tool call. Changed raw_content_for_log to use this method so the assistant message only contains the preamble text + the single executed tool call. Added 5 integration tests covering stuttered duplicates, triple stutter, cross-turn dedup, and different-args boundary case. Added MockResponse helpers for simulating LLM stutter patterns.	2026-02-10 19:53:11 +11:00
Dhanji R. Prasanna	328eecfcad	fix: extract_facts fallback for facts-prefixed selectors in datalog verification Root cause: ActionEnvelope.to_yaml_value() creates a Mapping from the facts HashMap without a 'facts:' wrapper key, but rulespec selectors may include a 'facts.' prefix (e.g. 'facts.feature.done' instead of 'feature.done'). This caused zero facts to be extracted, making all predicate evaluations fail. Fix: extract_facts() now tries the selector against the unwrapped envelope value first, and if empty, retries against a facts-wrapped version as fallback. Also: - Strengthened write_envelope tool description to require top-level facts: key, file paths for evidence, and allow free-form notes - Updated system prompt with matching rules - Added 6 new tests (4 unit, 2 integration) - Strengthened existing integration test to verify fact count > 0	2026-02-07 14:42:39 +11:00
Dhanji R. Prasanna	7032e75fc6	Add write_envelope tool with verify_envelope for explicit envelope creation - New crates/g3-core/src/tools/envelope.rs with execute_write_envelope() and verify_envelope() (moved from shadow_datalog_verify in plan.rs) - write_envelope accepts YAML facts, writes envelope.yaml to session dir, then runs datalog verification against analysis/rulespec.yaml in shadow mode - plan_verify() now only checks envelope existence (no longer runs datalog) - Tool count: 13 -> 14 - Updated system prompt to instruct agents to call write_envelope before marking last plan item done - Updated integration tests to use write_envelope tool directly Workflow: write_envelope -> verify_envelope -> datalog shadow artifacts plan_write(done) -> plan_verify -> checks envelope exists	2026-02-06 16:09:07 +11:00
Dhanji R. Prasanna	f7a240a99b	refactor: decouple rulespec from plan_write, read from analysis/rulespec.yaml - Remove rulespec parameter from plan_write tool definition and execution - Remove rulespec compilation from plan_approve (no longer pre-compiles) - Remove write_rulespec, get_rulespec_path, format_rulespec_yaml/markdown from invariants.rs; read_rulespec() now takes &Path working dir - Remove save/load_compiled_rulespec, get_compiled_rulespec_path from datalog.rs - Update shadow_datalog_verify() to compile on-the-fly from analysis/rulespec.yaml, writing rulespec.compiled.dl and datalog_evaluation.txt to session dir - Remove rulespec display from plan_read output - Remove Invariants/Rulespec section from native.md system prompt - Remove rulespec from prompts.rs plan_write format and examples - Update existing tests to remove rulespec from plan_write calls - Add 3 integration tests for on-the-fly rulespec verification	2026-02-06 15:31:23 +11:00
Dhanji R. Prasanna	3823f8b5f3	Optimize native system prompt - 48% size reduction Removed redundant and vague content from prompts/system/native.md: - Simplified intro from 17 lines to 3 lines - Reduced Code Search section to one line - Removed duplicate Plan Mode example (kept one) - Removed Action Envelope section (rarely used correctly) - Removed verbose Memory Format details (tool description covers it) - Removed Response Guidelines (obvious to modern LLMs) Size: 8,620 chars -> 4,498 chars Also updated: - G3_IDENTITY_LINE constant for agent mode compatibility - Test assertions to check for new prompt markers - System prompt validation to use new marker string	2026-02-05 22:16:34 +11:00
Dhanji R. Prasanna	7e2d9bc22c	Enforce rulespec creation with plan_write for new plans Solves the tautology problem where the LLM would write invariants after implementation, making them match what was done rather than constrain it. Changes: - plan_write now accepts 'rulespec' parameter - New plans REQUIRE rulespec (fails with helpful error if missing) - Plan updates don't require rulespec (backward compatible) - Rulespec is parsed, validated, and written atomically with plan - Updated system prompt with clear examples for new vs update - Updated tool definition schema - Updated all affected tests New flow: task → plan+rulespec → user reviews BOTH → approve → implement	2026-02-05 21:12:02 +11:00
Dhanji R. Prasanna	0f919237ea	Make plan approval gate only active in plan mode - Add in_plan_mode flag to Agent struct - Add set_plan_mode() and is_plan_mode() methods - Gate check now only runs when in_plan_mode is true - CLI calls set_plan_mode(true) on /plan command and EnterPlanMode - CLI calls set_plan_mode(false) on approval and CTRL-D exit - Update integration test to enable plan mode - Fix test YAML to use Vec<Check> for negative/boundary checks	2026-02-05 11:41:52 +11:00
Dhanji R. Prasanna	c347a73cbd	Add plan approval gate to block file changes without approved plan - Add check_plan_approval_gate() in tools/plan.rs that runs after each tool call - Detects file changes via git status --porcelain when plan exists but not approved - Reverts changes: git checkout for modified files, rm for new untracked files - Returns blocking message instructing LLM to create/approve plan first - Add ApprovalGateResult enum with Allowed/Blocked/NotGitRepo variants - Add set_session_id() and set_working_dir() methods on Agent for testing - Add integration test using MockProvider to simulate blocked write_file	2026-02-05 11:34:10 +11:00
Dhanji R. Prasanna	263a838d31	Remove redundant 'No plan exists' message from plan_read output The UI already shows 'empty' via print_plan_compact, so returning an empty string avoids duplicate output.	2026-02-02 17:19:01 +11:00
Dhanji R. Prasanna	a63950d8f5	Add Plan Mode to replace TODO system Plan Mode is a cognitive forcing system that requires reasoning about: - Happy path - Negative case - Boundary condition New tools: - plan_read: Read current plan for session - plan_write: Create/update plan with YAML content (validates structure) - plan_approve: Mark current revision as approved New command: - /feature <description>: Start Plan Mode for a new feature Plan schema requires: - plan_id, revision, approved_revision - items with id, description, state, touches, checks (happy/negative/boundary) - evidence and notes required when marking items done Verification: - plan_verify() called automatically when all items are done/blocked Removed: - todo_read, todo_write tools - todo.rs module and related tests	2026-02-02 14:38:25 +11:00
Dhanji R. Prasanna	58bbfde6f4	test: add integration tests for streaming parser stuttering bug fix Add characterization tests for the streaming parser stuttering bug fix (`fa3c920`). These tests verify that when an LLM "stutters" and emits incomplete tool call fragments followed by complete tool calls, the parser: 1. Does not get stuck waiting for the incomplete fragment to complete 2. Successfully parses complete tool calls that appear after the fragment Tests cover: - The exact pattern from butler session butler_c6ab59af2e4f991c - Edge cases that should NOT trigger invalidation (nested JSON, patterns in strings) - Recovery behavior after reset - Multiple complete tool calls - Boundary conditions (chunk boundaries, minimal patterns) Agent: hopper	2026-01-30 14:30:27 +11:00
Dhanji R. Prasanna	570a824780	Rename archivist agent to huffman Named after David Huffman, inventor of Huffman coding - compression that preserves information with fewer bits. Fits the agent's purpose: compact memory, preserve semantics.	2026-01-29 11:22:59 +11:00
Dhanji R. Prasanna	56f558dc1b	Fix compiler warnings in test files Eliminate unused variable and import warnings across test files: - streaming_parser_test.rs: prefix unused `tools` with underscore - webdriver_session.rs: remove unused `use super::*` import - mock_provider_integration_test.rs: prefix unused `result` and `task_result` - test_preflight_max_tokens.rs: prefix unused `proposed_max` - todo_staleness_test.rs: add #[allow(dead_code)] for test helper methods - json_parsing_stress_test.rs: prefix unused `tools` - read_file_token_limit_test.rs: add #[allow(dead_code)] for unused helper - background_process_demo_test.rs: remove unused PathBuf import - test_session_continuation.rs: prefix unused `temp_dir` in 7 tests All tests pass. No behavior changes. Agent: fowler	2026-01-29 11:15:10 +11:00
Dhanji R. Prasanna	7bfb9efa19	Remove automatic README loading from context window README.md is no longer auto-loaded into the LLM context at startup. This saves ~4,600 tokens per session while AGENTS.md and memory.md still provide all critical information for code tasks. Changes: - Delete read_project_readme() function - Remove readme_content parameter from combine_project_content() - Rename extract_readme_heading() -> extract_project_heading() - Rename Agent constructors: _with_readme_ -> _with_project_context_ - Update context preservation to only check for Agent Configuration - Remove has_readme field from LoadedContent - Update all tests to use new markers and function names The LLM can still read README.md on-demand via read_file when needed.	2026-01-29 11:07:41 +11:00
Dhanji R. Prasanna	a902be1562	Refactor system prompts to eliminate duplication; upgrade embedded provider - Refactor prompts.rs: extract shared sections (intro, TODO, workspace memory, web research, response guidelines) used by both native and non-native prompts - Fix typo in native prompt: "save them.." -> "save them." - Fix non-native prompt: add missing closing braces in JSON examples, add IMPORTANT steps section, align with native prompt quality - Add 9 unit tests to verify both prompts contain required sections - Upgrade llama-cpp-2 dependency and refactor embedded provider - Update config.example.toml with embedded model examples - Update workspace memory	2026-01-28 09:56:39 +11:00
Dhanji R. Prasanna	5b4079e861	Add prompt cache statistics tracking to /stats command - Extend Usage struct with cache_creation_tokens and cache_read_tokens fields - Parse Anthropic cache_creation_input_tokens and cache_read_input_tokens - Parse OpenAI prompt_tokens_details.cached_tokens for automatic prefix caching - Add CacheStats struct to Agent for cumulative tracking across API calls - Add "Prompt Cache Statistics" section to /stats output showing: - API call count and cache hit count - Hit rate percentage - Total input tokens and cache read/creation tokens - Cache efficiency (% of input served from cache) - Update all provider implementations and test files	2026-01-27 11:32:45 +11:00
Dhanji R. Prasanna	2e84f1ece0	test: fix ACD test race condition and add read_image characterization test - Fix test_rehydrate_success race condition by using UUID for unique session IDs - Add #[serial] attribute to prevent parallel execution conflicts - Improve cleanup to remove entire session directory tree - Add characterization test for resize_image_to_dimensions fallback behavior (documents fix from commit `af8b849` for media type preservation) Agent: hopper	2026-01-26 16:19:53 +11:00
Dhanji R. Prasanna	726e2d71f5	test: add integration test for project content surviving compaction Add test_project_content_survives_compaction() to verify that project content loaded via /project command persists through context compaction. This is a CHARACTERIZATION test that validates: - Project content appended to README message survives compaction - The README message (containing project content) is preserved as message[1] - PROJECT INSTRUCTIONS, ACTIVE PROJECT markers, Brief and Status sections all survive the compaction process Agent: hopper	2026-01-26 16:09:17 +11:00
Dhanji R. Prasanna	9de8e8cc76	Fix compaction bug: use User role for summary to maintain alternation The previous implementation added the summary as a System message, which caused "Conversation must start with a user message" errors because the first non-system message after compaction was Assistant (the preserved last assistant message). Fix: Change summary from System to User message, creating valid alternation: [System Prompt] -> [Summary as USER] -> [Last Assistant] -> [Latest User] This also prevents system message bloat across multiple compactions since the summary is now part of the conversation flow and gets replaced on each compaction. Added test_second_compaction_no_bloat to verify no accumulation.	2026-01-26 15:24:04 +11:00
Dhanji R. Prasanna	5d0d532b47	feat: preserve last assistant message during compaction When context window compaction occurs, the last assistant message is now preserved in addition to the system prompt, README, and summary. This improves continuity after compaction by keeping the LLM's most recent response, which often contains important context about what was just done or what comes next. New message order after compaction: [System Prompt] -> [README/AGENTS.md] -> [ACD Stub?] -> [Summary] -> [Last Assistant] -> [Latest User?] Changes: - Add last_assistant_message field to PreservedMessages struct - Modify extract_preserved_messages() to find last assistant message - Modify reset_with_summary_and_stub() to include last assistant message - Add comprehensive integration tests using MockProvider Tests cover edge cases: - No assistant message exists - Tool-call-only assistant messages (still preserved) - Multiple assistant messages (only last one preserved) - No trailing user message	2026-01-23 09:54:03 +05:30
Dhanji R. Prasanna	feb7c3e40d	Add /project and /unproject commands for project-specific context - Add Project struct in crates/g3-cli/src/project.rs with file loading logic - Load brief.md, contacts.yaml, status.md from project path - Load projects.md from workspace root for cross-project context - Project content appended to system message (survives compaction/dehydration) - /project <path> loads project and auto-submits prompt asking about state - /unproject clears project content and resets context - Add set_project_content(), clear_project_content(), has_project_content() to Agent - Add new_for_test_with_readme() for testing with custom README content - Add 6 unit tests for Project struct - Add 9 integration tests for project context behavior	2026-01-21 14:53:30 +05:30
Dhanji R. Prasanna	6a5ce11e7b	Consolidate redundant assistant message test files Deleted 4 redundant test files (~956 lines): - assistant_message_dedup_test.rs (416 lines, 12 tests) - consecutive_assistant_message_test.rs (248 lines, 6 tests) - missing_assistant_message_test.rs (100 lines, 4 tests) - early_return_path_test.rs (192 lines, 5 tests) - whitebox test Created consolidated assistant_message_test.rs (369 lines, 14 tests): - Helper function tests for consecutive message detection - ContextWindow unit tests for normal and tool execution flows - Bug demonstration tests documenting what bugs looked like - Invariant tests for user/assistant alternation - Missing assistant message fallback logic tests The early_return_path_test was removed because it: - Referenced specific line numbers in production code (brittle) - Reimplemented internal logic (whitebox anti-pattern) - Duplicated coverage from mock_provider_integration_test.rs All 729 g3-core tests pass.	2026-01-21 10:27:07 +05:30
Dhanji R. Prasanna	9a0a2a2726	Make dehydration stub more compact Change from multi-line verbose format to single-line compact format: Before: ⚡ DEHYDRATED CONTEXT (fragment_id: 188c7ac71613) • 8 messages (4 user, 4 assistant) • 3 tool calls (shell ×3) • ~299 tokens saved To restore this history, call: rehydrate(fragment_id: "188c7ac71613") After: ⚡ DEHYDRATED CONTEXT: 3 tool calls (shell x3), 8 total msgs. To restore, call: rehydrate(fragment_id: "188c7ac71613") - Combine all info into single line - Remove tokens saved (not essential for rehydration decision) - Use ASCII 'x' instead of '×' for simplicity - Add 'no tool calls' case for fragments without tools - Update related tests	2026-01-20 21:26:42 +05:30
Dhanji R. Prasanna	182f5f98fe	Centralize g3 status message formatting Extract a new g3_status module in g3-cli that provides consistent formatting for all 'g3:' prefixed system status messages. Key changes: - Add G3Status struct with methods for progress, done, failed, error, etc. - Add Status enum with Done, Failed, Error, Resolved, Insufficient, NoChanges - Add ThinResult struct in g3-core for semantic thinning data - Update UiWriter trait with print_thin_result() method - Refactor context thinning to return ThinResult instead of formatted strings - Update all callers to use the new centralized formatting - Session resume/decline messages now use G3Status - Compaction status messages now use G3Status This maintains clean separation of concerns: g3-core emits semantic data, g3-cli handles all terminal formatting and colors.	2026-01-20 09:50:55 +05:30
Dhanji R. Prasanna	f4cce22db3	Add test documenting LLM duplicate text behavior Adds test_llm_repeats_text_before_each_tool_call() which documents the scenario where the LLM re-outputs the same preamble text before each tool call in a multi-tool response. Analysis showed this is LLM behavior, not a g3 bug: - Each assistant message is correctly stored with different tool calls - The duplicate display is the LLM choosing to repeat context - Storage is correct, display accurately reflects LLM output Decision: Accept as LLM behavior (Option B). Future LLM improvements may resolve this naturally without g3 code changes.	2026-01-19 18:44:01 +05:30
Dhanji R. Prasanna	1604ed613a	Add integration tests proving tool results are never parsed as tool calls Adds 3 new tests to json_parsing_stress_test.rs: - test_tool_result_with_json_not_parsed: Full agent integration test proving that JSON in tool results (sent TO the LLM) is never parsed by the streaming parser (which only sees LLM output) - test_parser_only_processes_completion_chunks: Documents that StreamingToolParser only accepts CompletionChunk, not Message objects - test_architectural_separation_documented: Documents the data flow showing tool results flow TO the LLM while the parser only sees FROM the LLM This proves the architectural guarantee: there is no code path where tool result content could be parsed as a tool call, because: 1. Tool results are Message objects added to context_window 2. The streaming parser only processes CompletionChunk from provider.stream_completion() 3. These are completely separate data types flowing in opposite directions Total: 41 JSON parsing stress tests now pass.	2026-01-19 16:21:36 +05:30
Dhanji R. Prasanna	2043a83e7d	Add comprehensive MockProvider integration tests Added 6 new integration tests for stream_completion_with_tools: - test_text_before_tool_call_preserved: text before native tool call is saved - test_native_tool_call_execution: native tool calls execute correctly - test_duplicate_tool_calls_skipped: sequential duplicates are detected - test_json_fallback_tool_calling: JSON tool calls work without native support - test_text_after_tool_execution_preserved: follow-up text is saved - test_multiple_tool_calls_executed: multiple tool calls in sequence work Also added MockResponse helper methods: - text_then_native_tool(): text followed by native tool call - duplicate_native_tool_calls(): same tool call twice (for dedup testing) Fixed text_with_json_tool() to ensure "tool" key comes before "args" (serde_json alphabetizes keys, breaking pattern detection). Total: 18 integration tests covering historical bugs and core behaviors.	2026-01-19 14:44:30 +05:30
Dhanji R. Prasanna	5caa101b84	Fix inline JSON being incorrectly detected as tool call The bug was caused by mark_tool_calls_consumed() being called after displaying each chunk, which advanced last_consumed_position to the end of the current buffer. When the next chunk arrived with JSON, the unchecked_buffer started at position 0 of the slice, causing is_on_own_line() to return true (position 0 is always "on its own line"). Removed the problematic mark_tool_calls_consumed() call from the "no tool executed" branch. The remaining call after actual tool execution is correct and necessary. Added integration test that verifies inline JSON in prose is not detected as a tool call.	2026-01-19 14:35:01 +05:30
Dhanji R. Prasanna	292a3aa48d	Add MockProvider for integration testing Adds a configurable mock LLM provider that can simulate various behaviors: - Text-only responses (single or multi-chunk streaming) - Native tool calls - JSON tool calls in text - Truncated responses (max_tokens) - Multi-turn conversations Features: - Builder pattern for easy test setup - Request tracking for verification - Preset scenarios for common patterns - Full LLMProvider trait implementation Also adds integration tests that use MockProvider to test the stream_completion_with_tools code path, including: - test_butler_bug_scenario: reproduces the exact bug where text-only responses were not saved to context, causing consecutive user messages This enables testing complex streaming behaviors without real API calls.	2026-01-19 13:59:31 +05:30
Dhanji R. Prasanna	349230d0b7	Fix missing assistant messages in context window Bug: When the LLM responded with text-only (no tool calls), the assistant message was sometimes not saved to the context window. This caused consecutive user messages where the LLM would lose track of previous responses. Root causes found and fixed: 1. Early return path (line ~2535): When stream finishes with no tools executed in previous iterations (any_tool_executed=false), the code returned early without saving the assistant message. Fixed by adding save before return. 2. Post-loop path (line ~2657): When raw_clean was empty but current_response had content, no message was saved. Fixed by falling back to current_response. Both paths now properly save the assistant message before returning. The assistant_message_added flag prevents any duplication. Added tests: - missing_assistant_message_test.rs: verifies the fallback logic - assistant_message_dedup_test.rs: verifies no duplicate messages - consecutive_assistant_message_test.rs: verifies alternation invariant	2026-01-19 13:50:28 +05:30
Dhanji R. Prasanna	74b1b9bea3	refactor: simplify context thinning status message Change format from verbose emoji-based message to cleaner status line: Before: ✨ 🥒 Context thinned at 70%: 7 tool results, ~33839 chars saved ✨ After: g3: thinning context ... 70% -> 40% ... [done] The new format shows before/after percentages and uses bold green for 'g3:' and '[done]' to match other status messages. Also removes unused emoji() and label() methods from ThinScope.	2026-01-17 04:47:16 +05:30
Dhanji R. Prasanna	1003386f7f	Auto-resize large images (>=5MB) in read_image tool Images >= 5MB are now automatically resized to < 4.9MB using ImageMagick before being sent to the LLM. This prevents API errors from oversized images. - Uses iterative quality/scale reduction to find optimal size - Converts to JPEG for better compression - Shows original and resized size in terminal output (e.g., '6.2 MB → 4.1 MB (resized)') - Falls back to original if ImageMagick fails or isn't available	2026-01-16 21:09:38 +05:30
Dhanji R. Prasanna	fc702168ab	Add streaming completion integration test with mock LLM provider Adds tests to verify that: - All streaming chunks are processed before control returns to caller - Both tool calls in a multi-tool-call stream are executed - The finished signal properly terminates stream processing Also adds Agent::new_for_test() to allow injecting mock providers.	2026-01-16 20:52:32 +05:30
Dhanji R. Prasanna	0e33465342	Add print_g3_progress/print_g3_status methods for consistent status messages	2026-01-16 20:28:24 +05:30
Dhanji R. Prasanna	6bd9c51e8e	feat: shell output pagination and optimized read_file with seek - Shell outputs > 8KB are truncated to first 500 chars - Full output saved to .g3/sessions/<session_id>/tools/shell_stdout_<id>.txt - LLM can use read_file with start/end to paginate through large outputs - read_file now uses seek() for O(1) random access instead of reading entire file - UTF-8 safe: reads extra bytes at boundaries to find valid char positions - Falls back to lossy conversion for binary files (no panics) Files changed: - paths.rs: get_tools_output_dir(), generate_short_id() - shell.rs: truncate_large_output() integration - file_ops.rs: seek-based read_file_range() helper - New test: read_file_utf8_test.rs	2026-01-16 09:16:16 +05:30
Dhanji R. Prasanna	0ae1a13cdb	feat: real-time tool call streaming indicator with blinking UI - Add ToolParsingHint enum (Detected/Active/Complete) for UI feedback - New UiWriter methods: print_tool_streaming_hint(), print_tool_streaming_active() - Refactor ConsoleUiWriter state to use atomics in ParsingHintState - Add tool_call_streaming field to CompletionChunk for provider hints - Anthropic provider sends streaming hints when tool name detected - New streaming helpers: make_tool_streaming_hint(), make_tool_streaming_active() Parser improvements: - Add is_json_invalidated() to detect false positive tool patterns - Fix tool result poisoning when file contents contain partial JSON - Unescaped newlines in strings or prose after JSON invalidates detection User sees ' ● tool_name \|' immediately when tool call starts streaming, with blinking indicator while args are received.	2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna	d68f059acf	fix: detect invalidated JSON tool calls to prevent parser poisoning When partial JSON tool call patterns appear in LLM output (e.g., from quoting file content), the parser would incorrectly report them as "incomplete tool calls", triggering auto-continue loops. Fix: Added is_json_invalidated() to detect when partial JSON has been invalidated by subsequent content that cannot be valid JSON: - Unescaped newline inside a string (invalid JSON) - Newline followed by prose text outside a string The check is only applied to incomplete JSON - complete tool calls with trailing text are still correctly detected. Added 6 new tests covering: - Tool results with partial JSON patterns - LLM quoting file content inline vs on own line - Comment prefixes (// # -- etc) with partial patterns - Real incomplete tool calls (should still be detected)	2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna	999ac6fe66	fix: prevent parser poisoning from inline tool-call JSON patterns The streaming parser was incorrectly detecting tool call patterns that appeared inline in prose (e.g., when explaining the format), causing g3 to return control mid-task. Fix: Modified find_first_tool_call_start() and find_last_tool_call_start() to only recognize patterns that appear on their own line (at start of buffer or after newline with only whitespace before the pattern). Changes: - Added is_on_own_line() helper to check line-boundary conditions - Updated detection methods to skip inline patterns - Removed sanitize_inline_tool_patterns() and LBRACE_HOMOGLYPH (no longer needed) - Rewrote tests for new behavior - Added streaming_repro tests that use process_chunk() to verify the exact bug scenario 28 tests covering: streaming repro, line boundaries, Unicode, code contexts, edge cases	2026-01-15 13:49:29 +05:30
Dhanji R. Prasanna	38828c7757	Clean up tool output formatting - Shell: "✅ Command executed successfully" → "⚡️ ran successfully" - Write file: Remove ✏️ emoji, use plain "wrote N lines \| M chars"	2026-01-14 19:42:54 +05:30
Dhanji R. Prasanna	dea0e6b1ca	Compact tool output improvements - Rename take_screenshot -> screenshot, code_coverage -> coverage (shorter names) - Align \| character across all compact tools (pad to 11 chars for str_replace) - Make code_search a compact tool with summary display - Show language and search name in code_search output (e.g., rust:"find structs") - Add format_code_search_summary() to extract match/file counts from JSON response	2026-01-14 08:12:50 +05:30
Dhanji R. Prasanna	3a47ebe668	better racket example support	2026-01-13 21:16:14 +05:30
Dhanji R. Prasanna	151b8c4658	Add Racket tree-sitter support, remove Kotlin - Add tree-sitter-racket dependency (v0.24) - Initialize Racket parser in code search - Add .rkt, .rktl, .rktd file extensions - Add test_racket_search test - Remove Kotlin from supported languages (was disabled) - Clean up duplicate test files Supported languages: Rust, Python, JavaScript, TypeScript, Go, Java, C, C++, Racket	2026-01-13 18:44:59 +05:30
Dhanji R. Prasanna	b89d55a9ff	Add characterization tests for stream_completion_with_tools Add 32 blackbox characterization tests to lock down the behavior of the stream_completion_with_tools function (1067 lines) before refactoring. Tests cover key behaviors through stable boundaries: - StreamingToolParser: tool call detection, incomplete detection, text accumulation - Auto-continue logic: autonomous mode decisions, priority ordering - Duplicate detection: sequential duplicates, cross-message duplicates - Context window: token tracking, compaction threshold, history preservation - Tool execution: read_file, shell, write_file, todo tools through Agent - Streaming utilities: LLM token cleaning, duration formatting, truncation - Parser sanitization: inline tool pattern handling, homoglyph replacement These tests intentionally do NOT assert: - Internal parser state or implementation details - Specific timing values - UI output formatting - Provider-specific behavior Agent: hopper	2026-01-13 16:25:33 +05:30
Dhanji R. Prasanna	dc45987e8d	Add characterization tests for UTF-8 truncation and parser sanitization Agent: hopper Adds 32 new integration tests covering recent commits: ## UTF-8 Safe Truncation Tests (14 tests) Covers commit `f30f145` (Fix UTF-8 panics): - Topic extraction with emoji, CJK, and multi-byte characters - Truncation at character boundaries (not byte boundaries) - Edge cases: exactly 50 chars, 51 chars, 2-byte/3-byte/4-byte UTF-8 - Stub generation with multi-byte topics - Combining characters and diacritics ## Parser Sanitization Tests (18 tests) Covers commit `4c36cc0` (Prevent parser poisoning): - Code block contexts (inline code, after fences, prose) - Line boundary edge cases (empty lines, whitespace, indentation) - Unicode handling (emoji, bullets, CJK before patterns) - Multiple patterns on same line - Negative cases (similar but different patterns, partial patterns) - Real-world scenarios from the original bug report All tests are blackbox/characterization style - they test observable outputs through stable public interfaces without encoding internal implementation details.	2026-01-13 11:22:46 +05:30
Dhanji R. Prasanna	6f50d01ab6	Add comprehensive end-of-turn behavior tests for g3-core Agent: hopper Adds 56 new integration tests covering the observable end-of-turn behaviors in the streaming module: - Timing footer formatting (5 tests): verifies user-facing timing display with various durations, token counts, and context percentages - Tool call duplicate detection (6 tests): ensures identical sequential tool calls are detected while different tools/args are not - Empty response detection (9 tests): validates detection of empty, whitespace-only, and timing-only responses that trigger auto-continue - Connection error classification (5 tests): verifies EOF, connection, chunk, and body errors are correctly identified for graceful recovery - Tool output summary formatting (17 tests): covers read_file, write_file, str_replace, remember, screenshot, coverage, and rehydrate summaries - Duration formatting (4 tests): milliseconds, seconds, minutes, zero - Text truncation (4 tests): short/long strings, multiline, flag behavior - LLM token cleaning (3 tests): removal of stop tokens like <\|im_end\|> - Edge cases (4 tests): empty inputs, unicode handling, large numbers All tests are blackbox/characterization style - they test observable outputs through stable public interfaces without encoding internal implementation details. Tests remain stable under refactoring that preserves behavior.	2026-01-12 21:17:32 +05:30
Dhanji R. Prasanna	c2aa80647a	Remove legacy logs/ directory, consolidate all data under .g3/ This change removes the legacy logs/ directory and consolidates all session data, error logs, and discovery files under the .g3/ directory. New directory structure: - .g3/sessions/<session_id>/session.json - session logs - .g3/errors/ - error logs (was logs/errors/) - .g3/background_processes/ - background process logs - .g3/discovery/ - planner discovery files (was workspace/logs/) Changes: - paths.rs: Remove get_logs_dir()/logs_dir(), add get_errors_dir(), get_background_processes_dir(), get_discovery_dir() - session.rs: Anonymous sessions now use .g3/sessions/anonymous_<ts>/ - error_handling.rs: Errors now saved to .g3/errors/ - project.rs: Remove logs_dir() and ensure_logs_dir() methods - feedback_extraction.rs: Remove logs_dir field and fallback logic - planner: Use .g3/ for workspace data and .g3/discovery/ for reports - flock.rs: Look for session metrics in .g3/sessions/ - coach_feedback.rs: Remove fallback to logs/ path - Update all tests to use new paths - Update README.md and .gitignore	2026-01-12 18:20:08 +05:30
Dhanji R. Prasanna	5dfabaf19a	Add 72 integration tests for compaction, retry, tool execution, and error classification Agent: hopper Added 4 new test files with blackbox/characterization-style integration tests: - compaction_behavior_test.rs (14 tests): Token cap calculation, thinking mode disable logic, summary message building, CompactionResult behavior - retry_behavior_test.rs (17 tests): RetryConfig presets and customization, RetryResult state handling, retry_operation behavior with simulated errors - tool_execution_roundtrip_test.rs (16 tests): End-to-end tool execution through Agent interface for read_file, write_file, shell, str_replace, and TODO tools - error_classification_test.rs (25 tests): Recoverable vs non-recoverable error classification, retry delay calculation, edge cases and priority handling All tests follow integration-first philosophy: - Test through stable public interfaces - Assert observable behavior, not implementation details - Use characterization style to document current behavior - Enable refactoring by not encoding internal structure	2026-01-12 11:40:19 +05:30
Dhanji R. Prasanna	f415dbb84b	Fix ACD turn summary loss and add /dump command ACD (Aggressive Context Dehydration) fixes: - Fixed dehydrate_context() to extract turn summary from context window instead of using the passed-in final_response (which contained only the timing footer, not the actual LLM response) - Removed final_response parameter from dehydrate_context() since it now self-extracts the last assistant message as the summary - This ensures the actual turn summary is preserved after dehydration, not just the timing footer New /dump command: - Added /dump command to dump entire context window to tmp/ for debugging - Shows message index, role, kind, content length, and full content - Available in both console and machine modes UTF-8 safety: - Fixed truncate_to_word_boundary() to use character indices instead of byte indices, preventing panics on multi-byte UTF-8 characters - Added UTF-8 string slicing guidance to AGENTS.md Agent: g3	2026-01-12 05:13:02 +05:30
Dhanji R. Prasanna	83c9b5d434	Add integration blackbox tests for g3-core Adds 18 new integration tests covering: - Background process lifecycle (start, check running, kill, list) - Unified diff edge cases (multi-hunk, additions-only, deletions-only, CRLF normalization, range constraints, error handling) - Error classification boundaries (rate limit, server error, timeout, network error, context length exceeded, model busy, non-recoverable) These tests follow blackbox/integration-first principles: - Test through stable public interfaces - Do not encode internal implementation details - Focus on observable behavior - Enable refactoring without test breakage Agent: hopper	2026-01-11 16:32:59 +05:30

1 2

93 Commits