Calibrate used_tokens from API prompt_tokens (ground truth) to fix
progress bar drift in interactive mode. Three issues fixed:
1. update_usage_from_response() only updated cumulative_tokens, never
calibrated used_tokens. Now snaps used_tokens to prompt_tokens when
available (falls back to heuristic when prompt_tokens is 0).
2. Moved calibration call inline during streaming (when usage chunk
arrives) instead of after the loop. Text-only responses — the most
common case in interactive mode — take an early return path that
bypassed the post-loop usage update entirely.
3. Removed mock Usage with hardcoded prompt_tokens=100 from
execute_single_task() which corrupted calibration.
- Remove all file revert/delete logic from check_plan_approval_gate:
no more git checkout or fs::remove_file calls. The gate only warns.
- Remove reverted_files field from ApprovalGateResult::Blocked.
- Add get_dirty_files() helper to snapshot dirty files as a HashSet.
- Capture baseline dirty files when plan mode starts (set_plan_mode).
Pre-existing dirty files are excluded from gate checks so they
never trigger blocking.
- Add 5 new unit tests covering non-destructive behavior, baseline
exclusion, and mixed baseline/new file scenarios.
- Update integration test to match new non-destructive semantics.
estimate_tokens() only counted message.content chars, completely
ignoring message.tool_calls[].input JSON. When sent to the API,
tool_use blocks include full input, so the token tracker massively
undercounted — in one session, 303k chars (101k tokens) of tool
input were invisible, showing 39% usage when actual was >100%.
Compaction never triggered, causing an API 400 error.
Added estimate_message_tokens() that accounts for both content and
tool_call input. Updated add_message_with_tokens(), recalculate_tokens(),
and clear_conversation() to use it.
7 unit tests + 1 integration test reproducing the exact session trace.
After context compaction, the preserved last assistant message retained
its structured tool_calls field, but the corresponding tool_result was
summarized away. This created orphaned tool_use blocks that violated
the Anthropic API constraint: 'Each tool_use block must have a
corresponding tool_result block in the next message', causing 400 errors.
Primary fix: clear tool_calls from the preserved assistant message in
extract_preserved_messages(). The tool call was already executed and
its result is captured in the summary.
Defense-in-depth: added strip_orphaned_tool_use() post-processing in
Anthropic convert_messages() to detect and strip any orphaned tool_use
blocks before they reach the API.
Added 7 tests: 3 unit tests for compaction stripping, 3 unit tests for
Anthropic orphan detection, 1 integration test reproducing the exact
bug scenario from the h3 session.
The agent would stop mid-task because native tool calls were stored as
inline JSON text in Message.content. When sent back to the Anthropic API
via convert_messages(), they went as plain text instead of structured
tool_use/tool_result blocks. The model would occasionally get confused
and emit text describing what it wanted to do instead of invoking the
tool mechanism.
Changes:
- Add MessageToolCall struct and tool_calls/tool_result_id fields to Message
- Add id field to core ToolCall struct to preserve provider tool call IDs
- Update Anthropic convert_messages() to emit tool_use and tool_result blocks
- Add ToolResult variant to AnthropicContent enum
- Store tool calls structurally in tool message construction (not inline JSON)
- Fix add_message() to preserve empty-content messages with tool_calls
- Fix check_duplicate_in_previous_message() to check structured tool_calls
- Generate valid IDs for JSON fallback tool calls (Anthropic pattern requirement)
- Update planner create_tool_message() to use structured tool calls
Features:
- New predicate rules: NotContains, AnyOf, NoneOf
- Conditional predicates via when clauses (WhenCondition/CompiledWhenCondition)
- Null handling: YAML null treated as absent for exists/not_exists
- Solon agent for rulespec authoring (agents/solon.md)
- Rulespec schema documentation (prompts/schemas/rulespec.schema.md)
Bugfix:
- Fixed when condition evaluation in datalog path: catch-all branch did
naive string contains instead of delegating to evaluate_predicate_datalog().
Rules like matches (regex) were silently ignored, causing vacuous pass
and letting violations through. Now delegates to evaluate_predicate_datalog()
which handles all 12 rule types correctly.
Tests: 34 new tests covering all new rules, null handling, when conditions,
and the when+matches bugfix (butler rulespec pattern).
Root cause: ActionEnvelope.to_yaml_value() creates a Mapping from the
facts HashMap without a 'facts:' wrapper key, but rulespec selectors
may include a 'facts.' prefix (e.g. 'facts.feature.done' instead of
'feature.done'). This caused zero facts to be extracted, making all
predicate evaluations fail.
Fix: extract_facts() now tries the selector against the unwrapped
envelope value first, and if empty, retries against a facts-wrapped
version as fallback.
Also:
- Strengthened write_envelope tool description to require top-level
facts: key, file paths for evidence, and allow free-form notes
- Updated system prompt with matching rules
- Added 6 new tests (4 unit, 2 integration)
- Strengthened existing integration test to verify fact count > 0
- New crates/g3-core/src/tools/envelope.rs with execute_write_envelope()
and verify_envelope() (moved from shadow_datalog_verify in plan.rs)
- write_envelope accepts YAML facts, writes envelope.yaml to session dir,
then runs datalog verification against analysis/rulespec.yaml in shadow mode
- plan_verify() now only checks envelope existence (no longer runs datalog)
- Tool count: 13 -> 14
- Updated system prompt to instruct agents to call write_envelope before
marking last plan item done
- Updated integration tests to use write_envelope tool directly
Workflow: write_envelope -> verify_envelope -> datalog shadow artifacts
plan_write(done) -> plan_verify -> checks envelope exists
- Remove rulespec parameter from plan_write tool definition and execution
- Remove rulespec compilation from plan_approve (no longer pre-compiles)
- Remove write_rulespec, get_rulespec_path, format_rulespec_yaml/markdown
from invariants.rs; read_rulespec() now takes &Path working dir
- Remove save/load_compiled_rulespec, get_compiled_rulespec_path from datalog.rs
- Update shadow_datalog_verify() to compile on-the-fly from
analysis/rulespec.yaml, writing rulespec.compiled.dl and
datalog_evaluation.txt to session dir
- Remove rulespec display from plan_read output
- Remove Invariants/Rulespec section from native.md system prompt
- Remove rulespec from prompts.rs plan_write format and examples
- Update existing tests to remove rulespec from plan_write calls
- Add 3 integration tests for on-the-fly rulespec verification
Implement a new datalog verification layer using datafrog that:
- Compiles rulespec to datalog on plan_approve
- Extracts facts from action envelope using selectors
- Executes datalog rules on plan_verify
- Writes evaluation results to datalog_evaluation.txt (shadow mode)
Key components:
- crates/g3-core/src/tools/datalog.rs: Full datalog module with:
- compile_rulespec(): Validates and compiles rulespec
- extract_facts(): Extracts facts from envelope YAML
- execute_rules(): Runs datafrog iteration
- 23 comprehensive tests
- crates/g3-core/src/tools/plan.rs:
- execute_plan_approve(): Now compiles rulespec on approval
- shadow_datalog_verify(): Runs datalog and writes to eval file
Results are written to .g3/sessions/<id>/datalog_evaluation.txt
for inspection, NOT injected into context window (shadow mode).
Restores the research tool that was previously externalized as a skill:
- Add pending_research.rs: PendingResearchManager with thread-safe task tracking
- Add tools/research.rs: execute_research (async), execute_research_status
- Add research/research_status tool definitions with exclude_research config
- Integrate PendingResearchManager into Agent and ToolContext
- Inject completed research results in streaming loop
Remove research skill:
- Clear EMBEDDED_SKILLS array in embedded.rs
- Delete skills/research/ directory
- Update all tests expecting embedded research skill
- Update docs and memory to reflect the change
The research tool now:
- Spawns scout agent in background tokio task
- Returns immediately with research_id
- Automatically injects results into conversation when ready
- Supports status checks via research_status tool
- Add terminal_width module with get_terminal_width(), clip_line(),
compress_path(), and compress_command() utilities
- Update ConsoleUiWriter to use dynamic terminal width for all tool output
- Tool output lines are clipped to fit without wrapping
- Tool headers use semantic compression (paths preserve filename,
commands clip from right)
- 4-character right margin for visual clarity
- Minimum 40 columns, default 80 when terminal size unavailable
- All truncation is UTF-8 safe (char counting, not byte slicing)
- Add 13 unit tests for terminal width utilities
- Remove is_embedded_skill() from discovery.rs (unused)
- Remove get_embedded_skills_map() from embedded.rs (unused)
- Remove associated tests for deleted functions
- Inline path check in test_repo_overrides_embedded test
This eliminates dead code warnings and reduces module surface area
without changing any behavior.
Agent: fowler
Focused analysis on past 10 commits covering:
- New skills module in g3-core (parser, discovery, prompt, embedded, extraction)
- Research tool externalized to skills/research/ skill
- SkillsConfig added to g3-config
- SDLC pipeline state moved to .g3/sdlc/
Key findings:
- 4 crates changed, 29 files affected (8 added, 2 deleted, 19 modified)
- No dependency cycles detected
- Clean DAG structure in new skills module
- Cross-crate coupling via g3-core::skills and g3-config::SkillsConfig
- Compile-time coupling to skills/research/ via include_str!
Agent: euler
Implements a pipeline that orchestrates 7 g3 agents in sequence:
1. euler - dependency graph and hotspots analysis
2. breaker - whitebox exploration and edge-case discovery
3. hopper - deep testing and regression integrity
4. fowler - refactoring to deduplicate and reduce complexity
5. carmack - in-place rewriting for readability and concision
6. lamport - human-readable documentation and validation
7. huffman - semantic compression of memory
Features:
- Commit cursor tracking (--from flag to set starting point)
- Crash recovery (resumes from last incomplete stage)
- Git worktree isolation for all pipeline work
- Visual pipeline display with status icons
- Summary generation saved to .g3/sessions/sdlc/
- Pipeline state persisted to analysis/sdlc/pipeline.json
CLI:
- studio sdlc run [-c N] [--from COMMIT]
- studio sdlc status
- studio sdlc reset
Also adds huffman agent to embedded agents list.
Adds rulespec.yaml and envelope.yaml support for machine-readable
invariant checking during plan completion.
- Add invariants module with Rulespec, ActionEnvelope, and evaluation logic
- Add Invariants section to system prompt with workflow instructions
- Show rulespec/envelope file status in plan verification output
- Rulespec written during planning (captures constraints from task)
- Envelope written after implementation (documents what was built)
Implements the Agent Skills specification (https://agentskills.io) for
portable skill packages that give the agent new capabilities.
Changes:
- Add skills module with SKILL.md parser (YAML frontmatter + markdown body)
- Implement skill discovery from ~/.g3/skills/, config extra_paths, and .g3/skills/
- Generate <available_skills> XML for system prompt injection
- Add SkillsConfig to g3-config with enabled flag and extra_paths
- Wire skills discovery into CLI startup
- Add 29 unit tests for parser, discovery, and prompt generation
- Update README with Agent Skills documentation
Skill locations (priority order):
1. ~/.g3/skills/ (global)
2. Config extra_paths
3. .g3/skills/ (workspace, highest priority)
At startup, g3 scans skill directories and injects a summary into the
system prompt. When the agent needs a skill, it reads the full SKILL.md
using the read_file tool.
- Update command matching from /feature to /plan in commands.rs
- Update help text, usage message, and example
- Update workspace memory references
- /feature is no longer recognized (completely removed)
Adds a verification system that checks evidence in completed plan items:
- Evidence parsing: supports code locations (file:line, file:line-line, file only)
and test references (file::test_name)
- Code location verification: checks file exists, validates line numbers in range
- Test reference verification: checks test file exists, searches for fn pattern
- Verification results: Verified, Warning, Error, Skipped statuses
- Loud output formatting with emoji indicators for warnings/errors
- Integration with execute_plan_write(): runs when plan is complete and approved
- 12 new unit tests covering parsing and verification
Warnings are advisory (don't block), errors are loud but also don't block.
Blocked items are skipped during verification.
Plan Mode is a cognitive forcing system that requires reasoning about:
- Happy path
- Negative case
- Boundary condition
New tools:
- plan_read: Read current plan for session
- plan_write: Create/update plan with YAML content (validates structure)
- plan_approve: Mark current revision as approved
New command:
- /feature <description>: Start Plan Mode for a new feature
Plan schema requires:
- plan_id, revision, approved_revision
- items with id, description, state, touches, checks (happy/negative/boundary)
- evidence and notes required when marking items done
Verification:
- plan_verify() called automatically when all items are done/blocked
Removed:
- todo_read, todo_write tools
- todo.rs module and related tests
The research tool now spawns the scout agent in a background tokio task
and returns immediately with a research_id placeholder. This allows the
agent to continue working while research runs (30-120 seconds).
Key changes:
- New PendingResearchManager for tracking async research tasks
- research tool returns immediately with placeholder containing research_id
- research_status tool to check progress of pending research
- Auto-injection of completed research at natural break points:
- Start of each tool iteration (before LLM call)
- Before prompting user in interactive mode
- /research CLI command to list all research tasks
- Updated system prompt to explain async behavior
The agent can:
- Continue with other work while research runs
- Check status with research_status tool
- Yield turn to user if results are critical before continuing
- Extend Usage struct with cache_creation_tokens and cache_read_tokens fields
- Parse Anthropic cache_creation_input_tokens and cache_read_input_tokens
- Parse OpenAI prompt_tokens_details.cached_tokens for automatic prefix caching
- Add CacheStats struct to Agent for cumulative tracking across API calls
- Add "Prompt Cache Statistics" section to /stats output showing:
- API call count and cache hit count
- Hit rate percentage
- Total input tokens and cache read/creation tokens
- Cache efficiency (% of input served from cache)
- Update all provider implementations and test files
Rename all references from "Project Memory" to "Workspace Memory" to avoid
future conflation if a "project" concept is introduced later.
Changes:
- Rename read_project_memory() -> read_workspace_memory()
- Update all prompts, tool descriptions, and comments
- Update header parsing in memory.rs to use "# Workspace Memory"
- Update display detection for "=== Workspace Memory ==="
- Update documentation and analysis/memory.md
11 files changed, ~36 occurrences updated.
Centralize tool output formatting logic that was duplicated/scattered in
stream_completion_with_tools(). This eliminates code-path aliasing where
tool type checks were done in multiple places.
Changes:
- Add ToolOutputFormat enum (SelfHandled, Compact, Regular)
- Add format_tool_result_summary() for centralized formatting decisions
- Add is_compact_tool() and is_self_handled_tool() helper functions
- Move parse_diff_stats() from lib.rs to streaming.rs
- Simplify tool execution display logic in lib.rs using new helpers
Net effect: -86 lines in lib.rs, +112 lines in streaming.rs
The streaming.rs additions are reusable, well-named functions.
All 585+ workspace tests pass.
Agent: fowler
Consolidate scattered state variables in the 834-line stream_completion_with_tools()
function to use the existing StreamingState and IterationState structs from
streaming.rs. This eliminates code-path aliasing where state was tracked in
multiple places and makes the streaming loop easier to reason about.
Changes:
- Add assistant_message_added field to StreamingState
- Add stream_stop_reason field to IterationState
- Replace 8 inline state variables with StreamingState::new()
- Replace 7 iteration-local variables with IterationState::new()
- All 585 workspace tests pass
This is a pure refactor with no behavior changes. The state structs were already
defined in streaming.rs but not used in the main streaming loop.
Agent: fowler
- Extract handle_command() from interactive.rs to new commands.rs module
(320 lines, 15 match arms for /help, /compact, /thinnify, etc.)
- Fix orphaned tests in completion.rs that were outside mod tests block
- Add #[allow(dead_code)] to with_include_prompt_filename() (used in tests)
- interactive.rs reduced from 595 to 290 lines
Agent: fowler
Extract a new g3_status module in g3-cli that provides consistent formatting
for all 'g3:' prefixed system status messages.
Key changes:
- Add G3Status struct with methods for progress, done, failed, error, etc.
- Add Status enum with Done, Failed, Error, Resolved, Insufficient, NoChanges
- Add ThinResult struct in g3-core for semantic thinning data
- Update UiWriter trait with print_thin_result() method
- Refactor context thinning to return ThinResult instead of formatted strings
- Update all callers to use the new centralized formatting
- Session resume/decline messages now use G3Status
- Compaction status messages now use G3Status
This maintains clean separation of concerns: g3-core emits semantic data,
g3-cli handles all terminal formatting and colors.
The bug was caused by mark_tool_calls_consumed() being called after
displaying each chunk, which advanced last_consumed_position to the
end of the current buffer. When the next chunk arrived with JSON,
the unchecked_buffer started at position 0 of the slice, causing
is_on_own_line() to return true (position 0 is always "on its own line").
Removed the problematic mark_tool_calls_consumed() call from the
"no tool executed" branch. The remaining call after actual tool
execution is correct and necessary.
Added integration test that verifies inline JSON in prose is not
detected as a tool call.
- Add language_prompts module that auto-detects programming languages in workspace
- Scan for language files with depth limit (2) to inject relevant toolchain prompts
- Add prompts/langs/ directory for language-specific markdown files
- Include Racket/raco toolchain guidance as first language prompt
- Update combine_project_content() to accept language_content parameter
- Integrate language detection into main CLI flow and agent mode
- Update project memory with new feature documentation
- Updated memory reminder prompt with per-symbol char ranges
- Added two few-shot examples: Session Continuation (feature) + UTF-8 Safe Slicing (pattern)
- Updated system prompt Memory Format section to match
- Format: file -> nested symbols with [start..end] ranges and descriptions
- Enables direct read_file navigation to specific functions
1. architecture.md: Fixed diagram to show 'studio' instead of 'g3-console'
(the crate was renamed during development)
2. analysis/memory.md: Removed reference to non-existent machine_ui_writer.rs
3. theme.rs: Clarified that 'retro' is a theme option (the default theme),
not a separate TUI mode. No --retro CLI flag exists.
Project memory is now stored at analysis/memory.md instead of .g3/memory.md.
This change enables:
- Shared memory across git worktrees (studio agent sessions)
- Version-controlled memory that persists across clones
- Memory changes tracked in git history and reviewable in PRs
Changes:
- crates/g3-core/src/tools/memory.rs: Update get_memory_path() to use analysis/
- crates/g3-cli/src/project_files.rs: Update read_project_memory() path
- crates/g3-core/src/prompts.rs: Update documentation references (2 occurrences)
- analysis/memory.md: Add memory file (copied from .g3/memory.md)