Fix context window progress bar showing wrong token counts

Calibrate used_tokens from API prompt_tokens (ground truth) to fix progress bar drift in interactive mode. Three issues fixed: 1. update_usage_from_response() only updated cumulative_tokens, never calibrated used_tokens. Now snaps used_tokens to prompt_tokens when available (falls back to heuristic when prompt_tokens is 0). 2. Moved calibration call inline during streaming (when usage chunk arrives) instead of after the loop. Text-only responses — the most common case in interactive mode — take an early return path that bypassed the post-loop usage update entirely. 3. Removed mock Usage with hardcoded prompt_tokens=100 from execute_single_task() which corrupted calibration.
2026-03-18 15:31:20 +11:00
parent 68510e06d1
commit 1b26de6cd2
5 changed files with 128 additions and 65 deletions
--- a/analysis/memory.md
+++ b/analysis/memory.md
@@ -1,5 +1,5 @@
 # Workspace Memory
-> Updated: 2026-02-14T22:33:04Z | Size: 22.9k chars
+> Updated: 2026-03-18T03:59:01Z | Size: 25.2k chars

 ### Remember Tool Wiring
 - `crates/g3-core/src/tools/memory.rs` [0..5686]
@@ -396,4 +396,19 @@ Tool output responsive to terminal width — no line wrapping, 4-char right marg
 - `crates/g3-core/src/lib.rs` [170..171] - `baseline_dirty_files: HashSet<String>` field on Agent
 - `crates/g3-core/src/lib.rs` [1675..1686] - `set_plan_mode(enabled, working_dir)` - captures baseline on enable, clears on disable
 - **Key invariant**: The approval gate NEVER deletes or reverts files. It only warns.
- **Key invariant**: Pre-existing dirty files (captured at plan mode start) are excluded from gate checks.
+- **Key invariant**: Pre-existing dirty files (captured at plan mode start) are excluded from gate checks.
+
+### Context Window Calibration (Token Drift Fix)
+- `crates/g3-core/src/context_window.rs` [159..189] - `update_usage_from_response()` now calibrates `used_tokens` from API `prompt_tokens` (ground truth). When `prompt_tokens > 0`, snaps `used_tokens` to it. When 0, leaves unchanged (heuristic fallback).
+- `crates/g3-core/src/context_window.rs` [93..100] - No more 1% safety buffer. `total_tokens = raw` (was `raw * 0.99`).
+- `crates/g3-core/src/context_window.rs` [222..250] - `estimate_message_tokens()` now adds: +4 per-message overhead, +30 per tool_use block (was 20), +15 per tool_result message.
+- `crates/g3-core/src/lib.rs` [2232..2241] - `ensure_context_capacity()` called inside streaming loop for iteration > 1 (catches post-tool-execution growth).
+- **Root cause**: Heuristic token estimation drifted ~48% over 809 messages / 388 tool calls (136k estimated vs 201k actual). API `prompt_tokens` is ground truth.
+
+### Context Window Calibration (Token Drift Fix) - CORRECTED
+- `crates/g3-core/src/context_window.rs` [168..189] - `update_usage_from_response()` calibrates `used_tokens` from API `prompt_tokens` (ground truth). When `prompt_tokens > 0`, snaps `used_tokens` to it. When 0, leaves unchanged (heuristic fallback).
+- `crates/g3-core/src/lib.rs` [2316..2319] - Calibration call placed **inline** during streaming (when usage chunk arrives in `chunk.usage`), NOT after the streaming loop. Critical because text-only responses take an early return path that bypasses post-loop code.
+- `crates/g3-core/src/lib.rs` [2892..2898] - Post-loop code only handles fallback (no-usage) case now.
+- `crates/g3-core/src/context_window.rs` [87..93] - 1% safety buffer IS still in place (`total_tokens * 0.99`). Left as safety net between calibration points.
+- **Root cause of display bug**: (1) `update_usage_from_response` never calibrated `used_tokens`, only `cumulative_tokens`. (2) `execute_single_task` had mock usage with hardcoded `prompt_tokens: 100`. (3) Post-loop usage update was bypassed by early returns in text-only response paths.
+- **Key streaming flow**: For text-only responses (most common in interactive mode), `chunk.finished` triggers an early `return Ok(self.finalize_streaming_turn(...))` that bypasses all post-loop code. Calibration MUST happen inline when `chunk.usage` arrives.