Add 1% safety buffer to context window to prevent API token limit errors

Our token estimation heuristic (chars/3 * 1.1 for code, chars/4 * 1.1 for text) slightly undercounts over long sessions with hundreds of tool calls. This accumulated drift of ~89 tokens caused Anthropic API 400 errors: 'prompt is too long: 200089 tokens > 200000 maximum' Fix: ContextWindow::new() now applies a 1% buffer, setting total_tokens to 99% of the provider-reported limit. For a 200k window this gives 198k, providing a 2000-token safety margin that absorbs estimation drift. All percentage calculations, compaction thresholds, and thinning triggers operate against the buffered limit, so compaction fires earlier and we never send a request the API will reject.
2026-02-13 15:46:53 +11:00
parent a7e0b0ef9e
commit 0410efd41b
5 changed files with 203 additions and 11 deletions
--- a/crates/g3-core/tests/test_token_counting.rs
+++ b/crates/g3-core/tests/test_token_counting.rs
@@ -94,7 +94,8 @@ fn test_percentage_based_on_used_tokens() {

    // Initially 0%
    assert_eq!(window.percentage_used(), 0.0);
-    assert_eq!(window.remaining_tokens(), 1000);
+    // After 1% buffer: total_tokens = 990
+    assert_eq!(window.remaining_tokens(), 990);

    // Add messages to increase used_tokens
    // A message with ~100 chars should be roughly 25-30 tokens
@@ -107,7 +108,7 @@ fn test_percentage_based_on_used_tokens() {
    assert!(percentage < 100.0, "percentage should be < 100");
    
    // remaining_tokens should decrease
-    assert!(window.remaining_tokens() < 1000, "remaining tokens should decrease");
+    assert!(window.remaining_tokens() < 990, "remaining tokens should decrease");
 }

 /// Test that the 80% compaction threshold works correctly.