Remove final_output tool and improve scout report handback

final_output removal:
- Remove final_output from tool definitions and dispatch
- Update system prompts to request summaries as regular text
- Remove final_output_called field from StreamingState
- Update auto_continue tests to remove final_output_called parameter
- Remove final_output test from tool_execution_test.rs
- Update planner and flock prompts to not reference final_output
- Keep backwards-compat code in feedback_extraction.rs and task_result.rs

Scout report handback:
- Change from file-based to delimiter-based report extraction
- Scout outputs report between ---SCOUT_REPORT_START/END--- markers
- Research tool extracts content between markers, strips ANSI codes
- Add comprehensive tests for extraction and ANSI stripping

657 tests pass.
This commit is contained in:
Dhanji R. Prasanna
2026-01-10 13:43:04 +11:00
parent cab2fb187a
commit 0aa1287ca6
9 changed files with 247 additions and 95 deletions

View File

@@ -37,7 +37,7 @@ Short bullets comparing the options where it matters.
### Recommendation (Optional) ### Recommendation (Optional)
If one option is clearly dominant, state it. If one option is clearly dominant, state it.
If not, say No clear default. If not, say "No clear default."
### Unknowns / Risks ### Unknowns / Risks
Things that require validation, experimentation, or judgment. Things that require validation, experimentation, or judgment.
@@ -46,7 +46,56 @@ Things that require validation, experimentation, or judgment.
Links only (titles + URLs). Links only (titles + URLs).
Brief quotes or snippets if relevant to decision making. No page dumps. Brief quotes or snippets if relevant to decision making. No page dumps.
Write this brief out to a temporary file and write out the full path of the filename as your VERY LAST LINE of output. **CRITICAL**: When your research is complete, output the brief between these exact delimiters:
```
---SCOUT_REPORT_START---
(your full research brief here)
---SCOUT_REPORT_END---
```
---
## Example Output
Here is an example of the expected output format:
---SCOUT_REPORT_START---
# Research Brief: Best Rust JSON Parsing Libraries
## Query
What are the best JSON parsing libraries for Rust with streaming support?
## Options
### 1. **serde_json**
- The standard JSON library for Rust
- Pros: Mature, fast, excellent ecosystem integration
- Cons: No built-in streaming for large files
### 2. **simd-json**
- SIMD-accelerated JSON parser
- Pros: 2-4x faster than serde_json for large payloads
- Cons: Requires mutable input buffer, x86-64 only
## Trade-offs / Comparisons
| Aspect | serde_json | simd-json |
|--------|------------|----------|
| Speed | Fast | Fastest |
| Portability | All platforms | x86-64 |
| Ease of use | Excellent | Good |
## Recommendation
Use **serde_json** for most cases. Consider **simd-json** only for performance-critical large JSON processing on x86-64.
## Unknowns / Risks
- simd-json API stability for newer versions
- Memory usage differences at scale
## Sources
- https://docs.rs/serde_json
- https://github.com/simd-lite/simd-json
---SCOUT_REPORT_END---
--- ---
@@ -71,7 +120,7 @@ If nothing useful exists, say so explicitly and back this up with evidence.
- Optimize for *your user* making a decision, not for completeness. - Optimize for *your user* making a decision, not for completeness.
You are allowed to say: You are allowed to say:
> This exists but is immature / fragile / not worth it. > "This exists but is immature / fragile / not worth it."
--- ---
@@ -91,6 +140,7 @@ You succeed if:
- The reader can decide what to try or ignore in under 5 minutes. - The reader can decide what to try or ignore in under 5 minutes.
- The brief is calm, bounded, and opinionated where justified. - The brief is calm, bounded, and opinionated where justified.
- No context bloat is introduced. - No context bloat is introduced.
- **The report is wrapped in the exact delimiters shown above.**
If nothing meets the bar, saying so is OK. If nothing meets the bar, saying so is OK.

View File

@@ -2789,7 +2789,7 @@ Remember: Be clear in your review and concise in your feedback. APPROVE iff the
// We have a valid coach result, process it // We have a valid coach result, process it
let coach_result = coach_result_opt.unwrap(); let coach_result = coach_result_opt.unwrap();
// Extract the complete coach feedback from final_output // Extract the complete coach feedback from the response
let coach_feedback_text = let coach_feedback_text =
extract_coach_feedback_from_logs(&coach_result, &coach_agent, &output)?; extract_coach_feedback_from_logs(&coach_result, &coach_agent, &output)?;
@@ -2800,7 +2800,7 @@ Remember: Be clear in your review and concise in your feedback. APPROVE iff the
coach_result.response.len() coach_result.response.len()
); );
// Check if we got empty feedback (this can happen if the coach doesn't call final_output) // Check if we got empty feedback (this can happen if the coach doesn't provide substantive feedback)
if coach_feedback_text.is_empty() { if coach_feedback_text.is_empty() {
output.print("⚠️ Coach did not provide feedback. This may be a model issue."); output.print("⚠️ Coach did not provide feedback. This may be a model issue.");
coach_feedback = "The implementation needs review. Please ensure all requirements are met and the code compiles without errors.".to_string(); coach_feedback = "The implementation needs review. Please ensure all requirements are met and the code compiles without errors.".to_string();

View File

@@ -22,7 +22,6 @@ pub struct StreamingState {
pub response_started: bool, pub response_started: bool,
pub any_tool_executed: bool, pub any_tool_executed: bool,
pub auto_summary_attempts: usize, pub auto_summary_attempts: usize,
pub final_output_called: bool,
pub turn_accumulated_usage: Option<g3_providers::Usage>, pub turn_accumulated_usage: Option<g3_providers::Usage>,
} }
@@ -36,7 +35,6 @@ impl StreamingState {
response_started: false, response_started: false,
any_tool_executed: false, any_tool_executed: false,
auto_summary_attempts: 0, auto_summary_attempts: 0,
final_output_called: false,
turn_accumulated_usage: None, turn_accumulated_usage: None,
} }
} }

View File

@@ -10,12 +10,16 @@ use crate::ToolCall;
use super::executor::ToolContext; use super::executor::ToolContext;
/// Delimiter markers for scout report extraction
const REPORT_START_MARKER: &str = "---SCOUT_REPORT_START---";
const REPORT_END_MARKER: &str = "---SCOUT_REPORT_END---";
/// Execute the research tool by spawning a scout agent. /// Execute the research tool by spawning a scout agent.
/// ///
/// This tool: /// This tool:
/// 1. Spawns `g3 --agent scout` with the query /// 1. Spawns `g3 --agent scout` with the query
/// 2. Captures stdout and extracts the last line (file path to report) /// 2. Captures stdout and extracts the report between delimiter markers
/// 3. Reads the report file and returns its contents /// 3. Returns the report content directly
pub async fn execute_research<W: UiWriter>( pub async fn execute_research<W: UiWriter>(
tool_call: &ToolCall, tool_call: &ToolCall,
ctx: &mut ToolContext<'_, W>, ctx: &mut ToolContext<'_, W>,
@@ -46,20 +50,20 @@ pub async fn execute_research<W: UiWriter>(
.spawn() .spawn()
.map_err(|e| anyhow::anyhow!("Failed to spawn scout agent: {}", e))?; .map_err(|e| anyhow::anyhow!("Failed to spawn scout agent: {}", e))?;
// Capture stdout to find the report file path // Capture stdout to find the report content
let stdout = child.stdout.take() let stdout = child.stdout.take()
.ok_or_else(|| anyhow::anyhow!("Failed to capture scout agent stdout"))?; .ok_or_else(|| anyhow::anyhow!("Failed to capture scout agent stdout"))?;
let mut reader = BufReader::new(stdout).lines(); let mut reader = BufReader::new(stdout).lines();
let mut last_line = String::new(); let mut all_output = Vec::new();
// Print a header for the scout output // Print a header for the scout output
ctx.ui_writer.println("\n📡 Scout agent output:"); ctx.ui_writer.println("\n📡 Scout agent researching...");
// Stream all lines to UI, keeping track of the last one for the report path // Collect all lines
while let Some(line) = reader.next_line().await? { while let Some(line) = reader.next_line().await? {
ctx.ui_writer.println(&format!(" {}", line)); ctx.ui_writer.println(&format!(" {}", line));
last_line = line; all_output.push(line);
} }
// Wait for the process to complete // Wait for the process to complete
@@ -70,31 +74,167 @@ pub async fn execute_research<W: UiWriter>(
return Ok(format!("❌ Scout agent failed with exit code: {:?}", status.code())); return Ok(format!("❌ Scout agent failed with exit code: {:?}", status.code()));
} }
// The last line should be the path to the report file // Join all output and extract the report between markers
let report_path = last_line.trim(); let full_output = all_output.join("\n");
if report_path.is_empty() { extract_report(&full_output)
return Ok("❌ Scout agent did not output a report file path".to_string()); }
/// Extract the research report from scout output.
///
/// Looks for content between SCOUT_REPORT_START and SCOUT_REPORT_END markers.
/// Strips ANSI escape codes from the extracted content.
fn extract_report(output: &str) -> Result<String> {
// Strip ANSI codes from the entire output first
let clean_output = strip_ansi_codes(output);
// Find the start marker
let start_pos = clean_output.find(REPORT_START_MARKER)
.ok_or_else(|| anyhow::anyhow!(
"Scout agent did not output a properly formatted report. Expected {} marker.",
REPORT_START_MARKER
))?;
// Find the end marker
let end_pos = clean_output.find(REPORT_END_MARKER)
.ok_or_else(|| anyhow::anyhow!(
"Scout agent report is incomplete. Expected {} marker.",
REPORT_END_MARKER
))?;
if end_pos <= start_pos {
return Err(anyhow::anyhow!("Invalid report format: end marker before start marker"));
}
// Extract content between markers
let report_start = start_pos + REPORT_START_MARKER.len();
let report_content = clean_output[report_start..end_pos].trim();
if report_content.is_empty() {
return Ok("❌ Scout agent returned an empty report.".to_string());
}
Ok(format!("📋 Research Report:\n\n{}", report_content))
}
/// Strip ANSI escape codes from a string.
///
/// Handles common ANSI sequences like:
/// - CSI sequences: \x1b[...m (colors, styles)
/// - OSC sequences: \x1b]...\x07 (terminal titles, etc.)
fn strip_ansi_codes(s: &str) -> String {
let mut result = String::with_capacity(s.len());
let mut chars = s.chars().peekable();
while let Some(c) = chars.next() {
if c == '\x1b' {
// Start of escape sequence
match chars.peek() {
Some('[') => {
// CSI sequence: \x1b[...X where X is a letter
chars.next(); // consume '['
while let Some(&next) = chars.peek() {
chars.next();
if next.is_ascii_alphabetic() {
break;
}
}
}
Some(']') => {
// OSC sequence: \x1b]...\x07
chars.next(); // consume ']'
while let Some(&next) = chars.peek() {
chars.next();
if next == '\x07' {
break;
}
}
}
_ => {
// Unknown escape, skip just the ESC
}
}
} else {
result.push(c);
}
}
result
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_strip_ansi_codes() {
// Simple color code
assert_eq!(strip_ansi_codes("\x1b[31mred\x1b[0m"), "red");
// RGB color code (like the bug we saw)
assert_eq!(
strip_ansi_codes("\x1b[38;2;216;177;114mtmp/file.md\x1b[0m"),
"tmp/file.md"
);
// Multiple codes
assert_eq!(
strip_ansi_codes("\x1b[1m\x1b[32mbold green\x1b[0m normal"),
"bold green normal"
);
// No codes
assert_eq!(strip_ansi_codes("plain text"), "plain text");
// Empty string
assert_eq!(strip_ansi_codes(""), "");
} }
// Expand tilde if present #[test]
let expanded_path = if report_path.starts_with('~') { fn test_extract_report_success() {
if let Ok(home) = std::env::var("HOME") { let output = r#"Some preamble text
std::path::PathBuf::from(home).join(&report_path[2..]) // Skip "~/" ---SCOUT_REPORT_START---
} else { # Research Brief
std::path::PathBuf::from(report_path)
}
} else {
std::path::PathBuf::from(report_path)
};
// Read the report file This is the report content.
match std::fs::read_to_string(&expanded_path) { ---SCOUT_REPORT_END---
Ok(content) => { Some trailing text"#;
Ok(format!("📋 Research Report:\n\n{}", content))
} let result = extract_report(output).unwrap();
Err(e) => { assert!(result.contains("Research Brief"));
Ok(format!("❌ Failed to read report file '{}': {}", report_path, e)) assert!(result.contains("This is the report content."));
} assert!(!result.contains("preamble"));
assert!(!result.contains("trailing"));
}
#[test]
fn test_extract_report_with_ansi_codes() {
let output = "\x1b[32m---SCOUT_REPORT_START---\x1b[0m\n# Report\n\x1b[31m---SCOUT_REPORT_END---\x1b[0m";
let result = extract_report(output).unwrap();
assert!(result.contains("# Report"));
}
#[test]
fn test_extract_report_missing_start() {
let output = "No markers here";
let result = extract_report(output);
assert!(result.is_err());
assert!(result.unwrap_err().to_string().contains("SCOUT_REPORT_START"));
}
#[test]
fn test_extract_report_missing_end() {
let output = "---SCOUT_REPORT_START---\nContent but no end";
let result = extract_report(output);
assert!(result.is_err());
assert!(result.unwrap_err().to_string().contains("SCOUT_REPORT_END"));
}
#[test]
fn test_extract_report_empty_content() {
let output = "---SCOUT_REPORT_START---\n---SCOUT_REPORT_END---";
let result = extract_report(output).unwrap();
assert!(result.contains("empty report"));
} }
} }

View File

@@ -4,7 +4,6 @@
//! 1. Empty/trivial responses (just timing lines) //! 1. Empty/trivial responses (just timing lines)
//! 2. Incomplete tool calls //! 2. Incomplete tool calls
//! 3. Unexecuted tool calls //! 3. Unexecuted tool calls
//! 4. Missing final_output after tool execution
/// Helper function to check if a response is considered "empty" or trivial /// Helper function to check if a response is considered "empty" or trivial
/// This mirrors the logic in lib.rs for detecting empty responses /// This mirrors the logic in lib.rs for detecting empty responses
@@ -117,37 +116,26 @@ fn test_max_auto_summary_attempts_is_reasonable() {
// ============================================================================= // =============================================================================
/// Simulates the should_auto_continue logic from lib.rs /// Simulates the should_auto_continue logic from lib.rs
/// After removing final_output, the logic is simpler:
/// - Continue if there's an incomplete tool call
/// - Continue if there's an unexecuted tool call
/// - Continue if tool executed but response is empty (LLM stuttered)
fn should_auto_continue( fn should_auto_continue(
any_tool_executed: bool, any_tool_executed: bool,
final_output_called: bool,
has_incomplete_tool_call: bool, has_incomplete_tool_call: bool,
has_unexecuted_tool_call: bool, has_unexecuted_tool_call: bool,
is_empty_response: bool, is_empty_response: bool,
) -> bool { ) -> bool {
(any_tool_executed && !final_output_called) has_incomplete_tool_call
|| has_incomplete_tool_call
|| has_unexecuted_tool_call || has_unexecuted_tool_call
|| (any_tool_executed && is_empty_response) || (any_tool_executed && is_empty_response)
} }
#[test] #[test]
fn test_auto_continue_after_tool_no_final_output() { fn test_auto_continue_tool_executed_with_response() {
// Tool executed but no final_output - should continue // Tool executed with substantive response - should NOT continue
assert!(should_auto_continue(
true, // any_tool_executed
false, // final_output_called
false, // has_incomplete_tool_call
false, // has_unexecuted_tool_call
false, // is_empty_response
));
}
#[test]
fn test_auto_continue_with_final_output() {
// Tool executed AND final_output called - should NOT continue
assert!(!should_auto_continue( assert!(!should_auto_continue(
true, // any_tool_executed true, // any_tool_executed
true, // final_output_called
false, // has_incomplete_tool_call false, // has_incomplete_tool_call
false, // has_unexecuted_tool_call false, // has_unexecuted_tool_call
false, // is_empty_response false, // is_empty_response
@@ -159,7 +147,6 @@ fn test_auto_continue_incomplete_tool_call() {
// Incomplete tool call - should continue regardless of other flags // Incomplete tool call - should continue regardless of other flags
assert!(should_auto_continue( assert!(should_auto_continue(
false, // any_tool_executed false, // any_tool_executed
false, // final_output_called
true, // has_incomplete_tool_call true, // has_incomplete_tool_call
false, // has_unexecuted_tool_call false, // has_unexecuted_tool_call
false, // is_empty_response false, // is_empty_response
@@ -171,7 +158,6 @@ fn test_auto_continue_unexecuted_tool_call() {
// Unexecuted tool call - should continue // Unexecuted tool call - should continue
assert!(should_auto_continue( assert!(should_auto_continue(
false, // any_tool_executed false, // any_tool_executed
false, // final_output_called
false, // has_incomplete_tool_call false, // has_incomplete_tool_call
true, // has_unexecuted_tool_call true, // has_unexecuted_tool_call
false, // is_empty_response false, // is_empty_response
@@ -183,7 +169,6 @@ fn test_auto_continue_empty_response_after_tool() {
// Empty response after tool execution - should continue // Empty response after tool execution - should continue
assert!(should_auto_continue( assert!(should_auto_continue(
true, // any_tool_executed true, // any_tool_executed
false, // final_output_called
false, // has_incomplete_tool_call false, // has_incomplete_tool_call
false, // has_unexecuted_tool_call false, // has_unexecuted_tool_call
true, // is_empty_response true, // is_empty_response
@@ -196,7 +181,6 @@ fn test_auto_continue_empty_response_no_tool() {
// (This is a normal case where LLM just didn't respond) // (This is a normal case where LLM just didn't respond)
assert!(!should_auto_continue( assert!(!should_auto_continue(
false, // any_tool_executed false, // any_tool_executed
false, // final_output_called
false, // has_incomplete_tool_call false, // has_incomplete_tool_call
false, // has_unexecuted_tool_call false, // has_unexecuted_tool_call
true, // is_empty_response true, // is_empty_response
@@ -208,7 +192,6 @@ fn test_auto_continue_no_conditions_met() {
// No tools, no incomplete calls, substantive response - should NOT continue // No tools, no incomplete calls, substantive response - should NOT continue
assert!(!should_auto_continue( assert!(!should_auto_continue(
false, // any_tool_executed false, // any_tool_executed
false, // final_output_called
false, // has_incomplete_tool_call false, // has_incomplete_tool_call
false, // has_unexecuted_tool_call false, // has_unexecuted_tool_call
false, // is_empty_response false, // is_empty_response
@@ -216,19 +199,22 @@ fn test_auto_continue_no_conditions_met() {
} }
// ============================================================================= // =============================================================================
// Test: Redundant condition detection // Test: Edge cases
// ============================================================================= // =============================================================================
#[test] #[test]
fn test_redundant_empty_response_condition() { fn test_auto_continue_multiple_conditions() {
// This test documents that (any_tool_executed && is_empty_response) is redundant // Multiple conditions true - should still continue
// when (any_tool_executed && !final_output_called) is already true assert!(should_auto_continue(
true, // any_tool_executed
true, // has_incomplete_tool_call
true, // has_unexecuted_tool_call
true, // is_empty_response
));
// Case: tool executed, no final_output, empty response // Only incomplete tool call
let result_with_empty = should_auto_continue(true, false, false, false, true); assert!(should_auto_continue(false, true, false, false));
let result_without_empty = should_auto_continue(true, false, false, false, false);
// Both should be true because (any_tool_executed && !final_output_called) is true // Only unexecuted tool call
assert_eq!(result_with_empty, result_without_empty, assert!(should_auto_continue(false, false, true, false));
"The is_empty_response condition is redundant when any_tool_executed && !final_output_called");
} }

View File

@@ -326,27 +326,6 @@ mod todo_tests {
} }
} }
// =============================================================================
// Test: final_output tool
// =============================================================================
mod final_output_tests {
use super::*;
#[test]
fn test_final_output_tool_call() {
let tool_call = make_tool_call(
"final_output",
json!({
"summary": "Task completed successfully.\n\n## Changes Made\n- Added feature X"
}),
);
assert_eq!(tool_call.tool, "final_output");
assert!(tool_call.args.get("summary").is_some());
}
}
// ============================================================================= // =============================================================================
// Test: code_search tool // Test: code_search tool
// ============================================================================= // =============================================================================

View File

@@ -236,7 +236,7 @@ impl FlockMode {
- The specific requirements that belong to this module\n\ - The specific requirements that belong to this module\n\
- Any dependencies on other modules\n\n\ - Any dependencies on other modules\n\n\
4. Return your final partitioning exactly once, prefixed by the marker '{{PARTITION JSON}}' followed by a fenced code block that starts with \"```json\" and ends with \"```\". Place only the JSON array inside the fence.\n\ 4. Return your final partitioning exactly once, prefixed by the marker '{{PARTITION JSON}}' followed by a fenced code block that starts with \"```json\" and ends with \"```\". Place only the JSON array inside the fence.\n\
5. Use the final_output tool to provide your partitioning as a JSON array of objects, where each object has:\n\ 5. The JSON array should contain objects, where each object has:\n\
- \"module_name\": string\n\ - \"module_name\": string\n\
- \"requirements\": string (the requirements text for this module)\n\ - \"requirements\": string (the requirements text for this module)\n\
- \"dependencies\": array of strings (names of other modules this depends on)\n\n\ - \"dependencies\": array of strings (names of other modules this depends on)\n\n\

View File

@@ -391,7 +391,7 @@ Before making suggestions, please:
After understanding the context, update the `{codepath}/g3-plan/new_requirements.md` file by prepending After understanding the context, update the `{codepath}/g3-plan/new_requirements.md` file by prepending
your refined requirements under the heading `{{{{CURRENT REQUIREMENTS}}}}`. your refined requirements under the heading `{{{{CURRENT REQUIREMENTS}}}}`.
Use final_output when you are done to indicate completion."#, When you are done, provide a brief summary of the refinements you made."#,
codepath = codepath codepath = codepath
) )
} }

View File

@@ -553,7 +553,6 @@ pub fn get_planner_tools() -> Vec<&'static str> {
"shell", "shell",
"code_search", "code_search",
"str_replace", "str_replace",
"final_output",
] ]
} }
@@ -673,7 +672,7 @@ pub async fn run_coach_player_loop(
).await?; ).await?;
let coach_prompt = format!( let coach_prompt = format!(
"You are G3 in coach mode. Review the implementation against these requirements:\n\n{}\n\nCheck:\n1. Are requirements implemented correctly?\n2. Does the code compile?\n3. What's missing?\n\nUse the final_output tool to provide your feedback.\nIf implementation is COMPLETE, include 'IMPLEMENTATION_APPROVED' in your feedback.\nOtherwise, provide specific feedback for the player to fix.", "You are G3 in coach mode. Review the implementation against these requirements:\n\n{}\n\nCheck:\n1. Are requirements implemented correctly?\n2. Does the code compile?\n3. What's missing?\n\nProvide your feedback as a summary.\nIf implementation is COMPLETE, include 'IMPLEMENTATION_APPROVED' in your feedback.\nOtherwise, provide specific feedback for the player to fix.",
requirements_content requirements_content
); );