From 05d253ee2a0960107852396640c318865dddad5e Mon Sep 17 00:00:00 2001 From: "Dhanji R. Prasanna" Date: Thu, 29 Jan 2026 10:33:53 +1100 Subject: [PATCH] docs: add embedded model performance comparison for agentic tasks Added a new section documenting local LLM performance on complex agentic tasks (comic book repacking test case). Includes: - Cloud model baseline (Claude Opus 4.5, Sonnet 4.5, Claude 4 family) - Local model ratings (Qwen3-32B, Qwen3-14B, GLM-4 9B, Qwen3-4B) - Key findings about MoE vs dense models - Configuration example for embedded providers --- README.md | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/README.md b/README.md index 346ddd1..6bb6dac 100644 --- a/README.md +++ b/README.md @@ -115,6 +115,44 @@ These commands give you fine-grained control over context management, allowing y - Provider-specific optimizations and feature support - Local model support for offline operation +### Embedded Models (Local LLMs) + +g3 supports local models via llama.cpp with Metal acceleration on macOS. Here's a performance comparison for **agentic tasks** (multi-step tool-calling workflows): + +**Test case**: Comic book repacking - extract CBR/CBZ archives, reorder files preserving page and issue order, repack into single archive. Requires correct sequencing, file handling, and no race conditions. + +#### Cloud Models (Baseline) +| Model | Agentic Score | Notes | +|-------|---------------|-------| +| **Claude Opus 4.5** | ⭐⭐⭐⭐⭐ | Flawless execution | +| Claude Sonnet 4.5 | ⭐⭐⭐⭐ | Good, occasional issues | +| Claude 4 family | ⭐⭐⭐ | Gets there eventually, needs manual checking | + +#### Local Models +| Model | Size | Speed | Agentic Score | Notes | +|-------|------|-------|---------------|-------| +| **Qwen3-32B** (Dense) | 18 GB | Slow | ⭐⭐⭐ | Good reasoning, spotty execution | +| Qwen3-14B | 8.4 GB | Medium | ⭐⭐ | Understands tasks but makes implementation errors | +| GLM-4 9B | 5.7 GB | Fast | ⭐⭐ | Quick responses, may struggle with complex tasks | +| Qwen3-4B | 2.3 GB | Very Fast | ❌ | Generates malformed tool calls - not for agentic use | +| ~~Qwen3-30B-A3B~~ (MoE) | 17 GB | Very Fast | ❌ | **Avoid** - loops infinitely on tool calls | + +**Key findings**: +- **Dense models** (Qwen3-32B, Qwen3-14B) handle agentic loops correctly +- **MoE models** (Qwen3-30B-A3B) are fast but don't know when to stop tool-calling +- **Metal GPU** works well with dense models on Apple Silicon +- Even the best local models (32B) lag significantly behind Claude Opus 4.5 on complex tasks +- Local models are best for simpler agentic tasks or when offline/privacy is required + +Configuration example: +```toml +[providers.embedded.qwen3-big] +model_path = "~/.g3/models/Qwen_Qwen3-32B-Q4_K_M.gguf" +model_type = "qwen" +context_length = 40960 +gpu_layers = 99 # Full GPU offload on Apple Silicon +``` + ### Task Automation - Single-shot task execution for quick operations - Iterative task mode for complex, multi-step workflows