Refactor system prompts to eliminate duplication; upgrade embedded provider

- Refactor prompts.rs: extract shared sections (intro, TODO, workspace memory,
  web research, response guidelines) used by both native and non-native prompts
- Fix typo in native prompt: "save them.." -> "save them."
- Fix non-native prompt: add missing closing braces in JSON examples,
  add IMPORTANT steps section, align with native prompt quality
- Add 9 unit tests to verify both prompts contain required sections
- Upgrade llama-cpp-2 dependency and refactor embedded provider
- Update config.example.toml with embedded model examples
- Update workspace memory
This commit is contained in:
Dhanji R. Prasanna
2026-01-28 09:56:39 +11:00
parent 585684a86e
commit a902be1562
9 changed files with 1027 additions and 851 deletions

View File

@@ -1,5 +1,5 @@
# Workspace Memory
> Updated: 2026-01-27T00:12:18Z | Size: 19.5k chars
> Updated: 2026-01-27T02:55:20Z | Size: 21.9k chars
### Remember Tool Wiring
- `crates/g3-core/src/tools/memory.rs` [0..5000] - `execute_remember()`, `get_memory_path()`, `merge_memory()`
@@ -346,4 +346,58 @@ Tracks prompt/prefix caching efficacy across Anthropic and OpenAI providers.
- `crates/g3-core/src/stats.rs`
- `AgentStatsSnapshot.cache_stats` [20] - reference to cache stats for formatting
- `format_cache_stats()` [189..230] - formats cache statistics section with hit rate and efficiency metrics
- `format_cache_stats()` [189..230] - formats cache statistics section with hit rate and efficiency metrics
### Embedded Provider (Local LLM via llama.cpp)
Local model inference using llama-cpp-rs bindings with Metal acceleration on macOS.
- `crates/g3-providers/src/embedded.rs`
- `EmbeddedProvider` [22..85] - struct with session, model_name, max_tokens, temperature, context_length
- `new()` [26..85] - constructor, handles tilde expansion, auto-downloads Qwen if missing
- `format_messages()` [87..175] - converts Message[] to prompt string, supports Qwen/Mistral/Llama templates
- `get_stop_sequences()` [280..340] - returns model-specific stop tokens
- `generate_completion()` [177..278] - non-streaming inference with timeout
- `stream()` [560..780] - streaming inference via spawn_blocking + mpsc channel
**Known Issues (as of 2025-01):**
- Provider name hardcoded as `"embedded"` instead of `"embedded.{name}"` format
- Missing GLM-4 chat template (uses `[gMASK]<sop><|role|>` NOT ChatML)
- Missing `has_native_tool_calling()` override (defaults to false)
- No streaming usage tracking (unlike Anthropic)
- No tool streaming hints (`make_tool_streaming_hint()` not used)
### Chat Template Formats (Embedded Provider)
| Model | Format | Start Token | End Token |
|-------|--------|-------------|----------|
| Qwen | ChatML | `<\|im_start\|>role\n` | `<\|im_end\|>` |
| GLM-4 | ChatGLM4 | `[gMASK]<sop><\|role\|>\n` | `<\|endoftext\|>` |
| Mistral | Instruct | `<s>[INST]` | `[/INST]` |
| Llama | Llama2 | `<<SYS>>` | `<</SYS>>` |
### GLM-4 Model Downloads
Recommended GGUF models for Mac M4 Max with 128GB unified memory.
**Download commands:**
```bash
# GLM-4 9B Q8_0 (~10GB) - Very capable, fast
python3 -m huggingface_hub.commands.huggingface_cli download bartowski/THUDM_GLM-4-9B-0414-GGUF \
--include "THUDM_GLM-4-9B-0414-Q8_0.gguf" --local-dir ~/.g3/models/
# GLM-4 32B Q6_K_L (~27GB) - TOP TIER for coding/reasoning
python3 -m huggingface_hub.commands.huggingface_cli download bartowski/THUDM_GLM-4-32B-0414-GGUF \
--include "THUDM_GLM-4-32B-0414-Q6_K_L.gguf" --local-dir ~/.g3/models/
# Qwen3 4B Q4_K_M (~2.3GB) - Small but rivals 72B performance
python3 -m huggingface_hub.commands.huggingface_cli download Qwen/Qwen3-4B-GGUF \
--include "qwen3-4b-q4_k_m.gguf" --local-dir ~/.g3/models/
```
**Config example:**
```toml
[providers.embedded.glm4]
model_path = "~/.g3/models/THUDM_GLM-4-32B-0414-Q6_K_L.gguf"
model_type = "glm4"
context_length = 32768
max_tokens = 4096
gpu_layers = 99
```