10 KiB
G3 LLM Providers Guide
Last updated: January 2025
Source of truth: crates/g3-providers/src/
Purpose
This document describes the LLM providers supported by G3, their capabilities, and how to choose between them.
Provider Overview
| Provider | Type | Tool Calling | Cache Control | Context Window | Best For |
|---|---|---|---|---|---|
| Anthropic | Cloud | Native | Yes | 200k (1M optional) | General use, complex tasks |
| Databricks | Cloud | Native | Yes (Claude models) | Varies | Enterprise, existing Databricks users |
| OpenAI | Cloud | Native | No | 128k | GPT model preference |
| OpenAI-Compatible | Cloud | Native | No | Varies | OpenRouter, Groq, Together, etc. |
| Embedded | Local | JSON fallback | No | 4k-32k | Privacy, offline, cost savings |
Anthropic
Location: crates/g3-providers/src/anthropic.rs
Features
- Native tool calling: Full support for structured tool calls
- Prompt caching: Reduce costs with ephemeral caching
- Extended context: Optional 1M token context (additional cost)
- Extended thinking: Budget tokens for complex reasoning
- Streaming: Real-time response streaming
Configuration
[providers.anthropic.default]
api_key = "sk-ant-api03-..." # Required
model = "claude-sonnet-4-5" # Model name
max_tokens = 64000 # Max output tokens
temperature = 0.3 # 0.0-1.0
cache_config = "ephemeral" # Optional: Enable caching
enable_1m_context = true # Optional: 1M context
thinking_budget_tokens = 10000 # Optional: Extended thinking
Available Models
| Model | Context | Best For |
|---|---|---|
claude-sonnet-4-5 |
200k | Balanced performance/cost |
claude-opus-4-5 |
200k | Complex reasoning |
claude-3-5-sonnet-20241022 |
200k | Previous generation |
claude-3-opus-20240229 |
200k | Previous generation |
Prompt Caching
Enable caching to reduce costs for repeated context:
cache_config = "ephemeral" # Cache for session duration
Caching is applied to:
- System prompts
- README/AGENTS.md content
- Large tool results
Extended Thinking
For complex tasks requiring step-by-step reasoning:
thinking_budget_tokens = 10000 # Tokens for internal reasoning
The model uses these tokens for planning before responding.
Databricks
Location: crates/g3-providers/src/databricks.rs
Features
- Foundation Model APIs: Access to various models
- OAuth authentication: Secure browser-based auth
- Token authentication: Personal access tokens
- Enterprise integration: Works with existing Databricks setup
Configuration
[providers.databricks.default]
host = "https://your-workspace.cloud.databricks.com"
model = "databricks-claude-sonnet-4"
max_tokens = 4096
temperature = 0.1
use_oauth = true # Recommended
# token = "dapi..." # Alternative: PAT
Authentication
OAuth (Recommended):
- Set
use_oauth = true - On first run, browser opens for authentication
- Tokens are cached in
~/.databricks/oauth-tokens.json - Tokens refresh automatically
Personal Access Token:
- Generate token in Databricks workspace
- Set
token = "dapi..."anduse_oauth = false
Available Models
Models depend on your Databricks workspace configuration:
databricks-claude-sonnet-4(Claude via Databricks)databricks-meta-llama-3-1-70b-instructdatabricks-dbrx-instruct- Custom fine-tuned models
OpenAI
Location: crates/g3-providers/src/openai.rs
Features
- Native tool calling: Full support
- Custom endpoints: Override base URL
- Streaming: Real-time responses
Configuration
[providers.openai.default]
api_key = "sk-..." # Required
model = "gpt-4-turbo" # Model name
max_tokens = 4096
temperature = 0.1
# base_url = "https://api.openai.com/v1" # Optional
Available Models
| Model | Context | Notes |
|---|---|---|
gpt-4-turbo |
128k | Latest GPT-4 |
gpt-4o |
128k | Optimized GPT-4 |
gpt-4 |
8k | Original GPT-4 |
gpt-3.5-turbo |
16k | Faster, cheaper |
OpenAI-Compatible Providers
Location: crates/g3-providers/src/openai.rs (reuses OpenAI implementation)
For services that implement the OpenAI API format.
Configuration
# OpenRouter
[providers.openai_compatible.openrouter]
api_key = "sk-or-..."
model = "anthropic/claude-3.5-sonnet"
base_url = "https://openrouter.ai/api/v1"
max_tokens = 4096
temperature = 0.1
# Groq
[providers.openai_compatible.groq]
api_key = "gsk_..."
model = "llama-3.3-70b-versatile"
base_url = "https://api.groq.com/openai/v1"
max_tokens = 4096
temperature = 0.1
# Together
[providers.openai_compatible.together]
api_key = "..."
model = "meta-llama/Llama-3-70b-chat-hf"
base_url = "https://api.together.xyz/v1"
max_tokens = 4096
temperature = 0.1
Supported Services
- OpenRouter: Access to many models through one API
- Groq: Fast inference for Llama models
- Together: Open-source model hosting
- Anyscale: Scalable model serving
- Local servers: Ollama, vLLM, text-generation-inference
Embedded (Local Models)
Location: crates/g3-providers/src/embedded.rs
Features
- Completely local: No data leaves your machine
- Offline capable: Works without internet
- GPU acceleration: Metal (macOS), CUDA (Linux)
- No API costs: Free after model download
Configuration
[providers.embedded.default]
model_path = "~/.cache/g3/models/qwen2.5-7b-instruct-q3_k_m.gguf"
model_type = "qwen" # Model architecture
context_length = 32768 # Context window
max_tokens = 2048 # Max output
temperature = 0.1
gpu_layers = 32 # GPU offload (0 = CPU only)
threads = 8 # CPU threads
Supported Model Types
| Type | Models | Notes |
|---|---|---|
qwen |
Qwen 2.5 series | Good coding ability |
codellama |
Code Llama | Specialized for code |
llama |
Llama 2/3 | General purpose |
mistral |
Mistral/Mixtral | Efficient |
Model Download
Download GGUF models from Hugging Face:
mkdir -p ~/.cache/g3/models
cd ~/.cache/g3/models
# Example: Qwen 2.5 7B
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf
Hardware Requirements
| Model Size | RAM Required | GPU VRAM | Notes |
|---|---|---|---|
| 7B Q4 | 6GB | 4GB | Good for most tasks |
| 7B Q8 | 10GB | 8GB | Better quality |
| 13B Q4 | 10GB | 8GB | More capable |
| 70B Q4 | 48GB | 40GB | Requires high-end hardware |
GPU Acceleration
macOS (Metal):
gpu_layers = 32 # Offload layers to GPU
Linux (CUDA): Requires CUDA toolkit installed.
CPU Only:
gpu_layers = 0
threads = 8 # Use more threads
Tool Calling
Embedded models don't have native tool calling. G3 uses JSON fallback:
- System prompt includes tool definitions as JSON
- Model outputs tool calls as JSON in response
- G3 parses JSON and executes tools
This works but is less reliable than native tool calling.
Provider Selection Guide
By Use Case
| Use Case | Recommended Provider |
|---|---|
| General coding tasks | Anthropic (Claude Sonnet) |
| Complex reasoning | Anthropic (Claude Opus) |
| Enterprise/compliance | Databricks |
| Cost-sensitive | Embedded or Groq |
| Privacy-critical | Embedded |
| Offline development | Embedded |
| Fast iteration | Groq (Llama) |
| Model variety | OpenRouter |
By Priority
Quality first: Anthropic Claude Opus/Sonnet
- Best reasoning and coding ability
- Native tool calling
- Prompt caching for efficiency
Cost first: Embedded or OpenAI-compatible
- Embedded: Free after download
- Groq: Very cheap, fast
- OpenRouter: Pay-per-use, many options
Privacy first: Embedded
- Data never leaves your machine
- No API calls
- Full control
Speed first: Groq or Embedded with GPU
- Groq: Extremely fast inference
- Embedded with Metal/CUDA: Low latency
Provider Trait
All providers implement the LLMProvider trait:
#[async_trait]
pub trait LLMProvider: Send + Sync {
/// Generate a completion
async fn complete(&self, request: CompletionRequest) -> Result<CompletionResponse>;
/// Stream a completion
async fn stream(&self, request: CompletionRequest) -> Result<CompletionStream>;
/// Provider name (e.g., "anthropic.default")
fn name(&self) -> &str;
/// Model name (e.g., "claude-sonnet-4-5")
fn model(&self) -> &str;
/// Whether provider supports native tool calling
fn has_native_tool_calling(&self) -> bool;
/// Whether provider supports cache control
fn supports_cache_control(&self) -> bool;
/// Configured max tokens
fn max_tokens(&self) -> u32;
/// Configured temperature
fn temperature(&self) -> f32;
}
Adding a New Provider
- Create
crates/g3-providers/src/newprovider.rs - Implement
LLMProvidertrait - Add configuration struct to
crates/g3-config/src/lib.rs - Register in
crates/g3-core/src/lib.rs(new_with_mode_and_readme) - Export from
crates/g3-providers/src/lib.rs - Update documentation
Troubleshooting
Authentication Errors
Anthropic: Verify API key starts with sk-ant-
Databricks OAuth:
- Delete
~/.databricks/oauth-tokens.jsonand re-authenticate - Ensure workspace URL is correct
OpenAI: Verify API key and check billing status
Rate Limits
G3 automatically retries on rate limits with exponential backoff.
To reduce rate limit issues:
- Use prompt caching (Anthropic)
- Reduce
max_tokens - Use a provider with higher limits
Context Window Errors
If you see "context too long" errors:
- Use
/compactto compact conversation - Use
/thinnifyto replace large tool results - Increase
max_context_lengthin config - Switch to a provider with larger context
Embedded Model Issues
Model not loading:
- Verify
model_pathis correct - Check file permissions
- Ensure enough RAM
Slow inference:
- Increase
gpu_layersfor GPU offload - Reduce
context_length - Use a smaller quantization (Q4 vs Q8)
Poor tool calling:
- Embedded models use JSON fallback
- Consider cloud provider for complex tool use