alex/g3

Fork 0

Files

Dhanji R. Prasanna 5bfaee8dd5 use consistent naming for compaction

2026-01-08 12:54:03 +11:00

10 KiB

Raw Blame History

G3 LLM Providers Guide

Last updated: January 2025
Source of truth: crates/g3-providers/src/

Purpose

This document describes the LLM providers supported by G3, their capabilities, and how to choose between them.

Provider Overview

Provider	Type	Tool Calling	Cache Control	Context Window	Best For
Anthropic	Cloud	Native	Yes	200k (1M optional)	General use, complex tasks
Databricks	Cloud	Native	Yes (Claude models)	Varies	Enterprise, existing Databricks users
OpenAI	Cloud	Native	No	128k	GPT model preference
OpenAI-Compatible	Cloud	Native	No	Varies	OpenRouter, Groq, Together, etc.
Embedded	Local	JSON fallback	No	4k-32k	Privacy, offline, cost savings

Anthropic

Location: crates/g3-providers/src/anthropic.rs

Features

Native tool calling: Full support for structured tool calls
Prompt caching: Reduce costs with ephemeral caching
Extended context: Optional 1M token context (additional cost)
Extended thinking: Budget tokens for complex reasoning
Streaming: Real-time response streaming

Configuration

[providers.anthropic.default]
api_key = "sk-ant-api03-..."     # Required
model = "claude-sonnet-4-5"      # Model name
max_tokens = 64000               # Max output tokens
temperature = 0.3                # 0.0-1.0
cache_config = "ephemeral"       # Optional: Enable caching
enable_1m_context = true          # Optional: 1M context
thinking_budget_tokens = 10000    # Optional: Extended thinking

Available Models

Model	Context	Best For
`claude-sonnet-4-5`	200k	Balanced performance/cost
`claude-opus-4-5`	200k	Complex reasoning
`claude-3-5-sonnet-20241022`	200k	Previous generation
`claude-3-opus-20240229`	200k	Previous generation

Prompt Caching

Enable caching to reduce costs for repeated context:

cache_config = "ephemeral"  # Cache for session duration

Caching is applied to:

System prompts
README/AGENTS.md content
Large tool results

Extended Thinking

For complex tasks requiring step-by-step reasoning:

thinking_budget_tokens = 10000  # Tokens for internal reasoning

The model uses these tokens for planning before responding.

Databricks

Location: crates/g3-providers/src/databricks.rs

Features

Foundation Model APIs: Access to various models
OAuth authentication: Secure browser-based auth
Token authentication: Personal access tokens
Enterprise integration: Works with existing Databricks setup

Configuration

[providers.databricks.default]
host = "https://your-workspace.cloud.databricks.com"
model = "databricks-claude-sonnet-4"
max_tokens = 4096
temperature = 0.1
use_oauth = true              # Recommended
# token = "dapi..."           # Alternative: PAT

Authentication

OAuth (Recommended):

Set use_oauth = true
On first run, browser opens for authentication
Tokens are cached in ~/.databricks/oauth-tokens.json
Tokens refresh automatically

Personal Access Token:

Generate token in Databricks workspace
Set token = "dapi..." and use_oauth = false

Available Models

Models depend on your Databricks workspace configuration:

databricks-claude-sonnet-4 (Claude via Databricks)
databricks-meta-llama-3-1-70b-instruct
databricks-dbrx-instruct
Custom fine-tuned models

OpenAI

Location: crates/g3-providers/src/openai.rs

Features

Native tool calling: Full support
Custom endpoints: Override base URL
Streaming: Real-time responses

Configuration

[providers.openai.default]
api_key = "sk-..."               # Required
model = "gpt-4-turbo"            # Model name
max_tokens = 4096
temperature = 0.1
# base_url = "https://api.openai.com/v1"  # Optional

Available Models

Model	Context	Notes
`gpt-4-turbo`	128k	Latest GPT-4
`gpt-4o`	128k	Optimized GPT-4
`gpt-4`	8k	Original GPT-4
`gpt-3.5-turbo`	16k	Faster, cheaper

OpenAI-Compatible Providers

Location: crates/g3-providers/src/openai.rs (reuses OpenAI implementation)

For services that implement the OpenAI API format.

Configuration

# OpenRouter
[providers.openai_compatible.openrouter]
api_key = "sk-or-..."
model = "anthropic/claude-3.5-sonnet"
base_url = "https://openrouter.ai/api/v1"
max_tokens = 4096
temperature = 0.1

# Groq
[providers.openai_compatible.groq]
api_key = "gsk_..."
model = "llama-3.3-70b-versatile"
base_url = "https://api.groq.com/openai/v1"
max_tokens = 4096
temperature = 0.1

# Together
[providers.openai_compatible.together]
api_key = "..."
model = "meta-llama/Llama-3-70b-chat-hf"
base_url = "https://api.together.xyz/v1"
max_tokens = 4096
temperature = 0.1

Supported Services

OpenRouter: Access to many models through one API
Groq: Fast inference for Llama models
Together: Open-source model hosting
Anyscale: Scalable model serving
Local servers: Ollama, vLLM, text-generation-inference

Embedded (Local Models)

Location: crates/g3-providers/src/embedded.rs

Features

Completely local: No data leaves your machine
Offline capable: Works without internet
GPU acceleration: Metal (macOS), CUDA (Linux)
No API costs: Free after model download

Configuration

[providers.embedded.default]
model_path = "~/.cache/g3/models/qwen2.5-7b-instruct-q3_k_m.gguf"
model_type = "qwen"              # Model architecture
context_length = 32768           # Context window
max_tokens = 2048                # Max output
temperature = 0.1
gpu_layers = 32                  # GPU offload (0 = CPU only)
threads = 8                      # CPU threads

Supported Model Types

Type	Models	Notes
`qwen`	Qwen 2.5 series	Good coding ability
`codellama`	Code Llama	Specialized for code
`llama`	Llama 2/3	General purpose
`mistral`	Mistral/Mixtral	Efficient

Model Download

Download GGUF models from Hugging Face:

mkdir -p ~/.cache/g3/models
cd ~/.cache/g3/models

# Example: Qwen 2.5 7B
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf

Hardware Requirements

Model Size	RAM Required	GPU VRAM	Notes
7B Q4	6GB	4GB	Good for most tasks
7B Q8	10GB	8GB	Better quality
13B Q4	10GB	8GB	More capable
70B Q4	48GB	40GB	Requires high-end hardware

GPU Acceleration

macOS (Metal):

gpu_layers = 32  # Offload layers to GPU

Linux (CUDA): Requires CUDA toolkit installed.

CPU Only:

gpu_layers = 0
threads = 8  # Use more threads

Tool Calling

Embedded models don't have native tool calling. G3 uses JSON fallback:

System prompt includes tool definitions as JSON
Model outputs tool calls as JSON in response
G3 parses JSON and executes tools

This works but is less reliable than native tool calling.

Provider Selection Guide

By Use Case

Use Case	Recommended Provider
General coding tasks	Anthropic (Claude Sonnet)
Complex reasoning	Anthropic (Claude Opus)
Enterprise/compliance	Databricks
Cost-sensitive	Embedded or Groq
Privacy-critical	Embedded
Offline development	Embedded
Fast iteration	Groq (Llama)
Model variety	OpenRouter

By Priority

Quality first: Anthropic Claude Opus/Sonnet

Best reasoning and coding ability
Native tool calling
Prompt caching for efficiency

Cost first: Embedded or OpenAI-compatible

Embedded: Free after download
Groq: Very cheap, fast
OpenRouter: Pay-per-use, many options

Privacy first: Embedded

Data never leaves your machine
No API calls
Full control

Speed first: Groq or Embedded with GPU

Groq: Extremely fast inference
Embedded with Metal/CUDA: Low latency

Provider Trait

All providers implement the LLMProvider trait:

#[async_trait]
pub trait LLMProvider: Send + Sync {
    /// Generate a completion
    async fn complete(&self, request: CompletionRequest) -> Result<CompletionResponse>;
    
    /// Stream a completion
    async fn stream(&self, request: CompletionRequest) -> Result<CompletionStream>;
    
    /// Provider name (e.g., "anthropic.default")
    fn name(&self) -> &str;
    
    /// Model name (e.g., "claude-sonnet-4-5")
    fn model(&self) -> &str;
    
    /// Whether provider supports native tool calling
    fn has_native_tool_calling(&self) -> bool;
    
    /// Whether provider supports cache control
    fn supports_cache_control(&self) -> bool;
    
    /// Configured max tokens
    fn max_tokens(&self) -> u32;
    
    /// Configured temperature
    fn temperature(&self) -> f32;
}

Adding a New Provider

Create crates/g3-providers/src/newprovider.rs
Implement LLMProvider trait
Add configuration struct to crates/g3-config/src/lib.rs
Register in crates/g3-core/src/lib.rs (new_with_mode_and_readme)
Export from crates/g3-providers/src/lib.rs
Update documentation

Troubleshooting

Authentication Errors

Anthropic: Verify API key starts with sk-ant-

Databricks OAuth:

Delete ~/.databricks/oauth-tokens.json and re-authenticate
Ensure workspace URL is correct

OpenAI: Verify API key and check billing status

Rate Limits

G3 automatically retries on rate limits with exponential backoff.

To reduce rate limit issues:

Use prompt caching (Anthropic)
Reduce max_tokens
Use a provider with higher limits

Context Window Errors

If you see "context too long" errors:

Use /compact to compact conversation
Use /thinnify to replace large tool results
Increase max_context_length in config
Switch to a provider with larger context

Embedded Model Issues

Model not loading:

Verify model_path is correct
Check file permissions
Ensure enough RAM

Slow inference:

Increase gpu_layers for GPU offload
Reduce context_length
Use a smaller quantization (Q4 vs Q8)

Poor tool calling:

Embedded models use JSON fallback
Consider cloud provider for complex tool use

10 KiB Raw Blame History

G3 LLM Providers Guide

Purpose

Provider Overview

Anthropic

Features

Configuration

Available Models

Prompt Caching

Extended Thinking

Databricks

Features

Configuration

Authentication

Available Models

OpenAI

Features

Configuration

Available Models

OpenAI-Compatible Providers

Configuration

Supported Services

Embedded (Local Models)

Features

Configuration

Supported Model Types

Model Download

Hardware Requirements

GPU Acceleration

Tool Calling

Provider Selection Guide

By Use Case

By Priority

Provider Trait

Adding a New Provider

Troubleshooting

Authentication Errors

Rate Limits

Context Window Errors

Embedded Model Issues

10 KiB

Raw Blame History