Files
g3/docs/providers.md
2026-01-08 12:54:03 +11:00

10 KiB

G3 LLM Providers Guide

Last updated: January 2025
Source of truth: crates/g3-providers/src/

Purpose

This document describes the LLM providers supported by G3, their capabilities, and how to choose between them.

Provider Overview

Provider Type Tool Calling Cache Control Context Window Best For
Anthropic Cloud Native Yes 200k (1M optional) General use, complex tasks
Databricks Cloud Native Yes (Claude models) Varies Enterprise, existing Databricks users
OpenAI Cloud Native No 128k GPT model preference
OpenAI-Compatible Cloud Native No Varies OpenRouter, Groq, Together, etc.
Embedded Local JSON fallback No 4k-32k Privacy, offline, cost savings

Anthropic

Location: crates/g3-providers/src/anthropic.rs

Features

  • Native tool calling: Full support for structured tool calls
  • Prompt caching: Reduce costs with ephemeral caching
  • Extended context: Optional 1M token context (additional cost)
  • Extended thinking: Budget tokens for complex reasoning
  • Streaming: Real-time response streaming

Configuration

[providers.anthropic.default]
api_key = "sk-ant-api03-..."     # Required
model = "claude-sonnet-4-5"      # Model name
max_tokens = 64000               # Max output tokens
temperature = 0.3                # 0.0-1.0
cache_config = "ephemeral"       # Optional: Enable caching
enable_1m_context = true          # Optional: 1M context
thinking_budget_tokens = 10000    # Optional: Extended thinking

Available Models

Model Context Best For
claude-sonnet-4-5 200k Balanced performance/cost
claude-opus-4-5 200k Complex reasoning
claude-3-5-sonnet-20241022 200k Previous generation
claude-3-opus-20240229 200k Previous generation

Prompt Caching

Enable caching to reduce costs for repeated context:

cache_config = "ephemeral"  # Cache for session duration

Caching is applied to:

  • System prompts
  • README/AGENTS.md content
  • Large tool results

Extended Thinking

For complex tasks requiring step-by-step reasoning:

thinking_budget_tokens = 10000  # Tokens for internal reasoning

The model uses these tokens for planning before responding.


Databricks

Location: crates/g3-providers/src/databricks.rs

Features

  • Foundation Model APIs: Access to various models
  • OAuth authentication: Secure browser-based auth
  • Token authentication: Personal access tokens
  • Enterprise integration: Works with existing Databricks setup

Configuration

[providers.databricks.default]
host = "https://your-workspace.cloud.databricks.com"
model = "databricks-claude-sonnet-4"
max_tokens = 4096
temperature = 0.1
use_oauth = true              # Recommended
# token = "dapi..."           # Alternative: PAT

Authentication

OAuth (Recommended):

  1. Set use_oauth = true
  2. On first run, browser opens for authentication
  3. Tokens are cached in ~/.databricks/oauth-tokens.json
  4. Tokens refresh automatically

Personal Access Token:

  1. Generate token in Databricks workspace
  2. Set token = "dapi..." and use_oauth = false

Available Models

Models depend on your Databricks workspace configuration:

  • databricks-claude-sonnet-4 (Claude via Databricks)
  • databricks-meta-llama-3-1-70b-instruct
  • databricks-dbrx-instruct
  • Custom fine-tuned models

OpenAI

Location: crates/g3-providers/src/openai.rs

Features

  • Native tool calling: Full support
  • Custom endpoints: Override base URL
  • Streaming: Real-time responses

Configuration

[providers.openai.default]
api_key = "sk-..."               # Required
model = "gpt-4-turbo"            # Model name
max_tokens = 4096
temperature = 0.1
# base_url = "https://api.openai.com/v1"  # Optional

Available Models

Model Context Notes
gpt-4-turbo 128k Latest GPT-4
gpt-4o 128k Optimized GPT-4
gpt-4 8k Original GPT-4
gpt-3.5-turbo 16k Faster, cheaper

OpenAI-Compatible Providers

Location: crates/g3-providers/src/openai.rs (reuses OpenAI implementation)

For services that implement the OpenAI API format.

Configuration

# OpenRouter
[providers.openai_compatible.openrouter]
api_key = "sk-or-..."
model = "anthropic/claude-3.5-sonnet"
base_url = "https://openrouter.ai/api/v1"
max_tokens = 4096
temperature = 0.1

# Groq
[providers.openai_compatible.groq]
api_key = "gsk_..."
model = "llama-3.3-70b-versatile"
base_url = "https://api.groq.com/openai/v1"
max_tokens = 4096
temperature = 0.1

# Together
[providers.openai_compatible.together]
api_key = "..."
model = "meta-llama/Llama-3-70b-chat-hf"
base_url = "https://api.together.xyz/v1"
max_tokens = 4096
temperature = 0.1

Supported Services

  • OpenRouter: Access to many models through one API
  • Groq: Fast inference for Llama models
  • Together: Open-source model hosting
  • Anyscale: Scalable model serving
  • Local servers: Ollama, vLLM, text-generation-inference

Embedded (Local Models)

Location: crates/g3-providers/src/embedded.rs

Features

  • Completely local: No data leaves your machine
  • Offline capable: Works without internet
  • GPU acceleration: Metal (macOS), CUDA (Linux)
  • No API costs: Free after model download

Configuration

[providers.embedded.default]
model_path = "~/.cache/g3/models/qwen2.5-7b-instruct-q3_k_m.gguf"
model_type = "qwen"              # Model architecture
context_length = 32768           # Context window
max_tokens = 2048                # Max output
temperature = 0.1
gpu_layers = 32                  # GPU offload (0 = CPU only)
threads = 8                      # CPU threads

Supported Model Types

Type Models Notes
qwen Qwen 2.5 series Good coding ability
codellama Code Llama Specialized for code
llama Llama 2/3 General purpose
mistral Mistral/Mixtral Efficient

Model Download

Download GGUF models from Hugging Face:

mkdir -p ~/.cache/g3/models
cd ~/.cache/g3/models

# Example: Qwen 2.5 7B
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf

Hardware Requirements

Model Size RAM Required GPU VRAM Notes
7B Q4 6GB 4GB Good for most tasks
7B Q8 10GB 8GB Better quality
13B Q4 10GB 8GB More capable
70B Q4 48GB 40GB Requires high-end hardware

GPU Acceleration

macOS (Metal):

gpu_layers = 32  # Offload layers to GPU

Linux (CUDA): Requires CUDA toolkit installed.

CPU Only:

gpu_layers = 0
threads = 8  # Use more threads

Tool Calling

Embedded models don't have native tool calling. G3 uses JSON fallback:

  1. System prompt includes tool definitions as JSON
  2. Model outputs tool calls as JSON in response
  3. G3 parses JSON and executes tools

This works but is less reliable than native tool calling.


Provider Selection Guide

By Use Case

Use Case Recommended Provider
General coding tasks Anthropic (Claude Sonnet)
Complex reasoning Anthropic (Claude Opus)
Enterprise/compliance Databricks
Cost-sensitive Embedded or Groq
Privacy-critical Embedded
Offline development Embedded
Fast iteration Groq (Llama)
Model variety OpenRouter

By Priority

Quality first: Anthropic Claude Opus/Sonnet

  • Best reasoning and coding ability
  • Native tool calling
  • Prompt caching for efficiency

Cost first: Embedded or OpenAI-compatible

  • Embedded: Free after download
  • Groq: Very cheap, fast
  • OpenRouter: Pay-per-use, many options

Privacy first: Embedded

  • Data never leaves your machine
  • No API calls
  • Full control

Speed first: Groq or Embedded with GPU

  • Groq: Extremely fast inference
  • Embedded with Metal/CUDA: Low latency

Provider Trait

All providers implement the LLMProvider trait:

#[async_trait]
pub trait LLMProvider: Send + Sync {
    /// Generate a completion
    async fn complete(&self, request: CompletionRequest) -> Result<CompletionResponse>;
    
    /// Stream a completion
    async fn stream(&self, request: CompletionRequest) -> Result<CompletionStream>;
    
    /// Provider name (e.g., "anthropic.default")
    fn name(&self) -> &str;
    
    /// Model name (e.g., "claude-sonnet-4-5")
    fn model(&self) -> &str;
    
    /// Whether provider supports native tool calling
    fn has_native_tool_calling(&self) -> bool;
    
    /// Whether provider supports cache control
    fn supports_cache_control(&self) -> bool;
    
    /// Configured max tokens
    fn max_tokens(&self) -> u32;
    
    /// Configured temperature
    fn temperature(&self) -> f32;
}

Adding a New Provider

  1. Create crates/g3-providers/src/newprovider.rs
  2. Implement LLMProvider trait
  3. Add configuration struct to crates/g3-config/src/lib.rs
  4. Register in crates/g3-core/src/lib.rs (new_with_mode_and_readme)
  5. Export from crates/g3-providers/src/lib.rs
  6. Update documentation

Troubleshooting

Authentication Errors

Anthropic: Verify API key starts with sk-ant-

Databricks OAuth:

  • Delete ~/.databricks/oauth-tokens.json and re-authenticate
  • Ensure workspace URL is correct

OpenAI: Verify API key and check billing status

Rate Limits

G3 automatically retries on rate limits with exponential backoff.

To reduce rate limit issues:

  • Use prompt caching (Anthropic)
  • Reduce max_tokens
  • Use a provider with higher limits

Context Window Errors

If you see "context too long" errors:

  1. Use /compact to compact conversation
  2. Use /thinnify to replace large tool results
  3. Increase max_context_length in config
  4. Switch to a provider with larger context

Embedded Model Issues

Model not loading:

  • Verify model_path is correct
  • Check file permissions
  • Ensure enough RAM

Slow inference:

  • Increase gpu_layers for GPU offload
  • Reduce context_length
  • Use a smaller quantization (Q4 vs Q8)

Poor tool calling:

  • Embedded models use JSON fallback
  • Consider cloud provider for complex tool use