g3/docs/providers.md

# g3 LLM Providers Guide

**Last updated**: January 2025
**Source of truth**: `crates/g3-providers/src/`

## Purpose

This document describes the LLM providers supported by g3, their capabilities, and how to choose between them.

## Provider Overview

| Provider | Type | Tool Calling | Cache Control | Context Window | Best For |
|----------|------|--------------|---------------|----------------|----------|
| **Anthropic** | Cloud | Native | Yes | 200k (1M optional) | General use, complex tasks |
| **Databricks** | Cloud | Native | Yes (Claude models) | Varies | Enterprise, existing Databricks users |
| **OpenAI** | Cloud | Native | No | 128k | GPT model preference |
| **OpenAI-Compatible** | Cloud | Native | No | Varies | OpenRouter, Groq, Together, etc. |
| **Embedded** | Local | JSON fallback | No | 4k-32k | Privacy, offline, cost savings |

## Anthropic

**Location**: `crates/g3-providers/src/anthropic.rs`

### Features

- **Native tool calling**: Full support for structured tool calls
- **Prompt caching**: Reduce costs with ephemeral caching
- **Extended context**: Optional 1M token context (additional cost)
- **Extended thinking**: Budget tokens for complex reasoning
- **Streaming**: Real-time response streaming

### Configuration

```toml
[providers.anthropic.default]
api_key = "sk-ant-api03-..."     # Required
model = "claude-sonnet-4-5"      # Model name
max_tokens = 64000               # Max output tokens
temperature = 0.3                # 0.0-1.0
cache_config = "ephemeral"       # Optional: Enable caching
enable_1m_context = true          # Optional: 1M context
thinking_budget_tokens = 10000    # Optional: Extended thinking
```

### Available Models

| Model | Context | Best For |
|-------|---------|----------|
| `claude-sonnet-4-5` | 200k | Balanced performance/cost |
| `claude-opus-4-5` | 200k | Complex reasoning |
| `claude-3-5-sonnet-20241022` | 200k | Previous generation |
| `claude-3-opus-20240229` | 200k | Previous generation |

### Prompt Caching

Enable caching to reduce costs for repeated context:

```toml
cache_config = "ephemeral"  # Cache for session duration
```

Caching is applied to:
- System prompts
- README/AGENTS.md content
- Large tool results

### Extended Thinking

For complex tasks requiring step-by-step reasoning:

```toml
thinking_budget_tokens = 10000  # Tokens for internal reasoning
```

The model uses these tokens for planning before responding.

---

## Databricks

**Location**: `crates/g3-providers/src/databricks.rs`

### Features

- **Foundation Model APIs**: Access to various models
- **OAuth authentication**: Secure browser-based auth
- **Token authentication**: Personal access tokens
- **Enterprise integration**: Works with existing Databricks setup

### Configuration

```toml
[providers.databricks.default]
host = "https://your-workspace.cloud.databricks.com"
model = "databricks-claude-sonnet-4"
max_tokens = 4096
temperature = 0.1
use_oauth = true              # Recommended
# token = "dapi..."           # Alternative: PAT
```

### Authentication

**OAuth (Recommended)**:
1. Set `use_oauth = true`
2. On first run, browser opens for authentication
3. Tokens are cached in `~/.databricks/oauth-tokens.json`
4. Tokens refresh automatically

**Personal Access Token**:
1. Generate token in Databricks workspace
2. Set `token = "dapi..."` and `use_oauth = false`

### Available Models

Models depend on your Databricks workspace configuration:
- `databricks-claude-sonnet-4` (Claude via Databricks)
- `databricks-meta-llama-3-1-70b-instruct`
- `databricks-dbrx-instruct`
- Custom fine-tuned models

---

## OpenAI

**Location**: `crates/g3-providers/src/openai.rs`

### Features

- **Native tool calling**: Full support
- **Custom endpoints**: Override base URL
- **Streaming**: Real-time responses

### Configuration

```toml
[providers.openai.default]
api_key = "sk-..."               # Required
model = "gpt-4-turbo"            # Model name
max_tokens = 4096
temperature = 0.1
# base_url = "https://api.openai.com/v1"  # Optional
```

### Available Models

| Model | Context | Notes |
|-------|---------|-------|
| `gpt-4-turbo` | 128k | Latest GPT-4 |
| `gpt-4o` | 128k | Optimized GPT-4 |
| `gpt-4` | 8k | Original GPT-4 |
| `gpt-3.5-turbo` | 16k | Faster, cheaper |

---

## OpenAI-Compatible Providers

**Location**: `crates/g3-providers/src/openai.rs` (reuses OpenAI implementation)

For services that implement the OpenAI API format.

### Configuration

```toml
# OpenRouter
[providers.openai_compatible.openrouter]
api_key = "sk-or-..."
model = "anthropic/claude-3.5-sonnet"
base_url = "https://openrouter.ai/api/v1"
max_tokens = 4096
temperature = 0.1

# Groq
[providers.openai_compatible.groq]
api_key = "gsk_..."
model = "llama-3.3-70b-versatile"
base_url = "https://api.groq.com/openai/v1"
max_tokens = 4096
temperature = 0.1

# Together
[providers.openai_compatible.together]
api_key = "..."
model = "meta-llama/Llama-3-70b-chat-hf"
base_url = "https://api.together.xyz/v1"
max_tokens = 4096
temperature = 0.1
```

### Supported Services

- **OpenRouter**: Access to many models through one API
- **Groq**: Fast inference for Llama models
- **Together**: Open-source model hosting
- **Anyscale**: Scalable model serving
- **Local servers**: Ollama, vLLM, text-generation-inference

---

## Embedded (Local Models)

**Location**: `crates/g3-providers/src/embedded.rs`

### Features

- **Completely local**: No data leaves your machine
- **Offline capable**: Works without internet
- **GPU acceleration**: Metal (macOS), CUDA (Linux)
- **No API costs**: Free after model download

### Configuration

```toml
[providers.embedded.default]
model_path = "~/.cache/g3/models/qwen2.5-7b-instruct-q3_k_m.gguf"
model_type = "qwen"              # Model architecture
context_length = 32768           # Context window
max_tokens = 2048                # Max output
temperature = 0.1
gpu_layers = 32                  # GPU offload (0 = CPU only)
threads = 8                      # CPU threads
```

### Supported Model Types

| Type | Models | Notes |
|------|--------|-------|
| `qwen` | Qwen 2.5 series | Good coding ability |
| `codellama` | Code Llama | Specialized for code |
| `llama` | Llama 2/3 | General purpose |
| `mistral` | Mistral/Mixtral | Efficient |

### Model Download

Download GGUF models from Hugging Face:

```bash
mkdir -p ~/.cache/g3/models
cd ~/.cache/g3/models

# Example: Qwen 2.5 7B
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf
```

### Hardware Requirements

| Model Size | RAM Required | GPU VRAM | Notes |
|------------|--------------|----------|-------|
| 7B Q4 | 6GB | 4GB | Good for most tasks |
| 7B Q8 | 10GB | 8GB | Better quality |
| 13B Q4 | 10GB | 8GB | More capable |
| 70B Q4 | 48GB | 40GB | Requires high-end hardware |

### GPU Acceleration

**macOS (Metal)**:
```toml
gpu_layers = 32  # Offload layers to GPU
```

**Linux (CUDA)**:
Requires CUDA toolkit installed.

**CPU Only**:
```toml
gpu_layers = 0
threads = 8  # Use more threads
```

### Tool Calling

Embedded models don't have native tool calling. g3 uses JSON fallback:
1. System prompt includes tool definitions as JSON
2. Model outputs tool calls as JSON in response
3. g3 parses JSON and executes tools

This works but is less reliable than native tool calling.

---

## Provider Selection Guide

### By Use Case

| Use Case | Recommended Provider |
|----------|---------------------|
| General coding tasks | Anthropic (Claude Sonnet) |
| Complex reasoning | Anthropic (Claude Opus) |
| Enterprise/compliance | Databricks |
| Cost-sensitive | Embedded or Groq |
| Privacy-critical | Embedded |
| Offline development | Embedded |
| Fast iteration | Groq (Llama) |
| Model variety | OpenRouter |

### By Priority

**Quality first**: Anthropic Claude Opus/Sonnet
- Best reasoning and coding ability
- Native tool calling
- Prompt caching for efficiency

**Cost first**: Embedded or OpenAI-compatible
- Embedded: Free after download
- Groq: Very cheap, fast
- OpenRouter: Pay-per-use, many options

**Privacy first**: Embedded
- Data never leaves your machine
- No API calls
- Full control

**Speed first**: Groq or Embedded with GPU
- Groq: Extremely fast inference
- Embedded with Metal/CUDA: Low latency

---

## Provider Trait

All providers implement the `LLMProvider` trait:

```rust
#[async_trait]
pub trait LLMProvider: Send + Sync {
    /// Generate a completion
    async fn complete(&self, request: CompletionRequest) -> Result<CompletionResponse>;

    /// Stream a completion
    async fn stream(&self, request: CompletionRequest) -> Result<CompletionStream>;

    /// Provider name (e.g., "anthropic.default")
    fn name(&self) -> &str;

    /// Model name (e.g., "claude-sonnet-4-5")
    fn model(&self) -> &str;

    /// Whether provider supports native tool calling
    fn has_native_tool_calling(&self) -> bool;

    /// Whether provider supports cache control
    fn supports_cache_control(&self) -> bool;

    /// Configured max tokens
    fn max_tokens(&self) -> u32;

    /// Configured temperature
    fn temperature(&self) -> f32;
}
```

---

## Adding a New Provider

1. Create `crates/g3-providers/src/newprovider.rs`
2. Implement `LLMProvider` trait
3. Add configuration struct to `crates/g3-config/src/lib.rs`
4. Register in `crates/g3-core/src/lib.rs` (`new_with_mode_and_readme`)
5. Export from `crates/g3-providers/src/lib.rs`
6. Update documentation

---

## Troubleshooting

### Authentication Errors

**Anthropic**: Verify API key starts with `sk-ant-`

**Databricks OAuth**:
- Delete `~/.databricks/oauth-tokens.json` and re-authenticate
- Ensure workspace URL is correct

**OpenAI**: Verify API key and check billing status

### Rate Limits

g3 automatically retries on rate limits with exponential backoff.

To reduce rate limit issues:
- Use prompt caching (Anthropic)
- Reduce `max_tokens`
- Use a provider with higher limits

### Context Window Errors

If you see "context too long" errors:
1. Use `/compact` to compact conversation
2. Use `/thinnify` to replace large tool results
3. Increase `max_context_length` in config
4. Switch to a provider with larger context

### Embedded Model Issues

**Model not loading**:
- Verify `model_path` is correct
- Check file permissions
- Ensure enough RAM

**Slow inference**:
- Increase `gpu_layers` for GPU offload
- Reduce `context_length`
- Use a smaller quantization (Q4 vs Q8)

**Poor tool calling**:
- Embedded models use JSON fallback
- Consider cloud provider for complex tool use