Local LLM Execution

Overview

MoltGhost executes language models locally within each Agent Pod, eliminating external API dependencies.

Models run natively using the Pod's allocated compute (GPU/CPU), delivering autonomous inference with full resource control.

External APIs:    Prompt → Network → Provider → Latency + Costs → Response
Local LLM:        Prompt → Pod GPU → 50ms Inference → Response

Execution Architecture

┌──────────────────────────────────────────────────────────────┐
│                         Agent Pod                            │
├──────────────────────────────────────────────────────────────┤
│  ┌─────────────────────┐                                    │
│  │   Agent Runtime     │                                    │
│  │   (OpenClaw)        │◄────────────── User Requests      │
│  └─────────────────────┘                                    │
│           │                                                 │
│           ▼                                                 │
│  ┌─────────────────────┐  ┌─────────────────────────────┐   │
│  │ Ollama Model Server │──│        Local LLM             │   │
│  │                     │  │  -  7B-405B Parameters       │   │
│  └─────────────────────┘  │  -  Quantized (Q4/Q8)         │   │
│                           │  -  GPU Accelerated           │   │
│                           └─────────────────────────────┘   │
│                                                              │
│  Compute Resources:                                          │
│  ├─ NVIDIA GPU (A100/H100/L40S)                              │
│  ├─ 16-512GB System Memory                                   │
│  └─ NVMe Storage (Model Weights)                             │
└──────────────────────────────────────────────────────────────┘

Supported Models & Formats

Family	Models	Quantization	Max Context	Use Case
Llama	3.1 (8B, 70B, 405B)	Q4_K_M, Q8_0	128K	General purpose
Qwen	2.5 (7B, 32B, 72B)	Q4_K_S, Q6_K	32K	Multilingual
Mistral	Nemo (12B), Mixtral 8x22B	Q4_K_M	128K	Tool calling
Phi-3	Mini (3.8B), Medium (14B)	Q8_0	128K	Fast inference

Model Selection at Deploy:

moltghost deploy my-agent --model llama3.1-70b-q4 --context 32k

Inference Pipeline

User Request → Agent Runtime → Plan Reasoning → Model Inference → Response

Performance Metrics (A100 GPU):

Model	Tokens/sec	TTFT	Latency (1K tokens)
7B Q4	150+	200ms	8s
70B Q4	45+	800ms	25s
405B Q4	12+	2.5s	90s

Resource Allocation

Dynamic scaling based on model requirements:

Model Size	GPU	Memory	Storage
7B-13B	1×L40S	24GB	10GB
30B-70B	1×A100	80GB	50GB
100B+	2×H100	160GB+	200GB+

Model Weights → Quantized → GPU Memory → Inference Ready
Llama3.1-70B: 140GB raw → 38GB Q4_K_M → Runs on 80GB Pod

Multi-Level Isolation Benefits

Agent A (Llama 70B) → Pod A → Independent GPU + Memory
Agent B (Qwen 32B)  → Pod B → Independent GPU + Memory
Agent C (Mistral 12B) → Pod C → Independent GPU + Memory

Guaranteed Isolation:

✅ No inference queue contention
✅ Independent scaling
✅ Zero cross-agent interference
✅ Private model deployments

Model Management Lifecycle

When a Pod starts, models are loaded and prepared for inference:

Select Model - Choose from supported models
Download Weights - Fetch pre-trained parameters
Quantization - Convert to efficient format (Q4/Q8)
GPU Load - Allocate VRAM and load weights
Inference Ready - Agent can process requests

Operations:

# Update model without downtime
moltghost agent update my-agent --model qwen2.5-72b-q4

# Scale compute
moltghost agent scale my-agent --gpu h100 --memory 160gb

Summary

Local LLMs unlock production autonomy:

✅ Native GPU acceleration (no API hops)
✅ 20+ supported models with quantization
✅ Predictable performance + cost control
✅ Complete isolation per agent
✅ Dynamic model updates and scaling

Result: Agents reason and act independently using dedicated, local intelligence.

Pro Tip: Start with 7B-13B models for development, scale to 70B+ for production reasoning.