AI Shaman: An Ollama GPU Guardian Proxy

The Problem: 15+ Hard Reboots Per Day

I run two RTX 3090s on a single PSU. That sentence alone should tell you where this is going. The 3090 is a 350W card under load, and when both fire up inference simultaneously, the combined instantaneous power draw spikes well past what the PSU can cleanly deliver. The result was not graceful throttling or an error message. It was a hard power-cut reboot, the kind that skips shutdown entirely and drops you back at BIOS.

At peak, I was hitting more than 15 of these per day. Every long-running job, every concurrent API call from two different services, every moment where a second model started warming up while the first was mid-token — all of it was a lottery ticket for a system reset. Filesystem corruption risk aside, it was simply unusable as a development machine.

The naive fixes did not work. Setting power limits with nvidia-smi -pl helped somewhat but did not eliminate simultaneous spikes. The problem was not average power draw — it was that Ollama had no concept of what the other GPU was doing. Each instance made decisions in isolation.

What I needed was a single chokepoint that understood the full system state before allowing any inference request through.

The Name: A'shaman

In Robert Jordan's Wheel of Time series, the Aes Sedai channelers have a counterpart organization called the Asha'man — men who channel the One Power, trained as weapons, bound to a fortress called the Black Tower. The word loosely translates to "Guardian" in the Old Tongue of the series. It felt right. This proxy sits between all consumers and the raw power of the hardware, enforcing discipline the way a guardian would: not by being clever, but by being unmovable.

The project is called AI Shaman. The phonetics stick. The meaning fits.

Architecture Overview

AI Shaman is a Python AsyncIO reverse proxy using aiohttp. Consumers (Open WebUI, LangChain agents, shell scripts, whatever) continue hitting the same ports they always did. Ollama gets moved to internal ports. The proxy sits in between and makes every go/no-go decision before a single byte reaches Ollama.


  Consumers (any client)
       |
       v
+------+------+------+
|    Port 11434       |  <-- AI Shaman proxy (was Ollama's port)
|    AI Shaman        |
+------+------+------+
       |
  [Decision Engine]
  - Model affinity check
  - Admin endpoint block
  - Power budget circuit breaker
  - Temperature circuit breaker
  - Per-GPU semaphore (max 1 concurrent)
  - Request queue
       |
  +----+----+
  |         |
  v         v
GPU 0     GPU 1
Ollama    Ollama
:11435    :11436
(internal ports, not externally accessible)

The two Ollama instances run on internal ports. AI Shaman owns the external ports. Nothing bypasses it.

Per-GPU Semaphores: The Core Fix

The heart of the solution is one asyncio semaphore per GPU, each initialized with a limit of 1. This means only one inference request can be active on each GPU at any moment. If a second request arrives for GPU 0 while it is already processing, that request waits in the queue. It does not fire. The PSU never sees simultaneous full-load draws.


import asyncio

gpu_semaphores = {
    0: asyncio.Semaphore(1),
    1: asyncio.Semaphore(1),
}

async def handle_inference(request, gpu_id):
    async with gpu_semaphores[gpu_id]:
        # Only one request per GPU passes through here at a time
        return await forward_to_ollama(request, gpu_id)

The semaphore approach is simpler and more reliable than any form of power monitoring and backpressure. By the time you have measured the power spike, it has already happened. The semaphore prevents the spike from being possible in the first place.

Model Affinity: Preventing VRAM Thrashing

Ollama will load and unload models based on what is being requested. With two GPUs and consumers requesting different models, you can end up with a model being evicted from VRAM to make room for another, only to be requested again seconds later. Each load is slow and expensive.

AI Shaman enforces model affinity: specific models are pinned to specific GPUs. In my configuration, qwen3 and its variants live on GPU 0. gpt-oss and related models live on GPU 1. If a request comes in asking for qwen3 but targets GPU 1's Ollama instance, it is rejected immediately with a detailed JSON error.


{
  "error": "model_affinity_violation",
  "message": "Model 'qwen3:32b' is pinned to GPU 0 (port 11435). You requested routing to GPU 1 (port 11436). This is not allowed.",
  "model_requested": "qwen3:32b",
  "gpu_requested": 1,
  "gpu_required": 0,
  "correct_port": 11435,
  "hint": "Update your client to use port 11435 for this model, or use the proxy port 11434 and let AI Shaman route it correctly."
}

The rejection message is deliberately loud and specific. Silent failures waste debugging time. The JSON tells the caller exactly what went wrong, which port to use, and what the correct configuration is. If the model field in the request is empty or missing, the request is rejected — fail-closed, not fail-open.

Power Budget Circuit Breaker

The semaphore prevents simultaneous inference, but there are edge cases: model load operations, warmup phases, background Ollama housekeeping. The power budget circuit breaker adds a second layer of protection by polling nvidia-smi and tracking combined wattage across both GPUs.

If the combined draw across both GPUs exceeds 350 watts, the circuit breaker opens. New inference requests are queued but not forwarded. Once power drops below the threshold, the breaker closes and queued requests proceed. The nvidia-smi poll has a hard 10-second timeout — a lesson learned from the Gemini audit, which I will get to shortly.


async def check_power_budget():
    try:
        proc = await asyncio.wait_for(
            asyncio.create_subprocess_exec(
                "nvidia-smi",
                "--query-gpu=power.draw",
                "--format=csv,noheader,nounits",
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE,
            ),
            timeout=10.0,
        )
        stdout, _ = await proc.communicate()
        draws = [float(line.strip()) for line in stdout.decode().splitlines() if line.strip()]
        return sum(draws)
    except asyncio.TimeoutError:
        # nvidia-smi hung — fail safe, assume high power
        return 999.0

Temperature Circuit Breaker

Thermal throttling degrades inference quality in ways that are hard to detect: token generation slows, latency spikes, and under sustained load the GPU can hit emergency shutdown temperatures. AI Shaman monitors GPU temperatures alongside power draw.

If either GPU exceeds 82 degrees Celsius, inference is paused for that GPU. It does not resume until temperature drops below 75 degrees — a hysteresis gap that prevents rapid oscillation between paused and active states. The 7-degree gap is deliberate. Without it, a GPU hovering at exactly the threshold would toggle the circuit breaker multiple times per second.

Blocked Admin Endpoints

Ollama exposes endpoints for pulling new models (/api/pull), deleting models (/api/delete), copying models (/api/copy), and other administrative operations. These endpoints are useful during setup but dangerous to leave exposed to arbitrary consumers in a production environment. A runaway agent or a misconfigured client could trigger a 20GB model download in the middle of a production workload.

AI Shaman maintains a blocklist of admin endpoints. Any request matching these paths receives an immediate 403 with a clear message. The list includes:

/api/pull — model download
/api/delete — model deletion
/api/copy — model copy
/api/push — model push to registry
/api/create — model creation from Modelfile

These operations are still available by connecting directly to the internal Ollama port, which requires being on the machine itself. The proxy enforces a clear separation between inference consumers and administrative access.

Zero-Change Deployment

The deployment philosophy was important to get right. Any existing consumer — Open WebUI, LangChain scripts, shell scripts calling curl localhost:11434 — should not need to change a single configuration line. AI Shaman owns the ports that consumers expect. Ollama moves to internal ports. The proxy is transparent to callers.

This meant the migration was risk-free. If the proxy had a bug or needed to be bypassed for any reason, I could stop it and have consumers talk directly to the internal Ollama instances by updating one config value. No consumer code needed to change in either direction.

Management API

AI Shaman exposes a management API, but only on localhost. It is not reachable from the network. The endpoints cover everything needed to observe and control the proxy:

GET /shaman/status — current state of both GPUs, circuit breaker status, queue depth
GET /shaman/metrics — request counts, rejection counts, average latency by GPU and model
POST /shaman/pause — manually open the circuit breaker (useful before running power-hungry non-inference tasks)
POST /shaman/resume — close the circuit breaker and allow inference again

Binding the management API to 0.0.0.0 was one of the issues the security audit caught. Exposing pause/resume to the network would allow any process on any machine to halt all inference. It is now bound to 127.0.0.1 exclusively.

Full Request Logging

Every request that passes through AI Shaman is logged with structured JSON: timestamp, GPU assigned, model name, request latency, power draw at time of request, and whether the request was queued or passed through immediately. Rejections are also logged with the rejection reason.

This logging has been more useful than expected. It surfaced a consumer that was sending model names with trailing whitespace, which would have silently bypassed affinity checks before the fix. It showed which models had the most queue contention. It gave me a clear picture of actual usage patterns that I did not have before.

Concurrency Test Results

The clearest proof that the semaphore logic works is in the concurrency test. Three requests are fired simultaneously. If they ran in parallel, all three would complete at roughly the same time. Instead, they complete sequentially:


Test: 3 simultaneous inference requests

Request 1 complete: 1.9s   (ran immediately, semaphore acquired)
Request 2 complete: 3.8s   (queued behind request 1, ran after)
Request 3 complete: 5.6s   (queued behind request 2, ran after)

Queue depth progression: 2 -> 1 -> 0

Each request adds roughly 1.9 seconds — the approximate inference time for the test prompt. The queue depth starts at 2 (two requests waiting while one runs), drops to 1, then to 0 as each completes. This is exactly the behavior the semaphore is supposed to enforce, and it has held up under production load.

Since deploying AI Shaman: zero hard reboots. Not fewer. Zero.

The Gemini Audit

After the initial implementation was working, I ran the codebase through Gemini CLI (gpt-5.3) for a security-focused review. I was not expecting much — the codebase is small and the attack surface is narrow. The audit found four real issues that needed fixing.

Empty Model Field Bypassed Affinity Check

The original affinity check read the model field from the request body and looked it up in the affinity map. If the model field was absent or empty, the lookup returned nothing and the request fell through to the default handler — which would forward to whichever GPU was available. An empty model field is not a valid inference request, but it was passing through the proxy and reaching Ollama directly. The fix was fail-closed: any request with a missing or empty model field is rejected with an explicit error before reaching the routing logic.

Admin Endpoints Were Unguarded

The initial implementation focused on the inference path and left admin endpoints unblocked. The audit correctly identified that /api/pull and /api/delete would pass through the proxy without any checks. The blocklist was added as a direct result of this finding.

nvidia-smi Could Hang Forever

The power polling function called nvidia-smi as a subprocess and awaited its output. There was no timeout. On Linux, nvidia-smi occasionally hangs when the driver is in a bad state — particularly during recovery from a power event. A hung nvidia-smi call would block the entire power check loop, causing the proxy to make routing decisions based on stale power data indefinitely. The 10-second timeout was added, with a fail-safe that treats a timeout as maximum power draw (999W), opening the circuit breaker.

Management API on 0.0.0.0

The management API was binding to all interfaces, making pause/resume accessible from any machine on the network. This was changed to bind exclusively to 127.0.0.1. Administrative access now requires a local session or an explicit SSH tunnel.

All four issues were real. None of them were theoretical edge cases — the empty model field bypass had already triggered once in production logs by the time I ran the audit. Having a second set of eyes (even an artificial one) review infrastructure code before treating it as stable is worth the time.

Cross-Platform Notes

AI Shaman runs on both Linux and Windows. The core AsyncIO and aiohttp code is identical on both platforms. The differences are in the process management layer: on Linux the proxy runs as a systemd service, on Windows it runs as a scheduled task or directly from a terminal. The nvidia-smi binary is on PATH by default in both environments after a standard CUDA installation.

The subprocess handling for nvidia-smi uses asyncio.create_subprocess_exec rather than subprocess.run, which keeps it non-blocking in both environments. Windows has slightly different asyncio event loop behavior (it uses ProactorEventLoop by default), but nothing in the proxy requires the SelectorEventLoop, so it works without modification.

Configuration

Everything is driven by a JSON configuration file. GPU assignments, port mappings, power threshold, temperature thresholds, and model affinity rules are all defined there. No code changes are needed to reconfigure the proxy for a different hardware setup or model collection.


{
  "gpus": [
    {
      "id": 0,
      "ollama_port": 11435,
      "proxy_port": 11434,
      "models": ["qwen3", "qwen3:32b", "qwen3:14b", "qwen3:8b"]
    },
    {
      "id": 1,
      "ollama_port": 11436,
      "proxy_port": 11434,
      "models": ["gpt-oss", "gpt-oss:large"]
    }
  ],
  "power_budget_watts": 350,
  "temperature_pause_celsius": 82,
  "temperature_resume_celsius": 75,
  "nvidia_smi_timeout_seconds": 10,
  "management_api": {
    "host": "127.0.0.1",
    "port": 11437
  },
  "blocked_endpoints": [
    "/api/pull",
    "/api/delete",
    "/api/copy",
    "/api/push",
    "/api/create"
  ]
}

What I Would Do Differently

The per-GPU semaphore limit of 1 is conservative. Some models are small enough that two could run on the same GPU simultaneously without a power problem. A future version could make the semaphore limit configurable per model — large models get a limit of 1, small models allow 2 or 3 concurrent. This would improve throughput for mixed workloads without sacrificing safety for heavy models.

The power budget polling interval is currently fixed at 2 seconds. Under transient load spikes, 2 seconds is a long time. A future version could use NVML directly (via pynvml) instead of shelling out to nvidia-smi, which would allow polling at much higher frequency with lower overhead.

The queue implementation is a simple asyncio queue per GPU with no priority. A priority queue would allow tagging certain request sources (interactive UI vs. background batch job) and letting interactive requests skip ahead of batch jobs during high load.

Getting It

AI Shaman is open source. The repository includes the proxy, the configuration file, a systemd unit file for Linux, and a basic test suite that includes the concurrency test described above.

GitHub: github.com/benolenick/ai-shaman

If you are running multiple GPUs on a shared PSU, or even a single GPU with multiple consumers that can pile on simultaneously, the core semaphore pattern is worth adopting regardless of whether you use this code directly. The math is simple: simultaneous power draw causes reboots, a semaphore of 1 makes simultaneous draw impossible, and an AsyncIO semaphore adds essentially no latency overhead compared to the inference time it is protecting.