Skip to main content

Posts

Featured

Load Balancing and Scaling LLM Serving

Load balancing for large language models requires specialized cache-aware routing strategies because traditional round-robin approaches can degrade prompt cache hit rates from 50-90% efficieny down to just 1/N across N replicas, eliminating the cost and latency benefits. A new technique called precise prefix cache-aware routing--which uses radix trees and real-time KV cache events from inference engines--can improve throughput by up to 108% compared to standard Kubernetes load balancing by ensuring requests hit servers that already have relevant conversation history cached.  Read More

Latest Posts

New undersea cable cutter risks Internet's backbone

Android CLI and skills: Build Android apps 3x faster using any agent

Stop comparing price per million tokens: the hidden LLM API costs

Let the commits tell the story

Verified Argo CD deployments

Cloudflare Client-Side Security: smarter detection, now open to everyone

North Korean hackers blamed for hijacking popular Axios open source project to spread malware

Inside the Claude Code source

The Developer's Playbook: Defending Gates and Scaling Systems

A Quick Guide to CS