Building a Resilient LLM API Setup with Ollama and Load Balancing

One of my current projects is refining my local LLM (Large Language Model) setup. Right now, I’m running Ollama on my Proxmox host, utilizing a GPU passthrough with a GTX 1060. While not the most powerful card, it’s efficient enough for generating responses at a decent rate. However, stability has been a challenge since I also use the Windows VM with this GPU for other resource-intensive applications. This can cause disruptions, making the LLM API occasionally unstable.

The storage requirements for these models are manageable but can add up quickly. Smaller models, like 7B, only need a few gigabytes, but larger ones—13B and beyond—can be substantial in size. Warming up the models also takes time, impacting responsiveness, though once they’re cached, the response rate is quick.

To make the models accessible remotely, I’ve integrated my LLM API with a Telegram bot, allowing me to query it on the go. I use it for network insights, server information, and other tasks around my LAN setup. However, due to occasional instability in the Ollama API on the Windows VM, I’ve been exploring a new approach: using my HAProxy LXC container to load balance requests across multiple machines.

I have another setup with a 3080 GPU, which offers far greater performance—generating responses almost instantly. My plan is to split the workload so that the Windows VM acts as a fallback, while my 3080 machine becomes the primary route for API requests. The key challenge is managing context and tokens across load-balanced instances. If a conversation is handed off to another machine mid-session, maintaining contextual continuity could be tricky, especially with the real-time aspect of responses. I’m unsure whether this will impact system prompts or if a single, initial context setting will be enough.

Despite these uncertainties, I expect this setup to significantly improve stability. If successful, it will allow me to personalize different LLMs for specific tasks. For instance, I’ve started using one model as a job-hunting assistant. By feeding it my data, it’s able to help with various repetitive tasks, like generating tailored cover letters and pulling up relevant details from my social media profiles. It has been a time-saver, letting me quickly retrieve information I’d otherwise have to repeatedly input on various sites.

In the near future, I plan to create a video showcasing this setup and my experience building it. Stay tuned for more updates on this LLM journey!