Models: Choosing for the Right Task
The most common mistake is choosing a model by size rather than by task. Among the available options, there are always models that fit within the available VRAM while delivering sufficient quality for the specific use case.
General tasks and reasoning: The top of the open-source leaderboard in 2026 looks quite different from a year ago. GLM-5 (744B) and Kimi K2.5 (1T) lead overall rankings, with Qwen 3.5 397B and GLM-4.7 close behind. For teams that need a strong balance of reasoning and coding without running a massive model, GLM-4.7 355B is a standout - it scores 85.7 on GPQA Diamond and 73.8 on SWE-bench Verified, outperforming many larger alternatives. Qwen 3.5 remains a strong multilingual choice with 262K context window support out of the box.
Coding: For code generation, Kimi K2.5 leads the open-source field with 99% HumanEval and 85% LiveCodeBench under an MIT license. For real-world bug fixing in production codebases, GLM-5 and GLM-4.7 are the stronger picks - GLM-4.7 scores 94.2% on HumanEval and 73.8% on SWE-bench Verified, which tests actual GitHub issue resolution rather than synthetic benchmarks.
For constrained hardware: Qwen3.5-9B is the standout choice for 8GB VRAM in 2026 - it fits entirely in GPU memory, delivers 54–58 tokens/second, and supports a 200K+ context window at Q4_K_M quantization. GLM-Z1-9B is a strong alternative for math-heavy tasks. For systems with 16GB VRAM, Mistral Small 3.1 24B opens up - it fits comfortably at Q4_K_M and delivers a significant quality jump over 9B models without requiring the 40GB VRAM of a 70B class model. In most practical scenarios, a well-quantized smaller model outperforms an unquantized larger one.
All listed open-source self-hosted LLM models are available through Hugging Face under Apache 2.0 or MIT licenses for commercial use.
Tools: The Selection Logic
LM Studio - the simplest starting point. A graphical interface with no command line required, built-in model search and download, automatic optimization for available hardware. Includes a built-in local inference server compatible with the OpenAI API. The ideal entry point before diving into Ollama and Docker.
Ollama - the next step for developers. More control, a straightforward path to a production-ready API endpoint, and an active integration ecosystem.
vLLM - for production with real users. Unmatched in throughput and serving performance.
Prem AI - managed self-hosting for companies without an MLOps team. Deploys on your own infrastructure, but the provider handles all operational complexity. Swiss jurisdiction, built-in fine-tuning pipeline, OpenAI-compatible API. Among enterprise-grade self-hosted LLM services, one of the most mature options available for regulated industries.
Fine-Tuning: When a Standard Model Isn't Enough
Fine-tuning is the process of further training an existing model on your own dataset. Unlike RAG, which connects external knowledge at inference time, fine-tuning modifies the model weights themselves. The result is a model that doesn't just know about your domain - it thinks in its terms and style.
The most widely used approach is LoRA (Low-Rank Adaptation): instead of retraining all billions of parameters, LoRA adds small adapter matrices on top of the existing weights. This reduces memory requirements and training time by an order of magnitude - fine-tuning a 7B model on a single RTX 5090 or RTX 4090 takes hours, not weeks.
When it makes sense: a law firm that wants the model to respond in a specific legal style; a healthcare organization with proprietary terminology; a product where the model must adhere to a strict tone of voice. RAG answers the question "what does the model know" - fine-tuning answers "how does it think and respond."