The `whichllm` command-line tool finds the best-performing local large language model (LLM) that will run on your specific hardware. Instead of just finding the largest model that fits your VRAM, it uses live benchmark data, recency, and architecture awareness to rank models by actual quality and speed.
Choosing a local LLM often feels like a guessing game based on VRAM capacity and parameter counts. This leads developers to run larger, older, or less efficient models simply because they "fit." The `whichllm` tool, available on GitHub, solves this by providing evidence-based recommendations tailored to your machine. According to the project's documentation, it can show that a newer 27-billion-parameter model outperforms an older 32-billion-parameter one on the same hardware, a distinction most tools would miss.
How Does It Rank Models?
The core of `whichllm` is its sophisticated, multi-factor scoring system that goes far beyond model size. It treats finding an LLM like a research project, not a storage calculation.
The tool automatically detects your hardware—NVIDIA, AMD, Apple Silicon, or CPU-only—and estimates VRAM needs by considering weights, KV cache, and framework overhead. It then ranks compatible models from Hugging Face based on a merged score from multiple sources.
Key ranking factors include:
- Benchmark Quality: Scores are aggregated from trusted sources like LiveBench, Artificial Analysis, Chatbot Arena ELO, and the Open LLM Leaderboard.
- Recency-Awareness: The system automatically demotes scores from stale leaderboards, preventing an outdated 2024 model from outranking a superior 2026-generation model.
- Evidence Grading: Every benchmark score is graded by its source. A direct match gets full confidence, while scores inherited from a base model or self-reported by an uploader are heavily discounted.
- Speed & Architecture: It models tokens-per-second (t/s) based on memory bandwidth and quantization, ensuring the top pick is not just powerful but usable.
For example, on an RTX 4090 with 24 GB of VRAM, `whichllm` recommends Qwen3.6-27B with a score of 92.8, even though a larger 32B model also fits. The smaller model is ranked higher because of its superior benchmark performance and newer architecture.
What Can You Do with It?
Beyond just providing a ranked list, `whichllm` includes several commands to streamline the entire local LLM workflow from planning to execution.
You can simulate hardware you don't own to plan a purchase with `whichllm --gpu "RTX 5090"`. The `plan` command works in reverse, telling you what hardware you'd need for a specific model like "llama 3 70b". Once you've chosen a model, you can immediately start a conversation using `whichllm run` or get a ready-to-use Python script with `whichllm snippet`. These commands handle the creation of an isolated environment, dependency installation, and model downloading.
This focus on actionable output helps combat the growing problem of "AI slop," where low-quality or hallucinated AI-generated content pollutes datasets and research. By prioritizing verified, benchmarked models, developers can make more informed choices. The issue has become serious enough that platforms like ArXiv are now banning researchers who submit papers with unchecked, LLM-generated content, according to The Verge.
The Trending Society Take
Tools like `whichllm` represent a critical shift in the AI ecosystem from "bigger is better" to "smarter is better." For too long, parameter count has been a vanity metric. This tool gives individual builders and small teams the power to make evidence-based decisions that were previously only possible for large, well-resourced labs with dedicated evaluation teams. It's a move toward democratizing not just access to models, but access to quality.








