What is whichllm and what does it do?

whichllm is an open-source command-line utility designed to help users find the best-performing large language models (LLMs) for their specific computer hardware. It automatically detects your system's specifications and ranks LLMs based on real benchmarks and recency, rather than just their parameter count, ensuring optimal local AI performance.

How does whichllm rank LLMs for performance?

whichllm ranks LLMs by calculating a holistic score that considers multiple factors, including merged benchmarks from trusted sources like LiveBench and Chatbot Arena. It also employs evidence-grading to weigh benchmark reliability, recency-aware scoring to prioritize current models, and architecture awareness to accurately model VRAM usage and speed based on specific hardware.

Why is whichllm useful for choosing local AI models?

whichllm is useful because it solves the 'biggest is not best' problem by analyzing a model's true performance rather than just its size, helping users avoid less capable models. It acts as an evidence-based filter for the thousands of options on platforms like Hugging Face, guiding developers to high-quality, well-benchmarked models that will actually perform well on their hardware.

What functionalities does whichllm offer beyond model recommendations?

Beyond recommendations, whichllm provides actionable tools such as the `whichllm run` command to automatically download and start an interactive chat session with a top-ranked model. It can also generate ready-to-run Python code snippets for model integration and assist with hardware planning by simulating GPU performance or calculating the hardware needed for specific models.

whichllm: Best Local LLM for PC Hardware & Performance

Choosing a local large language model often involves navigating a confusing landscape of hardware requirements and performance claims. The open-source tool `whichllm` provides a direct solution: it is a command-line utility that auto-detects your computer's hardware and ranks the best-performing LLMs that will actually run on it. As of May 2026, the tool uses live data to recommend models based on real benchmarks and recency, not just parameter count, according to its GitHub repository.

The core value of `whichllm` is solving the "biggest is not best" problem. Many developers assume the largest model that fits into their GPU's VRAM is the superior choice. However, a newer, more efficient model with fewer parameters can often outperform a larger, older one. `whichllm` codifies this by analyzing a model's true performance, not just its size, preventing users from choosing a less capable model simply because it has a higher parameter count.

This addresses a growing pain point for developers. The Hugging Face Hub, a central repository for AI models, hosts thousands of options with numerous variations, forks, and quantization formats. `whichllm` acts as an evidence-based filter for this noise, helping users sidestep low-quality or poorly benchmarked models.

How Does It Rank Models?

The tool’s ranking engine goes far beyond simple hardware compatibility checks. It calculates a holistic score for each model by evaluating multiple factors, ensuring its top recommendation is genuinely the best practical choice.

The ranking is built on several key data points:

Merged Benchmarks: It pulls data from multiple trusted sources, including LiveBench, Artificial Analysis, and Chatbot Arena, to create a composite quality score.
Evidence-Grading: Every benchmark score is discounted based on its source. A direct, verified benchmark receives full weight, while a score inherited from a base model or self-reported by an uploader is heavily penalized.
Recency-Aware Scoring: The system actively demotes scores from stale leaderboards, ensuring that a model from 2024 cannot outrank a newer, current-generation model on an outdated test.
Architecture Awareness: VRAM and speed estimates are not generic. The tool models VRAM usage by accounting for weights, KV cache, and activation overhead, while speed calculations consider memory bandwidth, quantization efficiency, and even the difference between unified memory and discrete GPUs.

Beyond Recommendations: Planning and Execution

`whichllm` is more than just a recommendation list; it is an actionable tool for developers and enthusiasts. A key feature is the `whichllm run` command, which can automatically download the top-ranked model, set up an isolated environment, and start an interactive chat session.

For developers looking to integrate a model into their own applications, the `whichllm snippet` command generates ready-to-run Python code for any given model. This lowers the barrier to entry for experimenting with different LLMs.

The tool also serves hardware planning needs. Users can simulate how different GPUs would perform or determine the hardware required to run a specific large model:

`whichllm --gpu "RTX 4090"` simulates recommendations for a specific graphics card.
`whichllm plan "llama 3 70b"` calculates the GPU needed to run a desired model.

`whichllm upgrade "RTX 4090" "RTX 5090"` compares your current setup against potential upgrades.

All output can be formatted as JSON, allowing `whichllm` to be integrated into automated scripts and pipelines, such as feeding a model ID directly into a local Ollama instance.

The Trending Society Take

The proliferation of AI-generated content is creating a data quality crisis, forcing platforms like ArXiv to ban researchers who upload unchecked "AI slop." `whichllm` applies a similar quality-control principle at the developer level. By prioritizing verifiable benchmarks and evidence over marketing claims and parameter counts, it provides a crucial tool for navigating the increasingly chaotic local AI ecosystem. This tool helps builders choose substance over size, promoting a healthier, more transparent development landscape.