Run Local LLMs with Claude Code: Unsloth for Private AI

How to Run Local LLMs with Claude Code | Unsloth Documentation

Anthropic's Claude Code, an AI agent for developers, now supports running local Large Language Models (LLMs) like Qwen3.5 and GLM-4.7-Flash through an integration with llama.cpp, according to Unsloth Documentation. This capability allows developers to leverage custom or open-source models for coding tasks directly on their machines, enhancing privacy and customization while powering Claude Code's new autonomous computer control features. Developers can now switch out Anthropic's default models for optimized local alternatives, gaining significant control over their AI development environment.

Why Run Local LLMs with Claude Code?

Imagine having a super-smart coding assistant that can not only understand your code but also execute tasks on your computer. Now, imagine you can swap out its "brain" for one you’ve trained yourself or picked from a community of open-source innovators. That's the core idea behind running local LLMs with Claude Code. This integration bypasses cloud-based APIs, delivering a more private, customizable, and often faster development experience right on your local machine.

FAQFrequently Asked Questions

Unsloth enables developers to run local Large Language Models (LLMs) like Qwen3.5 with Claude Code, enhancing privacy and customization. This integration allows developers to use custom or open-source models for coding tasks directly on their machines. It also powers Claude Code's new autonomous computer control features.

Running local LLMs with Claude Code offers enhanced privacy, customization, and potentially faster development. It bypasses cloud-based APIs, allowing developers to tailor the AI's intelligence to specific needs. This is especially useful with Claude Code's ability to directly control your computer for tasks like opening files and running developer tools.

To set up a local AI agent, you need to compile `llama.cpp`, download your preferred open-source model (like Qwen3.5), and deploy it locally using the `llama-server` component. Unsloth provides optimized GGUF models for performance. The `llama-server` then provides an OpenAI-compatible endpoint for Claude Code, typically on port 8001.

'The Claude Code Loophole' refers to a fix addressing an issue where Claude Code's attribution header invalidates the KV cache, significantly slowing down inference with local models by up to 90%. By setting `CLAUDE_CODE_ATTRIBUTION_HEADER` to an empty string, developers can avoid this performance bottleneck and maintain efficient inference speeds.

Unsloth provides dynamically quantized GGUF models optimized for performance and accuracy, even on consumer-grade GPUs. Examples of these models include Qwen3.5-35B-A3B and GLM-4.7-Flash. These models are designed to work efficiently with `llama.cpp` and Claude Code, providing a tailored AI development experience.