Transformers: Reference inference behavior for Transformers format models (obviously), compatibility
High-speed backends (vLLM, SGLang, etc.): Speed and stability at large scale
GGUF file-based backends (Ollama, Llama.cpp, LM Studio, etc.): Low VRAM, low RAM
Well, it depends on your use case, the hardware you have, and the model you want to use…
Best approach
Because you are already using a Hugging Face Transformers model and want to run it locally, the best default is:
start with plain Transformers first, inside a normal Python script or app.
Do not switch immediately to a separate runtime or local server unless you already know you need one. Hugging Face still presents pipeline() as the easiest inference entry point, and for LLM-style generation it recommends generate() when you need more control over prompting, decoding, and memory behavior. (Hugging Face)
Why this is usually the right choice
There are a few different layers that people mix together:
- Transformers is the model library itself.
transformers serve is a local server layer on top of it.
- vLLM / SGLang are higher-performance serving engines.
- Ollama / LM Studio / llama.cpp are local deployment runtimes, often more natural for GGUF-style workflows than for native Python Transformers development. Hugging Face’s current serving docs explicitly say
transformers serve is suitable for evaluation, experimentation, and moderate-load local or self-hosted use, while vLLM and SGLang remain the recommendation for large-scale production. (Hugging Face)
So if your situation is “I have a Transformers model and want local inference,” the most sensible first move is to stay in the same stack and make that path work cleanly before adding extra layers. (Hugging Face)
The simple decision rule
Choose plain Transformers if:
You are running inference from Python, still validating the model, still tuning prompts, or still figuring out memory limits. This is the most direct route, and Hugging Face’s docs recommend the Auto classes plus dtype="auto" and device_map="auto" as a practical starting point for loading larger models. (Hugging Face)
Choose transformers serve only if:
Your local Python inference already works and you now want an OpenAI-compatible local endpoint for another tool, UI, or app to call. HF documents it as a local server with OpenAI SDK compatibility, but still positions it as experimental and best for moderate load rather than maximum throughput. (Hugging Face)
Choose something else only if your real goal changes
If your goal becomes high-throughput multi-user serving, that is where vLLM or SGLang starts to make more sense. If your goal becomes ultra-simple desktop/offline deployment, that is where Ollama, LM Studio, or llama.cpp starts to make more sense. But that is a different optimization target from “run my Transformers model locally in code.” (Hugging Face)
What I would do first in practice
If it is a chat or text-generation model
Use AutoTokenizer + AutoModelForCausalLM, format chat input with apply_chat_template(...) when the model expects chat-formatted messages, and call generate(). HF’s LLM guide explains that autoregressive generation is handled by generate(), and that GenerationConfig controls defaults such as stopping and decoding behavior. (Hugging Face)
If it is a classic NLP / vision / audio model
Start with pipeline() unless you already need low-level control. HF still describes pipelines as the easiest way to run inference across many tasks such as classification, QA, ASR, and feature extraction. (Hugging Face)
How to make a too-large local model fit
The first tool is dtype="auto". Hugging Face documents that this initializes weights in the dtype they are stored in, which can avoid unnecessary extra memory use during loading. (Hugging Face)
The second tool is device_map="auto". Accelerate’s Big Model Inference guide says this fills GPU memory first, then CPU, then disk if necessary. That is extremely useful for getting a model to run locally even when it does not fully fit in VRAM. (Hugging Face)
But there is an important background detail: device_map="auto" is mainly a fit-it-into-memory strategy, not a fastest-possible strategy. Accelerate’s docs say this adds inference overhead because layers are moved between devices, and in multi-GPU model parallelism only one GPU is active at a time while the next waits for outputs from the previous one. (Hugging Face)
So the rule is:
- use
device_map="auto" to make a large model run;
- do not expect it to be the best answer for throughput or latency. (Hugging Face)
If memory is still tight
The next thing to try is quantization, especially 8-bit or 4-bit. Hugging Face’s quantization docs say this reduces memory and compute costs and allows models that would not normally fit to run on more limited hardware; the bitsandbytes integration is the most common first step for local LLM inference. (Hugging Face)
That makes the usual progression:
- plain model with
dtype="auto"
- add
device_map="auto"
- add 8-bit quantization
- if needed, try 4-bit quantization. (Hugging Face)
One important exception: Apple Silicon
If you are on a Mac with Apple Silicon, plain Transformers is still a reasonable first step, but MLX is worth a serious look. Hugging Face’s MLX integration docs say MLX keeps arrays in shared memory on Apple Silicon, avoids CPU↔GPU copies, supports native safetensors loading, and can load supported Transformers language models from the Hub without weight conversion. (Hugging Face)
So on Apple Silicon, my recommendation becomes:
- Transformers first if you want the most standard HF/PyTorch path;
- MLX if local speed and Apple-native efficiency become more important. (Hugging Face)
Another exception: CPU-focused non-LLM inference
If your model is not a chat LLM, and you care about CPU latency for tasks like classification or QA, Optimum ONNX Runtime is worth considering. HF documents ONNX Runtime pipelines as a drop-in replacement for Transformers pipelines, with the same API and potential speedups on CPU and GPU. (Hugging Face)
That means:
- for local LLM work, start with standard Transformers;
- for local non-LLM task inference where latency matters, ONNX Runtime can be a strong second step. (Hugging Face)
The main pitfalls to avoid
1. Jumping runtimes too early
A lot of inference debugging is really about model choice, prompt format, tokenizer behavior, or memory fit, not the runtime itself. If you switch to a different runtime before validating those basics, you make debugging harder. HF’s docs already give you enough to validate those basics in native Transformers. (Hugging Face)
2. Treating device_map="auto" as a speed feature
It is primarily a memory survival feature. It can be the difference between “runs” and “doesn’t run,” but it is not the same thing as a tuned serving stack. (Hugging Face)
3. Copying older examples blindly
Recent Transformers releases still show ongoing v5 cleanup and changes in generation internals. The current release notes mention pipeline task updates/removals in the v5 cleanup and continued refactoring of generation input preparation away from older cache_position behavior. (GitHub)
4. Assuming chat-template behavior is identical across every model
The docs show the intended chat-template path, but there have also been recent model-specific issues around apply_chat_template(...) return types and downstream generate() behavior, especially in multimodal cases. If you hit a weird template/generation mismatch, check current issues before assuming your code is fundamentally wrong. (Hugging Face)
My clear recommendation for your case
If you are currently trying to run local inference with a Hugging Face Transformers model, I would do this:
Best default path
- Stay in plain Transformers
- Load with Auto classes
- Start with
dtype="auto"
- Add
device_map="auto" if the model is large
- Add 8-bit / 4-bit quantization only if memory is still the blocker
- Only after the model works reliably, decide whether you need a local server like
transformers serve or a different runtime. (Hugging Face)
Bottom line
For local inference with a Hugging Face Transformers model, the best approach is usually not to leave Transformers immediately.
It is:
Transformers first for correctness and fit.
Quantization if needed for memory.
transformers serve only if you want a local API.
A different runtime only if your real goal is no longer “run this Transformers model locally in Python.” (Hugging Face)