Hi, how can i duplicate some space that working on zero and make it works on CPU or GPU?
This is essentially a code porting task. If the target is GPU space, the difficulty of migration depends on GPU performance. (If performance is sufficient, it should run with minimal changes.) If the target is CPU space, it might be usable for purposes like verifying the created UI, but don’t expect it to be practical.
Incidentally, even when cloning from an older Zero GPU Space to another Zero GPU Space, minor adjustments are occasionally necessary. This is because dependencies and the behavior of the HF-side runtime change over time.
The reliable way to think about this is:
- Duplicate the Space
- Choose the new hardware
- Re-add secrets
- Run it once with as few code changes as possible
- Then decide whether to keep one portable codebase or split into CPU-only / dedicated-GPU versions
That order matters because, on Hugging Face, duplicating a Space is easy, but matching the runtime behavior across ZeroGPU, CPU, and dedicated GPU is the real migration work. Hugging Face’s docs say duplicated Spaces are created from the three-dot menu, are private by default, can be assigned new hardware during duplication, auto-populate public Variables, and do not copy Secrets automatically. The duplicate also starts on free CPU by default unless you choose something else. (Hugging Face)
1. Why moving off ZeroGPU is not just “change hardware”
ZeroGPU is a shared, dynamic GPU system for Spaces. Hugging Face documents it as NVIDIA H200-backed, with large = 70 GB VRAM and xlarge = 141 GB VRAM, and GPU work is requested with @spaces.GPU. The same docs also say that @spaces.GPU is effect-free outside ZeroGPU, which is the key reason many ZeroGPU apps can still run on CPU or a paid GPU without removing the decorator immediately. (Hugging Face)
That leads to a very important distinction:
- ZeroGPU → dedicated GPU is often straightforward.
- ZeroGPU → CPU is often not.
- One repo that works on all three is possible, but you need to write to the strictest contract, which is ZeroGPU’s stateless GPU model. (Hugging Face)
2. What duplication actually does
In the UI
The official flow is:
- open the Space
- click the three dots
- choose Duplicate this Space
- choose owner, Space name, visibility, hardware, storage, and variables/secrets
Hugging Face explicitly says duplicated Spaces are private by default, that the duplicate workflow can warn you about missing secrets, and that public Variables can be carried over while Secrets are not. The duplicate uses free CPU hardware by default unless you change it. (Hugging Face)
Programmatically
If you want automation, Hugging Face’s Hub API docs show duplicate_repo(..., repo_type="space", ...) and duplicate_space(...). The examples include duplicating a Space with custom hardware, and the docs say the new Space can be created in the same running/paused state as the source. (Hugging Face)
Example:
from huggingface_hub import duplicate_repo
duplicate_repo(
"owner/source-space",
repo_type="space",
space_hardware="t4-medium",
)
That is useful if you maintain many Spaces or want repeatable migrations. (Hugging Face)
3. Choose hardware by memory envelope, not by label
This is where many migrations go wrong.
Hugging Face currently lists these relevant hardware tiers:
- CPU Basic: 2 vCPU, 16 GB RAM
- CPU Upgrade: 8 vCPU, 32 GB RAM
- T4 small: 16 GB VRAM
- L4 x1: 24 GB VRAM
- A10G small: 24 GB VRAM
- A100 large: 80 GB VRAM
- ZeroGPU large: 70 GB VRAM
- ZeroGPU xlarge: 141 GB VRAM (Hugging Face)
That means a Space that runs comfortably on ZeroGPU large may still fail on a T4 or A10G, even though those are also GPU Spaces, because you are moving from 70 GB VRAM to 16–24 GB VRAM. Likewise, moving to CPU is not “same thing but slower”; it is often a change in the entire model/runtime budget. (Hugging Face)
So before touching code, decide which target you really want:
A. Dedicated GPU copy
Good when you want stable latency and a persistent GPU. This is usually the easiest destination for a ZeroGPU app. (Hugging Face)
B. CPU copy
Good for a lightweight fallback, but usually needs more code and model changes. CPU Basic and CPU Upgrade have much smaller effective inference budgets than ZeroGPU. (Hugging Face)
C. Portable one-repo design
Good when you want the same repo to work on ZeroGPU, CPU, and dedicated GPU. This is usually the best long-term design, but the code must follow ZeroGPU-safe patterns. (Hugging Face)
4. The safest migration path
This is the sequence I would recommend in practice:
Step 1: duplicate first, without rewriting code yet
Create the duplicate and restore the missing secrets. Since public Variables can be copied and Secrets cannot, missing credentials are one of the most common reasons a duplicated Space “suddenly” fails. (Hugging Face)
Step 2: let it boot once on CPU
This is useful even if your real goal is a paid GPU. It tells you whether the repo builds and starts cleanly in a fresh Space environment. Hugging Face’s duplication flow defaults to CPU anyway, so this is the natural first checkpoint. (Hugging Face)
Step 3: only then switch hardware
If the duplicate builds and the UI appears, move it to the GPU tier you want. Paid hardware runs indefinitely by default unless you set a sleep time, while free CPU sleeps automatically after inactivity. (Hugging Face)
Step 4: change code only if the duplicate fails or performs badly
Many ZeroGPU apps already run on dedicated GPU with few or no code changes because the decorator remains harmless outside ZeroGPU. CPU is where code and model changes become much more likely. (Hugging Face)
5. Should you remove @spaces.GPU?
Usually, not at first.
Hugging Face’s ZeroGPU docs state that @spaces.GPU is effect-free in non-ZeroGPU environments. So if your goal is “duplicate this ZeroGPU Space and make it run on CPU or paid GPU,” the safest first move is often to keep the decorator and only change the hardware setting. (Hugging Face)
That gives you three possible codebase strategies:
Keep it
Best if you want one repo that can still run on ZeroGPU later. (Hugging Face)
Remove it later
Best if the duplicate will live permanently on a dedicated GPU and you want simpler startup logic and lower latency. (Hugging Face)
Keep it in the portable branch, remove it in a GPU-only fork
Best if you want both portability and a highly optimized paid-GPU deployment. (Hugging Face)
6. The most important code rule
If you want the same code to work on ZeroGPU, CPU, and dedicated GPU:
Do not initialize CUDA at import time.
This is the biggest source of ZeroGPU-specific breakage. ZeroGPU is stateless, and a representative public issue shows vLLM failing because it calls torch.cuda.set_device(...) during model setup, which conflicts with the stateless GPU contract. (GitHub)
Typical anti-patterns:
import torch
from diffusers import DiffusionPipeline
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("model-id").to("cuda")
or
model = AutoModelForCausalLM.from_pretrained(...).cuda()
or
torch.load("weights.pt", map_location="cuda")
Those patterns are fine in many traditional GPU environments, but they are exactly the patterns that create trouble in ZeroGPU-style runtimes. (Hugging Face)
7. The portable pattern that works best
This is the design I recommend first.
Core idea
- load safely
- detect the device only when needed
- move to GPU only inside the inference path
- keep CPU as a real fallback
- keep
@spaces.GPUin place for portability
Example:
import torch
import gradio as gr
import spaces
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_ID = "distilgpt2"
tokenizer = None
model = None
loaded_device = None
def resolve_device() -> str:
return "cuda" if torch.cuda.is_available() else "cpu"
def load_model_for(device: str):
global tokenizer, model, loaded_device
if model is not None and loaded_device == device:
return
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
dtype = torch.float16 if device == "cuda" else torch.float32
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=dtype,
).to(device)
loaded_device = device
@spaces.GPU(duration=60) # harmless outside ZeroGPU
def generate(prompt: str, max_new_tokens: int = 64):
device = resolve_device()
load_model_for(device)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
demo = gr.Interface(
fn=generate,
inputs=[
gr.Textbox(label="Prompt"),
gr.Slider(1, 256, value=64, step=1, label="Max new tokens"),
],
outputs=gr.Textbox(label="Output"),
)
if __name__ == "__main__":
demo.launch()
Why this is good:
- still valid for ZeroGPU
- still valid for dedicated GPU
- still valid for CPU
- avoids top-level CUDA initialization
- keeps hardware switching mostly a configuration problem instead of a rewrite problem (Hugging Face)
8. How to rewrite for a dedicated-GPU-only Space
Once a dedicated GPU duplicate is stable, you may want to optimize for that environment only.
Typical changes:
- preload the model onto CUDA at startup
- remove ZeroGPU-specific duration logic
- simplify device branching
- optionally drop the
spacesdependency if you will never go back to ZeroGPU
Example:
import torch
import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_ID = "distilgpt2"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.float16 if DEVICE == "cuda" else torch.float32
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=DTYPE,
).to(DEVICE)
def generate(prompt: str, max_new_tokens: int = 64):
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
demo = gr.Interface(fn=generate, inputs="text", outputs="text")
if __name__ == "__main__":
demo.launch()
This is cleaner and usually faster on a persistent GPU, but it is no longer the safest baseline if you still want to preserve ZeroGPU compatibility. Hugging Face’s GPU docs also note that some frameworks need CUDA-capable installs to take advantage of upgraded hardware. (Hugging Face)
9. How to rewrite for CPU
CPU migration usually needs the most adaptation.
Typical CPU changes:
- force
device = "cpu" - use
torch.float32unless you know a smaller dtype is safe on CPU - reduce batch size, resolution, token limit, or inference steps
- choose a smaller model if needed
- remove libraries that depend on GPU-specific behavior
Example:
import torch
import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_ID = "distilgpt2"
DEVICE = "cpu"
DTYPE = torch.float32
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=DTYPE,
).to(DEVICE)
def generate(prompt: str, max_new_tokens: int = 32):
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
For larger image or LLM apps, a CPU fork often becomes a different product tier, not just “same app, slower.” The hardware gap between CPU Basic and ZeroGPU/dedicated GPU is large enough that defaults often need to change. (Hugging Face)
10. What usually needs changing in the repo
README.md YAML
Spaces are configured by the YAML block at the top of README.md. Hugging Face documents fields such as sdk, python_version, sdk_version, app_file, and suggested_hardware. Crucially, suggested_hardware is only a hint for duplicators; it does not assign hardware automatically. (Hugging Face)
Example:
---
title: My Space
sdk: gradio
python_version: 3.10
sdk_version: 5.49.1
app_file: app.py
suggested_hardware: a10g-small
---
That is especially helpful if other people will duplicate your Space. (Hugging Face)
requirements.txt
Hugging Face says Python dependencies go in requirements.txt, earlier bootstrap packages can go in pre-requirements.txt, and Debian packages can go in packages.txt. (Hugging Face)
A typical migration-oriented requirements file might look like:
gradio==5.49.1
spaces>=0.30.0
torch>=2.4,<2.8
transformers>=4.45,<5
The exact versions depend on the app, but pinning is useful because rebuilds on Spaces are fresh deployments, and version drift is a common cause of “worked before, broken after duplicate/restart.” That is also why the official docs expose sdk_version in the README metadata. (Hugging Face)
11. Common failure modes after duplication
Missing secrets
Very common. The app works in the source Space because credentials are already configured; the duplicate fails because Secrets were not copied. (Hugging Face)
Wrong target hardware
A ZeroGPU app that relied on 70 GB VRAM may not fit on 16–24 GB VRAM GPUs. Pick hardware based on actual memory needs, not on the generic word “GPU.” (Hugging Face)
CPU fallback is unrealistic
Many ZeroGPU demos were never designed for CPU-level throughput or memory. In those cases, the right solution is usually a smaller model or a separate CPU configuration. (Hugging Face)
Framework incompatibility
Some frameworks are less compatible with ZeroGPU’s stateless model than high-level Hugging Face stacks such as transformers or diffusers. The vLLM issue is the clearest public example. (Hugging Face)
Fresh rebuild reveals dependency issues
A duplicated Space gets a fresh environment. If you were relying on loosely pinned dependencies, the duplicate may surface a problem that the original had not shown yet. Hugging Face’s Spaces docs make clear that dependency installation happens from your repo files such as requirements.txt, pre-requirements.txt, and README metadata. (Hugging Face)
12. Gradio 6: only migrate it separately
If the Space also needs a Gradio 6 upgrade, treat that as a separate migration.
Gradio’s migration guide says:
show_apiinlaunch()becomesfooter_links- event listener
show_api/api_name=Falsebecomeapi_visibility - the API surface changes in ways that can affect how clients call your Space (Gradio)
That means a safe order is:
- duplicate and stabilize
- switch hardware
- only then upgrade Gradio major versions
If you do both at once, it becomes much harder to tell whether the failure came from hardware semantics or API/framework changes. (Gradio)
13. The practical recommendations I would follow
If your goal is “I just want my own working copy”
Duplicate it, restore secrets, leave the code mostly unchanged, test the duplicate on CPU once, then switch to dedicated GPU if needed. This keeps the migration simple and uses the official duplication workflow as intended. (Hugging Face)
If your goal is “I want a strong dedicated-GPU deployment”
Duplicate it to a GPU tier, keep @spaces.GPU initially, confirm it works, then optionally refactor into a dedicated-GPU-only version with eager CUDA preload and simpler code. (Hugging Face)
If your goal is “I want one repo that runs everywhere”
Keep @spaces.GPU, avoid import-time CUDA, detect device lazily, and branch explicitly for CPU vs GPU dtype and limits. That gives you the most portable baseline. (Hugging Face)
If your goal is “I want a CPU version too”
Expect a separate configuration, and often a separate model or default UX. CPU is usually the first place where “same code, same defaults” stops being realistic. (Hugging Face)
14. Bottom line
The cleanest answer is:
- Yes, you can duplicate a ZeroGPU Space and run the duplicate on CPU or dedicated GPU.
- The simplest first step is to duplicate it and change only the hardware.
- For dedicated GPU, that often works surprisingly well with minimal code changes.
- For CPU, you usually need explicit changes to model size, dtype, limits, and sometimes the entire UX.
- If you want one codebase to work everywhere, keep
@spaces.GPUand structure the code so CUDA is never initialized at import time. (Hugging Face)
Thank you very much! I appreciate the effort in answering.