Why starting all the time and get kill in 30min

exact-railcar · March 31, 2026, 12:54am

all the output in logs are correct but why keep starting not go to running? and get kill in 30min?

John6666 · March 31, 2026, 9:42am

Seems now Running: Gorgeous - a Hugging Face Space by exact-railcar
Maybe it was due to recent HF major spaces outage?

John6666 · March 31, 2026, 10:06am

Just in case, here are some parts of the Space’s code that might be problematic:

There are real code and deployment problems here.

The key point is this:

Your 404 log lines are not evidence that the Space itself is healthy. In gorgeous.py, those 404s come from your own outbound polling to the remote ModelScope endpoints imageGet, firstLastGet, and videoGet. Your code prints each response status, and on 404 it just sleeps 60 seconds and tries again. That means “the remote job is not ready yet,” not “Hugging Face accepted the Space as Running.” (Hugging Face)

Also, the exact 30 minute kill matches Hugging Face’s default startup health timeout. For Docker Spaces, app_port defaults to 7860, and startup_duration_timeout defaults to 30 minutes unless you set it in the README metadata. (Hugging Face)

What is happening

Your code starts an aiohttp server on port 7860, then immediately enters a long remote-processing pipeline. On paper, that should be enough. But if anything after await site.start() fails, your top-level except: catches it, writes a traceback file, uploads it, and then goes into an infinite sleep. That can leave the container process alive while the actual web app is no longer healthy, which is a good fit for “keeps Starting, then gets killed after 30 minutes.” (Hugging Face)

Causes

1. The broad `except:` can hide a real crash and leave the container half-dead

At the bottom of gorgeous.py, you run:

try:
    uvloop.run(main())
except:
    ...write traceback...
    ...upload file...
    time.sleep(math.inf)

So if main() fails at any point, the process does not fail fast. It goes into an infinite sleep instead. That is one of the strongest explanations for “logs look active, but the Space never becomes Running.” (Hugging Face)

2. Secret handling is fragile

You build the authorization header like this:

'Bearer ' + os.getenv('modelscope')

If the modelscope secret is missing, that expression raises immediately because it is trying to concatenate a string and None. Later, the exception path also tries to upload with os.getenv('huggingface'). Hugging Face’s Docker docs say runtime secrets are injected as environment variables, so this code path depends completely on both secrets being present and valid. (Hugging Face)

3. Your Dockerfile does not follow Hugging Face’s recommended Docker permissions setup

Your Dockerfile uses FROM ubuntu, sets WORKDIR /home/ubuntu, and copies files without --chown. Your code writes output.mp4 and gorgeous.txt into that working directory. Hugging Face’s Docker docs say the container runs with user ID 1000 and recommend creating that user, switching to it, setting the workdir there, and using COPY --chown=user to avoid permission issues. (Hugging Face)

4. Your README metadata is incomplete

Your README only sets:

title
emoji
colorFrom
colorTo
sdk: docker
pinned: false

It does not set app_port or startup_duration_timeout. Missing app_port is not automatically fatal here, because the documented Docker default is 7860 and your code also uses 7860. But missing startup_duration_timeout is why the failure cuts off at 30 minutes. (Hugging Face)

5. The web route is fragile

You serve the root path with:

app.add_routes([aiohttp.web.static('/', ..., show_index=True)])

That means your root is a static directory handler, not a normal health endpoint. aiohttp’s own docs say add_static() is for development only, not production. This may still work, but it is a weak choice for a Space that needs a simple, reliable HTTP response as soon as it boots. (Hugging Face)

6. `numpy` is imported directly but not installed directly

gorgeous.py imports numpy, but the Dockerfile only installs huggingface_hub and modelscope with pip. That means you are relying on a transitive dependency to provide numpy. Since your current logs show the script starts, numpy is probably arriving indirectly right now. But it is still a packaging bug waiting to break on a future rebuild. (Hugging Face)

What is probably not the main problem

The bind host is probably not the issue. Your code uses TCPSite(runner, port=7860) without a host, and aiohttp documents that host=None means all interfaces. So this is likely fine. (Hugging Face)

Best explanation in plain terms

The most likely sequence is:

The container starts.
Your server begins listening on 7860.
Your worker logic starts polling remote endpoints and prints 404.
Somewhere after startup, an exception or unhealthy state occurs.
Your except: block prevents a clean crash and instead sleeps forever.
Hugging Face never sees the Space become healthy enough within the startup window.
At 30 minutes, the Space is marked unhealthy and killed. (Hugging Face)

That is why the logs can look “correct” and the Space can still stay in Starting.

Fixes

Fix 1. Add explicit README metadata

Use this at the top of README.md:

---
title: Gorgeous
sdk: docker
app_port: 7860
startup_duration_timeout: 1h
---

This makes the port explicit and raises the startup ceiling above the default 30 minutes. (Hugging Face)

Fix 2. Replace the static root with a real health endpoint

Use a simple route like:

from aiohttp import web
import os

async def index(_):
    return web.Response(text="ok")

app = web.Application()
app.router.add_get("/", index)

That is much safer than using static('/') as the root response. aiohttp’s docs explicitly warn against add_static() as a production serving strategy. (AIOHTTP)

Fix 3. Fail fast instead of sleeping forever after errors

Change this:

except:
    ...
    time.sleep(math.inf)

to this:

except Exception:
    pathlib.Path("gorgeous.txt").write_text(traceback.format_exc())
    raise

That way, the container stops clearly and the logs show the real failure. Right now, your exception handler can mask the real bug. (Hugging Face)

Fix 4. Validate required secrets before doing any network work

Do this near startup:

ms_token = os.environ.get("modelscope")
hf_token = os.environ.get("huggingface")

if not ms_token:
    raise RuntimeError("Missing modelscope secret")
if not hf_token:
    raise RuntimeError("Missing huggingface secret")

That turns a vague failure into an immediate, readable one. Hugging Face’s Docker docs confirm runtime secrets should be read from environment variables. (Hugging Face)

Fix 5. Follow Hugging Face’s Docker permission pattern

A safer Dockerfile shape is:

FROM python:3.11-slim

RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user PATH=/home/user/.local/bin:$PATH
WORKDIR $HOME/app

COPY --chown=user . $HOME/app

RUN pip install --no-cache-dir aiohttp uvloop pillow av numpy huggingface_hub modelscope

ENTRYPOINT ["python", "-u", "gorgeous.py"]

This matches Hugging Face’s documented UID 1000 model and avoids the file ownership trap. (Hugging Face)

Fix 6. Separate “web server” from “long worker”

Best pattern:

start a tiny web app immediately
return 200 OK on /
launch the long ModelScope polling in a background task

That makes the Space healthy first, then starts the heavy work. Right now, both responsibilities are mixed together in one startup path. (Hugging Face)

Priority order

Do these first:

add startup_duration_timeout: 1h
add explicit app_port: 7860
replace static('/') with GET / -> 200 OK
remove the infinite sleep from the exception path
validate secrets at startup
fix Docker permissions and install numpy explicitly

Bottom line

The 404 lines in your screenshot are not the main bug. They are expected from your remote polling loop. The real problem is that your Space has several ways to become unhealthy while still looking alive, especially because of the broad except: plus infinite sleep, the fragile secret handling, the weak root route, and the missing startup timeout override. (Hugging Face)

Topic		Replies	Views
HF Space stuck at Starting Spaces	3	834	November 28, 2025
My Hugging Face Space keep starting Spaces	7	165	March 26, 2026
Having error in restarting space Beginners	4	307	February 18, 2026
Launch timed out, space was not healthy after 30 min Beginners	14	3877	January 12, 2025
Restart error 503. Factory reset. No rebuild. Same with other space Spaces	4	236	February 26, 2026