all the output in logs are correct but why keep starting not go to running? and get kill in 30min?
Seems now Running: Gorgeous - a Hugging Face Space by exact-railcar
Maybe it was due to recent HF major spaces outage?
Just in case, here are some parts of the Space’s code that might be problematic:
There are real code and deployment problems here.
The key point is this:
Your 404 log lines are not evidence that the Space itself is healthy. In gorgeous.py, those 404s come from your own outbound polling to the remote ModelScope endpoints imageGet, firstLastGet, and videoGet. Your code prints each response status, and on 404 it just sleeps 60 seconds and tries again. That means “the remote job is not ready yet,” not “Hugging Face accepted the Space as Running.” (Hugging Face)
Also, the exact 30 minute kill matches Hugging Face’s default startup health timeout. For Docker Spaces, app_port defaults to 7860, and startup_duration_timeout defaults to 30 minutes unless you set it in the README metadata. (Hugging Face)
What is happening
Your code starts an aiohttp server on port 7860, then immediately enters a long remote-processing pipeline. On paper, that should be enough. But if anything after await site.start() fails, your top-level except: catches it, writes a traceback file, uploads it, and then goes into an infinite sleep. That can leave the container process alive while the actual web app is no longer healthy, which is a good fit for “keeps Starting, then gets killed after 30 minutes.” (Hugging Face)
Causes
1. The broad except: can hide a real crash and leave the container half-dead
At the bottom of gorgeous.py, you run:
try:
uvloop.run(main())
except:
...write traceback...
...upload file...
time.sleep(math.inf)
So if main() fails at any point, the process does not fail fast. It goes into an infinite sleep instead. That is one of the strongest explanations for “logs look active, but the Space never becomes Running.” (Hugging Face)
2. Secret handling is fragile
You build the authorization header like this:
'Bearer ' + os.getenv('modelscope')
If the modelscope secret is missing, that expression raises immediately because it is trying to concatenate a string and None. Later, the exception path also tries to upload with os.getenv('huggingface'). Hugging Face’s Docker docs say runtime secrets are injected as environment variables, so this code path depends completely on both secrets being present and valid. (Hugging Face)
3. Your Dockerfile does not follow Hugging Face’s recommended Docker permissions setup
Your Dockerfile uses FROM ubuntu, sets WORKDIR /home/ubuntu, and copies files without --chown. Your code writes output.mp4 and gorgeous.txt into that working directory. Hugging Face’s Docker docs say the container runs with user ID 1000 and recommend creating that user, switching to it, setting the workdir there, and using COPY --chown=user to avoid permission issues. (Hugging Face)
4. Your README metadata is incomplete
Your README only sets:
titleemojicolorFromcolorTosdk: dockerpinned: false
It does not set app_port or startup_duration_timeout. Missing app_port is not automatically fatal here, because the documented Docker default is 7860 and your code also uses 7860. But missing startup_duration_timeout is why the failure cuts off at 30 minutes. (Hugging Face)
5. The web route is fragile
You serve the root path with:
app.add_routes([aiohttp.web.static('/', ..., show_index=True)])
That means your root is a static directory handler, not a normal health endpoint. aiohttp’s own docs say add_static() is for development only, not production. This may still work, but it is a weak choice for a Space that needs a simple, reliable HTTP response as soon as it boots. (Hugging Face)
6. numpy is imported directly but not installed directly
gorgeous.py imports numpy, but the Dockerfile only installs huggingface_hub and modelscope with pip. That means you are relying on a transitive dependency to provide numpy. Since your current logs show the script starts, numpy is probably arriving indirectly right now. But it is still a packaging bug waiting to break on a future rebuild. (Hugging Face)
What is probably not the main problem
The bind host is probably not the issue. Your code uses TCPSite(runner, port=7860) without a host, and aiohttp documents that host=None means all interfaces. So this is likely fine. (Hugging Face)
Best explanation in plain terms
The most likely sequence is:
- The container starts.
- Your server begins listening on
7860. - Your worker logic starts polling remote endpoints and prints
404. - Somewhere after startup, an exception or unhealthy state occurs.
- Your
except:block prevents a clean crash and instead sleeps forever. - Hugging Face never sees the Space become healthy enough within the startup window.
- At 30 minutes, the Space is marked unhealthy and killed. (Hugging Face)
That is why the logs can look “correct” and the Space can still stay in Starting.
Fixes
Fix 1. Add explicit README metadata
Use this at the top of README.md:
---
title: Gorgeous
sdk: docker
app_port: 7860
startup_duration_timeout: 1h
---
This makes the port explicit and raises the startup ceiling above the default 30 minutes. (Hugging Face)
Fix 2. Replace the static root with a real health endpoint
Use a simple route like:
from aiohttp import web
import os
async def index(_):
return web.Response(text="ok")
app = web.Application()
app.router.add_get("/", index)
That is much safer than using static('/') as the root response. aiohttp’s docs explicitly warn against add_static() as a production serving strategy. (AIOHTTP)
Fix 3. Fail fast instead of sleeping forever after errors
Change this:
except:
...
time.sleep(math.inf)
to this:
except Exception:
pathlib.Path("gorgeous.txt").write_text(traceback.format_exc())
raise
That way, the container stops clearly and the logs show the real failure. Right now, your exception handler can mask the real bug. (Hugging Face)
Fix 4. Validate required secrets before doing any network work
Do this near startup:
ms_token = os.environ.get("modelscope")
hf_token = os.environ.get("huggingface")
if not ms_token:
raise RuntimeError("Missing modelscope secret")
if not hf_token:
raise RuntimeError("Missing huggingface secret")
That turns a vague failure into an immediate, readable one. Hugging Face’s Docker docs confirm runtime secrets should be read from environment variables. (Hugging Face)
Fix 5. Follow Hugging Face’s Docker permission pattern
A safer Dockerfile shape is:
FROM python:3.11-slim
RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user PATH=/home/user/.local/bin:$PATH
WORKDIR $HOME/app
COPY --chown=user . $HOME/app
RUN pip install --no-cache-dir aiohttp uvloop pillow av numpy huggingface_hub modelscope
ENTRYPOINT ["python", "-u", "gorgeous.py"]
This matches Hugging Face’s documented UID 1000 model and avoids the file ownership trap. (Hugging Face)
Fix 6. Separate “web server” from “long worker”
Best pattern:
- start a tiny web app immediately
- return
200 OKon/ - launch the long ModelScope polling in a background task
That makes the Space healthy first, then starts the heavy work. Right now, both responsibilities are mixed together in one startup path. (Hugging Face)
Priority order
Do these first:
- add
startup_duration_timeout: 1h - add explicit
app_port: 7860 - replace
static('/')withGET / -> 200 OK - remove the infinite sleep from the exception path
- validate secrets at startup
- fix Docker permissions and install
numpyexplicitly
Bottom line
The 404 lines in your screenshot are not the main bug. They are expected from your remote polling loop. The real problem is that your Space has several ways to become unhealthy while still looking alive, especially because of the broad except: plus infinite sleep, the fragile secret handling, the weak root route, and the missing startup timeout override. (Hugging Face)
