H100 on-premise for occasional LLM usage, how do you handle GPU idle time when deploying at a client site?

Hi,

I’m a software publisher and I need to deploy my solution on-premise at a client’s site. Based on our tests, the workload requires an H100. However, the usage will be occasional and not continuous, so the GPU will be idle most of the time.

I’m not comfortable recommending a client invest in an H100 that will sit unused 80% of the time, but I don’t have a choice on the hardware side.

My questions:

  • Do other ISVs face the same issue when deploying on-premise?

  • How do you justify this to the client?

  • Does the client typically share the GPU with other internal workloads to spread the cost?

  • Are there any common organizational or contractual patterns to handle this?

This is as much a business/organizational question as a technical one. Any real-world feedback welcome.

Thanks

rent it out on vast.ai?

or crypto mining if the electricity cost make sense.

I would not solve this by renting the card out or mining on it. That is a good way to turn a clean on-prem deployment into a security, accounting, support, and liability mess.

The first mistake is treating this as a GPU utilization problem. It is not. It is a capacity and ownership problem.

If the customer requires on-prem, then they are not buying 100 percent utilization. They are buying local capacity, data control, predictable latency, and the right to run the workload at peak without waiting for somebody else’s cloud quota. Expensive idle capacity is not unusual. Hospitals, banks, factories, backup systems, HA clusters, and DR environments all work like this. Most of that hardware is “wasted” until the day it is not.

What I would clarify very early is whether the H100 is required continuously or only at peak. If it is a peak requirement, say that plainly. Otherwise someone will run nvidia-smi two weeks after installation and conclude that the project bought a very expensive heater.

The normal enterprise pattern is not “make the ISV keep the GPU busy”. The normal pattern is that the client owns the idle capacity. Your application gets a reservation or priority, and the client can use the remaining time for their own batch inference, embeddings, document processing, evaluation jobs, internal ML experiments, or whatever else fits their governance. Slurm or Kubernetes can do the scheduling if they already have that kind of environment. MIG may help for smaller independent workloads, but it is not magic. If your model needs the whole card, it needs the whole card.

I would be very careful with the sales story here. “We will keep the GPU busy” is the wrong promise. The honest promise is “this workload needs this class of hardware when it runs, and on-prem means paying for availability rather than consumption.”

So the contract needs to state one of three things clearly. The GPU is dedicated to your product. Or the GPU is shared, with your product having priority. Or you provide the whole thing as a managed appliance/capacity service.

Leaving that undecided is how these projects become awkward. The bad outcome is not an idle H100. The bad outcome is an idle H100 that nobody is allowed to use, nobody budgeted correctly, and everyone blames on the software vendor.