Instructions to use tiiuae/falcon-40b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiiuae/falcon-40b-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tiiuae/falcon-40b-instruct", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b-instruct", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tiiuae/falcon-40b-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiiuae/falcon-40b-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/tiiuae/falcon-40b-instruct
- SGLang
How to use tiiuae/falcon-40b-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-40b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiiuae/falcon-40b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiiuae/falcon-40b-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use tiiuae/falcon-40b-instruct with Docker Model Runner:
docker model run hf.co/tiiuae/falcon-40b-instruct
ValueError: sharded is not supported for AutoModel ERROR
Using the latest model of falcon-40b-instruct, there is a problem, when run on Sagemaker.
The endpoint can not be started when using these instructions https://github.com/marshmellow77/falcon-document-chatbot/blob/main/deploy-falcon-40b-instruct.ipynb
Yesterday everything worked fine.
The error message ist, the following:
ValueError: sharded is not supported for AutoModel
Current Workaround is to use the latest Revision which works:
llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env={
'HF_MODEL_ID': hf_model_id,
'HF_MODEL_REVISION': "1e7fdcc9f45d13704f3826e99937917e007cd975",
# 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
'SM_NUM_GPUS': json.dumps(number_of_gpu),
'MAX_INPUT_LENGTH': json.dumps(1900), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
}
)
Could be because of this change: https://huggingface.co/tiiuae/falcon-40b/commit/f1ba7d328c06aa6fbb4a8afd3c756f46d7e6b232 and line here
https://github.com/huggingface/text-generation-inference/blob/b7327205a6f2f2c6349e75b8ea484e1e2823075a/server/text_generation_server/models/__init__.py#L233
This is exactly what I'm running into when trying to make this work. I thought this was an issue with the HF inference server, thanks for pointing this out!
The problematic change has been reverted with https://huggingface.co/tiiuae/falcon-40b-instruct/commit/ca78eac0ed45bf64445ff0687fabba1598daebf3 , everything works like before now with the currently released files on main.
Hello,
I am still running into the same issue, with 7b-instruct version when explicitly pointing to the commit that reverts the change
config = {
'HF_MODEL_ID': "tiiuae/falcon-7b-instruct", # model_id from hf.co/models
'HF_MODEL_REVISION': "eb410fb6ffa9028e97adb801f0d6ec46d02f8b07",
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
# 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}
Running on ml.g5.48xlarge with number_of_gpu = 8 above.
Any ideas what could be wrong in my setup?
Still having the issue...
Still having the issue...
It turns out sharding is not supported for 7B variants. You need to make sure to either choose an instance that has a single GPU, or explicitly choose number_of_gpu = 1 in your config.
Hello, I re-opened this discussion because I found the issue reported again after the last commit.
Here is my configuration:
{
"HF_MODEL_ID": "tiiuae/falcon-40b-instruct",
"SM_NUM_GPUS": "4",
"HF_MODEL_QUANTIZE": "bitsandbytes",
"MAX_INPUT_LENGTH": "1024",
"MAX_TOTAL_TOKENS": "2048"
}
With this configuration and using an instance ml.g5.12xlarge I get the error message: ValueError: sharded is not supported for AutoModel
Adding "HF_MODEL_REVISION": ca78eac0ed45bf64445ff0687fabba1598daebf3 to deploy the previous commit works perfectly fine.
The issue is again in the last commit uploaded: ecb78d97ac356d098e79f0db222c9ce7c5d9ee5f
I ran into the same issue today.
Changed the revision and it works fine as mentioned by valenlopez3 in the above thread