Instructions to use tiiuae/falcon-40b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tiiuae/falcon-40b-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tiiuae/falcon-40b-instruct", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b-instruct", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use tiiuae/falcon-40b-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tiiuae/falcon-40b-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-40b-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/tiiuae/falcon-40b-instruct

SGLang

How to use tiiuae/falcon-40b-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tiiuae/falcon-40b-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-40b-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tiiuae/falcon-40b-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiiuae/falcon-40b-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use tiiuae/falcon-40b-instruct with Docker Model Runner:
```
docker model run hf.co/tiiuae/falcon-40b-instruct
```

ValueError: sharded is not supported for AutoModel ERROR

#68

by peyers - opened Jul 13, 2023

Discussion

peyers

Jul 13, 2023

Using the latest model of falcon-40b-instruct, there is a problem, when run on Sagemaker.
The endpoint can not be started when using these instructions https://github.com/marshmellow77/falcon-document-chatbot/blob/main/deploy-falcon-40b-instruct.ipynb

Yesterday everything worked fine.

The error message ist, the following:
ValueError: sharded is not supported for AutoModel

Current Workaround is to use the latest Revision which works:

llm_model = HuggingFaceModel(
role=role,
image_uri=llm_image,
env={
'HF_MODEL_ID': hf_model_id,
'HF_MODEL_REVISION': "1e7fdcc9f45d13704f3826e99937917e007cd975",
# 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
'SM_NUM_GPUS': json.dumps(number_of_gpu),
'MAX_INPUT_LENGTH': json.dumps(1900), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
}
)

ashivadi

Jul 13, 2023

•

edited Jul 13, 2023

Could be because of this change: https://huggingface.co/tiiuae/falcon-40b/commit/f1ba7d328c06aa6fbb4a8afd3c756f46d7e6b232 and line here
https://github.com/huggingface/text-generation-inference/blob/b7327205a6f2f2c6349e75b8ea484e1e2823075a/server/text_generation_server/models/__init__.py#L233

rodgermoore

Jul 14, 2023

Could be because of this change: https://huggingface.co/tiiuae/falcon-40b/commit/f1ba7d328c06aa6fbb4a8afd3c756f46d7e6b232 and line here
https://github.com/huggingface/text-generation-inference/blob/b7327205a6f2f2c6349e75b8ea484e1e2823075a/server/text_generation_server/models/__init__.py#L233

This is exactly what I'm running into when trying to make this work. I thought this was an issue with the HF inference server, thanks for pointing this out!

hugefan

Jul 14, 2023

The problematic change has been reverted with https://huggingface.co/tiiuae/falcon-40b-instruct/commit/ca78eac0ed45bf64445ff0687fabba1598daebf3 , everything works like before now with the currently released files on main.

mkserge

Jul 20, 2023

Hello,

I am still running into the same issue, with 7b-instruct version when explicitly pointing to the commit that reverts the change

config = {
  'HF_MODEL_ID': "tiiuae/falcon-7b-instruct", # model_id from hf.co/models
  'HF_MODEL_REVISION': "eb410fb6ffa9028e97adb801f0d6ec46d02f8b07",
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

Running on ml.g5.48xlarge with number_of_gpu = 8 above.

Any ideas what could be wrong in my setup?

axiom-of-choice

Jul 27, 2023

•

edited Jul 27, 2023

Still having the issue...

mkserge

Jul 27, 2023

•

edited Jul 27, 2023

Still having the issue...

It turns out sharding is not supported for 7B variants. You need to make sure to either choose an instance that has a single GPU, or explicitly choose number_of_gpu = 1 in your config.

valenlopez3

Oct 10, 2023

Hello, I re-opened this discussion because I found the issue reported again after the last commit.

Here is my configuration:
{
"HF_MODEL_ID": "tiiuae/falcon-40b-instruct",
"SM_NUM_GPUS": "4",
"HF_MODEL_QUANTIZE": "bitsandbytes",
"MAX_INPUT_LENGTH": "1024",
"MAX_TOTAL_TOKENS": "2048"
}

With this configuration and using an instance ml.g5.12xlarge I get the error message: ValueError: sharded is not supported for AutoModel

Adding "HF_MODEL_REVISION": ca78eac0ed45bf64445ff0687fabba1598daebf3 to deploy the previous commit works perfectly fine.

The issue is again in the last commit uploaded: ecb78d97ac356d098e79f0db222c9ce7c5d9ee5f

bikalnetomi

Oct 12, 2023

I ran into the same issue today.
Changed the revision and it works fine as mentioned by valenlopez3 in the above thread

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment