In the following tutorial we will focus how to deploy Meta’s Llama model API endpoint with GCP’s Cloud Run.
We will build a dockerized service and
(1) test the service locally with local model
(2) test the service locally with model stored on Google Cloud Storage
(3) deploy it to Google Cloud Run with model stored on Google Cloud Storage
Downloading Llama
What is Llama?
Llama (Large Language Model Meta AI) is an advanced family of large language models developed by Meta (formerly Facebook). It is designed to perform a wide range of natural language processing tasks, such as text generation, summarization, and language translation. Llama stands out for its efficiency, making it a viable choice for running AI models on resource-constrained setups like serverless environments.
What is Hugging Face?
Hugging Face is a popular open-source platform and community for building, deploying, and sharing machine learning models, particularly in the field of natural language processing. It provides tools like the Transformers library, which simplifies the use of pre-trained models like Llama, and a repository where developers can easily access and contribute models. Check out my github repo to see it in action and follow along this tutorial:
I chose probably the smallest Llama model that I found on Hugging Face, which is around 2.3 Gb.
Firstly, we need to download and huggingface cli and create a token that is required for logging into Hugging Face:
We need to ask permission to use the model, because otherwise we get the following error:
"Access to model meta-llama/Llama-3.2-1B is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Llama-3.2-1B to ask for access."
Downloading llama
After the request accepted we can download the model:
huggingface-cli download meta-llama/Llama-3.2-1B --repo-type model --local-dir ./model
Building a dockerized service and test the service locally with local model
Since I’m using europe-west1 with my llama gcp project, the docker image looked like this in my case:
In your case you can use the following pattern:
docker build -t <docker repository region>/<project-id>/<repository>/<image:tag> .
Now we can run our Docker image locally and mount our downloaded model into the container:
docker run -d -p 8000:8000 \
-v $(pwd)/model:/app/model \
europe-west1-docker.pkg.dev/llama-443912/llama/huggingface-llama-api:local
Our app is now available on localhost:8000
You can use Fast API Swagger UI as well to verify the status of the service:
Since the model loading can take time especially when we use Google Cloud Storage (GCS) and in Cloud Run, but the service is up and running immediately because the model is loading asynchronously in the background.
We can verify if the model is loaded on the /status endpoint:
If the model is loaded, we can try it out on the /generate-text endpoint:
Building a dockerized service and test it locally with Google Cloud Storage
Now let’s create an image where we can mount a Google Cloud Storage into our container. In this case we need to add gcsfuse $BUCKET_NAME /app/model in our Dockerfile.
# Start FastAPI server
# Local run
#CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
# GCP run
CMD bash -c "sleep 5 && gcsfuse $BUCKET_NAME /app/model && uvicorn app:app --host 0.0.0.0 --port 8000"
Building the image:
We need to mount your Google Cloud Storage Service Account credentials which can be created on IAM after enabling IAM API in your project.
Before running our application we need to upload our model in our Google Cloud Storage:
Running our docker image:
docker run -d -p 8000:8000 \
--cap-add SYS_ADMIN \
--device /dev/fuse \
-v $(pwd)/gcs-key.json:/app/gcs-key.json \
-e GOOGLE_APPLICATION_CREDENTIALS=/app/gcs-key.json \
-e BUCKET_NAME=llamaapi \
europe-west1-docker.pkg.dev/llama-443912/llama/huggingface-llama-api:latest
Since now we are using a Google Cloud Storage, it will take some time to load the model:
Building a dockerized service and deploy it to Google Cloud Run with model stored on Google Cloud Storage
At first we need to enable couple of APIs like Cloud Build API, Artifact Registry API , install gcloud CLI and create our repository
Create Artifact Repository
In the Artifact Registry we will store our docker image:
Creating cloudbuild.yaml
steps:
- name: 'gcr.io/cloud-builders/docker'
script: |
docker build -t europe-west1-docker.pkg.dev/$PROJECT_ID/llama/huggingface-llama-api:latest .
automapSubstitutions: true
images:
- 'europe-west1-docker.pkg.dev/$PROJECT_ID/llama/huggingface-llama-api:latest'
Executing gcloud build:
gcloud builds submit --region=europe-west1 --config cloudbuild.yaml
After the successful build and deployment our image is visible in our repository:
Now we can deploy our app and mount the Google Cloud Storage into our container.
gcloud run deploy huggingface-llama-api \
--image=europe-west1-docker.pkg.dev/llama-443912/llama/huggingface-llama-api:latest \
--set-env-vars="BUCKET_NAME=llamaapi" \
--add-volume name=model,type=cloud-storage,bucket=llamaapi \
--add-volume-mount volume=model,mount-path=/app/model \
--max-instances=2 --min-instances=0 --port=8000 \
--allow-unauthenticated \
--region=europe-west1 \
--memory=6Gi --cpu=4 -q
After the deployment the API is up and running, but the model is still loading in the background:
As long as the model is not loaded the status will be “Model is loading. Please wait.”
It took me around 15 minutes till the model is loaded:
But after that we can access to the model on Google Cloud:
And let’s hope the response is something similar :-)
Conclusion
We can use Cloud Run service to deploy Meta’s Llama model API endpoint without storing the model inside the container and since it is pay-per-use service this could reduce our costs of the model deployment, but if we have a bigger model we may face issues with the model loading.
Another solution could be that we load the model in another Cloud Run service as the Cloud Run documentation suggests:
Advantage of using Google Cloud Storage and storing our model there that we don’t need to update the code in our app, we need to just upload the model in our bucket and redeploy the same Cloud Run service.
The code of the tutorial you can find here:
Comments