How to Deploy Ollama on GCP Cloud Run to Run Large Language Models

Posted on 2025-03-07 Edited on 2025-07-24

Ollama is an open-source framework that allows you to easily run open-source large language models like Llama 3.3, Mistral, and Gemma 2. Google Cloud Run provides a containerized environment with GPU support, making it ideal for deploying AI inference services. This article will guide you through deploying Ollama on Cloud Run.

Why Use Cloud Run?

GPU acceleration: Supports NVIDIA L4 GPU for faster inference
Pay-per-use: Serverless operation, you only pay when the service is running
Auto-scaling: Automatically adjusts the number of instances based on traffic
Simple management: No need to handle server configuration and maintenance

Creating a GCS Bucket

When deploying Ollama on Cloud Run, using a Google Cloud Storage (GCS) bucket is important for several reasons:

Persistent storage: Cloud Run instances are stateless, and data in the container is lost when instances restart or scale. Using a GCS bucket provides persistent storage to ensure model files aren’t lost.
Model sharing: Large language model files are typically very large (several GB to tens of GB), and downloading these models takes time. By storing models in a GCS bucket, multiple Cloud Run instances can share the same model files, avoiding duplicate downloads.
Cost-effectiveness: When Cloud Run scales to multiple instances, if each instance needs to download the model, it consumes significant bandwidth and time. Using a GCS bucket reduces this redundancy and lowers costs.
Startup time optimization: By pre-storing model files in a GCS bucket, you can significantly reduce the startup time of Cloud Run instances, as instances can directly mount the bucket instead of downloading the model.

Creating an ollama_volume Bucket

First, we need to create a GCS bucket to store Ollama’s model files:

Navigate to the Cloud Storage page in the Google Cloud Console.

Click the “Create a bucket” button to start creating a new bucket.

In the “Get Started” section, name your bucket. In our example, we use “ollama_volume” as the bucket name. Remember that GCS bucket names must be globally unique, so you may need to add some unique identifiers.

Expand the “Optimize storage for data-intensive workloads” option and check the “Enable Hierarchical namespace on this bucket” option. As shown in the image, this option is important for optimizing AI/ML workloads, as it provides a filesystem-like hierarchy, supporting atomic folder renames, faster folder listings, and other features that will help optimize LLM access efficiency.

In the description of “Optimize for AI/ML and analytics with a filesystem-like hierarchical structure,” you can see that this permanent option enables enhancements not available in standard buckets.

bucket name

Next, we need to choose the location type and region for the bucket. In the “Choose where to store your data” step:

In the “Location type” section, select the “Region” (single region) option, which will provide the lowest latency within a single region.
From the dropdown menu, select “us-central1 (Iowa)” as the storage region. This is an important choice because:

It defines the geographic location of the data
Affects cost, performance, and availability
Cannot be changed once set
Should match the region where you plan to deploy your Cloud Run service to reduce latency and optimize performance

Location type

On the storage class selection page, use the “Standard” option, which is best suited for frequently accessed LLM model files.

Storage class

On the “Choose how to control access to objects” page, select access control settings:

Access control

This page contains two main sections:

Public access prevention: The “Enforce public access prevention on this bucket” option is checked, which limits data from being publicly accessible
Access control method: Select the “Uniform” option, which will ensure unified access to all objects using only bucket-level permissions (IAM)

On the data protection settings page, keep the default data protection options:

After clicking the “CREATE” button, the system will display a confirmation dialog reminding you about the public access settings for the bucket:

Confirm access

Click the “CONFIRM” button to confirm these settings and complete the bucket creation.

Preparing a Service Account with GCS Bucket Access

# Set environment variables
PROJECT_ID=$(gcloud config get-value project)  # Your current project ID
SA_NAME="ollama-service-account"              # Service account name
SA_DISPLAY_NAME="Ollama Service Account"      # Service account display name

# Create service account
gcloud iam service-accounts create $SA_NAME \
  --display-name="$SA_DISPLAY_NAME" \
  --project=$PROJECT_ID

# Get the full service account email address
SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

# Grant Cloud Run necessary permissions
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$SA_EMAIL" \
  --role="roles/storage.objectAdmin"

Deploying Ollama on Google Cloud Run

After creating the GCS bucket and service account, we can start deploying the Ollama service on Google Cloud Run. First, navigate to the Cloud Run page in the Google Cloud Console:

Cloud Run page

Click the “DEPLOY CONTAINER” button at the top to begin the deployment process. On the deployment page:

Deployment page

Select “Deploy one revision from an existing container image”
Enter ollama/ollama:0.5.13 in the “Container image URL” field
Enter ollama in the “Service name” field
Choose us-central1 (Iowa) for “Region”, ensuring it matches your GCS bucket region
Select “Require authentication” for “Authentication”, which will provide basic security protection

In the billing and traffic settings section, configure the following:

Billing and traffic settings

Select “Instance-based” for “Billing”, which keeps instances running even when idle
Choose “Auto scaling” for “Service scaling”
Set “Minimum number of instances” to 0, so you won’t incur costs when there are no requests
Select “All” for “Ingress”, allowing direct access to the service from the internet

These settings allow your Ollama service to scale automatically based on demand while not incurring costs when there’s no traffic, achieving truly serverless deployment.

Next, you need to link the GCS bucket as a Cloud Run volume. Click the “VOLUMES” tab at the top:

Volume settings

On the volume configuration page:

Click “New Volume” to expand volume creation options
Select “Cloud Storage bucket” for “Volume type”
Enter gcs-1 for “Volume name”
Select the ollama_volume bucket you created earlier for “Bucket”
Uncheck the “Read-only” option, as Ollama needs to write model files to the bucket

This step is very important as it mounts the GCS bucket to the Cloud Run service, allowing Ollama to persistently store downloaded model files, avoiding redownloading models each time instances restart.

Click “DONE” to complete the volume setup, then we need to mount this volume to the container.

After creating the volume, you need to mount it to the container. Click the “GO TO CONTAINER(S) TAB” button, then select the “VOLUME MOUNTS” tab:

Volume settings

On the volume mount configuration page:

Volume mount settings

Select the gcs-1 volume you created earlier from the “Name” dropdown
Enter /root/.ollama in the “Mount path” field (note the dot before “ollama”)
This path is crucial as it’s the default location where Ollama stores model files

With this setup, when Ollama downloads models, the files will be saved in the GCS bucket rather than the container’s temporary storage. This ensures model files aren’t lost when instances restart or scale, and can be shared across multiple instances, greatly improving efficiency and startup speed.

Click “DONE” to complete the volume mount setup.

In the “SECURITY” tab, you need to configure the service account to ensure Ollama can access the GCS bucket:

Security settings

On the security settings page:

Select “Ollama Service Account” from the “Service account” dropdown, which is the service account we created earlier
This service account has been granted storage.objectAdmin permissions, allowing Ollama to read and write to the GCS bucket
Keep the default “Google-managed encryption key” option for “Encryption”

Setting the correct service account is a key step in ensuring Ollama can properly access the GCS bucket. Without appropriate permissions, the Cloud Run service won’t be able to read or write model files in the bucket.

Next, you need to set environment variables for the Ollama container. Click the “VARIABLES & SECRETS” tab:

Environment variables

On the environment variables page:

Enter OLLAMA_HOST in the “Name” field
Enter 0.0.0.0:8080 in the “Value” field

This environment variable is crucial as it tells Ollama to listen for requests on port 8080 in the container, rather than the default port 11434. Cloud Run requires containers to receive requests on port 8080, so this setting is necessary.

If you need to add more environment variables, you can click the “ADD VARIABLE” button. Click “DONE” after setting up the environment variables to continue to the next configuration step.

Next, you need to set up appropriate computing resources for the Ollama container. Click the “SETTINGS” tab and configure the “Resources” section:

Resource settings

On the resource settings page:

Select 16 GiB for “Memory”, which is the minimum memory required to run large language models
Select 4 for “CPU”, which is the minimum CPU count that supports 16 GiB of memory
Check the “GPU” option to enable GPU acceleration
Select NVIDIA L4 for “GPU type”, which is the GPU type supported by Cloud Run
Set “Number of GPUs” to 1

These resource settings are crucial for Ollama’s performance:

16 GiB of memory and 4 CPUs are the basic requirements for running medium to large LLMs
NVIDIA L4 GPU significantly accelerates model inference, improving response times
Using a GPU requires at least 4 CPUs, which is a Google Cloud requirement

Click “DONE” after setting up the resources to continue to the final configuration step.

Finally, you need to configure request handling parameters. In the request settings section:

Request settings

On the request settings page:

Set “Request timeout” to 300 seconds (5 minutes), which is a reasonable timeout for processing LLM inference requests
Keep “Maximum concurrent requests per instance” at 80, which is the maximum number of requests each instance can handle simultaneously
Keep “Minimum number of instances” at 0, so you won’t incur costs when there’s no traffic
Set “Maximum number of instances” to 3, which limits the upper bound of auto-scaling, preventing excessive costs due to traffic spikes

Limiting the maximum number of instances to a low value (like 3 or 4) is an important safety measure. If your service is attacked or experiences abnormal traffic increases, this prevents the system from automatically scaling to many instances, avoiding unexpected high costs.

After completing all the settings, click the “CREATE” button to create the service. The system will begin deploying the Ollama service, which may take a few minutes.

After following all the configuration steps and clicking the “CREATE” button, Google Cloud Run will deploy your Ollama service. Once the deployment is complete, you’ll see a screen similar to this:

Deployment Complete

The deployment success page shows your Ollama service is now up and running. Key information displayed includes:

Service URL: The unique URL for your Ollama service (e.g., https://ollama-xxxxxxxxxxxx.us-central1.run.app)
Region: Confirms the service is deployed in us-central1
CPU and Memory: Shows the allocated resources (4 CPUs, 16 GiB memory)
GPU: Indicates 1 NVIDIA L4 GPU is attached
Revision status: “Serving traffic” means the service is active and ready to process requests
Authentication: “Require authentication” indicates that requests must include authentication

You can now use this service to pull models and make inference requests. The service URL is what you’ll use in your API calls, and you’ll need to include authentication as shown in the testing examples below.

Testing the Ollama Service

Downloading a model:

curl --location 'https://ollama-xxxxxxxxxxxx.us-central1.run.app/api/pull' \
--header "Authorization: Bearer $(gcloud auth print-identity-token)" \
--data  '{
  "model": "deepseek-r1:1.5b"
}'

Output:

{"status":"pulling manifest"}
{"status":"pulling aabd4debf0c8","digest":"sha256:aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc","total":1117320512}
{"status":"pulling aabd4debf0c8","digest":"sha256:aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc","total":1117320512}
{"status":"pulling aabd4debf0c8","digest":"sha256:aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc","total":1117320512,"completed":2346405}
...
...
...
{"status":"verifying sha256 digest"}
{"status":"writing manifest"}
{"status":"success"}

Next, we can test a conversation:

curl --location 'https://ollama-xxxxxxxxxxxx.us-central1.run.app/api/chat' \
--header "Authorization: Bearer $(gcloud auth print-identity-token)" \
--data '{
  "model": "deepseek-r1:1.5b",
  "stream": false,
  "messages": [
    { "role": "user", "content": "why is the sky blue" }
  ]
}'

Output:

{"model":"deepseek-r1:1.5b","created_at":"2025-03-06T16:37:18.674426239Z","message":{"role":"assistant","content":"\u003cthink\u003e\n\n\u003c/think\u003e\n\nThe blue color of the sky is called **blue coloring** or **violet sky**, and it has to do with a phenomenon called **blue shift**. Here's how it works:\n\n1. **Infrared Radiation**: The Earth absorbs most of the visible light that enters our atmosphere, particularly in the red and orange wavelengths. However, it reflects some infrared radiation, which is absorbed by the atmosphere.\n\n2. **Blue Shift**: As sunlight travels through the Earth's atmosphere, the blue-inflated infrared radiation from the sun is quickly scattered away. But the longer wavelength (orange and red) light gets \"blown away\" by the atmosphere, while the shorter blue light accumulates in the sky. This causes the blue light to appear blue-violet when it reaches Earth.\n\n3. **Scattering**: The atmosphere also scatters visible light off of particles, such as nitrogen molecules and water vapor, creating the cloudy appearance. However, this scattered blue light dominates the sky's color due to the blue shift.\n\nThis phenomenon is more pronounced during clear days with less pollution, but even in polluted skies, you can still see a blue sky because some light remains visible after passing through the atmosphere."},"done_reason":"stop","done":true,"total_duration":25267446418,"load_duration":21595303847,"prompt_eval_count":8,"prompt_eval_duration":1311000000,"eval_count":245,"eval_duration":2359000000}

Conclusion

Deploying Ollama on Google Cloud Run provides a powerful, flexible, and cost-effective solution for running large language models. By following this guide, you’ve created a serverless AI inference service that:

Leverages GPU acceleration for much faster model inference compared to CPU-only solutions
Scales automatically based on your traffic needs, from zero to multiple instances
Maintains efficiency by using GCS buckets to avoid redundant model downloads
Optimizes costs by only running when needed and controlling maximum instance counts
Provides robust security through service account permissions and authentication