How to Deploy Ollama on GCP Cloud Run to Run Large Language Models

Ollama is an open-source framework that allows you to easily run open-source large language models like Llama 3.3, Mistral, and Gemma 2. Google Cloud Run provides a containerized environment with GPU support, making it ideal for deploying AI inference services. This article will guide you through deploying Ollama on Cloud Run.

Why Use Cloud Run?

  • GPU acceleration: Supports NVIDIA L4 GPU for faster inference
  • Pay-per-use: Serverless operation, you only pay when the service is running
  • Auto-scaling: Automatically adjusts the number of instances based on traffic
  • Simple management: No need to handle server configuration and maintenance

Creating a GCS Bucket

When deploying Ollama on Cloud Run, using a Google Cloud Storage (GCS) bucket is important for several reasons:

  1. Persistent storage: Cloud Run instances are stateless, and data in the container is lost when instances restart or scale. Using a GCS bucket provides persistent storage to ensure model files aren’t lost.
  2. Model sharing: Large language model files are typically very large (several GB to tens of GB), and downloading these models takes time. By storing models in a GCS bucket, multiple Cloud Run instances can share the same model files, avoiding duplicate downloads.
  3. Cost-effectiveness: When Cloud Run scales to multiple instances, if each instance needs to download the model, it consumes significant bandwidth and time. Using a GCS bucket reduces this redundancy and lowers costs.
  4. Startup time optimization: By pre-storing model files in a GCS bucket, you can significantly reduce the startup time of Cloud Run instances, as instances can directly mount the bucket instead of downloading the model.

Creating an ollama_volume Bucket

First, we need to create a GCS bucket to store Ollama’s model files:

Navigate to the Cloud Storage page in the Google Cloud Console.

Click the “Create a bucket” button to start creating a new bucket.

In the “Get Started” section, name your bucket. In our example, we use “ollama_volume” as the bucket name. Remember that GCS bucket names must be globally unique, so you may need to add some unique identifiers.

Expand the “Optimize storage for data-intensive workloads” option and check the “Enable Hierarchical namespace on this bucket” option. As shown in the image, this option is important for optimizing AI/ML workloads, as it provides a filesystem-like hierarchy, supporting atomic folder renames, faster folder listings, and other features that will help optimize LLM access efficiency.

In the description of “Optimize for AI/ML and analytics with a filesystem-like hierarchical structure,” you can see that this permanent option enables enhancements not available in standard buckets.

bucket name

Next, we need to choose the location type and region for the bucket. In the “Choose where to store your data” step:

In the “Location type” section, select the “Region” (single region) option, which will provide the lowest latency within a single region.
From the dropdown menu, select “us-central1 (Iowa)” as the storage region. This is an important choice because:

  • It defines the geographic location of the data
  • Affects cost, performance, and availability
  • Cannot be changed once set
  • Should match the region where you plan to deploy your Cloud Run service to reduce latency and optimize performance

Location type

On the storage class selection page, use the “Standard” option, which is best suited for frequently accessed LLM model files.

Storage class

On the “Choose how to control access to objects” page, select access control settings:

Access control

This page contains two main sections:

  • Public access prevention: The “Enforce public access prevention on this bucket” option is checked, which limits data from being publicly accessible
  • Access control method: Select the “Uniform” option, which will ensure unified access to all objects using only bucket-level permissions (IAM)

On the data protection settings page, keep the default data protection options:
Data protection

After clicking the “CREATE” button, the system will display a confirmation dialog reminding you about the public access settings for the bucket:

Confirm access

Click the “CONFIRM” button to confirm these settings and complete the bucket creation.

Preparing a Service Account with GCS Bucket Access

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Set environment variables
PROJECT_ID=$(gcloud config get-value project) # Your current project ID
SA_NAME="ollama-service-account" # Service account name
SA_DISPLAY_NAME="Ollama Service Account" # Service account display name

# Create service account
gcloud iam service-accounts create $SA_NAME \
--display-name="$SA_DISPLAY_NAME" \
--project=$PROJECT_ID

# Get the full service account email address
SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

# Grant Cloud Run necessary permissions
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:$SA_EMAIL" \
--role="roles/storage.objectAdmin"

Deploying Ollama on Google Cloud Run

After creating the GCS bucket and service account, we can start deploying the Ollama service on Google Cloud Run. First, navigate to the Cloud Run page in the Google Cloud Console:

Cloud Run page

Click the “DEPLOY CONTAINER” button at the top to begin the deployment process. On the deployment page:

Deployment page

  1. Select “Deploy one revision from an existing container image”
  2. Enter ollama/ollama:0.5.13 in the “Container image URL” field
  3. Enter ollama in the “Service name” field
  4. Choose us-central1 (Iowa) for “Region”, ensuring it matches your GCS bucket region
  5. Select “Require authentication” for “Authentication”, which will provide basic security protection

In the billing and traffic settings section, configure the following:

Billing and traffic settings

  1. Select “Instance-based” for “Billing”, which keeps instances running even when idle
  2. Choose “Auto scaling” for “Service scaling”
  3. Set “Minimum number of instances” to 0, so you won’t incur costs when there are no requests
  4. Select “All” for “Ingress”, allowing direct access to the service from the internet

These settings allow your Ollama service to scale automatically based on demand while not incurring costs when there’s no traffic, achieving truly serverless deployment.

Next, you need to link the GCS bucket as a Cloud Run volume. Click the “VOLUMES” tab at the top:

Volume settings

On the volume configuration page:

  1. Click “New Volume” to expand volume creation options
  2. Select “Cloud Storage bucket” for “Volume type”
  3. Enter gcs-1 for “Volume name”
  4. Select the ollama_volume bucket you created earlier for “Bucket”
  5. Uncheck the “Read-only” option, as Ollama needs to write model files to the bucket

This step is very important as it mounts the GCS bucket to the Cloud Run service, allowing Ollama to persistently store downloaded model files, avoiding redownloading models each time instances restart.

Click “DONE” to complete the volume setup, then we need to mount this volume to the container.

After creating the volume, you need to mount it to the container. Click the “GO TO CONTAINER(S) TAB” button, then select the “VOLUME MOUNTS” tab:

Volume settings

On the volume mount configuration page:

Volume mount settings

  1. Select the gcs-1 volume you created earlier from the “Name” dropdown
  2. Enter /root/.ollama in the “Mount path” field (note the dot before “ollama”)
  3. This path is crucial as it’s the default location where Ollama stores model files

With this setup, when Ollama downloads models, the files will be saved in the GCS bucket rather than the container’s temporary storage. This ensures model files aren’t lost when instances restart or scale, and can be shared across multiple instances, greatly improving efficiency and startup speed.

Click “DONE” to complete the volume mount setup.

In the “SECURITY” tab, you need to configure the service account to ensure Ollama can access the GCS bucket:

Security settings

On the security settings page:

  1. Select “Ollama Service Account” from the “Service account” dropdown, which is the service account we created earlier
  2. This service account has been granted storage.objectAdmin permissions, allowing Ollama to read and write to the GCS bucket
  3. Keep the default “Google-managed encryption key” option for “Encryption”

Setting the correct service account is a key step in ensuring Ollama can properly access the GCS bucket. Without appropriate permissions, the Cloud Run service won’t be able to read or write model files in the bucket.

Next, you need to set environment variables for the Ollama container. Click the “VARIABLES & SECRETS” tab:

Environment variables

On the environment variables page:

  1. Enter OLLAMA_HOST in the “Name” field
  2. Enter 0.0.0.0:8080 in the “Value” field

This environment variable is crucial as it tells Ollama to listen for requests on port 8080 in the container, rather than the default port 11434. Cloud Run requires containers to receive requests on port 8080, so this setting is necessary.

If you need to add more environment variables, you can click the “ADD VARIABLE” button. Click “DONE” after setting up the environment variables to continue to the next configuration step.

Next, you need to set up appropriate computing resources for the Ollama container. Click the “SETTINGS” tab and configure the “Resources” section:

Resource settings

On the resource settings page:

  1. Select 16 GiB for “Memory”, which is the minimum memory required to run large language models
  2. Select 4 for “CPU”, which is the minimum CPU count that supports 16 GiB of memory
  3. Check the “GPU” option to enable GPU acceleration
  4. Select NVIDIA L4 for “GPU type”, which is the GPU type supported by Cloud Run
  5. Set “Number of GPUs” to 1

These resource settings are crucial for Ollama’s performance:

  • 16 GiB of memory and 4 CPUs are the basic requirements for running medium to large LLMs
  • NVIDIA L4 GPU significantly accelerates model inference, improving response times
  • Using a GPU requires at least 4 CPUs, which is a Google Cloud requirement

Click “DONE” after setting up the resources to continue to the final configuration step.

Finally, you need to configure request handling parameters. In the request settings section:

Request settings

On the request settings page:

  1. Set “Request timeout” to 300 seconds (5 minutes), which is a reasonable timeout for processing LLM inference requests
  2. Keep “Maximum concurrent requests per instance” at 80, which is the maximum number of requests each instance can handle simultaneously
  3. Keep “Minimum number of instances” at 0, so you won’t incur costs when there’s no traffic
  4. Set “Maximum number of instances” to 3, which limits the upper bound of auto-scaling, preventing excessive costs due to traffic spikes

Limiting the maximum number of instances to a low value (like 3 or 4) is an important safety measure. If your service is attacked or experiences abnormal traffic increases, this prevents the system from automatically scaling to many instances, avoiding unexpected high costs.

After completing all the settings, click the “CREATE” button to create the service. The system will begin deploying the Ollama service, which may take a few minutes.

After following all the configuration steps and clicking the “CREATE” button, Google Cloud Run will deploy your Ollama service. Once the deployment is complete, you’ll see a screen similar to this:

Deployment Complete

The deployment success page shows your Ollama service is now up and running. Key information displayed includes:

  • Service URL: The unique URL for your Ollama service (e.g., https://ollama-xxxxxxxxxxxx.us-central1.run.app)
  • Region: Confirms the service is deployed in us-central1
  • CPU and Memory: Shows the allocated resources (4 CPUs, 16 GiB memory)
  • GPU: Indicates 1 NVIDIA L4 GPU is attached
  • Revision status: “Serving traffic” means the service is active and ready to process requests
  • Authentication: “Require authentication” indicates that requests must include authentication

You can now use this service to pull models and make inference requests. The service URL is what you’ll use in your API calls, and you’ll need to include authentication as shown in the testing examples below.

Testing the Ollama Service

Downloading a model:

1
2
3
4
5
curl --location 'https://ollama-xxxxxxxxxxxx.us-central1.run.app/api/pull' \
--header "Authorization: Bearer $(gcloud auth print-identity-token)" \
--data '{
"model": "deepseek-r1:1.5b"
}'

Output:

1
2
3
4
5
6
7
8
9
10
{"status":"pulling manifest"}
{"status":"pulling aabd4debf0c8","digest":"sha256:aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc","total":1117320512}
{"status":"pulling aabd4debf0c8","digest":"sha256:aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc","total":1117320512}
{"status":"pulling aabd4debf0c8","digest":"sha256:aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc","total":1117320512,"completed":2346405}
...
...
...
{"status":"verifying sha256 digest"}
{"status":"writing manifest"}
{"status":"success"}

Next, we can test a conversation:

1
2
3
4
5
6
7
8
9
curl --location 'https://ollama-xxxxxxxxxxxx.us-central1.run.app/api/chat' \
--header "Authorization: Bearer $(gcloud auth print-identity-token)" \
--data '{
"model": "deepseek-r1:1.5b",
"stream": false,
"messages": [
{ "role": "user", "content": "why is the sky blue" }
]
}'

Output:

1
{"model":"deepseek-r1:1.5b","created_at":"2025-03-06T16:37:18.674426239Z","message":{"role":"assistant","content":"\u003cthink\u003e\n\n\u003c/think\u003e\n\nThe blue color of the sky is called **blue coloring** or **violet sky**, and it has to do with a phenomenon called **blue shift**. Here's how it works:\n\n1. **Infrared Radiation**: The Earth absorbs most of the visible light that enters our atmosphere, particularly in the red and orange wavelengths. However, it reflects some infrared radiation, which is absorbed by the atmosphere.\n\n2. **Blue Shift**: As sunlight travels through the Earth's atmosphere, the blue-inflated infrared radiation from the sun is quickly scattered away. But the longer wavelength (orange and red) light gets \"blown away\" by the atmosphere, while the shorter blue light accumulates in the sky. This causes the blue light to appear blue-violet when it reaches Earth.\n\n3. **Scattering**: The atmosphere also scatters visible light off of particles, such as nitrogen molecules and water vapor, creating the cloudy appearance. However, this scattered blue light dominates the sky's color due to the blue shift.\n\nThis phenomenon is more pronounced during clear days with less pollution, but even in polluted skies, you can still see a blue sky because some light remains visible after passing through the atmosphere."},"done_reason":"stop","done":true,"total_duration":25267446418,"load_duration":21595303847,"prompt_eval_count":8,"prompt_eval_duration":1311000000,"eval_count":245,"eval_duration":2359000000}

Conclusion

Deploying Ollama on Google Cloud Run provides a powerful, flexible, and cost-effective solution for running large language models. By following this guide, you’ve created a serverless AI inference service that:

  • Leverages GPU acceleration for much faster model inference compared to CPU-only solutions
  • Scales automatically based on your traffic needs, from zero to multiple instances
  • Maintains efficiency by using GCS buckets to avoid redundant model downloads
  • Optimizes costs by only running when needed and controlling maximum instance counts
  • Provides robust security through service account permissions and authentication