This section explains the steps to add Google Vertex AI models and configure the required access controls.
1
Navigate to Google Vertex Models in AI Gateway
From the TrueFoundry dashboard, navigate to AI Gateway > Models and select Google Vertex.
2
Add Google Vertex Account and Authentication
Give a unique name to your Google Vertex account. This will be used to refer to the models later. Add collaborators to your account, this will give access to the account to other users/teams. Learn more about access control here.
Get Google Vertex Authentication Details
Required IAM RoleThe Google Cloud identity used by the gateway (a service account, whether referenced by a key, by GKE Workload Identity, or by Workload Identity Federation) must have the Agent Platform User role (roles/aiplatform.user, formerly Vertex AI User), which includes the aiplatform.endpoints.predict permission required by the gateway.The gateway supports three authentication methods. Pick the one that matches your deployment.1. Using Service Account JSON KeyThis method works for all deployment types (GKE, EKS, AKS, on-premises, or the SaaS Gateway).
Generate a Service Account JSON key by following the official Google Cloud documentation here.
The service account must have the Agent Platform User role (formerly Vertex AI User).
When adding the provider account in TrueFoundry, select Service account key file as the authentication type and paste the JSON key into the Service account key JSON field (or store it as a secret and reference it).
2. Using Workload Identity Federation (Keyless, Cross-Cloud)Workload Identity Federation (WIF) lets the gateway authenticate to Google Cloud without service account keys, even when running outside of GKE — for example, on Amazon EKS, Azure AKS, or on-premises Kubernetes clusters. It works by exchanging a short-lived Kubernetes service account token for a Google Cloud access token through Google’s Security Token Service.
Workload Identity Federation is the recommended approach for production deployments running outside of GKE. It eliminates long-lived service account keys while supporting any Kubernetes environment, and it also works on the SaaS version of the AI Gateway.
A Google Cloud IAM service account that the federated identity can impersonate, with the Agent Platform User role (roles/aiplatform.user, formerly Vertex AI User) granted on the project.
The Kubernetes service account used by the gateway must have permission to issue TokenRequest resources for itself. The TrueFoundry-provided Helm chart configures this RBAC automatically.
Generate the credential configuration JSONUse the gcloud CLI to generate the credential configuration file:
This produces a JSON file with "type": "external_account" describing the identity pool, audience, and STS token-exchange endpoints. It is not a private key.
Configure in TrueFoundryWhen adding or editing the Vertex AI provider account:
Select Workload Identity Federation file as the authentication type.
Paste the contents of the generated credential-config.json into the Key file content field, or store it as a secret and reference it.
Resumable file uploads (used for some batch and fine-tuning workflows that upload files to Google Cloud Storage via signed URLs) are not yet supported with Workload Identity Federation. If you rely on those flows, use a Service Account JSON key instead.
3. Using GCP Workload Identity on GKE (Self-Hosted Gateway only)When running the gateway inside Google Kubernetes Engine (GKE), you can rely on GKE’s built-in Workload Identity, which lets a Kubernetes service account (KSA) act as a Google Cloud IAM service account (GSA) automatically through the GKE metadata server.
GKE Workload Identity is GKE-specific. Pods using the configured KSA authenticate as the associated GSA when accessing Google Cloud APIs, with no extra configuration on the gateway side.
To set up GKE Workload Identity, follow the official Google Cloud documentation: Configure Workload Identity on GKE.When adding the Vertex AI provider account in TrueFoundry, leave the authentication section empty — the gateway will automatically pick up GKE Workload Identity credentials via Application Default Credentials (ADC).
GCP Workload Identity (GKE ADC) does not work on the SaaS version of the Gateway, and it only works when the gateway runs inside a GKE cluster. For all other environments, use Workload Identity Federation or a Service Account JSON key.
3
Configure Project ID and Region
Provide your Google Cloud Project ID and a default Region for all models under this account. You can override the region for individual models later.Project ID
You can find your Project ID in the top-right corner of your Google Cloud Console.
Region
Specify a default region for all models under this account. You can override this region for individual models later.
4
Add Models
You can either select available models from the list or add them manually by clicking + Add Model. When adding a model manually, the Model ID format depends on the provider.
Adding Google (Gemini) Models
Select a Gemini model from the list or add it manually.
Model ID Format: google/<vertex-model-id>
Example: google/gemini-1.5-pro
You can find the Model ID in the Google Cloud Console.
Adding Anthropic Models
Select a Claude model from the list or add it manually.
Model ID Format: anthropic/<vertex-model-id>
Example: anthropic/claude-3-5-sonnet-v2@20241022
Adding Mistral AI Models
Select a Mistral model from the list or add it manually.
Model ID Format: mistralai/<vertex-model-id>
Example: mistralai/mistral-large-2411@001
When adding any model manually, you can specify a Region to override the default one set at the account level.
Once your Vertex provider account is configured, the following API surfaces are available through the gateway. The table below summarizes each endpoint alongside platform feature support (tracing, cost tracking).
Vertex’s chat completions endpoint is the most widely used — it supports streaming, tools, multimodal input (images, audio, video, PDF), structured JSON outputs &
extended thinking. The gateway translates OpenAI-compatible requests into Vertex’s native generateContent API based on the model family.
Full provider capability matrix: Chat Completions API.
Python
from openai import OpenAIclient = OpenAI( api_key="your-truefoundry-api-key", base_url="{GATEWAY_BASE_URL}",)response = client.chat.completions.create( model="vertex-main/gemini-2.5-flash", messages=[ {"role": "user", "content": "What is TrueFoundry in one line?"}, ],)print(response.choices[0].message.content)
Streaming
Set stream=True and iterate over delta chunks. Defensively check that chunk.choices is non-empty and delta.content is not None.
Python
stream = client.chat.completions.create( model="vertex-main/gemini-2.5-flash", messages=[{"role": "user", "content": "Count from 1 to 5, one number per line."}], stream=True,)for chunk in stream: if ( chunk.choices and len(chunk.choices) > 0 and chunk.choices[0].delta.content is not None ): print(chunk.choices[0].delta.content, end="", flush=True)print()
Function calling / tools
Advertise a tool, hand the model’s tool_calls back as a tool role message, then request the final response.
Python
import jsontools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a city.", "parameters": { "type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"], }, },}]messages = [{"role": "user", "content": "What's the weather in Bengaluru right now?"}]first = client.chat.completions.create( model="vertex-main/gemini-2.5-flash", messages=messages, tools=tools, tool_choice={"type": "function", "function": {"name": "get_weather"}},)assistant_msg = first.choices[0].messagetool_calls = assistant_msg.tool_calls or []if tool_calls: tool_call = tool_calls[0] messages.append(assistant_msg) messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": json.dumps({"city": "Bengaluru", "temp_c": 28, "summary": "partly cloudy"}), }) second = client.chat.completions.create( model="vertex-main/gemini-2.5-flash", messages=messages, ) print(second.choices[0].message.content)
Vision (multimodal images)
Gemini models support image inputs via the image_url content part. The detail parameter (low / high / auto) translates to Vertex’s native mediaResolution setting.
Gemini models accept audio files inline via image_url content parts with a mime_type hint. This is unique to Gemini — Bedrock and direct OpenAI chat models do not accept audio as chat input.
Python
import base64, wave, struct, mathfrom io import BytesIO# Generate a 1-second 440 Hz sine wave PCM WAV in-memorybuf = BytesIO()with wave.open(buf, "wb") as w: w.setnchannels(1) w.setsampwidth(2) w.setframerate(16000) for i in range(16000): sample = int(32767 * 0.3 * math.sin(2 * math.pi * 440 * i / 16000)) w.writeframes(struct.pack("<h", sample))audio_uri = f"data:audio/wav;base64,{base64.b64encode(buf.getvalue()).decode('ascii')}"response = client.chat.completions.create( model="vertex-main/gemini-2.5-flash", messages=[{ "role": "user", "content": [ {"type": "text", "text": "Describe this audio in one sentence."}, {"type": "image_url", "image_url": {"url": audio_uri, "mime_type": "audio/wav"}}, ], }],)print(response.choices[0].message.content)
Video input (Gemini-specific)
Gemini models accept video files via image_url content parts with mime_type: video/mp4. The gateway fetches the URL server-side, so you can pass any publicly reachable MP4. This is unique to Gemini.
Python
SAMPLE_VIDEO_URL = "https://www.youtube.com/watch?v=8FsHo7xoTr4"response = client.chat.completions.create( model="vertex-main/gemini-2.5-flash", messages=[{ "role": "user", "content": [ {"type": "text", "text": "Describe what happens in this video in one sentence."}, {"type": "image_url", "image_url": {"url": SAMPLE_VIDEO_URL, "mime_type": "video/mp4"}}, ], }],)print(response.choices[0].message.content)
PDF document input
Gemini models accept PDFs via the file content type with base64 encoding.
Python
import base64with open("sample.pdf", "rb") as f: pdf_b64 = base64.b64encode(f.read()).decode("ascii")response = client.chat.completions.create( model="vertex-main/gemini-2.5-flash", messages=[{ "role": "user", "content": [ {"type": "text", "text": "What text is in this PDF?"}, { "type": "file", "file": { "filename": "sample.pdf", "file_data": f"data:application/pdf;base64,{pdf_b64}", }, }, ], }],)print(response.choices[0].message.content)
Structured outputs (JSON schema)
Gemini supports two structured-output modes via response_format:
Gemini 2.5 Pro and 2.5 Flash support extended thinking, on by default. Use reasoning_effort (low/medium/high) — the gateway translates it to Vertex’s native thinking-budget parameter.
Gemini 3+ models additionally return thinking_blocks with signatures for multi-turn continuity.
Python
response = client.chat.completions.create( model="vertex-main/gemini-2.5-pro", messages=[{ "role": "user", "content": "A bat and a ball cost $1.10. The bat costs $1.00 more than the ball. How much is the ball?", }], reasoning_effort="high", max_tokens=8000,)msg = response.choices[0].messageprint("answer:", msg.content)print("reasoning:", getattr(msg, "reasoning_content", None))for block in getattr(msg, "thinking_blocks", []) or []: print(" block:", block.get("type"), "signature:", block.get("signature", "")[:30])
For Gemini 3+, always echo thinking_blocks exactly as returned when continuing a conversation. Blocks with missing or modified signature fields are rejected by Vertex.
Text-to-Speech
Vertex’s Gemini TTS models (gemini-2.5-flash-tts, gemini-2.5-flash-preview-tts, gemini-2.5-pro-preview-tts) generate audio from text. The gateway exposes them via the OpenAI-compatible /audio/speech endpoint.
Full docs: Text-to-Speech.
Model gating: Before you call the Text to Speech API, add TTS-preview model to your Vertex provider account. When adding a model, select Text to Speech as the model type.
Python
response = client.audio.speech.create( model="vertex-main/gemini-2.5-flash-tts", voice="alloy", input="Hello from TrueFoundry. The AI gateway makes multi-provider routing simple.",)with open("tts.wav", "wb") as f: f.write(response.read())
Embeddings
Vertex exposes two embedding families through /embeddings:
Text embeddings (text-embedding-004, text-embedding-005) — accept a task_type parameter via extra_body that tunes the vector for the downstream task (RETRIEVAL_DOCUMENT, RETRIEVAL_QUERY, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING).
Multimodal embeddings (multimodalembedding@001) — accept text, image, and/or video in the same request and return separate vectors per modality.
# Text embeddings with task_typedoc = client.embeddings.create( model="vertex-main/text-embedding-005", input=["TrueFoundry is an AI gateway that unifies access to multiple LLM providers."], extra_body={"task_type": "RETRIEVAL_DOCUMENT"},)query = client.embeddings.create( model="vertex-main/text-embedding-005", input=["What does TrueFoundry do?"], extra_body={"task_type": "RETRIEVAL_QUERY"},)print("dim:", len(doc.data[0].embedding))
Multimodal embeddings (image + text)
multimodalembedding@001 returns separate vectors per modality under embedding (text) and image_embedding. Useful for cross-modal retrieval — e.g. find the image whose embedding is closest to a text query.
Python
import base64from io import BytesIOfrom PIL import Image as PILImageimg = PILImage.new("RGB", (128, 128), (220, 20, 60))buf = BytesIO()img.save(buf, format="PNG")img_b64 = base64.b64encode(buf.getvalue()).decode("ascii")# encoding_format="float" — the gateway defaults image_embedding to# a base64 string; set explicitly to get a list of floats.response = client.embeddings.create( model="vertex-main/multimodalembedding@001", input=[{ "text": "A red square with a gold circle", "image": {"base64": img_b64}, }], encoding_format="float",)data = response.data[0]print("text dim :", len(data.embedding))print("image dim:", len(data.image_embedding))
Supported embedding models on Vertex include text-embedding-004, text-embedding-005, and multimodalembedding@001. Add them to your provider account from the model picker.
Image Generation
Vertex exposes Imagen and Gemini image-generation models via the OpenAI-compatible /images/generations endpoint.
Full docs: Image Generation.
Python
import base64response = client.images.generate( model="vertex-main/imagen-4.0-generate-001", prompt="A minimalist isometric illustration of a cloud with a lightning bolt, flat colors.", size="1024x1024", n=1,)item = response.data[0]if getattr(item, "b64_json", None): image_bytes = base64.b64decode(item.b64_json)else: import requests image_bytes = requests.get(item.url, timeout=60).contentwith open("generated.png", "wb") as f: f.write(image_bytes)
Supported text-to-image models include imagen-4.0-generate-001, imagen-3.0-generate-002, and Gemini image models such as gemini-3-pro-image-preview.
Pricing varies by model family. Imagen models are billed per image at a flat rate. Gemini image models are billed per token, where higher-resolution / HD outputs consume more tokens. Pick the family that matches your cost profile.
Image Edit
Vertex’s image edit only supports inpainting with a mask — unlike OpenAI, the mask is required. Use imagen-3.0-capability-001 (Imagen’s edit-specific variant).
Python
import base64from PIL import Image as PILImage, ImageDraw# Build a binary mask: white where we want the model to paint, black elsewheremask_img = PILImage.new("L", (1024, 1024), 0)mask_draw = ImageDraw.Draw(mask_img)mask_draw.rectangle((700, 0, 1024, 300), fill=255)mask_img.save("mask.png", format="PNG")with open("generated.png", "rb") as img_f, open("mask.png", "rb") as mask_f: response = client.images.edit( model="vertex-main/imagen-3.0-capability-001", image=img_f, mask=mask_f, prompt="Paint a bright yellow sun in the top-right corner.", n=1, )item = response.data[0]if getattr(item, "b64_json", None): image_bytes = base64.b64decode(item.b64_json)else: import requests image_bytes = requests.get(item.url, timeout=60).contentwith open("edited.png", "wb") as f: f.write(image_bytes)
Image Variation (client.images.create_variation) is not supported — Vertex Imagen only supports generation and inpainting.
Batch API
Vertex batch jobs are GCS-backed — the gateway uploads JSONL to a Cloud Storage bucket on your provider account, creates a Vertex batch prediction job, and fetches results from GCS.
Full docs: Batch Predictions.
Vertex batch prerequisites:
GCS bucket — must be in the same region as your Vertex model
Service account / federated identity — with roles/storage.objectAdmin on the bucket
Workload Identity Federation caveat — does not yet support resumable uploads. Use a Service Account JSON key for batch.
Client setup with batch-specific headers. The bucket and region headers tell the gateway where to stage the JSONL on GCS, and x-tfy-provider-model is the bare Vertex model id (no provider prefix).
Python
from openai import OpenAIbatch_client = OpenAI( api_key="your-truefoundry-api-key", base_url="{GATEWAY_BASE_URL}", default_headers={ "x-tfy-provider-name": "vertex-main", "x-tfy-vertex-storage-bucket-name": "your-gcs-bucket-name", "x-tfy-vertex-region": "us-central1", "x-tfy-provider-model": "gemini-2.5-flash", # bare Vertex id },)
Build and upload the input JSONL.
Python
import jsonLANGUAGES = ["French", "Japanese", "Hindi", "Spanish", "German"]batch_requests = [ { "custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gemini-2.5-flash", # bare Vertex id inside the body too "messages": [{"role": "user", "content": f"Say hello in {lang}."}], "max_tokens": 50, }, } for i, lang in enumerate(LANGUAGES)]with open("batch_input.jsonl", "w") as f: for req in batch_requests: f.write(json.dumps(req) + "\n")with open("batch_input.jsonl", "rb") as f: uploaded = batch_client.files.create(file=f, purpose="batch")print(uploaded.id) # Example: gs%3A%2F%2Fyour-bucket%2Fuuid.jsonl (URL-encoded)
2. Create Batch Job
Vertex doesn’t enforce a strict per-batch minimum like Bedrock — you can submit a small batch for testing.
Poll batches.retrieve() until completed. batch.id may come as URL-encoded; unquote() once before retrieve so the OpenAI SDK doesn’t double-encode the path.
Python
import timefrom urllib.parse import unquoteTERMINAL = {"completed", "failed", "expired", "cancelled"}TIMEOUT_SECONDS = 30 * 60POLL_INTERVAL = 15batch_id = unquote(batch.id)start = time.monotonic()while batch.status not in TERMINAL: if time.monotonic() - start > TIMEOUT_SECONDS: print(f"timed out after {TIMEOUT_SECONDS}s — rerun this cell to keep polling") break time.sleep(POLL_INTERVAL) batch = batch_client.batches.retrieve(batch_id) print("status:", batch.status)print("final:", batch.status, "output_file_id:", batch.output_file_id)
4. Fetch Results
Vertex returns the payload as a single-line JSON array (not JSONL) — parse it once, then iterate.
Vertex’s Files API stores uploads in Google Cloud Storage on your behalf. Upload, retrieve metadata, and retrieve content are supported. List and delete are NOT supported — the GCS backend doesn’t expose those operations.
Full docs: Files API.
Python
import jsonfrom urllib.parse import unquotefrom openai import OpenAIfiles_client = OpenAI( api_key="your-truefoundry-api-key", base_url="{GATEWAY_BASE_URL}", default_headers={ "x-tfy-provider-name": "vertex-main", "x-tfy-vertex-storage-bucket-name": "your-gcs-bucket-name", "x-tfy-vertex-region": "us-central1", "x-tfy-provider-model": "gemini-2.5-flash", },)# Vertex Files API only accepts purpose="batch" (or "fine-tune" with the# x-tfy-file-purpose header) and validates content as batch-style JSONL.with open("files_api_test.jsonl", "w") as f: f.write(json.dumps({ "custom_id": "demo", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gemini-2.5-flash", "messages": [{"role": "user", "content": "hi"}], "max_tokens": 5, }, }) + "\n")with open("files_api_test.jsonl", "rb") as f: uploaded = files_client.files.create(file=f, purpose="batch")print(uploaded.id) # gs%3A%2F%2Fbucket%2Fuuid.jsonl# CRITICAL: unquote the id before passing it to retrieve / content.file_id = unquote(uploaded.id)meta = files_client.files.retrieve(file_id)content = files_client.files.content(file_id).read()print(f"{len(content)} bytes")
files.list() and files.delete() is not supported by Vertex — the GCS backend doesn’t expose them. Plan lifecycle management via GCS bucket policies and lifecycle rules instead of through the gateway.
Vertex’s Files API only accepts purpose="batch" for batch uploads or purpose="fine-tune" (with the x-tfy-file-purpose: fine-tune header) for tuning uploads. Plain text or non-conforming JSONL will fail validation.
Fine-tuning
Vertex supports supervised fine-tuning of Gemini models. The lifecycle:
Prepare JSONL training data (one example per line)
Upload via the Files API with purpose="fine-tune" and the x-tfy-file-purpose: fine-tune header
Submit a fine-tune job; poll for completion
Use the resulting fine-tuned model id in subsequent inference calls
Do I need to add multiple provider accounts for different regions?
No. You can set a default region at the account level and override it for each individual model if needed. This allows you to use models from different regions with a single provider account.
Which authentication method should I choose?
Service Account JSON Key — Works everywhere (any cloud, on-prem, SaaS Gateway). Simplest to set up, but requires you to manage and rotate a long-lived secret.
Workload Identity Federation — Recommended for production. Keyless, works on any Kubernetes cluster (EKS, AKS, GKE, on-prem) and on the SaaS Gateway. Requires a one-time setup of a Workload Identity Pool in Google Cloud.
GCP Workload Identity (GKE) — Only available when the self-hosted gateway runs inside a GKE cluster. Keyless and zero-config on the gateway side, but does not work on the SaaS Gateway or outside of GKE.
Service Account Key
Workload Identity Federation
GCP Workload Identity (GKE)
Works on GKE
Yes
Yes
Yes
Works on EKS / AKS / on-prem
Yes
Yes
No
Works on SaaS Gateway
Yes
Yes
No
Key management required
Yes
No
No
Requires credential JSON in TrueFoundry
Yes (service account key)
Yes (external_account config)
No (leave empty)
What is the difference between GCP Workload Identity and Workload Identity Federation?
Both are keyless authentication mechanisms, but they target different environments.GCP Workload Identity is a GKE-only feature. The GKE metadata server automatically maps a Kubernetes service account to a Google Cloud IAM service account. The gateway picks this up through Application Default Credentials (ADC) when no auth data is configured. It does not work on the SaaS Gateway or outside of GKE.Workload Identity Federation is a broader Google Cloud feature that works across any Kubernetes cluster (EKS, AKS, on-prem, and GKE) and on the SaaS Gateway. It requires you to provide an external_account credential configuration JSON (generated via gcloud iam workload-identity-pools create-cred-config). The gateway exchanges a short-lived Kubernetes service account token for a Google Cloud access token through Google’s Security Token Service.
How do I set up Workload Identity Federation for an EKS cluster? (Step-by-step example)
This example walks through the full setup of Workload Identity Federation to let a TrueFoundry service account running on Amazon EKS authenticate to Google Cloud. Replace the pool names, project IDs, OIDC issuer URI, namespaces, and service account names with your own values.Step 1 — Create a Workload Identity Pool
gcloud iam workload-identity-pools create <POOL_NAME> \ --location="global" \ --description="Workload identity pool for <YOUR_CLUSTER>" \ --display-name="<YOUR_CLUSTER>"
Step 2 — Create a Workload Identity ProviderThe --issuer-uri must be the OIDC issuer URL of your EKS cluster. You can find it in the AWS EKS console or via aws eks describe-cluster. The --attribute-condition restricts which Kubernetes service accounts can use this provider.
gcloud iam service-accounts create <GSA_NAME> \ --project="<GCP_PROJECT_ID>" \ --display-name="<GSA_DISPLAY_NAME>"
Step 4 — Grant the Service Account the Required RoleGrant the Agent Platform User role (formerly Vertex AI User, or whichever role your workload needs) to the service account:
Step 5 — Allow the Federated Identity to Impersonate the Service AccountGrant the roles/iam.workloadIdentityUser role so the Kubernetes service account (via the workload identity pool) can impersonate the Google Cloud service account:
gcloud iam service-accounts add-iam-policy-binding <GSA_EMAIL> \ --member="principal://iam.googleapis.com/projects/<PROJECT_NUMBER>/locations/global/workloadIdentityPools/<POOL_NAME>/subject/system:serviceaccount:<NAMESPACE>:<KSA_NAME>" \ --role="roles/iam.workloadIdentityUser"
Optionally, to allow all service accounts in a namespace (instead of a single one), use principalSet:
gcloud iam service-accounts add-iam-policy-binding <GSA_EMAIL> \ --member="principalSet://iam.googleapis.com/projects/<PROJECT_NUMBER>/locations/global/workloadIdentityPools/<POOL_NAME>/attribute.namespace/<NAMESPACE>" \ --role="roles/iam.workloadIdentityUser"
Step 6 — Generate the Credential Configuration FileThis is the file you will paste into TrueFoundry when configuring the Vertex AI provider account.
The generated credential-configuration.json file is what you provide in TrueFoundry under Workload Identity Federation file when adding the Vertex AI provider account.
When should I use Gemini vs Vertex AI? What's the difference?
Gemini is generally recommended for individual developers and prototyping use cases, while Vertex AI is recommended for production and enterprise use cases.Vertex AI offers everything available in the Gemini API and more, including:
More secure auth using service accounts instead of API keys
A Model Garden that includes multiple third-party models