Learn how to use TrueFoundry’s unified Chat Completions API to interact with models from multiple providers through a consistent interface
TrueFoundry AI Gateway provides a universal API for all supported models via the standard OpenAI /chat/completions endpoint. This unified interface allows you to seamlessly work with models from different providers through a consistent API.
You can use the standard OpenAI client to send requests to the gateway:
from openai import OpenAIclient = OpenAI( api_key="your_truefoundry_api_key", base_url="{GATEWAY_BASE_URL}")response = client.chat.completions.create( model="openai-main/gpt-4o-mini", # this is the truefoundry model id messages=[{"role": "user", "content": "Hello, how are you?"}])print(response.choices[0].message.content)
model: TrueFoundry model ID in the format provider_account/model_name (available in the LLM playground UI)
See Integrate with code for instructions on obtaining these values.For using native provider SDKs (OpenAI, Google Gen AI, Anthropic, boto3), see Native SDK Support.
System prompts set the behavior and context for the model by defining the assistant’s role, tone, and constraints:
response = client.chat.completions.create( model="openai-main/gpt-4o-mini", messages=[ {"role": "system", "content": "You are a helpful assistant that specializes in Python programming."}, {"role": "user", "content": "How do I write a function to calculate factorial?"} ])
The API supports various media types including images, audio, video and pdf.
Images
Supported Providers: OpenAI, Bedrock, Anthropic, Google Vertex, Google GeminiSend images as part of your chat completion requests using either URLs or base64 encoding:
import base64def encode_image(image_path): with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode('utf-8')response = client.chat.completions.create( model="openai-main/gpt-4o", messages=[ { "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{encode_image('image.jpeg')}" } } ] } ])
Media Resolution
Supported Providers:OpenAI, Azure OpenAI, Google Gemini, Google Vertex AI, xAIThe detail parameter in the image_url object allows you to control the resolution at which images are processed. This helps balance between response quality, latency, and cost.Supported Values: low, high, auto
import base64from openai import OpenAIAPI_KEY = "your_truefoundry_api_key"BASE_URL = "{GATEWAY_BASE_URL}"# Read and encode the image as base64with open("test-img.png", "rb") as image_file: base64_image = base64.b64encode(image_file.read()).decode('utf-8')client = OpenAI( api_key=API_KEY, base_url=BASE_URL)response = client.chat.completions.create( model="test-123/gemini-3-pro-preview", messages=[ { "role": "user", "content": [ {"type": "text", "text": "What is in this image?"}, { "type": "image_url", "image_url": { "url": f"data:image/png;base64,{base64_image}", "detail": "low" # Options: "low", "high", "auto" } } ] } ])print(response.choices[0].message)
For Google Gemini and Vertex AI providers, the detail parameter is automatically translated to the mediaResolution parameter:
"low" → MEDIA_RESOLUTION_LOW (64 tokens)
"high" → MEDIA_RESOLUTION_HIGH (256+ tokens with scaling)
"auto" or omitted → No explicit media resolution (model decides)
Audio
Supported Models: Google Gemini models (Gemini 2.0 Flash, etc.)Send audio files in supported formats (MP3, WAV, etc.). Currently supported for Google Gemini models:
import base64def encode_video(video_path): with open(video_path, "rb") as video_file: return base64.b64encode(video_file.read()).decode('utf-8')response = client.chat.completions.create( model="internal-google/gemini-2-0-flash", messages=[ { "role": "user", "content": [ {"type": "text", "text": "Describe what's happening in this video"}, { "type": "image_url", "image_url": { "url": f"data:video/mp4;base64,{encode_video('video.mp4')}", "mime_type": "video/mp4" # required for gemini models } } ] } ])
PDF Documents
Supported Providers: OpenAI, Bedrock, Anthropic, Google Vertex, Google GeminiPDF document processing allows models to analyze and extract information from PDF files:
TrueFoundry supports vision models from all integrated providers as they become available. These models can analyze and interpret images alongside text, enabling multimodal AI applications.
Function calling allows models to invoke defined functions during conversations, enabling them to perform specific actions or retrieve external information.
from openai import OpenAIimport jsonclient = OpenAI( api_key="your_truefoundry_api_key", base_url="{GATEWAY_BASE_URL}")# Define a functiontools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"] } }, "required": ["location"] } }}]# Make the requestresponse = client.chat.completions.create( model="openai-main/gpt-4o-mini", messages=[{"role": "user", "content": "What's the weather in New York?"}], tools=tools)# Check if the model wants to call a functionif response.choices[0].message.tool_calls: tool_call = response.choices[0].message.tool_calls[0] function_name = tool_call.function.name function_args = json.loads(tool_call.function.arguments) print(f"Function called: {function_name}") print(f"Arguments: {function_args}")
Process function calls and continue the conversation:
# Initial requestmessages = [{"role": "user", "content": "What's the weather in Tokyo?"}]response = client.chat.completions.create( model="openai-main/gpt-4o-mini", messages=messages, tools=tools)# Handle function callif response.choices[0].message.tool_calls: messages.append(response.choices[0].message) for tool_call in response.choices[0].message.tool_calls: function_name = tool_call.function.name function_args = json.loads(tool_call.function.arguments) # Execute your function (simulated here) if function_name == "get_weather": result = f"The weather in {function_args['location']} is 22°C and sunny" # Add the function result to the conversation messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": result }) # Continue the conversation final_response = client.chat.completions.create( model="openai-main/gpt-4o-mini", messages=messages ) print(final_response.choices[0].message.content)
Controlling When and How Functions Are Called
Control when and how functions are called:
# Force a specific function callresponse = client.chat.completions.create( model="openai-main/gpt-4o-mini", messages=[{"role": "user", "content": "What's the weather?"}], tools=tools, tool_choice={"type": "function", "function": {"name": "get_weather"}})# Allow automatic function calling (default)response = client.chat.completions.create( model="openai-main/gpt-4o-mini", messages=[{"role": "user", "content": "What's the weather?"}], tools=tools, tool_choice="auto")# Prevent function callingresponse = client.chat.completions.create( model="openai-main/gpt-4o-mini", messages=[{"role": "user", "content": "What's the weather?"}], tools=tools, tool_choice="none")# Force any function callresponse = client.chat.completions.create( model="openai-main/gpt-4o-mini", messages=[{"role": "user", "content": "What's the weather?"}], tools=tools, tool_choice="required")
Thought signatures are encrypted representations of a model’s internal reasoning process that help maintain context and coherence across multi-turn interactions, particularly during function calling. When using certain Gemini 3 preview models, the API includes a thought_signature field in tool call responses.
from openai import OpenAIimport jsonclient = OpenAI( api_key="your_truefoundry_api_key", base_url="{GATEWAY_BASE_URL}")tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"] } }}]# First call - model requests toolresponse = client.chat.completions.create( model="vertex-main/gemini-3-pro-preview", messages=[{"role": "user", "content": "What's the weather in San Francisco?"}], tools=tools)message = response.choices[0].messageif message.tool_calls: tool_call = message.tool_calls[0] args = json.loads(tool_call.function.arguments) result = f"The weather in {args['location']} is 18°C and cloudy." # Convert message to dict (preserves thought_signature) assistant_message = message.model_dump(exclude_none=True) # Second call - send tool result back final_response = client.chat.completions.create( model="vertex-main/gemini-3-pro-preview", messages=[ {"role": "user", "content": "What's the weather in San Francisco?"}, assistant_message, # Includes thought_signature { "role": "tool", "content": json.dumps(result), "tool_call_id": tool_call.id } ] ) print(final_response.choices[0].message.content)
The chat completions API supports structured response formats, enabling you to receive consistent, predictable outputs in JSON format. This is useful for parsing responses programmatically.
Basic JSON Mode: Getting Valid JSON Without Structure Constraints
JSON mode ensures the model’s output is valid JSON without enforcing a specific structure:
from openai import OpenAIclient = OpenAI( api_key="your_truefoundry_api_key", base_url="{GATEWAY_BASE_URL}")response = client.chat.completions.create( model="openai-main/gpt-4o", messages=[ {"role": "system", "content": "You are a helpful assistant designed to output JSON."}, {"role": "user", "content": "Extract information about the 2020 World Series winner"} ], response_format={"type": "json_object"})print(response.choices[0].message.content)
Output:
{ "team": "Los Angeles Dodgers", "year": 2020, "opponent": "Tampa Bay Rays", "games_played": 6, "series_result": "4-2"}
JSON Schema Mode: Enforcing Specific Data Structures
JSON Schema mode provides strict structure validation using predefined schemas:
from openai import OpenAIimport jsonclient = OpenAI( api_key="your_truefoundry_api_key", base_url="{GATEWAY_BASE_URL}")# Define JSON schemauser_info_schema = { "type": "object", "properties": { "name": {"type": "string"}, "age": {"type": "integer", "minimum": 0}, "occupation": {"type": "string"}, "location": {"type": "string"}, "skills": { "type": "array", "items": {"type": "string"} } }, "required": ["name", "age", "occupation", "location", "skills"], "additionalProperties": False}response = client.chat.completions.create( model="openai-main/gpt-4o", messages=[ { "role": "system", "content": "Extract user information and respond according to the provided JSON schema." }, { "role": "user", "content": "My name is Sarah Johnson, I'm 28 years old, and I work as a data scientist in New York. I'm skilled in Python, SQL, and machine learning." } ], response_format={ "type": "json_schema", "json_schema": { "name": "user_info", "schema": user_info_schema, "strict": True } })# Parse responseresult = json.loads(response.choices[0].message.content)
When using JSON schema with strict mode set to true, all properties defined in the schema must be included in the required array. If any property is defined but not marked as required, the API will return a 400 Bad Request Error.
Pydantic provides automatic validation, serialization, and type hints for structured data:
from openai import OpenAIfrom pydantic import BaseModel, Fieldfrom typing import Listclient = OpenAI( api_key="your_truefoundry_api_key", base_url="{GATEWAY_BASE_URL}")# Define Pydantic modelclass UserInfo(BaseModel): name: str = Field(description="Full name of the user") age: int = Field(ge=0, description="Age in years") occupation: str = Field(description="Job title or profession") location: str = Field(description="City or location") skills: List[str] = Field(description="List of professional skills") class Config: extra = "forbid" # Prevent additional fieldsresponse = client.chat.completions.create( model="openai-main/gpt-4o", messages=[ { "role": "system", "content": "Extract user information and respond according to the provided schema." }, { "role": "user", "content": "Hi, I'm Mike Chen, a 32-year-old software architect from Seattle. I specialize in cloud computing, microservices, and Kubernetes." } ], response_format={ "type": "json_schema", "json_schema": { "name": "user_info", "schema": UserInfo.model_json_schema(), "strict": True } })# Parse and validate with Pydanticuser_data = UserInfo.model_validate_json(response.choices[0].message.content)
When using OpenAI models with Pydantic Models, there should not be any optional fields in the pydantic model when strict mode is true. This is because the corresponding JSON schema will have missing fields in the “required” section.
Streamlined Pydantic Integration with OpenAI's Beta Parse API
The beta parse client provides the most streamlined approach for Pydantic integration:
from openai import OpenAIfrom pydantic import BaseModel, Fieldfrom typing import List, Optionalclass UserInfo(BaseModel): name: str = Field(description="Full name of the user") age: int = Field(ge=0, description="Age in years") occupation: str = Field(description="Job title or profession") location: Optional[str] = Field(None, description="City or location") skills: List[str] = Field(default=[], description="List of professional skills")client = OpenAI( api_key="your_truefoundry_api_key", base_url="{GATEWAY_BASE_URL}")completion = client.beta.chat.completions.parse( model="openai-main/gpt-4o", messages=[ { "role": "system", "content": "Extract user information from the provided text." }, { "role": "user", "content": "Hello, I'm Alex Rodriguez, a 29-year-old product manager from Austin. I have experience in agile methodologies, data analysis, and team leadership." } ], response_format=UserInfo,)user_result = completion.choices[0].message.parsed
This approach allows for optional fields in your Pydantic model and provides a cleaner API for structured responses.
You can use response_format with any provider. The Gateway either uses the provider’s native structured output or converts your schema into a tool the model must call and then puts the result in message.content.
Provider
Support
OpenAI
Native for json_object and for json_schema on supported models (e.g. gpt-4o, gpt-5, gpt-4.1, o3, o4). Other models use tool conversion.
Azure OpenAI
Same as OpenAI.
Anthropic
Native for Claude 4.5/4.6 with json_schema. Other models use tool conversion.
Google Gemini, Google Vertex
Native when the request has no tools; otherwise tool conversion.
All others (Bedrock, Cohere, Mistral, OpenRouter, Groq, xAI, vLLM, etc.)
Tool conversion only. The Gateway turns your schema into a required tool and extracts the result into message.content.
Anthropic and JSON schema constraints: The code examples in this doc use Pydantic’s ge=0 for fields such as age. Anthropic’s API does not support these constraint parameters in the schema. If you use structured output with Anthropic models, omit ge, le and similar numeric/string constraints from your schema (or use a schema without them). The code will work with Anthropic once those constraints are removed.
Prompt caching reduces processing time and costs by reusing previously computed prefixes. Some providers handle this automatically, while others require you to mark which content to cache. The gateway handles both: you can include cache_control in your requests and the gateway will forward it to providers that support it and automatically strip it for providers that don’t.
You can safely include cache_control in requests regardless of the target provider. The gateway ensures it never causes errors.
The ttl field is optional (only Anthropic supports it). The type must be "ephemeral".
Anthropic / Bedrock
For Anthropic (direct, Vertex AI, Azure AI Foundry) and AWS Bedrock, you must explicitly mark content to cache. The gateway forwards cache_control to Anthropic as is, and translates it into Bedrock’s native cachePoint format automatically.
4096 tokens: Claude Mythos Preview, Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.52048 tokens: Sonnet 4.6, Haiku 3.5, Haiku 31024 tokens: Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7
For Amazon Titan models on Bedrock, cache_control on tool definitions is automatically skipped since these models do not support cache points on tools.
OpenAI / Azure OpenAI
OpenAI and Azure OpenAI cache prompts automatically. No cache_control markup is needed. Any cache_control in the request is stripped by the gateway.You can optionally pass prompt_cache_key to improve cache hit rates across requests with shared prefixes:
prompt_cache_key is only available for OpenAI and Azure OpenAI.
Gemini / Groq / xAI
These providers handle caching automatically. No configuration is needed. Any cache_control in the request is stripped by the gateway before forwarding.Cached token counts are still reported in the response usage when the provider returns them.
TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, available for models from multiple providers including Anthropic,OpenAI,Azure OpenAI,Groq, xAI and Vertex.These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.
Supported models: Claude Opus 4.1 (claude-opus-4-1-20250805), Claude Opus 4 (claude-opus-4-20250514), Claude Sonnet 4 (claude-sonnet-4-20250514), Claude Sonnet 3.7 (claude-3-7-sonnet-20250219)
via Anthropic, AWS Bedrock, and Google Vertex AI
For Anthropic models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Anthropic’s native thinking parameter format since Anthropic doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:
Supported models: grok-3-mini (with reasoning_effort parameter), grok-4-0709, grok-4-1-fast-reasoning, grok-4-fast-reasoning (reasoning built-in)For grok-3-mini, you can use the reasoning_effort parameter to control reasoning depth. Other Grok models like grok-4-0709 have reasoning capabilities built-in but do not support the reasoning_effort parameter.
from openai import OpenAIclient = OpenAI( api_key="TFY_API_KEY", base_url="{GATEWAY_BASE_URL}")# For grok-3-mini with reasoning_effort parameterresponse = client.chat.completions.create( model="xai-main/grok-3-mini", messages=[ {"role": "user", "content": "How to compute 3^3^3?"} ], reasoning_effort="high", # Options: "high", "low" (only for grok-3-mini) max_tokens=8000)print(response.choices[0].message.content)
The reasoning_effort parameter is only supported for grok-3-mini. For other Grok models like grok-4-0709 and grok-4-1-fast-reasoning, reasoning is built-in and the reasoning_effort parameter should not be used. Reasoning tokens are included in the usage metrics for all reasoning-capable models.Parameter Restrictions: Reasoning models (like grok-4-0709 and grok-4-1-fast-reasoning) do not support presence_penalty, frequency_penalty, or stop parameters. Using these parameters with reasoning models will result in an error.
Gemini
Supported models: All Gemini 2.5 Series Models.These models can be accessed from Google Vertex or Google Gemini Providers
For Gemini models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Gemini’s native thinking parameter format since Gemini doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:
none: 0% of max_tokens
low: 30% of max_tokens
medium: 60% of max_tokens
high: 90% of max_tokens
Note: Gemini 2.5 Pro and 2.5 Flash comes with reasoning on by default.
When reasoning tokens are enabled, the response includes both thinking and content sections:
{ "id": "1742890579083", "object": "chat.completion", "created": 1742890579, "model": "", "provider": "aws", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "To compute 3^3^3, you need to understand that exponentiation is evaluated from right to left (right-associative). So:\n\n3^3^3 = 3^(3^3), not (3^3)^3\n\nHere's how to calculate it step by step:\n\n**Step 1:** Calculate the rightmost exponent first\n- 3^3 = 27\n\n**Step 2:** Now calculate 3^27\n- 3^27 = 7,625,597,484,987\n\nTherefore: **3^3^3 = 7,625,597,484,987**\n\n---\n\n**Note:** If you meant (3^3)^3 instead, that would be:\n- (3^3)^3 = 27^3 = 19,683\n\nThis is much smaller than 3^3^3 because:\n- (3^3)^3 = 3^(3×3) = 3^9\n- While 3^3^3 = 3^27\n\nThe difference between 3^9 and 3^27 is enormous!", "reasoning_content": "The user is asking how to compute 3^3^3. This is a question about exponentiation and specifically about the order of operations when dealing with repeated exponentiation (also known as tetration or power towers).\n\nThe key thing to understand here is that exponentiation is right-associative, meaning 3^3^3 = 3^(3^3), not (3^3)^3.\n\nSo we need to:\n1. First compute 3^3 = 27\n2. Then compute 3^27\n\nLet me work through this:\n- 3^3 = 27\n- 3^27 = 3^27\n\nNow 3^27 is a large number. Let me think about how to compute it:\n3^27 = 3^(3×9) = (3^3)^9 = 27^9\n\nOr we could compute it directly:\n3^1 = 3\n3^2 = 9\n3^3 = 27\n3^4 = 81\n3^5 = 243\n3^6 = 729\n3^7 = 2,187\n3^8 = 6,561\n3^9 = 19,683\n3^10 = 59,049\n...\n\nActually, let me just state that 3^27 = 7,625,597,484,987\n\nSo 3^3^3 = 3^(3^3) = 3^27 = 7,625,597,484,987" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 45, "completion_tokens": 180, "total_tokens": 225 }}
Extended thinking exposes the model’s step-by-step reasoning as structured thinking_blocks. Unlike the plain-text reasoning_content field, thinking blocks carry cryptographic signatures — required to continue a reasoning chain across multiple turns.
Use model_dump(exclude_none=True) on the assistant message — it captures content, tool_calls, and thinking_blocks in one shot, so you don’t need to construct the dict manually.
# Turn 1response = client.chat.completions.create( model="anthropic-main/claude-opus-4-1-20250805", messages=[{"role": "user", "content": "What is 3^3^3?"}], reasoning_effort="high", max_tokens=8000)# Serialize the full assistant message (preserves thinking_blocks + signatures)assistant_message = response.choices[0].message.model_dump(exclude_none=True)# Turn 2 — pass the serialized message back as-isresponse2 = client.chat.completions.create( model="anthropic-main/claude-opus-4-1-20250805", messages=[ {"role": "user", "content": "What is 3^3^3?"}, assistant_message, {"role": "user", "content": "Now explain why exponentiation is right-associative."} ], reasoning_effort="high", max_tokens=8000)
Always echo thinking_blocks exactly as returned. Blocks with missing or modified signature fields are rejected by the provider.
When thinking is enabled, Anthropic and Bedrock require the assistant message to include thinking_blocks alongside tool_calls. Use model_dump(exclude_none=True) — it captures both in one step.
import jsonfrom openai import OpenAIclient = OpenAI(api_key="TFY_API_KEY", base_url="{GATEWAY_BASE_URL}")tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get weather for a city", "parameters": { "type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"] } }}]messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]# Turn 1 — model responds with thinking + tool callresponse = client.chat.completions.create( model="anthropic-main/claude-opus-4-1-20250805", messages=messages, tools=tools, reasoning_effort="high", max_tokens=8000)msg = response.choices[0].message# Append assistant message — model_dump captures tool_calls + thinking_blocks togethermessages.append(msg.model_dump(exclude_none=True))# Execute the tool and append the resultfor tool_call in msg.tool_calls: args = json.loads(tool_call.function.arguments) result = f"Sunny, 24°C in {args['city']}" messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": result })# Turn 2 — model summarizes with full contextresponse2 = client.chat.completions.create( model="anthropic-main/claude-opus-4-1-20250805", messages=messages, tools=tools, reasoning_effort="high", max_tokens=8000)print(response2.choices[0].message.content)
Google Gemini support grounding with Google Search, which allows the model to augment its responses with real-time web results.When grounding is enabled, the model can call a search tool during generation to retrieve up-to-date information and incorporate it into the final answer.
from openai import OpenAIclient = OpenAI( api_key="TFY_API_KEY", base_url="{GATEWAY_BASE_URL}")response = client.chat.completions.create( model="tfy-ai-gemini/gemini-2-5-pro", # tfy gemini model name messages=[{"role": "user", "content": "what date and time is right now?"}], tools=[{ "type": "function", "function": { "name": "google_search", } }])print(response.choices[0].message)