Deploy AI Gateway FAQs - TrueFoundry Docs

How to add multiple gateway planes to the control plane?

You can add multiple gateway planes to the control plane by following the steps below:

Create Kubernetes Secret for License Key and DB Credentials

We will create two secrets in this step:

Store the License Key
Store the Image Pull Secret

Create Kubernetes Secret for License Key

We need to create a Kubernetes secret containing the licence key.

Same license key will be used for all the gateway planes as used for the control plane

truefoundry-creds.yaml

apiVersion: v1
kind: Secret
metadata:
  name: truefoundry-creds
type: Opaque
stringData:
  TFY_API_KEY: <TFY_API_KEY>

Apply the secret to the Kubernetes cluster (Assuming you are installing the control plane in the truefoundry namespace)

kubectl apply -f truefoundry-creds.yaml -n truefoundry

Create Kubernetes Secret for Image Pull Secret

We need to create a Image Pull Secret to enable pulling the truefoundry images from the private registry.

Same image pull secret will be used for all the gateway planes as used for the control plane. Use your credentials if you are pulling TrueFoundry images from your registry.

truefoundry-image-pull-secret.yaml

apiVersion: v1
kind: Secret
metadata:
  name: truefoundry-image-pull-secret
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: <IMAGE_PULL_SECRET> # Provided by TrueFoundry team

Apply the secret to the Kubernetes cluster (Assuming you are installing the control plane in the truefoundry namespace)

kubectl apply -f truefoundry-image-pull-secret.yaml -n truefoundry

Create Helm chart Values file for gateway plane

Create a values file as given below and replace the following values:

CONTROL_PLANE_URL: URL that you will map to the control plane dashboard.
TENANT_NAME: Tenant name provided by TrueFoundry team.
GATEWAY_ENDPOINT_HOST: The domain where you will expose the gateway endpoint (e.g., gateway.example.com)

truefoundry-gateway-values.yaml

global:
  # This is the reference to the secrets we created in the previous step
  imagePullSecrets:
    - name: "truefoundry-image-pull-secret"

  # Choose the resource tier as per your needs
  resourceTier: medium # or small or large
  controlPlaneURL: <CONTROL_PLANE_URL> # eg. https://example-company.truefoundry.cloud
  tenantName: <TENANT_NAME>

ingress:
  enabled: true
  annotations: {}
  ingressClassName: nginx
  tls: []
  hosts:
    - <GATEWAY_ENDPOINT_HOST>

# Optional: Istio configuration (if using Istio instead of standard ingress)
# istio:
#   virtualservice:
#     hosts:
#       - <GATEWAY_ENDPOINT_HOST>
#     enabled: true
#     retries:
#       enabled: true
#       retryOn: gateway-error
#     gateways:
#       - istio-system/tfy-wildcard
#     annotations: {}

Install Helm chart for gateway plane

helm upgrade --install tfy-llm-gateway oci://tfy.jfrog.io/tfy-helm/tfy-llm-gateway -n truefoundry --create-namespace -f truefoundry-gateway-values.yaml

Can I use my Artifactory as a mirror to pull images?

Yes. You can configure your Artifactory to mirror our registry.

Credentials for accessing the TrueFoundry private registry are required and will be provided during onboarding.

1. Registry Configuration

URL: https://tfy.jfrog.io/

2. Update Helm values

global:
  image:
    registry: <YOUR_REGISTRY> # Replace with your registry
postgresql:
  image:
    registry: <YOUR_REGISTRY> # Replace with your registry, use this if `devMode` is enabled

Can I copy images to my own private registry?

Yes. We provide a script that uses the truefoundry Helm Chart to identify and copy required images to your private registry.

Credentials for accessing the TrueFoundry private registry are required and will be provided during onboarding.

Generic Registry
AWS ECR Registry

1. Install required dependencies

Skopeo
- Used to perform the image copy operation.
Helm
- Used to get the list of images from the TrueFoundry Helm Chart.

2. Add TrueFoundry Helm Chart repository

helm repo add truefoundry https://truefoundry.github.io/infra-charts
helm repo update

3. Authenticate to the TrueFoundry source registry

skopeo login -u <USERNAME> -p <PASSWORD> https://tfy.jfrog.io/

Replace <USERNAME> with the TrueFoundry registry username.
Replace <PASSWORD> with the TrueFoundry registry password.

4. Authenticate to your destination registry

skopeo login -u <USERNAME> -p <PASSWORD> <YOUR_REGISTRY>

Replace <USERNAME> with your registry username.
Replace <PASSWORD> with your registry password.
Replace <YOUR_REGISTRY> with the URL of your registry.Skopeo will use authentication details for a registry that was previously authenticated with docker login.Alternatively, you can use the --dest-user and --dest-password flags to provide the username and password for the destination registry.

5. Run Clone Image Script

export TRUEFOUNDRY_HELM_CHART_VERSION=<TRUEFOUNDRY_HELM_CHART_VERSION>
export TRUEFOUNDRY_HELM_VALUES_FILE=<TRUEFOUNDRY_HELM_VALUES_FILE>
export DEST_REGISTRY=<YOUR_DESTINATION_REGISTRY>

# Dry-run example
curl -s https://raw.githubusercontent.com/truefoundry/infra-charts/main/scripts/clone_images_to_your_registry.sh | bash -s -- --helm-chart truefoundry --helm-version $TRUEFOUNDRY_HELM_CHART_VERSION --helm-values $TRUEFOUNDRY_HELM_VALUES_FILE --dest-registry $DEST_REGISTRY --dry-run

# Live example
curl -s https://raw.githubusercontent.com/truefoundry/infra-charts/main/scripts/clone_images_to_your_registry.sh | bash -s -- --helm-chart truefoundry --helm-version $TRUEFOUNDRY_HELM_CHART_VERSION --helm-values $TRUEFOUNDRY_HELM_VALUES_FILE --dest-registry $DEST_REGISTRY

Replace <TRUEFOUNDRY_HELM_CHART_VERSION> with the version of the Truefoundry helm chart you want to use. You can find the latest version in the changelog.Replace <TRUEFOUNDRY_HELM_VALUES_FILE> with the path to the values file you created in the Installation Instructions.Replace <DEST_REGISTRY> with the URL of your registry.

6. Update the Helm values file to use your registry

global:
  image:
    registry: <YOUR_REGISTRY> # Replace with your registry
postgresql:
  image:
    registry: <YOUR_REGISTRY> # Replace with your registry, use this if `devMode` is enabled

1. Install required dependencies

Skopeo
- Used to perform the image copy operation.
Helm
- Used to get the list of images from the TrueFoundry Helm Chart.
AWS CLI
- Used to perform AWS ECR actions to validate and create repositories.

2. Add TrueFoundry Helm Chart repository

helm repo add truefoundry https://truefoundry.github.io/infra-charts
helm repo update

3. Authenticate to the TrueFoundry source registry

skopeo login -u <USERNAME> -p <PASSWORD> https://tfy.jfrog.io/

Replace <USERNAME> with the TrueFoundry registry username.
Replace <PASSWORD> with the TrueFoundry registry password.

4. Authenticate to your destination registry

# Set your AWS profile
export AWS_PROFILE=<AWS_PROFILE>

# Authenticate to ECR using the profile
aws ecr get-login-password --region us-west-2 | skopeo login --username AWS --password-stdin <YOUR_ECR_REGISTRY>

Replace <AWS_PROFILE> with your AWS profile name.
Replace <YOUR_ECR_REGISTRY> with the URL of your ECR registry (ex. 123456789012.dkr.ecr.us-east-2.amazonaws.com).Skopeo will use authentication details for a registry that was previously authenticated with docker login.

5. Run Clone Image Script

This script creates required ECR repositories and copies images.
Optionally append a path to your registry URL to namespace repositories (e.g., 123456789012.dkr.ecr.us-east-2.amazonaws.com/truefoundry).

export TRUEFOUNDRY_HELM_CHART_VERSION=<TRUEFOUNDRY_HELM_CHART_VERSION>
export TRUEFOUNDRY_HELM_VALUES_FILE=<TRUEFOUNDRY_HELM_VALUES_FILE>
export DEST_REGISTRY=<YOUR_DESTINATION_REGISTRY>

# Dry-run example
curl -s https://raw.githubusercontent.com/truefoundry/infra-charts/main/scripts/clone_images_to_your_registry.sh | bash -s -- --helm-chart truefoundry --helm-version $TRUEFOUNDRY_HELM_CHART_VERSION --helm-values $TRUEFOUNDRY_HELM_VALUES_FILE --dest-registry $DEST_REGISTRY --dry-run

# Live example
curl -s https://raw.githubusercontent.com/truefoundry/infra-charts/main/scripts/clone_images_to_your_registry.sh | bash -s -- --helm-chart truefoundry --helm-version $TRUEFOUNDRY_HELM_CHART_VERSION --helm-values $TRUEFOUNDRY_HELM_VALUES_FILE --dest-registry $DEST_REGISTRY

Replace <TRUEFOUNDRY_HELM_CHART_VERSION> with the TrueFoundry Helm chart version. Find the latest version in the changelog.Replace <TRUEFOUNDRY_HELM_VALUES_FILE> with the path to your values file from Installation Instructions.Replace <YOUR_DESTINATION_ECR_REGISTRY> with your ECR registry URL (e.g., 123456789012.dkr.ecr.us-east-2.amazonaws.com/truefoundy).

6. Update the Helm values file to use your registry

global:
  image:
    registry: <YOUR_REGISTRY> # Replace with your registry
postgresql:
  image:
    registry: <YOUR_REGISTRY> # Replace with your registry, use this if `devMode` is enabled

How to install in an air-gapped / restricted network environment?

An air-gapped environment is isolated from the internet. Since the control plane and gateway plane ship as a single helm chart (truefoundry), you only need to make the container images available in your private registry and update the helm values to point to it.

Copy images to your private registry — set up a registry mirror or copy images directly using the steps described in the FAQs above
Update helm values to point to your private registry (see the helm value overrides in the same FAQs above)
Continue with the standard installation on the overview and choose your cloud install guide (AWS, GCP, Azure, or on-prem)

How to integrate with AWS bedrock models from a different AWS account?

You can integrate with AWS bedrock models from a different AWS account by following the steps below:

Add the following IAM policy to the control plane IAM role so that it can assume the IAM role of the AWS account that has the bedrock models:

{
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Effect": "Allow",
      "Resource": "*"
    }
  ],
  "Version": "2012-10-17"
}

In the IAM role in the destination AWS account (which has bedrock access), add the following trust policy to allow the control plane IAM role to assume it:

{
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "<CONTROL_PLANE_IAM_ROLE_ARN>"
      },

      "Action": "sts:AssumeRole"
    }
  ],
  "Version": "2012-10-17"
}

Now you can use the IAM role of the destination AWS account while integrating AWS bedrock models in the TrueFoundry AI gateway.

Do we need any NFS volumes in Kubernetes for the AI Gateway or Control Plane?

No, we only need block storage for installing and running Truefoundry. This should be supported via the CSI driver and only ReadWriteOnce access is required.

What is the structure of access logs

We log access information in standard output with the following format:

logfmt
json

These can be switched with the help of an environment variable to the AI Gateway installation. (Default: logfmt)

Log format

Standard log format structure:

time="%START_TIME%" level=%LEVEL% ip=%IP_ADDRESS% tenant=%TENANT_NAME% user=%SUBJECT_TYPE%:%SUBJECT_SLUG% model=%MODEL_ID% method=%METHOD% path=%PATH% status=%STATUS_CODE% time_taken=%DURATION%ms trace_id=%TRACE_ID%

Log operator	Details
START_TIME	ISO timestamp for request start. eg. 2025-08-12 13:34:50
LEVEL	info\|warn\|error
IP_ADDRESS	IP address of the caller. eg. ::ffff:10.99.55.142
TENANT_NAME	Name of the tenant. eg. truefoundry
SUBJECT_TYPE	user\|virtualaccount
SUBJECT_SLUG	Email or virtual account name. eg. tfy-user@truefoundry.com\|demo-virtualaccount
MODEL_ID	Model ID. eg. openai-default/gpt-5
METHOD	GET\|POST\|PUT
PATH	Path of the request. eg. /api/inference/openai/chat/completions
STATUS_CODE	200\|400\|401\|403\|429\|500
DURATION	Duration of the request. eg. 12
TRACE_ID	Trace ID of the request

Examples:

time="2025-08-12 13:34:50" level=info ip=::ffff:10.99.55.142 tenant=truefoundry user=virtualaccount:demo-virtualaccount model=openai-default/gpt-5 method=POST path=/api/inference/openai/chat/completions status=200 time_taken=53ms trace_id=587b2a946c13f62f9160674a8c983ce3

How to use SSO directly without using TrueFoundry Auth Server?

By default, the control plane uses the TrueFoundry Auth Server for user authentication. However, you can configure it to use your own external identity provider instead. We support both OIDC and SAML-compliant identity providers. Read more

Requests to the gateway are timing out after a certain duration

If your LLM requests are timing out after a certain duration, the first thing to check is the traces in the TrueFoundry dashboard. Look at the request duration — if you see requests consistently timing out at exactly 60 seconds, the issue is almost certainly the load balancer, not the TrueFoundry AI Gateway. The TrueFoundry gateway does not impose any request timeout.

Traces showing requests timing out at 60 seconds

This commonly happens when an Application Load Balancer (ALB) is placed in front of the gateway to expose it. The default Connection idle timeout on AWS ALBs is 60 seconds, which is too short for long-running LLM inference requests (especially streaming responses or large prompts).Solution: Increase the idle timeout on your AWS ALB to a higher value (e.g., 300 seconds or more).You can find this setting in the AWS Console under EC2 → Load Balancers → Select your ALB → Attributes tab → Connection idle timeout.

You can also update it via the AWS CLI:

aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn <YOUR_ALB_ARN> \
  --attributes Key=idle_timeout.timeout_seconds,Value=300

If you are using an ingress controller (e.g., NGINX Ingress) in addition to the ALB, also verify that the ingress controller’s proxy timeout settings are configured appropriately.

Can I get TrueFoundry metrics in Victoria Metrics instead of Prometheus?

Yes. TrueFoundry supports exporting metrics to Victoria Metrics as an alternative to Prometheus. To enable this, add the following to your truefoundry-values.yaml file and upgrade the Helm release:

This only installs the VMServiceScrape and related custom resources for scraping TrueFoundry metrics. It does not deploy Victoria Metrics itself — you are responsible for installing and managing your own Victoria Metrics instance.

truefoundry-values.yaml

victoriaMetricsMonitoring:
  enabled: true

Then upgrade the Helm release to apply the changes:

helm upgrade --install truefoundry oci://tfy.jfrog.io/tfy-helm/truefoundry -n truefoundry --create-namespace -f truefoundry-values.yaml

How to enable SSL for PostgreSQL connections?

The TrueFoundry control plane supports SSL connections to PostgreSQL. You can configure SSL by setting the DB_SSL_MODE environment variable in your truefoundry-values.yaml.Supported DB_SSL_MODE values:

Mode	Encryption	Certificate Validation	Use Case
`disable`	No	No	Local development or trusted networks
`no-verify`	Yes	No	Managed databases with self-signed or unverified certs
`require`	Yes	Yes (system CA store)	When you have a valid CA certificate and want full verification
`verify-ca`	Yes	Yes (custom CA)	Same as `require` but explicitly checks CA
`verify-full`	Yes	Yes (CA + hostname)	Strictest mode, validates CA and hostname

SSL certificate environment variables:

Variable	Purpose	Required
`DB_SSL_CA_PATH`	Path to the server CA certificate file	For `require`, `verify-ca`, or `verify-full` modes
`DB_SSL_CERT_PATH`	Path to the client certificate file (for mTLS)	Only for mTLS (GCP Cloud SQL, Azure Database for PostgreSQL)
`DB_SSL_KEY_PATH`	Path to the client private key file (for mTLS)	Only for mTLS (GCP Cloud SQL, Azure Database for PostgreSQL)

The certificate requirements vary by cloud provider. AWS RDS only needs the server CA bundle (DB_SSL_CA_PATH), while GCP Cloud SQL and Azure Database for PostgreSQL may require all three certificate paths when client certificate authentication (mTLS) is enabled. Refer to the cloud-specific control plane documentation for detailed examples.

Scenario 1: Encrypted connection without certificate validation (no-verify)This is the simplest option for managed databases. It encrypts the connection but skips server certificate validation.

truefoundry-values.yaml

servicefoundryServer:
  env:
    DB_SSL_MODE: "no-verify"
mlfoundryServer:
  env:
    DB_SSL_MODE: "no-verify"

Scenario 2: Encrypted connection with certificate validation (require)This mode encrypts the connection and validates the server certificate. You must provide the appropriate certificate files for your database provider. The example below shows the full configuration with all three certificate paths (for GCP/Azure mTLS). For AWS RDS, only DB_SSL_CA_PATH is needed.Create a Kubernetes Secret containing your certificate files:

# AWS RDS (CA bundle only)
kubectl create secret generic db-ssl-certs \
  --from-file=ca-certificate.crt=/path/to/your/ca-certificate.crt \
  -n truefoundry

# GCP Cloud SQL / Azure (full mTLS)
kubectl create secret generic db-ssl-certs \
  --from-file=ca-certificate.crt=/path/to/server-ca.pem \
  --from-file=client-cert.pem=/path/to/client-cert.pem \
  --from-file=client-key.pem=/path/to/client-key.pem \
  -n truefoundry

Then configure truefoundry-values.yaml to mount the certificates and set the SSL paths:

truefoundry-values.yaml

servicefoundryServer:
  env:
    DB_SSL_MODE: "require"
    DB_SSL_CA_PATH: "/etc/ssl/custom/ca-certificate.crt"
    # Only needed for mTLS (GCP Cloud SQL, Azure Database for PostgreSQL)
    DB_SSL_CERT_PATH: "/etc/ssl/custom/client-cert.pem"
    DB_SSL_KEY_PATH: "/etc/ssl/custom/client-key.pem"
  extraVolumes:
    - name: db-ssl-certs
      secret:
        secretName: db-ssl-certs
  extraVolumeMounts:
    - name: db-ssl-certs
      mountPath: /etc/ssl/custom
      readOnly: true
mlfoundryServer:
  env:
    DB_SSL_MODE: "require"
    DB_SSL_CA_PATH: "/etc/ssl/custom/ca-certificate.crt"
    # Only needed for mTLS (GCP Cloud SQL, Azure Database for PostgreSQL)
    DB_SSL_CERT_PATH: "/etc/ssl/custom/client-cert.pem"
    DB_SSL_KEY_PATH: "/etc/ssl/custom/client-key.pem"
  extraVolumes:
    - name: db-ssl-certs
      secret:
        secretName: db-ssl-certs
  extraVolumeMounts:
    - name: db-ssl-certs
      mountPath: /etc/ssl/custom
      readOnly: true

Upgrade the Helm release to apply the changes:

helm upgrade --install truefoundry oci://tfy.jfrog.io/tfy-helm/truefoundry -n truefoundry --create-namespace -f truefoundry-values.yaml

How to configure custom CA certificates?

If your TrueFoundry deployment needs to trust custom Certificate Authorities (e.g., for internal services, private registries, or corporate proxies), you can configure custom CA certificates in the Helm chart.There are two methods to provide custom CA certificates:

Method 1: Pass customCA as a multiline string

You can directly provide the CA certificate content as a multiline string in your values.yaml:

truefoundry-values.yaml

global:
  customCA:
    enabled: true
    certificate: |
      -----BEGIN CERTIFICATE-----
      MIIDXTCCAkWgAwIBAgIJAKZ7VqHEqvmKMA0GCSqGSIb3DQEBCwUAMEUxCzAJBgNV
      BAYTAkFVMRMwEQYDVQQIDApTb21lLVN0YXRlMSEwHwYDVQQKDBhJbnRlcm5ldCBX
      ... (rest of your certificate) ...
      -----END CERTIFICATE-----

This method is suitable when you have one or a few CA certificates to add.

Method 2: Use an existing ConfigMap containing CA certificate(s)

If you already have your custom CA certificates in a Kubernetes ConfigMap, you can reference it directly. An initContainer will merge the custom CA with the system CAs.

Create a ConfigMap with your custom CA certificate(s)

Create a Kubernetes ConfigMap containing your custom CA certificate(s):

kubectl create configmap custom-ca-certificates \
  --from-file=ca-certificates.crt=custom-ca.crt \
  -n truefoundry

Alternatively, if you want to create it from a YAML file:

custom-ca-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-ca-certificates
  namespace: truefoundry
data:
  ca-certificates.crt: |
    -----BEGIN CERTIFICATE-----
    ... (your custom CA certificate content) ...
    -----END CERTIFICATE-----

Apply the ConfigMap:

kubectl apply -f custom-ca-configmap.yaml

Reference the ConfigMap in your Helm values

Update your truefoundry-values.yaml to reference the ConfigMap:

truefoundry-values.yaml

global:
  customCA:
    enabled: true
    existingConfigMap:
      name: custom-ca-certificates

Upgrade the Helm installation

Apply the changes by upgrading your Helm release:

helm upgrade --install truefoundry oci://tfy.jfrog.io/tfy-helm/truefoundry \
  -n truefoundry --create-namespace -f truefoundry-values.yaml

Method 2b: Use an existing ConfigMap with `overrideCAList`

If you want the ConfigMap to replace the system CA bundle entirely instead of merging, set overrideCAList to true. In this mode, the ConfigMap is mounted directly at /etc/ssl/certs/ (no initContainer is used), so the ConfigMap must contain the full CA bundle (system + custom CAs).

Prepare your CA certificate file

Add your custom CA certificate(s) to your system’s CA bundle. On a Linux system with the certificate file saved as custom-ca.crt:

# Copy the certificate to the CA directory
sudo cp custom-ca.crt /usr/local/share/ca-certificates/

# Update the CA certificates bundle
sudo update-ca-certificates

This will generate or update /etc/ssl/certs/ca-certificates.crt with your custom CA included (system CAs + your custom CA).

Create a ConfigMap from the complete ca-certificates.crt file

Create a Kubernetes ConfigMap containing the complete CA bundle:

kubectl create configmap custom-ca-certificates \
  --from-file=ca-certificates.crt=/etc/ssl/certs/ca-certificates.crt \
  -n truefoundry

Alternatively, if you want to create it from a YAML file:

custom-ca-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-ca-certificates
  namespace: truefoundry
data:
  ca-certificates.crt: |
    -----BEGIN CERTIFICATE-----
    ... (your complete ca-certificates.crt content including system + custom CAs) ...
    -----END CERTIFICATE-----

Apply the ConfigMap:

kubectl apply -f custom-ca-configmap.yaml

Reference the ConfigMap in your Helm values with overrideCAList

Update your truefoundry-values.yaml to reference the ConfigMap with overrideCAList enabled:

truefoundry-values.yaml

global:
  customCA:
    enabled: true
    existingConfigMap:
      name: custom-ca-certificates
      overrideCAList: true

Upgrade the Helm installation

Apply the changes by upgrading your Helm release:

helm upgrade --install truefoundry oci://tfy.jfrog.io/tfy-helm/truefoundry \
  -n truefoundry --create-namespace -f truefoundry-values.yaml

When overrideCAList is set to true, the ConfigMap is mounted directly replacing the system CA bundle. Your ConfigMap must contain the complete CA bundle (system CAs + your custom CAs). If you only include your custom CAs, all standard public CA trust will be lost and outbound HTTPS connections to public services will fail.

The custom CA certificates will be mounted into all TrueFoundry pods and added to the system’s trust store. This ensures that all outgoing HTTPS connections from TrueFoundry services will trust your custom CAs.

After adding custom CA certificates, verify that your TrueFoundry pods have restarted and are running correctly. You may need to restart existing pods for the changes to take effect.

How to enable in-pod TLS termination on the proxy (control plane and gateway)?

By default, TLS is terminated at your ingress controller or load balancer, and traffic reaches the TrueFoundry proxy (Caddy) over plain HTTP inside the cluster.In-pod TLS termination moves that step into the proxy container: Caddy terminates HTTPS using a certificate you provide, then forwards to the application over loopback HTTP. This is useful when you want the same certificate inside the pod.

Plane	Helm chart	Values path	Caddy listener
Control plane	`truefoundry`	`global.proxy.tls`	`:8080` on `tfy-proxy`
Gateway	`tfy-llm-gateway`	`proxy.tls`	`:8081` on the gateway proxy sidecar (app stays on `:8787`)

Do not terminate TLS at the ingress and inside the pod for the same hostname. Pick one layer:

In-pod termination (this guide): ingress must pass through encrypted traffic (for example, NGINX ssl-passthrough). Do not attach a TLS certificate on the Ingress resource for that host.
Ingress termination (default): leave global.proxy.tls.enabled / proxy.tls.enabled as false and configure TLS on the Ingress or Gateway API parent instead.

Traffic flow (in-pod termination)

Client ──HTTPS──► Ingress (TLS passthrough) ──HTTPS──► Caddy in pod ──HTTP──► App (servicefoundry / llm-gateway)

Prerequisites

A Kubernetes TLS Secret in the release namespace with PEM certificate and private key (standard keys tls.crt and tls.key).
An ingress controller that can forward TLS without terminating it when using Ingress (see below).
For self-signed or private CAs: also configure custom CA certificates so Node.js services trust outbound HTTPS, or use in-cluster HTTP URLs for internal API calls (recommended).

Control plane (`truefoundry` chart)

Create the TLS Secret

Create a kubernetes.io/tls secret in the truefoundry namespace. Replace the paths with your certificate and key files:

kubectl create secret tls tfy-proxy-cp-tls \
  --cert=/path/to/tls.crt \
  --key=/path/to/tls.key \
  -n truefoundry

For a wildcard host such as *.primary.example.com, issue a cert that covers your control-plane hostname (for example cp.primary.example.com).

Enable proxy TLS in Helm values

Add the following to truefoundry-values.yaml:

truefoundry-values.yaml

global:
  proxy:
    tls:
      enabled: true
      secretName: tfy-proxy-cp-tls
      # Optional: if your Secret uses non-standard keys
      # secretKeys:
      #   cert: tls.crt
      #   key: tls.key

Upgrade the release:

helm upgrade --install truefoundry oci://tfy.jfrog.io/tfy-helm/truefoundry \
  -n truefoundry --create-namespace -f truefoundry-values.yaml

Configure ingress for TLS passthrough

When global.proxy.tls.enabled is true, Caddy expects HTTPS on the service port. Your ingress must forward the TLS connection without terminating it.ingress-nginx — enable passthrough on the controller (once per cluster) and annotate the control-plane Ingress:

truefoundry-values.yaml

global:
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - cp.example.com
    # Do not set global.ingress.tls when using in-pod termination — TLS is handled inside tfy-proxy.
    annotations:
      nginx.ingress.kubernetes.io/ssl-passthrough: "true"
  proxy:
    tls:
      enabled: true
      secretName: tfy-proxy-cp-tls

The ingress-nginx controller must be installed with controller.extraArgs.enable-ssl-passthrough: "true".Istio / Gateway API — configure TLS mode PASSTHROUGH on the Gateway listener that fronts the control plane. TLS is not configured on the HTTPRoute itself.

Verify the control plane

kubectl -n truefoundry rollout status deploy/truefoundry-tfy-proxy
curl -vk https://cp.example.com/health

You should get a successful response over HTTPS. Check that the certificate presented to the client is the one from your Secret (not only the ingress default certificate).

Update gateway `CONTROL_PLANE_URL` when `tags.llmGateway` is enabled and control-plane proxy TLS is on

When global.proxy.tls.enabled is true, truefoundry-tfy-proxy listens with TLS on port 8080. In-cluster HTTP calls such as http://<release>-tfy-proxy:8080 will fail (for example ECONNRESET or certificate errors).If you deploy the gateway with the truefoundry chart (tags.llmGateway: true), override tfy-llm-gateway.env.CONTROL_PLANE_URL to your HTTPS control-plane URL (global.controlPlaneURL), not the internal http://...-tfy-proxy:8080 address:

truefoundry-values.yaml

global:
  controlPlaneURL: https://cp.example.com
  proxy:
    tls:
      enabled: true
      secretName: tfy-proxy-cp-tls
  customCA:
    enabled: true
    existingConfigMap:
      name: custom-ca-certificates

tfy-llm-gateway:
  env:
    CONTROL_PLANE_URL: "{{ .Values.global.controlPlaneURL }}"
    PUBLIC_CONTROL_PLANE_URL: "{{ .Values.global.controlPlaneURL }}"

The standalone tfy-llm-gateway chart already sets CONTROL_PLANE_URL from global.controlPlaneURL by default. The override above is required when tags.llmGateway is true on the truefoundry chart, because the parent chart default uses http://{{ .Release.Name }}-tfy-proxy:8080.

Gateway plane (`tfy-llm-gateway` chart)

Use this when deploying the gateway as its own Helm release (gateway plane only) or when overriding the tfy-llm-gateway subchart under the parent truefoundry chart.

Create the TLS Secret

kubectl create secret tls tfy-proxy-gateway-tls \
  --cert=/path/to/tls.crt \
  --key=/path/to/tls.key \
  -n truefoundry

Enable proxy TLS and ingress passthrough

Standalone gateway release (truefoundry-values.yaml for tfy-llm-gateway):

truefoundry-values.yaml

global:
  # Must be https:// when the control-plane tfy-proxy has global.proxy.tls.enabled
  controlPlaneURL: https://cp.example.com

proxy:
  tls:
    enabled: true
    secretName: tfy-proxy-gateway-tls

# env.CONTROL_PLANE_URL defaults to global.controlPlaneURL in this chart.
# Override explicitly if a parent release set it to http://...-tfy-proxy:8080:
# env:
#   CONTROL_PLANE_URL: "{{ .Values.global.controlPlaneURL }}"

ingress:
  enabled: true
  ingressClassName: nginx
  hosts:
    - gateway.example.com
  annotations:
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
  # Do not set ingress.tls — TLS terminates inside the pod.

Gateway bundled with truefoundry (tags.llmGateway: true) — nest under tfy-llm-gateway::

truefoundry-values.yaml

tfy-llm-gateway:
  proxy:
    tls:
      enabled: true
      secretName: tfy-proxy-gateway-tls
  ingress:
    enabled: true
    annotations:
      nginx.ingress.kubernetes.io/ssl-passthrough: "true"

Configure environment variables for startup

The gateway loads configuration at startup over HTTP(S). Set env based on whether the control-plane proxy has in-pod TLS enabled.When global.proxy.tls.enabled is true on the control plane (same cluster), set CONTROL_PLANE_URL to the public control-plane URL. Do not use http://<release>-tfy-proxy:8080 — that port expects HTTPS:

truefoundry-values.yaml

global:
  controlPlaneURL: https://cp.example.com

tfy-llm-gateway:
  env:
    CONTROL_PLANE_URL: "{{ .Values.global.controlPlaneURL }}"
    PUBLIC_CONTROL_PLANE_URL: "{{ .Values.global.controlPlaneURL }}"

Add custom CA certificates if controlPlaneURL uses a private or mkcert-signed certificate.When control-plane proxy TLS is disabled (default), you can use the internal proxy URL for CONTROL_PLANE_URL if the gateway and control plane share a release:

truefoundry-values.yaml

tfy-llm-gateway:
  env:
    CONTROL_PLANE_URL: http://truefoundry-tfy-proxy:8080
    PUBLIC_CONTROL_PLANE_URL: https://cp.example.com
    SERVICEFOUNDRY_SERVER_URL: http://truefoundry-servicefoundry-server:3000
    CONTROL_PLANE_NATS_URL: http://truefoundry-tfy-nats:4222

Replace truefoundry with your Helm release name if different. SERVICEFOUNDRY_SERVER_URL is used to fetch NATS credentials (/v1/x/llm-gateway/nats-creds); pointing it at servicefoundry-server avoids TLS issues on the proxy port.

Verify the gateway

kubectl -n truefoundry rollout status deploy/tfy-llm-gateway
kubectl -n truefoundry get pods -l app.kubernetes.io/name=tfy-llm-gateway
# Expect 2/2 Ready when proxy.tls is enabled (gateway + proxy containers)
curl -vk https://gateway.example.com/health

If pods crash with unable to verify the first certificate when fetching NATS credentials, see the custom CA section or the internal HTTP env overrides above.

East-west vs north-south TLS: proxy.tls on the gateway sidecar secures traffic into the gateway pod from clients. On the control plane, global.proxy.tls makes port 8080 HTTPS on tfy-proxy. Gateway pods must use CONTROL_PLANE_URL: "{{ .Values.global.controlPlaneURL }}" (plus global.customCA for private CAs), or call servicefoundry-server / tfy-nats directly over HTTP — not http://...-tfy-proxy:8080.

How to enable and access control plane monitoring (Grafana)?

TrueFoundry ships with a built-in monitoring stack that includes Grafana dashboards for the control plane. To enable it, add the following to your truefoundry-values.yaml:

truefoundry-values.yaml

truefoundryMonitoring:
  enabled: true
  grafana:
    grafana.ini:
      auth.jwt:
        jwk_set_url: >-
          https://<your-truefoundry-control-plane-url>/api/svc/v1/keys/<tenant-name>/jwks

Then upgrade the Helm release to apply the changes:

helm upgrade --install truefoundry oci://tfy.jfrog.io/tfy-helm/truefoundry \
  -n truefoundry --create-namespace \
  -f truefoundry-values.yaml

Once enabled, platform admins can access the Grafana dashboard at:

https://<your-truefoundry-control-plane-url>/admin/grafana/

Replace <your-truefoundry-control-plane-url> with your actual control plane domain (e.g., app.example.com) and <tenant-name> with your TrueFoundry tenant name provided during onboarding.
Only users with the admin role can access this endpoint.
Make sure to include the trailing / at the end of the URL.
If you already have Prometheus or VictoriaLogs in your cluster, you can point the monitoring stack to them using externalServices instead of installing new instances.

For the full configuration reference, see the Control Plane Monitoring guide.

How do you add default metadata to all requests passing via the gateway?

You can attach default metadata to every request that passes through the AI Gateway by setting the DEFAULT_GATEWAY_METADATA environment variable on the gateway. The value should be a JSON string of key-value pairs.Add the following to your gateway configuration in values file of the gateway plane:

tfy-llm-gateway:
  env:
    DEFAULT_GATEWAY_METADATA: '{"org":"internal"}'

The metadata key-value pairs will be automatically included in every request routed through the gateway. You can use this to tag requests with organizational identifiers, environment labels, or any other metadata your downstream systems need.

How to expose additional metadata as Prometheus labels for gateway metrics?

By default, the AI Gateway exposes a fixed set of Prometheus labels on its metrics. If you want to slice and aggregate gateway metrics by your own metadata fields (e.g. customer_id, request_type, environment), set the LLM_GATEWAY_METADATA_LOGGING_KEYS environment variable on the gateway. The value is a JSON-encoded array of metadata keys.Each key listed here is exposed as a Prometheus label prefixed with ai_gateway_metadata_* — for example, customer_id becomes the label ai_gateway_metadata_customer_id. You can then use these labels for granular filtering and aggregation in Grafana.Add the following to your gateway configuration in values file of the gateway plane:

tfy-llm-gateway:
  env:
    LLM_GATEWAY_METADATA_LOGGING_KEYS: '["customer_id", "request_type"]'

Once the gateway is restarted, requests that include these metadata keys (either via default metadata or per-request metadata) will emit Prometheus metrics with the corresponding ai_gateway_metadata_customer_id and ai_gateway_metadata_request_type labels.

Only add metadata keys with bounded, low-cardinality values (e.g. customer tier, request type, environment). Adding high-cardinality keys like user IDs or trace IDs as labels can cause your Prometheus / Victoria Metrics instance to consume excessive memory and storage.

How to use HTTPRoute to route traffic using Kubernetes Gateway API?

The TrueFoundry Helm charts support the Kubernetes Gateway API as an alternative to standard Ingress resources. Use HTTPRoute when your cluster uses a Gateway API-compatible controller (e.g. Envoy Gateway, Istio, NGINX Gateway Fabric, GKE Gateway).Control plane (truefoundry chart)Add the following to your truefoundry-values.yaml, setting parentRefs to point to your existing Gateway:

truefoundry-values.yaml

global:
  httpRoute:
    enabled: true
    parentRefs:
      - name: my-gateway        # Name of your Gateway resource
        namespace: gateway-system  # Namespace where the Gateway is deployed
        sectionName: https      # Listener section on the Gateway (e.g. http or https)
    hostnames:
      - "app.example.com"       # Hostname that this HTTPRoute should match

Then apply:

helm upgrade --install truefoundry oci://tfy.jfrog.io/tfy-helm/truefoundry \
  -n truefoundry --create-namespace \
  -f truefoundry-values.yaml

Only one routing method should be enabled at a time. Disable global.ingress.enabled and global.virtualservice.enabled when using httpRoute.
The sectionName must match a named listener on your Gateway resource. Omit it if your Gateway has a single unnamed listener.
TLS termination is handled by the parent Gateway — no TLS configuration is needed on the HTTPRoute itself.

How to restrict AWS S3 permissions to a minimal set?

By default, the installation instructions use s3:* for the S3 bucket IAM policy for simplicity. If your organization requires a least-privilege approach, you can replace s3:* with the following minimal set of permissions:

{
  "Statement": [
    {
      "Sid": "S3",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucketMultipartUploads",
        "s3:GetBucketTagging",
        "s3:GetObjectVersionTagging",
        "s3:ReplicateTags",
        "s3:PutObjectVersionTagging",
        "s3:ListMultipartUploadParts",
        "s3:PutObject",
        "s3:GetObject",
        "s3:GetObjectAcl",
        "s3:AbortMultipartUpload",
        "s3:PutBucketTagging",
        "s3:GetObjectVersionAcl",
        "s3:GetObjectTagging",
        "s3:PutObjectTagging",
        "s3:GetObjectVersion",
        "s3:ListBucket",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::<YOUR_S3_BUCKET_NAME>",
        "arn:aws:s3:::<YOUR_S3_BUCKET_NAME>/*"
      ]
    }
  ],
  "Version": "2012-10-17"
}

How to configure security context for TrueFoundry components?

By default, the TrueFoundry Helm chart ships with container and pod security contexts configured for all components to follow security best practices — pods run as a non-root user (runAsNonRoot: true), use a read-only root filesystem (readOnlyRootFilesystem: true), and drop all privileges (capabilities.drop: [ALL]).However, NATS (used internally for messaging) does not have these defaults applied automatically. If your cluster enforces Pod Security Standards (e.g. restricted profile) or you want a consistent security posture across all components, you need to explicitly add the security context for NATS by adding the following to your truefoundry-values.yaml:

truefoundry-values.yaml

tfyNats:
  container:
    merge:
      securityContext:
        capabilities:
          drop:
            - ALL
        readOnlyRootFilesystem: true
        allowPrivilegeEscalation: false
  podTemplate:
    merge:
      spec:
        securityContext:
          fsGroup: 1000
          runAsUser: 1000
          runAsNonRoot: true

The NATS subchart uses a different values structure (container.merge and podTemplate.merge) compared to other TrueFoundry components. This is because NATS uses its own Helm chart conventions for overriding pod and container specs.

How to enable Network Policies for Control Plane?

Network policies are optional and shipped inside the truefoundry Helm chart. They apply only to the release namespace.

Network policies are opt-in (networkPolicy.enabled: false by default). Before enabling in production, add all required sources to allowedIngressFrom (monitoring and ingress at minimum). If allowedIngressFrom is empty, cross-namespace ingress is blocked — Prometheus scrapes and ingress traffic will fail until you allow those namespaces.

Prerequisites

Your cluster CNI must enforce NetworkPolicy objects. Creating policies in the API is not enough.

Platform	Requirement
Amazon EKS	VPC CNI add-on with `enableNetworkPolicy: "true"` (v1.14.0+), or Calico/Cilium
GKE	Cluster network policy enabled (Calico)
AKS	Azure CNI with a policy-capable engine (Calico or Cilium)
OpenShift (OCP)	4.5+ with OpenShift SDN, or 4.8+ with OVN-Kubernetes (default CNI from 4.12+); both enforce `NetworkPolicy` without an extra policy CNI
Generic	Any CNI that enforces `NetworkPolicy` (e.g. Calico, Cilium, Weave Net); vanilla clusters without a policy-capable CNI need one installed

On OpenShift, confirm the cluster network plugin with oc get network.config cluster -o jsonpath='{.spec.networkType}{"\n"}'. Clusters on OCP 4.5–4.11 typically report OpenShiftSDN; new installs on OCP 4.12+ default to OVNKubernetes (available as an option from 4.8). See About network policy (OpenShift).

On EKS with the Amazon VPC CNI, verify the add-on has network policy enforcement enabled before relying on these rules. See Amazon EKS network policies.

Policies created

When networkPolicy.enabled: true, the chart creates up to four NetworkPolicy objects:

Policy	Purpose
`default-deny-ingress`	Block all ingress into control-plane pods
`allow-all-egress`	Allow all egress from control-plane pods
`intra-instance`	Allow ingress between pods with `app.kubernetes.io/instance: <release-name>`
`ingress-external`	Allow ingress from namespaces listed in `allowedIngressFrom` (only when that list is non-empty)

Add network policy settings to your Helm values

Add the following to truefoundry-values.yaml. Replace namespace names with the ones used in your cluster.Minimal enable (in-release traffic only — no Prometheus or ingress from other namespaces yet):

truefoundry-values.yaml

networkPolicy:
  enabled: true
  allowedIngressFrom: []

Typical production configuration (monitoring + ingress controller):

truefoundry-values.yaml

networkPolicy:
  enabled: true
  allowedIngressFrom:
    # All pods in the monitoring namespace (Prometheus scrapes)
    - namespace: tfy-prometheus
    # Only ingress controller pods (recommended)
    - namespace: ingress-nginx
      podSelector:
        app.kubernetes.io/name: ingress-nginx
    # Optional: Istio ingress gateway
    # - namespace: istio-system
    #   podSelector:
    #     app: istio-ingressgateway

Each allowedIngressFrom entry requires namespace. Omit podSelector to allow all pods in that namespace; set podSelector to allow only matching pods (recommended for ingress controllers).

Cross-namespace sources are combined in a single ingress-external policy. In-namespace microservice traffic uses the label app.kubernetes.io/instance: <helm-release-name> (for example app.kubernetes.io/instance: truefoundry).

Upgrade the Helm release

helm upgrade --install truefoundry oci://tfy.jfrog.io/tfy-helm/truefoundry \
  -n truefoundry --create-namespace \
  -f truefoundry-values.yaml

Verify NetworkPolicies are applied

kubectl get networkpolicy -n truefoundry
kubectl describe networkpolicy -n truefoundry

You should see three policies when allowedIngressFrom is empty, or four when cross-namespace sources are configured.

Validate connectivity

After enabling, confirm:

Control plane UI loads via ingress
Prometheus scrape targets for control-plane services are UP
Pods in the same Helm release can reach each other
External dependencies still work (RDS, S3, cloud APIs) — egress is allow-all by default

In-release connectivity example (adjust service name if needed):

kubectl exec -n truefoundry deploy/truefoundry-mlfoundry-server -- \
  curl -sf http://truefoundry-servicefoundry-server:3000/health

Negative test (should fail — traffic from an unlisted namespace):

kubectl run np-test -n default --rm -it --image=curlimages/curl -- \
  curl -m 5 http://truefoundry-servicefoundry-server.truefoundry.svc.cluster.local:3000/

Troubleshooting

Symptom	Likely cause	Fix
Prometheus targets down	Monitoring namespace not in `allowedIngressFrom`	Add your Prometheus namespace (e.g. `tfy-prometheus`)
Ingress 502 / timeout	Ingress namespace not allowed or wrong `podSelector`	Add ingress namespace with labels matching your controller
Policies exist but no effect	CNI does not enforce NetworkPolicy	Enable enforcement on your cluster (EKS: `enableNetworkPolicy` on vpc-cni)
Pods cannot reach each other	Wrong or missing `app.kubernetes.io/instance` label	Verify pod labels match Helm release name

Network policies complement — but do not replace — TLS, authentication, ingress WAF, and cloud security groups. They are scoped to the control-plane namespace only; other namespaces are not modified.

How to apply tolerations, affinity, and nodeSelector to TrueFoundry components?

The truefoundry Helm chart exposes global.tolerations, global.affinity, and global.nodeSelector. These are applied automatically to all first-party TrueFoundry components — servicefoundry server, mlfoundry server, tfy-proxy, tfy-controller, tfy-k8s-controller, and so on — as well as the AI Gateway (tfy-llm-gateway) and its bundled Redis. Set them once under global and every TrueFoundry pod will schedule onto your tainted / labelled control-plane nodes.Two bundled dependencies are the exception. They are third-party subcharts that do not read the global.* values, so their scheduling must be set on their own keys:

Component	Inherits `global` scheduling?	Where to configure
First-party components (servicefoundry, mlfoundry, tfy-proxy, controllers, …)	✅ Yes	`global.tolerations` / `global.affinity` / `global.nodeSelector`
AI Gateway (`tfy-llm-gateway`) + bundled Redis	✅ Yes	`global.tolerations` / `global.affinity` / `global.nodeSelector`
NATS (`tfyNats`)	❌ No	`tfyNats.podTemplate.patch` (JSON Patch)
PostgreSQL (`postgresql`, only when `devMode` is enabled)	❌ No	`postgresql.primary.tolerations` / `.affinity` / `.nodeSelector`

Set scheduling for all TrueFoundry components

Add your tolerations, affinity, and nodeSelector under global in truefoundry-values.yaml:

truefoundry-values.yaml

# 1) First-party components + the AI Gateway (and its bundled Redis)
#    inherit these global settings automatically.
global:
  tolerations:
    - key: class.truefoundry.io/control-plane
      effect: NoSchedule
      operator: Exists
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: <YOUR_NODE_LABEL_KEY>      # e.g. the label on your control-plane nodepool
                operator: In
                values:
                  - <YOUR_NODE_LABEL_VALUE>
  nodeSelector: {}      # leave empty unless you specifically need one

# 2) NATS does NOT read global.* — set scheduling on the pod template via a JSON Patch.
#    The patch list REPLACES the chart default entirely, so keep the default entries
#    (resolver-volume, pid, imagePullSecrets) in addition to your scheduling values.
tfyNats:
  podTemplate:
    patch:
      - op: add
        path: /spec/volumes/-
        value:
          name: resolver-volume
          secret:
            secretName: tfy-nats-accounts
            defaultMode: 420
      - op: replace
        path: /spec/volumes/1
        value:
          name: pid
          emptyDir:
            sizeLimit: "256Mi"
      - op: add
        path: /spec/imagePullSecrets
        value: []
      # Tolerations — match the taints on your control-plane nodepool
      - op: add
        path: /spec/tolerations
        value:
          - key: class.truefoundry.io/control-plane
            effect: NoSchedule
            operator: Exists
      # Affinity — match the labels on your control-plane nodes
      - op: add
        path: /spec/affinity
        value:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: <YOUR_NODE_LABEL_KEY>
                      operator: In
                      values:
                        - <YOUR_NODE_LABEL_VALUE>
      - op: add
        path: /spec/nodeSelector
        value: {}

# 3) PostgreSQL exists only when devMode.enabled is true (dev / test installs).
#    It does NOT read global.* — use the standard Bitnami primary.* keys.
#    Omit this block on production installs that use an external database.
postgresql:
  primary:
    tolerations:
      - key: class.truefoundry.io/control-plane
        effect: NoSchedule
        operator: Exists
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: <YOUR_NODE_LABEL_KEY>
                  operator: In
                  values:
                    - <YOUR_NODE_LABEL_VALUE>
    nodeSelector: {}

Unlike NATS, PostgreSQL uses the standard Bitnami primary.tolerations / primary.affinity / primary.nodeSelector keys — no JSON Patch is required. Overriding postgresql.primary.tolerations replaces the chart default (which includes spot-instance tolerations), so include any entries you still need.

Apply the changes

Upgrade the Helm release to apply your scheduling settings:

helm upgrade --install truefoundry oci://tfy.jfrog.io/tfy-helm/truefoundry \
  -n truefoundry --create-namespace -f truefoundry-values.yaml

Why are volumes (PVCs) stuck in Pending state?

The control plane creates PersistentVolumeClaims for NATS JetStream (and PostgreSQL when devMode is enabled) without specifying a storage class. These PVCs rely on the cluster having a default StorageClass. If no default is set, the PVCs stay in Pending state and the corresponding pods (e.g., truefoundry-tfy-nats-0) never start.1. Check the PVC status and events:

kubectl get pvc -n truefoundry
kubectl describe pvc <pvc-name> -n truefoundry

If the events show messages like no storage class is set or waiting for a volume to be created, the cluster is missing a default storage class or a provisioner.2. Check if a default storage class exists:

kubectl get storageclass

One of the storage classes should be marked with (default). If none is, mark one as default:

kubectl patch storageclass <storage-class-name> \
  -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

On AWS EKS, ensure the EBS CSI driver addon is installed — without it, PVCs will not be provisioned even if a storage class exists.
On GKE and AKS, a default storage class is usually present out of the box.
Only block storage with ReadWriteOnce access is required — see Do we need any NFS volumes?

3. Recreate the stuck PVCs:Setting a default storage class does not retroactively apply to already-created PVCs — the default is injected only at PVC creation time. Delete the Pending PVCs and their pods so they get recreated with the default storage class:

kubectl delete pvc <pvc-name> -n truefoundry
kubectl delete pod <pod-name> -n truefoundry

Alternatively, you can specify the storage class explicitly instead of relying on the cluster default. For the NATS JetStream PVC:

truefoundry-values.yaml

tfyNats:
  config:
    jetstream:
      fileStore:
        pvc:
          storageClassName: <storage-class-name>

How to deploy on OpenShift with restricted Security Context Constraints (SCC)?

OpenShift clusters enforce Security Context Constraints (SCCs) that expect pods to have an empty security context so that OpenShift can inject arbitrary user and group IDs at runtime. By default, the TrueFoundry Helm chart sets explicit podSecurityContext and securityContext values (such as runAsUser, runAsNonRoot, fsGroup, etc.) on its components, which conflicts with the restricted or restricted-v2 SCC.To resolve this, disable both pod-level and container-level security contexts for all components by adding the following overrides to your truefoundry-values.yaml:

truefoundry-values.yaml

# Disable security contexts for OpenShift SCC compatibility
truefoundryBootstrap:
  podSecurityContext:
    enabled: false

mlfoundryServer:
  podSecurityContext:
    enabled: false

servicefoundryServer:
  podSecurityContext:
    enabled: false

tfyK8sController:
  podSecurityContext:
    enabled: false

tfyProxy:
  podSecurityContext:
    enabled: false

deltaFusionIngestor:
  podSecurityContext:
    enabled: false

deltaFusionCompaction:
  podSecurityContext:
    enabled: false

deltaFusionQueryServer:
  podSecurityContext:
    enabled: false

tfy-llm-gateway:
  podSecurityContext:
    enabled: false

tfy-otel-collector:
  podSecurityContext:
    enabled: false

Setting enabled: false removes all explicit security context fields from the pod and container specs, allowing OpenShift’s SCC admission controller to assign user and group IDs as needed.

How to install Vector for log collection on OpenShift (OCP)?

The tfy-logs chart ships Vector as a DaemonSet that tails container logs from each node’s host filesystem and ships them to VictoriaLogs. Vector writes its checkpoint/snapshot state (the record of how far it has read in each log file) to a hostPath data directory on the node. The chart default points to a host directory that is not writable on RHCOS, so Vector cannot persist its checkpoints and the pod fails to start or restarts without retaining read positions. You must point this at a writable location on the node.On RHCOS the writable, persistent location is under /var/home/core (other paths such as /var/lib are managed and read-only for containers). Set persistence.hostPath.path to a writable directory there, for example /var/home/core/data/vector. If you mirror images into a private registry (common in air-gapped OpenShift clusters), also override the registry for victoria-logs-single and Vector — see Can I use my Artifactory as a mirror to pull images?.Because Vector runs as a DaemonSet that mounts the node’s host filesystem (hostPath) to read container logs, its service account must be granted the privileged SCC. Without this, OpenShift’s SCC admission controller blocks the pods and the DaemonSet will not start. Grant the SCC to the tfy-logs-vector service account in the tfy-logs namespace:

oc adm policy add-scc-to-user privileged -z tfy-logs-vector -n tfy-logs

On SELinux-enforcing nodes (RHCOS), the default container SELinux context (container_t) cannot read the host log files under /var/log. Set the pod’s SELinux type to spc_t (super-privileged container) so Vector is allowed to read them — otherwise the pod runs but collects no logs (permission denied).

tfy-logs-values.yaml

victoria-logs-single:
  enabled: true
  # Optional: only needed if you pull images from a private registry / mirror
  global:
    image:
      registry: <YOUR_REGISTRY>
  server:
    image:
      registry: <YOUR_REGISTRY>
  vector:
    enabled: true
    # Required on SELinux-enforcing nodes (RHCOS) so Vector can read host log files
    podSecurityContext:
      seLinuxOptions:
        type: spc_t
    # Optional: only needed if you pull images from a private registry / mirror
    image:
      repository: <YOUR_REGISTRY>/timberio/vector
    persistence:
      hostPath:
        enabled: true
        # Must be a writable location on the node. On RHCOS use a path under /var/home/core.
        path: /var/home/core/data/vector

# Vector for Windows nodes is not applicable on OpenShift
windowsVector:
  enabled: false

The directory is created automatically on each node by the DaemonSet. If your cluster uses a different writable mount (for example a dedicated data partition), set path to a writable directory on that mount instead — the value only needs to be writable by the Vector pod on every node.

After applying the values, verify the Vector DaemonSet is running on every node:

kubectl -n tfy-logs rollout status daemonset/<release-name>-vector
kubectl -n tfy-logs logs -l app.kubernetes.io/name=vector --tail=50

How to deploy the control plane on AWS EKS with CloudWatch Observability addon installed?

When the EKS CloudWatch Observability addon is enabled, its ADOT auto-instrumentation injects bundled Python libraries via PYTHONPATH that conflict with truefoundry-mlfoundry-server dependencies, causing the pod to enter CrashLoopBackOff. You may see errors like:

ImportError: cannot import name 'DEFAULT_CIPHERS' from 'urllib3.util.ssl_'
  (/otel-auto-instrumentation-python/urllib3/util/ssl_.py)

ImportError: cannot import name 'LogData' from 'opentelemetry.sdk._logs'
  (/otel-auto-instrumentation-python/opentelemetry/sdk/_logs/__init__.py)

To fix this, exclude the truefoundry namespace from the addon’s auto-instrumentation by updating the addon configuration:

{
  "manager": {
    "applicationSignals": {
      "autoMonitor": {
        "exclude": {
          "python": { "namespaces": ["truefoundry"] },
          "java": { "namespaces": ["truefoundry"] },
          "nodejs": { "namespaces": ["truefoundry"] },
          "dotnet": { "namespaces": ["truefoundry"] }
        }
      }
    },
    "autoAnnotateAutoInstrumentation": {
      "python": { "namespaces": [] },
      "java": { "namespaces": [] },
      "nodejs": { "namespaces": [] },
      "dotnet": { "namespaces": [] }
    }
  }
}

After updating the addon config, restart the deployment and verify the pods are running:

kubectl rollout restart deployment truefoundry-mlfoundry-server -n truefoundry

How to enable the stdio MCP proxy?

The stdio MCP proxy enables TrueFoundry to run MCP servers that use the stdio transport protocol. It is disabled by default.To enable it, add the following to your truefoundry-values.yaml:

truefoundry-values.yaml

stdioMcpProxy:
  enabled: true

Restart the truefoundry-tfy-proxy deployment as well

kubectl rollout restart deployment truefoundry-tfy-proxy -n truefoundry

​FAQs

​Log format

​Method 1: Pass customCA as a multiline string

​Method 2: Use an existing ConfigMap containing CA certificate(s)

​Method 2b: Use an existing ConfigMap with overrideCAList

​Traffic flow (in-pod termination)

​Prerequisites

​Control plane (truefoundry chart)

​Gateway plane (tfy-llm-gateway chart)

​Prerequisites

​Policies created

​Troubleshooting

​Set scheduling for all TrueFoundry components

​Apply the changes

FAQs

Log format

Method 1: Pass customCA as a multiline string

Method 2: Use an existing ConfigMap containing CA certificate(s)

Method 2b: Use an existing ConfigMap with `overrideCAList`

Traffic flow (in-pod termination)

Prerequisites

Control plane (`truefoundry` chart)

Gateway plane (`tfy-llm-gateway` chart)

Prerequisites

Policies created

Troubleshooting

Set scheduling for all TrueFoundry components

Apply the changes