Skip to content

Prism — Operations

Audience: Operations engineers, on-call responders, SREs, and DevOps engineers. Every section is actionable — specific commands, thresholds, and step-by-step procedures.


GCP Infrastructure Reference

Resource Details
GCP Project swisper (project number: 1045528868895)
Region europe-west1
Cloud Run service prism-gateway
Gateway URL https://prism-gateway-xvsemyikqq-oa.a.run.app
MCP endpoint https://prism-gateway-xvsemyikqq-oa.a.run.app/mcp
Cloud SQL instance prism-db (connection: swisper:europe-west1:prism-db)
Cloud SQL tier db-perf-optimized-N-2, PostgreSQL 16, ENTERPRISE_PLUS
Cloud SQL public IP 34.14.11.205
pgvector version 0.8.1
Container image europe-west1-docker.pkg.dev/swisper/prism/gateway:latest
Service account swisper-vertex-runtime@swisper.iam.gserviceaccount.com
Auth provider GCP Identity Platform (Firebase Auth) + Developer Tokens
JWKS URL https://www.googleapis.com/service_accounts/v1/jwk/securetoken@system.gserviceaccount.com

Deployment

Mechanism

Prism runs as a Cloud Run service (prism-gateway) in europe-west1. The container image is stored in Artifact Registry. Cloud Run connects to Cloud SQL via the Cloud SQL Auth Proxy (Unix socket). Secrets are injected from Secret Manager at startup.

Min instances: 0 (scales to zero when idle). Max instances: 3. Port: 8080.

On startup, the gateway:

  1. Initializes the asyncpg connection pool
  2. Marks any orphaned running jobs in prism.index_jobs as failed
  3. Starts a background watchdog that marks stuck jobs (>30 min) as failed every 60 seconds
  4. Starts the MCP Streamable HTTP session manager

Pre-deployment checklist

  • [ ] All Prism tests pass: cd apps/prism && uv run pytest tests/ -vv
  • [ ] Schema migrations applied if schema changed (see Runbook: Apply Schema Migration)
  • [ ] Environment variables and secrets updated in Secret Manager if config changed
  • [ ] Docker image builds successfully locally before pushing

Build and deploy

cd apps/prism

# Authenticate Docker (one-time per machine)
gcloud auth configure-docker europe-west1-docker.pkg.dev

# Build the gateway image
docker build --target gateway \
  -t europe-west1-docker.pkg.dev/swisper/prism/gateway:latest \
  -f Dockerfile.prism .

# Push to Artifact Registry
docker push europe-west1-docker.pkg.dev/swisper/prism/gateway:latest

# Deploy to Cloud Run (rolling update, zero downtime)
gcloud run services update prism-gateway \
  --region=europe-west1 \
  --project=swisper \
  --image=europe-west1-docker.pkg.dev/swisper/prism/gateway:latest

# Verify the deployment
curl https://prism-gateway-xvsemyikqq-oa.a.run.app/health
# Expected: {"status":"ok"}

Full redeploy from scratch

gcloud run deploy prism-gateway \
  --image=europe-west1-docker.pkg.dev/swisper/prism/gateway:latest \
  --region=europe-west1 \
  --project=swisper \
  --service-account=swisper-vertex-runtime@swisper.iam.gserviceaccount.com \
  --add-cloudsql-instances=swisper:europe-west1:prism-db \
  --set-secrets="PRISM_DATABASE_URL=prism-database-url:latest,PRISM_WEBHOOK_SECRET=prism-webhook-secret:latest" \
  --set-env-vars="PRISM_VERTEX_AI_PROJECT=swisper,PRISM_VERTEX_AI_REGION=europe-west1,PRISM_JWT_JWKS_URL=https://www.googleapis.com/service_accounts/v1/jwk/securetoken@system.gserviceaccount.com,PRISM_JWT_AUDIENCE=swisper,PRISM_LOG_LEVEL=INFO,PRISM_RERANKER_ENABLED=true,PRISM_RERANKER_TYPE=google,PRISM_GCP_PROJECT=swisper,PRISM_GCP_LOCATION=europe-west1" \
  --port=8080 \
  --allow-unauthenticated \
  --min-instances=0 \
  --max-instances=3

Rollback

# List available revisions
gcloud run revisions list \
  --service=prism-gateway \
  --region=europe-west1 \
  --project=swisper

# Route 100% traffic to a previous revision
gcloud run services update-traffic prism-gateway \
  --region=europe-west1 \
  --project=swisper \
  --to-revisions=REVISION_NAME=100

Local development

cd apps/prism
uv sync
docker compose up -d postgres   # local pgvector on port 5433
cp .env.example .env            # fill in PRISM_VERTEX_AI_PROJECT, etc.

# Run gateway locally
uv run uvicorn "prism.gateway.app:create_gateway_app" --factory --host 0.0.0.0 --port 8080

Secrets and Credentials

All secrets live in GCP Secret Manager (project: swisper).

Secret Name Description Format
prism-database-url Cloud SQL connection string via proxy socket postgresql://prism:<password>@/prism?host=/cloudsql/swisper:europe-west1:prism-db
prism-webhook-secret HMAC secret for GitHub webhook validation Plain string
prism-uat-private-key RSA-2048 private key for signing test tokens PEM-encoded private key

Database direct access

# Method 1: Cloud SQL Proxy (recommended)
cloud-sql-proxy swisper:europe-west1:prism-db --port=15432 &
PGPASSWORD="<password>" psql -h 127.0.0.1 -p 15432 -U prism -d prism

# Method 2: Direct public IP (IP must be in authorized networks)
PGPASSWORD="<password>" psql \
  "postgresql://prism:<password>@34.14.11.205:5432/prism?sslmode=require"

Authentication

The gateway supports two authentication methods:

Developer tokens start with prism_ and are validated via DB lookup in prism.developer_tokens. They never expire and are the recommended method for .cursor/mcp.json and ~/.claude.json configs.

Tokens are generated in the Prism Console (Settings > Developer Tokens) or via the Console API:

curl -s -X POST https://prism-gateway-xvsemyikqq-oa.a.run.app/api/v1/developer-tokens \
  -H "Authorization: Bearer <system-jwt>" \
  -H "Content-Type: application/json" \
  -d '{"label": "cursor-laptop"}'

Firebase JWTs (used by the Console web app)

RS256 tokens issued by Google Cloud Identity Platform. Valid for 1 hour. Required claims: sub, tid, aud (= swisper), exp, iat.


Connecting a New Repository

  1. Log in to the Prism Console
  2. Click Connect Repository on the dashboard
  3. Select the GitHub repository
  4. The Console installs the GitHub webhook automatically
  5. Push to trigger the first full index

Via the API (manual setup)

  1. Register the repo:

    curl -s -X POST https://prism-gateway-xvsemyikqq-oa.a.run.app/api/v1/repos \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{"full_name": "org/repo"}'
    

  2. Add the GitHub webhook:

  3. Go to the GitHub repository > Settings > Webhooks > Add webhook
  4. Payload URL: https://prism-gateway-xvsemyikqq-oa.a.run.app/api/v1/ingest/webhook
  5. Content type: application/json
  6. Secret: value from prism-webhook-secret in Secret Manager
  7. Events: Push event only

  8. Push to trigger the first full index.


Monitoring

Health check

curl https://prism-gateway-xvsemyikqq-oa.a.run.app/health
# Expected: {"status":"ok"} — HTTP 200

Cloud Run logs

# Live tail
gcloud logging tail \
  'resource.type="cloud_run_revision" AND resource.labels.service_name="prism-gateway"' \
  --project=swisper

# Recent errors only
gcloud logging read \
  'resource.type="cloud_run_revision" AND resource.labels.service_name="prism-gateway" AND severity>=ERROR' \
  --project=swisper --limit=50 \
  --format="table(timestamp,textPayload)"

Indexing job status

-- Active and recent jobs
SELECT job_id, repo_id, status, started_at, completed_at, error_message
FROM prism.index_jobs
ORDER BY started_at DESC
LIMIT 20;

-- Stuck jobs (running > 30 min — watchdog should catch these)
SELECT * FROM prism.index_jobs
WHERE status = 'running'
AND started_at < now() - interval '30 minutes';

Key metrics

Metric What it measures Normal Alert threshold
Cloud Run request count (/health) Service availability 200 OK on all requests Any non-200 for >2 min
Cloud Run 4xx error rate Auth failures and bad requests <1% >5% over 5 min
Cloud Run 5xx error rate Server errors (DB, Vertex AI) <0.5% >2% over 5 min
Cloud Run instance count Scaling behavior 0–3 >3 (at max, may need limit increase)
Cloud SQL connections Connection pool usage 5–15 active >20 sustained
Cloud SQL CPU utilization Query load <40% >80% for 10 min
Vertex AI embedding latency Embedding generation speed <500ms per call >2000ms
Indexing job success rate Tier 3 webhook processing >95% <90% over 1 hour
Job watchdog interventions Stuck job detection 0 per day >3 per day

Common Failure Modes

1. Cloud Run service unavailable (502/503)

Trigger: Cloud Run service crashed, redeploying, or failing cold start. Symptoms: /health returns connection error or 502. All MCP queries fail. Impact: All AI assistants using Prism cannot make queries. Resolution:

# Check service status
gcloud run services describe prism-gateway \
  --region=europe-west1 --project=swisper \
  --format="yaml(status.conditions)"

# Check recent logs for crash reason
gcloud logging read \
  'resource.labels.service_name="prism-gateway" AND severity>=ERROR' \
  --project=swisper --limit=20

# If bad deploy, roll back
gcloud run revisions list --service=prism-gateway --region=europe-west1 --project=swisper
gcloud run services update-traffic prism-gateway \
  --region=europe-west1 --project=swisper \
  --to-revisions=LAST_GOOD_REVISION=100

2. All requests return 401 Unauthorized

Trigger: Developer token revoked, JWT expired, JWKS URL misconfigured, or tid claim missing. Symptoms: Every authenticated request returns {"error": "auth_failed"}. Health check still returns 200. Impact: All authenticated MCP queries and ingestion requests fail. Resolution:

# For developer tokens: verify token exists in DB
psql -c "SELECT token_prefix, label, revoked_at FROM prism.developer_tokens WHERE token_prefix = 'prism_abc...';"

# For JWTs: verify JWKS URL
gcloud run services describe prism-gateway \
  --region=europe-west1 --project=swisper \
  --format="yaml(spec.template.spec.containers[0].env)" | grep JWKS

# Verify JWKS endpoint is reachable
curl "https://www.googleapis.com/service_accounts/v1/jwk/securetoken@system.gserviceaccount.com"

3. Vertex AI embedding calls failing (ingestion returns 500)

Trigger: Vertex AI API unavailable, quota exceeded, or service account missing role. Symptoms: Ingestion endpoints return 500. Error logs show Vertex AI errors. Index status stuck at indexing. Impact: New code is not embedded. Semantic search degrades (only BM25 and exact legs work). Resolution:

# Check Vertex AI API status
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  "https://us-central1-aiplatform.googleapis.com/v1/projects/swisper/locations/us-central1/publishers/google/models/gemini-embedding-001"

# Check service account roles
gcloud projects get-iam-policy swisper \
  --flatten="bindings[].members" \
  --format="table(bindings.role)" \
  --filter="bindings.members:swisper-vertex-runtime@swisper.iam.gserviceaccount.com"

4. Cloud SQL connection pool exhausted

Trigger: Spike in concurrent queries or slow queries holding connections. Symptoms: asyncpg.exceptions.TooManyConnectionsError in logs. HTTP 500 responses. Health check still returns 200. Impact: All database-backed MCP queries fail. Resolution:

# Check active connections
psql -c "SELECT count(*), state, left(query, 80) FROM pg_stat_activity GROUP BY state, left(query, 80) ORDER BY count DESC LIMIT 20;"

# Increase pool size if needed
gcloud run services update prism-gateway \
  --region=europe-west1 --project=swisper \
  --set-env-vars="PRISM_DATABASE_POOL_SIZE=20"

5. GitHub webhook not triggering indexing

Trigger: Webhook secret mismatch, gateway error on webhook endpoint, or GitHub delivery failure. Symptoms: GitHub > Webhooks > Recent Deliveries shows failed deliveries. Index doesn't update after push. Impact: Pushes do not trigger re-indexing. The hosted index drifts from the pushed state. Resolution:

# Verify webhook secret matches
gcloud secrets versions access latest --secret=prism-webhook-secret --project=swisper

# Check GitHub webhook delivery failures in GitHub UI:
# Repository > Settings > Webhooks > Recent Deliveries

# Manually trigger re-index
curl -s -X POST https://prism-gateway-xvsemyikqq-oa.a.run.app/api/v1/repos/REPO_ID/index \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"mode": "full"}'

6. Indexing jobs stuck in running state

Trigger: Indexing worker crashed mid-job, or Cloud Run container was evicted. Symptoms: prism.index_jobs shows jobs with status = 'running' and started_at > 30 minutes ago. Impact: Repo shows "Indexing..." indefinitely in the Console. New pushes may queue behind the stuck job. Resolution: The gateway's background watchdog automatically marks stuck jobs (>30 min) as failed every 60 seconds. On container restart, orphaned running jobs are also marked failed. If the watchdog is not running:

UPDATE prism.index_jobs
SET status = 'failed', completed_at = now(),
    error_message = 'Manual: marked as failed by operator'
WHERE status = 'running'
AND started_at < now() - interval '30 minutes';

Runbooks

Runbook: Deploy a new version

When to use: Releasing a new build of the Prism gateway. Estimated duration: 5 minutes.

  1. Run all tests:
    cd apps/prism && uv run pytest tests/ -vv
    
  2. Build and push the Docker image:
    docker build --target gateway \
      -t europe-west1-docker.pkg.dev/swisper/prism/gateway:latest \
      -f Dockerfile.prism .
    docker push europe-west1-docker.pkg.dev/swisper/prism/gateway:latest
    
  3. Deploy to Cloud Run:
    gcloud run services update prism-gateway \
      --region=europe-west1 --project=swisper \
      --image=europe-west1-docker.pkg.dev/swisper/prism/gateway:latest
    
  4. Verify:
    curl https://prism-gateway-xvsemyikqq-oa.a.run.app/health
    

Rollback: See "Roll back to a previous revision" below.


Runbook: Roll back to a previous revision

When to use: New deployment is broken. Estimated duration: 3 minutes.

  1. List recent revisions:
    gcloud run revisions list --service=prism-gateway --region=europe-west1 --project=swisper
    
  2. Route 100% traffic to the last known-good revision:
    gcloud run services update-traffic prism-gateway \
      --region=europe-west1 --project=swisper \
      --to-revisions=REVISION_NAME=100
    
  3. Verify:
    curl https://prism-gateway-xvsemyikqq-oa.a.run.app/health
    

Runbook: Rotate the database password

When to use: Scheduled quarterly rotation or suspected compromise. Estimated duration: 10 minutes.

  1. Generate a new password (minimum 24 characters, secure random).
  2. Set the new password in Cloud SQL:
    gcloud sql users set-password prism --instance=prism-db \
      --password="<NEW_PASSWORD>" --project=swisper
    
  3. Update Secret Manager:
    echo -n "postgresql://prism:<NEW_PASSWORD>@/prism?host=/cloudsql/swisper:europe-west1:prism-db" | \
      gcloud secrets versions add prism-database-url --data-file=- --project=swisper
    
  4. Redeploy Cloud Run to pick up the new secret:
    gcloud run services update prism-gateway \
      --region=europe-west1 --project=swisper \
      --image=europe-west1-docker.pkg.dev/swisper/prism/gateway:latest
    
  5. Verify: curl https://prism-gateway-xvsemyikqq-oa.a.run.app/health

Rollback: Secret Manager retains previous versions. Add a new version with the old password and redeploy.


Runbook: Apply schema migration across all tenant schemas

When to use: A new column or index must be added to all tenant tables. Estimated duration: 1–60 minutes depending on tenant count and migration type.

  1. Connect to the database via Cloud SQL Proxy.
  2. Enumerate all tenant schemas:
    SELECT schema_name FROM information_schema.schemata
    WHERE schema_name LIKE 'tenant_%' ORDER BY schema_name;
    
  3. Run the migration on each schema:
    DO $$
    DECLARE r RECORD;
    BEGIN
      FOR r IN (SELECT schema_name FROM information_schema.schemata
                WHERE schema_name LIKE 'tenant_%') LOOP
        EXECUTE format('ALTER TABLE %I.code_chunks ADD COLUMN IF NOT EXISTS new_col TEXT', r.schema_name);
      END LOOP;
    END;
    $$;
    
  4. Verify on a sample tenant:
    SELECT column_name FROM information_schema.columns
    WHERE table_schema = 'tenant_<name>' AND table_name = 'code_chunks';
    

Runbook: Add authorized IP for direct Cloud SQL access

When to use: A developer needs direct psql access for debugging. Estimated duration: 3 minutes.

curl -s ifconfig.me  # Get developer's public IP

gcloud sql instances patch prism-db \
  --authorized-networks="<EXISTING_IPs>/32,<NEW_IP>/32" \
  --project=swisper

Remove the IP when access is no longer needed.


Escalation

Escalate when:

  • A failure mode is not listed above and persists for more than 15 minutes
  • Data loss or data exposure is suspected
  • The Cloud SQL instance is unresponsive or reporting corruption
  • A Cloud Run deployment cannot be rolled back
Role Contact Channel
Prism owner / tech lead heiko.sundermann@fintama.com Slack #prism or direct message
GCP infrastructure heiko.sundermann@fintama.com Slack #infra

GCP emergency contacts:

  • GCP support: https://console.cloud.google.com/support (project: swisper)
  • Cloud SQL support escalation: via GCP Console, Premium support tier required for SLA