Prism — Operations¶
Audience: Operations engineers, on-call responders, SREs, and DevOps engineers. Every section is actionable — specific commands, thresholds, and step-by-step procedures.
GCP Infrastructure Reference¶
| Resource | Details |
|---|---|
| GCP Project | swisper (project number: 1045528868895) |
| Region | europe-west1 |
| Cloud Run service | prism-gateway |
| Gateway URL | https://prism-gateway-xvsemyikqq-oa.a.run.app |
| MCP endpoint | https://prism-gateway-xvsemyikqq-oa.a.run.app/mcp |
| Cloud SQL instance | prism-db (connection: swisper:europe-west1:prism-db) |
| Cloud SQL tier | db-perf-optimized-N-2, PostgreSQL 16, ENTERPRISE_PLUS |
| Cloud SQL public IP | 34.14.11.205 |
| pgvector version | 0.8.1 |
| Container image | europe-west1-docker.pkg.dev/swisper/prism/gateway:latest |
| Service account | swisper-vertex-runtime@swisper.iam.gserviceaccount.com |
| Auth provider | GCP Identity Platform (Firebase Auth) + Developer Tokens |
| JWKS URL | https://www.googleapis.com/service_accounts/v1/jwk/securetoken@system.gserviceaccount.com |
Deployment¶
Mechanism¶
Prism runs as a Cloud Run service (prism-gateway) in europe-west1. The container image is stored in Artifact Registry. Cloud Run connects to Cloud SQL via the Cloud SQL Auth Proxy (Unix socket). Secrets are injected from Secret Manager at startup.
Min instances: 0 (scales to zero when idle). Max instances: 3. Port: 8080.
On startup, the gateway:
- Initializes the asyncpg connection pool
- Marks any orphaned
runningjobs inprism.index_jobsasfailed - Starts a background watchdog that marks stuck jobs (>30 min) as failed every 60 seconds
- Starts the MCP Streamable HTTP session manager
Pre-deployment checklist¶
- [ ] All Prism tests pass:
cd apps/prism && uv run pytest tests/ -vv - [ ] Schema migrations applied if schema changed (see Runbook: Apply Schema Migration)
- [ ] Environment variables and secrets updated in Secret Manager if config changed
- [ ] Docker image builds successfully locally before pushing
Build and deploy¶
cd apps/prism
# Authenticate Docker (one-time per machine)
gcloud auth configure-docker europe-west1-docker.pkg.dev
# Build the gateway image
docker build --target gateway \
-t europe-west1-docker.pkg.dev/swisper/prism/gateway:latest \
-f Dockerfile.prism .
# Push to Artifact Registry
docker push europe-west1-docker.pkg.dev/swisper/prism/gateway:latest
# Deploy to Cloud Run (rolling update, zero downtime)
gcloud run services update prism-gateway \
--region=europe-west1 \
--project=swisper \
--image=europe-west1-docker.pkg.dev/swisper/prism/gateway:latest
# Verify the deployment
curl https://prism-gateway-xvsemyikqq-oa.a.run.app/health
# Expected: {"status":"ok"}
Full redeploy from scratch¶
gcloud run deploy prism-gateway \
--image=europe-west1-docker.pkg.dev/swisper/prism/gateway:latest \
--region=europe-west1 \
--project=swisper \
--service-account=swisper-vertex-runtime@swisper.iam.gserviceaccount.com \
--add-cloudsql-instances=swisper:europe-west1:prism-db \
--set-secrets="PRISM_DATABASE_URL=prism-database-url:latest,PRISM_WEBHOOK_SECRET=prism-webhook-secret:latest" \
--set-env-vars="PRISM_VERTEX_AI_PROJECT=swisper,PRISM_VERTEX_AI_REGION=europe-west1,PRISM_JWT_JWKS_URL=https://www.googleapis.com/service_accounts/v1/jwk/securetoken@system.gserviceaccount.com,PRISM_JWT_AUDIENCE=swisper,PRISM_LOG_LEVEL=INFO,PRISM_RERANKER_ENABLED=true,PRISM_RERANKER_TYPE=google,PRISM_GCP_PROJECT=swisper,PRISM_GCP_LOCATION=europe-west1" \
--port=8080 \
--allow-unauthenticated \
--min-instances=0 \
--max-instances=3
Rollback¶
# List available revisions
gcloud run revisions list \
--service=prism-gateway \
--region=europe-west1 \
--project=swisper
# Route 100% traffic to a previous revision
gcloud run services update-traffic prism-gateway \
--region=europe-west1 \
--project=swisper \
--to-revisions=REVISION_NAME=100
Local development¶
cd apps/prism
uv sync
docker compose up -d postgres # local pgvector on port 5433
cp .env.example .env # fill in PRISM_VERTEX_AI_PROJECT, etc.
# Run gateway locally
uv run uvicorn "prism.gateway.app:create_gateway_app" --factory --host 0.0.0.0 --port 8080
Secrets and Credentials¶
All secrets live in GCP Secret Manager (project: swisper).
| Secret Name | Description | Format |
|---|---|---|
prism-database-url |
Cloud SQL connection string via proxy socket | postgresql://prism:<password>@/prism?host=/cloudsql/swisper:europe-west1:prism-db |
prism-webhook-secret |
HMAC secret for GitHub webhook validation | Plain string |
prism-uat-private-key |
RSA-2048 private key for signing test tokens | PEM-encoded private key |
Database direct access¶
# Method 1: Cloud SQL Proxy (recommended)
cloud-sql-proxy swisper:europe-west1:prism-db --port=15432 &
PGPASSWORD="<password>" psql -h 127.0.0.1 -p 15432 -U prism -d prism
# Method 2: Direct public IP (IP must be in authorized networks)
PGPASSWORD="<password>" psql \
"postgresql://prism:<password>@34.14.11.205:5432/prism?sslmode=require"
Authentication¶
The gateway supports two authentication methods:
Developer tokens (recommended for MCP configs)¶
Developer tokens start with prism_ and are validated via DB lookup in prism.developer_tokens. They never expire and are the recommended method for .cursor/mcp.json and ~/.claude.json configs.
Tokens are generated in the Prism Console (Settings > Developer Tokens) or via the Console API:
curl -s -X POST https://prism-gateway-xvsemyikqq-oa.a.run.app/api/v1/developer-tokens \
-H "Authorization: Bearer <system-jwt>" \
-H "Content-Type: application/json" \
-d '{"label": "cursor-laptop"}'
Firebase JWTs (used by the Console web app)¶
RS256 tokens issued by Google Cloud Identity Platform. Valid for 1 hour. Required claims: sub, tid, aud (= swisper), exp, iat.
Connecting a New Repository¶
Via the Prism Console (recommended)¶
- Log in to the Prism Console
- Click Connect Repository on the dashboard
- Select the GitHub repository
- The Console installs the GitHub webhook automatically
- Push to trigger the first full index
Via the API (manual setup)¶
-
Register the repo:
-
Add the GitHub webhook:
- Go to the GitHub repository > Settings > Webhooks > Add webhook
- Payload URL:
https://prism-gateway-xvsemyikqq-oa.a.run.app/api/v1/ingest/webhook - Content type:
application/json - Secret: value from
prism-webhook-secretin Secret Manager -
Events: Push event only
-
Push to trigger the first full index.
Monitoring¶
Health check¶
Cloud Run logs¶
# Live tail
gcloud logging tail \
'resource.type="cloud_run_revision" AND resource.labels.service_name="prism-gateway"' \
--project=swisper
# Recent errors only
gcloud logging read \
'resource.type="cloud_run_revision" AND resource.labels.service_name="prism-gateway" AND severity>=ERROR' \
--project=swisper --limit=50 \
--format="table(timestamp,textPayload)"
Indexing job status¶
-- Active and recent jobs
SELECT job_id, repo_id, status, started_at, completed_at, error_message
FROM prism.index_jobs
ORDER BY started_at DESC
LIMIT 20;
-- Stuck jobs (running > 30 min — watchdog should catch these)
SELECT * FROM prism.index_jobs
WHERE status = 'running'
AND started_at < now() - interval '30 minutes';
Key metrics¶
| Metric | What it measures | Normal | Alert threshold |
|---|---|---|---|
Cloud Run request count (/health) |
Service availability | 200 OK on all requests | Any non-200 for >2 min |
| Cloud Run 4xx error rate | Auth failures and bad requests | <1% | >5% over 5 min |
| Cloud Run 5xx error rate | Server errors (DB, Vertex AI) | <0.5% | >2% over 5 min |
| Cloud Run instance count | Scaling behavior | 0–3 | >3 (at max, may need limit increase) |
| Cloud SQL connections | Connection pool usage | 5–15 active | >20 sustained |
| Cloud SQL CPU utilization | Query load | <40% | >80% for 10 min |
| Vertex AI embedding latency | Embedding generation speed | <500ms per call | >2000ms |
| Indexing job success rate | Tier 3 webhook processing | >95% | <90% over 1 hour |
| Job watchdog interventions | Stuck job detection | 0 per day | >3 per day |
Common Failure Modes¶
1. Cloud Run service unavailable (502/503)¶
Trigger: Cloud Run service crashed, redeploying, or failing cold start.
Symptoms: /health returns connection error or 502. All MCP queries fail.
Impact: All AI assistants using Prism cannot make queries.
Resolution:
# Check service status
gcloud run services describe prism-gateway \
--region=europe-west1 --project=swisper \
--format="yaml(status.conditions)"
# Check recent logs for crash reason
gcloud logging read \
'resource.labels.service_name="prism-gateway" AND severity>=ERROR' \
--project=swisper --limit=20
# If bad deploy, roll back
gcloud run revisions list --service=prism-gateway --region=europe-west1 --project=swisper
gcloud run services update-traffic prism-gateway \
--region=europe-west1 --project=swisper \
--to-revisions=LAST_GOOD_REVISION=100
2. All requests return 401 Unauthorized¶
Trigger: Developer token revoked, JWT expired, JWKS URL misconfigured, or tid claim missing.
Symptoms: Every authenticated request returns {"error": "auth_failed"}. Health check still returns 200.
Impact: All authenticated MCP queries and ingestion requests fail.
Resolution:
# For developer tokens: verify token exists in DB
psql -c "SELECT token_prefix, label, revoked_at FROM prism.developer_tokens WHERE token_prefix = 'prism_abc...';"
# For JWTs: verify JWKS URL
gcloud run services describe prism-gateway \
--region=europe-west1 --project=swisper \
--format="yaml(spec.template.spec.containers[0].env)" | grep JWKS
# Verify JWKS endpoint is reachable
curl "https://www.googleapis.com/service_accounts/v1/jwk/securetoken@system.gserviceaccount.com"
3. Vertex AI embedding calls failing (ingestion returns 500)¶
Trigger: Vertex AI API unavailable, quota exceeded, or service account missing role.
Symptoms: Ingestion endpoints return 500. Error logs show Vertex AI errors. Index status stuck at indexing.
Impact: New code is not embedded. Semantic search degrades (only BM25 and exact legs work).
Resolution:
# Check Vertex AI API status
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://us-central1-aiplatform.googleapis.com/v1/projects/swisper/locations/us-central1/publishers/google/models/gemini-embedding-001"
# Check service account roles
gcloud projects get-iam-policy swisper \
--flatten="bindings[].members" \
--format="table(bindings.role)" \
--filter="bindings.members:swisper-vertex-runtime@swisper.iam.gserviceaccount.com"
4. Cloud SQL connection pool exhausted¶
Trigger: Spike in concurrent queries or slow queries holding connections.
Symptoms: asyncpg.exceptions.TooManyConnectionsError in logs. HTTP 500 responses. Health check still returns 200.
Impact: All database-backed MCP queries fail.
Resolution:
# Check active connections
psql -c "SELECT count(*), state, left(query, 80) FROM pg_stat_activity GROUP BY state, left(query, 80) ORDER BY count DESC LIMIT 20;"
# Increase pool size if needed
gcloud run services update prism-gateway \
--region=europe-west1 --project=swisper \
--set-env-vars="PRISM_DATABASE_POOL_SIZE=20"
5. GitHub webhook not triggering indexing¶
Trigger: Webhook secret mismatch, gateway error on webhook endpoint, or GitHub delivery failure. Symptoms: GitHub > Webhooks > Recent Deliveries shows failed deliveries. Index doesn't update after push. Impact: Pushes do not trigger re-indexing. The hosted index drifts from the pushed state. Resolution:
# Verify webhook secret matches
gcloud secrets versions access latest --secret=prism-webhook-secret --project=swisper
# Check GitHub webhook delivery failures in GitHub UI:
# Repository > Settings > Webhooks > Recent Deliveries
# Manually trigger re-index
curl -s -X POST https://prism-gateway-xvsemyikqq-oa.a.run.app/api/v1/repos/REPO_ID/index \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"mode": "full"}'
6. Indexing jobs stuck in running state¶
Trigger: Indexing worker crashed mid-job, or Cloud Run container was evicted.
Symptoms: prism.index_jobs shows jobs with status = 'running' and started_at > 30 minutes ago.
Impact: Repo shows "Indexing..." indefinitely in the Console. New pushes may queue behind the stuck job.
Resolution: The gateway's background watchdog automatically marks stuck jobs (>30 min) as failed every 60 seconds. On container restart, orphaned running jobs are also marked failed. If the watchdog is not running:
UPDATE prism.index_jobs
SET status = 'failed', completed_at = now(),
error_message = 'Manual: marked as failed by operator'
WHERE status = 'running'
AND started_at < now() - interval '30 minutes';
Runbooks¶
Runbook: Deploy a new version¶
When to use: Releasing a new build of the Prism gateway. Estimated duration: 5 minutes.
- Run all tests:
- Build and push the Docker image:
- Deploy to Cloud Run:
- Verify:
Rollback: See "Roll back to a previous revision" below.
Runbook: Roll back to a previous revision¶
When to use: New deployment is broken. Estimated duration: 3 minutes.
- List recent revisions:
- Route 100% traffic to the last known-good revision:
- Verify:
Runbook: Rotate the database password¶
When to use: Scheduled quarterly rotation or suspected compromise. Estimated duration: 10 minutes.
- Generate a new password (minimum 24 characters, secure random).
- Set the new password in Cloud SQL:
- Update Secret Manager:
- Redeploy Cloud Run to pick up the new secret:
- Verify:
curl https://prism-gateway-xvsemyikqq-oa.a.run.app/health
Rollback: Secret Manager retains previous versions. Add a new version with the old password and redeploy.
Runbook: Apply schema migration across all tenant schemas¶
When to use: A new column or index must be added to all tenant tables. Estimated duration: 1–60 minutes depending on tenant count and migration type.
- Connect to the database via Cloud SQL Proxy.
- Enumerate all tenant schemas:
- Run the migration on each schema:
- Verify on a sample tenant:
Runbook: Add authorized IP for direct Cloud SQL access¶
When to use: A developer needs direct psql access for debugging. Estimated duration: 3 minutes.
curl -s ifconfig.me # Get developer's public IP
gcloud sql instances patch prism-db \
--authorized-networks="<EXISTING_IPs>/32,<NEW_IP>/32" \
--project=swisper
Remove the IP when access is no longer needed.
Escalation¶
Escalate when:
- A failure mode is not listed above and persists for more than 15 minutes
- Data loss or data exposure is suspected
- The Cloud SQL instance is unresponsive or reporting corruption
- A Cloud Run deployment cannot be rolled back
| Role | Contact | Channel |
|---|---|---|
| Prism owner / tech lead | heiko.sundermann@fintama.com | Slack #prism or direct message |
| GCP infrastructure | heiko.sundermann@fintama.com | Slack #infra |
GCP emergency contacts:
- GCP support: https://console.cloud.google.com/support (project:
swisper) - Cloud SQL support escalation: via GCP Console, Premium support tier required for SLA