Skip to content

TDR-004: Prism Indexing Must Move to a Queue-Based Worker

Status: Identified Priority: High (before GA / multi-tenant launch) Estimated Effort: 3–5 days Date Identified: 2026-02-27 Identified By: Dev lead (during EPC_012 Console UAT)


Description

What: The Prism Gateway currently runs full repo indexing as an inline asyncio.create_task on the HTTP server process. This works for a single-tenant alpha but is unsafe at any meaningful scale.

Current State:

# prism/gateway/console_routes.py
asyncio.create_task(
    _run_tier3_full_reindex(
        storage=storage, embedder=embedder,
        repo_full_name=..., clone_url=..., branch=..., tenant_id=...,
    )
)
return JSONResponse({"status": "queued"}, status_code=202)

Each POST /api/v1/repos/{id}/index spawns a background coroutine that: 1. Clones the full repo into /tmp via git clone --depth=1 2. Walks all files and generates Vertex AI embeddings (one embed() call per file) 3. Writes chunks + embeddings to pgvector

Why it is problematic:

Scenario Failure mode
Single large repo (>300 MB) OOM — Vertex AI SDK alone uses ~400 MB baseline
2+ repos indexed concurrently on the same instance Additive memory pressure → OOM
Cloud Run scale-to-zero during indexing Container killed mid-index, job silently lost
Indexing timeout (large repo > Cloud Run 3600s request timeout) Job killed without status update
Retry on failure No retry logic; failed jobs are silently dropped

The memory crash was observed during EPC_012 UAT at 531 MiB for a 69 MB repo (Fintama/helvetiq) — before the clone even started, just from loading the Vertex AI SDK. Memory was increased to 2 GiB as a short-term fix.

Desired State:

POST /api/v1/repos/{id}/index enqueues a Cloud Tasks task and returns 202 immediately. A dedicated indexer worker (separate Cloud Run service or Cloud Run Job) picks up the task, clones, embeds, and writes results. The gateway HTTP server has no indexing logic.

Console API ──► Gateway /index ──► Cloud Tasks queue ──► Indexer worker
                    202 ◄──────────────────────────────   (dedicated service)

Impact

Reliability

  • Silent failures: OOM or timeout during indexing leaves the repo in "indexing" or "pending" state with no error surfaced to the user
  • No retry: Failed jobs are lost; user must manually trigger again
  • Race condition: Two simultaneous reindex triggers for the same repo produce duplicate work and potentially corrupt chunk data

Scalability

  • Memory: Each concurrent index job adds ~150–400 MB; 4 concurrent jobs on a single 2 GiB instance will OOM
  • CPU: Embedding generation is CPU-intensive; running it on the gateway degrades HTTP response latency for all other requests (MCP tools, auth)

Observability

  • No progress visibility: Status is updated via direct DB writes from the background task; if the task is killed the status never updates to "failed"
  • No job history: No record of past index runs, durations, or errors

Security

  • ⚠️ Token in clone URL: GitHub access token is currently embedded in the clone URL (https://TOKEN@github.com/...) and passed through the gateway request body. This is a short-term pragmatic choice. A proper solution uses GitHub App installation tokens generated at job dispatch time.

Remediation Plan

  1. Create prism-indexer Cloud Run service (separate from gateway) — same Docker image, different entry point, no HTTP server
  2. Gateway /index endpoint calls Cloud Tasks create_task pointing at prism-indexer/internal/run
  3. Indexer handles one job at a time per instance; Cloud Tasks handles retries, deduplication (via task name), and timeout management
  4. Job status written to prism.index_jobs table; gateway /status endpoint reads from there

Effort: ~4 days Requires: Cloud Tasks queue (1 queue, free tier covers alpha load)

Option B — Cloud Run Jobs (Simpler, Less Observability)

  1. Each /index trigger creates a Cloud Run Job execution
  2. Job runs to completion in an isolated container; no shared state
  3. Status surfaced via Cloud Run Jobs API

Effort: ~2 days Limitation: No built-in retry policy; harder to query job status from the gateway

Short-Term Mitigations Already Applied (Alpha)

Mitigation Location Removes risk?
Gateway memory → 2 GiB Cloud Run config Partially (OOM for single job)
asyncio.Semaphore(2) console_routes.py Partially (caps concurrency)

Note: The semaphore has NOT been implemented yet — it is called out in TDR-004 as a recommended interim measure before GA.

Prerequisites Before GA

  • [ ] Implement asyncio.Semaphore(2) in console_routes.py as an interim guard
  • [ ] Implement Option A (Cloud Tasks) before onboarding the second tenant
  • [ ] Replace token-in-URL with GitHub App installation tokens (separate TDR or ADR)
  • [ ] Add prism.index_jobs table to track job history and status

Success Criteria

  • Indexing a repo does not affect gateway HTTP response latency
  • OOM during indexing does not affect the gateway process
  • Failed index jobs automatically retry (up to 3 attempts)
  • Job status is queryable independently of the gateway process lifecycle
  • Two simultaneous reindex triggers for the same repo are deduplicated
  • EPC_012: Console build where this was first observed in UAT
  • ADR-005: Developer token architecture (opaque tokens vs IDP tokens)
  • Code References:
  • apps/prism/prism/gateway/console_routes.py_run_tier3_full_reindex()
  • apps/prism/prism/tier3/handler.pyTier3Handler.full_reindex()
  • apps/prism/prism/tier3/clone.pyshallow_clone()
  • apps/prism-console-api/console_api/gateways/prism_gateway_client.pytrigger_index()

Status Updates

  • 2026-02-27: Identified during EPC_012 Console UAT. Memory OOM observed at 531 MiB for a 69 MB private repo. Short-term fix: increased Cloud Run memory to 2 GiB. Queue-based solution deferred to pre-GA.