Building a Fault-Tolerant Document Processing Pipeline for Healthcare

Every week, thousands of inbound documents arrive from dozens of healthcare providers — faxes, EDI files, scanned forms. Each one needs classification, data extraction, validation against patient records, and routing to the correct downstream system.

The naive approach is a lambda-per-file. The real one involves careful partitioning, idempotency contracts, and understanding when not to use serverless.

Here's what I learned building a pipeline that's ingested over 1M documents in production.

The Constraints That Matter

Healthcare document processing has a few non-negotiable properties:

Exactly-once semantics — losing a fax or processing it twice can mean a denied claim, which costs real money
Ordering within a session — a single fax bundle has multiple pages; they must be reassembled in order
Multi-tenant isolation — each provider has different SLAs, retention policies, and downstream endpoints
No $regex on the primary data store — DocumentDB (MongoDB-compatible) doesn't handle unindexed regex scans at volume; one bad query can take down the cluster

These constraints eliminated several architectures early.

Why Not Pure Serverless

The analysis started like most do: lambda every event. S3 put → lambda → write to Mongo. Simple, scalable, and wrong for this load.

The problem is burstiness with ordering. A provider sends 10,000 pages in a single fax batch. Lambda will spin up hundreds of concurrent invocations — each one racing to write to DocumentDB. Without coordination, pages get written out of order, timestamps drift, and the downstream reassembly logic has to guess.

The fix wasn't distributed locks (too expensive, too fragile). It was batching by arrival, with a 60-second accumulation window. Instead of one lambda per file, we buffer all files that arrive within a window, sort them by the provider's sequence numbers, then submit them as a single ordered batch.

// Example: batch accumulation state
{
  "provider_id": "29424721",
  "files": [
    {"seq": 1, "path": "...", "received_at": "2026-06-06T10:00:01Z"},
    {"seq": 2, "path": "...", "received_at": "2026-06-06T10:00:02Z"}
  ],
  "window_close": "2026-06-06T10:01:00Z"
}

This cut race-condition incidents by about 90%.

The Idempotency Contract

Every step in the pipeline needs an idempotency key. We used a composite key: {provider_id}_{provider_tx_id}_{step_number}. If a step fails and retries, the downstream check "have I already processed this key?" prevents duplicates.

The storage for this is a simple TTL collection in DocumentDB — keys expire after 7 days (the maximum expected retry window). This kept the table small (~10GB) and queries fast.

// Idempotency check pseudocode
function processIfNew(key, processor) {
  const existing = db.idempotency.findOne({ _id: key });
  if (existing) return existing.result;
  // acquire conditional lock, process, then mark completed
  db.idempotency.updateOne(
    { _id: key, locked_at: null },
    { $set: { locked_at: new Date().toISOString() } }
  );
  const result = processor();
  db.idempotency.updateOne(
    { _id: key },
    { $set: { result, completed_at: new Date().toISOString() },
      $unset: { locked_at: "" } }
  );
  return result;
}

We avoided transactional outbox patterns because DocumentDB doesn't support multi-document transactions at this volume — the locking overhead kills throughput.

Handling the "Unindexed Query" Trap

The number-one cause of production incidents in Mongo-based systems is unindexed queries at scale.

The trap is: it works fine at 10K documents. At 1M, a single $regex query on a text field can memory-map the entire collection, consuming all available RAM on the DocumentDB instance, and cascade to OOM.

Our rule:

Every query pattern must have an index before it ships. No exceptions.
UI filtering uses only indexed fields (provider_id, status, received_date).
Full-text search goes through a dedicated search service (OpenSearch), not $text indexes on DocumentDB.
Backfills are rate-limited to N operations/second and use cursor-based pagination, not offsets.

This required a cultural change: engineers were used to "just query what you need" in development. In production, an unindexed query on a 200GB collection is a fireable offense — or at least a P1 incident.

The Reprocess Pattern

Documents fail. Downstream systems go down, format parsers encounter edge cases, timestamps go stale.

The critical design decision: failed documents stay in their failure bucket, not in a success bucket that needs to be "un-done."

s3://production-bucket/
  incoming/
  processing/
  completed/
  failed/

A reprocess job scans the failed/ prefix and pushes documents back through incoming/. This avoids the bug where a success+delete pattern accidentally consumes a document that was actually in flight, not completed.

The mistake most teams make: copying failed docs to a success folder with a "needs review" flag. That creates ambiguity — is it succeeded or not? Our pipeline never writes a document to completed/ unless it fully processed. If anything breaks, it stays in failed/ and a separate monitor alerts.

What I'd Do Differently

Partition earlier. We started with a single DocumentDB collection. At 1M+ documents per provider, we should have sharded by provider_id from month one.
Invest in dead-letter observability. The pipeline swallows exceptions (returns 200 to S3 trigger even on failure) — that's fine for resilience, but it means the dead-letter queue needs better dashboards than we had.
Pre-generate batch manifests. Instead of scanning S3 for "what arrived in this window," we now have a lambda that writes an arrival manifest every minute. The processor reads that manifest rather than listing objects.

This is based on my work building Valor Solution's core document processing pipeline, processing documents across multiple healthcare providers at scale.