Building Retry Logic That Reduced Churn at Comcast

How I designed payment recovery flows, dunning sequences, and an internal ops dashboard for Comcast's subscription billing platform.

December 1, 20254 min read

Node.jsPostgreSQLBillingArchitectureEnterprise

Context

Involuntary churn — customers who leave because a payment fails, not because they wanted to cancel — is one of the biggest revenue leaks in subscription businesses. At Comcast's scale, even a small percentage improvement in payment recovery translates to millions in annual revenue.

I worked on the subscription and billing system that handles upgrades, downgrades, and payment recovery across prepaid, postpaid, and family plan types. The system needed to handle complex edge cases while being transparent enough for internal ops teams to debug issues quickly.

The Problem

The existing payment recovery flow had several gaps:

Single retry, no escalation. Failed payments got one retry attempt. If that failed, the subscription was suspended — no graduated recovery path.
No channel differentiation. The same generic "payment failed" email went out regardless of failure reason (expired card vs. insufficient funds vs. bank decline).
Ops visibility was poor. When customers called about billing issues, support agents had to query multiple systems to understand what happened.

Approach

1. Multi-Stage Dunning Sequences

I replaced the single-retry approach with a graduated recovery pipeline:

Stage 1 (Day 0): Immediate soft retry. Many failures are transient (bank processing delays, temporary holds). A retry within minutes recovers a meaningful portion of failures with no customer-facing notification needed.

Stage 2 (Day 1–3): First notification. Context-aware messaging based on the failure code:

Expired card → "Update your payment method" with a direct link to the update flow.
Insufficient funds → "We'll try again in 3 days" (no action required from customer).
Hard decline → "Contact your bank" with specific guidance.

Stage 3 (Day 3–7): Second retry with escalated notification. If the card was updated, retry immediately. Otherwise, attempt with the original method (banks sometimes clear holds after a few days).

Stage 4 (Day 7–14): Final attempt with grace period warning. Subscription downgrade to a free tier rather than hard suspension — keeping the customer in the ecosystem.

2. Payment Recovery API Design

The retry engine was built as a background job processor:

Idempotent retries — each attempt has a unique key, preventing double-charges even if the job runner processes a message twice.
Exponential backoff with jitter — retry timing avoids thundering herd problems when payment processors have temporary outages.
Failure classification — each payment processor maps raw error codes to a normalized failure taxonomy (transient, actionable, terminal) that drives the dunning behavior.

3. Handling Plan Complexity

Billing recovery isn't one-size-fits-all. The system needed different paths for:

Prepaid accounts: No retry — alert immediately, provide top-up link.
Postpaid accounts: Full dunning sequence with grace periods.
Family plans: Notify the plan owner, not individual members. Handle the case where the plan owner's payment fails but individual member payments succeed.

Each plan type has its own dunning configuration (retry intervals, notification templates, grace periods), stored as versioned config rather than hardcoded logic.

4. Internal Operations Dashboard

The support team's biggest pain point was context switching between multiple tools. I built a unified dashboard that showed:

Customer timeline — every billing event (charge, retry, notification, status change) in chronological order with linked payment processor reference IDs.
Recovery status — which dunning stage the customer is in, what's been tried, what's next.
Quick actions — manual retry trigger, grace period extension, plan adjustment — all with audit logging.

The dashboard cut billing-ticket resolution time from hours (querying multiple systems, correlating timestamps) to minutes (single view, click to act).

Results

| Metric | Before | After | Impact | | --------------------------- | ---------- | ---------------------- | ---------------------------------- | | Failed payment recovery | Baseline | Significantly improved | Multi-stage dunning | | Involuntary churn reduction | — | Measurable savings | Graduated recovery + grace periods | | Ops ticket resolution | Hours | Minutes | Unified dashboard | | Double-charge incidents | Occasional | Zero | Idempotent retry keys |

Key Takeaways

Classify failures, don't treat them uniformly. "Payment failed" is not one problem — it's a dozen different problems that need different responses.
Grace periods retain customers. Downgrading to a free tier instead of hard-suspending keeps customers in the ecosystem. Many recover and re-upgrade.
Ops tooling is product work. The dashboard wasn't a side project — it directly reduced support costs and improved customer experience.
Idempotency isn't optional in billing. If you can't safely retry any operation, you'll eventually double-charge someone.

This post describes work I did at Comcast from 2023–2025. Specific internal metrics and code are generalized to respect confidentiality.