Payment Gateway System Design 2026 — Idempotency, Saga Pattern, and Double-Charge Defence for Stripe-scale

Posted on: 4/16/2026 9:11:57 PM

1. Three Guarantees You Cannot Get Wrong — and Why a Payment System Is the Peak of Distributed Systems

Writing a payment system means accepting that every mistake becomes hard cash. No feature in any SaaS product carries a higher "never wrong, not even once" rate than payments: a small bug can charge a customer twice, refund yourself twice, or worse — silently swallow a transaction so that nobody notices until end-of-month reconciliation. That's why every serious payments page — Stripe, Adyen, PayPal, VNPay, MoMo — lives and dies by three non-negotiable guarantees.

0Acceptable double-charges
≥99.99%Production target for authorization success rate
<5sEnd-to-end p99 auth via 3DS
T+0On-time reconciliation with the acquirer

The three core guarantees of a payment system

  • No double charge — a single payment intent, no matter how many times it is retried, goes through the acquirer exactly once, regardless of client retries, network flaps, or a server crash mid-flight.
  • No double refund — the inverse: a refund request triggered during a maintenance window, in a webhook retry, or by a support agent's click, still refunds exactly once.
  • No lost transactions — every time money leaves a customer's account, the system records it fully and does not drop it; each payment's state is reconstructable from the ledger, even if some service dies for an hour or two.

These three guarantees aren't "features" — they're the moral contract the system must keep no matter what goes wrong around it: a DB replica drops, a message broker swallows a message, the acquirer times out, webhooks duplicate, or a PM accidentally clicks "reprocess" twice in the admin tool. This whole article revolves around industrial techniques to keep those three guarantees, expressed in the language of .NET 10 and the stack familiar to Vietnamese engineering teams.

2. Overall Architecture of a Standard 2026 Payment Gateway

A payment system isn't a service; it's a chain of services with very sharp responsibility boundaries. Bundling them into one monolith sounds simple but turns into disaster when operations can't say exactly "this bit belongs to who" during an incident — and payment incidents always happen at 2am.

flowchart LR
    CL(["Client / Checkout UI"]) --> API["Payment API
(.NET 10, Minimal API)"] API --> IDP[("Idempotency Store
Redis + SQL backup")] API --> LDG[("Ledger DB
Postgres Serializable")] API --> ORC["Saga Orchestrator
Temporal / MassTransit"] ORC --> RISK["Risk / Fraud
Engine"] ORC --> TOK["Tokenization
Vault"] ORC --> ACQ(("Acquirer / PSP
Stripe, Adyen, VNPay")) ACQ -. async .-> WHK["Webhook
Listener"] WHK --> LDG LDG --> RECON["Reconciliation
Worker (nightly)"] RECON --> REP[("Acquirer Report
SFTP / API")] LDG --> OUT[("Transactional
Outbox")] OUT --> BUS[("Event Bus
Kafka / RabbitMQ")] BUS --> DW[("Data Warehouse
ClickHouse / BigQuery")]

Figure 1: Layered architecture of a payment gateway — API, ledger, orchestrator, acquirer, reconciliation

There are five distinct responsibility zones to identify:

  • Payment API — accepts the client request, verifies idempotency, creates an intent, returns a client_secret or redirect URL. Never calls the acquirer directly; only writes the intent and hands off to the orchestrator.
  • Saga Orchestrator — coordinates the sequence of steps (risk check → tokenize → authorize → capture → webhook) with the ability to compensate each step. This is where the state machine "lives" and where execution resumes after a crash.
  • Ledger — the system's economic source of truth. Every movement (authorize, capture, refund, chargeback) is an immutable row. No UPDATE, only INSERT; balances are sums.
  • Webhook Listener — consumes async events from the acquirer (payment_intent.succeeded, charge.refunded, dispute.created). Verifies signatures, updates the ledger, triggers downstream.
  • Reconciliation — a nightly worker reconciling the internal ledger with the acquirer's settlement file, catching mismatches before they become an accountant's dispute.

3. Idempotency Key — The First Shield Against Double-Charge

The idempotency key is the most important shield, and also the technique most frequently implemented incorrectly. The principle was standardised by Stripe in 2017 and is now an industry default: the client generates a unique key per payment intent; any payment-creating request sent with the same key returns the same response, no matter how many times it is retried.

The key isn't just "de-dup": it's a contract between client and server stating "this request is the same intention, don't process it again". When a network flap causes the client to miss the response and retry, the server must recognise this and replay the old response — not create a new authorization.

// Payment API endpoint — .NET 10 Minimal API, idempotency done right
app.MapPost("/v1/payment_intents", async (
    [FromHeader(Name = "Idempotency-Key")] string idemKey,
    [FromBody] CreateIntentRequest req,
    IIdempotencyStore idem,
    IPaymentService svc,
    CancellationToken ct) =>
{
    if (string.IsNullOrWhiteSpace(idemKey) || idemKey.Length > 255)
        return Results.BadRequest("Idempotency-Key header is required");

    // hash the body to detect key reuse with different body
    var bodyHash = SHA256Hex(JsonSerializer.SerializeToUtf8Bytes(req));

    var saved = await idem.TryBeginAsync(idemKey, bodyHash, ct);
    if (saved is { Status: IdemStatus.Completed } done)
        return Results.Content(done.ResponseJson, "application/json", null, done.StatusCode);

    if (saved is { Status: IdemStatus.InFlight })
        return Results.StatusCode(409); // conflict — let client retry later

    if (saved is { Status: IdemStatus.BodyMismatch })
        return Results.StatusCode(422); // key reused with different body — client bug

    try
    {
        var intent = await svc.CreateIntentAsync(req, ct);
        var resp = JsonSerializer.Serialize(intent);
        await idem.CompleteAsync(idemKey, 201, resp, ct);
        return Results.Content(resp, "application/json", null, 201);
    }
    catch (Exception ex)
    {
        await idem.FailAsync(idemKey, ex.Message, ct);
        throw;
    }
});

Five details that separate correct from broken idempotency

  • Store the body hash too, not just the key. If the client sends the same key with a different body (bug or attack), the server must reject with 422 rather than replay the old response and create a false impression.
  • An explicit In-Flight status. A request currently running must be marked so retries receive 409 and wait, instead of running in parallel and creating two authorizations.
  • TTL 24–72 hours is the sweet spot. Shorter and retries after a crash won't match; longer and storage bloats indefinitely.
  • A serializable transaction for the insert phase. Races between two requests sharing a key must be stopped at the DB layer, not trusted to application-level logic.
  • Scope by tenant/merchant. Key "abc-123" for merchant A must not collide with merchant B; the composite primary key is always (tenant_id, idem_key).

Ideal idempotency storage is two-tier: Redis as the first hit serving checks under 5ms, but every change also written to Postgres as the source of truth. Losing Redis doesn't lose money; losing Postgres does. The "Redis write-through Postgres" pattern is the standard — don't use Redis as the sole store.

4. Double-Entry Ledger — The Economic Source of Truth That You Only Append, Never Update

The single most-miswritten database design in payment systems is a payments table with a status column that gets UPDATEd repeatedly. This design dies the moment there's a dispute: there's no way to know what state the payment was in at which point, no audit path, no way to rebuild balances. Accounting solved this 500 years ago: double-entry ledger.

flowchart LR
    subgraph Ledger
        LT[("ledger_txn
id, type, idem_key, created_at")] LE[("ledger_entry
txn_id, account_id, amount, sign")] end LT --- LE A1(["customer:123:available"]) A2(["merchant:anhtu:pending"]) A3(["merchant:anhtu:available"]) A4(["bank:acquirer:stripe"]) LE -.->|"capture"| A1 LE -.->|"capture"| A2 LE -.->|"settle T+2"| A2 LE -.->|"settle T+2"| A3 LE -.->|"payout"| A3 LE -.->|"payout"| A4

Figure 2: Double-entry ledger — every transaction produces at least two balancing entries; the balance is a sum per account

Two principles you cannot break:

  • Entries are append-only. No UPDATE, no DELETE. "Cancelling" a transaction means writing another transaction with the opposite sign (reversal), leaving the audit log intact.
  • Each transaction's entries must sum to 0. If you take 100k from account A, exactly 100k must be written to another account. Enforce this with a trigger or domain rule so the ledger never "inflates" or "deflates" without reason.
-- Minimal but accurate Postgres schema
CREATE TABLE ledger_txn (
    id              bigint PRIMARY KEY,
    tenant_id       bigint NOT NULL,
    type            text   NOT NULL,   -- authorize | capture | refund | chargeback
    idem_key        text   NOT NULL,
    created_at      timestamptz NOT NULL DEFAULT now(),
    UNIQUE (tenant_id, idem_key)
);

CREATE TABLE ledger_entry (
    id              bigserial PRIMARY KEY,
    txn_id          bigint NOT NULL REFERENCES ledger_txn(id),
    account_id      text   NOT NULL,   -- 'customer:123:available'
    currency        char(3) NOT NULL,
    amount_minor    bigint NOT NULL,   -- signed, minor units
    created_at      timestamptz NOT NULL DEFAULT now()
);

-- invariant: per-txn sum(amount_minor) = 0
CREATE OR REPLACE FUNCTION ensure_balanced() RETURNS trigger AS $$
DECLARE s bigint;
BEGIN
    SELECT sum(amount_minor) INTO s FROM ledger_entry WHERE txn_id = NEW.txn_id;
    IF s <> 0 THEN
        RAISE EXCEPTION 'ledger txn % imbalanced by %', NEW.txn_id, s;
    END IF;
    RETURN NULL;
END; $$ LANGUAGE plpgsql;

CREATE CONSTRAINT TRIGGER ledger_balance_check
AFTER INSERT ON ledger_entry DEFERRABLE INITIALLY DEFERRED
FOR EACH ROW EXECUTE FUNCTION ensure_balanced();

Account balance is a view: SELECT sum(amount_minor) FROM ledger_entry WHERE account_id = ?. At high volume you cache balances in a continuously-rebuilt materialized table — but the source of truth is always the sum. On disputes, you can replay the entire ledger to prove every cent flowed correctly.

5. Saga Pattern — When One Authorization Is Five Steps That Can Fail Anywhere

A payment isn't a query. It's a sequence: risk check → tokenize card → call acquirer to authorize → write ledger → emit event. Each step can time out, fail, or return "maybe" (the acquirer especially). And each step has a compensating action if a later step fails: cancel the authorization, issue a refund, emit a compensating event. This is the essence of the Saga Pattern.

CriterionSaga ChoreographySaga Orchestration
ControlEach service listens to events and reactsA central orchestrator calls each service
CouplingLow — services don't know each otherHigher — the orchestrator knows the whole flow
ObservabilityHard — flow scattered across the event logEasy — centralised state machine
CompensationComplex — each service remembers its ownDirect — the orchestrator calls the inverse action
Best forSimple flows, <4 steps, independent teamsPayment, booking, order — many rollbackable steps
2026 toolingMassTransit, NServiceBus, KafkaTemporal, Cadence, Dapr Workflows, AWS Step Functions

For payments, orchestration is the right choice almost every time. Flows have 5–10 steps, each step has a clear compensation, and observability is a hard requirement for accounting and compliance. Temporal (or Dapr Workflows for lighter teams) is the standard tool.

sequenceDiagram
    autonumber
    participant C as Client
    participant A as Payment API
    participant T as Temporal Worker
    participant R as Risk Engine
    participant V as Vault
    participant P as PSP (Stripe)
    participant L as Ledger
    C->>A: POST /intents (Idempotency-Key)
    A->>T: StartWorkflow(intentId)
    T->>R: RiskCheck(card, user, ip)
    R-->>T: score=0.2 approved
    T->>V: TokenizeCard(PAN)
    V-->>T: token=tok_abc
    T->>P: Authorize(token, amount)
    P-->>T: auth_id=ch_123 approved
    T->>L: WriteAuthorizeEntries
    T-->>A: Workflow complete
    A-->>C: 201 Created (intent)
    Note over T,P: If any step fails,
compensation walks it back

Figure 3: Orchestrated saga for authorize — every step has its own timeout, retry policy, and explicit compensation

// Temporal workflow for authorize intent — .NET SDK v2
[Workflow]
public class AuthorizeIntentWorkflow
{
    [WorkflowRun]
    public async Task<IntentResult> RunAsync(AuthorizeInput input)
    {
        // Each activity has its own retry policy; exceeding attempts raises to the workflow
        var risk = await Workflow.ExecuteActivityAsync(
            (IRiskActivities a) => a.ScoreAsync(input),
            new() { StartToCloseTimeout = TimeSpan.FromSeconds(5),
                    RetryPolicy = new() { MaximumAttempts = 3 } });
        if (risk.Decision == "deny")
            return IntentResult.Declined("risk_block");

        var token = await Workflow.ExecuteActivityAsync(
            (IVaultActivities a) => a.TokenizeAsync(input.Card),
            new() { StartToCloseTimeout = TimeSpan.FromSeconds(3) });

        try
        {
            var auth = await Workflow.ExecuteActivityAsync(
                (IPsPActivities a) => a.AuthorizeAsync(token, input.AmountMinor, input.Currency),
                new() { StartToCloseTimeout = TimeSpan.FromSeconds(30),
                        RetryPolicy = new() { MaximumAttempts = 1 } });  // DO NOT retry acquirer
            await Workflow.ExecuteActivityAsync(
                (ILedgerActivities a) => a.RecordAuthorizeAsync(input.IntentId, auth),
                new() { StartToCloseTimeout = TimeSpan.FromSeconds(2) });
            return IntentResult.Approved(auth.AuthId);
        }
        catch (ActivityFailureException)
        {
            // compensation — don't call PSP because we may or may not have an auth; let recon sort it
            await Workflow.ExecuteActivityAsync(
                (ILedgerActivities a) => a.MarkIntentFailedAsync(input.IntentId),
                new() { StartToCloseTimeout = TimeSpan.FromSeconds(2) });
            throw;
        }
    }
}

Never auto-retry authorize calls to the acquirer

The natural instinct when an activity fails is to retry. But for an authorize call, retry is mortally dangerous: a timeout does not mean failure — the money may already be held on the acquirer's side while the network simply lost the response. Retrying creates a second authorization. The golden rule: authorize/capture/refund calls to the acquirer only retry when the acquirer itself supports an idempotency key (Stripe, Adyen do; some domestic PSPs do not). When it doesn't: call once, and let the reconciliation worker track it down later.

6. Transactional Outbox — The Bridge Guaranteeing DB Commit and Message Publish

A very common bug: the service writes to the ledger successfully, then emits a "payment.succeeded" event to Kafka, then returns a response to the client. Problem: those two steps aren't atomic. If the service crashes between them, the ledger is written but the event never fires, and downstream (email confirmation, analytics, loyalty points) never runs. The fix pattern is called Transactional Outbox.

flowchart LR
    API["Payment API"] --> TX{"BEGIN TX"}
    TX --> L[("ledger_entry")]
    TX --> O[("outbox_event")]
    TX --> C{"COMMIT"}
    C --> R["Outbox Relay
(CDC or poller)"] R --> B[("Kafka / RabbitMQ")] B --> D1["Email Service"] B --> D2["Loyalty Service"] B --> D3["Analytics"]

Figure 4: Outbox pattern — ledger and event commit in the same transaction; the relay pushes to the bus afterwards

The mechanism is simple but rock-solid: the ledger row and the event row are written in the same SQL transaction. COMMIT commits both; on crash, both roll back. A dedicated relay worker reads outbox_event and publishes to the bus, marking rows as published. The bus handles at-least-once; consumers must be idempotent.

-- Outbox table
CREATE TABLE outbox_event (
    id           bigserial PRIMARY KEY,
    aggregate_id text NOT NULL,
    event_type   text NOT NULL,
    payload      jsonb NOT NULL,
    created_at   timestamptz NOT NULL DEFAULT now(),
    published_at timestamptz,
    INDEX unpublished ON outbox_event (created_at) WHERE published_at IS NULL
);

-- Relay worker (.NET BackgroundService) — read a batch of 100, publish, mark
while (!stoppingToken.IsCancellationRequested)
{
    using var tx = await db.BeginTransactionAsync();
    var batch = await db.QueryAsync<OutboxRow>(
        "SELECT * FROM outbox_event WHERE published_at IS NULL " +
        "ORDER BY id FOR UPDATE SKIP LOCKED LIMIT 100");
    if (!batch.Any()) { await Task.Delay(200); continue; }
    foreach (var row in batch)
        await producer.ProduceAsync("payment.events", row.ToKafkaMessage());
    await db.ExecuteAsync(
        "UPDATE outbox_event SET published_at = now() WHERE id = ANY(@ids)",
        new { ids = batch.Select(b => b.Id).ToArray() });
    await tx.CommitAsync();
}

FOR UPDATE SKIP LOCKED is the crucial detail — it lets many relay workers run in parallel without stepping on each other. CDC-based outbox (Debezium reading the Postgres WAL and pushing to Kafka) is an advanced variant for throughput beyond 10k events/s.

7. Acquirer Webhooks — All At-Least-Once, Signed, and Replay-Protected

Most of the payment flow actually completes via webhook, not via the initial HTTP response. Authorize might succeed synchronously, but 3DS challenges, async capture, refunds, chargebacks — all return via webhook. The webhook listener is an extremely risky subsystem if underestimated: three common fatal mistakes.

Common mistakeSymptomConsequenceDefence
No signature verificationAccepts requests pretending to be from PSPAttacker "confirms" fake paymentsHMAC check with shared secret, reject outside tolerance window
Duplicates not handledPSP retries 2–3 times, logged each timeDouble ledger entry, broken bookkeepingIdempotent on the PSP's event_id
Return 2xx too earlyPSP thinks you processed it, but you didn'tLost events when the worker crashes mid-processPersist to an internal queue first, ack after
No out-of-order handlingsucceeded arrives before createdState machine rejects a valid eventBuffer and resolve by event_type precedence
Slow inline processingPSP times out, retry stormWebhook queue tens of thousands deepAccept + persist + 200 immediately, process async
// Proper webhook handler — .NET 10 Minimal API
app.MapPost("/webhooks/stripe", async (
    HttpRequest httpReq,
    [FromServices] IWebhookVerifier verifier,
    [FromServices] IWebhookQueue queue,
    CancellationToken ct) =>
{
    using var reader = new StreamReader(httpReq.Body);
    var rawBody = await reader.ReadToEndAsync(ct);
    var signature = httpReq.Headers["Stripe-Signature"].ToString();

    // 1. Verify HMAC with 5-minute tolerance to defeat replay
    if (!verifier.VerifyAndCheckTimestamp(rawBody, signature, TimeSpan.FromMinutes(5)))
        return Results.Unauthorized();

    var evt = JsonSerializer.Deserialize<StripeEvent>(rawBody)!;

    // 2. Idempotent on Stripe's event.id — a duplicate returns 200 immediately
    if (!await queue.EnqueueIfNewAsync(evt.Id, evt.Type, rawBody, ct))
        return Results.Ok(); // already seen; ack so Stripe stops retrying

    // 3. Return 200 immediately; the worker processes async
    return Results.Ok();
});

Rule for tolerating late events: every state machine must accept any arrival order. If payment_intent.succeeded arrives before payment_intent.created, don't reject — mark it pending and reconcile once the earlier event arrives. Major PSPs guarantee at-least-once but not total ordering.

8. Reconciliation — The Nightly Worker That Sees Everything Webhooks Missed

No matter how hard webhooks and the saga try, there's still a class of events that never make it to your system: events swallowed when the PSP changes formats, webhook retries that ran out, network partitions lasting hours. That's why every serious payment system has a reconciliation worker running nightly, reconciling the internal ledger against the settlement report from the acquirer.

23:30 — fetch settlement
The worker pulls the settlement file or calls /v1/balance_transactions for all of day T-1 and writes into a staging table psp_settlement_raw.
23:45 — normalize
Normalise formats (Stripe, Adyen, VNPay each have their own) into a common schema: (psp_ref, type, amount_minor, currency, occurred_at).
00:00 — diff
LEFT JOIN ledger against settlement by psp_ref. Three kinds of mismatch: (a) in ledger, not in settlement — possible phantom auth; (b) in settlement, not in ledger — lost webhook; (c) amount drift — risk change or partial capture.
00:30 — auto-heal
For type (b), re-query the PSP by psp_ref; if confirmed valid, write a supplementary ledger entry with txn type recon_backfill. Record the metric recon.backfilled_total.
01:00 — alert
Remaining mismatches after auto-heal go into recon_exception and wake the accounting PagerDuty rotation. SLA: clear every exception within 48 hours.

Reconciliation isn't supporting cast — it's the third line of defence

Idempotency blocks double-charge at request time, the Temporal saga ensures workflows don't drop mid-execution, reconciliation ensures the end-of-day balance is right no matter what. Three independent layers, defending against failures at three different moments. Skip any layer and sooner or later accountants will be counting by hand.

9. 3DS 2.x and SCA — The Async Flow That Needs Its Own State Machine

Since PSD2 in Europe and equivalents in many countries, Strong Customer Authentication (SCA) via 3DS 2.x is no longer optional. This flow turns authorize from "call the API and get a result" into "initiate challenge, redirect user, wait for the browser to come back, handle the outcome". A dedicated state machine is mandatory.

stateDiagram-v2
    [*] --> Requires_PM: create intent
    Requires_PM --> Requires_Action: attach card, PSP returns requires_action
    Requires_Action --> Processing: user completes 3DS challenge, browser returns
    Processing --> Succeeded: acquirer confirms authorize
    Processing --> Failed: acquirer declines or 3DS times out
    Requires_Action --> Failed: user closes browser past 10 minutes
    Succeeded --> Captured: capture at T+0
    Captured --> Refunded: partial or full refund
    Captured --> Disputed: chargeback
    Disputed --> Captured: dispute_won
    Disputed --> Refunded: dispute_lost

Figure 5: Intent state machine including async 3DS branches, disputes, and refunds

Four production principles when implementing 3DS:

  • System-side timeout on the user challenge. If an intent is requires_action for more than 15 minutes, auto-cancel it to avoid holding the acquirer's authorization and incurring fees.
  • Don't trust the client redirect. The returning browser can be forged or replayed; the authoritative 3DS result comes from the PSP's async webhook, not the URL.
  • Persist the 3DS outcome in the ledger. A sca_outcome column in ledger_txn enables audit and proves exemption eligibility (low value, recurring) when needed.
  • Fallback to low-risk exemption. Authorization rates rise noticeably when you correctly apply TRA (Transaction Risk Analysis) exemption — which requires risk engine integration from the outset.

10. Payment Observability — The Metrics That Deserve Their Own Dashboard

Observability for a payment system differs from a normal service in one way: every metric translates into money. p99 latency isn't just UX — it determines how many customers abandon their cart. Auth rate isn't just "ok or not" — it's the percentage of revenue you're losing to acquirer declines. A proper payment dashboard must have the following metrics, sliced by PSP, card scheme, BIN, and country.

MetricMeaningSuggested 2026 SLOSlice by
authorization_rate% of intents approved / total intents≥ 92% for non-3DS, ≥ 87% for 3DSPSP, scheme, BIN, country
capture_latency_p99p99 time from request to capture<5s (non-3DS), <30s (3DS)PSP, amount bucket
webhook_lag_secondsLag between PSP event and ledger update<60s p99, <600s p99.9event_type
recon_mismatch_countMismatch rows after the nightly run<10/day self-heal, 0 to escalatemismatch_type
idempotency_replay_rate% of requests returning a cached response<1% normally; spike = client bugendpoint, tenant
fraud_block_rate% of intents blocked by the risk engineBalanced against chargeback raterisk model version
chargeback_rate% of txns becoming chargebacks<0.9% — above this you lose merchant statusscheme, MCC

Recommended observability stack for .NET 10: OpenTelemetry for tracing and metrics, Tempo or Jaeger for distributed traces, Loki for structured logs, Prometheus + Grafana for dashboards. Most importantly: every metric must be traceable back to a ledger entry. Trace IDs should be attached to the webhook header sent to the PSP (where the PSP supports it) so that incident investigation has a cross-boundary audit trail.

11. Security and Compliance — PCI DSS v4, Tokenization, and Rules That Aren't a Joke

A payment system touches PANs (card numbers) and CVVs, making you a scope target for PCI DSS v4.0.1 — in full effect from March 2025. The only scope-reduction technique, and the right approach for a small team doing payments, is to never touch the PAN.

  • Tokenise at the edge. PSP frontend SDKs (Stripe Elements, Adyen Web Components) accept the PAN from the user and swap it directly with the PSP for a token. Your server only sees the token — PCI scope drops to SAQ A, cutting from ~400 controls to ~30.
  • Vault it if you must store. To charge a customer periodically without their presence, use the PSP's customer vault rather than rolling your own. The token vault is decryptable only by the PSP; you only hold the customer_id.
  • Encryption at rest + in transit everywhere. TLS 1.3 is mandatory for every connection to the PSP; DB columns holding sensitive data (billing address, last-4, fingerprint) encrypted with KMS-managed keys and quarterly rotation.
  • Key management outside the service. Secrets don't live in appsettings; use Azure Key Vault, AWS KMS, HashiCorp Vault. Audit access logs for ≥ 1 year.
  • Separation of duties. The person deploying code must not also be the one approving manual refunds in the admin tool. Clear role separation is how you pass ISO 27001 and SOC 2.

A frequently overlooked detail: BIN-based routing and network tokens

PCI DSS v4 encourages using a network token (Visa VTS, Mastercard MDES) over a PSP's PAN-based token — raising auth rates by 3–5% and eliminating the risk of expired tokens when customers replace cards. Alongside, BIN-based routing lets you pick the optimal acquirer by card country/scheme, adding another 1–2% to auth rate. These two "hidden" optimisations can add up to millions of dollars in annual revenue for a mid-sized merchant.

12. Production Checklist — 20 Non-Negotiable Items Before Go-Live

Going live with a payment system is not like launching a regular service. A day-one incident can trigger enough chargebacks to lose merchant status. The checklist below is the intersection of many public post-mortems (Stripe, GoCardless, Monzo) and experience rolling out Vietnamese domestic payments.

GroupMandatory itemNotes
CorrectnessIdempotency with body-hash, scoped per tenantTwo-tier Redis + Postgres
Double-entry ledger, balanced-trigger invariantEnforce sum = 0 at the DB
Saga orchestration for every flow with > 2 stepsTemporal or Dapr Workflows
Transactional outbox for every externally emitted eventSKIP LOCKED for parallel relays
ResilienceCircuit breaker on every call to the PSPPolly v8, 50% fail / 30s threshold
Strict timeouts (no 100s defaults)≤30s for authorize, ≤5s for tokenize
No acquirer retry without acquirer-supported idem-keyLet recon handle it
Dead-letter queue for every consumerAlert when depth exceeds threshold
ObservabilityOTel tracing across webhooksW3C Trace Context headers
Dashboards for auth rate, capture latency, webhook lagSliced by PSP/country/BIN
Runbook for every recon exception typeReadable by accountants
PagerDuty alerts when mismatch > 10/dayAuto-ticketing
SecurityEdge tokenisation — never touch the PANPCI scope drops to SAQ A
Webhook HMAC with tolerance ≤5 minutesAnti-replay
Secrets in Key Vault / KMS, quarterly rotationNever committed to appsettings
Audit log for every manual refund, retained ≥ 1 yearImmutable, WORM storage
Compliance & Go-liveLoad test with failure injection (toxiproxy)Test PSP 500/slow/partition
Monthly chaos drills (Redis down, DB failover)Game-day script
Separation of duties between deploy and refundISO 27001 / SOC 2
Per-PSP kill switchAuto-fallback to a backup PSP

13. Conclusion — Payments Is Where Engineering Discipline Meets Business Discipline

Every pattern in this article — idempotency, double-entry ledger, saga, outbox, at-least-once webhooks, reconciliation, 3DS state machine, payment-specific observability — exists because in payments, one-in-a-million errors become real money flowing to the wrong place. They aren't "optional best practice"; they are the minimum floor. Teams that skip any of them will pay the price through a painful incident eventually.

Good news: every technique in this article is well-tooled for the .NET 10 stack. Temporal SDK for sagas, Npgsql for serializable transactions, Polly v8 for resilience, OpenTelemetry for observability, and native .NET Stripe/Adyen SDKs for PSP integration. The engineering team has both the tools and public playbooks; what's left is the discipline to put each layer in the right place. That's where a senior engineer can create the most obvious business value: turning this complex set of patterns into a system that's simple to operate, easy to debug at 2am, and never surprises customers or accountants.

14. References