Payment Gateway System Design 2026 — Idempotency, Saga Pattern, and Double-Charge Defence for Stripe-scale

Posted on: 4/16/2026 9:11:57 PM

Table of contents

1. Three Guarantees You Cannot Get Wrong — and Why a Payment System Is the Peak of Distributed Systems
1. The three core guarantees of a payment system
2. Overall Architecture of a Standard 2026 Payment Gateway
3. Idempotency Key — The First Shield Against Double-Charge
1. Five details that separate correct from broken idempotency
4. Double-Entry Ledger — The Economic Source of Truth That You Only Append, Never Update
5. Saga Pattern — When One Authorization Is Five Steps That Can Fail Anywhere
1. Never auto-retry authorize calls to the acquirer
6. Transactional Outbox — The Bridge Guaranteeing DB Commit and Message Publish
7. Acquirer Webhooks — All At-Least-Once, Signed, and Replay-Protected
8. Reconciliation — The Nightly Worker That Sees Everything Webhooks Missed
1. Reconciliation isn't supporting cast — it's the third line of defence
9. 3DS 2.x and SCA — The Async Flow That Needs Its Own State Machine
10. Payment Observability — The Metrics That Deserve Their Own Dashboard
11. Security and Compliance — PCI DSS v4, Tokenization, and Rules That Aren't a Joke
1. A frequently overlooked detail: BIN-based routing and network tokens
12. Production Checklist — 20 Non-Negotiable Items Before Go-Live
13. Conclusion — Payments Is Where Engineering Discipline Meets Business Discipline
14. References

1. Three Guarantees You Cannot Get Wrong — and Why a Payment System Is the Peak of Distributed Systems

Writing a payment system means accepting that every mistake becomes hard cash. No feature in any SaaS product carries a higher "never wrong, not even once" rate than payments: a small bug can charge a customer twice, refund yourself twice, or worse — silently swallow a transaction so that nobody notices until end-of-month reconciliation. That's why every serious payments page — Stripe, Adyen, PayPal, VNPay, MoMo — lives and dies by three non-negotiable guarantees.

0Acceptable double-charges

≥99.99%Production target for authorization success rate

<5sEnd-to-end p99 auth via 3DS

T+0On-time reconciliation with the acquirer

The three core guarantees of a payment system

No double charge — a single payment intent, no matter how many times it is retried, goes through the acquirer exactly once, regardless of client retries, network flaps, or a server crash mid-flight.
No double refund — the inverse: a refund request triggered during a maintenance window, in a webhook retry, or by a support agent's click, still refunds exactly once.
No lost transactions — every time money leaves a customer's account, the system records it fully and does not drop it; each payment's state is reconstructable from the ledger, even if some service dies for an hour or two.

These three guarantees aren't "features" — they're the moral contract the system must keep no matter what goes wrong around it: a DB replica drops, a message broker swallows a message, the acquirer times out, webhooks duplicate, or a PM accidentally clicks "reprocess" twice in the admin tool. This whole article revolves around industrial techniques to keep those three guarantees, expressed in the language of .NET 10 and the stack familiar to Vietnamese engineering teams.

2. Overall Architecture of a Standard 2026 Payment Gateway

A payment system isn't a service; it's a chain of services with very sharp responsibility boundaries. Bundling them into one monolith sounds simple but turns into disaster when operations can't say exactly "this bit belongs to who" during an incident — and payment incidents always happen at 2am.

flowchart LR
    CL(["Client / Checkout UI"]) --> API["Payment API
(.NET 10, Minimal API)"]
    API --> IDP[("Idempotency Store
Redis + SQL backup")]
    API --> LDG[("Ledger DB
Postgres Serializable")]
    API --> ORC["Saga Orchestrator
Temporal / MassTransit"]
    ORC --> RISK["Risk / Fraud
Engine"]
    ORC --> TOK["Tokenization
Vault"]
    ORC --> ACQ(("Acquirer / PSP
Stripe, Adyen, VNPay"))
    ACQ -. async .-> WHK["Webhook
Listener"]
    WHK --> LDG
    LDG --> RECON["Reconciliation
Worker (nightly)"]
    RECON --> REP[("Acquirer Report
SFTP / API")]
    LDG --> OUT[("Transactional
Outbox")]
    OUT --> BUS[("Event Bus
Kafka / RabbitMQ")]
    BUS --> DW[("Data Warehouse
ClickHouse / BigQuery")]

Figure 1: Layered architecture of a payment gateway — API, ledger, orchestrator, acquirer, reconciliation

There are five distinct responsibility zones to identify:

Payment API — accepts the client request, verifies idempotency, creates an intent, returns a client_secret or redirect URL. Never calls the acquirer directly; only writes the intent and hands off to the orchestrator.
Saga Orchestrator — coordinates the sequence of steps (risk check → tokenize → authorize → capture → webhook) with the ability to compensate each step. This is where the state machine "lives" and where execution resumes after a crash.
Ledger — the system's economic source of truth. Every movement (authorize, capture, refund, chargeback) is an immutable row. No UPDATE, only INSERT; balances are sums.
Webhook Listener — consumes async events from the acquirer (payment_intent.succeeded, charge.refunded, dispute.created). Verifies signatures, updates the ledger, triggers downstream.
Reconciliation — a nightly worker reconciling the internal ledger with the acquirer's settlement file, catching mismatches before they become an accountant's dispute.

3. Idempotency Key — The First Shield Against Double-Charge

The idempotency key is the most important shield, and also the technique most frequently implemented incorrectly. The principle was standardised by Stripe in 2017 and is now an industry default: the client generates a unique key per payment intent; any payment-creating request sent with the same key returns the same response, no matter how many times it is retried.

The key isn't just "de-dup": it's a contract between client and server stating "this request is the same intention, don't process it again". When a network flap causes the client to miss the response and retry, the server must recognise this and replay the old response — not create a new authorization.

// Payment API endpoint — .NET 10 Minimal API, idempotency done right
app.MapPost("/v1/payment_intents", async (
    [FromHeader(Name = "Idempotency-Key")] string idemKey,
    [FromBody] CreateIntentRequest req,
    IIdempotencyStore idem,
    IPaymentService svc,
    CancellationToken ct) =>
{
    if (string.IsNullOrWhiteSpace(idemKey) || idemKey.Length > 255)
        return Results.BadRequest("Idempotency-Key header is required");

    // hash the body to detect key reuse with different body
    var bodyHash = SHA256Hex(JsonSerializer.SerializeToUtf8Bytes(req));

    var saved = await idem.TryBeginAsync(idemKey, bodyHash, ct);
    if (saved is { Status: IdemStatus.Completed } done)
        return Results.Content(done.ResponseJson, "application/json", null, done.StatusCode);

    if (saved is { Status: IdemStatus.InFlight })
        return Results.StatusCode(409); // conflict — let client retry later

    if (saved is { Status: IdemStatus.BodyMismatch })
        return Results.StatusCode(422); // key reused with different body — client bug

    try
    {
        var intent = await svc.CreateIntentAsync(req, ct);
        var resp = JsonSerializer.Serialize(intent);
        await idem.CompleteAsync(idemKey, 201, resp, ct);
        return Results.Content(resp, "application/json", null, 201);
    }
    catch (Exception ex)
    {
        await idem.FailAsync(idemKey, ex.Message, ct);
        throw;
    }
});

Five details that separate correct from broken idempotency

Store the body hash too, not just the key. If the client sends the same key with a different body (bug or attack), the server must reject with 422 rather than replay the old response and create a false impression.
An explicit In-Flight status. A request currently running must be marked so retries receive 409 and wait, instead of running in parallel and creating two authorizations.
TTL 24–72 hours is the sweet spot. Shorter and retries after a crash won't match; longer and storage bloats indefinitely.
A serializable transaction for the insert phase. Races between two requests sharing a key must be stopped at the DB layer, not trusted to application-level logic.
Scope by tenant/merchant. Key "abc-123" for merchant A must not collide with merchant B; the composite primary key is always (tenant_id, idem_key).

Ideal idempotency storage is two-tier: Redis as the first hit serving checks under 5ms, but every change also written to Postgres as the source of truth. Losing Redis doesn't lose money; losing Postgres does. The "Redis write-through Postgres" pattern is the standard — don't use Redis as the sole store.

4. Double-Entry Ledger — The Economic Source of Truth That You Only Append, Never Update

The single most-miswritten database design in payment systems is a payments table with a status column that gets UPDATEd repeatedly. This design dies the moment there's a dispute: there's no way to know what state the payment was in at which point, no audit path, no way to rebuild balances. Accounting solved this 500 years ago: double-entry ledger.

flowchart LR
    subgraph Ledger
        LT[("ledger_txn
id, type, idem_key, created_at")]
        LE[("ledger_entry
txn_id, account_id, amount, sign")]
    end
    LT --- LE
    A1(["customer:123:available"])
    A2(["merchant:anhtu:pending"])
    A3(["merchant:anhtu:available"])
    A4(["bank:acquirer:stripe"])
    LE -.->|"capture"| A1
    LE -.->|"capture"| A2
    LE -.->|"settle T+2"| A2
    LE -.->|"settle T+2"| A3
    LE -.->|"payout"| A3
    LE -.->|"payout"| A4

Figure 2: Double-entry ledger — every transaction produces at least two balancing entries; the balance is a sum per account

Two principles you cannot break:

Entries are append-only. No UPDATE, no DELETE. "Cancelling" a transaction means writing another transaction with the opposite sign (reversal), leaving the audit log intact.
Each transaction's entries must sum to 0. If you take 100k from account A, exactly 100k must be written to another account. Enforce this with a trigger or domain rule so the ledger never "inflates" or "deflates" without reason.

-- Minimal but accurate Postgres schema
CREATE TABLE ledger_txn (
    id              bigint PRIMARY KEY,
    tenant_id       bigint NOT NULL,
    type            text   NOT NULL,   -- authorize | capture | refund | chargeback
    idem_key        text   NOT NULL,
    created_at      timestamptz NOT NULL DEFAULT now(),
    UNIQUE (tenant_id, idem_key)
);

CREATE TABLE ledger_entry (
    id              bigserial PRIMARY KEY,
    txn_id          bigint NOT NULL REFERENCES ledger_txn(id),
    account_id      text   NOT NULL,   -- 'customer:123:available'
    currency        char(3) NOT NULL,
    amount_minor    bigint NOT NULL,   -- signed, minor units
    created_at      timestamptz NOT NULL DEFAULT now()
);

-- invariant: per-txn sum(amount_minor) = 0
CREATE OR REPLACE FUNCTION ensure_balanced() RETURNS trigger AS $$
DECLARE s bigint;
BEGIN
    SELECT sum(amount_minor) INTO s FROM ledger_entry WHERE txn_id = NEW.txn_id;
    IF s <> 0 THEN
        RAISE EXCEPTION 'ledger txn % imbalanced by %', NEW.txn_id, s;
    END IF;
    RETURN NULL;
END; $$ LANGUAGE plpgsql;

CREATE CONSTRAINT TRIGGER ledger_balance_check
AFTER INSERT ON ledger_entry DEFERRABLE INITIALLY DEFERRED
FOR EACH ROW EXECUTE FUNCTION ensure_balanced();

Account balance is a view: SELECT sum(amount_minor) FROM ledger_entry WHERE account_id = ?. At high volume you cache balances in a continuously-rebuilt materialized table — but the source of truth is always the sum. On disputes, you can replay the entire ledger to prove every cent flowed correctly.

5. Saga Pattern — When One Authorization Is Five Steps That Can Fail Anywhere

A payment isn't a query. It's a sequence: risk check → tokenize card → call acquirer to authorize → write ledger → emit event. Each step can time out, fail, or return "maybe" (the acquirer especially). And each step has a compensating action if a later step fails: cancel the authorization, issue a refund, emit a compensating event. This is the essence of the Saga Pattern.

Criterion	Saga Choreography	Saga Orchestration
Control	Each service listens to events and reacts	A central orchestrator calls each service
Coupling	Low — services don't know each other	Higher — the orchestrator knows the whole flow
Observability	Hard — flow scattered across the event log	Easy — centralised state machine
Compensation	Complex — each service remembers its own	Direct — the orchestrator calls the inverse action
Best for	Simple flows, <4 steps, independent teams	Payment, booking, order — many rollbackable steps
2026 tooling	MassTransit, NServiceBus, Kafka	Temporal, Cadence, Dapr Workflows, AWS Step Functions

For payments, orchestration is the right choice almost every time. Flows have 5–10 steps, each step has a clear compensation, and observability is a hard requirement for accounting and compliance. Temporal (or Dapr Workflows for lighter teams) is the standard tool.

sequenceDiagram
    autonumber
    participant C as Client
    participant A as Payment API
    participant T as Temporal Worker
    participant R as Risk Engine
    participant V as Vault
    participant P as PSP (Stripe)
    participant L as Ledger
    C->>A: POST /intents (Idempotency-Key)
    A->>T: StartWorkflow(intentId)
    T->>R: RiskCheck(card, user, ip)
    R-->>T: score=0.2 approved
    T->>V: TokenizeCard(PAN)
    V-->>T: token=tok_abc
    T->>P: Authorize(token, amount)
    P-->>T: auth_id=ch_123 approved
    T->>L: WriteAuthorizeEntries
    T-->>A: Workflow complete
    A-->>C: 201 Created (intent)
    Note over T,P: If any step fails,
compensation walks it back

Figure 3: Orchestrated saga for authorize — every step has its own timeout, retry policy, and explicit compensation

// Temporal workflow for authorize intent — .NET SDK v2
[Workflow]
public class AuthorizeIntentWorkflow
{
    [WorkflowRun]
    public async Task<IntentResult> RunAsync(AuthorizeInput input)
    {
        // Each activity has its own retry policy; exceeding attempts raises to the workflow
        var risk = await Workflow.ExecuteActivityAsync(
            (IRiskActivities a) => a.ScoreAsync(input),
            new() { StartToCloseTimeout = TimeSpan.FromSeconds(5),
                    RetryPolicy = new() { MaximumAttempts = 3 } });
        if (risk.Decision == "deny")
            return IntentResult.Declined("risk_block");

        var token = await Workflow.ExecuteActivityAsync(
            (IVaultActivities a) => a.TokenizeAsync(input.Card),
            new() { StartToCloseTimeout = TimeSpan.FromSeconds(3) });

        try
        {
            var auth = await Workflow.ExecuteActivityAsync(
                (IPsPActivities a) => a.AuthorizeAsync(token, input.AmountMinor, input.Currency),
                new() { StartToCloseTimeout = TimeSpan.FromSeconds(30),
                        RetryPolicy = new() { MaximumAttempts = 1 } });  // DO NOT retry acquirer
            await Workflow.ExecuteActivityAsync(
                (ILedgerActivities a) => a.RecordAuthorizeAsync(input.IntentId, auth),
                new() { StartToCloseTimeout = TimeSpan.FromSeconds(2) });
            return IntentResult.Approved(auth.AuthId);
        }
        catch (ActivityFailureException)
        {
            // compensation — don't call PSP because we may or may not have an auth; let recon sort it
            await Workflow.ExecuteActivityAsync(
                (ILedgerActivities a) => a.MarkIntentFailedAsync(input.IntentId),
                new() { StartToCloseTimeout = TimeSpan.FromSeconds(2) });
            throw;
        }
    }
}

Never auto-retry authorize calls to the acquirer

The natural instinct when an activity fails is to retry. But for an authorize call, retry is mortally dangerous: a timeout does not mean failure — the money may already be held on the acquirer's side while the network simply lost the response. Retrying creates a second authorization. The golden rule: authorize/capture/refund calls to the acquirer only retry when the acquirer itself supports an idempotency key (Stripe, Adyen do; some domestic PSPs do not). When it doesn't: call once, and let the reconciliation worker track it down later.

6. Transactional Outbox — The Bridge Guaranteeing DB Commit and Message Publish

A very common bug: the service writes to the ledger successfully, then emits a "payment.succeeded" event to Kafka, then returns a response to the client. Problem: those two steps aren't atomic. If the service crashes between them, the ledger is written but the event never fires, and downstream (email confirmation, analytics, loyalty points) never runs. The fix pattern is called Transactional Outbox.

flowchart LR
    API["Payment API"] --> TX{"BEGIN TX"}
    TX --> L[("ledger_entry")]
    TX --> O[("outbox_event")]
    TX --> C{"COMMIT"}
    C --> R["Outbox Relay
(CDC or poller)"]
    R --> B[("Kafka / RabbitMQ")]
    B --> D1["Email Service"]
    B --> D2["Loyalty Service"]
    B --> D3["Analytics"]

Figure 4: Outbox pattern — ledger and event commit in the same transaction; the relay pushes to the bus afterwards

The mechanism is simple but rock-solid: the ledger row and the event row are written in the same SQL transaction. COMMIT commits both; on crash, both roll back. A dedicated relay worker reads outbox_event and publishes to the bus, marking rows as published. The bus handles at-least-once; consumers must be idempotent.

-- Outbox table
CREATE TABLE outbox_event (
    id           bigserial PRIMARY KEY,
    aggregate_id text NOT NULL,
    event_type   text NOT NULL,
    payload      jsonb NOT NULL,
    created_at   timestamptz NOT NULL DEFAULT now(),
    published_at timestamptz,
    INDEX unpublished ON outbox_event (created_at) WHERE published_at IS NULL
);

-- Relay worker (.NET BackgroundService) — read a batch of 100, publish, mark
while (!stoppingToken.IsCancellationRequested)
{
    using var tx = await db.BeginTransactionAsync();
    var batch = await db.QueryAsync<OutboxRow>(
        "SELECT * FROM outbox_event WHERE published_at IS NULL " +
        "ORDER BY id FOR UPDATE SKIP LOCKED LIMIT 100");
    if (!batch.Any()) { await Task.Delay(200); continue; }
    foreach (var row in batch)
        await producer.ProduceAsync("payment.events", row.ToKafkaMessage());
    await db.ExecuteAsync(
        "UPDATE outbox_event SET published_at = now() WHERE id = ANY(@ids)",
        new { ids = batch.Select(b => b.Id).ToArray() });
    await tx.CommitAsync();
}

FOR UPDATE SKIP LOCKED is the crucial detail — it lets many relay workers run in parallel without stepping on each other. CDC-based outbox (Debezium reading the Postgres WAL and pushing to Kafka) is an advanced variant for throughput beyond 10k events/s.

7. Acquirer Webhooks — All At-Least-Once, Signed, and Replay-Protected

Most of the payment flow actually completes via webhook, not via the initial HTTP response. Authorize might succeed synchronously, but 3DS challenges, async capture, refunds, chargebacks — all return via webhook. The webhook listener is an extremely risky subsystem if underestimated: three common fatal mistakes.

Common mistake	Symptom	Consequence	Defence
No signature verification	Accepts requests pretending to be from PSP	Attacker "confirms" fake payments	HMAC check with shared secret, reject outside tolerance window
Duplicates not handled	PSP retries 2–3 times, logged each time	Double ledger entry, broken bookkeeping	Idempotent on the PSP's event_id
Return 2xx too early	PSP thinks you processed it, but you didn't	Lost events when the worker crashes mid-process	Persist to an internal queue first, ack after
No out-of-order handling	succeeded arrives before created	State machine rejects a valid event	Buffer and resolve by event_type precedence
Slow inline processing	PSP times out, retry storm	Webhook queue tens of thousands deep	Accept + persist + 200 immediately, process async

// Proper webhook handler — .NET 10 Minimal API
app.MapPost("/webhooks/stripe", async (
    HttpRequest httpReq,
    [FromServices] IWebhookVerifier verifier,
    [FromServices] IWebhookQueue queue,
    CancellationToken ct) =>
{
    using var reader = new StreamReader(httpReq.Body);
    var rawBody = await reader.ReadToEndAsync(ct);
    var signature = httpReq.Headers["Stripe-Signature"].ToString();

    // 1. Verify HMAC with 5-minute tolerance to defeat replay
    if (!verifier.VerifyAndCheckTimestamp(rawBody, signature, TimeSpan.FromMinutes(5)))
        return Results.Unauthorized();

    var evt = JsonSerializer.Deserialize<StripeEvent>(rawBody)!;

    // 2. Idempotent on Stripe's event.id — a duplicate returns 200 immediately
    if (!await queue.EnqueueIfNewAsync(evt.Id, evt.Type, rawBody, ct))
        return Results.Ok(); // already seen; ack so Stripe stops retrying

    // 3. Return 200 immediately; the worker processes async
    return Results.Ok();
});

Rule for tolerating late events: every state machine must accept any arrival order. If payment_intent.succeeded arrives before payment_intent.created, don't reject — mark it pending and reconcile once the earlier event arrives. Major PSPs guarantee at-least-once but not total ordering.

8. Reconciliation — The Nightly Worker That Sees Everything Webhooks Missed

No matter how hard webhooks and the saga try, there's still a class of events that never make it to your system: events swallowed when the PSP changes formats, webhook retries that ran out, network partitions lasting hours. That's why every serious payment system has a reconciliation worker running nightly, reconciling the internal ledger against the settlement report from the acquirer.

23:30 — fetch settlement

The worker pulls the settlement file or calls /v1/balance_transactions for all of day T-1 and writes into a staging table psp_settlement_raw.

23:45 — normalize

Normalise formats (Stripe, Adyen, VNPay each have their own) into a common schema: (psp_ref, type, amount_minor, currency, occurred_at).

00:00 — diff

LEFT JOIN ledger against settlement by psp_ref. Three kinds of mismatch: (a) in ledger, not in settlement — possible phantom auth; (b) in settlement, not in ledger — lost webhook; (c) amount drift — risk change or partial capture.

00:30 — auto-heal

For type (b), re-query the PSP by psp_ref; if confirmed valid, write a supplementary ledger entry with txn type recon_backfill. Record the metric recon.backfilled_total.

01:00 — alert

Remaining mismatches after auto-heal go into recon_exception and wake the accounting PagerDuty rotation. SLA: clear every exception within 48 hours.

Reconciliation isn't supporting cast — it's the third line of defence

Idempotency blocks double-charge at request time, the Temporal saga ensures workflows don't drop mid-execution, reconciliation ensures the end-of-day balance is right no matter what. Three independent layers, defending against failures at three different moments. Skip any layer and sooner or later accountants will be counting by hand.

9. 3DS 2.x and SCA — The Async Flow That Needs Its Own State Machine

Since PSD2 in Europe and equivalents in many countries, Strong Customer Authentication (SCA) via 3DS 2.x is no longer optional. This flow turns authorize from "call the API and get a result" into "initiate challenge, redirect user, wait for the browser to come back, handle the outcome". A dedicated state machine is mandatory.

stateDiagram-v2
    [*] --> Requires_PM: create intent
    Requires_PM --> Requires_Action: attach card, PSP returns requires_action
    Requires_Action --> Processing: user completes 3DS challenge, browser returns
    Processing --> Succeeded: acquirer confirms authorize
    Processing --> Failed: acquirer declines or 3DS times out
    Requires_Action --> Failed: user closes browser past 10 minutes
    Succeeded --> Captured: capture at T+0
    Captured --> Refunded: partial or full refund
    Captured --> Disputed: chargeback
    Disputed --> Captured: dispute_won
    Disputed --> Refunded: dispute_lost

Figure 5: Intent state machine including async 3DS branches, disputes, and refunds

Four production principles when implementing 3DS:

System-side timeout on the user challenge. If an intent is requires_action for more than 15 minutes, auto-cancel it to avoid holding the acquirer's authorization and incurring fees.
Don't trust the client redirect. The returning browser can be forged or replayed; the authoritative 3DS result comes from the PSP's async webhook, not the URL.
Persist the 3DS outcome in the ledger. A sca_outcome column in ledger_txn enables audit and proves exemption eligibility (low value, recurring) when needed.
Fallback to low-risk exemption. Authorization rates rise noticeably when you correctly apply TRA (Transaction Risk Analysis) exemption — which requires risk engine integration from the outset.

10. Payment Observability — The Metrics That Deserve Their Own Dashboard

Observability for a payment system differs from a normal service in one way: every metric translates into money. p99 latency isn't just UX — it determines how many customers abandon their cart. Auth rate isn't just "ok or not" — it's the percentage of revenue you're losing to acquirer declines. A proper payment dashboard must have the following metrics, sliced by PSP, card scheme, BIN, and country.

Metric	Meaning	Suggested 2026 SLO	Slice by
authorization_rate	% of intents approved / total intents	≥ 92% for non-3DS, ≥ 87% for 3DS	PSP, scheme, BIN, country
capture_latency_p99	p99 time from request to capture	<5s (non-3DS), <30s (3DS)	PSP, amount bucket
webhook_lag_seconds	Lag between PSP event and ledger update	<60s p99, <600s p99.9	event_type
recon_mismatch_count	Mismatch rows after the nightly run	<10/day self-heal, 0 to escalate	mismatch_type
idempotency_replay_rate	% of requests returning a cached response	<1% normally; spike = client bug	endpoint, tenant
fraud_block_rate	% of intents blocked by the risk engine	Balanced against chargeback rate	risk model version
chargeback_rate	% of txns becoming chargebacks	<0.9% — above this you lose merchant status	scheme, MCC

Recommended observability stack for .NET 10: OpenTelemetry for tracing and metrics, Tempo or Jaeger for distributed traces, Loki for structured logs, Prometheus + Grafana for dashboards. Most importantly: every metric must be traceable back to a ledger entry. Trace IDs should be attached to the webhook header sent to the PSP (where the PSP supports it) so that incident investigation has a cross-boundary audit trail.

11. Security and Compliance — PCI DSS v4, Tokenization, and Rules That Aren't a Joke

A payment system touches PANs (card numbers) and CVVs, making you a scope target for PCI DSS v4.0.1 — in full effect from March 2025. The only scope-reduction technique, and the right approach for a small team doing payments, is to never touch the PAN.

Tokenise at the edge. PSP frontend SDKs (Stripe Elements, Adyen Web Components) accept the PAN from the user and swap it directly with the PSP for a token. Your server only sees the token — PCI scope drops to SAQ A, cutting from ~400 controls to ~30.
Vault it if you must store. To charge a customer periodically without their presence, use the PSP's customer vault rather than rolling your own. The token vault is decryptable only by the PSP; you only hold the customer_id.
Encryption at rest + in transit everywhere. TLS 1.3 is mandatory for every connection to the PSP; DB columns holding sensitive data (billing address, last-4, fingerprint) encrypted with KMS-managed keys and quarterly rotation.
Key management outside the service. Secrets don't live in appsettings; use Azure Key Vault, AWS KMS, HashiCorp Vault. Audit access logs for ≥ 1 year.
Separation of duties. The person deploying code must not also be the one approving manual refunds in the admin tool. Clear role separation is how you pass ISO 27001 and SOC 2.

A frequently overlooked detail: BIN-based routing and network tokens

PCI DSS v4 encourages using a network token (Visa VTS, Mastercard MDES) over a PSP's PAN-based token — raising auth rates by 3–5% and eliminating the risk of expired tokens when customers replace cards. Alongside, BIN-based routing lets you pick the optimal acquirer by card country/scheme, adding another 1–2% to auth rate. These two "hidden" optimisations can add up to millions of dollars in annual revenue for a mid-sized merchant.

12. Production Checklist — 20 Non-Negotiable Items Before Go-Live

Going live with a payment system is not like launching a regular service. A day-one incident can trigger enough chargebacks to lose merchant status. The checklist below is the intersection of many public post-mortems (Stripe, GoCardless, Monzo) and experience rolling out Vietnamese domestic payments.

Group	Mandatory item	Notes
Correctness	Idempotency with body-hash, scoped per tenant	Two-tier Redis + Postgres
	Double-entry ledger, balanced-trigger invariant	Enforce sum = 0 at the DB
	Saga orchestration for every flow with > 2 steps	Temporal or Dapr Workflows
	Transactional outbox for every externally emitted event	SKIP LOCKED for parallel relays
Resilience	Circuit breaker on every call to the PSP	Polly v8, 50% fail / 30s threshold
	Strict timeouts (no 100s defaults)	≤30s for authorize, ≤5s for tokenize
	No acquirer retry without acquirer-supported idem-key	Let recon handle it
	Dead-letter queue for every consumer	Alert when depth exceeds threshold
Observability	OTel tracing across webhooks	W3C Trace Context headers
	Dashboards for auth rate, capture latency, webhook lag	Sliced by PSP/country/BIN
	Runbook for every recon exception type	Readable by accountants
	PagerDuty alerts when mismatch > 10/day	Auto-ticketing
Security	Edge tokenisation — never touch the PAN	PCI scope drops to SAQ A
	Webhook HMAC with tolerance ≤5 minutes	Anti-replay
	Secrets in Key Vault / KMS, quarterly rotation	Never committed to appsettings
	Audit log for every manual refund, retained ≥ 1 year	Immutable, WORM storage
Compliance & Go-live	Load test with failure injection (toxiproxy)	Test PSP 500/slow/partition
	Monthly chaos drills (Redis down, DB failover)	Game-day script
	Separation of duties between deploy and refund	ISO 27001 / SOC 2
	Per-PSP kill switch	Auto-fallback to a backup PSP

13. Conclusion — Payments Is Where Engineering Discipline Meets Business Discipline

Every pattern in this article — idempotency, double-entry ledger, saga, outbox, at-least-once webhooks, reconciliation, 3DS state machine, payment-specific observability — exists because in payments, one-in-a-million errors become real money flowing to the wrong place. They aren't "optional best practice"; they are the minimum floor. Teams that skip any of them will pay the price through a painful incident eventually.

Good news: every technique in this article is well-tooled for the .NET 10 stack. Temporal SDK for sagas, Npgsql for serializable transactions, Polly v8 for resilience, OpenTelemetry for observability, and native .NET Stripe/Adyen SDKs for PSP integration. The engineering team has both the tools and public playbooks; what's left is the discipline to put each layer in the right place. That's where a senior engineer can create the most obvious business value: turning this complex set of patterns into a system that's simple to operate, easy to debug at 2am, and never surprises customers or accountants.

14. References

#Idempotency #Saga Pattern #Temporal #Transactional Outbox #.NET 10 #Minimal API #OpenTelemetry #Circuit Breaker #system design #Payment Gateway #Payment System Design #Idempotency Key #Saga Orchestration #Double Entry Ledger #Reconciliation #Webhook #HMAC #3DS 2 #Strong Customer Authentication #PCI DSS #PCI DSS v4 #Tokenization #Stripe #Adyen #Polly #Distributed System

# Payment Gateway System Design 2026 — Idempotency, Saga Pattern, and Double-Charge Defence for Stripe-scale

## 1. Three Guarantees You Cannot Get Wrong — and Why a Payment System Is the Peak of Distributed Systems

0Acceptable double-charges

≥99.99%Production target for authorization success rate

<5sEnd-to-end p99 auth via 3DS

T+0On-time reconciliation with the acquirer

#### The three core guarantees of a payment system

- **No double charge** — a single payment intent, no matter how many times it is retried, goes through the acquirer exactly once, regardless of client retries, network flaps, or a server crash mid-flight.
- **No double refund** — the inverse: a refund request triggered during a maintenance window, in a webhook retry, or by a support agent's click, still refunds exactly once.
- **No lost transactions** — every time money leaves a customer's account, the system records it fully and does not drop it; each payment's state is reconstructable from the ledger, even if some service dies for an hour or two.

These three guarantees aren't "features" — they're the *moral contract* the system must keep no matter what goes wrong around it: a DB replica drops, a message broker swallows a message, the acquirer times out, webhooks duplicate, or a PM accidentally clicks "reprocess" twice in the admin tool. This whole article revolves around industrial techniques to keep those three guarantees, expressed in the language of .NET 10 and the stack familiar to Vietnamese engineering teams.

## 2. Overall Architecture of a Standard 2026 Payment Gateway

A payment system isn't a service; it's *a chain of services* with very sharp responsibility boundaries. Bundling them into one monolith sounds simple but turns into disaster when operations can't say exactly "this bit belongs to who" during an incident — and payment incidents always happen at 2am.

```
flowchart LR
    CL(["Client / Checkout UI"]) --> API["Payment API  
(.NET 10, Minimal API)"]
    API --> IDP[("Idempotency Store  
Redis + SQL backup")]
    API --> LDG[("Ledger DB  
Postgres Serializable")]
    API --> ORC["Saga Orchestrator  
Temporal / MassTransit"]
    ORC --> RISK["Risk / Fraud  
Engine"]
    ORC --> TOK["Tokenization  
Vault"]
    ORC --> ACQ(("Acquirer / PSP  
Stripe, Adyen, VNPay"))
    ACQ -. async .-> WHK["Webhook  
Listener"]
    WHK --> LDG
    LDG --> RECON["Reconciliation  
Worker (nightly)"]
    RECON --> REP[("Acquirer Report  
SFTP / API")]
    LDG --> OUT[("Transactional  
Outbox")]
    OUT --> BUS[("Event Bus  
Kafka / RabbitMQ")]
    BUS --> DW[("Data Warehouse  
ClickHouse / BigQuery")]

```

Figure 1: Layered architecture of a payment gateway — API, ledger, orchestrator, acquirer, reconciliation

There are five distinct responsibility zones to identify:

- **Payment API** — accepts the client request, verifies idempotency, creates an intent, returns a client_secret or redirect URL. Never calls the acquirer directly; only writes the intent and hands off to the orchestrator.
- **Saga Orchestrator** — coordinates the sequence of steps (risk check → tokenize → authorize → capture → webhook) with the ability to compensate each step. This is where the state machine "lives" and where execution resumes after a crash.
- **Ledger** — the system's economic source of truth. Every movement (authorize, capture, refund, chargeback) is an immutable row. No UPDATE, only INSERT; balances are sums.
- **Webhook Listener** — consumes async events from the acquirer (payment_intent.succeeded, charge.refunded, dispute.created). Verifies signatures, updates the ledger, triggers downstream.
- **Reconciliation** — a nightly worker reconciling the internal ledger with the acquirer's settlement file, catching mismatches before they become an accountant's dispute.

## 3. Idempotency Key — The First Shield Against Double-Charge

The idempotency key is the most important shield, and also the technique most frequently implemented incorrectly. The principle was standardised by Stripe in 2017 and is now an industry default: *the client generates a unique key per payment intent; any payment-creating request sent with the same key returns the same response, no matter how many times it is retried*.

```
// Payment API endpoint — .NET 10 Minimal API, idempotency done right
app.MapPost("/v1/payment_intents", async (
    [FromHeader(Name = "Idempotency-Key")] string idemKey,
    [FromBody] CreateIntentRequest req,
    IIdempotencyStore idem,
    IPaymentService svc,
    CancellationToken ct) =>
{
    if (string.IsNullOrWhiteSpace(idemKey) || idemKey.Length > 255)
        return Results.BadRequest("Idempotency-Key header is required");

// hash the body to detect key reuse with different body
    var bodyHash = SHA256Hex(JsonSerializer.SerializeToUtf8Bytes(req));

var saved = await idem.TryBeginAsync(idemKey, bodyHash, ct);
    if (saved is { Status: IdemStatus.Completed } done)
        return Results.Content(done.ResponseJson, "application/json", null, done.StatusCode);

if (saved is { Status: IdemStatus.InFlight })
        return Results.StatusCode(409); // conflict — let client retry later

if (saved is { Status: IdemStatus.BodyMismatch })
        return Results.StatusCode(422); // key reused with different body — client bug

try
    {
        var intent = await svc.CreateIntentAsync(req, ct);
        var resp = JsonSerializer.Serialize(intent);
        await idem.CompleteAsync(idemKey, 201, resp, ct);
        return Results.Content(resp, "application/json", null, 201);
    }
    catch (Exception ex)
    {
        await idem.FailAsync(idemKey, ex.Message, ct);
        throw;
    }
});

```

#### Five details that separate correct from broken idempotency

- **Store the body hash too, not just the key.** If the client sends the same key with a different body (bug or attack), the server must reject with 422 rather than replay the old response and create a false impression.
- **An explicit In-Flight status.** A request currently running must be marked so retries receive 409 and wait, instead of running in parallel and creating two authorizations.
- **TTL 24–72 hours is the sweet spot.** Shorter and retries after a crash won't match; longer and storage bloats indefinitely.
- **A serializable transaction for the insert phase.** Races between two requests sharing a key must be stopped at the DB layer, not trusted to application-level logic.
- **Scope by tenant/merchant.** Key "abc-123" for merchant A must not collide with merchant B; the composite primary key is always `(tenant_id, idem_key)`.

Ideal idempotency storage is *two-tier*: Redis as the first hit serving checks under 5ms, but every change also written to Postgres as the source of truth. Losing Redis doesn't lose money; losing Postgres does. The "Redis write-through Postgres" pattern is the standard — don't use Redis as the sole store.

## 4. Double-Entry Ledger — The Economic Source of Truth That You Only Append, Never Update

The single most-miswritten database design in payment systems is *a `payments` table with a `status` column* that gets UPDATEd repeatedly. This design dies the moment there's a dispute: there's no way to know what state the payment was in at which point, no audit path, no way to rebuild balances. Accounting solved this 500 years ago: **double-entry ledger**.

```
flowchart LR
    subgraph Ledger
        LT[("ledger_txn  
id, type, idem_key, created_at")]
        LE[("ledger_entry  
txn_id, account_id, amount, sign")]
    end
    LT --- LE
    A1(["customer:123:available"])
    A2(["merchant:anhtu:pending"])
    A3(["merchant:anhtu:available"])
    A4(["bank:acquirer:stripe"])
    LE -.->|"capture"| A1
    LE -.->|"capture"| A2
    LE -.->|"settle T+2"| A2
    LE -.->|"settle T+2"| A3
    LE -.->|"payout"| A3
    LE -.->|"payout"| A4

```

Figure 2: Double-entry ledger — every transaction produces at least two balancing entries; the balance is a sum per account

Two principles you cannot break:

- **Entries are append-only.** No UPDATE, no DELETE. "Cancelling" a transaction means writing another transaction with the opposite sign (reversal), leaving the audit log intact.
- **Each transaction's entries must sum to 0.** If you take 100k from account A, exactly 100k must be written to another account. Enforce this with a trigger or domain rule so the ledger never "inflates" or "deflates" without reason.

```
-- Minimal but accurate Postgres schema
CREATE TABLE ledger_txn (
    id              bigint PRIMARY KEY,
    tenant_id       bigint NOT NULL,
    type            text   NOT NULL,   -- authorize | capture | refund | chargeback
    idem_key        text   NOT NULL,
    created_at      timestamptz NOT NULL DEFAULT now(),
    UNIQUE (tenant_id, idem_key)
);

CREATE TABLE ledger_entry (
    id              bigserial PRIMARY KEY,
    txn_id          bigint NOT NULL REFERENCES ledger_txn(id),
    account_id      text   NOT NULL,   -- 'customer:123:available'
    currency        char(3) NOT NULL,
    amount_minor    bigint NOT NULL,   -- signed, minor units
    created_at      timestamptz NOT NULL DEFAULT now()
);

-- invariant: per-txn sum(amount_minor) = 0
CREATE OR REPLACE FUNCTION ensure_balanced() RETURNS trigger AS $$
DECLARE s bigint;
BEGIN
    SELECT sum(amount_minor) INTO s FROM ledger_entry WHERE txn_id = NEW.txn_id;
    IF s <> 0 THEN
        RAISE EXCEPTION 'ledger txn % imbalanced by %', NEW.txn_id, s;
    END IF;
    RETURN NULL;
END; $$ LANGUAGE plpgsql;

CREATE CONSTRAINT TRIGGER ledger_balance_check
AFTER INSERT ON ledger_entry DEFERRABLE INITIALLY DEFERRED
FOR EACH ROW EXECUTE FUNCTION ensure_balanced();
```
Account balance is a *view*: `SELECT sum(amount_minor) FROM ledger_entry WHERE account_id = ?`. At high volume you cache balances in a continuously-rebuilt materialized table — but the source of truth is always the sum. On disputes, you can replay the entire ledger to prove every cent flowed correctly.

## 5. Saga Pattern — When One Authorization Is Five Steps That Can Fail Anywhere

A payment isn't a query. It's a sequence: risk check → tokenize card → call acquirer to authorize → write ledger → emit event. Each step can time out, fail, or return "maybe" (the acquirer especially). And each step has a *compensating action* if a later step fails: cancel the authorization, issue a refund, emit a compensating event. This is the essence of the Saga Pattern.

| Criterion | Saga Choreography | Saga Orchestration |
| --- | --- | --- |
| Control | Each service listens to events and reacts | A central orchestrator calls each service |
| Coupling | Low — services don't know each other | Higher — the orchestrator knows the whole flow |
| Observability | Hard — flow scattered across the event log | Easy — centralised state machine |
| Compensation | Complex — each service remembers its own | Direct — the orchestrator calls the inverse action |
| Best for | Simple flows, <4 steps, independent teams | Payment, booking, order — many rollbackable steps |
| 2026 tooling | MassTransit, NServiceBus, Kafka | Temporal, Cadence, Dapr Workflows, AWS Step Functions |

For payments, **orchestration is the right choice almost every time**. Flows have 5–10 steps, each step has a clear compensation, and observability is a hard requirement for accounting and compliance. Temporal (or Dapr Workflows for lighter teams) is the standard tool.

```
sequenceDiagram
    autonumber
    participant C as Client
    participant A as Payment API
    participant T as Temporal Worker
    participant R as Risk Engine
    participant V as Vault
    participant P as PSP (Stripe)
    participant L as Ledger
    C->>A: POST /intents (Idempotency-Key)
    A->>T: StartWorkflow(intentId)
    T->>R: RiskCheck(card, user, ip)
    R-->>T: score=0.2 approved
    T->>V: TokenizeCard(PAN)
    V-->>T: token=tok_abc
    T->>P: Authorize(token, amount)
    P-->>T: auth_id=ch_123 approved
    T->>L: WriteAuthorizeEntries
    T-->>A: Workflow complete
    A-->>C: 201 Created (intent)
    Note over T,P: If any step fails,  
compensation walks it back

```

Figure 3: Orchestrated saga for authorize — every step has its own timeout, retry policy, and explicit compensation

```
// Temporal workflow for authorize intent — .NET SDK v2
[Workflow]
public class AuthorizeIntentWorkflow
{
    [WorkflowRun]
    public async Task<IntentResult> RunAsync(AuthorizeInput input)
    {
        // Each activity has its own retry policy; exceeding attempts raises to the workflow
        var risk = await Workflow.ExecuteActivityAsync(
            (IRiskActivities a) => a.ScoreAsync(input),
            new() { StartToCloseTimeout = TimeSpan.FromSeconds(5),
                    RetryPolicy = new() { MaximumAttempts = 3 } });
        if (risk.Decision == "deny")
            return IntentResult.Declined("risk_block");

var token = await Workflow.ExecuteActivityAsync(
            (IVaultActivities a) => a.TokenizeAsync(input.Card),
            new() { StartToCloseTimeout = TimeSpan.FromSeconds(3) });

try
        {
            var auth = await Workflow.ExecuteActivityAsync(
                (IPsPActivities a) => a.AuthorizeAsync(token, input.AmountMinor, input.Currency),
                new() { StartToCloseTimeout = TimeSpan.FromSeconds(30),
                        RetryPolicy = new() { MaximumAttempts = 1 } });  // DO NOT retry acquirer
            await Workflow.ExecuteActivityAsync(
                (ILedgerActivities a) => a.RecordAuthorizeAsync(input.IntentId, auth),
                new() { StartToCloseTimeout = TimeSpan.FromSeconds(2) });
            return IntentResult.Approved(auth.AuthId);
        }
        catch (ActivityFailureException)
        {
            // compensation — don't call PSP because we may or may not have an auth; let recon sort it
            await Workflow.ExecuteActivityAsync(
                (ILedgerActivities a) => a.MarkIntentFailedAsync(input.IntentId),
                new() { StartToCloseTimeout = TimeSpan.FromSeconds(2) });
            throw;
        }
    }
}

```

#### Never auto-retry authorize calls to the acquirer

The natural instinct when an activity fails is to retry. But for an authorize call, retry is **mortally dangerous**: a timeout does not mean failure — the money may already be held on the acquirer's side while the network simply lost the response. Retrying creates a second authorization. The golden rule: authorize/capture/refund calls to the acquirer only retry when the acquirer itself supports an idempotency key (Stripe, Adyen do; some domestic PSPs do not). When it doesn't: call once, and let the reconciliation worker track it down later.

## 6. Transactional Outbox — The Bridge Guaranteeing DB Commit and Message Publish

A very common bug: the service writes to the ledger successfully, then emits a "payment.succeeded" event to Kafka, then returns a response to the client. Problem: *those two steps aren't atomic*. If the service crashes between them, the ledger is written but the event never fires, and downstream (email confirmation, analytics, loyalty points) never runs. The fix pattern is called **Transactional Outbox**.

```
flowchart LR
    API["Payment API"] --> TX{"BEGIN TX"}
    TX --> L[("ledger_entry")]
    TX --> O[("outbox_event")]
    TX --> C{"COMMIT"}
    C --> R["Outbox Relay  
(CDC or poller)"]
    R --> B[("Kafka / RabbitMQ")]
    B --> D1["Email Service"]
    B --> D2["Loyalty Service"]
    B --> D3["Analytics"]

```

Figure 4: Outbox pattern — ledger and event commit in the same transaction; the relay pushes to the bus afterwards

The mechanism is simple but rock-solid: the ledger row and the event row are written in the same SQL transaction. COMMIT commits both; on crash, both roll back. A dedicated relay worker reads `outbox_event` and publishes to the bus, marking rows as published. The bus handles at-least-once; consumers must be idempotent.

```
-- Outbox table
CREATE TABLE outbox_event (
    id           bigserial PRIMARY KEY,
    aggregate_id text NOT NULL,
    event_type   text NOT NULL,
    payload      jsonb NOT NULL,
    created_at   timestamptz NOT NULL DEFAULT now(),
    published_at timestamptz,
    INDEX unpublished ON outbox_event (created_at) WHERE published_at IS NULL
);

-- Relay worker (.NET BackgroundService) — read a batch of 100, publish, mark
while (!stoppingToken.IsCancellationRequested)
{
    using var tx = await db.BeginTransactionAsync();
    var batch = await db.QueryAsync<OutboxRow>(
        "SELECT * FROM outbox_event WHERE published_at IS NULL " +
        "ORDER BY id FOR UPDATE SKIP LOCKED LIMIT 100");
    if (!batch.Any()) { await Task.Delay(200); continue; }
    foreach (var row in batch)
        await producer.ProduceAsync("payment.events", row.ToKafkaMessage());
    await db.ExecuteAsync(
        "UPDATE outbox_event SET published_at = now() WHERE id = ANY(@ids)",
        new { ids = batch.Select(b => b.Id).ToArray() });
    await tx.CommitAsync();
}
```
`FOR UPDATE SKIP LOCKED` is the crucial detail — it lets many relay workers run in parallel without stepping on each other. CDC-based outbox (Debezium reading the Postgres WAL and pushing to Kafka) is an advanced variant for throughput beyond 10k events/s.

## 7. Acquirer Webhooks — All At-Least-Once, Signed, and Replay-Protected

Most of the payment flow actually completes *via webhook*, not via the initial HTTP response. Authorize might succeed synchronously, but 3DS challenges, async capture, refunds, chargebacks — all return via webhook. The webhook listener is an extremely risky subsystem if underestimated: three common fatal mistakes.

| Common mistake | Symptom | Consequence | Defence |
| --- | --- | --- | --- |
| No signature verification | Accepts requests pretending to be from PSP | Attacker "confirms" fake payments | HMAC check with shared secret, reject outside tolerance window |
| Duplicates not handled | PSP retries 2–3 times, logged each time | Double ledger entry, broken bookkeeping | Idempotent on the PSP's event_id |
| Return 2xx too early | PSP thinks you processed it, but you didn't | Lost events when the worker crashes mid-process | Persist to an internal queue first, ack after |
| No out-of-order handling | succeeded arrives before created | State machine rejects a valid event | Buffer and resolve by event_type precedence |
| Slow inline processing | PSP times out, retry storm | Webhook queue tens of thousands deep | Accept + persist + 200 immediately, process async |

```
// Proper webhook handler — .NET 10 Minimal API
app.MapPost("/webhooks/stripe", async (
    HttpRequest httpReq,
    [FromServices] IWebhookVerifier verifier,
    [FromServices] IWebhookQueue queue,
    CancellationToken ct) =>
{
    using var reader = new StreamReader(httpReq.Body);
    var rawBody = await reader.ReadToEndAsync(ct);
    var signature = httpReq.Headers["Stripe-Signature"].ToString();

// 1. Verify HMAC with 5-minute tolerance to defeat replay
    if (!verifier.VerifyAndCheckTimestamp(rawBody, signature, TimeSpan.FromMinutes(5)))
        return Results.Unauthorized();

var evt = JsonSerializer.Deserialize<StripeEvent>(rawBody)!;

// 2. Idempotent on Stripe's event.id — a duplicate returns 200 immediately
    if (!await queue.EnqueueIfNewAsync(evt.Id, evt.Type, rawBody, ct))
        return Results.Ok(); // already seen; ack so Stripe stops retrying

// 3. Return 200 immediately; the worker processes async
    return Results.Ok();
});
```
Rule for tolerating late events: every state machine must accept *any arrival order*. If `payment_intent.succeeded` arrives before `payment_intent.created`, don't reject — mark it pending and reconcile once the earlier event arrives. Major PSPs guarantee at-least-once but not total ordering.

## 8. Reconciliation — The Nightly Worker That Sees Everything Webhooks Missed

No matter how hard webhooks and the saga try, there's still a class of events that *never* make it to your system: events swallowed when the PSP changes formats, webhook retries that ran out, network partitions lasting hours. That's why every serious payment system has a **reconciliation worker** running nightly, reconciling the internal ledger against the settlement report from the acquirer.

23:30 — fetch settlement

The worker pulls the settlement file or calls `/v1/balance_transactions` for all of day T-1 and writes into a staging table `psp_settlement_raw`.

23:45 — normalize

Normalise formats (Stripe, Adyen, VNPay each have their own) into a common schema: `(psp_ref, type, amount_minor, currency, occurred_at)`.

00:00 — diff

LEFT JOIN ledger against settlement by `psp_ref`. Three kinds of mismatch: (a) in ledger, not in settlement — possible phantom auth; (b) in settlement, not in ledger — lost webhook; (c) amount drift — risk change or partial capture.

00:30 — auto-heal

For type (b), re-query the PSP by `psp_ref`; if confirmed valid, write a supplementary ledger entry with txn type `recon_backfill`. Record the metric `recon.backfilled_total`.

01:00 — alert

Remaining mismatches after auto-heal go into `recon_exception` and wake the accounting PagerDuty rotation. SLA: clear every exception within 48 hours.

#### Reconciliation isn't supporting cast — it's the third line of defence

Idempotency blocks double-charge *at request time*, the Temporal saga ensures workflows don't drop mid-execution, reconciliation ensures *the end-of-day balance* is right no matter what. Three independent layers, defending against failures at three different moments. Skip any layer and sooner or later accountants will be counting by hand.

## 9. 3DS 2.x and SCA — The Async Flow That Needs Its Own State Machine

Since PSD2 in Europe and equivalents in many countries, *Strong Customer Authentication* (SCA) via 3DS 2.x is no longer optional. This flow turns authorize from "call the API and get a result" into "initiate challenge, redirect user, wait for the browser to come back, handle the outcome". A dedicated state machine is mandatory.

```
stateDiagram-v2
    [*] --> Requires_PM: create intent
    Requires_PM --> Requires_Action: attach card, PSP returns requires_action
    Requires_Action --> Processing: user completes 3DS challenge, browser returns
    Processing --> Succeeded: acquirer confirms authorize
    Processing --> Failed: acquirer declines or 3DS times out
    Requires_Action --> Failed: user closes browser past 10 minutes
    Succeeded --> Captured: capture at T+0
    Captured --> Refunded: partial or full refund
    Captured --> Disputed: chargeback
    Disputed --> Captured: dispute_won
    Disputed --> Refunded: dispute_lost

```

Figure 5: Intent state machine including async 3DS branches, disputes, and refunds

Four production principles when implementing 3DS:

- **System-side timeout on the user challenge.** If an intent is `requires_action` for more than 15 minutes, auto-cancel it to avoid holding the acquirer's authorization and incurring fees.
- **Don't trust the client redirect.** The returning browser can be forged or replayed; the authoritative 3DS result comes from the PSP's async webhook, not the URL.
- **Persist the 3DS outcome in the ledger.** A `sca_outcome` column in `ledger_txn` enables audit and proves exemption eligibility (low value, recurring) when needed.
- **Fallback to low-risk exemption.** Authorization rates rise noticeably when you correctly apply TRA (Transaction Risk Analysis) exemption — which requires risk engine integration from the outset.

## 10. Payment Observability — The Metrics That Deserve Their Own Dashboard

Observability for a payment system differs from a normal service in one way: *every metric translates into money*. p99 latency isn't just UX — it determines how many customers abandon their cart. Auth rate isn't just "ok or not" — it's the percentage of revenue you're losing to acquirer declines. A proper payment dashboard must have the following metrics, sliced by PSP, card scheme, BIN, and country.

| Metric | Meaning | Suggested 2026 SLO | Slice by |
| --- | --- | --- | --- |
| authorization_rate | % of intents approved / total intents | ≥ 92% for non-3DS, ≥ 87% for 3DS | PSP, scheme, BIN, country |
| capture_latency_p99 | p99 time from request to capture | <5s (non-3DS), <30s (3DS) | PSP, amount bucket |
| webhook_lag_seconds | Lag between PSP event and ledger update | <60s p99, <600s p99.9 | event_type |
| recon_mismatch_count | Mismatch rows after the nightly run | <10/day self-heal, 0 to escalate | mismatch_type |
| idempotency_replay_rate | % of requests returning a cached response | <1% normally; spike = client bug | endpoint, tenant |
| fraud_block_rate | % of intents blocked by the risk engine | Balanced against chargeback rate | risk model version |
| chargeback_rate | % of txns becoming chargebacks | <0.9% — above this you lose merchant status | scheme, MCC |

Recommended observability stack for .NET 10: OpenTelemetry for tracing and metrics, Tempo or Jaeger for distributed traces, Loki for structured logs, Prometheus + Grafana for dashboards. Most importantly: *every metric must be traceable back to a ledger entry*. Trace IDs should be attached to the webhook header sent to the PSP (where the PSP supports it) so that incident investigation has a cross-boundary audit trail.

## 11. Security and Compliance — PCI DSS v4, Tokenization, and Rules That Aren't a Joke

A payment system touches PANs (card numbers) and CVVs, making you a scope target for **PCI DSS v4.0.1** — in full effect from March 2025. The only scope-reduction technique, and the right approach for a small team doing payments, is to *never touch the PAN*.

- **Tokenise at the edge.** PSP frontend SDKs (Stripe Elements, Adyen Web Components) accept the PAN from the user and swap it directly with the PSP for a token. Your server only sees the token — PCI scope drops to SAQ A, cutting from ~400 controls to ~30.
- **Vault it if you must store.** To charge a customer periodically without their presence, use the PSP's customer vault rather than rolling your own. The token vault is decryptable only by the PSP; you only hold the `customer_id`.
- **Encryption at rest + in transit everywhere.** TLS 1.3 is mandatory for every connection to the PSP; DB columns holding sensitive data (billing address, last-4, fingerprint) encrypted with KMS-managed keys and quarterly rotation.
- **Key management outside the service.** Secrets don't live in appsettings; use Azure Key Vault, AWS KMS, HashiCorp Vault. Audit access logs for ≥ 1 year.
- **Separation of duties.** The person deploying code must not also be the one approving manual refunds in the admin tool. Clear role separation is how you pass ISO 27001 and SOC 2.

#### A frequently overlooked detail: BIN-based routing and network tokens

PCI DSS v4 encourages using a *network token* (Visa VTS, Mastercard MDES) over a PSP's PAN-based token — raising auth rates by 3–5% and eliminating the risk of expired tokens when customers replace cards. Alongside, BIN-based routing lets you pick the optimal acquirer by card country/scheme, adding another 1–2% to auth rate. These two "hidden" optimisations can add up to millions of dollars in annual revenue for a mid-sized merchant.

## 12. Production Checklist — 20 Non-Negotiable Items Before Go-Live

| Group | Mandatory item | Notes |
| --- | --- | --- |
| Correctness | Idempotency with body-hash, scoped per tenant | Two-tier Redis + Postgres |
|  | Double-entry ledger, balanced-trigger invariant | Enforce sum = 0 at the DB |
|  | Saga orchestration for every flow with > 2 steps | Temporal or Dapr Workflows |
|  | Transactional outbox for every externally emitted event | SKIP LOCKED for parallel relays |
| Resilience | Circuit breaker on every call to the PSP | Polly v8, 50% fail / 30s threshold |
|  | Strict timeouts (no 100s defaults) | ≤30s for authorize, ≤5s for tokenize |
|  | No acquirer retry without acquirer-supported idem-key | Let recon handle it |
|  | Dead-letter queue for every consumer | Alert when depth exceeds threshold |
| Observability | OTel tracing across webhooks | W3C Trace Context headers |
|  | Dashboards for auth rate, capture latency, webhook lag | Sliced by PSP/country/BIN |
|  | Runbook for every recon exception type | Readable by accountants |
|  | PagerDuty alerts when mismatch > 10/day | Auto-ticketing |
| Security | Edge tokenisation — never touch the PAN | PCI scope drops to SAQ A |
|  | Webhook HMAC with tolerance ≤5 minutes | Anti-replay |
|  | Secrets in Key Vault / KMS, quarterly rotation | Never committed to appsettings |
|  | Audit log for every manual refund, retained ≥ 1 year | Immutable, WORM storage |
| Compliance & Go-live | Load test with failure injection (toxiproxy) | Test PSP 500/slow/partition |
|  | Monthly chaos drills (Redis down, DB failover) | Game-day script |
|  | Separation of duties between deploy and refund | ISO 27001 / SOC 2 |
|  | Per-PSP kill switch | Auto-fallback to a backup PSP |

## 13. Conclusion — Payments Is Where Engineering Discipline Meets Business Discipline

Every pattern in this article — idempotency, double-entry ledger, saga, outbox, at-least-once webhooks, reconciliation, 3DS state machine, payment-specific observability — exists because in payments, *one-in-a-million errors become real money flowing to the wrong place*. They aren't "optional best practice"; they are the minimum floor. Teams that skip any of them will pay the price through a painful incident eventually.

Good news: every technique in this article is well-tooled for the .NET 10 stack. Temporal SDK for sagas, Npgsql for serializable transactions, Polly v8 for resilience, OpenTelemetry for observability, and native .NET Stripe/Adyen SDKs for PSP integration. The engineering team has both the tools and public playbooks; what's left is the *discipline to put each layer in the right place*. That's where a senior engineer can create the most obvious business value: turning this complex set of patterns into a system that's simple to operate, easy to debug at 2am, and never surprises customers or accountants.

## 14. References

- [Stripe Engineering — Designing robust and predictable APIs with idempotency](https://stripe.com/blog/idempotency)
- [Stripe Docs — Payment Intents API and lifecycle states](https://docs.stripe.com/payments/payment-intents)
- [Chris Richardson — The Saga Pattern in microservices](https://microservices.io/patterns/data/saga.html)
- [Microservices.io — Transactional Outbox Pattern](https://microservices.io/patterns/data/transactional-outbox.html)
- [Temporal .NET SDK — workflow and activity API](https://docs.temporal.io/dotnet)
- [PCI SSC — PCI DSS v4.0.1 Requirements and Security Assessment Procedures](https://www.pcisecuritystandards.org/document_library/)
- [Adyen — 3D Secure 2 and Strong Customer Authentication](https://www.adyen.com/knowledge-hub/3d-secure-2)
- [Martin Fowler — Patterns of Distributed Systems](https://martinfowler.com/articles/patterns-of-distributed-systems/)
- [OpenTelemetry — .NET instrumentation](https://opentelemetry.io/docs/languages/net/)
- [Polly v8 — Resilience strategies for .NET](https://www.pollydocs.org/strategies/)

Modular Monolith with .NET 10 — The Middle Path Between Monolith and Microservices with Vertical Slice, Wolverine, and Bounded Context

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.