Data Pipeline: From Batch to Streaming in Modern Data Systems

Posted on: 4/24/2026 6:14:55 AM

Table of contents

What is a Data Pipeline?
ETL vs ELT: Two Different Philosophies
1. ETL — The Traditional Approach
  1. Extract → Transform → Load
2. ELT — The Modern Approach
  1. Extract → Load → Transform
  2. 💡 2026 Trend
Batch Processing — Processing in Batches
Stream Processing — Real-Time Processing
Unified Batch + Streaming: The 2026 Trend
1. Kappa Architecture vs Lambda Architecture
  1. Lambda Architecture (Traditional)
  2. Kappa Architecture (Modern)
2. dbt + Apache Flink: One Language for Both Worlds
Data Quality & Observability
1. Data Quality Checks with dbt Tests
2. Data Contracts — Agreements Between Producers and Consumers
  1. What is a Data Contract?
Modern Data Stack 2026
1. Real-World Costs
  1. ⚠️ Common Pitfalls
Pipeline Design: Decision Framework
Best Practices for Production Pipelines
Conclusion
1. 💡 Advice
References

In the modern data world, an efficient data processing system is the backbone of every business decision. From analyzing user behavior in real-time to compiling daily revenue reports — everything depends on the Data Pipeline. This article dives deep into modern Data Pipeline architecture, compares ETL vs ELT, Batch vs Streaming, and guides you to choose the right solution for each real-world scenario.

73% enterprises adopting ELT (2026)

<1s streaming pipeline latency

5x cost reduction with lakehouse

60% teams using Unified Batch+Stream

What is a Data Pipeline?

A Data Pipeline is an automated sequence of steps to move, transform, and load data from sources to destinations. A pipeline can be as simple as a nightly script copying data from a production database to a data warehouse, or as complex as a system processing millions of events per second from IoT sensors.

graph LR
    A[Data Sources] --> B[Ingestion Layer]
    B --> C[Processing Layer]
    C --> D[Storage Layer]
    D --> E[Serving Layer]

    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#16213e,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#e94560,stroke:#fff,color:#fff

High-level Data Pipeline Architecture

ETL vs ELT: Two Different Philosophies

The difference between ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) isn't just about step ordering — it reflects two fundamentally different approaches to data processing.

ETL — The Traditional Approach

Extract → Transform → Load

Data is extracted from sources, transformed (filtered, normalized, aggregated) on an intermediate server, then loaded into the target data store. This model was dominant in the on-premise data warehouse era when compute and storage were expensive — you only wanted to store "clean" data.

graph LR
    S1[MySQL] --> E[Extract]
    S2[API] --> E
    S3[CSV Files] --> E
    E --> T[Transform Server]
    T --> L[Load]
    L --> DW[Data Warehouse]

    style T fill:#e94560,stroke:#fff,color:#fff
    style DW fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style S1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style S2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style S3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Traditional ETL processing flow

ELT — The Modern Approach

Extract → Load → Transform

Raw data is loaded directly into the storage system (data warehouse or lakehouse), then transformed in-place using the compute power of the destination system. This approach leverages the scale-out capabilities of cloud platforms like BigQuery, Snowflake, or Databricks.

graph LR
    S1[MySQL] --> E[Extract]
    S2[API] --> E
    S3[CSV Files] --> E
    E --> L[Load Raw Data]
    L --> DW[Data Warehouse / Lakehouse]
    DW --> T[Transform with dbt / SQL]
    T --> DW

    style DW fill:#e94560,stroke:#fff,color:#fff
    style T fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style S1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style S2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style S3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Modern ELT flow — transform right inside the data warehouse

Criteria	ETL	ELT
Transform Location	Intermediate server (staging)	Inside the data warehouse
Ingestion Speed	Slower (must transform first)	Faster (load first, transform later)
Flexibility	Hard to change transform logic	Easy — just rewrite SQL/dbt models
Compute Cost	Self-managed transform servers	Cloud DW compute (pay-per-query)
Raw Data	Lost after transformation	Always preserved
Popular Tools	Informatica, Talend, SSIS	dbt, Fivetran, Airbyte + Snowflake/BigQuery
Best For	On-premise, strict compliance	Cloud-native, modern analytics

💡 2026 Trend

ELT has become the standard for most new analytics projects. With dbt (data build tool) as the standard transformation layer, data teams only need SQL to define business logic — version control, testing, and documentation are all handled automatically by dbt.

Batch Processing — Processing in Batches

Batch processing handles data in "batches" at scheduled intervals — hourly, daily, or weekly. It remains the backbone of most analytics systems due to its simplicity, low cost, and ability to handle large data volumes.

When to Use Batch?

End-of-day or end-of-month revenue reports
Periodic ML model training
Data warehouse refresh
Metrics aggregation (DAU, MAU, retention)
Compliance reporting and audit logs

Typical Batch Pipeline Architecture

graph TD
    subgraph Sources
        DB[(Production DB)]
        API[External APIs]
        Files[Log Files / CSV]
    end

    subgraph Orchestration
        AF[Apache Airflow / Dagster]
    end

    subgraph Processing
        SP[Apache Spark]
        DBT[dbt Models]
    end

    subgraph Storage
        DL[(Data Lake - S3/GCS)]
        DW[(Data Warehouse)]
    end

    subgraph Serving
        BI[BI Dashboard]
        ML[ML Training]
    end

    DB --> AF
    API --> AF
    Files --> AF
    AF --> SP
    SP --> DL
    DL --> DBT
    DBT --> DW
    DW --> BI
    DW --> ML

    style AF fill:#e94560,stroke:#fff,color:#fff
    style SP fill:#2c3e50,stroke:#fff,color:#fff
    style DBT fill:#16213e,stroke:#fff,color:#fff
    style DW fill:#e94560,stroke:#fff,color:#fff
    style DL fill:#2c3e50,stroke:#fff,color:#fff

Batch Pipeline Architecture with Airflow + Spark + dbt

Example: Revenue Aggregation Pipeline with dbt

-- models/marts/daily_revenue.sql
WITH orders AS (
    SELECT
        DATE(created_at) AS order_date,
        product_id,
        quantity,
        unit_price,
        quantity * unit_price AS line_total,
        discount_amount
    FROM {{ ref('stg_orders') }}
    WHERE status = 'completed'
),

daily_summary AS (
    SELECT
        order_date,
        COUNT(DISTINCT product_id) AS unique_products,
        SUM(quantity) AS total_items,
        SUM(line_total) AS gross_revenue,
        SUM(discount_amount) AS total_discounts,
        SUM(line_total) - SUM(discount_amount) AS net_revenue
    FROM orders
    GROUP BY order_date
)

SELECT
    order_date,
    unique_products,
    total_items,
    gross_revenue,
    total_discounts,
    net_revenue,
    AVG(net_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7d_avg
FROM daily_summary

Orchestrator: Airflow vs Dagster vs Prefect

Apache Airflow remains the industry standard with the largest community. Dagster stands out with its "software-defined assets" concept — you define data assets instead of tasks, better suited for data-centric workflows. Prefect simplifies configuration with a "code-first" approach — no separate DAG definition needed.

Stream Processing — Real-Time Processing

Stream Processing handles data continuously as it's generated, with latency from milliseconds to seconds. It's mandatory for use cases requiring immediate response.

When to Use Streaming?

Real-time financial fraud detection
Infrastructure monitoring & alerting
Personalization — product recommendations as users click
IoT sensor data processing
Real-time dashboards and operational analytics

Streaming Pipeline Architecture

graph LR
    subgraph Producers
        App[Application Events]
        IoT[IoT Devices]
        CDC[Change Data Capture]
    end

    subgraph Message Broker
        K[Apache Kafka / Redpanda]
    end

    subgraph Stream Processor
        F[Apache Flink]
    end

    subgraph Sinks
        RT[(Real-time Store)]
        DL[(Data Lake)]
        Alert[Alerting System]
    end

    App --> K
    IoT --> K
    CDC --> K
    K --> F
    F --> RT
    F --> DL
    F --> Alert

    style K fill:#e94560,stroke:#fff,color:#fff
    style F fill:#2c3e50,stroke:#fff,color:#fff
    style RT fill:#16213e,stroke:#fff,color:#fff

Streaming Pipeline Architecture with Kafka + Flink

Example: Fraud Detection with Apache Flink

// Detect anomalous transactions: 3 transactions > $1000 within 5 minutes
DataStream<Transaction> transactions = env
    .addSource(new FlinkKafkaConsumer<>(
        "transactions",
        new TransactionDeserializer(),
        kafkaProps
    ));

Pattern<Transaction, ?> fraudPattern = Pattern
    .<Transaction>begin("first")
        .where(new SimpleCondition<Transaction>() {
            public boolean filter(Transaction t) {
                return t.getAmount() > 1000;
            }
        })
    .next("second")
        .where(new SimpleCondition<Transaction>() {
            public boolean filter(Transaction t) {
                return t.getAmount() > 1000;
            }
        })
    .next("third")
        .where(new SimpleCondition<Transaction>() {
            public boolean filter(Transaction t) {
                return t.getAmount() > 1000;
            }
        })
    .within(Time.minutes(5));

PatternStream<Transaction> patternStream = CEP.pattern(
    transactions.keyBy(Transaction::getUserId),
    fraudPattern
);

patternStream.select(
    (Map<String, List<Transaction>> pattern) -> {
        return new FraudAlert(
            pattern.get("first").get(0).getUserId(),
            "3 large transactions in 5 minutes"
        );
    }
).addSink(new AlertSink());

Apache Kafka vs Redpanda vs Amazon Kinesis

Criteria	Apache Kafka	Redpanda	Amazon Kinesis
Architecture	JVM-based, needs ZooKeeper/KRaft	C++, no ZooKeeper needed	Fully managed (AWS)
Latency	~5-10ms (p99)	~1-2ms (p99)	~70-200ms
Throughput	Very high (millions/sec)	Comparable to Kafka	Lower (1MB/s per shard)
Ops Complexity	High (cluster management)	Lower (single binary)	No management needed
Cost	Self-host or Confluent Cloud	Self-host or Redpanda Cloud	Pay-per-shard + data
Kafka API Compatible	✅ Native	✅ 100% compatible	❌ Proprietary API

Unified Batch + Streaming: The 2026 Trend

One of the biggest pain points in data engineering has been maintaining two separate systems for batch and streaming — two codebases, two skill sets, two CI/CD pipelines. The 2026 trend is to unify both under a single framework.

graph TD
    subgraph "Shift-Left Architecture"
        S[Data Sources] --> SK[Streaming Layer - Kafka + Flink]
        SK --> |Real-time| RT[Real-time Analytics]
        SK --> |Persist| LH[Lakehouse - Iceberg/Delta]
        LH --> |Batch Transform| DBT[dbt Models]
        DBT --> DW[Serving Layer]
    end

    style SK fill:#e94560,stroke:#fff,color:#fff
    style LH fill:#2c3e50,stroke:#fff,color:#fff
    style DBT fill:#16213e,stroke:#fff,color:#fff
    style DW fill:#e94560,stroke:#fff,color:#fff
    style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style RT fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Shift-Left Architecture — streaming is the first layer to process data

Kappa Architecture vs Lambda Architecture

Lambda Architecture (Traditional)

Maintains two parallel paths: a batch layer (for accuracy) and a speed layer (for real-time). Results are merged at the serving layer. Drawback: you must write and maintain processing logic in two places — prone to drift and hard to debug.

Kappa Architecture (Modern)

Uses a single streaming layer for both real-time and batch. Batch processing = replaying the event log. Much simpler but requires a message broker with long-term storage capability (Kafka with infinite retention or tiered storage).

graph TD
    subgraph "Lambda Architecture"
        D1[Data] --> BL[Batch Layer - Spark]
        D1 --> SL[Speed Layer - Flink]
        BL --> SV1[Serving Layer]
        SL --> SV1
    end

    subgraph "Kappa Architecture"
        D2[Data] --> ST[Stream Layer - Flink/Kafka Streams]
        ST --> SV2[Serving Layer]
        ST --> |Replay for reprocessing| ST
    end

    style BL fill:#2c3e50,stroke:#fff,color:#fff
    style SL fill:#e94560,stroke:#fff,color:#fff
    style ST fill:#e94560,stroke:#fff,color:#fff
    style SV1 fill:#16213e,stroke:#fff,color:#fff
    style SV2 fill:#16213e,stroke:#fff,color:#fff

Lambda vs Kappa Architecture

dbt + Apache Flink: One Language for Both Worlds

In 2026, dbt officially supports Apache Flink as an adapter — meaning you can write dbt models (SQL) to define both batch and streaming transformations. This is a crucial milestone toward unifying the data engineering toolchain.

# dbt_project.yml
models:
  my_project:
    staging:
      +materialized: table          # Batch: create static tables
    real_time:
      +materialized: streaming_table # Streaming: Flink streaming table
    marts:
      +materialized: incremental     # Incremental: append-only

-- models/real_time/rt_user_sessions.sql
-- Materialized as streaming table on Flink
{{ config(materialized='streaming_table') }}

SELECT
    user_id,
    session_id,
    TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
    COUNT(*) AS event_count,
    MAX(event_time) AS last_activity,
    COLLECT(DISTINCT page_url) AS pages_visited
FROM {{ source('kafka', 'user_events') }}
GROUP BY
    user_id,
    session_id,
    TUMBLE(event_time, INTERVAL '5' MINUTE)

Data Quality & Observability

The best pipeline is meaningless if the data inside it isn't trustworthy. Data quality and observability are two essential pillars.

Data Quality Checks with dbt Tests

# models/staging/schema.yml
version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: amount
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 1000000
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'completed', 'cancelled', 'refunded']
      - name: created_at
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_recent:
              datepart: day
              interval: 3

Data Contracts — Agreements Between Producers and Consumers

What is a Data Contract?

A Data Contract is a formal agreement between the data-producing team (producer) and data-consuming team (consumer) about the schema, quality, SLA, and ownership of a dataset. It prevents silent "breaking changes" — like when the backend team renames a column without telling the analytics team.

# contracts/orders_v2.yaml
apiVersion: v2
kind: DataContract
metadata:
  name: orders
  owner: backend-team
  domain: e-commerce

schema:
  type: object
  properties:
    order_id:
      type: string
      format: uuid
      required: true
    user_id:
      type: string
      required: true
    amount:
      type: number
      minimum: 0
    currency:
      type: string
      enum: [VND, USD, EUR]

quality:
  - type: freshness
    threshold: "PT1H"    # Data must not be older than 1 hour
  - type: completeness
    column: amount
    threshold: 0.99       # 99% of rows must have values

sla:
  availability: 99.9%
  latency: "PT5M"         # Max 5 minutes from source to warehouse

Modern Data Stack 2026

Here's the most widely used toolkit in modern data pipeline architecture:

graph TD
    subgraph "Ingestion"
        FV[Fivetran / Airbyte]
        DB[Debezium CDC]
    end

    subgraph "Storage"
        S3[Object Storage - S3/GCS/R2]
        ICE[Apache Iceberg / Delta Lake]
    end

    subgraph "Processing"
        SPK[Apache Spark]
        FLK[Apache Flink]
        DBTL[dbt Core / dbt Cloud]
    end

    subgraph "Orchestration"
        DAG[Dagster / Airflow / Prefect]
    end

    subgraph "Serving"
        SNF[Snowflake / BigQuery / Databricks]
        RS[Redis / Druid - Low Latency]
    end

    subgraph "Observability"
        MC[Monte Carlo / Elementary]
        GE[Great Expectations]
    end

    FV --> S3
    DB --> S3
    S3 --> ICE
    ICE --> SPK
    ICE --> FLK
    SPK --> DBTL
    FLK --> DBTL
    DAG --> SPK
    DAG --> FLK
    DAG --> DBTL
    DBTL --> SNF
    DBTL --> RS
    MC --> DBTL
    GE --> DBTL

    style FV fill:#e94560,stroke:#fff,color:#fff
    style DBTL fill:#e94560,stroke:#fff,color:#fff
    style SNF fill:#2c3e50,stroke:#fff,color:#fff
    style ICE fill:#16213e,stroke:#fff,color:#fff
    style DAG fill:#2c3e50,stroke:#fff,color:#fff

Modern Data Stack 2026 — from ingestion to serving

Real-World Costs

Component	Free / Low-Cost Solution	Enterprise Solution
Ingestion	Airbyte OSS (self-host free)	Fivetran ($1/credit)
Storage	MinIO + Iceberg (self-host)	Snowflake / Databricks
Transform	dbt Core (free, CLI)	dbt Cloud ($100/seat/mo)
Orchestration	Dagster OSS / Airflow	Dagster Cloud / Astronomer
Message Broker	Redpanda Community (free)	Confluent Cloud
Observability	Elementary (dbt native, free)	Monte Carlo

⚠️ Common Pitfalls

Over-engineering: Not every team needs Kafka + Flink + Spark. If your data is under 100GB/day and latency requirement is > 1 hour, a simple batch pipeline with Airbyte + dbt + PostgreSQL/BigQuery is enough. Streaming pipelines only add value when you genuinely need real-time processing.

Pipeline Design: Decision Framework

When designing a data pipeline, answer these 4 questions to choose the right architecture:

graph TD
    Q1{How fast must data
be processed?}
    Q1 -->|Under 1 minute| STREAM[Streaming Pipeline]
    Q1 -->|1 min - 1 hour| MICRO[Micro-Batch]
    Q1 -->|Over 1 hour| BATCH[Batch Pipeline]

    STREAM --> Q2{Volume?}
    Q2 -->|>100K events/sec| KAFKA[Kafka + Flink]
    Q2 -->|<100K events/sec| MANAGED[Kinesis / Pub-Sub]

    BATCH --> Q3{Daily data size?}
    Q3 -->|>1TB| SPARK[Spark + Iceberg]
    Q3 -->|<1TB| SIMPLE[dbt + DW]

    MICRO --> Q4{Complexity?}
    Q4 -->|Simple| SSTREAM[Spark Structured Streaming]
    Q4 -->|Complex CEP| FLINK2[Apache Flink]

    style Q1 fill:#e94560,stroke:#fff,color:#fff
    style STREAM fill:#2c3e50,stroke:#fff,color:#fff
    style BATCH fill:#2c3e50,stroke:#fff,color:#fff
    style MICRO fill:#2c3e50,stroke:#fff,color:#fff
    style KAFKA fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style MANAGED fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style SPARK fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style SIMPLE fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style SSTREAM fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style FLINK2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Decision tree for choosing Data Pipeline architecture

Best Practices for Production Pipelines

1. Idempotency — Rerun Without Duplicates

Every pipeline must be idempotent — rerunning with the same input produces the same output. Use MERGE/UPSERT instead of INSERT, and partition data by date so you can reprocess individual partitions independently.

-- Idempotent load pattern
MERGE INTO fact_orders AS target
USING staging_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET
    amount = source.amount,
    status = source.status,
    updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (order_id, amount, status, created_at)
VALUES (source.order_id, source.amount, source.status, source.created_at);

2. Schema Evolution — Change Schema Without Breaking Pipelines

Use Apache Iceberg or Delta Lake for automatic schema evolution — add new columns without rewriting all existing data. Combine with Data Contracts to ensure backward compatibility.

3. Backfill Strategy — Reprocess Historical Data

Always design pipelines with backfill capability — the ability to reprocess data from a point in the past. Partition by day/month and parameterize date ranges in DAG configuration.

4. Monitoring & Alerting

Pipeline health: Run duration, success/failure rate, data freshness
Data quality: Row count anomalies, null rate, schema drift
Cost tracking: Compute hours, storage growth, query cost

Conclusion

Data Pipeline architecture in 2026 converges on three key trends: ELT replacing ETL thanks to cloud warehouse compute power, Unified Batch + Streaming with dbt + Flink blurring the line between both worlds, and Shift-Left Architecture pushing transformation, quality checks, and governance into the streaming layer from the start. The key to success isn't choosing the "hottest" technology, but understanding your specific requirements for latency, volume, and complexity — then choosing the simplest solution that meets those requirements.

💡 Advice

Start with a simple batch pipeline (Airbyte + dbt + BigQuery/PostgreSQL). Only add streaming when you have a genuine real-time use case. Premature complexity is the biggest enemy of data engineering.

References

#Data Pipeline #Apache Kafka #Apache Flink #dbt #system design #Stream Processing #Data Engineering

# Data Pipeline: From Batch to Streaming in Modern Data Systems

In the modern data world, an efficient data processing system is the backbone of every business decision. From analyzing user behavior in real-time to compiling daily revenue reports — everything depends on the **Data Pipeline**. This article dives deep into modern Data Pipeline architecture, compares ETL vs ELT, Batch vs Streaming, and guides you to choose the right solution for each real-world scenario.

73% enterprises adopting ELT (2026)

<1s streaming pipeline latency

5x cost reduction with lakehouse

60% teams using Unified Batch+Stream

## What is a Data Pipeline?

```
graph LR
    A[Data Sources] --> B[Ingestion Layer]
    B --> C[Processing Layer]
    C --> D[Storage Layer]
    D --> E[Serving Layer]

style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#16213e,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#e94560,stroke:#fff,color:#fff

```
High-level Data Pipeline Architecture

## ETL vs ELT: Two Different Philosophies

The difference between ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) isn't just about step ordering — it reflects two fundamentally different approaches to data processing.

### ETL — The Traditional Approach

#### Extract → Transform → Load

```
graph LR
    S1[MySQL] --> E[Extract]
    S2[API] --> E
    S3[CSV Files] --> E
    E --> T[Transform Server]
    T --> L[Load]
    L --> DW[Data Warehouse]

style T fill:#e94560,stroke:#fff,color:#fff
    style DW fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style S1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style S2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style S3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```
Traditional ETL processing flow

### ELT — The Modern Approach

#### Extract → Load → Transform

```
graph LR
    S1[MySQL] --> E[Extract]
    S2[API] --> E
    S3[CSV Files] --> E
    E --> L[Load Raw Data]
    L --> DW[Data Warehouse / Lakehouse]
    DW --> T[Transform with dbt / SQL]
    T --> DW

style DW fill:#e94560,stroke:#fff,color:#fff
    style T fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style S1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style S2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style S3 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```
Modern ELT flow — transform right inside the data warehouse

| Criteria | ETL | ELT |
| --- | --- | --- |
| **Transform Location** | Intermediate server (staging) | Inside the data warehouse |
| **Ingestion Speed** | Slower (must transform first) | Faster (load first, transform later) |
| **Flexibility** | Hard to change transform logic | Easy — just rewrite SQL/dbt models |
| **Compute Cost** | Self-managed transform servers | Cloud DW compute (pay-per-query) |
| **Raw Data** | Lost after transformation | Always preserved |
| **Popular Tools** | Informatica, Talend, SSIS | dbt, Fivetran, Airbyte + Snowflake/BigQuery |
| **Best For** | On-premise, strict compliance | Cloud-native, modern analytics |

#### 💡 2026 Trend

ELT has become the standard for most new analytics projects. With **dbt** (data build tool) as the standard transformation layer, data teams only need SQL to define business logic — version control, testing, and documentation are all handled automatically by dbt.

## Batch Processing — Processing in Batches

### When to Use Batch?

- End-of-day or end-of-month revenue reports
- Periodic ML model training
- Data warehouse refresh
- Metrics aggregation (DAU, MAU, retention)
- Compliance reporting and audit logs

### Typical Batch Pipeline Architecture

```
graph TD
    subgraph Sources
        DB[(Production DB)]
        API[External APIs]
        Files[Log Files / CSV]
    end

subgraph Orchestration
        AF[Apache Airflow / Dagster]
    end

subgraph Processing
        SP[Apache Spark]
        DBT[dbt Models]
    end

subgraph Storage
        DL[(Data Lake - S3/GCS)]
        DW[(Data Warehouse)]
    end

subgraph Serving
        BI[BI Dashboard]
        ML[ML Training]
    end

DB --> AF
    API --> AF
    Files --> AF
    AF --> SP
    SP --> DL
    DL --> DBT
    DBT --> DW
    DW --> BI
    DW --> ML

style AF fill:#e94560,stroke:#fff,color:#fff
    style SP fill:#2c3e50,stroke:#fff,color:#fff
    style DBT fill:#16213e,stroke:#fff,color:#fff
    style DW fill:#e94560,stroke:#fff,color:#fff
    style DL fill:#2c3e50,stroke:#fff,color:#fff

```
Batch Pipeline Architecture with Airflow + Spark + dbt

### Example: Revenue Aggregation Pipeline with dbt

```sql
-- models/marts/daily_revenue.sql
WITH orders AS (
    SELECT
        DATE(created_at) AS order_date,
        product_id,
        quantity,
        unit_price,
        quantity * unit_price AS line_total,
        discount_amount
    FROM {{ ref('stg_orders') }}
    WHERE status = 'completed'
),

daily_summary AS (
    SELECT
        order_date,
        COUNT(DISTINCT product_id) AS unique_products,
        SUM(quantity) AS total_items,
        SUM(line_total) AS gross_revenue,
        SUM(discount_amount) AS total_discounts,
        SUM(line_total) - SUM(discount_amount) AS net_revenue
    FROM orders
    GROUP BY order_date
)

SELECT
    order_date,
    unique_products,
    total_items,
    gross_revenue,
    total_discounts,
    net_revenue,
    AVG(net_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7d_avg
FROM daily_summary
```

#### Orchestrator: Airflow vs Dagster vs Prefect

**Apache Airflow** remains the industry standard with the largest community. **Dagster** stands out with its "software-defined assets" concept — you define data assets instead of tasks, better suited for data-centric workflows. **Prefect** simplifies configuration with a "code-first" approach — no separate DAG definition needed.

## Stream Processing — Real-Time Processing

Stream Processing handles data continuously as it's generated, with latency from milliseconds to seconds. It's mandatory for use cases requiring immediate response.

### When to Use Streaming?

- Real-time financial fraud detection
- Infrastructure monitoring & alerting
- Personalization — product recommendations as users click
- IoT sensor data processing
- Real-time dashboards and operational analytics

### Streaming Pipeline Architecture

```
graph LR
    subgraph Producers
        App[Application Events]
        IoT[IoT Devices]
        CDC[Change Data Capture]
    end

subgraph Message Broker
        K[Apache Kafka / Redpanda]
    end

subgraph Stream Processor
        F[Apache Flink]
    end

subgraph Sinks
        RT[(Real-time Store)]
        DL[(Data Lake)]
        Alert[Alerting System]
    end

App --> K
    IoT --> K
    CDC --> K
    K --> F
    F --> RT
    F --> DL
    F --> Alert

style K fill:#e94560,stroke:#fff,color:#fff
    style F fill:#2c3e50,stroke:#fff,color:#fff
    style RT fill:#16213e,stroke:#fff,color:#fff

```
Streaming Pipeline Architecture with Kafka + Flink

### Example: Fraud Detection with Apache Flink

```java
// Detect anomalous transactions: 3 transactions > $1000 within 5 minutes
DataStream<Transaction> transactions = env
    .addSource(new FlinkKafkaConsumer<>(
        "transactions",
        new TransactionDeserializer(),
        kafkaProps
    ));

Pattern<Transaction, ?> fraudPattern = Pattern
    .<Transaction>begin("first")
        .where(new SimpleCondition<Transaction>() {
            public boolean filter(Transaction t) {
                return t.getAmount() > 1000;
            }
        })
    .next("second")
        .where(new SimpleCondition<Transaction>() {
            public boolean filter(Transaction t) {
                return t.getAmount() > 1000;
            }
        })
    .next("third")
        .where(new SimpleCondition<Transaction>() {
            public boolean filter(Transaction t) {
                return t.getAmount() > 1000;
            }
        })
    .within(Time.minutes(5));

PatternStream<Transaction> patternStream = CEP.pattern(
    transactions.keyBy(Transaction::getUserId),
    fraudPattern
);

patternStream.select(
    (Map<String, List<Transaction>> pattern) -> {
        return new FraudAlert(
            pattern.get("first").get(0).getUserId(),
            "3 large transactions in 5 minutes"
        );
    }
).addSink(new AlertSink());
```

### Apache Kafka vs Redpanda vs Amazon Kinesis

| Criteria | Apache Kafka | Redpanda | Amazon Kinesis |
| --- | --- | --- | --- |
| **Architecture** | JVM-based, needs ZooKeeper/KRaft | C++, no ZooKeeper needed | Fully managed (AWS) |
| **Latency** | ~5-10ms (p99) | ~1-2ms (p99) | ~70-200ms |
| **Throughput** | Very high (millions/sec) | Comparable to Kafka | Lower (1MB/s per shard) |
| **Ops Complexity** | High (cluster management) | Lower (single binary) | No management needed |
| **Cost** | Self-host or Confluent Cloud | Self-host or Redpanda Cloud | Pay-per-shard + data |
| **Kafka API Compatible** | ✅ Native | ✅ 100% compatible | ❌ Proprietary API |

## Unified Batch + Streaming: The 2026 Trend

```
graph TD
    subgraph "Shift-Left Architecture"
        S[Data Sources] --> SK[Streaming Layer - Kafka + Flink]
        SK --> |Real-time| RT[Real-time Analytics]
        SK --> |Persist| LH[Lakehouse - Iceberg/Delta]
        LH --> |Batch Transform| DBT[dbt Models]
        DBT --> DW[Serving Layer]
    end

style SK fill:#e94560,stroke:#fff,color:#fff
    style LH fill:#2c3e50,stroke:#fff,color:#fff
    style DBT fill:#16213e,stroke:#fff,color:#fff
    style DW fill:#e94560,stroke:#fff,color:#fff
    style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style RT fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

```
Shift-Left Architecture — streaming is the first layer to process data

### Kappa Architecture vs Lambda Architecture

#### Lambda Architecture (Traditional)

Maintains two parallel paths: a **batch layer** (for accuracy) and a **speed layer** (for real-time). Results are merged at the serving layer. Drawback: you must write and maintain processing logic in two places — prone to drift and hard to debug.

#### Kappa Architecture (Modern)

Uses **a single streaming layer** for both real-time and batch. Batch processing = replaying the event log. Much simpler but requires a message broker with long-term storage capability (Kafka with infinite retention or tiered storage).

```
graph TD
    subgraph "Lambda Architecture"
        D1[Data] --> BL[Batch Layer - Spark]
        D1 --> SL[Speed Layer - Flink]
        BL --> SV1[Serving Layer]
        SL --> SV1
    end

subgraph "Kappa Architecture"
        D2[Data] --> ST[Stream Layer - Flink/Kafka Streams]
        ST --> SV2[Serving Layer]
        ST --> |Replay for reprocessing| ST
    end

style BL fill:#2c3e50,stroke:#fff,color:#fff
    style SL fill:#e94560,stroke:#fff,color:#fff
    style ST fill:#e94560,stroke:#fff,color:#fff
    style SV1 fill:#16213e,stroke:#fff,color:#fff
    style SV2 fill:#16213e,stroke:#fff,color:#fff

```
Lambda vs Kappa Architecture

### dbt + Apache Flink: One Language for Both Worlds

In 2026, **dbt officially supports Apache Flink** as an adapter — meaning you can write dbt models (SQL) to define both batch and streaming transformations. This is a crucial milestone toward unifying the data engineering toolchain.

```yaml
# dbt_project.yml
models:
  my_project:
    staging:
      +materialized: table          # Batch: create static tables
    real_time:
      +materialized: streaming_table # Streaming: Flink streaming table
    marts:
      +materialized: incremental     # Incremental: append-only
```

```sql
-- models/real_time/rt_user_sessions.sql
-- Materialized as streaming table on Flink
{{ config(materialized='streaming_table') }}

SELECT
    user_id,
    session_id,
    TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
    COUNT(*) AS event_count,
    MAX(event_time) AS last_activity,
    COLLECT(DISTINCT page_url) AS pages_visited
FROM {{ source('kafka', 'user_events') }}
GROUP BY
    user_id,
    session_id,
    TUMBLE(event_time, INTERVAL '5' MINUTE)
```

## Data Quality & Observability

The best pipeline is meaningless if the data inside it isn't trustworthy. Data quality and observability are two essential pillars.

### Data Quality Checks with dbt Tests

```yaml
# models/staging/schema.yml
version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: amount
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 1000000
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'completed', 'cancelled', 'refunded']
      - name: created_at
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_recent:
              datepart: day
              interval: 3
```

### Data Contracts — Agreements Between Producers and Consumers

#### What is a Data Contract?

```yaml
# contracts/orders_v2.yaml
apiVersion: v2
kind: DataContract
metadata:
  name: orders
  owner: backend-team
  domain: e-commerce

schema:
  type: object
  properties:
    order_id:
      type: string
      format: uuid
      required: true
    user_id:
      type: string
      required: true
    amount:
      type: number
      minimum: 0
    currency:
      type: string
      enum: [VND, USD, EUR]

quality:
  - type: freshness
    threshold: "PT1H"    # Data must not be older than 1 hour
  - type: completeness
    column: amount
    threshold: 0.99       # 99% of rows must have values

sla:
  availability: 99.9%
  latency: "PT5M"         # Max 5 minutes from source to warehouse
```

## Modern Data Stack 2026

Here's the most widely used toolkit in modern data pipeline architecture:

```
graph TD
    subgraph "Ingestion"
        FV[Fivetran / Airbyte]
        DB[Debezium CDC]
    end

subgraph "Storage"
        S3[Object Storage - S3/GCS/R2]
        ICE[Apache Iceberg / Delta Lake]
    end

subgraph "Processing"
        SPK[Apache Spark]
        FLK[Apache Flink]
        DBTL[dbt Core / dbt Cloud]
    end

subgraph "Orchestration"
        DAG[Dagster / Airflow / Prefect]
    end

subgraph "Serving"
        SNF[Snowflake / BigQuery / Databricks]
        RS[Redis / Druid - Low Latency]
    end

subgraph "Observability"
        MC[Monte Carlo / Elementary]
        GE[Great Expectations]
    end

FV --> S3
    DB --> S3
    S3 --> ICE
    ICE --> SPK
    ICE --> FLK
    SPK --> DBTL
    FLK --> DBTL
    DAG --> SPK
    DAG --> FLK
    DAG --> DBTL
    DBTL --> SNF
    DBTL --> RS
    MC --> DBTL
    GE --> DBTL

style FV fill:#e94560,stroke:#fff,color:#fff
    style DBTL fill:#e94560,stroke:#fff,color:#fff
    style SNF fill:#2c3e50,stroke:#fff,color:#fff
    style ICE fill:#16213e,stroke:#fff,color:#fff
    style DAG fill:#2c3e50,stroke:#fff,color:#fff

```
Modern Data Stack 2026 — from ingestion to serving

### Real-World Costs

| Component | Free / Low-Cost Solution | Enterprise Solution |
| --- | --- | --- |
| **Ingestion** | Airbyte OSS (self-host free) | Fivetran ($1/credit) |
| **Storage** | MinIO + Iceberg (self-host) | Snowflake / Databricks |
| **Transform** | dbt Core (free, CLI) | dbt Cloud ($100/seat/mo) |
| **Orchestration** | Dagster OSS / Airflow | Dagster Cloud / Astronomer |
| **Message Broker** | Redpanda Community (free) | Confluent Cloud |
| **Observability** | Elementary (dbt native, free) | Monte Carlo |

#### ⚠️ Common Pitfalls

**Over-engineering:** Not every team needs Kafka + Flink + Spark. If your data is under 100GB/day and latency requirement is > 1 hour, a simple batch pipeline with Airbyte + dbt + PostgreSQL/BigQuery is enough. Streaming pipelines only add value when you genuinely need real-time processing.

## Pipeline Design: Decision Framework

When designing a data pipeline, answer these 4 questions to choose the right architecture:

```
graph TD
    Q1{How fast must data  
be processed?}
    Q1 -->|Under 1 minute| STREAM[Streaming Pipeline]
    Q1 -->|1 min - 1 hour| MICRO[Micro-Batch]
    Q1 -->|Over 1 hour| BATCH[Batch Pipeline]

STREAM --> Q2{Volume?}
    Q2 -->|>100K events/sec| KAFKA[Kafka + Flink]
    Q2 -->|<100K events/sec| MANAGED[Kinesis / Pub-Sub]

BATCH --> Q3{Daily data size?}
    Q3 -->|>1TB| SPARK[Spark + Iceberg]
    Q3 -->|<1TB| SIMPLE[dbt + DW]

MICRO --> Q4{Complexity?}
    Q4 -->|Simple| SSTREAM[Spark Structured Streaming]
    Q4 -->|Complex CEP| FLINK2[Apache Flink]

style Q1 fill:#e94560,stroke:#fff,color:#fff
    style STREAM fill:#2c3e50,stroke:#fff,color:#fff
    style BATCH fill:#2c3e50,stroke:#fff,color:#fff
    style MICRO fill:#2c3e50,stroke:#fff,color:#fff
    style KAFKA fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style MANAGED fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style SPARK fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style SIMPLE fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style SSTREAM fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style FLINK2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```
Decision tree for choosing Data Pipeline architecture

## Best Practices for Production Pipelines

### 1. Idempotency — Rerun Without Duplicates

Every pipeline must be idempotent — rerunning with the same input produces the same output. Use `MERGE`/`UPSERT` instead of `INSERT`, and partition data by date so you can reprocess individual partitions independently.

```sql
-- Idempotent load pattern
MERGE INTO fact_orders AS target
USING staging_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET
    amount = source.amount,
    status = source.status,
    updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (order_id, amount, status, created_at)
VALUES (source.order_id, source.amount, source.status, source.created_at);
```

### 2. Schema Evolution — Change Schema Without Breaking Pipelines

Use Apache Iceberg or Delta Lake for automatic schema evolution — add new columns without rewriting all existing data. Combine with Data Contracts to ensure backward compatibility.

### 3. Backfill Strategy — Reprocess Historical Data

Always design pipelines with backfill capability — the ability to reprocess data from a point in the past. Partition by day/month and parameterize date ranges in DAG configuration.

### 4. Monitoring & Alerting

- **Pipeline health:** Run duration, success/failure rate, data freshness
- **Data quality:** Row count anomalies, null rate, schema drift
- **Cost tracking:** Compute hours, storage growth, query cost

## Conclusion

Data Pipeline architecture in 2026 converges on three key trends: **ELT replacing ETL** thanks to cloud warehouse compute power, **Unified Batch + Streaming** with dbt + Flink blurring the line between both worlds, and **Shift-Left Architecture** pushing transformation, quality checks, and governance into the streaming layer from the start. The key to success isn't choosing the "hottest" technology, but understanding your specific requirements for latency, volume, and complexity — then choosing the simplest solution that meets those requirements.

#### 💡 Advice

Start with a simple batch pipeline (Airbyte + dbt + BigQuery/PostgreSQL). Only add streaming when you have a genuine real-time use case. Premature complexity is the biggest enemy of data engineering.

## References

- [AWS — Database Caching Strategies Using Redis](https://docs.aws.amazon.com/whitepapers/latest/database-caching-strategies-using-redis/caching-patterns.html)
- [dbt Meets Apache Flink: One Workflow for Data Engineers (2026)](https://www.kai-waehner.de/blog/2026/03/26/dbt-meets-apache-flink-one-workflow-for-data-engineers-on-snowflake-bigquery-databricks-and-confluent/)
- [Redis 8.6 — Performance Improvements & Streams Enhancements](https://redis.io/blog/announcing-redis-86-performance-improvements-streams/)
- [Data Pipeline Architecture: ETL, ELT, and Streaming Patterns — Calmops](https://calmops.com/software-engineering/data-pipeline-architecture-etl-elt-streaming/)
- [Data Pipeline Fundamentals: Batch, Streaming, and Idempotent Design](https://www.ml4devs.com/what-is/data-pipeline/)
- [Data Pipeline Architecture: Complete 2026 Guide](https://dataengineeringcompanies.com/data-pipeline/)

Vue Vapor Mode — Eliminating Virtual DOM for 3x Performance

DevSecOps — Integrating Security into Modern CI/CD Pipelines

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.