Grafana LGTM Stack — Build a Free Observability Platform for Production

Posted on: 4/18/2026 8:11:46 AM

Table of contents

1. What Is the LGTM Stack?
1. Why not use the ELK Stack?
2. The Overall LGTM Stack Architecture
3. Grafana Alloy — The Unified Collector
1. What does Alloy replace?
2. The key point
4. Loki — Economical Log Aggregation
1. LogQL — The Log Query Language
  1. Bloom Filters in Loki 3.x
2. Structured Metadata
5. Mimir — Large-Scale Metrics Storage
6. Tempo — Index-Free Distributed Tracing
1. TraceQL — Query Traces Like a Database
  1. Exemplars — The Bridge Between Metrics ↔ Traces
7. Grafana — Dashboards, Alerting, and Correlation
1. Grafana 12 highlights
2. Correlation — The Power of a Unified Stack
8. Deploying the LGTM Stack with Docker Compose
1. Production note
9. Integrating with an ASP.NET Core Application
1. Step 1: Install the packages
2. Step 2: Configure Program.cs
  1. Grafana OpenTelemetry Distribution for .NET
10. Alerting — From Observation to Action
11. Real Sizing and Cost
1. Trade-off to consider
12. Production Best Practices
Conclusion

Do you have applications running in production but, when incidents happen, you have to SSH into the server and read logs with grep? Or worse, you don't know which service is slow until customers complain? Grafana LGTM Stack — a completely free, open-source observability toolkit — solves this problem by unifying Logs, Metrics, Traces, and Profiles into a single platform.

100% Open-source, self-hosted, no vendor lock-in

65% MTTR reduction vs traditional monitoring

10M+ Metrics/second handled by Mimir

4 Signals: Logs, Metrics, Traces, Profiles

1. What Is the LGTM Stack?

LGTM stands for four core components developed by Grafana Labs:

Component	Role	Commercial equivalent
Loki	Log aggregation — collect, store, and query logs	Splunk, Datadog Logs
Grafana	Visualization — dashboards, alerting, explore	Datadog Dashboard, Kibana
Tempo	Distributed tracing — follow requests across services	Jaeger, Datadog APM
Mimir	Metrics storage — store Prometheus metrics long-term	Thanos, Cortex, Datadog Metrics

Beyond these 4, the stack also includes Grafana Alloy — a unified collector that replaces Promtail, Grafana Agent, and the OpenTelemetry Collector, acting as the "extended arm" that gathers every telemetry signal from your applications.

Why not use the ELK Stack?

ELK (Elasticsearch + Logstash + Kibana) indexes log content in full — needing massive RAM and disk. Loki only indexes labels (metadata) and stores logs compressed → 10-50× storage savings. For small and mid-sized systems, the LGTM stack runs comfortably on a single 4 CPU / 8GB RAM server.

2. The Overall LGTM Stack Architecture

Understanding the architecture tells you where data comes from and where it goes — so when incidents happen, you know which component to check.

graph LR
    subgraph Applications
        A1["ASP.NET Core API"]
        A2["Vue.js Frontend"]
        A3["Background Worker"]
    end

    subgraph "Grafana Alloy (Collector)"
        C1["OTLP Receiver"]
        C2["Prometheus Scraper"]
        C3["Log Pipeline"]
    end

    subgraph "Storage Backends"
        M["Mimir
Metrics"]
        L["Loki
Logs"]
        T["Tempo
Traces"]
    end

    G["Grafana
Dashboard + Alerting"]

    A1 -->|OTLP gRPC| C1
    A2 -->|OTLP HTTP| C1
    A3 -->|OTLP gRPC| C1
    A1 -->|metrics endpoint| C2
    C1 --> M
    C1 --> T
    C2 --> M
    C3 --> L
    M --> G
    L --> G
    T --> G

    style A1 fill:#e94560,stroke:#fff,color:#fff
    style A2 fill:#e94560,stroke:#fff,color:#fff
    style A3 fill:#e94560,stroke:#fff,color:#fff
    style C1 fill:#2c3e50,stroke:#fff,color:#fff
    style C2 fill:#2c3e50,stroke:#fff,color:#fff
    style C3 fill:#2c3e50,stroke:#fff,color:#fff
    style M fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style T fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#4CAF50,stroke:#fff,color:#fff

LGTM Stack architecture — data flows from apps via Alloy to the storage backends, while Grafana queries all of them

3. Grafana Alloy — The Unified Collector

Previously you needed to run Promtail separately (for logs), Grafana Agent (for metrics), and the OpenTelemetry Collector (for traces). Grafana Alloy unifies all three into a single binary with the River declarative configuration language.

What does Alloy replace?

Promtail → Alloy loki pipeline · Grafana Agent → Alloy prometheus pipeline · OTel Collector → Alloy otelcol pipeline. One process, one config, one place to debug.

Example Alloy config that receives OTLP from a .NET application and forwards it to Loki + Tempo + Mimir:

// Receive telemetry via OTLP
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }

  output {
    metrics = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
  }
}

// Batch to reduce network overhead
otelcol.processor.batch "default" {
  output {
    metrics = [otelcol.exporter.prometheus.mimir.input]
    logs    = [otelcol.exporter.loki.default.input]
    traces  = [otelcol.exporter.otlp.tempo.input]
  }
}

// Export metrics to Mimir
otelcol.exporter.prometheus "mimir" {
  forward_to = [prometheus.remote_write.mimir.receiver]
}

prometheus.remote_write "mimir" {
  endpoint {
    url = "http://mimir:9009/api/v1/push"
  }
}

// Export logs to Loki
otelcol.exporter.loki "default" {
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

// Export traces to Tempo
otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    tls { insecure = true }
  }
}

The key point

Alloy uses a component-based model: each block is a component with inputs/outputs, connected to each other via forward_to or output. You can insert processors (filter, transform, sample) in the middle of a pipeline without changing the receiver or exporter.

4. Loki — Economical Log Aggregation

Loki is the heart of log collection in the LGTM Stack. Unlike Elasticsearch (full-text indexing), Loki only indexes labels (e.g., {app="api", env="production"}) and stores log content compressed. That gives you:

10-50× cheaper storage than Elasticsearch for the same log volume
Simpler operations — no JVM heap tuning, no shard rebalancing
Natural integration with Prometheus labels — same label set for metrics and logs

LogQL — The Log Query Language

LogQL is inspired by PromQL, using label selectors combined with filter expressions:

// Find error logs for the api service over the last hour
{app="api", env="production"} |= "error" | json | status_code >= 500

// Count failed requests per endpoint, every 5 minutes
rate({app="api"} |= "HTTP" | json | status_code >= 500 [5m]) by (endpoint)

// Calculate P99 response time from logs
{app="api"} | json | unwrap duration_ms [5m] | quantile_over_time(0.99)

// Pattern matching — detect log format automatically
{app="api"} | pattern "<ip> - <method> <path> <status> <duration>ms"
  | status >= 500

Bloom Filters in Loki 3.x

Loki 3.0+ supports Bloom filters to speed up filter queries. Instead of scanning all chunks, Loki checks the Bloom filter first to quickly skip chunks that don't contain the searched keyword — significantly reducing I/O for queries like |= "OutOfMemoryException" over large datasets.

Structured Metadata

From Loki 3.0, you can attach structured metadata to log entries without turning them into labels (which would explode cardinality). Examples: trace_id, user_id, request_id — filterable but they don't create new series.

// Query logs by trace_id from structured metadata
{app="api"} | trace_id = "abc123def456"

5. Mimir — Large-Scale Metrics Storage

Prometheus is great for scraping metrics, but it has 2 major limits at scale:

Single-node storage — local TSDB doesn't scale horizontally
Short retention — usually 15-30 days due to disk

Mimir solves both by becoming remote storage for Prometheus, supporting multi-tenancy and long-term retention on object storage (S3, MinIO, Azure Blob).

graph TD
    P1["Prometheus / Alloy"] -->|remote_write| D["Distributor"]
    D --> I1["Ingester 1"]
    D --> I2["Ingester 2"]
    D --> I3["Ingester 3"]
    I1 --> S["Object Storage
S3 / MinIO / Azure Blob"]
    I2 --> S
    I3 --> S
    QF["Query Frontend"] --> Q["Querier"]
    Q --> I1
    Q --> I2
    Q --> I3
    Q --> S
    G["Grafana"] --> QF

    style D fill:#e94560,stroke:#fff,color:#fff
    style I1 fill:#2c3e50,stroke:#fff,color:#fff
    style I2 fill:#2c3e50,stroke:#fff,color:#fff
    style I3 fill:#2c3e50,stroke:#fff,color:#fff
    style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style QF fill:#4CAF50,stroke:#fff,color:#fff
    style Q fill:#4CAF50,stroke:#fff,color:#fff
    style G fill:#4CAF50,stroke:#fff,color:#fff
    style P1 fill:#e94560,stroke:#fff,color:#fff

Mimir architecture — Distributor routes metrics to Ingesters; long-term data lands on Object Storage

Feature	Prometheus (standalone)	Mimir
Horizontal scaling	No	Yes — sharding by tenant/series
Long-term retention	15-30 days (disk)	Unlimited (object storage)
Multi-tenant	No	Yes — isolates data across teams
High availability	Needs Thanos sidecar	Built-in replication
Query performance	Degrades with data size	Query splitting + caching
Storage cost	Expensive SSD	Cheap object storage

6. Tempo — Index-Free Distributed Tracing

When a request passes through 5 services, you want to know: which service is slow? Where did the error happen? Tempo answers that by storing distributed traces at very low cost.

Unlike Jaeger (needs Elasticsearch/Cassandra), Tempo only needs object storage. It doesn't index traces — it stores them by trace ID. To find a trace, you use:

TraceQL — a dedicated query language for traces
Metrics-to-traces — from a dashboard spike, click to see example traces
Logs-to-traces — from a log line with a trace_id, jump to Tempo to see the full trace

TraceQL — Query Traces Like a Database

// Find traces with an error span in the "order-api" service
{ resource.service.name = "order-api" && status = error }

// Traces with duration > 2 seconds
{ duration > 2s }

// Traces that pass through both order-api and payment-service
{ resource.service.name = "order-api" } >> { resource.service.name = "payment-service" }

// Spans with a specific attribute
{ span.http.status_code >= 500 && span.http.method = "POST" }

Exemplars — The Bridge Between Metrics ↔ Traces

When Prometheus/Mimir collects metrics, it can attach an exemplar — a sample trace ID for each data point. In Grafana, when you see P99 latency suddenly spike, clicking the exemplar jumps straight to the specific trace that caused that spike. This is a killer feature of running a unified LGTM stack.

7. Grafana — Dashboards, Alerting, and Correlation

Grafana is the visualization layer that stitches everything together. Version 12.x brings many important improvements:

Grafana 12 highlights

Git Sync — manage dashboards as code, version-controlled via Git. Explore Logs — auto-detects patterns in logs, no query writing required. Traces to Profiles — from a slow span, drill down directly into flame graphs to see which functions consume CPU. Adaptive dashboards — layouts adjust automatically based on data.

Correlation — The Power of a Unified Stack

The biggest advantage of the LGTM stack is the ability to correlate the three signals:

graph LR
    M["📊 Metrics
CPU spike at 14:05"] -->|exemplar trace_id| T["🔍 Traces
3.2s slow span in payment-service"]
    T -->|trace_id in log| L["📝 Logs
TimeoutException connecting to DB"]
    L -->|label match| M

    style M fill:#e94560,stroke:#fff,color:#fff
    style T fill:#2c3e50,stroke:#fff,color:#fff
    style L fill:#4CAF50,stroke:#fff,color:#fff

Correlation loop: Metrics → Traces → Logs → back to Metrics. Debug incidents in minutes instead of hours

A typical incident-debug workflow:

Alert fires "P99 latency > 2s" on a Grafana dashboard
Click the metric panel → view the exemplar trace ID
Open the trace in Tempo → see a db.query span taking 2.8s
Click the trace_id → Loki shows: Connection pool exhausted, waiting 2.5s
Root cause: connection pool is too small → increase MaxPoolSize → deploy fix

8. Deploying the LGTM Stack with Docker Compose

Below is a production-ready Docker Compose configuration for a medium-sized system (10-50 services, ~100GB logs/month):

version: "3.8"

services:
  # --- Grafana ---
  grafana:
    image: grafana/grafana:12.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password
      - GF_FEATURE_TOGGLES_ENABLE=traceToMetrics,traceToLogs
    volumes:
      - grafana-data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning
    depends_on: [loki, mimir, tempo]

  # --- Loki (Log Storage) ---
  loki:
    image: grafana/loki:3.4.0
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/config.yaml
    volumes:
      - ./config/loki.yaml:/etc/loki/config.yaml
      - loki-data:/loki

  # --- Mimir (Metrics Storage) ---
  mimir:
    image: grafana/mimir:2.15.0
    ports:
      - "9009:9009"
    command: -config.file=/etc/mimir/config.yaml
    volumes:
      - ./config/mimir.yaml:/etc/mimir/config.yaml
      - mimir-data:/data

  # --- Tempo (Trace Storage) ---
  tempo:
    image: grafana/tempo:2.7.0
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "3200:3200"   # Tempo query
    command: -config.file=/etc/tempo/config.yaml
    volumes:
      - ./config/tempo.yaml:/etc/tempo/config.yaml
      - tempo-data:/var/tempo

  # --- Alloy (Collector) ---
  alloy:
    image: grafana/alloy:1.6.0
    ports:
      - "12345:12345"  # Alloy UI
      - "4327:4317"    # OTLP gRPC (apps send here)
      - "4328:4318"    # OTLP HTTP
    volumes:
      - ./config/alloy.river:/etc/alloy/config.river
    command: run /etc/alloy/config.river --server.http.listen-addr=0.0.0.0:12345

volumes:
  grafana-data:
  loki-data:
  mimir-data:
  tempo-data:

Production note

The config above fits a single-node or staging setup. For large production traffic (>1TB logs/month), run Loki and Mimir in microservices mode — split distributor, ingester, and querier into separate containers, and use object storage (self-hosted MinIO or S3) instead of local disks.

9. Integrating with an ASP.NET Core Application

Sending telemetry from a .NET app to the LGTM Stack takes only 2 steps: install NuGet packages and configure the exporter.

Step 1: Install the packages

dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

Step 2: Configure Program.cs

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService("order-api", serviceVersion: "1.0.0"))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(o => o.SetDbStatementForText = true)
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }))
    .WithLogging(logging => logging
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }));

Grafana OpenTelemetry Distribution for .NET

Grafana provides the Grafana.OpenTelemetry package — a distribution that wraps common instrumentations and optimized defaults for the LGTM stack. Just builder.Services.AddGrafanaOpenTelemetry() is enough — much less config than a manual setup.

10. Alerting — From Observation to Action

Observability has no value if nobody gets notified when incidents happen. Grafana Alerting supports:

Unified alerting — alert rules for metrics (PromQL), logs (LogQL), and traces
Multi-channel — Slack, Discord, Telegram, PagerDuty, email, webhook
Silences & Mute timings — disable alerts during maintenance windows
Alert grouping — bundle 100 alerts of the same kind into one notification

Example alert rule for error rate:

# Alert when error rate > 5% for 5 minutes
- alert: HighErrorRate
  expr: |
    sum(rate({app="api"} |= "error" [5m])) by (app)
    /
    sum(rate({app="api"} [5m])) by (app)
    > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Abnormally high error rate for {{ $labels.app }}"
    description: "Error rate is at {{ $value | humanizePercentage }}"

11. Real Sizing and Cost

One of the main reasons to choose the LGTM Stack is cost. Compared to SaaS:

Scale	LGTM Self-hosted	Datadog (estimate)
10 services, 50GB logs/month	1 VM 4 CPU / 16GB RAM ~$40-80/month	~$200-500/month
50 services, 500GB logs/month	3 VMs or K8s cluster ~$200-400/month	~$2,000-5,000/month
200 services, 2TB logs/month	K8s cluster + S3 ~$500-1,000/month	~$10,000+/month

Trade-off to consider

Self-hosted saves money but costs operational time. If your team only has 1-2 DevOps, start with the Grafana Cloud Free tier (10K metrics, 50GB logs, 50GB traces free) and migrate to self-hosted once you outgrow it. Grafana Cloud runs the same LGTM stack, so migration is essentially endpoint swaps.

12. Production Best Practices

Selective labels

Only use low-cardinality labels (app, env, region). Never use user_id, request_id, or IPs as labels — use Loki 3.x structured metadata instead. High-cardinality labels are the #1 cause of Loki OOM.

Tiered retention

Hot data (7 days) on SSD, warm data (30 days) on HDD, cold data (1 year+) on object storage. Configure retention_period and compactor in Loki to automatically move data across tiers.

Trace sampling

You don't need to store 100% of traces. Use tail-based sampling in Alloy: always keep traces with errors or high latency, and sample 10-20% of successful traces. Reduces Tempo storage cost by 80% without losing important information.

Recording rules for metrics

Pre-compute complex PromQL queries into recording rules. Instead of querying raw data every time a dashboard loads, Mimir computes aggregated metrics ahead of time — dashboards load 10× faster.

Dashboards as code

Use Grafana 12 Git Sync or the grafana/grafana Terraform provider to manage dashboards via version control. Nobody edits production dashboards by hand in the UI — every change goes through PR review.

Conclusion

The Grafana LGTM Stack — Loki, Grafana, Tempo, Mimir plus the Alloy collector — delivers a complete, free, vendor-lock-in-free observability platform. With correlation across logs, metrics, and traces in a single interface, your team can cut incident debugging from hours to minutes.

If you're running CloudWatch + Kibana + Jaeger separately, or paying thousands of dollars a month for Datadog, now is the time to consider moving to LGTM Stack — start with the Grafana Cloud Free tier to experiment, then self-host once you're comfortable.

References:
Grafana Loki Documentation · Grafana Mimir Documentation · Grafana Tempo Documentation · Grafana Alloy Documentation · Grafana 12 What's New · Instrument .NET with OpenTelemetry — Grafana

#Grafana #Observability #Prometheus #Loki #Tempo #Docker Compose #OpenTelemetry #ASP.NET Core #Monitoring #DevOps

# Grafana LGTM Stack — Build a Free Observability Platform for Production

Do you have applications running in production but, when incidents happen, you have to SSH into the server and read logs with `grep`? Or worse, you don't know which service is slow until customers complain? **Grafana LGTM Stack** — a completely free, open-source observability toolkit — solves this problem by unifying **Logs, Metrics, Traces**, and **Profiles** into a single platform.

100% Open-source, self-hosted, no vendor lock-in

65% MTTR reduction vs traditional monitoring

10M+ Metrics/second handled by Mimir

4 Signals: Logs, Metrics, Traces, Profiles

## 1. What Is the LGTM Stack?

LGTM stands for four core components developed by Grafana Labs:

| Component | Role | Commercial equivalent |
| --- | --- | --- |
| **Loki** | Log aggregation — collect, store, and query logs | Splunk, Datadog Logs |
| **Grafana** | Visualization — dashboards, alerting, explore | Datadog Dashboard, Kibana |
| **Tempo** | Distributed tracing — follow requests across services | Jaeger, Datadog APM |
| **Mimir** | Metrics storage — store Prometheus metrics long-term | Thanos, Cortex, Datadog Metrics |

Beyond these 4, the stack also includes **Grafana Alloy** — a unified collector that replaces Promtail, Grafana Agent, and the OpenTelemetry Collector, acting as the "extended arm" that gathers every telemetry signal from your applications.

#### Why not use the ELK Stack?

ELK (Elasticsearch + Logstash + Kibana) indexes log content in full — needing massive RAM and disk. Loki only indexes **labels** (metadata) and stores logs compressed → 10-50× storage savings. For small and mid-sized systems, the LGTM stack runs comfortably on a single 4 CPU / 8GB RAM server.

## 2. The Overall LGTM Stack Architecture

Understanding the architecture tells you where data comes from and where it goes — so when incidents happen, you know which component to check.

```
graph LR
    subgraph Applications
        A1["ASP.NET Core API"]
        A2["Vue.js Frontend"]
        A3["Background Worker"]
    end

subgraph "Grafana Alloy (Collector)"
        C1["OTLP Receiver"]
        C2["Prometheus Scraper"]
        C3["Log Pipeline"]
    end

subgraph "Storage Backends"
        M["Mimir  
Metrics"]
        L["Loki  
Logs"]
        T["Tempo  
Traces"]
    end

G["Grafana  
Dashboard + Alerting"]

A1 -->|OTLP gRPC| C1
    A2 -->|OTLP HTTP| C1
    A3 -->|OTLP gRPC| C1
    A1 -->|metrics endpoint| C2
    C1 --> M
    C1 --> T
    C2 --> M
    C3 --> L
    M --> G
    L --> G
    T --> G

style A1 fill:#e94560,stroke:#fff,color:#fff
    style A2 fill:#e94560,stroke:#fff,color:#fff
    style A3 fill:#e94560,stroke:#fff,color:#fff
    style C1 fill:#2c3e50,stroke:#fff,color:#fff
    style C2 fill:#2c3e50,stroke:#fff,color:#fff
    style C3 fill:#2c3e50,stroke:#fff,color:#fff
    style M fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style L fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style T fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#4CAF50,stroke:#fff,color:#fff

```

LGTM Stack architecture — data flows from apps via Alloy to the storage backends, while Grafana queries all of them

## 3. Grafana Alloy — The Unified Collector

Previously you needed to run Promtail separately (for logs), Grafana Agent (for metrics), and the OpenTelemetry Collector (for traces). **Grafana Alloy** unifies all three into a single binary with the River declarative configuration language.

#### What does Alloy replace?

`Promtail` → Alloy loki pipeline · `Grafana Agent` → Alloy prometheus pipeline · `OTel Collector` → Alloy otelcol pipeline. One process, one config, one place to debug.

Example Alloy config that receives OTLP from a .NET application and forwards it to Loki + Tempo + Mimir:

```
// Receive telemetry via OTLP
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }

output {
    metrics = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
  }
}

// Batch to reduce network overhead
otelcol.processor.batch "default" {
  output {
    metrics = [otelcol.exporter.prometheus.mimir.input]
    logs    = [otelcol.exporter.loki.default.input]
    traces  = [otelcol.exporter.otlp.tempo.input]
  }
}

// Export metrics to Mimir
otelcol.exporter.prometheus "mimir" {
  forward_to = [prometheus.remote_write.mimir.receiver]
}

prometheus.remote_write "mimir" {
  endpoint {
    url = "http://mimir:9009/api/v1/push"
  }
}

// Export logs to Loki
otelcol.exporter.loki "default" {
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

// Export traces to Tempo
otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    tls { insecure = true }
  }
}
```

#### The key point

Alloy uses a **component-based** model: each block is a component with inputs/outputs, connected to each other via `forward_to` or `output`. You can insert processors (filter, transform, sample) in the middle of a pipeline without changing the receiver or exporter.

## 4. Loki — Economical Log Aggregation

Loki is the heart of log collection in the LGTM Stack. Unlike Elasticsearch (full-text indexing), Loki only indexes **labels** (e.g., `{app="api", env="production"}`) and stores log content compressed. That gives you:

- **10-50× cheaper storage** than Elasticsearch for the same log volume
- **Simpler operations** — no JVM heap tuning, no shard rebalancing
- **Natural integration** with Prometheus labels — same label set for metrics and logs

### LogQL — The Log Query Language

LogQL is inspired by PromQL, using label selectors combined with filter expressions:

```
// Find error logs for the api service over the last hour
{app="api", env="production"} |= "error" | json | status_code >= 500

// Count failed requests per endpoint, every 5 minutes
rate({app="api"} |= "HTTP" | json | status_code >= 500 [5m]) by (endpoint)

// Calculate P99 response time from logs
{app="api"} | json | unwrap duration_ms [5m] | quantile_over_time(0.99)

// Pattern matching — detect log format automatically
{app="api"} | pattern "<ip> - <method> <path> <status> <duration>ms"
  | status >= 500
```

#### Bloom Filters in Loki 3.x

Loki 3.0+ supports **Bloom filters** to speed up filter queries. Instead of scanning all chunks, Loki checks the Bloom filter first to quickly skip chunks that don't contain the searched keyword — significantly reducing I/O for queries like `|= "OutOfMemoryException"` over large datasets.

### Structured Metadata

From Loki 3.0, you can attach **structured metadata** to log entries without turning them into labels (which would explode cardinality). Examples: `trace_id`, `user_id`, `request_id` — filterable but they don't create new series.

```
// Query logs by trace_id from structured metadata
{app="api"} | trace_id = "abc123def456"
```

## 5. Mimir — Large-Scale Metrics Storage

Prometheus is great for scraping metrics, but it has 2 major limits at scale:

1. **Single-node storage** — local TSDB doesn't scale horizontally
2. **Short retention** — usually 15-30 days due to disk

**Mimir** solves both by becoming remote storage for Prometheus, supporting multi-tenancy and long-term retention on object storage (S3, MinIO, Azure Blob).

```
graph TD
    P1["Prometheus / Alloy"] -->|remote_write| D["Distributor"]
    D --> I1["Ingester 1"]
    D --> I2["Ingester 2"]
    D --> I3["Ingester 3"]
    I1 --> S["Object Storage  
S3 / MinIO / Azure Blob"]
    I2 --> S
    I3 --> S
    QF["Query Frontend"] --> Q["Querier"]
    Q --> I1
    Q --> I2
    Q --> I3
    Q --> S
    G["Grafana"] --> QF

style D fill:#e94560,stroke:#fff,color:#fff
    style I1 fill:#2c3e50,stroke:#fff,color:#fff
    style I2 fill:#2c3e50,stroke:#fff,color:#fff
    style I3 fill:#2c3e50,stroke:#fff,color:#fff
    style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style QF fill:#4CAF50,stroke:#fff,color:#fff
    style Q fill:#4CAF50,stroke:#fff,color:#fff
    style G fill:#4CAF50,stroke:#fff,color:#fff
    style P1 fill:#e94560,stroke:#fff,color:#fff

```

Mimir architecture — Distributor routes metrics to Ingesters; long-term data lands on Object Storage

| Feature | Prometheus (standalone) | Mimir |
| --- | --- | --- |
| Horizontal scaling | No | Yes — sharding by tenant/series |
| Long-term retention | 15-30 days (disk) | Unlimited (object storage) |
| Multi-tenant | No | Yes — isolates data across teams |
| High availability | Needs Thanos sidecar | Built-in replication |
| Query performance | Degrades with data size | Query splitting + caching |
| Storage cost | Expensive SSD | Cheap object storage |

## 6. Tempo — Index-Free Distributed Tracing

When a request passes through 5 services, you want to know: which service is slow? Where did the error happen? **Tempo** answers that by storing distributed traces at very low cost.

Unlike Jaeger (needs Elasticsearch/Cassandra), Tempo only needs **object storage**. It doesn't index traces — it stores them by trace ID. To find a trace, you use:

- **TraceQL** — a dedicated query language for traces
- **Metrics-to-traces** — from a dashboard spike, click to see example traces
- **Logs-to-traces** — from a log line with a trace_id, jump to Tempo to see the full trace

### TraceQL — Query Traces Like a Database

```
// Find traces with an error span in the "order-api" service
{ resource.service.name = "order-api" && status = error }

// Traces with duration > 2 seconds
{ duration > 2s }

// Traces that pass through both order-api and payment-service
{ resource.service.name = "order-api" } >> { resource.service.name = "payment-service" }

// Spans with a specific attribute
{ span.http.status_code >= 500 && span.http.method = "POST" }
```

#### Exemplars — The Bridge Between Metrics ↔ Traces

When Prometheus/Mimir collects metrics, it can attach an **exemplar** — a sample trace ID for each data point. In Grafana, when you see P99 latency suddenly spike, clicking the exemplar jumps straight to the specific trace that caused that spike. This is a killer feature of running a unified LGTM stack.

## 7. Grafana — Dashboards, Alerting, and Correlation

Grafana is the visualization layer that stitches everything together. Version 12.x brings many important improvements:

#### Grafana 12 highlights

**Git Sync** — manage dashboards as code, version-controlled via Git. **Explore Logs** — auto-detects patterns in logs, no query writing required. **Traces to Profiles** — from a slow span, drill down directly into flame graphs to see which functions consume CPU. **Adaptive dashboards** — layouts adjust automatically based on data.

### Correlation — The Power of a Unified Stack

The biggest advantage of the LGTM stack is the ability to **correlate** the three signals:

style M fill:#e94560,stroke:#fff,color:#fff
    style T fill:#2c3e50,stroke:#fff,color:#fff
    style L fill:#4CAF50,stroke:#fff,color:#fff

```

Correlation loop: Metrics → Traces → Logs → back to Metrics. Debug incidents in minutes instead of hours

**A typical incident-debug workflow:**

1. Alert fires *"P99 latency > 2s"* on a Grafana dashboard
2. Click the metric panel → view the exemplar trace ID
3. Open the trace in Tempo → see a `db.query` span taking 2.8s
4. Click the trace_id → Loki shows: `Connection pool exhausted, waiting 2.5s`
5. Root cause: connection pool is too small → increase `MaxPoolSize` → deploy fix

## 8. Deploying the LGTM Stack with Docker Compose

Below is a production-ready Docker Compose configuration for a medium-sized system (10-50 services, ~100GB logs/month):

```
version: "3.8"

services:
  # --- Grafana ---
  grafana:
    image: grafana/grafana:12.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password
      - GF_FEATURE_TOGGLES_ENABLE=traceToMetrics,traceToLogs
    volumes:
      - grafana-data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning
    depends_on: [loki, mimir, tempo]

# --- Loki (Log Storage) ---
  loki:
    image: grafana/loki:3.4.0
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/config.yaml
    volumes:
      - ./config/loki.yaml:/etc/loki/config.yaml
      - loki-data:/loki

# --- Mimir (Metrics Storage) ---
  mimir:
    image: grafana/mimir:2.15.0
    ports:
      - "9009:9009"
    command: -config.file=/etc/mimir/config.yaml
    volumes:
      - ./config/mimir.yaml:/etc/mimir/config.yaml
      - mimir-data:/data

# --- Tempo (Trace Storage) ---
  tempo:
    image: grafana/tempo:2.7.0
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "3200:3200"   # Tempo query
    command: -config.file=/etc/tempo/config.yaml
    volumes:
      - ./config/tempo.yaml:/etc/tempo/config.yaml
      - tempo-data:/var/tempo

# --- Alloy (Collector) ---
  alloy:
    image: grafana/alloy:1.6.0
    ports:
      - "12345:12345"  # Alloy UI
      - "4327:4317"    # OTLP gRPC (apps send here)
      - "4328:4318"    # OTLP HTTP
    volumes:
      - ./config/alloy.river:/etc/alloy/config.river
    command: run /etc/alloy/config.river --server.http.listen-addr=0.0.0.0:12345

volumes:
  grafana-data:
  loki-data:
  mimir-data:
  tempo-data:
```

#### Production note

The config above fits a single-node or staging setup. For large production traffic (>1TB logs/month), run Loki and Mimir in **microservices mode** — split distributor, ingester, and querier into separate containers, and use object storage (self-hosted MinIO or S3) instead of local disks.

## 9. Integrating with an ASP.NET Core Application

Sending telemetry from a .NET app to the LGTM Stack takes only 2 steps: install NuGet packages and configure the exporter.

### Step 1: Install the packages

```
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol
```

### Step 2: Configure Program.cs

```
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService("order-api", serviceVersion: "1.0.0"))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(o => o.SetDbStatementForText = true)
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }))
    .WithLogging(logging => logging
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }));
```

#### Grafana OpenTelemetry Distribution for .NET

Grafana provides the `Grafana.OpenTelemetry` package — a distribution that wraps common instrumentations and optimized defaults for the LGTM stack. Just `builder.Services.AddGrafanaOpenTelemetry()` is enough — much less config than a manual setup.

## 10. Alerting — From Observation to Action

Observability has no value if nobody gets notified when incidents happen. Grafana Alerting supports:

- **Unified alerting** — alert rules for metrics (PromQL), logs (LogQL), and traces
- **Multi-channel** — Slack, Discord, Telegram, PagerDuty, email, webhook
- **Silences & Mute timings** — disable alerts during maintenance windows
- **Alert grouping** — bundle 100 alerts of the same kind into one notification

Example alert rule for error rate:

```
# Alert when error rate > 5% for 5 minutes
- alert: HighErrorRate
  expr: |
    sum(rate({app="api"} |= "error" [5m])) by (app)
    /
    sum(rate({app="api"} [5m])) by (app)
    > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Abnormally high error rate for {{ $labels.app }}"
    description: "Error rate is at {{ $value | humanizePercentage }}"
```

## 11. Real Sizing and Cost

One of the main reasons to choose the LGTM Stack is cost. Compared to SaaS:

| Scale | LGTM Self-hosted | Datadog (estimate) |
| --- | --- | --- |
| 10 services, 50GB logs/month | 1 VM 4 CPU / 16GB RAM   ~$40-80/month | ~$200-500/month |
| 50 services, 500GB logs/month | 3 VMs or K8s cluster   ~$200-400/month | ~$2,000-5,000/month |
| 200 services, 2TB logs/month | K8s cluster + S3   ~$500-1,000/month | ~$10,000+/month |

#### Trade-off to consider

Self-hosted saves money but costs **operational time**. If your team only has 1-2 DevOps, start with the **Grafana Cloud Free tier** (10K metrics, 50GB logs, 50GB traces free) and migrate to self-hosted once you outgrow it. Grafana Cloud runs the same LGTM stack, so migration is essentially endpoint swaps.

## 12. Production Best Practices

Selective labels

Only use low-cardinality labels (app, env, region). Never use user_id, request_id, or IPs as labels — use Loki 3.x structured metadata instead. High-cardinality labels are the #1 cause of Loki OOM.

Tiered retention

Hot data (7 days) on SSD, warm data (30 days) on HDD, cold data (1 year+) on object storage. Configure `retention_period` and `compactor` in Loki to automatically move data across tiers.

Trace sampling

You don't need to store 100% of traces. Use **tail-based sampling** in Alloy: always keep traces with errors or high latency, and sample 10-20% of successful traces. Reduces Tempo storage cost by 80% without losing important information.

Recording rules for metrics

Pre-compute complex PromQL queries into recording rules. Instead of querying raw data every time a dashboard loads, Mimir computes aggregated metrics ahead of time — dashboards load 10× faster.

Dashboards as code

Use Grafana 12 Git Sync or the `grafana/grafana` Terraform provider to manage dashboards via version control. Nobody edits production dashboards by hand in the UI — every change goes through PR review.

## Conclusion

**References:**  
[Grafana Loki Documentation](https://grafana.com/docs/loki/latest/) · [Grafana Mimir Documentation](https://grafana.com/docs/mimir/latest/) · [Grafana Tempo Documentation](https://grafana.com/docs/tempo/latest/) · [Grafana Alloy Documentation](https://grafana.com/docs/alloy/latest/) · [Grafana 12 What's New](https://grafana.com/docs/grafana/latest/whatsnew/) · [Instrument .NET with OpenTelemetry — Grafana](https://grafana.com/docs/opentelemetry/instrument/grafana-dotnet/)

Tailwind CSS 4 and the Oxide Engine — When a CSS Framework Is Rewritten in Rust

Microsoft Agent Framework 1.0 — Unified SDK for AI Agents on .NET 10

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.