Grafana LGTM Stack — Build a Free Observability Platform for Production

Posted on: 4/18/2026 8:11:46 AM

Do you have applications running in production but, when incidents happen, you have to SSH into the server and read logs with grep? Or worse, you don't know which service is slow until customers complain? Grafana LGTM Stack — a completely free, open-source observability toolkit — solves this problem by unifying Logs, Metrics, Traces, and Profiles into a single platform.

100% Open-source, self-hosted, no vendor lock-in
65% MTTR reduction vs traditional monitoring
10M+ Metrics/second handled by Mimir
4 Signals: Logs, Metrics, Traces, Profiles

1. What Is the LGTM Stack?

LGTM stands for four core components developed by Grafana Labs:

ComponentRoleCommercial equivalent
LokiLog aggregation — collect, store, and query logsSplunk, Datadog Logs
GrafanaVisualization — dashboards, alerting, exploreDatadog Dashboard, Kibana
TempoDistributed tracing — follow requests across servicesJaeger, Datadog APM
MimirMetrics storage — store Prometheus metrics long-termThanos, Cortex, Datadog Metrics

Beyond these 4, the stack also includes Grafana Alloy — a unified collector that replaces Promtail, Grafana Agent, and the OpenTelemetry Collector, acting as the "extended arm" that gathers every telemetry signal from your applications.

Why not use the ELK Stack?

ELK (Elasticsearch + Logstash + Kibana) indexes log content in full — needing massive RAM and disk. Loki only indexes labels (metadata) and stores logs compressed → 10-50× storage savings. For small and mid-sized systems, the LGTM stack runs comfortably on a single 4 CPU / 8GB RAM server.

2. The Overall LGTM Stack Architecture

Understanding the architecture tells you where data comes from and where it goes — so when incidents happen, you know which component to check.

graph LR
    subgraph Applications
        A1["ASP.NET Core API"]
        A2["Vue.js Frontend"]
        A3["Background Worker"]
    end

    subgraph "Grafana Alloy (Collector)"
        C1["OTLP Receiver"]
        C2["Prometheus Scraper"]
        C3["Log Pipeline"]
    end

    subgraph "Storage Backends"
        M["Mimir
Metrics"] L["Loki
Logs"] T["Tempo
Traces"] end G["Grafana
Dashboard + Alerting"] A1 -->|OTLP gRPC| C1 A2 -->|OTLP HTTP| C1 A3 -->|OTLP gRPC| C1 A1 -->|metrics endpoint| C2 C1 --> M C1 --> T C2 --> M C3 --> L M --> G L --> G T --> G style A1 fill:#e94560,stroke:#fff,color:#fff style A2 fill:#e94560,stroke:#fff,color:#fff style A3 fill:#e94560,stroke:#fff,color:#fff style C1 fill:#2c3e50,stroke:#fff,color:#fff style C2 fill:#2c3e50,stroke:#fff,color:#fff style C3 fill:#2c3e50,stroke:#fff,color:#fff style M fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style L fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style T fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style G fill:#4CAF50,stroke:#fff,color:#fff
LGTM Stack architecture — data flows from apps via Alloy to the storage backends, while Grafana queries all of them

3. Grafana Alloy — The Unified Collector

Previously you needed to run Promtail separately (for logs), Grafana Agent (for metrics), and the OpenTelemetry Collector (for traces). Grafana Alloy unifies all three into a single binary with the River declarative configuration language.

What does Alloy replace?

Promtail → Alloy loki pipeline · Grafana Agent → Alloy prometheus pipeline · OTel Collector → Alloy otelcol pipeline. One process, one config, one place to debug.

Example Alloy config that receives OTLP from a .NET application and forwards it to Loki + Tempo + Mimir:

// Receive telemetry via OTLP
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }

  output {
    metrics = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
  }
}

// Batch to reduce network overhead
otelcol.processor.batch "default" {
  output {
    metrics = [otelcol.exporter.prometheus.mimir.input]
    logs    = [otelcol.exporter.loki.default.input]
    traces  = [otelcol.exporter.otlp.tempo.input]
  }
}

// Export metrics to Mimir
otelcol.exporter.prometheus "mimir" {
  forward_to = [prometheus.remote_write.mimir.receiver]
}

prometheus.remote_write "mimir" {
  endpoint {
    url = "http://mimir:9009/api/v1/push"
  }
}

// Export logs to Loki
otelcol.exporter.loki "default" {
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

// Export traces to Tempo
otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo:4317"
    tls { insecure = true }
  }
}

The key point

Alloy uses a component-based model: each block is a component with inputs/outputs, connected to each other via forward_to or output. You can insert processors (filter, transform, sample) in the middle of a pipeline without changing the receiver or exporter.

4. Loki — Economical Log Aggregation

Loki is the heart of log collection in the LGTM Stack. Unlike Elasticsearch (full-text indexing), Loki only indexes labels (e.g., {app="api", env="production"}) and stores log content compressed. That gives you:

  • 10-50× cheaper storage than Elasticsearch for the same log volume
  • Simpler operations — no JVM heap tuning, no shard rebalancing
  • Natural integration with Prometheus labels — same label set for metrics and logs

LogQL — The Log Query Language

LogQL is inspired by PromQL, using label selectors combined with filter expressions:

// Find error logs for the api service over the last hour
{app="api", env="production"} |= "error" | json | status_code >= 500

// Count failed requests per endpoint, every 5 minutes
rate({app="api"} |= "HTTP" | json | status_code >= 500 [5m]) by (endpoint)

// Calculate P99 response time from logs
{app="api"} | json | unwrap duration_ms [5m] | quantile_over_time(0.99)

// Pattern matching — detect log format automatically
{app="api"} | pattern "<ip> - <method> <path> <status> <duration>ms"
  | status >= 500

Bloom Filters in Loki 3.x

Loki 3.0+ supports Bloom filters to speed up filter queries. Instead of scanning all chunks, Loki checks the Bloom filter first to quickly skip chunks that don't contain the searched keyword — significantly reducing I/O for queries like |= "OutOfMemoryException" over large datasets.

Structured Metadata

From Loki 3.0, you can attach structured metadata to log entries without turning them into labels (which would explode cardinality). Examples: trace_id, user_id, request_id — filterable but they don't create new series.

// Query logs by trace_id from structured metadata
{app="api"} | trace_id = "abc123def456"

5. Mimir — Large-Scale Metrics Storage

Prometheus is great for scraping metrics, but it has 2 major limits at scale:

  1. Single-node storage — local TSDB doesn't scale horizontally
  2. Short retention — usually 15-30 days due to disk

Mimir solves both by becoming remote storage for Prometheus, supporting multi-tenancy and long-term retention on object storage (S3, MinIO, Azure Blob).

graph TD
    P1["Prometheus / Alloy"] -->|remote_write| D["Distributor"]
    D --> I1["Ingester 1"]
    D --> I2["Ingester 2"]
    D --> I3["Ingester 3"]
    I1 --> S["Object Storage
S3 / MinIO / Azure Blob"] I2 --> S I3 --> S QF["Query Frontend"] --> Q["Querier"] Q --> I1 Q --> I2 Q --> I3 Q --> S G["Grafana"] --> QF style D fill:#e94560,stroke:#fff,color:#fff style I1 fill:#2c3e50,stroke:#fff,color:#fff style I2 fill:#2c3e50,stroke:#fff,color:#fff style I3 fill:#2c3e50,stroke:#fff,color:#fff style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style QF fill:#4CAF50,stroke:#fff,color:#fff style Q fill:#4CAF50,stroke:#fff,color:#fff style G fill:#4CAF50,stroke:#fff,color:#fff style P1 fill:#e94560,stroke:#fff,color:#fff
Mimir architecture — Distributor routes metrics to Ingesters; long-term data lands on Object Storage
FeaturePrometheus (standalone)Mimir
Horizontal scalingNoYes — sharding by tenant/series
Long-term retention15-30 days (disk)Unlimited (object storage)
Multi-tenantNoYes — isolates data across teams
High availabilityNeeds Thanos sidecarBuilt-in replication
Query performanceDegrades with data sizeQuery splitting + caching
Storage costExpensive SSDCheap object storage

6. Tempo — Index-Free Distributed Tracing

When a request passes through 5 services, you want to know: which service is slow? Where did the error happen? Tempo answers that by storing distributed traces at very low cost.

Unlike Jaeger (needs Elasticsearch/Cassandra), Tempo only needs object storage. It doesn't index traces — it stores them by trace ID. To find a trace, you use:

  • TraceQL — a dedicated query language for traces
  • Metrics-to-traces — from a dashboard spike, click to see example traces
  • Logs-to-traces — from a log line with a trace_id, jump to Tempo to see the full trace

TraceQL — Query Traces Like a Database

// Find traces with an error span in the "order-api" service
{ resource.service.name = "order-api" && status = error }

// Traces with duration > 2 seconds
{ duration > 2s }

// Traces that pass through both order-api and payment-service
{ resource.service.name = "order-api" } >> { resource.service.name = "payment-service" }

// Spans with a specific attribute
{ span.http.status_code >= 500 && span.http.method = "POST" }

Exemplars — The Bridge Between Metrics ↔ Traces

When Prometheus/Mimir collects metrics, it can attach an exemplar — a sample trace ID for each data point. In Grafana, when you see P99 latency suddenly spike, clicking the exemplar jumps straight to the specific trace that caused that spike. This is a killer feature of running a unified LGTM stack.

7. Grafana — Dashboards, Alerting, and Correlation

Grafana is the visualization layer that stitches everything together. Version 12.x brings many important improvements:

Grafana 12 highlights

Git Sync — manage dashboards as code, version-controlled via Git. Explore Logs — auto-detects patterns in logs, no query writing required. Traces to Profiles — from a slow span, drill down directly into flame graphs to see which functions consume CPU. Adaptive dashboards — layouts adjust automatically based on data.

Correlation — The Power of a Unified Stack

The biggest advantage of the LGTM stack is the ability to correlate the three signals:

graph LR
    M["📊 Metrics
CPU spike at 14:05"] -->|exemplar trace_id| T["🔍 Traces
3.2s slow span in payment-service"] T -->|trace_id in log| L["📝 Logs
TimeoutException connecting to DB"] L -->|label match| M style M fill:#e94560,stroke:#fff,color:#fff style T fill:#2c3e50,stroke:#fff,color:#fff style L fill:#4CAF50,stroke:#fff,color:#fff
Correlation loop: Metrics → Traces → Logs → back to Metrics. Debug incidents in minutes instead of hours

A typical incident-debug workflow:

  1. Alert fires "P99 latency > 2s" on a Grafana dashboard
  2. Click the metric panel → view the exemplar trace ID
  3. Open the trace in Tempo → see a db.query span taking 2.8s
  4. Click the trace_id → Loki shows: Connection pool exhausted, waiting 2.5s
  5. Root cause: connection pool is too small → increase MaxPoolSize → deploy fix

8. Deploying the LGTM Stack with Docker Compose

Below is a production-ready Docker Compose configuration for a medium-sized system (10-50 services, ~100GB logs/month):

version: "3.8"

services:
  # --- Grafana ---
  grafana:
    image: grafana/grafana:12.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password
      - GF_FEATURE_TOGGLES_ENABLE=traceToMetrics,traceToLogs
    volumes:
      - grafana-data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning
    depends_on: [loki, mimir, tempo]

  # --- Loki (Log Storage) ---
  loki:
    image: grafana/loki:3.4.0
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/config.yaml
    volumes:
      - ./config/loki.yaml:/etc/loki/config.yaml
      - loki-data:/loki

  # --- Mimir (Metrics Storage) ---
  mimir:
    image: grafana/mimir:2.15.0
    ports:
      - "9009:9009"
    command: -config.file=/etc/mimir/config.yaml
    volumes:
      - ./config/mimir.yaml:/etc/mimir/config.yaml
      - mimir-data:/data

  # --- Tempo (Trace Storage) ---
  tempo:
    image: grafana/tempo:2.7.0
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "3200:3200"   # Tempo query
    command: -config.file=/etc/tempo/config.yaml
    volumes:
      - ./config/tempo.yaml:/etc/tempo/config.yaml
      - tempo-data:/var/tempo

  # --- Alloy (Collector) ---
  alloy:
    image: grafana/alloy:1.6.0
    ports:
      - "12345:12345"  # Alloy UI
      - "4327:4317"    # OTLP gRPC (apps send here)
      - "4328:4318"    # OTLP HTTP
    volumes:
      - ./config/alloy.river:/etc/alloy/config.river
    command: run /etc/alloy/config.river --server.http.listen-addr=0.0.0.0:12345

volumes:
  grafana-data:
  loki-data:
  mimir-data:
  tempo-data:

Production note

The config above fits a single-node or staging setup. For large production traffic (>1TB logs/month), run Loki and Mimir in microservices mode — split distributor, ingester, and querier into separate containers, and use object storage (self-hosted MinIO or S3) instead of local disks.

9. Integrating with an ASP.NET Core Application

Sending telemetry from a .NET app to the LGTM Stack takes only 2 steps: install NuGet packages and configure the exporter.

Step 1: Install the packages

dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

Step 2: Configure Program.cs

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService("order-api", serviceVersion: "1.0.0"))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(o => o.SetDbStatementForText = true)
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }))
    .WithLogging(logging => logging
        .AddOtlpExporter(o =>
        {
            o.Endpoint = new Uri("http://alloy:4317");
            o.Protocol = OpenTelemetry.Exporter
                .OtlpExportProtocol.Grpc;
        }));

Grafana OpenTelemetry Distribution for .NET

Grafana provides the Grafana.OpenTelemetry package — a distribution that wraps common instrumentations and optimized defaults for the LGTM stack. Just builder.Services.AddGrafanaOpenTelemetry() is enough — much less config than a manual setup.

10. Alerting — From Observation to Action

Observability has no value if nobody gets notified when incidents happen. Grafana Alerting supports:

  • Unified alerting — alert rules for metrics (PromQL), logs (LogQL), and traces
  • Multi-channel — Slack, Discord, Telegram, PagerDuty, email, webhook
  • Silences & Mute timings — disable alerts during maintenance windows
  • Alert grouping — bundle 100 alerts of the same kind into one notification

Example alert rule for error rate:

# Alert when error rate > 5% for 5 minutes
- alert: HighErrorRate
  expr: |
    sum(rate({app="api"} |= "error" [5m])) by (app)
    /
    sum(rate({app="api"} [5m])) by (app)
    > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Abnormally high error rate for {{ $labels.app }}"
    description: "Error rate is at {{ $value | humanizePercentage }}"

11. Real Sizing and Cost

One of the main reasons to choose the LGTM Stack is cost. Compared to SaaS:

ScaleLGTM Self-hostedDatadog (estimate)
10 services, 50GB logs/month1 VM 4 CPU / 16GB RAM
~$40-80/month
~$200-500/month
50 services, 500GB logs/month3 VMs or K8s cluster
~$200-400/month
~$2,000-5,000/month
200 services, 2TB logs/monthK8s cluster + S3
~$500-1,000/month
~$10,000+/month

Trade-off to consider

Self-hosted saves money but costs operational time. If your team only has 1-2 DevOps, start with the Grafana Cloud Free tier (10K metrics, 50GB logs, 50GB traces free) and migrate to self-hosted once you outgrow it. Grafana Cloud runs the same LGTM stack, so migration is essentially endpoint swaps.

12. Production Best Practices

Selective labels
Only use low-cardinality labels (app, env, region). Never use user_id, request_id, or IPs as labels — use Loki 3.x structured metadata instead. High-cardinality labels are the #1 cause of Loki OOM.
Tiered retention
Hot data (7 days) on SSD, warm data (30 days) on HDD, cold data (1 year+) on object storage. Configure retention_period and compactor in Loki to automatically move data across tiers.
Trace sampling
You don't need to store 100% of traces. Use tail-based sampling in Alloy: always keep traces with errors or high latency, and sample 10-20% of successful traces. Reduces Tempo storage cost by 80% without losing important information.
Recording rules for metrics
Pre-compute complex PromQL queries into recording rules. Instead of querying raw data every time a dashboard loads, Mimir computes aggregated metrics ahead of time — dashboards load 10× faster.
Dashboards as code
Use Grafana 12 Git Sync or the grafana/grafana Terraform provider to manage dashboards via version control. Nobody edits production dashboards by hand in the UI — every change goes through PR review.

Conclusion

The Grafana LGTM Stack — Loki, Grafana, Tempo, Mimir plus the Alloy collector — delivers a complete, free, vendor-lock-in-free observability platform. With correlation across logs, metrics, and traces in a single interface, your team can cut incident debugging from hours to minutes.

If you're running CloudWatch + Kibana + Jaeger separately, or paying thousands of dollars a month for Datadog, now is the time to consider moving to LGTM Stack — start with the Grafana Cloud Free tier to experiment, then self-host once you're comfortable.

References:
Grafana Loki Documentation · Grafana Mimir Documentation · Grafana Tempo Documentation · Grafana Alloy Documentation · Grafana 12 What's New · Instrument .NET with OpenTelemetry — Grafana