Grafana LGTM Stack — Build a Free Observability Platform for Production
Posted on: 4/18/2026 8:11:46 AM
Table of contents
- 1. What Is the LGTM Stack?
- 2. The Overall LGTM Stack Architecture
- 3. Grafana Alloy — The Unified Collector
- 4. Loki — Economical Log Aggregation
- 5. Mimir — Large-Scale Metrics Storage
- 6. Tempo — Index-Free Distributed Tracing
- 7. Grafana — Dashboards, Alerting, and Correlation
- 8. Deploying the LGTM Stack with Docker Compose
- 9. Integrating with an ASP.NET Core Application
- 10. Alerting — From Observation to Action
- 11. Real Sizing and Cost
- 12. Production Best Practices
- Conclusion
Do you have applications running in production but, when incidents happen, you have to SSH into the server and read logs with grep? Or worse, you don't know which service is slow until customers complain? Grafana LGTM Stack — a completely free, open-source observability toolkit — solves this problem by unifying Logs, Metrics, Traces, and Profiles into a single platform.
1. What Is the LGTM Stack?
LGTM stands for four core components developed by Grafana Labs:
| Component | Role | Commercial equivalent |
|---|---|---|
| Loki | Log aggregation — collect, store, and query logs | Splunk, Datadog Logs |
| Grafana | Visualization — dashboards, alerting, explore | Datadog Dashboard, Kibana |
| Tempo | Distributed tracing — follow requests across services | Jaeger, Datadog APM |
| Mimir | Metrics storage — store Prometheus metrics long-term | Thanos, Cortex, Datadog Metrics |
Beyond these 4, the stack also includes Grafana Alloy — a unified collector that replaces Promtail, Grafana Agent, and the OpenTelemetry Collector, acting as the "extended arm" that gathers every telemetry signal from your applications.
Why not use the ELK Stack?
ELK (Elasticsearch + Logstash + Kibana) indexes log content in full — needing massive RAM and disk. Loki only indexes labels (metadata) and stores logs compressed → 10-50× storage savings. For small and mid-sized systems, the LGTM stack runs comfortably on a single 4 CPU / 8GB RAM server.
2. The Overall LGTM Stack Architecture
Understanding the architecture tells you where data comes from and where it goes — so when incidents happen, you know which component to check.
graph LR
subgraph Applications
A1["ASP.NET Core API"]
A2["Vue.js Frontend"]
A3["Background Worker"]
end
subgraph "Grafana Alloy (Collector)"
C1["OTLP Receiver"]
C2["Prometheus Scraper"]
C3["Log Pipeline"]
end
subgraph "Storage Backends"
M["Mimir
Metrics"]
L["Loki
Logs"]
T["Tempo
Traces"]
end
G["Grafana
Dashboard + Alerting"]
A1 -->|OTLP gRPC| C1
A2 -->|OTLP HTTP| C1
A3 -->|OTLP gRPC| C1
A1 -->|metrics endpoint| C2
C1 --> M
C1 --> T
C2 --> M
C3 --> L
M --> G
L --> G
T --> G
style A1 fill:#e94560,stroke:#fff,color:#fff
style A2 fill:#e94560,stroke:#fff,color:#fff
style A3 fill:#e94560,stroke:#fff,color:#fff
style C1 fill:#2c3e50,stroke:#fff,color:#fff
style C2 fill:#2c3e50,stroke:#fff,color:#fff
style C3 fill:#2c3e50,stroke:#fff,color:#fff
style M fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style L fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style T fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style G fill:#4CAF50,stroke:#fff,color:#fff
3. Grafana Alloy — The Unified Collector
Previously you needed to run Promtail separately (for logs), Grafana Agent (for metrics), and the OpenTelemetry Collector (for traces). Grafana Alloy unifies all three into a single binary with the River declarative configuration language.
What does Alloy replace?
Promtail → Alloy loki pipeline · Grafana Agent → Alloy prometheus pipeline · OTel Collector → Alloy otelcol pipeline. One process, one config, one place to debug.
Example Alloy config that receives OTLP from a .NET application and forwards it to Loki + Tempo + Mimir:
// Receive telemetry via OTLP
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
http { endpoint = "0.0.0.0:4318" }
output {
metrics = [otelcol.processor.batch.default.input]
logs = [otelcol.processor.batch.default.input]
traces = [otelcol.processor.batch.default.input]
}
}
// Batch to reduce network overhead
otelcol.processor.batch "default" {
output {
metrics = [otelcol.exporter.prometheus.mimir.input]
logs = [otelcol.exporter.loki.default.input]
traces = [otelcol.exporter.otlp.tempo.input]
}
}
// Export metrics to Mimir
otelcol.exporter.prometheus "mimir" {
forward_to = [prometheus.remote_write.mimir.receiver]
}
prometheus.remote_write "mimir" {
endpoint {
url = "http://mimir:9009/api/v1/push"
}
}
// Export logs to Loki
otelcol.exporter.loki "default" {
forward_to = [loki.write.default.receiver]
}
loki.write "default" {
endpoint {
url = "http://loki:3100/loki/api/v1/push"
}
}
// Export traces to Tempo
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo:4317"
tls { insecure = true }
}
}
The key point
Alloy uses a component-based model: each block is a component with inputs/outputs, connected to each other via forward_to or output. You can insert processors (filter, transform, sample) in the middle of a pipeline without changing the receiver or exporter.
4. Loki — Economical Log Aggregation
Loki is the heart of log collection in the LGTM Stack. Unlike Elasticsearch (full-text indexing), Loki only indexes labels (e.g., {app="api", env="production"}) and stores log content compressed. That gives you:
- 10-50× cheaper storage than Elasticsearch for the same log volume
- Simpler operations — no JVM heap tuning, no shard rebalancing
- Natural integration with Prometheus labels — same label set for metrics and logs
LogQL — The Log Query Language
LogQL is inspired by PromQL, using label selectors combined with filter expressions:
// Find error logs for the api service over the last hour
{app="api", env="production"} |= "error" | json | status_code >= 500
// Count failed requests per endpoint, every 5 minutes
rate({app="api"} |= "HTTP" | json | status_code >= 500 [5m]) by (endpoint)
// Calculate P99 response time from logs
{app="api"} | json | unwrap duration_ms [5m] | quantile_over_time(0.99)
// Pattern matching — detect log format automatically
{app="api"} | pattern "<ip> - <method> <path> <status> <duration>ms"
| status >= 500
Bloom Filters in Loki 3.x
Loki 3.0+ supports Bloom filters to speed up filter queries. Instead of scanning all chunks, Loki checks the Bloom filter first to quickly skip chunks that don't contain the searched keyword — significantly reducing I/O for queries like |= "OutOfMemoryException" over large datasets.
Structured Metadata
From Loki 3.0, you can attach structured metadata to log entries without turning them into labels (which would explode cardinality). Examples: trace_id, user_id, request_id — filterable but they don't create new series.
// Query logs by trace_id from structured metadata
{app="api"} | trace_id = "abc123def456"
5. Mimir — Large-Scale Metrics Storage
Prometheus is great for scraping metrics, but it has 2 major limits at scale:
- Single-node storage — local TSDB doesn't scale horizontally
- Short retention — usually 15-30 days due to disk
Mimir solves both by becoming remote storage for Prometheus, supporting multi-tenancy and long-term retention on object storage (S3, MinIO, Azure Blob).
graph TD
P1["Prometheus / Alloy"] -->|remote_write| D["Distributor"]
D --> I1["Ingester 1"]
D --> I2["Ingester 2"]
D --> I3["Ingester 3"]
I1 --> S["Object Storage
S3 / MinIO / Azure Blob"]
I2 --> S
I3 --> S
QF["Query Frontend"] --> Q["Querier"]
Q --> I1
Q --> I2
Q --> I3
Q --> S
G["Grafana"] --> QF
style D fill:#e94560,stroke:#fff,color:#fff
style I1 fill:#2c3e50,stroke:#fff,color:#fff
style I2 fill:#2c3e50,stroke:#fff,color:#fff
style I3 fill:#2c3e50,stroke:#fff,color:#fff
style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style QF fill:#4CAF50,stroke:#fff,color:#fff
style Q fill:#4CAF50,stroke:#fff,color:#fff
style G fill:#4CAF50,stroke:#fff,color:#fff
style P1 fill:#e94560,stroke:#fff,color:#fff
| Feature | Prometheus (standalone) | Mimir |
|---|---|---|
| Horizontal scaling | No | Yes — sharding by tenant/series |
| Long-term retention | 15-30 days (disk) | Unlimited (object storage) |
| Multi-tenant | No | Yes — isolates data across teams |
| High availability | Needs Thanos sidecar | Built-in replication |
| Query performance | Degrades with data size | Query splitting + caching |
| Storage cost | Expensive SSD | Cheap object storage |
6. Tempo — Index-Free Distributed Tracing
When a request passes through 5 services, you want to know: which service is slow? Where did the error happen? Tempo answers that by storing distributed traces at very low cost.
Unlike Jaeger (needs Elasticsearch/Cassandra), Tempo only needs object storage. It doesn't index traces — it stores them by trace ID. To find a trace, you use:
- TraceQL — a dedicated query language for traces
- Metrics-to-traces — from a dashboard spike, click to see example traces
- Logs-to-traces — from a log line with a trace_id, jump to Tempo to see the full trace
TraceQL — Query Traces Like a Database
// Find traces with an error span in the "order-api" service
{ resource.service.name = "order-api" && status = error }
// Traces with duration > 2 seconds
{ duration > 2s }
// Traces that pass through both order-api and payment-service
{ resource.service.name = "order-api" } >> { resource.service.name = "payment-service" }
// Spans with a specific attribute
{ span.http.status_code >= 500 && span.http.method = "POST" }
Exemplars — The Bridge Between Metrics ↔ Traces
When Prometheus/Mimir collects metrics, it can attach an exemplar — a sample trace ID for each data point. In Grafana, when you see P99 latency suddenly spike, clicking the exemplar jumps straight to the specific trace that caused that spike. This is a killer feature of running a unified LGTM stack.
7. Grafana — Dashboards, Alerting, and Correlation
Grafana is the visualization layer that stitches everything together. Version 12.x brings many important improvements:
Grafana 12 highlights
Git Sync — manage dashboards as code, version-controlled via Git. Explore Logs — auto-detects patterns in logs, no query writing required. Traces to Profiles — from a slow span, drill down directly into flame graphs to see which functions consume CPU. Adaptive dashboards — layouts adjust automatically based on data.
Correlation — The Power of a Unified Stack
The biggest advantage of the LGTM stack is the ability to correlate the three signals:
graph LR
M["📊 Metrics
CPU spike at 14:05"] -->|exemplar trace_id| T["🔍 Traces
3.2s slow span in payment-service"]
T -->|trace_id in log| L["📝 Logs
TimeoutException connecting to DB"]
L -->|label match| M
style M fill:#e94560,stroke:#fff,color:#fff
style T fill:#2c3e50,stroke:#fff,color:#fff
style L fill:#4CAF50,stroke:#fff,color:#fff
A typical incident-debug workflow:
- Alert fires "P99 latency > 2s" on a Grafana dashboard
- Click the metric panel → view the exemplar trace ID
- Open the trace in Tempo → see a
db.queryspan taking 2.8s - Click the trace_id → Loki shows:
Connection pool exhausted, waiting 2.5s - Root cause: connection pool is too small → increase
MaxPoolSize→ deploy fix
8. Deploying the LGTM Stack with Docker Compose
Below is a production-ready Docker Compose configuration for a medium-sized system (10-50 services, ~100GB logs/month):
version: "3.8"
services:
# --- Grafana ---
grafana:
image: grafana/grafana:12.4.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=your-secure-password
- GF_FEATURE_TOGGLES_ENABLE=traceToMetrics,traceToLogs
volumes:
- grafana-data:/var/lib/grafana
- ./provisioning:/etc/grafana/provisioning
depends_on: [loki, mimir, tempo]
# --- Loki (Log Storage) ---
loki:
image: grafana/loki:3.4.0
ports:
- "3100:3100"
command: -config.file=/etc/loki/config.yaml
volumes:
- ./config/loki.yaml:/etc/loki/config.yaml
- loki-data:/loki
# --- Mimir (Metrics Storage) ---
mimir:
image: grafana/mimir:2.15.0
ports:
- "9009:9009"
command: -config.file=/etc/mimir/config.yaml
volumes:
- ./config/mimir.yaml:/etc/mimir/config.yaml
- mimir-data:/data
# --- Tempo (Trace Storage) ---
tempo:
image: grafana/tempo:2.7.0
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "3200:3200" # Tempo query
command: -config.file=/etc/tempo/config.yaml
volumes:
- ./config/tempo.yaml:/etc/tempo/config.yaml
- tempo-data:/var/tempo
# --- Alloy (Collector) ---
alloy:
image: grafana/alloy:1.6.0
ports:
- "12345:12345" # Alloy UI
- "4327:4317" # OTLP gRPC (apps send here)
- "4328:4318" # OTLP HTTP
volumes:
- ./config/alloy.river:/etc/alloy/config.river
command: run /etc/alloy/config.river --server.http.listen-addr=0.0.0.0:12345
volumes:
grafana-data:
loki-data:
mimir-data:
tempo-data:
Production note
The config above fits a single-node or staging setup. For large production traffic (>1TB logs/month), run Loki and Mimir in microservices mode — split distributor, ingester, and querier into separate containers, and use object storage (self-hosted MinIO or S3) instead of local disks.
9. Integrating with an ASP.NET Core Application
Sending telemetry from a .NET app to the LGTM Stack takes only 2 steps: install NuGet packages and configure the exporter.
Step 1: Install the packages
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol
Step 2: Configure Program.cs
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry()
.ConfigureResource(resource => resource
.AddService("order-api", serviceVersion: "1.0.0"))
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddSqlClientInstrumentation(o => o.SetDbStatementForText = true)
.AddOtlpExporter(o =>
{
o.Endpoint = new Uri("http://alloy:4317");
o.Protocol = OpenTelemetry.Exporter
.OtlpExportProtocol.Grpc;
}))
.WithMetrics(metrics => metrics
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddRuntimeInstrumentation()
.AddOtlpExporter(o =>
{
o.Endpoint = new Uri("http://alloy:4317");
o.Protocol = OpenTelemetry.Exporter
.OtlpExportProtocol.Grpc;
}))
.WithLogging(logging => logging
.AddOtlpExporter(o =>
{
o.Endpoint = new Uri("http://alloy:4317");
o.Protocol = OpenTelemetry.Exporter
.OtlpExportProtocol.Grpc;
}));
Grafana OpenTelemetry Distribution for .NET
Grafana provides the Grafana.OpenTelemetry package — a distribution that wraps common instrumentations and optimized defaults for the LGTM stack. Just builder.Services.AddGrafanaOpenTelemetry() is enough — much less config than a manual setup.
10. Alerting — From Observation to Action
Observability has no value if nobody gets notified when incidents happen. Grafana Alerting supports:
- Unified alerting — alert rules for metrics (PromQL), logs (LogQL), and traces
- Multi-channel — Slack, Discord, Telegram, PagerDuty, email, webhook
- Silences & Mute timings — disable alerts during maintenance windows
- Alert grouping — bundle 100 alerts of the same kind into one notification
Example alert rule for error rate:
# Alert when error rate > 5% for 5 minutes
- alert: HighErrorRate
expr: |
sum(rate({app="api"} |= "error" [5m])) by (app)
/
sum(rate({app="api"} [5m])) by (app)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Abnormally high error rate for {{ $labels.app }}"
description: "Error rate is at {{ $value | humanizePercentage }}"
11. Real Sizing and Cost
One of the main reasons to choose the LGTM Stack is cost. Compared to SaaS:
| Scale | LGTM Self-hosted | Datadog (estimate) |
|---|---|---|
| 10 services, 50GB logs/month | 1 VM 4 CPU / 16GB RAM ~$40-80/month | ~$200-500/month |
| 50 services, 500GB logs/month | 3 VMs or K8s cluster ~$200-400/month | ~$2,000-5,000/month |
| 200 services, 2TB logs/month | K8s cluster + S3 ~$500-1,000/month | ~$10,000+/month |
Trade-off to consider
Self-hosted saves money but costs operational time. If your team only has 1-2 DevOps, start with the Grafana Cloud Free tier (10K metrics, 50GB logs, 50GB traces free) and migrate to self-hosted once you outgrow it. Grafana Cloud runs the same LGTM stack, so migration is essentially endpoint swaps.
12. Production Best Practices
retention_period and compactor in Loki to automatically move data across tiers.grafana/grafana Terraform provider to manage dashboards via version control. Nobody edits production dashboards by hand in the UI — every change goes through PR review.Conclusion
The Grafana LGTM Stack — Loki, Grafana, Tempo, Mimir plus the Alloy collector — delivers a complete, free, vendor-lock-in-free observability platform. With correlation across logs, metrics, and traces in a single interface, your team can cut incident debugging from hours to minutes.
If you're running CloudWatch + Kibana + Jaeger separately, or paying thousands of dollars a month for Datadog, now is the time to consider moving to LGTM Stack — start with the Grafana Cloud Free tier to experiment, then self-host once you're comfortable.
References:
Grafana Loki Documentation ·
Grafana Mimir Documentation ·
Grafana Tempo Documentation ·
Grafana Alloy Documentation ·
Grafana 12 What's New ·
Instrument .NET with OpenTelemetry — Grafana
Tailwind CSS 4 and the Oxide Engine — When a CSS Framework Is Rewritten in Rust
Microsoft Agent Framework 1.0 — Unified SDK for AI Agents on .NET 10
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.