Building a Modern Observability Stack with Cribl, Loki, and Grafana

Introduction

Enterprise observability is at an inflection point. Traditional approaches — shipping everything to a monolithic SIEM and paying per gigabyte — are becoming unsustainable as log volumes grow exponentially. At the same time, the open-source observability ecosystem has matured to the point where a small team can stand up a production-quality pipeline in a weekend.

This post walks through building a modern observability stack using Cribl, Grafana Loki, and Grafana — a combination that delivers powerful log visibility at a fraction of the cost of legacy tooling. Along the way, we’ll compare this approach to traditional alternatives and explore where each fits in a mature enterprise environment.

The Stack: Cribl + LGTM

The core stack covered in this project consists of three components working together:

Cribl Stream / Cribl.Cloud acts as the data pipeline layer. Its job is to receive log data from any source, parse and enrich it, apply routing logic, and forward it to one or more destinations. Think of it as a smart router for your observability data — one that can filter noise, redact sensitive fields, convert formats, and split traffic between a cheap log store and an expensive SIEM based on rules you define.

Grafana Loki is a log aggregation system designed to be cost-efficient. Unlike Elasticsearch or Splunk, which index every field in every log at ingest time, Loki only indexes labels — small pieces of metadata attached to log streams. The raw log content is compressed and stored cheaply. This design makes Loki dramatically less expensive to operate at scale, with the trade-off that full-text searches across log bodies are slower than in a fully-indexed system.

Grafana is the visualization layer. It connects to Loki (and Prometheus for metrics) and renders dashboards, supports alerting, and provides an exploration interface for ad-hoc queries. Grafana is widely considered best-in-class for observability dashboards and has strong support for combining multiple data sources in a single view.

Together, the architecture looks like this:

Log Sources (endpoints, network, cloud, applications)
        ↓
   Cribl Stream / Cribl.Cloud
   (route, filter, enrich, label)
        ↓
      Loki
   (compressed log storage, label-indexed)
        ↓
    Grafana
   (dashboards, alerting, exploration)

What We Built

Starting from scratch, the lab environment consisted of:

A Docker Compose stack running Loki and Grafana locally on macOS
A Cribl.Cloud deployment routing Datagen syslog events through an ngrok tunnel to local Loki
Dynamic Loki labels configured in Cribl to expose host and severityName as queryable dimensions
A Grafana dashboard with four panels: live log stream, log volume over time, top hosts by volume, and severity breakdown
Discussion of adding Prometheus and Node Exporter to extend the stack into infrastructure metrics

The key configuration insight was understanding that Loki’s label system is set at ingest time, not query time. Cribl’s native Loki destination makes this straightforward — labels can be defined as dynamic expressions like `${host}` that evaluate per-event before forwarding, turning any parsed field into a queryable dimension without changing the storage schema.

Logs vs. Metrics: Understanding the Difference

One of the more nuanced aspects of modern observability is understanding when to store raw logs versus when to convert them to metrics. The distinction matters for both cost and query performance.

Logs are discrete, timestamped event records — the full text of what happened. They carry complete context: usernames, IP addresses, request payloads, stack traces. They are expensive to store at scale and slow to query via full-text search, but they are irreplaceable for forensics, debugging, and audit trails.

Metrics are numeric measurements over time — counters, gauges, histograms. They carry no individual event context, only aggregated values. They are cheap to store, fast to query, and ideal for alerting and long-term trending.

The practical implication: a high-volume stream of HTTP 200 access logs probably doesn’t need to be stored in full. What you actually need is a metric — http_requests_total{status="200"} incrementing in Prometheus — while the raw logs are sampled or dropped. Cribl is particularly well-suited for this log-to-metrics conversion pattern, extracting numeric values from log fields and forwarding them to a metrics store while routing the raw events to cheaper storage.

The decision framework is roughly:

Signal type	Store as logs	Convert to metrics
Security and audit events	Yes — full detail required	No
Error and exception detail	Yes — stack traces needed	Rate metric as supplement
High-volume access logs	Sample only	Yes — counters and latency
SLA and uptime measurements	No	Yes — long retention needed
Alerting triggers	No	Yes — fast query required

Comparing the Approaches

Grafana + Loki vs. Splunk

Splunk is the incumbent in enterprise log management and SIEM. It offers full-text indexing of every field at ingest, a powerful query language (SPL), a rich ecosystem of security content (Splunk ES, ESCU detection rules), built-in case management, and compliance reporting frameworks.

The trade-off is cost. Splunk licensing at scale is among the most expensive line items in an enterprise security budget. Every gigabyte ingested adds to the bill, which creates pressure to filter aggressively before data reaches the platform — often at the cost of visibility.

Grafana + Loki addresses the cost problem directly. Loki’s label-only indexing means storage is cheap enough to retain far more data for far longer. The trade-off is that you lose Splunk’s arbitrary full-text search capability — if you didn’t define something as a label at ingest time, finding it later requires a slower grep-style scan.

For security operations specifically, Grafana + Loki is not a Splunk replacement. It lacks threat correlation rules, a detection engine, case management, and compliance reporting. For infrastructure observability, application monitoring, and operational dashboards, it is genuinely excellent and dramatically cheaper.

The Mature Enterprise Answer: Both

In a well-architected enterprise stack, these tools are not competing — they serve different tiers of the same pipeline:

All log sources
      ↓
   Cribl (route, filter, enrich)
    /              \
Loki/Grafana    Splunk/Sentinel
(observability)  (security ops)

Cribl routes security-relevant events (authentication, privilege changes, network anomalies) to the SIEM where they can be correlated and investigated. Everything else — application logs, access logs, infrastructure events — goes to Loki at a fraction of the cost. Grafana provides the unified visualization layer on top of both Loki and Prometheus, giving operators a single pane for infrastructure health without touching expensive SIEM storage.

This architecture pattern — often called a tiered observability pipeline — is where most mature organizations are heading. It optimizes cost without sacrificing security coverage, and Cribl is the linchpin that makes selective routing practical.

Grafana + Loki vs. Elastic Stack

The Elastic Stack (Elasticsearch + Kibana) occupies a middle ground between Loki and Splunk. Elasticsearch fully indexes log content, giving it Splunk-like search flexibility. Kibana’s dashboards are comparable to Grafana’s. Elastic Security adds SIEM-like detection capabilities.

Compared to Loki, Elasticsearch is more powerful for ad-hoc search but significantly more resource-intensive and operationally complex. Running Elasticsearch at scale requires careful cluster management, shard tuning, and index lifecycle policies. Loki’s simpler architecture is easier to operate and cheaper to run, at the cost of search flexibility.

For teams without a dedicated platform engineering function, Loki’s operational simplicity is a meaningful advantage. For teams that need rich full-text search without Splunk’s licensing costs, Elastic is a compelling alternative.

Where This Stack Fits in an Enterprise Environment

A mature enterprise observability and security stack has five distinct layers:

Data sources — endpoints, network devices, cloud platforms, applications generating logs and metrics
Pipeline — Cribl routing, filtering, enriching, and shaping data before storage
Storage — Loki/Prometheus for observability data; SIEM (Splunk, Sentinel, or Elastic Security) for security operations
Analytics — Grafana for operational dashboards; SIEM analytics for threat detection and compliance
Operations — alerting platforms (PagerDuty), SOC workflows, SOAR for automated response, ITSM for ticketing

The stack built in this project covers layers 2, 3 (observability side), and 4 (Grafana). Adding Prometheus rounds out the metrics side of storage. The natural next step for a more complete security stack would be standing up Elastic Security or a lightweight SIEM to experience the detection and investigation side.

Key Takeaways

Cribl is the foundation. Whether data is going to Loki, Splunk, Elastic, or a cloud data lake, having a flexible pipeline layer in front of storage is what makes tiered architectures practical. Without it, every source connects directly to every destination, and changing anything becomes a large project.

Loki’s label model requires intentional design. The decision of which fields to expose as labels should be made at pipeline design time, not after the fact. Working through this in a lab environment — discovering that severityName wasn’t queryable until it was added as a Cribl label — is exactly the kind of lesson that prevents production surprises.

Metrics and logs solve different problems. Understanding when to convert logs to metrics — and building pipelines that do so — is one of the highest-leverage cost optimizations available in a high-volume logging environment.

Open source observability has matured. The combination of Cribl, Loki, Prometheus, and Grafana delivers capabilities that would have required expensive enterprise tooling a few years ago. For infrastructure observability, it is now a serious production-grade option.

What’s Next

From here, the natural extensions to this lab are:

Adding Prometheus and Node Exporter to bring infrastructure metrics alongside logs in Grafana
Configuring Grafana alerting to fire on log patterns and metric thresholds
Hardening the stack with persistent storage, proper authentication, and removing the ngrok dependency in favor of a stable endpoint
Routing real log sources (rather than Datagen) through the pipeline to experience production-like data volumes and variety
Standing up a lightweight Elastic Security instance to experience the SIEM side of the architecture

The lab described here is a working foundation for any of these directions. More importantly, it’s a concrete representation of how modern observability pipelines are actually built — and a useful reference point for conversations with clients evaluating their logging architecture.

Built with Cribl.Cloud, Grafana Loki, Grafana OSS, Docker, and ngrok on macOS.