OpenSearch for SREs: The Open-Source Observability Powerhouse on Kubernetes

The Rise of the Open-Source Alternative: OpenSearch for the Modern SRE

In the dynamic world of cloud-native infrastructure, robust observability is not just a nice-to-have; it’s a foundational requirement for any Site Reliability Engineer (SRE). For years, Elasticsearch dominated the landscape for log aggregation and search. However, a significant licensing change by Elastic in early 2021 created a void, prompting AWS to fork the last Apache 2.0 licensed version and launch OpenSearch.

What is OpenSearch? At its core, OpenSearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It’s designed for high-volume data ingestion and rapid querying across massive datasets. Think of it as a specialized database for semi-structured data like logs, metrics, and traces.

Why is its open-source nature important for SREs? The “open” in OpenSearch isn’t just a marketing buzzword; it’s critical for SREs:

  • No Vendor Lock-in: You’re free from proprietary licenses and sudden feature restrictions. This gives you long-term stability and predictability in your tech stack.
  • Community-Driven Development: Bugs are squashed faster, features are added based on broad community needs, and you can even contribute fixes yourself.
  • Auditability & Transparency: You can inspect the source code, understand exactly how your data is being handled, and ensure there are no hidden surprises—crucial for security-conscious environments.
  • Cost Predictability: While managed services exist, you always have the option to self-host without escalating license costs as your data grows.

For an SRE, OpenSearch provides the backbone for what’s often called the “OSD Stack” (OpenSearch, OpenSearch Dashboards, and Data Prepper/Fluent Bit), empowering proactive monitoring, rapid incident response, and deep operational insights.

Core Concepts for the SRE Toolkit

To wield OpenSearch effectively, you need to understand its fundamental building blocks:

  1. Cluster & Nodes:
    • An OpenSearch Cluster is a group of interconnected servers (nodes) that work together to store and search your data.
    • Nodes specialize: some are cluster_manager (formerly “master”) nodes (brains for cluster state), others are data nodes (muscles for storing and searching data), and some can be ingest nodes (for pre-processing data). SREs design clusters with dedicated nodes for stability and scale.
  2. Indices & Documents:
    • An Index is like a logical database table, a collection of related JSON documents. For logs, you typically create time-based indices (e.g., application-logs-2026.02.08).
    • A Document is a single JSON record within an index—e.g., one log line, one metric data point, one trace span.
  3. Shards & Replicas:
    • To handle massive data, OpenSearch horizontally scales using Shards. An index is split into primary shards, distributed across data nodes.
    • Replicas are copies of primary shards. They provide high availability (if a node fails, a replica becomes primary) and scale read operations. SREs fine-tune shard/replica counts for performance and resilience.
  4. OpenSearch Dashboards:
    • The browser-based UI for visualizing, analyzing, and managing your OpenSearch data. It’s where you build dashboards, run queries, and configure alerts.
  5. Index State Management (ISM):
    • An automated policy engine. SREs use ISM to define lifecycle rules for indices, like moving old data to cheaper storage, taking snapshots, or deleting it after a set period. This prevents disk exhaustion and controls costs.

Getting Hands-On: Your Production-Lite OpenSearch Lab on Kubernetes

The best way to understand OpenSearch is to run it. We’ll set up a mini-stack on Minikube (or any Kubernetes cluster) that mirrors a production setup for logs:

  • OpenSearch Cluster: The data store.
  • OpenSearch Dashboards: The UI.
  • Nginx: A sample application generating logs.
  • Fluent Bit: The lightweight log shipper.

All configurations will be declarative, using Helm charts and Kubernetes Jobs, making it fully automated and version-controllable—true SRE style.

Prerequisites:

  • Kubernetes cluster (Minikube recommended for local dev)
  • kubectl installed and configured
  • helm installed

1. Initialize Minikube

Ensure your Minikube has enough resources for OpenSearch:

minikube start --cpus 4 --memory 8192 --driver docker

2. Prepare Helm Repositories

helm repo add opensearch https://opensearch-project.github.io/helm-charts/
helm repo add fluent https://fluent.github.io/helm-charts
helm repo update

3. Deploy OpenSearch & Dashboards (with a Password)

We’ll use a values.yaml file to set up a single-node OpenSearch cluster with a secure admin password and OpenSearch Dashboards.

Save the following as opensearch-values.yaml:

# opensearch-values.yaml
singleNode: true
persistence:
  enabled: false # For a lab, we'll keep it stateless. Set to true for production with PVs.
extraEnvs:
  - name: OPENSEARCH_INITIAL_ADMIN_PASSWORD
    value: YourStrongPassword123! # <<< CHANGE THIS TO A STRONG PASSWORD

Now deploy:

helm install my-os opensearch/opensearch -f opensearch-values.yaml
helm install my-dashboards opensearch/opensearch-dashboards

Note: OpenSearch will take a few minutes to start as it initializes its JVM and security plugin. Keep an eye on kubectl get pods -w.

4. Deploy Sample Nginx App

This is our log source:

kubectl create deployment nginx-server --image=nginx
kubectl expose deployment nginx-server --port=80

5. Deploy Fluent Bit (The Log Shipper)

Fluent Bit will scrape Nginx logs and send them to OpenSearch. We’ll use a fluent-bit-values.yaml to handle the configuration declaratively and resolve common SRE headaches (DNS, auth, mapping issues).

Save the following as fluent-bit-values.yaml:

# fluent-bit-values.yaml
config:
  service: |
    [SERVICE]
        Daemon          Off
        Flush           1
        Log_Level       info
        Parsers_File    parsers.conf
        HTTP_Server     On
        HTTP_Listen     0.0.0.0
        HTTP_Port       2020
        Health_Check    On

  inputs: |
    [INPUT]
        Name           tail
        Path           /var/log/containers/*.log
        multiline.parser docker, cri
        Tag            kube.*
        Mem_Buf_Limit  5MB
        Skip_Long_Lines On

  filters: |
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        Keep_Log            Off
        # These help prevent mapping conflicts from inconsistent pod labels
        Labels              Off
        Annotations         Off

  outputs: |
    [OUTPUT]
        Name            es
        Match           *
        Host            my-os-opensearch # This is the Kubernetes Service name for OpenSearch
        Port            9200
        HTTP_User       admin
        HTTP_Passwd     YourStrongPassword123! # <<< USE THE SAME PASSWORD AS ABOVE
        Logstash_Format On
        Logstash_Prefix nginx-logs
        tls             On
        tls.verify      Off
        Suppress_Type_Name On # Crucial for OpenSearch 2.x
        Trace_Error        On # For better debugging in Fluent Bit logs

Deploy Fluent Bit:

helm install fluent-bit fluent/fluent-bit -f fluent-bit-values.yaml
The Log Shipper: Why Fluent Bit?

In a Kubernetes ecosystem, logs are ephemeral—when a pod dies, its logs go with it. To prevent this, we need a DaemonSet that acts as a “log vacuum.” We chose Fluent Bit because it’s the lightweight, high-performance cousin of Fluentd. It has a tiny memory footprint (crucial when you’re running it on every node in a cluster) and handles the “Log Pipeline” in three distinct stages:

  • Input (Tail): It watches the raw .log files created by the Kubernetes container engine on the node’s disk.
  • Filters (Kubernetes): This is the “SRE secret sauce.” Fluent Bit talks to the Kubernetes API to enrich your logs with metadata like the Pod Name, Namespace, and Labels. We’ve also added logic here to “clean” the logs to prevent mapping conflicts.
  • Output (OpenSearch): It batches the logs and ships them securely via HTTPS to our OpenSearch cluster.
The Anatomy of a Fluent Bit Pipeline

When you look at a Fluent Bit configuration, it’s organized into distinct sections. Each one has a specific job in the “Ingestion Lifecycle.” Understanding these is the key to onboarding any new service into OpenSearch.

  • [SERVICE] (The Global Brain): This section defines the engine’s behavior. It controls how often data is “flushed” (sent) to the destination, where the internal logs go, and whether to enable a health-check server. For SREs, this is where we tune performance and monitoring for the log shipper itself.
  • [INPUT] (The Collector): This is the “vacuum cleaner.” It tells Fluent Bit where to get data. In Kubernetes, we usually use the tail input to follow the log files generated by the container runtime (CRI/Docker). You can have multiple inputs—one for system logs, one for app logs, and even one for metrics.
  • [FILTER] (The Processor): This is where the magic happens. Filters allow you to modify data in flight.
    • The Kubernetes Filter is the most popular; it reaches out to the K8s API to tag your logs with pod names and namespaces.
    • You can also use filters to drop sensitive data (PII), parse strings into JSON, or “flatten” complex labels to avoid the mapping conflicts we saw earlier.
  • [OUTPUT] (The Destination): This defines the “Exit” for your data. In our case, it’s the es (Elasticsearch/OpenSearch) plugin. This section handles the connection details, authentication, and index naming conventions.

6. Automate ISM Policy & Index Pattern with a Kubernetes Job

This is where the SRE magic happens. We’ll use a Job to apply our ISM policy (7-day retention) and create the nginx-logs-* index pattern in Dashboards, all via API calls.

Save the following as observability-setup-job.yaml:

# observability-setup-job.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: opensearch-sre-setup-script
data:
  setup.sh: |
    #!/bin/bash
    set -euo pipefail

    OPENSEARCH_USER="admin"
    OPENSEARCH_PASSWORD="YourStrongPassword123!" # <<< USE THE SAME PASSWORD
    DASHBOARDS_URL="http://my-dashboards-opensearch-dashboards.default.svc.cluster.local:5601"
    OPENSEARCH_URL="https://my-os-opensearch.default.svc.cluster.local:9200"

    echo "Waiting for OpenSearch Dashboards to be available..."
    until curl -s -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" -k "$DASHBOARDS_URL/api/status" | grep -q "available"; do
      sleep 5
    done
    echo "OpenSearch Dashboards is available."

    echo "Creating ISM policy: nginx_retention..."
    curl -X PUT "$OPENSEARCH_URL/_plugins/_ism/policies/nginx_retention" \
      -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" -k -H "Content-Type: application/json" \
      -d '{
            "policy": {
              "description": "Delete logs after 7 days",
              "default_state": "hot",
              "states": [
                {
                  "name": "hot",
                  "actions": [],
                  "transitions": [
                    {
                      "state_name": "delete",
                      "conditions": { "min_index_age": "7d" }
                    }
                  ]
                },
                {
                  "name": "delete",
                  "actions": [ { "delete": {} } ],
                  "transitions": []
                }
              ],
              "ism_template": [
                {
                  "index_patterns": ["nginx-logs-*"],
                  "priority": 100
                }
              ]
            }
          }'
    echo "ISM policy created."

    echo "Creating OpenSearch Dashboards Index Pattern: nginx-logs-pattern..."
    curl -X POST "$DASHBOARDS_URL/api/saved_objects/index-pattern/nginx-logs-pattern" \
      -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" -H "osd-xsrf: true" -H "Content-Type: application/json" \
      -d '{
            "attributes": {
              "title": "nginx-logs-*",
              "timeFieldName": "@timestamp"
            }
          }'
    echo "Index pattern created."

---
apiVersion: batch/v1
kind: Job
metadata:
  name: opensearch-observability-setup
spec:
  template:
    spec:
      containers:
      - name: setup-runner
        image: curlimages/curl:latest # A lightweight image with curl and bash
        command: ["bash", "/scripts/setup.sh"]
        volumeMounts:
        - name: setup-script
          mountPath: /scripts
      volumes:
      - name: setup-script
        configMap:
          name: opensearch-sre-setup-script
          defaultMode: 0744 # Make the script executable
      restartPolicy: OnFailure

Apply the setup Job:

kubectl apply -f observability-setup-job.yaml
Understanding the Automation: ISM & Index Patterns

Before we fire off the job, let’s talk about what’s happening under the hood. We aren’t just pushing config; we are defining the lifecycle and visibility of our data.

  • ISM (Index State Management): In a production cluster, logs are a “growing fire.” If you don’t manage them, they will eventually eat your disk and crash your nodes. By defining an ISM Policy, we automate the “Hot-to-Delete” lifecycle. In this setup, we’re telling OpenSearch: “Keep these logs fresh for 7 days, then delete them automatically.” No manual cleanup, no 3 AM disk-space alerts.
  • Index Patterns: While the Index is where the data lives, the Index Pattern is the “Lens” used by OpenSearch Dashboards to see it. By creating this via code, we ensure that as soon as the stack is up, your Discover tab is ready to go. We’re essentially “gluing” all those daily daily nginx-logs-* into one continuous timeline so you can query across multiple days without lifting a finger.

7. Generate Some Nginx Traffic

To see logs, your Nginx server needs visitors:

kubectl run load-gen --image=busybox --restart=Never -- /bin/sh -c "while true; do wget -qO- http://nginx-server; sleep 2; done"

8. Access OpenSearch Dashboards

Port-forward Dashboards to your local machine:

kubectl port-forward svc/my-dashboards-opensearch-dashboards 5601:5601

Then, open https://localhost:5601 in your browser. Log in with admin and your chosen password. Go to the “Discover” tab, select nginx-logs-* in the dropdown, set your time range (e.g., “Last 15 minutes”), and watch your logs flow in!

Conclusion

By following this automated approach, you’ve built a robust, observable, and easily reproducible log aggregation stack with OpenSearch. You’ve tackled critical SRE challenges like security, data ingestion, and lifecycle management—all as code. This hands-on experience forms a solid foundation for further exploration into metrics, traces, and advanced alerting in your cloud-native environments.

What other OpenSearch challenges will you automate next?

See the whole code at my repository

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top