The Rise of the Open-Source Alternative: OpenSearch for the Modern SRE
In the dynamic world of cloud-native infrastructure, robust observability is not just a nice-to-have; it’s a foundational requirement for any Site Reliability Engineer (SRE). For years, Elasticsearch dominated the landscape for log aggregation and search. However, a significant licensing change by Elastic in early 2021 created a void, prompting AWS to fork the last Apache 2.0 licensed version and launch OpenSearch.
What is OpenSearch? At its core, OpenSearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It’s designed for high-volume data ingestion and rapid querying across massive datasets. Think of it as a specialized database for semi-structured data like logs, metrics, and traces.
Why is its open-source nature important for SREs? The “open” in OpenSearch isn’t just a marketing buzzword; it’s critical for SREs:
- No Vendor Lock-in: You’re free from proprietary licenses and sudden feature restrictions. This gives you long-term stability and predictability in your tech stack.
- Community-Driven Development: Bugs are squashed faster, features are added based on broad community needs, and you can even contribute fixes yourself.
- Auditability & Transparency: You can inspect the source code, understand exactly how your data is being handled, and ensure there are no hidden surprises—crucial for security-conscious environments.
- Cost Predictability: While managed services exist, you always have the option to self-host without escalating license costs as your data grows.
For an SRE, OpenSearch provides the backbone for what’s often called the “OSD Stack” (OpenSearch, OpenSearch Dashboards, and Data Prepper/Fluent Bit), empowering proactive monitoring, rapid incident response, and deep operational insights.
Core Concepts for the SRE Toolkit
To wield OpenSearch effectively, you need to understand its fundamental building blocks:
- Cluster & Nodes:
- An OpenSearch Cluster is a group of interconnected servers (nodes) that work together to store and search your data.
- Nodes specialize: some are
cluster_manager(formerly “master”) nodes (brains for cluster state), others aredatanodes (muscles for storing and searching data), and some can beingestnodes (for pre-processing data). SREs design clusters with dedicated nodes for stability and scale.
- Indices & Documents:
- An Index is like a logical database table, a collection of related JSON documents. For logs, you typically create time-based indices (e.g.,
application-logs-2026.02.08). - A Document is a single JSON record within an index—e.g., one log line, one metric data point, one trace span.
- An Index is like a logical database table, a collection of related JSON documents. For logs, you typically create time-based indices (e.g.,
- Shards & Replicas:
- To handle massive data, OpenSearch horizontally scales using Shards. An index is split into primary shards, distributed across data nodes.
- Replicas are copies of primary shards. They provide high availability (if a node fails, a replica becomes primary) and scale read operations. SREs fine-tune shard/replica counts for performance and resilience.
- OpenSearch Dashboards:
- The browser-based UI for visualizing, analyzing, and managing your OpenSearch data. It’s where you build dashboards, run queries, and configure alerts.
- Index State Management (ISM):
- An automated policy engine. SREs use ISM to define lifecycle rules for indices, like moving old data to cheaper storage, taking snapshots, or deleting it after a set period. This prevents disk exhaustion and controls costs.
Getting Hands-On: Your Production-Lite OpenSearch Lab on Kubernetes
The best way to understand OpenSearch is to run it. We’ll set up a mini-stack on Minikube (or any Kubernetes cluster) that mirrors a production setup for logs:
- OpenSearch Cluster: The data store.
- OpenSearch Dashboards: The UI.
- Nginx: A sample application generating logs.
- Fluent Bit: The lightweight log shipper.
All configurations will be declarative, using Helm charts and Kubernetes Jobs, making it fully automated and version-controllable—true SRE style.
Prerequisites:
- Kubernetes cluster (Minikube recommended for local dev)
kubectlinstalled and configuredhelminstalled
1. Initialize Minikube
Ensure your Minikube has enough resources for OpenSearch:
minikube start --cpus 4 --memory 8192 --driver docker2. Prepare Helm Repositories
helm repo add opensearch https://opensearch-project.github.io/helm-charts/
helm repo add fluent https://fluent.github.io/helm-charts
helm repo update3. Deploy OpenSearch & Dashboards (with a Password)
We’ll use a values.yaml file to set up a single-node OpenSearch cluster with a secure admin password and OpenSearch Dashboards.
Save the following as opensearch-values.yaml:
# opensearch-values.yaml
singleNode: true
persistence:
enabled: false # For a lab, we'll keep it stateless. Set to true for production with PVs.
extraEnvs:
- name: OPENSEARCH_INITIAL_ADMIN_PASSWORD
value: YourStrongPassword123! # <<< CHANGE THIS TO A STRONG PASSWORDNow deploy:
helm install my-os opensearch/opensearch -f opensearch-values.yaml
helm install my-dashboards opensearch/opensearch-dashboardsNote: OpenSearch will take a few minutes to start as it initializes its JVM and security plugin. Keep an eye on kubectl get pods -w.
4. Deploy Sample Nginx App
This is our log source:
kubectl create deployment nginx-server --image=nginx
kubectl expose deployment nginx-server --port=805. Deploy Fluent Bit (The Log Shipper)
Fluent Bit will scrape Nginx logs and send them to OpenSearch. We’ll use a fluent-bit-values.yaml to handle the configuration declaratively and resolve common SRE headaches (DNS, auth, mapping issues).
Save the following as fluent-bit-values.yaml:
# fluent-bit-values.yaml
config:
service: |
[SERVICE]
Daemon Off
Flush 1
Log_Level info
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
inputs: |
[INPUT]
Name tail
Path /var/log/containers/*.log
multiline.parser docker, cri
Tag kube.*
Mem_Buf_Limit 5MB
Skip_Long_Lines On
filters: |
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
Keep_Log Off
# These help prevent mapping conflicts from inconsistent pod labels
Labels Off
Annotations Off
outputs: |
[OUTPUT]
Name es
Match *
Host my-os-opensearch # This is the Kubernetes Service name for OpenSearch
Port 9200
HTTP_User admin
HTTP_Passwd YourStrongPassword123! # <<< USE THE SAME PASSWORD AS ABOVE
Logstash_Format On
Logstash_Prefix nginx-logs
tls On
tls.verify Off
Suppress_Type_Name On # Crucial for OpenSearch 2.x
Trace_Error On # For better debugging in Fluent Bit logsDeploy Fluent Bit:
helm install fluent-bit fluent/fluent-bit -f fluent-bit-values.yamlThe Log Shipper: Why Fluent Bit?
In a Kubernetes ecosystem, logs are ephemeral—when a pod dies, its logs go with it. To prevent this, we need a DaemonSet that acts as a “log vacuum.” We chose Fluent Bit because it’s the lightweight, high-performance cousin of Fluentd. It has a tiny memory footprint (crucial when you’re running it on every node in a cluster) and handles the “Log Pipeline” in three distinct stages:
- Input (Tail): It watches the raw
.logfiles created by the Kubernetes container engine on the node’s disk. - Filters (Kubernetes): This is the “SRE secret sauce.” Fluent Bit talks to the Kubernetes API to enrich your logs with metadata like the Pod Name, Namespace, and Labels. We’ve also added logic here to “clean” the logs to prevent mapping conflicts.
- Output (OpenSearch): It batches the logs and ships them securely via HTTPS to our OpenSearch cluster.
The Anatomy of a Fluent Bit Pipeline
When you look at a Fluent Bit configuration, it’s organized into distinct sections. Each one has a specific job in the “Ingestion Lifecycle.” Understanding these is the key to onboarding any new service into OpenSearch.
[SERVICE](The Global Brain): This section defines the engine’s behavior. It controls how often data is “flushed” (sent) to the destination, where the internal logs go, and whether to enable a health-check server. For SREs, this is where we tune performance and monitoring for the log shipper itself.[INPUT](The Collector): This is the “vacuum cleaner.” It tells Fluent Bit where to get data. In Kubernetes, we usually use thetailinput to follow the log files generated by the container runtime (CRI/Docker). You can have multiple inputs—one for system logs, one for app logs, and even one for metrics.[FILTER](The Processor): This is where the magic happens. Filters allow you to modify data in flight.- The Kubernetes Filter is the most popular; it reaches out to the K8s API to tag your logs with pod names and namespaces.
- You can also use filters to drop sensitive data (PII), parse strings into JSON, or “flatten” complex labels to avoid the mapping conflicts we saw earlier.
[OUTPUT](The Destination): This defines the “Exit” for your data. In our case, it’s thees(Elasticsearch/OpenSearch) plugin. This section handles the connection details, authentication, and index naming conventions.
6. Automate ISM Policy & Index Pattern with a Kubernetes Job
This is where the SRE magic happens. We’ll use a Job to apply our ISM policy (7-day retention) and create the nginx-logs-* index pattern in Dashboards, all via API calls.
Save the following as observability-setup-job.yaml:
# observability-setup-job.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: opensearch-sre-setup-script
data:
setup.sh: |
#!/bin/bash
set -euo pipefail
OPENSEARCH_USER="admin"
OPENSEARCH_PASSWORD="YourStrongPassword123!" # <<< USE THE SAME PASSWORD
DASHBOARDS_URL="http://my-dashboards-opensearch-dashboards.default.svc.cluster.local:5601"
OPENSEARCH_URL="https://my-os-opensearch.default.svc.cluster.local:9200"
echo "Waiting for OpenSearch Dashboards to be available..."
until curl -s -u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" -k "$DASHBOARDS_URL/api/status" | grep -q "available"; do
sleep 5
done
echo "OpenSearch Dashboards is available."
echo "Creating ISM policy: nginx_retention..."
curl -X PUT "$OPENSEARCH_URL/_plugins/_ism/policies/nginx_retention" \
-u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" -k -H "Content-Type: application/json" \
-d '{
"policy": {
"description": "Delete logs after 7 days",
"default_state": "hot",
"states": [
{
"name": "hot",
"actions": [],
"transitions": [
{
"state_name": "delete",
"conditions": { "min_index_age": "7d" }
}
]
},
{
"name": "delete",
"actions": [ { "delete": {} } ],
"transitions": []
}
],
"ism_template": [
{
"index_patterns": ["nginx-logs-*"],
"priority": 100
}
]
}
}'
echo "ISM policy created."
echo "Creating OpenSearch Dashboards Index Pattern: nginx-logs-pattern..."
curl -X POST "$DASHBOARDS_URL/api/saved_objects/index-pattern/nginx-logs-pattern" \
-u "$OPENSEARCH_USER:$OPENSEARCH_PASSWORD" -H "osd-xsrf: true" -H "Content-Type: application/json" \
-d '{
"attributes": {
"title": "nginx-logs-*",
"timeFieldName": "@timestamp"
}
}'
echo "Index pattern created."
---
apiVersion: batch/v1
kind: Job
metadata:
name: opensearch-observability-setup
spec:
template:
spec:
containers:
- name: setup-runner
image: curlimages/curl:latest # A lightweight image with curl and bash
command: ["bash", "/scripts/setup.sh"]
volumeMounts:
- name: setup-script
mountPath: /scripts
volumes:
- name: setup-script
configMap:
name: opensearch-sre-setup-script
defaultMode: 0744 # Make the script executable
restartPolicy: OnFailureApply the setup Job:
kubectl apply -f observability-setup-job.yamlUnderstanding the Automation: ISM & Index Patterns
Before we fire off the job, let’s talk about what’s happening under the hood. We aren’t just pushing config; we are defining the lifecycle and visibility of our data.
- ISM (Index State Management): In a production cluster, logs are a “growing fire.” If you don’t manage them, they will eventually eat your disk and crash your nodes. By defining an ISM Policy, we automate the “Hot-to-Delete” lifecycle. In this setup, we’re telling OpenSearch: “Keep these logs fresh for 7 days, then delete them automatically.” No manual cleanup, no 3 AM disk-space alerts.
- Index Patterns: While the Index is where the data lives, the Index Pattern is the “Lens” used by OpenSearch Dashboards to see it. By creating this via code, we ensure that as soon as the stack is up, your Discover tab is ready to go. We’re essentially “gluing” all those daily daily
nginx-logs-*into one continuous timeline so you can query across multiple days without lifting a finger.
7. Generate Some Nginx Traffic
To see logs, your Nginx server needs visitors:
kubectl run load-gen --image=busybox --restart=Never -- /bin/sh -c "while true; do wget -qO- http://nginx-server; sleep 2; done"8. Access OpenSearch Dashboards
Port-forward Dashboards to your local machine:
kubectl port-forward svc/my-dashboards-opensearch-dashboards 5601:5601Then, open https://localhost:5601 in your browser. Log in with admin and your chosen password. Go to the “Discover” tab, select nginx-logs-* in the dropdown, set your time range (e.g., “Last 15 minutes”), and watch your logs flow in!
Conclusion
By following this automated approach, you’ve built a robust, observable, and easily reproducible log aggregation stack with OpenSearch. You’ve tackled critical SRE challenges like security, data ingestion, and lifecycle management—all as code. This hands-on experience forms a solid foundation for further exploration into metrics, traces, and advanced alerting in your cloud-native environments.
What other OpenSearch challenges will you automate next?
See the whole code at my repository
