Opsio - Cloud and AI Solutions
MonitoringDevOps5 min readΒ· 921 words

ELK Stack Best Practices: Index Lifecycle, Sharding, and Cost Control at Scale

Published: Β·Updated: Β·Reviewed by Opsio Engineering Team
Oscar Bergenbrink

CTO

Technology leadership, cloud architecture, and digital transformation strategy

ELK Stack Best Practices: Index Lifecycle, Sharding, and Cost Control at Scale

An ELK Stack at 50 GB/day ingest forgives almost any configuration choice. The same stack at 5 TB/day punishes every one of them. The patterns below are the ones we apply on customer clusters above ~500 GB/day, where bad defaults compound into operational pain quickly.

Pattern 1: Size Shards Between 20 and 50 GB

The most common ELK production pathology is shard explosion: thousands of small shards on a cluster designed for hundreds of large ones. Cluster state grows, master node CPU climbs, and shard allocation slows to a crawl. The fix is the same on every cluster:

  • Target shard size: 20-50 GB on hot tier, up to 100 GB on warm tier
  • Number of shards per index: (daily ingest GB Γ— retention days) / 30 GB, rounded
  • Use rollover instead of date-based indices to keep shard size bounded as ingest fluctuates

The cluster-state ceiling sits at roughly 10,000 shards per master before performance degrades visibly. Stay under 5,000 with headroom.

Pattern 2: Index Lifecycle Management Done Right

ILM is mandatory on any cluster retaining more than 30 days of data. The standard four-phase policy:

{
  "phases": {
    "hot": {
      "actions": {
        "rollover": { "max_age": "1d", "max_primary_shard_size": "30gb" },
        "set_priority": { "priority": 100 }
      }
    },
    "warm": {
      "min_age": "7d",
      "actions": {
        "shrink": { "number_of_shards": 1 },
        "forcemerge": { "max_num_segments": 1 },
        "allocate": { "include": { "data_tier": "warm" } },
        "set_priority": { "priority": 50 }
      }
    },
    "cold": {
      "min_age": "30d",
      "actions": {
        "searchable_snapshot": { "snapshot_repository": "s3-cold" },
        "set_priority": { "priority": 0 }
      }
    },
    "delete": {
      "min_age": "365d",
      "actions": { "delete": {} }
    }
  }
}

Two things to call out. Force-merge in the warm phase reclaims significant disk by collapsing Lucene segments β€” but it is IO-intensive and should be scheduled outside peak hours. Searchable snapshots in the cold phase keep data queryable while it lives only on object storage, dropping cold-tier infrastructure cost by 70-90%.

Free Expert Consultation

Need expert help with elk stack best practices?

Our cloud architects can help you with elk stack best practices β€” from strategy to implementation. Book a free 30-minute advisory call with no obligation.

Solution ArchitectAI ExpertSecurity SpecialistDevOps Engineer
50+ certified engineersAWS Advanced Partner24/7 support
Completely free β€” no obligationResponse within 24h

Pattern 3: Use Index Templates with Explicit Mappings

Schemaless ingestion is convenient until it isn't. The day a developer logs response_time: "1.2s" in one service and response_time: 1200 in another, mapping conflicts break dashboards. The fix is index templates with explicit field types and dynamic-mapping disabled for production indices.

PUT _index_template/logs-app
{
  "index_patterns": ["logs-app-*"],
  "template": {
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "@timestamp":   { "type": "date" },
        "service":      { "type": "keyword" },
        "level":        { "type": "keyword" },
        "message":      { "type": "text" },
        "response_time_ms": { "type": "long" },
        "trace_id":     { "type": "keyword" }
      }
    }
  }
}

Strict mappings reject documents with unmapped fields. That sounds harsh; it isn't. Reject-on-unknown is what surfaces accidental field-name drift before it metastasises into a hundred broken dashboards.

Pattern 4: Snapshot Daily and Test Restores

Snapshots to object storage are the only real disaster recovery for an Elasticsearch cluster. Configure them on day one. The minimum standard:

  • Daily snapshots to S3 / GCS / Azure blob, retained 30 days
  • Weekly snapshots retained 12 weeks
  • Monthly snapshots retained 12 months for archival / compliance
  • Quarterly restore drill β€” pull the latest snapshot into a parallel cluster and verify data integrity

The fourth bullet is the one most teams skip and the one that bites them when they need it. Snapshots that have never been restored are not real backups.

Pattern 5: Separate Ingest, Storage, and Compute Costs

Cost optimisation on ELK is rarely about reducing total data β€” it is about putting each piece of data on the right tier. Three rules of thumb:

  1. Hot tier (last 7-14 days) on local SSD β€” fast indexing, fast search, expensive disk. Optimise for IOPS
  2. Warm tier (15-30 days) on slower disk β€” read-mostly, occasional search. Cheaper IOPS, larger disk
  3. Cold / frozen tier (30-365 days) on object storage with searchable snapshots β€” minimal compute, cheap storage. Slower queries acceptable

For an AWS deployment, gp3 EBS for hot, st1 for warm, S3 for cold typically cuts storage cost by 50-70% versus everything-on-gp3.

Pattern 6: Build a Cluster Monitoring Cluster

Self-monitoring breaks during outages. Run a small dedicated monitoring cluster (3 nodes, modest sizing) and ship metrics from your production cluster to it via Metricbeat with the elasticsearch module. Alert on:

  • Cluster status not green for >5 minutes
  • Indexing rate sudden drop >50% YoY weekday
  • Search latency p95 >2x baseline
  • Disk watermark hit on any node
  • JVM old-gen pressure >75% sustained
  • Pending tasks count >100

These six alerts catch roughly 90% of the meaningful production incidents.

Pattern 7: Apply RBAC, Field Security, and Audit Logging

Elasticsearch security is feature-complete but not on-by-default for self-managed clusters. The minimum production configuration:

  • TLS on transport and HTTP layers (mandatory in 8.x)
  • Role-based access control with index- and document-level filtering
  • Field-level security to redact PII for general users
  • Audit logging shipped to a separate cluster or to your soc security delivery SIEM
  • SAML / OIDC integration for SSO

Pattern 8: Plan Version Upgrades Quarterly

Elasticsearch ships major versions roughly annually and minor versions quarterly. Skip three minor versions and the upgrade path becomes painful. The discipline that prevents this:

  • Quarterly minor-version upgrade rolling through the cluster
  • Annual major-version review with parallel-cluster migration plan
  • Always read the breaking-changes notes before scheduling

How Opsio Helps

Opsio's end-to-end elk stack service applies these patterns as standard. Customer engagements typically reduce hot-tier disk cost by 30-50% in the first quarter through ILM tuning alone, and reduce shard count by 60-90% on clusters that had drifted into shard-explosion territory. We also handle cloud cost optimization services across the wider data platform, where ELK is one of several systems whose tier and retention policies usually need joint redesign.

About the Author

Oscar Bergenbrink
Oscar Bergenbrink

CTO at Opsio

Technology leadership, cloud architecture, and digital transformation strategy

Editorial standards: This article was written by a certified practitioner and peer-reviewed by our engineering team. We update content quarterly to ensure technical accuracy. Opsio maintains editorial independence β€” we recommend solutions based on technical merit, not commercial relationships.