- fluentd - activate and view monitoring
- elasticsearch - set physical memory lock; set heap size; check health; check shards; allocate unassigned shards
okay, here's what happened:
- my kibana kept on dying - when i start it, it dies in a few seconds. and i dont have my kibana log activated.
- fluentd can't flush - tailing fluentd log shows that it could not push logs to elasticsearch, read timeout reached
- elasticsearch has high cpu and memory usage - elasticsearch uses up to and sometimes exceeding 100% cpu
so, here's what i did:
- added fluentd monitor by inserting the following in /etc/td-agent/td-agent.conf file:
<source>
bind 0.0.0.0
type monitor_agent
port 24220
</source>
- checked fluentd's status via the url below, which showed large and continually growing buffer_total_queued_size for multiple objects
# curl http://127.0.0.1:24220/api/plugins.json
- moved my EFK to a larger instance with 8GB of memory, which is double of the previous, to make sure that it's not the memory allocation that's choking
- set elasticsearch to use physical memory and not use swap by setting the line below in elasticsearch.yml (this requires service restart)
bootstrap.mlockall: true
- set elasticsearch's memory heap allocation to 4GB by adding the line below in elasticsearch's startup script(this requires service restart)
ES_HEAP_SIZE=4g
- checked elasticsearch's health via the url below, which showed a "red" status, which basically means something was really wrong. also, there was an unassigned shard (got more than 50 unassigned shards, limiting it here for easier read)
# curl http://127.0.0.1:9200/_cluster/health?pretty
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 3,
"active_shards" : 3,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 1,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0
}
- checked elasticsearch's shard status via the url below, which showed which shard was unassigned. columns are index, shard, primary or replica, docs, bytes, and node (you need to identify these to properly allocate shards)
# curl http://localhost:9200/_cat/shards
logstash-2015.12.26 0 p STARTED 65270 14.2mb 10.47.200.10 Magnum
logstash-2015.12.26 1 p STARTED 65261 14.2mb 10.47.200.10 Magnum
logstash-2015.12.26 2 p STARTED 65244 14.2mb 10.47.200.10 Magnum
logstash-2015.12.26 3 p UNASSIGNED
- allocated the unassigned shard through elasticsearch's api (use values identified above)
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands": [
{"allocate": {
"index": "logstash-2015.12.26",
"shard": 3,
"node": "Magnum",
"allow_primary": true }
}]
}'
after all of those, elasticsearch cluster status turned green, fluentd can flush logs again, and kibana does not die anymore! whew, two days worth of investigation and fixing.
No comments:
Post a Comment