head tap: EFK : kibana dies - fluentd cant flush

this post includes the following:

fluentd - activate and view monitoring
elasticsearch - set physical memory lock; set heap size; check health; check shards; allocate unassigned shards

okay, here's what happened:

my kibana kept on dying - when i start it, it dies in a few seconds. and i dont have my kibana log activated.
fluentd can't flush - tailing fluentd log shows that it could not push logs to elasticsearch, read timeout reached
elasticsearch has high cpu and memory usage - elasticsearch uses up to and sometimes exceeding 100% cpu

so, here's what i did:

added fluentd monitor by inserting the following in /etc/td-agent/td-agent.conf file:


          <source>

            bind 0.0.0.0

            type monitor_agent

            port 24220

          </source>

checked fluentd's status via the url below, which showed large and continually growing buffer_total_queued_size for multiple objects

# curl http://127.0.0.1:24220/api/plugins.json

moved my EFK to a larger instance with 8GB of memory, which is double of the previous, to make sure that it's not the memory allocation that's choking

set elasticsearch to use physical memory and not use swap by setting the line below in elasticsearch.yml (this requires service restart)

bootstrap.mlockall: true

set elasticsearch's memory heap allocation to 4GB by adding the line below in elasticsearch's startup script(this requires service restart)

ES_HEAP_SIZE=4g

checked elasticsearch's health via the url below, which showed a "red" status, which basically means something was really wrong. also, there was an unassigned shard (got more than 50 unassigned shards, limiting it here for easier read)


          # curl http://127.0.0.1:9200/_cluster/health?pretty



          {

            "cluster_name" : "elasticsearch",

            "status" : "red",

            "timed_out" : false,

            "number_of_nodes" : 1,

            "number_of_data_nodes" : 1,

            "active_primary_shards" : 3,

            "active_shards" : 3,

            "relocating_shards" : 0,

            "initializing_shards" : 0,

            "unassigned_shards" : 1,

            "delayed_unassigned_shards" : 0,

            "number_of_pending_tasks" : 0,

            "number_of_in_flight_fetch" : 0

          }

checked elasticsearch's shard status via the url below, which showed which shard was unassigned. columns are index, shard, primary or replica, docs, bytes, and node (you need to identify these to properly allocate shards)


          # curl http://localhost:9200/_cat/shards



          logstash-2015.12.26                       0 p STARTED  65270   14.2mb 10.47.200.10 Magnum

          logstash-2015.12.26                       1 p STARTED  65261   14.2mb 10.47.200.10 Magnum

          logstash-2015.12.26                       2 p STARTED  65244   14.2mb 10.47.200.10 Magnum

          logstash-2015.12.26                       3 p UNASSIGNED

allocated the unassigned shard through elasticsearch's api (use values identified above)

        curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands": [

            {"allocate": {

                "index": "logstash-2015.12.26",

                "shard": 3,

                "node": "Magnum",

                "allow_primary": true }

            }]

        }'

after all of those, elasticsearch cluster status turned green, fluentd can flush logs again, and kibana does not die anymore! whew, two days worth of investigation and fixing.

head tap

Friday, May 13, 2016

EFK : kibana dies - fluentd cant flush - elasticsearch java high cpu

No comments:

Post a Comment

SSH : No matching host key type found. Their offer: ssh-rsa,ssh-dss