System has problems
System has problems
Hi,
Every couple of hours, system icon goes red reporting that system has problems. Checked Elasticsearch and logstash status, Both are active (Running). On top of that there are plenty of errors in logs. Currently it's showing logs are not received. when I checked on port 5544, Connections are established. Could you please guide further.
[root@se002roamnagios ~]# systemctl status logstash
â— logstash.service - LSB: Logstash
Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
Active: active (running) since Sun 2021-01-31 23:33:53 CET; 1 day 10h ago
Docs: man:systemd-sysv-generator(8)
Process: 1535 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)
Tasks: 71
Memory: 792.3M
CGroup: /system.slice/logstash.service
├─1715 runuser -s /bin/sh -c exec /usr/local/nagioslogserver/logstash/bin/logstash agent -f /usr/local/nagioslogserver/logstash/etc/conf.d -l /var/log/lo...
└─1760 /bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOn...
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: I/O exception (java.net.SocketException) caught when processing request to {}->http...failed)
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: Retrying request to {}->http://localhost:9200
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: Retrying request to {}->http://localhost:9200
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: I/O exception (java.net.SocketException) caught when processing request to {}->http...failed)
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: Retrying request to {}->http://localhost:9200
Hint: Some lines were ellipsized, use -l to show in full.
[root@se002roamnagios ~]# curl http://localhost:9200
{
"status" : 200,
"name" : "1ac29902-0050-4635-9803-efc2d35298bd",
"cluster_name" : "b8a56162-1bb8-4939-9f41-f90ed0b2d8c9",
"version" : {
"number" : "1.7.6",
"build_hash" : "c730b59357f8ebc555286794dcd90b3411f517c9",
"build_timestamp" : "2016-11-18T15:21:16Z",
"build_snapshot" : false,
"lucene_version" : "4.10.4"
},
"tagline" : "You Know, for Search"
}
[root@se002roamnagios ~]# systemctl status elasticservice
Unit elasticservice.service could not be found.
[root@se002roamnagios ~]# systemctl status elasticsearch
â— elasticsearch.service - LSB: This service manages the elasticsearch daemon
Loaded: loaded (/etc/rc.d/init.d/elasticsearch; bad; vendor preset: disabled)
Active: active (running) since Tue 2021-02-02 08:39:33 CET; 1h 26min ago
Docs: man:systemd-sysv-generator(8)
Process: 97015 ExecStop=/etc/rc.d/init.d/elasticsearch stop (code=exited, status=0/SUCCESS)
Process: 97077 ExecStart=/etc/rc.d/init.d/elasticsearch start (code=exited, status=0/SUCCESS)
Tasks: 142
Memory: 42.6G
CGroup: /system.slice/elasticsearch.service
└─97103 /bin/java -Xms32123m -Xmx32123m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseC...
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net systemd[1]: Starting LSB: This service manages the elasticsearch daemon...
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net runuser[97094]: pam_unix(runuser:session): session opened for user nagios by (uid=0)
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net runuser[97094]: pam_unix(runuser:session): session closed for user nagios
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net elasticsearch[97077]: Starting elasticsearch: [ OK ]
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net systemd[1]: Started LSB: This service manages the elasticsearch daemon.
Every couple of hours, system icon goes red reporting that system has problems. Checked Elasticsearch and logstash status, Both are active (Running). On top of that there are plenty of errors in logs. Currently it's showing logs are not received. when I checked on port 5544, Connections are established. Could you please guide further.
[root@se002roamnagios ~]# systemctl status logstash
â— logstash.service - LSB: Logstash
Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
Active: active (running) since Sun 2021-01-31 23:33:53 CET; 1 day 10h ago
Docs: man:systemd-sysv-generator(8)
Process: 1535 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)
Tasks: 71
Memory: 792.3M
CGroup: /system.slice/logstash.service
├─1715 runuser -s /bin/sh -c exec /usr/local/nagioslogserver/logstash/bin/logstash agent -f /usr/local/nagioslogserver/logstash/etc/conf.d -l /var/log/lo...
└─1760 /bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOn...
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: I/O exception (java.net.SocketException) caught when processing request to {}->http...failed)
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: Retrying request to {}->http://localhost:9200
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: Retrying request to {}->http://localhost:9200
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: I/O exception (java.net.SocketException) caught when processing request to {}->http...failed)
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: Retrying request to {}->http://localhost:9200
Hint: Some lines were ellipsized, use -l to show in full.
[root@se002roamnagios ~]# curl http://localhost:9200
{
"status" : 200,
"name" : "1ac29902-0050-4635-9803-efc2d35298bd",
"cluster_name" : "b8a56162-1bb8-4939-9f41-f90ed0b2d8c9",
"version" : {
"number" : "1.7.6",
"build_hash" : "c730b59357f8ebc555286794dcd90b3411f517c9",
"build_timestamp" : "2016-11-18T15:21:16Z",
"build_snapshot" : false,
"lucene_version" : "4.10.4"
},
"tagline" : "You Know, for Search"
}
[root@se002roamnagios ~]# systemctl status elasticservice
Unit elasticservice.service could not be found.
[root@se002roamnagios ~]# systemctl status elasticsearch
â— elasticsearch.service - LSB: This service manages the elasticsearch daemon
Loaded: loaded (/etc/rc.d/init.d/elasticsearch; bad; vendor preset: disabled)
Active: active (running) since Tue 2021-02-02 08:39:33 CET; 1h 26min ago
Docs: man:systemd-sysv-generator(8)
Process: 97015 ExecStop=/etc/rc.d/init.d/elasticsearch stop (code=exited, status=0/SUCCESS)
Process: 97077 ExecStart=/etc/rc.d/init.d/elasticsearch start (code=exited, status=0/SUCCESS)
Tasks: 142
Memory: 42.6G
CGroup: /system.slice/elasticsearch.service
└─97103 /bin/java -Xms32123m -Xmx32123m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseC...
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net systemd[1]: Starting LSB: This service manages the elasticsearch daemon...
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net runuser[97094]: pam_unix(runuser:session): session opened for user nagios by (uid=0)
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net runuser[97094]: pam_unix(runuser:session): session closed for user nagios
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net elasticsearch[97077]: Starting elasticsearch: [ OK ]
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net systemd[1]: Started LSB: This service manages the elasticsearch daemon.
You do not have the required permissions to view the files attached to this post.
Re: System has problems
Hello Again,
For logs not coming in, found that it was a problem with logstash. After logstash restart, Logs started coming it.
However Please help to find cause of the repeated issue.
These two problems are very frequent :
1.) System has problems in GUI (About System status)
2.) Logs stop coming in (Probably due to logstash outofmemory or something).
Is there a health check script which can find component status if anything goes unhealthy ? How can we avoid this in future because we can't depend on manual monitoring. missing logs means reports become useless.
For logs not coming in, found that it was a problem with logstash. After logstash restart, Logs started coming it.
However Please help to find cause of the repeated issue.
These two problems are very frequent :
1.) System has problems in GUI (About System status)
2.) Logs stop coming in (Probably due to logstash outofmemory or something).
Is there a health check script which can find component status if anything goes unhealthy ? How can we avoid this in future because we can't depend on manual monitoring. missing logs means reports become useless.
Re: System has problems
I'd recommend setting up NCPA on the NLS machine so you can monitor it on XI or Core. See https://support.nagios.com/kb/article/n ... i-857.html.
Please provide a profile from the system so we can look for problems on the system. It can be gathered under Admin > System > System Status > Download System Profile or from the command line with:
This will create /tmp/system-profile.tar.gz.
Note that this file can be very large and may not be able to be uploaded through the ticketing system. You can split the file into smaller files with the split command on the NLS(or other Linux machine) command line:
The above command will split the system-profile.tar.gz into 5MB segments and save them to files with the naming convention system-profile-nn. Please send me the profile in a private message.
Please provide a profile from the system so we can look for problems on the system. It can be gathered under Admin > System > System Status > Download System Profile or from the command line with:
Code: Select all
/usr/local/nagioslogserver/scripts/profile.shNote that this file can be very large and may not be able to be uploaded through the ticketing system. You can split the file into smaller files with the split command on the NLS(or other Linux machine) command line:
Code: Select all
split -b 5000000 /tmp/system-profile.tar.gz system-profile- -dAs of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: System has problems
Hello,
I have sent profile data in personal message. Kindly share your feedback asap because again GUI is reporting "System has problems" and logs have stopped coming in.
I have sent profile data in personal message. Kindly share your feedback asap because again GUI is reporting "System has problems" and logs have stopped coming in.
Re: System has problems
The profile had a problem being generated and was only 131 bytes. If you tried generating the profile in the web UI, try running it from the command line instead.
If that fails, please check the disk usage as well as cpu and load from the command line with:
If that fails, please check the disk usage as well as cpu and load from the command line with:
Code: Select all
df -h
topAs of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: System has problems
Hello,
1.4M system-profile.tar.gz have been shared in personal message. Later probably you can help to find why GUI didn't generate it.
I feel there are permissions issues in some directories or files. NLS does not have script to fix permissions like Nagios XI ?
1.4M system-profile.tar.gz have been shared in personal message. Later probably you can help to find why GUI didn't generate it.
I feel there are permissions issues in some directories or files. NLS does not have script to fix permissions like Nagios XI ?
Re: System has problems
The / partition is almost full with only 537MB available. Find out what is consuming the space with:
and we we may find some items that we can remove to clear up space or it may be necessary to increase the amount of space on this partition.
Code: Select all
du -a /| sort -n -r | head -n 20As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: System has problems
Hello,
Following your recommendations, we increased root partition and restarted logstash and elasticsearch.
However Again same problems have popped up.
[root@se002roamnagios ~]# df -kh
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 284M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/mapper/vg00-root 34G 3.6G 31G 11% /
/dev/sda1 1014M 186M 829M 19% /boot
/dev/mapper/vg00-var 8.0G 1.2G 6.9G 15% /var
/dev/mapper/vg00-home 2.0G 482M 1.6G 24% /home
/dev/mapper/vg01-disk1 1.0T 193G 832G 19% /nagios
/dev/mapper/vg00-opt 4.0G 1.3G 2.8G 31% /opt
/dev/mapper/vg00-varlog 8.0G 576M 7.5G 8% /var/log
/dev/mapper/vg00-audit 4.0G 475M 3.6G 12% /var/log/audit
/dev/mapper/vg00-tmp 2.0G 68M 2.0G 4% /tmp
tmpfs 6.3G 0 6.3G 0% /run/user/0
tmpfs 6.3G 0 6.3G 0% /run/user/1002
tmpfs 6.3G 0 6.3G 0% /run/user/48
tmpfs 6.3G 0 6.3G 0% /run/user/504
[root@se002roamnagios ~]#
Following your recommendations, we increased root partition and restarted logstash and elasticsearch.
However Again same problems have popped up.
[root@se002roamnagios ~]# df -kh
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 284M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/mapper/vg00-root 34G 3.6G 31G 11% /
/dev/sda1 1014M 186M 829M 19% /boot
/dev/mapper/vg00-var 8.0G 1.2G 6.9G 15% /var
/dev/mapper/vg00-home 2.0G 482M 1.6G 24% /home
/dev/mapper/vg01-disk1 1.0T 193G 832G 19% /nagios
/dev/mapper/vg00-opt 4.0G 1.3G 2.8G 31% /opt
/dev/mapper/vg00-varlog 8.0G 576M 7.5G 8% /var/log
/dev/mapper/vg00-audit 4.0G 475M 3.6G 12% /var/log/audit
/dev/mapper/vg00-tmp 2.0G 68M 2.0G 4% /tmp
tmpfs 6.3G 0 6.3G 0% /run/user/0
tmpfs 6.3G 0 6.3G 0% /run/user/1002
tmpfs 6.3G 0 6.3G 0% /run/user/48
tmpfs 6.3G 0 6.3G 0% /run/user/504
[root@se002roamnagios ~]#
Re: System has problems
Please provide a fresh profile so we may review the logs.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: System has problems
Hello,
I have sent you system profile. Currently logs have stopped coming in. Just to check, Can wrong filter definitions crash EL components if they get applied successfully ?
I can see lots of such errors in elasticsearch log errors.
[2021-02-08 20:56:22,560][DEBUG][action.search.type ] [1ac29902-0050-4635-9803-efc2d35298bd] [logstash-2021.02.03][2], node[eF4R0ydWT-KGKzOWYoELpw], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@2d095ea0] lastShard [true]
org.elasticsearch.search.query.QueryPhaseExecutionException: [logstash-2021.02.03][2]: query[ConstantScore(*:*)],from[0],size[0]: Query Failed [Failed to execute main query]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:163)
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:301)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:312)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:231)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:228)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.ElasticsearchException: org.elasticsearch.ElasticsearchIllegalStateEx
I have sent you system profile. Currently logs have stopped coming in. Just to check, Can wrong filter definitions crash EL components if they get applied successfully ?
I can see lots of such errors in elasticsearch log errors.
[2021-02-08 20:56:22,560][DEBUG][action.search.type ] [1ac29902-0050-4635-9803-efc2d35298bd] [logstash-2021.02.03][2], node[eF4R0ydWT-KGKzOWYoELpw], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@2d095ea0] lastShard [true]
org.elasticsearch.search.query.QueryPhaseExecutionException: [logstash-2021.02.03][2]: query[ConstantScore(*:*)],from[0],size[0]: Query Failed [Failed to execute main query]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:163)
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:301)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:312)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:231)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:228)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.ElasticsearchException: org.elasticsearch.ElasticsearchIllegalStateEx
You do not have the required permissions to view the files attached to this post.