System has problems

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
scv831
Posts: 12
Joined: Thu Dec 24, 2020 6:38 am

System has problems

Post by scv831 »

Hi,

Every couple of hours, system icon goes red reporting that system has problems. Checked Elasticsearch and logstash status, Both are active (Running). On top of that there are plenty of errors in logs. Currently it's showing logs are not received. when I checked on port 5544, Connections are established. Could you please guide further.

[root@se002roamnagios ~]# systemctl status logstash
● logstash.service - LSB: Logstash
Loaded: loaded (/etc/rc.d/init.d/logstash; bad; vendor preset: disabled)
Active: active (running) since Sun 2021-01-31 23:33:53 CET; 1 day 10h ago
Docs: man:systemd-sysv-generator(8)
Process: 1535 ExecStart=/etc/rc.d/init.d/logstash start (code=exited, status=0/SUCCESS)
Tasks: 71
Memory: 792.3M
CGroup: /system.slice/logstash.service
├─1715 runuser -s /bin/sh -c exec /usr/local/nagioslogserver/logstash/bin/logstash agent -f /usr/local/nagioslogserver/logstash/etc/conf.d -l /var/log/lo...
└─1760 /bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOn...

Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: I/O exception (java.net.SocketException) caught when processing request to {}->http...failed)
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: Retrying request to {}->http://localhost:9200
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: Retrying request to {}->http://localhost:9200
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: I/O exception (java.net.SocketException) caught when processing request to {}->http...failed)
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: Feb 01, 2021 7:16:25 AM org.apache.http.impl.execchain.RetryExec execute
Feb 01 07:16:25 se002roamnagios.ddc.teliasonera.net logstash[1535]: INFO: Retrying request to {}->http://localhost:9200
Hint: Some lines were ellipsized, use -l to show in full.
[root@se002roamnagios ~]# curl http://localhost:9200
{
"status" : 200,
"name" : "1ac29902-0050-4635-9803-efc2d35298bd",
"cluster_name" : "b8a56162-1bb8-4939-9f41-f90ed0b2d8c9",
"version" : {
"number" : "1.7.6",
"build_hash" : "c730b59357f8ebc555286794dcd90b3411f517c9",
"build_timestamp" : "2016-11-18T15:21:16Z",
"build_snapshot" : false,
"lucene_version" : "4.10.4"
},
"tagline" : "You Know, for Search"
}
[root@se002roamnagios ~]# systemctl status elasticservice
Unit elasticservice.service could not be found.
[root@se002roamnagios ~]# systemctl status elasticsearch
● elasticsearch.service - LSB: This service manages the elasticsearch daemon
Loaded: loaded (/etc/rc.d/init.d/elasticsearch; bad; vendor preset: disabled)
Active: active (running) since Tue 2021-02-02 08:39:33 CET; 1h 26min ago
Docs: man:systemd-sysv-generator(8)
Process: 97015 ExecStop=/etc/rc.d/init.d/elasticsearch stop (code=exited, status=0/SUCCESS)
Process: 97077 ExecStart=/etc/rc.d/init.d/elasticsearch start (code=exited, status=0/SUCCESS)
Tasks: 142
Memory: 42.6G
CGroup: /system.slice/elasticsearch.service
└─97103 /bin/java -Xms32123m -Xmx32123m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseC...

Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net systemd[1]: Starting LSB: This service manages the elasticsearch daemon...
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net runuser[97094]: pam_unix(runuser:session): session opened for user nagios by (uid=0)
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net runuser[97094]: pam_unix(runuser:session): session closed for user nagios
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net elasticsearch[97077]: Starting elasticsearch: [ OK ]
Feb 02 08:39:33 se002roamnagios.ddc.teliasonera.net systemd[1]: Started LSB: This service manages the elasticsearch daemon.
You do not have the required permissions to view the files attached to this post.
scv831
Posts: 12
Joined: Thu Dec 24, 2020 6:38 am

Re: System has problems

Post by scv831 »

Hello Again,

For logs not coming in, found that it was a problem with logstash. After logstash restart, Logs started coming it.

However Please help to find cause of the repeated issue.

These two problems are very frequent :

1.) System has problems in GUI (About System status)
2.) Logs stop coming in (Probably due to logstash outofmemory or something).

Is there a health check script which can find component status if anything goes unhealthy ? How can we avoid this in future because we can't depend on manual monitoring. missing logs means reports become useless.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: System has problems

Post by cdienger »

I'd recommend setting up NCPA on the NLS machine so you can monitor it on XI or Core. See https://support.nagios.com/kb/article/n ... i-857.html.

Please provide a profile from the system so we can look for problems on the system. It can be gathered under Admin > System > System Status > Download System Profile or from the command line with:

Code: Select all

/usr/local/nagioslogserver/scripts/profile.sh
This will create /tmp/system-profile.tar.gz.

Note that this file can be very large and may not be able to be uploaded through the ticketing system. You can split the file into smaller files with the split command on the NLS(or other Linux machine) command line:

Code: Select all

split -b 5000000 /tmp/system-profile.tar.gz system-profile- -d
The above command will split the system-profile.tar.gz into 5MB segments and save them to files with the naming convention system-profile​-nn. Please send me the profile in a private message.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
scv831
Posts: 12
Joined: Thu Dec 24, 2020 6:38 am

Re: System has problems

Post by scv831 »

Hello,

I have sent profile data in personal message. Kindly share your feedback asap because again GUI is reporting "System has problems" and logs have stopped coming in.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: System has problems

Post by cdienger »

The profile had a problem being generated and was only 131 bytes. If you tried generating the profile in the web UI, try running it from the command line instead.

If that fails, please check the disk usage as well as cpu and load from the command line with:

Code: Select all

df -h
top
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
scv831
Posts: 12
Joined: Thu Dec 24, 2020 6:38 am

Re: System has problems

Post by scv831 »

Hello,

1.4M system-profile.tar.gz have been shared in personal message. Later probably you can help to find why GUI didn't generate it.

I feel there are permissions issues in some directories or files. NLS does not have script to fix permissions like Nagios XI ?
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: System has problems

Post by cdienger »

The / partition is almost full with only 537MB available. Find out what is consuming the space with:

Code: Select all

du -a /| sort -n -r | head -n 20
and we we may find some items that we can remove to clear up space or it may be necessary to increase the amount of space on this partition.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
scv831
Posts: 12
Joined: Thu Dec 24, 2020 6:38 am

Re: System has problems

Post by scv831 »

Hello,

Following your recommendations, we increased root partition and restarted logstash and elasticsearch.

However Again same problems have popped up.

[root@se002roamnagios ~]# df -kh
Filesystem Size Used Avail Use% Mounted on
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 284M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/mapper/vg00-root 34G 3.6G 31G 11% /
/dev/sda1 1014M 186M 829M 19% /boot
/dev/mapper/vg00-var 8.0G 1.2G 6.9G 15% /var
/dev/mapper/vg00-home 2.0G 482M 1.6G 24% /home
/dev/mapper/vg01-disk1 1.0T 193G 832G 19% /nagios
/dev/mapper/vg00-opt 4.0G 1.3G 2.8G 31% /opt
/dev/mapper/vg00-varlog 8.0G 576M 7.5G 8% /var/log
/dev/mapper/vg00-audit 4.0G 475M 3.6G 12% /var/log/audit
/dev/mapper/vg00-tmp 2.0G 68M 2.0G 4% /tmp
tmpfs 6.3G 0 6.3G 0% /run/user/0
tmpfs 6.3G 0 6.3G 0% /run/user/1002
tmpfs 6.3G 0 6.3G 0% /run/user/48
tmpfs 6.3G 0 6.3G 0% /run/user/504
[root@se002roamnagios ~]#
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: System has problems

Post by cdienger »

Please provide a fresh profile so we may review the logs.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
scv831
Posts: 12
Joined: Thu Dec 24, 2020 6:38 am

Re: System has problems

Post by scv831 »

Hello,
I have sent you system profile. Currently logs have stopped coming in. Just to check, Can wrong filter definitions crash EL components if they get applied successfully ?

I can see lots of such errors in elasticsearch log errors.



[2021-02-08 20:56:22,560][DEBUG][action.search.type ] [1ac29902-0050-4635-9803-efc2d35298bd] [logstash-2021.02.03][2], node[eF4R0ydWT-KGKzOWYoELpw], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@2d095ea0] lastShard [true]
org.elasticsearch.search.query.QueryPhaseExecutionException: [logstash-2021.02.03][2]: query[ConstantScore(*:*)],from[0],size[0]: Query Failed [Failed to execute main query]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:163)
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:301)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:312)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:231)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:228)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.ElasticsearchException: org.elasticsearch.ElasticsearchIllegalStateEx
You do not have the required permissions to view the files attached to this post.
Locked