Page 2 of 3
Re: Nagios Host/Service Check Timeouts
Posted: Fri Sep 03, 2021 9:57 am
by pbroste
Hello @Dusan.Mandic
Sounds good, depending on your environment on the size increment(s).
You can set and watch it:
Code: Select all
watch -n 5 ls -lahrt /usr/local/nagios/var/nagios.log /usr/local/nagios/var/nagios.debug
Let us know how things look,
Perry
Re: Nagios Host/Service Check Timeouts
Posted: Wed Sep 08, 2021 1:28 pm
by Dusan.Mandic
Here's the ethtool. Just started the log today.
Re: Nagios Host/Service Check Timeouts
Posted: Thu Sep 09, 2021 11:44 am
by pbroste
Hello @Dusan.Mandic
Thanks for following up, with the results; let's go ahead and increase the
service and host timeouts to 120 seconds and bounce the nagios.service.
In:
/usr/local/ncpa/etc/ncpa.cfg
Change to:
plugin_timeout = 120
In:
/usr/local/nagios/etc/nagios.cfg
Change to:
host_check_timeout=120
service_check_timeout=120
Then restart the nagios service:
Please run for awhile and let us know how things are looking,
Perry
Re: Nagios Host/Service Check Timeouts
Posted: Thu Sep 09, 2021 4:10 pm
by Dusan.Mandic
Things did not go well.
This caused multiple alerts. I think there were about 90 criticals thrown in a couple minutes.
This is our production server, i was able to catch it before alerts went out but we cannot have an event of that nature again.
Re: Nagios Host/Service Check Timeouts
Posted: Thu Sep 09, 2021 4:18 pm
by Dusan.Mandic
Re: Nagios Host/Service Check Timeouts
Posted: Fri Sep 10, 2021 10:20 am
by pbroste
Hello @Dusan.Mandic
Let's get an updated System Profile on this so we can see what is going on.
To send us your system profile.
- Login to the Nagios XI GUI using a web browser.
- Click the "Admin" > "System Profile" Menu
- Click the "Download Profile" button
- Save the profile.zip file and share via PM
Or command:
Code: Select all
rm -rf /usr/local/nagiosxi/var/components/profile.zip
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT
Then send the resulting /usr/local/nagiosxi/var/components/profile.zip​ file via Private Message.
Thanks,
Perry
Re: Nagios Host/Service Check Timeouts
Posted: Fri Sep 10, 2021 12:47 pm
by Dusan.Mandic
pm sent.
the load has now gone way up from before, where it stood around 3/4/4, it has now almost tripled.
Re: Nagios Host/Service Check Timeouts
Posted: Fri Sep 10, 2021 2:05 pm
by pbroste
Hello @Dusan.Mandic
You are correct that we see resource issues
PHP Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate....) and the NPCD has reached MAX LOAD too.
22731 nagios 20 0 11.6g 40312 11688 S 112.5 0.1 0:00.20 java
10032 apache 20 0 684800 72184 6124 S 100.0 0.2 0:10.72 httpd
1936 apache 20 0 643556 30884 6232 R 93.8 0.1 0:05.10 httpd
We see that
/opt/SumoCollector/jre/bin/java is taking up bit and it is on the top of the resource users listed as 'java'. May want to review that process to take a look at how things look.
Then it appears there are multiple definitions found for the same service. Take a look at your configuration files and see where they are occurring. We see that just happened recently, so I am not sure if there were any bulk changes or imports. May want to run through the database repair and CCM reindex.
Warning: Duplicate definition found for service 'CPU Stats' on host 'XXXXXXXXX' (config file '/usr/local/nagios/etc/services/RHEL_Jira.cfg'....)
You can run this check command on the configs to get results:
Code: Select all
/usr/local/nagios/bin/nagios -vvv /usr/local/nagios/etc/nagios.cfg
You should be able to run
grep -Eir -A 5 -B 5 'CPU Stats' /usr/local/nagios/etc/* to help search through them all.
Here are the instructions on how to run through re-index.
[*]Reindex the Core Configuration Manager (CCM) configs[/*]
- rm -rf /usr/local/nagios/etc/import/*
- 1: Terminal command list all running /bin/nagios -> ps -aux | grep -E '/bin/nagios'
- 2: Terminal command -> killall -9 nagios (or pkill nagios)
- 3: Terminal command check to see if /bin/nagios processes are stopped
- 4: Restart nagios.service by terminal command: systemctl restart nagios
- 5: Head over to the Nagios XI web console ==> Core Configuration Manager (CCM) ==> Config File Management ==> [Delete Files] ==> [Write Files] ==> [Verify Files]
- 6: Core Configuration Manager (CCM) ==> Under Quick Tools ==> "Apply Configuration"
- 7: Restart nagios.service by terminal command: systemctl restart nagios
[list]
[/list]
[*]verify that the host and services look good and verify that there are no errors in core by:[/*]
Code: Select all
/usr/local/nagios/bin/nagios -vvv /usr/local/nagios/etc/nagios.cfg
[/list]
Re: Nagios Host/Service Check Timeouts
Posted: Tue Sep 14, 2021 11:31 am
by Dusan.Mandic
Edited: Warning: Duplicate definition found for service 'CPU Stats' on host 'XXXXXXXXX' (config file '/usr/local/nagios/etc/services/RHEL_Jira.cfg'....oa
These were applied to all hosts. I removed them from the two hostgroups they were associated with. I did not do any bulk imports. If these arent on the first system profile I uploaded in this thread, that would lead me to believe there was some sort of error on reconfigure.
Still hovering around the 9/9/9 load
Re: Nagios Host/Service Check Timeouts
Posted: Wed Sep 15, 2021 10:57 am
by pbroste
Hello @Dusan.Mandic
Want to get a better picture on what is going on realtime, and would like to capture the following event timeline to file so we can see where things are hanging up. Please change the
<yourhostnameoripaddresshere> to a host that is logging timeout(s).
Code: Select all
tail -Fn0 /var/log/httpd/* /var/log/apache2/* /usr/local/nagios/var/* /usr/local/nagiosxi/tmp/* /usr/local/nagiosxi/var/* /var/log/syslog /var/log/messages /usr/local/nagios/var/spool/* /usr/local/nagiosxi/var/components/* | grep -Ei "warn|error|fail|timeout|<yourhostnameoripaddresshere>" >> /tmp/results.txt
Capture until you see a message that indicates a timeout, then 'cntl-c' to breakout to send /tmp/results.txt via private message.
Thanks,
Perry