Nagios Host/Service Check Timeouts

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Nagios Host/Service Check Timeouts

Post by pbroste »

Hello @Dusan.Mandic

Sounds good, depending on your environment on the size increment(s).

You can set and watch it:

Code: Select all

watch -n 5 ls -lahrt /usr/local/nagios/var/nagios.log /usr/local/nagios/var/nagios.debug
Let us know how things look,
Perry
Dusan.Mandic
Posts: 60
Joined: Mon Apr 06, 2020 2:30 pm

Re: Nagios Host/Service Check Timeouts

Post by Dusan.Mandic »

Here's the ethtool. Just started the log today.
You do not have the required permissions to view the files attached to this post.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Nagios Host/Service Check Timeouts

Post by pbroste »

Hello @Dusan.Mandic

Thanks for following up, with the results; let's go ahead and increase the service and host timeouts to 120 seconds and bounce the nagios.service.

In: /usr/local/ncpa/etc/ncpa.cfg
Change to:
plugin_timeout = 120
In: /usr/local/nagios/etc/nagios.cfg
Change to:
host_check_timeout=120
service_check_timeout=120
Then restart the nagios service:

Code: Select all

systemctl restart nagios
Please run for awhile and let us know how things are looking,
Perry
Dusan.Mandic
Posts: 60
Joined: Mon Apr 06, 2020 2:30 pm

Re: Nagios Host/Service Check Timeouts

Post by Dusan.Mandic »

Things did not go well.

This caused multiple alerts. I think there were about 90 criticals thrown in a couple minutes.

This is our production server, i was able to catch it before alerts went out but we cannot have an event of that nature again.
Dusan.Mandic
Posts: 60
Joined: Mon Apr 06, 2020 2:30 pm

Re: Nagios Host/Service Check Timeouts

Post by Dusan.Mandic »

.
results.txt
You do not have the required permissions to view the files attached to this post.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Nagios Host/Service Check Timeouts

Post by pbroste »

Hello @Dusan.Mandic

Let's get an updated System Profile on this so we can see what is going on.

To send us your system profile.
  • Login to the Nagios XI GUI using a web browser.
  • Click the "Admin" > "System Profile" Menu
  • Click the "Download Profile" button
  • Save the profile.zip file and share via PM
Or command:

Code: Select all

rm -rf /usr/local/nagiosxi/var/components/profile.zip
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT

Then send the resulting /usr/local/nagiosxi/var/components/profile.zip​ file via Private Message.

Thanks,
Perry
Dusan.Mandic
Posts: 60
Joined: Mon Apr 06, 2020 2:30 pm

Re: Nagios Host/Service Check Timeouts

Post by Dusan.Mandic »

pm sent.

the load has now gone way up from before, where it stood around 3/4/4, it has now almost tripled.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Nagios Host/Service Check Timeouts

Post by pbroste »

Hello @Dusan.Mandic

You are correct that we see resource issues
PHP Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate....) and the NPCD has reached MAX LOAD too.
22731 nagios 20 0 11.6g 40312 11688 S 112.5 0.1 0:00.20 java
10032 apache 20 0 684800 72184 6124 S 100.0 0.2 0:10.72 httpd
1936 apache 20 0 643556 30884 6232 R 93.8 0.1 0:05.10 httpd
We see that /opt/SumoCollector/jre/bin/java is taking up bit and it is on the top of the resource users listed as 'java'. May want to review that process to take a look at how things look.

Then it appears there are multiple definitions found for the same service. Take a look at your configuration files and see where they are occurring. We see that just happened recently, so I am not sure if there were any bulk changes or imports. May want to run through the database repair and CCM reindex.
Warning: Duplicate definition found for service 'CPU Stats' on host 'XXXXXXXXX' (config file '/usr/local/nagios/etc/services/RHEL_Jira.cfg'....)
You can run this check command on the configs to get results:

Code: Select all

/usr/local/nagios/bin/nagios -vvv /usr/local/nagios/etc/nagios.cfg
You should be able to run grep -Eir -A 5 -B 5 'CPU Stats' /usr/local/nagios/etc/* to help search through them all.

Here are the instructions on how to run through re-index.
[*]Reindex the Core Configuration Manager (CCM) configs[/*]
  • rm -rf /usr/local/nagios/etc/import/*
  • 1: Terminal command list all running /bin/nagios -> ps -aux | grep -E '/bin/nagios'
  • 2: Terminal command -> killall -9 nagios (or pkill nagios)
  • 3: Terminal command check to see if /bin/nagios processes are stopped
  • 4: Restart nagios.service by terminal command: systemctl restart nagios
  • 5: Head over to the Nagios XI web console ==> Core Configuration Manager (CCM) ==> Config File Management ==> [Delete Files] ==> [Write Files] ==> [Verify Files]
  • 6: Core Configuration Manager (CCM) ==> Under Quick Tools ==> "Apply Configuration"
  • 7: Restart nagios.service by terminal command: systemctl restart nagios
  • [list]
  • Code: Select all

    systemctl restart nagios
[/list]

[*]verify that the host and services look good and verify that there are no errors in core by:[/*]

  • Code: Select all

    /usr/local/nagios/bin/nagios -vvv /usr/local/nagios/etc/nagios.cfg
[/list]
Dusan.Mandic
Posts: 60
Joined: Mon Apr 06, 2020 2:30 pm

Re: Nagios Host/Service Check Timeouts

Post by Dusan.Mandic »

Edited: Warning: Duplicate definition found for service 'CPU Stats' on host 'XXXXXXXXX' (config file '/usr/local/nagios/etc/services/RHEL_Jira.cfg'....oa

These were applied to all hosts. I removed them from the two hostgroups they were associated with. I did not do any bulk imports. If these arent on the first system profile I uploaded in this thread, that would lead me to believe there was some sort of error on reconfigure.

Still hovering around the 9/9/9 load
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Nagios Host/Service Check Timeouts

Post by pbroste »

Hello @Dusan.Mandic

Want to get a better picture on what is going on realtime, and would like to capture the following event timeline to file so we can see where things are hanging up. Please change the <yourhostnameoripaddresshere> to a host that is logging timeout(s).

Code: Select all

tail -Fn0 /var/log/httpd/* /var/log/apache2/* /usr/local/nagios/var/* /usr/local/nagiosxi/tmp/* /usr/local/nagiosxi/var/* /var/log/syslog /var/log/messages /usr/local/nagios/var/spool/* /usr/local/nagiosxi/var/components/* | grep -Ei "warn|error|fail|timeout|<yourhostnameoripaddresshere>" >> /tmp/results.txt
Capture until you see a message that indicates a timeout, then 'cntl-c' to breakout to send /tmp/results.txt via private message.

Thanks,
Perry
Locked