z

Commercial Support Clients: Clients with support contracts can get escalated support assistance by visiting Nagios Answer Hub. These forums are for community support services. Although we at Nagios try our best to help out on the forums here, we always give priority support to our support clients.

Service Check Timeouts

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.

Service Check Timeouts

Postby Dusan.Mandic » Wed Jan 05, 2022 7:13 pm

Hello all,

Having a seemingly recurring issue with service checks timing out and causing notifications. This seems to be occuring on multiple hosts. Our load seems to be in the 30's as well (16 core VM), which is probably incurring the situation.

Attached is profile.

Moderator's Note: The profile has been shared with the support team but has been removed from the public forum.
Dusan.Mandic
 
Posts: 60
Joined: Mon Apr 06, 2020 2:30 pm

Re: Service Check Timeouts

Postby ssax » Thu Jan 06, 2022 1:45 pm

I see this:

[Wed Jan 05 03:01:50.206051 2022] [:error] [pid 596] [client X.X.X.X:59566] PHP Warning: mysqli::mysqli(): (08004/1040): Too many connections in /usr/local/nagiosxi/html/includes/components/opscreen/merlin.php on line 25, referer: https://XXXXXXX/nagiosxi/includes/compo ... screen.php


Please add these under the [mysqld] section of your /etc/my.cnf:

Code: Select all
[mysqld]
max_allowed_packet=512M
max_connections=1000


Then restart these services:

Code: Select all
systemctl restart mariadb nagios httpd crond


If that doesn't alleviate it, it may be related to Trend Micro, that's quite a bit of CPU being used by it (305.6%), I would disable it as a test and see if that helps resolve your issue:

Code: Select all
6654 root      20   0 9768056 361804  40084 S 305.6  1.1   8087:20 ds_am


It is likely interfering with how fast things need to go and processes/jobs/checks are getting queued up and timing out.

Please PM the output of these commands as root/sudo:

Code: Select all
sar -A
ulimit -a
su -s /bin/bash -c 'ulimit -a' nagios
su -s /bin/bash -c 'ulimit -a' mysql
su -s /bin/bash -c 'ulimit -a' apache


Additionally, please send the output of this command:
- NOTE: You may need to adjust the -uroot, and -pnagiosxi in the command if you've changed the root mysql password

Code: Select all
echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -uroot -pnagiosxi --table
ssax
Dreams In Code
 
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Service Check Timeouts

Postby Dusan.Mandic » Wed Jan 12, 2022 12:50 pm

It seems to only time out on certain service checks from 10.200.247.xxx, 10.200.235.xxx and 10.200.249.xxx. Is there any way to isolate why these timed out?
Dusan.Mandic
 
Posts: 60
Joined: Mon Apr 06, 2020 2:30 pm

Re: Service Check Timeouts

Postby pbroste » Thu Jan 13, 2022 3:01 pm

Hello @Dusan.Mandic

@ssax is out of the office this week and want to follow up with you on this on his behalf.

From your previous response it sounds like you want to verify events from within the address ranges listed:

Code: Select all
grep -Ei "10.200.247.[0-9]{3}|10.200.235.[0-9]{3}|10.200.249.[0-9]{3}" /usr/local/nagiosxi/var/*.log --color=always | less -SR


Read through the results and let us know if you see anything that sticks out or is incommon.

Thanks,
Perry
User avatar
pbroste
 
Posts: 1287
Joined: Tue Jun 01, 2021 1:27 pm

Re: Service Check Timeouts

Postby Dusan.Mandic » Thu Feb 03, 2022 5:41 pm

Still experiencing timeouts from the same hosts. Was able to pare down our vROPs polling to bring the server load down (API request reduction), so I now know its not proc cycles

would you like another profile sent @ssax?
Dusan.Mandic
 
Posts: 60
Joined: Mon Apr 06, 2020 2:30 pm

Re: Service Check Timeouts

Postby pbroste » Fri Feb 04, 2022 3:54 pm

Hello @Dusan.Mandic

Thanks for following up, I will ping @ssax, and let him know that you are going to send an updated System Profile to his Private Message inbox.

Thanks,
Perry
User avatar
pbroste
 
Posts: 1287
Joined: Tue Jun 01, 2021 1:27 pm

Re: Service Check Timeouts

Postby ssax » Tue Feb 08, 2022 7:40 pm

If it's the same hosts that are timing out, does it occur on a consistently periodic fashion at around the same times?

Is there any consistency to the timing of them failing?

Since it's the same ones in the same subnets, are you checking them over a VPN tunnel that could be having issues/routing issues? (VPN tunnels can bet setup by subnet or by host so if it's a subnet based VPN tunnel and it dropped/re-established it would take down all hosts in that tunnel as an example)

If it happens during the same times, check backups, vmotions, off-server jobs, vulnerability scanning, etc that could be causing the systems OR the tunnel/interfaces in the network path to overload and drop packets. I've seen all of those take down systems like that in a network. I worked at a place that had an old router that when we implemented vulnerability scanning and it scanned the remote systems it would overload the router interface (too much data for the old hardware) and the cause connectivity issues.

Those are some good places to start. I would also check the network statistics on the network device interfaces in the path, maybe sometimes you get an invalid route/asymmetric routing if you're using some type of protocol such as bgp/eigrp/ospf.

If it was network issues globally with the XI server you'd be having other hosts/services with the same issues so it's likely external to the XI server causing it.

You can generally increase the plugin timeouts to account for it but you may need to investigate the network path to determine where the failure is coming from if it's impacting entire subnets.
ssax
Dreams In Code
 
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Service Check Timeouts

Postby Dusan.Mandic » Tue Feb 15, 2022 11:37 am

Theres no correlation concerning timing that i can see, but it seems to be the check_Unanswered Messages in MSG Queue (check AS400 msg plugin) across those 4 hosts. I would think if it was a network issue, we would see different service checks dropping, not just that one. Can someone please look into the profile i sent for service check timeouts concerning that service?

Our network is all internal, and most of the timeouts occur without rhyme or reason. The firewall is all open, and i dont see any drops anywhere
Dusan.Mandic
 
Posts: 60
Joined: Mon Apr 06, 2020 2:30 pm

Re: Service Check Timeouts

Postby ssax » Wed Feb 16, 2022 11:02 am

I see this consuming a lot of CPU:

6654 root 20 0 9768056 361804 40084 S 305.6 1.1 8087:20 ds_am


Please try disabling that deep security agent and see if that is slowing down your checks and causing them to hit a limit and timeout. The assumption is that everything that nagios is doing is slowed down by the agent scanning for threats. That would be my first guess based on what you're saying. If that resolves it you would either need to contact the agent vendor and ask them what can be done or increase the timeouts on your checks.

I see these as well (will cause gaps in your graphs):

[01-05-2022 18:05:30] NPCD: WARN: MAX load reached: load 48.040000/10.000000 at i=1
[01-05-2022 18:05:45] NPCD: WARN: MAX load reached: load 49.040000/10.000000 at i=1
[01-05-2022 18:06:00] NPCD: WARN: MAX load reached: load 47.240000/10.000000 at i=1
[01-05-2022 18:06:15] NPCD: WARN: MAX load reached: load 50.650000/10.000000 at i=1


2022-01-01 19:59:13 [10780] [0] *** TIMEOUT: Timeout after 20 secs. ***
2022-01-01 19:59:13 [10780] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2022-01-01 19:59:13 [10780] [0] *** TIMEOUT: Please check your npcd.cfg


Please follow this guide to set your load_threshold to 80.0 and your TIMEOUT to 40:

https://support.nagios.com/kb/article.php?id=9

Please send the output of these commands:

Code: Select all
ulimit -a
su -s /bin/bash -c 'ulimit -a' nagios
su -s /bin/bash -c 'ulimit -a' mysql
su -s /bin/bash -c 'ulimit -a' apache
netstat -s
ethtool -S eth0


Additionally, please send the output of this command:
- NOTE: You may need to adjust the -uroot and -pnagiosxi in the command if you've changed the root mysql password

Code: Select all
echo "SELECT table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES WHERE table_schema IN ('nagios', 'nagiosql', 'nagiosxi');" | mysql -uroot -pnagiosxi --table
ssax
Dreams In Code
 
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Service Check Timeouts

Postby Dusan.Mandic » Wed Feb 16, 2022 4:15 pm

Are you using the new profile I sent in PM? The load issue has been resolved.

Confirmed with networking were not capping our threshold limits, same with IOPS for storage.

Please confirm you are using the latest profile, created 2/8/2022
Dusan.Mandic
 
Posts: 60
Joined: Mon Apr 06, 2020 2:30 pm

Next

Return to Nagios XI

Who is online

Users browsing this forum: No registered users and 12 guests