Page 1 of 1

http service checks showing socket timeout after 10 seconds

Posted: Thu Jan 26, 2017 9:07 am
by mcbe
A number (but not all) of our http service checks (webpage and content) are all of a sudden showing as critical - socket timeout after 10 seconds. Also a few Process CPU plugin checks are timing out at 15 sec. Other host and service checks (mostly ncpa) are functioning normally.

We made a change yesterday to the Active Directory Domain Controller that we point to for authentication. Not sure if its related, but the timing is suspicious.

There are no obvious issues in the log.
At points the CPU usage gets up to 100%, but never for more than a moment.
the process that shows 100% in top is:
/usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh

Attaching profile.zip

Re: http service checks showing socket timeout after 10 seco

Posted: Thu Jan 26, 2017 11:32 am
by bwallace
I'm betting the cause of this is due to the changes that were made to your DC. In the profile there is a file called nagios.txt which is a log of all your check results basically.
Every check that times out, and there are multiple, have an FQDN as the host, not an IP address. There are only a handful that are successful. In the UI go to Home > Incident Management > Notifications to see for yourself, perhaps the difference between the working fqdn's and the non-working will have more meaning to to you.

Now by default, if Nagios does not get the reply from monitored host / service in 10 seconds it will mark it as “CRITICAL – Socket timeout after 10 seconds“
You could possibly work around this by increasing the “Socket timeout” value from the default 10 seconds to 20 or 30. However, this is a bit of a band-aid approach as we should find out why the timeout occurs in the first place.

Also, as a test, run the check from the command line, just to see if the results differ.

I'm not sure what your check_command is, but for example, here is one for check_http:

/usr/local/nagios/libexec/check_http -H yoursite.com

Do you get a different result when running it manually?
Feel free to post the results here along with your check command.

Re: http service checks showing socket timeout after 10 seco

Posted: Thu Jan 26, 2017 11:35 am
by tgriep
In the Apache error log, I am seeing some permission errors for the AD certs and that could be causing the system to not login correctly to the AD server.
Can you login as root to the server and run the following commands to fix the permissions?

Code: Select all

chown apache.nagios /etc/openldap /etc/openldap/cacerts /etc/openldap/certs
chmod 775 /etc/openldap /etc/openldap/certs /etc/openldap/cacerts
Then restart the Apache and Nagios processes by running the following

Code: Select all

service httpd restart
service nagios restart
Another thing I found it that the Database Backend process is not running on your server so the system is not storing the data in the MYSQL database and that is probably causing the timeouts.
To start it run the following

Code: Select all

service ndo2db start
Let the system run for 5 to 10 minutes and see if the errors are gone.

Re: http service checks showing socket timeout after 10 seco

Posted: Thu Jan 26, 2017 3:29 pm
by mcbe
Thanks for the responses.

Running from command line returns the same error.
[nagios@nagiasp01 openldap]$ /usr/local/nagios/libexec/check_http -f follow -I dcriapps.dcri.duke.edu -u "/emove"
CRITICAL - Socket timeout after 10 seconds

Also tried with IP instead of fqdn. same result:
[nagios@nagiasp01 ~]$ /usr/local/nagios/libexec/check_http -f follow -I 10.0.130.61 -u "/emove"
CRITICAL - Socket timeout after 10 seconds

Changed the ownership and permissions of those directories, but to no effect.

The ndo2db service is now running.

Attaching an updated profile.zip

Re: http service checks showing socket timeout after 10 seco

Posted: Thu Jan 26, 2017 3:36 pm
by mcbe
Also confirmed that extending the socket timeout to 30 seconds it completes successfully. Though we'd prefer to understand what is causing the timeout and try to resolve.

[nagios@nagiasp01 ~]$ /usr/local/nagios/libexec/check_http -s "Employee Move" -f follow -I 10.0.130.61 -u "/emove" -S -p 443 -t 30
HTTP OK: HTTP/1.1 200 OK - 6015 bytes in 10.393 second response time |time=10.393486s;;;0.000000 size=6015B;;;0

Re: http service checks showing socket timeout after 10 seco

Posted: Thu Jan 26, 2017 3:49 pm
by dwhitfield
mcbe wrote: The ndo2db service is now running.
Was that before or after you ran the checks that timed out at 10 seconds? If after, that might be your answer.

It looks like they are just barely over 10 seconds. You could try 15. It doesn't give you an answer for why things are taking longer, but that could be a long rabbit hole.

How many of the devices you monitor do you have root access to? Do you have access to the network infrastructure? I'm just trying to figure out how easy it is going to be to get to a root cause here (assuming it wasn't the authentication or database issues already mentioned)

Re: http service checks showing socket timeout after 10 seco

Posted: Thu Jan 26, 2017 4:12 pm
by tgriep
All of the other errors look to be resolved so that is good.

The timeout issue. The check_http plugin has a default 10 second timeout and if you look at your last post, it took 10.393 seconds to retrieve the data from that site so that would cause the timeout.
Check and see if the server is loaded now or that there is more network latency between the Nagios server and that web site.

Re: http service checks showing socket timeout after 10 seco

Posted: Thu Jan 26, 2017 4:24 pm
by tgriep
One more I found that should be changed for the MYSQL, ndo2db issue.
Edit the following file
/usr/local/nagiosxi/html/config.inc.php file and change the following line (Line 49) from

Code: Select all

"dbserver" => '127.0.0.1',
to

Code: Select all

"dbserver" => 'localhost',
Save the file and restart the following daemons by running the following as root.

Code: Select all

service httpd restart
service nagios stop
service ndo2db restart
service nagios start
Try that and see if the /var/log/messages file doesn't have any new ndo2db errors.