Lose Access to LDAP > All Hosts/Servies Down

warnox · Post by **warnox** » Thu Jun 23, 2016 8:44 am

Hi,

A few days ago the LDAP server that Nagios was configured to use (in /etc/httpd/conf.d/nagios.conf, for authenticating users to the web interface) was taken offline for a few hours. I'm trying to figure out why this caused Nagios to mark all hosts as down with a 'Socket timeout after 10 seconds' error.

My understanding was that this LDAP configuration was only for authenticating users to the WI, and nothing to do with the actual checks etc.

Thanks for any help.

rkennedy · Post by **rkennedy** » Thu Jun 23, 2016 11:34 am

This shouldn't have affected nagios at all. I wonder if there was some sort of loop that happened causing resources to diminish which then caused issues with Nagios?

Are you using WMI checks at all? Do you have any log files available from that time you could share?

warnox · Post by **warnox** » Wed Jun 29, 2016 7:17 pm

rkennedy wrote:This shouldn't have affected nagios at all. I wonder if there was some sort of loop that happened causing resources to diminish which then caused issues with Nagios?

Are you using WMI checks at all? Do you have any log files available from that time you could share?

That's what I thought too, as LDAP has nothing to do with Nagios executing host checks. The only other thing is that the LDAP server was also a DNS server but the Nagios CentOS box definitely has a secondary DNS server configured which remained online.

What logs are you looking for? Happy to post them.

rkennedy · Post by **rkennedy** » Thu Jun 30, 2016 9:47 am

Can you post your /usr/local/nagios/var/nagios.log file, and also your /var/log/httpd/error_log for us to take a look at? (paths may vary depending on your setup.)

This should be a good start.

warnox · Post by **warnox** » Fri Jul 01, 2016 5:54 am

rkennedy wrote:Can you post your /usr/local/nagios/var/nagios.log file, and also your /var/log/httpd/error_log for us to take a look at? (paths may vary depending on your setup.)

This should be a good start.

Is there anything in particular you're looking for? Just because these files contain potentially sensitive data.

rkennedy · Post by **rkennedy** » Tue Jul 05, 2016 9:30 am

Any / all kind of errors that may be related to LDAP. Feel free to PM it over if you'd like to keep it private.

warnox · Post by **warnox** » Wed Jul 13, 2016 5:10 am

rkennedy wrote:Any / all kind of errors that may be related to LDAP. Feel free to PM it over if you'd like to keep it private.

Sorry for the delay. I've checked nagios...log files for the date and there are no lines that contain "LDAP" anywhere.

Below are the events from the error_log file for the date the issue occurred.

Code: Select all

[Wed Jun 22 01:42:52.228303 2016] [mpm_prefork:notice] [pid 1395] AH00170: caught SIGWINCH, shutting down gracefully
[Wed Jun 22 01:42:53.364960 2016] [core:notice] [pid 26345] SELinux policy enabled; httpd running as context system_u:system_r:httpd_t:s0
[Wed Jun 22 01:42:53.366936 2016] [suexec:notice] [pid 26345] AH01232: suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Wed Jun 22 01:42:53.429775 2016] [auth_digest:notice] [pid 26345] AH01757: generating secret for digest authentication ...
[Wed Jun 22 01:42:53.430819 2016] [lbmethod_heartbeat:notice] [pid 26345] AH02282: No slotmem from mod_heartmonitor
[Wed Jun 22 01:42:53.445078 2016] [mpm_prefork:notice] [pid 26345] AH00163: Apache/2.4.6 (CentOS) PHP/5.4.16 configured -- resuming normal operations
[Wed Jun 22 01:42:53.445109 2016] [core:notice] [pid 26345] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
[Mon Jun 27 03:22:02.031759 2016] [mpm_prefork:notice] [pid 26345] AH00171: Graceful restart requested, doing restart

Post by **tgriep** » Wed Jul 13, 2016 12:48 pm

Is the server configured to use LDAP for the UNIX shell login?
Was the LDAP server powered off or just the services were shutdown?
I am thinking it was that the DNS server was down and that the server didn't start using the secondary server.

Can you post the errors from the nagios archived log file for that day so we can see them?
The archive file can be found here.

Code: Select all

/usr/local/nagios/var/archives

warnox · Post by **warnox** » Thu Jul 14, 2016 3:02 am

No, not using LDAP for UNIX shell login.

The LDAP server was intermittently not reachable but secondary DNS would've been up the whole time. I was thinking that too but I can't see a reason for CentOS not to have used the secondary DNS server.

PM sent with the log file.

Post by **tgriep** » Thu Jul 14, 2016 9:38 am

Thanks for the Log file.
It looks like the Nagios system couldn't connect to a router and then all of the hosts behind it start to timeout because they couldn't connect to the hosts as the router was down.
Then later on, a second router went down for another site, causing the same issue.
From what I can see, it is all normal.
You may want to setup a parent - child relationship so if this happens again, you will not get the notifications for the hosts behind the router when the router is down.
Take a look at this document for more details.
https://assets.nagios.com/downloads/nag ... ility.html

Nagios Support Forum

Lose Access to LDAP > All Hosts/Servies Down

Lose Access to LDAP > All Hosts/Servies Down

Re: Lose Access to LDAP > All Hosts/Servies Down

Re: Lose Access to LDAP > All Hosts/Servies Down

Re: Lose Access to LDAP > All Hosts/Servies Down

Re: Lose Access to LDAP > All Hosts/Servies Down

Re: Lose Access to LDAP > All Hosts/Servies Down

Re: Lose Access to LDAP > All Hosts/Servies Down

Re: Lose Access to LDAP > All Hosts/Servies Down

Re: Lose Access to LDAP > All Hosts/Servies Down

Re: Lose Access to LDAP > All Hosts/Servies Down