Lose Access to LDAP > All Hosts/Servies Down
Lose Access to LDAP > All Hosts/Servies Down
Hi,
A few days ago the LDAP server that Nagios was configured to use (in /etc/httpd/conf.d/nagios.conf, for authenticating users to the web interface) was taken offline for a few hours. I'm trying to figure out why this caused Nagios to mark all hosts as down with a 'Socket timeout after 10 seconds' error.
My understanding was that this LDAP configuration was only for authenticating users to the WI, and nothing to do with the actual checks etc.
Thanks for any help.
A few days ago the LDAP server that Nagios was configured to use (in /etc/httpd/conf.d/nagios.conf, for authenticating users to the web interface) was taken offline for a few hours. I'm trying to figure out why this caused Nagios to mark all hosts as down with a 'Socket timeout after 10 seconds' error.
My understanding was that this LDAP configuration was only for authenticating users to the WI, and nothing to do with the actual checks etc.
Thanks for any help.
Re: Lose Access to LDAP > All Hosts/Servies Down
This shouldn't have affected nagios at all. I wonder if there was some sort of loop that happened causing resources to diminish which then caused issues with Nagios?
Are you using WMI checks at all? Do you have any log files available from that time you could share?
Are you using WMI checks at all? Do you have any log files available from that time you could share?
Former Nagios Employee
Re: Lose Access to LDAP > All Hosts/Servies Down
That's what I thought too, as LDAP has nothing to do with Nagios executing host checks. The only other thing is that the LDAP server was also a DNS server but the Nagios CentOS box definitely has a secondary DNS server configured which remained online.rkennedy wrote:This shouldn't have affected nagios at all. I wonder if there was some sort of loop that happened causing resources to diminish which then caused issues with Nagios?
Are you using WMI checks at all? Do you have any log files available from that time you could share?
What logs are you looking for? Happy to post them.
Re: Lose Access to LDAP > All Hosts/Servies Down
Can you post your /usr/local/nagios/var/nagios.log file, and also your /var/log/httpd/error_log for us to take a look at? (paths may vary depending on your setup.)
This should be a good start.
This should be a good start.
Former Nagios Employee
Re: Lose Access to LDAP > All Hosts/Servies Down
Is there anything in particular you're looking for? Just because these files contain potentially sensitive data.rkennedy wrote:Can you post your /usr/local/nagios/var/nagios.log file, and also your /var/log/httpd/error_log for us to take a look at? (paths may vary depending on your setup.)
This should be a good start.
Re: Lose Access to LDAP > All Hosts/Servies Down
Any / all kind of errors that may be related to LDAP. Feel free to PM it over if you'd like to keep it private.
Former Nagios Employee
Re: Lose Access to LDAP > All Hosts/Servies Down
Sorry for the delay. I've checked nagios...log files for the date and there are no lines that contain "LDAP" anywhere.rkennedy wrote:Any / all kind of errors that may be related to LDAP. Feel free to PM it over if you'd like to keep it private.
Below are the events from the error_log file for the date the issue occurred.
Code: Select all
[Wed Jun 22 01:42:52.228303 2016] [mpm_prefork:notice] [pid 1395] AH00170: caught SIGWINCH, shutting down gracefully
[Wed Jun 22 01:42:53.364960 2016] [core:notice] [pid 26345] SELinux policy enabled; httpd running as context system_u:system_r:httpd_t:s0
[Wed Jun 22 01:42:53.366936 2016] [suexec:notice] [pid 26345] AH01232: suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Wed Jun 22 01:42:53.429775 2016] [auth_digest:notice] [pid 26345] AH01757: generating secret for digest authentication ...
[Wed Jun 22 01:42:53.430819 2016] [lbmethod_heartbeat:notice] [pid 26345] AH02282: No slotmem from mod_heartmonitor
[Wed Jun 22 01:42:53.445078 2016] [mpm_prefork:notice] [pid 26345] AH00163: Apache/2.4.6 (CentOS) PHP/5.4.16 configured -- resuming normal operations
[Wed Jun 22 01:42:53.445109 2016] [core:notice] [pid 26345] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
[Mon Jun 27 03:22:02.031759 2016] [mpm_prefork:notice] [pid 26345] AH00171: Graceful restart requested, doing restartRe: Lose Access to LDAP > All Hosts/Servies Down
Is the server configured to use LDAP for the UNIX shell login?
Was the LDAP server powered off or just the services were shutdown?
I am thinking it was that the DNS server was down and that the server didn't start using the secondary server.
Can you post the errors from the nagios archived log file for that day so we can see them?
The archive file can be found here.
Was the LDAP server powered off or just the services were shutdown?
I am thinking it was that the DNS server was down and that the server didn't start using the secondary server.
Can you post the errors from the nagios archived log file for that day so we can see them?
The archive file can be found here.
Code: Select all
/usr/local/nagios/var/archivesBe sure to check out our Knowledgebase for helpful articles and solutions!
Re: Lose Access to LDAP > All Hosts/Servies Down
No, not using LDAP for UNIX shell login.
The LDAP server was intermittently not reachable but secondary DNS would've been up the whole time. I was thinking that too but I can't see a reason for CentOS not to have used the secondary DNS server.
PM sent with the log file.
The LDAP server was intermittently not reachable but secondary DNS would've been up the whole time. I was thinking that too but I can't see a reason for CentOS not to have used the secondary DNS server.
PM sent with the log file.
Re: Lose Access to LDAP > All Hosts/Servies Down
Thanks for the Log file.
It looks like the Nagios system couldn't connect to a router and then all of the hosts behind it start to timeout because they couldn't connect to the hosts as the router was down.
Then later on, a second router went down for another site, causing the same issue.
From what I can see, it is all normal.
You may want to setup a parent - child relationship so if this happens again, you will not get the notifications for the hosts behind the router when the router is down.
Take a look at this document for more details.
https://assets.nagios.com/downloads/nag ... ility.html
It looks like the Nagios system couldn't connect to a router and then all of the hosts behind it start to timeout because they couldn't connect to the hosts as the router was down.
Then later on, a second router went down for another site, causing the same issue.
From what I can see, it is all normal.
You may want to setup a parent - child relationship so if this happens again, you will not get the notifications for the hosts behind the router when the router is down.
Take a look at this document for more details.
https://assets.nagios.com/downloads/nag ... ility.html
Be sure to check out our Knowledgebase for helpful articles and solutions!