Nagios XI Partial Crash?
Posted: Thu Mar 31, 2016 3:50 pm
I'm trying to figure out what happened after rebooting a Nagios XI installation running on a RHEL 6 VM.
The main observable was passive remote hosts and services were fine, but active local hosts and services showed up as unreachable. However, I can ping them from within the Nagios XI GUI. Here's the Nagios log and my actions:
Fri, 25 Mar 2016 16:09:56 GMT
[1458922196] Caught SIGTERM, shutting down...
[1458922197] ndomod: Error writing to data sink! Some output may get lost...
[1458922197] ndomod: Please check remote ndo2db log, database connection or SSL Parameters
[1458922197] Successfully shutdown... (PID=3839)
[1458922197] ndomod: Shutdown complete.
[1458922197] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
<< Reboot of VM. Before starting Nagios services, delete any check results greater than an hour and run repairmysql.sh on Nagios databases >>
[1458929056] Nagios 3.5.0 starting... (PID=15470)
[1458929056] Local time is Fri Mar 25 18:04:16 UTC 2016
[1458929056] LOG VERSION: 2.0
[1458929056] ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[1458929056] ndomod: Successfully connected to data sink. 4 queued items to flush.
[1458929056] ndomod: Successfully flushed 4 queued items to data sink.
[1458929056] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1458929058] Finished daemonizing... (New PID=15475)
<< Noticed in the Nagios XI GUI the Monitoring Engine had a red "x", all others green, no log entries in between [1458929058] and [1458930154], clicked Action->Restart, Monitoring Engine changed to green check >>
[1458930154] Nagios 3.5.0 starting... (PID=19455)
[1458930154] Local time is Fri Mar 25 18:22:34 UTC 2016
[1458930154] LOG VERSION: 2.0
[1458930154] ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[1458930154] ndomod: Successfully connected to data sink. 0 queued items to flush.
[1458930154] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1458930157] Finished daemonizing... (New PID=19460)
[1458930188] Warning: Could not stat() check result file '/usr/local/nagios/var/spool/checkresults/cFcOzY2'.
... << I assume these are the passive check results that were greater than an hour ~300 entries per second between [1458930188] and [1458930201] >>
[1458930201] Warning: Could not stat() check result file '/usr/local/nagios/var/spool/checkresults/cojoSy2'.
[1458930201] Caught SIGTERM, shutting down...
[1458930202] Successfully shutdown... (PID=15475)
[1458930202] ndomod: Shutdown complete.
[1458930202] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
[1458930248] SERVICE ALERT: remote_host_001;Ping;OK;HARD;1;OK - 10.105.0.145: rta 0.341ms, lost 0%
... << Nagios processing entries for passive remote hosts and services only, ~10 entries per second between [1458930248] and [1458931053] >>
<< Noticed in the GUI the Monitoring Engine had a red "x", all others green >>
<< Truncated nagios_logentries table (~5GB) and nagios_notifications (~1GB), repaired Nagios databases >>
<< Restarted all Nagios services (see question #2), everything returned to normal >>
[1458931061] Nagios 3.5.0 starting... (PID=6885)
[1458931061] Local time is Fri Mar 25 18:37:41 UTC 2016
[1458931061] LOG VERSION: 2.0
[1458931062] ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[1458931062] ndomod: Successfully connected to data sink. 0 queued items to flush.
[1458931062] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1458931064] Finished daemonizing... (New PID=6956)
Questions:
1) What typically causes ndomod to shut down?
2) When I bring up Nagios XI services I run the following:
# Start Nagios Processes
service mysqld start
service npcd start
service ndo2db start
# Sleep for 10 seconds to ensure ndo2db is up
sleep 10
service nagios start
service nagiosxi start
Where does ndomod fit into this? Is it a process under the ndo2db service?
3) Is there a "best practice" on how notify an administrator if any Nagios process/service stops? I created a Nagios service that checks the state of the Nagios XI services, but if active checks are not working, the administrator is never notified. How to monitor the monitoring software
What have other users done in the past?
Thanks.
Nagios XI Version
full=2012R2.9
major=2012
minor=R2.9
releasedate=2014-02-11
release=320
NRPE v2.15 (modified for 4KB messages)
The main observable was passive remote hosts and services were fine, but active local hosts and services showed up as unreachable. However, I can ping them from within the Nagios XI GUI. Here's the Nagios log and my actions:
Fri, 25 Mar 2016 16:09:56 GMT
[1458922196] Caught SIGTERM, shutting down...
[1458922197] ndomod: Error writing to data sink! Some output may get lost...
[1458922197] ndomod: Please check remote ndo2db log, database connection or SSL Parameters
[1458922197] Successfully shutdown... (PID=3839)
[1458922197] ndomod: Shutdown complete.
[1458922197] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
<< Reboot of VM. Before starting Nagios services, delete any check results greater than an hour and run repairmysql.sh on Nagios databases >>
[1458929056] Nagios 3.5.0 starting... (PID=15470)
[1458929056] Local time is Fri Mar 25 18:04:16 UTC 2016
[1458929056] LOG VERSION: 2.0
[1458929056] ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[1458929056] ndomod: Successfully connected to data sink. 4 queued items to flush.
[1458929056] ndomod: Successfully flushed 4 queued items to data sink.
[1458929056] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1458929058] Finished daemonizing... (New PID=15475)
<< Noticed in the Nagios XI GUI the Monitoring Engine had a red "x", all others green, no log entries in between [1458929058] and [1458930154], clicked Action->Restart, Monitoring Engine changed to green check >>
[1458930154] Nagios 3.5.0 starting... (PID=19455)
[1458930154] Local time is Fri Mar 25 18:22:34 UTC 2016
[1458930154] LOG VERSION: 2.0
[1458930154] ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[1458930154] ndomod: Successfully connected to data sink. 0 queued items to flush.
[1458930154] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1458930157] Finished daemonizing... (New PID=19460)
[1458930188] Warning: Could not stat() check result file '/usr/local/nagios/var/spool/checkresults/cFcOzY2'.
... << I assume these are the passive check results that were greater than an hour ~300 entries per second between [1458930188] and [1458930201] >>
[1458930201] Warning: Could not stat() check result file '/usr/local/nagios/var/spool/checkresults/cojoSy2'.
[1458930201] Caught SIGTERM, shutting down...
[1458930202] Successfully shutdown... (PID=15475)
[1458930202] ndomod: Shutdown complete.
[1458930202] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
[1458930248] SERVICE ALERT: remote_host_001;Ping;OK;HARD;1;OK - 10.105.0.145: rta 0.341ms, lost 0%
... << Nagios processing entries for passive remote hosts and services only, ~10 entries per second between [1458930248] and [1458931053] >>
<< Noticed in the GUI the Monitoring Engine had a red "x", all others green >>
<< Truncated nagios_logentries table (~5GB) and nagios_notifications (~1GB), repaired Nagios databases >>
<< Restarted all Nagios services (see question #2), everything returned to normal >>
[1458931061] Nagios 3.5.0 starting... (PID=6885)
[1458931061] Local time is Fri Mar 25 18:37:41 UTC 2016
[1458931061] LOG VERSION: 2.0
[1458931062] ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[1458931062] ndomod: Successfully connected to data sink. 0 queued items to flush.
[1458931062] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1458931064] Finished daemonizing... (New PID=6956)
Questions:
1) What typically causes ndomod to shut down?
2) When I bring up Nagios XI services I run the following:
# Start Nagios Processes
service mysqld start
service npcd start
service ndo2db start
# Sleep for 10 seconds to ensure ndo2db is up
sleep 10
service nagios start
service nagiosxi start
Where does ndomod fit into this? Is it a process under the ndo2db service?
3) Is there a "best practice" on how notify an administrator if any Nagios process/service stops? I created a Nagios service that checks the state of the Nagios XI services, but if active checks are not working, the administrator is never notified. How to monitor the monitoring software
Thanks.
Nagios XI Version
full=2012R2.9
major=2012
minor=R2.9
releasedate=2014-02-11
release=320
NRPE v2.15 (modified for 4KB messages)