Page 1 of 2

Nagios XI Partial Crash?

Posted: Thu Mar 31, 2016 3:50 pm
by toodaly
I'm trying to figure out what happened after rebooting a Nagios XI installation running on a RHEL 6 VM.
The main observable was passive remote hosts and services were fine, but active local hosts and services showed up as unreachable. However, I can ping them from within the Nagios XI GUI. Here's the Nagios log and my actions:

Fri, 25 Mar 2016 16:09:56 GMT
[1458922196] Caught SIGTERM, shutting down...
[1458922197] ndomod: Error writing to data sink! Some output may get lost...
[1458922197] ndomod: Please check remote ndo2db log, database connection or SSL Parameters
[1458922197] Successfully shutdown... (PID=3839)
[1458922197] ndomod: Shutdown complete.
[1458922197] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
<< Reboot of VM. Before starting Nagios services, delete any check results greater than an hour and run repairmysql.sh on Nagios databases >>
[1458929056] Nagios 3.5.0 starting... (PID=15470)
[1458929056] Local time is Fri Mar 25 18:04:16 UTC 2016
[1458929056] LOG VERSION: 2.0
[1458929056] ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[1458929056] ndomod: Successfully connected to data sink. 4 queued items to flush.
[1458929056] ndomod: Successfully flushed 4 queued items to data sink.
[1458929056] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1458929058] Finished daemonizing... (New PID=15475)
<< Noticed in the Nagios XI GUI the Monitoring Engine had a red "x", all others green, no log entries in between [1458929058] and [1458930154], clicked Action->Restart, Monitoring Engine changed to green check >>
[1458930154] Nagios 3.5.0 starting... (PID=19455)
[1458930154] Local time is Fri Mar 25 18:22:34 UTC 2016
[1458930154] LOG VERSION: 2.0
[1458930154] ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[1458930154] ndomod: Successfully connected to data sink. 0 queued items to flush.
[1458930154] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1458930157] Finished daemonizing... (New PID=19460)
[1458930188] Warning: Could not stat() check result file '/usr/local/nagios/var/spool/checkresults/cFcOzY2'.
... << I assume these are the passive check results that were greater than an hour ~300 entries per second between [1458930188] and [1458930201] >>
[1458930201] Warning: Could not stat() check result file '/usr/local/nagios/var/spool/checkresults/cojoSy2'.
[1458930201] Caught SIGTERM, shutting down...
[1458930202] Successfully shutdown... (PID=15475)
[1458930202] ndomod: Shutdown complete.
[1458930202] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
[1458930248] SERVICE ALERT: remote_host_001;Ping;OK;HARD;1;OK - 10.105.0.145: rta 0.341ms, lost 0%
... << Nagios processing entries for passive remote hosts and services only, ~10 entries per second between [1458930248] and [1458931053] >>
<< Noticed in the GUI the Monitoring Engine had a red "x", all others green >>
<< Truncated nagios_logentries table (~5GB) and nagios_notifications (~1GB), repaired Nagios databases >>
<< Restarted all Nagios services (see question #2), everything returned to normal >>
[1458931061] Nagios 3.5.0 starting... (PID=6885)
[1458931061] Local time is Fri Mar 25 18:37:41 UTC 2016
[1458931061] LOG VERSION: 2.0
[1458931062] ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[1458931062] ndomod: Successfully connected to data sink. 0 queued items to flush.
[1458931062] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1458931064] Finished daemonizing... (New PID=6956)

Questions:
1) What typically causes ndomod to shut down?
2) When I bring up Nagios XI services I run the following:
# Start Nagios Processes
service mysqld start
service npcd start
service ndo2db start
# Sleep for 10 seconds to ensure ndo2db is up
sleep 10
service nagios start
service nagiosxi start
Where does ndomod fit into this? Is it a process under the ndo2db service?
3) Is there a "best practice" on how notify an administrator if any Nagios process/service stops? I created a Nagios service that checks the state of the Nagios XI services, but if active checks are not working, the administrator is never notified. How to monitor the monitoring software :lol: What have other users done in the past?

Thanks.
Nagios XI Version
full=2012R2.9
major=2012
minor=R2.9
releasedate=2014-02-11
release=320
NRPE v2.15 (modified for 4KB messages)

Re: Nagios XI Partial Crash?

Posted: Thu Mar 31, 2016 4:03 pm
by hsmith
toodaly wrote:1) What typically causes ndomod to shut down?
There could be a few things. What's the output of an ipcs -q command? Is there anything in /var/log/messages when this is happening?
toodaly wrote:2) When I bring up Nagios XI services I run the following:


ndomod is bound to the nagios service.
toodaly wrote:3) Is there a "best practice" on how notify an administrator if any Nagios process/service stops? I created a Nagios service that checks the state of the Nagios XI services, but if active checks are not working, the administrator is never notified. How to monitor the monitoring software :lol: What have other users done in the past?
It sounds kind of silly, but you could have a Nagios core instance running on a VM that checks Nagios XI, and alerts you if there are issues. Yay redundancy! Maybe even put it on a Raspberry Pi for bonus cool points.

Re: Nagios XI Partial Crash?

Posted: Thu Mar 31, 2016 5:03 pm
by toodaly
What's the output of an ipcs -q command? Is there anything in /var/log/messages when this is happening?
[root@RHEL_LA_001 ~]# ipcs -q

------ Message Queues --------
key msqid owner perms used-bytes messages
0x26000002 360448 nagios 600 0 0

[root@RHEL_LA_001 ~]#
Although when I first did this, there were some bytes (I don't remember) and about 3000 messages before I closed the window and thought I should have captured it. I imagine it's similar to the check results directory that has a bunch of entries one second and are gone the next.

Nothing different in /var/log/messages than what I had above except every block of "Could not stat()" entries was:
Mar 25 18:23:08 RHEL_LA_001 rsyslogd-2177: imuxsock begins to drop messages from pid 15475 due to rate-limiting
Mar 25 18:23:14 RHEL_LA_001 rsyslogd-2177: imuxsock lost 1667 messages from pid 15475 due to rate-limiting

Things have been running smooth since the table truncations and database repair. I'll perform these next time I see this.
ndomod is bound to the nagios service.
That makes sense as "ndomod: Shutdown complete" coincided (at least appeared to) with the monitoring engine going red. I don't see ndomod when I do a ps -aef. Is that expected?

Thanks.

Re: Nagios XI Partial Crash?

Posted: Thu Mar 31, 2016 5:16 pm
by rkennedy
Do you have a firewall running at all?

Also, just noticed -

Code: Select all

Nagios XI Version
full=2012R2.9
major=2012
minor=R2.9
Is upgrading an option? We're now on XI 5.2.5 which is at least 3 years newer than this version.

Re: Nagios XI Partial Crash?

Posted: Thu Mar 31, 2016 5:34 pm
by toodaly
Do you have a firewall running at all?
Yes, there is a firewall in place. Nagios XI has been working fine before and after this anomaly.
Is upgrading an option?
Unfortunately, not at this time. This was the version that our requirements were verified against a few years back. There is no budget/schedule to requalify with the latest version. Yes, not ideal.

Is there a table I can truncate to flush stale checks so I don't get all of the "Could not stat()" entries? Do you think this is unrelated to ndomod shutting down?
What are other possibilities that would cause ndomod to shutdown?

Thanks.

Re: Nagios XI Partial Crash?

Posted: Fri Apr 01, 2016 11:44 am
by tgriep
Are still getting the "Could Not Stat" messages?
If so, can you run the following and post the output?
ls -l /usr/local/nagios/var/spool/checkresults/

The ndomod broker module is used by the Nagios process to write the information to the MYSQL database.
If you look in the nagios.cfg file you will see the module.
Every time Nagios is restarted (Applying the Config restarts Nagios too) you will see that message in the nagios.log file.
Does that help out?

Re: Nagios XI Partial Crash?

Posted: Mon Apr 04, 2016 10:16 am
by toodaly
I do not get the "Could not stat" messages when Nagios is in a steady state.

I get the "Could not stat" messages because when there a issue (e.g. ndo2db services stops) and /usr/local/nagios/var/spool/checkresults/ and/or /tmp starts to fill up with check results, the administrator will reboot the VM that Nagios is running on. The startup procedure cleans these directories of check results that are older than an hour, runs a repair of the nagios and mysql databases, and starts up the Nagios services. Normally, the /usr/local/nagios/var/spool/checkresults/ directory will contain entries for the current time and then Nagios will process and remove them.

Here's where I make some assumptions and ask if you can correct me. I assume the filenames of these check results are stored somewhere (a text file, a database table). After a reboot of the VM when I go in and delete the check results that are older than an hour, Nagios will then start processing the checkresult files in /usr/local/nagios/var/spool/checkresults based on what was contained in the text file or database table. When the file that it expects (from the text file or database table) does not exist, Nagios logs:
Warning: Could not stat() check result file '/usr/local/nagios/var/spool/checkresults/cFcOzY2'.

Here's my questions:
1) Is this a correct assumption (I haven't checked the actual filenames before a reboot that the "Could not stat" messages correspond to check result files that the startup process deletes?

2) If so, where is the check result filenames stored?

3) Can this file or database table be flushed so I don't get the flood of "Could not stat" messages that leads to my initial sequence that leads to the ndomod to shutdown?
...
[1458930201] Warning: Could not stat() check result file '/usr/local/nagios/var/spool/checkresults/cojoSy2'.
[1458930201] Caught SIGTERM, shutting down...
[1458930202] Successfully shutdown... (PID=15475)
[1458930202] ndomod: Shutdown complete.
[1458930202] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
Every time Nagios is restarted (Applying the Config restarts Nagios too) you will see that message in the nagios.log file.
The "Could not stat" log entry or the "ndomod: Shutdown complete" log entry?

Thanks.

Re: Nagios XI Partial Crash?

Posted: Mon Apr 04, 2016 2:24 pm
by tgriep
Below is the description of what the checkresults folder is used for.
This options determines which directory Nagios will use to temporarily store host and service check results before they are processed.
When the system is rebooted and those files are deleted, is the nagios process running?
If so, that could be the cause of the stat messages. The nagios process writes them there, then the startup deletes them and then the Nagios process cannot find them and that causes the error.

When Applying the config, you should see the "ndomod: Shutdown complete" message, that is normal.

How many hosts and services is the XI system monitoring?
The next time the system stops processing the files in the checkresults folder, can you run the following and post it here?

Code: Select all

df -h
df -i
ps -ef

Re: Nagios XI Partial Crash?

Posted: Mon Apr 04, 2016 5:12 pm
by toodaly
When the system is rebooted and those files are deleted, is the nagios process running?
No, check results files and performance data files (older than an hour) are deleted before any Nagios service is started.

When looking at nagios.cfg, I noticed ndomod.cfg which contains this:
# BUFFER FILE
# This option is used to specify a file which will be used to store the
# contents of buffered data which could not be sent to the NDO2DB daemon
# before Nagios shuts down. Prior to shutting down, the NDO NEB module
# will write all buffered data to this file for later processing. When
# Nagios (re)starts, the NDO NEB module will read the contents of this
# file and send it to the NDO2DB daemon for processing.
buffer_file=/usr/local/nagios/var/ndomod.tmp

I believe this is the file I'm looking for.

Would deleting the contents of this temp file cause any unwanted side effects? i.e. is Nagios expecting something to be in there when it restarts? Currently the file is empty (0 bytes).
How many hosts and services is the XI system monitoring?
~300 Active Hosts
~2000 Passive Hosts (from 6 other Nagios servers)
~2000 Active Services
~8000 Passive Services (from 6 other Nagios servers)

Will do what you suggested the next time this happens.

Thanks.

Re: Nagios XI Partial Crash?

Posted: Mon Apr 04, 2016 5:19 pm
by toodaly
I forgot to add, I mentioned above that I occasionally see check results in /usr/local/nagios/var/spool/checkresults/ but also /tmp

What I see is something like this:
-rw------- 1 nagios users 256 Mar 25 22:13 check4pnssA
-rw------- 1 nagios users 256 Mar 25 22:13 checkQT1wHi
-rw------- 1 nagios users 256 Mar 25 22:13 checkcahoCo
-rw------- 1 nagios users 256 Mar 25 22:13 checkE1qdS6

[root@RHEL_LA_001 tmp]# cat checkn39ZJM
### Active Check Result File ###
file_time=1458950532

### Nagios Host Check Result ###
# Time: Sat Mar 26 00:02:12 2016
host_name=RHEL_LA_PTR001
check_type=0
check_options=0
scheduled_check=1
reschedule_check=1
latency=1.057000
start_time=1458950532.57359

Similar to /usr/local/nagios/var/spool/checkresults, these files are there and gone the next second.

What are these used for? Is this expected? The only reference in nagios.cfg is:
temp_path=/tmp

Thanks.