System Status and Monitoring Engine Status Invalid

rseiwert · Post by **rseiwert** » Fri Apr 03, 2015 1:01 pm

Thanks MP that helps, I've been starting and stopping things and attempting to determine a baseline and I don't know what alot of these nagios processes are, problem is the NOC is going to look at what's on the board. I do need to the board to be accurate. That fact that the system implies that it is OK with the current date and time on it is bad. Most of the people charged with watching don't have access to the nagios console.

MP, also, all of those pass for me (once tweaked for my install locations). What was crashing was the Nagios core process which can be checked with

ps -ef | grep '/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg' | grep -v grep

which should return 2 processes from what I can tell.

rseiwert · Post by **rseiwert** » Fri Apr 03, 2015 1:13 pm

At the time there were no Nagios processes running (ie /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg) but just about every other process was running. I believe the nagios core process is being killed off for some reason, probably a check gone crazy. Still looking at this but here is the death of the nagios core process

Apr 2 20:09:26 nagios nagios: wproc: Socket to worker Core Worker 1609 broken, removing
Apr 2 20:09:26 nagios nagios: Caught SIGSEGV, shutting down...

I would expect the XI php icons to go red at this point.

Permissions check

Code: Select all

[root@nagios etc]# ll -d /usr/local/nagios/var
drwxrwxr-x 6 nagios nagios 4096 Apr  3 13:44 /usr/local/nagios/var
[root@nagios etc]# ll /usr/local/nagios/var
total 50940
drwxrwxr-x 2 nagios nagios    20480 Apr  1 00:00 archives
-rw-r--r-- 1 apache apache 28127830 Apr  3 13:29 graphapi.log
-rw-r--r-- 1 nagios nagios        0 Apr  3 13:44 host-perfdata
-rw-r--r-- 1 nagios nagios   349836 Apr  3 13:44 nagios.debug
-rw-r--r-- 1 nagios nagios  1000038 Apr  3 13:43 nagios.debug.old
-rw-r--r-- 1 nagios nagios        6 Apr  3 12:40 nagios.lock
-rw-r--r-- 1 nagios nagios  1929637 Apr  3 13:43 nagios.log
-rw-rw-r-- 1 nagios users   1375000 Jan 12 11:02 nagios.tmp4aAc98
-rw-r--r-- 1 nagios nagios        5 Apr  2 17:21 ndo2db.lock
-rw-r--r-- 1 nagios nagios        0 Apr  3 12:25 ndomod.tmp
srwxr-xr-x 1 nagios nagios        0 Apr  2 17:21 ndo.sock
-rw-r--r-- 1 nagios nagios  2484319 Apr  3 13:44 npcd.log
-rw-r--r-- 1 nagios nagios 10485799 Apr  2 19:37 npcd.log.old
-rw-r--r-- 1 nagios nagios   660162 Apr  3 12:25 objects.cache
-rw-rw-rw- 1 nagios nagios  3232687 Apr  3 11:45 perfdata.log
-rw------- 1 nagios nagios  1202368 Apr  3 13:25 retention.dat
drwxrwsr-x 2 nagios nagcmd     4096 Apr  3 12:40 rw
-rw-r--r-- 1 nagios nagios     5490 Apr  3 13:44 service-perfdata
drwxr-xr-x 5 nagios nagios     4096 Feb 12  2014 spool
drwxr-xr-x 2 nagios nagios     4096 Apr  1 18:18 stats
-rw-rw-r-- 1 nagios nagios  1195587 Apr  3 13:44 status.dat

The current /usr/local/nagiosxi/var/sysstat.log looks like it gets overwritten regularly with current info and things are working (i rebooted) so I will attach the current and last archive. I wished I had saved the one from this morning but the head of it is in the original post.

rseiwert · Post by **rseiwert** » Fri Apr 03, 2015 1:14 pm

Only three attachments per post. Here is the last log requested. To soon to post again. Arggg

ssax · Post by **ssax** » Fri Apr 03, 2015 1:42 pm

The only thing I saw in the postgresql log was when the machine was rebooted.

So if you manually stop the nagios process now does it show red? I understand if you can't do this until a scheduled time, I'm just wondering if the reboot fixed the issue.

The nagios process doesn't have to be running in order for the systat information to be updated, if it was unable to update for some reason the postgresql DB wasn't allowing it or the systat.php cron was failing (doesn't look like it from the logs).

If you run into this again, please run this command while you are experiencing the issue and post the output:

Code: Select all

echo "select * from xi_sysstat where metric = 'daemons' \x\g;" | psql nagiosxi nagiosxi

rseiwert · Post by **rseiwert** » Fri Apr 03, 2015 2:04 pm

Steps to replicate this issue.
If you were to do /etc/init.d/nagios stop the Process State in the Monitoring Engine Status goes red almost immediately
but if you kill off the nagios process with extreme prejudice to simulate a crash (kill -9) the Process State stays green showing the running time, the process ID and that it was recently updated.

Code: Select all

[root@nagios var]# ps -ef | grep '/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg' | grep -v grep
nagios   63164     1  0 14:33 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   63178 63164  0 14:33 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
[root@nagios var]# kill -9 63164
[root@nagios var]# /etc/init.d/nagios status
nagios is not running

After 10 mins XI interface still shows all OK

and checking the sysstat.log shows the last checking is no longer fresh and that nagios is not running.

Code: Select all

ssax · Post by **ssax** » Mon Apr 06, 2015 9:47 am

This looks like it's a bug, I will report it to the developers.

I'll look for a temporary solution and update the post when I've found one.

Edit:

Code: Select all

NEW TASK ID 5386 created - Nagios XI Bug Report: service nagios status or /etc/init.d/nagios status returns OK when it's not running

ssax · Post by **ssax** » Mon Apr 06, 2015 2:25 pm

Edit /etc/init.d/nagios and on line 137 you'll see this code:

Code: Select all

echo "nagios is not running"

Add a new line after that and make it look like this:

Code: Select all

echo "nagios is not running"
return 1

rseiwert · Post by **rseiwert** » Mon Apr 06, 2015 4:54 pm

Thanks, that works for 2 out the 3 statuses. I just want to report that it doesn't solve the problem for the Monitoring Engine Process on the Monitoring Engine Status Page (the one pictured below)

Post by **lmiltchev** » Tue Apr 07, 2015 1:20 pm

Thanks, that works for 2 out the 3 statuses. I just want to report that it doesn't solve the problem for the Monitoring Engine Process on the Monitoring Engine Status Page (the one pictured below)

Can you elaborate? What is the issue at the moment - nagios is not running but it shows as running in the web UI (PID 63164)?

What is the output of the following commands?

Code: Select all

service nagios status
ps -ef | grep /bin/[n]agios

rseiwert · Post by **rseiwert** » Wed Apr 08, 2015 4:21 pm

There are three places I know of in XI which tell me that things are processing normally. Referencing the scenario above where I described on how to replicate the issue and after putting in the fix suggested by ssax, 2 out of 3 status now correctly tell me I have a problem but one still tells me that things are processing normally and a non-existent process is running just fine. 2 out of 3 ain't bad and much better than before where it was zero out of three.

Nagios Support Forum

System Status and Monitoring Engine Status Invalid

Re: System Status and Monitoring Engine Status Invalid

Re: System Status and Monitoring Engine Status Invalid

Re: System Status and Monitoring Engine Status Invalid

Re: System Status and Monitoring Engine Status Invalid

Re: System Status and Monitoring Engine Status Invalid

Re: System Status and Monitoring Engine Status Invalid

Re: System Status and Monitoring Engine Status Invalid

Re: System Status and Monitoring Engine Status Invalid

Re: System Status and Monitoring Engine Status Invalid

Re: System Status and Monitoring Engine Status Invalid