System Status and Monitoring Engine Status Invalid

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
rseiwert
Posts: 196
Joined: Wed Jun 22, 2011 10:33 pm
Location: Somewhere between Here and Now

Re: System Status and Monitoring Engine Status Invalid

Post by rseiwert »

Thanks MP that helps, I've been starting and stopping things and attempting to determine a baseline and I don't know what alot of these nagios processes are, problem is the NOC is going to look at what's on the board. I do need to the board to be accurate. That fact that the system implies that it is OK with the current date and time on it is bad. Most of the people charged with watching don't have access to the nagios console.

MP, also, all of those pass for me (once tweaked for my install locations). What was crashing was the Nagios core process which can be checked with

ps -ef | grep '/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg' | grep -v grep

which should return 2 processes from what I can tell.
Last edited by rseiwert on Sat Apr 04, 2015 1:14 pm, edited 1 time in total.
Grumpy Olde IT Guy
User avatar
rseiwert
Posts: 196
Joined: Wed Jun 22, 2011 10:33 pm
Location: Somewhere between Here and Now

Re: System Status and Monitoring Engine Status Invalid

Post by rseiwert »

At the time there were no Nagios processes running (ie /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg) but just about every other process was running. I believe the nagios core process is being killed off for some reason, probably a check gone crazy. Still looking at this but here is the death of the nagios core process

Apr 2 20:09:26 nagios nagios: wproc: Socket to worker Core Worker 1609 broken, removing
Apr 2 20:09:26 nagios nagios: Caught SIGSEGV, shutting down...

I would expect the XI php icons to go red at this point.

Permissions check

Code: Select all

[root@nagios etc]# ll -d /usr/local/nagios/var
drwxrwxr-x 6 nagios nagios 4096 Apr  3 13:44 /usr/local/nagios/var
[root@nagios etc]# ll /usr/local/nagios/var
total 50940
drwxrwxr-x 2 nagios nagios    20480 Apr  1 00:00 archives
-rw-r--r-- 1 apache apache 28127830 Apr  3 13:29 graphapi.log
-rw-r--r-- 1 nagios nagios        0 Apr  3 13:44 host-perfdata
-rw-r--r-- 1 nagios nagios   349836 Apr  3 13:44 nagios.debug
-rw-r--r-- 1 nagios nagios  1000038 Apr  3 13:43 nagios.debug.old
-rw-r--r-- 1 nagios nagios        6 Apr  3 12:40 nagios.lock
-rw-r--r-- 1 nagios nagios  1929637 Apr  3 13:43 nagios.log
-rw-rw-r-- 1 nagios users   1375000 Jan 12 11:02 nagios.tmp4aAc98
-rw-r--r-- 1 nagios nagios        5 Apr  2 17:21 ndo2db.lock
-rw-r--r-- 1 nagios nagios        0 Apr  3 12:25 ndomod.tmp
srwxr-xr-x 1 nagios nagios        0 Apr  2 17:21 ndo.sock
-rw-r--r-- 1 nagios nagios  2484319 Apr  3 13:44 npcd.log
-rw-r--r-- 1 nagios nagios 10485799 Apr  2 19:37 npcd.log.old
-rw-r--r-- 1 nagios nagios   660162 Apr  3 12:25 objects.cache
-rw-rw-rw- 1 nagios nagios  3232687 Apr  3 11:45 perfdata.log
-rw------- 1 nagios nagios  1202368 Apr  3 13:25 retention.dat
drwxrwsr-x 2 nagios nagcmd     4096 Apr  3 12:40 rw
-rw-r--r-- 1 nagios nagios     5490 Apr  3 13:44 service-perfdata
drwxr-xr-x 5 nagios nagios     4096 Feb 12  2014 spool
drwxr-xr-x 2 nagios nagios     4096 Apr  1 18:18 stats
-rw-rw-r-- 1 nagios nagios  1195587 Apr  3 13:44 status.dat
The current /usr/local/nagiosxi/var/sysstat.log looks like it gets overwritten regularly with current info and things are working (i rebooted) so I will attach the current and last archive. I wished I had saved the one from this morning but the head of it is in the original post.
You do not have the required permissions to view the files attached to this post.
Grumpy Olde IT Guy
User avatar
rseiwert
Posts: 196
Joined: Wed Jun 22, 2011 10:33 pm
Location: Somewhere between Here and Now

Re: System Status and Monitoring Engine Status Invalid

Post by rseiwert »

Only three attachments per post. Here is the last log requested. To soon to post again. Arggg
You do not have the required permissions to view the files attached to this post.
Grumpy Olde IT Guy
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: System Status and Monitoring Engine Status Invalid

Post by ssax »

The only thing I saw in the postgresql log was when the machine was rebooted.

So if you manually stop the nagios process now does it show red? I understand if you can't do this until a scheduled time, I'm just wondering if the reboot fixed the issue.

The nagios process doesn't have to be running in order for the systat information to be updated, if it was unable to update for some reason the postgresql DB wasn't allowing it or the systat.php cron was failing (doesn't look like it from the logs).

If you run into this again, please run this command while you are experiencing the issue and post the output:

Code: Select all

echo "select * from xi_sysstat where metric = 'daemons' \x\g;" | psql nagiosxi nagiosxi
User avatar
rseiwert
Posts: 196
Joined: Wed Jun 22, 2011 10:33 pm
Location: Somewhere between Here and Now

Re: System Status and Monitoring Engine Status Invalid

Post by rseiwert »

Steps to replicate this issue.
If you were to do /etc/init.d/nagios stop the Process State in the Monitoring Engine Status goes red almost immediately
but if you kill off the nagios process with extreme prejudice to simulate a crash (kill -9) the Process State stays green showing the running time, the process ID and that it was recently updated.

Code: Select all

[root@nagios var]# ps -ef | grep '/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg' | grep -v grep
nagios   63164     1  0 14:33 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   63178 63164  0 14:33 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
[root@nagios var]# kill -9 63164
[root@nagios var]# /etc/init.d/nagios status
nagios is not running
After 10 mins XI interface still shows all OK
Image
Image

and checking the sysstat.log shows the last checking is no longer fresh and that nagios is not running.

Code: Select all

[root@nagios var]# cat sysstat.log
DB BACKEND:
Array
(
    [last_checkin] => 2015-04-03 14:44:43
    [bytes_processed] => 3808714
    [entries_processed] => 6383
    [connect_time] => 2015-04-03 14:33:30
    [disconnect_time] => 0000-00-00 00:00:00
)
CMDLINE=/etc/init.d/nagios status
nagios is not running
OUTPUT=nagios is not running
RETURNCODE=0
CMDLINE=/etc/init.d/npcd status
NPCD running (pid 1558).
OUTPUT=NPCD running (pid 1558).
RETURNCODE=0
CMDLINE=/etc/init.d/ndo2db status
ndo2db (pid 1625) is running...
OUTPUT=ndo2db (pid 1625) is running...
RETURNCODE=0
DAEMONS:
Array
(
    [nagioscore] => Array
        (
            [daemon] => nagios
            [output] => nagios is not running
            [return_code] => 0
            [status] => 0
        )

    [pnp] => Array
        (
            [daemon] => npcd
            [output] => NPCD running (pid 1558).
            [return_code] => 0
            [status] => 0
        )

    [ndoutils] => Array
        (
            [daemon] => ndo2db
            [output] => ndo2db (pid 1625) is running...
            [return_code] => 0
            [status] => 0
        )

)
CORE STATS:
Array
(
    [activehostchecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 80
        )

    [passivehostchecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 0
        )

    [activeservicechecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 325
        )

    [passiveservicechecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 0
        )

    [activehostcheckperf] => Array
        (
            [min_latency] => 0
            [max_latency] => 0.00258
            [avg_latency] => 6.53932584269663e-05
            [min_execution_time] => 0.00145
            [max_execution_time] => 10.00252
            [avg_execution_time] => 0.27182797752809
        )

    [activeservicecheckperf] => Array
        (
            [min_latency] => 0
            [max_latency] => 0.181
            [avg_latency] => 0.000642213622291022
            [min_execution_time] => 0
            [max_execution_time] => 20.84528
            [avg_execution_time] => 0.857088405572756
        )

)
LOAD:
Array
(
    [load1] => 0.62
    [load5] => 0.78
    [load15] => 0.95
)
MEMORY:
Array
(
    [total] => 7865
    [used] => 2145
    [free] => 5720
    [shared] => 13
    [buffers] => 169
    [cached] => 934
)
SWAP:
Array
(
    [total] => 2015
    [used] => 0
    [free] => 2015
)
IOSTAT:
Array
(
    [user] => 3.79
    [nice] => 0.00
    [system] => 0.51
    [iowait] => 0.05
    [steal] => 0.00
    [idle] => 95.65
)
.DB BACKEND:
Array
(
    [last_checkin] => 2015-04-03 14:44:43
    [bytes_processed] => 3808714
    [entries_processed] => 6383
    [connect_time] => 2015-04-03 14:33:30
    [disconnect_time] => 0000-00-00 00:00:00
)
CMDLINE=/etc/init.d/nagios status
nagios is not running
OUTPUT=nagios is not running
RETURNCODE=0
CMDLINE=/etc/init.d/npcd status
NPCD running (pid 1558).
OUTPUT=NPCD running (pid 1558).
RETURNCODE=0
CMDLINE=/etc/init.d/ndo2db status
ndo2db (pid 1625) is running...
OUTPUT=ndo2db (pid 1625) is running...
RETURNCODE=0
DAEMONS:
Array
(
    [nagioscore] => Array
        (
            [daemon] => nagios
            [output] => nagios is not running
            [return_code] => 0
            [status] => 0
        )

    [pnp] => Array
        (
            [daemon] => npcd
            [output] => NPCD running (pid 1558).
            [return_code] => 0
            [status] => 0
        )

    [ndoutils] => Array
        (
            [daemon] => ndo2db
            [output] => ndo2db (pid 1625) is running...
            [return_code] => 0
            [status] => 0
        )

)
CORE STATS:
Array
(
    [activehostchecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 75
        )

    [passivehostchecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 0
        )

    [activeservicechecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 302
        )

    [passiveservicechecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 0
        )

    [activehostcheckperf] => Array
        (
            [min_latency] => 0
            [max_latency] => 0.00258
            [avg_latency] => 6.53932584269663e-05
            [min_execution_time] => 0.00145
            [max_execution_time] => 10.00252
            [avg_execution_time] => 0.27182797752809
        )

    [activeservicecheckperf] => Array
        (
            [min_latency] => 0
            [max_latency] => 0.181
            [avg_latency] => 0.000642213622291022
            [min_execution_time] => 0
            [max_execution_time] => 20.84528
            [avg_execution_time] => 0.857088405572756
        )

)
LOAD:
Array
(
    [load1] => 0.59
    [load5] => 0.77
    [load15] => 0.94
)
MEMORY:
Array
(
    [total] => 7865
    [used] => 2121
    [free] => 5744
    [shared] => 13
    [buffers] => 169
    [cached] => 934
)
SWAP:
Array
(
    [total] => 2015
    [used] => 0
    [free] => 2015
)
IOSTAT:
Array
(
    [user] => 5.16
    [nice] => 0.00
    [system] => 0.56
    [iowait] => 0.05
    [steal] => 0.00
    [idle] => 94.23
)
.DB BACKEND:
Array
(
    [last_checkin] => 2015-04-03 14:44:43
    [bytes_processed] => 3808714
    [entries_processed] => 6383
    [connect_time] => 2015-04-03 14:33:30
    [disconnect_time] => 0000-00-00 00:00:00
)
CMDLINE=/etc/init.d/nagios status
nagios is not running
OUTPUT=nagios is not running
RETURNCODE=0
CMDLINE=/etc/init.d/npcd status
NPCD running (pid 1558).
OUTPUT=NPCD running (pid 1558).
RETURNCODE=0
CMDLINE=/etc/init.d/ndo2db status
ndo2db (pid 1625) is running...
OUTPUT=ndo2db (pid 1625) is running...
RETURNCODE=0
DAEMONS:
Array
(
    [nagioscore] => Array
        (
            [daemon] => nagios
            [output] => nagios is not running
            [return_code] => 0
            [status] => 0
        )

    [pnp] => Array
        (
            [daemon] => npcd
            [output] => NPCD running (pid 1558).
            [return_code] => 0
            [status] => 0
        )

    [ndoutils] => Array
        (
            [daemon] => ndo2db
            [output] => ndo2db (pid 1625) is running...
            [return_code] => 0
            [status] => 0
        )

)
CORE STATS:
Array
(
    [activehostchecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 72
        )

    [passivehostchecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 0
        )

    [activeservicechecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 280
        )

    [passiveservicechecks] => Array
        (
            [1min] => 0
            [5min] => 0
            [15min] => 0
        )

    [activehostcheckperf] => Array
        (
            [min_latency] => 0
            [max_latency] => 0.00258
            [avg_latency] => 6.53932584269663e-05
            [min_execution_time] => 0.00145
            [max_execution_time] => 10.00252
            [avg_execution_time] => 0.27182797752809
        )

    [activeservicecheckperf] => Array
        (
            [min_latency] => 0
            [max_latency] => 0.181
            [avg_latency] => 0.000642213622291022
            [min_execution_time] => 0
            [max_execution_time] => 20.84528
            [avg_execution_time] => 0.857088405572756
        )

)
LOAD:
Array
(
    [load1] => 0.57
    [load5] => 0.75
    [load15] => 0.93
)
MEMORY:
Array
(
    [total] => 7865
    [used] => 2112
    [free] => 5753
    [shared] => 13
    [buffers] => 169
    [cached] => 934
)
SWAP:
Array
(
    [total] => 2015
    [used] => 0
    [free] => 2015
)
Done
Grumpy Olde IT Guy
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: System Status and Monitoring Engine Status Invalid

Post by ssax »

This looks like it's a bug, I will report it to the developers.

I'll look for a temporary solution and update the post when I've found one.

Edit:

Code: Select all

NEW TASK ID 5386 created - Nagios XI Bug Report: service nagios status or /etc/init.d/nagios status returns OK when it's not running
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: System Status and Monitoring Engine Status Invalid

Post by ssax »

Edit /etc/init.d/nagios and on line 137 you'll see this code:

Code: Select all

echo "nagios is not running"
Add a new line after that and make it look like this:

Code: Select all

echo "nagios is not running"
return 1
User avatar
rseiwert
Posts: 196
Joined: Wed Jun 22, 2011 10:33 pm
Location: Somewhere between Here and Now

Re: System Status and Monitoring Engine Status Invalid

Post by rseiwert »

Thanks, that works for 2 out the 3 statuses. I just want to report that it doesn't solve the problem for the Monitoring Engine Process on the Monitoring Engine Status Page (the one pictured below)
Image
Grumpy Olde IT Guy
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: System Status and Monitoring Engine Status Invalid

Post by lmiltchev »

Thanks, that works for 2 out the 3 statuses. I just want to report that it doesn't solve the problem for the Monitoring Engine Process on the Monitoring Engine Status Page (the one pictured below)
Can you elaborate? What is the issue at the moment - nagios is not running but it shows as running in the web UI (PID 63164)?

What is the output of the following commands?

Code: Select all

service nagios status
ps -ef | grep /bin/[n]agios
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
rseiwert
Posts: 196
Joined: Wed Jun 22, 2011 10:33 pm
Location: Somewhere between Here and Now

Re: System Status and Monitoring Engine Status Invalid

Post by rseiwert »

There are three places I know of in XI which tell me that things are processing normally. Referencing the scenario above where I described on how to replicate the issue and after putting in the fix suggested by ssax, 2 out of 3 status now correctly tell me I have a problem but one still tells me that things are processing normally and a non-existent process is running just fine. 2 out of 3 ain't bad and much better than before where it was zero out of three.
Grumpy Olde IT Guy
Locked