System Status and Monitoring Engine Status Invalid
Re: System Status and Monitoring Engine Status Invalid
Thanks MP that helps, I've been starting and stopping things and attempting to determine a baseline and I don't know what alot of these nagios processes are, problem is the NOC is going to look at what's on the board. I do need to the board to be accurate. That fact that the system implies that it is OK with the current date and time on it is bad. Most of the people charged with watching don't have access to the nagios console.
MP, also, all of those pass for me (once tweaked for my install locations). What was crashing was the Nagios core process which can be checked with
ps -ef | grep '/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg' | grep -v grep
which should return 2 processes from what I can tell.
MP, also, all of those pass for me (once tweaked for my install locations). What was crashing was the Nagios core process which can be checked with
ps -ef | grep '/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg' | grep -v grep
which should return 2 processes from what I can tell.
Last edited by rseiwert on Sat Apr 04, 2015 1:14 pm, edited 1 time in total.
Grumpy Olde IT Guy
Re: System Status and Monitoring Engine Status Invalid
At the time there were no Nagios processes running (ie /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg) but just about every other process was running. I believe the nagios core process is being killed off for some reason, probably a check gone crazy. Still looking at this but here is the death of the nagios core process
Apr 2 20:09:26 nagios nagios: wproc: Socket to worker Core Worker 1609 broken, removing
Apr 2 20:09:26 nagios nagios: Caught SIGSEGV, shutting down...
I would expect the XI php icons to go red at this point.
Permissions check
The current /usr/local/nagiosxi/var/sysstat.log looks like it gets overwritten regularly with current info and things are working (i rebooted) so I will attach the current and last archive. I wished I had saved the one from this morning but the head of it is in the original post.
Apr 2 20:09:26 nagios nagios: wproc: Socket to worker Core Worker 1609 broken, removing
Apr 2 20:09:26 nagios nagios: Caught SIGSEGV, shutting down...
I would expect the XI php icons to go red at this point.
Permissions check
Code: Select all
[root@nagios etc]# ll -d /usr/local/nagios/var
drwxrwxr-x 6 nagios nagios 4096 Apr 3 13:44 /usr/local/nagios/var
[root@nagios etc]# ll /usr/local/nagios/var
total 50940
drwxrwxr-x 2 nagios nagios 20480 Apr 1 00:00 archives
-rw-r--r-- 1 apache apache 28127830 Apr 3 13:29 graphapi.log
-rw-r--r-- 1 nagios nagios 0 Apr 3 13:44 host-perfdata
-rw-r--r-- 1 nagios nagios 349836 Apr 3 13:44 nagios.debug
-rw-r--r-- 1 nagios nagios 1000038 Apr 3 13:43 nagios.debug.old
-rw-r--r-- 1 nagios nagios 6 Apr 3 12:40 nagios.lock
-rw-r--r-- 1 nagios nagios 1929637 Apr 3 13:43 nagios.log
-rw-rw-r-- 1 nagios users 1375000 Jan 12 11:02 nagios.tmp4aAc98
-rw-r--r-- 1 nagios nagios 5 Apr 2 17:21 ndo2db.lock
-rw-r--r-- 1 nagios nagios 0 Apr 3 12:25 ndomod.tmp
srwxr-xr-x 1 nagios nagios 0 Apr 2 17:21 ndo.sock
-rw-r--r-- 1 nagios nagios 2484319 Apr 3 13:44 npcd.log
-rw-r--r-- 1 nagios nagios 10485799 Apr 2 19:37 npcd.log.old
-rw-r--r-- 1 nagios nagios 660162 Apr 3 12:25 objects.cache
-rw-rw-rw- 1 nagios nagios 3232687 Apr 3 11:45 perfdata.log
-rw------- 1 nagios nagios 1202368 Apr 3 13:25 retention.dat
drwxrwsr-x 2 nagios nagcmd 4096 Apr 3 12:40 rw
-rw-r--r-- 1 nagios nagios 5490 Apr 3 13:44 service-perfdata
drwxr-xr-x 5 nagios nagios 4096 Feb 12 2014 spool
drwxr-xr-x 2 nagios nagios 4096 Apr 1 18:18 stats
-rw-rw-r-- 1 nagios nagios 1195587 Apr 3 13:44 status.dat
You do not have the required permissions to view the files attached to this post.
Grumpy Olde IT Guy
Re: System Status and Monitoring Engine Status Invalid
Only three attachments per post. Here is the last log requested. To soon to post again. Arggg
You do not have the required permissions to view the files attached to this post.
Grumpy Olde IT Guy
Re: System Status and Monitoring Engine Status Invalid
The only thing I saw in the postgresql log was when the machine was rebooted.
So if you manually stop the nagios process now does it show red? I understand if you can't do this until a scheduled time, I'm just wondering if the reboot fixed the issue.
The nagios process doesn't have to be running in order for the systat information to be updated, if it was unable to update for some reason the postgresql DB wasn't allowing it or the systat.php cron was failing (doesn't look like it from the logs).
If you run into this again, please run this command while you are experiencing the issue and post the output:
So if you manually stop the nagios process now does it show red? I understand if you can't do this until a scheduled time, I'm just wondering if the reboot fixed the issue.
The nagios process doesn't have to be running in order for the systat information to be updated, if it was unable to update for some reason the postgresql DB wasn't allowing it or the systat.php cron was failing (doesn't look like it from the logs).
If you run into this again, please run this command while you are experiencing the issue and post the output:
Code: Select all
echo "select * from xi_sysstat where metric = 'daemons' \x\g;" | psql nagiosxi nagiosxiRe: System Status and Monitoring Engine Status Invalid
Steps to replicate this issue.
If you were to do /etc/init.d/nagios stop the Process State in the Monitoring Engine Status goes red almost immediately
but if you kill off the nagios process with extreme prejudice to simulate a crash (kill -9) the Process State stays green showing the running time, the process ID and that it was recently updated.
After 10 mins XI interface still shows all OK


and checking the sysstat.log shows the last checking is no longer fresh and that nagios is not running.
If you were to do /etc/init.d/nagios stop the Process State in the Monitoring Engine Status goes red almost immediately
but if you kill off the nagios process with extreme prejudice to simulate a crash (kill -9) the Process State stays green showing the running time, the process ID and that it was recently updated.
Code: Select all
[root@nagios var]# ps -ef | grep '/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg' | grep -v grep
nagios 63164 1 0 14:33 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 63178 63164 0 14:33 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
[root@nagios var]# kill -9 63164
[root@nagios var]# /etc/init.d/nagios status
nagios is not running


and checking the sysstat.log shows the last checking is no longer fresh and that nagios is not running.
Code: Select all
[root@nagios var]# cat sysstat.log
DB BACKEND:
Array
(
[last_checkin] => 2015-04-03 14:44:43
[bytes_processed] => 3808714
[entries_processed] => 6383
[connect_time] => 2015-04-03 14:33:30
[disconnect_time] => 0000-00-00 00:00:00
)
CMDLINE=/etc/init.d/nagios status
nagios is not running
OUTPUT=nagios is not running
RETURNCODE=0
CMDLINE=/etc/init.d/npcd status
NPCD running (pid 1558).
OUTPUT=NPCD running (pid 1558).
RETURNCODE=0
CMDLINE=/etc/init.d/ndo2db status
ndo2db (pid 1625) is running...
OUTPUT=ndo2db (pid 1625) is running...
RETURNCODE=0
DAEMONS:
Array
(
[nagioscore] => Array
(
[daemon] => nagios
[output] => nagios is not running
[return_code] => 0
[status] => 0
)
[pnp] => Array
(
[daemon] => npcd
[output] => NPCD running (pid 1558).
[return_code] => 0
[status] => 0
)
[ndoutils] => Array
(
[daemon] => ndo2db
[output] => ndo2db (pid 1625) is running...
[return_code] => 0
[status] => 0
)
)
CORE STATS:
Array
(
[activehostchecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 80
)
[passivehostchecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 0
)
[activeservicechecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 325
)
[passiveservicechecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 0
)
[activehostcheckperf] => Array
(
[min_latency] => 0
[max_latency] => 0.00258
[avg_latency] => 6.53932584269663e-05
[min_execution_time] => 0.00145
[max_execution_time] => 10.00252
[avg_execution_time] => 0.27182797752809
)
[activeservicecheckperf] => Array
(
[min_latency] => 0
[max_latency] => 0.181
[avg_latency] => 0.000642213622291022
[min_execution_time] => 0
[max_execution_time] => 20.84528
[avg_execution_time] => 0.857088405572756
)
)
LOAD:
Array
(
[load1] => 0.62
[load5] => 0.78
[load15] => 0.95
)
MEMORY:
Array
(
[total] => 7865
[used] => 2145
[free] => 5720
[shared] => 13
[buffers] => 169
[cached] => 934
)
SWAP:
Array
(
[total] => 2015
[used] => 0
[free] => 2015
)
IOSTAT:
Array
(
[user] => 3.79
[nice] => 0.00
[system] => 0.51
[iowait] => 0.05
[steal] => 0.00
[idle] => 95.65
)
.DB BACKEND:
Array
(
[last_checkin] => 2015-04-03 14:44:43
[bytes_processed] => 3808714
[entries_processed] => 6383
[connect_time] => 2015-04-03 14:33:30
[disconnect_time] => 0000-00-00 00:00:00
)
CMDLINE=/etc/init.d/nagios status
nagios is not running
OUTPUT=nagios is not running
RETURNCODE=0
CMDLINE=/etc/init.d/npcd status
NPCD running (pid 1558).
OUTPUT=NPCD running (pid 1558).
RETURNCODE=0
CMDLINE=/etc/init.d/ndo2db status
ndo2db (pid 1625) is running...
OUTPUT=ndo2db (pid 1625) is running...
RETURNCODE=0
DAEMONS:
Array
(
[nagioscore] => Array
(
[daemon] => nagios
[output] => nagios is not running
[return_code] => 0
[status] => 0
)
[pnp] => Array
(
[daemon] => npcd
[output] => NPCD running (pid 1558).
[return_code] => 0
[status] => 0
)
[ndoutils] => Array
(
[daemon] => ndo2db
[output] => ndo2db (pid 1625) is running...
[return_code] => 0
[status] => 0
)
)
CORE STATS:
Array
(
[activehostchecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 75
)
[passivehostchecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 0
)
[activeservicechecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 302
)
[passiveservicechecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 0
)
[activehostcheckperf] => Array
(
[min_latency] => 0
[max_latency] => 0.00258
[avg_latency] => 6.53932584269663e-05
[min_execution_time] => 0.00145
[max_execution_time] => 10.00252
[avg_execution_time] => 0.27182797752809
)
[activeservicecheckperf] => Array
(
[min_latency] => 0
[max_latency] => 0.181
[avg_latency] => 0.000642213622291022
[min_execution_time] => 0
[max_execution_time] => 20.84528
[avg_execution_time] => 0.857088405572756
)
)
LOAD:
Array
(
[load1] => 0.59
[load5] => 0.77
[load15] => 0.94
)
MEMORY:
Array
(
[total] => 7865
[used] => 2121
[free] => 5744
[shared] => 13
[buffers] => 169
[cached] => 934
)
SWAP:
Array
(
[total] => 2015
[used] => 0
[free] => 2015
)
IOSTAT:
Array
(
[user] => 5.16
[nice] => 0.00
[system] => 0.56
[iowait] => 0.05
[steal] => 0.00
[idle] => 94.23
)
.DB BACKEND:
Array
(
[last_checkin] => 2015-04-03 14:44:43
[bytes_processed] => 3808714
[entries_processed] => 6383
[connect_time] => 2015-04-03 14:33:30
[disconnect_time] => 0000-00-00 00:00:00
)
CMDLINE=/etc/init.d/nagios status
nagios is not running
OUTPUT=nagios is not running
RETURNCODE=0
CMDLINE=/etc/init.d/npcd status
NPCD running (pid 1558).
OUTPUT=NPCD running (pid 1558).
RETURNCODE=0
CMDLINE=/etc/init.d/ndo2db status
ndo2db (pid 1625) is running...
OUTPUT=ndo2db (pid 1625) is running...
RETURNCODE=0
DAEMONS:
Array
(
[nagioscore] => Array
(
[daemon] => nagios
[output] => nagios is not running
[return_code] => 0
[status] => 0
)
[pnp] => Array
(
[daemon] => npcd
[output] => NPCD running (pid 1558).
[return_code] => 0
[status] => 0
)
[ndoutils] => Array
(
[daemon] => ndo2db
[output] => ndo2db (pid 1625) is running...
[return_code] => 0
[status] => 0
)
)
CORE STATS:
Array
(
[activehostchecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 72
)
[passivehostchecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 0
)
[activeservicechecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 280
)
[passiveservicechecks] => Array
(
[1min] => 0
[5min] => 0
[15min] => 0
)
[activehostcheckperf] => Array
(
[min_latency] => 0
[max_latency] => 0.00258
[avg_latency] => 6.53932584269663e-05
[min_execution_time] => 0.00145
[max_execution_time] => 10.00252
[avg_execution_time] => 0.27182797752809
)
[activeservicecheckperf] => Array
(
[min_latency] => 0
[max_latency] => 0.181
[avg_latency] => 0.000642213622291022
[min_execution_time] => 0
[max_execution_time] => 20.84528
[avg_execution_time] => 0.857088405572756
)
)
LOAD:
Array
(
[load1] => 0.57
[load5] => 0.75
[load15] => 0.93
)
MEMORY:
Array
(
[total] => 7865
[used] => 2112
[free] => 5753
[shared] => 13
[buffers] => 169
[cached] => 934
)
SWAP:
Array
(
[total] => 2015
[used] => 0
[free] => 2015
)
Done
Grumpy Olde IT Guy
Re: System Status and Monitoring Engine Status Invalid
This looks like it's a bug, I will report it to the developers.
I'll look for a temporary solution and update the post when I've found one.
Edit:
I'll look for a temporary solution and update the post when I've found one.
Edit:
Code: Select all
NEW TASK ID 5386 created - Nagios XI Bug Report: service nagios status or /etc/init.d/nagios status returns OK when it's not runningRe: System Status and Monitoring Engine Status Invalid
Edit /etc/init.d/nagios and on line 137 you'll see this code:
Add a new line after that and make it look like this:
Code: Select all
echo "nagios is not running"Code: Select all
echo "nagios is not running"
return 1Re: System Status and Monitoring Engine Status Invalid
Thanks, that works for 2 out the 3 statuses. I just want to report that it doesn't solve the problem for the Monitoring Engine Process on the Monitoring Engine Status Page (the one pictured below)


Grumpy Olde IT Guy
Re: System Status and Monitoring Engine Status Invalid
Can you elaborate? What is the issue at the moment - nagios is not running but it shows as running in the web UI (PID 63164)?Thanks, that works for 2 out the 3 statuses. I just want to report that it doesn't solve the problem for the Monitoring Engine Process on the Monitoring Engine Status Page (the one pictured below)
What is the output of the following commands?
Code: Select all
service nagios status
ps -ef | grep /bin/[n]agiosBe sure to check out our Knowledgebase for helpful articles and solutions!
Re: System Status and Monitoring Engine Status Invalid
There are three places I know of in XI which tell me that things are processing normally. Referencing the scenario above where I described on how to replicate the issue and after putting in the fix suggested by ssax, 2 out of 3 status now correctly tell me I have a problem but one still tells me that things are processing normally and a non-existent process is running just fine. 2 out of 3 ain't bad and much better than before where it was zero out of three.
Grumpy Olde IT Guy