Nagios Stops Processing
Nagios Stops Processing
For the last couple of days NagiosXI has stopped processing checks. Everything appears to run fine. I got green checks for all processes running. If I restart the process state from the admin it runs for a few hours and stops again. Looking at the monitoring engine stats shows zero scheduled checks. Trying to issue a command such as schedule an immediate check times out with no response. Running 2014R1.5 on Centos 6 2.6.32-504.8.1.el6.x86_64
Not sure where to start looking for an issue at this point.
Last thing in messages before restarting
Apr 1 13:13:28 nagios nagios: wproc: iocache_capacity() is 0 for worker Core Worker 16273.
Apr 1 13:13:28 nagios nagios: wproc: Socket to worker Core Worker 16273 broken, removing
Apr 1 13:13:28 nagios nagios: Caught SIGSEGV, shutting down...
Apr 1 14:51:52 nagios nagios: Nagios 4.0.8 starting... (PID=7965)
Not sure where to start looking for an issue at this point.
Last thing in messages before restarting
Apr 1 13:13:28 nagios nagios: wproc: iocache_capacity() is 0 for worker Core Worker 16273.
Apr 1 13:13:28 nagios nagios: wproc: Socket to worker Core Worker 16273 broken, removing
Apr 1 13:13:28 nagios nagios: Caught SIGSEGV, shutting down...
Apr 1 14:51:52 nagios nagios: Nagios 4.0.8 starting... (PID=7965)
Grumpy Olde IT Guy
Re: Nagios Stops Processing
Is updating to 2014r2.6 a possibility? There were some issues with scheduling in 2014r1.x. If it is not a possibility, at least try turning off auto check rescheduling.
Edit nagios.cfg, change:
To:
And restart nagios:
Edit nagios.cfg, change:
Code: Select all
auto_reschedule_checks=1Code: Select all
auto_reschedule_checks=0Code: Select all
service nagios restartFormer Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Nagios Stops Processing
I would like to say the 2.6 upgrade has had the be the most painless Nagios upgrade I have ever performed. I think not a single one of my checks got screwed up.
I don't know if the problem is fixed by upgrading as only time will tell but I have some thoughts.
It seems possible that the host and service pages could detect when their next scheduled check had not been fired off. Over the last couple of days there have been many times I didn't notice that a scheduled check was past due and the state on the screen was from the last day. If where is states Next Check: 2015-04-01 19:22:20 could be in red if it was late or even better at the top of the page above the first line something stating the data is out of date. I'm sorry to say most of my people are used to Nagios working and don't notice that data is out of date. It's green so everything most be good.
I don't know if the problem is fixed by upgrading as only time will tell but I have some thoughts.
It seems possible that the host and service pages could detect when their next scheduled check had not been fired off. Over the last couple of days there have been many times I didn't notice that a scheduled check was past due and the state on the screen was from the last day. If where is states Next Check: 2015-04-01 19:22:20 could be in red if it was late or even better at the top of the page above the first line something stating the data is out of date. I'm sorry to say most of my people are used to Nagios working and don't notice that data is out of date. It's green so everything most be good.
Grumpy Olde IT Guy
Re: Nagios Stops Processing
I stopped working again after upgrading. I did double the Ram and double to CPUs so now it's running 8GB with 4 core monitoring approx. 90 hosts and rebooted the machine after upgrading.
In the log I see
Apr 1 22:34:19 nagios nagios: wproc: iocache_capacity() is 0 for worker Core Worker 8077.
Apr 1 22:34:19 nagios nagios: wproc: Socket to worker Core Worker 8077 broken, removing
Apr 1 22:34:19 nagios nagios: Caught SIGSEGV, shutting down...
Earlier in the messages I see
Apr 1 21:53:44 nagios nagios: Warning: The check of service 'System Log' on host 'BE1.vca.com' looks like it was orphaned (results never came back; last_check=1427938314; next_check=1427938913). I'm scheduling an immediate check of the service...
.....
Apr 1 22:02:19 nagios nagios: wproc: iocache_capacity() is 0 for worker Core Worker 8078.
Apr 1 22:02:19 nagios nagios: wproc: Socket to worker Core Worker 8078 broken, removing
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5247' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5250' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5246' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5251' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5245' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5249' doesn't exist on Core Worker 8077.
Apr 1 22:02:24 nagios nagios: SERVICE ALERT: EX1;Active Sync Users;WARNING;SOFT;2;WARNING: nrm-7.5/min,
Apr 1 22:02:36 nagios nagios: wproc: Job with id '5248' doesn't exist on Core Worker 8077.
Apr 1 22:03:22 nagios nagios: SERVICE ALERT: EX1;Active Sync Users;WARNING;SOFT;3;WARNING: nrm-7.5/min,
Apr 1 22:03:43 nagios nagios: Warning: The check of service 'Actual Usage' on host 'iSCSIgroup' looks like it was orphaned (results never came back; last_check=1427938925; next_check=1427939523). I'm scheduling an immediate check of the service...
I will try auto_reschedule_checks=0
In the log I see
Apr 1 22:34:19 nagios nagios: wproc: iocache_capacity() is 0 for worker Core Worker 8077.
Apr 1 22:34:19 nagios nagios: wproc: Socket to worker Core Worker 8077 broken, removing
Apr 1 22:34:19 nagios nagios: Caught SIGSEGV, shutting down...
Earlier in the messages I see
Apr 1 21:53:44 nagios nagios: Warning: The check of service 'System Log' on host 'BE1.vca.com' looks like it was orphaned (results never came back; last_check=1427938314; next_check=1427938913). I'm scheduling an immediate check of the service...
.....
Apr 1 22:02:19 nagios nagios: wproc: iocache_capacity() is 0 for worker Core Worker 8078.
Apr 1 22:02:19 nagios nagios: wproc: Socket to worker Core Worker 8078 broken, removing
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5247' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5250' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5246' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5251' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5245' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5249' doesn't exist on Core Worker 8077.
Apr 1 22:02:24 nagios nagios: SERVICE ALERT: EX1;Active Sync Users;WARNING;SOFT;2;WARNING: nrm-7.5/min,
Apr 1 22:02:36 nagios nagios: wproc: Job with id '5248' doesn't exist on Core Worker 8077.
Apr 1 22:03:22 nagios nagios: SERVICE ALERT: EX1;Active Sync Users;WARNING;SOFT;3;WARNING: nrm-7.5/min,
Apr 1 22:03:43 nagios nagios: Warning: The check of service 'Actual Usage' on host 'iSCSIgroup' looks like it was orphaned (results never came back; last_check=1427938925; next_check=1427939523). I'm scheduling an immediate check of the service...
I will try auto_reschedule_checks=0
Grumpy Olde IT Guy
Re: Nagios Stops Processing
Try setting:
as recommended by abrist and let us know if this fixed your issue.
In regards to your previous point - you have a point. I believe it is a common practice (among many nagios users) not to read the numbers if everything is green.
Adding something on the Host/Service Status Detail page to alert users when "data is out of date" could be a good feature request candidate.
Code: Select all
auto_reschedule_checks=0In regards to your previous point - you have a point. I believe it is a common practice (among many nagios users) not to read the numbers if everything is green.
Adding something on the Host/Service Status Detail page to alert users when "data is out of date" could be a good feature request candidate.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios Stops Processing
I did set auto_reschedule_checks=0 and still NagiosXI stops processing. Core seems to continue to function normally. Currently running Nagios XI 2014R2.6.
In XI still get System Ok: and all green checks. On the Monitor Engine Process it shows everything is green as well.
Process Start Time 2015-04-02 10:42:28
Total Running Time 2h 22m 57s (this must be client side as it is updating
Process ID 17233
But when I look at a check in XI (at 1:10 PM) I see
Last Check: 2015-04-02 12:15:27
Next Check: 2015-04-02 12:20:27
checking running processes they indeed are running
One thing interesting even thought I have auto_reschedule_checks=0 I see the following in the messages log
Apr 2 13:02:27 nagios nagios: Warning: The check of service 'FTP' on host 'nyctp' looks like it was orphaned (results never came back; last_check=1427993136; next_check=1427993436). I'm scheduling an immediate check of the service...
localhost (nagios) was alerting it needed updates. Ran yum update to get the new ssl. . I see in the message log that nagios recognized the update
Apr 2 13:00:27 nagios nagios: SERVICE ALERT: localhost;Yum Updates;OK;HARD;4;YUM OK: O/S is up to date.
but then I look in the XI interface
YUM WARNING: O/S requires an update.
Status Details
Service State: Warning
Duration: 15h 30m 3s
Service Stability: Unchanging (stable)
Last Check: 2015-04-02 12:25:26
Next Check: 2015-04-02 12:30:26
Checking Nagios Core this show the correct info, it is just XI that is failing to funciton.
Current Status: OK (for 0d 0h 17m 57s)
Status Information: YUM OK: O/S is up to date.
Performance Data:
Current Attempt: 1/4 (HARD state)
Last Check Time: 04-02-2015 13:15:26
Check Type: ACTIVE
Check Latency / Duration: 0.000 / 0.743 seconds
Next Scheduled Check: 04-02-2015 13:20:26
Last State Change: 04-02-2015 13:00:26
Still have no clue where to start to troubleshoot why XI keeps failing with no major changes to what was a once working system.
In XI still get System Ok: and all green checks. On the Monitor Engine Process it shows everything is green as well.
Process Start Time 2015-04-02 10:42:28
Total Running Time 2h 22m 57s (this must be client side as it is updating
Process ID 17233
But when I look at a check in XI (at 1:10 PM) I see
Last Check: 2015-04-02 12:15:27
Next Check: 2015-04-02 12:20:27
checking running processes they indeed are running
Code: Select all
ps -ef | grep 17233
nagios 17233 1 0 10:42 ? 00:00:18 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 17235 17233 0 10:42 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 17236 17233 0 10:42 ? 00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 17237 17233 0 10:42 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 17238 17233 0 10:42 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 17239 17233 0 10:42 ? 00:00:14 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 17240 17233 0 10:42 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 17246 17233 0 10:42 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfgApr 2 13:02:27 nagios nagios: Warning: The check of service 'FTP' on host 'nyctp' looks like it was orphaned (results never came back; last_check=1427993136; next_check=1427993436). I'm scheduling an immediate check of the service...
localhost (nagios) was alerting it needed updates. Ran yum update to get the new ssl. . I see in the message log that nagios recognized the update
Apr 2 13:00:27 nagios nagios: SERVICE ALERT: localhost;Yum Updates;OK;HARD;4;YUM OK: O/S is up to date.
but then I look in the XI interface
YUM WARNING: O/S requires an update.
Status Details
Service State: Warning
Duration: 15h 30m 3s
Service Stability: Unchanging (stable)
Last Check: 2015-04-02 12:25:26
Next Check: 2015-04-02 12:30:26
Checking Nagios Core this show the correct info, it is just XI that is failing to funciton.
Current Status: OK (for 0d 0h 17m 57s)
Status Information: YUM OK: O/S is up to date.
Performance Data:
Current Attempt: 1/4 (HARD state)
Last Check Time: 04-02-2015 13:15:26
Check Type: ACTIVE
Check Latency / Duration: 0.000 / 0.743 seconds
Next Scheduled Check: 04-02-2015 13:20:26
Last State Change: 04-02-2015 13:00:26
Still have no clue where to start to troubleshoot why XI keeps failing with no major changes to what was a once working system.
Grumpy Olde IT Guy
Re: Nagios Stops Processing
Try following this procedure to see if if fixes the Orphaned service.
http://support.nagios.com/wiki/index.ph ... g_Orphaned
Can you run following and post back the results?
http://support.nagios.com/wiki/index.ph ... g_Orphaned
Can you run following and post back the results?
Code: Select all
ipcs -q
ulimit -a
df -h
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios Stops Processing
Tried the killall -9 nagios and then restarting from the web interface and things appeared to start processing but the monitoring engine was either up or down depending on which view you looked at. Also blue ! stated active checks were disabled but the stats panel and scheduler show them being active. In the end I ended up reboot. Rebooting definitely gets XI working for a few hours.
I'm starting to think that this is in the database backend. I think this because if I click on the gear on database backend and restart it then XI starts working again. Also I think this since Nagios Core is working just fine. This is the 4th time today that XI has crapped out. Make that 5th while writing this.
I'm starting to think that this is in the database backend. I think this because if I click on the gear on database backend and restart it then XI starts working again. Also I think this since Nagios Core is working just fine. This is the 4th time today that XI has crapped out. Make that 5th while writing this.
Code: Select all
[root@nagios security]# ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0x89000002 294912 nagios 600 0 0
You have mail in /var/spool/mail/root
[root@nagios security]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 62755
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 62755
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
[root@nagios security]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
22G 15G 5.5G 73% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
/dev/sda1 477M 143M 309M 32% /boot
//eng/massive 2.0T 361G 1.7T 18% /mnt/massiveGrumpy Olde IT Guy
Re: Nagios Stops Processing
Just fyi,
I've run into a couple of times where there were multiple instances of nagios running....I suspect I applied a configuration
at or near the same time someone else did. It causes strange problems. So if I see something strange I generally run
ps -eaf | grep "nagios -d"
root 19028 14185 0 17:39 pts/0 00:00:00 grep nagios -d
nagios 27644 1 3 16:50 ? 00:01:48 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 27657 27644 0 16:50 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
On my system, there is normally two processes, a parent and a child. If I see two pairs, I stop nagios, then manually
kill the other parent...verify none are left, then start nagios.
I've run into a couple of times where there were multiple instances of nagios running....I suspect I applied a configuration
at or near the same time someone else did. It causes strange problems. So if I see something strange I generally run
ps -eaf | grep "nagios -d"
root 19028 14185 0 17:39 pts/0 00:00:00 grep nagios -d
nagios 27644 1 3 16:50 ? 00:01:48 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 27657 27644 0 16:50 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
On my system, there is normally two processes, a parent and a child. If I see two pairs, I stop nagios, then manually
kill the other parent...verify none are left, then start nagios.
Re: Nagios Stops Processing
If you are seeing multiple nagios parent processes after applying config, you may want to open a ticket by emailing [email protected] as there are few things that can cause this, and we will most likely want to move to a remote session to resolve it.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.