Nagios Stops Processing

rseiwert · Post by **rseiwert** » Wed Apr 01, 2015 2:03 pm

For the last couple of days NagiosXI has stopped processing checks. Everything appears to run fine. I got green checks for all processes running. If I restart the process state from the admin it runs for a few hours and stops again. Looking at the monitoring engine stats shows zero scheduled checks. Trying to issue a command such as schedule an immediate check times out with no response. Running 2014R1.5 on Centos 6 2.6.32-504.8.1.el6.x86_64

Not sure where to start looking for an issue at this point.

Last thing in messages before restarting
Apr 1 13:13:28 nagios nagios: wproc: iocache_capacity() is 0 for worker Core Worker 16273.
Apr 1 13:13:28 nagios nagios: wproc: Socket to worker Core Worker 16273 broken, removing
Apr 1 13:13:28 nagios nagios: Caught SIGSEGV, shutting down...
Apr 1 14:51:52 nagios nagios: Nagios 4.0.8 starting... (PID=7965)

abrist · Post by **abrist** » Wed Apr 01, 2015 2:11 pm

Is updating to 2014r2.6 a possibility? There were some issues with scheduling in 2014r1.x. If it is not a possibility, at least try turning off auto check rescheduling.
Edit nagios.cfg, change:

Code: Select all

auto_reschedule_checks=1

To:

Code: Select all

auto_reschedule_checks=0

And restart nagios:

Code: Select all

service nagios restart

rseiwert · Post by **rseiwert** » Wed Apr 01, 2015 6:18 pm

I would like to say the 2.6 upgrade has had the be the most painless Nagios upgrade I have ever performed. I think not a single one of my checks got screwed up.

I don't know if the problem is fixed by upgrading as only time will tell but I have some thoughts.

It seems possible that the host and service pages could detect when their next scheduled check had not been fired off. Over the last couple of days there have been many times I didn't notice that a scheduled check was past due and the state on the screen was from the last day. If where is states Next Check: 2015-04-01 19:22:20 could be in red if it was late or even better at the top of the page above the first line something stating the data is out of date. I'm sorry to say most of my people are used to Nagios working and don't notice that data is out of date. It's green so everything most be good.

rseiwert · Post by **rseiwert** » Thu Apr 02, 2015 9:40 am

I stopped working again after upgrading. I did double the Ram and double to CPUs so now it's running 8GB with 4 core monitoring approx. 90 hosts and rebooted the machine after upgrading.

In the log I see
Apr 1 22:34:19 nagios nagios: wproc: iocache_capacity() is 0 for worker Core Worker 8077.
Apr 1 22:34:19 nagios nagios: wproc: Socket to worker Core Worker 8077 broken, removing
Apr 1 22:34:19 nagios nagios: Caught SIGSEGV, shutting down...

Earlier in the messages I see
Apr 1 21:53:44 nagios nagios: Warning: The check of service 'System Log' on host 'BE1.vca.com' looks like it was orphaned (results never came back; last_check=1427938314; next_check=1427938913). I'm scheduling an immediate check of the service...
.....
Apr 1 22:02:19 nagios nagios: wproc: iocache_capacity() is 0 for worker Core Worker 8078.
Apr 1 22:02:19 nagios nagios: wproc: Socket to worker Core Worker 8078 broken, removing
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5247' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5250' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5246' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5251' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5245' doesn't exist on Core Worker 8077.
Apr 1 22:02:19 nagios nagios: wproc: Job with id '5249' doesn't exist on Core Worker 8077.
Apr 1 22:02:24 nagios nagios: SERVICE ALERT: EX1;Active Sync Users;WARNING;SOFT;2;WARNING: nrm-7.5/min,
Apr 1 22:02:36 nagios nagios: wproc: Job with id '5248' doesn't exist on Core Worker 8077.
Apr 1 22:03:22 nagios nagios: SERVICE ALERT: EX1;Active Sync Users;WARNING;SOFT;3;WARNING: nrm-7.5/min,
Apr 1 22:03:43 nagios nagios: Warning: The check of service 'Actual Usage' on host 'iSCSIgroup' looks like it was orphaned (results never came back; last_check=1427938925; next_check=1427939523). I'm scheduling an immediate check of the service...

I will try auto_reschedule_checks=0

Post by **lmiltchev** » Thu Apr 02, 2015 9:52 am

Try setting:

Code: Select all

auto_reschedule_checks=0

as recommended by abrist and let us know if this fixed your issue.

In regards to your previous point - you have a point. I believe it is a common practice (among many nagios users) not to read the numbers if everything is green.

Adding something on the Host/Service Status Detail page to alert users when "data is out of date" could be a good feature request candidate.

rseiwert · Post by **rseiwert** » Thu Apr 02, 2015 12:25 pm

I did set auto_reschedule_checks=0 and still NagiosXI stops processing. Core seems to continue to function normally. Currently running Nagios XI 2014R2.6.

In XI still get System Ok: and all green checks. On the Monitor Engine Process it shows everything is green as well.
Process Start Time 2015-04-02 10:42:28
Total Running Time 2h 22m 57s (this must be client side as it is updating
Process ID 17233
But when I look at a check in XI (at 1:10 PM) I see
Last Check: 2015-04-02 12:15:27
Next Check: 2015-04-02 12:20:27

checking running processes they indeed are running

Code: Select all

ps -ef | grep 17233
nagios   17233     1  0 10:42 ?        00:00:18 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   17235 17233  0 10:42 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   17236 17233  0 10:42 ?        00:00:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   17237 17233  0 10:42 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   17238 17233  0 10:42 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   17239 17233  0 10:42 ?        00:00:14 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   17240 17233  0 10:42 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   17246 17233  0 10:42 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

One thing interesting even thought I have auto_reschedule_checks=0 I see the following in the messages log

Apr 2 13:02:27 nagios nagios: Warning: The check of service 'FTP' on host 'nyctp' looks like it was orphaned (results never came back; last_check=1427993136; next_check=1427993436). I'm scheduling an immediate check of the service...

localhost (nagios) was alerting it needed updates. Ran yum update to get the new ssl. . I see in the message log that nagios recognized the update
Apr 2 13:00:27 nagios nagios: SERVICE ALERT: localhost;Yum Updates;OK;HARD;4;YUM OK: O/S is up to date.
but then I look in the XI interface
YUM WARNING: O/S requires an update.
Status Details
Service State: Warning
Duration: 15h 30m 3s
Service Stability: Unchanging (stable)
Last Check: 2015-04-02 12:25:26
Next Check: 2015-04-02 12:30:26

Checking Nagios Core this show the correct info, it is just XI that is failing to funciton.
Current Status: OK (for 0d 0h 17m 57s)
Status Information: YUM OK: O/S is up to date.
Performance Data:
Current Attempt: 1/4 (HARD state)
Last Check Time: 04-02-2015 13:15:26
Check Type: ACTIVE
Check Latency / Duration: 0.000 / 0.743 seconds
Next Scheduled Check: 04-02-2015 13:20:26
Last State Change: 04-02-2015 13:00:26

Still have no clue where to start to troubleshoot why XI keeps failing with no major changes to what was a once working system.

Post by **tgriep** » Thu Apr 02, 2015 1:48 pm

Try following this procedure to see if if fixes the Orphaned service.
http://support.nagios.com/wiki/index.ph ... g_Orphaned

Can you run following and post back the results?

Code: Select all

ipcs -q
ulimit -a
df -h

rseiwert · Post by **rseiwert** » Thu Apr 02, 2015 4:23 pm

Tried the killall -9 nagios and then restarting from the web interface and things appeared to start processing but the monitoring engine was either up or down depending on which view you looked at. Also blue ! stated active checks were disabled but the stats panel and scheduler show them being active. In the end I ended up reboot. Rebooting definitely gets XI working for a few hours.
I'm starting to think that this is in the database backend. I think this because if I click on the gear on database backend and restart it then XI starts working again. Also I think this since Nagios Core is working just fine. This is the 4th time today that XI has crapped out. Make that 5th while writing this.

Code: Select all

[root@nagios security]# ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x89000002 294912     nagios     600        0            0

You have mail in /var/spool/mail/root
[root@nagios security]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 62755
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 62755
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
[root@nagios security]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                       22G   15G  5.5G  73% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/sda1             477M  143M  309M  32% /boot
//eng/massive         2.0T  361G  1.7T  18% /mnt/massive

jwelch · Post by **jwelch** » Thu Apr 02, 2015 4:41 pm

Just fyi,
I've run into a couple of times where there were multiple instances of nagios running....I suspect I applied a configuration
at or near the same time someone else did. It causes strange problems. So if I see something strange I generally run
ps -eaf | grep "nagios -d"

root 19028 14185 0 17:39 pts/0 00:00:00 grep nagios -d
nagios 27644 1 3 16:50 ? 00:01:48 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 27657 27644 0 16:50 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

On my system, there is normally two processes, a parent and a child. If I see two pairs, I stop nagios, then manually
kill the other parent...verify none are left, then start nagios.

abrist · Post by **abrist** » Thu Apr 02, 2015 4:46 pm

If you are seeing multiple nagios parent processes after applying config, you may want to open a ticket by emailing [email protected] as there are few things that can cause this, and we will most likely want to move to a remote session to resolve it.

Nagios Support Forum

Nagios Stops Processing

Nagios Stops Processing

Re: Nagios Stops Processing

Re: Nagios Stops Processing

Re: Nagios Stops Processing

Re: Nagios Stops Processing

Re: Nagios Stops Processing

Re: Nagios Stops Processing

Re: Nagios Stops Processing

Re: Nagios Stops Processing

Re: Nagios Stops Processing