This support forum board is for support questions relating to
Nagios XI , our flagship commercial network monitoring solution.
nseltzer
Posts: 18 Joined: Tue Sep 11, 2012 12:10 pm
Location: Sidney, NE
Contact:
Post
by nseltzer » Wed Jan 23, 2013 3:33 pm
Those are the hung processes after Nagios was stopped. I did, however, take a look at each of the children servers and they all appear to be synced via NTP.
Code: Select all
papmoncp00
Wed Jan 23 13:29:41 MST 2013
papmoncp01
Wed Jan 23 13:29:44 MST 2013
papmoncp02
Wed Jan 23 13:29:46 MST 2013
papmoncp03
Wed Jan 23 13:29:47 MST 2013
papmoncp04
Wed Jan 23 13:29:49 MST 2013
papmoncp05
Wed Jan 23 13:29:51 MST 2013
papmoncp06
Wed Jan 23 13:29:53 MST 2013
papmoncp07
Wed Jan 23 13:29:54 MST 2013
scottwilkerson
DevOps Engineer
Posts: 19396 Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:
Post
by scottwilkerson » Wed Jan 23, 2013 3:38 pm
That is strange that these are the hung processes because one item is the parent
Code: Select all
nagios 4888 1 1 Jan22 ? 00:14:26 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
do you get errors when you run
It could be timing out...in which case we may need to help you adjust the init script to offer a longer timeperiod when stopping nagios
nseltzer
Posts: 18 Joined: Tue Sep 11, 2012 12:10 pm
Location: Sidney, NE
Contact:
Post
by nseltzer » Wed Jan 23, 2013 3:47 pm
We have seen "Warning - nagios did not exit in a timely manner". I compared the init script on the new server to the old server and it appears that they are identical. Please let me know what you would advise.
Thanks!
scottwilkerson
DevOps Engineer
Posts: 19396 Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:
Post
by scottwilkerson » Wed Jan 23, 2013 5:59 pm
Attached is an init file with line 160 set to allow 30 second to shutdown (instead of the default 10)
You do not have the required permissions to view the files attached to this post.
mguthrie
Posts: 4380 Joined: Mon Jun 14, 2010 10:21 am
Post
by mguthrie » Wed Jan 23, 2013 6:00 pm
Also, can you verify that your RAM disk has enough space left on it, and that all of the directories on it are owned nagios:nagios?
nseltzer
Posts: 18 Joined: Tue Sep 11, 2012 12:10 pm
Location: Sidney, NE
Contact:
Post
by nseltzer » Thu Jan 24, 2013 10:21 am
Good morning,
Code: Select all
$ df -h
...snip...
tmpfs 1.0G 14M 1011M 2% /var/nagiosramdisk
...snip...
I've made the change to the Nagios init script to allow for 30 seconds instead of ten when shutting down. I will restart the Nagios services. I will update the thread accordingly.
nseltzer
Posts: 18 Joined: Tue Sep 11, 2012 12:10 pm
Location: Sidney, NE
Contact:
Post
by nseltzer » Thu Jan 24, 2013 11:44 am
All,
We're still having issues with forked processes not falling off in a timely manner.
Code: Select all
$ date
Thu Jan 24 09:40:08 MST 2013
Every 1.0s: ps -ef | grep bin/nagios | grep -v grep
nagios 7149 11171 0 09:25 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 11171 1 5 08:38 ? 00:03:18 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 15360 11171 0 09:38 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 15902 11171 0 09:39 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 26750 11171 0 09:05 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
PID 26750 and 7149 have been hanging around for a while and they don't appear to be going anywhere.
I've attached a copy of the current Nagios init file we're using for your information.
You do not have the required permissions to view the files attached to this post.
kdavison
Posts: 3 Joined: Tue May 08, 2012 10:23 am
Post
by kdavison » Thu Jan 24, 2013 12:54 pm
Newer profile.txt
You do not have the required permissions to view the files attached to this post.
nseltzer
Posts: 18 Joined: Tue Sep 11, 2012 12:10 pm
Location: Sidney, NE
Contact:
Post
by nseltzer » Thu Jan 24, 2013 12:55 pm
My boss, kdavison, has posted a profile from the XI interface. This profile is from when Nagios is in a "stalled" state. The process is still running, but the scheduler has stopped processing External Commands and all of the forks are in a frozen state.
Code: Select all
Every 1.0s: ps -ef | grep bin/nagios | grep -v grep Thu Jan 24 10:55:21 2013
nagios 1053 11171 0 10:12 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 7149 11171 0 09:25 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 8273 11171 0 10:25 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 9500 11171 0 10:28 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 11171 1 4 08:38 ? 00:05:53 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 20715 11171 0 09:48 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 26750 11171 0 09:05 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 26909 11171 0 10:00 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Stopping Nagios resulted in the following:
Code: Select all
$ for i in nagiosxi npcd ndo2db nagios;do sudo /sbin/service $i stop;done
NPCD Stopped.
Stopping ndo2db: done.
Stopping nagios: ..............................
Warning - nagios did not exit in a timely manner
mguthrie
Posts: 4380 Joined: Mon Jun 14, 2010 10:21 am
Post
by mguthrie » Thu Jan 24, 2013 5:05 pm
I've only seen something like this one other time, not sure if the issue is related or not, but lets try the following commands:
Code: Select all
service nagios stop
killall -9 nagios
rm -f /usr/local/nagios/var/retention.dat
service nagios start
This will start Nagios with everything in a pending state until results come in, but I'd like to rule out a retention issue as a possibility.