Page 1 of 2
XI web interface goes into a bad state
Posted: Wed Sep 25, 2013 2:07 pm
by uidaho
I apologize for the poor description and lack of details, but we have seen the same thing twice so I wanted to get this posted and get the community's advice.
We run Nagios XI 2012R2.2. The XI Web Interface has displayed the following symptoms:
All hosts and services report last check time = never.
All host statuses are pending, all service statuses are OK.
Unable to perform actions such as "Schedule an immediate check", results in an error message like "Unable to process command at this time" (I will get exact message next time)
Notifications are disabled on every item, although "global notifications" is not disabled.
This is all resolved by stopping and starting the monitoring engine process.
Has anyone seen this?
What data can I collect while this occurs in order to better troubleshoot?
We have a second Nagios XI instance monitoring the primary, but it did not detect when the primary was in this state. Does anyone have ideas on how to build a service that will detect this? I understand this is hard to do with the little info I have provided.
Thank you
David Summers
Re: XI web interface goes into a bad state
Posted: Wed Sep 25, 2013 2:41 pm
by abrist
You may have multiple parent nagios processes. You could try stopping nagios and then checking for any other nagios processes. If you find some, then this was your issue:
Code: Select all
service nagios stop
ps -aef | grep nagios.cfg
If you find any extra nagios processes, kill them:
Code: Select all
killall -9 nagios
service nagios start
Re: XI web interface goes into a bad state
Posted: Mon Oct 14, 2013 3:44 pm
by uidaho
We just had the same problem reoccur. I rebooted the Nagios server. It came back up in the same state, as described above. It appears that Nagios Core was running just fine. nagios.log is was updating. I logged into Nagios Core interface and all servers/services appeared to have an accurate state.
However, the Nagios XI interface was reporting empty host groups, no status updates on any checks, etc, etc. I then restarted the "monitoring process" through the GUI. It did fix the Nagios XI interface.
We have seen this state several times in the last two months. Nagios Core is functional, but the XI GUI is hung up.
What can we do in addition to restarting "Monitoring Process" every time this happens? Can we work with you to determine the cause of this bug?
Re: XI web interface goes into a bad state
Posted: Tue Oct 15, 2013 3:00 am
by scottwilkerson
If I had to guess, it appears that Nagios is starting before the ndo2db process is connected.
Out of curiosity, do you have an offloaded MySQL database, or a delayed start of ndo2db or mysqld?
Re: XI web interface goes into a bad state
Posted: Fri Oct 18, 2013 10:37 am
by uidaho
We've seen this happen in two different scenarios:
server rebooted
server was up, but a large number of hosts went offline
We have not offloaded the mysql database, nor changed any of the startup script options. We used the Nagios XI installation script to add the various things to the startup directories.
lrwxrwxrwx 1 root root 16 Apr 12 2013 /etc/rc3.d/S99nagios -> ../init.d/nagios
lrwxrwxrwx 1 root root 18 Apr 12 2013 /etc/rc3.d/S99nagiosxi -> ../init.d/nagiosxi
lrwxrwxrwx 1 root root 16 Apr 12 2013 /etc/rc3.d/S99ndo2db -> ../init.d/ndo2db
Should we start ndo2db earlier? Say S98?
When this occurs again, should we check if ndo2db recently restarted? Are there some good log files to check?
Re: XI web interface goes into a bad state
Posted: Fri Oct 18, 2013 1:26 pm
by slansing
If/when this occurs again, grab the entire current system log "/var/log/messages" file and you should see a message about ndo2db not correctly syncing to the database. It may be as Scott mentioned that ndo is not syncing in a timely manner and nagios gets started before it. If this occurs, run a:
You "should" then see a new message in the system log showing that ndo is synced and your hosts/services should be performing checks properly. Let us know if any of the above happens.
Re: XI web interface goes into a bad state
Posted: Thu Oct 31, 2013 7:29 pm
by uidaho
We had another instance (I did not witness, so it is possible it is a different but similar issue) of the XI strangeness today. I was not present to try restarting the ndo service, but here are some entries from the messages log:
Code: Select all
Oct 31 14:39:57 monitor01 nagios: Caught SIGTERM, shutting down...
Oct 31 14:39:57 monitor01 nagios: Successfully shutdown... (PID=17559)
Oct 31 14:39:57 monitor01 nagios: ndomod: Shutdown complete.
Oct 31 14:39:57 monitor01 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
Oct 31 14:40:00 monitor01 nagios: Nagios 3.5.0 starting... (PID=30796)
Oct 31 14:40:00 monitor01 nagios: Local time is Thu Oct 31 14:40:00 PDT 2013
Oct 31 14:40:00 monitor01 nagios: LOG VERSION: 2.0
Oct 31 14:40:00 monitor01 nagios: ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Oct 31 14:40:00 monitor01 nagios: ndomod: Successfully connected to data sink. 0 queued items to flush.
Oct 31 14:40:00 monitor01 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
Oct 31 14:40:02 monitor01 nagios: Finished daemonizing... (New PID=661)
Oct 31 14:40:02 monitor01 nagios: Error: Could not create external command file '/usr/local/nagios/var/rw/nagios.cmd' as named pipe: (17) -> File exists. If this file already exists and you are sure that another copy of Nagios is not running, you should delete this file.
Oct 31 14:40:02 monitor01 nagios: Bailing out due to errors encountered while trying to initialize the external command file... (PID=661)
Oct 31 14:40:02 monitor01 nagios: ndomod: Shutdown complete.
Oct 31 14:40:02 monitor01 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
Oct 31 14:41:55 monitor01 nagios: Nagios 3.5.0 starting... (PID=27184)
Oct 31 14:41:55 monitor01 nagios: Local time is Thu Oct 31 14:41:55 PDT 2013
Oct 31 14:41:55 monitor01 nagios: LOG VERSION: 2.0
Oct 31 14:41:55 monitor01 nagios: ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Oct 31 14:41:55 monitor01 nagios: ndomod: Successfully connected to data sink. 0 queued items to flush.
Oct 31 14:41:55 monitor01 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
Oct 31 14:41:56 monitor01 nagios: Finished daemonizing... (New PID=27189)
Oct 31 14:41:57 monitor01 nagios: SERVICE DOWNTIME ALERT: netman-collect;Disk Usage - /usr/local 20 prcnt free ux;STARTED; Service has entered a period of scheduled downtime
Oct 31 14:41:57 monitor01 nagios: SERVICE DOWNTIME ALERT: sanman01;Svc - USB-Lib1 Drive Mounted on SanMan01;STARTED; Service has entered a period of scheduled downtime
Oct 31 14:41:57 monitor01 nagios: SERVICE DOWNTIME ALERT: sanman01;Svc - USB-Noc0 Drive Mounted on SanMan01;STARTED; Service has entered a period of scheduled downtime
Most of the entries match those of other times we apply a change in CCM, which restarts nagios. What is different this time is that
ndomod: Shutdown complete
entry is later than normal, and preceded by the "Bailing" entry.
Does this lend credence to Scott's assessment? I think it does. We do not have an offloaded MySQL instance. I do not think it has a delayed start, but in this case the server was not rebooting. What can we do to ensure the sequence of events are correct when applying changes in CCM?
Thank you
Re: XI web interface goes into a bad state
Posted: Fri Nov 01, 2013 12:03 pm
by slansing
Well it seems like it does not take much time for nagios to initialize. Can you:
And show the output of:
Code: Select all
ll -la /usr/local/nagios/var/rw/nagios.cmd
Re: XI web interface goes into a bad state
Posted: Fri Mar 14, 2014 11:48 am
by uidaho
Sorry for leaving this unattended for so long, but we have the time to work on this now.
The symptoms in this thread are exactly what we are seeing:
http://support.nagios.com/forum/viewtop ... 16&t=11719
Here is the output of me running the commands requested in the prior post (run as root):
Code: Select all
# service nagios stop
Stopping nagios: .done.
# killall -9 nagios
nagios: no process killed
# ll -la /usr/local/nagios/var/rw/nagios.cmd
ls: cannot access /usr/local/nagios/var/rw/nagios.cmd: No such file or directory
#
# ll -la /usr/local/nagios/var/rw/nagios.cmd
ls: cannot access /usr/local/nagios/var/rw/nagios.cmd: No such file or directory
# service nagios start
Starting nagios: done.
If this is not related, I apologize, but here are some event logs we see every time we reboot the nagios server:
Code: Select all
Information2014-03-14 09:09:28Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
Process Information2014-03-14 09:08:48Successfully shutdown... (PID=2626)
Information2014-03-14 09:08:48ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Information2014-03-14 09:08:48ndomod: Error writing to data sink! Some output may get lost...
What can we do next to troubleshoot this issue?
Thank you!
Re: XI web interface goes into a bad state
Posted: Fri Mar 14, 2014 4:08 pm
by slansing
Okay, so your hosts/services are all getting greyed out and tell you that they are pending? After a restart of nagios? Do they eventually go back to their states once checks are ran, or are checks not even being scheduled when you look at their details pages. Do you recall disabling state retention options on your hosts/services or in the nagios.cfg? In addition to answering those questions can you edit:
And change:
To
And:
To:
Then:
If you still notice those ndo2db errors, attach a copy of:
Code: Select all
/usr/local/nagios/var/ndo2db.debug