URGENT - Weird issues after error in config

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

URGENT - Weird issues after error in config

Post by WillemDH »

Hello,

We have been experiencing very weird issues with our nagios XI Production server today after a collegaue made an error in the config of a new service object. He accidentually assigned a hostgroup to a service. This hostgroup contained a lot of hosts (+- 100) resulting in an email storm, as the other hosts in the hostgroup were not able to execute this check succesfully.

After removing the hostgroup from the service, we suspected the issues to end, but in fact we were still receiving emails from this config error. I have tried numerous things. We cannot find any errors in the configuration file of this service. This service is also not found on any of the impacted hosts. The notifications are also not visible in the Notification overview page.

Please advise how to fix this issue. It seems this service has become a ghost service on all impacted hosts in the hostgroup.... We even tried removing the contacts from the contactgroup to which the emails are sent. But we are still receiving emails for all these hosts.....

Grtz

Willem
Nagios XI 5.8.1
https://outsideit.net
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: URGENT - Weird issues after error in config

Post by tmcdonald »

How long has it been? It is possible there was some rate-limiting somewhere in the chain from your Nagios machine to your inbox, and messages are being queued up and released slowly.

Have you tried doing a nagios restart from the command line?

Do you see the service still attached to the hosts in the status.dat or objects.cache files?
Former Nagios employee
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: URGENT - Weird issues after error in config

Post by WillemDH »

It has started several hours agao. I don't think this has something to with rate-limitations. I tried restarting nagios from commandline. I'm nto able to revert to a snapshot from before the issues as too many changes have been made.
I don't find a status.dat file. Where is it supposed to be located?

How can I cat the objects.cache for all services of a host?

We looked around in objects.cache but didn't found any services.

EDIT1: Found the status.dat => was on the ramdisk...

EDIT 2: In the objects.cache on the ramdisk we find the correct service on the correct host. Status.dat also seems correct.

EDIT 3: It might be worth noting that all the hosts where the service was wrongfully added are running their checks through mod gearman on a worker node. We already tried rebooting the Nagios server and the gearman worker node..
Last edited by WillemDH on Tue Oct 27, 2015 9:51 am, edited 2 times in total.
Nagios XI 5.8.1
https://outsideit.net
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: URGENT - Weird issues after error in config

Post by tmcdonald »

Do you have multiple nagios processes running? I usually run ps -ef | grep bin/nagios to check. Typical output will be something like this:

Code: Select all

root@localhost: /usr/local/nagios/var
$ ps -ef | grep bin/nagios
root      8622 19456  0 09:48 pts/0    00:00:00 grep bin/nagios
nagios   19150     1  0 Oct23 ?        00:25:54 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   19153 19150  0 Oct23 ?        00:00:13 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   19154 19150  0 Oct23 ?        00:00:13 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   19155 19150  0 Oct23 ?        00:00:13 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   19156 19150  0 Oct23 ?        00:00:12 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   19167 19150  0 Oct23 ?        00:03:41 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root@localhost: /usr/local/nagios/var
$
With one main nagios process, a few workers, and a subordinate child process. If you have two main processes you probably want to kill them all and restart making sure just one is running.
Former Nagios employee
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: URGENT - Weird issues after error in config

Post by WillemDH »

As we already had the multiple nagios instances running issues, we have a check testing this. So we shouldnt have multiple nagios instances running.

Code: Select all

 ps -ef | grep bin/nagios
root     16282  3106  0 15:52 pts/1    00:00:00 grep bin/nagios
nagios   40015     1  5 15:31 ?        00:01:06 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   40017 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40018 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40019 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40020 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40021 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40022 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40023 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40024 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40025 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40089 40015  0 15:31 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
The emails we are receiving have a link to the service:

Code: Select all

***** Nagios XI Alert *****

Nagios has detected a problem with this service.

Notification Type: PROBLEM

Service: Traceroute Active iFiber Router - Hop 4
Host: onderwijsvpn-30
Address: onderwijsvpn-30
State: UNKNOWN
Info:
ERROR in traceroute command.
Date/Time: 27/10/2015 14:48:38

Respond: http://nagiosserver/nagiosxi/?&xiwindow=http%3A%2F%2Fnagiosserver%2Fnagiosxi%2Fincludes%2Fcomponents%2Fxicore%2Fstatus.php%3Fshow%3Dservicedetail%26host%3Donderwijsvpn-30%26service%3DTraceroute%2BActive%2BiFiber%2BRouter%2B-%2BHop%2B4
When we click on the link we are linked to this page (check screenshot). This service does not exist in the CCM, is not findable in objects.cache nor in status.dat.

When we then go to configure and try delete this service we are linked to the CCM main page...
You do not have the required permissions to view the files attached to this post.
Last edited by WillemDH on Tue Oct 27, 2015 9:59 am, edited 1 time in total.
Nagios XI 5.8.1
https://outsideit.net
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: URGENT - Weird issues after error in config

Post by tmcdonald »

Does this service show up in the Core interface as well?
Former Nagios employee
SteveBeauchemin
Posts: 524
Joined: Mon Oct 14, 2013 7:19 pm

Re: URGENT - Weird issues after error in config

Post by SteveBeauchemin »

Are the emails time stamped for right now? Are they new? Or are the emails queued up and keep coming out because there is a backlog.

On my server, I use postfix. Maybe check the queue? I used this resource to get syntax. http://www.cyberciti.biz/tips/howto-pos ... queue.html

Just a thought.

Steve B
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: URGENT - Weird issues after error in config

Post by WillemDH »

They are not visible in the Core interface. The timestamps in the emails we receive are 'real-time' which makes it seem like the checks are still getting executed..

Code: Select all

Service: Traceroute Active iFiber Router - Hop 4
Host: onderwijsvpn-31
Address: onderwijsvpn-31
State: UNKNOWN
Info:
ERROR in traceroute command.
Date/Time: 27/10/2015 16:00:37
Current time here : 16:01

The Nagios host which should have this problematic service also is still sending emails while the error with the issue with the traceroute command has been solved in the meantime. (nagios user needed sudo for traceroute). So this service is showing healthy in XI. But we are still getting notifications for it....

We did a

Code: Select all

while true
while> do
while> pgrep check_traceroute
while> sleep 0.1
while> done
on the Nagios server as well as on the mrtg worker node and nothing came up. So we suspect that no checks are activel running. The issue is in the recurring notificationj of all the host objects which temporarily had these two services 'Traceroute Active iFiber Router - Hop 4' and 'Traceroute Active Backbone Router - Hop 1'
Last edited by WillemDH on Tue Oct 27, 2015 10:13 am, edited 1 time in total.
Nagios XI 5.8.1
https://outsideit.net
SteveBeauchemin
Posts: 524
Joined: Mon Oct 14, 2013 7:19 pm

Re: URGENT - Weird issues after error in config

Post by SteveBeauchemin »

If you rename the test binary or remove it from libexec does it stop?

If the system cannot see the executable, it should provide different data in the logs. Maybe help track it.

I have been flooded by tests before and ended up taking the code away so they could not run.

That will prove that it is still actively trying to run.

Steve B
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: URGENT - Weird issues after error in config

Post by tmcdonald »

Humor me here, but for about 5 minutes stop the following (make sure everyone is logged out first):

service postgresql stop
service crond stop


and see if the emails stop. Then after 5 minutes start them back up again:

service crond start
service postgresql start
Former Nagios employee
Locked