URGENT - Weird issues after error in config

Post by **WillemDH** » Tue Oct 27, 2015 9:22 am

Hello,

We have been experiencing very weird issues with our nagios XI Production server today after a collegaue made an error in the config of a new service object. He accidentually assigned a hostgroup to a service. This hostgroup contained a lot of hosts (+- 100) resulting in an email storm, as the other hosts in the hostgroup were not able to execute this check succesfully.

After removing the hostgroup from the service, we suspected the issues to end, but in fact we were still receiving emails from this config error. I have tried numerous things. We cannot find any errors in the configuration file of this service. This service is also not found on any of the impacted hosts. The notifications are also not visible in the Notification overview page.

Please advise how to fix this issue. It seems this service has become a ghost service on all impacted hosts in the hostgroup.... We even tried removing the contacts from the contactgroup to which the emails are sent. But we are still receiving emails for all these hosts.....

Grtz

Willem

tmcdonald · Post by **tmcdonald** » Tue Oct 27, 2015 9:31 am

How long has it been? It is possible there was some rate-limiting somewhere in the chain from your Nagios machine to your inbox, and messages are being queued up and released slowly.

Have you tried doing a nagios restart from the command line?

Do you see the service still attached to the hosts in the status.dat or objects.cache files?

Post by **WillemDH** » Tue Oct 27, 2015 9:42 am

It has started several hours agao. I don't think this has something to with rate-limitations. I tried restarting nagios from commandline. I'm nto able to revert to a snapshot from before the issues as too many changes have been made.
I don't find a status.dat file. Where is it supposed to be located?

How can I cat the objects.cache for all services of a host?

We looked around in objects.cache but didn't found any services.

EDIT1: Found the status.dat => was on the ramdisk...

EDIT 2: In the objects.cache on the ramdisk we find the correct service on the correct host. Status.dat also seems correct.

EDIT 3: It might be worth noting that all the hosts where the service was wrongfully added are running their checks through mod gearman on a worker node. We already tried rebooting the Nagios server and the gearman worker node..

tmcdonald · Post by **tmcdonald** » Tue Oct 27, 2015 9:49 am

Do you have multiple nagios processes running? I usually run ps -ef | grep bin/nagios to check. Typical output will be something like this:

Code: Select all

root@localhost: /usr/local/nagios/var
$ ps -ef | grep bin/nagios
root      8622 19456  0 09:48 pts/0    00:00:00 grep bin/nagios
nagios   19150     1  0 Oct23 ?        00:25:54 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   19153 19150  0 Oct23 ?        00:00:13 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   19154 19150  0 Oct23 ?        00:00:13 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   19155 19150  0 Oct23 ?        00:00:13 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   19156 19150  0 Oct23 ?        00:00:12 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   19167 19150  0 Oct23 ?        00:03:41 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root@localhost: /usr/local/nagios/var
$

With one main nagios process, a few workers, and a subordinate child process. If you have two main processes you probably want to kill them all and restart making sure just one is running.

Post by **WillemDH** » Tue Oct 27, 2015 9:52 am

As we already had the multiple nagios instances running issues, we have a check testing this. So we shouldnt have multiple nagios instances running.

Code: Select all

 ps -ef | grep bin/nagios
root     16282  3106  0 15:52 pts/1    00:00:00 grep bin/nagios
nagios   40015     1  5 15:31 ?        00:01:06 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   40017 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40018 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40019 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40020 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40021 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40022 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40023 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40024 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40025 40015  0 15:31 ?        00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   40089 40015  0 15:31 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

The emails we are receiving have a link to the service:

Code: Select all

***** Nagios XI Alert *****

Nagios has detected a problem with this service.

Notification Type: PROBLEM

Service: Traceroute Active iFiber Router - Hop 4
Host: onderwijsvpn-30
Address: onderwijsvpn-30
State: UNKNOWN
Info:
ERROR in traceroute command.
Date/Time: 27/10/2015 14:48:38

Respond: http://nagiosserver/nagiosxi/?&xiwindow=http%3A%2F%2Fnagiosserver%2Fnagiosxi%2Fincludes%2Fcomponents%2Fxicore%2Fstatus.php%3Fshow%3Dservicedetail%26host%3Donderwijsvpn-30%26service%3DTraceroute%2BActive%2BiFiber%2BRouter%2B-%2BHop%2B4

When we click on the link we are linked to this page (check screenshot). This service does not exist in the CCM, is not findable in objects.cache nor in status.dat.

When we then go to configure and try delete this service we are linked to the CCM main page...

tmcdonald · Post by **tmcdonald** » Tue Oct 27, 2015 9:58 am

Does this service show up in the Core interface as well?

SteveBeauchemin · Post by **SteveBeauchemin** » Tue Oct 27, 2015 9:59 am

Are the emails time stamped for right now? Are they new? Or are the emails queued up and keep coming out because there is a backlog.

On my server, I use postfix. Maybe check the queue? I used this resource to get syntax. http://www.cyberciti.biz/tips/howto-pos ... queue.html

Just a thought.

Steve B

Post by **WillemDH** » Tue Oct 27, 2015 10:01 am

They are not visible in the Core interface. The timestamps in the emails we receive are 'real-time' which makes it seem like the checks are still getting executed..

Code: Select all

Service: Traceroute Active iFiber Router - Hop 4
Host: onderwijsvpn-31
Address: onderwijsvpn-31
State: UNKNOWN
Info:
ERROR in traceroute command.
Date/Time: 27/10/2015 16:00:37

Current time here : 16:01

The Nagios host which should have this problematic service also is still sending emails while the error with the issue with the traceroute command has been solved in the meantime. (nagios user needed sudo for traceroute). So this service is showing healthy in XI. But we are still getting notifications for it....

We did a

Code: Select all

while true
while> do
while> pgrep check_traceroute
while> sleep 0.1
while> done

on the Nagios server as well as on the mrtg worker node and nothing came up. So we suspect that no checks are activel running. The issue is in the recurring notificationj of all the host objects which temporarily had these two services 'Traceroute Active iFiber Router - Hop 4' and 'Traceroute Active Backbone Router - Hop 1'

SteveBeauchemin · Post by **SteveBeauchemin** » Tue Oct 27, 2015 10:12 am

If you rename the test binary or remove it from libexec does it stop?

If the system cannot see the executable, it should provide different data in the logs. Maybe help track it.

I have been flooded by tests before and ended up taking the code away so they could not run.

That will prove that it is still actively trying to run.

Steve B

tmcdonald · Post by **tmcdonald** » Tue Oct 27, 2015 10:18 am

Humor me here, but for about 5 minutes stop the following (make sure everyone is logged out first):

service postgresql stop
service crond stop

and see if the emails stop. Then after 5 minutes start them back up again:

service crond start
service postgresql start

Nagios Support Forum

URGENT - Weird issues after error in config

URGENT - Weird issues after error in config

Re: URGENT - Weird issues after error in config

Re: URGENT - Weird issues after error in config

Re: URGENT - Weird issues after error in config

Re: URGENT - Weird issues after error in config

Re: URGENT - Weird issues after error in config

Re: URGENT - Weird issues after error in config

Re: URGENT - Weird issues after error in config

Re: URGENT - Weird issues after error in config

Re: URGENT - Weird issues after error in config