URGENT - Weird issues after error in config
URGENT - Weird issues after error in config
Hello,
We have been experiencing very weird issues with our nagios XI Production server today after a collegaue made an error in the config of a new service object. He accidentually assigned a hostgroup to a service. This hostgroup contained a lot of hosts (+- 100) resulting in an email storm, as the other hosts in the hostgroup were not able to execute this check succesfully.
After removing the hostgroup from the service, we suspected the issues to end, but in fact we were still receiving emails from this config error. I have tried numerous things. We cannot find any errors in the configuration file of this service. This service is also not found on any of the impacted hosts. The notifications are also not visible in the Notification overview page.
Please advise how to fix this issue. It seems this service has become a ghost service on all impacted hosts in the hostgroup.... We even tried removing the contacts from the contactgroup to which the emails are sent. But we are still receiving emails for all these hosts.....
Grtz
Willem
We have been experiencing very weird issues with our nagios XI Production server today after a collegaue made an error in the config of a new service object. He accidentually assigned a hostgroup to a service. This hostgroup contained a lot of hosts (+- 100) resulting in an email storm, as the other hosts in the hostgroup were not able to execute this check succesfully.
After removing the hostgroup from the service, we suspected the issues to end, but in fact we were still receiving emails from this config error. I have tried numerous things. We cannot find any errors in the configuration file of this service. This service is also not found on any of the impacted hosts. The notifications are also not visible in the Notification overview page.
Please advise how to fix this issue. It seems this service has become a ghost service on all impacted hosts in the hostgroup.... We even tried removing the contacts from the contactgroup to which the emails are sent. But we are still receiving emails for all these hosts.....
Grtz
Willem
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
Re: URGENT - Weird issues after error in config
How long has it been? It is possible there was some rate-limiting somewhere in the chain from your Nagios machine to your inbox, and messages are being queued up and released slowly.
Have you tried doing a nagios restart from the command line?
Do you see the service still attached to the hosts in the status.dat or objects.cache files?
Have you tried doing a nagios restart from the command line?
Do you see the service still attached to the hosts in the status.dat or objects.cache files?
Former Nagios employee
Re: URGENT - Weird issues after error in config
It has started several hours agao. I don't think this has something to with rate-limitations. I tried restarting nagios from commandline. I'm nto able to revert to a snapshot from before the issues as too many changes have been made.
I don't find a status.dat file. Where is it supposed to be located?
How can I cat the objects.cache for all services of a host?
We looked around in objects.cache but didn't found any services.
EDIT1: Found the status.dat => was on the ramdisk...
EDIT 2: In the objects.cache on the ramdisk we find the correct service on the correct host. Status.dat also seems correct.
EDIT 3: It might be worth noting that all the hosts where the service was wrongfully added are running their checks through mod gearman on a worker node. We already tried rebooting the Nagios server and the gearman worker node..
I don't find a status.dat file. Where is it supposed to be located?
How can I cat the objects.cache for all services of a host?
We looked around in objects.cache but didn't found any services.
EDIT1: Found the status.dat => was on the ramdisk...
EDIT 2: In the objects.cache on the ramdisk we find the correct service on the correct host. Status.dat also seems correct.
EDIT 3: It might be worth noting that all the hosts where the service was wrongfully added are running their checks through mod gearman on a worker node. We already tried rebooting the Nagios server and the gearman worker node..
Last edited by WillemDH on Tue Oct 27, 2015 9:51 am, edited 2 times in total.
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
Re: URGENT - Weird issues after error in config
Do you have multiple nagios processes running? I usually run ps -ef | grep bin/nagios to check. Typical output will be something like this:
With one main nagios process, a few workers, and a subordinate child process. If you have two main processes you probably want to kill them all and restart making sure just one is running.
Code: Select all
root@localhost: /usr/local/nagios/var
$ ps -ef | grep bin/nagios
root 8622 19456 0 09:48 pts/0 00:00:00 grep bin/nagios
nagios 19150 1 0 Oct23 ? 00:25:54 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 19153 19150 0 Oct23 ? 00:00:13 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 19154 19150 0 Oct23 ? 00:00:13 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 19155 19150 0 Oct23 ? 00:00:13 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 19156 19150 0 Oct23 ? 00:00:12 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 19167 19150 0 Oct23 ? 00:03:41 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root@localhost: /usr/local/nagios/var
$
Former Nagios employee
Re: URGENT - Weird issues after error in config
As we already had the multiple nagios instances running issues, we have a check testing this. So we shouldnt have multiple nagios instances running.
The emails we are receiving have a link to the service:
When we click on the link we are linked to this page (check screenshot). This service does not exist in the CCM, is not findable in objects.cache nor in status.dat.
When we then go to configure and try delete this service we are linked to the CCM main page...
Code: Select all
ps -ef | grep bin/nagios
root 16282 3106 0 15:52 pts/1 00:00:00 grep bin/nagios
nagios 40015 1 5 15:31 ? 00:01:06 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 40017 40015 0 15:31 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 40018 40015 0 15:31 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 40019 40015 0 15:31 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 40020 40015 0 15:31 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 40021 40015 0 15:31 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 40022 40015 0 15:31 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 40023 40015 0 15:31 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 40024 40015 0 15:31 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 40025 40015 0 15:31 ? 00:00:01 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 40089 40015 0 15:31 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Code: Select all
***** Nagios XI Alert *****
Nagios has detected a problem with this service.
Notification Type: PROBLEM
Service: Traceroute Active iFiber Router - Hop 4
Host: onderwijsvpn-30
Address: onderwijsvpn-30
State: UNKNOWN
Info:
ERROR in traceroute command.
Date/Time: 27/10/2015 14:48:38
Respond: http://nagiosserver/nagiosxi/?&xiwindow=http%3A%2F%2Fnagiosserver%2Fnagiosxi%2Fincludes%2Fcomponents%2Fxicore%2Fstatus.php%3Fshow%3Dservicedetail%26host%3Donderwijsvpn-30%26service%3DTraceroute%2BActive%2BiFiber%2BRouter%2B-%2BHop%2B4
When we then go to configure and try delete this service we are linked to the CCM main page...
You do not have the required permissions to view the files attached to this post.
Last edited by WillemDH on Tue Oct 27, 2015 9:59 am, edited 1 time in total.
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
Re: URGENT - Weird issues after error in config
Does this service show up in the Core interface as well?
Former Nagios employee
-
SteveBeauchemin
- Posts: 524
- Joined: Mon Oct 14, 2013 7:19 pm
Re: URGENT - Weird issues after error in config
Are the emails time stamped for right now? Are they new? Or are the emails queued up and keep coming out because there is a backlog.
On my server, I use postfix. Maybe check the queue? I used this resource to get syntax. http://www.cyberciti.biz/tips/howto-pos ... queue.html
Just a thought.
Steve B
On my server, I use postfix. Maybe check the queue? I used this resource to get syntax. http://www.cyberciti.biz/tips/howto-pos ... queue.html
Just a thought.
Steve B
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
Re: URGENT - Weird issues after error in config
They are not visible in the Core interface. The timestamps in the emails we receive are 'real-time' which makes it seem like the checks are still getting executed..
Current time here : 16:01
The Nagios host which should have this problematic service also is still sending emails while the error with the issue with the traceroute command has been solved in the meantime. (nagios user needed sudo for traceroute). So this service is showing healthy in XI. But we are still getting notifications for it....
We did a
on the Nagios server as well as on the mrtg worker node and nothing came up. So we suspect that no checks are activel running. The issue is in the recurring notificationj of all the host objects which temporarily had these two services 'Traceroute Active iFiber Router - Hop 4' and 'Traceroute Active Backbone Router - Hop 1'
Code: Select all
Service: Traceroute Active iFiber Router - Hop 4
Host: onderwijsvpn-31
Address: onderwijsvpn-31
State: UNKNOWN
Info:
ERROR in traceroute command.
Date/Time: 27/10/2015 16:00:37The Nagios host which should have this problematic service also is still sending emails while the error with the issue with the traceroute command has been solved in the meantime. (nagios user needed sudo for traceroute). So this service is showing healthy in XI. But we are still getting notifications for it....
We did a
Code: Select all
while true
while> do
while> pgrep check_traceroute
while> sleep 0.1
while> done
Last edited by WillemDH on Tue Oct 27, 2015 10:13 am, edited 1 time in total.
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
-
SteveBeauchemin
- Posts: 524
- Joined: Mon Oct 14, 2013 7:19 pm
Re: URGENT - Weird issues after error in config
If you rename the test binary or remove it from libexec does it stop?
If the system cannot see the executable, it should provide different data in the logs. Maybe help track it.
I have been flooded by tests before and ended up taking the code away so they could not run.
That will prove that it is still actively trying to run.
Steve B
If the system cannot see the executable, it should provide different data in the logs. Maybe help track it.
I have been flooded by tests before and ended up taking the code away so they could not run.
That will prove that it is still actively trying to run.
Steve B
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
Re: URGENT - Weird issues after error in config
Humor me here, but for about 5 minutes stop the following (make sure everyone is logged out first):
service postgresql stop
service crond stop
and see if the emails stop. Then after 5 minutes start them back up again:
service crond start
service postgresql start
service postgresql stop
service crond stop
and see if the emails stop. Then after 5 minutes start them back up again:
service crond start
service postgresql start
Former Nagios employee