Page 1 of 1

False notification

Posted: Sat Apr 13, 2013 12:49 am
by sbaviswa
Hi all

We have nearly 400 network devices being monitored via snmp v3. However the router host is being checked using the check_host_alive command. In a 3 month period after installation, we have got the below scenario around 2 times.

The point is sometimes, there are 4-5 or at times 80-90 hosts showing as down with 100% packet loss in host-check command. But their respective services are showing up. Also the pinging to the host gives very low rta much below the threshold. Also when we re-schedule immediate check for that host, it does not correct itself. It stays in this state even for hours together.

We have to restart the nagios & httpd service to bring them back to normal.

Any ideas welcome.

Info which might be helpful for your analysis: We have installed nagiosxi on a 4 CPU/ 16GB RAM server of which the CPU usage is only less than 10%, but RAM utilisation is around 15GB with 5GB cached. There is a 12GB swap which is not used at all.

Regards
SBA-Viswa

Re: False notification

Posted: Mon Apr 15, 2013 10:20 am
by scottwilkerson
If you run a standard ping to these hosts from the CLI do you have packet loss there?

Re: False notification

Posted: Wed Apr 17, 2013 5:20 am
by sbaviswa
During the said time when the hosts shows down & the services shows up, the host is actually up. Tested by pinging from the "ping this host" option in the monitoring GUI as well as directly from outside the nagios environment.

Also evaluated teh following document --> http://assets.nagios.com/downloads/nagi ... _In_XI.pdf

Here it was mentioned that there might be firewalls inbetween to block the icmp traffic, but still it happens to certain devices & also it get to normal if we restart the nagios & httpd service.

SBA-Viswa

Re: False notification

Posted: Wed Apr 17, 2013 4:32 pm
by abrist
Verify that there are not 2 separate nagios processes:

Code: Select all

service nagios stop
killall nagios
ps -aef | grep nagios
service nagios start

Re: False notification

Posted: Tue Apr 23, 2013 6:20 am
by sbaviswa
The said issue has cropped up again today only with around 51 host showing down when actually not.

Our client checked the processes during this issue period. There were no multiple nagios instances which could have triggered this scenario.

Please throw light on what next to be checked. Our client is going embarassed & actually doubting the credibility of the status of other hosts.

Regards
SBA-Viswa

Re: False notification

Posted: Tue Apr 23, 2013 2:16 pm
by abrist
If you are a paying customer, could you please send an email to [email protected] to open up a ticket. We may need to look at your configuration snapshot tarball and that is best done through more secure means.

Re: False notification

Posted: Wed Apr 24, 2013 12:08 am
by sbaviswa
Yes our customer has support previledges. Ultimately I need to ask the customer to mail nagios support team.
But can you guide me on how & when to take the configuration snapshot tarball.

Re: False notification

Posted: Wed Apr 24, 2013 11:38 am
by lmiltchev
From the Nagios XI web interface, click on the "Admin" menu, then click on "Config Snapshots" under the "Monitoring Config" menu on the left, click on both, the "Download' and "View Output" actions buttons, save both files (*.tar.gz and *.txt), and email them to [email protected].