False Positives

rj-admin2 · Post by **rj-admin2** » Wed May 22, 2024 10:47 pm

Hi,

I've recently setup email and SMS alerting on Nagios XI for our firewalls and switches. Almost immediately I've been getting alerts, at different times (but never during the working day) to say the devices are DOWN. However on closer inspection I find this is not the case as I also monitor System Uptime and all devices have not been DOWN (in the truest sense) at any time. So my question is how can I investigate and determine what the cause of this issue is. We're talking about 3 firewalls and 9 switches across 3 sites all registering as DOWN at the same time and all registering as UP at around the same time too.

TIA.

jsimon · Post by **jsimon** » Thu May 23, 2024 9:36 am

Hi @rj-admin2,

This could be a result of services going down for any length of time, depending on your monitoring settings for those services. Personally I've noticed switches can be a particular source of these, as individual ports can report outages for momentary disruptions. You can account for this on the 3rd page of a wizard, where you are asked to choose your values for rechecking a problem-reporting host/service. If your time range is too narrow, try increasing the number of times and/or the check interval at which the service will be rechecked before you are notified.

To change this setting for an existing host or service, go to Configure->Core Config Manager->Monitoring(Left Navbar)->Hosts (or Services) -> Edit on the item you'd like to modify. The setting you're looking for here is on the Check Settings tab (Retry Interval, Max Check attempts).

Let us know if this doesn't fully address your issue or if you have any other questions about this!

sgardil · Post by **sgardil** » Thu May 23, 2024 10:03 am

To add to what jsimon said are the emails you are getting about specific services on the devices? If it is just certain services going down at the same time it sounds like it could be possible that there is a configuration in your network to disable certain ports during non work hours.

rj-admin2 · Post by **rj-admin2** » Tue May 28, 2024 5:34 am

Many thanks for he replies.

I have reconfigured Nagios to use IP addresses instead of DNS names for the hosts as it is these that are reported as being DOWN. I'm also now in the process of configuring Parent/Child relationships between the devices. This I hope will reduce the number of alerts we are receiving. In fact since I've done this we have not had an alert generated in the last 2 nights.

TIA.

rj-admin2 · Post by **rj-admin2** » Thu Jun 20, 2024 12:11 am

Hi guys,

After the last change made (increasing the max check attempts of the hosts from 5 to 10) things have been stable now for about 2 weeks or so. So just to clarify - each host defined has the following entries configured:

Check interval - 5 mins
Retry interval - 1 min
Max Check attempts 10 attempts

However last night, I received notifications telling me hosts were down when in fact they were not. There was however a very high RTA from the PING checks across all hosts at the same time. Between reporting going down/unreachable to being notified the hosts were back online took about 3 mins. I would have thought based on the above settings, I wouldn't have received any notifications as the 3 mins it took would have been within the scope of the alert notifications. Am I reading this wrongly? What happens if I were to increase the Retry interval from 1 to 3 mins?

These issues only seem to occur outside our normal office hours. I've never had this during the normal working day. There are backups happening in the evening, however they also happen throughout the working day.

TIA.

kg2857 · Post by **kg2857** » Thu Jun 20, 2024 1:57 am

The checks run on a schedule and I'm not sure why they are thought as incorrect w/o proof.
Maybe change the notification delay.

jsimon · Post by **jsimon** » Thu Jun 20, 2024 9:58 am

Hi @rj-admin2,

It might be worth looking at either the Notification Report or the State History report. Use the Advanced filters to select the host you're looking for more info on. If you use the State History Report, you'll want to choose "Soft State". You'll get to see when the checks returned what values, and when notifications were sent. Hopefully this will give you more information about the exact intervals that caused this incongruity.

sgardil · Post by **sgardil** » Thu Jun 20, 2024 10:31 am

kg2857 wrote: ↑Thu Jun 20, 2024 1:57 am The checks run on a schedule and I'm not sure why they are thought as incorrect w/o proof.
Maybe change the notification delay.

Have you tried what is mentioned here by kg2857? Changing the first notification delay could do the job, but that is if you are fine with having that first notification delay during the normal time as well.

Nagios Support Forum

False Positives

False Positives

Re: False Positives

Re: False Positives

Re: False Positives

Re: False Positives

Re: False Positives

Re: False Positives

Re: False Positives