Page 1 of 3

nagios false alarams

Posted: Sun Feb 16, 2020 11:38 pm
hi Team,

i am facing issue in nagios monitoring,
i am getting false alarms in nagios xi and error is time out after 60.1 sec

alert wont stay for more than 1 min.

-
Harsha S

Re: nagios false alarams

Posted: Mon Feb 17, 2020 10:30 am
by jdunitz
Well, you could increase the global timeout to be longer than 60 seconds, but if you're seeing timeouts more than occasionally, you probably have another problem that should be looked into. Perhaps your system is overloaded, or you had multiple services fail at the same time and flood the system with alerts, some of which timed out?

It would be really helpful to get a system profile from you.
To do so, go to Admin at the top, System Config on the left, System Profile inside that, and "Download Profile" in the main pane.

Once we have a look at that, we might be able to tell you more.

Thanks!

Re: nagios false alarams

Posted: Mon Feb 17, 2020 1:01 pm
by jdunitz
Also, can you tell us which checks are timing out?
And how many retries are your checks configured to go through before alerting? We usually recommend a few (like 5) retries, to cut down on the number of false positives.

Re: nagios false alarams

Posted: Tue Feb 18, 2020 12:27 am
the time out alerts are coming randomly, its not for single server or services.
most of the time service checks (nrpe client checks) are getting time out.
5 retries is configured, however we receive alert on dashboard and we do not receive mail, those alerts will clear with in 1 min.

Re: nagios false alarams

Posted: Tue Feb 18, 2020 12:29 am
please find the attachment for details


Moderator's Note: The profile has been shared with the support team but has been removed from the public forum.

Re: nagios false alarams

Posted: Tue Feb 18, 2020 11:34 am
by jdunitz
I don't see anything obviously wrong with your XI server. It's certainly not overloaded, at least not when the snapshot was taken.

We need to rule out the possibility of network congestion or other problems on the service end.

Can you do a state history report for the last week on one of your host/service combinations that have had this problem with recently?

If you've never done one of those before, you just go to "Reports" at the top, "State History" on the left, and set up the parameters of the report to include a period of "last week" (or a custom date range, whatever makes sense). Then set the "Type" and "State Type" to "both" and hit the blue "Run" button. Once it runs, you can click the "Download" dropdown at the top right and choose "CSV". If you could let us have a look at that CSV file, that might help us determine what the problem could be.
I've attached an annotated screenshot that might help generate the state history report.


Thanks!
state_history-annotated.png

Re: nagios false alarams

Posted: Tue Feb 18, 2020 10:48 pm
please find the alert,

timeout alert which you can find is false alarm

Re: nagios false alarams

Posted: Wed Feb 19, 2020 10:54 am
by jdunitz
Thanks for the screenshot, but it'd be really good to get the full report as described before.

Another couple things to look at:
1) Is your network configuration on your XI server all correct? DNS, netmask, ethernet duplex, etc?
2) Is this a virtual or physical server?

Also, can you choose a few hosts you've had issues with and run some long ping tests and log them?
You can do something like this:

Code: Select all

# ping -c 500 192.168.1.222 > /tmp/ping_log_1.222 &
# ping -c 500 192.168.4.101 > /tmp/ping_log_4.101 &
That'll take a few minutes to run, of course. If you want, you could even do it multiple times, either to separate logs, or use ">>" instead of ">" to append to the previous logs.

You don't have to send us your ping logs, but at least look through them to see if there are more than a couple missed pings, and certainly you'll want to be on the lookout for several consecutive missed pings.

Hope that helps!

Re: nagios false alarams

Posted: Thu Feb 20, 2020 10:13 pm
all servers are in VM and there is no packet drops between servers.

still i am getting below error.

(Service check timed out after 60.01 seconds)

Re: nagios false alarams

Posted: Fri Feb 21, 2020 11:34 am
by jdunitz
Would you be able to send us the last several days of archive logs?
Here's a little script that will gather the last 3-4 days of logs. Depending on the size of the logs, if we could get 3-5 days of logs, that'd be splendid.

Code: Select all

[root@localhost archives]# cd /usr/local/nagios/var/archives
[root@localhost archives]# find . -mtime -3 -name "*.log" | while read a
> do
> tar -rvf /tmp/arch.tar $a
> done
./nagios-02-20-2020-00.log
./nagios-02-19-2020-00.log
./nagios-02-21-2020-00.log
./nagios-02-18-2020-00.log
[root@localhost archives]#
[root@localhost archives]# gzip -9 /tmp/arch.tar
[root@localhost archives]# ls -l /tmp/arch.tar.gz
-rw-r--r-- 1 root root 272936 Feb 21 10:23 /tmp/arch.tar.gz
[root@localhost archives]#

Also, the state history report as mentioned previously would be quite helpful for getting this sorted.

Thanks!
--Jeffrey