Page 1 of 2

Some checks were not working

Posted: Mon Mar 02, 2020 8:22 am
by goldmund84
Hi,
Found strange behaviour of Nagios checks.
For one host several checks were not working, as if disabled for several hours. The graph is in the attachment.
Tried to find something in the Event Log, but nothing suspicious. The host has about 50 checks, only few of them not worked. All other hosts were pooled successfully. This check is executed via SNMP.

Please, advice where I could look to find the reason for this behaviour.

We have about 2500 checks and a lot of hosts, the issue was only with one host.
Nagios XI version is 5.5.2. (no support).

Re: Some checks were not working

Posted: Mon Mar 02, 2020 3:00 pm
by tgriep
Try this, run a State History report for the Host and all of it services to see what sort of errors the plugins were generating at that time to see if there is any correlation on why some of the checks were not working.

Re: Some checks were not working

Posted: Tue Mar 03, 2020 8:46 am
by goldmund84
Nothing suspicious in State History, just normal flow of events. And like on the graph, for the certain service there is a state change at 22:16 and the next one only next day at 15:20.
Between 22:16 and 15:20 - there is a gap with no events. But in reality, there were events and we missed the accident.

Any more thoughts?

Re: Some checks were not working

Posted: Tue Mar 03, 2020 10:32 am
by tgriep
Can you clarify what you mean be this?
"there were events and we missed the accident."
Are you saying that the host in question had issues during that time the graph stopped?
If the device had issues and stopped responding to SNMP polling, that would show the issue you are seeing.
What was the last state change at 22:16?

Re: Some checks were not working

Posted: Wed Mar 04, 2020 4:16 am
by goldmund84
The host didn't have issues. There was issues in services related to the host. But not on the host. The host received less traffic than usual - and that was the issue that we needed to be alerted. But everything was working on the host itself. Other SNMP checks worked correctly on the same host with no gap.
At 22:16 the response reported OK State.

Re: Some checks were not working

Posted: Wed Mar 04, 2020 5:50 pm
by tgriep
If the services related to the host, caused the check to fail and not return performance data, that would show in the graph just like you are seeing.
No performance data returned for a check, means no data for the graph and a gap will be displayed.

Look in the archived log files. Do you see the check running for that service during the time the issue happened?

Code: Select all

/usr/local/nagios/var/archives/nagios-02-26-2020-00.log
/usr/local/nagios/var/archives/nagios-02-27-2020-00.log
If so, post the entries so we can view them.

Re: Some checks were not working

Posted: Fri Mar 06, 2020 8:48 am
by goldmund84
Hi,

Checked those logs and there are no reported states on the affected service during the "gap period". As if somebody disabled it for the period. Is there a way to check whether it was put in the Scheduled Downtime or Acknowledged/Disabled or something else?

Re: Some checks were not working

Posted: Fri Mar 06, 2020 10:08 am
by tgriep
You might be able to search the Audit Log in the Admin > Audit Log menu in the XI GUI.
It is an Enterprise feature so you need that license enabled.

Re: Some checks were not working

Posted: Tue Mar 10, 2020 6:13 am
by goldmund84
Is there any method to look into audit log from console, not from GUI? Because we don't have Enterprise support.

Re: Some checks were not working

Posted: Tue Mar 10, 2020 2:04 pm
by tgriep
Go to the Admin > System Settings menu and see if your version has the Audit Log enabled and is so, it will show you the path to the file.