Page 1 of 1
Service state not changing
Posted: Wed May 29, 2019 12:50 pm
by jvaira
Hello,
I am running into an issue where a service check is exceeding its warning / critical threshold but the state does not seem to be changing. In the attached graph you can see that this service check has exceeded its critical threshold multiple times in the last 24 hours, however when I run the report for state history ( second attached screenshot ) it is saying that the service only changed from warning to critical one time and that was about an hour after it actually crossed the critical threshold this morning.
Re: Service state not changing
Posted: Wed May 29, 2019 3:58 pm
by cdienger
Please PM me a profile(Admin > System Config > System Profile > Download Profile) so that we may be able to review the configuration for this check.
Re: Service state not changing
Posted: Thu May 30, 2019 10:26 am
by jvaira
Hello
@cdienger I sent you a PM with the requested profile. Please let me know if there is anything else that I need to send over.
Re: Service state not changing
Posted: Thu May 30, 2019 1:43 pm
by cdienger
Thanks for that. There doesn't appear to be anything obviously wrong so far so I'd like to get a copy of /usr/local/nagios/var/archives/nagios-05-30-2019-00.log and try to line up the events there with the data we have so far.
Re: Service state not changing
Posted: Tue Jun 04, 2019 10:37 am
by jvaira
Hello,
@cdienger I just wanted to make sure you got my last PM that included the requested logs. If so is there any update on this issue? We are concerned that this may still be happening and that there are checks over their threshold that are not changing state.
Re: Service state not changing
Posted: Tue Jun 04, 2019 4:44 pm
by cdienger
Thanks for your patience. This is odd behavior and I don't have reason for it yet. I'll take another look at it first thing tomorrow.
Re: Service state not changing
Posted: Wed Jun 05, 2019 11:43 am
by cdienger
There are no obvious problems in the data that was collected. One possibility that could have caused this would be if there were more than one instance of nagios running. You can monitor this using the check_procs plugin so if it occurs again we'll have some more data to look at. See the attached screenshot. Normally when the nagios service is running, there will be two processes that look like:
/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
and the plugin will return:
PROCS OK: 2 processes with command name 'nagios', args 'nagios.cfg' | procs=2;2:2;;0;
If there is more than one nagios instance running, the the number of nagios processes will be more than 2.
If it does occur again, we'll want to gather fresh nagios logs, a profile, and the /usr/local/nagios/var/status.dat.