Two simultaneous statuses for an active check

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
mon-team
Posts: 171
Joined: Thu Jun 28, 2012 9:22 am

Two simultaneous statuses for an active check

Post by mon-team »

Hello there,
we are experiencing an issue on some active checks which are reporting two service status at the same time.
These are the logs related to the last occurences of the issue:

[1437329784] SERVICE ALERT: ROUTER.lan;Traffic_ge-0/0/5;UNKNOWN;HARD;1;INTERFACE_TRAFFIC UNKNOWN - Error:Time duration between plugin calls is invalid
[1437329784] GLOBAL SERVICE EVENT HANDLER: ROUTER.lan;Traffic_ge-0/0/5;UNKNOWN;HARD;1;xi_service_event_handler
[1437329784] SERVICE ALERT: ROUTER.lan;Traffic_ge-0/0/5;OK;HARD;1;INTERFACE_TRAFFIC OK - (in=819.87Mbps/out=140.09Mbps)
[1437329784] GLOBAL SERVICE EVENT HANDLER: ROUTER.lan;Traffic_ge-0/0/5;OK;HARD;1;xi_service_event_handler


[1437158769] SERVICE ALERT: ROUTER.lan;Traffic_ge-0/0/5;CRITICAL;HARD;1;INTERFACE_TRAFFIC CRITICAL - (in=0.00Mbps/out=0.00Mbps)
[1437158769] GLOBAL SERVICE EVENT HANDLER: ROUTER.lan;Traffic_ge-0/0/5;CRITICAL;HARD;1;xi_service_event_handler
[1437158769] SERVICE ALERT: ROUTER.lan;Traffic_ge-0/0/5;OK;HARD;1;INTERFACE_TRAFFIC OK - (in=822.27Mbps/out=139.35Mbps)
[1437158769] GLOBAL SERVICE EVENT HANDLER: ROUTER.lan;Traffic_ge-0/0/5;OK;HARD;1;xi_service_event_handler


These are the service, serrvice template and the command configurations:


define service {
host_name ROUTER.lan
service_description Traffic_ge-0/0/5
use service-traffic
servicegroups +Services_for_SD,wrk1servicegroups_grp
check_command check_interface_traffic_rate! -C $USER16$ -n ge-0/0/5 -u Mbps -w 30:,0: -c 30:,0:!!!!!!!
max_check_attempts 1
check_interval 5
passive_checks_enabled 0
contacts nagios-tech,netsupport-mail,network-sms
register 1
}

define service {
name service-traffic
use generic-service
is_volatile 0
max_check_attempts 2
check_interval 5
retry_interval 1
check_period 24x7
notification_interval 1440
notification_period 24x7
notification_options w,c,u,r
register 0

}

define command {
command_name check_interface_traffic_rate
command_line $USER1$/check_interface_traffic.pl -H $HOSTADDRESS$ $ARG1$
}


The issue is occurring since we upgraded Nagios XI to 2014R2.7 and modgearman to version 1.5.
We have 3 mod_gearman workers running but the "Traffic_ge-0/0/5" service is configured to run only on a specific worker.

Attached you can find the following files:
-check_interface_traffic.pl perl script used by the command
-mod_geaman_neb.conf
-mod_gearman_worker.conf from worker where that service runs
-mod_gearman_worker2.conf from worker where other services run


Can anyone help us in investigating the problem?

Thanks.
You do not have the required permissions to view the files attached to this post.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Two simultaneous statuses for an active check

Post by jdalrymple »

This one is definitely sketchy and maybe even a bit scary, but I'll bite.

First - you have to dig deep to find the error, it's all the way back in check_snmp which is a bit weird, but not terribly surprising given that your plugin uses it. If going down this road we discover that plugin is the problem maybe we can adjust it or just drop in replace it with a perl SNMP query.

Second - I recommend firing up gearman worker logs and watching what happens there:

/etc/mod_gearman/mod_gearman_worker.conf:

Code: Select all

debug=2
Third - If 2nd is utterly useless we might want to pass your check through either a localhostgroups or localservicegroups option in the NEB config so that we can see if Nagios itself can run things better.

Curious, is it just this 1 interface on this 1 host?
mon-team
Posts: 171
Joined: Thu Jun 28, 2012 9:22 am

Re: Two simultaneous statuses for an active check

Post by mon-team »

Thanks jdalrymple for your reply,
The issue impacts more than one check on more hosts.
We moved one of the impacted checks to the localservicegroups and the issue did not occur again, so far.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Two simultaneous statuses for an active check

Post by jdalrymple »

Interesting - if that does turn out to be a working solution, is the subset of affected hosts/services small enough that they could be run in entirety by the local Nagios box?
mon-team
Posts: 171
Joined: Thu Jun 28, 2012 9:22 am

Re: Two simultaneous statuses for an active check

Post by mon-team »

As long as the services were on the server, the issue did not occur.
jdalrymple wrote:Interesting - if that does turn out to be a working solution, is the subset of affected hosts/services small enough that they could be run in entirety by the local Nagios box?
We don't want to move the affected services to the local Nagios box to balance the workload on our 3 workers.

After moving back those services on the worker queue the problem happened again. I attached both mod_gearman_neb.log and mod_gearman_worker.log with debug=2.
You do not have the required permissions to view the files attached to this post.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Two simultaneous statuses for an active check

Post by tgriep »

Could you edit your nod_gearman_neb.conf file in the server and change the following
from

Code: Select all

result_workers=2
to

Code: Select all

result_workers=1
Restart the gearman daemon by running the following and see if that resolves the issue?

Code: Select all

service gearmand restart
Be sure to check out our Knowledgebase for helpful articles and solutions!
mon-team
Posts: 171
Joined: Thu Jun 28, 2012 9:22 am

Re: Two simultaneous statuses for an active check

Post by mon-team »

Thank you tgriep for your reply.
We reduced the results_worker to 1 but this didn't resolve the issue.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Two simultaneous statuses for an active check

Post by tgriep »

In the mod_gearman_worker.conf file you have defined "servicegroups=wrk1servicegroups_grp" but the default hosts and services set to no.
In the mod_gearman_worker2.conf file you have not defined servicegroups= but the default hosts and services are set to yes.
I think that if the hosts and services are set to yes and no groups defined, it will run any service or host check.
This may be causing the problem.
To verify this, can you check the logs on your workers and see if that is happening for the duplicate service checks?
Be sure to check out our Knowledgebase for helpful articles and solutions!
mon-team
Posts: 171
Joined: Thu Jun 28, 2012 9:22 am

Re: Two simultaneous statuses for an active check

Post by mon-team »

I checked all the workers logs: the only worker which is running the duplicate service checks is the worker1. There is no trace of those service checks in the other workers.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Two simultaneous statuses for an active check

Post by tgriep »

Is the time on sync between the Nagios server and the worker?
You may want to change this setting on the worker to see if that fixes it.
eventhandler=yes

Can you run this command and post back the results?

Code: Select all

grep gear /usr/local/nagios/etc/nagios.cfg 
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked