Service Check interval

n8860104460 · Post by **n8860104460** » Tue Jul 25, 2017 7:18 am

Hi Team,

We have to setup one DB tablespace monitoring where interval check should be 55 Mins for specific service (Nagios will send next check to server after 55 Mins) Purpose for this delay is when we trying to fetch the data from Nagios server ( Using Command ) it is taking more than 45 Mins to show DB tablespace data.

We have make changes on check interval 55 Mins but still in GUI it’s not giving an output and from command its working fine.

Command we use to fetch the data.

$USER1$/check_oracle_health -t 15 --connect $HOSTADDRESS$:$_HOSTDBPORT$/$_HOSTDBNAME$ --username $_HOSTDBUSER$ --password '$_HOSTDBPASS$' --warning $_HOSTTSWARN$ --critical $_HOSTTSCRIT$ --mode tablespace-usage

Error :

(Service Check Timed Out On Worker: usa*********)

Status Details
Service State: Critical
Duration: 21h 54m 4s
Service Stability: Unchanging (stable)
Last Check: 2017-07-25 08:05:16
Next Check: 2017-07-25 09:00:16

Config. File

###############################################################################
# Service configuration file
#
# Created by: Nagios Core Config Manager 2.3.3
# Date: 2017-07-25 07:50:17
# Version: Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND ---
# Nagios CCM will overwrite all manual settings during the next update if you
# would like to edit files manually, place them in the 'static' directory or
# import your configs into the CCM by placing them in the 'import' directory.
#
###############################################################################

define service {
host_name mc0300*****_testing
service_description Oracle DB tablespace usage
use xerox_service_prod
check_command xerox_common_db_oracle_tblspc_60_min!!!!!!!!
check_interval 55
register 1
}

###############################################################################
#
# Service configuration file
#
# END OF FILE
#
###############################################################################

Please advise

Post by **mcapra** » Tue Jul 25, 2017 8:27 am

For a check that is taking nearly an hour to properly return it's results, I would highly recommend you schedule it as a cron job and submit the results to Nagios XI as a passive check. More info on passive checks if you go that route:
https://assets.nagios.com/downloads/nag ... ios-XI.pdf

You might also make sure you are using the latest version of check_oracle_health as substantial performance improvements have been made in later versions:
https://labs.consol.de/nagios/check_ora ... index.html

dwhitfield · Post by **dwhitfield** » Tue Jul 25, 2017 2:01 pm

I agree with @mcapra, but if you want to do it the way you are doing it, it seems to me the interval is not as big of an issue as the timeout. Can you attach the plugin you are using? I found several different links to check_oracle_health so I want to be sure we are using the correct one.

It may also be useful to see a profile to help determine why things are taking so long. Can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.

After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.

n8860104460 · Post by **n8860104460** » Sat Jul 29, 2017 1:38 pm

profile.zip

Hello,

Please find attached plugin and system profile details, and see the more details below for DB instance hope this will give you an idea why Nagios server is taking that much time to fetch the details.

No. of Count tablespace on instance. = 430
load on DB.

load averages: 28.1, 29.1, 29.5; up 156+12:35:23 05:43:37
1343 processes: 1315 sleeping, 3 zombie, 2 stopped, 23 on cpu
CPU states: 62.8% idle, 24.6% user, 12.6% kernel, 0.0% iowait, 0.0% swap
Memory: 192G phys mem, 14G free mem, 38G total swap, 38G free swap

Post by **tacolover101** » Mon Jul 31, 2017 9:37 am

Please find attached plugin and system profile details, and see the more details below for DB instance hope this will give you an idea why Nagios server is taking that much time to fetch the details.

if its taking 55 minute sto return this sounds like an issue with your DB, not Nagios. if you're making a large query, remembe that SQL is single threaded so no matter how large your system is, throwing more resources at it will not solve it.

Post by **lmiltchev** » Mon Jul 31, 2017 12:47 pm

Correct me if I am wrong, but it seems like you are using modgearman.

Error :

(Service Check Timed Out On Worker: usa*********)

Is everything on the modgearman worker set up as on the server (plugin, command, service, environment, etc.)? Does the check take such a long time if you disable modgearman?

To help us troubleshoot the issue, you may want to temporarily increase the log verbousity on the worker, by setting:

Code: Select all

debug=3

in the "/etc/mod_gearman2/worker.conf" file, then restarting the worker process:

Code: Select all

service mod-gearman2-worker restart

Next, post the "/var/log/mod_gearman2/mod_gearman_worker.log" log after your check is run.

n8860104460 · Post by **n8860104460** » Wed Aug 02, 2017 11:39 am

upon checking gearmand logs found that gearmand is recheck the service in every MAX 300 sec. and due to this service check timeout error is coming.

and settings applied in service > check setting > check_interval = 60 Mins is not working due to gearmand re-check.

gearmand Log
[2017-08-02 12:08:39][109164][INFO ] timeout (300s) hit for servicecheck: mc03XXXX_ISRVE - mc0300XXXXX_ISRVE

can we increase gearmand check interval?

Post by **tgriep** » Thu Aug 03, 2017 2:06 pm

I think the timeout is hard coded in Mod Gearman and cannot be changed in the configuration files.
What you could do is to create a service group in Nagios, add that service to that group and then add the service group to the Gearman Server localservicegroups option so that check will not be run by Gearman buy by nagios itself.
Take a look at this link for more details.
https://labs.consol.de/nagios/mod-gearm ... er_options

localservicegroups

sets a list of servicegroups which will not be executed by gearman. They are just passed through.
localservicegroups=name1,name2,name3

n8860104460 · Post by **n8860104460** » Mon Aug 21, 2017 7:00 am

Hi Team,

Finally issue has been resolved after changing the TIME OUT Value (4000) in gearmand file and now we are able to see Tablespace details in NAGIOS XI.

Thank you for all your suggestion and support.

bolson · Post by **bolson** » Mon Aug 21, 2017 10:50 am

Closing topic as resolved.

Thank you for using the Nagios Support Forum.

Nagios Support Forum

Service Check interval

Service Check interval

Re: Service Check interval

Re: Service Check interval

Re: Service Check interval

Re: Service Check interval

Re: Service Check interval

Re: Service Check interval

Re: Service Check interval

Re: Service Check interval

Re: Service Check interval