Host or Service Check Interval (SOLVED)

cornelp · Post by **cornelp** » Tue Aug 16, 2016 10:43 am

I'm trying to figure out how can I check a service and/or a host every 20 secs, then re-try every 10 secs, only to send a notification after 3 retries.
This is what I got:
Template used:

define host{
name host-services ; The name of this host template
check_period extendhours ; By default, switches are monitored round the clock
check_interval 0.30 ; Switches are checked every 5 minutes
retry_interval 0.20 ; Schedule host check retries at 1 minute intervals
max_check_attempts 3 ; Check each switch 10 times (max)
check_command check-host-alive ; Default command to check if routers are "alive"
notification_interval 0 ; Resend notifications every 30 minutes
notification_options d,r,u ; Only send notifications for specific host states
contact_groups admins ; Notifications get sent to the admins by default
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
notification_period extendhours
register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE
}

Host config:
define host{
use host-services ; Inherit default values from a template
host_name laptop ; The name we're giving to this host
alias Laptop ; A longer name associated with the host
address 10.2.10.166 ; IP address of the host
active_checks_enabled 1
}

And this is what I get in the logs. First, the moment I unplug the laptop, it takes about 40-50 secs for the first SOFT 1 to show up. Then it looks like second re-try is 36 secs later, then 3rd re-try 36 secs again. Why does it take 40-50 secs to show up as SOFT down, and then 36 seconds for every re-try?

[08-16-2016 11:30:22] HOST ALERT: laptop;DOWN;HARD;3;PING CRITICAL - Packet loss = 100%
[08-16-2016 11:29:46] HOST ALERT: laptop;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
[08-16-2016 11:29:10] HOST ALERT: laptop;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%

Thank you very much for your support.

rkennedy · Post by **rkennedy** » Tue Aug 16, 2016 11:34 am

This isn't advised at all, as checking this frequently can lead to a multitude of different issues. It will have quite a lot of load on the system as well.

Instead of using 0.2 and 0.3, take a look at this page - https://assets.nagios.com/downloads/nag ... gmain.html

Specifically, the part about interval_length in your nagios configuration -

Code: Select all

Format: 	interval_length=<seconds>
Example: 	interval_length=60

This is the number of seconds per "unit interval" used for timing in the scheduling queue, re-notifications, etc. "Units intervals" are used in the object configuration file to determine how often to run a service check, how often to re-notify a contact, etc.

Important: The default value for this is set to 60, which means that a "unit value" of 1 in the object configuration file will mean 60 seconds (1 minute). I have not really tested other values for this variable, so proceed at your own risk if you decide to do so!

Adjust this to say, 10. Then, in your check_interval change it to 3, and retry interval of 2. By using decimals in your check_interval I have a feeling it's throwing the math off.

cornelp · Post by **cornelp** » Tue Aug 16, 2016 1:47 pm

I did as you stated. I changed the Nagios cfg file interval_length to 10 and the host file to 20 and 10.

[08-16-2016 14:42:00] HOST NOTIFICATION: nagiosadmin;laptop;DOWN;notify-host-by-email;(Host check timed out after 30.00 seconds)
[08-16-2016 14:42:00] HOST ALERT: laptop;DOWN;HARD;3;(Host check timed out after 30.00 seconds)
[08-16-2016 14:41:10] HOST ALERT: laptop;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
[08-16-2016 14:40:20] HOST ALERT: laptop;DOWN;SOFT;1;(Host check timed out after 30.00 seconds)

Looks like it went up to 50 secs now between checks.

rkennedy · Post by **rkennedy** » Tue Aug 16, 2016 4:51 pm

Adjust this to say, 10. Then, in your check_interval change it to 3, and retry interval of 2. By using decimals in your check_interval I have a feeling it's throwing the math off.

You'll want to change the check_interval to 2 or 3, not 10/20, which should put it on ~30 second timing.

cornelp · Post by **cornelp** » Wed Aug 17, 2016 8:32 am

Apologies, mistyped. I did to interval of 3 and retry of 2. I just mis-typed it here, sorry.
Anyway, I did a test, and this is what I got:

Code: Select all

Nagios Config File
interval_length=60

Nagios Host Template
check_interval          0.2
retry_interval          0.3
This setup checks the host every 31 secs
-------------------------------------------------

Nagios Config File
interval_length=60

Nagios Host Template
check_interval          0.1
retry_interval          0.2
This setup checks the host every 42 secs
-------------------------------------------------

Nagios Config File
interval_length=60

Nagios Host Template
check_interval          0.5
retry_interval          0.2
This setup checks the host every 42 secs
-------------------------------------------------

Nagios Config File
interval_length=60

Nagios Host Template
check_interval          0.1
retry_interval          0.5
This setup checks the host every 60 secs
-------------------------------------------------

Nagios Config File
interval_length=30

Nagios Host Template
check_interval          0.1
retry_interval          0.5
This setup checks the host every 45 secs
-------------------------------------------------

Nagios Config File
interval_length=10

Nagios Host Template
check_interval          3
retry_interval          2
This setup checks the host every 49 secs
-------------------------------------------------

Nagios Config File
interval_length=5

Nagios Host Template
check_interval          1
retry_interval          1
This setup checks the host every 32 secs
-------------------------------------------------

Nagios Config File
interval_length=1

Nagios Host Template
check_interval          1
retry_interval          1
This setup checks the host every 31 secs
-------------------------------------------------

So I'm a bit confused as to what to set it to get what I need. Anyone can help me with this please?
Thanks...

tmcdonald · Post by **tmcdonald** » Wed Aug 17, 2016 2:31 pm

Nagios checks are not run on an exact schedule. The scheduling engine employs some tricks to keep checks from bunching up and causing CPU spikes, otherwise if you had all of your checks set to run every minute, you would have 59 seconds of nothing and then everything run at once. Take a look at the main config documentation:

https://assets.nagios.com/downloads/nag ... gmain.html

Specifically:

Code: Select all

service_interleave_factor
service_inter_check_delay_method
max_service_check_spread
host_inter_check_delay_method
max_host_check_spread

Likely you will need to tweak those settings if you need exact, on-the-dot timing.

cornelp · Post by **cornelp** » Fri Aug 26, 2016 2:03 pm

So, I was able to make all the necessary changes. When the host goes down, no matter what numbers I use, It will not send an alert until 1.5 minutes later. (3-30 sec interval checks).

Host Notification[08-26-2016 14:11:24] HOST NOTIFICATION: nagiosadmin;P-DFB-FW02-TEST;DOWN;notify-host-by-email;PING CRITICAL - Packet loss = 100%
Host Down[08-26-2016 14:11:24] HOST ALERT: P-DFB-FW02-TEST;DOWN;HARD;3;PING CRITICAL - Packet loss = 100%
Host Down[08-26-2016 14:10:54] HOST ALERT: P-DFB-FW02-TEST;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Host Down[08-26-2016 14:10:24] HOST ALERT: P-DFB-FW02-TEST;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%

Now, I got the service (ping) to show up with an interval of 10 secs between each ping, with a max of 3 pings. After the 3 pings, I want a notification to be sent. Since I cannot do it for the host, might as well do it for the service.
So, the service works and checks every 10 secs. After 3 pings, its supposed to send me an alert (via email/txt). It does not. Where am I going wrong that the notification is not being sent?

Informational Message[08-26-2016 14:52:29] wproc: Core Worker 12047: job 24 (pid=12297) timed out. Killing it
Service Critical[08-26-2016 14:52:28] SERVICE ALERT: P-DFB-FW02-TEST;PING;CRITICAL;HARD;3;PING CRITICAL - Packet loss = 100%
Service Critical[08-26-2016 14:52:18] SERVICE ALERT: P-DFB-FW02-TEST;PING;CRITICAL;SOFT;2;PING CRITICAL - Packet loss = 100%
Service Critical[08-26-2016 14:52:08] SERVICE ALERT: P-DFB-FW02-TEST;PING;CRITICAL;SOFT;1;PING CRITICAL - Packet loss = 100%

So, as you can see above, every 10 secs, it checks for ping and times out. After 3 times, its supposed to send an alert, but it does not.

Host Setup and Template:
define host{
use host-services ; Inherit default values from a template
host_name P-DFB-FW01-TEST ; The name we're giving to this switch
alias HQ PfSense Firewall ; A longer name associated with the switch
address 10.2.21.120 ; IP address of the switch
active_checks_enabled 1
}

define host{
name host-services ; The name of this host template
check_period extendhours ; By default, switches are monitored round the clock
check_interval 0.20 ; Switches are checked every 5 minutes
retry_interval 0.10 ; Schedule host check retries at 1 minute intervals
max_check_attempts 3 ; Check each switch 10 times (max)
check_command check-host-alive ; Default command to check if routers are "alive"
notification_interval 0 ; Resend notifications every 30 minutes
notification_options d,r,u ; Only send notifications for specific host states
contact_groups admins ; Notifications get sent to the admins by default
notifications_enabled 0 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_period extendhours
icon_image /usr/local/nagios/share/images/RRT-Color-RGB.PNG
register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE
}

Service Setup and Template:
define service{
use web-services ; Inherit values from a template
host_name P-DFB-FW01-TEST ; The name of the host the service is associated with
service_description PING ; The service description
check_command check_ping!200.0,20%!600.0,60% ; The command used to monitor the service
notifications_enabled 1
}

define service{
name web-services ; The name of this service template
max_check_attempts 3 ; Re-check the service up to 4 times in order to determine its final (hard) state
normal_check_interval 0.20 ; Check the service every 5 minutes under normal conditions
retry_check_interval 0.10 ; Re-check the service every minute until a hard state can be determined
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 0 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
contact_groups admins ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 0 ; Re-notify about service problems every hour
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
icon_image /usr/local/nagios/share/images/RRT-Color-RGB.PNG

THANK YOU again for your support.

Post by **Box293** » Sun Aug 28, 2016 8:34 pm

At this point I think enabling debug mode and looking at the debug log will be required to understand what is going on.

Try setting the debug level on and then restart Nagios.

Code: Select all

sed -i 's/.*debug_level=.*/debug_level=-1/g' /usr/local/nagios/etc/nagios.cfg
service nagios restart

Make the problem occur.
Upload the file /usr/local/nagios/var/nagios.debug

When you are finished this turns debugging off:

Code: Select all

sed -i 's/.*debug_level=.*/debug_level=0/g' /usr/local/nagios/etc/nagios.cfg
service nagios restart

Also could you please upload the file /usr/local/nagios/var/objects.cache

cornelp · Post by **cornelp** » Mon Aug 29, 2016 8:08 am

Here you go. Uploaded the 2 files you requested. I had to rename them to txt as it would not allow the original ext.
Thank you VERY MUCH for your assistance.

cornelp · Post by **cornelp** » Mon Aug 29, 2016 8:42 am

So I see in the debug file:
No contacts were found for notification purposes. No notification was sent out.
I have the notification contacts setup properly, as I get the notification for the host down. I don't get it why it says that:
Here is my contacts:

define contact{
contact_name nagiosadmin ; Short name of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Nagios Admin ; Full name of user

email cpaunescu@peoplestrustinsurance.com ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
service_notification_period 24x7
host_notification_period 24x7
service_notifications_enabled 1
service_notification_options w,u,c,r,f,s,n
host_notification_options d,u,r,f,s
host_notification_commands notify-host-by-email
service_notification_commands notify-service-by-email
pager XXX
can_submit_commands 1
}

define contact{
contact_name CPaunescu ; Short name of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Chris Paunescu ; Full name of user

email XXX ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
service_notifications_enabled 1
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r,f,s,n
host_notification_options d,u,r,f,s
host_notification_commands notify-host-by-email
service_notification_commands notify-service-by-email
pager XXX
can_submit_commands 1
}

define contact{
contact_name CPaunescu-Cell ; Short name of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Chris Paunescu ; Full name of user

email XXX ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
service_notifications_enabled 1
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r,f,s,n
host_notification_options d,u,r,f,s
host_notification_commands notify-host-by-email
service_notification_commands notify-service-by-email
pager XXX
can_submit_commands 1
}

###############################################################################
###############################################################################
#
# CONTACT GROUPS
#
###############################################################################
###############################################################################

# We only have one contact in this simple configuration file, so there is
# no need to create more than one contact group.

define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members nagiosadmin, CPaunescu, CPaunescu-Cell
}

define contactgroup{
contactgroup_name admins-email
alias Nagios Administrators
members nagiosadmin
}

Nagios Support Forum

Host or Service Check Interval (SOLVED)

Host or Service Check Interval (SOLVED)

Re: Host or Service Check Interval

Re: Host or Service Check Interval

Re: Host or Service Check Interval

Re: Host or Service Check Interval

Re: Host or Service Check Interval

Re: Host or Service Check Interval

Re: Host or Service Check Interval

Re: Host or Service Check Interval

Re: Host or Service Check Interval