XI alerts sending problem

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
skunk64
Posts: 9
Joined: Mon Oct 26, 2015 8:43 am

XI alerts sending problem

Post by skunk64 »

Hi,

we are experiencing some strange problems with e-mail alerts. For some services, CRITICAL alert is sent after 1 of 5 checks, not 5 of 5 as it should be. Service and service template are configured properly.

For some of those services, OK alert isn't sent at all, but it is enabled in notification options.

This is not happening on all services, only a few of 5000 of them.

Does anyone know why is this happening?

Nagios XI 5.5.3
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: XI alerts sending problem

Post by lmiltchev »

Can you show us the actual config of a 'problem' service, along with configs or all relevant templates that this service is using?

Also, show a screenshot of a State History and Notifications reports for this service the same time period.

Note: In the State History report, select State Type = Both.
Be sure to check out our Knowledgebase for helpful articles and solutions!
skunk64
Posts: 9
Joined: Mon Oct 26, 2015 8:43 am

Re: XI alerts sending problem

Post by skunk64 »

So, this is the problem:

Service state history:
statehistory.jpg
Notification history:
notificationshistory.jpg
Sometimes Nagios assumes HARD state even if check is at 1/5. In this example is WARNING state. Plus, some alerts aren't even sent. We noticed this started to happen after Nagios upgrade (5.5.2 --> 5.5.3).

Ping service uses xiwizard_xxxx_emerson_ping_service service template which uses xiwizard_generic_service service template.

Ping service config:

Code: Select all

define service {
    host_name                napajanje-st-pujanke
    service_description      Ping
    use                      xiwizard_xxxx_emerson_ping_service
    check_command            check_icmp!200,20%!500,60%
    max_check_attempts       5
    check_interval           5
    retry_interval           1
    check_period             xi_timeperiod_24x7
    notification_interval    720
    notification_period      xi_timeperiod_24x7
    contacts                 xxxx
    contact_groups           xxxx,xxxx,xxxx,xxxx
    _contacts                xxxx
    _contact_groups          xxxx,xxxx,xxxx,xxxx
    _ping_critical           500ms
    _ping_critical_perct     60%
    _ping_warning            200ms
    _ping_warning_perct      20%
    _xiwizard                xxxx
    register                 1
}
xiwizard_xxxx_emerson_ping_service service template:

Code: Select all

define service {
    name                            xiwizard_xxxx_emerson_ping_service
    service_description             Checkping
    servicegroups                   check_ping
    use                             xiwizard_generic_service
    check_command                   check_icmp!200.0,20%!500.0,60%
    max_check_attempts              5
    check_interval                  3
    retry_interval                  1
    active_checks_enabled           1
    passive_checks_enabled          0
    check_period                    xi_timeperiod_24x7
    flap_detection_enabled          0
    notification_interval           720
    notification_period             xi_timeperiod_24x7
    notification_options            w,c,u,r,f,
    notifications_enabled           1
    register                        0
}
xiwizard_generic_service service template:

Code: Select all

define service {
    name                            xiwizard_generic_service
    check_command                   check_xi_service_none
    is_volatile                     0
    max_check_attempts              5
    check_interval                  5
    retry_interval                  1
    active_checks_enabled           1
    passive_checks_enabled          1
    check_period                    xi_timeperiod_24x7
    parallelize_check               1
    obsess_over_service             1
    check_freshness                 0
    event_handler_enabled           1
    flap_detection_enabled          1
    process_perf_data               1
    retain_status_information       1
    retain_nonstatus_information    1
    notification_interval           60
    notification_period             xi_timeperiod_24x7
    notifications_enabled           1
    register                        0
}
check_xi_service_none command:

Code: Select all

$USER1$/check_dummy 0 "Nothing to monitor"
You do not have the required permissions to view the files attached to this post.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: XI alerts sending problem

Post by ssax »

I believe that it is working as expected and that you are seeing proper functionality, we would need to see what state the host was in during the time that the services went hard 1 of 5 though to be sure.
When a service check results in a non-OK state, Nagios will check the host that the service is associated with to determine whether or not is UP. If the host is not UP (i.e. it is either down or unreachable), Nagios will immediately put the service into a hard non-OK state and it will reset the current attempt number to 1. Since the service is in a hard non-OK state, the service check will be rescheduled at the normal frequency specified by the check_interval option instead of the retry_interval option.
Taken from here:

https://assets.nagios.com/downloads/nag ... uling.html

After an extensive discussion with the developers and the other techs here it seems to be working as intended. (was broken in the past, and it currently works as it should)

If the host is in a down state (hard or soft) when the service checks it will check the host state and because the host is down (whether hard or soft) the services go into a hard problem state and it resets the current attempt to 1.

One way that you can get around it would be to set host_down_disable_service_checks=1 in your /usr/local/nagios/etc/nagios.cfg and restart the nagios service:

Code: Select all

service nagios restart
Setting that will stop the service checks from even running if the host is in a problem state (hard or soft) to prevent alerts/notifications.

Please include the host in the state history output if you'd like us to validate if that is what is indeed occurring.


Thank you
Locked