Page 1 of 1

XI: users reporting NRDP is failing to respond with OK

Posted: Fri Mar 20, 2020 7:02 pm
by inversecow
Ahoy SUPP folks,

FYI, my users report an issue where a number of their monitored nodes are failing to get return messages from our XI NRDP server.
Further this is leading to the NCPA on their managed servers to "hang / die".

Restarting the NCPA gets it back into action for a time, but apparently it will die once more after a given period of time.
So far we have had reports of this both from our Solaris & Linux customers.

This leads also to waves of notifications going out to our users, with reports like this:

"CRITICAL: Freshness threshold reached!!!" (which is an attribute of their service definition (2nd argument for the `check_dummy` command)

I would be happy to collect and pass along information from the monitored environments for review (EG: profile dump, etc.)

Thank you for your time!

Here is some supporting information on our environment (using my SOL customers as example, but this impacts other OS teams also):

XI: 5.6.9 (on a RHEL 7.x VM)
CPU: 6 vCPUs / MEM: 32 GB / DSK: 500GB

dB: MariaDB 5.x (off-box)

monitor types:
active: 4,292, UP/DOWN (ping) checks
passive: 29,610, whole range (CPU / DISK / OS SERVICE / OS LOGS), via NCPA / NRDP rules

notes:
no delegated XIs joined (planning some to offload ACTIVE checks)
no RAM disk
no mod_gearman instances (possible option, only just "went live" with XI)

# SERVICE TEMPLATE

```

Code: Select all

define service {
    name                            Sol_NCPA_Listener
    service_description             Solaris NCPA Listener
    display_name                    Solaris NCPA Listener
    active_checks_enabled           0
    passive_checks_enabled          1
    check_freshness                 1
    freshness_threshold             1200
    notification_period             24x7
    notification_options            w,c,r,
    notifications_enabled           1
    contacts                        certain_user,solaris_info
    register                        0
}
```

# SERVICE DEFINITION

```

Code: Select all

define service {
    host_name                 anchornode.fqdn
    service_description       Sol_NCPA_Listener
    use                       Sol_NCPA_Listener
    hostgroup_name            solaris-10-servers,solaris-11-servers,solaris-servers
    check_command             check_dummy!2!"Freshness threshold reached\!\!\!"!!!!!!
    initial_state             o
    max_check_attempts        1
    active_checks_enabled     0
    passive_checks_enabled    1
    register                  1
}
```

# NCPA: NRDP service rule

```

Code: Select all

[passive checks]
%HOSTNAME%|Sol_NCPA_Listener|120 = processes?name=ncpa_listener&match=search&critical=1:15
```

# EVENT LOG (`/usr/local/nagiosxi/var/eventman.log`)

```

Code: Select all

    [meta] => Array
        (
            [notification-type] => service
            [contact] => user
            [contactemail] => [email protected]
            [type] => PROBLEM
            [escalated] => 0
            [author] =>
            [comments] =>
            [host] => server.fqdn
            [hostaddress] => server.fqdn
            [hostalias] => server.fqdn
            [hostdisplayname] => server.fqdn
            [service] => Sol_NCPA_Listener
            [hoststate] => UP
            [hoststateid] => 0
            [servicestate] => CRITICAL
            [servicestateid] => 2
            [lastservicestate] => CRITICAL
            [lastservicestateid] => 2
            [servicestatetype] => HARD
            [currentattempt] => 1
            [maxattempts] => 1
            [serviceeventid] => 344094
            [serviceproblemid] => 170121
            [serviceoutput] => CRITICAL: Freshness threshold reached!!!
            [longserviceoutput] =>
            [datetime] => Fri Mar 20 15:57:25 PDT 2020
        )
```

Re: XI: users reporting NRDP is failing to respond with OK

Posted: Mon Mar 23, 2020 11:36 am
by jdunitz
Just to clarify some things:

1) this is some, but not all of your monitored machines that are having this problem?
2) ...and this is new?
3) ...and you're not aware that anyhing in particular has changed recently, apart from perhaps adding more monitored machines?

How's the load looking on your Nagios server?

Can you post the output of this? (which should just show the summary info at the top of top)

Code: Select all

 # TERM=dumb top -n4 -u nobody
Thanks!

--Jeffrey