XI: users reporting NRDP is failing to respond with OK
Posted: Fri Mar 20, 2020 7:02 pm
Ahoy SUPP folks,
FYI, my users report an issue where a number of their monitored nodes are failing to get return messages from our XI NRDP server.
Further this is leading to the NCPA on their managed servers to "hang / die".
Restarting the NCPA gets it back into action for a time, but apparently it will die once more after a given period of time.
So far we have had reports of this both from our Solaris & Linux customers.
This leads also to waves of notifications going out to our users, with reports like this:
"CRITICAL: Freshness threshold reached!!!" (which is an attribute of their service definition (2nd argument for the `check_dummy` command)
I would be happy to collect and pass along information from the monitored environments for review (EG: profile dump, etc.)
Thank you for your time!
Here is some supporting information on our environment (using my SOL customers as example, but this impacts other OS teams also):
XI: 5.6.9 (on a RHEL 7.x VM)
CPU: 6 vCPUs / MEM: 32 GB / DSK: 500GB
dB: MariaDB 5.x (off-box)
monitor types:
active: 4,292, UP/DOWN (ping) checks
passive: 29,610, whole range (CPU / DISK / OS SERVICE / OS LOGS), via NCPA / NRDP rules
notes:
no delegated XIs joined (planning some to offload ACTIVE checks)
no RAM disk
no mod_gearman instances (possible option, only just "went live" with XI)
# SERVICE TEMPLATE
```
```
# SERVICE DEFINITION
```
```
# NCPA: NRDP service rule
```
```
# EVENT LOG (`/usr/local/nagiosxi/var/eventman.log`)
```
```
FYI, my users report an issue where a number of their monitored nodes are failing to get return messages from our XI NRDP server.
Further this is leading to the NCPA on their managed servers to "hang / die".
Restarting the NCPA gets it back into action for a time, but apparently it will die once more after a given period of time.
So far we have had reports of this both from our Solaris & Linux customers.
This leads also to waves of notifications going out to our users, with reports like this:
"CRITICAL: Freshness threshold reached!!!" (which is an attribute of their service definition (2nd argument for the `check_dummy` command)
I would be happy to collect and pass along information from the monitored environments for review (EG: profile dump, etc.)
Thank you for your time!
Here is some supporting information on our environment (using my SOL customers as example, but this impacts other OS teams also):
XI: 5.6.9 (on a RHEL 7.x VM)
CPU: 6 vCPUs / MEM: 32 GB / DSK: 500GB
dB: MariaDB 5.x (off-box)
monitor types:
active: 4,292, UP/DOWN (ping) checks
passive: 29,610, whole range (CPU / DISK / OS SERVICE / OS LOGS), via NCPA / NRDP rules
notes:
no delegated XIs joined (planning some to offload ACTIVE checks)
no RAM disk
no mod_gearman instances (possible option, only just "went live" with XI)
# SERVICE TEMPLATE
```
Code: Select all
define service {
name Sol_NCPA_Listener
service_description Solaris NCPA Listener
display_name Solaris NCPA Listener
active_checks_enabled 0
passive_checks_enabled 1
check_freshness 1
freshness_threshold 1200
notification_period 24x7
notification_options w,c,r,
notifications_enabled 1
contacts certain_user,solaris_info
register 0
}# SERVICE DEFINITION
```
Code: Select all
define service {
host_name anchornode.fqdn
service_description Sol_NCPA_Listener
use Sol_NCPA_Listener
hostgroup_name solaris-10-servers,solaris-11-servers,solaris-servers
check_command check_dummy!2!"Freshness threshold reached\!\!\!"!!!!!!
initial_state o
max_check_attempts 1
active_checks_enabled 0
passive_checks_enabled 1
register 1
}# NCPA: NRDP service rule
```
Code: Select all
[passive checks]
%HOSTNAME%|Sol_NCPA_Listener|120 = processes?name=ncpa_listener&match=search&critical=1:15# EVENT LOG (`/usr/local/nagiosxi/var/eventman.log`)
```
Code: Select all
[meta] => Array
(
[notification-type] => service
[contact] => user
[contactemail] => [email protected]
[type] => PROBLEM
[escalated] => 0
[author] =>
[comments] =>
[host] => server.fqdn
[hostaddress] => server.fqdn
[hostalias] => server.fqdn
[hostdisplayname] => server.fqdn
[service] => Sol_NCPA_Listener
[hoststate] => UP
[hoststateid] => 0
[servicestate] => CRITICAL
[servicestateid] => 2
[lastservicestate] => CRITICAL
[lastservicestateid] => 2
[servicestatetype] => HARD
[currentattempt] => 1
[maxattempts] => 1
[serviceeventid] => 344094
[serviceproblemid] => 170121
[serviceoutput] => CRITICAL: Freshness threshold reached!!!
[longserviceoutput] =>
[datetime] => Fri Mar 20 15:57:25 PDT 2020
)