One of our Nagios XI instances has started acting up. I'm not sure what has happened, but there are several symptoms. There is nothing showing in the Monitoring Engine Event Queue dashlet, and all zeros on the Monitoring Engine Check Statistics dashlet. The Monitoring Engine Performace dashlet does show a few values, but the max service check latency is almost an hour (3,445.18 seconds), with an average of 561.66 seconds.
If I restart Nagios (usually with 'systemctl stop nagios && systemctl stop ndo2db && systemctl restart mariadb && systemctl start ndo2db && systemctl start nagios' just to be thorough), the load average can go into the 800s before it calms down. We normally see a flurry of activity on restarts (load average maybe in the 50s or 60s), but they settle down in a few minutes. Our normal load average is 3 to 5 on these 12 vCPU VMs running Nagios XI, and less than 2 on our smaller ones.
This XI instance normally does about 2,500 hosts and 77,000 services, so I've enabled large installation tweaks and use a RAM disk. I tuned the message parameters, too.
Code: Select all
sysctl -p
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 512000
Code: Select all
ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0xbd000040 196609 nagios 600 262144000 256000
Looking at the database log, I've tried running the database repair script multiple times. It's not uncommon for the database to need repair in our environment, but not like this. The repair appears to succeed, at least temporarily, but the nagios_statehistory table is showing up over and over.
Code: Select all
210428 13:06:03 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.64-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server
210428 16:20:01 [ERROR] mysqld: Table './nagios/nagios_statehistory' is marked as crashed and last (automatic?) repair failed
210428 16:25:01 [ERROR] mysqld: Table './nagios/nagios_statehistory' is marked as crashed and last (automatic?) repair failed
This is CentOS 7, so I updated the GRUB options to force an fsck on reboot. Reboots seem to go just fine, so I'm guessing there's no filesystem corruption.
Editing to add: this VM has 10GB of memory assigned to it, and the OOM killer has started taking out mysqld processes. I don't know a lot about databases, but that sounds suboptimal.
I would have attached a profile to this, but this is all this system can produce:
Code: Select all
PROFILE BUILD FAILED
Array
(
)
CODE: 1
Code: Select all
System Profile
A system profile makes it easier for our support techs to understand the system that you are running on. Including a downloaded system profile with your support ticket is always recommended.
Nagios XI - System Info
System
Nagios XI version: 5.6.14
Release info: den-nagios.unitrendscloud.com 3.10.0-1062.18.1.el7.x86_64 x86_64
CentOS Linux release 7.7.1908 (Core)
Gnome is not installed
Apache Information
PHP Version: 5.4.16
Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36
Server Name: den-nagios.unitrendscloud.com
Server Address: 10.201.255.14
Server Port: 443
Date/Time
PHP Timezone: UTC
PHP Time: Wed, 28 Apr 2021 17:33:52 +0000
System Time: Wed, 28 Apr 2021 17:33:52 +0000
Nagios XI Data
License ends in: VUVOSM
UUID: f6dc7d1c-b9b0-44f5-892c-d7fd0a105bd1
Install Type: manual/unknown
└─48011 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
└─1599 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
└─47262 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg -f
CPU Load 15: 2.34
Total Hosts: 2496
Total Services: 76811
Function get_base_uri() returns: https://den-nagios.unitrendscloud.com/nagiosxi/
Function get_base_url() returns: https://den-nagios.unitrendscloud.com/nagiosxi/
Function get_backend_url(internal_call=false) returns: https://den-nagios.unitrendscloud.com/nagiosxi/includes/components/profile/profile.php
Function get_backend_url(internal_call=true) returns: https://localhost/nagiosxi/backend/
Ping Test localhost
Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.082 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.047 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.053 ms
--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.047/0.060/0.082/0.017 ms
Test wget To localhost
WGET From URL: https://localhost/nagiosxi/includes/components/ccm/
Running:
/usr/bin/wget https://localhost/nagiosxi/includes/components/ccm/
--2021-04-28 17:33:54-- https://localhost/nagiosxi/includes/components/ccm/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:443... connected.
ERROR: cannot verify localhost's certificate, issued by '/C=US/O=DigiCert Inc/CN=DigiCert SHA2 Secure Server CA':
Unable to locally verify the issuer's authority.
ERROR: no certificate subject alternative name matches
requested host name 'localhost'.
To connect to localhost insecurely, use `--no-check-certificate'.
Network Settings
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens192: mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:ba:3a:4a brd ff:ff:ff:ff:ff:ff
inet 10.201.255.14/16 brd 10.201.255.255 scope global noprefixroute ens192
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:feba:3a4a/64 scope link
valid_lft forever preferred_lft forever
3: ens224: mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:ba:83:03 brd ff:ff:ff:ff:ff:ff
4: ens256: mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:ba:1c:43 brd ff:ff:ff:ff:ff:ff
inet 10.163.31.207/23 brd 10.163.31.255 scope global noprefixroute ens256
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:feba:1c43/64 scope link
valid_lft forever preferred_lft forever
default via 10.201.0.1 dev ens192 proto static metric 100
10.7.3.148 via 10.163.31.1 dev ens256 proto static
10.163.30.0/23 dev ens256 proto kernel scope link src 10.163.31.207 metric 102
10.201.0.0/16 dev ens192 proto kernel scope link src 10.201.255.14 metric 100
10.206.0.0/16 via 10.201.255.230 dev ens192 proto static metric 100
Nagios XI Components
actions 2.2.2
alertcloud 1.2.1
alertstream 2.1.1
autodiscovery 2.2.6
backendapiurl 1.0.5
bandwidthreport 1.8.1
bbmap 1.2.1
birdseye 3.2.4
bulkmodifications 2.2.0
capacityplanning 2.3.0
ccm 3.0.5
custom-includes 1.0.5
customlogin 1.0.0
customlogo 1.2.0
deploydashboard 1.3.0
deploynotification 1.3.3
duo 1.0.2
escalationwizard 1.5.1
freevariabletab 1.1.0
globaleventhandler 1.3.0
googlemap 1.6.2
graphexplorer 2.3.0
helpsystem 2.0.1
highcharts
homepagemod 1.1.11
hypermap 1.2.1
hypermap_replay 1.2.0
isms 1.2.3
latestalerts 1.2.7
ldap_ad_integration 1.1.2
map 1.0.0
massacknowledge 2.2.2
massimmediatecheck 1.0.2
metrics 1.3.4
minemap 1.2.5
msp 1.2.0
mtr 1.0.2
nagiosbpi 2.8.3
nagioscore
nagioscorecfg
nagiosim 2.2.7
nagiosna 1.4.1
nagiosql
nagvis 2.0.4
nocscreen 1.3.3
nrdsconfigmanager 1.6.8
nxti 1.0.3
opscreen 1.8.0
perfdata
pingaction 1.1.2
pnp
profile 1.4.1
proxy 1.1.5
rdp 1.0.5
rename 1.7.0
scheduledbackups 1.2.0
scheduledreporting
similetimeline 1.5.1
snmptrapsender 1.6.2
statusmap 1.0.3
tracerouteaction 1.1.2
twilio 1.0.0
usermacros 1.1.0
xicore
Nagios XI Config Wizards
activedirectory 1.3.4
ec2 1.1.3
s3 1.1.2
java_tomcat 1.1.0
autodiscovery 1.4.2
bpiwizard 1.1.5
bulkhostimport 2.1.3
capacity-planning 1.0.1
dhcp 1.1.6
dnsquery 1.1.5
digitalocean 1.0.2
docker 1.1.2
domain_expiration 1.1.6
email-delivery 2.0.5
esensors_websensor 1.1.6
exchange 1.3.3
ftpserver 1.5.7
folder_watch 1.0.6
genericnetdevice 1.0.4
java_glassfish 1.1.0
google-cloud 1.0.2
hyperv 1.0.2
java_jboss 1.1.0
java_jetty 1.1.0
ldapserver 1.3.4
linode 1.0.2
linux_snmp 1.5.8
linux-server 1.5.8
mssql_database 1.6.4
mssql_query 1.6.7
mssql_server 1.9.2
macosx 1.3.3
mailserver 1.2.6
microsoft-azure 1.0.2
mongodb_database 1.1.4
mongodbserver 1.1.4
mountpoint 1.0.3
mysqlquery 1.2.4
mysqlserver 1.3.4
ncpa 2.2.4
nrpe 1.5.3
nagioslogserver 1.0.7
nna 1.0.7
nagiosxiserver 1.3.2
nagiostats 1.2.3
switch 2.5.2
oraclequery 1.3.8
oracleserverspace 1.5.8
oracletablespace 1.5.9
passivecheck 1.2.5
postgresdb 1.5.4
postgresquery 1.2.4
postgresserver 1.3.5
printer 1.1.4
radiusserver 2.0.3
rackspace 1.0.2
sla 1.3.4
snmp 1.6.5
snmp_trap 1.5.4
snmpwalk 2.0.0
sshproxy 1.5.8
solaris 1.3.2
tcpudpport 1.3.4
tftp 1.0.3
passiveobject 1.1.3
vmware 1.7.3
watchguard 1.4.6
webtransaction 1.2.6
java_weblogic 1.1.0
website 1.4.1
website_defacement 1.2.2
websiteurl 1.4.0
windowsdesktop 1.6.4
windowseventlog 2.0.1
windowssnmp 1.5.6
windowsserver 1.6.4
windowswmi 2.2.0
Nagios XI Dashlets
alertcloud
bbmap
capacityplanning
graphexplorer
hypermap
latestalerts
metrics
metricsguage
minemap
xicore_xi_news_feed
xicore_getting_started
xicore_admin_tasks
xicore_eventqueue_chart
xicore_component_status
xicore_server_stats
xicore_monitoring_stats
xicore_monitoring_perf
xicore_monitoring_process
xicore_perfdata_chart
xicore_host_status_summary
xicore_service_status_summary
xicore_comments
xicore_hostgroup_status_overview
xicore_hostgroup_status_grid
xicore_servicegroup_status_overview
xicore_servicegroup_status_grid
xicore_hostgroup_status_summary
xicore_servicegroup_status_summary
xicore_available_updates
xicore_network_outages
xicore_network_outages_summary
xicore_network_health
xicore_host_status_tac_summary
xicore_service_status_tac_summary
xicore_feature_status_tac_summary
availability
custom_dashlet 1.0.6
gauges 1.2.2
googlemapdashlet 1.1.0
internettrafficreport
rss_dashlet 1.1.3
sansrisingports 2.0
sla
worldtimeserver 2.0.0