We are experiencing this same type of issue. We have admin user and regular users, and we find setting users to admins unacceptable . Prior to the upgrade we had a nice load of 1-3. Our httpd and mysql service were stable, and now it seems mysql is restarting its self every 12 hours or so. (I and my colleague have also been oncalled now a couple times in the last 2 weeks for mysql crash and table corruption, the load shows one of these outage on the 26th)
I spent a good portion of time investigating into why and it seems that mysql queries are being heavily aggressive without an index. Last night I turn on log-queries-not-using-indexes and in a matter of a few mins, I have counted over 5600 queries from the following select statement.
Code: Select all
# cat /var/log/mysqld-slow-query.log | grep 'nagios_objects WHERE TRUE ORDER BY nagios_objects.objecttype_id DESC'| wc -l
5688
The upgrade happened on Aug 22 to 5.5.2 from 5.4.13. I believe the 1st real apply config came in on the 28, which is what sent the load way up (I'm in the process of testing this theory by getting the snapshot configs off previous system backups to see if this lowers the load back to after upgrade 22 but before the 28.)
I would also point out we have 569 hosts with 3381 service checks (with mod_gearman to lower load), we run this on a physical HP g8 server with 12 cores (E5-2620 v2 @ 2.10GHz) and 32G of ram on ssd drives in raid1 so this spike in resources has been puzzling to say the least.
The system itself is stable enough that it hasn't caused any major problem with upper management, but only seems like its a ticking timebomb before mysql crashes again, and management starts knocking. This system should be the most stable system in our infrastructure as its the eyes into our business systems. One other clue that is noticeable it when we apply config because its delays in loading dashboards that have service/host group queries (aka my default dashboard and our operations dashboards) up to 2-3 mins for them to be displayed, which in previous versions was 10-15 seconds at most.
Here is the top output
Code: Select all
top - 09:39:07 up 13:34, 2 users, load average: 6.59, 6.16, 6.02
Tasks: 392 total, 7 running, 385 sleeping, 0 stopped, 0 zombie
Cpu(s): 57.8%us, 8.8%sy, 0.0%ni, 33.3%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 32837684k total, 26061236k used, 6776448k free, 257344k buffers
Swap: 4976636k total, 0k used, 4976636k free, 4670624k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18206 mysql 20 0 6477m 190m 6960 S 246.8 0.6 1417:05 mysqld
14148 apache 20 0 575m 71m 21m R 55.1 0.2 126:22.93 httpd
13985 apache 20 0 577m 78m 28m S 54.1 0.2 117:53.61 httpd
13988 apache 20 0 575m 71m 21m R 52.1 0.2 122:24.28 httpd
57156 apache 20 0 577m 73m 21m R 50.1 0.2 119:04.73 httpd
14149 apache 20 0 578m 75m 23m R 48.8 0.2 122:09.89 httpd
35917 apache 20 0 543m 36m 17m S 28.4 0.1 0:24.55 httpd
20805 apache 20 0 555m 47m 19m S 25.1 0.1 10:17.66 httpd
42405 nagios 20 0 74036 13m 1040 S 15.5 0.0 36:15.24 ndo2db
47383 gearmand 20 0 465m 5576 984 S 2.6 0.0 15:08.67 gearmand
55119 nagios 20 0 131m 8292 2064 R 2.3 0.0 0:00.07 check_ifopersta
42368 nagios 20 0 905m 55m 2544 S 1.0 0.2 11:29.81 nagios
1516 root 20 0 0 0 0 S 0.7 0.0 1:09.08 flush-253:0
50672 root 20 0 15288 1568 988 R 0.7 0.0 0:00.38 top
51996 root 20 0 15268 1616 1008 S 0.7 0.0 0:42.05 top
3 root RT 0 0 0 0 S 0.3 0.0 0:07.42 migration/0
17 root 20 0 0 0 0 S 0.3 0.0 0:59.11 ksoftirqd/3
25301 nagios 20 0 133m 3684 2228 S 0.3 0.0 0:00.77 mod_gearman2_wo
26016 nagios 20 0 133m 3688 2228 S 0.3 0.0 0:00.85 mod_gearman2_wo
42969 nagios 20 0 133m 3612 2228 S 0.3 0.0 0:00.25 mod_gearman2_wo
54974 nagios 20 0 41440 2944 2252 S 0.3 0.0 0:00.01 check_nrpe
63481 nagios 20 0 133m 3680 2228 S 0.3 0.0 0:00.86 mod_gearman2_wo
63677 nagios 20 0 133m 3688 2228 S 0.3 0.0 0:01.27 mod_gearman2_wo
Here is the system profile with hostname and IP, and license info removed.
Code: Select all
Nagios XI - System Info
System
Nagios XI version: 5.5.3
XI installed from: manual
XI UUID: ******************************
Release info: *************** 2.6.32-754.3.5.el6.x86_64 x86_64
Red Hat Enterprise Linux Server release 6.10 (Santiago)
Gnome is not installed
Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36
Server Name: ****************
Server Address: ************************
Server Port: 443
Date/Time
PHP Timezone: America/Toronto
PHP Time: Thu, 06 Sep 2018 09:26:45 -0400
System Time: Thu, 06 Sep 2018 09:26:45 -0400
Nagios XI Data
License ends in: ************
UUID: ******************
Install Type: manual/unknown
nagios (pid 42368) is running...
NPCD running (pid 46951).
ndo2db (pid 46680) is running...
CPU Load 15: 5.92
Total Hosts: 570
Total Services: 3404
Function get_base_uri() returns: https://***************/nagiosxi/
Function get_base_url() returns: https://***************/nagiosxi/
Function get_backend_url(internal_call=false) returns: https://***************/nagiosxi/includes/components/profile/profile.php
Function get_backend_url(internal_call=true) returns: http://localhost/nagiosxi/backend/
Ping Test localhost
Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.031 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.036 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.026 ms
--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.026/0.031/0.036/0.004 ms
Test wget To localhost
WGET From URL: http://localhost/nagiosxi/includes/components/ccm/
Running:
/usr/bin/wget http://localhost/nagiosxi/includes/components/ccm/
--2018-09-06 09:26:47-- http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://localhost/nagiosxi/login.php?redirect=/nagiosxi/includes/components/ccm/index.php%3f&noauth=1 [following]
--2018-09-06 09:26:47-- http://localhost/nagiosxi/login.php?redirect=/nagiosxi/includes/components/ccm/index.php%3f&noauth=1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "/usr/local/nagiosxi/tmp/ccm_index.tmp"
0K .......... ......... 541K=0.04s
2018-09-06 09:26:47 (541 KB/s) - "/usr/local/nagiosxi/tmp/ccm_index.tmp" saved [20452]
Network Settings
1: lo: mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: mtu 1500 qdisc mq master bond0 state UP qlen 1000
link/ether d8:9d:67:17:e7:34 brd ff:ff:ff:ff:ff:ff
3: eth1: mtu 1500 qdisc mq master bond0 state UP qlen 1000
link/ether d8:9d:67:17:e7:34 brd ff:ff:ff:ff:ff:ff
4: eth2: mtu 1500 qdisc noop state DOWN qlen 1000
link/ether d8:9d:67:17:e7:36 brd ff:ff:ff:ff:ff:ff
5: eth3: mtu 1500 qdisc noop state DOWN qlen 1000
link/ether d8:9d:67:17:e7:37 brd ff:ff:ff:ff:ff:ff
6: bond0: mtu 1500 qdisc noqueue state UP
link/ether d8:9d:67:17:e7:34 brd ff:ff:ff:ff:ff:ff
inet *************** brd *************** scope global bond0
inet6 fe80::da9d:67ff:fe17:e734/64 scope link
valid_lft forever preferred_lft forever
*************** dev bond0 proto kernel scope link src ***************
169.254.0.0/16 dev bond0 scope link metric 1006
default via *************** dev bond0
Nagios XI Components
actions 2.0.1
actionurl
alertcloud 1.2.1
alertstream 2.1.0
autodiscovery 2.2.5
backendapiurl 1.0.3
bandwidthreport 1.8.0
bbmap 1.2.0
birdseye 3.2.2
bulkmodifications 2.2.0
capacityplanning 2.3.0
ccm 2.7.0
custom-includes 1.0.4
customlogin 1.0.0
customlogo 1.2.0
deploydashboard 1.3.0
deploynotification 1.3.3
duo 1.0.0
escalationwizard 1.5.0
freevariabletab 1.0.1
globaleventhandler 1.2.2
graphexplorer 2.2.0
helpsystem 2.0.0
highcharts 4.0.1
homepagemod 1.1.7
hypermap 1.1.6
hypermap_replay 1.2.0
isms 1.2.3
latestalerts 1.2.6
ldap_ad_integration 1.1.0
massacknowledge 2.1.14
metrics 1.2.10
minemap 1.2.4
modgearman 1
nagiosbpi 2.7.1
nagioscore
nagioscorecfg
nagiosim 2.2.6
nagiosna 1.4.0
nagiosql
nagvis 2.0.0
nocscreen 1.1.2
nrdsconfigmanager 1.6.4
nxti 1.0.1
opscreen 1.8.0
perfdata
pingaction 1.1.1
pnp
profile 1.4.0
proxy 1.1.4
rdp 1.0.3
rename 1.6.0
scheduledbackups 1.2.0
scheduledreporting
similetimeline 1.5.0
snmptrapsender 1.5.5
statusmap 1.0.2
tracerouteaction 1.1.1
usermacros 1.1.0
xicore
Nagios XI Config Wizards
ec2 1.0.0
s3 1.0.0
autodiscovery 1.4.1
bpiwizard 1.1.4
bulkhostimport 2.0.4
digitalocean 1.0.0
google-cloud 1.0.0
linode 1.0.0
microsoft-azure 1.0.0
rackspace 1.0.0
dhcp 1.1.4
dnsquery 1.1.3
docker 1.0.0
domain_expiration 1.1.4
email-delivery 2.0.4
esensors_websensor 1.1.4
exchange 1.3.2
folder_watch 1.0.5
ftpserver 1.5.5
genericnetdevice 1.0.3
ldapserver 1.3.3
linux-server 1.5.5
linux_snmp 1.5.4
macosx 1.3.0
mailserver 1.2.4
mongodb_database 1.1.2
mongodbserver 1.1.2
mountpoint 1.0.2
mssql_database 1.6.2
mssql_query 1.6.4
mssql_server 1.9.1
mysqlquery 1.2.3
mysqlserver 1.3.3
nagioslogserver 1.0.5
nagiostats 1.2.3
nagiosxiserver 1.3.0
ncpa 2.0.0
nna 1.0.4
nrpe 1.5.2
oraclequery 1.3.3
oracleserverspace 1.5.3
oracletablespace 1.5.4
passivecheck 1.2.4
passiveobject 1.1.3
postgresdb 1.5.3
postgresquery 1.2.3
postgresserver 1.3.4
printer 1.1.3
radiusserver 2.0.1
sla 1.3.2
snmp 1.5.8
snmp_trap 1.5.3
snmpwalk 1.3.6
solaris 1.2.5
sshproxy 1.5.7
switch 2.4.0
tcpudpport 1.3.3
tftp 1.0.2
vmware 1.7.1
watchguard 1.4.5
website 1.3.0
website_defacement 1.1.5
websiteurl 1.3.7
webtransaction 1.2.5
windowseventlog 1.3.3
windowsserver 1.6.1
windowsdesktop 1.6.1
windowssnmp 1.5.1
windowswmi 2.1.0
Nagios XI Dashlets
alertcloud
bbmap
capacityplanning
graphexplorer
hypermap
latestalerts
metrics
metricsguage
minemap
xicore_xi_news_feed
xicore_getting_started
xicore_admin_tasks
xicore_eventqueue_chart
xicore_component_status
xicore_server_stats
xicore_monitoring_stats
xicore_monitoring_perf
xicore_monitoring_process
xicore_perfdata_chart
xicore_host_status_summary
xicore_service_status_summary
xicore_comments
xicore_hostgroup_status_overview
xicore_hostgroup_status_grid
xicore_servicegroup_status_overview
xicore_servicegroup_status_grid
xicore_hostgroup_status_summary
xicore_servicegroup_status_summary
xicore_available_updates
xicore_network_outages
xicore_network_outages_summary
xicore_network_health
xicore_host_status_tac_summary
xicore_service_status_tac_summary
xicore_feature_status_tac_summary
availability
custom_dashlet 1.0.5
gauges 1.2.2
googlemapdashlet 1.1.0
internettrafficreport
rss_dashlet 1.1.0
sansrisingports 2.0
sla
statusinfo 2016-08-22
text 2011-11-30
worldtimeserver 2.0.0
What else should I try? what is your recommendations to rectifying this problem? Should I just downgrade until this can be fixed?
(I'll entertain this idea of applying a config from before this date but it looks like I will have to downgrade)
Last night I also noticed the 5.5.3 upgrade, in which I applied hoping it would lower the load. Didn't help!