Yea so doubling the resources has made it run like it did prior to the update. I'm not sure this is unexpected. However at this point the VM is over provisioned and the vmware guys over here are asking me for justifications as it was performing fine prior to the update.
We need to understand what about the update caused it to triple in utilization. Doubling the resources due to an update unfortunately won't be an acceptable answer for the vmware guys. Do I need to open a proper ticket on this? What are the next steps here?
5.5.1 Httpd high load
Re: 5.5.1 Httpd high load
I like graphs...
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: 5.5.1 Httpd high load
@Ehamby, I still believe this is most likely a Core 4 issue. I suggest opening a new issue on the GitHub:
https://github.com/NagiosEnterprises/na ... issues/new
That way you will be able to communicate directly with the developers.
https://github.com/NagiosEnterprises/na ... issues/new
That way you will be able to communicate directly with the developers.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
- tylerhoadley
- Posts: 43
- Joined: Tue Jul 02, 2013 1:41 pm
Re: 5.5.1 Httpd high load
We are experiencing this same type of issue. We have admin user and regular users, and we find setting users to admins unacceptable . Prior to the upgrade we had a nice load of 1-3. Our httpd and mysql service were stable, and now it seems mysql is restarting its self every 12 hours or so. (I and my colleague have also been oncalled now a couple times in the last 2 weeks for mysql crash and table corruption, the load shows one of these outage on the 26th)


I spent a good portion of time investigating into why and it seems that mysql queries are being heavily aggressive without an index. Last night I turn on log-queries-not-using-indexes and in a matter of a few mins, I have counted over 5600 queries from the following select statement.
The upgrade happened on Aug 22 to 5.5.2 from 5.4.13. I believe the 1st real apply config came in on the 28, which is what sent the load way up (I'm in the process of testing this theory by getting the snapshot configs off previous system backups to see if this lowers the load back to after upgrade 22 but before the 28.)
I would also point out we have 569 hosts with 3381 service checks (with mod_gearman to lower load), we run this on a physical HP g8 server with 12 cores (E5-2620 v2 @ 2.10GHz) and 32G of ram on ssd drives in raid1 so this spike in resources has been puzzling to say the least.
The system itself is stable enough that it hasn't caused any major problem with upper management, but only seems like its a ticking timebomb before mysql crashes again, and management starts knocking. This system should be the most stable system in our infrastructure as its the eyes into our business systems. One other clue that is noticeable it when we apply config because its delays in loading dashboards that have service/host group queries (aka my default dashboard and our operations dashboards) up to 2-3 mins for them to be displayed, which in previous versions was 10-15 seconds at most.
Here is the top output
Here is the system profile with hostname and IP, and license info removed.
What else should I try? what is your recommendations to rectifying this problem? Should I just downgrade until this can be fixed?
(I'll entertain this idea of applying a config from before this date but it looks like I will have to downgrade)
Last night I also noticed the 5.5.3 upgrade, in which I applied hoping it would lower the load. Didn't help!


I spent a good portion of time investigating into why and it seems that mysql queries are being heavily aggressive without an index. Last night I turn on log-queries-not-using-indexes and in a matter of a few mins, I have counted over 5600 queries from the following select statement.
Code: Select all
# cat /var/log/mysqld-slow-query.log | grep 'nagios_objects WHERE TRUE ORDER BY nagios_objects.objecttype_id DESC'| wc -l
5688I would also point out we have 569 hosts with 3381 service checks (with mod_gearman to lower load), we run this on a physical HP g8 server with 12 cores (E5-2620 v2 @ 2.10GHz) and 32G of ram on ssd drives in raid1 so this spike in resources has been puzzling to say the least.
The system itself is stable enough that it hasn't caused any major problem with upper management, but only seems like its a ticking timebomb before mysql crashes again, and management starts knocking. This system should be the most stable system in our infrastructure as its the eyes into our business systems. One other clue that is noticeable it when we apply config because its delays in loading dashboards that have service/host group queries (aka my default dashboard and our operations dashboards) up to 2-3 mins for them to be displayed, which in previous versions was 10-15 seconds at most.
Here is the top output
Code: Select all
top - 09:39:07 up 13:34, 2 users, load average: 6.59, 6.16, 6.02
Tasks: 392 total, 7 running, 385 sleeping, 0 stopped, 0 zombie
Cpu(s): 57.8%us, 8.8%sy, 0.0%ni, 33.3%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 32837684k total, 26061236k used, 6776448k free, 257344k buffers
Swap: 4976636k total, 0k used, 4976636k free, 4670624k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18206 mysql 20 0 6477m 190m 6960 S 246.8 0.6 1417:05 mysqld
14148 apache 20 0 575m 71m 21m R 55.1 0.2 126:22.93 httpd
13985 apache 20 0 577m 78m 28m S 54.1 0.2 117:53.61 httpd
13988 apache 20 0 575m 71m 21m R 52.1 0.2 122:24.28 httpd
57156 apache 20 0 577m 73m 21m R 50.1 0.2 119:04.73 httpd
14149 apache 20 0 578m 75m 23m R 48.8 0.2 122:09.89 httpd
35917 apache 20 0 543m 36m 17m S 28.4 0.1 0:24.55 httpd
20805 apache 20 0 555m 47m 19m S 25.1 0.1 10:17.66 httpd
42405 nagios 20 0 74036 13m 1040 S 15.5 0.0 36:15.24 ndo2db
47383 gearmand 20 0 465m 5576 984 S 2.6 0.0 15:08.67 gearmand
55119 nagios 20 0 131m 8292 2064 R 2.3 0.0 0:00.07 check_ifopersta
42368 nagios 20 0 905m 55m 2544 S 1.0 0.2 11:29.81 nagios
1516 root 20 0 0 0 0 S 0.7 0.0 1:09.08 flush-253:0
50672 root 20 0 15288 1568 988 R 0.7 0.0 0:00.38 top
51996 root 20 0 15268 1616 1008 S 0.7 0.0 0:42.05 top
3 root RT 0 0 0 0 S 0.3 0.0 0:07.42 migration/0
17 root 20 0 0 0 0 S 0.3 0.0 0:59.11 ksoftirqd/3
25301 nagios 20 0 133m 3684 2228 S 0.3 0.0 0:00.77 mod_gearman2_wo
26016 nagios 20 0 133m 3688 2228 S 0.3 0.0 0:00.85 mod_gearman2_wo
42969 nagios 20 0 133m 3612 2228 S 0.3 0.0 0:00.25 mod_gearman2_wo
54974 nagios 20 0 41440 2944 2252 S 0.3 0.0 0:00.01 check_nrpe
63481 nagios 20 0 133m 3680 2228 S 0.3 0.0 0:00.86 mod_gearman2_wo
63677 nagios 20 0 133m 3688 2228 S 0.3 0.0 0:01.27 mod_gearman2_wo Code: Select all
Nagios XI - System Info
System
Nagios XI version: 5.5.3
XI installed from: manual
XI UUID: ******************************
Release info: *************** 2.6.32-754.3.5.el6.x86_64 x86_64
Red Hat Enterprise Linux Server release 6.10 (Santiago)
Gnome is not installed
Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36
Server Name: ****************
Server Address: ************************
Server Port: 443
Date/Time
PHP Timezone: America/Toronto
PHP Time: Thu, 06 Sep 2018 09:26:45 -0400
System Time: Thu, 06 Sep 2018 09:26:45 -0400
Nagios XI Data
License ends in: ************
UUID: ******************
Install Type: manual/unknown
nagios (pid 42368) is running...
NPCD running (pid 46951).
ndo2db (pid 46680) is running...
CPU Load 15: 5.92
Total Hosts: 570
Total Services: 3404
Function get_base_uri() returns: https://***************/nagiosxi/
Function get_base_url() returns: https://***************/nagiosxi/
Function get_backend_url(internal_call=false) returns: https://***************/nagiosxi/includes/components/profile/profile.php
Function get_backend_url(internal_call=true) returns: http://localhost/nagiosxi/backend/
Ping Test localhost
Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.031 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.036 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.026 ms
--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.026/0.031/0.036/0.004 ms
Test wget To localhost
WGET From URL: http://localhost/nagiosxi/includes/components/ccm/
Running:
/usr/bin/wget http://localhost/nagiosxi/includes/components/ccm/
--2018-09-06 09:26:47-- http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://localhost/nagiosxi/login.php?redirect=/nagiosxi/includes/components/ccm/index.php%3f&noauth=1 [following]
--2018-09-06 09:26:47-- http://localhost/nagiosxi/login.php?redirect=/nagiosxi/includes/components/ccm/index.php%3f&noauth=1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "/usr/local/nagiosxi/tmp/ccm_index.tmp"
0K .......... ......... 541K=0.04s
2018-09-06 09:26:47 (541 KB/s) - "/usr/local/nagiosxi/tmp/ccm_index.tmp" saved [20452]
Network Settings
1: lo: mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: mtu 1500 qdisc mq master bond0 state UP qlen 1000
link/ether d8:9d:67:17:e7:34 brd ff:ff:ff:ff:ff:ff
3: eth1: mtu 1500 qdisc mq master bond0 state UP qlen 1000
link/ether d8:9d:67:17:e7:34 brd ff:ff:ff:ff:ff:ff
4: eth2: mtu 1500 qdisc noop state DOWN qlen 1000
link/ether d8:9d:67:17:e7:36 brd ff:ff:ff:ff:ff:ff
5: eth3: mtu 1500 qdisc noop state DOWN qlen 1000
link/ether d8:9d:67:17:e7:37 brd ff:ff:ff:ff:ff:ff
6: bond0: mtu 1500 qdisc noqueue state UP
link/ether d8:9d:67:17:e7:34 brd ff:ff:ff:ff:ff:ff
inet *************** brd *************** scope global bond0
inet6 fe80::da9d:67ff:fe17:e734/64 scope link
valid_lft forever preferred_lft forever
*************** dev bond0 proto kernel scope link src ***************
169.254.0.0/16 dev bond0 scope link metric 1006
default via *************** dev bond0
Nagios XI Components
actions 2.0.1
actionurl
alertcloud 1.2.1
alertstream 2.1.0
autodiscovery 2.2.5
backendapiurl 1.0.3
bandwidthreport 1.8.0
bbmap 1.2.0
birdseye 3.2.2
bulkmodifications 2.2.0
capacityplanning 2.3.0
ccm 2.7.0
custom-includes 1.0.4
customlogin 1.0.0
customlogo 1.2.0
deploydashboard 1.3.0
deploynotification 1.3.3
duo 1.0.0
escalationwizard 1.5.0
freevariabletab 1.0.1
globaleventhandler 1.2.2
graphexplorer 2.2.0
helpsystem 2.0.0
highcharts 4.0.1
homepagemod 1.1.7
hypermap 1.1.6
hypermap_replay 1.2.0
isms 1.2.3
latestalerts 1.2.6
ldap_ad_integration 1.1.0
massacknowledge 2.1.14
metrics 1.2.10
minemap 1.2.4
modgearman 1
nagiosbpi 2.7.1
nagioscore
nagioscorecfg
nagiosim 2.2.6
nagiosna 1.4.0
nagiosql
nagvis 2.0.0
nocscreen 1.1.2
nrdsconfigmanager 1.6.4
nxti 1.0.1
opscreen 1.8.0
perfdata
pingaction 1.1.1
pnp
profile 1.4.0
proxy 1.1.4
rdp 1.0.3
rename 1.6.0
scheduledbackups 1.2.0
scheduledreporting
similetimeline 1.5.0
snmptrapsender 1.5.5
statusmap 1.0.2
tracerouteaction 1.1.1
usermacros 1.1.0
xicore
Nagios XI Config Wizards
ec2 1.0.0
s3 1.0.0
autodiscovery 1.4.1
bpiwizard 1.1.4
bulkhostimport 2.0.4
digitalocean 1.0.0
google-cloud 1.0.0
linode 1.0.0
microsoft-azure 1.0.0
rackspace 1.0.0
dhcp 1.1.4
dnsquery 1.1.3
docker 1.0.0
domain_expiration 1.1.4
email-delivery 2.0.4
esensors_websensor 1.1.4
exchange 1.3.2
folder_watch 1.0.5
ftpserver 1.5.5
genericnetdevice 1.0.3
ldapserver 1.3.3
linux-server 1.5.5
linux_snmp 1.5.4
macosx 1.3.0
mailserver 1.2.4
mongodb_database 1.1.2
mongodbserver 1.1.2
mountpoint 1.0.2
mssql_database 1.6.2
mssql_query 1.6.4
mssql_server 1.9.1
mysqlquery 1.2.3
mysqlserver 1.3.3
nagioslogserver 1.0.5
nagiostats 1.2.3
nagiosxiserver 1.3.0
ncpa 2.0.0
nna 1.0.4
nrpe 1.5.2
oraclequery 1.3.3
oracleserverspace 1.5.3
oracletablespace 1.5.4
passivecheck 1.2.4
passiveobject 1.1.3
postgresdb 1.5.3
postgresquery 1.2.3
postgresserver 1.3.4
printer 1.1.3
radiusserver 2.0.1
sla 1.3.2
snmp 1.5.8
snmp_trap 1.5.3
snmpwalk 1.3.6
solaris 1.2.5
sshproxy 1.5.7
switch 2.4.0
tcpudpport 1.3.3
tftp 1.0.2
vmware 1.7.1
watchguard 1.4.5
website 1.3.0
website_defacement 1.1.5
websiteurl 1.3.7
webtransaction 1.2.5
windowseventlog 1.3.3
windowsserver 1.6.1
windowsdesktop 1.6.1
windowssnmp 1.5.1
windowswmi 2.1.0
Nagios XI Dashlets
alertcloud
bbmap
capacityplanning
graphexplorer
hypermap
latestalerts
metrics
metricsguage
minemap
xicore_xi_news_feed
xicore_getting_started
xicore_admin_tasks
xicore_eventqueue_chart
xicore_component_status
xicore_server_stats
xicore_monitoring_stats
xicore_monitoring_perf
xicore_monitoring_process
xicore_perfdata_chart
xicore_host_status_summary
xicore_service_status_summary
xicore_comments
xicore_hostgroup_status_overview
xicore_hostgroup_status_grid
xicore_servicegroup_status_overview
xicore_servicegroup_status_grid
xicore_hostgroup_status_summary
xicore_servicegroup_status_summary
xicore_available_updates
xicore_network_outages
xicore_network_outages_summary
xicore_network_health
xicore_host_status_tac_summary
xicore_service_status_tac_summary
xicore_feature_status_tac_summary
availability
custom_dashlet 1.0.5
gauges 1.2.2
googlemapdashlet 1.1.0
internettrafficreport
rss_dashlet 1.1.0
sansrisingports 2.0
sla
statusinfo 2016-08-22
text 2011-11-30
worldtimeserver 2.0.0What else should I try? what is your recommendations to rectifying this problem? Should I just downgrade until this can be fixed?
(I'll entertain this idea of applying a config from before this date but it looks like I will have to downgrade)
Last night I also noticed the 5.5.3 upgrade, in which I applied hoping it would lower the load. Didn't help!
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: 5.5.1 Httpd high load
@tylerhoadley, Would you be able to open a support ticket for this issue? We could schedule a remote session to take a look into your server.
https://support.nagios.com/tickets/
Also, I'd like to see your system profile. To send us your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and upload it in the ticket.
Also, please share the /etc/php.ini file.
Thank you.
https://support.nagios.com/tickets/
Also, I'd like to see your system profile. To send us your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and upload it in the ticket.
Also, please share the /etc/php.ini file.
Thank you.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
- tylerhoadley
- Posts: 43
- Joined: Tue Jul 02, 2013 1:41 pm
Re: 5.5.1 Httpd high load
I stabilized the system by increasing the thread_cache_size in mysql, and tuning the query cache* settings for queries not hitting the cache. Mysql hasn't crashed since this tuning, however the load is still higher than usually for both httpd and mysql, however it is stable and the webui is usable (dashboards load relatively quickly again). I'm just glad this is on physical hardware with more resources available to allocate and consume.
I am also in the process of recovering the 5.4.13 system to like hardware (w/ 1-1 raid controller) and will be flipping the raid mirror set to resolve this system back to before this upgrade and the resulting system spike occurrence. Unfortunately this has to be done (I can't do adhoc support changes to rectify this) as we are going to be doing a massive network core refresh this coming weekend and I can't jeopardize that work with trial and error troubleshooting. (nagios has to be SOLID). I have done pre-flight preparation testing to the recovered system (its Sitting hot) to ensure I bring retention/object*cache and perf data over from the current system to the recovered system. I feel good about this approach with a solid backout plan of flipping back raid sets. hoping for a few min outage at most.
I want to ensure I keep retention and objects states... I have already sync'd up my flat file configs (stuff that has changed over the past week or 2 and imported them). I've updated my my.cnf:thread_cache/query_cache* for my next attempt of upgrading to 5.5.x post this weekend. I do not want any of the changes from the current mysql data as I will have the logs and events in our nagios log server already.
On a side note... I looked throughout the whole system for changes that occurred that day, and also inspected all logs. I even went so far as to see if the httpd traffic increased by loading them into our nagios log server httpd dashboard where we plot other httpd traffic data and the trends didn't change, which was the only other explanation I couldn't eliminate at the time as the gap between upgrade and load spikes was a couples days (would hate to blame nagios for a spike in user activity). Trends are 1-1 before and after.
One question I would ask... what changed from a structure point of view within the mysql databases for this upgrade? Could a reverted ccm configuration from before the upgrade to the upgraded system cause this? I'm not 100% sure if this is my case, but maybe a possibility or theory to exercise. ??? Nothing else seems to be relative to the date and spike that maybe hasn't be touched by a newer date timestamp.
Thanks,
I am also in the process of recovering the 5.4.13 system to like hardware (w/ 1-1 raid controller) and will be flipping the raid mirror set to resolve this system back to before this upgrade and the resulting system spike occurrence. Unfortunately this has to be done (I can't do adhoc support changes to rectify this) as we are going to be doing a massive network core refresh this coming weekend and I can't jeopardize that work with trial and error troubleshooting. (nagios has to be SOLID). I have done pre-flight preparation testing to the recovered system (its Sitting hot) to ensure I bring retention/object*cache and perf data over from the current system to the recovered system. I feel good about this approach with a solid backout plan of flipping back raid sets. hoping for a few min outage at most.
I want to ensure I keep retention and objects states... I have already sync'd up my flat file configs (stuff that has changed over the past week or 2 and imported them). I've updated my my.cnf:thread_cache/query_cache* for my next attempt of upgrading to 5.5.x post this weekend. I do not want any of the changes from the current mysql data as I will have the logs and events in our nagios log server already.
On a side note... I looked throughout the whole system for changes that occurred that day, and also inspected all logs. I even went so far as to see if the httpd traffic increased by loading them into our nagios log server httpd dashboard where we plot other httpd traffic data and the trends didn't change, which was the only other explanation I couldn't eliminate at the time as the gap between upgrade and load spikes was a couples days (would hate to blame nagios for a spike in user activity). Trends are 1-1 before and after.
One question I would ask... what changed from a structure point of view within the mysql databases for this upgrade? Could a reverted ccm configuration from before the upgrade to the upgraded system cause this? I'm not 100% sure if this is my case, but maybe a possibility or theory to exercise. ??? Nothing else seems to be relative to the date and spike that maybe hasn't be touched by a newer date timestamp.
Thanks,
Re: 5.5.1 Httpd high load
I don't think a reverted CCM config from before the upgrade would cause this, there really hasn't been a lot of DB structure changes.
Please create a ticket for this so we can get a remote session setup to see what we can find.
https://support.nagios.com/tickets
Please create a ticket for this so we can get a remote session setup to see what we can find.
https://support.nagios.com/tickets
Re: 5.5.1 Httpd high load
We believe we have found a fix for at least some of the httpd issues we were having. Everytime we would apply a config load would jump to 40-60+ for multiple minutes.
While load is still significantly higher than before we upgraded to 5.5 (how this thread started), we are no longer having huge load spikes when applying configs. We checked the arguments for the processes that seemed to be using up the most resources at restart and what we were killing to get Nagios working again and found they were all calling the same script.
The original was a check_bpi.php which called an api_tool scripts located at "/usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php". In that file we found the following snippet:
The code above suggests there is a 30 second timer, and there is some code for it however it does not appear to be actually implemented, instead it will indefinitely attempt to call get_program_status_xml_output until it receives the expected result. We also found in the apache error_logs a spam of xml parsing errors (which we have yet to look into) but this may explain why this seemed to actually go on indefinitely.
By changing the code above and implementing the 30 second timeout with a 1 second interval for attempting to check we have again achieved normal restarts with prompt recovery.
Here is the final code:
Hopefully this helps someone experiencing the same issue with applying configurations essentially being debilitating.
While load is still significantly higher than before we upgraded to 5.5 (how this thread started), we are no longer having huge load spikes when applying configs. We checked the arguments for the processes that seemed to be using up the most resources at restart and what we were killing to get Nagios working again and found they were all calling the same script.
The original was a check_bpi.php which called an api_tool scripts located at "/usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php". In that file we found the following snippet:
Code: Select all
// Verify that ndo is running... try waiting up to 30 seconds for it
$timeout = 30;
$i = 0;
$is_running = false;
do {
// Verify ndo and core are running and loaded
$data = get_program_status_xml_output(array(), true);
if ($data['is_currently_running'] == 1) {
$is_running = true;
}
} while (!$is_running && $i < $timeout);
By changing the code above and implementing the 30 second timeout with a 1 second interval for attempting to check we have again achieved normal restarts with prompt recovery.
Here is the final code:
Code: Select all
// Verify that ndo is running... try waiting up to 30 seconds for it
$timeout = 30;
$i = 0;
$is_running = false;
do {
// Verify ndo and core are running and loaded
$data = get_program_status_xml_output(array(), true);
if ($data['is_currently_running'] == 1) {
$is_running = true;
} else {
$i++;
sleep(1);
}
} while (!$is_running && $i < $timeout);
I like graphs...
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: 5.5.1 Httpd high load
Thanks for sharing your findings, this indeed would be a problem on Apply Config if you had BPI Sync turned on (default)
We have added the fix into the next version of XI
We have added the fix into the next version of XI