5.5.1 Httpd high load

Envera IT · Post by **Envera IT** » Fri Jul 27, 2018 8:39 am

Yea so doubling the resources has made it run like it did prior to the update. I'm not sure this is unexpected. However at this point the VM is over provisioned and the vmware guys over here are asking me for justifications as it was performing fine prior to the update.

We need to understand what about the update caused it to triple in utilization. Doubling the resources due to an update unfortunately won't be an acceptable answer for the vmware guys. Do I need to open a proper ticket on this? What are the next steps here?

npolovenko · Post by **npolovenko** » Fri Jul 27, 2018 3:47 pm

@Ehamby, I still believe this is most likely a Core 4 issue. I suggest opening a new issue on the GitHub:
https://github.com/NagiosEnterprises/na ... issues/new
That way you will be able to communicate directly with the developers.

tylerhoadley · Post by **tylerhoadley** » Thu Sep 06, 2018 8:49 am

We are experiencing this same type of issue. We have admin user and regular users, and we find setting users to admins unacceptable . Prior to the upgrade we had a nice load of 1-3. Our httpd and mysql service were stable, and now it seems mysql is restarting its self every 12 hours or so. (I and my colleague have also been oncalled now a couple times in the last 2 weeks for mysql crash and table corruption, the load shows one of these outage on the 26th)

I spent a good portion of time investigating into why and it seems that mysql queries are being heavily aggressive without an index. Last night I turn on log-queries-not-using-indexes and in a matter of a few mins, I have counted over 5600 queries from the following select statement.

Code: Select all

# cat /var/log/mysqld-slow-query.log | grep 'nagios_objects WHERE TRUE ORDER BY nagios_objects.objecttype_id DESC'| wc -l
5688

The upgrade happened on Aug 22 to 5.5.2 from 5.4.13. I believe the 1st real apply config came in on the 28, which is what sent the load way up (I'm in the process of testing this theory by getting the snapshot configs off previous system backups to see if this lowers the load back to after upgrade 22 but before the 28.)

I would also point out we have 569 hosts with 3381 service checks (with mod_gearman to lower load), we run this on a physical HP g8 server with 12 cores (E5-2620 v2 @ 2.10GHz) and 32G of ram on ssd drives in raid1 so this spike in resources has been puzzling to say the least.

The system itself is stable enough that it hasn't caused any major problem with upper management, but only seems like its a ticking timebomb before mysql crashes again, and management starts knocking. This system should be the most stable system in our infrastructure as its the eyes into our business systems. One other clue that is noticeable it when we apply config because its delays in loading dashboards that have service/host group queries (aka my default dashboard and our operations dashboards) up to 2-3 mins for them to be displayed, which in previous versions was 10-15 seconds at most.

Here is the top output

Code: Select all

top - 09:39:07 up 13:34,  2 users,  load average: 6.59, 6.16, 6.02
Tasks: 392 total,   7 running, 385 sleeping,   0 stopped,   0 zombie
Cpu(s): 57.8%us,  8.8%sy,  0.0%ni, 33.3%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  32837684k total, 26061236k used,  6776448k free,   257344k buffers
Swap:  4976636k total,        0k used,  4976636k free,  4670624k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                               
18206 mysql     20   0 6477m 190m 6960 S 246.8  0.6   1417:05 mysqld                                                                                                                                               
14148 apache    20   0  575m  71m  21m R 55.1  0.2 126:22.93 httpd                                                                                                                                                 
13985 apache    20   0  577m  78m  28m S 54.1  0.2 117:53.61 httpd                                                                                                                                                 
13988 apache    20   0  575m  71m  21m R 52.1  0.2 122:24.28 httpd                                                                                                                                                 
57156 apache    20   0  577m  73m  21m R 50.1  0.2 119:04.73 httpd                                                                                                                                                 
14149 apache    20   0  578m  75m  23m R 48.8  0.2 122:09.89 httpd                                                                                                                                                 
35917 apache    20   0  543m  36m  17m S 28.4  0.1   0:24.55 httpd                                                                                                                                                 
20805 apache    20   0  555m  47m  19m S 25.1  0.1  10:17.66 httpd                                                                                                                                                 
42405 nagios    20   0 74036  13m 1040 S 15.5  0.0  36:15.24 ndo2db                                                                                                                                                
47383 gearmand  20   0  465m 5576  984 S  2.6  0.0  15:08.67 gearmand                                                                                                                                              
55119 nagios    20   0  131m 8292 2064 R  2.3  0.0   0:00.07 check_ifopersta                                                                                                                                       
42368 nagios    20   0  905m  55m 2544 S  1.0  0.2  11:29.81 nagios                                                                                                                                                
 1516 root      20   0     0    0    0 S  0.7  0.0   1:09.08 flush-253:0                                                                                                                                           
50672 root      20   0 15288 1568  988 R  0.7  0.0   0:00.38 top                                                                                                                                                   
51996 root      20   0 15268 1616 1008 S  0.7  0.0   0:42.05 top                                                                                                                                                   
    3 root      RT   0     0    0    0 S  0.3  0.0   0:07.42 migration/0                                                                                                                                           
   17 root      20   0     0    0    0 S  0.3  0.0   0:59.11 ksoftirqd/3                                                                                                                                           
25301 nagios    20   0  133m 3684 2228 S  0.3  0.0   0:00.77 mod_gearman2_wo                                                                                                                                       
26016 nagios    20   0  133m 3688 2228 S  0.3  0.0   0:00.85 mod_gearman2_wo                                                                                                                                       
42969 nagios    20   0  133m 3612 2228 S  0.3  0.0   0:00.25 mod_gearman2_wo                                                                                                                                       
54974 nagios    20   0 41440 2944 2252 S  0.3  0.0   0:00.01 check_nrpe                                                                                                                                            
63481 nagios    20   0  133m 3680 2228 S  0.3  0.0   0:00.86 mod_gearman2_wo                                                                                                                                       
63677 nagios    20   0  133m 3688 2228 S  0.3  0.0   0:01.27 mod_gearman2_wo

Here is the system profile with hostname and IP, and license info removed.

Code: Select all

Nagios XI - System Info
System
Nagios XI version: 5.5.3
XI installed from: manual
XI UUID: ******************************
Release info: *************** 2.6.32-754.3.5.el6.x86_64 x86_64
Red Hat Enterprise Linux Server release 6.10 (Santiago)
Gnome is not installed
Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36
Server Name: ****************
Server Address: ************************
Server Port: 443
Date/Time
PHP Timezone: America/Toronto 
PHP Time: Thu, 06 Sep 2018 09:26:45 -0400
System Time: Thu, 06 Sep 2018 09:26:45 -0400
Nagios XI Data
License ends in: ************
UUID: ******************
Install Type: manual/unknown

nagios (pid 42368) is running...
NPCD running (pid 46951).
ndo2db (pid 46680) is running...
CPU Load 15: 5.92 
Total Hosts: 570 
Total Services: 3404 

Function get_base_uri() returns: https://***************/nagiosxi/
Function get_base_url() returns: https://***************/nagiosxi/
Function get_backend_url(internal_call=false) returns: https://***************/nagiosxi/includes/components/profile/profile.php
Function get_backend_url(internal_call=true) returns: http://localhost/nagiosxi/backend/

Ping Test localhost
Running:
/bin/ping -c 3 localhost 2>&1 
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.031 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.036 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.026 ms

--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.026/0.031/0.036/0.004 ms
Test wget To localhost
WGET From URL: http://localhost/nagiosxi/includes/components/ccm/ 
Running:
/usr/bin/wget http://localhost/nagiosxi/includes/components/ccm/ 
--2018-09-06 09:26:47-- http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://localhost/nagiosxi/login.php?redirect=/nagiosxi/includes/components/ccm/index.php%3f&noauth=1 [following]
--2018-09-06 09:26:47-- http://localhost/nagiosxi/login.php?redirect=/nagiosxi/includes/components/ccm/index.php%3f&noauth=1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "/usr/local/nagiosxi/tmp/ccm_index.tmp"

0K .......... ......... 541K=0.04s

2018-09-06 09:26:47 (541 KB/s) - "/usr/local/nagiosxi/tmp/ccm_index.tmp" saved [20452]

Network Settings
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN 

    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

    inet 127.0.0.1/8 scope host lo

    inet6 ::1/128 scope host 

       valid_lft forever preferred_lft forever

2: eth0:  mtu 1500 qdisc mq master bond0 state UP qlen 1000

    link/ether d8:9d:67:17:e7:34 brd ff:ff:ff:ff:ff:ff

3: eth1:  mtu 1500 qdisc mq master bond0 state UP qlen 1000

    link/ether d8:9d:67:17:e7:34 brd ff:ff:ff:ff:ff:ff

4: eth2:  mtu 1500 qdisc noop state DOWN qlen 1000

    link/ether d8:9d:67:17:e7:36 brd ff:ff:ff:ff:ff:ff

5: eth3:  mtu 1500 qdisc noop state DOWN qlen 1000

    link/ether d8:9d:67:17:e7:37 brd ff:ff:ff:ff:ff:ff

6: bond0:  mtu 1500 qdisc noqueue state UP 

    link/ether d8:9d:67:17:e7:34 brd ff:ff:ff:ff:ff:ff

    inet *************** brd *************** scope global bond0

    inet6 fe80::da9d:67ff:fe17:e734/64 scope link 

       valid_lft forever preferred_lft forever


*************** dev bond0  proto kernel  scope link  src ***************

169.254.0.0/16 dev bond0  scope link  metric 1006 

default via *************** dev bond0 


Nagios XI Components
actions	2.0.1
actionurl	
alertcloud	1.2.1
alertstream	2.1.0
autodiscovery	2.2.5
backendapiurl	1.0.3
bandwidthreport	1.8.0
bbmap	1.2.0
birdseye	3.2.2
bulkmodifications	2.2.0
capacityplanning	2.3.0
ccm	2.7.0
custom-includes	1.0.4
customlogin	1.0.0
customlogo	1.2.0
deploydashboard	1.3.0
deploynotification	1.3.3
duo	1.0.0
escalationwizard	1.5.0
freevariabletab	1.0.1
globaleventhandler	1.2.2
graphexplorer	2.2.0
helpsystem	2.0.0
highcharts	4.0.1
homepagemod	1.1.7
hypermap	1.1.6
hypermap_replay	1.2.0
isms	1.2.3
latestalerts	1.2.6
ldap_ad_integration	1.1.0
massacknowledge	2.1.14
metrics	1.2.10
minemap	1.2.4
modgearman	1
nagiosbpi	2.7.1
nagioscore	
nagioscorecfg	
nagiosim	2.2.6
nagiosna	1.4.0
nagiosql	
nagvis	2.0.0
nocscreen	1.1.2
nrdsconfigmanager	1.6.4
nxti	1.0.1
opscreen	1.8.0
perfdata	
pingaction	1.1.1
pnp	
profile	1.4.0
proxy	1.1.4
rdp	1.0.3
rename	1.6.0
scheduledbackups	1.2.0
scheduledreporting	
similetimeline	1.5.0
snmptrapsender	1.5.5
statusmap	1.0.2
tracerouteaction	1.1.1
usermacros	1.1.0
xicore	
Nagios XI Config Wizards
ec2	1.0.0
s3	1.0.0
autodiscovery	1.4.1
bpiwizard	1.1.4
bulkhostimport	2.0.4
digitalocean	1.0.0
google-cloud	1.0.0
linode	1.0.0
microsoft-azure	1.0.0
rackspace	1.0.0
dhcp	1.1.4
dnsquery	1.1.3
docker	1.0.0
domain_expiration	1.1.4
email-delivery	2.0.4
esensors_websensor	1.1.4
exchange	1.3.2
folder_watch	1.0.5
ftpserver	1.5.5
genericnetdevice	1.0.3
ldapserver	1.3.3
linux-server	1.5.5
linux_snmp	1.5.4
macosx	1.3.0
mailserver	1.2.4
mongodb_database	1.1.2
mongodbserver	1.1.2
mountpoint	1.0.2
mssql_database	1.6.2
mssql_query	1.6.4
mssql_server	1.9.1
mysqlquery	1.2.3
mysqlserver	1.3.3
nagioslogserver	1.0.5
nagiostats	1.2.3
nagiosxiserver	1.3.0
ncpa	2.0.0
nna	1.0.4
nrpe	1.5.2
oraclequery	1.3.3
oracleserverspace	1.5.3
oracletablespace	1.5.4
passivecheck	1.2.4
passiveobject	1.1.3
postgresdb	1.5.3
postgresquery	1.2.3
postgresserver	1.3.4
printer	1.1.3
radiusserver	2.0.1
sla	1.3.2
snmp	1.5.8
snmp_trap	1.5.3
snmpwalk	1.3.6
solaris	1.2.5
sshproxy	1.5.7
switch	2.4.0
tcpudpport	1.3.3
tftp	1.0.2
vmware	1.7.1
watchguard	1.4.5
website	1.3.0
website_defacement	1.1.5
websiteurl	1.3.7
webtransaction	1.2.5
windowseventlog	1.3.3
windowsserver	1.6.1
windowsdesktop	1.6.1
windowssnmp	1.5.1
windowswmi	2.1.0
Nagios XI Dashlets
alertcloud	
bbmap	
capacityplanning	
graphexplorer	
hypermap	
latestalerts	
metrics	
metricsguage	
minemap	
xicore_xi_news_feed	
xicore_getting_started	
xicore_admin_tasks	
xicore_eventqueue_chart	
xicore_component_status	
xicore_server_stats	
xicore_monitoring_stats	
xicore_monitoring_perf	
xicore_monitoring_process	
xicore_perfdata_chart	
xicore_host_status_summary	
xicore_service_status_summary	
xicore_comments	
xicore_hostgroup_status_overview	
xicore_hostgroup_status_grid	
xicore_servicegroup_status_overview	
xicore_servicegroup_status_grid	
xicore_hostgroup_status_summary	
xicore_servicegroup_status_summary	
xicore_available_updates	
xicore_network_outages	
xicore_network_outages_summary	
xicore_network_health	
xicore_host_status_tac_summary	
xicore_service_status_tac_summary	
xicore_feature_status_tac_summary	
availability	
custom_dashlet	1.0.5
gauges	1.2.2
googlemapdashlet	1.1.0
internettrafficreport	
rss_dashlet	1.1.0
sansrisingports	2.0
sla	
statusinfo	2016-08-22
text	2011-11-30
worldtimeserver	2.0.0

What else should I try? what is your recommendations to rectifying this problem? Should I just downgrade until this can be fixed?
(I'll entertain this idea of applying a config from before this date but it looks like I will have to downgrade)

Last night I also noticed the 5.5.3 upgrade, in which I applied hoping it would lower the load. Didn't help!

npolovenko · Post by **npolovenko** » Fri Sep 07, 2018 10:45 am

@tylerhoadley, Would you be able to open a support ticket for this issue? We could schedule a remote session to take a look into your server.
https://support.nagios.com/tickets/

Also, I'd like to see your system profile. To send us your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and upload it in the ticket.

Also, please share the /etc/php.ini file.

Thank you.

tylerhoadley · Post by **tylerhoadley** » Mon Sep 10, 2018 10:52 am

I stabilized the system by increasing the thread_cache_size in mysql, and tuning the query cache* settings for queries not hitting the cache. Mysql hasn't crashed since this tuning, however the load is still higher than usually for both httpd and mysql, however it is stable and the webui is usable (dashboards load relatively quickly again). I'm just glad this is on physical hardware with more resources available to allocate and consume.

I am also in the process of recovering the 5.4.13 system to like hardware (w/ 1-1 raid controller) and will be flipping the raid mirror set to resolve this system back to before this upgrade and the resulting system spike occurrence. Unfortunately this has to be done (I can't do adhoc support changes to rectify this) as we are going to be doing a massive network core refresh this coming weekend and I can't jeopardize that work with trial and error troubleshooting. (nagios has to be SOLID). I have done pre-flight preparation testing to the recovered system (its Sitting hot) to ensure I bring retention/object*cache and perf data over from the current system to the recovered system. I feel good about this approach with a solid backout plan of flipping back raid sets. hoping for a few min outage at most.

I want to ensure I keep retention and objects states... I have already sync'd up my flat file configs (stuff that has changed over the past week or 2 and imported them). I've updated my my.cnf:thread_cache/query_cache* for my next attempt of upgrading to 5.5.x post this weekend. I do not want any of the changes from the current mysql data as I will have the logs and events in our nagios log server already.

On a side note... I looked throughout the whole system for changes that occurred that day, and also inspected all logs. I even went so far as to see if the httpd traffic increased by loading them into our nagios log server httpd dashboard where we plot other httpd traffic data and the trends didn't change, which was the only other explanation I couldn't eliminate at the time as the gap between upgrade and load spikes was a couples days (would hate to blame nagios for a spike in user activity). Trends are 1-1 before and after.

One question I would ask... what changed from a structure point of view within the mysql databases for this upgrade? Could a reverted ccm configuration from before the upgrade to the upgraded system cause this? I'm not 100% sure if this is my case, but maybe a possibility or theory to exercise. ??? Nothing else seems to be relative to the date and spike that maybe hasn't be touched by a newer date timestamp.

Thanks,

ssax · Post by **ssax** » Tue Sep 11, 2018 9:35 am

I don't think a reverted CCM config from before the upgrade would cause this, there really hasn't been a lot of DB structure changes.

Please create a ticket for this so we can get a remote session setup to see what we can find.

https://support.nagios.com/tickets

Envera IT · Post by **Envera IT** » Mon Sep 24, 2018 1:31 pm

We believe we have found a fix for at least some of the httpd issues we were having. Everytime we would apply a config load would jump to 40-60+ for multiple minutes.

While load is still significantly higher than before we upgraded to 5.5 (how this thread started), we are no longer having huge load spikes when applying configs. We checked the arguments for the processes that seemed to be using up the most resources at restart and what we were killing to get Nagios working again and found they were all calling the same script.

The original was a check_bpi.php which called an api_tool scripts located at "/usr/local/nagiosxi/html/includes/components/nagiosbpi/api_tool.php". In that file we found the following snippet:

Code: Select all

    // Verify that ndo is running... try waiting up to 30 seconds for it
    $timeout = 30;
    $i = 0;
    $is_running = false;
    do {

        // Verify ndo and core are running and loaded
        $data = get_program_status_xml_output(array(), true);
        if ($data['is_currently_running'] == 1) {
            $is_running = true;
        }
    

    } while (!$is_running && $i < $timeout);

The code above suggests there is a 30 second timer, and there is some code for it however it does not appear to be actually implemented, instead it will indefinitely attempt to call get_program_status_xml_output until it receives the expected result. We also found in the apache error_logs a spam of xml parsing errors (which we have yet to look into) but this may explain why this seemed to actually go on indefinitely.

By changing the code above and implementing the 30 second timeout with a 1 second interval for attempting to check we have again achieved normal restarts with prompt recovery.

Here is the final code:

Code: Select all

    // Verify that ndo is running... try waiting up to 30 seconds for it
    $timeout = 30;
    $i = 0;
    $is_running = false;
    do {
        // Verify ndo and core are running and loaded
        $data = get_program_status_xml_output(array(), true);
        if ($data['is_currently_running'] == 1) {
            $is_running = true;
        } else {
      $i++;
      sleep(1);
    }
    } while (!$is_running && $i < $timeout);

Hopefully this helps someone experiencing the same issue with applying configurations essentially being debilitating.

scottwilkerson · Post by **scottwilkerson** » Tue Sep 25, 2018 2:05 pm

Thanks for sharing your findings, this indeed would be a problem on Apply Config if you had BPI Sync turned on (default)

We have added the fix into the next version of XI

Nagios Support Forum

5.5.1 Httpd high load

Re: 5.5.1 Httpd high load

Re: 5.5.1 Httpd high load

Re: 5.5.1 Httpd high load

Re: 5.5.1 Httpd high load

Re: 5.5.1 Httpd high load

Re: 5.5.1 Httpd high load

Re: 5.5.1 Httpd high load

Re: 5.5.1 Httpd high load