Page 1 of 1

Nagios 4.2.0 status.cgi is really slow and uses 100% CPU

Posted: Tue Aug 30, 2016 7:33 am
by GMont
Hi all,
I'm trying to upgrade from Nagios 3.5 to 4.2.0. The environment is this:

- Virtual machine (VMware)
- Red Hat Enterprise Linux 6.7
- SELinux: disabled
- using gearman from Con-sol Labs repositories

After cloning the Nagios 3.5 VM and compiling Nagios 4.2.0 from sources with rpmbuild I upgraded the rpms and tried with a subset
of hosts/services (just a few), apparently it was working fine.
When trying with the full hosts/services configuration (~2600 hosts, ~25000 services), though, I found out that status.cgi uses 100% CPU
and takes 15s to 30s to complete.

status.dat is ~35MB big.

I built a VM with CentOS and only the Nagios 4 packages (no gearman), and disabled active checks completely, so that the VM was 100% idle,
and the only running process was the CGI, I get similar timings:

Active Host / Service Checks: 2629 / 24186

Code: Select all

-sh-4.1$ ls -l /var/log/nagios/status.dat
-rw-rw-r-- 1 nagios nagios 33936069 Aug 30 11:08 /var/log/nagios/status.dat

-sh-4.1$ export REQUEST_METHOD=GET; export QUERY_STRING="host=all"; export REMOTE_USER="nagiosadmin"
-sh-4.1$ for i in 1 2 3 4 5 6; do time /usr/lib64/nagios/cgi/status.cgi > /dev/null; done

real 0m16.534s
user 0m16.428s
sys 0m0.092s

real 0m17.675s
user 0m17.527s
sys 0m0.116s
                                                                                                                                                                                                                                                                                           
real 0m25.333s                                                                                                                                                                                                                                                                             
user 0m25.123s                                                                                                                                                                                                                                                                             
sys 0m0.109s                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                           
real 0m21.453s                                                                                                                                                                                                                                                                             
user 0m21.333s                                                                                                                                                                                                                                                                             
sys 0m0.099s                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                           
real 0m17.910s                                                                                                                                                                                                                                                                             
user 0m17.812s                                                                                                                                                                                                                                                                             
sys 0m0.081s                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                           
real 0m16.212s                                                                                                                                                                                                                                                                             
user 0m16.115s                                                                                                                                                                                                                                                                             
sys 0m0.082s                                                                                                                                                                                                                                                                               


After moving status.dat to a filesystem in RAM, timings did not change either, so I'm assuming this is not an I/O issue:

Code: Select all

                                                                                                                                                                                                                                                                                     
-sh-4.1$ for i in 1 2 3 4 5; do time /usr/lib64/nagios/cgi/status.cgi > /dev/null; done                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                           
real 0m15.839s                                                                                                                                                                                                                                                                             
user 0m15.761s                                                                                                                                                                                                                                                                             
sys 0m0.065s                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                           
real 0m17.229s                                                                                                                                                                                                                                                                             
user 0m17.147s                                                                                                                                                                                                                                                                             
sys 0m0.065s                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                           
real 0m18.395s
user 0m18.271s
sys 0m0.099s

real 0m28.587s
user 0m28.249s
sys 0m0.089s

real 0m16.609s
user 0m16.520s
sys 0m0.077s

These are the timings for the current Nagios 3.5 installation (same configuration, VM of Nagios 4 is a clone of
the original VM):

Code: Select all

 for i in 1 2 3 4 5; do time /usr/lib64/nagios/cgi-bin/status.cgi > /dev/null; done

real    0m1.517s
user    0m1.472s
sys     0m0.046s

real    0m1.543s
user    0m1.493s
sys     0m0.050s

real    0m1.534s
user    0m1.486s
sys     0m0.049s

real    0m1.594s
user    0m1.523s
sys     0m0.071s

real    0m1.564s
user    0m1.508s
sys     0m0.055s
Is there a way I can profile status.cgi execution to understand why it's so slow? I tried with valgrind/callgrind
and it seems to spend most of the time (62%) in __strcmp_sse42.


Thanks.

Re: Nagios 4.2.0 status.cgi is really slow and uses 100% CPU

Posted: Tue Aug 30, 2016 6:25 pm
by Box293
There is a known issue with 4.2.0 and the verify taking a long time, this could be related.

The maint branch on GitHub has a fix for it:

https://github.com/NagiosEnterprises/na ... tree/maint

4.2.1 is due to be released early September.

Are you seeing any errors in /var/log/httpd/*_log ?


Alternatively you could try the previous version which does not have the issue like you are reporting:

https://github.com/NagiosEnterprises/na ... gios-4.1.1

Re: Nagios 4.2.0 status.cgi is really slow and uses 100% CPU

Posted: Thu Sep 01, 2016 9:54 am
by GMont
Thanks for you reply.

I took the time to test both with Nagios 4.1.1 and with the maint branch from github, in both cases the loading time of status.cgi was 2s or below,
so I thinks this solves the problem. I will probably wait for 4.2.1 to come out before upgrading the production environment.

Re: Nagios 4.2.0 status.cgi is really slow and uses 100% CPU

Posted: Thu Sep 01, 2016 12:04 pm
by rkennedy
Glad to hear this worked out! Are we good to mark this as resolved?

Re: Nagios 4.2.0 status.cgi is really slow and uses 100% CPU

Posted: Mon Sep 12, 2016 3:45 am
by GMont
Hi,
I tried Nagios 4.2.1, status.cgi loads in less than 2s.
I think we can consider this problem as fixed.

Thanks.