Nagios process at 100% CPU, system is crawling

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
consulvation
Posts: 16
Joined: Thu Jul 19, 2018 10:07 pm

Nagios process at 100% CPU, system is crawling

Post by consulvation »

We have been running Nagios 3.x for a few years with no performance issues at all. We rebuilt a new Nagios machine and installed a fresh copy of 4.4.1. It was working very well for about 1 week and then is started crawling and getting slower and slower. We restarted process, the machine itself. We increased the logging interval as well to no avail. We had run into this the first time we built the 4.x version and had recompiled a few times with different flags and thought that was the cause because we ended up removing the whole installation and creating a fresh compile (again) and everything seemed really fast for about a week, which is where we are now and no matter what, it crawls. If I stop the nagios service, the performance returns to normal. I run top and the nagios service is constantly showing up with 95-100% of CPU. It will take anywhere from 5-10 minutes to load a page if it ever gets there. We monitor about 60 devices and about 600 services, pretty small universe for now. I don't want to expand it until we work this out. We really like Nagios and want to get this working. I have read through other threads and none of those suggestions were helpful for our situation. I am hoping someone can help us look in the right direction to solve this. Thanks!
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Nagios process at 100% CPU, system is crawling

Post by npolovenko »

Hello, @consulvation. What are the hardware specs of the Nagios server? How much memory, how many CPU Cores? How many users are accessing the web interface at the same time? Are most of these users administrators or they have a limited access to hosts/services?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
consulvation
Posts: 16
Joined: Thu Jul 19, 2018 10:07 pm

Re: Nagios process at 100% CPU, system is crawling

Post by consulvation »

Hi,

Thanks for the response. Here is the info you requested:

How many users are accessing the web interface at the same time? 3-4 Max

Are most of these users administrators or they have a limited access to hosts/services? Admins

Code: Select all

sysadmin@monitoring:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:            11G        5.8G        1.1G         18M        4.7G        5.5G
Swap:          3.9G        498M        3.4G
sysadmin@monitoring:~$ lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.6 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 7 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a4)
00:1f.0 ISA bridge: Intel Corporation B75 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5761 Gigabit Ethernet PCIe (rev 10)
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios process at 100% CPU, system is crawling

Post by scottwilkerson »

4.4.1 has a known bug that causes hosts/services to stay in a soft state causing rechecks to happen more frequently

Most of this is resolved in the maint branch
https://github.com/NagiosEnterprises/na ... tree/maint

Also, you mentioned you increased the logging interval, I would return this to normal, you do not want more logging on an already taxed system
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
consulvation
Posts: 16
Joined: Thu Jul 19, 2018 10:07 pm

Re: Nagios process at 100% CPU, system is crawling

Post by consulvation »

Thanks for the information. I looked at my var/nagios.log file and see that every alert is in a soft state. I am also seeing a ton of these messages for almost everything.

[1532614055] Warning: Check of host '*******************************' timed out after 31.03 seconds
[1532614055] wproc: Core Worker 21174: job 19321 (pid=18770): Dormant child reaped
[1532614055] wproc: Core Worker 21176: job 19321 (pid=18771) timed out. Killing it
[1532614055] wproc: CHECK job 19321 from worker Core Worker 21176 timed out after 31.02s
[1532614055] wproc: host=********************************; service=(null);
[1532614055] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;

As for the log interval, I don't think I was clear how I explained it. I increased the interval value, not the frequency. I believe the default is 10 seconds and I increased it to 30 seconds so it would update less. This initially helped reduce our host and service latency from 5-15s down to almost 0s for a little while at least. I don't know what it's at anymore, because the web interface doesn't respond anymore. Here is the setting I was speaking of:

# STATUS FILE UPDATE INTERVAL
# This option determines the frequency (in seconds) that
# Nagios will periodically dump program, host, and
# service status data.

status_update_interval=30


As for the maintenance branch, I presume I download and do a ./configure, make and then a make install? Do I this over what is there or should I remove the existing directories? Are there any compile flags that I should be using? Thanks for your help.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios process at 100% CPU, system is crawling

Post by scottwilkerson »

Yes, this is the affect..
consulvation wrote: As for the maintenance branch, I presume I download and do a ./configure, make and then a make install? Do I this over what is there or should I remove the existing directories? Are there any compile flags that I should be using? Thanks for your help.
this sound correct

Code: Select all

./configure
make all
make install
service nagios restart
If you used a different command group in the past you may want to change to something like this

Code: Select all

./configure --with-command-group=nagcmd
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
consulvation
Posts: 16
Joined: Thu Jul 19, 2018 10:07 pm

Re: Nagios process at 100% CPU, system is crawling

Post by consulvation »

We applied the maint branch to our install, it doesn't seem to have improved anything. The machine still crawls while the nagios service runs. Should we wait a few cycles for it to catch up? Is there a way to verify that the issue that is resolved in the maintenance release is actually happening to our setup? I have no idea of the status, latency or any other stats since I can't get the web interface to load before timing out. Thanks.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios process at 100% CPU, system is crawling

Post by scottwilkerson »

consulvation wrote:Should we wait a few cycles for it to catch up?
yes and also can you send back the following command

Code: Select all

ps -ef|grep nagios.cfg
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
consulvation
Posts: 16
Joined: Thu Jul 19, 2018 10:07 pm

Re: Nagios process at 100% CPU, system is crawling

Post by consulvation »

Here are the results. It's been about 4-5 hours since applying the maint code:

Code: Select all

ps -ef|grep nagios.cfg
nagios     940     1 47 16:51 ?        00:59:20 /usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg
nagios    1747   940  0 16:53 ?        00:00:03 /usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios process at 100% CPU, system is crawling

Post by scottwilkerson »

Are you still seeing nagios process at 100% ?
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Locked