Nagios process at 100% CPU, system is crawling

An open discussion forum for obtaining help with Nagios Core. Nagios Core users of all experience levels are welcome here. Subforum have been created for the discussion of Nagios Core and Nagios Plugin development.

NOTE: The SourceForge.net mailing lists have been deprecated in favor of this forum in order to expedite support and provide additional features not available on the old mailing list.

Nagios process at 100% CPU, system is crawling

Postby consulvation » Wed Jul 25, 2018 9:29 am

We have been running Nagios 3.x for a few years with no performance issues at all. We rebuilt a new Nagios machine and installed a fresh copy of 4.4.1. It was working very well for about 1 week and then is started crawling and getting slower and slower. We restarted process, the machine itself. We increased the logging interval as well to no avail. We had run into this the first time we built the 4.x version and had recompiled a few times with different flags and thought that was the cause because we ended up removing the whole installation and creating a fresh compile (again) and everything seemed really fast for about a week, which is where we are now and no matter what, it crawls. If I stop the nagios service, the performance returns to normal. I run top and the nagios service is constantly showing up with 95-100% of CPU. It will take anywhere from 5-10 minutes to load a page if it ever gets there. We monitor about 60 devices and about 600 services, pretty small universe for now. I don't want to expand it until we work this out. We really like Nagios and want to get this working. I have read through other threads and none of those suggestions were helpful for our situation. I am hoping someone can help us look in the right direction to solve this. Thanks!
consulvation
 
Posts: 16
Joined: Thu Jul 19, 2018 10:07 pm

Re: Nagios process at 100% CPU, system is crawling

Postby npolovenko » Wed Jul 25, 2018 4:21 pm

Hello, @consulvation. What are the hardware specs of the Nagios server? How much memory, how many CPU Cores? How many users are accessing the web interface at the same time? Are most of these users administrators or they have a limited access to hosts/services?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
npolovenko
Support Tech
 
Posts: 2504
Joined: Mon May 15, 2017 5:00 pm

Re: Nagios process at 100% CPU, system is crawling

Postby consulvation » Wed Jul 25, 2018 4:37 pm

Hi,

Thanks for the response. Here is the info you requested:

How many users are accessing the web interface at the same time? 3-4 Max

Are most of these users administrators or they have a limited access to hosts/services? Admins

Code: Select all
sysadmin@monitoring:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:            11G        5.8G        1.1G         18M        4.7G        5.5G
Swap:          3.9G        498M        3.4G


sysadmin@monitoring:~$ lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.6 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 7 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a4)
00:1f.0 ISA bridge: Intel Corporation B75 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5761 Gigabit Ethernet PCIe (rev 10)
consulvation
 
Posts: 16
Joined: Thu Jul 19, 2018 10:07 pm

Re: Nagios process at 100% CPU, system is crawling

Postby scottwilkerson » Wed Jul 25, 2018 5:18 pm

4.4.1 has a known bug that causes hosts/services to stay in a soft state causing rechecks to happen more frequently

Most of this is resolved in the maint branch
https://github.com/NagiosEnterprises/na ... tree/maint

Also, you mentioned you increased the logging interval, I would return this to normal, you do not want more logging on an already taxed system
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 12598
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Nagios process at 100% CPU, system is crawling

Postby consulvation » Thu Jul 26, 2018 9:26 am

Thanks for the information. I looked at my var/nagios.log file and see that every alert is in a soft state. I am also seeing a ton of these messages for almost everything.

[1532614055] Warning: Check of host '*******************************' timed out after 31.03 seconds
[1532614055] wproc: Core Worker 21174: job 19321 (pid=18770): Dormant child reaped
[1532614055] wproc: Core Worker 21176: job 19321 (pid=18771) timed out. Killing it
[1532614055] wproc: CHECK job 19321 from worker Core Worker 21176 timed out after 31.02s
[1532614055] wproc: host=********************************; service=(null);
[1532614055] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;

As for the log interval, I don't think I was clear how I explained it. I increased the interval value, not the frequency. I believe the default is 10 seconds and I increased it to 30 seconds so it would update less. This initially helped reduce our host and service latency from 5-15s down to almost 0s for a little while at least. I don't know what it's at anymore, because the web interface doesn't respond anymore. Here is the setting I was speaking of:

# STATUS FILE UPDATE INTERVAL
# This option determines the frequency (in seconds) that
# Nagios will periodically dump program, host, and
# service status data.

status_update_interval=30


As for the maintenance branch, I presume I download and do a ./configure, make and then a make install? Do I this over what is there or should I remove the existing directories? Are there any compile flags that I should be using? Thanks for your help.
consulvation
 
Posts: 16
Joined: Thu Jul 19, 2018 10:07 pm

Re: Nagios process at 100% CPU, system is crawling

Postby scottwilkerson » Thu Jul 26, 2018 10:29 am

Yes, this is the affect..

consulvation wrote:As for the maintenance branch, I presume I download and do a ./configure, make and then a make install? Do I this over what is there or should I remove the existing directories? Are there any compile flags that I should be using? Thanks for your help.


this sound correct
Code: Select all
./configure
make all
make install
service nagios restart


If you used a different command group in the past you may want to change to something like this
Code: Select all
./configure --with-command-group=nagcmd
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 12598
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Nagios process at 100% CPU, system is crawling

Postby consulvation » Thu Jul 26, 2018 4:08 pm

We applied the maint branch to our install, it doesn't seem to have improved anything. The machine still crawls while the nagios service runs. Should we wait a few cycles for it to catch up? Is there a way to verify that the issue that is resolved in the maintenance release is actually happening to our setup? I have no idea of the status, latency or any other stats since I can't get the web interface to load before timing out. Thanks.
consulvation
 
Posts: 16
Joined: Thu Jul 19, 2018 10:07 pm

Re: Nagios process at 100% CPU, system is crawling

Postby scottwilkerson » Thu Jul 26, 2018 4:31 pm

consulvation wrote:Should we wait a few cycles for it to catch up?


yes and also can you send back the following command
Code: Select all
ps -ef|grep nagios.cfg
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 12598
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Nagios process at 100% CPU, system is crawling

Postby consulvation » Thu Jul 26, 2018 5:57 pm

Here are the results. It's been about 4-5 hours since applying the maint code:

Code: Select all
ps -ef|grep nagios.cfg
nagios     940     1 47 16:51 ?        00:59:20 /usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg
nagios    1747   940  0 16:53 ?        00:00:03 /usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg
consulvation
 
Posts: 16
Joined: Thu Jul 19, 2018 10:07 pm

Re: Nagios process at 100% CPU, system is crawling

Postby scottwilkerson » Fri Jul 27, 2018 8:59 am

Are you still seeing nagios process at 100% ?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 12598
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Next

Return to Nagios Core

Who is online

Users browsing this forum: charlie126 and 26 guests