Nagios 4 Load issues

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
liquidcool
Posts: 59
Joined: Tue Feb 21, 2012 6:08 am

Re: Nagios 4 Load issues

Post by liquidcool »

Sorry about that. It was my copying and pasting. Nothing wrong with top.

Have you guys been able to find out what might be causing this ? I will be reverting our servers back to 3.5.1 so i can carry on the roll out of the project.

Below is another excerpt of when the issue has kicked off. Don't know what the copy and paste has done now but it has put some of the columns side by side. Hope it is not too much of a problem :

Code: Select all

top - 02:08:54 up 27 days, 14:24,  4 users,  load average: 30.67, 21.28, 17.86
Tasks: 218 total,  51 running, 167 sleeping,   0 stopped,   0 zombie
Cpu(s): 69.9%us, 16.6%sy,  0.0%ni, 12.5%id,  0.8%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:   3927988k total,  3308128k used,   619860k free,   210456k buffers
Swap:  4200956k total,      260k used,  4200696k free,  2627300k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
26454 nagios    20   0 29376  12m 1108 S    6  0.3 128:17.41 nagios            12489 nagios    20   0 47380 7004 2820 R    4  0.2   0:00.13 check_snmp_port
12469 nagios    20   0 47388 7004 2820 R    4  0.2   0:00.12 check_snmp_port   12477 nagios    20   0 47380 7000 2816 R    4  0.2   0:00.12 check_snmp_port
12497 nagios    20   0 47380 7004 2820 R    4  0.2   0:00.11 check_snmp_port   12501 nagios    20   0 47384 7012 2824 R    4  0.2   0:00.11 check_snmp_port
12507 nagios    20   0 47384 7008 2820 R    3  0.2   0:00.10 check_snmp_port   12509 nagios    20   0 47384 7004 2820 R    3  0.2   0:00.08 check_snmp_port
12511 nagios    20   0 47388 7000 2816 R    3  0.2   0:00.08 check_snmp_port   12513 nagios    20   0 47380 7000 2820 R    3  0.2   0:00.08 check_snmp_port
12516 nagios    20   0 47388 7000 2816 R    3  0.2   0:00.08 check_snmp_port   12518 nagios    20   0 47388 7004 2820 R    3  0.2   0:00.08 check_snmp_port
12519 nagios    20   0 47384 7000 2816 R    3  0.2   0:00.08 check_snmp_port   12520 nagios    20   0 47384 7004 2820 R    3  0.2   0:00.08 check_snmp_port
12523 nagios    20   0 47384 7000 2816 R    3  0.2   0:00.08 check_snmp_port   12537 nagios    20   0 47380 7000 2820 R    3  0.2   0:00.08 check_snmp_port
12510 nagios    20   0 47388 7004 2820 R    2  0.2   0:00.07 check_snmp_port   12512 nagios    20   0 47384 7004 2820 R    2  0.2   0:00.07 check_snmp_port
12514 nagios    20   0 47384 7000 2816 R    2  0.2   0:00.07 check_snmp_port   12517 nagios    20   0 47380 6996 2816 R    2  0.2   0:00.07 check_snmp_port
12521 nagios    20   0 47380 7000 2820 R    2  0.2   0:00.07 check_snmp_port   12524 nagios    20   0 47388 7004 2820 R    2  0.2   0:00.07 check_snmp_port
12525 nagios    20   0 47388 7000 2816 R    2  0.2   0:00.07 check_snmp_port   12526 nagios    20   0 47384 6996 2816 R    2  0.2   0:00.07 check_snmp_port
12527 nagios    20   0 47384 7004 2820 R    2  0.2   0:00.07 check_snmp_port   12529 nagios    20   0 47384 7004 2820 R    2  0.2   0:00.07 check_snmp_port
12530 nagios    20   0 47392 6996 2820 R    2  0.2   0:00.07 check_snmp_port   12531 nagios    20   0 47380 7000 2820 R    2  0.2   0:00.07 check_snmp_port
12532 nagios    20   0 47380 6996 2816 R    2  0.2   0:00.07 check_snmp_port   12533 nagios    20   0 47380 6996 2816 R    2  0.2   0:00.07 check_snmp_port
12538 nagios    20   0 47384 7000 2820 R    2  0.2   0:00.07 check_snmp_port   12546 nagios    20   0 47384 7004 2820 R    2  0.2   0:00.07 check_snmp_port
12522 nagios    20   0 47380 6884 2720 R    2  0.2   0:00.06 check_snmp_port   12528 nagios    20   0 47384 7000 2820 R    2  0.2   0:00.06 check_snmp_port
12534 nagios    20   0 47380 6956 2788 R    2  0.2   0:00.06 check_snmp_port   12536 nagios    20   0 47388 6888 2720 R    2  0.2   0:00.06 check_snmp_port
12539 nagios    20   0 47380 6880 2716 R    2  0.2   0:00.06 check_snmp_port   12540 nagios    20   0 47380 6880 2716 R    2  0.2   0:00.06 check_snmp_port
12542 nagios    20   0 47380 6956 2788 R    2  0.2   0:00.06 check_snmp_port   12544 nagios    20   0 47384 6960 2792 R    2  0.2   0:00.06 check_snmp_port
12547 nagios    20   0 47384 6964 2792 R    2  0.2   0:00.06 check_snmp_port   12548 nagios    20   0 47384 6884 2716 R    2  0.2   0:00.06 check_snmp_port
12549 nagios    20   0 47384 6880 2716 R    2  0.2   0:00.06 check_snmp_port   12550 nagios    20   0 47384 7000 2820 R    2  0.2   0:00.06 check_snmp_port


top - 02:09:09 up 27 days, 14:24,  4 users,  load average: 23.96, 20.26, 17.57
Tasks: 176 total,   1 running, 175 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.0%us,  0.8%sy,  0.0%ni, 93.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3927988k total,  3105396k used,   822592k free,   210460k buffers
Swap:  4200956k total,      260k used,  4200696k free,  2627324k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8645 root      20   0  8876 1376  856 R    1  0.0  10:08.55 top                6068 root      RT   0 28192 3652 2824 S    0  0.1   1:06.44 multipathd
    1 root      20   0 10388  776  640 S    0  0.0   0:25.58 init                  2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   0:53.35 migration/0           4 root      20   0     0    0    0 S    0  0.0   4:50.96 ksoftirqd/0
    5 root      RT   0     0    0    0 S    0  0.0   0:55.30 migration/1           6 root      20   0     0    0    0 S    0  0.0   5:13.70 ksoftirqd/1
    7 root      RT   0     0    0    0 S    0  0.0   0:55.78 migration/2           8 root      20   0     0    0    0 S    0  0.0   5:21.53 ksoftirqd/2
    9 root      RT   0     0    0    0 S    0  0.0   0:56.53 migration/3          10 root      20   0     0    0    0 S    0  0.0   5:21.39 ksoftirqd/3
   11 root      20   0     0    0    0 S    0  0.0   2:12.23 events/0             12 root      20   0     0    0    0 S    0  0.0   2:18.20 events/1
   13 root      20   0     0    0    0 S    0  0.0   2:27.06 events/2             14 root      20   0     0    0    0 S    0  0.0   2:48.12 events/3
   15 root      20   0     0    0    0 S    0  0.0   0:00.00 cpuset               16 root      20   0     0    0    0 S    0  0.0   0:00.00 khelper
   17 root      20   0     0    0    0 S    0  0.0   0:00.00 netns                18 root      20   0     0    0    0 S    0  0.0   0:00.00 async/mgr
   19 root      20   0     0    0    0 S    0  0.0   0:00.00 pm                   20 root      20   0     0    0    0 S    0  0.0   0:02.90 sync_supers
   21 root      20   0     0    0    0 S    0  0.0   0:03.67 bdi-default          22 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/0
   23 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/1        24 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/2
   25 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/3        26 root      20   0     0    0    0 S    0  0.0   0:00.60 kblockd/0
   27 root      20   0     0    0    0 S    0  0.0   0:01.30 kblockd/1            28 root      20   0     0    0    0 S    0  0.0   0:00.96 kblockd/2
   29 root      20   0     0    0    0 S    0  0.0   0:00.90 kblockd/3            30 root      20   0     0    0    0 S    0  0.0   0:00.00 kacpid
   31 root      20   0     0    0    0 S    0  0.0   0:00.00 kacpi_notify         32 root      20   0     0    0    0 S    0  0.0   0:00.00 kacpi_hotplug
   33 root      20   0     0    0    0 S    0  0.0   0:00.00 kseriod              38 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/0
   39 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/1          40 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/2
   41 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/3          42 root      20   0     0    0    0 S    0  0.0   0:00.00 khungtaskd
   43 root      20   0     0    0    0 S    0  0.0   0:39.46 kswapd0              44 root      25   5     0    0    0 S    0  0.0   0:00.00 ksmd
   45 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/0                46 root      20   0     0    0    0 S    0  0.0   0:00.00 aio/1

Every 2.0s: free -m                                                                                                                                                                         Tue May 27 02:09:57 2014

             total       used       free     shared    buffers     cached
Mem:          3835       3021        814          0        205       2565
-/+ buffers/cache:        250       3585
Swap:         4102          0       4102
User avatar
lmiltchev
Former Nagios Staff
Posts: 13587
Joined: Mon May 23, 2011 12:15 pm

Re: Nagios 4 Load issues

Post by lmiltchev »

Have you guys been able to find out what might be causing this ?
Not yet. We need to do some more digging into this and testing.
Be sure to check out our Knowledgebase for helpful articles and solutions!
liquidcool
Posts: 59
Joined: Tue Feb 21, 2012 6:08 am

Re: Nagios 4 Load issues

Post by liquidcool »

I have noticed that 4.0.7 has been released.

Has this issue been resolved in this version ? I can't seem to find it specifically in the changelog.
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Nagios 4 Load issues

Post by sreinhardt »

There have been some improvements, but I haven't seen empirical evidence that it has been fully resolved, mostly due to lack of any external testing. As lmiltchev mentioned, this one is right at the top of our list though!
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
liquidcool
Posts: 59
Joined: Tue Feb 21, 2012 6:08 am

Re: Nagios 4 Load issues

Post by liquidcool »

Is there any more I can help you guys with testing on this? Is there something specific you want me to look at. I can certainly try as much as I can with the system I have in place.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios 4 Load issues

Post by scottwilkerson »

We still are not able to reproduce this.

I know you have mentioned that the configs are exactly the same, does that include the nagios.cfg?

Would it be possible to diff the 2 files?
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
liquidcool
Posts: 59
Joined: Tue Feb 21, 2012 6:08 am

Re: Nagios 4 Load issues

Post by liquidcool »

Yes, the config file is exactly the same. It may be the 3.5.1, but it works in 4.x (I just remove what it complains about when I do the verify)

I have attached it for you to look at. Pretty standard really.

If you have a version of 4.x that has extra debugging that you would like me to try, I am sure I can get it in and running for you. Just shout.
Attachments
nagios.cfg
Nagios CFG File
(44.45 KiB) Downloaded 280 times
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios 4 Load issues

Post by abrist »

If you have not done so, open a bug on track.nagios.org. It looks like I will have to spin up a Suse vm to test . . .
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
emislivec
Posts: 52
Joined: Tue Feb 25, 2014 10:06 am

Re: Nagios 4 Load issues

Post by emislivec »

liquidcool,

We have a bug report for a very similar problem (periodic load spikes with Core 4 on SLES 11, but I'm not sure that SLES in common is more than a coincidence): http://tracker.nagios.org/view.php?id=576 Apparently they saw this behavior with 4.0.3, but not after downgrading to 4.0.2. I've looked at the complete code diffs between the versions and nothing stands out as problematic.

As we're having difficulty reproducing this, can you try running 4.0.3 and 4.0.2 and see how they behave?

nagios.log files from when the load spikes could be helpful. To get more information out of Core, debug_level=28 will write debug info on the process, scheduled events and checks to /usr/local/nagios/var/nagios.debug The extra writes from debug logging hit performance a bit, so it's not for general use.

"ps -ef f" is also helpful, the extra 'f' gives a process tree so we can see parent process relationships easier.


This may be an issue with check (re)scheduling. Looking at the last top you posted:

@ 02:08:54: 51 threads running, 167 sleeping; CPU 12.5% idle. From the processes visible, at least 43 check_snmp_port threads were running/runnable, and the sequences of consecutive pids indicates they were started in quick succession. Getting rid of those drops the instantaneous load to 8.

15 seconds later @ 02:09:09: 1 running, 175 sleeping; CPU 93.2% idle. The check_snmp_port threads have gone away.

From this it looks like the load spikes really are sharp spikes that get smoothed out when averaging over 1, 5 and 15 minutes. Does this line up with what you see in realtime? Is check_snmp_port a common suspect when the load spikes, or are other checks and processes doing this too? You mentioned that despite the load, everything is working well from from a service standpoint: checks completing successfully, no excessive retires or timeouts, low check latencies, etc. Is my understanding correct?

Also are you using netsnmp in Python directly, or using an asynchronous wrapper like multicore-snmp or PyNetSNMP?
liquidcool
Posts: 59
Joined: Tue Feb 21, 2012 6:08 am

Re: Nagios 4 Load issues

Post by liquidcool »

Hi,

sorry for such a late response on this.

I looked at the bug report and we are using the same version as the user who put the bug in. SLES 11 SP3. Though on some of our other servers we are running SP1, and we see the same issue.

I have downloaded 4.0.2 and will put it in later today, once I have pulled the necessary logs with the required debugging level as requested.

With regards to the rescheduling I have enable auto_reschedule_checks with an interval of 60(seconds) and a window of 180(seconds). Don't think it has help at all as the frequency of the spikes is still and hour and 40 minutes (peak to peak).

I have also disabled use_retained_scheduling_info, but alas still not seen any change in the spikes.

You are correct that the load spikes are really sharp and get smoothed out. It is almost like nagios is not smoothing out the frequency of the checks. And batching them up and causing too many to run at once hence causing the spikes. I don't know what logic is behind the scenes that tries to smooth out checks.

check_snmp_port is just the most common check, but there are loads of others. So it may look like that is a culprit but there are plenty other checks that regularly run at the same time.

Just recently (probably as a result of having add more checks on the servers - about 35000 in total now) some of the spikes go through the roof. Loads of 200 /300 have been seen. At one point load got over 700 and climbing .... Checks started timing out as a result .... Luckily I was logged into the server at the time, and I stopped the nagios service and let it settle down. Gave the server a boot, but still the spikes continue and the odd extremely high one of over 200 /300.

The python scripts are using the netsnmp module compiled for python that is imported into the check scripts.
Locked