ndo2db Hogging ALL the CPU

Post by **mikew** » Thu Aug 21, 2014 6:12 pm

ndo2db is hogging all the CPU

Here is a line from top:
8022 nagios 20 0 51056 2324 1040 R 100.0 0.0 8:08.14 ndo2db

This is now about 80% of the time and nothing seems to impact it whether Nagios is restarted or mysqld, etc.

The Monitoring Engine Process seems to slowly die as the Event Queue keeps building until all checks are trying to execute at once. If I restart the monitoring engine it levels for 30 seconds.

No indication in logs of any issues.

System:
Redhat 6.5 64_bit
XI 2014R1.4 (did it in 1.3 also)
6433 MB of RAM free
CPU idle 72%
Load 1.6, 1.63, 1.38

I have checked the database tables and they are OK.
I have run /usr/local/nagiosxi/scripts/repairmysql.sh nagios no issues.
I have updated ulimits:

nagios hard memlock 128
nagios soft memlock 128
nagios hard nproc 4096
nagios soft nproc 4096
root hard memlock 128
root soft memlock 128
root hard nproc 4096
root soft nproc 4096

The server was restarted.

/etc/sysctl.conf modified:
# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 131072000

# Controls the maximum size of a message, in bytes
kernel.msgmax = 131072000

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 4294967295

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 268435456

abrist · Post by **abrist** » Fri Aug 22, 2014 11:00 am

Mike,

Are you running livestatus or mod_gearman on this server?
How large are the db tables?
Is mysql offloaded?

gregwhite · Post by **gregwhite** » Fri Aug 22, 2014 12:35 pm

We were having a similar problem. We also had high disk I/O. Andy had suggested we implement RDDCached as one of three solutions.
We did and saw a noticible improvement. We have yet to offload the mysql database.

abrist · Post by **abrist** » Fri Aug 22, 2014 1:43 pm

If the tables are very large, it may cause io wait if the disk latency is too high.
Run top and check the io wait. What does it average?

Post by **mikew** » Fri Aug 22, 2014 3:30 pm

There is no Mod_Gearman nor livestatus and tables are small. This only has about 30 hosts and 300 service checks but is almost unusable when it spikes. It makes updating config a nightmere. The MySQL is on the Nagios server.

Here is top:

Code: Select all

top - 14:31:34 up 21:58,  1 user,  load average: 1.02, 1.03, 1.00
Tasks: 250 total,   2 running, 248 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.1%us,  0.0%sy,  0.0%ni, 74.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8051868k total,  4821808k used,  3230060k free,   169756k buffers
Swap:  4194296k total,        0k used,  4194296k free,  3493356k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
28166 nagios    20   0 50620 1884 1036 R 99.7  0.0   1229:28 ndo2db             
19618 root      20   0 15168 1356  944 R  0.3  0.0   0:00.03 top

Post by **lmiltchev** » Mon Aug 25, 2014 10:00 am

Mike,

Are there any clues in the system log? Run the following command and show us the output:

Code: Select all

cat /var/log/messages | grep ndo2db

Post by **mikew** » Thu Aug 28, 2014 7:55 am

I have checked the system log with no entries indicating any problems.

I ran ndo2db in debug -1 and no information about any problems showed up.

Here is the load since upgrading! This says it all.

load4.png

Here is the load.

load.png

What is even worse, when it restarts it stop processing any checks within about 5 minutes and the queue looks like this:

que.png

Here is the log info showing workers OK and ndo2db functioning:

Aug 28 06:40:07 denvlx015 nagios: ndomod: NDOMOD 2.0.0 (02-28-2014) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Aug 28 06:40:07 denvlx015 nagios: ndomod: Successfully connected to data sink. 0 queued items to flush.
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for process data
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for log data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for system command data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for event handler data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for notification data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for comment data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for downtime data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for flapping data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for program status data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for host status data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for service status data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for adaptive program data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for adaptive host data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for adaptive service data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for external command data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for aggregated status data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for retention data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for contact data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for contact notification data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for acknowledgement data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for state change data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for contact status data'
Aug 28 06:40:07 denvlx015 nagios: ndomod registered for adaptive contact data'
Aug 28 06:40:07 denvlx015 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.

Aug 28 06:40:07 denvlx015 nagios: wproc: Successfully registered manager as @wproc with query handler
Aug 28 06:40:07 denvlx015 nagios: wproc: Registry request: name=Core Worker 13102;pid=13102
Aug 28 06:40:07 denvlx015 nagios: wproc: Registry request: name=Core Worker 13101;pid=13101
Aug 28 06:40:07 denvlx015 nagios: wproc: Registry request: name=Core Worker 13100;pid=13100
Aug 28 06:40:07 denvlx015 nagios: wproc: Registry request: name=Core Worker 13105;pid=13105
Aug 28 06:40:07 denvlx015 nagios: wproc: Registry request: name=Core Worker 13104;pid=13104
Aug 28 06:40:07 denvlx015 nagios: wproc: Registry request: name=Core Worker 13103;pid=13103

After I posted this I checked logs again and found problems with workers:

Aug 28 06:56:56 denvlx015 nagios: wproc: Core Worker 13104: job 1767 (pid=34637) timed out. Killing it
Aug 28 06:56:56 denvlx015 nagios: wproc: CHECK job 1767 from worker Core Worker 13104 timed out after 60.01s
Aug 28 06:56:56 denvlx015 nagios: wproc: host=denlx028.dn.example.com; service=Tablespace Can Allocate Next;
Aug 28 06:56:56 denvlx015 nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Aug 28 06:56:56 denvlx015 nagios: Warning: Check of service 'Tablespace Can Allocate Next' on host 'denlx028.dn.example.com' timed out after 60.009s!
Aug 28 06:56:56 denvlx015 nagios: wproc: Core Worker 13104: job 1767 (pid=34637): Dormant child reaped

I also noticed these PHP errors in the error_log for apache:
[Thu Aug 28 07:07:56 2014] [error] [client xxx] PHP Notice: Undefined variable: ac_needed_js_inject in /usr/local/nagiosxi/html/includes/components/ccm/page_templates/ccm_table.php on line 175, referer: http://xxx/nagiosxi/includes/components ... -index.php
[Thu Aug 28 07:07:56 2014] [error] [client xxx] PHP Notice: Undefined variable: sync_table_status in /usr/local/nagiosxi/html/includes/components/ccm/page_templates/ccm_table.php on line 195, referer: http://xxx/nagiosxi/includes/components ... -index.php

Does this make sense:
ps aux|grep ndo2db
nagios 5094 0.0 0.0 50276 1384 ? S 10:27 0:01 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios 5096 96.8 0.0 51948 3072 ? R 10:27 18:50 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios 42038 0.0 0.0 50276 656 ? Ss 06:29 0:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg

Post by **lmiltchev** » Thu Aug 28, 2014 4:16 pm

Mike, could you email the profile.zip to xisupport@nagios.com?

Admin->System Profile->Download Profile

This way, we can review all of the logs/configs.

Post by **mikew** » Thu Aug 28, 2014 6:59 pm

On the way.

abrist · Post by **abrist** » Fri Aug 29, 2014 12:28 pm

profile received with ticket.

Nagios Support Forum

ndo2db Hogging ALL the CPU

ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU