Page 1 of 1

Nagios checks stop after 20 mins - ndo2db 99% CPU

Posted: Fri Jul 25, 2014 12:01 am
by mrochelle
profile.zip
I have a Nagios 2014R1.3 that checks stop running after about 20 mins. I can restart ndo2db and they run for a while then gradually stop again. Following are commands / actions I've taken:
ulimit -a

Code: Select all

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 62833
max locked memory       (kbytes, -l) 512
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 40960
cpu time               (seconds, -t) unlimited
max user processes              (-u) 8192
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited[/list]
cd /usr/local/nagiosxi/scripts
./repairmysql.sh nagiosql
./repairmysql.sh nagios
[/code]

tail -25 /usr/local/nagios/var/nagios.log

Code: Select all

[1406259683] SERVICE ALERT: USATLVS003_172.17.0.26;Ping;CRITICAL;HARD;5;CRITICAL - 172.17.0.26: rta nan, lost 100%
[1406259686] SERVICE ALERT: USATLVS003_172.17.0.26;Memory Usage;WARNING;SOFT;3;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 10GB - Used: 9.233GB (92%) - Free: 784.93MB (8%)
[1406259698] SERVICE ALERT: USATLVS002_172.17.0.25;Drive C Disk Usage;CRITICAL;SOFT;4;CRITICAL - [Triggered by _Used%>95] - C: Total=84.66GB, Used=83.72GB (98.9%), Free=0.93GB (1.1%)     CRITICAL - [Triggered by _Used%>95] - F: Total=100.00GB, Used=99.01GB (99.0%), Free=0.99GB (1.0%)
[1406259735] SERVICE ALERT: USATLVS149_10.1.160.149;Memory Usage;WARNING;SOFT;3;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.95GB (93%) - Free: 1.049GB (7%)
[1406259746] SERVICE ALERT: USATLVS003_172.17.0.26;Memory Usage;WARNING;SOFT;4;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 10GB - Used: 9.233GB (92%) - Free: 785.016MB (8%)
[1406259759] SERVICE ALERT: USATLVS002_172.17.0.25;Drive C Disk Usage;CRITICAL;HARD;5;CRITICAL - [Triggered by _Used%>95] - C: Total=84.66GB, Used=83.72GB (98.9%), Free=0.93GB (1.1%)     CRITICAL - [Triggered by _Used%>95] - F: Total=100.00GB, Used=99.01GB (99.0%), Free=0.99GB (1.0%)
[1406259794] SERVICE ALERT: USATLVS149_10.1.160.149;Memory Usage;WARNING;SOFT;4;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.944GB (93%) - Free: 1.055GB (7%)
[1406259807] SERVICE ALERT: USATLVS003_172.17.0.26;Memory Usage;WARNING;HARD;5;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 10GB - Used: 9.232GB (92%) - Free: 786.281MB (8%)
[1406259853] SERVICE ALERT: USATLVS149_10.1.160.149;Memory Usage;WARNING;HARD;5;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.944GB (93%) - Free: 1.056GB (7%)
[1406259853] SERVICE NOTIFICATION: brian.hilliard;USATLVS149_10.1.160.149;Memory Usage;WARNING;xi_service_notification_handler;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.944GB (93%) - Free: 1.056GB (7%)
[1406259853] SERVICE NOTIFICATION: crawfors;USATLVS149_10.1.160.149;Memory Usage;WARNING;xi_service_notification_handler;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.944GB (93%) - Free: 1.056GB (7%)
[1406259853] SERVICE NOTIFICATION: sherwima;USATLVS149_10.1.160.149;Memory Usage;WARNING;xi_service_notification_handler;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.944GB (93%) - Free: 1.056GB (7%)
[1406260453] SERVICE ALERT: mitel-3300-icp_10.1.184.5;check_mitel.pl via SNMP;UNKNOWN;SOFT;1;ERROR: No response from remote host "10.1.184.5"
[1406260627] SERVICE ALERT: mitel-3300-icp_10.1.184.5;check_mitel.pl via SNMP;OK;SOFT;2;OK - No Alarms on 10.1.184.5
[1406260808] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;UNKNOWN;SOFT;1;UNKNOWN - Plugin Timed out (15 sec). There are multiple possible reasons for this, some of them include - The host 10.1.130.26 might just be really busy, it might not even be running Windows.
[1406260848] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;OK;SOFT;2;OK (Sample Period 361 sec) - Average CPU Utilisation 2.25%
[1406261884] ndomod: Error writing to data sink!  Some output may get lost...
[1406261884] ndomod: Please check remote ndo2db log, database connection or SSL Parameters
[1406261950] ndomod: Successfully reconnected to data sink!  0 items lost, 156 queued items to flush.
[1406261950] ndomod: Successfully flushed 156 queued items to data sink.
[1406262070] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;UNKNOWN;SOFT;1;UNKNOWN - Plugin Timed out (15 sec). There are multiple possible reasons for this, some of them include - The host 10.1.130.26 might just be really busy, it might not even be running Windows.
[1406262107] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;OK;SOFT;2;OK (Sample Period 360 sec) - Average CPU Utilisation 14.27%
[1406262956] Auto-save of retention data completed successfully.
[1406263330] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;UNKNOWN;SOFT;1;UNKNOWN - Plugin Timed out (15 sec). There are multiple possible reasons for this, some of them include - The host 10.1.130.26 might just be really busy, it might not even be running Windows.
[1406263367] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;OK;SOFT;2;OK (Sample Period 360 sec) - Average CPU Utilisation 2.37%
tail -25 /var/log/mysqld.log

Code: Select all

140724 23:02:26 [Note] /usr/libexec/mysqld: Shutdown complete

140724 23:02:26 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140724 23:02:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140724 23:02:28  InnoDB: Initializing buffer pool, size = 8.0M
140724 23:02:28  InnoDB: Completed initialization of buffer pool
140724 23:02:28  InnoDB: Started; log sequence number 0 44233
140724 23:02:28 [Note] Event Scheduler: Loaded 0 events
140724 23:02:28 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140724 23:02:43 [Note] /usr/libexec/mysqld: Normal shutdown

140724 23:02:43 [Note] Event Scheduler: Purging the queue. 0 events
140724 23:02:45  InnoDB: Starting shutdown...
140724 23:02:49  InnoDB: Shutdown completed; log sequence number 0 44233
140724 23:02:49 [Note] /usr/libexec/mysqld: Shutdown complete

140724 23:02:49 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140724 23:02:50 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140724 23:02:50  InnoDB: Initializing buffer pool, size = 8.0M
140724 23:02:50  InnoDB: Completed initialization of buffer pool
140724 23:02:50  InnoDB: Started; log sequence number 0 44233
140724 23:02:50 [Note] Event Scheduler: Loaded 0 events
140724 23:02:50 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
tail -25 /var/log/httpd/error_log

Code: Select all

[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: max_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 109, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: avg_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 110, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: min_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 111, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: max_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 112, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: avg_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 113, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: min_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 118, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: max_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 119, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: avg_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 120, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: min_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 121, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: max_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 122, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: avg_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 123, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: min_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 108, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: max_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 109, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: avg_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 110, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: min_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 111, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: max_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 112, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: avg_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 113, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: min_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 118, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: max_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 119, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: avg_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 120, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: min_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 121, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: max_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 122, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice:  Undefined index: avg_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 123, referer: http://10.1.160.140/nagiosxi/
which: no gnome-session in (/sbin:/usr/sbin:/bin:/usr/bin)
which: no gnome-session in (/sbin:/usr/sbin:/bin:/usr/bin)
Monitoring Event Queue
Dashboard Capture.JPG
To be honest, I'm not convinced totally that this is a problem since the message log indicates checks are processing. It appears all 200+ checks are executed at the same time. I'll follow up in the am to see if this holds or they spread out. However I don't believe the ndo2db process should be steady at 99% CPU.

Code: Select all

top - 00:11:54 up  2:12,  1 user,  load average: 1.36, 1.21, 1.10
Tasks: 183 total,   3 running, 180 sleeping,   0 stopped,   0 zombie
Cpu(s): 27.4%us,  0.2%sy,  0.0%ni, 72.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8061268k total,  1017068k used,  7044200k free,    41936k buffers
Swap:  2064376k total,        0k used,  2064376k free,   222224k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30452 nagios    20   0 51236 2748  964 R 99.7  0.0  49:30.19 ndo2db
16869 nagios    20   0  155m  13m 2356 R  4.0  0.2   0:00.12 check_wmi_plus.
16636 postgres  20   0  232m 6736 4404 S  0.3  0.1   0:00.01 postmaster
16796 root      20   0 15028 1392 1004 R  0.3  0.0   0:00.01 top
    1 root      20   0 19232 1524 1228 S  0.0  0.0   0:01.67 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.14 migration/0
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/0
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.09 migration/1
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1
    9 root      20   0     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/1
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/1
   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.13 migration/2
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
   13 root      20   0     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/2
   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/2
   15 root      RT   0     0    0    0 S  0.0  0.0   0:00.07 migration/3
   16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
   17 root      20   0     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/3
   18 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/3
   19 root      20   0     0    0    0 S  0.0  0.0   0:00.19 events/0
   20 root      20   0     0    0    0 S  0.0  0.0   0:00.17 events/1
   21 root      20   0     0    0    0 S  0.0  0.0   0:00.16 events/2
   22 root      20   0     0    0    0 S  0.0  0.0   0:02.39 events/3
   23 root      20   0     0    0    0 S  0.0  0.0   0:00.00 cgroup

Re: Nagios checks stop after 20 mins - ndo2db 99% CPU

Posted: Fri Jul 25, 2014 11:39 am
by sreinhardt
I would agree that ndo should never take 99% for an extended period of time. It'd doesn't look like you have crashed tables based on the mysql log, could you share the ndo log as well, and how has it been this morning?

Re: Nagios checks stop after 20 mins - ndo2db 99% CPU

Posted: Fri Jul 25, 2014 12:21 pm
by mrochelle
Logging stop shortly after my opening this post. The last monitor update was at 2014-07-25 01:32:09 AM.
The top command has not changed with

Code: Select all

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30452 nagios    20   0 51188 2576  968 R 99.4  0.0 766:04.20 ndo2db

Code: Select all

 tail -f /var/log/messages
Jul 25 12:15:26 nagprod01-corp ndo2db: Message sent to queue.
Jul 25 12:15:26 nagprod01-corp ndo2db: Warning: queue send error, retrying...
Jul 25 12:15:30 nagprod01-corp ndo2db: Message sent to queue.
Jul 25 12:15:30 nagprod01-corp ndo2db: Warning: queue send error, retrying...
Jul 25 12:15:34 nagprod01-corp ndo2db: Message sent to queue.
Jul 25 12:15:34 nagprod01-corp ndo2db: Warning: queue send error, retrying...
Jul 25 12:15:39 nagprod01-corp ndo2db: Message sent to queue.
Jul 25 12:15:39 nagprod01-corp ndo2db: Warning: queue send error, retrying...
Jul 25 12:15:43 nagprod01-corp ndo2db: Message sent to queue.
Jul 25 12:15:43 nagprod01-corp ndo2db: Warning: queue send error, retrying...
Jul 25 12:15:48 nagprod01-corp ndo2db: Message sent to queue.
Jul 25 12:15:48 nagprod01-corp ndo2db: Warning: queue send error, retrying...
Let me know the path of the ndo log or if you need me to turn on debug?

Also, while the nagiosXI interface has not indicated any monitor updates since 1:32AM, nagios core which I view via thruk has recent updates as indicated in /var/log/messages.

I'm going to go ahead and open a problem if anyone has time to remote in before the weekend.

Re: Nagios checks stop after 20 mins - ndo2db 99% CPU

Posted: Fri Jul 25, 2014 12:43 pm
by sreinhardt
I guess I shall lock this up and we can take the ticket when it comes in.