Nagios checks stop after 20 mins - ndo2db 99% CPU
Posted: Fri Jul 25, 2014 12:01 am
I have a Nagios 2014R1.3 that checks stop running after about 20 mins. I can restart ndo2db and they run for a while then gradually stop again. Following are commands / actions I've taken:
ulimit -a
cd /usr/local/nagiosxi/scripts
./repairmysql.sh nagiosql
./repairmysql.sh nagios
[/code]
tail -25 /usr/local/nagios/var/nagios.log
tail -25 /var/log/mysqld.log
tail -25 /var/log/httpd/error_log
Monitoring Event Queue
To be honest, I'm not convinced totally that this is a problem since the message log indicates checks are processing. It appears all 200+ checks are executed at the same time. I'll follow up in the am to see if this holds or they spread out. However I don't believe the ndo2db process should be steady at 99% CPU.
ulimit -a
Code: Select all
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 62833
max locked memory (kbytes, -l) 512
max memory size (kbytes, -m) unlimited
open files (-n) 8192
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 40960
cpu time (seconds, -t) unlimited
max user processes (-u) 8192
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited[/list]
./repairmysql.sh nagiosql
./repairmysql.sh nagios
[/code]
tail -25 /usr/local/nagios/var/nagios.log
Code: Select all
[1406259683] SERVICE ALERT: USATLVS003_172.17.0.26;Ping;CRITICAL;HARD;5;CRITICAL - 172.17.0.26: rta nan, lost 100%
[1406259686] SERVICE ALERT: USATLVS003_172.17.0.26;Memory Usage;WARNING;SOFT;3;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 10GB - Used: 9.233GB (92%) - Free: 784.93MB (8%)
[1406259698] SERVICE ALERT: USATLVS002_172.17.0.25;Drive C Disk Usage;CRITICAL;SOFT;4;CRITICAL - [Triggered by _Used%>95] - C: Total=84.66GB, Used=83.72GB (98.9%), Free=0.93GB (1.1%) CRITICAL - [Triggered by _Used%>95] - F: Total=100.00GB, Used=99.01GB (99.0%), Free=0.99GB (1.0%)
[1406259735] SERVICE ALERT: USATLVS149_10.1.160.149;Memory Usage;WARNING;SOFT;3;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.95GB (93%) - Free: 1.049GB (7%)
[1406259746] SERVICE ALERT: USATLVS003_172.17.0.26;Memory Usage;WARNING;SOFT;4;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 10GB - Used: 9.233GB (92%) - Free: 785.016MB (8%)
[1406259759] SERVICE ALERT: USATLVS002_172.17.0.25;Drive C Disk Usage;CRITICAL;HARD;5;CRITICAL - [Triggered by _Used%>95] - C: Total=84.66GB, Used=83.72GB (98.9%), Free=0.93GB (1.1%) CRITICAL - [Triggered by _Used%>95] - F: Total=100.00GB, Used=99.01GB (99.0%), Free=0.99GB (1.0%)
[1406259794] SERVICE ALERT: USATLVS149_10.1.160.149;Memory Usage;WARNING;SOFT;4;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.944GB (93%) - Free: 1.055GB (7%)
[1406259807] SERVICE ALERT: USATLVS003_172.17.0.26;Memory Usage;WARNING;HARD;5;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 10GB - Used: 9.232GB (92%) - Free: 786.281MB (8%)
[1406259853] SERVICE ALERT: USATLVS149_10.1.160.149;Memory Usage;WARNING;HARD;5;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.944GB (93%) - Free: 1.056GB (7%)
[1406259853] SERVICE NOTIFICATION: brian.hilliard;USATLVS149_10.1.160.149;Memory Usage;WARNING;xi_service_notification_handler;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.944GB (93%) - Free: 1.056GB (7%)
[1406259853] SERVICE NOTIFICATION: crawfors;USATLVS149_10.1.160.149;Memory Usage;WARNING;xi_service_notification_handler;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.944GB (93%) - Free: 1.056GB (7%)
[1406259853] SERVICE NOTIFICATION: sherwima;USATLVS149_10.1.160.149;Memory Usage;WARNING;xi_service_notification_handler;WARNING - [Triggered by _MemUsed%>90] - Physical Memory: Total: 16GB - Used: 14.944GB (93%) - Free: 1.056GB (7%)
[1406260453] SERVICE ALERT: mitel-3300-icp_10.1.184.5;check_mitel.pl via SNMP;UNKNOWN;SOFT;1;ERROR: No response from remote host "10.1.184.5"
[1406260627] SERVICE ALERT: mitel-3300-icp_10.1.184.5;check_mitel.pl via SNMP;OK;SOFT;2;OK - No Alarms on 10.1.184.5
[1406260808] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;UNKNOWN;SOFT;1;UNKNOWN - Plugin Timed out (15 sec). There are multiple possible reasons for this, some of them include - The host 10.1.130.26 might just be really busy, it might not even be running Windows.
[1406260848] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;OK;SOFT;2;OK (Sample Period 361 sec) - Average CPU Utilisation 2.25%
[1406261884] ndomod: Error writing to data sink! Some output may get lost...
[1406261884] ndomod: Please check remote ndo2db log, database connection or SSL Parameters
[1406261950] ndomod: Successfully reconnected to data sink! 0 items lost, 156 queued items to flush.
[1406261950] ndomod: Successfully flushed 156 queued items to data sink.
[1406262070] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;UNKNOWN;SOFT;1;UNKNOWN - Plugin Timed out (15 sec). There are multiple possible reasons for this, some of them include - The host 10.1.130.26 might just be really busy, it might not even be running Windows.
[1406262107] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;OK;SOFT;2;OK (Sample Period 360 sec) - Average CPU Utilisation 14.27%
[1406262956] Auto-save of retention data completed successfully.
[1406263330] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;UNKNOWN;SOFT;1;UNKNOWN - Plugin Timed out (15 sec). There are multiple possible reasons for this, some of them include - The host 10.1.130.26 might just be really busy, it might not even be running Windows.
[1406263367] SERVICE ALERT: USATLVS26_10.1.130.26;CPU Usage;OK;SOFT;2;OK (Sample Period 360 sec) - Average CPU Utilisation 2.37%
Code: Select all
140724 23:02:26 [Note] /usr/libexec/mysqld: Shutdown complete
140724 23:02:26 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140724 23:02:28 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140724 23:02:28 InnoDB: Initializing buffer pool, size = 8.0M
140724 23:02:28 InnoDB: Completed initialization of buffer pool
140724 23:02:28 InnoDB: Started; log sequence number 0 44233
140724 23:02:28 [Note] Event Scheduler: Loaded 0 events
140724 23:02:28 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distribution
140724 23:02:43 [Note] /usr/libexec/mysqld: Normal shutdown
140724 23:02:43 [Note] Event Scheduler: Purging the queue. 0 events
140724 23:02:45 InnoDB: Starting shutdown...
140724 23:02:49 InnoDB: Shutdown completed; log sequence number 0 44233
140724 23:02:49 [Note] /usr/libexec/mysqld: Shutdown complete
140724 23:02:49 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140724 23:02:50 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140724 23:02:50 InnoDB: Initializing buffer pool, size = 8.0M
140724 23:02:50 InnoDB: Completed initialization of buffer pool
140724 23:02:50 InnoDB: Started; log sequence number 0 44233
140724 23:02:50 [Note] Event Scheduler: Loaded 0 events
140724 23:02:50 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distributionCode: Select all
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: max_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 109, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: avg_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 110, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: min_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 111, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: max_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 112, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: avg_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 113, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: min_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 118, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: max_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 119, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: avg_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 120, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: min_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 121, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: max_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 122, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: avg_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 123, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: min_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 108, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: max_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 109, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: avg_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 110, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: min_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 111, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: max_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 112, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: avg_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 113, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: min_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 118, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: max_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 119, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: avg_latency in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 120, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: min_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 121, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: max_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 122, referer: http://10.1.160.140/nagiosxi/
[Thu Jul 24 23:02:59 2014] [error] [client 10.1.159.233] PHP Notice: Undefined index: avg_execution_time in /usr/local/nagiosxi/html/includes/utils-xmlsysstat.inc.php on line 123, referer: http://10.1.160.140/nagiosxi/
which: no gnome-session in (/sbin:/usr/sbin:/bin:/usr/bin)
which: no gnome-session in (/sbin:/usr/sbin:/bin:/usr/bin)Code: Select all
top - 00:11:54 up 2:12, 1 user, load average: 1.36, 1.21, 1.10
Tasks: 183 total, 3 running, 180 sleeping, 0 stopped, 0 zombie
Cpu(s): 27.4%us, 0.2%sy, 0.0%ni, 72.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8061268k total, 1017068k used, 7044200k free, 41936k buffers
Swap: 2064376k total, 0k used, 2064376k free, 222224k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30452 nagios 20 0 51236 2748 964 R 99.7 0.0 49:30.19 ndo2db
16869 nagios 20 0 155m 13m 2356 R 4.0 0.2 0:00.12 check_wmi_plus.
16636 postgres 20 0 232m 6736 4404 S 0.3 0.1 0:00.01 postmaster
16796 root 20 0 15028 1392 1004 R 0.3 0.0 0:00.01 top
1 root 20 0 19232 1524 1228 S 0.0 0.0 0:01.67 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.14 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
7 root RT 0 0 0 0 S 0.0 0.0 0:00.09 migration/1
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
9 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/1
10 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
11 root RT 0 0 0 0 S 0.0 0.0 0:00.13 migration/2
12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
13 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/2
14 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/2
15 root RT 0 0 0 0 S 0.0 0.0 0:00.07 migration/3
16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
17 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/3
18 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/3
19 root 20 0 0 0 0 S 0.0 0.0 0:00.19 events/0
20 root 20 0 0 0 0 S 0.0 0.0 0:00.17 events/1
21 root 20 0 0 0 0 S 0.0 0.0 0:00.16 events/2
22 root 20 0 0 0 0 S 0.0 0.0 0:02.39 events/3
23 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cgroup