Page 1 of 2

nagios service stops unepxectedly

Posted: Tue Sep 15, 2015 7:54 am
by kendallchenoweth
I'm running Nagios XI 2014 R2.0 and several times in the last two weeks, the nagios service has just stopped. There's nothing in the nagios.log to indicate a a problem. Please advise me what information I can send you to help us debug this problem. Thanks!

-Kendall Chenoweth

Re: nagios service stops unepxectedly

Posted: Tue Sep 15, 2015 8:54 am
by tgriep
How many hosts and services are you monitoring?

Can you run the following commands in a shell and post the output here?

Code: Select all

df -h
df -i
tail -100 /var/log/mysqld.log
Can you also check the /var/log/messages file for any errors that could be causing it to stop?

Re: nagios service stops unepxectedly

Posted: Thu Sep 17, 2015 8:23 am
by kendallchenoweth

Code: Select all

root@nagios-aws-pro01 /backups/log$ df -h
Filesystem            Size  Used Avail Use% Mounted on
rootfs                 40G  5.1G   33G  14% /
udev                  7.4G  124K  7.4G   1% /dev
tmpfs                 7.4G     0  7.4G   0% /dev/shm
/dev/xvde1             40G  5.1G   33G  14% /
none                  7.4G     0  7.4G   0% /dev/shm
/dev/xvdj1             99G   14G   80G  15% /backups
root@nagios-aws-pro01 /backups/log$ df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
rootfs               2621440  110152 2511288    5% /
udev                 1916337     546 1915791    1% /dev
tmpfs                1921482       1 1921481    1% /dev/shm
/dev/xvde1           2621440  110152 2511288    5% /
none                 1921482       1 1921481    1% /dev/shm
/dev/xvdj1           6553600     103 6553497    1% /backups


root@nagios-aws-pro01 /backups/log$ tail -100 /var/log/mysqld.log 
150209 15:32:20 [Note] /usr/libexec/mysqld: Shutdown complete

150209 15:32:20 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
150209 15:46:34 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150209 15:46:35  InnoDB: Initializing buffer pool, size = 8.0M
150209 15:46:35  InnoDB: Completed initialization of buffer pool
150209 15:46:35  InnoDB: Started; log sequence number 0 44233
150209 15:46:35 [Note] Event Scheduler: Loaded 0 events
150209 15:46:35 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
150210 13:07:01 [Note] /usr/libexec/mysqld: Normal shutdown

150210 13:07:01 [Note] Event Scheduler: Purging the queue. 0 events
150210 13:07:01  InnoDB: Starting shutdown...
150210 13:07:03  InnoDB: Shutdown completed; log sequence number 0 44233
150210 13:07:03 [Note] /usr/libexec/mysqld: Shutdown complete

150210 13:07:03 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
150210 13:08:00 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150210 13:08:00  InnoDB: Initializing buffer pool, size = 8.0M
150210 13:08:00  InnoDB: Completed initialization of buffer pool
150210 13:08:00  InnoDB: Started; log sequence number 0 44233
150210 13:08:00 [Note] Event Scheduler: Loaded 0 events
150210 13:08:00 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
150210 13:11:53 [Note] /usr/libexec/mysqld: Normal shutdown

150210 13:11:53 [Note] Event Scheduler: Purging the queue. 0 events
150210 13:11:53  InnoDB: Starting shutdown...
150210 13:11:56  InnoDB: Shutdown completed; log sequence number 0 44233
150210 13:11:56 [Note] /usr/libexec/mysqld: Shutdown complete

150210 13:11:56 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
150210 13:23:41 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150210 13:23:42  InnoDB: Initializing buffer pool, size = 8.0M
150210 13:23:42  InnoDB: Completed initialization of buffer pool
150210 13:23:42  InnoDB: Started; log sequence number 0 44233
150210 13:23:42 [Note] Event Scheduler: Loaded 0 events
150210 13:23:42 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
150210 13:29:22 [Note] /usr/libexec/mysqld: Normal shutdown

150210 13:29:22 [Note] Event Scheduler: Purging the queue. 0 events
150210 13:29:22  InnoDB: Starting shutdown...
150210 13:29:23  InnoDB: Shutdown completed; log sequence number 0 44233
150210 13:29:23 [Note] /usr/libexec/mysqld: Shutdown complete

150210 13:29:23 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
150210 13:30:14 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150210 13:30:14  InnoDB: Initializing buffer pool, size = 8.0M
150210 13:30:14  InnoDB: Completed initialization of buffer pool
150210 13:30:14  InnoDB: Started; log sequence number 0 44233
150210 13:30:14 [Note] Event Scheduler: Loaded 0 events
150210 13:30:14 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
150220 14:48:56 [Note] /usr/libexec/mysqld: Normal shutdown

150220 14:48:56 [Note] Event Scheduler: Purging the queue. 0 events
150220 14:48:56  InnoDB: Starting shutdown...
150220 14:48:57  InnoDB: Shutdown completed; log sequence number 0 44233
150220 14:48:57 [Note] /usr/libexec/mysqld: Shutdown complete

150220 14:48:57 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
150220 14:50:26 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150220 14:50:26  InnoDB: Initializing buffer pool, size = 8.0M
150220 14:50:26  InnoDB: Completed initialization of buffer pool
150220 14:50:26  InnoDB: Started; log sequence number 0 44233
150220 14:50:26 [Note] Event Scheduler: Loaded 0 events
150220 14:50:26 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
150522 13:21:02 [Note] /usr/libexec/mysqld: Normal shutdown

150522 13:21:02 [Note] Event Scheduler: Purging the queue. 0 events
150522 13:21:02  InnoDB: Starting shutdown...
150522 13:21:03  InnoDB: Shutdown completed; log sequence number 0 44233
150522 13:21:03 [Note] /usr/libexec/mysqld: Shutdown complete

150522 13:21:03 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
150522 13:22:34 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150522 13:22:35  InnoDB: Initializing buffer pool, size = 8.0M
150522 13:22:35  InnoDB: Completed initialization of buffer pool
150522 13:22:35  InnoDB: Started; log sequence number 0 44233
150522 13:22:35 [Note] Event Scheduler: Loaded 0 events
150522 13:22:35 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
150702 13:06:10 [Note] /usr/libexec/mysqld: Normal shutdown

150702 13:06:10 [Note] Event Scheduler: Purging the queue. 0 events
150702 13:06:10  InnoDB: Starting shutdown...
150702 13:06:11  InnoDB: Shutdown completed; log sequence number 0 44233
150702 13:06:11 [Note] /usr/libexec/mysqld: Shutdown complete

150702 13:06:11 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
150702 13:08:26 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150702 13:08:27  InnoDB: Initializing buffer pool, size = 8.0M
150702 13:08:27  InnoDB: Completed initialization of buffer pool
150702 13:08:27  InnoDB: Started; log sequence number 0 44233
150702 13:08:27 [Note] Event Scheduler: Loaded 0 events
150702 13:08:27 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.73'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution

Re: nagios service stops unepxectedly

Posted: Thu Sep 17, 2015 9:23 am
by kendallchenoweth
110 servers
423 services

Re: nagios service stops unepxectedly

Posted: Thu Sep 17, 2015 9:38 am
by kendallchenoweth
I have seen a correlation of a spike in incoming network traffic along with a spike in paging and context switches related to times when the service stops running.

I’m continuing to investigate, but I think that the spike in network traffic is causing the paging/context switches which is in turn terminating the Nagios process.

Have you seen this before? Are there some tunable parameters for the software? I've included some specs on the amazon virtual system below.

Thanks!

Code: Select all

root@nagios-aws-prod00 /var/log$ free
             total       used       free     shared    buffers     cached
Mem:      15371860   15082352     289508          0     165020    1746708
-/+ buffers/cache:   13170624    2201236
Swap:            0          0          0

root@nagios-aws-prod00 /var/log$  cat /proc/cpuinfo | grep vendor | uniq
vendor_id	: GenuineIntel
root@nagios-aws-prod00 /var/log$ cat /proc/cpuinfo | grep 'model name' | uniq
model name	: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
root@nagios-aws-prod00 /var/log$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    1
CPU socket(s):         4
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2500.074
BogoMIPS:              5000.14
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-3

root@nagios-aws-prod00 /var/log$  lscpu | grep -i mhz
CPU MHz:               2500.074
root@nagios-aws-prod00 /var/log$ cat /proc/cpuinfo | grep -i mhz | uniq
cpu MHz		: 2500.074

root@nagios-aws-prod00 /var/log$ cat /proc/cpuinfo | grep processor
processor	: 0
processor	: 1
processor	: 2
processor	: 3

root@nagios-aws-prod00 /var/log$ cat /proc/cpuinfo | grep processor
processor	: 0
processor	: 1
processor	: 2
processor	: 3
root@nagios-aws-prod00 /var/log$ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping	: 4
cpu MHz		: 2500.074
cache size	: 25600 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm rep_good aperfmperf unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes rdrand hypervisor lahf_lm ida arat epb pln pts dts fsgsbase erms
bogomips	: 5000.14
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping	: 4
cpu MHz		: 2500.074
cache size	: 25600 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm rep_good aperfmperf unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes rdrand hypervisor lahf_lm ida arat epb pln pts dts fsgsbase erms
bogomips	: 5000.14
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping	: 4
cpu MHz		: 2500.074
cache size	: 25600 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm rep_good aperfmperf unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes rdrand hypervisor lahf_lm ida arat epb pln pts dts fsgsbase erms
bogomips	: 5000.14
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping	: 4
cpu MHz		: 2500.074
cache size	: 25600 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm rep_good aperfmperf unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes rdrand hypervisor lahf_lm ida arat epb pln pts dts fsgsbase erms
bogomips	: 5000.14
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

Re: nagios service stops unepxectedly

Posted: Thu Sep 17, 2015 4:49 pm
by jdalrymple
Where is all your memory hanging out?

Code: Select all

 ps -aux --sort -rss | head -30
I don't know that it's related, but 16GB of memory use on a system monitoring only 110 hosts sounds a bit ridiculous. Maybe we'll get some clues anyway.

Re: nagios service stops unepxectedly

Posted: Thu Sep 17, 2015 5:34 pm
by kendallchenoweth
root@nagios-aws-pro01 ~$ ps -aux --sort -rss | head -30
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
apache 8312 0.0 0.2 468044 42976 ? S Sep15 2:32 /usr/sbin/httpd
apache 20194 0.0 0.2 469788 42804 ? S Sep15 1:28 /usr/sbin/httpd
apache 12578 0.1 0.2 467724 42800 ? S Sep15 4:28 /usr/sbin/httpd
apache 8245 0.0 0.2 467736 42776 ? S Sep15 2:28 /usr/sbin/httpd
apache 8538 0.0 0.2 467708 42712 ? S Sep15 2:24 /usr/sbin/httpd
apache 1731 0.0 0.2 467660 42688 ? S Sep15 2:56 /usr/sbin/httpd
apache 19380 0.0 0.2 467700 42552 ? S Sep15 0:44 /usr/sbin/httpd
apache 12711 1.1 0.2 467148 42100 ? S 10:38 5:37 /usr/sbin/httpd
apache 30117 1.2 0.2 467108 42028 ? S 11:48 4:51 /usr/sbin/httpd
apache 27620 1.1 0.2 466872 41820 ? S 10:23 5:47 /usr/sbin/httpd
apache 6778 1.1 0.2 466844 41804 ? S 11:56 4:32 /usr/sbin/httpd
apache 32417 1.1 0.2 466844 41780 ? S 11:50 4:45 /usr/sbin/httpd
apache 19110 1.1 0.2 466752 41772 ? S 09:48 6:16 /usr/sbin/httpd
apache 23117 1.3 0.2 466980 41752 ? S 17:00 1:12 /usr/sbin/httpd
apache 24924 1.1 0.2 466808 41692 ? S 11:16 5:11 /usr/sbin/httpd
apache 26134 1.2 0.2 466824 41576 ? S 17:15 0:59 /usr/sbin/httpd
apache 26752 1.2 0.2 466876 41536 ? S 17:47 0:34 /usr/sbin/httpd
apache 19742 1.1 0.2 466868 41428 ? S 18:19 0:10 /usr/sbin/httpd
apache 20312 1.4 0.2 466872 41348 ? S 18:22 0:09 /usr/sbin/httpd
apache 9683 1.1 0.2 462640 37600 ? S 12:26 4:17 /usr/sbin/httpd
apache 5325 1.2 0.2 461616 36548 ? S 12:50 4:11 /usr/sbin/httpd
nagios 25916 0.7 0.1 232600 30060 ? S 18:33 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php
nagios 25914 0.9 0.1 232108 25384 ? S 18:33 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
nagios 25912 0.6 0.1 225060 22768 ? S 18:33 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
nagios 25915 0.5 0.1 224796 22508 ? S 18:33 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php
nagios 25913 0.5 0.1 224664 22336 ? S 18:33 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php
root 4770 0.0 0.1 341896 19968 ? Ss Aug13 0:03 /usr/sbin/httpd
mysql 1008 0.0 0.1 377484 16840 ? Sl Jul02 0:11 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
postgres 1054 0.0 0.0 216364 10984 ? Ss Jul02 0:15 postgres: writer process

Re: nagios service stops unepxectedly

Posted: Fri Sep 18, 2015 9:09 am
by jdalrymple
There is nothing too terribly alarming there.

Going back to what tgriep mentioned, are there any system messages indicating what's going on?

Code: Select all

dmesg | tail
(if it hasn't crashed recently you may need to look back a bit)

Code: Select all

tail /var/log/messages
(same caveat as above)

Do you have any additional modules loaded? Nagios alone generally quite stable, but modules can blow it up.

Code: Select all

grep module /usr/local/nagios/etc/nagios.cfg

Re: nagios service stops unepxectedly

Posted: Fri Sep 18, 2015 9:12 am
by kendallchenoweth
nagios@nagios-aws-pro01 ~$ grep module /usr/local/nagios/etc/nagios.cfg
# NDOUtils module
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg

I'm looking into if there is a kernel configuration on ipcs message queues. The output from ipcs -q when it's running OK is about 50-60 entries. I'll capture the number of entries when there's a problem, the next time I get alerted that there is a problem.

Re: nagios service stops unepxectedly

Posted: Fri Sep 18, 2015 1:09 pm
by tgriep
Keep us in the loop.