Page 1 of 2

Nagios XI gone crazy

Posted: Mon Jul 24, 2017 2:19 am
by reincarne
Hi,
We are facing a new issue with NagiosXI. I'm attaching a screenshot from when the situation is normal.
Then at some period of time Nagios goes crazy, suddenly all the active service checks are executed only within 15-min (1-min and 5-min simply becomes empty). Then also service check latency average s spiking to about 700seconds. + high load on the machine.

What can be the problem?

Re: Nagios XI gone crazy

Posted: Mon Jul 24, 2017 9:35 am
by lmiltchev
Can you PM your profile (profile.zip) to any of the Nagios employees? Thank you!

Admin > System Config > System Profile > Download Profile

Re: Nagios XI gone crazy

Posted: Sun Aug 06, 2017 8:20 am
by reincarne
Well,
digging more into the problem we found that the source of the issue is related somehow to one of these:
ndo2db
ipcs

We noticed that after applying configuration the ipcs -q shows a lot of stuff stuck in the queue and its being processed very slowly.
Restarting the ndo2db process sometimes puts it even in a worse state.

We are in a situation where we are unable to monitor our production environment. We also did all the tricks with increase the message queue in kerenel files according to your documentations. What could be the reason for slow queue processing?

Re: Nagios XI gone crazy

Posted: Mon Aug 07, 2017 2:36 pm
by tgriep
There could be a lot of things that cause the Kernel Message queue to process slowly.
Your system is running 25902 service checks and if they are scheduled to run fairly quickly, it takes time to process them all and it may not be able to finish them.
From your profile, it looks like the IO wait is fairly high on the server so it cannot write the data quick enough and that would cause the issue you are having.
Try moving the server to faster hard drives if possible.

You should start to implement some of the performance enhancements in the document below.
https://assets.nagios.com/downloads/nag ... ios-XI.pdf

That should help the system run faster so it can process the queue better.

Re: Nagios XI gone crazy

Posted: Wed Aug 09, 2017 2:25 am
by reincarne
tgriep wrote:There could be a lot of things that cause the Kernel Message queue to process slowly.
Your system is running 25902 service checks and if they are scheduled to run fairly quickly, it takes time to process them all and it may not be able to finish them.
From your profile, it looks like the IO wait is fairly high on the server so it cannot write the data quick enough and that would cause the issue you are having.
Try moving the server to faster hard drives if possible.

You should start to implement some of the performance enhancements in the document below.
https://assets.nagios.com/downloads/nag ... ios-XI.pdf

That should help the system run faster so it can process the queue better.
We did the optimizations (only ram disk was not tested).
Anything we tried did not solve the issue. We noticed that when we run apply configuration, the queue spikes to around 140k and slowly going down. However when we run another apply configuration during the processing, it doubles the queue, makes it work harder, and its sort of ignoring the messages that were there before the second configuration change.
P.S Offloading DB did it even worse.
I attached a graph of the messages, we create simple plugin to track it.
Can it be a result of some checks which are running into a timeout state after 60 seconds?

Re: Nagios XI gone crazy

Posted: Wed Aug 09, 2017 1:38 pm
by tgriep
The checks that are timing out is not slowing the messages in the queue from clearing out, it only adds to what is already there.

I am thinking the IO wait issue is causing the issue as well as the number of service checks.
Setup the RAM Disk as that will help out in the performance.
Also, slow down the check interval for service checks that are not critical so the system can keep up with the queue.

Run the following as root and post the output.

Code: Select all

ps -ef --cols=300
top -n 1 |head -20
ipcs -q
cat /etc/sysctl.conf
cat /etc/my.cnf

Re: Nagios XI gone crazy

Posted: Sun Aug 13, 2017 3:40 am
by reincarne
tgriep wrote:The checks that are timing out is not slowing the messages in the queue from clearing out, it only adds to what is already there.

I am thinking the IO wait issue is causing the issue as well as the number of service checks.
Setup the RAM Disk as that will help out in the performance.
Also, slow down the check interval for service checks that are not critical so the system can keep up with the queue.

Run the following as root and post the output.

Code: Select all

ps -ef --cols=300
top -n 1 |head -20
ipcs -q
cat /etc/sysctl.conf
cat /etc/my.cnf
ps -ef --cols=300

Code: Select all

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Jul23 ?        00:00:03 /sbin/init
root         2     0  0 Jul23 ?        00:00:00 [kthreadd]
root         3     2  0 Jul23 ?        00:03:54 [migration/0]
root         4     2  0 Jul23 ?        00:00:06 [ksoftirqd/0]
root         5     2  0 Jul23 ?        00:00:00 [migration/0]
root         6     2  0 Jul23 ?        00:00:00 [watchdog/0]
root         7     2  0 Jul23 ?        00:03:29 [migration/1]
root         8     2  0 Jul23 ?        00:00:00 [migration/1]
root         9     2  0 Jul23 ?        00:00:02 [ksoftirqd/1]
root        10     2  0 Jul23 ?        00:00:00 [watchdog/1]
root        11     2  0 Jul23 ?        00:03:32 [migration/2]
root        12     2  0 Jul23 ?        00:00:00 [migration/2]
root        13     2  0 Jul23 ?        00:00:02 [ksoftirqd/2]
root        14     2  0 Jul23 ?        00:00:00 [watchdog/2]
root        15     2  0 Jul23 ?        00:03:41 [migration/3]
root        16     2  0 Jul23 ?        00:00:00 [migration/3]
root        17     2  0 Jul23 ?        00:00:02 [ksoftirqd/3]
root        18     2  0 Jul23 ?        00:00:00 [watchdog/3]
root        19     2  0 Jul23 ?        00:03:39 [migration/4]
root        20     2  0 Jul23 ?        00:00:00 [migration/4]
root        21     2  0 Jul23 ?        00:00:01 [ksoftirqd/4]
root        22     2  0 Jul23 ?        00:00:00 [watchdog/4]
root        23     2  0 Jul23 ?        00:03:34 [migration/5]
root        24     2  0 Jul23 ?        00:00:00 [migration/5]
root        25     2  0 Jul23 ?        00:00:01 [ksoftirqd/5]
root        26     2  0 Jul23 ?        00:00:00 [watchdog/5]
root        27     2  0 Jul23 ?        00:03:35 [migration/6]
root        28     2  0 Jul23 ?        00:00:00 [migration/6]
root        29     2  0 Jul23 ?        00:00:01 [ksoftirqd/6]
root        30     2  0 Jul23 ?        00:00:00 [watchdog/6]
root        31     2  0 Jul23 ?        00:03:30 [migration/7]
root        32     2  0 Jul23 ?        00:00:00 [migration/7]
root        33     2  0 Jul23 ?        00:00:01 [ksoftirqd/7]
root        34     2  0 Jul23 ?        00:00:00 [watchdog/7]
root        35     2  0 Jul23 ?        00:01:12 [events/0]
root        36     2  0 Jul23 ?        00:00:29 [events/1]
root        37     2  0 Jul23 ?        00:00:27 [events/2]
root        38     2  0 Jul23 ?        00:00:13 [events/3]
root        39     2  0 Jul23 ?        00:00:07 [events/4]
root        40     2  0 Jul23 ?        00:00:06 [events/5]
root        41     2  0 Jul23 ?        00:00:06 [events/6]
root        42     2  0 Jul23 ?        00:01:05 [events/7]
root        43     2  0 Jul23 ?        00:00:00 [cpuset]
root        44     2  0 Jul23 ?        00:00:00 [khelper]
root        45     2  0 Jul23 ?        00:00:00 [netns]
root        46     2  0 Jul23 ?        00:00:00 [async/mgr]
root        47     2  0 Jul23 ?        00:00:00 [pm]
root        48     2  0 Jul23 ?        00:00:00 [xenwatch]
root        49     2  0 Jul23 ?        00:00:00 [xenbus]
root        50     2  0 Jul23 ?        00:00:00 [sync_supers]
root        51     2  0 Jul23 ?        00:00:00 [bdi-default]
root        52     2  0 Jul23 ?        00:00:00 [kintegrityd/0]
root        53     2  0 Jul23 ?        00:00:00 [kintegrityd/1]
root        54     2  0 Jul23 ?        00:00:00 [kintegrityd/2]
root        55     2  0 Jul23 ?        00:00:00 [kintegrityd/3]
root        56     2  0 Jul23 ?        00:00:00 [kintegrityd/4]
root        57     2  0 Jul23 ?        00:00:00 [kintegrityd/5]
root        58     2  0 Jul23 ?        00:00:00 [kintegrityd/6]
root        59     2  0 Jul23 ?        00:00:00 [kintegrityd/7]
root        60     2  0 Jul23 ?        00:03:14 [kblockd/0]
root        61     2  0 Jul23 ?        00:00:00 [kblockd/1]
root        62     2  0 Jul23 ?        00:00:00 [kblockd/2]
root        63     2  0 Jul23 ?        00:00:00 [kblockd/3]
root        64     2  0 Jul23 ?        00:00:00 [kblockd/4]
root        65     2  0 Jul23 ?        00:00:00 [kblockd/5]
root        66     2  0 Jul23 ?        00:00:00 [kblockd/6]
root        67     2  0 Jul23 ?        00:00:00 [kblockd/7]
root        68     2  0 Jul23 ?        00:00:00 [ata/0]
root        69     2  0 Jul23 ?        00:00:00 [ata/1]
root        70     2  0 Jul23 ?        00:00:00 [ata/2]
root        71     2  0 Jul23 ?        00:00:00 [ata/3]
root        72     2  0 Jul23 ?        00:00:00 [ata/4]
root        73     2  0 Jul23 ?        00:00:00 [ata/5]
root        74     2  0 Jul23 ?        00:00:00 [ata/6]
root        75     2  0 Jul23 ?        00:00:00 [ata/7]
root        76     2  0 Jul23 ?        00:00:00 [ata_aux]
root        77     2  0 Jul23 ?        00:00:00 [ksuspend_usbd]
root        78     2  0 Jul23 ?        00:00:00 [khubd]
root        79     2  0 Jul23 ?        00:00:00 [kseriod]
root        80     2  0 Jul23 ?        00:00:00 [md/0]
root        81     2  0 Jul23 ?        00:00:00 [md/1]
root        82     2  0 Jul23 ?        00:00:00 [md/2]
root        83     2  0 Jul23 ?        00:00:00 [md/3]
root        84     2  0 Jul23 ?        00:00:00 [md/4]
root        85     2  0 Jul23 ?        00:00:00 [md/5]
root        86     2  0 Jul23 ?        00:00:00 [md/6]
root        87     2  0 Jul23 ?        00:00:00 [md/7]
root        88     2  0 Jul23 ?        00:00:00 [md_misc/0]
root        89     2  0 Jul23 ?        00:00:00 [md_misc/1]
root        90     2  0 Jul23 ?        00:00:00 [md_misc/2]
root        91     2  0 Jul23 ?        00:00:00 [md_misc/3]
root        92     2  0 Jul23 ?        00:00:00 [md_misc/4]
root        93     2  0 Jul23 ?        00:00:00 [md_misc/5]
root        94     2  0 Jul23 ?        00:00:00 [md_misc/6]
root        95     2  0 Jul23 ?        00:00:00 [md_misc/7]
root        96     2  0 Jul23 ?        00:00:00 [khungtaskd]
root        97     2  0 Jul23 ?        00:00:01 [kswapd0]
root        98     2  0 Jul23 ?        00:00:00 [ksmd]
root        99     2  0 Jul23 ?        00:00:00 [aio/0]
root       100     2  0 Jul23 ?        00:00:00 [aio/1]
root       101     2  0 Jul23 ?        00:00:00 [aio/2]
root       102     2  0 Jul23 ?        00:00:00 [aio/3]
root       103     2  0 Jul23 ?        00:00:00 [aio/4]
root       104     2  0 Jul23 ?        00:00:00 [aio/5]
root       105     2  0 Jul23 ?        00:00:00 [aio/6]
root       106     2  0 Jul23 ?        00:00:00 [aio/7]
root       107     2  0 Jul23 ?        00:00:00 [crypto/0]
root       108     2  0 Jul23 ?        00:00:00 [crypto/1]
root       109     2  0 Jul23 ?        00:00:00 [crypto/2]
root       110     2  0 Jul23 ?        00:00:00 [crypto/3]
root       111     2  0 Jul23 ?        00:00:00 [crypto/4]
root       112     2  0 Jul23 ?        00:00:00 [crypto/5]
root       113     2  0 Jul23 ?        00:00:00 [crypto/6]
root       114     2  0 Jul23 ?        00:00:00 [crypto/7]
root       119     2  0 Jul23 ?        00:00:00 [kthrotld/0]
root       120     2  0 Jul23 ?        00:00:00 [kthrotld/1]
root       121     2  0 Jul23 ?        00:00:00 [kthrotld/2]
root       122     2  0 Jul23 ?        00:00:00 [kthrotld/3]
root       123     2  0 Jul23 ?        00:00:00 [kthrotld/4]
root       124     2  0 Jul23 ?        00:00:00 [kthrotld/5]
root       125     2  0 Jul23 ?        00:00:00 [kthrotld/6]
root       126     2  0 Jul23 ?        00:00:00 [kthrotld/7]
root       128     2  0 Jul23 ?        00:00:00 [khvcd]
root       129     2  0 Jul23 ?        00:00:00 [kpsmoused]
root       130     2  0 Jul23 ?        00:00:00 [usbhid_resumer]
root       249     2  0 Jul23 ?        00:05:36 [jbd2/xvde1-8]
root       250     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       251     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       252     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       253     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       254     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       255     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       256     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       257     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       333     1  0 Jul23 ?        00:00:00 /sbin/udevd -d
root       550   333  0 Jul23 ?        00:00:00 /sbin/udevd -d
root       553     2  0 Jul23 ?        00:00:00 [kstriped]
root       555   333  0 Jul23 ?        00:00:00 /sbin/udevd -d
root       556     2  0 Jul23 ?        00:00:00 [kdmflush]
root       562     2  0 Jul23 ?        00:00:00 [kdmflush]
root       602     2  0 Jul23 ?        00:29:14 [jbd2/dm-1-8]
root       603     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       604     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       605     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       606     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       607     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       608     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       609     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       610     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       611     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       612     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       613     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       614     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       615     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       616     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       617     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       618     2  0 Jul23 ?        00:00:00 [ext4-dio-unwrit]
root       652     2  0 Jul23 ?        00:00:20 [kauditd]
root       656  1182  0 05:51 ?        00:00:00 sshd: ofirke [priv]
root       815     2  0 Jul23 ?        00:07:10 [flush-202:65]
root       820     2  0 Jul23 ?        01:36:38 [flush-253:1]
root       861     1  0 Jul23 ?        00:00:00 /sbin/dhclient -H vpc-nagiosxi -1 -q -lf /var/lib/dhclient/dhclient-eth0.leases -pf /var/run/dhclient-eth0.pid eth0
root       913     1  0 Jul23 ?        00:01:25 auditd
nslcd      935     1  0 Jul23 ?        00:46:37 /usr/sbin/nslcd
root       951     1  0 Jul23 ?        00:54:03 /sbin/rsyslogd -i /var/run/syslogd.pid -c 5
root       962     1  0 Jul23 ?        00:06:07 /bin/bash /etc/register/register_ip.sh
dbus       977     1  0 Jul23 ?        00:00:00 dbus-daemon --system
root       998     1  0 Jul23 ?        00:57:12 /usr/bin/ruby /usr/sbin/mcollectived --pid=/var/run/mcollectived.pid --config=/etc/mcollective/server.cfg --daemonize
root      1062     1  0 Jul23 ?        00:57:24 /usr/bin/nxlog
ofirke    1095   656  0 05:51 ?        00:00:00 sshd: ofirke@pts/1
root      1182     1  0 Jul23 ?        00:00:00 /usr/sbin/sshd
root      1193     1  0 Jul23 ?        00:01:25 xinetd -stayalive -pidfile /var/run/xinetd.pid
ntp       1204     1  0 Jul23 ?        00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
ofirke    1274  1095  0 05:51 pts/1    00:00:00 -bash
root      XXXXX   1  0 Jul23 ?        00:00:18 /usr/libexec/postfix/master
root      1767     1  0 Jul23 ?        00:03:24 crond
root      1843     1  0 Jul23 ?        00:00:46 /usr/bin/ruby /usr/bin/puppet agent
ajaxterm  1855     1  0 Jul23 ?        00:00:20 python /usr/share/ajaxterm/ajaxterm.py --daemon --port=8022 --uid=ajaxterm
root      2104  1182  0 08:24 ?        00:00:00 sshd: maximni [priv]
root      2146     1  0 Jul23 hvc0     00:00:00 /sbin/agetty /dev/hvc0 38400 vt100-nav
root      2148     1  0 Jul23 tty1     00:00:00 /sbin/mingetty /dev/tty1
root      2150     1  0 Jul23 tty2     00:00:00 /sbin/mingetty /dev/tty2
root      2152     1  0 Jul23 tty3     00:00:00 /sbin/mingetty /dev/tty3
root      2154     1  0 Jul23 tty4     00:00:00 /sbin/mingetty /dev/tty4
root      2156     1  0 Jul23 tty5     00:00:00 /sbin/mingetty /dev/tty5
root      2158     1  0 Jul23 tty6     00:00:00 /sbin/mingetty /dev/tty6
maximni   2403  2104  0 08:24 ?        00:00:00 sshd: maximni@pts/6
maximni   2451  2403  0 08:24 pts/6    00:00:00 -bash
root      2559  1274  0 06:56 pts/1    00:00:00 sudo su - nagios
root      2560  2559  0 06:56 pts/1    00:00:00 su - nagios
nagios    2561  2560  0 06:56 pts/1    00:00:00 -bash
nagios    3994     1  0 Aug10 ?        00:00:05 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
root      4053  1182  0 08:07 ?        00:00:00 sshd: alexle [priv]
root      4759 26715  0 08:13 pts/5    00:00:00 sudo su - nagios
root      4761  4759  0 08:13 pts/5    00:00:00 su - nagios
apache    4762 11986  7 08:20 ?        00:00:32 /usr/sbin/httpd
nagios    4763  4761  0 08:13 pts/5    00:00:00 -bash
apache    5122 11986  7 08:20 ?        00:00:33 /usr/sbin/httpd
apache    5125 11986  7 08:20 ?        00:00:32 /usr/sbin/httpd
alexle    5425  4053  0 08:07 ?        00:00:00 sshd: alexle@pts/4
alexle    5542  5425  0 08:07 pts/4    00:00:00 -bash
postfix   6858  XXXXX 08:09 ?        00:00:00 pickup -l -t fifo -u
nagios    6911  2561  0 08:02 pts/1    00:00:11 watch ipcs -q
apache    7579 11986  7 08:09 ?        00:01:25 /usr/sbin/httpd
apache    8668 11986  7 07:55 ?        00:02:24 /usr/sbin/httpd
apache   10025 11986  6 08:21 ?        00:00:28 /usr/sbin/httpd
apache   11114 11986  7 08:13 ?        00:01:03 /usr/sbin/httpd
nagios   11336     1  8 07:42 ?        00:03:44 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   11338 11336  0 07:42 ?        00:00:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11339 11336  0 07:42 ?        00:00:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11340 11336  0 07:42 ?        00:00:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11341 11336  0 07:42 ?        00:00:18 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11342 11336  0 07:42 ?        00:00:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11343 11336  0 07:42 ?        00:00:18 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11344 11336  0 07:42 ?        00:00:18 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11345 11336  0 07:42 ?        00:00:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11346 11336  0 07:42 ?        00:00:18 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11348 11336  0 07:42 ?        00:00:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11349 11336  0 07:42 ?        00:00:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11350 11336  0 07:42 ?        00:00:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   11379 11336  0 07:42 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root     11986     1  0 Aug03 ?        00:00:54 /usr/sbin/httpd
apache   12876 11986  7 08:00 ?        00:02:10 /usr/sbin/httpd
apache   14241 11986  7 08:14 ?        00:01:03 /usr/sbin/httpd
nagios   14258     1  0 08:10 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
apache   14430 11986  7 08:14 ?        00:01:00 /usr/sbin/httpd
apache   14613 11986  7 08:14 ?        00:01:02 /usr/sbin/httpd
postfix  15066  XXXXX 03:49 ?        00:00:00 qmgr -l -t fifo -u
root     15706  5542  0 08:10 pts/4    00:00:00 sudo su - nagios
root     15785 15706  0 08:10 pts/4    00:00:00 su - nagios
nagios   15835 15785  0 08:10 pts/4    00:00:00 -bash
apache   16793 11986  7 08:14 ?        00:01:03 /usr/sbin/httpd
nagios   17379 14258  1 08:10 ?        00:00:12 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios   17380 17379 38 08:10 ?        00:06:48 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
root     18029  1182  0 04:01 ?        00:00:00 sshd: alexcher [priv]
alexcher 18579 18029  0 04:01 ?        00:00:00 sshd: alexcher@pts/0
alexcher 18585 18579  0 04:01 pts/0    00:00:00 -bash
root     18785 18585  0 04:01 pts/0    00:00:00 sudo -su nagios
nagios   18786 18785  0 04:01 pts/0    00:00:00 /bin/bash
root     20104     1  2 00:00 ?        00:14:02 /usr/bin/atop -a -w /data/atop_logs/atop_20170813 5
apache   20277 11986  8 08:26 ?        00:00:07 /usr/sbin/httpd
apache   20278 11986  7 08:26 ?        00:00:06 /usr/sbin/httpd
apache   20279 11986  7 08:26 ?        00:00:06 /usr/sbin/httpd
root     21383 30316  0 08:05 pts/3    00:00:00 sudo su - nagios
root     21387 21383  0 08:05 pts/3    00:00:00 su - nagios
nagios   21392 21387  0 08:05 pts/3    00:00:00 -bash
root     23563  1767  0 08:27 ?        00:00:00 CROND
apache   23604 11986  6 08:19 ?        00:00:37 /usr/sbin/httpd
apache   23605 11986  6 08:19 ?        00:00:35 /usr/sbin/httpd
nagios   23656 23563  0 08:27 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1
nagios   23662 23656  0 08:27 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
root     25316     2  0 Aug02 ?        00:08:24 [flush-253:0]
root     26076  1182  0 08:11 ?        00:00:00 sshd: alexle [priv]
apache   26414 11986  7 08:19 ?        00:00:41 /usr/sbin/httpd
alexle   26664 26076  0 08:11 ?        00:00:00 sshd: alexle@pts/5
alexle   26715 26664  0 08:11 pts/5    00:00:00 -bash
postfix  26842  XXXXX 08:27 ?        00:00:00 smtpd -n smtp -t inet -u -o stress=
nagios   27918 11350  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   28501 11339  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
apache   29446 11986  7 08:23 ?        00:00:19 /usr/sbin/httpd
nagios   29523 11339  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
root     29981  1182  0 07:26 ?        00:00:00 sshd: ofirke [priv]
root     30157     1  2 Aug10 ?        01:28:12 /opt/BESClient/bin/BESClient
root     30279  1182  0 06:01 ?        00:00:00 sshd: ofirke [priv]
ofirke   30302 29981  0 07:26 ?        00:00:00 sshd: ofirke@pts/2
ofirke   30310 30279  0 06:01 ?        00:00:00 sshd: ofirke@pts/3
ofirke   30316 30310  0 06:01 pts/3    00:00:00 -bash
ofirke   30439 30302  0 07:26 pts/2    00:00:00 -bash
root     30791     1  0 Aug07 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --socket=/var/lib/mysql/mysql.sock --pid-file=/var/run/mysqld/mysqld.pid --basedir=/usr --user=mysql
mysql    30914 30791 57 Aug07 ?        3-09:31:35 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
apache   31547 11986  7 08:16 ?        00:00:51 /usr/sbin/httpd
nagios   31617 11348  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   31618 11346  0 08:27 ?        00:00:00 /bin/bash /usr/local/nagios/libexec/check_mysql_session nagios PPP XXXXX
nagios   31620 31618  0 08:27 ?        00:00:00 mysql -u nagios -px xxxxxxxxx -h XXXXX -e exit
nagios   31692 11343  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   31879 11340  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   31904 11346  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32014 11344  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32018 11349  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32054 11349  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32141 11349  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32142 11348  0 08:27 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32233 11345  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32235 11341  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32240 11338  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32247 11350  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32248 11341  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32249 11349  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32253 11338  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32255 11340  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32258 11344  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32261 11341  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32265 11342  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32266 11338  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32274 11341  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
root     32282  1767  0 08:28 ?        00:00:00 CROND
root     32283  1767  0 08:28 ?        00:00:00 CROND
root     32284  1767  0 08:28 ?        00:00:00 CROND
root     32285  1767  0 08:28 ?        00:00:00 CROND
root     32286  1767  0 08:28 ?        00:00:00 CROND
root     32287  1767  0 08:28 ?        00:00:00 CROND
nagios   32290 11339  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32291 11340  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32292 11343  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -H XXXXX -c check_kibana_tags -a de production_eucn_1,production_eucn_1,production_eucn1_2 240
nagios   32296 32284  0 08:28 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/event_handler.php > /usr/local/nagiosxi/var/event_handler.log 2>&1
nagios   32297 32296  6 08:28 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/event_handler.php
nagios   32302 32283  0 08:28 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php > /usr/local/nagiosxi/var/feedproc.log 2>&1
nagios   32304 32285  0 08:28 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/local/nagiosxi/var/eventman.log 2>&1
nagios   32305 32282  0 08:28 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1
nagios   32309 32304  9 08:28 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php
nagios   32310 32302  6 08:28 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php
nagios   32311 11348  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32313 32305 10 08:28 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
nagios   32351 11348  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32368 11339  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32371 11343  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32372 11344  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32377 32287  0 08:28 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1
nagios   32382 32377 12 08:28 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
nagios   32389  3994  7 08:28 ?        00:00:00 /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//service-perfdata.1502612863
nagios   32390 11346  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32394  3994  7 08:28 ?        00:00:00 /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//service-perfdata.1502612848
nagios   32403 11343  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32416 32286  0 08:28 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php > /usr/local/nagiosxi/var/cmdsubsys.log 2>&1
nagios   32418 32416  9 08:28 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php
nagios   32421 11338  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32452 11340  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32492 11338  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_service -a crond
nagios   32526 11343  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32531 11341  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_service -a crond
nagios   32561 11343 20 08:28 ?        00:00:00 /home/nagios/.rvm/rubies/ruby-2.0.0-p648/bin/ruby /usr/local/nagios/libexec/check_cloudwatch_rb_2.rb ap-northeast-1 CPUUtilization Average AWS/RDS DBInstanceIdentifier=XXXXX 360 60 90 0
nagios   32566 11341  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32572 11338  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_ntp_time -a 0.amazon.pool.ntp.org 1 1.2
nagios   32573 11340  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_clustered_redis_cpu -a 80 90
nagios   32593 11338  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_rabbitmq_overview -a nagios Tj8tJ1dEBy 15000 15000 15000 20000 20000 20000
nagios   32595 11340  1 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_file_content2 -a data/redis/log/redis-6379.log Cannot allocate memory 1 2
nagios   32603 11341 41 08:28 ?        00:00:00 /home/nagios/.rvm/rubies/ruby-2.0.0-p648/bin/ruby /usr/local/nagios/libexec/check_cloudwatch_rb_2.rb ap-northeast-1 CPUUtilization Average AWS/RDS DBInstanceIdentifier=XXXXX 360 60 90 0
nagios   32605 11348  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_
root     32613   962  0 08:28 ?        00:00:00 sleep 20
nagios   32616 11345  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32618 11341  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_ping -H XXXXX -w 5000.0,80% -c 7000.0,100% -p 5
nagios   32620 32618  0 08:28 ?        00:00:00 /bin/ping -n -U -w 40 -c 5 XXXXX
nagios   32626 11340 40 08:28 ?        00:00:00 /home/nagios/.rvm/rubies/ruby-2.0.0-p648/bin/ruby /usr/loc CPUUtilization Average AWS/RDS DBInstanceIdentifier=XXXXX 360 60 90 0
nagios   32629 11344  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_file_content2 -a data/redis/log/redis-6379.log Cannot allocate memory 1 2
nagios   32630 11345  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_file_content2 -a /data/redis/log/redis-6379.log Timeout receiving bulk data from MASTER 0 1
nagios   32632 11350  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_log_regex_only_th -a / reached memory limit, killing process 3000000 200
nagios   32640 11346  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_file_content2 -a /data/events_v2.log error 15 20
nagios   32641 11342  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXXing_queue -a 4000 6000
nagios   32642 11338  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -t 60 -H XXXXX -c check_puppet_agent -a 2999 3000
nagios   32647 11343  0 08:28 ?        00:00:00 /usr/local/nagios/libexec/check_nrpe -H XXXXX -c check__copy_port83 -a localhost 83 /monitor up 30
nagios   32649 11345 40 08:28 ?        00:00:00 /home/nagios/.rvm/rubies/ruby-2.0.0-p648/bin/ruby /usr/local/nagios/ ap-northeast-1 CPUUtilization Average AWS/RDS DBInstanceIdentifier=XXXXX 360 60 90 0
nagios   32677 32382  0 08:28 ?        00:00:00 sh -c /usr/bin/iostat -c 5 2 | tail --lines=2 | head --lines=1 | awk '{ print $1,$2,$3,$4,$5,$6 }'
nagios   32679 32677  0 08:28 ?        00:00:00 /usr/bin/iostat -c 5 2
nagios   32680 32677  0 08:28 ?        00:00:00 tail --lines=2
nagios   32681 32677  0 08:28 ?        00:00:00 head --lines=1
nagios   32682 32677  0 08:28 ?        00:00:00 awk { print $1,$2,$3,$4,$5,$6 }
nagios   32685 15835  0 08:28 pts/4    00:00:00 ps -ef --cols=300

top -n 1 |head -20

Code: Select all

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30914 mysql     20   0 4322m 111m 6864 S 110.8  0.4   4896:43 mysqld
11336 nagios    20   0 84104  57m 1300 D 55.4  0.2   4:51.45 nagios
 2945 apache    20   0  452m  38m 6028 S 36.3  0.1   0:42.90 httpd
14241 apache    20   0  453m  39m 5844 S 36.3  0.1   1:45.40 httpd
26414 apache    20   0  454m  40m 6060 S 36.3  0.1   1:28.03 httpd
 6439 apache    20   0  453m  40m 5772 S 34.4  0.1   0:41.70 httpd
29446 apache    20   0  452m  38m 5856 S 34.4  0.1   1:03.95 httpd
 3115 apache    20   0  451m  38m 5980 S 30.6  0.1   0:43.01 httpd
 4762 apache    20   0  458m  44m 5824 S 30.6  0.1   1:18.73 httpd
10025 apache    20   0  444m  31m 6052 S 28.7  0.1   1:11.98 httpd
 8668 apache    20   0  457m  43m 6108 S 22.9  0.1   3:05.79 httpd
23604 apache    20   0  449m  36m 6096 R 19.1  0.1   1:21.07 httpd
24537 nagios    20   0  267m  24m 4988 S 19.1  0.1   0:00.47 ruby
ipcs -q

Code: Select all

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x03410002 14680064   nagios     600        18143232     17718


cat /etc/sysctl.conf

Code: Select all

# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
# sysctl.conf(5) for more details.

# Controls IP packet forwarding
net.ipv4.ip_forward = 0

# Controls source route verification
net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1

# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1

net.ipv4.icmp_echo_ignore_broadcasts = 1

net.ipv4.conf.all.accept_redirects = 0

# Disable netfilter on bridges.
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0

# Controls the default maxmimum size of a mesage queue
#kernel.msgmnb = 131072000
#kernel.msgmnb = 262144000
kernel.msgmnb = 524288000

# Controls the maximum size of a message, in bytes
#kernel.msgmax = 131072000
kernel.msgmax = 262144000

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 4294967295

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 268435456
kernel.msgmni = 512000

cat /etc/my.cnf

Code: Select all

[mysqld]
query_cache_size=16M
query_cache_limit=4M
tmp_table_size=64M
max_heap_table_size=64M
key_buffer_size=32M
table_open_cache=32
max_connections=500

datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

Re: Nagios XI gone crazy

Posted: Mon Aug 14, 2017 7:36 am
by tacolover101
is segmenting your Nagios XI system into 2-3 systems and leveraging Fusion an option? this is a rather large install, and when XI re-writes out the info from the database, it seems to be taking time.

another option - what type of disks is your machine running? how about the offloaded DB?

Re: Nagios XI gone crazy

Posted: Mon Aug 14, 2017 10:45 am
by tgriep
Thanks @tacolover101 for the suggestions.
Segmenting the server will help out in this issue.

Try moving the system to a faster hard drive as well as adding a Ramdisk to the system.

Re: Nagios XI gone crazy

Posted: Wed Sep 06, 2017 6:26 am
by reincarne
tgriep wrote:Thanks @tacolover101 for the suggestions.
Segmenting the server will help out in this issue.

Try moving the system to a faster hard drive as well as adding a Ramdisk to the system.
For now we solved the issue by splitting the check intervals, it stabilized the system.
From my side, you can close the topic.