Nagios distributed monitoring

jdalrymple · Post by **jdalrymple** » Tue Jun 23, 2015 9:25 am

It's hard for me to say since in my experience with Gearman it's never crashed. I've never had an environment susceptible though - all small environments. Do you have debug logging turned on in your mod_gearman_worker.conf though? I would think there might be some clues there.

Code: Select all

# use debug to increase the verbosity of the module.
# Possible values are:
#     0 = only errors
#     1 = debug messages
#     2 = trace messages
#     3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=0

# Path to the logfile.
logfile=/var/log/mod_gearman/mod_gearman_worker.log

Post by **eloyd** » Thu Jul 23, 2015 8:06 pm

Fusion may still be for you. Rather than have one Nagios server which sends work to multiple Gearman workers, you could have multiple Nagios servers at each location instead. Then use Fusion to link them all together into one "view" or "dashboard" for you to look at to see if anything is wrong.

tmcdonald · Post by **tmcdonald** » Fri Jul 24, 2015 9:37 am

@klajosh2 - I know this thread is somewhat old but we will keep it around for a bit in case you come back. Let us know about the logging you may have enabled.

klajosh2 · Post by **klajosh2** » Fri Sep 11, 2015 5:25 am

Sorry for getting back so late.
I still experience orphaned checks with mod_gearman. And currently I experience extreme load on one of my collectors (machine runs only mod_gearman_worker to execute checks).
It is very annoying because I did not do any specific, I did not add thousand more services to check to the collector. Quite strange.
Regarding orphaned checks: when the problem struck in I found that - for some reasons - check_results queue is "full" usually nagios or gearman's event broker module
immediately processes the check results. When problem happened the number of waiting checks in the check_results queue were 7323. Only solution was to restart nagios.
On another forum the explanation was the following for the problem above:
"The check_results queue is pilling up usually means that there is a problem with Nagios or the Mod-Gearman NEB module."

And now my second current problem:

As I mentioned earlier one of my collectors load is extremely high, between 40 and 60. I turned on neb debug and worker debug. What I saw the problem is that checks were
sent not the regular time as they were set in nagios config (3 mins) but in every 15-20 seconds. Check it out:

[2015-09-10 17:19:17][7023][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:19:36][18977][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:19:57][31012][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:20:16][18972][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:20:35][7023][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:20:54][18971][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:21:13][8427][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:21:32][12288][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:21:51][8251][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:22:12][14058][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:25:41][18762][DEBUG] got service job: host-name01 - interfaces

and sometimes it restores back to normal scheduling:
[2015-09-10 17:25:41][18762][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:28:42][10092][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:31:11][9525][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:33:28][17713][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:35:28][17714][DEBUG] got service job: host-name01 - interfaces
[2015-09-10 17:37:20][9525][DEBUG] got service job: host-name01 - interfaces

Configuration of host-name01:

Code: Select all

Host        	Description    Max. Check Attempts    Normal Check Interval    Retry Check Interal    Obsess Over
host-name01    interfaces            	3                	0h 3m 0s    0h 1m 0s            	Yes

Do you have any idea why this is happening?
(collector has 16 cores, and configured to check about 200 hosts and 800-1000 services)

Thanks for help.

jdalrymple · Post by **jdalrymple** » Fri Sep 11, 2015 9:31 am

klajosh2 wrote:"The check_results queue is pilling up usually means that there is a problem with Nagios or the Mod-Gearman NEB module."

That is a proper explanation. The check_results queue should always be empty, or if it's not it should empty itself rather quickly. When it is not empty it means that Nagios and/or the NEB module have stalled for one reason or another. Troubleshooting why that's happening is where you start. What does the load look like on your central server? Is anything showing up in the neb module debug log to indicate why the stall is occurring? It's possible also that enabling gearmand logging and looking there will give insight. There is a problem, but the problem is *not* that mod_gearman is unstable, as mentioned when properly configured it can easily handle many tens of thousands of checks without fail. I recommend offloading as much work from the central server as you can, run all of your host and service checks on external workers and maybe even offload gearmand. I've seen great results from having gearmand on one or more separate servers.

klajosh2 wrote:What I saw the problem is that checks were
sent not the regular time as they were set in nagios config (3 mins) but in every 15-20 seconds.

Do you have any idea why this is happening?

Most likely on-demand checks, not an uncommon scenario:

https://assets.nagios.com/downloads/nag ... hecks.html
https://assets.nagios.com/downloads/nag ... hecks.html

My question would be why is the load so high? There is no reason for that size of a system to have any issue with 1000 services unless they're all super busy perl plugins like check_wmi_plus or check_esx3

klajosh2 · Post by **klajosh2** » Mon Sep 14, 2015 7:32 am

load of the central server: this is a good question. when problem struck the load is close to zero. (about my environment: central server not just runs the gearmand, nagios, thruk processes but
has a mod_gearman_worker also and runs checks against specific hosts).
NEB module's debug log does not show any extra or specific, it shows that checks are orphaned what is true because no results were processed.

My gearmand daemon's startup script is the following:
"--worker-wakeup=10 --retention-file=/tmp/gearmand.retention -q retention --log-file=/var/log/gearmand/gearmand.log"

I will extend this with -v option to enable more logs on gearmand to see more when problem strikes. I will get back when I got results.

high cpu load:

I think the problem is that time setup/synchronization is not proper on the server I see these in /var/log/messages:

Sep 14 13:39:33 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 0613 03 spike_detect +192.225993 s
Sep 14 13:52:54 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 061c 0c clock_step +192.239068 s
Sep 14 13:52:54 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 0615 05 clock_sync
Sep 14 13:52:55 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 c618 08 no_sys_peer
Sep 14 14:34:02 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 0613 03 spike_detect +192.259064 s
Sep 14 14:55:17 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 061c 0c clock_step +192.263858 s
Sep 14 14:55:17 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 0615 05 clock_sync
Sep 14 14:55:18 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 c618 08 no_sys_peer
Sep 14 15:05:45 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 0628 08 no_sys_peer
Sep 14 15:37:10 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 0613 03 spike_detect +192.284144 s
Sep 14 15:52:44 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 061c 0c clock_step +192.282527 s
Sep 14 15:52:44 GEARMAN-POLLER-SERVER ntpd[2365]: 0.0.0.0 0615 05 clock_sync

So my hypothesis is that the poller server's clock is jumping back and fort with 3 mins and this causes problems in the scheduler which sends the checks to the poller like crazy. Can this be a correct
explanation of the high cpu utilization? I see check - what should be run in every 3 minutes - run in every 20 seconds.

What do you think?

klajosh2 · Post by **klajosh2** » Mon Sep 14, 2015 7:52 am

one more thing I noticed what is very very strange:
the gearmand logstap is late with a month!. how can this be?
zless /var/log/gearmand/gearmand.log-20150911.gz (one day minus is ok since who logrotate works)
do you have any idea?

ERROR 2015-08-10 01:39:02.000000 [ 3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2015-08-10 01:39:02.000000 [ 3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2015-08-10 01:39:02.000000 [ 3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2015-08-10 01:39:02.000000 [ 3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2015-08-10 01:41:14.000000 [ 3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2015-08-10 01:41:14.000000 [ 3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2015-08-10 01:41:14.000000 [ 3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2015-08-10 01:41:14.000000 [ 3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2015-08-10 01:41:14.000000 [ 2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2015-08-10 01:41:14.000000 [ 2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2015-08-10 01:41:28.000000 [ 2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2015-08-10 01:41:28.000000 [ 2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109

Post by **eloyd** » Mon Sep 14, 2015 7:54 am

My immediate thought is - are you running NTP on all servers?

klajosh2 · Post by **klajosh2** » Mon Sep 14, 2015 8:52 am

yes we are running ntp on all servers.

jdalrymple · Post by **jdalrymple** » Mon Sep 14, 2015 10:07 am

A couple of pieces of advice:

1) setup check_ntp_time on all of your worker servers and make sure they are all pretty darn close.

2) next time you experience orphans, run `ps -ef | grep nagios.cfg` on your primary Nagios box and make sure there is only 1 nagios parent process - about 90% of the time multiple nagios processes causes orphans and is the beginning of the end for a gearman configured Nagios install

Nagios Support Forum

Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring