Event handler execution without calling corresponding script

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
jorr
Posts: 5
Joined: Tue Feb 25, 2014 2:22 pm
Location: USA, State of Jefferson

Event handler execution without calling corresponding script

Post by jorr »

Hello,

I have a Nagios server with ~2500 defined services and approximately 550 hosts. For the most part the server operates great with a single operational event handler for a specific set of checks that require an open SSH connection to perform. If the checks fail the event handler checks the status of the SSH connection and reconnects itself if required.

I am adding a new event handler that will execute when a host reaches a pre-defined threshold of memory and swap utilization. The event handler will execute an NRPE call to the target server and pass the command adjust_swap, which evaluates a few variables, adjusts them if required, and clears memory and swap space.

The command executes perfectly when called from the Nagios server to the NRPE client (from the nagios user), the command does its work, echos into a file and is seen in the client-side logs. Because the following command can successfully be executed, I do not believe it is a permissions, network, owner/group, or a configuration issue on the NRPE client side.

Code: Select all

[nagios@nagios ~]$ /usr/local/nagios/libexec/check_nrpe -H <ADDRESS> -p 5666 -c adjust_swap
OK - Memory and Swap cleared, swappiness is set to 10.
[nagios@nagios ~]$
While the client is in a normal state, I execute a memory stress test to generate swap usage and throw an alert on the Nagios server. The Nagios server then calls the event handler (according to nagios.log), but nothing ever gets executed.

Code: Select all

[root@nagios ~]# tail -n 0 -f nagios/var/nagios.log | grep sgr9-test
[1395251790] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;sgr9-test;Memory and Swap Use - With Automatic Cleanup;1395251748
[1395251790] SERVICE ALERT: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;SOFT;1;Ram : 3%, Swap : 6% : > 98, 5 : CRITICAL
[1395251790] SERVICE EVENT HANDLER: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;SOFT;1;adjust_swap_viaNRPE
[1395251851] SERVICE ALERT: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;SOFT;2;Ram : 3%, Swap : 6% : > 98, 5 : CRITICAL
[1395251851] SERVICE EVENT HANDLER: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;SOFT;2;adjust_swap_viaNRPE
[1395251910] SERVICE ALERT: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;HARD;3;Ram : 3%, Swap : 6% : > 98, 5 : CRITICAL
[1395251910] SERVICE EVENT HANDLER: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;HARD;3;adjust_swap_viaNRPE
Here are some of the setup parameters I am currently using, with masking over IP addresses and non-relevant data purged. Also note, event handlers are enabled in nagios.cfg and operate for another set of checks present in the server configuration.

NAGIOS SERVER

Code: Select all

define host {
	use                     linux-virt-mach
	host_name               sgr9-test
	hostgroups              memoryclear
	alias                   sgr9-test
	address                 <ADDRESS>
	event_handler_enabled   1
}
define service{
	use                     generic-service,service-pnp
	service_description     Memory and Swap Use - With Automatic Cleanup
	check_command           check_memory_swap
	event_handler           adjust_swap_viaNRPE
	event_handler_enabled   1
	is_volatile             1
}
define command{
        command_name    adjust_swap_viaNRPE
        command_line    $USER1$/usr/local/nagios/libexec/check_nrpe -H <ADDRESS> -p 5666 -c adjust_swap
}
Any advise? I've looked through a number of support threads, googled, etc, and saw some things to adjust and have since implemented (permissions, echoing to a file), but to no avail.

Sincerely,

Jesse
Last edited by jorr on Mon Mar 24, 2014 12:58 pm, edited 1 time in total.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Event handler execution without calling corresponding sc

Post by tmcdonald »

You could try running a tcpdump on port 5666 to see if the traffic even hits the server when run from Nagios.

Code: Select all

tcpdump port 5666
Former Nagios employee
jorr
Posts: 5
Joined: Tue Feb 25, 2014 2:22 pm
Location: USA, State of Jefferson

Re: Event handler execution without calling corresponding sc

Post by jorr »

Hello T,

I ran a tcpdump on port 5666 on the NRPE server and I did receive messages if I ran the command manually, or if one of several other NRPE-based checks were performed. As such, I halted all NRPE checks against the target server and re-ran the tcpdump.

Here is the results of running the event-handler manually, the time stamps pause while the script executes, and then resume when the results are returned.

Code: Select all

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
10:55:26.538907 IP unknown.servercentral.net.44902 > unknown.servercentral.net.nrpe: Flags [S], seq 3931271594, win 14600, options [mss 1460,sackOK,TS val 1743830829 ecr 0,nop,wscale 9], length 0
10:55:26.538939 IP unknown.servercentral.net.nrpe > unknown.servercentral.net.44902: Flags [S.], seq 3340450746, ack 3931271595, win 14480, options [mss 1460,sackOK,TS val 81587480 ecr 1743830829,nop,wscale 7], length 0
10:55:26.539056 IP unknown.servercentral.net.44902 > unknown.servercentral.net.nrpe: Flags [.], ack 1, win 29, options [nop,nop,TS val 1743830829 ecr 81587480], length 0
10:55:26.564788 IP unknown.servercentral.net.44902 > unknown.servercentral.net.nrpe: Flags [P.], seq 1:128, ack 1, win 29, options [nop,nop,TS val 1743830854 ecr 81587480], length 127
10:55:26.564811 IP unknown.servercentral.net.nrpe > unknown.servercentral.net.44902: Flags [.], ack 128, win 114, options [nop,nop,TS val 81587506 ecr 1743830854], length 0
10:55:26.656450 IP unknown.servercentral.net.nrpe > unknown.servercentral.net.44902: Flags [P.], seq 1:212, ack 128, win 114, options [nop,nop,TS val 81587597 ecr 1743830854], length 211
10:55:26.656646 IP unknown.servercentral.net.44902 > unknown.servercentral.net.nrpe: Flags [.], ack 212, win 31, options [nop,nop,TS val 1743830946 ecr 81587597], length 0
10:55:26.657325 IP unknown.servercentral.net.44902 > unknown.servercentral.net.nrpe: Flags [P.], seq 128:262, ack 212, win 31, options [nop,nop,TS val 1743830947 ecr 81587597], length 134
10:55:26.657351 IP unknown.servercentral.net.nrpe > unknown.servercentral.net.44902: Flags [.], ack 262, win 122, options [nop,nop,TS val 81587598 ecr 1743830947], length 0
10:55:26.657779 IP unknown.servercentral.net.nrpe > unknown.servercentral.net.44902: Flags [P.], seq 212:446, ack 262, win 122, options [nop,nop,TS val 81587599 ecr 1743830947], length 234
10:55:26.658729 IP unknown.servercentral.net.44902 > unknown.servercentral.net.nrpe: Flags [P.], seq 262:1376, ack 446, win 33, options [nop,nop,TS val 1743830948 ecr 81587599], length 1114
10:55:26.698465 IP unknown.servercentral.net.nrpe > unknown.servercentral.net.44902: Flags [.], ack 1376, win 139, options [nop,nop,TS val 81587640 ecr 1743830948], length 0
10:55:34.556957 IP unknown.servercentral.net.nrpe > unknown.servercentral.net.44902: Flags [P.], seq 446:1560, ack 1376, win 139, options [nop,nop,TS val 81595498 ecr 1743830948], length 1114
10:55:34.557275 IP unknown.servercentral.net.44902 > unknown.servercentral.net.nrpe: Flags [P.], seq 1376:1413, ack 1560, win 38, options [nop,nop,TS val 1743838847 ecr 81595498], length 37
10:55:34.557315 IP unknown.servercentral.net.nrpe > unknown.servercentral.net.44902: Flags [P.], seq 1560:1597, ack 1413, win 139, options [nop,nop,TS val 81595498 ecr 1743838847], length 37
10:55:34.557319 IP unknown.servercentral.net.44902 > unknown.servercentral.net.nrpe: Flags [F.], seq 1413, ack 1560, win 38, options [nop,nop,TS val 1743838847 ecr 81595498], length 0
10:55:34.560353 IP unknown.servercentral.net.nrpe > unknown.servercentral.net.44902: Flags [F.], seq 1597, ack 1414, win 139, options [nop,nop,TS val 81595501 ecr 1743838847], length 0
10:55:34.560485 IP unknown.servercentral.net.44902 > unknown.servercentral.net.nrpe: Flags [.], ack 1598, win 38, options [nop,nop,TS val 1743838850 ecr 81595498], length 0
These look A-OK to me.

Now, when I initiate the memory stress test and restart the tcpdump I get the following.

From Nagios Log:

Code: Select all

[1395331169] SERVICE ALERT: sgr9-test;Memory and Swap Use - With Automatic Cleanup;OK;HARD;3;Ram : 4%, Swap : 0% : : OK
[1395331169] SERVICE EVENT HANDLER: sgr9-test;Memory and Swap Use - With Automatic Cleanup;OK;HARD;3;adjust_swap_viaNRPE
[1395331469] SERVICE ALERT: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;SOFT;1;Ram : 3%, Swap : 6% : > 98, 5 : CRITICAL
[1395331469] SERVICE EVENT HANDLER: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;SOFT;1;adjust_swap_viaNRPE
[1395331525] SERVICE ALERT: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;SOFT;2;Ram : 3%, Swap : 6% : > 98, 5 : CRITICAL
[1395331525] SERVICE EVENT HANDLER: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;SOFT;2;adjust_swap_viaNRPE
[1395331588] SERVICE ALERT: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;HARD;3;Ram : 3%, Swap : 6% : > 98, 5 : CRITICAL
[1395331588] SERVICE EVENT HANDLER: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;HARD;3;adjust_swap_viaNRPE
[1395331886] SERVICE ALERT: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;HARD;3;Ram : 3%, Swap : 6% : > 98, 5 : CRITICAL
[1395331886] SERVICE EVENT HANDLER: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;HARD;3;adjust_swap_viaNRPE
[1395332187] SERVICE ALERT: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;HARD;3;Ram : 3%, Swap : 6% : > 98, 5 : CRITICAL
[1395332187] SERVICE EVENT HANDLER: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;HARD;3;adjust_swap_viaNRPE
[1395332487] SERVICE ALERT: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;HARD;3;Ram : 3%, Swap : 6% : > 98, 5 : CRITICAL
[1395332487] SERVICE EVENT HANDLER: sgr9-test;Memory and Swap Use - With Automatic Cleanup;CRITICAL;HARD;3;adjust_swap_viaNRPE
From TCP Dump on NRPE Target:

Code: Select all

sgr9:~# tcpdump port 5666
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
So it looks like Nagios is not sending packets, despite logging the event handler action in the nagios logs. I checked out the configuration link on the nagios site and followed it through to HOSTS > SGR9-TEST > MEMORY CHECK > EVENT HANDLER and got this information:

Code: Select all

Command Name	Command Line
To expand:	adjust_swap_viaNRPE
adjust_swap_viaNRPE	$USER1$/usr/local/nagios/libexec/check_nrpe -H <IP ADDRESS OF NRPE CLIENT HARD CODED INTO COMMAND> -p 5666 -c adjust_swap
->	$USER1$/usr/local/nagios/libexec/check_nrpe -H <IP ADDRESS OF NRPE CLIENT HARD CODED INTO COMMAND> -p 5666 -c adjust_swap

Enter the command_check definition from a host or service definition and press Go to see the expansion of the command
The command is identical to what I run manually to successfully clear the memory on the NRPE client and return results. (which generated the tcpoutput above.)

I double-checked the IP addresses, those look perfect; and that leaves me at a loss.

Sincerely,

Jesse
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Event handler execution without calling corresponding sc

Post by tmcdonald »

jorr wrote:Here is the results of running the event-handler manually, the time stamps pause while the script executes, and then resume when the results are returned.
What do you mean by "running manually"? Do you mean forcing the event handler to run by doing the memory stress test? Or do you mean by running the event handler command from the CLI?

I would test this by changing the event handler, instead of doing anything with NRPE, to simply touch a file in /tmp to see if it's actually getting called. It could be logged but not actually run.

If it is in fact running the touch in /tmp my next step would be to edit the original external command so it pipes into tee and outputs to a file so we can see if there were any errors displayed.
Former Nagios employee
jorr
Posts: 5
Joined: Tue Feb 25, 2014 2:22 pm
Location: USA, State of Jefferson

Re: Event handler execution without calling corresponding sc

Post by jorr »

T,

I will follow through with your suggestion today.

By running manually I mean by executing the command the event handler is configured to execute. So after generating some swap usage I log into nagios and su to nagios user, then execute this: /usr/local/nagios/libexec/check_nrpe -H <ADDRESS> -p 5666 -c adjust_swap

The current script does indeed write to a temp file. I will replace the script with a simple 'touch /tmp/test' and move forward from there.

Sincerely,

Jesse
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Event handler execution without calling corresponding sc

Post by tmcdonald »

Yea, running the command by hand from the CLI is likely to succeed. We need to see if nagios itself is running it, and if so is the script failing somewhere.
Former Nagios employee
jorr
Posts: 5
Joined: Tue Feb 25, 2014 2:22 pm
Location: USA, State of Jefferson

Re: Event handler execution without calling corresponding sc

Post by jorr »

Hey T,

Sorry for the delay; I got hit with deploying and configuring a cluster of VMs for a client.

While writing my reply with the results of your request I was including my configurations for the sake of knowledge and I encountered the command definition defined previously as:

Code: Select all

define command{
        command_name    adjust_swap_viaNRPE
        command_line    $USER1$/usr/local/nagios/libexec/check_nrpe -H <ADDRESS> -p 5666 -c adjust_swap
}
I had adjusted the command to be this:

Code: Select all

define command{
        command_name    adjust_swap_viaNRPE
        command_line    $USER1$/usr/local/nagios/libexec/test
#       command_line    $USER1$/usr/local/nagios/libexec/check_nrpe -H <ADDRESS> -p 5666 -c adjust_swap
}
And it hit me!

I was double mapping the directory path in the command structure by utilizing $USER1$ and then fully defining the path to the script.

I adjusted the command definition to this and tested with the original scripts for adjusting memory in place, meeting with complete success:

Code: Select all

define command{
       command_name    adjust_swap_viaNRPE
       command_line    $USER1$/check_nrpe -H <ADDRESS> -p 5666 -c adjust_swap
}
So the moral of the story is...check your paths particularly when macros are involved.

Thank you for the aid and support, without it I was starting to stumble around pretty good.

Sincerely,

Jesse
Locked