Page 2 of 6
Re: All Linux Server CPU Spike at same time
Posted: Wed Mar 08, 2017 3:58 pm
by mcapra
From 10.2.8.7, can you share the output of these commands:
Code: Select all
ls -al /usr/local/nagios/libexec/
ps aux | grep xinetd
ps aux | grep nrpe
cat /usr/local/nagios/etc/nrpe.cfg
Re: All Linux Server CPU Spike at same time
Posted: Wed Mar 08, 2017 9:35 pm
by kwhogster
See attached file
Re: All Linux Server CPU Spike at same time
Posted: Thu Mar 09, 2017 1:24 pm
by mcapra
I notice in your nrpe.cfg that the hosts are not comma-delimited:
Not sure if that's causing these problems, but I would throw a comma in between those 2 IPs and restart the nrpe service.
From the previous remote machine (not your Nagios Core machine), can you run the following commands and share their outputs:
Code: Select all
su nagios
/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
ls -al /usr/lib/nagios/plugins/
/usr/lib/nagios/plugins/check_nrpe -H 127.0.0.1
/usr/lib/nagios/plugins/check_nrpe -H 127.0.0.1 -c check_load
Re: All Linux Server CPU Spike at same time
Posted: Thu Mar 09, 2017 8:55 pm
by kwhogster
Made this change on my Nagios Core Server
allowed_hosts=127.0.0.1 10.2.8.79 to this
allowed_hosts=127.0.0.1,10.2.8.79
Restarted the NRPE service
Now on one of the remote Linux hosts I found the plugins and here are the results see attached
The check_nrpe commands failed
Do I need to modify the nrpe.cfg on all the Linux boxes?
Re: All Linux Server CPU Spike at same time
Posted: Fri Mar 10, 2017 12:00 pm
by mcapra
Does cutting SSL out of the equation fix things? Using the
-n argument:
Code: Select all
/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n -c check_load
Re: All Linux Server CPU Spike at same time
Posted: Fri Mar 10, 2017 8:02 pm
by kwhogster
[nagios@tgcs018 /]$ /usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
OK - load average: 0.00, 0.01, 0.05|load1=0.000;15.000;30.000;0; load5=0.010;10.000;25.000;0; load15=0.050;5.000;20.000;0;
[nagios@tgcs018 /]$ /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n
CHECK_NRPE: Error receiving data from daemon.
[nagios@tgcs018 /]$ /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n -c check_load
CHECK_NRPE: Error receiving data from daemon.
Still doing this on remote Linux box
Re: All Linux Server CPU Spike at same time
Posted: Mon Mar 13, 2017 11:40 am
by mcapra
Can you run those commands again:
Code: Select all
/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n -c check_load
Then shortly after, share the output of:
Code: Select all
tail -n 200 /var/log/messages | grep nrpe
Re: All Linux Server CPU Spike at same time
Posted: Mon Mar 13, 2017 8:32 pm
by kwhogster
Code: Select all
[root@tgcs018 /]# cd /usr/local/nagios/libexec
[root@tgcs018 libexec]# check_load -w 15,10,5 -c 30,25,20
-bash: check_load: command not found
[root@tgcs018 libexec]# ls
check_apt check_jabber check_services
check_asterisk.pl check_load check_simap
check_asterisk_sip_peers.sh check_log check_sip
check_breeze check_mailq check_smtp
check_by_ssh check_mrtg check_spop
check_clamd check_mrtgtraf check_ssh
check_cluster check_nagios check_ssmtp
check_cpu_stats.sh check_netstat.pl check_swap
check_dhcp check_nntp check_tcp
check_dig check_nntps check_time
check_disk check_nrpe check_udp
check_disk_smb check_nt check_ups
check_dns check_ntp check_uptime
check_dummy check_ntp_peer check_users
check_file_age check_ntp_time check_wave
check_flexlm check_nwstat check_yum
check_ftp check_open_files.pl custom_check_mem
check_http check_oracle custom_check_procs
check_icmp check_overcr nagisk.pl
check_ide_smart check_ping negate
check_ifoperstatus check_pop send_nsca
check_ifstatus check_procs urlize
check_imap check_real utils.pm
check_init_service check_rpc utils.sh
check_ircd check_sensors
[root@tgcs018 libexec]# ./check_load -w 15,10,5 -c 30,25,20
OK - load average: 0.05, 0.03, 0.05|load1=0.050;15.000;30.000;0; load5=0.030;10.000;25.000;0; load15=0.050;5.000;20.000;0;
[root@tgcs018 libexec]# ./check_nrpe -H 127.0.0.1 -n
CHECK_NRPE: Error receiving data from daemon.
[root@tgcs018 libexec]# ./check_nrpe -H 127.0.0.1 -n -c check_load
CHECK_NRPE: Error receiving data from daemon.
[root@tgcs018 libexec]# tail -n 200 /var/log/messages | grep nrpe
Mar 13 19:01:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8273 duration=0(sec)
Mar 13 19:02:25 tgcs018 xinetd[883]: START: nrpe pid=8446 from=::ffff:10.2.8.79
Mar 13 19:02:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8446 duration=0(sec)
Mar 13 19:03:25 tgcs018 xinetd[883]: START: nrpe pid=8619 from=::ffff:10.2.8.79
Mar 13 19:03:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8619 duration=0(sec)
Mar 13 19:04:25 tgcs018 xinetd[883]: START: nrpe pid=8790 from=::ffff:10.2.8.79
Mar 13 19:04:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8790 duration=0(sec)
Mar 13 19:05:25 tgcs018 xinetd[883]: START: nrpe pid=8961 from=::ffff:10.2.8.79
Mar 13 19:05:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8961 duration=0(sec)
Mar 13 19:06:25 tgcs018 xinetd[883]: START: nrpe pid=9132 from=::ffff:10.2.8.79
Mar 13 19:06:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9132 duration=0(sec)
Mar 13 19:07:25 tgcs018 xinetd[883]: START: nrpe pid=9303 from=::ffff:10.2.8.79
Mar 13 19:07:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9303 duration=0(sec)
Mar 13 19:08:25 tgcs018 xinetd[883]: START: nrpe pid=9474 from=::ffff:10.2.8.79
Mar 13 19:08:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9474 duration=0(sec)
Mar 13 19:09:25 tgcs018 xinetd[883]: START: nrpe pid=9657 from=::ffff:10.2.8.79
Mar 13 19:09:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9657 duration=0(sec)
Mar 13 19:10:25 tgcs018 xinetd[883]: START: nrpe pid=9834 from=::ffff:10.2.8.79
Mar 13 19:10:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9834 duration=1(sec)
Mar 13 19:11:25 tgcs018 xinetd[883]: START: nrpe pid=10005 from=::ffff:10.2.8.79
Mar 13 19:11:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10005 duration=1(sec)
Mar 13 19:12:26 tgcs018 xinetd[883]: START: nrpe pid=10176 from=::ffff:10.2.8.79
Mar 13 19:12:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10176 duration=0(sec)
Mar 13 19:13:26 tgcs018 xinetd[883]: START: nrpe pid=10347 from=::ffff:10.2.8.79
Mar 13 19:13:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10347 duration=0(sec)
Mar 13 19:14:26 tgcs018 xinetd[883]: START: nrpe pid=10518 from=::ffff:10.2.8.79
Mar 13 19:14:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10518 duration=0(sec)
Mar 13 19:15:25 tgcs018 xinetd[883]: START: nrpe pid=10689 from=::ffff:10.2.8.79
Mar 13 19:15:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10689 duration=0(sec)
Mar 13 19:16:25 tgcs018 xinetd[883]: START: nrpe pid=10860 from=::ffff:10.2.8.79
Mar 13 19:16:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10860 duration=0(sec)
Mar 13 19:17:25 tgcs018 xinetd[883]: START: nrpe pid=11031 from=::ffff:10.2.8.79
Mar 13 19:17:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11031 duration=0(sec)
Mar 13 19:18:25 tgcs018 xinetd[883]: START: nrpe pid=11202 from=::ffff:10.2.8.79
Mar 13 19:18:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11202 duration=0(sec)
Mar 13 19:19:25 tgcs018 xinetd[883]: START: nrpe pid=11373 from=::ffff:10.2.8.79
Mar 13 19:19:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11373 duration=0(sec)
Mar 13 19:20:25 tgcs018 xinetd[883]: START: nrpe pid=11549 from=::ffff:10.2.8.79
Mar 13 19:20:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11549 duration=0(sec)
Mar 13 19:21:25 tgcs018 xinetd[883]: START: nrpe pid=11720 from=::ffff:10.2.8.79
Mar 13 19:21:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11720 duration=0(sec)
Mar 13 19:22:25 tgcs018 xinetd[883]: START: nrpe pid=11891 from=::ffff:10.2.8.79
Mar 13 19:22:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11891 duration=0(sec)
Mar 13 19:23:25 tgcs018 xinetd[883]: START: nrpe pid=12068 from=::ffff:10.2.8.79
Mar 13 19:23:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12068 duration=0(sec)
Mar 13 19:24:25 tgcs018 xinetd[883]: START: nrpe pid=12240 from=::ffff:10.2.8.79
Mar 13 19:24:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12240 duration=1(sec)
Mar 13 19:25:26 tgcs018 xinetd[883]: START: nrpe pid=12411 from=::ffff:10.2.8.79
Mar 13 19:25:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12411 duration=0(sec)
Mar 13 19:26:26 tgcs018 xinetd[883]: START: nrpe pid=12582 from=::ffff:10.2.8.79
Mar 13 19:26:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12582 duration=0(sec)
Mar 13 19:27:26 tgcs018 xinetd[883]: START: nrpe pid=12753 from=::ffff:10.2.8.79
Mar 13 19:27:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12753 duration=0(sec)
Mar 13 19:28:26 tgcs018 xinetd[883]: START: nrpe pid=12924 from=::ffff:10.2.8.79
Mar 13 19:28:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12924 duration=0(sec)
Mar 13 19:29:26 tgcs018 xinetd[883]: START: nrpe pid=13112 from=::ffff:10.2.8.79
Mar 13 19:29:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13112 duration=0(sec)
Mar 13 19:30:26 tgcs018 xinetd[883]: START: nrpe pid=13291 from=::ffff:10.2.8.79
Mar 13 19:30:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13291 duration=0(sec)
Mar 13 19:31:26 tgcs018 xinetd[883]: START: nrpe pid=13465 from=::ffff:10.2.8.79
Mar 13 19:31:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13465 duration=0(sec)
Mar 13 19:31:27 tgcs018 xinetd[883]: START: nrpe pid=13509 from=::ffff:127.0.0.1
Mar 13 19:31:27 tgcs018 xinetd[13509]: FAIL: nrpe address from=::ffff:127.0.0.1
Mar 13 19:31:27 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13509 duration=0(sec)
Mar 13 19:31:44 tgcs018 xinetd[883]: START: nrpe pid=13541 from=::ffff:127.0.0.1
Mar 13 19:31:44 tgcs018 xinetd[13541]: FAIL: nrpe address from=::ffff:127.0.0.1
Mar 13 19:31:44 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13541 duration=0(sec)
[root@tgcs018 libexec]#
does this help?
Re: All Linux Server CPU Spike at same time
Posted: Tue Mar 14, 2017 4:01 pm
by tgriep
It looks like the NRPE agent is being run by xinetd and not in daemon mode so can you edit the following file
Comment out this line like the example below
Save the file and restart xinetd by running
This will allow any server connect to the NRPE agent.
Then run these commands and post the output.
Code: Select all
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1
/usr/local/nagios/libexec/check_nrpe
/usr/local/nagios/bin/nrpe
Then try running this from the Nagios server to see if the changes worked. Replace xxx.xxx.xxx.xxx with the IP address of the remote stsrem.
Code: Select all
/usr/local/nagios/libexec/check_nrpe -H xxx.xxx.xxx.xxx -c check_load
Re: All Linux Server CPU Spike at same time
Posted: Tue Mar 14, 2017 4:03 pm
by ssax
EDIT - Try tgriep's solution first.
We can turn on NRPE debugging to collect more information.
On the remote machine (not the nagios server), edit the file:
Change:
To:
Then restart xinetd:
Now we need to add an option to the rsyslog server so it processes debug messages, edit this file:
Find
/var/log/messages, the line in the config file will look like:
Code: Select all
*.info;mail.none;authpriv.none;cron.none /var/log/messages
We need to add the following to the line:
Code: Select all
*.info;mail.none;authpriv.none;cron.none;daemon.debug /var/log/messages
Then restart rsyslog:
Now there should be more information logged in
/var/log/messages.
From your nagios server execute this command:
- Change
YOURREMOTEHOST to the IP or DNS name of your remote host
Code: Select all
/usr/local/nagios/libexec/check_nrpe -H YOURREMOTEHOST
Then from the remote machine, please run this command and send us the output:
Thank you