Page 2 of 6

Re: All Linux Server CPU Spike at same time

Posted: Wed Mar 08, 2017 3:58 pm
by mcapra
From 10.2.8.7, can you share the output of these commands:

Code: Select all

ls -al /usr/local/nagios/libexec/
ps aux | grep xinetd
ps aux | grep nrpe
cat /usr/local/nagios/etc/nrpe.cfg

Re: All Linux Server CPU Spike at same time

Posted: Wed Mar 08, 2017 9:35 pm
by kwhogster
See attached file

Re: All Linux Server CPU Spike at same time

Posted: Thu Mar 09, 2017 1:24 pm
by mcapra
I notice in your nrpe.cfg that the hosts are not comma-delimited:

Code: Select all

allowed_hosts=127.0.0.1 10.2.8.79
Not sure if that's causing these problems, but I would throw a comma in between those 2 IPs and restart the nrpe service.

From the previous remote machine (not your Nagios Core machine), can you run the following commands and share their outputs:

Code: Select all

su nagios
/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
ls -al /usr/lib/nagios/plugins/
/usr/lib/nagios/plugins/check_nrpe -H 127.0.0.1
/usr/lib/nagios/plugins/check_nrpe -H 127.0.0.1 -c check_load

Re: All Linux Server CPU Spike at same time

Posted: Thu Mar 09, 2017 8:55 pm
by kwhogster
Made this change on my Nagios Core Server

allowed_hosts=127.0.0.1 10.2.8.79 to this
allowed_hosts=127.0.0.1,10.2.8.79

Restarted the NRPE service

Now on one of the remote Linux hosts I found the plugins and here are the results see attached

The check_nrpe commands failed

Do I need to modify the nrpe.cfg on all the Linux boxes?

Re: All Linux Server CPU Spike at same time

Posted: Fri Mar 10, 2017 12:00 pm
by mcapra
Does cutting SSL out of the equation fix things? Using the -n argument:

Code: Select all

/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n -c check_load

Re: All Linux Server CPU Spike at same time

Posted: Fri Mar 10, 2017 8:02 pm
by kwhogster
[nagios@tgcs018 /]$ /usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
OK - load average: 0.00, 0.01, 0.05|load1=0.000;15.000;30.000;0; load5=0.010;10.000;25.000;0; load15=0.050;5.000;20.000;0;
[nagios@tgcs018 /]$ /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n
CHECK_NRPE: Error receiving data from daemon.
[nagios@tgcs018 /]$ /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n -c check_load
CHECK_NRPE: Error receiving data from daemon.

Still doing this on remote Linux box

Re: All Linux Server CPU Spike at same time

Posted: Mon Mar 13, 2017 11:40 am
by mcapra
Can you run those commands again:

Code: Select all

/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n -c check_load
Then shortly after, share the output of:

Code: Select all

tail -n 200 /var/log/messages | grep nrpe

Re: All Linux Server CPU Spike at same time

Posted: Mon Mar 13, 2017 8:32 pm
by kwhogster

Code: Select all

[root@tgcs018 /]# cd /usr/local/nagios/libexec
[root@tgcs018 libexec]# check_load -w 15,10,5 -c 30,25,20
-bash: check_load: command not found
[root@tgcs018 libexec]# ls
check_apt                    check_jabber         check_services
check_asterisk.pl            check_load           check_simap
check_asterisk_sip_peers.sh  check_log            check_sip
check_breeze                 check_mailq          check_smtp
check_by_ssh                 check_mrtg           check_spop
check_clamd                  check_mrtgtraf       check_ssh
check_cluster                check_nagios         check_ssmtp
check_cpu_stats.sh           check_netstat.pl     check_swap
check_dhcp                   check_nntp           check_tcp
check_dig                    check_nntps          check_time
check_disk                   check_nrpe           check_udp
check_disk_smb               check_nt             check_ups
check_dns                    check_ntp            check_uptime
check_dummy                  check_ntp_peer       check_users
check_file_age               check_ntp_time       check_wave
check_flexlm                 check_nwstat         check_yum
check_ftp                    check_open_files.pl  custom_check_mem
check_http                   check_oracle         custom_check_procs
check_icmp                   check_overcr         nagisk.pl
check_ide_smart              check_ping           negate
check_ifoperstatus           check_pop            send_nsca
check_ifstatus               check_procs          urlize
check_imap                   check_real           utils.pm
check_init_service           check_rpc            utils.sh
check_ircd                   check_sensors
[root@tgcs018 libexec]# ./check_load -w 15,10,5 -c 30,25,20
OK - load average: 0.05, 0.03, 0.05|load1=0.050;15.000;30.000;0; load5=0.030;10.000;25.000;0; load15=0.050;5.000;20.000;0;
[root@tgcs018 libexec]# ./check_nrpe -H 127.0.0.1 -n
CHECK_NRPE: Error receiving data from daemon.
[root@tgcs018 libexec]# ./check_nrpe -H 127.0.0.1 -n -c check_load
CHECK_NRPE: Error receiving data from daemon.
[root@tgcs018 libexec]# tail -n 200 /var/log/messages | grep nrpe
Mar 13 19:01:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8273 duration=0(sec)
Mar 13 19:02:25 tgcs018 xinetd[883]: START: nrpe pid=8446 from=::ffff:10.2.8.79
Mar 13 19:02:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8446 duration=0(sec)
Mar 13 19:03:25 tgcs018 xinetd[883]: START: nrpe pid=8619 from=::ffff:10.2.8.79
Mar 13 19:03:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8619 duration=0(sec)
Mar 13 19:04:25 tgcs018 xinetd[883]: START: nrpe pid=8790 from=::ffff:10.2.8.79
Mar 13 19:04:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8790 duration=0(sec)
Mar 13 19:05:25 tgcs018 xinetd[883]: START: nrpe pid=8961 from=::ffff:10.2.8.79
Mar 13 19:05:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8961 duration=0(sec)
Mar 13 19:06:25 tgcs018 xinetd[883]: START: nrpe pid=9132 from=::ffff:10.2.8.79
Mar 13 19:06:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9132 duration=0(sec)
Mar 13 19:07:25 tgcs018 xinetd[883]: START: nrpe pid=9303 from=::ffff:10.2.8.79
Mar 13 19:07:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9303 duration=0(sec)
Mar 13 19:08:25 tgcs018 xinetd[883]: START: nrpe pid=9474 from=::ffff:10.2.8.79
Mar 13 19:08:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9474 duration=0(sec)
Mar 13 19:09:25 tgcs018 xinetd[883]: START: nrpe pid=9657 from=::ffff:10.2.8.79
Mar 13 19:09:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9657 duration=0(sec)
Mar 13 19:10:25 tgcs018 xinetd[883]: START: nrpe pid=9834 from=::ffff:10.2.8.79
Mar 13 19:10:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9834 duration=1(sec)
Mar 13 19:11:25 tgcs018 xinetd[883]: START: nrpe pid=10005 from=::ffff:10.2.8.79
Mar 13 19:11:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10005 duration=1(sec)
Mar 13 19:12:26 tgcs018 xinetd[883]: START: nrpe pid=10176 from=::ffff:10.2.8.79
Mar 13 19:12:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10176 duration=0(sec)
Mar 13 19:13:26 tgcs018 xinetd[883]: START: nrpe pid=10347 from=::ffff:10.2.8.79
Mar 13 19:13:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10347 duration=0(sec)
Mar 13 19:14:26 tgcs018 xinetd[883]: START: nrpe pid=10518 from=::ffff:10.2.8.79
Mar 13 19:14:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10518 duration=0(sec)
Mar 13 19:15:25 tgcs018 xinetd[883]: START: nrpe pid=10689 from=::ffff:10.2.8.79
Mar 13 19:15:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10689 duration=0(sec)
Mar 13 19:16:25 tgcs018 xinetd[883]: START: nrpe pid=10860 from=::ffff:10.2.8.79
Mar 13 19:16:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10860 duration=0(sec)
Mar 13 19:17:25 tgcs018 xinetd[883]: START: nrpe pid=11031 from=::ffff:10.2.8.79
Mar 13 19:17:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11031 duration=0(sec)
Mar 13 19:18:25 tgcs018 xinetd[883]: START: nrpe pid=11202 from=::ffff:10.2.8.79
Mar 13 19:18:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11202 duration=0(sec)
Mar 13 19:19:25 tgcs018 xinetd[883]: START: nrpe pid=11373 from=::ffff:10.2.8.79
Mar 13 19:19:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11373 duration=0(sec)
Mar 13 19:20:25 tgcs018 xinetd[883]: START: nrpe pid=11549 from=::ffff:10.2.8.79
Mar 13 19:20:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11549 duration=0(sec)
Mar 13 19:21:25 tgcs018 xinetd[883]: START: nrpe pid=11720 from=::ffff:10.2.8.79
Mar 13 19:21:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11720 duration=0(sec)
Mar 13 19:22:25 tgcs018 xinetd[883]: START: nrpe pid=11891 from=::ffff:10.2.8.79
Mar 13 19:22:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11891 duration=0(sec)
Mar 13 19:23:25 tgcs018 xinetd[883]: START: nrpe pid=12068 from=::ffff:10.2.8.79
Mar 13 19:23:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12068 duration=0(sec)
Mar 13 19:24:25 tgcs018 xinetd[883]: START: nrpe pid=12240 from=::ffff:10.2.8.79
Mar 13 19:24:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12240 duration=1(sec)
Mar 13 19:25:26 tgcs018 xinetd[883]: START: nrpe pid=12411 from=::ffff:10.2.8.79
Mar 13 19:25:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12411 duration=0(sec)
Mar 13 19:26:26 tgcs018 xinetd[883]: START: nrpe pid=12582 from=::ffff:10.2.8.79
Mar 13 19:26:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12582 duration=0(sec)
Mar 13 19:27:26 tgcs018 xinetd[883]: START: nrpe pid=12753 from=::ffff:10.2.8.79
Mar 13 19:27:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12753 duration=0(sec)
Mar 13 19:28:26 tgcs018 xinetd[883]: START: nrpe pid=12924 from=::ffff:10.2.8.79
Mar 13 19:28:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12924 duration=0(sec)
Mar 13 19:29:26 tgcs018 xinetd[883]: START: nrpe pid=13112 from=::ffff:10.2.8.79
Mar 13 19:29:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13112 duration=0(sec)
Mar 13 19:30:26 tgcs018 xinetd[883]: START: nrpe pid=13291 from=::ffff:10.2.8.79
Mar 13 19:30:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13291 duration=0(sec)
Mar 13 19:31:26 tgcs018 xinetd[883]: START: nrpe pid=13465 from=::ffff:10.2.8.79
Mar 13 19:31:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13465 duration=0(sec)
Mar 13 19:31:27 tgcs018 xinetd[883]: START: nrpe pid=13509 from=::ffff:127.0.0.1
Mar 13 19:31:27 tgcs018 xinetd[13509]: FAIL: nrpe address from=::ffff:127.0.0.1
Mar 13 19:31:27 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13509 duration=0(sec)
Mar 13 19:31:44 tgcs018 xinetd[883]: START: nrpe pid=13541 from=::ffff:127.0.0.1
Mar 13 19:31:44 tgcs018 xinetd[13541]: FAIL: nrpe address from=::ffff:127.0.0.1
Mar 13 19:31:44 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13541 duration=0(sec)
[root@tgcs018 libexec]#

does this help?

Re: All Linux Server CPU Spike at same time

Posted: Tue Mar 14, 2017 4:01 pm
by tgriep
It looks like the NRPE agent is being run by xinetd and not in daemon mode so can you edit the following file

Code: Select all

/etc/xinetd.d/nrpe
Comment out this line like the example below

Code: Select all

#       only_from       = 127.0.0.1
Save the file and restart xinetd by running

Code: Select all

service xinetd restart
This will allow any server connect to the NRPE agent.

Then run these commands and post the output.

Code: Select all

/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1
/usr/local/nagios/libexec/check_nrpe
/usr/local/nagios/bin/nrpe
Then try running this from the Nagios server to see if the changes worked. Replace xxx.xxx.xxx.xxx with the IP address of the remote stsrem.

Code: Select all

/usr/local/nagios/libexec/check_nrpe -H xxx.xxx.xxx.xxx -c check_load

Re: All Linux Server CPU Spike at same time

Posted: Tue Mar 14, 2017 4:03 pm
by ssax
EDIT - Try tgriep's solution first.

We can turn on NRPE debugging to collect more information.

On the remote machine (not the nagios server), edit the file:

Code: Select all

/usr/local/nagios/etc/nrpe.cfg
Change:

Code: Select all

debug=0
To:

Code: Select all

debug=1
Then restart xinetd:

Code: Select all

service xinetd restart
Now we need to add an option to the rsyslog server so it processes debug messages, edit this file:

Code: Select all

/etc/rsyslogd.conf
Find /var/log/messages, the line in the config file will look like:

Code: Select all

*.info;mail.none;authpriv.none;cron.none /var/log/messages
We need to add the following to the line:

Code: Select all

*.info;mail.none;authpriv.none;cron.none;daemon.debug /var/log/messages
Then restart rsyslog:

Code: Select all

service rsyslog restart
Now there should be more information logged in /var/log/messages.

From your nagios server execute this command:
- Change YOURREMOTEHOST to the IP or DNS name of your remote host

Code: Select all

/usr/local/nagios/libexec/check_nrpe -H YOURREMOTEHOST
Then from the remote machine, please run this command and send us the output:

Code: Select all

tail -n 100 /var/log/messages
Thank you