All Linux Server CPU Spike at same time

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: All Linux Server CPU Spike at same time

Post by mcapra »

From 10.2.8.7, can you share the output of these commands:

Code: Select all

ls -al /usr/local/nagios/libexec/
ps aux | grep xinetd
ps aux | grep nrpe
cat /usr/local/nagios/etc/nrpe.cfg
Former Nagios employee
https://www.mcapra.com/
kwhogster
Posts: 644
Joined: Wed Oct 14, 2015 6:51 pm
Location: Wood Ridge NJ USA
Contact:

Re: All Linux Server CPU Spike at same time

Post by kwhogster »

See attached file
Attachments
CPU Spike.txt
CPU Issue
(12.55 KiB) Downloaded 403 times
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: All Linux Server CPU Spike at same time

Post by mcapra »

I notice in your nrpe.cfg that the hosts are not comma-delimited:

Code: Select all

allowed_hosts=127.0.0.1 10.2.8.79
Not sure if that's causing these problems, but I would throw a comma in between those 2 IPs and restart the nrpe service.

From the previous remote machine (not your Nagios Core machine), can you run the following commands and share their outputs:

Code: Select all

su nagios
/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
ls -al /usr/lib/nagios/plugins/
/usr/lib/nagios/plugins/check_nrpe -H 127.0.0.1
/usr/lib/nagios/plugins/check_nrpe -H 127.0.0.1 -c check_load
Former Nagios employee
https://www.mcapra.com/
kwhogster
Posts: 644
Joined: Wed Oct 14, 2015 6:51 pm
Location: Wood Ridge NJ USA
Contact:

Re: All Linux Server CPU Spike at same time

Post by kwhogster »

Made this change on my Nagios Core Server

allowed_hosts=127.0.0.1 10.2.8.79 to this
allowed_hosts=127.0.0.1,10.2.8.79

Restarted the NRPE service

Now on one of the remote Linux hosts I found the plugins and here are the results see attached

The check_nrpe commands failed

Do I need to modify the nrpe.cfg on all the Linux boxes?
Attachments
CPU Spike 2.txt
Spike CPU 2
(5.58 KiB) Downloaded 387 times
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: All Linux Server CPU Spike at same time

Post by mcapra »

Does cutting SSL out of the equation fix things? Using the -n argument:

Code: Select all

/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n -c check_load
Former Nagios employee
https://www.mcapra.com/
kwhogster
Posts: 644
Joined: Wed Oct 14, 2015 6:51 pm
Location: Wood Ridge NJ USA
Contact:

Re: All Linux Server CPU Spike at same time

Post by kwhogster »

[nagios@tgcs018 /]$ /usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
OK - load average: 0.00, 0.01, 0.05|load1=0.000;15.000;30.000;0; load5=0.010;10.000;25.000;0; load15=0.050;5.000;20.000;0;
[nagios@tgcs018 /]$ /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n
CHECK_NRPE: Error receiving data from daemon.
[nagios@tgcs018 /]$ /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n -c check_load
CHECK_NRPE: Error receiving data from daemon.

Still doing this on remote Linux box
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: All Linux Server CPU Spike at same time

Post by mcapra »

Can you run those commands again:

Code: Select all

/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 -n -c check_load
Then shortly after, share the output of:

Code: Select all

tail -n 200 /var/log/messages | grep nrpe
Former Nagios employee
https://www.mcapra.com/
kwhogster
Posts: 644
Joined: Wed Oct 14, 2015 6:51 pm
Location: Wood Ridge NJ USA
Contact:

Re: All Linux Server CPU Spike at same time

Post by kwhogster »

Code: Select all

[root@tgcs018 /]# cd /usr/local/nagios/libexec
[root@tgcs018 libexec]# check_load -w 15,10,5 -c 30,25,20
-bash: check_load: command not found
[root@tgcs018 libexec]# ls
check_apt                    check_jabber         check_services
check_asterisk.pl            check_load           check_simap
check_asterisk_sip_peers.sh  check_log            check_sip
check_breeze                 check_mailq          check_smtp
check_by_ssh                 check_mrtg           check_spop
check_clamd                  check_mrtgtraf       check_ssh
check_cluster                check_nagios         check_ssmtp
check_cpu_stats.sh           check_netstat.pl     check_swap
check_dhcp                   check_nntp           check_tcp
check_dig                    check_nntps          check_time
check_disk                   check_nrpe           check_udp
check_disk_smb               check_nt             check_ups
check_dns                    check_ntp            check_uptime
check_dummy                  check_ntp_peer       check_users
check_file_age               check_ntp_time       check_wave
check_flexlm                 check_nwstat         check_yum
check_ftp                    check_open_files.pl  custom_check_mem
check_http                   check_oracle         custom_check_procs
check_icmp                   check_overcr         nagisk.pl
check_ide_smart              check_ping           negate
check_ifoperstatus           check_pop            send_nsca
check_ifstatus               check_procs          urlize
check_imap                   check_real           utils.pm
check_init_service           check_rpc            utils.sh
check_ircd                   check_sensors
[root@tgcs018 libexec]# ./check_load -w 15,10,5 -c 30,25,20
OK - load average: 0.05, 0.03, 0.05|load1=0.050;15.000;30.000;0; load5=0.030;10.000;25.000;0; load15=0.050;5.000;20.000;0;
[root@tgcs018 libexec]# ./check_nrpe -H 127.0.0.1 -n
CHECK_NRPE: Error receiving data from daemon.
[root@tgcs018 libexec]# ./check_nrpe -H 127.0.0.1 -n -c check_load
CHECK_NRPE: Error receiving data from daemon.
[root@tgcs018 libexec]# tail -n 200 /var/log/messages | grep nrpe
Mar 13 19:01:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8273 duration=0(sec)
Mar 13 19:02:25 tgcs018 xinetd[883]: START: nrpe pid=8446 from=::ffff:10.2.8.79
Mar 13 19:02:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8446 duration=0(sec)
Mar 13 19:03:25 tgcs018 xinetd[883]: START: nrpe pid=8619 from=::ffff:10.2.8.79
Mar 13 19:03:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8619 duration=0(sec)
Mar 13 19:04:25 tgcs018 xinetd[883]: START: nrpe pid=8790 from=::ffff:10.2.8.79
Mar 13 19:04:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8790 duration=0(sec)
Mar 13 19:05:25 tgcs018 xinetd[883]: START: nrpe pid=8961 from=::ffff:10.2.8.79
Mar 13 19:05:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=8961 duration=0(sec)
Mar 13 19:06:25 tgcs018 xinetd[883]: START: nrpe pid=9132 from=::ffff:10.2.8.79
Mar 13 19:06:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9132 duration=0(sec)
Mar 13 19:07:25 tgcs018 xinetd[883]: START: nrpe pid=9303 from=::ffff:10.2.8.79
Mar 13 19:07:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9303 duration=0(sec)
Mar 13 19:08:25 tgcs018 xinetd[883]: START: nrpe pid=9474 from=::ffff:10.2.8.79
Mar 13 19:08:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9474 duration=0(sec)
Mar 13 19:09:25 tgcs018 xinetd[883]: START: nrpe pid=9657 from=::ffff:10.2.8.79
Mar 13 19:09:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9657 duration=0(sec)
Mar 13 19:10:25 tgcs018 xinetd[883]: START: nrpe pid=9834 from=::ffff:10.2.8.79
Mar 13 19:10:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=9834 duration=1(sec)
Mar 13 19:11:25 tgcs018 xinetd[883]: START: nrpe pid=10005 from=::ffff:10.2.8.79
Mar 13 19:11:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10005 duration=1(sec)
Mar 13 19:12:26 tgcs018 xinetd[883]: START: nrpe pid=10176 from=::ffff:10.2.8.79
Mar 13 19:12:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10176 duration=0(sec)
Mar 13 19:13:26 tgcs018 xinetd[883]: START: nrpe pid=10347 from=::ffff:10.2.8.79
Mar 13 19:13:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10347 duration=0(sec)
Mar 13 19:14:26 tgcs018 xinetd[883]: START: nrpe pid=10518 from=::ffff:10.2.8.79
Mar 13 19:14:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10518 duration=0(sec)
Mar 13 19:15:25 tgcs018 xinetd[883]: START: nrpe pid=10689 from=::ffff:10.2.8.79
Mar 13 19:15:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10689 duration=0(sec)
Mar 13 19:16:25 tgcs018 xinetd[883]: START: nrpe pid=10860 from=::ffff:10.2.8.79
Mar 13 19:16:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=10860 duration=0(sec)
Mar 13 19:17:25 tgcs018 xinetd[883]: START: nrpe pid=11031 from=::ffff:10.2.8.79
Mar 13 19:17:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11031 duration=0(sec)
Mar 13 19:18:25 tgcs018 xinetd[883]: START: nrpe pid=11202 from=::ffff:10.2.8.79
Mar 13 19:18:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11202 duration=0(sec)
Mar 13 19:19:25 tgcs018 xinetd[883]: START: nrpe pid=11373 from=::ffff:10.2.8.79
Mar 13 19:19:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11373 duration=0(sec)
Mar 13 19:20:25 tgcs018 xinetd[883]: START: nrpe pid=11549 from=::ffff:10.2.8.79
Mar 13 19:20:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11549 duration=0(sec)
Mar 13 19:21:25 tgcs018 xinetd[883]: START: nrpe pid=11720 from=::ffff:10.2.8.79
Mar 13 19:21:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11720 duration=0(sec)
Mar 13 19:22:25 tgcs018 xinetd[883]: START: nrpe pid=11891 from=::ffff:10.2.8.79
Mar 13 19:22:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=11891 duration=0(sec)
Mar 13 19:23:25 tgcs018 xinetd[883]: START: nrpe pid=12068 from=::ffff:10.2.8.79
Mar 13 19:23:25 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12068 duration=0(sec)
Mar 13 19:24:25 tgcs018 xinetd[883]: START: nrpe pid=12240 from=::ffff:10.2.8.79
Mar 13 19:24:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12240 duration=1(sec)
Mar 13 19:25:26 tgcs018 xinetd[883]: START: nrpe pid=12411 from=::ffff:10.2.8.79
Mar 13 19:25:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12411 duration=0(sec)
Mar 13 19:26:26 tgcs018 xinetd[883]: START: nrpe pid=12582 from=::ffff:10.2.8.79
Mar 13 19:26:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12582 duration=0(sec)
Mar 13 19:27:26 tgcs018 xinetd[883]: START: nrpe pid=12753 from=::ffff:10.2.8.79
Mar 13 19:27:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12753 duration=0(sec)
Mar 13 19:28:26 tgcs018 xinetd[883]: START: nrpe pid=12924 from=::ffff:10.2.8.79
Mar 13 19:28:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=12924 duration=0(sec)
Mar 13 19:29:26 tgcs018 xinetd[883]: START: nrpe pid=13112 from=::ffff:10.2.8.79
Mar 13 19:29:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13112 duration=0(sec)
Mar 13 19:30:26 tgcs018 xinetd[883]: START: nrpe pid=13291 from=::ffff:10.2.8.79
Mar 13 19:30:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13291 duration=0(sec)
Mar 13 19:31:26 tgcs018 xinetd[883]: START: nrpe pid=13465 from=::ffff:10.2.8.79
Mar 13 19:31:26 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13465 duration=0(sec)
Mar 13 19:31:27 tgcs018 xinetd[883]: START: nrpe pid=13509 from=::ffff:127.0.0.1
Mar 13 19:31:27 tgcs018 xinetd[13509]: FAIL: nrpe address from=::ffff:127.0.0.1
Mar 13 19:31:27 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13509 duration=0(sec)
Mar 13 19:31:44 tgcs018 xinetd[883]: START: nrpe pid=13541 from=::ffff:127.0.0.1
Mar 13 19:31:44 tgcs018 xinetd[13541]: FAIL: nrpe address from=::ffff:127.0.0.1
Mar 13 19:31:44 tgcs018 xinetd[883]: EXIT: nrpe status=0 pid=13541 duration=0(sec)
[root@tgcs018 libexec]#

does this help?
Last edited by tmcdonald on Tue Mar 14, 2017 1:43 pm, edited 1 time in total.
Reason: Please use [code][/code] tags around long output
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: All Linux Server CPU Spike at same time

Post by tgriep »

It looks like the NRPE agent is being run by xinetd and not in daemon mode so can you edit the following file

Code: Select all

/etc/xinetd.d/nrpe
Comment out this line like the example below

Code: Select all

#       only_from       = 127.0.0.1
Save the file and restart xinetd by running

Code: Select all

service xinetd restart
This will allow any server connect to the NRPE agent.

Then run these commands and post the output.

Code: Select all

/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1
/usr/local/nagios/libexec/check_nrpe
/usr/local/nagios/bin/nrpe
Then try running this from the Nagios server to see if the changes worked. Replace xxx.xxx.xxx.xxx with the IP address of the remote stsrem.

Code: Select all

/usr/local/nagios/libexec/check_nrpe -H xxx.xxx.xxx.xxx -c check_load
Last edited by dwhitfield on Mon Mar 20, 2017 7:29 pm, edited 1 time in total.
Reason: ant to any
Be sure to check out our Knowledgebase for helpful articles and solutions!
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: All Linux Server CPU Spike at same time

Post by ssax »

EDIT - Try tgriep's solution first.

We can turn on NRPE debugging to collect more information.

On the remote machine (not the nagios server), edit the file:

Code: Select all

/usr/local/nagios/etc/nrpe.cfg
Change:

Code: Select all

debug=0
To:

Code: Select all

debug=1
Then restart xinetd:

Code: Select all

service xinetd restart
Now we need to add an option to the rsyslog server so it processes debug messages, edit this file:

Code: Select all

/etc/rsyslogd.conf
Find /var/log/messages, the line in the config file will look like:

Code: Select all

*.info;mail.none;authpriv.none;cron.none /var/log/messages
We need to add the following to the line:

Code: Select all

*.info;mail.none;authpriv.none;cron.none;daemon.debug /var/log/messages
Then restart rsyslog:

Code: Select all

service rsyslog restart
Now there should be more information logged in /var/log/messages.

From your nagios server execute this command:
- Change YOURREMOTEHOST to the IP or DNS name of your remote host

Code: Select all

/usr/local/nagios/libexec/check_nrpe -H YOURREMOTEHOST
Then from the remote machine, please run this command and send us the output:

Code: Select all

tail -n 100 /var/log/messages
Thank you
Locked