Page 1 of 6

All Linux Server CPU Spike at same time

Posted: Thu Mar 02, 2017 8:55 pm
by kwhogster
Nagios Core 4.1

everyday I get this on all my Linux servers


RaspberryPi Notifications for this host have been disabled Current Load CRITICAL 03-02-2017 20:46:26 0d 0h 5m 42s 4/4 CRITICAL - load average: 0.99, 7.56, 4.89
TGCS018 Notifications for this host have been disabled Current Load CRITICAL 03-02-2017 20:46:30 0d 0h 5m 38s 4/4 CRITICAL - load average: 0.91, 7.44, 4.86
localhost Notifications for this host have been disabled Current Load CRITICAL 03-02-2017 20:46:31 0d 0h 5m 37s 4/4 CRITICAL - load average: 0.91, 7.44, 4.86
vMA Notifications for this host have been disabled Current Load CRITICAL 03-02-2017 20:46:36 0d 0h 5m 32s 4/4 CRITICAL - load average: 0.83, 7.31, 4.84


The RapsberryPi is a Physical device and the other three are VM's

Is Nagios checks causing this?

Any ideas on Why


Thanks

Tom

Re: All Linux Server CPU Spike at same time

Posted: Thu Mar 02, 2017 10:51 pm
by rkennedy
kwhogster wrote:os checks causing this?

Any ideas on Why
Doubtful - I would go investigate the hosts logs and your configurations setup. Judging by the similarity you may be checking the load on the localhost across the board.

Re: All Linux Server CPU Spike at same time

Posted: Thu Mar 02, 2017 11:03 pm
by kwhogster
Which logs you mean

This is my config check

Code: Select all

define service{
        use                             local-service         ; Name of service template to use
        host_name                       vMA
        service_description             Current Load
        servicegroups                   CPULoad
        check_command                   check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
        }


They all are the same

On the local host a top extract

top - 23:02:08 up 53 days, 10:28, 2 users, load average: 0.04, 0.08, 0.23
Tasks: 194 total, 1 running, 193 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.5 us, 0.6 sy, 0.0 ni, 98.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8072680 total, 410528 free, 5597416 used, 2064736 buff/cache
KiB Swap: 8286204 total, 8263772 free, 22432 used. 2046844 avail Mem


Would it be best to make the Nagios server a physical or a VM could that help?

Thanks

Re: All Linux Server CPU Spike at same time

Posted: Fri Mar 03, 2017 10:52 am
by tgriep
The log files you would have to check to see why the load went up at that time are on the remote linux systems and not the Nagios server.
Take a look at the /var/log folder for the log files on the remote hosts. The message file might have some clues to what is happening at that time.

Re: All Linux Server CPU Spike at same time

Posted: Fri Mar 03, 2017 10:55 am
by dwhitfield
In addition to what @tgriep said, can you run tar -zcvf /tmp/supporttar.tar.gz /usr/local/nagios/etc and attach the file? If you are concerned about security, you can PM it to me. If you choose to PM, please make sure you update the thread so it shows back up on our support dashboard.

Re: All Linux Server CPU Spike at same time

Posted: Fri Mar 03, 2017 10:25 pm
by kwhogster
I tried to send a PM but it I stuck in my outbox.

So I am attaching it

Note:

On the local host it went to critical again now

and when I did a top on the local host the numbers did not match they were far less than what Nagios was reporting.

Thoughts?

Also which log file from var/log ?

Re: All Linux Server CPU Spike at same time

Posted: Mon Mar 06, 2017 1:01 pm
by tmcdonald
The command you are using (check_local_load) to check the remote servers is checking the local Nagios machine, as @rkennedy suggested. That means that for all 4 of the hosts, you are not checking their load but rather that of the Nagios server. That's why they all appear to go critical at the same time, and the values are so close.

You will need to use NRPE or something to check the remote machines.

Also, if a PM is stuck in the Outbox that just means the recipient has not yet read the message. Give it time and it should clear once they do.

Re: All Linux Server CPU Spike at same time

Posted: Mon Mar 06, 2017 7:10 pm
by kwhogster
Great


I am using nrpe this is from the nrpe.cfg on the Linux host

command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20

You have a check_nrpe sample or example

Re: All Linux Server CPU Spike at same time

Posted: Tue Mar 07, 2017 12:42 pm
by mcapra
With that command, this should be fine:

Code: Select all

/usr/local/nagios/libexec/check_nrpe -H <host> -c check_load
Since you're not passing arguments, all you really need to do is pass the -c with the command name.

Re: All Linux Server CPU Spike at same time

Posted: Tue Mar 07, 2017 8:35 pm
by kwhogster
Ran this

root@tgcs017:/usr/local/nagios/etc/objects# /usr/local/nagios/libexec/check_nrpe -H 10.2.8.7 -c check_load
NRPE: Unable to read output


Thoughts