Page 5 of 6

Re: All Linux Server CPU Spike at same time

Posted: Sun Mar 19, 2017 7:20 pm
by kwhogster
Yes the all spike at the exact same time

I sent a lot of info on this

Someone must know why

Re: All Linux Server CPU Spike at same time

Posted: Sun Mar 19, 2017 8:53 pm
by rkennedy
kwhogster wrote:Yes the all spike at the exact same time

I sent a lot of info on this

Someone must know why
rkennedy wrote: I haven't seen in this post if the nagios user (or whoever you're running NRPE under) is able to actually run your script locally. Give that a try and get it working, then watch your /var/log/messages on the client machine when running the checks through NRPE AFTER you get it working locally.

Re: All Linux Server CPU Spike at same time

Posted: Mon Mar 20, 2017 9:57 am
by tmcdonald
Quite a few things going on here:

1.) Did you ever follow the advice of @tgriep in this post? User @ssax posted right after so you might not have seen @tgriep's post.

2.) Your output here seems to indicate that the server does not have the proper glibc version, which indicates to me that the plugins might have been copied over from an incompatible system. Can you please elaborate on how you got the plugins installed on each of these systems? The specific error I am referring to is:

/usr/local/nagios/libexec/check_load: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /usr/local/nagios/libexec/check_load)

3.) As has been pointed out, this is not an issue of Nagios causing the high load. I know because the load averages are for 1, 5, and 15 minutes respectively in output such as this from your first post:

Current Load CRITICAL 03-02-2017 20:46:26 0d 0h 5m 42s 4/4 CRITICAL - load average: 0.99, 7.56, 4.89

That means that at the time the Nagios was check was run, the 1-minute load average was just under 1, which on most modern systems is not awful. Since a Nagios check typically doesn't take more than a second to run, it would have to be a bug of astronomical proportions to cause your 15-minute load to be higher than your 1-minute for an instantaneous check (astronomical here meaning "involving time travel and probably quantum physics")

Balance of probability puts this as a misconfiguration, as was pointed out early on in the thread. You had each of your remote checks configured to run a check not on that remote machine, but on the Nagios server itself. That is why all the loads were so identical - they were checking the same machine at roughly the same time.

I understand this is frustrating, but NRPE issues are some of the most common and solved problems we see on a daily basis. We have a 26-page PDF detailing everything we know about NRPE issues, and their solutions.

Re: All Linux Server CPU Spike at same time

Posted: Mon Mar 20, 2017 7:13 pm
by kwhogster
Does this also apply to Nagios Core 4.1

the entire document is Nagios XI

NRPE Client Timeout
This timeout is how long the NRPE client on the Nagios XI server will wait for a response from the plugin it executes before returning a
result to Nagios XI. You may need to change a couple settings in the remote host's /usr/local/nagios/etc/nrpe.cfg file
depending on how high you set the timeout in Nagios XI. Edit the file with the following command:
vi /usr/local/nagios/etc/nrpe.cfg
Search for the command_timeout= and connection_timeout= settings which may need to be altered. Set both of these, at
minimum, to the value of the timeout in Nagios XI. Usually the connection_timeout=300 is more than enough, as is the
command_timeout which defaults to 60 seconds. If you do set your timeout in Nagios XI higher, increase the command_timeout to
match.


My nrpe has this

# COMMAND TIMEOUT
# This specifies the maximum number of seconds that the NRPE daemon will
# allow plugins to finish executing before killing them off.

command_timeout=60



# CONNECTION TIMEOUT
# This specifies the maximum number of seconds that the NRPE daemon will
# wait for a connection to be established before exiting. This is sometimes
# seen where a network problem stops the SSL being established even though
# all network sessions are connected. This causes the nrpe daemons to
# accumulate, eating system resources. Do not set this too low.

connection_timeout=300



Should I change them

I all the suggestions on here not one was to change anything All I get is please send me this and send me that.

I have provided tons of info

Running the commands on the nagios server yes because it is one of the 4 Linux servers having the same problem

Re: All Linux Server CPU Spike at same time

Posted: Mon Mar 20, 2017 7:36 pm
by dwhitfield
kwhogster wrote:Does this also apply to Nagios Core 4.1
Yes
I all the suggestions on here not one was to change anything All I get is please send me this and send me that.
This is not true. As was pointed out in the last post, you were asked to comment out a line # only_from = 127.0.0.1, save the file, and then run service xinetd restart

Earlier on page two, you were told to run the command with the -n flag to see if that fixed it. You were also asked to add a comma to your only_from line.

You never confirmed whether you made it so that any server connect, which is why Trevor brought it back up.

As for why I asked about the cron files, that was to confirm that it was not something else, but rather a misconfiguration. This has seemed pretty obvious from the beginning, but I figured it was worth double-checking. There are no jobs running at those times that would cause that load. One other thing to ask though. Do you have anything like puppet, chef, backups or some job on a remote machine that you have not mentioned that would be running at 20:46:26?

Re: All Linux Server CPU Spike at same time

Posted: Mon Mar 20, 2017 7:47 pm
by kwhogster
I do not see any comment about only_from =

it was allowed_hosts=127.0.0.1,10,2,8,79

I did that and the -n did not work as I replied with the results


Where is only_from?????????????

also

this warning comes up a lot

CHECK_NRPE: Received 0 bytes from daemon. Check the remote server logs for error messages. which the doc talks about arguments but other times it checks fine
So that is a false positive

Just like when the Linux servers spike again all false positives

Re: All Linux Server CPU Spike at same time

Posted: Mon Mar 20, 2017 10:33 pm
by rkennedy
kwhogster wrote:I do not see any comment about only_from =

it was allowed_hosts=127.0.0.1,10,2,8,79

I did that and the -n did not work as I replied with the results


Where is only_from?????????????

also

this warning comes up a lot

CHECK_NRPE: Received 0 bytes from daemon. Check the remote server logs for error messages. which the doc talks about arguments but other times it checks fine
So that is a false positive

Just like when the Linux servers spike again all false positives
I want to help you, I really do, but @tmcdonald already beat me to it. Make sure to read his post, specifically #1 and #2.
tmcdonald wrote:Quite a few things going on here:

1.) Did you ever follow the advice of @tgriep in this post? User @ssax posted right after so you might not have seen @tgriep's post.

2.) Your output here seems to indicate that the server does not have the proper glibc version, which indicates to me that the plugins might have been copied over from an incompatible system. Can you please elaborate on how you got the plugins installed on each of these systems? The specific error I am referring to is:

/usr/local/nagios/libexec/check_load: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /usr/local/nagios/libexec/check_load)

3.) As has been pointed out, this is not an issue of Nagios causing the high load. I know because the load averages are for 1, 5, and 15 minutes respectively in output such as this from your first post:

Current Load CRITICAL 03-02-2017 20:46:26 0d 0h 5m 42s 4/4 CRITICAL - load average: 0.99, 7.56, 4.89

That means that at the time the Nagios was check was run, the 1-minute load average was just under 1, which on most modern systems is not awful. Since a Nagios check typically doesn't take more than a second to run, it would have to be a bug of astronomical proportions to cause your 15-minute load to be higher than your 1-minute for an instantaneous check (astronomical here meaning "involving time travel and probably quantum physics")

Balance of probability puts this as a misconfiguration, as was pointed out early on in the thread. You had each of your remote checks configured to run a check not on that remote machine, but on the Nagios server itself. That is why all the loads were so identical - they were checking the same machine at roughly the same time.

I understand this is frustrating, but NRPE issues are some of the most common and solved problems we see on a daily basis. We have a 26-page PDF detailing everything we know about NRPE issues, and their solutions.

Re: All Linux Server CPU Spike at same time

Posted: Tue Mar 21, 2017 5:57 am
by kwhogster
Rkennedy

Thanks The light bulb just went off that's what a good nights sleep does for you.

I think I just copied over some files to the other Linux servers.

So I am thinking

1. upgrade NRPE on all Linux servers.
2. install Nagios plugins on all servers.

What else do I need GLIBC_2.14 ?? where do I get that ? what package does it come in?

This way the check will run on each machine not just on the Nagios server which does tend to spike

Thanks

Re: All Linux Server CPU Spike at same time

Posted: Tue Mar 21, 2017 10:35 am
by dwhitfield
This may have already been sorted based on @rkennedy's response, but for clarity
kwhogster wrote: Where is only_from?????????????
Tom's post: https://support.nagios.com/forum/viewto ... 10#p215246

That is the one Trevor linked to yesterday.

You shouldn't need to upgrade NRPE, unless you are trying to uses configurations not available in earlier versions. Now, if you want to upgrade and get new features, I'm not going to talk you out of it, but it shouldn't be necessary, strictly speaking.

As for glibc, on CentOS the package is called glibc. If you search your repos for glibc I bet you will find it, but if you don't, let us know.

Re: All Linux Server CPU Spike at same time

Posted: Sun Jul 23, 2017 5:56 pm
by kwhogster
This has been resolved

Was networking setup on ESXI host
Dual nics one setup in standby mode changed that and now the network is working much better


Please lock this