NRPE master high load issue

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
ahiya
Posts: 11
Joined: Wed Nov 20, 2019 1:04 am

NRPE master high load issue

Post by ahiya »

hi all

im using nagios core 4.4.3 with nagios-nrpe-plugin 3.2.1 installed on ubuntu 18.04.
it installed on AWS EC2 type t2.medium (2cpu, 4ram). my server is configured with 3 check_workers due to my 2 CPUs.
it servers as "on-site" with direct host/service checks via VPN and as an NRPE master server.
the external commands are mostly ping and around 4 http/dns checks.
around 100 direct services and 350 NRPE services (one host)
when adding more NRPE agents (400 services each) the master load is rising and I'm getting "localhost load" alerts
`localhost/Current Load is CRITICAL:
CRITICAL - load average: 1.31, 1.48, 4.01`

while monitoring the server with Htop I see that the CPU uses repeatedly reaches to 100%.
I've looked online and found some recommendations that didn't really help.
- using check_fping instead of check_ping plugin
- external_command_buffer_slots=512
- use_large_installation_tweaks=1

using Htop i see the CPU spikes accrues when external commands are executed.

does anyone have any idea why my CPU is so high?
shouldn't Nagios handle thousands of services (with the right configuration) .
ill appreciate any tips and recommendations.

thanks :roll:
ahiya
Posts: 11
Joined: Wed Nov 20, 2019 1:04 am

Re: NRPE master high load issue

Post by ahiya »

hi

any ideas about that?
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: NRPE master high load issue

Post by benjaminsmith »

Hi
any ideas about that?
I would post a screenshot of the top command output. Nagios Core is a pretty efficient program, so this shouldn't be a problem. What kind of check interval have you configured on these services?

Reference:
https://assets.nagios.com/downloads/nag ... uning.html
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
ahiya
Posts: 11
Joined: Wed Nov 20, 2019 1:04 am

Re: NRPE master high load issue

Post by ahiya »

hi
thank you for responding.

attached screen shots from htop. these with 1135 checks configured.
regarding the link you've recommended, ive already look at it and accordingly ive enabled the "large_installation_tweaks" and switched to use fping instead of ping.
any other recommendations i should use?

thanks
:!:
this is the type of external commands im using:
check_nrpe_shell!/usr/lib/nagios/plugins/check_fping 172.16.0.58 -w 300,30% -c 1000,80%

and these are the intervals im using

check_interval 0.20
retry_interval 0.10
max_check_attempts 20
Attachments
3.JPG
2.JPG
1.JPG
ahiya
Posts: 11
Joined: Wed Nov 20, 2019 1:04 am

Re: NRPE master high load issue

Post by ahiya »

now its even worse

attached screen shots
Attachments
5.JPG
4.JPG
ahiya
Posts: 11
Joined: Wed Nov 20, 2019 1:04 am

Re: NRPE master high load issue

Post by ahiya »

hi

what do you think?
snapier3
Posts: 61
Joined: Tue Apr 23, 2019 7:12 pm

Re: NRPE master high load issue

Post by snapier3 »

XI Server
2Cores 4gig Ram

1. Your instance is not right sized for your active monitoring load.
400 service checks (Active Monitoring via NRPE)
NagiosXI Server Specs
NagiosXI Server Specs
2. Your timing is very, very, very aggressive for an "Active" monitoring scenario IMHO.
Executing every 20 seconds with a retry interval of 10 seconds with 10 retry (Total time to incident is 2min)
The plugin timeout is 30 seconds (I think) which means you have the posibility of trying to execute the check twice before the first timeout is reached adding greater load to the system.

I know this is for XI but, I have noticed that you can only push past the 1:5 recommended ratio if you're at 8G> Ram.
ahiya
Posts: 11
Joined: Wed Nov 20, 2019 1:04 am

Re: NRPE master high load issue

Post by ahiya »

hi

thanks for providing this table.
i will improve my HW and monitor to understand how "aggressive" i could be with my checks.
regarding the ratio issue. the whole point in NRPE is that i could communicate(check) with a remote site via one "agent"
and most of my sites are above 100 services (service== on-site network device).
Do you think NRPE is not right for my scenario?

again, thanks for responding

ahiya
snapier3
Posts: 61
Joined: Tue Apr 23, 2019 7:12 pm

Re: NRPE master high load issue

Post by snapier3 »

An agent is just a agent.

I have run through most of them and even helped create a couple myself over the years. It's become more of an exercise with choosing the right tool for the job now.

My goto agent is NCPA.
https://www.nagios.org/ncpa/
I use this in a Passive/Active scenario in my deployments leaning to the passive (the target is responsible for sending the telemetry to nagios) wherever possible.

When it's a network device like a Switch/Router most times it becomes a SNMP solution, unless of course you have an API available like VMware/F5 Networks.
ahiya
Posts: 11
Joined: Wed Nov 20, 2019 1:04 am

Re: NRPE master high load issue

Post by ahiya »

hi
after ive made some changes to my system
- increasing my server resources to- 4 cpu, 16G ram and 100G HD
- reconfigure workers number to 6 (4x1.5).
- change intervals to - check_interval of 1 min, retry_interval of 30 sec.


i still see my server cpu load jump to 80-100% every few minutes.

any idea how to proceed from here?
attache "htop" command screenshot

thanks
Attachments
6.JPG
Locked