hi all
im using nagios core 4.4.3 with nagios-nrpe-plugin 3.2.1 installed on ubuntu 18.04.
it installed on AWS EC2 type t2.medium (2cpu, 4ram). my server is configured with 3 check_workers due to my 2 CPUs.
it servers as "on-site" with direct host/service checks via VPN and as an NRPE master server.
the external commands are mostly ping and around 4 http/dns checks.
around 100 direct services and 350 NRPE services (one host)
when adding more NRPE agents (400 services each) the master load is rising and I'm getting "localhost load" alerts
`localhost/Current Load is CRITICAL:
CRITICAL - load average: 1.31, 1.48, 4.01`
while monitoring the server with Htop I see that the CPU uses repeatedly reaches to 100%.
I've looked online and found some recommendations that didn't really help.
- using check_fping instead of check_ping plugin
- external_command_buffer_slots=512
- use_large_installation_tweaks=1
using Htop i see the CPU spikes accrues when external commands are executed.
does anyone have any idea why my CPU is so high?
shouldn't Nagios handle thousands of services (with the right configuration) .
ill appreciate any tips and recommendations.
thanks
NRPE master high load issue
Re: NRPE master high load issue
hi
any ideas about that?
any ideas about that?
-
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: NRPE master high load issue
Hi
Reference:
https://assets.nagios.com/downloads/nag ... uning.html
I would post a screenshot of the top command output. Nagios Core is a pretty efficient program, so this shouldn't be a problem. What kind of check interval have you configured on these services?any ideas about that?
Reference:
https://assets.nagios.com/downloads/nag ... uning.html
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: NRPE master high load issue
hi
thank you for responding.
attached screen shots from htop. these with 1135 checks configured.
regarding the link you've recommended, ive already look at it and accordingly ive enabled the "large_installation_tweaks" and switched to use fping instead of ping.
any other recommendations i should use?
thanks
this is the type of external commands im using:
check_nrpe_shell!/usr/lib/nagios/plugins/check_fping 172.16.0.58 -w 300,30% -c 1000,80%
and these are the intervals im using
check_interval 0.20
retry_interval 0.10
max_check_attempts 20
thank you for responding.
attached screen shots from htop. these with 1135 checks configured.
regarding the link you've recommended, ive already look at it and accordingly ive enabled the "large_installation_tweaks" and switched to use fping instead of ping.
any other recommendations i should use?
thanks
this is the type of external commands im using:
check_nrpe_shell!/usr/lib/nagios/plugins/check_fping 172.16.0.58 -w 300,30% -c 1000,80%
and these are the intervals im using
check_interval 0.20
retry_interval 0.10
max_check_attempts 20
Re: NRPE master high load issue
hi
what do you think?
what do you think?
Re: NRPE master high load issue
XI Server
2Cores 4gig Ram
1. Your instance is not right sized for your active monitoring load.
400 service checks (Active Monitoring via NRPE) 2. Your timing is very, very, very aggressive for an "Active" monitoring scenario IMHO.
Executing every 20 seconds with a retry interval of 10 seconds with 10 retry (Total time to incident is 2min)
The plugin timeout is 30 seconds (I think) which means you have the posibility of trying to execute the check twice before the first timeout is reached adding greater load to the system.
I know this is for XI but, I have noticed that you can only push past the 1:5 recommended ratio if you're at 8G> Ram.
2Cores 4gig Ram
1. Your instance is not right sized for your active monitoring load.
400 service checks (Active Monitoring via NRPE) 2. Your timing is very, very, very aggressive for an "Active" monitoring scenario IMHO.
Executing every 20 seconds with a retry interval of 10 seconds with 10 retry (Total time to incident is 2min)
The plugin timeout is 30 seconds (I think) which means you have the posibility of trying to execute the check twice before the first timeout is reached adding greater load to the system.
I know this is for XI but, I have noticed that you can only push past the 1:5 recommended ratio if you're at 8G> Ram.
Re: NRPE master high load issue
hi
thanks for providing this table.
i will improve my HW and monitor to understand how "aggressive" i could be with my checks.
regarding the ratio issue. the whole point in NRPE is that i could communicate(check) with a remote site via one "agent"
and most of my sites are above 100 services (service== on-site network device).
Do you think NRPE is not right for my scenario?
again, thanks for responding
ahiya
thanks for providing this table.
i will improve my HW and monitor to understand how "aggressive" i could be with my checks.
regarding the ratio issue. the whole point in NRPE is that i could communicate(check) with a remote site via one "agent"
and most of my sites are above 100 services (service== on-site network device).
Do you think NRPE is not right for my scenario?
again, thanks for responding
ahiya
Re: NRPE master high load issue
An agent is just a agent.
I have run through most of them and even helped create a couple myself over the years. It's become more of an exercise with choosing the right tool for the job now.
My goto agent is NCPA.
https://www.nagios.org/ncpa/
I use this in a Passive/Active scenario in my deployments leaning to the passive (the target is responsible for sending the telemetry to nagios) wherever possible.
When it's a network device like a Switch/Router most times it becomes a SNMP solution, unless of course you have an API available like VMware/F5 Networks.
I have run through most of them and even helped create a couple myself over the years. It's become more of an exercise with choosing the right tool for the job now.
My goto agent is NCPA.
https://www.nagios.org/ncpa/
I use this in a Passive/Active scenario in my deployments leaning to the passive (the target is responsible for sending the telemetry to nagios) wherever possible.
When it's a network device like a Switch/Router most times it becomes a SNMP solution, unless of course you have an API available like VMware/F5 Networks.
Re: NRPE master high load issue
hi
after ive made some changes to my system
- increasing my server resources to- 4 cpu, 16G ram and 100G HD
- reconfigure workers number to 6 (4x1.5).
- change intervals to - check_interval of 1 min, retry_interval of 30 sec.
i still see my server cpu load jump to 80-100% every few minutes.
any idea how to proceed from here?
attache "htop" command screenshot
thanks
after ive made some changes to my system
- increasing my server resources to- 4 cpu, 16G ram and 100G HD
- reconfigure workers number to 6 (4x1.5).
- change intervals to - check_interval of 1 min, retry_interval of 30 sec.
i still see my server cpu load jump to 80-100% every few minutes.
any idea how to proceed from here?
attache "htop" command screenshot
thanks