Nagios Sizing and Polling Interval

conston_rd · Post by **conston_rd** » Mon Apr 20, 2020 3:38 am

Hi,

Currenty we have two nagios environments both running with Nagios xi 5.6.6 on centos 7.

Servers have 8 CPU and 32 GB ram. Each environment is monitoring 1100 servers with 5000 to 6000 checks.

Configured in HA with DRBD, no RAMDISK and SQL is not offloaded.

Currently there are lot of service checks which are with one minute polling interval.

We are planning to roll out new monitoring templates with 25 services per host
There will be an addition of 1000 more severs, so it will be around 2000 servers with average of 50 k services.

Can you please clarify on the below points ?

1) How to check and confirm if nagios is struggling with high servicecheck load.

2) Is it advisable to have one third of service checks with one minute polling ( current CPU, Memory DISK Swap all have 1 min polling) . ?

3) Can the existing setup handle, when the monitored servers and service checks doubles in number, 2000 plus servers and above 50K services ?

4) How far can we push the current setup before adding new instances ?

5) Is it recommended to have 1 min polling for performance metrics such as CPU, Memoy, Swap and Disk ?

Thanks

Post by **jdunitz** » Mon Apr 20, 2020 10:59 am

1) How to check and confirm if nagios is struggling with high servicecheck load.

The clearest indicators of problems are "ipcs -a", and if you have thousands of items in the queue on a regular basis, that's one sign. Also, look at "top", and if the load average is high, and memory free is low, that's another indication. And if you look in /var/log/messages and see:

Code: Select all

ndo2db: Error: queue recv error.
ndo2db: Error: max retries exceeded sending message to queue. 
ndo2db: Warning: queue send error, retrying...

...or similar errors, those are indications of something being wrong. There are other items in /var/log/messages that can indicate overload conditions, so read and interpret the error messages you find.

2) Is it advisable to have one third of service checks with one minute polling ( current CPU, Memory DISK Swap all have 1 min polling) . ?

sdf

3) Can the existing setup handle, when the monitored servers and service checks doubles in number, 2000 plus servers and above 50K services ?

...

4) How far can we push the current setup before adding new instances ?

You might be running pretty close to some limits already. I definitely suggest setting up a ramdisk, making sure your other disks are as fast as they can be, and setting up mod-gearman. Here are some links to documents for those:

https://assets.nagios.com/downloads/nag ... ios_XI.pdf

https://assets.nagios.com/downloads/nag ... giosXI.pdf

5) Is it recommended to have 1 min polling for performance metrics such as CPU, Memoy, Swap and Disk ?
Not in your case. With as much stuff as you're monitoring, I suggest saving the 1-minute interval for only your most critical machines and checks, and have the rest at five minutes.

I hope that helps you get started! Let us know if you have more questions.

--Jeffrey

Nagios Support Forum

Nagios Sizing and Polling Interval

Nagios Sizing and Polling Interval

Re: Nagios Sizing and Polling Interval