Nagios Sizing and Polling Interval

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
conston_rd
Posts: 14
Joined: Tue Nov 27, 2018 4:27 am

Nagios Sizing and Polling Interval

Post by conston_rd »

Hi,

Currenty we have two nagios environments both running with Nagios xi 5.6.6 on centos 7.

Servers have 8 CPU and 32 GB ram. Each environment is monitoring 1100 servers with 5000 to 6000 checks.

Configured in HA with DRBD, no RAMDISK and SQL is not offloaded.

Currently there are lot of service checks which are with one minute polling interval.

We are planning to roll out new monitoring templates with 25 services per host
There will be an addition of 1000 more severs, so it will be around 2000 servers with average of 50 k services.

Can you please clarify on the below points ?

1) How to check and confirm if nagios is struggling with high servicecheck load.

2) Is it advisable to have one third of service checks with one minute polling ( current CPU, Memory DISK Swap all have 1 min polling) . ?

3) Can the existing setup handle, when the monitored servers and service checks doubles in number, 2000 plus servers and above 50K services ?

4) How far can we push the current setup before adding new instances ?

5) Is it recommended to have 1 min polling for performance metrics such as CPU, Memoy, Swap and Disk ?



Thanks
User avatar
jdunitz
Posts: 235
Joined: Wed Feb 05, 2020 2:50 pm

Re: Nagios Sizing and Polling Interval

Post by jdunitz »

1) How to check and confirm if nagios is struggling with high servicecheck load.
The clearest indicators of problems are "ipcs -a", and if you have thousands of items in the queue on a regular basis, that's one sign. Also, look at "top", and if the load average is high, and memory free is low, that's another indication. And if you look in /var/log/messages and see:

Code: Select all

ndo2db: Error: queue recv error.
ndo2db: Error: max retries exceeded sending message to queue. 
ndo2db: Warning: queue send error, retrying...
...or similar errors, those are indications of something being wrong. There are other items in /var/log/messages that can indicate overload conditions, so read and interpret the error messages you find.
2) Is it advisable to have one third of service checks with one minute polling ( current CPU, Memory DISK Swap all have 1 min polling) . ?
sdf

3) Can the existing setup handle, when the monitored servers and service checks doubles in number, 2000 plus servers and above 50K services ?
...
4) How far can we push the current setup before adding new instances ?
You might be running pretty close to some limits already. I definitely suggest setting up a ramdisk, making sure your other disks are as fast as they can be, and setting up mod-gearman. Here are some links to documents for those:

https://assets.nagios.com/downloads/nag ... ios_XI.pdf

https://assets.nagios.com/downloads/nag ... giosXI.pdf





5) Is it recommended to have 1 min polling for performance metrics such as CPU, Memoy, Swap and Disk ?
Not in your case. With as much stuff as you're monitoring, I suggest saving the 1-minute interval for only your most critical machines and checks, and have the rest at five minutes.


I hope that helps you get started! Let us know if you have more questions.

--Jeffrey
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked