Nagios Support Forum

Posted: **Wed Nov 12, 2014 3:51 pm**

Having some issues after cloning/adding 300 + hosts to my Nagios Test environment. The Host checks are worrying me as most of them are an hour over due and the "Next Update Time' just keeps moving up.

I set these hosts up with two service checks each, Ping and SSH Port Check (TCP port 22 check) and Port 135 for Windows, and to run every minute.

The service checks are mostly fine, just a small percentage that are 20 to 30 mins old most of the time instead of every 1 or 2 minutes.

The System Performance and Monitoring Performance are spectacular. Hoping this data is accurate. Pretty beast machine. Dedicated hardware 24 core, 128 gig ram, 2 tb ssd raid with ESXi. This Nagiox XI VM is configured with 4 cores and 8 gigs of ram. Avg Service Check Latency is 0.00, sometimes will pop up to .06, but rarely.

I have checked the perf logs as well and they are not showing anything alarming.

I am just concerned these host, and some service, checks are not firing when they are supposed to.. Could it be due to bulk clone loading 100 at a time?

Here are some screenshots

Posted: **Wed Nov 12, 2014 3:54 pm**

More detail on service checks and performance

Let me know if I need to add any logs here.

Posted: **Wed Nov 12, 2014 4:03 pm**

Forgot to mention I grabbed these hosts from our CMDB and purposely included 40 or so that had been retired so critical host down messages would generate. I did notice this in host-perfdata.log

Code: Select all

DATATYPE::HOSTPERFDATA	TIMET::1415826040	HOSTNAME::PSP01SQLV01	HOSTPERFDATA::	HOSTCHECKCOMMAND::check_xi_host_ping!3000.0!80%!5000.0!100%	HOSTSTATE::DOWN	HOSTSTATETYPE::HARD	HOSTOUTPUT::check_icmp: Failed to resolve PSP01SQLV01

The log was moving right along and seemed to hold here for a bit.

Could this be slowing things down.. not able to resolve hosts, especially 40 or more?

Posted: **Wed Nov 12, 2014 4:12 pm**

Sorry keep adding data but I just grepped the retention.dat file for check_execution_time and here is a sample.. really good.

check_execution_time=0.007
check_execution_time=0.010
check_execution_time=0.005
check_execution_time=0.007
check_execution_time=0.005
check_execution_time=0.007
check_execution_time=0.005
check_execution_time=0.006
check_execution_time=3.004
check_execution_time=3.008
check_execution_time=3.004
check_execution_time=3.007
check_execution_time=3.004
check_execution_time=3.007
check_execution_time=0.005
check_execution_time=0.007
check_execution_time=0.005
check_execution_time=0.008
check_execution_time=3.004
check_execution_time=3.015
check_execution_time=0.007
check_execution_time=0.006
check_execution_time=3.004
check_execution_time=3.007
check_execution_time=0.005
check_execution_time=0.007
check_execution_time=0.004
check_execution_time=0.006
check_execution_time=0.004
check_execution_time=0.008
check_execution_time=0.004
check_execution_time=0.006
check_execution_time=0.005
check_execution_time=0.007
check_execution_time=2.007

Latency on almost all of them is 0.00 with two, in the file, that were .47..

Posted: **Wed Nov 12, 2014 4:34 pm**

Try using the following settings in the "nagios.cfg" file:

Code: Select all

auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=45

I believe the "default" setting for the "auto_rescheduling_window" is 180. This causes issues with small (1-2 min) check intervals and retries. Modify the "nagios.cfg" file with the above settings, and restart nagios:

Code: Select all

service nagios restart

Let us know if this helped.

Posted: **Wed Nov 12, 2014 5:39 pm**

Actuall changed

Code: Select all

[code]auto_rescheduling_window=45

[/code]

to

auto_rescheduling_window=40

and BAAAMMMMM!!!!!!

I feel stupid cause after I read your post it hit me someone was talking about this at the conference. Too much data, too little brain matter.

Here is a screenshot of everything now. CPU barely moved (screenshot shows 10, but mainly under 8)... service and host checks are still .00 or .01.. 700+ service checks per min and 358 Host checks per minute.

All running on a 4xcpu 2.5Ghx, 8 gig Ram VM -- which is running on a SSD raid. Dedicated ESX host for this has 24 cores, 128 gig of ram, 2 tb ssd raid and 3 tb 15k raid (Nagios log server going here)

I know people have much bigger deployments, but this is just a stress test on a small vm and I wanted to see how it would hold checking this much every single minute.

In your opinion am I in the right to think we are going to be able to run a hell of a lot of checks when we go production with 3 to 6 min checks avg times with more CPU and RAM added?

Posted: **Wed Nov 12, 2014 5:41 pm**

Indeed.

Posted: **Wed Nov 12, 2014 5:43 pm**

Oh and the dashboard loading time is SICK.

I can load the host dashboard.. with all hosts and services checks showing in about 2 to 3 seconds...

Loving this tool and hardware!

What is most impressive is not all software can perform this well no matter how much hardware you throw at it!

BRAVO NAGIOS!

Nagios Support Forum

Host/Service Check issues after bulk cloning 100+ hosts

Host/Service Check issues after bulk cloning 100+ hosts

Re: Host/Service Check issues after bulk cloning 100+ hosts

Re: Host/Service Check issues after bulk cloning 100+ hosts

Re: Host/Service Check issues after bulk cloning 100+ hosts

Re: Host/Service Check issues after bulk cloning 100+ hosts

Re: Host/Service Check issues after bulk cloning 100+ hosts

Re: Host/Service Check issues after bulk cloning 100+ hosts

Re: Host/Service Check issues after bulk cloning 100+ hosts