Page 1 of 1
Host/Service Check issues after bulk cloning 100+ hosts
Posted: Wed Nov 12, 2014 3:51 pm
by krobertson71
Having some issues after cloning/adding 300 + hosts to my Nagios Test environment. The Host checks are worrying me as most of them are an hour over due and the "Next Update Time' just keeps moving up.
I set these hosts up with two service checks each, Ping and SSH Port Check (TCP port 22 check) and Port 135 for Windows, and to run every minute.
The service checks are mostly fine, just a small percentage that are 20 to 30 mins old most of the time instead of every 1 or 2 minutes.
The System Performance and Monitoring Performance are spectacular. Hoping this data is accurate. Pretty beast machine. Dedicated hardware 24 core, 128 gig ram, 2 tb ssd raid with ESXi. This Nagiox XI VM is configured with 4 cores and 8 gigs of ram. Avg Service Check Latency is 0.00, sometimes will pop up to .06, but rarely.
I have checked the perf logs as well and they are not showing anything alarming.
I am just concerned these host, and some service, checks are not firing when they are supposed to.. Could it be due to bulk clone loading 100 at a time?
Here are some screenshots
Re: Host/Service Check issues after bulk cloning 100+ hosts
Posted: Wed Nov 12, 2014 3:54 pm
by krobertson71
More detail on service checks and performance
Let me know if I need to add any logs here.
Re: Host/Service Check issues after bulk cloning 100+ hosts
Posted: Wed Nov 12, 2014 4:03 pm
by krobertson71
Forgot to mention I grabbed these hosts from our CMDB and purposely included 40 or so that had been retired so critical host down messages would generate. I did notice this in host-perfdata.log
Code: Select all
DATATYPE::HOSTPERFDATA TIMET::1415826040 HOSTNAME::PSP01SQLV01 HOSTPERFDATA:: HOSTCHECKCOMMAND::check_xi_host_ping!3000.0!80%!5000.0!100% HOSTSTATE::DOWN HOSTSTATETYPE::HARD HOSTOUTPUT::check_icmp: Failed to resolve PSP01SQLV01
The log was moving right along and seemed to hold here for a bit.
Could this be slowing things down.. not able to resolve hosts, especially 40 or more?
Re: Host/Service Check issues after bulk cloning 100+ hosts
Posted: Wed Nov 12, 2014 4:12 pm
by krobertson71
Sorry keep adding data but I just grepped the retention.dat file for check_execution_time and here is a sample.. really good.
check_execution_time=0.007
check_execution_time=0.010
check_execution_time=0.005
check_execution_time=0.007
check_execution_time=0.005
check_execution_time=0.007
check_execution_time=0.005
check_execution_time=0.006
check_execution_time=3.004
check_execution_time=3.008
check_execution_time=3.004
check_execution_time=3.007
check_execution_time=3.004
check_execution_time=3.007
check_execution_time=0.005
check_execution_time=0.007
check_execution_time=0.005
check_execution_time=0.008
check_execution_time=3.004
check_execution_time=3.015
check_execution_time=0.007
check_execution_time=0.006
check_execution_time=3.004
check_execution_time=3.007
check_execution_time=0.005
check_execution_time=0.007
check_execution_time=0.004
check_execution_time=0.006
check_execution_time=0.004
check_execution_time=0.008
check_execution_time=0.004
check_execution_time=0.006
check_execution_time=0.005
check_execution_time=0.007
check_execution_time=2.007
Latency on almost all of them is 0.00 with two, in the file, that were .47..
Re: Host/Service Check issues after bulk cloning 100+ hosts
Posted: Wed Nov 12, 2014 4:34 pm
by lmiltchev
Try using the following settings in the "nagios.cfg" file:
Code: Select all
auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=45
I believe the "default" setting for the "auto_rescheduling_window" is 180. This causes issues with small (1-2 min) check intervals and retries. Modify the "nagios.cfg" file with the above settings, and restart nagios:
Let us know if this helped.
Re: Host/Service Check issues after bulk cloning 100+ hosts
Posted: Wed Nov 12, 2014 5:39 pm
by krobertson71
Actuall changed
[/code]
to
auto_rescheduling_window=40
and BAAAMMMMM!!!!!!
I feel stupid cause after I read your post it hit me someone was talking about this at the conference. Too much data, too little brain matter.
Here is a screenshot of everything now. CPU barely moved (screenshot shows 10, but mainly under 8)... service and host checks are still .00 or .01.. 700+ service checks per min and 358 Host checks per minute.
All running on a 4xcpu 2.5Ghx, 8 gig Ram VM -- which is running on a SSD raid. Dedicated ESX host for this has 24 cores, 128 gig of ram, 2 tb ssd raid and 3 tb 15k raid (Nagios log server going here)
I know people have much bigger deployments, but this is just a stress test on a small vm and I wanted to see how it would hold checking this much every single minute.
In your opinion am I in the right to think we are going to be able to run a hell of a lot of checks when we go production with 3 to 6 min checks avg times with more CPU and RAM added?
Re: Host/Service Check issues after bulk cloning 100+ hosts
Posted: Wed Nov 12, 2014 5:41 pm
by abrist
Indeed.

Re: Host/Service Check issues after bulk cloning 100+ hosts
Posted: Wed Nov 12, 2014 5:43 pm
by krobertson71
Oh and the dashboard loading time is SICK.
I can load the host dashboard.. with all hosts and services checks showing in about 2 to 3 seconds...
Loving this tool and hardware!
What is most impressive is not all software can perform this well no matter how much hardware you throw at it!
BRAVO NAGIOS!