Page 2 of 2

Re: OCSP problem after upgrade to Nagios 4.0 (NSCA)

Posted: Fri Nov 08, 2013 11:16 am
by MalcolmPreen
The master system is monitoring about 3000 services - and these are being reported on every 10 minutes or so... so I'd guestimate about 1500 / 5m

Each of the monitoring servers (7 in total at present) reports some of these... but some are reporting about 100, and others nearer to 900.

I've currently got three master servers... and each monitoring server reports to each of these... using a customised version of the script;

Code: Select all

contrib/eventhandlers/distributed-monitoring/submit_check_result_via_nsca
supplied with nagios source distribution.

The reason I believe it is a local resource issue, is that I'm timing each send_nsca command to each host... and recording these in a log file...

Most of the time, it is only one host which is having delays.... and I have managed to stop the errors by halving the number of NSCA messages being sent to the "problem" host...

Obviously, halving the number of messages means I don't get a whole picture... so its not a real solution.

The investigation continues....

Malcolm

Re: OCSP problem after upgrade to Nagios 4.0 (NCSA)

Posted: Fri Nov 08, 2013 11:35 am
by abrist
Sounds good. Is the server that is experiencing delays under heavy load of network usage?

Re: OCSP problem after upgrade to Nagios 4.0 (NCSA)

Posted: Fri Nov 08, 2013 11:50 am
by MalcolmPreen
The load should be no different to the other masters... the major difference is that it is a physical server... as opposed to a VMware virtual host.

My fear is that it may just "not be up for the task".... but whether that is CPU, memory, or network (or a combination) is what I need to determine next.

Re: OCSP problem after upgrade to Nagios 4.0 (NCSA)

Posted: Fri Nov 08, 2013 11:58 am
by abrist
Alright. Keep us posted. As the issues are particular to a server that is unique in comparison to the other servers, I think you are on the right track. If you need a hand drilling down the performance problems, let us know.

Re: OCSP problem after upgrade to Nagios 4.0 (NSCA)

Posted: Fri Nov 22, 2013 9:52 am
by MalcolmPreen
OK... I think I have a resolution...

I compared the configuration (CPUs / Memory and Network) between the two servers - they weren't identical.... but they were in the same ball park...

So I attempted a couple of nagios tuning recommendations... "large installation tweaks".... and "nagios using a ram-disk".

It certainly didn't fix the problem... although it might have made some improvement...

The problem only seemed to be nsca... so I stepped back and checked the config files...

Both systems were identical.... but I did notice that both had debug=1 (a hang over from the initial set-up).

So... I changed the nsca.cfg for the problem system to have debug=0 - and almost immediately the problem was gone....

My understanding says that nsca debug info is sent to syslog (and ends up in /var/log/messages)... so I guess there could be more disk latency for the disk on the physical system...

So... for the time being... the problem is gone... I wonder how many more hosts / services it will take to either break the virtual server (for which I now have a "fix")... or to break the physical server...

Thanks for listening, Malcolm

Re: OCSP problem after upgrade to Nagios 4.0 (NCSA)

Posted: Fri Nov 22, 2013 12:14 pm
by abrist
MalcolmPreen wrote:OK... I think I have a resolution...
Awesome!
MalcolmPreen wrote:So... for the time being... the problem is gone... I wonder how many more hosts / services it will take to either break the virtual server (for which I now have a "fix")... or to break the physical server...
All you can do is add objects and watch disk io / cpu load. I suggest you set up checks for the nagios server itself for these metrics (if you have not done so already) so that when you have a future problem, you have historical data to help with the hunt.