Page 3 of 4

Re: Hundreds of Active Check result files in /tmp

Posted: Tue Oct 09, 2012 3:29 pm
by jbennett
No errors noticed. Here is the output from the log:

Code: Select all

[root@nagiosxivm ~]# tail -f /var/log/mysqld.log
121009 14:25:57 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.61'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distr                                                                             ibution
121009 15:26:35 [Note] /usr/libexec/mysqld: Normal shutdown

121009 15:26:35 [Note] Event Scheduler: Purging the queue. 0 events
121009 15:26:37  InnoDB: Starting shutdown...
121009 15:26:39  InnoDB: Shutdown completed; log sequence number 0 44263
121009 15:26:39 [Note] /usr/libexec/mysqld: Shutdown complete

121009 15:26:39 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
121009 15:27:53 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
121009 15:27:53  InnoDB: Initializing buffer pool, size = 8.0M
121009 15:27:53  InnoDB: Completed initialization of buffer pool
121009 15:27:53  InnoDB: Started; log sequence number 0 44263
121009 15:27:53 [Note] Event Scheduler: Loaded 0 events
121009 15:27:53 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.61'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution

Re: Hundreds of Active Check result files in /tmp

Posted: Wed Oct 10, 2012 9:47 am
by mguthrie
Would you be interested in a scheduling a remote session for this? I'd be available either this afternoon between 1-3pm CST (UTC -06:00) or anytime tomorrow between 9:30am-3:30pm.

Re: Hundreds of Active Check result files in /tmp

Posted: Wed Oct 10, 2012 9:54 am
by jbennett
I am sending you a PM.

Re: Hundreds of Active Check result files in /tmp

Posted: Wed Oct 10, 2012 9:54 am
by mguthrie
Sounds good, we'll follow up there.

Re: Hundreds of Active Check result files in /tmp

Posted: Wed Oct 10, 2012 12:05 pm
by mguthrie
Ok, so just a followup for all who are reading along on this. The large numbers of check files in the /tmp directory are generated from the Nagios Core daemon itself. Some plugins utilize a temp file, and the Core process is supposed to clean them up, but for some reason it isn't. This could happen if the checks are being forcibly terminated on a regular basis, or possibly some sort of permissions problem.

Can you post your nagios.log file? Or if it's very large post a recent chunk of it?

Code: Select all

tail -1000 /usr/local/nagios/var/nagios.log > mynagios.log

Re: Hundreds of Active Check result files in /tmp

Posted: Wed Oct 10, 2012 1:47 pm
by jbennett
Since yoiu mentioned plug-ins, I should note that I did update the plug-ins package after I updated Nagios XI. Not sure if there's something in there that would be causing this? check_icmp or check_ping maybe?

Re: Hundreds of Active Check result files in /tmp

Posted: Wed Oct 10, 2012 1:58 pm
by mguthrie
So maybe you can help us understand a bit more about how your checks are set up. What I'm seeing is that you have passive checks enabled for a large number of your hosts and services, but they all appear to be showing up as stale on a regular basis, so freshness checks are being initiated quite frequently. Can you send a copy of your configuration snapshot tarball either in a PM or to [email protected] (if you have an XI Support contract).

Re: Hundreds of Active Check result files in /tmp

Posted: Wed Oct 10, 2012 2:41 pm
by jbennett
mguthrie wrote:So maybe you can help us understand a bit more about how your checks are set up. What I'm seeing is that you have passive checks enabled for a large number of your hosts and services, but they all appear to be showing up as stale on a regular basis, so freshness checks are being initiated quite frequently. Can you send a copy of your configuration snapshot tarball either in a PM or to [email protected] (if you have an XI Support contract).
Sending a PM now.

Re: Hundreds of Active Check result files in /tmp

Posted: Thu Oct 11, 2012 10:01 am
by mguthrie
Ok, so from reviewing your configs, I think it might be worthwhile to spend a little bit of time reviewing the following docs, because I think there's some confusion about what some of the config directives do, and also active vs passive checks.

http://nagios.sourceforge.net/docs/3_0/ ... .html#host
http://nagios.sourceforge.net/docs/3_0/ ... ml#service

name xiwizard_ITS_Camera_host
alias
check_command check_xi_host_ping!3000.0!80%!5000.0!100%
use xiwizard_generic_host
max_check_attempts 1000
check_interval 10
retry_interval 5
active_checks_enabled 1
check_period 24x7
check_freshness 1
freshness_threshold 1800
So I'm seeing settings like this on several of the templates, and I'm honestly not quite sure what kind of effect this will have on the monitoring engine, other than to say your results will be...unpredictable.

The only time you want to utilize freshness checking is if you're using purely passive checks. If you've got a passive check that Nagios is simply waiting for results for, the freshness check can be used to trigger an alert if the results are stale. You should never use freshness checking with active checks.

Max check attempts is how many times Nagios will retry a check if it detects a problem.
X = max_check_attempts
Y = retry_interval

If Nagios detects a problem, it will retry the check every Y minutes up to X amount of times to determine if the problem is persisting. If the host or service is in a problem state for X number of checks, an alert will be sent. The way things are set up right now on the system creates an enormous amount of retries, and it seems like the setting that you might actually want for some of these is simply:

notifications_enabled=0

So, as for where to go from here. I would:
- stop your monitoring engine.
- Delete /usr/local/nagios/var/retention.dat
- Remove ALL freshness checking from all templates and objects
- Revise your max check attempts on templates and objects, I would recommend against having your max_check_attempts higher than 10 if you can help it, otherwise you're just wasting resources on the monitoring engine.
- Although there are exceptions, most of the time retries should be happening inside of the regular check interval. If you're checking the host/service every 10 minutes, it doesn't make a lot of sense to have 60 minutes worth of retries every 5 minutes. If you want a 60 minute buffer before notifications go out, use the first_notification_delay instead.

Re: Hundreds of Active Check result files in /tmp

Posted: Thu Oct 11, 2012 1:45 pm
by jbennett
I can't thank you enough for your time on this.

As I have stated previously, I picked this box up from a previous admin who didn't share any details.

I went into it assuming he had everything set up correctly and didn't back track his work. That's my fault and I should have known better. :mrgreen:

I have adjusted all templates to fall within the bounds you have suggested.

I now plan to go back through and re-activate the 1500+ hosts that I had deactivated in small batches to ensure everything runs smoothly.