checkresults not being processed after upgrade to 2014R2.0

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
nagiosadmin42
Posts: 96
Joined: Sat Feb 11, 2012 2:16 pm

checkresults not being processed after upgrade to 2014R2.0

Post by nagiosadmin42 »

I noticed that our passive service results are no longer being processed since our upgrade from 2012 to 2014R2.0.

NOTE: our NRDP version is 1.2 (from /usr/local/nrdp/server/config.inc.php)

I've done the usual debugging of npcd and process_perfdata.pl, and am not seeing any of the passive services being processed.

That led me to checking the spool directory, and sure enough, the checkresults folder contained 800k files!

So I whacked them and have been trying to figure out why they're not being processed. I don't see any errors in either npcd.log or perfdata.log.

The permissions of the parent folder are:

Code: Select all

# ls -dl /usr/local/nagios/var/spool/
drwxrwxr-x 2 nagios nagcmd 1880 Dec  3 13:07 checkresults
And data files within checkresults are:

Code: Select all

-rwxrwx--- 1 apache nagcmd 260 Dec  3 12:59 cZZv9JS
-rw-r--r-- 1 apache apache   0 Dec  3 12:59 cZZv9JS.ok
And I verified that the nagios user is part of the nagcmd group:

Code: Select all

# grep nagios /etc/group
nagios:x:501:nagios,apache
nagcmd:x:502:nagios,apache
Here's the output of the npcd service start (with debug enabled):

Code: Select all

# service npcd start
DEBUG: Config File = /usr/local/nagios/etc/pnp/npcd.cfg
CONFIG_OPT_LOGTYPE = file
CONFIG_OPT_LOGFILE = /usr/local/nagios/var/npcd.log
CONFIG_OPT_LOGFILESIZE = 10485760
CONFIG_OPT_LOGLEVEL = -1
CONFIG_OPT_SCANDIR = /usr/local/nagios/var/spool/perfdata/
CONFIG_OPT_RUNCMD = /usr/local/nagios/libexec/process_perfdata.pl
CONFIG_OPT_RUNCMD_ARG = -b
CONFIG_OPT_MAXTHREADS = 5
CONFIG_OPT_LOAD = 50.0
CONFIG_OPT_USER = nagios
CONFIG_OPT_GROUP = nagios
CONFIG_OPT_PIDFILE = /usr/local/nagiosxi/var/subsys/npcd.pid
CONFIG_OPT_SLEEPTIME = 15
CONFIG_OPT_IDENTMYSELF = (null)
---------------------------
DEBUG: load_threshold is enabled - ('50.000000')
NPCD started.

I am wondering why the .ok file is owned by "apache apache" instead of "apache nagcmd" as the corresponding data file's ownership?

What else should I check? Thanks for any assistance!

Alan
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: checkresults not being processed after upgrade to 2014R2

Post by sreinhardt »

npcd and perfdata.log do not actually look at check results, that is going to be nagios reaping them directly through core. So considering other portions of nagios sound like they are working correctly, let's have you paste the nagios.cfg please. Sounds like it might be an issue with the core worker picking them or deleting them. Once we see the log, we will likely go with debug logging if that doesn't display the issue right away.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
nagiosadmin42
Posts: 96
Joined: Sat Feb 11, 2012 2:16 pm

Re: checkresults not being processed after upgrade to 2014R2

Post by nagiosadmin42 »

Here's our nagios.cfg file.

Note that I've been tweaking log/debug settings in various files.

Also, I had gotten focused on the low-level details, and forgotten to look at the GUI. A co-worker let me know that the graphs now have data, I think due to the permissions change I made to the checkresults folder.

However, while it appears the data is being ingested (I need to confirm this), the checkresults files are NOT being deleted.

Code: Select all

# MODIFIED
admin_email=root@localhost
admin_pager=root@localhost
translate_passive_host_checks=1
log_event_handlers=0
use_large_installation_tweaks=1
enable_environment_macros=0


# NDOUtils module
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg


# PNP settings - bulk mode with NCPD
#
# ADE 08/23/2012 PNP Comments added from http://bitflip.net/files/pnp4nagios-presentation-20090409.pdf
#

# PROCESS PERFORMANCE DATA OPTION
# This determines whether or not Nagios will process performance
# data returned from service and host checks. If this option is
# enabled, host performance data will be processed using the
# host_perfdata_command (defined below) and service performance
# data will be processed using the service_perfdata_command (also
# defined below). Read the HTML docs for more information on
# performance data.
# Values: 1 = process performance data, 0 = do not process performance data

process_performance_data=1

# HOST AND SERVICE PERFORMANCE DATA FILES
# These files are used to store host and service performance data.
# Performance data is only written to these files if the
# process_performance_data option (above) is set to 1.

host_perfdata_file=/usr/local/nagios/var/host-perfdata
service_perfdata_file=/usr/local/nagios/var/service-perfdata

# HOST AND SERVICE PERFORMANCE DATA FILE TEMPLATES
# These options determine what data is written (and how) to the
# performance data files. The templates may contain macros, special
# characters (\t for tab, \r for carriage return, \n for newline)
# and plain text. A newline is automatically added after each write
# to the performance data file. Some examples of what you can do are
# shown below.

host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$

# HOST AND SERVICE PERFORMANCE DATA FILE MODES
# This option determines whether or not the host and service
# performance data files are opened in write ("w") or append ("a")
# mode. If you want to use named pipes, you should use the special
# pipe ("p") mode which avoid blocking at startup, otherwise you will
# likely want the defult append ("a") mode.

host_perfdata_file_mode=a
service_perfdata_file_mode=a

# HOST AND SERVICE PERFORMANCE DATA FILE PROCESSING INTERVAL
# These options determine how often (in seconds) the host and service
# performance data files are processed using the commands defined
# below. A value of 0 indicates the files should not be periodically
# processed.

host_perfdata_file_processing_interval=15
service_perfdata_file_processing_interval=15

# HOST AND SERVICE PERFORMANCE DATA FILE PROCESSING COMMANDS
# These commands are used to periodically process the host and
# service performance data files. The interval at which the
# processing occurs is determined by the options above.
# Commands are defined in /usr/local/nagios/etc/objects/commands.cfg

host_perfdata_file_processing_command=process-host-perfdata-file-bulk
service_perfdata_file_processing_command=process-service-perfdata-file-bulk



# OBJECTS - UNMODIFIED
#cfg_file=/usr/local/nagios/etc/objects/commands.cfg
#cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
#cfg_file=/usr/local/nagios/etc/objects/templates.cfg
#cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg


# STATIC OBJECT DEFINITIONS (THESE DON'T GET EXPORTED/IMPORTED BY NAGIOSQL)
cfg_dir=/usr/local/nagios/etc/static

# OBJECTS EXPORTED FROM NAGIOSQL
cfg_file=/usr/local/nagios/etc/contacttemplates.cfg
cfg_file=/usr/local/nagios/etc/contactgroups.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/commands.cfg
cfg_file=/usr/local/nagios/etc/hostgroups.cfg
cfg_file=/usr/local/nagios/etc/servicegroups.cfg
cfg_file=/usr/local/nagios/etc/hosttemplates.cfg
cfg_file=/usr/local/nagios/etc/servicetemplates.cfg
cfg_file=/usr/local/nagios/etc/servicedependencies.cfg
cfg_file=/usr/local/nagios/etc/serviceescalations.cfg
cfg_file=/usr/local/nagios/etc/hostdependencies.cfg
cfg_file=/usr/local/nagios/etc/hostescalations.cfg
cfg_file=/usr/local/nagios/etc/hostextinfo.cfg
cfg_file=/usr/local/nagios/etc/serviceextinfo.cfg
cfg_dir=/usr/local/nagios/etc/hosts
cfg_dir=/usr/local/nagios/etc/services

# GLOBAL EVENT HANDLERS
global_host_event_handler=xi_host_event_handler
global_service_event_handler=xi_service_event_handler



# UNMODIFIED
accept_passive_host_checks=1
accept_passive_service_checks=1
additional_freshness_latency=15
auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=45
bare_update_check=0
cached_host_check_horizon=15
cached_service_check_horizon=15
check_external_commands=1
check_for_orphaned_hosts=1
check_for_orphaned_services=1
check_for_updates=1
# ADE 10/11/2013 enable host freshness!
check_host_freshness=1
check_result_path=/usr/local/nagios/var/spool/checkresults
#ADE 02/14/12 reaper settings updated based on http://assets.nagios.com/downloads/nagiosxi/docs/Maximizing_XI_Performance.pdf
#check_result_reaper_frequency=10
check_result_reaper_frequency=3
check_service_freshness=1
# Note: By setting command_check_interval to -1, Nagios will check for external commands as often as possible.
#command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
daemon_dumps_core=0
date_format=us
debug_file=/usr/local/nagios/var/nagios.debug
debug_level=0
debug_verbosity=1
#enable_embedded_perl=1
enable_event_handlers=1
enable_flap_detection=1
enable_notifications=1
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
event_broker_options=-1
event_handler_timeout=30
execute_host_checks=1
execute_service_checks=1
# ADE 12/2/2014 external_command_buffer_slots is deprecated, gives this warning on nagios startup:
# Warning: external_command_buffer_slots is deprecated and will be removed. All commands are always processed upon arrival
#external_command_buffer_slots=4096
high_host_flap_threshold=20.0
high_service_flap_threshold=20.0
host_check_timeout=30
host_freshness_check_interval=60
host_inter_check_delay_method=s
illegal_macro_output_chars=`~$&|'"<>
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
interval_length=60
lock_file=/usr/local/nagios/var/nagios.lock
log_archive_path=/usr/local/nagios/var/archives
#xxx ADE 12/3/2014 set log_external_commands=1
log_external_commands=1
log_file=/usr/local/nagios/var/nagios.log
log_host_retries=1
log_initial_states=0
log_notifications=1
#xxx ADE 12/3/2014 set log_passive_checks=1
log_passive_checks=1
log_rotation_method=d
log_service_retries=1
low_host_flap_threshold=5.0
low_service_flap_threshold=5.0
# ADE 11/12/2014 set max_check_result_file_age to zero so we don't delete check result files if the system experiences a glitch
#max_check_result_file_age=3600
max_check_result_file_age=0
#ADE 02/14/12 reaper settings updated based on http://assets.nagios.com/downloads/nagiosxi/docs/Maximizing_XI_Performance.pdf
#max_check_result_reaper_time=30
max_check_result_reaper_time=10
max_concurrent_checks=0
max_debug_file_size=1000000
max_host_check_spread=30
max_service_check_spread=30
nagios_group=nagios
nagios_user=nagios
notification_timeout=30
object_cache_file=/usr/local/nagios/var/objects.cache
obsess_over_hosts=0
obsess_over_services=0
ocsp_timeout=5
#p1_file=/usr/local/nagios/bin/p1.pl
passive_host_checks_are_soft=0
perfdata_timeout=5
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retain_state_information=1
retention_update_interval=60
service_check_timeout=60
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
#sleep_time=0.25
soft_state_dependencies=0
state_retention_file=/usr/local/nagios/var/retention.dat
status_file=/usr/local/nagios/var/status.dat
status_update_interval=10
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/tmp
use_aggressive_host_checking=0
####use_embedded_perl_implicitly=1
use_regexp_matching=0
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=1
use_true_regexp_matching=0
nagiosadmin42
Posts: 96
Joined: Sat Feb 11, 2012 2:16 pm

Re: checkresults not being processed after upgrade to 2014R2

Post by nagiosadmin42 »

Regarding my earlier note about the file permissions within the checkresults folder:

Code: Select all

-rwxrwx--- 1 apache nagcmd 312 Dec  3 11:54 cZzzOpT
-rw-r--r-- 1 apache apache   0 Dec  3 11:54 cZzzOpT.ok
I thought this might be part of the problem, i.e. that the ".ok" file is owned by the apache group, so I modified the NRDP plugin script to set the same file permissions on the ".ok" file as the related data file:

Code: Select all

# diff nagioscorepassivecheck.inc.php.ORIG nagioscorepassivecheck.inc.php
124c124,125
<               $fh=fopen($tmpname.".ok","w+");
---
>               $ok_filename = $tmpname . ".ok";
>               $fh=fopen($ok_filename,"w+");
125a127,128
>               chgrp($ok_filename,$cfg["nagios_command_group"]);
>               chmod($ok_filename,0770);
147c150
< ?>
\ No newline at end of file
---
> ?>
Files are now created with the same permissions:

Code: Select all

-rwxrwx--- 1 apache nagcmd 282 Dec  3 14:22 cZzWNCD
-rwxrwx--- 1 apache nagcmd   0 Dec  3 14:22 cZzWNCD.ok
However, this doesn't seem to have resolved the issue of the files not being deleted from the checkresults folder.
nagiosadmin42
Posts: 96
Joined: Sat Feb 11, 2012 2:16 pm

Re: checkresults not being processed after upgrade to 2014R2

Post by nagiosadmin42 »

FYI, while we're investigating this issue, I've added a cron job that runs every minute to delete checkresults files older than five minutes:

Code: Select all

#!/bin/sh

# Get rid of result files older than five minutes
find -L /usr/local/nagios/var/spool/checkresults -mmin +5 -type f -exec rm -f {} \;
Oh crap... having to add the "-L" option to the find command in this script reminded me that we had recently implemented use of the RAM disk. Instead of setting up a separate mount point on tmpfs as detailed in your document Utilizing_A_RAM_Disk_In_NagiosXI.pdf, we simply changed the three spool subdirectories to be symbolic links to /dev/shm/ where tmpfs is mounted:

Code: Select all

service nagios stop
service npcd stop
service ndo2db stop

cd /usr/local/nagios/var/spool

mv checkresults /dev/shm/
ln -s /dev/shm/checkresults checkresults

mv perfdata /dev/shm/
ln -s /dev/shm/perfdata perfdata

mv xidpe /dev/shm/
ln -s /dev/shm/xidpe xidpe

service nagios start
service npcd start
service ndo2db start
And then in the init.d script added code to re-create these on boot:

Code: Select all

if [ ! -d /dev/shm/checkresults ]; then
    mkdir /dev/shm/checkresults
    chown nagios.nagcmd /dev/shm/checkresults
    chmod 775 /dev/shm/checkresults
fi
if [ ! -d /dev/shm/perfdata ]; then
    mkdir /dev/shm/perfdata
    chown nagios.nagios /dev/shm/perfdata
    chmod 775 /dev/shm/perfdata
fi
if [ ! -d /dev/shm/xidpe ]; then
    mkdir /dev/shm/xidpe
    chown nagios.nagios /dev/shm/xidpe
    chmod 775 /dev/shm/xidpe
fi
This had been working prior to the 2014R2.0 upgrade.

So, is this the smoking gun? Does the symbolic link for checkresults somehow cause the Nagios reaper problems?
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: checkresults not being processed after upgrade to 2014R2

Post by sreinhardt »

Nah, I would highly doubt that it's a symlinking issue, that's pretty common place. If I had to guess any one thing from that config, I'd go with this guy:

Code: Select all

# ADE 11/12/2014 set max_check_result_file_age to zero so we don't delete check result files if the system experiences a glitch
#max_check_result_file_age=3600
max_check_result_file_age=0
Which since you moved to a non-permanent ramdisk, won't matter anyway after a reboot, as they would most definitely be gone. I think the key point that our manual discusses but doesn't solidify, is that it only claims to clean them if they are older than max age. So by setting a lifetime of 0 or infinity, nagios most likely is never deleting them despite having reaped them. I'll have to talk to one of the core devs to get his opinion, but if this is the case I would expect nagios to remove them post reap regardless of age, which does not seem to be happening. Another note, the value of 3600 is 60 minutes, if your system really goes down for 60 minutes, those check results are going to be very stale, and won't get properly placed into logs or rrds anyway.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
nagiosadmin42
Posts: 96
Joined: Sat Feb 11, 2012 2:16 pm

Re: checkresults not being processed after upgrade to 2014R2

Post by nagiosadmin42 »

Thanks Spenser. I modified our config to use five minutes: max_check_result_file_age=300 and restarted the nagios service.

Incoming checkresults files are now being processed and deleted, so it appears that the setting max_check_result_file_age=0 was indeed the problem.

Although, it was working on 2012R1.1 (yeah I know, we were really behind on upgrades).

FYI, just as a side note, there were 79 files sitting in the checkresults directory as I made that change, and they are not being deleted.

While Nagios appears to be processing and deleting files that came in AFTER the service restart, it is leaving this handful of files there. Not a big deal, just unexpected behavior.

In the Core 3.0 docs (http://nagios.sourceforge.net/docs/3_0/configmain.html), it says for max_check_result_file_age:

This option determines the maximum age in seconds that Nagios will consider check result files found in the check_result_path directory to be valid. Check result files that are older that this threshold will be deleted by Nagios and the check results they contain will not be processed. By using a value of zero (0) with this option, Nagios will process all check result files - even if they're older than your hardware :-).

So there appears to be a discrepancy between the docs and the Nagios implementation.

You can close this topic. Thanks again!

Alan
cmerchant
Posts: 546
Joined: Wed Sep 24, 2014 11:19 am

Re: checkresults not being processed after upgrade to 2014R2

Post by cmerchant »

We'll go ahead and close this thread. Thanks.
Locked