Nagios Support Forum

Posted: **Wed Jan 13, 2016 7:03 pm**

I needed to change the thresholds on a service, so I copied it in CCM, changed the name slightly, updated the thresholds and added hosts. The original running on all windows hosts except 2, gather performance data. The 2 on this service don't. The one that doesn't gather the data is the 2nd one below.

Code: Select all

# grep -v ^# /usr/local/nagios/etc/services/FS_Win_Usage.cfg

define service {
        service_description             FS_Win_Usage
        use                             default_service
        hostgroup_name                  Windows_Physical_Most,1VZW_Windows_Virtual_Most
        check_command                   check_nrpe!CheckDriveSize!-a ShowAll=long MinWarnFree=20% MinCritFree=10% FilterType=fixed!!!!!!
        register                        1
        }

Code: Select all

# grep -v ^# /usr/local/nagios/etc/services/FS_Win_Usage_SQL.cfg

define service {
        service_description             FS_Win_Usage_SQL
        use                             default_service
        hostgroup_name                  Windows_SQL
        check_command                   check_nrpe!CheckDriveSize!-a ShowAll=long MinWarnFree=10% MinCritFree=5% FilterType=fixed!!!!!!
        register                        1
        }

The above config looked a bit old fashioned so I updated to make it work w/ the newer syntax. Same result. No perfata in the DB and no graphs.

Code: Select all

# grep -v ^# /usr/local/nagios/etc/services/FS_Win_Usage_SQL_test.cfg

define service {
        service_description             FS_Win_Usage_SQL_test
        use                             default_service
        hostgroup_name                  Windows_SQL
        check_command                   check_nrpe!check_drivesize!-a --show-all "warn=free<10%" "crit=free<5%" "perf-config=*(unit:g)"!-a "filter=type = 'fixed' and drive regexp '.*[C-Z].*'" "warn=free<10%" "crit=free<5%"!!!!!
        notifications_enabled           0
        register                        1
        }

Since the same template is used for all 3 it seems like it isn't the template. I doubt its the fault of the servers. That seems to leave my service definitions.
I restarted the monitoring engine and performance grapher just for grins.

I assume the first question will be post the template. Here they are.

Code: Select all

define service {
       name                                     base_service
       service_description                      Base service sourced by others
       display_name                             Base template for most templates
       is_volatile                              0
       max_check_attempts                       1
       check_interval                           5
       active_checks_enabled                    1
       passive_checks_enabled                   1
       check_period                             24x7
       parallelize_check                        1
       obsess_over_service                      0
       check_freshness                          0
       event_handler_enabled                    1
       flap_detection_enabled                   1
       process_perf_data                        1
       retain_status_information                1
       retain_nonstatus_information             1
       notification_interval                    60
       first_notification_delay                 15
       notification_period                      24x7
       notification_options                     w,c,u,
       register                                 0

}

Code: Select all

define service {
       name                                     default_service
       service_description                      default_service
       display_name                             Template for most services
       use                                      base_service
       active_checks_enabled                    1
       process_perf_data                        1
       retain_status_information                1
       retain_nonstatus_information             1
       notification_options                     w,c,u,f,
       notifications_enabled                    1
       register                                 0

}

The templates along w/ hostgroups sort of show how I try to separate hosts/services from monitoring variables and services from hosts. Both of which seem abnormal.

So where have I screwed this up?

Thanks

Posted: **Wed Jan 13, 2016 7:23 pm**

gormank wrote:The above config looked a bit old fashioned so I updated to make it work w/ the newer syntax. Same result. No perfata in the DB and no graphs.

I suspect there is a difference in the output performance data from the old command to the new command, especially if extra datasources exist that were not present when the rrd file was initially created.

You should watching the perfdata log to see if there are any errors:

Code: Select all

tail -f /usr/local/nagios/var/perfdata.log

The easy solution is to delete the rrd and xml files for these services in /usr/local/nagios/share/perfdata/xxxxx/

Let us know if that works and if anything is reported by the perfdata log.

You may need to increase the logging verbosity and then take a deeper look into the logs. Follow the FAQ entry below to increase the log level of process_perfdata and npcd:

http://support.nagios.com/wiki/index.ph ... leshooting

Then tail the logs:

Code: Select all

tail -f /usr/local/nagios/var/perfdata.log
tail -f /usr/local/nagios/var/npcd.log

Don't forget to turn down the log level as per the FAQ when you are done!

Posted: **Wed Jan 13, 2016 7:51 pm**

I never even thought of the rrd files. That said, the rrd and xml files hadn't been updated in a long time so that could be it.
On the other hand, nothing is in the perfdata column of the nagios_servicestatus table for these hosts.

Not much activity in npcd.log since these are from yesterday and the system time is UTC.

# tail -f /usr/local/nagios/var/npcd.log
[01-12-2016 20:07:21] NPCD: WARN: MAX load reached: load 11.690000/10.000000 at i=1
[01-12-2016 20:07:36] NPCD: WARN: MAX load reached: load 12.570000/10.000000 at i=1
[01-12-2016 20:07:51] NPCD: WARN: MAX load reached: load 11.610000/10.000000 at i=1

The rrd/xml file names seem to follow the service names so for the new service, that shouldn't be a problem. For the old servicename, you imply that a change in perfdata output from a service can or may cause writing to the RRD to stop. That makes sense.

But, I copied and renamed the service, which should have started a new RRD, creating new files while leaving the old ones in the host directory. It sounds like old rrd/xml files and host dirs need to be cleaned up regularly if services change, and periodically regardless. Are there guidelines for this? It wouldn't be tough, but...

I don't see anything in the logs other than above, but I'll keep looking.

Posted: **Wed Jan 13, 2016 11:03 pm**

Generally speaking, the advanced logging should highlight the issues.

gormank wrote:But, I copied and renamed the service, which should have started a new RRD, creating new files while leaving the old ones in the host directory.

gormank wrote:I needed to change the thresholds on a service, so I copied it in CCM, changed the name slightly, updated the thresholds and added hosts. The original running on all windows hosts except 2, gather performance data. The 2 on this service don't. The one that doesn't gather the data is the 2nd one below.

gormank wrote:The above config looked a bit old fashioned so I updated to make it work w/ the newer syntax. Same result. No perfata in the DB and no graphs.

It's possible RRD files were created using the older commands which had been executed. After changing the service command, if the performance data and # datasources is different, it will cause issues.

gormank wrote:On the other hand, nothing is in the perfdata column of the nagios_servicestatus table for these hosts.

Can you show us a screenshot of the Advanced tab of the service that is not working.

gormank wrote:It sounds like old rrd/xml files and host dirs need to be cleaned up regularly if services change, and periodically regardless. Are there guidelines for this? It wouldn't be tough, but...

Normally they don't need to be touched, but if plugins are updated and the performance data changes or the # of datasources changes then then it can cause mayhem. The performance data stuff is pretty finicky.

In saying that, you could use this tool I created called the Performance Data Tool.

http://exchange.nagios.org/directory/Ad ... ol/details

Upload it into Nagios XI via Admin > System Extensions > Manage Components.

You use it via the Tools menu. It lets you browse the performance data files.

Posted: **Thu Jan 14, 2016 3:07 pm**

Here's the pic of the advanced tab.
I deleted the perfdata files, and no new ones are created, and there are none for disk usage for the old or new services. The perfdada tool is installed, but as there are no files, its not too informative.
Since the files have been deleted, and the new service has a new name, there's really no reason I can think of for the new or old service to not create new RRDs.
I'll look at the logging now and post results.

Thanks!

Posted: **Thu Jan 14, 2016 5:04 pm**

Can you run the commands from a command prompt on the server and post the output here?
The screen capture doesn't show and performance data and I would like to verify that the check is returning performance data.

Posted: **Thu Jan 14, 2016 6:18 pm**

Nothing in the logs about this after changing the log level.
There's no perfdata in the output which explains why nothing logged. The sparehost is an example of when things work, and sqlhost shoes it not working... Same command, different result. Interestingly, the hosts where there's no perfdata all have more than 2 drives.

# /usr/local/nagios/libexec/check_nrpe -H sparehost -u -t 30 -c CheckDriveSize -a ShowAll=long MinWarnFree=10% MinCritFree=5% FilterType=fixed | tr , "\n"
OK C:\: Total: 99.902GB - Used: 37.424GB (38%) - Free: 62.478GB (62%)
D:\: Total: 179.265GB - Used: 35.177GB (20%) - Free: 144.088GB (80%)
: Total: 99.996MB - Used: 30.375MB (31%) - Free: 69.621MB (69%)|'C:\ free'=62.47847GB;9.99023;4.99511;0;99.90234 'C:\ free %'=62%;9;4;0;100 'D:\ free'=144.08753GB;17.92646;8.96323;0;179.26464 'D:\ free %'=80%;9;4;0;100 '\\?\Volume{20157401-60b7-11e4-879c-806e6f6e6963}\ free'=69.62109MB;9.9996;4.9998;0;99.99609 '\\?\Volume{20157401-60b7-11e4-879c-806e6f6e6963}\ free %'=69%;9;4;0;100

# /usr/local/nagios/libexec/check_nrpe -H sqlhost-u -t 30 -c CheckDriveSize -a ShowAll=long MinWarnFree=10% MinCritFree=5% FilterType=fixed | tr , "\n"
OK C:\: Total: 299.901GB - Used: 158.707GB (53%) - Free: 141.195GB (47%)
D:\: Total: 99.996GB - Used: 51.638GB (52%) - Free: 48.358GB (48%)
G:\: Total: 2.441TB - Used: 233.248GB (10%) - Free: 2.214TB (90%)
J:\: Total: 10TB - Used: 3.982TB (40%) - Free: 6.018TB (60%)
L:\: Total: 4TB - Used: 328.096GB (9%) - Free: 3.679TB (91%)
M:\: Total: 4TB - Used: 397.326GB (10%) - Free: 3.612TB (90%)
N:\: Total: 13TB - Used: 6.564TB (51%) - Free: 6.436TB (49%)
O:\: Total: 12TB - Used: 4.226TB (36%) - Free: 7.774TB (64%)
R:\: Total: 1.75TB - Used: 508.166GB (29%) - Free: 1.254TB (71%)
T:\: Total: 1.065TB - Used: 481.642GB (45%) - Free: 608.595GB (55%)
W:\: Total: 1TB - Used: 782.761GB (77%) - Free: 241.112GB (23%)
X:\: Total: 10TB - Used: 2.027TB (21%) - Free: 7.973TB (79%)
: Total: 99.996MB - Used: 32.141MB (33%) - Free: 67.855MB (67%)

Posted: **Thu Jan 14, 2016 6:22 pm**

gormank wrote:# /usr/local/nagios/libexec/check_nrpe -H sparehost -u -t 30 -c CheckDriveSize -a ShowAll=long MinWarnFree=10% MinCritFree=5% FilterType=fixed | tr , "\n"

gormank wrote:# /usr/local/nagios/libexec/check_nrpe -H sqlhost-u -t 30 -c CheckDriveSize -a ShowAll=long MinWarnFree=10% MinCritFree=5% FilterType=fixed | tr , "\n"

I suspect it's this piping at the end of the command you are doing which is causing the issue | tr , "\n"

I've not seen that done before.

What happens if you remove | tr , "\n"

Posted: **Thu Jan 14, 2016 6:57 pm**

No, it isn't causing the problem. tr converts a character to another character, in this case, a comma to a newline.
You can see the perfdata in the good output example above.
Here's an example w/o the tr:

# /usr/local/nagios/libexec/check_nrpe -H wspr001 -u -t 30 -c CheckDriveSize -a ShowAll=long MinWarnFree=10%
OK C:\: Total: 99.902GB - Used: 37.423GB (38%) - Free: 62.479GB (62%), D:\: Total: 179.265GB - Used: 35.177GB (20%) - Free: 144.088GB (80%), : Total: 99.996MB - Used: 30.375MB (31%) - Free: 69.621MB (69%)|'C:\ free'=62.47918GB;9.99023;0;0;99.90234 'C:\ free %'=62%;9;0;0;100 'C:\ used'=37.42315GB;0;89.9121;0;99.90234 'C:\ used %'=37%;0;89;0;100 'D:\ free'=144.08753GB;17.92646;0;0;179.26464 'D:\ free %'=80%;9;0;0;100 'D:\ used'=35.1771GB;0;161.33818;0;179.26464 'D:\ used %'=19%;0;89;0;100 '\\?\Volume{20157401-60b7-11e4-879c-806e6f6e6963}\ free'=69.62109MB;9.9996;0;0;99.99609 '\\?\Volume{20157401-60b7-11e4-879c-806e6f6e6963}\ free %'=69%;9;0;0;100 '\\?\Volume{20157401-60b7-11e4-879c-806e6f6e6963}\ used'=30.375MB;0;89.99648;0;99.99609 '\\?\Volume{20157401-60b7-11e4-879c-806e6f6e6963}\ used %'=30%;0;89;0;100

# /usr/local/nagios/libexec/check_nrpe -H sqlhost -u -t 30 -c CheckDriveSize -a ShowAll=long MinWarnFree=10%
OK C:\: Total: 299.901GB - Used: 158.562GB (53%) - Free: 141.339GB (47%), D:\: Total: 99.996GB - Used: 51.639GB (52%) - Free: 48.357GB (48%), G:\: Total: 2.441TB - Used: 233.25GB (10%) - Free: 2.213TB (90%), J:\: Total: 10TB - Used: 3.985TB (40%) - Free: 6.015TB (60%), L:\: Total: 4TB - Used: 328.096GB (9%) - Free: 3.679TB (91%), M:\: Total: 4TB - Used: 397.326GB (10%) - Free: 3.612TB (90%), N:\: Total: 13TB - Used: 6.567TB (51%) - Free: 6.432TB (49%), O:\: Total: 12TB - Used: 4.229TB (36%) - Free: 7.77TB (64%), R:\: Total: 1.75TB - Used: 508.166GB (29%) - Free: 1.254TB (71%), T:\: Total: 1.065TB - Used: 481.642GB (45%) - Free: 608.595GB (55%), W:\: Total: 1TB - Used: 782.761GB (77%) - Free: 241.112GB (23%), X:\: Total: 10TB - Used: 2.027TB (21%) - Free: 7.973TB (79%), : Total: 99.996MB - Used: 32.141MB (33%) - Free: 67.855MB (67%)

Posted: **Thu Jan 14, 2016 7:07 pm**

Can you show us that output for sqlhost so we can keep the troubleshooting consistent please.

Nagios Support Forum

No perfdata on a copied service

No perfdata on a copied service

Re: No perfdata on a copied service

Re: No perfdata on a copied service

Re: No perfdata on a copied service

Re: No perfdata on a copied service

Re: No perfdata on a copied service

Re: No perfdata on a copied service

Re: No perfdata on a copied service

Re: No perfdata on a copied service

Re: No perfdata on a copied service