Couldnt capture perf data graph

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
anish
Posts: 161
Joined: Tue Jul 19, 2016 5:29 am

Couldnt capture perf data graph

Post by anish »

Hello,

We are facing issues in capturing performance data graph for check_disk for AIX and Solaris hosts on Nagios XI console . It seems the output perf data is too long and Nagios is unable to parse it completely.

I read somewhere to change the MAX_OUTPUTPLUGIN_LENGTH in somefile present in <nagios path>/include directory. As checked, /include is an empty directory.

Kindly suggest how to capture performance data for check_disk service.

Below is the sample Performance data:


check_disk Ouptut:

DISK OK - free space: / 546 MB (53% inode=83%); /usr 1536 MB (17% inode=72%); /var 1059 MB (34% inode=94%); /tmp 4081 MB (99% inode=99%); /home 2522 MB (61% inode=99%); /opt 1134 MB (28% inode=89%); /opt/DoOnceAIX 28 MB (11% inode=93%); /opt/IBM/SCM 336 MB (87% inode=99%); /opt/IBM/TPC 507 MB (99% inode=99%); /u001 40177 MB (78% inode=99%); /u101 178666 MB (58% inode=99%); /u102 104418 MB (92% inode=99%); /u103 80087 MB (97% inode=99%); /u104 80087 MB (97% inode=99%); /u105 247236 MB (96% inode=99%); /u106 212296 MB (82% inode=99%); /usr/local 976 MB (95% inode=99%); /var/log/eprise 462 MB (72% inode=97%);| /=477MB;921;972;0;1024 /usr=7423MB;8064;8512;0;8960 /var=2012MB;2764;2918;0;3072 /tmp=14MB;3686;3891;0;4096 /home=1573MB;3686;3891;0;4096 /opt=2833MB;3571;3769;0;3968 /opt/DoOnceAIX=227MB;230;243;0;256 /opt/IBM/SCM=47MB;345;364;0;384 /opt/IBM/TPC=4MB;460;486;0;512 /u001=11022MB;46080;48640;0;51200 /u101=128533MB;276480;291840;0;307200 /u102=8221MB;101376;107008;0;112640 /u103=1832MB;73728;77824;0;81920 /u

Performance data under Advanced Tab:

/=477MB;921;972;0;1024 /usr=7423MB;8064;8512;0;8960 /var=2012MB;2764;2918;0;3072 /tmp=14MB;3686;3891;0;4096 /home=1573MB;3686;3891;0;4096 /opt=2833MB;3571;3769;0;3968 /opt/DoOnceAIX=227MB;230;243;0;256 /opt/IBM/SCM=47MB;345;364;0;384 /opt/IBM/TPC=4MB;460;486;0;512 /u001=11022MB;46080;48640;0;51200 /u101=128533MB;276480;291840;0;307200 /u102=8221MB;101376;107008;0;112640 /u103=1832MB;73728;77824;0;81920 /u
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Couldnt capture perf data graph

Post by avandemore »

I think you're speaking of this document:

https://assets.nagios.com/downloads/nag ... inapi.html

What version of XI are you using? Please show the full service definition.
Previous Nagios employee
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Couldnt capture perf data graph

Post by rkennedy »

You should be able to fix this by expanding the SQL tables. See this article - https://support.nagios.com/kb/article.php?id=478

It will only work for future checks at this point, since in the past the limit was 256.

If that doesn't work, it may lie in your agent having a sending limit, which I would check by comparing the CLI result to a Nagios result, and executing check_disk locally on the remote machines.
Former Nagios Employee
anish
Posts: 161
Joined: Tue Jul 19, 2016 5:29 am

Re: Couldnt capture perf data graph

Post by anish »

Thanks for the reply:

1. Nagios XI version: 5.3.2

2. Service definition:

define service {
host_name <hostname>
service_description Check All drives
check_command check_nrpe!check_disk!-a '-w 10% -c 5%
max_check_attempts 5
check_interval 5
retry_interval 1
active_checks_enabled 1
passive_checks_enabled 1
check_period xi_timeperiod_24x7
check_freshness 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 60
notification_period xi_timeperiod_24x7
notifications_enabled 1
stalking_options o,w,c,u,
register 1
}

As suggested, I ran all the 4 DB queries and was able to capture the Output completely, However Performance data is still not getting captured completely.

output: DISK OK - free space: / 633 MB (61% inode=86%): /usr 1424 MB (16% inode=68%): /var 1075 MB (52% inode=96%): /tmp 3405 MB (83% inode=99%): /home 2171 MB (70% inode=99%): /opt 728 MB (20% inode=87%): /apps 511 MB (99% inode=99%): /oraarch 102384 MB (99% inode=99%): /oradump 204768 MB (99% inode=99%): /u001/app/oracle 51191 MB (99% inode=99%): /u101/oradata 51191 MB (99% inode=99%): /u102/oradata 204768 MB (99% inode=99%): /u103/oradata 51191 MB (99% inode=99%): /u104/oradata 51191 MB (99% inode=99%): /u105/oradata 102384 MB (99% inode=99%): /u106/oradata 102384 MB (99% inode=99%): /u107/oradata 102384 MB (99% inode=99%): /u108/oradata 102384 MB (99% inode=99%): /var/log/audit 1023 MB (99% inode=99%): /var/nmon 1009 MB (98% inode=99%):

Perf data:

/=390MB;921;972;0;1024 /usr=7151MB;7718;8147;0;8576 /var=972MB;1843;1945;0;2048 /tmp=690MB;3686;3891;0;4096 /home=900MB;2764;2918;0;3072 /opt=2855MB;3225;3404;0;3584 /apps=0MB;460;486;0;512 /oraarch=15MB;92160;97280;0;102400 /oradump=31MB;184320;194560;0;204800 /u001/app/oracle=

And the issue still exists.

Also, I am able to capture complete Performance data at agent end.

Still not sure where to look for nagios.h.in for changing MAX_PLUGIN_OUTPUT_LENGTH.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Couldnt capture perf data graph

Post by ssax »

If you run the check from the command line is it cut off?

Code: Select all

/usr/local/nagios/libexec/check_nrpe -H YOURHOST -c check_disk -a '-w 10% -c 5%'
anish
Posts: 161
Joined: Tue Jul 19, 2016 5:29 am

Re: Couldnt capture perf data graph

Post by anish »

From the Server end - YES, it does, but not from agent end.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Couldnt capture perf data graph

Post by rkennedy »

The issue is the agent can't handle all of that output then. The agent end works because it's not transmitting over a network, where as the server is restricted to the amount of data it can send. This is an issue that was fixed in NRPEv3.

You'll want to use NRPEv3 which supports a much larger output. These links might help -
https://support.nagios.com/kb/article.p ... ategory=22
https://support.nagios.com/kb/article.p ... ategory=22
Former Nagios Employee
anish
Posts: 161
Joined: Tue Jul 19, 2016 5:29 am

Re: Couldnt capture perf data graph

Post by anish »

Is there any other option to capture the same or agent upgrade is the only option?

I would like to try changing MAX_OUTPUTPLUGIN_LENGTH value and test if the issue still persists.

Kindly suggest where can i find the appropriate file and make required changes?
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Couldnt capture perf data graph

Post by rkennedy »

Upgrading the agent is the only solution, if you'd like to make custom changes to NRPE feel free, but this isn't something we can support.
Former Nagios Employee
SteveBeauchemin
Posts: 524
Joined: Mon Oct 14, 2013 7:19 pm

Re: Couldnt capture perf data graph

Post by SteveBeauchemin »

Have you considered that maybe having one test to return All your unix volume data could be split into multiple tests to solve the problem?

Try to reduce the number of volumes returned all at once.

from the /usr/local/nagios/libexec/check_disk --help output

Code: Select all

 -p, --path=PATH, --partition=PARTITION
    Mount point or block device as emitted by the mount(8) command (may be repeated)
Here are a couple examples - of what I do - and why I do it - maybe this will help

In the following example I test the root drive, and check for stale mounts. I do this as a separate test from other volumes so I can alert Only the Unix Admin folks.

Code: Select all

Description: _Check_Disk_fs_/
Check command: check_nrpe
Command view:
$USER1$/check_nrpe -u -t 60 -H $HOSTADDRESS$ -c $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$ $ARG7$ $ARG8$
where:
$ARG1$ = check_disk
$ARG2$ = -a "--stat-remote-fs --units GB
$ARG3$ = --warning 3%
$ARG4$ = --critical 1%
$ARG5$ = --path /"
Example 2:
Disk test that notifies the Application owner using the disk. The Unix Admin cannot fix a full disk there. It is up to the Owner of that application to deal with space problems.

Code: Select all

Description: _Check_Disk_fs_/atlassian
$ARG1$ = check_disk
$ARG2$ = -a "--units GB
$ARG3$ = --warning 3%
$ARG4$ = --critical 1%
$ARG5$ = --path /atlassian"
Last Example: Oracle host with Many mounts : use Regular Expressions

Code: Select all

Description: _Check_Disk_fs_/oracle
$ARG1$ = check_disk
$ARG2$ = -a "--units GB --local
$ARG3$ = --warning 5%
$ARG4$ = --critical 1%
$ARG5$ = --ereg-path /oracle$$
$ARG6$ = --ereg-path /oracle/i
$ARG7$ = --ereg-path /oracle/p"
So, I have multiple service tests to separate out different alertable items, and to reduce the size of the returned data to something useful. On some systems I test every volume individually. On some I use Regex matches to group them into one test.

Consider these options:

Code: Select all

 -A, --all
    Explicitly select all paths. This is equivalent to -R '.*'
 -R, --eregi-path=PATH, --eregi-partition=PARTITION
    Case insensitive regular expression for path/partition (may be repeated)
 -r, --ereg-path=PATH, --ereg-partition=PARTITION
    Regular expression for path or partition (may be repeated)
 -I, --ignore-eregi-path=PATH, --ignore-eregi-partition=PARTITION
    Regular expression to ignore selected path/partition (case insensitive) (may be repeated)
 -i, --ignore-ereg-path=PATH, --ignore-ereg-partition=PARTITION
    Regular expression to ignore selected path or partition (may be repeated)
My 2 cents...

Hope this help.

Steve B
Last edited by SteveBeauchemin on Wed Jan 18, 2017 4:33 pm, edited 1 time in total.
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
Locked