Page 1 of 3

Long check output (>255) and dropped State History

Posted: Fri Jan 29, 2016 9:26 am
by DigNetwerk
Hi,

It looks like some check results are not saved into State History because the output is too long (>255 chars). I'm currently experiencing this with a passive check (SNMP traps, using the wizard -> with volatile+stalking options on so that is not the problem), but maybe this also applies to active checks.

Compare these:

Status information field on the Service Status page (full output show):

This event is generated whenever a threshold crossing has 10.54.17.2 2 Threshold 84 86 0 DHCP high threshold reset: Member: 10.54.17.2 Network: 172.19.4.0/24/default Range: 172.19.4.40/172.19.4.230///default/ High Trigger Mark: 86% High Reset Mark: 85% Current Usage: 84% Active Leases: 162 Available Leases: 29 Total Addresses: 191 / ibNodeName.0 (OCTETSTR):10.54.17.2 ibTrapSeverity.0 (INTEGER):2 ibObjectName.0 (OCTETSTR):Threshold ibCurThresholdValue.0 (INTEGER):84 ibThresholdHigh.0 (INTEGER):86 ibThresholdLow.0 (INTEGER):0 ibTrapDesc.0 (OCTETSTR):DHCP high threshold reset: Member: 10.54.17.2 Network: 172.19.4.0/24/default Range: 172.19.4.40/172.19.4.230///default/ High Trigger Mark: 86% High Reset Mark: 85% Current Usage: 84% Active Leases: 162 Available Leases: 29 Total Addresses: 191

Information field on the Event Log page (trimmed output, but no problem for me that this is trimmed here):

SERVICE ALERT: gen-infoblox-n02-lan;SNMP Traps;OK;HARD;1;This event is generated whenever a threshold crossing has 10.54.17.2 2 Threshold 84 86 0 DHCP high threshold reset: Member: 10.54.17.2 Network: 172.19.4.0/24/default Range: 172.19.4.40/172.19.

State History page:

Not showing these check results (while multiple similar traps have been sent). It does show 2 traps however that have a small ouput.


I did some digging and found this:

Code: Select all

mysql> DESCRIBE nagios_statehistory;
+-----------------------+---------------+------+-----+---------------------+----------------+
| Field                 | Type          | Null | Key | Default             | Extra          |
+-----------------------+---------------+------+-----+---------------------+----------------+
| statehistory_id       | int(11)       | NO   | PRI | NULL                | auto_increment |
| instance_id           | smallint(6)   | NO   |     | 0                   |                |
| state_time            | datetime      | NO   | MUL | 0000-00-00 00:00:00 |                |
| state_time_usec       | int(11)       | NO   |     | 0                   |                |
| object_id             | int(11)       | NO   | MUL | 0                   |                |
| state_change          | smallint(6)   | NO   |     | 0                   |                |
| state                 | smallint(6)   | NO   |     | 0                   |                |
| state_type            | smallint(6)   | NO   |     | 0                   |                |
| current_check_attempt | smallint(6)   | NO   |     | 0                   |                |
| max_check_attempts    | smallint(6)   | NO   |     | 0                   |                |
| last_state            | smallint(6)   | NO   |     | -1                  |                |
| last_hard_state       | smallint(6)   | NO   |     | -1                  |                |
| output                | varchar(255)  | NO   |     |                     |                |
| long_output           | varchar(8192) | NO   |     |                     |                |
+-----------------------+---------------+------+-----+---------------------+----------------+
14 rows in set (0.00 sec)
Are the results being dropped from State History because the output field is only 255 chars? What is the long_output? This does not seem to be used then?


XI version: 5.2.3

Re: Long check output (>255) and dropped State History

Posted: Fri Jan 29, 2016 3:02 pm
by tgriep
I think if the data is too large for a field in MYSQL, it will not get saved.
The following command will increase the size of the output field to 1024 characters.

Code: Select all

echo "use nagios;alter table nagios_statehistory modify output varchar(1024) not null;" | mysql -pnagiosxi
The long_output is used only if the check returns data on 2 separate lines. The second line will go in to that field.

Re: Long check output (>255) and dropped State History

Posted: Mon Feb 01, 2016 5:12 am
by DigNetwerk
Hi tgriep,

I altered the table but no avail. I even altered it to 2048 but still nothing.

New SNMP traps still show up in the Event Log and on the Service Status page, but not in the State History.

So I tried something new: I first set the state to warning (by submitting a passive check result) and had the host resend some traps, the traps generated are currently all OK state (I still have to change this).

This resulted in the first of these traps with OK state to be logged in the State History, but not the subsequent other 2. The last one however is the one that shows up on the Service Status page.

I would think that with Stalking options enabled in the xiwizard_snmptrap_service template that this would log everything? Even if the state stays OK?

So I think there is a bug in the Stalking code?


UPDATE:

Meanwhile I have changed the state for this kind of trap from OK to WARNING (in /etc/snmp/snmptt.conf) and now when the host sends these traps all of them seem to be properly logged in the State History. I also tried it with CRITICAL and that works properly too.

So maybe there is only a bug in the Stalking code regarding the OK state?

Re: Long check output (>255) and dropped State History

Posted: Mon Feb 01, 2016 11:26 am
by tgriep
I looked at how stalking works and I found this explanation.
If you enable stalking for one or more states of a particular host or service, Nagios will log the results of the host or service check if the output from the check differs from the output from the previous check
If the output doesn't change from the previous check, it will not log it.

Here is the link for reference.
https://assets.nagios.com/downloads/nag ... lking.html

Another table to look at is the servicestatus table. Maybe the fields are not large enough. Can you run the following and post the output here?

Code: Select all

echo 'describe nagios_servicestatus;' | mysql  -t -pnagiosxi nagios

Re: Long check output (>255) and dropped State History

Posted: Tue Feb 02, 2016 7:52 am
by DigNetwerk
The output *does* change and it is still not logged, else I wouldn't be complaining ;)

However it does seem to log when the status goes from warning->warning or critical->critical or x->y, but NOT when it goes from ok->ok (with changing output mind you!).

And on the service template stalking is enabled for o, w, c and u. Which should, referring to this part of the docs, log for any output change:

https://assets.nagios.com/downloads/nag ... ml#service
stalking_options: This directive determines which service states "stalking" is enabled for. Valid options are a combination of one or more of the following: o = stalk on OK states, w = stalk on WARNING states, u = stalk on UNKNOWN states, and c = stalk on CRITICAL states. More information on state stalking can be found here.
Output from describe nagios_servicestatus:

Code: Select all

+-------------------------------+--------------+------+-----+---------------------+----------------+
| Field                         | Type         | Null | Key | Default             | Extra          |
+-------------------------------+--------------+------+-----+---------------------+----------------+
| servicestatus_id              | int(11)      | NO   | PRI | NULL                | auto_increment |
| instance_id                   | smallint(6)  | NO   | MUL | 0                   |                |
| service_object_id             | int(11)      | NO   | UNI | 0                   |                |
| status_update_time            | datetime     | NO   | MUL | 0000-00-00 00:00:00 |                |
| output                        | mediumtext   | NO   |     | NULL                |                |
| long_output                   | mediumtext   | NO   |     | NULL                |                |
| perfdata                      | mediumtext   | NO   |     | NULL                |                |
| current_state                 | smallint(6)  | NO   | MUL | 0                   |                |
| has_been_checked              | smallint(6)  | NO   |     | 0                   |                |
| should_be_scheduled           | smallint(6)  | NO   |     | 0                   |                |
| current_check_attempt         | smallint(6)  | NO   |     | 0                   |                |
| max_check_attempts            | smallint(6)  | NO   |     | 0                   |                |
| last_check                    | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| next_check                    | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| check_type                    | smallint(6)  | NO   | MUL | 0                   |                |
| last_state_change             | datetime     | NO   | MUL | 0000-00-00 00:00:00 |                |
| last_hard_state_change        | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| last_hard_state               | smallint(6)  | NO   |     | 0                   |                |
| last_time_ok                  | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| last_time_warning             | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| last_time_unknown             | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| last_time_critical            | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| state_type                    | smallint(6)  | NO   | MUL | 0                   |                |
| last_notification             | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| next_notification             | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| no_more_notifications         | smallint(6)  | NO   |     | 0                   |                |
| notifications_enabled         | smallint(6)  | NO   | MUL | 0                   |                |
| problem_has_been_acknowledged | smallint(6)  | NO   | MUL | 0                   |                |
| acknowledgement_type          | smallint(6)  | NO   |     | 0                   |                |
| current_notification_number   | smallint(6)  | NO   |     | 0                   |                |
| passive_checks_enabled        | smallint(6)  | NO   | MUL | 0                   |                |
| active_checks_enabled         | smallint(6)  | NO   | MUL | 0                   |                |
| event_handler_enabled         | smallint(6)  | NO   | MUL | 0                   |                |
| flap_detection_enabled        | smallint(6)  | NO   | MUL | 0                   |                |
| is_flapping                   | smallint(6)  | NO   | MUL | 0                   |                |
| percent_state_change          | double       | NO   | MUL | 0                   |                |
| latency                       | double       | NO   | MUL | 0                   |                |
| execution_time                | double       | NO   | MUL | 0                   |                |
| scheduled_downtime_depth      | smallint(6)  | NO   | MUL | 0                   |                |
| failure_prediction_enabled    | smallint(6)  | NO   |     | 0                   |                |
| process_performance_data      | smallint(6)  | NO   |     | 0                   |                |
| obsess_over_service           | smallint(6)  | NO   |     | 0                   |                |
| modified_service_attributes   | int(11)      | NO   |     | 0                   |                |
| event_handler                 | varchar(255) | NO   |     |                     |                |
| check_command                 | varchar(255) | NO   |     |                     |                |
| normal_check_interval         | double       | NO   |     | 0                   |                |
| retry_check_interval          | double       | NO   |     | 0                   |                |
| check_timeperiod_object_id    | int(11)      | NO   |     | 0                   |                |
+-------------------------------+--------------+------+-----+---------------------+----------------+

Re: Long check output (>255) and dropped State History

Posted: Tue Feb 02, 2016 1:23 pm
by tgriep
Can you go to Reports > Legacy Reports > Alert History and see if the Alerts are showing up there?
I think the State History report only looks at state changes and not the output.
Thanks.

Re: Long check output (>255) and dropped State History

Posted: Wed Feb 03, 2016 5:01 am
by DigNetwerk
They are showing up there. Just as they are showing up in the Event log.

By saying "I think the State History report only looks at state changes and not the output" I hope you mean that is the bug right? Not expected behavior?

Because the State History report DOES show continuous sequences of WARNING or CRITICAL states where only the output has changed. But it doesn't do this for OK states somehow. I think expected and documented behavior* is that it should also show continuous sequences of OK states where only the output has changed (especially because it already does this for WARNING and CRITICAL states).

*Of course only when stalking is enabled

So can we agree that this is a bug in the State History report?

Re: Long check output (>255) and dropped State History

Posted: Wed Feb 03, 2016 1:47 pm
by ssax
If you open up /usr/local/nagios/var/status.dat and look at that service, is the full text under plugin_output or long_plugin_output?

I'm wondering if the plugin is only putting it into the long_plugin_output (newline character in the output or something).

Re: Long check output (>255) and dropped State History

Posted: Fri Feb 05, 2016 6:45 am
by DigNetwerk
Hi,

It's under plugin_output:

Code: Select all

        last_time_ok=1454490716
        last_time_warning=1454672034
        last_time_unknown=0
        last_time_critical=1454323195
        plugin_output=10.54.17.2 2 Threshold 95 95 0 DHCP high threshold crossed:   Member: 10.54.17.2   Network: 10.159.13.0/24/default   Range: 10.159.13.50/10.159.13.250///default/   High Trigger Mark: 95%   High Reset Mark: 85%   Current Usage: 95%   Active Leases: 191   Available Leases: 10   Total Addresses: 201  / ibNodeName.0 (OCTETSTR):10.54.17.2 ibTrapSeverity.0 (INTEGER):2 ibObjectName.0 (OCTETSTR):Threshold ibCurThresholdValue.0 (INTEGER):95 ibThresholdHigh.0 (INTEGER):95 ibThresholdLow.0 (INTEGER):0 ibTrapDesc.0 (OCTETSTR):DHCP high threshold crossed:   Member: 10.54.17.2   Network: 10.159.13.0/24/default   Range: 10.159.13.50/10.159.13.250///default/   High Trigger Mark: 95%   High Reset Mark: 85%   Current Usage: 95%   Active Leases: 191   Available Leases: 10   Total Addresses: 201
        long_plugin_output=
        performance_data=
        last_check=1454672034
        next_check=1454490776


Re: Long check output (>255) and dropped State History

Posted: Fri Feb 05, 2016 12:45 pm
by tgriep
I was able to test the State History report and sending multiple OK's with changing the output.
I'll fill in a bug report for you.