Long check output (>255) and dropped State History

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
DigNetwerk
Posts: 40
Joined: Fri Oct 25, 2013 7:29 am

Long check output (>255) and dropped State History

Post by DigNetwerk »

Hi,

It looks like some check results are not saved into State History because the output is too long (>255 chars). I'm currently experiencing this with a passive check (SNMP traps, using the wizard -> with volatile+stalking options on so that is not the problem), but maybe this also applies to active checks.

Compare these:

Status information field on the Service Status page (full output show):

This event is generated whenever a threshold crossing has 10.54.17.2 2 Threshold 84 86 0 DHCP high threshold reset: Member: 10.54.17.2 Network: 172.19.4.0/24/default Range: 172.19.4.40/172.19.4.230///default/ High Trigger Mark: 86% High Reset Mark: 85% Current Usage: 84% Active Leases: 162 Available Leases: 29 Total Addresses: 191 / ibNodeName.0 (OCTETSTR):10.54.17.2 ibTrapSeverity.0 (INTEGER):2 ibObjectName.0 (OCTETSTR):Threshold ibCurThresholdValue.0 (INTEGER):84 ibThresholdHigh.0 (INTEGER):86 ibThresholdLow.0 (INTEGER):0 ibTrapDesc.0 (OCTETSTR):DHCP high threshold reset: Member: 10.54.17.2 Network: 172.19.4.0/24/default Range: 172.19.4.40/172.19.4.230///default/ High Trigger Mark: 86% High Reset Mark: 85% Current Usage: 84% Active Leases: 162 Available Leases: 29 Total Addresses: 191

Information field on the Event Log page (trimmed output, but no problem for me that this is trimmed here):

SERVICE ALERT: gen-infoblox-n02-lan;SNMP Traps;OK;HARD;1;This event is generated whenever a threshold crossing has 10.54.17.2 2 Threshold 84 86 0 DHCP high threshold reset: Member: 10.54.17.2 Network: 172.19.4.0/24/default Range: 172.19.4.40/172.19.

State History page:

Not showing these check results (while multiple similar traps have been sent). It does show 2 traps however that have a small ouput.


I did some digging and found this:

Code: Select all

mysql> DESCRIBE nagios_statehistory;
+-----------------------+---------------+------+-----+---------------------+----------------+
| Field                 | Type          | Null | Key | Default             | Extra          |
+-----------------------+---------------+------+-----+---------------------+----------------+
| statehistory_id       | int(11)       | NO   | PRI | NULL                | auto_increment |
| instance_id           | smallint(6)   | NO   |     | 0                   |                |
| state_time            | datetime      | NO   | MUL | 0000-00-00 00:00:00 |                |
| state_time_usec       | int(11)       | NO   |     | 0                   |                |
| object_id             | int(11)       | NO   | MUL | 0                   |                |
| state_change          | smallint(6)   | NO   |     | 0                   |                |
| state                 | smallint(6)   | NO   |     | 0                   |                |
| state_type            | smallint(6)   | NO   |     | 0                   |                |
| current_check_attempt | smallint(6)   | NO   |     | 0                   |                |
| max_check_attempts    | smallint(6)   | NO   |     | 0                   |                |
| last_state            | smallint(6)   | NO   |     | -1                  |                |
| last_hard_state       | smallint(6)   | NO   |     | -1                  |                |
| output                | varchar(255)  | NO   |     |                     |                |
| long_output           | varchar(8192) | NO   |     |                     |                |
+-----------------------+---------------+------+-----+---------------------+----------------+
14 rows in set (0.00 sec)
Are the results being dropped from State History because the output field is only 255 chars? What is the long_output? This does not seem to be used then?


XI version: 5.2.3
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Long check output (>255) and dropped State History

Post by tgriep »

I think if the data is too large for a field in MYSQL, it will not get saved.
The following command will increase the size of the output field to 1024 characters.

Code: Select all

echo "use nagios;alter table nagios_statehistory modify output varchar(1024) not null;" | mysql -pnagiosxi
The long_output is used only if the check returns data on 2 separate lines. The second line will go in to that field.
Be sure to check out our Knowledgebase for helpful articles and solutions!
DigNetwerk
Posts: 40
Joined: Fri Oct 25, 2013 7:29 am

Re: Long check output (>255) and dropped State History

Post by DigNetwerk »

Hi tgriep,

I altered the table but no avail. I even altered it to 2048 but still nothing.

New SNMP traps still show up in the Event Log and on the Service Status page, but not in the State History.

So I tried something new: I first set the state to warning (by submitting a passive check result) and had the host resend some traps, the traps generated are currently all OK state (I still have to change this).

This resulted in the first of these traps with OK state to be logged in the State History, but not the subsequent other 2. The last one however is the one that shows up on the Service Status page.

I would think that with Stalking options enabled in the xiwizard_snmptrap_service template that this would log everything? Even if the state stays OK?

So I think there is a bug in the Stalking code?


UPDATE:

Meanwhile I have changed the state for this kind of trap from OK to WARNING (in /etc/snmp/snmptt.conf) and now when the host sends these traps all of them seem to be properly logged in the State History. I also tried it with CRITICAL and that works properly too.

So maybe there is only a bug in the Stalking code regarding the OK state?
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Long check output (>255) and dropped State History

Post by tgriep »

I looked at how stalking works and I found this explanation.
If you enable stalking for one or more states of a particular host or service, Nagios will log the results of the host or service check if the output from the check differs from the output from the previous check
If the output doesn't change from the previous check, it will not log it.

Here is the link for reference.
https://assets.nagios.com/downloads/nag ... lking.html

Another table to look at is the servicestatus table. Maybe the fields are not large enough. Can you run the following and post the output here?

Code: Select all

echo 'describe nagios_servicestatus;' | mysql  -t -pnagiosxi nagios
Be sure to check out our Knowledgebase for helpful articles and solutions!
DigNetwerk
Posts: 40
Joined: Fri Oct 25, 2013 7:29 am

Re: Long check output (>255) and dropped State History

Post by DigNetwerk »

The output *does* change and it is still not logged, else I wouldn't be complaining ;)

However it does seem to log when the status goes from warning->warning or critical->critical or x->y, but NOT when it goes from ok->ok (with changing output mind you!).

And on the service template stalking is enabled for o, w, c and u. Which should, referring to this part of the docs, log for any output change:

https://assets.nagios.com/downloads/nag ... ml#service
stalking_options: This directive determines which service states "stalking" is enabled for. Valid options are a combination of one or more of the following: o = stalk on OK states, w = stalk on WARNING states, u = stalk on UNKNOWN states, and c = stalk on CRITICAL states. More information on state stalking can be found here.
Output from describe nagios_servicestatus:

Code: Select all

+-------------------------------+--------------+------+-----+---------------------+----------------+
| Field                         | Type         | Null | Key | Default             | Extra          |
+-------------------------------+--------------+------+-----+---------------------+----------------+
| servicestatus_id              | int(11)      | NO   | PRI | NULL                | auto_increment |
| instance_id                   | smallint(6)  | NO   | MUL | 0                   |                |
| service_object_id             | int(11)      | NO   | UNI | 0                   |                |
| status_update_time            | datetime     | NO   | MUL | 0000-00-00 00:00:00 |                |
| output                        | mediumtext   | NO   |     | NULL                |                |
| long_output                   | mediumtext   | NO   |     | NULL                |                |
| perfdata                      | mediumtext   | NO   |     | NULL                |                |
| current_state                 | smallint(6)  | NO   | MUL | 0                   |                |
| has_been_checked              | smallint(6)  | NO   |     | 0                   |                |
| should_be_scheduled           | smallint(6)  | NO   |     | 0                   |                |
| current_check_attempt         | smallint(6)  | NO   |     | 0                   |                |
| max_check_attempts            | smallint(6)  | NO   |     | 0                   |                |
| last_check                    | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| next_check                    | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| check_type                    | smallint(6)  | NO   | MUL | 0                   |                |
| last_state_change             | datetime     | NO   | MUL | 0000-00-00 00:00:00 |                |
| last_hard_state_change        | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| last_hard_state               | smallint(6)  | NO   |     | 0                   |                |
| last_time_ok                  | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| last_time_warning             | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| last_time_unknown             | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| last_time_critical            | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| state_type                    | smallint(6)  | NO   | MUL | 0                   |                |
| last_notification             | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| next_notification             | datetime     | NO   |     | 0000-00-00 00:00:00 |                |
| no_more_notifications         | smallint(6)  | NO   |     | 0                   |                |
| notifications_enabled         | smallint(6)  | NO   | MUL | 0                   |                |
| problem_has_been_acknowledged | smallint(6)  | NO   | MUL | 0                   |                |
| acknowledgement_type          | smallint(6)  | NO   |     | 0                   |                |
| current_notification_number   | smallint(6)  | NO   |     | 0                   |                |
| passive_checks_enabled        | smallint(6)  | NO   | MUL | 0                   |                |
| active_checks_enabled         | smallint(6)  | NO   | MUL | 0                   |                |
| event_handler_enabled         | smallint(6)  | NO   | MUL | 0                   |                |
| flap_detection_enabled        | smallint(6)  | NO   | MUL | 0                   |                |
| is_flapping                   | smallint(6)  | NO   | MUL | 0                   |                |
| percent_state_change          | double       | NO   | MUL | 0                   |                |
| latency                       | double       | NO   | MUL | 0                   |                |
| execution_time                | double       | NO   | MUL | 0                   |                |
| scheduled_downtime_depth      | smallint(6)  | NO   | MUL | 0                   |                |
| failure_prediction_enabled    | smallint(6)  | NO   |     | 0                   |                |
| process_performance_data      | smallint(6)  | NO   |     | 0                   |                |
| obsess_over_service           | smallint(6)  | NO   |     | 0                   |                |
| modified_service_attributes   | int(11)      | NO   |     | 0                   |                |
| event_handler                 | varchar(255) | NO   |     |                     |                |
| check_command                 | varchar(255) | NO   |     |                     |                |
| normal_check_interval         | double       | NO   |     | 0                   |                |
| retry_check_interval          | double       | NO   |     | 0                   |                |
| check_timeperiod_object_id    | int(11)      | NO   |     | 0                   |                |
+-------------------------------+--------------+------+-----+---------------------+----------------+
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Long check output (>255) and dropped State History

Post by tgriep »

Can you go to Reports > Legacy Reports > Alert History and see if the Alerts are showing up there?
I think the State History report only looks at state changes and not the output.
Thanks.
Be sure to check out our Knowledgebase for helpful articles and solutions!
DigNetwerk
Posts: 40
Joined: Fri Oct 25, 2013 7:29 am

Re: Long check output (>255) and dropped State History

Post by DigNetwerk »

They are showing up there. Just as they are showing up in the Event log.

By saying "I think the State History report only looks at state changes and not the output" I hope you mean that is the bug right? Not expected behavior?

Because the State History report DOES show continuous sequences of WARNING or CRITICAL states where only the output has changed. But it doesn't do this for OK states somehow. I think expected and documented behavior* is that it should also show continuous sequences of OK states where only the output has changed (especially because it already does this for WARNING and CRITICAL states).

*Of course only when stalking is enabled

So can we agree that this is a bug in the State History report?
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Long check output (>255) and dropped State History

Post by ssax »

If you open up /usr/local/nagios/var/status.dat and look at that service, is the full text under plugin_output or long_plugin_output?

I'm wondering if the plugin is only putting it into the long_plugin_output (newline character in the output or something).
DigNetwerk
Posts: 40
Joined: Fri Oct 25, 2013 7:29 am

Re: Long check output (>255) and dropped State History

Post by DigNetwerk »

Hi,

It's under plugin_output:

Code: Select all

        last_time_ok=1454490716
        last_time_warning=1454672034
        last_time_unknown=0
        last_time_critical=1454323195
        plugin_output=10.54.17.2 2 Threshold 95 95 0 DHCP high threshold crossed:   Member: 10.54.17.2   Network: 10.159.13.0/24/default   Range: 10.159.13.50/10.159.13.250///default/   High Trigger Mark: 95%   High Reset Mark: 85%   Current Usage: 95%   Active Leases: 191   Available Leases: 10   Total Addresses: 201  / ibNodeName.0 (OCTETSTR):10.54.17.2 ibTrapSeverity.0 (INTEGER):2 ibObjectName.0 (OCTETSTR):Threshold ibCurThresholdValue.0 (INTEGER):95 ibThresholdHigh.0 (INTEGER):95 ibThresholdLow.0 (INTEGER):0 ibTrapDesc.0 (OCTETSTR):DHCP high threshold crossed:   Member: 10.54.17.2   Network: 10.159.13.0/24/default   Range: 10.159.13.50/10.159.13.250///default/   High Trigger Mark: 95%   High Reset Mark: 85%   Current Usage: 95%   Active Leases: 191   Available Leases: 10   Total Addresses: 201
        long_plugin_output=
        performance_data=
        last_check=1454672034
        next_check=1454490776

User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Long check output (>255) and dropped State History

Post by tgriep »

I was able to test the State History report and sending multiple OK's with changing the output.
I'll fill in a bug report for you.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked