Issues on scheduled reports

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Issues on scheduled reports

Post by cdienger »

Per the document https://dev.mysql.com/doc/refman/8.0/en/gone-away.html the most likely cause would be the timeout being the root of the message. I'm not sure why the difference in where it runs matters quiet yet, but I would try increasing the wait_time on the database. Also, checking the /var/log/mysqld.log may give us some clues.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
lvaillant
Posts: 57
Joined: Mon Jun 06, 2016 2:47 am
Location: Paris, France

Re: Issues on scheduled reports

Post by lvaillant »

Hello,

MariaDB variables:

Code: Select all

log_warnings = 1

slow_query_log = ON
slow_query_log_file = /var/log/mariadb/mariadb-slow.log
log_slow_filter = admin,filesort,filesort_on_disk,full_join,full_scan,query_cache,query_cache_miss,tmp_table,tmp_table_on_disk
long_query_time = 10.000000

wait_timeout = 300
interactive_timeout = 28800
connect_timeout = 10
log_warning & slow_query_log are enabled.
If some connection issues or long queries related to reports happenned the event would have been logged.
(But there are some difference between MariaDB & MySQL)

MariaBD main log and slow-queries log files have no messages related to such events.
The main log is clean/empty and the slow-queries contains only few requests.

I scheduled a new report this morning. Same behavior but no log.

Code: Select all

MariaDB [(none)]> SHOW GLOBAL STATUS WHERE Variable_name LIKE '%abort%';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Aborted_clients  | 3162  |
| Aborted_connects | 21331 |
+------------------+-------+
2 rows in set (0.00 sec)
  • aborted_connects are clients who attempt to connect and fail. Usually this is because of incorrect credentials (wrong password or no matching host for the user).
  • aborted_clients are the client not closing the connection properly.
I suspect that the report process does not close its connection to DB after "hosts availabilty" query (queries?) is done. It keeps its connection open during it computes the "hosts availabilty" part of the report, then the "services availabilty" part may fail because of timeout.
When a report is built in less than 5min, the "services availabilty" part is filled up correctly.

I may change the wait_timeout variable.
I can't change it at the GLOBAL level without a DB restart.
But as it is a production server, it can't be done easily. It has to be explained, documented, justified and scheduled.

I need certainties before doing that.
Or maybe a small patch in Nagios report tool/script that close DB connections between queries ?
Loïc VAILLANT
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Issues on scheduled reports

Post by tgriep »

Did you increase the max_input_vars settings in the /etc/php.ini file?

Edit the php.ini file and add the following to it.

Code: Select all

max_input_vars = 10000
Save the change and restart Apache by running

Code: Select all

service httpd restart
I have seen that if the php settings are not large enough, it can cause the MYSQL timeout error so try this change and let us know how it works out.
Be sure to check out our Knowledgebase for helpful articles and solutions!
lvaillant
Posts: 57
Joined: Mon Jun 06, 2016 2:47 am
Location: Paris, France

Re: Issues on scheduled reports

Post by lvaillant »

The current PHP configuration is:

Code: Select all

error_reporting = E_ALL & ~E_DEPRECATED & ~E_STRICT
...
; How many GET/POST/COOKIE input variables may be accepted
; max_input_vars = 1000
max_input_vars = 1000 is the default value (see documentation) since PHP 5.3.9 (my version: php-5.4.16-43.el7_4.1.x86_64)

Moreover per documentation (max-input-vars) if there are more input variables than specified by this directive, an E_WARNING is issued, and further input variables are truncated from the request.

So E_WARNING should be logged... But nothing related in httpd log files.

Simulating and stracing a scheduled report in cmdline, I detected that /usr/local/nagiosxi/var/scheduledreporting.log file did not exist.
Now created, it does not reveal error messages. But I can see the related 2 wget commands.

The strace output is more interesting: it took less than 1min to generate the host_availability part of the report, but +5min for the service_availabilty part.
(The related servicegroup contains 9 hosts & 56 services.)

Why such a difference ?

Code: Select all

time strace /usr/bin/php /usr/local/nagiosxi/html/includes/components/scheduledreporting/sendreport.php --report=2prcF0 --username="loic.vaillant" 2>&1 > /tmp/report.log
...
read(7, --2018-06-05 11:23:44--  https://XXX.YYY.ZZZ/nagiosxi//reports/availability.php?reportperiod=lastmonth&startdate=&enddate=&host=&service=&hostgroup=&servicegroup=myservicegroup&advanced=1&assumeinitialstates=yes&assumestateretention=yes&assumestatesduringdowntime=yes&includesoftstates=no&assumedhoststate=3&assumedservicestate=6&timeperiod=&dont_count_downtime=on&dont_count_unknown=on&username=loic.vaillant&ticket=foobarp&locale=en_US&mode=csv&csvtype=host
Resolving XXX.YYY.ZZZ (XXX.YYY.ZZZ)... a.b.c.d
Connecting to XXX.YYY.ZZZ (XXX.YYY.ZZZ)|a.b.c.d|:443... connected.
WARNING: cannot verify XXX.YYY.ZZZ's certificate, issued by ‘/DC=lan/DC=zodiac/CN=HQ-CA1’:
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 430 [application/octet-stream]
Saving to: ‘/usr/local/nagiosxi/tmp/scheduledreport-loic.vaillant-2prcF0-Host_Availability.csv’

100%[===================================================================================================================================================================================================>] 430         --.-K/s   in 0s

2018-06-05 11:24:26 (32.5 MB/s) - ‘/usr/local/nagiosxi/tmp/scheduledreport-loic.vaillant-2prcF0-Host_Availability.csv’ saved [430/430]
...
read(7, --2018-06-05 11:24:26--  https://XXX.YYY.ZZZ/nagiosxi//reports/availability.php?reportperiod=lastmonth&startdate=&enddate=&host=&service=&hostgroup=&servicegroup=myservicegroup&advanced=1&assumeinitialstates=yes&assumestateretention=yes&assumestatesduringdowntime=yes&includesoftstates=no&assumedhoststate=3&assumedservicestate=6&timeperiod=&dont_count_downtime=on&dont_count_unknown=on&username=loic.vaillant&ticket=foobar&locale=en_US&mode=csv&csvtype=service
Resolving XXX.YYY.ZZZ (XXX.YYY.ZZZ)... a.b.c.d
Connecting to XXX.YYY.ZZZ (XXX.YYY.ZZZ)|a.b.c.d|:443... connected.
WARNING: cannot verify XXX.YYY.ZZZ's certificate, issued by ‘/DC=lan/DC=zodiac/CN=HQ-CA1’:
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘/usr/local/nagiosxi/tmp/scheduledreport-loic.vaillant-2prcF0-Service_Availability.csv’

    [ <=>                                                                                                                                                                                                ] 24,262      --.-K/s   in 0.1s

2018-06-05 11:29:57 (207 KB/s) - ‘/usr/local/nagiosxi/tmp/scheduledreport-loic.vaillant-2prcF0-Service_Availability.csv’ saved [24262]
exit_group(0)                           = ?
+++ exited with 0 +++

real    6m13.198s
user    0m0.250s
sys     0m0.291s

For the second read(/wget), the length is unspecified.
I'm digging further more...
Loïc VAILLANT
lvaillant
Posts: 57
Joined: Mon Jun 06, 2016 2:47 am
Location: Paris, France

Re: Issues on scheduled reports

Post by lvaillant »

I took a look back to the wait_timeout attribute.

Even if it is a GLOBAL VARIABLE, it is possible to change its value without restarting MySQL/MariaDB.

Code: Select all

MariaDB [(none)]> SHOW GLOBAL VARIABLES WHERE Variable_name LIKE '%wait_timeout%';
+---------------------------------------------------+----------+
| Variable_name                                     | Value    |
+---------------------------------------------------+----------+
| wait_timeout                                      | 300      |
+---------------------------------------------------+----------+
1 row in set (0.00 sec)

MariaDB [(none)]> SET GLOBAL wait_timeout = 600;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> SHOW GLOBAL VARIABLES WHERE Variable_name LIKE '%wait_timeout%';
+---------------------------------------------------+----------+
| Variable_name                                     | Value    |
+---------------------------------------------------+----------+
| wait_timeout                                      | 600      |
+---------------------------------------------------+----------+
1 row in set (0.00 sec)
After that, the report that failed is now ok.
I'm now trying bigger reports to test the efficiency of the fix.

You should also modify the /etc/my.cnf file to make this modification durable.
Loïc VAILLANT
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Issues on scheduled reports

Post by scottwilkerson »

lvaillant wrote:I took a look back to the wait_timeout attribute.

Even if it is a GLOBAL VARIABLE, it is possible to change its value without restarting MySQL/MariaDB.

Code: Select all

MariaDB [(none)]> SHOW GLOBAL VARIABLES WHERE Variable_name LIKE '%wait_timeout%';
+---------------------------------------------------+----------+
| Variable_name                                     | Value    |
+---------------------------------------------------+----------+
| wait_timeout                                      | 300      |
+---------------------------------------------------+----------+
1 row in set (0.00 sec)

MariaDB [(none)]> SET GLOBAL wait_timeout = 600;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> SHOW GLOBAL VARIABLES WHERE Variable_name LIKE '%wait_timeout%';
+---------------------------------------------------+----------+
| Variable_name                                     | Value    |
+---------------------------------------------------+----------+
| wait_timeout                                      | 600      |
+---------------------------------------------------+----------+
1 row in set (0.00 sec)
After that, the report that failed is now ok.
I'm now trying bigger reports to test the efficiency of the fix.

You should also modify the /etc/my.cnf file to make this modification durable.
Thanks for reporting back and I'm glad to hear that the failed report is now ok
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
lvaillant
Posts: 57
Joined: Mon Jun 06, 2016 2:47 am
Location: Paris, France

Re: Issues on scheduled reports

Post by lvaillant »

Hello,

You can new close this thread as Solved.
Thank you for your help.
Loïc VAILLANT
Locked