Issues on scheduled reports

Post by **cdienger** » Fri Jun 01, 2018 3:41 pm

Per the document https://dev.mysql.com/doc/refman/8.0/en/gone-away.html the most likely cause would be the timeout being the root of the message. I'm not sure why the difference in where it runs matters quiet yet, but I would try increasing the wait_time on the database. Also, checking the /var/log/mysqld.log may give us some clues.

lvaillant · Post by **lvaillant** » Mon Jun 04, 2018 6:45 am

Hello,

MariaDB variables:

Code: Select all

log_warnings = 1

slow_query_log = ON
slow_query_log_file = /var/log/mariadb/mariadb-slow.log
log_slow_filter = admin,filesort,filesort_on_disk,full_join,full_scan,query_cache,query_cache_miss,tmp_table,tmp_table_on_disk
long_query_time = 10.000000

wait_timeout = 300
interactive_timeout = 28800
connect_timeout = 10

log_warning & slow_query_log are enabled.
If some connection issues or long queries related to reports happenned the event would have been logged.
(But there are some difference between MariaDB & MySQL)

MariaBD main log and slow-queries log files have no messages related to such events.
The main log is clean/empty and the slow-queries contains only few requests.

I scheduled a new report this morning. Same behavior but no log.

Code: Select all

MariaDB [(none)]> SHOW GLOBAL STATUS WHERE Variable_name LIKE '%abort%';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Aborted_clients  | 3162  |
| Aborted_connects | 21331 |
+------------------+-------+
2 rows in set (0.00 sec)

aborted_connects are clients who attempt to connect and fail. Usually this is because of incorrect credentials (wrong password or no matching host for the user).
aborted_clients are the client not closing the connection properly.

I suspect that the report process does not close its connection to DB after "hosts availabilty" query (queries?) is done. It keeps its connection open during it computes the "hosts availabilty" part of the report, then the "services availabilty" part may fail because of timeout.
When a report is built in less than 5min, the "services availabilty" part is filled up correctly.

I may change the wait_timeout variable.
I can't change it at the GLOBAL level without a DB restart.
But as it is a production server, it can't be done easily. It has to be explained, documented, justified and scheduled.

I need certainties before doing that.
Or maybe a small patch in Nagios report tool/script that close DB connections between queries ?

Post by **tgriep** » Mon Jun 04, 2018 4:13 pm

Did you increase the max_input_vars settings in the /etc/php.ini file?

Edit the php.ini file and add the following to it.

Code: Select all

max_input_vars = 10000

Save the change and restart Apache by running

Code: Select all

service httpd restart

I have seen that if the php settings are not large enough, it can cause the MYSQL timeout error so try this change and let us know how it works out.

lvaillant · Post by **lvaillant** » Tue Jun 05, 2018 5:07 am

The current PHP configuration is:

Code: Select all

error_reporting = E_ALL & ~E_DEPRECATED & ~E_STRICT
...
; How many GET/POST/COOKIE input variables may be accepted
; max_input_vars = 1000

max_input_vars = 1000 is the default value (see documentation) since PHP 5.3.9 (my version: php-5.4.16-43.el7_4.1.x86_64)

Moreover per documentation (max-input-vars) if there are more input variables than specified by this directive, an E_WARNING is issued, and further input variables are truncated from the request.

So E_WARNING should be logged... But nothing related in httpd log files.

Simulating and stracing a scheduled report in cmdline, I detected that /usr/local/nagiosxi/var/scheduledreporting.log file did not exist.
Now created, it does not reveal error messages. But I can see the related 2 wget commands.

The strace output is more interesting: it took less than 1min to generate the host_availability part of the report, but +5min for the service_availabilty part.
(The related servicegroup contains 9 hosts & 56 services.)

Why such a difference ?

Code: Select all

time strace /usr/bin/php /usr/local/nagiosxi/html/includes/components/scheduledreporting/sendreport.php --report=2prcF0 --username="loic.vaillant" 2>&1 > /tmp/report.log
...
read(7, --2018-06-05 11:23:44--  https://XXX.YYY.ZZZ/nagiosxi//reports/availability.php?reportperiod=lastmonth&startdate=&enddate=&host=&service=&hostgroup=&servicegroup=myservicegroup&advanced=1&assumeinitialstates=yes&assumestateretention=yes&assumestatesduringdowntime=yes&includesoftstates=no&assumedhoststate=3&assumedservicestate=6&timeperiod=&dont_count_downtime=on&dont_count_unknown=on&username=loic.vaillant&ticket=foobarp&locale=en_US&mode=csv&csvtype=host
Resolving XXX.YYY.ZZZ (XXX.YYY.ZZZ)... a.b.c.d
Connecting to XXX.YYY.ZZZ (XXX.YYY.ZZZ)|a.b.c.d|:443... connected.
WARNING: cannot verify XXX.YYY.ZZZ's certificate, issued by ‘/DC=lan/DC=zodiac/CN=HQ-CA1’:
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 430 [application/octet-stream]
Saving to: ‘/usr/local/nagiosxi/tmp/scheduledreport-loic.vaillant-2prcF0-Host_Availability.csv’

100%[===================================================================================================================================================================================================>] 430         --.-K/s   in 0s

2018-06-05 11:24:26 (32.5 MB/s) - ‘/usr/local/nagiosxi/tmp/scheduledreport-loic.vaillant-2prcF0-Host_Availability.csv’ saved [430/430]
...
read(7, --2018-06-05 11:24:26--  https://XXX.YYY.ZZZ/nagiosxi//reports/availability.php?reportperiod=lastmonth&startdate=&enddate=&host=&service=&hostgroup=&servicegroup=myservicegroup&advanced=1&assumeinitialstates=yes&assumestateretention=yes&assumestatesduringdowntime=yes&includesoftstates=no&assumedhoststate=3&assumedservicestate=6&timeperiod=&dont_count_downtime=on&dont_count_unknown=on&username=loic.vaillant&ticket=foobar&locale=en_US&mode=csv&csvtype=service
Resolving XXX.YYY.ZZZ (XXX.YYY.ZZZ)... a.b.c.d
Connecting to XXX.YYY.ZZZ (XXX.YYY.ZZZ)|a.b.c.d|:443... connected.
WARNING: cannot verify XXX.YYY.ZZZ's certificate, issued by ‘/DC=lan/DC=zodiac/CN=HQ-CA1’:
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘/usr/local/nagiosxi/tmp/scheduledreport-loic.vaillant-2prcF0-Service_Availability.csv’

    [ <=>                                                                                                                                                                                                ] 24,262      --.-K/s   in 0.1s

2018-06-05 11:29:57 (207 KB/s) - ‘/usr/local/nagiosxi/tmp/scheduledreport-loic.vaillant-2prcF0-Service_Availability.csv’ saved [24262]
exit_group(0)                           = ?
+++ exited with 0 +++

real    6m13.198s
user    0m0.250s
sys     0m0.291s

For the second read(/wget), the length is unspecified.
I'm digging further more...

lvaillant · Post by **lvaillant** » Tue Jun 05, 2018 6:52 am

I took a look back to the wait_timeout attribute.

Even if it is a GLOBAL VARIABLE, it is possible to change its value without restarting MySQL/MariaDB.

Code: Select all

MariaDB [(none)]> SHOW GLOBAL VARIABLES WHERE Variable_name LIKE '%wait_timeout%';
+---------------------------------------------------+----------+
| Variable_name                                     | Value    |
+---------------------------------------------------+----------+
| wait_timeout                                      | 300      |
+---------------------------------------------------+----------+
1 row in set (0.00 sec)

MariaDB [(none)]> SET GLOBAL wait_timeout = 600;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> SHOW GLOBAL VARIABLES WHERE Variable_name LIKE '%wait_timeout%';
+---------------------------------------------------+----------+
| Variable_name                                     | Value    |
+---------------------------------------------------+----------+
| wait_timeout                                      | 600      |
+---------------------------------------------------+----------+
1 row in set (0.00 sec)

After that, the report that failed is now ok.
I'm now trying bigger reports to test the efficiency of the fix.

You should also modify the /etc/my.cnf file to make this modification durable.

scottwilkerson · Post by **scottwilkerson** » Tue Jun 05, 2018 9:48 am

lvaillant wrote:I took a look back to the wait_timeout attribute.

Even if it is a GLOBAL VARIABLE, it is possible to change its value without restarting MySQL/MariaDB.

Code: Select all

MariaDB [(none)]> SHOW GLOBAL VARIABLES WHERE Variable_name LIKE '%wait_timeout%';
+---------------------------------------------------+----------+
| Variable_name                                     | Value    |
+---------------------------------------------------+----------+
| wait_timeout                                      | 300      |
+---------------------------------------------------+----------+
1 row in set (0.00 sec)

MariaDB [(none)]> SET GLOBAL wait_timeout = 600;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> SHOW GLOBAL VARIABLES WHERE Variable_name LIKE '%wait_timeout%';
+---------------------------------------------------+----------+
| Variable_name                                     | Value    |
+---------------------------------------------------+----------+
| wait_timeout                                      | 600      |
+---------------------------------------------------+----------+
1 row in set (0.00 sec)

After that, the report that failed is now ok.
I'm now trying bigger reports to test the efficiency of the fix.

You should also modify the /etc/my.cnf file to make this modification durable.

Thanks for reporting back and I'm glad to hear that the failed report is now ok

lvaillant · Post by **lvaillant** » Tue Jun 12, 2018 1:41 am

Hello,

You can new close this thread as Solved.
Thank you for your help.

Nagios Support Forum

Issues on scheduled reports

Re: Issues on scheduled reports

Re: Issues on scheduled reports

Re: Issues on scheduled reports

Re: Issues on scheduled reports

Re: Issues on scheduled reports

Re: Issues on scheduled reports

Re: Issues on scheduled reports