Fusion 4.1.6

martinnick · Post by **martinnick** » Fri Jan 11, 2019 8:57 am

I've searched those logs high and low, I could only find more of the same. We don't seem to have any logs that go back to when we first started experiencing the issues. I've included more errors below directly from Fusion. The Database query didn't produce much, as you expected and I'm unsure as to what type of errors I would be looking that what cause these issues.

2019-01-10 19:48:10 Error System 0 0 insert_polled_data() caught exception: Unable to UPDATE polled_data ( [:polled_data_id] => 26299686, [:hosts_pending] => 0, [:hosts_up] => 86, [:hosts_down] => 0, [:hosts_unreachable] => 0, [:hosts_count] => 86, [:hosts_problems] => 0, [:hosts_problems_unhandled] => 0, [:hosts_disabled] => 0, [:hosts_acknowledged] => 0, [:hosts_flapping] => 0, [:hosts_downtime] => 0, [:hosts_pending_disabled] => 0, [:hosts_up_disabled] => 0, [:hosts_up_downtime] => 0, [:hosts_down_disabled] => 0, [:hosts_down_acknowledged] => 0, [:hosts_down_downtime] => 0, [:hosts_unreachable_disabled] => 0, [:hosts_unreachable_acknowledged] => 0, [:hosts_unreachable_downtime] => 0, [:services_pending] => 0, [:services_ok] => 1466, [:services_warning] => 1, [:services_critical] => 5, [:services_unknown] => 0, [:services_count] => 1472, [:services_problems] => 6, [:services_problems_unhandled] => 6, [:services_disabled] => 0, [:services_acknowledged] => 0, [:services_flapping] => 1, [:services_downtime] => 0, [:services_pending_disabled] => 0, [:services_ok_disabled] => 0, [:services_ok_downtime] => 0, [:services_warning_disabled] => 0, [:services_warning_acknowledged] => 0, [:services_warning_downtime] => 0, [:services_critical_disabled] => 0, [:services_critical_acknowledged] => 0, [:services_critical_downtime] => 0, [:services_unknown_disabled] => 0, [:services_unknown_acknowledged] => 0, [:services_unknown_downtime] => 0 )

2019-01-10 15:42:33 Error System 0 0 insert_polled_data() caught exception: Unable to UPDATE polled_data ( [:polled_data_id] => 26272294, [:services_pending] => 0, [:services_ok] => 41, [:services_warning] => 0, [:services_critical] => 0, [:services_unknown] => 0, [:services_count] => 41, [:services_problems] => 0, [:services_problems_unhandled] => 0, [:services_disabled] => 0, [:services_acknowledged] => 0, [:services_flapping] => 0, [:services_downtime] => 0, [:services_pending_disabled] => 0, [:services_ok_disabled] => 0, [:services_ok_downtime] => 0, [:services_warning_disabled] => 0, [:services_warning_acknowledged] => 0, [:services_warning_downtime] => 0, [:services_critical_disabled] => 0, [:services_critical_acknowledged] => 0, [:services_critical_downtime] => 0, [:services_unknown_disabled] => 0, [:services_unknown_acknowledged] => 0, [:services_unknown_downtime] => 0 )
2019-01-10 15:42:33 Error System 0 0 insert_polled_data() db error: ( [0] => ( [7] => exec(): SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction ), [1] => ( [8] => exec_query(DELETE FROM polled_extras WHERE polled_data_id NOT IN (SELECT polled_data_id FROM polled_data), ''): Failed to execute 'DELETE FROM polled_extras WHERE polled_data_id NOT IN (SELECT polled_data_id FROM polled_data)' ) )

Post by **tgriep** » Fri Jan 11, 2019 2:53 pm

Those errors don't help much in trouble shooting the issue.

What you could try is to truncate all of the polled data from the table so the server will start over with fresh data.
Truncating polled tables in MYSQL, run this as root.

Code: Select all

cd /usr/local/nagiosfusion/scripts
./truncate_polled.php

This will remove all of the temporary data that is used in the Fusion interface so you will have to wait until the next time a poll happens for a server to get updated data.

hbouma · Post by **hbouma** » Fri Jan 11, 2019 3:09 pm

This has been run. We will report if we continue to have issues.

scottwilkerson · Post by **scottwilkerson** » Fri Jan 11, 2019 3:29 pm

hbouma wrote:This has been run. We will report if we continue to have issues.

sounds good

hbouma · Post by **hbouma** » Mon Jan 14, 2019 7:57 am

Unfortunately, the issue still is occuring.

Since the last time we stopped all Nagios processes and manually ran the dbmaint_subsys.php script an hour and a half ago, we have 39 scripts attempting to run.

scottwilkerson · Post by **scottwilkerson** » Mon Jan 14, 2019 4:35 pm

I believe you maybe hitting a limit where your database on your Fusion server cannot keep up with the amount of data you are throwing at it (and subsequently removing with the dbmaint_subsys.php script)

First lets slow down the frequency we are running the dbmaint cron

edit /etc/cron.d/nagiosfusion and change this line

Code: Select all

*/5 * * * * nagios /usr/bin/php -q /usr/local/nagiosfusion/cron/dbmaint_subsys.php >>/usr/local/nagiosfusion/var/log/dbmaint_subsys.log 2>&1

to this

Code: Select all

*/30 * * * * nagios /usr/bin/php -q /usr/local/nagiosfusion/cron/dbmaint_subsys.php >>/usr/local/nagiosfusion/var/log/dbmaint_subsys.log 2>&1

then lets stop the current dbmaint scripts from running.

Code: Select all

killall -9 php
systemctl restart mariadb

Now I'd like to see some logs from the database

Code: Select all

tail -1000 /var/log/mariadb/mariadb.log

hbouma · Post by **hbouma** » Tue Jan 15, 2019 7:32 am

All steps have been completed.

The issue occurred this morning at 1:50AM EST, but the mariadb logs don't show anything at that time.

Here is the output from the mariadb.log file:

Code: Select all

190111 15:07:55 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.60-MariaDB
key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=538
max_threads=820
thread_count=536
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1831734 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x55e3f25680a0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f15c9642d80 thread_stack 0x48000
/usr/libexec/mysqld(my_print_stacktrace+0x3d)[0x55e3e4a14cbd]
/usr/libexec/mysqld(handle_fatal_signal+0x515)[0x55e3e46294a5]
/lib64/libpthread.so.0(+0xf5d0)[0x7f17753795d0]
/lib64/libc.so.6(gsignal+0x37)[0x7f1773aa4207]
/lib64/libc.so.6(abort+0x148)[0x7f1773aa58f8]
/lib64/libc.so.6(+0x78d27)[0x7f1773ae6d27]
/lib64/libc.so.6(+0x81489)[0x7f1773aef489]
/usr/libexec/mysqld(+0x7adb48)[0x55e3e4932b48]
/usr/libexec/mysqld(+0x77d5fd)[0x55e3e49025fd]
/usr/libexec/mysqld(_ZN7handler7ha_openEP5TABLEPKcij+0x33)[0x55e3e462d3e3]
/usr/libexec/mysqld(_Z14open_tmp_tableP5TABLE+0x2d)[0x55e3e452d7ed]
/usr/libexec/mysqld(_Z16create_tmp_tableP3THDP15TMP_TABLE_PARAMR4ListI4ItemEP8st_orderbbyyPKcbb+0x1e15)[0x55e3e45318f5]
/usr/libexec/mysqld(_Z19create_schema_tableP3THDP10TABLE_LIST+0x6b7)[0x55e3e454c937]
/usr/libexec/mysqld(_Z18mysql_schema_tableP3THDP3LEXP10TABLE_LIST+0x27)[0x55e3e45607d7]
/usr/libexec/mysqld(_Z11open_tablesP3THDPP10TABLE_LISTPjjP19Prelocking_strategy+0x3ce)[0x55e3e44bc38e]
/usr/libexec/mysqld(_Z20open_and_lock_tablesP3THDP10TABLE_LISTbjP19Prelocking_strategy+0x43)[0x55e3e44bceb3]
/usr/libexec/mysqld(+0x36c478)[0x55e3e44f1478]
/usr/libexec/mysqld(_Z21mysql_execute_commandP3THD+0x2e8e)[0x55e3e44fa49e]
/usr/libexec/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x125)[0x55e3e44fe445]
/usr/libexec/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1750)[0x55e3e45004a0]
/usr/libexec/mysqld(_Z24do_handle_one_connectionP3THD+0x1c2)[0x55e3e45b2a22]
/usr/libexec/mysqld(handle_one_connection+0x4a)[0x55e3e45b2aca]
/lib64/libpthread.so.0(+0x7dd5)[0x7f1775371dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f1773b6bead]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f14d400d7f8): DESCRIBE servers
Connection ID (thread ID): 190
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=off

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

190111 15:07:55 mysqld_safe Number of processes running now: 0
190111 15:07:55 mysqld_safe mysqld restarted
190111 15:07:55 [Note] /usr/libexec/mysqld (mysqld 5.5.60-MariaDB) starting as process 9008 ...
190111 15:07:55 InnoDB: The InnoDB memory heap is disabled
190111 15:07:55 InnoDB: Mutexes and rw_locks use GCC atomic builtins
190111 15:07:55 InnoDB: Compressed tables use zlib 1.2.7
190111 15:07:55 InnoDB: Using Linux native AIO
190111 15:07:55 InnoDB: Initializing buffer pool, size = 6.0G
190111 15:07:56 InnoDB: Completed initialization of buffer pool
190111 15:07:56 InnoDB: highest supported file format is Barracuda.
190111 15:07:56  InnoDB: Starting crash recovery from checkpoint LSN=4832196353319
InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
190111 15:07:56  InnoDB: Starting final batch to recover 4136 pages from redo log
190111 15:07:57  InnoDB: Waiting for the background threads to start
190111 15:07:58 Percona XtraDB (http://www.percona.com) 5.5.59-MariaDB-38.11 started; log sequence number 4832263564081
190111 15:07:58 [Note] Plugin 'FEEDBACK' is disabled.
190111 15:07:58 [Note] Server socket created on IP: '0.0.0.0'.
190111 15:07:58 [Note] Event Scheduler: Loaded 0 events
190111 15:07:58 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.60-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server
190115  7:28:19 [Note] /usr/libexec/mysqld: Normal shutdown
190115  7:28:19 [Note] Event Scheduler: Purging the queue. 0 events
190115  7:28:19  InnoDB: Starting shutdown...
190115  7:28:20  InnoDB: Waiting for 200 pages to be flushed
190115  7:28:31  InnoDB: Shutdown completed; log sequence number 4992770068032
190115  7:28:31 [Note] /usr/libexec/mysqld: Shutdown complete

190115 07:28:31 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
190115 07:28:32 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
190115  7:28:32 [Note] /usr/libexec/mysqld (mysqld 5.5.60-MariaDB) starting as process 9656 ...
190115  7:28:32 InnoDB: The InnoDB memory heap is disabled
190115  7:28:32 InnoDB: Mutexes and rw_locks use GCC atomic builtins
190115  7:28:32 InnoDB: Compressed tables use zlib 1.2.7
190115  7:28:32 InnoDB: Using Linux native AIO
190115  7:28:32 InnoDB: Initializing buffer pool, size = 6.0G
190115  7:28:32 InnoDB: Completed initialization of buffer pool
190115  7:28:32 InnoDB: highest supported file format is Barracuda.
190115  7:28:32  InnoDB: Waiting for the background threads to start
190115  7:28:33 Percona XtraDB (http://www.percona.com) 5.5.59-MariaDB-38.11 started; log sequence number 4992770068032
190115  7:28:33 [Note] Plugin 'FEEDBACK' is disabled.
190115  7:28:33 [Note] Server socket created on IP: '0.0.0.0'.
190115  7:28:33 [Note] Event Scheduler: Loaded 0 events
190115  7:28:33 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.60-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server

hbouma · Post by **hbouma** » Tue Jan 15, 2019 11:57 am

This seems to be helping. The dbmaint_subsys.php scripts are completing in about 20 to 25 minutes time.

scottwilkerson · Post by **scottwilkerson** » Tue Jan 15, 2019 4:20 pm

hbouma wrote:This seems to be helping. The dbmaint_subsys.php scripts are completing in about 20 to 25 minutes time.

Good to hear, it's possible on a large system the stacking of the removal of items just wasn't happening fast enough.

If we have to even changing the cron to run every 60 minutes instead of 30 would still be a good option.

Nagios Support Forum

Fusion 4.1.6

Re: Fusion 4.1.6

Re: Fusion 4.1.6

Re: Fusion 4.1.6

Re: Fusion 4.1.6

Re: Fusion 4.1.6

Re: Fusion 4.1.6

Re: Fusion 4.1.6

Re: Fusion 4.1.6

Re: Fusion 4.1.6