NDO3 problems after mariadb upgrade (5.5.68 - 10.6)

Posted: Fri Nov 11, 2022 7:02 am
by jmsanesteban.sgre


I have a weird situation, probably I made some mistakes and I hope the community will help.

We have a problem with the lock tables during the optimize process, so after some research we've decided to upgrade the MySQL to a version compartible with 5.7, because as far as I know there is a way to avoid lock tables in the optimize process.

I'm doing some test in my INT environment:

Nagios Core 4.4.6
NagiosXI 5.8.7
MariaDB 5.5.68
NDO 3.0.7

I've upgraded the MariaDB component to 10.6.10 and now I'm receiving this messages in nagios.log file:

cat /usr/local/nagios/var/nagios.log | grep NDO-3

[1668164808] NDO-3: Callbacks deregistered
[1668164808] NDO-3: NDO - Shutdown complete
[1668164810] NDO-3: NDO 3.0.7 (c) Copyright 2009-2020 Nagios - Nagios Core Development Team
[1668164810] NDO-3: Unable to connect to mysql. Configuration may be incorrect or database may have temporarily disconnected.
[1668164810] NDO-3: NDO was not able to initialize the database (main context) and will not start.
select * from mysql.global_priv where host = 'localhost' and user = 'ndoutils':

| localhost | ndoutils | {"access":549755813887,"version_id":100610,"plugin":"mysql_native_password","authentication_string":"*244733929909A95DDF1A7F78DD067589B4092EE7","password_last_changed":1667467358}
I've tried removing the plugin also wit hthe same results...

/usr/local/nagios/etc/ndo.cfg with "default" conf:

Default NDO config for Nagios XI

# NDOUtils module
# Commented out by NDO 'make install-broker-line' on Tue Feb  8 12:09:40 CET 2022
#broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
# Added by NDO 'make install-broker-line' on Tue Feb  8 12:09:40 CET 2022
broker_module=/usr/local/nagios/bin/ /usr/local/nagios/etc/ndo.cfg
I can login into the database using the credentials in the ndo.cfg file:

mysql -undoutils -D nagios -p
Enter password:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 827
Server version: 10.6.10-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [nagios]>

Why NDO3 can't login into the database?




Re: NDO3 problems after mariadb upgrade (5.5.68 - 10.6)

Posted: Mon Nov 21, 2022 4:28 am
by jmsanesteban.sgre


There is an small update on this. This problem is not related with the DB upgraded, I'm saying that because the problem happend this weekend on the PROD server with the old DB version, so it seems to be a problem with the NDO3 regardless the DB version. NDO logs flooded the system and the server ran out of space.

Please anyone could help to debug this?

We can't offload the DB or downgrade NDO to 2.x due to poor performance.




Re: NDO3 problems after mariadb upgrade (5.5.68 - 10.6)

Posted: Fri Nov 25, 2022 6:18 am
by jmsanesteban.sgre


I want to elaborate this a bit more, the problem with our PROD server is not about the DB context, it is about the NDO3 trying to insert data in nagios_servicestatus table:

[1669199184] NDO-3: The following query failed while MySQL appears to be connected:
[1669199184] NDO-3: INSERT INTO nagios_servicestatus (instance_id, service_object_id, status_update_time, output, long_output, perfdata, current_state, has_been_checked, should_be_scheduled, current_check_attempt, max_check_attempts, last_check, next_check, check_type, check_options, last_state_change, last_hard_state_change, last_hard_state, last_time_ok, last_time_warning, last_time_unknown, last_time_critical, state_type, last_notification, next_notification, no_more_notifications, notifications_enabled, problem_has_been_acknowledged, acknowledgement_type, current_notification_number, passive_checks_enabled, active_checks_enabled, event_handler_enabled, flap_detection_enabled, is_flapping, percent_state_change, latency, execution_time, scheduled_downtime_depth, failure_prediction_enabled, process_performance_data, obsess_over_service, modified_service_attributes, event_handler, check_command, normal_check_interval, retry_check_interval, check_timeperiod_object_id) VALUES (1,24302,FROM_UNIXTIME(1669199184),'CHECK_NRPE: Receive header underflow - only 0 bytes received (4 expected).','','',3,1,1,5,5,FROM_UNIXTIME(1669198922),FROM_UNIXTIME(1669199221),0,0,FROM_UNIXTIME(1663651268),FROM_UNIXTIME(1663651268),3,FROM_UNIXTIME(1661943887),FROM_UNIXTIME(0),FROM_UNIXTIME(1669198922),FROM_UNIXTIME(1662738723),1,FROM_UNIXTIME(0),FROM_UNIXTIME(3600),0,1,0,0,0,1,1,1,1,0,0.000000,0.000000,0.585218,0,0,1,1,0,'','sgre_plt_disk_usage_nrpe!30!75!85!!!!!',5.000000,1.000000,157) ON DUPLICATE KEY UPDATE instance_id = VALUES(instance_id), service_object_id = VALUES(service_object_id), status_update_time = VALUES(status_update_time), output = VALUES(output), long_output = VALUES(long_output), perfdata = VALUES(perfdata), current_state = VALUES(current_state), has_been_checked = VALUES(has_been_checked), should_be_scheduled = VALUES(should_be_scheduled), current_check_attempt = VALUES(current_check_attempt), max_check_attempts = VALUES(max_check_attempts), last_check = VALUES(last_check), next_check = VALUES(next_check), check_type = VALUES(check_type), check_options = VALUES(check_options), last_state_change = VALUES(last_state_change), last_hard_state_change = VALUES(last_hard_state_change), last_hard_state = VALUES(last_hard_state), last_time_ok = VALUES(last_time_ok), last_time_warning = VALUES(last_time_warning), last_time_unknown = VALUES(last_time_unknown), last_time_critical = VALUES(last_time_critical), state_type = VALUES(state_type), last_notification = VALUES(last_notification), next_notification = VALUES(next_notification), no_more_notifications = VALUES(no_more_notifications), notifications_enabled = VALUES(notifications_enabled), problem_has_been_acknowledged = VALUES(problem_has_been_acknowledged), acknowledgement_type = VALUES(acknowledgement_type), current_notification_number = VALUES(current_notification_number), passive_checks_enabled = VALUES(passive_checks_enabled), active_checks_enabled = VALUES(active_checks_enabled), event_handler_enabled = VALUES(event_handler_enabled), flap_detection_enabled = VALUES(flap_detection_enabled), is_flapping = VALUES(is_flapping), percent_state_change = VALUES(percent_state_change), latency = VALUES(latency), execution_time = VALUES(execution_time), scheduled_downtime_depth = VALUES(scheduled_downtime_depth), failure_prediction_enabled = VALUES(failure_prediction_enabled), process_performance_data = VALUES(process_performance_data), obsess_over_service = VALUES(obsess_over_service), modified_service_attributes = VALUES(modified_service_attributes), event_handler = VALUES(event_handler), check_command = VALUES(check_command), normal_check_interval = VALUES(normal_check_interval), retry_check_interval = VALUES(retry_check_interval), check_timeperiod_object_id = VALUES(check_timeperiod_object_id)
So I'm trying to set a non-debug mode for that kind of errors or at least trying to reduce the amount of these entries in log because only today in the log I have 5804826 inserts failed, so the log raised to 20GB. We have been sufering that problem for some days, the biggest log was for about 54GB so we ran out of space and the app collapsed.

The problem is that as far as I know, we don't have access to NDO3 code or debug options so from customer side we can't do anything. Only downgrade to NDO2, and in our case is not possible, because we had to upgrade to NDO3 due to performance problems.
