Nagios Support Forum

Posted: **Tue May 05, 2020 5:27 am**

Dears,

During the night of 5th May starting from midnight till around 05:00 am (GMT+2) we observed that ALL graphs stopped working.
Also, we were informed by our NOC, that Nagios somehow stopped working during that time. (We are not 100% sure of this).
Everything seemed to be back to normal at around 05:00 am.

On which logs and other means we can verify what happened?

Rgds,
Matthew

Posted: **Tue May 05, 2020 10:39 am**

The first thing you may want to do is look at the Nagios log to see if that contains any more information on what exactly is failing at that specific time. ( /usr/local/nagios/var/nagios.log ) Please sanitize the logs from that time frame, and post them here so that we can examine them.

We may be able to pinpoint certain problems by looking at your Nagios profile. You can download the profile by navigating to System Profile under System Config on the left pane > and clicking the Download Profile button on that page.

Also, as a reference, here's our guide with detailed information about the diagnostics logs available in Nagios XI

Nagios XI Log File Locations and Descriptions

Posted: **Wed May 06, 2020 3:33 am**

Hi,

Thanks for the explanation. Indeed I had taken a look at such logs but I didn't find anything out of the ordinary.
Attached is the profile and 5th May's nagios.log maybe I missed something.

Moderator's Note: The profile has been shared with the support team but has been removed from the public forum.

As I explained the issue seems to have started from midnight and lasted till 05:00 am. (our time zone is GMT+2)

I hope we can pinpoint this since we were left blind for approx. 4 hours.

Rgds,
Matt

Posted: **Wed May 06, 2020 4:26 pm**

Hi,

I'm seeing this entry in the log you sent over and there was a gap in the data, and the server restarted, do these times reflect what you observed?

Code: Select all

[1588630189] SERVICE ALERT: vip-tis-hlrdra01-p_v-ncc;210_S-ipops-BFX016-DIAMETER-TrafficSummation HLR DRA sums for TIS;WARNING;SOFT;1;Traffic is higher than Threshold (900) - Risk of throttling - In last minute the sum ( TIS-MILAN TIS-ROME ) of messages sent from vip-int-dra01-p : 946, received by vip-int-dra01-p : 945.
[1588648502] Caught SIGTERM, shutting down...
[1588648595] Nagios 4.4.5 starting... (PID=20446)
[1588648595] Local time is Tue May 05 05:16:35 CEST 2020

I also noticed some crashed database tables in the log, I would recommend running the database repair script. Run the following command as root from the terminal.

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh

There is a built-in wizard for monitoring the XI server. Go to Configure > Start Moniotirng Now and search for Nagios XI. I would recommend setting this up, so you receive a notification in the event the Nagios process stops.

Hope that helps. Let me know if you need further assistance.

Posted: **Thu May 07, 2020 3:25 am**

Hi,

That time is actually when the daily backup starts.

Code: Select all

#Nagios XI Backup
15 5 * * * /usr/local/nagiosxi/scripts/backup_xi.sh -n nagiosxi_2_production -d /mnt/Backups/Nagios > /usr/local/nagiosxi/scripts/backup_xi.log

That of course occurs daily (example for today)

Code: Select all

[1588821303] Caught SIGTERM, shutting down...
[1588821304] Successfully shutdown... (PID=10982)

The issue seemed to have started four hours before. I can't get to the reason why this happened. The only thing I observed the next day was that there was a gap in the graphs for four hours and according to our night shift guys, the Nagios was stalled.

The OS was never rebooted according to the logs.

As for the XI monitor I already have jobs and daemons being monitored but there weren't any alerts related around those times.

Yes it is ideal to run a repair DB, no harm. Can I kindly ask from where the crashed tables were noticed?
Can you highlight a way we can check for any count of crashed tables from MySQL in order to monitor it? - I was thinking something in these lines:

SELECT count(*) FROM information_schema.tables where engine not like '%MyISAM%' AND table_schema = 'nagios';

And monitor it from another xI server which we have running in parallel

Also do crashed tables repair by themselves (seems so)?

Rgds,

Posted: **Thu May 07, 2020 3:43 pm**

We found evidence of crashed tables in the database_log.txt file in your profile.

Yes, crashed tables can repair themselves. It is certainly possible that the issue with the databases spread to cause an issue with displaying graphs. Please let us know if you continue to experience graphing issues after running the database repair.

You could use the following linked plugin to monitor the database tables one XI instance from another.
https://exchange.nagios.org/directory/P ... us/details

Posted: **Mon May 11, 2020 1:50 am**

Hi,

is the database_log.txt file located somewhere else apart from the profile?
What concerns me is that I did not see crashed tables messages in the log for that period of time. The crashed tables evidence is 6 days prior to the issue.

Rgds,
Matthew

Posted: **Mon May 11, 2020 10:10 am**

The information from this text file is taken from a couple MySQL logs. The location of these can vary depending on your operating system, but for CentOS 6 they are located at /var/log/mysqld.log and /var/log/mysql.err.

If you are not running CentOS, we may have to look at the script that generates your profile to determine exactly where your instance is pulling the information from.

Code: Select all

/usr/local/nagiosxi/scripts/components/getprofile.sh

Nagios Support Forum

Logs to check for NagiosXI issues

Logs to check for NagiosXI issues

Re: Logs to check for NagiosXI issues

Re: Logs to check for NagiosXI issues

Re: Logs to check for NagiosXI issues

Re: Logs to check for NagiosXI issues

Re: Logs to check for NagiosXI issues

Re: Logs to check for NagiosXI issues

Re: Logs to check for NagiosXI issues