service checks stop working for no apparent reason

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
warapp
Posts: 33
Joined: Thu Jan 25, 2018 3:22 pm

service checks stop working for no apparent reason

Post by warapp »

I am having an issue in which all service checks stop working at the same time and for no apparent reason. When this happens no errors are reported on the xi -> admin status page, no activity appears in /usr/local/nagios/var/nagios.log and values in the Last Check column in xi -> Service Status page are not updated.

When the issue occurs, the only item(s) printed to /usr/local/nagios/var/nagios.log is activity related to external commands ... "[1628629148] SERVICE DOWNTIME ALERT: 00000-0 -- as-tst-001.example.com;sshd daemon;CANCELLED; Scheduled downtime for service has been cancelled."
We use external commands to schedule/unschedule service downtimes.

The condition continues for a random period of time and appears to self recover. When it recovers, the following is printed to the
/usr/local/nagios/var/nagios.log

[1628629199] Warning: A system time change of 2001 seconds (0d 0h 33m 21s forwards in time) has been detected. Compensating...

In addition, the following is printed:

[1628629243] NDO-3: The following query failed while MySQL appears to be connected:
[1628629243] NDO-3: INSERT INTO nagios_downtimehistory (instance_id, downtime_type, object_id, entry_time, author_name, comment_data, internal_downtime_id, triggered_by_id, is_fixed, duration, scheduled_start_time, scheduled_end_t\
ime) VALUES (1,1,28595,FROM_UNIXTIME(1628628063),'joe','00893254',878153,0,1,93600,FROM_UNIXTIME(1628627925),FROM_UNIXTIME(1628721525)) ON DUPLICATE KEY UPDATE instance_id = VALUES(instance_id), downtime_type = VALUES(downtime_type), object_id = VALUES(object_id), entry_time = VALUES(entry_time), author_name = VALUES(author_name), comment_data = VALUES(comment_data), internal_downtime_id = VALUES(internal_downtime_id), triggered_by_id = VALUES(triggered_by_id), is_fixed = VALUES(is_fixed), duration = VALUES(duration), scheduled_start_time = VALUES(scheduled_start_time), scheduled_end_time = VALUES(scheduled_end_time)

This system was built (manual install of nagiosxi 5.7.5, which completed without error) on top of a fresh install of Oracle 8. I then restored a nagiosxi backup taken on a device running CentOS6 and running the same version, nagiosxi version 5.7.5. The restore went without error and I have a "functioning" nagiosxi running on oracle 8.

I suspect the issue is related to the time drift, but, not sure. Could be related to external commands.

I need to understand why the service checks stop working at random times, and correct the issue.

Thanks in advance for your help.

-wr
warapp
Posts: 33
Joined: Thu Jan 25, 2018 3:22 pm

Re: service checks stop working for no apparent reason

Post by warapp »

nagios events processing is again stalled.

This appears to be related to a table lock issue

Here's what I see:
mysql> show full processlist;

| 8 | ndoutils | localhost | nagios | Execute | 39 | updating | UPDATE nagios_commenthistory SET deletion_time = FROM_UNIXTIME(1628686898), deletion_time_usec = 792110 WHERE comment_time = FROM_UNIXTIME(1628684788) AND internal_comment_id = 1211543 |
| 9 | ndoutils | localhost | nagios | Execute | 39 | Waiting for table level lock | INSERT INTO nagios_commenthistory (instance_id, comment_type, entry_type, object_id, comment_time, internal_comment_id, author_name, comment_data, is_persistent, comment_source, expires, expiration_time, entry_time, entry_time_usec) VALUES (1,2,2,38965,FROM_UNIXTIME(1628691410),1211857,'joe','This service has been scheduled for fixed downtime from 08-12-2021 00:00:00 to 08-13-2021 06:00:00. Notifications for the service will not be sent out during that time period.',0,0,0,FROM_UNIXTIME(0),FROM_UNIXTIME(1628691410),66950) ON DUPLICATE KEY UPDATE instance_id = VALUES(instance_id), comment_type = VALUES(comment_type), entry_type = VALUES(entry_type), object_id = VALUES(object_id), comment_time = VALUES(comment_time), internal_comment_id = VALUES(internal_comment_id), author_name = VALUES(author_name), comment_data = VALUES(comment_data), is_persistent = VALUES(is_persistent), comment_source = VALUES(comment_source), expires = VALUES(expires), expiration_time = VALUES(expiration_time), entry_time = VALUES(entry_time), entry_time_usec = VALUES(entry_time_usec) |
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: service checks stop working for no apparent reason

Post by pbroste »

Hello @warapp

Thanks for reaching out and want to find out what is the time difference between the Nagios XI and the device that you are checking? Also, let's get the System Profile so we can see what things look like from that end.

To send us your system profile.
  • Login to the Nagios XI GUI using a web browser.
  • Click the "Admin" > "System Profile" Menu
  • Click the "Download Profile" button
  • Save the profile.zip file and send via Private Message
Thanks,
Perry
warapp
Posts: 33
Joined: Thu Jan 25, 2018 3:22 pm

Re: service checks stop working for no apparent reason

Post by warapp »

I've sent you the profile via pm.

This system is not yet stable.

I now see activity in /usr/local/nagios/var/nagios.log indicating that service checks are running, notifications sent, but, values in the Last Check column in xi -> Service Status page are not correct. They show times ~ 10 hours ago.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: service checks stop working for no apparent reason

Post by pbroste »

Hello @warapp

Thanks for following up, yeah looks like things are a bit wonky since migrating things over to your 8.4.

I want to have you go ahead and run the database repair and then bounce the nagios.service.

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh
Please let us know how the services checks are cruising along and test to see if you are able to add host via API per the other support forum posted as well.

Follow up with the updated System Profile and let us know what works and what does not.

Thanks,
Perry
warapp
Posts: 33
Joined: Thu Jan 25, 2018 3:22 pm

Re: service checks stop working for no apparent reason

Post by warapp »

I've run the repair, restarted the service and sent the latest profile to you via pm.

I'm still seeing activity in /usr/local/nagios/var/nagios.log indicating that service checks are running, notifications sent, but, time values in the Last Check column in xi -> Service Status page are not current; they show times ~ 1 hour ago.
warapp
Posts: 33
Joined: Thu Jan 25, 2018 3:22 pm

Re: service checks stop working for no apparent reason

Post by warapp »

The service checks are running, I see activity in /usr/local/nagios/var/nagios.log, but, nothing gets updated in xi, see attached.

This is a restored system. It is using postgresql and mysql.
You do not have the required permissions to view the files attached to this post.
warapp
Posts: 33
Joined: Thu Jan 25, 2018 3:22 pm

Re: service checks stop working for no apparent reason

Post by warapp »

Reviewing article here, https://support.nagios.com/kb/article/n ... ng-19.html

I see correct "Last Check Time" information in Core for the services. The issue is xi does not show correct times. I've reviewed the article, in particular step 2, and nothing has corrected this issue.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: service checks stop working for no apparent reason

Post by pbroste »

Hello @warapp

To follow up, I spoke to a couple of colleagues on the migration issues you have been having issues with since the move over to 8.4. We would like to go ahead and downgrade NDO3. Please take a full backup or VM snapshot before proceeding.

### STANDARD DOWNGRADE OF NDO3

Code: Select all

systemctl stop nagios
cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.6.14.tar.gz
tar zxf xi-5.6.14.tar.gz
cd /tmp/nagiosxi/subcomponents/ndoutils
./install
systemctl enable ndo2db
Then edit your /usr/local/nagios/etc/nagios.cfg and make sure this line is uncommented:

Code: Select all

broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
Make sure this line is commented:

Code: Select all

#broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg
Then start the nagios service:

Code: Select all

systemctl start nagios
Please follow up with the results,
Perry
warapp
Posts: 33
Joined: Thu Jan 25, 2018 3:22 pm

Re: service checks stop working for no apparent reason

Post by warapp »

Followed your instructions and downgraded NDO. This had no affect on my two issues: api host add not working and xi statuses not updating consistently.

A couple more observations that may help direct us to a cause/fix.

1) Last Check times in XI, when I first logged in today, were reporting correctly. Up-to-date and aligned to what I saw in Core.

2) I applied your NDO changes soon after and since then XI Last Check times have not changed; they show times around when I applied the NDO changes and restarted.
Locked