Page 1 of 1

Problem (Backend: nagios): NDO claims that Nagios did not update for more than 180 seconds

Posted: Fri Dec 05, 2025 1:39 pm
by alexh4e
Hello there,

we are experiencing a random, intermittent issue with the NDO backend in NagVis.

Software versions:

Nagios Core 4.5.3

Nagios XI 2024R1.2.2

NagVis 1.9.40b

All components (Nagios Core, Nagios XI, NagVis) are running in a PCS cluster with Pacemaker, Corosync and DRBD.

At random intervals every few minutes, NagVis reports the following error on multiple objects:

Problem (Backend: nagios): NDO claims that Nagios did not update for more than 180 seconds.

Visual behavior:

The NagVis map shows Summary State = ERROR (All object turn blue)

The summary output reports “Contains ERROR objects”

Many objects simultaneously switch to ERROR

The output column shows the same NDO 180-second timeout message

After a few minutes, the map returns automatically to OK

No Pacemaker failover occurs and all cluster resources remain Started.

We suspect a temporary interruption in the Nagios → NDO → DB data flow.

Any indication on common causes or recommended checks/tuning for this scenario in HA clustered environments would be appreciated.

Thank you.

Re: Problem (Backend: nagios): NDO claims that Nagios did not update for more than 180 seconds

Posted: Fri Dec 05, 2025 3:35 pm
by ekapsner
Hello @alexh4e,

There are a couple of potential causes for this issue. To narrow it down, could you check if there are any NDO related messages in /var/local/nagios/var/nagios.log or any database errors in /var/log/mysql/mysqld.log? We're specifically looking for any messages from about the same time as the NagVis timeouts.

Also, is the database running on the same node as XI and NDO or on a separate server?

It also might be worth checking the monitoring engines status under Admin -> System Information -> Monitoring Engine Status. This can give you some insights into whether or not the issue impacts only NagVis or the rest of XI.

Thanks,
Emmett

Re: Problem (Backend: nagios): NDO claims that Nagios did not update for more than 180 seconds

Posted: Sat Dec 06, 2025 11:18 am
by alexh4e
Nagios DB is under the same server, as i said all Nagios Application are managed by PCS and DRBD. It's all under /drbd fs for HA

here are other details about the config, it's an example from lab but basically the same

[root@nagsrv1 ~]# pcs status
Cluster name: nag_cluster

WARNINGS:
No stonith devices and stonith-enabled is not false

Cluster Summary:
* Stack: corosync (Pacemaker is running)
* Current DC: nagsrv2 (version 2.1.7-5.3.el8_10-0f7f88312) - partition with quorum
* Last updated: Sat Dec 6 17:07:31 2025 on nagsrv1
* Last change: Fri Dec 5 15:26:13 2025 by root via root on nagsrv1
* 2 nodes configured
* 11 resource instances configured

Node List:
* Online: [ nagsrv1 nagsrv2 ]

Full List of Resources:
* Clone Set: ms_drbd_r0 [drbd_r0] (promotable):
* Masters: [ nagsrv1 ]
* Slaves: [ nagsrv2 ]
* Resource Group: g_nagios:
* p_fs_drbd (ocf::heartbeat:Filesystem): Started nagsrv1
* p_vipSPV (ocf::heartbeat:IPaddr2): Started nagsrv1
* p_mysql (ocf::heartbeat:mysql): Started nagsrv1
* p_snmptrapd (systemd:snmptrapd): Started nagsrv1
* p_snmptt (systemd:snmptt): Started nagsrv1
* p_crond (systemd:crond): Started nagsrv1
* p_nagios (systemd:nagios): Started nagsrv1
* p_npcd (systemd:npcd): Started nagsrv1
* p_httpd (systemd:httpd): Started nagsrv1

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled


[root@nagsrv1 ~]# cd /drbd/
[root@nagsrv1 drbd]# ll
total 8
drwxr-xr-x 4 root root 35 5 dic 13.25 backups
drwxrwsr-x. 4 root nagios 4096 5 dic 15.06 mibs
drwxrwxr-x 2 apache nagios 21 6 dic 17.10 mrtg
drwxr-xr-x 10 mysql mysql 4096 5 dic 14.32 mysql
drwxr-xr-x 8 root root 79 5 dic 13.24 nagios
drwxr-xr-x 10 root nagios 102 5 dic 13.25 nagiosxi
drwxrwxr-x 5 apache apache 70 5 dic 13.24 nagvis
drwxr-xr-x 2 root nagios 181 5 dic 15.12 snmp
[root@nagsrv1 drbd]# pwd
/drbd
[root@nagsrv1 drbd]#
[root@nagsrv1 drbd]#
[root@nagsrv1 drbd]# drbdadm status
r0 role:Primary
disk:UpToDate open:yes
nagsrv2 role:Secondary
peer-disk:UpToDate

[root@nagsrv1 drbd]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 40G 0 disk
├─sda1 8:1 0 600M 0 part /boot/efi
├─sda2 8:2 0 1G 0 part /boot
└─sda3 8:3 0 38,4G 0 part
├─rhel-root 253:0 0 34,5G 0 lvm /
└─rhel-swap 253:1 0 4G 0 lvm [SWAP]
sdb 8:16 0 20G 0 disk
└─sdb1 8:17 0 20G 0 part
└─drbd-drbddata 253:2 0 19G 0 lvm
└─drbd0 147:0 0 19G 0 disk /drbd
sr0 11:0 1 13,3G 0 rom

Re: Problem (Backend: nagios): NDO claims that Nagios did not update for more than 180 seconds

Posted: Mon Dec 08, 2025 4:56 pm
by DoubleDoubleA
Hi @alexh4e,

I think you will be best served putting in a support ticket on this one.

In general I would be surprised if there were an interruption of the data flow from Core to mysql. We have found some instances lately where VERY heavy ndo traffic of a very specific type will crash ndo and Core, and we have fixes we hope to release soon on that. But still, that doesn't sound like what you are experiencing, Core is not crashing for you, and somehow things recover on their own.

The Nagios development team answers question on the forum, the support team handles the actual tickets. They are going to have an SLA for you, a private means of sharing your XI profile, and a lot more experience with DRBD.

Aaron