nagios service stopped frequently

mejokj · Post by **mejokj** » Tue Sep 08, 2020 9:16 am

Hello,

The nagios service is stopped unexpectedly and below is the error when checking the log file. This server is nagios HA cluster.

nagiosxi version : 5.6.6
Due to this issue, we have changed the service from init.d to sysytemd. But the nagios service is stopping again.

+++++++++++++++++++++++++++++++++++++++++++
Sep 08 02:00:03 in-nagios-a.informatica.com systemd[1]: Stopping Cluster Controlled nagios...
Sep 08 02:00:03 in-nagios-a.informatica.com nagios[9633]: Caught SIGTERM, shutting down...
Sep 08 02:00:03 in-nagios-a.informatica.com nagios[9633]: Caught SIGTERM, shutting down...
Sep 08 02:00:03 in-nagios-a.informatica.com nagios[9813]: Caught SIGTERM, shutting down...
Sep 08 02:00:03 in-nagios-a.informatica.com nagios[9633]: Successfully shutdown... (PID=9633)
Sep 08 02:00:04 in-nagios-a.informatica.com nagios[9633]: livestatus: Socket thread has terminated
Sep 08 02:01:57 in-nagios-a.informatica.com systemd[1]: nagios.service stop-sigterm timed out. Killing.
Sep 08 02:01:57 in-nagios-a.informatica.com systemd[1]: nagios.service: main process exited, code=killed, status=9/KILL
Sep 08 02:01:57 in-nagios-a.informatica.com systemd[1]: Stopped Nagios Core 4.4.3.
Sep 08 02:01:57 in-nagios-a.informatica.com systemd[1]: Unit nagios.service entered failed state.
Sep 08 02:01:57 in-nagios-a.informatica.com systemd[1]: nagios.service failed.

+++++++++++++++++++++++++++++++++++++++++++

Below is the systemd file for nagios service.
+++++++++++++++++++++++++++
[root@in-nagios-a ~]# cat /usr/lib/systemd/system/nagios.service
[Unit]
Description=Nagios Core 4.4.3
Documentation=https://www.nagios.org/documentation
After=network.target local-fs.target

[Service]
Type=forking
ExecStartPre=/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
ExecStart=/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
ExecStop=/usr/bin/kill -s TERM ${MAINPID}
ExecStopPost=/usr/bin/rm -f /usr/local/nagios/var/rw/nagios.cmd
ExecReload=/usr/bin/kill -s HUP ${MAINPID}

[Install]
WantedBy=multi-user.target
+++++++++++++++++++++++++++

Post by **cdienger** » Wed Sep 09, 2020 12:05 pm

This appears to be a shutdown initiated by the HA setup. Was the other XI instance started when this instance was shut down? Tell us a bit more about the HA setup. What criteria must be met for the HA setup to switch the roles of the XI instances?

mejokj · Post by **mejokj** » Thu Sep 10, 2020 1:33 am

Hello,

After the nagios server shutdown, it was not started on another instance. seems because of the killing timeout as I previous mention in the logs.
++++++++++++++++++++++++++
Sep 08 02:01:57 in-nagios-a.informatica.com systemd[1]: nagios.service stop-sigterm timed out. Killing.
++++++++++++++++++++++++++

The Cluster we have is two node CRBD cluster CRM.

The crm status shows fine when the nagios service shutdown. Below is the status of the crm, but nagios service is stopped state.

+++++++++++++++++++++++++++++++++++++++++++++
[root@qy-nagios-a ~]# crm status
Stack: corosync
Current DC: qy-nagios-b.informatica.com (version 1.1.15.linbit-1.0+20160622+e174ec8.el7-e174ec8) - partition with quorum
Last updated: Wed Sep 9 23:22:03 2020 Last change: Wed Sep 9 09:40:16 2020 by root via crm_attribute on qy-nagios-a.informatica.com

2 nodes and 7 resources configured

Online: [ qy-nagios-a.informatica.com qy-nagios-b.informatica.com ]

Full list of resources:

Resource Group: g_nagios
p_fs_drbd (ocf:

Filesystem): Started qy-nagios-a.informatica.com
p_crond (systemd:crond): Started qy-nagios-a.informatica.com
p_nagios (systemd:nagios): Started qy-nagios-a.informatica.com
p_npcd (systemd:npcd): Started qy-nagios-a.informatica.com
p_virtip (ocf:

IPaddr2): Started qy-nagios-a.informatica.com
Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
Masters: [ qy-nagios-a.informatica.com ]
Slaves: [ qy-nagios-b.informatica.com ]

+++++++++++++++++++++++++++++++++++++++++++++

Regards,
Dhinil KV

ssax · Post by **ssax** » Thu Sep 10, 2020 5:12 pm

Please attach your nagios.service unit file for systemd.

What is the full output of this command?

Code: Select all

crm configure show

mejokj · Post by **mejokj** » Sat Sep 12, 2020 12:41 am

Hello,

I have attached the nagios.service.

Below is the crm config show.

+++++++++++++++++++++++++++
node 1: in-nagios-a.informatica.com \
attributes standby=off maintenance=off
node 2: in-nagios-b.informatica.com \
attributes standby=off maintenance=off
primitive p_crond systemd:crond \
op start interval=0 timeout=20s \
op stop interval=0 timeout=20s \
op monitor interval=20s timeout=20s \
meta is-managed=true target-role=Started
primitive p_drbd_r0 ocf:linbit:drbd \
params drbd_resource=r0 \
op monitor interval=29 role=Master \
op monitor interval=30 role=Slave \
op start interval=0 timeout=240s \
op stop interval=0 timeout=100s
primitive p_fs_drbd Filesystem \
params device="/dev/drbd0" directory="/drbd" fstype=xfs options=noatime \
op start interval=0 timeout=60s \
op stop interval=0 timeout=100s \
op monitor interval=10s timeout=40s \
meta maintenance=false is-managed=true target-role=Started
primitive p_mysql mysql \
params binary="/usr/bin/mysqld_safe" client_binary="/usr/bin/mysql" datadir="/drbd/mysql" \
op monitor interval=30s timeout=30s \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s \
meta is-managed=true target-role=Started
primitive p_nagios systemd:nagios \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s \
op monitor interval=20s timeout=30s \
meta is-managed=true target-role=Started
primitive p_ndo2db systemd:ndo2db \
op start interval=0 timeout=30s \
op stop interval=0 timeout=30s \
op monitor interval=20s timeout=30s \
meta is-managed=true target-role=Started
primitive p_npcd systemd:npcd \
op start interval=0 timeout=100s \
op stop interval=0 timeout=100s \
op monitor interval=20s timeout=100s
primitive p_virtip IPaddr2 \
params ip=10.65.32.89 cidr_netmask=24 \
op start interval=0 timeout=20s \
op stop interval=0 timeout=20s \
op monitor interval=10s timeout=20s \
meta is-managed=true target-role=Started
group g_nagios p_fs_drbd p_mysql p_crond p_ndo2db p_nagios p_npcd p_virtip \
meta target-role=Started
ms ms_drbd_r0 p_drbd_r0 \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true maintenance=false is-managed=true target-role=Started
colocation c_nagios_with_drbd inf: g_nagios ms_drbd_r0:Master
location cli-prefer-g_nagios g_nagios role=Started inf: in-nagios-a.informatica.com
order o_drbd_before_nagios inf: ms_drbd_r0:promote g_nagios:start
property cib-bootstrap-options: \
have-watchdog=false \
dc-version="1.1.15.linbit-2.0+20160622+e174ec8.el7-e174ec8" \
cluster-infrastructure=corosync \
cluster-name=nagiosxi-cluster \
stonith-enabled=false \
no-quorum-policy=ignore \
maintenance-mode=false \
last-lrm-refresh=1597808668
rsc_defaults rsc-options: \
failure-timeout=2m \
migration-threshold=3 \
resource-stickiness=200

+++++++++++++++++++++++++++

ssax · Post by **ssax** » Mon Sep 14, 2020 4:56 pm

What did systemctl status nagios show when nagios was down like that?

What does your /usr/local/nagios/var/nagios.log show right before you see the nagios service stopped?

When you apply configuration, the nagios service restarts. I'm wondering if the apply config service restarts are impacting your cluster setup/monitoring in some fashion. How does CRM determine that the service has failed? (that's why I'm requested the systemd status output when this issue has occurred).

Why CRM/Pacemaker wouldn't show it as stopped when it was stopped would be more something you'd need to talk with the CRM/Pacemaker admin/support on. We do not use or admin DRBD/CRM/Pacemaker here.

mejokj · Post by **mejokj** » Wed Sep 16, 2020 2:34 am

systemctl status nagios shows the nagios service is stopped state failed.

below is the log out put at that time.

+++++++++++++++++++++++++++++++++++++++++++
Sep 08 02:00:03 systemd[1]: Stopping Cluster Controlled nagios...
Sep 08 02:00:03 mnagios[9633]: Caught , shutting down...
Sep 08 02:00:03 nagios[9633]: Caught SIGTERM, shutting down...
Sep 08 02:00:03 nagios[9813]: Caught SIGTERM, shutting down...
Sep 08 02:00:03 nagios[9633]: Successfully shutdown... (PID=9633)
Sep 08 02:00:04 nagios[9633]: livestatus: Socket thread has terminated
Sep 08 02:01:57 systemd[1]: nagios.service stop-sigterm timed out. Killing.
Sep 08 02:01:57 systemd[1]: nagios.service: main process exited, code=killed, status=9/KILL
Sep 08 02:01:57 systemd[1]: Stopped Nagios Core 4.4.3.
Sep 08 02:01:57 systemd[1]: Unit nagios.service entered failed state.
Sep 08 02:01:57 systemd[1]: nagios.service failed.

+++++++++++++++++++++++++++++++++++++++++++

ssax · Post by **ssax** » Thu Sep 17, 2020 4:22 pm

This looks like it's the cluster that's shutting it down:

Sep 08 02:00:03 systemd[1]: Stopping Cluster Controlled nagios...

You will need to investigate your cluster to see why it's stopping it.

Nagios Support Forum

nagios service stopped frequently

nagios service stopped frequently

Re: nagios service stopped frequently

Re: nagios service stopped frequently

Re: nagios service stopped frequently

Re: nagios service stopped frequently

Re: nagios service stopped frequently

Re: nagios service stopped frequently

Re: nagios service stopped frequently