nagios.service restart breaks checks

ppear · Post by **ppear** » Tue Nov 14, 2023 7:28 am

I'm trying to track down a few issues with a Nagios Core implementation. The first problem, which may actually fix my second is when I make a configuration change, validate using /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg, and restart the nagios.service, it "breaks" the application. A reboot of the entire server fixes these errors and the application conducts checks. Looking for someone to steer me in the right direction. Here is my nagios.log following a nagios.service kick.

root@nagioshost etc]# systemctl restart nagios.service
[root@nagioshost etc]# ls
cgi.cfg cgi.cfg.bak htpasswd.users nagios.cfg objects resource.cfg
[root@nagioshost etc]# cd ..
[root@nagioshost nagios]# tail -f var/nagios.log
[1699945778] Caught SIGTERM, shutting down...
[1699945778] Successfully shutdown... (PID=1620)
[1699945778] Nagios 4.4.13 starting... (PID=4069163)
[1699945778] Local time is Tue Nov 14 08:09:38 CET 2023
[1699945778] LOG VERSION: 2.0
[1699945778] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1699945778] qh: core query handler registered
[1699945778] qh: echo service query handler registered
[1699945778] qh: help for the query handler registered
[1699945778] wproc: Successfully registered manager as @wproc with query handler
[1699945796] Successfully launched command file worker with pid 4069913
[1699945796] Unable to send check for host 'epo website' to worker (ret=-2)
[1699945796] Unable to run check for service 'System Uptime' on host 'exchange mail'
[1699945796] Unable to run check for service 'Process Count' on host 'tenable sc'
[1699945796] Unable to send check for host 'backup 1' to worker (ret=-2)
[1699945806] Unable to send check for host 'backup 2' to worker (ret=-2)
[1699945807] Unable to run check for service 'System Uptime' on host 'sql server'
[1699945807] Unable to run check for service 'Memory Usage' on host 'tenable sc'
[1699945807] Unable to send check for host 'owa website' to worker (ret=-2)
[1699945807] Unable to send check for host 'sql server' to worker (ret=-2)
[1699945808] Unable to run check for service 'Disk Space C:' on host 'chat'
[1699945811] Unable to run check for service 'CPU Usage' on host 'file server'
[1699945812] Unable to run check for service 'Chat Client' on host 'chat'
[1699945816] Unable to run check for service 'System Uptime' on host 'chat'
[1699945817] Unable to run check for service 'Memory Usage' on host 'backup 2'

Post by **danderson** » Tue Nov 14, 2023 4:42 pm

Thanks for reaching out @ppear,

Just to clarify, this issue wasn't resolved by restarting with systemctl but it was fixed by restarting the entire server?

ppear · Post by **ppear** » Wed Nov 15, 2023 2:03 am

That is correct. These results above are after the 'systemctl restart nagios.service' command is executed following validated changes. A full server OS reboot returns the nagios application to normal operation and these "unable to ..." lines in the log go away, checks run as expected.

kg2857 · Post by **kg2857** » Wed Nov 15, 2023 10:45 pm

Try running a check from the shell as defined by the service as the nagios user and read the output.

ppear · Post by **ppear** » Thu Nov 16, 2023 2:17 am

Sure thing. I ran a check_ping from the shell as the nagios user before and after a nagios.service restart. Both manual checks are successful. Could it be something with scheduling that break, perhaps? Here are my results for sanity check.

Checking from shell as -u nagios before nagios.service restart

[nagios@nagioshost nagios]$ sudo /usr/local/nagios/libexec/check_ping -H remotehost -w 50%,100 -c 100%,250 -p 5
PING OK - Packet loss = 0%, RTA = 1.01 ms|rta=1.008000ms;50.000000;100.000000;0.000000 pl=0%;50;100;0
[nagios@nagioshost nagios]$

Checking from shell as -u nagios after nagios.service restart

[nagios@nagioshost nagios]$ sudo systemctl restart nagios.service
[nagios@nagioshost nagios]$ sudo /usr/local/nagios/libexec/check_ping -H remotehost -w 50%,100 -c 100%,250 -p 5
PING OK - Packet loss = 0%, RTA = 0.82 ms|rta=0.818000ms;50.000000;100.000000;0.000000 pl=0%;50;100;0
[nagios@nagioshost nagios]$ tail -f var/nagios.log
[1700118508] qh: help for the query handler registered
[1700118508] wproc: Successfully registered manager as @wproc with query handler
[1700118525] Successfully launched command file worker with pid 32857
[1700118525] Unable to run check for service 'Total Processes' on host 'localhost'
[1700118525] Unable to run check for service 'Process Count' on host 'backup 2'
[1700118526] Unable to run check for service 'http ion' on host 'ion website'
[1700118528] Unable to send check for host 'secondary domain controller' to worker (ret=-2)
[1700118531] Unable to run check for service 'Disk Space C:' on host 'sharepoint'
[1700118532] Unable to run check for service 'SSH' on host 'localhost'
[1700118534] Unable to run check for service 'Disk Space C:' on host 'sql server'
[1700118535] Unable to run check for service 'Veeam Backup VSS Int Service' on host 'backup 1'

Post by **danderson** » Thu Nov 16, 2023 12:04 pm

Could you try modifying the nagios.cfg and change the option "debug_level" to debug_level=16?

Please post the log message after this.

Thanks

ppear · Post by **ppear** » Fri Nov 17, 2023 4:12 am

Of course. Thank you for taking a look. In the nagios.debug log, entries for "Unable to run scheduled service check at this time" start occurring.

# 4096 = Interprocess communication
# 8192 = Scheduling
# 16384 = Workers

debug_level=16

after nagios.service restart

nagios.log
[1700212009] wproc: Successfully registered manager as @wproc with query handler
[1700212011] SERVICE FLAPPING ALERT: localhost;Current Load;STARTED; Service appears to have started flapping (20.9% change >= 20.0% threshold)
[1700212011] SERVICE FLAPPING ALERT: nessus manager;Process Count;STARTED; Service appears to have started flapping (24.0% change >= 20.0% threshold)
[1700212026] Successfully launched command file worker with pid 32016
[1700212026] Unable to run check for service 'Process Count' on host 'backup 1'
[1700212035] Unable to run check for service 'Memory Usage' on host 'secondary domain controller'
[1700212036] Unable to run check for service 'Disk Space C:' on host 'chat'
[1700212044] Unable to run check for service 'Process Count' on host 'file server'
[1700212047] Unable to run check for service 'Root Partition' on host 'localhost'
[1700212051] Unable to run check for service 'Disk Space C:' on host 'primary domain controller'
[1700212055] Unable to run check for service 'Disk Space C:' on host 'sharepoint'

nagios.debug
[1700212044.779976] [016.1] [pid=31335] Unable to run scheduled service check at this time
[1700212044.780045] [016.1] [pid=31335] Rescheduled next service check for Fri Nov 17 10:12:24 2023
[1700212044.780073] [016.0] [pid=31335] Scheduling a non-forced, active check of service 'Process Count' on host 'file server' @ Fri Nov 17 10:12:24 2023
[1700212044.780087] [016.2] [pid=31335] Scheduling new service check event.
[1700212047.780180] [016.0] [pid=31335] Attempting to run scheduled check of service 'Root Partition' on host 'localhost': check options=0, latency=0.000743
[1700212047.780413] [016.0] [pid=31335] Checking service 'Root Partition' on host 'localhost'...
[1700212047.780442] [2320.2] [pid=31335] Raw Command Input: sudo $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
[1700212047.780463] [2320.2] [pid=31335] Expanded Command Output: sudo $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
[1700212047.781488] [016.1] [pid=31335] Unable to run scheduled service check at this time
[1700212047.781637] [016.1] [pid=31335] Rescheduled next service check for Fri Nov 17 10:12:27 2023
[1700212047.781730] [016.0] [pid=31335] Scheduling a non-forced, active check of service 'Root Partition' on host 'localhost' @ Fri Nov 17 10:12:27 2023
[1700212047.781957] [016.2] [pid=31335] Scheduling new service check event.
[1700212051.000359] [016.0] [pid=31335] Starting to read check result queue '/usr/local/nagios/var/spool/checkresults'...
[1700212051.000454] [016.0] [pid=31335] Finished reaping 0 check results
[1700212051.779788] [016.0] [pid=31335] Attempting to run scheduled check of service 'Disk Space C:' on host 'primary domain controller': check options=0, latency=0.000332
[1700212051.779925] [016.0] [pid=31335] Checking service 'Disk Space C:' on host 'primary domain controller'...
[1700212051.779951] [2320.2] [pid=31335] Raw Command Input: sudo $USER1$/check_ncpa.py -H $HOSTADDRESS$ $ARG1$
[1700212051.779970] [2320.2] [pid=31335] Expanded Command Output: sudo $USER1$/check_ncpa.py -H $HOSTADDRESS$ $ARG1$
[1700212051.780150] [016.1] [pid=31335] Unable to run scheduled service check at this time
[1700212051.780227] [016.1] [pid=31335] Rescheduled next service check for Fri Nov 17 10:12:31 2023
[1700212051.780253] [016.0] [pid=31335] Scheduling a non-forced, active check of service 'Disk Space C:' on host 'primary domain controller' @ Fri Nov 17 10:12:31 2023
[1700212051.780267] [016.2] [pid=31335] Scheduling new service check event

Post by **swolf** » Fri Nov 17, 2023 10:32 am

Hi @ppear,

I read through the source code corresponding to these log messages. I do see several "sad paths" that don't have debug logging, but they all appear to be related to either 1) Running out of memory, or 2) failing to find workers to run the checks.

Given that, my next questions to you are:

1) How is your Nagios Core server doing on RAM? Core itself tends not to take much, but are any other important processes running on that server?

2) I'd also like to see output for this command, both after server boot (when everything is running fine) and after nagios is restarted (once you're unable to run checks).

Code: Select all

ps -ef | grep nagios

On a default installation, I would expect to see about 10 processes - 2 for the daemon itself, and 8 for the workers.

As a side note, I think there's cause for us to go and improve the debug logging for Core 4.5.1. It may not be helpful to you and your current timeline but it bugs me that we got a (nearly) silent failure here.

ppear · Post by **ppear** » Tue Nov 21, 2023 2:11 am

I watched my resources over a period and didn't see anything of concern. I have the VM provisioned with 32GB RAM and 16 CPU. I watched RAM hover at 20% utilization with a few brief CPU spikes. I do have Splunk Enterprise running on the same server but the environment being monitored is small from my perspective. I monitored about 24 "hosts" with Nagios and even fewer assets report to Splunk. The server mostly has DISA Redhat 8 STIGs applied, but is running in permissive mode and having fips mode enabled/disabled didn't make a difference.

I think you are on to something, looks like the workers have an issue following the service restart. Here are the results of the requested commands.

ps -ef | grep nagios while running normally

[root@nagioshost ~]# ps -ef | grep nagios
nagios 1704 1 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 1706 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1707 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1708 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1709 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1710 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1711 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1712 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1713 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1714 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1715 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1716 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1717 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1718 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1719 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1720 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1722 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1723 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1724 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1726 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1727 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1728 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1729 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1733 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1734 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 1773 1704 0 08:21 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 10236 10105 0 08:25 pts/1 00:00:00 grep --color=auto nagios

ps -ef | grep nagios after nagios.service restart

[root@nagioshost ~]# ps -ef | grep nagios
nagios 29400 1 0 08:30 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 29401 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29402 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29403 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29404 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29405 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29406 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29407 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29408 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29409 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29410 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29411 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29412 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29413 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29414 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29415 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29416 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29417 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29418 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29419 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29420 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29421 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29422 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29423 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29424 29400 0 08:30 ? 00:00:00 [nagios] <defunct>
nagios 29588 29400 0 08:30 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 32975 10105 0 08:31 pts/1 00:00:00 grep --color=auto nagios

ppear · Post by **ppear** » Wed Nov 29, 2023 10:37 am

This one is still open for me, I haven't been able to locate any info using the latest results posted above.

Nagios Support Forum

nagios.service restart breaks checks

nagios.service restart breaks checks

Re: nagios.service restart breaks checks

Re: nagios.service restart breaks checks

Re: nagios.service restart breaks checks

Re: nagios.service restart breaks checks

Re: nagios.service restart breaks checks

Re: nagios.service restart breaks checks

Re: nagios.service restart breaks checks

Re: nagios.service restart breaks checks

Re: nagios.service restart breaks checks