New Nagios Core server - multiple failures

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Post Reply
NeoMatrixJR
Posts: 17
Joined: Fri Dec 21, 2018 11:18 am

New Nagios Core server - multiple failures

Post by NeoMatrixJR »

I'm trying to stand up a new Nagios core server to replace an exceedingly out of date existing server.
I've managed to complete most of this but somehow I keep hitting hurdle after hurdle. I have a working config and nagios will come up on system boot, but any attempt to restart the service and /usr/local/nagios/var/rw/nagios.cmd doesn't come back on restart, though the service says it's started. Also, while it is running, any existing attempt to run something against nagios.cmd comes back with `The permissions on the external command file and/or directory may be incorrect. Read the FAQs on how to setup proper permissions.`

The file gets created with nagios:nagios ownership, there's nagios:nagios ownership on the rw folder.
apache has been added to the nagios group.

Also, any repeated attempt to re-stop/restart the nagios service or reboot the machine after an initial restart of the service is VERY slow.
User avatar
danderson
Posts: 111
Joined: Wed Aug 09, 2023 10:05 am

Re: New Nagios Core server - multiple failures

Post by danderson »

Thanks for reaching out @NeoMatrixJR,

Can you give me a little more info about your system? Namely, what OS are you installing on, how many checks are you configuring, if you are using one of the RPM linuxes (linuces?), do you have SELinux enabled?
NeoMatrixJR
Posts: 17
Joined: Fri Dec 21, 2018 11:18 am

Re: New Nagios Core server - multiple failures

Post by NeoMatrixJR »

Debian 12
VMWare VM
Installed (latest all) Core from source, Plugins from source + nagiosgraph - all deployed via ansible.
no SELinux, but we're looking to eventually deploy more soon following CIS hardening standards. We'll be looking to the Deb11 standards until 12 is out.
I copied/cleaned the configs from the system this is replacing that was an old ubuntu 16 system running nagios core 4.4.2.
149 hosts
318 services

I fixed the `The permissions on the external command file and/or directory may be incorrect. Read the FAQs on how to setup proper permissions.`
all the guides say I needed to add "apache" to nagcmd group. Found I needed to add www-data.
Most everything seems to be working on reboot, but if I restart the service it doesn't recreate nagios.cmd....just removes the running copy and never recreates it. Also, all calls to stop/restart nagios or reboot the system take a LONG time to complete. (even after I've disabled all host/service checks while working on this issue by removing references to the .cfg files.)
On boot the file is created with nagios:nagcmd ownership.
User avatar
danderson
Posts: 111
Joined: Wed Aug 09, 2023 10:05 am

Re: New Nagios Core server - multiple failures

Post by danderson »

Are you getting anything in the nagios logs?

Are you getting any logs like Error: Could not create external command file?
NeoMatrixJR
Posts: 17
Joined: Fri Dec 21, 2018 11:18 am

Re: New Nagios Core server - multiple failures

Post by NeoMatrixJR »

No errors, but I did see a discrepancy below.
Also, we're a Nagios XI & Fusion customer...is there a better way I can request support/expedite? I'm under a time-crunch to get this completed by end of week.

It seems I wasn't auto-subscribed to updates to this thread either. I should at least have that turned on now.

Nagios Log - System Startup:
[1711398649] Nagios 4.5.1 starting... (PID=784)
[1711398649] Local time is Mon Mar 25 15:30:49 CDT 2024
[1711398649] LOG VERSION: 2.0
[1711398649] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1711398649] qh: core query handler registered
[1711398649] qh: echo service query handler registered
[1711398649] qh: help for the query handler registered
[1711398649] wproc: Successfully registered manager as @wproc with query handler
[1711398649] wproc: Registry request: name=Core Worker 805;pid=805
[1711398649] wproc: Registry request: name=Core Worker 807;pid=807
[1711398649] wproc: Registry request: name=Core Worker 806;pid=806
[1711398649] wproc: Registry request: name=Core Worker 804;pid=804
[1711398649] Successfully launched command file worker with pid 823
[1711398679] Warning: Attempting to execute the command "/usr/local/nagiosgraph/bin/insert.pl" resulted in a return code of 127. Make sure the script or binary you are trying to execute actually exists...
[1711398709] Warning: Attempting to execute the command "/usr/local/nagiosgraph/bin/insert.pl" resulted in a return code of 127. Make sure the script or binary you are trying to execute actually exists...

Nagios Log - `systemctl restart nagios`
[1711398709] Caught SIGTERM, shutting down...
[1711398709] Caught SIGTERM, shutting down...
[1711398709] Caught SIGTERM, shutting down...
[1711398709] Successfully shutdown... (PID=784)
[1711398709] Nagios 4.5.1 starting... (PID=1594)
[1711398709] Local time is Mon Mar 25 15:31:49 CDT 2024
[1711398709] LOG VERSION: 2.0
[1711398709] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1711398709] qh: core query handler registered
[1711398709] qh: echo service query handler registered
[1711398709] qh: help for the query handler registered
[1711398709] wproc: Successfully registered manager as @wproc with query handler
[1711398709] wproc: Registry request: name=Core Worker 1598;pid=1598
[1711398709] wproc: Registry request: name=Core Worker 1599;pid=1599
[1711398709] wproc: Registry request: name=Core Worker 1597;pid=1597
[1711398709] wproc: Registry request: name=Core Worker 1596;pid=1596
No indication of successful launch of command file worker....
gwesterman
Posts: 97
Joined: Wed Aug 23, 2023 11:29 am

Re: New Nagios Core server - multiple failures

Post by gwesterman »

Hi @NeoMatrixJR,

For quick support, you can open a ticket here: https://answerhub.nagios.com/support/s/.
NeoMatrixJR
Posts: 17
Joined: Fri Dec 21, 2018 11:18 am

Re: New Nagios Core server - multiple failures

Post by NeoMatrixJR »

I may not be able to use that :( I wasn't aware there was a separate "Nagios Core Support Plan"...I need to find out if we have that. Guess I need to continue here for now. Sadly it took a day to get login information for the portal.
NeoMatrixJR
Posts: 17
Joined: Fri Dec 21, 2018 11:18 am

Re: New Nagios Core server - multiple failures

Post by NeoMatrixJR »

Found the problem!

You have a bug.

On restart nagios was running 100% of 1 core of CPU.
On a whim I pulled up TCPDump and restarted it. It kept reaching out.
We have outbound firewalls.
I set check_for_updates=0​ instead of 1 by default.
FIXED.
You have an endless loop stuck searching for updates if it can't reach the web.
Post Reply