AIX monitoring issue for just some servers

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Post Reply
kbauma01
Posts: 54
Joined: Wed May 25, 2022 6:39 am

AIX monitoring issue for just some servers

Post by kbauma01 »

I have several AIX servers being monitored successfully but there are about 18 that are having issues. Everything is exactly the same - OS version, RPMs, python versions, etc. I've reinstalled the agent and even reinstalled python.

Using the command line, I get this error:

/usr/local/nagios/libexec/check_ncpa.py -H hostname -t mytoken -P 5693 --list
UNKNOWN: An error occurred connecting to API. (HTTP error: '500 INTERNAL SERVER ERROR')

When I go the GUI on port 5693, I don't see any checks, any live data, and the API just spins and throws an error into the listener log.

Here is what I am seeing in the /usr/local/ncpa/var/log/ncpa_listener.log

2024-01-04 09:29:06,082 8323480 INFO ::ffff:10.15.14.86 - - [2024-01-04 09:29:06] "GET /api/services/?token=mytoken&check=1&service=sshd&status=running HTTP/1.1" 500 2362 0.004727
2024-01-04 09:29:15,245 8323480 ERROR Exception on /api/services/ [GET]
Traceback (most recent call last):
File "/opt/freeware/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
File "/opt/freeware/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
File "/opt/freeware/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
File "/opt/freeware/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
File "/opt/freeware/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
File "/tmp/test/ncpa/agent/listener/server.py", line 185, in token_auth_decoration
File "/tmp/test/ncpa/agent/listener/server.py", line 931, in api
File "/tmp/test/ncpa/agent/listener/psapi.py", line 279, in getter
File "/tmp/test/ncpa/agent/listener/psapi.py", line 260, in refresh
File "/tmp/test/ncpa/agent/listener/psapi.py", line 230, in get_root_node
File "/tmp/test/ncpa/agent/listener/psapi.py", line 185, in get_disk_node
File "/opt/freeware/lib/python2.7/site-packages/psutil/__init__.py", line 2133, in disk_partitions
File "/opt/freeware/lib/python2.7/site-packages/psutil/_psaix.py", line 186, in disk_partitions
OSError: [Errno 13] Permission denied

I cannot figure out why most AIX servers work but some do not. Any help would be very much appreciated!
bbahn
Posts: 386
Joined: Thu Jan 12, 2023 5:42 pm

Re: AIX monitoring issue for just some servers

Post by bbahn »

Hello kbauma01,

It seems you have a permissions issue that is making the psutil library unable to check your disks. Can you check your permissions for your disks and compare the working servers against the ones that aren't?
Actively advancing awesome answers with ardent alliteration, aptly addressing all ambiguities. Amplify your acumen and avail our amicable assistance. Eagerly awaiting your astute assessments of our advice.
kbauma01
Posts: 54
Joined: Wed May 25, 2022 6:39 am

Re: AIX monitoring issue for just some servers

Post by kbauma01 »

They look the same to me.

Permissions on a not working server (aka "bad")

# ls -la /dev/hd*
brw-rw---- 1 root system 10, 9 May 12 2023 hd10opt
brw-rw---- 1 root system 10, 10 May 12 2023 hd11admin
brw-rw---- 1 root system 10, 6 May 12 2023 hd2
brw-rw---- 1 root system 10, 8 May 12 2023 hd3
brw-rw---- 1 root system 10, 5 May 12 2023 hd4
brw-rw---- 1 root system 10, 1 Dec 28 21:30 hd5
brw-rw---- 1 root system 10, 2 May 12 2023 hd6
brw-rw---- 1 root system 10, 4 May 12 2023 hd8
brw-rw---- 1 root system 10, 7 May 12 2023 hd9var
cr--r--r-T 1 root system 49, 0 May 12 2023 hdcrypt
brw------- 1 root system 13, 8 May 12 2023 hdisk3
brw------- 1 root system 13, 3 May 12 2023 hdisk4
brw------- 1 root system 13, 4 May 12 2023 hdisk5
brw------- 1 root system 13, 9 Jun 08 2023 hdisk8
brw------- 1 root system 13, 6 Jun 08 2023 hdisk9

Permissions on a working (aka "good")
# ls -la /dev/hd*
brw-rw---- 1 root system 10, 8 May 12 2023 hd10opt
brw-rw---- 1 root system 10, 9 May 12 2023 hd11admin
brw-rw---- 1 root system 10, 5 May 12 2023 hd2
brw-rw---- 1 root system 10, 7 May 12 2023 hd3
brw-rw---- 1 root system 10, 4 May 12 2023 hd4
brw-rw---- 1 root system 10, 1 Jun 08 2023 hd5
brw-rw---- 1 root system 10, 2 May 12 2023 hd6
brw-rw---- 1 root system 10, 3 May 12 2023 hd8
brw-rw---- 1 root system 10, 6 May 12 2023 hd9var
cr--r--r-T 1 root system 21, 0 May 12 2023 hdcrypt
brw------- 1 root system 18, 3 May 12 2023 hdisk3
brw------- 1 root system 18, 4 Jun 08 2023 hdisk4
brw------- 1 root system 18, 5 Jun 08 2023 hdisk5
User avatar
jmichaelson
Posts: 383
Joined: Wed Aug 23, 2023 1:02 pm

Re: AIX monitoring issue for just some servers

Post by jmichaelson »

Since the permissions are the same, the next step is to see what user the ncpa_listener is actually running as , and whether that user is a member of the system group (since NCPA shouldn't be running as root). The ncpa.cfg file (located in /usr/local/ncpa/etc) should show the uid and gid that the listener is running as compare that to the list of users in the system group in /etc/group.

I don't recall the exact ps options on AIX, but you can use it as well to determine the user of the running process (which should match the uid in ncpa.cfg).
Please let us know if you have any other questions or concerns.

-Jason
kbauma01
Posts: 54
Joined: Wed May 25, 2022 6:39 am

Re: AIX monitoring issue for just some servers

Post by kbauma01 »

The ncpa agent is running as nagios and the ncpa.cfg has that set.

[listener]
# This is for Unix only (Linux, Mac OS X, etc)
#
uid = nagios
gid = nagios

nagios 19923354 6357452 0 09:42:22 - 0:01 /usr/local/ncpa/ncpa_listener -n

When I try to start it as root, it dies.
kbauma01
Posts: 54
Joined: Wed May 25, 2022 6:39 am

Re: AIX monitoring issue for just some servers

Post by kbauma01 »

# startsrc -e LIBPATH=/usr/local/ncpa -s ncpa_listener
0513-059 The ncpa_listener Subsystem has been started. Subsystem PID is 10420568.

In /var/log/messages:
daemon:info src[19071252]: The ncpa_listener subsystem was requested to STARTED by user root

ps -ef|grep ncpa
root 19071292 16974322 0 10:04:11 pts/0 0:00 grep ncpa

For some reason, it doesn't start.
User avatar
jmichaelson
Posts: 383
Joined: Wed Aug 23, 2023 1:02 pm

Re: AIX monitoring issue for just some servers

Post by jmichaelson »

if you look at /etc/group, is the nagios user listed as being in the system group? I'm not sure whether it should be or not. Compare that against the /etc/group on the AIX system where the disk monitoring does work.
Please let us know if you have any other questions or concerns.

-Jason
kbauma01
Posts: 54
Joined: Wed May 25, 2022 6:39 am

Re: AIX monitoring issue for just some servers

Post by kbauma01 »

jmichaelson wrote: Mon Jan 08, 2024 10:46 am if you look at /etc/group, is the nagios user listed as being in the system group? I'm not sure whether it should be or not. Compare that against the /etc/group on the AIX system where the disk monitoring does work.
THAT DID IT! Most of my servers had nagios in the staff group and that is working fine. But for those 18 servers, they have to be in the system group for whatever reason. Thank you @jmichaelson! I've banging my head against my screen about this and I'm glad to have a solution.
Post Reply