Page 1 of 1

AIX monitoring issue for just some servers

Posted: Thu Jan 04, 2024 9:38 am
by kbauma01
I have several AIX servers being monitored successfully but there are about 18 that are having issues. Everything is exactly the same - OS version, RPMs, python versions, etc. I've reinstalled the agent and even reinstalled python.

Using the command line, I get this error:

/usr/local/nagios/libexec/check_ncpa.py -H hostname -t mytoken -P 5693 --list
UNKNOWN: An error occurred connecting to API. (HTTP error: '500 INTERNAL SERVER ERROR')

When I go the GUI on port 5693, I don't see any checks, any live data, and the API just spins and throws an error into the listener log.

Here is what I am seeing in the /usr/local/ncpa/var/log/ncpa_listener.log

2024-01-04 09:29:06,082 8323480 INFO ::ffff:10.15.14.86 - - [2024-01-04 09:29:06] "GET /api/services/?token=mytoken&check=1&service=sshd&status=running HTTP/1.1" 500 2362 0.004727
2024-01-04 09:29:15,245 8323480 ERROR Exception on /api/services/ [GET]
Traceback (most recent call last):
File "/opt/freeware/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
File "/opt/freeware/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
File "/opt/freeware/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
File "/opt/freeware/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
File "/opt/freeware/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
File "/tmp/test/ncpa/agent/listener/server.py", line 185, in token_auth_decoration
File "/tmp/test/ncpa/agent/listener/server.py", line 931, in api
File "/tmp/test/ncpa/agent/listener/psapi.py", line 279, in getter
File "/tmp/test/ncpa/agent/listener/psapi.py", line 260, in refresh
File "/tmp/test/ncpa/agent/listener/psapi.py", line 230, in get_root_node
File "/tmp/test/ncpa/agent/listener/psapi.py", line 185, in get_disk_node
File "/opt/freeware/lib/python2.7/site-packages/psutil/__init__.py", line 2133, in disk_partitions
File "/opt/freeware/lib/python2.7/site-packages/psutil/_psaix.py", line 186, in disk_partitions
OSError: [Errno 13] Permission denied

I cannot figure out why most AIX servers work but some do not. Any help would be very much appreciated!

Re: AIX monitoring issue for just some servers

Posted: Thu Jan 04, 2024 11:25 am
by bbahn
Hello kbauma01,

It seems you have a permissions issue that is making the psutil library unable to check your disks. Can you check your permissions for your disks and compare the working servers against the ones that aren't?

Re: AIX monitoring issue for just some servers

Posted: Thu Jan 04, 2024 1:35 pm
by kbauma01
They look the same to me.

Permissions on a not working server (aka "bad")

# ls -la /dev/hd*
brw-rw---- 1 root system 10, 9 May 12 2023 hd10opt
brw-rw---- 1 root system 10, 10 May 12 2023 hd11admin
brw-rw---- 1 root system 10, 6 May 12 2023 hd2
brw-rw---- 1 root system 10, 8 May 12 2023 hd3
brw-rw---- 1 root system 10, 5 May 12 2023 hd4
brw-rw---- 1 root system 10, 1 Dec 28 21:30 hd5
brw-rw---- 1 root system 10, 2 May 12 2023 hd6
brw-rw---- 1 root system 10, 4 May 12 2023 hd8
brw-rw---- 1 root system 10, 7 May 12 2023 hd9var
cr--r--r-T 1 root system 49, 0 May 12 2023 hdcrypt
brw------- 1 root system 13, 8 May 12 2023 hdisk3
brw------- 1 root system 13, 3 May 12 2023 hdisk4
brw------- 1 root system 13, 4 May 12 2023 hdisk5
brw------- 1 root system 13, 9 Jun 08 2023 hdisk8
brw------- 1 root system 13, 6 Jun 08 2023 hdisk9

Permissions on a working (aka "good")
# ls -la /dev/hd*
brw-rw---- 1 root system 10, 8 May 12 2023 hd10opt
brw-rw---- 1 root system 10, 9 May 12 2023 hd11admin
brw-rw---- 1 root system 10, 5 May 12 2023 hd2
brw-rw---- 1 root system 10, 7 May 12 2023 hd3
brw-rw---- 1 root system 10, 4 May 12 2023 hd4
brw-rw---- 1 root system 10, 1 Jun 08 2023 hd5
brw-rw---- 1 root system 10, 2 May 12 2023 hd6
brw-rw---- 1 root system 10, 3 May 12 2023 hd8
brw-rw---- 1 root system 10, 6 May 12 2023 hd9var
cr--r--r-T 1 root system 21, 0 May 12 2023 hdcrypt
brw------- 1 root system 18, 3 May 12 2023 hdisk3
brw------- 1 root system 18, 4 Jun 08 2023 hdisk4
brw------- 1 root system 18, 5 Jun 08 2023 hdisk5

Re: AIX monitoring issue for just some servers

Posted: Thu Jan 04, 2024 3:05 pm
by jmichaelson
Since the permissions are the same, the next step is to see what user the ncpa_listener is actually running as , and whether that user is a member of the system group (since NCPA shouldn't be running as root). The ncpa.cfg file (located in /usr/local/ncpa/etc) should show the uid and gid that the listener is running as compare that to the list of users in the system group in /etc/group.

I don't recall the exact ps options on AIX, but you can use it as well to determine the user of the running process (which should match the uid in ncpa.cfg).

Re: AIX monitoring issue for just some servers

Posted: Mon Jan 08, 2024 9:53 am
by kbauma01
The ncpa agent is running as nagios and the ncpa.cfg has that set.

[listener]
# This is for Unix only (Linux, Mac OS X, etc)
#
uid = nagios
gid = nagios

nagios 19923354 6357452 0 09:42:22 - 0:01 /usr/local/ncpa/ncpa_listener -n

When I try to start it as root, it dies.

Re: AIX monitoring issue for just some servers

Posted: Mon Jan 08, 2024 10:04 am
by kbauma01
# startsrc -e LIBPATH=/usr/local/ncpa -s ncpa_listener
0513-059 The ncpa_listener Subsystem has been started. Subsystem PID is 10420568.

In /var/log/messages:
daemon:info src[19071252]: The ncpa_listener subsystem was requested to STARTED by user root

ps -ef|grep ncpa
root 19071292 16974322 0 10:04:11 pts/0 0:00 grep ncpa

For some reason, it doesn't start.

Re: AIX monitoring issue for just some servers

Posted: Mon Jan 08, 2024 10:46 am
by jmichaelson
if you look at /etc/group, is the nagios user listed as being in the system group? I'm not sure whether it should be or not. Compare that against the /etc/group on the AIX system where the disk monitoring does work.

Re: AIX monitoring issue for just some servers

Posted: Tue Jan 09, 2024 9:10 am
by kbauma01
jmichaelson wrote: Mon Jan 08, 2024 10:46 am if you look at /etc/group, is the nagios user listed as being in the system group? I'm not sure whether it should be or not. Compare that against the /etc/group on the AIX system where the disk monitoring does work.
THAT DID IT! Most of my servers had nagios in the staff group and that is working fine. But for those 18 servers, they have to be in the system group for whatever reason. Thank you @jmichaelson! I've banging my head against my screen about this and I'm glad to have a solution.