Nagios ramdisk full and no performance graphs - Nagios Support Forum

Nagios ramdisk full and no performance graphs

Locked

41 posts

1
2
3
4
5
Next

hbouma: Posts: 483; Joined: Tue Feb 27, 2018 9:31 am

Nagios ramdisk full and no performance graphs

Post by hbouma » Wed Nov 03, 2021 9:40 am

We are seeing one of our servers has suddenly no performance graphs and the ramdisk filled up to 100%. Nagios XI 5.8.6 on RHEL 7 64bit VM's.

When we manually run the /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php >> /usr/local/nagiosxi/var/perfdataproc.log command, nothing is cleaned up from /var/nagiosramdisk.

I have had to manually clean up the ramdisk directory, but the mount is growing again.

pbroste: Posts: 1288; Joined: Tue Jun 01, 2021 1:27 pm

Re: Nagios ramdisk full and no performance graphs

Post by pbroste » Wed Nov 03, 2021 1:18 pm

Hello @hbouma

Thanks for reachingout, sounds like you have already went through the support article to optimize. Want to take a look at the System Profile of your environment so we can see what is going on.

Please PM your updated system profile for us to review.

To send us your system profile.

Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and send via Private Message

Thanks,
Perry

hbouma: Posts: 483; Joined: Tue Feb 27, 2018 9:31 am

Re: Nagios ramdisk full and no performance graphs

Post by hbouma » Wed Nov 03, 2021 1:35 pm

PM sent with profile

pbroste: Posts: 1288; Joined: Tue Jun 01, 2021 1:27 pm

Re: Nagios ramdisk full and no performance graphs

Post by pbroste » Wed Nov 03, 2021 2:34 pm

Hello @hbouma

Thanks for sending over the System Profile, wanted to send a quick reply as we see that 'NPCD' service is stopping. Please restart the 'NPCD' service and then take a look at the status to verify that it is running.

Code: Select all

systemctl restart npcd.service

Then;

Code: Select all

systemctl status npcd.service

Please take some time and review the optimal configuration on perfdata when you get a chance.

Thanks,
Perry

hbouma: Posts: 483; Joined: Tue Feb 27, 2018 9:31 am

Re: Nagios ramdisk full and no performance graphs

Post by hbouma » Wed Nov 03, 2021 2:40 pm

Code: Select all

systemctl status npcd
● npcd.service - LSB: Nagios NPCD Initscript
   Loaded: loaded (/etc/rc.d/init.d/npcd; bad; vendor preset: disabled)
   Active: active (running) since Wed 2021-11-03 10:28:32 EDT; 5h 7min ago
     Docs: man:systemd-sysv-generator(8)
 Main PID: 20943 (npcd)
   CGroup: /system.slice/npcd.service
           └─20943 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg

Nov 03 10:28:32 SERVERNAME systemd[1]: Starting LSB: Nagios NPCD Initscript...
Nov 03 10:28:32 SERVERNAME npcd[20940]: NPCD started.
Nov 03 10:28:32 SERVERNAME  systemd[1]: Failed to parse PID from file /usr/local/nagiosxi/var/subsys/npcd.pid: Invalid argument
Nov 03 10:28:32 SERVERNAME systemd[1]: Started LSB: Nagios NPCD Initscript.

$ cat /usr/local/nagiosxi/var/subsys/npcd.pid
20943

ps -ef | grep 20943
root      6128  5333  0 15:36 pts/0    00:00:00 grep --color=auto 20943
nagios   20943     1  0 10:28 ?        00:00:00 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg

If I restart NPCD, I see it stopped, started, updated the PID file with the new PID, but still gives the same output from a systemctl status npcd

pbroste: Posts: 1288; Joined: Tue Jun 01, 2021 1:27 pm

Re: Nagios ramdisk full and no performance graphs

Post by pbroste » Thu Nov 04, 2021 9:37 am

Hello @hbouma

Looks like hitting the load_threshold, let's bump that up.

Edit /usr/local/nagios/etc/pnp/npcd.cfg

change this

Code: Select all

load_threshold = XX.X

to this

Code: Select all

load_threshold = 100.0

then restart NPCD

Thanks,
Perry

hbouma: Posts: 483; Joined: Tue Feb 27, 2018 9:31 am

Re: Nagios ramdisk full and no performance graphs

Post by hbouma » Thu Nov 04, 2021 1:49 pm

The NPCD process now stays running and is not exiting. However, the perfdata in /var/nagiosramdisk is not clearing up.

systemctl status npcd
● npcd.service - LSB: Nagios NPCD Initscript
Loaded: loaded (/etc/rc.d/init.d/npcd; bad; vendor preset: disabled)
Active: active (running) since Thu 2021-11-04 14:44:57 EDT; 2min 43s ago
Docs: man:systemd-sysv-generator(8)
Main PID: 18795 (npcd)
CGroup: /system.slice/npcd.service
└─18795 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg

Nov 04 14:44:57 HOST systemd[1]: Starting LSB: Nagios NPCD Initscript...
Nov 04 14:44:57 HOST npcd[18792]: NPCD started.
Nov 04 14:44:57 HOST systemd[1]: Failed to parse PID from file /usr/local/nagiosxi/var/subsys/npcd.pid: Invalid argument
Nov 04 14:44:57 HOST systemd[1]: Started LSB: Nagios NPCD Initscript.

pbroste: Posts: 1288; Joined: Tue Jun 01, 2021 1:27 pm

Re: Nagios ramdisk full and no performance graphs

Post by pbroste » Fri Nov 05, 2021 12:21 pm

Hello @hbouma

Thanks for following up, a bit of digging into this but first would like to point out that the mount point '/var/nagiosramdisk' is a 'tmpfs'. Which is a temporary file system which resides in virtual RAM memory. "Tmpfs is a file system that keeps all of its files in virtual memory.
Everything in tmpfs is temporary in the sense that no files will be created on your hard drive."

When we look at the number of "files that appear" in '/var/nagiosramdisk/....' we see a small number which means that things are working properly.

Total files in /var/nagiosramdisk/spool/perfdata/
3

But when we look at 'fdisk -l' we see:

tmpfs 500M total 59M used 442M available 12% /var/nagiosramdisk

This is expected since those files are opened elsewhere. In this case, the deletion request can succeed and appears effective, but space will be free when files are no longer used. You can see those deletion-pending files using 'lsof' this way:

Code: Select all

lsof -nP +L1 /var/nagiosramdisk

and

Code: Select all

lsof -nP +L1 /tmp

Realizing that 'tmpfs' uses virtual memory, we can decrease the total allocated ramdisk from 500M to something like 100M to 250M (or what is typically seen as max).

tmpfs 500M 59M 442M 12% /var/nagiosramdisk

Let's take a look at these configs:

/usr/local/nagios/etc/pnp/process_perfdata.cfg

Code: Select all

TIMEOUT = 15

/usr/local/nagios/etc/pnp/npcd.cfg

Code: Select all

sleep_time = 10

If you set logging to 0 in both files you'll also notice a performance increase.

This can sometimes happen if there are a lot of files in the /usr/local/nagios/var/spool/perfdata directory. The directory scan for results can backup the processing queue, and then things can just snowball from there. Changing the configs above should prevent it in the future, but if you notice the issue persisting, you may need to clear the contents in the /usr/local/nagios/var/spool/perfdata directory so the system can catch up by unlocking exiting ncpd.service and killing perfdata process (ps -aux | grep -Ei 'ncpd|perf').

When looking at the commands.cfg we see that we are moving files: '/bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/... to verify that is happening want to watch this:

Code: Select all

while true; do ps aux --sort -rss | grep 'mv|process_perfdata.pl' && ls /var/nagiosramdisk/host-perfdata | wc -l && ls /usr/local/nagios/var/spool/perfdata/ | wc -l ; sleep 2; done

(ctl-c to breakout of this....)

Please let me know how things are looking,
Perry

hbouma: Posts: 483; Joined: Tue Feb 27, 2018 9:31 am

Re: Nagios ramdisk full and no performance graphs

Post by hbouma » Fri Nov 05, 2021 12:43 pm

For the disk usage, we have been cleaning up the host-perfdata and service-perfdata to free up space. The profile you were sent was created after these files were cleared.

When we look at the number of "files that appear" in '/var/nagiosramdisk/....' we see a small number which means that things are working properly.
Total files in /var/nagiosramdisk/spool/perfdata/
3

What I am seeing is that the host-perfdata and service-perfdata are continuously growing. I have been having to clear this out, but they continue to grow.

For instance, after cleaning out at 4PM yesterday. Now, they have grown to this:

-rw-r--r-- 1 nagios nagios 60M Nov 5 13:37 host-perfdata
-rw-r--r-- 1 nagios nagios 2.7M Nov 5 04:34 objects.cache
-rw-r--r-- 1 nagios nagios 226M Nov 5 13:37 service-perfdata

Code: Select all

lsof -nP +L1 /var/nagiosramdisk
COMMAND     PID       USER   FD   TYPE DEVICE  SIZE/OFF NLINK     NODE NAME
polkitd    1213    polkitd    3r   REG  253,5  11031312     0     4485 /var/lib/sss/mc/initgroups (deleted)
puppet     1613       root    8r   REG  253,5   8825056     0      845 /var/lib/sss/mc/passwd (deleted)
puppet     1613       root   10r   REG  253,5   6618808     0     1102 /var/lib/sss/mc/group (deleted)
freshclam  1623 clamupdate    5r   REG  253,5  11031312     0     1117 /var/lib/sss/mc/initgroups (deleted)
qmgr       2347    postfix    8r   REG  253,5  11031312     0     1117 /var/lib/sss/mc/initgroups (deleted)
b9daemon   2425       root   20r   REG  253,5   8825056     0      845 /var/lib/sss/mc/passwd (deleted)
nagios     3792     nagios   23w   REG   0,42  62009590     1 56340188 /var/nagiosramdisk/host-perfdata
nagios     3792     nagios   24w   REG   0,42 235923565     1 56340186 /var/nagiosramdisk/service-perfdata
nagios     3908     nagios   24w   REG   0,42  62009590     1 56340188 /var/nagiosramdisk/host-perfdata
nagios     3908     nagios   38w   REG   0,42 235923565     1 56340186 /var/nagiosramdisk/service-perfdata
java       9174      tidal    3r   REG  253,5   8825056     0      845 /var/lib/sss/mc/passwd (deleted)
bash      13284       root  cwd    DIR   0,42       160     4 56342033 /var/nagiosramdisk
sssd      14448       root   15r   REG  253,5  11031312     0     1117 /var/lib/sss/mc/initgroups (deleted)
sssd_be   14449       root   20r   REG  253,5  11031312     0     1117 /var/lib/sss/mc/initgroups (deleted)
lsof      15538       root  cwd    DIR   0,42       160     4 56342033 /var/nagiosramdisk
lsof      15539       root  cwd    DIR   0,42       160     4 56342033 /var/nagiosramdisk
npcd      18795     nagios  cwd    DIR   0,42        40     2 56340176 /var/nagiosramdisk/spool/perfdata

Code: Select all

lsof -nP +L1 /tmp
COMMAND     PID       USER   FD   TYPE DEVICE SIZE/OFF NLINK NODE NAME
polkitd    1213    polkitd    3r   REG  253,5 11031312     0 4485 /var/lib/sss/mc/initgroups (deleted)
puppet     1613       root    8r   REG  253,5  8825056     0  845 /var/lib/sss/mc/passwd (deleted)
puppet     1613       root   10r   REG  253,5  6618808     0 1102 /var/lib/sss/mc/group (deleted)
freshclam  1623 clamupdate    5r   REG  253,5 11031312     0 1117 /var/lib/sss/mc/initgroups (deleted)
qmgr       2347    postfix    8r   REG  253,5 11031312     0 1117 /var/lib/sss/mc/initgroups (deleted)
rrdcached  2367     nagios    3w   REG  253,6        0     1 3795 /tmp/rrd.journal.1636131426.921381
b9daemon   2425       root   20r   REG  253,5  8825056     0  845 /var/lib/sss/mc/passwd (deleted)
nagios     3793     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3794     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3795     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3796     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3797     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3798     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3799     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3800     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3801     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3802     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3803     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3804     nagios  cwd    DIR  253,6    12288    69    2 /tmp
nagios     3908     nagios  cwd    DIR  253,6    12288    69    2 /tmp
java       9174      tidal  mem    REG  253,6    32768     1 4352 /tmp/hsperfdata_tidal/9174
java       9174      tidal    3r   REG  253,5  8825056     0  845 /var/lib/sss/mc/passwd (deleted)
sshd      12946       root   11u   REG  253,6     3644     1 1462 /var/tmp/host_0
sshd      12979   SAhbouma   11u   REG  253,6     3644     1 1462 /var/tmp/host_0
sssd      14448       root   15r   REG  253,5 11031312     0 1117 /var/lib/sss/mc/initgroups (deleted)
sssd_be   14449       root   20r   REG  253,5 11031312     0 1117 /var/lib/sss/mc/initgroups (deleted)
check_nrp 16101     nagios  cwd    DIR  253,6    12288    69    2 /tmp
check_nrp 16142     nagios  cwd    DIR  253,6    12288    69    2 /tmp
check_nrp 16152     nagios  cwd    DIR  253,6    12288    69    2 /tmp

When running the while command you provided, we see 0 the following repeatedly.

Code: Select all

while true; do ps aux --sort -rss | grep 'mv|process_perfdata.pl' && ls /var/nagiosramdisk/host-perfdata | wc -l && ls /usr/local/nagios/var/spool/perfdata/ | wc -l ; sleep 2; done
root     19430  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     19493  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     19604  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     19663  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     19722  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     19781  0.0  0.0 112812   992 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     19840  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     19900  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     19960  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     20022  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     20086  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     20146  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2
root     20250  0.0  0.0 112812   996 pts/0    S+   13:37   0:00 grep --color=auto mv|process_perfdata.pl
1
2

pbroste: Posts: 1288; Joined: Tue Jun 01, 2021 1:27 pm

Re: Nagios ramdisk full and no performance graphs

Post by pbroste » Mon Nov 08, 2021 11:15 am

Hello @hbouma

Thanks for getting the details over so quickly. Frustrating that the 'ls' on the directories populate with one or two files generated, while the total size on the 'ramdisk' dir's is up in size. Want to go ahead and do some backtracking to see if there is any configuration that is out of place. Once we rule that out, we will need to look to see if there is anything on the os causing issues.

To start, let's make sure that the 'nagios' backend system user account is active:

Code: Select all

chage -I -1 -m 0 -M 99999 -E -1 nagios

Grab the config's:

Code: Select all

tar -czvf /tmp/ramdisk_config.tar.gz /lib/systemd/system/ramdisk.service /usr/local/nrdp/server/config.inc.php /usr/local/nagios/etc/pnp/npcd.cfg

We got a count, now let's take a look at the dir:

Code: Select all

ls -l /var/nagiosramdisk/*perfdata

How about the xidpe spool:

Code: Select all

ls -l /var/nagiosramdisk/spool/xidpe/
ls -l /var/nagiosramdisk/spool/perfdata/

The status on the ramdisk system service:

Code: Select all

systemctl status ramdisk.service

Then;

Code: Select all

systemctl restart ramdisk.service

Please send over the results when you get a chance so we can check them out.

Thanks,
Perry

Locked

41 posts

1
2
3
4
5
Next

Return to “Nagios XI”