Nagios ramdisk full and no performance graphs
Nagios ramdisk full and no performance graphs
We are seeing one of our servers has suddenly no performance graphs and the ramdisk filled up to 100%. Nagios XI 5.8.6 on RHEL 7 64bit VM's.
When we manually run the /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php >> /usr/local/nagiosxi/var/perfdataproc.log command, nothing is cleaned up from /var/nagiosramdisk.
I have had to manually clean up the ramdisk directory, but the mount is growing again.
When we manually run the /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php >> /usr/local/nagiosxi/var/perfdataproc.log command, nothing is cleaned up from /var/nagiosramdisk.
I have had to manually clean up the ramdisk directory, but the mount is growing again.
Re: Nagios ramdisk full and no performance graphs
Hello @hbouma
Thanks for reachingout, sounds like you have already went through the support article to optimize. Want to take a look at the System Profile of your environment so we can see what is going on.
Please PM your updated system profile for us to review.
To send us your system profile.
Perry
Thanks for reachingout, sounds like you have already went through the support article to optimize. Want to take a look at the System Profile of your environment so we can see what is going on.
Please PM your updated system profile for us to review.
To send us your system profile.
- Login to the Nagios XI GUI using a web browser.
- Click the "Admin" > "System Profile" Menu
- Click the "Download Profile" button
- Save the profile.zip file and send via Private Message
Perry
Re: Nagios ramdisk full and no performance graphs
PM sent with profile
Re: Nagios ramdisk full and no performance graphs
Hello @hbouma
Thanks for sending over the System Profile, wanted to send a quick reply as we see that 'NPCD' service is stopping. Please restart the 'NPCD' service and then take a look at the status to verify that it is running.
Then;
Please take some time and review the optimal configuration on perfdata when you get a chance.
Thanks,
Perry
Thanks for sending over the System Profile, wanted to send a quick reply as we see that 'NPCD' service is stopping. Please restart the 'NPCD' service and then take a look at the status to verify that it is running.
Code: Select all
systemctl restart npcd.service
Code: Select all
systemctl status npcd.service
Thanks,
Perry
Re: Nagios ramdisk full and no performance graphs
Code: Select all
systemctl status npcd
● npcd.service - LSB: Nagios NPCD Initscript
Loaded: loaded (/etc/rc.d/init.d/npcd; bad; vendor preset: disabled)
Active: active (running) since Wed 2021-11-03 10:28:32 EDT; 5h 7min ago
Docs: man:systemd-sysv-generator(8)
Main PID: 20943 (npcd)
CGroup: /system.slice/npcd.service
└─20943 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
Nov 03 10:28:32 SERVERNAME systemd[1]: Starting LSB: Nagios NPCD Initscript...
Nov 03 10:28:32 SERVERNAME npcd[20940]: NPCD started.
Nov 03 10:28:32 SERVERNAME systemd[1]: Failed to parse PID from file /usr/local/nagiosxi/var/subsys/npcd.pid: Invalid argument
Nov 03 10:28:32 SERVERNAME systemd[1]: Started LSB: Nagios NPCD Initscript.
$ cat /usr/local/nagiosxi/var/subsys/npcd.pid
20943
ps -ef | grep 20943
root 6128 5333 0 15:36 pts/0 00:00:00 grep --color=auto 20943
nagios 20943 1 0 10:28 ? 00:00:00 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
If I restart NPCD, I see it stopped, started, updated the PID file with the new PID, but still gives the same output from a systemctl status npcd
Re: Nagios ramdisk full and no performance graphs
Hello @hbouma
Looks like hitting the load_threshold, let's bump that up.
Edit /usr/local/nagios/etc/pnp/npcd.cfg
change this
to this
then restart NPCD
Thanks,
Perry
Looks like hitting the load_threshold, let's bump that up.
Edit /usr/local/nagios/etc/pnp/npcd.cfg
change this
Code: Select all
load_threshold = XX.X
Code: Select all
load_threshold = 100.0
Thanks,
Perry
Re: Nagios ramdisk full and no performance graphs
The NPCD process now stays running and is not exiting. However, the perfdata in /var/nagiosramdisk is not clearing up.
systemctl status npcd
● npcd.service - LSB: Nagios NPCD Initscript
Loaded: loaded (/etc/rc.d/init.d/npcd; bad; vendor preset: disabled)
Active: active (running) since Thu 2021-11-04 14:44:57 EDT; 2min 43s ago
Docs: man:systemd-sysv-generator(8)
Main PID: 18795 (npcd)
CGroup: /system.slice/npcd.service
└─18795 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
Nov 04 14:44:57 HOST systemd[1]: Starting LSB: Nagios NPCD Initscript...
Nov 04 14:44:57 HOST npcd[18792]: NPCD started.
Nov 04 14:44:57 HOST systemd[1]: Failed to parse PID from file /usr/local/nagiosxi/var/subsys/npcd.pid: Invalid argument
Nov 04 14:44:57 HOST systemd[1]: Started LSB: Nagios NPCD Initscript.
systemctl status npcd
● npcd.service - LSB: Nagios NPCD Initscript
Loaded: loaded (/etc/rc.d/init.d/npcd; bad; vendor preset: disabled)
Active: active (running) since Thu 2021-11-04 14:44:57 EDT; 2min 43s ago
Docs: man:systemd-sysv-generator(8)
Main PID: 18795 (npcd)
CGroup: /system.slice/npcd.service
└─18795 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
Nov 04 14:44:57 HOST systemd[1]: Starting LSB: Nagios NPCD Initscript...
Nov 04 14:44:57 HOST npcd[18792]: NPCD started.
Nov 04 14:44:57 HOST systemd[1]: Failed to parse PID from file /usr/local/nagiosxi/var/subsys/npcd.pid: Invalid argument
Nov 04 14:44:57 HOST systemd[1]: Started LSB: Nagios NPCD Initscript.
Re: Nagios ramdisk full and no performance graphs
Hello @hbouma
Thanks for following up, a bit of digging into this but first would like to point out that the mount point '/var/nagiosramdisk' is a 'tmpfs'. Which is a temporary file system which resides in virtual RAM memory. "Tmpfs is a file system that keeps all of its files in virtual memory.
Everything in tmpfs is temporary in the sense that no files will be created on your hard drive."
When we look at the number of "files that appear" in '/var/nagiosramdisk/....' we see a small number which means that things are working properly.
and
Realizing that 'tmpfs' uses virtual memory, we can decrease the total allocated ramdisk from 500M to something like 100M to 250M (or what is typically seen as max).
/usr/local/nagios/etc/pnp/process_perfdata.cfg
/usr/local/nagios/etc/pnp/npcd.cfg
If you set logging to 0 in both files you'll also notice a performance increase.
This can sometimes happen if there are a lot of files in the /usr/local/nagios/var/spool/perfdata directory. The directory scan for results can backup the processing queue, and then things can just snowball from there. Changing the configs above should prevent it in the future, but if you notice the issue persisting, you may need to clear the contents in the /usr/local/nagios/var/spool/perfdata directory so the system can catch up by unlocking exiting ncpd.service and killing perfdata process (ps -aux | grep -Ei 'ncpd|perf').
When looking at the commands.cfg we see that we are moving files: '/bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/... to verify that is happening want to watch this:
(ctl-c to breakout of this....)
Please let me know how things are looking,
Perry
Thanks for following up, a bit of digging into this but first would like to point out that the mount point '/var/nagiosramdisk' is a 'tmpfs'. Which is a temporary file system which resides in virtual RAM memory. "Tmpfs is a file system that keeps all of its files in virtual memory.
Everything in tmpfs is temporary in the sense that no files will be created on your hard drive."
When we look at the number of "files that appear" in '/var/nagiosramdisk/....' we see a small number which means that things are working properly.
But when we look at 'fdisk -l' we see:Total files in /var/nagiosramdisk/spool/perfdata/
3
This is expected since those files are opened elsewhere. In this case, the deletion request can succeed and appears effective, but space will be free when files are no longer used. You can see those deletion-pending files using 'lsof' this way:tmpfs 500M total 59M used 442M available 12% /var/nagiosramdisk
Code: Select all
lsof -nP +L1 /var/nagiosramdisk
Code: Select all
lsof -nP +L1 /tmp
Let's take a look at these configs:tmpfs 500M 59M 442M 12% /var/nagiosramdisk
/usr/local/nagios/etc/pnp/process_perfdata.cfg
Code: Select all
TIMEOUT = 15
Code: Select all
sleep_time = 10
This can sometimes happen if there are a lot of files in the /usr/local/nagios/var/spool/perfdata directory. The directory scan for results can backup the processing queue, and then things can just snowball from there. Changing the configs above should prevent it in the future, but if you notice the issue persisting, you may need to clear the contents in the /usr/local/nagios/var/spool/perfdata directory so the system can catch up by unlocking exiting ncpd.service and killing perfdata process (ps -aux | grep -Ei 'ncpd|perf').
When looking at the commands.cfg we see that we are moving files: '/bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/... to verify that is happening want to watch this:
Code: Select all
while true; do ps aux --sort -rss | grep 'mv|process_perfdata.pl' && ls /var/nagiosramdisk/host-perfdata | wc -l && ls /usr/local/nagios/var/spool/perfdata/ | wc -l ; sleep 2; done
Please let me know how things are looking,
Perry
Re: Nagios ramdisk full and no performance graphs
For the disk usage, we have been cleaning up the host-perfdata and service-perfdata to free up space. The profile you were sent was created after these files were cleared.
For instance, after cleaning out at 4PM yesterday. Now, they have grown to this:
-rw-r--r-- 1 nagios nagios 60M Nov 5 13:37 host-perfdata
-rw-r--r-- 1 nagios nagios 2.7M Nov 5 04:34 objects.cache
-rw-r--r-- 1 nagios nagios 226M Nov 5 13:37 service-perfdata
When running the while command you provided, we see 0 the following repeatedly.
What I am seeing is that the host-perfdata and service-perfdata are continuously growing. I have been having to clear this out, but they continue to grow.When we look at the number of "files that appear" in '/var/nagiosramdisk/....' we see a small number which means that things are working properly.
Total files in /var/nagiosramdisk/spool/perfdata/
3
For instance, after cleaning out at 4PM yesterday. Now, they have grown to this:
-rw-r--r-- 1 nagios nagios 60M Nov 5 13:37 host-perfdata
-rw-r--r-- 1 nagios nagios 2.7M Nov 5 04:34 objects.cache
-rw-r--r-- 1 nagios nagios 226M Nov 5 13:37 service-perfdata
Code: Select all
lsof -nP +L1 /var/nagiosramdisk
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
polkitd 1213 polkitd 3r REG 253,5 11031312 0 4485 /var/lib/sss/mc/initgroups (deleted)
puppet 1613 root 8r REG 253,5 8825056 0 845 /var/lib/sss/mc/passwd (deleted)
puppet 1613 root 10r REG 253,5 6618808 0 1102 /var/lib/sss/mc/group (deleted)
freshclam 1623 clamupdate 5r REG 253,5 11031312 0 1117 /var/lib/sss/mc/initgroups (deleted)
qmgr 2347 postfix 8r REG 253,5 11031312 0 1117 /var/lib/sss/mc/initgroups (deleted)
b9daemon 2425 root 20r REG 253,5 8825056 0 845 /var/lib/sss/mc/passwd (deleted)
nagios 3792 nagios 23w REG 0,42 62009590 1 56340188 /var/nagiosramdisk/host-perfdata
nagios 3792 nagios 24w REG 0,42 235923565 1 56340186 /var/nagiosramdisk/service-perfdata
nagios 3908 nagios 24w REG 0,42 62009590 1 56340188 /var/nagiosramdisk/host-perfdata
nagios 3908 nagios 38w REG 0,42 235923565 1 56340186 /var/nagiosramdisk/service-perfdata
java 9174 tidal 3r REG 253,5 8825056 0 845 /var/lib/sss/mc/passwd (deleted)
bash 13284 root cwd DIR 0,42 160 4 56342033 /var/nagiosramdisk
sssd 14448 root 15r REG 253,5 11031312 0 1117 /var/lib/sss/mc/initgroups (deleted)
sssd_be 14449 root 20r REG 253,5 11031312 0 1117 /var/lib/sss/mc/initgroups (deleted)
lsof 15538 root cwd DIR 0,42 160 4 56342033 /var/nagiosramdisk
lsof 15539 root cwd DIR 0,42 160 4 56342033 /var/nagiosramdisk
npcd 18795 nagios cwd DIR 0,42 40 2 56340176 /var/nagiosramdisk/spool/perfdata
Code: Select all
lsof -nP +L1 /tmp
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
polkitd 1213 polkitd 3r REG 253,5 11031312 0 4485 /var/lib/sss/mc/initgroups (deleted)
puppet 1613 root 8r REG 253,5 8825056 0 845 /var/lib/sss/mc/passwd (deleted)
puppet 1613 root 10r REG 253,5 6618808 0 1102 /var/lib/sss/mc/group (deleted)
freshclam 1623 clamupdate 5r REG 253,5 11031312 0 1117 /var/lib/sss/mc/initgroups (deleted)
qmgr 2347 postfix 8r REG 253,5 11031312 0 1117 /var/lib/sss/mc/initgroups (deleted)
rrdcached 2367 nagios 3w REG 253,6 0 1 3795 /tmp/rrd.journal.1636131426.921381
b9daemon 2425 root 20r REG 253,5 8825056 0 845 /var/lib/sss/mc/passwd (deleted)
nagios 3793 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3794 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3795 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3796 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3797 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3798 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3799 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3800 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3801 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3802 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3803 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3804 nagios cwd DIR 253,6 12288 69 2 /tmp
nagios 3908 nagios cwd DIR 253,6 12288 69 2 /tmp
java 9174 tidal mem REG 253,6 32768 1 4352 /tmp/hsperfdata_tidal/9174
java 9174 tidal 3r REG 253,5 8825056 0 845 /var/lib/sss/mc/passwd (deleted)
sshd 12946 root 11u REG 253,6 3644 1 1462 /var/tmp/host_0
sshd 12979 SAhbouma 11u REG 253,6 3644 1 1462 /var/tmp/host_0
sssd 14448 root 15r REG 253,5 11031312 0 1117 /var/lib/sss/mc/initgroups (deleted)
sssd_be 14449 root 20r REG 253,5 11031312 0 1117 /var/lib/sss/mc/initgroups (deleted)
check_nrp 16101 nagios cwd DIR 253,6 12288 69 2 /tmp
check_nrp 16142 nagios cwd DIR 253,6 12288 69 2 /tmp
check_nrp 16152 nagios cwd DIR 253,6 12288 69 2 /tmp
Code: Select all
while true; do ps aux --sort -rss | grep 'mv|process_perfdata.pl' && ls /var/nagiosramdisk/host-perfdata | wc -l && ls /usr/local/nagios/var/spool/perfdata/ | wc -l ; sleep 2; done
root 19430 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 19493 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 19604 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 19663 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 19722 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 19781 0.0 0.0 112812 992 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 19840 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 19900 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 19960 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 20022 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 20086 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 20146 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
root 20250 0.0 0.0 112812 996 pts/0 S+ 13:37 0:00 grep --color=auto mv|process_perfdata.pl
1
2
Re: Nagios ramdisk full and no performance graphs
Hello @hbouma
Thanks for getting the details over so quickly. Frustrating that the 'ls' on the directories populate with one or two files generated, while the total size on the 'ramdisk' dir's is up in size. Want to go ahead and do some backtracking to see if there is any configuration that is out of place. Once we rule that out, we will need to look to see if there is anything on the os causing issues.
To start, let's make sure that the 'nagios' backend system user account is active:
Grab the config's:
We got a count, now let's take a look at the dir:
How about the xidpe spool:
The status on the ramdisk system service:
Then;
Please send over the results when you get a chance so we can check them out.
Thanks,
Perry
Thanks for getting the details over so quickly. Frustrating that the 'ls' on the directories populate with one or two files generated, while the total size on the 'ramdisk' dir's is up in size. Want to go ahead and do some backtracking to see if there is any configuration that is out of place. Once we rule that out, we will need to look to see if there is anything on the os causing issues.
To start, let's make sure that the 'nagios' backend system user account is active:
Code: Select all
chage -I -1 -m 0 -M 99999 -E -1 nagios
Code: Select all
tar -czvf /tmp/ramdisk_config.tar.gz /lib/systemd/system/ramdisk.service /usr/local/nrdp/server/config.inc.php /usr/local/nagios/etc/pnp/npcd.cfg
Code: Select all
ls -l /var/nagiosramdisk/*perfdata
Code: Select all
ls -l /var/nagiosramdisk/spool/xidpe/
ls -l /var/nagiosramdisk/spool/perfdata/
Code: Select all
systemctl status ramdisk.service
Code: Select all
systemctl restart ramdisk.service
Thanks,
Perry