NagiosXI graph problems

ctretelea · Post by **ctretelea** » Wed May 09, 2018 9:22 am

Hi, I have graphs problems, is not for all only for some of them.
after some investigation I found that the ramdisk is full (100%)

Code: Select all

[root@ip-10-60-0-29 perfdata]# df -h
Filesystem                      Size  Used Avail Use% Mounted on
/dev/xvda1                      100G   49G   52G  49% /
devtmpfs                        7.8G     0  7.8G   0% /dev
tmpfs                           7.8G     0  7.8G   0% /dev/shm
tmpfs                           7.8G  873M  7.0G  11% /run
tmpfs                           7.8G     0  7.8G   0% /sys/fs/cgroup
s3fs                            256T     0  256T   0% /store/backups/nagiosxi1
/dev/mapper/data_vg1-mysql_lv1   20G  9.5G  9.2G  51% /var/lib/mysql
/dev/mapper/data_vg1-perf_lv1   4.8G  3.1G  1.5G  69% /apps/perf
tmpfs                           2.0G  2.0G     0 100% /var/nagiosramdisk
tmpfs                           1.6G     0  1.6G   0% /run/user/1001
tmpfs                           1.6G     0  1.6G   0% /run/user/1004
[root@ip-10-60-0-29 perfdata]# free -m
              total        used        free      shared  buff/cache   available
Mem:          15885        2630        2167        1660       11086       11140
Swap:          4095        1483        2612

after some investigation I found that the all space of Ramdisk is took by /var/nagiosramdisk/spool/perfdata/ folder

Here you'll see some info

Code: Select all

[root@ip-10-60-0-29 perfdata]# grep ramdisk /usr/local/nagios/etc/nagios.cfg /usr/local/nrdp/server/config.inc.php /usr/local/nagiosxi/html/config.inc.php /usr/local/nagios/etc/pnp/npcd.cfg
/usr/local/nagios/etc/nagios.cfg:service_perfdata_file=/var/nagiosramdisk/service-perfdata
/usr/local/nagios/etc/nagios.cfg:host_perfdata_file=/var/nagiosramdisk/host-perfdata
/usr/local/nagios/etc/nagios.cfg:check_result_path=/var/nagiosramdisk/spool/checkresults
/usr/local/nagios/etc/nagios.cfg:object_cache_file=/var/nagiosramdisk/objects.cache
/usr/local/nagios/etc/nagios.cfg:status_file=/var/nagiosramdisk/status.dat
/usr/local/nagios/etc/nagios.cfg:temp_path=/var/nagiosramdisk/tmp
/usr/local/nrdp/server/config.inc.php:$cfg["check_results_dir"]="/var/nagiosramdisk/spool/checkresults";
/usr/local/nagiosxi/html/config.inc.php:$cfg['xidpe_dir'] = '/var/nagiosramdisk/spool/xidpe/';
/usr/local/nagiosxi/html/config.inc.php:$cfg['perfdata_spool'] = '/var/nagiosramdisk/spool/perfdata/';
/usr/local/nagios/etc/pnp/npcd.cfg:perfdata_spool_dir = /var/nagiosramdisk/spool/perfdata/
[root@ip-10-60-0-29 perfdata]# ls /var/nagiosramdisk/spool/xidpe | wc -l
2
[root@ip-10-60-0-29 perfdata]# ls /var/nagiosramdisk/spool/perfdata/ | wc -l
33660
[root@ip-10-60-0-29 perfdata]# ls /var/nagiosramdisk/spool/checkresults/ | wc -l
2
[root@ip-10-60-0-29 perfdata]# chage -l nagios
Last password change                                    : Feb 17, 2017
Password expires                                        : never
Password inactive                                       : never
Account expires                                         : never
Minimum number of days between password change          : 0
Maximum number of days between password change          : 99999
Number of days of warning before password expires       : 7
[root@ip-10-60-0-29 perfdata]#

define command {
       command_name                             process-host-perfdata-file-bulk
       command_line                             /bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.host
}


define command {
       command_name                             process-service-perfdata-file-bulk
       command_line                             /bin/mv /var/nagiosramdisk/service-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.service
}

My question why the folder /var/nagiosramdisk/spool/perfdata/ is filling up and is not proccessed?

ctretelea · Post by **ctretelea** » Wed May 09, 2018 1:50 pm

Hello,
Going to debugging mode I see this errors in the file /usr/local/nagios/var/perfdata.log

Code: Select all

2018-05-09 14:41:12 [7296] [0] RRDs::update /usr/local/nagios/share/perfdata/diplomat-ntdspedi/Swap_Usage.rrd 1525891266:4091:52
2018-05-09 14:41:12 [7296] [0] RRDs::update ERROR /usr/local/nagios/share/perfdata/diplomat-ntdspedi/Swap_Usage.rrd: found extra data on update argument: 52

Code: Select all

2018-05-09 14:41:12 [7296] [0] RRDs::update /usr/local/nagios/share/perfdata/ntkncrep/Physical_Memory.rrd 1525891267:54:8.71299
2018-05-09 14:41:12 [7296] [0] RRDs::update ERROR /usr/local/nagios/share/perfdata/ntkncrep/Physical_Memory.rrd: expected 8 data source readings (got 2) from 1525891267

Are that something that can help figured out why it is not processed?

Post by **tgriep** » Wed May 09, 2018 4:45 pm

The Nagios process takes the files out of the perfdata folder and one cause on why it cannot remove the files is that the nagios user account maybe expired or has a password set.
To check that, run this as root and post the output.

Code: Select all

chage -l nagios

The RRD update error happens when a check was changed and it is returning different performance data.
When the perf data is changed, it will stop the graphing process so to get it working, the .xml and .rrd files for that service has to be deleted so they will get recreated and start to populate with data.

ctretelea · Post by **ctretelea** » Thu May 10, 2018 10:04 am

Hi tgriep,
Some of the file is removed some of them still there. the nagios is not expired I post the nagios account state in the first comment.

Code: Select all

[root@ip-10-60-0-29 ctretelea]# chage -l nagios
Last password change                                    : Feb 17, 2017
Password expires                                        : never
Password inactive                                       : never
Account expires                                         : never
Minimum number of days between password change          : 0
Maximum number of days between password change          : 99999
Number of days of warning before password expires       : 7

Yes, if I look at the error and remove the rrd file I will have new rrd file and the error disappear for that file, but I will loose the history graphing for that service. Nagios XI is using only RRD graphing? can we use something better then it?
I removed all files from the /usr/local/nagios/var/spool/perfdata/ folder but it still filling up. (in less then one hour I have already 52 files)

Code: Select all

Every 2.0s: ls /usr/local/nagios/var/spool/perfdata/ | wc -l                                                                                                    Thu May 10 11:02:07 2018
52

Also, looking in /usr/local/nagios/var/npcd.log log file I see this messages

... is an already in process PNP file. Leaving it untouched.

Is that normal?

Code: Select all

[05-10-2018 10:58:59] NPCD: ThreadCounter 0/5 File is 1525962054.perfdata.service-PID-29931
[05-10-2018 10:58:59] NPCD: File '1525962054.perfdata.service-PID-29931' is an already in process PNP file. Leaving it untouched.
[05-10-2018 10:58:59] NPCD: DEBUG: load 0.410000/10.000000
[05-10-2018 10:58:59] NPCD: ThreadCounter 0/5 File is 1525962099.perfdata.service-PID-31008
[05-10-2018 10:58:59] NPCD: File '1525962099.perfdata.service-PID-31008' is an already in process PNP file. Leaving it untouched.
[05-10-2018 10:58:59] NPCD: DEBUG: load 0.410000/10.000000

Post by **tgriep** » Thu May 10, 2018 1:31 pm

In the nagios.cfg file, there is an option called max_check_result_file_age and the default settings is shown below.

Code: Select all

max_check_result_file_age=3600

Here is a description of that option.

This options determines the maximum age in seconds that Nagios will consider check result files found in the check_result_path directory to be valid. Check result files that are older that this threshold will be deleted by Nagios and the check results they contain will not be processed.

If it is set to that, it should remove the files after 60 minutes.
The nagios account is OK so it could be a permission issue so can you run the following and post the output.
grep nag /etc/group

Code: Select all

ls -l  /var/nagiosramdisk/spool/perfdata/

Using the RRD files for the graphs is the only option as of now so there is not a way to change that.

The NPCD: File '1525962054.perfdata.service-PID-29931' is an already in process PNP file. Leaving it untouched messages could be that the timeout settings in the process_perfdata.cfg may need to be increased.
To increase it, edit the following file

Code: Select all

/usr/local/nagios/etc/pnp/process_perfdata.cfg

for example, change the default value from:

Code: Select all

TIMEOUT = 5

To:

Code: Select all

TIMEOUT = 20

If it was already increased, increase again and save the file and run the following to restart the processes.

Code: Select all

service npcd restart
service nagios restart

The following 2 articles you can look at to see if you can change the graphs to either add the new datapoints to the graphs
https://support.nagios.com/kb/article/n ... g-149.html

Or remove them from the graphs.
https://support.nagios.com/kb/article/n ... e-497.html

FYI, the articles may not work in all cases. Be sure to backup the files before editing them.

ctretelea · Post by **ctretelea** » Fri May 11, 2018 8:20 am

Hi tgriep,
Yes, the max_check_result_file_age is 3600 in nagios.cfg file.
Results of the command ls -l /var/nagiosramdisk/spool/perfdata/ I export to the file and attached it here.

I increase the timeout to 20 and restart the services but I don see any improvements.

Code: Select all

[root@ip-10-60-0-29 ctretelea]# tail -f /usr/local/nagios/var/perfdata.log
2018-05-11 08:34:34 [4015] [0] RRDs::update /usr/local/nagios/share/perfdata/sjsmith-tecsysdbsrv/_HOST_.rrd 1526042051:0.000:100:0.000:0.000
2018-05-11 08:34:34 [4015] [0] RRDs::update ERROR /usr/local/nagios/share/perfdata/sjsmith-tecsysdbsrv/_HOST_.rrd: illegal attempt to update using time 1526042051 when last update time is 1526042066 (minimum one second step)
2018-05-11 08:36:04 [7712] [0] RRDs::update /usr/local/nagios/share/perfdata/sjsmith-newtecsysappsrv/_HOST_.rrd 1526042135:0.000:100:0.000:0.000
2018-05-11 08:36:04 [7712] [0] RRDs::update ERROR /usr/local/nagios/share/perfdata/sjsmith-newtecsysappsrv/_HOST_.rrd: illegal attempt to update using time 1526042135 when last update time is 1526042143 (minimum one second step)
2018-05-11 09:03:32 [27783] [0] RRDs::update /usr/local/nagios/share/perfdata/ntndcdb/_HOST_.rrd 1526043778:141.741:0:170.218:106.458
2018-05-11 09:03:32 [27783] [0] RRDs::update ERROR /usr/local/nagios/share/perfdata/ntndcdb/_HOST_.rrd: illegal attempt to update using time 1526043778 when last update time is 1526043798 (minimum one second step)
2018-05-11 09:03:32 [27783] [0] RRDs::update /usr/local/nagios/share/perfdata/ACE-ntacerep/_HOST_.rrd 1526043787:155.774:0:189.633:111.053
2018-05-11 09:03:32 [27783] [0] RRDs::update ERROR /usr/local/nagios/share/perfdata/ACE-ntacerep/_HOST_.rrd: illegal attempt to update using time 1526043787 when last update time is 1526043804 (minimum one second step)
2018-05-11 09:05:33 [31587] [0] RRDs::update /usr/local/nagios/share/perfdata/ACE-ntacerep/_HOST_.rrd 1526043905:147.875:0:171.177:126.421
2018-05-11 09:05:33 [31587] [0] RRDs::update ERROR /usr/local/nagios/share/perfdata/ACE-ntacerep/_HOST_.rrd: illegal attempt to update using time 1526043905 when last update time is 1526043920 (minimum one second step)

Is that normal to receive these errors?

Code: Select all

[root@ip-10-60-0-29] 09:14 # tail -f /usr/local/nagios/var/npcd.log
[05-11-2018 09:14:32] NPCD: DEBUG: load 0.430000/10.000000
[05-11-2018 09:14:32] NPCD: ThreadCounter 1/5 File is 1526044463.perfdata.host
[05-11-2018 09:14:32] NPCD: Processing file 1526044462.perfdata.service with ID 140531020105472 - going to exec /usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1526044462.perfdata.service
[05-11-2018 09:14:32] NPCD: ThreadCounter 0/5 File is 1526044147.perfdata.service-PID-6557
[05-11-2018 09:14:32] NPCD: File '1526044147.perfdata.service-PID-6557' is an already in process PNP file. Leaving it untouched.
[05-11-2018 09:14:32] NPCD: DEBUG: load 0.430000/10.000000
[05-11-2018 09:14:32] NPCD: ThreadCounter 0/5 File is 1526044177.perfdata.service-PID-7564
[05-11-2018 09:14:32] NPCD: File '1526044177.perfdata.service-PID-7564' is an already in process PNP file. Leaving it untouched.
[05-11-2018 09:14:32] NPCD: DEBUG: load 0.430000/10.000000
[05-11-2018 09:14:32] NPCD: ThreadCounter 0/5 File is 1526044192.perfdata.host-PID-8561
[05-11-2018 09:14:32] NPCD: File '1526044192.perfdata.host-PID-8561' is an already in process PNP file. Leaving it untouched.
[05-11-2018 09:14:32] NPCD: DEBUG: load 0.430000/10.000000
[05-11-2018 09:14:32] NPCD: ThreadCounter 0/5 File is 1526044267.perfdata.service-PID-10531
[05-11-2018 09:14:32] NPCD: File '1526044267.perfdata.service-PID-10531' is an already in process PNP file. Leaving it untouched.
[05-11-2018 09:14:32] NPCD: DEBUG: load 0.430000/10.000000
[05-11-2018 09:14:32] NPCD: ThreadCounter 0/5 File is 1526044402.perfdata.service-PID-15217
[05-11-2018 09:14:32] NPCD: File '1526044402.perfdata.service-PID-15217' is an already in process PNP file. Leaving it untouched.
[05-11-2018 09:14:32] NPCD: DEBUG: load 0.430000/10.000000
[05-11-2018 09:14:32] NPCD: ThreadCounter 0/5 File is 1526044447.perfdata.service-PID-16253
[05-11-2018 09:14:32] NPCD: File '1526044447.perfdata.service-PID-16253' is an already in process PNP file. Leaving it untouched.
[05-11-2018 09:14:32] NPCD: DEBUG: load 0.430000/10.000000
...

you propose to increase it more, but how much more to do? is that every 20 seconds? and you propose to increase it to every 30 s? or 60 s?

Post by **tgriep** » Fri May 11, 2018 11:43 am

No, those errors are not normal.

It could be that there are multiple processes running which could be causing all of the time errors and already processed error.

One quick thing to do is to just reboot the server. That should kill off any extra processes.

If you can post your System Profile or PM it to me, we can take a look the logs and settings in it and get back to you.

To get your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and post it to the forum or PM it to me.

Post by **tgriep** » Mon May 14, 2018 9:53 am

Thanks for the profile. There are 2 copies of the NPCD daemon running on the system and there should only be one and that is causing the issue.
To stop and start them, run the following as root.

Code: Select all

service npcd stop
killall -9 npcd
service npcd start

Give the system 15 to 20 minutes and see if the issue is resolved.

ctretelea · Post by **ctretelea** » Mon May 14, 2018 12:11 pm

Hi tgriep,

Yes you was right and now we don't have anymore in process files, but I still have errors in the log files:

Code: Select all

[ctretelea@ip-10-60-0-29] 13:09 $ tail -f /usr/local/nagios/var/npcd.log
[05-14-2018 13:05:01] NPCD: ERROR: Executed command exits with return code '13'
[05-14-2018 13:05:01] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1526317492.perfdata.service'
[05-14-2018 13:07:48] NPCD: ERROR: Executed command exits with return code '13'
[05-14-2018 13:07:48] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1526317656.perfdata.service'
[05-14-2018 13:09:04] NPCD: ERROR: Executed command exits with return code '13'
[05-14-2018 13:09:04] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1526317732.perfdata.host'
[05-14-2018 13:09:04] NPCD: ERROR: Executed command exits with return code '13'
[05-14-2018 13:09:04] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1526317731.perfdata.service'
[05-14-2018 13:09:19] NPCD: ERROR: Executed command exits with return code '13'
[05-14-2018 13:09:19] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /var/nagiosramdisk/spool/perfdata//1526317746.perfdata.service'

Code: Select all

[root@ip-10-60-0-29] 13:09 # tail -f /usr/local/nagios/var/perfdata.log
2018-05-14 12:39:12 [1770] [0] RRDs::update /usr/local/nagios/share/perfdata/ntndcdb/_HOST_.rrd 1526315918:129.975:0:166.385:101.884
2018-05-14 12:39:12 [1770] [0] RRDs::update ERROR /usr/local/nagios/share/perfdata/ntndcdb/_HOST_.rrd: illegal attempt to update using time 1526315918 when last update time is 1526315935 (minimum one second step)
2018-05-14 12:39:12 [1770] [0] RRDs::update /usr/local/nagios/share/perfdata/ACE-ntacerep/_HOST_.rrd 1526315922:166.566:0:191.477:144.865
2018-05-14 12:39:12 [1770] [0] RRDs::update ERROR /usr/local/nagios/share/perfdata/ACE-ntacerep/_HOST_.rrd: illegal attempt to update using time 1526315922 when last update time is 1526315939 (minimum one second step)
2018-05-14 12:51:52 [25696] [0] RRDs::update /usr/local/nagios/share/perfdata/petfood-OAKWMSSQLTST01/_HOST_.rrd 1526316684:166.945:0:194.183:135.553
2018-05-14 12:51:52 [25696] [0] RRDs::update ERROR /usr/local/nagios/share/perfdata/petfood-OAKWMSSQLTST01/_HOST_.rrd: illegal attempt to update using time 1526316684 when last update time is 1526316702 (minimum one second step)
2018-05-14 12:55:55 [721] [0] RRDs::update /usr/local/nagios/share/perfdata/ACE-ntacerep/_HOST_.rrd 1526316929:159.440:0:184.440:128.976
2018-05-14 12:55:55 [721] [0] RRDs::update ERROR /usr/local/nagios/share/perfdata/ACE-ntacerep/_HOST_.rrd: illegal attempt to update using time 1526316929 when last update time is 1526316946 (minimum one second step)
2018-05-14 12:56:55 [2754] [0] RRDs::update /usr/local/nagios/share/perfdata/ACE-ntacerep/_HOST_.rrd 1526317005:148.412:0:184.385:91.777
2018-05-14 12:56:55 [2754] [0] RRDs::update ERROR could not lock RRD

How to solve those errors?

Post by **tgriep** » Mon May 14, 2018 3:18 pm

The error 13 and the LOCK error could be a permission problem so make sure the files and folders has the nagios user and group set to them as well as the permission to write to them.

Code: Select all

/var/nagiosramdisk/spool/
/usr/local/nagios/share/perfdata/

Nagios Support Forum

NagiosXI graph problems

NagiosXI graph problems

Re: NagiosXI graph problems

Re: NagiosXI graph problems

Re: NagiosXI graph problems

Re: NagiosXI graph problems

Re: NagiosXI graph problems

Re: NagiosXI graph problems

Re: NagiosXI graph problems

Re: NagiosXI graph problems

Re: NagiosXI graph problems