high load , rrdtool graph

zaji_nms · Post by **zaji_nms** » Thu Apr 04, 2019 4:20 am

Dear Expert

continue to my last post , high CPU load , which is Locked, checked and found RRDTOOL GRAPH is taking high load

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26310 apache 20 0 160m 6032 3972 R 97.5 0.0 0:00.82 rrdtool
26344 apache 20 0 180m 18m 16m R 95.6 0.1 0:00.68 rrdtool
26404 apache 20 0 180m 11m 9664 R 81.9 0.0 0:00.42 rrdtool
26430 apache 20 0 175m 9144 8084 R 66.3 0.0 0:00.34 rrdtool
26431 apache 20 0 175m 11m 10m R 60.5 0.0 0:00.31 rrdtool
26503 apache 20 0 175m 15m 14m R 31.2 0.0 0:00.16 rrdtool
26509 nagios 20 0 141m 10m 2276 S 15.6 0.0 0:00.08 check_ifopersta
14186 apache 20 0 509m 44m 15m S 13.7 0.1 0:03.83 httpd

no space issue
Filesystem Size Used Avail Use% Mounted on
tmpfs 500M 18M 483M 4% /var/nagiosramdisk

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
apache 16185 61.0 0.0 184452 5768 ? R 11:50 0:00 /usr/bin/rrdtool graph - --width=500 --height=100 --start=-4h --vertical-label ms -
-title xyz-7200 / --lower=0.000000 DEF:var1=/usr/local/nagios/share/perfdata/xyz-7200/_HOST_.rrd:1:AVERAGE AREA:var1#EACC00:rta LINE1:var1#000000:
GPRINT:var1:LAST:%3.4lf ms LAST GPRINT:var1:MAX:%3.4lf ms MAX GPRINT:var1:AVERAGE:%3.4lf ms AVERAGE \n HRULE:50.000000#FFFF00:Warning on 50.0000
00\n HRULE:80.000000#FF0000:Critical on 80.000000\n COMMENT:Default Template\r COMMENT:Check Command check_icmp\r

Please note I am 100% sure that we are not monitoring (displaying) HOST graph for xyz-7200, and same manner we found below service, which again 100% sure we are not monitoring (displaying) graph for this service

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
apache 17975 73.0 0.0 184484 16256 ? R 12:03 0:00 /usr/bin/rrdtool graph - --width=500 --height=100 --start=-4h --vertical-label Traf
fic Mb_s -X 0 -E --title Interface Traffic for xyz-asr-1 / 9467_parlon_Bandwidth DEF:var1=/usr/local/nagios/
share/perfdata/xyz-asr-1/9467_parlon_Bandwidth.rrd:1:MAX DEF:var2=/usr/local/nagios/share/perfdata/xyz-asr-1
/9467_parlon_Bandwidth.rrd:2:MAX CDEF:real1=var1,1,* CDEF:real2=var2,1,* LINE1:real1#0000CD:In (Mb_s) GPRINT
:real1:MIN:%9.3lf Min GPRINT:real1:AVERAGE:%9.3lf Avg GPRINT:real1:MAX:%9.3lf Max GPRINT:real1:LAST:%9.3lf Last\n LINE1:real2#FF0000:Out(Mb_s) GPRIN
T:real2:MIN:%9.3lf Min GPRINT:real2:AVERAGE:%9.3lf Avg GPRINT:real2:MAX:%9.3lf Max GPRINT:real2:LAST:%9.3lf Last\n COMMENT:Printed on

if there is rrdtool update, understand that in background some updating on going time to time but why rrdtool graph process initiating when Host/Service not on display, yes in the NAGVIS we added Host/Service, but just for Status Purpose mainly, not displaying/showing any graph on the monitoring screen

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
apache 32420 81.0 0.0 184468 17924 ? R 14:47 0:00 /usr/bin/rrdtool graph - --width=500 --height=100 --start=-4h --vertical-label --title mge420kva-2 / battery_status DEF:var1=/usr/local/nagios/share/perfdata/mge420kva-2/battery_status.rrd:1:AVERAGE AREA:var1#EACC00:capa
city LINE1:var1#000000: GPRINT:var1:LAST:%3.4lf LAST GPRINT:var1:MAX:%3.4lf MAX GPRINT:var1:AVERAGE:%3.4lf AVERAGE \n HRULE:50#FFFF00:Warning
on 50\n HRULE:30#FF0000:Critical on 30\n COMMENT:Default Template\r COMMENT:Check Command check_mge_battery\r

As I said above is correct, however if we displaying any live bandwidth graph, what is the default rate to refresh/update this and how to increase the auto refresh rate of it?

its all random, I think its for all the Host and all the Services

is there anyway to dig on the PID and found request coming from which source IP and who is initiating?

Regards

Post by **tgriep** » Thu Apr 04, 2019 4:38 pm

I have a question, I know you have been editing the templates for displaying different data on the graphs, did you do any changes to the Host template or the default template?

If you are viewing a graph in the XI GUI, it should update once every 60 seconds, that is the default.

There is not a simple way to determine what IP address is opening it but you could look in the Apache log files first in this folder.
/var/log/httpd

zaji_nms · Post by **zaji_nms** » Fri Apr 05, 2019 3:39 am

dear tgriep

if you see my last post, its on 24-Dec-2018 , we facing issue before that but reported when our main NAGVIS display Map getting blue very frequent (FYI, we monitoring only link/service Status , not Bandwidth), and time to time Mysql connection error appearing on Home Operations Center.

can you please have a look on my last post
https://support.nagios.com/forum/viewto ... 16&t=51767
high CPU, database connection error, blue Nagvis map

/var/log/httpd
-rw-r--r-- 1 root root 484 Apr 1 11:11 ssl_error_log
-rw-r--r-- 1 root root 2221695 Apr 1 15:48 ssl_request_log
-rw-r--r-- 1 root root 1944859 Apr 1 15:48 ssl_access_log
-rw-r--r-- 1 root root 342949 Apr 4 14:32 error_log

[Thu Apr 04 14:32:33 2019] [error] [client homePc] File does not exist: /usr/local/nagiosxi/html/includes/components/nagioscore/ui/images/logos/switch40.png, referer: h t t p : // thisSRV/nagiosxi/includes/components/xicore/status.php?show=hosts

will add this PNG file and its just minor error and not any issue, from last 24 hours, no error.

Please note that script just tested and thats it, #commented, we want to use but if this CPU/Load issue get resolved.

my question was, how to increase Performance Graph refresh period?
and
why "rrdtool graph" coming in picture that taking high CPU while we not displaying/monitor any performance graph?

apache 573 34.0 0.0 184460 11524 ? R 12:21 0:00 /usr/bin/rrdtool graph - --width=500 --height=100 --start=-4h --vertical-label RTA --title Ping times for xyz-abc-1 / DEF:var1=/usr/local/nagios/share/perfdata/xyz-abc-1/_HOST_.rrd:1:AVERAGE CDEF:sp1=var1,100,/,12,* CDEF:sp2=var1,100,/,30,* CDEF:sp3=var1,100,/,50,* CDEF:sp4=var1,100,/,70,* AREA:var1#FF5C00:Round Trip Times AREA:sp4#FF7C00: AREA:sp3#FF9C00: AREA:sp2#FFBC00: AREA:sp1#FFDC00: GPRINT:var1:LAST:%6.2lf ms last GPRINT:var1:MAX:%6.2lf ms max GPRINT:var1:AVERAGE:%6.2lf ms avg \n LINE1:var1#000000: HRULE:3000.000#000000:

apache 579 60.0 0.0 184452 9416 ? R 12:21 0:00 /usr/bin/rrdtool graph - --width=500 --height=100 --start=-4h --vertical-label ms --title abc-7200 / --lower=0.000000 DEF:var1=/usr/local/nagios/share/perfdata/abc-7200/_HOST_.rrd:1:AVERAGE AREA:var1#EACC00:rta LINE1:var1#000000: GPRINT:var1:LAST:%3.4lf ms LAST GPRINT:var1:MAX:%3.4lf ms MAX GPRINT:var1:AVERAGE:%3.4lf ms AVERAGE \n HRULE:100.000000#FFFF00:Warning on 100.000000\n HRULE:150.000000#FF0000:Critical on 150.000000\n COMMENT:Default Template\r COMMENT:Check Command check_icmp\r

from two above example, its showing _HOST_.rrd, we not displaying any graph on the screen and its random, looks for all (just its sample capture).... may be there is some parallel process going on (unnecessary), how to find ?

Regards

Post by **cdienger** » Fri Apr 05, 2019 1:19 pm

lsof may give you some idea of what is using it. Try running:

Code: Select all

lsof -p 579 
lsof -p 573

Note that 579 and 573 are process IDs that may change. Make sure to run the command with the current IDs.

zaji_nms · Post by **zaji_nms** » Fri Apr 05, 2019 2:27 pm

dear cdienger

in the TOP, rrdtool or "rrdtool graph" showing for the second only, how come to catch check the PID

also my question, how to increase the PERFORMANCE graph refresh time to increase?

regards

Post by **cdienger** » Fri Apr 05, 2019 3:25 pm

The data used to create the graphs only goes down to a minute so refreshing more frequently than that will not have much if any impact on how the graphs display, but it's configurable under Admin > System Extensions > Manage Graph Templates > Dashlet Refresh Rates > Performance Graphs.

The graphs are created using graphApi.php. I would monitor the access logs for requests to this file to see when it is being requested:

tail -f /var/log/httpd/* | grep -i graphapi

zaji_nms · Post by **zaji_nms** » Sat Apr 06, 2019 2:28 am

dear cdienger

I have increased the refresh rate and there is relief (load reduced), now main important issue remain that we monitoring our 100+ links via NAGVIS, only Status (not any HOST, not any Bandwidth) but why below log captured.

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
apache 16185 61.0 0.0 184452 5768 ? R 11:50 0:00 /usr/bin/rrdtool graph - --width=500 --height=100 --start=-4h --vertical-label ms -
-title xyz-7200 / --lower=0.000000 DEF:var1=/usr/local/nagios/share/perfdata/xyz-7200/_HOST_.rrd:1:AVERAGE AREA:var1#EACC00:rta LINE1:var1#000000:
GPRINT:var1:LAST:%3.4lf ms LAST GPRINT:var1:MAX:%3.4lf ms MAX GPRINT:var1:AVERAGE:%3.4lf ms AVERAGE \n HRULE:50.000000#FFFF00:Warning on 50.0000
00\n HRULE:80.000000#FF0000:Critical on 80.000000\n COMMENT:Default Template\r COMMENT:Check Command check_icmp\r

apache 573 34.0 0.0 184460 11524 ? R 12:21 0:00 /usr/bin/rrdtool graph - --width=500 --height=100 --start=-4h --vertical-label RTA --title Ping times for xyz-abc-1 / DEF:var1=/usr/local/nagios/share/perfdata/xyz-abc-1/_HOST_.rrd:1:AVERAGE CDEF:sp1=var1,100,/,12,* CDEF:sp2=var1,100,/,30,* CDEF:sp3=var1,100,/,50,* CDEF:sp4=var1,100,/,70,* AREA:var1#FF5C00:Round Trip Times AREA:sp4#FF7C00: AREA:sp3#FF9C00: AREA:sp2#FFBC00: AREA:sp1#FFDC00: GPRINT:var1:LAST:%6.2lf ms last GPRINT:var1:MAX:%6.2lf ms max GPRINT:var1:AVERAGE:%6.2lf ms avg \n LINE1:var1#000000: HRULE:3000.000#000000:

apache 579 60.0 0.0 184452 9416 ? R 12:21 0:00 /usr/bin/rrdtool graph - --width=500 --height=100 --start=-4h --vertical-label ms --title abc-7200 / --lower=0.000000 DEF:var1=/usr/local/nagios/share/perfdata/abc-7200/_HOST_.rrd:1:AVERAGE AREA:var1#EACC00:rta LINE1:var1#000000: GPRINT:var1:LAST:%3.4lf ms LAST GPRINT:var1:MAX:%3.4lf ms MAX GPRINT:var1:AVERAGE:%3.4lf ms AVERAGE \n HRULE:100.000000#FFFF00:Warning on 100.000000\n HRULE:150.000000#FF0000:Critical on 150.000000\n COMMENT:Default Template\r COMMENT:Check Command check_icmp\r

rrdtool graph......_HOST_.rrd <<<<<<<<< why this captured, who initiating?

lsof | grep rrdtool

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
rrdtool 23642 apache cwd DIR 8,3 4096 397333 /usr/local/nagiosxi/html/includes/components/perfdata
rrdtool 28729 apache cwd DIR 8,3 4096 397333 /usr/local/nagiosxi/html/includes/components/perfdata

Regards

Post by **tgriep** » Mon Apr 08, 2019 11:10 am

Could you post your Nagios XI System Profile so we can review it?
To get your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and upload it to the forum post.

Run the following as root on the Nagios server and upload the nagvis.zip file to the post so we can check the Nagvis settings and maps to see is we can fine that host configuration that is running the rrdtool graph command.

Code: Select all

zip -r nagvis.zip /usr/local/nagvis

Also, run this command as root and post the output to here.

Code: Select all

grep xyz-7200 /var/log/httpd/*

Thanks

zaji_nms · Post by **zaji_nms** » Mon Apr 08, 2019 12:49 pm

dear tgriep

whatever hint/advice given or query asked by cdienger, currently my CRITICAL issue resolved, me afraid to share this info even via PM or private email to you.

whatever query you asking, there is also some hint for me, I will try to dig myself, however if you give some narrow area to check will save my time and efforts (i will not mind if you just provide/refer some document or URL only)

one final question, as you know MRTG via cron.d running at every 5 minutes. Once i have read that even if we update MRTG every minute or two, MRTG will update data on AVERAGE of 5 minutes, so why Dashlets->Performance Graphs: 60 seconds? may I increase to 120 or 180 or more (less than 300)?

regards

Post by **tgriep** » Mon Apr 08, 2019 1:01 pm

I feel that someone has setup a Nagvis map on the server that is trying to view the performance data for that host and the rrdtool command.
If you search through all of the files and sub folders in the /usr/local/nagvis directory structure, you should be able to find it and remove it from the Nagvis configuration.

The 60 second refresh is for the whole Graphical interface and not specific to just the Bandwidth graphs.

Nagios Support Forum

high load , rrdtool graph

high load , rrdtool graph

Re: high load , rrdtool graph

Re: high load , rrdtool graph

Re: high load , rrdtool graph

Re: high load , rrdtool graph

Re: high load , rrdtool graph

Re: high load , rrdtool graph

Re: high load , rrdtool graph

Re: high load , rrdtool graph

Re: high load , rrdtool graph