Page 1 of 2
High Load Issue In Hadoop Ubuntu Machines.
Posted: Tue Oct 04, 2016 8:32 pm
by steelwedge
Dear Team,
We are observing high load issue in our hadoop ubuntu machines. One of the hadoop services (Impala service) will put more load( like 50) on linux machine when it processing the more data and it will come to normal state once the activity completes. The challenge what we are facing here issue during that time nrpe agent is executing scripts to monitor the linux services which executes linux "ps" command and got stuck and putting the more load on the machine and apparently machine becomes unresponsive. We need to reboot the machine to bring it to normal state, please suggest how we can mitigate this issue.
Regards,
Mohan
Re: High Load Issue In Hadoop Ubuntu Machines.
Posted: Tue Oct 04, 2016 9:28 pm
by Box293
You may need to look at using a different monitoring method. SNMP might be an option.
https://assets.nagios.com/downloads/nag ... g_SNMP.pdf
The Linux SNMP wizard should already exist in XI.
Another option is to configure your hadoop processes with a lower priority so that other things like NRPE are able to function correctly.
Re: High Load Issue In Hadoop Ubuntu Machines.
Posted: Wed Oct 05, 2016 2:40 pm
by steelwedge
So through SNMP monitoring does nagios will not use "ps" command to check the services.
Re: High Load Issue In Hadoop Ubuntu Machines.
Posted: Wed Oct 05, 2016 3:05 pm
by dwhitfield
No, SNMP-based checking will not run ps and parse the output. However, it is entirely possible that the SNMP daemon itself on the remote machine uses ps internally, but it is not possible for us to tell whether this is the case.
Please give SNMP a shot and let us know if the load issue is still present. Thanks!
Re: High Load Issue In Hadoop Ubuntu Machines.
Posted: Mon Oct 10, 2016 9:04 pm
by steelwedge
Why nrpe agent runs ps command and what it is doing with that output.
Re: High Load Issue In Hadoop Ubuntu Machines.
Posted: Mon Oct 10, 2016 9:22 pm
by Box293
NRPE stands for "Nagios Remote Plugin Executor".
It allows you to execute plugins to check "stuff". The plugin does whatever it's supposed to and then returns the output and exit code back to NRPE and NRPE sends that back to Nagios.
Whatever plugin you are using to monitor the services uses the ps command.
You will need to show us your service definition for the plugin that is causing your issue. Go into CCM, find the service, click the disk icon and paste the text here.
Re: High Load Issue In Hadoop Ubuntu Machines.
Posted: Tue Oct 11, 2016 2:50 am
by steelwedge
PFA service configuration file of the machine swodc01hdfs05 where we are seeing high load issues frequently.
Re: High Load Issue In Hadoop Ubuntu Machines.
Posted: Tue Oct 11, 2016 1:16 pm
by lmiltchev
The challenge what we are facing here issue during that time nrpe agent is executing scripts to monitor the linux services which executes linux "ps" command and got stuck and putting the more load on the machine and apparently machine becomes unresponsive. We need to reboot the machine to bring it to normal state, please suggest how we can mitigate this issue.
I can see the following commands, referenced in your config -
check_disk, check_cpu_stats, check_load, check_mem, check_init_service, check_open_files, check_procs, and check_users. Can you show us how they are defined on the client (remote machine)?
You will find their definitions in either "/usr/local/nagios/etc/nrpe/common.cfg" or "/usr/local/nagios/etc/nrpe.cfg" file.
Re: High Load Issue In Hadoop Ubuntu Machines.
Posted: Tue Oct 11, 2016 9:26 pm
by steelwedge
PFA requested file.
Re: High Load Issue In Hadoop Ubuntu Machines.
Posted: Wed Oct 12, 2016 12:35 pm
by avandemore
How often does this wedge occur? It seems unlikely ps would be the culprit as it simply reads and prints data from the kernel, however ps could have run into some process in a uninterruptible sleep state and not exited. Usually this is from disk IO eg NFS or something where an fsync can't complete properly.
What do the system logs look like after a reboot? Can you disable the NRPE checks and see if the hang still occurs? If not all at once at least bisecting the metrics would narrow it down.
Can you run top -bcn1 during such a high load? Also what happens if you send a hung ps process a SIGUSR1?