Dear Team,
We are observing high load issue in our hadoop ubuntu machines. One of the hadoop services (Impala service) will put more load( like 50) on linux machine when it processing the more data and it will come to normal state once the activity completes. The challenge what we are facing here issue during that time nrpe agent is executing scripts to monitor the linux services which executes linux "ps" command and got stuck and putting the more load on the machine and apparently machine becomes unresponsive. We need to reboot the machine to bring it to normal state, please suggest how we can mitigate this issue.
Regards,
Mohan
High Load Issue In Hadoop Ubuntu Machines.
-
steelwedge
- Posts: 69
- Joined: Fri Apr 24, 2015 4:58 am
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: High Load Issue In Hadoop Ubuntu Machines.
You may need to look at using a different monitoring method. SNMP might be an option.
https://assets.nagios.com/downloads/nag ... g_SNMP.pdf
The Linux SNMP wizard should already exist in XI.
Another option is to configure your hadoop processes with a lower priority so that other things like NRPE are able to function correctly.
https://assets.nagios.com/downloads/nag ... g_SNMP.pdf
The Linux SNMP wizard should already exist in XI.
Another option is to configure your hadoop processes with a lower priority so that other things like NRPE are able to function correctly.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
steelwedge
- Posts: 69
- Joined: Fri Apr 24, 2015 4:58 am
Re: High Load Issue In Hadoop Ubuntu Machines.
So through SNMP monitoring does nagios will not use "ps" command to check the services.
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: High Load Issue In Hadoop Ubuntu Machines.
No, SNMP-based checking will not run ps and parse the output. However, it is entirely possible that the SNMP daemon itself on the remote machine uses ps internally, but it is not possible for us to tell whether this is the case.
Please give SNMP a shot and let us know if the load issue is still present. Thanks!
Please give SNMP a shot and let us know if the load issue is still present. Thanks!
-
steelwedge
- Posts: 69
- Joined: Fri Apr 24, 2015 4:58 am
Re: High Load Issue In Hadoop Ubuntu Machines.
Why nrpe agent runs ps command and what it is doing with that output.
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: High Load Issue In Hadoop Ubuntu Machines.
NRPE stands for "Nagios Remote Plugin Executor".
It allows you to execute plugins to check "stuff". The plugin does whatever it's supposed to and then returns the output and exit code back to NRPE and NRPE sends that back to Nagios.
Whatever plugin you are using to monitor the services uses the ps command.
You will need to show us your service definition for the plugin that is causing your issue. Go into CCM, find the service, click the disk icon and paste the text here.
It allows you to execute plugins to check "stuff". The plugin does whatever it's supposed to and then returns the output and exit code back to NRPE and NRPE sends that back to Nagios.
Whatever plugin you are using to monitor the services uses the ps command.
You will need to show us your service definition for the plugin that is causing your issue. Go into CCM, find the service, click the disk icon and paste the text here.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
steelwedge
- Posts: 69
- Joined: Fri Apr 24, 2015 4:58 am
Re: High Load Issue In Hadoop Ubuntu Machines.
PFA service configuration file of the machine swodc01hdfs05 where we are seeing high load issues frequently.
You do not have the required permissions to view the files attached to this post.
Re: High Load Issue In Hadoop Ubuntu Machines.
I can see the following commands, referenced in your config - check_disk, check_cpu_stats, check_load, check_mem, check_init_service, check_open_files, check_procs, and check_users. Can you show us how they are defined on the client (remote machine)?The challenge what we are facing here issue during that time nrpe agent is executing scripts to monitor the linux services which executes linux "ps" command and got stuck and putting the more load on the machine and apparently machine becomes unresponsive. We need to reboot the machine to bring it to normal state, please suggest how we can mitigate this issue.
You will find their definitions in either "/usr/local/nagios/etc/nrpe/common.cfg" or "/usr/local/nagios/etc/nrpe.cfg" file.
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
steelwedge
- Posts: 69
- Joined: Fri Apr 24, 2015 4:58 am
Re: High Load Issue In Hadoop Ubuntu Machines.
PFA requested file.
You do not have the required permissions to view the files attached to this post.
-
avandemore
- Posts: 1597
- Joined: Tue Sep 27, 2016 4:57 pm
Re: High Load Issue In Hadoop Ubuntu Machines.
How often does this wedge occur? It seems unlikely ps would be the culprit as it simply reads and prints data from the kernel, however ps could have run into some process in a uninterruptible sleep state and not exited. Usually this is from disk IO eg NFS or something where an fsync can't complete properly.
What do the system logs look like after a reboot? Can you disable the NRPE checks and see if the hang still occurs? If not all at once at least bisecting the metrics would narrow it down.
Can you run top -bcn1 during such a high load? Also what happens if you send a hung ps process a SIGUSR1?
What do the system logs look like after a reboot? Can you disable the NRPE checks and see if the hang still occurs? If not all at once at least bisecting the metrics would narrow it down.
Can you run top -bcn1 during such a high load? Also what happens if you send a hung ps process a SIGUSR1?
Previous Nagios employee