High Load Issue In Hadoop Ubuntu Machines.

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
steelwedge
Posts: 69
Joined: Fri Apr 24, 2015 4:58 am

High Load Issue In Hadoop Ubuntu Machines.

Post by steelwedge »

Dear Team,

We are observing high load issue in our hadoop ubuntu machines. One of the hadoop services (Impala service) will put more load( like 50) on linux machine when it processing the more data and it will come to normal state once the activity completes. The challenge what we are facing here issue during that time nrpe agent is executing scripts to monitor the linux services which executes linux "ps" command and got stuck and putting the more load on the machine and apparently machine becomes unresponsive. We need to reboot the machine to bring it to normal state, please suggest how we can mitigate this issue.

Regards,
Mohan
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: High Load Issue In Hadoop Ubuntu Machines.

Post by Box293 »

You may need to look at using a different monitoring method. SNMP might be an option.

https://assets.nagios.com/downloads/nag ... g_SNMP.pdf

The Linux SNMP wizard should already exist in XI.

Another option is to configure your hadoop processes with a lower priority so that other things like NRPE are able to function correctly.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
steelwedge
Posts: 69
Joined: Fri Apr 24, 2015 4:58 am

Re: High Load Issue In Hadoop Ubuntu Machines.

Post by steelwedge »

So through SNMP monitoring does nagios will not use "ps" command to check the services.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: High Load Issue In Hadoop Ubuntu Machines.

Post by dwhitfield »

No, SNMP-based checking will not run ps and parse the output. However, it is entirely possible that the SNMP daemon itself on the remote machine uses ps internally, but it is not possible for us to tell whether this is the case.

Please give SNMP a shot and let us know if the load issue is still present. Thanks!
steelwedge
Posts: 69
Joined: Fri Apr 24, 2015 4:58 am

Re: High Load Issue In Hadoop Ubuntu Machines.

Post by steelwedge »

Why nrpe agent runs ps command and what it is doing with that output.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: High Load Issue In Hadoop Ubuntu Machines.

Post by Box293 »

NRPE stands for "Nagios Remote Plugin Executor".

It allows you to execute plugins to check "stuff". The plugin does whatever it's supposed to and then returns the output and exit code back to NRPE and NRPE sends that back to Nagios.

Whatever plugin you are using to monitor the services uses the ps command.

You will need to show us your service definition for the plugin that is causing your issue. Go into CCM, find the service, click the disk icon and paste the text here.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
steelwedge
Posts: 69
Joined: Fri Apr 24, 2015 4:58 am

Re: High Load Issue In Hadoop Ubuntu Machines.

Post by steelwedge »

PFA service configuration file of the machine swodc01hdfs05 where we are seeing high load issues frequently.
You do not have the required permissions to view the files attached to this post.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: High Load Issue In Hadoop Ubuntu Machines.

Post by lmiltchev »

The challenge what we are facing here issue during that time nrpe agent is executing scripts to monitor the linux services which executes linux "ps" command and got stuck and putting the more load on the machine and apparently machine becomes unresponsive. We need to reboot the machine to bring it to normal state, please suggest how we can mitigate this issue.
I can see the following commands, referenced in your config - check_disk, check_cpu_stats, check_load, check_mem, check_init_service, check_open_files, check_procs, and check_users. Can you show us how they are defined on the client (remote machine)?

You will find their definitions in either "/usr/local/nagios/etc/nrpe/common.cfg" or "/usr/local/nagios/etc/nrpe.cfg" file.
Be sure to check out our Knowledgebase for helpful articles and solutions!
steelwedge
Posts: 69
Joined: Fri Apr 24, 2015 4:58 am

Re: High Load Issue In Hadoop Ubuntu Machines.

Post by steelwedge »

PFA requested file.
You do not have the required permissions to view the files attached to this post.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: High Load Issue In Hadoop Ubuntu Machines.

Post by avandemore »

How often does this wedge occur? It seems unlikely ps would be the culprit as it simply reads and prints data from the kernel, however ps could have run into some process in a uninterruptible sleep state and not exited. Usually this is from disk IO eg NFS or something where an fsync can't complete properly.

What do the system logs look like after a reboot? Can you disable the NRPE checks and see if the hang still occurs? If not all at once at least bisecting the metrics would narrow it down.

Can you run top -bcn1 during such a high load? Also what happens if you send a hung ps process a SIGUSR1?
Previous Nagios employee
Locked