mapreduce - How hadoop decides how many nodes will do map and reduce tasks -


i'm new hadoop , i'm trying understand it. im talking hadoop 2. when have input file wanto mapreduce, in mapreduce programm parameter of split, make many map tasks splits,right?

the resource manager knows files , send tasks nodes have data, says how many nodes tasks? after maps donde there shuffle, node reduce task decided partitioner hash map,right? how many nodes reduce tasks? nodes have done maps reduce tasks?

thank you.

tldr: if have cluster , run mapreduce job, how hadoop decides how many nodes map tasks , nodes reduce tasks?

how many maps?

the number of maps driven total size of inputs, is, total number of blocks of input files.

the right level of parallelism maps seems around 10-100 maps per-node, although has been set 300 maps cpu-light map tasks. task setup takes while, best if maps take @ least minute execute.

if havve 10tb of input data , blocksize of 128mb, you’ll end 82,000 maps, unless configuration.set(mrjobconfig.num_maps, int) (which provides hint framework) used set higher.

how many reduces?

the right number of reduces seems 0.95 or 1.75 multiplied ( < no. of nodes > * < no. of maximum containers per node > ).

with 0.95 of reduces can launch , start transferring map outputs maps finish. 1.75 faster nodes finish first round of reduces , launch second wave of reduces doing better job of load balancing.

increasing number of reduces increases framework overhead, increases load balancing , lowers cost of failures.

reducer none

it legal set number of reduce-tasks 0 if no reduction desired

which nodes reduce tasks?

you can configure number of mappers , number of reducers per node per configuration parameters mapreduce.tasktracker.reduce.tasks.maximum

if set parameter zero, node won't considered reduce tasks. otherwise, nodes in cluster eligible reduce tasks.

source : map reduce tutorial apache.

note: given job, can set mapreduce.job.maps & mapreduce.job.reduces. may not effective. should leave decisions map reduce framework decide on number of map & reduce tasks

edit:

how decide reducer node?

assume have equal reduce slots available on 2 nodes n1 , n2 , current load on n1 > n2, , reduce task assigned n2. if both load , number of slots same, whoever sends first heartbeat resource manager task. code block reduce assignment:http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapred/jobqueuetaskscheduler.java#207


Comments

Popular posts from this blog

How to show in django cms breadcrumbs full path? -

php - Invalid Cofiguration - yii\base\InvalidConfigException - Yii2 -

ruby on rails - npm error: tunneling socket could not be established, cause=connect ETIMEDOUT -