algorithm - how can I use multithreading to compute the intersection on a very large dataset -


i have file composed of 4 millions sets. every set contains 1 n words. size of file 120 mb.

set1 = {w11, w12,...,w1i} set2 = {w21, w22,...,w2j} ... setm = {wm1, wm2,...,wmk} 

i want compute intersection between sets.

set 1 ∩ {set1,...,setm} set 2 ∩ {set1,...,setm} ... set m ∩ {set1,...,setm} 

every operation takes around 1.2 seconds. did following:

  • divide 4 million sets 6 chunks. every chunk containing 666666 sets

then following. in here i'll creating 36 threads , i'll computing intersection between chuncks. slow , complicated problem.

 vector<thread> threads;  for(int = 0; i< chunk.size();i++)   {     for(int j = 0; j < chunk.size();j++)     {        threads.push_back(thread(&transform::call_intersection, this, ref(chunk[i]),ref(tmp[j]), chunk(results)));     }   }  for(auto &t : threads){ t.join(); } 

do have idea on how divide problem sub-problems , join of them in end. way in linux too?


sample

the first column represents id of set , rest of columns represents words.

m.06fl3b|hadji|barbarella catton|haji catton|haji cat|haji m.06flgy|estadio neza 86 m.06fm8g|emd gp39dc m.0md41|pavees|barbarella catton m.06fmg|round m.01012g|hadji|fannin county windom town|windom m.0101b|affray 

example

m.06fl3b has intersection m.01012g , m.0md41. output file follows:

 m.06fl3b m.01012g m.0md41  m.06flgy  m.06fm8g  m.0md41 m.06fl3b  m.06fmg  m.01012g  m.06fl3b  m.0101b 

set intersection associative , therefore amenable parallel folding (which 1 of many use cases of mapreduce). each pair of sets ((1, 2), (3, 4), ...), can compute intersection of each pair, , put results new collection of sets, have half size. repeat until you're left 1 set. total number of intersection operations equal number of sets minus one.

launching millions of threads going bog down machine, however, want use thread pool: make number of threads close amount of cpu cores have available, , create list of tasks, each task 2 sets intersected. each thread repeatedly checks task list , grabs first available task (make sure access task list in thread-safe manner).


Comments

Popular posts from this blog

How to show in django cms breadcrumbs full path? -

php - Invalid Cofiguration - yii\base\InvalidConfigException - Yii2 -

ruby on rails - npm error: tunneling socket could not be established, cause=connect ETIMEDOUT -