algorithm - how can I use multithreading to compute the intersection on a very large dataset -
i have file composed of 4 millions sets. every set contains 1 n words. size of file 120 mb.
set1 = {w11, w12,...,w1i} set2 = {w21, w22,...,w2j} ... setm = {wm1, wm2,...,wmk}
i want compute intersection between sets.
set 1 ∩ {set1,...,setm} set 2 ∩ {set1,...,setm} ... set m ∩ {set1,...,setm}
every operation takes around 1.2 seconds. did following:
- divide 4 million sets 6 chunks. every chunk containing 666666 sets
then following. in here i'll creating 36 threads , i'll computing intersection between chuncks. slow , complicated problem.
vector<thread> threads; for(int = 0; i< chunk.size();i++) { for(int j = 0; j < chunk.size();j++) { threads.push_back(thread(&transform::call_intersection, this, ref(chunk[i]),ref(tmp[j]), chunk(results))); } } for(auto &t : threads){ t.join(); }
do have idea on how divide problem sub-problems , join of them in end. way in linux too?
sample
the first column represents id of set , rest of columns represents words.
m.06fl3b|hadji|barbarella catton|haji catton|haji cat|haji m.06flgy|estadio neza 86 m.06fm8g|emd gp39dc m.0md41|pavees|barbarella catton m.06fmg|round m.01012g|hadji|fannin county windom town|windom m.0101b|affray
example
m.06fl3b has intersection m.01012g , m.0md41. output file follows:
m.06fl3b m.01012g m.0md41 m.06flgy m.06fm8g m.0md41 m.06fl3b m.06fmg m.01012g m.06fl3b m.0101b
set intersection associative , therefore amenable parallel folding (which 1 of many use cases of mapreduce). each pair of sets ((1, 2), (3, 4), ...), can compute intersection of each pair, , put results new collection of sets, have half size. repeat until you're left 1 set. total number of intersection operations equal number of sets minus one.
launching millions of threads going bog down machine, however, want use thread pool: make number of threads close amount of cpu cores have available, , create list of tasks, each task 2 sets intersected. each thread repeatedly checks task list , grabs first available task (make sure access task list in thread-safe manner).
Comments
Post a Comment