algorithm - using 10 MB of memory for four billion integers (about finding the optimized block size) -

- July 15, 2015

this question has answer here:

find integer not among 4 billion given ones 38 answers

the problem is, given input file 4 billion integers, provide algorithm generate integer not contained in file, assume have 10 mb of memory.

searched solutions, 1 of store integers bit-vector blocks (each block representing specific range of integers among 4 billion range, each bit in block represent integer), , using counter each block, count number of integers in each block. if number of integers less block capacity integers, scan bit-vector of block find missing integers.

my confusion solution is, mentioned optimal smallest footprint is, when array of block counters occupies same memory bit vector. confused why in such situation optimal smallest footprint?

here calculation details referred,

let n = 2^32. counters (bytes): blocks * 4 bit vector (bytes): (n / blocks) / 8 blocks * 4 = (n / blocks) / 8 blocks^2 = n / 32 blocks = sqrt(n/2)/4

thanks in advance, lin

why smallest memory footprint:

in solution proposed, there 2 phases:

count number of integers in each block

this uses 4*(#blocks) bytes of memory.
use bit-vector each bit representing integer in block.

this uses (blocksize/8) bytes of memory, (n/blocks)/8.

setting 2 equal results in blocks = sqrt(n/32) have mentioned.

this optimal because memory required maximum of memory required in each phase (which must both executed). after 1st phase, can forget counters, except block search in phase 2.

optimization

if counter saturates when reaches capacity, don't need 4 bytes per counter, rather 3 bytes. counter reaches capacity when exceeds number of integers in block.

in case, phase 1 uses 3*blocks of memory, , phase 2 uses (n/blocks)/8. therefore, optimal blocks = sqrt(n/24). if n 4 billion, number of blocks approx 12910, , block size 309838 integers per block. fits in 3 bytes.

caveats, , alternative average case performance

the algorithm proposed works if input integers distinct. in case integers not distinct, suggest go randomized candidate set of integers approach. in randomized candidate set of integers approach, can select 1000 candidate integers @ random, , check if not found in input file. if fail, can try find random set of candidate integers. while has poor worst case performance, faster in average case input. example, if input integers have coverage of 99% of possible integers, on average, 1000 candidate integers, 10 of them not found. can select candidate integers pseudo-randomly never repeat candidate integer, , guarantee in fixed number of tries, have tested possible integers.

if each time, check sqrt(n) candidate integers, worst case performance can n*sqrt(n), because might have scan n integers sqrt(n) times.

you can avoid worst case time if use alternative, , if doesn't work first set of candidate integers, switch proposed solution. might give better average case performance (this common strategy in sorting quicksort used first, before switching heapsort example if appears worst case input present).

Search This Blog

Maxid