1. Hadoop Map Reduce – Sampling a dataset. 50 points
Imagine you’re working with a terabyte-scale dataset and you have a MapReduce
application you want to test with that dataset. Running your MapReduce application against
the dataset may take hours, and constantly iterating with code refinements and rerunning
against it isn’t an optimal workflow.
To solve this problem you look to sampling, which is a statistical methodology for extracting
a relevant subset of a population. In the context of MapReduce, sampling provides an
opportunity to work with large datasets without the overhead of having to wait for the
entire dataset to be read and processed.
Sun | Mon | Tue | Wed | Thu | Fri | Sat |
---|---|---|---|---|---|---|
27 | 28 | 29 | 30 | 1 | 2 | 3 |
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 | 31 |