Imagine you’re working with a terabyte-scale dataset and you have a MapReduce application you want to test with that dataset.

computer science

Description

1. Hadoop Map Reduce – Sampling a dataset. 50 points


Imagine you’re working with a terabyte-scale dataset and you have a MapReduce application you want to test with that dataset. Running your MapReduce application against the dataset may take hours, and constantly iterating with code refinements and rerunning against it isn’t an optimal workflow. 


To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population. In the context of MapReduce, sampling provides an opportunity to work with large datasets without the overhead of having to wait for the entire dataset to be read and processed. 




Related Questions in computer science category