The MapReduce programming model simplifies
large-scale data processing on commodity cluster by
exploiting parallel map tasks and reduces tasks.
Although many efforts have been made to improve
the performance of MapReduce jobs, they ignore the
network traffic generated in the shuffle phase, which
plays a critical role in performance enhancement.
Traditionally, a hash function is used to partition
intermediate data among reduce tasks, which,
however, is not traffic-efficient because network
topology and data size associated with each key are
not taken into consideration. In this paper, we study
to reduce network traffic cost for a MapReduce job
by designing a novel intermediate data partition
scheme. Furthermore, we jointly consider the
aggregator placement problem, where each
aggregator can reduce merged traffic from multiple
map tasks. A decomposition-based distributed
algorithm is proposed to deal with the large-scale
optimization problem for big data application and an
online algorithm is also designed to adjust data
partition and aggregation in a dynamic manner.
Finally, extensive simulation results demonstrate that
our proposals can significantly reduce network traffic
cost under both offline and online cases.