The document proposes a novel approach for performing data deduplication across distributed worker nodes in an in-memory big data analytic system. The approach aims to minimize both the total storage space needed and data shuffling between nodes. It first shows that finding an optimal solution is NP-hard, then presents file partitioning algorithms that find efficient solutions incrementally in polynomial time. Experimental results confirm the partitioning approach achieves compression ratios close to a centralized optimal solution.