Invited Talk: Accelerating Content-Defined Chunking for Data Deduplication
Speaker: Sreeharsha Udayashankar

Time: 4:30-5:30 pm, Apr 3rd, 2025
Location: CS 2310 (online on Zoom)

Abstract: Data deduplication is used to conserve storage space and network bandwidth. Content-defined chunking (CDC) algorithms divide data into chunks, dictating the space-saving efficiency of deduplication systems. However, modern CDC algorithms are slow due to their compute-intensive nature and need to scan large amounts of data, becoming one of the main bottlenecks in the deduplication pipeline. In this talk, I will present two solutions to accelerate content-defined chunking. The first solution, VectorCDC, uses AVX-friendly techniques to redesign and accelerate existing chunking algorithms. The second solution, SeqCDC, presents a new vector-friendly algorithm that uses content-defined heuristics to selectively skip scanning data regions, improving throughput without significantly affecting space savings.

Bio: Sreeharsha is a 4th year PhD student at the University of Waterloo advised by Prof. Samer Al-Kiswany. His research focuses on incorporating hardware acceleration into large-scale distributed systems.