Dedup: A Fast Stream Deduplicator package

I have just released a package that will allow you to deduplicate streams, to remove duplicate data across gigabytes and even terabytes of data at speeds >1GB/s.

For an introduction, I have made a blog post called Fast Stream Deduplication in Go.

If you just want to see the package, you can go the github project page.

It is currently in a “stable” state, but still lacks some corner case tests, which I will be adding in the coming days. It is however not extensively battle-tested yet, so if you are interested in that, please let me know.

I do not plan do modify the current API and file format unless there is a fundamental problem I cannot fix without doing so.

Comments, feedback, questions, suggestions, problems are all very welcome.

3 Likes

Sorry if it is in the blog, I couldn’t find it: hhow is the dynamic blocks determined? Rolling hash like Adler32 (bup, Camlistore), or some other method?

It is a local rolling hash on the predictability of the last 32 bytes. It is copied from ZPAQ by compression expert Matt Money.

It is rather simple in its execution, but works rather complexly. I asked Matt about it, and sort of understands it. You can of course also read his explanation.

My “layman” understanding: Each incoming byte is checked, if this was following the same byte as last time it was encountered. Based on whether the prediction matches one of two different hash values is multiplied to the existing hash and the current byte value. If the hash rolls over and is below a threshold, which determines the average block size, a new block is created.

Therefore blocks with similar content will rather fast converge to the same split points, and thereby similar blocks.

Edit: Oh yes, I forgot to mention, that I adjusted the maximum block size to be a 4x the approximate block size, which makes it easier to control. I believe in zpaq it is 16x average size.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.