Recommendations for dedup of huge backup archives

Ahmed Kamal via illumos-zfs

2014-06-21 23:33:18 UTC

I would like to pick the brains on this list, the recommendations do not
necessarily have to be strictly zfs related. Basically I would like to
cloud-archive huge backups (total size roughly 20TB uncompressed, 6TB
compressed) of 100 VMware VMs (generated by vRanger software). The job will
run weekly, always generating full backups. My WAN upload bandwidth is
50Mbps. I would like to do this in a network efficient manner, and
hopefully finishing in the weekly backup window. To me this seemed like a
perfect zfs dedup workload at first, then I started reading more, and
understood how horribly difficult dedup really is

Solution-1:
==
What first hit me, is to simply enable dedup on a 128k volume. However, I
now understand that any new files being inserted in the VM, would cause the
rest of blocks to be shifted by a (most likely) non-multiple of 128k, thus
making me loose on lots of potential dedup savings. I found online
references that typical dedup benefits with this simple approach hover
around 30%-50% .. I was looking for 10x after copying 10 copies of full
backups .. Life is not good :/

Solution-2:
==
Force a zfs block size of 4k .. As the gurus here will quickly recognize,
this will require hundreds of GBs of RAM ... which is unpractical for this
case

Solution-3:
==
rsync from new weekly full backup file, to old one living on zfs. Rsync
uses the rolling checksum window feature, this can even handle block shifts
by less than 4k (say inserting a few bytes in the middle of a huge file!).
After rsync is done syncing changed blocks (with in-place option) and I
take a zfs snapshot, I "hope" the snapshot will only contain the few blocks
rsync had to modify, not the whole file .. thoughts?

Solution-4:
==
As my WAN upload speed is only 50Mbps, would it be ok to store the DDT on a
spinning disk (LUN) .. i.e. If the slow down, will still be faster than my
WAN, then I guess it can be Ok ??

In any case, I'd be looking to snapshot the result, and efficiently zfs
send it to a ec2 server .. Hopefully I'll find a way to archive to S3 as
well

Seems to me like there is no easy-win. I appreciate any advice here ..
Thanks!

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com