From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f196.google.com ([209.85.223.196]:34074 "EHLO mail-io0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752257AbcLHUrB (ORCPT ); Thu, 8 Dec 2016 15:47:01 -0500 Received: by mail-io0-f196.google.com with SMTP id y124so2537349iof.1 for ; Thu, 08 Dec 2016 12:47:00 -0800 (PST) Subject: Re: duperemove : some real world figures on BTRFS deduplication To: Jeff Mahoney , =?UTF-8?Q?Sw=c3=a2mi_Petaramesh?= , linux-btrfs@vger.kernel.org References: <81bcff57-4bee-18d5-cac4-3359150730a5@petaramesh.org> <8da6fc44-2756-0bfd-7d7c-e11529f54f74@gmail.com> <808d0394-db2a-0a15-d084-309accc04ff9@suse.com> From: "Austin S. Hemmelgarn" Message-ID: Date: Thu, 8 Dec 2016 15:46:55 -0500 MIME-Version: 1.0 In-Reply-To: <808d0394-db2a-0a15-d084-309accc04ff9@suse.com> Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2016-12-08 15:07, Jeff Mahoney wrote: > On 12/8/16 10:42 AM, Austin S. Hemmelgarn wrote: >> On 2016-12-08 10:11, Swâmi Petaramesh wrote: >>> Hi, Some real world figures about running duperemove deduplication on >>> BTRFS : >>> >>> I have an external 2,5", 5400 RPM, 1 TB HD, USB3, on which I store the >>> BTRFS backups (full rsync) of 5 PCs, using 2 different distros, >>> typically at the same update level, and all of them more of less sharing >>> the entirety or part of the same set of user files. >>> >>> For each of these PCs I keep a series of 4-5 BTRFS subvolume snapshots >>> for having complete backups at different points in time. >>> >>> The HD was full to 93% and made a good testbed for deduplicating. >>> >>> So I ran duperemove on this HD, on a machine doing "only this", using a >>> hashfile. The machine being an Intel i5 with 6 GB of RAM. >>> >>> Well, the damn thing has been running for 15 days uninterrupted ! >>> ...Until I [Ctrl]-C it this morning as I had to move with the machine (I >>> wasn't expecting it to last THAT long...). >>> >>> It took about 48 hours just for calculating the files hashes. >>> >>> Then it took another 48 hours just for "loading the hashes of duplicate >>> extents". >>> >>> Then it took 11 days deduplicating until I killed it. >>> >>> At the end, the disk that was 93% full is now 76% full, so I saved 17% >>> of 1 TB (170 GB) by deduplicating for 15 days. >>> >>> Well the thing "works" and my disk isn't full anymore, so that's a very >>> partial success, but still l wonder if the gain is worth the effort... >> So, some general explanation here: >> Duperemove hashes data in blocks of (by default) 128kB, which means for >> ~930GB, you've got about 7618560 blocks to hash, which partly explains >> why it took so long to hash. Once that's done, it then has to compare >> hashes for all combinations of those blocks, which totals to >> 58042456473600 comparisons (hence that taking a long time). The block >> size thus becomes a trade-off between performance when hashing and >> actual space savings (smaller block size makes hashing take longer, but >> gives overall slightly better results for deduplication). > > IIRC, the core of the duperemove duplicate matcher isn't an O(n^2) > algorithm. I think Mark used a bloom filter to reduce the data set > prior to matching, but I haven't looked at the code in a while. > You're right, I had completely forgotten about that. Regardless of that though, it's still a lot of processing that needs done.