linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Mahoney <jeffm@suse.com>
To: "Swâmi Petaramesh" <swami@petaramesh.org>, linux-btrfs@vger.kernel.org
Subject: Re: duperemove : some real world figures on BTRFS deduplication
Date: Thu, 8 Dec 2016 15:07:09 -0500	[thread overview]
Message-ID: <930fe4c7-b936-8f2d-fc4c-cc5574c27f19@suse.com> (raw)
In-Reply-To: <81bcff57-4bee-18d5-cac4-3359150730a5@petaramesh.org>


[-- Attachment #1.1: Type: text/plain, Size: 2026 bytes --]

On 12/8/16 10:11 AM, Swâmi Petaramesh wrote:
> Hi, Some real world figures about running duperemove deduplication on
> BTRFS :
> 
> I have an external 2,5", 5400 RPM, 1 TB HD, USB3, on which I store the
> BTRFS backups (full rsync) of 5 PCs, using 2 different distros,
> typically at the same update level, and all of them more of less sharing
> the entirety or part of the same set of user files.
> 
> For each of these PCs I keep a series of 4-5 BTRFS subvolume snapshots
> for having complete backups at different points in time.
> 
> The HD was full to 93% and made a good testbed for deduplicating.
> 
> So I ran duperemove on this HD, on a machine doing "only this", using a
> hashfile. The machine being an Intel i5 with 6 GB of RAM.
> 
> Well, the damn thing has been running for 15 days uninterrupted !
> ...Until I [Ctrl]-C it this morning as I had to move with the machine (I
> wasn't expecting it to last THAT long...).
> 
> It took about 48 hours just for calculating the files hashes.
> 
> Then it took another 48 hours just for "loading the hashes of duplicate
> extents".
> 
> Then it took 11 days deduplicating until I killed it.
> 
> At the end, the disk that was 93% full is now 76% full, so I saved 17%
> of 1 TB (170 GB) by deduplicating for 15 days.
> 
> Well the thing "works" and my disk isn't full anymore, so that's a very
> partial success, but still l wonder if the gain is worth the effort...

What version were you using?  I know Mark had put a bunch of effort into
reducing the memory footprint and runtime.  The earlier versions were
"can we get this thing working" while the newer versions are more efficient.

What throughput are you getting to that disk?  I get that it's USB3, but
reading 1TB doesn't take a terribly long time so 15 days is pretty
ridiculous.

At any rate, the good news is that when you run it again, assuming you
used the hash file, it will not have to rescan most of your data set.

-Jeff

-- 
Jeff Mahoney
SUSE Labs


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 841 bytes --]

  parent reply	other threads:[~2016-12-08 20:07 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-08 15:11 duperemove : some real world figures on BTRFS deduplication Swâmi Petaramesh
2016-12-08 15:42 ` Austin S. Hemmelgarn
2016-12-08 18:00   ` Timofey Titovets
2016-12-08 20:07   ` Jeff Mahoney
2016-12-08 20:46     ` Austin S. Hemmelgarn
2016-12-08 20:07 ` Jeff Mahoney [this message]
2016-12-09 14:06   ` Swâmi Petaramesh
2016-12-09  2:58 ` Chris Murphy
2016-12-09 13:45   ` Swâmi Petaramesh
2016-12-09 15:43     ` Chris Murphy
2016-12-09 16:07       ` Holger Hoffstätte
     [not found] ` <CAEtw4r2Q3pz8FQrKgij_fWTBw7p2YRB6DqYrXzoOZ-g0htiKAw@mail.gmail.com>
2016-12-09  7:56   ` Peter Becker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=930fe4c7-b936-8f2d-fc4c-cc5574c27f19@suse.com \
    --to=jeffm@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=swami@petaramesh.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).