Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Gabriel de Perthuis <g2p.code@gmail.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: Possible to deduplicate read-only snapshots for space-efficient backups
Date: Tue, 7 May 2013 23:35:06 +0000 (UTC)	[thread overview]
Message-ID: <kmc33a$all$3@ger.gmane.org> (raw)
In-Reply-To: 9tdo5a-hde.ln1@hurikhan.ath.cx

On Wed, 08 May 2013 01:04:38 +0200, Kai Krakow wrote:
> Gabriel de Perthuis <g2p.code@gmail.com> schrieb:
>> It sounds simple, and was sort-of prompted by the new syscall taking
>> short ranges, but it is tricky figuring out a sane heuristic (when to
>> hash, when to bail, when to submit without comparing, what should be the
>> source in the last case), and it's not something I have an immediate
>> need for.  It is also possible to use 9p (with standard cow and/or
>> small-file dedup) and trade a bit of configuration for much more
>> space-efficient VMs.
>> 
>> Finer-grained tracking of which ranges have changed, and maybe some
>> caching of range hashes, would be a good first step before doing any
>> crazy large-file heuristics.  The hash caching would actually benefit
>> all use cases.
> 
> Looking back to good old peer-2-peer days (I think we all got in touch with 
> that the one or the other way), one title pops back into my mind: tiger-
> tree-hash...
> 
> I'm not really into it, but would it be possible to use tiger-tree-hashes to 
> find identical blocks? Even accross different sized files...

Possible, but bedup is all about doing as little io as it can get away
with, doing streaming reads only when it has sampled that the files are
likely duplicates and not spending a ton of disk space for indexing.

Hashing everything in the hope that there are identical blocks at
unrelated places on the disk is a much more resource-intensive approach;
Liu Bo is working on that, following ZFS's design choices.



  parent reply	other threads:[~2013-05-07 23:35 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-05 10:07 Possible to dedpulicate read-only snapshots for space-efficient backups Kai Krakow
2013-05-05 12:55 ` Gabriel de Perthuis
2013-05-05 17:22   ` Kai Krakow
2013-05-07 22:07     ` Gabriel de Perthuis
2013-05-07 23:04       ` Kai Krakow
2013-05-07 23:22         ` Kai Krakow
2013-05-07 23:35         ` Gabriel de Perthuis [this message]
2013-05-06  6:15 ` Jan Schmidt
2013-05-06  7:44   ` Kai Krakow
2013-05-06 14:35     ` james northrup
2013-05-06 20:48       ` Kai Krakow

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='kmc33a$all$3@ger.gmane.org' \
    --to=g2p.code@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox