Re: btrfs dedup - available or experimental? Or yet to be?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Learner Study <learner.study@gmail.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: btrfs dedup - available or experimental? Or yet to be?
Date: Wed, 13 May 2015 17:08:04 -0400	[thread overview]
Message-ID: <20150513210804.GA12262@hungrycats.org> (raw)
In-Reply-To: <CAP8+hKXZ5p80E_9dmVyzkdFGgc6NMSj2MDyp3DZ-dVi2RQ=EyA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2506 bytes --]

On Wed, May 13, 2015 at 09:23:25AM -0700, Learner Study wrote:
> I have been reading on de-duplication and how algorithms such as Bloom
> and Cuckoo filters are used for this purpose.
> 
> Does BTRFS dedup use any of these, or are there plans to incorporate
> these in future?

btrfs dedup currently lives in user space, and there are multiple
dedup userspace projects in development.

Administrators can choose the dedup tool, and can use multiple dedup tools
on the same filesystem.  This is particularly handy if you know something
about your data that a naive hashing algorithm might not (e.g. you have
two large trees derived from a common base, so you can use a much more
efficient algorithm than you would if you knew nothing about the data).

The basic kernel interface for dedup is the extent-same ioctl.
A userspace program creates a list of (fd, length, offset) tuples
referencing identical content and passes them to the kernel.  The kernel
locks the file contents, compares them, and replaces identical data
copies with references to a single extent in an atomic operation.
The kernel also provides interfaces to efficiently discover recently
modified extents in a filesystem, enabling deduplicators to follow new
data without the need to block writes.

Most btrfs deduplicators are based on a block-level hash table built by
scanning files, but every other aspect of the tools (e.g. the mechanism by
which files are discovered, block sizes, scalability, use of prefiltering
algorithms such as Bloom filters, whether the hash table is persistent
or ephemeral, etc) is different from one tool to another, and changing
over time as the tools are developed.

Because the kernel interface implies a read of both copies of duplicate
data, it is not necessary to use a collision-free hash.  Optimizing the
number of bits in the hash function for the size of the filesystem and
exploiting the statistical tendency for identical blocks to be adjacent
to other identical blocks in files enables considerable space efficiency
in the hash table--possibly so much that the Bloom/Cuckoo-style
pre-filtering benefit becomes irrelevant.  I'm not aware of a released
btrfs deduplicator that currently exploits these optimizations.

> Thanks for your guidance!
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

     prev parent reply	other threads:[~2015-05-13 21:08 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-23 23:10 btrfs dedup - available or experimental? Or yet to be? Martin
2015-03-23 23:22 ` Hugo Mills
2015-03-25  1:30   ` Rich Freeman
2015-03-27  0:07     ` Martin
2015-03-27  0:30       ` Rich Freeman
2015-03-29 11:43         ` Kai Krakow
2015-03-29 12:31           ` Rich Freeman
2015-03-29 14:44             ` Kai Krakow
2015-03-29 17:54               ` Christoph Anton Mitterer
2015-03-29 17:51           ` Christoph Anton Mitterer
2015-03-27 20:51       ` Mark Fasheh
2015-03-27 20:44     ` Mark Fasheh
2015-05-13 16:23   ` Learner Study
2015-05-13 21:08     ` Zygo Blaxell [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150513210804.GA12262@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=learner.study@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).