linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Gabriel <g2p.code@gmail.com>
To: linux-btrfs@vger.kernel.org
Subject: [RFC] Systemcall for offline deduplication
Date: Wed, 17 Oct 2012 11:39:42 +0000 (UTC)	[thread overview]
Message-ID: <k5m5du$keo$1@ger.gmane.org> (raw)
In-Reply-To: 20121015201516.GB10679@twin.jikos.cz

On Mon, 15 Oct 2012 22:15:16 +0200, David Sterba wrote:
> On Mon, Oct 15, 2012 at 07:09:23PM +0200, Bob Marley wrote:
>> I would really appreciate a systemcall (or ioctl or the like) to allow
>> deduplication of a block of a file against a block of another file.
>> (ok if blocks need to be aligned to filesystem blocks)
> 
> It exists, is called
> 
> BTRFS_IOC_CLONE_RANGE
> (http://lxr.free-electrons.com/source/fs/btrfs/ioctl.h#L399)
> 
> btrfs_ioctl_clone_range_args
> http://lxr.free-electrons.com/source/fs/btrfs/ioctl.h#L254
> 
>> The syscall should presumably check that the regions are really equal
>> and perform the deduplication atomically.
>> 
>> This would be the start for a lot of deduplication algorithms in
>> userspace.
>> It would be a killer feature for backup systems.
> 
> It doesn't do any checks if the range contents match, but for a backup
> system, the ranges can be merged at a calm state, ie not new data in
> flight.

Thanks for bringing this up.

I'm the author of bedup[1], a btrfs deduplication tool which currently 
uses the IOC_CLONE_RANGE syscall (and a host of other btrfs features: 
search, fiemap, inode-to-path backrefs, etc).

By far the biggest drawback is that the same-range check is done in 
userland; that means I need to lock files in userland to guarantee I have 
exclusive access to both files at the time the clone call is done.
I've found a way that might be okay against non-root users, but I 
wouldn't swear to it, and if it isn't that creates a security risk.

The other drawbacks come from CLONE_RANGE being a write operation.
It can't be done with read-only subvolumes, which is a shame because 
backup filesystems containing mostly read-only snapshots are a great 
candidate for deduplication. And it updates the mtime, when deduplication 
should be an implementation detail with no impact on file metadata.

Now, here's my proposal for fixing that:
A BTRFS_IOC_SAME_RANGE ioctl would be ideal. Takes two file descriptors, 
two offsets, one length, does some locking, checks that the ranges are 
identical (returns EINVAL if not), and defers to an implementation that 
works like clone_range with the metadata update and the writable volume 
restriction moved out.

I didn't go with something block-based or extent-based because with 
compression and fragmentation, extents would very easily fail to be 
aligned.

Thoughts on this interface?
Anyone interested in getting this implemented, or at least providing some 
guidance and patch review?

[1] https://github.com/g2p/bedup#readme


  reply	other threads:[~2012-10-17 11:45 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-15 17:09 Systemcall for offline deduplication Bob Marley
2012-10-15 20:15 ` David Sterba
2012-10-17 11:39   ` Gabriel [this message]
2012-10-26  6:26     ` [RFC] " Darrick J. Wong
2012-10-26 15:59       ` Gabriel
2012-10-26 16:21         ` Gabriel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='k5m5du$keo$1@ger.gmane.org' \
    --to=g2p.code@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).