Re: [PATCH 0/4] btrfs: offline dedupe v2

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Josef Bacik <jbacik@fusionio.com>
To: Mark Fasheh <mfasheh@suse.de>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	Chris Mason <clmason@fusionio.com>,
	Josef Bacik <josef@redhat.com>,
	Gabriel de Perthuis <g2p.code@gmail.com>,
	David Sterba <dsterba@suse.cz>
Subject: Re: [PATCH 0/4] btrfs: offline dedupe v2
Date: Wed, 12 Jun 2013 14:10:37 -0400	[thread overview]
Message-ID: <20130612181037.GD658@localhost.localdomain> (raw)
In-Reply-To: <1370982698-757-1-git-send-email-mfasheh@suse.de>

On Tue, Jun 11, 2013 at 02:31:34PM -0600, Mark Fasheh wrote:
> Hi,
> 
> The following series of patches implements in btrfs an ioctl to do
> offline deduplication of file extents.
> 
> To be clear, "offline" in this sense means that the file system is
> mounted and running, but the dedupe is not done during file writes,
> but after the fact when some userspace software initiates a dedupe.
> 
> The primary patch is loosely based off of one sent by Josef Bacik back
> in January, 2011.
> 
> http://permalink.gmane.org/gmane.comp.file-systems.btrfs/8508
> 
> I've made significant updates and changes from the original. In
> particular the structure passed is more fleshed out, this series has a
> high degree of code sharing between itself and the clone code, and the
> locking has been updated.
> 
> 
> The ioctl accepts a struct:
> 
> struct btrfs_ioctl_same_args {
> 	__u64 logical_offset;	/* in - start of extent in source */
> 	__u64 length;		/* in - length of extent */
> 	__u16 dest_count;	/* in - total elements in info array */
> 	__u16 reserved1;
> 	__u32 reserved2;
> 	struct btrfs_ioctl_same_extent_info info[0];
> };
> 
> Userspace puts each duplicate extent (other than the source) in an
> item in the info array. As there can be multiple dedupes in one
> operation, each info item has it's own status and 'bytes_deduped'
> member. This provides a number of benefits:
> 
> - We don't have to fail the entire ioctl because one of the dedupes failed.
> 
> - Userspace will always know how much progress was made on a file as we always
>   return the number of bytes deduped.
> 
> 
> #define BTRFS_SAME_DATA_DIFFERS	1
> /* For extent-same ioctl */
> struct btrfs_ioctl_same_extent_info {
> 	__s64 fd;		/* in - destination file */
> 	__u64 logical_offset;	/* in - start of extent in destination */
> 	__u64 bytes_deduped;	/* out - total # of bytes we were able
> 				 * to dedupe from this file */
> 	/* status of this dedupe operation:
> 	 * 0 if dedup succeeds
> 	 * < 0 for error
> 	 * == BTRFS_SAME_DATA_DIFFERS if data differs
> 	 */
> 	__s32 status;		/* out - see above description */
> 	__u32 reserved;
> };
> 
> 
> The kernel patches are based off Linux v3.9. At this point I've tested the
> ioctl against a decent variety of files and conditions.
> 
> A git tree for the kernel changes can be found at:
> 
> https://github.com/markfasheh/btrfs-extent-same
> 
> 
> I have a userspace project, duperemove available at:
> 
> https://github.com/markfasheh/duperemove
> 
> Hopefully this can serve as an example of one possible usage of the ioctl.
> 
> duperemove takes a list of files as argument and will search them for
> duplicated extents. If given the '-D' switch, duperemove will send dedupe
> requests for same extents and display the results.
> 
> Within the duperemove repo is a file, btrfs-extent-same.c that acts as
> a test wrapper around the ioctl. It can be compiled completely
> seperately from the rest of the project via "make
> btrfs-extent-same". This makes direct testing of the ioctl more
> convenient.
> 
> 
> Limitations
> 
> We can't yet dedupe within the same file (that is, source and destination
> are the same inode). This is due to a limitation in btrfs_clone().
> 
> 
> Perhaps this isn't a limiation per-se but extent-same requires read/write
> access to the files we want to dedupe.  During my last series I had a
> conversation with Gabriel de Perthuis about access checking where we tried
> to maintain the ability for a user to run extent-same against a readonly
> snapshot. In addition, I reasoned that since the underlying data won't
> change (at least to the user) that we ought only require the files to be
> open for read.
> 
> What I found however is that neither of these is a great idea ;)
> 
> - We want to require that the inode be open for writing so that an
>   unprivileged user can't do things like run dedupe on a performance
>   sensitive file that they might only have read access to.  In addition I
>   could see it as kind of a surprise (non-standard behavior) to an
>   administrator that users could alter the layout of files they are only
>   allowed to read.
> 
> - Readonly snapshots won't let you open for write anyway (unsuprisingly,
>   open() returns -EROFS).  So that kind of kills the idea of them being able
>   to open those files for write which we want to dedupe.
> 
> That said, I still think being able to run this against a set of readonly
> snapshots makes sense especially if those snapshots are taken for backup
> purposes. I'm just not sure how we can sanely enable it.
> 
> 
> 
> Code review is very much appreciated. Thanks,
>      --Mark
> 
> 
> ChangeLog
> 
> - check that we have appropriate access to each file before deduping. For
>   the source, we only check that it is opened for read. Target files have to
>   be open for write.
> 
> - don't dedupe on readonly submounts (this is to maintain 
> 
> - check that we don't dedupe files with different checksumming states
>  (compare BTRFS_INODE_NODATASUM flags)
> 
> - get and maintain write access to the mount during the extent same
>   operation (mount_want_write())
> 
> - allocate our read buffers up front in btrfs_ioctl_file_extent_same() and
>   pass them through for re-use on every call to btrfs_extent_same(). (thanks
>   to David Sterba <dsterba@suse.cz> for reporting this
> 
> - As the read buffers could possibly be up to 1MB (depending on user
>   request), we now conditionally vmalloc them.
> 
> - removed redundant check for same inode. btrfs_extent_same() catches it now
>   and bubbles the error up.
> 
> - remove some unnecessary printks
> 
> Changes from RFC to v1:
> 
> - don't error on large length value in btrfs exent-same, instead we just
>   dedupe the maximum allowed.  That way userspace doesn't have to worry
>   about an arbitrary length limit.
> 
> - btrfs_extent_same will now loop over the dedupe range at 1MB increments (for
>   a total of 16MB per request)
> 
> - cleaned up poorly coded while loop in __extent_read_full_page() (thanks to
>   David Sterba <dsterba@suse.cz> for reporting this)
> 
> - included two fixes from Gabriel de Perthuis <g2p.code@gmail.com>:
>    - allow dedupe across subvolumes
>    - don't lock compressed pages twice when deduplicating
> 
> - removed some unused / poorly designed fields in btrfs_ioctl_same_args.
>   This should also give us a bit more reserved bytes.
> 
> - return -E2BIG instead of -ENOMEM when arg list is too large (thanks to
>   David Sterba <dsterba@suse.cz> for reporting this)
> 
> - Some more reserved bytes are now included as a result of some of my
>   cleanups. Quite possibly we could add a couple more.

Ok I'm relatively happy with this set, I just want to have an xfstest on deck to
test it with.  Once I have an xfstest to run to verify its sanity I'll pull it
in.  Thanks,

Josef

next prev parent reply	other threads:[~2013-06-12 18:10 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-11 20:31 [PATCH 0/4] btrfs: offline dedupe v2 Mark Fasheh
2013-06-11 20:31 ` [PATCH 1/4] btrfs: abtract out range locking in clone ioctl() Mark Fasheh
2013-06-11 20:31 ` [PATCH 2/4] btrfs_ioctl_clone: Move clone code into it's own function Mark Fasheh
2013-06-11 20:31 ` [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock() Mark Fasheh
2013-06-11 20:31 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
2013-07-15 20:55   ` Zach Brown
2013-07-17  0:14     ` Gabriel de Perthuis
2013-06-11 20:56 ` [PATCH 0/4] btrfs: offline dedupe v2 Gabriel de Perthuis
2013-06-11 21:04   ` Mark Fasheh
2013-06-11 21:31     ` Gabriel de Perthuis
2013-06-11 21:45       ` Mark Fasheh
2013-06-12 18:10 ` Josef Bacik [this message]
2013-06-17 20:04   ` Mark Fasheh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130612181037.GD658@localhost.localdomain \
    --to=jbacik@fusionio.com \
    --cc=clmason@fusionio.com \
    --cc=dsterba@suse.cz \
    --cc=g2p.code@gmail.com \
    --cc=josef@redhat.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mfasheh@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).