Re: Announcing btrfs-dedupe - Austin S. Hemmelgarn

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: James Pharaoh <james@wellbehavedsoftware.com>,
	Mark Fasheh <mfasheh@versity.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: Announcing btrfs-dedupe
Date: Tue, 15 Nov 2016 07:26:53 -0500	[thread overview]
Message-ID: <0b8cd1f0-fca4-b7df-2f41-13c40aee493d@gmail.com> (raw)
In-Reply-To: <20161114211013.GJ21290@hungrycats.org>

On 2016-11-14 16:10, Zygo Blaxell wrote:
> On Mon, Nov 14, 2016 at 02:56:51PM -0500, Austin S. Hemmelgarn wrote:
>> On 2016-11-14 14:51, Zygo Blaxell wrote:
>>> Deduplicating an extent that may might be concurrently modified during the
>>> dedup is a reasonable userspace request.  In the general case there's
>>> no way for userspace to ensure that it's not happening.
>> I'm not even talking about the locking, I'm talking about the data
>> comparison that the ioctl does to ensure they are the same before
>> deduplicating them, and specifically that protecting against userspace just
>> passing in two random extents that happen to be the same size but not
>> contain the same data (because deduplication _should_ reject such a
>> situation, that's what the clone ioctl is for).
>
> If I'm deduping a VM image, and the virtual host is writing to said image
> (which is likely since an incremental dedup will be intentionally doing
> dedup over recently active data sets), the extent I just compared in
> userspace might be different by the time the kernel sees it.
>
> This is an important reason why the whole lock/read/compare/replace step
> is an atomic operation from userspace's PoV.
>
> The read also saves having to confirm a short/weak hash isn't a collision.
> The RAM savings from using weak hashes (~48 bits) are a huge performance
> win.
>
> The locking overhead is very small compared to the reading overhead,
> and (in the absence of bugs) it will only block concurrent writes to the
> same offset range in the src/dst inodes (based on a read of the code...I
> don't know if there's also an inode-level or backref-level barrier that
> expands the locking scope).
I'm not arguing that it's a bad thing that the kernel is doing this, I'm 
just saying that the locking overhead is minuscule in most cases 
compared to the data comparison.  It is absolutely necessary for exactly 
the reasons you are outlining.
>
> I'm not sure the ioctl is well designed for simply throwing random
> data at it, especially not entire files (it can't handle files over
> 16MB anyway).  It will read more data than it has to compared to a
> block-by-block comparison from userspace with prefetches or a pair of
> IO threads.  If userspace reads both copies of the data just before
> issuing the extent-same call, the kernel will read the data from cache
> reasonably quickly.
It still depends on the use case to a certain extent.  In the case I was 
using as an example, I know to a reasonably certain degree (barring 
tampering, bugs, or hardware failure) that any two files are identical, 
and I actually don't want to trash the page-cache just to deduplicate 
data faster (he data set in question is large, but most of it is idle at 
any given point in time), so there's no point in me prereading 
everything in userspace, which in turn makes the script I use much 
simpler (the most complex part is figuring out how to split extents for 
files bigger than the ioctl can handle such that I don't have tiny tail 
extents but still have a minimum number per file).
>
>> The locking is perfectly reasonable and shouldn't contribute that much to
>> the overhead (unless you're being crazy and deduplicating thousands of tiny
>> blocks of data).
>
> Why is deduplicating thousands of blocks of data crazy?  I already
> deduplicate four orders of magnitude more than that per week.
You missed the 'tiny' quantifier.  I'm talking really small blocks, on 
the order of less than 64k (so, IOW, stuff that's not much bigger than a 
few filesystem blocks), and that is somewhat crazy because it ends up 
not only taking _really_ long to do compared to larger chunks (because 
you're running more independent hashes than with bigger blocks), but 
also because it will often split extents unnecessarily and contribute to 
fragmentation, which will lead to all kinds of other performance 
problems on the FS.

next prev parent reply	other threads:[~2016-11-15 12:27 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-06 13:30 Announcing btrfs-dedupe James Pharaoh
2016-11-07 14:02 ` David Sterba
2016-11-07 17:48   ` Mark Fasheh
2016-11-07 20:54     ` Adam Borowski
2016-11-08  2:17       ` Darrick J. Wong
2016-11-08 18:59         ` Mark Fasheh
2016-11-08 19:47           ` Darrick J. Wong
2016-11-09 15:02       ` David Sterba
2016-11-08  2:40   ` Christoph Anton Mitterer
2016-11-08  6:11     ` James Pharaoh
2016-11-08 13:26     ` Austin S. Hemmelgarn
2016-11-08 16:57       ` Darrick J. Wong
2016-11-08 17:04         ` Austin S. Hemmelgarn
2016-11-08 18:49     ` Mark Fasheh
2016-11-07 17:59 ` Mark Fasheh
2016-11-07 18:49   ` James Pharaoh
2016-11-07 18:53     ` James Pharaoh
2016-11-14 18:07     ` Zygo Blaxell
2016-11-14 18:22       ` James Pharaoh
2016-11-14 18:39         ` Austin S. Hemmelgarn
2016-11-14 19:51           ` Zygo Blaxell
2016-11-14 19:56             ` Austin S. Hemmelgarn
2016-11-14 21:10               ` Zygo Blaxell
2016-11-15 12:26                 ` Austin S. Hemmelgarn [this message]
2016-11-15 17:52                   ` Zygo Blaxell
2016-11-16 22:24                     ` Niccolò Belli
2016-11-17  3:01                       ` Zygo Blaxell
2016-11-18 10:36                         ` Niccolò Belli
2016-11-14 20:07             ` James Pharaoh
2016-11-14 21:22               ` Zygo Blaxell
2016-11-14 18:43         ` Zygo Blaxell
2016-11-08 11:06 ` Niccolò Belli
2016-11-08 11:38   ` James Pharaoh
2016-11-08 16:57     ` Niccolò Belli
2016-11-08 16:58       ` James Pharaoh
2016-11-08 17:08         ` Niccolò Belli
2016-11-14 18:27   ` Zygo Blaxell
2016-11-08 22:36 ` Saint Germain
2016-11-09 11:24   ` Niccolò Belli
2016-11-09 12:47     ` Saint Germain
2016-11-13 12:45   ` James Pharaoh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0b8cd1f0-fca4-b7df-2f41-13c40aee493d@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=james@wellbehavedsoftware.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mfasheh@versity.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).