All of lore.kernel.org
 help / color / mirror / Atom feed
From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Mark Fasheh <mfasheh@suse.de>
Cc: dsterba@suse.cz, Timofey Titovets <nefelim4ag@gmail.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: Btrfs offline deduplication
Date: Fri, 01 Aug 2014 15:18:46 -0400	[thread overview]
Message-ID: <53DBE816.9050209@gmail.com> (raw)
In-Reply-To: <20140801185559.GG2203@wotan.suse.de>

[-- Attachment #1: Type: text/plain, Size: 2524 bytes --]

On 08/01/2014 02:55 PM, Mark Fasheh wrote:
> On Fri, Aug 01, 2014 at 10:16:08AM -0400, Austin S Hemmelgarn wrote:
>> On 2014-08-01 09:23, David Sterba wrote:
>>> On Fri, Aug 01, 2014 at 06:17:44AM -0400, Austin S Hemmelgarn wrote:
>>>> I do think however that having the option of a background thread doing
>>>> deduplication asynchronously is a good idea, but then you would have to
>>>> have some way to trigger it on individual files/trees, and triggering on
>>>> writes like the autodefrag thread does doesn't make much sense.  Having
>>>> some userspace program to tell it to run on a given set of files would
>>>> probably be the best approach for a trigger.  I don't remember if this
>>>> kind of thing was also included in the online deduplication patches that
>>>> got posted a while back or not.
>>>
>>> IIRC the proposed implementation only merged new writes with existing
>>> data.
>>>
>>> For the out-of-band ("off-line") dedup there's bedup
>>> (https://github.com/g2p/bedup) or Mark's duperemove tool
>>> (https://github.com/markfasheh/duperemove) that work on a set of files.
>>>
>> Something kernel-side to do the work asynchronously would be nice,
>> especially if it could leverage the check-sums that BTRFS already stores
>> for the blocks.  Having a userspace interface for offline deduplication
>> similar to that for scrub operations would even better.
> 
> Why does this have to be kernel side? There's userspace software already to
> dedupe that can be run on a regular basis. Exporting checksums is a
> differnet story (you can do that via ioctl) but running the dedupe software
> itself inside the kernel is exactly what we want to avoid by having the
> dedupe ioctl in the first place.
> 	--Mark
> 
> --
> Mark Fasheh
> 
Based on the same logic however, we don't need scrub to be done kernel
side, as it wouldn't take but one more ioctl to be able to tell it which
block out of a set to treat as valid.  I'm not saying that things need
to be done in the kernel, but duperemove doesn't use the ioctl interface
even if it exists, and bedup is buggy as hell (unless it's improved
greatly in the last two weeks), and neither of them is at all efficient.
 I do understand that this isn't something that is computationally
simple (especially on x86 with it's defficiency of registers), but rsync
does almost the same thing for data transmission over the network, and
it does so seemingly much more efficiently than either option available
at the moment.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

  reply	other threads:[~2014-08-01 19:18 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-31 23:54 Btrfs offline deduplication Timofey Titovets
2014-08-01 10:17 ` Austin S Hemmelgarn
2014-08-01 13:23   ` David Sterba
2014-08-01 14:16     ` Austin S Hemmelgarn
2014-08-01 18:55       ` Mark Fasheh
2014-08-01 19:18         ` Austin S Hemmelgarn [this message]
2014-08-01 20:18           ` Mark Fasheh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53DBE816.9050209@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mfasheh@suse.de \
    --cc=nefelim4ag@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.