Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: <dsterba@suse.cz>, <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
Date: Wed, 23 Mar 2016 10:25:51 +0800	[thread overview]
Message-ID: <56F1FEAF.2070806@cn.fujitsu.com> (raw)
In-Reply-To: <20160322133812.GK8095@twin.jikos.cz>

Thank you for your interest in dedupe patchset first.

In fact I'm quite afraid if there is no one interest in the patchset, it 
may be delayed again to 4.8.

David Sterba wrote on 2016/03/22 14:38 +0100:
> On Tue, Mar 22, 2016 at 09:35:25AM +0800, Qu Wenruo wrote:
>> This updated version of inband de-duplication has the following features:
>> 1) ONE unified dedup framework.
>> 2) TWO different back-end with different trade-off
>
> The on-disk format is defined in code, would be good to give some
> overview here.

No problem at all.
(Although not sure if it's a good idea to explain it in mail. Maybe wiki 
is much better?)

There are 3 dedupe related on-disk items.

1) dedupe status
    Used by both dedupe backends. Mainly used to record the dedupe
    backend info, allowing btrfs to resume its dedupe setup after umount.

Key contents:
    Objectid             , Type                   , Offset
   (0                    , DEDUPE_STATUS_ITEM_KEY , 0      )

Structure contents:
   dedupe block size:     records dedupe block size
   limit_nr:              In-memory hash limit
   hash_type:             Only SHA256 is possible yet
   backend:               In-memory or on-disk

2) dedupe hash item
    The main item for on-disk dedupe backend.
    It's used for hash -> extent search.
    Duplicated hash won't be inserted into dedupe tree.

Key contents:
    Objectid            , Type                   , Offset
   (Last 64bit of hash  , DEDUPE_HASH_ITEM_KEY   , Bytenr of the extent)

Structure contents:
   len:                   The in-memory length of the extent
                          Should always match dedupe_bs.
   disk_len:              The on-disk length of extent, diffs with len
                          if the extent is compressed.
   compression:           Compression algorithm.
   hash:                  Complete hash(SHA256) of the extent, including
                          the last  64 bit

   The structure is a simplified file extent with hash, offset are
   removed.

3) dedupe bytenr item
    Helper structure, mainly used for extent -> hash lookup, used by
    extent freeing.
    1 on 1 mapping with dedupe hash item.

Key contents:
    Objectid       , Type                       , Offset
   (Extent bytenr  , DEDUPE_HASH_BYTENR_ITEM_KEY, Last 64 bit of hash)

Structure contents:
   Hash:                 Complete hash(SHA256) of the extent.

>
>> 3) Support compression with dedupe
>> 4) Ioctl interface with persist dedup status
>
> I'd like to see the ioctl specified in more detail. So far there's
> enable, disable and status. I'd expect some way to control the in-memory
> limits, let it "forget" current hash cache, specify the dedupe chunk
> size, maybe sync of the in-memory hash cache to disk.

So current and planned ioctl should be the following, with some details 
related to your in-memory limit control concerns.

1) Enable
    Enable dedupe if it's not enabled already. (disabled -> enabled)
    Or change current dedupe setting to another. (re-configure)

    For dedupe_bs/backend/hash algorithm(only SHA256 yet) change, it
    will disable dedupe(dropping all hash) and then enable with new
    setting.

    For in-memory backend, if only limit is different from previous
    setting, limit can be changed on the fly without dropping any hash.

2) Disable
    Disable will drop all hash and delete the dedupe tree if it exists.
    Imply a full sync_fs().

3) Status
    Output basic status of current dedupe.
    Including running status(disabled/enabled), dedupe block size, hash
    algorithm, and limit setting for in-memory backend.

4) (PLANNED) In-memory hash size querying
    Allowing userspace to query in-memory hash structure header size.
    Used for "btrfs dedupe enable" '-l' option to output warning if user
    specify memory size larger than 1/4 of the total memory.

5) (PLANNED) Dedeup rate statistics
    Should be handy for user to know the dedupe rate so they can further
    fine tuning their dedup setup.

So for your "in-memory limit control", just enable it with different limit.
For "dedupe block size change", just enable it with different dedupe_bs.
For "forget hash", just disable it.

And for "write in-memory hash onto disk", not planned and may never do 
it due to the complexity, sorry.

>
>> 5) Ability to disable dedup for given dirs/files
>
> This would be good to extend to subvolumes.

I'm sorry that I didn't quite understand the difference.
Doesn't dir includes subvolume?

Or xattr for subvolume is only restored in its parent subvolume, and 
won't be copied for its snapshot?

>
>> TODO:
>> 1) Add extent-by-extent comparison for faster but more conflicting algorithm
>>     Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>>     CPU may even be a bottleneck other than IO.
>>     But for faster hash, it will definitely cause conflicts, so we need
>>     extent comparison before we introduce new dedup algorithm.
>
> If sha256 is slow, we can use a less secure hash that's faster but will
> do a full byte-to-byte comparison in case of hash collision, and
> recompute sha256 when the blocks are going to disk. I haven't thought
> this through, so there are possibly details that could make unfeasible.

Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only 
for both in-memory and on-disk backend. No SHA256 again.

In that case, for MD5 hit case, we will do a full byte-to-byte 
comparison. It may be slow or fast, depending on the cache.

But at least for MD5 miss case, it should be faster than SHA256.

>
> The idea is to move expensive hashing to the slow IO operations and do
> fast but not 100% safe hashing on the read/write side where performance
> matters.

Yes, although on the read side, we don't perform hash, we only do hash 
at write side.
And in that case, if weak hash hit, we will need to do memory 
comparison, which may also be slow.
So the performance impact may still exist.

The biggest challenge is, we need to read (decompressed) extent 
contents, even without an inode.
(So, no address_space and all the working facilities)

Considering the complexity and uncertain performance improvement, the 
priority of introducing weak hash is quite low so far, not to mention a 
lot of detail design change for it.

A much easier and practical enhancement is, to use SHA512.
As it's faster than SHA256 on modern 64bit machine for larger size.
For example, for hashing 8K data, SHA512 is almost 40% faster than SHA256.

>
>> 2) Misc end-user related helpers
>>     Like handy and easy to implement dedup rate report.
>>     And method to query in-memory hash size for those "non-exist" users who
>>     want to use 'dedup enable -l' option but didn't ever know how much
>>     RAM they have.
>
> That's what we should try know and define in advance, that's part of the
> ioctl interface.
>
> I went through the patches, there are a lot of small things to fix, but
> first I want to be sure about the interfaces, ie. on-disk and ioctl.

I hope such small things can be pointed out, allowing me to fix them 
while rebasing.

>
> Then we can start to merge the patchset in smaller batches, the
> in-memory deduplication does not have implications on the on-disk
> format, so it's "just" the ioctl part.

Yes, that's my original plan, first merge simple in-memory backend into 
4.5/4.6 and then adding ondisk backend into 4.7.

But things turned out that, since we designed the two-backends API from 
the beginning, on-disk backend doesn't take much time to implement.

So this makes what you see now, a big patchset with both backend 
implemented.

>
> The patches at the end of the series fix bugs introduced within the same
> series, these should be folded to the patches that are buggy.

I'll fold them in next version.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

next prev parent reply	other threads:[~2016-03-23  2:26 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 01/27] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 02/27] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 03/27] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 04/27] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 05/27] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 06/27] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 07/27] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 08/27] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 09/27] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
2016-03-24 20:58   ` Chris Mason
2016-03-25  1:59     ` Qu Wenruo
2016-03-25 15:11       ` Chris Mason
2016-03-26 13:11         ` Qu Wenruo
2016-03-28 14:09           ` Chris Mason
2016-03-29  1:47             ` Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 11/27] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info Qu Wenruo
2016-03-29 17:31   ` Alex Lyakas
2016-03-30  0:26     ` Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 12/27] btrfs: dedupe: Add support for on-disk hash search Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 13/27] btrfs: dedupe: Add support to delete hash for on-disk backend Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 14/27] btrfs: dedupe: Add support for adding " Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 15/27] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
2016-03-22  2:29   ` kbuild test robot
2016-03-22  2:48   ` kbuild test robot
2016-03-22  1:35 ` [PATCH v8 16/27] btrfs: dedupe: add an inode nodedupe flag Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 17/27] btrfs: dedupe: add a property handler for online dedupe Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 18/27] btrfs: dedupe: add per-file online dedupe control Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space Qu Wenruo
2016-04-22 18:06   ` Josef Bacik
2016-04-25  0:54     ` Qu Wenruo
2016-04-25 14:05       ` Josef Bacik
2016-04-26  0:50         ` Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 20/27] btrfs: dedupe: Fix a bug when running inband dedupe with balance Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 21/27] btrfs: Fix a memory leak in inband dedupe hash Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 22/27] btrfs: dedupe: Fix metadata balance error when dedupe is enabled Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 23/27] btrfs: dedupe: Avoid submit IO for hash hit extent Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 24/27] btrfs: dedupe: Preparation for compress-dedupe co-work Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue Qu Wenruo
2016-03-24 20:35   ` Chris Mason
2016-03-25  1:44     ` Qu Wenruo
2016-03-25 15:12       ` Chris Mason
2016-03-22  1:35 ` [PATCH v8 26/27] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 27/27] btrfs: dedupe: Fix a space cache delalloc bytes underflow bug Qu Wenruo
2016-03-22 13:38 ` [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework David Sterba
2016-03-23  2:25   ` Qu Wenruo [this message]
2016-03-24 13:42     ` David Sterba
2016-03-25  1:38       ` Qu Wenruo
2016-04-04 16:55         ` David Sterba
2016-04-05  3:08           ` Qu Wenruo
2016-04-20  2:02             ` Qu Wenruo
2016-04-20 19:14               ` Chris Mason
2016-04-06  3:47           ` Nicholas D Steeves
2016-04-06  5:22             ` Qu Wenruo
2016-04-22 22:14               ` Nicholas D Steeves
2016-04-25  1:25                 ` Qu Wenruo
2016-03-29 17:22 ` Alex Lyakas
2016-03-30  0:34   ` Qu Wenruo
2016-03-30 10:36     ` Alex Lyakas
2016-04-03  8:22     ` Alex Lyakas
2016-04-05  3:51       ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56F1FEAF.2070806@cn.fujitsu.com \
    --to=quwenruo@cn.fujitsu.com \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).