From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Chris Mason <clm@fb.com>, Qu Wenruo <quwenruo.btrfs@gmx.com>,
<linux-btrfs@vger.kernel.org>, Liu Bo <bo.li.liu@oracle.com>,
Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Subject: Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
Date: Tue, 29 Mar 2016 09:47:56 +0800 [thread overview]
Message-ID: <56F9DECC.4080304@cn.fujitsu.com> (raw)
In-Reply-To: <20160328140952.xj63bgpizg2vfcex@floor.thefacebook.com>
Chris Mason wrote on 2016/03/28 10:09 -0400:
> On Sat, Mar 26, 2016 at 09:11:53PM +0800, Qu Wenruo wrote:
>>
>>
>> On 03/25/2016 11:11 PM, Chris Mason wrote:
>>> On Fri, Mar 25, 2016 at 09:59:39AM +0800, Qu Wenruo wrote:
>>>>
>>>>
>>>> Chris Mason wrote on 2016/03/24 16:58 -0400:
>>>>> Are you storing the entire hash, or just the parts not represented in
>>>>> the key? I'd like to keep the on-disk part as compact as possible for
>>>>> this part.
>>>>
>>>> Currently, it's entire hash.
>>>>
>>>> More detailed can be checked in another mail.
>>>>
>>>> Although it's OK to truncate the last duplicated 8 bytes(64bit) for me,
>>>> I still quite like current implementation, as one memcpy() is simpler.
>>>
>>> [ sorry FB makes urls look ugly, so I delete them from replys ;) ]
>>>
>>> Right, I saw that but wanted to reply to the specific patch. One of the
>>> lessons learned from the extent allocation tree and file extent items is
>>> they are just too big. Lets save those bytes, it'll add up.
>>
>> OK, I'll reduce the duplicated last 8 bytes.
>>
>> And also, removing the "length" member, as it can be always fetched from
>> dedupe_info->block_size.
>
> This would mean dedup_info->block_size is a write once field. I'm ok
> with that (just like metadata blocksize) but we should make sure the
> ioctls etc don't allow changing it.
Not a problem, current block_size change is done by completely disabling
dedupe(imply a sync_fs), then re-enable with new block_size.
So it would be OK.
>
>>
>> The length itself is used to verify if we are at the transaction to a new
>> dedupe size, but later we use full sync_fs(), such behavior is not needed
>> any more.
>>
>>
>>>
>>>>
>>>>>
>>>>>> +
>>>>>> +/*
>>>>>> + * Objectid: bytenr
>>>>>> + * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
>>>>>> + * offset: Last 64 bit of the hash
>>>>>> + *
>>>>>> + * Used for bytenr <-> hash search (for free_extent)
>>>>>> + * all its content is hash.
>>>>>> + * So no special item struct is needed.
>>>>>> + */
>>>>>> +
>>>>>
>>>>> Can we do this instead with a backref from the extent? It'll save us a
>>>>> huge amount of IO as we delete things.
>>>>
>>>> That's the original implementation from Liu Bo.
>>>>
>>>> The problem is, it changes the data backref rules(originally, only
>>>> EXTENT_DATA item can cause data backref), and will make dedupe INCOMPACT
>>>> other than current RO_COMPACT.
>>>> So I really don't like to change the data backref rule.
>>>
>>> Let me reread this part, the cost of maintaining the second index is
>>> dramatically higher than adding a backref. I do agree that's its nice
>>> to be able to delete the dedup trees without impacting the rest, but
>>> over the long term I think we'll regret the added balances.
>>
>> Thanks for pointing the problem. Yes, I didn't even consider this fact.
>>
>> But, on the other hand. such remove only happens when we remove the *last*
>> reference of the extent.
>> So, for medium to high dedupe rate case, such routine is not that frequent,
>> which will reduce the impact.
>> (Which is quite different for non-dedupe case)
>
> It's both addition and removal, and the efficiency hit does depend on
> what level of sharing you're able to achieve. But what we don't want is
> for metadata usage to explode as people make small non-duplicate changes
> to their FS.
> If that happens, we'll only end up using dedup in back up
> farms and other highly limited use cases.
Right, with current dedupe-specific backref, it'll bring unavoidable
metadata overhead.
[[People are trading-off using non-default feature]]
Although IMHO, dedupe is not a generic feature, just like compression
and possible encryption, people choose them with trade-off in their mind.
For example, compression can achieve quite high performance for easily
compressible data, but can also get quite low performance for not so
compressible data, like ISO file or videos.
(In my test with 2 cores VM, virtio blk on HDD, dd ISO into btrfs file
will causing about 90MB/s for default mount option, while with
compression, it's only about 40~50MB/s)
If we combine all overhead together (not only metadata overhead), almost
all current transparent data processing method will only benefit
specific use case while reducing generic performance.
So increased metadata overhead is acceptable for me, especially when the
main overhead is CPU time spent on SHA256.
And we have workaround from setting dedupe disable prop to setting
larger dedupe block_size to avoid small and non-dedupe writes to fill
dedupe tree.
>
> I do agree that delayed refs are error prone, but that's a good reason
> not fix delayed refs, not to recreate the backrefs of the extent
> allocation tree in a new dedicated tree.
[[We need an idea generic for both backends]]
Also I want to mention is, dedupe now contains 2 different backends, so
we'd better choose one idea that won't break different backends into
different incompat/ro_compat flags.
If using backref method, ondisk backend will definitely make dedupe
incompatible, affecting in-memory backend even it's completely
backward-compatible.
Or, splitting dedupe flag into DEDUPE_ONDISK and DEDUPE_INMEMORY, and
former one is INCOMPAT, while latter is at most RO_COMPAT(if using
dedupe tree).
[[Cleaner layout is less bug-prone]]
The last point of using dedupe specific backref, is to reduce the
possible bug affection, which for me is more important than performance.
Current implementation will limit dedupe backref bug to dedupe only.
While a new backref bug will definitely impact almost all btrfs function.
Thanks,
Qu
>
> -chris
>
>
>
next prev parent reply other threads:[~2016-03-29 1:48 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-22 1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 01/27] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 02/27] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 03/27] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 04/27] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 05/27] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 06/27] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 07/27] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 08/27] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 09/27] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
2016-03-24 20:58 ` Chris Mason
2016-03-25 1:59 ` Qu Wenruo
2016-03-25 15:11 ` Chris Mason
2016-03-26 13:11 ` Qu Wenruo
2016-03-28 14:09 ` Chris Mason
2016-03-29 1:47 ` Qu Wenruo [this message]
2016-03-22 1:35 ` [PATCH v8 11/27] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info Qu Wenruo
2016-03-29 17:31 ` Alex Lyakas
2016-03-30 0:26 ` Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 12/27] btrfs: dedupe: Add support for on-disk hash search Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 13/27] btrfs: dedupe: Add support to delete hash for on-disk backend Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 14/27] btrfs: dedupe: Add support for adding " Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 15/27] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
2016-03-22 2:29 ` kbuild test robot
2016-03-22 2:48 ` kbuild test robot
2016-03-22 1:35 ` [PATCH v8 16/27] btrfs: dedupe: add an inode nodedupe flag Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 17/27] btrfs: dedupe: add a property handler for online dedupe Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 18/27] btrfs: dedupe: add per-file online dedupe control Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space Qu Wenruo
2016-04-22 18:06 ` Josef Bacik
2016-04-25 0:54 ` Qu Wenruo
2016-04-25 14:05 ` Josef Bacik
2016-04-26 0:50 ` Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 20/27] btrfs: dedupe: Fix a bug when running inband dedupe with balance Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 21/27] btrfs: Fix a memory leak in inband dedupe hash Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 22/27] btrfs: dedupe: Fix metadata balance error when dedupe is enabled Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 23/27] btrfs: dedupe: Avoid submit IO for hash hit extent Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 24/27] btrfs: dedupe: Preparation for compress-dedupe co-work Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue Qu Wenruo
2016-03-24 20:35 ` Chris Mason
2016-03-25 1:44 ` Qu Wenruo
2016-03-25 15:12 ` Chris Mason
2016-03-22 1:35 ` [PATCH v8 26/27] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
2016-03-22 1:35 ` [PATCH v8 27/27] btrfs: dedupe: Fix a space cache delalloc bytes underflow bug Qu Wenruo
2016-03-22 13:38 ` [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework David Sterba
2016-03-23 2:25 ` Qu Wenruo
2016-03-24 13:42 ` David Sterba
2016-03-25 1:38 ` Qu Wenruo
2016-04-04 16:55 ` David Sterba
2016-04-05 3:08 ` Qu Wenruo
2016-04-20 2:02 ` Qu Wenruo
2016-04-20 19:14 ` Chris Mason
2016-04-06 3:47 ` Nicholas D Steeves
2016-04-06 5:22 ` Qu Wenruo
2016-04-22 22:14 ` Nicholas D Steeves
2016-04-25 1:25 ` Qu Wenruo
2016-03-29 17:22 ` Alex Lyakas
2016-03-30 0:34 ` Qu Wenruo
2016-03-30 10:36 ` Alex Lyakas
2016-04-03 8:22 ` Alex Lyakas
2016-04-05 3:51 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56F9DECC.4080304@cn.fujitsu.com \
--to=quwenruo@cn.fujitsu.com \
--cc=bo.li.liu@oracle.com \
--cc=clm@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
--cc=wangxg.fnst@cn.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).