From: Miao Xie <miaox@cn.fujitsu.com>
To: bo.li.liu@oracle.com
Cc: Josef Bacik <jbacik@fusionio.com>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH 1/2] Btrfs: online data deduplication
Date: Tue, 09 Apr 2013 09:40:06 +0800 [thread overview]
Message-ID: <51637176.8060806@cn.fujitsu.com> (raw)
In-Reply-To: <20130408141625.GC27988@liubo.jp.oracle.com>
On mon, 8 Apr 2013 22:16:26 +0800, Liu Bo wrote:
> On Mon, Apr 08, 2013 at 08:54:50AM -0400, Josef Bacik wrote:
>> On Sun, Apr 07, 2013 at 07:12:48AM -0600, Liu Bo wrote:
>>> (NOTE: This leads to a FORMAT CHANGE, DO NOT use it on real data.)
>>>
>>> This introduce the online data deduplication feature for btrfs.
>>>
>>> (1) WHY do we need deduplication?
>>> To improve our storage effiency.
>>>
>>> (2) WHAT is deduplication?
>>> Two key ways for practical deduplication implementations,
>>> * When the data is deduplicated
>>> (inband vs background)
>>> * The granularity of the deduplication.
>>> (block level vs file level)
>>>
>>> For btrfs, we choose
>>> * inband(synchronous)
>>> * block level
>>>
>>> We choose them because of the same reason as how zfs does.
>>> a) To get an immediate benefit.
>>> b) To remove redundant parts within a file.
>>>
>>> So we have an inband, block level data deduplication here.
>>>
>>> (3) HOW does deduplication works?
>>> This makes full use of file extent back reference, the same way as
>>> IOCTL_CLONE, which lets us easily store multiple copies of a set of
>>> data as a single copy along with an index of references to the copy.
>>>
>>> Here we have
>>> a) a new dedicated tree(DEDUP tree) and
>>> b) a new key(BTRFS_DEDUP_ITEM_KEY), which consists of
>>> (stop 64bits of hash, type, disk offset),
>>> * stop 64bits of hash
>>> It comes from sha256, which is very helpful on avoiding collision.
>>> And we take the stop 64bits as the index.
>>> * disk offset
>>> It helps to find where the data is stored.
>>>
>>> So the whole deduplication process works as,
>>> 1) write something,
>>> 2) calculate the hash of this "something",
>>> 3) try to find the match of hash value by searching DEDUP keys in
>>> a dedicated tree, DEDUP tree.
>>> 4) if found, skip real IO and link to the existing copy
>>> if not, do real IO and insert a DEDUP key to the DEDUP tree.
>>>
>>> For now, we limit the deduplication unit to PAGESIZE, 4096, and we're
>>> going to increase this unit dynamically in the future.
>>>
>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>>> ---
>>> fs/btrfs/ctree.h | 53 ++++++++
>>> fs/btrfs/disk-io.c | 33 +++++-
>>> fs/btrfs/extent-tree.c | 22 +++-
>>> fs/btrfs/extent_io.c | 8 +-
>>> fs/btrfs/extent_io.h | 11 ++
>>> fs/btrfs/file-item.c | 186 ++++++++++++++++++++++++++
>>> fs/btrfs/file.c | 6 +-
>>> fs/btrfs/inode.c | 330 +++++++++++++++++++++++++++++++++++++++++++----
>>> fs/btrfs/ioctl.c | 34 +++++-
>>> fs/btrfs/ordered-data.c | 25 +++-
>>> fs/btrfs/ordered-data.h | 9 ++
>>> fs/btrfs/print-tree.c | 6 +-
>>> fs/btrfs/super.c | 7 +-
>>> 13 files changed, 685 insertions(+), 45 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 0d82922..59339bc 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -32,6 +32,7 @@
>>> #include <asm/kmap_types.h>
>>> #include <linux/pagemap.h>
>>> #include <linux/btrfs.h>
>>> +#include <crypto/hash.h>
>>> #include "extent_io.h"
>>> #include "extent_map.h"
>>> #include "async-thread.h"
>>> @@ -94,6 +95,9 @@ struct btrfs_ordered_sum;
>>> /* holds quota configuration and tracking */
>>> #define BTRFS_QUOTA_TREE_OBJECTID 8ULL
>>>
>>> +/* dedup tree(experimental) */
>>> +#define BTRFS_DEDUP_TREE_OBJECTID 9ULL
>>> +
>>> /* orhpan objectid for tracking unlinked/truncated files */
>>> #define BTRFS_ORPHAN_OBJECTID -5ULL
>>>
>>> @@ -884,12 +888,31 @@ struct btrfs_file_extent_item {
>>> */
>>> __le64 num_bytes;
>>>
>>> + /*
>>> + * the stop 64bits of sha256 hash value, this helps us find the
>>> + * corresponding item in dedup tree.
>>> + */
>>> + __le64 dedup_hash;
>>> +
>>> } __attribute__ ((__packed__));
>>>
>>
>> Please don't do this, do like what we do with the crc tree, just lookup based on
>> the disk bytenr when we free the extent and drop refs that way. Don't further
>> bloat the file extent item, we want it smaller not larger. Thanks,
>>
>> Josef
>
> So the real trouble is that I take this hash value as the first field of
> btrfs_key, and binary searching without the precise first field is not easy.
>
> Otherwise we may have to add another key type which replaces hash value with
> disk bytenr, ie. (disk bytenr, ANOTHER_KEY_TYPE, hash value), then we'll need to
> search the tree twice to free this one or drop refs.
Why need we store refs in btrfs_dedup_item structure?
I think the following one is better:
key.objectid = the stop 64bits of sha256 hash value
key.type = whatever
key.offset = bytenr /* the bytenr of block */
struct btrfs_dedup_item {
__le64 bytenr, /* the start bytenr of the extent */
__le64 len,
}
In this way, we use the refs in btrfs_extent_item to make sure the block is not freed.
And when we truncate the file, all thing we need do is delete the dedup item when we
free the extent just like checksum tree.
Thanks
Miao
>
> Either case is tradeoff, but as this is an initial version, we can try all of
> these knobs and choose the better one :)
>
> thanks,
> liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2013-04-09 1:38 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-07 13:12 [PATCH 0/2 RFC] Online data deduplication Liu Bo
2013-04-07 13:12 ` [PATCH 1/2] Btrfs: online " Liu Bo
2013-04-08 12:54 ` Josef Bacik
2013-04-08 14:16 ` Liu Bo
2013-04-08 20:37 ` Josef Bacik
2013-04-09 1:34 ` Liu Bo
2013-04-09 1:48 ` Josef Bacik
2013-04-10 14:21 ` Liu Bo
2013-04-09 1:40 ` Miao Xie [this message]
2013-04-08 13:47 ` David Sterba
2013-04-08 14:08 ` Liu Bo
2013-04-10 15:42 ` David Sterba
2013-04-09 1:52 ` Miao Xie
2013-04-10 15:52 ` David Sterba
2013-04-10 12:05 ` Marek Otahal
2013-04-10 14:14 ` Liu Bo
2013-04-07 13:12 ` [PATCH 2/2] Btrfs: skip merge part for delayed data refs Liu Bo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51637176.8060806@cn.fujitsu.com \
--to=miaox@cn.fujitsu.com \
--cc=bo.li.liu@oracle.com \
--cc=jbacik@fusionio.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox