Re: [RFC PATCH v6 5/5] Btrfs: online data deduplication

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Liu Bo <bo.li.liu@oracle.com>
To: dsterba@suse.cz, linux-btrfs@vger.kernel.org
Subject: Re: [RFC PATCH v6 5/5] Btrfs: online data deduplication
Date: Mon, 9 Sep 2013 14:15:44 +0800	[thread overview]
Message-ID: <20130909061541.GA2134@localhost.localdomain> (raw)
In-Reply-To: <20130902161942.GC23113@twin.jikos.cz>

Hi Dave,

On Mon, Sep 02, 2013 at 06:19:42PM +0200, David Sterba wrote:
> I wanted to only comment on the ioctl and interface to userspace bits,
> but found more things to comment in the kernel code.

Sorry for the late reply(I'm out for vacation these days).

> 
> On Thu, Aug 08, 2013 at 04:35:45PM +0800, Liu Bo wrote:
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -94,6 +95,9 @@ struct btrfs_ordered_sum;
> >  /* for storing balance parameters in the root tree */
> >  #define BTRFS_BALANCE_OBJECTID -4ULL
> >  
> > +/* dedup tree(experimental) */
> > +#define BTRFS_DEDUP_TREE_OBJECTID 9ULL
> 
> 9 now conflicts with the uuid tree, when you update the patch, please
> put the dedup tree definition right next after the uuid tree.
> 
> I've noticed that the lockdep annotation is also missing from disk-io.c)
> and the symbolic tree name in include/trace/events/btrfs.h as well.

Good point, I just saw that uuid tree has been merged into -next.

> 
> > +enum btrfs_dedup_type {
> > +	BTRFS_DEDUP_SHA256 = 0,
> > +	BTRFS_DEDUP_LAST = 1,
> > +};
> > +
> > +static int btrfs_dedup_lens[] = { 4, 0 };
> > +static int btrfs_dedup_sizes[] = { 32, 0 };	/* 256bit, 32bytes */
> 
> This is a bit confusing, what's the difference between 'len' and 'size'
> here.

Hmm, I admit that it's ugly, as we want to store the last 64bit of
256bit hash value as key.offset, an array of u64[4] is used and
the last 64bit will be array[3], any better ideas will be appreciated.

> 
> > +struct btrfs_dedup_item {
> > +	__le64 len;	/* disk length of dedup range */
> > +	u8 type;
> > +	u8 compression;
> > +	u8 encryption;
> > +	__le16 other_encoding; /* spare for later use */
> > +
> > +	/* hash/fingerprints go here */
> > +} __attribute__ ((__packed__));
> > +
> > +struct btrfs_dedup_hash {
> > +	/* hash algorithm */
> > +	int type;
> > +	u64 bytenr;
> > +	u64 num_bytes;
> > +	int compression;
> 
> Layout of this structure is a bit suboptimal, there is a 4 byte hole
> after type and compression, and you can use u8 types for both, but as
> this would leave another 6b hole after, using ints is ok, so the final
> layout should preferrably look like
> 
> {
> 	u64 bytenr;
> 	u64 num_bytes;
> 	int type;
> 	int compression;
> 
> 	u64 hash[];
> };

Looks good, will update it.

> 
> > +
> > +	/* last field is a variable length array of dedup hash */
> > +	u64 hash[];
> > +};
> > +
> > @@ -1967,6 +2027,7 @@ struct btrfs_ioctl_defrag_range_args {
> >  #define BTRFS_MOUNT_CHECK_INTEGRITY	(1 << 20)
> >  #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
> >  #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	(1 << 22)
> > +#define BTRFS_MOUNT_DEDUP		(1 << 24)
> >  
> >  #define btrfs_clear_opt(o, opt)		((o) &= ~BTRFS_MOUNT_##opt)
> >  #define btrfs_set_opt(o, opt)		((o) |= BTRFS_MOUNT_##opt)
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -2184,6 +2194,8 @@ int open_ctree(struct super_block *sb,
> >  	atomic64_set(&fs_info->tree_mod_seq, 0);
> >  	fs_info->sb = sb;
> >  	fs_info->max_inline = 8192 * 1024;
> > +	fs_info->dedup_bs = 8192;
> > +	fs_info->dedup_type = BTRFS_DEDUP_SHA256;
> >  	fs_info->metadata_ratio = 0;
> >  	fs_info->defrag_inodes = RB_ROOT;
> >  	fs_info->free_chunk_space = 0;
> > @@ -2282,6 +2294,7 @@ int open_ctree(struct super_block *sb,
> >  	mutex_init(&fs_info->dev_replace.lock_finishing_cancel_unmount);
> >  	mutex_init(&fs_info->dev_replace.lock_management_lock);
> >  	mutex_init(&fs_info->dev_replace.lock);
> > +	mutex_init(&fs_info->dedup_ioctl_mutex);
> >  
> >  	spin_lock_init(&fs_info->qgroup_lock);
> >  	mutex_init(&fs_info->qgroup_ioctl_lock);
> > @@ -2457,6 +2470,14 @@ int open_ctree(struct super_block *sb,
> >  		goto fail_alloc;
> >  	}
> >  
> > +	fs_info->dedup_driver = crypto_alloc_shash("sha256", 0, 0);
> 
> Isn't this supposed to use fs_info->dedup_type somehow? This is harcoded
> to sha256 and it's the only one available now, but when the defines and
> variables are there, please do use them

Okay.

> 
> > +	if (IS_ERR(fs_info->dedup_driver)) {
> > +		pr_info("Cannot load sha256 driver\n");
> > +		err = PTR_ERR(fs_info->dedup_driver);
> > +		fs_info->dedup_driver = NULL;
> > +		goto fail_alloc;
> > +	}
> > +
> >  	btrfs_init_workers(&fs_info->generic_worker,
> >  			   "genwork", 1, NULL);
> >  
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -1102,8 +1102,16 @@ static noinline int lookup_extent_data_ref(struct btrfs_trans_handle *trans,
> >  		key.offset = parent;
> >  	} else {
> >  		key.type = BTRFS_EXTENT_DATA_REF_KEY;
> > -		key.offset = hash_extent_data_ref(root_objectid,
> > -						  owner, offset);
> > +
> > +		/*
> > +		 * we've not got the right offset and owner, so search by -1
> > +		 * here.
> > +		 */
> > +		if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID)
> > +			key.offset = -1;
> 
> offset is u64

Okay.

> 
> > +		else
> > +			key.offset = hash_extent_data_ref(root_objectid,
> > +							  owner, offset);
> >  	}
> >  again:
> >  	recow = 0;
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > +static int run_locked_range(struct extent_io_tree *tree, struct inode *inode,
> > +			    struct page *locked_page, u64 start, u64 end,
> > +			    get_extent_t *get_extent, int mode,
> > +			    struct btrfs_dedup_hash *hash)
> > +{
> > +	int page_started = 0;
> > +	unsigned long nr_written = 0;
> > +	int ret;
> > +
> > +	lock_extent(tree, start, end);
> > +
> > +	/* allocate blocks */
> > +	ret = cow_file_range_dedup(inode, locked_page, start, end,
> > +				   &page_started, &nr_written, 0, hash);
> > +
> > +	/* JDM XXX */
> 
> What? :)

I'm going to just clean it up as it comes from the origial code and I
don't know what it is, either :-)

> 
> > +
> > +	/*
> > +	 * if page_started, cow_file_range inserted an
> > +	 * inline extent and took care of all the unlocking
> > +	 * and IO for us.  Otherwise, we need to submit
> > +	 * all those pages down to the drive.
> > +	 */
> > +	if (!page_started && !ret)
> > +		extent_write_locked_range(tree, inode, start, end, get_extent,
> > +					  mode);
> > +	else if (ret)
> > +		unlock_page(locked_page);
> > +
> > +	return ret;
> > +}
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -361,6 +362,8 @@ static match_table_t tokens = {
> >  	{Opt_check_integrity_including_extent_data, "check_int_data"},
> >  	{Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
> >  	{Opt_fatal_errors, "fatal_errors=%s"},
> > +	{Opt_dedup_bs, "dedup_bs=%s"},
> > +	{Opt_dedup, "dedup"},
> 
> What's the expected usage? You have the ioctl to enable/disable dedup
> for that, so if it's enabled via a mount option it is equivalent to
> calling the 'dedup enable' after mount, ok?
> 

Not true, there're two stages we need -- dedup ioctl is used to create/delete
dedup tree, while dedup mount options is to enable/disable the dedup feature,
that is, if we don't mount with the options, dedup won't happen on writes.

> I think dedup is kind of similar to quotas, you enable them and want
> them to be persistent once enabled, or disabled upon an explicit
> request.
> 
> Setting the blocksize via mount option looks more like a temporary
> solution before we have persistent per-object properties, here it would
> be eg. "filesystem.dedup_block_size".

So the goal is that we need a flexible way to enable/disable dedup
feature.

It's true that the dedup tree is similar to quota tree -- it's persistent
once it's created, but the difference is that dedup happens on every
write and we usually don't want it to be in default use.

> 
> But once the mount option is there, it's much harder to remove it so we
> should not introduce it at all. For now I suggest to add another mode
> for the dedup ioctl to set the block size.

I agree, it's tradeoff though.

> 
> Another question is whether to enable different block sizes for
> different subvolumes, then this should be reflected in the ioctl
> interface.
> 

Yeh, I'm planning to do this after all the dedup mechanism is built up :)

> 
> >  	{Opt_err, NULL},
> >  };
> >  
> > --- a/include/uapi/linux/btrfs.h
> > +++ b/include/uapi/linux/btrfs.h
> > @@ -374,6 +374,10 @@ struct btrfs_ioctl_get_dev_stats {
> >  	__u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
> >  };
> >  
> > +/* deduplication control ioctl modes */
> > +#define BTRFS_DEDUP_CTL_REG 1
> > +#define BTRFS_DEDUP_CTL_UNREG 2
> 
> And the whole point of my mail was to tell you that for consistency with
> the userspace tools I suggest to rename the ioctl defines to
> 
> BTRFS_DEDUP_CTL_REG -> BTRFS_DEDUP_CTL_ENABLE
> BTRFS_DEDUP_CTL_UNREG -> BTRFS_DEDUP_CTL_DISABLE
> 
> but it turned out to be a bit longer :)
> 

Thanks for all the comments!

-liubo

> > +
> >  #define BTRFS_QUOTA_CTL_ENABLE	1
> >  #define BTRFS_QUOTA_CTL_DISABLE	2
> >  #define BTRFS_QUOTA_CTL_RESCAN__NOTUSED	3

next prev parent reply	other threads:[~2013-09-09  6:16 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-08  8:35 [RFC PATCH v6 0/5] Online data deduplication Liu Bo
2013-08-08  8:35 ` [RFC PATCH v6 1/5] Btrfs: skip merge part for delayed data refs Liu Bo
2013-08-08  8:35 ` [RFC PATCH v6 2/5] Btrfs: improve the delayed refs process in rm case Liu Bo
2013-08-08  8:35 ` [RFC PATCH v6 3/5] Btrfs: introduce a head ref rbtree Liu Bo
2013-08-08  8:35 ` [RFC PATCH v6 4/5] Btrfs: disable qgroups accounting when quata_enable is 0 Liu Bo
2013-08-08  8:35 ` [RFC PATCH v6 5/5] Btrfs: online data deduplication Liu Bo
2013-09-02 16:19   ` David Sterba
2013-09-09  6:15     ` Liu Bo [this message]
2013-08-08  8:35 ` [PATCH v2] Btrfs-progs: add dedup subcommand Liu Bo
2013-08-09 12:51   ` David Sterba
2013-08-12  2:45     ` Liu Bo
2013-09-02 16:33       ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130909061541.GA2134@localhost.localdomain \
    --to=bo.li.liu@oracle.com \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).