Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Nikolay Borisov <nborisov@suse.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, Jeff Mahoney <jeffm@suse.com>,
	Qu Wenruo <wqu@suse.com>,
	linux-btrfs@vger.kernel.org
Cc: dsterba@suse.cz
Subject: Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv
Date: Fri, 23 Feb 2018 10:14:15 +0200	[thread overview]
Message-ID: <23c9039f-b574-b7f1-b954-ebb48dea8740@suse.com> (raw)
In-Reply-To: <d8e5c4c8-1c20-6cfb-c34d-59fc8b857264@gmx.com>



On 23.02.2018 01:34, Qu Wenruo wrote:
> 
> 
> On 2018年02月23日 06:44, Jeff Mahoney wrote:
>> On 12/22/17 1:18 AM, Qu Wenruo wrote:
>>> Unlike reservation calculation used in inode rsv for metadata, qgroup
>>> doesn't really need to care things like csum size or extent usage for
>>> whole tree COW.
>>>
>>> Qgroup care more about net change of extent usage.
>>> That's to say, if we're going to insert one file extent, it will mostly
>>> find its place in CoWed tree block, leaving no change in extent usage.
>>> Or cause leaf split, result one new net extent, increasing qgroup number
>>> by nodesize.
>>> (Or even more rare case, increase the tree level, increasing qgroup
>>> number by 2 * nodesize)
>>>
>>> So here instead of using the way overkilled calculation for extent
>>> allocator, which cares more about accurate and no error, qgroup doesn't
>>> need that over-calculated reservation.
>>>
>>> This patch will maintain 2 new members in btrfs_block_rsv structure for
>>> qgroup, using much smaller calculation for qgroup rsv, reducing false
>>> EDQUOT.
>>
>>
>> I think this is the right idea but, rather than being the last step in a
>> qgroup rework, it should be the first.
> 
> That's right.
> 
> Although putting it as 1st patch needs extra work to co-operate with
> later type seperation.
> 
>>  Don't qgroup reservation
>> lifetimes match the block reservation lifetimes?
> 
> Also correct, but...
> 
>>  We wouldn't want a
>> global reserve and we wouldn't track usage on a per-block basis, but
>> they should otherwise match.  We already have clear APIs surrounding the
>> use of block reservations, so it seems to me that it'd make sense to
>> extend those to cover the qgroup cases as well.  Otherwise, the rest of
>> your patchset makes a parallel reservation system with a different API.
>> That keeps the current state of qgroups as a bolt-on that always needs
>> to be tracked separately (and which developers need to ensure they get
>> right).
> 
> The problem is, block reservation is designed to ensure every CoW could
> be fulfilled without error.
> 
> That's to say, for case like CoW write with csum, we need to care about
> space reservation not only for EXTENT_DATA in fs tree, but also later
> EXTENT_ITEM for extent tree, and CSUM_ITEM for csum tree.
> 
> However extent tree and csum tree doesn't contribute to quota at all.
> If we follow such over-reservation, it would be much much easier to hit
> false EDQUOTA early.
> 
> That's the main reason why a separate (and a little simplified) block
> rsv tracking system.
> 
> And if there is better solution, I'm all ears.

So looking at the code the main hurdles seems to be the fact that the
btrfs_block_rsv_* API's take a raw byte count, which is derived from
btrfs_calc_trans_metadata_size, which in turn is passed the number of
items we want.

So what if we extend the block_rsv_* apis to take an additional
'quota_bytes' or somesuch argument which would represent the amount we
would like to be charged to the qgroups. This will force us to revisit
every call site of block_rsv API's and adjust the code accordingly so
that callers pass the correct number. This will lead to code such as:

if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)) { blah }

be contained into the block rsv apis. This will ensure lifetime of
blockrsv/qgroups is always consistent. I think this only applies to
qgroup metadata reservations.




> 
> Thanks,
> Qu
> 
>>
>> -Jeff
>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>  fs/btrfs/ctree.h       | 18 +++++++++++++++++
>>>  fs/btrfs/extent-tree.c | 55 ++++++++++++++++++++++++++++++++++++++++----------
>>>  2 files changed, 62 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 0c58f92c2d44..97783ba91e00 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -467,6 +467,24 @@ struct btrfs_block_rsv {
>>>  	unsigned short full;
>>>  	unsigned short type;
>>>  	unsigned short failfast;
>>> +
>>> +	/*
>>> +	 * Qgroup equivalent for @size @reserved
>>> +	 *
>>> +	 * Unlike normal normal @size/@reserved for inode rsv,
>>> +	 * qgroup doesn't care about things like csum size nor how many tree
>>> +	 * blocks it will need to reserve.
>>> +	 *
>>> +	 * Qgroup cares more about *NET* change of extent usage.
>>> +	 * So for ONE newly inserted file extent, in worst case it will cause
>>> +	 * leaf split and level increase, nodesize for each file extent
>>> +	 * is already way overkilled.
>>> +	 *
>>> +	 * In short, qgroup_size/reserved is the up limit of possible needed
>>> +	 * qgroup metadata reservation.
>>> +	 */
>>> +	u64 qgroup_rsv_size;
>>> +	u64 qgroup_rsv_reserved;
>>>  };
>>>  
>>>  /*
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 986660f301de..9d579c7bcf7f 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -5565,14 +5565,18 @@ static void space_info_add_new_bytes(struct btrfs_fs_info *fs_info,
>>>  
>>>  static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>>>  				    struct btrfs_block_rsv *block_rsv,
>>> -				    struct btrfs_block_rsv *dest, u64 num_bytes)
>>> +				    struct btrfs_block_rsv *dest, u64 num_bytes,
>>> +				    u64 *qgroup_to_release_ret)
>>>  {
>>>  	struct btrfs_space_info *space_info = block_rsv->space_info;
>>> +	u64 qgroup_to_release = 0;
>>>  	u64 ret;
>>>  
>>>  	spin_lock(&block_rsv->lock);
>>> -	if (num_bytes == (u64)-1)
>>> +	if (num_bytes == (u64)-1) {
>>>  		num_bytes = block_rsv->size;
>>> +		qgroup_to_release = block_rsv->qgroup_rsv_size;
>>> +	}
>>>  	block_rsv->size -= num_bytes;
>>>  	if (block_rsv->reserved >= block_rsv->size) {
>>>  		num_bytes = block_rsv->reserved - block_rsv->size;
>>> @@ -5581,6 +5585,12 @@ static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>>>  	} else {
>>>  		num_bytes = 0;
>>>  	}
>>> +	if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) {
>>> +		qgroup_to_release = block_rsv->qgroup_rsv_reserved -
>>> +				    block_rsv->qgroup_rsv_size;
>>> +		block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
>>> +	} else
>>> +		qgroup_to_release = 0;
>>>  	spin_unlock(&block_rsv->lock);
>>>  
>>>  	ret = num_bytes;
>>> @@ -5603,6 +5613,8 @@ static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>>>  			space_info_add_old_bytes(fs_info, space_info,
>>>  						 num_bytes);
>>>  	}
>>> +	if (qgroup_to_release_ret)
>>> +		*qgroup_to_release_ret = qgroup_to_release;
>>>  	return ret;
>>>  }
>>>  
>>> @@ -5744,17 +5756,21 @@ int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
>>>  	struct btrfs_root *root = inode->root;
>>>  	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
>>>  	u64 num_bytes = 0;
>>> +	u64 qgroup_num_bytes = 0;
>>>  	int ret = -ENOSPC;
>>>  
>>>  	spin_lock(&block_rsv->lock);
>>>  	if (block_rsv->reserved < block_rsv->size)
>>>  		num_bytes = block_rsv->size - block_rsv->reserved;
>>> +	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
>>> +		qgroup_num_bytes = block_rsv->qgroup_rsv_size -
>>> +				   block_rsv->qgroup_rsv_reserved;
>>>  	spin_unlock(&block_rsv->lock);
>>>  
>>>  	if (num_bytes == 0)
>>>  		return 0;
>>>  
>>> -	ret = btrfs_qgroup_reserve_meta_prealloc(root, num_bytes, true);
>>> +	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
>>>  	if (ret)
>>>  		return ret;
>>>  	ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
>>> @@ -5762,7 +5778,13 @@ int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
>>>  		block_rsv_add_bytes(block_rsv, num_bytes, 0);
>>>  		trace_btrfs_space_reservation(root->fs_info, "delalloc",
>>>  					      btrfs_ino(inode), num_bytes, 1);
>>> -	}
>>> +
>>> +		/* Don't forget to increase qgroup_rsv_reserved */
>>> +		spin_lock(&block_rsv->lock);
>>> +		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
>>> +		spin_unlock(&block_rsv->lock);
>>> +	} else
>>> +		btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
>>>  	return ret;
>>>  }
>>>  
>>> @@ -5784,20 +5806,23 @@ void btrfs_inode_rsv_release(struct btrfs_inode *inode, bool qgroup_free)
>>>  	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
>>>  	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
>>>  	u64 released = 0;
>>> +	u64 qgroup_to_release = 0;
>>>  
>>>  	/*
>>>  	 * Since we statically set the block_rsv->size we just want to say we
>>>  	 * are releasing 0 bytes, and then we'll just get the reservation over
>>>  	 * the size free'd.
>>>  	 */
>>> -	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv, 0);
>>> +	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv, 0,
>>> +					   &qgroup_to_release);
>>>  	if (released > 0)
>>>  		trace_btrfs_space_reservation(fs_info, "delalloc",
>>>  					      btrfs_ino(inode), released, 0);
>>>  	if (qgroup_free)
>>> -		btrfs_qgroup_free_meta_prealloc(inode->root, released);
>>> +		btrfs_qgroup_free_meta_prealloc(inode->root, qgroup_to_release);
>>>  	else
>>> -		btrfs_qgroup_convert_reserved_meta(inode->root, released);
>>> +		btrfs_qgroup_convert_reserved_meta(inode->root,
>>> +						   qgroup_to_release);
>>>  }
>>>  
>>>  void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
>>> @@ -5809,7 +5834,7 @@ void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
>>>  	if (global_rsv == block_rsv ||
>>>  	    block_rsv->space_info != global_rsv->space_info)
>>>  		global_rsv = NULL;
>>> -	block_rsv_release_bytes(fs_info, block_rsv, global_rsv, num_bytes);
>>> +	block_rsv_release_bytes(fs_info, block_rsv, global_rsv, num_bytes, NULL);
>>>  }
>>>  
>>>  static void update_global_block_rsv(struct btrfs_fs_info *fs_info)
>>> @@ -5889,7 +5914,7 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info)
>>>  static void release_global_block_rsv(struct btrfs_fs_info *fs_info)
>>>  {
>>>  	block_rsv_release_bytes(fs_info, &fs_info->global_block_rsv, NULL,
>>> -				(u64)-1);
>>> +				(u64)-1, NULL);
>>>  	WARN_ON(fs_info->trans_block_rsv.size > 0);
>>>  	WARN_ON(fs_info->trans_block_rsv.reserved > 0);
>>>  	WARN_ON(fs_info->chunk_block_rsv.size > 0);
>>> @@ -5931,7 +5956,7 @@ void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans)
>>>  	WARN_ON_ONCE(!list_empty(&trans->new_bgs));
>>>  
>>>  	block_rsv_release_bytes(fs_info, &fs_info->chunk_block_rsv, NULL,
>>> -				trans->chunk_bytes_reserved);
>>> +				trans->chunk_bytes_reserved, NULL);
>>>  	trans->chunk_bytes_reserved = 0;
>>>  }
>>>  
>>> @@ -6036,6 +6061,7 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>>>  {
>>>  	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
>>>  	u64 reserve_size = 0;
>>> +	u64 qgroup_rsv_size = 0;
>>>  	u64 csum_leaves;
>>>  	unsigned outstanding_extents;
>>>  
>>> @@ -6048,9 +6074,16 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>>>  						 inode->csum_bytes);
>>>  	reserve_size += btrfs_calc_trans_metadata_size(fs_info,
>>>  						       csum_leaves);
>>> +	/*
>>> +	 * For qgroup rsv, the calculation is very simple:
>>> +	 * nodesize for each outstanding extent.
>>> +	 * This is already overkilled under most case.
>>> +	 */
>>> +	qgroup_rsv_size = outstanding_extents * fs_info->nodesize;
>>>  
>>>  	spin_lock(&block_rsv->lock);
>>>  	block_rsv->size = reserve_size;
>>> +	block_rsv->qgroup_rsv_size = qgroup_rsv_size;
>>>  	spin_unlock(&block_rsv->lock);
>>>  }
>>>  
>>> @@ -8405,7 +8438,7 @@ static void unuse_block_rsv(struct btrfs_fs_info *fs_info,
>>>  			    struct btrfs_block_rsv *block_rsv, u32 blocksize)
>>>  {
>>>  	block_rsv_add_bytes(block_rsv, blocksize, 0);
>>> -	block_rsv_release_bytes(fs_info, block_rsv, NULL, 0);
>>> +	block_rsv_release_bytes(fs_info, block_rsv, NULL, 0, NULL);
>>>  }
>>>  
>>>  /*
>>>
>>
>>
>

next prev parent reply	other threads:[~2018-02-23  8:14 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-22  6:18 [PATCH v2 00/10] Use split qgroup rsv type Qu Wenruo
2017-12-22  6:18 ` [PATCH v2 01/10] btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans Qu Wenruo
2017-12-22  6:18 ` [PATCH v2 02/10] btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroup Qu Wenruo
2017-12-22  6:18 ` [PATCH v2 03/10] btrfs: qgroup: Introduce function to convert META_PREALLOC into META_PERTRANS Qu Wenruo
2017-12-22  6:18 ` [PATCH v2 04/10] btrfs: qgroup: Use separate meta reservation type for delalloc Qu Wenruo
2017-12-26  5:37   ` [PATCH v2.2 " Qu Wenruo
2017-12-26  5:40     ` Qu Wenruo
2017-12-26  7:10       ` Lakshmipathi.G
2017-12-22  6:18 ` [PATCH v2 05/10] btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item Qu Wenruo
2017-12-22  6:18 ` [PATCH v2 06/10] btrfs: qgroup: Use root->qgroup_meta_rsv_* to record qgroup meta reserved space Qu Wenruo
2017-12-22  6:18 ` [PATCH v2 07/10] btrfs: qgroup: Update trace events for metadata reservation Qu Wenruo
2017-12-22  6:18 ` [PATCH v2 08/10] Revert "btrfs: qgroups: Retry after commit on getting EDQUOT" Qu Wenruo
2017-12-22  6:18 ` [PATCH v2 09/10] btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT Qu Wenruo
2017-12-22  8:06   ` [PATCH v2.1 " Qu Wenruo
2017-12-22  6:18 ` [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv Qu Wenruo
2018-02-22 22:44   ` Jeff Mahoney
2018-02-22 23:34     ` Qu Wenruo
2018-02-23  8:14       ` Nikolay Borisov [this message]
2018-02-23  9:06         ` Qu Wenruo
2018-02-23 11:00           ` Nikolay Borisov
2018-02-23 11:22             ` Qu Wenruo
2018-02-23 14:43       ` Jeff Mahoney
2018-04-03  7:30   ` Qu Wenruo
2018-04-04  8:53     ` Nikolay Borisov
2018-04-04 12:17       ` Qu Wenruo
2018-04-12  0:03         ` Omar Sandoval
2018-04-12 12:46           ` David Sterba
2018-04-12 13:13     ` David Sterba
2018-04-16  7:53       ` Misono Tomohiro
2018-04-16 17:27         ` David Sterba
2018-04-17  0:14           ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=23c9039f-b574-b7f1-b954-ebb48dea8740@suse.com \
    --to=nborisov@suse.com \
    --cc=dsterba@suse.cz \
    --cc=jeffm@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).