[PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
@ 2016-04-15  9:08 Qu Wenruo
  2016-04-19 22:19 ` Mark Fasheh
  2016-04-22 18:12 ` Josef Bacik
  0 siblings, 2 replies; 18+ messages in thread
From: Qu Wenruo @ 2016-04-15  9:08 UTC (permalink / raw)
  To: linux-btrfs, mfasheh, fdmanana

Current btrfs qgroup design implies a requirement that after calling
btrfs_qgroup_account_extents() there must be a commit root switch.

Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
inside btrfs_commit_transaction() just be commit_cowonly_roots().

However there is a exception at create_pending_snapshot(), which will
call btrfs_qgroup_account_extents() but no any commit root switch.

In case of creating a snapshot whose parent root is itself (create a
snapshot of fs tree), it will corrupt qgroup by the following trace:
(skipped unrelated data)
======
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
======

The problem here is in first qgroup_account_extent(), the
nr_new_roots of the extent is 1, which means its reference got
increased, and qgroup increased its rfer and excl.

But at second qgroup_account_extent(), its reference got decreased, but
between these two qgroup_account_extent(), there is no switch roots.
This leads to the same nr_old_roots, and this extent just got ignored by
qgroup, which means this extent is wrongly accounted.

Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
create_pending_snapshot(), with needed preparation.

Reported-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
v2:
  Fix a soft lockup caused by missing switch_commit_root() call.
  Fix a warning caused by dirty-but-not-committed root.
v3:
  Fix a bug which will cause qgroup accounting for dropping snapshot
  wrong
v4:
  Fix a bug caused by non-cowed btree modification.

To Filipe:
  I'm sorry I didn't wait for your reply on the dropped roots.
  I reverted back the version where we deleted dropped roots in
  switch_commit_roots().

  As I think as long as we called btrfs_qgroup_prepare_account_extents()
  and btrfs_qgroup_account_extents(), it should have already accounted
  extents for dropped roots, and then we are OK to drop them.

  It would be very nice if you could point out what I missed.
  Thanks
  Qu
---
 fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 93 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 43885e5..92f8193 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -311,10 +311,11 @@ loop:
  * when the transaction commits
  */
 static int record_root_in_trans(struct btrfs_trans_handle *trans,
-			       struct btrfs_root *root)
+			       struct btrfs_root *root,
+			       int force)
 {
-	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
-	    root->last_trans < trans->transid) {
+	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
+	    root->last_trans < trans->transid) || force) {
 		WARN_ON(root == root->fs_info->extent_root);
 		WARN_ON(root->commit_root != root->node);
 
@@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
 		smp_wmb();
 
 		spin_lock(&root->fs_info->fs_roots_radix_lock);
-		if (root->last_trans == trans->transid) {
+		if (root->last_trans == trans->transid && !force) {
 			spin_unlock(&root->fs_info->fs_roots_radix_lock);
 			return 0;
 		}
@@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 		return 0;
 
 	mutex_lock(&root->fs_info->reloc_mutex);
-	record_root_in_trans(trans, root);
+	record_root_in_trans(trans, root, 0);
 	mutex_unlock(&root->fs_info->reloc_mutex);
 
 	return 0;
@@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root)
 }
 
 /*
+ * Do all special snapshot related qgroup dirty hack.
+ *
+ * Will do all needed qgroup inherit and dirty hack like switch commit
+ * roots inside one transaction and write all btree into disk, to make
+ * qgroup works.
+ */
+static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
+				   struct btrfs_root *src,
+				   struct btrfs_root *parent,
+				   struct btrfs_qgroup_inherit *inherit,
+				   u64 dst_objectid)
+{
+	struct btrfs_fs_info *fs_info = src->fs_info;
+	int ret;
+
+	/*
+	 * We are going to commit transaction, see btrfs_commit_transaction()
+	 * comment for reason locking tree_log_mutex
+	 */
+	mutex_lock(&fs_info->tree_log_mutex);
+
+	ret = commit_fs_roots(trans, src);
+	if (ret)
+		goto out;
+	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
+	if (ret < 0)
+		goto out;
+	ret = btrfs_qgroup_account_extents(trans, fs_info);
+	if (ret < 0)
+		goto out;
+
+	/* Now qgroup are all updated, we can inherit it to new qgroups */
+	ret = btrfs_qgroup_inherit(trans, fs_info,
+				   src->root_key.objectid, dst_objectid,
+				   inherit);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * Now we do a simplified commit transaction, which will:
+	 * 1) commit all subvolume and extent tree
+	 *    To ensure all subvolume and extent tree have a valid
+	 *    commit_root to accounting later insert_dir_item()
+	 * 2) write all btree blocks onto disk
+	 *    This is to make sure later btree modification will be cowed
+	 *    Or commit_root can be populated and cause wrong qgroup numbers
+	 * In this simplified commit, we don't really care about other trees
+	 * like chunk and root tree, as they won't affect qgroup.
+	 * And we don't write super to avoid half committed status.
+	 */
+	ret = commit_cowonly_roots(trans, src);
+	if (ret)
+		goto out;
+	switch_commit_roots(trans->transaction, fs_info);
+	ret = btrfs_write_and_wait_transaction(trans, src);
+	if (ret)
+		btrfs_std_error(fs_info, ret,
+			"Error while writing out transaction for qgroup");
+
+out:
+	mutex_unlock(&fs_info->tree_log_mutex);
+
+	/*
+	 * Force parent root to be updated, as we recorded it before so its
+	 * last_trans == cur_transid.
+	 * Or it won't be committed again onto disk after later
+	 * insert_dir_item()
+	 */
+	if (!ret)
+		record_root_in_trans(trans, parent, 1);
+	return ret;
+}
+
+/*
  * new snapshots need to be created at a very specific time in the
  * transaction commit.  This does the actual creation.
  *
@@ -1383,7 +1458,7 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 	dentry = pending->dentry;
 	parent_inode = pending->dir;
 	parent_root = BTRFS_I(parent_inode)->root;
-	record_root_in_trans(trans, parent_root);
+	record_root_in_trans(trans, parent_root, 0);
 
 	cur_time = current_fs_time(parent_inode->i_sb);
 
@@ -1420,7 +1495,7 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
-	record_root_in_trans(trans, root);
+	record_root_in_trans(trans, root, 0);
 	btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
 	memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
 	btrfs_check_and_init_root_item(new_root_item);
@@ -1516,6 +1591,17 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
+	/*
+	 * Do special qgroup accounting for snapshot, as we do some qgroup
+	 * snapshot hack to do fast snapshot.
+	 * To co-operate with that hack, we do hack again.
+	 * Or snapshot will be greatly slowed down by a subtree qgroup rescan
+	 */
+	ret = qgroup_account_snapshot(trans, root, parent_root,
+				      pending->inherit, objectid);
+	if (ret < 0)
+		goto fail;
+
 	ret = btrfs_insert_dir_item(trans, parent_root,
 				    dentry->d_name.name, dentry->d_name.len,
 				    parent_inode, &key,
@@ -1559,23 +1645,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
-	/*
-	 * account qgroup counters before qgroup_inherit()
-	 */
-	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
-	if (ret)
-		goto fail;
-	ret = btrfs_qgroup_account_extents(trans, fs_info);
-	if (ret)
-		goto fail;
-	ret = btrfs_qgroup_inherit(trans, fs_info,
-				   root->root_key.objectid,
-				   objectid, pending->inherit);
-	if (ret) {
-		btrfs_abort_transaction(trans, root, ret);
-		goto fail;
-	}
-
 fail:
 	pending->error = ret;
 dir_item_existed:
-- 
2.8.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kerne

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-15  9:08 [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot Qu Wenruo
@ 2016-04-19 22:19 ` Mark Fasheh
  2016-04-20 14:25   ` Holger Hoffstätte
  2016-04-22 18:12 ` Josef Bacik
  1 sibling, 1 reply; 18+ messages in thread
From: Mark Fasheh @ 2016-04-19 22:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, fdmanana

On Fri, Apr 15, 2016 at 05:08:22PM +0800, Qu Wenruo wrote:
> Current btrfs qgroup design implies a requirement that after calling
> btrfs_qgroup_account_extents() there must be a commit root switch.
> 
> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
> inside btrfs_commit_transaction() just be commit_cowonly_roots().
> 
> However there is a exception at create_pending_snapshot(), which will
> call btrfs_qgroup_account_extents() but no any commit root switch.
> 
> In case of creating a snapshot whose parent root is itself (create a
> snapshot of fs tree), it will corrupt qgroup by the following trace:
> (skipped unrelated data)
> ======
> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> ======
> 
> The problem here is in first qgroup_account_extent(), the
> nr_new_roots of the extent is 1, which means its reference got
> increased, and qgroup increased its rfer and excl.
> 
> But at second qgroup_account_extent(), its reference got decreased, but
> between these two qgroup_account_extent(), there is no switch roots.
> This leads to the same nr_old_roots, and this extent just got ignored by
> qgroup, which means this extent is wrongly accounted.
> 
> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
> create_pending_snapshot(), with needed preparation.
> 
> Reported-by: Mark Fasheh <mfasheh@suse.de>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Ok, this version seems to be giving me the right numbers. I'll send a test
for it shortly. I'd still like to know if this patch introduces an
unintended side effects but otherwise, thanks Qu.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-19 22:19 ` Mark Fasheh
@ 2016-04-20 14:25   ` Holger Hoffstätte
  0 siblings, 0 replies; 18+ messages in thread
From: Holger Hoffstätte @ 2016-04-20 14:25 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 19 Apr 2016 15:19:46 -0700, Mark Fasheh wrote:

> On Fri, Apr 15, 2016 at 05:08:22PM +0800, Qu Wenruo wrote:
>> Current btrfs qgroup design implies a requirement that after calling
>> btrfs_qgroup_account_extents() there must be a commit root switch.
>> 
>> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
>> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>> 
>> However there is a exception at create_pending_snapshot(), which will
>> call btrfs_qgroup_account_extents() but no any commit root switch.
>> 
>> In case of creating a snapshot whose parent root is itself (create a
>> snapshot of fs tree), it will corrupt qgroup by the following trace:
>> (skipped unrelated data)
>> ======
>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>> ======
>> 
>> The problem here is in first qgroup_account_extent(), the
>> nr_new_roots of the extent is 1, which means its reference got
>> increased, and qgroup increased its rfer and excl.
>> 
>> But at second qgroup_account_extent(), its reference got decreased, but
>> between these two qgroup_account_extent(), there is no switch roots.
>> This leads to the same nr_old_roots, and this extent just got ignored by
>> qgroup, which means this extent is wrongly accounted.
>> 
>> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
>> create_pending_snapshot(), with needed preparation.
>> 
>> Reported-by: Mark Fasheh <mfasheh@suse.de>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> 
> Ok, this version seems to be giving me the right numbers. I'll send a test
> for it shortly. I'd still like to know if this patch introduces an
> unintended side effects but otherwise, thanks Qu.
> 	--Mark

Hi Mark,

Can't speak about other side effects since I have not observed any so far,
but I can confirm that the previously failing case of deleting a renamed
snapshot [1] now works properly with v4 without getting the commit roots
in a twist. So:

Tested-by: holger.hoffstaette@googlemail.com

cheers
Holger

[1] http://thread.gmane.org/gmane.comp.file-systems.btrfs/55052


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-15  9:08 [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot Qu Wenruo
  2016-04-19 22:19 ` Mark Fasheh
@ 2016-04-22 18:12 ` Josef Bacik
  2016-04-22 18:21   ` Mark Fasheh
  2016-05-11 16:57   ` Mark Fasheh
  1 sibling, 2 replies; 18+ messages in thread
From: Josef Bacik @ 2016-04-22 18:12 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, mfasheh, fdmanana

On 04/15/2016 05:08 AM, Qu Wenruo wrote:
> Current btrfs qgroup design implies a requirement that after calling
> btrfs_qgroup_account_extents() there must be a commit root switch.
>
> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>
> However there is a exception at create_pending_snapshot(), which will
> call btrfs_qgroup_account_extents() but no any commit root switch.
>
> In case of creating a snapshot whose parent root is itself (create a
> snapshot of fs tree), it will corrupt qgroup by the following trace:
> (skipped unrelated data)
> ======
> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> ======
>
> The problem here is in first qgroup_account_extent(), the
> nr_new_roots of the extent is 1, which means its reference got
> increased, and qgroup increased its rfer and excl.
>
> But at second qgroup_account_extent(), its reference got decreased, but
> between these two qgroup_account_extent(), there is no switch roots.
> This leads to the same nr_old_roots, and this extent just got ignored by
> qgroup, which means this extent is wrongly accounted.
>
> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
> create_pending_snapshot(), with needed preparation.
>
> Reported-by: Mark Fasheh <mfasheh@suse.de>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> v2:
>    Fix a soft lockup caused by missing switch_commit_root() call.
>    Fix a warning caused by dirty-but-not-committed root.
> v3:
>    Fix a bug which will cause qgroup accounting for dropping snapshot
>    wrong
> v4:
>    Fix a bug caused by non-cowed btree modification.
>
> To Filipe:
>    I'm sorry I didn't wait for your reply on the dropped roots.
>    I reverted back the version where we deleted dropped roots in
>    switch_commit_roots().
>
>    As I think as long as we called btrfs_qgroup_prepare_account_extents()
>    and btrfs_qgroup_account_extents(), it should have already accounted
>    extents for dropped roots, and then we are OK to drop them.
>
>    It would be very nice if you could point out what I missed.
>    Thanks
>    Qu
> ---
>   fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 93 insertions(+), 24 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 43885e5..92f8193 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -311,10 +311,11 @@ loop:
>    * when the transaction commits
>    */
>   static int record_root_in_trans(struct btrfs_trans_handle *trans,
> -			       struct btrfs_root *root)
> +			       struct btrfs_root *root,
> +			       int force)
>   {
> -	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> -	    root->last_trans < trans->transid) {
> +	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> +	    root->last_trans < trans->transid) || force) {
>   		WARN_ON(root == root->fs_info->extent_root);
>   		WARN_ON(root->commit_root != root->node);
>
> @@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
>   		smp_wmb();
>
>   		spin_lock(&root->fs_info->fs_roots_radix_lock);
> -		if (root->last_trans == trans->transid) {
> +		if (root->last_trans == trans->transid && !force) {
>   			spin_unlock(&root->fs_info->fs_roots_radix_lock);
>   			return 0;
>   		}
> @@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
>   		return 0;
>
>   	mutex_lock(&root->fs_info->reloc_mutex);
> -	record_root_in_trans(trans, root);
> +	record_root_in_trans(trans, root, 0);
>   	mutex_unlock(&root->fs_info->reloc_mutex);
>
>   	return 0;
> @@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root)
>   }
>
>   /*
> + * Do all special snapshot related qgroup dirty hack.
> + *
> + * Will do all needed qgroup inherit and dirty hack like switch commit
> + * roots inside one transaction and write all btree into disk, to make
> + * qgroup works.
> + */
> +static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
> +				   struct btrfs_root *src,
> +				   struct btrfs_root *parent,
> +				   struct btrfs_qgroup_inherit *inherit,
> +				   u64 dst_objectid)
> +{
> +	struct btrfs_fs_info *fs_info = src->fs_info;
> +	int ret;
> +
> +	/*
> +	 * We are going to commit transaction, see btrfs_commit_transaction()
> +	 * comment for reason locking tree_log_mutex
> +	 */
> +	mutex_lock(&fs_info->tree_log_mutex);
> +
> +	ret = commit_fs_roots(trans, src);
> +	if (ret)
> +		goto out;
> +	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
> +	if (ret < 0)
> +		goto out;
> +	ret = btrfs_qgroup_account_extents(trans, fs_info);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* Now qgroup are all updated, we can inherit it to new qgroups */
> +	ret = btrfs_qgroup_inherit(trans, fs_info,
> +				   src->root_key.objectid, dst_objectid,
> +				   inherit);
> +	if (ret < 0)
> +		goto out;
> +
> +	/*
> +	 * Now we do a simplified commit transaction, which will:
> +	 * 1) commit all subvolume and extent tree
> +	 *    To ensure all subvolume and extent tree have a valid
> +	 *    commit_root to accounting later insert_dir_item()
> +	 * 2) write all btree blocks onto disk
> +	 *    This is to make sure later btree modification will be cowed
> +	 *    Or commit_root can be populated and cause wrong qgroup numbers
> +	 * In this simplified commit, we don't really care about other trees
> +	 * like chunk and root tree, as they won't affect qgroup.
> +	 * And we don't write super to avoid half committed status.
> +	 */
> +	ret = commit_cowonly_roots(trans, src);
> +	if (ret)
> +		goto out;
> +	switch_commit_roots(trans->transaction, fs_info);
> +	ret = btrfs_write_and_wait_transaction(trans, src);
> +	if (ret)
> +		btrfs_std_error(fs_info, ret,
> +			"Error while writing out transaction for qgroup");
> +
> +out:
> +	mutex_unlock(&fs_info->tree_log_mutex);
> +
> +	/*
> +	 * Force parent root to be updated, as we recorded it before so its
> +	 * last_trans == cur_transid.
> +	 * Or it won't be committed again onto disk after later
> +	 * insert_dir_item()
> +	 */
> +	if (!ret)
> +		record_root_in_trans(trans, parent, 1);
> +	return ret;
> +}

NACK, holy shit we aren't adding a special transaction commit only for 
qgroup snapshots.  Figure out a different way.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-22 18:12 ` Josef Bacik
@ 2016-04-22 18:21   ` Mark Fasheh
  2016-04-22 18:23     ` Josef Bacik
  2016-05-11 16:57   ` Mark Fasheh
  1 sibling, 1 reply; 18+ messages in thread
From: Mark Fasheh @ 2016-04-22 18:21 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Qu Wenruo, linux-btrfs, fdmanana

On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
> >+	/*
> >+	 * Force parent root to be updated, as we recorded it before so its
> >+	 * last_trans == cur_transid.
> >+	 * Or it won't be committed again onto disk after later
> >+	 * insert_dir_item()
> >+	 */
> >+	if (!ret)
> >+		record_root_in_trans(trans, parent, 1);
> >+	return ret;
> >+}
> 
> NACK, holy shit we aren't adding a special transaction commit only
> for qgroup snapshots.  Figure out a different way.  Thanks,

Yeah I saw that. To be fair, we run a whole lot of the transaction stuff
multiple times (at least from my reading) so I'm really unclear on what the
performance impact is.

Do you have any suggestion though? We've been banging our heads against this
for a while now and as slow as this patch might be, it actually works where
nothing else has so far.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-22 18:21   ` Mark Fasheh
@ 2016-04-22 18:23     ` Josef Bacik
  2016-04-22 18:29       ` Mark Fasheh
  2016-04-25  0:56       ` Qu Wenruo
  0 siblings, 2 replies; 18+ messages in thread
From: Josef Bacik @ 2016-04-22 18:23 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Qu Wenruo, linux-btrfs, fdmanana

On 04/22/2016 02:21 PM, Mark Fasheh wrote:
> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>> +	/*
>>> +	 * Force parent root to be updated, as we recorded it before so its
>>> +	 * last_trans == cur_transid.
>>> +	 * Or it won't be committed again onto disk after later
>>> +	 * insert_dir_item()
>>> +	 */
>>> +	if (!ret)
>>> +		record_root_in_trans(trans, parent, 1);
>>> +	return ret;
>>> +}
>>
>> NACK, holy shit we aren't adding a special transaction commit only
>> for qgroup snapshots.  Figure out a different way.  Thanks,
>
> Yeah I saw that. To be fair, we run a whole lot of the transaction stuff
> multiple times (at least from my reading) so I'm really unclear on what the
> performance impact is.
>
> Do you have any suggestion though? We've been banging our heads against this
> for a while now and as slow as this patch might be, it actually works where
> nothing else has so far.

I'm less concerned about committing another transaction and more 
concerned about the fact that it is an special variant of the 
transaction commit.  If this goes wrong, or at some point in the future 
we fail to update it along with btrfs_transaction_commit we suddenly are 
corrupting metadata.  If we have to commit a transaction then call 
btrfs_commit_transaction(), don't open code a stripped down version, 
here be dragons.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-22 18:23     ` Josef Bacik
@ 2016-04-22 18:29       ` Mark Fasheh
  2016-04-25  0:56       ` Qu Wenruo
  1 sibling, 0 replies; 18+ messages in thread
From: Mark Fasheh @ 2016-04-22 18:29 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Qu Wenruo, linux-btrfs, fdmanana

On Fri, Apr 22, 2016 at 02:23:59PM -0400, Josef Bacik wrote:
> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
> >On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
> >>On 04/15/2016 05:08 AM, Qu Wenruo wrote:
> >>>+	/*
> >>>+	 * Force parent root to be updated, as we recorded it before so its
> >>>+	 * last_trans == cur_transid.
> >>>+	 * Or it won't be committed again onto disk after later
> >>>+	 * insert_dir_item()
> >>>+	 */
> >>>+	if (!ret)
> >>>+		record_root_in_trans(trans, parent, 1);
> >>>+	return ret;
> >>>+}
> >>
> >>NACK, holy shit we aren't adding a special transaction commit only
> >>for qgroup snapshots.  Figure out a different way.  Thanks,
> >
> >Yeah I saw that. To be fair, we run a whole lot of the transaction stuff
> >multiple times (at least from my reading) so I'm really unclear on what the
> >performance impact is.
> >
> >Do you have any suggestion though? We've been banging our heads against this
> >for a while now and as slow as this patch might be, it actually works where
> >nothing else has so far.
> 
> I'm less concerned about committing another transaction and more
> concerned about the fact that it is an special variant of the
> transaction commit.  If this goes wrong, or at some point in the
> future we fail to update it along with btrfs_transaction_commit we
> suddenly are corrupting metadata.  If we have to commit a
> transaction then call btrfs_commit_transaction(), don't open code a
> stripped down version, here be dragons.  Thanks,

Ok yeah that makes perfect sense - I thought you were telling me that this
would be a big performance problem.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-22 18:23     ` Josef Bacik
  2016-04-22 18:29       ` Mark Fasheh
@ 2016-04-25  0:56       ` Qu Wenruo
  2016-04-25 14:24         ` Josef Bacik
  1 sibling, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2016-04-25  0:56 UTC (permalink / raw)
  To: Josef Bacik, Mark Fasheh; +Cc: linux-btrfs, fdmanana



Josef Bacik wrote on 2016/04/22 14:23 -0400:
> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>> +    /*
>>>> +     * Force parent root to be updated, as we recorded it before so
>>>> its
>>>> +     * last_trans == cur_transid.
>>>> +     * Or it won't be committed again onto disk after later
>>>> +     * insert_dir_item()
>>>> +     */
>>>> +    if (!ret)
>>>> +        record_root_in_trans(trans, parent, 1);
>>>> +    return ret;
>>>> +}
>>>
>>> NACK, holy shit we aren't adding a special transaction commit only
>>> for qgroup snapshots.  Figure out a different way.  Thanks,
>>
>> Yeah I saw that. To be fair, we run a whole lot of the transaction stuff
>> multiple times (at least from my reading) so I'm really unclear on
>> what the
>> performance impact is.
>>
>> Do you have any suggestion though? We've been banging our heads
>> against this
>> for a while now and as slow as this patch might be, it actually works
>> where
>> nothing else has so far.
>
> I'm less concerned about committing another transaction and more
> concerned about the fact that it is an special variant of the
> transaction commit.  If this goes wrong, or at some point in the future
> we fail to update it along with btrfs_transaction_commit we suddenly are
> corrupting metadata.  If we have to commit a transaction then call
> btrfs_commit_transaction(), don't open code a stripped down version,
> here be dragons.  Thanks,
>
> Josef
>
>
>
Yes, I also don't like the dirty hack.

Although the problem is, we have no other good choice.

If we can call commit_transaction() that's the best case, but the 
problem is, in create_pending_snapshots(), we are already inside 
commit_transaction().

Or commit_transaction() can be called inside commit_transaction()?

Thanks,
Qu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-25  0:56       ` Qu Wenruo
@ 2016-04-25 14:24         ` Josef Bacik
  2016-04-26  0:35           ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2016-04-25 14:24 UTC (permalink / raw)
  To: Qu Wenruo, Mark Fasheh; +Cc: linux-btrfs, fdmanana

On 04/24/2016 08:56 PM, Qu Wenruo wrote:
>
>
> Josef Bacik wrote on 2016/04/22 14:23 -0400:
>> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>>> +    /*
>>>>> +     * Force parent root to be updated, as we recorded it before so
>>>>> its
>>>>> +     * last_trans == cur_transid.
>>>>> +     * Or it won't be committed again onto disk after later
>>>>> +     * insert_dir_item()
>>>>> +     */
>>>>> +    if (!ret)
>>>>> +        record_root_in_trans(trans, parent, 1);
>>>>> +    return ret;
>>>>> +}
>>>>
>>>> NACK, holy shit we aren't adding a special transaction commit only
>>>> for qgroup snapshots.  Figure out a different way.  Thanks,
>>>
>>> Yeah I saw that. To be fair, we run a whole lot of the transaction stuff
>>> multiple times (at least from my reading) so I'm really unclear on
>>> what the
>>> performance impact is.
>>>
>>> Do you have any suggestion though? We've been banging our heads
>>> against this
>>> for a while now and as slow as this patch might be, it actually works
>>> where
>>> nothing else has so far.
>>
>> I'm less concerned about committing another transaction and more
>> concerned about the fact that it is an special variant of the
>> transaction commit.  If this goes wrong, or at some point in the future
>> we fail to update it along with btrfs_transaction_commit we suddenly are
>> corrupting metadata.  If we have to commit a transaction then call
>> btrfs_commit_transaction(), don't open code a stripped down version,
>> here be dragons.  Thanks,
>>
>> Josef
>>
>>
>>
> Yes, I also don't like the dirty hack.
>
> Although the problem is, we have no other good choice.
>
> If we can call commit_transaction() that's the best case, but the
> problem is, in create_pending_snapshots(), we are already inside
> commit_transaction().
>
> Or commit_transaction() can be called inside commit_transaction()?
>

No, figure out a different way.  IIRC I dealt with this with the 
no_quota flag for inc_ref/dec_ref since the copy root stuff does strange 
things with the reference counts, but all this code is gone now.  I 
looked around to see if I could figure out how the refs are ending up 
this way but it doesn't make sense to me and there isn't enough 
information in your changelog for me to be able to figure it out. 
You've created this mess, clean it up without making it messier.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-25 14:24         ` Josef Bacik
@ 2016-04-26  0:35           ` Qu Wenruo
  2016-04-26 14:26             ` Josef Bacik
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2016-04-26  0:35 UTC (permalink / raw)
  To: Josef Bacik, Mark Fasheh; +Cc: linux-btrfs, fdmanana



Josef Bacik wrote on 2016/04/25 10:24 -0400:
> On 04/24/2016 08:56 PM, Qu Wenruo wrote:
>>
>>
>> Josef Bacik wrote on 2016/04/22 14:23 -0400:
>>> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
>>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>>>> +    /*
>>>>>> +     * Force parent root to be updated, as we recorded it before so
>>>>>> its
>>>>>> +     * last_trans == cur_transid.
>>>>>> +     * Or it won't be committed again onto disk after later
>>>>>> +     * insert_dir_item()
>>>>>> +     */
>>>>>> +    if (!ret)
>>>>>> +        record_root_in_trans(trans, parent, 1);
>>>>>> +    return ret;
>>>>>> +}
>>>>>
>>>>> NACK, holy shit we aren't adding a special transaction commit only
>>>>> for qgroup snapshots.  Figure out a different way.  Thanks,
>>>>
>>>> Yeah I saw that. To be fair, we run a whole lot of the transaction
>>>> stuff
>>>> multiple times (at least from my reading) so I'm really unclear on
>>>> what the
>>>> performance impact is.
>>>>
>>>> Do you have any suggestion though? We've been banging our heads
>>>> against this
>>>> for a while now and as slow as this patch might be, it actually works
>>>> where
>>>> nothing else has so far.
>>>
>>> I'm less concerned about committing another transaction and more
>>> concerned about the fact that it is an special variant of the
>>> transaction commit.  If this goes wrong, or at some point in the future
>>> we fail to update it along with btrfs_transaction_commit we suddenly are
>>> corrupting metadata.  If we have to commit a transaction then call
>>> btrfs_commit_transaction(), don't open code a stripped down version,
>>> here be dragons.  Thanks,
>>>
>>> Josef
>>>
>>>
>>>
>> Yes, I also don't like the dirty hack.
>>
>> Although the problem is, we have no other good choice.
>>
>> If we can call commit_transaction() that's the best case, but the
>> problem is, in create_pending_snapshots(), we are already inside
>> commit_transaction().
>>
>> Or commit_transaction() can be called inside commit_transaction()?
>>
>
> No, figure out a different way.  IIRC I dealt with this with the
> no_quota flag for inc_ref/dec_ref since the copy root stuff does strange
> things with the reference counts, but all this code is gone now.  I
> looked around to see if I could figure out how the refs are ending up
> this way but it doesn't make sense to me and there isn't enough
> information in your changelog for me to be able to figure it out. You've
> created this mess, clean it up without making it messier.  Thanks,
>
> Josef
>
>
Unfortunately, your original no_quota flag just hide the bug, and hide 
it in a bad method.

Originally, no_quota flag is used for case like this, to skip quota at 
snapshot creation, and use quota_inherit() to hack the quota accounting.
It seems work, but in fact, if the DIR_ITEM insert need to create a new 
cousin leaf, then quota is messed up.

Your quota rework doesn't really help, as it won't even accounting 
things well, just check fstest/btrfs/091 on 4.1 kernel.

The only perfect fix for this already nasty subvolume creation is to do 
full subtree rescan.
Or no one knows when higher qgroups will be broken.



If you think splitting commit_transaction into two variants can cause 
problem, I can merge this two variants into one.

As in btrfs_commit_transaction() the commit process is much the same as 
the one used in create_pending_snapshot().

If there is only one __commit_roots() to do such commit, then there is 
nothing special only for quota.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-26  0:35           ` Qu Wenruo
@ 2016-04-26 14:26             ` Josef Bacik
  2016-04-27  1:12               ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2016-04-26 14:26 UTC (permalink / raw)
  To: Qu Wenruo, Mark Fasheh; +Cc: linux-btrfs, fdmanana

On 04/25/2016 08:35 PM, Qu Wenruo wrote:
>
>
> Josef Bacik wrote on 2016/04/25 10:24 -0400:
>> On 04/24/2016 08:56 PM, Qu Wenruo wrote:
>>>
>>>
>>> Josef Bacik wrote on 2016/04/22 14:23 -0400:
>>>> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
>>>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>>>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>>>>> +    /*
>>>>>>> +     * Force parent root to be updated, as we recorded it before so
>>>>>>> its
>>>>>>> +     * last_trans == cur_transid.
>>>>>>> +     * Or it won't be committed again onto disk after later
>>>>>>> +     * insert_dir_item()
>>>>>>> +     */
>>>>>>> +    if (!ret)
>>>>>>> +        record_root_in_trans(trans, parent, 1);
>>>>>>> +    return ret;
>>>>>>> +}
>>>>>>
>>>>>> NACK, holy shit we aren't adding a special transaction commit only
>>>>>> for qgroup snapshots.  Figure out a different way.  Thanks,
>>>>>
>>>>> Yeah I saw that. To be fair, we run a whole lot of the transaction
>>>>> stuff
>>>>> multiple times (at least from my reading) so I'm really unclear on
>>>>> what the
>>>>> performance impact is.
>>>>>
>>>>> Do you have any suggestion though? We've been banging our heads
>>>>> against this
>>>>> for a while now and as slow as this patch might be, it actually works
>>>>> where
>>>>> nothing else has so far.
>>>>
>>>> I'm less concerned about committing another transaction and more
>>>> concerned about the fact that it is an special variant of the
>>>> transaction commit.  If this goes wrong, or at some point in the future
>>>> we fail to update it along with btrfs_transaction_commit we suddenly
>>>> are
>>>> corrupting metadata.  If we have to commit a transaction then call
>>>> btrfs_commit_transaction(), don't open code a stripped down version,
>>>> here be dragons.  Thanks,
>>>>
>>>> Josef
>>>>
>>>>
>>>>
>>> Yes, I also don't like the dirty hack.
>>>
>>> Although the problem is, we have no other good choice.
>>>
>>> If we can call commit_transaction() that's the best case, but the
>>> problem is, in create_pending_snapshots(), we are already inside
>>> commit_transaction().
>>>
>>> Or commit_transaction() can be called inside commit_transaction()?
>>>
>>
>> No, figure out a different way.  IIRC I dealt with this with the
>> no_quota flag for inc_ref/dec_ref since the copy root stuff does strange
>> things with the reference counts, but all this code is gone now.  I
>> looked around to see if I could figure out how the refs are ending up
>> this way but it doesn't make sense to me and there isn't enough
>> information in your changelog for me to be able to figure it out. You've
>> created this mess, clean it up without making it messier.  Thanks,
>>
>> Josef
>>
>>
> Unfortunately, your original no_quota flag just hide the bug, and hide
> it in a bad method.
>
> Originally, no_quota flag is used for case like this, to skip quota at
> snapshot creation, and use quota_inherit() to hack the quota accounting.
> It seems work, but in fact, if the DIR_ITEM insert need to create a new
> cousin leaf, then quota is messed up.
>

No, and this is the problem, you fundamentally didn't understand what I 
wrote, and instead of trying to understand it and fix the bug you just 
threw it all away.  The no_quota stuff was not a hack, it was put in 
place to deal with refs we already knew where accounted for, such as 
when we converted to mixed refs or other such operations.

There were bugs in my rework, but now the situation is untenable.  What 
we now have is something that holds delayed refs in memory for the 
entire transaction, which is a non-starter for anybody who wants to use 
this in production.  On our gluster machines we get millions of delayed 
refs per transaction, and then there are multiple file systems.  Then 
you have to post-process all of that during the critical section of the 
commit?  So now suddenly I'm walking millions of delayed refs doing 
accounting all at once, that is going to cause commit latencies in the 
seconds which is completely unacceptable.

Anyway I spent some time this morning reading through the new stuff to 
figure out how it works now and I've got a patch to fix this problem 
that doesn't involve screwing with the transaction commit stuff at all. 
I sent it along separately

Btrfs: fix qgroup accounting when snapshotting

This fixes the basic case that was described originally.  I haven't 
tested it other than that but I'm pretty sure it is correct.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-26 14:26             ` Josef Bacik
@ 2016-04-27  1:12               ` Qu Wenruo
  0 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2016-04-27  1:12 UTC (permalink / raw)
  To: Josef Bacik, Mark Fasheh; +Cc: linux-btrfs, fdmanana



Josef Bacik wrote on 2016/04/26 10:26 -0400:
> On 04/25/2016 08:35 PM, Qu Wenruo wrote:
>>
>>
>> Josef Bacik wrote on 2016/04/25 10:24 -0400:
>>> On 04/24/2016 08:56 PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> Josef Bacik wrote on 2016/04/22 14:23 -0400:
>>>>> On 04/22/2016 02:21 PM, Mark Fasheh wrote:
>>>>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik
>>>>>> wrote:
>>>>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>>>>>> +    /* +     * Force parent root to be updated, as we
>>>>>>>> recorded it before so its +     * last_trans ==
>>>>>>>> cur_transid. +     * Or it won't be committed again
>>>>>>>> onto disk after later +     * insert_dir_item() +
>>>>>>>> */ +    if (!ret) +        record_root_in_trans(trans,
>>>>>>>> parent, 1); +    return ret; +}
>>>>>>>
>>>>>>> NACK, holy shit we aren't adding a special transaction
>>>>>>> commit only for qgroup snapshots.  Figure out a different
>>>>>>> way.  Thanks,
>>>>>>
>>>>>> Yeah I saw that. To be fair, we run a whole lot of the
>>>>>> transaction stuff multiple times (at least from my reading)
>>>>>> so I'm really unclear on what the performance impact is.
>>>>>>
>>>>>> Do you have any suggestion though? We've been banging our
>>>>>> heads against this for a while now and as slow as this
>>>>>> patch might be, it actually works where nothing else has so
>>>>>> far.
>>>>>
>>>>> I'm less concerned about committing another transaction and
>>>>> more concerned about the fact that it is an special variant
>>>>> of the transaction commit.  If this goes wrong, or at some
>>>>> point in the future we fail to update it along with
>>>>> btrfs_transaction_commit we suddenly are corrupting metadata.
>>>>> If we have to commit a transaction then call
>>>>> btrfs_commit_transaction(), don't open code a stripped down
>>>>> version, here be dragons.  Thanks,
>>>>>
>>>>> Josef
>>>>>
>>>>>
>>>>>
>>>> Yes, I also don't like the dirty hack.
>>>>
>>>> Although the problem is, we have no other good choice.
>>>>
>>>> If we can call commit_transaction() that's the best case, but
>>>> the problem is, in create_pending_snapshots(), we are already
>>>> inside commit_transaction().
>>>>
>>>> Or commit_transaction() can be called inside
>>>> commit_transaction()?
>>>>
>>>
>>> No, figure out a different way.  IIRC I dealt with this with the
>>> no_quota flag for inc_ref/dec_ref since the copy root stuff does
>>> strange things with the reference counts, but all this code is
>>> gone now.  I looked around to see if I could figure out how the
>>> refs are ending up this way but it doesn't make sense to me and
>>> there isn't enough information in your changelog for me to be
>>> able to figure it out. You've created this mess, clean it up
>>> without making it messier.  Thanks,
>>>
>>> Josef
>>>
>>>
>> Unfortunately, your original no_quota flag just hide the bug, and
>> hide it in a bad method.
>>
>> Originally, no_quota flag is used for case like this, to skip quota
>> at snapshot creation, and use quota_inherit() to hack the quota
>> accounting. It seems work, but in fact, if the DIR_ITEM insert need
>> to create a new cousin leaf, then quota is messed up.
>>
>
> No, and this is the problem, you fundamentally didn't understand what
> I wrote, and instead of trying to understand it and fix the bug you
> just threw it all away.  The no_quota stuff was not a hack, it was
> put in place to deal with refs we already knew where accounted for,
> such as when we converted to mixed refs or other such operations.

Even we know it has been accounted, it's still a hack to jump away from
normal accounting routing.

Just like what we do in qgroup_inherit(). We believe snapshot creation
will always make the exclusive to nodesize.
But in fact, we didn't consider the higher qgroup, and qgroup_inherit()
is just messing up that case.


And Yes, I DID threw the old qgroup code away.
As there are too many existing bugs and possible bugs, not only in
qgroup, but also in backref walk.

Backref walk can't handle time_seq really well, that's one of the reason
for btrfs/091.
And new comer must handle the extra no_quota flag, which is more easy to
cause new regression.

>
> There were bugs in my rework, but now the situation is untenable.
> What we now have is something that holds delayed refs in memory for
> the entire transaction, which is a non-starter for anybody who wants
> to use this in production.

Nope, we didn't hold delayed refs in current qgroup codes.

Delayed refs can be flushed to disk at any time just as it is.
We only trace which bytenr will go through qgroup routine, nothing to do 
with delayed refs.

Yes you can have millions of delayed refs, but there won't be millions
of different extents. Or sync_fs() would have been triggered.

> On our gluster machines we get millions of delayed refs per
> transaction, and then there are multiple file systems.  Then you have
> to post-process all of that during the critical section of the
> commit?  So now suddenly I'm walking millions of delayed refs doing
> accounting all at once, that is going to cause commit latencies in
> the seconds which is completely unacceptable.

In fact qgroup code will never account or walk through delayed refs.

Qgroup code will only walk through backrefs in extent tree. (commit tree 
and just before commit tree)
So, millions is already reduced to a much smaller amount.

Then, for old qgroup code, each time a inc/dec_extent_ref() will walk
through all backref.

So in special case like in-band dedupe, thousands of inc_extent_ref()
will cause thousands backref walk through, no matter when you do the
walk through, it will hugely delayed the transaction.

But in new qgroup code, twice for any dirty extent in one trans.
Even in the best case, it is the same with old case.
And much much much much much faster in worst case.


And further more, since qgroup code won't bother with the already 
b**lsh*t delayed-refs, it only needs to care extent tree.
This also avoid the backref bugs in walking delayed_ref.


Although, the root problem is in backref walking.
It's too time consuming, and we need some cache for it.
Or dedupe(in-band or out-of-band or reflink) can easily screw it up with 
fiemap.

Just create a 1G file with all its extents pointing to a single 128K one.
Then fiemap will just hang your system.

>
> Anyway I spent some time this morning reading through the new stuff
> to figure out how it works now and I've got a patch to fix this
> problem that doesn't involve screwing with the transaction commit
> stuff at all. I sent it along separately
>
> Btrfs: fix qgroup accounting when snapshotting
>
> This fixes the basic case that was described originally.  I haven't
> tested it other than that but I'm pretty sure it is correct.
> Thanks,
>
> Josef
>
Unfortunately, it isn't correct and even didn't fix the bug.

I'll comment in that thread.

Thanks,
Qu





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-04-22 18:12 ` Josef Bacik
  2016-04-22 18:21   ` Mark Fasheh
@ 2016-05-11 16:57   ` Mark Fasheh
  2016-05-11 16:59     ` Josef Bacik
  1 sibling, 1 reply; 18+ messages in thread
From: Mark Fasheh @ 2016-05-11 16:57 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Qu Wenruo, linux-btrfs, fdmanana

Hi Josef, 

On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
> >Current btrfs qgroup design implies a requirement that after calling
> >btrfs_qgroup_account_extents() there must be a commit root switch.
> >
> >Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
> >inside btrfs_commit_transaction() just be commit_cowonly_roots().
> >
> >However there is a exception at create_pending_snapshot(), which will
> >call btrfs_qgroup_account_extents() but no any commit root switch.
> >
> >In case of creating a snapshot whose parent root is itself (create a
> >snapshot of fs tree), it will corrupt qgroup by the following trace:
> >(skipped unrelated data)
> >======
> >btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
> >qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
> >qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> >btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> >======
> >
> >The problem here is in first qgroup_account_extent(), the
> >nr_new_roots of the extent is 1, which means its reference got
> >increased, and qgroup increased its rfer and excl.
> >
> >But at second qgroup_account_extent(), its reference got decreased, but
> >between these two qgroup_account_extent(), there is no switch roots.
> >This leads to the same nr_old_roots, and this extent just got ignored by
> >qgroup, which means this extent is wrongly accounted.
> >
> >Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
> >create_pending_snapshot(), with needed preparation.
> >
> >Reported-by: Mark Fasheh <mfasheh@suse.de>
> >Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> >---
> >v2:
> >   Fix a soft lockup caused by missing switch_commit_root() call.
> >   Fix a warning caused by dirty-but-not-committed root.
> >v3:
> >   Fix a bug which will cause qgroup accounting for dropping snapshot
> >   wrong
> >v4:
> >   Fix a bug caused by non-cowed btree modification.
> >
> >To Filipe:
> >   I'm sorry I didn't wait for your reply on the dropped roots.
> >   I reverted back the version where we deleted dropped roots in
> >   switch_commit_roots().
> >
> >   As I think as long as we called btrfs_qgroup_prepare_account_extents()
> >   and btrfs_qgroup_account_extents(), it should have already accounted
> >   extents for dropped roots, and then we are OK to drop them.
> >
> >   It would be very nice if you could point out what I missed.
> >   Thanks
> >   Qu
> >---
> >  fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 93 insertions(+), 24 deletions(-)
> >
> >diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> >index 43885e5..92f8193 100644
> >--- a/fs/btrfs/transaction.c
> >+++ b/fs/btrfs/transaction.c
> >@@ -311,10 +311,11 @@ loop:
> >   * when the transaction commits
> >   */
> >  static int record_root_in_trans(struct btrfs_trans_handle *trans,
> >-			       struct btrfs_root *root)
> >+			       struct btrfs_root *root,
> >+			       int force)
> >  {
> >-	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> >-	    root->last_trans < trans->transid) {
> >+	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> >+	    root->last_trans < trans->transid) || force) {
> >  		WARN_ON(root == root->fs_info->extent_root);
> >  		WARN_ON(root->commit_root != root->node);
> >
> >@@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
> >  		smp_wmb();
> >
> >  		spin_lock(&root->fs_info->fs_roots_radix_lock);
> >-		if (root->last_trans == trans->transid) {
> >+		if (root->last_trans == trans->transid && !force) {
> >  			spin_unlock(&root->fs_info->fs_roots_radix_lock);
> >  			return 0;
> >  		}
> >@@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
> >  		return 0;
> >
> >  	mutex_lock(&root->fs_info->reloc_mutex);
> >-	record_root_in_trans(trans, root);
> >+	record_root_in_trans(trans, root, 0);
> >  	mutex_unlock(&root->fs_info->reloc_mutex);
> >
> >  	return 0;
> >@@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root)
> >  }
> >
> >  /*
> >+ * Do all special snapshot related qgroup dirty hack.
> >+ *
> >+ * Will do all needed qgroup inherit and dirty hack like switch commit
> >+ * roots inside one transaction and write all btree into disk, to make
> >+ * qgroup works.
> >+ */
> >+static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
> >+				   struct btrfs_root *src,
> >+				   struct btrfs_root *parent,
> >+				   struct btrfs_qgroup_inherit *inherit,
> >+				   u64 dst_objectid)
> >+{
> >+	struct btrfs_fs_info *fs_info = src->fs_info;
> >+	int ret;
> >+
> >+	/*
> >+	 * We are going to commit transaction, see btrfs_commit_transaction()
> >+	 * comment for reason locking tree_log_mutex
> >+	 */
> >+	mutex_lock(&fs_info->tree_log_mutex);
> >+
> >+	ret = commit_fs_roots(trans, src);
> >+	if (ret)
> >+		goto out;
> >+	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
> >+	if (ret < 0)
> >+		goto out;
> >+	ret = btrfs_qgroup_account_extents(trans, fs_info);
> >+	if (ret < 0)
> >+		goto out;
> >+
> >+	/* Now qgroup are all updated, we can inherit it to new qgroups */
> >+	ret = btrfs_qgroup_inherit(trans, fs_info,
> >+				   src->root_key.objectid, dst_objectid,
> >+				   inherit);
> >+	if (ret < 0)
> >+		goto out;
> >+
> >+	/*
> >+	 * Now we do a simplified commit transaction, which will:
> >+	 * 1) commit all subvolume and extent tree
> >+	 *    To ensure all subvolume and extent tree have a valid
> >+	 *    commit_root to accounting later insert_dir_item()
> >+	 * 2) write all btree blocks onto disk
> >+	 *    This is to make sure later btree modification will be cowed
> >+	 *    Or commit_root can be populated and cause wrong qgroup numbers
> >+	 * In this simplified commit, we don't really care about other trees
> >+	 * like chunk and root tree, as they won't affect qgroup.
> >+	 * And we don't write super to avoid half committed status.
> >+	 */
> >+	ret = commit_cowonly_roots(trans, src);
> >+	if (ret)
> >+		goto out;
> >+	switch_commit_roots(trans->transaction, fs_info);
> >+	ret = btrfs_write_and_wait_transaction(trans, src);
> >+	if (ret)
> >+		btrfs_std_error(fs_info, ret,
> >+			"Error while writing out transaction for qgroup");
> >+
> >+out:
> >+	mutex_unlock(&fs_info->tree_log_mutex);
> >+
> >+	/*
> >+	 * Force parent root to be updated, as we recorded it before so its
> >+	 * last_trans == cur_transid.
> >+	 * Or it won't be committed again onto disk after later
> >+	 * insert_dir_item()
> >+	 */
> >+	if (!ret)
> >+		record_root_in_trans(trans, parent, 1);
> >+	return ret;
> >+}
> 
> NACK, holy shit we aren't adding a special transaction commit only
> for qgroup snapshots.  Figure out a different way.  Thanks,


Unfortunately I think we're going to have to swallow our pride on this one :(

Per our conversations on irc, and my detailed observations in this e-mail:

https://www.marc.info/?l=linux-btrfs&m=146257186311897&w=2

It seems like we have a core problem in that root counting during snapshot
create is unreliable and leads to corrupted qgroups. You add into that the
poor assumptions made by the rest of the code (such as qgroup_inherit()
always marking dst->excl = node_size) and ti looks like we won't have a
proper fix until another qgroup rewrite.

In the meantime, this would make qgroups numbers correct again. If we drop a
single check in there to only run when qgroups are enabled, we can mitigate
the performance impact. If I send that patch would you be ok to ACK it this
time around?

Thanks,
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-05-11 16:57   ` Mark Fasheh
@ 2016-05-11 16:59     ` Josef Bacik
  2016-05-11 19:53       ` Mark Fasheh
  0 siblings, 1 reply; 18+ messages in thread
From: Josef Bacik @ 2016-05-11 16:59 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Qu Wenruo, linux-btrfs, fdmanana

On 05/11/2016 09:57 AM, Mark Fasheh wrote:
> Hi Josef,
>
> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>> Current btrfs qgroup design implies a requirement that after calling
>>> btrfs_qgroup_account_extents() there must be a commit root switch.
>>>
>>> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
>>> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>>>
>>> However there is a exception at create_pending_snapshot(), which will
>>> call btrfs_qgroup_account_extents() but no any commit root switch.
>>>
>>> In case of creating a snapshot whose parent root is itself (create a
>>> snapshot of fs tree), it will corrupt qgroup by the following trace:
>>> (skipped unrelated data)
>>> ======
>>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
>>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
>>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
>>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>>> ======
>>>
>>> The problem here is in first qgroup_account_extent(), the
>>> nr_new_roots of the extent is 1, which means its reference got
>>> increased, and qgroup increased its rfer and excl.
>>>
>>> But at second qgroup_account_extent(), its reference got decreased, but
>>> between these two qgroup_account_extent(), there is no switch roots.
>>> This leads to the same nr_old_roots, and this extent just got ignored by
>>> qgroup, which means this extent is wrongly accounted.
>>>
>>> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
>>> create_pending_snapshot(), with needed preparation.
>>>
>>> Reported-by: Mark Fasheh <mfasheh@suse.de>
>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> ---
>>> v2:
>>>   Fix a soft lockup caused by missing switch_commit_root() call.
>>>   Fix a warning caused by dirty-but-not-committed root.
>>> v3:
>>>   Fix a bug which will cause qgroup accounting for dropping snapshot
>>>   wrong
>>> v4:
>>>   Fix a bug caused by non-cowed btree modification.
>>>
>>> To Filipe:
>>>   I'm sorry I didn't wait for your reply on the dropped roots.
>>>   I reverted back the version where we deleted dropped roots in
>>>   switch_commit_roots().
>>>
>>>   As I think as long as we called btrfs_qgroup_prepare_account_extents()
>>>   and btrfs_qgroup_account_extents(), it should have already accounted
>>>   extents for dropped roots, and then we are OK to drop them.
>>>
>>>   It would be very nice if you could point out what I missed.
>>>   Thanks
>>>   Qu
>>> ---
>>>  fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++----------
>>>  1 file changed, 93 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index 43885e5..92f8193 100644
>>> --- a/fs/btrfs/transaction.c
>>> +++ b/fs/btrfs/transaction.c
>>> @@ -311,10 +311,11 @@ loop:
>>>   * when the transaction commits
>>>   */
>>>  static int record_root_in_trans(struct btrfs_trans_handle *trans,
>>> -			       struct btrfs_root *root)
>>> +			       struct btrfs_root *root,
>>> +			       int force)
>>>  {
>>> -	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
>>> -	    root->last_trans < trans->transid) {
>>> +	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
>>> +	    root->last_trans < trans->transid) || force) {
>>>  		WARN_ON(root == root->fs_info->extent_root);
>>>  		WARN_ON(root->commit_root != root->node);
>>>
>>> @@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
>>>  		smp_wmb();
>>>
>>>  		spin_lock(&root->fs_info->fs_roots_radix_lock);
>>> -		if (root->last_trans == trans->transid) {
>>> +		if (root->last_trans == trans->transid && !force) {
>>>  			spin_unlock(&root->fs_info->fs_roots_radix_lock);
>>>  			return 0;
>>>  		}
>>> @@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
>>>  		return 0;
>>>
>>>  	mutex_lock(&root->fs_info->reloc_mutex);
>>> -	record_root_in_trans(trans, root);
>>> +	record_root_in_trans(trans, root, 0);
>>>  	mutex_unlock(&root->fs_info->reloc_mutex);
>>>
>>>  	return 0;
>>> @@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root)
>>>  }
>>>
>>>  /*
>>> + * Do all special snapshot related qgroup dirty hack.
>>> + *
>>> + * Will do all needed qgroup inherit and dirty hack like switch commit
>>> + * roots inside one transaction and write all btree into disk, to make
>>> + * qgroup works.
>>> + */
>>> +static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
>>> +				   struct btrfs_root *src,
>>> +				   struct btrfs_root *parent,
>>> +				   struct btrfs_qgroup_inherit *inherit,
>>> +				   u64 dst_objectid)
>>> +{
>>> +	struct btrfs_fs_info *fs_info = src->fs_info;
>>> +	int ret;
>>> +
>>> +	/*
>>> +	 * We are going to commit transaction, see btrfs_commit_transaction()
>>> +	 * comment for reason locking tree_log_mutex
>>> +	 */
>>> +	mutex_lock(&fs_info->tree_log_mutex);
>>> +
>>> +	ret = commit_fs_roots(trans, src);
>>> +	if (ret)
>>> +		goto out;
>>> +	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
>>> +	if (ret < 0)
>>> +		goto out;
>>> +	ret = btrfs_qgroup_account_extents(trans, fs_info);
>>> +	if (ret < 0)
>>> +		goto out;
>>> +
>>> +	/* Now qgroup are all updated, we can inherit it to new qgroups */
>>> +	ret = btrfs_qgroup_inherit(trans, fs_info,
>>> +				   src->root_key.objectid, dst_objectid,
>>> +				   inherit);
>>> +	if (ret < 0)
>>> +		goto out;
>>> +
>>> +	/*
>>> +	 * Now we do a simplified commit transaction, which will:
>>> +	 * 1) commit all subvolume and extent tree
>>> +	 *    To ensure all subvolume and extent tree have a valid
>>> +	 *    commit_root to accounting later insert_dir_item()
>>> +	 * 2) write all btree blocks onto disk
>>> +	 *    This is to make sure later btree modification will be cowed
>>> +	 *    Or commit_root can be populated and cause wrong qgroup numbers
>>> +	 * In this simplified commit, we don't really care about other trees
>>> +	 * like chunk and root tree, as they won't affect qgroup.
>>> +	 * And we don't write super to avoid half committed status.
>>> +	 */
>>> +	ret = commit_cowonly_roots(trans, src);
>>> +	if (ret)
>>> +		goto out;
>>> +	switch_commit_roots(trans->transaction, fs_info);
>>> +	ret = btrfs_write_and_wait_transaction(trans, src);
>>> +	if (ret)
>>> +		btrfs_std_error(fs_info, ret,
>>> +			"Error while writing out transaction for qgroup");
>>> +
>>> +out:
>>> +	mutex_unlock(&fs_info->tree_log_mutex);
>>> +
>>> +	/*
>>> +	 * Force parent root to be updated, as we recorded it before so its
>>> +	 * last_trans == cur_transid.
>>> +	 * Or it won't be committed again onto disk after later
>>> +	 * insert_dir_item()
>>> +	 */
>>> +	if (!ret)
>>> +		record_root_in_trans(trans, parent, 1);
>>> +	return ret;
>>> +}
>>
>> NACK, holy shit we aren't adding a special transaction commit only
>> for qgroup snapshots.  Figure out a different way.  Thanks,
>
>
> Unfortunately I think we're going to have to swallow our pride on this one :(
>
> Per our conversations on irc, and my detailed observations in this e-mail:
>
> https://www.marc.info/?l=linux-btrfs&m=146257186311897&w=2
>
> It seems like we have a core problem in that root counting during snapshot
> create is unreliable and leads to corrupted qgroups. You add into that the
> poor assumptions made by the rest of the code (such as qgroup_inherit()
> always marking dst->excl = node_size) and ti looks like we won't have a
> proper fix until another qgroup rewrite.
>
> In the meantime, this would make qgroups numbers correct again. If we drop a
> single check in there to only run when qgroups are enabled, we can mitigate
> the performance impact. If I send that patch would you be ok to ACK it this
> time around?
>

Yeah as long as it's limited to only the qgroup case then I'm fine with 
it for now.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-05-11 16:59     ` Josef Bacik
@ 2016-05-11 19:53       ` Mark Fasheh
  2016-05-11 20:30         ` Josef Bacik
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Fasheh @ 2016-05-11 19:53 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Qu Wenruo, linux-btrfs, fdmanana

On Wed, May 11, 2016 at 09:59:52AM -0700, Josef Bacik wrote:
> On 05/11/2016 09:57 AM, Mark Fasheh wrote:
> >Hi Josef,
> >
> >On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
> >>On 04/15/2016 05:08 AM, Qu Wenruo wrote:
> >>>Current btrfs qgroup design implies a requirement that after calling
> >>>btrfs_qgroup_account_extents() there must be a commit root switch.
> >>>
> >>>Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
> >>>inside btrfs_commit_transaction() just be commit_cowonly_roots().
> >>>
> >>>However there is a exception at create_pending_snapshot(), which will
> >>>call btrfs_qgroup_account_extents() but no any commit root switch.
> >>>
> >>>In case of creating a snapshot whose parent root is itself (create a
> >>>snapshot of fs tree), it will corrupt qgroup by the following trace:
> >>>(skipped unrelated data)
> >>>======
> >>>btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
> >>>qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
> >>>qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> >>>btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> >>>======
> >>>
> >>>The problem here is in first qgroup_account_extent(), the
> >>>nr_new_roots of the extent is 1, which means its reference got
> >>>increased, and qgroup increased its rfer and excl.
> >>>
> >>>But at second qgroup_account_extent(), its reference got decreased, but
> >>>between these two qgroup_account_extent(), there is no switch roots.
> >>>This leads to the same nr_old_roots, and this extent just got ignored by
> >>>qgroup, which means this extent is wrongly accounted.
> >>>
> >>>Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
> >>>create_pending_snapshot(), with needed preparation.
> >>>
> >>>Reported-by: Mark Fasheh <mfasheh@suse.de>
> >>>Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> >>>---
> >>>v2:
> >>>  Fix a soft lockup caused by missing switch_commit_root() call.
> >>>  Fix a warning caused by dirty-but-not-committed root.
> >>>v3:
> >>>  Fix a bug which will cause qgroup accounting for dropping snapshot
> >>>  wrong
> >>>v4:
> >>>  Fix a bug caused by non-cowed btree modification.
> >>>
> >>>To Filipe:
> >>>  I'm sorry I didn't wait for your reply on the dropped roots.
> >>>  I reverted back the version where we deleted dropped roots in
> >>>  switch_commit_roots().
> >>>
> >>>  As I think as long as we called btrfs_qgroup_prepare_account_extents()
> >>>  and btrfs_qgroup_account_extents(), it should have already accounted
> >>>  extents for dropped roots, and then we are OK to drop them.
> >>>
> >>>  It would be very nice if you could point out what I missed.
> >>>  Thanks
> >>>  Qu
> >>>---
> >>> fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++----------
> >>> 1 file changed, 93 insertions(+), 24 deletions(-)
> >>>
> >>>diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> >>>index 43885e5..92f8193 100644
> >>>--- a/fs/btrfs/transaction.c
> >>>+++ b/fs/btrfs/transaction.c
> >>>@@ -311,10 +311,11 @@ loop:
> >>>  * when the transaction commits
> >>>  */
> >>> static int record_root_in_trans(struct btrfs_trans_handle *trans,
> >>>-			       struct btrfs_root *root)
> >>>+			       struct btrfs_root *root,
> >>>+			       int force)
> >>> {
> >>>-	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> >>>-	    root->last_trans < trans->transid) {
> >>>+	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> >>>+	    root->last_trans < trans->transid) || force) {
> >>> 		WARN_ON(root == root->fs_info->extent_root);
> >>> 		WARN_ON(root->commit_root != root->node);
> >>>
> >>>@@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
> >>> 		smp_wmb();
> >>>
> >>> 		spin_lock(&root->fs_info->fs_roots_radix_lock);
> >>>-		if (root->last_trans == trans->transid) {
> >>>+		if (root->last_trans == trans->transid && !force) {
> >>> 			spin_unlock(&root->fs_info->fs_roots_radix_lock);
> >>> 			return 0;
> >>> 		}
> >>>@@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
> >>> 		return 0;
> >>>
> >>> 	mutex_lock(&root->fs_info->reloc_mutex);
> >>>-	record_root_in_trans(trans, root);
> >>>+	record_root_in_trans(trans, root, 0);
> >>> 	mutex_unlock(&root->fs_info->reloc_mutex);
> >>>
> >>> 	return 0;
> >>>@@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root)
> >>> }
> >>>
> >>> /*
> >>>+ * Do all special snapshot related qgroup dirty hack.
> >>>+ *
> >>>+ * Will do all needed qgroup inherit and dirty hack like switch commit
> >>>+ * roots inside one transaction and write all btree into disk, to make
> >>>+ * qgroup works.
> >>>+ */
> >>>+static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
> >>>+				   struct btrfs_root *src,
> >>>+				   struct btrfs_root *parent,
> >>>+				   struct btrfs_qgroup_inherit *inherit,
> >>>+				   u64 dst_objectid)
> >>>+{
> >>>+	struct btrfs_fs_info *fs_info = src->fs_info;
> >>>+	int ret;
> >>>+
> >>>+	/*
> >>>+	 * We are going to commit transaction, see btrfs_commit_transaction()
> >>>+	 * comment for reason locking tree_log_mutex
> >>>+	 */
> >>>+	mutex_lock(&fs_info->tree_log_mutex);
> >>>+
> >>>+	ret = commit_fs_roots(trans, src);
> >>>+	if (ret)
> >>>+		goto out;
> >>>+	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
> >>>+	if (ret < 0)
> >>>+		goto out;
> >>>+	ret = btrfs_qgroup_account_extents(trans, fs_info);
> >>>+	if (ret < 0)
> >>>+		goto out;
> >>>+
> >>>+	/* Now qgroup are all updated, we can inherit it to new qgroups */
> >>>+	ret = btrfs_qgroup_inherit(trans, fs_info,
> >>>+				   src->root_key.objectid, dst_objectid,
> >>>+				   inherit);
> >>>+	if (ret < 0)
> >>>+		goto out;
> >>>+
> >>>+	/*
> >>>+	 * Now we do a simplified commit transaction, which will:
> >>>+	 * 1) commit all subvolume and extent tree
> >>>+	 *    To ensure all subvolume and extent tree have a valid
> >>>+	 *    commit_root to accounting later insert_dir_item()
> >>>+	 * 2) write all btree blocks onto disk
> >>>+	 *    This is to make sure later btree modification will be cowed
> >>>+	 *    Or commit_root can be populated and cause wrong qgroup numbers
> >>>+	 * In this simplified commit, we don't really care about other trees
> >>>+	 * like chunk and root tree, as they won't affect qgroup.
> >>>+	 * And we don't write super to avoid half committed status.
> >>>+	 */
> >>>+	ret = commit_cowonly_roots(trans, src);
> >>>+	if (ret)
> >>>+		goto out;
> >>>+	switch_commit_roots(trans->transaction, fs_info);
> >>>+	ret = btrfs_write_and_wait_transaction(trans, src);
> >>>+	if (ret)
> >>>+		btrfs_std_error(fs_info, ret,
> >>>+			"Error while writing out transaction for qgroup");
> >>>+
> >>>+out:
> >>>+	mutex_unlock(&fs_info->tree_log_mutex);
> >>>+
> >>>+	/*
> >>>+	 * Force parent root to be updated, as we recorded it before so its
> >>>+	 * last_trans == cur_transid.
> >>>+	 * Or it won't be committed again onto disk after later
> >>>+	 * insert_dir_item()
> >>>+	 */
> >>>+	if (!ret)
> >>>+		record_root_in_trans(trans, parent, 1);
> >>>+	return ret;
> >>>+}
> >>
> >>NACK, holy shit we aren't adding a special transaction commit only
> >>for qgroup snapshots.  Figure out a different way.  Thanks,
> >
> >
> >Unfortunately I think we're going to have to swallow our pride on this one :(
> >
> >Per our conversations on irc, and my detailed observations in this e-mail:
> >
> >https://www.marc.info/?l=linux-btrfs&m=146257186311897&w=2
> >
> >It seems like we have a core problem in that root counting during snapshot
> >create is unreliable and leads to corrupted qgroups. You add into that the
> >poor assumptions made by the rest of the code (such as qgroup_inherit()
> >always marking dst->excl = node_size) and ti looks like we won't have a
> >proper fix until another qgroup rewrite.
> >
> >In the meantime, this would make qgroups numbers correct again. If we drop a
> >single check in there to only run when qgroups are enabled, we can mitigate
> >the performance impact. If I send that patch would you be ok to ACK it this
> >time around?
> >
> 
> Yeah as long as it's limited to only the qgroup case then I'm fine
> with it for now.  Thanks,

Ok, here it goes. My xfstest for this issue is upstream now so I was able to
run tests/btrfs/122 to verify that this still fixes the problem. Btw, so
that makes one drop snapshot test and two create snapshot tests for qgroups
now. I'll make one for raid convert as that's broken too. Hopefully then we
can be sure that this stuff doesn't break again.

Btw, this is against Linux 4.5, let me know if you'd like it ported forward.
	--Mark

--
Mark Fasheh


From: Qu Wenruo <quwenruo@cn.fujitsu.com>

[PATCH] btrfs: qgroup: Fix qgroup accounting when creating snapshot

Current btrfs qgroup design implies a requirement that after calling
btrfs_qgroup_account_extents() there must be a commit root switch.

Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
inside btrfs_commit_transaction() just be commit_cowonly_roots().

However there is a exception at create_pending_snapshot(), which will
call btrfs_qgroup_account_extents() but no any commit root switch.

In case of creating a snapshot whose parent root is itself (create a
snapshot of fs tree), it will corrupt qgroup by the following trace:
(skipped unrelated data)
======
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
======

The problem here is in first qgroup_account_extent(), the
nr_new_roots of the extent is 1, which means its reference got
increased, and qgroup increased its rfer and excl.

But at second qgroup_account_extent(), its reference got decreased, but
between these two qgroup_account_extent(), there is no switch roots.
This leads to the same nr_old_roots, and this extent just got ignored by
qgroup, which means this extent is wrongly accounted.

Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
create_pending_snapshot(), with needed preparation.

Mark: I added a check at the top of qgroup_account_snapshot() to skip this
code if qgroups are turned off. xfstest btrfs/122 exposes this problem.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/transaction.c | 129 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 105 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index b6031ce..21f6609 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -311,10 +311,11 @@ loop:
  * when the transaction commits
  */
 static int record_root_in_trans(struct btrfs_trans_handle *trans,
-			       struct btrfs_root *root)
+			       struct btrfs_root *root,
+			       int force)
 {
-	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
-	    root->last_trans < trans->transid) {
+	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
+	    root->last_trans < trans->transid) || force) {
 		WARN_ON(root == root->fs_info->extent_root);
 		WARN_ON(root->commit_root != root->node);
 
@@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
 		smp_wmb();
 
 		spin_lock(&root->fs_info->fs_roots_radix_lock);
-		if (root->last_trans == trans->transid) {
+		if (root->last_trans == trans->transid && !force) {
 			spin_unlock(&root->fs_info->fs_roots_radix_lock);
 			return 0;
 		}
@@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 		return 0;
 
 	mutex_lock(&root->fs_info->reloc_mutex);
-	record_root_in_trans(trans, root);
+	record_root_in_trans(trans, root, 0);
 	mutex_unlock(&root->fs_info->reloc_mutex);
 
 	return 0;
@@ -1309,6 +1310,92 @@ int btrfs_defrag_root(struct btrfs_root *root)
 }
 
 /*
+ * Do all special snapshot related qgroup dirty hack.
+ *
+ * Will do all needed qgroup inherit and dirty hack like switch commit
+ * roots inside one transaction and write all btree into disk, to make
+ * qgroup works.
+ */
+static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
+				   struct btrfs_root *src,
+				   struct btrfs_root *parent,
+				   struct btrfs_qgroup_inherit *inherit,
+				   u64 dst_objectid)
+{
+	struct btrfs_fs_info *fs_info = src->fs_info;
+	int ret;
+
+	/*
+	 * Save some performance in the case that qgroups are not
+	 * enabled. If this check races with the ioctl, rescan will
+	 * kick in anyway.
+	 */
+	mutex_lock(&fs_info->qgroup_ioctl_lock);
+	if (!fs_info->quota_enabled) {
+		mutex_unlock(&fs_info->qgroup_ioctl_lock);
+		return 0;
+	}
+	mutex_unlock(&fs_info->qgroup_ioctl_lock);
+
+	/*
+	 * We are going to commit transaction, see btrfs_commit_transaction()
+	 * comment for reason locking tree_log_mutex
+	 */
+	mutex_lock(&fs_info->tree_log_mutex);
+
+	ret = commit_fs_roots(trans, src);
+	if (ret)
+		goto out;
+	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
+	if (ret < 0)
+		goto out;
+	ret = btrfs_qgroup_account_extents(trans, fs_info);
+	if (ret < 0)
+		goto out;
+
+	/* Now qgroup are all updated, we can inherit it to new qgroups */
+	ret = btrfs_qgroup_inherit(trans, fs_info,
+				   src->root_key.objectid, dst_objectid,
+				   inherit);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * Now we do a simplified commit transaction, which will:
+	 * 1) commit all subvolume and extent tree
+	 *    To ensure all subvolume and extent tree have a valid
+	 *    commit_root to accounting later insert_dir_item()
+	 * 2) write all btree blocks onto disk
+	 *    This is to make sure later btree modification will be cowed
+	 *    Or commit_root can be populated and cause wrong qgroup numbers
+	 * In this simplified commit, we don't really care about other trees
+	 * like chunk and root tree, as they won't affect qgroup.
+	 * And we don't write super to avoid half committed status.
+	 */
+	ret = commit_cowonly_roots(trans, src);
+	if (ret)
+		goto out;
+	switch_commit_roots(trans->transaction, fs_info);
+	ret = btrfs_write_and_wait_transaction(trans, src);
+	if (ret)
+		btrfs_std_error(fs_info, ret,
+			"Error while writing out transaction for qgroup");
+
+out:
+	mutex_unlock(&fs_info->tree_log_mutex);
+
+	/*
+	 * Force parent root to be updated, as we recorded it before so its
+	 * last_trans == cur_transid.
+	 * Or it won't be committed again onto disk after later
+	 * insert_dir_item()
+	 */
+	if (!ret)
+		record_root_in_trans(trans, parent, 1);
+	return ret;
+}
+
+/*
  * new snapshots need to be created at a very specific time in the
  * transaction commit.  This does the actual creation.
  *
@@ -1379,7 +1466,7 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 	dentry = pending->dentry;
 	parent_inode = pending->dir;
 	parent_root = BTRFS_I(parent_inode)->root;
-	record_root_in_trans(trans, parent_root);
+	record_root_in_trans(trans, parent_root, 0);
 
 	/*
 	 * insert the directory item
@@ -1414,7 +1501,7 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
-	record_root_in_trans(trans, root);
+	record_root_in_trans(trans, root, 0);
 	btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
 	memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
 	btrfs_check_and_init_root_item(new_root_item);
@@ -1510,6 +1597,17 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
+	/*
+	 * Do special qgroup accounting for snapshot, as we do some qgroup
+	 * snapshot hack to do fast snapshot.
+	 * To co-operate with that hack, we do hack again.
+	 * Or snapshot will be greatly slowed down by a subtree qgroup rescan
+	 */
+	ret = qgroup_account_snapshot(trans, root, parent_root,
+				      pending->inherit, objectid);
+	if (ret < 0)
+		goto fail;
+
 	ret = btrfs_insert_dir_item(trans, parent_root,
 				    dentry->d_name.name, dentry->d_name.len,
 				    parent_inode, &key,
@@ -1552,23 +1650,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
-	/*
-	 * account qgroup counters before qgroup_inherit()
-	 */
-	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
-	if (ret)
-		goto fail;
-	ret = btrfs_qgroup_account_extents(trans, fs_info);
-	if (ret)
-		goto fail;
-	ret = btrfs_qgroup_inherit(trans, fs_info,
-				   root->root_key.objectid,
-				   objectid, pending->inherit);
-	if (ret) {
-		btrfs_abort_transaction(trans, root, ret);
-		goto fail;
-	}
-
 fail:
 	pending->error = ret;
 dir_item_existed:
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-05-11 19:53       ` Mark Fasheh
@ 2016-05-11 20:30         ` Josef Bacik
  2016-05-11 23:33           ` Qu Wenruo
  2016-05-12  9:10           ` David Sterba
  0 siblings, 2 replies; 18+ messages in thread
From: Josef Bacik @ 2016-05-11 20:30 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Qu Wenruo, linux-btrfs, fdmanana

On 05/11/2016 12:53 PM, Mark Fasheh wrote:
> On Wed, May 11, 2016 at 09:59:52AM -0700, Josef Bacik wrote:
>> On 05/11/2016 09:57 AM, Mark Fasheh wrote:
>>> Hi Josef,
>>>
>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>>> Current btrfs qgroup design implies a requirement that after calling
>>>>> btrfs_qgroup_account_extents() there must be a commit root switch.
>>>>>
>>>>> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
>>>>> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>>>>>
>>>>> However there is a exception at create_pending_snapshot(), which will
>>>>> call btrfs_qgroup_account_extents() but no any commit root switch.
>>>>>
>>>>> In case of creating a snapshot whose parent root is itself (create a
>>>>> snapshot of fs tree), it will corrupt qgroup by the following trace:
>>>>> (skipped unrelated data)
>>>>> ======
>>>>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
>>>>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
>>>>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
>>>>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
>>>>> ======
>>>>>
>>>>> The problem here is in first qgroup_account_extent(), the
>>>>> nr_new_roots of the extent is 1, which means its reference got
>>>>> increased, and qgroup increased its rfer and excl.
>>>>>
>>>>> But at second qgroup_account_extent(), its reference got decreased, but
>>>>> between these two qgroup_account_extent(), there is no switch roots.
>>>>> This leads to the same nr_old_roots, and this extent just got ignored by
>>>>> qgroup, which means this extent is wrongly accounted.
>>>>>
>>>>> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
>>>>> create_pending_snapshot(), with needed preparation.
>>>>>
>>>>> Reported-by: Mark Fasheh <mfasheh@suse.de>
>>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>>> ---
>>>>> v2:
>>>>>  Fix a soft lockup caused by missing switch_commit_root() call.
>>>>>  Fix a warning caused by dirty-but-not-committed root.
>>>>> v3:
>>>>>  Fix a bug which will cause qgroup accounting for dropping snapshot
>>>>>  wrong
>>>>> v4:
>>>>>  Fix a bug caused by non-cowed btree modification.
>>>>>
>>>>> To Filipe:
>>>>>  I'm sorry I didn't wait for your reply on the dropped roots.
>>>>>  I reverted back the version where we deleted dropped roots in
>>>>>  switch_commit_roots().
>>>>>
>>>>>  As I think as long as we called btrfs_qgroup_prepare_account_extents()
>>>>>  and btrfs_qgroup_account_extents(), it should have already accounted
>>>>>  extents for dropped roots, and then we are OK to drop them.
>>>>>
>>>>>  It would be very nice if you could point out what I missed.
>>>>>  Thanks
>>>>>  Qu
>>>>> ---
>>>>> fs/btrfs/transaction.c | 117 +++++++++++++++++++++++++++++++++++++++----------
>>>>> 1 file changed, 93 insertions(+), 24 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>>> index 43885e5..92f8193 100644
>>>>> --- a/fs/btrfs/transaction.c
>>>>> +++ b/fs/btrfs/transaction.c
>>>>> @@ -311,10 +311,11 @@ loop:
>>>>>  * when the transaction commits
>>>>>  */
>>>>> static int record_root_in_trans(struct btrfs_trans_handle *trans,
>>>>> -			       struct btrfs_root *root)
>>>>> +			       struct btrfs_root *root,
>>>>> +			       int force)
>>>>> {
>>>>> -	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
>>>>> -	    root->last_trans < trans->transid) {
>>>>> +	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
>>>>> +	    root->last_trans < trans->transid) || force) {
>>>>> 		WARN_ON(root == root->fs_info->extent_root);
>>>>> 		WARN_ON(root->commit_root != root->node);
>>>>>
>>>>> @@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
>>>>> 		smp_wmb();
>>>>>
>>>>> 		spin_lock(&root->fs_info->fs_roots_radix_lock);
>>>>> -		if (root->last_trans == trans->transid) {
>>>>> +		if (root->last_trans == trans->transid && !force) {
>>>>> 			spin_unlock(&root->fs_info->fs_roots_radix_lock);
>>>>> 			return 0;
>>>>> 		}
>>>>> @@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
>>>>> 		return 0;
>>>>>
>>>>> 	mutex_lock(&root->fs_info->reloc_mutex);
>>>>> -	record_root_in_trans(trans, root);
>>>>> +	record_root_in_trans(trans, root, 0);
>>>>> 	mutex_unlock(&root->fs_info->reloc_mutex);
>>>>>
>>>>> 	return 0;
>>>>> @@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root)
>>>>> }
>>>>>
>>>>> /*
>>>>> + * Do all special snapshot related qgroup dirty hack.
>>>>> + *
>>>>> + * Will do all needed qgroup inherit and dirty hack like switch commit
>>>>> + * roots inside one transaction and write all btree into disk, to make
>>>>> + * qgroup works.
>>>>> + */
>>>>> +static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
>>>>> +				   struct btrfs_root *src,
>>>>> +				   struct btrfs_root *parent,
>>>>> +				   struct btrfs_qgroup_inherit *inherit,
>>>>> +				   u64 dst_objectid)
>>>>> +{
>>>>> +	struct btrfs_fs_info *fs_info = src->fs_info;
>>>>> +	int ret;
>>>>> +
>>>>> +	/*
>>>>> +	 * We are going to commit transaction, see btrfs_commit_transaction()
>>>>> +	 * comment for reason locking tree_log_mutex
>>>>> +	 */
>>>>> +	mutex_lock(&fs_info->tree_log_mutex);
>>>>> +
>>>>> +	ret = commit_fs_roots(trans, src);
>>>>> +	if (ret)
>>>>> +		goto out;
>>>>> +	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
>>>>> +	if (ret < 0)
>>>>> +		goto out;
>>>>> +	ret = btrfs_qgroup_account_extents(trans, fs_info);
>>>>> +	if (ret < 0)
>>>>> +		goto out;
>>>>> +
>>>>> +	/* Now qgroup are all updated, we can inherit it to new qgroups */
>>>>> +	ret = btrfs_qgroup_inherit(trans, fs_info,
>>>>> +				   src->root_key.objectid, dst_objectid,
>>>>> +				   inherit);
>>>>> +	if (ret < 0)
>>>>> +		goto out;
>>>>> +
>>>>> +	/*
>>>>> +	 * Now we do a simplified commit transaction, which will:
>>>>> +	 * 1) commit all subvolume and extent tree
>>>>> +	 *    To ensure all subvolume and extent tree have a valid
>>>>> +	 *    commit_root to accounting later insert_dir_item()
>>>>> +	 * 2) write all btree blocks onto disk
>>>>> +	 *    This is to make sure later btree modification will be cowed
>>>>> +	 *    Or commit_root can be populated and cause wrong qgroup numbers
>>>>> +	 * In this simplified commit, we don't really care about other trees
>>>>> +	 * like chunk and root tree, as they won't affect qgroup.
>>>>> +	 * And we don't write super to avoid half committed status.
>>>>> +	 */
>>>>> +	ret = commit_cowonly_roots(trans, src);
>>>>> +	if (ret)
>>>>> +		goto out;
>>>>> +	switch_commit_roots(trans->transaction, fs_info);
>>>>> +	ret = btrfs_write_and_wait_transaction(trans, src);
>>>>> +	if (ret)
>>>>> +		btrfs_std_error(fs_info, ret,
>>>>> +			"Error while writing out transaction for qgroup");
>>>>> +
>>>>> +out:
>>>>> +	mutex_unlock(&fs_info->tree_log_mutex);
>>>>> +
>>>>> +	/*
>>>>> +	 * Force parent root to be updated, as we recorded it before so its
>>>>> +	 * last_trans == cur_transid.
>>>>> +	 * Or it won't be committed again onto disk after later
>>>>> +	 * insert_dir_item()
>>>>> +	 */
>>>>> +	if (!ret)
>>>>> +		record_root_in_trans(trans, parent, 1);
>>>>> +	return ret;
>>>>> +}
>>>>
>>>> NACK, holy shit we aren't adding a special transaction commit only
>>>> for qgroup snapshots.  Figure out a different way.  Thanks,
>>>
>>>
>>> Unfortunately I think we're going to have to swallow our pride on this one :(
>>>
>>> Per our conversations on irc, and my detailed observations in this e-mail:
>>>
>>> https://www.marc.info/?l=linux-btrfs&m=146257186311897&w=2
>>>
>>> It seems like we have a core problem in that root counting during snapshot
>>> create is unreliable and leads to corrupted qgroups. You add into that the
>>> poor assumptions made by the rest of the code (such as qgroup_inherit()
>>> always marking dst->excl = node_size) and ti looks like we won't have a
>>> proper fix until another qgroup rewrite.
>>>
>>> In the meantime, this would make qgroups numbers correct again. If we drop a
>>> single check in there to only run when qgroups are enabled, we can mitigate
>>> the performance impact. If I send that patch would you be ok to ACK it this
>>> time around?
>>>
>>
>> Yeah as long as it's limited to only the qgroup case then I'm fine
>> with it for now.  Thanks,
>
> Ok, here it goes. My xfstest for this issue is upstream now so I was able to
> run tests/btrfs/122 to verify that this still fixes the problem. Btw, so
> that makes one drop snapshot test and two create snapshot tests for qgroups
> now. I'll make one for raid convert as that's broken too. Hopefully then we
> can be sure that this stuff doesn't break again.
>
> Btw, this is against Linux 4.5, let me know if you'd like it ported forward.
> 	--Mark
>
> --
> Mark Fasheh
>
>
> From: Qu Wenruo <quwenruo@cn.fujitsu.com>
>
> [PATCH] btrfs: qgroup: Fix qgroup accounting when creating snapshot
>
> Current btrfs qgroup design implies a requirement that after calling
> btrfs_qgroup_account_extents() there must be a commit root switch.
>
> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>
> However there is a exception at create_pending_snapshot(), which will
> call btrfs_qgroup_account_extents() but no any commit root switch.
>
> In case of creating a snapshot whose parent root is itself (create a
> snapshot of fs tree), it will corrupt qgroup by the following trace:
> (skipped unrelated data)
> ======
> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1
> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0
> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384
> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0
> ======
>
> The problem here is in first qgroup_account_extent(), the
> nr_new_roots of the extent is 1, which means its reference got
> increased, and qgroup increased its rfer and excl.
>
> But at second qgroup_account_extent(), its reference got decreased, but
> between these two qgroup_account_extent(), there is no switch roots.
> This leads to the same nr_old_roots, and this extent just got ignored by
> qgroup, which means this extent is wrongly accounted.
>
> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
> create_pending_snapshot(), with needed preparation.
>
> Mark: I added a check at the top of qgroup_account_snapshot() to skip this
> code if qgroups are turned off. xfstest btrfs/122 exposes this problem.
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Signed-off-by: Mark Fasheh <mfasheh@suse.de>
> ---
>  fs/btrfs/transaction.c | 129 ++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 105 insertions(+), 24 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index b6031ce..21f6609 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -311,10 +311,11 @@ loop:
>   * when the transaction commits
>   */
>  static int record_root_in_trans(struct btrfs_trans_handle *trans,
> -			       struct btrfs_root *root)
> +			       struct btrfs_root *root,
> +			       int force)
>  {
> -	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> -	    root->last_trans < trans->transid) {
> +	if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
> +	    root->last_trans < trans->transid) || force) {
>  		WARN_ON(root == root->fs_info->extent_root);
>  		WARN_ON(root->commit_root != root->node);
>
> @@ -331,7 +332,7 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
>  		smp_wmb();
>
>  		spin_lock(&root->fs_info->fs_roots_radix_lock);
> -		if (root->last_trans == trans->transid) {
> +		if (root->last_trans == trans->transid && !force) {
>  			spin_unlock(&root->fs_info->fs_roots_radix_lock);
>  			return 0;
>  		}
> @@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
>  		return 0;
>
>  	mutex_lock(&root->fs_info->reloc_mutex);
> -	record_root_in_trans(trans, root);
> +	record_root_in_trans(trans, root, 0);
>  	mutex_unlock(&root->fs_info->reloc_mutex);
>
>  	return 0;
> @@ -1309,6 +1310,92 @@ int btrfs_defrag_root(struct btrfs_root *root)
>  }
>
>  /*
> + * Do all special snapshot related qgroup dirty hack.
> + *
> + * Will do all needed qgroup inherit and dirty hack like switch commit
> + * roots inside one transaction and write all btree into disk, to make
> + * qgroup works.
> + */
> +static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
> +				   struct btrfs_root *src,
> +				   struct btrfs_root *parent,
> +				   struct btrfs_qgroup_inherit *inherit,
> +				   u64 dst_objectid)
> +{
> +	struct btrfs_fs_info *fs_info = src->fs_info;
> +	int ret;
> +
> +	/*
> +	 * Save some performance in the case that qgroups are not
> +	 * enabled. If this check races with the ioctl, rescan will
> +	 * kick in anyway.
> +	 */
> +	mutex_lock(&fs_info->qgroup_ioctl_lock);
> +	if (!fs_info->quota_enabled) {
> +		mutex_unlock(&fs_info->qgroup_ioctl_lock);
> +		return 0;
> +	}
> +	mutex_unlock(&fs_info->qgroup_ioctl_lock);
> +
> +	/*
> +	 * We are going to commit transaction, see btrfs_commit_transaction()
> +	 * comment for reason locking tree_log_mutex
> +	 */
> +	mutex_lock(&fs_info->tree_log_mutex);
> +
> +	ret = commit_fs_roots(trans, src);
> +	if (ret)
> +		goto out;
> +	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
> +	if (ret < 0)
> +		goto out;
> +	ret = btrfs_qgroup_account_extents(trans, fs_info);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* Now qgroup are all updated, we can inherit it to new qgroups */
> +	ret = btrfs_qgroup_inherit(trans, fs_info,
> +				   src->root_key.objectid, dst_objectid,
> +				   inherit);
> +	if (ret < 0)
> +		goto out;
> +
> +	/*
> +	 * Now we do a simplified commit transaction, which will:
> +	 * 1) commit all subvolume and extent tree
> +	 *    To ensure all subvolume and extent tree have a valid
> +	 *    commit_root to accounting later insert_dir_item()
> +	 * 2) write all btree blocks onto disk
> +	 *    This is to make sure later btree modification will be cowed
> +	 *    Or commit_root can be populated and cause wrong qgroup numbers
> +	 * In this simplified commit, we don't really care about other trees
> +	 * like chunk and root tree, as they won't affect qgroup.
> +	 * And we don't write super to avoid half committed status.
> +	 */
> +	ret = commit_cowonly_roots(trans, src);
> +	if (ret)
> +		goto out;
> +	switch_commit_roots(trans->transaction, fs_info);
> +	ret = btrfs_write_and_wait_transaction(trans, src);
> +	if (ret)
> +		btrfs_std_error(fs_info, ret,
> +			"Error while writing out transaction for qgroup");
> +
> +out:
> +	mutex_unlock(&fs_info->tree_log_mutex);
> +
> +	/*
> +	 * Force parent root to be updated, as we recorded it before so its
> +	 * last_trans == cur_transid.
> +	 * Or it won't be committed again onto disk after later
> +	 * insert_dir_item()
> +	 */
> +	if (!ret)
> +		record_root_in_trans(trans, parent, 1);
> +	return ret;
> +}
> +
> +/*
>   * new snapshots need to be created at a very specific time in the
>   * transaction commit.  This does the actual creation.
>   *
> @@ -1379,7 +1466,7 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>  	dentry = pending->dentry;
>  	parent_inode = pending->dir;
>  	parent_root = BTRFS_I(parent_inode)->root;
> -	record_root_in_trans(trans, parent_root);
> +	record_root_in_trans(trans, parent_root, 0);
>
>  	/*
>  	 * insert the directory item
> @@ -1414,7 +1501,7 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>  		goto fail;
>  	}
>
> -	record_root_in_trans(trans, root);
> +	record_root_in_trans(trans, root, 0);
>  	btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
>  	memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
>  	btrfs_check_and_init_root_item(new_root_item);
> @@ -1510,6 +1597,17 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>  		goto fail;
>  	}
>
> +	/*
> +	 * Do special qgroup accounting for snapshot, as we do some qgroup
> +	 * snapshot hack to do fast snapshot.
> +	 * To co-operate with that hack, we do hack again.
> +	 * Or snapshot will be greatly slowed down by a subtree qgroup rescan
> +	 */
> +	ret = qgroup_account_snapshot(trans, root, parent_root,
> +				      pending->inherit, objectid);
> +	if (ret < 0)
> +		goto fail;
> +
>  	ret = btrfs_insert_dir_item(trans, parent_root,
>  				    dentry->d_name.name, dentry->d_name.len,
>  				    parent_inode, &key,
> @@ -1552,23 +1650,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
>  		goto fail;
>  	}
>
> -	/*
> -	 * account qgroup counters before qgroup_inherit()
> -	 */
> -	ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
> -	if (ret)
> -		goto fail;
> -	ret = btrfs_qgroup_account_extents(trans, fs_info);
> -	if (ret)
> -		goto fail;
> -	ret = btrfs_qgroup_inherit(trans, fs_info,
> -				   root->root_key.objectid,
> -				   objectid, pending->inherit);
> -	if (ret) {
> -		btrfs_abort_transaction(trans, root, ret);
> -		goto fail;
> -	}
> -
>  fail:
>  	pending->error = ret;
>  dir_item_existed:
>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-05-11 20:30         ` Josef Bacik
@ 2016-05-11 23:33           ` Qu Wenruo
  2016-05-12  9:10           ` David Sterba
  1 sibling, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2016-05-11 23:33 UTC (permalink / raw)
  To: Josef Bacik, Mark Fasheh; +Cc: Qu Wenruo, linux-btrfs, fdmanana



On 05/12/2016 04:30 AM, Josef Bacik wrote:
> On 05/11/2016 12:53 PM, Mark Fasheh wrote:
>> On Wed, May 11, 2016 at 09:59:52AM -0700, Josef Bacik wrote:
>>> On 05/11/2016 09:57 AM, Mark Fasheh wrote:
>>>> Hi Josef,
>>>>
>>>> On Fri, Apr 22, 2016 at 02:12:11PM -0400, Josef Bacik wrote:
>>>>> On 04/15/2016 05:08 AM, Qu Wenruo wrote:
>>>>>> Current btrfs qgroup design implies a requirement that after calling
>>>>>> btrfs_qgroup_account_extents() there must be a commit root switch.
>>>>>>
>>>>>> Normally this is OK, as btrfs_qgroup_accounting_extents() is only
>>>>>> called
>>>>>> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>>>>>>
>>>>>> However there is a exception at create_pending_snapshot(), which will
>>>>>> call btrfs_qgroup_account_extents() but no any commit root switch.
>>>>>>
>>>>>> In case of creating a snapshot whose parent root is itself (create a
>>>>>> snapshot of fs tree), it will corrupt qgroup by the following trace:
>>>>>> (skipped unrelated data)
>>>>>> ======
>>>>>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384,
>>>>>> nr_old_roots = 0, nr_new_roots = 1
>>>>>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count
>>>>>> = 1, rfer = 0, excl = 0
>>>>>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count
>>>>>> = 1, rfer = 16384, excl = 16384
>>>>>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384,
>>>>>> nr_old_roots = 0, nr_new_roots = 0
>>>>>> ======
>>>>>>
>>>>>> The problem here is in first qgroup_account_extent(), the
>>>>>> nr_new_roots of the extent is 1, which means its reference got
>>>>>> increased, and qgroup increased its rfer and excl.
>>>>>>
>>>>>> But at second qgroup_account_extent(), its reference got
>>>>>> decreased, but
>>>>>> between these two qgroup_account_extent(), there is no switch roots.
>>>>>> This leads to the same nr_old_roots, and this extent just got
>>>>>> ignored by
>>>>>> qgroup, which means this extent is wrongly accounted.
>>>>>>
>>>>>> Fix it by call commit_cowonly_roots() after
>>>>>> qgroup_account_extent() in
>>>>>> create_pending_snapshot(), with needed preparation.
>>>>>>
>>>>>> Reported-by: Mark Fasheh <mfasheh@suse.de>
>>>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>>>> ---
>>>>>> v2:
>>>>>>  Fix a soft lockup caused by missing switch_commit_root() call.
>>>>>>  Fix a warning caused by dirty-but-not-committed root.
>>>>>> v3:
>>>>>>  Fix a bug which will cause qgroup accounting for dropping snapshot
>>>>>>  wrong
>>>>>> v4:
>>>>>>  Fix a bug caused by non-cowed btree modification.
>>>>>>
>>>>>> To Filipe:
>>>>>>  I'm sorry I didn't wait for your reply on the dropped roots.
>>>>>>  I reverted back the version where we deleted dropped roots in
>>>>>>  switch_commit_roots().
>>>>>>
>>>>>>  As I think as long as we called
>>>>>> btrfs_qgroup_prepare_account_extents()
>>>>>>  and btrfs_qgroup_account_extents(), it should have already accounted
>>>>>>  extents for dropped roots, and then we are OK to drop them.
>>>>>>
>>>>>>  It would be very nice if you could point out what I missed.
>>>>>>  Thanks
>>>>>>  Qu
>>>>>> ---
>>>>>> fs/btrfs/transaction.c | 117
>>>>>> +++++++++++++++++++++++++++++++++++++++----------
>>>>>> 1 file changed, 93 insertions(+), 24 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>>>> index 43885e5..92f8193 100644
>>>>>> --- a/fs/btrfs/transaction.c
>>>>>> +++ b/fs/btrfs/transaction.c
>>>>>> @@ -311,10 +311,11 @@ loop:
>>>>>>  * when the transaction commits
>>>>>>  */
>>>>>> static int record_root_in_trans(struct btrfs_trans_handle *trans,
>>>>>> -                   struct btrfs_root *root)
>>>>>> +                   struct btrfs_root *root,
>>>>>> +                   int force)
>>>>>> {
>>>>>> -    if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
>>>>>> -        root->last_trans < trans->transid) {
>>>>>> +    if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
>>>>>> +        root->last_trans < trans->transid) || force) {
>>>>>>         WARN_ON(root == root->fs_info->extent_root);
>>>>>>         WARN_ON(root->commit_root != root->node);
>>>>>>
>>>>>> @@ -331,7 +332,7 @@ static int record_root_in_trans(struct
>>>>>> btrfs_trans_handle *trans,
>>>>>>         smp_wmb();
>>>>>>
>>>>>>         spin_lock(&root->fs_info->fs_roots_radix_lock);
>>>>>> -        if (root->last_trans == trans->transid) {
>>>>>> +        if (root->last_trans == trans->transid && !force) {
>>>>>>             spin_unlock(&root->fs_info->fs_roots_radix_lock);
>>>>>>             return 0;
>>>>>>         }
>>>>>> @@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct
>>>>>> btrfs_trans_handle *trans,
>>>>>>         return 0;
>>>>>>
>>>>>>     mutex_lock(&root->fs_info->reloc_mutex);
>>>>>> -    record_root_in_trans(trans, root);
>>>>>> +    record_root_in_trans(trans, root, 0);
>>>>>>     mutex_unlock(&root->fs_info->reloc_mutex);
>>>>>>
>>>>>>     return 0;
>>>>>> @@ -1311,6 +1312,80 @@ int btrfs_defrag_root(struct btrfs_root *root)
>>>>>> }
>>>>>>
>>>>>> /*
>>>>>> + * Do all special snapshot related qgroup dirty hack.
>>>>>> + *
>>>>>> + * Will do all needed qgroup inherit and dirty hack like switch
>>>>>> commit
>>>>>> + * roots inside one transaction and write all btree into disk, to
>>>>>> make
>>>>>> + * qgroup works.
>>>>>> + */
>>>>>> +static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
>>>>>> +                   struct btrfs_root *src,
>>>>>> +                   struct btrfs_root *parent,
>>>>>> +                   struct btrfs_qgroup_inherit *inherit,
>>>>>> +                   u64 dst_objectid)
>>>>>> +{
>>>>>> +    struct btrfs_fs_info *fs_info = src->fs_info;
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * We are going to commit transaction, see
>>>>>> btrfs_commit_transaction()
>>>>>> +     * comment for reason locking tree_log_mutex
>>>>>> +     */
>>>>>> +    mutex_lock(&fs_info->tree_log_mutex);
>>>>>> +
>>>>>> +    ret = commit_fs_roots(trans, src);
>>>>>> +    if (ret)
>>>>>> +        goto out;
>>>>>> +    ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
>>>>>> +    if (ret < 0)
>>>>>> +        goto out;
>>>>>> +    ret = btrfs_qgroup_account_extents(trans, fs_info);
>>>>>> +    if (ret < 0)
>>>>>> +        goto out;
>>>>>> +
>>>>>> +    /* Now qgroup are all updated, we can inherit it to new
>>>>>> qgroups */
>>>>>> +    ret = btrfs_qgroup_inherit(trans, fs_info,
>>>>>> +                   src->root_key.objectid, dst_objectid,
>>>>>> +                   inherit);
>>>>>> +    if (ret < 0)
>>>>>> +        goto out;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Now we do a simplified commit transaction, which will:
>>>>>> +     * 1) commit all subvolume and extent tree
>>>>>> +     *    To ensure all subvolume and extent tree have a valid
>>>>>> +     *    commit_root to accounting later insert_dir_item()
>>>>>> +     * 2) write all btree blocks onto disk
>>>>>> +     *    This is to make sure later btree modification will be
>>>>>> cowed
>>>>>> +     *    Or commit_root can be populated and cause wrong qgroup
>>>>>> numbers
>>>>>> +     * In this simplified commit, we don't really care about
>>>>>> other trees
>>>>>> +     * like chunk and root tree, as they won't affect qgroup.
>>>>>> +     * And we don't write super to avoid half committed status.
>>>>>> +     */
>>>>>> +    ret = commit_cowonly_roots(trans, src);
>>>>>> +    if (ret)
>>>>>> +        goto out;
>>>>>> +    switch_commit_roots(trans->transaction, fs_info);
>>>>>> +    ret = btrfs_write_and_wait_transaction(trans, src);
>>>>>> +    if (ret)
>>>>>> +        btrfs_std_error(fs_info, ret,
>>>>>> +            "Error while writing out transaction for qgroup");
>>>>>> +
>>>>>> +out:
>>>>>> +    mutex_unlock(&fs_info->tree_log_mutex);
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Force parent root to be updated, as we recorded it before
>>>>>> so its
>>>>>> +     * last_trans == cur_transid.
>>>>>> +     * Or it won't be committed again onto disk after later
>>>>>> +     * insert_dir_item()
>>>>>> +     */
>>>>>> +    if (!ret)
>>>>>> +        record_root_in_trans(trans, parent, 1);
>>>>>> +    return ret;
>>>>>> +}
>>>>>
>>>>> NACK, holy shit we aren't adding a special transaction commit only
>>>>> for qgroup snapshots.  Figure out a different way.  Thanks,
>>>>
>>>>
>>>> Unfortunately I think we're going to have to swallow our pride on
>>>> this one :(
>>>>
>>>> Per our conversations on irc, and my detailed observations in this
>>>> e-mail:
>>>>
>>>> https://www.marc.info/?l=linux-btrfs&m=146257186311897&w=2
>>>>
>>>> It seems like we have a core problem in that root counting during
>>>> snapshot
>>>> create is unreliable and leads to corrupted qgroups. You add into
>>>> that the
>>>> poor assumptions made by the rest of the code (such as qgroup_inherit()
>>>> always marking dst->excl = node_size) and ti looks like we won't have a
>>>> proper fix until another qgroup rewrite.
>>>>
>>>> In the meantime, this would make qgroups numbers correct again. If
>>>> we drop a
>>>> single check in there to only run when qgroups are enabled, we can
>>>> mitigate
>>>> the performance impact. If I send that patch would you be ok to ACK
>>>> it this
>>>> time around?
>>>>
>>>
>>> Yeah as long as it's limited to only the qgroup case then I'm fine
>>> with it for now.  Thanks,
>>
>> Ok, here it goes. My xfstest for this issue is upstream now so I was
>> able to
>> run tests/btrfs/122 to verify that this still fixes the problem. Btw, so
>> that makes one drop snapshot test and two create snapshot tests for
>> qgroups
>> now. I'll make one for raid convert as that's broken too. Hopefully
>> then we
>> can be sure that this stuff doesn't break again.
>>
>> Btw, this is against Linux 4.5, let me know if you'd like it ported
>> forward.
>>     --Mark
>>
>> --
>> Mark Fasheh
>>
>>
>> From: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>
>> [PATCH] btrfs: qgroup: Fix qgroup accounting when creating snapshot
>>
>> Current btrfs qgroup design implies a requirement that after calling
>> btrfs_qgroup_account_extents() there must be a commit root switch.
>>
>> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called
>> inside btrfs_commit_transaction() just be commit_cowonly_roots().
>>
>> However there is a exception at create_pending_snapshot(), which will
>> call btrfs_qgroup_account_extents() but no any commit root switch.
>>
>> In case of creating a snapshot whose parent root is itself (create a
>> snapshot of fs tree), it will corrupt qgroup by the following trace:
>> (skipped unrelated data)
>> ======
>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384,
>> nr_old_roots = 0, nr_new_roots = 1
>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count =
>> 1, rfer = 0, excl = 0
>> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count =
>> 1, rfer = 16384, excl = 16384
>> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384,
>> nr_old_roots = 0, nr_new_roots = 0
>> ======
>>
>> The problem here is in first qgroup_account_extent(), the
>> nr_new_roots of the extent is 1, which means its reference got
>> increased, and qgroup increased its rfer and excl.
>>
>> But at second qgroup_account_extent(), its reference got decreased, but
>> between these two qgroup_account_extent(), there is no switch roots.
>> This leads to the same nr_old_roots, and this extent just got ignored by
>> qgroup, which means this extent is wrongly accounted.
>>
>> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in
>> create_pending_snapshot(), with needed preparation.
>>
>> Mark: I added a check at the top of qgroup_account_snapshot() to skip
>> this
>> code if qgroups are turned off. xfstest btrfs/122 exposes this problem.
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> Signed-off-by: Mark Fasheh <mfasheh@suse.de>
>> ---
>>  fs/btrfs/transaction.c | 129
>> ++++++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 105 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index b6031ce..21f6609 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -311,10 +311,11 @@ loop:
>>   * when the transaction commits
>>   */
>>  static int record_root_in_trans(struct btrfs_trans_handle *trans,
>> -                   struct btrfs_root *root)
>> +                   struct btrfs_root *root,
>> +                   int force)
>>  {
>> -    if (test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
>> -        root->last_trans < trans->transid) {
>> +    if ((test_bit(BTRFS_ROOT_REF_COWS, &root->state) &&
>> +        root->last_trans < trans->transid) || force) {
>>          WARN_ON(root == root->fs_info->extent_root);
>>          WARN_ON(root->commit_root != root->node);
>>
>> @@ -331,7 +332,7 @@ static int record_root_in_trans(struct
>> btrfs_trans_handle *trans,
>>          smp_wmb();
>>
>>          spin_lock(&root->fs_info->fs_roots_radix_lock);
>> -        if (root->last_trans == trans->transid) {
>> +        if (root->last_trans == trans->transid && !force) {
>>              spin_unlock(&root->fs_info->fs_roots_radix_lock);
>>              return 0;
>>          }
>> @@ -402,7 +403,7 @@ int btrfs_record_root_in_trans(struct
>> btrfs_trans_handle *trans,
>>          return 0;
>>
>>      mutex_lock(&root->fs_info->reloc_mutex);
>> -    record_root_in_trans(trans, root);
>> +    record_root_in_trans(trans, root, 0);
>>      mutex_unlock(&root->fs_info->reloc_mutex);
>>
>>      return 0;
>> @@ -1309,6 +1310,92 @@ int btrfs_defrag_root(struct btrfs_root *root)
>>  }
>>
>>  /*
>> + * Do all special snapshot related qgroup dirty hack.
>> + *
>> + * Will do all needed qgroup inherit and dirty hack like switch commit
>> + * roots inside one transaction and write all btree into disk, to make
>> + * qgroup works.
>> + */
>> +static int qgroup_account_snapshot(struct btrfs_trans_handle *trans,
>> +                   struct btrfs_root *src,
>> +                   struct btrfs_root *parent,
>> +                   struct btrfs_qgroup_inherit *inherit,
>> +                   u64 dst_objectid)
>> +{
>> +    struct btrfs_fs_info *fs_info = src->fs_info;
>> +    int ret;
>> +
>> +    /*
>> +     * Save some performance in the case that qgroups are not
>> +     * enabled. If this check races with the ioctl, rescan will
>> +     * kick in anyway.
>> +     */
>> +    mutex_lock(&fs_info->qgroup_ioctl_lock);
>> +    if (!fs_info->quota_enabled) {
>> +        mutex_unlock(&fs_info->qgroup_ioctl_lock);
>> +        return 0;
>> +    }
>> +    mutex_unlock(&fs_info->qgroup_ioctl_lock);
>> +
>> +    /*
>> +     * We are going to commit transaction, see
>> btrfs_commit_transaction()
>> +     * comment for reason locking tree_log_mutex
>> +     */
>> +    mutex_lock(&fs_info->tree_log_mutex);
>> +
>> +    ret = commit_fs_roots(trans, src);
>> +    if (ret)
>> +        goto out;
>> +    ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
>> +    if (ret < 0)
>> +        goto out;
>> +    ret = btrfs_qgroup_account_extents(trans, fs_info);
>> +    if (ret < 0)
>> +        goto out;
>> +
>> +    /* Now qgroup are all updated, we can inherit it to new qgroups */
>> +    ret = btrfs_qgroup_inherit(trans, fs_info,
>> +                   src->root_key.objectid, dst_objectid,
>> +                   inherit);
>> +    if (ret < 0)
>> +        goto out;
>> +
>> +    /*
>> +     * Now we do a simplified commit transaction, which will:
>> +     * 1) commit all subvolume and extent tree
>> +     *    To ensure all subvolume and extent tree have a valid
>> +     *    commit_root to accounting later insert_dir_item()
>> +     * 2) write all btree blocks onto disk
>> +     *    This is to make sure later btree modification will be cowed
>> +     *    Or commit_root can be populated and cause wrong qgroup numbers
>> +     * In this simplified commit, we don't really care about other trees
>> +     * like chunk and root tree, as they won't affect qgroup.
>> +     * And we don't write super to avoid half committed status.
>> +     */
>> +    ret = commit_cowonly_roots(trans, src);
>> +    if (ret)
>> +        goto out;
>> +    switch_commit_roots(trans->transaction, fs_info);
>> +    ret = btrfs_write_and_wait_transaction(trans, src);
>> +    if (ret)
>> +        btrfs_std_error(fs_info, ret,
>> +            "Error while writing out transaction for qgroup");
>> +
>> +out:
>> +    mutex_unlock(&fs_info->tree_log_mutex);
>> +
>> +    /*
>> +     * Force parent root to be updated, as we recorded it before so its
>> +     * last_trans == cur_transid.
>> +     * Or it won't be committed again onto disk after later
>> +     * insert_dir_item()
>> +     */
>> +    if (!ret)
>> +        record_root_in_trans(trans, parent, 1);
>> +    return ret;
>> +}
>> +
>> +/*
>>   * new snapshots need to be created at a very specific time in the
>>   * transaction commit.  This does the actual creation.
>>   *
>> @@ -1379,7 +1466,7 @@ static noinline int
>> create_pending_snapshot(struct btrfs_trans_handle *trans,
>>      dentry = pending->dentry;
>>      parent_inode = pending->dir;
>>      parent_root = BTRFS_I(parent_inode)->root;
>> -    record_root_in_trans(trans, parent_root);
>> +    record_root_in_trans(trans, parent_root, 0);
>>
>>      /*
>>       * insert the directory item
>> @@ -1414,7 +1501,7 @@ static noinline int
>> create_pending_snapshot(struct btrfs_trans_handle *trans,
>>          goto fail;
>>      }
>>
>> -    record_root_in_trans(trans, root);
>> +    record_root_in_trans(trans, root, 0);
>>      btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
>>      memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
>>      btrfs_check_and_init_root_item(new_root_item);
>> @@ -1510,6 +1597,17 @@ static noinline int
>> create_pending_snapshot(struct btrfs_trans_handle *trans,
>>          goto fail;
>>      }
>>
>> +    /*
>> +     * Do special qgroup accounting for snapshot, as we do some qgroup
>> +     * snapshot hack to do fast snapshot.
>> +     * To co-operate with that hack, we do hack again.
>> +     * Or snapshot will be greatly slowed down by a subtree qgroup
>> rescan
>> +     */
>> +    ret = qgroup_account_snapshot(trans, root, parent_root,
>> +                      pending->inherit, objectid);
>> +    if (ret < 0)
>> +        goto fail;
>> +
>>      ret = btrfs_insert_dir_item(trans, parent_root,
>>                      dentry->d_name.name, dentry->d_name.len,
>>                      parent_inode, &key,
>> @@ -1552,23 +1650,6 @@ static noinline int
>> create_pending_snapshot(struct btrfs_trans_handle *trans,
>>          goto fail;
>>      }
>>
>> -    /*
>> -     * account qgroup counters before qgroup_inherit()
>> -     */
>> -    ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
>> -    if (ret)
>> -        goto fail;
>> -    ret = btrfs_qgroup_account_extents(trans, fs_info);
>> -    if (ret)
>> -        goto fail;
>> -    ret = btrfs_qgroup_inherit(trans, fs_info,
>> -                   root->root_key.objectid,
>> -                   objectid, pending->inherit);
>> -    if (ret) {
>> -        btrfs_abort_transaction(trans, root, ret);
>> -        goto fail;
>> -    }
>> -
>>  fail:
>>      pending->error = ret;
>>  dir_item_existed:
>>
>
> Reviewed-by: Josef Bacik <jbacik@fb.com>
>
> Thanks,
>
> Josef

Nice to hear that.

Thanks,
Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
  2016-05-11 20:30         ` Josef Bacik
  2016-05-11 23:33           ` Qu Wenruo
@ 2016-05-12  9:10           ` David Sterba
  1 sibling, 0 replies; 18+ messages in thread
From: David Sterba @ 2016-05-12  9:10 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Mark Fasheh, Qu Wenruo, linux-btrfs, fdmanana

On Wed, May 11, 2016 at 01:30:21PM -0700, Josef Bacik wrote:
> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> > Signed-off-by: Mark Fasheh <mfasheh@suse.de>
> Reviewed-by: Josef Bacik <jbacik@fb.com>

Applied to for-next with the following fixup to make it bisectable:

---
    btrfs: build fixup for qgroup_account_snapshot

    The macro btrfs_std_error got renamed to btrfs_handle_fs_error in an
    independent branch for the same merge target (4.7). To make the code
    compilable for bisectability reasons, add a temporary stub.

    Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/transaction.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index d7172d7ced5f..530081388d77 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1311,6 +1311,11 @@ int btrfs_defrag_root(struct btrfs_root *root)
        return ret;
 }

+/* Bisesctability fixup, remove in 4.8 */
+#ifndef btrfs_std_error
+#define btrfs_std_error btrfs_handle_fs_error
+#endif
+
 /*
  * Do all special snapshot related qgroup dirty hack.
  *
---

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-05-12  9:10 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-15  9:08 [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot Qu Wenruo
2016-04-19 22:19 ` Mark Fasheh
2016-04-20 14:25   ` Holger Hoffstätte
2016-04-22 18:12 ` Josef Bacik
2016-04-22 18:21   ` Mark Fasheh
2016-04-22 18:23     ` Josef Bacik
2016-04-22 18:29       ` Mark Fasheh
2016-04-25  0:56       ` Qu Wenruo
2016-04-25 14:24         ` Josef Bacik
2016-04-26  0:35           ` Qu Wenruo
2016-04-26 14:26             ` Josef Bacik
2016-04-27  1:12               ` Qu Wenruo
2016-05-11 16:57   ` Mark Fasheh
2016-05-11 16:59     ` Josef Bacik
2016-05-11 19:53       ` Mark Fasheh
2016-05-11 20:30         ` Josef Bacik
2016-05-11 23:33           ` Qu Wenruo
2016-05-12  9:10           ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).