[PATCH 0/4] btrfs: update qgroups in drop snapshot

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] btrfs: update qgroups in drop snapshot
@ 2015-09-22 20:15 Mark Fasheh
  2015-09-22 20:15 ` [PATCH 1/4] Btrfs: use btrfs_get_fs_root in resolve_indirect_ref Mark Fasheh
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Mark Fasheh @ 2015-09-22 20:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jbacik, clm, quwenruo, Mark Fasheh

Hi,

The following 4 patches fix a regression introduced in Linux
4.2 where btrfs_drop_snapshot() wasn't updating qgroups, resulting in
them going bad.

The original e-mail pointing this out is below:

http://www.spinics.net/lists/linux-btrfs/msg46093.html


The first two patches are from Josef and fix bugs in our counting of
roots (which is critical for qgroups to work correctly). Both were
sent to the list already:

http://www.spinics.net/lists/linux-btrfs/msg47035.html
http://www.spinics.net/lists/linux-btrfs/msg47150.html

Truth be told, most of the time fixing this was spent figuring out
those two issues. Once I realized I was seeing a bug and we fixed it
correctly, my drop snapshot patch got dramatically smaller.

I also re-added some of the tracing in qgroup.c that we recently
lost. It is again possible to debug qgroup operations on a live
system, allowing us to find issues like the two above by narrowing
down our operations and manually walking through them via
cat sys/debug/tracing.

The entire patch series can be tested with the following xfstests.git
patch, which will be sent to the appropriate list shortly)

https://github.com/markfasheh/xfstests/commit/b09ca51c012824e44546b13862ab1f93a6f2f675

Thanks,
	--Mark


From: Mark Fasheh <mfasheh@suse.de>

[PATCH] btrfs: add test for quota groups and drop snapshot

Test btrfs quota group consistency operations during snapshot
delete. Btrfs has had long standing issues with drop snapshot
failing to properly account for quota groups. This test crafts
several snapshot trees with shared and exclusive elements. One of
the trees is removed and then quota group consistency is checked.

This issue is fixed by the following linux kernel patches:
   Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
   Btrfs: keep dropped roots in cache until transaciton commit
   btrfs: qgroup: account shared subtree during snapshot delete

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 tests/btrfs/098   | 166 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/btrfs/group |   1 +
 2 files changed, 167 insertions(+)
 create mode 100644 tests/btrfs/098

diff --git a/tests/btrfs/098 b/tests/btrfs/098
new file mode 100644
index 0000000..7977a90
--- /dev/null
+++ b/tests/btrfs/098
@@ -0,0 +1,166 @@
+#! /bin/bash
+# FS QA Test No. btrfs/098
+#
+# Test btrfs quota group consistency operations during snapshot
+# delete. Btrfs has had long standing issues with drop snapshot
+# failing to properly account for quota groups. This test crafts
+# several snapshot trees with shared and exclusive elements. One of
+# the trees is removed and then quota group consistency is checked.
+#
+# This issue is fixed by the following linux kernel patches:
+#
+#    Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
+#    Btrfs: keep dropped roots in cache until transaciton commit
+#    btrfs: qgroup: account shared subtree during snapshot delete
+#
+#-----------------------------------------------------------------------
+# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	rm -fr $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_need_to_be_root
+
+rm -f $seqres.full
+
+# Create an fs tree of a given height at a target location. This is
+# done by agressively creating inline extents to expand the number of
+# nodes required. We also add an traditional extent so that
+# drop_snapshot is forced to walk at least one extent that is not
+# stored in metadata.
+#
+# NOTE: The ability to vary tree height for this test is very useful
+# for debugging problems with drop_snapshot(). As a result we retain
+# that parameter even though the test below always does level 2 trees.
+_explode_fs_tree () {
+    local level=$1;
+    local loc="$2";
+    local bs=4095;
+    local cnt=1;
+    local n;
+
+    if [ -z $loc ]; then
+	echo "specify location for fileset"
+	exit 1;
+    fi
+
+    case $level in
+	1)# this always reproduces level 1 trees	
+	    n=10;
+	    ;;
+	2)# this always reproduces level 2 trees
+	    n=1500
+	    ;;
+	3)# this always reproduces level 3 trees
+	    n=1000000;
+	    ;;
+	*)
+	    echo "Can't make level $level trees";
+	    exit 1;
+	    ;;
+    esac
+
+    mkdir -p $loc
+    for i in `seq -w 1 $n`;
+    do
+	dd status=none if=/dev/zero of=$loc/file$i bs=$bs count=$cnt
+    done
+
+    bs=131072
+    cnt=1
+    dd status=none if=/dev/zero of=$loc/extentfile bs=$bs count=$cnt
+}
+
+# Force the default leaf size as the calculations for making our btree
+# heights are based on that.
+run_check _scratch_mkfs "--nodesize 16384"
+_scratch_mount
+
+# populate the default subvolume and create a snapshot ('snap1')
+_explode_fs_tree 1 $SCRATCH_MNT/files
+_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
+
+# create new btree nodes in this snapshot. They will become shared
+# with the next snapshot we create.
+_explode_fs_tree 1 $SCRATCH_MNT/snap1/files-snap1
+
+# create our final snapshot ('snap2'), populate it with
+# exclusively owned nodes.
+#
+# As a result of this action, snap2 will get an implied ref to the
+# 128K extent created in the default subvolume.
+_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT/snap1 $SCRATCH_MNT/snap2
+_explode_fs_tree 1 $SCRATCH_MNT/snap2/files-snap2
+
+# Enable qgroups now that we have our filesystem prepared. This
+# will kick off a scan which we will have to wait for.
+_run_btrfs_util_prog quota enable $SCRATCH_MNT
+_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
+
+# Remount to clear cache, force everything to disk
+_scratch_unmount
+_scratch_mount
+
+# Finally, delete snap1 to trigger btrfs_drop_snapshot(). This
+# snapshot is most interesting to delete because it will cause some
+# nodes to go exclusively owned for snap2, while some will stay shared
+# with the default subvolume. That exercises a maximum of the drop
+# snapshot/qgroup interactions.
+#
+# snap2s imlied ref from to the 128K extent in files/ can be lost by
+# the root finding code in qgroup accounting due to snap1 no longer
+# providing a path to it. This was fixed by the first two patches
+# referenced above.
+_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
+
+# There is no way from userspace to force btrfs_drop_snapshot to run
+# at a given time (even via mount/unmount). We must wait for it to
+# start and complete. This is the shortest time on my tests systems I
+# have found which always allows drop_snapshot to run to completion.
+sleep 45
+
+_scratch_unmount
+
+# generate a qgroup report and look for inconsistent groups
+#  - don't use _run_btrfs_util_prog here as it captures the output and
+#    we need to grep it.
+$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | grep -E -q "Counts for qgroup.*are different"
+if [ $? -ne 0 ]; then
+    status=0
+fi
+
+exit
diff --git a/tests/btrfs/group b/tests/btrfs/group
index e13865a..d78a355 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -100,3 +100,4 @@
 095 auto quick metadata
 096 auto quick clone
 097 auto quick send clone
+098 auto qgroup
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 1/4] Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
  2015-09-22 20:15 [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
@ 2015-09-22 20:15 ` Mark Fasheh
  2015-09-22 20:15 ` [PATCH 2/4] Btrfs: keep dropped roots in cache until transaction commit, V2 Mark Fasheh
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Mark Fasheh @ 2015-09-22 20:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jbacik, clm, quwenruo, Mark Fasheh

From: Josef Bacik <jbacik@fb.com>

The backref code will look up the fs_root we're trying to resolve our indirect
refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT
if the ref is 0.  This isn't helpful for the qgroup stuff with snapshot delete
as it won't be able to search down the snapshot we are deleting, which will
cause us to miss roots.  So use btrfs_get_fs_root and send false for check_ref
so we can always get the root we're looking for.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/backref.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 802fabb..5de66e9 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -332,7 +332,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info,
 
 	index = srcu_read_lock(&fs_info->subvol_srcu);
 
-	root = btrfs_read_fs_root_no_name(fs_info, &root_key);
+	root = btrfs_get_fs_root(fs_info, &root_key, false);
 	if (IS_ERR(root)) {
 		srcu_read_unlock(&fs_info->subvol_srcu, index);
 		ret = PTR_ERR(root);
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] Btrfs: keep dropped roots in cache until transaction commit, V2
  2015-09-22 20:15 [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
  2015-09-22 20:15 ` [PATCH 1/4] Btrfs: use btrfs_get_fs_root in resolve_indirect_ref Mark Fasheh
@ 2015-09-22 20:15 ` Mark Fasheh
  2015-09-22 20:15 ` [PATCH 3/4] btrfs: Add qgroup tracing Mark Fasheh
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Mark Fasheh @ 2015-09-22 20:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jbacik, clm, quwenruo, Mark Fasheh

From: Josef Bacik <jbacik@fb.com>

When dropping a snapshot we need to account for the qgroup changes.  If we drop
the snapshot in all one go then the backref code will fail to find blocks from
the snapshot we dropped since it won't be able to find the root in the fs root
cache.  This can lead to us failing to find refs from other roots that pointed
at blocks in the now deleted root.  To handle this we need to not remove the fs
roots from the cache until after we process the qgroup operations.  Do this by
adding dropped roots to a list on the transaction, and letting the transaction
remove the roots at the same time it drops the commit roots.  This will keep all
of the backref searching code in sync properly, and fixes a problem Mark was
seeing with snapshot delete and qgroups.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/extent-tree.c |  2 +-
 fs/btrfs/transaction.c | 30 ++++++++++++++++++++++++++++++
 fs/btrfs/transaction.h |  5 ++++-
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 07204bf..3a70e6c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8576,7 +8576,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 	}
 
 	if (test_bit(BTRFS_ROOT_IN_RADIX, &root->state)) {
-		btrfs_drop_and_free_fs_root(tree_root->fs_info, root);
+		btrfs_add_dropped_root(trans, root);
 	} else {
 		free_extent_buffer(root->node);
 		free_extent_buffer(root->commit_root);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f5021fc..cd15f39 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -117,6 +117,16 @@ static noinline void switch_commit_roots(struct btrfs_transaction *trans,
 			btrfs_unpin_free_ino(root);
 		clear_btree_io_tree(&root->dirty_log_pages);
 	}
+
+	/* We can free old roots now. */
+	spin_lock(&trans->dropped_roots_lock);
+	while (!list_empty(&trans->dropped_roots)) {
+		root = list_first_entry(&trans->dropped_roots,
+					struct btrfs_root, root_list);
+		list_del_init(&root->root_list);
+		btrfs_drop_and_free_fs_root(fs_info, root);
+	}
+	spin_unlock(&trans->dropped_roots_lock);
 	up_write(&fs_info->commit_root_sem);
 }
 
@@ -255,9 +265,11 @@ loop:
 	INIT_LIST_HEAD(&cur_trans->pending_ordered);
 	INIT_LIST_HEAD(&cur_trans->dirty_bgs);
 	INIT_LIST_HEAD(&cur_trans->io_bgs);
+	INIT_LIST_HEAD(&cur_trans->dropped_roots);
 	mutex_init(&cur_trans->cache_write_mutex);
 	cur_trans->num_dirty_bgs = 0;
 	spin_lock_init(&cur_trans->dirty_bgs_lock);
+	spin_lock_init(&cur_trans->dropped_roots_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(&cur_trans->dirty_pages,
 			     fs_info->btree_inode->i_mapping);
@@ -334,6 +346,24 @@ static int record_root_in_trans(struct btrfs_trans_handle *trans,
 }
 
 
+void btrfs_add_dropped_root(struct btrfs_trans_handle *trans,
+			    struct btrfs_root *root)
+{
+	struct btrfs_transaction *cur_trans = trans->transaction;
+
+	/* Add ourselves to the transaction dropped list */
+	spin_lock(&cur_trans->dropped_roots_lock);
+	list_add_tail(&root->root_list, &cur_trans->dropped_roots);
+	spin_unlock(&cur_trans->dropped_roots_lock);
+
+	/* Make sure we don't try to update the root at commit time */
+	spin_lock(&root->fs_info->fs_roots_radix_lock);
+	radix_tree_tag_clear(&root->fs_info->fs_roots_radix,
+			     (unsigned long)root->root_key.objectid,
+			     BTRFS_ROOT_TRANS_TAG);
+	spin_unlock(&root->fs_info->fs_roots_radix_lock);
+}
+
 int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *root)
 {
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index eb09c20..ef4ff38 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -65,6 +65,7 @@ struct btrfs_transaction {
 	struct list_head switch_commits;
 	struct list_head dirty_bgs;
 	struct list_head io_bgs;
+	struct list_head dropped_roots;
 	u64 num_dirty_bgs;
 
 	/*
@@ -74,6 +75,7 @@ struct btrfs_transaction {
 	 */
 	struct mutex cache_write_mutex;
 	spinlock_t dirty_bgs_lock;
+	spinlock_t dropped_roots_lock;
 	struct btrfs_delayed_ref_root delayed_refs;
 	int aborted;
 	int dirty_bg_run;
@@ -214,5 +216,6 @@ int btrfs_transaction_blocked(struct btrfs_fs_info *info);
 int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
 void btrfs_put_transaction(struct btrfs_transaction *transaction);
 void btrfs_apply_pending_changes(struct btrfs_fs_info *fs_info);
-
+void btrfs_add_dropped_root(struct btrfs_trans_handle *trans,
+			    struct btrfs_root *root);
 #endif
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] btrfs: Add qgroup tracing
  2015-09-22 20:15 [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
  2015-09-22 20:15 ` [PATCH 1/4] Btrfs: use btrfs_get_fs_root in resolve_indirect_ref Mark Fasheh
  2015-09-22 20:15 ` [PATCH 2/4] Btrfs: keep dropped roots in cache until transaction commit, V2 Mark Fasheh
@ 2015-09-22 20:15 ` Mark Fasheh
  2015-09-22 20:15 ` [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete Mark Fasheh
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Mark Fasheh @ 2015-09-22 20:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jbacik, clm, quwenruo, Mark Fasheh

This patch adds tracepoints to the qgroup code on both the reporting side
(insert_dirty_extents) and the accounting side. Taken together it allows us
to see what qgroup operations have happened, and what their result was.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/qgroup.c            | 10 +++++
 include/trace/events/btrfs.h | 88 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 8a82029..676202e 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1461,6 +1461,8 @@ struct btrfs_qgroup_extent_record
 	struct btrfs_qgroup_extent_record *entry;
 	u64 bytenr = record->bytenr;
 
+	trace_btrfs_qgroup_insert_dirty_extent(record);
+
 	while (*p) {
 		parent_node = *p;
 		entry = rb_entry(parent_node, struct btrfs_qgroup_extent_record,
@@ -1591,6 +1593,9 @@ static int qgroup_update_counters(struct btrfs_fs_info *fs_info,
 		cur_old_count = btrfs_qgroup_get_old_refcnt(qg, seq);
 		cur_new_count = btrfs_qgroup_get_new_refcnt(qg, seq);
 
+		trace_qgroup_update_counters(qg->qgroupid, cur_old_count,
+					     cur_new_count);
+
 		/* Rfer update part */
 		if (cur_old_count == 0 && cur_new_count > 0) {
 			qg->rfer += num_bytes;
@@ -1684,6 +1689,9 @@ btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans,
 		goto out_free;
 	BUG_ON(!fs_info->quota_root);
 
+	trace_btrfs_qgroup_account_extent(bytenr, num_bytes, nr_old_roots,
+					  nr_new_roots);
+
 	qgroups = ulist_alloc(GFP_NOFS);
 	if (!qgroups) {
 		ret = -ENOMEM;
@@ -1753,6 +1761,8 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans,
 		record = rb_entry(node, struct btrfs_qgroup_extent_record,
 				  node);
 
+		trace_btrfs_qgroup_account_extents(record);
+
 		if (!ret) {
 			/*
 			 * Use (u64)-1 as time_seq to do special search, which
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 0b73af9..9d7b545 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -23,7 +23,7 @@ struct map_lookup;
 struct extent_buffer;
 struct btrfs_work;
 struct __btrfs_workqueue;
-struct btrfs_qgroup_operation;
+struct btrfs_qgroup_extent_record;
 
 #define show_ref_type(type)						\
 	__print_symbolic(type,						\
@@ -1117,6 +1117,92 @@ DEFINE_EVENT(btrfs__workqueue_done, btrfs_workqueue_destroy,
 	TP_ARGS(wq)
 );
 
+DECLARE_EVENT_CLASS(btrfs_qgroup_extent,
+	TP_PROTO(struct btrfs_qgroup_extent_record *rec),
+
+	TP_ARGS(rec),
+
+	TP_STRUCT__entry(
+		__field(	u64,  bytenr		)
+		__field(	u64,  num_bytes		)
+	),
+
+	TP_fast_assign(
+		__entry->bytenr		= rec->bytenr,
+		__entry->num_bytes	= rec->num_bytes;
+	),
+
+	TP_printk("bytenr = %llu, num_bytes = %llu",
+		  (unsigned long long)__entry->bytenr,
+		  (unsigned long long)__entry->num_bytes)
+);
+
+DEFINE_EVENT(btrfs_qgroup_extent, btrfs_qgroup_account_extents,
+
+	TP_PROTO(struct btrfs_qgroup_extent_record *rec),
+
+	TP_ARGS(rec)
+);
+
+DEFINE_EVENT(btrfs_qgroup_extent, btrfs_qgroup_insert_dirty_extent,
+
+	TP_PROTO(struct btrfs_qgroup_extent_record *rec),
+
+	TP_ARGS(rec)
+);
+
+TRACE_EVENT(btrfs_qgroup_account_extent,
+
+	TP_PROTO(u64 bytenr, u64 num_bytes, u64 nr_old_roots, u64 nr_new_roots),
+
+	TP_ARGS(bytenr, num_bytes, nr_old_roots, nr_new_roots),
+
+	TP_STRUCT__entry(
+		__field(	u64,  bytenr			)
+		__field(	u64,  num_bytes			)
+		__field(	u64,  nr_old_roots		)
+		__field(	u64,  nr_new_roots		)
+	),
+
+	TP_fast_assign(
+		__entry->bytenr		= bytenr;
+		__entry->num_bytes	= num_bytes;
+		__entry->nr_old_roots	= nr_old_roots;
+		__entry->nr_new_roots	= nr_new_roots;
+	),
+
+	TP_printk("bytenr = %llu, num_bytes = %llu, nr_old_roots = %llu, "
+		  "nr_new_roots = %llu",
+		  __entry->bytenr,
+		  __entry->num_bytes,
+		  __entry->nr_old_roots,
+		  __entry->nr_new_roots)
+);
+
+TRACE_EVENT(qgroup_update_counters,
+
+	TP_PROTO(u64 qgid, u64 cur_old_count, u64 cur_new_count),
+
+	TP_ARGS(qgid, cur_old_count, cur_new_count),
+
+	TP_STRUCT__entry(
+		__field(	u64,  qgid			)
+		__field(	u64,  cur_old_count		)
+		__field(	u64,  cur_new_count		)
+	),
+
+	TP_fast_assign(
+		__entry->qgid		= qgid;
+		__entry->cur_old_count	= cur_old_count;
+		__entry->cur_new_count	= cur_new_count;
+	),
+
+	    TP_printk("	qgid = %llu, cur_old_count = %llu, cur_new_count = %llu",
+		  __entry->qgid,
+		  __entry->cur_old_count,
+		  __entry->cur_new_count)
+);
+
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete
  2015-09-22 20:15 [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
                   ` (2 preceding siblings ...)
  2015-09-22 20:15 ` [PATCH 3/4] btrfs: Add qgroup tracing Mark Fasheh
@ 2015-09-22 20:15 ` Mark Fasheh
  2015-11-02  1:59   ` Qu Wenruo
  2015-09-22 21:12 ` [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Mark Fasheh @ 2015-09-22 20:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jbacik, clm, quwenruo, Mark Fasheh

Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
mechanism.') removed our qgroup accounting during
btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
going bad shortly after a snapshot is removed.

Fix this by adding a dirty extent record when we encounter extents during
our shared subtree walk. This effectively restores the functionality we had
with the original shared subtree walking code in 1152651 (btrfs: qgroup:
account shared subtrees during snapshot delete).

The idea with the original patch (and this one) is that shared subtrees can
get skipped during drop_snapshot. The shared subtree walk then allows us a
chance to visit those extents and add them to the qgroup work for later
processing. This ultimately makes the accounting for drop snapshot work.

The new qgroup code nicely handles all the other extents during the tree
walk via the ref dec/inc functions so we don't have to add actions beyond
what we had originally.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/extent-tree.c | 41 ++++++++++++++++++++++++++++++++++-------
 1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3a70e6c..89be620 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7757,17 +7757,37 @@ reada:
 }
 
 /*
- * TODO: Modify related function to add related node/leaf to dirty_extent_root,
- * for later qgroup accounting.
- *
- * Current, this function does nothing.
+ * These may not be seen by the usual inc/dec ref code so we have to
+ * add them here.
  */
+static int record_one_subtree_extent(struct btrfs_trans_handle *trans,
+				     struct btrfs_root *root, u64 bytenr,
+				     u64 num_bytes)
+{
+	struct btrfs_qgroup_extent_record *qrecord;
+	struct btrfs_delayed_ref_root *delayed_refs;
+
+	qrecord = kmalloc(sizeof(*qrecord), GFP_NOFS);
+	if (!qrecord)
+		return -ENOMEM;
+
+	qrecord->bytenr = bytenr;
+	qrecord->num_bytes = num_bytes;
+	qrecord->old_roots = NULL;
+
+	delayed_refs = &trans->transaction->delayed_refs;
+	if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
+		kfree(qrecord);
+
+	return 0;
+}
+
 static int account_leaf_items(struct btrfs_trans_handle *trans,
 			      struct btrfs_root *root,
 			      struct extent_buffer *eb)
 {
 	int nr = btrfs_header_nritems(eb);
-	int i, extent_type;
+	int i, extent_type, ret;
 	struct btrfs_key key;
 	struct btrfs_file_extent_item *fi;
 	u64 bytenr, num_bytes;
@@ -7790,6 +7810,10 @@ static int account_leaf_items(struct btrfs_trans_handle *trans,
 			continue;
 
 		num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
+
+		ret = record_one_subtree_extent(trans, root, bytenr, num_bytes);
+		if (ret)
+			return ret;
 	}
 	return 0;
 }
@@ -7858,8 +7882,6 @@ static int adjust_slots_upwards(struct btrfs_root *root,
 
 /*
  * root_eb is the subtree root and is locked before this function is called.
- * TODO: Modify this function to mark all (including complete shared node)
- * to dirty_extent_root to allow it get accounted in qgroup.
  */
 static int account_shared_subtree(struct btrfs_trans_handle *trans,
 				  struct btrfs_root *root,
@@ -7937,6 +7959,11 @@ walk_down:
 			btrfs_tree_read_lock(eb);
 			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
 			path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
+
+			ret = record_one_subtree_extent(trans, root, child_bytenr,
+							root->nodesize);
+			if (ret)
+				goto out;
 		}
 
 		if (level == 0) {
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: update qgroups in drop snapshot
  2015-09-22 20:15 [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
                   ` (3 preceding siblings ...)
  2015-09-22 20:15 ` [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete Mark Fasheh
@ 2015-09-22 21:12 ` Mark Fasheh
  2015-09-23  1:40   ` Qu Wenruo
  2015-09-23  3:58 ` Qu Wenruo
  2015-09-25  3:17 ` Qu Wenruo
  6 siblings, 1 reply; 17+ messages in thread
From: Mark Fasheh @ 2015-09-22 21:12 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jbacik, clm, quwenruo

On Tue, Sep 22, 2015 at 01:15:44PM -0700, Mark Fasheh wrote:
> The entire patch series can be tested with the following xfstests.git
> patch, which will be sent to the appropriate list shortly)
> 
> https://github.com/markfasheh/xfstests/commit/b09ca51c012824e44546b13862ab1f93a6f2f675

That xfstests patch doesn't apply against latest xfstests.git, attached is
one that does. Git branch is at

git@github.com:markfasheh/xfstests.git qgroup-drop-snapshot

Sorry about that.
	--Mark

--
Mark Fasheh


From: Mark Fasheh <mfasheh@suse.de>

[PATCH] btrfs: add test for quota groups and drop snapshot

Test btrfs quota group consistency operations during snapshot
delete. Btrfs has had long standing issues with drop snapshot
failing to properly account for quota groups. This test crafts
several snapshot trees with shared and exclusive elements. One of
the trees is removed and then quota group consistency is checked.

This issue is fixed by the following linux kernel patches:
   Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
   Btrfs: keep dropped roots in cache until transaciton commit
   btrfs: qgroup: account shared subtree during snapshot delete

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 tests/btrfs/104     | 166 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/btrfs/104.out |   1 +
 tests/btrfs/group   |   1 +
 3 files changed, 168 insertions(+)
 create mode 100644 tests/btrfs/104
 create mode 100644 tests/btrfs/104.out

diff --git a/tests/btrfs/104 b/tests/btrfs/104
new file mode 100644
index 0000000..29dc3be
--- /dev/null
+++ b/tests/btrfs/104
@@ -0,0 +1,166 @@
+#! /bin/bash
+# FS QA Test No. btrfs/104
+#
+# Test btrfs quota group consistency operations during snapshot
+# delete. Btrfs has had long standing issues with drop snapshot
+# failing to properly account for quota groups. This test crafts
+# several snapshot trees with shared and exclusive elements. One of
+# the trees is removed and then quota group consistency is checked.
+#
+# This issue is fixed by the following linux kernel patches:
+#
+#    Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
+#    Btrfs: keep dropped roots in cache until transaciton commit
+#    btrfs: qgroup: account shared subtree during snapshot delete
+#
+#-----------------------------------------------------------------------
+# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	rm -fr $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_need_to_be_root
+
+rm -f $seqres.full
+
+# Create an fs tree of a given height at a target location. This is
+# done by agressively creating inline extents to expand the number of
+# nodes required. We also add an traditional extent so that
+# drop_snapshot is forced to walk at least one extent that is not
+# stored in metadata.
+#
+# NOTE: The ability to vary tree height for this test is very useful
+# for debugging problems with drop_snapshot(). As a result we retain
+# that parameter even though the test below always does level 2 trees.
+_explode_fs_tree () {
+    local level=$1;
+    local loc="$2";
+    local bs=4095;
+    local cnt=1;
+    local n;
+
+    if [ -z $loc ]; then
+	echo "specify location for fileset"
+	exit 1;
+    fi
+
+    case $level in
+	1)# this always reproduces level 1 trees
+	    n=10;
+	    ;;
+	2)# this always reproduces level 2 trees
+	    n=1500
+	    ;;
+	3)# this always reproduces level 3 trees
+	    n=1000000;
+	    ;;
+	*)
+	    echo "Can't make level $level trees";
+	    exit 1;
+	    ;;
+    esac
+
+    mkdir -p $loc
+    for i in `seq -w 1 $n`;
+    do
+	dd status=none if=/dev/zero of=$loc/file$i bs=$bs count=$cnt
+    done
+
+    bs=131072
+    cnt=1
+    dd status=none if=/dev/zero of=$loc/extentfile bs=$bs count=$cnt
+}
+
+# Force the default leaf size as the calculations for making our btree
+# heights are based on that.
+run_check _scratch_mkfs "--nodesize 16384"
+_scratch_mount
+
+# populate the default subvolume and create a snapshot ('snap1')
+_explode_fs_tree 1 $SCRATCH_MNT/files
+_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
+
+# create new btree nodes in this snapshot. They will become shared
+# with the next snapshot we create.
+_explode_fs_tree 1 $SCRATCH_MNT/snap1/files-snap1
+
+# create our final snapshot ('snap2'), populate it with
+# exclusively owned nodes.
+#
+# As a result of this action, snap2 will get an implied ref to the
+# 128K extent created in the default subvolume.
+_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT/snap1 $SCRATCH_MNT/snap2
+_explode_fs_tree 1 $SCRATCH_MNT/snap2/files-snap2
+
+# Enable qgroups now that we have our filesystem prepared. This
+# will kick off a scan which we will have to wait for.
+_run_btrfs_util_prog quota enable $SCRATCH_MNT
+_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
+
+# Remount to clear cache, force everything to disk
+_scratch_unmount
+_scratch_mount
+
+# Finally, delete snap1 to trigger btrfs_drop_snapshot(). This
+# snapshot is most interesting to delete because it will cause some
+# nodes to go exclusively owned for snap2, while some will stay shared
+# with the default subvolume. That exercises a maximum of the drop
+# snapshot/qgroup interactions.
+#
+# snap2s imlied ref from to the 128K extent in files/ can be lost by
+# the root finding code in qgroup accounting due to snap1 no longer
+# providing a path to it. This was fixed by the first two patches
+# referenced above.
+_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
+
+# There is no way from userspace to force btrfs_drop_snapshot to run
+# at a given time (even via mount/unmount). We must wait for it to
+# start and complete. This is the shortest time on my tests systems I
+# have found which always allows drop_snapshot to run to completion.
+sleep 45
+
+_scratch_unmount
+
+# generate a qgroup report and look for inconsistent groups
+#  - don't use _run_btrfs_util_prog here as it captures the output and
+#    we need to grep it.
+$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | grep -E -q "Counts for qgroup.*are different"
+if [ $? -ne 0 ]; then
+    status=0
+fi
+
+exit
diff --git a/tests/btrfs/104.out b/tests/btrfs/104.out
new file mode 100644
index 0000000..1ed84bc
--- /dev/null
+++ b/tests/btrfs/104.out
@@ -0,0 +1 @@
+QA output created by 104
diff --git a/tests/btrfs/group b/tests/btrfs/group
index e92a65a..6218adf 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -106,3 +106,4 @@
 101 auto quick replace
 102 auto quick metadata enospc
 103 auto quick clone compress
+104 auto qgroup
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: update qgroups in drop snapshot
  2015-09-22 21:12 ` [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
@ 2015-09-23  1:40   ` Qu Wenruo
  2015-09-23 21:49     ` Mark Fasheh
  0 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2015-09-23  1:40 UTC (permalink / raw)
  To: Mark Fasheh, linux-btrfs; +Cc: jbacik, clm

Hi Mark,

Thanks for all your work on this.
Comment on test case is inlined below.

Mark Fasheh wrote on 2015/09/22 14:12 -0700:
> On Tue, Sep 22, 2015 at 01:15:44PM -0700, Mark Fasheh wrote:
>> The entire patch series can be tested with the following xfstests.git
>> patch, which will be sent to the appropriate list shortly)
>>
>> https://github.com/markfasheh/xfstests/commit/b09ca51c012824e44546b13862ab1f93a6f2f675
>
> That xfstests patch doesn't apply against latest xfstests.git, attached is
> one that does. Git branch is at
>
> git@github.com:markfasheh/xfstests.git qgroup-drop-snapshot
>
> Sorry about that.
> 	--Mark
>
> --
> Mark Fasheh
>
>
> From: Mark Fasheh <mfasheh@suse.de>
>
> [PATCH] btrfs: add test for quota groups and drop snapshot
>
> Test btrfs quota group consistency operations during snapshot
> delete. Btrfs has had long standing issues with drop snapshot
> failing to properly account for quota groups. This test crafts
> several snapshot trees with shared and exclusive elements. One of
> the trees is removed and then quota group consistency is checked.
>
> This issue is fixed by the following linux kernel patches:
>     Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
>     Btrfs: keep dropped roots in cache until transaciton commit
>     btrfs: qgroup: account shared subtree during snapshot delete
>
> Signed-off-by: Mark Fasheh <mfasheh@suse.de>
> ---
>   tests/btrfs/104     | 166 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>   tests/btrfs/104.out |   1 +
>   tests/btrfs/group   |   1 +
>   3 files changed, 168 insertions(+)
>   create mode 100644 tests/btrfs/104
>   create mode 100644 tests/btrfs/104.out
>
> diff --git a/tests/btrfs/104 b/tests/btrfs/104
> new file mode 100644
> index 0000000..29dc3be
> --- /dev/null
> +++ b/tests/btrfs/104
> @@ -0,0 +1,166 @@
> +#! /bin/bash
> +# FS QA Test No. btrfs/104
> +#
> +# Test btrfs quota group consistency operations during snapshot
> +# delete. Btrfs has had long standing issues with drop snapshot
> +# failing to properly account for quota groups. This test crafts
> +# several snapshot trees with shared and exclusive elements. One of
> +# the trees is removed and then quota group consistency is checked.
> +#
> +# This issue is fixed by the following linux kernel patches:
> +#
> +#    Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
> +#    Btrfs: keep dropped roots in cache until transaciton commit
> +#    btrfs: qgroup: account shared subtree during snapshot delete
> +#
> +#-----------------------------------------------------------------------
> +# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#
> +#-----------------------------------------------------------------------
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	rm -fr $tmp
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_need_to_be_root
> +
> +rm -f $seqres.full
> +
> +# Create an fs tree of a given height at a target location. This is
> +# done by agressively creating inline extents to expand the number of
> +# nodes required. We also add an traditional extent so that
> +# drop_snapshot is forced to walk at least one extent that is not
> +# stored in metadata.
> +#
> +# NOTE: The ability to vary tree height for this test is very useful
> +# for debugging problems with drop_snapshot(). As a result we retain
> +# that parameter even though the test below always does level 2 trees.
> +_explode_fs_tree () {
> +    local level=$1;
> +    local loc="$2";
> +    local bs=4095;
> +    local cnt=1;
> +    local n;
> +
> +    if [ -z $loc ]; then
> +	echo "specify location for fileset"
> +	exit 1;
> +    fi
> +
> +    case $level in
> +	1)# this always reproduces level 1 trees
> +	    n=10;

I'd prefer a more accurate calculation based on nodesize.
That would allow using any possible nodesize.
For example for $nodesize larger than 4K
l1_min_n = int(($nodesize - 101) / (4095 + 21 + 17)) + 1

And for 4K nodesize,
2 2K inline extent will cause the fs_tree to be 2 level.
As 4K nodeize won't allow 4095 inline extent length.

101 is sizeof(struct btrfs_header).
So $nodesize - 101 is the available space for a leaf.
And 4095 is the maximum inline extent length, 21 is the inline 
file_extent header length,offsetof(struct btrfs_file_extent_item, 
disk_bytenr).
17 is sizeof(struct btrfs_disk_key).

Above is the maximum value in theory, in fact leaf will be split more 
early than hitting the number.
So the max_n should be good enough to create desired tree level, and 
make script run a little faster.

> +	    ;;
> +	2)# this always reproduces level 2 trees
> +	    n=1500
For level 2 and higher tree, it will be
l2_min_n = l1_min_n * (($nodesize - 101) / 33 +1) ^ ($level - 1)

101 is still header size.
33 is sizeof(struct btrfs_key_ptr)

> +	    ;;
> +	3)# this always reproduces level 3 trees
> +	    n=1000000;
> +	    ;;
> +	*)
> +	    echo "Can't make level $level trees";
> +	    exit 1;
> +	    ;;
> +    esac
> +
> +    mkdir -p $loc
> +    for i in `seq -w 1 $n`;
> +    do
> +	dd status=none if=/dev/zero of=$loc/file$i bs=$bs count=$cnt
> +    done
> +
> +    bs=131072
> +    cnt=1
> +    dd status=none if=/dev/zero of=$loc/extentfile bs=$bs count=$cnt
> +}
> +
> +# Force the default leaf size as the calculations for making our btree
> +# heights are based on that.
> +run_check _scratch_mkfs "--nodesize 16384"

When using given nodesize, it's better to consider other arch like AA64 
or other arch which support 64K pagesize.
(btrfs doesn't support subpage size nodesize, which means if using 
nodesize smaller than pagesize, it won't be mounted)

I got informed several times when submitting qgroup related test case.
See btrfs/017 and btrfs/091.

But on the other hand, using 64K nodesize is somewhat overkilled and 
will make the test takes more time.

If it's OK to ignore 64K pagesize case, I'd prefer to use 4K nodesize, 
which will be much faster to create fs tree with 2~3 levels.

> +_scratch_mount
> +
> +# populate the default subvolume and create a snapshot ('snap1')
> +_explode_fs_tree 1 $SCRATCH_MNT/files
> +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
> +
> +# create new btree nodes in this snapshot. They will become shared
> +# with the next snapshot we create.
> +_explode_fs_tree 1 $SCRATCH_MNT/snap1/files-snap1
> +
> +# create our final snapshot ('snap2'), populate it with
> +# exclusively owned nodes.
> +#
> +# As a result of this action, snap2 will get an implied ref to the
> +# 128K extent created in the default subvolume.
> +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT/snap1 $SCRATCH_MNT/snap2
> +_explode_fs_tree 1 $SCRATCH_MNT/snap2/files-snap2
> +
> +# Enable qgroups now that we have our filesystem prepared. This
> +# will kick off a scan which we will have to wait for.
> +_run_btrfs_util_prog quota enable $SCRATCH_MNT
> +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
> +
> +# Remount to clear cache, force everything to disk
> +_scratch_unmount
> +_scratch_mount

Is there anything special that needs to use umount/mount other than sync?

> +
> +# Finally, delete snap1 to trigger btrfs_drop_snapshot(). This
> +# snapshot is most interesting to delete because it will cause some
> +# nodes to go exclusively owned for snap2, while some will stay shared
> +# with the default subvolume. That exercises a maximum of the drop
> +# snapshot/qgroup interactions.
> +#
> +# snap2s imlied ref from to the 128K extent in files/ can be lost by
> +# the root finding code in qgroup accounting due to snap1 no longer
> +# providing a path to it. This was fixed by the first two patches
> +# referenced above.
> +_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
> +
> +# There is no way from userspace to force btrfs_drop_snapshot to run
> +# at a given time (even via mount/unmount). We must wait for it to
> +# start and complete. This is the shortest time on my tests systems I
> +# have found which always allows drop_snapshot to run to completion.
> +sleep 45

Does "btrfs subv delete -c" help here?

> +
> +_scratch_unmount
> +
> +# generate a qgroup report and look for inconsistent groups
> +#  - don't use _run_btrfs_util_prog here as it captures the output and
> +#    we need to grep it.
> +$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | grep -E -q "Counts for qgroup.*are different"
> +if [ $? -ne 0 ]; then
> +    status=0
> +fi
Quite a nice idea to use btrfsck to check qgroup validation.

But I don't see the reason not to use _run_btrfS_util_progs, as I don't 
think it's needed to grep.

If there is a bug in return value of btrfsck, then I'm OK with it as a 
workaround.

But if btrfsck --qgroup-report will return non-zero when it finds a 
qgroup mismatch, I think is better to just call _run_btrfs_util_prog, as 
it has judgment for return value check.

Thanks,
Qu

> +
> +exit
> diff --git a/tests/btrfs/104.out b/tests/btrfs/104.out
> new file mode 100644
> index 0000000..1ed84bc
> --- /dev/null
> +++ b/tests/btrfs/104.out
> @@ -0,0 +1 @@
> +QA output created by 104
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index e92a65a..6218adf 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -106,3 +106,4 @@
>   101 auto quick replace
>   102 auto quick metadata enospc
>   103 auto quick clone compress
> +104 auto qgroup
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: update qgroups in drop snapshot
  2015-09-22 20:15 [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
                   ` (4 preceding siblings ...)
  2015-09-22 21:12 ` [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
@ 2015-09-23  3:58 ` Qu Wenruo
  2015-09-23  8:50   ` Holger Hoffstätte
  2015-09-23 22:08   ` Mark Fasheh
  2015-09-25  3:17 ` Qu Wenruo
  6 siblings, 2 replies; 17+ messages in thread
From: Qu Wenruo @ 2015-09-23  3:58 UTC (permalink / raw)
  To: Mark Fasheh, linux-btrfs; +Cc: jbacik, clm

Hi Mark,

I'd like to test the patchset, but it seems to be a little out of date, 
and failed to apply to integration-4.3.

Would you please rebase it to integration-4.3?

Thanks,
Qu

Mark Fasheh wrote on 2015/09/22 13:15 -0700:
> Hi,
>
> The following 4 patches fix a regression introduced in Linux
> 4.2 where btrfs_drop_snapshot() wasn't updating qgroups, resulting in
> them going bad.
>
> The original e-mail pointing this out is below:
>
> http://www.spinics.net/lists/linux-btrfs/msg46093.html
>
>
> The first two patches are from Josef and fix bugs in our counting of
> roots (which is critical for qgroups to work correctly). Both were
> sent to the list already:
>
> http://www.spinics.net/lists/linux-btrfs/msg47035.html
> http://www.spinics.net/lists/linux-btrfs/msg47150.html
>
> Truth be told, most of the time fixing this was spent figuring out
> those two issues. Once I realized I was seeing a bug and we fixed it
> correctly, my drop snapshot patch got dramatically smaller.
>
> I also re-added some of the tracing in qgroup.c that we recently
> lost. It is again possible to debug qgroup operations on a live
> system, allowing us to find issues like the two above by narrowing
> down our operations and manually walking through them via
> cat sys/debug/tracing.
>
> The entire patch series can be tested with the following xfstests.git
> patch, which will be sent to the appropriate list shortly)
>
> https://github.com/markfasheh/xfstests/commit/b09ca51c012824e44546b13862ab1f93a6f2f675
>
> Thanks,
> 	--Mark
>
>
> From: Mark Fasheh <mfasheh@suse.de>
>
> [PATCH] btrfs: add test for quota groups and drop snapshot
>
> Test btrfs quota group consistency operations during snapshot
> delete. Btrfs has had long standing issues with drop snapshot
> failing to properly account for quota groups. This test crafts
> several snapshot trees with shared and exclusive elements. One of
> the trees is removed and then quota group consistency is checked.
>
> This issue is fixed by the following linux kernel patches:
>     Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
>     Btrfs: keep dropped roots in cache until transaciton commit
>     btrfs: qgroup: account shared subtree during snapshot delete
>
> Signed-off-by: Mark Fasheh <mfasheh@suse.de>
> ---
>   tests/btrfs/098   | 166 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   tests/btrfs/group |   1 +
>   2 files changed, 167 insertions(+)
>   create mode 100644 tests/btrfs/098
>
> diff --git a/tests/btrfs/098 b/tests/btrfs/098
> new file mode 100644
> index 0000000..7977a90
> --- /dev/null
> +++ b/tests/btrfs/098
> @@ -0,0 +1,166 @@
> +#! /bin/bash
> +# FS QA Test No. btrfs/098
> +#
> +# Test btrfs quota group consistency operations during snapshot
> +# delete. Btrfs has had long standing issues with drop snapshot
> +# failing to properly account for quota groups. This test crafts
> +# several snapshot trees with shared and exclusive elements. One of
> +# the trees is removed and then quota group consistency is checked.
> +#
> +# This issue is fixed by the following linux kernel patches:
> +#
> +#    Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
> +#    Btrfs: keep dropped roots in cache until transaciton commit
> +#    btrfs: qgroup: account shared subtree during snapshot delete
> +#
> +#-----------------------------------------------------------------------
> +# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#
> +#-----------------------------------------------------------------------
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	rm -fr $tmp
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_need_to_be_root
> +
> +rm -f $seqres.full
> +
> +# Create an fs tree of a given height at a target location. This is
> +# done by agressively creating inline extents to expand the number of
> +# nodes required. We also add an traditional extent so that
> +# drop_snapshot is forced to walk at least one extent that is not
> +# stored in metadata.
> +#
> +# NOTE: The ability to vary tree height for this test is very useful
> +# for debugging problems with drop_snapshot(). As a result we retain
> +# that parameter even though the test below always does level 2 trees.
> +_explode_fs_tree () {
> +    local level=$1;
> +    local loc="$2";
> +    local bs=4095;
> +    local cnt=1;
> +    local n;
> +
> +    if [ -z $loc ]; then
> +	echo "specify location for fileset"
> +	exit 1;
> +    fi
> +
> +    case $level in
> +	1)# this always reproduces level 1 trees	
> +	    n=10;
> +	    ;;
> +	2)# this always reproduces level 2 trees
> +	    n=1500
> +	    ;;
> +	3)# this always reproduces level 3 trees
> +	    n=1000000;
> +	    ;;
> +	*)
> +	    echo "Can't make level $level trees";
> +	    exit 1;
> +	    ;;
> +    esac
> +
> +    mkdir -p $loc
> +    for i in `seq -w 1 $n`;
> +    do
> +	dd status=none if=/dev/zero of=$loc/file$i bs=$bs count=$cnt
> +    done
> +
> +    bs=131072
> +    cnt=1
> +    dd status=none if=/dev/zero of=$loc/extentfile bs=$bs count=$cnt
> +}
> +
> +# Force the default leaf size as the calculations for making our btree
> +# heights are based on that.
> +run_check _scratch_mkfs "--nodesize 16384"
> +_scratch_mount
> +
> +# populate the default subvolume and create a snapshot ('snap1')
> +_explode_fs_tree 1 $SCRATCH_MNT/files
> +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
> +
> +# create new btree nodes in this snapshot. They will become shared
> +# with the next snapshot we create.
> +_explode_fs_tree 1 $SCRATCH_MNT/snap1/files-snap1
> +
> +# create our final snapshot ('snap2'), populate it with
> +# exclusively owned nodes.
> +#
> +# As a result of this action, snap2 will get an implied ref to the
> +# 128K extent created in the default subvolume.
> +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT/snap1 $SCRATCH_MNT/snap2
> +_explode_fs_tree 1 $SCRATCH_MNT/snap2/files-snap2
> +
> +# Enable qgroups now that we have our filesystem prepared. This
> +# will kick off a scan which we will have to wait for.
> +_run_btrfs_util_prog quota enable $SCRATCH_MNT
> +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
> +
> +# Remount to clear cache, force everything to disk
> +_scratch_unmount
> +_scratch_mount
> +
> +# Finally, delete snap1 to trigger btrfs_drop_snapshot(). This
> +# snapshot is most interesting to delete because it will cause some
> +# nodes to go exclusively owned for snap2, while some will stay shared
> +# with the default subvolume. That exercises a maximum of the drop
> +# snapshot/qgroup interactions.
> +#
> +# snap2s imlied ref from to the 128K extent in files/ can be lost by
> +# the root finding code in qgroup accounting due to snap1 no longer
> +# providing a path to it. This was fixed by the first two patches
> +# referenced above.
> +_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
> +
> +# There is no way from userspace to force btrfs_drop_snapshot to run
> +# at a given time (even via mount/unmount). We must wait for it to
> +# start and complete. This is the shortest time on my tests systems I
> +# have found which always allows drop_snapshot to run to completion.
> +sleep 45
> +
> +_scratch_unmount
> +
> +# generate a qgroup report and look for inconsistent groups
> +#  - don't use _run_btrfs_util_prog here as it captures the output and
> +#    we need to grep it.
> +$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | grep -E -q "Counts for qgroup.*are different"
> +if [ $? -ne 0 ]; then
> +    status=0
> +fi
> +
> +exit
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index e13865a..d78a355 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -100,3 +100,4 @@
>   095 auto quick metadata
>   096 auto quick clone
>   097 auto quick send clone
> +098 auto qgroup
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: update qgroups in drop snapshot
  2015-09-23  3:58 ` Qu Wenruo
@ 2015-09-23  8:50   ` Holger Hoffstätte
  2015-09-23 22:08   ` Mark Fasheh
  1 sibling, 0 replies; 17+ messages in thread
From: Holger Hoffstätte @ 2015-09-23  8:50 UTC (permalink / raw)
  To: Qu Wenruo, Mark Fasheh, linux-btrfs

On 09/23/15 05:58, Qu Wenruo wrote:
> I'd like to test the patchset, but it seems to be a little out of
> date, and failed to apply to integration-4.3.

The first two patches were already merged yesterday, so if you just
skip those you should get further.

-h


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: update qgroups in drop snapshot
  2015-09-23  1:40   ` Qu Wenruo
@ 2015-09-23 21:49     ` Mark Fasheh
  2015-09-24  5:47       ` Duncan
  2015-09-24  6:29       ` Qu Wenruo
  0 siblings, 2 replies; 17+ messages in thread
From: Mark Fasheh @ 2015-09-23 21:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, jbacik, clm

On Wed, Sep 23, 2015 at 09:40:42AM +0800, Qu Wenruo wrote:
> Thanks for all your work on this.
> Comment on test case is inlined below.

No problem, thanks for the review Qu!


> >+# Create an fs tree of a given height at a target location. This is
> >+# done by agressively creating inline extents to expand the number of
> >+# nodes required. We also add an traditional extent so that
> >+# drop_snapshot is forced to walk at least one extent that is not
> >+# stored in metadata.
> >+#
> >+# NOTE: The ability to vary tree height for this test is very useful
> >+# for debugging problems with drop_snapshot(). As a result we retain
> >+# that parameter even though the test below always does level 2 trees.
> >+_explode_fs_tree () {
> >+    local level=$1;
> >+    local loc="$2";
> >+    local bs=4095;
> >+    local cnt=1;
> >+    local n;
> >+
> >+    if [ -z $loc ]; then
> >+	echo "specify location for fileset"
> >+	exit 1;
> >+    fi
> >+
> >+    case $level in
> >+	1)# this always reproduces level 1 trees
> >+	    n=10;
> 
> I'd prefer a more accurate calculation based on nodesize.
> That would allow using any possible nodesize.

How do we query nodesize? 'btrfs fi show' doesn't seem to give it.


But yeah in theory that sounds nice, I can certainly try it out. I'm not
really sure how complicated we want this whole function to be though. Right
now it should always work on all platforms.


> For example for $nodesize larger than 4K
> l1_min_n = int(($nodesize - 101) / (4095 + 21 + 17)) + 1
> 
> And for 4K nodesize,
> 2 2K inline extent will cause the fs_tree to be 2 level.
> As 4K nodeize won't allow 4095 inline extent length.
> 
> 101 is sizeof(struct btrfs_header).
> So $nodesize - 101 is the available space for a leaf.
> And 4095 is the maximum inline extent length, 21 is the inline
> file_extent header length,offsetof(struct btrfs_file_extent_item,
> disk_bytenr).
> 17 is sizeof(struct btrfs_disk_key).
> 
> Above is the maximum value in theory, in fact leaf will be split
> more early than hitting the number.
> So the max_n should be good enough to create desired tree level, and
> make script run a little faster.
> 
> >+	    ;;
> >+	2)# this always reproduces level 2 trees
> >+	    n=1500
> For level 2 and higher tree, it will be
> l2_min_n = l1_min_n * (($nodesize - 101) / 33 +1) ^ ($level - 1)
> 
> 101 is still header size.
> 33 is sizeof(struct btrfs_key_ptr)
> 
> >+	    ;;
> >+	3)# this always reproduces level 3 trees
> >+	    n=1000000;
> >+	    ;;
> >+	*)
> >+	    echo "Can't make level $level trees";
> >+	    exit 1;
> >+	    ;;
> >+    esac
> >+
> >+    mkdir -p $loc
> >+    for i in `seq -w 1 $n`;
> >+    do
> >+	dd status=none if=/dev/zero of=$loc/file$i bs=$bs count=$cnt
> >+    done
> >+
> >+    bs=131072
> >+    cnt=1
> >+    dd status=none if=/dev/zero of=$loc/extentfile bs=$bs count=$cnt
> >+}
> >+
> >+# Force the default leaf size as the calculations for making our btree
> >+# heights are based on that.
> >+run_check _scratch_mkfs "--nodesize 16384"
> 
> When using given nodesize, it's better to consider other arch like
> AA64 or other arch which support 64K pagesize.
> (btrfs doesn't support subpage size nodesize, which means if using
> nodesize smaller than pagesize, it won't be mounted)
> 
> I got informed several times when submitting qgroup related test case.
> See btrfs/017 and btrfs/091.
> 
> But on the other hand, using 64K nodesize is somewhat overkilled and
> will make the test takes more time.
> 
> If it's OK to ignore 64K pagesize case, I'd prefer to use 4K
> nodesize, which will be much faster to create fs tree with 2~3
> levels.

Right, so 64K nodesize is the most 'compatible' nodesize.


> >+_scratch_mount
> >+
> >+# populate the default subvolume and create a snapshot ('snap1')
> >+_explode_fs_tree 1 $SCRATCH_MNT/files
> >+_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
> >+
> >+# create new btree nodes in this snapshot. They will become shared
> >+# with the next snapshot we create.
> >+_explode_fs_tree 1 $SCRATCH_MNT/snap1/files-snap1
> >+
> >+# create our final snapshot ('snap2'), populate it with
> >+# exclusively owned nodes.
> >+#
> >+# As a result of this action, snap2 will get an implied ref to the
> >+# 128K extent created in the default subvolume.
> >+_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT/snap1 $SCRATCH_MNT/snap2
> >+_explode_fs_tree 1 $SCRATCH_MNT/snap2/files-snap2
> >+
> >+# Enable qgroups now that we have our filesystem prepared. This
> >+# will kick off a scan which we will have to wait for.
> >+_run_btrfs_util_prog quota enable $SCRATCH_MNT
> >+_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
> >+
> >+# Remount to clear cache, force everything to disk
> >+_scratch_unmount
> >+_scratch_mount
> 
> Is there anything special that needs to use umount/mount other than sync?

A couple times now it's been to my advantage to force btrfs to reread the
file trees. It might not be strictly necessary any more.


> >+# Finally, delete snap1 to trigger btrfs_drop_snapshot(). This
> >+# snapshot is most interesting to delete because it will cause some
> >+# nodes to go exclusively owned for snap2, while some will stay shared
> >+# with the default subvolume. That exercises a maximum of the drop
> >+# snapshot/qgroup interactions.
> >+#
> >+# snap2s imlied ref from to the 128K extent in files/ can be lost by
> >+# the root finding code in qgroup accounting due to snap1 no longer
> >+# providing a path to it. This was fixed by the first two patches
> >+# referenced above.
> >+_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
> >+
> >+# There is no way from userspace to force btrfs_drop_snapshot to run
> >+# at a given time (even via mount/unmount). We must wait for it to
> >+# start and complete. This is the shortest time on my tests systems I
> >+# have found which always allows drop_snapshot to run to completion.
> >+sleep 45
> 
> Does "btrfs subv delete -c" help here?

Unfortunately not :( We need to wait for drop_snapshot() to get run. That
flag (from memory) just waits for the initial orphaning transaction to
finish.


> >+
> >+_scratch_unmount
> >+
> >+# generate a qgroup report and look for inconsistent groups
> >+#  - don't use _run_btrfs_util_prog here as it captures the output and
> >+#    we need to grep it.
> >+$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | grep -E -q "Counts for qgroup.*are different"
> >+if [ $? -ne 0 ]; then
> >+    status=0
> >+fi
> Quite a nice idea to use btrfsck to check qgroup validation.
> 
> But I don't see the reason not to use _run_btrfS_util_progs, as I
> don't think it's needed to grep.
> 
> If there is a bug in return value of btrfsck, then I'm OK with it as
> a workaround.
> 
> But if btrfsck --qgroup-report will return non-zero when it finds a
> qgroup mismatch, I think is better to just call
> _run_btrfs_util_prog, as it has judgment for return value check.

btrfsck --qgroup-report returns zero unless there was an issue generating
the report so the grep there is the only way to catch this consistently.

Thanks again,
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: update qgroups in drop snapshot
  2015-09-23  3:58 ` Qu Wenruo
  2015-09-23  8:50   ` Holger Hoffstätte
@ 2015-09-23 22:08   ` Mark Fasheh
  1 sibling, 0 replies; 17+ messages in thread
From: Mark Fasheh @ 2015-09-23 22:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, jbacik, clm

On Wed, Sep 23, 2015 at 11:58:57AM +0800, Qu Wenruo wrote:
> Hi Mark,
> 
> I'd like to test the patchset, but it seems to be a little out of
> date, and failed to apply to integration-4.3.
> 
> Would you please rebase it to integration-4.3?

Hey Qu I think you just need to drop the patch titled:

    Btrfs: keep dropped roots in cache until transaction commit

since it is already in integration-4.3. Everything else seems to apply on my
end.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: update qgroups in drop snapshot
  2015-09-23 21:49     ` Mark Fasheh
@ 2015-09-24  5:47       ` Duncan
  2015-09-24  6:29       ` Qu Wenruo
  1 sibling, 0 replies; 17+ messages in thread
From: Duncan @ 2015-09-24  5:47 UTC (permalink / raw)
  To: linux-btrfs

Mark Fasheh posted on Wed, 23 Sep 2015 14:49:26 -0700 as excerpted:

> On Wed, Sep 23, 2015 at 09:40:42AM +0800, Qu Wenruo wrote:
>> 
>> I'd prefer a more accurate calculation based on nodesize.
>> That would allow using any possible nodesize.
> 
> How do we query nodesize? 'btrfs fi show' doesn't seem to give it.

btrfs-show-super reports nodesize.  Note, however, that unlike
btrfs fi show, show-super is used on a device, not a (mounted) filesystem.

(A not-too-technical question I can answer, so I am. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: update qgroups in drop snapshot
  2015-09-23 21:49     ` Mark Fasheh
  2015-09-24  5:47       ` Duncan
@ 2015-09-24  6:29       ` Qu Wenruo
  1 sibling, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2015-09-24  6:29 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, jbacik, clm



Mark Fasheh wrote on 2015/09/23 14:49 -0700:
> On Wed, Sep 23, 2015 at 09:40:42AM +0800, Qu Wenruo wrote:
>> Thanks for all your work on this.
>> Comment on test case is inlined below.
>
> No problem, thanks for the review Qu!
>
>
>>> +# Create an fs tree of a given height at a target location. This is
>>> +# done by agressively creating inline extents to expand the number of
>>> +# nodes required. We also add an traditional extent so that
>>> +# drop_snapshot is forced to walk at least one extent that is not
>>> +# stored in metadata.
>>> +#
>>> +# NOTE: The ability to vary tree height for this test is very useful
>>> +# for debugging problems with drop_snapshot(). As a result we retain
>>> +# that parameter even though the test below always does level 2 trees.
>>> +_explode_fs_tree () {
>>> +    local level=$1;
>>> +    local loc="$2";
>>> +    local bs=4095;
>>> +    local cnt=1;
>>> +    local n;
>>> +
>>> +    if [ -z $loc ]; then
>>> +	echo "specify location for fileset"
>>> +	exit 1;
>>> +    fi
>>> +
>>> +    case $level in
>>> +	1)# this always reproduces level 1 trees
>>> +	    n=10;
>>
>> I'd prefer a more accurate calculation based on nodesize.
>> That would allow using any possible nodesize.
>
> How do we query nodesize? 'btrfs fi show' doesn't seem to give it.

Duncan provided 'btrfs-show-super', that's one possible method.
Although you will need to grep output from it.

But the problem is, why you need to query nodesize?
As in the testcase, the nodesize is given by yourself, just a new 
parameter should handle it.

>
>
> But yeah in theory that sounds nice, I can certainly try it out. I'm not
> really sure how complicated we want this whole function to be though. Right
> now it should always work on all platforms.
>

Yeah, your immediate number is OK, I'd just like to put it to a minimum.
And hope to make such test fast enough to be in 'quick' group.

>
>> For example for $nodesize larger than 4K
>> l1_min_n = int(($nodesize - 101) / (4095 + 21 + 17)) + 1
>>
>> And for 4K nodesize,
>> 2 2K inline extent will cause the fs_tree to be 2 level.
>> As 4K nodeize won't allow 4095 inline extent length.
>>
>> 101 is sizeof(struct btrfs_header).
>> So $nodesize - 101 is the available space for a leaf.
>> And 4095 is the maximum inline extent length, 21 is the inline
>> file_extent header length,offsetof(struct btrfs_file_extent_item,
>> disk_bytenr).
>> 17 is sizeof(struct btrfs_disk_key).
>>
>> Above is the maximum value in theory, in fact leaf will be split
>> more early than hitting the number.
>> So the max_n should be good enough to create desired tree level, and
>> make script run a little faster.
>>
>>> +	    ;;
>>> +	2)# this always reproduces level 2 trees
>>> +	    n=1500
>> For level 2 and higher tree, it will be
>> l2_min_n = l1_min_n * (($nodesize - 101) / 33 +1) ^ ($level - 1)
>>
>> 101 is still header size.
>> 33 is sizeof(struct btrfs_key_ptr)
>>
>>> +	    ;;
>>> +	3)# this always reproduces level 3 trees
>>> +	    n=1000000;
>>> +	    ;;
>>> +	*)
>>> +	    echo "Can't make level $level trees";
>>> +	    exit 1;
>>> +	    ;;
>>> +    esac
>>> +
>>> +    mkdir -p $loc
>>> +    for i in `seq -w 1 $n`;
>>> +    do
>>> +	dd status=none if=/dev/zero of=$loc/file$i bs=$bs count=$cnt
>>> +    done
>>> +
>>> +    bs=131072
>>> +    cnt=1
>>> +    dd status=none if=/dev/zero of=$loc/extentfile bs=$bs count=$cnt
>>> +}
>>> +
>>> +# Force the default leaf size as the calculations for making our btree
>>> +# heights are based on that.
>>> +run_check _scratch_mkfs "--nodesize 16384"
>>
>> When using given nodesize, it's better to consider other arch like
>> AA64 or other arch which support 64K pagesize.
>> (btrfs doesn't support subpage size nodesize, which means if using
>> nodesize smaller than pagesize, it won't be mounted)
>>
>> I got informed several times when submitting qgroup related test case.
>> See btrfs/017 and btrfs/091.
>>
>> But on the other hand, using 64K nodesize is somewhat overkilled and
>> will make the test takes more time.
>>
>> If it's OK to ignore 64K pagesize case, I'd prefer to use 4K
>> nodesize, which will be much faster to create fs tree with 2~3
>> levels.
>
> Right, so 64K nodesize is the most 'compatible' nodesize.
>
>
>>> +_scratch_mount
>>> +
>>> +# populate the default subvolume and create a snapshot ('snap1')
>>> +_explode_fs_tree 1 $SCRATCH_MNT/files
>>> +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
>>> +
>>> +# create new btree nodes in this snapshot. They will become shared
>>> +# with the next snapshot we create.
>>> +_explode_fs_tree 1 $SCRATCH_MNT/snap1/files-snap1
>>> +
>>> +# create our final snapshot ('snap2'), populate it with
>>> +# exclusively owned nodes.
>>> +#
>>> +# As a result of this action, snap2 will get an implied ref to the
>>> +# 128K extent created in the default subvolume.
>>> +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT/snap1 $SCRATCH_MNT/snap2
>>> +_explode_fs_tree 1 $SCRATCH_MNT/snap2/files-snap2
>>> +
>>> +# Enable qgroups now that we have our filesystem prepared. This
>>> +# will kick off a scan which we will have to wait for.
>>> +_run_btrfs_util_prog quota enable $SCRATCH_MNT
>>> +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
>>> +
>>> +# Remount to clear cache, force everything to disk
>>> +_scratch_unmount
>>> +_scratch_mount
>>
>> Is there anything special that needs to use umount/mount other than sync?
>
> A couple times now it's been to my advantage to force btrfs to reread the
> file trees. It might not be strictly necessary any more.
>
>
>>> +# Finally, delete snap1 to trigger btrfs_drop_snapshot(). This
>>> +# snapshot is most interesting to delete because it will cause some
>>> +# nodes to go exclusively owned for snap2, while some will stay shared
>>> +# with the default subvolume. That exercises a maximum of the drop
>>> +# snapshot/qgroup interactions.
>>> +#
>>> +# snap2s imlied ref from to the 128K extent in files/ can be lost by
>>> +# the root finding code in qgroup accounting due to snap1 no longer
>>> +# providing a path to it. This was fixed by the first two patches
>>> +# referenced above.
>>> +_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
>>> +
>>> +# There is no way from userspace to force btrfs_drop_snapshot to run
>>> +# at a given time (even via mount/unmount). We must wait for it to
>>> +# start and complete. This is the shortest time on my tests systems I
>>> +# have found which always allows drop_snapshot to run to completion.
>>> +sleep 45
>>
>> Does "btrfs subv delete -c" help here?
>
> Unfortunately not :( We need to wait for drop_snapshot() to get run. That
> flag (from memory) just waits for the initial orphaning transaction to
> finish.

Too sad. :(
Also pointed out by Dave Chinner, if use such immediate number to wait, 
it may not be reproducible on other systems.

>
>
>>> +
>>> +_scratch_unmount
>>> +
>>> +# generate a qgroup report and look for inconsistent groups
>>> +#  - don't use _run_btrfs_util_prog here as it captures the output and
>>> +#    we need to grep it.
>>> +$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | grep -E -q "Counts for qgroup.*are different"
>>> +if [ $? -ne 0 ]; then
>>> +    status=0
>>> +fi
>> Quite a nice idea to use btrfsck to check qgroup validation.
>>
>> But I don't see the reason not to use _run_btrfS_util_progs, as I
>> don't think it's needed to grep.
>>
>> If there is a bug in return value of btrfsck, then I'm OK with it as
>> a workaround.
>>
>> But if btrfsck --qgroup-report will return non-zero when it finds a
>> qgroup mismatch, I think is better to just call
>> _run_btrfs_util_prog, as it has judgment for return value check.
>
> btrfsck --qgroup-report returns zero unless there was an issue generating
> the report so the grep there is the only way to catch this consistently.

Oh, that's another btrfsck bug to fix.
Quite a lot of return value bugs are still in btrfsck codes.

Thanks,
Qu

>
> Thanks again,
> 	--Mark
>
> --
> Mark Fasheh
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: update qgroups in drop snapshot
  2015-09-22 20:15 [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
                   ` (5 preceding siblings ...)
  2015-09-23  3:58 ` Qu Wenruo
@ 2015-09-25  3:17 ` Qu Wenruo
  6 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2015-09-25  3:17 UTC (permalink / raw)
  To: Mark Fasheh, linux-btrfs; +Cc: jbacik, clm



Mark Fasheh wrote on 2015/09/22 13:15 -0700:
> Hi,
>
> The following 4 patches fix a regression introduced in Linux
> 4.2 where btrfs_drop_snapshot() wasn't updating qgroups, resulting in
> them going bad.
>
> The original e-mail pointing this out is below:
>
> http://www.spinics.net/lists/linux-btrfs/msg46093.html
>
>
> The first two patches are from Josef and fix bugs in our counting of
> roots (which is critical for qgroups to work correctly). Both were
> sent to the list already:
>
> http://www.spinics.net/lists/linux-btrfs/msg47035.html
> http://www.spinics.net/lists/linux-btrfs/msg47150.html
>
> Truth be told, most of the time fixing this was spent figuring out
> those two issues. Once I realized I was seeing a bug and we fixed it
> correctly, my drop snapshot patch got dramatically smaller.
>
> I also re-added some of the tracing in qgroup.c that we recently
> lost. It is again possible to debug qgroup operations on a live
> system, allowing us to find issues like the two above by narrowing
> down our operations and manually walking through them via
> cat sys/debug/tracing.
>
> The entire patch series can be tested with the following xfstests.git
> patch, which will be sent to the appropriate list shortly)
>
> https://github.com/markfasheh/xfstests/commit/b09ca51c012824e44546b13862ab1f93a6f2f675
>
> Thanks,
> 	--Mark
>
>
> From: Mark Fasheh <mfasheh@suse.de>
>
> [PATCH] btrfs: add test for quota groups and drop snapshot
>
> Test btrfs quota group consistency operations during snapshot
> delete. Btrfs has had long standing issues with drop snapshot
> failing to properly account for quota groups. This test crafts
> several snapshot trees with shared and exclusive elements. One of
> the trees is removed and then quota group consistency is checked.
>
> This issue is fixed by the following linux kernel patches:
>     Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
>     Btrfs: keep dropped roots in cache until transaciton commit
>     btrfs: qgroup: account shared subtree during snapshot delete
>
> Signed-off-by: Mark Fasheh <mfasheh@suse.de>

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Excellent work, Mark.

The codes are quite clear and clean, and added needed trace function.
Handle multiple level qgroup quite well using existing facilities.

As the patchset is quite small, I would like to see it shipped in 4.3.

But due to the lack of facility in kernel, the test case maybe more 
tricky than the problem you fixed... :(


[[Further Enhancement]]
I think the subvolume rescan facility you use here, can be further 
refined to a generic subvolume qgroup rescan function, even it's already 
quite good now.

There is still a lot of other case may cause btrfs qgroup go mad, like 
qgroup assign, and btrfs snapshot with higher level qgroup.

Now it will just info btrfs-progs and output a warning.
But with your subvolume rescan facility, added with some other protect, 
it should be able to handle qgroup assign case as well.

Thanks,
Qu

> ---
>   tests/btrfs/098   | 166 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   tests/btrfs/group |   1 +
>   2 files changed, 167 insertions(+)
>   create mode 100644 tests/btrfs/098
>
> diff --git a/tests/btrfs/098 b/tests/btrfs/098
> new file mode 100644
> index 0000000..7977a90
> --- /dev/null
> +++ b/tests/btrfs/098
> @@ -0,0 +1,166 @@
> +#! /bin/bash
> +# FS QA Test No. btrfs/098
> +#
> +# Test btrfs quota group consistency operations during snapshot
> +# delete. Btrfs has had long standing issues with drop snapshot
> +# failing to properly account for quota groups. This test crafts
> +# several snapshot trees with shared and exclusive elements. One of
> +# the trees is removed and then quota group consistency is checked.
> +#
> +# This issue is fixed by the following linux kernel patches:
> +#
> +#    Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
> +#    Btrfs: keep dropped roots in cache until transaciton commit
> +#    btrfs: qgroup: account shared subtree during snapshot delete
> +#
> +#-----------------------------------------------------------------------
> +# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#
> +#-----------------------------------------------------------------------
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	rm -fr $tmp
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_need_to_be_root
> +
> +rm -f $seqres.full
> +
> +# Create an fs tree of a given height at a target location. This is
> +# done by agressively creating inline extents to expand the number of
> +# nodes required. We also add an traditional extent so that
> +# drop_snapshot is forced to walk at least one extent that is not
> +# stored in metadata.
> +#
> +# NOTE: The ability to vary tree height for this test is very useful
> +# for debugging problems with drop_snapshot(). As a result we retain
> +# that parameter even though the test below always does level 2 trees.
> +_explode_fs_tree () {
> +    local level=$1;
> +    local loc="$2";
> +    local bs=4095;
> +    local cnt=1;
> +    local n;
> +
> +    if [ -z $loc ]; then
> +	echo "specify location for fileset"
> +	exit 1;
> +    fi
> +
> +    case $level in
> +	1)# this always reproduces level 1 trees	
> +	    n=10;
> +	    ;;
> +	2)# this always reproduces level 2 trees
> +	    n=1500
> +	    ;;
> +	3)# this always reproduces level 3 trees
> +	    n=1000000;
> +	    ;;
> +	*)
> +	    echo "Can't make level $level trees";
> +	    exit 1;
> +	    ;;
> +    esac
> +
> +    mkdir -p $loc
> +    for i in `seq -w 1 $n`;
> +    do
> +	dd status=none if=/dev/zero of=$loc/file$i bs=$bs count=$cnt
> +    done
> +
> +    bs=131072
> +    cnt=1
> +    dd status=none if=/dev/zero of=$loc/extentfile bs=$bs count=$cnt
> +}
> +
> +# Force the default leaf size as the calculations for making our btree
> +# heights are based on that.
> +run_check _scratch_mkfs "--nodesize 16384"
> +_scratch_mount
> +
> +# populate the default subvolume and create a snapshot ('snap1')
> +_explode_fs_tree 1 $SCRATCH_MNT/files
> +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
> +
> +# create new btree nodes in this snapshot. They will become shared
> +# with the next snapshot we create.
> +_explode_fs_tree 1 $SCRATCH_MNT/snap1/files-snap1
> +
> +# create our final snapshot ('snap2'), populate it with
> +# exclusively owned nodes.
> +#
> +# As a result of this action, snap2 will get an implied ref to the
> +# 128K extent created in the default subvolume.
> +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT/snap1 $SCRATCH_MNT/snap2
> +_explode_fs_tree 1 $SCRATCH_MNT/snap2/files-snap2
> +
> +# Enable qgroups now that we have our filesystem prepared. This
> +# will kick off a scan which we will have to wait for.
> +_run_btrfs_util_prog quota enable $SCRATCH_MNT
> +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
> +
> +# Remount to clear cache, force everything to disk
> +_scratch_unmount
> +_scratch_mount
> +
> +# Finally, delete snap1 to trigger btrfs_drop_snapshot(). This
> +# snapshot is most interesting to delete because it will cause some
> +# nodes to go exclusively owned for snap2, while some will stay shared
> +# with the default subvolume. That exercises a maximum of the drop
> +# snapshot/qgroup interactions.
> +#
> +# snap2s imlied ref from to the 128K extent in files/ can be lost by
> +# the root finding code in qgroup accounting due to snap1 no longer
> +# providing a path to it. This was fixed by the first two patches
> +# referenced above.
> +_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
> +
> +# There is no way from userspace to force btrfs_drop_snapshot to run
> +# at a given time (even via mount/unmount). We must wait for it to
> +# start and complete. This is the shortest time on my tests systems I
> +# have found which always allows drop_snapshot to run to completion.
> +sleep 45
> +
> +_scratch_unmount
> +
> +# generate a qgroup report and look for inconsistent groups
> +#  - don't use _run_btrfs_util_prog here as it captures the output and
> +#    we need to grep it.
> +$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | grep -E -q "Counts for qgroup.*are different"
> +if [ $? -ne 0 ]; then
> +    status=0
> +fi
> +
> +exit
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index e13865a..d78a355 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -100,3 +100,4 @@
>   095 auto quick metadata
>   096 auto quick clone
>   097 auto quick send clone
> +098 auto qgroup
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete
  2015-09-22 20:15 ` [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete Mark Fasheh
@ 2015-11-02  1:59   ` Qu Wenruo
  2015-11-03 23:56     ` Mark Fasheh
  0 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2015-11-02  1:59 UTC (permalink / raw)
  To: Mark Fasheh, linux-btrfs; +Cc: jbacik, clm



Mark Fasheh wrote on 2015/09/22 13:15 -0700:
> Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
> mechanism.') removed our qgroup accounting during
> btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
> going bad shortly after a snapshot is removed.
>
> Fix this by adding a dirty extent record when we encounter extents during
> our shared subtree walk. This effectively restores the functionality we had
> with the original shared subtree walking code in 1152651 (btrfs: qgroup:
> account shared subtrees during snapshot delete).
>
> The idea with the original patch (and this one) is that shared subtrees can
> get skipped during drop_snapshot. The shared subtree walk then allows us a
> chance to visit those extents and add them to the qgroup work for later
> processing. This ultimately makes the accounting for drop snapshot work.
>
> The new qgroup code nicely handles all the other extents during the tree
> walk via the ref dec/inc functions so we don't have to add actions beyond
> what we had originally.
>
> Signed-off-by: Mark Fasheh <mfasheh@suse.de>

Hi Mark,

Despite the performance regression reported from Stefan Priebe,
there is another problem, I'll comment inlined below.

> ---
>   fs/btrfs/extent-tree.c | 41 ++++++++++++++++++++++++++++++++++-------
>   1 file changed, 34 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3a70e6c..89be620 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7757,17 +7757,37 @@ reada:
>   }
>
>   /*
> - * TODO: Modify related function to add related node/leaf to dirty_extent_root,
> - * for later qgroup accounting.
> - *
> - * Current, this function does nothing.
> + * These may not be seen by the usual inc/dec ref code so we have to
> + * add them here.
>    */
> +static int record_one_subtree_extent(struct btrfs_trans_handle *trans,
> +				     struct btrfs_root *root, u64 bytenr,
> +				     u64 num_bytes)
> +{
> +	struct btrfs_qgroup_extent_record *qrecord;
> +	struct btrfs_delayed_ref_root *delayed_refs;
> +
> +	qrecord = kmalloc(sizeof(*qrecord), GFP_NOFS);
> +	if (!qrecord)
> +		return -ENOMEM;
> +
> +	qrecord->bytenr = bytenr;
> +	qrecord->num_bytes = num_bytes;
> +	qrecord->old_roots = NULL;
> +
> +	delayed_refs = &trans->transaction->delayed_refs;
> +	if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
> +		kfree(qrecord);

1) Unprotected dirty_extent_root.

Unfortunately, btrfs_qgroup_insert_dirty_exntet() is not protected by 
any lock/mutex.

And I'm sorry not to add comment about that.

In fact, btrfs_qgroup_insert_dirty_extent should always be used with
delayed_refs->lock hold.
Just like add_delayed_ref_head(), where every caller of 
add_delayed_ref_head() holds delayed_refs->lock.

So here you will nned to hold delayed_refs->lock.

2) Performance regression.(Reported by Stefan Priebe)

The performance regression is not caused by your codes, at least not 
completely.

It's my fault not adding enough comment for insert_dirty_extent() 
function. (just like 1, I must say I'm a bad reviewer until there is bug 
report)

As I was only expecting it called inside add_delayed_ref_head(),
and caller of add_delayed_ref_head() has judged whether qgroup is 
enabled before calling add_delayed_ref_head().

So for qgroup disabled case, insert_dirty_extent() won't ever be called.



As a result, if you want to call btrfs_qgroup_insert_dirty_extent() out 
of add_delayed_ref_head(), you will need to handle the 
delayed_refs->lock and judge whether qgroup is enabled.

BTW, if it's OK for you, you can also further improve the performance of 
qgroup by using kmem_cache for struct btrfs_qgroup_extent_record.

I assume the kmalloc() may be one performance hot spot considering the 
amount it called in qgroup enabled case.

Thanks,
Qu

> +
> +	return 0;
> +}
> +
>   static int account_leaf_items(struct btrfs_trans_handle *trans,
>   			      struct btrfs_root *root,
>   			      struct extent_buffer *eb)
>   {
>   	int nr = btrfs_header_nritems(eb);
> -	int i, extent_type;
> +	int i, extent_type, ret;
>   	struct btrfs_key key;
>   	struct btrfs_file_extent_item *fi;
>   	u64 bytenr, num_bytes;
> @@ -7790,6 +7810,10 @@ static int account_leaf_items(struct btrfs_trans_handle *trans,
>   			continue;
>
>   		num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
> +
> +		ret = record_one_subtree_extent(trans, root, bytenr, num_bytes);
> +		if (ret)
> +			return ret;
>   	}
>   	return 0;
>   }
> @@ -7858,8 +7882,6 @@ static int adjust_slots_upwards(struct btrfs_root *root,
>
>   /*
>    * root_eb is the subtree root and is locked before this function is called.
> - * TODO: Modify this function to mark all (including complete shared node)
> - * to dirty_extent_root to allow it get accounted in qgroup.
>    */
>   static int account_shared_subtree(struct btrfs_trans_handle *trans,
>   				  struct btrfs_root *root,
> @@ -7937,6 +7959,11 @@ walk_down:
>   			btrfs_tree_read_lock(eb);
>   			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
>   			path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
> +
> +			ret = record_one_subtree_extent(trans, root, child_bytenr,
> +							root->nodesize);
> +			if (ret)
> +				goto out;
>   		}
>
>   		if (level == 0) {
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete
  2015-11-02  1:59   ` Qu Wenruo
@ 2015-11-03 23:56     ` Mark Fasheh
  2015-11-04  1:10       ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Fasheh @ 2015-11-03 23:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, jbacik, clm

On Mon, Nov 02, 2015 at 09:59:01AM +0800, Qu Wenruo wrote:
> 
> 
> Mark Fasheh wrote on 2015/09/22 13:15 -0700:
> >Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
> >mechanism.') removed our qgroup accounting during
> >btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
> >going bad shortly after a snapshot is removed.
> >
> >Fix this by adding a dirty extent record when we encounter extents during
> >our shared subtree walk. This effectively restores the functionality we had
> >with the original shared subtree walking code in 1152651 (btrfs: qgroup:
> >account shared subtrees during snapshot delete).
> >
> >The idea with the original patch (and this one) is that shared subtrees can
> >get skipped during drop_snapshot. The shared subtree walk then allows us a
> >chance to visit those extents and add them to the qgroup work for later
> >processing. This ultimately makes the accounting for drop snapshot work.
> >
> >The new qgroup code nicely handles all the other extents during the tree
> >walk via the ref dec/inc functions so we don't have to add actions beyond
> >what we had originally.
> >
> >Signed-off-by: Mark Fasheh <mfasheh@suse.de>
> 
> Hi Mark,
> 
> Despite the performance regression reported from Stefan Priebe,
> there is another problem, I'll comment inlined below.
> 
> >---
> >  fs/btrfs/extent-tree.c | 41 ++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 34 insertions(+), 7 deletions(-)
> >
> >diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >index 3a70e6c..89be620 100644
> >--- a/fs/btrfs/extent-tree.c
> >+++ b/fs/btrfs/extent-tree.c
> >@@ -7757,17 +7757,37 @@ reada:
> >  }
> >
> >  /*
> >- * TODO: Modify related function to add related node/leaf to dirty_extent_root,
> >- * for later qgroup accounting.
> >- *
> >- * Current, this function does nothing.
> >+ * These may not be seen by the usual inc/dec ref code so we have to
> >+ * add them here.
> >   */
> >+static int record_one_subtree_extent(struct btrfs_trans_handle *trans,
> >+				     struct btrfs_root *root, u64 bytenr,
> >+				     u64 num_bytes)
> >+{
> >+	struct btrfs_qgroup_extent_record *qrecord;
> >+	struct btrfs_delayed_ref_root *delayed_refs;
> >+
> >+	qrecord = kmalloc(sizeof(*qrecord), GFP_NOFS);
> >+	if (!qrecord)
> >+		return -ENOMEM;
> >+
> >+	qrecord->bytenr = bytenr;
> >+	qrecord->num_bytes = num_bytes;
> >+	qrecord->old_roots = NULL;
> >+
> >+	delayed_refs = &trans->transaction->delayed_refs;
> >+	if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
> >+		kfree(qrecord);
> 
> 1) Unprotected dirty_extent_root.
> 
> Unfortunately, btrfs_qgroup_insert_dirty_exntet() is not protected
> by any lock/mutex.
> 
> And I'm sorry not to add comment about that.
> 
> In fact, btrfs_qgroup_insert_dirty_extent should always be used with
> delayed_refs->lock hold.
> Just like add_delayed_ref_head(), where every caller of
> add_delayed_ref_head() holds delayed_refs->lock.
> 
> So here you will nned to hold delayed_refs->lock.

Ok, thanks for pointing this out. To your knowledge is there any reasion why
the followup patch shouldn't just wrap the call to
btrfs_qgroup_insert_dirty_extent() in the correct lock?



> 2) Performance regression.(Reported by Stefan Priebe)
> 
> The performance regression is not caused by your codes, at least not
> completely.
> 
> It's my fault not adding enough comment for insert_dirty_extent()
> function. (just like 1, I must say I'm a bad reviewer until there is
> bug report)
> 
> As I was only expecting it called inside add_delayed_ref_head(),
> and caller of add_delayed_ref_head() has judged whether qgroup is
> enabled before calling add_delayed_ref_head().
> 
> So for qgroup disabled case, insert_dirty_extent() won't ever be called.
> 
> 
> 
> As a result, if you want to call btrfs_qgroup_insert_dirty_extent()
> out of add_delayed_ref_head(), you will need to handle the
> delayed_refs->lock and judge whether qgroup is enabled.

Ok, so callers of btrfs_qgroup_insert_dirty_extent() also have to check
whether qgroups are enabled.


> BTW, if it's OK for you, you can also further improve the
> performance of qgroup by using kmem_cache for struct
> btrfs_qgroup_extent_record.
> 
> I assume the kmalloc() may be one performance hot spot considering
> the amount it called in qgroup enabled case.

We're reading disk in that case, I hardly think the small kmalloc() matters.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete
  2015-11-03 23:56     ` Mark Fasheh
@ 2015-11-04  1:10       ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2015-11-04  1:10 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, jbacik, clm



Mark Fasheh wrote on 2015/11/03 15:56 -0800:
> On Mon, Nov 02, 2015 at 09:59:01AM +0800, Qu Wenruo wrote:
>>
>>
>> Mark Fasheh wrote on 2015/09/22 13:15 -0700:
>>> Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
>>> mechanism.') removed our qgroup accounting during
>>> btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
>>> going bad shortly after a snapshot is removed.
>>>
>>> Fix this by adding a dirty extent record when we encounter extents during
>>> our shared subtree walk. This effectively restores the functionality we had
>>> with the original shared subtree walking code in 1152651 (btrfs: qgroup:
>>> account shared subtrees during snapshot delete).
>>>
>>> The idea with the original patch (and this one) is that shared subtrees can
>>> get skipped during drop_snapshot. The shared subtree walk then allows us a
>>> chance to visit those extents and add them to the qgroup work for later
>>> processing. This ultimately makes the accounting for drop snapshot work.
>>>
>>> The new qgroup code nicely handles all the other extents during the tree
>>> walk via the ref dec/inc functions so we don't have to add actions beyond
>>> what we had originally.
>>>
>>> Signed-off-by: Mark Fasheh <mfasheh@suse.de>
>>
>> Hi Mark,
>>
>> Despite the performance regression reported from Stefan Priebe,
>> there is another problem, I'll comment inlined below.
>>
>>> ---
>>>   fs/btrfs/extent-tree.c | 41 ++++++++++++++++++++++++++++++++++-------
>>>   1 file changed, 34 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 3a70e6c..89be620 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -7757,17 +7757,37 @@ reada:
>>>   }
>>>
>>>   /*
>>> - * TODO: Modify related function to add related node/leaf to dirty_extent_root,
>>> - * for later qgroup accounting.
>>> - *
>>> - * Current, this function does nothing.
>>> + * These may not be seen by the usual inc/dec ref code so we have to
>>> + * add them here.
>>>    */
>>> +static int record_one_subtree_extent(struct btrfs_trans_handle *trans,
>>> +				     struct btrfs_root *root, u64 bytenr,
>>> +				     u64 num_bytes)
>>> +{
>>> +	struct btrfs_qgroup_extent_record *qrecord;
>>> +	struct btrfs_delayed_ref_root *delayed_refs;
>>> +
>>> +	qrecord = kmalloc(sizeof(*qrecord), GFP_NOFS);
>>> +	if (!qrecord)
>>> +		return -ENOMEM;
>>> +
>>> +	qrecord->bytenr = bytenr;
>>> +	qrecord->num_bytes = num_bytes;
>>> +	qrecord->old_roots = NULL;
>>> +
>>> +	delayed_refs = &trans->transaction->delayed_refs;
>>> +	if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
>>> +		kfree(qrecord);
>>
>> 1) Unprotected dirty_extent_root.
>>
>> Unfortunately, btrfs_qgroup_insert_dirty_exntet() is not protected
>> by any lock/mutex.
>>
>> And I'm sorry not to add comment about that.
>>
>> In fact, btrfs_qgroup_insert_dirty_extent should always be used with
>> delayed_refs->lock hold.
>> Just like add_delayed_ref_head(), where every caller of
>> add_delayed_ref_head() holds delayed_refs->lock.
>>
>> So here you will nned to hold delayed_refs->lock.
>
> Ok, thanks for pointing this out. To your knowledge is there any reasion why
> the followup patch shouldn't just wrap the call to
> btrfs_qgroup_insert_dirty_extent() in the correct lock?

Just as explained in previous reply, all caller (add_delayed_ref_head) 
has the correct lock(delayed_refs->lock).

So I didn't see the need to add a new lock for it or wrap the new lock 
into insert_dirty_extent() at that time.
(Just lock delayed_refs->lock will cause deadlock)

But now since the function is called elsewhere, I'm OK not to reuse 
delayed_ref->lock, add a new lock and integrate it into 
insert_dirty_extent()

Thanks,
Qu

>
>
>
>> 2) Performance regression.(Reported by Stefan Priebe)
>>
>> The performance regression is not caused by your codes, at least not
>> completely.
>>
>> It's my fault not adding enough comment for insert_dirty_extent()
>> function. (just like 1, I must say I'm a bad reviewer until there is
>> bug report)
>>
>> As I was only expecting it called inside add_delayed_ref_head(),
>> and caller of add_delayed_ref_head() has judged whether qgroup is
>> enabled before calling add_delayed_ref_head().
>>
>> So for qgroup disabled case, insert_dirty_extent() won't ever be called.
>>
>>
>>
>> As a result, if you want to call btrfs_qgroup_insert_dirty_extent()
>> out of add_delayed_ref_head(), you will need to handle the
>> delayed_refs->lock and judge whether qgroup is enabled.
>
> Ok, so callers of btrfs_qgroup_insert_dirty_extent() also have to check
> whether qgroups are enabled.
>
>
>> BTW, if it's OK for you, you can also further improve the
>> performance of qgroup by using kmem_cache for struct
>> btrfs_qgroup_extent_record.
>>
>> I assume the kmalloc() may be one performance hot spot considering
>> the amount it called in qgroup enabled case.
>
> We're reading disk in that case, I hardly think the small kmalloc() matters.
> 	--Mark
>
> --
> Mark Fasheh
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-11-04  1:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-22 20:15 [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
2015-09-22 20:15 ` [PATCH 1/4] Btrfs: use btrfs_get_fs_root in resolve_indirect_ref Mark Fasheh
2015-09-22 20:15 ` [PATCH 2/4] Btrfs: keep dropped roots in cache until transaction commit, V2 Mark Fasheh
2015-09-22 20:15 ` [PATCH 3/4] btrfs: Add qgroup tracing Mark Fasheh
2015-09-22 20:15 ` [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete Mark Fasheh
2015-11-02  1:59   ` Qu Wenruo
2015-11-03 23:56     ` Mark Fasheh
2015-11-04  1:10       ` Qu Wenruo
2015-09-22 21:12 ` [PATCH 0/4] btrfs: update qgroups in drop snapshot Mark Fasheh
2015-09-23  1:40   ` Qu Wenruo
2015-09-23 21:49     ` Mark Fasheh
2015-09-24  5:47       ` Duncan
2015-09-24  6:29       ` Qu Wenruo
2015-09-23  3:58 ` Qu Wenruo
2015-09-23  8:50   ` Holger Hoffstätte
2015-09-23 22:08   ` Mark Fasheh
2015-09-25  3:17 ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).