[PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153
@ 2020-07-02  0:14 Qu Wenruo
  2020-07-02  0:14 ` [PATCH 1/3] btrfs: Introduce extent_changeset_revert() for qgroup Qu Wenruo
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Qu Wenruo @ 2020-07-02  0:14 UTC (permalink / raw)
  To: linux-btrfs

Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
btrfs/153 always fails with early EDQUOT.

This is caused by the fact that:
- We always reserved data space for even NODATACOW buffered write
  This is mostly to improve performance, and not pratical to revert.

- Btrfs qgroup data and meta reserved space share the same limit
  So it's not ensured to return EDQUOT just for that data reservation,
  metadata reservation can also get EDQUOT, means we can't go the same
  solution as that commit.

This patchset will solve it by doing extra qgroup space flushing when
EDQUOT is hit.

This is a little like what we do in ticketing space reservation system,
but since there are very limited ways for qgroup to reclaim space,
currently it's still handled in qgroup realm, not reusing the ticketing
system yet.

By this, this patch could solve the btrfs/153 problem, while still keep
btrfs qgroup space usage under the limit.

The only cost is, when we're near qgroup limit, we will cause more dirty
inodes flush and transaction commit, much like what we do when the
metadata space is near exhausted.
So the cost should be still acceptable.

Qu Wenruo (3):
  btrfs: Introduce extent_changeset_revert() for qgroup
  btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT
  Revert "btrfs: qgroup: Commit transaction in advance to reduce early
    EDQUOT"

 fs/btrfs/ctree.h       |   6 +-
 fs/btrfs/disk-io.c     |   2 +-
 fs/btrfs/extent_io.c   |  43 +++++-
 fs/btrfs/extent_io.h   |  24 +++-
 fs/btrfs/qgroup.c      | 310 ++++++++++++++++++++++++++++-------------
 fs/btrfs/transaction.c |   1 -
 fs/btrfs/transaction.h |  14 --
 7 files changed, 273 insertions(+), 127 deletions(-)

-- 
2.27.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] btrfs: Introduce extent_changeset_revert() for qgroup
  2020-07-02  0:14 [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 Qu Wenruo
@ 2020-07-02  0:14 ` Qu Wenruo
  2020-07-02 13:40   ` Josef Bacik
  2020-07-02  0:14 ` [PATCH 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT Qu Wenruo
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2020-07-02  0:14 UTC (permalink / raw)
  To: linux-btrfs

[PROBLEM]
Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
reserved space of the changeset.

This means the following call is not possible:
	ret = btrfs_qgroup_reserve_data();
	if (ret == -EDQUOT) {
		/* Do something to free some qgroup space */
		ret = btrfs_qgroup_reserve_data();
	}

As if the first btrfs_qgroup_reserve_data() fails, it will free all
reserved qgroup space, so the next btrfs_qgroup_reserve_data() will
always success, and can go beyond qgroup limit.

[CAUSE]
This is mostly due to the fact that we don't have a good way to revert
changeset modification accurately.

Currently the changeset infrastructure is implemented using ulist, which
can only store two u64 values, used as start and length for each changed
extent range.

So we can't record which changed extent is modified in last
modification, thus unable to revert to previous status.

[FIX]
This patch will re-implement using pure rbtree, adding a new member,
changed_extent::seq, so we can remove changed extents which is
modified in previous modification.

This allows us to implement qgroup_revert(), which allow btrfs to revert
its modification to the io_tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c |  43 +++++++++++-
 fs/btrfs/extent_io.h |  24 ++++++-
 fs/btrfs/qgroup.c    | 155 +++++++++++++++++++++++++++++--------------
 3 files changed, 168 insertions(+), 54 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8a7e9da74b87..21991d6716d9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -142,6 +142,40 @@ struct extent_page_data {
 	unsigned int sync_io:1;
 };
 
+static int insert_changed_extent(struct extent_changeset *changeset,
+				 u64 start, u64 len)
+{
+	struct changed_extent *entry;
+	struct changed_extent *cur;
+	struct rb_node **p = &changeset->root.rb_node;
+	struct rb_node *parent = NULL;
+
+	entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
+	if (!entry)
+		return -ENOMEM;
+
+	entry->start = start;
+	entry->len = len;
+	entry->seq = changeset->seq;
+
+	while (*p) {
+		parent = *p;
+		cur = rb_entry(parent, struct changed_extent, node);
+		if (cur->start < start) {
+			p = &(*p)->rb_right;
+		} else if (cur->start > start) {
+			p = &(*p)->rb_left;
+		} else {
+			kfree(entry);
+			return -EEXIST;
+		}
+	}
+	rb_link_node(&entry->node, parent, p);
+	rb_insert_color(&entry->node, &changeset->root);
+	changeset->bytes_changed += len;
+	return 0;
+}
+
 static int add_extent_changeset(struct extent_state *state, unsigned bits,
 				 struct extent_changeset *changeset,
 				 int set)
@@ -154,9 +188,8 @@ static int add_extent_changeset(struct extent_state *state, unsigned bits,
 		return 0;
 	if (!set && (state->state & bits) == 0)
 		return 0;
-	changeset->bytes_changed += state->end - state->start + 1;
-	ret = ulist_add(&changeset->range_changed, state->start, state->end,
-			GFP_ATOMIC);
+	ret = insert_changed_extent(changeset, state->start,
+				    state->end - state->start + 1);
 	return ret;
 }
 
@@ -718,6 +751,8 @@ int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 
 	if (bits & (EXTENT_LOCKED | EXTENT_BOUNDARY))
 		clear = 1;
+	if (changeset)
+		changeset->seq++;
 again:
 	if (!prealloc && gfpflags_allow_blocking(mask)) {
 		/*
@@ -980,6 +1015,8 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	btrfs_debug_check_extent_io_range(tree, start, end);
 	trace_btrfs_set_extent_bit(tree, start, end - start + 1, bits);
 
+	if (changeset)
+		changeset->seq++;
 again:
 	if (!prealloc && gfpflags_allow_blocking(mask)) {
 		/*
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 00a88f2eb5ab..ed00980e044b 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -131,18 +131,29 @@ struct extent_buffer {
 /*
  * Structure to record how many bytes and which ranges are set/cleared
  */
+struct changed_extent {
+	struct rb_node node;
+	u64 start;
+	u64 len;
+	u64 seq;	/* At which seqence number this extent is added */
+};
+
 struct extent_changeset {
+	/* Sequence number for current set/clear of this changeset */
+	u64 seq;
+
 	/* How many bytes are set/cleared in this operation */
 	unsigned int bytes_changed;
 
 	/* Changed ranges */
-	struct ulist range_changed;
+	struct rb_root root;
 };
 
 static inline void extent_changeset_init(struct extent_changeset *changeset)
 {
 	changeset->bytes_changed = 0;
-	ulist_init(&changeset->range_changed);
+	changeset->seq = 0;
+	changeset->root = RB_ROOT;
 }
 
 static inline struct extent_changeset *extent_changeset_alloc(void)
@@ -159,10 +170,17 @@ static inline struct extent_changeset *extent_changeset_alloc(void)
 
 static inline void extent_changeset_release(struct extent_changeset *changeset)
 {
+	struct changed_extent *entry;
+	struct changed_extent *next;
+
 	if (!changeset)
 		return;
 	changeset->bytes_changed = 0;
-	ulist_release(&changeset->range_changed);
+	changeset->seq = 0;
+	rbtree_postorder_for_each_entry_safe(entry, next, &changeset->root,
+					     node)
+		kfree(entry);
+	changeset->root = RB_ROOT;
 }
 
 static inline void extent_changeset_free(struct extent_changeset *changeset)
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index ee0ad33b659c..5cfd2bcc55ef 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3447,6 +3447,66 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info)
 	}
 }
 
+static void qgroup_revert(struct btrfs_inode *inode,
+			  struct extent_changeset *reserved, u64 start, u64 len)
+{
+	struct rb_node *n = reserved->root.rb_node;
+	struct changed_extent *entry = NULL;
+
+	while (n) {
+		entry = rb_entry(n, struct changed_extent, node);
+		if (entry->start < start)
+			n = n->rb_right;
+		if (entry->start > start)
+			n = n->rb_left;
+		else
+			break;
+	}
+	if (!entry)
+		goto out;
+
+	if (entry->start > start && rb_prev(&entry->node))
+		entry = rb_entry(rb_prev(&entry->node), struct changed_extent,
+				 node);
+
+	n = &entry->node;
+	while (n) {
+		struct rb_node *tmp = rb_next(n);
+
+		entry = rb_entry(n, struct changed_extent, node);
+
+		if (entry->start >= start + len)
+			break;
+		if (entry->start + entry->len <= start)
+			goto next;
+		if (entry->seq != reserved->seq)
+			goto next;
+		/*
+		 * Now the entry is in [start, start + len) and has the seq
+		 * matching reserved->seq, revert the EXTENT_QGROUP_RESERVED
+		 * bit.
+		 */
+		clear_extent_bits(&inode->io_tree,
+				entry->start, entry->start + entry->len - 1,
+				EXTENT_QGROUP_RESERVED);
+
+		rb_erase(&entry->node, &reserved->root);
+		RB_CLEAR_NODE(&entry->node);
+		if (likely(reserved->bytes_changed >= entry->len)) {
+			reserved->bytes_changed -= entry->len;
+		} else {
+			WARN_ON(1);
+			reserved->bytes_changed = 0;
+		}
+		kfree(entry);
+
+next:
+		n = tmp;
+	}
+out:
+	reserved->seq--;
+}
+
 /*
  * Reserve qgroup space for range [start, start + len).
  *
@@ -3466,11 +3526,10 @@ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 			u64 len)
 {
 	struct btrfs_root *root = inode->root;
-	struct ulist_node *unode;
-	struct ulist_iterator uiter;
 	struct extent_changeset *reserved;
 	u64 orig_reserved;
 	u64 to_reserve;
+	bool new_reserved = false;
 	int ret;
 
 	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &root->fs_info->flags) ||
@@ -3481,6 +3540,7 @@ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 	if (WARN_ON(!reserved_ret))
 		return -EINVAL;
 	if (!*reserved_ret) {
+		new_reserved = true;
 		*reserved_ret = extent_changeset_alloc();
 		if (!*reserved_ret)
 			return -ENOMEM;
@@ -3496,23 +3556,20 @@ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 	trace_btrfs_qgroup_reserve_data(&inode->vfs_inode, start, len,
 					to_reserve, QGROUP_RESERVE);
 	if (ret < 0)
-		goto cleanup;
+		goto out;
 	ret = qgroup_reserve(root, to_reserve, true, BTRFS_QGROUP_RSV_DATA);
 	if (ret < 0)
 		goto cleanup;
-
 	return ret;
 
 cleanup:
-	/* cleanup *ALL* already reserved ranges */
-	ULIST_ITER_INIT(&uiter);
-	while ((unode = ulist_next(&reserved->range_changed, &uiter)))
-		clear_extent_bit(&inode->io_tree, unode->val,
-				 unode->aux, EXTENT_QGROUP_RESERVED, 0, 0, NULL);
-	/* Also free data bytes of already reserved one */
-	btrfs_qgroup_free_refroot(root->fs_info, root->root_key.objectid,
-				  orig_reserved, BTRFS_QGROUP_RSV_DATA);
-	extent_changeset_release(reserved);
+	qgroup_revert(inode, reserved, start, len);
+out:
+	if (new_reserved) {
+		extent_changeset_release(reserved);
+		kfree(reserved);
+		*reserved_ret = NULL;
+	}
 	return ret;
 }
 
@@ -3521,8 +3578,7 @@ static int qgroup_free_reserved_data(struct btrfs_inode *inode,
 			struct extent_changeset *reserved, u64 start, u64 len)
 {
 	struct btrfs_root *root = inode->root;
-	struct ulist_node *unode;
-	struct ulist_iterator uiter;
+	struct rb_node *node;
 	struct extent_changeset changeset;
 	int freed = 0;
 	int ret;
@@ -3531,42 +3587,42 @@ static int qgroup_free_reserved_data(struct btrfs_inode *inode,
 	len = round_up(start + len, root->fs_info->sectorsize);
 	start = round_down(start, root->fs_info->sectorsize);
 
-	ULIST_ITER_INIT(&uiter);
-	while ((unode = ulist_next(&reserved->range_changed, &uiter))) {
-		u64 range_start = unode->val;
-		/* unode->aux is the inclusive end */
-		u64 range_len = unode->aux - range_start + 1;
-		u64 free_start;
-		u64 free_len;
+	node = rb_first(&reserved->root);
+	while (node) {
+		struct rb_node *tmp = rb_next(node);
+		struct changed_extent *entry;
+
+		entry = rb_entry(node, struct changed_extent, node);
 
 		extent_changeset_release(&changeset);
+		if (entry->start >= start + len)
+			break;
+		if (entry->start + entry->len <= start)
+			goto next;
+		ASSERT(entry->start >= start &&
+		       entry->start + len <= start + len);
 
-		/* Only free range in range [start, start + len) */
-		if (range_start >= start + len ||
-		    range_start + range_len <= start)
-			continue;
-		free_start = max(range_start, start);
-		free_len = min(start + len, range_start + range_len) -
-			   free_start;
-		/*
-		 * TODO: To also modify reserved->ranges_reserved to reflect
-		 * the modification.
-		 *
-		 * However as long as we free qgroup reserved according to
-		 * EXTENT_QGROUP_RESERVED, we won't double free.
-		 * So not need to rush.
-		 */
-		ret = clear_record_extent_bits(&inode->io_tree, free_start,
-				free_start + free_len - 1,
+		ret = clear_record_extent_bits(&inode->io_tree,
+				entry->start, entry->start + entry->len - 1,
 				EXTENT_QGROUP_RESERVED, &changeset);
-		if (ret < 0)
-			goto out;
 		freed += changeset.bytes_changed;
+		if (reserved->bytes_changed >= changeset.bytes_changed) {
+			reserved->bytes_changed -= changeset.bytes_changed;
+		} else {
+			WARN_ON(1);
+			reserved->bytes_changed = 0;
+		}
+
+		rb_erase(&entry->node, &reserved->root);
+		RB_CLEAR_NODE(&entry->node);
+		kfree(entry);
+next:
+		node = tmp;
 	}
+
 	btrfs_qgroup_free_refroot(root->fs_info, root->root_key.objectid, freed,
 				  BTRFS_QGROUP_RSV_DATA);
 	ret = freed;
-out:
 	extent_changeset_release(&changeset);
 	return ret;
 }
@@ -3813,8 +3869,6 @@ void btrfs_qgroup_convert_reserved_meta(struct btrfs_root *root, int num_bytes)
 void btrfs_qgroup_check_reserved_leak(struct btrfs_inode *inode)
 {
 	struct extent_changeset changeset;
-	struct ulist_node *unode;
-	struct ulist_iterator iter;
 	int ret;
 
 	extent_changeset_init(&changeset);
@@ -3823,12 +3877,17 @@ void btrfs_qgroup_check_reserved_leak(struct btrfs_inode *inode)
 
 	WARN_ON(ret < 0);
 	if (WARN_ON(changeset.bytes_changed)) {
-		ULIST_ITER_INIT(&iter);
-		while ((unode = ulist_next(&changeset.range_changed, &iter))) {
+		struct rb_node *n;
+
+		for (n = rb_first(&changeset.root); n; n = rb_next(n)) {
+			struct changed_extent *entry;
+
+			entry = rb_entry(n, struct changed_extent, node);
 			btrfs_warn(inode->root->fs_info,
-		"leaking qgroup reserved space, ino: %llu, start: %llu, end: %llu",
-				btrfs_ino(inode), unode->val, unode->aux);
+	"leaking qgroup reserved space, ino: %llu, start: %llu, len: %llu",
+				btrfs_ino(inode), entry->start, entry->len);
 		}
+
 		btrfs_qgroup_free_refroot(inode->root->fs_info,
 				inode->root->root_key.objectid,
 				changeset.bytes_changed, BTRFS_QGROUP_RSV_DATA);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT
  2020-07-02  0:14 [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 Qu Wenruo
  2020-07-02  0:14 ` [PATCH 1/3] btrfs: Introduce extent_changeset_revert() for qgroup Qu Wenruo
@ 2020-07-02  0:14 ` Qu Wenruo
  2020-07-02 13:43   ` Josef Bacik
  2020-07-02  0:14 ` [PATCH 3/3] Revert "btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT" Qu Wenruo
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2020-07-02  0:14 UTC (permalink / raw)
  To: linux-btrfs

[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space.
One of the most obvious case is the the test case btrfs/153, which do
fallocate, then write into the preallocated range.

  btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
      --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
      +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
      @@ -1,2 +1,5 @@
       QA output created by 153
      +pwrite: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
       Silence is golden
      ...
      (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)

[CAUSE]
Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.

Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.

For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.

This leads to the -EDQUOT in buffered write routine.

And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/meta.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.

[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space by:
- Flush all inodes of the root
  NODATACOW writes will free the qgroup reserved at run_dealloc_range().
  However we don't have the infrastructure to only flush NODATACOW
  inodes, here we flush all inodes anyway.

- Wait ordered extents
  This would convert the preallocated metadata space into per-trans
  metadata, which can be freed in later transaction commit.

- Commit transaction
  This will free all per-trans metadata space.

Also we don't want to trigger flush too racy, so here we introduce a
per-root mutex to ensure if there is a running qgroup flushing, we wait
for it to end and don't start re-flush.

Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h   |   1 +
 fs/btrfs/disk-io.c |   1 +
 fs/btrfs/qgroup.c  | 112 ++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 103 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4dd478b4fe3a..891f47c7891f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1162,6 +1162,7 @@ struct btrfs_root {
 	spinlock_t qgroup_meta_rsv_lock;
 	u64 qgroup_meta_rsv_pertrans;
 	u64 qgroup_meta_rsv_prealloc;
+	struct mutex qgroup_flushing_mutex;
 
 	/* Number of active swapfiles */
 	atomic_t nr_swapfiles;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7c07578866f3..ca2c7d39e9f7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1116,6 +1116,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
 	mutex_init(&root->delalloc_mutex);
+	mutex_init(&root->qgroup_flushing_mutex);
 	init_waitqueue_head(&root->log_writer_wait);
 	init_waitqueue_head(&root->log_commit_wait[0]);
 	init_waitqueue_head(&root->log_commit_wait[1]);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 5cfd2bcc55ef..24b4aed57249 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3508,20 +3508,62 @@ static void qgroup_revert(struct btrfs_inode *inode,
 }
 
 /*
- * Reserve qgroup space for range [start, start + len).
+ * Try to free some space for qgroup.
  *
- * This function will either reserve space from related qgroups or doing
- * nothing if the range is already reserved.
+ * For qgroup, there are only 3 ways to free qgroup space:
+ * - Flush nodatacow write
+ *   Any nodatacow write will free its reserved data space at
+ *   run_delalloc_range().
+ *   In theory, we should only flush nodatacow inodes, but it's
+ *   not yet possible, so we need to flush the whole root.
  *
- * Return 0 for successful reserve
- * Return <0 for error (including -EQUOT)
+ * - Wait for ordered extents
+ *   When ordered extents are finished, their reserved metadata
+ *   is finally converted to per_trans status, which can be freed
+ *   by later commit transaction.
  *
- * NOTE: this function may sleep for memory allocation.
- *       if btrfs_qgroup_reserve_data() is called multiple times with
- *       same @reserved, caller must ensure when error happens it's OK
- *       to free *ALL* reserved space.
+ * - Commit transaction
+ *   This would free the meta_per_trans space.
+ *   In theory this shouldn't provide much space, but any more qgroup space
+ *   is needed.
  */
-int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
+static int try_flush_qgroup(struct btrfs_root *root)
+{
+	struct btrfs_trans_handle *trans;
+	int ret;
+
+	if (!is_fstree(root->root_key.objectid))
+		return 0;
+
+	/*
+	 * We don't want to run flush again and again, so if there is a running
+	 * one, we won't try to start a new flush, but exit directly.
+	 */
+	ret = mutex_trylock(&root->qgroup_flushing_mutex);
+	if (!ret) {
+		mutex_lock(&root->qgroup_flushing_mutex);
+		mutex_unlock(&root->qgroup_flushing_mutex);
+		return 0;
+	}
+
+	ret = btrfs_start_delalloc_snapshot(root);
+	if (ret < 0)
+		goto unlock;
+	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto unlock;
+	}
+
+	ret = btrfs_commit_transaction(trans);
+unlock:
+	mutex_unlock(&root->qgroup_flushing_mutex);
+	return ret;
+}
+
+static int qgroup_reserve_data(struct btrfs_inode *inode,
 			struct extent_changeset **reserved_ret, u64 start,
 			u64 len)
 {
@@ -3573,6 +3615,37 @@ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 	return ret;
 }
 
+/*
+ * Reserve qgroup space for range [start, start + len).
+ *
+ * This function will either reserve space from related qgroups or doing
+ * nothing if the range is already reserved.
+ *
+ * Return 0 for successful reserve
+ * Return <0 for error (including -EQUOT)
+ *
+ * NOTE: This function may sleep for memory allocation, dirty page flushing and
+ *	 commit transaction. So caller should not hold any dirty page locked.
+ */
+int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
+			struct extent_changeset **reserved_ret, u64 start,
+			u64 len)
+{
+	int ret;
+
+	ret = qgroup_reserve_data(inode, reserved_ret, start, len);
+	if (!ret)
+		return ret;
+
+	if (ret < 0 && ret != -EDQUOT)
+		return ret;
+
+	ret = try_flush_qgroup(inode->root);
+	if (ret < 0)
+		return ret;
+	return qgroup_reserve_data(inode, reserved_ret, start, len);
+}
+
 /* Free ranges specified by @reserved, normally in error path */
 static int qgroup_free_reserved_data(struct btrfs_inode *inode,
 			struct extent_changeset *reserved, u64 start, u64 len)
@@ -3740,7 +3813,7 @@ static int sub_root_meta_rsv(struct btrfs_root *root, int num_bytes,
 	return num_bytes;
 }
 
-int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
+static int qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 				enum btrfs_qgroup_rsv_type type, bool enforce)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -3767,6 +3840,23 @@ int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 	return ret;
 }
 
+int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
+				enum btrfs_qgroup_rsv_type type, bool enforce)
+{
+	int ret;
+
+	ret = qgroup_reserve_meta(root, num_bytes, type, enforce);
+	if (!ret)
+		return ret;
+	if (ret < 0 && ret != -EDQUOT)
+		return ret;
+
+	ret = try_flush_qgroup(root);
+	if (ret < 0)
+		return ret;
+	return qgroup_reserve_meta(root, num_bytes, type, enforce);
+}
+
 void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] Revert "btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT"
  2020-07-02  0:14 [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 Qu Wenruo
  2020-07-02  0:14 ` [PATCH 1/3] btrfs: Introduce extent_changeset_revert() for qgroup Qu Wenruo
  2020-07-02  0:14 ` [PATCH 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT Qu Wenruo
@ 2020-07-02  0:14 ` Qu Wenruo
  2020-07-02 13:11   ` David Sterba
  2020-07-02 13:22 ` [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 David Sterba
  2020-07-02 13:28 ` Josef Bacik
  4 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2020-07-02  0:14 UTC (permalink / raw)
  To: linux-btrfs

This reverts commit a514d63882c3d2063b21b865447266ebcb18b04c.

Since we have the ability to retry qgroup reservation, and do qgroup
space flushing, there is no need for the BTRFS_FS_NEED_ASYNC_COMMIT
mechanism anymore.

Just revert that commit to make the code a little simpler.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h       |  5 -----
 fs/btrfs/disk-io.c     |  1 -
 fs/btrfs/qgroup.c      | 43 ++----------------------------------------
 fs/btrfs/transaction.c |  1 -
 fs/btrfs/transaction.h | 14 --------------
 5 files changed, 2 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 891f47c7891f..373567c168ac 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -545,11 +545,6 @@ enum {
 	 * (device replace, resize, device add/delete, balance)
 	 */
 	BTRFS_FS_EXCL_OP,
-	/*
-	 * To info transaction_kthread we need an immediate commit so it
-	 * doesn't need to wait for commit_interval
-	 */
-	BTRFS_FS_NEED_ASYNC_COMMIT,
 	/*
 	 * Indicate that balance has been set up from the ioctl and is in the
 	 * main phase. The fs_info::balance_ctl is initialized.
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ca2c7d39e9f7..7b82157c7df3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1746,7 +1746,6 @@ static int transaction_kthread(void *arg)
 
 		now = ktime_get_seconds();
 		if (cur->state < TRANS_STATE_COMMIT_START &&
-		    !test_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags) &&
 		    (now < cur->start_time ||
 		     now - cur->start_time < fs_info->commit_interval)) {
 			spin_unlock(&fs_info->trans_lock);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 24b4aed57249..2080699f6971 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -11,7 +11,6 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/btrfs.h>
-#include <linux/sizes.h>
 
 #include "ctree.h"
 #include "transaction.h"
@@ -2895,20 +2894,8 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 	return ret;
 }
 
-/*
- * Two limits to commit transaction in advance.
- *
- * For RATIO, it will be 1/RATIO of the remaining limit as threshold.
- * For SIZE, it will be in byte unit as threshold.
- */
-#define QGROUP_FREE_RATIO		32
-#define QGROUP_FREE_SIZE		SZ_32M
-static bool qgroup_check_limits(struct btrfs_fs_info *fs_info,
-				const struct btrfs_qgroup *qg, u64 num_bytes)
+static bool qgroup_check_limits(const struct btrfs_qgroup *qg, u64 num_bytes)
 {
-	u64 free;
-	u64 threshold;
-
 	if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) &&
 	    qgroup_rsv_total(qg) + (s64)qg->rfer + num_bytes > qg->max_rfer)
 		return false;
@@ -2917,32 +2904,6 @@ static bool qgroup_check_limits(struct btrfs_fs_info *fs_info,
 	    qgroup_rsv_total(qg) + (s64)qg->excl + num_bytes > qg->max_excl)
 		return false;
 
-	/*
-	 * Even if we passed the check, it's better to check if reservation
-	 * for meta_pertrans is pushing us near limit.
-	 * If there is too much pertrans reservation or it's near the limit,
-	 * let's try commit transaction to free some, using transaction_kthread
-	 */
-	if ((qg->lim_flags & (BTRFS_QGROUP_LIMIT_MAX_RFER |
-			      BTRFS_QGROUP_LIMIT_MAX_EXCL))) {
-		if (qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_EXCL) {
-			free = qg->max_excl - qgroup_rsv_total(qg) - qg->excl;
-			threshold = min_t(u64, qg->max_excl / QGROUP_FREE_RATIO,
-					  QGROUP_FREE_SIZE);
-		} else {
-			free = qg->max_rfer - qgroup_rsv_total(qg) - qg->rfer;
-			threshold = min_t(u64, qg->max_rfer / QGROUP_FREE_RATIO,
-					  QGROUP_FREE_SIZE);
-		}
-
-		/*
-		 * Use transaction_kthread to commit transaction, so we no
-		 * longer need to bother nested transaction nor lock context.
-		 */
-		if (free < threshold)
-			btrfs_commit_transaction_locksafe(fs_info);
-	}
-
 	return true;
 }
 
@@ -2990,7 +2951,7 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes, bool enforce,
 
 		qg = unode_aux_to_qgroup(unode);
 
-		if (enforce && !qgroup_check_limits(fs_info, qg, num_bytes)) {
+		if (enforce && !qgroup_check_limits(qg, num_bytes)) {
 			ret = -EDQUOT;
 			goto out;
 		}
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index b359d4b17658..d0a6150bf82d 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2351,7 +2351,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	 */
 	cur_trans->state = TRANS_STATE_COMPLETED;
 	wake_up(&cur_trans->commit_wait);
-	clear_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags);
 
 	spin_lock(&fs_info->trans_lock);
 	list_del_init(&cur_trans->list);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 6f65fff6cf50..0b18d25aa9b6 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -208,20 +208,6 @@ int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root);
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans);
 int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
 				   int wait_for_unblock);
-
-/*
- * Try to commit transaction asynchronously, so this is safe to call
- * even holding a spinlock.
- *
- * It's done by informing transaction_kthread to commit transaction without
- * waiting for commit interval.
- */
-static inline void btrfs_commit_transaction_locksafe(
-		struct btrfs_fs_info *fs_info)
-{
-	set_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags);
-	wake_up_process(fs_info->transaction_kthread);
-}
 int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans);
 int btrfs_should_end_transaction(struct btrfs_trans_handle *trans);
 void btrfs_throttle(struct btrfs_fs_info *fs_info);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] Revert "btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT"
  2020-07-02  0:14 ` [PATCH 3/3] Revert "btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT" Qu Wenruo
@ 2020-07-02 13:11   ` David Sterba
  0 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2020-07-02 13:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Jul 02, 2020 at 08:14:34AM +0800, Qu Wenruo wrote:
> This reverts commit a514d63882c3d2063b21b865447266ebcb18b04c.
> 
> Since we have the ability to retry qgroup reservation, and do qgroup
> space flushing, there is no need for the BTRFS_FS_NEED_ASYNC_COMMIT
> mechanism anymore.
> 
> Just revert that commit to make the code a little simpler.

Interestingly, the same comment from
https://lore.kernel.org/linux-btrfs/20180629114629.GO2287@twin.jikos.cz/

applies to your patch sans the references to the other patch:

"
I do not recommend to do a revert here.  Even if a patch reverts
functionality because it's not needed anymore, then it's a "forward"
change.  There are also other changes that may affect the behaviour and
in this case it's 47dba17171a76ea2a2a71 that removes rcu barrier, so the
patch has to be put into the context of current code.

The commit is almost 2 years old, the idea of reverts is IMHO more to
provide an easy way do a small step back during one devlopment cycle
when the moving parts are still in sight.

Back then the commit fixed a deadlock, a revert here would read as 'ok,
we want the deadlock back'.

So, the code is ok. The subject needs to drop the word 'revert' and
changelg maybe mention a few more references why the logic is not needed
anymore.
"

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153
  2020-07-02  0:14 [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 Qu Wenruo
                   ` (2 preceding siblings ...)
  2020-07-02  0:14 ` [PATCH 3/3] Revert "btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT" Qu Wenruo
@ 2020-07-02 13:22 ` David Sterba
  2020-07-02 13:28 ` Josef Bacik
  4 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2020-07-02 13:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Jul 02, 2020 at 08:14:31AM +0800, Qu Wenruo wrote:
> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
> btrfs/153 always fails with early EDQUOT.
> 
> This is caused by the fact that:
> - We always reserved data space for even NODATACOW buffered write
>   This is mostly to improve performance, and not pratical to revert.
> 
> - Btrfs qgroup data and meta reserved space share the same limit
>   So it's not ensured to return EDQUOT just for that data reservation,
>   metadata reservation can also get EDQUOT, means we can't go the same
>   solution as that commit.
> 
> This patchset will solve it by doing extra qgroup space flushing when
> EDQUOT is hit.
> 
> This is a little like what we do in ticketing space reservation system,
> but since there are very limited ways for qgroup to reclaim space,
> currently it's still handled in qgroup realm, not reusing the ticketing
> system yet.
> 
> By this, this patch could solve the btrfs/153 problem, while still keep
> btrfs qgroup space usage under the limit.
> 
> The only cost is, when we're near qgroup limit, we will cause more dirty
> inodes flush and transaction commit, much like what we do when the
> metadata space is near exhausted.
> So the cost should be still acceptable.

This sounds like a reasonable solution to me. Making the behaviour
closer to ticket reservations would probably make it easier to switch
some day.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153
  2020-07-02  0:14 [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 Qu Wenruo
                   ` (3 preceding siblings ...)
  2020-07-02 13:22 ` [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 David Sterba
@ 2020-07-02 13:28 ` Josef Bacik
  2020-07-02 13:41   ` David Sterba
  4 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2020-07-02 13:28 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 7/1/20 8:14 PM, Qu Wenruo wrote:
> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
> btrfs/153 always fails with early EDQUOT.
> 
> This is caused by the fact that:
> - We always reserved data space for even NODATACOW buffered write
>    This is mostly to improve performance, and not pratical to revert.
> 
> - Btrfs qgroup data and meta reserved space share the same limit
>    So it's not ensured to return EDQUOT just for that data reservation,
>    metadata reservation can also get EDQUOT, means we can't go the same
>    solution as that commit.
> 
> This patchset will solve it by doing extra qgroup space flushing when
> EDQUOT is hit.
> 
> This is a little like what we do in ticketing space reservation system,
> but since there are very limited ways for qgroup to reclaim space,
> currently it's still handled in qgroup realm, not reusing the ticketing
> system yet.
> 
> By this, this patch could solve the btrfs/153 problem, while still keep
> btrfs qgroup space usage under the limit.
> 
> The only cost is, when we're near qgroup limit, we will cause more dirty
> inodes flush and transaction commit, much like what we do when the
> metadata space is near exhausted.
> So the cost should be still acceptable.

This patchset fails to apply on misc-next as of

     btrfs: allow use of global block reserve for balance item deletion

Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] btrfs: Introduce extent_changeset_revert() for qgroup
  2020-07-02  0:14 ` [PATCH 1/3] btrfs: Introduce extent_changeset_revert() for qgroup Qu Wenruo
@ 2020-07-02 13:40   ` Josef Bacik
  2020-07-02 13:50     ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2020-07-02 13:40 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 7/1/20 8:14 PM, Qu Wenruo wrote:
> [PROBLEM]
> Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
> reserved space of the changeset.
> 
> This means the following call is not possible:
> 	ret = btrfs_qgroup_reserve_data();
> 	if (ret == -EDQUOT) {
> 		/* Do something to free some qgroup space */
> 		ret = btrfs_qgroup_reserve_data();
> 	}
> 
> As if the first btrfs_qgroup_reserve_data() fails, it will free all
> reserved qgroup space, so the next btrfs_qgroup_reserve_data() will
> always success, and can go beyond qgroup limit.
> 
> [CAUSE]
> This is mostly due to the fact that we don't have a good way to revert
> changeset modification accurately.
> 
> Currently the changeset infrastructure is implemented using ulist, which
> can only store two u64 values, used as start and length for each changed
> extent range.
> 
> So we can't record which changed extent is modified in last
> modification, thus unable to revert to previous status.
> 
> [FIX]
> This patch will re-implement using pure rbtree, adding a new member,
> changed_extent::seq, so we can remove changed extents which is
> modified in previous modification.
> 
> This allows us to implement qgroup_revert(), which allow btrfs to revert
> its modification to the io_tree.
> 

I'm having a hard time groking what's going on here.  These changesets are 
limited to a [start, end] range correct?  Why can we have multiple changesets 
for the same range?  Are we reserving doubly for modifications in the same 
range?  Because it seems here if you find your changeset at all we'll clear the 
io_tree, but if you can have multiple changesets for the same range then....why 
even bother?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153
  2020-07-02 13:28 ` Josef Bacik
@ 2020-07-02 13:41   ` David Sterba
  2020-07-02 13:44     ` Josef Bacik
  0 siblings, 1 reply; 19+ messages in thread
From: David Sterba @ 2020-07-02 13:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Qu Wenruo, linux-btrfs

On Thu, Jul 02, 2020 at 09:28:30AM -0400, Josef Bacik wrote:
> On 7/1/20 8:14 PM, Qu Wenruo wrote:
> > Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
> > btrfs/153 always fails with early EDQUOT.
> > 
> > This is caused by the fact that:
> > - We always reserved data space for even NODATACOW buffered write
> >    This is mostly to improve performance, and not pratical to revert.
> > 
> > - Btrfs qgroup data and meta reserved space share the same limit
> >    So it's not ensured to return EDQUOT just for that data reservation,
> >    metadata reservation can also get EDQUOT, means we can't go the same
> >    solution as that commit.
> > 
> > This patchset will solve it by doing extra qgroup space flushing when
> > EDQUOT is hit.
> > 
> > This is a little like what we do in ticketing space reservation system,
> > but since there are very limited ways for qgroup to reclaim space,
> > currently it's still handled in qgroup realm, not reusing the ticketing
> > system yet.
> > 
> > By this, this patch could solve the btrfs/153 problem, while still keep
> > btrfs qgroup space usage under the limit.
> > 
> > The only cost is, when we're near qgroup limit, we will cause more dirty
> > inodes flush and transaction commit, much like what we do when the
> > metadata space is near exhausted.
> > So the cost should be still acceptable.
> 
> This patchset fails to apply on misc-next as of
> 
>      btrfs: allow use of global block reserve for balance item deletion

That's about 50 patches behind some recent snapshot of misc-next, the
patches apply cleanly for me here.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT
  2020-07-02  0:14 ` [PATCH 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT Qu Wenruo
@ 2020-07-02 13:43   ` Josef Bacik
  2020-07-02 13:54     ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2020-07-02 13:43 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 7/1/20 8:14 PM, Qu Wenruo wrote:
> [PROBLEM]
> There are known problem related to how btrfs handles qgroup reserved
> space.
> One of the most obvious case is the the test case btrfs/153, which do
> fallocate, then write into the preallocated range.
> 
>    btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
>        --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>        +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
>        @@ -1,2 +1,5 @@
>         QA output created by 153
>        +pwrite: Disk quota exceeded
>        +/mnt/scratch/testfile2: Disk quota exceeded
>        +/mnt/scratch/testfile2: Disk quota exceeded
>         Silence is golden
>        ...
>        (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
> 
> [CAUSE]
> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
> we always reserve space no matter if it's COW or not.
> 
> Such behavior change is mostly for performance, and reverting it is not
> a good idea anyway.
> 
> For preallcoated extent, we reserve qgroup data space for it already,
> and since we also reserve data space for qgroup at buffered write time,
> it needs twice the space for us to write into preallocated space.
> 
> This leads to the -EDQUOT in buffered write routine.
> 
> And we can't follow the same solution, unlike data/meta space check,
> qgroup reserved space is shared between data/meta.
> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
> check after qgroup reservation failure is not a solution.

Why not?  I get that we don't know for sure how we failed, but in the case of a 
write we're way more likely to have failed for data reasons right?  So why not 
just fall back to the NODATACOW check and then do the metadata reservation. 
Then if it fails again you know its a real EDQUOT and your done.

Or if you want to get super fancy you could even break up the metadata and data 
reservations here so that we only fall through to the NODATACOW check if we fail 
the data reservation.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153
  2020-07-02 13:41   ` David Sterba
@ 2020-07-02 13:44     ` Josef Bacik
  0 siblings, 0 replies; 19+ messages in thread
From: Josef Bacik @ 2020-07-02 13:44 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs

On 7/2/20 9:41 AM, David Sterba wrote:
> On Thu, Jul 02, 2020 at 09:28:30AM -0400, Josef Bacik wrote:
>> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
>>> btrfs/153 always fails with early EDQUOT.
>>>
>>> This is caused by the fact that:
>>> - We always reserved data space for even NODATACOW buffered write
>>>     This is mostly to improve performance, and not pratical to revert.
>>>
>>> - Btrfs qgroup data and meta reserved space share the same limit
>>>     So it's not ensured to return EDQUOT just for that data reservation,
>>>     metadata reservation can also get EDQUOT, means we can't go the same
>>>     solution as that commit.
>>>
>>> This patchset will solve it by doing extra qgroup space flushing when
>>> EDQUOT is hit.
>>>
>>> This is a little like what we do in ticketing space reservation system,
>>> but since there are very limited ways for qgroup to reclaim space,
>>> currently it's still handled in qgroup realm, not reusing the ticketing
>>> system yet.
>>>
>>> By this, this patch could solve the btrfs/153 problem, while still keep
>>> btrfs qgroup space usage under the limit.
>>>
>>> The only cost is, when we're near qgroup limit, we will cause more dirty
>>> inodes flush and transaction commit, much like what we do when the
>>> metadata space is near exhausted.
>>> So the cost should be still acceptable.
>>
>> This patchset fails to apply on misc-next as of
>>
>>       btrfs: allow use of global block reserve for balance item deletion
> 
> That's about 50 patches behind some recent snapshot of misc-next, the
> patches apply cleanly for me here.
> 

I misread the date on my tree, I just saw "Thu" and assumed my cron job was 
doing its job correctly.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] btrfs: Introduce extent_changeset_revert() for qgroup
  2020-07-02 13:40   ` Josef Bacik
@ 2020-07-02 13:50     ` Qu Wenruo
  2020-07-02 13:56       ` Josef Bacik
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2020-07-02 13:50 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2548 bytes --]



On 2020/7/2 下午9:40, Josef Bacik wrote:
> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>> [PROBLEM]
>> Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
>> reserved space of the changeset.
>>
>> This means the following call is not possible:
>>     ret = btrfs_qgroup_reserve_data();
>>     if (ret == -EDQUOT) {
>>         /* Do something to free some qgroup space */
>>         ret = btrfs_qgroup_reserve_data();
>>     }
>>
>> As if the first btrfs_qgroup_reserve_data() fails, it will free all
>> reserved qgroup space, so the next btrfs_qgroup_reserve_data() will
>> always success, and can go beyond qgroup limit.
>>
>> [CAUSE]
>> This is mostly due to the fact that we don't have a good way to revert
>> changeset modification accurately.
>>
>> Currently the changeset infrastructure is implemented using ulist, which
>> can only store two u64 values, used as start and length for each changed
>> extent range.
>>
>> So we can't record which changed extent is modified in last
>> modification, thus unable to revert to previous status.
>>
>> [FIX]
>> This patch will re-implement using pure rbtree, adding a new member,
>> changed_extent::seq, so we can remove changed extents which is
>> modified in previous modification.
>>
>> This allows us to implement qgroup_revert(), which allow btrfs to revert
>> its modification to the io_tree.
>>
> 
> I'm having a hard time groking what's going on here.  These changesets
> are limited to a [start, end] range correct?

Yes, but we may be only responsible for part of the changeset.

One example is we want to falloc range [0, 16K)
And [0, 4K), [8K, 12K) has already one existing file extent.

Then we succeeded in allocating space for [4K, 8K), but failed to
allocating space for [8K, 12K).

In that case, if we just return EDQUOT and clear the range for [4K, 8k)
and [8K, 12K), everything is completely fine.

But if we want to retry, then we should only clear the range for [8K,
12K), but not to clear [4K, 8K).

That's what we need for the next patch, just revert what we did in
previous set_extent_bit(), but not touching previously set bits.

Thanks,
Qu

>  Why can we have multiple
> changesets for the same range?  Are we reserving doubly for
> modifications in the same range?  Because it seems here if you find your
> changeset at all we'll clear the io_tree, but if you can have multiple
> changesets for the same range then....why even bother?  Thanks,
> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT
  2020-07-02 13:43   ` Josef Bacik
@ 2020-07-02 13:54     ` Qu Wenruo
  2020-07-02 13:57       ` Josef Bacik
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2020-07-02 13:54 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2931 bytes --]



On 2020/7/2 下午9:43, Josef Bacik wrote:
> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>> [PROBLEM]
>> There are known problem related to how btrfs handles qgroup reserved
>> space.
>> One of the most obvious case is the the test case btrfs/153, which do
>> fallocate, then write into the preallocated range.
>>
>>    btrfs/153 1s ... - output mismatch (see
>> xfstests-dev/results//btrfs/153.out.bad)
>>        --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>        +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>> 20:24:40.730000089 +0800
>>        @@ -1,2 +1,5 @@
>>         QA output created by 153
>>        +pwrite: Disk quota exceeded
>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>         Silence is golden
>>        ...
>>        (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>
>> [CAUSE]
>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have
>> to"),
>> we always reserve space no matter if it's COW or not.
>>
>> Such behavior change is mostly for performance, and reverting it is not
>> a good idea anyway.
>>
>> For preallcoated extent, we reserve qgroup data space for it already,
>> and since we also reserve data space for qgroup at buffered write time,
>> it needs twice the space for us to write into preallocated space.
>>
>> This leads to the -EDQUOT in buffered write routine.
>>
>> And we can't follow the same solution, unlike data/meta space check,
>> qgroup reserved space is shared between data/meta.
>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>> check after qgroup reservation failure is not a solution.
> 
> Why not?  I get that we don't know for sure how we failed, but in the
> case of a write we're way more likely to have failed for data reasons
> right?

Nope, mostly we failed at metadata reservation, as that would return
EDQUOT to user space.

We may have some cases which get EDQUOT at data reservation part, but
that's what we excepted.
(And already what we're doing)

The problem is when the metadata reservation failed with EDQUOT.

>  So why not just fall back to the NODATACOW check and then do the
> metadata reservation. Then if it fails again you know its a real EDQUOT
> and your done.
> 
> Or if you want to get super fancy you could even break up the metadata
> and data reservations here so that we only fall through to the NODATACOW
> check if we fail the data reservation.  Thanks,

The problem is, qgroup doesn't split metadata and data (yet).
Currently data and meta shares the same limit.

So when we hit EDQUOT, you have no guarantee it would happen only in
qgroup data reservation.

Thanks,
Qu
> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] btrfs: Introduce extent_changeset_revert() for qgroup
  2020-07-02 13:50     ` Qu Wenruo
@ 2020-07-02 13:56       ` Josef Bacik
  2020-07-02 14:07         ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2020-07-02 13:56 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 7/2/20 9:50 AM, Qu Wenruo wrote:
> 
> 
> On 2020/7/2 下午9:40, Josef Bacik wrote:
>> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>>> [PROBLEM]
>>> Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
>>> reserved space of the changeset.
>>>
>>> This means the following call is not possible:
>>>      ret = btrfs_qgroup_reserve_data();
>>>      if (ret == -EDQUOT) {
>>>          /* Do something to free some qgroup space */
>>>          ret = btrfs_qgroup_reserve_data();
>>>      }
>>>
>>> As if the first btrfs_qgroup_reserve_data() fails, it will free all
>>> reserved qgroup space, so the next btrfs_qgroup_reserve_data() will
>>> always success, and can go beyond qgroup limit.
>>>
>>> [CAUSE]
>>> This is mostly due to the fact that we don't have a good way to revert
>>> changeset modification accurately.
>>>
>>> Currently the changeset infrastructure is implemented using ulist, which
>>> can only store two u64 values, used as start and length for each changed
>>> extent range.
>>>
>>> So we can't record which changed extent is modified in last
>>> modification, thus unable to revert to previous status.
>>>
>>> [FIX]
>>> This patch will re-implement using pure rbtree, adding a new member,
>>> changed_extent::seq, so we can remove changed extents which is
>>> modified in previous modification.
>>>
>>> This allows us to implement qgroup_revert(), which allow btrfs to revert
>>> its modification to the io_tree.
>>>
>>
>> I'm having a hard time groking what's going on here.  These changesets
>> are limited to a [start, end] range correct?
> 
> Yes, but we may be only responsible for part of the changeset.
> 
> One example is we want to falloc range [0, 16K)
> And [0, 4K), [8K, 12K) has already one existing file extent.
> 
> Then we succeeded in allocating space for [4K, 8K), but failed to
> allocating space for [8K, 12K).
> 
> In that case, if we just return EDQUOT and clear the range for [4K, 8k)
> and [8K, 12K), everything is completely fine.
> 
> But if we want to retry, then we should only clear the range for [8K,
> 12K), but not to clear [4K, 8K).
> 
> That's what we need for the next patch, just revert what we did in
> previous set_extent_bit(), but not touching previously set bits.
> 

Ok so how do we get to this case then?  The changeset already has the range 
we're responsible for correct?  Why do we need the sequence number?  Like we 
should have a changeset for [4k, 8k) and one for [8k, 12k) right?  Or are we 
handed back the whole thing?  If we fail _just_ for the [8k, 12k) part, do we 
know that?  Or do we just know our whole [0k, 16k) thing failed, and so we need 
all this convoluted tracking to be able to retry for the range that we fucked 
up, and this is what you are facilitating?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT
  2020-07-02 13:54     ` Qu Wenruo
@ 2020-07-02 13:57       ` Josef Bacik
  2020-07-02 14:19         ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2020-07-02 13:57 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 7/2/20 9:54 AM, Qu Wenruo wrote:
> 
> 
> On 2020/7/2 下午9:43, Josef Bacik wrote:
>> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>>> [PROBLEM]
>>> There are known problem related to how btrfs handles qgroup reserved
>>> space.
>>> One of the most obvious case is the the test case btrfs/153, which do
>>> fallocate, then write into the preallocated range.
>>>
>>>     btrfs/153 1s ... - output mismatch (see
>>> xfstests-dev/results//btrfs/153.out.bad)
>>>         --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>>         +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>>> 20:24:40.730000089 +0800
>>>         @@ -1,2 +1,5 @@
>>>          QA output created by 153
>>>         +pwrite: Disk quota exceeded
>>>         +/mnt/scratch/testfile2: Disk quota exceeded
>>>         +/mnt/scratch/testfile2: Disk quota exceeded
>>>          Silence is golden
>>>         ...
>>>         (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>>
>>> [CAUSE]
>>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have
>>> to"),
>>> we always reserve space no matter if it's COW or not.
>>>
>>> Such behavior change is mostly for performance, and reverting it is not
>>> a good idea anyway.
>>>
>>> For preallcoated extent, we reserve qgroup data space for it already,
>>> and since we also reserve data space for qgroup at buffered write time,
>>> it needs twice the space for us to write into preallocated space.
>>>
>>> This leads to the -EDQUOT in buffered write routine.
>>>
>>> And we can't follow the same solution, unlike data/meta space check,
>>> qgroup reserved space is shared between data/meta.
>>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>>> check after qgroup reservation failure is not a solution.
>>
>> Why not?  I get that we don't know for sure how we failed, but in the
>> case of a write we're way more likely to have failed for data reasons
>> right?
> 
> Nope, mostly we failed at metadata reservation, as that would return
> EDQUOT to user space.
> 
> We may have some cases which get EDQUOT at data reservation part, but
> that's what we excepted.
> (And already what we're doing)
> 
> The problem is when the metadata reservation failed with EDQUOT.
> 
>>    So why not just fall back to the NODATACOW check and then do the
>> metadata reservation. Then if it fails again you know its a real EDQUOT
>> and your done.
>>
>> Or if you want to get super fancy you could even break up the metadata
>> and data reservations here so that we only fall through to the NODATACOW
>> check if we fail the data reservation.  Thanks,
> 
> The problem is, qgroup doesn't split metadata and data (yet).
> Currently data and meta shares the same limit.
> 
> So when we hit EDQUOT, you have no guarantee it would happen only in
> qgroup data reservation.
> 

Sure, but if you are able to do the nocow thing, then presumably your quota 
reservation is less now?  So on failure you go do the complicated nocow check, 
and if it succeeds you retry your quota reservation with just the metadata 
portion, right?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] btrfs: Introduce extent_changeset_revert() for qgroup
  2020-07-02 13:56       ` Josef Bacik
@ 2020-07-02 14:07         ` Qu Wenruo
  0 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2020-07-02 14:07 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3438 bytes --]



On 2020/7/2 下午9:56, Josef Bacik wrote:
> On 7/2/20 9:50 AM, Qu Wenruo wrote:
>>
>>
>> On 2020/7/2 下午9:40, Josef Bacik wrote:
>>> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>>>> [PROBLEM]
>>>> Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
>>>> reserved space of the changeset.
>>>>
>>>> This means the following call is not possible:
>>>>      ret = btrfs_qgroup_reserve_data();
>>>>      if (ret == -EDQUOT) {
>>>>          /* Do something to free some qgroup space */
>>>>          ret = btrfs_qgroup_reserve_data();
>>>>      }
>>>>
>>>> As if the first btrfs_qgroup_reserve_data() fails, it will free all
>>>> reserved qgroup space, so the next btrfs_qgroup_reserve_data() will
>>>> always success, and can go beyond qgroup limit.
>>>>
>>>> [CAUSE]
>>>> This is mostly due to the fact that we don't have a good way to revert
>>>> changeset modification accurately.
>>>>
>>>> Currently the changeset infrastructure is implemented using ulist,
>>>> which
>>>> can only store two u64 values, used as start and length for each
>>>> changed
>>>> extent range.
>>>>
>>>> So we can't record which changed extent is modified in last
>>>> modification, thus unable to revert to previous status.
>>>>
>>>> [FIX]
>>>> This patch will re-implement using pure rbtree, adding a new member,
>>>> changed_extent::seq, so we can remove changed extents which is
>>>> modified in previous modification.
>>>>
>>>> This allows us to implement qgroup_revert(), which allow btrfs to
>>>> revert
>>>> its modification to the io_tree.
>>>>
>>>
>>> I'm having a hard time groking what's going on here.  These changesets
>>> are limited to a [start, end] range correct?
>>
>> Yes, but we may be only responsible for part of the changeset.
>>
>> One example is we want to falloc range [0, 16K)
>> And [0, 4K), [8K, 12K) has already one existing file extent.
>>
>> Then we succeeded in allocating space for [4K, 8K), but failed to
>> allocating space for [8K, 12K).

Sorry, this should be [12K, 16K).

>>
>> In that case, if we just return EDQUOT and clear the range for [4K, 8k)
>> and [8K, 12K), everything is completely fine.
>>
>> But if we want to retry, then we should only clear the range for [8K,
>> 12K), but not to clear [4K, 8K).
>>
>> That's what we need for the next patch, just revert what we did in
>> previous set_extent_bit(), but not touching previously set bits.
>>
> 
> Ok so how do we get to this case then?  The changeset already has the
> range we're responsible for correct?
>  Why do we need the sequence
> number?  Like we should have a changeset for [4k, 8k) and one for [8k,
> 12k) right?  Or are we handed back the whole thing?  If we fail _just_
> for the [8k, 12k) part, do we know that?  Or do we just know our whole
> [0k, 16k) thing failed, and so we need all this convoluted tracking to
> be able to retry for the range that we fucked up, and this is what you
> are facilitating?  Thanks,

Oh, I got your point.

Right, when the failure happens, the range passed in is not [0, 16K),
but the last failed one.

We can just check the range, free any existing entry in the failure
range, and call it a day...

And I made the case over complicated them.

Thanks for the advice, I'll try to use the existing ulist mechanism to
do the same thing.

Thanks,
Qu
> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT
  2020-07-02 13:57       ` Josef Bacik
@ 2020-07-02 14:19         ` Qu Wenruo
  2020-07-02 14:58           ` Josef Bacik
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2020-07-02 14:19 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4326 bytes --]



On 2020/7/2 下午9:57, Josef Bacik wrote:
> On 7/2/20 9:54 AM, Qu Wenruo wrote:
>>
>>
>> On 2020/7/2 下午9:43, Josef Bacik wrote:
>>> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>>>> [PROBLEM]
>>>> There are known problem related to how btrfs handles qgroup reserved
>>>> space.
>>>> One of the most obvious case is the the test case btrfs/153, which do
>>>> fallocate, then write into the preallocated range.
>>>>
>>>>     btrfs/153 1s ... - output mismatch (see
>>>> xfstests-dev/results//btrfs/153.out.bad)
>>>>         --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>>>         +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>>>> 20:24:40.730000089 +0800
>>>>         @@ -1,2 +1,5 @@
>>>>          QA output created by 153
>>>>         +pwrite: Disk quota exceeded
>>>>         +/mnt/scratch/testfile2: Disk quota exceeded
>>>>         +/mnt/scratch/testfile2: Disk quota exceeded
>>>>          Silence is golden
>>>>         ...
>>>>         (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>>>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>>>
>>>> [CAUSE]
>>>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have
>>>> to"),
>>>> we always reserve space no matter if it's COW or not.
>>>>
>>>> Such behavior change is mostly for performance, and reverting it is not
>>>> a good idea anyway.
>>>>
>>>> For preallcoated extent, we reserve qgroup data space for it already,
>>>> and since we also reserve data space for qgroup at buffered write time,
>>>> it needs twice the space for us to write into preallocated space.
>>>>
>>>> This leads to the -EDQUOT in buffered write routine.
>>>>
>>>> And we can't follow the same solution, unlike data/meta space check,
>>>> qgroup reserved space is shared between data/meta.
>>>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>>>> check after qgroup reservation failure is not a solution.
>>>
>>> Why not?  I get that we don't know for sure how we failed, but in the
>>> case of a write we're way more likely to have failed for data reasons
>>> right?
>>
>> Nope, mostly we failed at metadata reservation, as that would return
>> EDQUOT to user space.
>>
>> We may have some cases which get EDQUOT at data reservation part, but
>> that's what we excepted.
>> (And already what we're doing)
>>
>> The problem is when the metadata reservation failed with EDQUOT.
>>
>>>    So why not just fall back to the NODATACOW check and then do the
>>> metadata reservation. Then if it fails again you know its a real EDQUOT
>>> and your done.
>>>
>>> Or if you want to get super fancy you could even break up the metadata
>>> and data reservations here so that we only fall through to the NODATACOW
>>> check if we fail the data reservation.  Thanks,
>>
>> The problem is, qgroup doesn't split metadata and data (yet).
>> Currently data and meta shares the same limit.
>>
>> So when we hit EDQUOT, you have no guarantee it would happen only in
>> qgroup data reservation.
>>
> 
> Sure, but if you are able to do the nocow thing, then presumably your
> quota reservation is less now?  So on failure you go do the complicated
> nocow check, and if it succeeds you retry your quota reservation with
> just the metadata portion, right?  Thanks,

Then metadata portion can still fail, even we skipped the data reserv.

The metadata portion still needs some space, while the data rsv skip
only happens after we're already near the qgroup limit, which means
there are ready not much space left.

Consider this case, we have 128M limit, we fallocated 120M, then we have
dirtied 7M data, plus several kilo for metadata reserved.

Then at the next 1M, we run out of qgroup limit, at whatever position.
Even we skip current 4K for data, the next metadata reserve may still
not be met, and still got EDQUOT at metadata reserve.

Or some other open() calls to create a new file would just get EDQUOT,
without any way to free any extra space.


Instead of try to skip just several 4K for qgroup data rsv, we should
flush the existing 7M, to free at least 7M data + several kilo meta space.

Thanks,
Qu

> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT
  2020-07-02 14:19         ` Qu Wenruo
@ 2020-07-02 14:58           ` Josef Bacik
  2020-07-02 23:36             ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Josef Bacik @ 2020-07-02 14:58 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 7/2/20 10:19 AM, Qu Wenruo wrote:
> 
> 
> On 2020/7/2 下午9:57, Josef Bacik wrote:
>> On 7/2/20 9:54 AM, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/7/2 下午9:43, Josef Bacik wrote:
>>>> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>>>>> [PROBLEM]
>>>>> There are known problem related to how btrfs handles qgroup reserved
>>>>> space.
>>>>> One of the most obvious case is the the test case btrfs/153, which do
>>>>> fallocate, then write into the preallocated range.
>>>>>
>>>>>      btrfs/153 1s ... - output mismatch (see
>>>>> xfstests-dev/results//btrfs/153.out.bad)
>>>>>          --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>>>>          +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>>>>> 20:24:40.730000089 +0800
>>>>>          @@ -1,2 +1,5 @@
>>>>>           QA output created by 153
>>>>>          +pwrite: Disk quota exceeded
>>>>>          +/mnt/scratch/testfile2: Disk quota exceeded
>>>>>          +/mnt/scratch/testfile2: Disk quota exceeded
>>>>>           Silence is golden
>>>>>          ...
>>>>>          (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>>>>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>>>>
>>>>> [CAUSE]
>>>>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have
>>>>> to"),
>>>>> we always reserve space no matter if it's COW or not.
>>>>>
>>>>> Such behavior change is mostly for performance, and reverting it is not
>>>>> a good idea anyway.
>>>>>
>>>>> For preallcoated extent, we reserve qgroup data space for it already,
>>>>> and since we also reserve data space for qgroup at buffered write time,
>>>>> it needs twice the space for us to write into preallocated space.
>>>>>
>>>>> This leads to the -EDQUOT in buffered write routine.
>>>>>
>>>>> And we can't follow the same solution, unlike data/meta space check,
>>>>> qgroup reserved space is shared between data/meta.
>>>>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>>>>> check after qgroup reservation failure is not a solution.
>>>>
>>>> Why not?  I get that we don't know for sure how we failed, but in the
>>>> case of a write we're way more likely to have failed for data reasons
>>>> right?
>>>
>>> Nope, mostly we failed at metadata reservation, as that would return
>>> EDQUOT to user space.
>>>
>>> We may have some cases which get EDQUOT at data reservation part, but
>>> that's what we excepted.
>>> (And already what we're doing)
>>>
>>> The problem is when the metadata reservation failed with EDQUOT.
>>>
>>>>     So why not just fall back to the NODATACOW check and then do the
>>>> metadata reservation. Then if it fails again you know its a real EDQUOT
>>>> and your done.
>>>>
>>>> Or if you want to get super fancy you could even break up the metadata
>>>> and data reservations here so that we only fall through to the NODATACOW
>>>> check if we fail the data reservation.  Thanks,
>>>
>>> The problem is, qgroup doesn't split metadata and data (yet).
>>> Currently data and meta shares the same limit.
>>>
>>> So when we hit EDQUOT, you have no guarantee it would happen only in
>>> qgroup data reservation.
>>>
>>
>> Sure, but if you are able to do the nocow thing, then presumably your
>> quota reservation is less now?  So on failure you go do the complicated
>> nocow check, and if it succeeds you retry your quota reservation with
>> just the metadata portion, right?  Thanks,
> 
> Then metadata portion can still fail, even we skipped the data reserv.
> 
> The metadata portion still needs some space, while the data rsv skip
> only happens after we're already near the qgroup limit, which means
> there are ready not much space left.
> 
> Consider this case, we have 128M limit, we fallocated 120M, then we have
> dirtied 7M data, plus several kilo for metadata reserved.
> 
> Then at the next 1M, we run out of qgroup limit, at whatever position.
> Even we skip current 4K for data, the next metadata reserve may still
> not be met, and still got EDQUOT at metadata reserve.
> 
> Or some other open() calls to create a new file would just get EDQUOT,
> without any way to free any extra space.
> 
> 
> Instead of try to skip just several 4K for qgroup data rsv, we should
> flush the existing 7M, to free at least 7M data + several kilo meta space.
> 

Right so I'm not against flushing in general, I just think that we can greatly 
improve on this particular problem without flushing.  Changing how we do the 
NOCOW check with quota could be faster than doing the flushing.

Now as for the flushing part itself, I'd rather hook into the existing flushing 
infrastructure we have.  Obviously the ticketing is going to be different, but 
the flushing part is still the same, and with data reservations now moved over 
to that infrastructure we finally have it all in the same place.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT
  2020-07-02 14:58           ` Josef Bacik
@ 2020-07-02 23:36             ` Qu Wenruo
  0 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2020-07-02 23:36 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5765 bytes --]



On 2020/7/2 下午10:58, Josef Bacik wrote:
> On 7/2/20 10:19 AM, Qu Wenruo wrote:
>>
>>
>> On 2020/7/2 下午9:57, Josef Bacik wrote:
>>> On 7/2/20 9:54 AM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2020/7/2 下午9:43, Josef Bacik wrote:
>>>>> On 7/1/20 8:14 PM, Qu Wenruo wrote:
>>>>>> [PROBLEM]
>>>>>> There are known problem related to how btrfs handles qgroup reserved
>>>>>> space.
>>>>>> One of the most obvious case is the the test case btrfs/153, which do
>>>>>> fallocate, then write into the preallocated range.
>>>>>>
>>>>>>      btrfs/153 1s ... - output mismatch (see
>>>>>> xfstests-dev/results//btrfs/153.out.bad)
>>>>>>          --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341
>>>>>> +0800
>>>>>>          +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01
>>>>>> 20:24:40.730000089 +0800
>>>>>>          @@ -1,2 +1,5 @@
>>>>>>           QA output created by 153
>>>>>>          +pwrite: Disk quota exceeded
>>>>>>          +/mnt/scratch/testfile2: Disk quota exceeded
>>>>>>          +/mnt/scratch/testfile2: Disk quota exceeded
>>>>>>           Silence is golden
>>>>>>          ...
>>>>>>          (Run 'diff -u xfstests-dev/tests/btrfs/153.out
>>>>>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>>>>>
>>>>>> [CAUSE]
>>>>>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we
>>>>>> have
>>>>>> to"),
>>>>>> we always reserve space no matter if it's COW or not.
>>>>>>
>>>>>> Such behavior change is mostly for performance, and reverting it
>>>>>> is not
>>>>>> a good idea anyway.
>>>>>>
>>>>>> For preallcoated extent, we reserve qgroup data space for it already,
>>>>>> and since we also reserve data space for qgroup at buffered write
>>>>>> time,
>>>>>> it needs twice the space for us to write into preallocated space.
>>>>>>
>>>>>> This leads to the -EDQUOT in buffered write routine.
>>>>>>
>>>>>> And we can't follow the same solution, unlike data/meta space check,
>>>>>> qgroup reserved space is shared between data/meta.
>>>>>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>>>>>> check after qgroup reservation failure is not a solution.
>>>>>
>>>>> Why not?  I get that we don't know for sure how we failed, but in the
>>>>> case of a write we're way more likely to have failed for data reasons
>>>>> right?
>>>>
>>>> Nope, mostly we failed at metadata reservation, as that would return
>>>> EDQUOT to user space.
>>>>
>>>> We may have some cases which get EDQUOT at data reservation part, but
>>>> that's what we excepted.
>>>> (And already what we're doing)
>>>>
>>>> The problem is when the metadata reservation failed with EDQUOT.
>>>>
>>>>>     So why not just fall back to the NODATACOW check and then do the
>>>>> metadata reservation. Then if it fails again you know its a real
>>>>> EDQUOT
>>>>> and your done.
>>>>>
>>>>> Or if you want to get super fancy you could even break up the metadata
>>>>> and data reservations here so that we only fall through to the
>>>>> NODATACOW
>>>>> check if we fail the data reservation.  Thanks,
>>>>
>>>> The problem is, qgroup doesn't split metadata and data (yet).
>>>> Currently data and meta shares the same limit.
>>>>
>>>> So when we hit EDQUOT, you have no guarantee it would happen only in
>>>> qgroup data reservation.
>>>>
>>>
>>> Sure, but if you are able to do the nocow thing, then presumably your
>>> quota reservation is less now?  So on failure you go do the complicated
>>> nocow check, and if it succeeds you retry your quota reservation with
>>> just the metadata portion, right?  Thanks,
>>
>> Then metadata portion can still fail, even we skipped the data reserv.
>>
>> The metadata portion still needs some space, while the data rsv skip
>> only happens after we're already near the qgroup limit, which means
>> there are ready not much space left.
>>
>> Consider this case, we have 128M limit, we fallocated 120M, then we have
>> dirtied 7M data, plus several kilo for metadata reserved.
>>
>> Then at the next 1M, we run out of qgroup limit, at whatever position.
>> Even we skip current 4K for data, the next metadata reserve may still
>> not be met, and still got EDQUOT at metadata reserve.
>>
>> Or some other open() calls to create a new file would just get EDQUOT,
>> without any way to free any extra space.
>>
>>
>> Instead of try to skip just several 4K for qgroup data rsv, we should
>> flush the existing 7M, to free at least 7M data + several kilo meta
>> space.
>>
> 
> Right so I'm not against flushing in general, I just think that we can
> greatly improve on this particular problem without flushing.  Changing
> how we do the NOCOW check with quota could be faster than doing the
> flushing.

Yep, but as mentioned, the uncertain timing of when we get the EDQUOT is
really annoying and tricky to solve, thus have to go the flushing method.

The performance is definitely slower, but it's not acceptable, since
we're near the limiting, slowing down is pretty common.

> 
> Now as for the flushing part itself, I'd rather hook into the existing
> flushing infrastructure we have.

That's the ultimate objective.

>  Obviously the ticketing is going to be
> different, but the flushing part is still the same, and with data
> reservations now moved over to that infrastructure we finally have it
> all in the same place.  Thanks,

Before the needed infrastructure get merged, I'll keep the current small
retry code and look into what's needed to integrate qgroup rsv into the
ticketing system.

Thanks,
Qu

> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-07-02 23:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-07-02  0:14 [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 Qu Wenruo
2020-07-02  0:14 ` [PATCH 1/3] btrfs: Introduce extent_changeset_revert() for qgroup Qu Wenruo
2020-07-02 13:40   ` Josef Bacik
2020-07-02 13:50     ` Qu Wenruo
2020-07-02 13:56       ` Josef Bacik
2020-07-02 14:07         ` Qu Wenruo
2020-07-02  0:14 ` [PATCH 2/3] btrfs: qgroup: Try to flush qgroup space when we get -EDQUOT Qu Wenruo
2020-07-02 13:43   ` Josef Bacik
2020-07-02 13:54     ` Qu Wenruo
2020-07-02 13:57       ` Josef Bacik
2020-07-02 14:19         ` Qu Wenruo
2020-07-02 14:58           ` Josef Bacik
2020-07-02 23:36             ` Qu Wenruo
2020-07-02  0:14 ` [PATCH 3/3] Revert "btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT" Qu Wenruo
2020-07-02 13:11   ` David Sterba
2020-07-02 13:22 ` [PATCH 0/3] btrfs: qgroup: Fix the long existing regression of btrfs/153 David Sterba
2020-07-02 13:28 ` Josef Bacik
2020-07-02 13:41   ` David Sterba
2020-07-02 13:44     ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox