* [PATCH 0/5] btrfs: qgroups rsv fixes
@ 2023-12-01 21:00 Boris Burkov
2023-12-01 21:00 ` [PATCH 1/5] btrfs: free qgroup rsv on ioerr ordered_extent Boris Burkov
` (5 more replies)
0 siblings, 6 replies; 16+ messages in thread
From: Boris Burkov @ 2023-12-01 21:00 UTC (permalink / raw)
To: linux-btrfs, kernel-team
This series contains a number of related but relatively orthogonal fixes
for various bugs in qgroup/squota reservation accounting. Most of these
manifest as either rsv underflow WARNINGs (in qgroup_rsv_release) or as
WARNINGs at umount for unreleased space.
With these fixes, I am able to get a fully clean '-g auto' fstests run
on my setup and with -O squota in MKFS_OPTIONS.
Boris Burkov (5):
btrfs: free qgroup rsv on ioerr ordered_extent
btrfs: fix qgroup_free_reserved_data int overflow
btrfs: free qgroup pertrans rsv on trans abort
btrfs: dont clear qgroup rsv bit in release_folio
btrfs: ensure releasing squota rsv on head refs
fs/btrfs/delalloc-space.c | 2 +-
fs/btrfs/disk-io.c | 28 +++++++++++++++++++++++
fs/btrfs/extent-tree.c | 47 +++++++++++++++++++++++++++------------
fs/btrfs/extent_io.c | 3 ++-
fs/btrfs/file.c | 2 +-
fs/btrfs/inode.c | 16 ++++++-------
fs/btrfs/ordered-data.c | 10 +++++----
fs/btrfs/qgroup.c | 46 +++++++++++++++++++++++++-------------
fs/btrfs/qgroup.h | 8 +++----
9 files changed, 114 insertions(+), 48 deletions(-)
--
2.42.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/5] btrfs: free qgroup rsv on ioerr ordered_extent
2023-12-01 21:00 [PATCH 0/5] btrfs: qgroups rsv fixes Boris Burkov
@ 2023-12-01 21:00 ` Boris Burkov
2023-12-04 21:04 ` Qu Wenruo
2023-12-01 21:00 ` [PATCH 2/5] btrfs: fix qgroup_free_reserved_data int overflow Boris Burkov
` (4 subsequent siblings)
5 siblings, 1 reply; 16+ messages in thread
From: Boris Burkov @ 2023-12-01 21:00 UTC (permalink / raw)
To: linux-btrfs, kernel-team
An ordered extent completing is a critical moment in qgroup rsv
handling, as the ownership of the reservation is handed off from the
ordered extent to the delayed ref. In the happy path we release (unlock)
but do not free (decrement counter) the reservation, and the delayed ref
drives the free. However, on an error, we don't create a delayed ref,
since there is no ref to add. Therefore, free on the error path.
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/ordered-data.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 574e8a55e24a..8d4ab5ecfa5d 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -599,7 +599,8 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
release = entry->disk_num_bytes;
else
release = entry->num_bytes;
- btrfs_delalloc_release_metadata(btrfs_inode, release, false);
+ btrfs_delalloc_release_metadata(btrfs_inode, release,
+ test_bit(BTRFS_ORDERED_IOERR, &entry->flags));
}
percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/5] btrfs: fix qgroup_free_reserved_data int overflow
2023-12-01 21:00 [PATCH 0/5] btrfs: qgroups rsv fixes Boris Burkov
2023-12-01 21:00 ` [PATCH 1/5] btrfs: free qgroup rsv on ioerr ordered_extent Boris Burkov
@ 2023-12-01 21:00 ` Boris Burkov
2023-12-04 21:07 ` Qu Wenruo
2023-12-01 21:00 ` [PATCH 3/5] btrfs: free qgroup pertrans rsv on trans abort Boris Burkov
` (3 subsequent siblings)
5 siblings, 1 reply; 16+ messages in thread
From: Boris Burkov @ 2023-12-01 21:00 UTC (permalink / raw)
To: linux-btrfs, kernel-team
The reserved data counter and input parameter is a u64, but we
inadvertantly accumulate it in an int. Overflowing that int results in
freeing the wrong amount of data and breaking rsv accounting.
Unfortunately, this overflow rot spreads from there, as the qgroup
release/free functions rely on returning an int to take advantage of
negative values for error codes.
Therefore, the full fix is to return the "released" or "freed" amount by
a u64* argument and to return 0 or negative error code via the return
value.
Most of the callsites simply ignore the return value, though some
of them handle the error and count the returned bytes. Change all of
them accordingly.
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/delalloc-space.c | 2 +-
fs/btrfs/file.c | 2 +-
fs/btrfs/inode.c | 16 ++++++++--------
fs/btrfs/ordered-data.c | 7 ++++---
fs/btrfs/qgroup.c | 25 +++++++++++++++----------
fs/btrfs/qgroup.h | 4 ++--
6 files changed, 31 insertions(+), 25 deletions(-)
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 51453d4928fa..2833e8ef4c09 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -199,7 +199,7 @@ void btrfs_free_reserved_data_space(struct btrfs_inode *inode,
start = round_down(start, fs_info->sectorsize);
btrfs_free_reserved_data_space_noquota(fs_info, len);
- btrfs_qgroup_free_data(inode, reserved, start, len);
+ btrfs_qgroup_free_data(inode, reserved, start, len, NULL);
}
/*
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e9c4b947a5aa..7a71720aaed2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3192,7 +3192,7 @@ static long btrfs_fallocate(struct file *file, int mode,
qgroup_reserved -= range->len;
} else if (qgroup_reserved > 0) {
btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved,
- range->start, range->len);
+ range->start, range->len, NULL);
qgroup_reserved -= range->len;
}
list_del(&range->list);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f8647d8271b7..e79a047aa5d1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -697,7 +697,7 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 size,
* And at reserve time, it's always aligned to page size, so
* just free one page here.
*/
- btrfs_qgroup_free_data(inode, NULL, 0, PAGE_SIZE);
+ btrfs_qgroup_free_data(inode, NULL, 0, PAGE_SIZE, NULL);
btrfs_free_path(path);
btrfs_end_transaction(trans);
return ret;
@@ -5141,7 +5141,7 @@ static void evict_inode_truncate_pages(struct inode *inode)
*/
if (state_flags & EXTENT_DELALLOC)
btrfs_qgroup_free_data(BTRFS_I(inode), NULL, start,
- end - start + 1);
+ end - start + 1, NULL);
clear_extent_bit(io_tree, start, end,
EXTENT_CLEAR_ALL_BITS | EXTENT_DO_ACCOUNTING,
@@ -8076,7 +8076,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
* reserved data space.
* Since the IO will never happen for this page.
*/
- btrfs_qgroup_free_data(inode, NULL, cur, range_end + 1 - cur);
+ btrfs_qgroup_free_data(inode, NULL, cur, range_end + 1 - cur, NULL);
if (!inode_evicting) {
clear_extent_bit(tree, cur, range_end, EXTENT_LOCKED |
EXTENT_DELALLOC | EXTENT_UPTODATE |
@@ -9513,7 +9513,7 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
struct btrfs_path *path;
u64 start = ins->objectid;
u64 len = ins->offset;
- int qgroup_released;
+ u64 qgroup_released = 0;
int ret;
memset(&stack_fi, 0, sizeof(stack_fi));
@@ -9526,9 +9526,9 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
btrfs_set_stack_file_extent_compression(&stack_fi, BTRFS_COMPRESS_NONE);
/* Encryption and other encoding is reserved and all 0 */
- qgroup_released = btrfs_qgroup_release_data(inode, file_offset, len);
- if (qgroup_released < 0)
- return ERR_PTR(qgroup_released);
+ ret = btrfs_qgroup_release_data(inode, file_offset, len, &qgroup_released);
+ if (ret < 0)
+ return ERR_PTR(ret);
if (trans) {
ret = insert_reserved_file_extent(trans, inode,
@@ -10423,7 +10423,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
btrfs_delalloc_release_metadata(inode, disk_num_bytes, ret < 0);
out_qgroup_free_data:
if (ret < 0)
- btrfs_qgroup_free_data(inode, data_reserved, start, num_bytes);
+ btrfs_qgroup_free_data(inode, data_reserved, start, num_bytes, NULL);
out_free_data_space:
/*
* If btrfs_reserve_extent() succeeded, then we already decremented
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 8d4ab5ecfa5d..c68fb78b7454 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -152,11 +152,12 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
{
struct btrfs_ordered_extent *entry;
int ret;
+ u64 qgroup_rsv = 0;
if (flags &
((1 << BTRFS_ORDERED_NOCOW) | (1 << BTRFS_ORDERED_PREALLOC))) {
/* For nocow write, we can release the qgroup rsv right now */
- ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
+ ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
if (ret < 0)
return ERR_PTR(ret);
} else {
@@ -164,7 +165,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
* The ordered extent has reserved qgroup space, release now
* and pass the reserved number for qgroup_record to free.
*/
- ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes);
+ ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes, &qgroup_rsv);
if (ret < 0)
return ERR_PTR(ret);
}
@@ -182,7 +183,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
entry->inode = igrab(&inode->vfs_inode);
entry->compress_type = compress_type;
entry->truncated_len = (u64)-1;
- entry->qgroup_rsv = ret;
+ entry->qgroup_rsv = qgroup_rsv;
entry->flags = flags;
refcount_set(&entry->refs, 1);
init_waitqueue_head(&entry->wait);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index ce446d9d7f23..a953c16c7eb8 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -4057,13 +4057,14 @@ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
/* Free ranges specified by @reserved, normally in error path */
static int qgroup_free_reserved_data(struct btrfs_inode *inode,
- struct extent_changeset *reserved, u64 start, u64 len)
+ struct extent_changeset *reserved,
+ u64 start, u64 len, u64 *freed_ret)
{
struct btrfs_root *root = inode->root;
struct ulist_node *unode;
struct ulist_iterator uiter;
struct extent_changeset changeset;
- int freed = 0;
+ u64 freed = 0;
int ret;
extent_changeset_init(&changeset);
@@ -4104,7 +4105,9 @@ static int qgroup_free_reserved_data(struct btrfs_inode *inode,
}
btrfs_qgroup_free_refroot(root->fs_info, root->root_key.objectid, freed,
BTRFS_QGROUP_RSV_DATA);
- ret = freed;
+ if (freed_ret)
+ *freed_ret = freed;
+ ret = 0;
out:
extent_changeset_release(&changeset);
return ret;
@@ -4112,7 +4115,7 @@ static int qgroup_free_reserved_data(struct btrfs_inode *inode,
static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
struct extent_changeset *reserved, u64 start, u64 len,
- int free)
+ u64 *released, int free)
{
struct extent_changeset changeset;
int trace_op = QGROUP_RELEASE;
@@ -4128,7 +4131,7 @@ static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
/* In release case, we shouldn't have @reserved */
WARN_ON(!free && reserved);
if (free && reserved)
- return qgroup_free_reserved_data(inode, reserved, start, len);
+ return qgroup_free_reserved_data(inode, reserved, start, len, released);
extent_changeset_init(&changeset);
ret = clear_record_extent_bits(&inode->io_tree, start, start + len -1,
EXTENT_QGROUP_RESERVED, &changeset);
@@ -4143,7 +4146,8 @@ static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
btrfs_qgroup_free_refroot(inode->root->fs_info,
inode->root->root_key.objectid,
changeset.bytes_changed, BTRFS_QGROUP_RSV_DATA);
- ret = changeset.bytes_changed;
+ if (released)
+ *released = changeset.bytes_changed;
out:
extent_changeset_release(&changeset);
return ret;
@@ -4162,9 +4166,10 @@ static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
* NOTE: This function may sleep for memory allocation.
*/
int btrfs_qgroup_free_data(struct btrfs_inode *inode,
- struct extent_changeset *reserved, u64 start, u64 len)
+ struct extent_changeset *reserved,
+ u64 start, u64 len, u64 *freed)
{
- return __btrfs_qgroup_release_data(inode, reserved, start, len, 1);
+ return __btrfs_qgroup_release_data(inode, reserved, start, len, freed, 1);
}
/*
@@ -4182,9 +4187,9 @@ int btrfs_qgroup_free_data(struct btrfs_inode *inode,
*
* NOTE: This function may sleep for memory allocation.
*/
-int btrfs_qgroup_release_data(struct btrfs_inode *inode, u64 start, u64 len)
+int btrfs_qgroup_release_data(struct btrfs_inode *inode, u64 start, u64 len, u64 *released)
{
- return __btrfs_qgroup_release_data(inode, NULL, start, len, 0);
+ return __btrfs_qgroup_release_data(inode, NULL, start, len, released, 0);
}
static void add_root_meta_rsv(struct btrfs_root *root, int num_bytes,
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 855a4f978761..15b485506104 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -358,10 +358,10 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
/* New io_tree based accurate qgroup reserve API */
int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
struct extent_changeset **reserved, u64 start, u64 len);
-int btrfs_qgroup_release_data(struct btrfs_inode *inode, u64 start, u64 len);
+int btrfs_qgroup_release_data(struct btrfs_inode *inode, u64 start, u64 len, u64 *released);
int btrfs_qgroup_free_data(struct btrfs_inode *inode,
struct extent_changeset *reserved, u64 start,
- u64 len);
+ u64 len, u64 *freed);
int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
enum btrfs_qgroup_rsv_type type, bool enforce);
int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/5] btrfs: free qgroup pertrans rsv on trans abort
2023-12-01 21:00 [PATCH 0/5] btrfs: qgroups rsv fixes Boris Burkov
2023-12-01 21:00 ` [PATCH 1/5] btrfs: free qgroup rsv on ioerr ordered_extent Boris Burkov
2023-12-01 21:00 ` [PATCH 2/5] btrfs: fix qgroup_free_reserved_data int overflow Boris Burkov
@ 2023-12-01 21:00 ` Boris Burkov
2023-12-04 21:08 ` Qu Wenruo
2023-12-05 14:27 ` David Sterba
2023-12-01 21:00 ` [PATCH 4/5] btrfs: dont clear qgroup rsv bit in release_folio Boris Burkov
` (2 subsequent siblings)
5 siblings, 2 replies; 16+ messages in thread
From: Boris Burkov @ 2023-12-01 21:00 UTC (permalink / raw)
To: linux-btrfs, kernel-team
If we abort a transaction, we never run the code that frees the pertrans
qgroup reservation. This results in warnings on unmount as that
reservation has been leaked. The leak isn't a huge issue since the fs is
read-only, but it's better to clean it up when we know we can/should. Do
it during the cleanup_transaction step of aborting.
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/disk-io.c | 28 ++++++++++++++++++++++++++++
fs/btrfs/qgroup.c | 5 +++--
2 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9317606017e2..a1f440cd6d45 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4775,6 +4775,32 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction *cur_trans,
}
}
+static void btrfs_free_all_qgroup_pertrans(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_root *gang[8];
+ int i;
+ int ret;
+
+ spin_lock(&fs_info->fs_roots_radix_lock);
+ while (1) {
+ ret = radix_tree_gang_lookup_tag(&fs_info->fs_roots_radix,
+ (void **)gang, 0,
+ ARRAY_SIZE(gang),
+ 0); // BTRFS_ROOT_TRANS_TAG
+ if (ret == 0)
+ break;
+ for (i = 0; i < ret; i++) {
+ struct btrfs_root *root = gang[i];
+
+ btrfs_qgroup_free_meta_all_pertrans(root);
+ radix_tree_tag_clear(&fs_info->fs_roots_radix,
+ (unsigned long)root->root_key.objectid,
+ 0); // BTRFS_ROOT_TRANS_TAG
+ }
+ }
+ spin_unlock(&fs_info->fs_roots_radix_lock);
+}
+
void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
struct btrfs_fs_info *fs_info)
{
@@ -4803,6 +4829,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
EXTENT_DIRTY);
btrfs_destroy_pinned_extent(fs_info, &cur_trans->pinned_extents);
+ btrfs_free_all_qgroup_pertrans(fs_info);
+
cur_trans->state =TRANS_STATE_COMPLETED;
wake_up(&cur_trans->commit_wait);
}
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a953c16c7eb8..daec90342dad 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -4337,8 +4337,9 @@ static void qgroup_convert_meta(struct btrfs_fs_info *fs_info, u64 ref_root,
qgroup_rsv_release(fs_info, qgroup, num_bytes,
BTRFS_QGROUP_RSV_META_PREALLOC);
- qgroup_rsv_add(fs_info, qgroup, num_bytes,
- BTRFS_QGROUP_RSV_META_PERTRANS);
+ if (!sb_rdonly(fs_info->sb))
+ qgroup_rsv_add(fs_info, qgroup, num_bytes,
+ BTRFS_QGROUP_RSV_META_PERTRANS);
list_for_each_entry(glist, &qgroup->groups, next_group)
qgroup_iterator_add(&qgroup_list, glist->group);
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/5] btrfs: dont clear qgroup rsv bit in release_folio
2023-12-01 21:00 [PATCH 0/5] btrfs: qgroups rsv fixes Boris Burkov
` (2 preceding siblings ...)
2023-12-01 21:00 ` [PATCH 3/5] btrfs: free qgroup pertrans rsv on trans abort Boris Burkov
@ 2023-12-01 21:00 ` Boris Burkov
2023-12-04 21:09 ` Qu Wenruo
2023-12-01 21:00 ` [PATCH 5/5] btrfs: ensure releasing squota rsv on head refs Boris Burkov
2023-12-05 17:09 ` [PATCH 0/5] btrfs: qgroups rsv fixes David Sterba
5 siblings, 1 reply; 16+ messages in thread
From: Boris Burkov @ 2023-12-01 21:00 UTC (permalink / raw)
To: linux-btrfs, kernel-team
The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for
duplicate reservations. That is two writes to that range in one
transaction shouldn't create two reservations, as the reservation will
only be freed once when the write finally goes down. Therefore, it is
never OK to clear that bit without freeing the associated qgroup rsv. At
this point, we don't want to be freeing the rsv, so mask off the bit.
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/extent_io.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0143bf63044c..87283087c669 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2310,7 +2310,8 @@ static int try_release_extent_state(struct extent_io_tree *tree,
ret = 0;
} else {
u32 clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM |
- EXTENT_DELALLOC_NEW | EXTENT_CTLBITS);
+ EXTENT_DELALLOC_NEW | EXTENT_CTLBITS
+ | EXTENT_QGROUP_RESERVED);
/*
* At this point we can safely clear everything except the
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 5/5] btrfs: ensure releasing squota rsv on head refs
2023-12-01 21:00 [PATCH 0/5] btrfs: qgroups rsv fixes Boris Burkov
` (3 preceding siblings ...)
2023-12-01 21:00 ` [PATCH 4/5] btrfs: dont clear qgroup rsv bit in release_folio Boris Burkov
@ 2023-12-01 21:00 ` Boris Burkov
2023-12-05 17:09 ` [PATCH 0/5] btrfs: qgroups rsv fixes David Sterba
5 siblings, 0 replies; 16+ messages in thread
From: Boris Burkov @ 2023-12-01 21:00 UTC (permalink / raw)
To: linux-btrfs, kernel-team
A reservation goes through a 3 step lifetime:
- generated during delalloc
- released/counted by ordered_extent allocation
- freed by running delayed ref
That third step depends on must_insert_reserved on the head ref, so the
head ref with that field set owns the reservation. Once you prepare to
run the head ref, must_insert_reserved is unset, which means that
running the ref must free the reservation, whether or not it succeeds,
or else the reservation is leaked. That results in either a risk of
spurious ENOSPC if the fs stays writeable or a warning on unmount if it
is readonly.
The existing squota code was aware of these invariants, but missed a few
cases. Improve it by adding a helper function to use in the cleanup
paths and call it from the existing early returns in running delayed
refs. This also simplifies btrfs_record_squota_delta and struct
btrfs_quota_delta.
This fixes (or at least improves the reliability of) generic/475 with
mkfs -O squota. On my machine, that test failed ~4/10 times without this
patch and passed 100/100 times with it.
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/extent-tree.c | 47 +++++++++++++++++++++++++++++-------------
fs/btrfs/qgroup.c | 16 +++++++++++---
fs/btrfs/qgroup.h | 4 ++--
3 files changed, 48 insertions(+), 19 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 008c4c77a847..e41251c08190 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1547,6 +1547,23 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
return ret;
}
+static void free_head_ref_squota_rsv(struct btrfs_fs_info *fs_info,
+ struct btrfs_delayed_ref_head *href)
+{
+ u64 root = href->owning_root;
+
+ /*
+ * Don't check must_insert_reserved, as this is called from contexts
+ * where it has already been unset.
+ */
+ if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_SIMPLE ||
+ !href->is_data || !is_fstree(root))
+ return;
+
+ btrfs_qgroup_free_refroot(fs_info, root, href->reserved_bytes,
+ BTRFS_QGROUP_RSV_DATA);
+}
+
static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
struct btrfs_delayed_ref_head *href,
struct btrfs_delayed_ref_node *node,
@@ -1569,7 +1586,6 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
struct btrfs_squota_delta delta = {
.root = href->owning_root,
.num_bytes = node->num_bytes,
- .rsv_bytes = href->reserved_bytes,
.is_data = true,
.is_inc = true,
.generation = trans->transid,
@@ -1586,11 +1602,9 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans,
flags, ref->objectid,
ref->offset, &key,
node->ref_mod, href->owning_root);
+ free_head_ref_squota_rsv(trans->fs_info, href);
if (!ret)
ret = btrfs_record_squota_delta(trans->fs_info, &delta);
- else
- btrfs_qgroup_free_refroot(trans->fs_info, delta.root,
- delta.rsv_bytes, BTRFS_QGROUP_RSV_DATA);
} else if (node->action == BTRFS_ADD_DELAYED_REF) {
ret = __btrfs_inc_extent_ref(trans, node, parent, ref->root,
ref->objectid, ref->offset,
@@ -1742,7 +1756,6 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans,
struct btrfs_squota_delta delta = {
.root = href->owning_root,
.num_bytes = fs_info->nodesize,
- .rsv_bytes = 0,
.is_data = false,
.is_inc = true,
.generation = trans->transid,
@@ -1774,8 +1787,10 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
int ret = 0;
if (TRANS_ABORTED(trans)) {
- if (insert_reserved)
+ if (insert_reserved) {
btrfs_pin_extent(trans, node->bytenr, node->num_bytes, 1);
+ free_head_ref_squota_rsv(trans->fs_info, href);
+ }
return 0;
}
@@ -1871,6 +1886,8 @@ u64 btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_root *delayed_refs,
struct btrfs_delayed_ref_head *head)
{
+ u64 ret = 0;
+
/*
* We had csum deletions accounted for in our delayed refs rsv, we need
* to drop the csum leaves for this update from our delayed_refs_rsv.
@@ -1885,14 +1902,13 @@ u64 btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
btrfs_delayed_refs_rsv_release(fs_info, 0, nr_csums);
- return btrfs_calc_delayed_ref_csum_bytes(fs_info, nr_csums);
+ ret = btrfs_calc_delayed_ref_csum_bytes(fs_info, nr_csums);
}
- if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE &&
- head->must_insert_reserved && head->is_data)
- btrfs_qgroup_free_refroot(fs_info, head->owning_root,
- head->reserved_bytes, BTRFS_QGROUP_RSV_DATA);
+ /* must_insert_reserved can be set only if we didn't run the head ref */
+ if (head->must_insert_reserved)
+ free_head_ref_squota_rsv(fs_info, head);
- return 0;
+ return ret;
}
static int cleanup_ref_head(struct btrfs_trans_handle *trans,
@@ -2033,6 +2049,11 @@ static int btrfs_run_delayed_refs_for_head(struct btrfs_trans_handle *trans,
* spin lock.
*/
must_insert_reserved = locked_ref->must_insert_reserved;
+ /*
+ * Unsetting this on the head ref relinquishes ownership of
+ * the rsv_bytes, so it is critical that every possible code
+ * path from here forward frees all rsv including qgroup rsv.
+ */
locked_ref->must_insert_reserved = false;
extent_op = locked_ref->extent_op;
@@ -3292,7 +3313,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
struct btrfs_squota_delta delta = {
.root = delayed_ref_root,
.num_bytes = num_bytes,
- .rsv_bytes = 0,
.is_data = is_data,
.is_inc = false,
.generation = btrfs_extent_generation(leaf, ei),
@@ -4935,7 +4955,6 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
.root = root_objectid,
.num_bytes = ins->offset,
.generation = trans->transid,
- .rsv_bytes = 0,
.is_data = true,
.is_inc = true,
};
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index daec90342dad..9576d77f6f6a 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -4661,6 +4661,19 @@ void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans)
*root = RB_ROOT;
}
+void btrfs_free_squota_rsv(struct btrfs_fs_info *fs_info,
+ u64 root, u64 rsv_bytes)
+{
+ if (btrfs_qgroup_mode(fs_info) != BTRFS_QGROUP_MODE_SIMPLE)
+ return;
+
+ if (!is_fstree(root))
+ return;
+
+ btrfs_qgroup_free_refroot(fs_info, root, rsv_bytes,
+ BTRFS_QGROUP_RSV_DATA);
+}
+
int btrfs_record_squota_delta(struct btrfs_fs_info *fs_info,
struct btrfs_squota_delta *delta)
{
@@ -4705,8 +4718,5 @@ int btrfs_record_squota_delta(struct btrfs_fs_info *fs_info,
out:
spin_unlock(&fs_info->qgroup_lock);
- if (!ret && delta->rsv_bytes)
- btrfs_qgroup_free_refroot(fs_info, root, delta->rsv_bytes,
- BTRFS_QGROUP_RSV_DATA);
return ret;
}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 15b485506104..3dbb4095d2f2 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -274,8 +274,6 @@ struct btrfs_squota_delta {
u64 root;
/* The number of bytes in the extent being counted. */
u64 num_bytes;
- /* The number of bytes reserved for this extent. */
- u64 rsv_bytes;
/* The generation the extent was created in. */
u64 generation;
/* Whether we are using or freeing the extent. */
@@ -422,6 +420,8 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
struct btrfs_root *root, struct extent_buffer *eb);
void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans);
bool btrfs_check_quota_leak(struct btrfs_fs_info *fs_info);
+void btrfs_free_squota_rsv(struct btrfs_fs_info *fs_info,
+ u64 root, u64 rsv_bytes);
int btrfs_record_squota_delta(struct btrfs_fs_info *fs_info,
struct btrfs_squota_delta *delta);
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/5] btrfs: free qgroup rsv on ioerr ordered_extent
2023-12-01 21:00 ` [PATCH 1/5] btrfs: free qgroup rsv on ioerr ordered_extent Boris Burkov
@ 2023-12-04 21:04 ` Qu Wenruo
2023-12-05 19:42 ` Boris Burkov
0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2023-12-04 21:04 UTC (permalink / raw)
To: Boris Burkov, linux-btrfs, kernel-team
On 2023/12/2 07:30, Boris Burkov wrote:
> An ordered extent completing is a critical moment in qgroup rsv
> handling, as the ownership of the reservation is handed off from the
> ordered extent to the delayed ref. In the happy path we release (unlock)
> but do not free (decrement counter) the reservation, and the delayed ref
> drives the free. However, on an error, we don't create a delayed ref,
> since there is no ref to add. Therefore, free on the error path.
And I believe this would cause btrfs to be noisy at the unmount time,
due to unreleased qgroup rsv.
Have you hit any one during your tests? If so, mind to add some dmesg
output for it?
Or if no hit so far, would it be possible to add a new test case for it?
>
> Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Thanks,
Qu
> ---
> fs/btrfs/ordered-data.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 574e8a55e24a..8d4ab5ecfa5d 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -599,7 +599,8 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
> release = entry->disk_num_bytes;
> else
> release = entry->num_bytes;
> - btrfs_delalloc_release_metadata(btrfs_inode, release, false);
> + btrfs_delalloc_release_metadata(btrfs_inode, release,
> + test_bit(BTRFS_ORDERED_IOERR, &entry->flags));
> }
>
> percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] btrfs: fix qgroup_free_reserved_data int overflow
2023-12-01 21:00 ` [PATCH 2/5] btrfs: fix qgroup_free_reserved_data int overflow Boris Burkov
@ 2023-12-04 21:07 ` Qu Wenruo
0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2023-12-04 21:07 UTC (permalink / raw)
To: Boris Burkov, linux-btrfs, kernel-team
On 2023/12/2 07:30, Boris Burkov wrote:
> The reserved data counter and input parameter is a u64, but we
> inadvertantly accumulate it in an int. Overflowing that int results in
> freeing the wrong amount of data and breaking rsv accounting.
>
> Unfortunately, this overflow rot spreads from there, as the qgroup
> release/free functions rely on returning an int to take advantage of
> negative values for error codes.
Indeed, reusing int for both released bytes and error number is the root
cause of the overflow.
>
> Therefore, the full fix is to return the "released" or "freed" amount by
> a u64* argument and to return 0 or negative error code via the return
> value.
>
> Most of the callsites simply ignore the return value, though some
> of them handle the error and count the returned bytes. Change all of
> them accordingly.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Thanks,
Qu
> ---
> fs/btrfs/delalloc-space.c | 2 +-
> fs/btrfs/file.c | 2 +-
> fs/btrfs/inode.c | 16 ++++++++--------
> fs/btrfs/ordered-data.c | 7 ++++---
> fs/btrfs/qgroup.c | 25 +++++++++++++++----------
> fs/btrfs/qgroup.h | 4 ++--
> 6 files changed, 31 insertions(+), 25 deletions(-)
>
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index 51453d4928fa..2833e8ef4c09 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -199,7 +199,7 @@ void btrfs_free_reserved_data_space(struct btrfs_inode *inode,
> start = round_down(start, fs_info->sectorsize);
>
> btrfs_free_reserved_data_space_noquota(fs_info, len);
> - btrfs_qgroup_free_data(inode, reserved, start, len);
> + btrfs_qgroup_free_data(inode, reserved, start, len, NULL);
> }
>
> /*
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index e9c4b947a5aa..7a71720aaed2 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3192,7 +3192,7 @@ static long btrfs_fallocate(struct file *file, int mode,
> qgroup_reserved -= range->len;
> } else if (qgroup_reserved > 0) {
> btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved,
> - range->start, range->len);
> + range->start, range->len, NULL);
> qgroup_reserved -= range->len;
> }
> list_del(&range->list);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index f8647d8271b7..e79a047aa5d1 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -697,7 +697,7 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 size,
> * And at reserve time, it's always aligned to page size, so
> * just free one page here.
> */
> - btrfs_qgroup_free_data(inode, NULL, 0, PAGE_SIZE);
> + btrfs_qgroup_free_data(inode, NULL, 0, PAGE_SIZE, NULL);
> btrfs_free_path(path);
> btrfs_end_transaction(trans);
> return ret;
> @@ -5141,7 +5141,7 @@ static void evict_inode_truncate_pages(struct inode *inode)
> */
> if (state_flags & EXTENT_DELALLOC)
> btrfs_qgroup_free_data(BTRFS_I(inode), NULL, start,
> - end - start + 1);
> + end - start + 1, NULL);
>
> clear_extent_bit(io_tree, start, end,
> EXTENT_CLEAR_ALL_BITS | EXTENT_DO_ACCOUNTING,
> @@ -8076,7 +8076,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
> * reserved data space.
> * Since the IO will never happen for this page.
> */
> - btrfs_qgroup_free_data(inode, NULL, cur, range_end + 1 - cur);
> + btrfs_qgroup_free_data(inode, NULL, cur, range_end + 1 - cur, NULL);
> if (!inode_evicting) {
> clear_extent_bit(tree, cur, range_end, EXTENT_LOCKED |
> EXTENT_DELALLOC | EXTENT_UPTODATE |
> @@ -9513,7 +9513,7 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
> struct btrfs_path *path;
> u64 start = ins->objectid;
> u64 len = ins->offset;
> - int qgroup_released;
> + u64 qgroup_released = 0;
> int ret;
>
> memset(&stack_fi, 0, sizeof(stack_fi));
> @@ -9526,9 +9526,9 @@ static struct btrfs_trans_handle *insert_prealloc_file_extent(
> btrfs_set_stack_file_extent_compression(&stack_fi, BTRFS_COMPRESS_NONE);
> /* Encryption and other encoding is reserved and all 0 */
>
> - qgroup_released = btrfs_qgroup_release_data(inode, file_offset, len);
> - if (qgroup_released < 0)
> - return ERR_PTR(qgroup_released);
> + ret = btrfs_qgroup_release_data(inode, file_offset, len, &qgroup_released);
> + if (ret < 0)
> + return ERR_PTR(ret);
>
> if (trans) {
> ret = insert_reserved_file_extent(trans, inode,
> @@ -10423,7 +10423,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
> btrfs_delalloc_release_metadata(inode, disk_num_bytes, ret < 0);
> out_qgroup_free_data:
> if (ret < 0)
> - btrfs_qgroup_free_data(inode, data_reserved, start, num_bytes);
> + btrfs_qgroup_free_data(inode, data_reserved, start, num_bytes, NULL);
> out_free_data_space:
> /*
> * If btrfs_reserve_extent() succeeded, then we already decremented
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 8d4ab5ecfa5d..c68fb78b7454 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -152,11 +152,12 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> {
> struct btrfs_ordered_extent *entry;
> int ret;
> + u64 qgroup_rsv = 0;
>
> if (flags &
> ((1 << BTRFS_ORDERED_NOCOW) | (1 << BTRFS_ORDERED_PREALLOC))) {
> /* For nocow write, we can release the qgroup rsv right now */
> - ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes);
> + ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
> if (ret < 0)
> return ERR_PTR(ret);
> } else {
> @@ -164,7 +165,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> * The ordered extent has reserved qgroup space, release now
> * and pass the reserved number for qgroup_record to free.
> */
> - ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes);
> + ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes, &qgroup_rsv);
> if (ret < 0)
> return ERR_PTR(ret);
> }
> @@ -182,7 +183,7 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> entry->inode = igrab(&inode->vfs_inode);
> entry->compress_type = compress_type;
> entry->truncated_len = (u64)-1;
> - entry->qgroup_rsv = ret;
> + entry->qgroup_rsv = qgroup_rsv;
> entry->flags = flags;
> refcount_set(&entry->refs, 1);
> init_waitqueue_head(&entry->wait);
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index ce446d9d7f23..a953c16c7eb8 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -4057,13 +4057,14 @@ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
>
> /* Free ranges specified by @reserved, normally in error path */
> static int qgroup_free_reserved_data(struct btrfs_inode *inode,
> - struct extent_changeset *reserved, u64 start, u64 len)
> + struct extent_changeset *reserved,
> + u64 start, u64 len, u64 *freed_ret)
> {
> struct btrfs_root *root = inode->root;
> struct ulist_node *unode;
> struct ulist_iterator uiter;
> struct extent_changeset changeset;
> - int freed = 0;
> + u64 freed = 0;
> int ret;
>
> extent_changeset_init(&changeset);
> @@ -4104,7 +4105,9 @@ static int qgroup_free_reserved_data(struct btrfs_inode *inode,
> }
> btrfs_qgroup_free_refroot(root->fs_info, root->root_key.objectid, freed,
> BTRFS_QGROUP_RSV_DATA);
> - ret = freed;
> + if (freed_ret)
> + *freed_ret = freed;
> + ret = 0;
> out:
> extent_changeset_release(&changeset);
> return ret;
> @@ -4112,7 +4115,7 @@ static int qgroup_free_reserved_data(struct btrfs_inode *inode,
>
> static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
> struct extent_changeset *reserved, u64 start, u64 len,
> - int free)
> + u64 *released, int free)
> {
> struct extent_changeset changeset;
> int trace_op = QGROUP_RELEASE;
> @@ -4128,7 +4131,7 @@ static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
> /* In release case, we shouldn't have @reserved */
> WARN_ON(!free && reserved);
> if (free && reserved)
> - return qgroup_free_reserved_data(inode, reserved, start, len);
> + return qgroup_free_reserved_data(inode, reserved, start, len, released);
> extent_changeset_init(&changeset);
> ret = clear_record_extent_bits(&inode->io_tree, start, start + len -1,
> EXTENT_QGROUP_RESERVED, &changeset);
> @@ -4143,7 +4146,8 @@ static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
> btrfs_qgroup_free_refroot(inode->root->fs_info,
> inode->root->root_key.objectid,
> changeset.bytes_changed, BTRFS_QGROUP_RSV_DATA);
> - ret = changeset.bytes_changed;
> + if (released)
> + *released = changeset.bytes_changed;
> out:
> extent_changeset_release(&changeset);
> return ret;
> @@ -4162,9 +4166,10 @@ static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
> * NOTE: This function may sleep for memory allocation.
> */
> int btrfs_qgroup_free_data(struct btrfs_inode *inode,
> - struct extent_changeset *reserved, u64 start, u64 len)
> + struct extent_changeset *reserved,
> + u64 start, u64 len, u64 *freed)
> {
> - return __btrfs_qgroup_release_data(inode, reserved, start, len, 1);
> + return __btrfs_qgroup_release_data(inode, reserved, start, len, freed, 1);
> }
>
> /*
> @@ -4182,9 +4187,9 @@ int btrfs_qgroup_free_data(struct btrfs_inode *inode,
> *
> * NOTE: This function may sleep for memory allocation.
> */
> -int btrfs_qgroup_release_data(struct btrfs_inode *inode, u64 start, u64 len)
> +int btrfs_qgroup_release_data(struct btrfs_inode *inode, u64 start, u64 len, u64 *released)
> {
> - return __btrfs_qgroup_release_data(inode, NULL, start, len, 0);
> + return __btrfs_qgroup_release_data(inode, NULL, start, len, released, 0);
> }
>
> static void add_root_meta_rsv(struct btrfs_root *root, int num_bytes,
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index 855a4f978761..15b485506104 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -358,10 +358,10 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
> /* New io_tree based accurate qgroup reserve API */
> int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
> struct extent_changeset **reserved, u64 start, u64 len);
> -int btrfs_qgroup_release_data(struct btrfs_inode *inode, u64 start, u64 len);
> +int btrfs_qgroup_release_data(struct btrfs_inode *inode, u64 start, u64 len, u64 *released);
> int btrfs_qgroup_free_data(struct btrfs_inode *inode,
> struct extent_changeset *reserved, u64 start,
> - u64 len);
> + u64 len, u64 *freed);
> int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
> enum btrfs_qgroup_rsv_type type, bool enforce);
> int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/5] btrfs: free qgroup pertrans rsv on trans abort
2023-12-01 21:00 ` [PATCH 3/5] btrfs: free qgroup pertrans rsv on trans abort Boris Burkov
@ 2023-12-04 21:08 ` Qu Wenruo
2023-12-05 14:27 ` David Sterba
1 sibling, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2023-12-04 21:08 UTC (permalink / raw)
To: Boris Burkov, linux-btrfs, kernel-team
On 2023/12/2 07:30, Boris Burkov wrote:
> If we abort a transaction, we never run the code that frees the pertrans
> qgroup reservation. This results in warnings on unmount as that
> reservation has been leaked. The leak isn't a huge issue since the fs is
> read-only, but it's better to clean it up when we know we can/should. Do
> it during the cleanup_transaction step of aborting.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Thanks,
Qu
> ---
> fs/btrfs/disk-io.c | 28 ++++++++++++++++++++++++++++
> fs/btrfs/qgroup.c | 5 +++--
> 2 files changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 9317606017e2..a1f440cd6d45 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4775,6 +4775,32 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction *cur_trans,
> }
> }
>
> +static void btrfs_free_all_qgroup_pertrans(struct btrfs_fs_info *fs_info)
> +{
> + struct btrfs_root *gang[8];
> + int i;
> + int ret;
> +
> + spin_lock(&fs_info->fs_roots_radix_lock);
> + while (1) {
> + ret = radix_tree_gang_lookup_tag(&fs_info->fs_roots_radix,
> + (void **)gang, 0,
> + ARRAY_SIZE(gang),
> + 0); // BTRFS_ROOT_TRANS_TAG
> + if (ret == 0)
> + break;
> + for (i = 0; i < ret; i++) {
> + struct btrfs_root *root = gang[i];
> +
> + btrfs_qgroup_free_meta_all_pertrans(root);
> + radix_tree_tag_clear(&fs_info->fs_roots_radix,
> + (unsigned long)root->root_key.objectid,
> + 0); // BTRFS_ROOT_TRANS_TAG
> + }
> + }
> + spin_unlock(&fs_info->fs_roots_radix_lock);
> +}
> +
> void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
> struct btrfs_fs_info *fs_info)
> {
> @@ -4803,6 +4829,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
> EXTENT_DIRTY);
> btrfs_destroy_pinned_extent(fs_info, &cur_trans->pinned_extents);
>
> + btrfs_free_all_qgroup_pertrans(fs_info);
> +
> cur_trans->state =TRANS_STATE_COMPLETED;
> wake_up(&cur_trans->commit_wait);
> }
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index a953c16c7eb8..daec90342dad 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -4337,8 +4337,9 @@ static void qgroup_convert_meta(struct btrfs_fs_info *fs_info, u64 ref_root,
>
> qgroup_rsv_release(fs_info, qgroup, num_bytes,
> BTRFS_QGROUP_RSV_META_PREALLOC);
> - qgroup_rsv_add(fs_info, qgroup, num_bytes,
> - BTRFS_QGROUP_RSV_META_PERTRANS);
> + if (!sb_rdonly(fs_info->sb))
> + qgroup_rsv_add(fs_info, qgroup, num_bytes,
> + BTRFS_QGROUP_RSV_META_PERTRANS);
>
> list_for_each_entry(glist, &qgroup->groups, next_group)
> qgroup_iterator_add(&qgroup_list, glist->group);
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 4/5] btrfs: dont clear qgroup rsv bit in release_folio
2023-12-01 21:00 ` [PATCH 4/5] btrfs: dont clear qgroup rsv bit in release_folio Boris Burkov
@ 2023-12-04 21:09 ` Qu Wenruo
0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2023-12-04 21:09 UTC (permalink / raw)
To: Boris Burkov, linux-btrfs, kernel-team
On 2023/12/2 07:30, Boris Burkov wrote:
> The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for
> duplicate reservations. That is two writes to that range in one
> transaction shouldn't create two reservations, as the reservation will
> only be freed once when the write finally goes down. Therefore, it is
> never OK to clear that bit without freeing the associated qgroup rsv. At
> this point, we don't want to be freeing the rsv, so mask off the bit.
Did you have any dmesg output for this case? I believe this would lead
to a underflow.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Thanks,
Qu
> ---
> fs/btrfs/extent_io.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0143bf63044c..87283087c669 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2310,7 +2310,8 @@ static int try_release_extent_state(struct extent_io_tree *tree,
> ret = 0;
> } else {
> u32 clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM |
> - EXTENT_DELALLOC_NEW | EXTENT_CTLBITS);
> + EXTENT_DELALLOC_NEW | EXTENT_CTLBITS
> + | EXTENT_QGROUP_RESERVED);
>
> /*
> * At this point we can safely clear everything except the
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/5] btrfs: free qgroup pertrans rsv on trans abort
2023-12-01 21:00 ` [PATCH 3/5] btrfs: free qgroup pertrans rsv on trans abort Boris Burkov
2023-12-04 21:08 ` Qu Wenruo
@ 2023-12-05 14:27 ` David Sterba
2023-12-05 19:45 ` Boris Burkov
1 sibling, 1 reply; 16+ messages in thread
From: David Sterba @ 2023-12-05 14:27 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
On Fri, Dec 01, 2023 at 01:00:11PM -0800, Boris Burkov wrote:
> If we abort a transaction, we never run the code that frees the pertrans
> qgroup reservation. This results in warnings on unmount as that
> reservation has been leaked. The leak isn't a huge issue since the fs is
> read-only, but it's better to clean it up when we know we can/should. Do
> it during the cleanup_transaction step of aborting.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
> fs/btrfs/disk-io.c | 28 ++++++++++++++++++++++++++++
> fs/btrfs/qgroup.c | 5 +++--
> 2 files changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 9317606017e2..a1f440cd6d45 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4775,6 +4775,32 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction *cur_trans,
> }
> }
>
> +static void btrfs_free_all_qgroup_pertrans(struct btrfs_fs_info *fs_info)
> +{
> + struct btrfs_root *gang[8];
> + int i;
> + int ret;
> +
> + spin_lock(&fs_info->fs_roots_radix_lock);
> + while (1) {
> + ret = radix_tree_gang_lookup_tag(&fs_info->fs_roots_radix,
> + (void **)gang, 0,
> + ARRAY_SIZE(gang),
> + 0); // BTRFS_ROOT_TRANS_TAG
What does the comment mean?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/5] btrfs: qgroups rsv fixes
2023-12-01 21:00 [PATCH 0/5] btrfs: qgroups rsv fixes Boris Burkov
` (4 preceding siblings ...)
2023-12-01 21:00 ` [PATCH 5/5] btrfs: ensure releasing squota rsv on head refs Boris Burkov
@ 2023-12-05 17:09 ` David Sterba
5 siblings, 0 replies; 16+ messages in thread
From: David Sterba @ 2023-12-05 17:09 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
On Fri, Dec 01, 2023 at 01:00:08PM -0800, Boris Burkov wrote:
> This series contains a number of related but relatively orthogonal fixes
> for various bugs in qgroup/squota reservation accounting. Most of these
> manifest as either rsv underflow WARNINGs (in qgroup_rsv_release) or as
> WARNINGs at umount for unreleased space.
>
> With these fixes, I am able to get a fully clean '-g auto' fstests run
> on my setup and with -O squota in MKFS_OPTIONS.
>
> Boris Burkov (5):
> btrfs: free qgroup rsv on ioerr ordered_extent
> btrfs: fix qgroup_free_reserved_data int overflow
> btrfs: free qgroup pertrans rsv on trans abort
> btrfs: dont clear qgroup rsv bit in release_folio
> btrfs: ensure releasing squota rsv on head refs
I've added it to misc-next for testing, some of the patches seem to be
an rc pull material. There are some comments too.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/5] btrfs: free qgroup rsv on ioerr ordered_extent
2023-12-04 21:04 ` Qu Wenruo
@ 2023-12-05 19:42 ` Boris Burkov
2023-12-05 20:16 ` Qu Wenruo
0 siblings, 1 reply; 16+ messages in thread
From: Boris Burkov @ 2023-12-05 19:42 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs, kernel-team
On Tue, Dec 05, 2023 at 07:34:10AM +1030, Qu Wenruo wrote:
>
>
> On 2023/12/2 07:30, Boris Burkov wrote:
> > An ordered extent completing is a critical moment in qgroup rsv
> > handling, as the ownership of the reservation is handed off from the
> > ordered extent to the delayed ref. In the happy path we release (unlock)
> > but do not free (decrement counter) the reservation, and the delayed ref
> > drives the free. However, on an error, we don't create a delayed ref,
> > since there is no ref to add. Therefore, free on the error path.
>
> And I believe this would cause btrfs to be noisy at the unmount time,
> due to unreleased qgroup rsv.
>
> Have you hit any one during your tests? If so, mind to add some dmesg
> output for it?
>
> Or if no hit so far, would it be possible to add a new test case for it?
I hit the conditions for all of these fixes running xfstests with
MKFS_OPTIONS including -O squota or -O quota. IIRC the failures were
almost all in the umount path in btrfs_check_quota_leak, though some of
the issues manifested as reservation freeing underflows in
qgroup_rsv_release. generic/475 triggered most of the bugs but a handful
of other tests hit some others. Unfortunately, I did not take perfect
notes on which test was fixed by which patch. I can try to recover that
information by removing the patches and running the full suite while
iteratively adding them back in. That is obviously fairly time consuming,
so I would only do it if people really want that information in the commit
messages or something.
Thanks for the review!
Boris
>
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Thanks,
> Qu
>
> > ---
> > fs/btrfs/ordered-data.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > index 574e8a55e24a..8d4ab5ecfa5d 100644
> > --- a/fs/btrfs/ordered-data.c
> > +++ b/fs/btrfs/ordered-data.c
> > @@ -599,7 +599,8 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
> > release = entry->disk_num_bytes;
> > else
> > release = entry->num_bytes;
> > - btrfs_delalloc_release_metadata(btrfs_inode, release, false);
> > + btrfs_delalloc_release_metadata(btrfs_inode, release,
> > + test_bit(BTRFS_ORDERED_IOERR, &entry->flags));
> > }
> >
> > percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/5] btrfs: free qgroup pertrans rsv on trans abort
2023-12-05 14:27 ` David Sterba
@ 2023-12-05 19:45 ` Boris Burkov
2023-12-05 22:39 ` David Sterba
0 siblings, 1 reply; 16+ messages in thread
From: Boris Burkov @ 2023-12-05 19:45 UTC (permalink / raw)
To: David Sterba; +Cc: linux-btrfs, kernel-team
On Tue, Dec 05, 2023 at 03:27:32PM +0100, David Sterba wrote:
> On Fri, Dec 01, 2023 at 01:00:11PM -0800, Boris Burkov wrote:
> > If we abort a transaction, we never run the code that frees the pertrans
> > qgroup reservation. This results in warnings on unmount as that
> > reservation has been leaked. The leak isn't a huge issue since the fs is
> > read-only, but it's better to clean it up when we know we can/should. Do
> > it during the cleanup_transaction step of aborting.
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> > fs/btrfs/disk-io.c | 28 ++++++++++++++++++++++++++++
> > fs/btrfs/qgroup.c | 5 +++--
> > 2 files changed, 31 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 9317606017e2..a1f440cd6d45 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -4775,6 +4775,32 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction *cur_trans,
> > }
> > }
> >
> > +static void btrfs_free_all_qgroup_pertrans(struct btrfs_fs_info *fs_info)
> > +{
> > + struct btrfs_root *gang[8];
> > + int i;
> > + int ret;
> > +
> > + spin_lock(&fs_info->fs_roots_radix_lock);
> > + while (1) {
> > + ret = radix_tree_gang_lookup_tag(&fs_info->fs_roots_radix,
> > + (void **)gang, 0,
> > + ARRAY_SIZE(gang),
> > + 0); // BTRFS_ROOT_TRANS_TAG
>
> What does the comment mean?
Oops, I forgot about this. BTRFS_ROOT_TRANS_TAG is a #define in
transaction.c, so it wasn't visible here in disk-io.c. I should move it
to some header they both include. Based on the other stuff in there,
btrfs/fs.h looks reasonable?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/5] btrfs: free qgroup rsv on ioerr ordered_extent
2023-12-05 19:42 ` Boris Burkov
@ 2023-12-05 20:16 ` Qu Wenruo
0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2023-12-05 20:16 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
On 2023/12/6 06:12, Boris Burkov wrote:
> On Tue, Dec 05, 2023 at 07:34:10AM +1030, Qu Wenruo wrote:
>>
>>
>> On 2023/12/2 07:30, Boris Burkov wrote:
>>> An ordered extent completing is a critical moment in qgroup rsv
>>> handling, as the ownership of the reservation is handed off from the
>>> ordered extent to the delayed ref. In the happy path we release (unlock)
>>> but do not free (decrement counter) the reservation, and the delayed ref
>>> drives the free. However, on an error, we don't create a delayed ref,
>>> since there is no ref to add. Therefore, free on the error path.
>>
>> And I believe this would cause btrfs to be noisy at the unmount time,
>> due to unreleased qgroup rsv.
>>
>> Have you hit any one during your tests? If so, mind to add some dmesg
>> output for it?
>>
>> Or if no hit so far, would it be possible to add a new test case for it?
>
> I hit the conditions for all of these fixes running xfstests with
> MKFS_OPTIONS including -O squota or -O quota. IIRC the failures were
> almost all in the umount path in btrfs_check_quota_leak, though some of
> the issues manifested as reservation freeing underflows in
> qgroup_rsv_release. generic/475 triggered most of the bugs but a handful
> of other tests hit some others. Unfortunately, I did not take perfect
> notes on which test was fixed by which patch. I can try to recover that
> information by removing the patches and running the full suite while
> iteratively adding them back in. That is obviously fairly time consuming,
> so I would only do it if people really want that information in the commit
> messages or something.
Oh no worries, in that case I'm totally fine with the current commit
message.
Although a CC to stable would be nicer for patches 1~4.
Thanks,
Qu
>
> Thanks for the review!
> Boris
>
>>
>>>
>>> Signed-off-by: Boris Burkov <boris@bur.io>
>>
>> Reviewed-by: Qu Wenruo <wqu@suse.com>
>>
>> Thanks,
>> Qu
>>
>>> ---
>>> fs/btrfs/ordered-data.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
>>> index 574e8a55e24a..8d4ab5ecfa5d 100644
>>> --- a/fs/btrfs/ordered-data.c
>>> +++ b/fs/btrfs/ordered-data.c
>>> @@ -599,7 +599,8 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
>>> release = entry->disk_num_bytes;
>>> else
>>> release = entry->num_bytes;
>>> - btrfs_delalloc_release_metadata(btrfs_inode, release, false);
>>> + btrfs_delalloc_release_metadata(btrfs_inode, release,
>>> + test_bit(BTRFS_ORDERED_IOERR, &entry->flags));
>>> }
>>>
>>> percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/5] btrfs: free qgroup pertrans rsv on trans abort
2023-12-05 19:45 ` Boris Burkov
@ 2023-12-05 22:39 ` David Sterba
0 siblings, 0 replies; 16+ messages in thread
From: David Sterba @ 2023-12-05 22:39 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
On Tue, Dec 05, 2023 at 11:45:52AM -0800, Boris Burkov wrote:
> On Tue, Dec 05, 2023 at 03:27:32PM +0100, David Sterba wrote:
> > On Fri, Dec 01, 2023 at 01:00:11PM -0800, Boris Burkov wrote:
> > > If we abort a transaction, we never run the code that frees the pertrans
> > > qgroup reservation. This results in warnings on unmount as that
> > > reservation has been leaked. The leak isn't a huge issue since the fs is
> > > read-only, but it's better to clean it up when we know we can/should. Do
> > > it during the cleanup_transaction step of aborting.
> > >
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > > fs/btrfs/disk-io.c | 28 ++++++++++++++++++++++++++++
> > > fs/btrfs/qgroup.c | 5 +++--
> > > 2 files changed, 31 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > > index 9317606017e2..a1f440cd6d45 100644
> > > --- a/fs/btrfs/disk-io.c
> > > +++ b/fs/btrfs/disk-io.c
> > > @@ -4775,6 +4775,32 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction *cur_trans,
> > > }
> > > }
> > >
> > > +static void btrfs_free_all_qgroup_pertrans(struct btrfs_fs_info *fs_info)
> > > +{
> > > + struct btrfs_root *gang[8];
> > > + int i;
> > > + int ret;
> > > +
> > > + spin_lock(&fs_info->fs_roots_radix_lock);
> > > + while (1) {
> > > + ret = radix_tree_gang_lookup_tag(&fs_info->fs_roots_radix,
> > > + (void **)gang, 0,
> > > + ARRAY_SIZE(gang),
> > > + 0); // BTRFS_ROOT_TRANS_TAG
> >
> > What does the comment mean?
>
> Oops, I forgot about this. BTRFS_ROOT_TRANS_TAG is a #define in
> transaction.c, so it wasn't visible here in disk-io.c. I should move it
> to some header they both include. Based on the other stuff in there,
> btrfs/fs.h looks reasonable?
Maybe transaction.h
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-12-05 22:46 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-01 21:00 [PATCH 0/5] btrfs: qgroups rsv fixes Boris Burkov
2023-12-01 21:00 ` [PATCH 1/5] btrfs: free qgroup rsv on ioerr ordered_extent Boris Burkov
2023-12-04 21:04 ` Qu Wenruo
2023-12-05 19:42 ` Boris Burkov
2023-12-05 20:16 ` Qu Wenruo
2023-12-01 21:00 ` [PATCH 2/5] btrfs: fix qgroup_free_reserved_data int overflow Boris Burkov
2023-12-04 21:07 ` Qu Wenruo
2023-12-01 21:00 ` [PATCH 3/5] btrfs: free qgroup pertrans rsv on trans abort Boris Burkov
2023-12-04 21:08 ` Qu Wenruo
2023-12-05 14:27 ` David Sterba
2023-12-05 19:45 ` Boris Burkov
2023-12-05 22:39 ` David Sterba
2023-12-01 21:00 ` [PATCH 4/5] btrfs: dont clear qgroup rsv bit in release_folio Boris Burkov
2023-12-04 21:09 ` Qu Wenruo
2023-12-01 21:00 ` [PATCH 5/5] btrfs: ensure releasing squota rsv on head refs Boris Burkov
2023-12-05 17:09 ` [PATCH 0/5] btrfs: qgroups rsv fixes David Sterba
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox