* [PATCH 0/3] btrfs: some fixes around automatic block group reclaim
@ 2025-02-25 10:57 fdmanana
2025-02-25 10:57 ` [PATCH 1/3] btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work() fdmanana
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: fdmanana @ 2025-02-25 10:57 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
Fix a couple races that should be mostly harmless but trigger KCSAN warnings
and fix the accounting of reclaimed bytes. More details in the change logs.
Filipe Manana (3):
btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()
btrfs: get used bytes while holding lock at btrfs_reclaim_bgs_work()
btrfs: fix reclaimed bytes accounting after automatic block group reclaim
fs/btrfs/block-group.c | 55 ++++++++++++++++++++++++++++++++----------
1 file changed, 42 insertions(+), 13 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/3] btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()
2025-02-25 10:57 [PATCH 0/3] btrfs: some fixes around automatic block group reclaim fdmanana
@ 2025-02-25 10:57 ` fdmanana
2025-02-25 13:51 ` Johannes Thumshirn
2025-02-25 10:57 ` [PATCH 2/3] btrfs: get used " fdmanana
` (3 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: fdmanana @ 2025-02-25 10:57 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
At btrfs_reclaim_bgs_work(), we are grabbing a block group's zone unusable
bytes while not under the protection of the block group's spinlock, so
this can trigger race reports from KCSAN (or similar tools) since that
field is typically updated while holding the lock, such as at
__btrfs_add_free_space_zoned() for example.
Fix this by grabbing the zone unusable bytes while we are still in the
critical section holding the block group's spinlock, which is right above
where we are currently grabbing it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/block-group.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 18f58674a16c..2e174c14ca0a 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1887,6 +1887,17 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
up_write(&space_info->groups_sem);
goto next;
}
+
+ /*
+ * Cache the zone_unusable value before turning the block group
+ * to read only. As soon as the blog group is read only it's
+ * zone_unusable value gets moved to the block group's read-only
+ * bytes and isn't available for calculations anymore. We also
+ * cache it before unlocking the block group, to prevent races
+ * (reports from KCSAN and such tools) with tasks updating it.
+ */
+ zone_unusable = bg->zone_unusable;
+
spin_unlock(&bg->lock);
spin_unlock(&space_info->lock);
@@ -1903,13 +1914,6 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
goto next;
}
- /*
- * Cache the zone_unusable value before turning the block group
- * to read only. As soon as the blog group is read only it's
- * zone_unusable value gets moved to the block group's read-only
- * bytes and isn't available for calculations anymore.
- */
- zone_unusable = bg->zone_unusable;
ret = inc_block_group_ro(bg, 0);
up_write(&space_info->groups_sem);
if (ret < 0)
--
2.45.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/3] btrfs: get used bytes while holding lock at btrfs_reclaim_bgs_work()
2025-02-25 10:57 [PATCH 0/3] btrfs: some fixes around automatic block group reclaim fdmanana
2025-02-25 10:57 ` [PATCH 1/3] btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work() fdmanana
@ 2025-02-25 10:57 ` fdmanana
2025-02-25 10:57 ` [PATCH 3/3] btrfs: fix reclaimed bytes accounting after automatic block group reclaim fdmanana
` (2 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: fdmanana @ 2025-02-25 10:57 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
At btrfs_reclaim_bgs_work(), we are grabbing twice the used bytes counter
of the block group while not holding the block group's spinlock. This can
result in races, reported by KCSAN and similar tools, since a concurrent
task can be updating that counter while at btrfs_update_block_group().
So avoid these races by grabbing the counter in the critical section right
above that is delimited by the block group's spinlock. This also avoids
using two different values of the counter in case it changes in between
each read. This silences KCSAN and is required for the next patch in the
series too.
Fixes: 243192b67649 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/block-group.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 2e174c14ca0a..1200c5efeb3d 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1823,7 +1823,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
list_sort(NULL, &fs_info->reclaim_bgs, reclaim_bgs_cmp);
while (!list_empty(&fs_info->reclaim_bgs)) {
u64 zone_unusable;
- u64 reclaimed;
+ u64 used;
int ret = 0;
bg = list_first_entry(&fs_info->reclaim_bgs,
@@ -1919,19 +1919,30 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
if (ret < 0)
goto next;
+ /*
+ * Grab the used bytes counter while holding the block groups's
+ * spinlock to prevent races with tasks concurrently updating it
+ * due to extent allocation and deallocation (running
+ * btrfs_update_block_group()) - we have set the block group to
+ * RO but that only prevents extent reservation, allocation
+ * happens after reservation.
+ */
+ spin_lock(&bg->lock);
+ used = bg->used;
+ spin_unlock(&bg->lock);
+
btrfs_info(fs_info,
"reclaiming chunk %llu with %llu%% used %llu%% unusable",
bg->start,
- div64_u64(bg->used * 100, bg->length),
+ div64_u64(used * 100, bg->length),
div64_u64(zone_unusable * 100, bg->length));
trace_btrfs_reclaim_block_group(bg);
- reclaimed = bg->used;
ret = btrfs_relocate_chunk(fs_info, bg->start);
if (ret) {
btrfs_dec_block_group_ro(bg);
btrfs_err(fs_info, "error relocating chunk %llu",
bg->start);
- reclaimed = 0;
+ used = 0;
spin_lock(&space_info->lock);
space_info->reclaim_errors++;
if (READ_ONCE(space_info->periodic_reclaim))
@@ -1940,7 +1951,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
}
spin_lock(&space_info->lock);
space_info->reclaim_count++;
- space_info->reclaim_bytes += reclaimed;
+ space_info->reclaim_bytes += used;
spin_unlock(&space_info->lock);
next:
--
2.45.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/3] btrfs: fix reclaimed bytes accounting after automatic block group reclaim
2025-02-25 10:57 [PATCH 0/3] btrfs: some fixes around automatic block group reclaim fdmanana
2025-02-25 10:57 ` [PATCH 1/3] btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work() fdmanana
2025-02-25 10:57 ` [PATCH 2/3] btrfs: get used " fdmanana
@ 2025-02-25 10:57 ` fdmanana
2025-02-25 11:37 ` [PATCH 0/3] btrfs: some fixes around " Daniel Vacek
2025-02-26 15:17 ` David Sterba
4 siblings, 0 replies; 9+ messages in thread
From: fdmanana @ 2025-02-25 10:57 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
We are considering the used bytes counter of a block group as the amount
to update the space info's reclaim bytes counter after relocating the
block group, but this value alone is often not enough. This is because we
may have a reserved extent (or more) and in that case its size is
reflected in the reserved counter of the block group - the size of the
extent is only transferred from the reserved counter to the used counter
of the block group when the delayed ref for the extent is run - typically
when committing the transaction (or when flushing delayed refs due to
ENOSPC on space reservation). Such call chain for data extents is:
btrfs_run_delayed_refs_for_head()
run_one_delayed_ref()
run_delayed_data_ref()
alloc_reserved_file_extent()
alloc_reserved_extent()
btrfs_update_block_group()
-> transfers the extent size from the reserved
counter to the used counter
For metadata extents:
btrfs_run_delayed_refs_for_head()
run_one_delayed_ref()
run_delayed_tree_ref()
alloc_reserved_tree_block()
alloc_reserved_extent()
btrfs_update_block_group()
-> transfers the extent size from the reserved
counter to the used counter
Since relocation flushes delalloc, waits for ordered extent completion
and commits the current transaction before doing the actual relocation
work, the correct amount of reclaimed space is therefore the sum of the
"used" and "reserved" counters of the block group before we call
btrfs_relocate_chunk() at btrfs_reclaim_bgs_work().
So fix this by taking the "reserved" counter into consideration.
Fixes: 243192b67649 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/block-group.c | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 1200c5efeb3d..c0c247ecbe9a 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1824,6 +1824,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
while (!list_empty(&fs_info->reclaim_bgs)) {
u64 zone_unusable;
u64 used;
+ u64 reserved;
int ret = 0;
bg = list_first_entry(&fs_info->reclaim_bgs,
@@ -1920,21 +1921,32 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
goto next;
/*
- * Grab the used bytes counter while holding the block groups's
- * spinlock to prevent races with tasks concurrently updating it
- * due to extent allocation and deallocation (running
- * btrfs_update_block_group()) - we have set the block group to
- * RO but that only prevents extent reservation, allocation
- * happens after reservation.
+ * The amount of bytes reclaimed corresponds to the sum of the
+ * "used" and "reserved" counters. We have set the block group
+ * to RO above, which prevents reservations from happening but
+ * we may have existing reservations for which allocation has
+ * not yet been done - btrfs_update_block_group() was not yet
+ * called, which is where we will transfer a reserved extent's
+ * size from the "reserved" counter to the "used" counter - this
+ * happens when running delayed references. When we relocate the
+ * chunk below, relocation first flushes dellaloc, waits for
+ * ordered extent completion (which is where we create delayed
+ * references for data extents) and commits the current
+ * transaction (which runs delayed references), and only after
+ * it does the actual work to move extents out of the block
+ * group. So the reported amount of reclaimed bytes is
+ * effectively the sum of the 'used' and 'reserved' counters.
*/
spin_lock(&bg->lock);
used = bg->used;
+ reserved = bg->reserved;
spin_unlock(&bg->lock);
btrfs_info(fs_info,
- "reclaiming chunk %llu with %llu%% used %llu%% unusable",
+ "reclaiming chunk %llu with %llu%% used %llu%% reserved %llu%% unusable",
bg->start,
div64_u64(used * 100, bg->length),
+ div64_u64(reserved * 100, bg->length),
div64_u64(zone_unusable * 100, bg->length));
trace_btrfs_reclaim_block_group(bg);
ret = btrfs_relocate_chunk(fs_info, bg->start);
@@ -1943,6 +1955,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
btrfs_err(fs_info, "error relocating chunk %llu",
bg->start);
used = 0;
+ reserved = 0;
spin_lock(&space_info->lock);
space_info->reclaim_errors++;
if (READ_ONCE(space_info->periodic_reclaim))
@@ -1952,6 +1965,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
spin_lock(&space_info->lock);
space_info->reclaim_count++;
space_info->reclaim_bytes += used;
+ space_info->reclaim_bytes += reserved;
spin_unlock(&space_info->lock);
next:
--
2.45.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 0/3] btrfs: some fixes around automatic block group reclaim
2025-02-25 10:57 [PATCH 0/3] btrfs: some fixes around automatic block group reclaim fdmanana
` (2 preceding siblings ...)
2025-02-25 10:57 ` [PATCH 3/3] btrfs: fix reclaimed bytes accounting after automatic block group reclaim fdmanana
@ 2025-02-25 11:37 ` Daniel Vacek
2025-02-25 11:39 ` Filipe Manana
2025-02-26 15:17 ` David Sterba
4 siblings, 1 reply; 9+ messages in thread
From: Daniel Vacek @ 2025-02-25 11:37 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Tue, 25 Feb 2025 at 11:58, <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> Fix a couple races that should be mostly harmless but trigger KCSAN warnings
> and fix the accounting of reclaimed bytes. More details in the change logs.
>
> Filipe Manana (3):
> btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()
> btrfs: get used bytes while holding lock at btrfs_reclaim_bgs_work()
> btrfs: fix reclaimed bytes accounting after automatic block group reclaim
I'd say 2 and 3 would better be folded into one patch. The split is a
bit confusing.
>
> fs/btrfs/block-group.c | 55 ++++++++++++++++++++++++++++++++----------
> 1 file changed, 42 insertions(+), 13 deletions(-)
>
> --
> 2.45.2
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/3] btrfs: some fixes around automatic block group reclaim
2025-02-25 11:37 ` [PATCH 0/3] btrfs: some fixes around " Daniel Vacek
@ 2025-02-25 11:39 ` Filipe Manana
2025-02-25 11:48 ` Daniel Vacek
0 siblings, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2025-02-25 11:39 UTC (permalink / raw)
To: Daniel Vacek; +Cc: linux-btrfs
On Tue, Feb 25, 2025 at 11:38 AM Daniel Vacek <neelx@suse.com> wrote:
>
> On Tue, 25 Feb 2025 at 11:58, <fdmanana@kernel.org> wrote:
> >
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Fix a couple races that should be mostly harmless but trigger KCSAN warnings
> > and fix the accounting of reclaimed bytes. More details in the change logs.
> >
> > Filipe Manana (3):
> > btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()
> > btrfs: get used bytes while holding lock at btrfs_reclaim_bgs_work()
> > btrfs: fix reclaimed bytes accounting after automatic block group reclaim
>
> I'd say 2 and 3 would better be folded into one patch. The split is a
> bit confusing.
I find it less confusing, because it's 2 different problems being
solved and easier to review and reason about.
>
> >
> > fs/btrfs/block-group.c | 55 ++++++++++++++++++++++++++++++++----------
> > 1 file changed, 42 insertions(+), 13 deletions(-)
> >
> > --
> > 2.45.2
> >
> >
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/3] btrfs: some fixes around automatic block group reclaim
2025-02-25 11:39 ` Filipe Manana
@ 2025-02-25 11:48 ` Daniel Vacek
0 siblings, 0 replies; 9+ messages in thread
From: Daniel Vacek @ 2025-02-25 11:48 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs
On Tue, 25 Feb 2025 at 12:40, Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Tue, Feb 25, 2025 at 11:38 AM Daniel Vacek <neelx@suse.com> wrote:
> >
> > On Tue, 25 Feb 2025 at 11:58, <fdmanana@kernel.org> wrote:
> > >
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Fix a couple races that should be mostly harmless but trigger KCSAN warnings
> > > and fix the accounting of reclaimed bytes. More details in the change logs.
> > >
> > > Filipe Manana (3):
> > > btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()
> > > btrfs: get used bytes while holding lock at btrfs_reclaim_bgs_work()
> > > btrfs: fix reclaimed bytes accounting after automatic block group reclaim
> >
> > I'd say 2 and 3 would better be folded into one patch. The split is a
> > bit confusing.
>
> I find it less confusing, because it's 2 different problems being
> solved and easier to review and reason about.
OK
> >
> > >
> > > fs/btrfs/block-group.c | 55 ++++++++++++++++++++++++++++++++----------
> > > 1 file changed, 42 insertions(+), 13 deletions(-)
> > >
> > > --
> > > 2.45.2
> > >
> > >
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()
2025-02-25 10:57 ` [PATCH 1/3] btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work() fdmanana
@ 2025-02-25 13:51 ` Johannes Thumshirn
0 siblings, 0 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2025-02-25 13:51 UTC (permalink / raw)
To: fdmanana@kernel.org, linux-btrfs@vger.kernel.org
Oh shoot, my bad. Thanks for fixing
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/3] btrfs: some fixes around automatic block group reclaim
2025-02-25 10:57 [PATCH 0/3] btrfs: some fixes around automatic block group reclaim fdmanana
` (3 preceding siblings ...)
2025-02-25 11:37 ` [PATCH 0/3] btrfs: some fixes around " Daniel Vacek
@ 2025-02-26 15:17 ` David Sterba
4 siblings, 0 replies; 9+ messages in thread
From: David Sterba @ 2025-02-26 15:17 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Tue, Feb 25, 2025 at 10:57:06AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Fix a couple races that should be mostly harmless but trigger KCSAN warnings
> and fix the accounting of reclaimed bytes. More details in the change logs.
>
> Filipe Manana (3):
> btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()
> btrfs: get used bytes while holding lock at btrfs_reclaim_bgs_work()
> btrfs: fix reclaimed bytes accounting after automatic block group reclaim
Reviewed-by: David Sterba <dsterba@suse.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-02-26 15:17 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-25 10:57 [PATCH 0/3] btrfs: some fixes around automatic block group reclaim fdmanana
2025-02-25 10:57 ` [PATCH 1/3] btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work() fdmanana
2025-02-25 13:51 ` Johannes Thumshirn
2025-02-25 10:57 ` [PATCH 2/3] btrfs: get used " fdmanana
2025-02-25 10:57 ` [PATCH 3/3] btrfs: fix reclaimed bytes accounting after automatic block group reclaim fdmanana
2025-02-25 11:37 ` [PATCH 0/3] btrfs: some fixes around " Daniel Vacek
2025-02-25 11:39 ` Filipe Manana
2025-02-25 11:48 ` Daniel Vacek
2025-02-26 15:17 ` David Sterba
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.