* [PATCH v2] btrfs: fix racy bitfield write in btrfs_clear_space_info_full()
@ 2025-10-02 18:50 Boris Burkov
2025-10-02 22:16 ` Qu Wenruo
0 siblings, 1 reply; 5+ messages in thread
From: Boris Burkov @ 2025-10-02 18:50 UTC (permalink / raw)
To: linux-btrfs, kernel-team
From the memory-barriers.txt document regarding memory barrier ordering
guarantees:
(*) These guarantees do not apply to bitfields, because compilers often
generate code to modify these using non-atomic read-modify-write
sequences. Do not attempt to use bitfields to synchronize parallel
algorithms.
(*) Even in cases where bitfields are protected by locks, all fields
in a given bitfield must be protected by one lock. If two fields
in a given bitfield are protected by different locks, the compiler's
non-atomic read-modify-write sequences can cause an update to one
field to corrupt the value of an adjacent field.
btrfs_space_info has a bitfield sharing an underlying word consisting of
the fields full, chunk_alloc, and flush:
struct btrfs_space_info {
struct btrfs_fs_info * fs_info; /* 0 8 */
struct btrfs_space_info * parent; /* 8 8 */
...
int clamp; /* 172 4 */
unsigned int full:1; /* 176: 0 4 */
unsigned int chunk_alloc:1; /* 176: 1 4 */
unsigned int flush:1; /* 176: 2 4 */
...
Therefore, to be safe from parallel read-modify-writes losing a write to one of the bitfield members protected by a lock, all writes to all the
bitfields must use the lock. They almost universally do, except for
btrfs_clear_space_info_full() which iterates over the space_infos and
writes out found->full = 0 without a lock.
Imagine that we have one thread completing a transaction in which we
finished deleting a block_group and are thus calling
btrfs_clear_space_info_full() while simultaneously the data reclaim
ticket infrastructure is running do_async_reclaim_data_space():
T1 T2
btrfs_commit_transaction
btrfs_clear_space_info_full
data_sinfo->full = 0
READ: full:0, chunk_alloc:0, flush:1
do_async_reclaim_data_space(data_sinfo)
spin_lock(&space_info->lock);
if(list_empty(tickets))
space_info->flush = 0;
READ: full: 0, chunk_alloc:0, flush:1
MOD/WRITE: full: 0, chunk_alloc:0, flush:0
spin_unlock(&space_info->lock);
return;
MOD/WRITE: full:0, chunk_alloc:0, flush:1
and now data_sinfo->flush is 1 but the reclaim worker has exited. This
breaks the invariant that flush is 0 iff there is no work queued or
running. Once this invariant is violated, future allocations that go
into __reserve_bytes() will add tickets to space_info->tickets but will
see space_info->flush is set to 1 and not queue the work. After this,
they will block forever on the resulting ticket, as it is now impossible
to kick the worker again.
I also confirmed by looking at the assembly of the affected kernel that
it is doing RMW operations. For example, to set the flush (3rd) bit to 0,
the assembly is:
andb $0xfb,0x60(%rbx)
and similarly for setting the full (1st) bit to 0:
andb $0xfe,-0x20(%rax)
So I think this is really a bug on practical systems. I have observed
a number of systems in this exact state, but am currently unable to
reproduce it.
Rather than leaving this footgun lying around for the future, take
advantage of the fact that there is room in the struct anyway, and that
it is already quite large and simply change the three bitfield members to
bools. This avoids writes to space_info->full having any effect on
writes to space_info->flush, regardless of locking.
Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure")
Signed-off-by: Boris Burkov <boris@bur.io>
---
Changelog:
v2:
- migrate the three bitfield members to bools to step around the whole
atomic RMW issue in the most straightforward way.
---
fs/btrfs/block-group.c | 6 +++---
fs/btrfs/space-info.c | 22 +++++++++++-----------
fs/btrfs/space-info.h | 6 +++---
3 files changed, 17 insertions(+), 17 deletions(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 4330f5ba02dd..cd51f50a7c8b 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -4215,7 +4215,7 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans,
mutex_unlock(&fs_info->chunk_mutex);
} else {
/* Proceed with allocation */
- space_info->chunk_alloc = 1;
+ space_info->chunk_alloc = true;
wait_for_alloc = false;
spin_unlock(&space_info->lock);
}
@@ -4264,7 +4264,7 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans,
spin_lock(&space_info->lock);
if (ret < 0) {
if (ret == -ENOSPC)
- space_info->full = 1;
+ space_info->full = true;
else
goto out;
} else {
@@ -4274,7 +4274,7 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans,
space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
out:
- space_info->chunk_alloc = 0;
+ space_info->chunk_alloc = false;
spin_unlock(&space_info->lock);
mutex_unlock(&fs_info->chunk_mutex);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 0e5c0c80e0fe..04a07d6f8537 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -192,7 +192,7 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info *info)
struct btrfs_space_info *found;
list_for_each_entry(found, head, list)
- found->full = 0;
+ found->full = false;
}
/*
@@ -372,7 +372,7 @@ void btrfs_add_bg_to_space_info(struct btrfs_fs_info *info,
space_info->bytes_readonly += block_group->bytes_super;
btrfs_space_info_update_bytes_zone_unusable(space_info, block_group->zone_unusable);
if (block_group->length > 0)
- space_info->full = 0;
+ space_info->full = false;
btrfs_try_granting_tickets(info, space_info);
spin_unlock(&space_info->lock);
@@ -1146,7 +1146,7 @@ static void do_async_reclaim_metadata_space(struct btrfs_space_info *space_info)
spin_lock(&space_info->lock);
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info);
if (!to_reclaim) {
- space_info->flush = 0;
+ space_info->flush = false;
spin_unlock(&space_info->lock);
return;
}
@@ -1158,7 +1158,7 @@ static void do_async_reclaim_metadata_space(struct btrfs_space_info *space_info)
flush_space(fs_info, space_info, to_reclaim, flush_state, false);
spin_lock(&space_info->lock);
if (list_empty(&space_info->tickets)) {
- space_info->flush = 0;
+ space_info->flush = false;
spin_unlock(&space_info->lock);
return;
}
@@ -1201,7 +1201,7 @@ static void do_async_reclaim_metadata_space(struct btrfs_space_info *space_info)
flush_state = FLUSH_DELAYED_ITEMS_NR;
commit_cycles--;
} else {
- space_info->flush = 0;
+ space_info->flush = false;
}
} else {
flush_state = FLUSH_DELAYED_ITEMS_NR;
@@ -1383,7 +1383,7 @@ static void do_async_reclaim_data_space(struct btrfs_space_info *space_info)
spin_lock(&space_info->lock);
if (list_empty(&space_info->tickets)) {
- space_info->flush = 0;
+ space_info->flush = false;
spin_unlock(&space_info->lock);
return;
}
@@ -1394,7 +1394,7 @@ static void do_async_reclaim_data_space(struct btrfs_space_info *space_info)
flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE, false);
spin_lock(&space_info->lock);
if (list_empty(&space_info->tickets)) {
- space_info->flush = 0;
+ space_info->flush = false;
spin_unlock(&space_info->lock);
return;
}
@@ -1411,7 +1411,7 @@ static void do_async_reclaim_data_space(struct btrfs_space_info *space_info)
data_flush_states[flush_state], false);
spin_lock(&space_info->lock);
if (list_empty(&space_info->tickets)) {
- space_info->flush = 0;
+ space_info->flush = false;
spin_unlock(&space_info->lock);
return;
}
@@ -1428,7 +1428,7 @@ static void do_async_reclaim_data_space(struct btrfs_space_info *space_info)
if (maybe_fail_all_tickets(fs_info, space_info))
flush_state = 0;
else
- space_info->flush = 0;
+ space_info->flush = false;
} else {
flush_state = 0;
}
@@ -1444,7 +1444,7 @@ static void do_async_reclaim_data_space(struct btrfs_space_info *space_info)
aborted_fs:
maybe_fail_all_tickets(fs_info, space_info);
- space_info->flush = 0;
+ space_info->flush = false;
spin_unlock(&space_info->lock);
}
@@ -1825,7 +1825,7 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
*/
maybe_clamp_preempt(fs_info, space_info);
- space_info->flush = 1;
+ space_info->flush = true;
trace_btrfs_trigger_flush(fs_info,
space_info->flags,
orig_bytes, flush,
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 679f22efb407..a846f63585c9 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -142,11 +142,11 @@ struct btrfs_space_info {
flushing. The value is >> clamp, so turns
out to be a 2^clamp divisor. */
- unsigned int full:1; /* indicates that we cannot allocate any more
+ bool full; /* indicates that we cannot allocate any more
chunks for this space */
- unsigned int chunk_alloc:1; /* set if we are allocating a chunk */
+ bool chunk_alloc; /* set if we are allocating a chunk */
- unsigned int flush:1; /* set if we are trying to make space */
+ bool flush; /* set if we are trying to make space */
unsigned int force_alloc; /* set if we need to force a chunk
alloc for this space */
--
2.50.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2] btrfs: fix racy bitfield write in btrfs_clear_space_info_full()
2025-10-02 18:50 [PATCH v2] btrfs: fix racy bitfield write in btrfs_clear_space_info_full() Boris Burkov
@ 2025-10-02 22:16 ` Qu Wenruo
2025-11-17 9:48 ` Sam James
0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2025-10-02 22:16 UTC (permalink / raw)
To: Boris Burkov, linux-btrfs, kernel-team
在 2025/10/3 04:20, Boris Burkov 写道:
> From the memory-barriers.txt document regarding memory barrier ordering
> guarantees:
>
> (*) These guarantees do not apply to bitfields, because compilers often
> generate code to modify these using non-atomic read-modify-write
> sequences. Do not attempt to use bitfields to synchronize parallel
> algorithms.
>
> (*) Even in cases where bitfields are protected by locks, all fields
> in a given bitfield must be protected by one lock. If two fields
> in a given bitfield are protected by different locks, the compiler's
> non-atomic read-modify-write sequences can cause an update to one
> field to corrupt the value of an adjacent field.
>
> btrfs_space_info has a bitfield sharing an underlying word consisting of
> the fields full, chunk_alloc, and flush:
>
> struct btrfs_space_info {
> struct btrfs_fs_info * fs_info; /* 0 8 */
> struct btrfs_space_info * parent; /* 8 8 */
> ...
> int clamp; /* 172 4 */
> unsigned int full:1; /* 176: 0 4 */
> unsigned int chunk_alloc:1; /* 176: 1 4 */
> unsigned int flush:1; /* 176: 2 4 */
> ...
>
> Therefore, to be safe from parallel read-modify-writes losing a write to one of the bitfield members protected by a lock, all writes to all the
> bitfields must use the lock. They almost universally do, except for
> btrfs_clear_space_info_full() which iterates over the space_infos and
> writes out found->full = 0 without a lock.
>
> Imagine that we have one thread completing a transaction in which we
> finished deleting a block_group and are thus calling
> btrfs_clear_space_info_full() while simultaneously the data reclaim
> ticket infrastructure is running do_async_reclaim_data_space():
>
> T1 T2
> btrfs_commit_transaction
> btrfs_clear_space_info_full
> data_sinfo->full = 0
> READ: full:0, chunk_alloc:0, flush:1
> do_async_reclaim_data_space(data_sinfo)
> spin_lock(&space_info->lock);
> if(list_empty(tickets))
> space_info->flush = 0;
> READ: full: 0, chunk_alloc:0, flush:1
> MOD/WRITE: full: 0, chunk_alloc:0, flush:0
> spin_unlock(&space_info->lock);
> return;
> MOD/WRITE: full:0, chunk_alloc:0, flush:1
>
> and now data_sinfo->flush is 1 but the reclaim worker has exited. This
> breaks the invariant that flush is 0 iff there is no work queued or
> running. Once this invariant is violated, future allocations that go
> into __reserve_bytes() will add tickets to space_info->tickets but will
> see space_info->flush is set to 1 and not queue the work. After this,
> they will block forever on the resulting ticket, as it is now impossible
> to kick the worker again.
>
> I also confirmed by looking at the assembly of the affected kernel that
> it is doing RMW operations. For example, to set the flush (3rd) bit to 0,
> the assembly is:
> andb $0xfb,0x60(%rbx)
> and similarly for setting the full (1st) bit to 0:
> andb $0xfe,-0x20(%rax)
>
> So I think this is really a bug on practical systems. I have observed
> a number of systems in this exact state, but am currently unable to
> reproduce it.
>
> Rather than leaving this footgun lying around for the future, take
> advantage of the fact that there is room in the struct anyway, and that
> it is already quite large and simply change the three bitfield members to
> bools. This avoids writes to space_info->full having any effect on
> writes to space_info->flush, regardless of locking.
>
> Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure")
> Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Indeed it's much more straightfoward than asymmetric locking, and still
less changes than a full bitmap implementation.
Thanks,
Qu
> ---
> Changelog:
> v2:
> - migrate the three bitfield members to bools to step around the whole
> atomic RMW issue in the most straightforward way.
>
> ---
> fs/btrfs/block-group.c | 6 +++---
> fs/btrfs/space-info.c | 22 +++++++++++-----------
> fs/btrfs/space-info.h | 6 +++---
> 3 files changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 4330f5ba02dd..cd51f50a7c8b 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -4215,7 +4215,7 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans,
> mutex_unlock(&fs_info->chunk_mutex);
> } else {
> /* Proceed with allocation */
> - space_info->chunk_alloc = 1;
> + space_info->chunk_alloc = true;
> wait_for_alloc = false;
> spin_unlock(&space_info->lock);
> }
> @@ -4264,7 +4264,7 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans,
> spin_lock(&space_info->lock);
> if (ret < 0) {
> if (ret == -ENOSPC)
> - space_info->full = 1;
> + space_info->full = true;
> else
> goto out;
> } else {
> @@ -4274,7 +4274,7 @@ int btrfs_chunk_alloc(struct btrfs_trans_handle *trans,
>
> space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
> out:
> - space_info->chunk_alloc = 0;
> + space_info->chunk_alloc = false;
> spin_unlock(&space_info->lock);
> mutex_unlock(&fs_info->chunk_mutex);
>
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 0e5c0c80e0fe..04a07d6f8537 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -192,7 +192,7 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info *info)
> struct btrfs_space_info *found;
>
> list_for_each_entry(found, head, list)
> - found->full = 0;
> + found->full = false;
> }
>
> /*
> @@ -372,7 +372,7 @@ void btrfs_add_bg_to_space_info(struct btrfs_fs_info *info,
> space_info->bytes_readonly += block_group->bytes_super;
> btrfs_space_info_update_bytes_zone_unusable(space_info, block_group->zone_unusable);
> if (block_group->length > 0)
> - space_info->full = 0;
> + space_info->full = false;
> btrfs_try_granting_tickets(info, space_info);
> spin_unlock(&space_info->lock);
>
> @@ -1146,7 +1146,7 @@ static void do_async_reclaim_metadata_space(struct btrfs_space_info *space_info)
> spin_lock(&space_info->lock);
> to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info);
> if (!to_reclaim) {
> - space_info->flush = 0;
> + space_info->flush = false;
> spin_unlock(&space_info->lock);
> return;
> }
> @@ -1158,7 +1158,7 @@ static void do_async_reclaim_metadata_space(struct btrfs_space_info *space_info)
> flush_space(fs_info, space_info, to_reclaim, flush_state, false);
> spin_lock(&space_info->lock);
> if (list_empty(&space_info->tickets)) {
> - space_info->flush = 0;
> + space_info->flush = false;
> spin_unlock(&space_info->lock);
> return;
> }
> @@ -1201,7 +1201,7 @@ static void do_async_reclaim_metadata_space(struct btrfs_space_info *space_info)
> flush_state = FLUSH_DELAYED_ITEMS_NR;
> commit_cycles--;
> } else {
> - space_info->flush = 0;
> + space_info->flush = false;
> }
> } else {
> flush_state = FLUSH_DELAYED_ITEMS_NR;
> @@ -1383,7 +1383,7 @@ static void do_async_reclaim_data_space(struct btrfs_space_info *space_info)
>
> spin_lock(&space_info->lock);
> if (list_empty(&space_info->tickets)) {
> - space_info->flush = 0;
> + space_info->flush = false;
> spin_unlock(&space_info->lock);
> return;
> }
> @@ -1394,7 +1394,7 @@ static void do_async_reclaim_data_space(struct btrfs_space_info *space_info)
> flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE, false);
> spin_lock(&space_info->lock);
> if (list_empty(&space_info->tickets)) {
> - space_info->flush = 0;
> + space_info->flush = false;
> spin_unlock(&space_info->lock);
> return;
> }
> @@ -1411,7 +1411,7 @@ static void do_async_reclaim_data_space(struct btrfs_space_info *space_info)
> data_flush_states[flush_state], false);
> spin_lock(&space_info->lock);
> if (list_empty(&space_info->tickets)) {
> - space_info->flush = 0;
> + space_info->flush = false;
> spin_unlock(&space_info->lock);
> return;
> }
> @@ -1428,7 +1428,7 @@ static void do_async_reclaim_data_space(struct btrfs_space_info *space_info)
> if (maybe_fail_all_tickets(fs_info, space_info))
> flush_state = 0;
> else
> - space_info->flush = 0;
> + space_info->flush = false;
> } else {
> flush_state = 0;
> }
> @@ -1444,7 +1444,7 @@ static void do_async_reclaim_data_space(struct btrfs_space_info *space_info)
>
> aborted_fs:
> maybe_fail_all_tickets(fs_info, space_info);
> - space_info->flush = 0;
> + space_info->flush = false;
> spin_unlock(&space_info->lock);
> }
>
> @@ -1825,7 +1825,7 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
> */
> maybe_clamp_preempt(fs_info, space_info);
>
> - space_info->flush = 1;
> + space_info->flush = true;
> trace_btrfs_trigger_flush(fs_info,
> space_info->flags,
> orig_bytes, flush,
> diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
> index 679f22efb407..a846f63585c9 100644
> --- a/fs/btrfs/space-info.h
> +++ b/fs/btrfs/space-info.h
> @@ -142,11 +142,11 @@ struct btrfs_space_info {
> flushing. The value is >> clamp, so turns
> out to be a 2^clamp divisor. */
>
> - unsigned int full:1; /* indicates that we cannot allocate any more
> + bool full; /* indicates that we cannot allocate any more
> chunks for this space */
> - unsigned int chunk_alloc:1; /* set if we are allocating a chunk */
> + bool chunk_alloc; /* set if we are allocating a chunk */
>
> - unsigned int flush:1; /* set if we are trying to make space */
> + bool flush; /* set if we are trying to make space */
>
> unsigned int force_alloc; /* set if we need to force a chunk
> alloc for this space */
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] btrfs: fix racy bitfield write in btrfs_clear_space_info_full()
2025-10-02 22:16 ` Qu Wenruo
@ 2025-11-17 9:48 ` Sam James
2025-11-17 9:52 ` Qu Wenruo
0 siblings, 1 reply; 5+ messages in thread
From: Sam James @ 2025-11-17 9:48 UTC (permalink / raw)
To: wqu; +Cc: boris, kernel-team, linux-btrfs
Has this one been taken into a tree yet? We've been "backporting" it for
a while now.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] btrfs: fix racy bitfield write in btrfs_clear_space_info_full()
2025-11-17 9:48 ` Sam James
@ 2025-11-17 9:52 ` Qu Wenruo
2025-11-17 10:07 ` Sam James
0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2025-11-17 9:52 UTC (permalink / raw)
To: Sam James; +Cc: boris, kernel-team, linux-btrfs
在 2025/11/17 20:18, Sam James 写道:
> Has this one been taken into a tree yet? We've been "backporting" it for
> a while now.
It's already in our for-next tree, which will be pushed upstream in
v6.19 release.
Thanks,
Qu
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] btrfs: fix racy bitfield write in btrfs_clear_space_info_full()
2025-11-17 9:52 ` Qu Wenruo
@ 2025-11-17 10:07 ` Sam James
0 siblings, 0 replies; 5+ messages in thread
From: Sam James @ 2025-11-17 10:07 UTC (permalink / raw)
To: Qu Wenruo; +Cc: boris, kernel-team, linux-btrfs
Qu Wenruo <wqu@suse.com> writes:
> 在 2025/11/17 20:18, Sam James 写道:
>> Has this one been taken into a tree yet? We've been "backporting" it for
>> a while now.
>
> It's already in our for-next tree, which will be pushed upstream in
> v6.19 release.
Ah, sorry for missing it. Thanks!
>
> Thanks,
> Qu
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-11-17 10:07 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-02 18:50 [PATCH v2] btrfs: fix racy bitfield write in btrfs_clear_space_info_full() Boris Burkov
2025-10-02 22:16 ` Qu Wenruo
2025-11-17 9:48 ` Sam James
2025-11-17 9:52 ` Qu Wenruo
2025-11-17 10:07 ` Sam James
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox