* [PATCH] memory/ramblock: Fix clear of mru_block on possible race condition
@ 2026-06-01 20:59 Peter Xu
2026-06-02 4:25 ` Akihiko Odaki
0 siblings, 1 reply; 7+ messages in thread
From: Peter Xu @ 2026-06-01 20:59 UTC (permalink / raw)
To: qemu-devel
Cc: peterx, Philippe Mathieu-Daudé, Fabiano Rosas, Paolo Bonzini,
Alex Bennée, Akihiko Odaki
The race condition was not reported in any real bug, but I found it while
reviewing some relevant chnages from Akihiko [1]. It's not clear if
there's any way to reproduce it, so far it's only theoretical. Hence
there's also no Fixes tag attached. There's also no need to copy stable
until we have a solid reproducer.
Currently, mru_block might still points to ramblocks that are removed if
below race condition happens:
Reader A Writer
-------- ------
rcu_read_lock()
walk list, find block X
QLIST_REMOVE_RCU(X)
qatomic_set(&mru_block, NULL)
call_rcu(X, reclaim_ramblock)
qatomic_set(&mru_block, X) <----------- overwrites NULL
rcu_read_unlock()
grace period ends, X freed
Reader C
--------
rcu_read_lock()
qatomic_rcu_read(&mru_block) -> X
Read X's block->offset ... <----------- UAF
To fix it, we can introduce a nested RCU free for reset of the mru_block
field, and only free the ramblock until the 2nd RCU call.
Since QEMU always: (1) dequeue the RCU node before invoking the
function, (2) reset all fields in rcu_head in the call_rcu1() call, nested
RCU will work all fine like what's used in this patch.
[1] https://lore.kernel.org/r/5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp
Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
Based-on: <5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp>
Since there's no reproducer, I didn't verify the problem as-is. What I did
is I ran this through some high concurrency "make check -jN" (8 processes
with 4 cores pinned, hence 8*N make-check threads running concurrently on 4
physical cores), I didn't see any regression.
---
system/physmem.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/system/physmem.c b/system/physmem.c
index 9e1ac13e82..1697568fa7 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2590,6 +2590,21 @@ static void reclaim_ramblock(RAMBlock *block)
g_free(block);
}
+static void reclaim_ramblock_prepare(RAMBlock *block)
+{
+ /*
+ * 1st round rcu free guarantees the last reader that can access this
+ * ramblock is gone. Reset this field making sure it will never point
+ * to the ramblock being freed.
+ */
+ qatomic_set(&ram_list.mru_block, NULL);
+ /*
+ * Note: this is an intended nested use of rcu_head. If needed, we can
+ * provide two rcu_heads for ramblock.
+ */
+ call_rcu(block, reclaim_ramblock, rcu);
+}
+
void qemu_ram_free(RAMBlock *block)
{
g_autofree char *name = NULL;
@@ -2607,10 +2622,11 @@ void qemu_ram_free(RAMBlock *block)
name = cpr_name(block->mr);
cpr_delete_fd(name, 0);
QLIST_REMOVE_RCU(block, next);
+ /* First attempt of reset, see reclaim_ramblock_prepare() for the 2nd */
qatomic_set(&ram_list.mru_block, NULL);
/* Write list before version */
qatomic_store_release(&ram_list.version, ram_list.version + 1);
- call_rcu(block, reclaim_ramblock, rcu);
+ call_rcu(block, reclaim_ramblock_prepare, rcu);
qemu_mutex_unlock_ramlist();
}
--
2.53.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH] memory/ramblock: Fix clear of mru_block on possible race condition 2026-06-01 20:59 [PATCH] memory/ramblock: Fix clear of mru_block on possible race condition Peter Xu @ 2026-06-02 4:25 ` Akihiko Odaki 2026-06-02 19:51 ` Peter Xu 0 siblings, 1 reply; 7+ messages in thread From: Akihiko Odaki @ 2026-06-02 4:25 UTC (permalink / raw) To: Peter Xu, qemu-devel Cc: Philippe Mathieu-Daudé, Fabiano Rosas, Paolo Bonzini, Alex Bennée On 2026/06/02 5:59, Peter Xu wrote: > The race condition was not reported in any real bug, but I found it while > reviewing some relevant chnages from Akihiko [1]. It's not clear if > there's any way to reproduce it, so far it's only theoretical. Hence > there's also no Fixes tag attached. There's also no need to copy stable > until we have a solid reproducer. > > Currently, mru_block might still points to ramblocks that are removed if > below race condition happens: > > Reader A Writer > -------- ------ > rcu_read_lock() > walk list, find block X > QLIST_REMOVE_RCU(X) > qatomic_set(&mru_block, NULL) > call_rcu(X, reclaim_ramblock) > qatomic_set(&mru_block, X) <----------- overwrites NULL > rcu_read_unlock() > grace period ends, X freed > > Reader C > -------- > rcu_read_lock() > qatomic_rcu_read(&mru_block) -> X > Read X's block->offset ... <----------- UAF > > To fix it, we can introduce a nested RCU free for reset of the mru_block > field, and only free the ramblock until the 2nd RCU call. > > Since QEMU always: (1) dequeue the RCU node before invoking the > function, (2) reset all fields in rcu_head in the call_rcu1() call, nested > RCU will work all fine like what's used in this patch. > > [1] https://lore.kernel.org/r/5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp > > Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > > Based-on: <5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp> > > Since there's no reproducer, I didn't verify the problem as-is. What I did > is I ran this through some high concurrency "make check -jN" (8 processes > with 4 cores pinned, hence 8*N make-check threads running concurrently on 4 > physical cores), I didn't see any regression. > --- > system/physmem.c | 18 +++++++++++++++++- > 1 file changed, 17 insertions(+), 1 deletion(-) > > diff --git a/system/physmem.c b/system/physmem.c > index 9e1ac13e82..1697568fa7 100644 > --- a/system/physmem.c > +++ b/system/physmem.c > @@ -2590,6 +2590,21 @@ static void reclaim_ramblock(RAMBlock *block) > g_free(block); > } > > +static void reclaim_ramblock_prepare(RAMBlock *block) > +{ > + /* > + * 1st round rcu free guarantees the last reader that can access this > + * ramblock is gone. Reset this field making sure it will never point > + * to the ramblock being freed. Somewhat nitpicking, but I would call this the first grace period rather than “rcu free”. call_rcu() itself does not free anything; it schedules a callback to run after a grace period, which is what matters here. > + */ > + qatomic_set(&ram_list.mru_block, NULL); > + /* > + * Note: this is an intended nested use of rcu_head. If needed, we can > + * provide two rcu_heads for ramblock. > + */ > + call_rcu(block, reclaim_ramblock, rcu); > +} > + > void qemu_ram_free(RAMBlock *block) > { > g_autofree char *name = NULL; > @@ -2607,10 +2622,11 @@ void qemu_ram_free(RAMBlock *block) > name = cpr_name(block->mr); > cpr_delete_fd(name, 0); > QLIST_REMOVE_RCU(block, next); > + /* First attempt of reset, see reclaim_ramblock_prepare() for the 2nd */ > qatomic_set(&ram_list.mru_block, NULL); I think it would be better to remove this for simplicity. mru_block is cleared later anyway, so this does not affect correctness. The only practical effect is that qemu_get_ram_block() may be slower during the first grace period. Conceptually, it is simpler to describe the two grace periods as serving two distinct purposes, rather than as two attempts to clear mru_block. With the early clear removed, readers do not first need to understand that the first grace period leaves the first clear ineffective; they can instead focus on how it makes the later clear effective. Regars, Akihiko Odaki > /* Write list before version */ > qatomic_store_release(&ram_list.version, ram_list.version + 1); > - call_rcu(block, reclaim_ramblock, rcu); > + call_rcu(block, reclaim_ramblock_prepare, rcu); > qemu_mutex_unlock_ramlist(); > } > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] memory/ramblock: Fix clear of mru_block on possible race condition 2026-06-02 4:25 ` Akihiko Odaki @ 2026-06-02 19:51 ` Peter Xu 2026-06-03 5:11 ` Akihiko Odaki 0 siblings, 1 reply; 7+ messages in thread From: Peter Xu @ 2026-06-02 19:51 UTC (permalink / raw) To: Akihiko Odaki Cc: qemu-devel, Philippe Mathieu-Daudé, Fabiano Rosas, Paolo Bonzini, Alex Bennée On Tue, Jun 02, 2026 at 01:25:02PM +0900, Akihiko Odaki wrote: > On 2026/06/02 5:59, Peter Xu wrote: > > The race condition was not reported in any real bug, but I found it while > > reviewing some relevant chnages from Akihiko [1]. It's not clear if > > there's any way to reproduce it, so far it's only theoretical. Hence > > there's also no Fixes tag attached. There's also no need to copy stable > > until we have a solid reproducer. > > > > Currently, mru_block might still points to ramblocks that are removed if > > below race condition happens: > > > > Reader A Writer > > -------- ------ > > rcu_read_lock() > > walk list, find block X > > QLIST_REMOVE_RCU(X) > > qatomic_set(&mru_block, NULL) > > call_rcu(X, reclaim_ramblock) > > qatomic_set(&mru_block, X) <----------- overwrites NULL > > rcu_read_unlock() > > grace period ends, X freed > > > > Reader C > > -------- > > rcu_read_lock() > > qatomic_rcu_read(&mru_block) -> X > > Read X's block->offset ... <----------- UAF > > > > To fix it, we can introduce a nested RCU free for reset of the mru_block > > field, and only free the ramblock until the 2nd RCU call. > > > > Since QEMU always: (1) dequeue the RCU node before invoking the > > function, (2) reset all fields in rcu_head in the call_rcu1() call, nested > > RCU will work all fine like what's used in this patch. > > > > [1] https://lore.kernel.org/r/5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp > > > > Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > > > Based-on: <5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp> > > > > Since there's no reproducer, I didn't verify the problem as-is. What I did > > is I ran this through some high concurrency "make check -jN" (8 processes > > with 4 cores pinned, hence 8*N make-check threads running concurrently on 4 > > physical cores), I didn't see any regression. > > --- > > system/physmem.c | 18 +++++++++++++++++- > > 1 file changed, 17 insertions(+), 1 deletion(-) > > > > diff --git a/system/physmem.c b/system/physmem.c > > index 9e1ac13e82..1697568fa7 100644 > > --- a/system/physmem.c > > +++ b/system/physmem.c > > @@ -2590,6 +2590,21 @@ static void reclaim_ramblock(RAMBlock *block) > > g_free(block); > > } > > +static void reclaim_ramblock_prepare(RAMBlock *block) > > +{ > > + /* > > + * 1st round rcu free guarantees the last reader that can access this > > + * ramblock is gone. Reset this field making sure it will never point [1] > > + * to the ramblock being freed. > > Somewhat nitpicking, but I would call this the first grace period rather > than “rcu free”. call_rcu() itself does not free anything; it schedules a > callback to run after a grace period, which is what matters here. Sure, I'll fix the wording. > > > + */ > > + qatomic_set(&ram_list.mru_block, NULL); > > + /* > > + * Note: this is an intended nested use of rcu_head. If needed, we can > > + * provide two rcu_heads for ramblock. > > + */ > > + call_rcu(block, reclaim_ramblock, rcu); > > +} > > + > > void qemu_ram_free(RAMBlock *block) > > { > > g_autofree char *name = NULL; > > @@ -2607,10 +2622,11 @@ void qemu_ram_free(RAMBlock *block) > > name = cpr_name(block->mr); > > cpr_delete_fd(name, 0); > > QLIST_REMOVE_RCU(block, next); > > + /* First attempt of reset, see reclaim_ramblock_prepare() for the 2nd */ > > qatomic_set(&ram_list.mru_block, NULL); > > I think it would be better to remove this for simplicity. > > mru_block is cleared later anyway, so this does not affect correctness. The > only practical effect is that qemu_get_ram_block() may be slower during the > first grace period. > > Conceptually, it is simpler to describe the two grace periods as serving two > distinct purposes, rather than as two attempts to clear mru_block. With the > early clear removed, readers do not first need to understand that the first > grace period leaves the first clear ineffective; they can instead focus on > how it makes the later clear effective. It was my intention to keep this, because then it's easier to justify "nobody has access to the ramblock being removed". If with this line removed here, it means after the first grace period (say, reaching reclaim_ramblock_prepare()), RCU readers can still access the ramblock, essentially because before reclaim_ramblock_prepare() clearing mru_block even if the block is not visible on ram_list it's still visible when a cache hit happens. But I agree removal of this line shouldn't involve correctness issues, because qemu_get_ram_block() when having a mru_block cache hit, will always return the block directly: [2] block = qatomic_rcu_read(&ram_list.mru_block); if (block && addr - block->offset < block->max_length) { return block; } And it will never reach the "found:" label later to update mru_block again. But that's obscure. So it's a niche use of it, but keeping this line to reset makes this justification slightly easier (corresponds to the comment at [1] above). I can also remove this line, but then I'll need to enrich comment explaining case [2] above. We'll also need to be careful on not changing that behavior in qemu_get_ram_block(). Thanks, > > Regars, > Akihiko Odaki > > > /* Write list before version */ > > qatomic_store_release(&ram_list.version, ram_list.version + 1); > > - call_rcu(block, reclaim_ramblock, rcu); > > + call_rcu(block, reclaim_ramblock_prepare, rcu); > > qemu_mutex_unlock_ramlist(); > > } > -- Peter Xu ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] memory/ramblock: Fix clear of mru_block on possible race condition 2026-06-02 19:51 ` Peter Xu @ 2026-06-03 5:11 ` Akihiko Odaki 2026-06-03 17:35 ` Peter Xu 0 siblings, 1 reply; 7+ messages in thread From: Akihiko Odaki @ 2026-06-03 5:11 UTC (permalink / raw) To: Peter Xu Cc: qemu-devel, Philippe Mathieu-Daudé, Fabiano Rosas, Paolo Bonzini, Alex Bennée On 2026/06/03 4:51, Peter Xu wrote: > On Tue, Jun 02, 2026 at 01:25:02PM +0900, Akihiko Odaki wrote: >> On 2026/06/02 5:59, Peter Xu wrote: >>> The race condition was not reported in any real bug, but I found it while >>> reviewing some relevant chnages from Akihiko [1]. It's not clear if >>> there's any way to reproduce it, so far it's only theoretical. Hence >>> there's also no Fixes tag attached. There's also no need to copy stable >>> until we have a solid reproducer. >>> >>> Currently, mru_block might still points to ramblocks that are removed if >>> below race condition happens: >>> >>> Reader A Writer >>> -------- ------ >>> rcu_read_lock() >>> walk list, find block X >>> QLIST_REMOVE_RCU(X) >>> qatomic_set(&mru_block, NULL) >>> call_rcu(X, reclaim_ramblock) >>> qatomic_set(&mru_block, X) <----------- overwrites NULL >>> rcu_read_unlock() >>> grace period ends, X freed >>> >>> Reader C >>> -------- >>> rcu_read_lock() >>> qatomic_rcu_read(&mru_block) -> X >>> Read X's block->offset ... <----------- UAF >>> >>> To fix it, we can introduce a nested RCU free for reset of the mru_block >>> field, and only free the ramblock until the 2nd RCU call. >>> >>> Since QEMU always: (1) dequeue the RCU node before invoking the >>> function, (2) reset all fields in rcu_head in the call_rcu1() call, nested >>> RCU will work all fine like what's used in this patch. >>> >>> [1] https://lore.kernel.org/r/5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp >>> >>> Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> >>> Signed-off-by: Peter Xu <peterx@redhat.com> >>> --- >>> >>> Based-on: <5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp> >>> >>> Since there's no reproducer, I didn't verify the problem as-is. What I did >>> is I ran this through some high concurrency "make check -jN" (8 processes >>> with 4 cores pinned, hence 8*N make-check threads running concurrently on 4 >>> physical cores), I didn't see any regression. >>> --- >>> system/physmem.c | 18 +++++++++++++++++- >>> 1 file changed, 17 insertions(+), 1 deletion(-) >>> >>> diff --git a/system/physmem.c b/system/physmem.c >>> index 9e1ac13e82..1697568fa7 100644 >>> --- a/system/physmem.c >>> +++ b/system/physmem.c >>> @@ -2590,6 +2590,21 @@ static void reclaim_ramblock(RAMBlock *block) >>> g_free(block); >>> } >>> +static void reclaim_ramblock_prepare(RAMBlock *block) >>> +{ >>> + /* >>> + * 1st round rcu free guarantees the last reader that can access this >>> + * ramblock is gone. Reset this field making sure it will never point > > [1] > >>> + * to the ramblock being freed. >> >> Somewhat nitpicking, but I would call this the first grace period rather >> than “rcu free”. call_rcu() itself does not free anything; it schedules a >> callback to run after a grace period, which is what matters here. > > Sure, I'll fix the wording. > >> >>> + */ >>> + qatomic_set(&ram_list.mru_block, NULL); >>> + /* >>> + * Note: this is an intended nested use of rcu_head. If needed, we can >>> + * provide two rcu_heads for ramblock. >>> + */ >>> + call_rcu(block, reclaim_ramblock, rcu); >>> +} >>> + >>> void qemu_ram_free(RAMBlock *block) >>> { >>> g_autofree char *name = NULL; >>> @@ -2607,10 +2622,11 @@ void qemu_ram_free(RAMBlock *block) >>> name = cpr_name(block->mr); >>> cpr_delete_fd(name, 0); >>> QLIST_REMOVE_RCU(block, next); >>> + /* First attempt of reset, see reclaim_ramblock_prepare() for the 2nd */ >>> qatomic_set(&ram_list.mru_block, NULL); >> >> I think it would be better to remove this for simplicity. >> >> mru_block is cleared later anyway, so this does not affect correctness. The >> only practical effect is that qemu_get_ram_block() may be slower during the >> first grace period. >> >> Conceptually, it is simpler to describe the two grace periods as serving two >> distinct purposes, rather than as two attempts to clear mru_block. With the >> early clear removed, readers do not first need to understand that the first >> grace period leaves the first clear ineffective; they can instead focus on >> how it makes the later clear effective. > > It was my intention to keep this, because then it's easier to justify > "nobody has access to the ramblock being removed". > > If with this line removed here, it means after the first grace period (say, > reaching reclaim_ramblock_prepare()), RCU readers can still access the > ramblock, essentially because before reclaim_ramblock_prepare() clearing > mru_block even if the block is not visible on ram_list it's still visible > when a cache hit happens. qemu_get_ram_block() may set mru_block during the grace period. Once mru_block is set, RCU readers can still access the ramblock, even if the block is no longer visible in ram_list. When enriching the comment in reclaim_ramblock_prepare(), I think the phrase “the last reader that can access this ramblock is gone” could be improved. What matters here is that the last reader that could find the block in ram_list is gone. The block is no longer visible in ram_list to any later reader, and clearing the cache at that point ensures that it is no longer visible through an mru_block cache hit either. Regards, Akihiko Odaki > > But I agree removal of this line shouldn't involve correctness issues, > because qemu_get_ram_block() when having a mru_block cache hit, will always > return the block directly: > > [2] > > block = qatomic_rcu_read(&ram_list.mru_block); > if (block && addr - block->offset < block->max_length) { > return block; > } > > And it will never reach the "found:" label later to update mru_block again. > But that's obscure. > > So it's a niche use of it, but keeping this line to reset makes this > justification slightly easier (corresponds to the comment at [1] above). > > I can also remove this line, but then I'll need to enrich comment > explaining case [2] above. We'll also need to be careful on not changing > that behavior in qemu_get_ram_block(). > > Thanks, > >> >> Regars, >> Akihiko Odaki >> >>> /* Write list before version */ >>> qatomic_store_release(&ram_list.version, ram_list.version + 1); >>> - call_rcu(block, reclaim_ramblock, rcu); >>> + call_rcu(block, reclaim_ramblock_prepare, rcu); >>> qemu_mutex_unlock_ramlist(); >>> } >> > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] memory/ramblock: Fix clear of mru_block on possible race condition 2026-06-03 5:11 ` Akihiko Odaki @ 2026-06-03 17:35 ` Peter Xu 2026-06-04 5:24 ` Akihiko Odaki 0 siblings, 1 reply; 7+ messages in thread From: Peter Xu @ 2026-06-03 17:35 UTC (permalink / raw) To: Akihiko Odaki Cc: qemu-devel, Philippe Mathieu-Daudé, Fabiano Rosas, Paolo Bonzini, Alex Bennée On Wed, Jun 03, 2026 at 02:11:44PM +0900, Akihiko Odaki wrote: > qemu_get_ram_block() may set mru_block during the grace period. Once > mru_block is set, RCU readers can still access the ramblock, even if the > block is no longer visible in ram_list. True.. > > When enriching the comment in reclaim_ramblock_prepare(), I think the phrase > “the last reader that can access this ramblock is gone” could be improved. > What matters here is that the last reader that could find the block in > ram_list is gone. The block is no longer visible in ram_list to any later > reader, and clearing the cache at that point ensures that it is no longer > visible through an mru_block cache hit either. Yes that's ambiguous. I updated comments in this patch, removed the 1st reset of mru_block, and when at it also touched up the comment in qemu_get_ram_block(), let me know if there's further comments before repost. Thanks, ===8<=== From 18789a1136c753a6cdce1ea698cf4c1871121cc9 Mon Sep 17 00:00:00 2001 From: Peter Xu <peterx@redhat.com> Date: Fri, 29 May 2026 12:05:15 -0400 Subject: [PATCH] memory/ramblock: Fix clear of mru_block on possible race condition The race condition was not reported in any real bug, but I found it while reviewing some relevant chnages from Akihiko [1]. It's not clear if there's any way to reproduce it, so far it's only theoretical. Hence there's also no Fixes tag attached. There's also no need to copy stable until we have a solid reproducer. Currently, mru_block might still points to ramblocks that are removed if below race condition happens: Reader A Writer -------- ------ rcu_read_lock() walk list, find block X QLIST_REMOVE_RCU(X) qatomic_set(&mru_block, NULL) call_rcu(X, reclaim_ramblock) qatomic_set(&mru_block, X) <----------- overwrites NULL rcu_read_unlock() grace period ends, X freed Reader C -------- rcu_read_lock() qatomic_rcu_read(&mru_block) -> X Read X's block->offset ... <----------- UAF To fix it, we can introduce a nested RCU free for reset of the mru_block field, and only free the ramblock until the 2nd RCU call. Since QEMU always: (1) dequeue the RCU node before invoking the function, (2) reset all fields in rcu_head in the call_rcu1() call, nested RCU will work all fine like what's used in this patch. [1] https://lore.kernel.org/r/5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Signed-off-by: Peter Xu <peterx@redhat.com> --- Based-on: <5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp> --- system/physmem.c | 46 ++++++++++++++++++++++++++++++---------------- 1 file changed, 30 insertions(+), 16 deletions(-) diff --git a/system/physmem.c b/system/physmem.c index 9e1ac13e82..c9e5696d6a 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -836,21 +836,14 @@ static RAMBlock *qemu_get_ram_block(ram_addr_t addr) abort(); found: - /* It is safe to write mru_block outside the BQL. This - * is what happens: - * - * qatomic_set(&mru_block, xxx) - * rcu_read_unlock() - * xxx removed from list - * rcu_read_lock() - * read mru_block - * qatomic_set(&mru_block, NULL); - * call_rcu(reclaim_ramblock, xxx); - * rcu_read_unlock() + /* + * It is safe to write mru_block outside the BQL, the writer (e.g. when + * QEMU frees a ramblock) is designed to be thread-safe with readers + * updating this field concurrently. See reclaim_ramblock_prepare(). * - * qatomic_rcu_set is not needed here. The block was already published - * when it was placed into the list. Here we're just making an extra - * copy of the pointer. + * qatomic_rcu_set() is not needed here, because the block was already + * published when it was placed into the list. Here we're just making + * an extra copy of the pointer. */ qatomic_set(&ram_list.mru_block, block); return block; @@ -2590,6 +2583,23 @@ static void reclaim_ramblock(RAMBlock *block) g_free(block); } +static void reclaim_ramblock_prepare(RAMBlock *block) +{ + /* + * After one round of grace period, no more reader can see this + * ramblock via ram_list. Reset this field making sure it will never + * point to the ramblock being freed. + */ + qatomic_set(&ram_list.mru_block, NULL); + /* + * Wait for a second round of grace period to make sure whoever + * accessed the ramblock previously via mru_block has finished using + * it. Note: this is an intended nested use of rcu_head. If needed, + * we can provide two rcu_heads for ramblock. + */ + call_rcu(block, reclaim_ramblock, rcu); +} + void qemu_ram_free(RAMBlock *block) { g_autofree char *name = NULL; @@ -2607,10 +2617,14 @@ void qemu_ram_free(RAMBlock *block) name = cpr_name(block->mr); cpr_delete_fd(name, 0); QLIST_REMOVE_RCU(block, next); - qatomic_set(&ram_list.mru_block, NULL); /* Write list before version */ qatomic_store_release(&ram_list.version, ram_list.version + 1); - call_rcu(block, reclaim_ramblock, rcu); + /* + * Wait for a grace period to make sure no reader can see this ramblock + * via ram_list anymore. Note that readers can still see and access + * the ramblock via mru_block, so we can't free it yet. + */ + call_rcu(block, reclaim_ramblock_prepare, rcu); qemu_mutex_unlock_ramlist(); } -- 2.53.0 -- Peter Xu ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] memory/ramblock: Fix clear of mru_block on possible race condition 2026-06-03 17:35 ` Peter Xu @ 2026-06-04 5:24 ` Akihiko Odaki 2026-06-04 13:33 ` Peter Xu 0 siblings, 1 reply; 7+ messages in thread From: Akihiko Odaki @ 2026-06-04 5:24 UTC (permalink / raw) To: Peter Xu Cc: qemu-devel, Philippe Mathieu-Daudé, Fabiano Rosas, Paolo Bonzini, Alex Bennée On 2026/06/04 2:35, Peter Xu wrote: > On Wed, Jun 03, 2026 at 02:11:44PM +0900, Akihiko Odaki wrote: >> qemu_get_ram_block() may set mru_block during the grace period. Once >> mru_block is set, RCU readers can still access the ramblock, even if the >> block is no longer visible in ram_list. > > True.. > >> >> When enriching the comment in reclaim_ramblock_prepare(), I think the phrase >> “the last reader that can access this ramblock is gone” could be improved. >> What matters here is that the last reader that could find the block in >> ram_list is gone. The block is no longer visible in ram_list to any later >> reader, and clearing the cache at that point ensures that it is no longer >> visible through an mru_block cache hit either. > > Yes that's ambiguous. > > I updated comments in this patch, removed the 1st reset of mru_block, and > when at it also touched up the comment in qemu_get_ram_block(), let me know > if there's further comments before repost. It looks good. You may repost it with my Reviewed-by added: Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Regards, Akihiko Odaki > > Thanks, > > ===8<=== > From 18789a1136c753a6cdce1ea698cf4c1871121cc9 Mon Sep 17 00:00:00 2001 > From: Peter Xu <peterx@redhat.com> > Date: Fri, 29 May 2026 12:05:15 -0400 > Subject: [PATCH] memory/ramblock: Fix clear of mru_block on possible race > condition > > The race condition was not reported in any real bug, but I found it while > reviewing some relevant chnages from Akihiko [1]. It's not clear if > there's any way to reproduce it, so far it's only theoretical. Hence > there's also no Fixes tag attached. There's also no need to copy stable > until we have a solid reproducer. > > Currently, mru_block might still points to ramblocks that are removed if > below race condition happens: > > Reader A Writer > -------- ------ > rcu_read_lock() > walk list, find block X > QLIST_REMOVE_RCU(X) > qatomic_set(&mru_block, NULL) > call_rcu(X, reclaim_ramblock) > qatomic_set(&mru_block, X) <----------- overwrites NULL > rcu_read_unlock() > grace period ends, X freed > > Reader C > -------- > rcu_read_lock() > qatomic_rcu_read(&mru_block) -> X > Read X's block->offset ... <----------- UAF > > To fix it, we can introduce a nested RCU free for reset of the mru_block > field, and only free the ramblock until the 2nd RCU call. > > Since QEMU always: (1) dequeue the RCU node before invoking the > function, (2) reset all fields in rcu_head in the call_rcu1() call, nested > RCU will work all fine like what's used in this patch. > > [1] https://lore.kernel.org/r/5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp > > Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > Based-on: <5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp> > --- > system/physmem.c | 46 ++++++++++++++++++++++++++++++---------------- > 1 file changed, 30 insertions(+), 16 deletions(-) > > diff --git a/system/physmem.c b/system/physmem.c > index 9e1ac13e82..c9e5696d6a 100644 > --- a/system/physmem.c > +++ b/system/physmem.c > @@ -836,21 +836,14 @@ static RAMBlock *qemu_get_ram_block(ram_addr_t addr) > abort(); > > found: > - /* It is safe to write mru_block outside the BQL. This > - * is what happens: > - * > - * qatomic_set(&mru_block, xxx) > - * rcu_read_unlock() > - * xxx removed from list > - * rcu_read_lock() > - * read mru_block > - * qatomic_set(&mru_block, NULL); > - * call_rcu(reclaim_ramblock, xxx); > - * rcu_read_unlock() > + /* > + * It is safe to write mru_block outside the BQL, the writer (e.g. when > + * QEMU frees a ramblock) is designed to be thread-safe with readers > + * updating this field concurrently. See reclaim_ramblock_prepare(). > * > - * qatomic_rcu_set is not needed here. The block was already published > - * when it was placed into the list. Here we're just making an extra > - * copy of the pointer. > + * qatomic_rcu_set() is not needed here, because the block was already > + * published when it was placed into the list. Here we're just making > + * an extra copy of the pointer. > */ > qatomic_set(&ram_list.mru_block, block); > return block; > @@ -2590,6 +2583,23 @@ static void reclaim_ramblock(RAMBlock *block) > g_free(block); > } > > +static void reclaim_ramblock_prepare(RAMBlock *block) > +{ > + /* > + * After one round of grace period, no more reader can see this > + * ramblock via ram_list. Reset this field making sure it will never > + * point to the ramblock being freed. > + */ > + qatomic_set(&ram_list.mru_block, NULL); > + /* > + * Wait for a second round of grace period to make sure whoever > + * accessed the ramblock previously via mru_block has finished using > + * it. Note: this is an intended nested use of rcu_head. If needed, > + * we can provide two rcu_heads for ramblock. > + */ > + call_rcu(block, reclaim_ramblock, rcu); > +} > + > void qemu_ram_free(RAMBlock *block) > { > g_autofree char *name = NULL; > @@ -2607,10 +2617,14 @@ void qemu_ram_free(RAMBlock *block) > name = cpr_name(block->mr); > cpr_delete_fd(name, 0); > QLIST_REMOVE_RCU(block, next); > - qatomic_set(&ram_list.mru_block, NULL); > /* Write list before version */ > qatomic_store_release(&ram_list.version, ram_list.version + 1); > - call_rcu(block, reclaim_ramblock, rcu); > + /* > + * Wait for a grace period to make sure no reader can see this ramblock > + * via ram_list anymore. Note that readers can still see and access > + * the ramblock via mru_block, so we can't free it yet. > + */ > + call_rcu(block, reclaim_ramblock_prepare, rcu); > qemu_mutex_unlock_ramlist(); > } > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] memory/ramblock: Fix clear of mru_block on possible race condition 2026-06-04 5:24 ` Akihiko Odaki @ 2026-06-04 13:33 ` Peter Xu 0 siblings, 0 replies; 7+ messages in thread From: Peter Xu @ 2026-06-04 13:33 UTC (permalink / raw) To: Akihiko Odaki Cc: qemu-devel, Philippe Mathieu-Daudé, Fabiano Rosas, Paolo Bonzini, Alex Bennée On Thu, Jun 04, 2026 at 02:24:07PM +0900, Akihiko Odaki wrote: > On 2026/06/04 2:35, Peter Xu wrote: > > On Wed, Jun 03, 2026 at 02:11:44PM +0900, Akihiko Odaki wrote: > > > qemu_get_ram_block() may set mru_block during the grace period. Once > > > mru_block is set, RCU readers can still access the ramblock, even if the > > > block is no longer visible in ram_list. > > > > True.. > > > > > > > > When enriching the comment in reclaim_ramblock_prepare(), I think the phrase > > > “the last reader that can access this ramblock is gone” could be improved. > > > What matters here is that the last reader that could find the block in > > > ram_list is gone. The block is no longer visible in ram_list to any later > > > reader, and clearing the cache at that point ensures that it is no longer > > > visible through an mru_block cache hit either. > > > > Yes that's ambiguous. > > > > I updated comments in this patch, removed the 1st reset of mru_block, and > > when at it also touched up the comment in qemu_get_ram_block(), let me know > > if there's further comments before repost. > > It looks good. You may repost it with my Reviewed-by added: > > Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Thanks. -- Peter Xu ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-06-04 13:34 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-01 20:59 [PATCH] memory/ramblock: Fix clear of mru_block on possible race condition Peter Xu 2026-06-02 4:25 ` Akihiko Odaki 2026-06-02 19:51 ` Peter Xu 2026-06-03 5:11 ` Akihiko Odaki 2026-06-03 17:35 ` Peter Xu 2026-06-04 5:24 ` Akihiko Odaki 2026-06-04 13:33 ` Peter Xu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.