From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists1p.gnu.org (lists1p.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1E0CDCD6E57 for ; Tue, 2 Jun 2026 19:52:36 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists1p.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1wUV9W-0003rS-5t; Tue, 02 Jun 2026 15:51:54 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists1p.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wUV9U-0003qu-Km for qemu-devel@nongnu.org; Tue, 02 Jun 2026 15:51:52 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wUV9S-0002r1-JF for qemu-devel@nongnu.org; Tue, 02 Jun 2026 15:51:52 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780429909; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xMfVwINepOmViYIvNgMpykGSmbUAr2HE8FtTMWi6j94=; b=K0TcPphMgWabeaV6ib0ziLn2w8+ItTffLB9wTyPc+RcIGFjmWDaMF+tj5jV5OhhfuF5qH6 ce4wGn0QY6WgakM3uH+7QNKGN1v9bKcAx/eHJwd6ezzfOgPRWI+IfMP9mxRlGZNjA/uR9T Vyb9EmzOnwE+15BAtvBrhIIHN/1+7QQ= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-361-fhtXyfyyNOGDuB6ovoXFAA-1; Tue, 02 Jun 2026 15:51:47 -0400 X-MC-Unique: fhtXyfyyNOGDuB6ovoXFAA-1 X-Mimecast-MFC-AGG-ID: fhtXyfyyNOGDuB6ovoXFAA_1780429907 Received: by mail-qk1-f200.google.com with SMTP id af79cd13be357-9157929bf8fso385606785a.0 for ; Tue, 02 Jun 2026 12:51:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1780429907; x=1781034707; darn=nongnu.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=xMfVwINepOmViYIvNgMpykGSmbUAr2HE8FtTMWi6j94=; b=o4Kgmf103g3djlzQNtC5EdGw2L1MakXr0xnEXUoUS3CQpMj9wMsSgBkXTZOTkCy102 Aq3YrsVVGgjbQofNnj6ISS1HvKJMOwEwTukMl9QQCzzeUn43j1n5sphi2pEYPrk7hznj nWnWySJybdB3cVWkPr8aFVzLoJTmY5KVMLhly1nw5hRBcHTftCWUgpzB2v/AYHEo9DHw iyohNMDusHnqXk+7Q/sH8xh2BkEOvifmdM1T3Iqf3pbvD65+FnM7tEsUBeDdUSgMTV3u HkYPepxZycAMQlYQzONVpvdy45xbruWaXtihS/Ks9Tpal8J1JLAHrYCDs+RwR2akhlpy PRNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780429907; x=1781034707; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=xMfVwINepOmViYIvNgMpykGSmbUAr2HE8FtTMWi6j94=; b=AEM61ZwPSKLXpbjCY8xPoLX6diq+rFgQye8EhOjN0oYR0xiGLOW6a6nJ9W0k6dYXGi dvT44SCxIXF6sjXzIRwodMPPgCKiwsEvc5P21AVDVLX78+5fn0dSxhmjP0NvenH36w14 VQoGYa+k5A6ty9NgG2TCXI+DI5YlB/CHh3PCXH7h2C61hRvwd3kiWUq6x3Sv9gVmFTKE S0ZDHLbTUb7K4P1qwlMiN4qDoy/kUE5rwlPtaUaH1+D+4FjLpCG1LP6bN/ULDt0mKP/b 0QTijBhrfgiuoXJTWw+8mnUfGZm8ygWgEe+fur5q7GHcJqxzkM8QohqSYsoCAsI99/C5 hhiw== X-Gm-Message-State: AOJu0YwMsnTJDWCQXI0d/5rT9hIA5nLsfQL4vM1CvG1AJwk8N0L+E9LX 47DHCWOY/5rjyxPfswYzkrTLrt6oGl1jVu1sLvrIvZVivNnPJ8NbRBAkoWK/oximw3YKtIF3d6E f2+xBi9fEezLfQ5jTUGirrL/8TZAJFqqln9YmePboOzC0Lj2crXEWitpc X-Gm-Gg: Acq92OFPB4XW6PlL3lTuYEITJ4l9odSG5+IYO58DvTRViJOlFdiq8Db/L8/rVFhmoJf IZQ6X1VQmU1dBU+mpGMi6I7tKD9OMxqBJK7QTCwCSr8EO8mwK8I2BsB0eWn+Az4EsbMAH8EAuWg 18FXkbGDJQI0ndtGDxlPgG+V6A/n/JHH5qu9DZNjsgWaqFjIdvmdcbEAMgG7x2Is+Cq5aam3hDi 3Ka+NNW6R6M03HKmq9Xs1BhJYsLEAuuUsmJJXcMG9fu96CvO1Iz9hvdPh4cDg81jAg8FBwskdm+ yaLLJ3xlnRL9CmOWpFKNWhBGIpTAI+zu8N5kKPjWMdIiVei5nBXzOPsySMYtxGxGL+aFlZ50Our gf3Fa3yvyEmoQwWciEpt9AdO1gQ== X-Received: by 2002:a05:620a:4482:b0:909:c82b:9756 with SMTP id af79cd13be357-9158a659573mr92381685a.5.1780429906833; Tue, 02 Jun 2026 12:51:46 -0700 (PDT) X-Received: by 2002:a05:620a:4482:b0:909:c82b:9756 with SMTP id af79cd13be357-9158a659573mr92374185a.5.1780429905934; Tue, 02 Jun 2026 12:51:45 -0700 (PDT) Received: from x1.local ([142.189.10.167]) by smtp.gmail.com with ESMTPSA id af79cd13be357-9158a37cab6sm35366885a.22.2026.06.02.12.51.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Jun 2026 12:51:45 -0700 (PDT) Date: Tue, 2 Jun 2026 15:51:43 -0400 From: Peter Xu To: Akihiko Odaki Cc: qemu-devel@nongnu.org, Philippe =?utf-8?Q?Mathieu-Daud=C3=A9?= , Fabiano Rosas , Paolo Bonzini , Alex =?utf-8?Q?Benn=C3=A9e?= Subject: Re: [PATCH] memory/ramblock: Fix clear of mru_block on possible race condition Message-ID: References: <20260601205900.1067714-1-peterx@redhat.com> <36749195-2c49-4f9c-8445-2e67b2097676@rsg.ci.i.u-tokyo.ac.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <36749195-2c49-4f9c-8445-2e67b2097676@rsg.ci.i.u-tokyo.ac.jp> Received-SPF: pass client-ip=170.10.129.124; envelope-from=peterx@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -24 X-Spam_score: -2.5 X-Spam_bar: -- X-Spam_report: (-2.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.445, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: qemu development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On Tue, Jun 02, 2026 at 01:25:02PM +0900, Akihiko Odaki wrote: > On 2026/06/02 5:59, Peter Xu wrote: > > The race condition was not reported in any real bug, but I found it while > > reviewing some relevant chnages from Akihiko [1]. It's not clear if > > there's any way to reproduce it, so far it's only theoretical. Hence > > there's also no Fixes tag attached. There's also no need to copy stable > > until we have a solid reproducer. > > > > Currently, mru_block might still points to ramblocks that are removed if > > below race condition happens: > > > > Reader A Writer > > -------- ------ > > rcu_read_lock() > > walk list, find block X > > QLIST_REMOVE_RCU(X) > > qatomic_set(&mru_block, NULL) > > call_rcu(X, reclaim_ramblock) > > qatomic_set(&mru_block, X) <----------- overwrites NULL > > rcu_read_unlock() > > grace period ends, X freed > > > > Reader C > > -------- > > rcu_read_lock() > > qatomic_rcu_read(&mru_block) -> X > > Read X's block->offset ... <----------- UAF > > > > To fix it, we can introduce a nested RCU free for reset of the mru_block > > field, and only free the ramblock until the 2nd RCU call. > > > > Since QEMU always: (1) dequeue the RCU node before invoking the > > function, (2) reset all fields in rcu_head in the call_rcu1() call, nested > > RCU will work all fine like what's used in this patch. > > > > [1] https://lore.kernel.org/r/5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp > > > > Cc: Akihiko Odaki > > Signed-off-by: Peter Xu > > --- > > > > Based-on: <5fd9e540-cf13-45f5-ba9a-c7faf364035b@rsg.ci.i.u-tokyo.ac.jp> > > > > Since there's no reproducer, I didn't verify the problem as-is. What I did > > is I ran this through some high concurrency "make check -jN" (8 processes > > with 4 cores pinned, hence 8*N make-check threads running concurrently on 4 > > physical cores), I didn't see any regression. > > --- > > system/physmem.c | 18 +++++++++++++++++- > > 1 file changed, 17 insertions(+), 1 deletion(-) > > > > diff --git a/system/physmem.c b/system/physmem.c > > index 9e1ac13e82..1697568fa7 100644 > > --- a/system/physmem.c > > +++ b/system/physmem.c > > @@ -2590,6 +2590,21 @@ static void reclaim_ramblock(RAMBlock *block) > > g_free(block); > > } > > +static void reclaim_ramblock_prepare(RAMBlock *block) > > +{ > > + /* > > + * 1st round rcu free guarantees the last reader that can access this > > + * ramblock is gone. Reset this field making sure it will never point [1] > > + * to the ramblock being freed. > > Somewhat nitpicking, but I would call this the first grace period rather > than “rcu free”. call_rcu() itself does not free anything; it schedules a > callback to run after a grace period, which is what matters here. Sure, I'll fix the wording. > > > + */ > > + qatomic_set(&ram_list.mru_block, NULL); > > + /* > > + * Note: this is an intended nested use of rcu_head. If needed, we can > > + * provide two rcu_heads for ramblock. > > + */ > > + call_rcu(block, reclaim_ramblock, rcu); > > +} > > + > > void qemu_ram_free(RAMBlock *block) > > { > > g_autofree char *name = NULL; > > @@ -2607,10 +2622,11 @@ void qemu_ram_free(RAMBlock *block) > > name = cpr_name(block->mr); > > cpr_delete_fd(name, 0); > > QLIST_REMOVE_RCU(block, next); > > + /* First attempt of reset, see reclaim_ramblock_prepare() for the 2nd */ > > qatomic_set(&ram_list.mru_block, NULL); > > I think it would be better to remove this for simplicity. > > mru_block is cleared later anyway, so this does not affect correctness. The > only practical effect is that qemu_get_ram_block() may be slower during the > first grace period. > > Conceptually, it is simpler to describe the two grace periods as serving two > distinct purposes, rather than as two attempts to clear mru_block. With the > early clear removed, readers do not first need to understand that the first > grace period leaves the first clear ineffective; they can instead focus on > how it makes the later clear effective. It was my intention to keep this, because then it's easier to justify "nobody has access to the ramblock being removed". If with this line removed here, it means after the first grace period (say, reaching reclaim_ramblock_prepare()), RCU readers can still access the ramblock, essentially because before reclaim_ramblock_prepare() clearing mru_block even if the block is not visible on ram_list it's still visible when a cache hit happens. But I agree removal of this line shouldn't involve correctness issues, because qemu_get_ram_block() when having a mru_block cache hit, will always return the block directly: [2] block = qatomic_rcu_read(&ram_list.mru_block); if (block && addr - block->offset < block->max_length) { return block; } And it will never reach the "found:" label later to update mru_block again. But that's obscure. So it's a niche use of it, but keeping this line to reset makes this justification slightly easier (corresponds to the comment at [1] above). I can also remove this line, but then I'll need to enrich comment explaining case [2] above. We'll also need to be careful on not changing that behavior in qemu_get_ram_block(). Thanks, > > Regars, > Akihiko Odaki > > > /* Write list before version */ > > qatomic_store_release(&ram_list.version, ram_list.version + 1); > > - call_rcu(block, reclaim_ramblock, rcu); > > + call_rcu(block, reclaim_ramblock_prepare, rcu); > > qemu_mutex_unlock_ramlist(); > > } > -- Peter Xu