From: "Michael S. Tsirkin" <mst@redhat.com>
To: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Zi Yan" <ziy@nvidia.com>,
"David Hildenbrand (Arm)" <david@kernel.org>,
"Andrew Morton" <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, "Jason Wang" <jasowang@redhat.com>,
"Xuan Zhuo" <xuanzhuo@linux.alibaba.com>,
"Eugenio Pérez" <eperezma@redhat.com>,
"Muchun Song" <muchun.song@linux.dev>,
"Oscar Salvador" <osalvador@suse.de>,
"Lorenzo Stoakes" <ljs@kernel.org>,
"Liam R. Howlett" <liam@infradead.org>,
"Vlastimil Babka" <vbabka@kernel.org>,
"Mike Rapoport" <rppt@kernel.org>,
"Suren Baghdasaryan" <surenb@google.com>,
"Michal Hocko" <mhocko@suse.com>,
"Brendan Jackman" <jackmanb@google.com>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Baolin Wang" <baolin.wang@linux.alibaba.com>,
"Nico Pache" <npache@redhat.com>,
"Ryan Roberts" <ryan.roberts@arm.com>,
"Dev Jain" <dev.jain@arm.com>, "Barry Song" <baohua@kernel.org>,
"Lance Yang" <lance.yang@linux.dev>,
"Hugh Dickins" <hughd@google.com>,
"Matthew Brost" <matthew.brost@intel.com>,
"Joshua Hahn" <joshua.hahnjy@gmail.com>,
"Rakie Kim" <rakie.kim@sk.com>,
"Byungchul Park" <byungchul@sk.com>,
"Gregory Price" <gourry@gourry.net>,
"Ying Huang" <ying.huang@linux.alibaba.com>,
"Alistair Popple" <apopple@nvidia.com>,
"Christoph Lameter" <cl@gentwo.org>,
"David Rientjes" <rientjes@google.com>,
"Roman Gushchin" <roman.gushchin@linux.dev>,
"Harry Yoo" <harry.yoo@oracle.com>,
"Axel Rasmussen" <axelrasmussen@google.com>,
"Yuanchu Xie" <yuanchu@google.com>, "Wei Xu" <weixugc@google.com>,
"Chris Li" <chrisl@kernel.org>,
"Kairui Song" <kasong@tencent.com>,
"Kemeng Shi" <shikemeng@huaweicloud.com>,
"Nhat Pham" <nphamcs@gmail.com>, "Baoquan He" <bhe@redhat.com>,
virtualization@lists.linux.dev, linux-mm@kvack.org,
"Andrea Arcangeli" <aarcange@redhat.com>,
"Naoya Horiguchi" <nao.horiguchi@gmail.com>
Subject: Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
Date: Thu, 11 Jun 2026 01:43:53 -0400 [thread overview]
Message-ID: <20260611013644-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <14537566-94d9-eac5-2636-35f925a9d159@huawei.com>
On Thu, Jun 11, 2026 at 11:35:36AM +0800, Miaohe Lin wrote:
> On 2026/6/11 5:18, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2026 at 03:24:30PM +0800, Miaohe Lin wrote:
> >> On 2026/6/10 5:00, Michael S. Tsirkin wrote:
> >>> On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
> >>>> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
> >>>>
> >>>>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
> >>>>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
> >>>>>>
> >>>>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
> >>>>>>>
> >>>>>>>> On 6/9/26 20:10, Andrew Morton wrote:
> >>>>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >>>>>>>>>> update to page->flags can race with non-atomic flag operations
> >>>>>>>>>> that run under zone->lock in the buddy allocator.
> >>>>>>>>>>
> >>>>>>>>>> In particular, __free_pages_prepare() does:
> >>>>>>>>>>
> >>>>>>>>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>>>>>>>>>
> >>>>>>>>>> This non-atomic read-modify-write, while correctly excluding
> >>>>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
> >>>>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
> >>>>>>>>>> is set and the write happens after. Will only get worse if/when
> >>>>>>>>>> we add more non-atomic flag operations.
> >>>>>>>>>>
> >>>>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >>>>>>>>>> around ClearPageHWPoison in the retry path. This
> >>>>>>>>>> serializes with all buddy flag manipulation. The cost is
> >>>>>>>>>> negligible: one lock/unlock in an extremely rare path
> >>>>>>>>>> (hardware memory errors).
> >>>>>>>>>>
> >>>>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >>>>>>>>>> in this file operate on pages already removed from the buddy
> >>>>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >>>>>>>>>> need zone->lock protection.
> >>>>>>>>>
> >>>>>>>>> Sashiko is saying this doesn't do anything "Because
> >>>>>>>>> __free_pages_prepare() executes entirely locklessly". Did it goof?
> >>>>>>>>>
> >>>>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
> >>>>>>>>
> >>>>>>>> Battle of the bots: it's right.
> >>>>>>>
> >>>>>>> Yep, __free_pages_prepare() changes the page flag without holding
> >>>>>>> zone->lock.
> >>>>>>
> >>>>>> __free_pages_prepare() works on frozen pages and assumes no one else
> >>>>>> touches the input page. To avoid this race, memory_failure() might
> >>>>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
> >>>>>> sure if that works along with memory failure flow.
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> Yan, Zi
> >>>>>
> >>>>>
> >>>>>
> >>>>> Actually memory failure already plays with this down the road no?
> >>>>>
> >>>>> So maybe it's enough to just SetPageHWPoison afterwards again?
> >>>>>
> >>>>>
> >>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>>>> index ee42d4361309..4758fea94a96 100644
> >>>>> --- a/mm/memory-failure.c
> >>>>> +++ b/mm/memory-failure.c
> >>>>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>> if (!res) {
> >>>>> if (is_free_buddy_page(p)) {
> >>>>> if (take_page_off_buddy(p)) {
> >>>>> + SetPageHWPoison(p);
> >>>>> page_ref_inc(p);
> >>>>> res = MF_RECOVERED;
> >>>>> } else {
> >>>>>
> >>>>>
> >>>>> and maybe in a bunch of other places in there?
> >>>>
> >>>> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
> >>>> just set it again here?
> >>>
> >>> Yea.
> >>>
> >>>> Why not do it after get_hwpoison_page(), since that
> >>>> is the expected page flag?
> >>>
> >>> It's still in the buddy at that point right? I'm worried buddy might
> >>> poke at flags.
> >>
> >> Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
> >> HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
> >> pages are not on the way to buddy...
> >>
> >> Thanks.
> >> .
> >
> >
> > To clarify do you not agree repeating SetPageHWPoison is enough for
> > this? And if not, do you have suggestions on how to fix this race?
>
> Do you mean repeating SetPageHWPoison on every branch?
Right.
> Is it possible
> to make __free_pages_prepare changes page->flags atomically or this race
> is specified to memory_failure?
>
> Thanks.
> .
Adding an atomic op on every fast path page allocation is, I am
guessing, going to slow down Linux measureably.
Doing it for the benefit of memory_failure, which is the slowest of
slow paths, seems unpalatable, to me.
Neither am I sure it's the only racy place -
grep for __SetPage and __ClearPage - all these have the same issue, I
suspect.
At the same time, I'm not an mm maintainer. If you disagree, try to
upstream a change converting all non atomics in mm to atomics, and see
what others say.
--
MST
next prev parent reply other threads:[~2026-06-11 5:44 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-09 10:12 [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock Michael S. Tsirkin
2026-06-09 12:50 ` David Hildenbrand (Arm)
2026-06-09 16:12 ` Zi Yan
2026-06-09 18:10 ` Andrew Morton
2026-06-09 18:38 ` David Hildenbrand (Arm)
2026-06-09 18:39 ` Zi Yan
2026-06-09 18:52 ` Zi Yan
2026-06-09 20:34 ` Michael S. Tsirkin
2026-06-09 20:54 ` Zi Yan
2026-06-09 21:00 ` Michael S. Tsirkin
2026-06-10 7:24 ` Miaohe Lin
2026-06-10 7:35 ` Michael S. Tsirkin
2026-06-10 21:18 ` Michael S. Tsirkin
2026-06-11 3:35 ` Miaohe Lin
2026-06-11 5:43 ` Michael S. Tsirkin [this message]
2026-06-11 7:36 ` Miaohe Lin
2026-06-11 6:33 ` Michael S. Tsirkin
2026-06-09 20:24 ` Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260611013644-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=byungchul@sk.com \
--cc=chrisl@kernel.org \
--cc=cl@gentwo.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=eperezma@redhat.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=harry.yoo@oracle.com \
--cc=hughd@google.com \
--cc=jackmanb@google.com \
--cc=jasowang@redhat.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kasong@tencent.com \
--cc=lance.yang@linux.dev \
--cc=liam@infradead.org \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=nao.horiguchi@gmail.com \
--cc=npache@redhat.com \
--cc=nphamcs@gmail.com \
--cc=osalvador@suse.de \
--cc=rakie.kim@sk.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shikemeng@huaweicloud.com \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=virtualization@lists.linux.dev \
--cc=weixugc@google.com \
--cc=xuanzhuo@linux.alibaba.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox