* [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
@ 2026-06-09 10:12 Michael S. Tsirkin
2026-06-09 12:50 ` David Hildenbrand (Arm)
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-09 10:12 UTC (permalink / raw)
To: linux-kernel
Cc: Miaohe Lin, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
TestSetPageHWPoison() is called without zone->lock, so its atomic
update to page->flags can race with non-atomic flag operations
that run under zone->lock in the buddy allocator.
In particular, __free_pages_prepare() does:
page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
This non-atomic read-modify-write, while correctly excluding
__PG_HWPOISON from the mask, can still lose a concurrent
TestSetPageHWPoison if the read happens before the poison bit
is set and the write happens after. Will only get worse if/when
we add more non-atomic flag operations.
Fix by acquiring zone->lock around TestSetPageHWPoison and
around ClearPageHWPoison in the retry path. This
serializes with all buddy flag manipulation. The cost is
negligible: one lock/unlock in an extremely rare path
(hardware memory errors).
Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
in this file operate on pages already removed from the buddy
allocator or on non-buddy pages (DAX, hugetlb), so they do not
need zone->lock protection.
Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---
Sending separately as suggested by multiple people. I also added
a Fixes tag.
mm/memory-failure.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..3880486028a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2348,6 +2348,8 @@ int memory_failure(unsigned long pfn, int flags)
unsigned long page_flags;
bool retry = true;
int hugetlb = 0;
+ struct zone *zone;
+ unsigned long mf_flags;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -2390,7 +2392,11 @@ int memory_failure(unsigned long pfn, int flags)
if (hugetlb)
goto unlock_mutex;
+ /* Serialize with non-atomic buddy flag operations */
+ zone = page_zone(p);
+ spin_lock_irqsave(&zone->lock, mf_flags);
if (TestSetPageHWPoison(p)) {
+ spin_unlock_irqrestore(&zone->lock, mf_flags);
res = -EHWPOISON;
if (flags & MF_ACTION_REQUIRED)
res = kill_accessing_process(current, pfn, flags);
@@ -2399,6 +2405,7 @@ int memory_failure(unsigned long pfn, int flags)
action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
goto unlock_mutex;
}
+ spin_unlock_irqrestore(&zone->lock, mf_flags);
/*
* We need/can do nothing about count=0 pages.
@@ -2420,7 +2427,10 @@ int memory_failure(unsigned long pfn, int flags)
} else {
/* We lost the race, try again */
if (retry) {
+ /* Serialize with non-atomic buddy flag operations */
+ spin_lock_irqsave(&zone->lock, mf_flags);
ClearPageHWPoison(p);
+ spin_unlock_irqrestore(&zone->lock, mf_flags);
retry = false;
goto try_again;
}
--
MST
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 10:12 [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock Michael S. Tsirkin
@ 2026-06-09 12:50 ` David Hildenbrand (Arm)
2026-06-09 16:12 ` Zi Yan
2026-06-09 18:10 ` Andrew Morton
2 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-09 12:50 UTC (permalink / raw)
To: Michael S. Tsirkin, linux-kernel
Cc: Miaohe Lin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Muchun Song, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 6/9/26 12:12, Michael S. Tsirkin wrote:
> TestSetPageHWPoison() is called without zone->lock, so its atomic
> update to page->flags can race with non-atomic flag operations
> that run under zone->lock in the buddy allocator.
>
> In particular, __free_pages_prepare() does:
>
> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>
> This non-atomic read-modify-write, while correctly excluding
> __PG_HWPOISON from the mask, can still lose a concurrent
> TestSetPageHWPoison if the read happens before the poison bit
> is set and the write happens after. Will only get worse if/when
> we add more non-atomic flag operations.
>
> Fix by acquiring zone->lock around TestSetPageHWPoison and
> around ClearPageHWPoison in the retry path. This
> serializes with all buddy flag manipulation. The cost is
> negligible: one lock/unlock in an extremely rare path
> (hardware memory errors).
>
> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> in this file operate on pages already removed from the buddy
> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> need zone->lock protection.
>
> Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
memory_failure() is documented that it must not be called with the spinlock
held, so this cannot deadlock. In particular, we would grab the lock already in
take_page_off_buddy().
LGTM, thanks
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
--
Cheers,
David
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 10:12 [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock Michael S. Tsirkin
2026-06-09 12:50 ` David Hildenbrand (Arm)
@ 2026-06-09 16:12 ` Zi Yan
2026-06-09 18:10 ` Andrew Morton
2 siblings, 0 replies; 27+ messages in thread
From: Zi Yan @ 2026-06-09 16:12 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, Miaohe Lin, David Hildenbrand (Arm), Jason Wang,
Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 9 Jun 2026, at 6:12, Michael S. Tsirkin wrote:
> TestSetPageHWPoison() is called without zone->lock, so its atomic
> update to page->flags can race with non-atomic flag operations
> that run under zone->lock in the buddy allocator.
>
> In particular, __free_pages_prepare() does:
>
> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>
> This non-atomic read-modify-write, while correctly excluding
> __PG_HWPOISON from the mask, can still lose a concurrent
> TestSetPageHWPoison if the read happens before the poison bit
> is set and the write happens after. Will only get worse if/when
> we add more non-atomic flag operations.
>
> Fix by acquiring zone->lock around TestSetPageHWPoison and
> around ClearPageHWPoison in the retry path. This
> serializes with all buddy flag manipulation. The cost is
> negligible: one lock/unlock in an extremely rare path
> (hardware memory errors).
>
> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> in this file operate on pages already removed from the buddy
> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> need zone->lock protection.
>
> Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>
> Sending separately as suggested by multiple people. I also added
> a Fixes tag.
>
>
> mm/memory-failure.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
Makes sense to me. Thanks.
Acked-by: Zi Yan <ziy@nvidia.com>
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 10:12 [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock Michael S. Tsirkin
2026-06-09 12:50 ` David Hildenbrand (Arm)
2026-06-09 16:12 ` Zi Yan
@ 2026-06-09 18:10 ` Andrew Morton
2026-06-09 18:38 ` David Hildenbrand (Arm)
2026-06-09 20:24 ` Michael S. Tsirkin
2 siblings, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2026-06-09 18:10 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, Miaohe Lin, David Hildenbrand (Arm), Jason Wang,
Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> TestSetPageHWPoison() is called without zone->lock, so its atomic
> update to page->flags can race with non-atomic flag operations
> that run under zone->lock in the buddy allocator.
>
> In particular, __free_pages_prepare() does:
>
> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>
> This non-atomic read-modify-write, while correctly excluding
> __PG_HWPOISON from the mask, can still lose a concurrent
> TestSetPageHWPoison if the read happens before the poison bit
> is set and the write happens after. Will only get worse if/when
> we add more non-atomic flag operations.
>
> Fix by acquiring zone->lock around TestSetPageHWPoison and
> around ClearPageHWPoison in the retry path. This
> serializes with all buddy flag manipulation. The cost is
> negligible: one lock/unlock in an extremely rare path
> (hardware memory errors).
>
> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> in this file operate on pages already removed from the buddy
> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> need zone->lock protection.
Sashiko is saying this doesn't do anything "Because
__free_pages_prepare() executes entirely locklessly". Did it goof?
https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 18:10 ` Andrew Morton
@ 2026-06-09 18:38 ` David Hildenbrand (Arm)
2026-06-09 18:39 ` Zi Yan
2026-06-11 6:33 ` Michael S. Tsirkin
2026-06-09 20:24 ` Michael S. Tsirkin
1 sibling, 2 replies; 27+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-09 18:38 UTC (permalink / raw)
To: Andrew Morton, Michael S. Tsirkin
Cc: linux-kernel, Miaohe Lin, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 6/9/26 20:10, Andrew Morton wrote:
> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>> update to page->flags can race with non-atomic flag operations
>> that run under zone->lock in the buddy allocator.
>>
>> In particular, __free_pages_prepare() does:
>>
>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>
>> This non-atomic read-modify-write, while correctly excluding
>> __PG_HWPOISON from the mask, can still lose a concurrent
>> TestSetPageHWPoison if the read happens before the poison bit
>> is set and the write happens after. Will only get worse if/when
>> we add more non-atomic flag operations.
>>
>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>> around ClearPageHWPoison in the retry path. This
>> serializes with all buddy flag manipulation. The cost is
>> negligible: one lock/unlock in an extremely rare path
>> (hardware memory errors).
>>
>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>> in this file operate on pages already removed from the buddy
>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>> need zone->lock protection.
>
> Sashiko is saying this doesn't do anything "Because
> __free_pages_prepare() executes entirely locklessly". Did it goof?
>
> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
Battle of the bots: it's right.
--
Cheers,
David
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 18:38 ` David Hildenbrand (Arm)
@ 2026-06-09 18:39 ` Zi Yan
2026-06-09 18:52 ` Zi Yan
2026-06-11 6:33 ` Michael S. Tsirkin
1 sibling, 1 reply; 27+ messages in thread
From: Zi Yan @ 2026-06-09 18:39 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Andrew Morton, Michael S. Tsirkin, linux-kernel, Miaohe Lin,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
> On 6/9/26 20:10, Andrew Morton wrote:
>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>> update to page->flags can race with non-atomic flag operations
>>> that run under zone->lock in the buddy allocator.
>>>
>>> In particular, __free_pages_prepare() does:
>>>
>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>
>>> This non-atomic read-modify-write, while correctly excluding
>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>> TestSetPageHWPoison if the read happens before the poison bit
>>> is set and the write happens after. Will only get worse if/when
>>> we add more non-atomic flag operations.
>>>
>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>> around ClearPageHWPoison in the retry path. This
>>> serializes with all buddy flag manipulation. The cost is
>>> negligible: one lock/unlock in an extremely rare path
>>> (hardware memory errors).
>>>
>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>> in this file operate on pages already removed from the buddy
>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>> need zone->lock protection.
>>
>> Sashiko is saying this doesn't do anything "Because
>> __free_pages_prepare() executes entirely locklessly". Did it goof?
>>
>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>
> Battle of the bots: it's right.
Yep, __free_pages_prepare() changes the page flag without holding
zone->lock.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 18:39 ` Zi Yan
@ 2026-06-09 18:52 ` Zi Yan
2026-06-09 20:34 ` Michael S. Tsirkin
0 siblings, 1 reply; 27+ messages in thread
From: Zi Yan @ 2026-06-09 18:52 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Andrew Morton, Michael S. Tsirkin, linux-kernel, Miaohe Lin,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 9 Jun 2026, at 14:39, Zi Yan wrote:
> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
>
>> On 6/9/26 20:10, Andrew Morton wrote:
>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>
>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>>> update to page->flags can race with non-atomic flag operations
>>>> that run under zone->lock in the buddy allocator.
>>>>
>>>> In particular, __free_pages_prepare() does:
>>>>
>>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>>
>>>> This non-atomic read-modify-write, while correctly excluding
>>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>>> TestSetPageHWPoison if the read happens before the poison bit
>>>> is set and the write happens after. Will only get worse if/when
>>>> we add more non-atomic flag operations.
>>>>
>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>>> around ClearPageHWPoison in the retry path. This
>>>> serializes with all buddy flag manipulation. The cost is
>>>> negligible: one lock/unlock in an extremely rare path
>>>> (hardware memory errors).
>>>>
>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>>> in this file operate on pages already removed from the buddy
>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>>> need zone->lock protection.
>>>
>>> Sashiko is saying this doesn't do anything "Because
>>> __free_pages_prepare() executes entirely locklessly". Did it goof?
>>>
>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>>
>> Battle of the bots: it's right.
>
> Yep, __free_pages_prepare() changes the page flag without holding
> zone->lock.
__free_pages_prepare() works on frozen pages and assumes no one else
touches the input page. To avoid this race, memory_failure() might
want to try_get_page() before TestClearPageHWPoison(), but I am not
sure if that works along with memory failure flow.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 18:10 ` Andrew Morton
2026-06-09 18:38 ` David Hildenbrand (Arm)
@ 2026-06-09 20:24 ` Michael S. Tsirkin
1 sibling, 0 replies; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-09 20:24 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, Miaohe Lin, David Hildenbrand (Arm), Jason Wang,
Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On Tue, Jun 09, 2026 at 11:10:20AM -0700, Andrew Morton wrote:
> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
> > TestSetPageHWPoison() is called without zone->lock, so its atomic
> > update to page->flags can race with non-atomic flag operations
> > that run under zone->lock in the buddy allocator.
> >
> > In particular, __free_pages_prepare() does:
> >
> > page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >
> > This non-atomic read-modify-write, while correctly excluding
> > __PG_HWPOISON from the mask, can still lose a concurrent
> > TestSetPageHWPoison if the read happens before the poison bit
> > is set and the write happens after. Will only get worse if/when
> > we add more non-atomic flag operations.
> >
> > Fix by acquiring zone->lock around TestSetPageHWPoison and
> > around ClearPageHWPoison in the retry path. This
> > serializes with all buddy flag manipulation. The cost is
> > negligible: one lock/unlock in an extremely rare path
> > (hardware memory errors).
> >
> > Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> > in this file operate on pages already removed from the buddy
> > allocator or on non-buddy pages (DAX, hugetlb), so they do not
> > need zone->lock protection.
>
> Sashiko is saying this doesn't do anything "Because
> __free_pages_prepare() executes entirely locklessly". Did it goof?
>
> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>
Oh. So it only helps with the prezero patches. Maybe other places where
flags are touched locklessly. Not __free_pages_prepare. I was too
focused on that. Scrap this please. I'll try to think of something.
--
MST
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 18:52 ` Zi Yan
@ 2026-06-09 20:34 ` Michael S. Tsirkin
2026-06-09 20:54 ` Zi Yan
0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-09 20:34 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand (Arm), Andrew Morton, linux-kernel, Miaohe Lin,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
> On 9 Jun 2026, at 14:39, Zi Yan wrote:
>
> > On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
> >
> >> On 6/9/26 20:10, Andrew Morton wrote:
> >>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>
> >>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >>>> update to page->flags can race with non-atomic flag operations
> >>>> that run under zone->lock in the buddy allocator.
> >>>>
> >>>> In particular, __free_pages_prepare() does:
> >>>>
> >>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>>>
> >>>> This non-atomic read-modify-write, while correctly excluding
> >>>> __PG_HWPOISON from the mask, can still lose a concurrent
> >>>> TestSetPageHWPoison if the read happens before the poison bit
> >>>> is set and the write happens after. Will only get worse if/when
> >>>> we add more non-atomic flag operations.
> >>>>
> >>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >>>> around ClearPageHWPoison in the retry path. This
> >>>> serializes with all buddy flag manipulation. The cost is
> >>>> negligible: one lock/unlock in an extremely rare path
> >>>> (hardware memory errors).
> >>>>
> >>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >>>> in this file operate on pages already removed from the buddy
> >>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >>>> need zone->lock protection.
> >>>
> >>> Sashiko is saying this doesn't do anything "Because
> >>> __free_pages_prepare() executes entirely locklessly". Did it goof?
> >>>
> >>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
> >>
> >> Battle of the bots: it's right.
> >
> > Yep, __free_pages_prepare() changes the page flag without holding
> > zone->lock.
>
> __free_pages_prepare() works on frozen pages and assumes no one else
> touches the input page. To avoid this race, memory_failure() might
> want to try_get_page() before TestClearPageHWPoison(), but I am not
> sure if that works along with memory failure flow.
>
> Best Regards,
> Yan, Zi
Actually memory failure already plays with this down the road no?
So maybe it's enough to just SetPageHWPoison afterwards again?
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..4758fea94a96 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
if (!res) {
if (is_free_buddy_page(p)) {
if (take_page_off_buddy(p)) {
+ SetPageHWPoison(p);
page_ref_inc(p);
res = MF_RECOVERED;
} else {
and maybe in a bunch of other places in there?
--
MST
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 20:34 ` Michael S. Tsirkin
@ 2026-06-09 20:54 ` Zi Yan
2026-06-09 21:00 ` Michael S. Tsirkin
0 siblings, 1 reply; 27+ messages in thread
From: Zi Yan @ 2026-06-09 20:54 UTC (permalink / raw)
To: Michael S. Tsirkin, Miaohe Lin
Cc: David Hildenbrand (Arm), Andrew Morton, linux-kernel, Jason Wang,
Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
>>
>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
>>>
>>>> On 6/9/26 20:10, Andrew Morton wrote:
>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>
>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>>>>> update to page->flags can race with non-atomic flag operations
>>>>>> that run under zone->lock in the buddy allocator.
>>>>>>
>>>>>> In particular, __free_pages_prepare() does:
>>>>>>
>>>>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>>>>
>>>>>> This non-atomic read-modify-write, while correctly excluding
>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>>>>> TestSetPageHWPoison if the read happens before the poison bit
>>>>>> is set and the write happens after. Will only get worse if/when
>>>>>> we add more non-atomic flag operations.
>>>>>>
>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>>>>> around ClearPageHWPoison in the retry path. This
>>>>>> serializes with all buddy flag manipulation. The cost is
>>>>>> negligible: one lock/unlock in an extremely rare path
>>>>>> (hardware memory errors).
>>>>>>
>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>>>>> in this file operate on pages already removed from the buddy
>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>>>>> need zone->lock protection.
>>>>>
>>>>> Sashiko is saying this doesn't do anything "Because
>>>>> __free_pages_prepare() executes entirely locklessly". Did it goof?
>>>>>
>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>>>>
>>>> Battle of the bots: it's right.
>>>
>>> Yep, __free_pages_prepare() changes the page flag without holding
>>> zone->lock.
>>
>> __free_pages_prepare() works on frozen pages and assumes no one else
>> touches the input page. To avoid this race, memory_failure() might
>> want to try_get_page() before TestClearPageHWPoison(), but I am not
>> sure if that works along with memory failure flow.
>>
>> Best Regards,
>> Yan, Zi
>
>
>
> Actually memory failure already plays with this down the road no?
>
> So maybe it's enough to just SetPageHWPoison afterwards again?
>
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index ee42d4361309..4758fea94a96 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
> if (!res) {
> if (is_free_buddy_page(p)) {
> if (take_page_off_buddy(p)) {
> + SetPageHWPoison(p);
> page_ref_inc(p);
> res = MF_RECOVERED;
> } else {
>
>
> and maybe in a bunch of other places in there?
You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
just set it again here? Why not do it after get_hwpoison_page(), since that
is the expected page flag? Miaohe probably can give a better answer here.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 20:54 ` Zi Yan
@ 2026-06-09 21:00 ` Michael S. Tsirkin
2026-06-10 7:24 ` Miaohe Lin
0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-09 21:00 UTC (permalink / raw)
To: Zi Yan
Cc: Miaohe Lin, David Hildenbrand (Arm), Andrew Morton, linux-kernel,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
>
> > On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
> >> On 9 Jun 2026, at 14:39, Zi Yan wrote:
> >>
> >>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
> >>>
> >>>> On 6/9/26 20:10, Andrew Morton wrote:
> >>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>>
> >>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >>>>>> update to page->flags can race with non-atomic flag operations
> >>>>>> that run under zone->lock in the buddy allocator.
> >>>>>>
> >>>>>> In particular, __free_pages_prepare() does:
> >>>>>>
> >>>>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>>>>>
> >>>>>> This non-atomic read-modify-write, while correctly excluding
> >>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
> >>>>>> TestSetPageHWPoison if the read happens before the poison bit
> >>>>>> is set and the write happens after. Will only get worse if/when
> >>>>>> we add more non-atomic flag operations.
> >>>>>>
> >>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >>>>>> around ClearPageHWPoison in the retry path. This
> >>>>>> serializes with all buddy flag manipulation. The cost is
> >>>>>> negligible: one lock/unlock in an extremely rare path
> >>>>>> (hardware memory errors).
> >>>>>>
> >>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >>>>>> in this file operate on pages already removed from the buddy
> >>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >>>>>> need zone->lock protection.
> >>>>>
> >>>>> Sashiko is saying this doesn't do anything "Because
> >>>>> __free_pages_prepare() executes entirely locklessly". Did it goof?
> >>>>>
> >>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
> >>>>
> >>>> Battle of the bots: it's right.
> >>>
> >>> Yep, __free_pages_prepare() changes the page flag without holding
> >>> zone->lock.
> >>
> >> __free_pages_prepare() works on frozen pages and assumes no one else
> >> touches the input page. To avoid this race, memory_failure() might
> >> want to try_get_page() before TestClearPageHWPoison(), but I am not
> >> sure if that works along with memory failure flow.
> >>
> >> Best Regards,
> >> Yan, Zi
> >
> >
> >
> > Actually memory failure already plays with this down the road no?
> >
> > So maybe it's enough to just SetPageHWPoison afterwards again?
> >
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index ee42d4361309..4758fea94a96 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
> > if (!res) {
> > if (is_free_buddy_page(p)) {
> > if (take_page_off_buddy(p)) {
> > + SetPageHWPoison(p);
> > page_ref_inc(p);
> > res = MF_RECOVERED;
> > } else {
> >
> >
> > and maybe in a bunch of other places in there?
>
> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
> just set it again here?
Yea.
> Why not do it after get_hwpoison_page(), since that
> is the expected page flag?
It's still in the buddy at that point right? I'm worried buddy might
poke at flags.
> Miaohe probably can give a better answer here.
>
>
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 21:00 ` Michael S. Tsirkin
@ 2026-06-10 7:24 ` Miaohe Lin
2026-06-10 7:35 ` Michael S. Tsirkin
2026-06-10 21:18 ` Michael S. Tsirkin
0 siblings, 2 replies; 27+ messages in thread
From: Miaohe Lin @ 2026-06-10 7:24 UTC (permalink / raw)
To: Michael S. Tsirkin, Zi Yan
Cc: David Hildenbrand (Arm), Andrew Morton, linux-kernel, Jason Wang,
Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 2026/6/10 5:00, Michael S. Tsirkin wrote:
> On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
>> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
>>
>>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
>>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
>>>>
>>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
>>>>>
>>>>>> On 6/9/26 20:10, Andrew Morton wrote:
>>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>>
>>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>>>>>>> update to page->flags can race with non-atomic flag operations
>>>>>>>> that run under zone->lock in the buddy allocator.
>>>>>>>>
>>>>>>>> In particular, __free_pages_prepare() does:
>>>>>>>>
>>>>>>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>>>>>>
>>>>>>>> This non-atomic read-modify-write, while correctly excluding
>>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
>>>>>>>> is set and the write happens after. Will only get worse if/when
>>>>>>>> we add more non-atomic flag operations.
>>>>>>>>
>>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>>>>>>> around ClearPageHWPoison in the retry path. This
>>>>>>>> serializes with all buddy flag manipulation. The cost is
>>>>>>>> negligible: one lock/unlock in an extremely rare path
>>>>>>>> (hardware memory errors).
>>>>>>>>
>>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>>>>>>> in this file operate on pages already removed from the buddy
>>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>>>>>>> need zone->lock protection.
>>>>>>>
>>>>>>> Sashiko is saying this doesn't do anything "Because
>>>>>>> __free_pages_prepare() executes entirely locklessly". Did it goof?
>>>>>>>
>>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>>>>>>
>>>>>> Battle of the bots: it's right.
>>>>>
>>>>> Yep, __free_pages_prepare() changes the page flag without holding
>>>>> zone->lock.
>>>>
>>>> __free_pages_prepare() works on frozen pages and assumes no one else
>>>> touches the input page. To avoid this race, memory_failure() might
>>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
>>>> sure if that works along with memory failure flow.
>>>>
>>>> Best Regards,
>>>> Yan, Zi
>>>
>>>
>>>
>>> Actually memory failure already plays with this down the road no?
>>>
>>> So maybe it's enough to just SetPageHWPoison afterwards again?
>>>
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index ee42d4361309..4758fea94a96 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
>>> if (!res) {
>>> if (is_free_buddy_page(p)) {
>>> if (take_page_off_buddy(p)) {
>>> + SetPageHWPoison(p);
>>> page_ref_inc(p);
>>> res = MF_RECOVERED;
>>> } else {
>>>
>>>
>>> and maybe in a bunch of other places in there?
>>
>> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
>> just set it again here?
>
> Yea.
>
>> Why not do it after get_hwpoison_page(), since that
>> is the expected page flag?
>
> It's still in the buddy at that point right? I'm worried buddy might
> poke at flags.
Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
pages are not on the way to buddy...
Thanks.
.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-10 7:24 ` Miaohe Lin
@ 2026-06-10 7:35 ` Michael S. Tsirkin
2026-06-10 21:18 ` Michael S. Tsirkin
1 sibling, 0 replies; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 7:35 UTC (permalink / raw)
To: Miaohe Lin
Cc: Zi Yan, David Hildenbrand (Arm), Andrew Morton, linux-kernel,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On Wed, Jun 10, 2026 at 03:24:30PM +0800, Miaohe Lin wrote:
> On 2026/6/10 5:00, Michael S. Tsirkin wrote:
> > On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
> >> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
> >>
> >>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
> >>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
> >>>>
> >>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
> >>>>>
> >>>>>> On 6/9/26 20:10, Andrew Morton wrote:
> >>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>>>>
> >>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >>>>>>>> update to page->flags can race with non-atomic flag operations
> >>>>>>>> that run under zone->lock in the buddy allocator.
> >>>>>>>>
> >>>>>>>> In particular, __free_pages_prepare() does:
> >>>>>>>>
> >>>>>>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>>>>>>>
> >>>>>>>> This non-atomic read-modify-write, while correctly excluding
> >>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
> >>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
> >>>>>>>> is set and the write happens after. Will only get worse if/when
> >>>>>>>> we add more non-atomic flag operations.
> >>>>>>>>
> >>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >>>>>>>> around ClearPageHWPoison in the retry path. This
> >>>>>>>> serializes with all buddy flag manipulation. The cost is
> >>>>>>>> negligible: one lock/unlock in an extremely rare path
> >>>>>>>> (hardware memory errors).
> >>>>>>>>
> >>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >>>>>>>> in this file operate on pages already removed from the buddy
> >>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >>>>>>>> need zone->lock protection.
> >>>>>>>
> >>>>>>> Sashiko is saying this doesn't do anything "Because
> >>>>>>> __free_pages_prepare() executes entirely locklessly". Did it goof?
> >>>>>>>
> >>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
> >>>>>>
> >>>>>> Battle of the bots: it's right.
> >>>>>
> >>>>> Yep, __free_pages_prepare() changes the page flag without holding
> >>>>> zone->lock.
> >>>>
> >>>> __free_pages_prepare() works on frozen pages and assumes no one else
> >>>> touches the input page. To avoid this race, memory_failure() might
> >>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
> >>>> sure if that works along with memory failure flow.
> >>>>
> >>>> Best Regards,
> >>>> Yan, Zi
> >>>
> >>>
> >>>
> >>> Actually memory failure already plays with this down the road no?
> >>>
> >>> So maybe it's enough to just SetPageHWPoison afterwards again?
> >>>
> >>>
> >>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>> index ee42d4361309..4758fea94a96 100644
> >>> --- a/mm/memory-failure.c
> >>> +++ b/mm/memory-failure.c
> >>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
> >>> if (!res) {
> >>> if (is_free_buddy_page(p)) {
> >>> if (take_page_off_buddy(p)) {
> >>> + SetPageHWPoison(p);
> >>> page_ref_inc(p);
> >>> res = MF_RECOVERED;
> >>> } else {
> >>>
> >>>
> >>> and maybe in a bunch of other places in there?
> >>
> >> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
> >> just set it again here?
> >
> > Yea.
> >
> >> Why not do it after get_hwpoison_page(), since that
> >> is the expected page flag?
> >
> > It's still in the buddy at that point right? I'm worried buddy might
> > poke at flags.
>
> Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
> HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
> pages are not on the way to buddy...
>
> Thanks.
> .
Right so here after take_page_off_buddy it's ok, for example.
--
MST
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-10 7:24 ` Miaohe Lin
2026-06-10 7:35 ` Michael S. Tsirkin
@ 2026-06-10 21:18 ` Michael S. Tsirkin
2026-06-11 3:35 ` Miaohe Lin
1 sibling, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-10 21:18 UTC (permalink / raw)
To: Miaohe Lin
Cc: Zi Yan, David Hildenbrand (Arm), Andrew Morton, linux-kernel,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On Wed, Jun 10, 2026 at 03:24:30PM +0800, Miaohe Lin wrote:
> On 2026/6/10 5:00, Michael S. Tsirkin wrote:
> > On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
> >> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
> >>
> >>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
> >>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
> >>>>
> >>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
> >>>>>
> >>>>>> On 6/9/26 20:10, Andrew Morton wrote:
> >>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>>>>
> >>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >>>>>>>> update to page->flags can race with non-atomic flag operations
> >>>>>>>> that run under zone->lock in the buddy allocator.
> >>>>>>>>
> >>>>>>>> In particular, __free_pages_prepare() does:
> >>>>>>>>
> >>>>>>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>>>>>>>
> >>>>>>>> This non-atomic read-modify-write, while correctly excluding
> >>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
> >>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
> >>>>>>>> is set and the write happens after. Will only get worse if/when
> >>>>>>>> we add more non-atomic flag operations.
> >>>>>>>>
> >>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >>>>>>>> around ClearPageHWPoison in the retry path. This
> >>>>>>>> serializes with all buddy flag manipulation. The cost is
> >>>>>>>> negligible: one lock/unlock in an extremely rare path
> >>>>>>>> (hardware memory errors).
> >>>>>>>>
> >>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >>>>>>>> in this file operate on pages already removed from the buddy
> >>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >>>>>>>> need zone->lock protection.
> >>>>>>>
> >>>>>>> Sashiko is saying this doesn't do anything "Because
> >>>>>>> __free_pages_prepare() executes entirely locklessly". Did it goof?
> >>>>>>>
> >>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
> >>>>>>
> >>>>>> Battle of the bots: it's right.
> >>>>>
> >>>>> Yep, __free_pages_prepare() changes the page flag without holding
> >>>>> zone->lock.
> >>>>
> >>>> __free_pages_prepare() works on frozen pages and assumes no one else
> >>>> touches the input page. To avoid this race, memory_failure() might
> >>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
> >>>> sure if that works along with memory failure flow.
> >>>>
> >>>> Best Regards,
> >>>> Yan, Zi
> >>>
> >>>
> >>>
> >>> Actually memory failure already plays with this down the road no?
> >>>
> >>> So maybe it's enough to just SetPageHWPoison afterwards again?
> >>>
> >>>
> >>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>> index ee42d4361309..4758fea94a96 100644
> >>> --- a/mm/memory-failure.c
> >>> +++ b/mm/memory-failure.c
> >>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
> >>> if (!res) {
> >>> if (is_free_buddy_page(p)) {
> >>> if (take_page_off_buddy(p)) {
> >>> + SetPageHWPoison(p);
> >>> page_ref_inc(p);
> >>> res = MF_RECOVERED;
> >>> } else {
> >>>
> >>>
> >>> and maybe in a bunch of other places in there?
> >>
> >> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
> >> just set it again here?
> >
> > Yea.
> >
> >> Why not do it after get_hwpoison_page(), since that
> >> is the expected page flag?
> >
> > It's still in the buddy at that point right? I'm worried buddy might
> > poke at flags.
>
> Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
> HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
> pages are not on the way to buddy...
>
> Thanks.
> .
To clarify do you not agree repeating SetPageHWPoison is enough for
this? And if not, do you have suggestions on how to fix this race?
Thanks a lot,
--
MST
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-10 21:18 ` Michael S. Tsirkin
@ 2026-06-11 3:35 ` Miaohe Lin
2026-06-11 5:43 ` Michael S. Tsirkin
0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2026-06-11 3:35 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Zi Yan, David Hildenbrand (Arm), Andrew Morton, linux-kernel,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 2026/6/11 5:18, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2026 at 03:24:30PM +0800, Miaohe Lin wrote:
>> On 2026/6/10 5:00, Michael S. Tsirkin wrote:
>>> On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
>>>> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
>>>>
>>>>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
>>>>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
>>>>>>
>>>>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
>>>>>>>
>>>>>>>> On 6/9/26 20:10, Andrew Morton wrote:
>>>>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>>>>>>>>> update to page->flags can race with non-atomic flag operations
>>>>>>>>>> that run under zone->lock in the buddy allocator.
>>>>>>>>>>
>>>>>>>>>> In particular, __free_pages_prepare() does:
>>>>>>>>>>
>>>>>>>>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>>>>>>>>
>>>>>>>>>> This non-atomic read-modify-write, while correctly excluding
>>>>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>>>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
>>>>>>>>>> is set and the write happens after. Will only get worse if/when
>>>>>>>>>> we add more non-atomic flag operations.
>>>>>>>>>>
>>>>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>>>>>>>>> around ClearPageHWPoison in the retry path. This
>>>>>>>>>> serializes with all buddy flag manipulation. The cost is
>>>>>>>>>> negligible: one lock/unlock in an extremely rare path
>>>>>>>>>> (hardware memory errors).
>>>>>>>>>>
>>>>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>>>>>>>>> in this file operate on pages already removed from the buddy
>>>>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>>>>>>>>> need zone->lock protection.
>>>>>>>>>
>>>>>>>>> Sashiko is saying this doesn't do anything "Because
>>>>>>>>> __free_pages_prepare() executes entirely locklessly". Did it goof?
>>>>>>>>>
>>>>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>>>>>>>>
>>>>>>>> Battle of the bots: it's right.
>>>>>>>
>>>>>>> Yep, __free_pages_prepare() changes the page flag without holding
>>>>>>> zone->lock.
>>>>>>
>>>>>> __free_pages_prepare() works on frozen pages and assumes no one else
>>>>>> touches the input page. To avoid this race, memory_failure() might
>>>>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
>>>>>> sure if that works along with memory failure flow.
>>>>>>
>>>>>> Best Regards,
>>>>>> Yan, Zi
>>>>>
>>>>>
>>>>>
>>>>> Actually memory failure already plays with this down the road no?
>>>>>
>>>>> So maybe it's enough to just SetPageHWPoison afterwards again?
>>>>>
>>>>>
>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>> index ee42d4361309..4758fea94a96 100644
>>>>> --- a/mm/memory-failure.c
>>>>> +++ b/mm/memory-failure.c
>>>>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>>> if (!res) {
>>>>> if (is_free_buddy_page(p)) {
>>>>> if (take_page_off_buddy(p)) {
>>>>> + SetPageHWPoison(p);
>>>>> page_ref_inc(p);
>>>>> res = MF_RECOVERED;
>>>>> } else {
>>>>>
>>>>>
>>>>> and maybe in a bunch of other places in there?
>>>>
>>>> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
>>>> just set it again here?
>>>
>>> Yea.
>>>
>>>> Why not do it after get_hwpoison_page(), since that
>>>> is the expected page flag?
>>>
>>> It's still in the buddy at that point right? I'm worried buddy might
>>> poke at flags.
>>
>> Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
>> HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
>> pages are not on the way to buddy...
>>
>> Thanks.
>> .
>
>
> To clarify do you not agree repeating SetPageHWPoison is enough for
> this? And if not, do you have suggestions on how to fix this race?
Do you mean repeating SetPageHWPoison on every branch? Is it possible
to make __free_pages_prepare changes page->flags atomically or this race
is specified to memory_failure?
Thanks.
.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-11 3:35 ` Miaohe Lin
@ 2026-06-11 5:43 ` Michael S. Tsirkin
2026-06-11 7:36 ` Miaohe Lin
0 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-11 5:43 UTC (permalink / raw)
To: Miaohe Lin
Cc: Zi Yan, David Hildenbrand (Arm), Andrew Morton, linux-kernel,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On Thu, Jun 11, 2026 at 11:35:36AM +0800, Miaohe Lin wrote:
> On 2026/6/11 5:18, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2026 at 03:24:30PM +0800, Miaohe Lin wrote:
> >> On 2026/6/10 5:00, Michael S. Tsirkin wrote:
> >>> On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
> >>>> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
> >>>>
> >>>>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
> >>>>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
> >>>>>>
> >>>>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
> >>>>>>>
> >>>>>>>> On 6/9/26 20:10, Andrew Morton wrote:
> >>>>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >>>>>>>>>> update to page->flags can race with non-atomic flag operations
> >>>>>>>>>> that run under zone->lock in the buddy allocator.
> >>>>>>>>>>
> >>>>>>>>>> In particular, __free_pages_prepare() does:
> >>>>>>>>>>
> >>>>>>>>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>>>>>>>>>
> >>>>>>>>>> This non-atomic read-modify-write, while correctly excluding
> >>>>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
> >>>>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
> >>>>>>>>>> is set and the write happens after. Will only get worse if/when
> >>>>>>>>>> we add more non-atomic flag operations.
> >>>>>>>>>>
> >>>>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >>>>>>>>>> around ClearPageHWPoison in the retry path. This
> >>>>>>>>>> serializes with all buddy flag manipulation. The cost is
> >>>>>>>>>> negligible: one lock/unlock in an extremely rare path
> >>>>>>>>>> (hardware memory errors).
> >>>>>>>>>>
> >>>>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >>>>>>>>>> in this file operate on pages already removed from the buddy
> >>>>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >>>>>>>>>> need zone->lock protection.
> >>>>>>>>>
> >>>>>>>>> Sashiko is saying this doesn't do anything "Because
> >>>>>>>>> __free_pages_prepare() executes entirely locklessly". Did it goof?
> >>>>>>>>>
> >>>>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
> >>>>>>>>
> >>>>>>>> Battle of the bots: it's right.
> >>>>>>>
> >>>>>>> Yep, __free_pages_prepare() changes the page flag without holding
> >>>>>>> zone->lock.
> >>>>>>
> >>>>>> __free_pages_prepare() works on frozen pages and assumes no one else
> >>>>>> touches the input page. To avoid this race, memory_failure() might
> >>>>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
> >>>>>> sure if that works along with memory failure flow.
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> Yan, Zi
> >>>>>
> >>>>>
> >>>>>
> >>>>> Actually memory failure already plays with this down the road no?
> >>>>>
> >>>>> So maybe it's enough to just SetPageHWPoison afterwards again?
> >>>>>
> >>>>>
> >>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>>>> index ee42d4361309..4758fea94a96 100644
> >>>>> --- a/mm/memory-failure.c
> >>>>> +++ b/mm/memory-failure.c
> >>>>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>> if (!res) {
> >>>>> if (is_free_buddy_page(p)) {
> >>>>> if (take_page_off_buddy(p)) {
> >>>>> + SetPageHWPoison(p);
> >>>>> page_ref_inc(p);
> >>>>> res = MF_RECOVERED;
> >>>>> } else {
> >>>>>
> >>>>>
> >>>>> and maybe in a bunch of other places in there?
> >>>>
> >>>> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
> >>>> just set it again here?
> >>>
> >>> Yea.
> >>>
> >>>> Why not do it after get_hwpoison_page(), since that
> >>>> is the expected page flag?
> >>>
> >>> It's still in the buddy at that point right? I'm worried buddy might
> >>> poke at flags.
> >>
> >> Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
> >> HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
> >> pages are not on the way to buddy...
> >>
> >> Thanks.
> >> .
> >
> >
> > To clarify do you not agree repeating SetPageHWPoison is enough for
> > this? And if not, do you have suggestions on how to fix this race?
>
> Do you mean repeating SetPageHWPoison on every branch?
Right.
> Is it possible
> to make __free_pages_prepare changes page->flags atomically or this race
> is specified to memory_failure?
>
> Thanks.
> .
Adding an atomic op on every fast path page allocation is, I am
guessing, going to slow down Linux measureably.
Doing it for the benefit of memory_failure, which is the slowest of
slow paths, seems unpalatable, to me.
Neither am I sure it's the only racy place -
grep for __SetPage and __ClearPage - all these have the same issue, I
suspect.
At the same time, I'm not an mm maintainer. If you disagree, try to
upstream a change converting all non atomics in mm to atomics, and see
what others say.
--
MST
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-09 18:38 ` David Hildenbrand (Arm)
2026-06-09 18:39 ` Zi Yan
@ 2026-06-11 6:33 ` Michael S. Tsirkin
2026-06-11 11:33 ` David Hildenbrand (Arm)
1 sibling, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-11 6:33 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Andrew Morton, linux-kernel, Miaohe Lin, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On Tue, Jun 09, 2026 at 08:38:09PM +0200, David Hildenbrand (Arm) wrote:
> On 6/9/26 20:10, Andrew Morton wrote:
> > On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >
> >> TestSetPageHWPoison() is called without zone->lock, so its atomic
> >> update to page->flags can race with non-atomic flag operations
> >> that run under zone->lock in the buddy allocator.
> >>
> >> In particular, __free_pages_prepare() does:
> >>
> >> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>
> >> This non-atomic read-modify-write, while correctly excluding
> >> __PG_HWPOISON from the mask, can still lose a concurrent
> >> TestSetPageHWPoison if the read happens before the poison bit
> >> is set and the write happens after. Will only get worse if/when
> >> we add more non-atomic flag operations.
> >>
> >> Fix by acquiring zone->lock around TestSetPageHWPoison and
> >> around ClearPageHWPoison in the retry path. This
> >> serializes with all buddy flag manipulation. The cost is
> >> negligible: one lock/unlock in an extremely rare path
> >> (hardware memory errors).
> >>
> >> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> >> in this file operate on pages already removed from the buddy
> >> allocator or on non-buddy pages (DAX, hugetlb), so they do not
> >> need zone->lock protection.
> >
> > Sashiko is saying this doesn't do anything "Because
> > __free_pages_prepare() executes entirely locklessly". Did it goof?
> >
> > https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>
> Battle of the bots: it's right.
Ugh it's bot against human - I remembered we have zone lock
normally in alloc and thought it helps and didn't double check we don't
have it here . The bot wins (
> --
> Cheers,
>
> Davidc
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-11 5:43 ` Michael S. Tsirkin
@ 2026-06-11 7:36 ` Miaohe Lin
2026-06-11 13:20 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2026-06-11 7:36 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Zi Yan, David Hildenbrand (Arm), Andrew Morton, linux-kernel,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 2026/6/11 13:43, Michael S. Tsirkin wrote:
> On Thu, Jun 11, 2026 at 11:35:36AM +0800, Miaohe Lin wrote:
>> On 2026/6/11 5:18, Michael S. Tsirkin wrote:
>>> On Wed, Jun 10, 2026 at 03:24:30PM +0800, Miaohe Lin wrote:
>>>> On 2026/6/10 5:00, Michael S. Tsirkin wrote:
>>>>> On Tue, Jun 09, 2026 at 04:54:01PM -0400, Zi Yan wrote:
>>>>>> On 9 Jun 2026, at 16:34, Michael S. Tsirkin wrote:
>>>>>>
>>>>>>> On Tue, Jun 09, 2026 at 02:52:47PM -0400, Zi Yan wrote:
>>>>>>>> On 9 Jun 2026, at 14:39, Zi Yan wrote:
>>>>>>>>
>>>>>>>>> On 9 Jun 2026, at 14:38, David Hildenbrand (Arm) wrote:
>>>>>>>>>
>>>>>>>>>> On 6/9/26 20:10, Andrew Morton wrote:
>>>>>>>>>>> On Tue, 9 Jun 2026 06:12:49 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> TestSetPageHWPoison() is called without zone->lock, so its atomic
>>>>>>>>>>>> update to page->flags can race with non-atomic flag operations
>>>>>>>>>>>> that run under zone->lock in the buddy allocator.
>>>>>>>>>>>>
>>>>>>>>>>>> In particular, __free_pages_prepare() does:
>>>>>>>>>>>>
>>>>>>>>>>>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>>>>>>>>>>>
>>>>>>>>>>>> This non-atomic read-modify-write, while correctly excluding
>>>>>>>>>>>> __PG_HWPOISON from the mask, can still lose a concurrent
>>>>>>>>>>>> TestSetPageHWPoison if the read happens before the poison bit
>>>>>>>>>>>> is set and the write happens after. Will only get worse if/when
>>>>>>>>>>>> we add more non-atomic flag operations.
>>>>>>>>>>>>
>>>>>>>>>>>> Fix by acquiring zone->lock around TestSetPageHWPoison and
>>>>>>>>>>>> around ClearPageHWPoison in the retry path. This
>>>>>>>>>>>> serializes with all buddy flag manipulation. The cost is
>>>>>>>>>>>> negligible: one lock/unlock in an extremely rare path
>>>>>>>>>>>> (hardware memory errors).
>>>>>>>>>>>>
>>>>>>>>>>>> Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
>>>>>>>>>>>> in this file operate on pages already removed from the buddy
>>>>>>>>>>>> allocator or on non-buddy pages (DAX, hugetlb), so they do not
>>>>>>>>>>>> need zone->lock protection.
>>>>>>>>>>>
>>>>>>>>>>> Sashiko is saying this doesn't do anything "Because
>>>>>>>>>>> __free_pages_prepare() executes entirely locklessly". Did it goof?
>>>>>>>>>>>
>>>>>>>>>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>>>>>>>>>>
>>>>>>>>>> Battle of the bots: it's right.
>>>>>>>>>
>>>>>>>>> Yep, __free_pages_prepare() changes the page flag without holding
>>>>>>>>> zone->lock.
>>>>>>>>
>>>>>>>> __free_pages_prepare() works on frozen pages and assumes no one else
>>>>>>>> touches the input page. To avoid this race, memory_failure() might
>>>>>>>> want to try_get_page() before TestClearPageHWPoison(), but I am not
>>>>>>>> sure if that works along with memory failure flow.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Yan, Zi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Actually memory failure already plays with this down the road no?
>>>>>>>
>>>>>>> So maybe it's enough to just SetPageHWPoison afterwards again?
>>>>>>>
>>>>>>>
>>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>>>> index ee42d4361309..4758fea94a96 100644
>>>>>>> --- a/mm/memory-failure.c
>>>>>>> +++ b/mm/memory-failure.c
>>>>>>> @@ -2415,6 +2415,7 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>>> if (!res) {
>>>>>>> if (is_free_buddy_page(p)) {
>>>>>>> if (take_page_off_buddy(p)) {
>>>>>>> + SetPageHWPoison(p);
>>>>>>> page_ref_inc(p);
>>>>>>> res = MF_RECOVERED;
>>>>>>> } else {
>>>>>>>
>>>>>>>
>>>>>>> and maybe in a bunch of other places in there?
>>>>>>
>>>>>> You mean for fear of losing HWPoison flag in the earlier TestSetPageHWPoison(),
>>>>>> just set it again here?
>>>>>
>>>>> Yea.
>>>>>
>>>>>> Why not do it after get_hwpoison_page(), since that
>>>>>> is the expected page flag?
>>>>>
>>>>> It's still in the buddy at that point right? I'm worried buddy might
>>>>> poke at flags.
>>>>
>>>> Since __free_pages_prepare() executes entirely locklessly, the only way to ensure
>>>> HWPoison flag won't be lost might be only set hwpoison flag iff we can make sure
>>>> pages are not on the way to buddy...
>>>>
>>>> Thanks.
>>>> .
>>>
>>>
>>> To clarify do you not agree repeating SetPageHWPoison is enough for
>>> this? And if not, do you have suggestions on how to fix this race?
>>
>> Do you mean repeating SetPageHWPoison on every branch?
>
> Right.
>
>> Is it possible
>> to make __free_pages_prepare changes page->flags atomically or this race
>> is specified to memory_failure?
>>
>> Thanks.
>> .
>
>
> Adding an atomic op on every fast path page allocation is, I am
> guessing, going to slow down Linux measureably.
>
> Doing it for the benefit of memory_failure, which is the slowest of
> slow paths, seems unpalatable, to me.
Agree, it's not worth to do so.
>
> Neither am I sure it's the only racy place -
> grep for __SetPage and __ClearPage - all these have the same issue, I
> suspect.
>
> At the same time, I'm not an mm maintainer. If you disagree, try to
> upstream a change converting all non atomics in mm to atomics, and see
> what others say.
Since memory_failure might be the only place, this change would be unacceptable.
We should come up with a better solution. Maybe we can try repeating SetPageHWPoison
and ClearPageHWPoison at a first attempt though it looks somewhat weird to me and makes
code more complicated. But it's already complicated. :)
Thanks.
.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-11 6:33 ` Michael S. Tsirkin
@ 2026-06-11 11:33 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-11 11:33 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Andrew Morton, linux-kernel, Miaohe Lin, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 6/11/26 08:33, Michael S. Tsirkin wrote:
> On Tue, Jun 09, 2026 at 08:38:09PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/9/26 20:10, Andrew Morton wrote:
>>>
>>>
>>> Sashiko is saying this doesn't do anything "Because
>>> __free_pages_prepare() executes entirely locklessly". Did it goof?
>>>
>>> https://sashiko.dev/#/patchset/df06b66fe4ff8e925ee0714955abc2183a727b90.1780998980.git.mst@redhat.com
>>
>> Battle of the bots: it's right.
>
> Ugh it's bot against human - I remembered we have zone lock
> normally in alloc and thought it helps and didn't double check we don't
> have it here . The bot wins (
Heh, I was primarily looking whether taking the lock would be harmful, not if it
would be useful. :)
--
Cheers,
David
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-11 7:36 ` Miaohe Lin
@ 2026-06-11 13:20 ` David Hildenbrand (Arm)
2026-06-15 3:29 ` Miaohe Lin
0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-11 13:20 UTC (permalink / raw)
To: Miaohe Lin, Michael S. Tsirkin
Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 6/11/26 09:36, Miaohe Lin wrote:
> On 2026/6/11 13:43, Michael S. Tsirkin wrote:
>> On Thu, Jun 11, 2026 at 11:35:36AM +0800, Miaohe Lin wrote:
>>>
>>> Do you mean repeating SetPageHWPoison on every branch?
>>
>> Right.
>>
>>> Is it possible
>>> to make __free_pages_prepare changes page->flags atomically or this race
>>> is specified to memory_failure?
>>>
>>> Thanks.
>>> .
>>
>>
>> Adding an atomic op on every fast path page allocation is, I am
>> guessing, going to slow down Linux measureably.
>>
>> Doing it for the benefit of memory_failure, which is the slowest of
>> slow paths, seems unpalatable, to me.
>
> Agree, it's not worth to do so.
>
>>
>> Neither am I sure it's the only racy place -
>> grep for __SetPage and __ClearPage - all these have the same issue, I
>> suspect.
>>
>> At the same time, I'm not an mm maintainer. If you disagree, try to
>> upstream a change converting all non atomics in mm to atomics, and see
>> what others say.
>
> Since memory_failure might be the only place, this change would be unacceptable.
> We should come up with a better solution. Maybe we can try repeating SetPageHWPoison
> and ClearPageHWPoison at a first attempt though it looks somewhat weird to me and makes
> code more complicated.
And I am fairly sure we could still have some remaining races ... it's shaky.
Hm ...
--
Cheers,
David
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-11 13:20 ` David Hildenbrand (Arm)
@ 2026-06-15 3:29 ` Miaohe Lin
2026-06-15 10:54 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2026-06-15 3:29 UTC (permalink / raw)
To: David Hildenbrand (Arm), Michael S. Tsirkin
Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 2026/6/11 21:20, David Hildenbrand (Arm) wrote:
> On 6/11/26 09:36, Miaohe Lin wrote:
>> On 2026/6/11 13:43, Michael S. Tsirkin wrote:
>>> On Thu, Jun 11, 2026 at 11:35:36AM +0800, Miaohe Lin wrote:
>>>>
>>>> Do you mean repeating SetPageHWPoison on every branch?
>>>
>>> Right.
>>>
>>>> Is it possible
>>>> to make __free_pages_prepare changes page->flags atomically or this race
>>>> is specified to memory_failure?
>>>>
>>>> Thanks.
>>>> .
>>>
>>>
>>> Adding an atomic op on every fast path page allocation is, I am
>>> guessing, going to slow down Linux measureably.
>>>
>>> Doing it for the benefit of memory_failure, which is the slowest of
>>> slow paths, seems unpalatable, to me.
>>
>> Agree, it's not worth to do so.
>>
>>>
>>> Neither am I sure it's the only racy place -
>>> grep for __SetPage and __ClearPage - all these have the same issue, I
>>> suspect.
>>>
>>> At the same time, I'm not an mm maintainer. If you disagree, try to
>>> upstream a change converting all non atomics in mm to atomics, and see
>>> what others say.
>>
>> Since memory_failure might be the only place, this change would be unacceptable.
>> We should come up with a better solution. Maybe we can try repeating SetPageHWPoison
>> and ClearPageHWPoison at a first attempt though it looks somewhat weird to me and makes
>> code more complicated.
>
> And I am fairly sure we could still have some remaining races ... it's shaky.
I have to agree it's shaky. Any suggestion for next step?
Thanks.
.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-15 3:29 ` Miaohe Lin
@ 2026-06-15 10:54 ` David Hildenbrand (Arm)
2026-06-16 6:32 ` Miaohe Lin
0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-15 10:54 UTC (permalink / raw)
To: Miaohe Lin, Michael S. Tsirkin
Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 6/15/26 05:29, Miaohe Lin wrote:
> On 2026/6/11 21:20, David Hildenbrand (Arm) wrote:
>> On 6/11/26 09:36, Miaohe Lin wrote:
>>>
>>> Agree, it's not worth to do so.
>>>
>>>
>>> Since memory_failure might be the only place, this change would be unacceptable.
>>> We should come up with a better solution. Maybe we can try repeating SetPageHWPoison
>>> and ClearPageHWPoison at a first attempt though it looks somewhat weird to me and makes
>>> code more complicated.
>>
>> And I am fairly sure we could still have some remaining races ... it's shaky.
>
> I have to agree it's shaky.
Right, just let writing task reschedule after reading the flags,
but before writing the flags.
> Any suggestion for next step?
We have various code that assumes that no concurrent writes are
possible, and consequently, we use no atomics.
__free_pages_prepare() is just one user.
Then we have __folio_set_locked(), __folio_clear_active()
and __folio_clear_unevictable().
But also __folio_mark_uptodate(), which is called rather frequently.
page_cpupid_reset_last() is also a thing, but it mostly falls
under __free_pages_prepare() handling.
... and __split_folio_to_order() also messes with flags directly without atomics.
Many of these are only possible for frozen pages (refcount == 0). I think
only __folio_set_locked() and __folio_mark_uptodate() are called on
non-frozen pages, when there is the expectation that nobody will concurrently
use atomics that would be bad (e.g., don't trylock if not an lru page).
We don't want to use atomics at these places just to please memory failure code.
Would it be sufficient to know in memory-failure code that concurrent
handling succeeded?
Assume that we enlighten all non-atomics to grab the rcu read lock, such as
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b4..3c3852b60bbd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -803,10 +803,30 @@ static inline bool PageUptodate(const struct page *page)
return folio_test_uptodate(page_folio(page));
}
+#ifdef CONFIG_MEMORY_FAILURE
+static inline void page_flags_modify_nonatomic_begin(void)
+{
+ rcu_read_lock();
+}
+static inline void page_flags_modify_nonatomic_end(void)
+{
+ rcu_read_unlock();
+}
+#else
+static inline void page_flags_modify_nonatomic_begin(void)
+{
+}
+static inline void page_flags_modify_nonatomic_end(void)
+{
+}
+#endif
+
static __always_inline void __folio_mark_uptodate(struct folio *folio)
{
smp_wmb();
+ page_flags_modify_nonatomic_begin();
__set_bit(PG_uptodate, folio_flags(folio, 0));
+ page_flags_modify_nonatomic_end();
}
And then we have some retry logic such as:
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 51508a55c405..1123c40aaf43 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -162,6 +162,62 @@ static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
static DEFINE_MUTEX(pfn_space_lock);
+static bool page_test_set_hwpoison(struct page *page)
+{
+ lockdep_assert_held(&mf_mutex);
+
+ while (true) {
+ /* Already set -> not our problem. */
+ if (TestSetPageHWPoison(page))
+ return true;
+ /* Make sure concurrent non-atomic writers completed. */
+ synchronize_rcu();
+ /* Setting the flag was sticky. */
+ if (PageHWPoison(page))
+ return false;
+ }
+}
+
+static bool page_test_clear_hwpoison(struct page *page)
+{
+ lockdep_assert_held(&mf_mutex);
+
+ while (true) {
+ /* Already clear -> not our problem. */
+ if (!TestClearPageHWPoison(page))
+ return false;
+ /* Make sure concurrent non-atomic writers completed. */
+ synchronize_rcu();
+ /* Clearing the flag was sticky. */
+ if (!PageHWPoison(page))
+ return true;
+ }
+}
+
+static void page_set_hwpoison(struct page *page)
+{
+ lockdep_assert_held(&mf_mutex);
+
+ while (!PageHWPoison(page)) {
+ SetPageHWPoison(page);
+
+ /* Make sure concurrent non-atomic writers completed. */
+ synchronize_rcu();
+ }
+}
+
+static void page_clear_hwpoison(struct page *page)
+{
+ lockdep_assert_held(&mf_mutex);
+
+ while (PageHWPoison(page)) {
+ ClearPageHWPoison(page);
+
+ /* Make sure concurrent non-atomic writers completed. */
+ synchronize_rcu();
+ }
+}
+
/*
* Return values:
* 1: the page is dissolved (if needed) and taken off from buddy,
@@ -199,7 +255,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
return false;
}
- SetPageHWPoison(page);
+ page_set_hwpoison(page);
if (release)
put_page(page);
page_ref_inc(page);
@@ -1744,7 +1800,7 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
* Use this flag as an indication that the dax page has been
* remapped UC to prevent speculative consumption of poison.
*/
- SetPageHWPoison(&folio->page);
+ page_set_hwpoison(&folio->page);
/*
* Unlike System-RAM there is no possibility to swap in a
@@ -1789,7 +1845,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
goto unlock;
if (!pre_remove)
- SetPageHWPoison(page);
+ page_set_hwpoison(page);
/*
* The pre_remove case is revoking access, the memory is still
@@ -1866,7 +1922,7 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
head = llist_del_all(raw_hwp_list_head(folio));
llist_for_each_entry_safe(p, next, head, node) {
if (move_flag)
- SetPageHWPoison(p->page);
+ page_set_hwpoison(p->page);
else
num_poisoned_pages_sub(page_to_pfn(p->page), 1);
kfree(p);
@@ -2380,7 +2436,7 @@ int memory_failure(unsigned long pfn, int flags)
if (res != -ENOENT)
goto unlock_mutex;
- if (TestSetPageHWPoison(p)) {
+ if (page_test_set_hwpoison(p)) {
res = -EHWPOISON;
if (flags & MF_ACTION_REQUIRED)
res = kill_accessing_process(current, pfn, flags);
@@ -2410,7 +2466,7 @@ int memory_failure(unsigned long pfn, int flags)
} else {
/* We lost the race, try again */
if (retry) {
- ClearPageHWPoison(p);
+ page_clear_hwpoison(p);
retry = false;
goto try_again;
}
@@ -2431,7 +2487,7 @@ int memory_failure(unsigned long pfn, int flags)
/* filter pages that are protected from hwpoison test by users */
folio_lock(folio);
if (hwpoison_filter(p)) {
- ClearPageHWPoison(p);
+ page_clear_hwpoison(p);
folio_unlock(folio);
folio_put(folio);
res = -EOPNOTSUPP;
@@ -2751,7 +2807,7 @@ int unpoison_memory(unsigned long pfn)
}
folio_put(folio);
- if (TestClearPageHWPoison(p)) {
+ if (page_test_clear_hwpoison(p)) {
folio_put(folio);
ret = 0;
}
Maybe that would work. There would still be issues to solve
(a) We don't hold the mf_mutex on all call paths, but we really need it so a
page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.
(b) There are some leftover SetPageHWPoison etc. instances. The ones in
arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
corner cases either way and we can document the situation.
Further, while I assume the synchronize_rcu() on the MCE path should be fine
(who cares about performance there?), I don't know if the added RCU read lock
on some paths could be noticable.
So one idea worth discussing, but I am sure there are more problems.
--
Cheers,
David
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-15 10:54 ` David Hildenbrand (Arm)
@ 2026-06-16 6:32 ` Miaohe Lin
2026-06-16 6:56 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2026-06-16 6:32 UTC (permalink / raw)
To: David Hildenbrand (Arm), Michael S. Tsirkin
Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 2026/6/15 18:54, David Hildenbrand (Arm) wrote:
> On 6/15/26 05:29, Miaohe Lin wrote:
>> On 2026/6/11 21:20, David Hildenbrand (Arm) wrote:
>>> On 6/11/26 09:36, Miaohe Lin wrote:
>>>>
>>>> Agree, it's not worth to do so.
>>>>
>>>>
>>>> Since memory_failure might be the only place, this change would be unacceptable.
>>>> We should come up with a better solution. Maybe we can try repeating SetPageHWPoison
>>>> and ClearPageHWPoison at a first attempt though it looks somewhat weird to me and makes
>>>> code more complicated.
>>>
>>> And I am fairly sure we could still have some remaining races ... it's shaky.
>>
>> I have to agree it's shaky.
>
> Right, just let writing task reschedule after reading the flags,
> but before writing the flags.
>
>> Any suggestion for next step?
>
> We have various code that assumes that no concurrent writes are
> possible, and consequently, we use no atomics.
>
> __free_pages_prepare() is just one user.
>
> Then we have __folio_set_locked(), __folio_clear_active()
> and __folio_clear_unevictable().
>
> But also __folio_mark_uptodate(), which is called rather frequently.
>
> page_cpupid_reset_last() is also a thing, but it mostly falls
> under __free_pages_prepare() handling.
>
> ... and __split_folio_to_order() also messes with flags directly without atomics.
>
>
> Many of these are only possible for frozen pages (refcount == 0). I think
> only __folio_set_locked() and __folio_mark_uptodate() are called on
> non-frozen pages, when there is the expectation that nobody will concurrently
> use atomics that would be bad (e.g., don't trylock if not an lru page).
>
Thanks David! This information is really helpful!
>
> We don't want to use atomics at these places just to please memory failure code.
Bad news. We have more places racing with memory failure code.
>
> Would it be sufficient to know in memory-failure code that concurrent
> handling succeeded?
I think so, that would be useful.
>
>
> Assume that we enlighten all non-atomics to grab the rcu read lock, such as
These non-atomics are defined and used because they want to avoid atomic ops overhead?
So I'm afraid using rcu read lock in these places would lead to unexpected overhead.
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 7223f6f4e2b4..3c3852b60bbd 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -803,10 +803,30 @@ static inline bool PageUptodate(const struct page *page)
> return folio_test_uptodate(page_folio(page));
> }
>
> +#ifdef CONFIG_MEMORY_FAILURE
> +static inline void page_flags_modify_nonatomic_begin(void)
> +{
> + rcu_read_lock();
> +}
> +static inline void page_flags_modify_nonatomic_end(void)
> +{
> + rcu_read_unlock();
> +}
> +#else
> +static inline void page_flags_modify_nonatomic_begin(void)
> +{
> +}
> +static inline void page_flags_modify_nonatomic_end(void)
> +{
> +}
> +#endif
> +
> static __always_inline void __folio_mark_uptodate(struct folio *folio)
> {
> smp_wmb();
> + page_flags_modify_nonatomic_begin();
> __set_bit(PG_uptodate, folio_flags(folio, 0));
> + page_flags_modify_nonatomic_end();
> }
>
>
> And then we have some retry logic such as:
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 51508a55c405..1123c40aaf43 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -162,6 +162,62 @@ static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
>
> static DEFINE_MUTEX(pfn_space_lock);
>
> +static bool page_test_set_hwpoison(struct page *page)
> +{
> + lockdep_assert_held(&mf_mutex);
> +
> + while (true) {
> + /* Already set -> not our problem. */
> + if (TestSetPageHWPoison(page))
> + return true;
> + /* Make sure concurrent non-atomic writers completed. */
> + synchronize_rcu();
> + /* Setting the flag was sticky. */
> + if (PageHWPoison(page))
> + return false;
> + }
> +}
> +
> +static bool page_test_clear_hwpoison(struct page *page)
> +{
> + lockdep_assert_held(&mf_mutex);
> +
> + while (true) {
> + /* Already clear -> not our problem. */
> + if (!TestClearPageHWPoison(page))
> + return false;
> + /* Make sure concurrent non-atomic writers completed. */
> + synchronize_rcu();
> + /* Clearing the flag was sticky. */
> + if (!PageHWPoison(page))
> + return true;
> + }
> +}
> +
> +static void page_set_hwpoison(struct page *page)
> +{
> + lockdep_assert_held(&mf_mutex);
> +
> + while (!PageHWPoison(page)) {
> + SetPageHWPoison(page);
> +
> + /* Make sure concurrent non-atomic writers completed. */
> + synchronize_rcu();
> + }
> +}
> +
> +static void page_clear_hwpoison(struct page *page)
> +{
> + lockdep_assert_held(&mf_mutex);
> +
> + while (PageHWPoison(page)) {
> + ClearPageHWPoison(page);
> +
> + /* Make sure concurrent non-atomic writers completed. */
> + synchronize_rcu();
> + }
> +}
> +
> /*
> * Return values:
> * 1: the page is dissolved (if needed) and taken off from buddy,
> @@ -199,7 +255,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
> return false;
> }
>
> - SetPageHWPoison(page);
> + page_set_hwpoison(page);
> if (release)
> put_page(page);
> page_ref_inc(page);
> @@ -1744,7 +1800,7 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
> * Use this flag as an indication that the dax page has been
> * remapped UC to prevent speculative consumption of poison.
> */
> - SetPageHWPoison(&folio->page);
> + page_set_hwpoison(&folio->page);
>
> /*
> * Unlike System-RAM there is no possibility to swap in a
> @@ -1789,7 +1845,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> goto unlock;
>
> if (!pre_remove)
> - SetPageHWPoison(page);
> + page_set_hwpoison(page);
>
> /*
> * The pre_remove case is revoking access, the memory is still
> @@ -1866,7 +1922,7 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> head = llist_del_all(raw_hwp_list_head(folio));
> llist_for_each_entry_safe(p, next, head, node) {
> if (move_flag)
> - SetPageHWPoison(p->page);
> + page_set_hwpoison(p->page);
> else
> num_poisoned_pages_sub(page_to_pfn(p->page), 1);
> kfree(p);
> @@ -2380,7 +2436,7 @@ int memory_failure(unsigned long pfn, int flags)
> if (res != -ENOENT)
> goto unlock_mutex;
>
> - if (TestSetPageHWPoison(p)) {
> + if (page_test_set_hwpoison(p)) {
> res = -EHWPOISON;
> if (flags & MF_ACTION_REQUIRED)
> res = kill_accessing_process(current, pfn, flags);
> @@ -2410,7 +2466,7 @@ int memory_failure(unsigned long pfn, int flags)
> } else {
> /* We lost the race, try again */
> if (retry) {
> - ClearPageHWPoison(p);
> + page_clear_hwpoison(p);
> retry = false;
> goto try_again;
> }
> @@ -2431,7 +2487,7 @@ int memory_failure(unsigned long pfn, int flags)
> /* filter pages that are protected from hwpoison test by users */
> folio_lock(folio);
> if (hwpoison_filter(p)) {
> - ClearPageHWPoison(p);
> + page_clear_hwpoison(p);
> folio_unlock(folio);
> folio_put(folio);
> res = -EOPNOTSUPP;
> @@ -2751,7 +2807,7 @@ int unpoison_memory(unsigned long pfn)
> }
>
> folio_put(folio);
> - if (TestClearPageHWPoison(p)) {
> + if (page_test_clear_hwpoison(p)) {
> folio_put(folio);
> ret = 0;
> }
>
>
> Maybe that would work. There would still be issues to solve
>
> (a) We don't hold the mf_mutex on all call paths, but we really need it so a
> page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.
>
> (b) There are some leftover SetPageHWPoison etc. instances. The ones in
> arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
> corner cases either way and we can document the situation.
>
>
> Further, while I assume the synchronize_rcu() on the MCE path should be fine
> (who cares about performance there?), I don't know if the added RCU read lock
> on some paths could be noticable.
>
> So one idea worth discussing, but I am sure there are more problems.
I think this is a good idea, although there are some remaining issues.
But such race should be really rare, is it worth all this effort? Could we
simply aim to resolve, not to be flawless? I.e. could we simply check
and re-set the hwpoison flag at the end of memory_failure handling to
simply avoid losing hwpoison flag as a best-effort attempt? Would it be
acceptable?
Thanks.
.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-16 6:32 ` Miaohe Lin
@ 2026-06-16 6:56 ` David Hildenbrand (Arm)
2026-06-16 11:40 ` Miaohe Lin
2026-06-16 16:15 ` Michael S. Tsirkin
0 siblings, 2 replies; 27+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-16 6:56 UTC (permalink / raw)
To: Miaohe Lin, Michael S. Tsirkin
Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
>>
>>
>> Assume that we enlighten all non-atomics to grab the rcu read lock, such as
>
> These non-atomics are defined and used because they want to avoid atomic ops overhead?
> So I'm afraid using rcu read lock in these places would lead to unexpected overhead.
It should be cheaper than atomics IIUC. Further, I assume that some pages could
batch over multiple such operations (esp. page freeing path when we process tail
pages).
With !CONFIG_PREEMPT_RCU it's simply preempt_disable()/preempt_enable(), which
is either a NOP or just adjusting the preempt counter of the current thread. Cheap.
With CONFIG_PREEMPT_RCU we mostly increment current->rcu_read_lock_nesting. But
there might be a function call involved (did not look into the details). So that
variant should be slightly more expensive.
We'd have to measure what an addition rcu read lock would cost in there. that
should be fairly easy to benchmark.
>>
>> Maybe that would work. There would still be issues to solve
>>
>> (a) We don't hold the mf_mutex on all call paths, but we really need it so a
>> page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.
>>
>> (b) There are some leftover SetPageHWPoison etc. instances. The ones in
>> arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
>> corner cases either way and we can document the situation.
>>
>>
>> Further, while I assume the synchronize_rcu() on the MCE path should be fine
>> (who cares about performance there?), I don't know if the added RCU read lock
>> on some paths could be noticable.
>>
>> So one idea worth discussing, but I am sure there are more problems.
>
> I think this is a good idea, although there are some remaining issues.
> But such race should be really rare, is it worth all this effort? Could we
> simply aim to resolve, not to be flawless? I.e. could we simply check
> and re-set the hwpoison flag at the end of memory_failure handling to
> simply avoid losing hwpoison flag as a best-effort attempt? Would it be
> acceptable?
Hacky. Sufficient for the hypervisor to suspend the nonatomic-setting CPU at the
wrong time to still trigger the same behavior.
I think, either we fix it properly, or we redesign hwpoison handling to deal
with setting/clearing becoming stale at some random point in the future.
--
Cheers,
David
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-16 6:56 ` David Hildenbrand (Arm)
@ 2026-06-16 11:40 ` Miaohe Lin
2026-06-16 12:18 ` David Hildenbrand (Arm)
2026-06-16 16:15 ` Michael S. Tsirkin
1 sibling, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2026-06-16 11:40 UTC (permalink / raw)
To: David Hildenbrand (Arm), Michael S. Tsirkin
Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 2026/6/16 14:56, David Hildenbrand (Arm) wrote:
>>>
>>>
>>> Assume that we enlighten all non-atomics to grab the rcu read lock, such as
>>
>> These non-atomics are defined and used because they want to avoid atomic ops overhead?
>> So I'm afraid using rcu read lock in these places would lead to unexpected overhead.
>
> It should be cheaper than atomics IIUC. Further, I assume that some pages could
> batch over multiple such operations (esp. page freeing path when we process tail
> pages).
>
> With !CONFIG_PREEMPT_RCU it's simply preempt_disable()/preempt_enable(), which
> is either a NOP or just adjusting the preempt counter of the current thread. Cheap.
>
> With CONFIG_PREEMPT_RCU we mostly increment current->rcu_read_lock_nesting. But
> there might be a function call involved (did not look into the details). So that
> variant should be slightly more expensive.
I scanned the code and found rcu_read_unlock_special might be called in some cases.
Some expensive ops, e.g. irq_work_queue_on, might be called in some corner cases.
So the overhead of rcu read lock might be fluctuating.
>
> We'd have to measure what an addition rcu read lock would cost in there. that
> should be fairly easy to benchmark.
Sure. We can do that if needed.
>
>>>
>>> Maybe that would work. There would still be issues to solve
>>>
>>> (a) We don't hold the mf_mutex on all call paths, but we really need it so a
>>> page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.
>>>
>>> (b) There are some leftover SetPageHWPoison etc. instances. The ones in
>>> arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
>>> corner cases either way and we can document the situation.
>>>
>>>
>>> Further, while I assume the synchronize_rcu() on the MCE path should be fine
>>> (who cares about performance there?), I don't know if the added RCU read lock
>>> on some paths could be noticable.
>>>
>>> So one idea worth discussing, but I am sure there are more problems.
>>
>> I think this is a good idea, although there are some remaining issues.
>> But such race should be really rare, is it worth all this effort? Could we
>> simply aim to resolve, not to be flawless? I.e. could we simply check
>> and re-set the hwpoison flag at the end of memory_failure handling to
>> simply avoid losing hwpoison flag as a best-effort attempt? Would it be
>> acceptable?
>
> Hacky. Sufficient for the hypervisor to suspend the nonatomic-setting CPU at the
> wrong time to still trigger the same behavior.
Right. hypervisor could make the issue easier to trigger...
>
> I think, either we fix it properly, or we redesign hwpoison handling to deal
> with setting/clearing becoming stale at some random point in the future.
I think your proposal, although there are still some issues to be resolved, is
nevertheless a good solution. We could also wait and see if anyone comes up with
a better one.
Thanks.
.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-16 11:40 ` Miaohe Lin
@ 2026-06-16 12:18 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-16 12:18 UTC (permalink / raw)
To: Miaohe Lin, Michael S. Tsirkin
Cc: Zi Yan, Andrew Morton, linux-kernel, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On 6/16/26 13:40, Miaohe Lin wrote:
> On 2026/6/16 14:56, David Hildenbrand (Arm) wrote:
>>>
>>> These non-atomics are defined and used because they want to avoid atomic ops overhead?
>>> So I'm afraid using rcu read lock in these places would lead to unexpected overhead.
>>
>> It should be cheaper than atomics IIUC. Further, I assume that some pages could
>> batch over multiple such operations (esp. page freeing path when we process tail
>> pages).
>>
>> With !CONFIG_PREEMPT_RCU it's simply preempt_disable()/preempt_enable(), which
>> is either a NOP or just adjusting the preempt counter of the current thread. Cheap.
>>
>> With CONFIG_PREEMPT_RCU we mostly increment current->rcu_read_lock_nesting. But
>> there might be a function call involved (did not look into the details). So that
>> variant should be slightly more expensive.
>
> I scanned the code and found rcu_read_unlock_special might be called in some cases.
> Some expensive ops, e.g. irq_work_queue_on, might be called in some corner cases.
> So the overhead of rcu read lock might be fluctuating.
Right. Usually rcu_read_lock+unlock is supposed to be very lightweight, but that
might not be completely the case with that PREEMPT_RCU thingy ...
>
>>
>> We'd have to measure what an addition rcu read lock would cost in there. that
>> should be fairly easy to benchmark.
>
> Sure. We can do that if needed.
>
>>
>>>
>>> I think this is a good idea, although there are some remaining issues.
>>> But such race should be really rare, is it worth all this effort? Could we
>>> simply aim to resolve, not to be flawless? I.e. could we simply check
>>> and re-set the hwpoison flag at the end of memory_failure handling to
>>> simply avoid losing hwpoison flag as a best-effort attempt? Would it be
>>> acceptable?
>>
>> Hacky. Sufficient for the hypervisor to suspend the nonatomic-setting CPU at the
>> wrong time to still trigger the same behavior.
>
> Right. hypervisor could make the issue easier to trigger...
>
>>
>> I think, either we fix it properly, or we redesign hwpoison handling to deal
>> with setting/clearing becoming stale at some random point in the future.
>
> I think your proposal, although there are still some issues to be resolved, is
> nevertheless a good solution. We could also wait and see if anyone comes up with
> a better one.
I wouldn't call it "good" ... it's the only thing I was easily able to come up
with :)
The only alternative would be moving the hwpoison bit out of page->flags,
storing it in a sparse bitmap or sth. like that. It would be a bigger rework and
I am sure there are issues with that as well.
--
Cheers,
David
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
2026-06-16 6:56 ` David Hildenbrand (Arm)
2026-06-16 11:40 ` Miaohe Lin
@ 2026-06-16 16:15 ` Michael S. Tsirkin
1 sibling, 0 replies; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-06-16 16:15 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Miaohe Lin, Zi Yan, Andrew Morton, linux-kernel, Jason Wang,
Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi
On Tue, Jun 16, 2026 at 08:56:42AM +0200, David Hildenbrand (Arm) wrote:
> >>
> >>
> >> Assume that we enlighten all non-atomics to grab the rcu read lock, such as
> >
> > These non-atomics are defined and used because they want to avoid atomic ops overhead?
> > So I'm afraid using rcu read lock in these places would lead to unexpected overhead.
>
> It should be cheaper than atomics IIUC. Further, I assume that some pages could
> batch over multiple such operations (esp. page freeing path when we process tail
> pages).
Doubt we have the energy to bother with this much.
We'll have to stick them in the bit manipulation macros if memory failure is
configured and be done with it.
--
MST
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2026-06-16 16:15 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-09 10:12 [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock Michael S. Tsirkin
2026-06-09 12:50 ` David Hildenbrand (Arm)
2026-06-09 16:12 ` Zi Yan
2026-06-09 18:10 ` Andrew Morton
2026-06-09 18:38 ` David Hildenbrand (Arm)
2026-06-09 18:39 ` Zi Yan
2026-06-09 18:52 ` Zi Yan
2026-06-09 20:34 ` Michael S. Tsirkin
2026-06-09 20:54 ` Zi Yan
2026-06-09 21:00 ` Michael S. Tsirkin
2026-06-10 7:24 ` Miaohe Lin
2026-06-10 7:35 ` Michael S. Tsirkin
2026-06-10 21:18 ` Michael S. Tsirkin
2026-06-11 3:35 ` Miaohe Lin
2026-06-11 5:43 ` Michael S. Tsirkin
2026-06-11 7:36 ` Miaohe Lin
2026-06-11 13:20 ` David Hildenbrand (Arm)
2026-06-15 3:29 ` Miaohe Lin
2026-06-15 10:54 ` David Hildenbrand (Arm)
2026-06-16 6:32 ` Miaohe Lin
2026-06-16 6:56 ` David Hildenbrand (Arm)
2026-06-16 11:40 ` Miaohe Lin
2026-06-16 12:18 ` David Hildenbrand (Arm)
2026-06-16 16:15 ` Michael S. Tsirkin
2026-06-11 6:33 ` Michael S. Tsirkin
2026-06-11 11:33 ` David Hildenbrand (Arm)
2026-06-09 20:24 ` Michael S. Tsirkin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.