* [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
@ 2026-06-28 21:45 Michael S. Tsirkin
2026-06-28 21:45 ` [PATCH 1/2] mm: memory-failure: use RCU to fix HWPoison flag race Michael S. Tsirkin
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2026-06-28 21:45 UTC (permalink / raw)
To: linux-kernel
Cc: David Hildenbrand, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
David Hildenbrand
I don't like it that we are adding overhead to the good path for
the benefit of memory failure, which never triggers on many systems,
but I don't have a better idea. Pls take a look.
Non-atomic page flag operations (page->flags.f &= ~mask, __set_bit,
__clear_bit) can race with atomic TestSetPageHWPoison() in
memory_failure(). The non-atomic RMW reads flags, memory_failure()
atomically sets HWPoison, then the RMW writes back the old value
without HWPoison, clobbering the bit.
The race was confirmed by injecting a cpu_relax() delay between the
load and store of the non-atomic RMW in __free_pages_prepare, then
running concurrent MADV_HWPOISON injection. The clobbered HWPoison
bit was observed repeatedly.
This series fixes the race by:
1. Having memory_failure() call synchronize_rcu() + retry after
setting HWPoison, so that any in-flight non-atomic RMW that
read the old flags value completes before we proceed.
2. Wrapping all non-atomic page flag operations in
rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
so that synchronize_rcu() actually drains them.
Performance impact (page alloc+free microbenchmark, 200K iterations,
20 runs, KVM guest, error bars are 3-sigma):
!PREEMPT_RCU (x86):
insns/iter cycles/iter
base: 12237 +/- 1 17954 +/- 136
patched: +22 +/- 1 -124 +/- 122
(+0.18%) (within noise)
PREEMPT_RCU:
insns/iter cycles/iter
base: 12512 +/- 3 18541 +/- 214
patched: +95 +/- 3 -12 +/- 161
(+0.76%) (within noise)
When !CONFIG_MEMORY_FAILURE, all wrappers compile away completely.
Suggested-by: David Hildenbrand <david@redhat.com>
Michael S. Tsirkin (2):
mm: memory-failure: use RCU to fix HWPoison flag race
mm: wrap non-atomic page flag ops in RCU for HWPoison safety
include/linux/mm.h | 7 ++++
include/linux/page-flags.h | 81 +++++++++++++++++++++++++++++++++++---
mm/huge_memory.c | 2 +
mm/memory-failure.c | 54 +++++++++++++++++++++----
mm/memremap.c | 6 ++-
mm/mm_init.c | 2 +
mm/page_alloc.c | 4 ++
mm/slub.c | 2 +-
8 files changed, 143 insertions(+), 15 deletions(-)
--
MST
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] mm: memory-failure: use RCU to fix HWPoison flag race
2026-06-28 21:45 [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Michael S. Tsirkin
@ 2026-06-28 21:45 ` Michael S. Tsirkin
2026-06-28 21:45 ` [PATCH 2/2] mm: wrap non-atomic page flag ops in RCU for HWPoison safety Michael S. Tsirkin
` (2 subsequent siblings)
3 siblings, 0 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2026-06-28 21:45 UTC (permalink / raw)
To: linux-kernel
Cc: David Hildenbrand, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
David Hildenbrand
Non-atomic page flag operations (page->flags.f &= ~mask, __set_bit,
__clear_bit) can race with atomic TestSetPageHWPoison() in
memory_failure(). The non-atomic RMW reads flags, memory_failure()
atomically sets HWPoison, then the RMW writes back the old value
without HWPoison - clobbering the bit.
Add synchronize_rcu() + retry helpers for setting and clearing
HWPoison, and convert all memory_failure() call sites to use them.
Follow-up patches wrap non-atomic page flag operations in
rcu_read_lock/rcu_read_unlock so that synchronize_rcu() drains
in-flight callers.
Note: the MCE handler in arch/x86/kernel/cpu/mce/core.c also calls
SetPageHWPoison() and is subject to the same race. It cannot use
the drain helpers (MCE context cannot call synchronize_rcu()).
For recoverable MCE errors, memory_failure() is queued via work
items (kill_me_maybe/kill_me_never) and will re-set the bit via
test_and_set_hwpoison_drain_rcu() if it was clobbered. The
mce_panic() path sets HWPoison for kdump right before panic() so
the race should not matter there. The MCG_STATUS_SEAM_NR path does
not queue memory_failure(), but the affected page belongs to a
TDX guest whose CPU core has already been marked dead - the page
is not subject to concurrent non-atomic flag operations in the
buddy allocator, so the race does not trigger.
Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: Cursor:gpt-5.4-xhigh-fast
---
mm/memory-failure.c | 54 ++++++++++++++++++++++++++++++++++++++-------
1 file changed, 46 insertions(+), 8 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..351f8bbda248 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -76,6 +76,44 @@ static int sysctl_enable_soft_offline __read_mostly = 1;
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
+/*
+ * Drain any in-flight non-atomic page flag operations that could
+ * clobber a concurrently set HWPoison bit. Retries until the bit sticks.
+ */
+static void set_hwpoison_drain_rcu(struct page *p)
+{
+ do {
+ synchronize_rcu();
+ } while (!TestSetPageHWPoison(p));
+}
+
+/*
+ * Drain any in-flight non-atomic page flag operations that could
+ * restore the HWPoison bit from stale data. Retries until it stays clear.
+ */
+static void clear_hwpoison_drain_rcu(struct page *p)
+{
+ do {
+ synchronize_rcu();
+ } while (TestClearPageHWPoison(p));
+}
+
+static bool test_and_set_hwpoison_drain_rcu(struct page *p)
+{
+ bool was_set = TestSetPageHWPoison(p);
+
+ set_hwpoison_drain_rcu(p);
+ return was_set;
+}
+
+static bool test_and_clear_hwpoison_drain_rcu(struct page *p)
+{
+ bool was_set = TestClearPageHWPoison(p);
+
+ clear_hwpoison_drain_rcu(p);
+ return was_set;
+}
+
static bool hw_memory_failure __read_mostly = false;
static DEFINE_MUTEX(mf_mutex);
@@ -211,7 +249,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
return false;
}
- SetPageHWPoison(page);
+ set_hwpoison_drain_rcu(page);
if (release)
put_page(page);
page_ref_inc(page);
@@ -1756,7 +1794,7 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
* Use this flag as an indication that the dax page has been
* remapped UC to prevent speculative consumption of poison.
*/
- SetPageHWPoison(&folio->page);
+ set_hwpoison_drain_rcu(&folio->page);
/*
* Unlike System-RAM there is no possibility to swap in a
@@ -1801,7 +1839,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
goto unlock;
if (!pre_remove)
- SetPageHWPoison(page);
+ set_hwpoison_drain_rcu(page);
/*
* The pre_remove case is revoking access, the memory is still
@@ -1878,7 +1916,7 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
head = llist_del_all(raw_hwp_list_head(folio));
llist_for_each_entry_safe(p, next, head, node) {
if (move_flag)
- SetPageHWPoison(p->page);
+ set_hwpoison_drain_rcu(p->page);
else
num_poisoned_pages_sub(page_to_pfn(p->page), 1);
kfree(p);
@@ -2390,7 +2428,7 @@ int memory_failure(unsigned long pfn, int flags)
if (hugetlb)
goto unlock_mutex;
- if (TestSetPageHWPoison(p)) {
+ if (test_and_set_hwpoison_drain_rcu(p)) {
res = -EHWPOISON;
if (flags & MF_ACTION_REQUIRED)
res = kill_accessing_process(current, pfn, flags);
@@ -2420,7 +2458,7 @@ int memory_failure(unsigned long pfn, int flags)
} else {
/* We lost the race, try again */
if (retry) {
- ClearPageHWPoison(p);
+ clear_hwpoison_drain_rcu(p);
retry = false;
goto try_again;
}
@@ -2441,7 +2479,7 @@ int memory_failure(unsigned long pfn, int flags)
/* filter pages that are protected from hwpoison test by users */
folio_lock(folio);
if (hwpoison_filter(p)) {
- ClearPageHWPoison(p);
+ clear_hwpoison_drain_rcu(p);
folio_unlock(folio);
folio_put(folio);
res = -EOPNOTSUPP;
@@ -2761,7 +2799,7 @@ int unpoison_memory(unsigned long pfn)
}
folio_put(folio);
- if (TestClearPageHWPoison(p)) {
+ if (test_and_clear_hwpoison_drain_rcu(p)) {
folio_put(folio);
ret = 0;
}
--
MST
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] mm: wrap non-atomic page flag ops in RCU for HWPoison safety
2026-06-28 21:45 [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Michael S. Tsirkin
2026-06-28 21:45 ` [PATCH 1/2] mm: memory-failure: use RCU to fix HWPoison flag race Michael S. Tsirkin
@ 2026-06-28 21:45 ` Michael S. Tsirkin
2026-06-29 2:11 ` [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Andi Kleen
2026-06-29 6:49 ` David Hildenbrand (Arm)
3 siblings, 0 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2026-06-28 21:45 UTC (permalink / raw)
To: linux-kernel
Cc: David Hildenbrand, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
David Hildenbrand
Wrap all non-atomic page flag operations in rcu_read_lock/rcu_read_unlock
so that synchronize_rcu() in memory_failure() can drain in-flight
callers. This completes the RCU-based protection introduced in the
previous patch.
Add hwpoison_safe_set_bit/clear_bit wrappers in page-flags.h that
bracket __set_bit/__clear_bit with rcu_read_lock/rcu_read_unlock.
Convert all __SETPAGEFLAG/__CLEARPAGEFLAG macros and direct flag
manipulation sites (page_alloc.c, huge_memory.c, memremap.c,
mm_init.c, slub.c, mm.h) to use these wrappers.
For batched flag clearing paths (page_alloc.c order > 0), use
hwpoison_rcu_lock/unlock around the entire loop rather than
per-operation locking.
When !CONFIG_MEMORY_FAILURE, all wrappers compile to the bare
operations with no overhead.
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: Cursor:gpt-5.4-xhigh-fast
---
include/linux/mm.h | 7 ++++
include/linux/page-flags.h | 81 +++++++++++++++++++++++++++++++++++---
mm/huge_memory.c | 2 +
mm/memremap.c | 6 ++-
mm/mm_init.c | 2 +
mm/page_alloc.c | 4 ++
mm/slub.c | 2 +-
7 files changed, 97 insertions(+), 7 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 06bbe9eba636..d52a5e90cad6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2341,6 +2341,7 @@ static inline int folio_last_cpupid(struct folio *folio)
int folio_xchg_last_cpupid(struct folio *folio, int cpupid);
+/* Caller must hold hwpoison_rcu_lock() */
static inline void page_cpupid_reset_last(struct page *page)
{
page->flags.f |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
@@ -2503,8 +2504,10 @@ static inline struct zone *folio_zone(const struct folio *folio)
#ifdef SECTION_IN_PAGE_FLAGS
static inline void set_page_section(struct page *page, unsigned long section)
{
+ hwpoison_rcu_lock_flags(&page->flags.f);
page->flags.f &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT);
page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
+ hwpoison_rcu_unlock_flags(&page->flags.f);
}
static inline unsigned long memdesc_section(memdesc_flags_t mdf)
@@ -2719,14 +2722,18 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
static inline void set_page_zone(struct page *page, enum zone_type zone)
{
+ hwpoison_rcu_lock_flags(&page->flags.f);
page->flags.f &= ~(ZONES_MASK << ZONES_PGSHIFT);
page->flags.f |= (zone & ZONES_MASK) << ZONES_PGSHIFT;
+ hwpoison_rcu_unlock_flags(&page->flags.f);
}
static inline void set_page_node(struct page *page, unsigned long node)
{
+ hwpoison_rcu_lock_flags(&page->flags.f);
page->flags.f &= ~(NODES_MASK << NODES_PGSHIFT);
page->flags.f |= (node & NODES_MASK) << NODES_PGSHIFT;
+ hwpoison_rcu_unlock_flags(&page->flags.f);
}
static inline void set_page_links(struct page *page, enum zone_type zone,
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b4..25eb5a77c92f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -9,6 +9,7 @@
#include <linux/types.h>
#include <linux/bug.h>
#include <linux/mmdebug.h>
+#include <linux/rcupdate.h>
#ifndef __GENERATING_BOUNDS_H
#include <linux/mm_types.h>
#include <generated/bounds.h>
@@ -404,6 +405,68 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n)
#define FOLIO_HEAD_PAGE 0
#define FOLIO_SECOND_PAGE 1
+/*
+ * Non-atomic page flag operations (__set_bit, __clear_bit, flags &= ~mask)
+ * can race with atomic TestSetPageHWPoison() in memory_failure().
+ * Wrap non-atomic ops in rcu_read_lock/rcu_read_unlock so that
+ * synchronize_rcu() in memory_failure() drains in-flight callers.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+static __always_inline void
+hwpoison_safe_set_bit(unsigned long nr, unsigned long *addr)
+{
+ rcu_read_lock();
+ __set_bit(nr, addr);
+ rcu_read_unlock();
+}
+
+static __always_inline void
+hwpoison_safe_clear_bit(unsigned long nr, unsigned long *addr)
+{
+ rcu_read_lock();
+ __clear_bit(nr, addr);
+ rcu_read_unlock();
+}
+
+static __always_inline void hwpoison_rcu_lock_flags(unsigned long *addr)
+{
+ rcu_read_lock();
+}
+
+static __always_inline void hwpoison_rcu_unlock_flags(unsigned long *addr)
+{
+ rcu_read_unlock();
+}
+
+static __always_inline void hwpoison_rcu_lock(void)
+{
+ rcu_read_lock();
+}
+
+static __always_inline void hwpoison_rcu_unlock(void)
+{
+ rcu_read_unlock();
+}
+
+#else /* !CONFIG_MEMORY_FAILURE */
+static inline void
+hwpoison_safe_set_bit(unsigned long nr, unsigned long *addr)
+{
+ __set_bit(nr, addr);
+}
+
+static inline void
+hwpoison_safe_clear_bit(unsigned long nr, unsigned long *addr)
+{
+ __clear_bit(nr, addr);
+}
+
+static inline void hwpoison_rcu_lock_flags(unsigned long *addr) { }
+static inline void hwpoison_rcu_unlock_flags(unsigned long *addr) { }
+static inline void hwpoison_rcu_lock(void) { }
+static inline void hwpoison_rcu_unlock(void) { }
+#endif
+
/*
* Macros to create function definitions for page flags
*/
@@ -421,11 +484,11 @@ static __always_inline void folio_clear_##name(struct folio *folio) \
#define __FOLIO_SET_FLAG(name, page) \
static __always_inline void __folio_set_##name(struct folio *folio) \
-{ __set_bit(PG_##name, folio_flags(folio, page)); }
+{ hwpoison_safe_set_bit(PG_##name, folio_flags(folio, page)); }
#define __FOLIO_CLEAR_FLAG(name, page) \
static __always_inline void __folio_clear_##name(struct folio *folio) \
-{ __clear_bit(PG_##name, folio_flags(folio, page)); }
+{ hwpoison_safe_clear_bit(PG_##name, folio_flags(folio, page)); }
#define FOLIO_TEST_SET_FLAG(name, page) \
static __always_inline bool folio_test_set_##name(struct folio *folio) \
@@ -458,12 +521,12 @@ static __always_inline void ClearPage##uname(struct page *page) \
#define __SETPAGEFLAG(uname, lname, policy) \
__FOLIO_SET_FLAG(lname, FOLIO_##policy) \
static __always_inline void __SetPage##uname(struct page *page) \
-{ __set_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ hwpoison_safe_set_bit(PG_##lname, &policy(page, 1)->flags.f); }
#define __CLEARPAGEFLAG(uname, lname, policy) \
__FOLIO_CLEAR_FLAG(lname, FOLIO_##policy) \
static __always_inline void __ClearPage##uname(struct page *page) \
-{ __clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ hwpoison_safe_clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
#define TESTSETFLAG(uname, lname, policy) \
FOLIO_TEST_SET_FLAG(lname, FOLIO_##policy) \
@@ -806,7 +869,7 @@ static inline bool PageUptodate(const struct page *page)
static __always_inline void __folio_mark_uptodate(struct folio *folio)
{
smp_wmb();
- __set_bit(PG_uptodate, folio_flags(folio, 0));
+ hwpoison_safe_set_bit(PG_uptodate, folio_flags(folio, 0));
}
static __always_inline void folio_mark_uptodate(struct folio *folio)
@@ -1166,6 +1229,14 @@ static __always_inline void ClearPageAnonExclusive(struct page *page)
}
static __always_inline void __ClearPageAnonExclusive(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+ VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+ hwpoison_safe_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
+}
+
+/* Caller must hold hwpoison_rcu_lock() */
+static __always_inline void ___ClearPageAnonExclusive(struct page *page)
{
VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..99f600459964 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3624,6 +3624,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
* unreferenced sub-pages of an anonymous THP: we can simply drop
* PG_anon_exclusive (-> PG_mappedtodisk) for these here.
*/
+ hwpoison_rcu_lock_flags(&new_folio->flags.f);
new_folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
new_folio->flags.f |= (folio->flags.f &
((1L << PG_referenced) |
@@ -3643,6 +3644,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
#endif
(1L << PG_dirty) |
LRU_GEN_MASK | LRU_REFS_MASK));
+ hwpoison_rcu_unlock_flags(&new_folio->flags.f);
if (handle_hwpoison &&
page_range_has_hwpoisoned(new_head, new_nr_pages))
diff --git a/mm/memremap.c b/mm/memremap.c
index 053842d45cb1..5c769dc4930c 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -425,8 +425,10 @@ void free_zone_device_folio(struct folio *folio)
mem_cgroup_uncharge(folio);
if (folio_test_anon(folio)) {
+ hwpoison_rcu_lock();
for (i = 0; i < nr; i++)
- __ClearPageAnonExclusive(folio_page(folio, i));
+ ___ClearPageAnonExclusive(folio_page(folio, i));
+ hwpoison_rcu_unlock();
}
/*
@@ -494,7 +496,9 @@ void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
* blindly clear bits which could have set my order field here,
* including page head.
*/
+ hwpoison_rcu_lock_flags(&new_page->flags.f);
new_page->flags.f &= ~0xffUL; /* Clear possible order, page head */
+ hwpoison_rcu_unlock_flags(&new_page->flags.f);
#ifdef NR_PAGES_IN_LARGE_FOLIO
/*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index f9f8e1af921c..881574dd14da 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -596,7 +596,9 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn,
set_page_links(page, zone, nid, pfn);
init_page_count(page);
atomic_set(&page->_mapcount, -1);
+ hwpoison_rcu_lock_flags(&page->flags.f);
page_cpupid_reset_last(page);
+ hwpoison_rcu_unlock_flags(&page->flags.f);
page_kasan_tag_reset(page);
INIT_LIST_HEAD(&page->lru);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d49c254174da..60a4697ecb9a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1358,6 +1358,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
if (unlikely(order)) {
int i;
+ hwpoison_rcu_lock();
if (compound) {
page[1].flags.f &= ~PAGE_FLAGS_SECOND;
#ifdef NR_PAGES_IN_LARGE_FOLIO
@@ -1375,6 +1376,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
}
(page + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
}
+ hwpoison_rcu_unlock();
}
if (folio_test_anon(folio)) {
mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
@@ -1391,8 +1393,10 @@ __always_inline bool __free_pages_prepare(struct page *page,
return false;
}
+ hwpoison_rcu_lock_flags(&page->flags.f);
page_cpupid_reset_last(page);
page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+ hwpoison_rcu_unlock_flags(&page->flags.f);
page->private = 0;
reset_page_owner(page, order);
page_table_check_free(page, order);
diff --git a/mm/slub.c b/mm/slub.c
index a2bf3756ca7d..fdfb3019b562 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -617,7 +617,7 @@ static inline void slab_set_pfmemalloc(struct slab *slab)
static inline void __slab_clear_pfmemalloc(struct slab *slab)
{
- __clear_bit(SL_pfmemalloc, &slab->flags.f);
+ hwpoison_safe_clear_bit(SL_pfmemalloc, &slab->flags.f);
}
/*
--
MST
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
2026-06-28 21:45 [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Michael S. Tsirkin
2026-06-28 21:45 ` [PATCH 1/2] mm: memory-failure: use RCU to fix HWPoison flag race Michael S. Tsirkin
2026-06-28 21:45 ` [PATCH 2/2] mm: wrap non-atomic page flag ops in RCU for HWPoison safety Michael S. Tsirkin
@ 2026-06-29 2:11 ` Andi Kleen
2026-06-29 8:10 ` Michael S. Tsirkin
2026-06-29 6:49 ` David Hildenbrand (Arm)
3 siblings, 1 reply; 10+ messages in thread
From: Andi Kleen @ 2026-06-29 2:11 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, David Hildenbrand, Miaohe Lin, Naoya Horiguchi,
Andrew Morton, Oscar Salvador, Hidehiro Kawai, Rik van Riel,
Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
David Hildenbrand
On Sun, Jun 28, 2026 at 05:45:22PM -0400, Michael S. Tsirkin wrote:
> This series fixes the race by:
>
> 1. Having memory_failure() call synchronize_rcu() + retry after
> setting HWPoison, so that any in-flight non-atomic RMW that
> read the old flags value completes before we proceed.
>
> 2. Wrapping all non-atomic page flag operations in
> rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
> so that synchronize_rcu() actually drains them.
It wouldn't surprise me if your underlying performance assumptions
-- an non contended atomic is cheaper than a rcu_read_lock/unlock --
are not true in various CPU/kernel configuration combinations.
Modern CPUs have fast atomics when uncontended in normal circumstances.
But it probably doesn't matter much either way because the difference
shouldn't be very much. It seems very complicated for something that
could be much simpler.
But I guess it's fine.
-Andi
>
> Performance impact (page alloc+free microbenchmark, 200K iterations,
> 20 runs, KVM guest, error bars are 3-sigma):
>
> !PREEMPT_RCU (x86):
> insns/iter cycles/iter
> base: 12237 +/- 1 17954 +/- 136
> patched: +22 +/- 1 -124 +/- 122
> (+0.18%) (within noise)
>
> PREEMPT_RCU:
> insns/iter cycles/iter
> base: 12512 +/- 3 18541 +/- 214
> patched: +95 +/- 3 -12 +/- 161
> (+0.76%) (within noise)
>
> When !CONFIG_MEMORY_FAILURE, all wrappers compile away completely.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
2026-06-28 21:45 [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Michael S. Tsirkin
` (2 preceding siblings ...)
2026-06-29 2:11 ` [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Andi Kleen
@ 2026-06-29 6:49 ` David Hildenbrand (Arm)
2026-06-29 7:34 ` Michael S. Tsirkin
3 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29 6:49 UTC (permalink / raw)
To: Michael S. Tsirkin, linux-kernel
Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Oscar Salvador,
Andi Kleen, Hidehiro Kawai, Rik van Riel, Vlastimil Babka,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl
On 6/28/26 23:45, Michael S. Tsirkin wrote:
> I don't like it that we are adding overhead to the good path for
> the benefit of memory failure, which never triggers on many systems,
> but I don't have a better idea. Pls take a look.
As I said on Friday.
"It's also doesn't address the mf_mutex implications and the x86 thingies I
mentioned.
...
I'll either take care of that myself or find someone that can work on this with
attention to all details.
"
This is nothing to vibe-code. This needs a real expert.
>
> Non-atomic page flag operations (page->flags.f &= ~mask, __set_bit,
> __clear_bit) can race with atomic TestSetPageHWPoison() in
> memory_failure(). The non-atomic RMW reads flags, memory_failure()
> atomically sets HWPoison, then the RMW writes back the old value
> without HWPoison, clobbering the bit.
>
> The race was confirmed by injecting a cpu_relax() delay between the
> load and store of the non-atomic RMW in __free_pages_prepare, then
> running concurrent MADV_HWPOISON injection. The clobbered HWPoison
> bit was observed repeatedly.
>
> This series fixes the race by:
>
> 1. Having memory_failure() call synchronize_rcu() + retry after
> setting HWPoison, so that any in-flight non-atomic RMW that
> read the old flags value completes before we proceed.
>
> 2. Wrapping all non-atomic page flag operations in
> rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
> so that synchronize_rcu() actually drains them.
>
> Performance impact (page alloc+free microbenchmark, 200K iterations,
> 20 runs, KVM guest, error bars are 3-sigma):
>
> !PREEMPT_RCU (x86):
> insns/iter cycles/iter
> base: 12237 +/- 1 17954 +/- 136
> patched: +22 +/- 1 -124 +/- 122
> (+0.18%) (within noise)
>
> PREEMPT_RCU:
> insns/iter cycles/iter
> base: 12512 +/- 3 18541 +/- 214
> patched: +95 +/- 3 -12 +/- 161
> (+0.76%) (within noise)
>
> When !CONFIG_MEMORY_FAILURE, all wrappers compile away completely.
>
> Suggested-by: David Hildenbrand <david@redhat.com>
No ;)
--
Cheers,
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
2026-06-29 6:49 ` David Hildenbrand (Arm)
@ 2026-06-29 7:34 ` Michael S. Tsirkin
2026-06-29 13:05 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 10+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29 7:34 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl
On Mon, Jun 29, 2026 at 08:49:37AM +0200, David Hildenbrand (Arm) wrote:
> On 6/28/26 23:45, Michael S. Tsirkin wrote:
> > I don't like it that we are adding overhead to the good path for
> > the benefit of memory failure, which never triggers on many systems,
> > but I don't have a better idea. Pls take a look.
>
> As I said on Friday.
>
> "It's also doesn't address the mf_mutex implications and the x86 thingies I
> mentioned.
Well I did attempt addressing this. These would be these two:
(a) We don't hold the mf_mutex on all call paths, but we really need it so a
page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.
page_test_set_hwpoison was this code you wrote:
+static void page_set_hwpoison(struct page *page)
+{
+ lockdep_assert_held(&mf_mutex);
+
+ while (!PageHWPoison(page)) {
+ SetPageHWPoison(page);
+
+ /* Make sure concurrent non-atomic writers completed. */
+ synchronize_rcu();
+ }
+}
and indeed the test+set combination seems racy. But consider the version I posted, for example:
+/*
+ * Drain any in-flight non-atomic page flag operations that could
+ * clobber a concurrently set HWPoison bit. Retries until the bit sticks.
+ */
+static void set_hwpoison_drain_rcu(struct page *p)
+{
+ do {
+ synchronize_rcu();
+ } while (!TestSetPageHWPoison(p));
+}
+
...
+static bool test_and_set_hwpoison_drain_rcu(struct page *p)
+{
+ bool was_set = TestSetPageHWPoison(p);
+
+ set_hwpoison_drain_rcu(p);
+ return was_set;
+}
does not seem racy without a lock. But maybe I don't get it.
(b) There are some leftover SetPageHWPoison etc. instances. The ones in
arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
corner cases either way and we can document the situation.
Well, I did try to document the situation - it's in the commit log for
patch 1:
Note: the MCE handler in arch/x86/kernel/cpu/mce/core.c also calls
SetPageHWPoison() and is subject to the same race. It cannot use
the drain helpers (MCE context cannot call synchronize_rcu()).
For recoverable MCE errors, memory_failure() is queued via work
items (kill_me_maybe/kill_me_never) and will re-set the bit via
test_and_set_hwpoison_drain_rcu() if it was clobbered. The
mce_panic() path sets HWPoison for kdump right before panic() so
the race is irrelevant there. The MCG_STATUS_SEAM_NR path does
not queue memory_failure(), but the affected page belongs to a
TDX guest whose CPU core has already been marked dead - the page
is not subject to concurrent non-atomic flag operations in the
buddy allocator, so the race does not apply.
> ...
>
> I'll either take care of that myself or find someone that can work on this with
> attention to all details.
> "
>
> This is nothing to vibe-code. This needs a real expert.
Well I had this sitting on the disk anyway, so I thought I'd post.
I wouldn't call this vibe-code - a bunch of manual work went into this,
llms mostly as a grep/sed replacement. But hey. I don't object to
someone taking over, for sure. Was fun, and maybe these patches will be
helpful as a starting point.
In particular, maybe I should have been more explicit about how your
points from Friday are addressed.
If you want to add a bit more to explain the exact concerns here, for
whoever works on this next, feel free to do so.
> >
> > Non-atomic page flag operations (page->flags.f &= ~mask, __set_bit,
> > __clear_bit) can race with atomic TestSetPageHWPoison() in
> > memory_failure(). The non-atomic RMW reads flags, memory_failure()
> > atomically sets HWPoison, then the RMW writes back the old value
> > without HWPoison, clobbering the bit.
> >
> > The race was confirmed by injecting a cpu_relax() delay between the
> > load and store of the non-atomic RMW in __free_pages_prepare, then
> > running concurrent MADV_HWPOISON injection. The clobbered HWPoison
> > bit was observed repeatedly.
> >
> > This series fixes the race by:
> >
> > 1. Having memory_failure() call synchronize_rcu() + retry after
> > setting HWPoison, so that any in-flight non-atomic RMW that
> > read the old flags value completes before we proceed.
> >
> > 2. Wrapping all non-atomic page flag operations in
> > rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
> > so that synchronize_rcu() actually drains them.
> >
> > Performance impact (page alloc+free microbenchmark, 200K iterations,
> > 20 runs, KVM guest, error bars are 3-sigma):
> >
> > !PREEMPT_RCU (x86):
> > insns/iter cycles/iter
> > base: 12237 +/- 1 17954 +/- 136
> > patched: +22 +/- 1 -124 +/- 122
> > (+0.18%) (within noise)
> >
> > PREEMPT_RCU:
> > insns/iter cycles/iter
> > base: 12512 +/- 3 18541 +/- 214
> > patched: +95 +/- 3 -12 +/- 161
> > (+0.76%) (within noise)
> >
> > When !CONFIG_MEMORY_FAILURE, all wrappers compile away completely.
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
>
> No ;)
>
> --
> Cheers,
>
> David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
2026-06-29 2:11 ` [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Andi Kleen
@ 2026-06-29 8:10 ` Michael S. Tsirkin
2026-06-29 8:21 ` David Hildenbrand (Arm)
2026-06-29 8:39 ` Michael S. Tsirkin
0 siblings, 2 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29 8:10 UTC (permalink / raw)
To: Andi Kleen
Cc: linux-kernel, David Hildenbrand, Miaohe Lin, Naoya Horiguchi,
Andrew Morton, Oscar Salvador, Hidehiro Kawai, Rik van Riel,
Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
David Hildenbrand
On Sun, Jun 28, 2026 at 07:11:58PM -0700, Andi Kleen wrote:
> On Sun, Jun 28, 2026 at 05:45:22PM -0400, Michael S. Tsirkin wrote:
> > This series fixes the race by:
> >
> > 1. Having memory_failure() call synchronize_rcu() + retry after
> > setting HWPoison, so that any in-flight non-atomic RMW that
> > read the old flags value completes before we proceed.
> >
> > 2. Wrapping all non-atomic page flag operations in
> > rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
> > so that synchronize_rcu() actually drains them.
>
> It wouldn't surprise me if your underlying performance assumptions
> -- an non contended atomic is cheaper than a rcu_read_lock/unlock --
> are not true in various CPU/kernel configuration combinations.
>
> Modern CPUs have fast atomics when uncontended in normal circumstances.
> But it probably doesn't matter much either way because the difference
> shouldn't be very much.
Hmm. It's a bit silly that I didn't try. Seemed clear to me, but,
on this old xeon...
insns/iter cycles/iter
-------------------------------------------------------
base 12238 +/- 1.0 17889 +/- 97.9
rcu_read_lock 12251 +/- 7.3 17991 +/-191.6
atomic ops 12233 +/- 1.9 17733 +/-136.5
The diff in the noise.
And old, slow CPUs maybe don't have MF at all.
So maybe just atomics instead of all this mess.
> It seems very complicated for something that
> could be much simpler.
>
> But I guess it's fine.
>
> -Andi
Indeed. David already said he's gonnu look at this himself, but here's what I
tested, maybe helpful for whoever wants to try this approach:
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b4..8f0940cf2f29 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -404,6 +404,44 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n)
#define FOLIO_HEAD_PAGE 0
#define FOLIO_SECOND_PAGE 1
+/*
+ * When CONFIG_MEMORY_FAILURE is enabled, non-atomic page flag operations
+ * must be atomic to prevent clobbering concurrent TestSetPageHWPoison()
+ * in memory_failure(). Otherwise, use the cheaper non-atomic versions.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define __pf_set_bit set_bit
+#define __pf_clear_bit clear_bit
+
+static __always_inline void page_flags_clear(unsigned long *flags,
+ unsigned long mask)
+{
+ atomic_long_andnot(mask, (atomic_long_t *)flags);
+}
+
+static __always_inline void page_flags_set(unsigned long *flags,
+ unsigned long mask)
+{
+ atomic_long_or(mask, (atomic_long_t *)flags);
+}
+
+#else /* !CONFIG_MEMORY_FAILURE */
+#define __pf_set_bit __set_bit
+#define __pf_clear_bit __clear_bit
+
+static __always_inline void page_flags_clear(unsigned long *flags,
+ unsigned long mask)
+{
+ *flags &= ~mask;
+}
+
+static __always_inline void page_flags_set(unsigned long *flags,
+ unsigned long mask)
+{
+ *flags |= mask;
+}
+#endif /* CONFIG_MEMORY_FAILURE */
+
/*
* Macros to create function definitions for page flags
*/
@@ -421,11 +459,11 @@ static __always_inline void folio_clear_##name(struct folio *folio) \
#define __FOLIO_SET_FLAG(name, page) \
static __always_inline void __folio_set_##name(struct folio *folio) \
-{ __set_bit(PG_##name, folio_flags(folio, page)); }
+{ __pf_set_bit(PG_##name, folio_flags(folio, page)); }
#define __FOLIO_CLEAR_FLAG(name, page) \
static __always_inline void __folio_clear_##name(struct folio *folio) \
-{ __clear_bit(PG_##name, folio_flags(folio, page)); }
+{ __pf_clear_bit(PG_##name, folio_flags(folio, page)); }
#define FOLIO_TEST_SET_FLAG(name, page) \
static __always_inline bool folio_test_set_##name(struct folio *folio) \
@@ -458,12 +496,12 @@ static __always_inline void ClearPage##uname(struct page *page) \
#define __SETPAGEFLAG(uname, lname, policy) \
__FOLIO_SET_FLAG(lname, FOLIO_##policy) \
static __always_inline void __SetPage##uname(struct page *page) \
-{ __set_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ __pf_set_bit(PG_##lname, &policy(page, 1)->flags.f); }
#define __CLEARPAGEFLAG(uname, lname, policy) \
__FOLIO_CLEAR_FLAG(lname, FOLIO_##policy) \
static __always_inline void __ClearPage##uname(struct page *page) \
-{ __clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ __pf_clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
#define TESTSETFLAG(uname, lname, policy) \
FOLIO_TEST_SET_FLAG(lname, FOLIO_##policy) \
@@ -806,7 +844,7 @@ static inline bool PageUptodate(const struct page *page)
static __always_inline void __folio_mark_uptodate(struct folio *folio)
{
smp_wmb();
- __set_bit(PG_uptodate, folio_flags(folio, 0));
+ __pf_set_bit(PG_uptodate, folio_flags(folio, 0));
}
static __always_inline void folio_mark_uptodate(struct folio *folio)
@@ -1162,14 +1200,14 @@ static __always_inline void ClearPageAnonExclusive(struct page *page)
{
VM_BUG_ON_PGFLAGS(!PageAnonNotKsm(page), page);
VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
- clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
+ __pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
}
static __always_inline void __ClearPageAnonExclusive(struct page *page)
{
VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
- __clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
+ __pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
}
#ifdef CONFIG_MMU
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 06bbe9eba636..931dfc84319f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2343,7 +2343,7 @@ int folio_xchg_last_cpupid(struct folio *folio, int cpupid);
static inline void page_cpupid_reset_last(struct page *page)
{
- page->flags.f |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
+ page_flags_set(&page->flags.f, LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
}
#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
@@ -2503,8 +2503,8 @@ static inline struct zone *folio_zone(const struct folio *folio)
#ifdef SECTION_IN_PAGE_FLAGS
static inline void set_page_section(struct page *page, unsigned long section)
{
- page->flags.f &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT);
- page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
+ page_flags_clear(&page->flags.f, SECTIONS_MASK << SECTIONS_PGSHIFT);
+ page_flags_set(&page->flags.f, (section & SECTIONS_MASK) << SECTIONS_PGSHIFT);
}
static inline unsigned long memdesc_section(memdesc_flags_t mdf)
@@ -2719,14 +2719,14 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
static inline void set_page_zone(struct page *page, enum zone_type zone)
{
- page->flags.f &= ~(ZONES_MASK << ZONES_PGSHIFT);
- page->flags.f |= (zone & ZONES_MASK) << ZONES_PGSHIFT;
+ page_flags_clear(&page->flags.f, ZONES_MASK << ZONES_PGSHIFT);
+ page_flags_set(&page->flags.f, (zone & ZONES_MASK) << ZONES_PGSHIFT);
}
static inline void set_page_node(struct page *page, unsigned long node)
{
- page->flags.f &= ~(NODES_MASK << NODES_PGSHIFT);
- page->flags.f |= (node & NODES_MASK) << NODES_PGSHIFT;
+ page_flags_clear(&page->flags.f, NODES_MASK << NODES_PGSHIFT);
+ page_flags_set(&page->flags.f, (node & NODES_MASK) << NODES_PGSHIFT);
}
static inline void set_page_links(struct page *page, enum zone_type zone,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..100226f59490 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3624,8 +3624,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
* unreferenced sub-pages of an anonymous THP: we can simply drop
* PG_anon_exclusive (-> PG_mappedtodisk) for these here.
*/
- new_folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
- new_folio->flags.f |= (folio->flags.f &
+ page_flags_clear(&new_folio->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
+ page_flags_set(&new_folio->flags.f, folio->flags.f &
((1L << PG_referenced) |
(1L << PG_swapbacked) |
(1L << PG_swapcache) |
diff --git a/mm/memremap.c b/mm/memremap.c
index 053842d45cb1..194ca2f04a87 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -494,7 +494,7 @@ void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
* blindly clear bits which could have set my order field here,
* including page head.
*/
- new_page->flags.f &= ~0xffUL; /* Clear possible order, page head */
+ page_flags_clear(&new_page->flags.f, 0xffUL); /* Clear possible order, page head */
#ifdef NR_PAGES_IN_LARGE_FOLIO
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d49c254174da..a3447124a725 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1359,7 +1359,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
int i;
if (compound) {
- page[1].flags.f &= ~PAGE_FLAGS_SECOND;
+ page_flags_clear(&page[1].flags.f, PAGE_FLAGS_SECOND);
#ifdef NR_PAGES_IN_LARGE_FOLIO
folio->_nr_pages = 0;
#endif
@@ -1373,7 +1373,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
continue;
}
}
- (page + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+ page_flags_clear(&(page + i)->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
}
}
if (folio_test_anon(folio)) {
@@ -1392,7 +1392,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
}
page_cpupid_reset_last(page);
- page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+ page_flags_clear(&page->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
page->private = 0;
reset_page_owner(page, order);
page_table_check_free(page, order);
diff --git a/mm/slub.c b/mm/slub.c
index a2bf3756ca7d..a55199f642d3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -617,7 +617,7 @@ static inline void slab_set_pfmemalloc(struct slab *slab)
static inline void __slab_clear_pfmemalloc(struct slab *slab)
{
- __clear_bit(SL_pfmemalloc, &slab->flags.f);
+ __pf_clear_bit(SL_pfmemalloc, &slab->flags.f);
}
/*
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
2026-06-29 8:10 ` Michael S. Tsirkin
@ 2026-06-29 8:21 ` David Hildenbrand (Arm)
2026-06-29 8:39 ` Michael S. Tsirkin
1 sibling, 0 replies; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29 8:21 UTC (permalink / raw)
To: Michael S. Tsirkin, Andi Kleen
Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
Oscar Salvador, Hidehiro Kawai, Rik van Riel, Vlastimil Babka,
Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl
On 6/29/26 10:10, Michael S. Tsirkin wrote:
> On Sun, Jun 28, 2026 at 07:11:58PM -0700, Andi Kleen wrote:
>> On Sun, Jun 28, 2026 at 05:45:22PM -0400, Michael S. Tsirkin wrote:
>>> This series fixes the race by:
>>>
>>> 1. Having memory_failure() call synchronize_rcu() + retry after
>>> setting HWPoison, so that any in-flight non-atomic RMW that
>>> read the old flags value completes before we proceed.
>>>
>>> 2. Wrapping all non-atomic page flag operations in
>>> rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
>>> so that synchronize_rcu() actually drains them.
>>
>> It wouldn't surprise me if your underlying performance assumptions
>> -- an non contended atomic is cheaper than a rcu_read_lock/unlock --
>> are not true in various CPU/kernel configuration combinations.
>>
>> Modern CPUs have fast atomics when uncontended in normal circumstances.
>> But it probably doesn't matter much either way because the difference
>> shouldn't be very much.
>
>
> Hmm. It's a bit silly that I didn't try. Seemed clear to me, but,
> on this old xeon...
>
> insns/iter cycles/iter
> -------------------------------------------------------
> base 12238 +/- 1.0 17889 +/- 97.9
> rcu_read_lock 12251 +/- 7.3 17991 +/-191.6
> atomic ops 12233 +/- 1.9 17733 +/-136.5
>
>
> The diff in the noise.
>
> And old, slow CPUs maybe don't have MF at all.
>
> So maybe just atomics instead of all this mess.
That would be much better.
What I was concerned about so far was that many distributions enable hwpoison
handling unconditionally (independent of any specific CPU!).
I recall running experiments on some not-so-dated hardware 2 years ago (when
optimizing out rmap atomics) where additional atomics really hurt, even in
uncontended cases.
>
>
>
>
>> It seems very complicated for something that
>> could be much simpler.
>>
>> But I guess it's fine.
>>
>> -Andi
>
> Indeed. David already said he's gonnu look at this himself, but he
If we can go that simple route (I'm not sure yet), your patch would be fine. I
can try finding someone to run more experiments on arm64 hardware.
--
Cheers,
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
2026-06-29 8:10 ` Michael S. Tsirkin
2026-06-29 8:21 ` David Hildenbrand (Arm)
@ 2026-06-29 8:39 ` Michael S. Tsirkin
1 sibling, 0 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29 8:39 UTC (permalink / raw)
To: Andi Kleen
Cc: linux-kernel, David Hildenbrand, Miaohe Lin, Naoya Horiguchi,
Andrew Morton, Oscar Salvador, Hidehiro Kawai, Rik van Riel,
Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
David Hildenbrand
On Mon, Jun 29, 2026 at 04:10:33AM -0400, Michael S. Tsirkin wrote:
> On Sun, Jun 28, 2026 at 07:11:58PM -0700, Andi Kleen wrote:
> > On Sun, Jun 28, 2026 at 05:45:22PM -0400, Michael S. Tsirkin wrote:
> > > This series fixes the race by:
> > >
> > > 1. Having memory_failure() call synchronize_rcu() + retry after
> > > setting HWPoison, so that any in-flight non-atomic RMW that
> > > read the old flags value completes before we proceed.
> > >
> > > 2. Wrapping all non-atomic page flag operations in
> > > rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
> > > so that synchronize_rcu() actually drains them.
> >
> > It wouldn't surprise me if your underlying performance assumptions
> > -- an non contended atomic is cheaper than a rcu_read_lock/unlock --
> > are not true in various CPU/kernel configuration combinations.
> >
> > Modern CPUs have fast atomics when uncontended in normal circumstances.
> > But it probably doesn't matter much either way because the difference
> > shouldn't be very much.
>
>
> Hmm. It's a bit silly that I didn't try. Seemed clear to me, but,
> on this old xeon...
>
> insns/iter cycles/iter
> -------------------------------------------------------
> base 12238 +/- 1.0 17889 +/- 97.9
> rcu_read_lock 12251 +/- 7.3 17991 +/-191.6
> atomic ops 12233 +/- 1.9 17733 +/-136.5
>
>
> The diff in the noise.
>
> And old, slow CPUs maybe don't have MF at all.
>
> So maybe just atomics instead of all this mess.
However, this was a basic test, when allocating 4k pages. With 2M hugepages:
insns/iter cycles/iter
-------------------------------------------------------
base 20758 +/- 12.5 191208 +/-1946.6
rcu 20785 +/- 3.7 197108 +/- 132.1
atomic 20727 +/- 6.4 204591 +/- 160.2
rcu vs base +27 (+0.13%) +5900 (+3.09%)
atomic vs base -31 (-0.15%) +13383 (+7.00%)
and even with THP:
insns/iter cycles/iter
-------------------------------------------------------
base 27220 +/- 2.8 192151 +/- 483.3
rcu 27248 +/- 30.1 194159 +/-2746.6
atomic 27186 +/- 3.2 200526 +/- 746.2
rcu vs base +28 (+0.10%) +2008 (+1.04%)
atomic vs base -34 (-0.12%) +8374 (+4.36%)
needs more thought.
>
>
>
> > It seems very complicated for something that
> > could be much simpler.
> >
> > But I guess it's fine.
> >
> > -Andi
>
> Indeed. David already said he's gonnu look at this himself, but here's what I
> tested, maybe helpful for whoever wants to try this approach:
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 7223f6f4e2b4..8f0940cf2f29 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -404,6 +404,44 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n)
> #define FOLIO_HEAD_PAGE 0
> #define FOLIO_SECOND_PAGE 1
>
> +/*
> + * When CONFIG_MEMORY_FAILURE is enabled, non-atomic page flag operations
> + * must be atomic to prevent clobbering concurrent TestSetPageHWPoison()
> + * in memory_failure(). Otherwise, use the cheaper non-atomic versions.
> + */
> +#ifdef CONFIG_MEMORY_FAILURE
> +#define __pf_set_bit set_bit
> +#define __pf_clear_bit clear_bit
> +
> +static __always_inline void page_flags_clear(unsigned long *flags,
> + unsigned long mask)
> +{
> + atomic_long_andnot(mask, (atomic_long_t *)flags);
> +}
> +
> +static __always_inline void page_flags_set(unsigned long *flags,
> + unsigned long mask)
> +{
> + atomic_long_or(mask, (atomic_long_t *)flags);
> +}
> +
> +#else /* !CONFIG_MEMORY_FAILURE */
> +#define __pf_set_bit __set_bit
> +#define __pf_clear_bit __clear_bit
> +
> +static __always_inline void page_flags_clear(unsigned long *flags,
> + unsigned long mask)
> +{
> + *flags &= ~mask;
> +}
> +
> +static __always_inline void page_flags_set(unsigned long *flags,
> + unsigned long mask)
> +{
> + *flags |= mask;
> +}
> +#endif /* CONFIG_MEMORY_FAILURE */
> +
> /*
> * Macros to create function definitions for page flags
> */
> @@ -421,11 +459,11 @@ static __always_inline void folio_clear_##name(struct folio *folio) \
>
> #define __FOLIO_SET_FLAG(name, page) \
> static __always_inline void __folio_set_##name(struct folio *folio) \
> -{ __set_bit(PG_##name, folio_flags(folio, page)); }
> +{ __pf_set_bit(PG_##name, folio_flags(folio, page)); }
>
> #define __FOLIO_CLEAR_FLAG(name, page) \
> static __always_inline void __folio_clear_##name(struct folio *folio) \
> -{ __clear_bit(PG_##name, folio_flags(folio, page)); }
> +{ __pf_clear_bit(PG_##name, folio_flags(folio, page)); }
>
> #define FOLIO_TEST_SET_FLAG(name, page) \
> static __always_inline bool folio_test_set_##name(struct folio *folio) \
> @@ -458,12 +496,12 @@ static __always_inline void ClearPage##uname(struct page *page) \
> #define __SETPAGEFLAG(uname, lname, policy) \
> __FOLIO_SET_FLAG(lname, FOLIO_##policy) \
> static __always_inline void __SetPage##uname(struct page *page) \
> -{ __set_bit(PG_##lname, &policy(page, 1)->flags.f); }
> +{ __pf_set_bit(PG_##lname, &policy(page, 1)->flags.f); }
>
> #define __CLEARPAGEFLAG(uname, lname, policy) \
> __FOLIO_CLEAR_FLAG(lname, FOLIO_##policy) \
> static __always_inline void __ClearPage##uname(struct page *page) \
> -{ __clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
> +{ __pf_clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
>
> #define TESTSETFLAG(uname, lname, policy) \
> FOLIO_TEST_SET_FLAG(lname, FOLIO_##policy) \
> @@ -806,7 +844,7 @@ static inline bool PageUptodate(const struct page *page)
> static __always_inline void __folio_mark_uptodate(struct folio *folio)
> {
> smp_wmb();
> - __set_bit(PG_uptodate, folio_flags(folio, 0));
> + __pf_set_bit(PG_uptodate, folio_flags(folio, 0));
> }
>
> static __always_inline void folio_mark_uptodate(struct folio *folio)
> @@ -1162,14 +1200,14 @@ static __always_inline void ClearPageAnonExclusive(struct page *page)
> {
> VM_BUG_ON_PGFLAGS(!PageAnonNotKsm(page), page);
> VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> - clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
> + __pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
> }
>
> static __always_inline void __ClearPageAnonExclusive(struct page *page)
> {
> VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> - __clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
> + __pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
> }
>
> #ifdef CONFIG_MMU
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 06bbe9eba636..931dfc84319f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2343,7 +2343,7 @@ int folio_xchg_last_cpupid(struct folio *folio, int cpupid);
>
> static inline void page_cpupid_reset_last(struct page *page)
> {
> - page->flags.f |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
> + page_flags_set(&page->flags.f, LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
> }
> #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
>
> @@ -2503,8 +2503,8 @@ static inline struct zone *folio_zone(const struct folio *folio)
> #ifdef SECTION_IN_PAGE_FLAGS
> static inline void set_page_section(struct page *page, unsigned long section)
> {
> - page->flags.f &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT);
> - page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
> + page_flags_clear(&page->flags.f, SECTIONS_MASK << SECTIONS_PGSHIFT);
> + page_flags_set(&page->flags.f, (section & SECTIONS_MASK) << SECTIONS_PGSHIFT);
> }
>
> static inline unsigned long memdesc_section(memdesc_flags_t mdf)
> @@ -2719,14 +2719,14 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
>
> static inline void set_page_zone(struct page *page, enum zone_type zone)
> {
> - page->flags.f &= ~(ZONES_MASK << ZONES_PGSHIFT);
> - page->flags.f |= (zone & ZONES_MASK) << ZONES_PGSHIFT;
> + page_flags_clear(&page->flags.f, ZONES_MASK << ZONES_PGSHIFT);
> + page_flags_set(&page->flags.f, (zone & ZONES_MASK) << ZONES_PGSHIFT);
> }
>
> static inline void set_page_node(struct page *page, unsigned long node)
> {
> - page->flags.f &= ~(NODES_MASK << NODES_PGSHIFT);
> - page->flags.f |= (node & NODES_MASK) << NODES_PGSHIFT;
> + page_flags_clear(&page->flags.f, NODES_MASK << NODES_PGSHIFT);
> + page_flags_set(&page->flags.f, (node & NODES_MASK) << NODES_PGSHIFT);
> }
>
> static inline void set_page_links(struct page *page, enum zone_type zone,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 970e077019b7..100226f59490 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3624,8 +3624,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
> * unreferenced sub-pages of an anonymous THP: we can simply drop
> * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
> */
> - new_folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> - new_folio->flags.f |= (folio->flags.f &
> + page_flags_clear(&new_folio->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
> + page_flags_set(&new_folio->flags.f, folio->flags.f &
> ((1L << PG_referenced) |
> (1L << PG_swapbacked) |
> (1L << PG_swapcache) |
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 053842d45cb1..194ca2f04a87 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -494,7 +494,7 @@ void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> * blindly clear bits which could have set my order field here,
> * including page head.
> */
> - new_page->flags.f &= ~0xffUL; /* Clear possible order, page head */
> + page_flags_clear(&new_page->flags.f, 0xffUL); /* Clear possible order, page head */
>
> #ifdef NR_PAGES_IN_LARGE_FOLIO
> /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d49c254174da..a3447124a725 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1359,7 +1359,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
> int i;
>
> if (compound) {
> - page[1].flags.f &= ~PAGE_FLAGS_SECOND;
> + page_flags_clear(&page[1].flags.f, PAGE_FLAGS_SECOND);
> #ifdef NR_PAGES_IN_LARGE_FOLIO
> folio->_nr_pages = 0;
> #endif
> @@ -1373,7 +1373,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
> continue;
> }
> }
> - (page + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> + page_flags_clear(&(page + i)->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
> }
> }
> if (folio_test_anon(folio)) {
> @@ -1392,7 +1392,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
> }
>
> page_cpupid_reset_last(page);
> - page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> + page_flags_clear(&page->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
> page->private = 0;
> reset_page_owner(page, order);
> page_table_check_free(page, order);
> diff --git a/mm/slub.c b/mm/slub.c
> index a2bf3756ca7d..a55199f642d3 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -617,7 +617,7 @@ static inline void slab_set_pfmemalloc(struct slab *slab)
>
> static inline void __slab_clear_pfmemalloc(struct slab *slab)
> {
> - __clear_bit(SL_pfmemalloc, &slab->flags.f);
> + __pf_clear_bit(SL_pfmemalloc, &slab->flags.f);
> }
>
> /*
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
2026-06-29 7:34 ` Michael S. Tsirkin
@ 2026-06-29 13:05 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29 13:05 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl
On 6/29/26 09:34, Michael S. Tsirkin wrote:
> On Mon, Jun 29, 2026 at 08:49:37AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/28/26 23:45, Michael S. Tsirkin wrote:
>>> I don't like it that we are adding overhead to the good path for
>>> the benefit of memory failure, which never triggers on many systems,
>>> but I don't have a better idea. Pls take a look.
>>
>> As I said on Friday.
>>
>> "It's also doesn't address the mf_mutex implications and the x86 thingies I
>> mentioned.
>
> Well I did attempt addressing this. These would be these two:
>
> (a) We don't hold the mf_mutex on all call paths, but we really need it so a
> page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.
>
> page_test_set_hwpoison was this code you wrote:
>
> +static void page_set_hwpoison(struct page *page)
> +{
> + lockdep_assert_held(&mf_mutex);
> +
> + while (!PageHWPoison(page)) {
> + SetPageHWPoison(page);
> +
> + /* Make sure concurrent non-atomic writers completed. */
> + synchronize_rcu();
> + }
> +}
>
> and indeed the test+set combination seems racy. But consider the version I posted, for example:
>
> +/*
> + * Drain any in-flight non-atomic page flag operations that could
> + * clobber a concurrently set HWPoison bit. Retries until the bit sticks.
> + */
> +static void set_hwpoison_drain_rcu(struct page *p)
> +{
> + do {
> + synchronize_rcu();
> + } while (!TestSetPageHWPoison(p));
> +}
> +
>
> ...
>
> +static bool test_and_set_hwpoison_drain_rcu(struct page *p)
> +{
> + bool was_set = TestSetPageHWPoison(p);
> +
> + set_hwpoison_drain_rcu(p);
> + return was_set;
> +}
>
>
>
> does not seem racy without a lock. But maybe I don't get it.
Staring at your implementation, just think about two concurrent invocations of
test_and_set_hwpoison_drain() in your code:
Assume HWPoison flag is not set.
Thread 1: test_and_set_hwpoison_drain_rcu() -> TestSetPageHWPoison()
-> was_set = false
Thread 2: update that overwrites page->flags. HWPoison accidentally cleared.
Thread 3: test_and_set_hwpoison_drain_rcu() -> TestSetPageHWPoison()
-> was_set = false
Thread 1: does RCU sync and returns "!was_set"
thread 2: does RCU sync and returns "!was_set"
So you could end up with two thread believing that they atomically cleared the
flag, and you really need to lock.
We really have to document and enforce that the mutex is involved.
And I fear there are more nasty details to be uncovered while we rework some of
this properly, mandating a detailed look.
For example, TestClearPageHWPoison() in put_page_back_buddy() likely needs a
proper treatment as well. Likely that code should be reworked entirely to not
have arbitrary hwpoison page flag modifications throughout the codebase.
>
>
>
> (b) There are some leftover SetPageHWPoison etc. instances. The ones in
> arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
> corner cases either way and we can document the situation.
>
> Well, I did try to document the situation - it's in the commit log for
> patch 1:
>
> Note: the MCE handler in arch/x86/kernel/cpu/mce/core.c also calls
> SetPageHWPoison() and is subject to the same race. It cannot use
> the drain helpers (MCE context cannot call synchronize_rcu()).
> For recoverable MCE errors, memory_failure() is queued via work
> items (kill_me_maybe/kill_me_never) and will re-set the bit via
> test_and_set_hwpoison_drain_rcu() if it was clobbered. The
> mce_panic() path sets HWPoison for kdump right before panic() so
> the race is irrelevant there. The MCG_STATUS_SEAM_NR path does
> not queue memory_failure(), but the affected page belongs to a
> TDX guest whose CPU core has already been marked dead - the page
> is not subject to concurrent non-atomic flag operations in the
> buddy allocator, so the race does not apply.
>
We should have a central mechanism in place to document this and avoid future
mistakes.
I am not even sure if we should clearly document for SetPageHWPoison() when and
how they can be used, or if we need a completely new set of helpers.
And that's something to figure out (e.g., interaction with the mutex) by looking
into all of the details, so I expect this to take a lot more time.
[...]
>> This is nothing to vibe-code. This needs a real expert.
>
> Well I had this sitting on the disk anyway, so I thought I'd post.
It would be good to coordinate here.
Like a reply to my mail, asking whether you should post a new version that you
have already in place.
>
> I wouldn't call this vibe-code - a bunch of manual work went into this,
> llms mostly as a grep/sed replacement.
The version you posted earlier had real AI vibes to it, so I can only speculate.
I know that you did some manual work on this, but the details are really ugly in
this code.
> But hey. I don't object to
> someone taking over, for sure. Was fun, and maybe these patches will be
> helpful as a starting point.
>
> In particular, maybe I should have been more explicit about how your
> points from Friday are addressed.
Yes.
>
> If you want to add a bit more to explain the exact concerns here, for
> whoever works on this next, feel free to do so.
I raised some above. I'll try to find someone to take a closer look and see to
which degree we could optimize this.
Or if there are actually more performant alternatives that we could use. (I
still doubt that using atomics is ok in general)
--
Cheers,
David
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-06-29 13:05 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-28 21:45 [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Michael S. Tsirkin
2026-06-28 21:45 ` [PATCH 1/2] mm: memory-failure: use RCU to fix HWPoison flag race Michael S. Tsirkin
2026-06-28 21:45 ` [PATCH 2/2] mm: wrap non-atomic page flag ops in RCU for HWPoison safety Michael S. Tsirkin
2026-06-29 2:11 ` [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Andi Kleen
2026-06-29 8:10 ` Michael S. Tsirkin
2026-06-29 8:21 ` David Hildenbrand (Arm)
2026-06-29 8:39 ` Michael S. Tsirkin
2026-06-29 6:49 ` David Hildenbrand (Arm)
2026-06-29 7:34 ` Michael S. Tsirkin
2026-06-29 13:05 ` David Hildenbrand (Arm)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox