[PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
@ 2026-06-28 21:45 Michael S. Tsirkin
  2026-06-28 21:45 ` [PATCH 1/2] mm: memory-failure: use RCU to fix HWPoison flag race Michael S. Tsirkin
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-28 21:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
	David Hildenbrand

I don't like it that we are adding overhead to the good path for
the benefit of memory failure, which never triggers on many systems,
but I don't have a better idea. Pls take a look.

Non-atomic page flag operations (page->flags.f &= ~mask, __set_bit,
__clear_bit) can race with atomic TestSetPageHWPoison() in
memory_failure().  The non-atomic RMW reads flags, memory_failure()
atomically sets HWPoison, then the RMW writes back the old value
without HWPoison, clobbering the bit.

The race was confirmed by injecting a cpu_relax() delay between the
load and store of the non-atomic RMW in __free_pages_prepare, then
running concurrent MADV_HWPOISON injection.  The clobbered HWPoison
bit was observed repeatedly.

This series fixes the race by:

1. Having memory_failure() call synchronize_rcu() + retry after
   setting HWPoison, so that any in-flight non-atomic RMW that
   read the old flags value completes before we proceed.

2. Wrapping all non-atomic page flag operations in
   rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
   so that synchronize_rcu() actually drains them.

Performance impact (page alloc+free microbenchmark, 200K iterations,
20 runs, KVM guest, error bars are 3-sigma):

  !PREEMPT_RCU (x86):
                         insns/iter      cycles/iter
    base:                12237 +/- 1     17954 +/- 136
    patched:             +22 +/- 1       -124 +/- 122
                         (+0.18%)        (within noise)

  PREEMPT_RCU:
                         insns/iter      cycles/iter
    base:                12512 +/- 3     18541 +/- 214
    patched:             +95 +/- 3       -12 +/- 161
                         (+0.76%)        (within noise)

When !CONFIG_MEMORY_FAILURE, all wrappers compile away completely.

Suggested-by: David Hildenbrand <david@redhat.com>

Michael S. Tsirkin (2):
  mm: memory-failure: use RCU to fix HWPoison flag race
  mm: wrap non-atomic page flag ops in RCU for HWPoison safety

 include/linux/mm.h         |  7 ++++
 include/linux/page-flags.h | 81 +++++++++++++++++++++++++++++++++++---
 mm/huge_memory.c           |  2 +
 mm/memory-failure.c        | 54 +++++++++++++++++++++----
 mm/memremap.c              |  6 ++-
 mm/mm_init.c               |  2 +
 mm/page_alloc.c            |  4 ++
 mm/slub.c                  |  2 +-
 8 files changed, 143 insertions(+), 15 deletions(-)

-- 
MST

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/2] mm: memory-failure: use RCU to fix HWPoison flag race
  2026-06-28 21:45 [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Michael S. Tsirkin
@ 2026-06-28 21:45 ` Michael S. Tsirkin
  2026-06-28 21:45 ` [PATCH 2/2] mm: wrap non-atomic page flag ops in RCU for HWPoison safety Michael S. Tsirkin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-28 21:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
	David Hildenbrand

Non-atomic page flag operations (page->flags.f &= ~mask, __set_bit,
__clear_bit) can race with atomic TestSetPageHWPoison() in
memory_failure().  The non-atomic RMW reads flags, memory_failure()
atomically sets HWPoison, then the RMW writes back the old value
without HWPoison - clobbering the bit.

Add synchronize_rcu() + retry helpers for setting and clearing
HWPoison, and convert all memory_failure() call sites to use them.
Follow-up patches wrap non-atomic page flag operations in
rcu_read_lock/rcu_read_unlock so that synchronize_rcu() drains
in-flight callers.

Note: the MCE handler in arch/x86/kernel/cpu/mce/core.c also calls
SetPageHWPoison() and is subject to the same race.  It cannot use
the drain helpers (MCE context cannot call synchronize_rcu()).
For recoverable MCE errors, memory_failure() is queued via work
items (kill_me_maybe/kill_me_never) and will re-set the bit via
test_and_set_hwpoison_drain_rcu() if it was clobbered.  The
mce_panic() path sets HWPoison for kdump right before panic() so
the race should not matter there.  The MCG_STATUS_SEAM_NR path does
not queue memory_failure(), but the affected page belongs to a
TDX guest whose CPU core has already been marked dead - the page
is not subject to concurrent non-atomic flag operations in the
buddy allocator, so the race does not trigger.

Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: Cursor:gpt-5.4-xhigh-fast
---
 mm/memory-failure.c | 54 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 46 insertions(+), 8 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..351f8bbda248 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -76,6 +76,44 @@ static int sysctl_enable_soft_offline __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
+/*
+ * Drain any in-flight non-atomic page flag operations that could
+ * clobber a concurrently set HWPoison bit.  Retries until the bit sticks.
+ */
+static void set_hwpoison_drain_rcu(struct page *p)
+{
+	do {
+		synchronize_rcu();
+	} while (!TestSetPageHWPoison(p));
+}
+
+/*
+ * Drain any in-flight non-atomic page flag operations that could
+ * restore the HWPoison bit from stale data.  Retries until it stays clear.
+ */
+static void clear_hwpoison_drain_rcu(struct page *p)
+{
+	do {
+		synchronize_rcu();
+	} while (TestClearPageHWPoison(p));
+}
+
+static bool test_and_set_hwpoison_drain_rcu(struct page *p)
+{
+	bool was_set = TestSetPageHWPoison(p);
+
+	set_hwpoison_drain_rcu(p);
+	return was_set;
+}
+
+static bool test_and_clear_hwpoison_drain_rcu(struct page *p)
+{
+	bool was_set = TestClearPageHWPoison(p);
+
+	clear_hwpoison_drain_rcu(p);
+	return was_set;
+}
+
 static bool hw_memory_failure __read_mostly = false;
 
 static DEFINE_MUTEX(mf_mutex);
@@ -211,7 +249,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
 			return false;
 	}
 
-	SetPageHWPoison(page);
+	set_hwpoison_drain_rcu(page);
 	if (release)
 		put_page(page);
 	page_ref_inc(page);
@@ -1756,7 +1794,7 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
 	 * Use this flag as an indication that the dax page has been
 	 * remapped UC to prevent speculative consumption of poison.
 	 */
-	SetPageHWPoison(&folio->page);
+	set_hwpoison_drain_rcu(&folio->page);
 
 	/*
 	 * Unlike System-RAM there is no possibility to swap in a
@@ -1801,7 +1839,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 			goto unlock;
 
 		if (!pre_remove)
-			SetPageHWPoison(page);
+			set_hwpoison_drain_rcu(page);
 
 		/*
 		 * The pre_remove case is revoking access, the memory is still
@@ -1878,7 +1916,7 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
 	head = llist_del_all(raw_hwp_list_head(folio));
 	llist_for_each_entry_safe(p, next, head, node) {
 		if (move_flag)
-			SetPageHWPoison(p->page);
+			set_hwpoison_drain_rcu(p->page);
 		else
 			num_poisoned_pages_sub(page_to_pfn(p->page), 1);
 		kfree(p);
@@ -2390,7 +2428,7 @@ int memory_failure(unsigned long pfn, int flags)
 	if (hugetlb)
 		goto unlock_mutex;
 
-	if (TestSetPageHWPoison(p)) {
+	if (test_and_set_hwpoison_drain_rcu(p)) {
 		res = -EHWPOISON;
 		if (flags & MF_ACTION_REQUIRED)
 			res = kill_accessing_process(current, pfn, flags);
@@ -2420,7 +2458,7 @@ int memory_failure(unsigned long pfn, int flags)
 			} else {
 				/* We lost the race, try again */
 				if (retry) {
-					ClearPageHWPoison(p);
+					clear_hwpoison_drain_rcu(p);
 					retry = false;
 					goto try_again;
 				}
@@ -2441,7 +2479,7 @@ int memory_failure(unsigned long pfn, int flags)
 	/* filter pages that are protected from hwpoison test by users */
 	folio_lock(folio);
 	if (hwpoison_filter(p)) {
-		ClearPageHWPoison(p);
+		clear_hwpoison_drain_rcu(p);
 		folio_unlock(folio);
 		folio_put(folio);
 		res = -EOPNOTSUPP;
@@ -2761,7 +2799,7 @@ int unpoison_memory(unsigned long pfn)
 		}
 
 		folio_put(folio);
-		if (TestClearPageHWPoison(p)) {
+		if (test_and_clear_hwpoison_drain_rcu(p)) {
 			folio_put(folio);
 			ret = 0;
 		}
-- 
MST


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/2] mm: wrap non-atomic page flag ops in RCU for HWPoison safety
  2026-06-28 21:45 [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Michael S. Tsirkin
  2026-06-28 21:45 ` [PATCH 1/2] mm: memory-failure: use RCU to fix HWPoison flag race Michael S. Tsirkin
@ 2026-06-28 21:45 ` Michael S. Tsirkin
  2026-06-29  2:11 ` [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Andi Kleen
  2026-06-29  6:49 ` David Hildenbrand (Arm)
  3 siblings, 0 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-28 21:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
	David Hildenbrand

Wrap all non-atomic page flag operations in rcu_read_lock/rcu_read_unlock
so that synchronize_rcu() in memory_failure() can drain in-flight
callers.  This completes the RCU-based protection introduced in the
previous patch.

Add hwpoison_safe_set_bit/clear_bit wrappers in page-flags.h that
bracket __set_bit/__clear_bit with rcu_read_lock/rcu_read_unlock.
Convert all __SETPAGEFLAG/__CLEARPAGEFLAG macros and direct flag
manipulation sites (page_alloc.c, huge_memory.c, memremap.c,
mm_init.c, slub.c, mm.h) to use these wrappers.

For batched flag clearing paths (page_alloc.c order > 0), use
hwpoison_rcu_lock/unlock around the entire loop rather than
per-operation locking.

When !CONFIG_MEMORY_FAILURE, all wrappers compile to the bare
operations with no overhead.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Assisted-by: Cursor:gpt-5.4-xhigh-fast
---
 include/linux/mm.h         |  7 ++++
 include/linux/page-flags.h | 81 +++++++++++++++++++++++++++++++++++---
 mm/huge_memory.c           |  2 +
 mm/memremap.c              |  6 ++-
 mm/mm_init.c               |  2 +
 mm/page_alloc.c            |  4 ++
 mm/slub.c                  |  2 +-
 7 files changed, 97 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 06bbe9eba636..d52a5e90cad6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2341,6 +2341,7 @@ static inline int folio_last_cpupid(struct folio *folio)
 
 int folio_xchg_last_cpupid(struct folio *folio, int cpupid);
 
+/* Caller must hold hwpoison_rcu_lock() */
 static inline void page_cpupid_reset_last(struct page *page)
 {
 	page->flags.f |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
@@ -2503,8 +2504,10 @@ static inline struct zone *folio_zone(const struct folio *folio)
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
+	hwpoison_rcu_lock_flags(&page->flags.f);
 	page->flags.f &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT);
 	page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
+	hwpoison_rcu_unlock_flags(&page->flags.f);
 }
 
 static inline unsigned long memdesc_section(memdesc_flags_t mdf)
@@ -2719,14 +2722,18 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
 
 static inline void set_page_zone(struct page *page, enum zone_type zone)
 {
+	hwpoison_rcu_lock_flags(&page->flags.f);
 	page->flags.f &= ~(ZONES_MASK << ZONES_PGSHIFT);
 	page->flags.f |= (zone & ZONES_MASK) << ZONES_PGSHIFT;
+	hwpoison_rcu_unlock_flags(&page->flags.f);
 }
 
 static inline void set_page_node(struct page *page, unsigned long node)
 {
+	hwpoison_rcu_lock_flags(&page->flags.f);
 	page->flags.f &= ~(NODES_MASK << NODES_PGSHIFT);
 	page->flags.f |= (node & NODES_MASK) << NODES_PGSHIFT;
+	hwpoison_rcu_unlock_flags(&page->flags.f);
 }
 
 static inline void set_page_links(struct page *page, enum zone_type zone,
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b4..25eb5a77c92f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mmdebug.h>
+#include <linux/rcupdate.h>
 #ifndef __GENERATING_BOUNDS_H
 #include <linux/mm_types.h>
 #include <generated/bounds.h>
@@ -404,6 +405,68 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n)
 #define FOLIO_HEAD_PAGE		0
 #define FOLIO_SECOND_PAGE	1
 
+/*
+ * Non-atomic page flag operations (__set_bit, __clear_bit, flags &= ~mask)
+ * can race with atomic TestSetPageHWPoison() in memory_failure().
+ * Wrap non-atomic ops in rcu_read_lock/rcu_read_unlock so that
+ * synchronize_rcu() in memory_failure() drains in-flight callers.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+static __always_inline void
+hwpoison_safe_set_bit(unsigned long nr, unsigned long *addr)
+{
+	rcu_read_lock();
+	__set_bit(nr, addr);
+	rcu_read_unlock();
+}
+
+static __always_inline void
+hwpoison_safe_clear_bit(unsigned long nr, unsigned long *addr)
+{
+	rcu_read_lock();
+	__clear_bit(nr, addr);
+	rcu_read_unlock();
+}
+
+static __always_inline void hwpoison_rcu_lock_flags(unsigned long *addr)
+{
+	rcu_read_lock();
+}
+
+static __always_inline void hwpoison_rcu_unlock_flags(unsigned long *addr)
+{
+	rcu_read_unlock();
+}
+
+static __always_inline void hwpoison_rcu_lock(void)
+{
+	rcu_read_lock();
+}
+
+static __always_inline void hwpoison_rcu_unlock(void)
+{
+	rcu_read_unlock();
+}
+
+#else /* !CONFIG_MEMORY_FAILURE */
+static inline void
+hwpoison_safe_set_bit(unsigned long nr, unsigned long *addr)
+{
+	__set_bit(nr, addr);
+}
+
+static inline void
+hwpoison_safe_clear_bit(unsigned long nr, unsigned long *addr)
+{
+	__clear_bit(nr, addr);
+}
+
+static inline void hwpoison_rcu_lock_flags(unsigned long *addr) { }
+static inline void hwpoison_rcu_unlock_flags(unsigned long *addr) { }
+static inline void hwpoison_rcu_lock(void) { }
+static inline void hwpoison_rcu_unlock(void) { }
+#endif
+
 /*
  * Macros to create function definitions for page flags
  */
@@ -421,11 +484,11 @@ static __always_inline void folio_clear_##name(struct folio *folio)	\
 
 #define __FOLIO_SET_FLAG(name, page)					\
 static __always_inline void __folio_set_##name(struct folio *folio)	\
-{ __set_bit(PG_##name, folio_flags(folio, page)); }
+{ hwpoison_safe_set_bit(PG_##name, folio_flags(folio, page)); }
 
 #define __FOLIO_CLEAR_FLAG(name, page)					\
 static __always_inline void __folio_clear_##name(struct folio *folio)	\
-{ __clear_bit(PG_##name, folio_flags(folio, page)); }
+{ hwpoison_safe_clear_bit(PG_##name, folio_flags(folio, page)); }
 
 #define FOLIO_TEST_SET_FLAG(name, page)					\
 static __always_inline bool folio_test_set_##name(struct folio *folio)	\
@@ -458,12 +521,12 @@ static __always_inline void ClearPage##uname(struct page *page)		\
 #define __SETPAGEFLAG(uname, lname, policy)				\
 __FOLIO_SET_FLAG(lname, FOLIO_##policy)					\
 static __always_inline void __SetPage##uname(struct page *page)		\
-{ __set_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ hwpoison_safe_set_bit(PG_##lname, &policy(page, 1)->flags.f); }
 
 #define __CLEARPAGEFLAG(uname, lname, policy)				\
 __FOLIO_CLEAR_FLAG(lname, FOLIO_##policy)				\
 static __always_inline void __ClearPage##uname(struct page *page)	\
-{ __clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ hwpoison_safe_clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
 
 #define TESTSETFLAG(uname, lname, policy)				\
 FOLIO_TEST_SET_FLAG(lname, FOLIO_##policy)				\
@@ -806,7 +869,7 @@ static inline bool PageUptodate(const struct page *page)
 static __always_inline void __folio_mark_uptodate(struct folio *folio)
 {
 	smp_wmb();
-	__set_bit(PG_uptodate, folio_flags(folio, 0));
+	hwpoison_safe_set_bit(PG_uptodate, folio_flags(folio, 0));
 }
 
 static __always_inline void folio_mark_uptodate(struct folio *folio)
@@ -1166,6 +1229,14 @@ static __always_inline void ClearPageAnonExclusive(struct page *page)
 }
 
 static __always_inline void __ClearPageAnonExclusive(struct page *page)
+{
+	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
+	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	hwpoison_safe_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
+}
+
+/* Caller must hold hwpoison_rcu_lock() */
+static __always_inline void ___ClearPageAnonExclusive(struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
 	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..99f600459964 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3624,6 +3624,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		 * unreferenced sub-pages of an anonymous THP: we can simply drop
 		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
 		 */
+		hwpoison_rcu_lock_flags(&new_folio->flags.f);
 		new_folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
 		new_folio->flags.f |= (folio->flags.f &
 				((1L << PG_referenced) |
@@ -3643,6 +3644,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 #endif
 				 (1L << PG_dirty) |
 				 LRU_GEN_MASK | LRU_REFS_MASK));
+		hwpoison_rcu_unlock_flags(&new_folio->flags.f);
 
 		if (handle_hwpoison &&
 		    page_range_has_hwpoisoned(new_head, new_nr_pages))
diff --git a/mm/memremap.c b/mm/memremap.c
index 053842d45cb1..5c769dc4930c 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -425,8 +425,10 @@ void free_zone_device_folio(struct folio *folio)
 	mem_cgroup_uncharge(folio);
 
 	if (folio_test_anon(folio)) {
+		hwpoison_rcu_lock();
 		for (i = 0; i < nr; i++)
-			__ClearPageAnonExclusive(folio_page(folio, i));
+			___ClearPageAnonExclusive(folio_page(folio, i));
+		hwpoison_rcu_unlock();
 	}
 
 	/*
@@ -494,7 +496,9 @@ void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
 		 * blindly clear bits which could have set my order field here,
 		 * including page head.
 		 */
+		hwpoison_rcu_lock_flags(&new_page->flags.f);
 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
+		hwpoison_rcu_unlock_flags(&new_page->flags.f);
 
 #ifdef NR_PAGES_IN_LARGE_FOLIO
 		/*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index f9f8e1af921c..881574dd14da 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -596,7 +596,9 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn,
 	set_page_links(page, zone, nid, pfn);
 	init_page_count(page);
 	atomic_set(&page->_mapcount, -1);
+	hwpoison_rcu_lock_flags(&page->flags.f);
 	page_cpupid_reset_last(page);
+	hwpoison_rcu_unlock_flags(&page->flags.f);
 	page_kasan_tag_reset(page);
 
 	INIT_LIST_HEAD(&page->lru);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d49c254174da..60a4697ecb9a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1358,6 +1358,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
 	if (unlikely(order)) {
 		int i;
 
+		hwpoison_rcu_lock();
 		if (compound) {
 			page[1].flags.f &= ~PAGE_FLAGS_SECOND;
 #ifdef NR_PAGES_IN_LARGE_FOLIO
@@ -1375,6 +1376,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
 			}
 			(page + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
 		}
+		hwpoison_rcu_unlock();
 	}
 	if (folio_test_anon(folio)) {
 		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
@@ -1391,8 +1393,10 @@ __always_inline bool __free_pages_prepare(struct page *page,
 			return false;
 	}
 
+	hwpoison_rcu_lock_flags(&page->flags.f);
 	page_cpupid_reset_last(page);
 	page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+	hwpoison_rcu_unlock_flags(&page->flags.f);
 	page->private = 0;
 	reset_page_owner(page, order);
 	page_table_check_free(page, order);
diff --git a/mm/slub.c b/mm/slub.c
index a2bf3756ca7d..fdfb3019b562 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -617,7 +617,7 @@ static inline void slab_set_pfmemalloc(struct slab *slab)
 
 static inline void __slab_clear_pfmemalloc(struct slab *slab)
 {
-	__clear_bit(SL_pfmemalloc, &slab->flags.f);
+	hwpoison_safe_clear_bit(SL_pfmemalloc, &slab->flags.f);
 }
 
 /*
-- 
MST


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-28 21:45 [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Michael S. Tsirkin
  2026-06-28 21:45 ` [PATCH 1/2] mm: memory-failure: use RCU to fix HWPoison flag race Michael S. Tsirkin
  2026-06-28 21:45 ` [PATCH 2/2] mm: wrap non-atomic page flag ops in RCU for HWPoison safety Michael S. Tsirkin
@ 2026-06-29  2:11 ` Andi Kleen
  2026-06-29  8:10   ` Michael S. Tsirkin
  2026-06-29  6:49 ` David Hildenbrand (Arm)
  3 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2026-06-29  2:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand, Miaohe Lin, Naoya Horiguchi,
	Andrew Morton, Oscar Salvador, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
	David Hildenbrand

On Sun, Jun 28, 2026 at 05:45:22PM -0400, Michael S. Tsirkin wrote:
> This series fixes the race by:
> 
> 1. Having memory_failure() call synchronize_rcu() + retry after
>    setting HWPoison, so that any in-flight non-atomic RMW that
>    read the old flags value completes before we proceed.
>
> 2. Wrapping all non-atomic page flag operations in
>    rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
>    so that synchronize_rcu() actually drains them.

It wouldn't surprise me if your underlying performance assumptions 
-- an non contended atomic is cheaper than a rcu_read_lock/unlock --
are not true in various CPU/kernel configuration combinations.

Modern CPUs have fast atomics when uncontended in normal circumstances.

But it probably doesn't matter much either way because the difference
shouldn't be very much.  It seems very complicated for something that
could be much simpler.

But I guess it's fine.

-Andi

> 
> Performance impact (page alloc+free microbenchmark, 200K iterations,
> 20 runs, KVM guest, error bars are 3-sigma):
> 
>   !PREEMPT_RCU (x86):
>                          insns/iter      cycles/iter
>     base:                12237 +/- 1     17954 +/- 136
>     patched:             +22 +/- 1       -124 +/- 122
>                          (+0.18%)        (within noise)
> 
>   PREEMPT_RCU:
>                          insns/iter      cycles/iter
>     base:                12512 +/- 3     18541 +/- 214
>     patched:             +95 +/- 3       -12 +/- 161
>                          (+0.76%)        (within noise)
> 
> When !CONFIG_MEMORY_FAILURE, all wrappers compile away completely.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29  2:11 ` [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Andi Kleen
@ 2026-06-29  8:10   ` Michael S. Tsirkin
  2026-06-29  8:21     ` David Hildenbrand (Arm)
  2026-06-29  8:39     ` Michael S. Tsirkin
  0 siblings, 2 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29  8:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, David Hildenbrand, Miaohe Lin, Naoya Horiguchi,
	Andrew Morton, Oscar Salvador, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
	David Hildenbrand

On Sun, Jun 28, 2026 at 07:11:58PM -0700, Andi Kleen wrote:
> On Sun, Jun 28, 2026 at 05:45:22PM -0400, Michael S. Tsirkin wrote:
> > This series fixes the race by:
> > 
> > 1. Having memory_failure() call synchronize_rcu() + retry after
> >    setting HWPoison, so that any in-flight non-atomic RMW that
> >    read the old flags value completes before we proceed.
> >
> > 2. Wrapping all non-atomic page flag operations in
> >    rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
> >    so that synchronize_rcu() actually drains them.
> 
> It wouldn't surprise me if your underlying performance assumptions 
> -- an non contended atomic is cheaper than a rcu_read_lock/unlock --
> are not true in various CPU/kernel configuration combinations.
> 
> Modern CPUs have fast atomics when uncontended in normal circumstances.
> But it probably doesn't matter much either way because the difference
> shouldn't be very much.


Hmm. It's a bit silly that I didn't try. Seemed clear to me, but,
on this old xeon...

                                   insns/iter    cycles/iter
     -------------------------------------------------------
       base                       12238 +/- 1.0    17889 +/- 97.9
       rcu_read_lock              12251 +/- 7.3    17991 +/-191.6
       atomic ops                 12233 +/- 1.9    17733 +/-136.5


The diff in the noise.

And old, slow CPUs maybe don't have MF at all.

So maybe just atomics instead of all this mess.




>  It seems very complicated for something that
> could be much simpler.
> 
> But I guess it's fine.
> 
> -Andi

Indeed. David already said he's gonnu look at this himself, but here's what I
tested, maybe helpful for whoever wants to try this approach:

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>


diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b4..8f0940cf2f29 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -404,6 +404,44 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n)
 #define FOLIO_HEAD_PAGE		0
 #define FOLIO_SECOND_PAGE	1
 
+/*
+ * When CONFIG_MEMORY_FAILURE is enabled, non-atomic page flag operations
+ * must be atomic to prevent clobbering concurrent TestSetPageHWPoison()
+ * in memory_failure().  Otherwise, use the cheaper non-atomic versions.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define __pf_set_bit		set_bit
+#define __pf_clear_bit		clear_bit
+
+static __always_inline void page_flags_clear(unsigned long *flags,
+					     unsigned long mask)
+{
+	atomic_long_andnot(mask, (atomic_long_t *)flags);
+}
+
+static __always_inline void page_flags_set(unsigned long *flags,
+					   unsigned long mask)
+{
+	atomic_long_or(mask, (atomic_long_t *)flags);
+}
+
+#else /* !CONFIG_MEMORY_FAILURE */
+#define __pf_set_bit		__set_bit
+#define __pf_clear_bit		__clear_bit
+
+static __always_inline void page_flags_clear(unsigned long *flags,
+					     unsigned long mask)
+{
+	*flags &= ~mask;
+}
+
+static __always_inline void page_flags_set(unsigned long *flags,
+					   unsigned long mask)
+{
+	*flags |= mask;
+}
+#endif /* CONFIG_MEMORY_FAILURE */
+
 /*
  * Macros to create function definitions for page flags
  */
@@ -421,11 +459,11 @@ static __always_inline void folio_clear_##name(struct folio *folio)	\
 
 #define __FOLIO_SET_FLAG(name, page)					\
 static __always_inline void __folio_set_##name(struct folio *folio)	\
-{ __set_bit(PG_##name, folio_flags(folio, page)); }
+{ __pf_set_bit(PG_##name, folio_flags(folio, page)); }
 
 #define __FOLIO_CLEAR_FLAG(name, page)					\
 static __always_inline void __folio_clear_##name(struct folio *folio)	\
-{ __clear_bit(PG_##name, folio_flags(folio, page)); }
+{ __pf_clear_bit(PG_##name, folio_flags(folio, page)); }
 
 #define FOLIO_TEST_SET_FLAG(name, page)					\
 static __always_inline bool folio_test_set_##name(struct folio *folio)	\
@@ -458,12 +496,12 @@ static __always_inline void ClearPage##uname(struct page *page)		\
 #define __SETPAGEFLAG(uname, lname, policy)				\
 __FOLIO_SET_FLAG(lname, FOLIO_##policy)					\
 static __always_inline void __SetPage##uname(struct page *page)		\
-{ __set_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ __pf_set_bit(PG_##lname, &policy(page, 1)->flags.f); }
 
 #define __CLEARPAGEFLAG(uname, lname, policy)				\
 __FOLIO_CLEAR_FLAG(lname, FOLIO_##policy)				\
 static __always_inline void __ClearPage##uname(struct page *page)	\
-{ __clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
+{ __pf_clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
 
 #define TESTSETFLAG(uname, lname, policy)				\
 FOLIO_TEST_SET_FLAG(lname, FOLIO_##policy)				\
@@ -806,7 +844,7 @@ static inline bool PageUptodate(const struct page *page)
 static __always_inline void __folio_mark_uptodate(struct folio *folio)
 {
 	smp_wmb();
-	__set_bit(PG_uptodate, folio_flags(folio, 0));
+	__pf_set_bit(PG_uptodate, folio_flags(folio, 0));
 }
 
 static __always_inline void folio_mark_uptodate(struct folio *folio)
@@ -1162,14 +1200,14 @@ static __always_inline void ClearPageAnonExclusive(struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(!PageAnonNotKsm(page), page);
 	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
-	clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
+	__pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
 }
 
 static __always_inline void __ClearPageAnonExclusive(struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
 	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
-	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
+	__pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
 }
 
 #ifdef CONFIG_MMU
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 06bbe9eba636..931dfc84319f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2343,7 +2343,7 @@ int folio_xchg_last_cpupid(struct folio *folio, int cpupid);
 
 static inline void page_cpupid_reset_last(struct page *page)
 {
-	page->flags.f |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
+	page_flags_set(&page->flags.f, LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
 }
 #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
 
@@ -2503,8 +2503,8 @@ static inline struct zone *folio_zone(const struct folio *folio)
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
-	page->flags.f &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT);
-	page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
+	page_flags_clear(&page->flags.f, SECTIONS_MASK << SECTIONS_PGSHIFT);
+	page_flags_set(&page->flags.f, (section & SECTIONS_MASK) << SECTIONS_PGSHIFT);
 }
 
 static inline unsigned long memdesc_section(memdesc_flags_t mdf)
@@ -2719,14 +2719,14 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
 
 static inline void set_page_zone(struct page *page, enum zone_type zone)
 {
-	page->flags.f &= ~(ZONES_MASK << ZONES_PGSHIFT);
-	page->flags.f |= (zone & ZONES_MASK) << ZONES_PGSHIFT;
+	page_flags_clear(&page->flags.f, ZONES_MASK << ZONES_PGSHIFT);
+	page_flags_set(&page->flags.f, (zone & ZONES_MASK) << ZONES_PGSHIFT);
 }
 
 static inline void set_page_node(struct page *page, unsigned long node)
 {
-	page->flags.f &= ~(NODES_MASK << NODES_PGSHIFT);
-	page->flags.f |= (node & NODES_MASK) << NODES_PGSHIFT;
+	page_flags_clear(&page->flags.f, NODES_MASK << NODES_PGSHIFT);
+	page_flags_set(&page->flags.f, (node & NODES_MASK) << NODES_PGSHIFT);
 }
 
 static inline void set_page_links(struct page *page, enum zone_type zone,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 970e077019b7..100226f59490 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3624,8 +3624,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		 * unreferenced sub-pages of an anonymous THP: we can simply drop
 		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
 		 */
-		new_folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
-		new_folio->flags.f |= (folio->flags.f &
+		page_flags_clear(&new_folio->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
+		page_flags_set(&new_folio->flags.f, folio->flags.f &
 				((1L << PG_referenced) |
 				 (1L << PG_swapbacked) |
 				 (1L << PG_swapcache) |
diff --git a/mm/memremap.c b/mm/memremap.c
index 053842d45cb1..194ca2f04a87 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -494,7 +494,7 @@ void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
 		 * blindly clear bits which could have set my order field here,
 		 * including page head.
 		 */
-		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
+		page_flags_clear(&new_page->flags.f, 0xffUL);	/* Clear possible order, page head */
 
 #ifdef NR_PAGES_IN_LARGE_FOLIO
 		/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d49c254174da..a3447124a725 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1359,7 +1359,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
 		int i;
 
 		if (compound) {
-			page[1].flags.f &= ~PAGE_FLAGS_SECOND;
+			page_flags_clear(&page[1].flags.f, PAGE_FLAGS_SECOND);
 #ifdef NR_PAGES_IN_LARGE_FOLIO
 			folio->_nr_pages = 0;
 #endif
@@ -1373,7 +1373,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
 					continue;
 				}
 			}
-			(page + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+			page_flags_clear(&(page + i)->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
 		}
 	}
 	if (folio_test_anon(folio)) {
@@ -1392,7 +1392,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
 	}
 
 	page_cpupid_reset_last(page);
-	page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
+	page_flags_clear(&page->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
 	page->private = 0;
 	reset_page_owner(page, order);
 	page_table_check_free(page, order);
diff --git a/mm/slub.c b/mm/slub.c
index a2bf3756ca7d..a55199f642d3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -617,7 +617,7 @@ static inline void slab_set_pfmemalloc(struct slab *slab)
 
 static inline void __slab_clear_pfmemalloc(struct slab *slab)
 {
-	__clear_bit(SL_pfmemalloc, &slab->flags.f);
+	__pf_clear_bit(SL_pfmemalloc, &slab->flags.f);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29  8:10   ` Michael S. Tsirkin
@ 2026-06-29  8:21     ` David Hildenbrand (Arm)
  2026-06-29  8:39     ` Michael S. Tsirkin
  1 sibling, 0 replies; 29+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29  8:21 UTC (permalink / raw)
  To: Michael S. Tsirkin, Andi Kleen
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Hidehiro Kawai, Rik van Riel, Vlastimil Babka,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On 6/29/26 10:10, Michael S. Tsirkin wrote:
> On Sun, Jun 28, 2026 at 07:11:58PM -0700, Andi Kleen wrote:
>> On Sun, Jun 28, 2026 at 05:45:22PM -0400, Michael S. Tsirkin wrote:
>>> This series fixes the race by:
>>>
>>> 1. Having memory_failure() call synchronize_rcu() + retry after
>>>    setting HWPoison, so that any in-flight non-atomic RMW that
>>>    read the old flags value completes before we proceed.
>>>
>>> 2. Wrapping all non-atomic page flag operations in
>>>    rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
>>>    so that synchronize_rcu() actually drains them.
>>
>> It wouldn't surprise me if your underlying performance assumptions 
>> -- an non contended atomic is cheaper than a rcu_read_lock/unlock --
>> are not true in various CPU/kernel configuration combinations.
>>
>> Modern CPUs have fast atomics when uncontended in normal circumstances.
>> But it probably doesn't matter much either way because the difference
>> shouldn't be very much.
> 
> 
> Hmm. It's a bit silly that I didn't try. Seemed clear to me, but,
> on this old xeon...
> 
>                                    insns/iter    cycles/iter
>      -------------------------------------------------------
>        base                       12238 +/- 1.0    17889 +/- 97.9
>        rcu_read_lock              12251 +/- 7.3    17991 +/-191.6
>        atomic ops                 12233 +/- 1.9    17733 +/-136.5
> 
> 
> The diff in the noise.
> 
> And old, slow CPUs maybe don't have MF at all.
> 
> So maybe just atomics instead of all this mess.

That would be much better.

What I was concerned about so far was that many distributions enable hwpoison
handling unconditionally (independent of any specific CPU!).

I recall running experiments on some not-so-dated hardware 2 years ago (when
optimizing out rmap atomics) where additional atomics really hurt, even in
uncontended cases.

> 
> 
> 
> 
>>  It seems very complicated for something that
>> could be much simpler.
>>
>> But I guess it's fine.
>>
>> -Andi
> 
> Indeed. David already said he's gonnu look at this himself, but he
If we can go that simple route (I'm not sure yet), your patch would be fine. I
can try finding someone to run more experiments on arm64 hardware.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29  8:10   ` Michael S. Tsirkin
  2026-06-29  8:21     ` David Hildenbrand (Arm)
@ 2026-06-29  8:39     ` Michael S. Tsirkin
  2026-06-29 16:54       ` Andi Kleen
  1 sibling, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29  8:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, David Hildenbrand, Miaohe Lin, Naoya Horiguchi,
	Andrew Morton, Oscar Salvador, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
	David Hildenbrand

On Mon, Jun 29, 2026 at 04:10:33AM -0400, Michael S. Tsirkin wrote:
> On Sun, Jun 28, 2026 at 07:11:58PM -0700, Andi Kleen wrote:
> > On Sun, Jun 28, 2026 at 05:45:22PM -0400, Michael S. Tsirkin wrote:
> > > This series fixes the race by:
> > > 
> > > 1. Having memory_failure() call synchronize_rcu() + retry after
> > >    setting HWPoison, so that any in-flight non-atomic RMW that
> > >    read the old flags value completes before we proceed.
> > >
> > > 2. Wrapping all non-atomic page flag operations in
> > >    rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
> > >    so that synchronize_rcu() actually drains them.
> > 
> > It wouldn't surprise me if your underlying performance assumptions 
> > -- an non contended atomic is cheaper than a rcu_read_lock/unlock --
> > are not true in various CPU/kernel configuration combinations.
> > 
> > Modern CPUs have fast atomics when uncontended in normal circumstances.
> > But it probably doesn't matter much either way because the difference
> > shouldn't be very much.
> 
> 
> Hmm. It's a bit silly that I didn't try. Seemed clear to me, but,
> on this old xeon...
> 
>                                    insns/iter    cycles/iter
>      -------------------------------------------------------
>        base                       12238 +/- 1.0    17889 +/- 97.9
>        rcu_read_lock              12251 +/- 7.3    17991 +/-191.6
>        atomic ops                 12233 +/- 1.9    17733 +/-136.5
> 
> 
> The diff in the noise.
> 
> And old, slow CPUs maybe don't have MF at all.
> 
> So maybe just atomics instead of all this mess.


However, this was a basic test, when allocating 4k pages. With 2M hugepages:

                                   insns/iter    cycles/iter
     -------------------------------------------------------
       base                       20758 +/- 12.5   191208 +/-1946.6
       rcu                        20785 +/-  3.7   197108 +/- 132.1
       atomic                     20727 +/-  6.4   204591 +/- 160.2

       rcu vs base                  +27 (+0.13%)     +5900 (+3.09%)
       atomic vs base               -31 (-0.15%)    +13383 (+7.00%)

and even with THP:

                                   insns/iter    cycles/iter
     -------------------------------------------------------
       base                       27220 +/-  2.8   192151 +/- 483.3
       rcu                        27248 +/- 30.1   194159 +/-2746.6
       atomic                     27186 +/-  3.2   200526 +/- 746.2

       rcu vs base                  +28 (+0.10%)     +2008 (+1.04%)
       atomic vs base               -34 (-0.12%)     +8374 (+4.36%)


needs more thought.



> 
> 
> 
> >  It seems very complicated for something that
> > could be much simpler.
> > 
> > But I guess it's fine.
> > 
> > -Andi
> 
> Indeed. David already said he's gonnu look at this himself, but here's what I
> tested, maybe helpful for whoever wants to try this approach:
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 7223f6f4e2b4..8f0940cf2f29 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -404,6 +404,44 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n)
>  #define FOLIO_HEAD_PAGE		0
>  #define FOLIO_SECOND_PAGE	1
>  
> +/*
> + * When CONFIG_MEMORY_FAILURE is enabled, non-atomic page flag operations
> + * must be atomic to prevent clobbering concurrent TestSetPageHWPoison()
> + * in memory_failure().  Otherwise, use the cheaper non-atomic versions.
> + */
> +#ifdef CONFIG_MEMORY_FAILURE
> +#define __pf_set_bit		set_bit
> +#define __pf_clear_bit		clear_bit
> +
> +static __always_inline void page_flags_clear(unsigned long *flags,
> +					     unsigned long mask)
> +{
> +	atomic_long_andnot(mask, (atomic_long_t *)flags);
> +}
> +
> +static __always_inline void page_flags_set(unsigned long *flags,
> +					   unsigned long mask)
> +{
> +	atomic_long_or(mask, (atomic_long_t *)flags);
> +}
> +
> +#else /* !CONFIG_MEMORY_FAILURE */
> +#define __pf_set_bit		__set_bit
> +#define __pf_clear_bit		__clear_bit
> +
> +static __always_inline void page_flags_clear(unsigned long *flags,
> +					     unsigned long mask)
> +{
> +	*flags &= ~mask;
> +}
> +
> +static __always_inline void page_flags_set(unsigned long *flags,
> +					   unsigned long mask)
> +{
> +	*flags |= mask;
> +}
> +#endif /* CONFIG_MEMORY_FAILURE */
> +
>  /*
>   * Macros to create function definitions for page flags
>   */
> @@ -421,11 +459,11 @@ static __always_inline void folio_clear_##name(struct folio *folio)	\
>  
>  #define __FOLIO_SET_FLAG(name, page)					\
>  static __always_inline void __folio_set_##name(struct folio *folio)	\
> -{ __set_bit(PG_##name, folio_flags(folio, page)); }
> +{ __pf_set_bit(PG_##name, folio_flags(folio, page)); }
>  
>  #define __FOLIO_CLEAR_FLAG(name, page)					\
>  static __always_inline void __folio_clear_##name(struct folio *folio)	\
> -{ __clear_bit(PG_##name, folio_flags(folio, page)); }
> +{ __pf_clear_bit(PG_##name, folio_flags(folio, page)); }
>  
>  #define FOLIO_TEST_SET_FLAG(name, page)					\
>  static __always_inline bool folio_test_set_##name(struct folio *folio)	\
> @@ -458,12 +496,12 @@ static __always_inline void ClearPage##uname(struct page *page)		\
>  #define __SETPAGEFLAG(uname, lname, policy)				\
>  __FOLIO_SET_FLAG(lname, FOLIO_##policy)					\
>  static __always_inline void __SetPage##uname(struct page *page)		\
> -{ __set_bit(PG_##lname, &policy(page, 1)->flags.f); }
> +{ __pf_set_bit(PG_##lname, &policy(page, 1)->flags.f); }
>  
>  #define __CLEARPAGEFLAG(uname, lname, policy)				\
>  __FOLIO_CLEAR_FLAG(lname, FOLIO_##policy)				\
>  static __always_inline void __ClearPage##uname(struct page *page)	\
> -{ __clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
> +{ __pf_clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
>  
>  #define TESTSETFLAG(uname, lname, policy)				\
>  FOLIO_TEST_SET_FLAG(lname, FOLIO_##policy)				\
> @@ -806,7 +844,7 @@ static inline bool PageUptodate(const struct page *page)
>  static __always_inline void __folio_mark_uptodate(struct folio *folio)
>  {
>  	smp_wmb();
> -	__set_bit(PG_uptodate, folio_flags(folio, 0));
> +	__pf_set_bit(PG_uptodate, folio_flags(folio, 0));
>  }
>  
>  static __always_inline void folio_mark_uptodate(struct folio *folio)
> @@ -1162,14 +1200,14 @@ static __always_inline void ClearPageAnonExclusive(struct page *page)
>  {
>  	VM_BUG_ON_PGFLAGS(!PageAnonNotKsm(page), page);
>  	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> -	clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
> +	__pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
>  }
>  
>  static __always_inline void __ClearPageAnonExclusive(struct page *page)
>  {
>  	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
>  	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> -	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
> +	__pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
>  }
>  
>  #ifdef CONFIG_MMU
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 06bbe9eba636..931dfc84319f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2343,7 +2343,7 @@ int folio_xchg_last_cpupid(struct folio *folio, int cpupid);
>  
>  static inline void page_cpupid_reset_last(struct page *page)
>  {
> -	page->flags.f |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
> +	page_flags_set(&page->flags.f, LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
>  }
>  #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
>  
> @@ -2503,8 +2503,8 @@ static inline struct zone *folio_zone(const struct folio *folio)
>  #ifdef SECTION_IN_PAGE_FLAGS
>  static inline void set_page_section(struct page *page, unsigned long section)
>  {
> -	page->flags.f &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT);
> -	page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
> +	page_flags_clear(&page->flags.f, SECTIONS_MASK << SECTIONS_PGSHIFT);
> +	page_flags_set(&page->flags.f, (section & SECTIONS_MASK) << SECTIONS_PGSHIFT);
>  }
>  
>  static inline unsigned long memdesc_section(memdesc_flags_t mdf)
> @@ -2719,14 +2719,14 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
>  
>  static inline void set_page_zone(struct page *page, enum zone_type zone)
>  {
> -	page->flags.f &= ~(ZONES_MASK << ZONES_PGSHIFT);
> -	page->flags.f |= (zone & ZONES_MASK) << ZONES_PGSHIFT;
> +	page_flags_clear(&page->flags.f, ZONES_MASK << ZONES_PGSHIFT);
> +	page_flags_set(&page->flags.f, (zone & ZONES_MASK) << ZONES_PGSHIFT);
>  }
>  
>  static inline void set_page_node(struct page *page, unsigned long node)
>  {
> -	page->flags.f &= ~(NODES_MASK << NODES_PGSHIFT);
> -	page->flags.f |= (node & NODES_MASK) << NODES_PGSHIFT;
> +	page_flags_clear(&page->flags.f, NODES_MASK << NODES_PGSHIFT);
> +	page_flags_set(&page->flags.f, (node & NODES_MASK) << NODES_PGSHIFT);
>  }
>  
>  static inline void set_page_links(struct page *page, enum zone_type zone,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 970e077019b7..100226f59490 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3624,8 +3624,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  		 * unreferenced sub-pages of an anonymous THP: we can simply drop
>  		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
>  		 */
> -		new_folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> -		new_folio->flags.f |= (folio->flags.f &
> +		page_flags_clear(&new_folio->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
> +		page_flags_set(&new_folio->flags.f, folio->flags.f &
>  				((1L << PG_referenced) |
>  				 (1L << PG_swapbacked) |
>  				 (1L << PG_swapcache) |
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 053842d45cb1..194ca2f04a87 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -494,7 +494,7 @@ void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
>  		 * blindly clear bits which could have set my order field here,
>  		 * including page head.
>  		 */
> -		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> +		page_flags_clear(&new_page->flags.f, 0xffUL);	/* Clear possible order, page head */
>  
>  #ifdef NR_PAGES_IN_LARGE_FOLIO
>  		/*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d49c254174da..a3447124a725 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1359,7 +1359,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
>  		int i;
>  
>  		if (compound) {
> -			page[1].flags.f &= ~PAGE_FLAGS_SECOND;
> +			page_flags_clear(&page[1].flags.f, PAGE_FLAGS_SECOND);
>  #ifdef NR_PAGES_IN_LARGE_FOLIO
>  			folio->_nr_pages = 0;
>  #endif
> @@ -1373,7 +1373,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
>  					continue;
>  				}
>  			}
> -			(page + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +			page_flags_clear(&(page + i)->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
>  		}
>  	}
>  	if (folio_test_anon(folio)) {
> @@ -1392,7 +1392,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
>  	}
>  
>  	page_cpupid_reset_last(page);
> -	page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +	page_flags_clear(&page->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
>  	page->private = 0;
>  	reset_page_owner(page, order);
>  	page_table_check_free(page, order);
> diff --git a/mm/slub.c b/mm/slub.c
> index a2bf3756ca7d..a55199f642d3 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -617,7 +617,7 @@ static inline void slab_set_pfmemalloc(struct slab *slab)
>  
>  static inline void __slab_clear_pfmemalloc(struct slab *slab)
>  {
> -	__clear_bit(SL_pfmemalloc, &slab->flags.f);
> +	__pf_clear_bit(SL_pfmemalloc, &slab->flags.f);
>  }
>  
>  /*


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29  8:39     ` Michael S. Tsirkin
@ 2026-06-29 16:54       ` Andi Kleen
  2026-06-29 17:04         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2026-06-29 16:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand, Miaohe Lin, Naoya Horiguchi,
	Andrew Morton, Oscar Salvador, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl,
	David Hildenbrand

> However, this was a basic test, when allocating 4k pages. With 2M hugepages:
> 
>                                    insns/iter    cycles/iter
>      -------------------------------------------------------
>        base                       20758 +/- 12.5   191208 +/-1946.6
>        rcu                        20785 +/-  3.7   197108 +/- 132.1
>        atomic                     20727 +/-  6.4   204591 +/- 160.2
> 
>        rcu vs base                  +27 (+0.13%)     +5900 (+3.09%)
>        atomic vs base               -31 (-0.15%)    +13383 (+7.00%)
> 
> and even with THP:
> 
>                                    insns/iter    cycles/iter
>      -------------------------------------------------------
>        base                       27220 +/-  2.8   192151 +/- 483.3
>        rcu                        27248 +/- 30.1   194159 +/-2746.6
>        atomic                     27186 +/-  3.2   200526 +/- 746.2
> 
>        rcu vs base                  +28 (+0.10%)     +2008 (+1.04%)
>        atomic vs base               -34 (-0.12%)     +8374 (+4.36%)
> 
> 
> needs more thought.

Well the alternative is to not bother with RCU, but just wait a bit and
check if the bit stuck and repeat if needed. While that could in theory
livelock it is extremely unlikely (especially if you add a bit of randomization 
to the sleep)

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 16:54       ` Andi Kleen
@ 2026-06-29 17:04         ` David Hildenbrand (Arm)
  2026-06-29 20:43           ` Michael S. Tsirkin
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29 17:04 UTC (permalink / raw)
  To: Andi Kleen, Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Hidehiro Kawai, Rik van Riel, Vlastimil Babka,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On 6/29/26 18:54, Andi Kleen wrote:
>> However, this was a basic test, when allocating 4k pages. With 2M hugepages:
>>
>>                                    insns/iter    cycles/iter
>>      -------------------------------------------------------
>>        base                       20758 +/- 12.5   191208 +/-1946.6
>>        rcu                        20785 +/-  3.7   197108 +/- 132.1
>>        atomic                     20727 +/-  6.4   204591 +/- 160.2
>>
>>        rcu vs base                  +27 (+0.13%)     +5900 (+3.09%)
>>        atomic vs base               -31 (-0.15%)    +13383 (+7.00%)
>>
>> and even with THP:
>>
>>                                    insns/iter    cycles/iter
>>      -------------------------------------------------------
>>        base                       27220 +/-  2.8   192151 +/- 483.3
>>        rcu                        27248 +/- 30.1   194159 +/-2746.6
>>        atomic                     27186 +/-  3.2   200526 +/- 746.2
>>
>>        rcu vs base                  +28 (+0.10%)     +2008 (+1.04%)
>>        atomic vs base               -34 (-0.12%)     +8374 (+4.36%)
>>
>>
>> needs more thought.
> 
> Well the alternative is to not bother with RCU, but just wait a bit and
> check if the bit stuck and repeat if needed. While that could in theory
> livelock it is extremely unlikely (especially if you add a bit of randomization 
> to the sleep)

We discussed that a bit already. Hypervisors make it fairly unpredictable how
long you would actually have to spin.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 17:04         ` David Hildenbrand (Arm)
@ 2026-06-29 20:43           ` Michael S. Tsirkin
  0 siblings, 0 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29 20:43 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andi Kleen, linux-kernel, Miaohe Lin, Naoya Horiguchi,
	Andrew Morton, Oscar Salvador, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On Mon, Jun 29, 2026 at 07:04:25PM +0200, David Hildenbrand (Arm) wrote:
> On 6/29/26 18:54, Andi Kleen wrote:
> >> However, this was a basic test, when allocating 4k pages. With 2M hugepages:
> >>
> >>                                    insns/iter    cycles/iter
> >>      -------------------------------------------------------
> >>        base                       20758 +/- 12.5   191208 +/-1946.6
> >>        rcu                        20785 +/-  3.7   197108 +/- 132.1
> >>        atomic                     20727 +/-  6.4   204591 +/- 160.2
> >>
> >>        rcu vs base                  +27 (+0.13%)     +5900 (+3.09%)
> >>        atomic vs base               -31 (-0.15%)    +13383 (+7.00%)
> >>
> >> and even with THP:
> >>
> >>                                    insns/iter    cycles/iter
> >>      -------------------------------------------------------
> >>        base                       27220 +/-  2.8   192151 +/- 483.3
> >>        rcu                        27248 +/- 30.1   194159 +/-2746.6
> >>        atomic                     27186 +/-  3.2   200526 +/- 746.2
> >>
> >>        rcu vs base                  +28 (+0.10%)     +2008 (+1.04%)
> >>        atomic vs base               -34 (-0.12%)     +8374 (+4.36%)
> >>
> >>
> >> needs more thought.
> > 
> > Well the alternative is to not bother with RCU, but just wait a bit and
> > check if the bit stuck and repeat if needed. While that could in theory
> > livelock it is extremely unlikely (especially if you add a bit of randomization 
> > to the sleep)
> 
> We discussed that a bit already. Hypervisors make it fairly unpredictable how
> long you would actually have to spin.


Way I see it, this is not the issue. The issue is it does not fix the
race:


CPU1:

read flags

CPU2:

test and set

test and set #2 - sees it is set

CPU1:

write flags clearing the bit




> -- 
> Cheers,
> 
> David


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-28 21:45 [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Michael S. Tsirkin
                   ` (2 preceding siblings ...)
  2026-06-29  2:11 ` [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Andi Kleen
@ 2026-06-29  6:49 ` David Hildenbrand (Arm)
  2026-06-29  7:34   ` Michael S. Tsirkin
  3 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29  6:49 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Oscar Salvador,
	Andi Kleen, Hidehiro Kawai, Rik van Riel, Vlastimil Babka,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On 6/28/26 23:45, Michael S. Tsirkin wrote:
> I don't like it that we are adding overhead to the good path for
> the benefit of memory failure, which never triggers on many systems,
> but I don't have a better idea. Pls take a look.

As I said on Friday.

"It's also doesn't address the mf_mutex implications and the x86 thingies I
mentioned.

...

I'll either take care of that myself or find someone that can work on this with
attention to all details.
"

This is nothing to vibe-code. This needs a real expert.


> 
> Non-atomic page flag operations (page->flags.f &= ~mask, __set_bit,
> __clear_bit) can race with atomic TestSetPageHWPoison() in
> memory_failure().  The non-atomic RMW reads flags, memory_failure()
> atomically sets HWPoison, then the RMW writes back the old value
> without HWPoison, clobbering the bit.
> 
> The race was confirmed by injecting a cpu_relax() delay between the
> load and store of the non-atomic RMW in __free_pages_prepare, then
> running concurrent MADV_HWPOISON injection.  The clobbered HWPoison
> bit was observed repeatedly.
> 
> This series fixes the race by:
> 
> 1. Having memory_failure() call synchronize_rcu() + retry after
>    setting HWPoison, so that any in-flight non-atomic RMW that
>    read the old flags value completes before we proceed.
> 
> 2. Wrapping all non-atomic page flag operations in
>    rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
>    so that synchronize_rcu() actually drains them.
> 
> Performance impact (page alloc+free microbenchmark, 200K iterations,
> 20 runs, KVM guest, error bars are 3-sigma):
> 
>   !PREEMPT_RCU (x86):
>                          insns/iter      cycles/iter
>     base:                12237 +/- 1     17954 +/- 136
>     patched:             +22 +/- 1       -124 +/- 122
>                          (+0.18%)        (within noise)
> 
>   PREEMPT_RCU:
>                          insns/iter      cycles/iter
>     base:                12512 +/- 3     18541 +/- 214
>     patched:             +95 +/- 3       -12 +/- 161
>                          (+0.76%)        (within noise)
> 
> When !CONFIG_MEMORY_FAILURE, all wrappers compile away completely.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>

No ;)

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29  6:49 ` David Hildenbrand (Arm)
@ 2026-06-29  7:34   ` Michael S. Tsirkin
  2026-06-29 13:05     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29  7:34 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On Mon, Jun 29, 2026 at 08:49:37AM +0200, David Hildenbrand (Arm) wrote:
> On 6/28/26 23:45, Michael S. Tsirkin wrote:
> > I don't like it that we are adding overhead to the good path for
> > the benefit of memory failure, which never triggers on many systems,
> > but I don't have a better idea. Pls take a look.
> 
> As I said on Friday.
> 
> "It's also doesn't address the mf_mutex implications and the x86 thingies I
> mentioned.

Well I did attempt addressing this.  These would be these two:

	(a) We don't hold the mf_mutex on all call paths, but we really need it so a
	page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.

page_test_set_hwpoison was this code you wrote:

+static void page_set_hwpoison(struct page *page)
+{
+       lockdep_assert_held(&mf_mutex);
+
+       while (!PageHWPoison(page)) {
+               SetPageHWPoison(page);
+
+               /* Make sure concurrent non-atomic writers completed. */
+               synchronize_rcu();
+       }
+}

and indeed the test+set combination seems racy.  But consider the version I posted, for example:

+/*
+ * Drain any in-flight non-atomic page flag operations that could
+ * clobber a concurrently set HWPoison bit.  Retries until the bit sticks.
+ */
+static void set_hwpoison_drain_rcu(struct page *p)
+{
+       do {
+               synchronize_rcu();
+       } while (!TestSetPageHWPoison(p));
+}
+

...

+static bool test_and_set_hwpoison_drain_rcu(struct page *p)
+{
+       bool was_set = TestSetPageHWPoison(p);
+
+       set_hwpoison_drain_rcu(p);
+       return was_set;
+}



does not seem racy without a lock. But maybe I don't get it.



	(b) There are some leftover SetPageHWPoison etc. instances. The ones in
	arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
	corner cases either way and we can document the situation.

Well, I did try to document the situation - it's in the commit log for
patch 1:

    Note: the MCE handler in arch/x86/kernel/cpu/mce/core.c also calls
    SetPageHWPoison() and is subject to the same race.  It cannot use
    the drain helpers (MCE context cannot call synchronize_rcu()).
    For recoverable MCE errors, memory_failure() is queued via work
    items (kill_me_maybe/kill_me_never) and will re-set the bit via
    test_and_set_hwpoison_drain_rcu() if it was clobbered.  The
    mce_panic() path sets HWPoison for kdump right before panic() so
    the race is irrelevant there.  The MCG_STATUS_SEAM_NR path does
    not queue memory_failure(), but the affected page belongs to a
    TDX guest whose CPU core has already been marked dead - the page
    is not subject to concurrent non-atomic flag operations in the
    buddy allocator, so the race does not apply.






> ...
> 
> I'll either take care of that myself or find someone that can work on this with
> attention to all details.
> "

> 
> This is nothing to vibe-code. This needs a real expert.

Well I had this sitting on the disk anyway, so I thought I'd post.

I wouldn't call this vibe-code - a bunch of manual work went into this,
llms mostly as a grep/sed replacement.  But hey.  I don't object to
someone taking over, for sure. Was fun, and maybe these patches will be
helpful as a starting point.

In particular, maybe I should have been more explicit about how your
points from Friday are addressed.

If you want to add a bit more to explain the exact concerns here, for
whoever works on this next, feel free to do so.




> > 
> > Non-atomic page flag operations (page->flags.f &= ~mask, __set_bit,
> > __clear_bit) can race with atomic TestSetPageHWPoison() in
> > memory_failure().  The non-atomic RMW reads flags, memory_failure()
> > atomically sets HWPoison, then the RMW writes back the old value
> > without HWPoison, clobbering the bit.
> > 
> > The race was confirmed by injecting a cpu_relax() delay between the
> > load and store of the non-atomic RMW in __free_pages_prepare, then
> > running concurrent MADV_HWPOISON injection.  The clobbered HWPoison
> > bit was observed repeatedly.
> > 
> > This series fixes the race by:
> > 
> > 1. Having memory_failure() call synchronize_rcu() + retry after
> >    setting HWPoison, so that any in-flight non-atomic RMW that
> >    read the old flags value completes before we proceed.
> > 
> > 2. Wrapping all non-atomic page flag operations in
> >    rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
> >    so that synchronize_rcu() actually drains them.
> > 
> > Performance impact (page alloc+free microbenchmark, 200K iterations,
> > 20 runs, KVM guest, error bars are 3-sigma):
> > 
> >   !PREEMPT_RCU (x86):
> >                          insns/iter      cycles/iter
> >     base:                12237 +/- 1     17954 +/- 136
> >     patched:             +22 +/- 1       -124 +/- 122
> >                          (+0.18%)        (within noise)
> > 
> >   PREEMPT_RCU:
> >                          insns/iter      cycles/iter
> >     base:                12512 +/- 3     18541 +/- 214
> >     patched:             +95 +/- 3       -12 +/- 161
> >                          (+0.76%)        (within noise)
> > 
> > When !CONFIG_MEMORY_FAILURE, all wrappers compile away completely.
> > 
> > Suggested-by: David Hildenbrand <david@redhat.com>
> 
> No ;)
> 
> -- 
> Cheers,
> 
> David


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29  7:34   ` Michael S. Tsirkin
@ 2026-06-29 13:05     ` David Hildenbrand (Arm)
  2026-06-29 20:08       ` Michael S. Tsirkin
  2026-06-29 23:29       ` Michael S. Tsirkin
  0 siblings, 2 replies; 29+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29 13:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On 6/29/26 09:34, Michael S. Tsirkin wrote:
> On Mon, Jun 29, 2026 at 08:49:37AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/28/26 23:45, Michael S. Tsirkin wrote:
>>> I don't like it that we are adding overhead to the good path for
>>> the benefit of memory failure, which never triggers on many systems,
>>> but I don't have a better idea. Pls take a look.
>>
>> As I said on Friday.
>>
>> "It's also doesn't address the mf_mutex implications and the x86 thingies I
>> mentioned.
> 
> Well I did attempt addressing this.  These would be these two:
> 
> 	(a) We don't hold the mf_mutex on all call paths, but we really need it so a
> 	page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.
> 
> page_test_set_hwpoison was this code you wrote:
> 
> +static void page_set_hwpoison(struct page *page)
> +{
> +       lockdep_assert_held(&mf_mutex);
> +
> +       while (!PageHWPoison(page)) {
> +               SetPageHWPoison(page);
> +
> +               /* Make sure concurrent non-atomic writers completed. */
> +               synchronize_rcu();
> +       }
> +}
> 
> and indeed the test+set combination seems racy.  But consider the version I posted, for example:
> 
> +/*
> + * Drain any in-flight non-atomic page flag operations that could
> + * clobber a concurrently set HWPoison bit.  Retries until the bit sticks.
> + */
> +static void set_hwpoison_drain_rcu(struct page *p)
> +{
> +       do {
> +               synchronize_rcu();
> +       } while (!TestSetPageHWPoison(p));
> +}
> +
> 
> ...
> 
> +static bool test_and_set_hwpoison_drain_rcu(struct page *p)
> +{
> +       bool was_set = TestSetPageHWPoison(p);
> +
> +       set_hwpoison_drain_rcu(p);
> +       return was_set;
> +}
> 
> 
> 
> does not seem racy without a lock. But maybe I don't get it.


Staring at your implementation, just think about two concurrent invocations of
test_and_set_hwpoison_drain() in your code:

Assume HWPoison flag is not set.

Thread 1: test_and_set_hwpoison_drain_rcu() -> TestSetPageHWPoison()
	-> was_set = false

Thread 2: update that overwrites page->flags. HWPoison accidentally cleared.

Thread 3: test_and_set_hwpoison_drain_rcu() -> TestSetPageHWPoison()
	-> was_set = false

Thread 1: does RCU sync and returns "!was_set"
thread 2: does RCU sync and returns "!was_set"

So you could end up with two thread believing that they atomically cleared the
flag, and you really need to lock.

We really have to document and enforce that the mutex is involved.


And I fear there are more nasty details to be uncovered while we rework some of
this properly, mandating a detailed look.

For example, TestClearPageHWPoison() in put_page_back_buddy() likely needs a
proper treatment as well. Likely that code should be reworked entirely to not
have arbitrary hwpoison page flag modifications throughout the codebase.

> 
> 
> 
> 	(b) There are some leftover SetPageHWPoison etc. instances. The ones in
> 	arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
> 	corner cases either way and we can document the situation.
> 
> Well, I did try to document the situation - it's in the commit log for
> patch 1:
> 
>     Note: the MCE handler in arch/x86/kernel/cpu/mce/core.c also calls
>     SetPageHWPoison() and is subject to the same race.  It cannot use
>     the drain helpers (MCE context cannot call synchronize_rcu()).
>     For recoverable MCE errors, memory_failure() is queued via work
>     items (kill_me_maybe/kill_me_never) and will re-set the bit via
>     test_and_set_hwpoison_drain_rcu() if it was clobbered.  The
>     mce_panic() path sets HWPoison for kdump right before panic() so
>     the race is irrelevant there.  The MCG_STATUS_SEAM_NR path does
>     not queue memory_failure(), but the affected page belongs to a
>     TDX guest whose CPU core has already been marked dead - the page
>     is not subject to concurrent non-atomic flag operations in the
>     buddy allocator, so the race does not apply.
> 

We should have a central mechanism in place to document this and avoid future
mistakes.

I am not even sure if we should clearly document for SetPageHWPoison() when and
how they can be used, or if we need a completely new set of helpers.

And that's something to figure out (e.g., interaction with the mutex) by looking
into all of the details, so I expect this to take a lot more time.

[...]

>> This is nothing to vibe-code. This needs a real expert.
> 
> Well I had this sitting on the disk anyway, so I thought I'd post.

It would be good to coordinate here.

Like a reply to my mail, asking whether you should post a new version that you
have already in place.

> 
> I wouldn't call this vibe-code - a bunch of manual work went into this,
> llms mostly as a grep/sed replacement.  

The version you posted earlier had real AI vibes to it, so I can only speculate.
I know that you did some manual work on this, but the details are really ugly in
this code.

> But hey.  I don't object to
> someone taking over, for sure. Was fun, and maybe these patches will be
> helpful as a starting point.
> 
> In particular, maybe I should have been more explicit about how your
> points from Friday are addressed.

Yes.

> 
> If you want to add a bit more to explain the exact concerns here, for
> whoever works on this next, feel free to do so.

I raised some above. I'll try to find someone to take a closer look and see to
which degree we could optimize this.

Or if there are actually more performant alternatives that we could use. (I
still doubt that using atomics is ok in general)

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 13:05     ` David Hildenbrand (Arm)
@ 2026-06-29 20:08       ` Michael S. Tsirkin
  2026-06-29 20:55         ` Andi Kleen
  2026-06-29 21:22         ` David Hildenbrand (Arm)
  2026-06-29 23:29       ` Michael S. Tsirkin
  1 sibling, 2 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29 20:08 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On Mon, Jun 29, 2026 at 03:05:18PM +0200, David Hildenbrand (Arm) wrote:
> On 6/29/26 09:34, Michael S. Tsirkin wrote:
> > On Mon, Jun 29, 2026 at 08:49:37AM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/28/26 23:45, Michael S. Tsirkin wrote:
> >>> I don't like it that we are adding overhead to the good path for
> >>> the benefit of memory failure, which never triggers on many systems,
> >>> but I don't have a better idea. Pls take a look.
> >>
> >> As I said on Friday.
> >>
> >> "It's also doesn't address the mf_mutex implications and the x86 thingies I
> >> mentioned.
> > 
> > Well I did attempt addressing this.  These would be these two:
> > 
> > 	(a) We don't hold the mf_mutex on all call paths, but we really need it so a
> > 	page_test_set_hwpoison() cannot race in weird ways with the other primitives I think.
> > 
> > page_test_set_hwpoison was this code you wrote:
> > 
> > +static void page_set_hwpoison(struct page *page)
> > +{
> > +       lockdep_assert_held(&mf_mutex);
> > +
> > +       while (!PageHWPoison(page)) {
> > +               SetPageHWPoison(page);
> > +
> > +               /* Make sure concurrent non-atomic writers completed. */
> > +               synchronize_rcu();
> > +       }
> > +}
> > 
> > and indeed the test+set combination seems racy.  But consider the version I posted, for example:
> > 
> > +/*
> > + * Drain any in-flight non-atomic page flag operations that could
> > + * clobber a concurrently set HWPoison bit.  Retries until the bit sticks.
> > + */
> > +static void set_hwpoison_drain_rcu(struct page *p)
> > +{
> > +       do {
> > +               synchronize_rcu();
> > +       } while (!TestSetPageHWPoison(p));
> > +}
> > +
> > 
> > ...
> > 
> > +static bool test_and_set_hwpoison_drain_rcu(struct page *p)
> > +{
> > +       bool was_set = TestSetPageHWPoison(p);
> > +
> > +       set_hwpoison_drain_rcu(p);
> > +       return was_set;
> > +}
> > 
> > 
> > 
> > does not seem racy without a lock. But maybe I don't get it.
> 
> 
> Staring at your implementation, just think about two concurrent invocations of
> test_and_set_hwpoison_drain() in your code:
> 
> Assume HWPoison flag is not set.
> 
> Thread 1: test_and_set_hwpoison_drain_rcu() -> TestSetPageHWPoison()
> 	-> was_set = false
> 
> Thread 2: update that overwrites page->flags. HWPoison accidentally cleared.
> 
> Thread 3: test_and_set_hwpoison_drain_rcu() -> TestSetPageHWPoison()
> 	-> was_set = false
> 
> Thread 1: does RCU sync and returns "!was_set"
> thread 2: does RCU sync and returns "!was_set"
> 
> So you could end up with two thread believing that they atomically cleared the
> flag, and you really need to lock.
> 
> We really have to document and enforce that the mutex is involved.

Sure, that will be somewhat easy to add, for anyone who works on this next. But
just to make sure, I think they (test_and variants, and clear) *are* all under
mutex in the patch I posted.  It just isn't enforced. Only set
is not and that's safe I think.


> 
> And I fear there are more nasty details to be uncovered while we rework some of
> this properly, mandating a detailed look.
> 
> For example, TestClearPageHWPoison() in put_page_back_buddy() likely needs a
> proper treatment as well.

Hmm I don't see why. It owns the page. The issue is only if one pokes at
page flags without owning it.


> Likely that code should be reworked entirely to not
> have arbitrary hwpoison page flag modifications throughout the codebase.
> 
> > 
> > 
> > 
> > 	(b) There are some leftover SetPageHWPoison etc. instances. The ones in
> > 	arch/x86/kernel/cpu/mce/core.c likely cannot grab the mutex, but maybe they are
> > 	corner cases either way and we can document the situation.
> > 
> > Well, I did try to document the situation - it's in the commit log for
> > patch 1:
> > 
> >     Note: the MCE handler in arch/x86/kernel/cpu/mce/core.c also calls
> >     SetPageHWPoison() and is subject to the same race.  It cannot use
> >     the drain helpers (MCE context cannot call synchronize_rcu()).
> >     For recoverable MCE errors, memory_failure() is queued via work
> >     items (kill_me_maybe/kill_me_never) and will re-set the bit via
> >     test_and_set_hwpoison_drain_rcu() if it was clobbered.  The
> >     mce_panic() path sets HWPoison for kdump right before panic() so
> >     the race is irrelevant there.  The MCG_STATUS_SEAM_NR path does
> >     not queue memory_failure(), but the affected page belongs to a
> >     TDX guest whose CPU core has already been marked dead - the page
> >     is not subject to concurrent non-atomic flag operations in the
> >     buddy allocator, so the race does not apply.
> > 
> 
> We should have a central mechanism in place to document this and avoid future
> mistakes.

Like maybe a wrapper with lots of __, "raw" and "unsafe" so people know
they need to be careful.

> I am not even sure if we should clearly document for SetPageHWPoison() when and
> how they can be used, or if we need a completely new set of helpers.
> 
> And that's something to figure out (e.g., interaction with the mutex) by looking
> into all of the details, so I expect this to take a lot more time.
> 
> [...]

And again, I'm really not sure fixing a theoretical race when memory
is failing is worth slowing the world by 0.1-1% for.



> >> This is nothing to vibe-code. This needs a real expert.
> > 
> > Well I had this sitting on the disk anyway, so I thought I'd post.
> 
> It would be good to coordinate here.
> 
> Like a reply to my mail, asking whether you should post a new version that you
> have already in place.

Sure, I'll keep this in mind, thanks!

> > 
> > I wouldn't call this vibe-code - a bunch of manual work went into this,
> > llms mostly as a grep/sed replacement.  
> 
> The version you posted earlier had real AI vibes to it, so I can only speculate.
> I know that you did some manual work on this, but the details are really ugly in
> this code.
> 
> > But hey.  I don't object to
> > someone taking over, for sure. Was fun, and maybe these patches will be
> > helpful as a starting point.
> > 
> > In particular, maybe I should have been more explicit about how your
> > points from Friday are addressed.
> 
> Yes.
> 
> > 
> > If you want to add a bit more to explain the exact concerns here, for
> > whoever works on this next, feel free to do so.
> 
> I raised some above. I'll try to find someone to take a closer look and see to
> which degree we could optimize this.


From what I saw in my testing, if we allocate 4k pages
it's hidden by the overhead. With hp and thp it's measureably
worse than rcu on !preempt config.


I also tried keeping atomics in most places but optimizing the compound
case by test and set of pg lock, just to see what happens.  Was even
worse:

  4K (base: 5105 ns):
    rcu:     +63 +/- 144 ns  (+1.2 +/- 2.8%)
    atomic:   +8 +/- 144 ns  (+0.2 +/- 2.8%)
    pg_lock: +133 +/- 144 ns  (+2.6 +/- 2.8%)

  2M-thp (base: 52965 ns):
    rcu:      +74 +/- 780 ns  (+0.1 +/- 1.5%)
    atomic:  +657 +/- 780 ns  (+1.2 +/- 1.5%)                                                                          
    pg_lock: +1691 +/- 780 ns  (+3.2 +/- 1.5%)
  
  2M-hp (base: 53788 ns):                                                                                            
    rcu:     +150 +/- 821 ns  (+0.3 +/- 1.5%)
    atomic:  +427 +/- 821 ns  (+0.8 +/- 1.5%)
    pg_lock: +2413 +/- 821 ns  (+4.5 +/- 1.5%)

this is without preempt rcu. with preempt rcu it seems
much noisier so I can't really measure it:

4K (base: 5605 ns):
    rcu:     +126 +/- 136 ns  (+2.3 +/- 2.4%)
    atomic:   +44 +/- 114 ns  (+0.8 +/- 2.0%)

  2M-regular (base: 58179 ns):
    rcu:     -834 +/- 1573 ns  (-1.4 +/- 2.7%)
    atomic:  +440 +/- 1645 ns  (+0.8 +/- 2.8%)

  2M-huge (base: 59937 ns):
    rcu:     +800 +/- 2653 ns  (+1.3 +/- 4.4%)
    atomic: -1402 +/- 2546 ns  (-2.3 +/- 4.2%)



> 
> Or if there are actually more performant alternatives that we could use. (I
> still doubt that using atomics is ok in general)

Right now I'm thinking of looking at something like stop_machine maybe.

Do you want me to try?




-- 
MST


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 20:08       ` Michael S. Tsirkin
@ 2026-06-29 20:55         ` Andi Kleen
  2026-06-29 21:17           ` Michael S. Tsirkin
  2026-06-29 21:22         ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2026-06-29 20:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand (Arm), linux-kernel, Miaohe Lin,
	Naoya Horiguchi, Andrew Morton, Oscar Salvador, Hidehiro Kawai,
	Rik van Riel, Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

> > Or if there are actually more performant alternatives that we could use. (I
> > still doubt that using atomics is ok in general)
> 
> Right now I'm thinking of looking at something like stop_machine maybe.

There can be (rare) cases where yoy get a lot of memory errors in a
flood. While memory-failure doesn't need to be particularly optimized,
it shouldn't be excessively slow or impact the rest of the machine.

stop_machine is a big hammer that is likely to violate most of this.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 20:55         ` Andi Kleen
@ 2026-06-29 21:17           ` Michael S. Tsirkin
  2026-06-29 21:39             ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29 21:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Hildenbrand (Arm), linux-kernel, Miaohe Lin,
	Naoya Horiguchi, Andrew Morton, Oscar Salvador, Hidehiro Kawai,
	Rik van Riel, Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On Mon, Jun 29, 2026 at 01:55:57PM -0700, Andi Kleen wrote:
> > > Or if there are actually more performant alternatives that we could use. (I
> > > still doubt that using atomics is ok in general)
> > 
> > Right now I'm thinking of looking at something like stop_machine maybe.
> 
> There can be (rare) cases where yoy get a lot of memory errors in a
> flood. While memory-failure doesn't need to be particularly optimized,
> it shouldn't be excessively slow or impact the rest of the machine.
> 
> stop_machine is a big hammer that is likely to violate most of this.
> 
> -Andi

We can maybe batch a bunch of these, and do stop machine once?

-- 
MST


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 21:17           ` Michael S. Tsirkin
@ 2026-06-29 21:39             ` Andi Kleen
  2026-06-29 21:59               ` Michael S. Tsirkin
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2026-06-29 21:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand (Arm), linux-kernel, Miaohe Lin,
	Naoya Horiguchi, Andrew Morton, Oscar Salvador, Hidehiro Kawai,
	Rik van Riel, Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

> We can maybe batch a bunch of these, and do stop machine once?

If you add a long delay you increase the risk that the machine or
the critical process dies because there is an unrecoverable or
process killing AR access before the page can be off lined.

(BTW that's a problem even with the RCU approach, the cure might be
worse than the disease)

If you don't add a long delay you would do a lot of stop machines
on a flood, likely bringing it to a halt, even though other sockets etc.
might still be fine.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 21:39             ` Andi Kleen
@ 2026-06-29 21:59               ` Michael S. Tsirkin
  0 siblings, 0 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29 21:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Hildenbrand (Arm), linux-kernel, Miaohe Lin,
	Naoya Horiguchi, Andrew Morton, Oscar Salvador, Hidehiro Kawai,
	Rik van Riel, Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On Mon, Jun 29, 2026 at 02:39:44PM -0700, Andi Kleen wrote:
> > We can maybe batch a bunch of these, and do stop machine once?
> 
> If you add a long delay you increase the risk that the machine or
> the critical process dies because there is an unrecoverable or
> process killing AR access before the page can be off lined.

Well nothing prevents us from opportunistically running the MF machinery
once first of all.

No regression then?

We then add it to a batch and once the batch is full we do stop machine
and reprocess them again.

Hmm?




> 
> (BTW that's a problem even with the RCU approach, the cure might be
> worse than the disease)
> 
> If you don't add a long delay you would do a lot of stop machines
> on a flood, likely bringing it to a halt, even though other sockets etc.
> might still be fine.
> 
> -Andi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 20:08       ` Michael S. Tsirkin
  2026-06-29 20:55         ` Andi Kleen
@ 2026-06-29 21:22         ` David Hildenbrand (Arm)
  2026-06-29 21:43           ` David Hildenbrand (Arm)
                             ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29 21:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

[...]

> 
> And again, I'm really not sure fixing a theoretical race when memory
> is failing is worth slowing the world by 0.1-1% for.
> 

Fully agreed. I was hoping RCU was cheaper (I mean, we were once told that RCU
read side locking is essentially for free ... well in some configs :) )

The question if we could optimize it reasonably enough ...

> 
> From what I saw in my testing, if we allocate 4k pages
> it's hidden by the overhead. With hp and thp it's measureably
> worse than rcu on !preempt config.

... for example, by doing the rcu read lock + unlock around the

	for (i = 1; i < (1 << order); i++) {

loop on the alloc path. But I suspect it's not going to make that much of a
difference.

I concluded, similar to Andi, that stop_machine() is too big of a hammer.

I wonder if something could be built out of preempt_disable() and sync SMP
calls. hmm :(

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 21:22         ` David Hildenbrand (Arm)
@ 2026-06-29 21:43           ` David Hildenbrand (Arm)
  2026-06-29 23:34             ` Michael S. Tsirkin
  2026-06-29 21:50           ` Michael S. Tsirkin
  2026-06-29 21:54           ` Michael S. Tsirkin
  2 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29 21:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On 6/29/26 23:22, David Hildenbrand (Arm) wrote:
> [...]
> 
>>
>> And again, I'm really not sure fixing a theoretical race when memory
>> is failing is worth slowing the world by 0.1-1% for.
>>
> 
> Fully agreed. I was hoping RCU was cheaper (I mean, we were once told that RCU
> read side locking is essentially for free ... well in some configs :) )
> 
> The question if we could optimize it reasonably enough ...
> 
>>
>> From what I saw in my testing, if we allocate 4k pages
>> it's hidden by the overhead. With hp and thp it's measureably
>> worse than rcu on !preempt config.
> 
> ... for example, by doing the rcu read lock + unlock around the
> 
> 	for (i = 1; i < (1 << order); i++) {
> 
> loop on the alloc path. But I suspect it's not going to make that much of a
> difference.
> 
> I concluded, similar to Andi, that stop_machine() is too big of a hammer.
> 
> I wonder if something could be built out of preempt_disable() and sync SMP
> calls. hmm :(

Scrap that, shouldn't work I think ...

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 21:43           ` David Hildenbrand (Arm)
@ 2026-06-29 23:34             ` Michael S. Tsirkin
  2026-06-30  6:17               ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29 23:34 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On Mon, Jun 29, 2026 at 11:43:32PM +0200, David Hildenbrand (Arm) wrote:
> On 6/29/26 23:22, David Hildenbrand (Arm) wrote:
> > [...]
> > 
> >>
> >> And again, I'm really not sure fixing a theoretical race when memory
> >> is failing is worth slowing the world by 0.1-1% for.
> >>
> > 
> > Fully agreed. I was hoping RCU was cheaper (I mean, we were once told that RCU
> > read side locking is essentially for free ... well in some configs :) )
> > 
> > The question if we could optimize it reasonably enough ...
> > 
> >>
> >> From what I saw in my testing, if we allocate 4k pages
> >> it's hidden by the overhead. With hp and thp it's measureably
> >> worse than rcu on !preempt config.
> > 
> > ... for example, by doing the rcu read lock + unlock around the
> > 
> > 	for (i = 1; i < (1 << order); i++) {
> > 
> > loop on the alloc path. But I suspect it's not going to make that much of a
> > difference.
> > 
> > I concluded, similar to Andi, that stop_machine() is too big of a hammer.
> > 
> > I wonder if something could be built out of preempt_disable() and sync SMP
> > calls. hmm :(
> 
> Scrap that, shouldn't work I think ...
> 

Wait a sec, what about call_rcu_tasks? Use that and re-check the bit is
still set?

-- 
MST


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 23:34             ` Michael S. Tsirkin
@ 2026-06-30  6:17               ` David Hildenbrand (Arm)
  2026-06-30  6:27                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-30  6:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On 6/30/26 01:34, Michael S. Tsirkin wrote:
> On Mon, Jun 29, 2026 at 11:43:32PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/29/26 23:22, David Hildenbrand (Arm) wrote:
>>> [...]
>>>
>>>
>>> Fully agreed. I was hoping RCU was cheaper (I mean, we were once told that RCU
>>> read side locking is essentially for free ... well in some configs :) )
>>>
>>> The question if we could optimize it reasonably enough ...
>>>
>>>
>>> ... for example, by doing the rcu read lock + unlock around the
>>>
>>> 	for (i = 1; i < (1 << order); i++) {
>>>
>>> loop on the alloc path. But I suspect it's not going to make that much of a
>>> difference.
>>>
>>> I concluded, similar to Andi, that stop_machine() is too big of a hammer.
>>>
>>> I wonder if something could be built out of preempt_disable() and sync SMP
>>> calls. hmm :(
>>
>> Scrap that, shouldn't work I think ...
>>
> 
> Wait a sec, what about call_rcu_tasks? Use that and re-check the bit is
> still set?

So, in essence the idea I had yestarday when it was late was the following:

Assume we

1) Can have a way to guarantee that a function on a CPU cannot execute within
our critical section (while updating the flags)

2) We can request to execute a function on each CPU and wait for completion

I think we could just let each CPU execute our desired action (e.g., try setting
the bit).

E.g.,

	local_irq_save(flags);
	page->flags &= whatever;
	local_irq_restore(flags);

And assume we want to set the bit, do a

	SetPageHWPoison(page);
	smp_call_function(set_hwpoison_smp_sync, page, 1);

whereby

	static void set_hwpoison_smp_sync(void *info)
	{
		SetPageHWPoison(page);
	}


The idea is (that needs double checking) that a CPU will execute the
SetPageHWPoison() either before the local_irq_save() or after the
local_irq_restore(). So it's own non-atomic update cannot get interrupted.

Now, IIUC when it comes to "how expensive is this" I think we have (cheap to
expensive):

1) preempt_disable()
2) rcu_read_lock()
3) local_irq_save()


So the above wouldn't be better than an rcu-based approach we have right now.
We'd need something that relies on disabled preemption only.

Huh, but I read that "anything that disables preemption also marks an RCU-sched
read-side critical section including preempt_disable() and preempt_enable()".

So for our use case we should be able to use preempt_disable() instead of
local_irq_save(). That should already work for your existing implementation.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-30  6:17               ` David Hildenbrand (Arm)
@ 2026-06-30  6:27                 ` Michael S. Tsirkin
  2026-06-30  6:34                   ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-30  6:27 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On Tue, Jun 30, 2026 at 08:17:42AM +0200, David Hildenbrand (Arm) wrote:
> On 6/30/26 01:34, Michael S. Tsirkin wrote:
> > On Mon, Jun 29, 2026 at 11:43:32PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/29/26 23:22, David Hildenbrand (Arm) wrote:
> >>> [...]
> >>>
> >>>
> >>> Fully agreed. I was hoping RCU was cheaper (I mean, we were once told that RCU
> >>> read side locking is essentially for free ... well in some configs :) )
> >>>
> >>> The question if we could optimize it reasonably enough ...
> >>>
> >>>
> >>> ... for example, by doing the rcu read lock + unlock around the
> >>>
> >>> 	for (i = 1; i < (1 << order); i++) {
> >>>
> >>> loop on the alloc path. But I suspect it's not going to make that much of a
> >>> difference.
> >>>
> >>> I concluded, similar to Andi, that stop_machine() is too big of a hammer.
> >>>
> >>> I wonder if something could be built out of preempt_disable() and sync SMP
> >>> calls. hmm :(
> >>
> >> Scrap that, shouldn't work I think ...
> >>
> > 
> > Wait a sec, what about call_rcu_tasks? Use that and re-check the bit is
> > still set?
> 
> So, in essence the idea I had yestarday when it was late was the following:
> 
> Assume we
> 
> 1) Can have a way to guarantee that a function on a CPU cannot execute within
> our critical section (while updating the flags)
> 
> 2) We can request to execute a function on each CPU and wait for completion
> 
> I think we could just let each CPU execute our desired action (e.g., try setting
> the bit).
> 
> E.g.,
> 
> 	local_irq_save(flags);
> 	page->flags &= whatever;
> 	local_irq_restore(flags);
> 
> And assume we want to set the bit, do a
> 
> 	SetPageHWPoison(page);
> 	smp_call_function(set_hwpoison_smp_sync, page, 1);
> 
> whereby
> 
> 	static void set_hwpoison_smp_sync(void *info)
> 	{
> 		SetPageHWPoison(page);
> 	}
> 
> 
> The idea is (that needs double checking) that a CPU will execute the
> SetPageHWPoison() either before the local_irq_save() or after the
> local_irq_restore(). So it's own non-atomic update cannot get interrupted.
> 
> Now, IIUC when it comes to "how expensive is this" I think we have (cheap to
> expensive):
> 
> 1) preempt_disable()
> 2) rcu_read_lock()
> 3) local_irq_save()
> 
> 
> So the above wouldn't be better than an rcu-based approach we have right now.
> We'd need something that relies on disabled preemption only.
> 
> Huh, but I read that "anything that disables preemption also marks an RCU-sched
> read-side critical section including preempt_disable() and preempt_enable()".
> 
> So for our use case we should be able to use preempt_disable() instead of
> local_irq_save(). That should already work for your existing implementation.
> 
> -- 
> Cheers,
> 
> David

We have:

#else /* #ifdef CONFIG_PREEMPT_RCU */


static inline void __rcu_read_lock(void)
{
        preempt_disable();
}

...


static __always_inline void rcu_read_lock(void)
        __acquires_shared(RCU)
{
        __rcu_read_lock();
        __acquire_shared(RCU);
        rcu_lock_acquire(&rcu_lock_map);
        RCU_LOCKDEP_WARN(!rcu_is_watching(),
                         "rcu_read_lock() used illegally while idle");
}



So on non-debug build witout CONFIG_PREEMPT_RCU (what I tested), rcu_lock
is exactly same as preempt_disable.  It's relatively cheap but not free.


preempt_disable is not going to be cheaper.

I can test if you want but it seems clear.


But IIUC task rcu might be cheaper - IIUC it does not need rcu
lock/unlock at all, it relies on readers to invoke the scheduler
instead.
No?

-- 
MST


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-30  6:27                 ` Michael S. Tsirkin
@ 2026-06-30  6:34                   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-30  6:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On 6/30/26 08:27, Michael S. Tsirkin wrote:
> On Tue, Jun 30, 2026 at 08:17:42AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/30/26 01:34, Michael S. Tsirkin wrote:
>>>
>>> Wait a sec, what about call_rcu_tasks? Use that and re-check the bit is
>>> still set?
>>
>> So, in essence the idea I had yestarday when it was late was the following:
>>
>> Assume we
>>
>> 1) Can have a way to guarantee that a function on a CPU cannot execute within
>> our critical section (while updating the flags)
>>
>> 2) We can request to execute a function on each CPU and wait for completion
>>
>> I think we could just let each CPU execute our desired action (e.g., try setting
>> the bit).
>>
>> E.g.,
>>
>> 	local_irq_save(flags);
>> 	page->flags &= whatever;
>> 	local_irq_restore(flags);
>>
>> And assume we want to set the bit, do a
>>
>> 	SetPageHWPoison(page);
>> 	smp_call_function(set_hwpoison_smp_sync, page, 1);
>>
>> whereby
>>
>> 	static void set_hwpoison_smp_sync(void *info)
>> 	{
>> 		SetPageHWPoison(page);
>> 	}
>>
>>
>> The idea is (that needs double checking) that a CPU will execute the
>> SetPageHWPoison() either before the local_irq_save() or after the
>> local_irq_restore(). So it's own non-atomic update cannot get interrupted.
>>
>> Now, IIUC when it comes to "how expensive is this" I think we have (cheap to
>> expensive):
>>
>> 1) preempt_disable()
>> 2) rcu_read_lock()
>> 3) local_irq_save()
>>
>>
>> So the above wouldn't be better than an rcu-based approach we have right now.
>> We'd need something that relies on disabled preemption only.
>>
>> Huh, but I read that "anything that disables preemption also marks an RCU-sched
>> read-side critical section including preempt_disable() and preempt_enable()".
>>
>> So for our use case we should be able to use preempt_disable() instead of
>> local_irq_save(). That should already work for your existing implementation.
>>
>> -- 
>> Cheers,
>>
>> David
> 
> We have:
> 
> #else /* #ifdef CONFIG_PREEMPT_RCU */
> 
> 
> static inline void __rcu_read_lock(void)
> {
>         preempt_disable();
> }
> 
> ...
> 
> 
> static __always_inline void rcu_read_lock(void)
>         __acquires_shared(RCU)
> {
>         __rcu_read_lock();
>         __acquire_shared(RCU);
>         rcu_lock_acquire(&rcu_lock_map);
>         RCU_LOCKDEP_WARN(!rcu_is_watching(),
>                          "rcu_read_lock() used illegally while idle");
> }
> 
> 
> 
> So on non-debug build witout CONFIG_PREEMPT_RCU (what I tested), rcu_lock
> is exactly same as preempt_disable.  It's relatively cheap but not free.
> 
> 
> preempt_disable is not going to be cheaper.

Well, it will be cheaper in the general case (CONFIG_PREEMPT_RCU) :)

But yes, not for this case.

> 
> I can test if you want but it seems clear.
> 

If you measured only !CONFIG_PREEMPT_RCU, then yes, it won't change a thing for
that scenario.

> 
> But IIUC task rcu might be cheaper - IIUC it does not need rcu
> lock/unlock at all, it relies on readers to invoke the scheduler
> instead.
> No?

I thought that still requires protection of sorts (preempt_disable /
rcu_read_lock), because it might fire whenever the task is preempted?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 21:22         ` David Hildenbrand (Arm)
  2026-06-29 21:43           ` David Hildenbrand (Arm)
@ 2026-06-29 21:50           ` Michael S. Tsirkin
  2026-06-30  6:30             ` David Hildenbrand (Arm)
  2026-06-29 21:54           ` Michael S. Tsirkin
  2 siblings, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29 21:50 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On Mon, Jun 29, 2026 at 11:22:11PM +0200, David Hildenbrand (Arm) wrote:
> [...]
> 
> > 
> > And again, I'm really not sure fixing a theoretical race when memory
> > is failing is worth slowing the world by 0.1-1% for.
> > 
> 
> Fully agreed. I was hoping RCU was cheaper (I mean, we were once told that RCU
> read side locking is essentially for free ... well in some configs :) )
> 
> The question if we could optimize it reasonably enough ...
> 
> > 
> > From what I saw in my testing, if we allocate 4k pages
> > it's hidden by the overhead. With hp and thp it's measureably
> > worse than rcu on !preempt config.
> 
> ... for example, by doing the rcu read lock + unlock around the
> 
> 	for (i = 1; i < (1 << order); i++) {
> 
> loop on the alloc path.

Is this different from what this patch is doing?

        if (unlikely(order)) {
                int i;

                hwpoison_rcu_lock();
                if (compound) {
                        page[1].flags.f &= ~PAGE_FLAGS_SECOND;
  #ifdef NR_PAGES_IN_LARGE_FOLIO
                        folio->_nr_pages = 0;
  #endif
                }
                for (i = 1; i < (1 << order); i++) {
                        if (compound)
                                bad += free_tail_page_prepare(page, page + i);
                        if (is_check_pages_enabled()) {
                                if (free_page_is_bad(page + i)) {
                                        bad++;
                                        continue;
                                }
                        }
                        (page + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
                }
                hwpoison_rcu_unlock();
        }
  


> But I suspect it's not going to make that much of a
> difference.

It makes a difference! Just not enough to make it completely unmeasureable.


> I concluded, similar to Andi, that stop_machine() is too big of a hammer.
> 
> I wonder if something could be built out of preempt_disable() and sync SMP
> calls. hmm :(

rcu_lock is basically same as preempt_disable if rcu is non preemptible,
no?


> -- 
> Cheers,
> 
> David


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 21:50           ` Michael S. Tsirkin
@ 2026-06-30  6:30             ` David Hildenbrand (Arm)
  2026-06-30  6:41               ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-30  6:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On 6/29/26 23:50, Michael S. Tsirkin wrote:
> On Mon, Jun 29, 2026 at 11:22:11PM +0200, David Hildenbrand (Arm) wrote:
>> [...]
>>
>>>
>>> And again, I'm really not sure fixing a theoretical race when memory
>>> is failing is worth slowing the world by 0.1-1% for.
>>>
>>
>> Fully agreed. I was hoping RCU was cheaper (I mean, we were once told that RCU
>> read side locking is essentially for free ... well in some configs :) )
>>
>> The question if we could optimize it reasonably enough ...
>>
>>>
>>> From what I saw in my testing, if we allocate 4k pages
>>> it's hidden by the overhead. With hp and thp it's measureably
>>> worse than rcu on !preempt config.
>>
>> ... for example, by doing the rcu read lock + unlock around the
>>
>> 	for (i = 1; i < (1 << order); i++) {
>>
>> loop on the alloc path.
> 
> Is this different from what this patch is doing?

Ah, I missed that we batch this already. We could make it include the

page_cpupid_reset_last(page);
page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;

As well, to reduce from 3 to 1 locks.

So I guess there is potential for optimization.

[...]

> 
>> I concluded, similar to Andi, that stop_machine() is too big of a hammer.
>>
>> I wonder if something could be built out of preempt_disable() and sync SMP
>> calls. hmm :(
> 
> rcu_lock is basically same as preempt_disable if rcu is non preemptible,
> no?

Yes. See my other mail, I learned that preempt_disable() should likely just do
for our use case. So the preemptible RCU case would not matter.

I assume that's as good as it gets.

1) Use preempt_disable/preempt_enable to protect
2) Batch as good as possible in the page allocator

If the overhead is then still noticeable, there is not a lot we can do to handle
this cleanly I'm afraid.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-30  6:30             ` David Hildenbrand (Arm)
@ 2026-06-30  6:41               ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-30  6:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On 6/30/26 08:30, David Hildenbrand (Arm) wrote:
> On 6/29/26 23:50, Michael S. Tsirkin wrote:
>> On Mon, Jun 29, 2026 at 11:22:11PM +0200, David Hildenbrand (Arm) wrote:
>>> [...]
>>>
>>>
>>> Fully agreed. I was hoping RCU was cheaper (I mean, we were once told that RCU
>>> read side locking is essentially for free ... well in some configs :) )
>>>
>>> The question if we could optimize it reasonably enough ...
>>>
>>>
>>> ... for example, by doing the rcu read lock + unlock around the
>>>
>>> 	for (i = 1; i < (1 << order); i++) {
>>>
>>> loop on the alloc path.
>>
>> Is this different from what this patch is doing?
> 
> Ah, I missed that we batch this already. We could make it include the
> 
> page_cpupid_reset_last(page);
> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> 
> As well, to reduce from 3 to 1 locks.
> 
> So I guess there is potential for optimization.
> 
> [...]
> 
>>
>>> I concluded, similar to Andi, that stop_machine() is too big of a hammer.
>>>
>>> I wonder if something could be built out of preempt_disable() and sync SMP
>>> calls. hmm :(
>>
>> rcu_lock is basically same as preempt_disable if rcu is non preemptible,
>> no?
> 
> Yes. See my other mail, I learned that preempt_disable() should likely just do
> for our use case. So the preemptible RCU case would not matter.
> 
> I assume that's as good as it gets.
> 
> 1) Use preempt_disable/preempt_enable to protect
> 2) Batch as good as possible in the page allocator
> 
> If the overhead is then still noticeable, there is not a lot we can do to handle
> this cleanly I'm afraid.

FWIW, there seems to be preempt_enable_no_resched() that is a bit more
lightweight. I'm sure there is some downside, but it might be worth a try.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 21:22         ` David Hildenbrand (Arm)
  2026-06-29 21:43           ` David Hildenbrand (Arm)
  2026-06-29 21:50           ` Michael S. Tsirkin
@ 2026-06-29 21:54           ` Michael S. Tsirkin
  2 siblings, 0 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29 21:54 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On Mon, Jun 29, 2026 at 11:22:11PM +0200, David Hildenbrand (Arm) wrote:
> [...]
> 
> > 
> > And again, I'm really not sure fixing a theoretical race when memory
> > is failing is worth slowing the world by 0.1-1% for.
> > 
> 
> Fully agreed.

Let's say, we just send every bad page through MF machinery X times,
and hope the chances of a race are reduced enough not to matter.
WDYT?


> I was hoping RCU was cheaper (I mean, we were once told that RCU
> read side locking is essentially for free ... well in some configs :) )

It seems to be, just not to the level where you stick this around
each memory allocation.


-- 
MST


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
  2026-06-29 13:05     ` David Hildenbrand (Arm)
  2026-06-29 20:08       ` Michael S. Tsirkin
@ 2026-06-29 23:29       ` Michael S. Tsirkin
  1 sibling, 0 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2026-06-29 23:29 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-kernel, Miaohe Lin, Naoya Horiguchi, Andrew Morton,
	Oscar Salvador, Andi Kleen, Hidehiro Kawai, Rik van Riel,
	Vlastimil Babka, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Hao Li,
	Kiryl Shutsemau, Byungchul Park, linux-mm, linux-cxl

On Mon, Jun 29, 2026 at 03:05:18PM +0200, David Hildenbrand (Arm) wrote:
> It would be good to coordinate here.

Case in point, do you feel this should be blocking the zeroing/page
reporting work (which adds more non atomic RMWs), or I can work more on
that putting this aside?

-- 
MST


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-06-30  6:41 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-28 21:45 [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Michael S. Tsirkin
2026-06-28 21:45 ` [PATCH 1/2] mm: memory-failure: use RCU to fix HWPoison flag race Michael S. Tsirkin
2026-06-28 21:45 ` [PATCH 2/2] mm: wrap non-atomic page flag ops in RCU for HWPoison safety Michael S. Tsirkin
2026-06-29  2:11 ` [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Andi Kleen
2026-06-29  8:10   ` Michael S. Tsirkin
2026-06-29  8:21     ` David Hildenbrand (Arm)
2026-06-29  8:39     ` Michael S. Tsirkin
2026-06-29 16:54       ` Andi Kleen
2026-06-29 17:04         ` David Hildenbrand (Arm)
2026-06-29 20:43           ` Michael S. Tsirkin
2026-06-29  6:49 ` David Hildenbrand (Arm)
2026-06-29  7:34   ` Michael S. Tsirkin
2026-06-29 13:05     ` David Hildenbrand (Arm)
2026-06-29 20:08       ` Michael S. Tsirkin
2026-06-29 20:55         ` Andi Kleen
2026-06-29 21:17           ` Michael S. Tsirkin
2026-06-29 21:39             ` Andi Kleen
2026-06-29 21:59               ` Michael S. Tsirkin
2026-06-29 21:22         ` David Hildenbrand (Arm)
2026-06-29 21:43           ` David Hildenbrand (Arm)
2026-06-29 23:34             ` Michael S. Tsirkin
2026-06-30  6:17               ` David Hildenbrand (Arm)
2026-06-30  6:27                 ` Michael S. Tsirkin
2026-06-30  6:34                   ` David Hildenbrand (Arm)
2026-06-29 21:50           ` Michael S. Tsirkin
2026-06-30  6:30             ` David Hildenbrand (Arm)
2026-06-30  6:41               ` David Hildenbrand (Arm)
2026-06-29 21:54           ` Michael S. Tsirkin
2026-06-29 23:29       ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox