All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: linux-kernel@vger.kernel.org,
	David Hildenbrand <david@kernel.org>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Naoya Horiguchi <nao.horiguchi@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oscar Salvador <osalvador@suse.de>,
	Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>,
	Rik van Riel <riel@redhat.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <liam@infradead.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Brendan Jackman <jackmanb@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Nico Pache <npache@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	Barry Song <baohua@kernel.org>, Lance Yang <lance.yang@linux.dev>,
	Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Harry Yoo <harry@kernel.org>, Hao Li <hao.li@linux.dev>,
	Kiryl Shutsemau <kas@kernel.org>,
	Byungchul Park <byungchul@sk.com>,
	linux-mm@kvack.org, linux-cxl@vger.kernel.org,
	David Hildenbrand <david@redhat.com>
Subject: Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops
Date: Mon, 29 Jun 2026 04:39:53 -0400	[thread overview]
Message-ID: <20260629043417-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <20260629033608-mutt-send-email-mst@kernel.org>

On Mon, Jun 29, 2026 at 04:10:33AM -0400, Michael S. Tsirkin wrote:
> On Sun, Jun 28, 2026 at 07:11:58PM -0700, Andi Kleen wrote:
> > On Sun, Jun 28, 2026 at 05:45:22PM -0400, Michael S. Tsirkin wrote:
> > > This series fixes the race by:
> > > 
> > > 1. Having memory_failure() call synchronize_rcu() + retry after
> > >    setting HWPoison, so that any in-flight non-atomic RMW that
> > >    read the old flags value completes before we proceed.
> > >
> > > 2. Wrapping all non-atomic page flag operations in
> > >    rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only),
> > >    so that synchronize_rcu() actually drains them.
> > 
> > It wouldn't surprise me if your underlying performance assumptions 
> > -- an non contended atomic is cheaper than a rcu_read_lock/unlock --
> > are not true in various CPU/kernel configuration combinations.
> > 
> > Modern CPUs have fast atomics when uncontended in normal circumstances.
> > But it probably doesn't matter much either way because the difference
> > shouldn't be very much.
> 
> 
> Hmm. It's a bit silly that I didn't try. Seemed clear to me, but,
> on this old xeon...
> 
>                                    insns/iter    cycles/iter
>      -------------------------------------------------------
>        base                       12238 +/- 1.0    17889 +/- 97.9
>        rcu_read_lock              12251 +/- 7.3    17991 +/-191.6
>        atomic ops                 12233 +/- 1.9    17733 +/-136.5
> 
> 
> The diff in the noise.
> 
> And old, slow CPUs maybe don't have MF at all.
> 
> So maybe just atomics instead of all this mess.


However, this was a basic test, when allocating 4k pages. With 2M hugepages:

                                   insns/iter    cycles/iter
     -------------------------------------------------------
       base                       20758 +/- 12.5   191208 +/-1946.6
       rcu                        20785 +/-  3.7   197108 +/- 132.1
       atomic                     20727 +/-  6.4   204591 +/- 160.2

       rcu vs base                  +27 (+0.13%)     +5900 (+3.09%)
       atomic vs base               -31 (-0.15%)    +13383 (+7.00%)

and even with THP:

                                   insns/iter    cycles/iter
     -------------------------------------------------------
       base                       27220 +/-  2.8   192151 +/- 483.3
       rcu                        27248 +/- 30.1   194159 +/-2746.6
       atomic                     27186 +/-  3.2   200526 +/- 746.2

       rcu vs base                  +28 (+0.10%)     +2008 (+1.04%)
       atomic vs base               -34 (-0.12%)     +8374 (+4.36%)


needs more thought.



> 
> 
> 
> >  It seems very complicated for something that
> > could be much simpler.
> > 
> > But I guess it's fine.
> > 
> > -Andi
> 
> Indeed. David already said he's gonnu look at this himself, but here's what I
> tested, maybe helpful for whoever wants to try this approach:
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 7223f6f4e2b4..8f0940cf2f29 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -404,6 +404,44 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n)
>  #define FOLIO_HEAD_PAGE		0
>  #define FOLIO_SECOND_PAGE	1
>  
> +/*
> + * When CONFIG_MEMORY_FAILURE is enabled, non-atomic page flag operations
> + * must be atomic to prevent clobbering concurrent TestSetPageHWPoison()
> + * in memory_failure().  Otherwise, use the cheaper non-atomic versions.
> + */
> +#ifdef CONFIG_MEMORY_FAILURE
> +#define __pf_set_bit		set_bit
> +#define __pf_clear_bit		clear_bit
> +
> +static __always_inline void page_flags_clear(unsigned long *flags,
> +					     unsigned long mask)
> +{
> +	atomic_long_andnot(mask, (atomic_long_t *)flags);
> +}
> +
> +static __always_inline void page_flags_set(unsigned long *flags,
> +					   unsigned long mask)
> +{
> +	atomic_long_or(mask, (atomic_long_t *)flags);
> +}
> +
> +#else /* !CONFIG_MEMORY_FAILURE */
> +#define __pf_set_bit		__set_bit
> +#define __pf_clear_bit		__clear_bit
> +
> +static __always_inline void page_flags_clear(unsigned long *flags,
> +					     unsigned long mask)
> +{
> +	*flags &= ~mask;
> +}
> +
> +static __always_inline void page_flags_set(unsigned long *flags,
> +					   unsigned long mask)
> +{
> +	*flags |= mask;
> +}
> +#endif /* CONFIG_MEMORY_FAILURE */
> +
>  /*
>   * Macros to create function definitions for page flags
>   */
> @@ -421,11 +459,11 @@ static __always_inline void folio_clear_##name(struct folio *folio)	\
>  
>  #define __FOLIO_SET_FLAG(name, page)					\
>  static __always_inline void __folio_set_##name(struct folio *folio)	\
> -{ __set_bit(PG_##name, folio_flags(folio, page)); }
> +{ __pf_set_bit(PG_##name, folio_flags(folio, page)); }
>  
>  #define __FOLIO_CLEAR_FLAG(name, page)					\
>  static __always_inline void __folio_clear_##name(struct folio *folio)	\
> -{ __clear_bit(PG_##name, folio_flags(folio, page)); }
> +{ __pf_clear_bit(PG_##name, folio_flags(folio, page)); }
>  
>  #define FOLIO_TEST_SET_FLAG(name, page)					\
>  static __always_inline bool folio_test_set_##name(struct folio *folio)	\
> @@ -458,12 +496,12 @@ static __always_inline void ClearPage##uname(struct page *page)		\
>  #define __SETPAGEFLAG(uname, lname, policy)				\
>  __FOLIO_SET_FLAG(lname, FOLIO_##policy)					\
>  static __always_inline void __SetPage##uname(struct page *page)		\
> -{ __set_bit(PG_##lname, &policy(page, 1)->flags.f); }
> +{ __pf_set_bit(PG_##lname, &policy(page, 1)->flags.f); }
>  
>  #define __CLEARPAGEFLAG(uname, lname, policy)				\
>  __FOLIO_CLEAR_FLAG(lname, FOLIO_##policy)				\
>  static __always_inline void __ClearPage##uname(struct page *page)	\
> -{ __clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
> +{ __pf_clear_bit(PG_##lname, &policy(page, 1)->flags.f); }
>  
>  #define TESTSETFLAG(uname, lname, policy)				\
>  FOLIO_TEST_SET_FLAG(lname, FOLIO_##policy)				\
> @@ -806,7 +844,7 @@ static inline bool PageUptodate(const struct page *page)
>  static __always_inline void __folio_mark_uptodate(struct folio *folio)
>  {
>  	smp_wmb();
> -	__set_bit(PG_uptodate, folio_flags(folio, 0));
> +	__pf_set_bit(PG_uptodate, folio_flags(folio, 0));
>  }
>  
>  static __always_inline void folio_mark_uptodate(struct folio *folio)
> @@ -1162,14 +1200,14 @@ static __always_inline void ClearPageAnonExclusive(struct page *page)
>  {
>  	VM_BUG_ON_PGFLAGS(!PageAnonNotKsm(page), page);
>  	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> -	clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
> +	__pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
>  }
>  
>  static __always_inline void __ClearPageAnonExclusive(struct page *page)
>  {
>  	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
>  	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> -	__clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
> +	__pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
>  }
>  
>  #ifdef CONFIG_MMU
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 06bbe9eba636..931dfc84319f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2343,7 +2343,7 @@ int folio_xchg_last_cpupid(struct folio *folio, int cpupid);
>  
>  static inline void page_cpupid_reset_last(struct page *page)
>  {
> -	page->flags.f |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
> +	page_flags_set(&page->flags.f, LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
>  }
>  #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
>  
> @@ -2503,8 +2503,8 @@ static inline struct zone *folio_zone(const struct folio *folio)
>  #ifdef SECTION_IN_PAGE_FLAGS
>  static inline void set_page_section(struct page *page, unsigned long section)
>  {
> -	page->flags.f &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT);
> -	page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
> +	page_flags_clear(&page->flags.f, SECTIONS_MASK << SECTIONS_PGSHIFT);
> +	page_flags_set(&page->flags.f, (section & SECTIONS_MASK) << SECTIONS_PGSHIFT);
>  }
>  
>  static inline unsigned long memdesc_section(memdesc_flags_t mdf)
> @@ -2719,14 +2719,14 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
>  
>  static inline void set_page_zone(struct page *page, enum zone_type zone)
>  {
> -	page->flags.f &= ~(ZONES_MASK << ZONES_PGSHIFT);
> -	page->flags.f |= (zone & ZONES_MASK) << ZONES_PGSHIFT;
> +	page_flags_clear(&page->flags.f, ZONES_MASK << ZONES_PGSHIFT);
> +	page_flags_set(&page->flags.f, (zone & ZONES_MASK) << ZONES_PGSHIFT);
>  }
>  
>  static inline void set_page_node(struct page *page, unsigned long node)
>  {
> -	page->flags.f &= ~(NODES_MASK << NODES_PGSHIFT);
> -	page->flags.f |= (node & NODES_MASK) << NODES_PGSHIFT;
> +	page_flags_clear(&page->flags.f, NODES_MASK << NODES_PGSHIFT);
> +	page_flags_set(&page->flags.f, (node & NODES_MASK) << NODES_PGSHIFT);
>  }
>  
>  static inline void set_page_links(struct page *page, enum zone_type zone,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 970e077019b7..100226f59490 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3624,8 +3624,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  		 * unreferenced sub-pages of an anonymous THP: we can simply drop
>  		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
>  		 */
> -		new_folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> -		new_folio->flags.f |= (folio->flags.f &
> +		page_flags_clear(&new_folio->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
> +		page_flags_set(&new_folio->flags.f, folio->flags.f &
>  				((1L << PG_referenced) |
>  				 (1L << PG_swapbacked) |
>  				 (1L << PG_swapcache) |
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 053842d45cb1..194ca2f04a87 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -494,7 +494,7 @@ void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
>  		 * blindly clear bits which could have set my order field here,
>  		 * including page head.
>  		 */
> -		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> +		page_flags_clear(&new_page->flags.f, 0xffUL);	/* Clear possible order, page head */
>  
>  #ifdef NR_PAGES_IN_LARGE_FOLIO
>  		/*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d49c254174da..a3447124a725 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1359,7 +1359,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
>  		int i;
>  
>  		if (compound) {
> -			page[1].flags.f &= ~PAGE_FLAGS_SECOND;
> +			page_flags_clear(&page[1].flags.f, PAGE_FLAGS_SECOND);
>  #ifdef NR_PAGES_IN_LARGE_FOLIO
>  			folio->_nr_pages = 0;
>  #endif
> @@ -1373,7 +1373,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
>  					continue;
>  				}
>  			}
> -			(page + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +			page_flags_clear(&(page + i)->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
>  		}
>  	}
>  	if (folio_test_anon(folio)) {
> @@ -1392,7 +1392,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
>  	}
>  
>  	page_cpupid_reset_last(page);
> -	page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +	page_flags_clear(&page->flags.f, PAGE_FLAGS_CHECK_AT_PREP);
>  	page->private = 0;
>  	reset_page_owner(page, order);
>  	page_table_check_free(page, order);
> diff --git a/mm/slub.c b/mm/slub.c
> index a2bf3756ca7d..a55199f642d3 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -617,7 +617,7 @@ static inline void slab_set_pfmemalloc(struct slab *slab)
>  
>  static inline void __slab_clear_pfmemalloc(struct slab *slab)
>  {
> -	__clear_bit(SL_pfmemalloc, &slab->flags.f);
> +	__pf_clear_bit(SL_pfmemalloc, &slab->flags.f);
>  }
>  
>  /*


  parent reply	other threads:[~2026-06-29  8:40 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-28 21:45 [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Michael S. Tsirkin
2026-06-28 21:45 ` [PATCH 1/2] mm: memory-failure: use RCU to fix HWPoison flag race Michael S. Tsirkin
2026-06-28 21:45 ` [PATCH 2/2] mm: wrap non-atomic page flag ops in RCU for HWPoison safety Michael S. Tsirkin
2026-06-29  2:11 ` [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Andi Kleen
2026-06-29  8:10   ` Michael S. Tsirkin
2026-06-29  8:21     ` David Hildenbrand (Arm)
2026-06-29  8:39     ` Michael S. Tsirkin [this message]
2026-06-29  6:49 ` David Hildenbrand (Arm)
2026-06-29  7:34   ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260629043417-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=byungchul@sk.com \
    --cc=cl@gentwo.org \
    --cc=david@kernel.org \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hao.li@linux.dev \
    --cc=harry@kernel.org \
    --cc=hidehiro.kawai.ez@hitachi.com \
    --cc=jackmanb@google.com \
    --cc=kas@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=liam@infradead.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=npache@redhat.com \
    --cc=osalvador@suse.de \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.