From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C5E43ED5D6 for ; Mon, 29 Jun 2026 08:40:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782722426; cv=none; b=napGccdjtiJ32VqYRQAwMQf4xpXiDxWABiZaZi+UFPEQ7rSm2lE22NZSA8l97wU2Dn43Yza3rNriFYRnYvszTkWoaJ5mFMBk1SmJaNYsMHQOTCpMOPA73OLRn9L+iqm+A3Ndp4TvyA6jZk0eChmNtJU1Qf6wk6ODzKPzW+JS/H8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782722426; c=relaxed/simple; bh=43YyMEcSqhqWmDsFCxxIcc2M7dYYIOwYAl0wcA7/h9s=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=p57xQyOhK9uU3h+vURGilBaT3+GvsJwCZyNRfLktfgcSupCff9qL/WQhCGef6zF1JuXBXKqvV2xKrB+xZiv3SPkUTriYXpifoHcBds8URq6AR1N76kxViVyIc7v81rM6qfo+zBBnYyLOLvKCOC06tLNvYfReEQu/Y8SK9Vqn52k= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=GlEm0PcU; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=jBega8V2; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="GlEm0PcU"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="jBega8V2" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1782722422; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=apbP0/mAALs29zIL1Jivy2IR2lGWYH+nRz/CLVDLR20=; b=GlEm0PcUVdTyA4TByTFoEqkq8+UxHOkNsXqXQrCg04N7mer7BdeprPIfkVwMkWSatQKkm5 r9e8vRlJeTRNUikGols/yq/R8EDs+lk1T5qNHOHham/5UiFCrXYBaEDsmVQCKw3URT/x/+ JqzoZcRxVDQiEuCA9knENyEWEhxuuxY= Received: from mail-pj1-f70.google.com (mail-pj1-f70.google.com [209.85.216.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-78-T1brXsZDMJO-KoGlxjFESQ-1; Mon, 29 Jun 2026 04:40:20 -0400 X-MC-Unique: T1brXsZDMJO-KoGlxjFESQ-1 X-Mimecast-MFC-AGG-ID: T1brXsZDMJO-KoGlxjFESQ_1782722419 Received: by mail-pj1-f70.google.com with SMTP id 98e67ed59e1d1-37fe90ee192so1043903a91.3 for ; Mon, 29 Jun 2026 01:40:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1782722419; x=1783327219; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=apbP0/mAALs29zIL1Jivy2IR2lGWYH+nRz/CLVDLR20=; b=jBega8V2V3qwcRBwS0T0oKR2PCc7rMBPSEHmDjLuqijWgJKPXlEFIycK7c/9R/iRY/ sNk8Uq1mEX/ELeE0FoxKn2VqeC8Ym0l9WZAQHPxLaqIQ75qTma1bsOnoIzLwsr4jSS+z r4qyxwFvbBm48Xe0zYZxDyIEahzNuN72FrPN1CuJrpJfb3hMdQXSGp+wYYblparxibrv ubwxFYVrHfKj6d47GItsB8KphiGkAPHDvM3NYjuukdfuFNwjdj9G7Nt7CkqQfngZR3rT +JdLVpbjdvKowuAJ3Unx4AZQFZyrf2+fwVxpjafnIJaR33Jhnz+xut85hVy8GonmaXwE T+hg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1782722419; x=1783327219; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=apbP0/mAALs29zIL1Jivy2IR2lGWYH+nRz/CLVDLR20=; b=AvptT3koeH/5tAflVjhao3RYWcBuNXlykau4du9RxXBATRWRBkRazfBjqwSzJwhkKJ oqWx0zHzZ3vRTfvUZIRYMMsNplmnatpP5RQ2VXCxr+JPiG+URFBAWMkS24v9QcSf9K2h iW5NxlJzyD5I9IsF6JPOXWOj4e4xmIX/yOZ1I/0EqNyJj6YSXessLG9ab4/svTVlaZmL CuwckOgXJY6TmgTo5dj9k8cZ1M6jM5PYiZJG+zBM5cVQRF3n7m0lvqrmXVuPEny+lIcQ LgdHovH1xGLjdtDR1VzTfP9cECZnmvcQV6ssFJOs4XC9DZlFDSEM+4yWfsI8/EsBQSHZ Kb9w== X-Gm-Message-State: AOJu0YwWcLaL/XBxvMdGnF53WWSXw5wBHzsOCToHjv/p9rWh2+hfQZ6O vgmNmhYpnrIaAZvzaASm4zCXBVkpckG54QdyMLbjtrFyLfJAvzHsJKlw8BmchkNKXZD7QhVYf61 fIH/CC+nqfExGidjAfkjal5FJhf5FF1/Xe6jO7nlBapbBJUXxCjRZY9KPHLb1H6Y61w== X-Gm-Gg: AfdE7cmSxeVp4M+Nq1P1I1H1sbT6PbSkvka+0jTPuk8m4QsJljo5VFcfR7lhznshLNf 1MR+7wKMSdl9RaKE3TI11KUD58fviBOv89CUQnNiuUTpE67+sVZ8Tqjjg+Tu2dxr5lU7TC7FNzu uYhM7ohbfmpnoawH99viiB0qFD2GvhvTUKRdzOrvxayypxqVi/JbXuQek9hdVev9y2QVmX965q3 65O+d7eGoghJqcV3zpC5Z6BvTQU1Orsqp+uE/JHgtHJhAFHMBZ3d4hPJ7OR9/GrzlovkSUH4vFU RHsc1rAI3ZLwx9nLd6JX/8YtTOYyA/M5Ty9rof4NFFK6cooP6BqCJOEsdMp2Qvq0f/Tblir4EM5 qpNXJkEHpYYr1KMRzWubxDjaqGp2MhT/U X-Received: by 2002:a17:90b:1c10:b0:37f:c69d:ce69 with SMTP id 98e67ed59e1d1-37fc69dd07fmr6710639a91.10.1782722419088; Mon, 29 Jun 2026 01:40:19 -0700 (PDT) X-Received: by 2002:a17:90b:1c10:b0:37f:c69d:ce69 with SMTP id 98e67ed59e1d1-37fc69dd07fmr6710589a91.10.1782722418371; Mon, 29 Jun 2026 01:40:18 -0700 (PDT) Received: from redhat.com (IGLD-80-230-85-71.inter.net.il. [80.230.85.71]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-37fa9629f8fsm4643541a91.8.2026.06.29.01.39.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 29 Jun 2026 01:40:17 -0700 (PDT) Date: Mon, 29 Jun 2026 04:39:53 -0400 From: "Michael S. Tsirkin" To: Andi Kleen Cc: linux-kernel@vger.kernel.org, David Hildenbrand , Miaohe Lin , Naoya Horiguchi , Andrew Morton , Oscar Salvador , Hidehiro Kawai , Rik van Riel , Vlastimil Babka , Lorenzo Stoakes , "Liam R. Howlett" , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Brendan Jackman , Johannes Weiner , Zi Yan , Baolin Wang , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Lance Yang , Christoph Lameter , David Rientjes , Roman Gushchin , Harry Yoo , Hao Li , Kiryl Shutsemau , Byungchul Park , linux-mm@kvack.org, linux-cxl@vger.kernel.org, David Hildenbrand Subject: Re: [PATCH 0/2] mm: memory-failure: fix HWPoison flag race with non-atomic page flag ops Message-ID: <20260629043417-mutt-send-email-mst@kernel.org> References: <20260629033608-mutt-send-email-mst@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260629033608-mutt-send-email-mst@kernel.org> On Mon, Jun 29, 2026 at 04:10:33AM -0400, Michael S. Tsirkin wrote: > On Sun, Jun 28, 2026 at 07:11:58PM -0700, Andi Kleen wrote: > > On Sun, Jun 28, 2026 at 05:45:22PM -0400, Michael S. Tsirkin wrote: > > > This series fixes the race by: > > > > > > 1. Having memory_failure() call synchronize_rcu() + retry after > > > setting HWPoison, so that any in-flight non-atomic RMW that > > > read the old flags value completes before we proceed. > > > > > > 2. Wrapping all non-atomic page flag operations in > > > rcu_read_lock/rcu_read_unlock (CONFIG_MEMORY_FAILURE only), > > > so that synchronize_rcu() actually drains them. > > > > It wouldn't surprise me if your underlying performance assumptions > > -- an non contended atomic is cheaper than a rcu_read_lock/unlock -- > > are not true in various CPU/kernel configuration combinations. > > > > Modern CPUs have fast atomics when uncontended in normal circumstances. > > But it probably doesn't matter much either way because the difference > > shouldn't be very much. > > > Hmm. It's a bit silly that I didn't try. Seemed clear to me, but, > on this old xeon... > > insns/iter cycles/iter > ------------------------------------------------------- > base 12238 +/- 1.0 17889 +/- 97.9 > rcu_read_lock 12251 +/- 7.3 17991 +/-191.6 > atomic ops 12233 +/- 1.9 17733 +/-136.5 > > > The diff in the noise. > > And old, slow CPUs maybe don't have MF at all. > > So maybe just atomics instead of all this mess. However, this was a basic test, when allocating 4k pages. With 2M hugepages: insns/iter cycles/iter ------------------------------------------------------- base 20758 +/- 12.5 191208 +/-1946.6 rcu 20785 +/- 3.7 197108 +/- 132.1 atomic 20727 +/- 6.4 204591 +/- 160.2 rcu vs base +27 (+0.13%) +5900 (+3.09%) atomic vs base -31 (-0.15%) +13383 (+7.00%) and even with THP: insns/iter cycles/iter ------------------------------------------------------- base 27220 +/- 2.8 192151 +/- 483.3 rcu 27248 +/- 30.1 194159 +/-2746.6 atomic 27186 +/- 3.2 200526 +/- 746.2 rcu vs base +28 (+0.10%) +2008 (+1.04%) atomic vs base -34 (-0.12%) +8374 (+4.36%) needs more thought. > > > > > It seems very complicated for something that > > could be much simpler. > > > > But I guess it's fine. > > > > -Andi > > Indeed. David already said he's gonnu look at this himself, but here's what I > tested, maybe helpful for whoever wants to try this approach: > > Signed-off-by: Michael S. Tsirkin > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 7223f6f4e2b4..8f0940cf2f29 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -404,6 +404,44 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n) > #define FOLIO_HEAD_PAGE 0 > #define FOLIO_SECOND_PAGE 1 > > +/* > + * When CONFIG_MEMORY_FAILURE is enabled, non-atomic page flag operations > + * must be atomic to prevent clobbering concurrent TestSetPageHWPoison() > + * in memory_failure(). Otherwise, use the cheaper non-atomic versions. > + */ > +#ifdef CONFIG_MEMORY_FAILURE > +#define __pf_set_bit set_bit > +#define __pf_clear_bit clear_bit > + > +static __always_inline void page_flags_clear(unsigned long *flags, > + unsigned long mask) > +{ > + atomic_long_andnot(mask, (atomic_long_t *)flags); > +} > + > +static __always_inline void page_flags_set(unsigned long *flags, > + unsigned long mask) > +{ > + atomic_long_or(mask, (atomic_long_t *)flags); > +} > + > +#else /* !CONFIG_MEMORY_FAILURE */ > +#define __pf_set_bit __set_bit > +#define __pf_clear_bit __clear_bit > + > +static __always_inline void page_flags_clear(unsigned long *flags, > + unsigned long mask) > +{ > + *flags &= ~mask; > +} > + > +static __always_inline void page_flags_set(unsigned long *flags, > + unsigned long mask) > +{ > + *flags |= mask; > +} > +#endif /* CONFIG_MEMORY_FAILURE */ > + > /* > * Macros to create function definitions for page flags > */ > @@ -421,11 +459,11 @@ static __always_inline void folio_clear_##name(struct folio *folio) \ > > #define __FOLIO_SET_FLAG(name, page) \ > static __always_inline void __folio_set_##name(struct folio *folio) \ > -{ __set_bit(PG_##name, folio_flags(folio, page)); } > +{ __pf_set_bit(PG_##name, folio_flags(folio, page)); } > > #define __FOLIO_CLEAR_FLAG(name, page) \ > static __always_inline void __folio_clear_##name(struct folio *folio) \ > -{ __clear_bit(PG_##name, folio_flags(folio, page)); } > +{ __pf_clear_bit(PG_##name, folio_flags(folio, page)); } > > #define FOLIO_TEST_SET_FLAG(name, page) \ > static __always_inline bool folio_test_set_##name(struct folio *folio) \ > @@ -458,12 +496,12 @@ static __always_inline void ClearPage##uname(struct page *page) \ > #define __SETPAGEFLAG(uname, lname, policy) \ > __FOLIO_SET_FLAG(lname, FOLIO_##policy) \ > static __always_inline void __SetPage##uname(struct page *page) \ > -{ __set_bit(PG_##lname, &policy(page, 1)->flags.f); } > +{ __pf_set_bit(PG_##lname, &policy(page, 1)->flags.f); } > > #define __CLEARPAGEFLAG(uname, lname, policy) \ > __FOLIO_CLEAR_FLAG(lname, FOLIO_##policy) \ > static __always_inline void __ClearPage##uname(struct page *page) \ > -{ __clear_bit(PG_##lname, &policy(page, 1)->flags.f); } > +{ __pf_clear_bit(PG_##lname, &policy(page, 1)->flags.f); } > > #define TESTSETFLAG(uname, lname, policy) \ > FOLIO_TEST_SET_FLAG(lname, FOLIO_##policy) \ > @@ -806,7 +844,7 @@ static inline bool PageUptodate(const struct page *page) > static __always_inline void __folio_mark_uptodate(struct folio *folio) > { > smp_wmb(); > - __set_bit(PG_uptodate, folio_flags(folio, 0)); > + __pf_set_bit(PG_uptodate, folio_flags(folio, 0)); > } > > static __always_inline void folio_mark_uptodate(struct folio *folio) > @@ -1162,14 +1200,14 @@ static __always_inline void ClearPageAnonExclusive(struct page *page) > { > VM_BUG_ON_PGFLAGS(!PageAnonNotKsm(page), page); > VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); > - clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f); > + __pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f); > } > > static __always_inline void __ClearPageAnonExclusive(struct page *page) > { > VM_BUG_ON_PGFLAGS(!PageAnon(page), page); > VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); > - __clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f); > + __pf_clear_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f); > } > > #ifdef CONFIG_MMU > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 06bbe9eba636..931dfc84319f 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2343,7 +2343,7 @@ int folio_xchg_last_cpupid(struct folio *folio, int cpupid); > > static inline void page_cpupid_reset_last(struct page *page) > { > - page->flags.f |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT; > + page_flags_set(&page->flags.f, LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT); > } > #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */ > > @@ -2503,8 +2503,8 @@ static inline struct zone *folio_zone(const struct folio *folio) > #ifdef SECTION_IN_PAGE_FLAGS > static inline void set_page_section(struct page *page, unsigned long section) > { > - page->flags.f &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT); > - page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT; > + page_flags_clear(&page->flags.f, SECTIONS_MASK << SECTIONS_PGSHIFT); > + page_flags_set(&page->flags.f, (section & SECTIONS_MASK) << SECTIONS_PGSHIFT); > } > > static inline unsigned long memdesc_section(memdesc_flags_t mdf) > @@ -2719,14 +2719,14 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio) > > static inline void set_page_zone(struct page *page, enum zone_type zone) > { > - page->flags.f &= ~(ZONES_MASK << ZONES_PGSHIFT); > - page->flags.f |= (zone & ZONES_MASK) << ZONES_PGSHIFT; > + page_flags_clear(&page->flags.f, ZONES_MASK << ZONES_PGSHIFT); > + page_flags_set(&page->flags.f, (zone & ZONES_MASK) << ZONES_PGSHIFT); > } > > static inline void set_page_node(struct page *page, unsigned long node) > { > - page->flags.f &= ~(NODES_MASK << NODES_PGSHIFT); > - page->flags.f |= (node & NODES_MASK) << NODES_PGSHIFT; > + page_flags_clear(&page->flags.f, NODES_MASK << NODES_PGSHIFT); > + page_flags_set(&page->flags.f, (node & NODES_MASK) << NODES_PGSHIFT); > } > > static inline void set_page_links(struct page *page, enum zone_type zone, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 970e077019b7..100226f59490 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3624,8 +3624,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order, > * unreferenced sub-pages of an anonymous THP: we can simply drop > * PG_anon_exclusive (-> PG_mappedtodisk) for these here. > */ > - new_folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP; > - new_folio->flags.f |= (folio->flags.f & > + page_flags_clear(&new_folio->flags.f, PAGE_FLAGS_CHECK_AT_PREP); > + page_flags_set(&new_folio->flags.f, folio->flags.f & > ((1L << PG_referenced) | > (1L << PG_swapbacked) | > (1L << PG_swapcache) | > diff --git a/mm/memremap.c b/mm/memremap.c > index 053842d45cb1..194ca2f04a87 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -494,7 +494,7 @@ void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap, > * blindly clear bits which could have set my order field here, > * including page head. > */ > - new_page->flags.f &= ~0xffUL; /* Clear possible order, page head */ > + page_flags_clear(&new_page->flags.f, 0xffUL); /* Clear possible order, page head */ > > #ifdef NR_PAGES_IN_LARGE_FOLIO > /* > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index d49c254174da..a3447124a725 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1359,7 +1359,7 @@ __always_inline bool __free_pages_prepare(struct page *page, > int i; > > if (compound) { > - page[1].flags.f &= ~PAGE_FLAGS_SECOND; > + page_flags_clear(&page[1].flags.f, PAGE_FLAGS_SECOND); > #ifdef NR_PAGES_IN_LARGE_FOLIO > folio->_nr_pages = 0; > #endif > @@ -1373,7 +1373,7 @@ __always_inline bool __free_pages_prepare(struct page *page, > continue; > } > } > - (page + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP; > + page_flags_clear(&(page + i)->flags.f, PAGE_FLAGS_CHECK_AT_PREP); > } > } > if (folio_test_anon(folio)) { > @@ -1392,7 +1392,7 @@ __always_inline bool __free_pages_prepare(struct page *page, > } > > page_cpupid_reset_last(page); > - page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP; > + page_flags_clear(&page->flags.f, PAGE_FLAGS_CHECK_AT_PREP); > page->private = 0; > reset_page_owner(page, order); > page_table_check_free(page, order); > diff --git a/mm/slub.c b/mm/slub.c > index a2bf3756ca7d..a55199f642d3 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -617,7 +617,7 @@ static inline void slab_set_pfmemalloc(struct slab *slab) > > static inline void __slab_clear_pfmemalloc(struct slab *slab) > { > - __clear_bit(SL_pfmemalloc, &slab->flags.f); > + __pf_clear_bit(SL_pfmemalloc, &slab->flags.f); > } > > /*