From: "David Hildenbrand (Red Hat)" <david@kernel.org>
To: Harry Yoo <harry.yoo@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-mm@kvack.org, Will Deacon <will@kernel.org>,
"Aneesh Kumar K.V" <aneesh.kumar@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Nick Piggin <npiggin@gmail.com>,
Peter Zijlstra <peterz@infradead.org>,
Arnd Bergmann <arnd@arndb.de>,
Muchun Song <muchun.song@linux.dev>,
Oscar Salvador <osalvador@suse.de>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
Pedro Falcato <pfalcato@suse.de>, Rik van Riel <riel@surriel.com>,
Laurence Oberman <loberman@redhat.com>,
Prakash Sangappa <prakash.sangappa@oracle.com>,
Nadav Amit <nadav.amit@gmail.com>,
stable@vger.kernel.org, Ryan Roberts <ryan.roberts@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: Re: [PATCH v2 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
Date: Fri, 19 Dec 2025 14:52:41 +0100 [thread overview]
Message-ID: <d5bf88d9-aedf-4e6d-b5a0-e860bf0ed2e4@kernel.org> (raw)
In-Reply-To: <aUVHAD9G5_HKlYsR@hyeyoo>
On 12/19/25 13:37, Harry Yoo wrote:
> On Fri, Dec 12, 2025 at 08:10:19AM +0100, David Hildenbrand (Red Hat) wrote:
>> As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix
>> huge_pmd_unshare() vs GUP-fast race") we can end up in some situations
>> where we perform so many IPI broadcasts when unsharing hugetlb PMD page
>> tables that it severely regresses some workloads.
>>
>> In particular, when we fork()+exit(), or when we munmap() a large
>> area backed by many shared PMD tables, we perform one IPI broadcast per
>> unshared PMD table.
>>
>
> [...snip...]
>
>> Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race")
>> Reported-by: Uschakow, Stanislav" <suschako@amazon.de>
>> Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/
>> Tested-by: Laurence Oberman <loberman@redhat.com>
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> ---
>> include/asm-generic/tlb.h | 74 ++++++++++++++++++++++-
>> include/linux/hugetlb.h | 19 +++---
>> mm/hugetlb.c | 121 ++++++++++++++++++++++----------------
>> mm/mmu_gather.c | 7 +++
>> mm/mprotect.c | 2 +-
>> mm/rmap.c | 25 +++++---
>> 6 files changed, 179 insertions(+), 69 deletions(-)
>>
>> @@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>> pte = huge_pte_clear_uffd_wp(pte);
>> huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
>> pages++;
>> + tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
>> }
>>
>> next:
>> spin_unlock(ptl);
>> cond_resched();
>> }
>> - /*
>> - * There is nothing protecting a previously-shared page table that we
>> - * unshared through huge_pmd_unshare() from getting freed after we
>> - * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
>> - * succeeded, flush the range corresponding to the pud.
>> - */
>> - if (shared_pmd)
>> - flush_hugetlb_tlb_range(vma, range.start, range.end);
>> - else
>> - flush_hugetlb_tlb_range(vma, start, end);
>> +
>> + tlb_flush_mmu_tlbonly(tlb);
>> + huge_pmd_unshare_flush(tlb, vma);
>
> Shouldn't we teach mmu_gather that it has to call
I hope not :) In the worst case we could keep the
flush_hugetlb_tlb_range() in the !shared case in. Suboptimal but I am
sick and tired of dealing with this hugetlb mess.
Let me CC Ryan and Catalin for the arm64 pieces and Christophe on the
ppc pieces: See [1] where we convert away from some
flush_hugetlb_tlb_range() users to operate on mmu_gather using
* tlb_remove_huge_tlb_entry() for mremap() and mprotect(). Before we
would only use it in __unmap_hugepage_range().
* tlb_flush_pmd_range() for unsharing of shared PMD tables. We already
used that in one call path.
[1] https://lore.kernel.org/all/20251212071019.471146-5-david@kernel.org/
> flush_hugetlb_tlb_range() instead of ordinary TLB flush routine,
> otherwise it will break ARCHes that has "special requirements"
> for evicting hugetlb backing TLB entries?
Yeah, I was briefly wondering about that myself (and the inconsistency
we had in the code). I would hope that we're good, but maybe there are
some nasty corner cases we're missing. So thanks for raising that.
Given tlb_remove_huge_tlb_entry() exist (and is already getting used) I
would assume that it does the right thing.
In tlb_unshare_pmd_ptdesc(), I am now using tlb_flush_pmd_range(),
because we know that we are dealing with PMD-sized hugetlb folios.
And in fact, we were already doing that in case of
__unmap_hugepage_range(), where we did exactly what I do now:
tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
So, again, something would already be broken there unless I am missing
something important.
Looking at it, I wonder whether we must do the
tlb_remove_huge_tlb_entry() in move_hugetlb_page_tables() after the
move_huge_pte(). Looks like tlb_remove_huge_tlb_entry() might do some
flushing on ppc (and not just updating the mmu_gather) through
__tlb_remove_tlb_entry(). But it's a bit confusing.
--
Cheers
David
next prev parent reply other threads:[~2025-12-19 13:52 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-12 7:10 [PATCH v2 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather) David Hildenbrand (Red Hat)
2025-12-12 7:10 ` [PATCH v2 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
2025-12-12 7:10 ` [PATCH v2 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare() David Hildenbrand (Red Hat)
2025-12-19 4:44 ` Harry Yoo
2025-12-19 6:11 ` David Hildenbrand (Red Hat)
2025-12-19 11:20 ` Harry Yoo
2025-12-19 14:13 ` David Hildenbrand (Red Hat)
2025-12-19 21:37 ` Nadav Amit
2025-12-21 9:26 ` David Hildenbrand (Red Hat)
2025-12-12 7:10 ` [PATCH v2 3/4] mm/rmap: " David Hildenbrand (Red Hat)
2025-12-12 7:10 ` [PATCH v2 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather David Hildenbrand (Red Hat)
2025-12-16 10:47 ` Lorenzo Stoakes
2025-12-19 12:37 ` Harry Yoo
2025-12-19 13:52 ` David Hildenbrand (Red Hat) [this message]
2025-12-19 13:59 ` David Hildenbrand (Red Hat)
2025-12-21 12:24 ` David Hildenbrand (Red Hat)
2025-12-22 2:09 ` Harry Yoo
2025-12-22 10:10 ` David Hildenbrand (Red Hat)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d5bf88d9-aedf-4e6d-b5a0-e860bf0ed2e4@kernel.org \
--to=david@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@kernel.org \
--cc=arnd@arndb.de \
--cc=catalin.marinas@arm.com \
--cc=christophe.leroy@csgroup.eu \
--cc=harry.yoo@oracle.com \
--cc=jannh@google.com \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=loberman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=muchun.song@linux.dev \
--cc=nadav.amit@gmail.com \
--cc=npiggin@gmail.com \
--cc=osalvador@suse.de \
--cc=peterz@infradead.org \
--cc=pfalcato@suse.de \
--cc=prakash.sangappa@oracle.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=stable@vger.kernel.org \
--cc=vbabka@suse.cz \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.