From: "David Hildenbrand (Red Hat)" <david@kernel.org>
To: Harry Yoo <harry.yoo@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-mm@kvack.org, Will Deacon <will@kernel.org>,
"Aneesh Kumar K.V" <aneesh.kumar@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Nick Piggin <npiggin@gmail.com>,
Peter Zijlstra <peterz@infradead.org>,
Arnd Bergmann <arnd@arndb.de>,
Muchun Song <muchun.song@linux.dev>,
Oscar Salvador <osalvador@suse.de>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
Pedro Falcato <pfalcato@suse.de>, Rik van Riel <riel@surriel.com>,
Laurence Oberman <loberman@redhat.com>,
Prakash Sangappa <prakash.sangappa@oracle.com>,
Nadav Amit <nadav.amit@gmail.com>,
stable@vger.kernel.org, Ryan Roberts <ryan.roberts@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: Re: [PATCH v2 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
Date: Mon, 22 Dec 2025 11:10:32 +0100 [thread overview]
Message-ID: <c66cb4e4-820e-419a-ae9f-efd2c15aa570@kernel.org> (raw)
In-Reply-To: <aUioS4dkTrKgsHGP@hyeyoo>
>> Okay, the existing hugetlb mmu_gather integration is hell on earth.
>>
>> I *think* to get everything right (work around all the hacks we have) we might have to do a
>>
>> tlb_change_page_size(tlb, sz);
>> tlb_start_vma(tlb, vma);
>>
>> before adding something to the tlb and a tlb_end_vma(tlb, vma) if we
>> don't immediately call tlb_finish_mmu() already.
>
> Good point, indeed!
>
>> tlb_change_page_size() will set page_size accordingly (as required for
>> ppc IIUC).
>
> Right. PPC wants to flush TLB when the page size changes.
>
>> tlb_start_vma()->tlb_update_vma_flags() will set tlb->vma_huge for ...
>> some very good reason I am sure.
>
> :)
>
>> So something like the following might do the trick:
>>
>> From b0b854c2f91ce0931e1462774c92015183fb5b52 Mon Sep 17 00:00:00 2001
>> From: "David Hildenbrand (Red Hat)" <david@kernel.org>
>> Date: Sun, 21 Dec 2025 12:57:43 +0100
>> Subject: [PATCH] tmp
>>
>> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> ---
>> mm/hugetlb.c | 12 +++++++++++-
>> mm/rmap.c | 4 ++++
>> 2 files changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 7fef0b94b5d1e..14521210181c9 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -5113,6 +5113,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>> /* Prevent race with file truncation */
>> hugetlb_vma_lock_write(vma);
>> i_mmap_lock_write(mapping);
>> +
>> + tlb_change_page_size(&tlb, sz);
>> + tlb_start_vma(&tlb, vma);
>> for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
>> src_pte = hugetlb_walk(vma, old_addr, sz);
>> if (!src_pte) {
>> @@ -5128,13 +5131,13 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>> new_addr |= last_addr_mask;
>> continue;
>> }
>> - tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);
>> dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz);
>> if (!dst_pte)
>> break;
>> move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz);
>> + tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);
>> }
>> tlb_flush_mmu_tlbonly(&tlb);
>> @@ -6416,6 +6419,8 @@ long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vm
>> BUG_ON(address >= end);
>> flush_cache_range(vma, range.start, range.end);
>> + tlb_change_page_size(tlb, psize);
>> + tlb_start_vma(tlb, vma);
>> mmu_notifier_invalidate_range_start(&range);
>> hugetlb_vma_lock_write(vma);
>> @@ -6532,6 +6537,8 @@ long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vm
>> hugetlb_vma_unlock_write(vma);
>> mmu_notifier_invalidate_range_end(&range);
>> + tlb_end_vma(tlb, vma);
>> +
>> return pages > 0 ? (pages << h->order) : pages;
>> }
>> @@ -7259,6 +7266,9 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>> } else {
>> i_mmap_assert_write_locked(vma->vm_file->f_mapping);
>> }
>> +
>> + tlb_change_page_size(&tlb, sz);
>> + tlb_start_vma(&tlb, vma);
>> for (address = start; address < end; address += PUD_SIZE) {
>> ptep = hugetlb_walk(vma, address, sz);
>> if (!ptep)
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index d6799afe11147..27210bc6fb489 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2015,6 +2015,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> goto walk_abort;
>> tlb_gather_mmu(&tlb, mm);
>> + tlb_change_page_size(&tlb, huge_page_size(hstate_vma(vma)));
>> + tlb_start_vma(&tlb, vma);
>> if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
>> hugetlb_vma_unlock_write(vma);
>> huge_pmd_unshare_flush(&tlb, vma);
>> @@ -2413,6 +2415,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>> }
>> tlb_gather_mmu(&tlb, mm);
>> + tlb_change_page_size(&tlb, huge_page_size(hstate_vma(vma)));
>> + tlb_start_vma(&tlb, vma);
>> if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
>> hugetlb_vma_unlock_write(vma);
>> huge_pmd_unshare_flush(&tlb, vma);
>> --
>> 2.52.0
>>
>>
>>
>> But now I'm staring at it and wonder whether we should just defer the TLB flushing changes
>> to a later point and only focus on the IPI flushes.
>
> You mean defer TLB flushing to which point? For unmapping or
> changing permission of VMAs, flushing at VMA boundary already makes sense?
Defer converting to mmu_gather to a later patch set :)
I gave it a try yesterday, but it's also a bit ugly.
In the code above, primarily the rmap change is nasty.
>
> Or if you meant batching TLB flushes in try_to_{migrate,unmap}_one()...
>
> /me starts wondering...
>
> "Hmm... for RMAP, we already have TLB flush batching
> via struct tlbflush_unmap_batch. Why not use this framework
> when unmapping shared hugetlb pages as well?"
Hm, also not what we really want in most cases. I don't think we should
be using that outside of rmap.c (and I have the gut feeling that we
should maybe make use of mmu_gather in there instead at some point).
Let me try a bit to see if I can clean the code here up, or if I just
add a temporary custom batching data structure.
Thanks for bringing this up!
--
Cheers
David
prev parent reply other threads:[~2025-12-22 10:10 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-12 7:10 [PATCH v2 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather) David Hildenbrand (Red Hat)
2025-12-12 7:10 ` [PATCH v2 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
2025-12-12 7:10 ` [PATCH v2 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare() David Hildenbrand (Red Hat)
2025-12-19 4:44 ` Harry Yoo
2025-12-19 6:11 ` David Hildenbrand (Red Hat)
2025-12-19 11:20 ` Harry Yoo
2025-12-19 14:13 ` David Hildenbrand (Red Hat)
2025-12-19 21:37 ` Nadav Amit
2025-12-21 9:26 ` David Hildenbrand (Red Hat)
2025-12-12 7:10 ` [PATCH v2 3/4] mm/rmap: " David Hildenbrand (Red Hat)
2025-12-12 7:10 ` [PATCH v2 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather David Hildenbrand (Red Hat)
2025-12-16 10:47 ` Lorenzo Stoakes
2025-12-19 12:37 ` Harry Yoo
2025-12-19 13:52 ` David Hildenbrand (Red Hat)
2025-12-19 13:59 ` David Hildenbrand (Red Hat)
2025-12-21 12:24 ` David Hildenbrand (Red Hat)
2025-12-22 2:09 ` Harry Yoo
2025-12-22 10:10 ` David Hildenbrand (Red Hat) [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c66cb4e4-820e-419a-ae9f-efd2c15aa570@kernel.org \
--to=david@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@kernel.org \
--cc=arnd@arndb.de \
--cc=catalin.marinas@arm.com \
--cc=christophe.leroy@csgroup.eu \
--cc=harry.yoo@oracle.com \
--cc=jannh@google.com \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=loberman@redhat.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=muchun.song@linux.dev \
--cc=nadav.amit@gmail.com \
--cc=npiggin@gmail.com \
--cc=osalvador@suse.de \
--cc=peterz@infradead.org \
--cc=pfalcato@suse.de \
--cc=prakash.sangappa@oracle.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=stable@vger.kernel.org \
--cc=vbabka@suse.cz \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.