Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Lance Yang <lance.yang@linux.dev>
To: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: will@kernel.org, aneesh.kumar@kernel.org, npiggin@gmail.com,
	peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de, lorenzo.stoakes@oracle.com,
	ziy@nvidia.com, baolin.wang@linux.alibaba.com,
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
	dev.jain@arm.com, baohua@kernel.org, ioworker0@gmail.com,
	shy828301@gmail.com, riel@surriel.com, jannh@google.com,
	linux-arch@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, akpm@linux-foundation.org
Subject: Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
Date: Sun, 21 Dec 2025 18:43:32 +0800	[thread overview]
Message-ID: <90c1ff67-46fb-4ddd-9bdd-43633f89dda2@linux.dev> (raw)
In-Reply-To: <6dcfeb2a-dba6-4de9-ac1b-39312c6bbcb6@kernel.org>



On 2025/12/19 16:25, David Hildenbrand (Red Hat) wrote:
> On 12/18/25 15:35, Lance Yang wrote:
>>
>>
>> On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote:
>>> On 12/13/25 09:00, Lance Yang wrote:
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
>>>> in collapse_huge_page() when the TLB flush already provides necessary
>>>> synchronization.
>>>>
>>>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
>>>> unconditional"), bare metal x86 didn't enable 
>>>> MMU_GATHER_RCU_TABLE_FREE.
>>>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
>>>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
>>>>
>>>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
>>>> INVLPGB, all x86 systems started sending the second IPI. However, on
>>>> native x86 this is redundant:
>>>>
>>>>     - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to 
>>>> all
>>>>       CPUs to invalidate TLB entries
>>>>
>>>>     - GUP-fast runs with IRQs disabled, so when the flush IPI 
>>>> completes,
>>>>       any concurrent GUP-fast must have finished
>>>>
>>>>     - tlb_remove_table_sync_one() provides no additional 
>>>> synchronization
>>>>
>>>> On x86, skip the second IPI when running native (without paravirt) and
>>>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
>>>> INVLPGB, conservatively keep both IPIs.
>>>>
>>>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the 
>>>> hugetlb
>>>> optimization.
>>>>
>>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>>    mm/khugepaged.c | 7 ++++++-
>>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 97d1b2824386..06ea793a8190 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct
>>>> *mm, unsigned long address,
>>>>        _pmd = pmdp_collapse_flush(vma, address, pmd);
>>>>        spin_unlock(pmd_ptl);
>>>>        mmu_notifier_invalidate_range_end(&range);
>>>> -    tlb_remove_table_sync_one();
>>>> +    /*
>>>> +     * Skip the second IPI if the TLB flush above already synchronized
>>>> +     * with concurrent GUP-fast via broadcast IPIs.
>>>> +     */
>>>> +    if (!tlb_table_flush_implies_ipi_broadcast())
>>>> +        tlb_remove_table_sync_one();
>>>
>>> We end up calling
>>>
>>>       flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>>>
>>>       -> flush_tlb_mm_range(freed_tables = true)
>>>
>>>       -> flush_tlb_multi(mm_cpumask(mm), info);
>>>
>>> So freed_tables=true and we should be doing the right thing.
>>
>> Yep ;)
>>
>>> BTW, I was wondering whether we should embed that
>>> tlb_table_flush_implies_ipi_broadcast() check in
>>> tlb_remove_table_sync_one() instead.
>>> It then relies on the caller to do the right thing (flush with
>>> freed_tables=true or unshared_tables = true).
>>>
>>> Thoughts?
>>
>> Good point! Let me check the other callers to ensure they
>> are all preceded by a flush with freed_tables=true (or unshared_tables).
>>
>> Will get back to you with what I find :)
> 
> The use case in tlb_table_flush() is a bit confusing. But I would assume 
> that we have a TLB flush with remove_tables=true beforehand. Otherwise 
> we cannot possibly free the page table.

Right! I assume you meant freed_tables=true (not remove_tables) ;)

Verified all callers have proper TLB flushes *beforehand*:

-> 1. mm/khugepaged.c:1188 (collapse_huge_page)

	pmdp_collapse_flush(vma, address, pmd)
	-> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE)
		-> flush_tlb_mm_range(mm, ..., freed_tables = true)
			-> flush_tlb_multi(mm_cpumask(mm), info)

So freed_tables=true and we should be doing the right thing :)

-> 2. include/asm-generic/tlb.h:861 (tlb_flush_unshared_tables)

	tlb_flush_mmu_tlbonly(tlb)
	-> tlb_flush(tlb)
		-> flush_tlb_mm_range(mm, ..., unshared_tables = true)
			-> flush_tlb_multi(mm_cpumask(mm), info)

unshared_tables=true (equivalent to freed_tables for sending IPIs).

-> 3. mm/mmu_gather.c:341 (__tlb_remove_table_one)

When we can't allocate a batch page in tlb_remove_table(), we do:

	tlb_table_invalidate(tlb)
	-> tlb_flush_mmu_tlbonly(tlb)
		-> flush_tlb_mm_range(mm, ..., freed_tables = true)
			-> flush_tlb_multi(mm_cpumask(mm), info)

	Then:
	tlb_remove_table_one(table)
	-> __tlb_remove_table_one(table) // if !CONFIG_PT_RECLAIM
		-> tlb_remove_table_sync_one()

freed_tables=true, and this should work too.

Why is tlb->freed_tables guaranteed? Because callers like pte_free_tlb()
(via free_pte_range) set freed_tables=true before calling __pte_free_tlb(),
which then calls tlb_remove_table(). As you mentioned, we cannot free page
tables without freed_tables=true.

Note that tlb_remove_table_sync_one() was a NOP on bare metal x86
(CONFIG_MMU_GATHER_RCU_TABLE_FREE=n) before commit a37259732a7d.

-> 4-5. mm/khugepaged.c:1683,1819 (pmdp_get_lockless_sync macro)

Same as #1.

So all callers satisfy the requirement! Will embed the check in v2.

Hopefully I didn't miss any callers ;)

Cheers,
Lance

     prev parent reply	other threads:[~2025-12-21 10:44 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-13  8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
2025-12-13  8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
2025-12-15  5:48   ` Lance Yang
2025-12-18 13:08     ` David Hildenbrand (Red Hat)
2025-12-13  8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
2025-12-18 13:08   ` David Hildenbrand (Red Hat)
2025-12-22  3:19     ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang
2025-12-23  9:44       ` David Hildenbrand (Red Hat)
2025-12-23 11:13         ` Lance Yang
2025-12-13  8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
2025-12-14 13:24   ` kernel test robot
2025-12-14 13:56   ` kernel test robot
2025-12-18 13:13   ` David Hildenbrand (Red Hat)
2025-12-18 14:35     ` Lance Yang
2025-12-19  8:25       ` David Hildenbrand (Red Hat)
2025-12-21 10:43         ` Lance Yang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=90c1ff67-46fb-4ddd-9bdd-43633f89dda2@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=arnd@arndb.de \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hpa@zytor.com \
    --cc=ioworker0@gmail.com \
    --cc=jannh@google.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mingo@redhat.com \
    --cc=npache@redhat.com \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.