Re: [PATCH -v4 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: "David Hildenbrand (Red Hat)" <davidhildenbrandkernel@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	 Will Deacon <will@kernel.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 Ryan Roberts <ryan.roberts@arm.com>,
	 Barry Song <baohua@kernel.org>,
	 Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	 Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>,
	 Baolin Wang <baolin.wang@linux.alibaba.com>,
	Yang Shi <yang@os.amperecomputing.com>,
	 "Christoph Lameter (Ampere)" <cl@gentwo.org>,
	 Dev Jain <dev.jain@arm.com>,
	 Anshuman Khandual <anshuman.khandual@arm.com>,
	 Kefeng Wang <wangkefeng.wang@huawei.com>,
	Kevin Brodsky <kevin.brodsky@arm.com>,
	 Yin Fengwei <fengwei_yin@linux.alibaba.com>,
	 linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,  linux-mm@kvack.org
Subject: Re: [PATCH -v4 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault
Date: Sat, 08 Nov 2025 15:20:21 +0800	[thread overview]
Message-ID: <87qzu97zyi.fsf@DESKTOP-5N7EMDA> (raw)
In-Reply-To: <2b9fa85b-54ff-415c-9163-461e28b6d660@gmail.com> (David Hildenbrand's message of "Thu, 6 Nov 2025 10:47:10 +0100")

Hi, David,

"David Hildenbrand (Red Hat)" <davidhildenbrandkernel@gmail.com> writes:

> On 04.11.25 10:55, Huang Ying wrote:
>> A multi-thread customer workload with large memory footprint uses
>> fork()/exec() to run some external programs every tens seconds.  When
>> running the workload on an arm64 server machine, it's observed that
>> quite some CPU cycles are spent in the TLB flushing functions.  While
>> running the workload on the x86_64 server machine, it's not.  This
>> causes the performance on arm64 to be much worse than that on x86_64.
>> During the workload running, after fork()/exec() write-protects all
>> pages in the parent process, memory writing in the parent process
>> will cause a write protection fault.  Then the page fault handler
>> will make the PTE/PDE writable if the page can be reused, which is
>> almost always true in the workload.  On arm64, to avoid the write
>> protection fault on other CPUs, the page fault handler flushes the TLB
>> globally with TLBI broadcast after changing the PTE/PDE.  However, this
>> isn't always necessary.  Firstly, it's safe to leave some stale
>> read-only TLB entries as long as they will be flushed finally.
>> Secondly, it's quite possible that the original read-only PTE/PDEs
>> aren't cached in remote TLB at all if the memory footprint is large.
>> In fact, on x86_64, the page fault handler doesn't flush the remote
>> TLB in this situation, which benefits the performance a lot.
>> To improve the performance on arm64, make the write protection fault
>> handler flush the TLB locally instead of globally via TLBI broadcast
>> after making the PTE/PDE writable.  If there are stale read-only TLB
>> entries in the remote CPUs, the page fault handler on these CPUs will
>> regard the page fault as spurious and flush the stale TLB entries.
>> To test the patchset, make the usemem.c from
>> vm-scalability (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
>> support calling fork()/exec() periodically.  To mimic the behavior of
>> the customer workload, run usemem with 4 threads, access 100GB memory,
>> and call fork()/exec() every 40 seconds.  Test results show that with
>> the patchset the score of usemem improves ~40.6%.  The cycles% of TLB
>> flush functions reduces from ~50.5% to ~0.3% in perf profile.
>> 
>
> All makes sense to me.
>
> Some smaller comments below.

Thanks!

> [...]
>
>> +
>> +static inline void local_flush_tlb_page_nonotify(
>> +	struct vm_area_struct *vma, unsigned long uaddr)
>
> NIT: "struct vm_area_struct *vma" fits onto the previous line.

Sure.

>> +{
>> +	__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> +	dsb(nsh);
>> +}
>> +
>> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
>> +					unsigned long uaddr)
>> +{
>> +	__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> +	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
>> +						(uaddr & PAGE_MASK) + PAGE_SIZE);
>> +	dsb(nsh);
>> +}
>> +
>>   static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>>   					   unsigned long uaddr)
>>   {
>> @@ -472,6 +512,22 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>>   	dsb(ish);
>>   }
>>   +static inline void local_flush_tlb_contpte(struct vm_area_struct
>> *vma,
>> +					   unsigned long addr)
>> +{
>> +	unsigned long asid;
>> +
>> +	addr = round_down(addr, CONT_PTE_SIZE);
>> +
>> +	dsb(nshst);
>> +	asid = ASID(vma->vm_mm);
>> +	__flush_tlb_range_op(vale1, addr, CONT_PTES, PAGE_SIZE, asid,
>> +			     3, true, lpa2_is_enabled());
>> +	mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, addr,
>> +						    addr + CONT_PTE_SIZE);
>> +	dsb(nsh);
>> +}
>> +
>>   static inline void flush_tlb_range(struct vm_area_struct *vma,
>>   				   unsigned long start, unsigned long end)
>>   {
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index c0557945939c..589bcf878938 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -622,8 +622,7 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>   			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>>     		if (dirty)
>> -			__flush_tlb_range(vma, start_addr, addr,
>> -							PAGE_SIZE, true, 3);
>> +			local_flush_tlb_contpte(vma, start_addr);
>
> In this case, we now flush a bigger range than we used to, no?
>
> Probably I am missing something (should this change be explained in
> more detail in the cover letter), but I'm wondering why this contpte
> handling wasn't required before on this level.

As Ryan explained in his replay email.  The flush range doesn't change
here.  We just replace global TLB flush with local TLB flush.

>>   	} else {
>>   		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>   		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index d816ff44faff..22f54f5afe3f 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -235,7 +235,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
>>     	/* Invalidate a stale read-only entry */
>
> I would expand this comment to also explain how remote TLBs are
> handled very briefly -> flush_tlb_fix_spurious_fault().

Sure.

>>   	if (dirty)
>> -		flush_tlb_page(vma, address);
>> +		local_flush_tlb_page(vma, address);
>>   	return 1;
>>   }
>>   

---
Best Regards,
Huang, Ying

next prev parent reply	other threads:[~2025-11-08  7:20 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-04  9:55 [PATCH -v4 0/2] arm, tlbflush: avoid TLBI broadcast if page reused in write fault Huang Ying
2025-11-04  9:55 ` [PATCH -v4 1/2] mm: add spurious fault fixing support for huge pmd Huang Ying
2025-11-04  9:55 ` [PATCH -v4 2/2] arm64, tlbflush: don't TLBI broadcast if page reused in write fault Huang Ying
2025-11-06  9:47   ` David Hildenbrand (Red Hat)
2025-11-06 16:54     ` Ryan Roberts
2025-11-08  7:20     ` Huang, Ying [this message]
2025-11-06  1:01 ` [PATCH -v4 0/2] arm, tlbflush: avoid " Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87qzu97zyi.fsf@DESKTOP-5N7EMDA \
    --to=ying.huang@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=davidhildenbrandkernel@gmail.com \
    --cc=dev.jain@arm.com \
    --cc=fengwei_yin@linux.alibaba.com \
    --cc=kevin.brodsky@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=ryan.roberts@arm.com \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=yang@os.amperecomputing.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.