[RFC][PATCH] arm64: tlb: call kvm_call_hyp once during kvm_tlb_flush_vmid

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH] arm64: tlb: call kvm_call_hyp once during kvm_tlb_flush_vmid_range
@ 2026-02-09 13:14 yezhenyu (A)
  2026-02-09 14:35 ` Marc Zyngier
  0 siblings, 1 reply; 5+ messages in thread
From: yezhenyu (A) @ 2026-02-09 13:14 UTC (permalink / raw)
  To: rananta@google.com, will@kernel.org, maz@kernel.org,
	oliver.upton@linux.dev, catalin.marinas@arm.com,
	dmatlack@google.com
  Cc: linux-kernel@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, zhengchuan, Xiexiangyou,
	guoqixin (A), Mawen (Wayne)

From 9982be89f55bd99b3683337223284f0011ed248e Mon Sep 17 00:00:00 2001
From: eillon <yezhenyu2@huawei.com>
Date: Mon, 9 Feb 2026 19:48:46 +0800
Subject: [RFC][PATCH v1] arm64: tlb: call kvm_call_hyp once during
 kvm_tlb_flush_vmid_range

The kvm_tlb_flush_vmid_range() function is performance-critical
during live migration, but there is a while loop when the system
support flush tlb by range when the size is larger than MAX_TLBI_RANGE_PAGES.

This results in frequent entry to kvm_call_hyp() and then a large
amount of time is spent in kvm_clear_dirty_log_protect() during
migration(more than 50%). So, when the address range is large than
MAX_TLBI_RANGE_PAGES, directly call __kvm_tlb_flush_vmid to
optimize performance.

---
 arch/arm64/kvm/hyp/pgtable.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 874244df7..9da22b882 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -675,21 +675,19 @@ static bool stage2_has_fwb(struct kvm_pgtable *pgt)
 void kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
 				phys_addr_t addr, size_t size)
 {
-	unsigned long pages, inval_pages;
+	unsigned long pages = size >> PAGE_SHIFT;

-	if (!system_supports_tlb_range()) {
+	/*
+	 * This function is performance-critical during live migration;
+	 * thus, when the address range is large than MAX_TLBI_RANGE_PAGES,
+	 * directly call __kvm_tlb_flush_vmid to optimize performance.
+	 */
+	if (!system_supports_tlb_range() || pages > MAX_TLBI_RANGE_PAGES) {
 		kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
 		return;
 	}

-	pages = size >> PAGE_SHIFT;
-	while (pages > 0) {
-		inval_pages = min(pages, MAX_TLBI_RANGE_PAGES);
-		kvm_call_hyp(__kvm_tlb_flush_vmid_range, mmu, addr, inval_pages);
-
-		addr += inval_pages << PAGE_SHIFT;
-		pages -= inval_pages;
-	}
+	kvm_call_hyp(__kvm_tlb_flush_vmid_range, mmu, addr, pages);
 }

 #define KVM_S2_MEMATTR(pgt, attr) PAGE_S2_MEMATTR(attr, stage2_has_fwb(pgt))
--
2.43.0






^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] arm64: tlb: call kvm_call_hyp once during kvm_tlb_flush_vmid_range
  2026-02-09 13:14 [RFC][PATCH] arm64: tlb: call kvm_call_hyp once during kvm_tlb_flush_vmid_range yezhenyu (A)
@ 2026-02-09 14:35 ` Marc Zyngier
  2026-02-12 12:02   ` yezhenyu (A)
  0 siblings, 1 reply; 5+ messages in thread
From: Marc Zyngier @ 2026-02-09 14:35 UTC (permalink / raw)
  To: yezhenyu (A)
  Cc: rananta@google.com, will@kernel.org, oliver.upton@linux.dev,
	catalin.marinas@arm.com, dmatlack@google.com,
	linux-kernel@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, zhengchuan, Xiexiangyou,
	guoqixin (A), Mawen (Wayne)

On Mon, 09 Feb 2026 13:14:07 +0000,
"yezhenyu (A)" <yezhenyu2@huawei.com> wrote:
> 
> From 9982be89f55bd99b3683337223284f0011ed248e Mon Sep 17 00:00:00 2001
> From: eillon <yezhenyu2@huawei.com>
> Date: Mon, 9 Feb 2026 19:48:46 +0800
> Subject: [RFC][PATCH v1] arm64: tlb: call kvm_call_hyp once during
>  kvm_tlb_flush_vmid_range
> 
> The kvm_tlb_flush_vmid_range() function is performance-critical
> during live migration, but there is a while loop when the system
> support flush tlb by range when the size is larger than MAX_TLBI_RANGE_PAGES.
>
> This results in frequent entry to kvm_call_hyp() and then a large

What is the cost of kvm_call_hyp()?

> amount of time is spent in kvm_clear_dirty_log_protect() during
> migration(more than 50%).

50% of what time? The guest's run-time? The time spent doing TLBIs
compared to the time spent in kvm_clear_dirty_log_protect()?

> So, when the address range is large than
> MAX_TLBI_RANGE_PAGES, directly call __kvm_tlb_flush_vmid to
> optimize performance.

Multiple things here:

- there is no SoB, which means that patch cannot be considered for
  merging

- there is no data showing how this change improves the situation for
  a large enough set of workloads

- there is no description of a test that could be run on multiple
  implementations to check whether this change has a positive or
  negative impact

If you want to progress this sort of things, you will need to address
these points.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] arm64: tlb: call kvm_call_hyp once during kvm_tlb_flush_vmid_range
  2026-02-09 14:35 ` Marc Zyngier
@ 2026-02-12 12:02   ` yezhenyu (A)
  2026-02-16 13:05     ` Marc Zyngier
  0 siblings, 1 reply; 5+ messages in thread
From: yezhenyu (A) @ 2026-02-12 12:02 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: rananta@google.com, will@kernel.org, oliver.upton@linux.dev,
	catalin.marinas@arm.com, dmatlack@google.com,
	linux-kernel@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, zhengchuan, Xiexiangyou,
	guoqixin (A), Mawen (Wayne)

Thanks for your review.

On 2026/2/9 22:35, Marc Zyngier wrote:
> On Mon, 09 Feb 2026 13:14:07 +0000,
> "yezhenyu (A)" <yezhenyu2@huawei.com> wrote:
>>
>>  From 9982be89f55bd99b3683337223284f0011ed248e Mon Sep 17 00:00:00 2001
>> From: eillon <yezhenyu2@huawei.com>
>> Date: Mon, 9 Feb 2026 19:48:46 +0800
>> Subject: [RFC][PATCH v1] arm64: tlb: call kvm_call_hyp once during
>>   kvm_tlb_flush_vmid_range
>>
>> The kvm_tlb_flush_vmid_range() function is performance-critical
>> during live migration, but there is a while loop when the system
>> support flush tlb by range when the size is larger than MAX_TLBI_RANGE_PAGES.
>>
>> This results in frequent entry to kvm_call_hyp() and then a large
> 
> What is the cost of kvm_call_hyp()?
> 

Most cost of kvm_tlb_flush_vmid_range() is __tlb_switch_to_host(), which
is called in every __kvm_tlb_flush_vmid/__kvm_tlb_flush_vmid_range.

>> amount of time is spent in kvm_clear_dirty_log_protect() during
>> migration(more than 50%).
> 
> 50% of what time? The guest's run-time? The time spent doing TLBIs
> compared to the time spent in kvm_clear_dirty_log_protect()?
> 

kvm_clear_dirty_log_protect() cost more than 50% time during
ram_find_and_save_block(), but not every time.
I captured the flame graph during the live migration, and the
distribution of several key functions is as follows(sorry I
cannot transfer the SVG files outside my company):

     ram_find_and_save_block(): 84.01%
         memory_region_clear_dirty_bitmap(): 33.40%
             kvm_clear_dirty_log_protect(): 26.74%
                 kvm_arch_flush_remote_tlbs_range(): 9.67%
                     __tlb_switch_to_host(): 9.51%
                 kvm_arch_mmu_enable_log_dirty_pt_masked(): 9.38%
         ram_save_target_page_legacy(): 43.41%

The memory_region_clear_dirty_bitmap() cost about 40% of
ram_find_and_save_block(), and the kvm_arch_flush_remote_tlbs_range()
cost about 29% of memory_region_clear_dirty_bitmap().

And after the patch apply, the distribution of several key functions is
as follows:

     ram_find_and_save_block(): 53.84%
         memory_region_clear_dirty_bitmap(): 2.28%
             kvm_clear_dirty_log_protect(): 1.75%
                 kvm_arch_flush_remote_tlbs_range(): 0.03%
                     __tlb_switch_to_host(): 0.03%
                 kvm_arch_mmu_enable_log_dirty_pt_masked(): 0.96%
         ram_save_target_page_legacy(): 38.97%

The memory_region_clear_dirty_bitmap() cost about 4% of
ram_find_and_save_block(), and the kvm_arch_flush_remote_tlbs_range()
cost about 1% of memory_region_clear_dirty_bitmap().

>> So, when the address range is large than
>> MAX_TLBI_RANGE_PAGES, directly call __kvm_tlb_flush_vmid to
>> optimize performance.
> 
> Multiple things here:
> 
> - there is no SoB, which means that patch cannot be considered for
>    merging
> 
If there are no other issues with this patch, I can resend it with the
SoB (Signed-off-by) tag.


> - there is no data showing how this change improves the situation for
>    a large enough set of workloads
> 
> - there is no description of a test that could be run on multiple
>    implementations to check whether this change has a positive or
>    negative impact

This patch affected the migration bandwidth during the live migration.
With the same physical bandwidth, the optimization effect of this patch
can be observed by monitoring the real live migration bandwidth.

I have test this in an RDMA-like environment, the physical bandwidth is
about 100GBps; without this patch, the migration bandwidth is below 10
GBps, and after this patch apply, the migration bandwidth can reach 50
GBps.

> 
> If you want to progress this sort of things, you will need to address
> these points.
> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] arm64: tlb: call kvm_call_hyp once during kvm_tlb_flush_vmid_range
  2026-02-12 12:02   ` yezhenyu (A)
@ 2026-02-16 13:05     ` Marc Zyngier
  0 siblings, 0 replies; 5+ messages in thread
From: Marc Zyngier @ 2026-02-16 13:05 UTC (permalink / raw)
  To: yezhenyu (A)
  Cc: rananta@google.com, will@kernel.org, oliver.upton@linux.dev,
	catalin.marinas@arm.com, dmatlack@google.com,
	linux-kernel@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, zhengchuan, Xiexiangyou,
	guoqixin (A), Mawen (Wayne)

On Thu, 12 Feb 2026 12:02:33 +0000,
"yezhenyu (A)" <yezhenyu2@huawei.com> wrote:
> 
> Thanks for your review.
> 
> On 2026/2/9 22:35, Marc Zyngier wrote:
> > On Mon, 09 Feb 2026 13:14:07 +0000,
> > "yezhenyu (A)" <yezhenyu2@huawei.com> wrote:
> >> 
> >>  From 9982be89f55bd99b3683337223284f0011ed248e Mon Sep 17 00:00:00 2001
> >> From: eillon <yezhenyu2@huawei.com>
> >> Date: Mon, 9 Feb 2026 19:48:46 +0800
> >> Subject: [RFC][PATCH v1] arm64: tlb: call kvm_call_hyp once during
> >>   kvm_tlb_flush_vmid_range
> >> 
> >> The kvm_tlb_flush_vmid_range() function is performance-critical
> >> during live migration, but there is a while loop when the system
> >> support flush tlb by range when the size is larger than MAX_TLBI_RANGE_PAGES.
> >> 
> >> This results in frequent entry to kvm_call_hyp() and then a large
> > 
> > What is the cost of kvm_call_hyp()?
> > 
> 
> Most cost of kvm_tlb_flush_vmid_range() is __tlb_switch_to_host(), which
> is called in every __kvm_tlb_flush_vmid/__kvm_tlb_flush_vmid_range.

That was not my question: you indicate that frequent calls to
kvm_call_hyp() are making things costly. I find this assertion
surprising, given that on a VHE system, this is exactly *nothing*.

> 
> >> amount of time is spent in kvm_clear_dirty_log_protect() during
> >> migration(more than 50%).
> > 
> > 50% of what time? The guest's run-time? The time spent doing TLBIs
> > compared to the time spent in kvm_clear_dirty_log_protect()?
> > 
> 
> kvm_clear_dirty_log_protect() cost more than 50% time during
> ram_find_and_save_block(), but not every time.
> I captured the flame graph during the live migration, and the
> distribution of several key functions is as follows(sorry I
> cannot transfer the SVG files outside my company):
> 
>     ram_find_and_save_block(): 84.01%
>         memory_region_clear_dirty_bitmap(): 33.40%
>             kvm_clear_dirty_log_protect(): 26.74%
>                 kvm_arch_flush_remote_tlbs_range(): 9.67%
>                     __tlb_switch_to_host(): 9.51%
>                 kvm_arch_mmu_enable_log_dirty_pt_masked(): 9.38%
>         ram_save_target_page_legacy(): 43.41%
> 
> The memory_region_clear_dirty_bitmap() cost about 40% of
> ram_find_and_save_block(), and the kvm_arch_flush_remote_tlbs_range()
> cost about 29% of memory_region_clear_dirty_bitmap().
> 
> And after the patch apply, the distribution of several key functions is
> as follows:
> 
>     ram_find_and_save_block(): 53.84%
>         memory_region_clear_dirty_bitmap(): 2.28%
>             kvm_clear_dirty_log_protect(): 1.75%
>                 kvm_arch_flush_remote_tlbs_range(): 0.03%
>                     __tlb_switch_to_host(): 0.03%
>                 kvm_arch_mmu_enable_log_dirty_pt_masked(): 0.96%
>         ram_save_target_page_legacy(): 38.97%
>
>
> The memory_region_clear_dirty_bitmap() cost about 4% of
> ram_find_and_save_block(), and the kvm_arch_flush_remote_tlbs_range()
> cost about 1% of memory_region_clear_dirty_bitmap().

What is ram_find_and_save_block()? userspace code?

> 
> >> So, when the address range is large than
> >> MAX_TLBI_RANGE_PAGES, directly call __kvm_tlb_flush_vmid to
> >> optimize performance.
> > 
> > Multiple things here:
> > 
> > - there is no SoB, which means that patch cannot be considered for
> >    merging
> > 
> If there are no other issues with this patch, I can resend it with the
> SoB (Signed-off-by) tag.
> 
> 
> > - there is no data showing how this change improves the situation for
> >    a large enough set of workloads
> > 
> > - there is no description of a test that could be run on multiple
> >    implementations to check whether this change has a positive or
> >    negative impact
> 
> This patch affected the migration bandwidth during the live migration.
> With the same physical bandwidth, the optimization effect of this patch
> can be observed by monitoring the real live migration bandwidth.
> 
> I have test this in an RDMA-like environment, the physical bandwidth is
> about 100GBps; without this patch, the migration bandwidth is below 10
> GBps, and after this patch apply, the migration bandwidth can reach 50
> GBps.

Again: how can other people reproduce your findings? Please provide a
test, and its exact configuration. If this truly results in a 5x
improvement, it shouldn't be hard to reproduce.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] arm64: tlb: call kvm_call_hyp once during kvm_tlb_flush_vmid_range
@ 2026-02-12 11:54 yezhenyu (A)
  0 siblings, 0 replies; 5+ messages in thread
From: yezhenyu (A) @ 2026-02-12 11:54 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: rananta@google.com, will@kernel.org, oliver.upton@linux.dev,
	catalin.marinas@arm.com, dmatlack@google.com,
	linux-kernel@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, zhengchuan, Xiexiangyou,
	guoqixin (A), Mawen (Wayne)

On 2026/2/9 22:35, Marc Zyngier wrote:
> On Mon, 09 Feb 2026 13:14:07 +0000,
> "yezhenyu (A)" <yezhenyu2@huawei.com> wrote:
>>
>>  From 9982be89f55bd99b3683337223284f0011ed248e Mon Sep 17 00:00:00 2001
>> From: eillon <yezhenyu2@huawei.com>
>> Date: Mon, 9 Feb 2026 19:48:46 +0800
>> Subject: [RFC][PATCH v1] arm64: tlb: call kvm_call_hyp once during
>>   kvm_tlb_flush_vmid_range
>>
>> The kvm_tlb_flush_vmid_range() function is performance-critical
>> during live migration, but there is a while loop when the system
>> support flush tlb by range when the size is larger than MAX_TLBI_RANGE_PAGES.
>>
>> This results in frequent entry to kvm_call_hyp() and then a large
> 
> What is the cost of kvm_call_hyp()?

Most cost of kvm_tlb_flush_vmid_range() is __tlb_switch_to_host(), which
is called in every __kvm_tlb_flush_vmid/__kvm_tlb_flush_vmid_range.

> 
>> amount of time is spent in kvm_clear_dirty_log_protect() during
>> migration(more than 50%).
> 
> 50% of what time? The guest's run-time? The time spent doing TLBIs
> compared to the time spent in kvm_clear_dirty_log_protect()?
> 

kvm_clear_dirty_log_protect() cost more than 50% time during 
ram_find_and_save_block(), but not every time.
I captured the flame graph during the live migration, and the distribution
of several key functions is as follows(sorry I  cannot transfer the perf data
or the SVG files outside my company):

    ram_find_and_save_block(): 84.01%
        memory_region_clear_dirty_bitmap(): 33.40%
            kvm_clear_dirty_log_protect(): 26.74%
                kvm_arch_flush_remote_tlbs_range(): 9.67%
                    __tlb_switch_to_host(): 9.51%
                kvm_arch_mmu_enable_log_dirty_pt_masked(): 9.38%
        ram_save_target_page_legacy(): 43.41%

The memory_region_clear_dirty_bitmap() cost about 40% of
ram_find_and_save_block(), and the kvm_arch_flush_remote_tlbs_range()
cost about 29% of memory_region_clear_dirty_bitmap().

And after the patch apply, the distribution of several key functions is as follows:

    ram_find_and_save_block(): 53.84%
        memory_region_clear_dirty_bitmap(): 2.28%
            kvm_clear_dirty_log_protect(): 1.75%
                kvm_arch_flush_remote_tlbs_range(): 0.03%
                    __tlb_switch_to_host(): 0.03%
                kvm_arch_mmu_enable_log_dirty_pt_masked(): 0.96%
        ram_save_target_page_legacy(): 38.97%

The memory_region_clear_dirty_bitmap() cost about 4% of 
ram_find_and_save_block(), and the kvm_arch_flush_remote_tlbs_range()
cost about 1% of memory_region_clear_dirty_bitmap().

>> So, when the address range is large than
>> MAX_TLBI_RANGE_PAGES, directly call __kvm_tlb_flush_vmid to
>> optimize performance.
> 
> Multiple things here:
> 
> - there is no SoB, which means that patch cannot be considered for
>    merging

If there are no other issues with this patch, I can resend it with the SoB (Signed-off-by) tag.

> 
> - there is no data showing how this change improves the situation for
>    a large enough set of workloads
> 
> - there is no description of a test that could be run on multiple
>    implementations to check whether this change has a positive or
>    negative impact
> 

This patch affected the migration bandwidth during the live migration.
With the same physical bandwidth, the optimization effect of this patch
can be observed by monitoring the real live migration bandwidth.

I have test this in an RDMA-like environment, the physical bandwidth is
about 100GBps; without this patch, the migration bandwidth is below 10 GBps,
and after this patch apply, the migration bandwidth can reach 50 GBps.

> If you want to progress this sort of things, you will need to address
> these points.
> 
> Thanks,
> 
> 	M.
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-16 13:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09 13:14 [RFC][PATCH] arm64: tlb: call kvm_call_hyp once during kvm_tlb_flush_vmid_range yezhenyu (A)
2026-02-09 14:35 ` Marc Zyngier
2026-02-12 12:02   ` yezhenyu (A)
2026-02-16 13:05     ` Marc Zyngier
  -- strict thread matches above, loose matches on Subject: below --
2026-02-12 11:54 yezhenyu (A)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox