public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: Lance Yang <lance.yang@linux.dev>, akpm@linux-foundation.org
Cc: peterz@infradead.org, david@kernel.org,
	dave.hansen@linux.intel.com, ypodemsk@redhat.com,
	hughd@google.com, will@kernel.org, aneesh.kumar@kernel.org,
	npiggin@gmail.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, x86@kernel.org, hpa@zytor.com, arnd@arndb.de,
	ljs@kernel.org, ziy@nvidia.com, baolin.wang@linux.alibaba.com,
	Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
	dev.jain@arm.com, baohua@kernel.org, shy828301@gmail.com,
	riel@surriel.com, jannh@google.com, jgross@suse.com,
	seanjc@google.com, pbonzini@redhat.com,
	boris.ostrovsky@oracle.com, virtualization@lists.linux.dev,
	kvm@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	ioworker0@gmail.com
Subject: Re: [PATCH 7.2 v9 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush
Date: Thu, 23 Apr 2026 10:56:03 -0700	[thread overview]
Message-ID: <f856051b-10c7-4d65-9dbe-6b1677af74bd@intel.com> (raw)
In-Reply-To: <20260420030851.6735-3-lance.yang@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 2040 bytes --]

On 4/19/26 20:08, Lance Yang wrote:
> -	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
> +	/*
> +	 * Treat unshared_tables just like freed_tables, such that lazy-TLB
> +	 * CPUs also receive IPIs during unsharing of page tables, allowing
> +	 * us to safely implement tlb_table_flush_implies_ipi_broadcast().
> +	 */
> +	flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
> +			   tlb->freed_tables || tlb->unshared_tables);
>  }

I've been staring at this trying to make sense of it for too long.

Right now, flush_tlb_mm_range() literally has an argument named
"freed_tables" and "tlb->freed_tables" is passed there. That seems
totally sane. It's 100% straightforward to follow.

But it makes zero logical sense to me to now mix "tlb->unshared_tables"
in there. Sure, what you _want_ is the freed_tables==1 behavior from
tlb->unshared_tables==1, and this obviously hacks that in there, but
it's not explained well enough and not maintainable like this. IOW, it's
still just hack.

I think what's happened here is that info->freed_tables is being
modified from being strictly related to page table freeing, and moved
over to a bit which tells TLB flushing implementations whether they can
respect CPUs in lazy TLB mode.

It's mentioned in the comment, but then ever reflected into the code.

Shouldn't we be doing something like the attached patch? Look at how
that maps over to the flushing side, like in the hyperv code:

> -       bool do_lazy = !info->freed_tables;
> +       bool do_lazy = !info->wake_lazy_cpus;
>  
>         trace_hyperv_mmu_flush_tlb_multi(cpus, info);
>  
> @@ -198,7 +198,7 @@ static u64 hyperv_flush_tlb_others_ex(co
>  
>         flush->hv_vp_set.format = HV_GENERIC_SET_SPARSE_4K;
>         nr_bank = cpumask_to_vpset_skip(&flush->hv_vp_set, cpus,
> -                       info->freed_tables ? NULL : cpu_is_lazy);
> +                       info->wake_lazy_cpus ? NULL : cpu_is_lazy);

That even makes the hyperv code easier to read over what was there
before, IMNHO.

Thoughts?

[-- Attachment #2: flush_tlb_mm_range-lazy.patch --]
[-- Type: text/x-patch, Size: 6441 bytes --]



---

 b/arch/x86/hyperv/mmu.c           |    4 ++--
 b/arch/x86/include/asm/tlb.h      |   11 ++++++++++-
 b/arch/x86/include/asm/tlbflush.h |    4 ++--
 b/arch/x86/mm/tlb.c               |   29 +++++++++++++----------------
 4 files changed, 27 insertions(+), 21 deletions(-)

diff -puN arch/x86/mm/tlb.c~flush_tlb_mm_range-lazy arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~flush_tlb_mm_range-lazy	2026-04-23 10:37:49.745839224 -0700
+++ b/arch/x86/mm/tlb.c	2026-04-23 10:45:25.670880226 -0700
@@ -1339,16 +1339,12 @@ STATIC_NOPV void native_flush_tlb_multi(
 				(info->end - info->start) >> PAGE_SHIFT);
 
 	/*
-	 * If no page tables were freed, we can skip sending IPIs to
-	 * CPUs in lazy TLB mode. They will flush the CPU themselves
-	 * at the next context switch.
-	 *
-	 * However, if page tables are getting freed, we need to send the
-	 * IPI everywhere, to prevent CPUs in lazy TLB mode from tripping
-	 * up on the new contents of what used to be page tables, while
-	 * doing a speculative memory access.
+	 * Simple TLB flushes can avoid sending IPIs to CPUs in lazy
+	 * TLB mode. But some operations like freeing page tables
+	 * could leave dangerous state in paging structure caches.
+	 * Send IPIs even to lazy CPUs when necessary.
 	 */
-	if (info->freed_tables || mm_in_asid_transition(info->mm))
+	if (info->wake_lazy_cpus || mm_in_asid_transition(info->mm))
 		on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
 	else
 		on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
@@ -1381,7 +1377,7 @@ static DEFINE_PER_CPU(unsigned int, flus
 
 static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 			unsigned long start, unsigned long end,
-			unsigned int stride_shift, bool freed_tables,
+			unsigned int stride_shift, bool wake_lazy_cpus,
 			u64 new_tlb_gen)
 {
 	struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info);
@@ -1408,7 +1404,7 @@ static struct flush_tlb_info *get_flush_
 	info->end		= end;
 	info->mm		= mm;
 	info->stride_shift	= stride_shift;
-	info->freed_tables	= freed_tables;
+	info->wake_lazy_cpus	= wake_lazy_cpus;
 	info->new_tlb_gen	= new_tlb_gen;
 	info->initiating_cpu	= smp_processor_id();
 	info->trim_cpumask	= 0;
@@ -1427,7 +1423,7 @@ static void put_flush_tlb_info(void)
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
-				bool freed_tables)
+				bool wake_lazy_cpus)
 {
 	struct flush_tlb_info *info;
 	int cpu = get_cpu();
@@ -1436,7 +1432,7 @@ void flush_tlb_mm_range(struct mm_struct
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	new_tlb_gen = inc_mm_tlb_gen(mm);
 
-	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
+	info = get_flush_tlb_info(mm, start, end, stride_shift, wake_lazy_cpus,
 				  new_tlb_gen);
 
 	/*
@@ -1528,10 +1524,11 @@ static void kernel_tlb_flush_range(struc
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
 	struct flush_tlb_info *info;
+	bool wake_lazy_cpus = false;
 
 	guard(preempt)();
 
-	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
+	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, wake_lazy_cpus,
 				  TLB_GENERATION_INVALID);
 
 	if (info->end == TLB_FLUSH_ALL)
@@ -1708,10 +1705,10 @@ EXPORT_SYMBOL_FOR_KVM(__flush_tlb_all);
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info *info;
-
+	bool wake_lazy_cpus = false;
 	int cpu = get_cpu();
 
-	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
+	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, wake_lazy_cpus,
 				  TLB_GENERATION_INVALID);
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
diff -puN arch/x86/include/asm/tlbflush.h~flush_tlb_mm_range-lazy arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~flush_tlb_mm_range-lazy	2026-04-23 10:38:02.088295820 -0700
+++ b/arch/x86/include/asm/tlbflush.h	2026-04-23 10:39:40.979965863 -0700
@@ -247,7 +247,7 @@ struct flush_tlb_info {
 	u64			new_tlb_gen;
 	unsigned int		initiating_cpu;
 	u8			stride_shift;
-	u8			freed_tables;
+	u8			wake_lazy_cpus;
 	u8			trim_cpumask;
 };
 
@@ -337,7 +337,7 @@ static inline bool mm_in_asid_transition
 extern void flush_tlb_all(void);
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
-				bool freed_tables);
+				bool wake_lazy_cpus);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
diff -puN arch/x86/include/asm/tlb.h~flush_tlb_mm_range-lazy arch/x86/include/asm/tlb.h
--- a/arch/x86/include/asm/tlb.h~flush_tlb_mm_range-lazy	2026-04-23 10:47:01.221483878 -0700
+++ b/arch/x86/include/asm/tlb.h	2026-04-23 10:49:26.746985616 -0700
@@ -14,13 +14,22 @@ static inline void tlb_flush(struct mmu_
 {
 	unsigned long start = 0UL, end = TLB_FLUSH_ALL;
 	unsigned int stride_shift = tlb_get_unmap_shift(tlb);
+	bool wake_lazy_cpus;
 
 	if (!tlb->fullmm && !tlb->need_flush_all) {
 		start = tlb->start;
 		end = tlb->end;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	/*
+	 * Ensure all paging structure caches on all CPUs are flushed
+	 * when freeing page tables. Otherwise, a lazy CPU might wake
+	 * up and start walking previously-freed page tables and
+	 * caching garbage.
+	 */
+	wake_lazy_cpus = tlb->freed_tables;
+
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, wake_lazy_cpus);
 }
 
 static inline void invlpg(unsigned long addr)
diff -puN arch/x86/hyperv/mmu.c~flush_tlb_mm_range-lazy arch/x86/hyperv/mmu.c
--- a/arch/x86/hyperv/mmu.c~flush_tlb_mm_range-lazy	2026-04-23 10:53:05.251268911 -0700
+++ b/arch/x86/hyperv/mmu.c	2026-04-23 10:53:28.622156121 -0700
@@ -63,7 +63,7 @@ static void hyperv_flush_tlb_multi(const
 	struct hv_tlb_flush *flush;
 	u64 status;
 	unsigned long flags;
-	bool do_lazy = !info->freed_tables;
+	bool do_lazy = !info->wake_lazy_cpus;
 
 	trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
@@ -198,7 +198,7 @@ static u64 hyperv_flush_tlb_others_ex(co
 
 	flush->hv_vp_set.format = HV_GENERIC_SET_SPARSE_4K;
 	nr_bank = cpumask_to_vpset_skip(&flush->hv_vp_set, cpus,
-			info->freed_tables ? NULL : cpu_is_lazy);
+			info->wake_lazy_cpus ? NULL : cpu_is_lazy);
 	if (nr_bank < 0)
 		return HV_STATUS_INVALID_PARAMETER;
 
_

  reply	other threads:[~2026-04-23 17:56 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-20  3:08 [PATCH 7.2 v9 0/2] skip redundant sync IPIs when TLB flush sent them Lance Yang
2026-04-20  3:08 ` [PATCH 7.2 v9 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs Lance Yang
2026-04-20  3:08 ` [PATCH 7.2 v9 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush Lance Yang
2026-04-23 17:56   ` Dave Hansen [this message]
2026-04-23 19:44     ` David Hildenbrand (Arm)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f856051b-10c7-4d65-9dbe-6b1677af74bd@intel.com \
    --to=dave.hansen@intel.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=arnd@arndb.de \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=jannh@google.com \
    --cc=jgross@suse.com \
    --cc=kvm@vger.kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mingo@redhat.com \
    --cc=npache@redhat.com \
    --cc=npiggin@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=seanjc@google.com \
    --cc=shy828301@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=virtualization@lists.linux.dev \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=ypodemsk@redhat.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox