From: Dave Hansen <dave.hansen@intel.com>
To: Lance Yang <lance.yang@linux.dev>, akpm@linux-foundation.org
Cc: peterz@infradead.org, david@kernel.org,
dave.hansen@linux.intel.com, ypodemsk@redhat.com,
hughd@google.com, will@kernel.org, aneesh.kumar@kernel.org,
npiggin@gmail.com, tglx@linutronix.de, mingo@redhat.com,
bp@alien8.de, x86@kernel.org, hpa@zytor.com, arnd@arndb.de,
ljs@kernel.org, ziy@nvidia.com, baolin.wang@linux.alibaba.com,
Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
dev.jain@arm.com, baohua@kernel.org, shy828301@gmail.com,
riel@surriel.com, jannh@google.com, jgross@suse.com,
seanjc@google.com, pbonzini@redhat.com,
boris.ostrovsky@oracle.com, virtualization@lists.linux.dev,
kvm@vger.kernel.org, linux-arch@vger.kernel.org,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
ioworker0@gmail.com
Subject: Re: [PATCH 7.2 v9 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush
Date: Thu, 23 Apr 2026 10:56:03 -0700 [thread overview]
Message-ID: <f856051b-10c7-4d65-9dbe-6b1677af74bd@intel.com> (raw)
In-Reply-To: <20260420030851.6735-3-lance.yang@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 2040 bytes --]
On 4/19/26 20:08, Lance Yang wrote:
> - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
> + /*
> + * Treat unshared_tables just like freed_tables, such that lazy-TLB
> + * CPUs also receive IPIs during unsharing of page tables, allowing
> + * us to safely implement tlb_table_flush_implies_ipi_broadcast().
> + */
> + flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
> + tlb->freed_tables || tlb->unshared_tables);
> }
I've been staring at this trying to make sense of it for too long.
Right now, flush_tlb_mm_range() literally has an argument named
"freed_tables" and "tlb->freed_tables" is passed there. That seems
totally sane. It's 100% straightforward to follow.
But it makes zero logical sense to me to now mix "tlb->unshared_tables"
in there. Sure, what you _want_ is the freed_tables==1 behavior from
tlb->unshared_tables==1, and this obviously hacks that in there, but
it's not explained well enough and not maintainable like this. IOW, it's
still just hack.
I think what's happened here is that info->freed_tables is being
modified from being strictly related to page table freeing, and moved
over to a bit which tells TLB flushing implementations whether they can
respect CPUs in lazy TLB mode.
It's mentioned in the comment, but then ever reflected into the code.
Shouldn't we be doing something like the attached patch? Look at how
that maps over to the flushing side, like in the hyperv code:
> - bool do_lazy = !info->freed_tables;
> + bool do_lazy = !info->wake_lazy_cpus;
>
> trace_hyperv_mmu_flush_tlb_multi(cpus, info);
>
> @@ -198,7 +198,7 @@ static u64 hyperv_flush_tlb_others_ex(co
>
> flush->hv_vp_set.format = HV_GENERIC_SET_SPARSE_4K;
> nr_bank = cpumask_to_vpset_skip(&flush->hv_vp_set, cpus,
> - info->freed_tables ? NULL : cpu_is_lazy);
> + info->wake_lazy_cpus ? NULL : cpu_is_lazy);
That even makes the hyperv code easier to read over what was there
before, IMNHO.
Thoughts?
[-- Attachment #2: flush_tlb_mm_range-lazy.patch --]
[-- Type: text/x-patch, Size: 6441 bytes --]
---
b/arch/x86/hyperv/mmu.c | 4 ++--
b/arch/x86/include/asm/tlb.h | 11 ++++++++++-
b/arch/x86/include/asm/tlbflush.h | 4 ++--
b/arch/x86/mm/tlb.c | 29 +++++++++++++----------------
4 files changed, 27 insertions(+), 21 deletions(-)
diff -puN arch/x86/mm/tlb.c~flush_tlb_mm_range-lazy arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~flush_tlb_mm_range-lazy 2026-04-23 10:37:49.745839224 -0700
+++ b/arch/x86/mm/tlb.c 2026-04-23 10:45:25.670880226 -0700
@@ -1339,16 +1339,12 @@ STATIC_NOPV void native_flush_tlb_multi(
(info->end - info->start) >> PAGE_SHIFT);
/*
- * If no page tables were freed, we can skip sending IPIs to
- * CPUs in lazy TLB mode. They will flush the CPU themselves
- * at the next context switch.
- *
- * However, if page tables are getting freed, we need to send the
- * IPI everywhere, to prevent CPUs in lazy TLB mode from tripping
- * up on the new contents of what used to be page tables, while
- * doing a speculative memory access.
+ * Simple TLB flushes can avoid sending IPIs to CPUs in lazy
+ * TLB mode. But some operations like freeing page tables
+ * could leave dangerous state in paging structure caches.
+ * Send IPIs even to lazy CPUs when necessary.
*/
- if (info->freed_tables || mm_in_asid_transition(info->mm))
+ if (info->wake_lazy_cpus || mm_in_asid_transition(info->mm))
on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
else
on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
@@ -1381,7 +1377,7 @@ static DEFINE_PER_CPU(unsigned int, flus
static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
unsigned long start, unsigned long end,
- unsigned int stride_shift, bool freed_tables,
+ unsigned int stride_shift, bool wake_lazy_cpus,
u64 new_tlb_gen)
{
struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info);
@@ -1408,7 +1404,7 @@ static struct flush_tlb_info *get_flush_
info->end = end;
info->mm = mm;
info->stride_shift = stride_shift;
- info->freed_tables = freed_tables;
+ info->wake_lazy_cpus = wake_lazy_cpus;
info->new_tlb_gen = new_tlb_gen;
info->initiating_cpu = smp_processor_id();
info->trim_cpumask = 0;
@@ -1427,7 +1423,7 @@ static void put_flush_tlb_info(void)
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned int stride_shift,
- bool freed_tables)
+ bool wake_lazy_cpus)
{
struct flush_tlb_info *info;
int cpu = get_cpu();
@@ -1436,7 +1432,7 @@ void flush_tlb_mm_range(struct mm_struct
/* This is also a barrier that synchronizes with switch_mm(). */
new_tlb_gen = inc_mm_tlb_gen(mm);
- info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
+ info = get_flush_tlb_info(mm, start, end, stride_shift, wake_lazy_cpus,
new_tlb_gen);
/*
@@ -1528,10 +1524,11 @@ static void kernel_tlb_flush_range(struc
void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
struct flush_tlb_info *info;
+ bool wake_lazy_cpus = false;
guard(preempt)();
- info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
+ info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, wake_lazy_cpus,
TLB_GENERATION_INVALID);
if (info->end == TLB_FLUSH_ALL)
@@ -1708,10 +1705,10 @@ EXPORT_SYMBOL_FOR_KVM(__flush_tlb_all);
void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
struct flush_tlb_info *info;
-
+ bool wake_lazy_cpus = false;
int cpu = get_cpu();
- info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
+ info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, wake_lazy_cpus,
TLB_GENERATION_INVALID);
/*
* flush_tlb_multi() is not optimized for the common case in which only
diff -puN arch/x86/include/asm/tlbflush.h~flush_tlb_mm_range-lazy arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~flush_tlb_mm_range-lazy 2026-04-23 10:38:02.088295820 -0700
+++ b/arch/x86/include/asm/tlbflush.h 2026-04-23 10:39:40.979965863 -0700
@@ -247,7 +247,7 @@ struct flush_tlb_info {
u64 new_tlb_gen;
unsigned int initiating_cpu;
u8 stride_shift;
- u8 freed_tables;
+ u8 wake_lazy_cpus;
u8 trim_cpumask;
};
@@ -337,7 +337,7 @@ static inline bool mm_in_asid_transition
extern void flush_tlb_all(void);
extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned int stride_shift,
- bool freed_tables);
+ bool wake_lazy_cpus);
extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
diff -puN arch/x86/include/asm/tlb.h~flush_tlb_mm_range-lazy arch/x86/include/asm/tlb.h
--- a/arch/x86/include/asm/tlb.h~flush_tlb_mm_range-lazy 2026-04-23 10:47:01.221483878 -0700
+++ b/arch/x86/include/asm/tlb.h 2026-04-23 10:49:26.746985616 -0700
@@ -14,13 +14,22 @@ static inline void tlb_flush(struct mmu_
{
unsigned long start = 0UL, end = TLB_FLUSH_ALL;
unsigned int stride_shift = tlb_get_unmap_shift(tlb);
+ bool wake_lazy_cpus;
if (!tlb->fullmm && !tlb->need_flush_all) {
start = tlb->start;
end = tlb->end;
}
- flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+ /*
+ * Ensure all paging structure caches on all CPUs are flushed
+ * when freeing page tables. Otherwise, a lazy CPU might wake
+ * up and start walking previously-freed page tables and
+ * caching garbage.
+ */
+ wake_lazy_cpus = tlb->freed_tables;
+
+ flush_tlb_mm_range(tlb->mm, start, end, stride_shift, wake_lazy_cpus);
}
static inline void invlpg(unsigned long addr)
diff -puN arch/x86/hyperv/mmu.c~flush_tlb_mm_range-lazy arch/x86/hyperv/mmu.c
--- a/arch/x86/hyperv/mmu.c~flush_tlb_mm_range-lazy 2026-04-23 10:53:05.251268911 -0700
+++ b/arch/x86/hyperv/mmu.c 2026-04-23 10:53:28.622156121 -0700
@@ -63,7 +63,7 @@ static void hyperv_flush_tlb_multi(const
struct hv_tlb_flush *flush;
u64 status;
unsigned long flags;
- bool do_lazy = !info->freed_tables;
+ bool do_lazy = !info->wake_lazy_cpus;
trace_hyperv_mmu_flush_tlb_multi(cpus, info);
@@ -198,7 +198,7 @@ static u64 hyperv_flush_tlb_others_ex(co
flush->hv_vp_set.format = HV_GENERIC_SET_SPARSE_4K;
nr_bank = cpumask_to_vpset_skip(&flush->hv_vp_set, cpus,
- info->freed_tables ? NULL : cpu_is_lazy);
+ info->wake_lazy_cpus ? NULL : cpu_is_lazy);
if (nr_bank < 0)
return HV_STATUS_INVALID_PARAMETER;
_
next prev parent reply other threads:[~2026-04-23 17:56 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-20 3:08 [PATCH 7.2 v9 0/2] skip redundant sync IPIs when TLB flush sent them Lance Yang
2026-04-20 3:08 ` [PATCH 7.2 v9 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs Lance Yang
2026-04-20 3:08 ` [PATCH 7.2 v9 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush Lance Yang
2026-04-23 17:56 ` Dave Hansen [this message]
2026-04-23 19:44 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f856051b-10c7-4d65-9dbe-6b1677af74bd@intel.com \
--to=dave.hansen@intel.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@kernel.org \
--cc=arnd@arndb.de \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hpa@zytor.com \
--cc=hughd@google.com \
--cc=ioworker0@gmail.com \
--cc=jannh@google.com \
--cc=jgross@suse.com \
--cc=kvm@vger.kernel.org \
--cc=lance.yang@linux.dev \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mingo@redhat.com \
--cc=npache@redhat.com \
--cc=npiggin@gmail.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=seanjc@google.com \
--cc=shy828301@gmail.com \
--cc=tglx@linutronix.de \
--cc=virtualization@lists.linux.dev \
--cc=will@kernel.org \
--cc=x86@kernel.org \
--cc=ypodemsk@redhat.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox