From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 37F7214A62B for ; Tue, 5 May 2026 08:27:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777969627; cv=none; b=mihNtuW55U85jF5Wsz+bioq1CVqr8ecpZwDibhul4Fg+sExLu/smmt0nam7ceusbIb9i260WSaHZZ2lFlixDsVhEVpaCgad5JH24dS3iMRVP4bg+usF5lUtU9iS3V4i3wQUkbIhMxTJLlsLQuSVHAj8OxsSpFdY5ur0l4SAk50Q= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777969627; c=relaxed/simple; bh=Nv4dY7FYM/80o49yK3D9N+BFq/VpYm/PdQxUD0B7p3U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=CEfyAN2K4ml6NtDI0zFXdaJekdwiQ7Jl9wog2z9kkYmOh/j6HEYnm4nvtpI2bXQkQauXLRRyd5QMo1BSHFHcQzh2e/BAEc2h+aS1ZVMgkxemHhA6c3wDt718dVMAyTrrsVybXdNJ/KDJg/OPmhf9kbFBPmy9lyqsh4lY+YljvY8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=FcFQjTCP; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="FcFQjTCP" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777969625; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0wlFQn4mCAZ4JDf9m+1oTd5OJECuerm+FwCyUKdfn60=; b=FcFQjTCPyt+qKh523wKhAHkgUX0vgsO4scBDUSpBr0CxQ+0jjxQnMoTIqv52k9LpVMYM+2 NwEyfhdYiUPg78nCjJ7/JCTFfWVt+mPqsCTaIDkzg5zFfnHJL2QG9rJ5aC0dgWn+aDTwHq jnKimtyfbgYCz3klPaC+YgSjfnVym4s= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-350-GLXbQYrIOOmglOfO4CCSaw-1; Tue, 05 May 2026 04:27:01 -0400 X-MC-Unique: GLXbQYrIOOmglOfO4CCSaw-1 X-Mimecast-MFC-AGG-ID: GLXbQYrIOOmglOfO4CCSaw_1777969618 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 045AA1956059; Tue, 5 May 2026 08:26:58 +0000 (UTC) Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.44.48.109]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id BC19C30001BE; Tue, 5 May 2026 08:26:45 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Andy Lutomirski , Peter Zijlstra , Arnaldo Carvalho de Melo , Josh Poimboeuf , Paolo Bonzini , Arnd Bergmann , Frederic Weisbecker , "Paul E. McKenney" , Jason Baron , Steven Rostedt , Ard Biesheuvel , Sami Tolvanen , "David S. Miller" , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Uladzislau Rezki , Mathieu Desnoyers , Mel Gorman , Andrew Morton , Masahiro Yamada , Han Shen , Rik van Riel , Jann Horn , Dan Carpenter , Oleg Nesterov , Juri Lelli , Clark Williams , Tomas Glozar , Yair Podemsky , Marcelo Tosatti , Daniel Wagner , Petr Tesarik , Shrikanth Hegde Subject: [PATCH v9 10/10] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs when tracking CR3 switches Date: Tue, 5 May 2026 10:23:55 +0200 Message-ID: <20260505082355.1982003-11-vschneid@redhat.com> In-Reply-To: <20260505082355.1982003-1-vschneid@redhat.com> References: <20260505082355.1982003-1-vschneid@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 Previous commits have added a software signal that tracks which CR3 (kernel or user) is in use for any given CPU. Combined with: o the CR3 switch itself being a flush for non-global mappings o global mappings under kPTI being limited to the CEA and entry text we now have a way to safely defer (kernel) TLB flush IPIs targeting NOHZ_FULL CPUs executing in userspace (i.e. with the user CR3 loaded). When sending a kernel TLB flush IPI to a NOHZ_FULL CPU, check whether it is using the user CR3, and if it is, do not interrupt it and instead rely on the CR3 write that happens when switching to the kernel CR3. Signed-off-by: Valentin Schneider --- arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/tlb.c | 49 ++++++++++++++++++++++++++++----- mm/vmalloc.c | 30 ++++++++++++++++---- 3 files changed, 68 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 0ec669eb0b4e7..824304c08cd95 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -22,6 +22,7 @@ DECLARE_PER_CPU_PAGE_ALIGNED(bool, kernel_cr3_loaded); #endif void __flush_tlb_all(void); +void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end); #define TLB_FLUSH_ALL -1UL #define TLB_GENERATION_INVALID 0 diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index af43d177087e7..68bcccace0659 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -1509,23 +1510,24 @@ static void do_kernel_range_flush(void *info) flush_tlb_one_kernel(addr); } -static void kernel_tlb_flush_all(struct flush_tlb_info *info) +static void kernel_tlb_flush_all(smp_cond_func_t cond, struct flush_tlb_info *info) { if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) invlpgb_flush_all(); else - on_each_cpu(do_flush_tlb_all, NULL, 1); + on_each_cpu_cond(cond, do_flush_tlb_all, NULL, 1); } -static void kernel_tlb_flush_range(struct flush_tlb_info *info) +static void kernel_tlb_flush_range(smp_cond_func_t cond, struct flush_tlb_info *info) { if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) invlpgb_kernel_range_flush(info); else - on_each_cpu(do_kernel_range_flush, info, 1); + on_each_cpu_cond(cond, do_kernel_range_flush, info, 1); } -void flush_tlb_kernel_range(unsigned long start, unsigned long end) +static inline void +__flush_tlb_kernel_range(smp_cond_func_t cond, unsigned long start, unsigned long end) { struct flush_tlb_info *info; @@ -1535,13 +1537,46 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) TLB_GENERATION_INVALID); if (info->end == TLB_FLUSH_ALL) - kernel_tlb_flush_all(info); + kernel_tlb_flush_all(cond, info); else - kernel_tlb_flush_range(info); + kernel_tlb_flush_range(cond, info); put_flush_tlb_info(); } +void flush_tlb_kernel_range(unsigned long start, unsigned long end) +{ + __flush_tlb_kernel_range(NULL, start, end); +} + +#ifdef CONFIG_TRACK_CR3 +static bool flush_tlb_kernel_cond(int cpu, void *info) +{ + /* + * Send the IPI if the target CPU is a housekeeping one, or if it is + * already executing in kernelspace. + */ + bool ret = housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE); + + /* + * Pairs with the LOCK in NOTE_KERNEL_CR3 + * + * Ensures any previous operations are visible on a remote CPU + * entering the kernel and setting @kernel_cr3_loaded, if this one + * decides to defer the IPI. + */ + smp_mb(); + ret |= per_cpu(kernel_cr3_loaded, cpu); + + return ret; +} + +void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end) +{ + __flush_tlb_kernel_range(flush_tlb_kernel_cond, start, end); +} +#endif + /* * This can be used from process context to figure out what the value of * CR3 is without needing to do a (slow) __read_cr3(). diff --git a/mm/vmalloc.c b/mm/vmalloc.c index aa08651ec0df6..6276c8cb2be0d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -506,6 +506,26 @@ void vunmap_range_noflush(unsigned long start, unsigned long end) __vunmap_range_noflush(start, end); } +/* + * !!! BIG FAT WARNING !!! + * + * The CPU is free to cache any part of the paging hierarchy it wants at any + * time. It's also free to set accessed and dirty bits at any time, even for + * instructions that may never execute architecturally. + * + * This means that deferring a TLB flush affecting freed page-table-pages (IOW, + * keeping them in a CPU's paging hierarchy cache) is a recipe for disaster. + * + * This isn't a problem for deferral of TLB flushes in vmalloc, because + * page-table-pages used for vmap() mappings are never freed - see how + * __vunmap_range_noflush() walks the whole mapping but only clears the leaf PTEs. + * If this ever changes, TLB flush deferral will cause misery. + */ +void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end) +{ + flush_tlb_kernel_range(start, end); +} + /** * vunmap_range - unmap kernel virtual addresses * @addr: start of the VM area to unmap @@ -519,7 +539,7 @@ void vunmap_range(unsigned long addr, unsigned long end) { flush_cache_vunmap(addr, end); vunmap_range_noflush(addr, end); - flush_tlb_kernel_range(addr, end); + flush_tlb_kernel_range_deferrable(addr, end); } static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, @@ -2373,7 +2393,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, nr_purge_nodes = cpumask_weight(&purge_nodes); if (nr_purge_nodes > 0) { - flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end); /* One extra worker is per a lazy_max_pages() full set minus one. */ nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages(); @@ -2476,7 +2496,7 @@ static void free_unmap_vmap_area(struct vmap_area *va) flush_cache_vunmap(va->va_start, va->va_end); vunmap_range_noflush(va->va_start, va->va_end); if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(va->va_start, va->va_end); + flush_tlb_kernel_range_deferrable(va->va_start, va->va_end); free_vmap_area_noflush(va); } @@ -2923,7 +2943,7 @@ static void vb_free(unsigned long addr, unsigned long size) vunmap_range_noflush(addr, addr + size); if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(addr, addr + size); + flush_tlb_kernel_range_deferrable(addr, addr + size); spin_lock(&vb->lock); @@ -2988,7 +3008,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) free_purged_blocks(&purge_list); if (!__purge_vmap_area_lazy(start, end, false) && flush) - flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end); mutex_unlock(&vmap_purge_lock); } -- 2.52.0