From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5BE981D6193
	for <linux-kernel@vger.kernel.org>; Wed, 18 Mar 2026 17:21:47 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773854509; cv=none; b=WT8Y5bFo7EdGRd1VSOhAuLl7mXnkwpYP/CImzHEPz70NMq3qCA33NqhMV40E8pUaD3+n+zYAM4fzNp1OvYAJds1Hy0Ky4OzjhK3n2WUKCEBCL8WAuMNU6XBfT9rej1L0a5xu4tWdBs39Km6e8iSL+k5Uj/RAWkjk7Zzx/BFDBGU=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773854509; c=relaxed/simple;
	bh=SNc+ARoAhYJBOAnE500CbHUVknDSukE7Fr6aIrtOnNM=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=o3fgCCvFGq6IYaDtAgjbJKlDf8T89QSjs5P07TxkYk3nXgEPMx3tsqSoImbJxOu10HRcBzS4RmkDYaT/PBgDHdHEwzCAYdVZ8+fpQi9gcZZ5RE15g3VJ2p/N1ncVB3KJpij8GqyUIl5u1YpsPOSpBFj9o4OodjhrL2uYf1S72Nw=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=jrDJQTQn; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=BnG/7+qF; arc=none smtp.client-ip=193.142.43.55
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="jrDJQTQn";
	dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="BnG/7+qF"
Date: Wed, 18 Mar 2026 18:21:43 +0100
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1773854504;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=P3Euzrno4rvy/l8ZB1znu3FYRzYJb4WKTPOSALBTDWI=;
	b=jrDJQTQnfGezCIELt7wowhsp+CvqOQ/5P8n3kB2/OJI/sDn6pYsOR397GX/aMtAIFANoW1
	kivwipcmPkZ8sqaXt+Qsdb75Hhx83daXy+/yrRyz5m0d25Ut/3C0H240uMpATo47LjflIw
	jVIhLILO36ag2cOTDn+uV/J3szj7GI7Zy2SKm5hZMhzsr9FskRc4oxTVUjVdtD78rN/QWM
	tJycXUVzcbywmQne/rX/URygSH99RUhp91mC/LvYQuu33XoTY7x9KLwRZFwdf9RdUuJeKQ
	HLvtWZs4jpm5hBcGzWfmBEX0LhqjUrAEUP3MKLsxOCZyxpqaadvEn73BLE9HhA==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1773854504;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=P3Euzrno4rvy/l8ZB1znu3FYRzYJb4WKTPOSALBTDWI=;
	b=BnG/7+qFpPGw0FqHxDmr/hZoYB5QJZDYUEoDauq98eHxX37aURlt0JZvI0r7ECTuT6tFtT
	zOGEA27mV1nN0oCg==
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: Chuyi Zhou <zhouchuyi@bytedance.com>, Nadav Amit <nadav.amit@gmail.com>
Cc: tglx@linutronix.de, mingo@redhat.com, luto@kernel.org,
	peterz@infradead.org, paulmck@kernel.org, muchun.song@linux.dev,
	bp@alien8.de, dave.hansen@linux.intel.com, pbonzini@redhat.com,
	clrkwllms@kernel.org, rostedt@goodmis.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack
Message-ID: <20260318172143.ICooJ3-U@linutronix.de>
References: <20260318045638.1572777-1-zhouchuyi@bytedance.com>
 <20260318045638.1572777-11-zhouchuyi@bytedance.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <20260318045638.1572777-11-zhouchuyi@bytedance.com>

+Nadav, org post https://lore.kernel.org/all/20260318045638.1572777-11-zhou=
chuyi@bytedance.com/

On 2026-03-18 12:56:36 [+0800], Chuyi Zhou wrote:
> Commit 3db6d5a5ecaf ("x86/mm/tlb: Remove 'struct flush_tlb_info' from the
> stack") converted flush_tlb_info from stack variable to per-CPU variable.
> This brought about a performance improvement of around 3% in extreme test.
> However, it also required that all flush_tlb* operations keep preemption
> disabled entirely to prevent concurrent modifications of flush_tlb_info.
> flush_tlb* needs to send IPIs to remote CPUs and synchronously wait for
> all remote CPUs to complete their local TLB flushes. The process could
> take tens of milliseconds when interrupts are disabled or with a large
> number of remote CPUs.
=E2=80=A6
PeterZ wasn't too happy to reverse this.
The snippet below results in the following assembly:

| 0000000000001ab0 <flush_tlb_kernel_range>:
=E2=80=A6
|     1ac9:       48 89 e5                mov    %rsp,%rbp
|     1acc:       48 83 e4 c0             and    $0xffffffffffffffc0,%rsp
|     1ad0:       48 83 ec 40             sub    $0x40,%rsp

so it would align it properly which should result in the same cache-line
movement. I'm not sure about the virtual-to-physical translation of the
variables as in TLB misses since here we have a virtual mapped stack and
there we have virtual mapped per-CPU memory.

Here the below is my quick hack. Does this work, or still a now? I have
no numbers so=E2=80=A6

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflus=
h.h
index 5a3cdc439e38d..4a7f40c7f939a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -227,7 +227,7 @@ struct flush_tlb_info {
 	u8			stride_shift;
 	u8			freed_tables;
 	u8			trim_cpumask;
-};
+} __aligned(SMP_CACHE_BYTES);
=20
 void flush_tlb_local(void);
 void flush_tlb_one_user(unsigned long addr);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 621e09d049cb9..99b70e94ec281 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1394,28 +1394,12 @@ void flush_tlb_multi(const struct cpumask *cpumask,
  */
 unsigned long tlb_single_page_flush_ceiling __read_mostly =3D 33;
=20
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, flush_tlb_info=
);
-
-#ifdef CONFIG_DEBUG_VM
-static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
-#endif
-
-static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
-			unsigned long start, unsigned long end,
-			unsigned int stride_shift, bool freed_tables,
-			u64 new_tlb_gen)
+static void get_flush_tlb_info(struct flush_tlb_info *info,
+			       struct mm_struct *mm,
+			       unsigned long start, unsigned long end,
+			       unsigned int stride_shift, bool freed_tables,
+			       u64 new_tlb_gen)
 {
-	struct flush_tlb_info *info =3D this_cpu_ptr(&flush_tlb_info);
-
-#ifdef CONFIG_DEBUG_VM
-	/*
-	 * Ensure that the following code is non-reentrant and flush_tlb_info
-	 * is not overwritten. This means no TLB flushing is initiated by
-	 * interrupt handlers and machine-check exception handlers.
-	 */
-	BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) !=3D 1);
-#endif
-
 	/*
 	 * If the number of flushes is so large that a full flush
 	 * would be faster, do a full flush.
@@ -1433,8 +1417,6 @@ static struct flush_tlb_info *get_flush_tlb_info(stru=
ct mm_struct *mm,
 	info->new_tlb_gen	=3D new_tlb_gen;
 	info->initiating_cpu	=3D smp_processor_id();
 	info->trim_cpumask	=3D 0;
-
-	return info;
 }
=20
 static void put_flush_tlb_info(void)
@@ -1450,15 +1432,16 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsig=
ned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables)
 {
-	struct flush_tlb_info *info;
+	struct flush_tlb_info _info;
+	struct flush_tlb_info *info =3D &_info;
 	int cpu =3D get_cpu();
 	u64 new_tlb_gen;
=20
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	new_tlb_gen =3D inc_mm_tlb_gen(mm);
=20
-	info =3D get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
-				  new_tlb_gen);
+	get_flush_tlb_info(&_info, mm, start, end, stride_shift, freed_tables,
+			   new_tlb_gen);
=20
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
@@ -1548,17 +1531,15 @@ static void kernel_tlb_flush_range(struct flush_tlb=
_info *info)
=20
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	struct flush_tlb_info *info;
+	struct flush_tlb_info info;
=20
-	guard(preempt)();
+	get_flush_tlb_info(&info, NULL, start, end, PAGE_SHIFT, false,
+			   TLB_GENERATION_INVALID);
=20
-	info =3D get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
-				  TLB_GENERATION_INVALID);
-
-	if (info->end =3D=3D TLB_FLUSH_ALL)
-		kernel_tlb_flush_all(info);
+	if (info.end =3D=3D TLB_FLUSH_ALL)
+		kernel_tlb_flush_all(&info);
 	else
-		kernel_tlb_flush_range(info);
+		kernel_tlb_flush_range(&info);
=20
 	put_flush_tlb_info();
 }
@@ -1728,12 +1709,11 @@ EXPORT_SYMBOL_FOR_KVM(__flush_tlb_all);
=20
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
-	struct flush_tlb_info *info;
+	struct flush_tlb_info info;
=20
 	int cpu =3D get_cpu();
-
-	info =3D get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
-				  TLB_GENERATION_INVALID);
+	get_flush_tlb_info(&info, NULL, 0, TLB_FLUSH_ALL, 0, false,
+			   TLB_GENERATION_INVALID);
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
 	 * a local TLB flush is needed. Optimize this use-case by calling
@@ -1743,11 +1723,11 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap=
_batch *batch)
 		invlpgb_flush_all_nonglobals();
 		batch->unmapped_pages =3D false;
 	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
-		flush_tlb_multi(&batch->cpumask, info);
+		flush_tlb_multi(&batch->cpumask, &info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func(info);
+		flush_tlb_func(&info);
 		local_irq_enable();
 	}
=20