From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5BE981D6193 for ; Wed, 18 Mar 2026 17:21:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773854509; cv=none; b=WT8Y5bFo7EdGRd1VSOhAuLl7mXnkwpYP/CImzHEPz70NMq3qCA33NqhMV40E8pUaD3+n+zYAM4fzNp1OvYAJds1Hy0Ky4OzjhK3n2WUKCEBCL8WAuMNU6XBfT9rej1L0a5xu4tWdBs39Km6e8iSL+k5Uj/RAWkjk7Zzx/BFDBGU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773854509; c=relaxed/simple; bh=SNc+ARoAhYJBOAnE500CbHUVknDSukE7Fr6aIrtOnNM=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=o3fgCCvFGq6IYaDtAgjbJKlDf8T89QSjs5P07TxkYk3nXgEPMx3tsqSoImbJxOu10HRcBzS4RmkDYaT/PBgDHdHEwzCAYdVZ8+fpQi9gcZZ5RE15g3VJ2p/N1ncVB3KJpij8GqyUIl5u1YpsPOSpBFj9o4OodjhrL2uYf1S72Nw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=jrDJQTQn; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=BnG/7+qF; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="jrDJQTQn"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="BnG/7+qF" Date: Wed, 18 Mar 2026 18:21:43 +0100 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1773854504; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=P3Euzrno4rvy/l8ZB1znu3FYRzYJb4WKTPOSALBTDWI=; b=jrDJQTQnfGezCIELt7wowhsp+CvqOQ/5P8n3kB2/OJI/sDn6pYsOR397GX/aMtAIFANoW1 kivwipcmPkZ8sqaXt+Qsdb75Hhx83daXy+/yrRyz5m0d25Ut/3C0H240uMpATo47LjflIw jVIhLILO36ag2cOTDn+uV/J3szj7GI7Zy2SKm5hZMhzsr9FskRc4oxTVUjVdtD78rN/QWM tJycXUVzcbywmQne/rX/URygSH99RUhp91mC/LvYQuu33XoTY7x9KLwRZFwdf9RdUuJeKQ HLvtWZs4jpm5hBcGzWfmBEX0LhqjUrAEUP3MKLsxOCZyxpqaadvEn73BLE9HhA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1773854504; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=P3Euzrno4rvy/l8ZB1znu3FYRzYJb4WKTPOSALBTDWI=; b=BnG/7+qFpPGw0FqHxDmr/hZoYB5QJZDYUEoDauq98eHxX37aURlt0JZvI0r7ECTuT6tFtT zOGEA27mV1nN0oCg== From: Sebastian Andrzej Siewior To: Chuyi Zhou , Nadav Amit Cc: tglx@linutronix.de, mingo@redhat.com, luto@kernel.org, peterz@infradead.org, paulmck@kernel.org, muchun.song@linux.dev, bp@alien8.de, dave.hansen@linux.intel.com, pbonzini@redhat.com, clrkwllms@kernel.org, rostedt@goodmis.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack Message-ID: <20260318172143.ICooJ3-U@linutronix.de> References: <20260318045638.1572777-1-zhouchuyi@bytedance.com> <20260318045638.1572777-11-zhouchuyi@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <20260318045638.1572777-11-zhouchuyi@bytedance.com> +Nadav, org post https://lore.kernel.org/all/20260318045638.1572777-11-zhou= chuyi@bytedance.com/ On 2026-03-18 12:56:36 [+0800], Chuyi Zhou wrote: > Commit 3db6d5a5ecaf ("x86/mm/tlb: Remove 'struct flush_tlb_info' from the > stack") converted flush_tlb_info from stack variable to per-CPU variable. > This brought about a performance improvement of around 3% in extreme test. > However, it also required that all flush_tlb* operations keep preemption > disabled entirely to prevent concurrent modifications of flush_tlb_info. > flush_tlb* needs to send IPIs to remote CPUs and synchronously wait for > all remote CPUs to complete their local TLB flushes. The process could > take tens of milliseconds when interrupts are disabled or with a large > number of remote CPUs. =E2=80=A6 PeterZ wasn't too happy to reverse this. The snippet below results in the following assembly: | 0000000000001ab0 : =E2=80=A6 | 1ac9: 48 89 e5 mov %rsp,%rbp | 1acc: 48 83 e4 c0 and $0xffffffffffffffc0,%rsp | 1ad0: 48 83 ec 40 sub $0x40,%rsp so it would align it properly which should result in the same cache-line movement. I'm not sure about the virtual-to-physical translation of the variables as in TLB misses since here we have a virtual mapped stack and there we have virtual mapped per-CPU memory. Here the below is my quick hack. Does this work, or still a now? I have no numbers so=E2=80=A6 diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflus= h.h index 5a3cdc439e38d..4a7f40c7f939a 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -227,7 +227,7 @@ struct flush_tlb_info { u8 stride_shift; u8 freed_tables; u8 trim_cpumask; -}; +} __aligned(SMP_CACHE_BYTES); =20 void flush_tlb_local(void); void flush_tlb_one_user(unsigned long addr); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 621e09d049cb9..99b70e94ec281 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1394,28 +1394,12 @@ void flush_tlb_multi(const struct cpumask *cpumask, */ unsigned long tlb_single_page_flush_ceiling __read_mostly =3D 33; =20 -static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, flush_tlb_info= ); - -#ifdef CONFIG_DEBUG_VM -static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx); -#endif - -static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, - unsigned long start, unsigned long end, - unsigned int stride_shift, bool freed_tables, - u64 new_tlb_gen) +static void get_flush_tlb_info(struct flush_tlb_info *info, + struct mm_struct *mm, + unsigned long start, unsigned long end, + unsigned int stride_shift, bool freed_tables, + u64 new_tlb_gen) { - struct flush_tlb_info *info =3D this_cpu_ptr(&flush_tlb_info); - -#ifdef CONFIG_DEBUG_VM - /* - * Ensure that the following code is non-reentrant and flush_tlb_info - * is not overwritten. This means no TLB flushing is initiated by - * interrupt handlers and machine-check exception handlers. - */ - BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) !=3D 1); -#endif - /* * If the number of flushes is so large that a full flush * would be faster, do a full flush. @@ -1433,8 +1417,6 @@ static struct flush_tlb_info *get_flush_tlb_info(stru= ct mm_struct *mm, info->new_tlb_gen =3D new_tlb_gen; info->initiating_cpu =3D smp_processor_id(); info->trim_cpumask =3D 0; - - return info; } =20 static void put_flush_tlb_info(void) @@ -1450,15 +1432,16 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsig= ned long start, unsigned long end, unsigned int stride_shift, bool freed_tables) { - struct flush_tlb_info *info; + struct flush_tlb_info _info; + struct flush_tlb_info *info =3D &_info; int cpu =3D get_cpu(); u64 new_tlb_gen; =20 /* This is also a barrier that synchronizes with switch_mm(). */ new_tlb_gen =3D inc_mm_tlb_gen(mm); =20 - info =3D get_flush_tlb_info(mm, start, end, stride_shift, freed_tables, - new_tlb_gen); + get_flush_tlb_info(&_info, mm, start, end, stride_shift, freed_tables, + new_tlb_gen); =20 /* * flush_tlb_multi() is not optimized for the common case in which only @@ -1548,17 +1531,15 @@ static void kernel_tlb_flush_range(struct flush_tlb= _info *info) =20 void flush_tlb_kernel_range(unsigned long start, unsigned long end) { - struct flush_tlb_info *info; + struct flush_tlb_info info; =20 - guard(preempt)(); + get_flush_tlb_info(&info, NULL, start, end, PAGE_SHIFT, false, + TLB_GENERATION_INVALID); =20 - info =3D get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false, - TLB_GENERATION_INVALID); - - if (info->end =3D=3D TLB_FLUSH_ALL) - kernel_tlb_flush_all(info); + if (info.end =3D=3D TLB_FLUSH_ALL) + kernel_tlb_flush_all(&info); else - kernel_tlb_flush_range(info); + kernel_tlb_flush_range(&info); =20 put_flush_tlb_info(); } @@ -1728,12 +1709,11 @@ EXPORT_SYMBOL_FOR_KVM(__flush_tlb_all); =20 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) { - struct flush_tlb_info *info; + struct flush_tlb_info info; =20 int cpu =3D get_cpu(); - - info =3D get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false, - TLB_GENERATION_INVALID); + get_flush_tlb_info(&info, NULL, 0, TLB_FLUSH_ALL, 0, false, + TLB_GENERATION_INVALID); /* * flush_tlb_multi() is not optimized for the common case in which only * a local TLB flush is needed. Optimize this use-case by calling @@ -1743,11 +1723,11 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap= _batch *batch) invlpgb_flush_all_nonglobals(); batch->unmapped_pages =3D false; } else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) { - flush_tlb_multi(&batch->cpumask, info); + flush_tlb_multi(&batch->cpumask, &info); } else if (cpumask_test_cpu(cpu, &batch->cpumask)) { lockdep_assert_irqs_enabled(); local_irq_disable(); - flush_tlb_func(info); + flush_tlb_func(&info); local_irq_enable(); } =20