[PATCH v3 0/3] Optimize code generation during context switching

llvm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] Optimize code generation during context switching
@ 2025-11-13 10:52 Xie Yuanbin
  2025-11-13 10:52 ` [PATCH v3 1/3] Make enter_lazy_tlb inline on x86 Xie Yuanbin
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Xie Yuanbin @ 2025-11-13 10:52 UTC (permalink / raw)
  To: tglx, riel, segher, david, peterz, hpa, osalvador, linux,
	mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca, gor,
	agordeev, borntraeger, svens, davem, andreas, luto, mingo, bp,
	dave.hansen, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, james.clark, anna-maria, frederic,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, nathan, nick.desaulniers+lkml, morbo,
	justinstitt, qq570070308, thuth, brauner, arnd, sforshee,
	mhiramat, andrii, oleg, jlayton, aalbersh, akpm, david,
	lorenzo.stoakes, baolin.wang, max.kellermann, ryan.roberts, nysal,
	urezki
  Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
	sparclinux, linux-perf-users, llvm, will

The purpose of this series of patches is to optimize the performance of
context switching. It does not change the code logic, but only modifies
the inline attributes of some functions.

It is found that `finish_task_switch` is not inlined even in the O2 level
optimization. This may affect performance for the following reasons:
1. It is in the context switching code, which is a hot code path.
2. Because of the modern CPU mitigations for vulnerabilities, inside
switch_mm, the instruction pipeline and cache may be cleared.
finish_task_switch is right after that, so the performance is greatly
affected by function calls and branch jumps.
3. The __schedule function has __sched attribute, which makes it be
placed in the ".sched.text" section, while finish_task_switch does not,
which causes their distance to be very far in binary, aggravating the
above performance degradation.

This series of patches mainly does the following things:
1. Make enter_lazy_tlb inline on x86.
2. Make raw_spin_rq_unlock inline.
3. Let the finish_task_switch function be called inline during context
switching.
4. Set the subfunctions called by finish_task_switch to be inline:
After finish_task_switch is changed to an inline function, the number of
calls to the subfunctions (called by finish_task_switch) increases in
this translation unit due to the inline expansion of finish_task_switch.
For example, the finish_lock_switch function originally had only one
calling point in core.o (in finish_task_switch func), but because the
finish_task_switch was inlined, the calling points become two.
Due to compiler optimization strategies, these functions may transition
from inline functions to non inline functions, which can actually lead to
performance degradation.
Make the subfunctions of finish_task_stwitch inline to prevent
degradation.

Performance test data for these patches:
Time spent on calling finish_task_switch (unit: rdtsc):
 | compiler && appended cmdline | without patches | with patches  |
 | gcc + NA                     | 13.93 - 13.94   | 12.39 - 12.44 |
 | gcc + "spectre_v2_user=on"   | 24.69 - 24.85   | 13.68 - 13.73 |
 | clang + NA                   | 13.89 - 13.90   | 12.70 - 12.73 |
 | clang + "spectre_v2_user=on" | 29.00 - 29.02   | 18.88 - 18.97 |

Note: I use x86 for testing here. Different architectures have different
cmdlines for configuring mitigations. For example, on arm64, spectre v2
mitigation is enabled by default, and it should be disabled by adding
"nospectre_v2" to the cmdline.

Test info:
1. kernel source:
linux-next
commit 9c0826a5d9aa4d52206d ("Add linux-next specific files for 20251107")
2. test machine:
cpu: intel i5-8300h@4Ghz
mem: DDR4 2666MHz
Bare-metal boot, non-virtualized environment
3. compiler:
gcc: gcc version 15.2.0 (Debian 15.2.0-7) with
GNU ld (GNU Binutils for Debian) 2.45
clang: Debian clang version 21.1.4 (8) with
Debian LLD 21.1.4 (compatible with GNU linkers)
4. config:
base on default x86_64_defconfig, and setting:
CONFIG_HZ=100
CONFIG_DEBUG_ENTRY=n
CONFIG_X86_DEBUG_FPU=n
CONFIG_EXPERT=y
CONFIG_MODIFY_LDT_SYSCALL=n
CONFIG_CGROUPS=n
CONFIG_BLK_DEV_NVME=y
5. test method:
Use rdtsc (cntvct_el0 can be use on arm64/arm) to obtain timestamps
before and after finish_task_switch calling point, and created multiple
processes to trigger context switches, then calculated the average
duration of the finish_task_switch call.
Note that using multiple processes rather than threads is recommended for
testing, because this will trigger switch_mm (where spectre v2 mitigations
may be performed) during context switching.
The test code is attached at the end of the mail.

I also tested the impact on bzImage size, which may affect
embedded devices:
1. kernel source && compiler: same as above
2. config:
base on default x86_64_defconfig, and setting:
CONFIG_SCHED_CORE=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_NO_HZ_FULL=y

bzImage size:
 | compiler | without patches | with patches  |
 | clang    | 13722624        | 13722624      |
 | gcc      | 12596224        | 12596224      |

No size changes were found on bzImage.

testing code:
kernel(just for testing, not a commit):
```c
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 8a4ac4841be6..5a42ec008620 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -395,6 +395,7 @@
 468	common	file_getattr		sys_file_getattr
 469	common	file_setattr		sys_file_setattr
 470	common	listns			sys_listns
+471	common	sched_test		sys_sched_test

 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81cf8452449a..d7e2095aeb7d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5195,6 +5195,36 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 	calculate_sigpending();
 }

+static DEFINE_PER_CPU(uint64_t, start_time);
+static DEFINE_PER_CPU(uint64_t, total_time);
+static DEFINE_PER_CPU(uint64_t, total_count);
+
+static __always_inline uint64_t test_rdtsc(void)
+{
+    register uint64_t rax __asm__("rax");
+    register uint64_t rdx __asm__("rdx");
+
+    __asm__ __volatile__ ("rdtsc" : "=a"(rax), "=d"(rdx));
+    return rax | (rdx << 32);
+}
+
+static __always_inline void test_start(void)
+{
+	raw_cpu_write(start_time, test_rdtsc());
+}
+
+static __always_inline void test_end(void)
+{
+	const uint64_t end_time = test_rdtsc();
+	const uint64_t cost_time = end_time - raw_cpu_read(start_time);
+
+	raw_cpu_add(total_time, cost_time);
+	if (unlikely(raw_cpu_inc_return(total_count) % (1 << 20) == 0)) {
+		pr_emerg("cpu %d total_time %llu\n", raw_smp_processor_id(), raw_cpu_read(total_time));
+		raw_cpu_write(total_time, 0);
+	}
+}
+
 /*
  * context_switch - switch to the new MM and the new thread's register state.
  */
@@ -5264,7 +5294,10 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	switch_to(prev, next, prev);
 	barrier();

-	return finish_task_switch(prev);
+	test_start();
+	rq = finish_task_switch(prev);
+	test_end();
+	return rq;
 }

 /*
@@ -10861,3 +10894,18 @@ void sched_change_end(struct sched_change_ctx *ctx)
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
 }
+
+static struct task_struct *wait_task;
+
+SYSCALL_DEFINE0(sched_test)
+{
+	preempt_disable();
+	while (1) {
+		if (likely(wait_task))
+			wake_up_process(wait_task);
+		wait_task = current;
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		__schedule(SM_NONE);
+	}
+	return 0;
+}
```

User program:
```c
int main()
{
	cpu_set_t mask;
	if (fork())
		sleep(1);

	CPU_ZERO(&mask);
	CPU_SET(5, &mask); // Assume that cpu5 exists
	assert(sched_setaffinity(0, sizeof(mask), &mask) == 0);
	syscall(471);
	// unreachable
	return 0;
}
```

Usage:
1. set core5 as isolated cpu: add "isolcpus=5" to cmdline
2. run user programe
3. wait for kernel print

v2->v3: https://lore.kernel.org/20251108172346.263590-1-qq570070308@gmail.com
  - Fix building error in patch 1
  - Simply add the __always_inline attribute to the existing function,
    Instead of adding the always inline version functions

v1->v2: https://lore.kernel.org/20251024182628.68921-1-qq570070308@gmail.com
  - Make raw_spin_rq_unlock inline
  - Make __balance_callbacks inline
  - Add comments for always inline functions
  - Add Performance Test Data

Xie Yuanbin (3):
  Make enter_lazy_tlb inline on x86
  Make raw_spin_rq_unlock inline
  Make finish_task_switch and its subfuncs inline in context switching

 arch/arm/include/asm/mmu_context.h      |  2 +-
 arch/riscv/include/asm/sync_core.h      |  2 +-
 arch/s390/include/asm/mmu_context.h     |  2 +-
 arch/sparc/include/asm/mmu_context_64.h |  2 +-
 arch/x86/include/asm/mmu_context.h      | 23 ++++++++++++++++++++++-
 arch/x86/include/asm/sync_core.h        |  2 +-
 arch/x86/mm/tlb.c                       | 21 ---------------------
 include/linux/perf_event.h              |  2 +-
 include/linux/sched/mm.h                | 10 +++++-----
 include/linux/tick.h                    |  4 ++--
 include/linux/vtime.h                   |  8 ++++----
 kernel/sched/core.c                     | 19 +++++++------------
 kernel/sched/sched.h                    | 24 ++++++++++++++----------
 13 files changed, 60 insertions(+), 61 deletions(-)

-- 
2.51.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 1/3] Make enter_lazy_tlb inline on x86
  2025-11-13 10:52 [PATCH v3 0/3] Optimize code generation during context switching Xie Yuanbin
@ 2025-11-13 10:52 ` Xie Yuanbin
  2025-11-13 11:02   ` Xie Yuanbin
                     ` (2 more replies)
  2025-11-13 10:52 ` [PATCH v3 2/3] Make raw_spin_rq_unlock inline Xie Yuanbin
  2025-11-13 10:52 ` [PATCH v3 3/3] Make finish_task_switch and its subfuncs inline in context switching Xie Yuanbin
  2 siblings, 3 replies; 12+ messages in thread
From: Xie Yuanbin @ 2025-11-13 10:52 UTC (permalink / raw)
  To: tglx, riel, segher, david, peterz, hpa, osalvador, linux,
	mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca, gor,
	agordeev, borntraeger, svens, davem, andreas, luto, mingo, bp,
	dave.hansen, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, james.clark, anna-maria, frederic,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, nathan, nick.desaulniers+lkml, morbo,
	justinstitt, qq570070308, thuth, brauner, arnd, sforshee,
	mhiramat, andrii, oleg, jlayton, aalbersh, akpm, david,
	lorenzo.stoakes, baolin.wang, max.kellermann, ryan.roberts, nysal,
	urezki
  Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
	sparclinux, linux-perf-users, llvm, will, kernel test robot

This function is very short, and is called in the context switching,
which is the hot code path.

Change it to inline function on x86 to optimize performance, just like
its code on other architectures.

Signed-off-by: Xie Yuanbin <qq570070308@gmail.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511091959.kfmo9kPB-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202511092219.73aMMES4-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202511100042.ZklpqjOY-lkp@intel.com/
---
 arch/x86/include/asm/mmu_context.h | 23 ++++++++++++++++++++++-
 arch/x86/mm/tlb.c                  | 21 ---------------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 73bf3b1b44e8..ecd134dcfb34 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -136,8 +136,29 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
 }
 #endif
 
+/*
+ * Please ignore the name of this function.  It should be called
+ * switch_to_kernel_thread().
+ *
+ * enter_lazy_tlb() is a hint from the scheduler that we are entering a
+ * kernel thread or other context without an mm.  Acceptable implementations
+ * include doing nothing whatsoever, switching to init_mm, or various clever
+ * lazy tricks to try to minimize TLB flushes.
+ *
+ * The scheduler reserves the right to call enter_lazy_tlb() several times
+ * in a row.  It will notify us that we're going back to a real mm by
+ * calling switch_mm_irqs_off().
+ */
 #define enter_lazy_tlb enter_lazy_tlb
-extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
+#ifndef MODULE
+static __always_inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
+{
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
+		return;
+
+	this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
+}
+#endif
 
 #define mm_init_global_asid mm_init_global_asid
 extern void mm_init_global_asid(struct mm_struct *mm);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5d221709353e..cb715e8e75e4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -970,27 +970,6 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
 	}
 }
 
-/*
- * Please ignore the name of this function.  It should be called
- * switch_to_kernel_thread().
- *
- * enter_lazy_tlb() is a hint from the scheduler that we are entering a
- * kernel thread or other context without an mm.  Acceptable implementations
- * include doing nothing whatsoever, switching to init_mm, or various clever
- * lazy tricks to try to minimize TLB flushes.
- *
- * The scheduler reserves the right to call enter_lazy_tlb() several times
- * in a row.  It will notify us that we're going back to a real mm by
- * calling switch_mm_irqs_off().
- */
-void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
-{
-	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
-		return;
-
-	this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
-}
-
 /*
  * Using a temporary mm allows to set temporary mappings that are not accessible
  * by other CPUs. Such mappings are needed to perform sensitive memory writes
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] Make enter_lazy_tlb inline on x86
  2025-11-13 10:52 ` [PATCH v3 1/3] Make enter_lazy_tlb inline on x86 Xie Yuanbin
@ 2025-11-13 11:02   ` Xie Yuanbin
  2025-11-13 12:11   ` David Hildenbrand (Red Hat)
  2025-11-14 19:42   ` Thomas Gleixner
  2 siblings, 0 replies; 12+ messages in thread
From: Xie Yuanbin @ 2025-11-13 11:02 UTC (permalink / raw)
  To: qq570070308, riel
  Cc: aalbersh, acme, adrian.hunter, agordeev, akpm, alex,
	alexander.shishkin, andreas, andrii, anna-maria, aou, arnd,
	baolin.wang, borntraeger, bp, brauner, bsegall, dave.hansen,
	davem, david, david, dietmar.eggemann, frederic, gor, hca, hpa,
	irogers, james.clark, jlayton, jolsa, juri.lelli, justinstitt,
	linux-arm-kernel, linux-kernel, linux-perf-users, linux-riscv,
	linux-s390, linux, lkp, llvm, lorenzo.stoakes, luto, mark.rutland,
	mathieu.desnoyers, max.kellermann, mgorman, mhiramat, mingo,
	morbo, namhyung, nathan, nick.desaulniers+lkml, nysal, oleg,
	osalvador, palmer, paulmck, peterz, pjw, rostedt, ryan.roberts,
	segher, sforshee, sparclinux, svens, tglx, thuth, urezki,
	vincent.guittot, vschneid, will, x86

Hi, Rik van Riel!

I fixed a build error in this patch, could you please review it again?
Link: https://lore.kernel.org/20251113105227.57650-2-qq570070308@gmail.com
Thanks!

Xie Yuanbin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] Make enter_lazy_tlb inline on x86
  2025-11-13 10:52 ` [PATCH v3 1/3] Make enter_lazy_tlb inline on x86 Xie Yuanbin
  2025-11-13 11:02   ` Xie Yuanbin
@ 2025-11-13 12:11   ` David Hildenbrand (Red Hat)
  2025-11-14 19:42   ` Thomas Gleixner
  2 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-13 12:11 UTC (permalink / raw)
  To: Xie Yuanbin, tglx, riel, segher, peterz, hpa, osalvador, linux,
	mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca, gor,
	agordeev, borntraeger, svens, davem, andreas, luto, mingo, bp,
	dave.hansen, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, james.clark, anna-maria, frederic,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, nathan, nick.desaulniers+lkml, morbo,
	justinstitt, thuth, brauner, arnd, sforshee, mhiramat, andrii,
	oleg, jlayton, aalbersh, akpm, lorenzo.stoakes, baolin.wang,
	max.kellermann, ryan.roberts, nysal, urezki
  Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
	sparclinux, linux-perf-users, llvm, will, kernel test robot

On 13.11.25 11:52, Xie Yuanbin wrote:
> This function is very short, and is called in the context switching,
> which is the hot code path.
> 
> Change it to inline function

"always_inline" here and in the subject.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] Make enter_lazy_tlb inline on x86
  2025-11-13 10:52 ` [PATCH v3 1/3] Make enter_lazy_tlb inline on x86 Xie Yuanbin
  2025-11-13 11:02   ` Xie Yuanbin
  2025-11-13 12:11   ` David Hildenbrand (Red Hat)
@ 2025-11-14 19:42   ` Thomas Gleixner
  2025-11-15 13:54     ` Xie Yuanbin
  2 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2025-11-14 19:42 UTC (permalink / raw)
  To: Xie Yuanbin, riel, segher, david, peterz, hpa, osalvador, linux,
	mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca, gor,
	agordeev, borntraeger, svens, davem, andreas, luto, mingo, bp,
	dave.hansen, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, james.clark, anna-maria, frederic,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, nathan, nick.desaulniers+lkml, morbo,
	justinstitt, qq570070308, thuth, brauner, arnd, sforshee,
	mhiramat, andrii, oleg, jlayton, aalbersh, akpm, david,
	lorenzo.stoakes, baolin.wang, max.kellermann, ryan.roberts, nysal,
	urezki
  Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
	sparclinux, linux-perf-users, llvm, will, kernel test robot

On Thu, Nov 13 2025 at 18:52, Xie Yuanbin wrote:

Please use the documented way to denote functions in subject and change
log:

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#function-references-in-changelogs

Also you make this __always_inline and not inline.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/3] Make enter_lazy_tlb inline on x86
  2025-11-14 19:42   ` Thomas Gleixner
@ 2025-11-15 13:54     ` Xie Yuanbin
  0 siblings, 0 replies; 12+ messages in thread
From: Xie Yuanbin @ 2025-11-15 13:54 UTC (permalink / raw)
  To: tglx
  Cc: aalbersh, acme, adrian.hunter, agordeev, akpm, alex,
	alexander.shishkin, andreas, andrii, anna-maria, aou, arnd,
	baolin.wang, borntraeger, bp, brauner, bsegall, dave.hansen,
	davem, david, david, dietmar.eggemann, frederic, gor, hca, hpa,
	irogers, james.clark, jlayton, jolsa, juri.lelli, justinstitt,
	linux-arm-kernel, linux-kernel, linux-perf-users, linux-riscv,
	linux-s390, linux, lkp, llvm, lorenzo.stoakes, luto, mark.rutland,
	mathieu.desnoyers, max.kellermann, mgorman, mhiramat, mingo,
	morbo, namhyung, nathan, nick.desaulniers+lkml, nysal, oleg,
	osalvador, palmer, paulmck, peterz, pjw, qq570070308, riel,
	rostedt, ryan.roberts, segher, sforshee, sparclinux, svens, thuth,
	urezki, vincent.guittot, vschneid, will, x86

On Fri, 14 Nov 2025 20:42:35 +0100, Thomas Gleixner wrote:
> Please use the documented way to denote functions in subject and change
> log:
>
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#function-references-in-changelogs
>
> Also you make this __always_inline and not inline.

Thanks for pointing it out, I will improve it in v4 patch.

Xie Yuanbin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 2/3] Make raw_spin_rq_unlock inline
  2025-11-13 10:52 [PATCH v3 0/3] Optimize code generation during context switching Xie Yuanbin
  2025-11-13 10:52 ` [PATCH v3 1/3] Make enter_lazy_tlb inline on x86 Xie Yuanbin
@ 2025-11-13 10:52 ` Xie Yuanbin
  2025-11-14 19:44   ` Thomas Gleixner
  2025-11-13 10:52 ` [PATCH v3 3/3] Make finish_task_switch and its subfuncs inline in context switching Xie Yuanbin
  2 siblings, 1 reply; 12+ messages in thread
From: Xie Yuanbin @ 2025-11-13 10:52 UTC (permalink / raw)
  To: tglx, riel, segher, david, peterz, hpa, osalvador, linux,
	mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca, gor,
	agordeev, borntraeger, svens, davem, andreas, luto, mingo, bp,
	dave.hansen, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, james.clark, anna-maria, frederic,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, nathan, nick.desaulniers+lkml, morbo,
	justinstitt, qq570070308, thuth, brauner, arnd, sforshee,
	mhiramat, andrii, oleg, jlayton, aalbersh, akpm, david,
	lorenzo.stoakes, baolin.wang, max.kellermann, ryan.roberts, nysal,
	urezki
  Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
	sparclinux, linux-perf-users, llvm, will

This function is short, and is called in some critical hot code paths,
such as finish_lock_switch.

Make it inline to optimize performance.

Signed-off-by: Xie Yuanbin <qq570070308@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
---
 kernel/sched/core.c  | 5 -----
 kernel/sched/sched.h | 6 +++++-
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81cf8452449a..0e50ef3d819a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -677,11 +677,6 @@ bool raw_spin_rq_trylock(struct rq *rq)
 	}
 }
 
-void raw_spin_rq_unlock(struct rq *rq)
-{
-	raw_spin_unlock(rq_lockp(rq));
-}
-
 /*
  * double_rq_lock - safely lock two runqueues
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f702fb452eb6..7d305ec10374 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1541,13 +1541,17 @@ static inline void lockdep_assert_rq_held(struct rq *rq)
 
 extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);
 extern bool raw_spin_rq_trylock(struct rq *rq);
-extern void raw_spin_rq_unlock(struct rq *rq);
 
 static inline void raw_spin_rq_lock(struct rq *rq)
 {
 	raw_spin_rq_lock_nested(rq, 0);
 }
 
+static inline void raw_spin_rq_unlock(struct rq *rq)
+{
+	raw_spin_unlock(rq_lockp(rq));
+}
+
 static inline void raw_spin_rq_lock_irq(struct rq *rq)
 {
 	local_irq_disable();
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/3] Make raw_spin_rq_unlock inline
  2025-11-13 10:52 ` [PATCH v3 2/3] Make raw_spin_rq_unlock inline Xie Yuanbin
@ 2025-11-14 19:44   ` Thomas Gleixner
  2025-11-15 14:01     ` Xie Yuanbin
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2025-11-14 19:44 UTC (permalink / raw)
  To: Xie Yuanbin, riel, segher, david, peterz, hpa, osalvador, linux,
	mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca, gor,
	agordeev, borntraeger, svens, davem, andreas, luto, mingo, bp,
	dave.hansen, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, james.clark, anna-maria, frederic,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, nathan, nick.desaulniers+lkml, morbo,
	justinstitt, qq570070308, thuth, brauner, arnd, sforshee,
	mhiramat, andrii, oleg, jlayton, aalbersh, akpm, david,
	lorenzo.stoakes, baolin.wang, max.kellermann, ryan.roberts, nysal,
	urezki
  Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
	sparclinux, linux-perf-users, llvm, will

On Thu, Nov 13 2025 at 18:52, Xie Yuanbin wrote:

> This function is short, and is called in some critical hot code paths,
> such as finish_lock_switch.
>
> Make it inline to optimize performance.

> +static inline void raw_spin_rq_unlock(struct rq *rq)

That inline does not guarantee that the compiler actually inlines
it. clang is obnoxiously bad at that.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/3] Make raw_spin_rq_unlock inline
  2025-11-14 19:44   ` Thomas Gleixner
@ 2025-11-15 14:01     ` Xie Yuanbin
  0 siblings, 0 replies; 12+ messages in thread
From: Xie Yuanbin @ 2025-11-15 14:01 UTC (permalink / raw)
  To: tglx
  Cc: aalbersh, acme, adrian.hunter, agordeev, akpm, alex,
	alexander.shishkin, andreas, andrii, anna-maria, aou, arnd,
	baolin.wang, borntraeger, bp, brauner, bsegall, dave.hansen,
	davem, david, david, dietmar.eggemann, frederic, gor, hca, hpa,
	irogers, james.clark, jlayton, jolsa, juri.lelli, justinstitt,
	linux-arm-kernel, linux-kernel, linux-perf-users, linux-riscv,
	linux-s390, linux, llvm, lorenzo.stoakes, luto, mark.rutland,
	mathieu.desnoyers, max.kellermann, mgorman, mhiramat, mingo,
	morbo, namhyung, nathan, nick.desaulniers+lkml, nysal, oleg,
	osalvador, palmer, paulmck, peterz, pjw, qq570070308, riel,
	rostedt, ryan.roberts, segher, sforshee, sparclinux, svens, thuth,
	urezki, vincent.guittot, vschneid, will, x86

On Fri, 14 Nov 2025 20:44:13 +0100, Thomas Gleixner wrote:
> That inline does not guarantee that the compiler actually inlines
> it. clang is obnoxiously bad at that.

Yes, I know it. I modified it to __always_inline in patch 3.
This patch making it as inline to make the context look more harmonious.
Thanks!

Xie Yuanbin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 3/3] Make finish_task_switch and its subfuncs inline in context switching
  2025-11-13 10:52 [PATCH v3 0/3] Optimize code generation during context switching Xie Yuanbin
  2025-11-13 10:52 ` [PATCH v3 1/3] Make enter_lazy_tlb inline on x86 Xie Yuanbin
  2025-11-13 10:52 ` [PATCH v3 2/3] Make raw_spin_rq_unlock inline Xie Yuanbin
@ 2025-11-13 10:52 ` Xie Yuanbin
  2025-11-14 20:00   ` Thomas Gleixner
  2 siblings, 1 reply; 12+ messages in thread
From: Xie Yuanbin @ 2025-11-13 10:52 UTC (permalink / raw)
  To: tglx, riel, segher, david, peterz, hpa, osalvador, linux,
	mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca, gor,
	agordeev, borntraeger, svens, davem, andreas, luto, mingo, bp,
	dave.hansen, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, james.clark, anna-maria, frederic,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, nathan, nick.desaulniers+lkml, morbo,
	justinstitt, qq570070308, thuth, brauner, arnd, sforshee,
	mhiramat, andrii, oleg, jlayton, aalbersh, akpm, david,
	lorenzo.stoakes, baolin.wang, max.kellermann, ryan.roberts, nysal,
	urezki
  Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
	sparclinux, linux-perf-users, llvm, will

`finish_task_switch` is a hot path in context switching, and due to
possible mitigations inside switch_mm, performance here is greatly
affected by function calls and branch jumps. Make it inline to optimize
the performance.

After `finish_task_switch` is changed to an inline function, the number of
calls to the subfunctions (called by `finish_task_switch`) increases in
this translation unit due to the inline expansion of `finish_task_switch`.
Due to compiler optimization strategies, these functions may transition
from inline functions to non inline functions, which can actually lead to
performance degradation.

Make the subfunctions of finish_task_stwitch inline to prevent
degradation.

Perf test:
Time spent on calling finish_task_switch (rdtsc):
 | compiler && appended cmdline | without patch   | with patch    |
 | gcc + NA                     | 13.93 - 13.94   | 12.39 - 12.44 |
 | gcc + "spectre_v2_user=on"   | 24.69 - 24.85   | 13.68 - 13.73 |
 | clang + NA                   | 13.89 - 13.90   | 12.70 - 12.73 |
 | clang + "spectre_v2_user=on" | 29.00 - 29.02   | 18.88 - 18.97 |

Perf test info:
1. kernel source:
linux-next
commit 9c0826a5d9aa4d52206d ("Add linux-next specific files for 20251107")
2. compiler:
gcc: gcc version 15.2.0 (Debian 15.2.0-7) with
GNU ld (GNU Binutils for Debian) 2.45
clang: Debian clang version 21.1.4 (8) with
Debian LLD 21.1.4 (compatible with GNU linkers)
3. config:
base on default x86_64_defconfig, and setting:
CONFIG_HZ=100
CONFIG_DEBUG_ENTRY=n
CONFIG_X86_DEBUG_FPU=n
CONFIG_EXPERT=y
CONFIG_MODIFY_LDT_SYSCALL=n
CONFIG_CGROUPS=n
CONFIG_BLK_DEV_NVME=y

Size test:
bzImage size:
 | compiler | without patches | with patches  |
 | clang    | 13722624        | 13722624      |
 | gcc      | 12596224        | 12596224      |

Size test info:
1. kernel source && compiler: same as above
2. config:
base on default x86_64_defconfig, and setting:
CONFIG_SCHED_CORE=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_NO_HZ_FULL=y

Signed-off-by: Xie Yuanbin <qq570070308@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
---
 arch/arm/include/asm/mmu_context.h      |  2 +-
 arch/riscv/include/asm/sync_core.h      |  2 +-
 arch/s390/include/asm/mmu_context.h     |  2 +-
 arch/sparc/include/asm/mmu_context_64.h |  2 +-
 arch/x86/include/asm/sync_core.h        |  2 +-
 include/linux/perf_event.h              |  2 +-
 include/linux/sched/mm.h                | 10 +++++-----
 include/linux/tick.h                    |  4 ++--
 include/linux/vtime.h                   |  8 ++++----
 kernel/sched/core.c                     | 14 +++++++-------
 kernel/sched/sched.h                    | 20 ++++++++++----------
 11 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/arch/arm/include/asm/mmu_context.h b/arch/arm/include/asm/mmu_context.h
index db2cb06aa8cf..bebde469f81a 100644
--- a/arch/arm/include/asm/mmu_context.h
+++ b/arch/arm/include/asm/mmu_context.h
@@ -80,7 +80,7 @@ static inline void check_and_switch_context(struct mm_struct *mm,
 #ifndef MODULE
 #define finish_arch_post_lock_switch \
 	finish_arch_post_lock_switch
-static inline void finish_arch_post_lock_switch(void)
+static __always_inline void finish_arch_post_lock_switch(void)
 {
 	struct mm_struct *mm = current->mm;
 
diff --git a/arch/riscv/include/asm/sync_core.h b/arch/riscv/include/asm/sync_core.h
index 9153016da8f1..2fe6b7fe6b12 100644
--- a/arch/riscv/include/asm/sync_core.h
+++ b/arch/riscv/include/asm/sync_core.h
@@ -6,7 +6,7 @@
  * RISC-V implements return to user-space through an xRET instruction,
  * which is not core serializing.
  */
-static inline void sync_core_before_usermode(void)
+static __always_inline void sync_core_before_usermode(void)
 {
 	asm volatile ("fence.i" ::: "memory");
 }
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index d9b8501bc93d..c124ef6a01b3 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -97,7 +97,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 }
 
 #define finish_arch_post_lock_switch finish_arch_post_lock_switch
-static inline void finish_arch_post_lock_switch(void)
+static __always_inline void finish_arch_post_lock_switch(void)
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
diff --git a/arch/sparc/include/asm/mmu_context_64.h b/arch/sparc/include/asm/mmu_context_64.h
index 78bbacc14d2d..d1967214ef25 100644
--- a/arch/sparc/include/asm/mmu_context_64.h
+++ b/arch/sparc/include/asm/mmu_context_64.h
@@ -160,7 +160,7 @@ static inline void arch_start_context_switch(struct task_struct *prev)
 }
 
 #define finish_arch_post_lock_switch	finish_arch_post_lock_switch
-static inline void finish_arch_post_lock_switch(void)
+static __always_inline void finish_arch_post_lock_switch(void)
 {
 	/* Restore the state of MCDPER register for the new process
 	 * just switched to.
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index 96bda43538ee..4b55fa353bb5 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -93,7 +93,7 @@ static __always_inline void sync_core(void)
  * to user-mode. x86 implements return to user-space through sysexit,
  * sysrel, and sysretq, which are not core serializing.
  */
-static inline void sync_core_before_usermode(void)
+static __always_inline void sync_core_before_usermode(void)
 {
 	/* With PTI, we unconditionally serialize before running user code. */
 	if (static_cpu_has(X86_FEATURE_PTI))
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9870d768db4c..d9de20c20f38 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1624,7 +1624,7 @@ static inline void perf_event_task_migrate(struct task_struct *task)
 		task->sched_migrated = 1;
 }
 
-static inline void perf_event_task_sched_in(struct task_struct *prev,
+static __always_inline void perf_event_task_sched_in(struct task_struct *prev,
 					    struct task_struct *task)
 {
 	if (static_branch_unlikely(&perf_sched_events))
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 0e1d73955fa5..e7787a6e7d22 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -44,7 +44,7 @@ static inline void smp_mb__after_mmgrab(void)
 
 extern void __mmdrop(struct mm_struct *mm);
 
-static inline void mmdrop(struct mm_struct *mm)
+static __always_inline void mmdrop(struct mm_struct *mm)
 {
 	/*
 	 * The implicit full barrier implied by atomic_dec_and_test() is
@@ -71,14 +71,14 @@ static inline void __mmdrop_delayed(struct rcu_head *rhp)
  * Invoked from finish_task_switch(). Delegates the heavy lifting on RT
  * kernels via RCU.
  */
-static inline void mmdrop_sched(struct mm_struct *mm)
+static __always_inline void mmdrop_sched(struct mm_struct *mm)
 {
 	/* Provides a full memory barrier. See mmdrop() */
 	if (atomic_dec_and_test(&mm->mm_count))
 		call_rcu(&mm->delayed_drop, __mmdrop_delayed);
 }
 #else
-static inline void mmdrop_sched(struct mm_struct *mm)
+static __always_inline void mmdrop_sched(struct mm_struct *mm)
 {
 	mmdrop(mm);
 }
@@ -104,7 +104,7 @@ static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 	}
 }
 
-static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
+static __always_inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
 {
 	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
 		mmdrop_sched(mm);
@@ -531,7 +531,7 @@ enum {
 #include <asm/membarrier.h>
 #endif
 
-static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
+static __always_inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 {
 	/*
 	 * The atomic_read() below prevents CSE. The following should
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ac76ae9fa36d..fce16aa10ba2 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -175,7 +175,7 @@ extern cpumask_var_t tick_nohz_full_mask;
 #ifdef CONFIG_NO_HZ_FULL
 extern bool tick_nohz_full_running;
 
-static inline bool tick_nohz_full_enabled(void)
+static __always_inline bool tick_nohz_full_enabled(void)
 {
 	if (!context_tracking_enabled())
 		return false;
@@ -299,7 +299,7 @@ static inline void __tick_nohz_task_switch(void) { }
 static inline void tick_nohz_full_setup(cpumask_var_t cpumask) { }
 #endif
 
-static inline void tick_nohz_task_switch(void)
+static __always_inline void tick_nohz_task_switch(void)
 {
 	if (tick_nohz_full_enabled())
 		__tick_nohz_task_switch();
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 29dd5b91dd7d..428464bb81b3 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -67,24 +67,24 @@ static __always_inline void vtime_account_guest_exit(void)
  * For now vtime state is tied to context tracking. We might want to decouple
  * those later if necessary.
  */
-static inline bool vtime_accounting_enabled(void)
+static __always_inline bool vtime_accounting_enabled(void)
 {
 	return context_tracking_enabled();
 }
 
-static inline bool vtime_accounting_enabled_cpu(int cpu)
+static __always_inline bool vtime_accounting_enabled_cpu(int cpu)
 {
 	return context_tracking_enabled_cpu(cpu);
 }
 
-static inline bool vtime_accounting_enabled_this_cpu(void)
+static __always_inline bool vtime_accounting_enabled_this_cpu(void)
 {
 	return context_tracking_enabled_this_cpu();
 }
 
 extern void vtime_task_switch_generic(struct task_struct *prev);
 
-static inline void vtime_task_switch(struct task_struct *prev)
+static __always_inline void vtime_task_switch(struct task_struct *prev)
 {
 	if (vtime_accounting_enabled_this_cpu())
 		vtime_task_switch_generic(prev);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0e50ef3d819a..78d2c90bc73a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4868,7 +4868,7 @@ static inline void prepare_task(struct task_struct *next)
 	WRITE_ONCE(next->on_cpu, 1);
 }
 
-static inline void finish_task(struct task_struct *prev)
+static __always_inline void finish_task(struct task_struct *prev)
 {
 	/*
 	 * This must be the very last reference to @prev from this CPU. After
@@ -4884,7 +4884,7 @@ static inline void finish_task(struct task_struct *prev)
 	smp_store_release(&prev->on_cpu, 0);
 }
 
-static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
+static __always_inline void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
 {
 	void (*func)(struct rq *rq);
 	struct balance_callback *next;
@@ -4919,7 +4919,7 @@ struct balance_callback balance_push_callback = {
 	.func = balance_push,
 };
 
-static inline struct balance_callback *
+static __always_inline struct balance_callback *
 __splice_balance_callbacks(struct rq *rq, bool split)
 {
 	struct balance_callback *head = rq->balance_callback;
@@ -4949,7 +4949,7 @@ struct balance_callback *splice_balance_callbacks(struct rq *rq)
 	return __splice_balance_callbacks(rq, true);
 }
 
-static void __balance_callbacks(struct rq *rq)
+static __always_inline void __balance_callbacks(struct rq *rq)
 {
 	do_balance_callbacks(rq, __splice_balance_callbacks(rq, false));
 }
@@ -4982,7 +4982,7 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
 #endif
 }
 
-static inline void finish_lock_switch(struct rq *rq)
+static __always_inline void finish_lock_switch(struct rq *rq)
 {
 	/*
 	 * If we are tracking spinlock dependencies then we have to
@@ -5014,7 +5014,7 @@ static inline void kmap_local_sched_out(void)
 #endif
 }
 
-static inline void kmap_local_sched_in(void)
+static __always_inline void kmap_local_sched_in(void)
 {
 #ifdef CONFIG_KMAP_LOCAL
 	if (unlikely(current->kmap_ctrl.idx))
@@ -5067,7 +5067,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
  * past. 'prev == current' is still correct but we need to recalculate this_rq
  * because prev may have moved to another CPU.
  */
-static struct rq *finish_task_switch(struct task_struct *prev)
+static __always_inline struct rq *finish_task_switch(struct task_struct *prev)
 	__releases(rq->lock)
 {
 	struct rq *rq = this_rq();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7d305ec10374..ec301a91cb43 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1374,12 +1374,12 @@ static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
 
-static inline bool sched_core_enabled(struct rq *rq)
+static __always_inline bool sched_core_enabled(struct rq *rq)
 {
 	return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
 }
 
-static inline bool sched_core_disabled(void)
+static __always_inline bool sched_core_disabled(void)
 {
 	return !static_branch_unlikely(&__sched_core_enabled);
 }
@@ -1388,7 +1388,7 @@ static inline bool sched_core_disabled(void)
  * Be careful with this function; not for general use. The return value isn't
  * stable unless you actually hold a relevant rq->__lock.
  */
-static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+static __always_inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	if (sched_core_enabled(rq))
 		return &rq->core->__lock;
@@ -1396,7 +1396,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 	return &rq->__lock;
 }
 
-static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
+static __always_inline raw_spinlock_t *__rq_lockp(struct rq *rq)
 {
 	if (rq->core_enabled)
 		return &rq->core->__lock;
@@ -1487,12 +1487,12 @@ static inline bool sched_core_disabled(void)
 	return true;
 }
 
-static inline raw_spinlock_t *rq_lockp(struct rq *rq)
+static __always_inline raw_spinlock_t *rq_lockp(struct rq *rq)
 {
 	return &rq->__lock;
 }
 
-static inline raw_spinlock_t *__rq_lockp(struct rq *rq)
+static __always_inline raw_spinlock_t *__rq_lockp(struct rq *rq)
 {
 	return &rq->__lock;
 }
@@ -1542,23 +1542,23 @@ static inline void lockdep_assert_rq_held(struct rq *rq)
 extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);
 extern bool raw_spin_rq_trylock(struct rq *rq);
 
-static inline void raw_spin_rq_lock(struct rq *rq)
+static __always_inline void raw_spin_rq_lock(struct rq *rq)
 {
 	raw_spin_rq_lock_nested(rq, 0);
 }
 
-static inline void raw_spin_rq_unlock(struct rq *rq)
+static __always_inline void raw_spin_rq_unlock(struct rq *rq)
 {
 	raw_spin_unlock(rq_lockp(rq));
 }
 
-static inline void raw_spin_rq_lock_irq(struct rq *rq)
+static __always_inline void raw_spin_rq_lock_irq(struct rq *rq)
 {
 	local_irq_disable();
 	raw_spin_rq_lock(rq);
 }
 
-static inline void raw_spin_rq_unlock_irq(struct rq *rq)
+static __always_inline void raw_spin_rq_unlock_irq(struct rq *rq)
 {
 	raw_spin_rq_unlock(rq);
 	local_irq_enable();
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/3] Make finish_task_switch and its subfuncs inline in context switching
  2025-11-13 10:52 ` [PATCH v3 3/3] Make finish_task_switch and its subfuncs inline in context switching Xie Yuanbin
@ 2025-11-14 20:00   ` Thomas Gleixner
  2025-11-15 15:09     ` Xie Yuanbin
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2025-11-14 20:00 UTC (permalink / raw)
  To: Xie Yuanbin, riel, segher, david, peterz, hpa, osalvador, linux,
	mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca, gor,
	agordeev, borntraeger, svens, davem, andreas, luto, mingo, bp,
	dave.hansen, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, james.clark, anna-maria, frederic,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, nathan, nick.desaulniers+lkml, morbo,
	justinstitt, qq570070308, thuth, brauner, arnd, sforshee,
	mhiramat, andrii, oleg, jlayton, aalbersh, akpm, david,
	lorenzo.stoakes, baolin.wang, max.kellermann, ryan.roberts, nysal,
	urezki
  Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
	sparclinux, linux-perf-users, llvm, will

On Thu, Nov 13 2025 at 18:52, Xie Yuanbin wrote:

What are subfuncs? This is not a SMS service. Use proper words and not
made up abbreviations.

> `finish_task_switch` is a hot path in context switching, and due to

Same comment as before about functions....

> possible mitigations inside switch_mm, performance here is greatly
> affected by function calls and branch jumps. Make it inline to optimize
> the performance.

Again you mark them __always_inline and not inline. Most of them are
already 'inline'. Can you please precise in your wording?

> After `finish_task_switch` is changed to an inline function, the number of
> calls to the subfunctions (called by `finish_task_switch`) increases in
> this translation unit due to the inline expansion of `finish_task_switch`.
> Due to compiler optimization strategies, these functions may transition
> from inline functions to non inline functions, which can actually lead to
> performance degradation.

I'm having a hard time to understand this word salad.

> Make the subfunctions of finish_task_stwitch inline to prevent
> degradation.
>
> Perf test:
> Time spent on calling finish_task_switch (rdtsc):

What means (rdtsc)? 

>  | compiler && appended cmdline | without patch   | with patch    |
>  | gcc + NA                     | 13.93 - 13.94   | 12.39 - 12.44 |

What is NA and what are the time units of this?

>  | gcc + "spectre_v2_user=on"   | 24.69 - 24.85   | 13.68 - 13.73 |
>  | clang + NA                   | 13.89 - 13.90   | 12.70 - 12.73 |
>  | clang + "spectre_v2_user=on" | 29.00 - 29.02   | 18.88 - 18.97 |

So the real benefit is observable when spectre_v2_user mitigations are
enabled. You completely fail to explain that.

> Perf test info:
> 1. kernel source:
> linux-next
> commit 9c0826a5d9aa4d52206d ("Add linux-next specific files for 20251107")
> 2. compiler:
> gcc: gcc version 15.2.0 (Debian 15.2.0-7) with
> GNU ld (GNU Binutils for Debian) 2.45
> clang: Debian clang version 21.1.4 (8) with
> Debian LLD 21.1.4 (compatible with GNU linkers)
> 3. config:
> base on default x86_64_defconfig, and setting:
> CONFIG_HZ=100
> CONFIG_DEBUG_ENTRY=n
> CONFIG_X86_DEBUG_FPU=n
> CONFIG_EXPERT=y
> CONFIG_MODIFY_LDT_SYSCALL=n
> CONFIG_CGROUPS=n
> CONFIG_BLK_DEV_NVME=y

This really can go into the comment section below the first '---'
separator. No point in having this in the change log.

> Size test:
> bzImage size:
>  | compiler | without patches | with patches  |
>  | clang    | 13722624        | 13722624      |
>  | gcc      | 12596224        | 12596224      |

bzImage size is completely irrelevant. What's interesting is how the
size of the actual function changes.

> Size test info:
> 1. kernel source && compiler: same as above
> 2. config:
> base on default x86_64_defconfig, and setting:
> CONFIG_SCHED_CORE=y
> CONFIG_CC_OPTIMIZE_FOR_SIZE=y
> CONFIG_NO_HZ_FULL=y

And again, we all know how to build a kernel.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/3] Make finish_task_switch and its subfuncs inline in context switching
  2025-11-14 20:00   ` Thomas Gleixner
@ 2025-11-15 15:09     ` Xie Yuanbin
  0 siblings, 0 replies; 12+ messages in thread
From: Xie Yuanbin @ 2025-11-15 15:09 UTC (permalink / raw)
  To: tglx
  Cc: aalbersh, acme, adrian.hunter, agordeev, akpm, alex,
	alexander.shishkin, andreas, andrii, anna-maria, aou, arnd,
	baolin.wang, borntraeger, bp, brauner, bsegall, dave.hansen,
	davem, david, david, dietmar.eggemann, frederic, gor, hca, hpa,
	irogers, james.clark, jlayton, jolsa, juri.lelli, justinstitt,
	linux-arm-kernel, linux-kernel, linux-perf-users, linux-riscv,
	linux-s390, linux, llvm, lorenzo.stoakes, luto, mark.rutland,
	mathieu.desnoyers, max.kellermann, mgorman, mhiramat, mingo,
	morbo, namhyung, nathan, nick.desaulniers+lkml, nysal, oleg,
	osalvador, palmer, paulmck, peterz, pjw, qq570070308, riel,
	rostedt, ryan.roberts, segher, sforshee, sparclinux, svens, thuth,
	urezki, vincent.guittot, vschneid, will, x86

On Fri, 14 Nov 2025 21:00:43 +0100, Thomas Gleixner wrote:
> What are subfuncs? This is not a SMS service. Use proper words and not
> made up abbreviations.
>
> Again you mark them __always_inline and not inline. Most of them are
> already 'inline'. Can you please precise in your wording?
>
> This really can go into the comment section below the first '---'
> separator. No point in having this in the change log.

Thanks for pointing it out, I will improve it in v4 patch.

>> After `finish_task_switch` is changed to an inline function, the number of
>> calls to the subfunctions (called by `finish_task_switch`) increases in
>> this translation unit due to the inline expansion of `finish_task_switch`.
>> Due to compiler optimization strategies, these functions may transition
>> from inline functions to non inline functions, which can actually lead to
>> performance degradation.
>
> I'm having a hard time to understand this word salad.

I think the description is very important here, because it explains why
it needs to make the subfunctions as __always_inline.
Where is difficult to understand specifically? Please point it out,
and I will improve the description in v4 patch. Thank you very much!

> What means (rdtsc)?

This is a high-precision timestamp acquisition method in x86.
The description here is not sufficient, thanks for pointing it out, I
will improve it in v4 patch.

> So the real benefit is observable when spectre_v2_user mitigations are
> enabled. You completely fail to explain that.

What kind of explanation is needed here?
```txt
When spectre_v2_user mitigation is enabled, kernel is likely to
preform branch prediction hardening inside switch_mm_irq_off, which can
drastically increase the branch prediction misses in subsequently
executed code.

On x86, this mitigation is enabled conditionally by default, but on other
architectures, for example arm32/aarch64, the mitigation may be fully
enabled by default.

`finish_task_switch` is right after `switch_mm_irq_off`, so makeing it
inline can achieve high performance benefits.
```
Is it ok? Thanks very much!

> bzImage size is completely irrelevant. What's interesting is how the
> size of the actual function changes.

I think the bzImage size is meaningful, at least for many embedded
devices. Due to compression algorithms, code size cannot directly reflect
to the compressed size.

Anyway, I will supplement the size of the .text section in the v4 patch.

Thanks very much!

Xie Yuanbin

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-11-15 15:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-13 10:52 [PATCH v3 0/3] Optimize code generation during context switching Xie Yuanbin
2025-11-13 10:52 ` [PATCH v3 1/3] Make enter_lazy_tlb inline on x86 Xie Yuanbin
2025-11-13 11:02   ` Xie Yuanbin
2025-11-13 12:11   ` David Hildenbrand (Red Hat)
2025-11-14 19:42   ` Thomas Gleixner
2025-11-15 13:54     ` Xie Yuanbin
2025-11-13 10:52 ` [PATCH v3 2/3] Make raw_spin_rq_unlock inline Xie Yuanbin
2025-11-14 19:44   ` Thomas Gleixner
2025-11-15 14:01     ` Xie Yuanbin
2025-11-13 10:52 ` [PATCH v3 3/3] Make finish_task_switch and its subfuncs inline in context switching Xie Yuanbin
2025-11-14 20:00   ` Thomas Gleixner
2025-11-15 15:09     ` Xie Yuanbin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).