* [PATCH 0/3] Optimize code generation during context switching
@ 2025-10-24 18:26 Xie Yuanbin
2025-10-24 18:26 ` [PATCH 1/3] Change enter_lazy_tlb to inline on x86 Xie Yuanbin
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Xie Yuanbin @ 2025-10-24 18:26 UTC (permalink / raw)
To: linux, mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca,
gor, agordeev, borntraeger, svens, davem, andreas, tglx, mingo,
bp, dave.hansen, hpa, luto, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, anna-maria,
frederic, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, qq570070308, thuth, riel, akpm, david,
lorenzo.stoakes, segher, ryan.roberts, max.kellermann, urezki,
nysal
Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
sparclinux, linux-perf-users, will
The purpose of this series of patches is to optimize the performance of
context switching. It does not change the code logic, but only modifies
the inline attributes of some functions.
The original reason for writing this patch is that, when debugging a
schedule performance problem, I discovered that the finish_task_switch
function was not inlined, even in the O2 level optimization. This may
affect performance for the following reasons:
1. It is in the context switching code, and is called frequently.
2. Because of the modern CPU mitigations for vulnerabilities, inside
switch_mm, the instruction pipeline and cache may be cleared, and the
branch and cache miss may increase. finish_task_switch is right after
that, so this may cause greater performance degradation.
3. The __schedule function has __sched attribute, which makes it be
placed in the ".sched.text" section, while finish_task_switch does not,
which causes their distance to be very far in binary, aggravating the
above performance degradation.
I also noticed that on x86, enter_lazy_tlb func is not inlined. It's very
short, and since the cpu_tlbstate and cpu_tlbstate_shared variables are
global, it can be completely inline. In fact, the implementation of this
function on other architectures is inline.
This series of patches mainly does the following things:
1. Change enter_lazy_tlb to inline on x86.
2. Let the finish_task_switch function be called inline during context
switching.
3. Set the subfunctions called by finish_task_switch to be inline:
When finish_task_switch is changed to an inline func, the number of calls
to the subfunctions(which called by finish_task_switch) in this
translation unit increases due to the inline expansion of the
finish_task_switch function.
For example, the finish_lock_switch function originally had only one
calling point in core.o (in finish_task_switch func), but because the
finish_task_switch was inlined, the calling points become two.
Due to compiler optimization strategies,
these subfunctions may transition from inline functions to non inline
functions, which can actually lead to performance degradation.
So I modify some subfunctions of finish_task_stwitch to be always inline
to prevent degradation.
These functions are either very short or are only called once in the
entire kernel, so they do not have a big impact on the size.
This series of patches does not find any impact on the size of the
bzImage image (using Os to build).
Xie Yuanbin (3):
arch/arm/include/asm/mmu_context.h | 6 +++++-
arch/riscv/include/asm/sync_core.h | 2 +-
arch/s390/include/asm/mmu_context.h | 6 +++++-
arch/sparc/include/asm/mmu_context_64.h | 6 +++++-
arch/x86/include/asm/mmu_context.h | 22 +++++++++++++++++++++-
arch/x86/include/asm/sync_core.h | 2 +-
arch/x86/mm/tlb.c | 21 ---------------------
include/linux/perf_event.h | 2 +-
include/linux/sched/mm.h | 10 +++++-----
include/linux/tick.h | 4 ++--
include/linux/vtime.h | 8 ++++----
kernel/sched/core.c | 20 +++++++++++++-------
12 files changed, 63 insertions(+), 46 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/3] Change enter_lazy_tlb to inline on x86
2025-10-24 18:26 [PATCH 0/3] Optimize code generation during context switching Xie Yuanbin
@ 2025-10-24 18:26 ` Xie Yuanbin
2025-10-24 20:14 ` Rik van Riel
2025-10-24 18:35 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Xie Yuanbin
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: Xie Yuanbin @ 2025-10-24 18:26 UTC (permalink / raw)
To: linux, mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca,
gor, agordeev, borntraeger, svens, davem, andreas, tglx, mingo,
bp, dave.hansen, hpa, luto, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, anna-maria,
frederic, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, qq570070308, thuth, riel, akpm, david,
lorenzo.stoakes, segher, ryan.roberts, max.kellermann, urezki,
nysal
Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
sparclinux, linux-perf-users, will
This function is very short, and is called in the context switching,
so it is called very frequently.
Change it to inline function on x86 to improve performance, just like
its code in other architectures
Signed-off-by: Xie Yuanbin <qq570070308@gmail.com>
---
arch/x86/include/asm/mmu_context.h | 22 +++++++++++++++++++++-
arch/x86/mm/tlb.c | 21 ---------------------
2 files changed, 21 insertions(+), 22 deletions(-)
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 73bf3b1b44e8..30e68c5ef798 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -129,22 +129,42 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
{
}
static inline void mm_reset_untag_mask(struct mm_struct *mm)
{
}
#endif
+/*
+ * Please ignore the name of this function. It should be called
+ * switch_to_kernel_thread().
+ *
+ * enter_lazy_tlb() is a hint from the scheduler that we are entering a
+ * kernel thread or other context without an mm. Acceptable implementations
+ * include doing nothing whatsoever, switching to init_mm, or various clever
+ * lazy tricks to try to minimize TLB flushes.
+ *
+ * The scheduler reserves the right to call enter_lazy_tlb() several times
+ * in a row. It will notify us that we're going back to a real mm by
+ * calling switch_mm_irqs_off().
+ */
#define enter_lazy_tlb enter_lazy_tlb
-extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
+static __always_inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
+{
+ if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
+ return;
+
+ this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
+}
+
#define mm_init_global_asid mm_init_global_asid
extern void mm_init_global_asid(struct mm_struct *mm);
extern void mm_free_global_asid(struct mm_struct *mm);
/*
* Init a new mm. Used on mm copies, like at fork()
* and on mm's that are brand-new, like at execve().
*/
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5d221709353e..cb715e8e75e4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -963,41 +963,20 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.loaded_mm, next);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, ns.asid);
cpu_tlbstate_update_lam(new_lam, mm_untag_mask(next));
if (next != prev) {
cr4_update_pce_mm(next);
switch_ldt(prev, next);
}
}
-/*
- * Please ignore the name of this function. It should be called
- * switch_to_kernel_thread().
- *
- * enter_lazy_tlb() is a hint from the scheduler that we are entering a
- * kernel thread or other context without an mm. Acceptable implementations
- * include doing nothing whatsoever, switching to init_mm, or various clever
- * lazy tricks to try to minimize TLB flushes.
- *
- * The scheduler reserves the right to call enter_lazy_tlb() several times
- * in a row. It will notify us that we're going back to a real mm by
- * calling switch_mm_irqs_off().
- */
-void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
-{
- if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
- return;
-
- this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
-}
-
/*
* Using a temporary mm allows to set temporary mappings that are not accessible
* by other CPUs. Such mappings are needed to perform sensitive memory writes
* that override the kernel memory protections (e.g., W^X), without exposing the
* temporary page-table mappings that are required for these write operations to
* other CPUs. Using a temporary mm also allows to avoid TLB shootdowns when the
* mapping is torn down. Temporary mms can also be used for EFI runtime service
* calls or similar functionality.
*
* It is illegal to schedule while using a temporary mm -- the context switch
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/3] Provide and use an always inline version of finish_task_switch
2025-10-24 18:26 [PATCH 0/3] Optimize code generation during context switching Xie Yuanbin
2025-10-24 18:26 ` [PATCH 1/3] Change enter_lazy_tlb to inline on x86 Xie Yuanbin
@ 2025-10-24 18:35 ` Xie Yuanbin
2025-10-24 18:35 ` [PATCH 3/3] Set the subfunctions called by finish_task_switch to be inline Xie Yuanbin
2025-10-24 21:36 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Rik van Riel
2025-10-25 12:26 ` [PATCH 0/3] Optimize code generation during context switching Peter Zijlstra
2025-10-27 15:21 ` Xie Yuanbin
3 siblings, 2 replies; 16+ messages in thread
From: Xie Yuanbin @ 2025-10-24 18:35 UTC (permalink / raw)
To: linux, mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca,
gor, agordeev, borntraeger, svens, davem, andreas, tglx, mingo,
bp, dave.hansen, hpa, luto, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, anna-maria,
frederic, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, qq570070308, thuth, riel, akpm, david,
lorenzo.stoakes, segher, ryan.roberts, max.kellermann, urezki,
nysal
Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
sparclinux, linux-perf-users, will
finish_task_switch is called during context switching,
inlining it can bring some performance benefits.
Add an always inline version `finish_task_switch_ainline` to be called
during context switching, and keep the original version for being called
elsewhere, so as to take into account the size impact.
Signed-off-by: Xie Yuanbin <qq570070308@gmail.com>
---
kernel/sched/core.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1842285eac1e..6cb3f57c4d35 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5069,21 +5069,21 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
* Note that we may have delayed dropping an mm in context_switch(). If
* so, we finish that here outside of the runqueue lock. (Doing it
* with the lock held can cause deadlocks; see schedule() for
* details.)
*
* The context switch have flipped the stack from under us and restored the
* local variables which were saved when this task called schedule() in the
* past. 'prev == current' is still correct but we need to recalculate this_rq
* because prev may have moved to another CPU.
*/
-static struct rq *finish_task_switch(struct task_struct *prev)
+static __always_inline struct rq *finish_task_switch_ainline(struct task_struct *prev)
__releases(rq->lock)
{
struct rq *rq = this_rq();
struct mm_struct *mm = rq->prev_mm;
unsigned int prev_state;
/*
* The previous task will have left us with a preempt_count of 2
* because it left us after:
*
@@ -5153,20 +5153,25 @@ static struct rq *finish_task_switch(struct task_struct *prev)
/* Task is done with its stack. */
put_task_stack(prev);
put_task_struct_rcu_user(prev);
}
return rq;
}
+static struct rq *finish_task_switch(struct task_struct *prev)
+{
+ return finish_task_switch_ainline(prev);
+}
+
/**
* schedule_tail - first thing a freshly forked thread must call.
* @prev: the thread we just switched away from.
*/
asmlinkage __visible void schedule_tail(struct task_struct *prev)
__releases(rq->lock)
{
/*
* New tasks start with FORK_PREEMPT_COUNT, see there and
* finish_task_switch() for details.
@@ -5247,21 +5252,21 @@ context_switch(struct rq *rq, struct task_struct *prev,
/* switch_mm_cid() requires the memory barriers above. */
switch_mm_cid(rq, prev, next);
prepare_lock_switch(rq, next, rf);
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
barrier();
- return finish_task_switch(prev);
+ return finish_task_switch_ainline(prev);
}
/*
* nr_running and nr_context_switches:
*
* externally visible scheduler statistics: current number of runnable
* threads, total number of context switches performed since bootup.
*/
unsigned int nr_running(void)
{
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/3] Set the subfunctions called by finish_task_switch to be inline
2025-10-24 18:35 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Xie Yuanbin
@ 2025-10-24 18:35 ` Xie Yuanbin
2025-10-24 19:44 ` Thomas Gleixner
2025-10-24 21:36 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Rik van Riel
1 sibling, 1 reply; 16+ messages in thread
From: Xie Yuanbin @ 2025-10-24 18:35 UTC (permalink / raw)
To: linux, mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca,
gor, agordeev, borntraeger, svens, davem, andreas, tglx, mingo,
bp, dave.hansen, hpa, luto, peterz, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, anna-maria,
frederic, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, qq570070308, thuth, riel, akpm, david,
lorenzo.stoakes, segher, ryan.roberts, max.kellermann, urezki,
nysal
Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
sparclinux, linux-perf-users, will
The prev commit changed finish_task_switch as inline called, which
resulted in an increase in the number of calls to the subfunctions(which
called in finish_task_switch) in this translation unit due to the inline
expansion of finish_task_switch. Due to compiler optimization strategies,
these functions may transition from inline functions to non inline
functions, which can actually lead to performance degradation.
Modify some subfunctions of finish_task_stwitch to be called inline to
prevent degradation.
Signed-off-by: Xie Yuanbin <qq570070308@gmail.com>
---
arch/arm/include/asm/mmu_context.h | 6 +++++-
arch/riscv/include/asm/sync_core.h | 2 +-
arch/s390/include/asm/mmu_context.h | 6 +++++-
arch/sparc/include/asm/mmu_context_64.h | 6 +++++-
arch/x86/include/asm/sync_core.h | 2 +-
include/linux/perf_event.h | 2 +-
include/linux/sched/mm.h | 10 +++++-----
include/linux/tick.h | 4 ++--
include/linux/vtime.h | 8 ++++----
kernel/sched/core.c | 11 ++++++-----
10 files changed, 35 insertions(+), 22 deletions(-)
diff --git a/arch/arm/include/asm/mmu_context.h b/arch/arm/include/asm/mmu_context.h
index db2cb06aa8cf..d238f915f65d 100644
--- a/arch/arm/include/asm/mmu_context.h
+++ b/arch/arm/include/asm/mmu_context.h
@@ -73,39 +73,43 @@ static inline void check_and_switch_context(struct mm_struct *mm,
* finish_arch_post_lock_switch() call.
*/
mm->context.switch_pending = 1;
else
cpu_switch_mm(mm->pgd, mm);
}
#ifndef MODULE
#define finish_arch_post_lock_switch \
finish_arch_post_lock_switch
-static inline void finish_arch_post_lock_switch(void)
+static __always_inline void finish_arch_post_lock_switch_ainline(void)
{
struct mm_struct *mm = current->mm;
if (mm && mm->context.switch_pending) {
/*
* Preemption must be disabled during cpu_switch_mm() as we
* have some stateful cache flush implementations. Check
* switch_pending again in case we were preempted and the
* switch to this mm was already done.
*/
preempt_disable();
if (mm->context.switch_pending) {
mm->context.switch_pending = 0;
cpu_switch_mm(mm->pgd, mm);
}
preempt_enable_no_resched();
}
}
+static inline void finish_arch_post_lock_switch(void)
+{
+ finish_arch_post_lock_switch_ainline();
+}
#endif /* !MODULE */
#endif /* CONFIG_MMU */
#endif /* CONFIG_CPU_HAS_ASID */
#define activate_mm(prev,next) switch_mm(prev, next, NULL)
/*
* This is the actual mm switch as far as the scheduler
diff --git a/arch/riscv/include/asm/sync_core.h b/arch/riscv/include/asm/sync_core.h
index 9153016da8f1..2fe6b7fe6b12 100644
--- a/arch/riscv/include/asm/sync_core.h
+++ b/arch/riscv/include/asm/sync_core.h
@@ -1,19 +1,19 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _ASM_RISCV_SYNC_CORE_H
#define _ASM_RISCV_SYNC_CORE_H
/*
* RISC-V implements return to user-space through an xRET instruction,
* which is not core serializing.
*/
-static inline void sync_core_before_usermode(void)
+static __always_inline void sync_core_before_usermode(void)
{
asm volatile ("fence.i" ::: "memory");
}
#ifdef CONFIG_SMP
/*
* Ensure the next switch_mm() on every CPU issues a core serializing
* instruction for the given @mm.
*/
static inline void prepare_sync_core_cmd(struct mm_struct *mm)
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index d9b8501bc93d..abe734068193 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -90,21 +90,21 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
unsigned long flags;
local_irq_save(flags);
switch_mm_irqs_off(prev, next, tsk);
local_irq_restore(flags);
}
#define finish_arch_post_lock_switch finish_arch_post_lock_switch
-static inline void finish_arch_post_lock_switch(void)
+static __always_inline void finish_arch_post_lock_switch_ainline(void)
{
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
unsigned long flags;
if (mm) {
preempt_disable();
while (atomic_read(&mm->context.flush_count))
cpu_relax();
cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
@@ -112,20 +112,24 @@ static inline void finish_arch_post_lock_switch(void)
preempt_enable();
}
local_irq_save(flags);
if (test_thread_flag(TIF_ASCE_PRIMARY))
local_ctl_load(1, &get_lowcore()->kernel_asce);
else
local_ctl_load(1, &get_lowcore()->user_asce);
local_ctl_load(7, &get_lowcore()->user_asce);
local_irq_restore(flags);
}
+static inline void finish_arch_post_lock_switch(void)
+{
+ finish_arch_post_lock_switch_ainline();
+}
#define activate_mm activate_mm
static inline void activate_mm(struct mm_struct *prev,
struct mm_struct *next)
{
switch_mm_irqs_off(prev, next, current);
cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
if (test_thread_flag(TIF_ASCE_PRIMARY))
local_ctl_load(1, &get_lowcore()->kernel_asce);
else
diff --git a/arch/sparc/include/asm/mmu_context_64.h b/arch/sparc/include/asm/mmu_context_64.h
index 78bbacc14d2d..9102ab2adfbc 100644
--- a/arch/sparc/include/asm/mmu_context_64.h
+++ b/arch/sparc/include/asm/mmu_context_64.h
@@ -153,21 +153,21 @@ static inline void arch_start_context_switch(struct task_struct *prev)
:
: "g1");
if (tmp_mcdper)
set_tsk_thread_flag(prev, TIF_MCDPER);
else
clear_tsk_thread_flag(prev, TIF_MCDPER);
}
}
#define finish_arch_post_lock_switch finish_arch_post_lock_switch
-static inline void finish_arch_post_lock_switch(void)
+static __always_inline void finish_arch_post_lock_switch_ainline(void)
{
/* Restore the state of MCDPER register for the new process
* just switched to.
*/
if (adi_capable()) {
register unsigned long tmp_mcdper;
tmp_mcdper = test_thread_flag(TIF_MCDPER);
__asm__ __volatile__(
"mov %0, %%g1\n\t"
@@ -177,20 +177,24 @@ static inline void finish_arch_post_lock_switch(void)
: "ir" (tmp_mcdper)
: "g1");
if (current && current->mm && current->mm->context.adi) {
struct pt_regs *regs;
regs = task_pt_regs(current);
regs->tstate |= TSTATE_MCDE;
}
}
}
+static inline void finish_arch_post_lock_switch(void)
+{
+ finish_arch_post_lock_switch_ainline();
+}
#define mm_untag_mask mm_untag_mask
static inline unsigned long mm_untag_mask(struct mm_struct *mm)
{
return -1UL >> adi_nbits();
}
#include <asm-generic/mmu_context.h>
#endif /* !(__ASSEMBLER__) */
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index 96bda43538ee..4b55fa353bb5 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -86,21 +86,21 @@ static __always_inline void sync_core(void)
* hypervisor.
*/
iret_to_self();
}
/*
* Ensure that a core serializing instruction is issued before returning
* to user-mode. x86 implements return to user-space through sysexit,
* sysrel, and sysretq, which are not core serializing.
*/
-static inline void sync_core_before_usermode(void)
+static __always_inline void sync_core_before_usermode(void)
{
/* With PTI, we unconditionally serialize before running user code. */
if (static_cpu_has(X86_FEATURE_PTI))
return;
/*
* Even if we're in an interrupt, we might reschedule before returning,
* in which case we could switch to a different thread in the same mm
* and return using SYSRET or SYSEXIT. Instead of trying to keep
* track of our need to sync the core, just sync right away.
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fd1d91017b99..2b1c752af207 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1617,21 +1617,21 @@ static __always_inline bool __perf_sw_enabled(int swevt)
{
return static_key_false(&perf_swevent_enabled[swevt]);
}
static inline void perf_event_task_migrate(struct task_struct *task)
{
if (__perf_sw_enabled(PERF_COUNT_SW_CPU_MIGRATIONS))
task->sched_migrated = 1;
}
-static inline void perf_event_task_sched_in(struct task_struct *prev,
+static __always_inline void perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task)
{
if (static_branch_unlikely(&perf_sched_events))
__perf_event_task_sched_in(prev, task);
if (__perf_sw_enabled(PERF_COUNT_SW_CPU_MIGRATIONS) &&
task->sched_migrated) {
__perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
task->sched_migrated = 0;
}
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 0e1d73955fa5..e7787a6e7d22 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -37,21 +37,21 @@ static inline void mmgrab(struct mm_struct *mm)
atomic_inc(&mm->mm_count);
}
static inline void smp_mb__after_mmgrab(void)
{
smp_mb__after_atomic();
}
extern void __mmdrop(struct mm_struct *mm);
-static inline void mmdrop(struct mm_struct *mm)
+static __always_inline void mmdrop(struct mm_struct *mm)
{
/*
* The implicit full barrier implied by atomic_dec_and_test() is
* required by the membarrier system call before returning to
* user-space, after storing to rq->curr.
*/
if (unlikely(atomic_dec_and_test(&mm->mm_count)))
__mmdrop(mm);
}
@@ -64,28 +64,28 @@ static inline void __mmdrop_delayed(struct rcu_head *rhp)
{
struct mm_struct *mm = container_of(rhp, struct mm_struct, delayed_drop);
__mmdrop(mm);
}
/*
* Invoked from finish_task_switch(). Delegates the heavy lifting on RT
* kernels via RCU.
*/
-static inline void mmdrop_sched(struct mm_struct *mm)
+static __always_inline void mmdrop_sched(struct mm_struct *mm)
{
/* Provides a full memory barrier. See mmdrop() */
if (atomic_dec_and_test(&mm->mm_count))
call_rcu(&mm->delayed_drop, __mmdrop_delayed);
}
#else
-static inline void mmdrop_sched(struct mm_struct *mm)
+static __always_inline void mmdrop_sched(struct mm_struct *mm)
{
mmdrop(mm);
}
#endif
/* Helpers for lazy TLB mm refcounting */
static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
{
if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
mmgrab(mm);
@@ -97,21 +97,21 @@ static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
mmdrop(mm);
} else {
/*
* mmdrop_lazy_tlb must provide a full memory barrier, see the
* membarrier comment finish_task_switch which relies on this.
*/
smp_mb();
}
}
-static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
+static __always_inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
{
if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
mmdrop_sched(mm);
else
smp_mb(); /* see mmdrop_lazy_tlb() above */
}
/**
* mmget() - Pin the address space associated with a &struct mm_struct.
* @mm: The address space to pin.
@@ -524,21 +524,21 @@ enum {
enum {
MEMBARRIER_FLAG_SYNC_CORE = (1U << 0),
MEMBARRIER_FLAG_RSEQ = (1U << 1),
};
#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
#include <asm/membarrier.h>
#endif
-static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
+static __always_inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
{
/*
* The atomic_read() below prevents CSE. The following should
* help the compiler generate more efficient code on architectures
* where sync_core_before_usermode() is a no-op.
*/
if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
return;
if (current->mm != mm)
return;
diff --git a/include/linux/tick.h b/include/linux/tick.h
index ac76ae9fa36d..fce16aa10ba2 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -168,21 +168,21 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
* Mask of CPUs that are nohz_full.
*
* Users should be guarded by CONFIG_NO_HZ_FULL or a tick_nohz_full_cpu()
* check.
*/
extern cpumask_var_t tick_nohz_full_mask;
#ifdef CONFIG_NO_HZ_FULL
extern bool tick_nohz_full_running;
-static inline bool tick_nohz_full_enabled(void)
+static __always_inline bool tick_nohz_full_enabled(void)
{
if (!context_tracking_enabled())
return false;
return tick_nohz_full_running;
}
/*
* Check if a CPU is part of the nohz_full subset. Arrange for evaluating
* the cpu expression (typically smp_processor_id()) _after_ the static
@@ -292,21 +292,21 @@ static inline void tick_dep_init_task(struct task_struct *tsk) { }
static inline void tick_dep_set_signal(struct task_struct *tsk,
enum tick_dep_bits bit) { }
static inline void tick_dep_clear_signal(struct signal_struct *signal,
enum tick_dep_bits bit) { }
static inline void tick_nohz_full_kick_cpu(int cpu) { }
static inline void __tick_nohz_task_switch(void) { }
static inline void tick_nohz_full_setup(cpumask_var_t cpumask) { }
#endif
-static inline void tick_nohz_task_switch(void)
+static __always_inline void tick_nohz_task_switch(void)
{
if (tick_nohz_full_enabled())
__tick_nohz_task_switch();
}
static inline void tick_nohz_user_enter_prepare(void)
{
if (tick_nohz_full_cpu(smp_processor_id()))
rcu_nocb_flush_deferred_wakeup();
}
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 29dd5b91dd7d..428464bb81b3 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -60,38 +60,38 @@ static __always_inline void vtime_account_guest_exit(void)
}
#elif defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN)
/*
* Checks if vtime is enabled on some CPU. Cputime readers want to be careful
* in that case and compute the tickless cputime.
* For now vtime state is tied to context tracking. We might want to decouple
* those later if necessary.
*/
-static inline bool vtime_accounting_enabled(void)
+static __always_inline bool vtime_accounting_enabled(void)
{
return context_tracking_enabled();
}
-static inline bool vtime_accounting_enabled_cpu(int cpu)
+static __always_inline bool vtime_accounting_enabled_cpu(int cpu)
{
return context_tracking_enabled_cpu(cpu);
}
-static inline bool vtime_accounting_enabled_this_cpu(void)
+static __always_inline bool vtime_accounting_enabled_this_cpu(void)
{
return context_tracking_enabled_this_cpu();
}
extern void vtime_task_switch_generic(struct task_struct *prev);
-static inline void vtime_task_switch(struct task_struct *prev)
+static __always_inline void vtime_task_switch(struct task_struct *prev)
{
if (vtime_accounting_enabled_this_cpu())
vtime_task_switch_generic(prev);
}
static __always_inline void vtime_account_guest_enter(void)
{
if (vtime_accounting_enabled_this_cpu())
vtime_guest_enter(current);
else
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6cb3f57c4d35..7a70d13d03fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4869,21 +4869,21 @@ static inline void prepare_task(struct task_struct *next)
/*
* Claim the task as running, we do this before switching to it
* such that any running task will have this set.
*
* See the smp_load_acquire(&p->on_cpu) case in ttwu() and
* its ordering comment.
*/
WRITE_ONCE(next->on_cpu, 1);
}
-static inline void finish_task(struct task_struct *prev)
+static __always_inline void finish_task(struct task_struct *prev)
{
/*
* This must be the very last reference to @prev from this CPU. After
* p->on_cpu is cleared, the task can be moved to a different CPU. We
* must ensure this doesn't happen until the switch is completely
* finished.
*
* In particular, the load of prev->state in finish_task_switch() must
* happen before this.
*
@@ -4983,53 +4983,54 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
* do an early lockdep release here:
*/
rq_unpin_lock(rq, rf);
spin_release(&__rq_lockp(rq)->dep_map, _THIS_IP_);
#ifdef CONFIG_DEBUG_SPINLOCK
/* this is a valid case when another task releases the spinlock */
rq_lockp(rq)->owner = next;
#endif
}
-static inline void finish_lock_switch(struct rq *rq)
+static __always_inline void finish_lock_switch(struct rq *rq)
{
/*
* If we are tracking spinlock dependencies then we have to
* fix up the runqueue lock - which gets 'carried over' from
* prev into current:
*/
spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
__balance_callbacks(rq);
raw_spin_rq_unlock_irq(rq);
}
/*
* NOP if the arch has not defined these:
*/
#ifndef prepare_arch_switch
# define prepare_arch_switch(next) do { } while (0)
#endif
#ifndef finish_arch_post_lock_switch
-# define finish_arch_post_lock_switch() do { } while (0)
+# define finish_arch_post_lock_switch() do { } while (0)
+# define finish_arch_post_lock_switch_ainline() do { } while (0)
#endif
static inline void kmap_local_sched_out(void)
{
#ifdef CONFIG_KMAP_LOCAL
if (unlikely(current->kmap_ctrl.idx))
__kmap_local_sched_out();
#endif
}
-static inline void kmap_local_sched_in(void)
+static __always_inline void kmap_local_sched_in(void)
{
#ifdef CONFIG_KMAP_LOCAL
if (unlikely(current->kmap_ctrl.idx))
__kmap_local_sched_in();
#endif
}
/**
* prepare_task_switch - prepare to switch tasks
* @rq: the runqueue preparing to switch
@@ -5111,21 +5112,21 @@ static __always_inline struct rq *finish_task_switch_ainline(struct task_struct
* finish_task), otherwise a concurrent wakeup can get prev
* running on another CPU and we could rave with its RUNNING -> DEAD
* transition, resulting in a double drop.
*/
prev_state = READ_ONCE(prev->__state);
vtime_task_switch(prev);
perf_event_task_sched_in(prev, current);
finish_task(prev);
tick_nohz_task_switch();
finish_lock_switch(rq);
- finish_arch_post_lock_switch();
+ finish_arch_post_lock_switch_ainline();
kcov_finish_switch(current);
/*
* kmap_local_sched_out() is invoked with rq::lock held and
* interrupts disabled. There is no requirement for that, but the
* sched out code does not have an interrupt enabled section.
* Restoring the maps on sched in does not require interrupts being
* disabled either.
*/
kmap_local_sched_in();
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] Set the subfunctions called by finish_task_switch to be inline
2025-10-24 18:35 ` [PATCH 3/3] Set the subfunctions called by finish_task_switch to be inline Xie Yuanbin
@ 2025-10-24 19:44 ` Thomas Gleixner
2025-10-25 18:51 ` Xie Yuanbin
0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2025-10-24 19:44 UTC (permalink / raw)
To: Xie Yuanbin, linux, mathieu.desnoyers, paulmck, pjw, palmer, aou,
alex, hca, gor, agordeev, borntraeger, svens, davem, andreas,
mingo, bp, dave.hansen, hpa, luto, peterz, acme, namhyung,
mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
anna-maria, frederic, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
qq570070308, thuth, riel, akpm, david, lorenzo.stoakes, segher,
ryan.roberts, max.kellermann, urezki, nysal
Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
sparclinux, linux-perf-users, will
On Sat, Oct 25 2025 at 02:35, Xie Yuanbin wrote:
> #ifndef MODULE
> #define finish_arch_post_lock_switch \
> finish_arch_post_lock_switch
> -static inline void finish_arch_post_lock_switch(void)
> +static __always_inline void finish_arch_post_lock_switch_ainline(void)
> {
> struct mm_struct *mm = current->mm;
>
> if (mm && mm->context.switch_pending) {
> /*
> * Preemption must be disabled during cpu_switch_mm() as we
> * have some stateful cache flush implementations. Check
> * switch_pending again in case we were preempted and the
> * switch to this mm was already done.
> */
> preempt_disable();
> if (mm->context.switch_pending) {
> mm->context.switch_pending = 0;
> cpu_switch_mm(mm->pgd, mm);
> }
> preempt_enable_no_resched();
> }
> }
> +static inline void finish_arch_post_lock_switch(void)
> +{
> + finish_arch_post_lock_switch_ainline();
What is exactly the point of this indirection. Why can't you just mark
finish_arch_post_lock_switch() __always_inline and be done with it?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] Change enter_lazy_tlb to inline on x86
2025-10-24 18:26 ` [PATCH 1/3] Change enter_lazy_tlb to inline on x86 Xie Yuanbin
@ 2025-10-24 20:14 ` Rik van Riel
0 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2025-10-24 20:14 UTC (permalink / raw)
To: Xie Yuanbin, linux, mathieu.desnoyers, paulmck, pjw, palmer, aou,
alex, hca, gor, agordeev, borntraeger, svens, davem, andreas,
tglx, mingo, bp, dave.hansen, hpa, luto, peterz, acme, namhyung,
mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
anna-maria, frederic, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, thuth,
akpm, david, lorenzo.stoakes, segher, ryan.roberts,
max.kellermann, urezki, nysal
Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
sparclinux, linux-perf-users, will
On Sat, 2025-10-25 at 02:26 +0800, Xie Yuanbin wrote:
> This function is very short, and is called in the context switching,
> so it is called very frequently.
>
> Change it to inline function on x86 to improve performance, just like
> its code in other architectures
>
> Signed-off-by: Xie Yuanbin <qq570070308@gmail.com>
Might as well inline it into what I remember
being its only caller.
Reviewed-by: Rik van Riel <riel@surriel.com>
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] Provide and use an always inline version of finish_task_switch
2025-10-24 18:35 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Xie Yuanbin
2025-10-24 18:35 ` [PATCH 3/3] Set the subfunctions called by finish_task_switch to be inline Xie Yuanbin
@ 2025-10-24 21:36 ` Rik van Riel
2025-10-25 14:36 ` Segher Boessenkool
` (2 more replies)
1 sibling, 3 replies; 16+ messages in thread
From: Rik van Riel @ 2025-10-24 21:36 UTC (permalink / raw)
To: Xie Yuanbin, linux, mathieu.desnoyers, paulmck, pjw, palmer, aou,
alex, hca, gor, agordeev, borntraeger, svens, davem, andreas,
tglx, mingo, bp, dave.hansen, hpa, luto, peterz, acme, namhyung,
mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
anna-maria, frederic, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, thuth,
akpm, david, lorenzo.stoakes, segher, ryan.roberts,
max.kellermann, urezki, nysal
Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
sparclinux, linux-perf-users, will
On Sat, 2025-10-25 at 02:35 +0800, Xie Yuanbin wrote:
> finish_task_switch is called during context switching,
> inlining it can bring some performance benefits.
>
> Add an always inline version `finish_task_switch_ainline` to be
> called
> during context switching, and keep the original version for being
> called
> elsewhere, so as to take into account the size impact.
Does that actually work, or does the compiler
still inline some of those "non-inlined" versions,
anyway?
Also, what kind of performance improvement
have you measured with these changes?
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] Optimize code generation during context switching
2025-10-24 18:26 [PATCH 0/3] Optimize code generation during context switching Xie Yuanbin
2025-10-24 18:26 ` [PATCH 1/3] Change enter_lazy_tlb to inline on x86 Xie Yuanbin
2025-10-24 18:35 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Xie Yuanbin
@ 2025-10-25 12:26 ` Peter Zijlstra
2025-10-25 18:20 ` [PATCH 0/3] Optimize code generation during context Xie Yuanbin
2025-10-27 15:21 ` Xie Yuanbin
3 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2025-10-25 12:26 UTC (permalink / raw)
To: Xie Yuanbin
Cc: linux, mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca,
gor, agordeev, borntraeger, svens, davem, andreas, tglx, mingo,
bp, dave.hansen, hpa, luto, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, anna-maria,
frederic, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, thuth, riel, akpm, david,
lorenzo.stoakes, segher, ryan.roberts, max.kellermann, urezki,
nysal, x86, linux-arm-kernel, linux-kernel, linux-riscv,
linux-s390, sparclinux, linux-perf-users, will
On Sat, Oct 25, 2025 at 02:26:25AM +0800, Xie Yuanbin wrote:
> The purpose of this series of patches is to optimize the performance of
> context switching. It does not change the code logic, but only modifies
> the inline attributes of some functions.
>
> The original reason for writing this patch is that, when debugging a
> schedule performance problem, I discovered that the finish_task_switch
> function was not inlined, even in the O2 level optimization. This may
> affect performance for the following reasons:
Not sure what compiler you're running, but it is on the one random
compile I just checked.
> 1. It is in the context switching code, and is called frequently.
> 2. Because of the modern CPU mitigations for vulnerabilities, inside
> switch_mm, the instruction pipeline and cache may be cleared, and the
> branch and cache miss may increase. finish_task_switch is right after
> that, so this may cause greater performance degradation.
That patch really is one of the ugliest things I've seen in a while; and
you have no performance numbers included or any other justification for
any of this ugly.
> 3. The __schedule function has __sched attribute, which makes it be
> placed in the ".sched.text" section, while finish_task_switch does not,
> which causes their distance to be very far in binary, aggravating the
> above performance degradation.
How? If it doesn't get inlined it will be a direct call, in which case
the prefetcher should have no trouble.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] Provide and use an always inline version of finish_task_switch
2025-10-24 21:36 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Rik van Riel
@ 2025-10-25 14:36 ` Segher Boessenkool
2025-10-25 17:37 ` [PATCH 0/3] Optimize code generation during context Xie Yuanbin
2025-10-25 19:18 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Xie Yuanbin
2 siblings, 0 replies; 16+ messages in thread
From: Segher Boessenkool @ 2025-10-25 14:36 UTC (permalink / raw)
To: Rik van Riel
Cc: Xie Yuanbin, linux, mathieu.desnoyers, paulmck, pjw, palmer, aou,
alex, hca, gor, agordeev, borntraeger, svens, davem, andreas,
tglx, mingo, bp, dave.hansen, hpa, luto, peterz, acme, namhyung,
mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
anna-maria, frederic, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, thuth,
akpm, david, lorenzo.stoakes, ryan.roberts, max.kellermann,
urezki, nysal, x86, linux-arm-kernel, linux-kernel, linux-riscv,
linux-s390, sparclinux, linux-perf-users, will
On Fri, Oct 24, 2025 at 05:36:06PM -0400, Rik van Riel wrote:
> On Sat, 2025-10-25 at 02:35 +0800, Xie Yuanbin wrote:
> > finish_task_switch is called during context switching,
> > inlining it can bring some performance benefits.
> >
> > Add an always inline version `finish_task_switch_ainline` to be
> > called
> > during context switching, and keep the original version for being
> > called
> > elsewhere, so as to take into account the size impact.
>
> Does that actually work, or does the compiler
> still inline some of those "non-inlined" versions,
> anyway?
Of course the compiler does! That is part of the compiler's job after
all, to generate fast, efficient code!
The compiler will inline stuff when a) it *can*, mostly it has to have
the function body available; and b) it estimates it to be a win to
inline it. There is a whole bunch of heuristics for this. One of those
is that the always_inline attribute will do the utmost to get inlining
to happen.
(All that is assuming you have -finline-functions turned on, like you do
have at -O2).
Segher
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] Optimize code generation during context
2025-10-24 21:36 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Rik van Riel
2025-10-25 14:36 ` Segher Boessenkool
@ 2025-10-25 17:37 ` Xie Yuanbin
2025-10-29 10:26 ` David Hildenbrand
2025-10-25 19:18 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Xie Yuanbin
2 siblings, 1 reply; 16+ messages in thread
From: Xie Yuanbin @ 2025-10-25 17:37 UTC (permalink / raw)
To: riel, linux, mathieu.desnoyers, paulmck, pjw, palmer, aou, alex,
hca, gor, agordeev, borntraeger, svens, davem, andreas, tglx,
mingo, bp, dave.hansen, hpa, luto, peterz, acme, namhyung,
mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
anna-maria, frederic, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
qq570070308, thuth, akpm, david, lorenzo.stoakes, segher,
ryan.roberts, max.kellermann, urezki, nysal
Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
sparclinux, linux-perf-users, will
On Fri, 24 Oct 2025 17:36:06 -0400, Rik van Riel wrote:
> Also, what kind of performance improvement
> have you measured with these changes?
When I debugged performance issues before, I used the company's equipment.
I could only observe the macro business performance data, but not the
specific scheduling time. Today I did some testing using my devices,
and the testing logic is as follows:
```
- return finish_task_switch(prev);
+ start_time = rdtsc();
+ barrier();
+ rq = finish_task_switch(prev);
+ barrier();
+ end_time = rdtsc;
+ return rq;
```
The test data is as follows:
1. mitigations Off, without patches: 13.5 - 13.7
2. mitigations Off, with patches: 13.5 - 13.7
3. mitigations On, without patches: 23.3 - 23.6
4. mitigations On, with patches: 16.6 - 16.8
On my device, these patches have very little effect when mitigations off,
but the improvement was still very noticeable when the mitigation was on.
I suspect this is because I'm using a recent Ryzen CPU with a very
powerful instruction cache and branch prediction capabilities, so without
considering the Spectre vulnerability, inlining is less effective.
However, on embedded devices with small instruction caches, these patches
should still be effective even with mitigations off.
Xie Yuanbin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] Optimize code generation during context
2025-10-25 12:26 ` [PATCH 0/3] Optimize code generation during context switching Peter Zijlstra
@ 2025-10-25 18:20 ` Xie Yuanbin
0 siblings, 0 replies; 16+ messages in thread
From: Xie Yuanbin @ 2025-10-25 18:20 UTC (permalink / raw)
To: peterz, linux, mathieu.desnoyers, paulmck, pjw, palmer, aou, alex,
hca, gor, agordeev, borntraeger, svens, davem, andreas, tglx,
mingo, bp, dave.hansen, hpa, luto, acme, namhyung, mark.rutland,
alexander.shishkin, jolsa, irogers, adrian.hunter, anna-maria,
frederic, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, qq570070308, thuth, riel, akpm, david,
lorenzo.stoakes, segher, ryan.roberts, max.kellermann, urezki,
nysal
Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
sparclinux, linux-perf-users, will
On Sat, 25 Oct 2025 14:26:59 +0200, Peter Zijlstra wrote:
> Not sure what compiler you're running, but it is on the one random
> compile I just checked.
I'm using gcc 15.2 and clang 22 now, Neither of them inlines
finish_task_switch function, even at O2 optimization level.
> you have no performance numbers included or any other justification for
> any of this ugly.
I apologize for this. I originally discovered this missed optimization
when I was debugging a scheduling performance issue. I was using the
company's equipment and could only observe macro business performance
data, but not the specific scheduling time consuming data.
Today I did some testing using my own devices,
the testing logic is as follows:
```
- return finish_task_switch(prev);
+ start_time = rdtsc();
+ barrier();
+ rq = finish_task_switch(prev);
+ barrier();
+ end_time = rdtsc;
+ return rq;
```
The test data is as follows:
1. mitigations Off, without patches: 13.5 - 13.7
2. mitigations Off, with patches: 13.5 - 13.7
3. mitigations On, without patches: 23.3 - 23.6
4. mitigations On, with patches: 16.6 - 16.8
Some config:
PREEMPT=n
DEBUG_PREEMPT=n
NO_HZ_FULL=n
NO_HZ_IDLE=y
STACKPROTECTOR_STRONG=y
On my device, these patches have very little effect when mitigations off,
but the improvement was still very noticeable when the mitigation was on.
I suspect this is because I'm using a recent Ryzen CPU with a very
powerful instruction cache and branch prediction capabilities, so without
considering the Spectre vulnerability, inlining is less effective.
However, on embedded devices with small instruction caches, these patches
should still be effective even with mitigations off.
>> 3. The __schedule function has __sched attribute, which makes it be
>> placed in the ".sched.text" section, while finish_task_switch does not,
>> which causes their distance to be very far in binary, aggravating the
>> above performance degradation.
>
> How? If it doesn't get inlined it will be a direct call, in which case
> the prefetcher should have no trouble.
Placing related functions and data close together in the binary is a
common compiler optimization. For example, the cold and hot attributes
will place codes in ".text.hot" and ".text.cold" sections. This reduces
cache misses for instruction and data caches.
The current code adds the __sched attribute to the __schedule function
(placing it into ".text.sched" section), but not to finish_task_switch,
causing them to be very far apart in the binary.
If the __schedule function didn't have the __sched attribute, both would
be in the .text section of the sched.o translation unit.
Thus, the __sched attribute in the __schedule function actually causes a
degradation, and inlining finish_task_switch can alleviate this problem.
Xie Yuanbin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] Set the subfunctions called by finish_task_switch to be inline
2025-10-24 19:44 ` Thomas Gleixner
@ 2025-10-25 18:51 ` Xie Yuanbin
0 siblings, 0 replies; 16+ messages in thread
From: Xie Yuanbin @ 2025-10-25 18:51 UTC (permalink / raw)
To: tglx
Cc: acme, adrian.hunter, agordeev, akpm, alex, alexander.shishkin,
andreas, anna-maria, aou, borntraeger, bp, bsegall, dave.hansen,
davem, david, dietmar.eggemann, frederic, gor, hca, hpa, irogers,
jolsa, juri.lelli, linux-arm-kernel, linux-kernel,
linux-perf-users, linux-riscv, linux-s390, linux, lorenzo.stoakes,
luto, mark.rutland, mathieu.desnoyers, max.kellermann, mgorman,
mingo, namhyung, nysal, palmer, paulmck, peterz, pjw, qq570070308,
riel, rostedt, ryan.roberts, segher, sparclinux, svens, thuth,
urezki, vincent.guittot, vschneid, will, x86
On Fri, 24 Oct 2025 21:44:10 +0200, Thomas Gleixner wrote:
> What is exactly the point of this indirection. Why can't you just mark
> finish_arch_post_lock_switch() __always_inline and be done with it?
In this patch, I've added an always inline version of the function,
finish_arch_post_lock_switch_ainline. The original function,
finish_arch_post_lock_switch, retains its original inline attribute.
The reason for this is that this function is called not only during
context switches but also from other code, and I don't want to affect
those parts. In fact, with Os/Oz-level optimizations, if this function
is called multiple times within one .c file, it will most likely not be
inlined, even if it's marked as inline.
Context switching is a hot code, I hope it will be always inlined here to
improve performance. In other places, if it is not a performance-critical
function, then it can be not inlined to gain codesize benefits.
Look at your opinions. I have no objection to setting
finish_arch_post_lock_switch directly to __always_inline.
Xie Yuanbin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] Provide and use an always inline version of finish_task_switch
2025-10-24 21:36 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Rik van Riel
2025-10-25 14:36 ` Segher Boessenkool
2025-10-25 17:37 ` [PATCH 0/3] Optimize code generation during context Xie Yuanbin
@ 2025-10-25 19:18 ` Xie Yuanbin
2 siblings, 0 replies; 16+ messages in thread
From: Xie Yuanbin @ 2025-10-25 19:18 UTC (permalink / raw)
To: riel, segher
Cc: acme, adrian.hunter, agordeev, akpm, alex, alexander.shishkin,
andreas, anna-maria, aou, borntraeger, bp, bsegall, dave.hansen,
davem, david, dietmar.eggemann, frederic, gor, hca, hpa, irogers,
jolsa, juri.lelli, linux-arm-kernel, linux-kernel,
linux-perf-users, linux-riscv, linux-s390, linux, lorenzo.stoakes,
luto, mark.rutland, mathieu.desnoyers, max.kellermann, mgorman,
mingo, namhyung, nysal, palmer, paulmck, peterz, pjw, qq570070308,
rostedt, ryan.roberts, sparclinux, svens, tglx, thuth, urezki,
vincent.guittot, vschneid, will, x86
On Fri, 24 Oct 2025 17:36:06 -0400, Rik van Riel wrote:
> Does that actually work, or does the compiler
> still inline some of those "non-inlined" versions,
> anyway?
For the current code, adding a new finish_task_switch_ainline function
and calling it has the same effect as directly changing the
finish_task_switch function attribute to __always_inline. This is because
there are only two references to the finish_task_switch function in
core.c. When one is inlined, the other becomes the only call point and it
must be inlined (unless the no-inline option/attribute is added or the
static attribute is removed). The uninlined finish_task_switch assembly
code will not exist.
However, if the call point of the finish_task_switch function in core.c is
increased in the future, as long as one point is added, a non inline
function will be generated and codesize revenue will be obtained.
Xie Yuanbin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] Optimize code generation during context
2025-10-24 18:26 [PATCH 0/3] Optimize code generation during context switching Xie Yuanbin
` (2 preceding siblings ...)
2025-10-25 12:26 ` [PATCH 0/3] Optimize code generation during context switching Peter Zijlstra
@ 2025-10-27 15:21 ` Xie Yuanbin
3 siblings, 0 replies; 16+ messages in thread
From: Xie Yuanbin @ 2025-10-27 15:21 UTC (permalink / raw)
To: peterz, riel, segher, linux, mathieu.desnoyers, paulmck, pjw,
palmer, aou, alex, hca, gor, agordeev, borntraeger, svens, davem,
andreas, tglx, mingo, bp, dave.hansen, hpa, luto, acme, namhyung,
mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
anna-maria, frederic, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
qq570070308, thuth, akpm, david, lorenzo.stoakes, ryan.roberts,
max.kellermann, urezki, nysal
Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
sparclinux, linux-perf-users, will
I conducted a more detailed performance test on this series of patches.
https://lore.kernel.org/lkml/20251024182628.68921-1-qq570070308@gmail.com/t/#u
The data is as follows:
1. Time spent on calling finish_task_switch (unit: rdtsc):
| compiler && appended cmdline | without patches | with patches |
| clang + NA | 14.11 - 14.16 | 12.73 - 12.74 |
| clang + "spectre_v2_user=on" | 30.04 - 30.18 | 17.64 - 17.73 |
| gcc + NA | 16.73 - 16.83 | 15.35 - 15.44 |
| gcc + "spectre_v2_user=on" | 40.91 - 40.96 | 30.61 - 30.66 |
Note: I use x86 for testing here. Different architectures have different
cmdlines for configuring mitigations. For example, on arm64, spectre v2
mitigation is enabled by default, and it should be disabled by adding
"nospectre_v2" to the cmdline.
2. bzImage size:
| compiler | without patches | with patches |
| clang | 13173760 | 13173760 |
| gcc | 12166144 | 12166144 |
No size changes were found on bzImage.
Test info:
1. kernel source:
latest linux-next branch:
commit id 72fb0170ef1f45addf726319c52a0562b6913707
2. test machine:
cpu: intel i5-8300h@4Ghz
mem: DDR4 2666MHz
Bare-metal boot, non-virtualized environment
3. compiler:
gcc: gcc version 15.2.0 (Debian 15.2.0-7)
clang: Debian clang version 22.0.0 (++20250731080150+be449d6b6587-1~exp1+b1)
4. config:
base on default x86_64_defconfig, and setting:
CONFIG_PREEMPT=y
CONFIG_PREEMPT_DYNAMIC=n
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_HZ=100
CONFIG_DEBUG_ENTRY=n
CONFIG_X86_DEBUG_FPU=n
CONFIG_EXPERT=y
CONFIG_MODIFY_LDT_SYSCALL=n
CONFIG_CGROUPS=n
CONFIG_BUG=n
CONFIG_BLK_DEV_NVME=y
5. test method:
Use rdtsc (cntvct_el0 can be use on arm64/arm) to obtain timestamps
before and after finish_task_switch calling point, and created multiple
processes to trigger context switches, then calculated the average
duration of the finish_task_switch call.
Note that using multiple processes rather than threads is recommended for
testing, because this will trigger switch_mm (where spectre v2 mitigations
may be performed) during context switching.
I put my test code here:
kernel(just for testing, not a commit):
```
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ced2a1dee..9e72a4a1a 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -394,6 +394,7 @@
467 common open_tree_attr sys_open_tree_attr
468 common file_getattr sys_file_getattr
469 common file_setattr sys_file_setattr
+470 common mysyscall sys_mysyscall
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1842285ea..bcbfea69d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5191,6 +5191,40 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
calculate_sigpending();
}
+static DEFINE_PER_CPU(uint64_t, mytime);
+static DEFINE_PER_CPU(uint64_t, total_time);
+static DEFINE_PER_CPU(uint64_t, last_total_time);
+static DEFINE_PER_CPU(uint64_t, total_count);
+
+static __always_inline uint64_t myrdtsc(void)
+{
+ register uint64_t rax __asm__("rax");
+ register uint64_t rdx __asm__("rdx");
+
+ __asm__ __volatile__ ("rdtsc" : "=a"(rax), "=d"(rdx));
+ return rax | (rdx << 32);
+}
+
+static __always_inline void start_time(void)
+{
+ raw_cpu_write(mytime, myrdtsc());
+}
+
+static __always_inline void end_time(void)
+{
+ const uint64_t end_time = myrdtsc();
+ const uint64_t cost_time = end_time - raw_cpu_read(mytime);
+
+ raw_cpu_add(total_time, cost_time);
+ if (raw_cpu_inc_return(total_count) % (1 << 20) == 0) {
+ const uint64_t t = raw_cpu_read(total_time);
+ const uint64_t lt = raw_cpu_read(last_total_time);
+
+ pr_emerg("cpu %d total_time %llu, last_total_time %llu, cha : %llu\n", raw_smp_processor_id(), t, lt, t - lt);
+ raw_cpu_write(last_total_time, t);
+ }
+}
+
/*
* context_switch - switch to the new MM and the new thread's register state.
*/
@@ -5254,7 +5288,10 @@ context_switch(struct rq *rq, struct task_struct *prev,
switch_to(prev, next, prev);
barrier();
- return finish_task_switch(prev);
+ start_time();
+ rq = finish_task_switch(prev);
+ end_time();
+ return rq;
}
/*
@@ -10854,3 +10891,19 @@ void sched_change_end(struct sched_change_ctx *ctx)
p->sched_class->prio_changed(rq, p, ctx->prio);
}
}
+
+
+static struct task_struct *my_task;
+
+SYSCALL_DEFINE0(mysyscall)
+{
+ preempt_disable();
+ while (1) {
+ if (my_task)
+ wake_up_process(my_task);
+ my_task = current;
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ __schedule(0);
+ }
+ return 0;
+}
```
User program:
```c
int main()
{
cpu_set_t mask;
if (fork())
sleep(1);
CPU_ZERO(&mask);
CPU_SET(5, &mask); // Assume that cpu5 exists
assert(sched_setaffinity(0, sizeof(mask), &mask) == 0);
syscall(470);
// unreachable
return 0;
}
```
Usage:
1. set core5 as isolated cpu: add "isolcpus=5" to cmdline
2. run user programe
3. wait for kernel print
Everyone is welcome to test it.
Xie Yuanbin
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] Optimize code generation during context
2025-10-25 17:37 ` [PATCH 0/3] Optimize code generation during context Xie Yuanbin
@ 2025-10-29 10:26 ` David Hildenbrand
2025-10-30 15:04 ` Xie Yuanbin
0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2025-10-29 10:26 UTC (permalink / raw)
To: Xie Yuanbin, riel, linux, mathieu.desnoyers, paulmck, pjw, palmer,
aou, alex, hca, gor, agordeev, borntraeger, svens, davem, andreas,
tglx, mingo, bp, dave.hansen, hpa, luto, peterz, acme, namhyung,
mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
anna-maria, frederic, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, thuth,
akpm, lorenzo.stoakes, segher, ryan.roberts, max.kellermann,
urezki, nysal
Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
sparclinux, linux-perf-users, will
On 25.10.25 19:37, Xie Yuanbin wrote:
> On Fri, 24 Oct 2025 17:36:06 -0400, Rik van Riel wrote:
>> Also, what kind of performance improvement
>> have you measured with these changes?
>
> When I debugged performance issues before, I used the company's equipment.
> I could only observe the macro business performance data, but not the
> specific scheduling time. Today I did some testing using my devices,
> and the testing logic is as follows:
> ```
> - return finish_task_switch(prev);
> + start_time = rdtsc();
> + barrier();
> + rq = finish_task_switch(prev);
> + barrier();
> + end_time = rdtsc;
> + return rq;
> ```
>
> The test data is as follows:
> 1. mitigations Off, without patches: 13.5 - 13.7
> 2. mitigations Off, with patches: 13.5 - 13.7
> 3. mitigations On, without patches: 23.3 - 23.6
> 4. mitigations On, with patches: 16.6 - 16.8
Such numbers absolutely have to be part of the relevant patches / cover
letter to show that the compiler is not actually smart enough to make a
good decision.
Having that said, sometimes it helps to understand "why" the compiler
does a bad job, and try to tackle that instead.
For example, compilers will not inline functions that might be too big
(there is a compiler tunable), factoring out slow-paths etc could help
to convince the compiler to do the right thing.
Of course, it's not always possible, and sometimes we just now that we
always want to inline.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] Optimize code generation during context
2025-10-29 10:26 ` David Hildenbrand
@ 2025-10-30 15:04 ` Xie Yuanbin
0 siblings, 0 replies; 16+ messages in thread
From: Xie Yuanbin @ 2025-10-30 15:04 UTC (permalink / raw)
To: david
Cc: acme, adrian.hunter, agordeev, akpm, alex, alexander.shishkin,
andreas, anna-maria, aou, borntraeger, bp, bsegall, dave.hansen,
davem, dietmar.eggemann, frederic, gor, hca, hpa, irogers, jolsa,
juri.lelli, linux-arm-kernel, linux-kernel, linux-perf-users,
linux-riscv, linux-s390, linux, lorenzo.stoakes, luto,
mark.rutland, mathieu.desnoyers, max.kellermann, mgorman, mingo,
namhyung, nysal, palmer, paulmck, peterz, pjw, qq570070308, riel,
rostedt, ryan.roberts, segher, sparclinux, svens, tglx, thuth,
urezki, vincent.guittot, vschneid, will, x86
On Wed, 29 Oct 2025 11:26:39 +0100, David Hildenbrand wrote:
>> I did some testing using my devices,
>> and the testing logic is as follows:
>> ```
>> - return finish_task_switch(prev);
>> + start_time = rdtsc();
>> + barrier();
>> + rq = finish_task_switch(prev);
>> + barrier();
>> + end_time = rdtsc;
>> + return rq;
>> ```
>>
>> The test data is as follows:
>> 1. mitigations Off, without patches: 13.5 - 13.7
>> 2. mitigations Off, with patches: 13.5 - 13.7
>> 3. mitigations On, without patches: 23.3 - 23.6
>> 4. mitigations On, with patches: 16.6 - 16.8
>
> Such numbers absolutely have to be part of the relevant patches / cover
> letter to show that the compiler is not actually smart enough to make a
> good decision.
This was indeed my oversight; I did not read the submitting-patches
documentation carefully, thank you for your pointing it out, and I deeply
apologize for this.
Do I need to send the V2 version patches to supplement the relevant data?
By the way, the above data was tested in WSL. I did a more detailed test
on a physical machine. If possible, this data may be more appropriate:
Link: https://lore.kernel.org/20251027152100.62906-1-qq570070308@gmail.com
> Cheers
>
> David / dhildenb
Thanks very much.
Xie Yuanbin
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-10-30 15:04 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-24 18:26 [PATCH 0/3] Optimize code generation during context switching Xie Yuanbin
2025-10-24 18:26 ` [PATCH 1/3] Change enter_lazy_tlb to inline on x86 Xie Yuanbin
2025-10-24 20:14 ` Rik van Riel
2025-10-24 18:35 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Xie Yuanbin
2025-10-24 18:35 ` [PATCH 3/3] Set the subfunctions called by finish_task_switch to be inline Xie Yuanbin
2025-10-24 19:44 ` Thomas Gleixner
2025-10-25 18:51 ` Xie Yuanbin
2025-10-24 21:36 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Rik van Riel
2025-10-25 14:36 ` Segher Boessenkool
2025-10-25 17:37 ` [PATCH 0/3] Optimize code generation during context Xie Yuanbin
2025-10-29 10:26 ` David Hildenbrand
2025-10-30 15:04 ` Xie Yuanbin
2025-10-25 19:18 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Xie Yuanbin
2025-10-25 12:26 ` [PATCH 0/3] Optimize code generation during context switching Peter Zijlstra
2025-10-25 18:20 ` [PATCH 0/3] Optimize code generation during context Xie Yuanbin
2025-10-27 15:21 ` Xie Yuanbin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).