linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Optimize code generation during context switching
@ 2025-10-24 18:26 Xie Yuanbin
  2025-10-24 18:26 ` [PATCH 1/3] Change enter_lazy_tlb to inline on x86 Xie Yuanbin
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Xie Yuanbin @ 2025-10-24 18:26 UTC (permalink / raw)
  To: linux, mathieu.desnoyers, paulmck, pjw, palmer, aou, alex, hca,
	gor, agordeev, borntraeger, svens, davem, andreas, tglx, mingo,
	bp, dave.hansen, hpa, luto, peterz, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, anna-maria,
	frederic, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, qq570070308, thuth, riel, akpm, david,
	lorenzo.stoakes, segher, ryan.roberts, max.kellermann, urezki,
	nysal
  Cc: x86, linux-arm-kernel, linux-kernel, linux-riscv, linux-s390,
	sparclinux, linux-perf-users, will

The purpose of this series of patches is to optimize the performance of
context switching. It does not change the code logic, but only modifies
the inline attributes of some functions.

The original reason for writing this patch is that, when debugging a
schedule performance problem, I discovered that the finish_task_switch
function was not inlined, even in the O2 level optimization. This may
affect performance for the following reasons:
1. It is in the context switching code, and is called frequently.
2. Because of the modern CPU mitigations for vulnerabilities, inside
switch_mm, the instruction pipeline and cache may be cleared, and the
branch and cache miss may increase. finish_task_switch is right after
that, so this may cause greater performance degradation.
3. The __schedule function has __sched attribute, which makes it be
placed in the ".sched.text" section, while finish_task_switch does not,
which causes their distance to be very far in binary, aggravating the
above performance degradation.

I also noticed that on x86, enter_lazy_tlb func is not inlined. It's very
short, and since the cpu_tlbstate and cpu_tlbstate_shared variables are
global, it can be completely inline. In fact, the implementation of this
function on other architectures is inline.

This series of patches mainly does the following things:
1. Change enter_lazy_tlb to inline on x86.
2. Let the finish_task_switch function be called inline during context
switching.
3. Set the subfunctions called by finish_task_switch to be inline:
When finish_task_switch is changed to an inline func, the number of calls
to the subfunctions(which called by finish_task_switch) in this
translation unit increases due to the inline expansion of the
finish_task_switch function.
For example, the finish_lock_switch function originally had only one
calling point in core.o (in finish_task_switch func), but because the
finish_task_switch was inlined, the calling points become two.
Due to compiler optimization strategies,
these subfunctions may transition from inline functions to non inline
functions, which can actually lead to performance degradation.
So I modify some subfunctions of finish_task_stwitch to be always inline
to prevent degradation.
These functions are either very short or are only called once in the
entire kernel, so they do not have a big impact on the size.

This series of patches does not find any impact on the size of the
bzImage image (using Os to build).

Xie Yuanbin (3):
 arch/arm/include/asm/mmu_context.h      |  6 +++++-
 arch/riscv/include/asm/sync_core.h      |  2 +-
 arch/s390/include/asm/mmu_context.h     |  6 +++++-
 arch/sparc/include/asm/mmu_context_64.h |  6 +++++-
 arch/x86/include/asm/mmu_context.h      | 22 +++++++++++++++++++++-
 arch/x86/include/asm/sync_core.h        |  2 +-
 arch/x86/mm/tlb.c                       | 21 ---------------------
 include/linux/perf_event.h              |  2 +-
 include/linux/sched/mm.h                | 10 +++++-----
 include/linux/tick.h                    |  4 ++--
 include/linux/vtime.h                   |  8 ++++----
 kernel/sched/core.c                     | 20 +++++++++++++-------
 12 files changed, 63 insertions(+), 46 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-30 15:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-24 18:26 [PATCH 0/3] Optimize code generation during context switching Xie Yuanbin
2025-10-24 18:26 ` [PATCH 1/3] Change enter_lazy_tlb to inline on x86 Xie Yuanbin
2025-10-24 20:14   ` Rik van Riel
2025-10-24 18:35 ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Xie Yuanbin
2025-10-24 18:35   ` [PATCH 3/3] Set the subfunctions called by finish_task_switch to be inline Xie Yuanbin
2025-10-24 19:44     ` Thomas Gleixner
2025-10-25 18:51       ` Xie Yuanbin
2025-10-24 21:36   ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Rik van Riel
2025-10-25 14:36     ` Segher Boessenkool
2025-10-25 17:37     ` [PATCH 0/3] Optimize code generation during context Xie Yuanbin
2025-10-29 10:26       ` David Hildenbrand
2025-10-30 15:04         ` Xie Yuanbin
2025-10-25 19:18     ` [PATCH 2/3] Provide and use an always inline version of finish_task_switch Xie Yuanbin
2025-10-25 12:26 ` [PATCH 0/3] Optimize code generation during context switching Peter Zijlstra
2025-10-25 18:20   ` [PATCH 0/3] Optimize code generation during context Xie Yuanbin
2025-10-27 15:21 ` Xie Yuanbin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).