[DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers
@ 2025-11-24 21:32 Chang S. Bae
  2025-11-24 21:32 ` [RFC PATCH 1/3] x86/lib: Refactor csum_partial_copy_generic() into a macro Chang S. Bae
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Chang S. Bae @ 2025-11-24 21:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, mingo, bp, dave.hansen, chang.seok.bae

Hi all,

I’d like to initiate a discussion on this topic. The attached patchset
is *not* intended for upstream now. Instead, its purpose is simply to
serve as an example of how the kernel might use these registers. Beyond
a quick look, it will be likely wasting your time if deeply reviewing the
attached patches.

== Background ==

Advanced Performance Extensions (APX) introduces additional GPRs: R16–R32
(EGPRs) [1]. These EGPRs are accessible via new prefix encodings on
legacy instructions. Their state is handled through XSAVE, and support
for this new XSTATE component was merged in v6.16 [2]. So far, APX is
primarily targeted toward userspace enablement.

However, in-kernel use still needs to be explored. Ingo previously noted
that EGPRs may help reduce kernel stack pressue [3], and this topic comes
up in the x86 microconference at LPC [4]. I hope this posting can
circulate some thoughts along with an example ahead.

== Possible Approaches ==

(1) Selective and Limited Use

This follows how vector registers are used today in places like crypto
routines. AVX state usage is bracketed by kernel_fpu_begin() /
kernel_fpu_end(). EGPRs could be similarly used in a small bounded
region.

Under this model:

  * No changes are needed to the existing XSTATE management API.

  * Preemption and softirqs would be disabled while EGPRs are live,
    subsequently limiting usage to small regions.

  * This lends itself mostly to hand-written assembly, which is less
    scalable for broader adoption.

PATCH3 in the attached set shows an example of this kind usage.

(2) Broader or Tree-wide Adoption

If the goal is to substantially reduce stack pressure or improve
performance more broadly, EGPR usage would need to expand to larger
regions. This raises some considerations:

  * The usage window would become too large to keep preemption disabled.
    In that case, the wrapper-based approach becomes infeasible.

  * The EGPR state would then need to be switched on entry to ensure a
    clean separation as APX usage becomes more pervasive. This could be
    handled by extending struct pt_regs or another structure.

  * The kernel must be able to select between legacy mode and APX,
    since APX remains optional for backward compatibility. Conversely,
    APX-only kernel image won't be distributed.

  * This suggests some level of code duplication or alternate code paths
    as an unavoidable trade-off. As the usage grows, so does image size,
    which raises the bar for demonstrating a measurable benefit.

  * At that scale, adoption will likely rely on compiler support. Their
    code-generation and optimization behavior need to be examined and
    ensured in advance.

== Discussions ==

Given the above, a staged adoption may make sense. EGPR usage could
begin in self-contained libraries or performance-critical paths, being
evaluted incrementally as hardware becomes more broadly available.

Now here are some questions to discuss preliminary:

  * Does this overall framing make sense?
  * Are there alternative or more pragmatic approaches for adoption?
  * Which kernel subsystems or hot paths might benefit most from early
    experimentation with EGPRs?

Thanks,
Chang

[1] https://cdrdv2.intel.com/v1/dl/getContent/784266
[2] https://lore.kernel.org/lkml/aDL35MA4vH0wQ6Gb@gmail.com/
[3] https://lore.kernel.org/lkml/Z8C57rzRt90obAFg@gmail.com/
[4] https://lpc.events/event/19/contributions/2028/

Chang S. Bae (3):
  x86/lib: Refactor csum_partial_copy_generic() into a macro
  x86/lib: Convert repeated asm sequences in checksum copy into macros
  x86/lib: Use EGPRs in 64-bit checksum copy loop

 arch/x86/Kconfig                   |   6 +
 arch/x86/Kconfig.assembler         |   6 +
 arch/x86/include/asm/checksum_64.h |  24 ++-
 arch/x86/lib/csum-copy_64.S        | 282 +++++++++++++++++------------
 4 files changed, 206 insertions(+), 112 deletions(-)

base-commit: ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d
-- 
2.51.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH 1/3] x86/lib: Refactor csum_partial_copy_generic() into a macro
  2025-11-24 21:32 [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Chang S. Bae
@ 2025-11-24 21:32 ` Chang S. Bae
  2025-11-24 21:32 ` [RFC PATCH 2/3] x86/lib: Convert repeated asm sequences in checksum copy into macros Chang S. Bae
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Chang S. Bae @ 2025-11-24 21:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, mingo, bp, dave.hansen, chang.seok.bae

The current assembly implementation is too rigid to support new
variants that share most of the logic. Refactor the function body into a
reusable macro, with register aliasing to improve readability.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
No intention for upstream, but this series is just an example of how
extended GPRs can be used within the kernel.
---
 arch/x86/lib/csum-copy_64.S | 187 ++++++++++++++++++++----------------
 1 file changed, 103 insertions(+), 84 deletions(-)

diff --git a/arch/x86/lib/csum-copy_64.S b/arch/x86/lib/csum-copy_64.S
index d9e16a2cf285..66ed849090b7 100644
--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -26,17 +26,27 @@
  * They also should align source or destination to 8 bytes.
  */
 
-	.macro source
+.macro source
 10:
 	_ASM_EXTABLE_UA(10b, .Lfault)
-	.endm
+.endm
 
-	.macro dest
+.macro dest
 20:
 	_ASM_EXTABLE_UA(20b, .Lfault)
-	.endm
+.endm
 
-SYM_FUNC_START(csum_partial_copy_generic)
+.macro restore_regs_and_ret
+	movq 0*8(%rsp), %rbx
+	movq 1*8(%rsp), %r12
+	movq 2*8(%rsp), %r14
+	movq 3*8(%rsp), %r13
+	movq 4*8(%rsp), %r15
+	addq $5*8, %rsp
+	RET
+.endm
+
+.macro	_csum_partial_copy
 	subq  $5*8, %rsp
 	movq  %rbx, 0*8(%rsp)
 	movq  %r12, 1*8(%rsp)
@@ -48,41 +58,52 @@ SYM_FUNC_START(csum_partial_copy_generic)
 	xorl  %r9d, %r9d
 	movl  %edx, %ecx
 	cmpl  $8, %ecx
-	jb    .Lshort
+	jb    .Lshort\@
 
 	testb  $7, %sil
-	jne   .Lunaligned
-.Laligned:
-	movl  %ecx, %r12d
+	jne   .Lunaligned\@
+.Laligned\@:
+	.set  INP, %rdi		/* input pointer */
+	.set  OUTP, %rsi	/* output pointer */
+	.set  SUM, %rax		/* checksum accumulator */
+	.set  ZERO, %r9		/* zero register */
+	.set  LEN, %ecx		/* byte count */
+	.set  LEN64B, %r12d	/* 64-byte block count */
+	.set  TMP1, %rbx
+	.set  TMP2, %r8
+	.set  TMP3, %r11
+	.set  TMP4, %rdx
+	.set  TMP5, %r10
+	.set  TMP6, %r15
+	.set  TMP7, %r14
+	.set  TMP8, %r13
 
-	shrq  $6, %r12
-	jz	.Lhandle_tail       /* < 64 */
+	movl  LEN, LEN64B
+
+	shrl  $6, LEN64B
+	jz	.Lhandle_tail\@     /* < 64 */
 
 	clc
 
-	/* main loop. clear in 64 byte blocks */
-	/* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
-	/* r11:	temp3, rdx: temp4, r12 loopcnt */
-	/* r10:	temp5, r15: temp6, r14 temp7, r13 temp8 */
 	.p2align 4
-.Lloop:
+.Lloop\@:
 	source
-	movq  (%rdi), %rbx
+	movq  (INP), TMP1
 	source
-	movq  8(%rdi), %r8
+	movq  8(INP), TMP2
 	source
-	movq  16(%rdi), %r11
+	movq  16(INP), TMP3
 	source
-	movq  24(%rdi), %rdx
+	movq  24(INP), TMP4
 
 	source
-	movq  32(%rdi), %r10
+	movq  32(INP), TMP5
 	source
-	movq  40(%rdi), %r15
+	movq  40(INP), TMP6
 	source
-	movq  48(%rdi), %r14
+	movq  48(INP), TMP7
 	source
-	movq  56(%rdi), %r13
+	movq  56(INP), TMP8
 
 30:
 	/*
@@ -92,64 +113,64 @@ SYM_FUNC_START(csum_partial_copy_generic)
 	_ASM_EXTABLE(30b, 2f)
 	prefetcht0 5*64(%rdi)
 2:
-	adcq  %rbx, %rax
-	adcq  %r8, %rax
-	adcq  %r11, %rax
-	adcq  %rdx, %rax
-	adcq  %r10, %rax
-	adcq  %r15, %rax
-	adcq  %r14, %rax
-	adcq  %r13, %rax
+	adcq  TMP1, SUM
+	adcq  TMP2, SUM
+	adcq  TMP3, SUM
+	adcq  TMP4, SUM
+	adcq  TMP5, SUM
+	adcq  TMP6, SUM
+	adcq  TMP7, SUM
+	adcq  TMP8, SUM
 
-	decl %r12d
+	decl LEN64B
 
 	dest
-	movq %rbx, (%rsi)
+	movq TMP1, (OUTP)
 	dest
-	movq %r8, 8(%rsi)
+	movq TMP2, 8(OUTP)
 	dest
-	movq %r11, 16(%rsi)
+	movq TMP3, 16(OUTP)
 	dest
-	movq %rdx, 24(%rsi)
+	movq TMP4, 24(OUTP)
 
 	dest
-	movq %r10, 32(%rsi)
+	movq TMP5, 32(OUTP)
 	dest
-	movq %r15, 40(%rsi)
+	movq TMP6, 40(OUTP)
 	dest
-	movq %r14, 48(%rsi)
+	movq TMP7, 48(OUTP)
 	dest
-	movq %r13, 56(%rsi)
+	movq TMP8, 56(OUTP)
 
-	leaq 64(%rdi), %rdi
-	leaq 64(%rsi), %rsi
+	leaq 64(INP), INP
+	leaq 64(OUTP), OUTP
 
-	jnz	.Lloop
+	jnz	.Lloop\@
 
-	adcq  %r9, %rax
+	adcq  ZERO, SUM
 
 	/* do last up to 56 bytes */
-.Lhandle_tail:
+.Lhandle_tail\@:
 	/* ecx:	count, rcx.63: the end result needs to be rol8 */
 	movq %rcx, %r10
 	andl $63, %ecx
 	shrl $3, %ecx
-	jz	.Lfold
+	jz	.Lfold\@
 	clc
 	.p2align 4
-.Lloop_8:
+.Lloop_8\@:
 	source
-	movq (%rdi), %rbx
-	adcq %rbx, %rax
-	decl %ecx
+	movq (INP), TMP1
+	adcq TMP1, SUM
+	decl LEN
 	dest
-	movq %rbx, (%rsi)
-	leaq 8(%rsi), %rsi /* preserve carry */
-	leaq 8(%rdi), %rdi
-	jnz	.Lloop_8
-	adcq %r9, %rax	/* add in carry */
+	movq TMP1, (OUTP)
+	leaq 8(INP), INP /* preserve carry */
+	leaq 8(OUTP), OUTP
+	jnz	.Lloop_8\@
+	adcq ZERO, SUM	/* add in carry */
 
-.Lfold:
+.Lfold\@:
 	/* reduce checksum to 32bits */
 	movl %eax, %ebx
 	shrq $32, %rax
@@ -157,17 +178,17 @@ SYM_FUNC_START(csum_partial_copy_generic)
 	adcl %r9d, %eax
 
 	/* do last up to 6 bytes */
-.Lhandle_7:
+.Lhandle_7\@:
 	movl %r10d, %ecx
 	andl $7, %ecx
-.L1:				/* .Lshort rejoins the common path here */
+.L1\@:				/* .Lshort\@ rejoins the common path here */
 	shrl $1, %ecx
-	jz   .Lhandle_1
+	jz   .Lhandle_1\@
 	movl $2, %edx
 	xorl %ebx, %ebx
 	clc
 	.p2align 4
-.Lloop_1:
+.Lloop_1\@:
 	source
 	movw (%rdi), %bx
 	adcl %ebx, %eax
@@ -176,13 +197,13 @@ SYM_FUNC_START(csum_partial_copy_generic)
 	movw %bx, (%rsi)
 	leaq 2(%rdi), %rdi
 	leaq 2(%rsi), %rsi
-	jnz .Lloop_1
+	jnz .Lloop_1\@
 	adcl %r9d, %eax	/* add in carry */
 
 	/* handle last odd byte */
-.Lhandle_1:
+.Lhandle_1\@:
 	testb $1, %r10b
-	jz    .Lende
+	jz    .Lende\@
 	xorl  %ebx, %ebx
 	source
 	movb (%rdi), %bl
@@ -191,24 +212,18 @@ SYM_FUNC_START(csum_partial_copy_generic)
 	addl %ebx, %eax
 	adcl %r9d, %eax		/* carry */
 
-.Lende:
+.Lende\@:
 	testq %r10, %r10
-	js  .Lwas_odd
-.Lout:
-	movq 0*8(%rsp), %rbx
-	movq 1*8(%rsp), %r12
-	movq 2*8(%rsp), %r14
-	movq 3*8(%rsp), %r13
-	movq 4*8(%rsp), %r15
-	addq $5*8, %rsp
-	RET
-.Lshort:
+	js  .Lwas_odd\@
+.Lout\@:
+	restore_regs_and_ret
+.Lshort\@:
 	movl %ecx, %r10d
-	jmp  .L1
-.Lunaligned:
+	jmp  .L1\@
+.Lunaligned\@:
 	xorl %ebx, %ebx
 	testb $1, %sil
-	jne  .Lodd
+	jne  .Lodd\@
 1:	testb $2, %sil
 	je   2f
 	source
@@ -220,7 +235,7 @@ SYM_FUNC_START(csum_partial_copy_generic)
 	leaq 2(%rsi), %rsi
 	addq %rbx, %rax
 2:	testb $4, %sil
-	je .Laligned
+	je .Laligned\@
 	source
 	movl (%rdi), %ebx
 	dest
@@ -229,9 +244,9 @@ SYM_FUNC_START(csum_partial_copy_generic)
 	subq $4, %rcx
 	leaq 4(%rsi), %rsi
 	addq %rbx, %rax
-	jmp .Laligned
+	jmp .Laligned\@
 
-.Lodd:
+.Lodd\@:
 	source
 	movb (%rdi), %bl
 	dest
@@ -245,12 +260,16 @@ SYM_FUNC_START(csum_partial_copy_generic)
 	addq %rbx, %rax
 	jmp 1b
 
-.Lwas_odd:
+.Lwas_odd\@:
 	roll $8, %eax
-	jmp .Lout
+	jmp .Lout\@
+.endm
 
 	/* Exception: just return 0 */
 .Lfault:
 	xorl %eax, %eax
-	jmp  .Lout
+	restore_regs_and_ret
+
+SYM_FUNC_START(csum_partial_copy_generic)
+	_csum_partial_copy
 SYM_FUNC_END(csum_partial_copy_generic)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 2/3] x86/lib: Convert repeated asm sequences in checksum copy into macros
  2025-11-24 21:32 [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Chang S. Bae
  2025-11-24 21:32 ` [RFC PATCH 1/3] x86/lib: Refactor csum_partial_copy_generic() into a macro Chang S. Bae
@ 2025-11-24 21:32 ` Chang S. Bae
  2025-11-24 21:32 ` [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop Chang S. Bae
  2025-11-26 16:30 ` [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Peter Zijlstra
  3 siblings, 0 replies; 8+ messages in thread
From: Chang S. Bae @ 2025-11-24 21:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, mingo, bp, dave.hansen, chang.seok.bae

Several instruction patterns are repeated in the checksum-copy function.
Replace them with small macros to make concise and more readable.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
These repetitions are related to the loop unrolling, which will be
further extended using EGPRs in the next patch.
---
 arch/x86/lib/csum-copy_64.S | 106 ++++++++++++++++--------------------
 1 file changed, 48 insertions(+), 58 deletions(-)

diff --git a/arch/x86/lib/csum-copy_64.S b/arch/x86/lib/csum-copy_64.S
index 66ed849090b7..5526bdfac041 100644
--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -46,6 +46,43 @@
 	RET
 .endm
 
+.macro prefetch
+30:
+	/*
+	 * No _ASM_EXTABLE_UA; this is used for intentional prefetch on a
+	 * potentially unmapped kernel address.
+	 */
+	_ASM_EXTABLE(30b, 2f)
+	prefetcht0 5*64(%rdi)
+2:
+.endm
+
+.macro loadregs offset, src, regs:vararg
+	source
+	i = 0
+.irp  r, \regs
+	movq  8*(\offset + i)(\src), \r
+.endr
+.endm
+
+.macro storeregs offset, dst, regs:vararg
+	dest
+	i = 0
+.irp  r, \regs
+	movq  \r, 8*(\offset + i)(\dst)
+.endr
+.endm
+
+.macro sumregs sum, regs:vararg
+.irp  r, \regs
+	adcq  \r, \sum
+.endr
+.endm
+
+.macro incr ptr, count
+	leaq  8*(\count)(\ptr), \ptr
+.endm
+
 .macro	_csum_partial_copy
 	subq  $5*8, %rsp
 	movq  %rbx, 0*8(%rsp)
@@ -87,63 +124,18 @@
 
 	.p2align 4
 .Lloop\@:
-	source
-	movq  (INP), TMP1
-	source
-	movq  8(INP), TMP2
-	source
-	movq  16(INP), TMP3
-	source
-	movq  24(INP), TMP4
+	loadregs 0, INP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
 
-	source
-	movq  32(INP), TMP5
-	source
-	movq  40(INP), TMP6
-	source
-	movq  48(INP), TMP7
-	source
-	movq  56(INP), TMP8
+	prefetch
 
-30:
-	/*
-	 * No _ASM_EXTABLE_UA; this is used for intentional prefetch on a
-	 * potentially unmapped kernel address.
-	 */
-	_ASM_EXTABLE(30b, 2f)
-	prefetcht0 5*64(%rdi)
-2:
-	adcq  TMP1, SUM
-	adcq  TMP2, SUM
-	adcq  TMP3, SUM
-	adcq  TMP4, SUM
-	adcq  TMP5, SUM
-	adcq  TMP6, SUM
-	adcq  TMP7, SUM
-	adcq  TMP8, SUM
+	sumregs SUM, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
 
 	decl LEN64B
 
-	dest
-	movq TMP1, (OUTP)
-	dest
-	movq TMP2, 8(OUTP)
-	dest
-	movq TMP3, 16(OUTP)
-	dest
-	movq TMP4, 24(OUTP)
+	storeregs 0, OUTP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
 
-	dest
-	movq TMP5, 32(OUTP)
-	dest
-	movq TMP6, 40(OUTP)
-	dest
-	movq TMP7, 48(OUTP)
-	dest
-	movq TMP8, 56(OUTP)
-
-	leaq 64(INP), INP
-	leaq 64(OUTP), OUTP
+	incr INP, 8
+	incr OUTP, 8
 
 	jnz	.Lloop\@
 
@@ -159,14 +151,12 @@
 	clc
 	.p2align 4
 .Lloop_8\@:
-	source
-	movq (INP), TMP1
-	adcq TMP1, SUM
+	loadregs 0, INP, TMP1
+	sumregs SUM, TMP1
 	decl LEN
-	dest
-	movq TMP1, (OUTP)
-	leaq 8(INP), INP /* preserve carry */
-	leaq 8(OUTP), OUTP
+	storeregs 0, OUTP, TMP1
+	incr INP, 1 /* preserve carry */
+	incr OUTP, 1
 	jnz	.Lloop_8\@
 	adcq ZERO, SUM	/* add in carry */
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop
  2025-11-24 21:32 [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Chang S. Bae
  2025-11-24 21:32 ` [RFC PATCH 1/3] x86/lib: Refactor csum_partial_copy_generic() into a macro Chang S. Bae
  2025-11-24 21:32 ` [RFC PATCH 2/3] x86/lib: Convert repeated asm sequences in checksum copy into macros Chang S. Bae
@ 2025-11-24 21:32 ` Chang S. Bae
  2025-11-25 10:37   ` david laight
  2025-11-26 16:30 ` [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Peter Zijlstra
  3 siblings, 1 reply; 8+ messages in thread
From: Chang S. Bae @ 2025-11-24 21:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, tglx, mingo, bp, dave.hansen, chang.seok.bae

The current checksum copy routine already uses all legacy GPRs for loop
unrolling. APX introduces additional GPRs. Use them to extend the
unrolling further.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
Caveat: This is primarily an illustrative example. I have not fully
audited all call sites or large-buffer use cases (yet). The goal is to
demonstrate the potential of the extended register set.
---
 arch/x86/Kconfig                   |  6 +++
 arch/x86/Kconfig.assembler         |  6 +++
 arch/x86/include/asm/checksum_64.h | 24 +++++++++++-
 arch/x86/lib/csum-copy_64.S        | 59 ++++++++++++++++++++++++++++--
 4 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fa3b616af03a..e6d969376bf2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1890,6 +1890,12 @@ config X86_USER_SHADOW_STACK
 
 	  If unsure, say N.
 
+config X86_APX
+	bool "In-kernel APX use"
+	depends on AS_APX
+	help
+	  Experimental: enable in-kernel use of APX
+
 config INTEL_TDX_HOST
 	bool "Intel Trust Domain Extensions (TDX) host support"
 	depends on CPU_SUP_INTEL
diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index b1c59fb0a4c9..d208ac540609 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -5,3 +5,9 @@ config AS_WRUSS
 	def_bool $(as-instr64,wrussq %rax$(comma)(%rbx))
 	help
 	  Supported by binutils >= 2.31 and LLVM integrated assembler
+
+config AS_APX
+	def_bool $(as-instr64,mov %r16$(comma)%r17)
+	help
+	  Assembler support extended registers.
+	  Supported by binutils >= 2.43 (LLVM version TBD)
diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
index 4d4a47a3a8ab..4cbd9e71f8c3 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -10,6 +10,7 @@
 
 #include <linux/compiler.h>
 #include <asm/byteorder.h>
+#include <asm/fpu/api.h>
 
 /**
  * csum_fold - Fold and invert a 32bit checksum.
@@ -129,7 +130,28 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
 extern __wsum csum_partial(const void *buff, int len, __wsum sum);
 
 /* Do not call this directly. Use the wrappers below */
-extern __visible __wsum csum_partial_copy_generic(const void *src, void *dst, int len);
+extern __visible __wsum csum_partial_copy(const void *src, void *dst, int len);
+#ifndef CONFIG_X86_APX
+static inline __wsum csum_partial_copy_generic(const void *src, void *dst, int len)
+{
+	return csum_partial_copy(src, dst, len);
+}
+#else
+extern __visible __wsum csum_partial_copy_apx(const void *src, void *dst, int len);
+static inline __wsum csum_partial_copy_generic(const void *src, void *dst, int len)
+{
+	__wsum sum;
+
+	if (!cpu_has_xfeatures(XFEATURE_MASK_APX, NULL) || !irq_fpu_usable())
+		return csum_partial_copy(src, dst, len);
+
+	kernel_fpu_begin();
+	sum = csum_partial_copy_apx(src, dst, len);
+	kernel_fpu_end();
+
+	return sum;
+}
+#endif
 
 extern __wsum csum_and_copy_from_user(const void __user *src, void *dst, int len);
 extern __wsum csum_and_copy_to_user(const void *src, void __user *dst, int len);
diff --git a/arch/x86/lib/csum-copy_64.S b/arch/x86/lib/csum-copy_64.S
index 5526bdfac041..dc99227af94f 100644
--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -119,11 +119,54 @@
 
 	shrl  $6, LEN64B
 	jz	.Lhandle_tail\@     /* < 64 */
+.if USE_APX
+	cmpl  $3, LEN64B
+	jb	.Lloop_64\@         /* < 192 */
+	clc
+	.p2align 4
+.Lloop_192\@:
+	.set  TMP9, %r16
+	.set  TMP10, %r17
+	.set  TMP11, %r18
+	.set  TMP12, %r19
+	.set  TMP13, %r20
+	.set  TMP14, %r21
+	.set  TMP15, %r22
+	.set  TMP16, %r23
+	.set  TMP17, %r24
+	.set  TMP18, %r25
+	.set  TMP19, %r26
+	.set  TMP20, %r27
+	.set  TMP21, %r28
+	.set  TMP22, %r29
+	.set  TMP23, %r30
+	.set  TMP24, %r31
+
+	.p2align 4
+	loadregs 0, INP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
+	loadregs 8, INP, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
+	loadregs 16, INP, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
+
+	sumregs SUM, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
+	sumregs SUM, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
+	sumregs SUM, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
+
+	storeregs 0, OUTP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
+	storeregs 8, OUTP, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
+	storeregs 16, OUTP, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
+
+	incr INP, 24
+	incr OUTP, 24
 
+	sub  $3, LEN64B
+	cmp  $3, LEN64B
+	jnb	.Lloop_192\@
+.else
 	clc
 
 	.p2align 4
-.Lloop\@:
+.endif
+.Lloop_64\@:
 	loadregs 0, INP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
 
 	prefetch
@@ -137,7 +180,7 @@
 	incr INP, 8
 	incr OUTP, 8
 
-	jnz	.Lloop\@
+	jnz	.Lloop_64\@
 
 	adcq  ZERO, SUM
 
@@ -260,6 +303,14 @@
 	xorl %eax, %eax
 	restore_regs_and_ret
 
-SYM_FUNC_START(csum_partial_copy_generic)
+.set	USE_APX, 0
+SYM_FUNC_START(csum_partial_copy)
 	_csum_partial_copy
-SYM_FUNC_END(csum_partial_copy_generic)
+SYM_FUNC_END(csum_partial_copy)
+
+#ifdef CONFIG_X86_APX
+.set	USE_APX, 1
+SYM_FUNC_START(csum_partial_copy_apx)
+	_csum_partial_copy
+SYM_FUNC_END(csum_partial_copy_apx)
+#endif
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop
  2025-11-24 21:32 ` [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop Chang S. Bae
@ 2025-11-25 10:37   ` david laight
  2025-12-01 21:39     ` Chang S. Bae
  0 siblings, 1 reply; 8+ messages in thread
From: david laight @ 2025-11-25 10:37 UTC (permalink / raw)
  To: Chang S. Bae; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen

On Mon, 24 Nov 2025 21:32:26 +0000
"Chang S. Bae" <chang.seok.bae@intel.com> wrote:

> The current checksum copy routine already uses all legacy GPRs for loop
> unrolling. APX introduces additional GPRs. Use them to extend the
> unrolling further.

I very much doubt that unrolling this loop has any performance gain.
IIRC you can get a loop with just two 'memory read' and 'adcq' instructions
in it to execute a 'adcq' every clock.
It ought to be possible to do the same even with the extra 'memory write'.
(You can execute a '2 clock loop', but not a '1 clock loop'.)
Whatever you do the 'loop control' instruction are independent of the copy
and adcq ones and will run in parallel.
For the fastest loop, change the memory accesses to be negative
offsets from the end of the buffer.
Indeed, I think the Intel cpu (I've not done any tests on amd ones)
end up queuing up the adcq and writes (from many loop iterations)
waiting for the reads to complete.

But is this function even worth having at all?
The fast checksum routing does 1.5 to 2 'adcq' per clock.
On modern cpu 'rep movsb' will (usually) copy memory at (IIRC) 32
bytes/clock (IIRC 64 on intel if the destination is aligned).
Put together than is faster than the 1 adcq per clock maximum
of the 'copy and checksum' loop.

The only issue will be buffers over 2k which are likely to generate
extra reads into a 4k L1 data cache.

But it is worse than that.
This code (or something very similar) gets used to checksum data
during copy_to/from_user for sockets.
This goes back a long way and I suspect the 'killer ap' was nfsd
running over UDP (with 8k+ UDP datagrams).
Modern NICs all (well all anyone cares about) to IP checksum offload.
So you don't need to checksum on send() - I'm sure that is still
enabled even though you pretty much never want it.
The checksum on recv() can only happen for UDP, but massively
complicates the code paths and will normally not be needed.

	David

> 
> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
> ---
> Caveat: This is primarily an illustrative example. I have not fully
> audited all call sites or large-buffer use cases (yet). The goal is to
> demonstrate the potential of the extended register set.
> ---
>  arch/x86/Kconfig                   |  6 +++
>  arch/x86/Kconfig.assembler         |  6 +++
>  arch/x86/include/asm/checksum_64.h | 24 +++++++++++-
>  arch/x86/lib/csum-copy_64.S        | 59 ++++++++++++++++++++++++++++--
>  4 files changed, 90 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index fa3b616af03a..e6d969376bf2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1890,6 +1890,12 @@ config X86_USER_SHADOW_STACK
>  
>  	  If unsure, say N.
>  
> +config X86_APX
> +	bool "In-kernel APX use"
> +	depends on AS_APX
> +	help
> +	  Experimental: enable in-kernel use of APX
> +
>  config INTEL_TDX_HOST
>  	bool "Intel Trust Domain Extensions (TDX) host support"
>  	depends on CPU_SUP_INTEL
> diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
> index b1c59fb0a4c9..d208ac540609 100644
> --- a/arch/x86/Kconfig.assembler
> +++ b/arch/x86/Kconfig.assembler
> @@ -5,3 +5,9 @@ config AS_WRUSS
>  	def_bool $(as-instr64,wrussq %rax$(comma)(%rbx))
>  	help
>  	  Supported by binutils >= 2.31 and LLVM integrated assembler
> +
> +config AS_APX
> +	def_bool $(as-instr64,mov %r16$(comma)%r17)
> +	help
> +	  Assembler support extended registers.
> +	  Supported by binutils >= 2.43 (LLVM version TBD)
> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
> index 4d4a47a3a8ab..4cbd9e71f8c3 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/compiler.h>
>  #include <asm/byteorder.h>
> +#include <asm/fpu/api.h>
>  
>  /**
>   * csum_fold - Fold and invert a 32bit checksum.
> @@ -129,7 +130,28 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
>  extern __wsum csum_partial(const void *buff, int len, __wsum sum);
>  
>  /* Do not call this directly. Use the wrappers below */
> -extern __visible __wsum csum_partial_copy_generic(const void *src, void *dst, int len);
> +extern __visible __wsum csum_partial_copy(const void *src, void *dst, int len);
> +#ifndef CONFIG_X86_APX
> +static inline __wsum csum_partial_copy_generic(const void *src, void *dst, int len)
> +{
> +	return csum_partial_copy(src, dst, len);
> +}
> +#else
> +extern __visible __wsum csum_partial_copy_apx(const void *src, void *dst, int len);
> +static inline __wsum csum_partial_copy_generic(const void *src, void *dst, int len)
> +{
> +	__wsum sum;
> +
> +	if (!cpu_has_xfeatures(XFEATURE_MASK_APX, NULL) || !irq_fpu_usable())
> +		return csum_partial_copy(src, dst, len);
> +
> +	kernel_fpu_begin();
> +	sum = csum_partial_copy_apx(src, dst, len);
> +	kernel_fpu_end();
> +
> +	return sum;
> +}
> +#endif
>  
>  extern __wsum csum_and_copy_from_user(const void __user *src, void *dst, int len);
>  extern __wsum csum_and_copy_to_user(const void *src, void __user *dst, int len);
> diff --git a/arch/x86/lib/csum-copy_64.S b/arch/x86/lib/csum-copy_64.S
> index 5526bdfac041..dc99227af94f 100644
> --- a/arch/x86/lib/csum-copy_64.S
> +++ b/arch/x86/lib/csum-copy_64.S
> @@ -119,11 +119,54 @@
>  
>  	shrl  $6, LEN64B
>  	jz	.Lhandle_tail\@     /* < 64 */
> +.if USE_APX
> +	cmpl  $3, LEN64B
> +	jb	.Lloop_64\@         /* < 192 */
> +	clc
> +	.p2align 4
> +.Lloop_192\@:
> +	.set  TMP9, %r16
> +	.set  TMP10, %r17
> +	.set  TMP11, %r18
> +	.set  TMP12, %r19
> +	.set  TMP13, %r20
> +	.set  TMP14, %r21
> +	.set  TMP15, %r22
> +	.set  TMP16, %r23
> +	.set  TMP17, %r24
> +	.set  TMP18, %r25
> +	.set  TMP19, %r26
> +	.set  TMP20, %r27
> +	.set  TMP21, %r28
> +	.set  TMP22, %r29
> +	.set  TMP23, %r30
> +	.set  TMP24, %r31
> +
> +	.p2align 4
> +	loadregs 0, INP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
> +	loadregs 8, INP, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
> +	loadregs 16, INP, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
> +
> +	sumregs SUM, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
> +	sumregs SUM, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
> +	sumregs SUM, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
> +
> +	storeregs 0, OUTP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
> +	storeregs 8, OUTP, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
> +	storeregs 16, OUTP, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
> +
> +	incr INP, 24
> +	incr OUTP, 24
>  
> +	sub  $3, LEN64B
> +	cmp  $3, LEN64B
> +	jnb	.Lloop_192\@
> +.else
>  	clc
>  
>  	.p2align 4
> -.Lloop\@:
> +.endif
> +.Lloop_64\@:
>  	loadregs 0, INP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
>  
>  	prefetch
> @@ -137,7 +180,7 @@
>  	incr INP, 8
>  	incr OUTP, 8
>  
> -	jnz	.Lloop\@
> +	jnz	.Lloop_64\@
>  
>  	adcq  ZERO, SUM
>  
> @@ -260,6 +303,14 @@
>  	xorl %eax, %eax
>  	restore_regs_and_ret
>  
> -SYM_FUNC_START(csum_partial_copy_generic)
> +.set	USE_APX, 0
> +SYM_FUNC_START(csum_partial_copy)
>  	_csum_partial_copy
> -SYM_FUNC_END(csum_partial_copy_generic)
> +SYM_FUNC_END(csum_partial_copy)
> +
> +#ifdef CONFIG_X86_APX
> +.set	USE_APX, 1
> +SYM_FUNC_START(csum_partial_copy_apx)
> +	_csum_partial_copy
> +SYM_FUNC_END(csum_partial_copy_apx)
> +#endif


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers
  2025-11-24 21:32 [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Chang S. Bae
                   ` (2 preceding siblings ...)
  2025-11-24 21:32 ` [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop Chang S. Bae
@ 2025-11-26 16:30 ` Peter Zijlstra
  2025-12-01 21:40   ` Chang S. Bae
  3 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2025-11-26 16:30 UTC (permalink / raw)
  To: Chang S. Bae; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen

On Mon, Nov 24, 2025 at 09:32:23PM +0000, Chang S. Bae wrote:

> This follows how vector registers are used today in places like crypto
> routines. AVX state usage is bracketed by kernel_fpu_begin() /
> kernel_fpu_end(). EGPRs could be similarly used in a small bounded
> region.
> 
> Under this model:
> 
>   * No changes are needed to the existing XSTATE management API.
> 
>   * Preemption and softirqs would be disabled while EGPRs are live,
>     subsequently limiting usage to small regions.
> 
>   * This lends itself mostly to hand-written assembly, which is less
>     scalable for broader adoption.

IIRC it isn't hard to make kernel_fpu_begin/end() preemptible. It came
up with the last xsave rework -- mostly in the context of -rt, but
nobody ever picked it up and did it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop
  2025-11-25 10:37   ` david laight
@ 2025-12-01 21:39     ` Chang S. Bae
  0 siblings, 0 replies; 8+ messages in thread
From: Chang S. Bae @ 2025-12-01 21:39 UTC (permalink / raw)
  To: david laight; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen

On 11/25/2025 2:37 AM, david laight wrote:
> 
> This code (or something very similar) gets used to checksum data
> during copy_to/from_user for sockets.
> This goes back a long way and I suspect the 'killer ap' was nfsd
> running over UDP (with 8k+ UDP datagrams).
> Modern NICs all (well all anyone cares about) to IP checksum offload.
> So you don't need to checksum on send() - I'm sure that is still
> enabled even though you pretty much never want it.
> The checksum on recv() can only happen for UDP, but massively
> complicates the code paths and will normally not be needed.

It sounds like this optimization wouldn't provide practical benefit. I 
don't see a strong case for pursuing this further either, so I'd drop this.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers
  2025-11-26 16:30 ` [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Peter Zijlstra
@ 2025-12-01 21:40   ` Chang S. Bae
  0 siblings, 0 replies; 8+ messages in thread
From: Chang S. Bae @ 2025-12-01 21:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen

On 11/26/2025 8:30 AM, Peter Zijlstra wrote:
> 
> IIRC it isn't hard to make kernel_fpu_begin/end() preemptible. It came
> up with the last xsave rework -- mostly in the context of -rt, but
> nobody ever picked it up and did it.

Thanks for looking at this! Looks like I couldn't spot on that exact 
thread though, I suppose another XSAVE buffer to do so. In any case, I'd 
bring up this option in a slide deck, as something worth revisiting.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-12-01 21:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-24 21:32 [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 1/3] x86/lib: Refactor csum_partial_copy_generic() into a macro Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 2/3] x86/lib: Convert repeated asm sequences in checksum copy into macros Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop Chang S. Bae
2025-11-25 10:37   ` david laight
2025-12-01 21:39     ` Chang S. Bae
2025-11-26 16:30 ` [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Peter Zijlstra
2025-12-01 21:40   ` Chang S. Bae

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox