* [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers
@ 2025-11-24 21:32 Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 1/3] x86/lib: Refactor csum_partial_copy_generic() into a macro Chang S. Bae
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Chang S. Bae @ 2025-11-24 21:32 UTC (permalink / raw)
To: linux-kernel; +Cc: x86, tglx, mingo, bp, dave.hansen, chang.seok.bae
Hi all,
I’d like to initiate a discussion on this topic. The attached patchset
is *not* intended for upstream now. Instead, its purpose is simply to
serve as an example of how the kernel might use these registers. Beyond
a quick look, it will be likely wasting your time if deeply reviewing the
attached patches.
== Background ==
Advanced Performance Extensions (APX) introduces additional GPRs: R16–R32
(EGPRs) [1]. These EGPRs are accessible via new prefix encodings on
legacy instructions. Their state is handled through XSAVE, and support
for this new XSTATE component was merged in v6.16 [2]. So far, APX is
primarily targeted toward userspace enablement.
However, in-kernel use still needs to be explored. Ingo previously noted
that EGPRs may help reduce kernel stack pressue [3], and this topic comes
up in the x86 microconference at LPC [4]. I hope this posting can
circulate some thoughts along with an example ahead.
== Possible Approaches ==
(1) Selective and Limited Use
This follows how vector registers are used today in places like crypto
routines. AVX state usage is bracketed by kernel_fpu_begin() /
kernel_fpu_end(). EGPRs could be similarly used in a small bounded
region.
Under this model:
* No changes are needed to the existing XSTATE management API.
* Preemption and softirqs would be disabled while EGPRs are live,
subsequently limiting usage to small regions.
* This lends itself mostly to hand-written assembly, which is less
scalable for broader adoption.
PATCH3 in the attached set shows an example of this kind usage.
(2) Broader or Tree-wide Adoption
If the goal is to substantially reduce stack pressure or improve
performance more broadly, EGPR usage would need to expand to larger
regions. This raises some considerations:
* The usage window would become too large to keep preemption disabled.
In that case, the wrapper-based approach becomes infeasible.
* The EGPR state would then need to be switched on entry to ensure a
clean separation as APX usage becomes more pervasive. This could be
handled by extending struct pt_regs or another structure.
* The kernel must be able to select between legacy mode and APX,
since APX remains optional for backward compatibility. Conversely,
APX-only kernel image won't be distributed.
* This suggests some level of code duplication or alternate code paths
as an unavoidable trade-off. As the usage grows, so does image size,
which raises the bar for demonstrating a measurable benefit.
* At that scale, adoption will likely rely on compiler support. Their
code-generation and optimization behavior need to be examined and
ensured in advance.
== Discussions ==
Given the above, a staged adoption may make sense. EGPR usage could
begin in self-contained libraries or performance-critical paths, being
evaluted incrementally as hardware becomes more broadly available.
Now here are some questions to discuss preliminary:
* Does this overall framing make sense?
* Are there alternative or more pragmatic approaches for adoption?
* Which kernel subsystems or hot paths might benefit most from early
experimentation with EGPRs?
Thanks,
Chang
[1] https://cdrdv2.intel.com/v1/dl/getContent/784266
[2] https://lore.kernel.org/lkml/aDL35MA4vH0wQ6Gb@gmail.com/
[3] https://lore.kernel.org/lkml/Z8C57rzRt90obAFg@gmail.com/
[4] https://lpc.events/event/19/contributions/2028/
Chang S. Bae (3):
x86/lib: Refactor csum_partial_copy_generic() into a macro
x86/lib: Convert repeated asm sequences in checksum copy into macros
x86/lib: Use EGPRs in 64-bit checksum copy loop
arch/x86/Kconfig | 6 +
arch/x86/Kconfig.assembler | 6 +
arch/x86/include/asm/checksum_64.h | 24 ++-
arch/x86/lib/csum-copy_64.S | 282 +++++++++++++++++------------
4 files changed, 206 insertions(+), 112 deletions(-)
base-commit: ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d
--
2.51.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [RFC PATCH 1/3] x86/lib: Refactor csum_partial_copy_generic() into a macro
2025-11-24 21:32 [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Chang S. Bae
@ 2025-11-24 21:32 ` Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 2/3] x86/lib: Convert repeated asm sequences in checksum copy into macros Chang S. Bae
` (2 subsequent siblings)
3 siblings, 0 replies; 8+ messages in thread
From: Chang S. Bae @ 2025-11-24 21:32 UTC (permalink / raw)
To: linux-kernel; +Cc: x86, tglx, mingo, bp, dave.hansen, chang.seok.bae
The current assembly implementation is too rigid to support new
variants that share most of the logic. Refactor the function body into a
reusable macro, with register aliasing to improve readability.
No functional change.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
No intention for upstream, but this series is just an example of how
extended GPRs can be used within the kernel.
---
arch/x86/lib/csum-copy_64.S | 187 ++++++++++++++++++++----------------
1 file changed, 103 insertions(+), 84 deletions(-)
diff --git a/arch/x86/lib/csum-copy_64.S b/arch/x86/lib/csum-copy_64.S
index d9e16a2cf285..66ed849090b7 100644
--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -26,17 +26,27 @@
* They also should align source or destination to 8 bytes.
*/
- .macro source
+.macro source
10:
_ASM_EXTABLE_UA(10b, .Lfault)
- .endm
+.endm
- .macro dest
+.macro dest
20:
_ASM_EXTABLE_UA(20b, .Lfault)
- .endm
+.endm
-SYM_FUNC_START(csum_partial_copy_generic)
+.macro restore_regs_and_ret
+ movq 0*8(%rsp), %rbx
+ movq 1*8(%rsp), %r12
+ movq 2*8(%rsp), %r14
+ movq 3*8(%rsp), %r13
+ movq 4*8(%rsp), %r15
+ addq $5*8, %rsp
+ RET
+.endm
+
+.macro _csum_partial_copy
subq $5*8, %rsp
movq %rbx, 0*8(%rsp)
movq %r12, 1*8(%rsp)
@@ -48,41 +58,52 @@ SYM_FUNC_START(csum_partial_copy_generic)
xorl %r9d, %r9d
movl %edx, %ecx
cmpl $8, %ecx
- jb .Lshort
+ jb .Lshort\@
testb $7, %sil
- jne .Lunaligned
-.Laligned:
- movl %ecx, %r12d
+ jne .Lunaligned\@
+.Laligned\@:
+ .set INP, %rdi /* input pointer */
+ .set OUTP, %rsi /* output pointer */
+ .set SUM, %rax /* checksum accumulator */
+ .set ZERO, %r9 /* zero register */
+ .set LEN, %ecx /* byte count */
+ .set LEN64B, %r12d /* 64-byte block count */
+ .set TMP1, %rbx
+ .set TMP2, %r8
+ .set TMP3, %r11
+ .set TMP4, %rdx
+ .set TMP5, %r10
+ .set TMP6, %r15
+ .set TMP7, %r14
+ .set TMP8, %r13
- shrq $6, %r12
- jz .Lhandle_tail /* < 64 */
+ movl LEN, LEN64B
+
+ shrl $6, LEN64B
+ jz .Lhandle_tail\@ /* < 64 */
clc
- /* main loop. clear in 64 byte blocks */
- /* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
- /* r11: temp3, rdx: temp4, r12 loopcnt */
- /* r10: temp5, r15: temp6, r14 temp7, r13 temp8 */
.p2align 4
-.Lloop:
+.Lloop\@:
source
- movq (%rdi), %rbx
+ movq (INP), TMP1
source
- movq 8(%rdi), %r8
+ movq 8(INP), TMP2
source
- movq 16(%rdi), %r11
+ movq 16(INP), TMP3
source
- movq 24(%rdi), %rdx
+ movq 24(INP), TMP4
source
- movq 32(%rdi), %r10
+ movq 32(INP), TMP5
source
- movq 40(%rdi), %r15
+ movq 40(INP), TMP6
source
- movq 48(%rdi), %r14
+ movq 48(INP), TMP7
source
- movq 56(%rdi), %r13
+ movq 56(INP), TMP8
30:
/*
@@ -92,64 +113,64 @@ SYM_FUNC_START(csum_partial_copy_generic)
_ASM_EXTABLE(30b, 2f)
prefetcht0 5*64(%rdi)
2:
- adcq %rbx, %rax
- adcq %r8, %rax
- adcq %r11, %rax
- adcq %rdx, %rax
- adcq %r10, %rax
- adcq %r15, %rax
- adcq %r14, %rax
- adcq %r13, %rax
+ adcq TMP1, SUM
+ adcq TMP2, SUM
+ adcq TMP3, SUM
+ adcq TMP4, SUM
+ adcq TMP5, SUM
+ adcq TMP6, SUM
+ adcq TMP7, SUM
+ adcq TMP8, SUM
- decl %r12d
+ decl LEN64B
dest
- movq %rbx, (%rsi)
+ movq TMP1, (OUTP)
dest
- movq %r8, 8(%rsi)
+ movq TMP2, 8(OUTP)
dest
- movq %r11, 16(%rsi)
+ movq TMP3, 16(OUTP)
dest
- movq %rdx, 24(%rsi)
+ movq TMP4, 24(OUTP)
dest
- movq %r10, 32(%rsi)
+ movq TMP5, 32(OUTP)
dest
- movq %r15, 40(%rsi)
+ movq TMP6, 40(OUTP)
dest
- movq %r14, 48(%rsi)
+ movq TMP7, 48(OUTP)
dest
- movq %r13, 56(%rsi)
+ movq TMP8, 56(OUTP)
- leaq 64(%rdi), %rdi
- leaq 64(%rsi), %rsi
+ leaq 64(INP), INP
+ leaq 64(OUTP), OUTP
- jnz .Lloop
+ jnz .Lloop\@
- adcq %r9, %rax
+ adcq ZERO, SUM
/* do last up to 56 bytes */
-.Lhandle_tail:
+.Lhandle_tail\@:
/* ecx: count, rcx.63: the end result needs to be rol8 */
movq %rcx, %r10
andl $63, %ecx
shrl $3, %ecx
- jz .Lfold
+ jz .Lfold\@
clc
.p2align 4
-.Lloop_8:
+.Lloop_8\@:
source
- movq (%rdi), %rbx
- adcq %rbx, %rax
- decl %ecx
+ movq (INP), TMP1
+ adcq TMP1, SUM
+ decl LEN
dest
- movq %rbx, (%rsi)
- leaq 8(%rsi), %rsi /* preserve carry */
- leaq 8(%rdi), %rdi
- jnz .Lloop_8
- adcq %r9, %rax /* add in carry */
+ movq TMP1, (OUTP)
+ leaq 8(INP), INP /* preserve carry */
+ leaq 8(OUTP), OUTP
+ jnz .Lloop_8\@
+ adcq ZERO, SUM /* add in carry */
-.Lfold:
+.Lfold\@:
/* reduce checksum to 32bits */
movl %eax, %ebx
shrq $32, %rax
@@ -157,17 +178,17 @@ SYM_FUNC_START(csum_partial_copy_generic)
adcl %r9d, %eax
/* do last up to 6 bytes */
-.Lhandle_7:
+.Lhandle_7\@:
movl %r10d, %ecx
andl $7, %ecx
-.L1: /* .Lshort rejoins the common path here */
+.L1\@: /* .Lshort\@ rejoins the common path here */
shrl $1, %ecx
- jz .Lhandle_1
+ jz .Lhandle_1\@
movl $2, %edx
xorl %ebx, %ebx
clc
.p2align 4
-.Lloop_1:
+.Lloop_1\@:
source
movw (%rdi), %bx
adcl %ebx, %eax
@@ -176,13 +197,13 @@ SYM_FUNC_START(csum_partial_copy_generic)
movw %bx, (%rsi)
leaq 2(%rdi), %rdi
leaq 2(%rsi), %rsi
- jnz .Lloop_1
+ jnz .Lloop_1\@
adcl %r9d, %eax /* add in carry */
/* handle last odd byte */
-.Lhandle_1:
+.Lhandle_1\@:
testb $1, %r10b
- jz .Lende
+ jz .Lende\@
xorl %ebx, %ebx
source
movb (%rdi), %bl
@@ -191,24 +212,18 @@ SYM_FUNC_START(csum_partial_copy_generic)
addl %ebx, %eax
adcl %r9d, %eax /* carry */
-.Lende:
+.Lende\@:
testq %r10, %r10
- js .Lwas_odd
-.Lout:
- movq 0*8(%rsp), %rbx
- movq 1*8(%rsp), %r12
- movq 2*8(%rsp), %r14
- movq 3*8(%rsp), %r13
- movq 4*8(%rsp), %r15
- addq $5*8, %rsp
- RET
-.Lshort:
+ js .Lwas_odd\@
+.Lout\@:
+ restore_regs_and_ret
+.Lshort\@:
movl %ecx, %r10d
- jmp .L1
-.Lunaligned:
+ jmp .L1\@
+.Lunaligned\@:
xorl %ebx, %ebx
testb $1, %sil
- jne .Lodd
+ jne .Lodd\@
1: testb $2, %sil
je 2f
source
@@ -220,7 +235,7 @@ SYM_FUNC_START(csum_partial_copy_generic)
leaq 2(%rsi), %rsi
addq %rbx, %rax
2: testb $4, %sil
- je .Laligned
+ je .Laligned\@
source
movl (%rdi), %ebx
dest
@@ -229,9 +244,9 @@ SYM_FUNC_START(csum_partial_copy_generic)
subq $4, %rcx
leaq 4(%rsi), %rsi
addq %rbx, %rax
- jmp .Laligned
+ jmp .Laligned\@
-.Lodd:
+.Lodd\@:
source
movb (%rdi), %bl
dest
@@ -245,12 +260,16 @@ SYM_FUNC_START(csum_partial_copy_generic)
addq %rbx, %rax
jmp 1b
-.Lwas_odd:
+.Lwas_odd\@:
roll $8, %eax
- jmp .Lout
+ jmp .Lout\@
+.endm
/* Exception: just return 0 */
.Lfault:
xorl %eax, %eax
- jmp .Lout
+ restore_regs_and_ret
+
+SYM_FUNC_START(csum_partial_copy_generic)
+ _csum_partial_copy
SYM_FUNC_END(csum_partial_copy_generic)
--
2.51.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC PATCH 2/3] x86/lib: Convert repeated asm sequences in checksum copy into macros
2025-11-24 21:32 [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 1/3] x86/lib: Refactor csum_partial_copy_generic() into a macro Chang S. Bae
@ 2025-11-24 21:32 ` Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop Chang S. Bae
2025-11-26 16:30 ` [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Peter Zijlstra
3 siblings, 0 replies; 8+ messages in thread
From: Chang S. Bae @ 2025-11-24 21:32 UTC (permalink / raw)
To: linux-kernel; +Cc: x86, tglx, mingo, bp, dave.hansen, chang.seok.bae
Several instruction patterns are repeated in the checksum-copy function.
Replace them with small macros to make concise and more readable.
No functional change.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
These repetitions are related to the loop unrolling, which will be
further extended using EGPRs in the next patch.
---
arch/x86/lib/csum-copy_64.S | 106 ++++++++++++++++--------------------
1 file changed, 48 insertions(+), 58 deletions(-)
diff --git a/arch/x86/lib/csum-copy_64.S b/arch/x86/lib/csum-copy_64.S
index 66ed849090b7..5526bdfac041 100644
--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -46,6 +46,43 @@
RET
.endm
+.macro prefetch
+30:
+ /*
+ * No _ASM_EXTABLE_UA; this is used for intentional prefetch on a
+ * potentially unmapped kernel address.
+ */
+ _ASM_EXTABLE(30b, 2f)
+ prefetcht0 5*64(%rdi)
+2:
+.endm
+
+.macro loadregs offset, src, regs:vararg
+ source
+ i = 0
+.irp r, \regs
+ movq 8*(\offset + i)(\src), \r
+.endr
+.endm
+
+.macro storeregs offset, dst, regs:vararg
+ dest
+ i = 0
+.irp r, \regs
+ movq \r, 8*(\offset + i)(\dst)
+.endr
+.endm
+
+.macro sumregs sum, regs:vararg
+.irp r, \regs
+ adcq \r, \sum
+.endr
+.endm
+
+.macro incr ptr, count
+ leaq 8*(\count)(\ptr), \ptr
+.endm
+
.macro _csum_partial_copy
subq $5*8, %rsp
movq %rbx, 0*8(%rsp)
@@ -87,63 +124,18 @@
.p2align 4
.Lloop\@:
- source
- movq (INP), TMP1
- source
- movq 8(INP), TMP2
- source
- movq 16(INP), TMP3
- source
- movq 24(INP), TMP4
+ loadregs 0, INP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
- source
- movq 32(INP), TMP5
- source
- movq 40(INP), TMP6
- source
- movq 48(INP), TMP7
- source
- movq 56(INP), TMP8
+ prefetch
-30:
- /*
- * No _ASM_EXTABLE_UA; this is used for intentional prefetch on a
- * potentially unmapped kernel address.
- */
- _ASM_EXTABLE(30b, 2f)
- prefetcht0 5*64(%rdi)
-2:
- adcq TMP1, SUM
- adcq TMP2, SUM
- adcq TMP3, SUM
- adcq TMP4, SUM
- adcq TMP5, SUM
- adcq TMP6, SUM
- adcq TMP7, SUM
- adcq TMP8, SUM
+ sumregs SUM, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
decl LEN64B
- dest
- movq TMP1, (OUTP)
- dest
- movq TMP2, 8(OUTP)
- dest
- movq TMP3, 16(OUTP)
- dest
- movq TMP4, 24(OUTP)
+ storeregs 0, OUTP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
- dest
- movq TMP5, 32(OUTP)
- dest
- movq TMP6, 40(OUTP)
- dest
- movq TMP7, 48(OUTP)
- dest
- movq TMP8, 56(OUTP)
-
- leaq 64(INP), INP
- leaq 64(OUTP), OUTP
+ incr INP, 8
+ incr OUTP, 8
jnz .Lloop\@
@@ -159,14 +151,12 @@
clc
.p2align 4
.Lloop_8\@:
- source
- movq (INP), TMP1
- adcq TMP1, SUM
+ loadregs 0, INP, TMP1
+ sumregs SUM, TMP1
decl LEN
- dest
- movq TMP1, (OUTP)
- leaq 8(INP), INP /* preserve carry */
- leaq 8(OUTP), OUTP
+ storeregs 0, OUTP, TMP1
+ incr INP, 1 /* preserve carry */
+ incr OUTP, 1
jnz .Lloop_8\@
adcq ZERO, SUM /* add in carry */
--
2.51.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop
2025-11-24 21:32 [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 1/3] x86/lib: Refactor csum_partial_copy_generic() into a macro Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 2/3] x86/lib: Convert repeated asm sequences in checksum copy into macros Chang S. Bae
@ 2025-11-24 21:32 ` Chang S. Bae
2025-11-25 10:37 ` david laight
2025-11-26 16:30 ` [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Peter Zijlstra
3 siblings, 1 reply; 8+ messages in thread
From: Chang S. Bae @ 2025-11-24 21:32 UTC (permalink / raw)
To: linux-kernel; +Cc: x86, tglx, mingo, bp, dave.hansen, chang.seok.bae
The current checksum copy routine already uses all legacy GPRs for loop
unrolling. APX introduces additional GPRs. Use them to extend the
unrolling further.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
Caveat: This is primarily an illustrative example. I have not fully
audited all call sites or large-buffer use cases (yet). The goal is to
demonstrate the potential of the extended register set.
---
arch/x86/Kconfig | 6 +++
arch/x86/Kconfig.assembler | 6 +++
arch/x86/include/asm/checksum_64.h | 24 +++++++++++-
arch/x86/lib/csum-copy_64.S | 59 ++++++++++++++++++++++++++++--
4 files changed, 90 insertions(+), 5 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fa3b616af03a..e6d969376bf2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1890,6 +1890,12 @@ config X86_USER_SHADOW_STACK
If unsure, say N.
+config X86_APX
+ bool "In-kernel APX use"
+ depends on AS_APX
+ help
+ Experimental: enable in-kernel use of APX
+
config INTEL_TDX_HOST
bool "Intel Trust Domain Extensions (TDX) host support"
depends on CPU_SUP_INTEL
diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index b1c59fb0a4c9..d208ac540609 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -5,3 +5,9 @@ config AS_WRUSS
def_bool $(as-instr64,wrussq %rax$(comma)(%rbx))
help
Supported by binutils >= 2.31 and LLVM integrated assembler
+
+config AS_APX
+ def_bool $(as-instr64,mov %r16$(comma)%r17)
+ help
+ Assembler support extended registers.
+ Supported by binutils >= 2.43 (LLVM version TBD)
diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
index 4d4a47a3a8ab..4cbd9e71f8c3 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -10,6 +10,7 @@
#include <linux/compiler.h>
#include <asm/byteorder.h>
+#include <asm/fpu/api.h>
/**
* csum_fold - Fold and invert a 32bit checksum.
@@ -129,7 +130,28 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
extern __wsum csum_partial(const void *buff, int len, __wsum sum);
/* Do not call this directly. Use the wrappers below */
-extern __visible __wsum csum_partial_copy_generic(const void *src, void *dst, int len);
+extern __visible __wsum csum_partial_copy(const void *src, void *dst, int len);
+#ifndef CONFIG_X86_APX
+static inline __wsum csum_partial_copy_generic(const void *src, void *dst, int len)
+{
+ return csum_partial_copy(src, dst, len);
+}
+#else
+extern __visible __wsum csum_partial_copy_apx(const void *src, void *dst, int len);
+static inline __wsum csum_partial_copy_generic(const void *src, void *dst, int len)
+{
+ __wsum sum;
+
+ if (!cpu_has_xfeatures(XFEATURE_MASK_APX, NULL) || !irq_fpu_usable())
+ return csum_partial_copy(src, dst, len);
+
+ kernel_fpu_begin();
+ sum = csum_partial_copy_apx(src, dst, len);
+ kernel_fpu_end();
+
+ return sum;
+}
+#endif
extern __wsum csum_and_copy_from_user(const void __user *src, void *dst, int len);
extern __wsum csum_and_copy_to_user(const void *src, void __user *dst, int len);
diff --git a/arch/x86/lib/csum-copy_64.S b/arch/x86/lib/csum-copy_64.S
index 5526bdfac041..dc99227af94f 100644
--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -119,11 +119,54 @@
shrl $6, LEN64B
jz .Lhandle_tail\@ /* < 64 */
+.if USE_APX
+ cmpl $3, LEN64B
+ jb .Lloop_64\@ /* < 192 */
+ clc
+ .p2align 4
+.Lloop_192\@:
+ .set TMP9, %r16
+ .set TMP10, %r17
+ .set TMP11, %r18
+ .set TMP12, %r19
+ .set TMP13, %r20
+ .set TMP14, %r21
+ .set TMP15, %r22
+ .set TMP16, %r23
+ .set TMP17, %r24
+ .set TMP18, %r25
+ .set TMP19, %r26
+ .set TMP20, %r27
+ .set TMP21, %r28
+ .set TMP22, %r29
+ .set TMP23, %r30
+ .set TMP24, %r31
+
+ .p2align 4
+ loadregs 0, INP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
+ loadregs 8, INP, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
+ loadregs 16, INP, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
+
+ sumregs SUM, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
+ sumregs SUM, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
+ sumregs SUM, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
+
+ storeregs 0, OUTP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
+ storeregs 8, OUTP, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
+ storeregs 16, OUTP, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
+
+ incr INP, 24
+ incr OUTP, 24
+ sub $3, LEN64B
+ cmp $3, LEN64B
+ jnb .Lloop_192\@
+.else
clc
.p2align 4
-.Lloop\@:
+.endif
+.Lloop_64\@:
loadregs 0, INP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
prefetch
@@ -137,7 +180,7 @@
incr INP, 8
incr OUTP, 8
- jnz .Lloop\@
+ jnz .Lloop_64\@
adcq ZERO, SUM
@@ -260,6 +303,14 @@
xorl %eax, %eax
restore_regs_and_ret
-SYM_FUNC_START(csum_partial_copy_generic)
+.set USE_APX, 0
+SYM_FUNC_START(csum_partial_copy)
_csum_partial_copy
-SYM_FUNC_END(csum_partial_copy_generic)
+SYM_FUNC_END(csum_partial_copy)
+
+#ifdef CONFIG_X86_APX
+.set USE_APX, 1
+SYM_FUNC_START(csum_partial_copy_apx)
+ _csum_partial_copy
+SYM_FUNC_END(csum_partial_copy_apx)
+#endif
--
2.51.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop
2025-11-24 21:32 ` [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop Chang S. Bae
@ 2025-11-25 10:37 ` david laight
2025-12-01 21:39 ` Chang S. Bae
0 siblings, 1 reply; 8+ messages in thread
From: david laight @ 2025-11-25 10:37 UTC (permalink / raw)
To: Chang S. Bae; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen
On Mon, 24 Nov 2025 21:32:26 +0000
"Chang S. Bae" <chang.seok.bae@intel.com> wrote:
> The current checksum copy routine already uses all legacy GPRs for loop
> unrolling. APX introduces additional GPRs. Use them to extend the
> unrolling further.
I very much doubt that unrolling this loop has any performance gain.
IIRC you can get a loop with just two 'memory read' and 'adcq' instructions
in it to execute a 'adcq' every clock.
It ought to be possible to do the same even with the extra 'memory write'.
(You can execute a '2 clock loop', but not a '1 clock loop'.)
Whatever you do the 'loop control' instruction are independent of the copy
and adcq ones and will run in parallel.
For the fastest loop, change the memory accesses to be negative
offsets from the end of the buffer.
Indeed, I think the Intel cpu (I've not done any tests on amd ones)
end up queuing up the adcq and writes (from many loop iterations)
waiting for the reads to complete.
But is this function even worth having at all?
The fast checksum routing does 1.5 to 2 'adcq' per clock.
On modern cpu 'rep movsb' will (usually) copy memory at (IIRC) 32
bytes/clock (IIRC 64 on intel if the destination is aligned).
Put together than is faster than the 1 adcq per clock maximum
of the 'copy and checksum' loop.
The only issue will be buffers over 2k which are likely to generate
extra reads into a 4k L1 data cache.
But it is worse than that.
This code (or something very similar) gets used to checksum data
during copy_to/from_user for sockets.
This goes back a long way and I suspect the 'killer ap' was nfsd
running over UDP (with 8k+ UDP datagrams).
Modern NICs all (well all anyone cares about) to IP checksum offload.
So you don't need to checksum on send() - I'm sure that is still
enabled even though you pretty much never want it.
The checksum on recv() can only happen for UDP, but massively
complicates the code paths and will normally not be needed.
David
>
> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
> ---
> Caveat: This is primarily an illustrative example. I have not fully
> audited all call sites or large-buffer use cases (yet). The goal is to
> demonstrate the potential of the extended register set.
> ---
> arch/x86/Kconfig | 6 +++
> arch/x86/Kconfig.assembler | 6 +++
> arch/x86/include/asm/checksum_64.h | 24 +++++++++++-
> arch/x86/lib/csum-copy_64.S | 59 ++++++++++++++++++++++++++++--
> 4 files changed, 90 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index fa3b616af03a..e6d969376bf2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1890,6 +1890,12 @@ config X86_USER_SHADOW_STACK
>
> If unsure, say N.
>
> +config X86_APX
> + bool "In-kernel APX use"
> + depends on AS_APX
> + help
> + Experimental: enable in-kernel use of APX
> +
> config INTEL_TDX_HOST
> bool "Intel Trust Domain Extensions (TDX) host support"
> depends on CPU_SUP_INTEL
> diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
> index b1c59fb0a4c9..d208ac540609 100644
> --- a/arch/x86/Kconfig.assembler
> +++ b/arch/x86/Kconfig.assembler
> @@ -5,3 +5,9 @@ config AS_WRUSS
> def_bool $(as-instr64,wrussq %rax$(comma)(%rbx))
> help
> Supported by binutils >= 2.31 and LLVM integrated assembler
> +
> +config AS_APX
> + def_bool $(as-instr64,mov %r16$(comma)%r17)
> + help
> + Assembler support extended registers.
> + Supported by binutils >= 2.43 (LLVM version TBD)
> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
> index 4d4a47a3a8ab..4cbd9e71f8c3 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -10,6 +10,7 @@
>
> #include <linux/compiler.h>
> #include <asm/byteorder.h>
> +#include <asm/fpu/api.h>
>
> /**
> * csum_fold - Fold and invert a 32bit checksum.
> @@ -129,7 +130,28 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
> extern __wsum csum_partial(const void *buff, int len, __wsum sum);
>
> /* Do not call this directly. Use the wrappers below */
> -extern __visible __wsum csum_partial_copy_generic(const void *src, void *dst, int len);
> +extern __visible __wsum csum_partial_copy(const void *src, void *dst, int len);
> +#ifndef CONFIG_X86_APX
> +static inline __wsum csum_partial_copy_generic(const void *src, void *dst, int len)
> +{
> + return csum_partial_copy(src, dst, len);
> +}
> +#else
> +extern __visible __wsum csum_partial_copy_apx(const void *src, void *dst, int len);
> +static inline __wsum csum_partial_copy_generic(const void *src, void *dst, int len)
> +{
> + __wsum sum;
> +
> + if (!cpu_has_xfeatures(XFEATURE_MASK_APX, NULL) || !irq_fpu_usable())
> + return csum_partial_copy(src, dst, len);
> +
> + kernel_fpu_begin();
> + sum = csum_partial_copy_apx(src, dst, len);
> + kernel_fpu_end();
> +
> + return sum;
> +}
> +#endif
>
> extern __wsum csum_and_copy_from_user(const void __user *src, void *dst, int len);
> extern __wsum csum_and_copy_to_user(const void *src, void __user *dst, int len);
> diff --git a/arch/x86/lib/csum-copy_64.S b/arch/x86/lib/csum-copy_64.S
> index 5526bdfac041..dc99227af94f 100644
> --- a/arch/x86/lib/csum-copy_64.S
> +++ b/arch/x86/lib/csum-copy_64.S
> @@ -119,11 +119,54 @@
>
> shrl $6, LEN64B
> jz .Lhandle_tail\@ /* < 64 */
> +.if USE_APX
> + cmpl $3, LEN64B
> + jb .Lloop_64\@ /* < 192 */
> + clc
> + .p2align 4
> +.Lloop_192\@:
> + .set TMP9, %r16
> + .set TMP10, %r17
> + .set TMP11, %r18
> + .set TMP12, %r19
> + .set TMP13, %r20
> + .set TMP14, %r21
> + .set TMP15, %r22
> + .set TMP16, %r23
> + .set TMP17, %r24
> + .set TMP18, %r25
> + .set TMP19, %r26
> + .set TMP20, %r27
> + .set TMP21, %r28
> + .set TMP22, %r29
> + .set TMP23, %r30
> + .set TMP24, %r31
> +
> + .p2align 4
> + loadregs 0, INP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
> + loadregs 8, INP, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
> + loadregs 16, INP, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
> +
> + sumregs SUM, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
> + sumregs SUM, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
> + sumregs SUM, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
> +
> + storeregs 0, OUTP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
> + storeregs 8, OUTP, TMP9, TMP10, TMP11, TMP12, TMP13, TMP14, TMP15, TMP16
> + storeregs 16, OUTP, TMP17, TMP18, TMP19, TMP20, TMP21, TMP22, TMP23, TMP24
> +
> + incr INP, 24
> + incr OUTP, 24
>
> + sub $3, LEN64B
> + cmp $3, LEN64B
> + jnb .Lloop_192\@
> +.else
> clc
>
> .p2align 4
> -.Lloop\@:
> +.endif
> +.Lloop_64\@:
> loadregs 0, INP, TMP1, TMP2, TMP3, TMP4, TMP5, TMP6, TMP7, TMP8
>
> prefetch
> @@ -137,7 +180,7 @@
> incr INP, 8
> incr OUTP, 8
>
> - jnz .Lloop\@
> + jnz .Lloop_64\@
>
> adcq ZERO, SUM
>
> @@ -260,6 +303,14 @@
> xorl %eax, %eax
> restore_regs_and_ret
>
> -SYM_FUNC_START(csum_partial_copy_generic)
> +.set USE_APX, 0
> +SYM_FUNC_START(csum_partial_copy)
> _csum_partial_copy
> -SYM_FUNC_END(csum_partial_copy_generic)
> +SYM_FUNC_END(csum_partial_copy)
> +
> +#ifdef CONFIG_X86_APX
> +.set USE_APX, 1
> +SYM_FUNC_START(csum_partial_copy_apx)
> + _csum_partial_copy
> +SYM_FUNC_END(csum_partial_copy_apx)
> +#endif
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers
2025-11-24 21:32 [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Chang S. Bae
` (2 preceding siblings ...)
2025-11-24 21:32 ` [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop Chang S. Bae
@ 2025-11-26 16:30 ` Peter Zijlstra
2025-12-01 21:40 ` Chang S. Bae
3 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2025-11-26 16:30 UTC (permalink / raw)
To: Chang S. Bae; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen
On Mon, Nov 24, 2025 at 09:32:23PM +0000, Chang S. Bae wrote:
> This follows how vector registers are used today in places like crypto
> routines. AVX state usage is bracketed by kernel_fpu_begin() /
> kernel_fpu_end(). EGPRs could be similarly used in a small bounded
> region.
>
> Under this model:
>
> * No changes are needed to the existing XSTATE management API.
>
> * Preemption and softirqs would be disabled while EGPRs are live,
> subsequently limiting usage to small regions.
>
> * This lends itself mostly to hand-written assembly, which is less
> scalable for broader adoption.
IIRC it isn't hard to make kernel_fpu_begin/end() preemptible. It came
up with the last xsave rework -- mostly in the context of -rt, but
nobody ever picked it up and did it.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop
2025-11-25 10:37 ` david laight
@ 2025-12-01 21:39 ` Chang S. Bae
0 siblings, 0 replies; 8+ messages in thread
From: Chang S. Bae @ 2025-12-01 21:39 UTC (permalink / raw)
To: david laight; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen
On 11/25/2025 2:37 AM, david laight wrote:
>
> This code (or something very similar) gets used to checksum data
> during copy_to/from_user for sockets.
> This goes back a long way and I suspect the 'killer ap' was nfsd
> running over UDP (with 8k+ UDP datagrams).
> Modern NICs all (well all anyone cares about) to IP checksum offload.
> So you don't need to checksum on send() - I'm sure that is still
> enabled even though you pretty much never want it.
> The checksum on recv() can only happen for UDP, but massively
> complicates the code paths and will normally not be needed.
It sounds like this optimization wouldn't provide practical benefit. I
don't see a strong case for pursuing this further either, so I'd drop this.
Thanks,
Chang
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers
2025-11-26 16:30 ` [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Peter Zijlstra
@ 2025-12-01 21:40 ` Chang S. Bae
0 siblings, 0 replies; 8+ messages in thread
From: Chang S. Bae @ 2025-12-01 21:40 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen
On 11/26/2025 8:30 AM, Peter Zijlstra wrote:
>
> IIRC it isn't hard to make kernel_fpu_begin/end() preemptible. It came
> up with the last xsave rework -- mostly in the context of -rt, but
> nobody ever picked it up and did it.
Thanks for looking at this! Looks like I couldn't spot on that exact
thread though, I suppose another XSAVE buffer to do so. In any case, I'd
bring up this option in a slide deck, as something worth revisiting.
Thanks,
Chang
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-12-01 21:40 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-24 21:32 [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 1/3] x86/lib: Refactor csum_partial_copy_generic() into a macro Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 2/3] x86/lib: Convert repeated asm sequences in checksum copy into macros Chang S. Bae
2025-11-24 21:32 ` [RFC PATCH 3/3] x86/lib: Use EGPRs in 64-bit checksum copy loop Chang S. Bae
2025-11-25 10:37 ` david laight
2025-12-01 21:39 ` Chang S. Bae
2025-11-26 16:30 ` [DISCUSSION] x86: In-Kernel Use of Extended General-Purpose Registers Peter Zijlstra
2025-12-01 21:40 ` Chang S. Bae
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox