All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store
@ 2025-03-20 19:05 Mateusz Guzik
  2025-03-20 19:23 ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Mateusz Guzik @ 2025-03-20 19:05 UTC (permalink / raw)
  To: torvalds, x86
  Cc: hkrzesin, tglx, mingo, bp, dave.hansen, hpa, olichtne, atomasov,
	aokuliar, linux-kernel, Mateusz Guzik

Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
1 byte at a time loop to handle the tail.

This is trivially avoidable with overlapping stores, which is the
standard technique for these kind of routines (in fact memcpy() is
already using it in a more extensive form).

I traced calls in this range during a kernel build and found that 65% of
all of them had tail to take care of.

Distribution of size & 7 in that range is as follows:
@:
[0, 1)           3282178 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2)            949627 |@@@@@@@@@@@@@@@                                     |
[2, 3)           1078842 |@@@@@@@@@@@@@@@@@                                   |
[3, 4)            998995 |@@@@@@@@@@@@@@@                                     |
[4, 5)            744515 |@@@@@@@@@@@                                         |
[5, 6)            925437 |@@@@@@@@@@@@@@                                      |
[6, 7)            663305 |@@@@@@@@@@                                          |
[7, ...)          786007 |@@@@@@@@@@@@                                        |

@stats[notail]: 3282178
@stats[tail]: 6146728

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

I added a call to a custom copy_user_probe() routine to
copy_user_generic so that I can attach to it with bpftrace. Available
upon request. The one-liner is:

bpftrace -e 'kprobe:copy_user_probe /arg2 >= 8 && arg2 <= 64/ \
{ @ = lhist(arg2 & 7, 0, 7, 1); @stats[arg2 & 7 ? "tail" : "notail"] = count(); }'

Anyhow, I don't have any means to benchmark this at the moment as the
only hw I have access to is in fact a little too modern (FSRM), but this
being the standard technique I think wont require much convincing. If
anything the question is why this differs from memcpy which *does* use
overlapping stores.

Tested by forcing the kernel to use the routine, did the kernel build
just fine with the change.

That said, absent own bench results I'm not going to strongly argue for
the change but I do think it is an ok tidy-up until someone(tm) puts
more effort here.

 arch/x86/lib/copy_user_64.S | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index aa8c341b2441..2d5f42c521b5 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -65,10 +65,15 @@ SYM_FUNC_START(rep_movs_alternative)
 	je .Lexit
 	cmp $8,%ecx
 	jae .Lword
-	jmp .Lcopy_user_tail
+4:	movq -8(%rsi,%rcx),%rax
+5:	movq %rax,-8(%rdi,%rcx)
+	xorl %ecx,%ecx
+	RET
 
 	_ASM_EXTABLE_UA( 2b, .Lcopy_user_tail)
 	_ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 4b, .Lcopy_user_tail)
+	_ASM_EXTABLE_UA( 5b, .Lcopy_user_tail)
 
 .Llarge:
 0:	ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-03-26 22:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20 19:05 [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store Mateusz Guzik
2025-03-20 19:23 ` Linus Torvalds
2025-03-20 19:33   ` Mateusz Guzik
2025-03-20 20:24     ` Mateusz Guzik
2025-03-22 12:02       ` David Laight
2025-03-20 21:17     ` Linus Torvalds
2025-03-20 23:53       ` Linus Torvalds
2025-03-21 20:10         ` Mateusz Guzik
2025-03-21 20:47         ` David Laight
2025-03-25 22:42           ` Herton Krzesinski
2025-03-25 22:51             ` Linus Torvalds
2025-03-26 22:45             ` David Laight
2025-03-26 22:59             ` Mateusz Guzik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.