* Performance issues in copy_user_generic() in x86_64
@ 2025-03-14 17:53 Herton R. Krzesinski
2025-03-14 17:53 ` [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() Herton R. Krzesinski
0 siblings, 1 reply; 13+ messages in thread
From: Herton R. Krzesinski @ 2025-03-14 17:53 UTC (permalink / raw)
To: x86
Cc: tglx, mingo, bp, dave.hansen, hpa, linux-kernel, torvalds,
olichtne, atomasov, aokuliar
Hello,
recently I have got two reports of performance loss in copy_user_generic()
after updates in user copy functions in x86_64, when benchmarking with iperf3.
I believe the write alignment to 8 bytes that was done through the old
ALIGN_DESTINATION macro was helping in some cases, and when it was removed the
performance drop can be noticed. Looks like this theory is corroborated by some
performance testing I did.
Please take a look at the following email with the patch if everything is sane.
I already did some testing as explained in the changelog of the patch. I used
the following scripts to run the testing, I just wrote them to get the job done
and get some results, so there is nothing fancy about them.
---- bench.sh
#!/bin/bash
dir=$1
mkdir -p $dir
for cpu in 19 21 23 none; do
sync
echo 3 > /proc/sys/vm/drop_caches
cpu_opt=""
if [ "$cpu" != "none" ]; then
cpu_opt="taskset -c $cpu"
fi
$cpu_opt iperf3 -D -s -B 127.0.0.1 -p 12000
perf stat -o $dir/stat.$cpu.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V -n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 > $dir/stat-$cpu.txt
cat $dir/stat.$cpu.txt >> $dir/stat-$cpu.txt
rm -f $dir/stat.$cpu.txt
killall iperf3
done
----
---- stat.sh
#!/bin/bash
dir=$1
printf " %4s %13s %12s %12s %11s\n" "CPU" "RATE " "SYS " "TIME " "sender-receiver"
for cpu in 19 21 23 none; do
time=$(grep 'seconds time elapsed' $dir/stat-$cpu.txt | awk '{ print $1 }')
sys=$(grep 'seconds sys' $dir/stat-$cpu.txt | awk '{ print $1 }')
rate=$(grep ' sender' $dir/stat-$cpu.txt | awk '{ print $7 $8 }')
cpuu=$(grep 'CPU Utilization' $dir/stat-$cpu.txt | awk '{ printf "%s-%s\n", $4, $7 }')
printf "Server bind %4s: $rate $sys $time %s\n" $cpu $cpuu
done
----
Example of a test run:
nice -n -20 ./bench.sh align
./stat.sh align
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-14 17:53 Performance issues in copy_user_generic() in x86_64 Herton R. Krzesinski @ 2025-03-14 17:53 ` Herton R. Krzesinski 2025-03-14 19:06 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: Herton R. Krzesinski @ 2025-03-14 17:53 UTC (permalink / raw) To: x86 Cc: tglx, mingo, bp, dave.hansen, hpa, linux-kernel, torvalds, olichtne, atomasov, aokuliar Since the upstream series with user copy updates were merged upstream with commit a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy clarifications"), copy_user_generic() on x86_64 stopped doing alignment of the writes to the destination to a 8 byte boundary for the non FSRM case. Previously, this was done through the ALIGN_DESTINATION macro that was used in the now removed copy_user_generic_unrolled function. Turns out that may cause some loss of performance/throughput on some use cases and specific CPU/platforms. Lately I got two reports of performance/throughput issues after a RHEL 9 kernel pulled the same upstream series with updates to user copy functions. Both reports consisted of running specific networking/TCP related testing using iperf3. The first report was related to a Linux Bridge testing using VMs on an specific machine with an AMD CPU (EPYC 7402), and after a brief investigation it turned out that the later change through commit ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS") helped/fixed the performance issue. However, after the later commit/fix was applied, then I got another regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also running on an AMD based platform (AMD EPYC 7302 CPU), again that was using iperf3 to run the test. That regression was after applying the later fix/commit, but only this didn't help in telling the whole history. So I narrowed down the second regression use case, but running it without traffic through a nic, on localhost, in trying to narrow down CPU usage and not being limited by other factor like network bandwidth. I used another system also with an AMD CPU (AMD EPYC 7742). Basically, I run iperf3 in server and client mode in the same system, for example: - Start the server binding it to CPU core/thread 19: $ taskset -c 19 iperf3 -D -s -B 127.0.0.1 -p 12000 - Start the client always binding/running on CPU core/thread 17, using perf to get statistics: $ perf stat -o stat.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V \ -n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 \ > stat-19.txt For the client, always running/pinned to CPU 17. But for the iperf3 in server mode, I did test runs using CPUs 19, 21, 23 or not pinned to any specific CPU. So it basically consisted with four runs of the same commands, just changing the CPU which the server is pinned, or without pinning by removing the taskset call before the server command. The CPUs were chosen based on NUMA node they were on, this is the relevant output of lscpu on the system: $ lscpu ... Model name: AMD EPYC 7742 64-Core Processor ... Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 32 MiB (64 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 4 NUMA node0 CPU(s): 0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121 NUMA node1 CPU(s): 2,3,10,11,18,19,26,27,34,35,42,43,50,51,58,59,66,67,74,75,82,83,90,91,98,99,106,107,114,115,122,123 NUMA node2 CPU(s): 4,5,12,13,20,21,28,29,36,37,44,45,52,53,60,61,68,69,76,77,84,85,92,93,100,101,108,109,116,117,124,125 NUMA node3 CPU(s): 6,7,14,15,22,23,30,31,38,39,46,47,54,55,62,63,70,71,78,79,86,87,94,95,102,103,110,111,118,119,126,127 ... So for the server run, when picking a CPU, I chose CPUs to be not on the same node. The reason is with that I was able to get/measure relevant performance differences when changing the alignment of the destination in copy_user_generic. I made tables below, an example of a set results I got, summarizing the results: * No alignment case: CPU RATE SYS TIME sender-receiver Server bind 19: 13.0Gbits/sec 28.371851000 33.233499566 86.9%-70.8% Server bind 21: 12.9Gbits/sec 28.283381000 33.586486621 85.8%-69.9% Server bind 23: 11.1Gbits/sec 33.660190000 39.012243176 87.7%-64.5% Server bind none: 18.9Gbits/sec 19.215339000 22.875117865 86.0%-80.5% * With this patch (aligning write of destination address to 8 byte boundary): CPU RATE SYS TIME sender-receiver Server bind 19: 20.6Gbits/sec 15.084684000 21.009879874 75.8%-89.3% Server bind 21: 20.2Gbits/sec 15.322023000 21.427787353 75.5%-89.5% Server bind 23: 18.5Gbits/sec 17.520225000 23.390048552 78.1%-88.8% Server bind none: 26.0Gbits/sec 12.617328000 16.682558421 80.1%-89.6% So I consistently got better results when aligning the write. The results above were run on a 6.14.0-rc6 based kernel. The sys is sys time and then the total time to run/transfer 50G of data. The last field is the CPU usage of sender/receiver iperf3 process. All the CPUs/platforms mentioned above did not have FRMS or ERMS flag enabled/set. It's also worth to note that each pair of iperf3 runs may get slightly different results on each run, but I always got consistent higher results with the write alignment for this specific test of running the processes on CPUs in different NUMA nodes. Reported-by: Ondrej Lichtner <olichtne@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com> --- arch/x86/include/asm/uaccess_64.h | 2 +- arch/x86/lib/copy_user_64.S | 23 ++++++++++++++++++++--- 2 files changed, 21 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index c52f0133425b..96a0710cfacd 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -130,7 +130,7 @@ copy_user_generic(void *to, const void *from, unsigned long len) "2:\n" _ASM_EXTABLE_UA(1b, 2b) :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT - : : "memory", "rax"); + : : "memory", "rax", "rdx", "r8"); clac(); return len; } diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index fc9fb5d06174..4e95e9ffa37e 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -30,22 +30,39 @@ * it simpler for us, we can clobber rsi/rdi and rax freely. */ SYM_FUNC_START(rep_movs_alternative) + movl %edi,%eax + andl $7,%eax + je .Laligned + cmpq $8,%rcx + jb .Lcopy_user_short + movl %eax,%edx + sub $8,%edx + negl %edx + movq %rdx,%r8 + jmp .Lcopy_bytes_loop + +.Laligned: cmpq $64,%rcx jae .Llarge cmp $8,%ecx jae .Lword +.Lcopy_user_short: testl %ecx,%ecx je .Lexit - .Lcopy_user_tail: + movq %rcx,%r8 + movl %ecx,%edx +.Lcopy_bytes_loop: 0: movb (%rsi),%al 1: movb %al,(%rdi) inc %rdi inc %rsi - dec %rcx - jne .Lcopy_user_tail + dec %edx + jne .Lcopy_bytes_loop + sub %r8,%rcx + jne .Laligned .Lexit: RET -- 2.47.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-14 17:53 ` [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() Herton R. Krzesinski @ 2025-03-14 19:06 ` Linus Torvalds 2025-03-14 20:33 ` Herton Krzesinski 2025-03-17 13:16 ` David Laight 0 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2025-03-14 19:06 UTC (permalink / raw) To: Herton R. Krzesinski Cc: x86, tglx, mingo, bp, dave.hansen, hpa, linux-kernel, olichtne, atomasov, aokuliar On Fri, 14 Mar 2025 at 07:53, Herton R. Krzesinski <herton@redhat.com> wrote: > > --- a/arch/x86/include/asm/uaccess_64.h > +++ b/arch/x86/include/asm/uaccess_64.h > @@ -130,7 +130,7 @@ copy_user_generic(void *to, const void *from, unsigned long len) > "2:\n" > _ASM_EXTABLE_UA(1b, 2b) > :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT > - : : "memory", "rax"); > + : : "memory", "rax", "rdx", "r8"); Please don't penalize the caller with the extra clobbers. Maybe it doesn't matter - these functions are marked always_inline, but they aren't inlined in very many places and maybe those places have registers to spare - but let's not penalize the FSRM case anyway. And we do call it "rep_movs_alternative", so let's keep it close to "rep movs" semantics (yes, we already clobber %rax, but let's not make it worse). As to the actual change to rep_movs - that should be done differently too. In particular, I doubt it makes any sense to try to align the destination for small writes or for the ERMS case when we use 'rep movsb', so I think this should all go into just the ".Llarge_movsq" case. .. and then the patch can be further optimized to just do the first - possibly unaligned - destination word unconditionally, and then updating the addresses and counts to make the rest be aligned. Something ENTIRELY UNTESTED like this, in other words. And I wrote it so that it doesn't need any new temporary registers, so no need for clobbers or for some save/restore code. NOTE! The patch below is very intentionally whitespace-damaged. Anybody who applies this needs to look at it very carefully, because I just threw this together with zero testing and only very limited thought. But if it works, and if it actually improves performance, I think it might be a fairly minimal approach to "align destination". Linus ---- arch/x86/lib/copy_user_64.S | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index fc9fb5d06174..1c3090af3807 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -74,6 +74,23 @@ SYM_FUNC_START(rep_movs_alternative) _ASM_EXTABLE_UA( 0b, 1b) .Llarge_movsq: + /* Do the first possibly unaligned word */ +0: movq (%rsi),%rax +1: movq %rax,(%rdi) + _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail) + _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail) + + /* What would be the offset to the aligned destination? */ + leaq 8(%rdi),%rax + andq $-8,%rax + subq %rdi,%rax + + /* .. and update pointers and count to match */ + addq %rax,%rdi + addq %rax,%rsi + subq %rax,%rcx + + /* make %rcx contain the number of words, %rax the remainder */ movq %rcx,%rax shrq $3,%rcx andl $7,%eax ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-14 19:06 ` Linus Torvalds @ 2025-03-14 20:33 ` Herton Krzesinski 2025-03-16 10:58 ` Ingo Molnar 2025-03-17 13:16 ` David Laight 1 sibling, 1 reply; 13+ messages in thread From: Herton Krzesinski @ 2025-03-14 20:33 UTC (permalink / raw) To: Linus Torvalds Cc: x86, tglx, mingo, bp, dave.hansen, hpa, linux-kernel, olichtne, atomasov, aokuliar On Fri, Mar 14, 2025 at 4:06 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 14 Mar 2025 at 07:53, Herton R. Krzesinski <herton@redhat.com> wrote: > > > > --- a/arch/x86/include/asm/uaccess_64.h > > +++ b/arch/x86/include/asm/uaccess_64.h > > @@ -130,7 +130,7 @@ copy_user_generic(void *to, const void *from, unsigned long len) > > "2:\n" > > _ASM_EXTABLE_UA(1b, 2b) > > :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT > > - : : "memory", "rax"); > > + : : "memory", "rax", "rdx", "r8"); > > Please don't penalize the caller with the extra clobbers. > > Maybe it doesn't matter - these functions are marked always_inline, > but they aren't inlined in very many places and maybe those places > have registers to spare - but let's not penalize the FSRM case anyway. > > And we do call it "rep_movs_alternative", so let's keep it close to > "rep movs" semantics (yes, we already clobber %rax, but let's not make > it worse). > > As to the actual change to rep_movs - that should be done differently > too. In particular, I doubt it makes any sense to try to align the > destination for small writes or for the ERMS case when we use 'rep > movsb', so I think this should all go into just the ".Llarge_movsq" > case. > > .. and then the patch can be further optimized to just do the first - > possibly unaligned - destination word unconditionally, and then > updating the addresses and counts to make the rest be aligned. > > Something ENTIRELY UNTESTED like this, in other words. And I wrote it > so that it doesn't need any new temporary registers, so no need for > clobbers or for some save/restore code. > > NOTE! The patch below is very intentionally whitespace-damaged. > Anybody who applies this needs to look at it very carefully, because I > just threw this together with zero testing and only very limited > thought. > > But if it works, and if it actually improves performance, I think it > might be a fairly minimal approach to "align destination". It does look good in my testing here, I built same kernel I was using for testing the original patch (based on 6.14-rc6), this is one of the results I got in one of the runs testing on the same machine: CPU RATE SYS TIME sender-receiver Server bind 19: 20.8Gbits/sec 14.832313000 20.863476111 75.4%-89.2% Server bind 21: 18.0Gbits/sec 18.705221000 23.996913032 80.8%-89.7% Server bind 23: 20.1Gbits/sec 15.331761000 21.536657212 75.0%-89.7% Server bind none: 24.1Gbits/sec 14.164226000 18.043132731 82.3%-87.1% There are still some variations between runs, which is expected as was the same when I tested my patch or in the not aligned case, but it's consistently better/higher than the no align case. Looks really it's sufficient to align for the higher than or equal 64 bytes copy case. > > Linus > > ---- > > arch/x86/lib/copy_user_64.S | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S > index fc9fb5d06174..1c3090af3807 100644 > --- a/arch/x86/lib/copy_user_64.S > +++ b/arch/x86/lib/copy_user_64.S > @@ -74,6 +74,23 @@ SYM_FUNC_START(rep_movs_alternative) > _ASM_EXTABLE_UA( 0b, 1b) > > .Llarge_movsq: > + /* Do the first possibly unaligned word */ > +0: movq (%rsi),%rax > +1: movq %rax,(%rdi) > + _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail) > + _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail) > + > + /* What would be the offset to the aligned destination? */ > + leaq 8(%rdi),%rax > + andq $-8,%rax > + subq %rdi,%rax > + > + /* .. and update pointers and count to match */ > + addq %rax,%rdi > + addq %rax,%rsi > + subq %rax,%rcx > + > + /* make %rcx contain the number of words, %rax the remainder */ > movq %rcx,%rax > shrq $3,%rcx > andl $7,%eax > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-14 20:33 ` Herton Krzesinski @ 2025-03-16 10:58 ` Ingo Molnar 2025-03-16 11:09 ` Ingo Molnar 0 siblings, 1 reply; 13+ messages in thread From: Ingo Molnar @ 2025-03-16 10:58 UTC (permalink / raw) To: Herton Krzesinski Cc: Linus Torvalds, x86, tglx, mingo, bp, dave.hansen, hpa, linux-kernel, olichtne, atomasov, aokuliar * Herton Krzesinski <hkrzesin@redhat.com> wrote: > On Fri, Mar 14, 2025 at 4:06 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Fri, 14 Mar 2025 at 07:53, Herton R. Krzesinski <herton@redhat.com> wrote: > > > > > > --- a/arch/x86/include/asm/uaccess_64.h > > > +++ b/arch/x86/include/asm/uaccess_64.h > > > @@ -130,7 +130,7 @@ copy_user_generic(void *to, const void *from, unsigned long len) > > > "2:\n" > > > _ASM_EXTABLE_UA(1b, 2b) > > > :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT > > > - : : "memory", "rax"); > > > + : : "memory", "rax", "rdx", "r8"); > > > > Please don't penalize the caller with the extra clobbers. > > > > Maybe it doesn't matter - these functions are marked always_inline, > > but they aren't inlined in very many places and maybe those places > > have registers to spare - but let's not penalize the FSRM case anyway. > > > > And we do call it "rep_movs_alternative", so let's keep it close to > > "rep movs" semantics (yes, we already clobber %rax, but let's not make > > it worse). > > > > As to the actual change to rep_movs - that should be done differently > > too. In particular, I doubt it makes any sense to try to align the > > destination for small writes or for the ERMS case when we use 'rep > > movsb', so I think this should all go into just the ".Llarge_movsq" > > case. > > > > .. and then the patch can be further optimized to just do the first - > > possibly unaligned - destination word unconditionally, and then > > updating the addresses and counts to make the rest be aligned. > > > > Something ENTIRELY UNTESTED like this, in other words. And I wrote it > > so that it doesn't need any new temporary registers, so no need for > > clobbers or for some save/restore code. > > > > NOTE! The patch below is very intentionally whitespace-damaged. > > Anybody who applies this needs to look at it very carefully, because I > > just threw this together with zero testing and only very limited > > thought. > > > > But if it works, and if it actually improves performance, I think it > > might be a fairly minimal approach to "align destination". > > It does look good in my testing here, I built same kernel I > was using for testing the original patch (based on > 6.14-rc6), this is one of the results I got in one of the runs > testing on the same machine: > > CPU RATE SYS TIME sender-receiver > Server bind 19: 20.8Gbits/sec 14.832313000 20.863476111 75.4%-89.2% > Server bind 21: 18.0Gbits/sec 18.705221000 23.996913032 80.8%-89.7% > Server bind 23: 20.1Gbits/sec 15.331761000 21.536657212 75.0%-89.7% > Server bind none: 24.1Gbits/sec 14.164226000 18.043132731 82.3%-87.1% > > There are still some variations between runs, which is > expected as was the same when I tested my patch or in > the not aligned case, but it's consistently better/higher than > the no align case. Looks really it's sufficient to align for the > higher than or equal 64 bytes copy case. Mind sending a v2 patch with a changelog and these benchmark numbers added in, and perhaps a Co-developed-by tag with Linus or so? Thanks, Ingo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-16 10:58 ` Ingo Molnar @ 2025-03-16 11:09 ` Ingo Molnar 2025-03-17 13:18 ` Herton Krzesinski 2025-03-18 21:59 ` David Laight 0 siblings, 2 replies; 13+ messages in thread From: Ingo Molnar @ 2025-03-16 11:09 UTC (permalink / raw) To: Herton Krzesinski Cc: Linus Torvalds, x86, tglx, mingo, bp, dave.hansen, hpa, linux-kernel, olichtne, atomasov, aokuliar * Ingo Molnar <mingo@kernel.org> wrote: > > It does look good in my testing here, I built same kernel I was > > using for testing the original patch (based on 6.14-rc6), this is > > one of the results I got in one of the runs testing on the same > > machine: > > > > CPU RATE SYS TIME sender-receiver > > Server bind 19: 20.8Gbits/sec 14.832313000 20.863476111 75.4%-89.2% > > Server bind 21: 18.0Gbits/sec 18.705221000 23.996913032 80.8%-89.7% > > Server bind 23: 20.1Gbits/sec 15.331761000 21.536657212 75.0%-89.7% > > Server bind none: 24.1Gbits/sec 14.164226000 18.043132731 82.3%-87.1% > > > > There are still some variations between runs, which is expected as > > was the same when I tested my patch or in the not aligned case, but > > it's consistently better/higher than the no align case. Looks > > really it's sufficient to align for the higher than or equal 64 > > bytes copy case. > > Mind sending a v2 patch with a changelog and these benchmark numbers > added in, and perhaps a Co-developed-by tag with Linus or so? BTW., if you have a test system available, it would be nice to test a server CPU in the Intel spectrum as well. (For completeness mostly, I'd not expect there to be as much alignment sensitivity.) The CPU you tested, AMD Epyc 7742 was launched ~6 years ago so it's still within the window of microarchitectures we care about. An Intel test would be nice from a similar timeframe as well. Older is probably better in this case, but not too old. :-) ( Note that the Intel test is not required to apply the fix IMO - we did change alignment patterns ~2 years ago in a5624566431d which regressed. ) Thanks, Ingo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-16 11:09 ` Ingo Molnar @ 2025-03-17 13:18 ` Herton Krzesinski 2025-03-18 21:59 ` David Laight 1 sibling, 0 replies; 13+ messages in thread From: Herton Krzesinski @ 2025-03-17 13:18 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, x86, tglx, mingo, bp, dave.hansen, hpa, linux-kernel, olichtne, atomasov, aokuliar On Sun, Mar 16, 2025 at 8:10 AM Ingo Molnar <mingo@kernel.org> wrote: > > > * Ingo Molnar <mingo@kernel.org> wrote: > > > > It does look good in my testing here, I built same kernel I was > > > using for testing the original patch (based on 6.14-rc6), this is > > > one of the results I got in one of the runs testing on the same > > > machine: > > > > > > CPU RATE SYS TIME sender-receiver > > > Server bind 19: 20.8Gbits/sec 14.832313000 20.863476111 75.4%-89.2% > > > Server bind 21: 18.0Gbits/sec 18.705221000 23.996913032 80.8%-89.7% > > > Server bind 23: 20.1Gbits/sec 15.331761000 21.536657212 75.0%-89.7% > > > Server bind none: 24.1Gbits/sec 14.164226000 18.043132731 82.3%-87.1% > > > > > > There are still some variations between runs, which is expected as > > > was the same when I tested my patch or in the not aligned case, but > > > it's consistently better/higher than the no align case. Looks > > > really it's sufficient to align for the higher than or equal 64 > > > bytes copy case. > > > > Mind sending a v2 patch with a changelog and these benchmark numbers > > added in, and perhaps a Co-developed-by tag with Linus or so? > > BTW., if you have a test system available, it would be nice to test a > server CPU in the Intel spectrum as well. (For completeness mostly, I'd > not expect there to be as much alignment sensitivity.) > > The CPU you tested, AMD Epyc 7742 was launched ~6 years ago so it's > still within the window of microarchitectures we care about. An Intel > test would be nice from a similar timeframe as well. Older is probably > better in this case, but not too old. :-) > > ( Note that the Intel test is not required to apply the fix IMO - we > did change alignment patterns ~2 years ago in a5624566431d which > regressed. ) Yes I'll work here to send a v2 and try to test on an Intel system, and compare. > > Thanks, > > Ingo > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-16 11:09 ` Ingo Molnar 2025-03-17 13:18 ` Herton Krzesinski @ 2025-03-18 21:59 ` David Laight 2025-03-18 22:50 ` Herton Krzesinski 1 sibling, 1 reply; 13+ messages in thread From: David Laight @ 2025-03-18 21:59 UTC (permalink / raw) To: Ingo Molnar Cc: Herton Krzesinski, Linus Torvalds, x86, tglx, mingo, bp, dave.hansen, hpa, linux-kernel, olichtne, atomasov, aokuliar On Sun, 16 Mar 2025 12:09:47 +0100 Ingo Molnar <mingo@kernel.org> wrote: > * Ingo Molnar <mingo@kernel.org> wrote: > > > > It does look good in my testing here, I built same kernel I was > > > using for testing the original patch (based on 6.14-rc6), this is > > > one of the results I got in one of the runs testing on the same > > > machine: > > > > > > CPU RATE SYS TIME sender-receiver > > > Server bind 19: 20.8Gbits/sec 14.832313000 20.863476111 75.4%-89.2% > > > Server bind 21: 18.0Gbits/sec 18.705221000 23.996913032 80.8%-89.7% > > > Server bind 23: 20.1Gbits/sec 15.331761000 21.536657212 75.0%-89.7% > > > Server bind none: 24.1Gbits/sec 14.164226000 18.043132731 82.3%-87.1% > > > > > > There are still some variations between runs, which is expected as > > > was the same when I tested my patch or in the not aligned case, but > > > it's consistently better/higher than the no align case. Looks > > > really it's sufficient to align for the higher than or equal 64 > > > bytes copy case. > > > > Mind sending a v2 patch with a changelog and these benchmark numbers > > added in, and perhaps a Co-developed-by tag with Linus or so? > > BTW., if you have a test system available, it would be nice to test a > server CPU in the Intel spectrum as well. (For completeness mostly, I'd > not expect there to be as much alignment sensitivity.) > > The CPU you tested, AMD Epyc 7742 was launched ~6 years ago so it's > still within the window of microarchitectures we care about. An Intel > test would be nice from a similar timeframe as well. Older is probably > better in this case, but not too old. :-) Is that loop doing aligned 'rep movsq' ? Pretty much all the Intel (non-atom) cpu have some variant of FRSM. For FRSM you get double the throughput if the destination is 32byte aligned. No other alignment makes any difference. The cycle cost is per 16/32 byte block and different families have different costs for the first few blocks, after than you get 1 block/clock. That goes all the way back to Sandy Bridge and Ivy Bridge. I don't think anyone has tried doing that alignment. I'm sure I've measured misaligned 64bit writes and got no significant cost. It might be one extra clock for writes than cross cache line boundaries. Misaligned reads are pretty much 'cost free' - just about measurable on the ip-checksum code loop (and IIRC even running a three reads every two clocks algorithm). I don't have access to a similar range of amd chips. David > > ( Note that the Intel test is not required to apply the fix IMO - we > did change alignment patterns ~2 years ago in a5624566431d which > regressed. ) > > Thanks, > > Ingo > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-18 21:59 ` David Laight @ 2025-03-18 22:50 ` Herton Krzesinski 2025-03-19 13:07 ` David Laight 0 siblings, 1 reply; 13+ messages in thread From: Herton Krzesinski @ 2025-03-18 22:50 UTC (permalink / raw) To: David Laight Cc: Ingo Molnar, Linus Torvalds, x86, tglx, mingo, bp, dave.hansen, hpa, linux-kernel, olichtne, atomasov, aokuliar On Tue, Mar 18, 2025 at 6:59 PM David Laight <david.laight.linux@gmail.com> wrote: > > On Sun, 16 Mar 2025 12:09:47 +0100 > Ingo Molnar <mingo@kernel.org> wrote: > > > * Ingo Molnar <mingo@kernel.org> wrote: > > > > > > It does look good in my testing here, I built same kernel I was > > > > using for testing the original patch (based on 6.14-rc6), this is > > > > one of the results I got in one of the runs testing on the same > > > > machine: > > > > > > > > CPU RATE SYS TIME sender-receiver > > > > Server bind 19: 20.8Gbits/sec 14.832313000 20.863476111 75.4%-89.2% > > > > Server bind 21: 18.0Gbits/sec 18.705221000 23.996913032 80.8%-89.7% > > > > Server bind 23: 20.1Gbits/sec 15.331761000 21.536657212 75.0%-89.7% > > > > Server bind none: 24.1Gbits/sec 14.164226000 18.043132731 82.3%-87.1% > > > > > > > > There are still some variations between runs, which is expected as > > > > was the same when I tested my patch or in the not aligned case, but > > > > it's consistently better/higher than the no align case. Looks > > > > really it's sufficient to align for the higher than or equal 64 > > > > bytes copy case. > > > > > > Mind sending a v2 patch with a changelog and these benchmark numbers > > > added in, and perhaps a Co-developed-by tag with Linus or so? > > > > BTW., if you have a test system available, it would be nice to test a > > server CPU in the Intel spectrum as well. (For completeness mostly, I'd > > not expect there to be as much alignment sensitivity.) > > > > The CPU you tested, AMD Epyc 7742 was launched ~6 years ago so it's > > still within the window of microarchitectures we care about. An Intel > > test would be nice from a similar timeframe as well. Older is probably > > better in this case, but not too old. :-) > > Is that loop doing aligned 'rep movsq' ? > > Pretty much all the Intel (non-atom) cpu have some variant of FRSM. > For FRSM you get double the throughput if the destination is 32byte aligned. > No other alignment makes any difference. > The cycle cost is per 16/32 byte block and different families have > different costs for the first few blocks, after than you get 1 block/clock. > That goes all the way back to Sandy Bridge and Ivy Bridge. > I don't think anyone has tried doing that alignment. The code under copy_user_generic() has mainly two paths, first is FSRM case where rep movsb is used right away. Otherwise rep_movs_alternative is used: with 8 bytes or less byte by byte copy is used, between 8-64 bytes word/8byte copy is used, and for greater or equal than 64 bytes either rep movsb is used for cpus with ERMS or rep movsq is used when without. The last case is what Linus' patch touches. And the last case also falls back to byte by byte copy for the tail of the copy/remaining bytes if any. For Intel, I was looking and looks like after Sandy Bridge based CPUs most/almost all have ERMS, and FSRM is something only newer ones have. So the way back to Ivy Bridge is ERMS and not FSRM. The patch as expected does not affect newer/last decade Intel CPUs, since they have ERMS, and will use rep movsb for large copies. The only Intel CPU I found that runs the code for the alignment case (which is the last rep movsq case) is a Sandy Bridge based system here which does not have either erms/fsrm, I tested on one today and saw no difference at least in my iperf3 benchmark I'm using, so far only the AMD I tested seems to benefit from the alignment in the case I tested in practice, and the others I tested don't regress/get worse but no benefits either. I also tried the 32 byte write alignment on the same Sandy Bridge based CPU, I hope to have got the code correct: diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index fc9fb5d06174..75bf7e9e9318 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -74,6 +74,36 @@ SYM_FUNC_START(rep_movs_alternative) _ASM_EXTABLE_UA( 0b, 1b) .Llarge_movsq: + /* Do the first possibly unaligned word */ +0: movq (%rsi),%rax +1: movq %rax,(%rdi) +2: movq 8(%rsi),%rax +3: movq %rax,8(%rdi) +4: movq 16(%rsi),%rax +5: movq %rax,16(%rdi) +6: movq 24(%rsi),%rax +7: movq %rax,24(%rdi) + + _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail) + _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail) + _ASM_EXTABLE_UA( 2b, .Lcopy_user_tail) + _ASM_EXTABLE_UA( 3b, .Lcopy_user_tail) + _ASM_EXTABLE_UA( 4b, .Lcopy_user_tail) + _ASM_EXTABLE_UA( 5b, .Lcopy_user_tail) + _ASM_EXTABLE_UA( 6b, .Lcopy_user_tail) + _ASM_EXTABLE_UA( 7b, .Lcopy_user_tail) + + /* What would be the offset to the aligned destination? */ + leaq 32(%rdi),%rax + andq $-32,%rax + subq %rdi,%rax + + /* .. and update pointers and count to match */ + addq %rax,%rdi + addq %rax,%rsi + subq %rax,%rcx + + /* make %rcx contain the number of words, %rax the remainder */ movq %rcx,%rax shrq $3,%rcx andl $7,%eax But the code above doesn't perform better than the 8byte only alignment on the Intel CPU (it's Xeon E5-2667). I also tested the Linus' patch on other Intel CPU with ERMS (Xeon(R) Platinum 8280) and there is no difference with/without the patch as expected. However, I'll try to do a test with alignment before the rep movsb (erms) in the large case, to see if there is any difference, but I think the result will be the same as Sandy Bridge. It takes some time to do all the testing/setup, thus I'm almost settling on the 8 byte only alignment that was sent by Linus and I validated already, since I'm so far unable to get better results with other approaches/trials If you have any patch you want to be tested let me know and I can apply and run against my iperf3 test case. > > I'm sure I've measured misaligned 64bit writes and got no significant cost. > It might be one extra clock for writes than cross cache line boundaries. > Misaligned reads are pretty much 'cost free' - just about measurable > on the ip-checksum code loop (and IIRC even running a three reads every > two clocks algorithm). > > I don't have access to a similar range of amd chips. > > David > > > > > ( Note that the Intel test is not required to apply the fix IMO - we > > did change alignment patterns ~2 years ago in a5624566431d which > > regressed. ) > > > > Thanks, > > > > Ingo > > > ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-18 22:50 ` Herton Krzesinski @ 2025-03-19 13:07 ` David Laight 0 siblings, 0 replies; 13+ messages in thread From: David Laight @ 2025-03-19 13:07 UTC (permalink / raw) To: Herton Krzesinski Cc: Ingo Molnar, Linus Torvalds, x86, tglx, mingo, bp, dave.hansen, hpa, linux-kernel, olichtne, atomasov, aokuliar On Tue, 18 Mar 2025 19:50:41 -0300 Herton Krzesinski <hkrzesin@redhat.com> wrote: > On Tue, Mar 18, 2025 at 6:59 PM David Laight > <david.laight.linux@gmail.com> wrote: ... > For Intel, I was looking and looks like after Sandy Bridge based CPUs > most/almost all have ERMS, and FSRM is something only newer ones have. > So the way back to Ivy Bridge is ERMS and not FSRM. ERMS behaves much the same as FSRM. The cost of the first tranfser is a few clocks higher (maybe 30 not 24), and (IIRC) the overhead for the next couple of blocks is a bit bigger. Reading Agner's tables (again) Haswell will do 32 bytes/clock (for an aligned destination) whereas Sandy/Ivy bridge 'only' do 16.. I doubt it is enough to treat them differently. The real issue with using (aligned) 'rep movsq' was the 140 clock setup cost on P4 netburst (and no one cares about that and more). I don't think anything else really needs an open coded loop. There is no hint in the tables of the AND cpu (going way back) having long setup times. The differing cost of different ways of aligning the copy will show up most on short copies. You also need to benchmarks differing sizes/alignments - otherwise the branch predictor will get it right every time - which it doesn't in 'real code'. David ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-14 19:06 ` Linus Torvalds 2025-03-14 20:33 ` Herton Krzesinski @ 2025-03-17 13:16 ` David Laight 2025-03-17 21:29 ` Linus Torvalds 1 sibling, 1 reply; 13+ messages in thread From: David Laight @ 2025-03-17 13:16 UTC (permalink / raw) To: Linus Torvalds Cc: Herton R. Krzesinski, x86, tglx, mingo, bp, dave.hansen, hpa, linux-kernel, olichtne, atomasov, aokuliar On Fri, 14 Mar 2025 09:06:13 -1000 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 14 Mar 2025 at 07:53, Herton R. Krzesinski <herton@redhat.com> wrote: > > > > --- a/arch/x86/include/asm/uaccess_64.h > > +++ b/arch/x86/include/asm/uaccess_64.h > > @@ -130,7 +130,7 @@ copy_user_generic(void *to, const void *from, unsigned long len) > > "2:\n" > > _ASM_EXTABLE_UA(1b, 2b) > > :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT > > - : : "memory", "rax"); > > + : : "memory", "rax", "rdx", "r8"); > > Please don't penalize the caller with the extra clobbers. > > Maybe it doesn't matter - these functions are marked always_inline, > but they aren't inlined in very many places and maybe those places > have registers to spare - but let's not penalize the FSRM case anyway. > > And we do call it "rep_movs_alternative", so let's keep it close to > "rep movs" semantics (yes, we already clobber %rax, but let's not make > it worse). > > As to the actual change to rep_movs - that should be done differently > too. In particular, I doubt it makes any sense to try to align the > destination for small writes or for the ERMS case when we use 'rep > movsb', so I think this should all go into just the ".Llarge_movsq" > case. The Intel cpu (sandy bridge onwards) execute 'rep mosvb' twice as fast if the destination is 32 byte aligned. Potentially this is worth optimising for - but the cost of the extra code may exceed the benefit. > .. and then the patch can be further optimized to just do the first - > possibly unaligned - destination word unconditionally, and then > updating the addresses and counts to make the rest be aligned. You can also something similar for any trailing bytes. If you are feeling 'brave' copy the last 8 bytes first. David ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-17 13:16 ` David Laight @ 2025-03-17 21:29 ` Linus Torvalds 2025-03-17 22:32 ` David Laight 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2025-03-17 21:29 UTC (permalink / raw) To: David Laight Cc: Herton R. Krzesinski, x86, tglx, mingo, bp, dave.hansen, hpa, linux-kernel, olichtne, atomasov, aokuliar On Mon, 17 Mar 2025 at 06:16, David Laight <david.laight.linux@gmail.com> wrote: > > You can also something similar for any trailing bytes. > If you are feeling 'brave' copy the last 8 bytes first. I think that would be a mistake. Not only does ti cause bad patterns on page faults - we should recover ok from it (the exception will go back and do the copy in the right order one byte at a time in the "copy_user_tail" code) - but even in the absence of page faults it quite possibly messes with CPU prefetching and write buffer coalescing etc if you hop around like that. It *might* be worth trying doing last unaligned part the same way my patch does the first one - by just doing a full-word write at the end, offset backwards. That avoids the byte-at-a-time tail case. I'm not convinced it's worth it, but if somebody spends the effort on a patch and on benchmarking... Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() 2025-03-17 21:29 ` Linus Torvalds @ 2025-03-17 22:32 ` David Laight 0 siblings, 0 replies; 13+ messages in thread From: David Laight @ 2025-03-17 22:32 UTC (permalink / raw) To: Linus Torvalds Cc: Herton R. Krzesinski, x86, tglx, mingo, bp, dave.hansen, hpa, linux-kernel, olichtne, atomasov, aokuliar On Mon, 17 Mar 2025 14:29:05 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 17 Mar 2025 at 06:16, David Laight <david.laight.linux@gmail.com> wrote: > > > > You can also something similar for any trailing bytes. > > If you are feeling 'brave' copy the last 8 bytes first. > > I think that would be a mistake. > > Not only does it cause bad patterns on page faults - we should recover > ok from it (the exception will go back and do the copy in the right > order one byte at a time in the "copy_user_tail" code) - but even in > the absence of page faults it quite possibly messes with CPU > prefetching and write buffer coalescing etc if you hop around like > that. I thought you might say that :-) > It *might* be worth trying doing last unaligned part the same way my > patch does the first one - by just doing a full-word write at the end, > offset backwards. That avoids the byte-at-a-time tail case. > > I'm not convinced it's worth it, but if somebody spends the effort on > a patch and on benchmarking... After a 'rep movsl' you'll have %rsi and %rdi pointing to the first byte left to copy and the remaining bytes in (say) %rax. So something like: mov %rsi,-8(%rsi, %rax) mov -8(%rdi, %rax), %rsi will copy the last 8 bytes. Ought to be faster than anything with a branch in it. Whether it is worth leaving yourself with [1..8] bytes to copy rather than [0..7] (and copying the last 8 bytes twice) might be debatable. For Intel FRSM copying 32 bytes and then aligning the destination will be worth while for longer copies. But I've not tried to measure the cutoff for 'rep movsb' against a copy loop - given you'll get at least on mispredicted branch for the copy loop and a 50% one to select between the algorithms. Add in a function call and the ~30 clocks [1] for a short 'rep movsb' starts looking very good. [1] I can't remember actual number, but it isn't very many even on Ivy bridge - and you get 1-16 bytes copies for the same cost. David > > Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-03-19 13:07 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-14 17:53 Performance issues in copy_user_generic() in x86_64 Herton R. Krzesinski 2025-03-14 17:53 ` [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() Herton R. Krzesinski 2025-03-14 19:06 ` Linus Torvalds 2025-03-14 20:33 ` Herton Krzesinski 2025-03-16 10:58 ` Ingo Molnar 2025-03-16 11:09 ` Ingo Molnar 2025-03-17 13:18 ` Herton Krzesinski 2025-03-18 21:59 ` David Laight 2025-03-18 22:50 ` Herton Krzesinski 2025-03-19 13:07 ` David Laight 2025-03-17 13:16 ` David Laight 2025-03-17 21:29 ` Linus Torvalds 2025-03-17 22:32 ` David Laight
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.