From: "Havens, Austin" <austin.havens@anritsu.com>
To: Mark Rutland <mark.rutland@arm.com>
Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>,
"will@kernel.org" <will@kernel.org>,
"michal.simek@amd.com" <michal.simek@amd.com>,
"Suresh, Siddarth" <siddarth.suresh@anritsu.com>,
"Lui, Vincent" <vincent.lui@anritsu.com>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>
Subject: Re: Slowdown copying data between kernel versions 4.19 and 5.15
Date: Thu, 29 Jun 2023 19:33:39 +0000 [thread overview]
Message-ID: <OS3P301MB0421F2FCDF895AFD38356CD5EE25A@OS3P301MB0421.JPNP301.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <ZJ2UIfbCWeLzRMfs@FVFF77S0Q05N>
Hi Mark,
Thanks for the reply.
On Thursday, June 29, 2023 7:74 AM Mark Rutland wrote:
> On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
>> >After some investigation I am guessing the issue is either in the iovector
>> >iteration changes (around
>> >https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the
>> >lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out
>> >of my depth so it is just speculation.
>>
>> After comparing the dissassembly of __arch_copy_from_user on both kernels and
>> going through commit logs, I figured out the slowdown was mostly due to to
>> the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially
>> the changes to uao_ldp.
>
> For the benefit of others, that's commit:
>
> fc703d80130b1c9d ("arm64: uaccess: split user/kernel routine")
Sorry for the copy paste error.
>>
>> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
>> index 2c26ca5b7bb0..2b5454fa0f24 100644
>> --- a/arch/arm64/include/asm/asm-uaccess.h
>> +++ b/arch/arm64/include/asm/asm-uaccess.h
>> @@ -59,62 +59,32 @@ alternative_else_nop_endif
>> #endif
>>
>> /*
>> - * Generate the assembly for UAO alternatives with exception table entries.
>> + * Generate the assembly for LDTR/STTR with exception table entries.
>> * This is complicated as there is no post-increment or pair versions of the
>> * unprivileged instructions, and USER() only works for single instructions.
>> */
>> -#ifdef CONFIG_ARM64_UAO
>> .macro uao_ldp l, reg1, reg2, addr, post_inc
>> - alternative_if_not ARM64_HAS_UAO
>> -8888: ldp \reg1, \reg2, [\addr], \post_inc;
>> -8889: nop;
>> - nop;
>> - alternative_else
>> - ldtr \reg1, [\addr];
>> - ldtr \reg2, [\addr, #8];
>> - add \addr, \addr, \post_inc;
>> - alternative_endif
>> +8888: ldtr \reg1, [\addr];
>> +8889: ldtr \reg2, [\addr, #8];
>> + add \addr, \addr, \post_inc;
>>
>> _asm_extable 8888b,\l;
>> _asm_extable 8889b,\l;
>> .endm
>>
>> I could not directly revert the changes to test since more names changed in
>> other commits than I cared to figure out, but I hacked out that change, and
>> saw that the performance of the test program was basically back to normal.
>>
>> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
>> index ccedf548dac9..2ddf7eba46fd 100644
>> --- a/arch/arm64/include/asm/asm-uaccess.h
>> +++ b/arch/arm64/include/asm/asm-uaccess.h
>> @@ -64,9 +64,9 @@ alternative_else_nop_endif
>> * unprivileged instructions, and USER() only works for single instructions.
>> */
>> .macro user_ldp l, reg1, reg2, addr, post_inc
>> -8888: ldtr \reg1, [\addr];
>> -8889: ldtr \reg2, [\addr, #8];
>> - add \addr, \addr, \post_inc;
>> +8888: ldp \reg1, \reg2, [\addr], \post_inc;
>> +8889: nop;
>> + nop;
>
> As Catalin noted, we can't make that change generally as it'd be broken for any
> system with PAN, and in general we *really* want to use LDTR/STTR for user
> accesses to catch any misuse with kernel pointers.
I was afraid that would be the case. It puts us in a bit of a tough spot, but
at least we know we should be looking for workarounds instead of a fix.
>> Profiling with the hacked __arch_copy_from_user
>> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>>
>> Performance counter stats for '/mnt/usrroot/test_copy':
>>
>> 11822342 instructions # 0.23 insn per cycle
>> 50689594 cycles
>> 37627922 ld_dep_stall
>> 17933 read_alloc
>> 3421 dTLB-load-misses
>>
>> 0.043440253 seconds time elapsed
>>
>> 0.004382000 seconds user
>> 0.039442000 seconds sys
>>
>> Unfortunately the hack crashes in other cases so it is not a viable solution
>> for us. Also, on our actual workload there is still a small difference in
>> performance remaining that I have not tracked down yet (I am guessing it has
>> to do with the dTLB-load-misses remaining higher).
>>
>> Note, I think that the slow down is only noticeable in cases like ours where
>> the data being copied from is not in cache (for us, because the FPGA writes
>> it).
>
> When you say "is not in cache", what exactly do you mean? If this were just the
> latency of filling a cache I wouldn't expect the size of the first access to
> make a difference, so I'm assuming the source buffer is not mapped with
> cacheable memory attributes, which we generally assume.
>
> Which memory attribues are the source and destination buffers mapped with? Is
> that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?
>
> I'm assuming this is with some out-of-tree driver; if that's in a public tree
> could you please provide a pointer to it?
>
> Thanks,
> Mark.
I am actually not 100% clear on how the memory gets mapped. Currently we call
ioremap in our driver, so I think that should map it as iomem. When I removed
that or used /dev/mem, nothing changed, and looking at things now I think that
is because the origional mapping is from drivers/of/of_reserved_mem.c
IIRC I mostly followed this wiki when setting things up
https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841683/Linux+Reserved+Memory
I think the relevant parts are from the dts (note we do it 2x, because we have some
usages that also need to be accessed by other CPUs on the SoC which have adress
space restrictions)
reserved-memory {
#address-cells = <2>;
#size-cells = <2>;
ranges;
iq_capture: fpga_mem@1 {
compatible = "shared-dma-pool";
no-map;
reg = <0x0 0x70000000 0x0 0x10000000>;
};
big_iq_capture: fpga_mem@2 {
compatible = "shared-dma-pool";
no-map;
reg = <0x8 0x0 0x0 0x80000000>;
};
};
anritsu-databuffer@0 {
compatible = "anritsu,databuffer";
memory-region = <&iq_capture>;
device-name = "databuffer-device";
};
anritsu-databuffer@1 {
compatible = "anritsu,databuffer";
memory-region = <&big_iq_capture>;
device-name = "capturebuffer-device";
};
The databuffer driver is something we made and generally build out of tree,
but I put it in tree on our github if you want to look at it. I have not actually
tried to build it in-tree yet, so I could have made some mistakes with the Makefile
or something. Here is a link to where the ioremap is.
https://github.com/Anritsu/linux-xlnx/blob/intree_databuffer_driver/drivers/char/databuffer_driver.c#L242
Despite doing my best to read the documentation, I was never really sure if I got the
memory mapping right for our use case.
If you are interested in context, the use case is in spectrum analyzers.
https://www.anritsu.com/en-us/test-measurement/products/ms2090a
The feature is IQ capture, which if you are not familiar with Spectrum Analyzers,
is basically trying to take the data from an a high speed ADC and store it as fast
as possible. Since the FPGA is writing the data is clocked to the ADC, the rates
we can stream out without losing any data depend on how fast we can copy the
data from memory to either the network or a file, which is why this performance
is important to us. I think we should probably be using scatter/gather for this,
but I could not convince the FPGA engineers to implement it (and it sounded hard
so I did not try very hard to convince them).
Thanks for the help,
Austin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-06-29 19:34 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-23 21:30 Slowdown copying data between kernel versions 4.19 and 5.15 Havens, Austin
2023-06-28 21:38 ` Havens, Austin
2023-06-29 13:09 ` Robin Murphy
2023-06-29 13:36 ` Catalin Marinas
2023-06-29 14:24 ` Mark Rutland
2023-06-29 18:33 ` Robin Murphy
2023-06-29 19:33 ` Havens, Austin [this message]
2023-06-30 11:15 ` Mark Rutland
2023-07-06 19:12 ` Havens, Austin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=OS3P301MB0421F2FCDF895AFD38356CD5EE25A@OS3P301MB0421.JPNP301.PROD.OUTLOOK.COM \
--to=austin.havens@anritsu.com \
--cc=catalin.marinas@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=mark.rutland@arm.com \
--cc=michal.simek@amd.com \
--cc=siddarth.suresh@anritsu.com \
--cc=vincent.lui@anritsu.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox