From: Mark Rutland <mark.rutland@arm.com>
To: "Havens, Austin" <austin.havens@anritsu.com>
Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>,
"will@kernel.org" <will@kernel.org>,
"michal.simek@amd.com" <michal.simek@amd.com>,
"Suresh, Siddarth" <siddarth.suresh@anritsu.com>,
"Lui, Vincent" <vincent.lui@anritsu.com>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
Robin Murphy <robin.murphy@arm.com>
Subject: Re: Slowdown copying data between kernel versions 4.19 and 5.15
Date: Fri, 30 Jun 2023 12:15:02 +0100 [thread overview]
Message-ID: <ZJ65LNykL5ua_fWk@FVFF77S0Q05N> (raw)
In-Reply-To: <OS3P301MB0421F2FCDF895AFD38356CD5EE25A@OS3P301MB0421.JPNP301.PROD.OUTLOOK.COM>
On Thu, Jun 29, 2023 at 07:33:39PM +0000, Havens, Austin wrote:
> Hi Mark,
> Thanks for the reply.
No problem; thanks for the info here!
> On Thursday, June 29, 2023 7:74 AM Mark Rutland wrote:
> > On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
> >> Profiling with the hacked __arch_copy_from_user
> >> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> >>
> >> Performance counter stats for '/mnt/usrroot/test_copy':
> >>
> >> 11822342 instructions # 0.23 insn per cycle
> >> 50689594 cycles
> >> 37627922 ld_dep_stall
> >> 17933 read_alloc
> >> 3421 dTLB-load-misses
> >>
> >> 0.043440253 seconds time elapsed
> >>
> >> 0.004382000 seconds user
> >> 0.039442000 seconds sys
> >>
> >> Unfortunately the hack crashes in other cases so it is not a viable solution
> >> for us. Also, on our actual workload there is still a small difference in
> >> performance remaining that I have not tracked down yet (I am guessing it has
> >> to do with the dTLB-load-misses remaining higher).
> >>
> >> Note, I think that the slow down is only noticeable in cases like ours where
> >> the data being copied from is not in cache (for us, because the FPGA writes
> >> it).
> >
> > When you say "is not in cache", what exactly do you mean? If this were just the
> > latency of filling a cache I wouldn't expect the size of the first access to
> > make a difference, so I'm assuming the source buffer is not mapped with
> > cacheable memory attributes, which we generally assume.
> >
> > Which memory attribues are the source and destination buffers mapped with? Is
> > that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?
> >
> > I'm assuming this is with some out-of-tree driver; if that's in a public tree
> > could you please provide a pointer to it?
> >
> > Thanks,
> > Mark.
>
> I am actually not 100% clear on how the memory gets mapped. Currently we call
> ioremap in our driver, so I think that should map it as iomem. When I removed
> that or used /dev/mem, nothing changed, and looking at things now I think that
> is because the origional mapping is from drivers/of/of_reserved_mem.c
A plain ioremap() will give you Device memory attributes, which
copy_{to,from}_user() aren't suppposed to be used with, and also forbids the
CPU from doing a bunch of things (e.g. gathering and prefetching) which makes
this slow.
If it's possible to use Normal Non-Cacheable instead, (e.g. by using
ioremap_wc()), that will likely be faster since it permits gathering and
prefetching, etc. Even that's a bit weird, and I'd generally expect to have a
kernel driver to manage non-coherent DMA like this (rather than userspace
having the buffer nad pointing the kernel at it).
Robin might have thoughts on how to handle the non-coherent DMA.
Thanks,
Mark.
> IIRC I mostly followed this wiki when setting things up
> https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841683/Linux+Reserved+Memory
>
> I think the relevant parts are from the dts (note we do it 2x, because we have some
> usages that also need to be accessed by other CPUs on the SoC which have adress
> space restrictions)
>
> reserved-memory {
> #address-cells = <2>;
> #size-cells = <2>;
> ranges;
>
> iq_capture: fpga_mem@1 {
> compatible = "shared-dma-pool";
> no-map;
> reg = <0x0 0x70000000 0x0 0x10000000>;
> };
> big_iq_capture: fpga_mem@2 {
> compatible = "shared-dma-pool";
> no-map;
> reg = <0x8 0x0 0x0 0x80000000>;
> };
> };
>
>
> anritsu-databuffer@0 {
> compatible = "anritsu,databuffer";
> memory-region = <&iq_capture>;
> device-name = "databuffer-device";
> };
> anritsu-databuffer@1 {
> compatible = "anritsu,databuffer";
> memory-region = <&big_iq_capture>;
> device-name = "capturebuffer-device";
> };
>
> The databuffer driver is something we made and generally build out of tree,
> but I put it in tree on our github if you want to look at it. I have not actually
> tried to build it in-tree yet, so I could have made some mistakes with the Makefile
> or something. Here is a link to where the ioremap is.
>
> https://github.com/Anritsu/linux-xlnx/blob/intree_databuffer_driver/drivers/char/databuffer_driver.c#L242
>
> Despite doing my best to read the documentation, I was never really sure if I got the
> memory mapping right for our use case.
>
>
> If you are interested in context, the use case is in spectrum analyzers.
> https://www.anritsu.com/en-us/test-measurement/products/ms2090a
> The feature is IQ capture, which if you are not familiar with Spectrum Analyzers,
> is basically trying to take the data from an a high speed ADC and store it as fast
> as possible. Since the FPGA is writing the data is clocked to the ADC, the rates
> we can stream out without losing any data depend on how fast we can copy the
> data from memory to either the network or a file, which is why this performance
> is important to us. I think we should probably be using scatter/gather for this,
> but I could not convince the FPGA engineers to implement it (and it sounded hard
> so I did not try very hard to convince them).
>
> Thanks for the help,
> Austin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-06-30 11:15 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-23 21:30 Slowdown copying data between kernel versions 4.19 and 5.15 Havens, Austin
2023-06-28 21:38 ` Havens, Austin
2023-06-29 13:09 ` Robin Murphy
2023-06-29 13:36 ` Catalin Marinas
2023-06-29 14:24 ` Mark Rutland
2023-06-29 18:33 ` Robin Murphy
2023-06-29 19:33 ` Havens, Austin
2023-06-30 11:15 ` Mark Rutland [this message]
2023-07-06 19:12 ` Havens, Austin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZJ65LNykL5ua_fWk@FVFF77S0Q05N \
--to=mark.rutland@arm.com \
--cc=austin.havens@anritsu.com \
--cc=catalin.marinas@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=michal.simek@amd.com \
--cc=robin.murphy@arm.com \
--cc=siddarth.suresh@anritsu.com \
--cc=vincent.lui@anritsu.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox