From: "Russell King (Oracle)" <linux@armlinux.org.uk>
To: "Prof. Michael Taylor" <prof.taylor@gmail.com>
Cc: linux-arm-kernel@lists.infradead.org
Subject: Re: L_PTE_MT_BUFFERABLE / device ordered memory
Date: Tue, 3 Jan 2023 11:09:36 +0000 [thread overview]
Message-ID: <Y7QM8EOGpzicHcI0@shell.armlinux.org.uk> (raw)
In-Reply-To: <CAKoatx34B5VSb=2qmy1TAvPzrYvQyWJ3g2q=GJMiFdAn-EFwfA@mail.gmail.com>
On Fri, Dec 30, 2022 at 02:23:40PM -0800, Prof. Michael Taylor wrote:
> Hi,
>
> Apologies in advance if I have missed an ages old thread on this. And
> apologies for the length of the description.
>
> I am trying to tune the memory mapped I/O performance of a ZYNQ 7000
> with an ARM A7 core running Linux.
Reading this, I think there's a lot of confusion when I read this email.
According to
https://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html,
this is a Cortex-A9 core. I'm wondering what "ARM A7" above is referring
to. I first wondered if it was referring to Cortex-A7, but that doesn't
work out. Maybe you mean ARM architecture v7, which would make sense?
> From what I can observe (in the
> phys_mem_access_prot function in mmu.c), the default for a memory
> range that has not been given in the device tree is "strongly
> ordered" which means that the ZYNQ core will not proceed on to the
> next such memory request until the previous one has fully completed.
> This has very sub-optimal performance, requiring on average 24 cycle
> per access overhead. I believe this corresponds to the setting
> pgprot_noncached (and then to L_PTE_MT_UNCACHED) in the kernel.
That is indeed what happens with phys_mem_access_prot() for memory
areas that we don't know are emmory.
> The
> ARM architecture, however provides for another setting in the page
> table entry of "device ordering", which maintains ordering and
> quantity of requests going out to the device, without pausing the ARM
> core. In various Xilinx forum posts, it has been confirmed that in the
> baremetal OS option, that setting the value of the ARM page table TEX
> and C B fields to 000, 0, 1 respectively, that the performance is
> greatly improved (maybe 4 cycles per access).
Unclear whether they are taking account of TEX remapping.
>
> Q1. My goal is to unlock this functionality in the Linux kernel. Any
> best practices?
>
> (Below is what I tried/figured out.)
>
> Looking at the phys_mem_access_prot function, I therefore concluded
> that perhaps I should map in the memory location using the device
> tree, as reserved, and this would cause phys_mem_access_prot to select
> pgprot_writecombine in the kernel. After doing this successfully, I
> noticed a great improvement in performance, but also that only a small
> fraction of transactions in my test case were actually making it out
> to the I/O device. The test case was writing a series of zeros to the
> same I/O address, which corresponds to a FIFO, so I really want to see
> all of the zeros. Looking at the logic analyzer, I saw that the
> processor was optimizing away the repeated zero writes, and that the
> AWCACHE field on the AXI bus was set to 3.
With TEX=0 and tex remapping enabled, we arrange for the C and B bits
to have the same behaviour as previous architecture versions (because
we need to) and this is the exact behaviour that older CPUs exhibit
when using bufferable memory - writes are combined and repeated writes
to the same location are optimised.
> This was quite surprising
> to me, as these fields suggest that the PTE is, per the ARM docs
> (https://developer.arm.com/documentation/ihi0022/c/Additional-Control-Information/Cache-support),
> cacheable and bufferable, rather than just bufferable.
>
> Diving deeper into the kernel, I see that in proc-macros.S, in
> marv6_mt_table,
This table isn't used for Cortex-A9. It is used for ARM architecture v6
processors that lack the TEX remapping facility, so we have to do our
own table-based remapping.
> the L_PTE_MT_BUFFERABLE entry is set to PTE_EXT_TEX(1)
> (i.e TEX,C,B = 001,0,0) which per
> https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Protected-Memory-System-Architecture--PMSA-/Memory-region-attributes/C--B--and-TEX-2-0--encodings
> is listed as "Normal memory", but with out and inner regions given as
> non-cacheable. I would have expected PTE_BUFFERABLE (i.e.
> TEX,C,B=000,0,1).
>
> Also looking at proc-v7-2level.S, I see that BUFFERABLE is defined as
> TR=10, IR=00, OR=00, where TR memory type (per
> https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/System-Control-Registers-in-a-VMSA-implementation/VMSA-System-control-registers-descriptions--in-register-order/PRRR--Primary-Region-Remap-Register--VMSA?lang=en)
> is defined as 00=strongly-ordered, 01=device, 10=normal memory. So I
> would have expected 01=device memory.
This would give problems when there is a normal mapping of memory along
side a device mapping - that is prohibited by the architecture. So we
have to use TR=10 for this case.
> Q2. What is the history behind using strong-ordering instead of
> device-ordering for I/O writes? And why is the write-combining setting
> mapping to "Normal Memory" rather than device memory? And why does
> mmu.c not provide a mechanism for accessing device-ordering (or does
> it)?
It goes all the way back to ARM architectures v3..v5, where there was
no TEX field, and the choices were:
Uncached (CB=00)
Bufferable (CB=01)
Writethrough (CB=10)
Writeback (CB=11)
As previously mentioned, CB=01 merges writes, and even has some other
funny behaviours when the same region is mapped in two different
virtual address spaces - where writes bypass each other in non-program
order. So the memory location ends up with a different value than is
expected.
Essentially on these devices. uncached is the only option for devices
to ensure that devices see the same number of writes that the program
issues.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-01-03 13:07 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-30 22:23 L_PTE_MT_BUFFERABLE / device ordered memory Prof. Michael Taylor
2023-01-03 11:09 ` Russell King (Oracle) [this message]
2023-01-03 18:49 ` Prof. Michael Taylor
2023-01-12 17:16 ` Prof. Michael Taylor
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y7QM8EOGpzicHcI0@shell.armlinux.org.uk \
--to=linux@armlinux.org.uk \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=prof.taylor@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.