linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* L_PTE_MT_BUFFERABLE / device ordered memory
@ 2022-12-30 22:23 Prof. Michael Taylor
  2023-01-03 11:09 ` Russell King (Oracle)
  0 siblings, 1 reply; 4+ messages in thread
From: Prof. Michael Taylor @ 2022-12-30 22:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Apologies in advance if I have missed an ages old thread on this.  And
apologies for the length of the description.

I am trying to tune the memory mapped I/O performance of a ZYNQ 7000
with an ARM A7 core running Linux. From what I can observe (in the
phys_mem_access_prot function in mmu.c), the default for a memory
range that has not been given in the device tree is "strongly
ordered", which means that the ZYNQ core will not proceed on to the
next such memory request until the previous one has fully completed.
This has very sub-optimal performance, requiring on average 24 cycle
per access overhead. I believe this corresponds to the setting
pgprot_noncached (and then to L_PTE_MT_UNCACHED) in the kernel. The
ARM architecture, however provides for another setting in the page
table entry of "device ordering", which maintains ordering and
quantity of requests going out to the device, without pausing the ARM
core. In various Xilinx forum posts, it has been confirmed that in the
baremetal OS option, that setting the value of the ARM page table TEX
and C B fields to 000, 0, 1 respectively, that the performance is
greatly improved (maybe 4 cycles per access).

Q1. My goal is to unlock this functionality in the Linux kernel. Any
best practices?

(Below is what I tried/figured out.)

Looking at the phys_mem_access_prot function, I therefore concluded
that perhaps I should map in the memory location using the device
tree, as reserved, and this would cause phys_mem_access_prot to select
pgprot_writecombine in the kernel.  After doing this successfully, I
noticed a great improvement in performance, but also that only a small
fraction of transactions in my test case were actually making it out
to the I/O device.  The test case was writing a series of zeros to the
same I/O address, which corresponds to a FIFO, so I really want to see
all of the zeros. Looking at the logic analyzer, I saw that the
processor was optimizing away the repeated zero writes, and that the
AWCACHE field on the AXI bus was set to 3. This was quite surprising
to me, as these fields suggest that the PTE is, per the ARM docs
(https://developer.arm.com/documentation/ihi0022/c/Additional-Control-Information/Cache-support),
cacheable and bufferable, rather than just bufferable.

Diving deeper into the kernel, I see that in proc-macros.S, in
marv6_mt_table, the L_PTE_MT_BUFFERABLE entry is set to PTE_EXT_TEX(1)
(i.e TEX,C,B = 001,0,0) which per
https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Protected-Memory-System-Architecture--PMSA-/Memory-region-attributes/C--B--and-TEX-2-0--encodings
is listed as "Normal memory", but with out and inner regions given as
non-cacheable.  I would have expected PTE_BUFFERABLE (i.e.
TEX,C,B=000,0,1).

Also looking at proc-v7-2level.S, I see that BUFFERABLE is defined as
TR=10, IR=00, OR=00, where TR memory type (per
https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/System-Control-Registers-in-a-VMSA-implementation/VMSA-System-control-registers-descriptions--in-register-order/PRRR--Primary-Region-Remap-Register--VMSA?lang=en)
is defined as 00=strongly-ordered, 01=device, 10=normal memory. So I
would have expected 01=device memory.

So my conclusion is that pgprot_writecombine is not what I am looking
for, since not only does it buffering and combine writes into packets,
it also eliminates writes to the same address.

Q2. What is the history behind using strong-ordering instead of
device-ordering for I/O writes? And why is the write-combining setting
mapping to "Normal Memory" rather than device memory? And why does
mmu.c not provide a mechanism for accessing device-ordering (or does
it)?

Thanks!

Michael

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: L_PTE_MT_BUFFERABLE / device ordered memory
  2022-12-30 22:23 L_PTE_MT_BUFFERABLE / device ordered memory Prof. Michael Taylor
@ 2023-01-03 11:09 ` Russell King (Oracle)
  2023-01-03 18:49   ` Prof. Michael Taylor
  0 siblings, 1 reply; 4+ messages in thread
From: Russell King (Oracle) @ 2023-01-03 11:09 UTC (permalink / raw)
  To: Prof. Michael Taylor; +Cc: linux-arm-kernel

On Fri, Dec 30, 2022 at 02:23:40PM -0800, Prof. Michael Taylor wrote:
> Hi,
> 
> Apologies in advance if I have missed an ages old thread on this.  And
> apologies for the length of the description.
> 
> I am trying to tune the memory mapped I/O performance of a ZYNQ 7000
> with an ARM A7 core running Linux.

Reading this, I think there's a lot of confusion when I read this email.

According to
https://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html,
this is a Cortex-A9 core. I'm wondering what "ARM A7" above is referring
to. I first wondered if it was referring to Cortex-A7, but that doesn't
work out. Maybe you mean ARM architecture v7, which would make sense?

> From what I can observe (in the
> phys_mem_access_prot function in mmu.c), the default for a memory
> range that has not been given in the device tree is "strongly
> ordered" which means that the ZYNQ core will not proceed on to the
> next such memory request until the previous one has fully completed.
> This has very sub-optimal performance, requiring on average 24 cycle
> per access overhead. I believe this corresponds to the setting
> pgprot_noncached (and then to L_PTE_MT_UNCACHED) in the kernel.

That is indeed what happens with phys_mem_access_prot() for memory
areas that we don't know are emmory.

> The
> ARM architecture, however provides for another setting in the page
> table entry of "device ordering", which maintains ordering and
> quantity of requests going out to the device, without pausing the ARM
> core. In various Xilinx forum posts, it has been confirmed that in the
> baremetal OS option, that setting the value of the ARM page table TEX
> and C B fields to 000, 0, 1 respectively, that the performance is
> greatly improved (maybe 4 cycles per access).

Unclear whether they are taking account of TEX remapping.

> 
> Q1. My goal is to unlock this functionality in the Linux kernel. Any
> best practices?
> 
> (Below is what I tried/figured out.)
> 
> Looking at the phys_mem_access_prot function, I therefore concluded
> that perhaps I should map in the memory location using the device
> tree, as reserved, and this would cause phys_mem_access_prot to select
> pgprot_writecombine in the kernel.  After doing this successfully, I
> noticed a great improvement in performance, but also that only a small
> fraction of transactions in my test case were actually making it out
> to the I/O device.  The test case was writing a series of zeros to the
> same I/O address, which corresponds to a FIFO, so I really want to see
> all of the zeros. Looking at the logic analyzer, I saw that the
> processor was optimizing away the repeated zero writes, and that the
> AWCACHE field on the AXI bus was set to 3.

With TEX=0 and tex remapping enabled, we arrange for the C and B bits
to have the same behaviour as previous architecture versions (because
we need to) and this is the exact behaviour that older CPUs exhibit
when using bufferable memory - writes are combined and repeated writes
to the same location are optimised.

> This was quite surprising
> to me, as these fields suggest that the PTE is, per the ARM docs
> (https://developer.arm.com/documentation/ihi0022/c/Additional-Control-Information/Cache-support),
> cacheable and bufferable, rather than just bufferable.
> 
> Diving deeper into the kernel, I see that in proc-macros.S, in
> marv6_mt_table,

This table isn't used for Cortex-A9. It is used for ARM architecture v6
processors that lack the TEX remapping facility, so we have to do our
own table-based remapping.

> the L_PTE_MT_BUFFERABLE entry is set to PTE_EXT_TEX(1)
> (i.e TEX,C,B = 001,0,0) which per
> https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Protected-Memory-System-Architecture--PMSA-/Memory-region-attributes/C--B--and-TEX-2-0--encodings
> is listed as "Normal memory", but with out and inner regions given as
> non-cacheable.  I would have expected PTE_BUFFERABLE (i.e.
> TEX,C,B=000,0,1).
> 
> Also looking at proc-v7-2level.S, I see that BUFFERABLE is defined as
> TR=10, IR=00, OR=00, where TR memory type (per
> https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/System-Control-Registers-in-a-VMSA-implementation/VMSA-System-control-registers-descriptions--in-register-order/PRRR--Primary-Region-Remap-Register--VMSA?lang=en)
> is defined as 00=strongly-ordered, 01=device, 10=normal memory. So I
> would have expected 01=device memory.

This would give problems when there is a normal mapping of memory along
side a device mapping - that is prohibited by the architecture. So we
have to use TR=10 for this case.

> Q2. What is the history behind using strong-ordering instead of
> device-ordering for I/O writes? And why is the write-combining setting
> mapping to "Normal Memory" rather than device memory? And why does
> mmu.c not provide a mechanism for accessing device-ordering (or does
> it)?

It goes all the way back to ARM architectures v3..v5, where there was
no TEX field, and the choices were:

Uncached (CB=00)
Bufferable (CB=01)
Writethrough (CB=10)
Writeback (CB=11)

As previously mentioned, CB=01 merges writes, and even has some other
funny behaviours when the same region is mapped in two different
virtual address spaces - where writes bypass each other in non-program
order. So the memory location ends up with a different value than is
expected.

Essentially on these devices. uncached is the only option for devices
to ensure that devices see the same number of writes that the program
issues.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: L_PTE_MT_BUFFERABLE / device ordered memory
  2023-01-03 11:09 ` Russell King (Oracle)
@ 2023-01-03 18:49   ` Prof. Michael Taylor
  2023-01-12 17:16     ` Prof. Michael Taylor
  0 siblings, 1 reply; 4+ messages in thread
From: Prof. Michael Taylor @ 2023-01-03 18:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

Thanks for your reply! This is very helpful.

So it sounds like what I would like to do is to leave most of the
operation of the kernel the same, but have a select region of address
space marked as L_PTE_MT_DEV_SHARED instead of L_PTE_MT_DEV_UNCACHED.
Would that be the correct one to use for device ordered memory?

And assuming yes, would it be suffice to modify phys_mem_access_prot
so that it checks (pfn << PAGE_SIZE) to see if it is in the target
range of physical memory, and then return pgprot_device() rather than
pgprot_noncached() for that region?

Is there a better way for me to do this than this brute force way (I
noticed you alluded to this in this post:
https://www.spinics.net/lists/arm-kernel/msg75913.html)?


M

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: L_PTE_MT_BUFFERABLE / device ordered memory
  2023-01-03 18:49   ` Prof. Michael Taylor
@ 2023-01-12 17:16     ` Prof. Michael Taylor
  0 siblings, 0 replies; 4+ messages in thread
From: Prof. Michael Taylor @ 2023-01-12 17:16 UTC (permalink / raw)
  To: linux-arm-kernel

Bumping this. Thanks!

M


On Tue, Jan 3, 2023 at 10:49 AM Prof. Michael Taylor
<prof.taylor@gmail.com> wrote:
>
> Hi Russell,
>
> Thanks for your reply! This is very helpful.
>
> So it sounds like what I would like to do is to leave most of the
> operation of the kernel the same, but have a select region of address
> space marked as L_PTE_MT_DEV_SHARED instead of L_PTE_MT_DEV_UNCACHED.
> Would that be the correct one to use for device ordered memory?
>
> And assuming yes, would it be suffice to modify phys_mem_access_prot
> so that it checks (pfn << PAGE_SIZE) to see if it is in the target
> range of physical memory, and then return pgprot_device() rather than
> pgprot_noncached() for that region?
>
> Is there a better way for me to do this than this brute force way (I
> noticed you alluded to this in this post:
> https://www.spinics.net/lists/arm-kernel/msg75913.html)?
>
>
> M

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-01-12 17:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-30 22:23 L_PTE_MT_BUFFERABLE / device ordered memory Prof. Michael Taylor
2023-01-03 11:09 ` Russell King (Oracle)
2023-01-03 18:49   ` Prof. Michael Taylor
2023-01-12 17:16     ` Prof. Michael Taylor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).