From: Robin Murphy <robin.murphy@arm.com>
To: Peter Geis <pgwipeout@gmail.com>
Cc: "open list:ARM/Rockchip SoC..."
<linux-rockchip@lists.infradead.org>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Shawn Lin" <shawn.lin@rock-chips.com>,
"Kever Yang" <kever.yang@rock-chips.com>,
"amd-gfx list" <amd-gfx@lists.freedesktop.org>,
"Deucher, Alexander" <alexander.deucher@amd.com>,
"Alex Deucher" <alexdeucher@gmail.com>,
"Christian König" <christian.koenig@amd.com>
Subject: Re: radeon ring 0 test failed on arm64
Date: Thu, 17 Mar 2022 13:17:39 +0000 [thread overview]
Message-ID: <6f5aaddd-e793-e5f1-17aa-71e7804f035f@arm.com> (raw)
In-Reply-To: <CAMdYzYpt1vOCXiDUHCnuVRKnQ51Qissj9-w75FB6nVrFWS-9iw@mail.gmail.com>
On 2022-03-17 12:26, Peter Geis wrote:
> On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2022-03-17 00:14, Peter Geis wrote:
>>> Good Evening,
>>>
>>> I apologize for raising this email chain from the dead, but there have
>>> been some developments that have introduced even more questions.
>>> I've looped the Rockchip mailing list into this too, as this affects
>>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>>>
>>> TLDR for those not familiar: It seems the rk356x series (and possibly
>>> the rk3588) were built without any outer coherent cache.
>>> This means (unless Rockchip wants to clarify here) devices such as the
>>> ITS and PCIe cannot utilize cache snooping.
>>> This is based on the results of the email chain [2].
>>>
>>> The new circumstances are as follows:
>>> The RPi CM4 Adventure Team as I've taken to calling them has been
>>> attempting to get a dGPU working with the very broken Broadcom
>>> controller in the RPi CM4.
>>> Recently they acquired a SoQuartz rk3566 module which is pin
>>> compatible with the CM4, and have taken to trying it out as well.
>>>
>>> This is how I got involved.
>>> It seems they found a trivial way to force the Radeon R600 driver to
>>> use Non-Cached memory for everything.
>>> This single line change, combined with using memset_io instead of
>>> memset, allows the ring tests to pass and the card probes successfully
>>> (minus the DMA limitations of the rk356x due to the 32 bit
>>> interconnect).
>>> I discovered using this method that we start having unaligned io
>>> memory access faults (bus errors) when running glmark2-drm (running
>>> glmark2 directly was impossible, as both X and Wayland crashed too
>>> early).
>>> I traced this to using what I thought at the time was an unsafe memcpy
>>> in the mesa stack.
>>> Rewriting this function to force aligned writes solved the problem and
>>> allows glmark2-drm to run to completion.
>>> With some extensive debugging, I found about half a dozen memcpy
>>> functions in mesa that if forced to be aligned would allow Wayland to
>>> start, but with hilarious display corruption (see [3]. [4]).
>>> The CM4 team is convinced this is an issue with memcpy in glibc, but
>>> I'm not convinced it's that simple.
>>>
>>> On my two hour drive in to work this morning, I got to thinking.
>>> If this was an memcpy fault, this would be universally broken on arm64
>>> which is obviously not the case.
>>> So I started thinking, what is different here than with systems known to work:
>>> 1. No IOMMU for the PCIe controller.
>>> 2. The Outer Cache Issue.
>>>
>>> Robin:
>>> My questions for you, since you're the smartest person I know about
>>> arm64 memory management:
>>> Could cache snooping permit unaligned accesses to IO to be safe?
>>
>> No.
>>
>>> Or
>>> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
>>
>> No.
>>
>>> Or
>>> Am I insane here?
>>
>> No. (probably)
>>
>> CPU access to PCIe has nothing to do with PCIe's access to memory. From
>> what you've described, my guess is that a GPU BAR gets put in a
>> non-prefetchable window, such that it ends up mapped as Device memory
>> (whereas if it were prefetchable it would be Normal Non-Cacheable).
>
> Okay, this is perfect and I think you just put me on the right track
> for identifying the exact issue. Thanks!
>
> I've sliced up the non-prefetchable window and given it a prefetchable window.
> The 256MB BAR now resides in that window.
> However I'm still getting bus errors, so it seems the prefetch isn't
> actually happening.
Note that "prefetchable" really just means "no side-effects on reads",
i.e. we can map it with a Normal memory type that technically *allows*
the CPU to make speculative accesses because they will not be harmful,
but that's not to say the CPU will do so. Just that if it did, you
wouldn't notice anyway.
It's entirely possible that the PCIe IP itself doesn't like unaligned
accesses, so changing the memory type just moves you from an alignment
fault to an external abort.
> The difference is now the GPU realizes that an error has happened and
> initiates recovery, vice before where it seemed to be clueless.
> If I understand everything correctly, that's because before the bus
> error was raised by the CPU due to the memory flag, vice now where
> it's actually the bus raising the alarm.
>
> My next question, is this something the driver should set and isn't,
> or is it just because of the broken cache coherency?
The general rule for userspace mmap()ing PCIe-attached memory and
handing it off to glibc or anyone else who might assume it's regular
system RAM is "don't do that". If it's not access size or alignment that
falls over, it could be atomic operations, MTE tags, or any other
new-fangled memory innovation. For the ultimate dream of just plugging
in a card full of RAM, you either need to look back to ISA or forward to
CXL ;)
Robin.
WARNING: multiple messages have this Message-ID (diff)
From: Robin Murphy <robin.murphy@arm.com>
To: Peter Geis <pgwipeout@gmail.com>
Cc: "Kever Yang" <kever.yang@rock-chips.com>,
"Shawn Lin" <shawn.lin@rock-chips.com>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Christian König" <christian.koenig@amd.com>,
"Alex Deucher" <alexdeucher@gmail.com>,
"Deucher, Alexander" <alexander.deucher@amd.com>,
"amd-gfx list" <amd-gfx@lists.freedesktop.org>,
"open list:ARM/Rockchip SoC..."
<linux-rockchip@lists.infradead.org>
Subject: Re: radeon ring 0 test failed on arm64
Date: Thu, 17 Mar 2022 13:17:39 +0000 [thread overview]
Message-ID: <6f5aaddd-e793-e5f1-17aa-71e7804f035f@arm.com> (raw)
In-Reply-To: <CAMdYzYpt1vOCXiDUHCnuVRKnQ51Qissj9-w75FB6nVrFWS-9iw@mail.gmail.com>
On 2022-03-17 12:26, Peter Geis wrote:
> On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2022-03-17 00:14, Peter Geis wrote:
>>> Good Evening,
>>>
>>> I apologize for raising this email chain from the dead, but there have
>>> been some developments that have introduced even more questions.
>>> I've looped the Rockchip mailing list into this too, as this affects
>>> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>>>
>>> TLDR for those not familiar: It seems the rk356x series (and possibly
>>> the rk3588) were built without any outer coherent cache.
>>> This means (unless Rockchip wants to clarify here) devices such as the
>>> ITS and PCIe cannot utilize cache snooping.
>>> This is based on the results of the email chain [2].
>>>
>>> The new circumstances are as follows:
>>> The RPi CM4 Adventure Team as I've taken to calling them has been
>>> attempting to get a dGPU working with the very broken Broadcom
>>> controller in the RPi CM4.
>>> Recently they acquired a SoQuartz rk3566 module which is pin
>>> compatible with the CM4, and have taken to trying it out as well.
>>>
>>> This is how I got involved.
>>> It seems they found a trivial way to force the Radeon R600 driver to
>>> use Non-Cached memory for everything.
>>> This single line change, combined with using memset_io instead of
>>> memset, allows the ring tests to pass and the card probes successfully
>>> (minus the DMA limitations of the rk356x due to the 32 bit
>>> interconnect).
>>> I discovered using this method that we start having unaligned io
>>> memory access faults (bus errors) when running glmark2-drm (running
>>> glmark2 directly was impossible, as both X and Wayland crashed too
>>> early).
>>> I traced this to using what I thought at the time was an unsafe memcpy
>>> in the mesa stack.
>>> Rewriting this function to force aligned writes solved the problem and
>>> allows glmark2-drm to run to completion.
>>> With some extensive debugging, I found about half a dozen memcpy
>>> functions in mesa that if forced to be aligned would allow Wayland to
>>> start, but with hilarious display corruption (see [3]. [4]).
>>> The CM4 team is convinced this is an issue with memcpy in glibc, but
>>> I'm not convinced it's that simple.
>>>
>>> On my two hour drive in to work this morning, I got to thinking.
>>> If this was an memcpy fault, this would be universally broken on arm64
>>> which is obviously not the case.
>>> So I started thinking, what is different here than with systems known to work:
>>> 1. No IOMMU for the PCIe controller.
>>> 2. The Outer Cache Issue.
>>>
>>> Robin:
>>> My questions for you, since you're the smartest person I know about
>>> arm64 memory management:
>>> Could cache snooping permit unaligned accesses to IO to be safe?
>>
>> No.
>>
>>> Or
>>> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
>>
>> No.
>>
>>> Or
>>> Am I insane here?
>>
>> No. (probably)
>>
>> CPU access to PCIe has nothing to do with PCIe's access to memory. From
>> what you've described, my guess is that a GPU BAR gets put in a
>> non-prefetchable window, such that it ends up mapped as Device memory
>> (whereas if it were prefetchable it would be Normal Non-Cacheable).
>
> Okay, this is perfect and I think you just put me on the right track
> for identifying the exact issue. Thanks!
>
> I've sliced up the non-prefetchable window and given it a prefetchable window.
> The 256MB BAR now resides in that window.
> However I'm still getting bus errors, so it seems the prefetch isn't
> actually happening.
Note that "prefetchable" really just means "no side-effects on reads",
i.e. we can map it with a Normal memory type that technically *allows*
the CPU to make speculative accesses because they will not be harmful,
but that's not to say the CPU will do so. Just that if it did, you
wouldn't notice anyway.
It's entirely possible that the PCIe IP itself doesn't like unaligned
accesses, so changing the memory type just moves you from an alignment
fault to an external abort.
> The difference is now the GPU realizes that an error has happened and
> initiates recovery, vice before where it seemed to be clueless.
> If I understand everything correctly, that's because before the bus
> error was raised by the CPU due to the memory flag, vice now where
> it's actually the bus raising the alarm.
>
> My next question, is this something the driver should set and isn't,
> or is it just because of the broken cache coherency?
The general rule for userspace mmap()ing PCIe-attached memory and
handing it off to glibc or anyone else who might assume it's regular
system RAM is "don't do that". If it's not access size or alignment that
falls over, it could be atomic operations, MTE tags, or any other
new-fangled memory innovation. For the ultimate dream of just plugging
in a card full of RAM, you either need to look back to ISA or forward to
CXL ;)
Robin.
_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-rockchip
next prev parent reply other threads:[~2022-03-17 13:19 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-25 2:34 radeon ring 0 test failed on arm64 Peter Geis
2021-05-25 12:46 ` Alex Deucher
2021-05-25 12:55 ` Peter Geis
2021-05-25 13:05 ` Alex Deucher
2021-05-25 13:18 ` Peter Geis
2021-05-25 20:09 ` Robin Murphy
2021-05-26 9:42 ` Christian König
2021-05-26 10:59 ` Robin Murphy
2021-05-26 11:21 ` Christian König
2022-03-17 0:14 ` Peter Geis
2022-03-17 0:14 ` Peter Geis
2022-03-17 3:07 ` Kever Yang
2022-03-17 3:07 ` Kever Yang
2022-03-17 12:19 ` Peter Geis
2022-03-17 12:19 ` Peter Geis
2022-03-18 7:51 ` Kever Yang
2022-03-18 7:51 ` Kever Yang
2022-03-18 8:35 ` Christian König
2022-03-18 11:24 ` Peter Geis
2022-03-18 11:24 ` Peter Geis
2022-03-18 12:31 ` Christian König
2022-03-18 12:31 ` Christian König
2022-03-18 12:45 ` Peter Geis
2022-03-18 12:45 ` Peter Geis
2022-03-17 9:14 ` Christian König
2022-03-17 9:14 ` Christian König
2022-03-17 12:21 ` Peter Geis
2022-03-17 12:21 ` Peter Geis
2022-03-17 20:27 ` Alex Deucher
2022-03-17 20:27 ` Alex Deucher
2022-03-17 10:37 ` Robin Murphy
2022-03-17 10:37 ` Robin Murphy
2022-03-17 12:26 ` Peter Geis
2022-03-17 12:26 ` Peter Geis
2022-03-17 12:51 ` Christian König
2022-03-17 12:51 ` Christian König
2022-03-17 13:17 ` Robin Murphy [this message]
2022-03-17 13:17 ` Robin Murphy
2022-03-17 14:21 ` Peter Geis
2022-03-17 14:21 ` Peter Geis
2022-03-23 21:06 ` Alex Deucher
2022-03-23 21:06 ` Alex Deucher
2021-05-25 14:08 ` Christian König
2021-05-25 14:19 ` Peter Geis
2021-05-25 15:09 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6f5aaddd-e793-e5f1-17aa-71e7804f035f@arm.com \
--to=robin.murphy@arm.com \
--cc=alexander.deucher@amd.com \
--cc=alexdeucher@gmail.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=kever.yang@rock-chips.com \
--cc=linux-rockchip@lists.infradead.org \
--cc=pgwipeout@gmail.com \
--cc=shawn.lin@rock-chips.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.