* [PATCH] arm64/dma-mapping: Fix arch_sync_dma_for_device to respect dir parameter
@ 2025-08-20 10:28 John Cox via B4 Relay
2025-08-20 13:25 ` Catalin Marinas
0 siblings, 1 reply; 6+ messages in thread
From: John Cox via B4 Relay @ 2025-08-20 10:28 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon; +Cc: linux-arm-kernel, John Cox
From: John Cox <john.cox@raspberrypi.com>
All other architectures do different cache operations depending on the
dir parameter. Fix arm64 to do the same.
This fixes udmabuf operations when syncing for read e.g. when the CPU
reads back a V4L2 decoded frame buffer.
Signed-off-by: John Cox <john.cox@raspberrypi.com>
---
This patch makes the arch_sync_dma_for_device function on arm64
do different things depending on the value of the dir parameter. In
particular it does a cache invalidate operation if the dir flag is
set to DMA_FROM_DEVICE. The current code does a writeback without
invalidate under all circumstances. Nearly all other architectures do
an invalidate if the direction is FROM_DEVICE which seems like the
correct thing to do to me.
This patch fixes a problem I was having with udmabuf allocated
dmabufs. It also fixes a very similar problem I had with dma_heap
allocated dmabuf but that occured very much less frequently and I
haven't traced exactly what was going on there.
My problem (on a Raspberry Pi5):
[Userland]
Alloc memory with memfd_create + ftruncate
Derive dmabuf from memfd with udmabuf
Close memfd
Queue dmabuf into V4L2 with QBUF
<decode a video frame>
Extract dmabuf from V4L2 with DQBUF
Map dmabuf for read with mmap
Sync for read with DMA_BUF_IOCTL_SYNC with (DMA_BUF_SYNC_START |
DMA_BUF_SYNC_READ)
Read buffer
Sync end
Unmap
I get old (zero) data out of the "Read buffer" stage in some cache
lines sometimes.
It doesn't matter which way round the mmap & sync are.
I am aware that there is a patchset going through for udmabuf that may
well fix the udmabuf case above, but given that this patch fixes
something similar in dma_heap/system too I think it is still worth
having.
---
arch/arm64/mm/dma-mapping.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index b2b5792b2caaf81ccfc3204c94395bb0faeabddd..51c43c1f563015139e365ed86f0f5f0d9483fa7f 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -16,8 +16,22 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
unsigned long start = (unsigned long)phys_to_virt(paddr);
+ unsigned long end = start + size;
- dcache_clean_poc(start, start + size);
+ switch (dir) {
+ case DMA_BIDIRECTIONAL:
+ dcache_clean_inval_poc(start, end);
+ break;
+ case DMA_TO_DEVICE:
+ dcache_clean_poc(start, end);
+ break;
+ case DMA_FROM_DEVICE:
+ dcache_inval_poc(start, end);
+ break;
+ case DMA_NONE:
+ default:
+ break;
+ }
}
void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
---
base-commit: 14131859b2615fd3076ffdad370fc5500a037432
change-id: 20250819-arm64-dma-direction-fix-7a17db452040
Best regards,
--
John Cox <john.cox@raspberrypi.com>
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64/dma-mapping: Fix arch_sync_dma_for_device to respect dir parameter
2025-08-20 10:28 [PATCH] arm64/dma-mapping: Fix arch_sync_dma_for_device to respect dir parameter John Cox via B4 Relay
@ 2025-08-20 13:25 ` Catalin Marinas
2025-08-20 14:08 ` Robin Murphy
0 siblings, 1 reply; 6+ messages in thread
From: Catalin Marinas @ 2025-08-20 13:25 UTC (permalink / raw)
To: john.cox; +Cc: Will Deacon, linux-arm-kernel, Robin Murphy
On Wed, Aug 20, 2025 at 11:28:06AM +0100, John Cox via B4 Relay wrote:
> All other architectures do different cache operations depending on the
> dir parameter. Fix arm64 to do the same.
I suspect that's a bug in the users of the DMA API. We shouldn't modify
the arm64 implementation to cope with them.
> This fixes udmabuf operations when syncing for read e.g. when the CPU
> reads back a V4L2 decoded frame buffer.
>
> Signed-off-by: John Cox <john.cox@raspberrypi.com>
> ---
> This patch makes the arch_sync_dma_for_device function on arm64
> do different things depending on the value of the dir parameter. In
> particular it does a cache invalidate operation if the dir flag is
> set to DMA_FROM_DEVICE. The current code does a writeback without
> invalidate under all circumstances. Nearly all other architectures do
> an invalidate if the direction is FROM_DEVICE which seems like the
> correct thing to do to me.
So does arm64 but in the arch_sync_dma_for_cpu(). That's the correct
place to do it, otherwise after arch_sync_dma_for_device() you may have
speculative loads by the CPU populating the caches with stale data
before the device finished writing.
> This patch fixes a problem I was having with udmabuf allocated
> dmabufs. It also fixes a very similar problem I had with dma_heap
> allocated dmabuf but that occured very much less frequently and I
> haven't traced exactly what was going on there.
>
> My problem (on a Raspberry Pi5):
>
> [Userland]
> Alloc memory with memfd_create + ftruncate
> Derive dmabuf from memfd with udmabuf
> Close memfd
> Queue dmabuf into V4L2 with QBUF
> <decode a video frame>
> Extract dmabuf from V4L2 with DQBUF
> Map dmabuf for read with mmap
> Sync for read with DMA_BUF_IOCTL_SYNC with (DMA_BUF_SYNC_START |
> DMA_BUF_SYNC_READ)
> Read buffer
> Sync end
> Unmap
Between the device writing to the buffer and the "read buffer" step
above, is there a call to arch_sync_dma_for_cpu()? A quick look at
begin_cpu_udmabuf() shows a dma_sync_sgtable_for_cpu(), though there is
a branch where this is skipped. get_sg_table() seems to do a DMA map
which I think ends up in arch_sync_dma_for_device() but the sync
for-CPU is skipped.
An attempt to a udmabuf fix (untested):
diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index 40399c26e6be..9ab4a6c01143 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -256,10 +256,11 @@ static int begin_cpu_udmabuf(struct dma_buf *buf,
ret = PTR_ERR(ubuf->sg);
ubuf->sg = NULL;
}
- } else {
- dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
}
+ if (ubuf->sg)
+ dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
+
return ret;
}
> I get old (zero) data out of the "Read buffer" stage in some cache
> lines sometimes.
> It doesn't matter which way round the mmap & sync are.
>
> I am aware that there is a patchset going through for udmabuf that may
> well fix the udmabuf case above, but given that this patch fixes
> something similar in dma_heap/system too I think it is still worth
> having.
> ---
> arch/arm64/mm/dma-mapping.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index b2b5792b2caaf81ccfc3204c94395bb0faeabddd..51c43c1f563015139e365ed86f0f5f0d9483fa7f 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -16,8 +16,22 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> enum dma_data_direction dir)
> {
> unsigned long start = (unsigned long)phys_to_virt(paddr);
> + unsigned long end = start + size;
>
> - dcache_clean_poc(start, start + size);
> + switch (dir) {
> + case DMA_BIDIRECTIONAL:
> + dcache_clean_inval_poc(start, end);
> + break;
> + case DMA_TO_DEVICE:
> + dcache_clean_poc(start, end);
> + break;
> + case DMA_FROM_DEVICE:
> + dcache_inval_poc(start, end);
> + break;
> + case DMA_NONE:
> + default:
> + break;
> + }
> }
As explained above, that's not the right fix. We need to identify what's
missing on the ioctl() paths.
--
Catalin
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64/dma-mapping: Fix arch_sync_dma_for_device to respect dir parameter
2025-08-20 13:25 ` Catalin Marinas
@ 2025-08-20 14:08 ` Robin Murphy
2025-08-20 14:43 ` John Cox
0 siblings, 1 reply; 6+ messages in thread
From: Robin Murphy @ 2025-08-20 14:08 UTC (permalink / raw)
To: Catalin Marinas, john.cox; +Cc: Will Deacon, linux-arm-kernel
On 20/08/2025 2:25 pm, Catalin Marinas wrote:
> On Wed, Aug 20, 2025 at 11:28:06AM +0100, John Cox via B4 Relay wrote:
>> All other architectures do different cache operations depending on the
>> dir parameter. Fix arm64 to do the same.
>
> I suspect that's a bug in the users of the DMA API. We shouldn't modify
> the arm64 implementation to cope with them.
>
>> This fixes udmabuf operations when syncing for read e.g. when the CPU
>> reads back a V4L2 decoded frame buffer.
>>
>> Signed-off-by: John Cox <john.cox@raspberrypi.com>
>> ---
>> This patch makes the arch_sync_dma_for_device function on arm64
>> do different things depending on the value of the dir parameter. In
>> particular it does a cache invalidate operation if the dir flag is
>> set to DMA_FROM_DEVICE. The current code does a writeback without
>> invalidate under all circumstances. Nearly all other architectures do
>> an invalidate if the direction is FROM_DEVICE which seems like the
>> correct thing to do to me.
>
> So does arm64 but in the arch_sync_dma_for_cpu(). That's the correct
> place to do it, otherwise after arch_sync_dma_for_device() you may have
> speculative loads by the CPU populating the caches with stale data
> before the device finished writing.
Exactly, not only is it unnecessary, it's not even guaranteed to have
any lasting effect. arch_sync_dma_for_device() has two jobs to do: 1)
ensure that any new data going in the DMA_TO_DEVICE direction is visible
to the device; a clean is sufficient for that. 2) ensure that no dirty
cachelines may be written back over new DMA_FROM_DEVICE data; a clean is
sufficient for that also. Adding an invalidate at this point serves no
purpose since the CPU is still free to immediately speculatively fetch
the same cleaned data back into the cache.
>> This patch fixes a problem I was having with udmabuf allocated
>> dmabufs. It also fixes a very similar problem I had with dma_heap
>> allocated dmabuf but that occured very much less frequently and I
>> haven't traced exactly what was going on there.
>>
>> My problem (on a Raspberry Pi5):
>>
>> [Userland]
>> Alloc memory with memfd_create + ftruncate
>> Derive dmabuf from memfd with udmabuf
>> Close memfd
>> Queue dmabuf into V4L2 with QBUF
>> <decode a video frame>
>> Extract dmabuf from V4L2 with DQBUF
>> Map dmabuf for read with mmap
>> Sync for read with DMA_BUF_IOCTL_SYNC with (DMA_BUF_SYNC_START |
>> DMA_BUF_SYNC_READ)
>> Read buffer
>> Sync end
>> Unmap
>
> Between the device writing to the buffer and the "read buffer" step
> above, is there a call to arch_sync_dma_for_cpu()? A quick look at
> begin_cpu_udmabuf() shows a dma_sync_sgtable_for_cpu(), though there is
> a branch where this is skipped. get_sg_table() seems to do a DMA map
> which I think ends up in arch_sync_dma_for_device() but the sync
> for-CPU is skipped.
Indeed that path is clearly wrong.
Thanks,
Robin.
> An attempt to a udmabuf fix (untested):
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index 40399c26e6be..9ab4a6c01143 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -256,10 +256,11 @@ static int begin_cpu_udmabuf(struct dma_buf *buf,
> ret = PTR_ERR(ubuf->sg);
> ubuf->sg = NULL;
> }
> - } else {
> - dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
> }
>
> + if (ubuf->sg)
> + dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
> +
> return ret;
> }
>
>> I get old (zero) data out of the "Read buffer" stage in some cache
>> lines sometimes.
>> It doesn't matter which way round the mmap & sync are.
>>
>> I am aware that there is a patchset going through for udmabuf that may
>> well fix the udmabuf case above, but given that this patch fixes
>> something similar in dma_heap/system too I think it is still worth
>> having.
>> ---
>> arch/arm64/mm/dma-mapping.c | 16 +++++++++++++++-
>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
>> index b2b5792b2caaf81ccfc3204c94395bb0faeabddd..51c43c1f563015139e365ed86f0f5f0d9483fa7f 100644
>> --- a/arch/arm64/mm/dma-mapping.c
>> +++ b/arch/arm64/mm/dma-mapping.c
>> @@ -16,8 +16,22 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
>> enum dma_data_direction dir)
>> {
>> unsigned long start = (unsigned long)phys_to_virt(paddr);
>> + unsigned long end = start + size;
>>
>> - dcache_clean_poc(start, start + size);
>> + switch (dir) {
>> + case DMA_BIDIRECTIONAL:
>> + dcache_clean_inval_poc(start, end);
>> + break;
>> + case DMA_TO_DEVICE:
>> + dcache_clean_poc(start, end);
>> + break;
>> + case DMA_FROM_DEVICE:
>> + dcache_inval_poc(start, end);
>> + break;
>> + case DMA_NONE:
>> + default:
>> + break;
>> + }
>> }
>
> As explained above, that's not the right fix. We need to identify what's
> missing on the ioctl() paths.
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64/dma-mapping: Fix arch_sync_dma_for_device to respect dir parameter
2025-08-20 14:08 ` Robin Murphy
@ 2025-08-20 14:43 ` John Cox
2025-08-20 15:16 ` Catalin Marinas
0 siblings, 1 reply; 6+ messages in thread
From: John Cox @ 2025-08-20 14:43 UTC (permalink / raw)
To: Robin Murphy; +Cc: Catalin Marinas, Will Deacon, linux-arm-kernel
On Wed, 20 Aug 2025 at 15:08, Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 20/08/2025 2:25 pm, Catalin Marinas wrote:
> > On Wed, Aug 20, 2025 at 11:28:06AM +0100, John Cox via B4 Relay wrote:
> >> All other architectures do different cache operations depending on the
> >> dir parameter. Fix arm64 to do the same.
> >
> > I suspect that's a bug in the users of the DMA API. We shouldn't modify
> > the arm64 implementation to cope with them.
> >
> >> This fixes udmabuf operations when syncing for read e.g. when the CPU
> >> reads back a V4L2 decoded frame buffer.
> >>
> >> Signed-off-by: John Cox <john.cox@raspberrypi.com>
> >> ---
> >> This patch makes the arch_sync_dma_for_device function on arm64
> >> do different things depending on the value of the dir parameter. In
> >> particular it does a cache invalidate operation if the dir flag is
> >> set to DMA_FROM_DEVICE. The current code does a writeback without
> >> invalidate under all circumstances. Nearly all other architectures do
> >> an invalidate if the direction is FROM_DEVICE which seems like the
> >> correct thing to do to me.
> >
> > So does arm64 but in the arch_sync_dma_for_cpu(). That's the correct
> > place to do it, otherwise after arch_sync_dma_for_device() you may have
> > speculative loads by the CPU populating the caches with stale data
> > before the device finished writing.
>
> Exactly, not only is it unnecessary, it's not even guaranteed to have
> any lasting effect. arch_sync_dma_for_device() has two jobs to do: 1)
> ensure that any new data going in the DMA_TO_DEVICE direction is visible
> to the device; a clean is sufficient for that. 2) ensure that no dirty
> cachelines may be written back over new DMA_FROM_DEVICE data; a clean is
> sufficient for that also. Adding an invalidate at this point serves no
> purpose since the CPU is still free to immediately speculatively fetch
> the same cleaned data back into the cache.
An invalidate at this point for DMA_FROM_DEVICE does satisfy (2) at least as
well as clean and has the side benefit that any used cache lines are
now free for
use by the CPU. (1) is unaffected by this patch.
I believe that my patch is no less functional than the current code..
> >> This patch fixes a problem I was having with udmabuf allocated
> >> dmabufs. It also fixes a very similar problem I had with dma_heap
> >> allocated dmabuf but that occured very much less frequently and I
> >> haven't traced exactly what was going on there.
> >>
> >> My problem (on a Raspberry Pi5):
> >>
> >> [Userland]
> >> Alloc memory with memfd_create + ftruncate
> >> Derive dmabuf from memfd with udmabuf
> >> Close memfd
> >> Queue dmabuf into V4L2 with QBUF
> >> <decode a video frame>
> >> Extract dmabuf from V4L2 with DQBUF
> >> Map dmabuf for read with mmap
> >> Sync for read with DMA_BUF_IOCTL_SYNC with (DMA_BUF_SYNC_START |
> >> DMA_BUF_SYNC_READ)
> >> Read buffer
> >> Sync end
> >> Unmap
> >
> > Between the device writing to the buffer and the "read buffer" step
> > above, is there a call to arch_sync_dma_for_cpu()? A quick look at
> > begin_cpu_udmabuf() shows a dma_sync_sgtable_for_cpu(), though there is
> > a branch where this is skipped. get_sg_table() seems to do a DMA map
> > which I think ends up in arch_sync_dma_for_device() but the sync
> > for-CPU is skipped.
>
> Indeed that path is clearly wrong.
As I noted, there is a patch for udmabuf going through at the moment that
reworks that section of the code and may well fix the issue above. However
I had a similar issue, though much less frequent with dma_heap/system
allocated buffers which this patch also fixes. Now I grant that finding and
fixing every bit of code that gets this wrong would be ideal but this does
improve the situation where other drivers make incorrect / outdated
assumptions about what arch_sync_dma_for_device does, possibly based
on what is done for other architectures (e.g. armv7).
> Thanks,
> Robin.
>
> > An attempt to a udmabuf fix (untested):
> >
> > diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> > index 40399c26e6be..9ab4a6c01143 100644
> > --- a/drivers/dma-buf/udmabuf.c
> > +++ b/drivers/dma-buf/udmabuf.c
> > @@ -256,10 +256,11 @@ static int begin_cpu_udmabuf(struct dma_buf *buf,
> > ret = PTR_ERR(ubuf->sg);
> > ubuf->sg = NULL;
> > }
> > - } else {
> > - dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
> > }
> >
> > + if (ubuf->sg)
> > + dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
> > +
> > return ret;
> > }
Indeed, though that does have the annoyance that you run two sets
of cache ops over the entire buffer rather than one and for a decent
size of buffer (e.g. video frame) that is not free given that you have
to loop over every cache line.
> >> I get old (zero) data out of the "Read buffer" stage in some cache
> >> lines sometimes.
> >> It doesn't matter which way round the mmap & sync are.
> >>
> >> I am aware that there is a patchset going through for udmabuf that may
> >> well fix the udmabuf case above, but given that this patch fixes
> >> something similar in dma_heap/system too I think it is still worth
> >> having.
> >> ---
> >> arch/arm64/mm/dma-mapping.c | 16 +++++++++++++++-
> >> 1 file changed, 15 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> >> index b2b5792b2caaf81ccfc3204c94395bb0faeabddd..51c43c1f563015139e365ed86f0f5f0d9483fa7f 100644
> >> --- a/arch/arm64/mm/dma-mapping.c
> >> +++ b/arch/arm64/mm/dma-mapping.c
> >> @@ -16,8 +16,22 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> >> enum dma_data_direction dir)
> >> {
> >> unsigned long start = (unsigned long)phys_to_virt(paddr);
> >> + unsigned long end = start + size;
> >>
> >> - dcache_clean_poc(start, start + size);
> >> + switch (dir) {
> >> + case DMA_BIDIRECTIONAL:
> >> + dcache_clean_inval_poc(start, end);
> >> + break;
> >> + case DMA_TO_DEVICE:
> >> + dcache_clean_poc(start, end);
> >> + break;
> >> + case DMA_FROM_DEVICE:
> >> + dcache_inval_poc(start, end);
> >> + break;
> >> + case DMA_NONE:
> >> + default:
> >> + break;
> >> + }
> >> }
> >
> > As explained above, that's not the right fix. We need to identify what's
> > missing on the ioctl() paths.
> >
I take your points, but this patch should be no less functional and no slower
than the current code and does work around existing cases where other
drivers haven't got it right.
Many thanks for your swift responses
John Cox
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64/dma-mapping: Fix arch_sync_dma_for_device to respect dir parameter
2025-08-20 14:43 ` John Cox
@ 2025-08-20 15:16 ` Catalin Marinas
2025-08-20 15:35 ` John Cox
0 siblings, 1 reply; 6+ messages in thread
From: Catalin Marinas @ 2025-08-20 15:16 UTC (permalink / raw)
To: John Cox; +Cc: Robin Murphy, Will Deacon, linux-arm-kernel
On Wed, Aug 20, 2025 at 03:43:16PM +0100, John Cox wrote:
> On Wed, 20 Aug 2025 at 15:08, Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 20/08/2025 2:25 pm, Catalin Marinas wrote:
> > > On Wed, Aug 20, 2025 at 11:28:06AM +0100, John Cox via B4 Relay wrote:
> > >> This patch makes the arch_sync_dma_for_device function on arm64
> > >> do different things depending on the value of the dir parameter. In
> > >> particular it does a cache invalidate operation if the dir flag is
> > >> set to DMA_FROM_DEVICE. The current code does a writeback without
> > >> invalidate under all circumstances. Nearly all other architectures do
> > >> an invalidate if the direction is FROM_DEVICE which seems like the
> > >> correct thing to do to me.
> > >
> > > So does arm64 but in the arch_sync_dma_for_cpu(). That's the correct
> > > place to do it, otherwise after arch_sync_dma_for_device() you may have
> > > speculative loads by the CPU populating the caches with stale data
> > > before the device finished writing.
> >
> > Exactly, not only is it unnecessary, it's not even guaranteed to have
> > any lasting effect. arch_sync_dma_for_device() has two jobs to do: 1)
> > ensure that any new data going in the DMA_TO_DEVICE direction is visible
> > to the device; a clean is sufficient for that. 2) ensure that no dirty
> > cachelines may be written back over new DMA_FROM_DEVICE data; a clean is
> > sufficient for that also. Adding an invalidate at this point serves no
> > purpose since the CPU is still free to immediately speculatively fetch
> > the same cleaned data back into the cache.
>
> An invalidate at this point for DMA_FROM_DEVICE does satisfy (2) at least as
> well as clean and has the side benefit that any used cache lines are
> now free for use by the CPU. (1) is unaffected by this patch.
>
> I believe that my patch is no less functional than the current code..
It's no less functional but it does not address the problem, it simply
makes it less likely. Both Robin and I explained the speculative loads
potentially leading to inconsistent data if the for_cpu API is not
called.
So I'd rather have the current behaviour and we can spot problems early
than hiding it and making it harder to debug.
> > > An attempt to a udmabuf fix (untested):
> > >
> > > diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> > > index 40399c26e6be..9ab4a6c01143 100644
> > > --- a/drivers/dma-buf/udmabuf.c
> > > +++ b/drivers/dma-buf/udmabuf.c
> > > @@ -256,10 +256,11 @@ static int begin_cpu_udmabuf(struct dma_buf *buf,
> > > ret = PTR_ERR(ubuf->sg);
> > > ubuf->sg = NULL;
> > > }
> > > - } else {
> > > - dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
> > > }
> > >
> > > + if (ubuf->sg)
> > > + dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
> > > +
> > > return ret;
> > > }
>
> Indeed, though that does have the annoyance that you run two sets
> of cache ops over the entire buffer rather than one and for a decent
> size of buffer (e.g. video frame) that is not free given that you have
> to loop over every cache line.
The udmabuf code is slightly dodgy anyway. I expect the DMA buffer to
have been mapped already and a sync to_device issued long before
begin_cpu_udmabuf() is called without needing an additional dma_map here
just to populate the sg list. Otherwise it's quite late to do the sync
for_device just before the CPU is about to start reading. I'm not
familiar with this code but at this point we really should only have the
for_cpu loop.
Maybe the problem is elsewhere, I just spotted a corner case where the
sync for_cpu is not called.
> I take your points, but this patch should be no less functional and no slower
> than the current code and does work around existing cases where other
> drivers haven't got it right.
It does not work around them, it just makes them rarer. Other
architectures may get away with doing such maintenance in the for_device
sync only if they don't have aggressive speculative loads into the
cache. That's not the case for arm64.
--
Catalin
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64/dma-mapping: Fix arch_sync_dma_for_device to respect dir parameter
2025-08-20 15:16 ` Catalin Marinas
@ 2025-08-20 15:35 ` John Cox
0 siblings, 0 replies; 6+ messages in thread
From: John Cox @ 2025-08-20 15:35 UTC (permalink / raw)
To: Catalin Marinas; +Cc: Robin Murphy, Will Deacon, linux-arm-kernel
On Wed, 20 Aug 2025 at 16:16, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Wed, Aug 20, 2025 at 03:43:16PM +0100, John Cox wrote:
> > On Wed, 20 Aug 2025 at 15:08, Robin Murphy <robin.murphy@arm.com> wrote:
> > >
> > > On 20/08/2025 2:25 pm, Catalin Marinas wrote:
> > > > On Wed, Aug 20, 2025 at 11:28:06AM +0100, John Cox via B4 Relay wrote:
> > > >> This patch makes the arch_sync_dma_for_device function on arm64
> > > >> do different things depending on the value of the dir parameter. In
> > > >> particular it does a cache invalidate operation if the dir flag is
> > > >> set to DMA_FROM_DEVICE. The current code does a writeback without
> > > >> invalidate under all circumstances. Nearly all other architectures do
> > > >> an invalidate if the direction is FROM_DEVICE which seems like the
> > > >> correct thing to do to me.
> > > >
> > > > So does arm64 but in the arch_sync_dma_for_cpu(). That's the correct
> > > > place to do it, otherwise after arch_sync_dma_for_device() you may have
> > > > speculative loads by the CPU populating the caches with stale data
> > > > before the device finished writing.
> > >
> > > Exactly, not only is it unnecessary, it's not even guaranteed to have
> > > any lasting effect. arch_sync_dma_for_device() has two jobs to do: 1)
> > > ensure that any new data going in the DMA_TO_DEVICE direction is visible
> > > to the device; a clean is sufficient for that. 2) ensure that no dirty
> > > cachelines may be written back over new DMA_FROM_DEVICE data; a clean is
> > > sufficient for that also. Adding an invalidate at this point serves no
> > > purpose since the CPU is still free to immediately speculatively fetch
> > > the same cleaned data back into the cache.
> >
> > An invalidate at this point for DMA_FROM_DEVICE does satisfy (2) at least as
> > well as clean and has the side benefit that any used cache lines are
> > now free for use by the CPU. (1) is unaffected by this patch.
> >
> > I believe that my patch is no less functional than the current code..
>
> It's no less functional but it does not address the problem, it simply
> makes it less likely. Both Robin and I explained the speculative loads
> potentially leading to inconsistent data if the for_cpu API is not
> called.
>
> So I'd rather have the current behaviour and we can spot problems early
> than hiding it and making it harder to debug.
>
> > > > An attempt to a udmabuf fix (untested):
> > > >
> > > > diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> > > > index 40399c26e6be..9ab4a6c01143 100644
> > > > --- a/drivers/dma-buf/udmabuf.c
> > > > +++ b/drivers/dma-buf/udmabuf.c
> > > > @@ -256,10 +256,11 @@ static int begin_cpu_udmabuf(struct dma_buf *buf,
> > > > ret = PTR_ERR(ubuf->sg);
> > > > ubuf->sg = NULL;
> > > > }
> > > > - } else {
> > > > - dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
> > > > }
> > > >
> > > > + if (ubuf->sg)
> > > > + dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
> > > > +
> > > > return ret;
> > > > }
> >
> > Indeed, though that does have the annoyance that you run two sets
> > of cache ops over the entire buffer rather than one and for a decent
> > size of buffer (e.g. video frame) that is not free given that you have
> > to loop over every cache line.
>
> The udmabuf code is slightly dodgy anyway. I expect the DMA buffer to
> have been mapped already and a sync to_device issued long before
> begin_cpu_udmabuf() is called without needing an additional dma_map here
> just to populate the sg list. Otherwise it's quite late to do the sync
> for_device just before the CPU is about to start reading. I'm not
> familiar with this code but at this point we really should only have the
> for_cpu loop.
>
> Maybe the problem is elsewhere, I just spotted a corner case where the
> sync for_cpu is not called.
>
> > I take your points, but this patch should be no less functional and no slower
> > than the current code and does work around existing cases where other
> > drivers haven't got it right.
>
> It does not work around them, it just makes them rarer. Other
> architectures may get away with doing such maintenance in the for_device
> sync only if they don't have aggressive speculative loads into the
> cache. That's not the case for arm64.
Fair enough. I do understand your speculative load point and I do
understand that my patch just makes stuff rarer. I believe that
making such issues rarer is worthwhile when they are already
rare enough that they will often be simply dismissed as general
instability rather than leading to better debugging.
Having said that, I accept your point of view. Thank you for your
time.
John Cox
> --
> Catalin
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-08-20 18:58 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-20 10:28 [PATCH] arm64/dma-mapping: Fix arch_sync_dma_for_device to respect dir parameter John Cox via B4 Relay
2025-08-20 13:25 ` Catalin Marinas
2025-08-20 14:08 ` Robin Murphy
2025-08-20 14:43 ` John Cox
2025-08-20 15:16 ` Catalin Marinas
2025-08-20 15:35 ` John Cox
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox