arm: dma-mapping: CPU may not see up-to-date data after DMA transaction

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* arm: dma-mapping: CPU may not see up-to-date data after DMA transaction
@ 2018-11-22  7:36 JABLONSKY Jan
  2018-11-22  9:29 ` Russell King - ARM Linux
  0 siblings, 1 reply; 9+ messages in thread
From: JABLONSKY Jan @ 2018-11-22  7:36 UTC (permalink / raw)
  To: linux-arm-kernel

Proper cleaning and invalidating has to be performed also for inner cache,
otherwise CPU may work with data from inner cache

Problem appears on The Altera SoC FPGA (Cortex-A9), during
higher CPU and system memory load followed reading by CPU from memory after
DMA transaction.
That means CPU may not see most up-to-date and correct copy of DMA buffer.

Replace with functions __sync_cache_range_r/w, which handle mentioned
situation.
Protect against preemption to make sure, that the code is
running on the current CPU

Signed-off-by: Jan Jablonsky <jan.jablonsky@thalesgroup.com>
---
 arch/arm/mm/dma-mapping.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 661fe48ab78d..f40dc8b250d8 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -997,11 +997,18 @@ static void __dma_page_cpu_to_dev(struct page *page, unsigned long off,
 	dma_cache_maint_page(page, off, size, dir, dmac_map_area);
 
 	paddr = page_to_phys(page) + off;
+
+	/*
+	 * Protect against preemption
+	 * Ensure that the code is running on the current CPU
+	 */
+	preempt_disable();
 	if (dir == DMA_FROM_DEVICE) {
-		outer_inv_range(paddr, paddr + size);
+		__sync_cache_range_r(phys_to_virt(paddr), size);
 	} else {
-		outer_clean_range(paddr, paddr + size);
+		__sync_cache_range_w(phys_to_virt(paddr), size);
 	}
+	preempt_enable();
 	/* FIXME: non-speculating: flush on bidirectional mappings? */
 }
 
@@ -1013,7 +1020,13 @@ static void __dma_page_dev_to_cpu(struct page *page, unsigned long off,
 	/* FIXME: non-speculating: not required */
 	/* in any case, don't bother invalidating if DMA to device */
 	if (dir != DMA_TO_DEVICE) {
-		outer_inv_range(paddr, paddr + size);
+		/*
+		 * Protect against preemption
+		 * Ensure that the code is running on the current CPU
+		 */
+		preempt_disable();
+		__sync_cache_range_r(phys_to_virt(paddr), size);
+		preempt_enable();
 
 		dma_cache_maint_page(page, off, size, dir, dmac_unmap_area);
 	}

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* arm: dma-mapping: CPU may not see up-to-date data after DMA transaction
  2018-11-22  7:36 arm: dma-mapping: CPU may not see up-to-date data after DMA transaction JABLONSKY Jan
@ 2018-11-22  9:29 ` Russell King - ARM Linux
  2018-11-22 12:25   ` JABLONSKY Jan
  0 siblings, 1 reply; 9+ messages in thread
From: Russell King - ARM Linux @ 2018-11-22  9:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 22, 2018 at 07:36:26AM +0000, JABLONSKY Jan wrote:
> Proper cleaning and invalidating has to be performed also for inner cache,
> otherwise CPU may work with data from inner cache

The functions are _already_ performing cache maintanence on the inner
cache, this patch makes no sense.

> Problem appears on The Altera SoC FPGA (Cortex-A9), during
> higher CPU and system memory load followed reading by CPU from memory after
> DMA transaction.
> That means CPU may not see most up-to-date and correct copy of DMA buffer.
> 
> Replace with functions __sync_cache_range_r/w, which handle mentioned
> situation.
> Protect against preemption to make sure, that the code is
> running on the current CPU

This is too vague to say much more.

> 
> Signed-off-by: Jan Jablonsky <jan.jablonsky@thalesgroup.com>
> ---
>  arch/arm/mm/dma-mapping.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index 661fe48ab78d..f40dc8b250d8 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -997,11 +997,18 @@ static void __dma_page_cpu_to_dev(struct page *page, unsigned long off,
>  	dma_cache_maint_page(page, off, size, dir, dmac_map_area);
>  
>  	paddr = page_to_phys(page) + off;
> +
> +	/*
> +	 * Protect against preemption
> +	 * Ensure that the code is running on the current CPU
> +	 */
> +	preempt_disable();
>  	if (dir == DMA_FROM_DEVICE) {
> -		outer_inv_range(paddr, paddr + size);
> +		__sync_cache_range_r(phys_to_virt(paddr), size);
>  	} else {
> -		outer_clean_range(paddr, paddr + size);
> +		__sync_cache_range_w(phys_to_virt(paddr), size);
>  	}
> +	preempt_enable();
>  	/* FIXME: non-speculating: flush on bidirectional mappings? */
>  }
>  
> @@ -1013,7 +1020,13 @@ static void __dma_page_dev_to_cpu(struct page *page, unsigned long off,
>  	/* FIXME: non-speculating: not required */
>  	/* in any case, don't bother invalidating if DMA to device */
>  	if (dir != DMA_TO_DEVICE) {
> -		outer_inv_range(paddr, paddr + size);
> +		/*
> +		 * Protect against preemption
> +		 * Ensure that the code is running on the current CPU
> +		 */
> +		preempt_disable();
> +		__sync_cache_range_r(phys_to_virt(paddr), size);
> +		preempt_enable();
>  
>  		dma_cache_maint_page(page, off, size, dir, dmac_unmap_area);
>  	}

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 9+ messages in thread

* arm: dma-mapping: CPU may not see up-to-date data after DMA transaction
  2018-11-22  9:29 ` Russell King - ARM Linux
@ 2018-11-22 12:25   ` JABLONSKY Jan
  2018-11-22 12:51     ` Russell King - ARM Linux
  0 siblings, 1 reply; 9+ messages in thread
From: JABLONSKY Jan @ 2018-11-22 12:25 UTC (permalink / raw)
  To: linux-arm-kernel

After DMA transaction is complete, content of the DMA buffer is stored in the main memory.

outer_inv_range invalidates cache lines only in L2 level.
outer_cache is dedicated to L2 cache, but CPU must also clean L1 dcache to avoid possible L1 hit


Functions __sync_cache_range_r/w handle L1 and also L2 cache maintenance

That means outer_cache_* functions (within __dma_page_cpu_to_dev and __dma_page_dev_to_cpu) they can be replaced with functions below:

outer_inv_range -> __sync_cache_range_r
outer_clean_range -> __sync_cache_range_w

- __sync_cache_range_r -> - When DMA writes and CPU read
                          - to achieve reading direct from main memory (L1 miss, L2 miss)
                          - Ensure preceding writes to *p by other CPUs are visible to
                            subsequent reads by this CPU 
and

- __sync_cache_range_w -> - When CPU writes and DMA read
                          - Ensure preceding writes to *p by this CPU are visible to
                            subsequent reads by other CPUs


Similar problem is also described in the link below:
https://community.arm.com/processors/f/discussions/6555/need-to-invalidate-l1-cache-after-dma-on-cortex-a9



Test case:
The problem appears during calculation of md5sum of eMMC (16GB) in the loop.
After some time (e.g. 1-2 hours), md5sum may be different

For example:
5c2b3c7a6d69a2f6c4c1ddfdd3bf1ed5  /dev/mmcblk0     time 746 [s]
5c2b3c7a6d69a2f6c4c1ddfdd3bf1ed5  /dev/mmcblk0     time 738 [s]
c5f2bb8e9d83744d4087450d6274208e  /dev/mmcblk0     time 691 [s]
...


I hope, now is more clear for you



On Do, 2018-11-22 at 09:29 +0000, Russell King - ARM Linux wrote:
> On Thu, Nov 22, 2018 at 07:36:26AM +0000, JABLONSKY Jan wrote:
> > Proper cleaning and invalidating has to be performed also for inner cache,
> > otherwise CPU may work with data from inner cache
> 
> The functions are _already_ performing cache maintanence on the inner
> cache, this patch makes no sense.
> 
> > Problem appears on The Altera SoC FPGA (Cortex-A9), during
> > higher CPU and system memory load followed reading by CPU from memory after
> > DMA transaction.
> > That means CPU may not see most up-to-date and correct copy of DMA buffer.
> > 
> > Replace with functions __sync_cache_range_r/w, which handle mentioned
> > situation.
> > Protect against preemption to make sure, that the code is
> > running on the current CPU
> 
> This is too vague to say much more.
> 
> > 
> > Signed-off-by: Jan Jablonsky <jan.jablonsky@thalesgroup.com>
> > ---
> >  arch/arm/mm/dma-mapping.c | 19 ++++++++++++++++---
> >  1 file changed, 16 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> > index 661fe48ab78d..f40dc8b250d8 100644
> > --- a/arch/arm/mm/dma-mapping.c
> > +++ b/arch/arm/mm/dma-mapping.c
> > @@ -997,11 +997,18 @@ static void __dma_page_cpu_to_dev(struct page *page, unsigned long off,
> >  	dma_cache_maint_page(page, off, size, dir, dmac_map_area);
> >  
> >  	paddr = page_to_phys(page) + off;
> > +
> > +	/*
> > +	 * Protect against preemption
> > +	 * Ensure that the code is running on the current CPU
> > +	 */
> > +	preempt_disable();
> >  	if (dir == DMA_FROM_DEVICE) {
> > -		outer_inv_range(paddr, paddr + size);
> > +		__sync_cache_range_r(phys_to_virt(paddr), size);
> >  	} else {
> > -		outer_clean_range(paddr, paddr + size);
> > +		__sync_cache_range_w(phys_to_virt(paddr), size);
> >  	}
> > +	preempt_enable();
> >  	/* FIXME: non-speculating: flush on bidirectional mappings? */
> >  }
> >  
> > @@ -1013,7 +1020,13 @@ static void __dma_page_dev_to_cpu(struct page *page, unsigned long off,
> >  	/* FIXME: non-speculating: not required */
> >  	/* in any case, don't bother invalidating if DMA to device */
> >  	if (dir != DMA_TO_DEVICE) {
> > -		outer_inv_range(paddr, paddr + size);
> > +		/*
> > +		 * Protect against preemption
> > +		 * Ensure that the code is running on the current CPU
> > +		 */
> > +		preempt_disable();
> > +		__sync_cache_range_r(phys_to_virt(paddr), size);
> > +		preempt_enable();
> >  
> >  		dma_cache_maint_page(page, off, size, dir, dmac_unmap_area);
> >  	}
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* arm: dma-mapping: CPU may not see up-to-date data after DMA transaction
  2018-11-22 12:25   ` JABLONSKY Jan
@ 2018-11-22 12:51     ` Russell King - ARM Linux
  2018-11-22 15:36       ` JABLONSKY Jan
  0 siblings, 1 reply; 9+ messages in thread
From: Russell King - ARM Linux @ 2018-11-22 12:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 22, 2018 at 12:25:18PM +0000, JABLONSKY Jan wrote:
> After DMA transaction is complete, content of the DMA buffer is stored in the main memory.
> 
> outer_inv_range invalidates cache lines only in L2 level.

Correct.

> outer_cache is dedicated to L2 cache, but CPU must also clean L1 dcache to avoid possible L1 hit

Correct, except we _do_ handle the L1 cache.  See dmac_*_area functions
that are already being called via dma_cache_maint_page().  You did
analyse the code before proposing these modifications? ;)

Given that we already handle the L1 cache, why do we need more L1 cache
maintanence?

In any case, in your patch, using phys_to_virt() on the addresses is
utterly wrong, and will lead to hitting the wrong memory if the physical
address is a highmem page (this is why we have dma_cache_maint_page(),
which has the necessary code to handle this case.)  The physical to
virtual translation functions and macros are only valid for lowmem.

So, I think your patch actually ends up breaking stuff by doing cache
maintanence on unexpected memory locations.

Given, also, that we have shed-loads of Cortex A9 systems out there,
using DMA with MMC, some using it for the rootfs, I doubt that the
problem lies here but elsewhere - which MMC host controller are you
using?  Has that driver been properly reviewed - does it use the DMA
API correctly?  Does it use the correct DMA direction (which should
always be the same for all DMA API operations on the buffer.)

It could also be a hardware bug (which is suggested with your
modifications to disable preemption).  As cache maintanence is
broadcasted across all cores in a SMP system with Cortex A9, it should
not matter whether we are pre-empted during cache handling for DMA.  If
disabling preemption makes a difference for that, it probably means you
have bigger coherency issues on your SoC.

Yes, you may have found a rare corner case, but we really need to fully
understand the details of what is going on here _before_ making radical
changes to code that has been proven by probably billions of users over
the course of the last nine years.

Thanks.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 9+ messages in thread

* arm: dma-mapping: CPU may not see up-to-date data after DMA transaction
  2018-11-22 12:51     ` Russell King - ARM Linux
@ 2018-11-22 15:36       ` JABLONSKY Jan
  2018-11-22 16:36         ` Robin Murphy
  2018-11-22 16:57         ` Russell King - ARM Linux
  0 siblings, 2 replies; 9+ messages in thread
From: JABLONSKY Jan @ 2018-11-22 15:36 UTC (permalink / raw)
  To: linux-arm-kernel

Yes, you have right,

Synopsys DesignWare Multimedia Card Interface driver dw_mmc

SoC is using internal DMA controller (IDMAC)
A Few days ago, I also analyzed driver and proper using of DMA-API.
I found in dw_mci_dmac_complete_dma callback function, that invalidation of caches is performed only in the case 
of external DMA.
https://lore.kernel.org/patchwork/patch/1015401/

there is some significant progress, but the problem still occurs 



The idea behind using function __sync_cache_range_r is, that the sequence 
of cache maintaining is similar as described in L2 cache controller TRM to ensure coherency

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246h/DDI0246H_l2c310_r3p3_trm.pdf

Perform a combination of Clean and Invalidate operations that ensures coherency from
the outside in, and from the inside out.

CleanLevel1 Address ; This is broadcast within the cluster
DSB ; Ensure completion of the clean as far as Level 2
Clean&InvalLevel2 Address ; forces the address out past level 2
SYNC ; Ensures completion of the L2 inval
Clean&InvalLevel1 Address ; This is broadcast within the cluster
DSB ; Ensure completion of the clean&inval as far as Level 2 (no data lost)

With mentioned sequence, I really can see big difference.
But anyway, I understand what you mean, I will continue with the analysis





On Do, 2018-11-22 at 12:51 +0000, Russell King - ARM Linux wrote:
> On Thu, Nov 22, 2018 at 12:25:18PM +0000, JABLONSKY Jan wrote:
> > After DMA transaction is complete, content of the DMA buffer is stored in the main memory.
> > 
> > outer_inv_range invalidates cache lines only in L2 level.
> 
> Correct.
> 
> > outer_cache is dedicated to L2 cache, but CPU must also clean L1 dcache to avoid possible L1 hit
> 
> Correct, except we _do_ handle the L1 cache.  See dmac_*_area functions
> that are already being called via dma_cache_maint_page().  You did
> analyse the code before proposing these modifications? ;)
> 
> Given that we already handle the L1 cache, why do we need more L1 cache
> maintanence?
> 
> In any case, in your patch, using phys_to_virt() on the addresses is
> utterly wrong, and will lead to hitting the wrong memory if the physical
> address is a highmem page (this is why we have dma_cache_maint_page(),
> which has the necessary code to handle this case.)  The physical to
> virtual translation functions and macros are only valid for lowmem.
> 
> So, I think your patch actually ends up breaking stuff by doing cache
> maintanence on unexpected memory locations.
> 
> Given, also, that we have shed-loads of Cortex A9 systems out there,
> using DMA with MMC, some using it for the rootfs, I doubt that the
> problem lies here but elsewhere - which MMC host controller are you
> using?  Has that driver been properly reviewed - does it use the DMA
> API correctly?  Does it use the correct DMA direction (which should
> always be the same for all DMA API operations on the buffer.)
> 
> It could also be a hardware bug (which is suggested with your
> modifications to disable preemption).  As cache maintanence is
> broadcasted across all cores in a SMP system with Cortex A9, it should
> not matter whether we are pre-empted during cache handling for DMA.  If
> disabling preemption makes a difference for that, it probably means you
> have bigger coherency issues on your SoC.
> 
> Yes, you may have found a rare corner case, but we really need to fully
> understand the details of what is going on here _before_ making radical
> changes to code that has been proven by probably billions of users over
> the course of the last nine years.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* arm: dma-mapping: CPU may not see up-to-date data after DMA transaction
  2018-11-22 15:36       ` JABLONSKY Jan
@ 2018-11-22 16:36         ` Robin Murphy
  2018-11-22 16:57         ` Russell King - ARM Linux
  1 sibling, 0 replies; 9+ messages in thread
From: Robin Murphy @ 2018-11-22 16:36 UTC (permalink / raw)
  To: linux-arm-kernel

On 22/11/2018 15:36, JABLONSKY Jan wrote:
> Yes, you have right,
> 
> Synopsys DesignWare Multimedia Card Interface driver dw_mmc
> 
> SoC is using internal DMA controller (IDMAC)
> A Few days ago, I also analyzed driver and proper using of DMA-API.
> I found in dw_mci_dmac_complete_dma callback function, that invalidation of caches is performed only in the case
> of external DMA.
> https://lore.kernel.org/patchwork/patch/1015401/
> 
> there is some significant progress, but the problem still occurs

Frankly, dw_mci_dmac_complete_dma() looks pretty bogus anyway - that 
dma_sync_sg_for_cpu() is immediately before a call to 
host->dma_ops->cleanup(), which does the dma_unmap_sg() that I would 
expect to see at the end of a DMA transfer. So at best that sync should 
just be redundant with the unmap in terms of cache maintenance, but the 
fact that is uses a hard-coded direction and a possibly different device 
really smells of that driver being a lot more rotten in general.

Robin.

> 
> 
> The idea behind using function __sync_cache_range_r is, that the sequence
> of cache maintaining is similar as described in L2 cache controller TRM to ensure coherency
> 
> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246h/DDI0246H_l2c310_r3p3_trm.pdf
> 
> Perform a combination of Clean and Invalidate operations that ensures coherency from
> the outside in, and from the inside out.
> 
> CleanLevel1 Address ; This is broadcast within the cluster
> DSB ; Ensure completion of the clean as far as Level 2
> Clean&InvalLevel2 Address ; forces the address out past level 2
> SYNC ; Ensures completion of the L2 inval
> Clean&InvalLevel1 Address ; This is broadcast within the cluster
> DSB ; Ensure completion of the clean&inval as far as Level 2 (no data lost)
> 
> With mentioned sequence, I really can see big difference.
> But anyway, I understand what you mean, I will continue with the analysis
> 
> 
> 
> 
> 
> On Do, 2018-11-22 at 12:51 +0000, Russell King - ARM Linux wrote:
>> On Thu, Nov 22, 2018 at 12:25:18PM +0000, JABLONSKY Jan wrote:
>>> After DMA transaction is complete, content of the DMA buffer is stored in the main memory.
>>>
>>> outer_inv_range invalidates cache lines only in L2 level.
>>
>> Correct.
>>
>>> outer_cache is dedicated to L2 cache, but CPU must also clean L1 dcache to avoid possible L1 hit
>>
>> Correct, except we _do_ handle the L1 cache.  See dmac_*_area functions
>> that are already being called via dma_cache_maint_page().  You did
>> analyse the code before proposing these modifications? ;)
>>
>> Given that we already handle the L1 cache, why do we need more L1 cache
>> maintanence?
>>
>> In any case, in your patch, using phys_to_virt() on the addresses is
>> utterly wrong, and will lead to hitting the wrong memory if the physical
>> address is a highmem page (this is why we have dma_cache_maint_page(),
>> which has the necessary code to handle this case.)  The physical to
>> virtual translation functions and macros are only valid for lowmem.
>>
>> So, I think your patch actually ends up breaking stuff by doing cache
>> maintanence on unexpected memory locations.
>>
>> Given, also, that we have shed-loads of Cortex A9 systems out there,
>> using DMA with MMC, some using it for the rootfs, I doubt that the
>> problem lies here but elsewhere - which MMC host controller are you
>> using?  Has that driver been properly reviewed - does it use the DMA
>> API correctly?  Does it use the correct DMA direction (which should
>> always be the same for all DMA API operations on the buffer.)
>>
>> It could also be a hardware bug (which is suggested with your
>> modifications to disable preemption).  As cache maintanence is
>> broadcasted across all cores in a SMP system with Cortex A9, it should
>> not matter whether we are pre-empted during cache handling for DMA.  If
>> disabling preemption makes a difference for that, it probably means you
>> have bigger coherency issues on your SoC.
>>
>> Yes, you may have found a rare corner case, but we really need to fully
>> understand the details of what is going on here _before_ making radical
>> changes to code that has been proven by probably billions of users over
>> the course of the last nine years.
>>
>> Thanks.
>>
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* arm: dma-mapping: CPU may not see up-to-date data after DMA transaction
  2018-11-22 15:36       ` JABLONSKY Jan
  2018-11-22 16:36         ` Robin Murphy
@ 2018-11-22 16:57         ` Russell King - ARM Linux
  2019-01-08  6:10           ` JABLONSKY Jan
  1 sibling, 1 reply; 9+ messages in thread
From: Russell King - ARM Linux @ 2018-11-22 16:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 22, 2018 at 03:36:13PM +0000, JABLONSKY Jan wrote:
> Yes, you have right,
> 
> Synopsys DesignWare Multimedia Card Interface driver dw_mmc
> 
> SoC is using internal DMA controller (IDMAC)
> A Few days ago, I also analyzed driver and proper using of DMA-API.
> I found in dw_mci_dmac_complete_dma callback function, that invalidation of caches is performed only in the case 
> of external DMA.
> https://lore.kernel.org/patchwork/patch/1015401/
> 
> there is some significant progress, but the problem still occurs 

>From that patch, should I assume you're not using a separate DMAengine,
i.o.w., host->use_dma is TRANS_MODE_IDMAC ?

Now, setting up the DMA table:

        for (i = 0; i < sg_len; i++) {
                unsigned int length = sg_dma_len(&data->sg[i]);

                u32 mem_addr = sg_dma_address(&data->sg[i]);

is wrong.  scatterlists do not have to be contiguous arrays, they
can be fragmented, and the only way to walk a scatterlist correctly
is to use for_each_sg(), iow:

	struct scatterlist *sg;

	for_each_sg(data->sg, sg, sg_len, i) {
		unsigned int length = sg_dma_len(sg);
		u32 mem_addr = sg_dma_address(sg);

The 64-bit path needs similarly fixing.  It's interesting that the
driver uses for_each_sg() in some places already but not for setting
up these tables.  It should always use for_each_sg().

dw_mci_dma_cleanup() does this:

                dma_unmap_sg(host->dev,
                             data->sg,
                             data->sg_len,
                             mmc_get_dma_dir(data));

Note that it uses host->dev.  Other DMA functions use the same.
However, dw_mci_dmac_complete_dma() uses a different variable:

                /* Invalidate cache after read */
                dma_sync_sg_for_cpu(mmc_dev(host->slot->mmc),
                                    data->sg,
                                    data->sg_len,
                                    DMA_FROM_DEVICE);

It just happens that mmc_dev(host->slot->mmc) is the same as
host->dev, but this is really bad programming style - it looks
like pure obfuscation to me.

In any case, since the driver supports DMAengine, this is incorrect -
when using DMAengine, the buffer needs to be mapped and unmapped
using the DMAengine struct device, not the host controller's struct
device.  So, using host->dev and mmc_dev(host->slot->mmc) is
incorrect in this driver.

Next, the comment "/* Invalidate cache after read */" suggests that
something is very wrong - the cache handling happens when the buffer
is unmapped.  That happens at two points in this driver - either via

	host->dma_ops->cleanup(host);

if the buffer was mapped by dw_mci_submit_data_dma(), or by
dw_mci_post_req() if it was mapped by dw_mci_pre_req().  Similar
applies to:

        /* Flush cache before write */
        if (host->data->flags & MMC_DATA_WRITE)
                dma_sync_sg_for_device(mmc_dev(host->slot->mmc), sgl,
                                       sg_elems, DMA_TO_DEVICE);

in dw_mci_edmac_start_dma() - those dma_sync_*() calls really should
not be necessary.

I'd get rid of those dma_sync_*() calls and try to validate that
the buffer has been properly mapped to the _correct_ device, and
check that it is properly unmapped after the transfer has completed.
IOW, it's mapped to the MMC device if using dw_mci_idmac_ops (iow,
host->dev) and the DMAengine device if using dw_mci_edmac_ops (iow,
host->dms->ch->device->dev).

Note that there's an additional issue when using DMA engine - that
is an ordering problem, where the MMC controller may report that
the transfer has completed before the DMA engine has finished
transferring the data to memory - I see nothing in this driver
that caters to that issue.  I don't think that's your problem,
because I think you're using the internal MMC controller DMA.

> Perform a combination of Clean and Invalidate operations that ensures coherency from
> the outside in, and from the inside out.
>
> CleanLevel1 Address ; This is broadcast within the cluster
> DSB ; Ensure completion of the clean as far as Level 2
> Clean&InvalLevel2 Address ; forces the address out past level 2
> SYNC ; Ensures completion of the L2 inval
> Clean&InvalLevel1 Address ; This is broadcast within the cluster
> DSB ; Ensure completion of the clean&inval as far as Level 2 (no data lost)
>
> With mentioned sequence, I really can see big difference.
> But anyway, I understand what you mean, I will continue with the analysis

What we end up doing is:

- On buffer mapping for DMA_FROM_DEVICE (DMA from device to memory):
  - invalidate L1 cache
  - dsb
  - invalidate L2 cache
  - sync L2
  This is to ensure that no writebacks occur while the DMA is progressing.
... DMA happens ...
- On buffer unmapping for DMA_FROM_DEVICE:
  - invalidate L2 cache
  - sync L2
  - invalidate L1 cache
  - dsb
  This is to ensure that any speculatively read cache lines are killed
  prior to the CPU reading the new data in the buffer.

This reflects the L1/L2/L1 sequence talked about above without excessive
expense - these cache maintanence operations are _very_ expensive.  Note
that the CPU _may_ speculatively prefetch while the DMA happens, it may
also speculatively prefetch between and during each of the steps in the
documentation quote above.

Note that the CPU is _not_ allowed to write in any way to the memory
buffer while it is mapped for an active DMA operation - that _will_
definitely lead to corrupted data.

- On buffer mapping for DMA_TO_DEVICE (DMA from memory to device):
  - clean L1 cache
  - dsb
  - clean L2 cache
  - sync L2
... DMA happens ...
- On buffer unmapping for DMA_TO_DEVICE:
  - nothing is necessary, since we must have made the data visible to
    DMA in the mapping case.

Note that there's a requirement that the MMIO or other write that
triggers the DMA operation _must_ have an appropriate barrier (eg,
dsb) between all previous writes and the write that starts the DMA.
That means the non-relaxed write[bwl]() functions must be used, or
the driver muse use an explicit barrier and document that.

Hope this helps.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: arm: dma-mapping: CPU may not see up-to-date data after DMA transaction
  2018-11-22 16:57         ` Russell King - ARM Linux
@ 2019-01-08  6:10           ` JABLONSKY Jan
  2019-01-08  9:51             ` Russell King - ARM Linux
  0 siblings, 1 reply; 9+ messages in thread
From: JABLONSKY Jan @ 2019-01-08  6:10 UTC (permalink / raw)
  To: Russell King - ARM Linux; +Cc: Chris Cole, linux-arm-kernel@lists.infradead.org

Hi again,

I saw, that there are some findings in v7_dma_inv_range
https://patchwork.kernel.org/patch/10673097/


It wouldn't be better to replace it directly with C&I ?

Otherwise still there is risk that you lost writes in->out ?
(Exactly happens in my case)

v7_dma_inv_range:

from
mcrlo	p15, 0, r0, c7, c6, 1		@ invalidate D / U line
to
mcrlo	p15, 0, r0, c7, c14, 1		@ clean & invalidate D / U line


Thanks
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: arm: dma-mapping: CPU may not see up-to-date data after DMA transaction
  2019-01-08  6:10           ` JABLONSKY Jan
@ 2019-01-08  9:51             ` Russell King - ARM Linux
  0 siblings, 0 replies; 9+ messages in thread
From: Russell King - ARM Linux @ 2019-01-08  9:51 UTC (permalink / raw)
  To: JABLONSKY Jan; +Cc: Chris Cole, linux-arm-kernel@lists.infradead.org

On Tue, Jan 08, 2019 at 06:10:33AM +0000, JABLONSKY Jan wrote:
> Hi again,
> 
> I saw, that there are some findings in v7_dma_inv_range
> https://patchwork.kernel.org/patch/10673097/
> 
> 
> It wouldn't be better to replace it directly with C&I ?

No.

> Otherwise still there is risk that you lost writes in->out ?
> (Exactly happens in my case)

For a region that is being used to DMA data _from_ the device, the
data we have in the cache need not be written back out except if
there are overlapping cache lines - writing it out would be a
complete waste of bus bandwidth.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-01-08  9:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-11-22  7:36 arm: dma-mapping: CPU may not see up-to-date data after DMA transaction JABLONSKY Jan
2018-11-22  9:29 ` Russell King - ARM Linux
2018-11-22 12:25   ` JABLONSKY Jan
2018-11-22 12:51     ` Russell King - ARM Linux
2018-11-22 15:36       ` JABLONSKY Jan
2018-11-22 16:36         ` Robin Murphy
2018-11-22 16:57         ` Russell King - ARM Linux
2019-01-08  6:10           ` JABLONSKY Jan
2019-01-08  9:51             ` Russell King - ARM Linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).