* dma_cache_maint_contiguous should be patched as dma_cache_maint do
@ 2009-12-10 14:59 mkl lin
2009-12-10 15:35 ` Catalin Marinas
0 siblings, 1 reply; 16+ messages in thread
From: mkl lin @ 2009-12-10 14:59 UTC (permalink / raw)
To: linux-arm-kernel
hi,
I've got a deadlock while testing SATA AHCI driver ahci.c with the new IPI patch.
My platform is ARM11 MPCore 2CPU, Linux 2.6.31.1, SMP, with L1. I did a
simple write-read test with a PCIe SATA adapter and it stuck at first
or second round.
Here is the call stack of each CPU:
CPU1:
__smp_dma_cache_op
dma_cache_maint_page
dma_map_sg
ata_qc_issue
ata_scsi_translate
ata_scsi_queuecmd
scsi_request_fn
__generic_unplug_device
generic_unplug_device
blk_unplug
sync_buffer
__wait_on_bit_lock
out_of_line_wait_on_bit_lock
lock_buffer
__block_write_full_page
CPU0:
__raw_spin_lock
_spin_lock
ata_scsi_queuecmd
Following is my guess:
CPU1 called ata_scsi_queuecmd and get the lock, and go all the way down
into __smp_dma_cache_op, which send the IPI message and waiting for
other CPU finish cache operation (I guess so).
??? while(!cpus_empty(data.unfinished))
CPU0 also called ata_scsi_queuecmd and stuck in a spinlock, which is
obtained by CPU1, with IRQ disabled. So CPU0 could not handle the IPI,
keep spining, and CPU1 is waiting for CPU0 to finish handling IPI.
One strange thing is that whenever ata_scsi_queuecmd is called, the IRQ have already been disabled.
The patch would fix the bug, but I'm not sure whether will it break other things or not.
Best Regard,
Mac Lin
_________________________________________________________________
Keep your friends updated?even when you?re not signed in.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_5:092010
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch
Type: application/octet-stream
Size: 631 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20091210/520c074d/attachment-0001.obj>
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched as dma_cache_maint do
2009-12-10 14:59 dma_cache_maint_contiguous should be patched as dma_cache_maint do mkl lin
@ 2009-12-10 15:35 ` Catalin Marinas
2009-12-10 16:39 ` Russell King - ARM Linux
0 siblings, 1 reply; 16+ messages in thread
From: Catalin Marinas @ 2009-12-10 15:35 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, 2009-12-10 at 14:59 +0000, mkl lin wrote:
> I've got a deadlock while testing SATA AHCI driver ahci.c with the new IPI patch.
>
> My platform is ARM11 MPCore 2CPU, Linux 2.6.31.1, SMP, with L1. I did a
> simple write-read test with a PCIe SATA adapter and it stuck at first
> or second round.
[...]
> Following is my guess:
>
> CPU1 called ata_scsi_queuecmd and get the lock, and go all the way down
> into __smp_dma_cache_op, which send the IPI message and waiting for
> other CPU finish cache operation (I guess so).
>
> while(!cpus_empty(data.unfinished))
>
> CPU0 also called ata_scsi_queuecmd and stuck in a spinlock, which is
> obtained by CPU1, with IRQ disabled. So CPU0 could not handle the IPI,
> keep spining, and CPU1 is waiting for CPU0 to finish handling IPI.
This looks like a valid scenario, thanks for the analysis.
The DMA broadcasting patch allows IPI's with interrupts disabled on the
issuing CPU but that's slightly different and any IPI (not only the DMA
broadcasting would fail). Basically you can't issue an IPI while holding
a lock that may be held by a different CPU with interrupts disabled.
Your patch allows interrupts to be taken while in the
ata_scsi_queuecmd() function but this may have side-effects as this
function is assumed to be called with interrupts disabled.
So it leaves us with two workarounds - (1) re-implement this DMA cache
ops broadcasting with FIQs if the hardware (GIC) configuration allows
this or (2) instead of broadcasting, do a "Read or Write For Ownership"
on the calling CPU before invoking the cache operation. By doing this
the current CPU becomes the owner of the cache lines and it can do cache
maintenance on them. For large buffers, this RFO/WFO is slower than IPI
but in the general case it may be actually quicker.
If you have time (should only take an hour or so but I don't have
hardware to test it), I would be grateful if you could try the
following:
* Drop my DMA broadcasting patch
* In the dma_cache_maint (and the contiguous one) do the following
based on direction:
* TO_DEVICE: read each cache line in the buffer (you can
read a long variable every 32 bytes) before the local
cache maintenance operations. This ensures that any
dirty cache lines on other CPUs are transferred to L2
(or main memory) or the current CPU and the cache
operation would have the intended effect. The cache
lines on other CPUs may change to a Shared state (see
the MESI protocol)
* FROM_DEVICE: we don't care about the buffer, just write
0 in each cache line (as above, you can only write a
long every 32 bytes). This ensures that the cache lines
become Exclusive to the current CPU (no other CPU has
any copies) and the invalidation would ensure that there
are no cache lines on any CPU
* BIDIRECTIONAL: read a word in a cache line and write it
back. After cache clean&invalidate, the cache lines
would be removed from all the existing CPUs
ARM11MPCore doesn't do speculative accesses so the above should be safe.
Thanks.
--
Catalin
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched as dma_cache_maint do
2009-12-10 15:35 ` Catalin Marinas
@ 2009-12-10 16:39 ` Russell King - ARM Linux
2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Russell King - ARM Linux @ 2009-12-10 16:39 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Dec 10, 2009 at 03:35:07PM +0000, Catalin Marinas wrote:
> On Thu, 2009-12-10 at 14:59 +0000, mkl lin wrote:
> > I've got a deadlock while testing SATA AHCI driver ahci.c with the new IPI patch.
> >
> > My platform is ARM11 MPCore 2CPU, Linux 2.6.31.1, SMP, with L1. I did a
> > simple write-read test with a PCIe SATA adapter and it stuck at first
> > or second round.
> [...]
> > Following is my guess:
> >
> > CPU1 called ata_scsi_queuecmd and get the lock, and go all the way down
> > into __smp_dma_cache_op, which send the IPI message and waiting for
> > other CPU finish cache operation (I guess so).
> >
> > while(!cpus_empty(data.unfinished))
> >
> > CPU0 also called ata_scsi_queuecmd and stuck in a spinlock, which is
> > obtained by CPU1, with IRQ disabled. So CPU0 could not handle the IPI,
> > keep spining, and CPU1 is waiting for CPU0 to finish handling IPI.
>
> This looks like a valid scenario, thanks for the analysis.
>
> The DMA broadcasting patch allows IPI's with interrupts disabled on the
> issuing CPU but that's slightly different and any IPI (not only the DMA
> broadcasting would fail). Basically you can't issue an IPI while holding
> a lock that may be held by a different CPU with interrupts disabled.
I was rather expecting a deadlock report over the DMA cache ops broadcasting,
which is why I've been holding off merging it.
> Your patch allows interrupts to be taken while in the
> ata_scsi_queuecmd() function but this may have side-effects as this
> function is assumed to be called with interrupts disabled.
Having functions which enable interrupts while the parent is supposed to
be in an atomic context is definitely a recipe for things to go badly
wrong - this is not a solution anyone in their right mind should
contemplate.
> So it leaves us with two workarounds - (1) re-implement this DMA cache
> ops broadcasting with FIQs if the hardware (GIC) configuration allows
> this
I don't think the GIC allows FIQ IPIs - and even if it did, it presents
certain problems for people - the kernel would have to become the sole
owner of FIQs on SMP, and we'd have to prevent other people using them.
They'd certainly not be "fast" interrupts anymore, neither would they
be suitable for software-driven DMA (which is what FIQ gets used as for
some platforms.)
> or (2) instead of broadcasting, do a "Read or Write For Ownership"
> on the calling CPU before invoking the cache operation. By doing this
> the current CPU becomes the owner of the cache lines and it can do cache
> maintenance on them. For large buffers, this RFO/WFO is slower than IPI
> but in the general case it may be actually quicker.
This is probably the best we can do, and it will make DMA performance
suck - thankfully, later SMP CPUs have solved this whole cache IPI
issue by broadcasting the cache operations in hardware (which is actually
the only sane solution to this problem.)
Let's hope that all ARMv7 SMP implementations broadcast the cache
operations. For the time being, I'm going to assume that this is the
case. If this is not already the case, it's something I think ARM Ltd
need to ensure that hardware broadcasting is mandated in later versions
of the architecture.
If it is just ARMv6 which is affected by this (note that we can not rely
upon ID_MMFR3 telling us since this register has a different purpose
there) then we can solve this by just modifying the ARMv6 DMA cache
methods - which becomes easier if I get the rearrangement of the DMA
stuff finally sorted. The *only* thing that remains in the way of that
is the stupid flush_ioremap_region() crap for just one MTD driver.
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do
2009-12-10 16:39 ` Russell King - ARM Linux
@ 2009-12-10 17:15 ` Catalin Marinas
2009-12-10 17:39 ` Russell King - ARM Linux
2009-12-10 18:30 ` mkl lin
2009-12-10 21:21 ` dma_cache_maint_contiguous should be patched as dma_cache_maint do Nicolas Pitre
2009-12-11 10:56 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas
2 siblings, 2 replies; 16+ messages in thread
From: Catalin Marinas @ 2009-12-10 17:15 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, 2009-12-10 at 16:39 +0000, Russell King - ARM Linux wrote:
> On Thu, Dec 10, 2009 at 03:35:07PM +0000, Catalin Marinas wrote:
> > On Thu, 2009-12-10 at 14:59 +0000, mkl lin wrote:
> > > CPU0 also called ata_scsi_queuecmd and stuck in a spinlock, which is
> > > obtained by CPU1, with IRQ disabled. So CPU0 could not handle the IPI,
> > > keep spining, and CPU1 is waiting for CPU0 to finish handling IPI.
> >
> > This looks like a valid scenario, thanks for the analysis.
> >
> > The DMA broadcasting patch allows IPI's with interrupts disabled on the
> > issuing CPU but that's slightly different and any IPI (not only the DMA
> > broadcasting would fail). Basically you can't issue an IPI while holding
> > a lock that may be held by a different CPU with interrupts disabled.
>
> I was rather expecting a deadlock report over the DMA cache ops broadcasting,
> which is why I've been holding off merging it.
It's good that people tested it in various configurations since my
simple tests haven't shown any problem. So I won't be pushing this
patch.
> > or (2) instead of broadcasting, do a "Read or Write For Ownership"
> > on the calling CPU before invoking the cache operation. By doing this
> > the current CPU becomes the owner of the cache lines and it can do cache
> > maintenance on them. For large buffers, this RFO/WFO is slower than IPI
> > but in the general case it may be actually quicker.
>
> This is probably the best we can do, and it will make DMA performance
> suck
For relatively small buffers, this may be faster than the whole IPI
procedure but you can't say for sure without benchmarking.
> Let's hope that all ARMv7 SMP implementations broadcast the cache
> operations.
They do broadcast both cache operations and TLB maintenance (but for the
latter we have an issue with global ASID allocation, see my
corresponding patch). With speculative accesses would actually make this
impossible if broadcasting isn't done in hardware.
> If it is just ARMv6 which is affected by this (note that we can not rely
> upon ID_MMFR3 telling us since this register has a different purpose
> there) then we can solve this by just modifying the ARMv6 DMA cache
> methods - which becomes easier if I get the rearrangement of the DMA
> stuff finally sorted.
Good point, that's actually a quick modification (2-3 lines in each
cache function). Mac Lin, can you try the patch below (without the other
broadcasting patch)? If it works, I'll add some comment:
diff --git a/arch/arm/mm/cache-v6.S b/arch/arm/mm/cache-v6.S
index 295e25d..828a78e 100644
--- a/arch/arm/mm/cache-v6.S
+++ b/arch/arm/mm/cache-v6.S
@@ -210,6 +210,9 @@ ENTRY(v6_dma_inv_range)
mcrne p15, 0, r1, c7, c15, 1 @ clean & invalidate unified line
#endif
1:
+#ifdef CONFIG_SMP
+ str r0, [r0] @ write for ownership
+#endif
#ifdef HARVARD_CACHE
mcr p15, 0, r0, c7, c6, 1 @ invalidate D line
#else
@@ -230,6 +233,9 @@ ENTRY(v6_dma_inv_range)
ENTRY(v6_dma_clean_range)
bic r0, r0, #D_CACHE_LINE_SIZE - 1
1:
+#ifdef CONFIG_SMP
+ ldr r2, [r0] @ read for ownership
+#endif
#ifdef HARVARD_CACHE
mcr p15, 0, r0, c7, c10, 1 @ clean D line
#else
@@ -250,6 +256,10 @@ ENTRY(v6_dma_clean_range)
ENTRY(v6_dma_flush_range)
bic r0, r0, #D_CACHE_LINE_SIZE - 1
1:
+#ifdef CONFIG_SMP
+ ldr r2, [r0] @ read for ownership
+ str r2, [r0] @ write for ownership
+#endif
#ifdef HARVARD_CACHE
mcr p15, 0, r0, c7, c14, 1 @ clean & invalidate D line
#else
> The *only* thing that remains in the way of that
> is the stupid flush_ioremap_region() crap for just one MTD driver.
But that calls v6_dma_inv_range() anyway.
--
Catalin
^ permalink raw reply related [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do
2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas
@ 2009-12-10 17:39 ` Russell King - ARM Linux
2009-12-10 18:30 ` mkl lin
1 sibling, 0 replies; 16+ messages in thread
From: Russell King - ARM Linux @ 2009-12-10 17:39 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Dec 10, 2009 at 05:15:43PM +0000, Catalin Marinas wrote:
> On Thu, 2009-12-10 at 16:39 +0000, Russell King - ARM Linux wrote:
> > On Thu, Dec 10, 2009 at 03:35:07PM +0000, Catalin Marinas wrote:
> > > or (2) instead of broadcasting, do a "Read or Write For Ownership"
> > > on the calling CPU before invoking the cache operation. By doing this
> > > the current CPU becomes the owner of the cache lines and it can do cache
> > > maintenance on them. For large buffers, this RFO/WFO is slower than IPI
> > > but in the general case it may be actually quicker.
> >
> > This is probably the best we can do, and it will make DMA performance
> > suck
>
> For relatively small buffers, this may be faster than the whole IPI
> procedure but you can't say for sure without benchmarking.
I'm basing that comment on the efficiency of PIO block IO vs DMA block
IO, which are typically multiples of 4K. Having to read/write the
buffer is equivalent to PIO. As a comparison for a system with IDE,
for DMA you might get 10MB/s, with PIO maybe 3-4MB/s if you're lucky.
I would be very surprised if going down this route doesn't result in
block IO data performance (and network performance) dropping my more
than 60% of the DMA value (that's DMA performance * 0.4).
> > Let's hope that all ARMv7 SMP implementations broadcast the cache
> > operations.
>
> They do broadcast both cache operations and TLB maintenance (but for the
> latter we have an issue with global ASID allocation, see my
> corresponding patch). With speculative accesses would actually make this
> impossible if broadcasting isn't done in hardware.
Good - so we could optimize out the MMFR3 tests for ARMv7.
> > The *only* thing that remains in the way of that
> > is the stupid flush_ioremap_region() crap for just one MTD driver.
>
> But that calls v6_dma_inv_range() anyway.
Indeed, and will be the _only_ caller of that function. That's my point.
Why should we have something called "flush_ioremap_region" being the sole
caller of something called "dma_inv_range".
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do
2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas
2009-12-10 17:39 ` Russell King - ARM Linux
@ 2009-12-10 18:30 ` mkl lin
2009-12-10 18:45 ` mkl lin
1 sibling, 1 reply; 16+ messages in thread
From: mkl lin @ 2009-12-10 18:30 UTC (permalink / raw)
To: linux-arm-kernel
> Good point, that's actually a quick modification (2-3 lines in each
> cache function). Mac Lin, can you try the patch below (without the other
> broadcasting patch)? If it works, I'll add some comment:
I have tested with my patch, and it seems good. No data integrity issue, nor any deadlock.
Best Regards,
Mac Lin
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_3:092010
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smp.c.patch
Type: application/octet-stream
Size: 1159 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20091210/aed3f989/attachment-0001.obj>
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do
2009-12-10 18:30 ` mkl lin
@ 2009-12-10 18:45 ` mkl lin
2009-12-11 9:36 ` Catalin Marinas
0 siblings, 1 reply; 16+ messages in thread
From: mkl lin @ 2009-12-10 18:45 UTC (permalink / raw)
To: linux-arm-kernel
>> Good point, that's actually a quick modification (2-3 lines in each
>> cache function). Mac Lin, can you try the patch below (without the other
>> broadcasting patch)? If it works, I'll add some comment:
> I have tested with my patch, and it seems good. No data integrity issue, nor any deadlock.
BTW, read performance drops from 64MBps to 44MBps on my platform.
Best Regards,
Mac Lin
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you?re up to on Facebook.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_2:092009
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched as dma_cache_maint do
2009-12-10 16:39 ` Russell King - ARM Linux
2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas
@ 2009-12-10 21:21 ` Nicolas Pitre
2009-12-10 22:14 ` Russell King - ARM Linux
2009-12-11 10:56 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas
2 siblings, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-12-10 21:21 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, 10 Dec 2009, Russell King - ARM Linux wrote:
> If it is just ARMv6 which is affected by this (note that we can not rely
> upon ID_MMFR3 telling us since this register has a different purpose
> there) then we can solve this by just modifying the ARMv6 DMA cache
> methods - which becomes easier if I get the rearrangement of the DMA
> stuff finally sorted. The *only* thing that remains in the way of that
> is the stupid flush_ioremap_region() crap for just one MTD driver.
My suggestion: kill flush_ioremap_region() and substitute it by an ARMv5
cache range invalidate loop implementation local to that driver.
While at it, here's the patch:
-----
Subject: ARM: kill flush_ioremap_region()
There is not enough users to warrant its existance, and it is actually
an obstacle to progress with the new DMA API which cannot cover this
case properly.
To keep backward compatibility, let's do the necessary cache maintenance
locally in the only driver affected.
Signed-off-by: Nicolas Pitre <nico@marvell.com>
diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
index 3d0cdd2..244fb97 100644
--- a/arch/arm/include/asm/cacheflush.h
+++ b/arch/arm/include/asm/cacheflush.h
@@ -456,13 +456,6 @@ static inline void flush_kernel_dcache_page(struct page *page)
*/
#define flush_icache_page(vma,page) do { } while (0)
-static inline void flush_ioremap_region(unsigned long phys, void __iomem *virt,
- unsigned offset, size_t size)
-{
- const void *start = (void __force *)virt + offset;
- dmac_inv_range(start, start + size);
-}
-
/*
* flush_cache_vmap() is used when creating mappings (eg, via vmap,
* vmalloc, ioremap etc) in kernel space for pages. On non-VIPT
diff --git a/arch/arm/mm/proc-syms.c b/arch/arm/mm/proc-syms.c
index ac5c800..f604aa8 100644
--- a/arch/arm/mm/proc-syms.c
+++ b/arch/arm/mm/proc-syms.c
@@ -28,7 +28,6 @@ EXPORT_SYMBOL(__cpuc_flush_user_all);
EXPORT_SYMBOL(__cpuc_flush_user_range);
EXPORT_SYMBOL(__cpuc_coherent_kern_range);
EXPORT_SYMBOL(__cpuc_flush_dcache_page);
-EXPORT_SYMBOL(dmac_inv_range); /* because of flush_ioremap_region() */
#else
EXPORT_SYMBOL(cpu_cache);
#endif
diff --git a/drivers/mtd/maps/pxa2xx-flash.c b/drivers/mtd/maps/pxa2xx-flash.c
index 74fa075..b13f641 100644
--- a/drivers/mtd/maps/pxa2xx-flash.c
+++ b/drivers/mtd/maps/pxa2xx-flash.c
@@ -20,14 +20,23 @@
#include <asm/io.h>
#include <mach/hardware.h>
-#include <asm/cacheflush.h>
#include <asm/mach/flash.h>
+#define CACHELINESIZE 32
+
static void pxa2xx_map_inval_cache(struct map_info *map, unsigned long from,
ssize_t len)
{
- flush_ioremap_region(map->phys, map->cached, from, len);
+ unsigned long start = (unsigned long)map->cached + from;
+ unsigned long end = start + len;
+
+ start &= ~(CACHELINESIZE - 1);
+ while (start < end) {
+ /* invalidate D cache line */
+ asm volatile ("mcr p15, 0, %0, c7, c6, 1" : : "r" (start));
+ start += CACHELINESIZE;
+ }
}
struct pxa2xx_flash_info {
^ permalink raw reply related [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched as dma_cache_maint do
2009-12-10 21:21 ` dma_cache_maint_contiguous should be patched as dma_cache_maint do Nicolas Pitre
@ 2009-12-10 22:14 ` Russell King - ARM Linux
2009-12-11 1:24 ` Nicolas Pitre
0 siblings, 1 reply; 16+ messages in thread
From: Russell King - ARM Linux @ 2009-12-10 22:14 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Dec 10, 2009 at 04:21:29PM -0500, Nicolas Pitre wrote:
> On Thu, 10 Dec 2009, Russell King - ARM Linux wrote:
>
> > If it is just ARMv6 which is affected by this (note that we can not rely
> > upon ID_MMFR3 telling us since this register has a different purpose
> > there) then we can solve this by just modifying the ARMv6 DMA cache
> > methods - which becomes easier if I get the rearrangement of the DMA
> > stuff finally sorted. The *only* thing that remains in the way of that
> > is the stupid flush_ioremap_region() crap for just one MTD driver.
>
> My suggestion: kill flush_ioremap_region() and substitute it by an ARMv5
> cache range invalidate loop implementation local to that driver.
>
> While at it, here's the patch:
Great, can it end up in the patch system?
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched as dma_cache_maint do
2009-12-10 22:14 ` Russell King - ARM Linux
@ 2009-12-11 1:24 ` Nicolas Pitre
0 siblings, 0 replies; 16+ messages in thread
From: Nicolas Pitre @ 2009-12-11 1:24 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, 10 Dec 2009, Russell King - ARM Linux wrote:
> On Thu, Dec 10, 2009 at 04:21:29PM -0500, Nicolas Pitre wrote:
> > On Thu, 10 Dec 2009, Russell King - ARM Linux wrote:
> >
> > > If it is just ARMv6 which is affected by this (note that we can not rely
> > > upon ID_MMFR3 telling us since this register has a different purpose
> > > there) then we can solve this by just modifying the ARMv6 DMA cache
> > > methods - which becomes easier if I get the rearrangement of the DMA
> > > stuff finally sorted. The *only* thing that remains in the way of that
> > > is the stupid flush_ioremap_region() crap for just one MTD driver.
> >
> > My suggestion: kill flush_ioremap_region() and substitute it by an ARMv5
> > cache range invalidate loop implementation local to that driver.
> >
> > While at it, here's the patch:
>
> Great, can it end up in the patch system?
Patch #5848/1.
Nicolas
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do
2009-12-10 18:45 ` mkl lin
@ 2009-12-11 9:36 ` Catalin Marinas
2009-12-11 12:32 ` mkl lin
0 siblings, 1 reply; 16+ messages in thread
From: Catalin Marinas @ 2009-12-11 9:36 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, 2009-12-10 at 18:45 +0000, mkl lin wrote:
> >> Good point, that's actually a quick modification (2-3 lines in each
> >> cache function). Mac Lin, can you try the patch below (without the other
> >> broadcasting patch)? If it works, I'll add some comment:
> > I have tested with my patch, and it seems good. No data integrity
> > issue, nor any deadlock.
>
> BTW, read performance drops from 64MBps to 44MBps on my platform.
Is this compared to the IPI patch or to the UP (maxcpus=1) version?
You can try placing a "pld [r0, #32] before the new ldr/str so that it
fetches the next cache line while the loop is dealing with the current
one.
Thanks.
--
Catalin
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do
2009-12-10 16:39 ` Russell King - ARM Linux
2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas
2009-12-10 21:21 ` dma_cache_maint_contiguous should be patched as dma_cache_maint do Nicolas Pitre
@ 2009-12-11 10:56 ` Catalin Marinas
2 siblings, 0 replies; 16+ messages in thread
From: Catalin Marinas @ 2009-12-11 10:56 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, 2009-12-10 at 16:39 +0000, Russell King - ARM Linux wrote:
> Let's hope that all ARMv7 SMP implementations broadcast the cache
> operations.
I double-checked with the architecture people and the ARMv7 MP extension
mandates cache maintenance broadcasting in hardware. I'll modify my
ARMv7 lazy cache flushing patch to avoid reading MMFR3 in this case.
--
Catalin
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do
2009-12-11 9:36 ` Catalin Marinas
@ 2009-12-11 12:32 ` mkl lin
2009-12-16 20:33 ` Russell King - ARM Linux
0 siblings, 1 reply; 16+ messages in thread
From: mkl lin @ 2009-12-11 12:32 UTC (permalink / raw)
To: linux-arm-kernel
>> BTW, read performance drops from 64MBps to 44MBps on my platform.
> Is this compared to the IPI patch or to the UP (maxcpus=1) version?
> You can try placing a "pld [r0, #32] before the new ldr/str so that it
> fetches the next cache line while the loop is dealing with the current
> one.
I have redo all the tests again.
Linux 2.6.31.1
pcie sata adapter,?ahci.c
read 1GB file
1CPU:13.56/13.56/13.60 (sec) ~ 75.5MBps
SMP+IPI:16.29/16.14/16.05 (sec) ~ 63.8MBps
SMP+RFO/WFO:21.71/21.72/21.70 (sec) ~ 47.18MBps
SMP+RFO/WFO/pld:21.63/21.46/21.41 (sec) ~ 47.82MBps
Best Regards,
Mac Lin.
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you?re up to on Facebook.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_2:092009
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do
2009-12-11 12:32 ` mkl lin
@ 2009-12-16 20:33 ` Russell King - ARM Linux
2009-12-17 5:08 ` mkl lin
0 siblings, 1 reply; 16+ messages in thread
From: Russell King - ARM Linux @ 2009-12-16 20:33 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Dec 11, 2009 at 12:32:06PM +0000, mkl lin wrote:
>
> >> BTW, read performance drops from 64MBps to 44MBps on my platform.
> > Is this compared to the IPI patch or to the UP (maxcpus=1) version?
> > You can try placing a "pld [r0, #32] before the new ldr/str so that it
> > fetches the next cache line while the loop is dealing with the current
> > one.
>
> I have redo all the tests again.
>
> Linux 2.6.31.1
> pcie sata adapter,?ahci.c
> read 1GB file
>
> 1CPU:13.56/13.56/13.60 (sec) ~ 75.5MBps
> SMP+IPI:16.29/16.14/16.05 (sec) ~ 63.8MBps
> SMP+RFO/WFO:21.71/21.72/21.70 (sec) ~ 47.18MBps
> SMP+RFO/WFO/pld:21.63/21.46/21.41 (sec) ~ 47.82MBps
Out of interest, how does the 1CPU version perform when you boot with
'libata.force=pio' specified to the kernel?
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do
2009-12-16 20:33 ` Russell King - ARM Linux
@ 2009-12-17 5:08 ` mkl lin
2009-12-18 12:49 ` Richard Liu
0 siblings, 1 reply; 16+ messages in thread
From: mkl lin @ 2009-12-17 5:08 UTC (permalink / raw)
To: linux-arm-kernel
>> Linux 2.6.31.1
>> pcie sata adapter, ahci.c
>> read 1GB file
>>
>> 1CPU:13.56/13.56/13.60 (sec) ~ 75.5MBps
>> SMP+IPI:16.29/16.14/16.05 (sec) ~ 63.8MBps
>> SMP+RFO/WFO:21.71/21.72/21.70 (sec) ~ 47.18MBps
>> SMP+RFO/WFO/pld:21.63/21.46/21.41 (sec) ~ 47.82MBps
>
> Out of interest, how does the 1CPU version perform when you boot with
> 'libata.force=pio' specified to the kernel?
Due to my environment changed, those data is not comparable to previous. And AHCI only support DMA, force it to pio doen't make difference.
1CPU:17.15/17.07/17.10 ~ 59.98MBps
1CPU,with libata.force=pio: 17.02/16.86/17.03 ~ 60.73MBps
SMP+IPI: 21.20/21.36/21.88 ~ 48.30MBps
SMP+IPI,with libata.force=pio: 21.36/21.22/21.59 ~ 48.24MBps
Best Regard,
Mac Lin
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you.
http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_3:092010
^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do
2009-12-17 5:08 ` mkl lin
@ 2009-12-18 12:49 ` Richard Liu
0 siblings, 0 replies; 16+ messages in thread
From: Richard Liu @ 2009-12-18 12:49 UTC (permalink / raw)
To: linux-arm-kernel
IPI interrupt very high on some network device.
eth0 and eth1 are intel e1000e devices
and was configured as a bridge (Linux tcp/ip bridge stack)
Use iperf to generate traffic between eth0 and eth1
Here is interrupt counter.
# cat /proc/interrupts
CPU0 CPU1
pcie0 1703 0 GIC PCIe0, eth0
pcie1 2067 0 GIC PCIe1, eth1
IPI: 2487 2407340
LOC: 16709 16794
And the bridge throughput is only 30% then single core system.
Form previous experience, we will flush/invalidate a packet twice.
So, the IPI interrupt twice then pcie's interrupe is make sense.
But we observe extreme hight IPI interrupts, compare to traditional
cache flush function.
Is this issue cause by network stack or by IPI weakness ?
mkl lin wrote:
>>> Linux 2.6.31.1
>>> pcie sata adapter, ahci.c
>>> read 1GB file
>>>
>>> 1CPU:13.56/13.56/13.60 (sec) ~ 75.5MBps
>>> SMP+IPI:16.29/16.14/16.05 (sec) ~ 63.8MBps
>>> SMP+RFO/WFO:21.71/21.72/21.70 (sec) ~ 47.18MBps
>>> SMP+RFO/WFO/pld:21.63/21.46/21.41 (sec) ~ 47.82MBps
>>>
>> Out of interest, how does the 1CPU version perform when you boot with
>> 'libata.force=pio' specified to the kernel?
>>
>
> Due to my environment changed, those data is not comparable to previous. And AHCI only support DMA, force it to pio doen't make difference.
>
> 1CPU:17.15/17.07/17.10 ~ 59.98MBps
> 1CPU,with libata.force=pio: 17.02/16.86/17.03 ~ 60.73MBps
> SMP+IPI: 21.20/21.36/21.88 ~ 48.30MBps
> SMP+IPI,with libata.force=pio: 21.36/21.22/21.59 ~ 48.24MBps
>
> Best Regard,
> Mac Lin
>
>
> _________________________________________________________________
> Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you.
> http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_3:092010
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20091218/36e5ff45/attachment-0001.htm>
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-12-18 12:49 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-10 14:59 dma_cache_maint_contiguous should be patched as dma_cache_maint do mkl lin
2009-12-10 15:35 ` Catalin Marinas
2009-12-10 16:39 ` Russell King - ARM Linux
2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas
2009-12-10 17:39 ` Russell King - ARM Linux
2009-12-10 18:30 ` mkl lin
2009-12-10 18:45 ` mkl lin
2009-12-11 9:36 ` Catalin Marinas
2009-12-11 12:32 ` mkl lin
2009-12-16 20:33 ` Russell King - ARM Linux
2009-12-17 5:08 ` mkl lin
2009-12-18 12:49 ` Richard Liu
2009-12-10 21:21 ` dma_cache_maint_contiguous should be patched as dma_cache_maint do Nicolas Pitre
2009-12-10 22:14 ` Russell King - ARM Linux
2009-12-11 1:24 ` Nicolas Pitre
2009-12-11 10:56 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).