* dma_cache_maint_contiguous should be patched as dma_cache_maint do @ 2009-12-10 14:59 mkl lin 2009-12-10 15:35 ` Catalin Marinas 0 siblings, 1 reply; 16+ messages in thread From: mkl lin @ 2009-12-10 14:59 UTC (permalink / raw) To: linux-arm-kernel hi, I've got a deadlock while testing SATA AHCI driver ahci.c with the new IPI patch. My platform is ARM11 MPCore 2CPU, Linux 2.6.31.1, SMP, with L1. I did a simple write-read test with a PCIe SATA adapter and it stuck at first or second round. Here is the call stack of each CPU: CPU1: __smp_dma_cache_op dma_cache_maint_page dma_map_sg ata_qc_issue ata_scsi_translate ata_scsi_queuecmd scsi_request_fn __generic_unplug_device generic_unplug_device blk_unplug sync_buffer __wait_on_bit_lock out_of_line_wait_on_bit_lock lock_buffer __block_write_full_page CPU0: __raw_spin_lock _spin_lock ata_scsi_queuecmd Following is my guess: CPU1 called ata_scsi_queuecmd and get the lock, and go all the way down into __smp_dma_cache_op, which send the IPI message and waiting for other CPU finish cache operation (I guess so). ??? while(!cpus_empty(data.unfinished)) CPU0 also called ata_scsi_queuecmd and stuck in a spinlock, which is obtained by CPU1, with IRQ disabled. So CPU0 could not handle the IPI, keep spining, and CPU1 is waiting for CPU0 to finish handling IPI. One strange thing is that whenever ata_scsi_queuecmd is called, the IRQ have already been disabled. The patch would fix the bug, but I'm not sure whether will it break other things or not. Best Regard, Mac Lin _________________________________________________________________ Keep your friends updated?even when you?re not signed in. http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_5:092010 -------------- next part -------------- A non-text attachment was scrubbed... Name: patch Type: application/octet-stream Size: 631 bytes Desc: not available URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20091210/520c074d/attachment-0001.obj> ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched as dma_cache_maint do 2009-12-10 14:59 dma_cache_maint_contiguous should be patched as dma_cache_maint do mkl lin @ 2009-12-10 15:35 ` Catalin Marinas 2009-12-10 16:39 ` Russell King - ARM Linux 0 siblings, 1 reply; 16+ messages in thread From: Catalin Marinas @ 2009-12-10 15:35 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2009-12-10 at 14:59 +0000, mkl lin wrote: > I've got a deadlock while testing SATA AHCI driver ahci.c with the new IPI patch. > > My platform is ARM11 MPCore 2CPU, Linux 2.6.31.1, SMP, with L1. I did a > simple write-read test with a PCIe SATA adapter and it stuck at first > or second round. [...] > Following is my guess: > > CPU1 called ata_scsi_queuecmd and get the lock, and go all the way down > into __smp_dma_cache_op, which send the IPI message and waiting for > other CPU finish cache operation (I guess so). > > while(!cpus_empty(data.unfinished)) > > CPU0 also called ata_scsi_queuecmd and stuck in a spinlock, which is > obtained by CPU1, with IRQ disabled. So CPU0 could not handle the IPI, > keep spining, and CPU1 is waiting for CPU0 to finish handling IPI. This looks like a valid scenario, thanks for the analysis. The DMA broadcasting patch allows IPI's with interrupts disabled on the issuing CPU but that's slightly different and any IPI (not only the DMA broadcasting would fail). Basically you can't issue an IPI while holding a lock that may be held by a different CPU with interrupts disabled. Your patch allows interrupts to be taken while in the ata_scsi_queuecmd() function but this may have side-effects as this function is assumed to be called with interrupts disabled. So it leaves us with two workarounds - (1) re-implement this DMA cache ops broadcasting with FIQs if the hardware (GIC) configuration allows this or (2) instead of broadcasting, do a "Read or Write For Ownership" on the calling CPU before invoking the cache operation. By doing this the current CPU becomes the owner of the cache lines and it can do cache maintenance on them. For large buffers, this RFO/WFO is slower than IPI but in the general case it may be actually quicker. If you have time (should only take an hour or so but I don't have hardware to test it), I would be grateful if you could try the following: * Drop my DMA broadcasting patch * In the dma_cache_maint (and the contiguous one) do the following based on direction: * TO_DEVICE: read each cache line in the buffer (you can read a long variable every 32 bytes) before the local cache maintenance operations. This ensures that any dirty cache lines on other CPUs are transferred to L2 (or main memory) or the current CPU and the cache operation would have the intended effect. The cache lines on other CPUs may change to a Shared state (see the MESI protocol) * FROM_DEVICE: we don't care about the buffer, just write 0 in each cache line (as above, you can only write a long every 32 bytes). This ensures that the cache lines become Exclusive to the current CPU (no other CPU has any copies) and the invalidation would ensure that there are no cache lines on any CPU * BIDIRECTIONAL: read a word in a cache line and write it back. After cache clean&invalidate, the cache lines would be removed from all the existing CPUs ARM11MPCore doesn't do speculative accesses so the above should be safe. Thanks. -- Catalin ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched as dma_cache_maint do 2009-12-10 15:35 ` Catalin Marinas @ 2009-12-10 16:39 ` Russell King - ARM Linux 2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Russell King - ARM Linux @ 2009-12-10 16:39 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 10, 2009 at 03:35:07PM +0000, Catalin Marinas wrote: > On Thu, 2009-12-10 at 14:59 +0000, mkl lin wrote: > > I've got a deadlock while testing SATA AHCI driver ahci.c with the new IPI patch. > > > > My platform is ARM11 MPCore 2CPU, Linux 2.6.31.1, SMP, with L1. I did a > > simple write-read test with a PCIe SATA adapter and it stuck at first > > or second round. > [...] > > Following is my guess: > > > > CPU1 called ata_scsi_queuecmd and get the lock, and go all the way down > > into __smp_dma_cache_op, which send the IPI message and waiting for > > other CPU finish cache operation (I guess so). > > > > while(!cpus_empty(data.unfinished)) > > > > CPU0 also called ata_scsi_queuecmd and stuck in a spinlock, which is > > obtained by CPU1, with IRQ disabled. So CPU0 could not handle the IPI, > > keep spining, and CPU1 is waiting for CPU0 to finish handling IPI. > > This looks like a valid scenario, thanks for the analysis. > > The DMA broadcasting patch allows IPI's with interrupts disabled on the > issuing CPU but that's slightly different and any IPI (not only the DMA > broadcasting would fail). Basically you can't issue an IPI while holding > a lock that may be held by a different CPU with interrupts disabled. I was rather expecting a deadlock report over the DMA cache ops broadcasting, which is why I've been holding off merging it. > Your patch allows interrupts to be taken while in the > ata_scsi_queuecmd() function but this may have side-effects as this > function is assumed to be called with interrupts disabled. Having functions which enable interrupts while the parent is supposed to be in an atomic context is definitely a recipe for things to go badly wrong - this is not a solution anyone in their right mind should contemplate. > So it leaves us with two workarounds - (1) re-implement this DMA cache > ops broadcasting with FIQs if the hardware (GIC) configuration allows > this I don't think the GIC allows FIQ IPIs - and even if it did, it presents certain problems for people - the kernel would have to become the sole owner of FIQs on SMP, and we'd have to prevent other people using them. They'd certainly not be "fast" interrupts anymore, neither would they be suitable for software-driven DMA (which is what FIQ gets used as for some platforms.) > or (2) instead of broadcasting, do a "Read or Write For Ownership" > on the calling CPU before invoking the cache operation. By doing this > the current CPU becomes the owner of the cache lines and it can do cache > maintenance on them. For large buffers, this RFO/WFO is slower than IPI > but in the general case it may be actually quicker. This is probably the best we can do, and it will make DMA performance suck - thankfully, later SMP CPUs have solved this whole cache IPI issue by broadcasting the cache operations in hardware (which is actually the only sane solution to this problem.) Let's hope that all ARMv7 SMP implementations broadcast the cache operations. For the time being, I'm going to assume that this is the case. If this is not already the case, it's something I think ARM Ltd need to ensure that hardware broadcasting is mandated in later versions of the architecture. If it is just ARMv6 which is affected by this (note that we can not rely upon ID_MMFR3 telling us since this register has a different purpose there) then we can solve this by just modifying the ARMv6 DMA cache methods - which becomes easier if I get the rearrangement of the DMA stuff finally sorted. The *only* thing that remains in the way of that is the stupid flush_ioremap_region() crap for just one MTD driver. ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do 2009-12-10 16:39 ` Russell King - ARM Linux @ 2009-12-10 17:15 ` Catalin Marinas 2009-12-10 17:39 ` Russell King - ARM Linux 2009-12-10 18:30 ` mkl lin 2009-12-10 21:21 ` dma_cache_maint_contiguous should be patched as dma_cache_maint do Nicolas Pitre 2009-12-11 10:56 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas 2 siblings, 2 replies; 16+ messages in thread From: Catalin Marinas @ 2009-12-10 17:15 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2009-12-10 at 16:39 +0000, Russell King - ARM Linux wrote: > On Thu, Dec 10, 2009 at 03:35:07PM +0000, Catalin Marinas wrote: > > On Thu, 2009-12-10 at 14:59 +0000, mkl lin wrote: > > > CPU0 also called ata_scsi_queuecmd and stuck in a spinlock, which is > > > obtained by CPU1, with IRQ disabled. So CPU0 could not handle the IPI, > > > keep spining, and CPU1 is waiting for CPU0 to finish handling IPI. > > > > This looks like a valid scenario, thanks for the analysis. > > > > The DMA broadcasting patch allows IPI's with interrupts disabled on the > > issuing CPU but that's slightly different and any IPI (not only the DMA > > broadcasting would fail). Basically you can't issue an IPI while holding > > a lock that may be held by a different CPU with interrupts disabled. > > I was rather expecting a deadlock report over the DMA cache ops broadcasting, > which is why I've been holding off merging it. It's good that people tested it in various configurations since my simple tests haven't shown any problem. So I won't be pushing this patch. > > or (2) instead of broadcasting, do a "Read or Write For Ownership" > > on the calling CPU before invoking the cache operation. By doing this > > the current CPU becomes the owner of the cache lines and it can do cache > > maintenance on them. For large buffers, this RFO/WFO is slower than IPI > > but in the general case it may be actually quicker. > > This is probably the best we can do, and it will make DMA performance > suck For relatively small buffers, this may be faster than the whole IPI procedure but you can't say for sure without benchmarking. > Let's hope that all ARMv7 SMP implementations broadcast the cache > operations. They do broadcast both cache operations and TLB maintenance (but for the latter we have an issue with global ASID allocation, see my corresponding patch). With speculative accesses would actually make this impossible if broadcasting isn't done in hardware. > If it is just ARMv6 which is affected by this (note that we can not rely > upon ID_MMFR3 telling us since this register has a different purpose > there) then we can solve this by just modifying the ARMv6 DMA cache > methods - which becomes easier if I get the rearrangement of the DMA > stuff finally sorted. Good point, that's actually a quick modification (2-3 lines in each cache function). Mac Lin, can you try the patch below (without the other broadcasting patch)? If it works, I'll add some comment: diff --git a/arch/arm/mm/cache-v6.S b/arch/arm/mm/cache-v6.S index 295e25d..828a78e 100644 --- a/arch/arm/mm/cache-v6.S +++ b/arch/arm/mm/cache-v6.S @@ -210,6 +210,9 @@ ENTRY(v6_dma_inv_range) mcrne p15, 0, r1, c7, c15, 1 @ clean & invalidate unified line #endif 1: +#ifdef CONFIG_SMP + str r0, [r0] @ write for ownership +#endif #ifdef HARVARD_CACHE mcr p15, 0, r0, c7, c6, 1 @ invalidate D line #else @@ -230,6 +233,9 @@ ENTRY(v6_dma_inv_range) ENTRY(v6_dma_clean_range) bic r0, r0, #D_CACHE_LINE_SIZE - 1 1: +#ifdef CONFIG_SMP + ldr r2, [r0] @ read for ownership +#endif #ifdef HARVARD_CACHE mcr p15, 0, r0, c7, c10, 1 @ clean D line #else @@ -250,6 +256,10 @@ ENTRY(v6_dma_clean_range) ENTRY(v6_dma_flush_range) bic r0, r0, #D_CACHE_LINE_SIZE - 1 1: +#ifdef CONFIG_SMP + ldr r2, [r0] @ read for ownership + str r2, [r0] @ write for ownership +#endif #ifdef HARVARD_CACHE mcr p15, 0, r0, c7, c14, 1 @ clean & invalidate D line #else > The *only* thing that remains in the way of that > is the stupid flush_ioremap_region() crap for just one MTD driver. But that calls v6_dma_inv_range() anyway. -- Catalin ^ permalink raw reply related [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do 2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas @ 2009-12-10 17:39 ` Russell King - ARM Linux 2009-12-10 18:30 ` mkl lin 1 sibling, 0 replies; 16+ messages in thread From: Russell King - ARM Linux @ 2009-12-10 17:39 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 10, 2009 at 05:15:43PM +0000, Catalin Marinas wrote: > On Thu, 2009-12-10 at 16:39 +0000, Russell King - ARM Linux wrote: > > On Thu, Dec 10, 2009 at 03:35:07PM +0000, Catalin Marinas wrote: > > > or (2) instead of broadcasting, do a "Read or Write For Ownership" > > > on the calling CPU before invoking the cache operation. By doing this > > > the current CPU becomes the owner of the cache lines and it can do cache > > > maintenance on them. For large buffers, this RFO/WFO is slower than IPI > > > but in the general case it may be actually quicker. > > > > This is probably the best we can do, and it will make DMA performance > > suck > > For relatively small buffers, this may be faster than the whole IPI > procedure but you can't say for sure without benchmarking. I'm basing that comment on the efficiency of PIO block IO vs DMA block IO, which are typically multiples of 4K. Having to read/write the buffer is equivalent to PIO. As a comparison for a system with IDE, for DMA you might get 10MB/s, with PIO maybe 3-4MB/s if you're lucky. I would be very surprised if going down this route doesn't result in block IO data performance (and network performance) dropping my more than 60% of the DMA value (that's DMA performance * 0.4). > > Let's hope that all ARMv7 SMP implementations broadcast the cache > > operations. > > They do broadcast both cache operations and TLB maintenance (but for the > latter we have an issue with global ASID allocation, see my > corresponding patch). With speculative accesses would actually make this > impossible if broadcasting isn't done in hardware. Good - so we could optimize out the MMFR3 tests for ARMv7. > > The *only* thing that remains in the way of that > > is the stupid flush_ioremap_region() crap for just one MTD driver. > > But that calls v6_dma_inv_range() anyway. Indeed, and will be the _only_ caller of that function. That's my point. Why should we have something called "flush_ioremap_region" being the sole caller of something called "dma_inv_range". ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do 2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas 2009-12-10 17:39 ` Russell King - ARM Linux @ 2009-12-10 18:30 ` mkl lin 2009-12-10 18:45 ` mkl lin 1 sibling, 1 reply; 16+ messages in thread From: mkl lin @ 2009-12-10 18:30 UTC (permalink / raw) To: linux-arm-kernel > Good point, that's actually a quick modification (2-3 lines in each > cache function). Mac Lin, can you try the patch below (without the other > broadcasting patch)? If it works, I'll add some comment: I have tested with my patch, and it seems good. No data integrity issue, nor any deadlock. Best Regards, Mac Lin _________________________________________________________________ Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you. http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_3:092010 -------------- next part -------------- A non-text attachment was scrubbed... Name: smp.c.patch Type: application/octet-stream Size: 1159 bytes Desc: not available URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20091210/aed3f989/attachment-0001.obj> ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do 2009-12-10 18:30 ` mkl lin @ 2009-12-10 18:45 ` mkl lin 2009-12-11 9:36 ` Catalin Marinas 0 siblings, 1 reply; 16+ messages in thread From: mkl lin @ 2009-12-10 18:45 UTC (permalink / raw) To: linux-arm-kernel >> Good point, that's actually a quick modification (2-3 lines in each >> cache function). Mac Lin, can you try the patch below (without the other >> broadcasting patch)? If it works, I'll add some comment: > I have tested with my patch, and it seems good. No data integrity issue, nor any deadlock. BTW, read performance drops from 64MBps to 44MBps on my platform. Best Regards, Mac Lin _________________________________________________________________ Windows Live: Make it easier for your friends to see what you?re up to on Facebook. http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_2:092009 ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do 2009-12-10 18:45 ` mkl lin @ 2009-12-11 9:36 ` Catalin Marinas 2009-12-11 12:32 ` mkl lin 0 siblings, 1 reply; 16+ messages in thread From: Catalin Marinas @ 2009-12-11 9:36 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2009-12-10 at 18:45 +0000, mkl lin wrote: > >> Good point, that's actually a quick modification (2-3 lines in each > >> cache function). Mac Lin, can you try the patch below (without the other > >> broadcasting patch)? If it works, I'll add some comment: > > I have tested with my patch, and it seems good. No data integrity > > issue, nor any deadlock. > > BTW, read performance drops from 64MBps to 44MBps on my platform. Is this compared to the IPI patch or to the UP (maxcpus=1) version? You can try placing a "pld [r0, #32] before the new ldr/str so that it fetches the next cache line while the loop is dealing with the current one. Thanks. -- Catalin ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do 2009-12-11 9:36 ` Catalin Marinas @ 2009-12-11 12:32 ` mkl lin 2009-12-16 20:33 ` Russell King - ARM Linux 0 siblings, 1 reply; 16+ messages in thread From: mkl lin @ 2009-12-11 12:32 UTC (permalink / raw) To: linux-arm-kernel >> BTW, read performance drops from 64MBps to 44MBps on my platform. > Is this compared to the IPI patch or to the UP (maxcpus=1) version? > You can try placing a "pld [r0, #32] before the new ldr/str so that it > fetches the next cache line while the loop is dealing with the current > one. I have redo all the tests again. Linux 2.6.31.1 pcie sata adapter,?ahci.c read 1GB file 1CPU:13.56/13.56/13.60 (sec) ~ 75.5MBps SMP+IPI:16.29/16.14/16.05 (sec) ~ 63.8MBps SMP+RFO/WFO:21.71/21.72/21.70 (sec) ~ 47.18MBps SMP+RFO/WFO/pld:21.63/21.46/21.41 (sec) ~ 47.82MBps Best Regards, Mac Lin. _________________________________________________________________ Windows Live: Make it easier for your friends to see what you?re up to on Facebook. http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_2:092009 ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do 2009-12-11 12:32 ` mkl lin @ 2009-12-16 20:33 ` Russell King - ARM Linux 2009-12-17 5:08 ` mkl lin 0 siblings, 1 reply; 16+ messages in thread From: Russell King - ARM Linux @ 2009-12-16 20:33 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 11, 2009 at 12:32:06PM +0000, mkl lin wrote: > > >> BTW, read performance drops from 64MBps to 44MBps on my platform. > > Is this compared to the IPI patch or to the UP (maxcpus=1) version? > > You can try placing a "pld [r0, #32] before the new ldr/str so that it > > fetches the next cache line while the loop is dealing with the current > > one. > > I have redo all the tests again. > > Linux 2.6.31.1 > pcie sata adapter,?ahci.c > read 1GB file > > 1CPU:13.56/13.56/13.60 (sec) ~ 75.5MBps > SMP+IPI:16.29/16.14/16.05 (sec) ~ 63.8MBps > SMP+RFO/WFO:21.71/21.72/21.70 (sec) ~ 47.18MBps > SMP+RFO/WFO/pld:21.63/21.46/21.41 (sec) ~ 47.82MBps Out of interest, how does the 1CPU version perform when you boot with 'libata.force=pio' specified to the kernel? ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do 2009-12-16 20:33 ` Russell King - ARM Linux @ 2009-12-17 5:08 ` mkl lin 2009-12-18 12:49 ` Richard Liu 0 siblings, 1 reply; 16+ messages in thread From: mkl lin @ 2009-12-17 5:08 UTC (permalink / raw) To: linux-arm-kernel >> Linux 2.6.31.1 >> pcie sata adapter, ahci.c >> read 1GB file >> >> 1CPU:13.56/13.56/13.60 (sec) ~ 75.5MBps >> SMP+IPI:16.29/16.14/16.05 (sec) ~ 63.8MBps >> SMP+RFO/WFO:21.71/21.72/21.70 (sec) ~ 47.18MBps >> SMP+RFO/WFO/pld:21.63/21.46/21.41 (sec) ~ 47.82MBps > > Out of interest, how does the 1CPU version perform when you boot with > 'libata.force=pio' specified to the kernel? Due to my environment changed, those data is not comparable to previous. And AHCI only support DMA, force it to pio doen't make difference. 1CPU:17.15/17.07/17.10 ~ 59.98MBps 1CPU,with libata.force=pio: 17.02/16.86/17.03 ~ 60.73MBps SMP+IPI: 21.20/21.36/21.88 ~ 48.30MBps SMP+IPI,with libata.force=pio: 21.36/21.22/21.59 ~ 48.24MBps Best Regard, Mac Lin _________________________________________________________________ Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you. http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_3:092010 ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do 2009-12-17 5:08 ` mkl lin @ 2009-12-18 12:49 ` Richard Liu 0 siblings, 0 replies; 16+ messages in thread From: Richard Liu @ 2009-12-18 12:49 UTC (permalink / raw) To: linux-arm-kernel IPI interrupt very high on some network device. eth0 and eth1 are intel e1000e devices and was configured as a bridge (Linux tcp/ip bridge stack) Use iperf to generate traffic between eth0 and eth1 Here is interrupt counter. # cat /proc/interrupts CPU0 CPU1 pcie0 1703 0 GIC PCIe0, eth0 pcie1 2067 0 GIC PCIe1, eth1 IPI: 2487 2407340 LOC: 16709 16794 And the bridge throughput is only 30% then single core system. Form previous experience, we will flush/invalidate a packet twice. So, the IPI interrupt twice then pcie's interrupe is make sense. But we observe extreme hight IPI interrupts, compare to traditional cache flush function. Is this issue cause by network stack or by IPI weakness ? mkl lin wrote: >>> Linux 2.6.31.1 >>> pcie sata adapter, ahci.c >>> read 1GB file >>> >>> 1CPU:13.56/13.56/13.60 (sec) ~ 75.5MBps >>> SMP+IPI:16.29/16.14/16.05 (sec) ~ 63.8MBps >>> SMP+RFO/WFO:21.71/21.72/21.70 (sec) ~ 47.18MBps >>> SMP+RFO/WFO/pld:21.63/21.46/21.41 (sec) ~ 47.82MBps >>> >> Out of interest, how does the 1CPU version perform when you boot with >> 'libata.force=pio' specified to the kernel? >> > > Due to my environment changed, those data is not comparable to previous. And AHCI only support DMA, force it to pio doen't make difference. > > 1CPU:17.15/17.07/17.10 ~ 59.98MBps > 1CPU,with libata.force=pio: 17.02/16.86/17.03 ~ 60.73MBps > SMP+IPI: 21.20/21.36/21.88 ~ 48.30MBps > SMP+IPI,with libata.force=pio: 21.36/21.22/21.59 ~ 48.24MBps > > Best Regard, > Mac Lin > > > _________________________________________________________________ > Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you. > http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_3:092010 > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20091218/36e5ff45/attachment-0001.htm> ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched as dma_cache_maint do 2009-12-10 16:39 ` Russell King - ARM Linux 2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas @ 2009-12-10 21:21 ` Nicolas Pitre 2009-12-10 22:14 ` Russell King - ARM Linux 2009-12-11 10:56 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas 2 siblings, 1 reply; 16+ messages in thread From: Nicolas Pitre @ 2009-12-10 21:21 UTC (permalink / raw) To: linux-arm-kernel On Thu, 10 Dec 2009, Russell King - ARM Linux wrote: > If it is just ARMv6 which is affected by this (note that we can not rely > upon ID_MMFR3 telling us since this register has a different purpose > there) then we can solve this by just modifying the ARMv6 DMA cache > methods - which becomes easier if I get the rearrangement of the DMA > stuff finally sorted. The *only* thing that remains in the way of that > is the stupid flush_ioremap_region() crap for just one MTD driver. My suggestion: kill flush_ioremap_region() and substitute it by an ARMv5 cache range invalidate loop implementation local to that driver. While at it, here's the patch: ----- Subject: ARM: kill flush_ioremap_region() There is not enough users to warrant its existance, and it is actually an obstacle to progress with the new DMA API which cannot cover this case properly. To keep backward compatibility, let's do the necessary cache maintenance locally in the only driver affected. Signed-off-by: Nicolas Pitre <nico@marvell.com> diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h index 3d0cdd2..244fb97 100644 --- a/arch/arm/include/asm/cacheflush.h +++ b/arch/arm/include/asm/cacheflush.h @@ -456,13 +456,6 @@ static inline void flush_kernel_dcache_page(struct page *page) */ #define flush_icache_page(vma,page) do { } while (0) -static inline void flush_ioremap_region(unsigned long phys, void __iomem *virt, - unsigned offset, size_t size) -{ - const void *start = (void __force *)virt + offset; - dmac_inv_range(start, start + size); -} - /* * flush_cache_vmap() is used when creating mappings (eg, via vmap, * vmalloc, ioremap etc) in kernel space for pages. On non-VIPT diff --git a/arch/arm/mm/proc-syms.c b/arch/arm/mm/proc-syms.c index ac5c800..f604aa8 100644 --- a/arch/arm/mm/proc-syms.c +++ b/arch/arm/mm/proc-syms.c @@ -28,7 +28,6 @@ EXPORT_SYMBOL(__cpuc_flush_user_all); EXPORT_SYMBOL(__cpuc_flush_user_range); EXPORT_SYMBOL(__cpuc_coherent_kern_range); EXPORT_SYMBOL(__cpuc_flush_dcache_page); -EXPORT_SYMBOL(dmac_inv_range); /* because of flush_ioremap_region() */ #else EXPORT_SYMBOL(cpu_cache); #endif diff --git a/drivers/mtd/maps/pxa2xx-flash.c b/drivers/mtd/maps/pxa2xx-flash.c index 74fa075..b13f641 100644 --- a/drivers/mtd/maps/pxa2xx-flash.c +++ b/drivers/mtd/maps/pxa2xx-flash.c @@ -20,14 +20,23 @@ #include <asm/io.h> #include <mach/hardware.h> -#include <asm/cacheflush.h> #include <asm/mach/flash.h> +#define CACHELINESIZE 32 + static void pxa2xx_map_inval_cache(struct map_info *map, unsigned long from, ssize_t len) { - flush_ioremap_region(map->phys, map->cached, from, len); + unsigned long start = (unsigned long)map->cached + from; + unsigned long end = start + len; + + start &= ~(CACHELINESIZE - 1); + while (start < end) { + /* invalidate D cache line */ + asm volatile ("mcr p15, 0, %0, c7, c6, 1" : : "r" (start)); + start += CACHELINESIZE; + } } struct pxa2xx_flash_info { ^ permalink raw reply related [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched as dma_cache_maint do 2009-12-10 21:21 ` dma_cache_maint_contiguous should be patched as dma_cache_maint do Nicolas Pitre @ 2009-12-10 22:14 ` Russell King - ARM Linux 2009-12-11 1:24 ` Nicolas Pitre 0 siblings, 1 reply; 16+ messages in thread From: Russell King - ARM Linux @ 2009-12-10 22:14 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 10, 2009 at 04:21:29PM -0500, Nicolas Pitre wrote: > On Thu, 10 Dec 2009, Russell King - ARM Linux wrote: > > > If it is just ARMv6 which is affected by this (note that we can not rely > > upon ID_MMFR3 telling us since this register has a different purpose > > there) then we can solve this by just modifying the ARMv6 DMA cache > > methods - which becomes easier if I get the rearrangement of the DMA > > stuff finally sorted. The *only* thing that remains in the way of that > > is the stupid flush_ioremap_region() crap for just one MTD driver. > > My suggestion: kill flush_ioremap_region() and substitute it by an ARMv5 > cache range invalidate loop implementation local to that driver. > > While at it, here's the patch: Great, can it end up in the patch system? ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched as dma_cache_maint do 2009-12-10 22:14 ` Russell King - ARM Linux @ 2009-12-11 1:24 ` Nicolas Pitre 0 siblings, 0 replies; 16+ messages in thread From: Nicolas Pitre @ 2009-12-11 1:24 UTC (permalink / raw) To: linux-arm-kernel On Thu, 10 Dec 2009, Russell King - ARM Linux wrote: > On Thu, Dec 10, 2009 at 04:21:29PM -0500, Nicolas Pitre wrote: > > On Thu, 10 Dec 2009, Russell King - ARM Linux wrote: > > > > > If it is just ARMv6 which is affected by this (note that we can not rely > > > upon ID_MMFR3 telling us since this register has a different purpose > > > there) then we can solve this by just modifying the ARMv6 DMA cache > > > methods - which becomes easier if I get the rearrangement of the DMA > > > stuff finally sorted. The *only* thing that remains in the way of that > > > is the stupid flush_ioremap_region() crap for just one MTD driver. > > > > My suggestion: kill flush_ioremap_region() and substitute it by an ARMv5 > > cache range invalidate loop implementation local to that driver. > > > > While at it, here's the patch: > > Great, can it end up in the patch system? Patch #5848/1. Nicolas ^ permalink raw reply [flat|nested] 16+ messages in thread
* dma_cache_maint_contiguous should be patched asdma_cache_maint do 2009-12-10 16:39 ` Russell King - ARM Linux 2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas 2009-12-10 21:21 ` dma_cache_maint_contiguous should be patched as dma_cache_maint do Nicolas Pitre @ 2009-12-11 10:56 ` Catalin Marinas 2 siblings, 0 replies; 16+ messages in thread From: Catalin Marinas @ 2009-12-11 10:56 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2009-12-10 at 16:39 +0000, Russell King - ARM Linux wrote: > Let's hope that all ARMv7 SMP implementations broadcast the cache > operations. I double-checked with the architecture people and the ARMv7 MP extension mandates cache maintenance broadcasting in hardware. I'll modify my ARMv7 lazy cache flushing patch to avoid reading MMFR3 in this case. -- Catalin ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-12-18 12:49 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-12-10 14:59 dma_cache_maint_contiguous should be patched as dma_cache_maint do mkl lin 2009-12-10 15:35 ` Catalin Marinas 2009-12-10 16:39 ` Russell King - ARM Linux 2009-12-10 17:15 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas 2009-12-10 17:39 ` Russell King - ARM Linux 2009-12-10 18:30 ` mkl lin 2009-12-10 18:45 ` mkl lin 2009-12-11 9:36 ` Catalin Marinas 2009-12-11 12:32 ` mkl lin 2009-12-16 20:33 ` Russell King - ARM Linux 2009-12-17 5:08 ` mkl lin 2009-12-18 12:49 ` Richard Liu 2009-12-10 21:21 ` dma_cache_maint_contiguous should be patched as dma_cache_maint do Nicolas Pitre 2009-12-10 22:14 ` Russell King - ARM Linux 2009-12-11 1:24 ` Nicolas Pitre 2009-12-11 10:56 ` dma_cache_maint_contiguous should be patched asdma_cache_maint do Catalin Marinas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).