linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* USB mass storage and ARM cache coherency
       [not found] ` <1265045354.25750.52.camel@pc1117.cambridge.arm.com>
@ 2010-02-08  6:55   ` Pavel Machek
  2010-02-08  7:33     ` Andreas Mohr
  2010-02-08  9:51     ` Catalin Marinas
  0 siblings, 2 replies; 155+ messages in thread
From: Pavel Machek @ 2010-02-08  6:55 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> > So, let's put this in the HCD drivers and be done with it.
> 
> The patch below is what fixes the I-D cache incoherency issues on ARM. I
> don't particularly like the solution but it seems to be the only one
> available.

Really? It looks like arm should just flush the caches when mapping
executable page to the userspace.... you can't expect all the drivers
to be modified like that...

Plus it does unneccessary flushes on x86, etc...

> @@ -904,6 +906,14 @@ __acquires(priv->lock)
>  			status = 0;
>  	}
>  
> +	if (usb_pipein(urb->pipe) && usb_pipetype(urb->pipe) == PIPE_BULK) {
> +		void *ptr;
> +		for (ptr = urb->transfer_buffer;
> +		     ptr < urb->transfer_buffer + urb->transfer_buffer_length;
> +		     ptr += PAGE_SIZE)
> +			flush_dcache_page(virt_to_page(ptr));
> +	}
> +
>  	/* complete() can reenter this HCD */
>  	usb_hcd_unlink_urb_from_ep(priv_to_hcd(priv), urb);
>  	spin_unlock(&priv->lock);
> 

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-08  6:55   ` USB mass storage and ARM cache coherency Pavel Machek
@ 2010-02-08  7:33     ` Andreas Mohr
  2010-02-08 10:19       ` Catalin Marinas
  2010-02-08  9:51     ` Catalin Marinas
  1 sibling, 1 reply; 155+ messages in thread
From: Andreas Mohr @ 2010-02-08  7:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Mon, Feb 08, 2010 at 07:55:19AM +0100, Pavel Machek wrote:
> Plus it does unneccessary flushes on x86, etc...

Noticed that as well, there should be an arch-obeying helper for this.


On my MIPSEL, I had urb->transfer_buffer NULL ptr crashes
(I think that was expected in case of a certain DMA setup, Alan said).

However, even with NULL check added I still had:

hub 2-1.1:1.0: state 7 ports 7 chg 0000 evt 0010
Unhandled kernel unaligned access[#1]:
Cpu 0
$ 0   : 00000000 fffffffd 803b0000 00010000
$ 4   : 08002042 8143bfe0 0043bfe0 0000000d
$ 8   : 00000001 3b9aca00 c4653600 00000000
$12   : 00000049 3b9aca00 81dbc868 00000000
$16   : a1e00000 803b0000 8037f840 81dfaa80
$20   : 00000000 81dd5080 80000000 00000000
$24   : 00000000 80015a64
$28   : 8033a000 8033bc10 a1dd83cc 801da5e4
Hi    : 00000000
Lo    : 00000000
epc   : 800171e8 __flush_dcache_page+0x38/0x120
    Not tainted
ra    : 801da5e4 ehci_urb_done+0x180/0x1e4
Status: 10009002    KERNEL EXL
Cause : 00800010
BadVA : 08002056
PrId  : 00029029 (Broadcom BCM3302)
Modules linked in:
Process swapper (pid: 0, threadinfo=8033a000, task=8033c000, tls=00000000)
Stack : 00000000 00000000 81e04980 801c80ac a1dd9060 a1dd8394 ffffff6a ffffff6a
        81dfaa80 a1dd83cc a1dd8380 801db3a4 803a6a28 80068e9c 000003f8 00003fc0
        a1dd81cc 801dea58 00000001 00000000 a1dd9360 81dd5080 a1dd8380 10009001
        a1dd83cc 81dd5158 00000000 80318d44 81dd5158 00000001 00010031 801de8f4
        81dd5158 8033bce0 803a76a0 803a0000 8033d860 8004f924 00000219 00000043
        ...
Call Trace:
[<800171e8>] __flush_dcache_page+0x38/0x120
[<801da5e4>] ehci_urb_done+0x180/0x1e4
[<801db3a4>] qh_completions+0x484/0x554
[<801de8f4>] ehci_work+0x1ec/0xb68
[<801e2598>] ehci_irq+0x360/0x3a4
[<801c7cf8>] usb_hcd_irq+0x64/0x15c
[<80066d58>] handle_IRQ_event+0x90/0x280
[<80068e80>] handle_percpu_irq+0x48/0x9c
[<8000e228>] plat_irq_dispatch+0x15c/0x178
[<80001444>] ret_from_irq+0x0/0x4
[<80001680>] r4k_wait+0x20/0x40
[<8000fe34>] cpu_idle+0x30/0x60
[<80354a34>] start_kernel+0x338/0x350


Code: 00000000  10800029  3c02803b <8c820014> 14400026  3c02803b  8c83001c  2482001c  14620021
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Fatal exception in interrupt



Seems like BadVA : 08002056 really isn't as aligned (offset 0x6) as it should be.

I've given up on this now BTW, I'll wait until the dust has settled (i.e. some nice improvements
have found their way to the kernel) and retry in some months with a much newer kernel version
(currently patched-up 2.6.31.9) whether something remains to be fixed.
I'll work on more productive things such as submitting some waiting patches.

Andreas Mohr

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-08  6:55   ` USB mass storage and ARM cache coherency Pavel Machek
  2010-02-08  7:33     ` Andreas Mohr
@ 2010-02-08  9:51     ` Catalin Marinas
  2010-02-08 10:03       ` Andy Green
  2010-02-08 10:52       ` Pavel Machek
  1 sibling, 2 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-08  9:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > So, let's put this in the HCD drivers and be done with it.
> >
> > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > don't particularly like the solution but it seems to be the only one
> > available.
> 
> Really? It looks like arm should just flush the caches when mapping
> executable page to the userspace.... you can't expect all the drivers
> to be modified like that...

We could of course flush the caches every time we get a page fault but
that's far from optimal, especially since DMA-capable drivers to do not
pollute the D-cache and don't need this extra flushing. Note that the
recent ARM processors have PIPT caches but separate for I and D and it's
the PIO drivers that pollute the D-cache.

The kernel API provides flush_dcache_page() to be called every time the
kernel writes to a page cache page. This is further optimised for
working in pair with update_mmu_cache() to delay the flushing until the
actual page is mapped into user space and this latter function is called
(which in general is not a cache maintenance function).

The problem with some PIO drivers and a filesystems like ext2 is that
there is no call to flush_dcache_page() when getting data into a page
cache page. Since the page isn't marked as dirty (PG_arch_1), a
subsequent call to update_mmu_cache() as a result of a page fault
doesn't flush the caches.

There is a flush_icache_page() function called from __do_fault(),
however, Documentation/cachetlb.txt states that all the functionality of
this function can be implemented in flush_dcache_page() and
update_mmu_cache(), hence this function is a no-op.

Please suggest a better solution that does not involve modifying generic
Linux code.

> Plus it does unneccessary flushes on x86, etc...

On x86, it should indeed be conditionally compiled based on
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.

Regards.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-08  9:51     ` Catalin Marinas
@ 2010-02-08 10:03       ` Andy Green
  2010-02-17  9:50         ` Sascha Hauer
  2010-02-08 10:52       ` Pavel Machek
  1 sibling, 1 reply; 155+ messages in thread
From: Andy Green @ 2010-02-08 10:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/08/10 10:51, Somebody in the thread at some point said:

> We could of course flush the caches every time we get a page fault but
> that's far from optimal, especially since DMA-capable drivers to do not
> pollute the D-cache and don't need this extra flushing. Note that the
> recent ARM processors have PIPT caches but separate for I and D and it's
> the PIO drivers that pollute the D-cache.

Just noting that AFAIK iMX31 USB and MMC drivers both are PIO at the 
moment, for lack of any platform DMA support of its unusual DMA engine.

-Andy

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-08  7:33     ` Andreas Mohr
@ 2010-02-08 10:19       ` Catalin Marinas
  0 siblings, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-08 10:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2010-02-08 at 07:33 +0000, Andreas Mohr wrote:
> On Mon, Feb 08, 2010 at 07:55:19AM +0100, Pavel Machek wrote:
> > Plus it does unneccessary flushes on x86, etc...
> 
> Noticed that as well, there should be an arch-obeying helper for this.
> 
> 
> On my MIPSEL, I had urb->transfer_buffer NULL ptr crashes
> (I think that was expected in case of a certain DMA setup, Alan said).
> 
> However, even with NULL check added I still had:
> 
> hub 2-1.1:1.0: state 7 ports 7 chg 0000 evt 0010
> Unhandled kernel unaligned access[#1]:

Just to avoid confusion - that's a similar patch applied to a different
driver. The ISP1760 HCD driver works fine with my patch (transfer_buffer
never seems to be NULL with latest mainline). I can't comment on the
ehci-q.c driver (it looks like it has some support for DMA while my
patch only applies to PIO drivers where transfer_buffer should be set).

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-08  9:51     ` Catalin Marinas
  2010-02-08 10:03       ` Andy Green
@ 2010-02-08 10:52       ` Pavel Machek
  2010-02-08 11:28         ` Catalin Marinas
  1 sibling, 1 reply; 155+ messages in thread
From: Pavel Machek @ 2010-02-08 10:52 UTC (permalink / raw)
  To: linux-arm-kernel

> Hi,
> 
> On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > > So, let's put this in the HCD drivers and be done with it.
> > >
> > > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > > don't particularly like the solution but it seems to be the only one
> > > available.
> > 
> > Really? It looks like arm should just flush the caches when mapping
> > executable page to the userspace.... you can't expect all the drivers
> > to be modified like that...
> 
> We could of course flush the caches every time we get a page fault but
> that's far from optimal, especially since DMA-capable drivers to do
> not

Maybe far for optimal, but it is something that should be done,
_first_. Correctness is more important than performance, and you can't
expect all drivers to behave like you want them.

Then you can add optimalizations not to do the flushes on drivers you
audited and where you care...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-08 10:52       ` Pavel Machek
@ 2010-02-08 11:28         ` Catalin Marinas
  2010-02-16  7:57           ` Shilimkar, Santosh
  0 siblings, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-02-08 11:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2010-02-08 at 10:52 +0000, Pavel Machek wrote:
> > On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > > > So, let's put this in the HCD drivers and be done with it.
> > > >
> > > > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > > > don't particularly like the solution but it seems to be the only one
> > > > available.
> > >
> > > Really? It looks like arm should just flush the caches when mapping
> > > executable page to the userspace.... you can't expect all the drivers
> > > to be modified like that...
> >
> > We could of course flush the caches every time we get a page fault but
> > that's far from optimal, especially since DMA-capable drivers to do
> > not
> 
> Maybe far for optimal, but it is something that should be done,
> _first_. Correctness is more important than performance, and you can't
> expect all drivers to behave like you want them.

I wouldn't call heavy cache flushing "correctness". We could as well
disable the caches so that it is fully coherent.

The arch code follows an API defined in cachetlb.txt but the PIO drivers
don't (some do, like mmci.c). It may be inconvenient to call
flush_dcache_page() in the driver, hence I started a discussion on
linux-arch on a PIO mapping API that x86 or other fully coherent
architectures can leave it as no-ops.

> Then you can add optimalizations not to do the flushes on drivers you
> audited and where you care...

Sorry but that's not really feasible (unless I don't fully understand
what you mean) - if we do the cache flushing on the fault handling path
in the arch code, there is no way for the arch code to know what driver
is doing, unless we make this conditionally compiled with something like
CONFIG_ARCH_NEEDS_HEAVY_FLUSHING.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-08 11:28         ` Catalin Marinas
@ 2010-02-16  7:57           ` Shilimkar, Santosh
  2010-02-16  8:22             ` Oliver Neukum
  2010-02-16  8:44             ` Russell King - ARM Linux
  0 siblings, 2 replies; 155+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16  7:57 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: linux-arm-kernel-bounces at lists.infradead.org [mailto:linux-arm-kernel-
> bounces at lists.infradead.org] On Behalf Of Catalin Marinas
> Sent: Monday, February 08, 2010 4:58 PM
> To: Pavel Machek
> Cc: Matthew Dharm; Sergei Shtylyov; Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-
> kernel; Greg KH; linux-arm-kernel
> Subject: Re: USB mass storage and ARM cache coherency
> 
> On Mon, 2010-02-08 at 10:52 +0000, Pavel Machek wrote:
> > > On Mon, 2010-02-08 at 06:55 +0000, Pavel Machek wrote:
> > > > > > So, let's put this in the HCD drivers and be done with it.
> > > > >
> > > > > The patch below is what fixes the I-D cache incoherency issues on ARM. I
> > > > > don't particularly like the solution but it seems to be the only one
> > > > > available.
> > > >
> > > > Really? It looks like arm should just flush the caches when mapping
> > > > executable page to the userspace.... you can't expect all the drivers
> > > > to be modified like that...
> > >
> > > We could of course flush the caches every time we get a page fault but
> > > that's far from optimal, especially since DMA-capable drivers to do
> > > not
> >
> > Maybe far for optimal, but it is something that should be done,
> > _first_. Correctness is more important than performance, and you can't
> > expect all drivers to behave like you want them.
> 
> I wouldn't call heavy cache flushing "correctness". We could as well
> disable the caches so that it is fully coherent.
> 
> The arch code follows an API defined in cachetlb.txt but the PIO drivers
> don't (some do, like mmci.c). It may be inconvenient to call
> flush_dcache_page() in the driver, hence I started a discussion on
> linux-arch on a PIO mapping API that x86 or other fully coherent
> architectures can leave it as no-ops.
> 
> > Then you can add optimalizations not to do the flushes on drivers you
> > audited and where you care...
> 
> Sorry but that's not really feasible (unless I don't fully understand
> what you mean) - if we do the cache flushing on the fault handling path
> in the arch code, there is no way for the arch code to know what driver
> is doing, unless we make this conditionally compiled with something like
> CONFIG_ARCH_NEEDS_HEAVY_FLUSHING.


Continuing on the USB issue w.r.t cache coherency, the usb host
code is violating the buffer ownership rules of streaming APIs from
dma and non-dma transfers point if view.

We have a below temporary patch to get around the issue and probably it
needs to be fixed in the right way in the stack because some controllers
may not have PIO option even for control transfers. (e.g. Synopsis EHCI
controller)

From: Maulik Mankad <x0082077@ti.com>

USB: Avoid DMA map/unmap of control transfer buffers.

This patch avoids the DMA mapping of buffers for control
transfers.

Signed-off-by: Maulik Mankad <x0082077@ti.com>
---
Index: omap4_integration/drivers/usb/core/hcd.c
===================================================================
--- omap4_integration.orig/drivers/usb/core/hcd.c
+++ omap4_integration/drivers/usb/core/hcd.c
@@ -1274,6 +1274,10 @@ static int map_urb_for_dma(struct usb_hc
 	if (is_root_hub(urb->dev))
 		return 0;
 
+	if (usb_endpoint_xfer_control(&urb->ep->desc))
+		urb->transfer_flags = URB_NO_SETUP_DMA_MAP |
+					URB_NO_TRANSFER_DMA_MAP;
+
 	if (usb_endpoint_xfer_control(&urb->ep->desc)
 	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
 		if (hcd->self.uses_dma) {

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16  7:57           ` Shilimkar, Santosh
@ 2010-02-16  8:22             ` Oliver Neukum
  2010-02-16  8:55               ` Shilimkar, Santosh
  2010-02-17  9:05               ` Benjamin Herrenschmidt
  2010-02-16  8:44             ` Russell King - ARM Linux
  1 sibling, 2 replies; 155+ messages in thread
From: Oliver Neukum @ 2010-02-16  8:22 UTC (permalink / raw)
  To: linux-arm-kernel

Am Dienstag, 16. Februar 2010 08:57:53 schrieb Shilimkar, Santosh:
> Continuing on the USB issue w.r.t cache coherency, the usb host
> code is violating the buffer ownership rules of streaming APIs from
> dma and non-dma transfers point if view.
> 
> We have a below temporary patch to get around the issue and probably it
> needs to be fixed in the right way in the stack because some controllers
> may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> controller)

This seems wrong to me. Buffers for control transfers may be transfered
by DMA, so the caches must be flushed on architectures whose caches
are not coherent with respect to DMA.

Would you care to elaborate on the exact nature of the bug you are fixing?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16  7:57           ` Shilimkar, Santosh
  2010-02-16  8:22             ` Oliver Neukum
@ 2010-02-16  8:44             ` Russell King - ARM Linux
  2010-02-16  8:51               ` Gadiyar, Anand
  1 sibling, 1 reply; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-02-16  8:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 16, 2010 at 01:27:53PM +0530, Shilimkar, Santosh wrote:
> Continuing on the USB issue w.r.t cache coherency, the usb host
> code is violating the buffer ownership rules of streaming APIs from
> dma and non-dma transfers point if view.
> 
> We have a below temporary patch to get around the issue and probably it
> needs to be fixed in the right way in the stack because some controllers
> may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> controller)

        if (usb_endpoint_xfer_control(&urb->ep->desc)
            && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
                if (hcd->self.uses_dma) {		<=================
                        urb->setup_dma = dma_map_single(
                                        hcd->self.controller,
                                        urb->setup_packet,
                                        sizeof(struct usb_ctrlrequest),
                                        DMA_TO_DEVICE);

struct usb_hcd *usb_create_hcd (const struct hc_driver *driver,
                struct device *dev, const char *bus_name)
{
...
        hcd->self.uses_dma = (dev->dma_mask != NULL);

Is it easier to make sure that PIO devices don't have dev->dma_mask set?

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16  8:44             ` Russell King - ARM Linux
@ 2010-02-16  8:51               ` Gadiyar, Anand
  2010-02-20  7:21                 ` Pete Zaitcev
  0 siblings, 1 reply; 155+ messages in thread
From: Gadiyar, Anand @ 2010-02-16  8:51 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> On Tue, Feb 16, 2010 at 01:27:53PM +0530, Shilimkar, Santosh wrote:
> > Continuing on the USB issue w.r.t cache coherency, the usb host
> > code is violating the buffer ownership rules of streaming APIs from
> > dma and non-dma transfers point if view.
> > 
> > We have a below temporary patch to get around the issue and probably it
> > needs to be fixed in the right way in the stack because some controllers
> > may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> > controller)
> 
>         if (usb_endpoint_xfer_control(&urb->ep->desc)
>             && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
>                 if (hcd->self.uses_dma) {		<=================
>                         urb->setup_dma = dma_map_single(
>                                         hcd->self.controller,
>                                         urb->setup_packet,
>                                         sizeof(struct usb_ctrlrequest),
>                                         DMA_TO_DEVICE);
> 
> struct usb_hcd *usb_create_hcd (const struct hc_driver *driver,
>                 struct device *dev, const char *bus_name)
> {
> ...
>         hcd->self.uses_dma = (dev->dma_mask != NULL);
> 
> Is it easier to make sure that PIO devices don't have dev->dma_mask set?

Not really. For instance, in the case of the DMA engine in the MUSB
controller in OMAP3, we can only use DMA with endpoints other than
EP0, and EP0 is what is used for control transfers.

It's not PIO for all the endpoints or DMA for all of them.

- Anand

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16  8:22             ` Oliver Neukum
@ 2010-02-16  8:55               ` Shilimkar, Santosh
  2010-02-16  9:07                 ` Oliver Neukum
  2010-02-17  3:21                 ` Ming Lei
  2010-02-17  9:05               ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 155+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16  8:55 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Tuesday, February 16, 2010 1:53 PM
> To: Shilimkar, Santosh
> Cc: Catalin Marinas; Pavel Machek; Greg KH; Russell King - ARM Linux; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 08:57:53 schrieb Shilimkar, Santosh:
> > Continuing on the USB issue w.r.t cache coherency, the usb host
> > code is violating the buffer ownership rules of streaming APIs from
> > dma and non-dma transfers point if view.
> >
> > We have a below temporary patch to get around the issue and probably it
> > needs to be fixed in the right way in the stack because some controllers
> > may not have PIO option even for control transfers. (e.g. Synopsis EHCI
> > controller)
> 
> This seems wrong to me. Buffers for control transfers may be transfered
> by DMA, so the caches must be flushed on architectures whose caches
> are not coherent with respect to DMA.
Indeed and that's what I mentioned in the comment. But we shouldn't have dma 
cache maintenance operations done for the buffers which would use pio based transfer. 
> Would you care to elaborate on the exact nature of the bug you are fixing?
On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
transfer buffers are corrupted. On our platform, we use PIO mode for control 
transfers and DMA for bulk transfers.

The current stack performs dma cache maintenance even for the PIO transfers
which leads to the corruption issue. The control buffers are handled by CPU 
and they already coherent from CPU point of view.


Regards,
Santosh

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16  8:55               ` Shilimkar, Santosh
@ 2010-02-16  9:07                 ` Oliver Neukum
  2010-02-16  9:39                   ` Russell King - ARM Linux
  2010-02-17  3:21                 ` Ming Lei
  1 sibling, 1 reply; 155+ messages in thread
From: Oliver Neukum @ 2010-02-16  9:07 UTC (permalink / raw)
  To: linux-arm-kernel

Am Dienstag, 16. Februar 2010 09:55:55 schrieb Shilimkar, Santosh:
> > This seems wrong to me. Buffers for control transfers may be transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> Indeed and that's what I mentioned in the comment. But we shouldn't have dma 
> cache maintenance operations done for the buffers which would use pio based transfer.

Given that the generic layer can't know which buffers will be used for DMA
that would require a callback into the hcd driver.

> > Would you care to elaborate on the exact nature of the bug you are fixing?
> On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> transfer buffers are corrupted. On our platform, we use PIO mode for control 
> transfers and DMA for bulk transfers.
> 
> The current stack performs dma cache maintenance even for the PIO transfers
> which leads to the corruption issue. The control buffers are handled by CPU 
> and they already coherent from CPU point of view.

How does the mapping corrupt buffers? It might impact performance, but why
do you see corruption?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16  9:07                 ` Oliver Neukum
@ 2010-02-16  9:39                   ` Russell King - ARM Linux
  2010-02-16 13:32                     ` Oliver Neukum
  2010-02-17 12:29                     ` Jamie Lokier
  0 siblings, 2 replies; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-02-16  9:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 16, 2010 at 10:07:20AM +0100, Oliver Neukum wrote:
> Am Dienstag, 16. Februar 2010 09:55:55 schrieb Shilimkar, Santosh:
> > > Would you care to elaborate on the exact nature of the bug you are fixing?
> > On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> > transfer buffers are corrupted. On our platform, we use PIO mode for control 
> > transfers and DMA for bulk transfers.
> > 
> > The current stack performs dma cache maintenance even for the PIO transfers
> > which leads to the corruption issue. The control buffers are handled by CPU 
> > and they already coherent from CPU point of view.
> 
> How does the mapping corrupt buffers? It might impact performance, but why
> do you see corruption?

On map, buffers are cleaned if they're being used for DMA_TO_DEVICE and
DMA_BIDIRECTIONAL, or invalidated in the case of DMA_FROM_DEVICE.

However, because ARM CPUs can now speculatively prefetch, just leaving it
at that results in corruption of buffers used for DMA.  So we have to
invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
ensure coherency with DMA operations.

If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
writes can sit in the cache, and on unmap, they will be discarded.

Cleaning the cache on unmap is not an option; that too can lead to DMA
buffer corruption in the DMA case.

USB and associated host driver must abide by the DMA API buffer
ownership rules otherwise the result will be data corruption; either
that or USB/host driver people need to have a discussion with the
DMA API authors to remove this sensible "restriction".

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16  9:39                   ` Russell King - ARM Linux
@ 2010-02-16 13:32                     ` Oliver Neukum
  2010-02-16 13:40                       ` Shilimkar, Santosh
  2010-02-17 12:29                     ` Jamie Lokier
  1 sibling, 1 reply; 155+ messages in thread
From: Oliver Neukum @ 2010-02-16 13:32 UTC (permalink / raw)
  To: linux-arm-kernel

Am Dienstag, 16. Februar 2010 10:39:46 schrieb Russell King - ARM Linux:
> However, because ARM CPUs can now speculatively prefetch, just leaving it
> at that results in corruption of buffers used for DMA.  So we have to
> invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
> ensure coherency with DMA operations.
> 
> If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> writes can sit in the cache, and on unmap, they will be discarded.
> 
> Cleaning the cache on unmap is not an option; that too can lead to DMA
> buffer corruption in the DMA case.

I am afraid for these controllers the controller driver must be responsible
for all DMA and cache issues. Indicating the exact requirements to the
upper layer would be a battle already lost.
so the safe choice is not to set has_dma and the generic layer will leave
the issue to the lower level.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16 13:32                     ` Oliver Neukum
@ 2010-02-16 13:40                       ` Shilimkar, Santosh
  2010-02-16 13:46                         ` Oliver Neukum
  0 siblings, 1 reply; 155+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16 13:40 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Tuesday, February 16, 2010 7:03 PM
> To: Russell King - ARM Linux
> Cc: Shilimkar, Santosh; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov; Ming
> Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad, Maulik
> Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 10:39:46 schrieb Russell King - ARM Linux:
> > However, because ARM CPUs can now speculatively prefetch, just leaving it
> > at that results in corruption of buffers used for DMA.  So we have to
> > invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
> > ensure coherency with DMA operations.
> >
> > If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> > writes can sit in the cache, and on unmap, they will be discarded.
> >
> > Cleaning the cache on unmap is not an option; that too can lead to DMA
> > buffer corruption in the DMA case.
> 
> I am afraid for these controllers the controller driver must be responsible
> for all DMA and cache issues. Indicating the exact requirements to the
> upper layer would be a battle already lost.
> so the safe choice is not to set has_dma and the generic layer will leave
> the issue to the lower level.
This means don't use dma at all which will almost kill the performance.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16 13:40                       ` Shilimkar, Santosh
@ 2010-02-16 13:46                         ` Oliver Neukum
  2010-02-16 14:12                           ` Shilimkar, Santosh
  0 siblings, 1 reply; 155+ messages in thread
From: Oliver Neukum @ 2010-02-16 13:46 UTC (permalink / raw)
  To: linux-arm-kernel

Am Dienstag, 16. Februar 2010 14:40:45 schrieb Shilimkar, Santosh:
> > > If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> > > writes can sit in the cache, and on unmap, they will be discarded.
> > >
> > > Cleaning the cache on unmap is not an option; that too can lead to DMA
> > > buffer corruption in the DMA case.
> > 
> > I am afraid for these controllers the controller driver must be responsible
> > for all DMA and cache issues. Indicating the exact requirements to the
> > upper layer would be a battle already lost.
> > so the safe choice is not to set has_dma and the generic layer will leave
> > the issue to the lower level.
> This means don't use dma at all which will almost kill the performance.

Why would you be unable to map a buffer in the hcd driver when you know
that you'll use DMA?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16 13:46                         ` Oliver Neukum
@ 2010-02-16 14:12                           ` Shilimkar, Santosh
  2010-02-16 14:22                             ` Oliver Neukum
  0 siblings, 1 reply; 155+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16 14:12 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Tuesday, February 16, 2010 7:17 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 14:40:45 schrieb Shilimkar, Santosh:
> > > > If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> > > > writes can sit in the cache, and on unmap, they will be discarded.
> > > >
> > > > Cleaning the cache on unmap is not an option; that too can lead to DMA
> > > > buffer corruption in the DMA case.
> > >
> > > I am afraid for these controllers the controller driver must be responsible
> > > for all DMA and cache issues. Indicating the exact requirements to the
> > > upper layer would be a battle already lost.
> > > so the safe choice is not to set has_dma and the generic layer will leave
> > > the issue to the lower level.
> > This means don't use dma at all which will almost kill the performance.
> 
> Why would you be unable to map a buffer in the hcd driver when you know
> that you'll use DMA?
Probably it can be. The USB stack has the dma maintenance code at common 
place for all controllers and hence we were just trying to see if there is 
way to handle that way.

We shall check this possibility

Regards,
Santosh

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16 14:12                           ` Shilimkar, Santosh
@ 2010-02-16 14:22                             ` Oliver Neukum
  2010-02-16 14:45                               ` Shilimkar, Santosh
  2010-02-17  8:55                               ` Shilimkar, Santosh
  0 siblings, 2 replies; 155+ messages in thread
From: Oliver Neukum @ 2010-02-16 14:22 UTC (permalink / raw)
  To: linux-arm-kernel

Am Dienstag, 16. Februar 2010 15:12:45 schrieb Shilimkar, Santosh:
> > > > I am afraid for these controllers the controller driver must be responsible
> > > > for all DMA and cache issues. Indicating the exact requirements to the
> > > > upper layer would be a battle already lost.
> > > > so the safe choice is not to set has_dma and the generic layer will leave
> > > > the issue to the lower level.
> > > This means don't use dma at all which will almost kill the performance.
> > 
> > Why would you be unable to map a buffer in the hcd driver when you know
> > that you'll use DMA?
> Probably it can be. The USB stack has the dma maintenance code at common 
> place for all controllers and hence we were just trying to see if there is 
> way to handle that way.

This is true. If you can find a clean way to describe your requirements
to the generic layer, that would be better. The problem is that we must
not end up with a dozen flags.

Your original patch however kills ehci, ohci and uhci on some architectures.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16 14:22                             ` Oliver Neukum
@ 2010-02-16 14:45                               ` Shilimkar, Santosh
  2010-02-16 15:44                                 ` Alan Stern
  2010-02-17  8:55                               ` Shilimkar, Santosh
  1 sibling, 1 reply; 155+ messages in thread
From: Shilimkar, Santosh @ 2010-02-16 14:45 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Tuesday, February 16, 2010 7:53 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 15:12:45 schrieb Shilimkar, Santosh:
> > > > > I am afraid for these controllers the controller driver must be responsible
> > > > > for all DMA and cache issues. Indicating the exact requirements to the
> > > > > upper layer would be a battle already lost.
> > > > > so the safe choice is not to set has_dma and the generic layer will leave
> > > > > the issue to the lower level.
> > > > This means don't use dma at all which will almost kill the performance.
> > >
> > > Why would you be unable to map a buffer in the hcd driver when you know
> > > that you'll use DMA?
> > Probably it can be. The USB stack has the dma maintenance code at common
> > place for all controllers and hence we were just trying to see if there is
> > way to handle that way.
> 
> This is true. If you can find a clean way to describe your requirements
> to the generic layer, that would be better. The problem is that we must
> not end up with a dozen flags.
Agree 
> Your original patch however kills ehci, ohci and uhci on some architectures.
Well the patch was making _ONLY_ control transfers use PIO and rest of
the transfer would still use dma. So not sure how much performance impact would
be because of that.
Another issue with that patch is there are few controllers which can't do PIO
at all and hence the patch would broke those controllers.

So we need a clean way to handle it as you described.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16 14:45                               ` Shilimkar, Santosh
@ 2010-02-16 15:44                                 ` Alan Stern
  0 siblings, 0 replies; 155+ messages in thread
From: Alan Stern @ 2010-02-16 15:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 16 Feb 2010, Shilimkar, Santosh wrote:

> > Your original patch however kills ehci, ohci and uhci on some architectures.
> Well the patch was making _ONLY_ control transfers use PIO and rest of
> the transfer would still use dma. So not sure how much performance impact would
> be because of that.
> Another issue with that patch is there are few controllers which can't do PIO
> at all and hence the patch would broke those controllers.

More than "a few"!  None of the EHCI, OHCI, or UHCI controllers used in
Intel-compatible desktop and laptop systems can do PIO.  That's what 
Oliver meant.

Alan Stern

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16  8:55               ` Shilimkar, Santosh
  2010-02-16  9:07                 ` Oliver Neukum
@ 2010-02-17  3:21                 ` Ming Lei
  1 sibling, 0 replies; 155+ messages in thread
From: Ming Lei @ 2010-02-17  3:21 UTC (permalink / raw)
  To: linux-arm-kernel

2010/2/16 Shilimkar, Santosh <santosh.shilimkar@ti.com>:

>> > We have a below temporary patch to get around the issue and probably it
>> > needs to be fixed in the right way in the stack because some controllers
>> > may not have PIO option even for control transfers. (e.g. Synopsis EHCI
>> > controller)

Your temporary patch only removes dma map and umap for setup buffer in
control transfer.

>>
>> This seems wrong to me. Buffers for control transfers may be transfered
>> by DMA, so the caches must be flushed on architectures whose caches
>> are not coherent with respect to DMA.
> Indeed and that's what I mentioned in the comment. But we shouldn't have dma
> cache maintenance operations done for the buffers which would use pio based transfer.
>> Would you care to elaborate on the exact nature of the bug you are fixing?
> On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> transfer buffers are corrupted. On our platform, we use PIO mode for control
> transfers and DMA for bulk transfers.

I don't know you mean you use PIO mode for seup buffer only or whole control
transfer(setup sent, data in or data out).  If you mean do not use DMA
for setup sent, data in or data out in a control transfer, your
temporary patch maybe is not enough, right?

>
> The current stack performs dma cache maintenance even for the PIO transfers
> which leads to the corruption issue. The control buffers are handled by CPU
> and they already coherent from CPU point of view.

-- 
Lei Ming

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16 14:22                             ` Oliver Neukum
  2010-02-16 14:45                               ` Shilimkar, Santosh
@ 2010-02-17  8:55                               ` Shilimkar, Santosh
  2010-02-17  9:10                                 ` Oliver Neukum
  2010-02-17 17:02                                 ` Alan Stern
  1 sibling, 2 replies; 155+ messages in thread
From: Shilimkar, Santosh @ 2010-02-17  8:55 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Tuesday, February 16, 2010 7:53 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Dienstag, 16. Februar 2010 15:12:45 schrieb Shilimkar, Santosh:
> > > > > I am afraid for these controllers the controller driver must be responsible
> > > > > for all DMA and cache issues. Indicating the exact requirements to the
> > > > > upper layer would be a battle already lost.
> > > > > so the safe choice is not to set has_dma and the generic layer will leave
> > > > > the issue to the lower level.
> > > > This means don't use dma at all which will almost kill the performance.
> > >
> > > Why would you be unable to map a buffer in the hcd driver when you know
> > > that you'll use DMA?
> > Probably it can be. The USB stack has the dma maintenance code at common
> > place for all controllers and hence we were just trying to see if there is
> > way to handle that way.
> 
> This is true. If you can find a clean way to describe your requirements
> to the generic layer, that would be better. The problem is that we must
> not end up with a dozen flags.
> 
> Your original patch however kills ehci, ohci and uhci on some architectures.

How about below approach? Controller driver can set 
"uses_pio_for_control" if it can't do dma for control transfer.

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 80995ef..e3eae02 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
 
 	if (usb_endpoint_xfer_control(&urb->ep->desc)
 	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
-		if (hcd->self.uses_dma) {
+		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
 			urb->setup_dma = dma_map_single(
 					hcd->self.controller,
 					urb->setup_packet,
@@ -1335,7 +1335,7 @@ static void unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb)
 
 	if (usb_endpoint_xfer_control(&urb->ep->desc)
 	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
-		if (hcd->self.uses_dma)
+		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control)
 			dma_unmap_single(hcd->self.controller, urb->setup_dma,
 					sizeof(struct usb_ctrlrequest),
 					DMA_TO_DEVICE);
diff --git a/include/linux/usb.h b/include/linux/usb.h
index d7ace1b..ba5b0a2 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -329,6 +329,9 @@ struct usb_bus {
 	int busnum;			/* Bus number (in order of reg) */
 	const char *bus_name;		/* stable id (PCI slot_name etc) */
 	u8 uses_dma;			/* Does the host controller use DMA? */
+	u8 uses_pio_for_control;	/* Does the host controller use PIO
+					 * for control tansfers? 
+					 */
 	u8 otg_port;			/* 0, or number of OTG/HNP port */
 	unsigned is_b_host:1;		/* true during some HNP roleswitches */
 	unsigned b_hnp_enable:1;	/* OTG: did A-Host enable HNP? */

Regards,
Santosh

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16  8:22             ` Oliver Neukum
  2010-02-16  8:55               ` Shilimkar, Santosh
@ 2010-02-17  9:05               ` Benjamin Herrenschmidt
  2010-02-17  9:15                 ` Oliver Neukum
                                   ` (6 more replies)
  1 sibling, 7 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17  9:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> This seems wrong to me. Buffers for control transfers may be
> transfered
> by DMA, so the caches must be flushed on architectures whose caches
> are not coherent with respect to DMA.
> 
> Would you care to elaborate on the exact nature of the bug you are
> fixing?

I missed part of this thread, so forgive me if I'm a bit off here, but
if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
this is a long solved issue on other archs such as ppc (and I _think_
sparc).

The way we do it, at least on powerpc which is PIPT, is to keep track on
a per-page basis, whether a given page is clean for execution using
PG_arch1 bit. This bit is cleared when a new page is popped into the
page cache, and we clear it from flush_dcache_page() iirc (you may want
to dbl check I don't have the code at hand right now, or rather, I do
but I'm to lazy to look right now :-)

Any page with that not set is mapped into userspace with execute
permission disabled. We do the flush and set PG_arch1 on the first exec
fault to that page.

Cheers,
Ben.
 

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  8:55                               ` Shilimkar, Santosh
@ 2010-02-17  9:10                                 ` Oliver Neukum
  2010-02-17  9:17                                   ` Shilimkar, Santosh
  2010-02-17 17:02                                 ` Alan Stern
  1 sibling, 1 reply; 155+ messages in thread
From: Oliver Neukum @ 2010-02-17  9:10 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 17. Februar 2010 09:55:08 schrieb Shilimkar, Santosh:
> > Your original patch however kills ehci, ohci and uhci on some architectures.
> 
> How about below approach? Controller driver can set 
> "uses_pio_for_control" if it can't do dma for control transfer.
> 
> diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> index 80995ef..e3eae02 100644
> --- a/drivers/usb/core/hcd.c
> +++ b/drivers/usb/core/hcd.c
> @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
>  
>         if (usb_endpoint_xfer_control(&urb->ep->desc)
>             && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> -               if (hcd->self.uses_dma) {
> +               if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {

It is not elegant to describe exceptions. It would be better, if you split up
the flag into two flags, called uses_dma_for_ordinary_transfers and
uses_dma_for control_transfers. Doing so also makes sure you look at
all hcd drivers ;-)

And the tests become straightforward. And please add a detailed comment
to explain why this differentiation is needed on ARM.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:05               ` Benjamin Herrenschmidt
@ 2010-02-17  9:15                 ` Oliver Neukum
  2010-02-17  9:40                   ` Benjamin Herrenschmidt
  2010-02-17  9:55                 ` Russell King - ARM Linux
                                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 155+ messages in thread
From: Oliver Neukum @ 2010-02-17  9:15 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 17. Februar 2010 10:05:43 schrieb Benjamin Herrenschmidt:
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
> 
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I think
> sparc).

We should have changed the subject line.

There's a second problem. It turns out that on ARM
mapping for DMA must not be done if PIO will be used. Some HCDs
use PIO for some transfers but DMA for others. The generic layer
must learn about this.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:10                                 ` Oliver Neukum
@ 2010-02-17  9:17                                   ` Shilimkar, Santosh
  0 siblings, 0 replies; 155+ messages in thread
From: Shilimkar, Santosh @ 2010-02-17  9:17 UTC (permalink / raw)
  To: linux-arm-kernel

> -----Original Message-----
> From: Oliver Neukum [mailto:oliver at neukum.org]
> Sent: Wednesday, February 17, 2010 2:41 PM
> To: Shilimkar, Santosh
> Cc: Russell King - ARM Linux; Catalin Marinas; Pavel Machek; Greg KH; Matthew Dharm; Sergei Shtylyov;
> Ming Lei; Sebastian Siewior; linux-usb at vger.kernel.org; linux-kernel; linux-arm-kernel; Mankad,
> Maulik Ojas; Gadiyar, Anand
> Subject: Re: USB mass storage and ARM cache coherency
> 
> Am Mittwoch, 17. Februar 2010 09:55:08 schrieb Shilimkar, Santosh:
> > > Your original patch however kills ehci, ohci and uhci on some architectures.
> >
> > How about below approach? Controller driver can set
> > "uses_pio_for_control" if it can't do dma for control transfer.
> >
> > diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> > index 80995ef..e3eae02 100644
> > --- a/drivers/usb/core/hcd.c
> > +++ b/drivers/usb/core/hcd.c
> > @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
> >
> >         if (usb_endpoint_xfer_control(&urb->ep->desc)
> >             && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> > -               if (hcd->self.uses_dma) {
> > +               if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
> 
> It is not elegant to describe exceptions. It would be better, if you split up
> the flag into two flags, called uses_dma_for_ordinary_transfers and
> uses_dma_for control_transfers. Doing so also makes sure you look at
> all hcd drivers ;-)
> 
Good point. Negative checks are any way not elegant
> And the tests become straightforward. And please add a detailed comment
> to explain why this differentiation is needed on ARM.
OK. I shall create a patch with description about the problem.

Thanks for feedback!!

Regards,
Santosh

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:15                 ` Oliver Neukum
@ 2010-02-17  9:40                   ` Benjamin Herrenschmidt
  2010-02-17 10:09                     ` Oliver Neukum
  0 siblings, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17  9:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 10:15 +0100, Oliver Neukum wrote:
> We should have changed the subject line.
> 
> There's a second problem. It turns out that on ARM
> mapping for DMA must not be done if PIO will be used. Some HCDs
> use PIO for some transfers but DMA for others. The generic layer
> must learn about this. 

Ah, that makes a lot of sense and the same problem would happen on
any non-DMA coherent architecture, including some embedded ppc's.

I can see why the dma unmap would invalidate the dcache and blow
away the PIO.

What bugs me here is that the dma_map_* operation should always
be done at the lowest level, ie, the actual HCD driver, and thus
it should be up to the HCD to decide whether to dma_map or not
depending on whether it's going to do DMA or not. I haven't
scrutinized USB lately but if that isn't the case and the dma_map_*
operations are done behind your back by the USB core then that needs to
be changed in a way or another, or hooked at least.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-08 10:03       ` Andy Green
@ 2010-02-17  9:50         ` Sascha Hauer
  2010-02-17  9:57           ` Andy Green
  0 siblings, 1 reply; 155+ messages in thread
From: Sascha Hauer @ 2010-02-17  9:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Feb 08, 2010 at 11:03:14AM +0100, Andy Green wrote:
> On 02/08/10 10:51, Somebody in the thread at some point said:
>
>> We could of course flush the caches every time we get a page fault but
>> that's far from optimal, especially since DMA-capable drivers to do not
>> pollute the D-cache and don't need this extra flushing. Note that the
>> recent ARM processors have PIPT caches but separate for I and D and it's
>> the PIO drivers that pollute the D-cache.
>
> Just noting that AFAIK iMX31 USB and MMC drivers both are PIO at the  
> moment, for lack of any platform DMA support of its unusual DMA engine.

The EHCI module has its own DMA engine and has nothing to do with the
SDMA engine.

Sascha


-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:05               ` Benjamin Herrenschmidt
  2010-02-17  9:15                 ` Oliver Neukum
@ 2010-02-17  9:55                 ` Russell King - ARM Linux
  2010-02-17 10:05                   ` Benjamin Herrenschmidt
  2010-02-17 15:27                 ` Catalin Marinas
                                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-02-17  9:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 17, 2010 at 08:05:43PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> > 
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
> 
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

Nope.  It's to do with mapping a buffer for DMA, and then doing PIO
reads/writes to it.

With speculative prefetches, you have to deal with cache coherency with
hardware DMA on DMA unmap.  If you've written to the buffer in violation
of the DMA API buffer ownership rules, then your writes get thrown away
resulting in immediate data corruption.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:50         ` Sascha Hauer
@ 2010-02-17  9:57           ` Andy Green
  0 siblings, 0 replies; 155+ messages in thread
From: Andy Green @ 2010-02-17  9:57 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/17/10 10:50, Somebody in the thread at some point said:
> On Mon, Feb 08, 2010 at 11:03:14AM +0100, Andy Green wrote:
>> On 02/08/10 10:51, Somebody in the thread at some point said:
>>
>>> We could of course flush the caches every time we get a page fault but
>>> that's far from optimal, especially since DMA-capable drivers to do not
>>> pollute the D-cache and don't need this extra flushing. Note that the
>>> recent ARM processors have PIPT caches but separate for I and D and it's
>>> the PIO drivers that pollute the D-cache.
>>
>> Just noting that AFAIK iMX31 USB and MMC drivers both are PIO at the
>> moment, for lack of any platform DMA support of its unusual DMA engine.
>
> The EHCI module has its own DMA engine and has nothing to do with the
> SDMA engine.

You're right, my mistake.  iMX31 MMC is PIO due to no SDMA support though.

-Andy

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:55                 ` Russell King - ARM Linux
@ 2010-02-17 10:05                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 10:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 09:55 +0000, Russell King - ARM Linux wrote:
> Nope.  It's to do with mapping a buffer for DMA, and then doing PIO
> reads/writes to it.
> 
> With speculative prefetches, you have to deal with cache coherency with
> hardware DMA on DMA unmap.  If you've written to the buffer in violation
> of the DMA API buffer ownership rules, then your writes get thrown away
> resulting in immediate data corruption. 

Right, and this exact same problem will bite some embedded powerpc
too I suppose :-)

Hrm... actually not :-) We don't do the invalidate at unmap time
today because we know 44x have such a broken prefetcher that we disable
it ... interesting considering that there are machines around that
do non-coherent DMA with 750's style chips who -do- have a prefetcher...
damn, we have a bug :-)

In any case, same problem here.

See my reply to Oliver. Basically, the problem boils down to the
dma_map/unmap being done at the wrong layer. The driver should
simply not do these if it's going to do PIO over that range.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:40                   ` Benjamin Herrenschmidt
@ 2010-02-17 10:09                     ` Oliver Neukum
  2010-02-17 10:18                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 155+ messages in thread
From: Oliver Neukum @ 2010-02-17 10:09 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 17. Februar 2010 10:40:09 schrieb Benjamin Herrenschmidt:
> What bugs me here is that the dma_map_* operation should always
> be done at the lowest level, ie, the actual HCD driver, and thus
> it should be up to the HCD to decide whether to dma_map or not
> depending on whether it's going to do DMA or not. I haven't
> scrutinized USB lately but if that isn't the case and the dma_map_*
> operations are done behind your back by the USB core then that needs to
> be changed in a way or another, or hooked at least.

No problem here. USB core does the mapping only if the low-level driver
so requests. The only exception is in usb_buffer_alloc(), but that boils
down to dma_alloc_coherent()

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 10:09                     ` Oliver Neukum
@ 2010-02-17 10:18                       ` Benjamin Herrenschmidt
  2010-02-17 10:23                         ` Oliver Neukum
  0 siblings, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 10:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 11:09 +0100, Oliver Neukum wrote:
> 
> No problem here. USB core does the mapping only if the low-level driver
> so requests. The only exception is in usb_buffer_alloc(), but that boils
> down to dma_alloc_coherent() 

Allright, so why do we need to "fix" anything ? Or is the whole thread
moot ? :-)

It's pretty clear that between dma_map* and subsequent unmap, the memory
is owned by the device and must not be touched by the CPU. If that is
violated, then we have a driver bug.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 10:18                       ` Benjamin Herrenschmidt
@ 2010-02-17 10:23                         ` Oliver Neukum
  2010-02-17 12:15                           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 155+ messages in thread
From: Oliver Neukum @ 2010-02-17 10:23 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 17. Februar 2010 11:18:01 schrieb Benjamin Herrenschmidt:
> > No problem here. USB core does the mapping only if the low-level driver
> > so requests. The only exception is in usb_buffer_alloc(), but that boils
> > down to dma_alloc_coherent() 
> 
> Allright, so why do we need to "fix" anything ? Or is the whole thread
> moot ? :-)

The request a low-level driver does is all or nothing. Either DMA
issues have to be handled by that driver alone, or a finer-grained
description of the DMA requirements is needed. A fix using the latter
approach is being worked on.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 10:23                         ` Oliver Neukum
@ 2010-02-17 12:15                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 12:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 11:23 +0100, Oliver Neukum wrote:
> 
> The request a low-level driver does is all or nothing. Either DMA
> issues have to be handled by that driver alone, or a finer-grained
> description of the DMA requirements is needed. A fix using the latter
> approach is being worked on. 

Well, that's what I'm trying to understand.

IE. It's a pretty strong rule ... don't do CPU accesses between dma_map
and unmap. So it's all in driver land at that stage. I'm not sure how
the DMA requirements get into the picture here. IE. That rule is
globally true. It's not going to hurt just non-coherent archs, it's
going to hurt anybody using swiotlb too... So I don't see you need more
info about the DMA requirements, but maybe I did miss something :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16  9:39                   ` Russell King - ARM Linux
  2010-02-16 13:32                     ` Oliver Neukum
@ 2010-02-17 12:29                     ` Jamie Lokier
  1 sibling, 0 replies; 155+ messages in thread
From: Jamie Lokier @ 2010-02-17 12:29 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> On Tue, Feb 16, 2010 at 10:07:20AM +0100, Oliver Neukum wrote:
> > Am Dienstag, 16. Februar 2010 09:55:55 schrieb Shilimkar, Santosh:
> > > > Would you care to elaborate on the exact nature of the bug you are fixing?
> > > On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> > > transfer buffers are corrupted. On our platform, we use PIO mode for control 
> > > transfers and DMA for bulk transfers.
> > > 
> > > The current stack performs dma cache maintenance even for the PIO transfers
> > > which leads to the corruption issue. The control buffers are handled by CPU 
> > > and they already coherent from CPU point of view.
> > 
> > How does the mapping corrupt buffers? It might impact performance, but why
> > do you see corruption?
> 
> On map, buffers are cleaned if they're being used for DMA_TO_DEVICE and
> DMA_BIDIRECTIONAL, or invalidated in the case of DMA_FROM_DEVICE.
> 
> However, because ARM CPUs can now speculatively prefetch, just leaving it
> at that results in corruption of buffers used for DMA.  So we have to
> invalidate DMA_FROM_DEVICE and DMA_BIDIRECTIONAL buffers on unmap to
> ensure coherency with DMA operations.
> 
> If the CPU writes to a DMA_FROM_DEVICE buffer between map and unmap, the
> writes can sit in the cache, and on unmap, they will be discarded.
> 
> Cleaning the cache on unmap is not an option; that too can lead to DMA
> buffer corruption in the DMA case.

Provided the buffers are cleaned on map for
DMA_TO_DEVICE/DMA_BIDIRECTIONAL, I don't see how cleaning on unmap for
DMA_FROM_DEVICE/DMA_BIDIRECTIONAL can cause corruption.  The only way
to get dirty cache lines while mapped is if the CPU did PIO to them.
If it was real DMA, the second clean should be a no-op.  (Assume it's
all one or the other).

Can you explain why cleanining the cache on unmap (as well as map, in
DMA_BIDIRECTIONAL case) is not an option?  Just curious, because I
don't see what would go wrong.

> USB and associated host driver must abide by the DMA API buffer
> ownership rules otherwise the result will be data corruption; either
> that or USB/host driver people need to have a discussion with the
> DMA API authors to remove this sensible "restriction".

Just in case my question gives the wrong impression, I agree that the
DMA API must be followed. Additional flushes/cleans are not good for
performance either.

-- Jamie

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:05               ` Benjamin Herrenschmidt
  2010-02-17  9:15                 ` Oliver Neukum
  2010-02-17  9:55                 ` Russell King - ARM Linux
@ 2010-02-17 15:27                 ` Catalin Marinas
  2010-02-17 20:37                   ` Benjamin Herrenschmidt
  2010-02-17 15:27                 ` Catalin Marinas
                                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> > 
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
> 
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:05               ` Benjamin Herrenschmidt
                                   ` (2 preceding siblings ...)
  2010-02-17 15:27                 ` Catalin Marinas
@ 2010-02-17 15:27                 ` Catalin Marinas
  2010-02-17 15:39                 ` Catalin Marinas
                                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:05               ` Benjamin Herrenschmidt
                                   ` (3 preceding siblings ...)
  2010-02-17 15:27                 ` Catalin Marinas
@ 2010-02-17 15:39                 ` Catalin Marinas
  2010-02-17 15:40                 ` Catalin Marinas
  2010-02-17 15:40                 ` Catalin Marinas
  6 siblings, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:05               ` Benjamin Herrenschmidt
                                   ` (4 preceding siblings ...)
  2010-02-17 15:39                 ` Catalin Marinas
@ 2010-02-17 15:40                 ` Catalin Marinas
  2010-02-17 15:40                 ` Catalin Marinas
  6 siblings, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  9:05               ` Benjamin Herrenschmidt
                                   ` (5 preceding siblings ...)
  2010-02-17 15:40                 ` Catalin Marinas
@ 2010-02-17 15:40                 ` Catalin Marinas
  2010-02-17 16:19                   ` Catalin Marinas
  2010-02-17 16:19                   ` Catalin Marinas
  6 siblings, 2 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-17 15:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > This seems wrong to me. Buffers for control transfers may be
> > transfered
> > by DMA, so the caches must be flushed on architectures whose caches
> > are not coherent with respect to DMA.
> >=20
> > Would you care to elaborate on the exact nature of the bug you are
> > fixing?
>=20
> I missed part of this thread, so forgive me if I'm a bit off here, but
> if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> this is a long solved issue on other archs such as ppc (and I _think_
> sparc).

The thread I started was indeed regarding I/D cache coherency and PIO.
But it diverged into DMA issues a few days ago (should have been a new
thread).

> The way we do it, at least on powerpc which is PIPT, is to keep track on
> a per-page basis, whether a given page is clean for execution using
> PG_arch1 bit. This bit is cleared when a new page is popped into the
> page cache, and we clear it from flush_dcache_page() iirc (you may want
> to dbl check I don't have the code at hand right now, or rather, I do
> but I'm to lazy to look right now :-)

We do the same on ARM. The problem with most (all) HCD drivers that do
PIO is that they copy the data to the transfer buffer but there is no
call in this driver to flush_dcache_page(). The upper mass storage or
filesystem layers don't call this function either, so there isn't
anything that would set the PG_arch1 bit.

--=20
Catalin
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 15:40                 ` Catalin Marinas
@ 2010-02-17 16:19                   ` Catalin Marinas
  2010-02-17 16:19                   ` Catalin Marinas
  1 sibling, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-17 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

SORRY - one more message to apologise for the multiple reposts (and
automatically appended legal disclaimer). I've been moved to Exchange
2007 and trying to use Evolution + Exchange-MAPI. It looks like it went
terribly wrong.

Catalin


On Wed, 2010-02-17 at 15:40 +0000, Catalin Marinas wrote:
> On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> > On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > > This seems wrong to me. Buffers for control transfers may be
> > > transfered
> > > by DMA, so the caches must be flushed on architectures whose caches
> > > are not coherent with respect to DMA.
> > >=20
> > > Would you care to elaborate on the exact nature of the bug you are
> > > fixing?
> >=20
> > I missed part of this thread, so forgive me if I'm a bit off here, but
> > if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> > this is a long solved issue on other archs such as ppc (and I _think_
> > sparc).
> 
> The thread I started was indeed regarding I/D cache coherency and PIO.
> But it diverged into DMA issues a few days ago (should have been a new
> thread).
> 
> > The way we do it, at least on powerpc which is PIPT, is to keep track on
> > a per-page basis, whether a given page is clean for execution using
> > PG_arch1 bit. This bit is cleared when a new page is popped into the
> > page cache, and we clear it from flush_dcache_page() iirc (you may want
> > to dbl check I don't have the code at hand right now, or rather, I do
> > but I'm to lazy to look right now :-)
> 
> We do the same on ARM. The problem with most (all) HCD drivers that do
> PIO is that they copy the data to the transfer buffer but there is no
> call in this driver to flush_dcache_page(). The upper mass storage or
> filesystem layers don't call this function either, so there isn't
> anything that would set the PG_arch1 bit.
> 
> --=20
> Catalin
> -- 
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium.  Thank you.
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 15:40                 ` Catalin Marinas
  2010-02-17 16:19                   ` Catalin Marinas
@ 2010-02-17 16:19                   ` Catalin Marinas
  1 sibling, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-17 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

SORRY - one more message to apologise for the multiple reposts (and
automatically appended legal disclaimer). I've been moved to Exchange
2007 and trying to use Evolution + Exchange-MAPI. It looks like it went
terribly wrong.

Catalin


On Wed, 2010-02-17 at 15:40 +0000, Catalin Marinas wrote:
> On Wed, 2010-02-17 at 20:05 +1100, Benjamin Herrenschmidt wrote:
> > On Tue, 2010-02-16 at 09:22 +0100, Oliver Neukum wrote:
> > > This seems wrong to me. Buffers for control transfers may be
> > > transfered
> > > by DMA, so the caches must be flushed on architectures whose caches
> > > are not coherent with respect to DMA.
> > >=3D20
> > > Would you care to elaborate on the exact nature of the bug you are
> > > fixing?
> >=3D20
> > I missed part of this thread, so forgive me if I'm a bit off here, but
> > if the problem is indeed I$/D$ cache coherency vs. PIO transfers, then
> > this is a long solved issue on other archs such as ppc (and I _think_
> > sparc).
>=20
> The thread I started was indeed regarding I/D cache coherency and PIO.
> But it diverged into DMA issues a few days ago (should have been a new
> thread).
>=20
> > The way we do it, at least on powerpc which is PIPT, is to keep track o=
n
> > a per-page basis, whether a given page is clean for execution using
> > PG_arch1 bit. This bit is cleared when a new page is popped into the
> > page cache, and we clear it from flush_dcache_page() iirc (you may want
> > to dbl check I don't have the code at hand right now, or rather, I do
> > but I'm to lazy to look right now :-)
>=20
> We do the same on ARM. The problem with most (all) HCD drivers that do
> PIO is that they copy the data to the transfer buffer but there is no
> call in this driver to flush_dcache_page(). The upper mass storage or
> filesystem layers don't call this function either, so there isn't
> anything that would set the PG_arch1 bit.
>=20
> --=3D20
> Catalin
> --=20
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose
> the contents to any other person, use it for any purpose, or store or
> copy the information in any medium.  Thank you.
>=20
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17  8:55                               ` Shilimkar, Santosh
  2010-02-17  9:10                                 ` Oliver Neukum
@ 2010-02-17 17:02                                 ` Alan Stern
  2010-02-17 20:26                                   ` Russell King - ARM Linux
  2010-02-17 20:30                                   ` Gadiyar, Anand
  1 sibling, 2 replies; 155+ messages in thread
From: Alan Stern @ 2010-02-17 17:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 17 Feb 2010, Shilimkar, Santosh wrote:

> How about below approach? Controller driver can set 
> "uses_pio_for_control" if it can't do dma for control transfer.
> 
> diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> index 80995ef..e3eae02 100644
> --- a/drivers/usb/core/hcd.c
> +++ b/drivers/usb/core/hcd.c
> @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
>  
>  	if (usb_endpoint_xfer_control(&urb->ep->desc)
>  	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> -		if (hcd->self.uses_dma) {
> +		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
>  			urb->setup_dma = dma_map_single(
>  					hcd->self.controller,
>  					urb->setup_packet,
> @@ -1335,7 +1335,7 @@ static void unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb)
>  
>  	if (usb_endpoint_xfer_control(&urb->ep->desc)
>  	    && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> -		if (hcd->self.uses_dma)
> +		if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control)
>  			dma_unmap_single(hcd->self.controller, urb->setup_dma,
>  					sizeof(struct usb_ctrlrequest),
>  					DMA_TO_DEVICE);
> diff --git a/include/linux/usb.h b/include/linux/usb.h
> index d7ace1b..ba5b0a2 100644
> --- a/include/linux/usb.h
> +++ b/include/linux/usb.h
> @@ -329,6 +329,9 @@ struct usb_bus {
>  	int busnum;			/* Bus number (in order of reg) */
>  	const char *bus_name;		/* stable id (PCI slot_name etc) */
>  	u8 uses_dma;			/* Does the host controller use DMA? */
> +	u8 uses_pio_for_control;	/* Does the host controller use PIO
> +					 * for control tansfers? 
> +					 */
>  	u8 otg_port;			/* 0, or number of OTG/HNP port */
>  	unsigned is_b_host:1;		/* true during some HNP roleswitches */
>  	unsigned b_hnp_enable:1;	/* OTG: did A-Host enable HNP? */

Why do you skip mapping the setup packet but not the data packet?

Alan Stern

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 17:02                                 ` Alan Stern
@ 2010-02-17 20:26                                   ` Russell King - ARM Linux
  2010-02-17 20:30                                   ` Gadiyar, Anand
  1 sibling, 0 replies; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-02-17 20:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 17, 2010 at 12:02:21PM -0500, Alan Stern wrote:
> Why do you skip mapping the setup packet but not the data packet?

This is something of a FAQ in this thread.  Here are the responses to
similar questions yesterday:

"Gadiyar, Anand" <gadiyar@ti.com> said:
> Not really. For instance, in the case of the DMA engine in the MUSB
> controller in OMAP3, we can only use DMA with endpoints other than
> EP0, and EP0 is what is used for control transfers.
>
> It's not PIO for all the endpoints or DMA for all of them.

"Shilimkar, Santosh" <santosh.shilimkar@ti.com> said:
> On the OMAP4 (ARM cortex-a9) platform, the enumeration fails because control
> transfer buffers are corrupted. On our platform, we use PIO mode for control
> transfers and DMA for bulk transfers.
>
> The current stack performs dma cache maintenance even for the PIO transfers
> which leads to the corruption issue. The control buffers are handled by CPU
> and they already coherent from CPU point of view.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 17:02                                 ` Alan Stern
  2010-02-17 20:26                                   ` Russell King - ARM Linux
@ 2010-02-17 20:30                                   ` Gadiyar, Anand
  2010-02-18  6:56                                     ` Oliver Neukum
  1 sibling, 1 reply; 155+ messages in thread
From: Gadiyar, Anand @ 2010-02-17 20:30 UTC (permalink / raw)
  To: linux-arm-kernel

Alan Stern wrote:
> On Wed, 17 Feb 2010, Shilimkar, Santosh wrote:
> 
> > How about below approach? Controller driver can set
> > "uses_pio_for_control" if it can't do dma for control transfer.
> >
> > diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> > index 80995ef..e3eae02 100644
> > --- a/drivers/usb/core/hcd.c
> > +++ b/drivers/usb/core/hcd.c
> > @@ -1276,7 +1276,7 @@ static int map_urb_for_dma(struct usb_hcd *hcd, struct urb *urb,
> >
> >       if (usb_endpoint_xfer_control(&urb->ep->desc)
> >           && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> > -             if (hcd->self.uses_dma) {
> > +             if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control) {
> >                       urb->setup_dma = dma_map_single(
> >                                       hcd->self.controller,
> >                                       urb->setup_packet,
> > @@ -1335,7 +1335,7 @@ static void unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb)
> >
> >       if (usb_endpoint_xfer_control(&urb->ep->desc)
> >           && !(urb->transfer_flags & URB_NO_SETUP_DMA_MAP)) {
> > -             if (hcd->self.uses_dma)
> > +             if (hcd->self.uses_dma && !hcd->self.uses_pio_for_control)
> >                       dma_unmap_single(hcd->self.controller, urb->setup_dma,
> >                                       sizeof(struct usb_ctrlrequest),
> >                                       DMA_TO_DEVICE);
> > diff --git a/include/linux/usb.h b/include/linux/usb.h
> > index d7ace1b..ba5b0a2 100644
> > --- a/include/linux/usb.h
> > +++ b/include/linux/usb.h
> > @@ -329,6 +329,9 @@ struct usb_bus {
> >       int busnum;                     /* Bus number (in order of reg) */
> >       const char *bus_name;           /* stable id (PCI slot_name etc) */
> >       u8 uses_dma;                    /* Does the host controller use DMA? */
> > +     u8 uses_pio_for_control;        /* Does the host controller use PIO
> > +                                      * for control tansfers?
> > +                                      */
> >       u8 otg_port;                    /* 0, or number of OTG/HNP port */
> >       unsigned is_b_host:1;           /* true during some HNP roleswitches */
> >       unsigned b_hnp_enable:1;        /* OTG: did A-Host enable HNP? */
> 
> Why do you skip mapping the setup packet but not the data packet?
> 

I think that's oversight. For this controller, we need to skip mapping
all buffers used to do transfers on EP0, which is all control transfers.

Will fix in the next version of the patch.

- Anand

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 15:27                 ` Catalin Marinas
@ 2010-02-17 20:37                   ` Benjamin Herrenschmidt
  2010-02-17 20:44                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 20:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 15:27 +0000, Catalin Marinas wrote:
> We do the same on ARM. The problem with most (all) HCD drivers that do
> PIO is that they copy the data to the transfer buffer but there is no
> call in this driver to flush_dcache_page(). The upper mass storage or
> filesystem layers don't call this function either, so there isn't
> anything that would set the PG_arch1 bit. 

Actually, clear it :-)

I suppose that's one thing that needs to be fixed in the drivers.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 20:37                   ` Benjamin Herrenschmidt
@ 2010-02-17 20:44                     ` Russell King - ARM Linux
  2010-02-17 22:31                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-02-17 20:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 18, 2010 at 07:37:00AM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-17 at 15:27 +0000, Catalin Marinas wrote:
> > We do the same on ARM. The problem with most (all) HCD drivers that do
> > PIO is that they copy the data to the transfer buffer but there is no
> > call in this driver to flush_dcache_page(). The upper mass storage or
> > filesystem layers don't call this function either, so there isn't
> > anything that would set the PG_arch1 bit. 
> 
> Actually, clear it :-)
> 
> I suppose that's one thing that needs to be fixed in the drivers.

No, because that'd probably bugger up the Sparc64 method of delaying
flush_dcache_page.

This method works as follows:

- a page cache page is allocated - this has PG_arch_1 clear.

- IO happens on it and it's placed into the page cache.  PG_arch_1 is
  still clear.

- someone calls read()/write() which accesses the page.  The generic
  file IO layers call flush_dcache_page() in response to read()/write()
  fs calls.  flush_dcache_page() spots that the page is not yet mapped
  into userspace, and sets PG_arch_1 to mark the fact that the kernel
  mapping is dirty.

- when someone maps the page, we check PG_arch_1 in update_mmu_cache.
  If PG_arch_1 is set, we flush the kernel mapping.

Clearly, if we go around having drivers clearing PG_arch_1, this is going
to break horribly.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 20:44                     ` Russell King - ARM Linux
@ 2010-02-17 22:31                       ` Benjamin Herrenschmidt
  2010-02-19 17:15                         ` Catalin Marinas
  0 siblings, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-17 22:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 20:44 +0000, Russell King - ARM Linux wrote:
> No, because that'd probably bugger up the Sparc64 method of delaying
> flush_dcache_page.
> 
> This method works as follows:
> 
> - a page cache page is allocated - this has PG_arch_1 clear.
> 
> - IO happens on it and it's placed into the page cache.  PG_arch_1 is
>   still clear.
> 
> - someone calls read()/write() which accesses the page.  The generic
>   file IO layers call flush_dcache_page() in response to
> read()/write()
>   fs calls.  flush_dcache_page() spots that the page is not yet mapped
>   into userspace, and sets PG_arch_1 to mark the fact that the kernel
>   mapping is dirty.
> 
> - when someone maps the page, we check PG_arch_1 in update_mmu_cache.
>   If PG_arch_1 is set, we flush the kernel mapping.
> 
> Clearly, if we go around having drivers clearing PG_arch_1, this is
> going to break horribly. 

Ok, you do things very differently than us on ppc then. We clear
PG_arch_1 in flush_dcache_page(), and we set it when the page has been
cache cleaned for execution.

We assume that anybody that dirties a page in the kernel will call
flush_dcache_page() which removes our PG_arch_1 bit thus marking the
page "dirty".

Note that from experience, doing the check & flushes in
update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
where processor one does set_pte followed by update_mmu_cache(). The
later isn't done yet but processor 2 sees the PTE now and starts using
it, cache hasn't been fully flushed yet. You may avoid that race in some
ways, but on ppc, I've stopped using that.

I now do things directly in set_pte_at(). In fact, that's why I want
your patch to change update_mmu_cache() to take a PTE pointer :-) Since
my set_pte_at() can now remove the _PAGE_EXEC bit, I need
update_mmu_cache() to re-read the PTE before it updates the hash table
or TLB.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 20:30                                   ` Gadiyar, Anand
@ 2010-02-18  6:56                                     ` Oliver Neukum
  2010-02-18  7:14                                       ` Gadiyar, Anand
  0 siblings, 1 reply; 155+ messages in thread
From: Oliver Neukum @ 2010-02-18  6:56 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 17. Februar 2010 21:30:24 schrieb Gadiyar, Anand:
> > Why do you skip mapping the setup packet but not the data packet?
> > 
> 
> I think that's oversight. For this controller, we need to skip mapping
> all buffers used to do transfers on EP0, which is all control transfers.

One thing more. Do you have an issue with EP 0 only or all control
endpoints? EP 0 must be control, but devices are within spec if they
have multiple control endpoints provided EP 0 is control.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-18  6:56                                     ` Oliver Neukum
@ 2010-02-18  7:14                                       ` Gadiyar, Anand
  0 siblings, 0 replies; 155+ messages in thread
From: Gadiyar, Anand @ 2010-02-18  7:14 UTC (permalink / raw)
  To: linux-arm-kernel

Oliver Neukum wrote:
> Am Mittwoch, 17. Februar 2010 21:30:24 schrieb Gadiyar, Anand:
> > > Why do you skip mapping the setup packet but not the data packet?
> > > 
> > 
> > I think that's oversight. For this controller, we need to skip mapping
> > all buffers used to do transfers on EP0, which is all control transfers.
> 
> One thing more. Do you have an issue with EP 0 only or all control
> endpoints? EP 0 must be control, but devices are within spec if they
> have multiple control endpoints provided EP 0 is control.

Sorry for the confusion. The issue is not with EP 0 of devices
connected to the controller; the problem is with EP 0 on the host
controller itself.

The controller in question is the MUSB OTG controller present in
OMAPs, Davinci chips, and some Blackfins. The MUSB HCD driver is
written such that it carries out all control transfers on EP 0 of
the controller. All bulk transfers are carried out on other hardware
endpoints.

(This is the same "hardware endpoint" that is used in when the MUSB
is used in gadget mode.)


I'm not really sure why EP0 was chosen for control transfers, or
if there is a restriction that we *need* to use it. Let me study
the docs some more.

The problem is that with the driver code as written today, we use
EP 0 for all control transfers, and the DMA engine cannot do DMA
to this endpoint's FIFO.

- Anand

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-17 22:31                       ` Benjamin Herrenschmidt
@ 2010-02-19 17:15                         ` Catalin Marinas
  2010-02-19 17:36                           ` Catalin Marinas
  2010-02-24  2:39                           ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-19 17:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-17 at 22:31 +0000, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-17 at 20:44 +0000, Russell King - ARM Linux wrote:
> > No, because that'd probably bugger up the Sparc64 method of delaying
> > flush_dcache_page.
> >
> > This method works as follows:
> >
> > - a page cache page is allocated - this has PG_arch_1 clear.
> >
> > - IO happens on it and it's placed into the page cache.  PG_arch_1 is
> >   still clear.
> >
> > - someone calls read()/write() which accesses the page.  The generic
> >   file IO layers call flush_dcache_page() in response to
> > read()/write()
> >   fs calls.  flush_dcache_page() spots that the page is not yet mapped
> >   into userspace, and sets PG_arch_1 to mark the fact that the kernel
> >   mapping is dirty.
> >
> > - when someone maps the page, we check PG_arch_1 in update_mmu_cache.
> >   If PG_arch_1 is set, we flush the kernel mapping.
> >
> > Clearly, if we go around having drivers clearing PG_arch_1, this is
> > going to break horribly.
> 
> Ok, you do things very differently than us on ppc then. We clear
> PG_arch_1 in flush_dcache_page(), and we set it when the page has been
> cache cleaned for execution.

For this perspective it's not that different, just that we use the
negated PG_arch_1.

> We assume that anybody that dirties a page in the kernel will call
> flush_dcache_page() which removes our PG_arch_1 bit thus marking the
> page "dirty".

This assumption is not valid with some drivers like USB HCD doing PIO.
But, yes, that's how it should be done.

> Note that from experience, doing the check & flushes in
> update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
> where processor one does set_pte followed by update_mmu_cache(). The
> later isn't done yet but processor 2 sees the PTE now and starts using
> it, cache hasn't been fully flushed yet. You may avoid that race in some
> ways, but on ppc, I've stopped using that.

I think that's possible on ARM too. Having two threads on different
CPUs, one thread triggers a prefetch abort (instruction page fault) on
CPU0 but the second thread on CPU1 may branch into this page after
set_pte() (hence not fault) but before update_mmu_cache() doing the
flush.

On ARM11MPCore we flush the caches in flush_dcache_page() because the
cache maintenance operations weren't visible to the other CPUs.
Cortex-A9 broadcasts the cache operations in hardware so we can use lazy
flushing but with the race you pointed out.

Using set_pte_at() for delayed flushing may be a better option for ARM
as well (and maybe Documentation/cachetlb.txt updated).

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-19 17:15                         ` Catalin Marinas
@ 2010-02-19 17:36                           ` Catalin Marinas
  2010-02-19 20:53                             ` Oliver Neukum
  2010-02-24  2:47                             ` Benjamin Herrenschmidt
  2010-02-24  2:39                           ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-19 17:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-19 at 17:15 +0000, Catalin Marinas wrote:
> On Wed, 2010-02-17 at 22:31 +0000, Benjamin Herrenschmidt wrote:
> > On Wed, 2010-02-17 at 20:44 +0000, Russell King - ARM Linux wrote:
> > > No, because that'd probably bugger up the Sparc64 method of delaying
> > > flush_dcache_page.
> > >
> > > This method works as follows:
> > >
> > > - a page cache page is allocated - this has PG_arch_1 clear.
> > >
> > > - IO happens on it and it's placed into the page cache.  PG_arch_1 is
> > >   still clear.
> > >
> > > - someone calls read()/write() which accesses the page.  The generic
> > >   file IO layers call flush_dcache_page() in response to
> > > read()/write()
> > >   fs calls.  flush_dcache_page() spots that the page is not yet mapped
> > >   into userspace, and sets PG_arch_1 to mark the fact that the kernel
> > >   mapping is dirty.
> > >
> > > - when someone maps the page, we check PG_arch_1 in update_mmu_cache.
> > >   If PG_arch_1 is set, we flush the kernel mapping.
> > >
> > > Clearly, if we go around having drivers clearing PG_arch_1, this is
> > > going to break horribly.
> >
> > Ok, you do things very differently than us on ppc then. We clear
> > PG_arch_1 in flush_dcache_page(), and we set it when the page has been
> > cache cleaned for execution.
> 
> For this perspective it's not that different, just that we use the
> negated PG_arch_1.

I got your point now (after reading the replies on linux-arch :)).

So PPC assumes that if PG_arch_1 is clear (the default), the page wasn't
cleaned. If there is no call to flush_dcache_page() but the page gets
mapped to user space, update_mmu_cache() (or set_pte_at()) would simply
assume that the page was dirtied, flush the caches and set this bit.

We could easily do this on ARM as well and assume that the page is dirty
if !PG_arch_1. But it only partially solves the problem (only for
faulted-in pages).

If a page is already mapped in user space, flush_dcache_page() on ARM
does the flushing rather than deferring it to update_mmu_cache(). The
PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
that the HCD could transfer data into a page cache page already mapped
in user space? My understanding is that the scenario above is possible.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-19 17:36                           ` Catalin Marinas
@ 2010-02-19 20:53                             ` Oliver Neukum
  2010-02-24  2:48                               ` Benjamin Herrenschmidt
  2010-02-24  2:47                             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 155+ messages in thread
From: Oliver Neukum @ 2010-02-19 20:53 UTC (permalink / raw)
  To: linux-arm-kernel

Am Freitag, 19. Februar 2010 18:36:51 schrieb Catalin Marinas:
> If a page is already mapped in user space, flush_dcache_page() on ARM
> does the flushing rather than deferring it to update_mmu_cache(). The
> PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> that the HCD could transfer data into a page cache page already mapped
> in user space? My understanding is that the scenario above is possible.

Yes, video drivers do that.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-16  8:51               ` Gadiyar, Anand
@ 2010-02-20  7:21                 ` Pete Zaitcev
  0 siblings, 0 replies; 155+ messages in thread
From: Pete Zaitcev @ 2010-02-20  7:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 16 Feb 2010 14:21:48 +0530
"Gadiyar, Anand" <gadiyar@ti.com> wrote:

> >         hcd->self.uses_dma = (dev->dma_mask != NULL);
> > 
> > Is it easier to make sure that PIO devices don't have dev->dma_mask set?
> 
> Not really. For instance, in the case of the DMA engine in the MUSB
> controller in OMAP3, we can only use DMA with endpoints other than
> EP0, and EP0 is what is used for control transfers.
> 
> It's not PIO for all the endpoints or DMA for all of them.

The HC driver does not have to be 100% truthful here. If the system
is not HIGHMEM, HCD can easily set uses_dma to false yet use DMA
by mapping buffers itself, without relying on the quoted code.

On a HIGHMEM system, block layer will bounce-buffer data in such case.
Hopefuly not a problem for ARM?

All network stack drivers work that way, BTW.

-- Pete

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-19 17:15                         ` Catalin Marinas
  2010-02-19 17:36                           ` Catalin Marinas
@ 2010-02-24  2:39                           ` Benjamin Herrenschmidt
  2010-02-26 16:44                             ` Catalin Marinas
  1 sibling, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24  2:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-19 at 17:15 +0000, Catalin Marinas wrote:
> > Ok, you do things very differently than us on ppc then. We clear
> > PG_arch_1 in flush_dcache_page(), and we set it when the page has
> been
> > cache cleaned for execution.
> 
> For this perspective it's not that different, just that we use the
> negated PG_arch_1.

Right, though you default as "clean" while we default as "dirty".

> > We assume that anybody that dirties a page in the kernel will call
> > flush_dcache_page() which removes our PG_arch_1 bit thus marking the
> > page "dirty".
> 
> This assumption is not valid with some drivers like USB HCD doing PIO.
> But, yes, that's how it should be done.

So we go back to the fix should be done at the individual drivers level.
If it's going to write into the page cache, it needs to whack the bits.

Now there's of course the question as to whether you really only want to
do that for a PIO access and not for a DMA access, I think on power, we
don't really discriminate that much (since in any case our icache still
needs flushing). Maybe it would be useful to separate the I$ and D$ bits
but I'm not sure I can be bothered.
 
> > Note that from experience, doing the check & flushes in
> > update_mmu_cache() is racy on SMP. At least for I$/D$, we have the
> case
> > where processor one does set_pte followed by update_mmu_cache(). The
> > later isn't done yet but processor 2 sees the PTE now and starts
> using
> > it, cache hasn't been fully flushed yet. You may avoid that race in
> some
> > ways, but on ppc, I've stopped using that.
> 
> I think that's possible on ARM too. Having two threads on different
> CPUs, one thread triggers a prefetch abort (instruction page fault) on
> CPU0 but the second thread on CPU1 may branch into this page after
> set_pte() (hence not fault) but before update_mmu_cache() doing the
> flush.
> 
> On ARM11MPCore we flush the caches in flush_dcache_page() because the
> cache maintenance operations weren't visible to the other CPUs.

I'm not even sure that's going to be 100% correct. Don't you also need
to flush the remote icaches when you are dealing with instructions (such
as swap) anyways ?

I've had some discussions in the past with Russell and others around the
problem of non-broadcast cache ops on ARM SMP since that's also hurting
you hard with dma mappings.

Can you issue IPIs as FIQs if needed (from my old ARM knowledge, FIQs
are still on even in local_irq_save() blocks right ? I haven't touched
low level ARM for years tho, I may have forgotten things).

In this case, you should probably use the same bits as A9 and simply
make them use FIQs on 11MP to make the other cores flush as well.

> Cortex-A9 broadcasts the cache operations in hardware so we can use
> lazy flushing but with the race you pointed out.

Right.

> Using set_pte_at() for delayed flushing may be a better option for ARM
> as well (and maybe Documentation/cachetlb.txt updated). 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-19 17:36                           ` Catalin Marinas
  2010-02-19 20:53                             ` Oliver Neukum
@ 2010-02-24  2:47                             ` Benjamin Herrenschmidt
  2010-02-24 16:19                               ` Alan Stern
  2010-02-26 16:25                               ` Catalin Marinas
  1 sibling, 2 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24  2:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-19 at 17:36 +0000, Catalin Marinas wrote:
> 
> If a page is already mapped in user space, flush_dcache_page() on ARM
> does the flushing rather than deferring it to update_mmu_cache(). 

This is for D-cache aliases on VIVT right ? Or are you still talking
about I/D coherency on PIPT ARMs ? Because the later should not matter
for already mapped userspace pages in the sense that if user space
explicitely read() onto a page, it's up to userspace to cache clean that
page before executing from it in my book :-)

> The PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> that the HCD could transfer data into a page cache page already mapped
> in user space? My understanding is that the scenario above is possible.

It is but I'm not confident the responsibility for doing that cleanup
is at the HCD level. That would impact a lot of HCD activities that
don't need such flushing since the use of the page is purely in-kernel.

Though I suppose that could be optimized out in most case using the page
use count.

But I still wonder whether it should be pushed down to the actual
interface drivers, that's always been the case I believe. In fact, in
the case of block ops, it's generally done at the BIO or even file
system layer right ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-19 20:53                             ` Oliver Neukum
@ 2010-02-24  2:48                               ` Benjamin Herrenschmidt
  2010-02-24  7:16                                 ` Oliver Neukum
  0 siblings, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24  2:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-19 at 21:53 +0100, Oliver Neukum wrote:
> Am Freitag, 19. Februar 2010 18:36:51 schrieb Catalin Marinas:
> > If a page is already mapped in user space, flush_dcache_page() on ARM
> > does the flushing rather than deferring it to update_mmu_cache(). The
> > PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > that the HCD could transfer data into a page cache page already mapped
> > in user space? My understanding is that the scenario above is possible.
> 
> Yes, video drivers do that. 

In which case it would be up to the video driver to call
flush_dcache_page() (though if it's v4l you are talking about, maybe it
might make sense to push it into the v4l layer itself).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-24  2:48                               ` Benjamin Herrenschmidt
@ 2010-02-24  7:16                                 ` Oliver Neukum
  2010-02-24 21:12                                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 155+ messages in thread
From: Oliver Neukum @ 2010-02-24  7:16 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 24. Februar 2010 03:48:09 schrieb Benjamin Herrenschmidt:
> On Fri, 2010-02-19 at 21:53 +0100, Oliver Neukum wrote:
> > Am Freitag, 19. Februar 2010 18:36:51 schrieb Catalin Marinas:
> > > If a page is already mapped in user space, flush_dcache_page() on ARM
> > > does the flushing rather than deferring it to update_mmu_cache(). The
> > > PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > > that the HCD could transfer data into a page cache page already mapped
> > > in user space? My understanding is that the scenario above is possible.
> > 
> > Yes, video drivers do that. 
> 
> In which case it would be up to the video driver to call
> flush_dcache_page() (though if it's v4l you are talking about, maybe it
> might make sense to push it into the v4l layer itself).

I don't know. The issue seems quite complex. It would seem better to
centralize it as far as practical. Do you have a wrapper drivers could
call?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-24  2:47                             ` Benjamin Herrenschmidt
@ 2010-02-24 16:19                               ` Alan Stern
  2010-02-24 21:13                                 ` Benjamin Herrenschmidt
  2010-02-26 16:25                               ` Catalin Marinas
  1 sibling, 1 reply; 155+ messages in thread
From: Alan Stern @ 2010-02-24 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 24 Feb 2010, Benjamin Herrenschmidt wrote:

> > The PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > that the HCD could transfer data into a page cache page already mapped
> > in user space? My understanding is that the scenario above is possible.
> 
> It is but I'm not confident the responsibility for doing that cleanup
> is at the HCD level. That would impact a lot of HCD activities that
> don't need such flushing since the use of the page is purely in-kernel.

That's right.  The HCD merely puts data wherever it's told to.  It 
doesn't know whether the destination is in the page cache, in 
userspace, or anywhere else.  The same is true for usb-storage.

Alan Stern

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-24  7:16                                 ` Oliver Neukum
@ 2010-02-24 21:12                                   ` Benjamin Herrenschmidt
  2010-02-25  3:48                                     ` Oliver Neukum
  2010-02-25 12:36                                     ` James Bottomley
  0 siblings, 2 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24 21:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> I don't know. The issue seems quite complex. It would seem better to
> centralize it as far as practical. Do you have a wrapper drivers could
> call?

flush_dcache_page() ? :-)

Now, the subsystem might be the one to know whether something is mapped
into userspace or not (v4l in our case) in which case a wrapper could be
created.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-24 16:19                               ` Alan Stern
@ 2010-02-24 21:13                                 ` Benjamin Herrenschmidt
  2010-02-24 21:50                                   ` Alan Stern
  2010-02-26 16:00                                   ` Catalin Marinas
  0 siblings, 2 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-24 21:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 11:19 -0500, Alan Stern wrote:
> > It is but I'm not confident the responsibility for doing that
> cleanup
> > is at the HCD level. That would impact a lot of HCD activities that
> > don't need such flushing since the use of the page is purely
> in-kernel.
> 
> That's right.  The HCD merely puts data wherever it's told to.  It 
> doesn't know whether the destination is in the page cache, in 
> userspace, or anywhere else.  The same is true for usb-storage.

I'm surprised that usb-storage has an issue here. It shouldn't afaik,
since it's just a SCSI driver (or not anymore ?) and the BIO or
filesystems handle things there no ? I haven't seen a single call to
flush_dcache_page() in any of drivers/scsi, drivers/ata or drivers/ide
when I looked...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-24 21:13                                 ` Benjamin Herrenschmidt
@ 2010-02-24 21:50                                   ` Alan Stern
  2010-02-25 20:52                                     ` Benjamin Herrenschmidt
  2010-02-26 16:00                                   ` Catalin Marinas
  1 sibling, 1 reply; 155+ messages in thread
From: Alan Stern @ 2010-02-24 21:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 25 Feb 2010, Benjamin Herrenschmidt wrote:

> I'm surprised that usb-storage has an issue here. It shouldn't afaik,
> since it's just a SCSI driver (or not anymore ?)

It still is.  There's also the ub driver, which is a non-SCSI block 
device driver for some of the same devices handled by usb-storage.

> and the BIO or
> filesystems handle things there no ? I haven't seen a single call to
> flush_dcache_page() in any of drivers/scsi, drivers/ata or drivers/ide
> when I looked...

There is no real issue; it's just that the problem was first noted in 
connection with usb-storage reading in executable pages, so Catalin's 
initial post was oriented toward modifying usb-storage.

The main issue here is that the same host controller will use PIO
sometimes and DMA sometimes, depending on the details of the transfer.  
The USB core didn't expect this and consequently we violated the rules
for DMA mapping.  The question is: If the core is fixed so that the
rules aren't violated, will everything work correctly?

Alan Stern

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-24 21:12                                   ` Benjamin Herrenschmidt
@ 2010-02-25  3:48                                     ` Oliver Neukum
  2010-02-26  0:22                                       ` Benjamin Herrenschmidt
  2010-02-25 12:36                                     ` James Bottomley
  1 sibling, 1 reply; 155+ messages in thread
From: Oliver Neukum @ 2010-02-25  3:48 UTC (permalink / raw)
  To: linux-arm-kernel

Am Mittwoch, 24. Februar 2010 22:12:34 schrieb Benjamin Herrenschmidt:
> On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> > I don't know. The issue seems quite complex. It would seem better to
> > centralize it as far as practical. Do you have a wrapper drivers could
> > call?
> 
> flush_dcache_page() ? :-)

Will this do anything on arches that don't need it?
Secondly, can we have a wrapper that you can pass a pointer and an
offset?
 
> Now, the subsystem might be the one to know whether something is mapped
> into userspace or not (v4l in our case) in which case a wrapper could be
> created.

If possible, I'd like to centralize this. Drivers are likely to get this wrong.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-24 21:12                                   ` Benjamin Herrenschmidt
  2010-02-25  3:48                                     ` Oliver Neukum
@ 2010-02-25 12:36                                     ` James Bottomley
  1 sibling, 0 replies; 155+ messages in thread
From: James Bottomley @ 2010-02-25 12:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-02-25 at 08:12 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> > I don't know. The issue seems quite complex. It would seem better to
> > centralize it as far as practical. Do you have a wrapper drivers could
> > call?
> 
> flush_dcache_page() ? :-)

Actually, that can be wrong depending on the implementation.  The
problem is incoherency of the kernel page (dirty) with respect to user
space aliases (clean).  What has to happen on parisc is that the kernel
alias needs flushing.  We can guarantee the userspace aliases to be
clean (and not moved in).  We wouldn't want to incur the expense of
flushing the user space pages as well.

> Now, the subsystem might be the one to know whether something is mapped
> into userspace or not (v4l in our case) in which case a wrapper could be
> created.

Right, so it's the responsibility of the API used by the subsystem.
Thus Caitlin's pio_kmap seems the right one ... I don't understand what
the additional problems are.

James

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-24 21:50                                   ` Alan Stern
@ 2010-02-25 20:52                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-25 20:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 16:50 -0500, Alan Stern wrote:
> The main issue here is that the same host controller will use PIO
> sometimes and DMA sometimes, depending on the details of the
> transfer.  
> The USB core didn't expect this and consequently we violated the rules
> for DMA mapping.  The question is: If the core is fixed so that the
> rules aren't violated, will everything work correctly? 

As long as the only issue is that one (ie, doing PIO while dma-map'ed),
then yes, I'd say things should work. If not, then there is -another-
problem to be fixed :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-25  3:48                                     ` Oliver Neukum
@ 2010-02-26  0:22                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26  0:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-02-25 at 04:48 +0100, Oliver Neukum wrote:
> Am Mittwoch, 24. Februar 2010 22:12:34 schrieb Benjamin Herrenschmidt:
> > On Wed, 2010-02-24 at 08:16 +0100, Oliver Neukum wrote:
> > > I don't know. The issue seems quite complex. It would seem better to
> > > centralize it as far as practical. Do you have a wrapper drivers could
> > > call?
> > 
> > flush_dcache_page() ? :-)
> 
> Will this do anything on arches that don't need it?

No, it's going to be an empty inline:

arch/x86/include/asm/cacheflush.h:static inline void flush_dcache_page(struct page *page) { }

> Secondly, can we have a wrapper that you can pass a pointer and an
> offset?

I'm sure you can make one :-) Use virt_to_page() though that will not
work for vmap/vmalloc space of course.
 
> > Now, the subsystem might be the one to know whether something is mapped
> > into userspace or not (v4l in our case) in which case a wrapper could be
> > created.
> 
> If possible, I'd like to centralize this. Drivers are likely to get this wrong.

Right. In the case of v4l, it's probably something that should go into
the subsystem. IE. That's how it works for block too, it's done at the
BIO and/or filesystem layer (though individual filesystems do have their
hand in the pudding). 

Cheers,
Ben.

> 	Regards
> 		Oliver
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-24 21:13                                 ` Benjamin Herrenschmidt
  2010-02-24 21:50                                   ` Alan Stern
@ 2010-02-26 16:00                                   ` Catalin Marinas
  2010-02-26 21:36                                     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-02-26 16:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 21:13 +0000, Benjamin Herrenschmidt wrote:
> On Wed, 2010-02-24 at 11:19 -0500, Alan Stern wrote:
> > > It is but I'm not confident the responsibility for doing that cleanup
> > > is at the HCD level. That would impact a lot of HCD activities that
> > > don't need such flushing since the use of the page is purely in-kernel.
> >
> > That's right.  The HCD merely puts data wherever it's told to.  It
> > doesn't know whether the destination is in the page cache, in
> > userspace, or anywhere else.  The same is true for usb-storage.
> 
> I'm surprised that usb-storage has an issue here. It shouldn't afaik,
> since it's just a SCSI driver (or not anymore ?) and the BIO or
> filesystems handle things there no ? I haven't seen a single call to
> flush_dcache_page() in any of drivers/scsi, drivers/ata or drivers/ide
> when I looked...

The BIO or filesystem code don't call flush_dcache_page() either (well
some do like cramfs or jffs but they decompress the data received from
the block device).

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-24  2:47                             ` Benjamin Herrenschmidt
  2010-02-24 16:19                               ` Alan Stern
@ 2010-02-26 16:25                               ` Catalin Marinas
  2010-02-26 16:52                                 ` Alan Stern
                                                   ` (2 more replies)
  1 sibling, 3 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-26 16:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 02:47 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-19 at 17:36 +0000, Catalin Marinas wrote:
> >
> > If a page is already mapped in user space, flush_dcache_page() on ARM
> > does the flushing rather than deferring it to update_mmu_cache().
> 
> This is for D-cache aliases on VIVT right ? Or are you still talking
> about I/D coherency on PIPT ARMs ? Because the later should not matter
> for already mapped userspace pages in the sense that if user space
> explicitely read() onto a page, it's up to userspace to cache clean that
> page before executing from it in my book :-)

I was still thinking about PIPT I/D coherency. The read() case you
mention is pretty clear, no need or the kernel to ensure coherency
(especially since writing is done via copy_to_user rather than to the
page cache page).

For mmap'ed pages (and present in the page cache), is it guaranteed that
the HCD driver won't write to it once it has been mapped into user
space? If that's the case, it may solve the problem by just reversing
the meaning of PG_arch_1 on ARM and assume that a newly allocated page
has dirty D-cache by default.

> > The PIO HCD drivers, however, don't call flush_dcache_page(). Is it possible
> > that the HCD could transfer data into a page cache page already mapped
> > in user space? My understanding is that the scenario above is possible.
> 
> It is but I'm not confident the responsibility for doing that cleanup
> is at the HCD level. That would impact a lot of HCD activities that
> don't need such flushing since the use of the page is purely in-kernel.
> 
> Though I suppose that could be optimized out in most case using the page
> use count.
> 
> But I still wonder whether it should be pushed down to the actual
> interface drivers, that's always been the case I believe. In fact, in
> the case of block ops, it's generally done at the BIO or even file
> system layer right ?

The filesystem layer does it only if it needs to touch the data written
by the block device (e.g. cramfs, jffs). Some block devices call
flush_dcache_page (like mmci.c) while some others don't (and those that
use DMA actually don't since the DMA API handles the flushing).

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-24  2:39                           ` Benjamin Herrenschmidt
@ 2010-02-26 16:44                             ` Catalin Marinas
  2010-02-26 21:49                               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-02-26 16:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-02-24 at 02:39 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-19 at 17:15 +0000, Catalin Marinas wrote:
> > > We assume that anybody that dirties a page in the kernel will call
> > > flush_dcache_page() which removes our PG_arch_1 bit thus marking the
> > > page "dirty".
> >
> > This assumption is not valid with some drivers like USB HCD doing PIO.
> > But, yes, that's how it should be done.
> 
> So we go back to the fix should be done at the individual drivers level.
> If it's going to write into the page cache, it needs to whack the bits.
> 
> Now there's of course the question as to whether you really only want to
> do that for a PIO access and not for a DMA access, I think on power, we
> don't really discriminate that much (since in any case our icache still
> needs flushing). Maybe it would be useful to separate the I$ and D$ bits
> but I'm not sure I can be bothered.

On ARM, update_mmu_cache() invalidates the I-cache (if VM_EXEC)
independent of whether the D-cache was dirty (since we can get
speculative fetches into the I-cache before it was even mapped).

> > > Note that from experience, doing the check & flushes in
> > > update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
> > > where processor one does set_pte followed by update_mmu_cache(). The
> > > later isn't done yet but processor 2 sees the PTE now and starts using
> > > it, cache hasn't been fully flushed yet. You may avoid that race in some
> > > ways, but on ppc, I've stopped using that.
> >
> > I think that's possible on ARM too. Having two threads on different
> > CPUs, one thread triggers a prefetch abort (instruction page fault) on
> > CPU0 but the second thread on CPU1 may branch into this page after
> > set_pte() (hence not fault) but before update_mmu_cache() doing the
> > flush.
> >
> > On ARM11MPCore we flush the caches in flush_dcache_page() because the
> > cache maintenance operations weren't visible to the other CPUs.
> 
> I'm not even sure that's going to be 100% correct. Don't you also need
> to flush the remote icaches when you are dealing with instructions (such
> as swap) anyways ?

I don't think we tried swap but for pages that have been mapped for the
first time, the I-cache would be clean. At mm switching, if a thread
migrates to a new CPU we invalidate the cache at that point.

> I've had some discussions in the past with Russell and others around the
> problem of non-broadcast cache ops on ARM SMP since that's also hurting
> you hard with dma mappings.
> 
> Can you issue IPIs as FIQs if needed (from my old ARM knowledge, FIQs
> are still on even in local_irq_save() blocks right ? I haven't touched
> low level ARM for years tho, I may have forgotten things).

I have a patch for using IPIs via IRQ from the DMA API functions but,
while it works, it can deadlock with some drivers (complex situation).
Note that the patch added a specific IPI implementation which can cope
with interrupts being disabled (unlike the generic one).

My latest solution - http://bit.ly/apJv3O - is to use dummy
read-for-ownership or write-for-ownership accesses in the DMA cache
flushing functions to force cache line migration from the other CPUs.
Our current benchmarks only show around 10% disc throughput penalty
compared to the normal SMP case (compared to the UP case the penalty is
bigger but that's due to other things).

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 16:25                               ` Catalin Marinas
@ 2010-02-26 16:52                                 ` Alan Stern
  2010-02-26 21:51                                   ` Benjamin Herrenschmidt
  2010-02-26 21:00                                 ` Russell King - ARM Linux
  2010-02-26 21:40                                 ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 155+ messages in thread
From: Alan Stern @ 2010-02-26 16:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 26 Feb 2010, Catalin Marinas wrote:

> For mmap'ed pages (and present in the page cache), is it guaranteed that
> the HCD driver won't write to it once it has been mapped into user
> space? If that's the case, it may solve the problem by just reversing
> the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> has dirty D-cache by default.

Nothing is guaranteed.  The HCD will write to wherever it is asked.  If 
a driver does input to an mmap'ed page, the HCD won't even know that 
the page is mmap'ed.

Alan Stern

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 16:25                               ` Catalin Marinas
  2010-02-26 16:52                                 ` Alan Stern
@ 2010-02-26 21:00                                 ` Russell King - ARM Linux
  2010-02-28  0:14                                   ` Benjamin Herrenschmidt
  2010-02-26 21:40                                 ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-02-26 21:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> For mmap'ed pages (and present in the page cache), is it guaranteed that
> the HCD driver won't write to it once it has been mapped into user
> space? If that's the case, it may solve the problem by just reversing
> the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> has dirty D-cache by default.

I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
unnecessary D cache flushing when clean pages get mapped into userspace.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 16:00                                   ` Catalin Marinas
@ 2010-02-26 21:36                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 16:00 +0000, Catalin Marinas wrote:
> > I'm surprised that usb-storage has an issue here. It shouldn't
> afaik,
> > since it's just a SCSI driver (or not anymore ?) and the BIO or
> > filesystems handle things there no ? I haven't seen a single call to
> > flush_dcache_page() in any of drivers/scsi, drivers/ata or
> drivers/ide
> > when I looked...
> 
> The BIO or filesystem code don't call flush_dcache_page() either (well
> some do like cramfs or jffs but they decompress the data received from
> the block device). 

That's weird... that would mean that all existing PIO IDE or SCSI is
broken etc... Including I$/D$ cache coherency on powerpc and more. That
surprises me :-)

On an older kernel tree here:

$ grep -r flush_dcache_page fs | wc -l
118

So maybe that's where things need fixing ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 16:25                               ` Catalin Marinas
  2010-02-26 16:52                                 ` Alan Stern
  2010-02-26 21:00                                 ` Russell King - ARM Linux
@ 2010-02-26 21:40                                 ` Benjamin Herrenschmidt
  2010-02-26 21:49                                   ` Russell King - ARM Linux
  2 siblings, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 16:25 +0000, Catalin Marinas wrote:
> 
> For mmap'ed pages (and present in the page cache), is it guaranteed that
> the HCD driver won't write to it once it has been mapped into user
> space? If that's the case, it may solve the problem by just reversing
> the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> has dirty D-cache by default.

Well, I don't see why the HCD would write to it unless it's swapped out,
and thus unmapped or read() to or similar.

> The filesystem layer does it only if it needs to touch the data written
> by the block device (e.g. cramfs, jffs). Some block devices call
> flush_dcache_page (like mmci.c) while some others don't (and those that
> use DMA actually don't since the DMA API handles the flushing). 

Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
powerpc.. I'm afraid that whole cache handling stuff is totally
inconsistent since different archs have different expectations here.

Maybe we need to revisit things in that area, though it might require to
be done properly to have not one but two bits in struct page to
separately track the D$ and I$ state ...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 21:40                                 ` Benjamin Herrenschmidt
@ 2010-02-26 21:49                                   ` Russell King - ARM Linux
  2010-02-28  0:24                                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-02-26 21:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 27, 2010 at 08:40:29AM +1100, Benjamin Herrenschmidt wrote:
> Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
> powerpc.. I'm afraid that whole cache handling stuff is totally
> inconsistent since different archs have different expectations here.

It doesn't on ARM either.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 16:44                             ` Catalin Marinas
@ 2010-02-26 21:49                               ` Benjamin Herrenschmidt
  2010-02-26 22:03                                 ` Russell King - ARM Linux
  2010-02-28 23:17                                 ` Catalin Marinas
  0 siblings, 2 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:49 UTC (permalink / raw)
  To: linux-arm-kernel


> On ARM, update_mmu_cache() invalidates the I-cache (if VM_EXEC)
> independent of whether the D-cache was dirty (since we can get
> speculative fetches into the I-cache before it was even mapped).

We can get those speculative fetches too on power.

However, we only do the invalidate when PG_arch_1 is clear to avoid
doing it multiple time for a page that was already "cleaned". But it
seems that might not be that a good idea if indeed flush_dcache_page()
is not called for DMA transfers in most cases.

(In addition there is the race I mentioned with update_mmu_cache on SMP)

> > > > Note that from experience, doing the check & flushes in
> > > > update_mmu_cache() is racy on SMP. At least for I$/D$, we have the case
> > > > where processor one does set_pte followed by update_mmu_cache(). The
> > > > later isn't done yet but processor 2 sees the PTE now and starts using
> > > > it, cache hasn't been fully flushed yet. You may avoid that race in some
> > > > ways, but on ppc, I've stopped using that.
> > >
> > > I think that's possible on ARM too. Having two threads on different
> > > CPUs, one thread triggers a prefetch abort (instruction page fault) on
> > > CPU0 but the second thread on CPU1 may branch into this page after
> > > set_pte() (hence not fault) but before update_mmu_cache() doing the
> > > flush.
> > >
> > > On ARM11MPCore we flush the caches in flush_dcache_page() because the
> > > cache maintenance operations weren't visible to the other CPUs.
> > 
> > I'm not even sure that's going to be 100% correct. Don't you also need
> > to flush the remote icaches when you are dealing with instructions (such
> > as swap) anyways ?
> 
> I don't think we tried swap but for pages that have been mapped for the
> first time, the I-cache would be clean. 
>
> At mm switching, if a thread
> migrates to a new CPU we invalidate the cache at that point.

That sounds fragile. What about a multithread app with one thread on
each core hitting the pages at the same time ? Sounds racy to me...

> > I've had some discussions in the past with Russell and others around the
> > problem of non-broadcast cache ops on ARM SMP since that's also hurting
> > you hard with dma mappings.
> > 
> > Can you issue IPIs as FIQs if needed (from my old ARM knowledge, FIQs
> > are still on even in local_irq_save() blocks right ? I haven't touched
> > low level ARM for years tho, I may have forgotten things).
> 
> I have a patch for using IPIs via IRQ from the DMA API functions but,
> while it works, it can deadlock with some drivers (complex situation).
> Note that the patch added a specific IPI implementation which can cope
> with interrupts being disabled (unlike the generic one).

It will deadlock if you use normal IRQs. I don't see a good way around
that other than using a higher-level type of IRQs. I though ARM has
something like that (FIQs ?). Can you use those guys for IPIs ?

> My latest solution - http://bit.ly/apJv3O - is to use dummy
> read-for-ownership or write-for-ownership accesses in the DMA cache
> flushing functions to force cache line migration from the other CPUs.

That might do, but won't help for the icache, will it ?

> Our current benchmarks only show around 10% disc throughput penalty
> compared to the normal SMP case (compared to the UP case the penalty is
> bigger but that's due to other things).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 16:52                                 ` Alan Stern
@ 2010-02-26 21:51                                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-26 21:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 11:52 -0500, Alan Stern wrote:
> > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > the HCD driver won't write to it once it has been mapped into user
> > space? If that's the case, it may solve the problem by just reversing
> > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > has dirty D-cache by default.
> 
> Nothing is guaranteed.  The HCD will write to wherever it is asked.  If 
> a driver does input to an mmap'ed page, the HCD won't even know that 
> the page is mmap'ed.

Right but that won't happen unless somebody explicitely caused that
input to happen, typically, a userspace read(). I$/D$ coherency isn't
implicit in that case.

The question is more when the kernel itself moves a page in/out from
underneath the application (mmap'ed executable pages). One it's mapped
in, it won't be written to by the HCD unless something explicitely does
something to cause that write. If it's swapped out and back in, it will
have been unmapped. 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 21:49                               ` Benjamin Herrenschmidt
@ 2010-02-26 22:03                                 ` Russell King - ARM Linux
  2010-02-28  0:29                                   ` Benjamin Herrenschmidt
  2010-02-28 23:20                                   ` Catalin Marinas
  2010-02-28 23:17                                 ` Catalin Marinas
  1 sibling, 2 replies; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-02-26 22:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 27, 2010 at 08:49:40AM +1100, Benjamin Herrenschmidt wrote:
> It will deadlock if you use normal IRQs. I don't see a good way around
> that other than using a higher-level type of IRQs. I though ARM has
> something like that (FIQs ?). Can you use those guys for IPIs ?

If the hardware did support using FIQs for IPIs, this would not be
desirable because then it takes it away from the SoC folk to do what
they will with it.

In the past, it's been used as a fast CPU-driven "DMA" interface -
some SoCs have been wired up in such a way that's the only use
available for the FIQ.

The other problem we'd encounter using FIQs for IPIs is that some IPIs
need to take locks - and in order to make that safe, we'd either need
another class of locks which disable IRQs and FIQs together, or we'd
need to disable FIQs everywhere we disable IRQs - at which point FIQs
become utterly pointless.

(There only differences between FIQ and IRQ are:
 - on simultaneous raising of both, the FIQ will be called before the IRQ.
 - each has its own (single) vector.
 - invocation of FIQ masks IRQ.

What I'm saying is that what gives FIQ an advantage for SoC people is
that it's bare bones light weight and therefore extremely fast - as soon
as you load it up with additional complexity, it becomes less useful.)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 21:00                                 ` Russell King - ARM Linux
@ 2010-02-28  0:14                                   ` Benjamin Herrenschmidt
  2010-02-28  5:01                                     ` James Bottomley
  2010-03-01 10:42                                     ` Catalin Marinas
  0 siblings, 2 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-28  0:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > the HCD driver won't write to it once it has been mapped into user
> > space? If that's the case, it may solve the problem by just reversing
> > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > has dirty D-cache by default.
> 
> I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> unnecessary D cache flushing when clean pages get mapped into userspace.

That's an interesting thought for us too. When doing I$/D$ coherency, we
have to fist flush the D$ and then invalidate the I$. If we could keep
track of D$ and I$ separately, we could avoid the first step in many
cases, including the DMA API trick you mentioned.

I wonder if it's time to get a PG_arch_2 :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 21:49                                   ` Russell King - ARM Linux
@ 2010-02-28  0:24                                     ` Benjamin Herrenschmidt
  2010-02-28 19:17                                       ` Pavel Machek
  2010-03-01 11:10                                       ` Catalin Marinas
  0 siblings, 2 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-28  0:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 21:49 +0000, Russell King - ARM Linux wrote:
> On Sat, Feb 27, 2010 at 08:40:29AM +1100, Benjamin Herrenschmidt wrote:
> > Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
> > powerpc.. I'm afraid that whole cache handling stuff is totally
> > inconsistent since different archs have different expectations here.
> 
> It doesn't on ARM either.

Ok, pfiew :-)

So far, my understanding with I$/D$ is that we only care in a few cases
which is executing of an mmap'ed piece of executable that is -not- being
written to, and swap.

I -think- that in both cases, the page cache always pops up a new page
with PG_arch_1 clear before the driver gets to either DMA or PIO to it
when faulted the first time around, before any PTE is inserted.

So the current approach on powerpc with I$/D$ should work fine, and it
-might- make sense to use a similar one on PIPT ARM, provided we don't
have expectations of the I$/D$ coherency being maintained on
-subsequent- writes (PIO or DMA either) to such a page by the same
program transparently by the kernel.

There's two potential problems with the approach, and maybe more that I
have missed though. One is the case of a networked filesystem where the
executable pages are modified remotely. However, I would expect such a
program to invalidate the PTE mappings before making the change visible,
so we -do- get a chance to re-flush provided something clears PG_arch_1.

Then, there's In the case of a multithread app, where one thread does
the cache flush and another thread then executes, the earlier ARMs
without broadcast ops have a potential problem there. In fact, some
variant of PowerPC 440 have the same problem and some people are
(ab)using those for SMP setups I'm being told.

For that case, I see two options. One is a big hammer but would make
existing code work to "most" extent: Don't allow a page to be both
writable and executable. Ping-pong the page permission lazily and flush
when transitioning from write to exec.

That means using a spare bit for Linux _PAGE_RW separate from your real
RW bit I suppose, since you have HW loaded PTEs (on 440 it's easier
since we SW load, we can do the fixup there, though it has a perf impact
obviously).

Another option would be to make some syscall mandatory to "sync" caches
which could then do IPIs or whatever else is needed. But that would
require changing existing userspace code.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 22:03                                 ` Russell King - ARM Linux
@ 2010-02-28  0:29                                   ` Benjamin Herrenschmidt
  2010-02-28 23:20                                   ` Catalin Marinas
  1 sibling, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-02-28  0:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 22:03 +0000, Russell King - ARM Linux wrote:
> On Sat, Feb 27, 2010 at 08:49:40AM +1100, Benjamin Herrenschmidt wrote:
> > It will deadlock if you use normal IRQs. I don't see a good way around
> > that other than using a higher-level type of IRQs. I though ARM has
> > something like that (FIQs ?). Can you use those guys for IPIs ?
> 
> If the hardware did support using FIQs for IPIs, this would not be
> desirable because then it takes it away from the SoC folk to do what
> they will with it.
> 
> In the past, it's been used as a fast CPU-driven "DMA" interface -
> some SoCs have been wired up in such a way that's the only use
> available for the FIQ.

This is an issue indeed.

> The other problem we'd encounter using FIQs for IPIs is that some IPIs
> need to take locks - and in order to make that safe, we'd either need
> another class of locks which disable IRQs and FIQs together, or we'd
> need to disable FIQs everywhere we disable IRQs - at which point FIQs
> become utterly pointless.

That's solvable easily :-) I mentioned having potentially to deal with a
similar problem with people using PowerPC 440 for SMP (doesn't broadcast
cache ops either). 440 has critical interrupts, which are akin to FIQs.

The trick here is that you don't use -only- critical interrupts for
IPIs. You use normal interrupts for all the current IPI types. You -add-
a fast one using critical interrupts specifically for cache ops, with a
very fast asm only path.

This works for us because masking interrupts doesn't mask critical
interrupts (it's a separate mask bit in our MSR). If that isn't the case
with FIQs then the whole idea is moot.

> (There only differences between FIQ and IRQ are:
>  - on simultaneous raising of both, the FIQ will be called before the IRQ.
>  - each has its own (single) vector.
>  - invocation of FIQ masks IRQ.
> 
> What I'm saying is that what gives FIQ an advantage for SoC people is
> that it's bare bones light weight and therefore extremely fast - as soon
> as you load it up with additional complexity, it becomes less useful.)

I understand.

Then Catalin idea of tricking the cache with load and stores would work
for the D$ side of thing. The I$ side of thing probably still needs IPIs
though, and you might need to use non-blocking async SMP call function
for that if you're going to do it from set_pte_at() instead of
update_mmu_cache() since the later is racy. In any case, it's a lot less
of a deadlock nest than the D$ side which needs to be dealt with in the
DMA ops, called below layers of driver and subsystem locks.

Note: Somebody at ARM needs to be severely beaten up for coming up with
that SMP scheme without broadcast cache ops and not also mandating some
kind FIQ IPI scheme that isn't masked with normal interrupts :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-28  0:14                                   ` Benjamin Herrenschmidt
@ 2010-02-28  5:01                                     ` James Bottomley
  2010-03-01 10:39                                       ` Catalin Marinas
  2010-03-02 12:11                                       ` FUJITA Tomonori
  2010-03-01 10:42                                     ` Catalin Marinas
  1 sibling, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-02-28  5:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-02-28 at 11:14 +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > the HCD driver won't write to it once it has been mapped into user
> > > space? If that's the case, it may solve the problem by just reversing
> > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > has dirty D-cache by default.
> > 
> > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > unnecessary D cache flushing when clean pages get mapped into userspace.
> 
> That's an interesting thought for us too. When doing I$/D$ coherency, we
> have to fist flush the D$ and then invalidate the I$. If we could keep
> track of D$ and I$ separately, we could avoid the first step in many
> cases, including the DMA API trick you mentioned.
> 
> I wonder if it's time to get a PG_arch_2 :-)

Sorry to be a bit late to the party (on holiday), but I/D coherency is
supposed to be taken care of using flush_cache_page in the memory
mapping routines.  On parisc, at least, we don't use any PG_arch flags
to help.  The way it's supposed to work is that I is invalidated on
mapping or remapping, so the I/O code only needs to worry about flushing
D.  The guarantee we pass to userland is that any page we do I/O to has
a clean D cache before it goes back to userspace.  Thus if userspace
executes the page, the I cache gets its first movein there.  There is an
underlying assumption to all of this:  The CPU won't speculatively move
in I cache until the page is executed, so we can rely on the
flush_cache_page in the mapping to keep the I cache invalidated until
we're ready to execute.  The other fundamental assumption is that if
userspace needs to modify an executable region (say for dynamic linking)
it has to take care of reinvalidating the I cache itself ... although it
can do this by remapping the region to alter the flags (i.e W no X then
X no W).

But the point of all of this is that I cache invalidation doesn't appear
anywhere in the I/O path ... so  if we're getting I/D incoherency,
there's some problem in the mm code (or there's a missing arch
assumption ... like I cache gets moved in more aggressively than we
expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
there were a serious generic problem here.

James

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-28  0:24                                     ` Benjamin Herrenschmidt
@ 2010-02-28 19:17                                       ` Pavel Machek
  2010-03-01 11:10                                       ` Catalin Marinas
  1 sibling, 0 replies; 155+ messages in thread
From: Pavel Machek @ 2010-02-28 19:17 UTC (permalink / raw)
  To: linux-arm-kernel

> There's two potential problems with the approach, and maybe more that I
> have missed though. One is the case of a networked filesystem where the
> executable pages are modified remotely. However, I would expect such a
> program to invalidate the PTE mappings before making the change visible,
> so we -do- get a chance to re-flush provided something clears PG_arch_1.
> 
> Then, there's In the case of a multithread app, where one thread does
> the cache flush and another thread then executes, the earlier ARMs
> without broadcast ops have a potential problem there. In fact, some
> variant of PowerPC 440 have the same problem and some people are
> (ab)using those for SMP setups I'm being told.
> 
> For that case, I see two options. One is a big hammer but would make
> existing code work to "most" extent: Don't allow a page to be both
> writable and executable. Ping-pong the page permission lazily and flush
> when transitioning from write to exec.
> 
> That means using a spare bit for Linux _PAGE_RW separate from your real
> RW bit I suppose, since you have HW loaded PTEs (on 440 it's easier
> since we SW load, we can do the fixup there, though it has a perf impact
> obviously).
> 
> Another option would be to make some syscall mandatory to "sync" caches
> which could then do IPIs or whatever else is needed. But that would
> require changing existing userspace code.

Or you could do first option by default, and add mmap flag that says
that application is responsible for cross-cpu cache flushes...?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 21:49                               ` Benjamin Herrenschmidt
  2010-02-26 22:03                                 ` Russell King - ARM Linux
@ 2010-02-28 23:17                                 ` Catalin Marinas
  1 sibling, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-28 23:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 21:49 +0000, Benjamin Herrenschmidt wrote:
> > > > On ARM11MPCore we flush the caches in flush_dcache_page() because the
> > > > cache maintenance operations weren't visible to the other CPUs.
> > >
> > > I'm not even sure that's going to be 100% correct. Don't you also need
> > > to flush the remote icaches when you are dealing with instructions (such
> > > as swap) anyways ?
> >
> > I don't think we tried swap but for pages that have been mapped for the
> > first time, the I-cache would be clean.
> >
> > At mm switching, if a thread
> > migrates to a new CPU we invalidate the cache at that point.
> 
> That sounds fragile. What about a multithread app with one thread on
> each core hitting the pages at the same time ? Sounds racy to me...

Interestingly, until commit 826cbdaff29 (< 2 years ago), we didn't have
any I-cache flushing in update_mmu_cache() and it was working fine. I
added it for correctness reasons rather than to fix something. My theory
is that it was working because a page cache page tends to keep the same
physical address, especially if we don't swap pages, and a 16KB PIPT
cache cannot hold enough lines to show any issues (lines are replaced
frequently).

I suspect that's one of the reasons why only invalidating the whole
I-cache when switching the mm to a new CPU seems to suffice. Once we
enable some form of swapping, it may show the problem.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-26 22:03                                 ` Russell King - ARM Linux
  2010-02-28  0:29                                   ` Benjamin Herrenschmidt
@ 2010-02-28 23:20                                   ` Catalin Marinas
  1 sibling, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-02-28 23:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-02-26 at 22:03 +0000, Russell King - ARM Linux wrote:
> On Sat, Feb 27, 2010 at 08:49:40AM +1100, Benjamin Herrenschmidt wrote:
> > It will deadlock if you use normal IRQs. I don't see a good way around
> > that other than using a higher-level type of IRQs. I though ARM has
> > something like that (FIQs ?). Can you use those guys for IPIs ?
[...]
> The other problem we'd encounter using FIQs for IPIs is that some IPIs
> need to take locks - and in order to make that safe, we'd either need
> another class of locks which disable IRQs and FIQs together, or we'd
> need to disable FIQs everywhere we disable IRQs - at which point FIQs
> become utterly pointless.

You could use the FIQ only for the DMA cache maintenance operations and
not as a generic IPI mechanism. But the hardware needs to be modified.


-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-28  5:01                                     ` James Bottomley
@ 2010-03-01 10:39                                       ` Catalin Marinas
  2010-03-01 11:06                                         ` Russell King - ARM Linux
  2010-03-02 12:11                                       ` FUJITA Tomonori
  1 sibling, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-03-01 10:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-02-28 at 05:01 +0000, James Bottomley wrote:
> On Sun, 2010-02-28 at 11:14 +1100, Benjamin Herrenschmidt wrote:
> > On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > > the HCD driver won't write to it once it has been mapped into user
> > > > space? If that's the case, it may solve the problem by just reversing
> > > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > > has dirty D-cache by default.
> > >
> > > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > > unnecessary D cache flushing when clean pages get mapped into userspace.
> >
> > That's an interesting thought for us too. When doing I$/D$ coherency, we
> > have to fist flush the D$ and then invalidate the I$. If we could keep
> > track of D$ and I$ separately, we could avoid the first step in many
> > cases, including the DMA API trick you mentioned.
> >
> > I wonder if it's time to get a PG_arch_2 :-)
> 
> Sorry to be a bit late to the party (on holiday), but I/D coherency is
> supposed to be taken care of using flush_cache_page in the memory
> mapping routines.  On parisc, at least, we don't use any PG_arch flags
> to help.  The way it's supposed to work is that I is invalidated on
> mapping or remapping, so the I/O code only needs to worry about flushing
> D.  The guarantee we pass to userland is that any page we do I/O to has
> a clean D cache before it goes back to userspace.  Thus if userspace
> executes the page, the I cache gets its first movein there.  There is an
> underlying assumption to all of this:  The CPU won't speculatively move
> in I cache until the page is executed, so we can rely on the
> flush_cache_page in the mapping to keep the I cache invalidated until
> we're ready to execute.  

We cannot guarantee this assumption on ARM. As soon as the page is
accessible and executable, the CPU can fetch into the I-cache
speculatively. Even if the page hasn't been mapped into user-space yet,
we still have the kernel linear mapping via which we can get the same
I-cache lines fetched (PIPT cache).

The only place we can safely invalidate the I-cache is after the D-cache
was flushed (after flush_dcache_page).

On ARM PIPT, flush_cache_page is a no-op.

> The other fundamental assumption is that if
> userspace needs to modify an executable region (say for dynamic linking)
> it has to take care of reinvalidating the I cache itself ... although it
> can do this by remapping the region to alter the flags (i.e W no X then
> X no W).

The ARM dynamic linker remaps the page with no-exec, writes the data and
then remaps it back with exec. The COW code flushes the D-cache. Anyway,
recent dynamic linker no longer touches a code page.
> 
> But the point of all of this is that I cache invalidation doesn't appear
> anywhere in the I/O path ... so  if we're getting I/D incoherency,
> there's some problem in the mm code (or there's a missing arch
> assumption ... like I cache gets moved in more aggressively than we
> expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> there were a serious generic problem here.

On ARM PIPT, it's probably because flush_cache_page isn't implemented.
But as I said above, given the speculative fetches I don't think it
would help much (well, it would work a bit better but not a complete
fix).

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-28  0:14                                   ` Benjamin Herrenschmidt
  2010-02-28  5:01                                     ` James Bottomley
@ 2010-03-01 10:42                                     ` Catalin Marinas
  2010-03-03 20:24                                       ` Jamie Lokier
  1 sibling, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-03-01 10:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-02-28 at 00:14 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > the HCD driver won't write to it once it has been mapped into user
> > > space? If that's the case, it may solve the problem by just reversing
> > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > has dirty D-cache by default.
> >
> > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > unnecessary D cache flushing when clean pages get mapped into userspace.

That sounds good to me.

> That's an interesting thought for us too. When doing I$/D$ coherency, we
> have to fist flush the D$ and then invalidate the I$. If we could keep
> track of D$ and I$ separately, we could avoid the first step in many
> cases, including the DMA API trick you mentioned.
> 
> I wonder if it's time to get a PG_arch_2 :-)

As an optimisation, I think this would help (rather than always
invalidating the I-cache in update_mmu_cache or set_pte_at).

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-01 10:39                                       ` Catalin Marinas
@ 2010-03-01 11:06                                         ` Russell King - ARM Linux
  0 siblings, 0 replies; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-03-01 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 01, 2010 at 10:39:14AM +0000, Catalin Marinas wrote:
> On Sun, 2010-02-28 at 05:01 +0000, James Bottomley wrote:
> > But the point of all of this is that I cache invalidation doesn't appear
> > anywhere in the I/O path ... so  if we're getting I/D incoherency,
> > there's some problem in the mm code (or there's a missing arch
> > assumption ... like I cache gets moved in more aggressively than we
> > expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> > there were a serious generic problem here.
> 
> On ARM PIPT, it's probably because flush_cache_page isn't implemented.
> But as I said above, given the speculative fetches I don't think it
> would help much (well, it would work a bit better but not a complete
> fix).

Not quite.  flush_cache_page() is called when we unmap or replace a page
in userspace, which is completely the wrong place to do I-cache coherency
when you have speculatively loaded caches - or even D-cache coherency if
your cache behaves as a speculatively loaded PIPT or non-aliasing VIPT.

Flushing the I-cache after a page has been in userspace does nothing to
ensure that there aren't any I-cache lines associated with that page
when you next come to map it into userspace.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-28  0:24                                     ` Benjamin Herrenschmidt
  2010-02-28 19:17                                       ` Pavel Machek
@ 2010-03-01 11:10                                       ` Catalin Marinas
  2010-03-02  4:11                                         ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-03-01 11:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-02-28 at 00:24 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-02-26 at 21:49 +0000, Russell King - ARM Linux wrote:
> > On Sat, Feb 27, 2010 at 08:40:29AM +1100, Benjamin Herrenschmidt wrote:
> > > Hrm, the DMA API certainly doesn't handle the I$/D$ coherency on
> > > powerpc.. I'm afraid that whole cache handling stuff is totally
> > > inconsistent since different archs have different expectations here.
> >
> > It doesn't on ARM either.
> 
> Ok, pfiew :-)
> 
> So far, my understanding with I$/D$ is that we only care in a few cases
> which is executing of an mmap'ed piece of executable that is -not- being
> written to, and swap.
> 
> I -think- that in both cases, the page cache always pops up a new page
> with PG_arch_1 clear before the driver gets to either DMA or PIO to it
> when faulted the first time around, before any PTE is inserted.

That's my understanding too.

> So the current approach on powerpc with I$/D$ should work fine, and it
> -might- make sense to use a similar one on PIPT ARM, provided we don't
> have expectations of the I$/D$ coherency being maintained on
> -subsequent- writes (PIO or DMA either) to such a page by the same
> program transparently by the kernel.

Are these subsequent writes likely to happen?

> There's two potential problems with the approach, and maybe more that I
> have missed though. One is the case of a networked filesystem where the
> executable pages are modified remotely. However, I would expect such a
> program to invalidate the PTE mappings before making the change visible,
> so we -do- get a chance to re-flush provided something clears PG_arch_1.

I think the NFS code in Linux calls flush_dcache_page(). This function
can check whether the page is already mapped and do the cache flushing
rather than deferring it to set_pte_at().

> Then, there's In the case of a multithread app, where one thread does
> the cache flush and another thread then executes, the earlier ARMs
> without broadcast ops have a potential problem there. In fact, some
> variant of PowerPC 440 have the same problem and some people are
> (ab)using those for SMP setups I'm being told.

Yes. That could be solved at set_pte_at() level using IPIs.

> For that case, I see two options. One is a big hammer but would make
> existing code work to "most" extent: Don't allow a page to be both
> writable and executable. Ping-pong the page permission lazily and flush
> when transitioning from write to exec.

Are you referring to the SMP and non-broadcasting cache maintenance
issue? The same pte could be shared between multiple CPUs, so once you
make it executable on one it becomes executable on the others.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-01 11:10                                       ` Catalin Marinas
@ 2010-03-02  4:11                                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02  4:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2010-03-01 at 11:10 +0000, Catalin Marinas wrote:
> 
> 
> Yes. That could be solved at set_pte_at() level using IPIs.

Well, set_pte_at() itself is called with the PTE lock held, so you have
to be careful with IPIs at that point. You need the flush to happen
-before- the PTE is visible and you cannot synchronously send an IPI.

> > For that case, I see two options. One is a big hammer but would make
> > existing code work to "most" extent: Don't allow a page to be both
> > writable and executable. Ping-pong the page permission lazily and
> flush
> > when transitioning from write to exec.
> 
> Are you referring to the SMP and non-broadcasting cache maintenance
> issue? The same pte could be shared between multiple CPUs, so once you
> make it executable on one it becomes executable on the others.

Right, you would have to play the ping-pong trick globally. That's what
I do on ppc 440 for bluegene though that code isn't upstream.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-02-28  5:01                                     ` James Bottomley
  2010-03-01 10:39                                       ` Catalin Marinas
@ 2010-03-02 12:11                                       ` FUJITA Tomonori
  2010-03-02 17:05                                         ` Catalin Marinas
  2010-03-02 23:26                                         ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 155+ messages in thread
From: FUJITA Tomonori @ 2010-03-02 12:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 28 Feb 2010 10:31:03 +0530
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> On Sun, 2010-02-28 at 11:14 +1100, Benjamin Herrenschmidt wrote:
> > On Fri, 2010-02-26 at 21:00 +0000, Russell King - ARM Linux wrote:
> > > On Fri, Feb 26, 2010 at 04:25:21PM +0000, Catalin Marinas wrote:
> > > > For mmap'ed pages (and present in the page cache), is it guaranteed that
> > > > the HCD driver won't write to it once it has been mapped into user
> > > > space? If that's the case, it may solve the problem by just reversing
> > > > the meaning of PG_arch_1 on ARM and assume that a newly allocated page
> > > > has dirty D-cache by default.
> > > 
> > > I guess we could also set PG_arch_1 in the DMA API as well, to avoid the
> > > unnecessary D cache flushing when clean pages get mapped into userspace.
> > 
> > That's an interesting thought for us too. When doing I$/D$ coherency, we
> > have to fist flush the D$ and then invalidate the I$. If we could keep
> > track of D$ and I$ separately, we could avoid the first step in many
> > cases, including the DMA API trick you mentioned.
> > 
> > I wonder if it's time to get a PG_arch_2 :-)
> 
> Sorry to be a bit late to the party (on holiday), but I/D coherency is
> supposed to be taken care of using flush_cache_page in the memory
> mapping routines.

powerpc does that? To be exact, powerpc doesn't need
flush_cache_page() and handles I/D coherency in the pte modification
code. powerpc uses PG_arch_1 to avoid unnecessarily handling I/D
coherency. Seems that IA64 does the same trick with PG_arch_1.


> But the point of all of this is that I cache invalidation doesn't appear
> anywhere in the I/O path ... so  if we're getting I/D incoherency,
> there's some problem in the mm code (or there's a missing arch
> assumption ... like I cache gets moved in more aggressively than we
> expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> there were a serious generic problem here.

I'm not sure that there are some problems in the mm or common code. Is
this ARM's implementation issue? (Of course, the usb stack and the
driver's misuse of the DMA API needs to be fixed too).

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-02 12:11                                       ` FUJITA Tomonori
@ 2010-03-02 17:05                                         ` Catalin Marinas
  2010-03-02 17:47                                           ` Catalin Marinas
                                                             ` (2 more replies)
  2010-03-02 23:26                                         ` Benjamin Herrenschmidt
  1 sibling, 3 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-03-02 17:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 21:11 +0900, FUJITA Tomonori wrote:
> On Sun, 28 Feb 2010 10:31:03 +0530
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > But the point of all of this is that I cache invalidation doesn't appear
> > anywhere in the I/O path ... so  if we're getting I/D incoherency,
> > there's some problem in the mm code (or there's a missing arch
> > assumption ... like I cache gets moved in more aggressively than we
> > expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> > there were a serious generic problem here.
> 
> I'm not sure that there are some problems in the mm or common code. Is
> this ARM's implementation issue? (Of course, the usb stack and the
> driver's misuse of the DMA API needs to be fixed too).

Just to summarise - on ARM (PIPT / non-aliasing VIPT) there is I-cache
invalidation for user pages in update_mmu_cache() (it could actually be
in set_pte_at on SMP to avoid a race but that's for another thread). The
D-cache is flushed by this function only if the PG_arch_1 bit is set.
This bit is set in the ARM case by flush_dcache_page(), following the
advice in Documentation/cachetlb.txt.

With some drivers (those doing PIO) or subsystems (SCSI mass storage
over USB HCD), there is no call to flush_dcache_page() for page cache
pages, hence the ARM implementation of update_mmu_cache() doesn't flush
the D-cache (and only invalidating the I-cache doesn't help).

The viable solutions so far:

     1. Implement a PIO mapping API similar to the DMA API which takes
        care of the D-cache flushing. This means that PIO drivers would
        need to be modified to use an API like pio_kmap()/pio_kunmap()
        before writing to a page cache page.
     2. Invert the meaning of PG_arch_1 to denote a clean page. This
        means that by default newly allocated page cache pages are
        considered dirty and even if there isn't a call to
        flush_dcache_page(), update_mmu_cache() would flush the D-cache.
        This is the PowerPC approach.

Option 2 above looks pretty appealing to me since it can be done in the
ARM code exclusively. I've done some tests and it indeed solves the
cache coherency with a rootfs on a USB stick. As Russell suggested, it
can be optimised to mark a page as clean when the DMA API is involved to
avoid duplicate flushing.

It was also suggested to add a PG_arch_2 flag which would keep track of
the I-cache status as well.

I can post a proposal to modify the cachetlb.txt document to reflect the
issues we currently have on ARM.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-02 17:05                                         ` Catalin Marinas
@ 2010-03-02 17:47                                           ` Catalin Marinas
  2010-03-02 23:33                                             ` Benjamin Herrenschmidt
  2010-03-02 23:29                                           ` Benjamin Herrenschmidt
  2010-03-03 21:54                                           ` Pavel Machek
  2 siblings, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-03-02 17:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> On Tue, 2010-03-02 at 21:11 +0900, FUJITA Tomonori wrote:
> > On Sun, 28 Feb 2010 10:31:03 +0530
> > James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > > But the point of all of this is that I cache invalidation doesn't appear
> > > anywhere in the I/O path ... so  if we're getting I/D incoherency,
> > > there's some problem in the mm code (or there's a missing arch
> > > assumption ... like I cache gets moved in more aggressively than we
> > > expect).  Parisc is very sensitive to I/D incoherency, so we'd notice if
> > > there were a serious generic problem here.
> >
> > I'm not sure that there are some problems in the mm or common code. Is
> > this ARM's implementation issue? (Of course, the usb stack and the
> > driver's misuse of the DMA API needs to be fixed too).
> 
> Just to summarise - on ARM (PIPT / non-aliasing VIPT) there is I-cache
> invalidation for user pages in update_mmu_cache() (it could actually be
> in set_pte_at on SMP to avoid a race but that's for another thread). The
> D-cache is flushed by this function only if the PG_arch_1 bit is set.
> This bit is set in the ARM case by flush_dcache_page(), following the
> advice in Documentation/cachetlb.txt.
> 
> With some drivers (those doing PIO) or subsystems (SCSI mass storage
> over USB HCD), there is no call to flush_dcache_page() for page cache
> pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> the D-cache (and only invalidating the I-cache doesn't help).
> 
> The viable solutions so far:
> 
>      1. Implement a PIO mapping API similar to the DMA API which takes
>         care of the D-cache flushing. This means that PIO drivers would
>         need to be modified to use an API like pio_kmap()/pio_kunmap()
>         before writing to a page cache page.
>      2. Invert the meaning of PG_arch_1 to denote a clean page. This
>         means that by default newly allocated page cache pages are
>         considered dirty and even if there isn't a call to
>         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
>         This is the PowerPC approach.
> 
> Option 2 above looks pretty appealing to me since it can be done in the
> ARM code exclusively. I've done some tests and it indeed solves the
> cache coherency with a rootfs on a USB stick. As Russell suggested, it
> can be optimised to mark a page as clean when the DMA API is involved to
> avoid duplicate flushing.

Actually, option 2 still has an issue - does not easily work on SMP
systems where cache maintenance operations aren't broadcast in hardware.
In this case (ARM11MPCore), flush_dcache_page() is implemented
non-lazily so that the flushing happens on the same processor that
dirtied the cache. But since with some drivers there is no call to this
function, it wouldn't make any difference.

A solution is to do something like read-for-ownership before flushing
the D-cache in update_mmu_cache() (or set_pte_at()).

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-02 12:11                                       ` FUJITA Tomonori
  2010-03-02 17:05                                         ` Catalin Marinas
@ 2010-03-02 23:26                                         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02 23:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 21:11 +0900, FUJITA Tomonori wrote:
> 
> > Sorry to be a bit late to the party (on holiday), but I/D coherency
> is
> > supposed to be taken care of using flush_cache_page in the memory
> > mapping routines.
> 
> powerpc does that? To be exact, powerpc doesn't need
> flush_cache_page() and handles I/D coherency in the pte modification
> code. powerpc uses PG_arch_1 to avoid unnecessarily handling I/D
> coherency. Seems that IA64 does the same trick with PG_arch_1.

Right. We set PG_arch_1 to avoid doing it again of a given physical
page. We assume that it's always cleared when a page is recycled by the
page cache and we also clear it in flush_dcache_page() though the need
for that later thing is dubious...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-02 17:05                                         ` Catalin Marinas
  2010-03-02 17:47                                           ` Catalin Marinas
@ 2010-03-02 23:29                                           ` Benjamin Herrenschmidt
  2010-03-03  3:47                                             ` FUJITA Tomonori
  2010-03-03 10:40                                             ` Catalin Marinas
  2010-03-03 21:54                                           ` Pavel Machek
  2 siblings, 2 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02 23:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:

> The viable solutions so far:
> 
>      1. Implement a PIO mapping API similar to the DMA API which takes
>         care of the D-cache flushing. This means that PIO drivers would
>         need to be modified to use an API like pio_kmap()/pio_kunmap()
>         before writing to a page cache page.
>      2. Invert the meaning of PG_arch_1 to denote a clean page. This
>         means that by default newly allocated page cache pages are
>         considered dirty and even if there isn't a call to
>         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
>         This is the PowerPC approach.

I don't see the point of a "PIO" API. I would thus vote for 2 :-) Note
that flushing the D-cache isn't enough, you also need to invalidate the
I-cache as we discussed earlier, though you mostly get away if you don't
by luck.

There's also a question as to whether clearing PG_arch_1 is
flush_dcache_page() is really necessary or not.

> Option 2 above looks pretty appealing to me since it can be done in the
> ARM code exclusively. I've done some tests and it indeed solves the
> cache coherency with a rootfs on a USB stick. As Russell suggested, it
> can be optimised to mark a page as clean when the DMA API is involved to
> avoid duplicate flushing.

That wouldn't solve the need for invalidating the I-cache... Unless we
use another bit.

> It was also suggested to add a PG_arch_2 flag which would keep track of
> the I-cache status as well.
> 
> I can post a proposal to modify the cachetlb.txt document to reflect the
> issues we currently have on ARM.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-02 17:47                                           ` Catalin Marinas
@ 2010-03-02 23:33                                             ` Benjamin Herrenschmidt
  2010-03-03 10:21                                               ` Catalin Marinas
  0 siblings, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-02 23:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 17:47 +0000, Catalin Marinas wrote:
> 
> Actually, option 2 still has an issue - does not easily work on SMP
> systems where cache maintenance operations aren't broadcast in hardware.
> In this case (ARM11MPCore), flush_dcache_page() is implemented
> non-lazily so that the flushing happens on the same processor that
> dirtied the cache. But since with some drivers there is no call to this
> function, it wouldn't make any difference.

Also, option 1 would not solve the icache issue which has the same
problem related to IPIs. You -really- need to spank some HW folks
here :-)

> A solution is to do something like read-for-ownership before flushing
> the D-cache in update_mmu_cache() (or set_pte_at()). 

You might also want to experiment with not clearing PG_arch_1 in
flush_dcache_page(). I'm not 100% convinced it is necessary and that may
reduce the amount of flushing needed.

Another thing is, on powerpc, we only do the cleaning when we try to
execute from the pages. IE. We basically "filter out" exec permission
when pages are not clean. At least on processors that support per-page
exec permission. You may want to consider something like that as well.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-02 23:29                                           ` Benjamin Herrenschmidt
@ 2010-03-03  3:47                                             ` FUJITA Tomonori
  2010-03-03  5:10                                               ` Benjamin Herrenschmidt
  2010-03-03 10:43                                               ` Catalin Marinas
  2010-03-03 10:40                                             ` Catalin Marinas
  1 sibling, 2 replies; 155+ messages in thread
From: FUJITA Tomonori @ 2010-03-03  3:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 03 Mar 2010 10:29:54 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> 
> > The viable solutions so far:
> > 
> >      1. Implement a PIO mapping API similar to the DMA API which takes
> >         care of the D-cache flushing. This means that PIO drivers would
> >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> >         before writing to a page cache page.
> >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> >         means that by default newly allocated page cache pages are
> >         considered dirty and even if there isn't a call to
> >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> >         This is the PowerPC approach.
> 
> I don't see the point of a "PIO" API. I would thus vote for 2 :-) Note

Yeah, as powerpc and ia64 do, arm can flush D cache and invalidate I
cache when inserting a executable page to pte, IIUC. No need for the
new API for I/D consistency.

The ways to improve the approach (introducing PG_arch_2 or marking a
page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
to architectures.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-03  3:47                                             ` FUJITA Tomonori
@ 2010-03-03  5:10                                               ` Benjamin Herrenschmidt
  2010-03-03  5:40                                                 ` James Bottomley
  2010-03-03  6:35                                                 ` FUJITA Tomonori
  2010-03-03 10:43                                               ` Catalin Marinas
  1 sibling, 2 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-03  5:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> The ways to improve the approach (introducing PG_arch_2 or marking a
> page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> to architectures. 

How does the above work ? IE, the dma unmap will flush the D side but
not the I side ... or is the ia64 flush primitive magic enough to do
both ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-03  5:10                                               ` Benjamin Herrenschmidt
@ 2010-03-03  5:40                                                 ` James Bottomley
  2010-03-03  9:36                                                   ` Russell King - ARM Linux
  2010-03-04  2:00                                                   ` Benjamin Herrenschmidt
  2010-03-03  6:35                                                 ` FUJITA Tomonori
  1 sibling, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-03-03  5:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > The ways to improve the approach (introducing PG_arch_2 or marking a
> > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > to architectures. 
> 
> How does the above work ? IE, the dma unmap will flush the D side but
> not the I side ... or is the ia64 flush primitive magic enough to do
> both ?

The point is that in a well regulated system, the I cache shouldn't need
extra flushing in the kernel.  We should only be faulting in R-X pages.
If we're operating on RWX pages (i.e. self modifying code), it's the job
of userspace to keep I/D coherency.

So the only case the kernel needs to worry about is the R-X fault case
for executable text code.

James

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-03  5:10                                               ` Benjamin Herrenschmidt
  2010-03-03  5:40                                                 ` James Bottomley
@ 2010-03-03  6:35                                                 ` FUJITA Tomonori
  1 sibling, 0 replies; 155+ messages in thread
From: FUJITA Tomonori @ 2010-03-03  6:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 03 Mar 2010 16:10:32 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > The ways to improve the approach (introducing PG_arch_2 or marking a
> > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > to architectures. 
> 
> How does the above work ? IE, the dma unmap will flush the D side but
> not the I side ... or is the ia64 flush primitive magic enough to do
> both ?

On ia64 platform, I (and D) cache is coherent with the memory that you
did DMA to, I think. But better to ask an ia64 guru. :)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-03  5:40                                                 ` James Bottomley
@ 2010-03-03  9:36                                                   ` Russell King - ARM Linux
  2010-03-03 10:24                                                     ` James Bottomley
  2010-03-04  2:00                                                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-03-03  9:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 03, 2010 at 11:10:09AM +0530, James Bottomley wrote:
> On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > to architectures. 
> > 
> > How does the above work ? IE, the dma unmap will flush the D side but
> > not the I side ... or is the ia64 flush primitive magic enough to do
> > both ?
> 
> The point is that in a well regulated system, the I cache shouldn't need
> extra flushing in the kernel.  We should only be faulting in R-X pages.

James, that's a pipedream.  If you have a processor which doesn't support
NX, then the kernel marks all regions executable, even if the app only
asks for RW protection.

You end up with the protection masks always having VM_EXEC set in them,
so there's no way to distinguish from the kernel POV which pages are
going to be executed and those which aren't.

And if you can't do that, you have to _always_ flush the I cache for
every page fault, because you don't know if the I cache is out of sync
with the page that you've just read in from disk - and therefore you
may end up executing bad code instead of the glibc text that was
intended.

So here's the question: in a system where the responsibility for I-cache
flushing is in userspace, how do you ensure that you can execute code
in userspace to do this I-cache flushing without first having flushed
the (speculatively prefetching) I-cache?

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-02 23:33                                             ` Benjamin Herrenschmidt
@ 2010-03-03 10:21                                               ` Catalin Marinas
  0 siblings, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-03-03 10:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 23:33 +0000, Benjamin Herrenschmidt wrote:
> On Tue, 2010-03-02 at 17:47 +0000, Catalin Marinas wrote:
> >
> > Actually, option 2 still has an issue - does not easily work on SMP
> > systems where cache maintenance operations aren't broadcast in hardware.
> > In this case (ARM11MPCore), flush_dcache_page() is implemented
> > non-lazily so that the flushing happens on the same processor that
> > dirtied the cache. But since with some drivers there is no call to this
> > function, it wouldn't make any difference.
> 
> Also, option 1 would not solve the icache issue which has the same
> problem related to IPIs. 

Correct. But that's true for both options.

It would have been simpler if we had software TLBs.

> You -really- need to spank some HW folks here :-)

I think they got the message :). Cortex-A9 does it properly.

> > A solution is to do something like read-for-ownership before flushing
> > the D-cache in update_mmu_cache() (or set_pte_at()).
> 
> You might also want to experiment with not clearing PG_arch_1 in
> flush_dcache_page(). I'm not 100% convinced it is necessary and that may
> reduce the amount of flushing needed.

Could a file map page be swapped out (and the mapping removed), then the
page cache page modified (i.e. NFS filesystem) and flush_dcache_page()
called?

> Another thing is, on powerpc, we only do the cleaning when we try to
> execute from the pages. IE. We basically "filter out" exec permission
> when pages are not clean. At least on processors that support per-page
> exec permission. You may want to consider something like that as well.

For non-aliasing VIPT, I think that's a fair optimisation.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-03  9:36                                                   ` Russell King - ARM Linux
@ 2010-03-03 10:24                                                     ` James Bottomley
  2010-03-03 19:41                                                       ` Russell King - ARM Linux
  0 siblings, 1 reply; 155+ messages in thread
From: James Bottomley @ 2010-03-03 10:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 09:36 +0000, Russell King - ARM Linux wrote:
> On Wed, Mar 03, 2010 at 11:10:09AM +0530, James Bottomley wrote:
> > On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > > to architectures. 
> > > 
> > > How does the above work ? IE, the dma unmap will flush the D side but
> > > not the I side ... or is the ia64 flush primitive magic enough to do
> > > both ?
> > 
> > The point is that in a well regulated system, the I cache shouldn't need
> > extra flushing in the kernel.  We should only be faulting in R-X pages.
> 
> James, that's a pipedream.  If you have a processor which doesn't support
> NX, then the kernel marks all regions executable, even if the app only
> asks for RW protection.

I'm not talking about what the processor supports ... I'm talking about
what the user sets on the VMA.  My point is that the kernel only has
responsibility in specific situations ... it's those paths we do the I/D
coherency on.

> You end up with the protection masks always having VM_EXEC set in them,
> so there's no way to distinguish from the kernel POV which pages are
> going to be executed and those which aren't.

I think you're talking about the pte page flags, I'm talking about the
VMA ones above.

> And if you can't do that, you have to _always_ flush the I cache for
> every page fault, because you don't know if the I cache is out of sync
> with the page that you've just read in from disk - and therefore you
> may end up executing bad code instead of the glibc text that was
> intended.

If you're doing a not present, fault in a VMA executable region, I
agree ... since that's the start of the lifecycle where we have to begin
with I/D coherent.

> So here's the question: in a system where the responsibility for I-cache
> flushing is in userspace, how do you ensure that you can execute code
> in userspace to do this I-cache flushing without first having flushed
> the (speculatively prefetching) I-cache?

I'm not saying the common path (faulting in text sections) is the
responsibility of user space.  I'm saying the uncommon path, write
modification of binaries, is.  So the kernel only needs to worry about
the ordinary text fault path.

James

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-02 23:29                                           ` Benjamin Herrenschmidt
  2010-03-03  3:47                                             ` FUJITA Tomonori
@ 2010-03-03 10:40                                             ` Catalin Marinas
  1 sibling, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-03-03 10:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2010-03-02 at 23:29 +0000, Benjamin Herrenschmidt wrote:
> On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> 
> > The viable solutions so far:
> >
> >      1. Implement a PIO mapping API similar to the DMA API which takes
> >         care of the D-cache flushing. This means that PIO drivers would
> >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> >         before writing to a page cache page.
> >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> >         means that by default newly allocated page cache pages are
> >         considered dirty and even if there isn't a call to
> >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> >         This is the PowerPC approach.
[...]
> > Option 2 above looks pretty appealing to me since it can be done in the
> > ARM code exclusively. I've done some tests and it indeed solves the
> > cache coherency with a rootfs on a USB stick. As Russell suggested, it
> > can be optimised to mark a page as clean when the DMA API is involved to
> > avoid duplicate flushing.
> 
> That wouldn't solve the need for invalidating the I-cache... Unless we
> use another bit.

Indeed. We currently always invalidate the I-cache when the page is
mapped. With PG_arch_2, we could optimise this but I'm not sure it is
worth since I think we only get an update_mmu_cache() call for a page
(unless it is unmapped and re-mapped again).

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-03  3:47                                             ` FUJITA Tomonori
  2010-03-03  5:10                                               ` Benjamin Herrenschmidt
@ 2010-03-03 10:43                                               ` Catalin Marinas
  1 sibling, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-03-03 10:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 03:47 +0000, FUJITA Tomonori wrote:
> On Wed, 03 Mar 2010 10:29:54 +1100
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> > On Tue, 2010-03-02 at 17:05 +0000, Catalin Marinas wrote:
> >
> > > The viable solutions so far:
> > >
> > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > >         care of the D-cache flushing. This means that PIO drivers would
> > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > >         before writing to a page cache page.
> > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > >         means that by default newly allocated page cache pages are
> > >         considered dirty and even if there isn't a call to
> > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > >         This is the PowerPC approach.
> >
> > I don't see the point of a "PIO" API. I would thus vote for 2 :-) Note
> 
> Yeah, as powerpc and ia64 do, arm can flush D cache and invalidate I
> cache when inserting a executable page to pte, IIUC. No need for the
> new API for I/D consistency.

I can see that IA-64 uses the PG_arch_1 bit to mark a clean page rather
than dirty (as we did for ARM). The Documentation/cachetlb.txt needs
updating.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-03 10:24                                                     ` James Bottomley
@ 2010-03-03 19:41                                                       ` Russell King - ARM Linux
  0 siblings, 0 replies; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-03-03 19:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 03, 2010 at 03:54:37PM +0530, James Bottomley wrote:
> On Wed, 2010-03-03 at 09:36 +0000, Russell King - ARM Linux wrote:
> > James, that's a pipedream.  If you have a processor which doesn't support
> > NX, then the kernel marks all regions executable, even if the app only
> > asks for RW protection.
> 
> I'm not talking about what the processor supports ... I'm talking about
> what the user sets on the VMA.  My point is that the kernel only has
> responsibility in specific situations ... it's those paths we do the I/D
> coherency on.

You may not be talking about what the processor supports, but it is
directly relevant.

> > You end up with the protection masks always having VM_EXEC set in them,
> > so there's no way to distinguish from the kernel POV which pages are
> > going to be executed and those which aren't.
> 
> I think you're talking about the pte page flags, I'm talking about the
> VMA ones above.

No, I'm talking about the VMA ones.

        if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
                if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))
                        prot |= PROT_EXEC;
...
        /* Do simple checking here so the lower-level routines won't have
         * to. we assume access permissions have been handled by the open
         * of the memory object, so we don't do any here.
         */
        vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) |
                        mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

calc_vm_prot_bits(unsigned long prot)
{
        return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
               _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
               _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
               arch_calc_vm_prot_bits(prot);
}

So, if you have a CPU which does not support NX, then READ_IMPLIES_EXEC
is set in the personality.  That forces PROT_EXEC for anything with
PROT_READ, which in turn forces VM_EXEC.

> I'm not saying the common path (faulting in text sections) is the
> responsibility of user space.  I'm saying the uncommon path, write
> modification of binaries, is.  So the kernel only needs to worry about
> the ordinary text fault path.

What I'm saying is that you can't always tell the difference between
what's an executable page and what isn't in the kernel.  On NX-incapable
CPUs, the kernel treats *all* readable pages as executable, and there's
no way to tell from the VMA or page protection flags that this isn't
the case.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-01 10:42                                     ` Catalin Marinas
@ 2010-03-03 20:24                                       ` Jamie Lokier
  0 siblings, 0 replies; 155+ messages in thread
From: Jamie Lokier @ 2010-03-03 20:24 UTC (permalink / raw)
  To: linux-arm-kernel

Catalin Marinas wrote:
> > I wonder if it's time to get a PG_arch_2 :-)
> 
> As an optimisation, I think this would help (rather than always
> invalidating the I-cache in update_mmu_cache or set_pte_at).

If PG_arch_{1,2} are used in the same way on all architectures, when
they are used at all, perhaps they should be renamed :-)

-- Jamie

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-02 17:05                                         ` Catalin Marinas
  2010-03-02 17:47                                           ` Catalin Marinas
  2010-03-02 23:29                                           ` Benjamin Herrenschmidt
@ 2010-03-03 21:54                                           ` Pavel Machek
  2010-03-04  6:54                                             ` Wolfgang Mües
  2010-03-04 13:35                                             ` Catalin Marinas
  2 siblings, 2 replies; 155+ messages in thread
From: Pavel Machek @ 2010-03-03 21:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> > I'm not sure that there are some problems in the mm or common code. Is
> > this ARM's implementation issue? (Of course, the usb stack and the
> > driver's misuse of the DMA API needs to be fixed too).
> 
> Just to summarise - on ARM (PIPT / non-aliasing VIPT) there is I-cache
> invalidation for user pages in update_mmu_cache() (it could actually be
> in set_pte_at on SMP to avoid a race but that's for another thread). The
> D-cache is flushed by this function only if the PG_arch_1 bit is set.
> This bit is set in the ARM case by flush_dcache_page(), following the
> advice in Documentation/cachetlb.txt.
> 
> With some drivers (those doing PIO) or subsystems (SCSI mass storage
> over USB HCD), there is no call to flush_dcache_page() for page cache
> pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> the D-cache (and only invalidating the I-cache doesn't help).
> 
> The viable solutions so far:
> 
>      1. Implement a PIO mapping API similar to the DMA API which takes
>         care of the D-cache flushing. This means that PIO drivers would
>         need to be modified to use an API like pio_kmap()/pio_kunmap()
>         before writing to a page cache page.
>      2. Invert the meaning of PG_arch_1 to denote a clean page. This
>         means that by default newly allocated page cache pages are
>         considered dirty and even if there isn't a call to
>         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
>         This is the PowerPC approach.

What about option

3. Forget about PG_arch_1 and always do the flush?

How big is the performance impact? Note that current code does not
even *work* so working, 10% slower code will be an improvement.

								Pavel

(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-03  5:40                                                 ` James Bottomley
  2010-03-03  9:36                                                   ` Russell King - ARM Linux
@ 2010-03-04  2:00                                                   ` Benjamin Herrenschmidt
  2010-03-04  8:26                                                     ` James Bottomley
  1 sibling, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04  2:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 11:10 +0530, James Bottomley wrote:
> On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > to architectures. 
> > 
> > How does the above work ? IE, the dma unmap will flush the D side but
> > not the I side ... or is the ia64 flush primitive magic enough to do
> > both ?
> 
> The point is that in a well regulated system, the I cache shouldn't need
> extra flushing in the kernel.  We should only be faulting in R-X pages.
> If we're operating on RWX pages (i.e. self modifying code), it's the job
> of userspace to keep I/D coherency.
> 
> So the only case the kernel needs to worry about is the R-X fault case
> for executable text code.

Still, you do need to flush I when a page cache page is recycled.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-03 21:54                                           ` Pavel Machek
@ 2010-03-04  6:54                                             ` Wolfgang Mües
  2010-03-04  9:31                                               ` Russell King - ARM Linux
  2010-03-04 13:47                                               ` Catalin Marinas
  2010-03-04 13:35                                             ` Catalin Marinas
  1 sibling, 2 replies; 155+ messages in thread
From: Wolfgang Mües @ 2010-03-04  6:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Pavel Machek wrote:

> 3. Forget about PG_arch_1 and always do the flush?
> 
> How big is the performance impact? Note that current code does not
> even *work* so working, 10% slower code will be an improvement.

... and this is what *I* don't understand in this discussion. Obviously a 
flush() in PIO drivers is a clean and quick solution to the problem. And how 
much execution time will it cost - given the fact that if there is NO flush, 
the flush operation will not be avoided, only delayed (up to the time the data 
cache is doing the flush himself). If the data cache is doing the flush BEFORE 
the data is used in userspace (this includes the most common case of reading 
large files from the device), there will be no performance impact.

Just my 2 cents.

regards
Wolfgang
-- 
Wahre Worte sind nicht sch?n - Sch?ne Worte sind nicht wahr. (Laotse)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04  2:00                                                   ` Benjamin Herrenschmidt
@ 2010-03-04  8:26                                                     ` James Bottomley
  2010-03-04 21:25                                                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 155+ messages in thread
From: James Bottomley @ 2010-03-04  8:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 13:00 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2010-03-03 at 11:10 +0530, James Bottomley wrote:
> > On Wed, 2010-03-03 at 16:10 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2010-03-03 at 12:47 +0900, FUJITA Tomonori wrote:
> > > > The ways to improve the approach (introducing PG_arch_2 or marking a
> > > > page clean on dma_unmap_* with DMA_FROM_DEVICE like ia64 does) is up
> > > > to architectures. 
> > > 
> > > How does the above work ? IE, the dma unmap will flush the D side but
> > > not the I side ... or is the ia64 flush primitive magic enough to do
> > > both ?
> > 
> > The point is that in a well regulated system, the I cache shouldn't need
> > extra flushing in the kernel.  We should only be faulting in R-X pages.
> > If we're operating on RWX pages (i.e. self modifying code), it's the job
> > of userspace to keep I/D coherency.
> > 
> > So the only case the kernel needs to worry about is the R-X fault case
> > for executable text code.
> 
> Still, you do need to flush I when a page cache page is recycled.

Technically not if we've got all the I flushing when mapped executable
sorted out.  This is one of the dangers of over flushing ... if we start
flushing where we don't need it "just to be sure" we end up papering
over holes in the operating system and make catching actual bugs in
operations a lot harder.

The other thing you might not appreciate in ppc land is that for a lot
of other systems (well, like parisc) flushing a dirty cache line is
incredibly expensive (because we halt the CPU to wait for the memory
eviction), so ideally we want to flush as late as possible to give the
natural operations a chance to clean most of the cache lines.  Flushing
a clean cache line on parisc as well as invalidations are fast
operations.  That's why the kmap makes the most sense to us for
implementing PIO ops ... it's the farthest point we can flush the cache
at (because beyond it we've lost the mapping the VIPT cache requires to
flush).

James

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04  6:54                                             ` Wolfgang Mües
@ 2010-03-04  9:31                                               ` Russell King - ARM Linux
  2010-03-06 10:56                                                 ` Wolfgang Mües
  2010-03-04 13:47                                               ` Catalin Marinas
  1 sibling, 1 reply; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04  9:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 04, 2010 at 07:54:57AM +0100, Wolfgang M?es wrote:
> ... and this is what *I* don't understand in this discussion. Obviously a 
> flush() in PIO drivers is a clean and quick solution to the problem. And how 
> much execution time will it cost - given the fact that if there is NO flush, 
> the flush operation will not be avoided, only delayed (up to the time the data 
> cache is doing the flush himself). If the data cache is doing the flush BEFORE 
> the data is used in userspace (this includes the most common case of reading 
> large files from the device), there will be no performance impact.

You're assuming that every page is used in the same way.  Here's some
examples where this is wrong:

1. A page is faulted in for an application, and it is a text page.
   - the data read in to the page needs to be visible to the instruction
     stream, so on Harvard architecture machines, this may require cache
     maintainence on both the D and I caches.

2. A page is faulted in for an application's data page.
   - data may be written to the kernel mapping, which may alias with the
     eventual userspace address.  These aliases need to be dealt with, to
     make the data visible to the user mapping of the page.

3. A page may be read in response to an application issuing a read(2) call.
   - the data is read from the kernel mapping, and isn't mapped into a
     userspace address.

So, in case (3), flushing the I and D caches could be completely wasteful
- consider if this file is a 600MB MPEG video file which is being read by
a video player.  There's no need to flush the I cache because MPEG data
will never be executed.  There's no need to flush the D cache because
there isn't a user mapping of that data yet, and therefore there aren't
any aliases.

In case (2), it would be wasteful to flush the I cache - the application
isn't going to execute the data.

In case (1), everything is required to ensure that the instruction stream
can see the instructions.

So, the PG_arch_1 'delayed flush' is not only about delaying flushes until
they're required, it's about eliminating those which are not required to
give additional system performance - maybe to the point where you can
serve MP3 files via NFS with a low enough latency that your player isn't
regularly starved of data because of all the needless flushing going on.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-03 21:54                                           ` Pavel Machek
  2010-03-04  6:54                                             ` Wolfgang Mües
@ 2010-03-04 13:35                                             ` Catalin Marinas
  2010-03-04 13:51                                               ` Pavel Machek
  1 sibling, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-03-04 13:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > over USB HCD), there is no call to flush_dcache_page() for page cache
> > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > the D-cache (and only invalidating the I-cache doesn't help).
> >
> > The viable solutions so far:
> >
> >      1. Implement a PIO mapping API similar to the DMA API which takes
> >         care of the D-cache flushing. This means that PIO drivers would
> >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> >         before writing to a page cache page.
> >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> >         means that by default newly allocated page cache pages are
> >         considered dirty and even if there isn't a call to
> >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> >         This is the PowerPC approach.
> 
> What about option
> 
> 3. Forget about PG_arch_1 and always do the flush?
> 
> How big is the performance impact? Note that current code does not
> even *work* so working, 10% slower code will be an improvement.

The driver fix is as simple as calling a flush_dcache_page() and I've
been carrying such patches in my tree for some time now. The question is
whether we need to do it in the driver or not (would need to update
Documentation/cachetlb.txt as well).

The reason I'm not in favour always doing the flush is that we penalise
DMA drivers where there is no need for extra D-cache flushing (already
handled by the DMA API; option 1 above is similar, just that it is meant
for PIO usage). An ARM patch I proposed for inverting the meaning of
PG_arch_1 also marks a page as clean in the dma_map_* functions.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04  6:54                                             ` Wolfgang Mües
  2010-03-04  9:31                                               ` Russell King - ARM Linux
@ 2010-03-04 13:47                                               ` Catalin Marinas
  1 sibling, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-03-04 13:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 06:54 +0000, Wolfgang M?es wrote:
> Pavel Machek wrote:
> 
> > 3. Forget about PG_arch_1 and always do the flush?
> >
> > How big is the performance impact? Note that current code does not
> > even *work* so working, 10% slower code will be an improvement.
> 
> ... and this is what *I* don't understand in this discussion. Obviously a
> flush() in PIO drivers is a clean and quick solution to the problem. And how
> much execution time will it cost - given the fact that if there is NO flush,
> the flush operation will not be avoided, only delayed (up to the time the data
> cache is doing the flush himself). If the data cache is doing the flush BEFORE
> the data is used in userspace (this includes the most common case of reading
> large files from the device), there will be no performance impact.

Indeed, I don't care much about whether we do delayed cache flushing or
not. What I care about is that we need flushing at least once (and
ideally only once). Most PIO drivers don't call any cache flushing
function. Upper layers like USB mass storage or VFS don't do it either
(and probably they shouldn't).

This leaves us with either modifying existing PIO drivers (two patches I
submitted are already in mainline) or clarifying the flush_dcache_page()
usage throughout the kernel (and modifying the architecture code
accordingly). The Documentation/cachetlb.txt states that
flush_dcache_page() is called any time the kernel writes to a page cache
page, which is not the case for PIO drivers.

There may be a small advantage with the delayed flushing since not all
pages read from a device would be mapped in user space but I haven't
done any benchmarks to see the impact.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 13:35                                             ` Catalin Marinas
@ 2010-03-04 13:51                                               ` Pavel Machek
  2010-03-04 14:21                                                 ` James Bottomley
  2010-03-04 15:35                                                 ` Catalin Marinas
  0 siblings, 2 replies; 155+ messages in thread
From: Pavel Machek @ 2010-03-04 13:51 UTC (permalink / raw)
  To: linux-arm-kernel

> On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > > over USB HCD), there is no call to flush_dcache_page() for page cache
> > > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > > the D-cache (and only invalidating the I-cache doesn't help).
> > >
> > > The viable solutions so far:
> > >
> > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > >         care of the D-cache flushing. This means that PIO drivers would
> > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > >         before writing to a page cache page.
> > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > >         means that by default newly allocated page cache pages are
> > >         considered dirty and even if there isn't a call to
> > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > >         This is the PowerPC approach.
> > 
> > What about option
> > 
> > 3. Forget about PG_arch_1 and always do the flush?
> > 
> > How big is the performance impact? Note that current code does not
> > even *work* so working, 10% slower code will be an improvement.
> 
> The driver fix is as simple as calling a flush_dcache_page() and I've
> been carrying such patches in my tree for some time now. The question is
> whether we need to do it in the driver or not (would need to update
> Documentation/cachetlb.txt as well).
> 
> The reason I'm not in favour always doing the flush is that we penalise
> DMA drivers where there is no need for extra D-cache flushing (already
> handled by the DMA API; option 1 above is similar, just that it is meant
> for PIO usage). An ARM patch I proposed for inverting the meaning of
> PG_arch_1 also marks a page as clean in the dma_map_* functions.

But you are not fixing driver bug, are you?

Seems like ARM has requirement other architectures do not, that is
a) not documented anywhere
b) causes problems

You could argue that performance improvement (how big is it, anyway?)
is worth it, but this should be agreed to by wider community...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 13:51                                               ` Pavel Machek
@ 2010-03-04 14:21                                                 ` James Bottomley
  2010-03-04 14:27                                                   ` Russell King - ARM Linux
                                                                     ` (2 more replies)
  2010-03-04 15:35                                                 ` Catalin Marinas
  1 sibling, 3 replies; 155+ messages in thread
From: James Bottomley @ 2010-03-04 14:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > > > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > > > over USB HCD), there is no call to flush_dcache_page() for page cache
> > > > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > > > the D-cache (and only invalidating the I-cache doesn't help).
> > > >
> > > > The viable solutions so far:
> > > >
> > > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > > >         care of the D-cache flushing. This means that PIO drivers would
> > > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > > >         before writing to a page cache page.
> > > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > > >         means that by default newly allocated page cache pages are
> > > >         considered dirty and even if there isn't a call to
> > > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > > >         This is the PowerPC approach.
> > > 
> > > What about option
> > > 
> > > 3. Forget about PG_arch_1 and always do the flush?
> > > 
> > > How big is the performance impact? Note that current code does not
> > > even *work* so working, 10% slower code will be an improvement.
> > 
> > The driver fix is as simple as calling a flush_dcache_page() and I've
> > been carrying such patches in my tree for some time now. The question is
> > whether we need to do it in the driver or not (would need to update
> > Documentation/cachetlb.txt as well).
> > 
> > The reason I'm not in favour always doing the flush is that we penalise
> > DMA drivers where there is no need for extra D-cache flushing (already
> > handled by the DMA API; option 1 above is similar, just that it is meant
> > for PIO usage). An ARM patch I proposed for inverting the meaning of
> > PG_arch_1 also marks a page as clean in the dma_map_* functions.
> 
> But you are not fixing driver bug, are you?

Technically, he is.  In the old days, most VI architectures were high
end enough not to require PIO transfers.  The only exception was an IDE
driver used by sparc, which lead to the arch specific ide in/out string
instructions, in which sparc actually did all the necessary flushing.

So no other drivers than old IDE grew up with cache flushing in the PIO
case (and almost no high end VI hardware had an IDE interface, so they
rarely got implemented in the arch layer).  However, recently, with the
transition from old IDE to libata and the prevalence of ARM with more
commodity hardware, the deficiency is becoming exposed.  Even the PA8000
workstations now come with an IDE CD, which means we're starting to have
problems with them as well.

> Seems like ARM has requirement other architectures do not, that is
> a) not documented anywhere
> b) causes problems
> 
> You could argue that performance improvement (how big is it, anyway?)
> is worth it, but this should be agreed to by wider community...

Performance is always worth it provided we don't sacrifice correctness.
The thing which was discovered in this thread is basically that ARM is
handling deferred flushing (for D/I coherency) in a slightly different
way from everyone else ... once that's fixed, ARM will likely not have
the D/I problem, but we'll still have the libata (and other PIO systems)
D flushing issue.

James

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 14:21                                                 ` James Bottomley
@ 2010-03-04 14:27                                                   ` Russell King - ARM Linux
  2010-03-04 15:25                                                     ` Catalin Marinas
  2010-03-06 10:47                                                     ` James Bottomley
  2010-03-04 15:29                                                   ` Catalin Marinas
  2010-03-04 21:28                                                   ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 14:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > Seems like ARM has requirement other architectures do not, that is
> > a) not documented anywhere
> > b) causes problems
> > 
> > You could argue that performance improvement (how big is it, anyway?)
> > is worth it, but this should be agreed to by wider community...
> 
> Performance is always worth it provided we don't sacrifice correctness.
> The thing which was discovered in this thread is basically that ARM is
> handling deferred flushing (for D/I coherency) in a slightly different
> way from everyone else ... once that's fixed, ARM will likely not have
> the D/I problem, but we'll still have the libata (and other PIO systems)
> D flushing issue.

I think you've got that backwards.

Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
since we'll interpret '0' to mean "page is dirty, it needs flushing before
hitting userspace", whereas '1' means "page has been cleaned; there are no
aliases."

This doesn not address the I/D coherency issue, where the Icache needs
attention to get rid of speculatively loaded cache lines while old data
was present in the cache.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 14:27                                                   ` Russell King - ARM Linux
@ 2010-03-04 15:25                                                     ` Catalin Marinas
  2010-03-04 15:34                                                       ` Russell King - ARM Linux
  2010-03-04 21:31                                                       ` Benjamin Herrenschmidt
  2010-03-06 10:47                                                     ` James Bottomley
  1 sibling, 2 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-03-04 15:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 14:27 +0000, Russell King - ARM Linux wrote:
> On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> > On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > > Seems like ARM has requirement other architectures do not, that is
> > > a) not documented anywhere
> > > b) causes problems
> > >
> > > You could argue that performance improvement (how big is it, anyway?)
> > > is worth it, but this should be agreed to by wider community...
> >
> > Performance is always worth it provided we don't sacrifice correctness.
> > The thing which was discovered in this thread is basically that ARM is
> > handling deferred flushing (for D/I coherency) in a slightly different
> > way from everyone else ... once that's fixed, ARM will likely not have
> > the D/I problem, but we'll still have the libata (and other PIO systems)
> > D flushing issue.
> 
> I think you've got that backwards.
> 
> Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
> since we'll interpret '0' to mean "page is dirty, it needs flushing before
> hitting userspace", whereas '1' means "page has been cleaned; there are no
> aliases."
> 
> This doesn not address the I/D coherency issue, where the Icache needs
> attention to get rid of speculatively loaded cache lines while old data
> was present in the cache.

The I-cache flushing is already handled in update_mmu_cache (or
set_pte_at in a future patch; I'm not talking about other issues on
ARM11MPCore here).

We always invalidate the I-cache currently (since we may have DMA
transfers and the page's D-cache is clean). As an optimisation, we could
use PG_arch_2 for I-cache but I don't think there is much performance
benefit compared to always invalidating the I-cache flushing.

My understanding from this long discussion is that we cannot get the
kernel modifying a page cache page which is already mapped in user space
(well, ptrace does this but we flush the cache there already).

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 14:21                                                 ` James Bottomley
  2010-03-04 14:27                                                   ` Russell King - ARM Linux
@ 2010-03-04 15:29                                                   ` Catalin Marinas
  2010-03-04 15:41                                                     ` Paul Mundt
  2010-03-04 21:28                                                   ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-03-04 15:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> The thing which was discovered in this thread is basically that ARM is
> handling deferred flushing (for D/I coherency) in a slightly different
> way from everyone else ... 

Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
and IA-64 use PG_arch_1 as a clean rather than dirty bit.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 15:25                                                     ` Catalin Marinas
@ 2010-03-04 15:34                                                       ` Russell King - ARM Linux
  2010-03-04 21:31                                                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 15:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 04, 2010 at 03:25:23PM +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 14:27 +0000, Russell King - ARM Linux wrote:
> > On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> > > On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > > > Seems like ARM has requirement other architectures do not, that is
> > > > a) not documented anywhere
> > > > b) causes problems
> > > >
> > > > You could argue that performance improvement (how big is it, anyway?)
> > > > is worth it, but this should be agreed to by wider community...
> > >
> > > Performance is always worth it provided we don't sacrifice correctness.
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ... once that's fixed, ARM will likely not have
> > > the D/I problem, but we'll still have the libata (and other PIO systems)
> > > D flushing issue.
> > 
> > I think you've got that backwards.
> > 
> > Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
> > since we'll interpret '0' to mean "page is dirty, it needs flushing before
> > hitting userspace", whereas '1' means "page has been cleaned; there are no
> > aliases."
> > 
> > This doesn not address the I/D coherency issue, where the Icache needs
> > attention to get rid of speculatively loaded cache lines while old data
> > was present in the cache.
> 
> The I-cache flushing is already handled in update_mmu_cache (or
> set_pte_at in a future patch; I'm not talking about other issues on
> ARM11MPCore here).

You may not have been; my message was addressed to James to correct
his message, which seems to have the issues confused.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 13:51                                               ` Pavel Machek
  2010-03-04 14:21                                                 ` James Bottomley
@ 2010-03-04 15:35                                                 ` Catalin Marinas
  2010-03-07  8:23                                                   ` Pavel Machek
  1 sibling, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-03-04 15:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 13:51 +0000, Pavel Machek wrote:
> > On Wed, 2010-03-03 at 21:54 +0000, Pavel Machek wrote:
> > > > With some drivers (those doing PIO) or subsystems (SCSI mass storage
> > > > over USB HCD), there is no call to flush_dcache_page() for page cache
> > > > pages, hence the ARM implementation of update_mmu_cache() doesn't flush
> > > > the D-cache (and only invalidating the I-cache doesn't help).
> > > >
> > > > The viable solutions so far:
> > > >
> > > >      1. Implement a PIO mapping API similar to the DMA API which takes
> > > >         care of the D-cache flushing. This means that PIO drivers would
> > > >         need to be modified to use an API like pio_kmap()/pio_kunmap()
> > > >         before writing to a page cache page.
> > > >      2. Invert the meaning of PG_arch_1 to denote a clean page. This
> > > >         means that by default newly allocated page cache pages are
> > > >         considered dirty and even if there isn't a call to
> > > >         flush_dcache_page(), update_mmu_cache() would flush the D-cache.
> > > >         This is the PowerPC approach.
> > >
> > > What about option
> > >
> > > 3. Forget about PG_arch_1 and always do the flush?
> > >
> > > How big is the performance impact? Note that current code does not
> > > even *work* so working, 10% slower code will be an improvement.
> >
> > The driver fix is as simple as calling a flush_dcache_page() and I've
> > been carrying such patches in my tree for some time now. The question is
> > whether we need to do it in the driver or not (would need to update
> > Documentation/cachetlb.txt as well).
> >
> > The reason I'm not in favour always doing the flush is that we penalise
> > DMA drivers where there is no need for extra D-cache flushing (already
> > handled by the DMA API; option 1 above is similar, just that it is meant
> > for PIO usage). An ARM patch I proposed for inverting the meaning of
> > PG_arch_1 also marks a page as clean in the dma_map_* functions.
> 
> But you are not fixing driver bug, are you?

Some drivers I fixed already: db8516f61b481e8, 2d68b7fe55d9e19.

> Seems like ARM has requirement other architectures do not, that is
> a) not documented anywhere
> b) causes problems

Well, ARM is pretty similar to other architectures in this respect. And
I'm sure other architectures have similar problems, only that they only
become visible in some circumstances they may not have encountered (i.e.
PIO drivers + filesystem that doesn't call flush_dcache_page like ext*).
Some other architectures may do heavier flushing

Of course, a Documentation/arm/cachetlb.txt file would make sense.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 15:29                                                   ` Catalin Marinas
@ 2010-03-04 15:41                                                     ` Paul Mundt
  2010-03-04 16:30                                                       ` Russell King - ARM Linux
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 155+ messages in thread
From: Paul Mundt @ 2010-03-04 15:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > The thing which was discovered in this thread is basically that ARM is
> > handling deferred flushing (for D/I coherency) in a slightly different
> > way from everyone else ... 
> 
> Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> 
SH used to use it as a PG_mapped which was roughly similar to the
PG_dcache_clean approach, at which point things like flushing for the PIO
case in the HCD wasn't necessary. It did result in rather aggressive over
flushing though, which is one of the reasons we elected to switch to
PG_dcache_dirty.

Note that the PG_dcache_dirty semantics are also outlined in
Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 15:41                                                     ` Paul Mundt
@ 2010-03-04 16:30                                                       ` Russell King - ARM Linux
  2010-03-04 17:34                                                         ` Catalin Marinas
  2010-03-04 18:07                                                       ` Catalin Marinas
  2010-03-04 21:34                                                       ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 16:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 05, 2010 at 12:41:03AM +0900, Paul Mundt wrote:
> On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ... 
> > 
> > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > 
> SH used to use it as a PG_mapped which was roughly similar to the
> PG_dcache_clean approach, at which point things like flushing for the PIO
> case in the HCD wasn't necessary. It did result in rather aggressive over
> flushing though, which is one of the reasons we elected to switch to
> PG_dcache_dirty.
> 
> Note that the PG_dcache_dirty semantics are also outlined in
> Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

Indeed; the ARM approach was basically taken from Sparc64.

The problem being talked about (with data from PIO drivers not being
visible to userspace) is one of those corner cases.  It's been around
for something like 6 years or more, being reported by folk on the ARM
list on and off - so it's nothing new.

However, it seems very obscure - I've never been able to reproduce it
on any platform I have here, even with people's test programs which
instantly show it on their hardware.  It seems to require a very
specific set of hardware and software conditions to trigger it.

The general critera (from memory) seems to be:
- a virtual indexed aliasing cache (whether it be VIVT or VIPT aliasing)
- write allocate caches show the problem better than read allocate only
- using a block device for the filesystem
- mmap'ing a page and immediately accessing the last few cache lines in
  that page

The problem is that if enough of your data cache gets cycled through
in between the data being written to the page, and userspace trying to
read it, then you're going to see correct data.  So, the larger the L1
cache, the greater the chance that you'll see a problem.

Here is a program which Lothar sent me some time ago (the timestamp on
the .c is June 2004 - I can't find the original email though.)  I've
just checked with Lothar, who has given me permission to reproduce it.

I can't guarantee that this program still shows a problem - since I
believe I've never been able to reproduce it myself.  It might be worth
checking how other architectures behave.

Note that loop did get fixed with flush_dcache_page(), so trying it
against a loopback mounted filesystem won't show the problem.

/*
 * creates a testfile, 'mmap's it, and checks its content reading
 * page back to front. If a data error is found, the same page is read
 * over and over again, until data is eventually correct after some time.
 *
 * This points out a cache problem in the ARM linux kernel
 * Using the cache in Write-Through mode (kernel command line option: cachepolicy=writethrough)
 * or CONFIG_XSCALE_CACHE_ERRATA=y in older kernels prevents this problem
 *
 * (C) Lothar Wassmann, <LW@KARO-electronics.de>
 *
 */
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/ioctl.h>

#define PAGE_SIZE	4096
#define PAGE_SIZE_INT	((PAGE_SIZE)/sizeof(unsigned long))
#define PAGE_MASK	((PAGE_SIZE)-1)

#undef USE_BLKFLSBUF
#define BLKFRASET  _IO(0x12,100)/* set filesystem (mm/filemap.c) read-ahead */


size_t file_size = 256 * PAGE_SIZE;

unsigned long *buf=NULL;

const char* fn="testfile";

void usage(const char* name)
{
	printf("%s <mount point> [filename]\n", name);
	printf("\trequires <mount point> to be defined in /etc/fstab\n");
	printf("\t<mount point> will be unmounted and remounted during the test\n");
}

int create_file(const char* name, size_t size)
{
	int ret=0;
	int i;
	int fd;

	fd = open(name, O_CREAT|O_RDWR|O_SYNC|O_TRUNC, S_IWUSR|S_IRUSR|S_IRGRP|S_IROTH);
	if (fd < 0) {
		fprintf(stderr, "Failed to open '%s' for writing, errno=%d\n", name, errno);
		return errno;
	}

	for (i = size / sizeof(*buf); i > 0; i--) {
		buf[i-1] = i;
	}
	write(fd, buf, size);
	memset(buf, 0x55, size);

	close(fd);
	return ret;
}

int do_check(int fd, void *mapptr, size_t size)
{
	const int num_pages=size/PAGE_SIZE;
	volatile unsigned char *ptr=mapptr;
	int errors = 0;
	int soft = 0;
	int page;

	printf("Checking data from %08lx to %08lx\n", (unsigned long)(ptr + size),
	       (unsigned long)ptr);

	for (page = num_pages - 1; page >= 0; page--) {
		volatile unsigned long *pp=(volatile unsigned long *)&ptr[page*PAGE_SIZE];
		int offs;
		int page_errs=0;
		int err_offs=-1;

		for (offs = 0; offs < PAGE_SIZE; offs += sizeof(unsigned long)) {
			volatile unsigned long *lp=&pp[offs/sizeof(unsigned long)];
			unsigned long data=*lp;
			unsigned long ref=(((page*PAGE_SIZE)+offs)/sizeof(data)) + 1;

			if (data != ref) {
				const int max_tries=100000;
				int retries=max_tries;
				unsigned long new_data=*lp;

				errors++;
				page_errs++;
				while ((new_data != ref) && (--retries > 0)) {
					if (data != new_data) {
						fprintf(stderr, "Data @ page %03x:%03x (%08lx) changed to %08lx(%08lx)\n",
							page, offs, (unsigned long)lp, new_data, ref);
					}
					data = new_data;
					new_data = *lp;
				}
				if (new_data == ref) {
					fprintf(stderr, "Data @ page %03x:%03x (%08lx) OK after %d retries: %08lx\n",
						page, offs, (unsigned long)lp, max_tries - retries, new_data);
					soft++;
				} else {
					if (err_offs != offs) {
						fprintf(stderr, "Data error @ page %03x:%03x (%08lx): %08lx -> %08lx\n",
							page, offs, (unsigned long)lp, ref, data);
						err_offs = offs;
					}
					// retry the same page again, until data is correct
					offs = 0;
				}
			}
		}
		if (page_errs) {
			page = num_pages;
		}
	}

	fprintf(stderr, "Errors reverse check: %d; soft: %d; total bytes %d in %d pages\n",
		errors, soft, size, num_pages);

	return errors;
}

int check_file(const char* name, size_t size)
{
	int ret=0;
	int fd;
	void *ptr=NULL;
	int errors=0;
	int last_errors=0;

	fd = open(name, O_RDONLY|O_SYNC);
	if (fd < 0) {
		fprintf(stderr, "Failed to open '%s' for reading\n", name);
		return errno;
	}

	ptr = mmap(NULL, size, PROT_READ, MAP_SHARED/*PRIVATE*/, fd, 0);
	if (ptr == MAP_FAILED) {
		close(fd);
		return -ENOMEM;
	}

	printf("Checking file '%s'\n", name);
	do {
		last_errors = errors;
		errors = do_check(fd, ptr, size);
		if (errors != 0) {
			ret = errors;
		}
	} while (errors > 0 && errors != last_errors);

	if (munmap(ptr, size) != 0) {
		fprintf(stderr, "Failed to unmap %08lx\n", (unsigned long)ptr);
		if (ret == 0) {
			ret = -ENOMEM;
		}
	}
	close(fd);
	if (buf != NULL) {
		memset(buf, 0x55, size);
	}

	if (ret == 0) {
		printf("check successful\n");
	} else {
		printf("check failed\n");
	}

	return ret;
}

int main(int argc, char *argv[])
{
	int rc=0;
	char fname[100];
	char mount[44];
	char umount[44];

	if (argc < 2) {
		// first argument is required
		usage(argv[0]);
		return 1;
	}
	if (argc > 2) {
		// take optional second argument as filename
		fn = argv[2];
	}

	sprintf(fname, "%s/%s", argv[1], fn);
	sprintf(mount, "mount %s", argv[1]);
	sprintf(umount, "umount %s", argv[1]);

	file_size &= ~PAGE_MASK; // round size to page boundary
	buf = malloc(file_size);

	if (buf == NULL) {
		fprintf(stderr, "Failed to allocate buffer\n");
		rc = -ENOMEM;
	}

#ifdef USE_BLKFLSBUF	
	printf("Mounting '%s'\n", argv[1]);
	system(mount);
#endif

	while (rc == 0) {
		printf("Opening '%s'\n", fname);
		rc = create_file(fname, file_size);
		if (rc != 0) {
			fprintf(stderr, "Failed to create file '%s', rc=%d\n", fname, rc);
			break;
		}

#ifndef USE_BLKFLSBUF
		printf("Unmounting '%s'\n", argv[1]);
		system(umount);

		printf("Remounting '%s'\n", argv[1]);
		system(mount);
#else
		{
			int fd = open("/dev/loop0", O_RDONLY);
			ioctl(fd, BLKFLSBUF, 0);
			ioctl(fd, BLKRASET, 0);
			ioctl(fd, BLKFRASET, 0);
			close(fd);
		}
#endif

		rc = check_file(fname, file_size);
	}

	if (buf != NULL) {
		free(buf);
	}

	return rc;
}

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 16:30                                                       ` Russell King - ARM Linux
@ 2010-03-04 17:34                                                         ` Catalin Marinas
  2010-03-04 17:54                                                           ` Russell King - ARM Linux
  0 siblings, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-03-04 17:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 16:30 +0000, Russell King - ARM Linux wrote:
> On Fri, Mar 05, 2010 at 12:41:03AM +0900, Paul Mundt wrote:
> > On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > > The thing which was discovered in this thread is basically that ARM is
> > > > handling deferred flushing (for D/I coherency) in a slightly different
> > > > way from everyone else ...
> > >
> > > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > 
> > SH used to use it as a PG_mapped which was roughly similar to the
> > PG_dcache_clean approach, at which point things like flushing for the PIO
> > case in the HCD wasn't necessary. It did result in rather aggressive over
> > flushing though, which is one of the reasons we elected to switch to
> > PG_dcache_dirty.
> >
> > Note that the PG_dcache_dirty semantics are also outlined in
> > Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.
> 
> Indeed; the ARM approach was basically taken from Sparc64.
[...]
> The general critera (from memory) seems to be:
> - a virtual indexed aliasing cache (whether it be VIVT or VIPT aliasing)
> - write allocate caches show the problem better than read allocate only
> - using a block device for the filesystem
> - mmap'ing a page and immediately accessing the last few cache lines in
>   that page

It actually triggers easily with a non-aliasing VIPT cache (can't even
start /sbin/init). The main condition is for the caches to be in
write-allocate mode (and the processor to support this, i.e. Cortex-A9).

A simple test is to use an ext2/3 filesystem (cramfs, jffs2 etc.
wouldn't do since they call flush_dcache_page) on a compact flash card
using the pata_platform driver (and without commit 2d68b7fe55d9e19).

Other forms of triggering this is to use something like slram + ext2/3.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 17:34                                                         ` Catalin Marinas
@ 2010-03-04 17:54                                                           ` Russell King - ARM Linux
  0 siblings, 0 replies; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 17:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 04, 2010 at 05:34:28PM +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 16:30 +0000, Russell King - ARM Linux wrote:
> > On Fri, Mar 05, 2010 at 12:41:03AM +0900, Paul Mundt wrote:
> > > On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > > > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > > > The thing which was discovered in this thread is basically that ARM is
> > > > > handling deferred flushing (for D/I coherency) in a slightly different
> > > > > way from everyone else ...
> > > >
> > > > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > > > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > > > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > > 
> > > SH used to use it as a PG_mapped which was roughly similar to the
> > > PG_dcache_clean approach, at which point things like flushing for the PIO
> > > case in the HCD wasn't necessary. It did result in rather aggressive over
> > > flushing though, which is one of the reasons we elected to switch to
> > > PG_dcache_dirty.
> > >
> > > Note that the PG_dcache_dirty semantics are also outlined in
> > > Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.
> > 
> > Indeed; the ARM approach was basically taken from Sparc64.
> [...]
> > The general critera (from memory) seems to be:
> > - a virtual indexed aliasing cache (whether it be VIVT or VIPT aliasing)
> > - write allocate caches show the problem better than read allocate only
> > - using a block device for the filesystem
> > - mmap'ing a page and immediately accessing the last few cache lines in
> >   that page
> 
> It actually triggers easily with a non-aliasing VIPT cache (can't even
> start /sbin/init). The main condition is for the caches to be in
> write-allocate mode (and the processor to support this, i.e. Cortex-A9).
> 
> A simple test is to use an ext2/3 filesystem (cramfs, jffs2 etc.
> wouldn't do since they call flush_dcache_page) on a compact flash card
> using the pata_platform driver (and without commit 2d68b7fe55d9e19).

Yes, but this is a combination of hardware has only become available to
me in the last three months.

Previously, I've had reports of ext2 on CF cards on PXA255 based systems
giving problems.  However, I have a PXA255 system which runs its rootfs
off a CF card (which runs applications such as Abiword and gnumeric), but
it has never exhibited the reported problems...

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 15:41                                                     ` Paul Mundt
  2010-03-04 16:30                                                       ` Russell King - ARM Linux
@ 2010-03-04 18:07                                                       ` Catalin Marinas
  2010-03-04 21:37                                                         ` Benjamin Herrenschmidt
  2010-03-04 21:34                                                       ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-03-04 18:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 15:41 +0000, Paul Mundt wrote:
> On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ...
> >
> > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> 
> SH used to use it as a PG_mapped which was roughly similar to the
> PG_dcache_clean approach, at which point things like flushing for the PIO
> case in the HCD wasn't necessary. It did result in rather aggressive over
> flushing though, which is one of the reasons we elected to switch to
> PG_dcache_dirty.

Are you more in favour if a PIO kmap API than inverting the meaning of
PG_arch_1? 

I'm not familiar with SH but for PIO devices the flushing shouldn't be
more aggressive. For the DMA devices, Russell suggested that we mark the
page as clean (set PG_dcache_clean) in the DMA API to avoid the default
flushing.

> Note that the PG_dcache_dirty semantics are also outlined in
> Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

Yes, but the flush_dcache_page() semantics outlined in the same file
aren't followed by all the PIO drivers in the kernel.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04  8:26                                                     ` James Bottomley
@ 2010-03-04 21:25                                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:25 UTC (permalink / raw)
  To: linux-arm-kernel


> > Still, you do need to flush I when a page cache page is recycled.
> 
> Technically not if we've got all the I flushing when mapped executable
> sorted out.  This is one of the dangers of over flushing ... if we start
> flushing where we don't need it "just to be sure" we end up papering
> over holes in the operating system and make catching actual bugs in
> operations a lot harder.

Well, ok so we are talking past each other here :-) So let me try to
summarize what we do, and then write up what I'd like to be able to do
but can't quite see how to get there just yet.

On PPC, we keep track of whether a page is "cache clean" with PG_arch1. 

We only bother with flushing it when mapping it and yes, it's an
expensive operation.

We do it from within set_pte_at() and/or ptep_set_access_flags(), at
which point w test PG_arch_1, and if clear, do the flush and set it.

On systems that support per-page exec permission, we optimize things a
bit, in that unless this is an exec fault, we "skip" the flush when
mapping the page and filter out the exec permission (so that's a read
access for example). We later do the flush when exec is attempted.

On systems that don't (earlier 32-bit powerpc), we -have- to flush any
mapped page sadly as one could be mapped for read and actually executed
from. This is -not- a case of "let userspace shoot themselves in the
foot", letting stale icache leak through to userspace here is actually a
security hole in theory (granted, unlikely but we got barked at enough
when we tried to optimize that out).

Now, when we do the flush as described above, we do both D$ and I$
passes at once.

It would be indeed nice to be able to avoid the D$ flush when the page
was the target of a DMA operation, since the D$ flush is the most
expensive part of the process.

However, I don't see how to do that without having a separate page bit
to keep track of the D$ vs. I$ state. For example, if we use PG_arch_1
exclusively for D$, and always flush I$ on mapping to userspace, we end
up with a lot of I$ spurrious flushes any time glibc text for example is
mapped into a new process.

> The other thing you might not appreciate in ppc land is that for a lot
> of other systems (well, like parisc) flushing a dirty cache line is
> incredibly expensive (because we halt the CPU to wait for the memory
> eviction), 

Same here. High end server PPCs have the I$ snoop the D$ but on all the
other ones, we pay a dear price for those flushes, which is why I'm
trying to see how I could exploit the trick of not doing the D$ side
flush at least for targets of DMA ops, but as I said, I can't see how it
can be done properly without another tracking bit in struct page.

> so ideally we want to flush as late as possible to give the
> natural operations a chance to clean most of the cache lines.  Flushing
> a clean cache line on parisc as well as invalidations are fast
> operations.  That's why the kmap makes the most sense to us for
> implementing PIO ops ... it's the farthest point we can flush the cache
> at (because beyond it we've lost the mapping the VIPT cache requires to

Cheers,
Ben.

> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 14:21                                                 ` James Bottomley
  2010-03-04 14:27                                                   ` Russell King - ARM Linux
  2010-03-04 15:29                                                   ` Catalin Marinas
@ 2010-03-04 21:28                                                   ` Benjamin Herrenschmidt
  2010-03-04 21:40                                                     ` Russell King - ARM Linux
  2 siblings, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 19:51 +0530, James Bottomley wrote:
> 
> Technically, he is.  In the old days, most VI architectures were high
> end enough not to require PIO transfers.  The only exception was an
> IDE driver used by sparc, which lead to the arch specific ide in/out
> string instructions, in which sparc actually did all the necessary
> flushing.

Actually, Catalin's problem is with newer PIPT ARM :-)

> So no other drivers than old IDE grew up with cache flushing in the
> PIO case (and almost no high end VI hardware had an IDE interface, so
> they rarely got implemented in the arch layer).  However, recently,
> with the transition from old IDE to libata and the prevalence of ARM
> with more commodity hardware, the deficiency is becoming exposed.
> Even the PA8000 workstations now come with an IDE CD, which means
> we're starting to have problems with them as well.

I don't think there's a core or driver problem in this specific case. As
we discussed earlier, I believe the problem is that ARM considers a
fresh page out of the page cache as "clean" instead of "dirty", and
inverting that like we do on powerpc will fix their problem too.

> > Seems like ARM has requirement other architectures do not, that is
> > a) not documented anywhere
> > b) causes problems
> > 
> > You could argue that performance improvement (how big is it,
> anyway?)
> > is worth it, but this should be agreed to by wider community...
> 
> Performance is always worth it provided we don't sacrifice
> correctness.
> The thing which was discovered in this thread is basically that ARM is
> handling deferred flushing (for D/I coherency) in a slightly different
> way from everyone else ... once that's fixed, ARM will likely not have
> the D/I problem, but we'll still have the libata (and other PIO
> systems) D flushing issue. 

You mean older VIVT ARM will grow a new issue there ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 15:25                                                     ` Catalin Marinas
  2010-03-04 15:34                                                       ` Russell King - ARM Linux
@ 2010-03-04 21:31                                                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 15:25 +0000, Catalin Marinas wrote:
> My understanding from this long discussion is that we cannot get the
> kernel modifying a page cache page which is already mapped in user space
> (well, ptrace does this but we flush the cache there already).

Well, we -can- but it appears that we don't have to provide coherency
in that case since the modification is always done as the result of
userspace explicitely requesting that change (aka read() syscall) and
thus userspace is responsible for the flushing.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 15:41                                                     ` Paul Mundt
  2010-03-04 16:30                                                       ` Russell King - ARM Linux
  2010-03-04 18:07                                                       ` Catalin Marinas
@ 2010-03-04 21:34                                                       ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-03-05 at 00:41 +0900, Paul Mundt wrote:
> On Thu, Mar 04, 2010 at 03:29:38PM +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 14:21 +0000, James Bottomley wrote:
> > > The thing which was discovered in this thread is basically that ARM is
> > > handling deferred flushing (for D/I coherency) in a slightly different
> > > way from everyone else ... 
> > 
> > Doing a grep for PG_dcache_dirty defined in terms of PG_arch_1 reveals
> > that MIPS, Parisc, Score, SH and SPARC do similar things to ARM. PowerPC
> > and IA-64 use PG_arch_1 as a clean rather than dirty bit.
> > 
> SH used to use it as a PG_mapped which was roughly similar to the
> PG_dcache_clean approach, at which point things like flushing for the PIO
> case in the HCD wasn't necessary. It did result in rather aggressive over
> flushing though, which is one of the reasons we elected to switch to
> PG_dcache_dirty.
> 
> Note that the PG_dcache_dirty semantics are also outlined in
> Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly esoteric.

Doing this way though is a lot more fragile... since page cache pages
are no longer dirty by default, you need to ensure that any driver
writing to one without DMA sets PG_arch_1, and as we've seen, this is
generally not the case (it's almost never the case actually).

Also, in the DMA case, you may not need to flush D$, but you -still-
need to invalidate I$, and unless you then get another bit for tracking
it, you end up doing a lot of over-invalidating of I$ no ?

Or am I missing a critical piece of the puzzle ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 18:07                                                       ` Catalin Marinas
@ 2010-03-04 21:37                                                         ` Benjamin Herrenschmidt
  2010-03-04 22:11                                                           ` Catalin Marinas
  2010-03-05  1:17                                                           ` Paul Mundt
  0 siblings, 2 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-04 21:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> 
> Are you more in favour if a PIO kmap API than inverting the meaning of
> PG_arch_1? 

My main worry with this approach is the sheer amount of drivers that
need fixing. I believe inverting PG_arch_1 is a better solution and I
somewhat fail to see how we end up doing too much flushing if we have
per-page execute permission (but maybe SH doesn't ?)

> I'm not familiar with SH but for PIO devices the flushing shouldn't be
> more aggressive. For the DMA devices, Russell suggested that we mark
> the
> page as clean (set PG_dcache_clean) in the DMA API to avoid the
> default
> flushing.

I really like that idea, as I said earlier, but I'm worried about the I$
side of things. IE. What I'm trying to say is that I can't see how to do
that optimisation without ending up with missing I$ invalidations or
doing way too many of them, unless we have a separate bit to track I$
state.

> > Note that the PG_dcache_dirty semantics are also outlined in
> > Documentation/cachetlb.txt for PG_arch_1 usage, so it's hardly
> esoteric.
> 
> Yes, but the flush_dcache_page() semantics outlined in the same file
> aren't followed by all the PIO drivers in the kernel.
> 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 21:28                                                   ` Benjamin Herrenschmidt
@ 2010-03-04 21:40                                                     ` Russell King - ARM Linux
  2010-03-05  4:31                                                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-03-04 21:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 05, 2010 at 08:28:34AM +1100, Benjamin Herrenschmidt wrote:
> I don't think there's a core or driver problem in this specific case. As
> we discussed earlier, I believe the problem is that ARM considers a
> fresh page out of the page cache as "clean" instead of "dirty", and
> inverting that like we do on powerpc will fix their problem too.

The only concern is that it means we treat anonymous pages as dirty
by default.

That's quite sub-optimal since we take care (eg) on write faults to
copy the page and take care of the cache issues while we do that -
whether that be remapping the page to be coherent with the user
address, or cleaning each cache line as we copy the data.

Of course, the simple solution is to also arrange for PG_arch_1 to be
set in this case.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 21:37                                                         ` Benjamin Herrenschmidt
@ 2010-03-04 22:11                                                           ` Catalin Marinas
  2010-03-05  4:34                                                             ` Benjamin Herrenschmidt
  2010-03-05  1:17                                                           ` Paul Mundt
  1 sibling, 1 reply; 155+ messages in thread
From: Catalin Marinas @ 2010-03-04 22:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 21:37 +0000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > more aggressive. For the DMA devices, Russell suggested that we mark
> > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > default flushing.
> 
> I really like that idea, as I said earlier, but I'm worried about the I$
> side of things. IE. What I'm trying to say is that I can't see how to do
> that optimisation without ending up with missing I$ invalidations or
> doing way too many of them, unless we have a separate bit to track I$
> state.

But does this optimisation really matter? I think with careful checking
in set_pte_at(), you are not going to invalidate the I-cache more than
necessary. If the original page wasn't pte_present() you would need to
do the I-cache invalidation. The other cases where set_pte_at() is
called for LRU (pte_young) or COW (pte_write) we can avoid the extra
invalidation.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 21:37                                                         ` Benjamin Herrenschmidt
  2010-03-04 22:11                                                           ` Catalin Marinas
@ 2010-03-05  1:17                                                           ` Paul Mundt
  2010-03-05  4:44                                                             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 155+ messages in thread
From: Paul Mundt @ 2010-03-05  1:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 05, 2010 at 08:37:40AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > Are you more in favour if a PIO kmap API than inverting the meaning of
> > PG_arch_1? 
> 
> My main worry with this approach is the sheer amount of drivers that
> need fixing. I believe inverting PG_arch_1 is a better solution and I
> somewhat fail to see how we end up doing too much flushing if we have
> per-page execute permission (but maybe SH doesn't ?)
> 
Basically we have two different MMUs on VIPT parts, the older one on all
SH-4 parts were all read-implies-exec with no ability to differentiate
between read or exec access. For these parts the PG_dcache_dirty approach
saves us from a lot of flushing, and the corner cases were isolated
enough that we could tolerate fixups at the driver level, even on a
write-allocate D-cache.

For second generation SH-4A (SH-X2) and up parts, read and exec are split
out and we could reasonably adopt the PG_dcache_clean approach there
while adopting the same sort of flushing semantics as PPC to avoid
flushing constantly. The current generation of parts far outnumber their
legacy counterparts, so it's certainly something I plan to experiment
with.

We have an additional level of complexity on some of the SMP parts with a
non-coherent I-cache, some of the early CPUs have broken broadcasting of
the cacheops in hardware and so need to rely on IPIs, while the later
parts broadcast properly. We also need to deal with D-cache IPIs when
using mixed coherency protocols on different CPUs.

For older PIPT parts we've never used the deferred flush, since the only
time we ever had to bother with cache maintenance was in the DMA ops, as
anything closer to the CPU than the PCI DMAC had no opportunity to be
snooped.

> > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > more aggressive. For the DMA devices, Russell suggested that we mark
> > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > default flushing.
> 
> I really like that idea, as I said earlier, but I'm worried about the I$
> side of things. IE. What I'm trying to say is that I can't see how to do
> that optimisation without ending up with missing I$ invalidations or
> doing way too many of them, unless we have a separate bit to track I$
> state.
> 
Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
and certainly worth experimenting with. I don't know how we would do the
I-cache optimization without a PG_arch_2, though.

In any event, if there's going to be a mass exodus to PG_dcache_clean,
Documentation/cachetlb.txt could use a considerable amount of expanding.
The read/exec and I-cache optimizations are something that would be
valuable to document, as opposed to simply being pointed at the sparc64
approach with the regular PG_dcache_dirty caveats.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 21:40                                                     ` Russell King - ARM Linux
@ 2010-03-05  4:31                                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-05  4:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 21:40 +0000, Russell King - ARM Linux wrote:
> On Fri, Mar 05, 2010 at 08:28:34AM +1100, Benjamin Herrenschmidt wrote:
> > I don't think there's a core or driver problem in this specific case. As
> > we discussed earlier, I believe the problem is that ARM considers a
> > fresh page out of the page cache as "clean" instead of "dirty", and
> > inverting that like we do on powerpc will fix their problem too.
> 
> The only concern is that it means we treat anonymous pages as dirty
> by default.
>
> That's quite sub-optimal since we take care (eg) on write faults to
> copy the page and take care of the cache issues while we do that -

If you do the cache handling inside your copy_user_highpage() then you
can just set PG_arch_1 stuff there.

> whether that be remapping the page to be coherent with the user
> address, or cleaning each cache line as we copy the data.
> 
> Of course, the simple solution is to also arrange for PG_arch_1 to be
> set in this case.

Right.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 22:11                                                           ` Catalin Marinas
@ 2010-03-05  4:34                                                             ` Benjamin Herrenschmidt
  2010-03-05  9:27                                                               ` Catalin Marinas
  0 siblings, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-05  4:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 22:11 +0000, Catalin Marinas wrote:
> On Thu, 2010-03-04 at 21:37 +0000, Benjamin Herrenschmidt wrote:
> > On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > default flushing.
> > 
> > I really like that idea, as I said earlier, but I'm worried about the I$
> > side of things. IE. What I'm trying to say is that I can't see how to do
> > that optimisation without ending up with missing I$ invalidations or
> > doing way too many of them, unless we have a separate bit to track I$
> > state.
> 
> But does this optimisation really matter? I think with careful checking
> in set_pte_at(), you are not going to invalidate the I-cache more than
> necessary. If the original page wasn't pte_present() you would need to
> do the I-cache invalidation. The other cases where set_pte_at() is
> called for LRU (pte_young) or COW (pte_write) we can avoid the extra
> invalidation.

No. Not on PIPT (or non aliasing VIPT).

Take your typical glibc text page. This is a struct page that will be
mapped in almost every process in your system. You do not want to do the
icache inval every time. Once it's been cleaned once, it's clean for
subsequent mappings. Only VIVT needs such multiple invalidates I suppose
though in this case you probably do everything differently anyways.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-05  1:17                                                           ` Paul Mundt
@ 2010-03-05  4:44                                                             ` Benjamin Herrenschmidt
  2010-03-10  3:52                                                               ` Paul Mundt
  0 siblings, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-05  4:44 UTC (permalink / raw)
  To: linux-arm-kernel


> Basically we have two different MMUs on VIPT parts, the older one on all
> SH-4 parts were all read-implies-exec with no ability to differentiate
> between read or exec access. 

Ok, this is the same as the older ppc32 processors.

> For these parts the PG_dcache_dirty approach
> saves us from a lot of flushing, and the corner cases were isolated
> enough that we could tolerate fixups at the driver level, even on a
> write-allocate D-cache.

But how wide a range of devices do you have to support with those ? Is
this a few SoCs or people putting any random PCI device in there for
example ?

If I were to do it that way on ppc32, I worried that it would be more
than a few drivers that I would have to fix :-) All the 32-bit PowerMac
and PowerBooks for example, all of freescale 74xx based parts, etc...
those guys have PCI, and all sort of random HW plugged into them.

I would -love- to avoid that horrible amount of flushing we do on these,
it's quite high on any profile run, but I haven't found a good way to do
so. There's also a nasty issue of icache content leaking between
processes which I doubt is exploitable but I had people having a go at
me about it when I tried to avoid icache cleaning anonymous pages by
default.

> For second generation SH-4A (SH-X2) and up parts, read and exec are split
> out and we could reasonably adopt the PG_dcache_clean approach there
> while adopting the same sort of flushing semantics as PPC to avoid
> flushing constantly. The current generation of parts far outnumber their
> legacy counterparts, so it's certainly something I plan to experiment
> with.

I'd be curious to see whether you get a perf imporovement with that.

Note that we still have this additional thing that is floating around in
this thread which I thing is definitely worthwhile to do, which is to
mark clean pages that have been written to with DMA in dma_unmap and
friends.... if we can fix the icache problem. So far, I haven't found
James replies on this satisfactory :-) But maybe I just missed
something.

> We have an additional level of complexity on some of the SMP parts with a
> non-coherent I-cache,

I've that on some embedded ppc's too, where the icache flush instrutions
aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately
today nobody sane (appart from Bluegene) did an SMP part with those and
so we have well localized internal hacks for them. But I've heared that
some vendors might be pumping out SoCs with that stuff too soon which
worries me.

>  some of the early CPUs have broken broadcasting of
> the cacheops in hardware and so need to rely on IPIs, while the later
> parts broadcast properly. We also need to deal with D-cache IPIs when
> using mixed coherency protocols on different CPUs.

Right, that sucks. Do those have no-exec permission support ? If they
do, then you can do what I did for BG, which is to ping pong user pages
so they are either writable or executable (since userspace code itself
will break as it will assume the cache ops -are- broadcast, since that's
what the architecture says).

> For older PIPT parts we've never used the deferred flush, since the only
> time we ever had to bother with cache maintenance was in the DMA ops, as
> anything closer to the CPU than the PCI DMAC had no opportunity to be
> snooped.

Do you also, like ARM11MP, have a case of non-cache coherent DMA and
non-broadcast cache ops in SMP ? That's somewhat of a killer, I still
don't see how it can be dealt properly other than using load/store
tricks to bring the data into the local cache and flushing it from
there. DMA ops are called way to deep into spinlock hell to rely on IPIs
(unless your HW also provides some kind of NMI IPIs).

> > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > default flushing.
> > 
> > I really like that idea, as I said earlier, but I'm worried about the I$
> > side of things. IE. What I'm trying to say is that I can't see how to do
> > that optimisation without ending up with missing I$ invalidations or
> > doing way too many of them, unless we have a separate bit to track I$
> > state.
> > 
> Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
> and certainly worth experimenting with. I don't know how we would do the
> I-cache optimization without a PG_arch_2, though.

Right. That's the one thing I've been trying to figure out without
success. But then, is it a big deal to add PG_arch_2 ? doesn't sound
like it to me...

> In any event, if there's going to be a mass exodus to PG_dcache_clean,
> Documentation/cachetlb.txt could use a considerable amount of expanding.
> The read/exec and I-cache optimizations are something that would be
> valuable to document, as opposed to simply being pointed at the sparc64
> approach with the regular PG_dcache_dirty caveats.

Cheers,
Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-05  4:34                                                             ` Benjamin Herrenschmidt
@ 2010-03-05  9:27                                                               ` Catalin Marinas
  0 siblings, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-03-05  9:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2010-03-05 at 04:34 +0000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-03-04 at 22:11 +0000, Catalin Marinas wrote:
> > On Thu, 2010-03-04 at 21:37 +0000, Benjamin Herrenschmidt wrote:
> > > On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote:
> > > > I'm not familiar with SH but for PIO devices the flushing shouldn't be
> > > > more aggressive. For the DMA devices, Russell suggested that we mark
> > > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the
> > > > default flushing.
> > >
> > > I really like that idea, as I said earlier, but I'm worried about the I$
> > > side of things. IE. What I'm trying to say is that I can't see how to do
> > > that optimisation without ending up with missing I$ invalidations or
> > > doing way too many of them, unless we have a separate bit to track I$
> > > state.
> >
> > But does this optimisation really matter? I think with careful checking
> > in set_pte_at(), you are not going to invalidate the I-cache more than
> > necessary. If the original page wasn't pte_present() you would need to
> > do the I-cache invalidation. The other cases where set_pte_at() is
> > called for LRU (pte_young) or COW (pte_write) we can avoid the extra
> > invalidation.
> 
> No. Not on PIPT (or non aliasing VIPT).
> 
> Take your typical glibc text page. This is a struct page that will be
> mapped in almost every process in your system. You do not want to do the
> icache inval every time. Once it's been cleaned once, it's clean for
> subsequent mappings. Only VIVT needs such multiple invalidates I suppose
> though in this case you probably do everything differently anyways.

Yes, you are right, shared libraries don't need the extra flushing with
PIPT caches.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 14:27                                                   ` Russell King - ARM Linux
  2010-03-04 15:25                                                     ` Catalin Marinas
@ 2010-03-06 10:47                                                     ` James Bottomley
  2010-03-06 19:36                                                       ` Russell King - ARM Linux
  2010-03-06 21:03                                                       ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-03-06 10:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2010-03-04 at 14:27 +0000, Russell King - ARM Linux wrote:
> On Thu, Mar 04, 2010 at 07:51:52PM +0530, James Bottomley wrote:
> > On Thu, 2010-03-04 at 14:51 +0100, Pavel Machek wrote:
> > > Seems like ARM has requirement other architectures do not, that is
> > > a) not documented anywhere
> > > b) causes problems
> > > 
> > > You could argue that performance improvement (how big is it, anyway?)
> > > is worth it, but this should be agreed to by wider community...
> > 
> > Performance is always worth it provided we don't sacrifice correctness.
> > The thing which was discovered in this thread is basically that ARM is
> > handling deferred flushing (for D/I coherency) in a slightly different
> > way from everyone else ... once that's fixed, ARM will likely not have
> > the D/I problem, but we'll still have the libata (and other PIO systems)
> > D flushing issue.
> 
> I think you've got that backwards.
> 
> Reversing the meaning of PG_arch_1 will probably fix the D aliasing issue -
> since we'll interpret '0' to mean "page is dirty, it needs flushing before
> hitting userspace", whereas '1' means "page has been cleaned; there are no
> aliases."

Yes, that looks about right ... I'll think about doing this for parisc
as well.

> This doesn not address the I/D coherency issue, where the Icache needs
> attention to get rid of speculatively loaded cache lines while old data
> was present in the cache.

No, I understand that.  However, I/D coherency is handled way after the
writes to the page in the page cache.

On a fault in of exec data, we first try to get the page out of the page
cache.  If it's not present, we put the faulting process to sleep and
fetch it in from storage.  When we do the read, on the PIO path, the
kernel alias for the page becomes dirty.  Some time later, we place the
page into the user space (updating the pte entry that caused a fault).
At this point, we'll call both flush_icache_page() and
update_mmu_cache() ... this is where the I/D resolution should be done.
Since it's after any I/O has occurred, it doesn't matter whether the CPU
speculatively moved anything in or not.  As long as you flush the kernel
alias and invalidate the user I and D aliases, we're good to go.  Using
the page arch flags is really only to optimise this process (defer
kernel D alias flushing).

James


James

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04  9:31                                               ` Russell King - ARM Linux
@ 2010-03-06 10:56                                                 ` Wolfgang Mües
  2010-03-06 11:05                                                   ` Oliver Neukum
  2010-03-06 19:44                                                   ` Russell King - ARM Linux
  0 siblings, 2 replies; 155+ messages in thread
From: Wolfgang Mües @ 2010-03-06 10:56 UTC (permalink / raw)
  To: linux-arm-kernel

Russell,

Am Donnerstag, 4. M?rz 2010 10:31:17 schrieb Russell King - ARM Linux:
> You're assuming that every page is used in the same way.  Here's some
> examples where this is wrong:
> 
> 1. A page is faulted in for an application, and it is a text page.
>    - the data read in to the page needs to be visible to the instruction
>      stream, so on Harvard architecture machines, this may require cache
>      maintainence on both the D and I caches.
Yes. I think that the EXPECTED behaviour of block devices is to give the 
result of the read back in memory. So the driver should do the flush of the 
data cache.

The invalidation of the I cache should be done by the function which makes 
this piece of data executable. (Have I missed something here?)
 
> 3. A page may be read in response to an application issuing a read(2) call.
>    - the data is read from the kernel mapping, and isn't mapped into a
>      userspace address.
> 
> So, in case (3), flushing the I and D caches could be completely wasteful
But how do you AVOID the writeback of the data cache in (3)?
IMHO, the dirty data is in the cache, and the cache will writeback this data 
on its own.

regards
Wolfgang
-- 
Wahre Worte sind nicht sch?n - Sch?ne Worte sind nicht wahr. (Laotse)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-06 10:56                                                 ` Wolfgang Mües
@ 2010-03-06 11:05                                                   ` Oliver Neukum
  2010-03-06 19:44                                                   ` Russell King - ARM Linux
  1 sibling, 0 replies; 155+ messages in thread
From: Oliver Neukum @ 2010-03-06 11:05 UTC (permalink / raw)
  To: linux-arm-kernel

Am Samstag, 6. M?rz 2010 11:56:41 schrieb Wolfgang M?es:
> > 1. A page is faulted in for an application, and it is a text page.
> >    - the data read in to the page needs to be visible to the instruction
> >      stream, so on Harvard architecture machines, this may require cache
> >      maintainence on both the D and I caches.
> Yes. I think that the EXPECTED behaviour of block devices is to give the 
> result of the read back in memory. So the driver should do the flush of the 
> data cache.
> 
> The invalidation of the I cache should be done by the function which makes 
> this piece of data executable. (Have I missed something here?)

What tells you that IO is happening before the page is made executable?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-06 10:47                                                     ` James Bottomley
@ 2010-03-06 19:36                                                       ` Russell King - ARM Linux
  2010-03-06 21:07                                                         ` Benjamin Herrenschmidt
                                                                           ` (2 more replies)
  2010-03-06 21:03                                                       ` Benjamin Herrenschmidt
  1 sibling, 3 replies; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-03-06 19:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> On a fault in of exec data, we first try to get the page out of the page
> cache.  If it's not present, we put the faulting process to sleep and
> fetch it in from storage.  When we do the read, on the PIO path, the
> kernel alias for the page becomes dirty.  Some time later, we place the
> page into the user space (updating the pte entry that caused a fault).
> At this point, we'll call both flush_icache_page() and
> update_mmu_cache() ... this is where the I/D resolution should be done.

No - this is where things get extremely icky.

The problem at this point occurs on SMP architectures.  As soon as you
update the PTE entry, it is visible to other threads of the application.
If you do I-cache handling after updating the PTE, then there is a window
where another CPU can execute the page:

CPU0			CPU1
			speculatively prefetches from page N via kernel
			mapping, loads garbage into I-cache
attempts to execute P
page fault
page N allocated
set_pte_at
			executes P
			*splat*
flush I-cache

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-06 10:56                                                 ` Wolfgang Mües
  2010-03-06 11:05                                                   ` Oliver Neukum
@ 2010-03-06 19:44                                                   ` Russell King - ARM Linux
  1 sibling, 0 replies; 155+ messages in thread
From: Russell King - ARM Linux @ 2010-03-06 19:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Mar 06, 2010 at 11:56:41AM +0100, Wolfgang M?es wrote:
> > 3. A page may be read in response to an application issuing a read(2) call.
> >    - the data is read from the kernel mapping, and isn't mapped into a
> >      userspace address.
> > 
> > So, in case (3), flushing the I and D caches could be completely wasteful
> But how do you AVOID the writeback of the data cache in (3)?
> IMHO, the dirty data is in the cache, and the cache will writeback this data 
> on its own.

You don't avoid the writeback - you avoid explicitly causing the
writeback _and_ having to wait for it.

If you're writing data into a page (pio) which you then access via that
same mapping (via read(2)), it is totally pointless to sit in a loop
asking the cache to write the data back to memory.

The point when you need this data written back to memory is the point
where you start to create mappings which may alias with the existing
mapping.  Up until that point, the hardware itself can deal with the
writebacks when it decides it's a good time to do so.

Also, cache replaacement policies may not decide to immediately re-use
the cache lines you've just flushed - which means that by forcing them
to be written back, you're just increasing the overall latency of the
system.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-06 10:47                                                     ` James Bottomley
  2010-03-06 19:36                                                       ` Russell King - ARM Linux
@ 2010-03-06 21:03                                                       ` Benjamin Herrenschmidt
  2010-03-07  3:37                                                         ` James Bottomley
  1 sibling, 1 reply; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-06 21:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2010-03-06 at 16:17 +0530, James Bottomley wrote:
> On a fault in of exec data, we first try to get the page out of the page
> cache.  If it's not present, we put the faulting process to sleep and
> fetch it in from storage.  When we do the read, on the PIO path, the
> kernel alias for the page becomes dirty.  Some time later, we place the
> page into the user space (updating the pte entry that caused a fault).
> At this point, we'll call both flush_icache_page() and
> update_mmu_cache() ... this is where the I/D resolution should be done.
> Since it's after any I/O has occurred, it doesn't matter whether the CPU
> speculatively moved anything in or not.  As long as you flush the kernel
> alias and invalidate the user I and D aliases, we're good to go.  Using
> the page arch flags is really only to optimise this process (defer
> kernel D alias flushing).

Ok, so while flush_icache_page() looks like something we could use
instead of set_pte_at() for the icache flushing, it doesn't answer all
the questions. Off the top of my mind:

- I see the calls to flush_icache_page() in mm/memory.c but I don't see
them next to all set_pte_at() that insert a valid PTE. For example, we
don't flush the icache for anonymous pages. While that might seem like a
good idea, we have been under pressure to "fix" that on powerpc to make
sure there is no stale icache content from another process leaking into
userspace.

- It needs to be done -before- set_pte_at() but I think the code does it
right, only your explanation above makes it unclear :-)

- It doesn't take the PTE pointer as an argument, so here goes our trick
on powerpc of filtering out exec permission rather than flushing when a
page is accessed by a read fault

- We -still- have the problem of tracking whether the icache has been
flushed or not yet for a given physical page on archs with PIPT (or non
aliasing VIPT) like powerpc. Without that tracking, we flush a lot more
than necessary since we'll end up flushing things like glibc text pages
for every process they are mapped into which is totally wasteful. Thus
the idea of using a new PG bit to separate D$ from I$ tracking still
makes sense.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-06 19:36                                                       ` Russell King - ARM Linux
@ 2010-03-06 21:07                                                         ` Benjamin Herrenschmidt
  2010-03-07  5:54                                                         ` James Bottomley
  2010-03-08 11:17                                                         ` Catalin Marinas
  2 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-06 21:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> 
> No - this is where things get extremely icky.
> 
> The problem at this point occurs on SMP architectures.  As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:

Right, we actually hit that bug on powerpc, however, James explanation
is misleading, ie, I think the -code- actually is right and
flush_icache_page() is called before set_pte_at(). However, see my other
email, I have other issues with it as it is, but nothing unfixable.

So for now, I keep my flush in set_pte_at() and ptep_set_access_flags(),
we'll see if I can move that to an improved flush_icache_page(). In
fact, even set_pte_at() isn't a panacea for me, as I want the fault type
as well.

Cheers,
Ben.

> CPU0			CPU1
> 			speculatively prefetches from page N via kernel
> 			mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
> 			executes P
> 			*splat*
> flush I-cache

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-06 21:03                                                       ` Benjamin Herrenschmidt
@ 2010-03-07  3:37                                                         ` James Bottomley
  2010-03-08  8:46                                                           ` FUJITA Tomonori
  2010-03-09  2:25                                                           ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 155+ messages in thread
From: James Bottomley @ 2010-03-07  3:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-03-07 at 08:03 +1100, Benjamin Herrenschmidt wrote:
> On Sat, 2010-03-06 at 16:17 +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> > Since it's after any I/O has occurred, it doesn't matter whether the CPU
> > speculatively moved anything in or not.  As long as you flush the kernel
> > alias and invalidate the user I and D aliases, we're good to go.  Using
> > the page arch flags is really only to optimise this process (defer
> > kernel D alias flushing).
> 
> Ok, so while flush_icache_page() looks like something we could use
> instead of set_pte_at() for the icache flushing, it doesn't answer all
> the questions. Off the top of my mind:

OK, so what I was actually trying to get across is the point that we
don't handle I cache problems in the I/O or page cache code ... we
handle them in the mm code, so the mm piece of the above was
deliberately a bit vague.

> - I see the calls to flush_icache_page() in mm/memory.c but I don't see
> them next to all set_pte_at() that insert a valid PTE. For example, we
> don't flush the icache for anonymous pages. While that might seem like a
> good idea, we have been under pressure to "fix" that on powerpc to make
> sure there is no stale icache content from another process leaking into
> userspace.

I'm not entirely sure what flush_icache_page() is supposed to do.  On
parisc it flushes the *kernel* icache ... which has got to be wrong.
According to cachetlb.txt it's an obsolete interface.

> - It needs to be done -before- set_pte_at() but I think the code does it
> right, only your explanation above makes it unclear :-)

Sorry, like I said, I only sketched the mm piece.  However, at least on
parisc, there's a technical problem with flushing before we have the
pte:  On VIPT systems, we need a mapping before the flush will work.  I
was experimenting with a mechanism whereby we set aside in the kernel an
aligned region of our congruence size and simply flushed in that region
with the correct mappings, but we haven't got around to implementing it
in the kernel yet.

> - It doesn't take the PTE pointer as an argument, so here goes our trick
> on powerpc of filtering out exec permission rather than flushing when a
> page is accessed by a read fault
> 
> - We -still- have the problem of tracking whether the icache has been
> flushed or not yet for a given physical page on archs with PIPT (or non
> aliasing VIPT) like powerpc. Without that tracking, we flush a lot more
> than necessary since we'll end up flushing things like glibc text pages
> for every process they are mapped into which is totally wasteful. Thus
> the idea of using a new PG bit to separate D$ from I$ tracking still
> makes sense.

So, assuming full congruence of user space, can't you use the VMA as an
indicator?  i.e. if we have no user space mappings, we have to flush the
icache ... if we have one or more, the icache has been flushed and
placing the same page congruently in a different address space benefits
from that prior flush, so consequently there's no need to flush again?

I also think we've established the relevant facts for the I/O thread
(that we only need to either flush the kernel D cache or mark it as to
be flushed later on PIO reads).  We're now into deep technicalities of
how the mm system operates at the architecture level, so perhaps we
should move this to linux-arch?

James

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-06 19:36                                                       ` Russell King - ARM Linux
  2010-03-06 21:07                                                         ` Benjamin Herrenschmidt
@ 2010-03-07  5:54                                                         ` James Bottomley
  2010-03-08 11:17                                                         ` Catalin Marinas
  2 siblings, 0 replies; 155+ messages in thread
From: James Bottomley @ 2010-03-07  5:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> 
> No - this is where things get extremely icky.

OK, but the point I'm trying to make is that the page cache code,
including the I/O layer, only manages kernel D alias state (either by
flushing or marking it dirty).  The user space I/D handling is done in
the mm code (I'm not claiming it's done correctly there, just claiming
it's done there).

> The problem at this point occurs on SMP architectures.  As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:
> 
> CPU0			CPU1
> 			speculatively prefetches from page N via kernel
> 			mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
> 			executes P
> 			*splat*
> flush I-cache

OK, so I can believe this.  We see extremely rare segfaults on parisc
which look to be the result of some I flush race like this.  However, I
think for a discussion of problems with the arch and mm interfaces, we
should probably move off the usb list and onto linux-arch.

Our specific problem on parisc is that being VIPT we can't do an I (or
D) user flush without a mapping.  We have two schemes for fixing this:
One is to use a PAGE_FLUSH flag for the mapping ... it allows the
flushes to work but refuses any type of RWX access (can do this because
we have a software TLB).  The other is to use a flush area within the
kernel where we flush a page congruent to the userspace address ... I
haven't got this working yet, and it's a bit wasteful of kernel address
space because our congruence modulus is 4MB.

James

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-04 15:35                                                 ` Catalin Marinas
@ 2010-03-07  8:23                                                   ` Pavel Machek
  2010-03-08 10:57                                                     ` Catalin Marinas
  0 siblings, 1 reply; 155+ messages in thread
From: Pavel Machek @ 2010-03-07  8:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> > Seems like ARM has requirement other architectures do not, that is
> > a) not documented anywhere
> > b) causes problems
> 
> Well, ARM is pretty similar to other architectures in this respect. And
> I'm sure other architectures have similar problems, only that they only
> become visible in some circumstances they may not have encountered (i.e.
> PIO drivers + filesystem that doesn't call flush_dcache_page like ext*).
> Some other architectures may do heavier flushing
> 
> Of course, a Documentation/arm/cachetlb.txt file would make sense.

Actually, short/simple documentation for driver authors would be even
better. Then you can claim it is bug in driver :-).
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-07  3:37                                                         ` James Bottomley
@ 2010-03-08  8:46                                                           ` FUJITA Tomonori
  2010-03-09  2:25                                                           ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 155+ messages in thread
From: FUJITA Tomonori @ 2010-03-08  8:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 07 Mar 2010 09:07:17 +0530
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> So, assuming full congruence of user space, can't you use the VMA as an
> indicator?  i.e. if we have no user space mappings, we have to flush the
> icache ... if we have one or more, the icache has been flushed and
> placing the same page congruently in a different address space benefits
> from that prior flush, so consequently there's no need to flush again?

I'm not sure about this (sounds like the trick might work for some
though). As I said earlier, I think that IA64 could avoid flushing
I-cache even if the page has no user space mappings (if it did dma to
the page). ia64 needs to track pages for that.

As Ben said, I guess that we need two separate bits for D and I. I
think that it's a good idea to standardize how to use the bits for
optimization (some uses none, some uses only one, some needs both
though). And then we need to revisit I/O path (fs, the block layer,
drivers). Seems that we added flush_dcache_page() everywhere.


> I also think we've established the relevant facts for the I/O thread
> (that we only need to either flush the kernel D cache or mark it as to
> be flushed later on PIO reads).

We have the PIO issue about D-cache aliasing now? That's, don't mm/ or
fs/ already flush D-cache properly? I thought that Catalin has only
D/I cache consistency issue. If not, PIO doesn't also work powerpc
that handles properly D/I cache consistency.


> We're now into deep technicalities of
> how the mm system operates at the architecture level, so perhaps we
> should move this to linux-arch?

Yeah, probably we should.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-07  8:23                                                   ` Pavel Machek
@ 2010-03-08 10:57                                                     ` Catalin Marinas
  0 siblings, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-03-08 10:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-03-07 at 08:23 +0000, Pavel Machek wrote:
> > > Seems like ARM has requirement other architectures do not, that is
> > > a) not documented anywhere
> > > b) causes problems
> >
> > Well, ARM is pretty similar to other architectures in this respect. And
> > I'm sure other architectures have similar problems, only that they only
> > become visible in some circumstances they may not have encountered (i.e.
> > PIO drivers + filesystem that doesn't call flush_dcache_page like ext*).
> > Some other architectures may do heavier flushing
> >
> > Of course, a Documentation/arm/cachetlb.txt file would make sense.
> 
> Actually, short/simple documentation for driver authors would be even
> better. Then you can claim it is bug in driver :-).

That would help, but only once we agree whether it's a driver bug or the
arch code needs changing.

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-06 19:36                                                       ` Russell King - ARM Linux
  2010-03-06 21:07                                                         ` Benjamin Herrenschmidt
  2010-03-07  5:54                                                         ` James Bottomley
@ 2010-03-08 11:17                                                         ` Catalin Marinas
  2 siblings, 0 replies; 155+ messages in thread
From: Catalin Marinas @ 2010-03-08 11:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote:
> On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote:
> > On a fault in of exec data, we first try to get the page out of the page
> > cache.  If it's not present, we put the faulting process to sleep and
> > fetch it in from storage.  When we do the read, on the PIO path, the
> > kernel alias for the page becomes dirty.  Some time later, we place the
> > page into the user space (updating the pte entry that caused a fault).
> > At this point, we'll call both flush_icache_page() and
> > update_mmu_cache() ... this is where the I/D resolution should be done.
> 
> No - this is where things get extremely icky.
> 
> The problem at this point occurs on SMP architectures.  As soon as you
> update the PTE entry, it is visible to other threads of the application.
> If you do I-cache handling after updating the PTE, then there is a window
> where another CPU can execute the page:
> 
> CPU0                    CPU1
>                         speculatively prefetches from page N via kernel
>                         mapping, loads garbage into I-cache
> attempts to execute P
> page fault
> page N allocated
> set_pte_at
>                         executes P
>                         *splat*
> flush I-cache

You have two choices - either invalidate the I-cache before the user pte
becomes visible or set the page as not-executable in set_pte_at() and
later mark it as executable in update_mmu_cache (via set_pte_ext).

We currently invalidate the whole I-cache for historical reasons but we
could actually only invalidate a single page. Since even on latest ARM
CPUs, the I-cache is a real VIPT (i.e. can have aliases), we would need
to invalidate on the user mapping (or create a temporary one). The
latter approach of clearing the X bit in set_pte_at may actually help
with this scenario (I haven't done any tests though).

-- 
Catalin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-07  3:37                                                         ` James Bottomley
  2010-03-08  8:46                                                           ` FUJITA Tomonori
@ 2010-03-09  2:25                                                           ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-09  2:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 2010-03-07 at 09:07 +0530, James Bottomley wrote:
> So, assuming full congruence of user space, can't you use the VMA as an
> indicator?  i.e. if we have no user space mappings, we have to flush the
> icache ... if we have one or more, the icache has been flushed and
> placing the same page congruently in a different address space benefits
> from that prior flush, so consequently there's no need to flush again?

the VMA ? Or you mean struct page -> mapping ? That would work I suppose
in the case where we want to flush the icache pages for all pages mapped
into user space. But on processors that support per-page execute
permission, we really only want to flush pages that are executed from
(lazily). In that case, we do need a dedicated bit to keep track of
whether a given page has been flushed already.

> I also think we've established the relevant facts for the I/O thread
> (that we only need to either flush the kernel D cache or mark it as to
> be flushed later on PIO reads).  We're now into deep technicalities of
> how the mm system operates at the architecture level, so perhaps we
> should move this to linux-arch? 

No objection though moving threads after the fact is a recipe for
trouble :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-05  4:44                                                             ` Benjamin Herrenschmidt
@ 2010-03-10  3:52                                                               ` Paul Mundt
  2010-03-11 21:44                                                                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 155+ messages in thread
From: Paul Mundt @ 2010-03-10  3:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 05, 2010 at 03:44:55PM +1100, Benjamin Herrenschmidt wrote:
> > For these parts the PG_dcache_dirty approach
> > saves us from a lot of flushing, and the corner cases were isolated
> > enough that we could tolerate fixups at the driver level, even on a
> > write-allocate D-cache.
> 
> But how wide a range of devices do you have to support with those ? Is
> this a few SoCs or people putting any random PCI device in there for
> example ?
> 
> If I were to do it that way on ppc32, I worried that it would be more
> than a few drivers that I would have to fix :-) All the 32-bit PowerMac
> and PowerBooks for example, all of freescale 74xx based parts, etc...
> those guys have PCI, and all sort of random HW plugged into them.
> 
Many of those parts do support PCI, but are rarely used with arbitrary
devices. The PCI controller on those parts also permits one to establish
coherency for any transactions between PCI and memory through a rudimentary
snoop controller that requires the CPU to avoid entering any sleep
states. This works ok in practice since that series of host controllers
doesn't really support power management anyways (nor do any of the cores
of that generation implement any of the more complex sleep states).

> > For second generation SH-4A (SH-X2) and up parts, read and exec are split
> > out and we could reasonably adopt the PG_dcache_clean approach there
> > while adopting the same sort of flushing semantics as PPC to avoid
> > flushing constantly. The current generation of parts far outnumber their
> > legacy counterparts, so it's certainly something I plan to experiment
> > with.
> 
> I'd be curious to see whether you get a perf imporovement with that.
> 
> Note that we still have this additional thing that is floating around in
> this thread which I thing is definitely worthwhile to do, which is to
> mark clean pages that have been written to with DMA in dma_unmap and
> friends.... if we can fix the icache problem. So far, I haven't found
> James replies on this satisfactory :-) But maybe I just missed
> something.
> 
I'll start in on profiling some of this once I start on 2.6.35 stuff. I
think I still have my old numbers from when we did the PG_mapped to
PG_dcache_dirty transition, so it will be interesting to see how
PG_dcache_clean stacks up against both of those.

> > We have an additional level of complexity on some of the SMP parts with a
> > non-coherent I-cache,
> 
> I've that on some embedded ppc's too, where the icache flush instrutions
> aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately
> today nobody sane (appart from Bluegene) did an SMP part with those and
> so we have well localized internal hacks for them. But I've heared that
> some vendors might be pumping out SoCs with that stuff too soon which
> worries me.
> 
I-cache invalidations are broadcast on all mass produced SH-4A SMP parts,
but we do have some early proto chips that screwed that up. For the case
of mainline, we ought to be able to assume hardware broadcast though.

> >  some of the early CPUs have broken broadcasting of
> > the cacheops in hardware and so need to rely on IPIs, while the later
> > parts broadcast properly. We also need to deal with D-cache IPIs when
> > using mixed coherency protocols on different CPUs.
> 
> Right, that sucks. Do those have no-exec permission support ? If they
> do, then you can do what I did for BG, which is to ping pong user pages
> so they are either writable or executable (since userspace code itself
> will break as it will assume the cache ops -are- broadcast, since that's
> what the architecture says).
> 
Yes, these all support no-exec. I'll give the ping ponging thing a try,
thanks for the tip.

> Do you also, like ARM11MP, have a case of non-cache coherent DMA and
> non-broadcast cache ops in SMP ? That's somewhat of a killer, I still
> don't see how it can be dealt properly other than using load/store
> tricks to bring the data into the local cache and flushing it from
> there. DMA ops are called way to deep into spinlock hell to rely on IPIs

The only thing we really lack is I-cache coherency, which isn't such a
big deal with invalidations being broadcast. All DMA accesses are
snooped, and the D-cache is fully coherent.

> (unless your HW also provides some kind of NMI IPIs).
> 
While we don't have anything like FIQs to work with, we do have IRQ
priority levels to play with. I'd toyed with this idea in the past of
simply having a reserved level that never gets masked, particularly for
things like broadcast backtraces.

> > Using PG_dcache_clean from the DMA API sounds like a pretty good idea,
> > and certainly worth experimenting with. I don't know how we would do the
> > I-cache optimization without a PG_arch_2, though.
> 
> Right. That's the one thing I've been trying to figure out without
> success. But then, is it a big deal to add PG_arch_2 ? doesn't sound
> like it to me...
> 
Well, it does start to get a bit painful with sparsemem section or NUMA
node IDs also digging in to the page flags on 32-bit.. the benefits would
have to be pretty compelling to offset the pain.

^ permalink raw reply	[flat|nested] 155+ messages in thread

* USB mass storage and ARM cache coherency
  2010-03-10  3:52                                                               ` Paul Mundt
@ 2010-03-11 21:44                                                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 155+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-11 21:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2010-03-10 at 12:52 +0900, Paul Mundt wrote:
> Well, it does start to get a bit painful with sparsemem section or
> NUMA
> node IDs also digging in to the page flags on 32-bit.. the benefits
> would
> have to be pretty compelling to offset the pain. 

Unless we play a dangerous trick and re-use another flag that isn't
meaningful for allocated pages... maybe PG_buddy ? Or do I miss
something about that guy semantics ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 155+ messages in thread

end of thread, other threads:[~2010-03-11 21:44 UTC | newest]

Thread overview: 155+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20100129185434.GH19501@one-eyed-alien.net>
     [not found] ` <1265045354.25750.52.camel@pc1117.cambridge.arm.com>
2010-02-08  6:55   ` USB mass storage and ARM cache coherency Pavel Machek
2010-02-08  7:33     ` Andreas Mohr
2010-02-08 10:19       ` Catalin Marinas
2010-02-08  9:51     ` Catalin Marinas
2010-02-08 10:03       ` Andy Green
2010-02-17  9:50         ` Sascha Hauer
2010-02-17  9:57           ` Andy Green
2010-02-08 10:52       ` Pavel Machek
2010-02-08 11:28         ` Catalin Marinas
2010-02-16  7:57           ` Shilimkar, Santosh
2010-02-16  8:22             ` Oliver Neukum
2010-02-16  8:55               ` Shilimkar, Santosh
2010-02-16  9:07                 ` Oliver Neukum
2010-02-16  9:39                   ` Russell King - ARM Linux
2010-02-16 13:32                     ` Oliver Neukum
2010-02-16 13:40                       ` Shilimkar, Santosh
2010-02-16 13:46                         ` Oliver Neukum
2010-02-16 14:12                           ` Shilimkar, Santosh
2010-02-16 14:22                             ` Oliver Neukum
2010-02-16 14:45                               ` Shilimkar, Santosh
2010-02-16 15:44                                 ` Alan Stern
2010-02-17  8:55                               ` Shilimkar, Santosh
2010-02-17  9:10                                 ` Oliver Neukum
2010-02-17  9:17                                   ` Shilimkar, Santosh
2010-02-17 17:02                                 ` Alan Stern
2010-02-17 20:26                                   ` Russell King - ARM Linux
2010-02-17 20:30                                   ` Gadiyar, Anand
2010-02-18  6:56                                     ` Oliver Neukum
2010-02-18  7:14                                       ` Gadiyar, Anand
2010-02-17 12:29                     ` Jamie Lokier
2010-02-17  3:21                 ` Ming Lei
2010-02-17  9:05               ` Benjamin Herrenschmidt
2010-02-17  9:15                 ` Oliver Neukum
2010-02-17  9:40                   ` Benjamin Herrenschmidt
2010-02-17 10:09                     ` Oliver Neukum
2010-02-17 10:18                       ` Benjamin Herrenschmidt
2010-02-17 10:23                         ` Oliver Neukum
2010-02-17 12:15                           ` Benjamin Herrenschmidt
2010-02-17  9:55                 ` Russell King - ARM Linux
2010-02-17 10:05                   ` Benjamin Herrenschmidt
2010-02-17 15:27                 ` Catalin Marinas
2010-02-17 20:37                   ` Benjamin Herrenschmidt
2010-02-17 20:44                     ` Russell King - ARM Linux
2010-02-17 22:31                       ` Benjamin Herrenschmidt
2010-02-19 17:15                         ` Catalin Marinas
2010-02-19 17:36                           ` Catalin Marinas
2010-02-19 20:53                             ` Oliver Neukum
2010-02-24  2:48                               ` Benjamin Herrenschmidt
2010-02-24  7:16                                 ` Oliver Neukum
2010-02-24 21:12                                   ` Benjamin Herrenschmidt
2010-02-25  3:48                                     ` Oliver Neukum
2010-02-26  0:22                                       ` Benjamin Herrenschmidt
2010-02-25 12:36                                     ` James Bottomley
2010-02-24  2:47                             ` Benjamin Herrenschmidt
2010-02-24 16:19                               ` Alan Stern
2010-02-24 21:13                                 ` Benjamin Herrenschmidt
2010-02-24 21:50                                   ` Alan Stern
2010-02-25 20:52                                     ` Benjamin Herrenschmidt
2010-02-26 16:00                                   ` Catalin Marinas
2010-02-26 21:36                                     ` Benjamin Herrenschmidt
2010-02-26 16:25                               ` Catalin Marinas
2010-02-26 16:52                                 ` Alan Stern
2010-02-26 21:51                                   ` Benjamin Herrenschmidt
2010-02-26 21:00                                 ` Russell King - ARM Linux
2010-02-28  0:14                                   ` Benjamin Herrenschmidt
2010-02-28  5:01                                     ` James Bottomley
2010-03-01 10:39                                       ` Catalin Marinas
2010-03-01 11:06                                         ` Russell King - ARM Linux
2010-03-02 12:11                                       ` FUJITA Tomonori
2010-03-02 17:05                                         ` Catalin Marinas
2010-03-02 17:47                                           ` Catalin Marinas
2010-03-02 23:33                                             ` Benjamin Herrenschmidt
2010-03-03 10:21                                               ` Catalin Marinas
2010-03-02 23:29                                           ` Benjamin Herrenschmidt
2010-03-03  3:47                                             ` FUJITA Tomonori
2010-03-03  5:10                                               ` Benjamin Herrenschmidt
2010-03-03  5:40                                                 ` James Bottomley
2010-03-03  9:36                                                   ` Russell King - ARM Linux
2010-03-03 10:24                                                     ` James Bottomley
2010-03-03 19:41                                                       ` Russell King - ARM Linux
2010-03-04  2:00                                                   ` Benjamin Herrenschmidt
2010-03-04  8:26                                                     ` James Bottomley
2010-03-04 21:25                                                       ` Benjamin Herrenschmidt
2010-03-03  6:35                                                 ` FUJITA Tomonori
2010-03-03 10:43                                               ` Catalin Marinas
2010-03-03 10:40                                             ` Catalin Marinas
2010-03-03 21:54                                           ` Pavel Machek
2010-03-04  6:54                                             ` Wolfgang Mües
2010-03-04  9:31                                               ` Russell King - ARM Linux
2010-03-06 10:56                                                 ` Wolfgang Mües
2010-03-06 11:05                                                   ` Oliver Neukum
2010-03-06 19:44                                                   ` Russell King - ARM Linux
2010-03-04 13:47                                               ` Catalin Marinas
2010-03-04 13:35                                             ` Catalin Marinas
2010-03-04 13:51                                               ` Pavel Machek
2010-03-04 14:21                                                 ` James Bottomley
2010-03-04 14:27                                                   ` Russell King - ARM Linux
2010-03-04 15:25                                                     ` Catalin Marinas
2010-03-04 15:34                                                       ` Russell King - ARM Linux
2010-03-04 21:31                                                       ` Benjamin Herrenschmidt
2010-03-06 10:47                                                     ` James Bottomley
2010-03-06 19:36                                                       ` Russell King - ARM Linux
2010-03-06 21:07                                                         ` Benjamin Herrenschmidt
2010-03-07  5:54                                                         ` James Bottomley
2010-03-08 11:17                                                         ` Catalin Marinas
2010-03-06 21:03                                                       ` Benjamin Herrenschmidt
2010-03-07  3:37                                                         ` James Bottomley
2010-03-08  8:46                                                           ` FUJITA Tomonori
2010-03-09  2:25                                                           ` Benjamin Herrenschmidt
2010-03-04 15:29                                                   ` Catalin Marinas
2010-03-04 15:41                                                     ` Paul Mundt
2010-03-04 16:30                                                       ` Russell King - ARM Linux
2010-03-04 17:34                                                         ` Catalin Marinas
2010-03-04 17:54                                                           ` Russell King - ARM Linux
2010-03-04 18:07                                                       ` Catalin Marinas
2010-03-04 21:37                                                         ` Benjamin Herrenschmidt
2010-03-04 22:11                                                           ` Catalin Marinas
2010-03-05  4:34                                                             ` Benjamin Herrenschmidt
2010-03-05  9:27                                                               ` Catalin Marinas
2010-03-05  1:17                                                           ` Paul Mundt
2010-03-05  4:44                                                             ` Benjamin Herrenschmidt
2010-03-10  3:52                                                               ` Paul Mundt
2010-03-11 21:44                                                                 ` Benjamin Herrenschmidt
2010-03-04 21:34                                                       ` Benjamin Herrenschmidt
2010-03-04 21:28                                                   ` Benjamin Herrenschmidt
2010-03-04 21:40                                                     ` Russell King - ARM Linux
2010-03-05  4:31                                                       ` Benjamin Herrenschmidt
2010-03-04 15:35                                                 ` Catalin Marinas
2010-03-07  8:23                                                   ` Pavel Machek
2010-03-08 10:57                                                     ` Catalin Marinas
2010-03-02 23:26                                         ` Benjamin Herrenschmidt
2010-03-01 10:42                                     ` Catalin Marinas
2010-03-03 20:24                                       ` Jamie Lokier
2010-02-26 21:40                                 ` Benjamin Herrenschmidt
2010-02-26 21:49                                   ` Russell King - ARM Linux
2010-02-28  0:24                                     ` Benjamin Herrenschmidt
2010-02-28 19:17                                       ` Pavel Machek
2010-03-01 11:10                                       ` Catalin Marinas
2010-03-02  4:11                                         ` Benjamin Herrenschmidt
2010-02-24  2:39                           ` Benjamin Herrenschmidt
2010-02-26 16:44                             ` Catalin Marinas
2010-02-26 21:49                               ` Benjamin Herrenschmidt
2010-02-26 22:03                                 ` Russell King - ARM Linux
2010-02-28  0:29                                   ` Benjamin Herrenschmidt
2010-02-28 23:20                                   ` Catalin Marinas
2010-02-28 23:17                                 ` Catalin Marinas
2010-02-17 15:27                 ` Catalin Marinas
2010-02-17 15:39                 ` Catalin Marinas
2010-02-17 15:40                 ` Catalin Marinas
2010-02-17 15:40                 ` Catalin Marinas
2010-02-17 16:19                   ` Catalin Marinas
2010-02-17 16:19                   ` Catalin Marinas
2010-02-16  8:44             ` Russell King - ARM Linux
2010-02-16  8:51               ` Gadiyar, Anand
2010-02-20  7:21                 ` Pete Zaitcev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).