[PATCH] pxamci: close a serious race

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] pxamci: close a serious race
       [not found]         ` <4A4DA584.7080402@gmail.com>
@ 2010-02-18  4:54           ` Vernon Sauder
  2010-02-19 12:27             ` Eric Miao
  0 siblings, 1 reply; 3+ messages in thread
From: Vernon Sauder @ 2010-02-18  4:54 UTC (permalink / raw)
  To: linux-arm-kernel

Eric Miao wrote, On 07/03/2009 02:30 AM:
> Matt Reimer wrote:
>> On Tue, Jun 16, 2009 at 7:03 AM, Eric Miao <eric.y.miao@gmail.com> wrote:
>>
>>> Aric D. Blumer wrote:
>>>> Eric Miao wrote:
>>>>
>>>>>> Matt Reimer wrote:
>>>>>>
>>>>>>>> Close a race condition that could result in the last 32 bytes not
>>>>>>>> being transferred:
>>>>>>>>
>>>>>>>> . . .
>>>>>> Mmm... sounds reasonable (though I didn't observe this on PXA320 yet).
>>>>>>
>>>>>>>> Fix this by not calling pxamci_data_done() until the DMA channel
>>> stops.
>>>>>> The patch - however - I'm not sure if DATA_TRAN_DONE happens only for
>>>>>> READ (sounds like a WRITE will also trigger DATA_TRAN_DONE?)
>>>>>>
>>>> You are correct; it does happen for writes too.  Looks like we need to
>>>> take that into account somehow, because we're probably introducing a
>>>> race with writes now:  The DMA finishes and turns off the SD clock
>>>> possibly before the transaction occurs across the SD bus.  It's not
>>>> clear to me from the documentation if the MMC controller will finish
>>>> the transaction before turning off the clock.  Either way, the code
>>>> should be cleaner to guarantee that it doesn't.
>>>>
>>> A quick solution would be waiting for DMA to stop in pxamci_data_done()
>>> if it's a read - though not clean enough as using STOPIRQEN.
>>>
>>>
>> How about the attached patch? I've been running a test reading and writing a
>> 79M file for over an hour without a failure. I'll let it run a while and
>> report back if there are any problems.
>>
>> Matt
>>
> 
> Hi Matt,
> 
> Sorry for the long long latency. This patch doesn't apply to the latest
> tree however - could you take a look and make this refreshed?
> 
> Thanks
> - eric
> 

I am sorry to bring this back up again but it looks like this patch died
without a resolution that made it into mainline. I am working on an
issue with a PXA 270 that looks identical to this. However, this patch
still does not apply cleanly and causes other problems.

I have a platform where I can cause this issue within minutes. For some
reason, running an web server serving pages off a read-only SD mount and
serving a large file across a Bluetooth network connection brings out
the worst of this bug.

The first problem with this patch is calling data_done from dma_irq
breaks the synchronization. If the DMA interrupt comes before the
END_CMD_RES interrupt (which is normally does for me), then data_done is
called before cmd_done. This either causes cmd_done to not disable the
DATA_DONE interrupt (when called from pxamci_irq but after request_done
sets cmd == NULL) or a WARN_ON if start_cmd is called from data_done to
start processing another command.

The second problem is that the slab corruption patch which I submitted
[1] causes short messages to fail. If the algorithm assumes that all
messages are DMA, then things fall down when using PIO. I would like to
reconcile these two patches and see if we can get these issues fixed and
upstream.

For an immediate fix, the simplest thing is to add a busy-wait inside
data_done to wait for the DMA to stop [2]. From my testing, this wait
takes ~3 usec at the most.

If anyone has any good ideas, let's have them. I'll try to carve out
some time to follow this through if we can agree on a solution.

[1] http://thread.gmane.org/gmane.linux.ports.arm.kernel/41288

[2] the patch:
*** src.cache/drivers/mmc/host/pxamci.c.orig    2010-02-17
18:54:54.000000000 -0500
--- src.cache/drivers/mmc/host/pxamci.c 2010-02-17 23:33:37.000000000 -0500
@@ -322,6 +307,10 @@ static int pxamci_data_done(struct pxamc
        }
    }
    else {
+       // make sure DMA is finished
+       while (!(DCSR(host->dma) & DCSR_STOPSTATE)) {
+           udelay(1);
+       }
        DCSR(host->dma) = 0;
        dma_unmap_sg(mmc_dev(host->mmc), data->sg, host->dma_len,
                 host->dma_dir);

-- 
Regards,
 Vernon Sauder

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH] pxamci: close a serious race
  2010-02-18  4:54           ` [PATCH] pxamci: close a serious race Vernon Sauder
@ 2010-02-19 12:27             ` Eric Miao
  2010-02-20  0:04               ` Vernon Sauder
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Miao @ 2010-02-19 12:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 18, 2010 at 12:54 PM, Vernon Sauder <vernoninhand@gmail.com> wrote:
> Eric Miao wrote, On 07/03/2009 02:30 AM:
>> Matt Reimer wrote:
>>> On Tue, Jun 16, 2009 at 7:03 AM, Eric Miao <eric.y.miao@gmail.com> wrote:
>>>
>>>> Aric D. Blumer wrote:
>>>>> Eric Miao wrote:
>>>>>
>>>>>>> Matt Reimer wrote:
>>>>>>>
>>>>>>>>> Close a race condition that could result in the last 32 bytes not
>>>>>>>>> being transferred:
>>>>>>>>>
>>>>>>>>> . . .
>>>>>>> Mmm... sounds reasonable (though I didn't observe this on PXA320 yet).
>>>>>>>
>>>>>>>>> Fix this by not calling pxamci_data_done() until the DMA channel
>>>> stops.
>>>>>>> The patch - however - I'm not sure if DATA_TRAN_DONE happens only for
>>>>>>> READ (sounds like a WRITE will also trigger DATA_TRAN_DONE?)
>>>>>>>
>>>>> You are correct; it does happen for writes too. ?Looks like we need to
>>>>> take that into account somehow, because we're probably introducing a
>>>>> race with writes now: ?The DMA finishes and turns off the SD clock
>>>>> possibly before the transaction occurs across the SD bus. ?It's not
>>>>> clear to me from the documentation if the MMC controller will finish
>>>>> the transaction before turning off the clock. ?Either way, the code
>>>>> should be cleaner to guarantee that it doesn't.
>>>>>
>>>> A quick solution would be waiting for DMA to stop in pxamci_data_done()
>>>> if it's a read - though not clean enough as using STOPIRQEN.
>>>>
>>>>
>>> How about the attached patch? I've been running a test reading and writing a
>>> 79M file for over an hour without a failure. I'll let it run a while and
>>> report back if there are any problems.
>>>
>>> Matt
>>>
>>
>> Hi Matt,
>>
>> Sorry for the long long latency. This patch doesn't apply to the latest
>> tree however - could you take a look and make this refreshed?
>>
>> Thanks
>> - eric
>>
>
> I am sorry to bring this back up again but it looks like this patch died
> without a resolution that made it into mainline. I am working on an
> issue with a PXA 270 that looks identical to this. However, this patch
> still does not apply cleanly and causes other problems.
>
> I have a platform where I can cause this issue within minutes. For some
> reason, running an web server serving pages off a read-only SD mount and
> serving a large file across a Bluetooth network connection brings out
> the worst of this bug.
>
> The first problem with this patch is calling data_done from dma_irq
> breaks the synchronization. If the DMA interrupt comes before the
> END_CMD_RES interrupt (which is normally does for me), then data_done is
> called before cmd_done. This either causes cmd_done to not disable the
> DATA_DONE interrupt (when called from pxamci_irq but after request_done
> sets cmd == NULL) or a WARN_ON if start_cmd is called from data_done to
> start processing another command.
>
> The second problem is that the slab corruption patch which I submitted
> [1] causes short messages to fail. If the algorithm assumes that all
> messages are DMA, then things fall down when using PIO. I would like to
> reconcile these two patches and see if we can get these issues fixed and
> upstream.
>
> For an immediate fix, the simplest thing is to add a busy-wait inside
> data_done to wait for the DMA to stop [2]. From my testing, this wait
> takes ~3 usec at the most.
>

Went through all the threads in [1], I'm still a little bit concerned about
the root cause of this problem. My understanding atm (correct me pls)
is:

1. Errata 59, we need to fill the DMA at least 32-byte in 4-bit mode and
8-byte in 1-bit mode for read operation

2. That your observed issue which can be worked around by using PIO
instead of DMA for short messages

3. The DMA done interrupt comes before END_CMD_RES

For 3), I don't quite think it's a problem, data_done() is normally not
done in DMA done interrupt unless error occurs, normally the DMA
done interrupt handler just swaps the write buffer, data_done() is
performed in DATA_TRANS_DONE then.

For 2), I'm still thinking if there is an explanation.

For 1), we definitely need a workaround patch.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH] pxamci: close a serious race
  2010-02-19 12:27             ` Eric Miao
@ 2010-02-20  0:04               ` Vernon Sauder
  0 siblings, 0 replies; 3+ messages in thread
From: Vernon Sauder @ 2010-02-20  0:04 UTC (permalink / raw)
  To: linux-arm-kernel

Eric Miao wrote, On 02/19/2010 07:27 AM:
> On Thu, Feb 18, 2010 at 12:54 PM, Vernon Sauder <vernoninhand@gmail.com> wrote:
>> Eric Miao wrote, On 07/03/2009 02:30 AM:
>>> Matt Reimer wrote:
>>>> On Tue, Jun 16, 2009 at 7:03 AM, Eric Miao <eric.y.miao@gmail.com> wrote:
>>>>
>>>>> Aric D. Blumer wrote:
>>>>>> Eric Miao wrote:
>>>>>>
>>>>>>>> Matt Reimer wrote:
>>>>>>>>
>>>>>>>>>> Close a race condition that could result in the last 32 bytes not
>>>>>>>>>> being transferred:
>>>>>>>>>>
>>>>>>>>>> . . .
>>>>>>>> Mmm... sounds reasonable (though I didn't observe this on PXA320 yet).
>>>>>>>>
>>>>>>>>>> Fix this by not calling pxamci_data_done() until the DMA channel
>>>>> stops.
>>>>>>>> The patch - however - I'm not sure if DATA_TRAN_DONE happens only for
>>>>>>>> READ (sounds like a WRITE will also trigger DATA_TRAN_DONE?)
>>>>>>>>
>>>>>> You are correct; it does happen for writes too.  Looks like we need to
>>>>>> take that into account somehow, because we're probably introducing a
>>>>>> race with writes now:  The DMA finishes and turns off the SD clock
>>>>>> possibly before the transaction occurs across the SD bus.  It's not
>>>>>> clear to me from the documentation if the MMC controller will finish
>>>>>> the transaction before turning off the clock.  Either way, the code
>>>>>> should be cleaner to guarantee that it doesn't.
>>>>>>
>>>>> A quick solution would be waiting for DMA to stop in pxamci_data_done()
>>>>> if it's a read - though not clean enough as using STOPIRQEN.
>>>>>
>>>>>
>>>> How about the attached patch? I've been running a test reading and writing a
>>>> 79M file for over an hour without a failure. I'll let it run a while and
>>>> report back if there are any problems.
>>>>
>>>> Matt
>>>>
>>> Hi Matt,
>>>
>>> Sorry for the long long latency. This patch doesn't apply to the latest
>>> tree however - could you take a look and make this refreshed?
>>>
>>> Thanks
>>> - eric
>>>
>> I am sorry to bring this back up again but it looks like this patch died
>> without a resolution that made it into mainline. I am working on an
>> issue with a PXA 270 that looks identical to this. However, this patch
>> still does not apply cleanly and causes other problems.
>>
>> I have a platform where I can cause this issue within minutes. For some
>> reason, running an web server serving pages off a read-only SD mount and
>> serving a large file across a Bluetooth network connection brings out
>> the worst of this bug.
>>
>> The first problem with this patch is calling data_done from dma_irq
>> breaks the synchronization. If the DMA interrupt comes before the
>> END_CMD_RES interrupt (which is normally does for me), then data_done is
>> called before cmd_done. This either causes cmd_done to not disable the
>> DATA_DONE interrupt (when called from pxamci_irq but after request_done
>> sets cmd == NULL) or a WARN_ON if start_cmd is called from data_done to
>> start processing another command.
>>
>> The second problem is that the slab corruption patch which I submitted
>> [1] causes short messages to fail. If the algorithm assumes that all
>> messages are DMA, then things fall down when using PIO. I would like to
>> reconcile these two patches and see if we can get these issues fixed and
>> upstream.
>>
>> For an immediate fix, the simplest thing is to add a busy-wait inside
>> data_done to wait for the DMA to stop [2]. From my testing, this wait
>> takes ~3 usec at the most.
>>
> 
> Went through all the threads in [1], I'm still a little bit concerned about
> the root cause of this problem. My understanding atm (correct me pls)
> is:
> 
> 1. Errata 59, we need to fill the DMA at least 32-byte in 4-bit mode and
> 8-byte in 1-bit mode for read operation
> 
> 2. That your observed issue which can be worked around by using PIO
> instead of DMA for short messages
> 
> 3. The DMA done interrupt comes before END_CMD_RES

No, I am saying that the DMA done interrupt comes _after_ the
DATA_TRANS_DONE interrupt. (Or it would but I don't have the DMA
interrupt enabled.) In data_done() which is triggered by
DATA_TRANS_DONE, the DMA is not in a stopped state yet and we don't want
to stop it prematurely.

> For 3), I don't quite think it's a problem, data_done() is normally not
> done in DMA done interrupt unless error occurs, normally the DMA
> done interrupt handler just swaps the write buffer, data_done() is
> performed in DATA_TRANS_DONE then.
> 
> For 2), I'm still thinking if there is an explanation.

ETA?

> 
> For 1), we definitely need a workaround patch.

The workaround is [1], no?

There is a small possibility that this DMA race condition could be a
cause of failures I saw when doing > 32-bytes transfers. My tests still
failed until I increased the DMA threshold to 64 bytes. Errata #59 could
not account for those failures.


-- 
Regards,
 Vernon Sauder
 (: !Have a great day! :)

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-02-20  0:04 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <f383264b0906151408u7b7087b8u815b337836a4b2c6@mail.gmail.com>
     [not found] ` <4A373E72.4030104@gmail.com>
     [not found]   ` <4A37A343.9000106@sdgsystems.com>
     [not found]     ` <4A37A622.5050507@gmail.com>
     [not found]       ` <6f7de2b00906161221l684f3cb2ra3831f2a4eec50e3@mail.gmail.com>
     [not found]         ` <4A4DA584.7080402@gmail.com>
2010-02-18  4:54           ` [PATCH] pxamci: close a serious race Vernon Sauder
2010-02-19 12:27             ` Eric Miao
2010-02-20  0:04               ` Vernon Sauder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).