[PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
@ 2014-03-27  6:18 Yuvaraj Kumar C D
  2014-05-08  9:42 ` Yuvaraj Kumar
  0 siblings, 1 reply; 13+ messages in thread
From: Yuvaraj Kumar C D @ 2014-03-27  6:18 UTC (permalink / raw)
  To: linux-arm-kernel

From: Doug Anderson <dianders@chromium.org>

If we happened to get a data error at just the wrong time the dw_mmc
driver could get into a state where it would never complete its
request.  That would leave the caller just hanging there.

We fix this two ways and both of the two fixes on their own appear to
fix the problems we've seen:

1. Fix a race in the tasklet where the interrupt setting the data
   error happens _just after_ we check for it, then we get a
   EVENT_XFER_COMPLETE.  We fix this by repeating a bit of code.
2. Fix it so that if we detect that we've got an error in the "data
   busy" state and we're not going to do anything else we end the
   request and unblock anyone waiting.

Signed-off-by: Doug Anderson <dianders@chromium.org>
Signed-off-by: Yuvaraj Kumar C D <yuvaraj.cd@gmail.com>
---
 drivers/mmc/host/dw_mmc.c |   47 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 1d77431..4c589f1 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -1300,6 +1300,14 @@ static void dw_mci_tasklet_func(unsigned long priv)
 			/* fall through */
 
 		case STATE_SENDING_DATA:
+			/*
+			 * We could get a data error and never a transfer
+			 * complete so we'd better check for it here.
+			 *
+			 * Note that we don't really care if we also got a
+			 * transfer complete; stopping the DMA and sending an
+			 * abort won't hurt.
+			 */
 			if (test_and_clear_bit(EVENT_DATA_ERROR,
 					       &host->pending_events)) {
 				dw_mci_stop_dma(host);
@@ -1313,7 +1321,29 @@ static void dw_mci_tasklet_func(unsigned long priv)
 				break;
 
 			set_bit(EVENT_XFER_COMPLETE, &host->completed_events);
+
+			/*
+			 * Handle an EVENT_DATA_ERROR that might have shown up
+			 * before the transfer completed.  This might not have
+			 * been caught by the check above because the interrupt
+			 * could have gone off between the previous check and
+			 * the check for transfer complete.
+			 *
+			 * Technically this ought not be needed assuming we
+			 * get a DATA_COMPLETE eventually (we'll notice the
+			 * error and end the request), but it shouldn't hurt.
+			 *
+			 * This has the advantage of sending the stop command.
+			 */
+			if (test_and_clear_bit(EVENT_DATA_ERROR,
+					       &host->pending_events)) {
+				dw_mci_stop_dma(host);
+				send_stop_abort(host, data);
+				state = STATE_DATA_ERROR;
+				break;
+			}
 			prev_state = state = STATE_DATA_BUSY;
+
 			/* fall through */
 
 		case STATE_DATA_BUSY:
@@ -1336,6 +1366,23 @@ static void dw_mci_tasklet_func(unsigned long priv)
 				/* stop command for open-ended transfer*/
 				if (data->stop)
 					send_stop_abort(host, data);
+			} else {
+				/*
+				 * If we don't have a command complete now we'll
+				 * never get one since we just reset everything;
+				 * better end the request.
+				 *
+				 * If we do have a command complete we'll fall
+				 * through to the SENDING_STOP command and
+				 * everything will be peachy keen.
+				 *
+				 * TODO: I guess we shouldn't send a stop?
+				 */
+				if (!test_bit(EVENT_CMD_COMPLETE,
+					      &host->pending_events)) {
+					dw_mci_request_end(host, mrq);
+					goto unlock;
+				}
 			}
 
 			/*
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-03-27  6:18 [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error Yuvaraj Kumar C D
@ 2014-05-08  9:42 ` Yuvaraj Kumar
  2014-05-09  3:21   ` Sonny Rao
  0 siblings, 1 reply; 13+ messages in thread
From: Yuvaraj Kumar @ 2014-05-08  9:42 UTC (permalink / raw)
  To: linux-arm-kernel

Any comments on this patch?

On Thu, Mar 27, 2014 at 11:48 AM, Yuvaraj Kumar C D
<yuvaraj.cd@gmail.com> wrote:
> From: Doug Anderson <dianders@chromium.org>
>
> If we happened to get a data error at just the wrong time the dw_mmc
> driver could get into a state where it would never complete its
> request.  That would leave the caller just hanging there.
>
> We fix this two ways and both of the two fixes on their own appear to
> fix the problems we've seen:
>
> 1. Fix a race in the tasklet where the interrupt setting the data
>    error happens _just after_ we check for it, then we get a
>    EVENT_XFER_COMPLETE.  We fix this by repeating a bit of code.
> 2. Fix it so that if we detect that we've got an error in the "data
>    busy" state and we're not going to do anything else we end the
>    request and unblock anyone waiting.
>
> Signed-off-by: Doug Anderson <dianders@chromium.org>
> Signed-off-by: Yuvaraj Kumar C D <yuvaraj.cd@gmail.com>
> ---
>  drivers/mmc/host/dw_mmc.c |   47 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
>
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index 1d77431..4c589f1 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -1300,6 +1300,14 @@ static void dw_mci_tasklet_func(unsigned long priv)
>                         /* fall through */
>
>                 case STATE_SENDING_DATA:
> +                       /*
> +                        * We could get a data error and never a transfer
> +                        * complete so we'd better check for it here.
> +                        *
> +                        * Note that we don't really care if we also got a
> +                        * transfer complete; stopping the DMA and sending an
> +                        * abort won't hurt.
> +                        */
>                         if (test_and_clear_bit(EVENT_DATA_ERROR,
>                                                &host->pending_events)) {
>                                 dw_mci_stop_dma(host);
> @@ -1313,7 +1321,29 @@ static void dw_mci_tasklet_func(unsigned long priv)
>                                 break;
>
>                         set_bit(EVENT_XFER_COMPLETE, &host->completed_events);
> +
> +                       /*
> +                        * Handle an EVENT_DATA_ERROR that might have shown up
> +                        * before the transfer completed.  This might not have
> +                        * been caught by the check above because the interrupt
> +                        * could have gone off between the previous check and
> +                        * the check for transfer complete.
> +                        *
> +                        * Technically this ought not be needed assuming we
> +                        * get a DATA_COMPLETE eventually (we'll notice the
> +                        * error and end the request), but it shouldn't hurt.
> +                        *
> +                        * This has the advantage of sending the stop command.
> +                        */
> +                       if (test_and_clear_bit(EVENT_DATA_ERROR,
> +                                              &host->pending_events)) {
> +                               dw_mci_stop_dma(host);
> +                               send_stop_abort(host, data);
> +                               state = STATE_DATA_ERROR;
> +                               break;
> +                       }
>                         prev_state = state = STATE_DATA_BUSY;
> +
>                         /* fall through */
>
>                 case STATE_DATA_BUSY:
> @@ -1336,6 +1366,23 @@ static void dw_mci_tasklet_func(unsigned long priv)
>                                 /* stop command for open-ended transfer*/
>                                 if (data->stop)
>                                         send_stop_abort(host, data);
> +                       } else {
> +                               /*
> +                                * If we don't have a command complete now we'll
> +                                * never get one since we just reset everything;
> +                                * better end the request.
> +                                *
> +                                * If we do have a command complete we'll fall
> +                                * through to the SENDING_STOP command and
> +                                * everything will be peachy keen.
> +                                *
> +                                * TODO: I guess we shouldn't send a stop?
> +                                */
> +                               if (!test_bit(EVENT_CMD_COMPLETE,
> +                                             &host->pending_events)) {
> +                                       dw_mci_request_end(host, mrq);
> +                                       goto unlock;
> +                               }
>                         }
>
>                         /*
> --
> 1.7.10.4
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-05-08  9:42 ` Yuvaraj Kumar
@ 2014-05-09  3:21   ` Sonny Rao
  2014-05-10 14:11     ` Seungwon Jeon
  0 siblings, 1 reply; 13+ messages in thread
From: Sonny Rao @ 2014-05-09  3:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 8, 2014 at 2:42 AM, Yuvaraj Kumar <yuvaraj.cd@gmail.com> wrote:
> Any comments on this patch?
>

I'll just add that without this fix, running the tuning loop for UHS
modes is not reliable on dw_mmc because errors will happen and you
will eventually hit this race and hang.  This can happen any time
there is tuning like during boot or during resume from suspend.

> On Thu, Mar 27, 2014 at 11:48 AM, Yuvaraj Kumar C D
> <yuvaraj.cd@gmail.com> wrote:
>> From: Doug Anderson <dianders@chromium.org>
>>
>> If we happened to get a data error at just the wrong time the dw_mmc
>> driver could get into a state where it would never complete its
>> request.  That would leave the caller just hanging there.
>>
>> We fix this two ways and both of the two fixes on their own appear to
>> fix the problems we've seen:
>>
>> 1. Fix a race in the tasklet where the interrupt setting the data
>>    error happens _just after_ we check for it, then we get a
>>    EVENT_XFER_COMPLETE.  We fix this by repeating a bit of code.
>> 2. Fix it so that if we detect that we've got an error in the "data
>>    busy" state and we're not going to do anything else we end the
>>    request and unblock anyone waiting.
>>
>> Signed-off-by: Doug Anderson <dianders@chromium.org>
>> Signed-off-by: Yuvaraj Kumar C D <yuvaraj.cd@gmail.com>
>> ---
>>  drivers/mmc/host/dw_mmc.c |   47 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 47 insertions(+)
>>
>> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
>> index 1d77431..4c589f1 100644
>> --- a/drivers/mmc/host/dw_mmc.c
>> +++ b/drivers/mmc/host/dw_mmc.c
>> @@ -1300,6 +1300,14 @@ static void dw_mci_tasklet_func(unsigned long priv)
>>                         /* fall through */
>>
>>                 case STATE_SENDING_DATA:
>> +                       /*
>> +                        * We could get a data error and never a transfer
>> +                        * complete so we'd better check for it here.
>> +                        *
>> +                        * Note that we don't really care if we also got a
>> +                        * transfer complete; stopping the DMA and sending an
>> +                        * abort won't hurt.
>> +                        */
>>                         if (test_and_clear_bit(EVENT_DATA_ERROR,
>>                                                &host->pending_events)) {
>>                                 dw_mci_stop_dma(host);
>> @@ -1313,7 +1321,29 @@ static void dw_mci_tasklet_func(unsigned long priv)
>>                                 break;
>>
>>                         set_bit(EVENT_XFER_COMPLETE, &host->completed_events);
>> +
>> +                       /*
>> +                        * Handle an EVENT_DATA_ERROR that might have shown up
>> +                        * before the transfer completed.  This might not have
>> +                        * been caught by the check above because the interrupt
>> +                        * could have gone off between the previous check and
>> +                        * the check for transfer complete.
>> +                        *
>> +                        * Technically this ought not be needed assuming we
>> +                        * get a DATA_COMPLETE eventually (we'll notice the
>> +                        * error and end the request), but it shouldn't hurt.
>> +                        *
>> +                        * This has the advantage of sending the stop command.
>> +                        */
>> +                       if (test_and_clear_bit(EVENT_DATA_ERROR,
>> +                                              &host->pending_events)) {
>> +                               dw_mci_stop_dma(host);
>> +                               send_stop_abort(host, data);
>> +                               state = STATE_DATA_ERROR;
>> +                               break;
>> +                       }
>>                         prev_state = state = STATE_DATA_BUSY;
>> +
>>                         /* fall through */
>>
>>                 case STATE_DATA_BUSY:
>> @@ -1336,6 +1366,23 @@ static void dw_mci_tasklet_func(unsigned long priv)
>>                                 /* stop command for open-ended transfer*/
>>                                 if (data->stop)
>>                                         send_stop_abort(host, data);
>> +                       } else {
>> +                               /*
>> +                                * If we don't have a command complete now we'll
>> +                                * never get one since we just reset everything;
>> +                                * better end the request.
>> +                                *
>> +                                * If we do have a command complete we'll fall
>> +                                * through to the SENDING_STOP command and
>> +                                * everything will be peachy keen.
>> +                                *
>> +                                * TODO: I guess we shouldn't send a stop?
>> +                                */
>> +                               if (!test_bit(EVENT_CMD_COMPLETE,
>> +                                             &host->pending_events)) {
>> +                                       dw_mci_request_end(host, mrq);
>> +                                       goto unlock;
>> +                               }
>>                         }
>>
>>                         /*
>> --
>> 1.7.10.4
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-05-09  3:21   ` Sonny Rao
@ 2014-05-10 14:11     ` Seungwon Jeon
  2014-05-12 21:50       ` Doug Anderson
  0 siblings, 1 reply; 13+ messages in thread
From: Seungwon Jeon @ 2014-05-10 14:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 09, 2014, Sonny Rao wrote:
> On Thu, May 8, 2014 at 2:42 AM, Yuvaraj Kumar <yuvaraj.cd@gmail.com> wrote:
> > Any comments on this patch?
> >
> 
> I'll just add that without this fix, running the tuning loop for UHS
> modes is not reliable on dw_mmc because errors will happen and you
> will eventually hit this race and hang.  This can happen any time
> there is tuning like during boot or during resume from suspend.
> 
> > On Thu, Mar 27, 2014 at 11:48 AM, Yuvaraj Kumar C D
> > <yuvaraj.cd@gmail.com> wrote:
> >> From: Doug Anderson <dianders@chromium.org>
> >>
> >> If we happened to get a data error at just the wrong time the dw_mmc
> >> driver could get into a state where it would never complete its
> >> request.  That would leave the caller just hanging there.
> >>
> >> We fix this two ways and both of the two fixes on their own appear to
> >> fix the problems we've seen:
> >>
> >> 1. Fix a race in the tasklet where the interrupt setting the data
> >>    error happens _just after_ we check for it, then we get a
> >>    EVENT_XFER_COMPLETE.  We fix this by repeating a bit of code.
I think repeating is not good approach to fix race.
In your case, XFER_COMPLETE preceded data error and DTO didn't come?
It seems strange case.
I want to know actual error value if you can reproduce.

> >> 2. Fix it so that if we detect that we've got an error in the "data
> >>    busy" state and we're not going to do anything else we end the
> >>    request and unblock anyone waiting.
> >>
> >> Signed-off-by: Doug Anderson <dianders@chromium.org>
> >> Signed-off-by: Yuvaraj Kumar C D <yuvaraj.cd@gmail.com>
> >> ---
> >>  drivers/mmc/host/dw_mmc.c |   47 +++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 47 insertions(+)
> >>
> >> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> >> index 1d77431..4c589f1 100644
> >> --- a/drivers/mmc/host/dw_mmc.c
> >> +++ b/drivers/mmc/host/dw_mmc.c
> >> @@ -1300,6 +1300,14 @@ static void dw_mci_tasklet_func(unsigned long priv)
> >>                         /* fall through */
> >>
> >>                 case STATE_SENDING_DATA:
> >> +                       /*
> >> +                        * We could get a data error and never a transfer
> >> +                        * complete so we'd better check for it here.
> >> +                        *
> >> +                        * Note that we don't really care if we also got a
> >> +                        * transfer complete; stopping the DMA and sending an
> >> +                        * abort won't hurt.
> >> +                        */
> >>                         if (test_and_clear_bit(EVENT_DATA_ERROR,
> >>                                                &host->pending_events)) {
> >>                                 dw_mci_stop_dma(host);
> >> @@ -1313,7 +1321,29 @@ static void dw_mci_tasklet_func(unsigned long priv)
> >>                                 break;
> >>
> >>                         set_bit(EVENT_XFER_COMPLETE, &host->completed_events);
> >> +
> >> +                       /*
> >> +                        * Handle an EVENT_DATA_ERROR that might have shown up
> >> +                        * before the transfer completed.  This might not have
> >> +                        * been caught by the check above because the interrupt
> >> +                        * could have gone off between the previous check and
> >> +                        * the check for transfer complete.
> >> +                        *
> >> +                        * Technically this ought not be needed assuming we
> >> +                        * get a DATA_COMPLETE eventually (we'll notice the
> >> +                        * error and end the request), but it shouldn't hurt.
> >> +                        *
> >> +                        * This has the advantage of sending the stop command.
> >> +                        */
> >> +                       if (test_and_clear_bit(EVENT_DATA_ERROR,
> >> +                                              &host->pending_events)) {
> >> +                               dw_mci_stop_dma(host);
> >> +                               send_stop_abort(host, data);
> >> +                               state = STATE_DATA_ERROR;
> >> +                               break;
> >> +                       }
> >>                         prev_state = state = STATE_DATA_BUSY;
> >> +
> >>                         /* fall through */
> >>
> >>                 case STATE_DATA_BUSY:
> >> @@ -1336,6 +1366,23 @@ static void dw_mci_tasklet_func(unsigned long priv)
> >>                                 /* stop command for open-ended transfer*/
> >>                                 if (data->stop)
> >>                                         send_stop_abort(host, data);
> >> +                       } else {
> >> +                               /*
> >> +                                * If we don't have a command complete now we'll
> >> +                                * never get one since we just reset everything;
> >> +                                * better end the request.
> >> +                                *
> >> +                                * If we do have a command complete we'll fall
> >> +                                * through to the SENDING_STOP command and
> >> +                                * everything will be peachy keen.
> >> +                                *
> >> +                                * TODO: I guess we shouldn't send a stop?
> >> +                                */
> >> +                               if (!test_bit(EVENT_CMD_COMPLETE,
> >> +                                             &host->pending_events)) {
> >> +                                       dw_mci_request_end(host, mrq);
> >> +                                       goto unlock;
> >> +                               }
Can you explain what happens above?
What is it for?

Thanks,
Seungwon Jeon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-05-10 14:11     ` Seungwon Jeon
@ 2014-05-12 21:50       ` Doug Anderson
  2014-05-13  4:52         ` Seungwon Jeon
  0 siblings, 1 reply; 13+ messages in thread
From: Doug Anderson @ 2014-05-12 21:50 UTC (permalink / raw)
  To: linux-arm-kernel

Seungwon,

On Sat, May 10, 2014 at 7:11 AM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
> On Fri, May 09, 2014, Sonny Rao wrote:
>> On Thu, May 8, 2014 at 2:42 AM, Yuvaraj Kumar <yuvaraj.cd@gmail.com> wrote:
>> > Any comments on this patch?
>> >
>>
>> I'll just add that without this fix, running the tuning loop for UHS
>> modes is not reliable on dw_mmc because errors will happen and you
>> will eventually hit this race and hang.  This can happen any time
>> there is tuning like during boot or during resume from suspend.
>>
>> > On Thu, Mar 27, 2014 at 11:48 AM, Yuvaraj Kumar C D
>> > <yuvaraj.cd@gmail.com> wrote:
>> >> From: Doug Anderson <dianders@chromium.org>
>> >>
>> >> If we happened to get a data error at just the wrong time the dw_mmc
>> >> driver could get into a state where it would never complete its
>> >> request.  That would leave the caller just hanging there.
>> >>
>> >> We fix this two ways and both of the two fixes on their own appear to
>> >> fix the problems we've seen:
>> >>
>> >> 1. Fix a race in the tasklet where the interrupt setting the data
>> >>    error happens _just after_ we check for it, then we get a
>> >>    EVENT_XFER_COMPLETE.  We fix this by repeating a bit of code.
> I think repeating is not good approach to fix race.
> In your case, XFER_COMPLETE preceded data error and DTO didn't come?
> It seems strange case.
> I want to know actual error value if you can reproduce.

XFER_COMPLETE didn't necessarily precede data error.  Imagine this scenario:

1. Check for data error: nope
2. Interrupt happens and we get a data error and immediately xfer complete
3. Check for xfer complete: yup

That's the state that we are handling.

The system that dw_mmc uses where the interrupt handler has no locking
makes it incredibly difficult to get things right.  Can you propose an
alternate fix that would avoid the race?


>> >> 2. Fix it so that if we detect that we've got an error in the "data
>> >>    busy" state and we're not going to do anything else we end the
>> >>    request and unblock anyone waiting.
>> >>
>> >> Signed-off-by: Doug Anderson <dianders@chromium.org>
>> >> Signed-off-by: Yuvaraj Kumar C D <yuvaraj.cd@gmail.com>
>> >> ---
>> >>  drivers/mmc/host/dw_mmc.c |   47 +++++++++++++++++++++++++++++++++++++++++++++
>> >>  1 file changed, 47 insertions(+)
>> >>
>> >> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
>> >> index 1d77431..4c589f1 100644
>> >> --- a/drivers/mmc/host/dw_mmc.c
>> >> +++ b/drivers/mmc/host/dw_mmc.c
>> >> @@ -1300,6 +1300,14 @@ static void dw_mci_tasklet_func(unsigned long priv)
>> >>                         /* fall through */
>> >>
>> >>                 case STATE_SENDING_DATA:
>> >> +                       /*
>> >> +                        * We could get a data error and never a transfer
>> >> +                        * complete so we'd better check for it here.
>> >> +                        *
>> >> +                        * Note that we don't really care if we also got a
>> >> +                        * transfer complete; stopping the DMA and sending an
>> >> +                        * abort won't hurt.
>> >> +                        */
>> >>                         if (test_and_clear_bit(EVENT_DATA_ERROR,
>> >>                                                &host->pending_events)) {
>> >>                                 dw_mci_stop_dma(host);
>> >> @@ -1313,7 +1321,29 @@ static void dw_mci_tasklet_func(unsigned long priv)
>> >>                                 break;
>> >>
>> >>                         set_bit(EVENT_XFER_COMPLETE, &host->completed_events);
>> >> +
>> >> +                       /*
>> >> +                        * Handle an EVENT_DATA_ERROR that might have shown up
>> >> +                        * before the transfer completed.  This might not have
>> >> +                        * been caught by the check above because the interrupt
>> >> +                        * could have gone off between the previous check and
>> >> +                        * the check for transfer complete.
>> >> +                        *
>> >> +                        * Technically this ought not be needed assuming we
>> >> +                        * get a DATA_COMPLETE eventually (we'll notice the
>> >> +                        * error and end the request), but it shouldn't hurt.
>> >> +                        *
>> >> +                        * This has the advantage of sending the stop command.
>> >> +                        */
>> >> +                       if (test_and_clear_bit(EVENT_DATA_ERROR,
>> >> +                                              &host->pending_events)) {
>> >> +                               dw_mci_stop_dma(host);
>> >> +                               send_stop_abort(host, data);
>> >> +                               state = STATE_DATA_ERROR;
>> >> +                               break;
>> >> +                       }
>> >>                         prev_state = state = STATE_DATA_BUSY;
>> >> +
>> >>                         /* fall through */
>> >>
>> >>                 case STATE_DATA_BUSY:
>> >> @@ -1336,6 +1366,23 @@ static void dw_mci_tasklet_func(unsigned long priv)
>> >>                                 /* stop command for open-ended transfer*/
>> >>                                 if (data->stop)
>> >>                                         send_stop_abort(host, data);
>> >> +                       } else {
>> >> +                               /*
>> >> +                                * If we don't have a command complete now we'll
>> >> +                                * never get one since we just reset everything;
>> >> +                                * better end the request.
>> >> +                                *
>> >> +                                * If we do have a command complete we'll fall
>> >> +                                * through to the SENDING_STOP command and
>> >> +                                * everything will be peachy keen.
>> >> +                                *
>> >> +                                * TODO: I guess we shouldn't send a stop?
>> >> +                                */
>> >> +                               if (!test_bit(EVENT_CMD_COMPLETE,
>> >> +                                             &host->pending_events)) {
>> >> +                                       dw_mci_request_end(host, mrq);
>> >> +                                       goto unlock;
>> >> +                               }
> Can you explain what happens above?
> What is it for?

This was an alternate fix for the above, but appears to actually hit
in practice too.

Said another way: if we don't add the extra checking for
EVENT_DATA_ERROR (above) we'll end up here.  ...and if we ever get
into this "else" and don't do _something_ then we'll wedge forever.

-Doug

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-05-12 21:50       ` Doug Anderson
@ 2014-05-13  4:52         ` Seungwon Jeon
  2014-05-13 15:56           ` Doug Anderson
  2014-05-20  1:51           ` Seungwon Jeon
  0 siblings, 2 replies; 13+ messages in thread
From: Seungwon Jeon @ 2014-05-13  4:52 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Doug,

On Tue, May 13, 2014, Doug Anderson wrote:
> Seungwon,
> 
> On Sat, May 10, 2014 at 7:11 AM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
> > On Fri, May 09, 2014, Sonny Rao wrote:
> >> On Thu, May 8, 2014 at 2:42 AM, Yuvaraj Kumar <yuvaraj.cd@gmail.com> wrote:
> >> > Any comments on this patch?
> >> >
> >>
> >> I'll just add that without this fix, running the tuning loop for UHS
> >> modes is not reliable on dw_mmc because errors will happen and you
> >> will eventually hit this race and hang.  This can happen any time
> >> there is tuning like during boot or during resume from suspend.
> >>
> >> > On Thu, Mar 27, 2014 at 11:48 AM, Yuvaraj Kumar C D
> >> > <yuvaraj.cd@gmail.com> wrote:
> >> >> From: Doug Anderson <dianders@chromium.org>
> >> >>
> >> >> If we happened to get a data error at just the wrong time the dw_mmc
> >> >> driver could get into a state where it would never complete its
> >> >> request.  That would leave the caller just hanging there.
> >> >>
> >> >> We fix this two ways and both of the two fixes on their own appear to
> >> >> fix the problems we've seen:
> >> >>
> >> >> 1. Fix a race in the tasklet where the interrupt setting the data
> >> >>    error happens _just after_ we check for it, then we get a
> >> >>    EVENT_XFER_COMPLETE.  We fix this by repeating a bit of code.
> > I think repeating is not good approach to fix race.
> > In your case, XFER_COMPLETE preceded data error and DTO didn't come?
> > It seems strange case.
> > I want to know actual error value if you can reproduce.
> 
> XFER_COMPLETE didn't necessarily precede data error.  Imagine this scenario:
> 
> 1. Check for data error: nope
> 2. Interrupt happens and we get a data error and immediately xfer complete
> 3. Check for xfer complete: yup
> 
> That's the state that we are handling.
> 
> The system that dw_mmc uses where the interrupt handler has no locking
> makes it incredibly difficult to get things right.  Can you propose an
> alternate fix that would avoid the race?
Thank you for detailed scenario.
You're right.
Have you consider using spin_lock() in interrupt handler?
Then, we'll need to change spin_lock() to spin_lock_irqsave() in tasklet func.
And other locks in driver may need to be adjusted properly.

To return above scenario:
1. Check for data error: nope
2. Check for xfer complete: nope -> escape tasklet.
3. Interrupt happens and we get a data error and immediately xfer complete
4. Check for data error (Again in tasklet) : yup

How about this change?

Thanks,
Seungwon Jeon
> 
> 
> >> >> 2. Fix it so that if we detect that we've got an error in the "data
> >> >>    busy" state and we're not going to do anything else we end the
> >> >>    request and unblock anyone waiting.
> >> >>
> >> >> Signed-off-by: Doug Anderson <dianders@chromium.org>
> >> >> Signed-off-by: Yuvaraj Kumar C D <yuvaraj.cd@gmail.com>
> >> >> ---
> >> >>  drivers/mmc/host/dw_mmc.c |   47 +++++++++++++++++++++++++++++++++++++++++++++
> >> >>  1 file changed, 47 insertions(+)
> >> >>
> >> >> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> >> >> index 1d77431..4c589f1 100644
> >> >> --- a/drivers/mmc/host/dw_mmc.c
> >> >> +++ b/drivers/mmc/host/dw_mmc.c
> >> >> @@ -1300,6 +1300,14 @@ static void dw_mci_tasklet_func(unsigned long priv)
> >> >>                         /* fall through */
> >> >>
> >> >>                 case STATE_SENDING_DATA:
> >> >> +                       /*
> >> >> +                        * We could get a data error and never a transfer
> >> >> +                        * complete so we'd better check for it here.
> >> >> +                        *
> >> >> +                        * Note that we don't really care if we also got a
> >> >> +                        * transfer complete; stopping the DMA and sending an
> >> >> +                        * abort won't hurt.
> >> >> +                        */
> >> >>                         if (test_and_clear_bit(EVENT_DATA_ERROR,
> >> >>                                                &host->pending_events)) {
> >> >>                                 dw_mci_stop_dma(host);
> >> >> @@ -1313,7 +1321,29 @@ static void dw_mci_tasklet_func(unsigned long priv)
> >> >>                                 break;
> >> >>
> >> >>                         set_bit(EVENT_XFER_COMPLETE, &host->completed_events);
> >> >> +
> >> >> +                       /*
> >> >> +                        * Handle an EVENT_DATA_ERROR that might have shown up
> >> >> +                        * before the transfer completed.  This might not have
> >> >> +                        * been caught by the check above because the interrupt
> >> >> +                        * could have gone off between the previous check and
> >> >> +                        * the check for transfer complete.
> >> >> +                        *
> >> >> +                        * Technically this ought not be needed assuming we
> >> >> +                        * get a DATA_COMPLETE eventually (we'll notice the
> >> >> +                        * error and end the request), but it shouldn't hurt.
> >> >> +                        *
> >> >> +                        * This has the advantage of sending the stop command.
> >> >> +                        */
> >> >> +                       if (test_and_clear_bit(EVENT_DATA_ERROR,
> >> >> +                                              &host->pending_events)) {
> >> >> +                               dw_mci_stop_dma(host);
> >> >> +                               send_stop_abort(host, data);
> >> >> +                               state = STATE_DATA_ERROR;
> >> >> +                               break;
> >> >> +                       }
> >> >>                         prev_state = state = STATE_DATA_BUSY;
> >> >> +
> >> >>                         /* fall through */
> >> >>
> >> >>                 case STATE_DATA_BUSY:
> >> >> @@ -1336,6 +1366,23 @@ static void dw_mci_tasklet_func(unsigned long priv)
> >> >>                                 /* stop command for open-ended transfer*/
> >> >>                                 if (data->stop)
> >> >>                                         send_stop_abort(host, data);
> >> >> +                       } else {
> >> >> +                               /*
> >> >> +                                * If we don't have a command complete now we'll
> >> >> +                                * never get one since we just reset everything;
> >> >> +                                * better end the request.
> >> >> +                                *
> >> >> +                                * If we do have a command complete we'll fall
> >> >> +                                * through to the SENDING_STOP command and
> >> >> +                                * everything will be peachy keen.
> >> >> +                                *
> >> >> +                                * TODO: I guess we shouldn't send a stop?
> >> >> +                                */
> >> >> +                               if (!test_bit(EVENT_CMD_COMPLETE,
> >> >> +                                             &host->pending_events)) {
> >> >> +                                       dw_mci_request_end(host, mrq);
> >> >> +                                       goto unlock;
> >> >> +                               }
> > Can you explain what happens above?
> > What is it for?
> 
> This was an alternate fix for the above, but appears to actually hit
> in practice too.
> 
> Said another way: if we don't add the extra checking for
> EVENT_DATA_ERROR (above) we'll end up here.  ...and if we ever get
> into this "else" and don't do _something_ then we'll wedge forever.
> 
> -Doug
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-05-13  4:52         ` Seungwon Jeon
@ 2014-05-13 15:56           ` Doug Anderson
  2014-05-16  1:46             ` Seungwon Jeon
  2014-05-20  1:51           ` Seungwon Jeon
  1 sibling, 1 reply; 13+ messages in thread
From: Doug Anderson @ 2014-05-13 15:56 UTC (permalink / raw)
  To: linux-arm-kernel

Seungwon,

On Mon, May 12, 2014 at 9:52 PM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
> Hi Doug,
>
> On Tue, May 13, 2014, Doug Anderson wrote:
>> Seungwon,
>>
>> On Sat, May 10, 2014 at 7:11 AM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
>> > On Fri, May 09, 2014, Sonny Rao wrote:
>> >> On Thu, May 8, 2014 at 2:42 AM, Yuvaraj Kumar <yuvaraj.cd@gmail.com> wrote:
>> >> > Any comments on this patch?
>> >> >
>> >>
>> >> I'll just add that without this fix, running the tuning loop for UHS
>> >> modes is not reliable on dw_mmc because errors will happen and you
>> >> will eventually hit this race and hang.  This can happen any time
>> >> there is tuning like during boot or during resume from suspend.
>> >>
>> >> > On Thu, Mar 27, 2014 at 11:48 AM, Yuvaraj Kumar C D
>> >> > <yuvaraj.cd@gmail.com> wrote:
>> >> >> From: Doug Anderson <dianders@chromium.org>
>> >> >>
>> >> >> If we happened to get a data error at just the wrong time the dw_mmc
>> >> >> driver could get into a state where it would never complete its
>> >> >> request.  That would leave the caller just hanging there.
>> >> >>
>> >> >> We fix this two ways and both of the two fixes on their own appear to
>> >> >> fix the problems we've seen:
>> >> >>
>> >> >> 1. Fix a race in the tasklet where the interrupt setting the data
>> >> >>    error happens _just after_ we check for it, then we get a
>> >> >>    EVENT_XFER_COMPLETE.  We fix this by repeating a bit of code.
>> > I think repeating is not good approach to fix race.
>> > In your case, XFER_COMPLETE preceded data error and DTO didn't come?
>> > It seems strange case.
>> > I want to know actual error value if you can reproduce.
>>
>> XFER_COMPLETE didn't necessarily precede data error.  Imagine this scenario:
>>
>> 1. Check for data error: nope
>> 2. Interrupt happens and we get a data error and immediately xfer complete
>> 3. Check for xfer complete: yup
>>
>> That's the state that we are handling.
>>
>> The system that dw_mmc uses where the interrupt handler has no locking
>> makes it incredibly difficult to get things right.  Can you propose an
>> alternate fix that would avoid the race?
> Thank you for detailed scenario.
> You're right.
> Have you consider using spin_lock() in interrupt handler?
> Then, we'll need to change spin_lock() to spin_lock_irqsave() in tasklet func.
> And other locks in driver may need to be adjusted properly.

I have certainly considered it and I think it's the right way to go,
but I believe that this would be a pretty massive change to the design
of dw_mmc.  Someone appeared to try very hard not to use a spinlock in
the interrupt handler and came up with the whole tasklet / pending
events / completed events to deal with it.

In this particular case it would be pretty easy to just add the
spinlock around the data error / xfer complete checks, though once we
have a spinlock here it seems like we'd want to start using it in
other places.  It would be confusing if the interrupt handler grabbed
a spinlock the whole time but then only used it to protect a single
small check.

I'm really curious, though, why this driver can't just use a threaded
irq handler and eliminate the whole interrupt / tasklet split.  That
seems saner in the long run.


> To return above scenario:
> 1. Check for data error: nope
> 2. Check for xfer complete: nope -> escape tasklet.
> 3. Interrupt happens and we get a data error and immediately xfer complete
> 4. Check for data error (Again in tasklet) : yup
>
> How about this change?

I'm not sure I understand.  Are you suggesting a change to my code, or
wondering how my code handles the above scenario?

I think your scenario works find either with or without my patch.

-Doug

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-05-13 15:56           ` Doug Anderson
@ 2014-05-16  1:46             ` Seungwon Jeon
  2014-05-16 16:21               ` Doug Anderson
  0 siblings, 1 reply; 13+ messages in thread
From: Seungwon Jeon @ 2014-05-16  1:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 14, 2014, Doug Anderson wrote:
> Seungwon,
> 
> On Mon, May 12, 2014 at 9:52 PM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
> > Hi Doug,
> >
> > On Tue, May 13, 2014, Doug Anderson wrote:
> >> Seungwon,
> >>
> >> On Sat, May 10, 2014 at 7:11 AM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
> >> > On Fri, May 09, 2014, Sonny Rao wrote:
> >> >> On Thu, May 8, 2014 at 2:42 AM, Yuvaraj Kumar <yuvaraj.cd@gmail.com> wrote:
> >> >> > Any comments on this patch?
> >> >> >
> >> >>
> >> >> I'll just add that without this fix, running the tuning loop for UHS
> >> >> modes is not reliable on dw_mmc because errors will happen and you
> >> >> will eventually hit this race and hang.  This can happen any time
> >> >> there is tuning like during boot or during resume from suspend.
> >> >>
> >> >> > On Thu, Mar 27, 2014 at 11:48 AM, Yuvaraj Kumar C D
> >> >> > <yuvaraj.cd@gmail.com> wrote:
> >> >> >> From: Doug Anderson <dianders@chromium.org>
> >> >> >>
> >> >> >> If we happened to get a data error at just the wrong time the dw_mmc
> >> >> >> driver could get into a state where it would never complete its
> >> >> >> request.  That would leave the caller just hanging there.
> >> >> >>
> >> >> >> We fix this two ways and both of the two fixes on their own appear to
> >> >> >> fix the problems we've seen:
> >> >> >>
> >> >> >> 1. Fix a race in the tasklet where the interrupt setting the data
> >> >> >>    error happens _just after_ we check for it, then we get a
> >> >> >>    EVENT_XFER_COMPLETE.  We fix this by repeating a bit of code.
> >> > I think repeating is not good approach to fix race.
> >> > In your case, XFER_COMPLETE preceded data error and DTO didn't come?
> >> > It seems strange case.
> >> > I want to know actual error value if you can reproduce.
> >>
> >> XFER_COMPLETE didn't necessarily precede data error.  Imagine this scenario:
> >>
> >> 1. Check for data error: nope
> >> 2. Interrupt happens and we get a data error and immediately xfer complete
> >> 3. Check for xfer complete: yup
> >>
> >> That's the state that we are handling.
> >>
> >> The system that dw_mmc uses where the interrupt handler has no locking
> >> makes it incredibly difficult to get things right.  Can you propose an
> >> alternate fix that would avoid the race?
> > Thank you for detailed scenario.
> > You're right.
> > Have you consider using spin_lock() in interrupt handler?
> > Then, we'll need to change spin_lock() to spin_lock_irqsave() in tasklet func.
> > And other locks in driver may need to be adjusted properly.
> 
> I have certainly considered it and I think it's the right way to go,
> but I believe that this would be a pretty massive change to the design
> of dw_mmc.  Someone appeared to try very hard not to use a spinlock in
> the interrupt handler and came up with the whole tasklet / pending
> events / completed events to deal with it.
Yes. It might be not small changes. Moreover, it needs to test heavily.
But it should be done before long. As we experienced, there are some race issue
in current way. Will you update this patch? Please let me know.

> 
> In this particular case it would be pretty easy to just add the
> spinlock around the data error / xfer complete checks, though once we
> have a spinlock here it seems like we'd want to start using it in
> other places.  It would be confusing if the interrupt handler grabbed
> a spinlock the whole time but then only used it to protect a single
> small check.
> 
> I'm really curious, though, why this driver can't just use a threaded
> irq handler and eliminate the whole interrupt / tasklet split.  That
> seems saner in the long run.
I think dw_mmc conforms itself to typical driver's implementation,
which it may be about a choice of driver's design.

> 
> 
> > To return above scenario:
> > 1. Check for data error: nope
> > 2. Check for xfer complete: nope -> escape tasklet.
> > 3. Interrupt happens and we get a data error and immediately xfer complete
> > 4. Check for data error (Again in tasklet) : yup
> >
> > How about this change?
> 
> I'm not sure I understand.  Are you suggesting a change to my code, or
> wondering how my code handles the above scenario?
Ah, the above-mentioned steps describes scenario when changing lock scheme.

Thanks,
Seungwon

> 
> I think your scenario works find either with or without my patch.
> 
> -Doug
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-05-16  1:46             ` Seungwon Jeon
@ 2014-05-16 16:21               ` Doug Anderson
  2014-05-20  1:24                 ` Seungwon Jeon
  0 siblings, 1 reply; 13+ messages in thread
From: Doug Anderson @ 2014-05-16 16:21 UTC (permalink / raw)
  To: linux-arm-kernel

Seungwon,

On Thu, May 15, 2014 at 6:46 PM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
> On Wed, May 14, 2014, Doug Anderson wrote:
>> Seungwon,
>>
>> On Mon, May 12, 2014 at 9:52 PM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
>> > Hi Doug,
>> >
>> > On Tue, May 13, 2014, Doug Anderson wrote:
>> >> Seungwon,
>> >>
>> >> On Sat, May 10, 2014 at 7:11 AM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
>> >> > On Fri, May 09, 2014, Sonny Rao wrote:
>> >> >> On Thu, May 8, 2014 at 2:42 AM, Yuvaraj Kumar <yuvaraj.cd@gmail.com> wrote:
>> >> >> > Any comments on this patch?
>> >> >> >
>> >> >>
>> >> >> I'll just add that without this fix, running the tuning loop for UHS
>> >> >> modes is not reliable on dw_mmc because errors will happen and you
>> >> >> will eventually hit this race and hang.  This can happen any time
>> >> >> there is tuning like during boot or during resume from suspend.
>> >> >>
>> >> >> > On Thu, Mar 27, 2014 at 11:48 AM, Yuvaraj Kumar C D
>> >> >> > <yuvaraj.cd@gmail.com> wrote:
>> >> >> >> From: Doug Anderson <dianders@chromium.org>
>> >> >> >>
>> >> >> >> If we happened to get a data error at just the wrong time the dw_mmc
>> >> >> >> driver could get into a state where it would never complete its
>> >> >> >> request.  That would leave the caller just hanging there.
>> >> >> >>
>> >> >> >> We fix this two ways and both of the two fixes on their own appear to
>> >> >> >> fix the problems we've seen:
>> >> >> >>
>> >> >> >> 1. Fix a race in the tasklet where the interrupt setting the data
>> >> >> >>    error happens _just after_ we check for it, then we get a
>> >> >> >>    EVENT_XFER_COMPLETE.  We fix this by repeating a bit of code.
>> >> > I think repeating is not good approach to fix race.
>> >> > In your case, XFER_COMPLETE preceded data error and DTO didn't come?
>> >> > It seems strange case.
>> >> > I want to know actual error value if you can reproduce.
>> >>
>> >> XFER_COMPLETE didn't necessarily precede data error.  Imagine this scenario:
>> >>
>> >> 1. Check for data error: nope
>> >> 2. Interrupt happens and we get a data error and immediately xfer complete
>> >> 3. Check for xfer complete: yup
>> >>
>> >> That's the state that we are handling.
>> >>
>> >> The system that dw_mmc uses where the interrupt handler has no locking
>> >> makes it incredibly difficult to get things right.  Can you propose an
>> >> alternate fix that would avoid the race?
>> > Thank you for detailed scenario.
>> > You're right.
>> > Have you consider using spin_lock() in interrupt handler?
>> > Then, we'll need to change spin_lock() to spin_lock_irqsave() in tasklet func.
>> > And other locks in driver may need to be adjusted properly.
>>
>> I have certainly considered it and I think it's the right way to go,
>> but I believe that this would be a pretty massive change to the design
>> of dw_mmc.  Someone appeared to try very hard not to use a spinlock in
>> the interrupt handler and came up with the whole tasklet / pending
>> events / completed events to deal with it.
> Yes. It might be not small changes. Moreover, it needs to test heavily.
> But it should be done before long. As we experienced, there are some race issue
> in current way. Will you update this patch? Please let me know.

I would prefer not to, but I will if I need to.  As you say, adding a
spinlock to the IRQ handler will need to be tested heavily and I'm not
sure we're in the best position to do that.  I can send up the change
to use a spinlock with some minimal testing if you want, though.
...of perhaps Yuvaraj can since he's the one who's been sending this
patch upstream.

Note that the original change was quite hard to test.  The problem
would show up on certain boards easier than others and would show up
with certain revisions of the kernel but not with others.  In other
words: it was heavily timing dependent.  The current patch as-is has
been tested and shown to be stable for quite a while now.


>> In this particular case it would be pretty easy to just add the
>> spinlock around the data error / xfer complete checks, though once we
>> have a spinlock here it seems like we'd want to start using it in
>> other places.  It would be confusing if the interrupt handler grabbed
>> a spinlock the whole time but then only used it to protect a single
>> small check.
>>
>> I'm really curious, though, why this driver can't just use a threaded
>> irq handler and eliminate the whole interrupt / tasklet split.  That
>> seems saner in the long run.
> I think dw_mmc conforms itself to typical driver's implementation,
> which it may be about a choice of driver's design.

Yup, there are a lot of different valid ways to accomplish the same
thing in the kernel.  ...but some of the ways drivers do things are
historical and are not the suggested way of doing things anymore.  I'm
still a kernel noob in many ways (despite a few years of working with
it), so I may be mistaken.

To me it feels like the interrupt + tasklet is effectively reinventing
threaded interrupts, but doing it in a custom way.


-Doug

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-05-16 16:21               ` Doug Anderson
@ 2014-05-20  1:24                 ` Seungwon Jeon
  0 siblings, 0 replies; 13+ messages in thread
From: Seungwon Jeon @ 2014-05-20  1:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, May 17, 2014, Doug Anderson wrote:
> Seungwon,
> 
> On Thu, May 15, 2014 at 6:46 PM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
> > On Wed, May 14, 2014, Doug Anderson wrote:
> >> Seungwon,
> >>
> >> On Mon, May 12, 2014 at 9:52 PM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
> >> > Hi Doug,
> >> >
> >> > On Tue, May 13, 2014, Doug Anderson wrote:
> >> >> Seungwon,
> >> >>
> >> >> On Sat, May 10, 2014 at 7:11 AM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
> >> >> > On Fri, May 09, 2014, Sonny Rao wrote:
> >> >> >> On Thu, May 8, 2014 at 2:42 AM, Yuvaraj Kumar <yuvaraj.cd@gmail.com> wrote:
> >> >> >> > Any comments on this patch?
> >> >> >> >
> >> >> >>
> >> >> >> I'll just add that without this fix, running the tuning loop for UHS
> >> >> >> modes is not reliable on dw_mmc because errors will happen and you
> >> >> >> will eventually hit this race and hang.  This can happen any time
> >> >> >> there is tuning like during boot or during resume from suspend.
> >> >> >>
> >> >> >> > On Thu, Mar 27, 2014 at 11:48 AM, Yuvaraj Kumar C D
> >> >> >> > <yuvaraj.cd@gmail.com> wrote:
> >> >> >> >> From: Doug Anderson <dianders@chromium.org>
> >> >> >> >>
> >> >> >> >> If we happened to get a data error at just the wrong time the dw_mmc
> >> >> >> >> driver could get into a state where it would never complete its
> >> >> >> >> request.  That would leave the caller just hanging there.
> >> >> >> >>
> >> >> >> >> We fix this two ways and both of the two fixes on their own appear to
> >> >> >> >> fix the problems we've seen:
> >> >> >> >>
> >> >> >> >> 1. Fix a race in the tasklet where the interrupt setting the data
> >> >> >> >>    error happens _just after_ we check for it, then we get a
> >> >> >> >>    EVENT_XFER_COMPLETE.  We fix this by repeating a bit of code.
> >> >> > I think repeating is not good approach to fix race.
> >> >> > In your case, XFER_COMPLETE preceded data error and DTO didn't come?
> >> >> > It seems strange case.
> >> >> > I want to know actual error value if you can reproduce.
> >> >>
> >> >> XFER_COMPLETE didn't necessarily precede data error.  Imagine this scenario:
> >> >>
> >> >> 1. Check for data error: nope
> >> >> 2. Interrupt happens and we get a data error and immediately xfer complete
> >> >> 3. Check for xfer complete: yup
> >> >>
> >> >> That's the state that we are handling.
> >> >>
> >> >> The system that dw_mmc uses where the interrupt handler has no locking
> >> >> makes it incredibly difficult to get things right.  Can you propose an
> >> >> alternate fix that would avoid the race?
> >> > Thank you for detailed scenario.
> >> > You're right.
> >> > Have you consider using spin_lock() in interrupt handler?
> >> > Then, we'll need to change spin_lock() to spin_lock_irqsave() in tasklet func.
> >> > And other locks in driver may need to be adjusted properly.
> >>
> >> I have certainly considered it and I think it's the right way to go,
> >> but I believe that this would be a pretty massive change to the design
> >> of dw_mmc.  Someone appeared to try very hard not to use a spinlock in
> >> the interrupt handler and came up with the whole tasklet / pending
> >> events / completed events to deal with it.
> > Yes. It might be not small changes. Moreover, it needs to test heavily.
> > But it should be done before long. As we experienced, there are some race issue
> > in current way. Will you update this patch? Please let me know.
> 
> I would prefer not to, but I will if I need to.  As you say, adding a
> spinlock to the IRQ handler will need to be tested heavily and I'm not
> sure we're in the best position to do that.  I can send up the change
> to use a spinlock with some minimal testing if you want, though.
> ...of perhaps Yuvaraj can since he's the one who's been sending this
> patch upstream.
> 
> Note that the original change was quite hard to test.  The problem
> would show up on certain boards easier than others and would show up
> with certain revisions of the kernel but not with others.  In other
> words: it was heavily timing dependent.  The current patch as-is has
> been tested and shown to be stable for quite a while now.
> 
> 
> >> In this particular case it would be pretty easy to just add the
> >> spinlock around the data error / xfer complete checks, though once we
> >> have a spinlock here it seems like we'd want to start using it in
> >> other places.  It would be confusing if the interrupt handler grabbed
> >> a spinlock the whole time but then only used it to protect a single
> >> small check.
> >>
> >> I'm really curious, though, why this driver can't just use a threaded
> >> irq handler and eliminate the whole interrupt / tasklet split.  That
> >> seems saner in the long run.
> > I think dw_mmc conforms itself to typical driver's implementation,
> > which it may be about a choice of driver's design.
> 
> Yup, there are a lot of different valid ways to accomplish the same
> thing in the kernel.  ...but some of the ways drivers do things are
> historical and are not the suggested way of doing things anymore.  I'm
> still a kernel noob in many ways (despite a few years of working with
> it), so I may be mistaken.
> 
> To me it feels like the interrupt + tasklet is effectively reinventing
> threaded interrupts, but doing it in a custom way.

Although it may not be graceful way, this patch is plainly effective.
We can expect more in the next place.
I'll add minor comment in previous email.
Please check it.

Thanks,
Seungwon Jeon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-05-13  4:52         ` Seungwon Jeon
  2014-05-13 15:56           ` Doug Anderson
@ 2014-05-20  1:51           ` Seungwon Jeon
  2014-05-20 22:08             ` Doug Anderson
  1 sibling, 1 reply; 13+ messages in thread
From: Seungwon Jeon @ 2014-05-20  1:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 13, 2014, Seungwon Jeon wrote:
> Hi Doug,
> 
> On Tue, May 13, 2014, Doug Anderson wrote:
> > Seungwon,
> >
> > On Sat, May 10, 2014 at 7:11 AM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
> > > On Fri, May 09, 2014, Sonny Rao wrote:
> > >> On Thu, May 8, 2014 at 2:42 AM, Yuvaraj Kumar <yuvaraj.cd@gmail.com> wrote:
> > >> > Any comments on this patch?
> > >> >
> > >>
> > >> I'll just add that without this fix, running the tuning loop for UHS
> > >> modes is not reliable on dw_mmc because errors will happen and you
> > >> will eventually hit this race and hang.  This can happen any time
> > >> there is tuning like during boot or during resume from suspend.
> > >>
> > >> > On Thu, Mar 27, 2014 at 11:48 AM, Yuvaraj Kumar C D
> > >> > <yuvaraj.cd@gmail.com> wrote:
> > >> >> From: Doug Anderson <dianders@chromium.org>
> > >> >>
> > >> >> If we happened to get a data error at just the wrong time the dw_mmc
> > >> >> driver could get into a state where it would never complete its
> > >> >> request.  That would leave the caller just hanging there.
> > >> >>
> > >> >> We fix this two ways and both of the two fixes on their own appear to
> > >> >> fix the problems we've seen:
> > >> >>
> > >> >> 1. Fix a race in the tasklet where the interrupt setting the data
> > >> >>    error happens _just after_ we check for it, then we get a
> > >> >>    EVENT_XFER_COMPLETE.  We fix this by repeating a bit of code.
> > > I think repeating is not good approach to fix race.
> > > In your case, XFER_COMPLETE preceded data error and DTO didn't come?
> > > It seems strange case.
> > > I want to know actual error value if you can reproduce.
> >
> > XFER_COMPLETE didn't necessarily precede data error.  Imagine this scenario:
> >
> > 1. Check for data error: nope
> > 2. Interrupt happens and we get a data error and immediately xfer complete
> > 3. Check for xfer complete: yup
> >
> > That's the state that we are handling.
> >
> > The system that dw_mmc uses where the interrupt handler has no locking
> > makes it incredibly difficult to get things right.  Can you propose an
> > alternate fix that would avoid the race?
> Thank you for detailed scenario.
> You're right.
> Have you consider using spin_lock() in interrupt handler?
> Then, we'll need to change spin_lock() to spin_lock_irqsave() in tasklet func.
> And other locks in driver may need to be adjusted properly.
> 
> To return above scenario:
> 1. Check for data error: nope
> 2. Check for xfer complete: nope -> escape tasklet.
> 3. Interrupt happens and we get a data error and immediately xfer complete
> 4. Check for data error (Again in tasklet) : yup
> 
> How about this change?
> 
> Thanks,
> Seungwon Jeon
> >
> >
> > >> >> 2. Fix it so that if we detect that we've got an error in the "data
> > >> >>    busy" state and we're not going to do anything else we end the
> > >> >>    request and unblock anyone waiting.
> > >> >>
> > >> >> Signed-off-by: Doug Anderson <dianders@chromium.org>
> > >> >> Signed-off-by: Yuvaraj Kumar C D <yuvaraj.cd@gmail.com>
> > >> >> ---
> > >> >>  drivers/mmc/host/dw_mmc.c |   47 +++++++++++++++++++++++++++++++++++++++++++++
> > >> >>  1 file changed, 47 insertions(+)
> > >> >>
> > >> >> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> > >> >> index 1d77431..4c589f1 100644
> > >> >> --- a/drivers/mmc/host/dw_mmc.c
> > >> >> +++ b/drivers/mmc/host/dw_mmc.c
> > >> >> @@ -1300,6 +1300,14 @@ static void dw_mci_tasklet_func(unsigned long priv)
> > >> >>                         /* fall through */
> > >> >>
> > >> >>                 case STATE_SENDING_DATA:
> > >> >> +                       /*
> > >> >> +                        * We could get a data error and never a transfer
> > >> >> +                        * complete so we'd better check for it here.
> > >> >> +                        *
> > >> >> +                        * Note that we don't really care if we also got a
> > >> >> +                        * transfer complete; stopping the DMA and sending an
> > >> >> +                        * abort won't hurt.
> > >> >> +                        */
> > >> >>                         if (test_and_clear_bit(EVENT_DATA_ERROR,
> > >> >>                                                &host->pending_events)) {
> > >> >>                                 dw_mci_stop_dma(host);
> > >> >> @@ -1313,7 +1321,29 @@ static void dw_mci_tasklet_func(unsigned long priv)
> > >> >>                                 break;
> > >> >>
> > >> >>                         set_bit(EVENT_XFER_COMPLETE, &host->completed_events);
> > >> >> +
> > >> >> +                       /*
> > >> >> +                        * Handle an EVENT_DATA_ERROR that might have shown up
> > >> >> +                        * before the transfer completed.  This might not have
> > >> >> +                        * been caught by the check above because the interrupt
> > >> >> +                        * could have gone off between the previous check and
> > >> >> +                        * the check for transfer complete.
> > >> >> +                        *
> > >> >> +                        * Technically this ought not be needed assuming we
> > >> >> +                        * get a DATA_COMPLETE eventually (we'll notice the
> > >> >> +                        * error and end the request), but it shouldn't hurt.
> > >> >> +                        *
> > >> >> +                        * This has the advantage of sending the stop command.
> > >> >> +                        */
> > >> >> +                       if (test_and_clear_bit(EVENT_DATA_ERROR,
> > >> >> +                                              &host->pending_events)) {
> > >> >> +                               dw_mci_stop_dma(host);
> > >> >> +                               send_stop_abort(host, data);
> > >> >> +                               state = STATE_DATA_ERROR;
> > >> >> +                               break;
> > >> >> +                       }
> > >> >>                         prev_state = state = STATE_DATA_BUSY;
> > >> >> +
> > >> >>                         /* fall through */
> > >> >>
> > >> >>                 case STATE_DATA_BUSY:
> > >> >> @@ -1336,6 +1366,23 @@ static void dw_mci_tasklet_func(unsigned long priv)
> > >> >>                                 /* stop command for open-ended transfer*/
> > >> >>                                 if (data->stop)
> > >> >>                                         send_stop_abort(host, data);
> > >> >> +                       } else {
> > >> >> +                               /*
> > >> >> +                                * If we don't have a command complete now we'll
> > >> >> +                                * never get one since we just reset everything;
> > >> >> +                                * better end the request.
> > >> >> +                                *
> > >> >> +                                * If we do have a command complete we'll fall
> > >> >> +                                * through to the SENDING_STOP command and
> > >> >> +                                * everything will be peachy keen.
> > >> >> +                                *
> > >> >> +                                * TODO: I guess we shouldn't send a stop?

Please remove TODO:
We already reset controller in dw_mci_data_complete() through "mmc: dw_mmc: change to use recommended reset procedure"?
I guess it depends on that patch.
Then, we don't need to stop sequence anymore.

Thanks,
Seungwon Jeon

> > >> >> +                                */
> > >> >> +                               if (!test_bit(EVENT_CMD_COMPLETE,
> > >> >> +                                             &host->pending_events)) {
> > >> >> +                                       dw_mci_request_end(host, mrq);
> > >> >> +                                       goto unlock;
> > >> >> +                               }
> > > Can you explain what happens above?
> > > What is it for?
> >
> > This was an alternate fix for the above, but appears to actually hit
> > in practice too.
> >
> > Said another way: if we don't add the extra checking for
> > EVENT_DATA_ERROR (above) we'll end up here.  ...and if we ever get
> > into this "else" and don't do _something_ then we'll wedge forever.
> >
> > -Doug
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> > the body of a message to majordomo at vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-05-20  1:51           ` Seungwon Jeon
@ 2014-05-20 22:08             ` Doug Anderson
  2014-05-21  9:05               ` Seungwon Jeon
  0 siblings, 1 reply; 13+ messages in thread
From: Doug Anderson @ 2014-05-20 22:08 UTC (permalink / raw)
  To: linux-arm-kernel

Seungwon,

On Mon, May 19, 2014 at 6:51 PM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
>> > >> >> +                       } else {
>> > >> >> +                               /*
>> > >> >> +                                * If we don't have a command complete now we'll
>> > >> >> +                                * never get one since we just reset everything;
>> > >> >> +                                * better end the request.
>> > >> >> +                                *
>> > >> >> +                                * If we do have a command complete we'll fall
>> > >> >> +                                * through to the SENDING_STOP command and
>> > >> >> +                                * everything will be peachy keen.
>> > >> >> +                                *
>> > >> >> +                                * TODO: I guess we shouldn't send a stop?
>
> Please remove TODO:

OK

> We already reset controller in dw_mci_data_complete() through "mmc: dw_mmc: change to use recommended reset procedure"?
> I guess it depends on that patch.
> Then, we don't need to stop sequence anymore.

Even without that patch we've still pretty much stopped everything in
dw_mci_fifo_reset(), so I'm not sure there's any strong dependency.
...but in our tree this did land after the patch you mention, so you
could wait until it lands if you want.


NOTE: We found that on some machines we were getting warnings after
enabling tuning and using this patch.  The warnings were benign, but
we should probably squash the fix into this patch.  See
<https://chromium-review.googlesource.com/#/c/200652/> for details.

-Doug

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error
  2014-05-20 22:08             ` Doug Anderson
@ 2014-05-21  9:05               ` Seungwon Jeon
  0 siblings, 0 replies; 13+ messages in thread
From: Seungwon Jeon @ 2014-05-21  9:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 21, 2014, Doug Anderson wrote:
> Seungwon,
> 
> On Mon, May 19, 2014 at 6:51 PM, Seungwon Jeon <tgih.jun@samsung.com> wrote:
> >> > >> >> +                       } else {
> >> > >> >> +                               /*
> >> > >> >> +                                * If we don't have a command complete now we'll
> >> > >> >> +                                * never get one since we just reset everything;
> >> > >> >> +                                * better end the request.
> >> > >> >> +                                *
> >> > >> >> +                                * If we do have a command complete we'll fall
> >> > >> >> +                                * through to the SENDING_STOP command and
> >> > >> >> +                                * everything will be peachy keen.
> >> > >> >> +                                *
> >> > >> >> +                                * TODO: I guess we shouldn't send a stop?
> >
> > Please remove TODO:
> 
> OK
> 
> > We already reset controller in dw_mci_data_complete() through "mmc: dw_mmc: change to use
> recommended reset procedure"?
> > I guess it depends on that patch.
> > Then, we don't need to stop sequence anymore.
> 
> Even without that patch we've still pretty much stopped everything in
> dw_mci_fifo_reset(), so I'm not sure there's any strong dependency.
> ...but in our tree this did land after the patch you mention, so you
> could wait until it lands if you want.
> 
ciu-reset or stop/abort cmd is needed for recovery of controller FSM after getting error.
dw_mci_fifo_reset eventually has ciu-reset from that patch.
I think it's fine to apply after that.

> 
> NOTE: We found that on some machines we were getting warnings after
> enabling tuning and using this patch.  The warnings were benign, but
> we should probably squash the fix into this patch.  See
> <https://chromium-review.googlesource.com/#/c/200652/> for details.
Can it be accessible?

Thanks,
Seungwon Jeon

> 
> -Doug
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-05-21  9:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-27  6:18 [PATCH] mmc: dw_mmc: Make sure we don't get stuck when we get an error Yuvaraj Kumar C D
2014-05-08  9:42 ` Yuvaraj Kumar
2014-05-09  3:21   ` Sonny Rao
2014-05-10 14:11     ` Seungwon Jeon
2014-05-12 21:50       ` Doug Anderson
2014-05-13  4:52         ` Seungwon Jeon
2014-05-13 15:56           ` Doug Anderson
2014-05-16  1:46             ` Seungwon Jeon
2014-05-16 16:21               ` Doug Anderson
2014-05-20  1:24                 ` Seungwon Jeon
2014-05-20  1:51           ` Seungwon Jeon
2014-05-20 22:08             ` Doug Anderson
2014-05-21  9:05               ` Seungwon Jeon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).