linux-mmc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFT] mmc: tmio: avoid concurrent runs of mmc_request_done()
@ 2024-02-28 10:03 Wolfram Sang
  2024-02-29  6:21 ` Dirk Behme
  0 siblings, 1 reply; 3+ messages in thread
From: Wolfram Sang @ 2024-02-28 10:03 UTC (permalink / raw)
  To: linux-renesas-soc
  Cc: Wolfram Sang, Dirk Behme, Ulf Hansson, linux-mmc, linux-kernel

With the to-be-fixed commit, the reset_work handler cleared 'host->mrq'
outside of the spinlock protected critical section. That leaves a small
race window during execution of 'tmio_mmc_reset()' where the done_work
handler could grab a pointer to the now invalid 'host->mrq'. Both would
use it to call mmc_request_done() causing problems (see Link).

However, 'host->mrq' cannot simply be cleared earlier inside the
critical section. That would allow new mrqs to come in asynchronously
while the actual reset of the controller still needs to be done. So,
like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from
coming in but still avoiding concurrency between work handlers.

Reported-by: Dirk Behme <dirk.behme@de.bosch.com>
Closes: https://lore.kernel.org/all/20240220061356.3001761-1-dirk.behme@de.bosch.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race")
---

Dirk: could you get this tested on your affected setups? I am somewhat
optimistic that this is already enough. For sure, it is a needed first
step.

 drivers/mmc/host/tmio_mmc_core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/mmc/host/tmio_mmc_core.c b/drivers/mmc/host/tmio_mmc_core.c
index be7f18fd4836..c253d176db69 100644
--- a/drivers/mmc/host/tmio_mmc_core.c
+++ b/drivers/mmc/host/tmio_mmc_core.c
@@ -259,6 +259,8 @@ static void tmio_mmc_reset_work(struct work_struct *work)
 	else
 		mrq->cmd->error = -ETIMEDOUT;
 
+	/* No new calls yet, but disallow concurrent tmio_mmc_done_work() */
+	host->mrq = ERR_PTR(-EBUSY);
 	host->cmd = NULL;
 	host->data = NULL;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH RFT] mmc: tmio: avoid concurrent runs of mmc_request_done()
  2024-02-28 10:03 [PATCH RFT] mmc: tmio: avoid concurrent runs of mmc_request_done() Wolfram Sang
@ 2024-02-29  6:21 ` Dirk Behme
  2024-02-29  7:33   ` Wolfram Sang
  0 siblings, 1 reply; 3+ messages in thread
From: Dirk Behme @ 2024-02-29  6:21 UTC (permalink / raw)
  To: Wolfram Sang, linux-renesas-soc; +Cc: Ulf Hansson, linux-mmc, linux-kernel

Hi Wolfram,

On 28.02.2024 11:03, Wolfram Sang wrote:
> With the to-be-fixed commit, the reset_work handler cleared 'host->mrq'
> outside of the spinlock protected critical section. That leaves a small
> race window during execution of 'tmio_mmc_reset()' where the done_work
> handler could grab a pointer to the now invalid 'host->mrq'. Both would
> use it to call mmc_request_done() causing problems (see Link).
> 
> However, 'host->mrq' cannot simply be cleared earlier inside the
> critical section. That would allow new mrqs to come in asynchronously
> while the actual reset of the controller still needs to be done. So,
> like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from
> coming in but still avoiding concurrency between work handlers.
> 
> Reported-by: Dirk Behme <dirk.behme@de.bosch.com>
> Closes: https://lore.kernel.org/all/20240220061356.3001761-1-dirk.behme@de.bosch.com/
> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
> Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race")

Tested-by: Dirk Behme <dirk.behme@de.bosch.com>
Reviewed-by: Dirk Behme <dirk.behme@de.bosch.com>

> ---
> 
> Dirk: could you get this tested on your affected setups? I am somewhat
> optimistic that this is already enough. For sure, it is a needed first
> step.

Testing looks good :) Many thanks!

At least the issues we observed before are not seen any more. As we are 
not exactly sure on the root cause, of course this is not a 100% proof. 
But as the change looks good, looks like it won't break something and 
the system behaves good with it I would say we are good to go.

I think we could add anything like

Cc: stable@vger.kernel.org # 3.0+

?

>   drivers/mmc/host/tmio_mmc_core.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/mmc/host/tmio_mmc_core.c b/drivers/mmc/host/tmio_mmc_core.c
> index be7f18fd4836..c253d176db69 100644
> --- a/drivers/mmc/host/tmio_mmc_core.c
> +++ b/drivers/mmc/host/tmio_mmc_core.c
> @@ -259,6 +259,8 @@ static void tmio_mmc_reset_work(struct work_struct *work)
>   	else
>   		mrq->cmd->error = -ETIMEDOUT;
>   
> +	/* No new calls yet, but disallow concurrent tmio_mmc_done_work() */
> +	host->mrq = ERR_PTR(-EBUSY);
>   	host->cmd = NULL;
>   	host->data = NULL;
Thanks again!

Dirk

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH RFT] mmc: tmio: avoid concurrent runs of mmc_request_done()
  2024-02-29  6:21 ` Dirk Behme
@ 2024-02-29  7:33   ` Wolfram Sang
  0 siblings, 0 replies; 3+ messages in thread
From: Wolfram Sang @ 2024-02-29  7:33 UTC (permalink / raw)
  To: Dirk Behme; +Cc: linux-renesas-soc, Ulf Hansson, linux-mmc, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2453 bytes --]

Hi Dirk,

> > With the to-be-fixed commit, the reset_work handler cleared 'host->mrq'
> > outside of the spinlock protected critical section. That leaves a small
> > race window during execution of 'tmio_mmc_reset()' where the done_work
> > handler could grab a pointer to the now invalid 'host->mrq'. Both would
> > use it to call mmc_request_done() causing problems (see Link).
> > 
> > However, 'host->mrq' cannot simply be cleared earlier inside the
> > critical section. That would allow new mrqs to come in asynchronously
> > while the actual reset of the controller still needs to be done. So,
> > like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from
> > coming in but still avoiding concurrency between work handlers.
> > 
> > Reported-by: Dirk Behme <dirk.behme@de.bosch.com>
> > Closes: https://lore.kernel.org/all/20240220061356.3001761-1-dirk.behme@de.bosch.com/
> > Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
> > Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race")
> 
> Tested-by: Dirk Behme <dirk.behme@de.bosch.com>
> Reviewed-by: Dirk Behme <dirk.behme@de.bosch.com>

Awesome! Thanks for the super-fast tags!

> At least the issues we observed before are not seen any more. As we are not
> exactly sure on the root cause, of course this is not a 100% proof. But as
> the change looks good, looks like it won't break something and the system
> behaves good with it I would say we are good to go.

I agree. We don't know if it is all you need. But there definitely was a
race window and closing it removes some observed anomalies. Let's hope
all of them :) I looked many times at the code and, to the best of my
knowledge, don't see side effects. 'host->mrq' stays non-NULL, so new
mrqs won't be added like before. Changing it to an ERR_PTR will only
affect the check in the done_work handler which is what we want. But, of
course, more eyes are always welcome.

> I think we could add anything like
> 
> Cc: stable@vger.kernel.org # 3.0+

Yes, we should definitely have that. I would have added it once your
testing got good results. This affects every Renesas SDHI or Uniphier SD
instance since 3.0 (12 years). Wow! So, thanks a ton for your report and
assistance in debugging it. Very much appreciated! And, phew, I am happy
that this solution does not make the locking more complex \o/

All the best,

   Wolfram


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-02-29  7:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-28 10:03 [PATCH RFT] mmc: tmio: avoid concurrent runs of mmc_request_done() Wolfram Sang
2024-02-29  6:21 ` Dirk Behme
2024-02-29  7:33   ` Wolfram Sang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).