[PATCH 0/3] mmc: dw_mmc: Fix the CTO timer patch

linux-mmc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] mmc: dw_mmc: Fix the CTO timer patch
@ 2017-09-27 20:56 Douglas Anderson
  2017-09-27 20:56 ` [PATCH 1/3] mmc: dw_mmc: cancel the CTO timer after a voltage switch Douglas Anderson
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Douglas Anderson @ 2017-09-27 20:56 UTC (permalink / raw)
  To: jh80.chung, ulf.hansson, shawn.lin
  Cc: briannorris, amstan, xzy.xu, Douglas Anderson, linux-mmc,
	linux-kernel

Recently we landed 03de19212ea3 ("mmc: dw_mmc: introduce timer for
broken command transfer over scheme").  I found a bunch of problems
with that patch, so this series attempts to solve some of them.

NOTE: this series has only been lighly tested so far.  I can at least
reproduce the need for the CTO timer on one of my devices and so I can
confirm that part still works.  As mentioned in the 3rd patch I also
ran the mmc_test kernel module on this and did manage to see the 3rd
patch doing something useful.

Douglas Anderson (3):
  mmc: dw_mmc: cancel the CTO timer after a voltage switch
  mmc: dw_mmc: Fix the CTO timeout calculation
  mmc: dw_mmc: Add locking to the CTO timer

 drivers/mmc/host/dw_mmc.c | 101 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 89 insertions(+), 12 deletions(-)

-- 
2.14.2.822.g60be5d43e6-goog

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] mmc: dw_mmc: cancel the CTO timer after a voltage switch
  2017-09-27 20:56 [PATCH 0/3] mmc: dw_mmc: Fix the CTO timer patch Douglas Anderson
@ 2017-09-27 20:56 ` Douglas Anderson
  2017-10-09  6:57   ` Shawn Lin
  2017-09-27 20:56 ` [PATCH 2/3] mmc: dw_mmc: Fix the CTO timeout calculation Douglas Anderson
  2017-09-27 20:56 ` [PATCH 3/3] mmc: dw_mmc: Add locking to the CTO timer Douglas Anderson
  2 siblings, 1 reply; 10+ messages in thread
From: Douglas Anderson @ 2017-09-27 20:56 UTC (permalink / raw)
  To: jh80.chung, ulf.hansson, shawn.lin
  Cc: briannorris, amstan, xzy.xu, Douglas Anderson, linux-mmc,
	linux-kernel

When running with the commit 03de19212ea3 ("mmc: dw_mmc: introduce
timer for broken command transfer over scheme") I found this message
in the log:
  Unexpected command timeout, state 7

It turns out that we weren't properly cancelling the new CTO timer in
the case that a voltage switch was done.  Let's promote the cancel
into the dw_mci_cmd_interrupt() function to fix this.

Fixes: 03de19212ea3 ("mmc: dw_mmc: introduce timer for broken command transfer over scheme")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/mmc/host/dw_mmc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 860313bd952a..f5b2bb4b4d98 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -2570,6 +2570,8 @@ static void dw_mci_write_data_pio(struct dw_mci *host)
 
 static void dw_mci_cmd_interrupt(struct dw_mci *host, u32 status)
 {
+	del_timer(&host->cto_timer);
+
 	if (!host->cmd_status)
 		host->cmd_status = status;
 
@@ -2662,7 +2664,6 @@ static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
 		}
 
 		if (pending & SDMMC_INT_CMD_DONE) {
-			del_timer(&host->cto_timer);
 			mci_writel(host, RINTSTS, SDMMC_INT_CMD_DONE);
 			dw_mci_cmd_interrupt(host, pending);
 		}
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] mmc: dw_mmc: Fix the CTO timeout calculation
  2017-09-27 20:56 [PATCH 0/3] mmc: dw_mmc: Fix the CTO timer patch Douglas Anderson
  2017-09-27 20:56 ` [PATCH 1/3] mmc: dw_mmc: cancel the CTO timer after a voltage switch Douglas Anderson
@ 2017-09-27 20:56 ` Douglas Anderson
  2017-10-09  7:03   ` Shawn Lin
  2017-09-27 20:56 ` [PATCH 3/3] mmc: dw_mmc: Add locking to the CTO timer Douglas Anderson
  2 siblings, 1 reply; 10+ messages in thread
From: Douglas Anderson @ 2017-09-27 20:56 UTC (permalink / raw)
  To: jh80.chung, ulf.hansson, shawn.lin
  Cc: briannorris, amstan, xzy.xu, Douglas Anderson, linux-mmc,
	linux-kernel

In the commit 03de19212ea3 ("mmc: dw_mmc: introduce timer for broken
command transfer over scheme") we tried to calculate the expected
hardware command timeout value.  Unfortunately that calculation isn't
quite correct in all cases.  It used "bus_hz" but, as far as I can
tell, it's supposed to use the card clock.  Let's account for the div
value, which is documented as 2x the value stored in the register, or
1 if the register is 0.

NOTE: It's not expected that this will actually fix anything important
since the 10 ms margin added by the function will pretty much dwarf
any calculations.  The card clock should be 100 kHz at minimum and:
  1000 ms/s * (255 * 2) / 100000 Hz.
Gives us 5.1 ms.

...so really the point of this patch is just to make the code more
"correct" in case anyone ever tries to remove the 10 ms buffer.

Fixes: 03de19212ea3 ("mmc: dw_mmc: introduce timer for broken command transfer over scheme")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/mmc/host/dw_mmc.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index f5b2bb4b4d98..16516c528a88 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -401,10 +401,14 @@ static u32 dw_mci_prep_stop_abort(struct dw_mci *host, struct mmc_command *cmd)
 static inline void dw_mci_set_cto(struct dw_mci *host)
 {
 	unsigned int cto_clks;
+	unsigned int cto_div;
 	unsigned int cto_ms;

 	cto_clks = mci_readl(host, TMOUT) & 0xff;
-	cto_ms = DIV_ROUND_UP(cto_clks, host->bus_hz / 1000);
+	cto_div = (mci_readl(host, CLKDIV) & 0xff) * 2;
+	if (cto_div == 0)
+		cto_div = 1;
+	cto_ms = DIV_ROUND_UP(MSEC_PER_SEC * cto_clks * cto_div, host->bus_hz);

 	/* add a bit spare time */
 	cto_ms += 10;
-- 
2.14.2.822.g60be5d43e6-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] mmc: dw_mmc: Add locking to the CTO timer
  2017-09-27 20:56 [PATCH 0/3] mmc: dw_mmc: Fix the CTO timer patch Douglas Anderson
  2017-09-27 20:56 ` [PATCH 1/3] mmc: dw_mmc: cancel the CTO timer after a voltage switch Douglas Anderson
  2017-09-27 20:56 ` [PATCH 2/3] mmc: dw_mmc: Fix the CTO timeout calculation Douglas Anderson
@ 2017-09-27 20:56 ` Douglas Anderson
  2017-10-03 18:45   ` Doug Anderson
  2017-10-09  7:41   ` Shawn Lin
  2 siblings, 2 replies; 10+ messages in thread
From: Douglas Anderson @ 2017-09-27 20:56 UTC (permalink / raw)
  To: jh80.chung, ulf.hansson, shawn.lin
  Cc: briannorris, amstan, xzy.xu, Douglas Anderson, linux-mmc,
	linux-kernel

This attempts to instill a bit of paranoia to the code dealing with
the CTO timer.  It's believed that this will make the CTO timer more
robust in the case that we're having very long interrupt latencies.

Note that I originally thought that perhaps this patch was being
overly paranoid and wasn't really needed, but then while I was running
mmc_test on an rk3399 board I saw one instance of the message:
  dwmmc_rockchip fe320000.dwmmc: Unexpected interrupt latency

I had debug prints in the CTO timer code and I found that it was
running CMD 13 at the time.

...so even though this patch seems like it might be overly paranoid,
maybe it really isn't?

Presumably the bad interrupt latency experienced was due to the fact
that I had serial console enabled as serial console is typically where
I place blame when I see absurdly large interrupt latencies.  In this
particular case there was an (unrelated) printout to the serial
console just before I saw the "Unexpected interrupt latency" printout.

...and actually, I managed to even reproduce the problems by running
"iw mlan0 scan > /dev/null" while mmc_test was running.  That not only
does a bunch of PCIe traffic but it also (on my system) outputs some
SELinux log spam.

Fixes: 03de19212ea3 ("mmc: dw_mmc: introduce timer for broken command transfer over scheme")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/mmc/host/dw_mmc.c | 92 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 82 insertions(+), 10 deletions(-)

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 16516c528a88..6ecfe35094dd 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -403,6 +403,7 @@ static inline void dw_mci_set_cto(struct dw_mci *host)
 	unsigned int cto_clks;
 	unsigned int cto_div;
 	unsigned int cto_ms;
+	unsigned long irqflags;
 
 	cto_clks = mci_readl(host, TMOUT) & 0xff;
 	cto_div = (mci_readl(host, CLKDIV) & 0xff) * 2;
@@ -413,8 +414,24 @@ static inline void dw_mci_set_cto(struct dw_mci *host)
 	/* add a bit spare time */
 	cto_ms += 10;
 
-	mod_timer(&host->cto_timer,
-		  jiffies + msecs_to_jiffies(cto_ms) + 1);
+	/*
+	 * The durations we're working with are fairly short so we have to be
+	 * extra careful about synchronization here.  Specifically in hardware a
+	 * command timeout is _at most_ 5.1 ms, so that means we expect an
+	 * interrupt (either command done or timeout) to come rather quickly
+	 * after the mci_writel.  ...but just in case we have a long interrupt
+	 * latency let's add a bit of paranoia.
+	 *
+	 * In general we'll assume that at least an interrupt will be asserted
+	 * in hardware by the time the cto_timer runs.  ...and if it hasn't
+	 * been asserted in hardware by that time then we'll assume it'll never
+	 * come.
+	 */
+	spin_lock_irqsave(&host->irq_lock, irqflags);
+	if (!test_bit(EVENT_CMD_COMPLETE, &host->pending_events))
+		mod_timer(&host->cto_timer,
+			jiffies + msecs_to_jiffies(cto_ms) + 1);
+	spin_unlock_irqrestore(&host->irq_lock, irqflags);
 }
 
 static void dw_mci_start_command(struct dw_mci *host,
@@ -429,11 +446,11 @@ static void dw_mci_start_command(struct dw_mci *host,
 	wmb(); /* drain writebuffer */
 	dw_mci_wait_while_busy(host, cmd_flags);
 
+	mci_writel(host, CMD, cmd_flags | SDMMC_CMD_START);
+
 	/* response expected command only */
 	if (cmd_flags & SDMMC_CMD_RESP_EXP)
 		dw_mci_set_cto(host);
-
-	mci_writel(host, CMD, cmd_flags | SDMMC_CMD_START);
 }
 
 static inline void send_stop_abort(struct dw_mci *host, struct mmc_data *data)
@@ -1930,6 +1947,24 @@ static void dw_mci_set_drto(struct dw_mci *host)
 	mod_timer(&host->dto_timer, jiffies + msecs_to_jiffies(drto_ms));
 }
 
+static bool dw_mci_clear_pending_cmd_complete(struct dw_mci *host)
+{
+	if (!test_bit(EVENT_CMD_COMPLETE, &host->pending_events))
+		return false;
+
+	/*
+	 * Really be certain that the timer has stopped.  This is a bit of
+	 * paranoia and could only really happen if we had really bad
+	 * interrupt latency and the interrupt routine and timeout were
+	 * running concurrently so that the del_timer() in the interrupt
+	 * handler couldn't run.
+	 */
+	WARN_ON(del_timer_sync(&host->cto_timer));
+	clear_bit(EVENT_CMD_COMPLETE, &host->pending_events);
+
+	return true;
+}
+
 static void dw_mci_tasklet_func(unsigned long priv)
 {
 	struct dw_mci *host = (struct dw_mci *)priv;
@@ -1956,8 +1991,7 @@ static void dw_mci_tasklet_func(unsigned long priv)
 
 		case STATE_SENDING_CMD11:
 		case STATE_SENDING_CMD:
-			if (!test_and_clear_bit(EVENT_CMD_COMPLETE,
-						&host->pending_events))
+			if (!dw_mci_clear_pending_cmd_complete(host))
 				break;
 
 			cmd = host->cmd;
@@ -2126,8 +2160,7 @@ static void dw_mci_tasklet_func(unsigned long priv)
 			/* fall through */
 
 		case STATE_SENDING_STOP:
-			if (!test_and_clear_bit(EVENT_CMD_COMPLETE,
-						&host->pending_events))
+			if (!dw_mci_clear_pending_cmd_complete(host))
 				break;
 
 			/* CMD error in data command */
@@ -2600,6 +2633,8 @@ static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
 	struct dw_mci *host = dev_id;
 	u32 pending;
 	struct dw_mci_slot *slot = host->slot;
+	unsigned long irqflags;
+	int i;
 
 	pending = mci_readl(host, MINTSTS); /* read-only mask reg */
 
@@ -2607,8 +2642,6 @@ static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
 		/* Check volt switch first, since it can look like an error */
 		if ((host->state == STATE_SENDING_CMD11) &&
 		    (pending & SDMMC_INT_VOLT_SWITCH)) {
-			unsigned long irqflags;
-
 			mci_writel(host, RINTSTS, SDMMC_INT_VOLT_SWITCH);
 			pending &= ~SDMMC_INT_VOLT_SWITCH;
 
@@ -2624,11 +2657,15 @@ static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
 		}
 
 		if (pending & DW_MCI_CMD_ERROR_FLAGS) {
+			spin_lock_irqsave(&host->irq_lock, irqflags);
+
 			del_timer(&host->cto_timer);
 			mci_writel(host, RINTSTS, DW_MCI_CMD_ERROR_FLAGS);
 			host->cmd_status = pending;
 			smp_wmb(); /* drain writebuffer */
 			set_bit(EVENT_CMD_COMPLETE, &host->pending_events);
+
+			spin_unlock_irqrestore(&host->irq_lock, irqflags);
 		}
 
 		if (pending & DW_MCI_DATA_ERROR_FLAGS) {
@@ -2668,8 +2705,12 @@ static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
 		}
 
 		if (pending & SDMMC_INT_CMD_DONE) {
+			spin_lock_irqsave(&host->irq_lock, irqflags);
+
 			mci_writel(host, RINTSTS, SDMMC_INT_CMD_DONE);
 			dw_mci_cmd_interrupt(host, pending);
+
+			spin_unlock_irqrestore(&host->irq_lock, irqflags);
 		}
 
 		if (pending & SDMMC_INT_CD) {
@@ -2943,7 +2984,35 @@ static void dw_mci_cmd11_timer(unsigned long arg)
 static void dw_mci_cto_timer(unsigned long arg)
 {
 	struct dw_mci *host = (struct dw_mci *)arg;
+	unsigned long irqflags;
+	u32 pending;
 
+	spin_lock_irqsave(&host->irq_lock, irqflags);
+
+	/*
+	 * If somehow we have very bad interrupt latency it's remotely possible
+	 * that the timer could fire while the interrupt is still pending or
+	 * while the interrupt is midway through running.  Let's be paranoid
+	 * and detect those two cases.  Note that this is paranoia is somewhat
+	 * justified because in this function we don't actually cancel the
+	 * pending command in the controller--we just assume it will never come.
+	 */
+	pending = mci_readl(host, MINTSTS); /* read-only mask reg */
+	if (pending & (DW_MCI_CMD_ERROR_FLAGS | SDMMC_INT_CMD_DONE)) {
+		/* The interrupt should fire; no need to act but we can warn */
+		dev_warn(host->dev, "Unexpected interrupt latency\n");
+		goto exit;
+	}
+	if (test_bit(EVENT_CMD_COMPLETE, &host->pending_events)) {
+		/* Presumably interrupt handler couldn't delete the timer */
+		dev_warn(host->dev, "CTO timeout when already completed\n");
+		goto exit;
+	}
+
+	/*
+	 * Continued paranoia to make sure we're in the state we expect.
+	 * This paranoia isn't really justified but it seems good to be safe.
+	 */
 	switch (host->state) {
 	case STATE_SENDING_CMD11:
 	case STATE_SENDING_CMD:
@@ -2962,6 +3031,9 @@ static void dw_mci_cto_timer(unsigned long arg)
 			 host->state);
 		break;
 	}
+
+exit:
+	spin_unlock_irqrestore(&host->irq_lock, irqflags);
 }
 
 static void dw_mci_dto_timer(unsigned long arg)
-- 
2.14.2.822.g60be5d43e6-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] mmc: dw_mmc: Add locking to the CTO timer
  2017-09-27 20:56 ` [PATCH 3/3] mmc: dw_mmc: Add locking to the CTO timer Douglas Anderson
@ 2017-10-03 18:45   ` Doug Anderson
  2017-10-09  7:41   ` Shawn Lin
  1 sibling, 0 replies; 10+ messages in thread
From: Doug Anderson @ 2017-10-03 18:45 UTC (permalink / raw)
  To: Jaehoon Chung, Ulf Hansson, Shawn Lin
  Cc: Brian Norris, Alexandru M Stan, Ziyuan Xu, Douglas Anderson,
	linux-mmc@vger.kernel.org, linux-kernel@vger.kernel.org,
	open list:ARM/Rockchip SoC...

Hi,

On Wed, Sep 27, 2017 at 1:56 PM, Douglas Anderson <dianders@chromium.org> wrote:
> This attempts to instill a bit of paranoia to the code dealing with
> the CTO timer.  It's believed that this will make the CTO timer more
> robust in the case that we're having very long interrupt latencies.
>
> Note that I originally thought that perhaps this patch was being
> overly paranoid and wasn't really needed, but then while I was running
> mmc_test on an rk3399 board I saw one instance of the message:
>   dwmmc_rockchip fe320000.dwmmc: Unexpected interrupt latency
>
> I had debug prints in the CTO timer code and I found that it was
> running CMD 13 at the time.
>
> ...so even though this patch seems like it might be overly paranoid,
> maybe it really isn't?
>
> Presumably the bad interrupt latency experienced was due to the fact
> that I had serial console enabled as serial console is typically where
> I place blame when I see absurdly large interrupt latencies.  In this
> particular case there was an (unrelated) printout to the serial
> console just before I saw the "Unexpected interrupt latency" printout.
>
> ...and actually, I managed to even reproduce the problems by running
> "iw mlan0 scan > /dev/null" while mmc_test was running.  That not only
> does a bunch of PCIe traffic but it also (on my system) outputs some
> SELinux log spam.
>
> Fixes: 03de19212ea3 ("mmc: dw_mmc: introduce timer for broken command transfer over scheme")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
>
>  drivers/mmc/host/dw_mmc.c | 92 +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 82 insertions(+), 10 deletions(-)

I'm hoping that someone upstream might have some time to test and
review these three patches since I believe that they fix a regression
in 4.14.  I pinged Shawn Lin and found that it's a Chinese holiday
right now so he won't be able to test util next week, but presumably
others might be noticing problems with SD cards or eMMC using dw_mmc
on kernel 4.14, especially with serial console enabled?

Adding linux-rockchip to the CC here.  Original patches can be found
on patchwork.kernel.org...

Thanks!

-Doug

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] mmc: dw_mmc: cancel the CTO timer after a voltage switch
  2017-09-27 20:56 ` [PATCH 1/3] mmc: dw_mmc: cancel the CTO timer after a voltage switch Douglas Anderson
@ 2017-10-09  6:57   ` Shawn Lin
  0 siblings, 0 replies; 10+ messages in thread
From: Shawn Lin @ 2017-10-09  6:57 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: jh80.chung, ulf.hansson, shawn.lin, briannorris, amstan, xzy.xu,
	linux-mmc, linux-kernel


On 2017/9/28 4:56, Douglas Anderson wrote:
> When running with the commit 03de19212ea3 ("mmc: dw_mmc: introduce
> timer for broken command transfer over scheme") I found this message
> in the log:
>    Unexpected command timeout, state 7
> 
> It turns out that we weren't properly cancelling the new CTO timer in
> the case that a voltage switch was done.  Let's promote the cancel
> into the dw_mci_cmd_interrupt() function to fix this.
> 

Reviewed-by: Shawn Lin <shawn.lin@rock-chips.com>

> Fixes: 03de19212ea3 ("mmc: dw_mmc: introduce timer for broken command transfer over scheme")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
>   drivers/mmc/host/dw_mmc.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index 860313bd952a..f5b2bb4b4d98 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -2570,6 +2570,8 @@ static void dw_mci_write_data_pio(struct dw_mci *host)
>   
>   static void dw_mci_cmd_interrupt(struct dw_mci *host, u32 status)
>   {
> +	del_timer(&host->cto_timer);
> +
>   	if (!host->cmd_status)
>   		host->cmd_status = status;
>   
> @@ -2662,7 +2664,6 @@ static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
>   		}
>   
>   		if (pending & SDMMC_INT_CMD_DONE) {
> -			del_timer(&host->cto_timer);
>   			mci_writel(host, RINTSTS, SDMMC_INT_CMD_DONE);
>   			dw_mci_cmd_interrupt(host, pending);
>   		}
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] mmc: dw_mmc: Fix the CTO timeout calculation
  2017-09-27 20:56 ` [PATCH 2/3] mmc: dw_mmc: Fix the CTO timeout calculation Douglas Anderson
@ 2017-10-09  7:03   ` Shawn Lin
  2017-10-11 23:55     ` Doug Anderson
  0 siblings, 1 reply; 10+ messages in thread
From: Shawn Lin @ 2017-10-09  7:03 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: jh80.chung, ulf.hansson, shawn.lin, briannorris, amstan, xzy.xu,
	linux-mmc, linux-kernel

Hi

On 2017/9/28 4:56, Douglas Anderson wrote:
> In the commit 03de19212ea3 ("mmc: dw_mmc: introduce timer for broken
> command transfer over scheme") we tried to calculate the expected
> hardware command timeout value.  Unfortunately that calculation isn't
> quite correct in all cases.  It used "bus_hz" but, as far as I can
> tell, it's supposed to use the card clock.  Let's account for the div
> value, which is documented as 2x the value stored in the register, or
> 1 if the register is 0.

Good catch.
Would you mind appending a new patch to fix the drto case?

Reviewed-by: Shawn Lin <shawn.lin@rock-chips.com>

> 
> NOTE: It's not expected that this will actually fix anything important
> since the 10 ms margin added by the function will pretty much dwarf
> any calculations.  The card clock should be 100 kHz at minimum and:
>    1000 ms/s * (255 * 2) / 100000 Hz.
> Gives us 5.1 ms.
> 
> ...so really the point of this patch is just to make the code more
> "correct" in case anyone ever tries to remove the 10 ms buffer.
> 
> Fixes: 03de19212ea3 ("mmc: dw_mmc: introduce timer for broken command transfer over scheme")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
>   drivers/mmc/host/dw_mmc.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index f5b2bb4b4d98..16516c528a88 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -401,10 +401,14 @@ static u32 dw_mci_prep_stop_abort(struct dw_mci *host, struct mmc_command *cmd)
>   static inline void dw_mci_set_cto(struct dw_mci *host)
>   {
>   	unsigned int cto_clks;
> +	unsigned int cto_div;
>   	unsigned int cto_ms;
>   
>   	cto_clks = mci_readl(host, TMOUT) & 0xff;
> -	cto_ms = DIV_ROUND_UP(cto_clks, host->bus_hz / 1000);
> +	cto_div = (mci_readl(host, CLKDIV) & 0xff) * 2;
> +	if (cto_div == 0)
> +		cto_div = 1;
> +	cto_ms = DIV_ROUND_UP(MSEC_PER_SEC * cto_clks * cto_div, host->bus_hz);
>   
>   	/* add a bit spare time */
>   	cto_ms += 10;
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] mmc: dw_mmc: Add locking to the CTO timer
  2017-09-27 20:56 ` [PATCH 3/3] mmc: dw_mmc: Add locking to the CTO timer Douglas Anderson
  2017-10-03 18:45   ` Doug Anderson
@ 2017-10-09  7:41   ` Shawn Lin
  2017-10-11 23:53     ` Doug Anderson
  1 sibling, 1 reply; 10+ messages in thread
From: Shawn Lin @ 2017-10-09  7:41 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: jh80.chung, ulf.hansson, shawn.lin, briannorris, amstan, xzy.xu,
	linux-mmc, linux-kernel

Hi Doug,

On 2017/9/28 4:56, Douglas Anderson wrote:
> This attempts to instill a bit of paranoia to the code dealing with
> the CTO timer.  It's believed that this will make the CTO timer more
> robust in the case that we're having very long interrupt latencies.
> 

I have already got reports about the similar problem that on one of
the Rockchip platforms, some IP cyclicity idle the bus too long,
so that cto timer fires but actually it shouldn't.

> Note that I originally thought that perhaps this patch was being
> overly paranoid and wasn't really needed, but then while I was running
> mmc_test on an rk3399 board I saw one instance of the message:
>    dwmmc_rockchip fe320000.dwmmc: Unexpected interrupt latency
> 

The intention of introducing CTO and DRTO timers is simply to break the
dead loop due to the bug of controller itself but it seems the timer
should take more factors into consideration. So it's more complicated
than expected, expecially we should also fix the drto case like what
this patch does...

How about combining the cto and drto timer and re-new a catch-all
timer like what SDHCI did?


> I had debug prints in the CTO timer code and I found that it was
> running CMD 13 at the time.
> 
> ...so even though this patch seems like it might be overly paranoid,
> maybe it really isn't?
> 
> Presumably the bad interrupt latency experienced was due to the fact
> that I had serial console enabled as serial console is typically where
> I place blame when I see absurdly large interrupt latencies.  In this
> particular case there was an (unrelated) printout to the serial
> console just before I saw the "Unexpected interrupt latency" printout.
>  > ...and actually, I managed to even reproduce the problems by running
> "iw mlan0 scan > /dev/null" while mmc_test was running.  That not only
> does a bunch of PCIe traffic but it also (on my system) outputs some
> SELinux log spam.
> 
> Fixes: 03de19212ea3 ("mmc: dw_mmc: introduce timer for broken command transfer over scheme")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
>   drivers/mmc/host/dw_mmc.c | 92 +++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 82 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index 16516c528a88..6ecfe35094dd 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -403,6 +403,7 @@ static inline void dw_mci_set_cto(struct dw_mci *host)
>   	unsigned int cto_clks;
>   	unsigned int cto_div;
>   	unsigned int cto_ms;
> +	unsigned long irqflags;
>   
>   	cto_clks = mci_readl(host, TMOUT) & 0xff;
>   	cto_div = (mci_readl(host, CLKDIV) & 0xff) * 2;
> @@ -413,8 +414,24 @@ static inline void dw_mci_set_cto(struct dw_mci *host)
>   	/* add a bit spare time */
>   	cto_ms += 10;
>   
> -	mod_timer(&host->cto_timer,
> -		  jiffies + msecs_to_jiffies(cto_ms) + 1);
> +	/*
> +	 * The durations we're working with are fairly short so we have to be
> +	 * extra careful about synchronization here.  Specifically in hardware a
> +	 * command timeout is _at most_ 5.1 ms, so that means we expect an
> +	 * interrupt (either command done or timeout) to come rather quickly
> +	 * after the mci_writel.  ...but just in case we have a long interrupt
> +	 * latency let's add a bit of paranoia.
> +	 *
> +	 * In general we'll assume that at least an interrupt will be asserted
> +	 * in hardware by the time the cto_timer runs.  ...and if it hasn't
> +	 * been asserted in hardware by that time then we'll assume it'll never
> +	 * come.
> +	 */
> +	spin_lock_irqsave(&host->irq_lock, irqflags);
> +	if (!test_bit(EVENT_CMD_COMPLETE, &host->pending_events))
> +		mod_timer(&host->cto_timer,
> +			jiffies + msecs_to_jiffies(cto_ms) + 1);
> +	spin_unlock_irqrestore(&host->irq_lock, irqflags);
>   }
>   
>   static void dw_mci_start_command(struct dw_mci *host,
> @@ -429,11 +446,11 @@ static void dw_mci_start_command(struct dw_mci *host,
>   	wmb(); /* drain writebuffer */
>   	dw_mci_wait_while_busy(host, cmd_flags);
>   
> +	mci_writel(host, CMD, cmd_flags | SDMMC_CMD_START);
> +
>   	/* response expected command only */
>   	if (cmd_flags & SDMMC_CMD_RESP_EXP)
>   		dw_mci_set_cto(host);
> -
> -	mci_writel(host, CMD, cmd_flags | SDMMC_CMD_START);
>   }
>   
>   static inline void send_stop_abort(struct dw_mci *host, struct mmc_data *data)
> @@ -1930,6 +1947,24 @@ static void dw_mci_set_drto(struct dw_mci *host)
>   	mod_timer(&host->dto_timer, jiffies + msecs_to_jiffies(drto_ms));
>   }
>   
> +static bool dw_mci_clear_pending_cmd_complete(struct dw_mci *host)
> +{
> +	if (!test_bit(EVENT_CMD_COMPLETE, &host->pending_events))
> +		return false;
> +
> +	/*
> +	 * Really be certain that the timer has stopped.  This is a bit of
> +	 * paranoia and could only really happen if we had really bad
> +	 * interrupt latency and the interrupt routine and timeout were
> +	 * running concurrently so that the del_timer() in the interrupt
> +	 * handler couldn't run.
> +	 */
> +	WARN_ON(del_timer_sync(&host->cto_timer));
> +	clear_bit(EVENT_CMD_COMPLETE, &host->pending_events);
> +
> +	return true;
> +}
> +
>   static void dw_mci_tasklet_func(unsigned long priv)
>   {
>   	struct dw_mci *host = (struct dw_mci *)priv;
> @@ -1956,8 +1991,7 @@ static void dw_mci_tasklet_func(unsigned long priv)
>   
>   		case STATE_SENDING_CMD11:
>   		case STATE_SENDING_CMD:
> -			if (!test_and_clear_bit(EVENT_CMD_COMPLETE,
> -						&host->pending_events))
> +			if (!dw_mci_clear_pending_cmd_complete(host))
>   				break;
>   
>   			cmd = host->cmd;
> @@ -2126,8 +2160,7 @@ static void dw_mci_tasklet_func(unsigned long priv)
>   			/* fall through */
>   
>   		case STATE_SENDING_STOP:
> -			if (!test_and_clear_bit(EVENT_CMD_COMPLETE,
> -						&host->pending_events))
> +			if (!dw_mci_clear_pending_cmd_complete(host))
>   				break;
>   
>   			/* CMD error in data command */
> @@ -2600,6 +2633,8 @@ static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
>   	struct dw_mci *host = dev_id;
>   	u32 pending;
>   	struct dw_mci_slot *slot = host->slot;
> +	unsigned long irqflags;
> +	int i;
>   
>   	pending = mci_readl(host, MINTSTS); /* read-only mask reg */
>   
> @@ -2607,8 +2642,6 @@ static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
>   		/* Check volt switch first, since it can look like an error */
>   		if ((host->state == STATE_SENDING_CMD11) &&
>   		    (pending & SDMMC_INT_VOLT_SWITCH)) {
> -			unsigned long irqflags;
> -
>   			mci_writel(host, RINTSTS, SDMMC_INT_VOLT_SWITCH);
>   			pending &= ~SDMMC_INT_VOLT_SWITCH;
>   
> @@ -2624,11 +2657,15 @@ static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
>   		}
>   
>   		if (pending & DW_MCI_CMD_ERROR_FLAGS) {
> +			spin_lock_irqsave(&host->irq_lock, irqflags);
> +
>   			del_timer(&host->cto_timer);
>   			mci_writel(host, RINTSTS, DW_MCI_CMD_ERROR_FLAGS);
>   			host->cmd_status = pending;
>   			smp_wmb(); /* drain writebuffer */
>   			set_bit(EVENT_CMD_COMPLETE, &host->pending_events);
> +
> +			spin_unlock_irqrestore(&host->irq_lock, irqflags);
>   		}
>   
>   		if (pending & DW_MCI_DATA_ERROR_FLAGS) {
> @@ -2668,8 +2705,12 @@ static irqreturn_t dw_mci_interrupt(int irq, void *dev_id)
>   		}
>   
>   		if (pending & SDMMC_INT_CMD_DONE) {
> +			spin_lock_irqsave(&host->irq_lock, irqflags);
> +
>   			mci_writel(host, RINTSTS, SDMMC_INT_CMD_DONE);
>   			dw_mci_cmd_interrupt(host, pending);
> +
> +			spin_unlock_irqrestore(&host->irq_lock, irqflags);
>   		}
>   
>   		if (pending & SDMMC_INT_CD) {
> @@ -2943,7 +2984,35 @@ static void dw_mci_cmd11_timer(unsigned long arg)
>   static void dw_mci_cto_timer(unsigned long arg)
>   {
>   	struct dw_mci *host = (struct dw_mci *)arg;
> +	unsigned long irqflags;
> +	u32 pending;
>   
> +	spin_lock_irqsave(&host->irq_lock, irqflags);
> +
> +	/*
> +	 * If somehow we have very bad interrupt latency it's remotely possible
> +	 * that the timer could fire while the interrupt is still pending or
> +	 * while the interrupt is midway through running.  Let's be paranoid
> +	 * and detect those two cases.  Note that this is paranoia is somewhat
> +	 * justified because in this function we don't actually cancel the
> +	 * pending command in the controller--we just assume it will never come.
> +	 */
> +	pending = mci_readl(host, MINTSTS); /* read-only mask reg */
> +	if (pending & (DW_MCI_CMD_ERROR_FLAGS | SDMMC_INT_CMD_DONE)) {
> +		/* The interrupt should fire; no need to act but we can warn */
> +		dev_warn(host->dev, "Unexpected interrupt latency\n");
> +		goto exit;
> +	}
> +	if (test_bit(EVENT_CMD_COMPLETE, &host->pending_events)) {
> +		/* Presumably interrupt handler couldn't delete the timer */
> +		dev_warn(host->dev, "CTO timeout when already completed\n");
> +		goto exit;
> +	}
> +
> +	/*
> +	 * Continued paranoia to make sure we're in the state we expect.
> +	 * This paranoia isn't really justified but it seems good to be safe.
> +	 */
>   	switch (host->state) {
>   	case STATE_SENDING_CMD11:
>   	case STATE_SENDING_CMD:
> @@ -2962,6 +3031,9 @@ static void dw_mci_cto_timer(unsigned long arg)
>   			 host->state);
>   		break;
>   	}
> +
> +exit:
> +	spin_unlock_irqrestore(&host->irq_lock, irqflags);
>   }
>   
>   static void dw_mci_dto_timer(unsigned long arg)
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] mmc: dw_mmc: Add locking to the CTO timer
  2017-10-09  7:41   ` Shawn Lin
@ 2017-10-11 23:53     ` Doug Anderson
  0 siblings, 0 replies; 10+ messages in thread
From: Doug Anderson @ 2017-10-11 23:53 UTC (permalink / raw)
  To: Shawn Lin
  Cc: Jaehoon Chung, Ulf Hansson, Brian Norris, Alexandru M Stan,
	Ziyuan Xu, linux-mmc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Emil Renner Berthing

Hi,

On Mon, Oct 9, 2017 at 12:41 AM, Shawn Lin <shawn.lin@rock-chips.com> wrote:
> Hi Doug,
>
> On 2017/9/28 4:56, Douglas Anderson wrote:
>>
>> This attempts to instill a bit of paranoia to the code dealing with
>> the CTO timer.  It's believed that this will make the CTO timer more
>> robust in the case that we're having very long interrupt latencies.
>>
>
> I have already got reports about the similar problem that on one of
> the Rockchip platforms, some IP cyclicity idle the bus too long,
> so that cto timer fires but actually it shouldn't.

So presumably this patch fixes them?

>> Note that I originally thought that perhaps this patch was being
>> overly paranoid and wasn't really needed, but then while I was running
>> mmc_test on an rk3399 board I saw one instance of the message:
>>    dwmmc_rockchip fe320000.dwmmc: Unexpected interrupt latency
>>
>
> The intention of introducing CTO and DRTO timers is simply to break the
> dead loop due to the bug of controller itself but it seems the timer
> should take more factors into consideration. So it's more complicated
> than expected, expecially we should also fix the drto case like what
> this patch does...
>
> How about combining the cto and drto timer and re-new a catch-all
> timer like what SDHCI did?

I don't think that having separate timers really adds a lot of
overhead though, does it?  gdb shows "struct timer_list" on arm64 as
being 112 bytes so I guess there's some overhead there.  ...but
presumably you'd then need to add some code to differentiate the
command timeout and data timeout.  That adds more code and more
changes and makes this patch series riskier to backport to stable
trees (and, since it fixes a regression, I think it should be
backported assuming we don't land this in time for 4.14, which is
looking unlikely).

So I guess I'd say I'll keep them as separate timers in my patch
series unless someone is really upset by it and if someone wants to
see if things are cleaner by changing it to one timer then it'd be
great!

>> @@ -2600,6 +2633,8 @@ static irqreturn_t dw_mci_interrupt(int irq, void
>> *dev_id)
>>         struct dw_mci *host = dev_id;
>>         u32 pending;
>>         struct dw_mci_slot *slot = host->slot;
>> +       unsigned long irqflags;
>> +       int i;

It was pointed out by Emil that I messed up while merging this from
the Chromium OS tree (on kernel 4.4) and accidentally added "i" in
here.  While the patch submitted compiles and works it introduces a
stupid compiler warning.

I'll re-post tomorrow.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] mmc: dw_mmc: Fix the CTO timeout calculation
  2017-10-09  7:03   ` Shawn Lin
@ 2017-10-11 23:55     ` Doug Anderson
  0 siblings, 0 replies; 10+ messages in thread
From: Doug Anderson @ 2017-10-11 23:55 UTC (permalink / raw)
  To: Shawn Lin
  Cc: Jaehoon Chung, Ulf Hansson, Brian Norris, Alexandru M Stan,
	Ziyuan Xu, linux-mmc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Emil Renner Berthing

Hi,

On Mon, Oct 9, 2017 at 12:03 AM, Shawn Lin <shawn.lin@rock-chips.com> wrote:
> Hi
>
> On 2017/9/28 4:56, Douglas Anderson wrote:
>>
>> In the commit 03de19212ea3 ("mmc: dw_mmc: introduce timer for broken
>> command transfer over scheme") we tried to calculate the expected
>> hardware command timeout value.  Unfortunately that calculation isn't
>> quite correct in all cases.  It used "bus_hz" but, as far as I can
>> tell, it's supposed to use the card clock.  Let's account for the div
>> value, which is documented as 2x the value stored in the register, or
>> 1 if the register is 0.
>
>
> Good catch.
> Would you mind appending a new patch to fix the drto case?

I can add it to the series.  It'll be a separate patch, though, since
I wouldn't suggest backporting that one.  The DRTO timeout is so long
that it's really hard to hit it in this way so I think the argument is
much more academic there.  Still good to fix it, though.

-Doug

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-10-11 23:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-27 20:56 [PATCH 0/3] mmc: dw_mmc: Fix the CTO timer patch Douglas Anderson
2017-09-27 20:56 ` [PATCH 1/3] mmc: dw_mmc: cancel the CTO timer after a voltage switch Douglas Anderson
2017-10-09  6:57   ` Shawn Lin
2017-09-27 20:56 ` [PATCH 2/3] mmc: dw_mmc: Fix the CTO timeout calculation Douglas Anderson
2017-10-09  7:03   ` Shawn Lin
2017-10-11 23:55     ` Doug Anderson
2017-09-27 20:56 ` [PATCH 3/3] mmc: dw_mmc: Add locking to the CTO timer Douglas Anderson
2017-10-03 18:45   ` Doug Anderson
2017-10-09  7:41   ` Shawn Lin
2017-10-11 23:53     ` Doug Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).