[PATCH v2] can: sja1000: Always restart the Tx queue after an overrun

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] can: sja1000: Always restart the Tx queue after an overrun
@ 2023-09-27 16:44 Miquel Raynal
  2023-09-28  7:53 ` Marc Kleine-Budde
  0 siblings, 1 reply; 3+ messages in thread
From: Miquel Raynal @ 2023-09-27 16:44 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	netdev, linux-can, Jérémie Dautheribes,
	Thomas Petazzoni, sylvain.girard, pascal.eberhard, Miquel Raynal,
	stable

Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with
a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000
CAN controller reception: the Rx buffer is only 5 messages long, so when
the bus loaded (eg. a message every 50us), overrun may easily
happen. Upon an overrun situation, due to a possible internal crosstalk
situation, the controller enters a frozen state which only can be
unlocked with a soft reset (experimentally). The solution was to offload
a call to sja1000_start() in a threaded handler. This needs to happen in
process context as this operation requires to sleep. sja1000_start()
basically enters "reset mode", performs a proper software reset and
returns back into "normal mode".

Since this fix was introduced, we no longer observe any stalls in
reception. However it was sporadically observed that the transmit path
would now freeze. Further investigation blamed the fix mentioned above,
and especially the reset operation. Reproducing the reset in a loop
helped identifying what could possibly go wrong. The sja1000 is a single
Tx queue device, which leverages the netdev helpers to process one Tx
message at a time. The logic is: the queue is stopped, the message sent
to the transceiver, once properly transmitted the controller sets a
status bit which triggers an interrupt, in the interrupt handler the
transmission status is checked and the queue woken up. Unfortunately, if
an overrun happens, we might perform the soft reset precisely between
the transmission of the buffer to the transceiver and the advent of the
transmission status bit. We would then stop the transmission operation
without re-enabling the queue, leading to all further transmissions to
be ignored.

The reset interrupt can only happen while the device is "open", and
after a reset we anyway want to resume normal operations, no matter if a
packet to transmit got dropped in the process, so we shall wake up the
queue. Restarting the device and waking-up the queue is exactly what
sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about
the queue state, we must acquire a lock both in the reset handler and in
the transmit path to ensure serialization of both operations. As the
reset handler might still be called after the transmission of a frame to
the transceiver but before it actually gets transmitted, we must ensure
we don't leak the skb, so we free it (the behavior is consistent, no
matter if there was an skb on the stack or not).

Fixes: 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs")
Cc: stable@vger.kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
---

Changes in v2:
* As Marc sugested, use netif_tx_{,un}lock() instead of our own
  spin_lock.

 drivers/net/can/sja1000/sja1000.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
index ae47fc72aa96..91e3fb3eed20 100644
--- a/drivers/net/can/sja1000/sja1000.c
+++ b/drivers/net/can/sja1000/sja1000.c
@@ -297,6 +297,7 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
 	if (can_dropped_invalid_skb(dev, skb))
 		return NETDEV_TX_OK;

+	netif_tx_lock(dev);
 	netif_stop_queue(dev);

 	fi = dlc = cf->can_dlc;
@@ -335,6 +336,8 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,

 	sja1000_write_cmdreg(priv, cmd_reg_val);

+	netif_tx_unlock(dev);
+
 	return NETDEV_TX_OK;
 }

@@ -396,7 +399,13 @@ static irqreturn_t sja1000_reset_interrupt(int irq, void *dev_id)
 	struct net_device *dev = (struct net_device *)dev_id;

 	netdev_dbg(dev, "performing a soft reset upon overrun\n");
-	sja1000_start(dev);
+
+	netif_tx_lock(dev);
+
+	can_free_echo_skb(dev, 0);
+	sja1000_set_mode(dev, CAN_MODE_START);
+
+	netif_tx_unlock(dev);

 	return IRQ_HANDLED;
 }
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] can: sja1000: Always restart the Tx queue after an overrun
  2023-09-27 16:44 [PATCH v2] can: sja1000: Always restart the Tx queue after an overrun Miquel Raynal
@ 2023-09-28  7:53 ` Marc Kleine-Budde
  2023-10-02 14:26   ` Miquel Raynal
  0 siblings, 1 reply; 3+ messages in thread
From: Marc Kleine-Budde @ 2023-09-28  7:53 UTC (permalink / raw)
  To: Miquel Raynal
  Cc: Wolfgang Grandegger, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, netdev, linux-can, Jérémie Dautheribes,
	Thomas Petazzoni, sylvain.girard, pascal.eberhard, stable

[-- Attachment #1: Type: text/plain, Size: 4703 bytes --]

On 27.09.2023 18:44:42, Miquel Raynal wrote:
> Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with
> a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000
> CAN controller reception: the Rx buffer is only 5 messages long, so when
> the bus loaded (eg. a message every 50us), overrun may easily
> happen. Upon an overrun situation, due to a possible internal crosstalk
> situation, the controller enters a frozen state which only can be
> unlocked with a soft reset (experimentally). The solution was to offload
> a call to sja1000_start() in a threaded handler. This needs to happen in
> process context as this operation requires to sleep. sja1000_start()
> basically enters "reset mode", performs a proper software reset and
> returns back into "normal mode".
> 
> Since this fix was introduced, we no longer observe any stalls in
> reception. However it was sporadically observed that the transmit path
> would now freeze. Further investigation blamed the fix mentioned above,
> and especially the reset operation. Reproducing the reset in a loop
> helped identifying what could possibly go wrong. The sja1000 is a single
> Tx queue device, which leverages the netdev helpers to process one Tx
> message at a time. The logic is: the queue is stopped, the message sent
> to the transceiver, once properly transmitted the controller sets a
> status bit which triggers an interrupt, in the interrupt handler the
> transmission status is checked and the queue woken up. Unfortunately, if
> an overrun happens, we might perform the soft reset precisely between
> the transmission of the buffer to the transceiver and the advent of the
> transmission status bit. We would then stop the transmission operation
> without re-enabling the queue, leading to all further transmissions to
> be ignored.
> 
> The reset interrupt can only happen while the device is "open", and
> after a reset we anyway want to resume normal operations, no matter if a
> packet to transmit got dropped in the process, so we shall wake up the
> queue. Restarting the device and waking-up the queue is exactly what
> sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about
> the queue state, we must acquire a lock both in the reset handler and in
> the transmit path to ensure serialization of both operations. As the
> reset handler might still be called after the transmission of a frame to
> the transceiver but before it actually gets transmitted, we must ensure
> we don't leak the skb, so we free it (the behavior is consistent, no
> matter if there was an skb on the stack or not).
> 
> Fixes: 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs")
> Cc: stable@vger.kernel.org
> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
> ---
> 
> Changes in v2:
> * As Marc sugested, use netif_tx_{,un}lock() instead of our own
>   spin_lock.
> 
>  drivers/net/can/sja1000/sja1000.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
> index ae47fc72aa96..91e3fb3eed20 100644
> --- a/drivers/net/can/sja1000/sja1000.c
> +++ b/drivers/net/can/sja1000/sja1000.c
> @@ -297,6 +297,7 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
>  	if (can_dropped_invalid_skb(dev, skb))
>  		return NETDEV_TX_OK;
>  
> +	netif_tx_lock(dev);
>  	netif_stop_queue(dev);
>  
>  	fi = dlc = cf->can_dlc;
> @@ -335,6 +336,8 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
>  
>  	sja1000_write_cmdreg(priv, cmd_reg_val);
>  
> +	netif_tx_unlock(dev);
> +

I think netif_tx_lock() should be used in a different way. As far as I
understand it, you should call it only in the sja1000_reset_interrupt(),
where you want to tx path to interfere.

Please test the new code with lockdep enabled.

Marc

>  	return NETDEV_TX_OK;
>  }
>  
> @@ -396,7 +399,13 @@ static irqreturn_t sja1000_reset_interrupt(int irq, void *dev_id)
>  	struct net_device *dev = (struct net_device *)dev_id;
>  
>  	netdev_dbg(dev, "performing a soft reset upon overrun\n");
> -	sja1000_start(dev);
> +
> +	netif_tx_lock(dev);
> +
> +	can_free_echo_skb(dev, 0);
> +	sja1000_set_mode(dev, CAN_MODE_START);
> +
> +	netif_tx_unlock(dev);
>  
>  	return IRQ_HANDLED;
>  }
> -- 
> 2.34.1
> 
> 

-- 
Pengutronix e.K.                 | Marc Kleine-Budde          |
Embedded Linux                   | https://www.pengutronix.de |
Vertretung Nürnberg              | Phone: +49-5121-206917-129 |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-9   |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] can: sja1000: Always restart the Tx queue after an overrun
  2023-09-28  7:53 ` Marc Kleine-Budde
@ 2023-10-02 14:26   ` Miquel Raynal
  0 siblings, 0 replies; 3+ messages in thread
From: Miquel Raynal @ 2023-10-02 14:26 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: Wolfgang Grandegger, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, netdev, linux-can, Jérémie Dautheribes,
	Thomas Petazzoni, sylvain.girard, pascal.eberhard, stable

Hi Marc,

mkl@pengutronix.de wrote on Thu, 28 Sep 2023 09:53:17 +0200:

> On 27.09.2023 18:44:42, Miquel Raynal wrote:
> > Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with
> > a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000
> > CAN controller reception: the Rx buffer is only 5 messages long, so when
> > the bus loaded (eg. a message every 50us), overrun may easily
> > happen. Upon an overrun situation, due to a possible internal crosstalk
> > situation, the controller enters a frozen state which only can be
> > unlocked with a soft reset (experimentally). The solution was to offload
> > a call to sja1000_start() in a threaded handler. This needs to happen in
> > process context as this operation requires to sleep. sja1000_start()
> > basically enters "reset mode", performs a proper software reset and
> > returns back into "normal mode".
> > 
> > Since this fix was introduced, we no longer observe any stalls in
> > reception. However it was sporadically observed that the transmit path
> > would now freeze. Further investigation blamed the fix mentioned above,
> > and especially the reset operation. Reproducing the reset in a loop
> > helped identifying what could possibly go wrong. The sja1000 is a single
> > Tx queue device, which leverages the netdev helpers to process one Tx
> > message at a time. The logic is: the queue is stopped, the message sent
> > to the transceiver, once properly transmitted the controller sets a
> > status bit which triggers an interrupt, in the interrupt handler the
> > transmission status is checked and the queue woken up. Unfortunately, if
> > an overrun happens, we might perform the soft reset precisely between
> > the transmission of the buffer to the transceiver and the advent of the
> > transmission status bit. We would then stop the transmission operation
> > without re-enabling the queue, leading to all further transmissions to
> > be ignored.
> > 
> > The reset interrupt can only happen while the device is "open", and
> > after a reset we anyway want to resume normal operations, no matter if a
> > packet to transmit got dropped in the process, so we shall wake up the
> > queue. Restarting the device and waking-up the queue is exactly what
> > sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about
> > the queue state, we must acquire a lock both in the reset handler and in
> > the transmit path to ensure serialization of both operations. As the
> > reset handler might still be called after the transmission of a frame to
> > the transceiver but before it actually gets transmitted, we must ensure
> > we don't leak the skb, so we free it (the behavior is consistent, no
> > matter if there was an skb on the stack or not).
> > 
> > Fixes: 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
> > ---
> > 
> > Changes in v2:
> > * As Marc sugested, use netif_tx_{,un}lock() instead of our own
> >   spin_lock.
> > 
> >  drivers/net/can/sja1000/sja1000.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
> > index ae47fc72aa96..91e3fb3eed20 100644
> > --- a/drivers/net/can/sja1000/sja1000.c
> > +++ b/drivers/net/can/sja1000/sja1000.c
> > @@ -297,6 +297,7 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
> >  	if (can_dropped_invalid_skb(dev, skb))
> >  		return NETDEV_TX_OK;
> >  
> > +	netif_tx_lock(dev);
> >  	netif_stop_queue(dev);
> >  
> >  	fi = dlc = cf->can_dlc;
> > @@ -335,6 +336,8 @@ static netdev_tx_t sja1000_start_xmit(struct sk_buff *skb,
> >  
> >  	sja1000_write_cmdreg(priv, cmd_reg_val);
> >  
> > +	netif_tx_unlock(dev);
> > +  
> 
> I think netif_tx_lock() should be used in a different way. As far as I
> understand it, you should call it only in the sja1000_reset_interrupt(),
> where you want to tx path to interfere.

I believe you meant "don't want"? And yes you're right current use
can't properly handle my problem.

> Please test the new code with lockdep enabled.

I will fix the current implementation and test again by manually
producing overruns.

Thanks,
Miquèl

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-10-02 14:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-27 16:44 [PATCH v2] can: sja1000: Always restart the Tx queue after an overrun Miquel Raynal
2023-09-28  7:53 ` Marc Kleine-Budde
2023-10-02 14:26   ` Miquel Raynal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).