From: Vitor Soares <ivitro@gmail.com>
To: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>,
Thomas Kopp <thomas.kopp@microchip.com>,
Vincent Mailhol <mailhol.vincent@wanadoo.fr>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Vitor Soares <vitor.soares@toradex.com>,
linux-can@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH v5] can: mcp251xfd: fix infinite loop when xmit fails
Date: Tue, 14 May 2024 15:34:01 +0100 [thread overview]
Message-ID: <465a2ddb222beed7c90b36c523633fc5648715bb.camel@gmail.com> (raw)
In-Reply-To: <20240514-corgi-of-marvelous-peace-968f5c-mkl@pengutronix.de>
Hi Marc,
Appreciate your feedback.
On Tue, 2024-05-14 at 14:23 +0200, Marc Kleine-Budde wrote:
> On 14.05.2024 11:58:22, Vitor Soares wrote:
> > From: Vitor Soares <vitor.soares@toradex.com>
> >
> > When the mcp251xfd_start_xmit() function fails, the driver stops
> > processing messages, and the interrupt routine does not return,
> > running indefinitely even after killing the running application.
> >
> > Error messages:
> > [ 441.298819] mcp251xfd spi2.0 can0: ERROR in mcp251xfd_start_xmit: -16
> > [ 441.306498] mcp251xfd spi2.0 can0: Transmit Event FIFO buffer not empty.
> > (seq=0x000017c7, tef_tail=0x000017cf, tef_head=0x000017d0,
> > tx_head=0x000017d3).
> > ... and repeat forever.
> >
> > The issue can be triggered when multiple devices share the same
> > SPI interface. And there is concurrent access to the bus.
> >
> > The problem occurs because tx_ring->head increments even if
> > mcp251xfd_start_xmit() fails. Consequently, the driver skips one
> > TX package while still expecting a response in
> > mcp251xfd_handle_tefif_one().
> >
> > This patch resolves the issue by starting a workqueue to write
> > the tx obj synchronously if err = -EBUSY. In case of another error,
> > it decrements tx_ring->head, removes skb from the echo stack, and
> > drops the message.
>
> This looks quite good! Good work!
>
> I think you better move the allocation/destroy of the wq into the open()
> and stop() callbacks. You have to destroy the workqueue before putting
> the interface to sleep.
>
> >
> > Fixes: 55e5b97f003e ("can: mcp25xxfd: add driver for Microchip MCP25xxFD SPI
> > CAN")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Vitor Soares <vitor.soares@toradex.com>
> > ---
> >
> > V4->V5:
> > - Start a workqueue to write tx obj with spi_sync() when spi_async() == -
> > EBUSY.
> >
> > V3->V4:
> > - Leave can_put_echo_skb() and stop the queue if needed, before
> > mcp251xfd_tx_obj_write().
> > - Re-sync head and remove echo skb if mcp251xfd_tx_obj_write() fails.
> > - Revert -> return NETDEV_TX_BUSY if mcp251xfd_tx_obj_write() == -EBUSY.
> >
> > V2->V3:
> > - Add tx_dropped stats.
> > - netdev_sent_queue() only if can_put_echo_skb() succeed.
> >
> > V1->V2:
> > - Return NETDEV_TX_BUSY if mcp251xfd_tx_obj_write() == -EBUSY.
> > - Rework the commit message to address the change above.
> > - Change can_put_echo_skb() to be called after mcp251xfd_tx_obj_write()
> > succeed.
> > Otherwise, we get Kernel NULL pointer dereference error.
> >
> > .../net/can/spi/mcp251xfd/mcp251xfd-core.c | 13 ++++-
> > drivers/net/can/spi/mcp251xfd/mcp251xfd-tx.c | 51 ++++++++++++++++---
> > drivers/net/can/spi/mcp251xfd/mcp251xfd.h | 5 ++
> > 3 files changed, 60 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
> > b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
> > index 1d9057dc44f2..6cca853f2b1e 100644
> > --- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
> > +++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
> > @@ -2141,15 +2141,25 @@ static int mcp251xfd_probe(struct spi_device *spi)
> > if (err)
> > goto out_free_candev;
> >
> > + priv->tx_work_obj = NULL;
> > + priv->wq = alloc_workqueue("mcp251xfd_wq", WQ_FREEZABLE, 0);
>
> The m_can driver uses a ordered workqueue and you can add the name of
> the spi device to the wq's name :)
>
> priv->wq = alloc_ordered_workqueue("%s-mcp251xfd_wq", WQ_FREEZABLE |
> WQ_MEM_RECLAIM, dev_name(&spi->dev));
>
> > + if (!priv->wq) {
> > + err = -ENOMEM;
> > + goto out_can_rx_offload_del;
> > + }
> > + INIT_WORK(&priv->tx_work, mcp251xfd_tx_obj_write_sync);
> > +
> > err = mcp251xfd_register(priv);
> > if (err) {
> > dev_err_probe(&spi->dev, err, "Failed to detect %s.\n",
> > mcp251xfd_get_model_str(priv));
> > - goto out_can_rx_offload_del;
> > + goto out_can_free_wq;
>
> nitpick:
> out_destroy_workqueue;
>
> to match the function call.
>
> > }
> >
> > return 0;
> >
> > + out_can_free_wq:
> > + destroy_workqueue(priv->wq);
> > out_can_rx_offload_del:
> > can_rx_offload_del(&priv->offload);
> > out_free_candev:
> > @@ -2165,6 +2175,7 @@ static void mcp251xfd_remove(struct spi_device *spi)
> > struct mcp251xfd_priv *priv = spi_get_drvdata(spi);
> > struct net_device *ndev = priv->ndev;
> >
> > + destroy_workqueue(priv->wq);
> > can_rx_offload_del(&priv->offload);
> > mcp251xfd_unregister(priv);
> > spi->max_speed_hz = priv->spi_max_speed_hz_orig;
> > diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-tx.c
> > b/drivers/net/can/spi/mcp251xfd/mcp251xfd-tx.c
> > index 160528d3cc26..1e7ddf316643 100644
> > --- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-tx.c
> > +++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-tx.c
> > @@ -131,6 +131,41 @@ mcp251xfd_tx_obj_from_skb(const struct mcp251xfd_priv
> > *priv,
> > tx_obj->xfer[0].len = len;
> > }
> >
> > +static void mcp251xfd_tx_failure_drop(const struct mcp251xfd_priv *priv,
> > + struct mcp251xfd_tx_ring *tx_ring,
> > + int err)
> > +{
> > + struct net_device *ndev = priv->ndev;
> > + struct net_device_stats *stats = &ndev->stats;
> > + unsigned int frame_len = 0;
> > + u8 tx_head;
> > +
> > + tx_ring->head--;
> > + stats->tx_dropped++;
> > + tx_head = mcp251xfd_get_tx_head(tx_ring);
> > + can_free_echo_skb(ndev, tx_head, &frame_len);
> > + netdev_completed_queue(ndev, 1, frame_len);
> > + netif_wake_queue(ndev);
> > +
> > + if (net_ratelimit())
> > + netdev_err(priv->ndev, "ERROR in %s: %d\n", __func__, err);
> > +}
> > +
> > +void mcp251xfd_tx_obj_write_sync(struct work_struct *ws)
> > +{
> > + struct mcp251xfd_priv *priv = container_of(ws, struct
> > mcp251xfd_priv,
> > + tx_work);
> > + struct mcp251xfd_tx_obj *tx_obj = priv->tx_work_obj;
> > + struct mcp251xfd_tx_ring *tx_ring = priv->tx;
> > + int err;
> > +
> > + err = spi_sync(priv->spi, &tx_obj->msg);
> > + if (err)
> > + mcp251xfd_tx_failure_drop(priv, tx_ring, err);
> > +
> > + priv->tx_work_obj = NULL;
>
> Race condition:
> - after spi_sync() the CAN frame is send
> - after the TX complete IRQ the TX queue is restarted
> - the xmit handler might get BUSY
> - fill the tx_work_obj again
>
> > +}
> > +
> > static int mcp251xfd_tx_obj_write(const struct mcp251xfd_priv *priv,
> > struct mcp251xfd_tx_obj *tx_obj)
> > {
> > @@ -175,7 +210,7 @@ netdev_tx_t mcp251xfd_start_xmit(struct sk_buff *skb,
> > if (can_dev_dropped_skb(ndev, skb))
> > return NETDEV_TX_OK;
> >
> > - if (mcp251xfd_tx_busy(priv, tx_ring))
> > + if (mcp251xfd_tx_busy(priv, tx_ring) || priv->tx_work_obj)
>
> This should not happen, but better save than sorry.
As there is the race condition you mentioned above, on this condition:
priv->tx_work_obj = tx_obj --> xmit will return NETDEV_TX_BUSY
or
priv->tx_work_obj = NULL --> It goes through the rest of the code or the
workqueue may sleep after setting tx_work_obj to NULL. Should I use work_busy()
here instead or do you have another suggestion?
Everything else is clear to me and I'm addressing it for the v6.
Best regards,
Vitor Soares
next prev parent reply other threads:[~2024-05-14 14:34 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-14 10:58 [PATCH v5] can: mcp251xfd: fix infinite loop when xmit fails Vitor Soares
2024-05-14 12:23 ` Marc Kleine-Budde
2024-05-14 14:34 ` Vitor Soares [this message]
2024-05-15 11:12 ` Marc Kleine-Budde
2024-05-15 16:29 ` Vitor Soares
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=465a2ddb222beed7c90b36c523633fc5648715bb.camel@gmail.com \
--to=ivitro@gmail.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=linux-can@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mailhol.vincent@wanadoo.fr \
--cc=manivannan.sadhasivam@linaro.org \
--cc=mkl@pengutronix.de \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=stable@vger.kernel.org \
--cc=thomas.kopp@microchip.com \
--cc=vitor.soares@toradex.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox