linux-can.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3]  can: esd_usb: Fixes
@ 2025-09-24 17:30 Stefan Mätje
  2025-09-24 17:30 ` [PATCH v3 1/3] can: esd_usb: Fix not detecting version reply in probe routine Stefan Mätje
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Stefan Mätje @ 2025-09-24 17:30 UTC (permalink / raw)
  To: Frank Jungclaus, Marc Kleine-Budde, Vincent Mailhol, linux-can,
	socketcan
  Cc: David S . Miller, Jakub Kicinski, Oliver Hartkopp, Simon Horman,
	Wolfgang Grandegger, netdev

This was originally a patch series of 5 patches that is split
into two series, removing the forth and fifth patch. I keep
increasing the reroll count from the original series.

The first patch fixes a condition where the esd_usb CAN driver
may not detect connected CAN-USB devices correctly after a
reboot. This patch was already presented on the list before
starting this series and changes due to that feedback are
integrated.

References:
https://lore.kernel.org/linux-can/d7fd564775351ea8a60a6ada83a0368a99ea6b19.camel@esd.eu/

The second patch fixes situations where the the handling of TX
context objects for each sent CAN frame could go out of sync
with the acknowledged or erroneous TX jobs and then lose free
TX context objects. This could lead to the driver incapable of
sending frames.

The third patch adds TX FIFO watermark to eliminate occasional
error messages and significantly reduce the number of calls to
netif_start_queue() and netif_stop_queue().


Previous versions:
 v1: https://lore.kernel.org/linux-can/20250811210611.3233202-1-stefan.maetje@esd.eu/
 v2: https://lore.kernel.org/linux-can/20250821143422.3567029-1-stefan.maetje@esd.eu/

Changes in v3:
  - Split the patch series of 5 patches in two keeping only the
    first three with fixes in this series.
  - Rebased to linux-can.
  - Take the patches from the pull request "can 2025-09-22" to preserve 
    reworded commit messages.
    References:
    https://lore.kernel.org/linux-can/20250922100913.392916-1-mkl@pengutronix.de/
  - In the second patch use net_ratelimit() to rate limit a changed
    netdev_warn() message.

Changes in v2:
  - Withdraw "can: esd_usb: Fix possible calls to kfree() with NULL".
  - Reworked now first patch:
    - Functions esd_usb_req_version() and esd_usb_recv_version()
      now allocate their own transfer buffers.
    - Check whether the announced message size fits into received
      data block.
  - Second patch: Added a Fixes tag
  - Third patch: Added a Fixes tag
  - Forth patch:
    - Convert all occurrences of error status prints to use
      "ERR_PTR(err)" instead of printing the decimal value
      of "err".
    - Rename retval to err in esd_usb_read_bulk_callback() to
      make the naming of error status variables consistent
      with all other functions.

Signed-off-by: Stefan Mätje <stefan.maetje@esd.eu>
---
Stefan Mätje (3):
  can: esd_usb: Fix not detecting version reply in probe routine
  can: esd_usb: Fix handling of TX context objects
  can: esd_usb: Add watermark handling for TX jobs

 drivers/net/can/usb/esd_usb.c | 175 +++++++++++++++++++++++++---------
 1 file changed, 129 insertions(+), 46 deletions(-)


base-commit: c443be70aaee42c2d1d251e0329e0a69dd96ae54
-- 
2.34.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v3 1/3] can: esd_usb: Fix not detecting version reply in probe routine
  2025-09-24 17:30 [PATCH v3 0/3] can: esd_usb: Fixes Stefan Mätje
@ 2025-09-24 17:30 ` Stefan Mätje
  2025-09-24 17:30 ` [PATCH v3 2/3] can: esd_usb: Fix handling of TX context objects Stefan Mätje
  2025-09-24 17:30 ` [PATCH v3 3/3] can: esd_usb: Add watermark handling for TX jobs Stefan Mätje
  2 siblings, 0 replies; 7+ messages in thread
From: Stefan Mätje @ 2025-09-24 17:30 UTC (permalink / raw)
  To: Frank Jungclaus, Marc Kleine-Budde, Vincent Mailhol, linux-can,
	socketcan
  Cc: David S . Miller, Jakub Kicinski, Oliver Hartkopp, Simon Horman,
	Wolfgang Grandegger, netdev

This patch fixes some problems in the esd_usb_probe() routine that render
the CAN interface unusable.

The probe routine sends a version request message to the USB device to
receive a version reply with the number of CAN ports and the hard-
& firmware versions. Then for each CAN port a CAN netdev is registered.

The previous code assumed that the version reply would be received
immediately. But if the driver was reloaded without power cycling the
USB device (i. e. on a reboot) there could already be other incoming
messages in the USB buffers. These would be in front of the version
reply and need to be skipped.

In the previous code these problems were present:
- Only one usb_bulk_msg() read was done into a buffer of
  sizeof(union esd_usb_msg) which is smaller than ESD_USB_RX_BUFFER_SIZE
  which could lead to an overflow error from the USB stack.
- The first bytes of the received data were taken without checking for
  the message type. This could lead to zero detected CAN interfaces.

To mitigate these problems:
- Move the code to send the version request message into a standalone
  function esd_usb_req_version().
- Add a function esd_usb_recv_version() using a transfer buffer
  with ESD_USB_RX_BUFFER_SIZE. It reads and skips incoming "esd_usb_msg"
  messages until a version reply message is found. This is evaluated to
  return the count of CAN ports and version information.
- The data drain loop is limited by a maximum # of bytes to read from
  the device based on its internal buffer sizes and a timeout
  ESD_USB_DRAIN_TIMEOUT_MS.

This version of the patch incorporates changes recommended on the
linux-can list for a very first version [1].

[1] https://lore.kernel.org/linux-can/20250203145810.1286331-1-stefan.maetje@esd.eu

Fixes: 80662d943075 ("can: esd_usb: Add support for esd CAN-USB/3")
Signed-off-by: Stefan Mätje <stefan.maetje@esd.eu>
Link: https://patch.msgid.link/20250821143422.3567029-2-stefan.maetje@esd.eu
[mkl: minor change patch description to imperative language]
[mkl: squash error format changes from patch 4]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
---
 drivers/net/can/usb/esd_usb.c | 149 +++++++++++++++++++++++++---------
 1 file changed, 112 insertions(+), 37 deletions(-)

diff --git a/drivers/net/can/usb/esd_usb.c b/drivers/net/can/usb/esd_usb.c
index 27a3818885c2..ed1d6ba779dc 100644
--- a/drivers/net/can/usb/esd_usb.c
+++ b/drivers/net/can/usb/esd_usb.c
@@ -3,7 +3,7 @@
  * CAN driver for esd electronics gmbh CAN-USB/2, CAN-USB/3 and CAN-USB/Micro
  *
  * Copyright (C) 2010-2012 esd electronic system design gmbh, Matthias Fuchs <socketcan@esd.eu>
- * Copyright (C) 2022-2024 esd electronics gmbh, Frank Jungclaus <frank.jungclaus@esd.eu>
+ * Copyright (C) 2022-2025 esd electronics gmbh, Frank Jungclaus <frank.jungclaus@esd.eu>
  */
 
 #include <linux/can.h>
@@ -44,6 +44,9 @@ MODULE_LICENSE("GPL v2");
 #define ESD_USB_CMD_TS			5 /* also used for TS_REPLY */
 #define ESD_USB_CMD_IDADD		6 /* also used for IDADD_REPLY */
 
+/* esd version message name size */
+#define ESD_USB_FW_NAME_SZ 16
+
 /* esd CAN message flags - dlc field */
 #define ESD_USB_RTR	BIT(4)
 #define ESD_USB_NO_BRS	BIT(4)
@@ -95,6 +98,7 @@ MODULE_LICENSE("GPL v2");
 #define ESD_USB_RX_BUFFER_SIZE		1024
 #define ESD_USB_MAX_RX_URBS		4
 #define ESD_USB_MAX_TX_URBS		16 /* must be power of 2 */
+#define ESD_USB_DRAIN_TIMEOUT_MS	100
 
 /* Modes for CAN-USB/3, to be used for esd_usb_3_set_baudrate_msg_x.mode */
 #define ESD_USB_3_BAUDRATE_MODE_DISABLE		0 /* remove from bus */
@@ -131,7 +135,7 @@ struct esd_usb_version_reply_msg {
 	u8 nets;
 	u8 features;
 	__le32 version;
-	u8 name[16];
+	u8 name[ESD_USB_FW_NAME_SZ];
 	__le32 rsvd;
 	__le32 ts;
 };
@@ -625,17 +629,106 @@ static int esd_usb_send_msg(struct esd_usb *dev, union esd_usb_msg *msg)
 			    1000);
 }
 
-static int esd_usb_wait_msg(struct esd_usb *dev,
-			    union esd_usb_msg *msg)
+static int esd_usb_req_version(struct esd_usb *dev)
 {
-	int actual_length;
+	union esd_usb_msg *msg;
+	int err;
 
-	return usb_bulk_msg(dev->udev,
-			    usb_rcvbulkpipe(dev->udev, 1),
-			    msg,
-			    sizeof(*msg),
-			    &actual_length,
-			    1000);
+	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	msg->hdr.cmd = ESD_USB_CMD_VERSION;
+	msg->hdr.len = sizeof(struct esd_usb_version_msg) / sizeof(u32); /* # of 32bit words */
+	msg->version.rsvd = 0;
+	msg->version.flags = 0;
+	msg->version.drv_version = 0;
+
+	err = esd_usb_send_msg(dev, msg);
+	kfree(msg);
+	return err;
+}
+
+static int esd_usb_recv_version(struct esd_usb *dev)
+{
+	/* Device hardware has 2 RX buffers with ESD_USB_RX_BUFFER_SIZE, * 4 to give some slack. */
+	const int max_drain_bytes = (4 * ESD_USB_RX_BUFFER_SIZE);
+	unsigned long end_jiffies;
+	void *rx_buf;
+	int cnt_other = 0;
+	int cnt_ts = 0;
+	int cnt_vs = 0;
+	int len_sum = 0;
+	int attempt = 0;
+	int err;
+
+	rx_buf = kmalloc(ESD_USB_RX_BUFFER_SIZE, GFP_KERNEL);
+	if (!rx_buf)
+		return -ENOMEM;
+
+	end_jiffies = jiffies + msecs_to_jiffies(ESD_USB_DRAIN_TIMEOUT_MS);
+	do {
+		int actual_length;
+		int pos;
+
+		err = usb_bulk_msg(dev->udev,
+				   usb_rcvbulkpipe(dev->udev, 1),
+				   rx_buf,
+				   ESD_USB_RX_BUFFER_SIZE,
+				   &actual_length,
+				   ESD_USB_DRAIN_TIMEOUT_MS);
+		dev_dbg(&dev->udev->dev, "AT %d, LEN %d, ERR %d\n", attempt, actual_length, err);
+		++attempt;
+		if (err)
+			goto bail;
+		if (actual_length == 0)
+			continue;
+
+		err = -ENOENT;
+		len_sum += actual_length;
+		pos = 0;
+		while (pos < actual_length - sizeof(struct esd_usb_header_msg)) {
+			union esd_usb_msg *p_msg;
+
+			p_msg = (union esd_usb_msg *)(rx_buf + pos);
+
+			pos += p_msg->hdr.len * sizeof(u32); /* convert to # of bytes */
+			if (pos > actual_length) {
+				dev_err(&dev->udev->dev, "format error\n");
+				err = -EPROTO;
+				goto bail;
+			}
+
+			switch (p_msg->hdr.cmd) {
+			case ESD_USB_CMD_VERSION:
+				++cnt_vs;
+				dev->net_count = min(p_msg->version_reply.nets, ESD_USB_MAX_NETS);
+				dev->version = le32_to_cpu(p_msg->version_reply.version);
+				err = 0;
+				dev_dbg(&dev->udev->dev, "TS 0x%08x, V 0x%08x, N %u, F 0x%02x, %.*s\n",
+					le32_to_cpu(p_msg->version_reply.ts),
+					le32_to_cpu(p_msg->version_reply.version),
+					p_msg->version_reply.nets,
+					p_msg->version_reply.features,
+					ESD_USB_FW_NAME_SZ, p_msg->version_reply.name);
+				break;
+			case ESD_USB_CMD_TS:
+				++cnt_ts;
+				dev_dbg(&dev->udev->dev, "TS 0x%08x\n",
+					le32_to_cpu(p_msg->rx.ts));
+				break;
+			default:
+				++cnt_other;
+				dev_dbg(&dev->udev->dev, "HDR %d\n", p_msg->hdr.cmd);
+				break;
+			}
+		}
+	} while (cnt_vs == 0 && len_sum < max_drain_bytes && time_before(jiffies, end_jiffies));
+bail:
+	dev_dbg(&dev->udev->dev, "RC=%d; ATT=%d, TS=%d, VS=%d, O=%d, B=%d\n",
+		err, attempt, cnt_ts, cnt_vs, cnt_other, len_sum);
+	kfree(rx_buf);
+	return err;
 }
 
 static int esd_usb_setup_rx_urbs(struct esd_usb *dev)
@@ -1273,13 +1366,12 @@ static int esd_usb_probe(struct usb_interface *intf,
 			 const struct usb_device_id *id)
 {
 	struct esd_usb *dev;
-	union esd_usb_msg *msg;
 	int i, err;
 
 	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
 	if (!dev) {
 		err = -ENOMEM;
-		goto done;
+		goto bail;
 	}
 
 	dev->udev = interface_to_usbdev(intf);
@@ -1288,34 +1380,19 @@ static int esd_usb_probe(struct usb_interface *intf,
 
 	usb_set_intfdata(intf, dev);
 
-	msg = kmalloc(sizeof(*msg), GFP_KERNEL);
-	if (!msg) {
-		err = -ENOMEM;
-		goto free_msg;
-	}
-
 	/* query number of CAN interfaces (nets) */
-	msg->hdr.cmd = ESD_USB_CMD_VERSION;
-	msg->hdr.len = sizeof(struct esd_usb_version_msg) / sizeof(u32); /* # of 32bit words */
-	msg->version.rsvd = 0;
-	msg->version.flags = 0;
-	msg->version.drv_version = 0;
-
-	err = esd_usb_send_msg(dev, msg);
+	err = esd_usb_req_version(dev);
 	if (err < 0) {
-		dev_err(&intf->dev, "sending version message failed\n");
-		goto free_msg;
+		dev_err(&intf->dev, "sending version message failed: %pe\n", ERR_PTR(err));
+		goto bail;
 	}
 
-	err = esd_usb_wait_msg(dev, msg);
+	err = esd_usb_recv_version(dev);
 	if (err < 0) {
-		dev_err(&intf->dev, "no version message answer\n");
-		goto free_msg;
+		dev_err(&intf->dev, "no version message answer: %pe\n", ERR_PTR(err));
+		goto bail;
 	}
 
-	dev->net_count = (int)msg->version_reply.nets;
-	dev->version = le32_to_cpu(msg->version_reply.version);
-
 	if (device_create_file(&intf->dev, &dev_attr_firmware))
 		dev_err(&intf->dev,
 			"Couldn't create device file for firmware\n");
@@ -1332,11 +1409,9 @@ static int esd_usb_probe(struct usb_interface *intf,
 	for (i = 0; i < dev->net_count; i++)
 		esd_usb_probe_one_net(intf, i);
 
-free_msg:
-	kfree(msg);
+bail:
 	if (err)
 		kfree(dev);
-done:
 	return err;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 2/3] can: esd_usb: Fix handling of TX context objects
  2025-09-24 17:30 [PATCH v3 0/3] can: esd_usb: Fixes Stefan Mätje
  2025-09-24 17:30 ` [PATCH v3 1/3] can: esd_usb: Fix not detecting version reply in probe routine Stefan Mätje
@ 2025-09-24 17:30 ` Stefan Mätje
  2025-09-24 17:30 ` [PATCH v3 3/3] can: esd_usb: Add watermark handling for TX jobs Stefan Mätje
  2 siblings, 0 replies; 7+ messages in thread
From: Stefan Mätje @ 2025-09-24 17:30 UTC (permalink / raw)
  To: Frank Jungclaus, Marc Kleine-Budde, Vincent Mailhol, linux-can,
	socketcan
  Cc: David S . Miller, Jakub Kicinski, Oliver Hartkopp, Simon Horman,
	Wolfgang Grandegger, netdev

For each TX CAN frame submitted to the USB device the driver saves the
echo skb index in struct esd_tx_urb_context context objects. If the
driver runs out of free context objects CAN transmission stops.

Fix some spots where such context objects are not freed correctly.

In esd_usb_tx_done_msg() move the check for netif_device_present()
after the identification and release of TX context and the release of
the echo skb. This is allowed even if netif_device_present() would
return false because the mentioned operations don't touch the device
itself but only free local acquired resources. This keeps the context
handling with the acknowledged TX jobs in sync.

esd_usb_start_xmit() performs a check to see whether a context
object could be allocated. Add a netif_stop_queue() there before the
function is aborted. This makes sure the network queue is stopped and
avoids getting tons of log messages in a situation without free TX
objects. The adjacent log message now also prints the active jobs
counter making a cross check between active jobs and "no free context"
condition possible and is rate limited by net_ratelimit().

In esd_usb_start_xmit() the error handling of usb_submit_urb() missed to
free the context object together with the echo skb and decreasing the
job count.

Fixes: 96d8e90382dc ("can: Add driver for esd CAN-USB/2 device")
Signed-off-by: Stefan Mätje <stefan.maetje@esd.eu>
Link: https://patch.msgid.link/20250821143422.3567029-3-stefan.maetje@esd.eu
[mkl: minor change patch description to imperative language]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
---
 drivers/net/can/usb/esd_usb.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/net/can/usb/esd_usb.c b/drivers/net/can/usb/esd_usb.c
index ed1d6ba779dc..588ec02b9b21 100644
--- a/drivers/net/can/usb/esd_usb.c
+++ b/drivers/net/can/usb/esd_usb.c
@@ -460,9 +460,6 @@ static void esd_usb_tx_done_msg(struct esd_usb_net_priv *priv,
 	struct net_device *netdev = priv->netdev;
 	struct esd_tx_urb_context *context;
 
-	if (!netif_device_present(netdev))
-		return;
-
 	context = &priv->tx_contexts[msg->txdone.hnd & (ESD_USB_MAX_TX_URBS - 1)];
 
 	if (!msg->txdone.status) {
@@ -478,6 +475,9 @@ static void esd_usb_tx_done_msg(struct esd_usb_net_priv *priv,
 	context->echo_index = ESD_USB_MAX_TX_URBS;
 	atomic_dec(&priv->active_tx_jobs);
 
+	if (!netif_device_present(netdev))
+		return;
+
 	netif_wake_queue(netdev);
 }
 
@@ -975,9 +975,12 @@ static netdev_tx_t esd_usb_start_xmit(struct sk_buff *skb,
 		}
 	}
 
-	/* This may never happen */
+	/* This should never happen */
 	if (!context) {
-		netdev_warn(netdev, "couldn't find free context\n");
+		if (net_ratelimit())
+			netdev_warn(netdev, "No free context. Jobs: %d\n",
+				    atomic_read(&priv->active_tx_jobs));
+		netif_stop_queue(netdev);
 		ret = NETDEV_TX_BUSY;
 		goto releasebuf;
 	}
@@ -1008,6 +1011,8 @@ static netdev_tx_t esd_usb_start_xmit(struct sk_buff *skb,
 	if (err) {
 		can_free_echo_skb(netdev, context->echo_index, NULL);
 
+		/* Release used context on error */
+		context->echo_index = ESD_USB_MAX_TX_URBS;
 		atomic_dec(&priv->active_tx_jobs);
 		usb_unanchor_urb(urb);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v3 3/3] can: esd_usb: Add watermark handling for TX jobs
  2025-09-24 17:30 [PATCH v3 0/3] can: esd_usb: Fixes Stefan Mätje
  2025-09-24 17:30 ` [PATCH v3 1/3] can: esd_usb: Fix not detecting version reply in probe routine Stefan Mätje
  2025-09-24 17:30 ` [PATCH v3 2/3] can: esd_usb: Fix handling of TX context objects Stefan Mätje
@ 2025-09-24 17:30 ` Stefan Mätje
  2025-09-25  0:27   ` Jakub Kicinski
  2 siblings, 1 reply; 7+ messages in thread
From: Stefan Mätje @ 2025-09-24 17:30 UTC (permalink / raw)
  To: Frank Jungclaus, Marc Kleine-Budde, Vincent Mailhol, linux-can,
	socketcan
  Cc: David S . Miller, Jakub Kicinski, Oliver Hartkopp, Simon Horman,
	Wolfgang Grandegger, netdev

The driver tried to keep as much CAN frames as possible submitted to the
USB device (ESD_USB_MAX_TX_URBS). This has led to occasional "No free
context" error messages in high load situations like with
"cangen -g 0 -p 10 canX".

Now call netif_stop_queue() already if the number of active jobs
reaches ESD_USB_TX_URBS_HI_WM which is < ESD_USB_MAX_TX_URBS. The
netif_start_queue() is called in esd_usb_tx_done_msg() only if the
number of active jobs is <= ESD_USB_TX_URBS_LO_WM.

This change eliminates the occasional error messages and significantly
reduces the number of calls to netif_start_queue() and
netif_stop_queue().

The watermark limits have been chosen with the CAN-USB/Micro in mind to
not to compromise its TX throughput. This device is running on USB 1.1
only with its 1ms USB polling cycle where a ESD_USB_TX_URBS_LO_WM
value below 9 decreases the TX throughput.

Fixes: 96d8e90382dc ("can: Add driver for esd CAN-USB/2 device")
Signed-off-by: Stefan Mätje <stefan.maetje@esd.eu>
Link: https://patch.msgid.link/20250821143422.3567029-4-stefan.maetje@esd.eu
[mkl: minor change patch description to imperative language]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
---
 drivers/net/can/usb/esd_usb.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/can/usb/esd_usb.c b/drivers/net/can/usb/esd_usb.c
index 588ec02b9b21..a5206ff27565 100644
--- a/drivers/net/can/usb/esd_usb.c
+++ b/drivers/net/can/usb/esd_usb.c
@@ -98,6 +98,8 @@ MODULE_LICENSE("GPL v2");
 #define ESD_USB_RX_BUFFER_SIZE		1024
 #define ESD_USB_MAX_RX_URBS		4
 #define ESD_USB_MAX_TX_URBS		16 /* must be power of 2 */
+#define ESD_USB_TX_URBS_HI_WM		((15 * ESD_USB_MAX_TX_URBS) / 16)
+#define ESD_USB_TX_URBS_LO_WM		((9 * ESD_USB_MAX_TX_URBS) / 16)
 #define ESD_USB_DRAIN_TIMEOUT_MS	100
 
 /* Modes for CAN-USB/3, to be used for esd_usb_3_set_baudrate_msg_x.mode */
@@ -478,7 +480,8 @@ static void esd_usb_tx_done_msg(struct esd_usb_net_priv *priv,
 	if (!netif_device_present(netdev))
 		return;
 
-	netif_wake_queue(netdev);
+	if (atomic_read(&priv->active_tx_jobs) <= ESD_USB_TX_URBS_LO_WM)
+		netif_wake_queue(netdev);
 }
 
 static void esd_usb_read_bulk_callback(struct urb *urb)
@@ -988,7 +991,7 @@ static netdev_tx_t esd_usb_start_xmit(struct sk_buff *skb,
 	context->priv = priv;
 	context->echo_index = i;
 
-	/* hnd must not be 0 - MSB is stripped in txdone handling */
+	/* hnd must not be 0 - MSB is stripped in TX done handling */
 	msg->tx.hnd = BIT(31) | i; /* returned in TX done message */
 
 	usb_fill_bulk_urb(urb, dev->udev, usb_sndbulkpipe(dev->udev, 2), buf,
@@ -1003,8 +1006,8 @@ static netdev_tx_t esd_usb_start_xmit(struct sk_buff *skb,
 
 	atomic_inc(&priv->active_tx_jobs);
 
-	/* Slow down tx path */
-	if (atomic_read(&priv->active_tx_jobs) >= ESD_USB_MAX_TX_URBS)
+	/* Slow down TX path */
+	if (atomic_read(&priv->active_tx_jobs) >= ESD_USB_TX_URBS_HI_WM)
 		netif_stop_queue(netdev);
 
 	err = usb_submit_urb(urb, GFP_ATOMIC);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 3/3] can: esd_usb: Add watermark handling for TX jobs
  2025-09-24 17:30 ` [PATCH v3 3/3] can: esd_usb: Add watermark handling for TX jobs Stefan Mätje
@ 2025-09-25  0:27   ` Jakub Kicinski
  2025-09-26  1:35     ` Stefan Mätje
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2025-09-25  0:27 UTC (permalink / raw)
  To: Stefan Mätje
  Cc: Frank Jungclaus, Marc Kleine-Budde, Vincent Mailhol, linux-can,
	socketcan, David S . Miller, Oliver Hartkopp, Simon Horman,
	Wolfgang Grandegger, netdev

On Wed, 24 Sep 2025 19:30:35 +0200 Stefan Mätje wrote:
> The driver tried to keep as much CAN frames as possible submitted to the
> USB device (ESD_USB_MAX_TX_URBS). This has led to occasional "No free
> context" error messages in high load situations like with
> "cangen -g 0 -p 10 canX".

I grepped for "No free context" :) perhaps use the old message from
before the previous patch, so that users who see those in the logs
can correlate with this patch better?

> Now call netif_stop_queue() already if the number of active jobs
> reaches ESD_USB_TX_URBS_HI_WM which is < ESD_USB_MAX_TX_URBS. The
> netif_start_queue() is called in esd_usb_tx_done_msg() only if the
> number of active jobs is <= ESD_USB_TX_URBS_LO_WM.
> 
> This change eliminates the occasional error messages and significantly
> reduces the number of calls to netif_start_queue() and
> netif_stop_queue().
> 
> The watermark limits have been chosen with the CAN-USB/Micro in mind to
> not to compromise its TX throughput. This device is running on USB 1.1
> only with its 1ms USB polling cycle where a ESD_USB_TX_URBS_LO_WM
> value below 9 decreases the TX throughput.

> -	netif_wake_queue(netdev);
> +	if (atomic_read(&priv->active_tx_jobs) <= ESD_USB_TX_URBS_LO_WM)
> +		netif_wake_queue(netdev);
>  }

> -	/* Slow down tx path */
> -	if (atomic_read(&priv->active_tx_jobs) >= ESD_USB_MAX_TX_URBS)
> +	/* Slow down TX path */
> +	if (atomic_read(&priv->active_tx_jobs) >= ESD_USB_TX_URBS_HI_WM)
>  		netif_stop_queue(netdev);
>  
>  	err = usb_submit_urb(urb, GFP_ATOMIC);

I don't know much about USB. Is there some locking that makes this not
racy? I recommend using the macros from net/netdev_queues.h like
netif_txq_maybe_stop() the re-checking on one side is key.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 3/3] can: esd_usb: Add watermark handling for TX jobs
  2025-09-25  0:27   ` Jakub Kicinski
@ 2025-09-26  1:35     ` Stefan Mätje
  2025-09-26 19:50       ` Jakub Kicinski
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Mätje @ 2025-09-26  1:35 UTC (permalink / raw)
  To: kuba@kernel.org
  Cc: mailhol@kernel.org, socketcan@hartkopp.net, davem@davemloft.net,
	wg@grandegger.com, linux-can@vger.kernel.org, socketcan,
	Frank Jungclaus, mkl@pengutronix.de, horms@kernel.org,
	netdev@vger.kernel.org

Am Mittwoch, dem 24.09.2025 um 17:27 -0700 schrieb Jakub Kicinski:
> On Wed, 24 Sep 2025 19:30:35 +0200 Stefan Mätje wrote:
> > The driver tried to keep as much CAN frames as possible submitted to the
> > USB device (ESD_USB_MAX_TX_URBS). This has led to occasional "No free
> > context" error messages in high load situations like with
> > "cangen -g 0 -p 10 canX".
> 
> I grepped for "No free context" :) perhaps use the old message from
> before the previous patch, so that users who see those in the logs
> can correlate with this patch better?

I could change the message back to contain "couldn't find free context"
again.

> > Now call netif_stop_queue() already if the number of active jobs
> > reaches ESD_USB_TX_URBS_HI_WM which is < ESD_USB_MAX_TX_URBS. The
> > netif_start_queue() is called in esd_usb_tx_done_msg() only if the
> > number of active jobs is <= ESD_USB_TX_URBS_LO_WM.
> > 
> > This change eliminates the occasional error messages and significantly
> > reduces the number of calls to netif_start_queue() and
> > netif_stop_queue().
> > 
> > The watermark limits have been chosen with the CAN-USB/Micro in mind to
> > not to compromise its TX throughput. This device is running on USB 1.1
> > only with its 1ms USB polling cycle where a ESD_USB_TX_URBS_LO_WM
> > value below 9 decreases the TX throughput.
> 
> > -       netif_wake_queue(netdev);
> > +       if (atomic_read(&priv->active_tx_jobs) <= ESD_USB_TX_URBS_LO_WM)
> > +               netif_wake_queue(netdev);
> >  }
> 
> > -       /* Slow down tx path */
> > -       if (atomic_read(&priv->active_tx_jobs) >= ESD_USB_MAX_TX_URBS)
> > +       /* Slow down TX path */
> > +       if (atomic_read(&priv->active_tx_jobs) >= ESD_USB_TX_URBS_HI_WM)
> >                 netif_stop_queue(netdev);
> >  
> >         err = usb_submit_urb(urb, GFP_ATOMIC);
> 
> I don't know much about USB. Is there some locking that makes this not
> racy? I recommend using the macros from net/netdev_queues.h like
> netif_txq_maybe_stop() the re-checking on one side is key.

When I was changing the code I was under the impression that the atomic_xxx()
functions would do some memory barrier stuff to make other CPUs see the
results. 

This was because some other USB CAN drivers use very similar code snippets
to count their active TX jobs in flight (the grandparent seems to be ems_usb.c,
others: gs_usb.c, usb_8dev.c), but they always seem to call netif_wake_queue()
unconditionally.

The original code called netif_stop_queue() under the condition 
"if (atomic_read(&priv->active_tx_jobs) >= ESD_USB_MAX_TX_URBS)". But this
could lead to the "couldn't find free context". I assume this happened if
the callback handler called netif_wake_queue() via esd_usb_tx_done_msg()
after the condition check with network layer still in the start_xmit()
routine. So the network layer would try to xmit the next skb.

I changed therefore the stop condition to >= ESD_USB_TX_URBS_HI_WM which is
smaller than ESD_USB_MAX_TX_URBS so there is still at least one TX context
free after stopping that will handle an "overshoot" transmit.

On the wake side in esd_usb_tx_done_msg() if we would not see the increment
and are still under ESD_USB_TX_URBS_LO_WM only excess calls to
netif_wake_queue() would be generated. And each call of esd_usb_tx_done_msg()
would decrement the tx_active_jobs bringing this down.

At the moment I can't see clearly the impact of a race condition. What I can
imagine is that the "other side" doesn't see the changes done with atomic_inc()
and atomic_dec() making a reaction on changes of the "other side" too late.

Avoid this I could change the code to something like:

static void esd_usb_tx_done_msg(struct esd_usb_net_priv *priv,
				union esd_usb_msg *msg)
{
:
:
	/* Release context */
	context->echo_index = ESD_USB_MAX_TX_URBS;
	atomic_dec(&priv->active_tx_jobs);
	smp_mb__after_atomic();

	if (!netif_device_present(netdev))
		return;

	if (atomic_read(&priv->active_tx_jobs) <= ESD_USB_TX_URBS_LO_WM)
		netif_wake_queue(netdev);
}


and in the esd_usb_start_xmit() function:

:
:
	atomic_inc(&priv->active_tx_jobs);
	smp_mb__after_atomic();

	/* Slow down TX path */
	if (atomic_read(&priv->active_tx_jobs) >= ESD_USB_TX_URBS_HI_WM)
		netif_stop_queue(netdev);

	err = usb_submit_urb(urb, GFP_ATOMIC);
:
:

Not finding any issues with the published patch may be a side effect of 
testing mainly on x86 where the smp_mb__after_atomic() function is empty.

Using the netif_txq_maybe_stop() / netif_txq_completed_wake() pair seem
not the right thing for me because I don't understand the code completely
and the netif_txq_completed_wake() should only called from NAPI poll
context where I would need to call it from the USB callback handler
possibly on IRQ level.

How to proceed now?

Best regards,
    Stefan Mätje


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3 3/3] can: esd_usb: Add watermark handling for TX jobs
  2025-09-26  1:35     ` Stefan Mätje
@ 2025-09-26 19:50       ` Jakub Kicinski
  0 siblings, 0 replies; 7+ messages in thread
From: Jakub Kicinski @ 2025-09-26 19:50 UTC (permalink / raw)
  To: Stefan Mätje
  Cc: mailhol@kernel.org, socketcan@hartkopp.net, davem@davemloft.net,
	wg@grandegger.com, linux-can@vger.kernel.org, socketcan,
	Frank Jungclaus, mkl@pengutronix.de, horms@kernel.org,
	netdev@vger.kernel.org

On Fri, 26 Sep 2025 01:35:07 +0000 Stefan Mätje wrote:
> Using the netif_txq_maybe_stop() / netif_txq_completed_wake() pair seem
> not the right thing for me because I don't understand the code completely
> and the netif_txq_completed_wake() should only called from NAPI poll
> context where I would need to call it from the USB callback handler
> possibly on IRQ level.

Ah, true, not NAPI, but you don't use BQL so we'd need a flavor of 
the macro that doesn't picky back on the mb() in BQL.

The main point of the macro is that on one side you need to re-check
after a barrier. No matter how many barriers you put in place you
can still have:


 CPU 0                                 CPU 1
	used++
	if (used > threshold)
		/* long IRQ comes */
                                         used--
					 if (used <= threshold)
						wake();
		/* long IRQ ends */
		stop();

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-09-26 19:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-24 17:30 [PATCH v3 0/3] can: esd_usb: Fixes Stefan Mätje
2025-09-24 17:30 ` [PATCH v3 1/3] can: esd_usb: Fix not detecting version reply in probe routine Stefan Mätje
2025-09-24 17:30 ` [PATCH v3 2/3] can: esd_usb: Fix handling of TX context objects Stefan Mätje
2025-09-24 17:30 ` [PATCH v3 3/3] can: esd_usb: Add watermark handling for TX jobs Stefan Mätje
2025-09-25  0:27   ` Jakub Kicinski
2025-09-26  1:35     ` Stefan Mätje
2025-09-26 19:50       ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).