Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH iwl-net v2 1/6] ixgbe: fix SWFW semaphore timeout for X550 family
From: Simon Horman @ 2026-04-13 10:52 UTC (permalink / raw)
  To: Aleksandr Loktionov; +Cc: intel-wired-lan, anthony.l.nguyen, netdev
In-Reply-To: <20260408131154.2661818-2-aleksandr.loktionov@intel.com>

On Wed, Apr 08, 2026 at 03:11:49PM +0200, Aleksandr Loktionov wrote:
> According to FW documentation, the most time-consuming FW operation is
> Shadow RAM (SR) dump which takes up to 3.2 seconds.  For X550 family
> devices the module-update FW command can take over 4.5 s.  The default
> semaphore loop runs 200 iterations with a 5 ms sleep each, giving a
> maximum wait of 1 s -- not "200 ms" as previously stated in error.
> This is insufficient for X550 family FW update operations and causes
> spurious EBUSY failures.
> 
> Extend the SW/FW semaphore timeout from 1 s to 5 s (1000 iterations x
> 5 ms) for all three X550 variants: ixgbe_mac_X550, ixgbe_mac_X550EM_x,
> and ixgbe_mac_x550em_a.  All three share the same FW and exhibit the
> same worst-case latency.  Use three explicit mac.type comparisons rather
> than a range check so future MAC additions are not inadvertently
> captured.
> 
> The timeout variable is set immediately before the loop so the intent
> is clear, with an inline comment stating the resulting maximum delay.
> 
> Suggested-by: Soumen Karmakar <soumen.karmakar@intel.com>
> Cc: stable@vger.kernel.org
> Suggested-by: Marta Plantykow <marta.a.plantykow@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v1 -> v2:
>  - Squash with 0015 (X550EM extension); fix commit message ("200ms" was
>    wrong, actual default is 1 s); replace >= / <= range check with three
>    explicit mac.type == comparisons per Tony Nguyen.

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH iwl-net v2 3/6] ixgbe: call ixgbe_setup_fc() before fc_enable() after NVM update
From: Simon Horman @ 2026-04-13 10:51 UTC (permalink / raw)
  To: aleksandr.loktionov
  Cc: 'Simon Horman', intel-wired-lan, anthony.l.nguyen, netdev
In-Reply-To: <20260408131154.2661818-4-aleksandr.loktionov@intel.com>

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ixgbe: call ixgbe_setup_fc() before fc_enable() after NVM update

This commit adds a call to hw->mac.ops.setup_fc() before fc_enable() in the
watchdog task when the link comes up, aiming to prevent MDD events after
an NVM update by reprogramming flow-control autonegotiation.

> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -8031,7 +8031,14 @@ static void ixgbe_watchdog_update_link(struct ixgbe_adapter *adapter)
>  		pfc_en |= !!(adapter->ixgbe_ieee_pfc->pfc_en);
>  
>  	if (link_up && !((adapter->flags & IXGBE_FLAG_DCB_ENABLED) && pfc_en)) {
> -		hw->mac.ops.fc_enable(hw);
> +		/* Re-program flow-control autoneg before applying the result.
> +		 * If setup_fc() fails its output is invalid; skip fc_enable()
> +		 * to avoid committing stale capability bits that trigger MDD.
> +		 */
> +		if (hw->mac.ops.setup_fc && hw->mac.ops.setup_fc(hw))

Will calling setup_fc() unconditionally from the link-up watchdog cause an
infinite link flapping loop on 82599 backplane interfaces?

For backplane interfaces like KX/KX4/KR, setup_fc() resolves to
ixgbe_setup_fc_generic(), which then calls hw->mac.ops.prot_autoc_write().

On 82599 hardware, this calls prot_autoc_write_82599(), which unconditionally
executes ixgbe_reset_pipeline_82599().

Since ixgbe_reset_pipeline_82599() toggles the IXGBE_AUTOC_AN_RESTART bit,
would this drop the link immediately to restart Auto-Negotiation?

If the link drops, it would trigger a link-down LSC interrupt. Once
Auto-Negotiation completes, the link would come back up, triggering another
link-up LSC interrupt. This would cause ixgbe_watchdog_update_link() to run
again, repeating the cycle.

> +			e_warn(drv, "setup_fc failed, skipping fc_enable\n");
> +		else
> +			hw->mac.ops.fc_enable(hw);
>  		ixgbe_set_rx_drop_en(adapter);
>  	}

^ permalink raw reply

* Re: [PATCH v11 net-next 3/5] psp: add a new netdev event for dev unregister
From: Paolo Abeni @ 2026-04-13 10:47 UTC (permalink / raw)
  To: Wei Wang, netdev, Jakub Kicinski, Daniel Zahka, Willem de Bruijn,
	David Wei, Andrew Lunn, David S . Miller, Eric Dumazet,
	Simon Horman
  Cc: Wei Wang
In-Reply-To: <20260408231415.522691-4-weibunny.kernel@gmail.com>

On 4/9/26 1:14 AM, Wei Wang wrote:
> +static int psp_netdev_event(struct notifier_block *nb, unsigned long event,
> +			    void *ptr)
> +{
> +	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> +	struct psp_dev *psd;
> +
> +	if (event != NETDEV_UNREGISTER)
> +		return NOTIFY_DONE;
> +
> +	rcu_read_lock();
> +	psd = rcu_dereference(dev->psp_dev);
> +	if (psd && psp_dev_tryget(psd)) {
> +		rcu_read_unlock();
> +		mutex_lock(&psd->lock);
> +		psp_dev_disassoc_one(psd, dev);
> +		mutex_unlock(&psd->lock);
> +		psp_dev_put(psd);

Sashiko notes that the above is racy:

---
Can this code race with psp_nl_dev_assoc_doit() and permanently leak a
net_device reference?
If CPU1 is executing psp_nl_dev_assoc_doit() and CPU2 is unregistering the
device, the following interleaving could happen:
CPU1 (psp_nl_dev_assoc_doit)
    assoc_dev = dev_get_by_index(...) // acquires a reference
CPU2 (unregister_netdevice)
    psp_netdev_event()
        psd = rcu_dereference(dev->psp_dev); // sees NULL, returns
NOTIFY_DONE
CPU1 (psp_nl_dev_assoc_doit)
    cmpxchg(&assoc_dev->psp_dev, NULL, psd); // succeeds!
    list_add(...) // adds to psd->assoc_dev_list
If this occurs, the notifier misses the unregistration event since it runs
before the device is fully associated. The unregistering thread will then
enter netdev_wait_allrefs() and wait indefinitely because the reference
held in assoc_dev_list is never released.
---

/P


^ permalink raw reply

* Re: [PATCH net] net: usb: cdc_ncm: reject negative chained NDP offsets
From: Greg Kroah-Hartman @ 2026-04-13 10:43 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: linux-usb, netdev, linux-kernel, Oliver Neukum, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	stable
In-Reply-To: <2a6963c8-4a87-4fed-b875-d46f3ce53e42@suse.com>

On Mon, Apr 13, 2026 at 10:36:19AM +0200, Oliver Neukum wrote:
> 
> 
> On 11.04.26 12:53, Greg Kroah-Hartman wrote:
> > cdc_ncm_rx_fixup() reads dwNextNdpIndex from each NDP32 to chain to the
> > next one.  The 32-bit value from the device is stored into the signed
> > int ndpoffset so that means values with the high bit set become
> 
> Well, then isn't the problem rather that you should not store an
> unsigned value in a signed variable?

No.  well, yes.  but no.

cdc_ncm_rx_verify_nth16() returns an int, and is negative if something
went wrong, so we need it that way, and then we need to check it, like
we properly do at the top of the loop, it's just that at the bottom of
the loop we also need to do the same exact thing.

So I really think this patch is the correct thing to do unless you want
to add another temp variable here just for the sign -> unsigned
transition and but that might be even messier.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v11 net-next 2/5] psp: add new netlink cmd for dev-assoc and dev-disassoc
From: Paolo Abeni @ 2026-04-13 10:36 UTC (permalink / raw)
  To: Wei Wang, netdev, Jakub Kicinski, Daniel Zahka, Willem de Bruijn,
	David Wei, Andrew Lunn, David S . Miller, Eric Dumazet,
	Simon Horman
  Cc: Wei Wang
In-Reply-To: <20260408231415.522691-3-weibunny.kernel@gmail.com>



On 4/9/26 1:14 AM, Wei Wang wrote:
> From: Wei Wang <weibunny@fb.com>
> 
> The main purpose of this cmd is to be able to associate a
> non-psp-capable device (e.g. veth or netkit) with a psp device.
> One use case is if we create a pair of veth/netkit, and assign 1 end
> inside a netns, while leaving the other end within the default netns,
> with a real PSP device, e.g. netdevsim or a physical PSP-capable NIC.
> With this command, we could associate the veth/netkit inside the netns
> with PSP device, so the virtual device could act as PSP-capable device
> to initiate PSP connections, and performs PSP encryption/decryption on
> the real PSP device.
> 
> Signed-off-by: Wei Wang <weibunny@fb.com>
> Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com>
> ---
>  Documentation/netlink/specs/psp.yaml |  67 +++++-
>  include/net/psp/types.h              |  15 ++
>  include/uapi/linux/psp.h             |  13 ++
>  net/psp/psp-nl-gen.c                 |  32 +++
>  net/psp/psp-nl-gen.h                 |   2 +
>  net/psp/psp_main.c                   |  20 ++
>  net/psp/psp_nl.c                     | 325 ++++++++++++++++++++++++++-
>  7 files changed, 462 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/netlink/specs/psp.yaml b/Documentation/netlink/specs/psp.yaml
> index c54e1202cbe0..3d1b7223e084 100644
> --- a/Documentation/netlink/specs/psp.yaml
> +++ b/Documentation/netlink/specs/psp.yaml
> @@ -13,6 +13,17 @@ definitions:
>                hdr0-aes-gmac-128, hdr0-aes-gmac-256]
>  
>  attribute-sets:
> +  -
> +    name: assoc-dev-info
> +    attributes:
> +      -
> +        name: ifindex
> +        doc: ifindex of an associated network device.
> +        type: u32
> +      -
> +        name: nsid
> +        doc: Network namespace ID of the associated device.
> +        type: s32
>    -
>      name: dev
>      attributes:
> @@ -24,7 +35,9 @@ attribute-sets:
>            min: 1
>        -
>          name: ifindex
> -        doc: ifindex of the main netdevice linked to the PSP device.
> +        doc: |
> +          ifindex of the main netdevice linked to the PSP device,
> +          or the ifindex to associate with the PSP device.
>          type: u32
>        -
>          name: psp-versions-cap
> @@ -38,6 +51,28 @@ attribute-sets:
>          type: u32
>          enum: version
>          enum-as-flags: true
> +      -
> +        name: assoc-list
> +        doc: List of associated virtual devices.
> +        type: nest
> +        nested-attributes: assoc-dev-info
> +        multi-attr: true
> +      -
> +        name: nsid
> +        doc: |
> +          Network namespace ID for the device to associate/disassociate.
> +          Optional for dev-assoc and dev-disassoc; if not present, the
> +          device is looked up in the caller's network namespace.
> +        type: s32
> +      -
> +        name: by-association
> +        doc: |
> +          Flag indicating the PSP device is an associated device from a
> +          different network namespace.
> +          Present when in associated namespace, absent when in primary/host
> +          namespace.
> +        type: flag
> +
>    -
>      name: assoc
>      attributes:
> @@ -170,6 +205,8 @@ operations:
>              - ifindex
>              - psp-versions-cap
>              - psp-versions-ena
> +            - assoc-list
> +            - by-association
>          pre: psp-device-get-locked
>          post: psp-device-unlock
>        dump:
> @@ -279,6 +316,34 @@ operations:
>          post: psp-device-unlock
>        dump:
>          reply: *stats-all
> +    -
> +      name: dev-assoc
> +      doc: Associate a network device with a PSP device.
> +      attribute-set: dev
> +      do:
> +        request:
> +          attributes:
> +            - id
> +            - ifindex
> +            - nsid
> +        reply:
> +          attributes: []
> +        pre: psp-device-get-locked
> +        post: psp-device-unlock
> +    -
> +      name: dev-disassoc
> +      doc: Disassociate a network device from a PSP device.
> +      attribute-set: dev
> +      do:
> +        request:
> +          attributes:
> +            - id
> +            - ifindex
> +            - nsid
> +        reply:
> +          attributes: []
> +        pre: psp-device-get-locked
> +        post: psp-device-unlock
>  
>  mcast-groups:
>    list:
> diff --git a/include/net/psp/types.h b/include/net/psp/types.h
> index 25a9096d4e7d..4bd432ed107a 100644
> --- a/include/net/psp/types.h
> +++ b/include/net/psp/types.h
> @@ -5,6 +5,7 @@
>  
>  #include <linux/mutex.h>
>  #include <linux/refcount.h>
> +#include <net/net_trackers.h>
>  
>  struct netlink_ext_ack;
>  
> @@ -43,9 +44,22 @@ struct psp_dev_config {
>  	u32 versions;
>  };
>  
> +/**
> + * struct psp_assoc_dev - wrapper for associated net_device
> + * @dev_list: list node for psp_dev::assoc_dev_list
> + * @assoc_dev: the associated net_device
> + * @dev_tracker: tracker for the net_device reference
> + */
> +struct psp_assoc_dev {
> +	struct list_head dev_list;
> +	struct net_device *assoc_dev;
> +	netdevice_tracker dev_tracker;
> +};
> +
>  /**
>   * struct psp_dev - PSP device struct
>   * @main_netdev: original netdevice of this PSP device
> + * @assoc_dev_list: list of psp_assoc_dev entries associated with this PSP device
>   * @ops:	driver callbacks
>   * @caps:	device capabilities
>   * @drv_priv:	driver priv pointer
> @@ -67,6 +81,7 @@ struct psp_dev_config {
>   */
>  struct psp_dev {
>  	struct net_device *main_netdev;
> +	struct list_head assoc_dev_list;
>  
>  	struct psp_dev_ops *ops;
>  	struct psp_dev_caps *caps;
> diff --git a/include/uapi/linux/psp.h b/include/uapi/linux/psp.h
> index a3a336488dc3..1c8899cd4da5 100644
> --- a/include/uapi/linux/psp.h
> +++ b/include/uapi/linux/psp.h
> @@ -17,11 +17,22 @@ enum psp_version {
>  	PSP_VERSION_HDR0_AES_GMAC_256,
>  };
>  
> +enum {
> +	PSP_A_ASSOC_DEV_INFO_IFINDEX = 1,
> +	PSP_A_ASSOC_DEV_INFO_NSID,
> +
> +	__PSP_A_ASSOC_DEV_INFO_MAX,
> +	PSP_A_ASSOC_DEV_INFO_MAX = (__PSP_A_ASSOC_DEV_INFO_MAX - 1)
> +};
> +
>  enum {
>  	PSP_A_DEV_ID = 1,
>  	PSP_A_DEV_IFINDEX,
>  	PSP_A_DEV_PSP_VERSIONS_CAP,
>  	PSP_A_DEV_PSP_VERSIONS_ENA,
> +	PSP_A_DEV_ASSOC_LIST,
> +	PSP_A_DEV_NSID,
> +	PSP_A_DEV_BY_ASSOCIATION,
>  
>  	__PSP_A_DEV_MAX,
>  	PSP_A_DEV_MAX = (__PSP_A_DEV_MAX - 1)
> @@ -74,6 +85,8 @@ enum {
>  	PSP_CMD_RX_ASSOC,
>  	PSP_CMD_TX_ASSOC,
>  	PSP_CMD_GET_STATS,
> +	PSP_CMD_DEV_ASSOC,
> +	PSP_CMD_DEV_DISASSOC,
>  
>  	__PSP_CMD_MAX,
>  	PSP_CMD_MAX = (__PSP_CMD_MAX - 1)
> diff --git a/net/psp/psp-nl-gen.c b/net/psp/psp-nl-gen.c
> index 1f5e73e7ccc1..114299c64423 100644
> --- a/net/psp/psp-nl-gen.c
> +++ b/net/psp/psp-nl-gen.c
> @@ -53,6 +53,20 @@ static const struct nla_policy psp_get_stats_nl_policy[PSP_A_STATS_DEV_ID + 1] =
>  	[PSP_A_STATS_DEV_ID] = NLA_POLICY_MIN(NLA_U32, 1),
>  };
>  
> +/* PSP_CMD_DEV_ASSOC - do */
> +static const struct nla_policy psp_dev_assoc_nl_policy[PSP_A_DEV_NSID + 1] = {
> +	[PSP_A_DEV_ID] = NLA_POLICY_MIN(NLA_U32, 1),
> +	[PSP_A_DEV_IFINDEX] = { .type = NLA_U32, },
> +	[PSP_A_DEV_NSID] = { .type = NLA_S32, },
> +};
> +
> +/* PSP_CMD_DEV_DISASSOC - do */
> +static const struct nla_policy psp_dev_disassoc_nl_policy[PSP_A_DEV_NSID + 1] = {
> +	[PSP_A_DEV_ID] = NLA_POLICY_MIN(NLA_U32, 1),
> +	[PSP_A_DEV_IFINDEX] = { .type = NLA_U32, },
> +	[PSP_A_DEV_NSID] = { .type = NLA_S32, },
> +};
> +
>  /* Ops table for psp */
>  static const struct genl_split_ops psp_nl_ops[] = {
>  	{
> @@ -119,6 +133,24 @@ static const struct genl_split_ops psp_nl_ops[] = {
>  		.dumpit	= psp_nl_get_stats_dumpit,
>  		.flags	= GENL_CMD_CAP_DUMP,
>  	},
> +	{
> +		.cmd		= PSP_CMD_DEV_ASSOC,
> +		.pre_doit	= psp_device_get_locked,
> +		.doit		= psp_nl_dev_assoc_doit,
> +		.post_doit	= psp_device_unlock,
> +		.policy		= psp_dev_assoc_nl_policy,
> +		.maxattr	= PSP_A_DEV_NSID,
> +		.flags		= GENL_CMD_CAP_DO,
> +	},
> +	{
> +		.cmd		= PSP_CMD_DEV_DISASSOC,
> +		.pre_doit	= psp_device_get_locked,
> +		.doit		= psp_nl_dev_disassoc_doit,
> +		.post_doit	= psp_device_unlock,
> +		.policy		= psp_dev_disassoc_nl_policy,
> +		.maxattr	= PSP_A_DEV_NSID,
> +		.flags		= GENL_CMD_CAP_DO,

Sashiko notes that the above allows deleteing an associations bypassing
the netns boundaries. Do you need ADMIN_PERM flag or exlicit checks in
the doit cb?

> @@ -292,6 +455,145 @@ int psp_nl_key_rotate_doit(struct sk_buff *skb, struct genl_info *info)
>  	return err;
>  }
>  
> +int psp_nl_dev_assoc_doit(struct sk_buff *skb, struct genl_info *info)
> +{
> +	struct psp_dev *psd = info->user_ptr[0];
> +	struct psp_assoc_dev *psp_assoc_dev;
> +	struct net_device *assoc_dev;
> +	struct sk_buff *rsp;
> +	u32 assoc_ifindex;
> +	struct net *net;
> +	int nsid, err;
> +
> +	if (GENL_REQ_ATTR_CHECK(info, PSP_A_DEV_IFINDEX))
> +		return -EINVAL;
> +
> +	if (info->attrs[PSP_A_DEV_NSID]) {
> +		nsid = nla_get_s32(info->attrs[PSP_A_DEV_NSID]);
> +
> +		net = get_net_ns_by_id(genl_info_net(info), nsid);
> +		if (!net) {
> +			NL_SET_BAD_ATTR(info->extack,
> +					info->attrs[PSP_A_DEV_NSID]);
> +			return -EINVAL;
> +		}
> +	} else {
> +		net = get_net(genl_info_net(info));
> +	}

psp_nl_dev_disassoc_doit() has the same code; perhaps it would be worthy
move it in a common helper, called via pre_doit()? It should also
simplify the cleanup paths.

> +
> +	psp_assoc_dev = kzalloc(sizeof(*psp_assoc_dev), GFP_KERNEL);
> +	if (!psp_assoc_dev) {
> +		err = -ENOMEM;
> +		goto alloc_err;
> +	}
> +
> +	assoc_ifindex = nla_get_u32(info->attrs[PSP_A_DEV_IFINDEX]);
> +	assoc_dev = netdev_get_by_index(net, assoc_ifindex,
> +					&psp_assoc_dev->dev_tracker,
> +					GFP_KERNEL);
> +	if (!assoc_dev) {
> +		NL_SET_BAD_ATTR(info->extack, info->attrs[PSP_A_DEV_IFINDEX]);
> +		err = -ENODEV;
> +		goto assoc_dev_err;
> +	}
> +
> +	/* Check if device is already associated with a PSP device */
> +	if (cmpxchg(&assoc_dev->psp_dev, NULL, RCU_INITIALIZER(psd))) {
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Device already associated with a PSP device");
> +		err = -EBUSY;
> +		goto cmpxchg_err;
> +	}
> +
> +	psp_assoc_dev->assoc_dev = assoc_dev;
> +	rsp = psp_nl_reply_new(info);
> +	if (!rsp) {
> +		err = -ENOMEM;
> +		goto rsp_err;
> +	}
> +
> +	list_add_tail(&psp_assoc_dev->dev_list, &psd->assoc_dev_list);

Sashiko says:

---
list_add_tail(&psp_assoc_dev->dev_list, &psd->assoc_dev_list);
There doesn't seem to be a limit on the number of devices that can be
associated with a single PSP device.
If a user repeatedly associates devices, could the generated netlink message
in psp_nl_dev_fill() exceed the maximum allowed size (GENLMSG_DEFAULT_SIZE),
causing it to fail with -EMSGSIZE and permanently break PSP_CMD_DEV_GET
and management notifications for the device?
--

/P


^ permalink raw reply

* [PATCH net-next v6 2/2] net: hsr: reject unresolved interlink ifindex
From: luka.gejak @ 2026-04-13 10:34 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni; +Cc: netdev, fmaurer, horms, Luka Gejak
In-Reply-To: <20260413103449.169913-1-luka.gejak@linux.dev>

From: Luka Gejak <luka.gejak@linux.dev>

In hsr_newlink(), a provided but invalid IFLA_HSR_INTERLINK attribute
was silently ignored if __dev_get_by_index() returned NULL. This leads
to incorrect RedBox topology creation without notifying the user.

Fix this by returning -EINVAL and an extack message when the
interlink attribute is present but cannot be resolved.

Assisted-by: Gemini:Gemini-3.1-flash
Reviewed-by: Felix Maurer <fmaurer@redhat.com>
Signed-off-by: Luka Gejak <luka.gejak@linux.dev>
---
 net/hsr/hsr_netlink.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/hsr/hsr_netlink.c b/net/hsr/hsr_netlink.c
index db0b0af7a692..f0ca23da3ab9 100644
--- a/net/hsr/hsr_netlink.c
+++ b/net/hsr/hsr_netlink.c
@@ -76,9 +76,14 @@ static int hsr_newlink(struct net_device *dev,
 		return -EINVAL;
 	}
 
-	if (data[IFLA_HSR_INTERLINK])
+	if (data[IFLA_HSR_INTERLINK]) {
 		interlink = __dev_get_by_index(link_net,
 					       nla_get_u32(data[IFLA_HSR_INTERLINK]));
+		if (!interlink) {
+			NL_SET_ERR_MSG_MOD(extack, "Interlink does not exist");
+			return -EINVAL;
+		}
+	}
 
 	if (interlink && interlink == link[0]) {
 		NL_SET_ERR_MSG_MOD(extack, "Interlink and Slave1 are the same");
-- 
2.53.0


^ permalink raw reply related

* [PATCH net-next v6 1/2] net: hsr: require valid EOT supervision TLV
From: luka.gejak @ 2026-04-13 10:34 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni; +Cc: netdev, fmaurer, horms, Luka Gejak
In-Reply-To: <20260413103449.169913-1-luka.gejak@linux.dev>

From: Luka Gejak <luka.gejak@linux.dev>

Supervision frames are only valid if terminated with a zero-length EOT
TLV. The current check fails to reject non-EOT entries as the terminal
TLV, potentially allowing malformed supervision traffic.

Fix this by strictly requiring the terminal TLV to be HSR_TLV_EOT
with a length of zero, and properly linearizing the TLV header before
access.

Assisted-by: Gemini:Gemini-3.1-flash
Signed-off-by: Luka Gejak <luka.gejak@linux.dev>
---
 net/hsr/hsr_forward.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/hsr/hsr_forward.c b/net/hsr/hsr_forward.c
index 0aca859c88cb..0774981a65c1 100644
--- a/net/hsr/hsr_forward.c
+++ b/net/hsr/hsr_forward.c
@@ -84,7 +84,7 @@ static bool is_supervision_frame(struct hsr_priv *hsr, struct sk_buff *skb)
 
 	/* Get next tlv */
 	total_length += hsr_sup_tag->tlv.HSR_TLV_length;
-	if (!pskb_may_pull(skb, total_length))
+	if (!pskb_may_pull(skb, total_length + sizeof(struct hsr_sup_tlv)))
 		return false;
 	skb_pull(skb, total_length);
 	hsr_sup_tlv = (struct hsr_sup_tlv *)skb->data;
@@ -100,7 +100,7 @@ static bool is_supervision_frame(struct hsr_priv *hsr, struct sk_buff *skb)
 
 		/* make sure another tlv follows */
 		total_length += sizeof(struct hsr_sup_tlv) + hsr_sup_tlv->HSR_TLV_length;
-		if (!pskb_may_pull(skb, total_length))
+		if (!pskb_may_pull(skb, total_length + sizeof(struct hsr_sup_tlv)))
 			return false;
 
 		/* get next tlv */
@@ -110,7 +110,7 @@ static bool is_supervision_frame(struct hsr_priv *hsr, struct sk_buff *skb)
 	}
 
 	/* end of tlvs must follow at the end */
-	if (hsr_sup_tlv->HSR_TLV_type == HSR_TLV_EOT &&
+	if (hsr_sup_tlv->HSR_TLV_type != HSR_TLV_EOT ||
 	    hsr_sup_tlv->HSR_TLV_length != 0)
 		return false;
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH net-next v6 0/2] net: hsr: strict supervision TLV validation
From: luka.gejak @ 2026-04-13 10:34 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni; +Cc: netdev, fmaurer, horms, Luka Gejak

From: Luka Gejak <luka.gejak@linux.dev>

Changes in v6:
 - Dropped capitalization comment changes per request of Jakub Kicinski

Changes in v5:
 - Reverted TLV loop in Patch 1 to strict sequential parsing per IEC
   62439-3.
 - Retained pskb_may_pull() logic to ensure memory safety for TLV
   headers.
 - Dropped Reviewed-by from Patch 1 due to the logic evolving since
   original review.
 - Added Assisted-by tag for AI-aided translation and formatting to
   both patches.

Changes in v4:
 - Split from a 4-patch series into 'net' and 'net-next' as requested.
 - Implemented a TLV walker in Patch 1 to correctly handle extension
   TLVs and avoid regressions on paged frames/non-linearized skbs.
 - Corrected pskb_may_pull() logic to include the TLV header size.

History of pre-separation series (v1-v3):
Changes in v3:
 - addressed Felix review feedback in the VLAN add unwind fix
 - removed the superfluous empty line

Changes in v2:
 - picked up Reviewed-by tags on patches 1, 3 and 4
 - changes in patch 2 per advice of Felix Maurer

Luka Gejak (2):
  net: hsr: require valid EOT supervision TLV
  net: hsr: reject unresolved interlink ifindex

 net/hsr/hsr_forward.c | 6 +++---
 net/hsr/hsr_netlink.c | 7 ++++++-
 2 files changed, 9 insertions(+), 4 deletions(-)

-- 
2.53.0


^ permalink raw reply

* Re: [PATCH] vsock/virtio: fix accept queue count leak on transport mismatch in recv_listen
From: Stefano Garzarella @ 2026-04-13 10:30 UTC (permalink / raw)
  To: Dudu Lu; +Cc: netdev, stefanha, mst, jasowang
In-Reply-To: <20260413085243.73200-1-phx0fer@gmail.com>

On Mon, Apr 13, 2026 at 04:52:43PM +0800, Dudu Lu wrote:
>virtio_transport_recv_listen() calls sk_acceptq_added(sk) to increment
>the listener's accept queue counter before calling
>vsock_assign_transport(). When vsock_assign_transport() fails or selects
>a different transport than the one that received the packet, the error
>path returns without calling sk_acceptq_removed(sk), permanently
>incrementing sk_ack_backlog.
>
>A malicious VM peer can exploit this by sending repeated CONNECT
>requests that trigger the transport mismatch condition. Each such
>request permanently increments sk_ack_backlog. After approximately
>backlog+1 such requests (default backlog ~128), sk_acceptq_is_full()
>returns true, causing the listener to reject ALL new connections with
>-ENOMEM. The only recovery is closing and re-creating the listener
>socket.
>
>Compare with vmci_transport.c and hyperv_transport.c which correctly
>place sk_acceptq_added() AFTER the transport check, avoiding this
>issue entirely.
>
>Fix by moving sk_acceptq_added(sk) to after the transport validation
>check, matching the pattern used by the other transports.

The issue seems legitimate, but this patch doesn't do what you're 
describing here.

Out of curiosity, how did you generate it?

Stefano


>
>Fixes: c0cfa2d8a788 ("vsock: add multi-transports support")
>Signed-off-by: Dudu Lu <phx0fer@gmail.com>
>---
> net/vmw_vsock/virtio_transport_common.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index 8a9fb23c6e85..29e1d9833be4 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -1,3 +1,4 @@
>+	sk_acceptq_added(sk);
> // SPDX-License-Identifier: GPL-2.0-only
> /*
>  * common code for virtio vsock
>@@ -1560,8 +1561,9 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
> 		return -ENOMEM;
> 	}
>
>-	sk_acceptq_added(sk);
>
>+
>+	sk_acceptq_added(sk);
> 	lock_sock_nested(child, SINGLE_DEPTH_NESTING);
>
> 	child->sk_state = TCP_ESTABLISHED;
>-- 
>2.39.3 (Apple Git-145)
>


^ permalink raw reply

* Re: [PATCH iwl-net v2 2/6] ixgbe: add bounds check for debugfs register access
From: Simon Horman @ 2026-04-13 10:30 UTC (permalink / raw)
  To: Aleksandr Loktionov
  Cc: intel-wired-lan, anthony.l.nguyen, netdev, Paul Greenwalt
In-Reply-To: <20260408131154.2661818-3-aleksandr.loktionov@intel.com>

On Wed, Apr 08, 2026 at 03:11:50PM +0200, Aleksandr Loktionov wrote:
> From: Paul Greenwalt <paul.greenwalt@intel.com>
> 
> Prevent out-of-bounds MMIO accesses triggered through user-controlled
> register offsets.  IXGBE_HFDR (0x15FE8) is the highest valid MMIO
> register in the ixgbe register map; any offset beyond it would address
> unmapped memory.
> 
> Add a defense-in-depth check at two levels:
> 
> 1. ixgbe_read_reg() -- the noinline register read accessor.  A
>    WARN_ON_ONCE() guard here catches any future code path (including
>    ioctl extensions) that might inadvertently pass an out-of-range
>    offset without relying on higher layers to catch it first.
>    ixgbe_write_reg() is a static inline called from the TX/RX hot path;
>    adding WARN_ON_ONCE there would inline the check at every call site,
>    so only the read path gets this guard.
> 
> 2. ixgbe_dbg_reg_ops_write() -- the debugfs 'reg_ops' interface is the
>    only current path where a raw, user-supplied offset enters the driver.
>    Gating it before invoking the register accessors provides a clean,
>    user-visible failure (silent ignore with no kernel splat) for
>    deliberately malformed debugfs writes.
> 
> Add a reg <= IXGBE_HFDR guard to both the read and write paths in
> ixgbe_dbg_reg_ops_write(), and a WARN_ON_ONCE + early-return guard to
> ixgbe_read_reg().
> 
> Fixes: 91fbd8f081e2 ("ixgbe: added reg_ops file to debugfs")
> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v1 -> v2:
>  - Add Fixes: tag; reroute from iwl-next to iwl-net (security-relevant
>    hardening for user-controllable out-of-bounds MMIO).

Thanks for the update.

And sorry for not thinking to ask this earlier: this patch
addresses possible overruns of the mapped address space if the
supplied value for reg is too large. But do we also need a
guard against underrun if the value for reg is too small?

...

^ permalink raw reply

* Re: [PATCH] net/sched: sch_cake: fix NAT destination port not being updated in cake_update_flowkeys
From: Toke Høiland-Jørgensen @ 2026-04-13 10:07 UTC (permalink / raw)
  To: phx; +Cc: netdev
In-Reply-To: <CAKvCo-yFnu3RBbiGkaVi-X5qX_hN1a-FYrBZfzB9UKz8k-PZtQ@mail.gmail.com>

phx <phx0fer@gmail.com> writes:

> You're right, "vulnerability" is too strong - it's a correctness
> bug, not a security issue. Thanks for picking it up.

Cool. Could you please re-send with an updated commit message? Thanks!

-Toke

pw-bot: cr

^ permalink raw reply

* Re: [RFC] Proposal: Add sysfs interface for PCIe TPH Steering Tag retrieval and configuration
From: Leon Romanovsky @ 2026-04-13 10:01 UTC (permalink / raw)
  To: fengchengwen
  Cc: Jason Gunthorpe, Bjorn Helgaas, linux-rdma, linux-pci, netdev,
	dri-devel, Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang
In-Reply-To: <6ea4c4c2-774e-aa76-3665-918e2a24cc84@huawei.com>

On Fri, Apr 10, 2026 at 10:30:52PM +0800, fengchengwen wrote:
> Hi all,
> 
> I'm writing to propose adding a sysfs interface to expose and configure the
> PCIe TPH
> Steering Tag for PCIe devices, which is retrieved inside the kernel.
> 
> 
> Background: The TPH Steering Tag is tightly coupled with both a PCIe device
> (identified
> by its BDF) and a CPU core. It can only be obtained in kernel mode. To allow
> user-space
> applications to fetch and set this value securely and conveniently, we need
> a standard
> kernel-to-user interface.
> 
> 
> Proposed Solution: Add several sysfs attributes under each PCIe device's
> sysfs directory:
> 1. /sys/bus/pci/devices/<BDF>/tph_mode to query the TPH mode (interrupt or
> device specific)
> 2. /sys/bus/pci/devices/<BDF>/tph_enable to control the TPH feature
> 3. /sys/bus/pci/devices/<BDF>/tph_st to support both read and write
> operations, e.g.:
>    Read operation:
>      echo "cpu=3" > /sys/bus/pci/devices/0000:01:00.0/tph_st
>      cat /sys/bus/pci/devices/0000:01:00.0/tph_st
>    Write operation:
>      echo "index=10 st=123" > /sys/bus/pci/devices/0000:01:00.0/tph_st
> 
> 
> The design strictly follows PCI subsystem sysfs standards and has the
> following key properties:
> 
> 1. Dynamic Visibility: The sysfs attributes will only be present for PCIe
> devices that
>    support TPH Steering Tag. Devices without TPH capability will not show
> these nodes,
>    avoiding unnecessary user confusion.
> 
> 2. Permission Control: The attributes will use 0600 file permissions,
> ensuring only
>    privileged root users can read or write them, which satisfies security
> requirements
>    for hardware configuration interfaces.
> 
> 3. Standard Implementation Location: The interface will be implemented in
>    drivers/pci/pci-sysfs.c, the canonical location for all PCI device sysfs
> attributes,
>    ensuring consistency and maintainability within the PCI subsystem.
> 
> 
> Why sysfs instead of alternatives like VFIO-PCI ioctl:
> 
> - Universality: sysfs does not require binding the device to a special
> driver such as
>   vfio-pci. It is available to any privileged user-space component,
> including system
>   utilities, daemons, and monitoring tools.
> 
> - Simplicity: Both user-space usage (cat/echo) and kernel implementation are
>   straightforward, reducing code complexity and long-term maintenance cost.
> 
> - Design Alignment: TPH Steering Tag is a generic PCIe device feature, not
> specific to
>   user-space drivers like DPDK or VFIO. Exposing it via sysfs matches the
> kernel's
>   standard pattern for hardware capabilities.
> 
> 
> I look forward to your comments about this design before submitting the
> final patch.

You need to explain more clearly why this write functionality is useful
and necessary outside the VFIO/RDMA context:
https://lore.kernel.org/all/20260324234615.3731237-1-zhipingz@meta.com/

AFAIK, for non-VFIO TPH callers, kernel has enough knowledge to set
right ST values.

There are several comments regarding the implementation, but those can wait
until the rationale behind the proposal is fully clarified.

Thanks

> 
> Best regards,
> Chengwen Feng
> 

^ permalink raw reply

* RE: [PATCH net 1/1] tipc: validate Gap ACK blocks in STATE message
From: Tung Quang Nguyen @ 2026-04-13 10:01 UTC (permalink / raw)
  To: Ruide Cao, Ren Wei
  Cc: jmaloy@redhat.com, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, horms@kernel.org,
	yifanwucs@gmail.com, tomapufckgml@gmail.com, yuantan098@gmail.com,
	bird@lzu.edu.cn, enjou1224z@gmail.com, netdev@vger.kernel.org
In-Reply-To: <7369ab71-e3bc-48ac-8165-439ad8595fc0@gmail.com>

>Subject: Re: [PATCH net 1/1] tipc: validate Gap ACK blocks in STATE message
>
>
>On 4/12/2026 8:06 PM, Tung Quang Nguyen wrote:
>>> Subject: [PATCH net 1/1] tipc: validate Gap ACK blocks in STATE
>>> message
>>>
>>> From: Ruide Cao <caoruide123@gmail.com>
>>>
>>> tipc_get_gap_ack_blks() reads len, ugack_cnt and bgack_cnt directly
>>> from
>>> msg_data(hdr) before verifying that a STATE message actually contains
>>> the fixed Gap ACK block header in its logical data area.
>>>
>>> A peer that negotiates TIPC_GAP_ACK_BLOCK can send a short STATE
>>> message with a declared TIPC payload shorter than struct
>>> tipc_gap_ack_blks and still append a few physical bytes after the
>>> header. The helper then trusts those bytes as Gap ACK metadata, and
>>> the forged bgack_cnt/len values can drive the broadcast receive path into
>kmemdup() beyond the skb boundary.
>> Can you explain how that peer can alter the STATE message ? If it can, what
>concrete values are used  and on what fields of the STATE messages ?
>
>Thanks for the review.
>
>To clarify, the peer is not "altering" an already received STATE message; it is
>actively sending a malformed LINK_PROTOCOL/STATE_MSG after the link has
>already negotiated the TIPC_GAP_ACK_BLOCK capability.
>
>Concretely, the crafted STATE message is sent with a modified msg_size so that
>msg_data_sz(hdr) is 0, but the actual UDP payload still carries extra physical
>bytes after the 40-byte TIPC header. Those bytes are then interpreted as the
>fixed Gap ACK header. For example:
>  len       = 0x07fc
>  ugack_cnt = 0xff
>  bgack_cnt = 0xff
>
It is surprising that you can modify any field you want in the TIPC message. I do not think that current TIPC code can handle this corrupt message .
Can you send me the stack trace at receiving peer when real crash happens after you send the "crafted" state message ?
>These values are specifically chosen so that the existing sanity check remains
>internally consistent:
>  struct_size(p, gacks, 0xff + 0xff) == 0x07fc
>
>Therefore, the existing sanity check does not reject this case. It only checks the
>self-consistency of the attacker-controlled Gap ACK fields; it completely fails to
>check if the declared Gap ACK record actually fits inside the enclosing STATE
>message's logical payload length.
>
>>> Fix this by rejecting Gap ACK parsing unless the logical STATE
>>> payload is large enough to cover the fixed header, and by rejecting
>>> declared Gap ACK lengths that are smaller than the fixed header or larger
>than the logical payload.
>>> Return 0 for invalid lengths so malformed Gap ACK data is not treated
>>> as a valid payload offset, and drop unicast STATE messages that
>>> advertise Gap ACK support but still yield an invalid Gap ACK length.
>>> This keeps malformed Gap ACK data ignored without misaligning monitor
>payload parsing.
>>>
>>> Fixes: d7626b5acff9 ("tipc: introduce Gap ACK blocks for broadcast
>>> link")
>>> Cc: stable@kernel.org
>>> Reported-by: Yifan Wu <yifanwucs@gmail.com>
>>> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
>>> Co-developed-by: Yuan Tan <yuantan098@gmail.com>
>>> Signed-off-by: Yuan Tan <yuantan098@gmail.com>
>>> Suggested-by: Xin Liu <bird@lzu.edu.cn>
>>> Tested-by: Ren Wei <enjou1224z@gmail.com>
>>> Signed-off-by: Ruide Cao <caoruide123@gmail.com>
>>> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
>>> ---
>>> net/tipc/link.c | 16 ++++++++++++++--
>>> 1 file changed, 14 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/tipc/link.c b/net/tipc/link.c index
>>> 49dfc098d89b..44678d98939a
>>> 100644
>>> --- a/net/tipc/link.c
>>> +++ b/net/tipc/link.c
>>> @@ -1415,12 +1415,22 @@ u16 tipc_get_gap_ack_blks(struct
>>> tipc_gap_ack_blks **ga, struct tipc_link *l,
>>> 			  struct tipc_msg *hdr, bool uc)
>>> {
>>> 	struct tipc_gap_ack_blks *p;
>>> -	u16 sz = 0;
>>> +	u16 sz = 0, dlen = msg_data_sz(hdr);
>>>
>>> 	/* Does peer support the Gap ACK blocks feature? */
>>> 	if (l->peer_caps & TIPC_GAP_ACK_BLOCK) {
>>> +		u16 min_sz = struct_size(p, gacks, 0);
>>> +
>>> +		if (dlen < min_sz)
>>> +			goto ignore;
>> This checking is redundant because with existing sanity checking, the invalid
>gap ACK blocks will not be used to release acked messages in transmit queue.
>
>The `dlen < min_sz` check is required because the existing sanity check already
>dereferences `p->len`, `p->ugack_cnt`, and `p->bgack_cnt`.
In the  case of dlen > p->len  and  p->len  = 0x07fc,  p->ugack_cnt = 0xff, p->bgack_cnt = 0xff (Sender modified or kept dlen and modified the remaining 3 fields),
how could your above and subsequent sanity checks validate p->len, p->bgack_cnt and p->ugack_cnt ?
>Without this new check, an Out-of-Bounds (OOB) read occurs before the old
>sanity check even has a chance to run.
>
>>> +
>>> 		p = (struct tipc_gap_ack_blks *)msg_data(hdr);
>>> 		sz = ntohs(p->len);
>>> +		if (sz < min_sz || sz > dlen) {
>>> +			sz = 0;
>>> +			goto ignore;
>>> +		}
>> This checking is redundant. Existing sanity checking is good enough.
>
>The `sz < min_sz || sz > dlen` check is not redundant because the old sanity
>check completely fails to verify if the declared Gap ACK length
>(`sz`) actually fits inside the enclosing STATE message's logical payload length
>(`dlen`).
>
>Without checking against `dlen`, an internally consistent spoofed packet will
>pass the old check and cause OOB reads during the subsequent block parsing.
>
>>> +
>>> 		/* Sanity check */
>>> 		if (sz == struct_size(p, gacks, size_add(p->ugack_cnt, p-
>>>> bgack_cnt))) {
>>> 			/* Good, check if the desired type exists */ @@ -
>>> 1434,6 +1444,8 @@ u16 tipc_get_gap_ack_blks(struct tipc_gap_ack_blks
>>> **ga, struct tipc_link *l,
>>> 			}
>>> 		}
>>> 	}
>>> +
>>> +ignore:
>>> 	/* Other cases: ignore! */
>>> 	p = NULL;
>>>
>>> @@ -2270,7 +2282,7 @@ static int tipc_link_proto_rcv(struct tipc_link
>>> *l, struct sk_buff *skb,
>>> 	case STATE_MSG:
>>> 		/* Validate Gap ACK blocks, drop if invalid */
>>> 		glen = tipc_get_gap_ack_blks(&ga, l, hdr, true);
>>> -		if (glen > dlen)
>>> +		if (glen > dlen || ((l->peer_caps & TIPC_GAP_ACK_BLOCK) &&
>>> !glen))
>> This checking is redundant. Existing sanity checking is good enough.
>
>The unicast caller-side drop `((l->peer_caps & TIPC_GAP_ACK_BLOCK) &&
>!glen)` is also necessary. Once the capability is negotiated, a valid Gap ACK
>record MUST have at least the fixed 4-byte header. If `glen == 0` from such a
>peer, it indicates a malformed payload.
>
>The STATE message must be dropped here so it is not passed on to
>`tipc_mon_rcv()` as if monitor data started at `data + 0`, which would misalign
>the monitor payload parsing.
>
>>> 			break;
>>>
>>> 		l->rcv_nxt_state = msg_seqno(hdr) + 1;
>>> --
>>> 2.34.1
>>>

^ permalink raw reply

* Re: [PATCH] net/sched: sch_cake: fix NAT destination port not being updated in cake_update_flowkeys
From: Toke Høiland-Jørgensen @ 2026-04-13  9:41 UTC (permalink / raw)
  To: Dudu Lu, netdev; +Cc: jhs, jiri, Dudu Lu
In-Reply-To: <20260413084715.70169-1-phx0fer@gmail.com>

Dudu Lu <phx0fer@gmail.com> writes:

> cake_update_flowkeys() is supposed to update the flow dissector keys
> with the NAT-translated addresses and ports from conntrack, so that
> CAKE's per-flow fairness correctly identifies post-NAT flows as
> belonging to the same connection.
>
> For the source port, this works correctly:
>     keys->ports.src = port;  /* writes conntrack port into keys */
>
> But for the destination port, the assignment is reversed:
>     port = keys->ports.dst;  /* reads FROM keys into local var — no-op */

Huh, what a silly mistake - nice find!

> This means the NAT destination port is never updated in the flow keys.
> As a result, when multiple connections are NATed to the same destination
> (same IP + same port), CAKE treats them as separate flows because the
> original (pre-NAT) destination ports differ. This completely defeats
> CAKE's NAT-aware flow isolation when using the "nat" mode.
>
> The vulnerability was introduced in commit b0c19ed6088a ("sch_cake: Take advantage
> of skb->hash where appropriate")

Calling it a "vulnerability" seems perhaps a tad hyperbolic. Care to
elaborate on what you mean here?

-Toke

^ permalink raw reply

* Re: [PATCH 6.12.y] netfilter: conntrack: add missing netlink policy validations
From: Pablo Neira Ayuso @ 2026-04-13  9:47 UTC (permalink / raw)
  To: Li hongliang
  Cc: gregkh, stable, fw, patches, linux-kernel, kadlec, davem,
	edumazet, kuba, pabeni, horms, kaber, netfilter-devel, coreteam,
	netdev, imv4bel
In-Reply-To: <20260413073105.2990210-1-1468888505@139.com>

Why only 6.12?

On Mon, Apr 13, 2026 at 03:31:05PM +0800, Li hongliang wrote:
> From: Florian Westphal <fw@strlen.de>
> 
> [ Upstream commit f900e1d77ee0ef87bfb5ab3fe60f0b3d8ad5ba05 ]
> 
> Hyunwoo Kim reports out-of-bounds access in sctp and ctnetlink.
> 
> These attributes are used by the kernel without any validation.
> Extend the netlink policies accordingly.
> 
> Quoting the reporter:
>   nlattr_to_sctp() assigns the user-supplied CTA_PROTOINFO_SCTP_STATE
>   value directly to ct->proto.sctp.state without checking that it is
>   within the valid range. [..]
> 
>   and: ... with exp->dir = 100, the access at
>   ct->master->tuplehash[100] reads 5600 bytes past the start of a
>   320-byte nf_conn object, causing a slab-out-of-bounds read confirmed by
>   UBSAN.
> 
> Fixes: 076a0ca02644 ("netfilter: ctnetlink: add NAT support for expectations")
> Fixes: a258860e01b8 ("netfilter: ctnetlink: add full support for SCTP to ctnetlink")
> Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Signed-off-by: Li hongliang <1468888505@139.com>
> ---
>  net/netfilter/nf_conntrack_netlink.c    | 2 +-
>  net/netfilter/nf_conntrack_proto_sctp.c | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
> index 323e147fe282..f51cdfba68fb 100644
> --- a/net/netfilter/nf_conntrack_netlink.c
> +++ b/net/netfilter/nf_conntrack_netlink.c
> @@ -3460,7 +3460,7 @@ ctnetlink_change_expect(struct nf_conntrack_expect *x,
>  
>  #if IS_ENABLED(CONFIG_NF_NAT)
>  static const struct nla_policy exp_nat_nla_policy[CTA_EXPECT_NAT_MAX+1] = {
> -	[CTA_EXPECT_NAT_DIR]	= { .type = NLA_U32 },
> +	[CTA_EXPECT_NAT_DIR]	= NLA_POLICY_MAX(NLA_BE32, IP_CT_DIR_REPLY),
>  	[CTA_EXPECT_NAT_TUPLE]	= { .type = NLA_NESTED },
>  };
>  #endif
> diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
> index 4cc97f971264..fabb2c1ca00a 100644
> --- a/net/netfilter/nf_conntrack_proto_sctp.c
> +++ b/net/netfilter/nf_conntrack_proto_sctp.c
> @@ -587,7 +587,8 @@ static int sctp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
>  }
>  
>  static const struct nla_policy sctp_nla_policy[CTA_PROTOINFO_SCTP_MAX+1] = {
> -	[CTA_PROTOINFO_SCTP_STATE]	    = { .type = NLA_U8 },
> +	[CTA_PROTOINFO_SCTP_STATE]	    = NLA_POLICY_MAX(NLA_U8,
> +							 SCTP_CONNTRACK_HEARTBEAT_SENT),
>  	[CTA_PROTOINFO_SCTP_VTAG_ORIGINAL]  = { .type = NLA_U32 },
>  	[CTA_PROTOINFO_SCTP_VTAG_REPLY]     = { .type = NLA_U32 },
>  };
> -- 
> 2.34.1
> 
> 

^ permalink raw reply

* [PATCH net-next v2 5/5] selftests: net: add veth BQL stress test
From: hawk @ 2026-04-13  9:44 UTC (permalink / raw)
  To: netdev
  Cc: kernel-team, Jesper Dangaard Brouer, Jonas Köppeler,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Shuah Khan, linux-kernel, linux-kselftest
In-Reply-To: <20260413094442.1376022-1-hawk@kernel.org>

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add a selftest that exercises veth's BQL (Byte Queue Limits) code path
under sustained UDP load. The test creates a veth pair with GRO enabled
(activating the NAPI path and BQL), attaches a qdisc, optionally loads
iptables rules in the consumer namespace to slow NAPI processing, and
floods UDP packets for a configurable duration.

The test serves two purposes: benchmarking BQL's latency impact under
configurable load (iptables rules, qdisc type and parameters), and
detecting kernel BUG/Oops from DQL accounting mismatches. It monitors
dmesg throughout the run and reports PASS/FAIL via kselftest (lib.sh).

Diagnostic output is printed every 5 seconds:
  - BQL sysfs inflight/limit and watchdog tx_timeout counter
  - qdisc stats: packets, drops, requeues, backlog, qlen, overlimits
  - consumer PPS and NAPI-64 cycle time (shows fq_codel target impact)
  - sink PPS (per-period delta), latency min/avg/max (stddev at exit)
  - ping RTT to measure latency under load

Generating enough traffic to fill the 256-entry ptr_ring requires care:
the UDP sendto() path charges each SKB to sk_wmem_alloc, and the SKB
stays charged (via sock_wfree destructor) until the consumer NAPI thread
finishes processing it -- including any iptables rules in the receive
path. With the default sk_sndbuf (~208KB from wmem_default), only ~93
packets can be in-flight before sendto(MSG_DONTWAIT) returns EAGAIN.
Since 93 < 256 ring entries, the ring never fills and no backpressure
occurs. The test raises wmem_max via sysctl and sets SO_SNDBUF=1MB on
the flood socket to remove this bottleneck. An earlier multi-namespace
routing approach avoided this limit because ip_forward creates new SKBs
detached from the sender's socket.

The --bql-disable option (sets limit_min=1GB) enables A/B comparison.
Typical results with --nrules 6000 --qdisc-opts 'target 2ms interval 20ms':

  fq_codel + BQL disabled:  ping RTT ~10.8ms, 15% loss, 400KB in ptr_ring
  fq_codel + BQL enabled:   ping RTT ~0.6ms,   0% loss, 4KB in ptr_ring

Both cases show identical consumer speed (~20Kpps) and fq_codel drops
(~255K), proving the improvement comes purely from where packets buffer.

BQL moves buffering from the ptr_ring into the qdisc, where AQM
(fq_codel/CAKE) can act on it -- eliminating the "dark buffer" that
hides congestion from the scheduler.

The --qdisc-replace mode cycles through sfq/pfifo/fq_codel/noqueue
under active traffic to verify that stale BQL state (STACK_XOFF) is
properly handled during live qdisc transitions.

A companion wrapper (veth_bql_test_virtme.sh) launches the test inside
a virtme-ng VM, with .config validation to prevent silent stalls.

Usage:
  sudo ./veth_bql_test.sh [--duration 300] [--nrules 100]
                          [--qdisc sfq] [--qdisc-opts '...']
                          [--bql-disable] [--normal-napi]
                          [--qdisc-replace]

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 tools/testing/selftests/net/Makefile          |   3 +
 tools/testing/selftests/net/config            |   1 +
 tools/testing/selftests/net/napi_poll_hist.bt |  40 +
 tools/testing/selftests/net/veth_bql_test.sh  | 821 ++++++++++++++++++
 .../selftests/net/veth_bql_test_virtme.sh     | 124 +++
 5 files changed, 989 insertions(+)
 create mode 100644 tools/testing/selftests/net/napi_poll_hist.bt
 create mode 100755 tools/testing/selftests/net/veth_bql_test.sh
 create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 231245a95879..7f6524169b93 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -119,6 +119,7 @@ TEST_PROGS := \
 	udpgso_bench.sh \
 	unicast_extensions.sh \
 	veth.sh \
+	veth_bql_test.sh \
 	vlan_bridge_binding.sh \
 	vlan_hw_filter.sh \
 	vrf-xfrm-tests.sh \
@@ -196,7 +197,9 @@ TEST_FILES := \
 	fcnal-test.sh \
 	in_netns.sh \
 	lib.sh \
+	napi_poll_hist.bt \
 	settings \
+	veth_bql_test_virtme.sh \
 # end of TEST_FILES
 
 # YNL files, must be before "include ..lib.mk"
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 2a390cae41bf..7b1f41421145 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -97,6 +97,7 @@ CONFIG_NET_PKTGEN=m
 CONFIG_NET_SCH_ETF=m
 CONFIG_NET_SCH_FQ=m
 CONFIG_NET_SCH_FQ_CODEL=m
+CONFIG_NET_SCH_SFQ=m
 CONFIG_NET_SCH_HTB=m
 CONFIG_NET_SCH_INGRESS=m
 CONFIG_NET_SCH_NETEM=y
diff --git a/tools/testing/selftests/net/napi_poll_hist.bt b/tools/testing/selftests/net/napi_poll_hist.bt
new file mode 100644
index 000000000000..34d1a43906bf
--- /dev/null
+++ b/tools/testing/selftests/net/napi_poll_hist.bt
@@ -0,0 +1,40 @@
+#!/usr/bin/env bpftrace
+// SPDX-License-Identifier: GPL-2.0
+// napi_poll work histogram for veth BQL testing.
+// Shows how many packets each NAPI poll processes (0..64).
+// Full-budget (64) polls mean more work is pending; partial (<64) means
+// the ring drained before the budget was exhausted.
+//
+// Usage: bpftrace napi_poll_hist.bt
+// Interval output is a single compact line for easy script parsing.
+
+tracepoint:napi:napi_poll
+/str(args->dev_name, 8) == "veth_bql"/
+{
+	@work = lhist(args->work, 0, 65, 1);
+	@total++;
+	@sum += args->work;
+	if (args->work == args->budget) {
+		@full++;
+	}
+}
+
+interval:s:5
+{
+	$avg = @total > 0 ? @sum / @total : 0;
+	printf("napi_poll: polls=%llu full_budget=%llu partial=%llu avg_work=%llu\n",
+	       @total, @full, @total - @full, $avg);
+	clear(@total);
+	clear(@full);
+	clear(@sum);
+}
+
+END
+{
+	printf("\n--- napi_poll work histogram (lifetime) ---\n");
+	print(@work);
+	clear(@work);
+	clear(@total);
+	clear(@full);
+	clear(@sum);
+}
diff --git a/tools/testing/selftests/net/veth_bql_test.sh b/tools/testing/selftests/net/veth_bql_test.sh
new file mode 100755
index 000000000000..bfbbb3432a8f
--- /dev/null
+++ b/tools/testing/selftests/net/veth_bql_test.sh
@@ -0,0 +1,821 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Veth BQL (Byte Queue Limits) stress test and A/B benchmarking tool.
+#
+# Creates a veth pair with GRO on and TSO off (ensures all packets use
+# the NAPI/ptr_ring path where BQL operates), attaches a configurable
+# qdisc, optionally loads iptables rules to slow the consumer NAPI
+# processing, and floods UDP packets at maximum rate.
+#
+# Primary uses:
+#   1) A/B comparison of latency with/without BQL (--bql-disable flag)
+#   2) Testing different qdiscs and their parameters (--qdisc, --qdisc-opts)
+#   3) Detecting kernel BUG/Oops from DQL accounting mismatches
+#
+# Key design detail -- SO_SNDBUF and wmem_max:
+#   The UDP sendto() path charges each SKB to the socket's sk_wmem_alloc
+#   counter.  The SKB carries a destructor (sock_wfree) that releases the
+#   charge only after the consumer NAPI thread on the peer veth finishes
+#   processing it -- including any iptables rules in the receive path.
+#   With the default sk_sndbuf (~208KB from wmem_default), only ~93
+#   packets (1442B each) can be in-flight before sendto() returns EAGAIN.
+#   Since 93 < 256 ptr_ring entries, the ring never fills and no qdisc
+#   backpressure occurs.  The test temporarily raises the global wmem_max
+#   sysctl and sets SO_SNDBUF=1MB to allow enough in-flight SKBs to
+#   saturate the ptr_ring.  The original wmem_max is restored on exit.
+#
+# Two TX-stop mechanisms and the dark-buffer problem:
+#   DRV_XOFF backpressure (commit dc82a33297fc) stops the TX queue when
+#   the 256-entry ptr_ring is full.  The queue is released at the end of
+#   veth_poll() (commit 5442a9da6978) after processing up to 64 packets
+#   (NAPI budget).  Without BQL, the entire ring is a FIFO "dark buffer"
+#   in front of the qdisc -- packets there are invisible to AQM.
+#
+#   BQL adds STACK_XOFF, which dynamically limits in-flight bytes and
+#   stops the queue *before* the ring fills.  This keeps the ring
+#   shallow and moves buffering into the qdisc where sojourn-based AQM
+#   (codel, fq_codel, CAKE/COBALT) can measure and drop packets.
+#
+# Sojourn time and NAPI budget interaction:
+#   DRV_XOFF releases backpressure once per NAPI poll (up to 64 pkts).
+#   During that cycle, packets queued in the qdisc accumulate sojourn
+#   time.  With fq_codel's default target of 5ms, the threshold is:
+#     5000us / 64 pkts = 78us/pkt --> ~12,800 pps consumer speed.
+#   Below that rate the NAPI-64 cycle exceeds the target and fq_codel
+#   starts dropping.  Use --nrules and --qdisc-opts to experiment.
+#
+cd "$(dirname -- "$0")" || exit 1
+source lib.sh
+
+# Defaults
+DURATION=30       # seconds; use longer --duration to reach DQL counter wrap
+NRULES=3500       # iptables rules in consumer NS (0 to disable)
+QDISC=sfq         # qdisc to use (sfq, pfifo, fq_codel, etc.)
+QDISC_OPTS=""     # extra qdisc parameters (e.g. "target 1ms interval 10ms")
+BQL_DISABLE=0     # 1 to disable BQL (sets limit_min high)
+NORMAL_NAPI=0     # 1 to use normal softirq NAPI (skip threaded NAPI)
+QDISC_REPLACE=0   # 1 to test qdisc replacement under active traffic
+TINY_FLOOD=0      # 1 to add 2nd UDP thread with min-size packets
+VETH_A="veth_bql0"
+VETH_B="veth_bql1"
+IP_A="10.99.0.1"
+IP_B="10.99.0.2"
+PORT=9999
+PKT_SIZE=1400     # large packets: slower producer, bigger BQL charges
+
+usage() {
+    echo "Usage: $0 [OPTIONS]"
+    echo "  --duration SEC   test duration (default: $DURATION)"
+    echo "  --nrules N       iptables rules to slow consumer (default: $NRULES, 0=disable)"
+    echo "  --qdisc NAME     qdisc to install (default: $QDISC)"
+    echo "  --qdisc-opts STR extra qdisc params (e.g. 'target 1ms interval 10ms')"
+    echo "  --bql-disable    disable BQL for A/B comparison"
+    echo "  --normal-napi    use softirq NAPI instead of threaded NAPI"
+    echo "  --qdisc-replace  test qdisc replacement under active traffic"
+    echo "  --tiny-flood     add 2nd UDP thread with min-size packets (stress BQL bytes)"
+    exit 1
+}
+
+while [ $# -gt 0 ]; do
+    case "$1" in
+    --duration)   DURATION="$2"; shift 2 ;;
+    --nrules)     NRULES="$2"; shift 2 ;;
+    --qdisc)      QDISC="$2"; shift 2 ;;
+    --qdisc-opts) QDISC_OPTS="$2"; shift 2 ;;
+    --bql-disable) BQL_DISABLE=1; shift ;;
+    --normal-napi) NORMAL_NAPI=1; shift ;;
+    --qdisc-replace) QDISC_REPLACE=1; shift ;;
+    --tiny-flood) TINY_FLOOD=1; shift ;;
+    --help|-h)    usage ;;
+    *)            echo "Unknown option: $1" >&2; usage ;;
+    esac
+done
+
+TMPDIR=$(mktemp -d)
+
+FLOOD_PID=""
+FLOOD2_PID=""
+SINK_PID=""
+PING_PID=""
+BPFTRACE_PID=""
+
+# shellcheck disable=SC2329  # cleanup is invoked indirectly via trap
+cleanup() {
+    [ -n "$BPFTRACE_PID" ] && kill_process "$BPFTRACE_PID"
+    [ -n "$FLOOD_PID" ] && kill_process "$FLOOD_PID"
+    [ -n "$FLOOD2_PID" ] && kill_process "$FLOOD2_PID"
+    [ -n "$SINK_PID" ] && kill_process "$SINK_PID"
+    [ -n "$PING_PID" ] && kill_process "$PING_PID"
+    cleanup_all_ns
+    ip link del "$VETH_A" 2>/dev/null || true
+    [ -n "$ORIG_WMEM_MAX" ] && sysctl -qw net.core.wmem_max="$ORIG_WMEM_MAX"
+    rm -rf "$TMPDIR"
+}
+trap cleanup EXIT
+
+require_command gcc
+require_command ethtool
+require_command tc
+
+# --- Function definitions ---
+
+compile_tools() {
+    echo "--- Compiling UDP flood tool ---"
+cat > "$TMPDIR"/udp_flood.c << 'CEOF'
+#include <arpa/inet.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <time.h>
+#include <unistd.h>
+
+static volatile int running = 1;
+
+static void stop(int sig) { running = 0; }
+
+struct pkt_hdr {
+	struct timespec ts;
+	unsigned long seq;
+};
+
+int main(int argc, char **argv)
+{
+	struct sockaddr_in dst;
+	struct pkt_hdr hdr;
+	unsigned long count = 0;
+	char buf[1500];
+	int sndbuf = 1048576;
+	int pkt_size, max_pkt_size;
+	int cur_size;
+	int duration;
+	int fd;
+
+	if (argc < 5) {
+		fprintf(stderr, "Usage: %s <ip> <pkt_size> <port> <duration> [max_pkt_size]\n",
+			argv[0]);
+		return 1;
+	}
+
+	pkt_size = atoi(argv[2]);
+	if (pkt_size < (int)sizeof(struct pkt_hdr))
+		pkt_size = sizeof(struct pkt_hdr);
+	if (pkt_size > (int)sizeof(buf))
+		pkt_size = sizeof(buf);
+	max_pkt_size = (argc > 5) ? atoi(argv[5]) : pkt_size;
+	if (max_pkt_size < pkt_size)
+		max_pkt_size = pkt_size;
+	if (max_pkt_size > (int)sizeof(buf))
+		max_pkt_size = sizeof(buf);
+	duration = atoi(argv[4]);
+
+	memset(&dst, 0, sizeof(dst));
+	dst.sin_family = AF_INET;
+	dst.sin_port = htons(atoi(argv[3]));
+	inet_pton(AF_INET, argv[1], &dst.sin_addr);
+
+	fd = socket(AF_INET, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		return 1;
+	}
+
+	/* Raise send buffer so sk_wmem_alloc limit doesn't cap
+	 * in-flight packets before the ptr_ring (256 entries) fills.
+	 * Default wmem_default ~208K only allows ~93 packets.
+	 */
+	setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sndbuf, sizeof(sndbuf));
+
+	memset(buf, 0xAA, sizeof(buf));
+	signal(SIGINT, stop);
+	signal(SIGTERM, stop);
+	signal(SIGALRM, stop);
+	alarm(duration);
+
+	while (running) {
+		if (max_pkt_size > pkt_size)
+			cur_size = pkt_size + (rand() % (max_pkt_size - pkt_size + 1));
+		else
+			cur_size = pkt_size;
+		clock_gettime(CLOCK_MONOTONIC, &hdr.ts);
+		hdr.seq = count;
+		memcpy(buf, &hdr, sizeof(hdr));
+		sendto(fd, buf, cur_size, MSG_DONTWAIT,
+		       (struct sockaddr *)&dst, sizeof(dst));
+		count++;
+		if (!(count % 10000000))
+			fprintf(stderr, "  sent: %lu M packets\n",
+				count / 1000000);
+	}
+
+	fprintf(stderr, "Total sent: %lu packets (%.1f M)\n",
+		count, (double)count / 1e6);
+	close(fd);
+	return 0;
+}
+CEOF
+gcc -O2 -Wall -o "$TMPDIR"/udp_flood "$TMPDIR"/udp_flood.c || exit $ksft_fail
+
+# UDP sink with latency measurement
+cat > "$TMPDIR"/udp_sink.c << 'CEOF'
+#include <arpa/inet.h>
+#include <math.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <time.h>
+#include <unistd.h>
+
+static volatile int running = 1;
+
+static void stop(int sig) { running = 0; }
+
+struct pkt_hdr {
+	struct timespec ts;
+	unsigned long seq;
+};
+
+static void print_periodic(unsigned long count, unsigned long delta_count,
+			   double delta_sec, unsigned long drops,
+			   unsigned long reorders,
+			   double lat_min, double lat_sum,
+			   double lat_max)
+{
+	unsigned long pps;
+
+	if (!count)
+		return;
+	pps = delta_sec > 0 ? (unsigned long)(delta_count / delta_sec) : 0;
+	fprintf(stderr, "  sink: %lu pkts (%lu pps)  drops=%lu  reorders=%lu"
+		"  latency min/avg/max = %.3f/%.3f/%.3f ms\n",
+		count, pps, drops, reorders,
+		lat_min * 1e3, (lat_sum / count) * 1e3,
+		lat_max * 1e3);
+}
+
+static void print_final(unsigned long count, double elapsed_sec,
+			unsigned long drops, unsigned long reorders,
+			double lat_min, double lat_sum,
+			double lat_sum_sq, double lat_max)
+{
+	unsigned long pps;
+	double avg, stddev;
+
+	if (!count)
+		return;
+	pps = elapsed_sec > 0 ? (unsigned long)(count / elapsed_sec) : 0;
+	avg = lat_sum / count;
+	stddev = sqrt(lat_sum_sq / count - avg * avg);
+	fprintf(stderr, "  sink: %lu pkts (%lu avg pps)  drops=%lu  reorders=%lu"
+		"  latency min/avg/max/stddev = %.3f/%.3f/%.3f/%.3f ms\n",
+		count, pps, drops, reorders,
+		lat_min * 1e3, avg * 1e3,
+		lat_max * 1e3, stddev * 1e3);
+}
+
+int main(int argc, char **argv)
+{
+	unsigned long next_seq = 0, drops = 0, reorders = 0;
+	double lat_min = 1e9, lat_max = 0, lat_sum = 0, lat_sum_sq = 0;
+	unsigned long count = 0, last_count = 0;
+	struct sockaddr_in addr;
+	char buf[2048];
+	int fd, one = 1;
+
+	if (argc < 2) {
+		fprintf(stderr, "Usage: %s <port>\n", argv[0]);
+		return 1;
+	}
+
+	fd = socket(AF_INET, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("socket");
+		return 1;
+	}
+	setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &one, sizeof(one));
+
+	/* Timeout so recv() unblocks periodically to check 'running' flag.
+	 * Needed because glibc signal() sets SA_RESTART, so SIGTERM
+	 * does not interrupt recv().
+	 */
+	struct timeval tv = { .tv_sec = 1 };
+	setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
+
+	memset(&addr, 0, sizeof(addr));
+	addr.sin_family = AF_INET;
+	addr.sin_port = htons(atoi(argv[1]));
+	addr.sin_addr.s_addr = INADDR_ANY;
+	if (bind(fd, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
+		perror("bind");
+		return 1;
+	}
+
+	signal(SIGINT, stop);
+	signal(SIGTERM, stop);
+
+	struct timespec t_start, t_last_print;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+	t_last_print = t_start;
+
+	while (running) {
+		struct pkt_hdr hdr;
+		struct timespec now;
+		ssize_t n;
+		double lat;
+
+		n = recv(fd, buf, sizeof(buf), 0);
+		if (n < (ssize_t)sizeof(struct pkt_hdr))
+			continue;
+
+		clock_gettime(CLOCK_MONOTONIC, &now);
+		memcpy(&hdr, buf, sizeof(hdr));
+
+		/* Track drops (gaps) and reorders (late arrivals) */
+		if (hdr.seq > next_seq)
+			drops += hdr.seq - next_seq;
+		if (hdr.seq < next_seq)
+			reorders++;
+		if (hdr.seq >= next_seq)
+			next_seq = hdr.seq + 1;
+
+		lat = (now.tv_sec - hdr.ts.tv_sec) +
+		      (now.tv_nsec - hdr.ts.tv_nsec) * 1e-9;
+
+		if (lat < lat_min)
+			lat_min = lat;
+		if (lat > lat_max)
+			lat_max = lat;
+		lat_sum += lat;
+		lat_sum_sq += lat * lat;
+		count++;
+
+		{
+			double since_print;
+
+			since_print = (now.tv_sec - t_last_print.tv_sec) +
+				      (now.tv_nsec - t_last_print.tv_nsec) * 1e-9;
+			if (since_print >= 5.0) {
+				print_periodic(count, count - last_count,
+					       since_print, drops,
+					       reorders, lat_min,
+					       lat_sum, lat_max);
+				last_count = count;
+				t_last_print = now;
+			}
+		}
+	}
+
+	{
+		struct timespec t_now;
+		double elapsed;
+
+		clock_gettime(CLOCK_MONOTONIC, &t_now);
+		elapsed = (t_now.tv_sec - t_start.tv_sec) +
+			  (t_now.tv_nsec - t_start.tv_nsec) * 1e-9;
+		print_final(count, elapsed, drops, reorders,
+			    lat_min, lat_sum, lat_sum_sq, lat_max);
+	}
+	close(fd);
+	return 0;
+}
+CEOF
+gcc -O2 -Wall -o "$TMPDIR"/udp_sink "$TMPDIR"/udp_sink.c -lm || exit $ksft_fail
+}
+
+setup_veth() {
+    log_info "Setting up veth pair with GRO"
+    setup_ns NS || exit $ksft_skip
+    ip link add "$VETH_A" type veth peer name "$VETH_B" || \
+        { echo "Failed to create veth pair (need root?)"; exit $ksft_skip; }
+    ip link set "$VETH_B" netns "$NS" || \
+        { echo "Failed to move veth to namespace"; exit $ksft_skip; }
+
+    # Configure IPs
+    ip addr add "${IP_A}/24" dev "$VETH_A"
+    ip link set "$VETH_A" up
+
+    ip -netns "$NS" addr add "${IP_B}/24" dev "$VETH_B"
+    ip -netns "$NS" link set "$VETH_B" up
+
+    # Raise wmem_max so the flood tool's SO_SNDBUF takes effect.
+    # Default 212992 caps in-flight to ~93 packets (sk_wmem_alloc limit),
+    # which is less than the 256-entry ptr_ring and prevents backpressure.
+    ORIG_WMEM_MAX=$(sysctl -n net.core.wmem_max)
+    sysctl -qw net.core.wmem_max=1048576
+
+    # Enable GRO on both ends -- activates NAPI -- BQL code path
+    ethtool -K "$VETH_A" gro on 2>/dev/null || true
+    ip netns exec "$NS" ethtool -K "$VETH_B" gro on 2>/dev/null || true
+
+    # Disable TSO so veth_skb_is_eligible_for_gro() returns true for all
+    # packets, ensuring every SKB takes the NAPI/ptr_ring path.  With TSO
+    # enabled, only packets matching sock_wfree + GRO features are eligible;
+    # disabling TSO removes that filter unconditionally.
+    ethtool -K "$VETH_A" tso off gso off 2>/dev/null || true
+    ip netns exec "$NS" ethtool -K "$VETH_B" tso off gso off 2>/dev/null || true
+
+    # Enable threaded NAPI -- this is critical: BQL backpressure (STACK_XOFF)
+    # only engages when producer and consumer run on separate CPUs.
+    # Without threaded NAPI, softirq completions happen too fast for BQL
+    # to build up enough in-flight bytes to trigger the limit.
+    if [ "$NORMAL_NAPI" -eq 0 ]; then
+        echo 1 > /sys/class/net/"$VETH_A"/threaded 2>/dev/null || true
+        ip netns exec "$NS" sh -c "echo 1 > /sys/class/net/$VETH_B/threaded" 2>/dev/null || true
+        log_info "Threaded NAPI enabled"
+    else
+        log_info "Using normal softirq NAPI (threaded NAPI disabled)"
+    fi
+}
+
+install_qdisc() {
+    local qdisc="${1:-$QDISC}"
+    local opts="${2:-}"
+    # Add a qdisc -- veth defaults to noqueue, but BQL needs a qdisc
+    # because STACK_XOFF is checked by the qdisc layer.
+    # Note: qdisc_create() auto-fixes txqueuelen=0 on IFF_NO_QUEUE devices
+    # to DEFAULT_TX_QUEUE_LEN (commit 84c46dd86538).
+    log_info "Installing qdisc: $qdisc $opts"
+    # shellcheck disable=SC2086  # $opts must word-split for tc arguments
+    tc qdisc replace dev "$VETH_A" root $qdisc $opts
+    # shellcheck disable=SC2086
+    ip netns exec "$NS" tc qdisc replace dev "$VETH_B" root $qdisc $opts
+}
+
+remove_qdisc() {
+    log_info "Removing qdisc (reverting to noqueue)"
+    tc qdisc del dev "$VETH_A" root 2>/dev/null || true
+    ip netns exec "$NS" tc qdisc del dev "$VETH_B" root 2>/dev/null || true
+}
+
+setup_iptables() {
+    # Bulk-load iptables rules in consumer namespace to slow NAPI processing.
+    # Many rules force per-packet linear rule traversal, increasing consumer
+    # overhead and BQL inflight bytes -- simulates realistic k8s-like workload.
+    if [ "$NRULES" -gt 0 ]; then
+        # shellcheck disable=SC2016  # single quotes intentional
+        ip netns exec "$NS" bash -c '
+        iptables-restore < <(
+        echo "*filter"
+        for n in $(seq 1 '"$NRULES"'); do
+          echo "-I INPUT -d '"$IP_B"'"
+        done
+        echo "COMMIT"
+        )
+        ' 2>/dev/null || { RET=$ksft_fail retmsg="iptables not available" \
+            log_test "iptables"; exit "$EXIT_STATUS"; }
+        log_info "Loaded $NRULES iptables rules in consumer NS"
+    fi
+}
+
+check_bql_sysfs() {
+    BQL_DIR="/sys/class/net/${VETH_A}/queues/tx-0/byte_queue_limits"
+    if [ -d "$BQL_DIR" ]; then
+        log_info "BQL sysfs found: $BQL_DIR"
+        if [ "$BQL_DISABLE" -eq 1 ]; then
+            echo 1073741824 > "$BQL_DIR/limit_min"
+            log_info "BQL effectively disabled (limit_min=1G)"
+        fi
+    else
+        log_info "BQL sysfs absent (veth IFF_NO_QUEUE+lltx, DQL accounting still active)"
+        BQL_DIR=""
+    fi
+}
+
+start_traffic() {
+    # Snapshot dmesg before test
+    DMESG_BEFORE=$(dmesg | wc -l)
+
+    log_info "Starting UDP sink in namespace"
+    ip netns exec "$NS" "$TMPDIR"/udp_sink "$PORT" &
+    SINK_PID=$!
+    sleep 0.2
+
+    log_info "Starting ping to $IP_B (5/s) to measure latency under load"
+    ping -i 0.2 -w "$DURATION" "$IP_B" > "$TMPDIR"/ping.log 2>&1 &
+    PING_PID=$!
+
+    log_info "Flooding ${PKT_SIZE}-byte UDP packets for ${DURATION}s"
+    "$TMPDIR"/udp_flood "$IP_B" "$PKT_SIZE" "$PORT" "$DURATION" &
+    FLOOD_PID=$!
+
+    # Optional: 2nd UDP thread with tiny packets to stress byte-based BQL.
+    # Small packets charge few BQL bytes, letting many more into the
+    # ptr_ring before STACK_XOFF fires -- exposing the dark buffer.
+    if [ "$TINY_FLOOD" -eq 1 ]; then
+        local port2=$((PORT + 1))
+        ip netns exec "$NS" "$TMPDIR"/udp_sink "$port2" &
+        log_info "Starting 2nd UDP flood (min-size pkts) on port $port2"
+        "$TMPDIR"/udp_flood "$IP_B" 24 "$port2" "$DURATION" &
+        FLOOD2_PID=$!
+    fi
+
+    # Optional: start bpftrace napi_poll histogram (best-effort)
+    local bt_script
+    bt_script="$(dirname -- "$0")/napi_poll_hist.bt"
+    if command -v bpftrace >/dev/null 2>&1 && [ -f "$bt_script" ]; then
+        bpftrace "$bt_script" > "$TMPDIR"/napi_poll.log 2>&1 &
+        BPFTRACE_PID=$!
+        log_info "bpftrace napi_poll histogram started (pid=$BPFTRACE_PID)"
+    fi
+}
+
+stop_traffic() {
+    [ -n "$FLOOD_PID" ] && kill_process "$FLOOD_PID"
+    FLOOD_PID=""
+    [ -n "$FLOOD2_PID" ] && kill_process "$FLOOD2_PID"
+    FLOOD2_PID=""
+    [ -n "$SINK_PID" ] && kill_process "$SINK_PID"
+    SINK_PID=""
+    [ -n "$PING_PID" ] && kill_process "$PING_PID"
+    PING_PID=""
+    [ -n "$BPFTRACE_PID" ] && kill_process "$BPFTRACE_PID"
+    BPFTRACE_PID=""
+}
+
+check_dmesg_bug() {
+    local bug_pattern='kernel BUG|BUG:|Oops:|dql_completed'
+    local warn_pattern='WARNING:|asks to queue packet|NETDEV WATCHDOG'
+    if dmesg | tail -n +$((DMESG_BEFORE + 1)) | \
+       grep -qE "$bug_pattern"; then
+        dmesg | tail -n +$((DMESG_BEFORE + 1)) | \
+            grep -B2 -A20 -E "$bug_pattern|$warn_pattern"
+        return 1
+    fi
+    # Log new warnings since last check (don't repeat old ones)
+    local cur_lines
+    cur_lines=$(dmesg | wc -l)
+    if [ "$cur_lines" -gt "${DMESG_WARN_SEEN:-$DMESG_BEFORE}" ]; then
+        local new_warns
+        new_warns=$(dmesg | tail -n +$(("${DMESG_WARN_SEEN:-$DMESG_BEFORE}" + 1)) | \
+            grep -E "$warn_pattern") || true
+        if [ -n "$new_warns" ]; then
+            local cnt
+            cnt=$(echo "$new_warns" | wc -l)
+            echo "  WARN: $cnt new kernel warning(s):"
+            echo "$new_warns" | tail -5
+        fi
+    fi
+    DMESG_WARN_SEEN=$cur_lines
+    return 0
+}
+
+print_periodic_stats() {
+    local elapsed="$1"
+
+    # BQL stats and watchdog counter
+    WD_CNT=$(cat /sys/class/net/${VETH_A}/queues/tx-0/tx_timeout \
+        2>/dev/null) || WD_CNT="?"
+    if [ -n "$BQL_DIR" ] && [ -d "$BQL_DIR" ]; then
+        INFLIGHT=$(cat "$BQL_DIR/inflight" 2>/dev/null || echo "?")
+        LIMIT=$(cat "$BQL_DIR/limit" 2>/dev/null || echo "?")
+        echo "  [${elapsed}s] BQL inflight=${INFLIGHT} limit=${LIMIT}" \
+            "watchdog=${WD_CNT}"
+    else
+        echo "  [${elapsed}s] watchdog=${WD_CNT} (no BQL sysfs)"
+    fi
+
+    # Qdisc stats
+    JQ_FMT='"qdisc \(.kind) pkts=\(.packets) drops=\(.drops)'
+    JQ_FMT+=' requeues=\(.requeues) backlog=\(.backlog)'
+    JQ_FMT+=' qlen=\(.qlen) overlimits=\(.overlimits)"'
+    CUR_QPKTS=$(tc -j -s qdisc show dev "$VETH_A" root 2>/dev/null |
+        jq -r '.[0].packets // 0' 2>/dev/null) || CUR_QPKTS=0
+    QSTATS=$(tc -j -s qdisc show dev "$VETH_A" root 2>/dev/null |
+        jq -r ".[0] | $JQ_FMT" 2>/dev/null) &&
+        echo "  [${elapsed}s] $QSTATS" || true
+
+    # Consumer PPS and per-packet processing time
+    if [ "$PREV_QPKTS" -gt 0 ] 2>/dev/null; then
+        DELTA=$((CUR_QPKTS - PREV_QPKTS))
+        PPS=$((DELTA / INTERVAL))
+        if [ "$PPS" -gt 0 ]; then
+            PKT_MS=$(awk "BEGIN {printf \"%.3f\", 1000.0/$PPS}")
+            NAPI_MS=$(awk "BEGIN {printf \"%.1f\", 64000.0/$PPS}")
+            echo "  [${elapsed}s] consumer: ${PPS} pps" \
+                "(~${PKT_MS}ms/pkt, NAPI-64 cycle ~${NAPI_MS}ms)"
+        fi
+    fi
+    PREV_QPKTS=$CUR_QPKTS
+
+    # softnet_stat: per-CPU tracking to detect same-CPU vs multi-CPU NAPI
+    # /proc/net/softnet_stat columns: processed, dropped, time_squeeze (hex, per-CPU)
+    local cpu=0 total_proc=0 total_sq=0 active_cpus=""
+    while read -r line; do
+        # shellcheck disable=SC2086  # word splitting on $line is intentional
+        set -- $line
+        local cur_p=$((0x${1})) cur_sq=$((0x${3}))
+        if [ -f "$TMPDIR/softnet_cpu${cpu}" ]; then
+            read -r prev_p prev_sq < "$TMPDIR/softnet_cpu${cpu}"
+            local dp=$((cur_p - prev_p)) dsq=$((cur_sq - prev_sq))
+            total_proc=$((total_proc + dp))
+            total_sq=$((total_sq + dsq))
+            [ "$dp" -gt 0 ] && active_cpus="${active_cpus} cpu${cpu}(+${dp})"
+        fi
+        echo "$cur_p $cur_sq" > "$TMPDIR/softnet_cpu${cpu}"
+        cpu=$((cpu + 1))
+    done < /proc/net/softnet_stat
+    local n_active
+    n_active=$(echo "$active_cpus" | wc -w)
+    local cpu_mode="single-CPU"
+    [ "$n_active" -gt 1 ] && cpu_mode="multi-CPU(${n_active})"
+    if [ "$total_sq" -gt 0 ] && [ "$INTERVAL" -gt 0 ]; then
+        echo "  [${elapsed}s] softnet: processed=${total_proc}" \
+            "time_squeeze=${total_sq} (${total_sq}/${INTERVAL}s)" \
+            "${cpu_mode}:${active_cpus}"
+    else
+        echo "  [${elapsed}s] softnet: processed=${total_proc}" \
+            "time_squeeze=${total_sq}" \
+            "${cpu_mode}:${active_cpus}"
+    fi
+
+    # napi_poll histogram (from bpftrace, if running)
+    if [ -n "$BPFTRACE_PID" ] && [ -f "$TMPDIR"/napi_poll.log ]; then
+        local napi_line
+        napi_line=$(grep '^napi_poll:' "$TMPDIR"/napi_poll.log | tail -1)
+        [ -n "$napi_line" ] && echo "  [${elapsed}s] $napi_line"
+    fi
+
+    # Ping RTT
+    PING_RTT=$(tail -1 "$TMPDIR"/ping.log 2>/dev/null | grep -oP 'time=\K[0-9.]+') &&
+        echo "  [${elapsed}s] ping RTT=${PING_RTT}ms" || true
+}
+
+monitor_loop() {
+    ELAPSED=0
+    INTERVAL=5
+    PREV_QPKTS=0
+    # Seed per-CPU softnet baselines
+    local cpu=0
+    while read -r line; do
+        # shellcheck disable=SC2086  # word splitting on $line is intentional
+        set -- $line
+        echo "$((0x${1})) $((0x${3}))" > "$TMPDIR/softnet_cpu${cpu}"
+        cpu=$((cpu + 1))
+    done < /proc/net/softnet_stat
+    while kill -0 "$FLOOD_PID" 2>/dev/null; do
+        sleep "$INTERVAL"
+        ELAPSED=$((ELAPSED + INTERVAL))
+
+        if ! check_dmesg_bug; then
+            RET=$ksft_fail
+            retmsg="BUG_ON triggered in dql_completed at ${ELAPSED}s"
+            log_test "veth_bql"
+            exit "$EXIT_STATUS"
+        fi
+
+        print_periodic_stats "$ELAPSED"
+    done
+    wait "$FLOOD_PID" || true
+    FLOOD_PID=""
+}
+
+# Verify traffic is flowing by checking device tx_packets counter.
+# Works for both qdisc and noqueue modes.
+verify_traffic_flowing() {
+    local label="$1"
+    local prev_tx cur_tx
+
+    # Skip check if flood producer already exited (not a stall)
+    if [ -n "$FLOOD_PID" ] && ! kill -0 "$FLOOD_PID" 2>/dev/null; then
+        log_info "$label flood producer exited (duration reached)"
+        return 0
+    fi
+
+    prev_tx=$(cat /sys/class/net/${VETH_A}/statistics/tx_packets \
+        2>/dev/null) || prev_tx=0
+    sleep 0.5
+    cur_tx=$(cat /sys/class/net/${VETH_A}/statistics/tx_packets \
+        2>/dev/null) || cur_tx=0
+    if [ "$cur_tx" -gt "$prev_tx" ]; then
+        log_info "$label traffic flowing (tx: $prev_tx -> $cur_tx)"
+        return 0
+    fi
+    log_info "$label traffic STALLED (tx: $prev_tx -> $cur_tx)"
+    return 1
+}
+
+collect_results() {
+    local test_name="${1:-veth_bql}"
+
+    # Ping summary
+    wait "$PING_PID" 2>/dev/null || true
+    PING_PID=""
+    if [ -f "$TMPDIR"/ping.log ]; then
+        PING_LOSS=$(grep -o '[0-9.]*% packet loss' "$TMPDIR"/ping.log) &&
+            log_info "Ping loss: $PING_LOSS"
+        PING_SUMMARY=$(tail -1 "$TMPDIR"/ping.log)
+        log_info "Ping summary: $PING_SUMMARY"
+    fi
+
+    # Watchdog summary
+    WD_FINAL=$(cat /sys/class/net/${VETH_A}/queues/tx-0/tx_timeout \
+        2>/dev/null) || WD_FINAL=0
+    if [ "$WD_FINAL" -gt 0 ] 2>/dev/null; then
+        log_info "Watchdog fired ${WD_FINAL} time(s)"
+        dmesg | tail -n +$((DMESG_BEFORE + 1)) | \
+            grep -E 'NETDEV WATCHDOG|veth backpressure' || true
+    fi
+
+    # Final dmesg check -- only upgrade to fail, never override existing fail
+    if ! check_dmesg_bug; then
+        RET=$ksft_fail
+        retmsg="BUG_ON triggered in dql_completed"
+    fi
+    log_test "$test_name"
+    exit "$EXIT_STATUS"
+}
+
+# --- Test modes ---
+
+test_bql_stress() {
+    RET=$ksft_pass
+    compile_tools
+    setup_veth
+    install_qdisc "$QDISC" "$QDISC_OPTS"
+    setup_iptables
+    log_info "kernel: $(uname -r)"
+    check_bql_sysfs
+    start_traffic
+    monitor_loop
+    collect_results "veth_bql"
+}
+
+# Test qdisc replacement under active traffic.  Cycles through several
+# qdiscs including a transition to noqueue (tc qdisc del) to verify
+# that stale BQL state (STACK_XOFF) is properly reset during qdisc
+# transitions.
+test_qdisc_replace() {
+    local qdiscs=("sfq" "pfifo" "fq_codel")
+    local step=2
+    local elapsed=0
+    local idx
+
+    RET=$ksft_pass
+    compile_tools
+    setup_veth
+    install_qdisc "$QDISC" "$QDISC_OPTS"
+    setup_iptables
+    log_info "kernel: $(uname -r)"
+    check_bql_sysfs
+    start_traffic
+
+    while [ "$elapsed" -lt "$DURATION" ] && kill -0 "$FLOOD_PID" 2>/dev/null; do
+        sleep "$step"
+        elapsed=$((elapsed + step))
+
+        if ! check_dmesg_bug; then
+            RET=$ksft_fail
+            retmsg="BUG_ON during qdisc replacement at ${elapsed}s"
+            break
+        fi
+
+        # Cycle: sfq -> pfifo -> fq_codel -> noqueue -> sfq -> ...
+        idx=$(( (elapsed / step - 1) % (${#qdiscs[@]} + 1) ))
+        if [ "$idx" -eq "${#qdiscs[@]}" ]; then
+            remove_qdisc
+        else
+            install_qdisc "${qdiscs[$idx]}"
+        fi
+
+        # Print BQL and qdisc stats after each replacement
+        if [ -n "$BQL_DIR" ] && [ -d "$BQL_DIR" ]; then
+            local inflight limit limit_min limit_max holding
+            inflight=$(cat "$BQL_DIR/inflight" 2>/dev/null || echo "?")
+            limit=$(cat "$BQL_DIR/limit" 2>/dev/null || echo "?")
+            limit_min=$(cat "$BQL_DIR/limit_min" 2>/dev/null || echo "?")
+            limit_max=$(cat "$BQL_DIR/limit_max" 2>/dev/null || echo "?")
+            holding=$(cat "$BQL_DIR/holding_time" 2>/dev/null || echo "?")
+            echo "  [${elapsed}s] BQL inflight=${inflight} limit=${limit}" \
+                "limit_min=${limit_min} limit_max=${limit_max}" \
+                "holding=${holding}"
+        fi
+        local cur_qdisc
+        cur_qdisc=$(tc qdisc show dev "$VETH_A" root 2>/dev/null | \
+            awk '{print $2}') || cur_qdisc="none"
+        local txq_state
+        txq_state=$(cat /sys/class/net/${VETH_A}/queues/tx-0/tx_timeout \
+            2>/dev/null) || txq_state="?"
+        echo "  [${elapsed}s] qdisc=${cur_qdisc} watchdog=${txq_state}"
+
+        if ! verify_traffic_flowing "[${elapsed}s]"; then
+            RET=$ksft_fail
+            retmsg="Traffic stalled after qdisc replacement at ${elapsed}s"
+            break
+        fi
+    done
+
+    stop_traffic
+    collect_results "veth_bql_qdisc_replace"
+}
+
+# --- Main ---
+if [ "$QDISC_REPLACE" -eq 1 ]; then
+    test_qdisc_replace
+else
+    test_bql_stress
+fi
diff --git a/tools/testing/selftests/net/veth_bql_test_virtme.sh b/tools/testing/selftests/net/veth_bql_test_virtme.sh
new file mode 100755
index 000000000000..bb8dde0f6c00
--- /dev/null
+++ b/tools/testing/selftests/net/veth_bql_test_virtme.sh
@@ -0,0 +1,124 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Launch veth BQL test inside virtme-ng
+#
+# Must be run from the kernel build tree root.
+#
+# Options:
+#   --verbose       Show kernel console (vng boot messages) in real time.
+#                   Useful for debugging kernel panics / BUG_ON crashes.
+#   All other options are forwarded to veth_bql_test.sh (see --help there).
+#
+# Examples (run from kernel tree root):
+#   ./tools/testing/selftests/net/veth_bql_test_virtme.sh [OPTIONS]
+#     --duration 20 --nrules 1000
+#     --qdisc fq_codel --bql-disable
+#     --verbose --qdisc-replace --duration 60
+
+set -eu
+
+# Parse --verbose (consumed here, not forwarded to the inner test).
+VERBOSE=""
+INNER_ARGS=()
+for arg in "$@"; do
+    if [ "$arg" = "--verbose" ]; then
+        VERBOSE="--verbose"
+    else
+        INNER_ARGS+=("$arg")
+    fi
+done
+TEST_ARGS=""
+[ ${#INNER_ARGS[@]} -gt 0 ] && TEST_ARGS=$(printf '%q ' "${INNER_ARGS[@]}")
+
+if [ ! -f "vmlinux" ]; then
+    echo "ERROR: virtme-ng needs vmlinux; run from a compiled kernel tree:" >&2
+    echo "  cd /path/to/kernel && $0" >&2
+    exit 1
+fi
+
+# Verify .config has the options needed for virtme-ng and this test.
+# Without these the VM silently stalls with no output.
+KCONFIG=".config"
+if [ ! -f "$KCONFIG" ]; then
+    echo "ERROR: No .config found -- build the kernel first" >&2
+    exit 1
+fi
+
+MISSING=""
+for opt in CONFIG_VIRTIO CONFIG_VIRTIO_PCI CONFIG_VIRTIO_NET \
+           CONFIG_VIRTIO_CONSOLE CONFIG_NET_9P CONFIG_NET_9P_VIRTIO \
+           CONFIG_9P_FS CONFIG_VETH CONFIG_BQL; do
+    if ! grep -q "^${opt}=[ym]" "$KCONFIG"; then
+        MISSING+="  $opt\n"
+    fi
+done
+if [ -n "$MISSING" ]; then
+    echo "ERROR: .config is missing options required by virtme-ng:" >&2
+    echo -e "$MISSING" >&2
+    echo "Consider: vng --kconfig (or make defconfig + enable above)" >&2
+    exit 1
+fi
+
+TESTDIR="tools/testing/selftests/net"
+TESTNAME="veth_bql_test.sh"
+LOGFILE="veth_bql_test.log"
+LOGPATH="$TESTDIR/$LOGFILE"
+CONSOLELOG="veth_bql_console.log"
+rm -f "$LOGPATH" "$CONSOLELOG"
+
+echo "Starting VM... test output in $LOGPATH, kernel console in $CONSOLELOG"
+echo "(VM is booting, please wait ~30s)"
+
+# Always capture kernel console to a file via a second QEMU serial port.
+# vng claims ttyS0 (mapped to /dev/null); --qemu-opts adds ttyS1 on COM2.
+# earlycon registers COM2's I/O port (0x2f8) as a persistent console.
+# (plain console=ttyS1 does NOT work: the 8250 driver registers once,
+# ttyS0 wins, and ttyS1 is never picked up.)
+# --verbose additionally shows kernel console in real time on the terminal.
+SERIAL_CONSOLE="earlycon=uart8250,io,0x2f8,115200"
+SERIAL_CONSOLE+=" console=uart8250,io,0x2f8,115200"
+set +e
+vng $VERBOSE --cpus 4 --memory 2G \
+    --rwdir "$TESTDIR" \
+    --append "panic=5 loglevel=4 $SERIAL_CONSOLE" \
+    --qemu-opts="-serial file:$CONSOLELOG" \
+    --exec "cd $TESTDIR && \
+        ./$TESTNAME $TEST_ARGS 2>&1 | \
+        tee $LOGFILE; echo EXIT_CODE=\$? >> $LOGFILE"
+VNG_RC=$?
+set -e
+
+echo ""
+if [ "$VNG_RC" -ne 0 ]; then
+    echo "***********************************************************"
+    echo "* VM CRASHED -- kernel panic or BUG_ON (vng rc=$VNG_RC)"
+    echo "***********************************************************"
+    if [ -s "$CONSOLELOG" ] && \
+       grep -qiE 'kernel BUG|BUG:|Oops:|panic|dql_completed' "$CONSOLELOG"; then
+        echo ""
+        echo "--- kernel backtrace ($CONSOLELOG) ---"
+        grep -iE -A30 'kernel BUG|BUG:|Oops:|panic|dql_completed' \
+            "$CONSOLELOG" | head -50
+    else
+        echo ""
+        echo "Re-run with --verbose to see the kernel backtrace:"
+        echo "  $0 --verbose ${INNER_ARGS[*]}"
+    fi
+    exit 1
+elif [ ! -f "$LOGPATH" ]; then
+    echo "No log file found -- VM may have crashed before writing output"
+    exit 2
+else
+    echo "=== VM finished ==="
+fi
+
+# Scan console log for unexpected kernel warnings (even on clean exit)
+if [ -s "$CONSOLELOG" ]; then
+    WARN_PATTERN='kernel BUG|BUG:|Oops:|dql_completed|WARNING:|asks to queue packet|NETDEV WATCHDOG'
+    WARN_LINES=$(grep -cE "$WARN_PATTERN" "$CONSOLELOG" 2>/dev/null) || WARN_LINES=0
+    if [ "$WARN_LINES" -gt 0 ]; then
+        echo ""
+        echo "*** kernel warnings in $CONSOLELOG ($WARN_LINES lines) ***"
+        grep -E "$WARN_PATTERN" "$CONSOLELOG" | head -20
+    fi
+fi
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 4/5] net: sched: add timeout count to NETDEV WATCHDOG message
From: hawk @ 2026-04-13  9:44 UTC (permalink / raw)
  To: netdev
  Cc: kernel-team, Jesper Dangaard Brouer, Jakub Kicinski,
	Jonas Köppeler, Jamal Hadi Salim, Jiri Pirko,
	David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	linux-kernel
In-Reply-To: <20260413094442.1376022-1-hawk@kernel.org>

From: Jesper Dangaard Brouer <hawk@kernel.org>

Add the per-queue timeout counter (trans_timeout) to the core NETDEV
WATCHDOG log message.  This makes it easy to determine how frequently
a particular queue is stalling from a single log line, without having
to search through and correlate spaced-out log entries.

Useful for production monitoring where timeouts are spaced by the
watchdog interval, making frequency hard to judge.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/all/20251107175445.58eba452@kernel.org/
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 net/sched/sch_generic.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index a93321db8fd7..3e2e2e887a86 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -533,13 +533,12 @@ static void dev_watchdog(struct timer_list *t)
 		    netif_running(dev) &&
 		    netif_carrier_ok(dev)) {
 			unsigned int timedout_ms = 0;
+			struct netdev_queue *txq;
 			unsigned int i;
 			unsigned long trans_start;
 			unsigned long oldest_start = jiffies;
 
 			for (i = 0; i < dev->num_tx_queues; i++) {
-				struct netdev_queue *txq;
-
 				txq = netdev_get_tx_queue(dev, i);
 				if (!netif_xmit_stopped(txq))
 					continue;
@@ -561,9 +560,10 @@ static void dev_watchdog(struct timer_list *t)
 
 			if (unlikely(timedout_ms)) {
 				trace_net_dev_xmit_timeout(dev, i);
-				netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms\n",
+				netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms (n:%ld)\n",
 					    raw_smp_processor_id(),
-					    i, timedout_ms);
+					    i, timedout_ms,
+					    atomic_long_read(&txq->trans_timeout));
 				netif_freeze_queues(dev);
 				dev->netdev_ops->ndo_tx_timeout(dev, i);
 				netif_unfreeze_queues(dev);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 3/5] veth: add tx_timeout watchdog as BQL safety net
From: hawk @ 2026-04-13  9:44 UTC (permalink / raw)
  To: netdev
  Cc: kernel-team, Jesper Dangaard Brouer, Jonas Köppeler,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-kernel
In-Reply-To: <20260413094442.1376022-1-hawk@kernel.org>

From: Jesper Dangaard Brouer <hawk@kernel.org>

With the introduction of BQL (Byte Queue Limits) for veth, there are
now two independent mechanisms that can stop a transmit queue:

 - DRV_XOFF: set by netif_tx_stop_queue() when the ptr_ring is full
 - STACK_XOFF: set by BQL when the byte-in-flight limit is reached

If either mechanism stalls without a corresponding wake/completion,
the queue stops permanently. Enable the net device watchdog timer and
implement ndo_tx_timeout as a failsafe recovery.

The timeout handler resets BQL state (clearing STACK_XOFF) and wakes
the queue (clearing DRV_XOFF), covering both stop mechanisms. The
watchdog fires after 16 seconds, which accommodates worst-case NAPI
processing (budget=64 packets x 250ms per-packet consumer delay)
without false positives under normal backpressure.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 drivers/net/veth.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 6431dc40f9b4..911e7e36e166 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1429,6 +1429,22 @@ static int veth_set_channels(struct net_device *dev,
 	goto out;
 }
 
+static void veth_tx_timeout(struct net_device *dev, unsigned int txqueue)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(dev, txqueue);
+
+	netdev_err(dev,
+		   "veth backpressure(0x%lX) stalled(n:%ld) TXQ(%u) re-enable\n",
+		   txq->state, atomic_long_read(&txq->trans_timeout), txqueue);
+
+	/* Cannot call netdev_tx_reset_queue(): dql_reset() races with
+	 * peer NAPI calling dql_completed() concurrently.
+	 * Just clear the stop bits; the qdisc will re-stop if still stuck.
+	 */
+	clear_bit(__QUEUE_STATE_STACK_XOFF, &txq->state);
+	netif_tx_wake_queue(txq);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -1767,6 +1783,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_tx_timeout		= veth_tx_timeout,
 };
 
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
@@ -1806,6 +1823,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_destructor = veth_dev_free;
 	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
 	dev->max_mtu = ETH_MAX_MTU;
+	dev->watchdog_timeo = msecs_to_jiffies(16000);
 
 	dev->hw_features = VETH_FEATURES;
 	dev->hw_enc_features = VETH_FEATURES;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 2/5] veth: implement Byte Queue Limits (BQL) for latency reduction
From: hawk @ 2026-04-13  9:44 UTC (permalink / raw)
  To: netdev
  Cc: kernel-team, Jesper Dangaard Brouer, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Stanislav Fomichev, linux-kernel,
	bpf
In-Reply-To: <20260413094442.1376022-1-hawk@kernel.org>

From: Jesper Dangaard Brouer <hawk@kernel.org>

Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
reduce TX drops") gave qdiscs control over veth by returning
NETDEV_TX_BUSY when the ptr_ring is full (DRV_XOFF).  That commit noted
a known limitation: the 256-entry ptr_ring sits in front of the qdisc as
a dark buffer, adding base latency because the qdisc has no visibility
into how many bytes are already queued there.

Add BQL support so the qdisc gets feedback and can begin shaping traffic
before the ring fills.  In testing with fq_codel, BQL reduces ping RTT
under UDP load from ~6.61ms to ~0.36ms (18x).

Charge a fixed VETH_BQL_UNIT (1) per packet rather than skb->len, so
the DQL limit tracks packets-in-flight.  Unlike a physical NIC, veth
has no link speed -- the ptr_ring drains at CPU speed and is
packet-indexed, not byte-indexed, so bytes are not the natural unit.
With byte-based charging, small packets sneak many more entries into
the ring before STACK_XOFF fires, deepening the dark buffer under
mixed-size workloads.  Testing with a concurrent min-size packet flood
shows 3.7x ping RTT degradation with skb->len charging versus no
change with fixed-unit charging.

Charge BQL inside veth_xdp_rx() under the ptr_ring producer_lock, after
confirming the ring is not full.  The charge must precede the produce
because the NAPI consumer can run on another CPU and complete the SKB
the instant it becomes visible in the ring.  Doing both under the same
lock avoids a pre-charge/undo pattern -- BQL is only charged when
produce is guaranteed to succeed.

BQL is enabled only when a real qdisc is attached (guarded by
!qdisc_txq_has_no_queue), as HARD_TX_LOCK provides serialization
for TXQ modification like dql_queued(). For lltx devices, like veth,
this HARD_TX_LOCK serialization isn't provided.  The ptr_ring
producer_lock provides additional serialization that would allow
BQL to work correctly even with noqueue, though that combination
is not currently enabled, as the netstack will drop and warn.

Track per-SKB BQL state via a VETH_BQL_FLAG pointer tag in the ptr_ring
entry.  This is necessary because the qdisc can be replaced live while
SKBs are in-flight -- each SKB must carry the charge decision made at
enqueue time rather than re-checking the peer's qdisc at completion.

Complete per-SKB in veth_xdp_rcv() rather than in bulk, so STACK_XOFF
clears promptly when producer and consumer run on different CPUs.

BQL introduces a second independent queue-stop mechanism (STACK_XOFF)
alongside the existing DRV_XOFF (ring full).  Both must be clear for
the queue to transmit.  Reset BQL state in veth_napi_del_range() after
synchronize_net() to avoid racing with in-flight veth_poll() calls.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 drivers/net/veth.c | 74 +++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 63 insertions(+), 11 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index e35df717e65e..6431dc40f9b4 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -34,9 +34,13 @@
 #define DRV_VERSION	"1.0"
 
 #define VETH_XDP_FLAG		BIT(0)
+#define VETH_BQL_FLAG		BIT(1)
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Fixed BQL charge: DQL limit tracks packets-in-flight, not bytes */
+#define VETH_BQL_UNIT		1
+
 #define VETH_XDP_TX_BULK_SIZE	16
 #define VETH_XDP_BATCH		16
 
@@ -280,6 +284,21 @@ static bool veth_is_xdp_frame(void *ptr)
 	return (unsigned long)ptr & VETH_XDP_FLAG;
 }
 
+static bool veth_ptr_is_bql(void *ptr)
+{
+	return (unsigned long)ptr & VETH_BQL_FLAG;
+}
+
+static struct sk_buff *veth_ptr_to_skb(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~VETH_BQL_FLAG);
+}
+
+static void *veth_skb_to_ptr(struct sk_buff *skb, bool bql)
+{
+	return bql ? (void *)((unsigned long)skb | VETH_BQL_FLAG) : skb;
+}
+
 static struct xdp_frame *veth_ptr_to_xdp(void *ptr)
 {
 	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
@@ -295,7 +314,7 @@ static void veth_ptr_free(void *ptr)
 	if (veth_is_xdp_frame(ptr))
 		xdp_return_frame(veth_ptr_to_xdp(ptr));
 	else
-		kfree_skb(ptr);
+		kfree_skb(veth_ptr_to_skb(ptr));
 }
 
 static void __veth_xdp_flush(struct veth_rq *rq)
@@ -309,19 +328,33 @@ static void __veth_xdp_flush(struct veth_rq *rq)
 	}
 }
 
-static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb, bool do_bql,
+		       struct netdev_queue *txq)
 {
-	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
+	struct ptr_ring *ring = &rq->xdp_ring;
+
+	spin_lock(&ring->producer_lock);
+	if (unlikely(!ring->size) || __ptr_ring_full(ring)) {
+		spin_unlock(&ring->producer_lock);
 		return NETDEV_TX_BUSY; /* signal qdisc layer */
+	}
+
+	/* BQL charge before produce; consumer cannot see entry yet */
+	if (do_bql)
+		netdev_tx_sent_queue(txq, VETH_BQL_UNIT);
+
+	__ptr_ring_produce(ring, veth_skb_to_ptr(skb, do_bql));
+	spin_unlock(&ring->producer_lock);
 
 	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
 }
 
 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
-			    struct veth_rq *rq, bool xdp)
+			    struct veth_rq *rq, bool xdp, bool do_bql,
+			    struct netdev_queue *txq)
 {
 	return __dev_forward_skb(dev, skb) ?: xdp ?
-		veth_xdp_rx(rq, skb) :
+		veth_xdp_rx(rq, skb, do_bql, txq) :
 		__netif_rx(skb);
 }
 
@@ -352,6 +385,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct net_device *rcv;
 	int length = skb->len;
 	bool use_napi = false;
+	bool do_bql = false;
 	int ret, rxq;
 
 	rcu_read_lock();
@@ -375,8 +409,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	skb_tx_timestamp(skb);
+	txq = netdev_get_tx_queue(dev, rxq);
 
-	ret = veth_forward_skb(rcv, skb, rq, use_napi);
+	/* BQL charge happens inside veth_xdp_rx() under producer_lock */
+	do_bql = use_napi && !qdisc_txq_has_no_queue(txq);
+	ret = veth_forward_skb(rcv, skb, rq, use_napi, do_bql, txq);
 	switch (ret) {
 	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
 		if (!use_napi)
@@ -388,8 +425,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		/* If a qdisc is attached to our virtual device, returning
 		 * NETDEV_TX_BUSY is allowed.
 		 */
-		txq = netdev_get_tx_queue(dev, rxq);
-
 		if (qdisc_txq_has_no_queue(txq)) {
 			dev_kfree_skb_any(skb);
 			goto drop;
@@ -412,6 +447,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		net_crit_ratelimited("%s(%s): Invalid return code(%d)",
 				     __func__, dev->name, ret);
 	}
+
 	rcu_read_unlock();
 
 	return ret;
@@ -900,7 +936,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 
 static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct veth_xdp_tx_bq *bq,
-			struct veth_stats *stats)
+			struct veth_stats *stats,
+			struct netdev_queue *peer_txq)
 {
 	int i, done = 0, n_xdpf = 0;
 	void *xdpf[VETH_XDP_BATCH];
@@ -928,9 +965,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			}
 		} else {
 			/* ndo_start_xmit */
-			struct sk_buff *skb = ptr;
+			bool bql_charged = veth_ptr_is_bql(ptr);
+			struct sk_buff *skb = veth_ptr_to_skb(ptr);
 
 			stats->xdp_bytes += skb->len;
+			if (peer_txq && bql_charged)
+				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
+
 			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
 			if (skb) {
 				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
@@ -975,7 +1016,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	peer_txq = peer_dev ? netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
 
 	xdp_set_return_frame_no_direct();
-	done = veth_xdp_rcv(rq, budget, &bq, &stats);
+	done = veth_xdp_rcv(rq, budget, &bq, &stats, peer_txq);
 
 	if (stats.xdp_redirect > 0)
 		xdp_do_flush();
@@ -1073,6 +1114,7 @@ static int __veth_napi_enable(struct net_device *dev)
 static void veth_napi_del_range(struct net_device *dev, int start, int end)
 {
 	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
 	int i;
 
 	for (i = start; i < end; i++) {
@@ -1091,6 +1133,15 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
 	}
 
+	/* Reset BQL on peer's txqs: remaining ring items were freed above
+	 * without BQL completion, so DQL state must be reset.
+	 */
+	peer = rtnl_dereference(priv->peer);
+	if (peer) {
+		for (i = start; i < end; i++)
+			netdev_tx_reset_queue(netdev_get_tx_queue(peer, i));
+	}
+
 	for (i = start; i < end; i++) {
 		page_pool_destroy(priv->rq[i].page_pool);
 		priv->rq[i].page_pool = NULL;
@@ -1740,6 +1791,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_flags |= IFF_PHONY_HEADROOM;
 	dev->priv_flags |= IFF_DISABLE_NETPOLL;
 	dev->lltx = true;
+	dev->bql = true;
 
 	dev->netdev_ops = &veth_netdev_ops;
 	dev->xdp_metadata_ops = &veth_xdp_metadata_ops;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 1/5] net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
From: hawk @ 2026-04-13  9:44 UTC (permalink / raw)
  To: netdev
  Cc: kernel-team, Jesper Dangaard Brouer, Jonas Köppeler,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Kuniyuki Iwashima, Stanislav Fomichev, Frederic Weisbecker,
	Yajun Deng, Krishna Kumar, linux-doc, linux-kernel
In-Reply-To: <20260413094442.1376022-1-hawk@kernel.org>

From: Jesper Dangaard Brouer <hawk@kernel.org>

Virtual devices with IFF_NO_QUEUE or lltx are excluded from BQL sysfs
by netdev_uses_bql(), since they traditionally lack real hardware
queues. However, some virtual devices like veth implement a real
ptr_ring FIFO with NAPI processing and benefit from BQL to limit
in-flight bytes and reduce latency.

Add a per-device 'bql' bitfield boolean in the priv_flags_slow section
of struct net_device. When set, it overrides the IFF_NO_QUEUE/lltx
exclusion and exposes BQL sysfs entries (/sys/class/net/<dev>/queues/
tx-<n>/byte_queue_limits/). The flag is still gated on CONFIG_BQL.

This allows drivers that use BQL despite being IFF_NO_QUEUE to opt in
to sysfs visibility for monitoring and debugging.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
---
 Documentation/networking/net_cachelines/net_device.rst | 1 +
 include/linux/netdevice.h                              | 2 ++
 net/core/net-sysfs.c                                   | 8 +++++++-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 1c19bb7705df..b775d3235a2d 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -170,6 +170,7 @@ unsigned_long:1                     see_all_hwtstamp_requests
 unsigned_long:1                     change_proto_down
 unsigned_long:1                     netns_immutable
 unsigned_long:1                     fcoe_mtu
+unsigned_long:1                     bql                                                                 netdev_uses_bql(net-sysfs.c)
 struct list_head                    net_notifier_list
 struct macsec_ops*                  macsec_ops
 struct udp_tunnel_nic_info*         udp_tunnel_nic_info
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 47417b2d48a4..7a1a491ecdd5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2048,6 +2048,7 @@ enum netdev_reg_state {
  *	@change_proto_down: device supports setting carrier via IFLA_PROTO_DOWN
  *	@netns_immutable: interface can't change network namespaces
  *	@fcoe_mtu:	device supports maximum FCoE MTU, 2158 bytes
+ *	@bql:		device uses BQL (DQL sysfs) despite having IFF_NO_QUEUE
  *
  *	@net_notifier_list:	List of per-net netdev notifier block
  *				that follow this device when it is moved
@@ -2462,6 +2463,7 @@ struct net_device {
 	unsigned long		change_proto_down:1;
 	unsigned long		netns_immutable:1;
 	unsigned long		fcoe_mtu:1;
+	unsigned long		bql:1;
 
 	struct list_head	net_notifier_list;
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e430645748a7..4360efc8f241 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1945,10 +1945,16 @@ static const struct kobj_type netdev_queue_ktype = {
 
 static bool netdev_uses_bql(const struct net_device *dev)
 {
+	if (!IS_ENABLED(CONFIG_BQL))
+		return false;
+
+	if (dev->bql)
+		return true;
+
 	if (dev->lltx || (dev->priv_flags & IFF_NO_QUEUE))
 		return false;
 
-	return IS_ENABLED(CONFIG_BQL);
+	return true;
 }
 
 static int netdev_queue_add_kobject(struct net_device *dev, int index)
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 0/5] veth: add Byte Queue Limits (BQL) support
From: hawk @ 2026-04-13  9:44 UTC (permalink / raw)
  To: netdev
  Cc: kernel-team, Jesper Dangaard Brouer, Chris Arges, Mike Freemon,
	Toke Høiland-Jørgensen, Shuah Khan, linux-kselftest,
	Jonas Köppeler, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, John Fastabend,
	Stanislav Fomichev, bpf

From: Jesper Dangaard Brouer <hawk@kernel.org>

This series adds BQL (Byte Queue Limits) to the veth driver, reducing
latency by dynamically limiting in-flight packets in the ptr_ring and
moving buffering into the qdisc where AQM algorithms can act on it.

Problem:
  veth's 256-entry ptr_ring acts as a "dark buffer" -- packets queued
  there are invisible to the qdisc's AQM.  Under load, the ring fills
  completely (DRV_XOFF backpressure), adding up to 256 packets of
  unmanaged latency before the qdisc even sees congestion.

Solution:
  BQL (STACK_XOFF) dynamically limits in-flight packets, stopping the
  queue before the ring fills.  This keeps the ring shallow and pushes
  excess packets into the qdisc, where sojourn-based AQM can measure
  and drop them.

  Test setup: veth pair, UDP flood, 13000 iptables rules in consumer
  namespace (slows NAPI-64 cycle to ~6-7ms), ping measures RTT under load.

                   BQL off                    BQL on
  fq_codel:  RTT ~22ms, 4% loss         RTT ~1.3ms, 0% loss
  sfq:       RTT ~24ms, 0% loss         RTT ~1.5ms, 0% loss

  BQL reduces ping RTT by ~17x for both qdiscs.  Consumer throughput
  is unchanged (~10K pps) -- BQL adds no overhead.

CoDel bug discovered during BQL development:
  Our original motivation for BQL was fq_codel ping loss observed under
  load (4-26% depending on NAPI cycle time).  Investigating this led us
  to discover a bug in the CoDel implementation: codel_dequeue() does
  not reset vars->first_above_time when a flow goes empty, contrary to
  the reference algorithm.  This causes stale CoDel state to persist
  across empty periods in fq_codel's per-flow queues, penalizing sparse
  flows like ICMP ping.  A fix for this has been applied to the net tree:
    https://git.kernel.org/netdev/net/c/815980fe6dbb

  BQL remains valuable independently: it reduces RTT by ~17x by moving
  buffering from the dark ptr_ring into the qdisc.  Additionally, BQL
  clears STACK_XOFF per-SKB as each packet completes, rather than
  batch-waking after 64 packets (DRV_XOFF).  This keeps sojourn times
  below fq_codel's target, preventing CoDel from entering dropping
  state on non-congested flows in the first place.

Key design decisions:
  - Charge-under-lock in veth_xdp_rx(): The BQL charge must precede
    the ptr_ring produce, because the NAPI consumer can run on another
    CPU and complete the SKB immediately after it becomes visible.  To
    avoid a pre-charge/undo pattern, the charge is done under the
    ptr_ring producer_lock after confirming the ring is not full.  BQL
    is only charged when produce is guaranteed to succeed, keeping
    num_queued monotonically increasing.  HARD_TX_LOCK already
    serializes dql_queued() (veth requires a qdisc for BQL); the
    ptr_ring lock additionally would allow noqueue to work correctly.

  - Per-SKB BQL tracking via pointer tag: A VETH_BQL_FLAG bit in the
    ptr_ring pointer records whether each SKB was BQL-charged.  This is
    necessary because the qdisc can be replaced live (noqueue->sfq or
    vice versa) while SKBs are in-flight -- the completion side must
    know the charge state that was decided at enqueue time.

  - IFF_NO_QUEUE + BQL coexistence: A new dev->bql flag enables BQL
    sysfs exposure for IFF_NO_QUEUE devices that opt in to DQL
    accounting, without changing IFF_NO_QUEUE semantics.

Background and acknowledgments:
  Mike Freemon reported the veth dark buffer problem internally at
  Cloudflare and showed that recompiling the kernel with a ptr_ring
  size of 30 (down from 256) made fq_codel work dramatically better.
  This was the primary motivation for a proper BQL solution that
  achieves the same effect dynamically without a kernel rebuild.

  Chris Arges wrote a reproducer for the dark buffer latency problem:
    https://github.com/netoptimizer/veth-backpressure-performance-testing
  This is where we first observed ping packets being dropped under
  fq_codel, which became our secondary motivation for BQL.  In
  production we switched to SFQ on veth devices as a workaround.

  Jonas Koeppeler provided extensive testing and code review.
  Together we discovered that the fq_codel ping loss was actually a
  12-year-old CoDel bug (stale first_above_time in empty flows), not
  caused by the dark buffer itself.  A fix has been applied to the net tree:
    https://git.kernel.org/netdev/net/c/815980fe6dbb

Patch overview:
  1. net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  2. veth: implement Byte Queue Limits (BQL) for latency reduction
  3. veth: add tx_timeout watchdog as BQL safety net
  4. net: sched: add timeout count to NETDEV WATCHDOG message
  5. selftests: net: add veth BQL stress test

Jesper Dangaard Brouer (5):
  net: add dev->bql flag to allow BQL sysfs for IFF_NO_QUEUE devices
  veth: implement Byte Queue Limits (BQL) for latency reduction
  veth: add tx_timeout watchdog as BQL safety net
  net: sched: add timeout count to NETDEV WATCHDOG message
  selftests: net: add veth BQL stress test

 .../networking/net_cachelines/net_device.rst  |   1 +
 drivers/net/veth.c                            |  92 +-
 include/linux/netdevice.h                     |   2 +
 net/core/net-sysfs.c                          |   8 +-
 net/sched/sch_generic.c                       |   8 +-
 tools/testing/selftests/net/Makefile          |   3 +
 tools/testing/selftests/net/config            |   1 +
 tools/testing/selftests/net/napi_poll_hist.bt |  40 +
 tools/testing/selftests/net/veth_bql_test.sh  | 821 ++++++++++++++++++
 .../selftests/net/veth_bql_test_virtme.sh     | 124 +++
 10 files changed, 1084 insertions(+), 16 deletions(-)
 create mode 100644 tools/testing/selftests/net/napi_poll_hist.bt
 create mode 100755 tools/testing/selftests/net/veth_bql_test.sh
 create mode 100755 tools/testing/selftests/net/veth_bql_test_virtme.sh

V1: https://lore.kernel.org/all/20260324174719.1224337-1-hawk@kernel.org/

Changes since V1:
  - Patch 1 (dev->bql flag): add kdoc entry for @bql in struct net_device.
  - Patch 2 (veth BQL): charge fixed VETH_BQL_UNIT (1) per packet instead
    of skb->len.  veth has no link speed; the ptr_ring is packet-indexed.
    Byte-based charging lets small packets sneak many entries into the ring.
    Testing: min-size packet flood causes 3.7x ping RTT degradation with
    skb->len vs no change with fixed-unit charging.
  - Patch 3 (tx_timeout watchdog): fix race with peer NAPI: replace
    netdev_tx_reset_queue() with clear_bit(STACK_XOFF) + netif_tx_wake_queue()
    to avoid dql_reset() racing with concurrent dql_completed().
  - Patch 5 (selftests): fix shellcheck warnings and infos:
    - Quote variables passed to kill_process and exit.
    - Declare and assign local variables separately (SC2155).
    - Use read -r to avoid mangling backslashes (SC2162).
    - Add shellcheck disable comments for intentional word splitting
      (set -- $line, tc $qdisc $opts) and indirect invocation (trap).
    - Make iptables-restore failure a hard FAIL instead of continuing.
    - Add veth_bql_test.sh to TEST_PROGS in net/Makefile.
    - Add veth_bql_test_virtme.sh to TEST_FILES (needs kernel build tree).
    - Add napi_poll_hist.bt to TEST_FILES in net/Makefile.
    - Add CONFIG_NET_SCH_SFQ=m to net/config (default qdisc is sfq).
    - Reduce default test duration from 300s to 30s for kselftest CI.
    - Fix virtme wrapper: empty args bug, check vmlinux instead of test path.
  - Cover letter: update CoDel fix reference to merged commit in net tree.

Cc: Chris Arges <carges@cloudflare.com>
Cc: Mike Freemon <mfreemon@cloudflare.com>
Cc: Toke Høiland-Jørgensen <toke@toke.dk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: linux-kselftest@vger.kernel.org
Cc: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Cc: kernel-team@cloudflare.com

-- 
2.43.0


^ permalink raw reply

* [PATCH] net: sched: teql: fix use-after-free in teql_master_xmit
From: Kito Xu (veritas501) @ 2026-04-13  9:44 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, netdev, linux-kernel, Kito Xu (veritas501)

teql_destroy() traverses the circular slave list to unlink a slave
qdisc. It reads master->slaves as both the starting point and the
do-while termination sentinel. However, the data-path writers
teql_dequeue() and teql_master_xmit() concurrently modify
master->slaves without holding RTNL, running in softirq or process
context respectively. If master->slaves is overwritten to an
already-visited node mid-traversal, the loop exits early without
finding the target slave. The slave is never unlinked, but its memory
is freed by call_rcu() -- leaving a dangling pointer in the circular
list. The next teql_master_xmit() traversal dereferences freed memory.

race condition:

  CPU 0 (teql_destroy, RTNL held)       CPU 1 (teql_dequeue, softirq)
  -----------------------------------   -----------------------------
  prev = master->slaves;  // = A
  q = NEXT_SLAVE(A);      // = B
  B == A? No.
  prev = B;
                                         /* slave C's queue drains */
                                         skb == NULL ->
                                         dat->m->slaves = C; /* write! */

  q = NEXT_SLAVE(B);      // = C
  C == A? No.
  prev = C;
  /* check: (prev=C) != master->slaves(C)?
     FALSE -> loop exits! */
  /* A never unlinked, freed by call_rcu */

  CPU 0 (teql_master_xmit, later)
  -----------------------------------
  q = NEXT_SLAVE(C);      // = A (freed!)
  slave = qdisc_dev(A);   // UAF!

Fix this by saving master->slaves into a local `head` variable at the
start of teql_destroy() and using it as a stable sentinel for the
entire traversal. Also annotate all data-path accesses to
master->slaves with READ_ONCE/WRITE_ONCE to prevent store-tearing and
compiler-introduced re-reads.

==================================================================
BUG: KASAN: slab-use-after-free in teql_master_xmit+0xeae/0x14a0
Read of size 8 at addr ffff888018074040 by task poc/162

CPU: 2 UID: 0 PID: 162 Comm: poc Not tainted 7.0.0-rc7-next-20260410 #10 PREEMPTLAZY
Call Trace:
 <TASK>
 dump_stack_lvl+0x64/0x80
 print_report+0xd0/0x5e0
 kasan_report+0xce/0x100
 teql_master_xmit+0xeae/0x14a0
 dev_hard_start_xmit+0xcd/0x5b0
 sch_direct_xmit+0x12e/0xac0
 __qdisc_run+0x3b1/0x1a70
 __dev_queue_xmit+0x2257/0x3100
 ip_finish_output2+0x615/0x19c0
 ip_output+0x158/0x2b0
 ip_send_skb+0x11b/0x160
 udp_send_skb+0x64b/0xd80
 udp_sendmsg+0x138c/0x1ec0
 __sys_sendto+0x331/0x3a0
 __x64_sys_sendto+0xe0/0x1c0
 do_syscall_64+0x64/0x680
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

The buggy address belongs to the object at ffff888018074000
 which belongs to the cache kmalloc-512 of size 512
The buggy address is located 64 bytes inside of
 freed 512-byte region [ffff888018074000, ffff888018074200)
==================================================================

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kito Xu (veritas501) <hxzene@gmail.com>
---
 net/sched/sch_teql.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index ec4039a201a2..2e86397a5219 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -101,7 +101,7 @@ teql_dequeue(struct Qdisc *sch)
 	if (skb == NULL) {
 		struct net_device *m = qdisc_dev(q);
 		if (m) {
-			dat->m->slaves = sch;
+			WRITE_ONCE(dat->m->slaves, sch);
 			netif_wake_queue(m);
 		}
 	} else {
@@ -136,19 +136,23 @@ teql_destroy(struct Qdisc *sch)
 	if (!master)
 		return;
 
-	prev = master->slaves;
+	prev = READ_ONCE(master->slaves);
 	if (prev) {
+		struct Qdisc *head = prev;
+
 		do {
 			q = NEXT_SLAVE(prev);
 			if (q == sch) {
 				NEXT_SLAVE(prev) = NEXT_SLAVE(q);
-				if (q == master->slaves) {
-					master->slaves = NEXT_SLAVE(q);
-					if (q == master->slaves) {
+				if (q == head) {
+					WRITE_ONCE(master->slaves,
+						   NEXT_SLAVE(q));
+					if (q == NEXT_SLAVE(q)) {
 						struct netdev_queue *txq;
 
 						txq = netdev_get_tx_queue(master->dev, 0);
-						master->slaves = NULL;
+						WRITE_ONCE(master->slaves,
+							   NULL);
 
 						dev_reset_queue(master->dev,
 								txq, NULL);
@@ -158,7 +162,7 @@ teql_destroy(struct Qdisc *sch)
 				break;
 			}
 
-		} while ((prev = q) != master->slaves);
+		} while ((prev = q) != head);
 	}
 }
 
@@ -285,7 +289,7 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 	int subq = skb_get_queue_mapping(skb);
 	struct sk_buff *skb_res = NULL;
 
-	start = master->slaves;
+	start = READ_ONCE(master->slaves);
 
 restart:
 	nores = 0;
@@ -317,7 +321,7 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 				    netdev_start_xmit(skb, slave, slave_txq, false) ==
 				    NETDEV_TX_OK) {
 					__netif_tx_unlock(slave_txq);
-					master->slaves = NEXT_SLAVE(q);
+					WRITE_ONCE(master->slaves, NEXT_SLAVE(q));
 					netif_wake_queue(dev);
 					master->tx_packets++;
 					master->tx_bytes += length;
@@ -329,7 +333,7 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 				busy = 1;
 			break;
 		case 1:
-			master->slaves = NEXT_SLAVE(q);
+			WRITE_ONCE(master->slaves, NEXT_SLAVE(q));
 			return NETDEV_TX_OK;
 		default:
 			nores = 1;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v5] net: caif: fix stack out-of-bounds write in cfctrl_link_setup()
From: Paolo Abeni @ 2026-04-13  9:30 UTC (permalink / raw)
  To: Simon Horman, Kangzheng Gu
  Cc: davem, edumazet, kuba, kees, thorsten.blum, arnd, sjur.brandeland,
	netdev, linux-kernel, stable
In-Reply-To: <20260412135743.GK469338@kernel.org>

On 4/12/26 3:57 PM, Simon Horman wrote:
> I am wondering if it would be best to follow the pattern for
> writing linkparam.u.utility.name elsewhere in this function.
> That:
> 1. Uses a somewhat more succinct loop control structure
> 2. Silently truncates input without updating cmdrsp if overrun would occur
> 
> Something like this (compile tested only!):
> 
> diff --git a/net/caif/cfctrl.c b/net/caif/cfctrl.c
> index c6cc2bfed65d..ba184c11386e 100644
> --- a/net/caif/cfctrl.c
> +++ b/net/caif/cfctrl.c
> @@ -15,6 +15,7 @@
>  #include <net/caif/cfctrl.h>
>  
>  #define container_obj(layr) container_of(layr, struct cfctrl, serv.layer)
> +#define RFM_VOLUME_LEN 20
>  #define UTILITY_NAME_LENGTH 16
>  #define CFPKT_CTRL_PKT_LEN 20
>  
> @@ -414,10 +415,11 @@ static int cfctrl_link_setup(struct cfctrl *cfctrl, struct cfpkt *pkt, u8 cmdrsp
>  		 */
>  		linkparam.u.rfm.connid = cfpkt_extr_head_u32(pkt);
>  		cp = (u8 *) linkparam.u.rfm.volume;
> -		for (tmp = cfpkt_extr_head_u8(pkt);
> -		     cfpkt_more(pkt) && tmp != '\0';
> -		     tmp = cfpkt_extr_head_u8(pkt))
> +		caif_assert(sizeof(linkparam.u.rfm.volume) >= RFM_VOLUME_LEN);
> +		for(i = 0; i < RFM_VOLUME_LEN - 1 && cfpkt_more(pkt); i++) {
> +			tmp = cfpkt_extr_head_u8(pkt);
>  			*cp++ = tmp;
> +		}
>  		*cp = '\0';
>  
>  		if (CFCTRL_ERR_BIT & cmdrsp)

I agree that the code suggested by Simon is clearer. Note that AFAICS it
lacks an additional `tmp!= '\0'` check to break the loop, but even with
that added it should be preferable.

Thanks,

Paolo


^ permalink raw reply

* Re: [PATCH net-next v2 3/3] net: mdio: treat PSE EPROBE_DEFER as non-fatal during PHY registration
From: Kory Maincent @ 2026-04-13  9:28 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Russell King (Oracle), Carlo Szelinsky, o.rempel, andrew+netdev,
	hkallweit1, kuba, davem, edumazet, pabeni, horms, netdev,
	linux-kernel
In-Reply-To: <280018e9-4499-4a13-b692-6d6a4499477b@lunn.ch>

On Thu, 9 Apr 2026 21:54:33 +0200
Andrew Lunn <andrew@lunn.ch> wrote:

> On Thu, Apr 09, 2026 at 05:08:33PM +0100, Russell King (Oracle) wrote:
> > On Thu, Apr 09, 2026 at 05:34:56PM +0200, Andrew Lunn wrote:  
> > > I still think we should be deferring probe until we have all the parts
> > > available. The question is, how do we actually do that?  
> > 
> > Indeed...
> >   
> > > We could insist that MACs being used with PSE need to call
> > > phylink_connect() in probe, so we can return EPROBE_DEFER. We might
> > > actually need a new API method, phylink_connect_probe(). That can call
> > > down into phylib, maybe again new API methods, which will not bind
> > > genphy, but return EPROBE_DEFER.  
> 
> I did not say i would be easy...
>  
> > How would MACs know whether they should call phylink_connect_probe()
> > or phylink_connect_phy() ?  
> 
> It would not. Anybody with a board using PSE would need to modify the
> MAC driver to use phylink_connect_probe(), if they have a slow to load
> PSE device.
> 
> > What do we do about MAC drivers that are a single driver and device,
> > but are made up of several network devices (like Marvell PP2) ?  
> 
> It would need more care, but it should work. You might end up removing
> a perfectly good device because the other one is missing its PHY,
> which is not ideal, but hopefully you get there in the end.
> 
> > We also have network drivers that provide a MDIO bus for a different
> > network device, which makes connecting the PHY harder in the probe
> > path.  
> 
> Yes, we would see such setup doing more deferred probing, but again,
> they should get there in the end. The most common systems doing this
> are using the FEC. Are there any board using the FEC and problematic
> PSE?

Not that I know of but it is only the beginning of PSE support in Linux.

> > Lastly, what do we do where a PHY driver hasn't been configured or
> > doesn't exist for the PHY?  
> 
> I was wondering if we can get from the driver core some idea where we
> are in the deferred probing window. If we are 2/3 of the way through
> the window, fall back to genphy?

How could we decide when to fall back to genphy and when to continue the
defer situation?

> I'm not saying we should change all MAC drivers, or recommend new MAC
> drivers connect to the PHY in probe. I just want to offer the option
> if you have a problematic PSE or PHY, change the MAC driver.
> 
> What we have also said in the past, it is the bootloaders problem to
> download firmware into the PHY, or PSE, so that it is ready to go by
> the time Linux boots. That would also be the simpler solution here.

My thought of using MDI was to separate the hardware port from the PHY device,
as in hardware, the PSE is directly wired to the MDI we should have the binding
similar.
I was thinking of adding a new helper to register the MDIs for the MACs.
In the MAC/SWITCH binding we could have a list of MDIs similarly to that: 
https://elixir.bootlin.com/linux/v7.0-rc7/source/Documentation/devicetree/bindings/net/ethernet-phy.yaml#L332

We could have the SFP and PSE phandle directly in that node.
For example prestera driver is already doing something similar for SFP:
https://elixir.bootlin.com/linux/v7.0-rc7/source/drivers/net/ethernet/marvell/prestera/prestera_main.c#L370

I wanted to convert this helper into a generic one. Then every MAC/Switch
driver could just simply call the new helper to register their MDIs. I am
surely missing lots of things as I am not as net expert as you, but what do you
think of that?

We would have one path into phylink and one trough the new helper when there is
no PHY devices registered.

Maxime what do you think as you are actively working on MDIs?

Still, this does not reply the initial question, should we keep EPROBE_DEFER
in phylink if the PSE is not found or should we have have a way to probe the
PHY and solve the PSE phandle later?

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [patch 07/38] treewide: Consolidate cycles_t
From: Ojaswin Mujoo @ 2026-04-13  9:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Arnd Bergmann, x86, Lu Baolu, iommu, Michael Grzeschik,
	netdev, linux-wireless, Herbert Xu, linux-crypto, Vlastimil Babka,
	linux-mm, David Woodhouse, Bernie Thompson, linux-fbdev,
	Theodore Tso, linux-ext4, Andrew Morton, Uladzislau Rezki,
	Marco Elver, Dmitry Vyukov, kasan-dev, Andrey Ryabinin,
	Thomas Sailer, linux-hams, Jason A. Donenfeld, Richard Henderson,
	linux-alpha, Russell King, linux-arm-kernel, Catalin Marinas,
	Huacai Chen, loongarch, Geert Uytterhoeven, linux-m68k,
	Dinh Nguyen, Jonas Bonn, linux-openrisc, Helge Deller,
	linux-parisc, Michael Ellerman, linuxppc-dev, Paul Walmsley,
	linux-riscv, Heiko Carstens, linux-s390, David S. Miller,
	sparclinux
In-Reply-To: <20260410120318.045532623@kernel.org>

On Fri, Apr 10, 2026 at 02:19:03PM +0200, Thomas Gleixner wrote:
> Most architectures define cycles_t as unsigned long execpt:
> 
>  - x86 requires it to be 64-bit independent of the 32-bit/64-bit build.
> 
>  - parisc and mips define it as unsigned int
> 
>    parisc has no real reason to do so as there are only a few usage sites
>    which either expand it to a 64-bit value or utilize only the lower
>    32bits.
> 
>    mips has no real requirement either.
> 
> Move the typedef to types.h and provide a config switch to enforce the
> 64-bit type for x86.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> ---
>  arch/Kconfig                       |    4 ++++
>  arch/alpha/include/asm/timex.h     |    3 ---
>  arch/arm/include/asm/timex.h       |    1 -
>  arch/loongarch/include/asm/timex.h |    2 --
>  arch/m68k/include/asm/timex.h      |    2 --
>  arch/mips/include/asm/timex.h      |    2 --
>  arch/nios2/include/asm/timex.h     |    2 --
>  arch/parisc/include/asm/timex.h    |    2 --
>  arch/powerpc/include/asm/timex.h   |    4 +---
>  arch/riscv/include/asm/timex.h     |    2 --
>  arch/s390/include/asm/timex.h      |    2 --
>  arch/sparc/include/asm/timex_64.h  |    1 -
>  arch/x86/Kconfig                   |    1 +
>  arch/x86/include/asm/tsc.h         |    2 --
>  include/asm-generic/timex.h        |    1 -
>  include/linux/types.h              |    6 ++++++
>  16 files changed, 12 insertions(+), 25 deletions(-)
> 
<...>

> --- a/arch/powerpc/include/asm/timex.h
> +++ b/arch/powerpc/include/asm/timex.h
> @@ -11,9 +11,7 @@
>  #include <asm/cputable.h>
>  #include <asm/vdso/timebase.h>
>  
> -typedef unsigned long cycles_t;
> -
> -static inline cycles_t get_cycles(void)
> +ostatic inline cycles_t get_cycles(void)

Hi Thomas, I'm in middle of testing this series on powerpc. In the meantime I
noticed that there's probably a small typo here (althrough this is fixed
later)

Regards,
ojaswin
>  {
>  	return mftb();
>  }

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox