Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] devlink/param: replace deprecated strcpy() with strscpy()
From: David Laight @ 2026-05-07  8:04 UTC (permalink / raw)
  To: Álvaro Costa
  Cc: jiri, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, open list:DEVLINK, open list
In-Reply-To: <20260506211415.16343-1-alvaroc.dev@gmail.com>

On Wed,  6 May 2026 18:14:11 -0300
Álvaro Costa <alvaroc.dev@gmail.com> wrote:

> Replace strcpy() call used to extract a string parameter from param_data
> with strscpy(). Since strscpy() already performs bounds checking and
> ensures the destination string is NUL-terminated, remove the string
> length check as well.
> 
> Signed-off-by: Álvaro Costa <alvaroc.dev@gmail.com>
> ---
>  net/devlink/param.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/net/devlink/param.c b/net/devlink/param.c
> index cf95268da5b0..26695b7e2861 100644
> --- a/net/devlink/param.c
> +++ b/net/devlink/param.c
> @@ -536,11 +536,9 @@ devlink_param_value_get_from_info(const struct devlink_param *param,
>  		value->vu64 = nla_get_u64(param_data);
>  		break;
>  	case DEVLINK_PARAM_TYPE_STRING:
> -		len = strnlen(nla_data(param_data), nla_len(param_data));
> -		if (len == nla_len(param_data) ||
> -		    len >= __DEVLINK_PARAM_MAX_STRING_VALUE)
> +		len = strscpy(value->vstr, nla_data(param_data));
> +		if (len < 0)
>  			return -EINVAL;
> -		strcpy(value->vstr, nla_data(param_data));

The only sensible thing here is to replace the strcpy() with:
		memcpy(value->vstr, nla_data(param_data), len + 1);

-- David

>  		break;
>  	case DEVLINK_PARAM_TYPE_BOOL:
>  		if (param_data && nla_len(param_data))


^ permalink raw reply

* Re: [PATCH net-next v6 04/10] enic: add admin CQ service with MSI-X interrupt and NAPI polling
From: Paolo Abeni @ 2026-05-07  8:03 UTC (permalink / raw)
  To: Satish Kharat, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski
  Cc: netdev, linux-kernel, Sesidhar Baddela
In-Reply-To: <20260503-enic-sriov-v2-admin-channel-v2-v6-4-0af4fbc2d86d@cisco.com>

On 5/3/26 1:22 PM, Satish Kharat wrote:
> +static void enic_admin_msg_enqueue(struct enic *enic, void *buf,
> +				   unsigned int len)
> +{
> +	struct enic_admin_msg *msg;
> +
> +	msg = kmalloc(struct_size(msg, data, len), GFP_ATOMIC);
> +	if (!msg) {
> +		enic->admin_msg_drop_cnt++;
> +		if (net_ratelimit())
> +			netdev_warn(enic->netdev,
> +				    "admin msg enqueue drop (len=%u drops=%llu)\n",
> +				    len, enic->admin_msg_drop_cnt);

Failed allocation will splat dmesg; no need to added additional error
messages (here and elsewhere in the series).

/P


^ permalink raw reply

* [PATCH v3 net-next 3/3] ipv4: Add __must_check to nexthop removal functions
From: Cosmin Ratiu @ 2026-05-07  7:56 UTC (permalink / raw)
  To: netdev
  Cc: David Ahern, Ido Schimmel, Kuniyuki Iwashima, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Simon Horman, Paolo Abeni,
	Cosmin Ratiu
In-Reply-To: <20260507075606.322405-1-cratiu@nvidia.com>

These functions return a signal whether FIB flushing is required which
must not be ignored. Use the compiler to help with enforcing this
requirement in the future.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
---
 net/ipv4/nexthop.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 703954c490d0..6205bd57aa85 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -20,8 +20,8 @@
 #define NH_RES_DEFAULT_IDLE_TIMER	(120 * HZ)
 #define NH_RES_DEFAULT_UNBALANCED_TIMER	0	/* No forced rebalancing. */
 
-static bool remove_nexthop(struct net *net, struct nexthop *nh,
-			   struct nl_info *nlinfo);
+static bool __must_check remove_nexthop(struct net *net, struct nexthop *nh,
+					struct nl_info *nlinfo);
 
 #define NH_DEV_HASHBITS  8
 #define NH_DEV_HASHSIZE (1U << NH_DEV_HASHBITS)
@@ -2016,9 +2016,9 @@ static void nh_hthr_group_rebalance(struct nh_group *nhg)
 	}
 }
 
-static bool remove_nh_grp_entry(struct net *net, struct nh_grp_entry *nhge,
-				struct nl_info *nlinfo,
-				struct list_head *deferred_free)
+static bool __must_check
+remove_nh_grp_entry(struct net *net, struct nh_grp_entry *nhge,
+		    struct nl_info *nlinfo, struct list_head *deferred_free)
 {
 	struct nh_grp_entry *nhges, *new_nhges;
 	struct nexthop *nhp = nhge->nh_parent;
@@ -2095,8 +2095,9 @@ static bool remove_nh_grp_entry(struct net *net, struct nh_grp_entry *nhge,
 	return false;
 }
 
-static bool remove_nexthop_from_groups(struct net *net, struct nexthop *nh,
-				       struct nl_info *nlinfo)
+static bool __must_check
+remove_nexthop_from_groups(struct net *net, struct nexthop *nh,
+			   struct nl_info *nlinfo)
 {
 	struct nh_grp_entry *nhge, *tmp;
 	LIST_HEAD(deferred_free);
@@ -2146,7 +2147,8 @@ static void remove_nexthop_group(struct nexthop *nh, struct nl_info *nlinfo)
 }
 
 /* not called for nexthop replace */
-static bool __remove_nexthop_fib(struct net *net, struct nexthop *nh)
+static bool __must_check __remove_nexthop_fib(struct net *net,
+					      struct nexthop *nh)
 {
 	bool need_flush = !list_empty(&nh->fi_list);
 	struct fib6_info *f6i;
@@ -2177,8 +2179,8 @@ static bool __remove_nexthop_fib(struct net *net, struct nexthop *nh)
 	return need_flush;
 }
 
-static bool __remove_nexthop(struct net *net, struct nexthop *nh,
-			     struct nl_info *nlinfo)
+static bool __must_check __remove_nexthop(struct net *net, struct nexthop *nh,
+					  struct nl_info *nlinfo)
 {
 	bool need_flush = __remove_nexthop_fib(net, nh);
 
@@ -2197,8 +2199,8 @@ static bool __remove_nexthop(struct net *net, struct nexthop *nh,
 	return need_flush;
 }
 
-static bool remove_nexthop(struct net *net, struct nexthop *nh,
-			   struct nl_info *nlinfo)
+static bool __must_check remove_nexthop(struct net *net, struct nexthop *nh,
+					struct nl_info *nlinfo)
 {
 	bool need_flush;
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH v3 net-next 2/3] ipv4: Flush the FIB once on multiple nexthop removal
From: Cosmin Ratiu @ 2026-05-07  7:56 UTC (permalink / raw)
  To: netdev
  Cc: David Ahern, Ido Schimmel, Kuniyuki Iwashima, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Simon Horman, Paolo Abeni,
	Cosmin Ratiu
In-Reply-To: <20260507075606.322405-1-cratiu@nvidia.com>

When a device is going down or when a net namespace is deleted, all
nexthops on it are removed, and for each nexthop being removed the FIB
table is flushed, which does a full trie traversal looking for entries
marked RTNH_F_DEAD and removing them. This is O(N x R), with N being
number of dev nexthops and R being number of IPv4 routes.

The RTNL is held the entire time.

When there are many nexthops to be removed and many routing entries,
this can result in the RTNL being held for multiple minutes, which
causes unhappiness in other processes trying to acquire the RTNL (e.g.
systemd-networkd for DHCP renewals).

In a complicated deployment with multiple vxlan devices, each having
16K nexthops and a total of 128K ipv4 routes, this is exactly what
happens:

nexthop_flush_dev()                # loops over 16K nexthops
  -> remove_nexthop()
    -> __remove_nexthop()
      -> __remove_nexthop_fib()    # marks fi->fib_flags |= RTNH_F_DEAD
        -> fib_flush()             # for EACH nexthop!
	  -> fib_table_flush()     # walks the ENTIRE FIB, 128K entries

This patch makes use of the previously added FIB flushing signal to only
do a single FIB flush after all nexthops to be removed are marked as
RTNH_F_DEAD:
- __remove_nexthop_fib() no longer flushes the FIB.
- nexthop_flush_dev() and flush_all_nexthops() now keep track whether
  any nexthop was removed and trigger a FIB flush at the end.
- a new wrapper is defined, remove_one_nexthop() which calls
  remove_nexthop() and flushes if necessary. This is intended for places
  which must remove a single nexthop and shouldn't worry about the need
  to trigger a FIB flush. For now, the only caller is rtm_del_nexthop().
- The two direct callers of __remove_nexthop() get a WARN_ON_ONCE, since
  the nh about to be removed should not have any FIB entries referencing
  it when replacing or inserting a new one.

This dramatically improves performance from O(N x R) to O(N + R).

Releasing a nexthop reference in remove_nexthop() now no longer frees
it. Instead, it is deleted when the last fib_info pointing to it gets
freed via free_fib_info_rcu(). All routing code is already careful not
to take into consideration routes marked with RTNH_F_DEAD.

Tested with:
DEV=eth2
ip link set up dev $DEV
ip link add testnh0 link $DEV type macvlan mode bridge
ip addr add 198.51.100.1/24 dev testnh0
ip link set testnh0 up

seq 1 65536 | \
sed 's/.*/nexthop add id & via 198.51.100.2 dev testnh0/' | \
ip -batch -

i=1
for a in $(seq 0 255); do
  for b in $(seq 0 255); do
    echo "route add 10.${a}.${b}.0/32 nhid $i"
    i=$((i + 1))
  done
done | ip -batch -

time ip link set testnh0 down
ip link del testnh0

Without this patch:
real	0m32.601s
user	0m0.000s
sys	0m32.511s

With this patch:
real	0m0.209s
user	0m0.000s
sys	0m0.153s

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
---
 net/ipv4/nexthop.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 7177092d2605..703954c490d0 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -2154,8 +2154,6 @@ static bool __remove_nexthop_fib(struct net *net, struct nexthop *nh)
 
 	list_for_each_entry(fi, &nh->fi_list, nh_list)
 		fi->fib_flags |= RTNH_F_DEAD;
-	if (need_flush)
-		fib_flush(net);
 
 	spin_lock_bh(&nh->lock);
 
@@ -2220,6 +2218,13 @@ static bool remove_nexthop(struct net *net, struct nexthop *nh,
 	return need_flush;
 }
 
+static void remove_one_nexthop(struct net *net, struct nexthop *nh,
+			       struct nl_info *nlinfo)
+{
+	if (remove_nexthop(net, nh, nlinfo))
+		fib_flush(net);
+}
+
 /* if any FIB entries reference this nexthop, any dst entries
  * need to be regenerated
  */
@@ -2602,7 +2607,7 @@ static int replace_nexthop(struct net *net, struct nexthop *old,
 	if (!err) {
 		nh_rt_cache_flush(net, old, new);
 
-		__remove_nexthop(net, new, NULL);
+		WARN_ON_ONCE(__remove_nexthop(net, new, NULL));
 		nexthop_put(new);
 	}
 
@@ -2709,6 +2714,7 @@ static void nexthop_flush_dev(struct net_device *dev, unsigned long event)
 	unsigned int hash = nh_dev_hashfn(dev->ifindex);
 	struct net *net = dev_net(dev);
 	struct hlist_head *head = &net->nexthop.devhash[hash];
+	bool need_flush = false;
 	struct hlist_node *n;
 	struct nh_info *nhi;
 
@@ -2720,22 +2726,28 @@ static void nexthop_flush_dev(struct net_device *dev, unsigned long event)
 		    (event == NETDEV_DOWN || event == NETDEV_CHANGE))
 			continue;
 
-		remove_nexthop(net, nhi->nh_parent, NULL);
+		need_flush |= remove_nexthop(net, nhi->nh_parent, NULL);
 	}
+
+	if (need_flush)
+		fib_flush(net);
 }
 
 /* rtnl; called when net namespace is deleted */
 static void flush_all_nexthops(struct net *net)
 {
 	struct rb_root *root = &net->nexthop.rb_root;
+	bool need_flush = false;
 	struct rb_node *node;
 	struct nexthop *nh;
 
 	while ((node = rb_first(root))) {
 		nh = rb_entry(node, struct nexthop, rb_node);
-		remove_nexthop(net, nh, NULL);
+		need_flush |= remove_nexthop(net, nh, NULL);
 		cond_resched();
 	}
+	if (need_flush)
+		fib_flush(net);
 }
 
 static struct nexthop *nexthop_create_group(struct net *net,
@@ -3004,7 +3016,7 @@ static struct nexthop *nexthop_add(struct net *net, struct nh_config *cfg,
 
 	err = insert_nexthop(net, nh, cfg, extack);
 	if (err) {
-		__remove_nexthop(net, nh, NULL);
+		WARN_ON_ONCE(__remove_nexthop(net, nh, NULL));
 		nexthop_put(nh);
 		nh = ERR_PTR(err);
 	}
@@ -3373,7 +3385,7 @@ static int rtm_del_nexthop(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	nh = nexthop_find_by_id(net, id);
 	if (nh)
-		remove_nexthop(net, nh, &nlinfo);
+		remove_one_nexthop(net, nh, &nlinfo);
 	else
 		err = -ENOENT;
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH v3 net-next 1/3] ipv4: Provide a FIB flushing signal from nexthop removal functions
From: Cosmin Ratiu @ 2026-05-07  7:56 UTC (permalink / raw)
  To: netdev
  Cc: David Ahern, Ido Schimmel, Kuniyuki Iwashima, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Simon Horman, Paolo Abeni,
	Cosmin Ratiu
In-Reply-To: <20260507075606.322405-1-cratiu@nvidia.com>

Plumb a bool value throughout the various nexthop removal functions,
determined in the innermost __remove_nexthop_fib() (which still does the
FIB flushing) and propagated up all callers.

The next patch will make use of this signal to optimize the removal of
multiple nexthops by moving the FIB flushing up the call hierarchy.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
---
 net/ipv4/nexthop.c | 50 +++++++++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index f92fcc39fc4c..7177092d2605 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -20,7 +20,7 @@
 #define NH_RES_DEFAULT_IDLE_TIMER	(120 * HZ)
 #define NH_RES_DEFAULT_UNBALANCED_TIMER	0	/* No forced rebalancing. */
 
-static void remove_nexthop(struct net *net, struct nexthop *nh,
+static bool remove_nexthop(struct net *net, struct nexthop *nh,
 			   struct nl_info *nlinfo);
 
 #define NH_DEV_HASHBITS  8
@@ -2016,7 +2016,7 @@ static void nh_hthr_group_rebalance(struct nh_group *nhg)
 	}
 }
 
-static void remove_nh_grp_entry(struct net *net, struct nh_grp_entry *nhge,
+static bool remove_nh_grp_entry(struct net *net, struct nh_grp_entry *nhge,
 				struct nl_info *nlinfo,
 				struct list_head *deferred_free)
 {
@@ -2033,10 +2033,8 @@ static void remove_nh_grp_entry(struct net *net, struct nh_grp_entry *nhge,
 	newg = nhg->spare;
 
 	/* last entry, keep it visible and remove the parent */
-	if (nhg->num_nh == 1) {
-		remove_nexthop(net, nhp, nlinfo);
-		return;
-	}
+	if (nhg->num_nh == 1)
+		return remove_nexthop(net, nhp, nlinfo);
 
 	newg->has_v4 = false;
 	newg->is_multipath = nhg->is_multipath;
@@ -2093,22 +2091,26 @@ static void remove_nh_grp_entry(struct net *net, struct nh_grp_entry *nhge,
 
 	if (nlinfo)
 		nexthop_notify(RTM_NEWNEXTHOP, nhp, nlinfo);
+
+	return false;
 }
 
-static void remove_nexthop_from_groups(struct net *net, struct nexthop *nh,
+static bool remove_nexthop_from_groups(struct net *net, struct nexthop *nh,
 				       struct nl_info *nlinfo)
 {
 	struct nh_grp_entry *nhge, *tmp;
 	LIST_HEAD(deferred_free);
+	bool need_flush = false;
 
 	/* If there is nothing to do, let's avoid the costly call to
 	 * synchronize_net()
 	 */
 	if (list_empty(&nh->grp_list))
-		return;
+		return false;
 
 	list_for_each_entry_safe(nhge, tmp, &nh->grp_list, nh_list)
-		remove_nh_grp_entry(net, nhge, nlinfo, &deferred_free);
+		need_flush |= remove_nh_grp_entry(net, nhge, nlinfo,
+						  &deferred_free);
 
 	/* make sure all see the newly published array before releasing rtnl */
 	synchronize_net();
@@ -2118,6 +2120,8 @@ static void remove_nexthop_from_groups(struct net *net, struct nexthop *nh,
 		list_del(&nhge->nh_list);
 		free_percpu(nhge->stats);
 	}
+
+	return need_flush;
 }
 
 static void remove_nexthop_group(struct nexthop *nh, struct nl_info *nlinfo)
@@ -2142,17 +2146,15 @@ static void remove_nexthop_group(struct nexthop *nh, struct nl_info *nlinfo)
 }
 
 /* not called for nexthop replace */
-static void __remove_nexthop_fib(struct net *net, struct nexthop *nh)
+static bool __remove_nexthop_fib(struct net *net, struct nexthop *nh)
 {
+	bool need_flush = !list_empty(&nh->fi_list);
 	struct fib6_info *f6i;
-	bool do_flush = false;
 	struct fib_info *fi;
 
-	list_for_each_entry(fi, &nh->fi_list, nh_list) {
+	list_for_each_entry(fi, &nh->fi_list, nh_list)
 		fi->fib_flags |= RTNH_F_DEAD;
-		do_flush = true;
-	}
-	if (do_flush)
+	if (need_flush)
 		fib_flush(net);
 
 	spin_lock_bh(&nh->lock);
@@ -2173,12 +2175,14 @@ static void __remove_nexthop_fib(struct net *net, struct nexthop *nh)
 	}
 
 	spin_unlock_bh(&nh->lock);
+
+	return need_flush;
 }
 
-static void __remove_nexthop(struct net *net, struct nexthop *nh,
+static bool __remove_nexthop(struct net *net, struct nexthop *nh,
 			     struct nl_info *nlinfo)
 {
-	__remove_nexthop_fib(net, nh);
+	bool need_flush = __remove_nexthop_fib(net, nh);
 
 	if (nh->is_group) {
 		remove_nexthop_group(nh, nlinfo);
@@ -2189,13 +2193,17 @@ static void __remove_nexthop(struct net *net, struct nexthop *nh,
 		if (nhi->fib_nhc.nhc_dev)
 			hlist_del(&nhi->dev_hash);
 
-		remove_nexthop_from_groups(net, nh, nlinfo);
+		need_flush |= remove_nexthop_from_groups(net, nh, nlinfo);
 	}
+
+	return need_flush;
 }
 
-static void remove_nexthop(struct net *net, struct nexthop *nh,
+static bool remove_nexthop(struct net *net, struct nexthop *nh,
 			   struct nl_info *nlinfo)
 {
+	bool need_flush;
+
 	call_nexthop_notifiers(net, NEXTHOP_EVENT_DEL, nh, NULL);
 
 	/* remove from the tree */
@@ -2204,10 +2212,12 @@ static void remove_nexthop(struct net *net, struct nexthop *nh,
 	if (nlinfo)
 		nexthop_notify(RTM_DELNEXTHOP, nh, nlinfo);
 
-	__remove_nexthop(net, nh, nlinfo);
+	need_flush = __remove_nexthop(net, nh, nlinfo);
 	nh_base_seq_inc(net);
 
 	nexthop_put(nh);
+
+	return need_flush;
 }
 
 /* if any FIB entries reference this nexthop, any dst entries
-- 
2.53.0


^ permalink raw reply related

* [PATCH net-next v3 0/3] ipv4: Flush the FIB once on multiple nexthop removal
From: Cosmin Ratiu @ 2026-05-07  7:56 UTC (permalink / raw)
  To: netdev
  Cc: David Ahern, Ido Schimmel, Kuniyuki Iwashima, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Simon Horman, Paolo Abeni,
	Cosmin Ratiu

This series optimizes multiple nexthop removal performance from having
to do a FIB flush for each nexthop being removed to only doing a single
FIB flush after all nexthops are removed.

This dramatically improves performance in scenarios where there are
many nexthops and many ipv4 routes. Please see individual patches for
more details and for a test scenario.

V2 -> V3: https://lore.kernel.org/netdev/8fea4084-c9ec-472a-b8ab-ecc87e537216@kernel.org/T/#t
- Split the patch into 3 (Ido Schimmel, David Ahern)
- Used WARN_ON_ONCE instead of WARN_ON (Ido Schimmel)

V1 -> V2:
- Fixes xmas tree in a couple places (Kuniyuki Iwashima)
- Added __must_check to remove_nexthop_from_groups() (Kuniyuki Iwashima)

Cosmin Ratiu (3):
  ipv4: Provide a FIB flushing signal from nexthop removal functions
  ipv4: Flush the FIB once on multiple nexthop removal
  ipv4: Add __must_check to nexthop removal functions

 net/ipv4/nexthop.c | 88 +++++++++++++++++++++++++++++-----------------
 1 file changed, 56 insertions(+), 32 deletions(-)

-- 
2.53.0


^ permalink raw reply

* Re: [PATCH] mctp i2c: check packet length before marking flow active
From: William A. Kennington III @ 2026-05-07  7:50 UTC (permalink / raw)
  To: Jeremy Kerr, Matt Johnston, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Wolfram Sang
  Cc: netdev, linux-kernel
In-Reply-To: <88cb1f245485d991b845886827876f27f11792ae.camel@codeconstruct.com.au>


On 5/6/26 01:01, Jeremy Kerr wrote:
> Hi William,
>
>>> Just to clarify my understanding of the state: "being held by two
>>> owners" would indicate a violation of the lock itself. Or is it that
>>> there are two threads blocked waiting to acquire the mutex?
>> I think it’s actually this, 2 threads are waiting on acquiring the lock.
> OK, that's good news!
>
>> There was a theory that it was a lock underflow that allowed 2 threads
>> to acquire the lock that lead to this patch.
>>
>>> For NVMe-MI, you're likely using manual tag allocation, where the tag
>>> allocation (and hence flow state) is entirely controlled by userspace.
>>> It may be that the NVMe protocol-level errors are causing that tags to
>>> be held for long durations, perhaps?
>> Yeah, this is very plausible given the device(s) stop responding
>> correctly. I imagine we are getting stuck with manual allocations and
>> not releasing locks. Can we reset the state machine back to NEW instead
>> of holding the lock?
> Not sure what you're referring to here; if the userspace application is
> not releasing the tag, we have to keep the i2c bus locked, otherwise we
> may not receive a response from the device.

Isn't this inherently an approach asking for trouble, where a 
potentially buggy userspace can starve out other applications which need 
to access the bus. For us we have FRU devices on the that are 
periodically rescanned or accessed for various reasons alongside the 
MCTP endpoint on the NVME device.

> The one case I can think of (in upstream infrastructure, at least) is
> that this might be triggered by the device reporting a long MPRT value,
> and then a response gets lost. libnvme is respecting the MPRT, and not
> releasing the tag for that (excessive) duration.
Yeah, I'll have to look at the specific firmware bug more but I don't 
think it's been oot caused it fully yet.
> However, the tag -> i2c lock associations are only useful if you have
> muxes in the i2c topology. Is that the case on your platform? If not,
> perhaps we could elide all the bus locking when we can detect that...

We have at least 1 layer of mux before each NVME and FRU device.

> Cheers,
>
>
> Jeremy

^ permalink raw reply

* [PATCH v3] net: net_failover: Fix the deadlock in slave register
From: faicker.mo @ 2026-05-07  7:43 UTC (permalink / raw)
  To: faicker.mo
  Cc: kuba, Sridhar Samudrala, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Stanislav Fomichev,
	netdev, linux-kernel

From: Faicker Mo <faicker.mo@gmail.com>

There is netdev_lock_ops() before the NETDEV_REGISTER notifier
in register_netdevice(), so use the non-locking functions
in net_failover_slave_register().

Call Trace:
 <TASK>
 __schedule+0x30d/0x7a0
 schedule+0x27/0x90
 schedule_preempt_disabled+0x15/0x30
 __mutex_lock.constprop.0+0x538/0x9e0
 __mutex_lock_slowpath+0x13/0x20
 mutex_lock+0x3b/0x50
 dev_set_mtu+0x40/0xe0
 net_failover_slave_register+0x24/0x280
 failover_slave_register+0x103/0x1b0
 failover_event+0x15e/0x210
 ? dropmon_net_event+0xac/0xe0
 notifier_call_chain+0x5e/0xe0
 raw_notifier_call_chain+0x16/0x30
 call_netdevice_notifiers_info+0x52/0xa0
 register_netdevice+0x5f4/0x7c0
 register_netdev+0x1e/0x40
 _mlx5e_probe+0xe2/0x370 [mlx5_core]
 mlx5e_probe+0x59/0x70 [mlx5_core]
 ? __pfx_mlx5e_probe+0x10/0x10 [mlx5_core]

Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP")
Signed-off-by: Faicker Mo <faicker.mo@gmail.com>
---
Changes since v1:
  - Fix the space chars (Simon)
  - Change the dev_close to netif_close (Simon)
  - Change the label err_dev_open to err_netif_open
Changes since v2:
  - Add lock ops in failover_existing_slave_register (Jakub Kicinski)
---
 drivers/net/net_failover.c | 12 ++++++------
 net/core/failover.c        |  5 ++++-
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index d0361aaf25ef..3f7d31033bae 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -502,7 +502,7 @@ static int net_failover_slave_register(struct net_device *slave_dev,
 
 	/* Align MTU of slave with failover dev */
 	orig_mtu = slave_dev->mtu;
-	err = dev_set_mtu(slave_dev, failover_dev->mtu);
+	err = netif_set_mtu(slave_dev, failover_dev->mtu);
 	if (err) {
 		netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
 			   slave_dev->name, failover_dev->mtu);
@@ -512,11 +512,11 @@ static int net_failover_slave_register(struct net_device *slave_dev,
 	dev_hold(slave_dev);
 
 	if (netif_running(failover_dev)) {
-		err = dev_open(slave_dev, NULL);
+		err = netif_open(slave_dev, NULL);
 		if (err && (err != -EBUSY)) {
 			netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
 				   slave_dev->name, err);
-			goto err_dev_open;
+			goto err_netif_open;
 		}
 	}
 
@@ -562,10 +562,10 @@ static int net_failover_slave_register(struct net_device *slave_dev,
 err_vlan_add:
 	dev_uc_unsync(slave_dev, failover_dev);
 	dev_mc_unsync(slave_dev, failover_dev);
-	dev_close(slave_dev);
-err_dev_open:
+	netif_close(slave_dev);
+err_netif_open:
 	dev_put(slave_dev);
-	dev_set_mtu(slave_dev, orig_mtu);
+	netif_set_mtu(slave_dev, orig_mtu);
 done:
 	return err;
 }
diff --git a/net/core/failover.c b/net/core/failover.c
index 11bb183c7a1b..122fbcbf915a 100644
--- a/net/core/failover.c
+++ b/net/core/failover.c
@@ -221,8 +221,11 @@ failover_existing_slave_register(struct net_device *failover_dev)
 	for_each_netdev(net, dev) {
 		if (netif_is_failover(dev))
 			continue;
-		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
+		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr)) {
+			netdev_lock_ops(dev);
 			failover_slave_register(dev);
+			netdev_unlock_ops(dev);
+		}
 	}
 	rtnl_unlock();
 }
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net-next 2/3] ppp: unify two channel structs
From: Sebastian Andrzej Siewior @ 2026-05-07  7:40 UTC (permalink / raw)
  To: Qingfang Deng
  Cc: Paolo Abeni, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Jiri Kosina, David Sterba, Greg Kroah-Hartman,
	Jiri Slaby, Mitchell Blank Jr, Chas Williams, Simon Horman,
	James Chapman, Kees Cook, Taegu Ha, Guillaume Nault,
	Eric Woudstra, Arnd Bergmann, Dawid Osuchowski, Breno Leitao,
	linux-ppp, netdev, linux-kernel, linux-serial, linux-atm-general
In-Reply-To: <c9993ee6-4023-4331-a1c1-4e30952146fe@linux.dev>

On 2026-05-07 13:53:30 [+0800], Qingfang Deng wrote:
> > This patch is IMHO a bit too big and should be split. Also this kind of
> > refactor looks very invasive and potentially regression prone. I think
> > it should include a signficant self-test coverage increase.
> This is indeed too big. But how do I split it without breaking the build?

The current ppp tests would yell if you accidentally broke something?

Sebastian

^ permalink raw reply

* Re: [PATCH net-next v2] declance: Remove IRQF_ONESHOT
From: Sebastian Andrzej Siewior @ 2026-05-07  7:34 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: netdev, linux-mips, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni
In-Reply-To: <alpine.DEB.2.21.2605060040230.46195@angie.orcam.me.uk>

On 2026-05-06 10:25:02 [+0100], Maciej W. Rozycki wrote:
> On Tue, 5 May 2026, Sebastian Andrzej Siewior wrote:
> 
> > I'm not if sure if you may need to change the primary handler if the
> > interrupt flow is EOI and cascading based on what you wrote. If you have
> > access to the HW then you it should be easy to test given the
> > `threadirqs' argument should expose problems.
> 
>  The interrupt is exceedingly rare, I've only seen it actually fire maybe 
> a dozen times across all my systems in 25+ years.  It happens when there 
> is a memory read error on DMA, such as an uncorrected ECC or parity error 
> (depending on the system variant), or a bus timeout.

I assumed you have other interrupts on that hw, cascaded/ operating the
same way. But otherwise…

>  It should be possible to orchestrate it, such as by making the LANCE DMA 
> pointer register refer an unpopulated location in the system address map; 
> memory ECC errors can be induced too by the DRAM controller's diagnostic 
> feature.  It seems enough hassle though I'd rather get things right by the 
> spec.

Oh, yeah. That should do it.

>  Thanks for the hint as to the `threadirqs' facility though, it may come 
> up helpful sometime.

:)

>   Maciej

Sebastian

^ permalink raw reply

* Re: [PATCH net-next 2/3] ppp: unify two channel structs
From: Paolo Abeni @ 2026-05-07  7:32 UTC (permalink / raw)
  To: Qingfang Deng, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Jiri Kosina, David Sterba, Greg Kroah-Hartman,
	Jiri Slaby, Mitchell Blank Jr, Chas Williams, Simon Horman,
	James Chapman, Kees Cook, Sebastian Andrzej Siewior, Taegu Ha,
	Guillaume Nault, Eric Woudstra, Arnd Bergmann, Dawid Osuchowski,
	Breno Leitao, linux-ppp, netdev, linux-kernel, linux-serial,
	linux-atm-general
In-Reply-To: <c9993ee6-4023-4331-a1c1-4e30952146fe@linux.dev>

On 5/7/26 7:53 AM, Qingfang Deng wrote:
> On 2026/5/5 19:16, Paolo Abeni wrote:
>> On 4/30/26 11:05 AM, Qingfang Deng wrote:
>>> Historically, PPP maintained two separate structures for a channel:
>>> 'struct channel' was internal to ppp_generic.c, while 'struct ppp_channel'
>>> was the public interface that drivers were required to embed. This
>>> duplication was redundant and forced drivers to manage the lifecycle of
>>> the public structure.
>>>
>>> Unify these two structures into a single 'struct ppp_channel', which is
>>> now internal to ppp_generic.c. Drivers now use a 'ppp_channel_conf'
>>> structure to specify registration parameters and receive an opaque
>>> pointer to the allocated channel.
>>>
>>> Key changes:
>>> - ppp_register_channel() and ppp_register_net_channel() now return
>>>    a 'struct ppp_channel *' instead of taking a pointer to a driver-
>>>    embedded structure.
>>> - 'struct ppp_channel_ops' methods now take the driver's 'private'
>>>    pointer directly as their first argument, simplifying driver logic.
>>> - ppp_unregister_channel() now takes the opaque pointer.
>>> - Multilink-specific fields are unified and handled via the new
>>>    configuration structure.
>>>
>>> This cleanup simplifies the driver interface and makes the channel
>>> lifecycle management more robust by centralizing allocation in the PPP
>>> generic layer.
>>>
>>> Assisted-by: Gemini:gemini-3-flash
>>> Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
>>> ---
>>>   drivers/net/ppp/ppp_async.c      |  51 +++++-----
>>>   drivers/net/ppp/ppp_generic.c    | 161 +++++++++++++++----------------
>>>   drivers/net/ppp/ppp_synctty.c    |  51 +++++-----
>>>   drivers/net/ppp/pppoe.c          |  34 ++++---
>>>   drivers/net/ppp/pppox.c          |   4 +-
>>>   drivers/net/ppp/pptp.c           |  40 ++++----
>>>   drivers/tty/ipwireless/network.c |  30 +++---
>>>   include/linux/if_pppox.h         |   2 +-
>>>   include/linux/ppp_channel.h      |  49 ++++++----
>>>   net/atm/pppoatm.c                |  61 ++++++------
>>>   net/l2tp/l2tp_ppp.c              |  34 ++++---
>>>   11 files changed, 271 insertions(+), 246 deletions(-)
>> This patch is IMHO a bit too big and should be split. Also this kind of
>> refactor looks very invasive and potentially regression prone. I think
>> it should include a signficant self-test coverage increase.
> This is indeed too big. But how do I split it without breaking the build?

This is indeed a good question, but I'm really unable to give you a good
answer without allocating to this topic much more time than I have
available.

I think that the (indeed smallish) mtu changes could easily go in a
separate patch.

You could try introducing the struct and/or variables renaming
separately, with no actual functional change, i.e.

- one patch to rename ppp_channel -> ppp_channel_conf
- one patch to rename channel -> ppp_channel
(possibly adjust accordingly the variables name if can done mechanically)

no idea if the end result would be more palatable, but possibly worth a try.

/P


^ permalink raw reply

* [PATCH] mptcp: serialize subflow->closing with RX path
From: Kalpan Jani @ 2026-05-07  7:28 UTC (permalink / raw)
  To: matttbe, martineau, mptcp, netdev, linux-kernel
  Cc: shardul.b, janak, kalpanjani009, shardulsb08, Kalpan Jani

There is a race between mptcp_data_ready() (RX path) and
mptcp_close_ssk() (teardown path) when accessing subflow->closing.

Currently, mptcp_data_ready() checks subflow->closing before acquiring
mptcp_data_lock(), while mptcp_close_ssk() may concurrently set
subflow->closing and purge backlog entries. This creates a classic
time-of-check vs time-of-use (TOCTOU) race:

  CPU A (close path)              CPU B (RX path)
  ----------------------         -------------------------
  set closing = 1
                                 read closing == 0
  purge backlog
                                 enqueue skb to backlog

As a result, skb entries referencing the subflow socket (ssk) may be
enqueued after the subflow is marked closing and scheduled for cleanup.
This can lead to:

  - WARN in inet_sock_destruct() due to non-zero sk_rmem_alloc
  - potential use-after-free via stale skb->sk references

Fix this by serializing both the closing check and backlog enqueue
under mptcp_data_lock(). This ensures that subflow->closing state and
backlog operations are observed atomically, preventing new skb from
being enqueued once teardown begins.

Also protect backlog cleanup in mptcp_close_ssk() with the same lock
to guarantee mutual exclusion with the RX path.

This restores proper synchronization between RX and teardown paths
and prevents stale skb references to closing subflows.

Signed-off-by: Kalpan Jani <kalpan.jani@mpiricsoftware.com>
---
 net/mptcp/protocol.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 718e910ff..295f8e1c0 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -910,14 +910,34 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
 	struct mptcp_sock *msk = mptcp_sk(sk);
 
+	/*
+	 * The close path can set subflow->closing while we are racing
+	 * from BH context here. The old check was done before taking
+	 * mptcp_data_lock(), leaving a TOCTOU window:
+	 *
+	 *   CPU A: close path sets closing = 1 and purges backlog
+	 *   CPU B: already observed closing == 0 and later enqueues skb
+	 *
+	 * That skb keeps skb->sk == ssk and can later trigger:
+	 * - WARN in inet_sock_destruct() (ssk->sk_rmem_alloc != 0)
+	 * - UAF in backlog purge via stale skb->sk
+	 */
+
 	/* The peer can send data while we are shutting down this
 	 * subflow at subflow destruction time, but we must avoid enqueuing
 	 * more data to the msk receive queue
 	 */
-	if (unlikely(subflow->closing))
-		return;
 
 	mptcp_data_lock(sk);
+
+	/* Serialize closing check with backlog enqueue */
+	if (unlikely(subflow->closing)) {
+		mptcp_data_unlock(sk);
+		return;
+	}
+
 	mptcp_rcv_rtt_update(msk, subflow);
 	if (!sock_owned_by_user(sk)) {
 		/* Wake-up the reader only for in-sequence data */
@@ -2653,9 +2673,12 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 	if (sk->sk_state == TCP_ESTABLISHED)
 		mptcp_event(MPTCP_EVENT_SUB_CLOSED, mptcp_sk(sk), ssk, GFP_KERNEL);
 
-	/* Remove any reference from the backlog to this ssk; backlog skbs consume
+	/* Remove any reference from the backlog to this ssk.
+	 * Serialize cleanup with RX-side enqueue using mptcp_data_lock().
+	 * Backlog skbs consume
 	 * space in the msk receive queue, no need to touch sk->sk_rmem_alloc
 	 */
+	mptcp_data_lock(sk);
 	list_for_each_entry(skb, &msk->backlog_list, list) {
 		if (skb->sk != ssk)
 			continue;
@@ -2663,6 +2686,8 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 		atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
 		skb->sk = NULL;
 	}
+	mptcp_data_unlock(sk);
+
 
 	/* subflow aborted before reaching the fully_established status
 	 * attempt the creation of the next subflow
-- 
2.43.0

^ permalink raw reply related

* [PATCH v2 1/7] perf trace: Sync linux/socket.h with the kernel source
From: Namhyung Kim @ 2026-05-07  7:26 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Ian Rogers, James Clark
  Cc: Jiri Olsa, Adrian Hunter, Peter Zijlstra, Ingo Molnar, LKML,
	linux-perf-users, Ravi Bangoria, netdev
In-Reply-To: <20260507072632.37152-1-namhyung@kernel.org>

To pick up changes from:

 c66e0f453d1afa82 ("net: use ktime_t in struct scm_timestamping_internal")

This would be used to beautify networking syscall arguments and not to
affect builds of other tools (e.g. objtool).

Please see tools/include/uapi/README.

Cc: netdev@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/trace/beauty/include/linux/socket.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/trace/beauty/include/linux/socket.h b/tools/perf/trace/beauty/include/linux/socket.h
index ec715ad4bf25f5f7..ec4a0a0257939a53 100644
--- a/tools/perf/trace/beauty/include/linux/socket.h
+++ b/tools/perf/trace/beauty/include/linux/socket.h
@@ -415,7 +415,7 @@ struct __kernel_timespec;
 struct old_timespec32;
 
 struct scm_timestamping_internal {
-	struct timespec64 ts[3];
+	ktime_t ts[3];
 };
 
 extern void put_cmsg_scm_timestamping64(struct msghdr *msg, struct scm_timestamping_internal *tss);
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH net-next v3 1/4] net: eth: fbnic: Fix addr validation in pcs write
From: Paolo Abeni @ 2026-05-07  7:20 UTC (permalink / raw)
  To: Jakub Kicinski, mike.marciniszyn
  Cc: Alexander Duyck, kernel-team, Andrew Lunn, David S. Miller,
	Eric Dumazet, Heiner Kallweit, Russell King, Jacob Keller,
	Mohsin Bashir, Simon Horman, Lee Trager, Andrew Lunn, netdev,
	linux-kernel
In-Reply-To: <20260506185819.1c68a71b@kernel.org>

On 5/7/26 3:58 AM, Jakub Kicinski wrote:
> On Mon,  4 May 2026 09:58:12 -0400 mike.marciniszyn@gmail.com wrote:
>> From: "Mike Marciniszyn (Meta)" <mike.marciniszyn@gmail.com>
>>
>> The DW IP has two distinct PCS address ranges cooresponding
>> to the C45 PCS registers.
>>
>> The shim translates the PCS addr/regno into specific CSR writes
>> into one of those two zero-relative.
>>
>> This patch fixes a one off in the test that could allow an invalid
>> CSR write if an addr == 2 was called.
>>
>> This patch contains a fix for addr validation in fbnic_mdio_write_pcs()
>> to only return actual CSR reads for addr 0 and 1.
>>
>> There are as of yet, no real impact for the bug as no PCS writes are
>> not yet present.
> 
> Hi Paolo! Was there a reason / do you recall why this was not applied?
> (I dropped it from patchwork now. If the omission was accidental it has
> to be reposted)

Darn, limited capacity here plus re-submission glitch: v3 had a slightly
different cover title (due to typo) WRT v2 so PW did not mark v2 as
superseded. I process patches via PW in sequence, when I reached v2 I
considered the sashiko comment not blocking and I apply it. I was unable
to reach v3 until now.

TL;DR: @Mike: please re-submit 1/4 and double check there are not other
differences between v2 and v3 - otherwise more patches needed. Also
please ensure you keep the series title consistent among revision, or at
least manually remove old revisions from PW upon resubmission.

Thanks,

Paolo


^ permalink raw reply

* Re: [PATCH net 01/12] net: shaper: drop redundant xa_lock() bracketing
From: Paolo Abeni @ 2026-05-07  7:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, andrew+netdev, horms, shuah,
	linux-kselftest
In-Reply-To: <20260506153348.1c278f4f@kernel.org>

On 5/7/26 12:33 AM, Jakub Kicinski wrote:
> On Wed, 6 May 2026 17:30:32 +0200 Paolo Abeni wrote:
>> On 5/6/26 2:06 AM, Jakub Kicinski wrote:
>>> The shaper insertion code is written in a way that suggests that
>>> perhaps it was expecting readers to be fenced off by xa_lock.  
>>
>> FTR, the code was shaped (pun intended :) that way to avoid acquiring
>> multiple times the xa_lock when not needed. Sort of evil early optimization.
> 
> Ah, that makes more sense, I guess. Do you prefer me to drop this patch?
> The XA lock is guaranteed not to be contended since we're under the
> netdev mutex..

Sorry, I should have been more clear: I'm fine with the patch, just
giving more context, as far as I remembered it.

/P


^ permalink raw reply

* Re: [PATCH v5 net-next 3/3] selftests:net: Implement ptp4l sync test using netdevsim
From: Maciek Machnikowski @ 2026-05-07  7:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, richardcochran, milena.olech, willemdebruijn.kernel,
	andrew, vadim.fedorenko, horms
In-Reply-To: <20260506150725.14c82e23@kernel.org>



On 07/05/2026 00:07, Jakub Kicinski wrote:
> On Wed, 6 May 2026 13:18:01 +0200 Maciek Machnikowski wrote:
>> On 06/05/2026 02:23, Jakub Kicinski wrote:
>>> On Tue, 5 May 2026 17:22:34 -0700 Jakub Kicinski wrote:  
>>>> You have to add it to the relevant config.
>>>>
>>>> Please read: github.com/linux-netdev/nipa/wiki/Netdev-CI-system  
>>>
>>> Sorry, wrong link, this:
>>>
>>> https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-style  
>> I did follow the guid, ran the test on the vng environment and it still
>> passed the test. Is there more debuggability on the CI system? Can you
>> share the .config file it generated while running the test? Or the
>> commands it tries to run so I can verify them locally?
>>
>>
>> The commands I ran were:
>>
>> make mrproper
>> vng --build  --config tools/testing/selftests/drivers/net/netdevsim/config
>> vng -v --run . --user root --cpus 4 -- tools/testing/selftests/net/ptp.py
>>
>> The tests were done on the ARM64 Fedora 44 with the default kernel
>> 6.19.14-300.fc44.aarch64 running inside the Parallels Desktop VM.
> 
> https://netdev-ctrl.bots.linux.dev/logs/vmksft/net/results/631241/config

The PTP_1588_CLOCK_MOCK is not set in that one. Seems that options from
tools/testing/selftests/drivers/net/netdevsim/config are not correctly
populated in it (CONFIG_NET_SCH_MQPRIO, CONFIG_NET_SCH_MULTIQ are also
not set, CONFIG_PSAMPLE is set to m, not y)

^ permalink raw reply

* [PATCH net-next v3] net: stmmac: Use interrupt mode INTM=1 for per channel irq
From: muhammad.nazim.amirul.nazle.asmade @ 2026-05-07  7:01 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, pabeni, edumazet, andrew+netdev, linux-kernel

From: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>

commit 6ccf12ae111e ("net: stmmac: use interrupt mode INTM=1 for
multi-MSI") introduced INTM=1 interrupt mode for platforms using MSI.

Apply a similar approach to enable per-channel interrupts using shared
peripheral interrupt (SPI), so that only per-channel TX and RX
interrupts (TI/RI) are handled by the TX/RX ISR without invoking the
common interrupt ISR.

Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
---
Changes in v3:
- Rebased and reposted on the net-next tree.

Changes in v2:
- Rename macros to use XGMAC_ prefix to match dwxgmac2.h convention.
- Drop DMA_MODE_INTM_SHIFT and use FIELD_PREP() instead.
- Wire up multi_irq_en via STMMAC_FLAG_MULTI_IRQ_EN in stmmac_main.c
  so the feature is reachable by platform drivers.
- Drop unused plat_stmmacenet_data fields (ext_snapshot_num,
  int_snapshot_en, ext_snapshot_en, multi_msi_en, multi_irq_en)
  which duplicate existing flags bits.
- Remove misleading commit message paragraph about ISR decoupling
  (dwxgmac2_dma_interrupt() already handles TI/RI independently of NIS).

 drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h     | 2 ++
 drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c | 9 +++++++++
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 2 ++
 include/linux/stmmac.h                             | 2 ++
 4 files changed, 15 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h
index 51943705a2b0..544541e0e2a5 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h
@@ -320,6 +320,8 @@
 /* DMA Registers */
 #define XGMAC_DMA_MODE			0x00003000
 #define XGMAC_SWR			BIT(0)
+#define XGMAC_DMA_MODE_INTM_MASK	GENMASK(13, 12)
+#define XGMAC_DMA_MODE_INTM_MODE1	0x1
 #define XGMAC_DMA_SYSBUS_MODE		0x00003004
 #define XGMAC_WR_OSR_LMT		GENMASK(29, 24)
 #define XGMAC_RD_OSR_LMT		GENMASK(21, 16)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
index 03437f1cf3df..59fe488933d3 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
@@ -4,6 +4,7 @@
  * stmmac XGMAC support.
  */
 
+#include <linux/bitfield.h>
 #include <linux/iopoll.h>
 #include "stmmac.h"
 #include "dwxgmac2.h"
@@ -31,6 +32,14 @@ static void dwxgmac2_dma_init(void __iomem *ioaddr,
 		value |= XGMAC_EAME;
 
 	writel(value, ioaddr + XGMAC_DMA_SYSBUS_MODE);
+
+	if (dma_cfg->multi_irq_en) {
+		value = readl(ioaddr + XGMAC_DMA_MODE);
+		value &= ~XGMAC_DMA_MODE_INTM_MASK;
+		value |= FIELD_PREP(XGMAC_DMA_MODE_INTM_MASK,
+				    XGMAC_DMA_MODE_INTM_MODE1);
+		writel(value, ioaddr + XGMAC_DMA_MODE);
+	}
 }
 
 static void dwxgmac2_dma_init_chan(struct stmmac_priv *priv,
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 3591755ea30b..531f9d7bf2b9 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -7836,6 +7836,8 @@ static int __stmmac_dvr_probe(struct device *device,
 	priv->dev->base_addr = (unsigned long)res->addr;
 	priv->plat->dma_cfg->multi_msi_en =
 		(priv->plat->flags & STMMAC_FLAG_MULTI_MSI_EN);
+	priv->plat->dma_cfg->multi_irq_en =
+		(priv->plat->flags & STMMAC_FLAG_MULTI_IRQ_EN);
 
 	priv->dev->irq = res->irq;
 	priv->wol_irq = res->wol_irq;
diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
index 4430b967abde..cdb983e49856 100644
--- a/include/linux/stmmac.h
+++ b/include/linux/stmmac.h
@@ -122,6 +122,7 @@ struct stmmac_dma_cfg {
 	bool eame;
 	/* multi_msi_en: stmmac core internal */
 	bool multi_msi_en;
+	bool multi_irq_en;
 	/* atds: stmmac core internal */
 	bool atds;
 };
@@ -211,6 +212,7 @@ enum dwmac_core_type {
 #define STMMAC_FLAG_HWTSTAMP_CORRECT_LATENCY	BIT(14)
 #define STMMAC_FLAG_KEEP_PREAMBLE_BEFORE_SFD	BIT(15)
 #define STMMAC_FLAG_SERDES_SUPPORTS_2500M	BIT(16)
+#define STMMAC_FLAG_MULTI_IRQ_EN		BIT(17)
 
 struct mac_device_info;
 
-- 
2.43.7


^ permalink raw reply related

* Re: [PATCH net v4 3/4] bonding: 3ad: fix mux port state on oper down
From: Jay Vosburgh @ 2026-05-07  6:58 UTC (permalink / raw)
  To: Louis Scalbert
  Cc: netdev, stephen, andrew+netdev, edumazet, kuba, pabeni, fbl, andy,
	shemminger, maheshb
In-Reply-To: <CAJ5u_OdCjZJLYr8OhfskTxavuW83dYSFx1_bTto9wezTn60nfA@mail.gmail.com>

Louis Scalbert <louis.scalbert@6wind.com> wrote:

>Hello Jay,
>
>Sorry for the late reply. I’ve been busy with another project these
>past few days.
>
>Le jeu. 23 avr. 2026 à 22:00, Jay Vosburgh <jv@jvosburgh.net> a écrit :
>>
>> Louis Scalbert <louis.scalbert@6wind.com> wrote:
>>
>> >When the bonding interface has carrier down due to the absence of
>> >usable slaves and a slave transitions from down to up, the bonding
>> >interface briefly goes carrier up, then down again, and finally up
>> >once LACP negotiates collecting and distributing on the port.
>> >
>> >When lacp_strict mode is on, the interface should not transition to
>> >carrier up until LACP negotiation is complete.
>> >
>> >This happens because the actor and partner port states remain in
>> >Collecting_Distributing when the port goes down. When the port
>> >comes back up, it temporarily remains in this state until LACP
>> >renegotiation occurs.
>> >
>> >Previously this was mostly cosmetic, but since the bonding carrier
>> >state may depend on the LACP negotiation state, it causes the
>> >interface to flap.
>> >
>> >Move an operationally down port to the Mux WAITING state and clear the
>> >Synchronization, Collecting, and Distributing states, in accordance with
>> >the 802.1AX Mux state machine diagram.
>> >
>> >Fixes: 655f8919d549 ("bonding: add min links parameter to 802.3ad")
>> >Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
>> >---
>> > drivers/net/bonding/bond_3ad.c | 7 +++++++
>> > 1 file changed, 7 insertions(+)
>> >
>> >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>> >index 9cf064243d58..bc2964ea11f5 100644
>> >--- a/drivers/net/bonding/bond_3ad.c
>> >+++ b/drivers/net/bonding/bond_3ad.c
>> >@@ -1053,6 +1053,8 @@ static void ad_mux_machine(struct port *port, bool *update_slave_arr)
>> >
>> >       if (port->sm_vars & AD_PORT_BEGIN) {
>> >               port->sm_mux_state = AD_MUX_DETACHED;
>> >+      } else if (!port->is_enabled && port->sm_mux_state != AD_MUX_DETACHED) {
>> >+              port->sm_mux_state = AD_MUX_WAITING;
>>
>>         Technically, this is not exactly following the state machines.
>>
>>         The mux machine should transition to WAITING from DETACHED when
>> Selected == SELECTED or STANDBY, not for !is_enabled ("port_enabled" in
>> the standard).
>
>The MUX machine still transitions from DETACHED to WAITING; this
>happens a few lines later and is unchanged.
>
>The relevant code is:
>
>} else if (!port->is_enabled && port->sm_mux_state != AD_MUX_DETACHED) {
>    port->sm_mux_state = AD_MUX_WAITING;
>} else {
>     switch (port->sm_mux_state) {
>     case AD_MUX_DETACHED:
>          if ((port->sm_vars & AD_PORT_SELECTED)
>              || (port->sm_vars & AD_PORT_STANDBY))
>          /* if SELECTED or STANDBY */
>          port->sm_mux_state = AD_MUX_WAITING;
>          break;
>
>My change is for NOT is_enabled AND NOT AD_MUX_DETACHED.
>
>> The check for !is_enabled happens in the receive
>> machine, and it would transition to PORT_DISABLED state (which clears
>> Synchronization).
>
>I agree with that statement. However, clearing Synchronization is not
>sufficient: the purpose of the fix is to clear Collecting and
>Distributing.
>
>I noticed that is_enabled is defined in the 802.3ad-2000 standard, while
>Port_Operational is defined in 802.1AX-2020. I am not sure about
>802.1AX-2014, as I do not have access to that version.

	Did you mean 802.1AX-2014 instead of 802.3ad-2000 here?

>The 802.1AX-2020 standard uses a different MUX state diagram. Compared to
>802.3ad, the ATTACHED state has been split into ATTACH and ATTACHED.
>There is no longer a WAITING state; it has been replaced by
>ATTACHED_WTR.
>
>The new standard says that the MUX machine should transition to
>ATTACHED_WTR when Port_Operational is FALSE and the current state is
>not DETACHED.
>
>So, in my opinion, the change is correct, at least with respect to
>802.1AX-2020.

	I'm concerned that mixing versions of the standard will cause
issues, precisely because the state machines vary between the 2014 and
2020 versions.

	As the standards are not explicitly backwards compatible,
bonding should conform to one version, and in my opinion the version
that makes the most sense is the 2014 edition.  Bonding is already very
close to the 2014 version, while compliance with the 2020 version would
require significant changes.

	The 2014 edition is freely available from the IEEE, for example at:

https://1.ieee802.org/tsn/802-1ax-2014/

>>
>>         I'm not sure if this is actually an issue or not; I need to read
>> the relevant bits again to make sure I understand how it's supposed to
>> work.
>
>Please confirm what you want me to do: should I keep the fix as it is ?

	I see that you've reposted the patch series, I'll reply there
later when I've had a chance to study the details.

	-J

>best regards,
>
>Louis Scalbert
>>
>>         -J
>>
>> >       } else {
>> >               switch (port->sm_mux_state) {
>> >               case AD_MUX_DETACHED:
>> >@@ -1200,6 +1202,11 @@ static void ad_mux_machine(struct port *port, bool *update_slave_arr)
>> >                       break;
>> >               case AD_MUX_WAITING:
>> >                       port->sm_mux_timer_counter = __ad_timer_to_ticks(AD_WAIT_WHILE_TIMER, 0);
>> >+                      port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION;
>> >+                      ad_disable_collecting_distributing(port,
>> >+                                                         update_slave_arr);
>> >+                      port->actor_oper_port_state &= ~LACP_STATE_COLLECTING;
>> >+                      port->actor_oper_port_state &= ~LACP_STATE_DISTRIBUTING;
>> >                       break;
>> >               case AD_MUX_ATTACHED:
>> >                       if (port->aggregator->is_active)
>> >--
>> >2.39.2
>> >
>>
>> ---
>>         -Jay Vosburgh, jv@jvosburgh.net

---
	-Jay Vosburgh, jv@jvosburgh.net

^ permalink raw reply

* Re: [PATCH net-next 1/6] bridge: uapi: Add neigh_forward_grat netlink attributes
From: Ido Schimmel @ 2026-05-07  6:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Danielle Ratson, netdev@vger.kernel.org, donald.hunter@gmail.com,
	davem@davemloft.net, edumazet@google.com, pabeni@redhat.com,
	horms@kernel.org, razor@blackwall.org, andrew+netdev@lunn.ch,
	shuah@kernel.org, ast@fiberby.net, liuhangbin@gmail.com,
	daniel@iogearbox.net, Andy Roulin, fmaurer@redhat.com,
	sdf.kernel@gmail.com, sd@queasysnail.net, kees@kernel.org,
	nickgarlis@gmail.com, amorenoz@redhat.com, alasdair@mcwilliam.dev,
	johannes.wiesboeck@aisec.fraunhofer.de, Petr Machata,
	linux-kernel@vger.kernel.org, bridge@lists.linux.dev,
	linux-kselftest@vger.kernel.org
In-Reply-To: <20260506150556.4e540854@kernel.org>

On Wed, May 06, 2026 at 03:05:56PM -0700, Jakub Kicinski wrote:
> > > Related: The AI also did not catch that the spec was missing (easy
> > > to forget for rtnetlink). Do you think it's worth adding to
> > > review-prompts?  
> 
> I assumed Sashiko missed this because it doesn't spent too much time on
> the series as a whole right now (I was going to tweak that but looks
> like Claude is having another meltdown, crazy inference latency,
> patches backlogging for review..)
> 
> Are you saying that the review prompts-based test is also not catching
> this even if you give it the range of commits that form the series?

Yes. I noticed this during the STP mode review:

https://lore.kernel.org/netdev/20260324200051.GA572287@shredder/

Where the entire code changes are in a single patch.

AI review of this patch:

https://sashiko.dev/#/patchset/20260324184942.2828691-1-aroulin%40nvidia.com
https://netdev-ai.bots.linux.dev/ai-review.html?id=d0cd7a7e-a803-45df-8121-555484a45315#patch-0

We can try to add this to the review-prompts and see if it helps.

^ permalink raw reply

* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
From: Simon Schippers @ 2026-05-07  6:54 UTC (permalink / raw)
  To: hawk, netdev
  Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Stanislav Fomichev, linux-kernel, bpf
In-Reply-To: <20260505132159.241305-4-hawk@kernel.org>

On 5/5/26 15:21, hawk@kernel.org wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> Commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> reduce TX drops") gave qdiscs control over veth by returning
> NETDEV_TX_BUSY when the ptr_ring is full (DRV_XOFF).  That commit noted
> a known limitation: the 256-entry ptr_ring sits in front of the qdisc as
> a dark buffer, adding base latency because the qdisc has no visibility
> into how many bytes are already queued there.
> 
> Add BQL support so the qdisc gets feedback and can begin shaping traffic
> before the ring fills.  In testing with fq_codel, BQL reduces ping RTT
> under UDP load from ~6.61ms to ~0.36ms (18x).
> 
> Charge a fixed VETH_BQL_UNIT (1) per packet rather than skb->len, so
> the DQL limit tracks packets-in-flight.  Unlike a physical NIC, veth
> has no link speed -- the ptr_ring drains at CPU speed and is
> packet-indexed, not byte-indexed, so bytes are not the natural unit.
> With byte-based charging, small packets sneak many more entries into
> the ring before STACK_XOFF fires, deepening the dark buffer under
> mixed-size workloads.  Testing with a concurrent min-size packet flood
> shows 3.7x ping RTT degradation with skb->len charging versus no
> change with fixed-unit charging.
> 
> Charge BQL inside veth_xdp_rx() under the ptr_ring producer_lock, after
> confirming the ring is not full.  The charge must precede the produce
> because the NAPI consumer can run on another CPU and complete the SKB
> the instant it becomes visible in the ring.  Doing both under the same
> lock avoids a pre-charge/undo pattern -- BQL is only charged when
> produce is guaranteed to succeed.
> 
> BQL is enabled only when a real qdisc is attached (guarded by
> !qdisc_txq_has_no_queue), as HARD_TX_LOCK provides serialization
> for TXQ modification like dql_queued(). For lltx devices, like veth,
> this HARD_TX_LOCK serialization isn't provided.  The ptr_ring
> producer_lock provides additional serialization that would allow
> BQL to work correctly even with noqueue, though that combination
> is not currently enabled, as the netstack will drop and warn.
> 
> Track per-SKB BQL state via a VETH_BQL_FLAG pointer tag in the ptr_ring
> entry.  This is necessary because the qdisc can be replaced live while
> SKBs are in-flight -- each SKB must carry the charge decision made at
> enqueue time rather than re-checking the peer's qdisc at completion.
> 
> Complete per-SKB in veth_xdp_rcv() rather than in bulk, so STACK_XOFF
> clears promptly when producer and consumer run on different CPUs.
> 
> BQL introduces a second independent queue-stop mechanism (STACK_XOFF)
> alongside the existing DRV_XOFF (ring full).  Both must be clear for
> the queue to transmit.  Reset BQL state in veth_napi_del_range() after
> synchronize_net() to avoid racing with in-flight veth_poll() calls.
> Clamp the reset loop to the peer's real_num_tx_queues, since the peer
> may have fewer TX queues than the local device has RX queues (e.g. when
> veth is enslaved to a bond with XDP attached).
> 
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  drivers/net/veth.c | 86 ++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 75 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 0cfb19b760dd..86b78900c48e 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -34,9 +34,13 @@
>  #define DRV_VERSION	"1.0"
>  
>  #define VETH_XDP_FLAG		BIT(0)
> +#define VETH_BQL_FLAG		BIT(1)
>  #define VETH_RING_SIZE		256
>  #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
>  
> +/* Fixed BQL charge: DQL limit tracks packets-in-flight, not bytes */
> +#define VETH_BQL_UNIT		1
> +
>  #define VETH_XDP_TX_BULK_SIZE	16
>  #define VETH_XDP_BATCH		16
>  
> @@ -280,6 +284,21 @@ static bool veth_is_xdp_frame(void *ptr)
>  	return (unsigned long)ptr & VETH_XDP_FLAG;
>  }
>  
> +static bool veth_ptr_is_bql(void *ptr)
> +{
> +	return (unsigned long)ptr & VETH_BQL_FLAG;
> +}
> +
> +static struct sk_buff *veth_ptr_to_skb(void *ptr)
> +{
> +	return (void *)((unsigned long)ptr & ~VETH_BQL_FLAG);
> +}
> +
> +static void *veth_skb_to_ptr(struct sk_buff *skb, bool bql)
> +{
> +	return bql ? (void *)((unsigned long)skb | VETH_BQL_FLAG) : skb;
> +}
> +
>  static struct xdp_frame *veth_ptr_to_xdp(void *ptr)
>  {
>  	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
> @@ -295,7 +314,7 @@ static void veth_ptr_free(void *ptr)
>  	if (veth_is_xdp_frame(ptr))
>  		xdp_return_frame(veth_ptr_to_xdp(ptr));
>  	else
> -		kfree_skb(ptr);
> +		kfree_skb(veth_ptr_to_skb(ptr));
>  }
>  
>  static void __veth_xdp_flush(struct veth_rq *rq)
> @@ -309,19 +328,33 @@ static void __veth_xdp_flush(struct veth_rq *rq)
>  	}
>  }
>  
> -static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
> +static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb, bool do_bql,
> +		       struct netdev_queue *txq)
>  {
> -	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb)))
> +	struct ptr_ring *ring = &rq->xdp_ring;
> +
> +	spin_lock(&ring->producer_lock);
> +	if (unlikely(!ring->size) || __ptr_ring_full(ring)) {
> +		spin_unlock(&ring->producer_lock);
>  		return NETDEV_TX_BUSY; /* signal qdisc layer */
> +	}
> +
> +	/* BQL charge before produce; consumer cannot see entry yet */
> +	if (do_bql)
> +		netdev_tx_sent_queue(txq, VETH_BQL_UNIT);
> +
> +	__ptr_ring_produce(ring, veth_skb_to_ptr(skb, do_bql));
> +	spin_unlock(&ring->producer_lock);
>  
>  	return NET_RX_SUCCESS; /* same as NETDEV_TX_OK */
>  }
>  
>  static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
> -			    struct veth_rq *rq, bool xdp)
> +			    struct veth_rq *rq, bool xdp, bool do_bql,
> +			    struct netdev_queue *txq)
>  {
>  	return __dev_forward_skb(dev, skb) ?: xdp ?
> -		veth_xdp_rx(rq, skb) :
> +		veth_xdp_rx(rq, skb, do_bql, txq) :
>  		__netif_rx(skb);
>  }
>  
> @@ -348,10 +381,11 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>  	struct veth_rq *rq = NULL;
> -	struct netdev_queue *txq;
> +	struct netdev_queue *txq = NULL;
>  	struct net_device *rcv;
>  	int length = skb->len;
>  	bool use_napi = false;
> +	bool do_bql = false;
>  	int ret, rxq;
>  
>  	rcu_read_lock();
> @@ -375,8 +409,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  	}
>  
>  	skb_tx_timestamp(skb);
> -
> -	ret = veth_forward_skb(rcv, skb, rq, use_napi);
> +	if (rxq < dev->real_num_tx_queues) {
> +		txq = netdev_get_tx_queue(dev, rxq);
> +		/* BQL charge happens inside veth_xdp_rx() under producer_lock */
> +		do_bql = use_napi && !qdisc_txq_has_no_queue(txq);
> +	}
> +	ret = veth_forward_skb(rcv, skb, rq, use_napi, do_bql, txq);
>  	switch (ret) {
>  	case NET_RX_SUCCESS: /* same as NETDEV_TX_OK */
>  		if (!use_napi)
> @@ -412,6 +450,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  		net_crit_ratelimited("%s(%s): Invalid return code(%d)",
>  				     __func__, dev->name, ret);
>  	}
> +
>  	rcu_read_unlock();
>  
>  	return ret;
> @@ -900,7 +939,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
>  
>  static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  			struct veth_xdp_tx_bq *bq,
> -			struct veth_stats *stats)
> +			struct veth_stats *stats,
> +			struct netdev_queue *peer_txq)
>  {
>  	int i, done = 0, n_xdpf = 0;
>  	void *xdpf[VETH_XDP_BATCH];
> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  			}
>  		} else {
>  			/* ndo_start_xmit */
> -			struct sk_buff *skb = ptr;
> +			bool bql_charged = veth_ptr_is_bql(ptr);
> +			struct sk_buff *skb = veth_ptr_to_skb(ptr);
>  
>  			stats->xdp_bytes += skb->len;
> +			if (peer_txq && bql_charged)
> +				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);

In the discussion with Jonas [1], I left a comment explaining why I think
this doesn’t work.

I still think first that adding an option to modify the hard-coded
VETH_RING_SIZE is the way to go.

Thanks!

[1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/

> +
>  			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
>  			if (skb) {
>  				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
> @@ -976,7 +1020,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  		   netdev_get_tx_queue(peer_dev, queue_idx) : NULL;
>  
>  	xdp_set_return_frame_no_direct();
> -	done = veth_xdp_rcv(rq, budget, &bq, &stats);
> +	done = veth_xdp_rcv(rq, budget, &bq, &stats, peer_txq);
>  
>  	if (stats.xdp_redirect > 0)
>  		xdp_do_flush();
> @@ -1074,6 +1118,7 @@ static int __veth_napi_enable(struct net_device *dev)
>  static void veth_napi_del_range(struct net_device *dev, int start, int end)
>  {
>  	struct veth_priv *priv = netdev_priv(dev);
> +	struct net_device *peer;
>  	int i;
>  
>  	for (i = start; i < end; i++) {
> @@ -1092,6 +1137,24 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
>  		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
>  	}
>  
> +	/* Reset BQL and wake stopped peer txqs.  A concurrent veth_xmit()
> +	 * may have set DRV_XOFF between rcu_assign_pointer(napi, NULL) and
> +	 * synchronize_net(), and NAPI can no longer clear it.
> +	 * Only wake when the device is still up.
> +	 */
> +	peer = rtnl_dereference(priv->peer);
> +	if (peer) {
> +		int peer_end = min_t(int, end, peer->real_num_tx_queues);
> +
> +		for (i = start; i < peer_end; i++) {
> +			struct netdev_queue *txq = netdev_get_tx_queue(peer, i);
> +
> +			netdev_tx_reset_queue(txq);
> +			if (netif_running(dev))
> +				netif_tx_wake_queue(txq);
> +		}
> +	}
> +
>  	for (i = start; i < end; i++) {
>  		page_pool_destroy(priv->rq[i].page_pool);
>  		priv->rq[i].page_pool = NULL;
> @@ -1741,6 +1804,7 @@ static void veth_setup(struct net_device *dev)
>  	dev->priv_flags |= IFF_PHONY_HEADROOM;
>  	dev->priv_flags |= IFF_DISABLE_NETPOLL;
>  	dev->lltx = true;
> +	dev->bql = true;
>  
>  	dev->netdev_ops = &veth_netdev_ops;
>  	dev->xdp_metadata_ops = &veth_xdp_metadata_ops;

^ permalink raw reply

* AW: assert in phylink.c with lan7801 and dp83tc811 since kernel 6.18
From: Sven Schuchmann @ 2026-05-07  6:52 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Maxime Chevallier, netdev@vger.kernel.org
In-Reply-To: <24815161-68ff-4f3e-b0e1-286871b9686a@lunn.ch>

Hi Andew,

thanks for the detailed explanation! Indeed adding  
.get_features   = genphy_c45_pma_read_ext_abilities,
to "dp83tc811.c" makes the network work again:

[    2.253619] usb 1-1.3: new high-speed USB device number 3 using dwc2
[    2.342774] usb 1-1.3: New USB device found, idVendor=0424, idProduct=7801, bcdDevice= 3.00
[    2.342810] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[    2.342820] usb 1-1.3: Product: LAN7801
[    2.342828] usb 1-1.3: Manufacturer: Microchip
[    2.342835] usb 1-1.3: SerialNumber: 00800F780100
[    2.353766] lan78xx 1-1.3:1.0 (unnamed net_device) (uninitialized): deferred multicast write 0x00007ca0
[    2.466961] systemd[1]: System time advanced to timestamp on /var/lib/systemd/timesync/clock: Thu 2026-05-07 08:03:39 CEST
[    2.500632] lan78xx 1-1.3:1.0 (unnamed net_device) (uninitialized): registered mdiobus bus usb-001:003
[    2.500675] lan78xx 1-1.3:1.0 (unnamed net_device) (uninitialized): int urb period 64
[    2.500877] lan78xx 1-1.3:1.0 (unnamed net_device) (uninitialized): phydev->irq = 37
[    2.511949] lan78xx 1-1.3:1.0 (unnamed net_device) (uninitialized): PHY usb-001:003:00 doesn't supply possible interfaces
[    2.511990] lan78xx 1-1.3:1.0 (unnamed net_device) (uninitialized): PHY [usb-001:003:00] driver [TI DP83TC811] (irq=37)
[    2.512005] lan78xx 1-1.3:1.0 (unnamed net_device) (uninitialized): phy: rgmii-id setting supported 00000000,00000008,00000000,00006000 advertising 00000000,00000008,00000000,00006000
[...]
[   10.336369] lan78xx 1-1.3:1.0 broadr1: configuring for phy/rgmii-id link mode
[   10.336397] lan78xx 1-1.3:1.0 broadr1: major config, requested phy/rgmii-id
[   10.336422] lan78xx 1-1.3:1.0 broadr1: major config, active phy/none/rgmii-id
[   10.336432] lan78xx 1-1.3:1.0 broadr1: phylink_mac_config: mode=phy/rgmii-id/none adv=00000000,00000000,00000000,00000000 pause=00
[   10.337360] lan78xx 1-1.3:1.0 broadr1: PHY INTR: 0x00020000
[   10.341302] lan78xx 1-1.3:1.0 broadr1: receive multicast hash filter
[   10.341445] lan78xx 1-1.3:1.0 broadr1: receive multicast hash filter
[   10.342409] lan78xx 1-1.3:1.0 broadr1: deferred multicast write 0x00007ca2
[   10.345161] lan78xx 1-1.3:1.0 broadr1: phy link up rgmii-id/100Mbps/Full/none/off/nolpi
[   10.347681] lan78xx 1-1.3:1.0 broadr1: start tx path
[   10.348201] lan78xx 1-1.3:1.0 broadr1: start rx path
[   10.348730] lan78xx 1-1.3:1.0 broadr1: Link is Up - 100Mbps/Full - flow control off

I think it is worth doing a patch on this also?
I am not that experienced in sending kernel patches but I could try...

Regards,

    Sven


________________________________________
Von: Andrew Lunn <andrew@lunn.ch>
Gesendet: Mittwoch, 06. Mai 2026 18:40
An: Sven Schuchmann <schuchmann@schleissheimer.de>
Cc: Maxime Chevallier <maxime.chevallier@bootlin.com>; netdev@vger.kernel.org <netdev@vger.kernel.org>
Betreff: Re: assert in phylink.c with lan7801 and dp83tc811 since kernel 6.18


>        phylink_dbg(pl, "phy: --3.4 supported    0x%x\n", supported);



supported is a pointer, so you cannot get anything useful from

printing its value.



> [    2.771616] lan78xx 1-1.4:1.0 (unnamed net_device) (uninitialized): validation of rgmii-id with support 00000000,00000000,00000000,00006280 and advertisement 00000000,00000000,00000000,00006280 failed: -EINVAL

 

> But I do not get what phylink_is_empty_linkmode() is doing...



The supported value is a combination of a few things. It indicates if

the PHY supports auto negotiation, pause, asym pause, if the link is

using twisted pair, fibre etc. It also lists the speeds the PHY

supports. There is a list here:



https://elixir.bootlin.com/linux/v7.0.1/source/include/uapi/linux/ethtool.h#L1963



Decoding 00006280 we get:



ETHTOOL_LINK_MODE_TP_BIT,

ETHTOOL_LINK_MODE_MII_BIT,

ETHTOOL_LINK_MODE_Pause_BIT,

ETHTOOL_LINK_MODE_Asym_Pause_BIT



So Maxime was correct, it is not listing any link speeds. From what i

understand, it is a 100BaseT1? So you would expect to see:



ETHTOOL_LINK_MODE_100baseT1_Full_BIT



which is bit 67.



phylib asks the PHY what it can do. For BaseT1 that would be

genphy_c45_pma_baset1_read_abilities()



https://elixir.bootlin.com/linux/v7.0.1/source/drivers/net/phy/phy-c45.c#L971



either the PHY has the wrong value in its register, or the function is

not getting called.



The PHY is an dp83tc811?



https://elixir.bootlin.com/linux/v7.0.1/source/drivers/net/phy/dp83tc811.c#L387



It does not set .get_features. However, dp83tg720.c does:



https://elixir.bootlin.com/linux/v7.0.1/source/drivers/net/phy/dp83tg720.c#L654



So try adding:



        .get_features   = genphy_c45_pma_read_ext_abilities,



        Andrew


^ permalink raw reply

* Re: [PATCH net-next v10 4/4] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
From: Simon Schippers @ 2026-05-07  6:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20260506185515-mutt-send-email-mst@kernel.org>

On 5/7/26 00:56, Michael S. Tsirkin wrote:
> On Wed, May 06, 2026 at 06:28:06PM -0400, Michael S. Tsirkin wrote:
>> On Wed, May 06, 2026 at 04:10:33PM +0200, Simon Schippers wrote:
>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring
>>> becomes full. Once an entry is successfully produced and the ptr_ring
>>> reaches capacity, the netdev queue is stopped instead of dropping
>>> subsequent packets. If no qdisc is present, the previous tail-drop
>>> behavior is preserved.
>>>
>>> If producing an entry fails anyways due to a race, tun_net_xmit() drops
>>> the packet. Such races are expected because LLTX is enabled and the
>>> transmit path operates without the usual locking.
>>>
>>> The __tun_wake_queue() function of the consumer races with the producer
>>> for waking/stopping the netdev queue, which could result in a stalled
>>> queue. Therefore, an smp_mb__after_atomic() is introduced that pairs
>>> with the smp_mb() of the consumer. It follows the principle of store
>>> buffering described in tools/memory-model/Documentation/recipes.txt:
>>>
>>> - The producer in tun_net_xmit() first sets __QUEUE_STATE_DRV_XOFF,
>>>   followed by an smp_mb__after_atomic() (= smp_mb()), and then reads the
>>>   ring with __ptr_ring_produce_peek().
>>>
>>> - The consumer in __tun_wake_queue() first writes zero to the ring in
>>>   __ptr_ring_consume(), followed by an smp_mb(), and then reads the queue
>>>   status with netif_tx_queue_stopped().
>>>
>>> => Following the aforementioned principle, it is impossible for the
>>>    producer to see a full ring (and therefore not wake the queue on the
>>>    re-check) while the consumer simultaneously fails to see a stopped
>>>    queue (and therefore also does not wake it).
>>>
>>> Benchmarks:
>>> The benchmarks show a slight regression in raw transmission performance
>>> when using two sending threads. Packet loss also occurs only in the
>>> two-thread sending case; no packet loss was observed with a single
>>> sending thread.
>>>
>>> Test setup:
>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
>>> Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
>>> mitigations disabled.
>>>
>>> Note for tap+vhost-net:
>>> XDP drop program active in VM -> ~2.5x faster; slower for tap due to
>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
>>>
>>> +--------------------------+--------------+----------------+----------+
>>> | 1 thread                 | Stock        | Patched with   | diff     |
>>> | sending                  |              | fq_codel qdisc |          |
>>> +------------+-------------+--------------+----------------+----------+
>>> | TAP        | Received    | 1.132 Mpps   | 1.133 Mpps     | +0.1%    |
>>> |            +-------------+--------------+----------------+----------+
>>> |            | Lost/s      | 3.765 Mpps   | 0 pps          |          |
>>> +------------+-------------+--------------+----------------+----------+
>>> | TAP        | Received    | 3.857 Mpps   | 3.905 Mpps     | +1.2%    |
>>> |            +-------------+--------------+----------------+----------+
>>> | +vhost-net | Lost/s      | 0.802 Mpps   | 0 pps          |          |
>>> +------------+-------------+--------------+----------------+----------+
>>>
>>> +--------------------------+--------------+----------------+----------+
>>> | 2 threads                | Stock        | Patched with   | diff     |
>>> | sending                  |              | fq_codel qdisc |          |
>>> +------------+-------------+--------------+----------------+----------+
>>> | TAP        | Received    | 1.115 Mpps   | 1.092 Mpps     | -2.1%    |
>>> |            +-------------+--------------+----------------+----------+
>>> |            | Lost/s      | 8.490 Mpps   | 359 pps        |          |
>>> +------------+-------------+--------------+----------------+----------+
>>> | TAP        | Received    | 3.664 Mpps   | 3.549 Mpps     | -3.1%    |
>>> |            +-------------+--------------+----------------+----------+
>>> | +vhost-net | Lost/s      | 5.330 Mpps   | 832 pps        |          |
>>> +------------+-------------+--------------+----------------+----------+
>>>
>>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>>> ---
>>>  drivers/net/tun.c | 25 +++++++++++++++++++++++--
>>>  1 file changed, 23 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>> index fc358c4c355b..d9ffbf88cfd8 100644
>>> --- a/drivers/net/tun.c
>>> +++ b/drivers/net/tun.c
>>> @@ -1018,6 +1018,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>  	struct netdev_queue *queue;
>>>  	struct tun_file *tfile;
>>>  	int len = skb->len;
>>> +	int ret;
>>>  
>>>  	rcu_read_lock();
>>>  	tfile = rcu_dereference(tun->tfiles[txq]);
>>> @@ -1072,13 +1073,33 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>  
>>>  	nf_reset_ct(skb);
>>>  
>>> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>>> +	queue = netdev_get_tx_queue(dev, txq);
>>> +
>>> +	spin_lock(&tfile->tx_ring.producer_lock);
>>> +	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>>> +	if (!qdisc_txq_has_no_queue(queue) &&
>>> +	    (__ptr_ring_produce_peek(&tfile->tx_ring) || ret)) {
>>> +		netif_tx_stop_queue(queue);
>>> +		/* Paired with smp_mb() in __tun_wake_queue() */
>>> +		smp_mb__after_atomic();
>>> +		if (!__ptr_ring_produce_peek(&tfile->tx_ring))
>>> +			netif_tx_wake_queue(queue);
>>> +	}
>>> +	spin_unlock(&tfile->tx_ring.producer_lock);
>>> +
>>
>> There's a weird corner case here when tx_queue_len is 0
>> but a qdisc has been configured - it looks like that
>> currently it just drops all packets, with this change,
>> the qdisc will get stuck permanently.
>>
>> I suspect just checking tx_ring.size should fix it.
>> Or if you feel adventurous, change return code for __ptr_ring_produce
>> to distinguish between "no ring" and "no space".
> 
> 
> __ptr_ring_produce_peek really.
> 

Yes, I like the approach of returning this from
__ptr_ring_produce_peek(). Then I will do a switch on the return value
in tun_net_xmit().

Additionally, I should wake up in tun_queue_resize() after calling
ptr_ring_resize_multiple_bh(). For a new dev->tx_queue_len > 0, it
should be fine without waking, but for 0 it is not.

> 
>>
>>> +	if (ret) {
>>> +		/* This should be a rare case if a qdisc is present, but
>>> +		 * can happen due to lltx.
>>> +		 * Since skb_tx_timestamp(), skb_orphan(),
>>> +		 * run_ebpf_filter() and pskb_trim() could have tinkered
>>> +		 * with the SKB, returning NETDEV_TX_BUSY is unsafe and
>>> +		 * we must drop instead.
>>> +		 */
>>>  		drop_reason = SKB_DROP_REASON_FULL_RING;
>>>  		goto drop;
>>>  	}
>>>  
>>>  	/* dev->lltx requires to do our own update of trans_start */
>>> -	queue = netdev_get_tx_queue(dev, txq);
>>>  	txq_trans_cond_update(queue);
>>>  
>>>  	/* Notify and wake up reader process */
>>> -- 
>>> 2.43.0
> 

^ permalink raw reply

* RE: [Patch net-next v1 2/7] r8169: add support for multi rx queues
From: Javen @ 2026-05-07  6:26 UTC (permalink / raw)
  To: Heiner Kallweit, nic_swsd@realtek.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <9010eca8-5ae1-4b32-8641-b310aa9e317a@gmail.com>

>On 06.05.2026 10:13, javen wrote:
>> From: Javen Xu <javen_xu@realsil.com.cn>
>>
>> This patch adds support for multi rx queues. RSS requires multi rx
>> queues to receive packets. So we need struct rtl8169_rx_ring for each
>> queue.
>>
>> Signed-off-by: Javen Xu <javen_xu@realsil.com.cn>
>> ---
>>  drivers/net/ethernet/realtek/r8169_main.c | 318
>> +++++++++++++++++-----
>>  1 file changed, 251 insertions(+), 67 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c
>> b/drivers/net/ethernet/realtek/r8169_main.c
>> index ef74ee02c117..bc75dbb9901d 100644
>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>> @@ -74,10 +74,11 @@
>>  #define NUM_TX_DESC  256     /* Number of Tx descriptor registers */
>>  #define NUM_RX_DESC  256     /* Number of Rx descriptor registers */
>>  #define R8169_TX_RING_BYTES  (NUM_TX_DESC * sizeof(struct TxDesc))
>> -#define R8169_RX_RING_BYTES  (NUM_RX_DESC * sizeof(struct RxDesc))
>>  #define R8169_TX_STOP_THRS   (MAX_SKB_FRAGS + 1)
>>  #define R8169_TX_START_THRS  (2 * R8169_TX_STOP_THRS)
>> +#define R8169_MAX_RX_QUEUES  8
>>  #define R8169_MAX_MSIX_VEC   32
>> +#define R8127_MAX_RX_QUEUES  8
>
>Why two MAX_RX_QUEUES constants with the same value?

R8169_MAX_RX_QUEUES controls the number of rx queues supported by the driver.
R8127_MAX_RX_QUEUES  controls the number of rx queues that rtl8127 support.

>
>>
>>  #define OCP_STD_PHY_BASE     0xa400
>>
>> @@ -447,6 +448,7 @@ enum rtl8125_registers {
>>       RSS_CTRL_8125           = 0x4500,
>>       Q_NUM_CTRL_8125         = 0x4800,
>>       EEE_TXIDLE_TIMER_8125   = 0x6048,
>> +     RDSAR_Q1_LOW            = 0x4000,
>
>Better sort register ids by register number?

I will re-order it in the next patch.

>
>>  };
>>
>>  #define LEDSEL_MASK_8125     0x23f
>> @@ -731,6 +733,19 @@ enum rtl_dash_type {
>>       RTL_DASH_25_BP,
>>  };
>>
>> +struct rtl8169_rx_ring {
>> +     u32 index;                                      /* Rx queue index */
>> +     u32 cur_rx;                                     /* Index of next Rx pkt. */
>> +     u32 dirty_rx;                                   /* Index for recycling. */
>> +     u32 num_rx_desc;                                /* num of Rx desc */
>> +     struct RxDesc *rx_desc_array;                   /* array of Rx Desc*/
>> +     u32 rx_desc_alloc_size;                         /* memory size per descs of ring */
>> +     dma_addr_t rx_desc_phy_addr[NUM_RX_DESC];       /* Rx data buffer
>physical dma address */
>> +     dma_addr_t rx_phy_addr;                         /* Rx desc physical address */
>> +     struct page *rx_databuff[NUM_RX_DESC];          /* Rx data buffers */
>> +     u16 rdsar_reg;                                  /* Receive Descriptor Start Address */
>> +};
>> +
>>  struct rtl8169_napi {
>>       struct napi_struct napi;
>>       void *priv;
>> @@ -744,6 +759,13 @@ struct rtl8169_irq {
>>       char            name[IFNAMSIZ + 10];
>>  };
>>
>> +enum rx_desc_ring_type {
>> +     RX_DESC_RING_TYPE_UNKNOWN = 0,
>> +     RX_DESC_RING_TYPE_DEFAULT,
>> +     RX_DESC_RING_TYPE_RSS,
>> +     RX_DESC_RING_TYPE_MAX
>
>UNKNOWN and MAX seem to never be used in this series.
>So do we need them?
>

They are not needed. I will remove them.

>> +};
>> +
>>  struct rtl8169_private {
>>       void __iomem *mmio_addr;        /* memory map physical address */
>>       struct pci_dev *pci_dev;
>> @@ -752,28 +774,28 @@ struct rtl8169_private {
>>       struct napi_struct napi;
>>       enum mac_version mac_version;
>>       enum rtl_dash_type dash_type;
>> -     u32 cur_rx; /* Index into the Rx descriptor buffer of next Rx pkt. */
>>       u32 cur_tx; /* Index into the Tx descriptor buffer of next Rx pkt. */
>>       u32 dirty_tx;
>>       struct TxDesc *TxDescArray;     /* 256-aligned Tx descriptor ring */
>> -     struct RxDesc *RxDescArray;     /* 256-aligned Rx descriptor ring */
>>       dma_addr_t TxPhyAddr;
>> -     dma_addr_t RxPhyAddr;
>> -     struct page *Rx_databuff[NUM_RX_DESC];  /* Rx data buffers */
>>       struct ring_info tx_skb[NUM_TX_DESC];   /* Tx data buffers */
>>       struct rtl8169_irq irq_tbl[R8169_MAX_MSIX_VEC];
>>       struct rtl8169_napi r8169napi[R8169_MAX_MSIX_VEC];
>> +     struct rtl8169_rx_ring rx_ring[R8169_MAX_RX_QUEUES];
>>       u16 isr_reg[R8169_MAX_MSIX_VEC];
>>       u16 imr_reg[R8169_MAX_MSIX_VEC];
>>       unsigned int num_rx_rings;
>>       u16 cp_cmd;
>>       u16 tx_lpi_timer;
>>       u32 irq_mask;
>> +     u16 hw_supp_num_rx_queues;
>>       u8 min_irq_nvecs;
>>       u8 max_irq_nvecs;
>>       u8 hw_supp_isr_ver;
>>       u8 hw_curr_isr_ver;
>>       u8 irq_nvecs;
>> +     u8 init_rx_desc_type;
>> +     u8 recheck_desc_ownbit;
>
>This seems to be a flag. Then why type u8?

Change to bool.

>
>>       int irq;
>>       struct clk *clk;
>>
>> @@ -2647,9 +2669,27 @@ static void rtl_init_rxcfg(struct rtl8169_private
>*tp)
>>       }
>>  }
>>
>> +static void rtl8169_rx_desc_init(struct rtl8169_private *tp) {
>> +     for (int i = 0; i < tp->num_rx_rings; i++) {
>> +             struct rtl8169_rx_ring *ring = &tp->rx_ring[i];
>> +
>> +             memset(ring->rx_desc_array, 0x0, ring->rx_desc_alloc_size);
>> +     }
>> +}
>> +
>>  static void rtl8169_init_ring_indexes(struct rtl8169_private *tp)  {
>> -     tp->dirty_tx = tp->cur_tx = tp->cur_rx = 0;
>> +     tp->dirty_tx = 0;
>> +     tp->cur_tx = 0;
>> +
>> +     for (int i = 0; i < tp->hw_supp_num_rx_queues; i++) {
>> +             struct rtl8169_rx_ring *ring = &tp->rx_ring[i];
>> +
>> +             ring->dirty_rx = 0;
>> +             ring->cur_rx = 0;
>> +             ring->index = i;
>> +     }
>>  }
>>
>>  static void rtl_jumbo_config(struct rtl8169_private *tp) @@ -2708,8
>> +2748,18 @@ static void rtl_hw_reset(struct rtl8169_private *tp)
>>       rtl_loop_wait_low(tp, &rtl_chipcmd_cond, 100, 100);  }
>>
>> +static void rtl_set_ring_size(struct rtl8169_private *tp, u32 rx_num)
>> +{
>> +     for (int i = 0; i < tp->hw_supp_num_rx_queues; i++)
>> +             tp->rx_ring[i].num_rx_desc = rx_num; }
>> +
>>  static void rtl_setup_mqs_reg(struct rtl8169_private *tp)  {
>> +     tp->rx_ring[0].rdsar_reg = RxDescAddrLow;
>> +     for (int i = 1; i < tp->hw_supp_num_rx_queues; i++)
>> +             tp->rx_ring[i].rdsar_reg = (u16)(RDSAR_Q1_LOW + (i - 1)
>> + * 8);
>
>This looks like array rx_ring[] isn't actually needed.

I will drop num_rx_desc and rdsar_reg from struct rx_ring, and introduce a helper function to calculate the register.

>
>> +
>>       if (tp->mac_version <= RTL_GIGA_MAC_VER_52) {
>>               tp->isr_reg[0] = IntrStatus;
>>               tp->imr_reg[0] = IntrMask; @@ -2733,17 +2783,21 @@
>> static void rtl_software_parameter_initialize(struct rtl8169_private *tp)
>>       case RTL_GIGA_MAC_VER_80:
>>               tp->min_irq_nvecs = 1;
>>               tp->max_irq_nvecs = 1;
>> +             tp->hw_supp_num_rx_queues = R8127_MAX_RX_QUEUES;
>>               tp->hw_supp_isr_ver = 6;
>>               break;
>>       default:
>>               tp->min_irq_nvecs = 1;
>>               tp->max_irq_nvecs = 1;
>> +             tp->hw_supp_num_rx_queues = 1;
>>               tp->hw_supp_isr_ver = 1;
>>               break;
>>       }
>> +     tp->init_rx_desc_type = RX_DESC_RING_TYPE_DEFAULT;
>>       tp->hw_curr_isr_ver = tp->hw_supp_isr_ver;
>>
>>       rtl_setup_mqs_reg(tp);
>> +     rtl_set_ring_size(tp, NUM_RX_DESC);
>>  }
>>
>>  static void rtl_request_firmware(struct rtl8169_private *tp) @@
>> -2877,8 +2931,13 @@ static void rtl_set_rx_tx_desc_registers(struct
>rtl8169_private *tp)
>>        */
>>       RTL_W32(tp, TxDescStartAddrHigh, ((u64) tp->TxPhyAddr) >> 32);
>>       RTL_W32(tp, TxDescStartAddrLow, ((u64) tp->TxPhyAddr) &
>DMA_BIT_MASK(32));
>> -     RTL_W32(tp, RxDescAddrHigh, ((u64) tp->RxPhyAddr) >> 32);
>> -     RTL_W32(tp, RxDescAddrLow, ((u64) tp->RxPhyAddr) &
>DMA_BIT_MASK(32));
>> +
>> +     for (int i = 0; i < tp->num_rx_rings; i++) {
>> +             struct rtl8169_rx_ring *ring = &tp->rx_ring[i];
>> +
>> +             RTL_W32(tp, ring->rdsar_reg, ((u64)ring->rx_phy_addr) &
>DMA_BIT_MASK(32));
>> +             RTL_W32(tp, ring->rdsar_reg + 4, ((u64)ring->rx_phy_addr >> 32));
>> +     }
>>  }
>>
>>  static void rtl8169_set_magic_reg(struct rtl8169_private *tp) @@
>> -4214,7 +4273,7 @@ static int rtl8169_change_mtu(struct net_device *dev,
>int new_mtu)
>>       return 0;
>>  }
>>
>> -static void rtl8169_mark_to_asic(struct RxDesc *desc)
>> +static void rtl8169_mark_to_asic_default(struct RxDesc *desc)
>>  {
>>       u32 eor = le32_to_cpu(desc->opts1) & RingEnd;
>>
>> @@ -4224,13 +4283,19 @@ static void rtl8169_mark_to_asic(struct RxDesc
>*desc)
>>       WRITE_ONCE(desc->opts1, cpu_to_le32(DescOwn | eor |
>> R8169_RX_BUF_SIZE));  }
>>
>> +static void rtl8169_mark_to_asic(struct rtl8169_private *tp, struct
>> +RxDesc *desc) {
>> +     rtl8169_mark_to_asic_default(desc);
>> +}
>> +
>>  static struct page *rtl8169_alloc_rx_data(struct rtl8169_private *tp,
>> -                                       struct RxDesc *desc)
>> +                                       struct rtl8169_rx_ring *ring,
>> + unsigned int index)
>>  {
>>       struct device *d = tp_to_dev(tp);
>>       int node = dev_to_node(d);
>>       dma_addr_t mapping;
>>       struct page *data;
>> +     struct RxDesc *desc = ring->rx_desc_array + index;
>>
>>       data = alloc_pages_node(node, GFP_KERNEL,
>get_order(R8169_RX_BUF_SIZE));
>>       if (!data)
>> @@ -4244,55 +4309,111 @@ static struct page
>*rtl8169_alloc_rx_data(struct rtl8169_private *tp,
>>       }
>>
>>       desc->addr = cpu_to_le64(mapping);
>> -     rtl8169_mark_to_asic(desc);
>> +     ring->rx_desc_phy_addr[index] = mapping;
>> +     rtl8169_mark_to_asic(tp, desc);
>>
>>       return data;
>>  }
>>
>> -static void rtl8169_rx_clear(struct rtl8169_private *tp)
>> +static void rtl8169_rx_clear(struct rtl8169_private *tp, struct
>> +rtl8169_rx_ring *ring)
>>  {
>>       int i;
>>
>> -     for (i = 0; i < NUM_RX_DESC && tp->Rx_databuff[i]; i++) {
>> +     for (i = 0; i < NUM_RX_DESC && ring->rx_databuff[i]; i++) {
>>               dma_unmap_page(tp_to_dev(tp),
>> -                            le64_to_cpu(tp->RxDescArray[i].addr),
>> +                            ring->rx_desc_phy_addr[i],
>>                              R8169_RX_BUF_SIZE, DMA_FROM_DEVICE);
>> -             __free_pages(tp->Rx_databuff[i], get_order(R8169_RX_BUF_SIZE));
>> -             tp->Rx_databuff[i] = NULL;
>> -             tp->RxDescArray[i].addr = 0;
>> -             tp->RxDescArray[i].opts1 = 0;
>> +             __free_pages(ring->rx_databuff[i], get_order(R8169_RX_BUF_SIZE));
>> +             ring->rx_databuff[i] = NULL;
>> +             ring->rx_desc_phy_addr[i] = 0;
>> +             ring->rx_desc_array[i].addr = 0;
>> +             ring->rx_desc_array[i].opts1 = 0;
>>       }
>>  }
>>
>> -static int rtl8169_rx_fill(struct rtl8169_private *tp)
>> +static void rtl8169_mark_as_last_descriptor_default(struct RxDesc
>> +*desc) {
>> +     desc->opts1 |= cpu_to_le32(RingEnd); }
>> +
>> +static void rtl8169_mark_as_last_descriptor(struct rtl8169_private
>> +*tp, struct RxDesc *desc) {
>> +     rtl8169_mark_as_last_descriptor_default(desc);
>> +}
>> +
>
>Do we actually need this in this patch?

Actually we don’t need in this patch. This is a interface for the upcoming rss patch. I will introduce it in the following patch.

>
>> +static int rtl8169_rx_fill(struct rtl8169_private *tp, struct
>> +rtl8169_rx_ring *ring)
>>  {
>>       int i;
>>
>>       for (i = 0; i < NUM_RX_DESC; i++) {
>>               struct page *data;
>>
>> -             data = rtl8169_alloc_rx_data(tp, tp->RxDescArray + i);
>> +             data = rtl8169_alloc_rx_data(tp, ring, i);
>>               if (!data) {
>> -                     rtl8169_rx_clear(tp);
>> +                     rtl8169_rx_clear(tp, ring);
>>                       return -ENOMEM;
>>               }
>> -             tp->Rx_databuff[i] = data;
>> +             ring->rx_databuff[i] = data;
>>       }
>>
>>       /* mark as last descriptor in the ring */
>> -     tp->RxDescArray[NUM_RX_DESC - 1].opts1 |= cpu_to_le32(RingEnd);
>> +     rtl8169_mark_as_last_descriptor(tp,
>> + &ring->rx_desc_array[NUM_RX_DESC - 1]);
>> +
>> +     return 0;
>> +}
>> +
>> +static int rtl8169_alloc_rx_desc(struct rtl8169_private *tp) {
>> +     struct rtl8169_rx_ring *ring;
>> +     struct pci_dev *pdev = tp->pci_dev;
>>
>> +     for (int i = 0; i < tp->num_rx_rings; i++) {
>> +             ring = &tp->rx_ring[i];
>> +             ring->rx_desc_alloc_size = (ring->num_rx_desc + 1) * sizeof(struct
>RxDesc);
>> +             ring->rx_desc_array = dma_alloc_coherent(&pdev->dev,
>> +                                                      ring->rx_desc_alloc_size,
>> +                                                      &ring->rx_phy_addr,
>> +                                                      GFP_KERNEL);
>> +             if (!ring->rx_desc_array)
>> +                     return -1;
>> +     }
>>       return 0;
>>  }
>>
>> +static void rtl8169_free_rx_desc(struct rtl8169_private *tp) {
>> +     struct rtl8169_rx_ring *ring;
>> +     struct pci_dev *pdev = tp->pci_dev;
>> +
>> +     for (int i = 0; i < tp->num_rx_rings; i++) {
>> +             ring = &tp->rx_ring[i];
>> +             if (ring->rx_desc_array) {
>> +                     dma_free_coherent(&pdev->dev,
>> +                                       ring->rx_desc_alloc_size,
>> +                                       ring->rx_desc_array,
>> +                                       ring->rx_phy_addr);
>> +                     ring->rx_desc_array = NULL;
>> +             }
>> +     }
>> +}
>> +
>>  static int rtl8169_init_ring(struct rtl8169_private *tp)  {
>> +     int retval = 0;
>> +
>>       rtl8169_init_ring_indexes(tp);
>> +     rtl8169_rx_desc_init(tp);
>>
>>       memset(tp->tx_skb, 0, sizeof(tp->tx_skb));
>> -     memset(tp->Rx_databuff, 0, sizeof(tp->Rx_databuff));
>>
>> -     return rtl8169_rx_fill(tp);
>> +     for (int i = 0; i < tp->num_rx_rings; i++) {
>> +             struct rtl8169_rx_ring *ring = &tp->rx_ring[i];
>> +
>> +             memset(ring->rx_databuff, 0, sizeof(ring->rx_databuff));
>> +             retval = rtl8169_rx_fill(tp, ring);
>> +     }
>> +
>> +     return retval;
>>  }
>>
>>  static void rtl8169_unmap_tx_skb(struct rtl8169_private *tp, unsigned
>> int entry) @@ -4381,16 +4502,24 @@ static void rtl8169_cleanup(struct
>rtl8169_private *tp)
>>       rtl8169_init_ring_indexes(tp);
>>  }
>>
>> -static void rtl_reset_work(struct rtl8169_private *tp)
>> +static void rtl8169_rx_desc_reset(struct rtl8169_private *tp)
>>  {
>> -     int i;
>> +     for (int i = 0; i < tp->num_rx_rings; i++) {
>> +             struct rtl8169_rx_ring *ring = &tp->rx_ring[i];
>>
>> +             for (int j = 0; j < ring->num_rx_desc; j++)
>> +                     rtl8169_mark_to_asic(tp, ring->rx_desc_array + j);
>> +     }
>> +}
>> +
>> +static void rtl_reset_work(struct rtl8169_private *tp) {
>>       netif_stop_queue(tp->dev);
>>
>>       rtl8169_cleanup(tp);
>>
>> -     for (i = 0; i < NUM_RX_DESC; i++)
>> -             rtl8169_mark_to_asic(tp->RxDescArray + i);
>> +     rtl8169_rx_desc_reset(tp);
>> +
>>       rtl8169_napi_enable(tp);
>>
>>       rtl_hw_start(tp);
>> @@ -4784,6 +4913,11 @@ static void rtl8169_pcierr_interrupt(struct
>net_device *dev)
>>       rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);  }
>>
>> +static void rtl8169_desc_quirk(struct rtl8169_private *tp) {
>> +     RTL_R8(tp, tp->imr_reg[0]);
>> +}
>> +
>>  static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
>>                  int budget)
>>  {
>> @@ -4836,9 +4970,11 @@ static inline int rtl8169_fragmented_frame(u32
>status)
>>       return (status & (FirstFrag | LastFrag)) != (FirstFrag |
>> LastFrag);  }
>>
>> -static inline void rtl8169_rx_csum(struct sk_buff *skb, u32 opts1)
>> +static inline void rtl8169_rx_csum_default(struct rtl8169_private *tp,
>> +                                        struct sk_buff *skb,
>> +                                        struct RxDesc *desc)
>>  {
>> -     u32 status = opts1 & (RxProtoMask | RxCSFailMask);
>> +     u32 status = le32_to_cpu(desc->opts1) & (RxProtoMask |
>> + RxCSFailMask);
>>
>>       if (status == RxProtoTCP || status == RxProtoUDP)
>>               skb->ip_summed = CHECKSUM_UNNECESSARY; @@ -4846,22
>> +4982,71 @@ static inline void rtl8169_rx_csum(struct sk_buff *skb, u32
>opts1)
>>               skb_checksum_none_assert(skb);  }
>>
>> -static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp,
>> int budget)
>> +static inline void rtl8169_rx_csum(struct rtl8169_private *tp,
>> +                                struct sk_buff *skb,
>> +                                struct RxDesc *desc) {
>> +     rtl8169_rx_csum_default(tp, skb, desc); }
>> +
>> +static u32 rtl8169_rx_desc_opts1(struct rtl8169_private *tp, struct
>> +RxDesc *desc) {
>> +     return READ_ONCE(desc->opts1);
>> +}
>
>I don't see which benefit the helper provides. Instead it may cause side effects
>due to the READ_ONCE().
>

I will move this function to rss patch. When rss enabled, the desc type is different. So we need the helper function.

>> +
>> +static bool rtl8169_check_rx_desc_error(struct net_device *dev,
>> +                                     struct rtl8169_private *tp,
>> +                                     u32 status) {
>> +     if (unlikely(status & RxRES)) {
>> +             if (status & (RxRWT | RxRUNT))
>> +                     dev->stats.rx_length_errors++;
>> +             if (status & RxCRC)
>> +                     dev->stats.rx_crc_errors++;
>> +             return true;
>> +     }
>> +     return false;
>> +}
>> +
>> +static inline void rtl8169_set_desc_dma_addr(struct rtl8169_private *tp,
>> +                                          struct RxDesc *desc,
>> +                                          dma_addr_t mapping) {
>> +     desc->addr = cpu_to_le64(mapping); }
>
>Argument tp isn't used. Why is it there?
>

Interface for rss, I will remove it.

>> +
>> +static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp,
>> +               struct rtl8169_rx_ring *ring, int budget)
>>  {
>>       struct device *d = tp_to_dev(tp);
>>       int count;
>>
>> -     for (count = 0; count < budget; count++, tp->cur_rx++) {
>> -             unsigned int pkt_size, entry = tp->cur_rx % NUM_RX_DESC;
>> -             struct RxDesc *desc = tp->RxDescArray + entry;
>> +     for (count = 0; count < budget; count++, ring->cur_rx++) {
>> +             unsigned int pkt_size, entry = ring->cur_rx % ring->num_rx_desc;
>> +             struct RxDesc *desc = ring->rx_desc_array + entry;
>>               struct sk_buff *skb;
>>               const void *rx_buf;
>>               dma_addr_t addr;
>>               u32 status;
>>
>> -             status = le32_to_cpu(READ_ONCE(desc->opts1));
>> -             if (status & DescOwn)
>> -                     break;
>> +             status = le32_to_cpu(rtl8169_rx_desc_opts1(tp, desc));
>> +
>> +             if (status & DescOwn) {
>> +                     if (!tp->recheck_desc_ownbit)
>> +                             break;
>> +
>> +                     /* Workaround for a hardware issue:
>
>Hardware issue on which chip version(s)?

On RTL8127 series.

>
>> +                      * Hardware might trigger RX interrupt before the DMA
>> +                      * engine fully updates RX desc ownbit in host memory.
>> +                      * So we do a quirk and re-read to avoid missing RX
>> +                      * packets.
>> +                      */
>> +                     tp->recheck_desc_ownbit = false;
>> +                     rtl8169_desc_quirk(tp);
>> +                     status = le32_to_cpu(rtl8169_rx_desc_opts1(tp, desc));
>> +                     if (status & DescOwn)
>> +                             break;
>> +             }
>>
>>               /* This barrier is needed to keep us from reading
>>                * any other fields out of the Rx descriptor until @@
>> -4869,20 +5054,15 @@ static int rtl_rx(struct net_device *dev, struct
>rtl8169_private *tp, int budget
>>                */
>>               dma_rmb();
>>
>> -             if (unlikely(status & RxRES)) {
>> +             if (rtl8169_check_rx_desc_error(dev, tp, status)) {
>>                       if (net_ratelimit())
>>                               netdev_warn(dev, "Rx ERROR. status = %08x\n",
>>                                           status);
>> +
>>                       dev->stats.rx_errors++;
>> -                     if (status & (RxRWT | RxRUNT))
>> -                             dev->stats.rx_length_errors++;
>> -                     if (status & RxCRC)
>> -                             dev->stats.rx_crc_errors++;
>>
>>                       if (!(dev->features & NETIF_F_RXALL))
>>                               goto release_descriptor;
>> -                     else if (status & RxRWT || !(status & (RxRUNT | RxCRC)))
>> -                             goto release_descriptor;
>>               }
>>
>>               pkt_size = status & GENMASK(13, 0); @@ -4898,14 +5078,14
>> @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, int
>budget
>>                       goto release_descriptor;
>>               }
>>
>> -             skb = napi_alloc_skb(&tp->r8169napi[0].napi, pkt_size);
>> +             skb = napi_alloc_skb(&tp->r8169napi[ring->index].napi,
>> + pkt_size);
>>               if (unlikely(!skb)) {
>>                       dev->stats.rx_dropped++;
>>                       goto release_descriptor;
>>               }
>>
>> -             addr = le64_to_cpu(desc->addr);
>> -             rx_buf = page_address(tp->Rx_databuff[entry]);
>> +             addr = ring->rx_desc_phy_addr[entry];
>> +             rx_buf = page_address(ring->rx_databuff[entry]);
>>
>>               dma_sync_single_for_cpu(d, addr, pkt_size, DMA_FROM_DEVICE);
>>               prefetch(rx_buf);
>> @@ -4914,7 +5094,7 @@ static int rtl_rx(struct net_device *dev, struct
>rtl8169_private *tp, int budget
>>               skb->len = pkt_size;
>>               dma_sync_single_for_device(d, addr, pkt_size,
>> DMA_FROM_DEVICE);
>>
>> -             rtl8169_rx_csum(skb, status);
>> +             rtl8169_rx_csum(tp, skb, desc);
>>               skb->protocol = eth_type_trans(skb, dev);
>>
>>               rtl8169_rx_vlan_tag(desc, skb); @@ -4922,11 +5102,12 @@
>> static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, int budget
>>               if (skb->pkt_type == PACKET_MULTICAST)
>>                       dev->stats.multicast++;
>>
>> -             napi_gro_receive(&tp->r8169napi[0].napi, skb);
>> +             napi_gro_receive(&tp->r8169napi[ring->index].napi, skb);
>>
>>               dev_sw_netstats_rx_add(dev, pkt_size);
>>  release_descriptor:
>> -             rtl8169_mark_to_asic(desc);
>> +             rtl8169_set_desc_dma_addr(tp, desc, ring-
>>rx_desc_phy_addr[entry]);
>> +             rtl8169_mark_to_asic(tp, desc);
>>       }
>>
>>       return count;
>> @@ -4952,6 +5133,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void
>*dev_instance)
>>               phy_mac_interrupt(tp->phydev);
>>
>>       rtl_irq_disable(tp);
>> +     tp->recheck_desc_ownbit = true;
>>       napi_schedule(&napi->napi);
>>  out:
>>       rtl_ack_events(tp, status);
>> @@ -5040,7 +5222,8 @@ static int rtl8169_poll(struct napi_struct
>> *napi, int budget)
>>
>>       rtl_tx(dev, tp, budget);
>>
>> -     work_done = rtl_rx(dev, tp, budget);
>> +     for (int i = 0; i < tp->num_rx_rings; i++)
>> +             work_done += rtl_rx(dev, tp, &tp->rx_ring[i], budget);
>>
>>       if (work_done < budget && napi_complete_done(napi, work_done))
>>               rtl_irq_enable(tp);
>> @@ -5168,21 +5351,21 @@ static int rtl8169_close(struct net_device *dev)
>>       struct pci_dev *pdev = tp->pci_dev;
>>
>>       pm_runtime_get_sync(&pdev->dev);
>> -
>>       netif_stop_queue(dev);
>> +
>>       rtl8169_down(tp);
>> -     rtl8169_rx_clear(tp);
>> +     for (int i = 0; i < tp->num_rx_rings; i++)
>> +             rtl8169_rx_clear(tp, &tp->rx_ring[i]);
>>
>>       rtl8169_free_irq(tp);
>>
>>       phy_disconnect(tp->phydev);
>>
>> -     dma_free_coherent(&pdev->dev, R8169_RX_RING_BYTES, tp-
>>RxDescArray,
>> -                       tp->RxPhyAddr);
>>       dma_free_coherent(&pdev->dev, R8169_TX_RING_BYTES, tp-
>>TxDescArray,
>>                         tp->TxPhyAddr);
>>       tp->TxDescArray = NULL;
>> -     tp->RxDescArray = NULL;
>> +
>> +     rtl8169_free_rx_desc(tp);
>>
>>       pm_runtime_put_sync(&pdev->dev);
>>
>> @@ -5211,16 +5394,15 @@ static int rtl_open(struct net_device *dev)
>>        * Rx and Tx descriptors needs 256 bytes alignment.
>>        * dma_alloc_coherent provides more.
>>        */
>> +
>
>It's not the only place with unmotivated changes. Please remove these
>changes.
>
>>       tp->TxDescArray = dma_alloc_coherent(&pdev->dev,
>R8169_TX_RING_BYTES,
>>                                            &tp->TxPhyAddr, GFP_KERNEL);
>>       if (!tp->TxDescArray)
>> -             goto out;
>> -
>> -     tp->RxDescArray = dma_alloc_coherent(&pdev->dev,
>R8169_RX_RING_BYTES,
>> -                                          &tp->RxPhyAddr, GFP_KERNEL);
>> -     if (!tp->RxDescArray)
>>               goto err_free_tx_0;
>>
>> +     if (rtl8169_alloc_rx_desc(tp) < 0)
>> +             goto err_free_rx_1;
>> +
>>       retval = rtl8169_init_ring(tp);
>>       if (retval < 0)
>>               goto err_free_rx_1;
>> @@ -5249,11 +5431,10 @@ static int rtl_open(struct net_device *dev)
>>       rtl8169_free_irq(tp);
>>  err_release_fw_2:
>>       rtl_release_firmware(tp);
>> -     rtl8169_rx_clear(tp);
>> +     for (int i = 0; i < tp->num_rx_rings; i++)
>> +             rtl8169_rx_clear(tp, &tp->rx_ring[i]);
>>  err_free_rx_1:
>> -     dma_free_coherent(&pdev->dev, R8169_RX_RING_BYTES, tp-
>>RxDescArray,
>> -                       tp->RxPhyAddr);
>> -     tp->RxDescArray = NULL;
>> +     rtl8169_free_rx_desc(tp);
>>  err_free_tx_0:
>>       dma_free_coherent(&pdev->dev, R8169_TX_RING_BYTES, tp-
>>TxDescArray,
>>                         tp->TxPhyAddr); @@ -5767,7 +5948,10 @@ static
>> int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>>       u32 txconfig;
>>       u32 xid;
>>
>> -     dev = devm_alloc_etherdev(&pdev->dev, sizeof (*tp));
>> +     dev = devm_alloc_etherdev_mqs(&pdev->dev, sizeof(*tp),
>> +                                   1,
>> +                                   R8169_MAX_RX_QUEUES);
>> +
>>       if (!dev)
>>               return -ENOMEM;
>>


^ permalink raw reply

* Re: [PATCH net-next v10 1/4] tun/tap: add ptr_ring consume helper with netdev queue wakeup
From: Simon Schippers @ 2026-05-07  6:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: willemdebruijn.kernel, jasowang, andrew+netdev, davem, edumazet,
	kuba, pabeni, eperezma, leiyang, stephen, jon, tim.gebauer,
	netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20260506181556-mutt-send-email-mst@kernel.org>

On 5/7/26 00:18, Michael S. Tsirkin wrote:
> On Wed, May 06, 2026 at 04:10:30PM +0200, Simon Schippers wrote:
>> Introduce tun_ring_consume() that wraps ptr_ring_consume() and calls
>> __tun_wake_queue(). The latter wakes the stopped netdev subqueue once
>> half of the ring capacity has been consumed, tracked via the new
>> cons_cnt field in tun_file. cons_cnt is updated while holding the ring
>> consumer lock, avoiding races. As a safety net, the queue is also woken
>> when the ring becomes empty. The point is to allow the queue to be
>> stopped when it gets full, which is required for traffic shaping -
>> implemented by the following "avoid ptr_ring tail-drop when a qdisc
>> is present". That patch also explains the pairing of the smp_mb()
>> of __tun_wake_queue().
>>
>> Without the corresponding queue stopping, this patch alone causes no
>> regression for a tap setup sending to a qemu VM: 1.132 Mpps
>> to 1.144 Mpps.
>>
>> Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
>> threads, pktgen sender; Avg over 50 runs @ 100,000,000 packets;
>> SRSO and spectre v2 mitigations disabled.
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  drivers/net/tun.c | 54 +++++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 50 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index b183189f1853..00ecf128fe8e 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -145,6 +145,7 @@ struct tun_file {
>>  	struct list_head next;
>>  	struct tun_struct *detached;
>>  	struct ptr_ring tx_ring;
>> +	int cons_cnt;
>>  	struct xdp_rxq_info xdp_rxq;
>>  };
>>  
>> @@ -557,6 +558,13 @@ void tun_ptr_free(void *ptr)
>>  }
>>  EXPORT_SYMBOL_GPL(tun_ptr_free);
>>  
>> +static void tun_reset_cons_cnt(struct tun_file *tfile)
>> +{
>> +	spin_lock(&tfile->tx_ring.consumer_lock);
>> +	tfile->cons_cnt = 0;
>> +	spin_unlock(&tfile->tx_ring.consumer_lock);
>> +}
>> +
>>  static void tun_queue_purge(struct tun_file *tfile)
>>  {
>>  	void *ptr;
>> @@ -564,6 +572,7 @@ static void tun_queue_purge(struct tun_file *tfile)
>>  	while ((ptr = ptr_ring_consume(&tfile->tx_ring)) != NULL)
>>  		tun_ptr_free(ptr);
>>  
>> +	tun_reset_cons_cnt(tfile);
>>  	skb_queue_purge(&tfile->sk.sk_write_queue);
>>  	skb_queue_purge(&tfile->sk.sk_error_queue);
>>  }
>> @@ -730,6 +739,7 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
>>  		goto out;
>>  	}
>>  
>> +	tun_reset_cons_cnt(tfile);
>>  	tfile->queue_index = tun->numqueues;
>>  	tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
>>  
>> @@ -2115,13 +2125,46 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>  	return total;
>>  }
>>  
>> -static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>> +/* Callers must hold ring.consumer_lock */
>> +static void __tun_wake_queue(struct tun_struct *tun,
>> +			     struct tun_file *tfile, int consumed)
>> +{
>> +	struct netdev_queue *txq = netdev_get_tx_queue(tun->dev,
>> +						tfile->queue_index);
>> +
>> +	/* Paired with smp_mb__after_atomic() in tun_net_xmit() */
>> +	smp_mb();
>> +	if (netif_tx_queue_stopped(txq)) {
>> +		tfile->cons_cnt += consumed;
>> +		if (tfile->cons_cnt >= tfile->tx_ring.size / 2 ||
>> +		    __ptr_ring_empty(&tfile->tx_ring)) {
>> +			netif_tx_wake_queue(txq);
>> +			tfile->cons_cnt = 0;
>> +		}
>> +	}
>> +}
>> +
>> +static void *tun_ring_consume(struct tun_struct *tun, struct tun_file *tfile)
>> +{
>> +	void *ptr;
>> +
>> +	spin_lock(&tfile->tx_ring.consumer_lock);
>> +	ptr = __ptr_ring_consume(&tfile->tx_ring);
>> +	if (ptr)
>> +		__tun_wake_queue(tun, tfile, 1);
>> +
>> +	spin_unlock(&tfile->tx_ring.consumer_lock);
>> +	return ptr;
>> +}
>> +
>> +static void *tun_ring_recv(struct tun_struct *tun, struct tun_file *tfile,
>> +			   int noblock, int *err)
>>  {
>>  	DECLARE_WAITQUEUE(wait, current);
>>  	void *ptr = NULL;
>>  	int error = 0;
>>  
>> -	ptr = ptr_ring_consume(&tfile->tx_ring);
>> +	ptr = tun_ring_consume(tun, tfile);
>>  	if (ptr)
>>  		goto out;
>>  	if (noblock) {
>> @@ -2133,7 +2176,7 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>  
>>  	while (1) {
>>  		set_current_state(TASK_INTERRUPTIBLE);
>> -		ptr = ptr_ring_consume(&tfile->tx_ring);
>> +		ptr = tun_ring_consume(tun, tfile);
>>  		if (ptr)
>>  			break;
>>  		if (signal_pending(current)) {
> 
> 
> So based on commit log I expected all calls to ptr_ring_consume to
> be replaced with tun_ring_consume, but it looks like tun_queue_purge
> still calls ptr_ring_consume.
> I suspect that together with patch 4 it can sometimes leave us stuck
> with a stopped queue and an empty ring, forever.
> 

I see. I will replace ptr_ring_consume() with tun_ring_consume().

> 
> 
> 
> 
>> @@ -2170,7 +2213,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>>  
>>  	if (!ptr) {
>>  		/* Read frames from ring */
>> -		ptr = tun_ring_recv(tfile, noblock, &err);
>> +		ptr = tun_ring_recv(tun, tfile, noblock, &err);
>>  		if (!ptr)
>>  			return err;
>>  	}
>> @@ -3406,6 +3449,8 @@ static int tun_chr_open(struct inode *inode, struct file * file)
>>  		return -ENOMEM;
>>  	}
>>  
>> +	tun_reset_cons_cnt(tfile);
>> +
>>  	mutex_init(&tfile->napi_mutex);
>>  	RCU_INIT_POINTER(tfile->tun, NULL);
>>  	tfile->flags = 0;
>> @@ -3614,6 +3659,7 @@ static int tun_queue_resize(struct tun_struct *tun)
>>  	for (i = 0; i < tun->numqueues; i++) {
>>  		tfile = rtnl_dereference(tun->tfiles[i]);
>>  		rings[i] = &tfile->tx_ring;
>> +		tun_reset_cons_cnt(tfile);
>>  	}
>>  	list_for_each_entry(tfile, &tun->disabled, next)
>>  		rings[i++] = &tfile->tx_ring;
>> -- 
>> 2.43.0
> 

^ permalink raw reply

* Re: [PATCH net-next v3 1/2] dpll: add fractional frequency offset to pin-parent-device
From: Ivan Vecera @ 2026-05-07  6:12 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Pirko
  Cc: netdev, Andrew Lunn, Arkadiusz Kubalewski, David S. Miller,
	Donald Hunter, Eric Dumazet, Jonathan Corbet, Leon Romanovsky,
	Mark Bloch, Michal Schmidt, Paolo Abeni, Pasi Vaananen, Petr Oros,
	Prathosh Satish, Saeed Mahameed, Shuah Khan, Simon Horman,
	Tariq Toukan, Vadim Fedorenko, linux-doc, linux-kernel,
	linux-rdma
In-Reply-To: <20260506183342.767b5fbc@kernel.org>

On 5/7/26 3:33 AM, Jakub Kicinski wrote:
> On Mon,  4 May 2026 17:53:39 +0200 Ivan Vecera wrote:
>> +          At top level this represents the RX vs TX symbol rate
>> +          offset on the media associated with the pin.
> 
> Isn't this a hacky hack? I'd think that pin is in or out.
> Having a freq offset between two pins or pin and parent's
> ref lock makes sense. This new interpretation sounds like
> we are trying to shove a difference between two pins into one?

The "RX vs TX symbol rate offset" description is not something I
introduced — it is the original documentation of the
fractional-frequency-offset attribute as defined by Jiri. The -ppt
variant was added later purely for higher precision. This patch just
unifies the documentation of both attributes.

I'm not sure I fully understand what the original "RX vs TX" semantics
were meant to capture. Jiri, could you clarify what you had in mind
and whether we should keep or change that description?

>> @@ -299,6 +299,10 @@ zl3073x_dpll_input_pin_ffo_get(const struct dpll_pin *dpll_pin, void *pin_priv,
>>   {
>>   	struct zl3073x_dpll_pin *pin = pin_priv;
>>   
>> +	/* Only rx vs tx symbol rate FFO is supported */
>> +	if (dpll)
>> +		return -ENODATA;
>> +
>>   	*ffo = pin->freq_offset;
> 
> It's easy for driver authors to forget this sort of validation.
> We should fail close, so it's better to have some "capability"
> bits or something for the driver to opt into getting given format
> of the call.

Regarding the fail-close concern — I agree that relying on drivers
to check dpll==NULL is fragile. A capability bit alone wouldn't help
though, since the driver still needs to distinguish which FFO context
is being requested.

I can think of two approaches:
1. An explicit bool parameter (e.g. `bool per_parent`) instead of
    overloading the dpll pointer for context distinction.
2. Separate callbacks for each FFO context (e.g. ffo_get for the
    top-level and ffo_parent_get for the per-parent).

Do you have a preference, or something else in mind?

Thanks,
Ivan


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox