Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net v3 1/1] net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF
From: Jamal Hadi Salim @ 2026-06-28 11:12 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, horms, victor, jiri, security,
	zdi-disclosures, stable, Jamal Hadi Salim

The teql master->slaves singly linked list is not protected against
multiple writes. It can be mod'ed concurently from teql_master_xmit(),
teql_dequeue(), teql_init() and teql_destroy() without holding any list
lock or RCU protection.

zdi-disclosures@trendmicro.com has demonstrated that the qdisc is freed
after an RCU grace period, but teql_master_xmit() running on another
CPU can still hold a stale pointer into the list, resulting in a
slab-use-after-free:

BUG: KASAN: slab-use-after-free in teql_master_xmit+0xf0f/0x16b0
Read of size 8 at addr ffff888013fb0440 by task poc/332
Freed 512-byte region [ffff888013fb0400, ffff888013fb0600) (kmalloc-512)

The fix?
Add a per-master slaves_lock spinlock that serializes all mutations of
master->slaves and the NEXT_SLAVE() links in teql_destroy() and
teql_qdisc_init(). teql_master_xmit() also takes the same slaves_lock
around those updates.
Annotate master->slaves and the per-slave ->next pointer with __rcu and
use the appropriate RCU accessors everywhere they are touched:
rcu_assign_pointer() on the writer side (under slaves_lock),
rcu_dereference_protected() for the writer-side loads (also under
slaves_lock), rcu_dereference_bh() for the loads in teql_master_xmit() and
rtnl_dereference() for the loads in teql_master_open()/teql_master_mtu(),
which run under RTNL.
Pair this with rcu_read_lock_bh()/rcu_read_unlock_bh() around the list
traversal in teql_master_xmit(), so that readers either observe a fully
linked list or are deferred until the in-flight mutation completes. The two
early-return paths in teql_master_xmit() are updated to release the RCU-bh
read-side critical section before returning, since leaving it held would
disable BH on that CPU for good.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: zdi-disclosures@trendmicro.com
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
v2->v3
1) Thanks to Simon's persistence:
 The writeback in teql_master_xmit() should not blindly write NEXT_SLAVE(q)
 into master->slaves. It should re-read master->slaves under slaves_lock and
 only update it if q is still the current head
2) Appease sashiko by mentioning teql_dequeue() on the commit and ensuring
 consistency on rcu_dereference_bh()/rcu_dereference_protected()
---

diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index e7bbc9e5174d..78c74c1182d7 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -52,7 +52,8 @@
 struct teql_master {
 	struct Qdisc_ops qops;
 	struct net_device *dev;
-	struct Qdisc *slaves;
+	struct Qdisc __rcu	*slaves;
+	spinlock_t		slaves_lock; /* serializes writes to ->slaves */
 	struct list_head master_list;
 	unsigned long	tx_bytes;
 	unsigned long	tx_packets;
@@ -61,7 +62,7 @@ struct teql_master {
 };
 
 struct teql_sched_data {
-	struct Qdisc *next;
+	struct Qdisc __rcu	*next;
 	struct teql_master *m;
 	struct sk_buff_head q;
 };
@@ -101,7 +102,9 @@ teql_dequeue(struct Qdisc *sch)
 	if (skb == NULL) {
 		struct net_device *m = qdisc_dev(q);
 		if (m) {
-			dat->m->slaves = sch;
+			spin_lock_bh(&dat->m->slaves_lock);
+			rcu_assign_pointer(dat->m->slaves, sch);
+			spin_unlock_bh(&dat->m->slaves_lock);
 			netif_wake_queue(m);
 		}
 	} else {
@@ -132,34 +135,49 @@ teql_destroy(struct Qdisc *sch)
 	struct Qdisc *q, *prev;
 	struct teql_sched_data *dat = qdisc_priv(sch);
 	struct teql_master *master = dat->m;
+	struct netdev_queue *txq = NULL;
+	bool reset_master_queue = false;
 
 	if (!master)
 		return;
 
-	prev = master->slaves;
+	spin_lock_bh(&master->slaves_lock);
+	prev = rcu_dereference_protected(master->slaves,
+					 lockdep_is_held(&master->slaves_lock));
 	if (prev) {
 		do {
-			q = NEXT_SLAVE(prev);
-			if (q == sch) {
-				NEXT_SLAVE(prev) = NEXT_SLAVE(q);
-				if (q == master->slaves) {
-					master->slaves = NEXT_SLAVE(q);
-					if (q == master->slaves) {
-						struct netdev_queue *txq;
-
-						txq = netdev_get_tx_queue(master->dev, 0);
-						master->slaves = NULL;
-
-						dev_reset_queue(master->dev,
-								txq, NULL);
-					}
-				}
-				skb_queue_purge(&dat->q);
-				break;
+			struct Qdisc *head, *next;
+
+			q = rcu_dereference_protected(NEXT_SLAVE(prev),
+						      lockdep_is_held(&master->slaves_lock));
+			if (q != sch) {
+				prev = q;
+				continue;
 			}
 
-		} while ((prev = q) != master->slaves);
+			next = rcu_dereference_protected(NEXT_SLAVE(q),
+							 lockdep_is_held(&master->slaves_lock));
+			rcu_assign_pointer(NEXT_SLAVE(prev), next);
+
+			head = rcu_dereference_protected(master->slaves,
+							 lockdep_is_held(&master->slaves_lock));
+			if (q == head) {
+				rcu_assign_pointer(master->slaves, next);
+				if (q == next) {
+					txq = netdev_get_tx_queue(master->dev, 0);
+					rcu_assign_pointer(master->slaves, NULL);
+					reset_master_queue = true;
+				}
+			}
+			skb_queue_purge(&dat->q);
+			break;
+		} while (prev != rcu_dereference_protected(master->slaves,
+							   lockdep_is_held(&master->slaves_lock)));
 	}
+	spin_unlock_bh(&master->slaves_lock);
+
+	if (reset_master_queue)
+		dev_reset_queue(master->dev, txq, NULL);
 }
 
 static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
@@ -168,6 +186,7 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
 	struct net_device *dev = qdisc_dev(sch);
 	struct teql_master *m = (struct teql_master *)sch->ops;
 	struct teql_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *first;
 
 	if (dev->hard_header_len > m->dev->hard_header_len)
 		return -EINVAL;
@@ -184,7 +203,9 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
 
 	skb_queue_head_init(&q->q);
 
-	if (m->slaves) {
+	spin_lock_bh(&m->slaves_lock);
+	first = rcu_dereference_protected(m->slaves, lockdep_is_held(&m->slaves_lock));
+	if (first) {
 		if (m->dev->flags & IFF_UP) {
 			if ((m->dev->flags & IFF_POINTOPOINT &&
 			     !(dev->flags & IFF_POINTOPOINT)) ||
@@ -192,8 +213,10 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
 			     !(dev->flags & IFF_BROADCAST)) ||
 			    (m->dev->flags & IFF_MULTICAST &&
 			     !(dev->flags & IFF_MULTICAST)) ||
-			    dev->mtu < m->dev->mtu)
+			    dev->mtu < m->dev->mtu) {
+				spin_unlock_bh(&m->slaves_lock);
 				return -EINVAL;
+			}
 		} else {
 			if (!(dev->flags&IFF_POINTOPOINT))
 				m->dev->flags &= ~IFF_POINTOPOINT;
@@ -204,14 +227,17 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
 			if (dev->mtu < m->dev->mtu)
 				m->dev->mtu = dev->mtu;
 		}
-		q->next = NEXT_SLAVE(m->slaves);
-		NEXT_SLAVE(m->slaves) = sch;
+		rcu_assign_pointer(q->next,
+				   rcu_dereference_protected(NEXT_SLAVE(first),
+							     lockdep_is_held(&m->slaves_lock)));
+		rcu_assign_pointer(NEXT_SLAVE(first), sch);
 	} else {
-		q->next = sch;
-		m->slaves = sch;
+		rcu_assign_pointer(q->next, sch);
+		rcu_assign_pointer(m->slaves, sch);
 		m->dev->mtu = dev->mtu;
 		m->dev->flags = (m->dev->flags&~FMASK)|(dev->flags&FMASK);
 	}
+	spin_unlock_bh(&m->slaves_lock);
 	return 0;
 }
 
@@ -285,7 +311,9 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 	int subq = skb_get_queue_mapping(skb);
 	struct sk_buff *skb_res = NULL;
 
-	start = master->slaves;
+	rcu_read_lock_bh();
+
+	start = rcu_dereference_bh(master->slaves);
 
 restart:
 	nores = 0;
@@ -317,10 +345,17 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 				    netdev_start_xmit(skb, slave, slave_txq, false) ==
 				    NETDEV_TX_OK) {
 					__netif_tx_unlock(slave_txq);
-					master->slaves = NEXT_SLAVE(q);
+					spin_lock_bh(&master->slaves_lock);
+					if (rcu_dereference_protected(master->slaves,
+								      lockdep_is_held(&master->slaves_lock)) == q)
+						rcu_assign_pointer(master->slaves,
+								   rcu_dereference_protected(NEXT_SLAVE(q),
+											     lockdep_is_held(&master->slaves_lock)));
+					spin_unlock_bh(&master->slaves_lock);
 					netif_wake_queue(dev);
 					master->tx_packets++;
 					master->tx_bytes += length;
+					rcu_read_unlock_bh();
 					return NETDEV_TX_OK;
 				}
 				__netif_tx_unlock(slave_txq);
@@ -329,14 +364,21 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 				busy = 1;
 			break;
 		case 1:
-			master->slaves = NEXT_SLAVE(q);
+			spin_lock_bh(&master->slaves_lock);
+			if (rcu_dereference_protected(master->slaves,
+						      lockdep_is_held(&master->slaves_lock)) == q)
+				rcu_assign_pointer(master->slaves,
+						   rcu_dereference_protected(NEXT_SLAVE(q),
+									     lockdep_is_held(&master->slaves_lock)));
+			spin_unlock_bh(&master->slaves_lock);
+			rcu_read_unlock_bh();
 			return NETDEV_TX_OK;
 		default:
 			nores = 1;
 			break;
 		}
 		__skb_pull(skb, skb_network_offset(skb));
-	} while ((q = NEXT_SLAVE(q)) != start);
+	} while ((q = rcu_dereference_bh(NEXT_SLAVE(q))) != start);
 
 	if (nores && skb_res == NULL) {
 		skb_res = skb;
@@ -345,29 +387,32 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	if (busy) {
 		netif_stop_queue(dev);
+		rcu_read_unlock_bh();
 		return NETDEV_TX_BUSY;
 	}
 	master->tx_errors++;
 
 drop:
 	master->tx_dropped++;
+	rcu_read_unlock_bh();
 	dev_kfree_skb(skb);
 	return NETDEV_TX_OK;
 }
 
 static int teql_master_open(struct net_device *dev)
 {
-	struct Qdisc *q;
+	struct Qdisc *q, *first;
 	struct teql_master *m = netdev_priv(dev);
 	int mtu = 0xFFFE;
 	unsigned int flags = IFF_NOARP | IFF_MULTICAST;
 
-	if (m->slaves == NULL)
+	first = rtnl_dereference(m->slaves);
+	if (!first)
 		return -EUNATCH;
 
 	flags = FMASK;
 
-	q = m->slaves;
+	q = first;
 	do {
 		struct net_device *slave = qdisc_dev(q);
 
@@ -389,7 +434,7 @@ static int teql_master_open(struct net_device *dev)
 			flags &= ~IFF_BROADCAST;
 		if (!(slave->flags&IFF_MULTICAST))
 			flags &= ~IFF_MULTICAST;
-	} while ((q = NEXT_SLAVE(q)) != m->slaves);
+	} while ((q = rtnl_dereference(NEXT_SLAVE(q))) != first);
 
 	m->dev->mtu = mtu;
 	m->dev->flags = (m->dev->flags&~FMASK) | flags;
@@ -417,14 +462,15 @@ static void teql_master_stats64(struct net_device *dev,
 static int teql_master_mtu(struct net_device *dev, int new_mtu)
 {
 	struct teql_master *m = netdev_priv(dev);
-	struct Qdisc *q;
+	struct Qdisc *q, *first;
 
-	q = m->slaves;
+	first = rtnl_dereference(m->slaves);
+	q = first;
 	if (q) {
 		do {
 			if (new_mtu > qdisc_dev(q)->mtu)
 				return -EINVAL;
-		} while ((q = NEXT_SLAVE(q)) != m->slaves);
+		} while ((q = rtnl_dereference(NEXT_SLAVE(q))) != first);
 	}
 
 	WRITE_ONCE(dev->mtu, new_mtu);
@@ -444,6 +490,7 @@ static __init void teql_master_setup(struct net_device *dev)
 	struct teql_master *master = netdev_priv(dev);
 	struct Qdisc_ops *ops = &master->qops;
 
+	spin_lock_init(&master->slaves_lock);
 	master->dev	= dev;
 	ops->priv_size  = sizeof(struct teql_sched_data);
 

^ permalink raw reply related

* xdp: add device context to bpf_xdp_link_attach_failed tracepoint
From: Masashi Honma @ 2026-06-28 11:33 UTC (permalink / raw)
  To: netdev, bpf, linux-trace-kernel
  Cc: leon.hwang, ast, daniel, kuba, hawk, andrii, rostedt, mhiramat,
	edumazet, pabeni, linux-kernel

Hello,

The bpf_xdp_link_attach_failed tracepoint (added in commit bf4ea1d0b2cb
"xdp: Add tracepoint for xdp attaching failure") exposes the netlink
extack message produced when attaching an XDP program via BPF_LINK_CREATE
fails. This is useful because, unlike the netlink attach path, the
bpf_link attach path does not return the extack to userspace -- the caller
only gets an errno (e.g. EINVAL/ERANGE).

We would like to use this in Cilium [1][2]: when attaching the XDP
datapath program fails, surface the kernel's reason (e.g. "single-buffer
XDP requires MTU less than ...") in the agent logs instead of an opaque
errno, so operators don't have to inspect dmesg on the host.

The limitation we hit is that the tracepoint only carries the message
string, so a consumer cannot tell which device a failure belongs to.
This matters for two reasons:

  1. Correlation: with only the message, a consumer cannot reliably
      attribute a failure to a specific attach, particularly if multiple
      XDP attaches happen concurrently.
  2. Scoping: a consumer watching this tracepoint sees XDP attach
      failures system-wide and cannot limit them to the devices it
      manages.

At the call site (bpf_xdp_link_attach() in net/core/dev.c) the net_device
is in scope, so exposing it looks straightforward:

  TRACE_EVENT(bpf_xdp_link_attach_failed,
      TP_PROTO(const char *msg, const struct net_device *dev),
      TP_ARGS(msg, dev),
      TP_STRUCT__entry(
          __string(msg, msg)
          __field(int, ifindex)
      ),
      TP_fast_assign(
          __assign_str(msg);
          __entry->ifindex = dev->ifindex;
      ),
      TP_printk("ifindex=%d errmsg=%s", __entry->ifindex, __get_str(msg))
  );

  - trace_bpf_xdp_link_attach_failed(extack._msg);
  + trace_bpf_xdp_link_attach_failed(extack._msg, dev);

Before sending a formal patch I'd appreciate guidance on a few points:

  - Should the tracepoint take const struct net_device *dev (consistent
    with the other tracepoints in this file, and lets TP_printk show the
    device), or just the ifindex as an int (simpler for raw_tp BPF
    consumers, which otherwise read dev->ifindex via CO-RE)?

  - For raw_tp consumers the argument order is effectively ABI: prepending
    dev would shift the existing msg argument. I've appended dev above to
    keep msg at args[0]. Is preserving the existing argument position the
    right call, or is reordering acceptable given how new and rarely
    consumed this tracepoint is?

  - Is extending the existing tracepoint preferred, or would you rather
    keep it as-is and expose the device context some other way?

This would be my first XDP/BPF tracepoint change, so any direction is
welcome. I'm happy to send a proper patch once the shape is agreed.

Regards,
Masashi Honma

[1] https://github.com/cilium/cilium/issues/40777
[2] https://github.com/cilium/cilium/pull/46546

^ permalink raw reply

* [PATCH net 1/1] tcp: Require init_net CAP_NET_ADMIN for tcp_child_ehash_entries
From: Ren Wei @ 2026-06-28 11:37 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, pabeni, horms, chia-yu.chang, ij, idosch,
	fmancera, bronzed_45_vested, yuuchihsu, kuniyu, yuantan098,
	yifanwucs, tomapufckgml, bird, roxy520tt, n05ec
In-Reply-To: <cover.1782641525.git.roxy520tt@gmail.com>

From: Zhiling Zou <roxy520tt@gmail.com>

tcp_child_ehash_entries controls the size of the private TCP established
hash table allocated for subsequently created child network namespaces.
The value is consumed during child netns creation by tcp_set_hashinfo()
and passed to inet_pernet_hashinfo_alloc(), which can allocate a large
per-netns ehash.

The sysctl is writable in each network namespace, and net sysctl
permissions allow a task with CAP_NET_ADMIN in the namespace's owning
user namespace to write it.  An unprivileged user can therefore create a
user and network namespace, set tcp_child_ehash_entries to its maximum
value, and repeatedly create nested network namespaces to force large
kernel allocations and exhaust host memory.

Require CAP_NET_ADMIN in the initial user namespace before accepting
writes to tcp_child_ehash_entries.  This keeps the tuning knob available
to the host administrator while preventing unprivileged user namespaces
from using it to drive host-wide memory consumption.

Fixes: d1e5e6408b30 ("tcp: Introduce optional per-netns ehash.")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Assisted-by: Codex:gpt-5.4
Signed-off-by: Zhiling Zou <roxy520tt@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
 net/ipv4/sysctl_net_ipv4.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index ca1180dba1de..1cad1b5cb826 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -9,6 +9,7 @@
 #include <linux/sysctl.h>
 #include <linux/seqlock.h>
 #include <linux/init.h>
+#include <linux/capability.h>
 #include <linux/slab.h>
 #include <net/icmp.h>
 #include <net/ip.h>
@@ -415,6 +416,16 @@ static int proc_tcp_ehash_entries(const struct ctl_table *table, int write,
 	return proc_dointvec(&tbl, write, buffer, lenp, ppos);
 }

+static int proc_tcp_child_ehash_entries(const struct ctl_table *table, int write,
+					void *buffer, size_t *lenp,
+					loff_t *ppos)
+{
+	if (write && !capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	return proc_douintvec_minmax(table, write, buffer, lenp, ppos);
+}
+
 static int proc_udp_hash_entries(const struct ctl_table *table, int write,
 				 void *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -1524,7 +1535,7 @@ static struct ctl_table ipv4_net_table[] = {
 		.data		= &init_net.ipv4.sysctl_tcp_child_ehash_entries,
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
-		.proc_handler	= proc_douintvec_minmax,
+		.proc_handler	= proc_tcp_child_ehash_entries,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &tcp_child_ehash_entries_max,
 	},
-- 
2.43.0

^ permalink raw reply related

* [RFC] xdp: add device context to bpf_xdp_link_attach_failed tracepoint
From: Masashi Honma @ 2026-06-28 11:39 UTC (permalink / raw)
  To: netdev, bpf, linux-trace-kernel
  Cc: leon.hwang, ast, daniel, kuba, hawk, andrii, rostedt, mhiramat,
	edumazet, pabeni, linux-kernel

Hello, I am re-posting this mail because I forget to add [RFC].

The bpf_xdp_link_attach_failed tracepoint (added in commit bf4ea1d0b2cb
"xdp: Add tracepoint for xdp attaching failure") exposes the netlink
extack message produced when attaching an XDP program via BPF_LINK_CREATE
fails. This is useful because, unlike the netlink attach path, the
bpf_link attach path does not return the extack to userspace -- the caller
only gets an errno (e.g. EINVAL/ERANGE).

We would like to use this in Cilium [1][2]: when attaching the XDP
datapath program fails, surface the kernel's reason (e.g. "single-buffer
XDP requires MTU less than ...") in the agent logs instead of an opaque
errno, so operators don't have to inspect dmesg on the host.

The limitation we hit is that the tracepoint only carries the message
string, so a consumer cannot tell which device a failure belongs to.
This matters for two reasons:

  1. Correlation: with only the message, a consumer cannot reliably
      attribute a failure to a specific attach, particularly if multiple
      XDP attaches happen concurrently.
  2. Scoping: a consumer watching this tracepoint sees XDP attach
      failures system-wide and cannot limit them to the devices it
      manages.

At the call site (bpf_xdp_link_attach() in net/core/dev.c) the net_device
is in scope, so exposing it looks straightforward:

  TRACE_EVENT(bpf_xdp_link_attach_failed,
      TP_PROTO(const char *msg, const struct net_device *dev),
      TP_ARGS(msg, dev),
      TP_STRUCT__entry(
          __string(msg, msg)
          __field(int, ifindex)
      ),
      TP_fast_assign(
          __assign_str(msg);
          __entry->ifindex = dev->ifindex;
      ),
      TP_printk("ifindex=%d errmsg=%s", __entry->ifindex, __get_str(msg))
  );

  - trace_bpf_xdp_link_attach_failed(extack._msg);
  + trace_bpf_xdp_link_attach_failed(extack._msg, dev);

Before sending a formal patch I'd appreciate guidance on a few points:

  - Should the tracepoint take const struct net_device *dev (consistent
    with the other tracepoints in this file, and lets TP_printk show the
    device), or just the ifindex as an int (simpler for raw_tp BPF
    consumers, which otherwise read dev->ifindex via CO-RE)?

  - For raw_tp consumers the argument order is effectively ABI: prepending
    dev would shift the existing msg argument. I've appended dev above to
    keep msg at args[0]. Is preserving the existing argument position the
    right call, or is reordering acceptable given how new and rarely
    consumed this tracepoint is?

  - Is extending the existing tracepoint preferred, or would you rather
    keep it as-is and expose the device context some other way?

This would be my first XDP/BPF tracepoint change, so any direction is
welcome. I'm happy to send a proper patch once the shape is agreed.

Regards,
Masashi Honma

[1] https://github.com/cilium/cilium/issues/40777
[2] https://github.com/cilium/cilium/pull/46546

^ permalink raw reply

* [PATCH net 1/1] tcp: bound SYN-ACK timers to reqsk timeout range
From: Ren Wei @ 2026-06-28 11:42 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, ncardwell, kuniyu, davem, pabeni, horms, chia-yu.chang,
	ij, bronzed_45_vested, yuuchihsu, idosch, yuantan098, yifanwucs,
	tomapufckgml, bird, roxy520tt, n05ec
In-Reply-To: <cover.1782643946.git.roxy520tt@gmail.com>

From: Zhiling Zou <roxy520tt@gmail.com>

tcp_synack_retries supplies the SYN-ACK retry limit used by request
socket timers. The same effective limit can also come from TCP_SYNCNT
through icsk_syn_retries, while TCP_DEFER_ACCEPT can keep an ACKed
request alive until rskq_defer_accept is reached.

The request socket timeout counter is incremented before it is used to
compute the next timeout. tcp_reqsk_timeout() and the Fast Open SYN-ACK
timer shift req->timeout by req->num_timeout. Excessive retry or
defer-accept limits can therefore drive these timer paths into invalid
shift counts before the request expires.

Limit tcp_synack_retries to the request socket timer range, clamp the
effective retry and defer-accept limits in the regular request socket
timer path, clamp the Fast Open retry limit, and make the request
socket timeout helper saturate before shifting.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Assisted-by: Codex:gpt-5.4
Signed-off-by: Zhiling Zou <roxy520tt@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
 include/net/tcp.h               | 19 +++++++++++++++----
 net/ipv4/inet_connection_sock.c |  6 +++++-
 net/ipv4/sysctl_net_ipv4.c      |  2 ++
 net/ipv4/tcp_timer.c            | 11 ++++++++---
 4 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6d376ea4d1c0..656f1bd0fa1a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -183,6 +183,7 @@ static_assert((1 << ATO_BITS) > TCP_DELACK_MAX);
 #define MAX_TCP_KEEPINTVL	32767
 #define MAX_TCP_KEEPCNT		127
 #define MAX_TCP_SYNCNT		127
+#define MAX_TCP_SYNACK_RETRIES	63
 
 /* Ensure that TCP PAWS checks are relaxed after ~2147 seconds
  * to avoid overflows. This assumes a clock smaller than 1 Mhz.
@@ -882,12 +883,22 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
 	return usecs_to_jiffies((tp->srtt_us >> 3) + tp->rttvar_us);
 }
 
-static inline unsigned long tcp_reqsk_timeout(struct request_sock *req)
+static inline unsigned long tcp_reqsk_timeout_sk(const struct sock *sk,
+						 struct request_sock *req)
 {
-	u64 timeout = (u64)req->timeout << req->num_timeout;
+	u64 timeout = req->timeout;
+	u32 rto_max = tcp_rto_max(sk);
+
+	if (req->num_timeout >= BITS_PER_TYPE(u64) ||
+	    timeout > U64_MAX >> req->num_timeout)
+		return rto_max;
+
+	return (unsigned long)min_t(u64, timeout << req->num_timeout, rto_max);
+}
 
-	return (unsigned long)min_t(u64, timeout,
-				    tcp_rto_max(req->rsk_listener));
+static inline unsigned long tcp_reqsk_timeout(struct request_sock *req)
+{
+	return tcp_reqsk_timeout_sk(req->rsk_listener, req);
 }
 
 u32 tcp_delack_max(const struct sock *sk);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 56902bba5483..b74212bae3dd 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -1056,6 +1056,8 @@ static void reqsk_timer_handler(struct timer_list *t)
 	net = sock_net(sk_listener);
 	max_syn_ack_retries = READ_ONCE(icsk->icsk_syn_retries) ? :
 		READ_ONCE(net->ipv4.sysctl_tcp_synack_retries);
+	max_syn_ack_retries = min_t(int, max_syn_ack_retries,
+				    MAX_TCP_SYNACK_RETRIES);
 	/* Normally all the openreqs are young and become mature
 	 * (i.e. converted to established socket) for first timeout.
 	 * If synack was not acknowledged for 1 second, it means
@@ -1086,7 +1088,9 @@ static void reqsk_timer_handler(struct timer_list *t)
 		}
 	}
 
-	syn_ack_recalc(req, max_syn_ack_retries, READ_ONCE(queue->rskq_defer_accept),
+	syn_ack_recalc(req, max_syn_ack_retries,
+		       min_t(u8, READ_ONCE(queue->rskq_defer_accept),
+			     MAX_TCP_SYNACK_RETRIES),
 		       &expire, &resend);
 	tcp_syn_ack_timeout(req);
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index ca1180dba1de..f9d233b98bbc 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -35,6 +35,7 @@ static int ip_ttl_min = 1;
 static int ip_ttl_max = 255;
 static int tcp_syn_retries_min = 1;
 static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
+static int tcp_synack_retries_max = MAX_TCP_SYNACK_RETRIES;
 static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT;
 static unsigned long ip_ping_group_range_min[] = { 0, 0 };
 static unsigned long ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
@@ -1034,6 +1035,7 @@ static struct ctl_table ipv4_net_table[] = {
 		.maxlen		= sizeof(u8),
 		.mode		= 0644,
 		.proc_handler	= proc_dou8vec_minmax,
+		.extra2		= &tcp_synack_retries_max
 	},
 #ifdef CONFIG_SYN_COOKIES
 	{
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index bf171b5e1eb3..097e5f698c57 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -458,6 +458,7 @@ static void tcp_fastopen_synack_timer(struct sock *sk, struct request_sock *req)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct net *net = sock_net(sk);
 	int max_retries;
 
 	tcp_syn_ack_timeout(req);
@@ -465,8 +466,12 @@ static void tcp_fastopen_synack_timer(struct sock *sk, struct request_sock *req)
 	/* Add one more retry for fastopen.
 	 * Paired with WRITE_ONCE() in tcp_sock_set_syncnt()
 	 */
-	max_retries = READ_ONCE(icsk->icsk_syn_retries) ? :
-		READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_synack_retries) + 1;
+	max_retries = READ_ONCE(icsk->icsk_syn_retries);
+	if (!max_retries) {
+		max_retries = READ_ONCE(net->ipv4.sysctl_tcp_synack_retries);
+		max_retries++;
+	}
+	max_retries = min_t(int, max_retries, MAX_TCP_SYNACK_RETRIES);
 
 	if (req->num_timeout >= max_retries) {
 		tcp_write_err(sk);
@@ -488,7 +493,7 @@ static void tcp_fastopen_synack_timer(struct sock *sk, struct request_sock *req)
 	if (!tp->retrans_stamp)
 		tp->retrans_stamp = tcp_time_stamp_ts(tp);
 	tcp_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-			  req->timeout << req->num_timeout, false);
+			     tcp_reqsk_timeout_sk(sk, req), false);
 }
 
 static bool tcp_rtx_probe0_timed_out(const struct sock *sk,
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net 2/3] net/sched: Handle TC_ACT_REDIRECT from qdisc filter chains
From: Jamal Hadi Salim @ 2026-06-28 12:28 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf, Linux Kernel Network Developers, Petr Machata
In-Reply-To: <20260627165220.096B61F00A3A@smtp.kernel.org>

On Sat, Jun 27, 2026 at 12:52 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [High] Removal of TC_ACT_REDIRECT support from qevents (tcf_qevent_handle) breaks existing eBPF redirect functionality.
> - [Low] pr_warn_once() used inside a static inline function in a header file will print multiple times.
> --
>
> commit d96e142c03f9c1a1e9ac2e1d8c8fe26605053b10
> Author: Jamal Hadi Salim <jhs@mojatatu.com>
>
> net/sched: Handle TC_ACT_REDIRECT from qdisc filter chains
>
> When a TC filter attached to a qdisc filter chain returns TC_ACT_REDIRECT,
> the redirect is silently lost. This patch adds a tcf_classify_qdisc()
> wrapper to convert TC_ACT_REDIRECT to TC_ACT_SHOT and drops the packet.
> It also removes the TC_ACT_REDIRECT case from tcf_qevent_handle() since
> it is handled by the wrapper.
>
> > diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> > index 3bd08d7f39c18..3a542a72e9a55 100644
> > --- a/include/net/pkt_cls.h
> > +++ b/include/net/pkt_cls.h
> [ ... ]
> > @@ -159,6 +159,19 @@ static inline int tcf_classify(struct sk_buff *skb,
> >
> >  #endif
> >
> > +static inline int tcf_classify_qdisc(struct sk_buff *skb,
> > +                                  const struct tcf_proto *tp,
> > +                                  struct tcf_result *res, bool compat_mode)
> > +{
> > +     int ret = tcf_classify(skb, NULL, tp, res, compat_mode);
> > +
> > +     if (unlikely(ret == TC_ACT_REDIRECT)) {
> > +             pr_warn_once("TC_ACT_REDIRECT from qdisc filter chains is not supported; use eBPF with clsact or mirred redirect instead\n");
>
> [Severity: Low]
> Is this going to print multiple times instead of strictly once globally?
>
> Because tcf_classify_qdisc() is a static inline function in a header file,
> each C file including it (like the various qdisc modules) will create its own
> local instance of the static __already_done variable used by pr_warn_once().

This was intentional. I used WARN_ONCE() initially, but NIPA testing
may consider that as something "bad". In any case, this usage is
common, just ignore it.

> > +             ret = TC_ACT_SHOT;
> > +     }
> > +     return ret;
> > +}
> > +
> >  static inline unsigned long
> >  __cls_set_class(unsigned long *clp, unsigned long cl)
> >  {
> > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> > index 3e67600a4a1a1..3ca56d060e28f 100644
> > --- a/net/sched/cls_api.c
> > +++ b/net/sched/cls_api.c
> > @@ -4033,7 +4033,7 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
> >
> >       fl = rcu_dereference_bh(qe->filter_chain);
> >
> > -     switch (tcf_classify(skb, NULL, fl, &cl_res, false)) {
> > +     switch (tcf_classify_qdisc(skb, fl, &cl_res, false)) {
>
> [Severity: High]
> Does this break existing eBPF redirect functionality for qevents?
>

Again, design intent. I will update the commit message to state this
clearly with the following text:
---
Remove the TC_ACT_REDIRECT case from tcf_qevent_handle().
skb_do_redirect() belongs in the BPF plumbing layer (net/core/), not
in net/sched/. The case was never consistent with the rest of the
qdisc classification infrastructure, where no classify function
handles
TC_ACT_REDIRECT.
---

IMO, it appears to have been TheLinuxWay(tm) i.e a cut-and-paste from
existing code on day 1 of  the qevent introduction rather than a
deliberate design decision.
Ccing Petr to see if he thinks differently.

cheers,
jamal

> Prior to this patch, tcf_qevent_handle() correctly handled TC_ACT_REDIRECT
> by calling skb_do_redirect(skb). By using the tcf_classify_qdisc() wrapper
> here, TC_ACT_REDIRECT is unconditionally converted to TC_ACT_SHOT. This
> inadvertently drops packets that were legitimately redirecting from qevents.
>
> >       case TC_ACT_SHOT:
> >               qdisc_qstats_drop(sch);
> >               __qdisc_drop(skb, to_free);
> > @@ -4045,10 +4045,6 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
> >               __qdisc_drop(skb, to_free);
> >               *ret = __NET_XMIT_STOLEN;
> >               return NULL;
> > -     case TC_ACT_REDIRECT:
> > -             skb_do_redirect(skb);
> > -             *ret = __NET_XMIT_STOLEN;
> > -             return NULL;
> >       case TC_ACT_CONSUMED:
> >               *ret = __NET_XMIT_STOLEN;
> >               return NULL;
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260626165156.169012-1-jhs@mojatatu.com?part=2

^ permalink raw reply

* Re: [PATCH net 3/3] selftests/tc-testing: Verify bpf redirect on RED block with preceding clsact (egress) classifier
From: Jamal Hadi Salim @ 2026-06-28 12:36 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf, Linux Kernel Network Developers, Victor Nogueira
In-Reply-To: <20260627165220.CA0CD1F000E9@smtp.kernel.org>

On Sat, Jun 27, 2026 at 12:52 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Medium] Hardcoded BPF object path 'action-ebpf' instead of using '$EBPFDIR/action-ebpf'.
> --
>
> commit f4ff308069725bd7a3b6c6f843ecb0fda5b73239
> Author: Victor Nogueira <victor@mojatatu.com>
>
> selftests/tc-testing: Verify bpf redirect on RED block with preceding clsact (egress) classifier
>
> This commit adds a test case to reproduce a NULL pointer dereference when a
> bpf filter attached to a RED block (which uses shared blocks) returns
> TC_ACT_REDIRECT while a preceding clsact egress classifier is present. The
> issue stems from a stack-allocated bpf_net_context being torn down before
> tcf_qevent_handle() calls skb_do_redirect().
>
> > diff --git a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
> > index a1f97a4b606e8..762f86ceab1cd 100644
> > --- a/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
> > +++ b/tools/testing/selftests/tc-testing/tc-tests/infra/qdiscs.json
> [ ... ]
> > @@ -1540,5 +1540,37 @@
> [ ... ]
> > +    {
> > +        "id": "fb8d",
> > +        "name": "Verify bpf redirect on RED block with preceding clsact (egress) classifier",
> > +        "category": [
> > +            "qdisc",
> > +            "red",
> > +            "qevent",
> > +            "clsact"
> > +        ],
> > +        "plugins": {
> > +            "requires": "nsPlugin"
> > +        },
> > +        "setup": [
> > +            "$IP addr add 10.10.10.1/24 dev $DUMMY",
> > +            "$IP neigh add 10.10.10.2 lladdr 02:00:00:00:00:01 dev $DUMMY nud permanent",
> > +            "$TC qdisc add dev $DUMMY handle 1: root tbf rate 1Mbit burst 10K limit 1M",
> > +            "$TC qdisc add dev $DUMMY parent 1:1 handle 11: red limit 1M avpkt 1400 probability 1 burst 38 harddrop min 30000 max 30001 qevent early_drop block 10",
> > +            "$TC qdisc add dev $DUMMY clsact",
> > +            "$TC filter add dev $DUMMY egress protocol ip prio 1 matchall action gact pass",
> > +            "$TC filter add block 10 protocol ip prio 1 matchall action bpf obj action-ebpf sec action-redirect"
>
> [Severity: Medium]
> In the setup block for "Verify bpf redirect on RED block with preceding clsact
> (egress) classifier", will this test fail with a file-not-found error when
> the tdc.py runner is executed in an out-of-tree build environment or with a
> customized $EBPFDIR?
>
> Should this path be updated to use $EBPFDIR/action-ebpf instead of the
> hardcoded action-ebpf so it aligns with other tests in the suite and finds
> the BPF object correctly?

The $EBPFDIR issue seems to be legit. Victor?

cheers,
jamal

> > +        ],
> > +        "cmdUnderTest": "bash -c 'data=$(head -c 1400 /dev/zero | tr \"\\0\" \"x\"); exec 3>/dev/udp/10.10.10.2/12345; for i in $(seq 1 8000); do printf \"%s\" \"$data\" >&3; done; exit 0'",
> [ ... ]
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260626165156.169012-1-jhs@mojatatu.com?part=3

^ permalink raw reply

* [PATCH net] ppp: fix use-after-free reads in the stats ioctls.
From: Norbert Szetei @ 2026-06-28 12:44 UTC (permalink / raw)
  To: netdev
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-ppp, linux-kernel

ppp_get_stats() (SIOCGPPPSTATS) and the SIOCGPPPCSTATS handler, both
reached from ppp_net_siocdevprivate(), dereference state that other
ioctls free under the ppp lock, without taking it:

  - ppp_get_stats() reads ppp->vj; PPPIOCSMAXCID frees it with
    slhc_free() under ppp_lock().
  - SIOCGPPPCSTATS calls ->comp_stat()/->decomp_stat() on
    ppp->xc_state / ppp->rc_state; PPPIOCSCOMPRESS and ppp_ccp_closed()
    free those.

A concurrent stats ioctl can then read freed memory (slab-use-after-
free), and the freed contents are copied back to userspace. This is 
reachable by a local user who has CAP_NET_ADMIN privileges and 
read/write access to /dev/ppp.

Take the lock the freeing path holds around each access: the receive
lock in ppp_get_stats() (PPPIOCSMAXCID frees ppp->vj under ppp_lock(),
which includes it) and ppp_lock() around the SIOCGPPPCSTATS callbacks.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Norbert Szetei <norbert@doyensec.com>
---
 drivers/net/ppp/ppp_generic.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index 57c68efa5ff8..847c5e1793c8 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -1505,10 +1505,13 @@ ppp_net_siocdevprivate(struct net_device *dev, struct ifreq *ifr,

 	case SIOCGPPPCSTATS:
 		memset(&cstats, 0, sizeof(cstats));
+		/* protect against PPPIOCSCOMPRESS/ppp_ccp_closed() freeing the state */
+		ppp_lock(ppp);
 		if (ppp->xc_state)
 			ppp->xcomp->comp_stat(ppp->xc_state, &cstats.c);
 		if (ppp->rc_state)
 			ppp->rcomp->decomp_stat(ppp->rc_state, &cstats.d);
+		ppp_unlock(ppp);
 		if (copy_to_user(addr, &cstats, sizeof(cstats)))
 			break;
 		err = 0;
@@ -3303,7 +3306,7 @@ find_compressor(int type)
 static void
 ppp_get_stats(struct ppp *ppp, struct ppp_stats *st)
 {
-	struct slcompress *vj = ppp->vj;
+	struct slcompress *vj;
 	int cpu;

 	memset(st, 0, sizeof(*st));
@@ -3323,8 +3326,14 @@ ppp_get_stats(struct ppp *ppp, struct ppp_stats *st)
 	}
 	st->p.ppp_ierrors = ppp->dev->stats.rx_errors;
 	st->p.ppp_oerrors = ppp->dev->stats.tx_errors;
-	if (!vj)
+
+	/* protect against PPPIOCSMAXCID freeing ppp->vj */
+	ppp_recv_lock(ppp);
+	vj = ppp->vj;
+	if (!vj) {
+		ppp_recv_unlock(ppp);
 		return;
+	}
 	st->vj.vjs_packets = vj->sls_o_compressed + vj->sls_o_uncompressed;
 	st->vj.vjs_compressed = vj->sls_o_compressed;
 	st->vj.vjs_searches = vj->sls_o_searches;
@@ -3333,6 +3342,7 @@ ppp_get_stats(struct ppp *ppp, struct ppp_stats *st)
 	st->vj.vjs_tossed = vj->sls_i_tossed;
 	st->vj.vjs_uncompressedin = vj->sls_i_uncompressed;
 	st->vj.vjs_compressedin = vj->sls_i_compressed;
+	ppp_recv_unlock(ppp);
 }

 /*
--
2.54.0

^ permalink raw reply related

* [PATCH v3] net/sched: act_nat: only rewrite IPv4 packets
From: Samuel Moelius @ 2026-06-28 13:20 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Samuel Moelius, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Herbert Xu,
	open list:TC subsystem, open list

act_nat can process packets whose protocol is not IPv4 and then
interpret the payload as an IPv4 header.  Non-IPv4 packets may be
modified based on unrelated bytes at the network header offset.

The action is documented as IPv4 NAT and should leave other protocols
alone.

Check skb->protocol before parsing and rewriting the IPv4 header.  This
keeps accepting hardware-accelerated VLAN IPv4 packets whose network
header already points at the IPv4 header, while still rejecting inline
VLAN packets because act_nat does not adjust the network header offset
before using ip_hdr(skb).

Fixes: b4219952356b ("[PKT_SCHED]: Add stateless NAT")
Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
---
Changes in v3:
  - Check skb->protocol
Changes in v2:
  - Check skb_protocol(skb, false)

 net/sched/act_nat.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c
index abb332dee836..1bf4a5853617 100644
--- a/net/sched/act_nat.c
+++ b/net/sched/act_nat.c
@@ -142,6 +142,9 @@ TC_INDIRECT_SCOPE int tcf_nat_act(struct sk_buff *skb,
 	egress = parms->flags & TCA_NAT_FLAG_EGRESS;

 	noff = skb_network_offset(skb);
+	if (skb->protocol != htons(ETH_P_IP))
+		goto out;
+
 	if (!pskb_may_pull(skb, sizeof(*iph) + noff))
 		goto drop;

-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH] net/sched: drr: reseed active class deficit after quantum changes
From: Samuel Moelius @ 2026-06-28 13:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet,
	Paolo Abeni, Simon Horman, open list:TC subsystem, open list
In-Reply-To: <20260609185625.6e4bb757@kernel.org>

On Tue, Jun 9, 2026 at 9:56 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue,  9 Jun 2026 00:36:18 +0000 Samuel Moelius wrote:
> > Subject: [PATCH] net/sched: drr: reseed active class deficit after quantum changes
>
> If the change is not for a serious bug it needs to be generated against
> net-next. This was generated against Linus's tree I guess and doesn't
> apply to -next.
>
> > Changing the quantum of an active DRR class leaves the old deficit in
> > place.  The next scheduling round can therefore use credit accumulated
> > under a different quantum.
> >
> > This can be observed by making a class active, changing its quantum, and
> > then dequeuing with the old deficit still present.
> >
> > When an active class quantum changes, reseed its deficit from the new
> > quantum so the changed class weight is reflected immediately.
>
> TBH the current implementation is how I would expect DRR to work.
> quantum is the "refill" value, it should not reset the state of
> the current round? It wouldn't in an ASIC.

Your point is well taken. Thank you for the feedback.

I withdraw the patch.

^ permalink raw reply

* Re: [PATCH v2 1/1] xfrm: nat_keepalive: avoid double free on send error
From: Eyal Birger @ 2026-06-28 13:42 UTC (permalink / raw)
  To: Ren Wei
  Cc: netdev, steffen.klassert, herbert, davem, yuantan098, bird,
	qianyuluo3
In-Reply-To: <20260625055513.1841167-1-n05ec@lzu.edu.cn>

On Wed, Jun 24, 2026 at 10:55 PM Ren Wei <n05ec@lzu.edu.cn> wrote:
>
> From: Qianyu Luo <qianyuluo3@gmail.com>
>
> nat_keepalive_send() frees the keepalive skb whenever the IPv4 or IPv6
> send helper reports an error.
>
> That cleanup is only correct before the skb is handed to the output
> path. Once ip_build_and_send_pkt() or ip6_xmit() takes ownership, the
> networking stack may already have consumed the skb before returning an
> error, so freeing it again is unsafe.
>
> Handle the pre-handoff failure cases inside nat_keepalive_send_ipv4()
> and nat_keepalive_send_ipv6(), where the caller still owns the skb, and
> keep nat_keepalive_send() responsible only for family dispatch and the
> unsupported-family cleanup path.
>
> Fixes: f531d13bdfe3 ("xfrm: support sending NAT keepalives in ESP in UDP states")

Thanks for the fix!

Reviewed-by: Eyal Birger <eyal.birger@gmail.com>

^ permalink raw reply

* [PATCH v3] net/sched: dualpi2: clear stale classification on filter miss
From: Samuel Moelius @ 2026-06-28 13:48 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Samuel Moelius, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Olga Albisser,
	Koen De Schepper, Henrik Steen, Olivier Tilmans,
	open list:TC subsystem, open list

DualPI2 leaves previous classification state attached to an skb when
filter classification returns no match.  The enqueue path can then act
on stale state from an earlier classification attempt.

A filter miss should fall back to the default class without reusing old
per-packet classification data.

Initialize the classification result to CLASSIC before running the
classifier.  Explicit L4S, priority, and successful filter
classification can still override that default.

Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
---
Changes in v3:
  - Improve readability
Changes in v2:
  - Add fixes tag

 net/sched/sch_dualpi2.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
index 5434df6ca8ef..27088760eff4 100644
--- a/net/sched/sch_dualpi2.c
+++ b/net/sched/sch_dualpi2.c
@@ -346,6 +346,8 @@ static int dualpi2_skb_classify(struct dualpi2_sched_data *q,
 	struct tcf_proto *fl;
 	int result;
 
+	cb->classified = DUALPI2_C_CLASSIC;
+
 	dualpi2_read_ect(skb);
 	if (cb->ect & q->ecn_mask) {
 		cb->classified = DUALPI2_C_L4S;
@@ -359,10 +361,8 @@ static int dualpi2_skb_classify(struct dualpi2_sched_data *q,
 	}
 
 	fl = rcu_dereference_bh(q->tcf_filters);
-	if (!fl) {
-		cb->classified = DUALPI2_C_CLASSIC;
+	if (!fl)
 		return NET_XMIT_SUCCESS;
-	}
 
 	result = tcf_classify(skb, NULL, fl, &res, false);
 	if (result >= 0) {
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] net/sched: codel: refresh CAN_BYPASS when limit changes
From: Samuel Moelius @ 2026-06-28 13:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, open list:TC subsystem, open list
In-Reply-To: <CANn89iJdrwPtMkexacercdV+2qDAZ=JYwGjW8xA1aKZTk3muhg@mail.gmail.com>

On Tue, Jun 9, 2026 at 11:28 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Jun 8, 2026 at 5:12 PM Samuel Moelius
> <sam.moelius@trailofbits.com> wrote:
> >
> > sch_codel and sch_fq_codel update their packet limit without refreshing
> > the queue bypass state.  Changing the limit to zero can leave CAN_BYPASS
> > set from the previous configuration.
> >
> > The enqueue path can then bypass limit enforcement even though the new
> > limit should prevent queued packets.
> >
> > Recompute the bypass flag whenever the configured limit changes.
> >
> > Assisted-by: Codex:gpt-5.5-cyber-preview
> > Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
> > ---
>
> We can't change sch->flags in a change() method even under qdisc
> spinlock, look at __dev_xmit_skb() which reads q->flags locklessly.

Your point is well taken. Thank you for the feedback.

I withdraw this patch.

^ permalink raw reply

* Re: [PATCH] net: usb: rtl8150: handle link status read failures
From: Petko Manolov @ 2026-06-28 15:18 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-usb, netdev, linux-kernel, stable,
	syzbot+9db6c624635564ad813c
In-Reply-To: <20260628093929.44214-1-alhouseenyousef@gmail.com>

On 26-06-28 11:39:29, Yousef Alhouseen wrote:
> set_carrier() ignores the result of the USB control transfer and tests the
> stack variable supplied as its receive buffer. If the device rejects or aborts
> the request, that variable remains uninitialized and the driver chooses an
> arbitrary carrier state.
> 
> Report carrier down when the link status cannot be read.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: syzbot+9db6c624635564ad813c@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=9db6c624635564ad813c
> Cc: stable@vger.kernel.org
> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
> ---
>  drivers/net/usb/rtl8150.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/usb/rtl8150.c b/drivers/net/usb/rtl8150.c
> index c880c95c41a5..5606490aaea0 100644
> --- a/drivers/net/usb/rtl8150.c
> +++ b/drivers/net/usb/rtl8150.c
> @@ -732,7 +732,11 @@ static void set_carrier(struct net_device *netdev)
>  	rtl8150_t *dev = netdev_priv(netdev);
>  	short tmp;
>  
> -	get_registers(dev, CSCR, 2, &tmp);
> +	if (get_registers(dev, CSCR, 2, &tmp)) {
> +		netif_carrier_off(netdev);
> +		return;
> +	}
> +
>  	if (tmp & CSCR_LINK_STATUS)
>  		netif_carrier_on(netdev);
>  	else

I'd rather do something along these lines:

@@ -732,7 +732,9 @@ static void set_carrier(struct net_device *netdev)
 	rtl8150_t *dev = netdev_priv(netdev);
 	short tmp;
 
-	get_registers(dev, CSCR, 2, &tmp);
+	if (get_registers(dev, CSCR, 2, &tmp)
+		return;
+
 	if (tmp & CSCR_LINK_STATUS)
 		netif_carrier_on(netdev);
 	else

IIRC it is possible for the message get lost in bus noise while the device is
still operating correctly.  So if my memory isn't failing me, it is better to
not do anything if usb_control_msg_recv() is non-zero and only change the
carrier status if 'tmp' contains meaningful value.


		Petko

^ permalink raw reply

* Re: [RFC] xdp: add device context to bpf_xdp_link_attach_failed tracepoint
From: Leon Hwang @ 2026-06-28 15:26 UTC (permalink / raw)
  To: Masashi Honma, netdev, bpf, linux-trace-kernel
  Cc: ast, daniel, kuba, hawk, andrii, rostedt, mhiramat, edumazet,
	pabeni, linux-kernel
In-Reply-To: <CAFk-A4mE9Jweo2hfX7y_85xbPyt0FqpMT1EvqX1OcYZ=LTLgRA@mail.gmail.com>

On 2026/6/28 19:39, Masashi Honma wrote:
> Hello, I am re-posting this mail because I forget to add [RFC].
> 
> The bpf_xdp_link_attach_failed tracepoint (added in commit bf4ea1d0b2cb
> "xdp: Add tracepoint for xdp attaching failure") exposes the netlink
> extack message produced when attaching an XDP program via BPF_LINK_CREATE
> fails. This is useful because, unlike the netlink attach path, the

I really appreciate that the XDP tracepoint helped someone.

> bpf_link attach path does not return the extack to userspace -- the caller
> only gets an errno (e.g. EINVAL/ERANGE).
> 
> We would like to use this in Cilium [1][2]: when attaching the XDP
> datapath program fails, surface the kernel's reason (e.g. "single-buffer
> XDP requires MTU less than ...") in the agent logs instead of an opaque
> errno, so operators don't have to inspect dmesg on the host.
> 
> The limitation we hit is that the tracepoint only carries the message
> string, so a consumer cannot tell which device a failure belongs to.
> This matters for two reasons:
> 
>   1. Correlation: with only the message, a consumer cannot reliably
>       attribute a failure to a specific attach, particularly if multiple
>       XDP attaches happen concurrently.
>   2. Scoping: a consumer watching this tracepoint sees XDP attach
>       failures system-wide and cannot limit them to the devices it
>       manages.
> 
> At the call site (bpf_xdp_link_attach() in net/core/dev.c) the net_device
> is in scope, so exposing it looks straightforward:
> 
>   TRACE_EVENT(bpf_xdp_link_attach_failed,
>       TP_PROTO(const char *msg, const struct net_device *dev),
>       TP_ARGS(msg, dev),
>       TP_STRUCT__entry(
>           __string(msg, msg)
>           __field(int, ifindex)
>       ),
>       TP_fast_assign(
>           __assign_str(msg);
>           __entry->ifindex = dev->ifindex;
>       ),
>       TP_printk("ifindex=%d errmsg=%s", __entry->ifindex, __get_str(msg))
>   );
> 
>   - trace_bpf_xdp_link_attach_failed(extack._msg);
>   + trace_bpf_xdp_link_attach_failed(extack._msg, dev);
> 
> Before sending a formal patch I'd appreciate guidance on a few points:
> 
>   - Should the tracepoint take const struct net_device *dev (consistent
>     with the other tracepoints in this file, and lets TP_printk show the
>     device), or just the ifindex as an int (simpler for raw_tp BPF
>     consumers, which otherwise read dev->ifindex via CO-RE)?
> 
>   - For raw_tp consumers the argument order is effectively ABI: prepending
>     dev would shift the existing msg argument. I've appended dev above to
>     keep msg at args[0]. Is preserving the existing argument position the
>     right call, or is reordering acceptable given how new and rarely
>     consumed this tracepoint is?
> 

Good concerns. I'm not sure about these parts.

>   - Is extending the existing tracepoint preferred, or would you rather
>     keep it as-is and expose the device context some other way?
> 

I'm planning to retire this tracepoint. But I think I cannot do it, if
there's user space application relied on the tracepoint.

I'm planning to add BPF syscall common attributes support for
BPF_LINK_CREATE, including XDP link. By that way, the kernel will be
able to back-propagate the 'extack._msg' to user space, when fail to
create XDP link. Thereafter, the user space library will be able to get
the error message alongside the errno.

Thanks,
Leon

> This would be my first XDP/BPF tracepoint change, so any direction is
> welcome. I'm happy to send a proper patch once the shape is agreed.
> 
> Regards,
> Masashi Honma
> 
> [1] https://github.com/cilium/cilium/issues/40777
> [2] https://github.com/cilium/cilium/pull/46546


^ permalink raw reply

* Re: [PATCH net 2/2] net: pse-pd: guard against freed PI data on regulator disable
From: Carlo Szelinsky @ 2026-06-28 15:31 UTC (permalink / raw)
  To: Simon Horman
  Cc: Oleksij Rempel, Kory Maincent, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel,
	Carlo Szelinsky
In-Reply-To: <20260615180048.828053-1-github@szelinsky.de>

Hi Simon,

Gentle ping on this one... I think I'm just waiting on your read before
I send v2, and I'd like to get it unblocked :-)

v2 would take pcdev->lock around the kfree() + pcdev->pi = NULL in
pse_release_pis() so the NULL is authoritative, and add the same
!pcdev->pi guard to pse_pi_is_enabled() and pse_pi_enable().

Two things I'd value your view on before I send the next version:

 - Is the contained fix (lock around the free) ok, or would you prefer
   the regulator unregister reordered ahead of pse_release_pis()?

 - I couldn't find a concrete consumer hitting a regulator op on
   another CPU during unbind, so I'd describe it as a narrow window
   rather than a proven race. Does that sound right to you?

Happy to just send v2 with the lock fix if that works for you.

Thanks,
Carlo

^ permalink raw reply

* [PATCH v2] net: usb: rtl8150: handle link status read failures
From: Yousef Alhouseen @ 2026-06-28 16:25 UTC (permalink / raw)
  To: Petko Manolov, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: linux-usb, netdev, linux-kernel, stable,
	syzbot+9db6c624635564ad813c, Yousef Alhouseen
In-Reply-To: <20260628151835.GC14404@carbon.k.g>

set_carrier() ignores the result of the USB control transfer and tests
the stack variable supplied as its receive buffer. If the device rejects
or aborts the request, that variable remains uninitialized and the driver
chooses an arbitrary carrier state.

Leave the existing carrier state unchanged when the link status cannot be
read. A transient USB error should not be treated as link loss.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+9db6c624635564ad813c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9db6c624635564ad813c
Cc: stable@vger.kernel.org
Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
 drivers/net/usb/rtl8150.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/rtl8150.c b/drivers/net/usb/rtl8150.c
index c880c95c41a5..d51e43170e03 100644
--- a/drivers/net/usb/rtl8150.c
+++ b/drivers/net/usb/rtl8150.c
@@ -732,7 +732,9 @@ static void set_carrier(struct net_device *netdev)
 	rtl8150_t *dev = netdev_priv(netdev);
 	short tmp;
 
-	get_registers(dev, CSCR, 2, &tmp);
+	if (get_registers(dev, CSCR, 2, &tmp))
+		return;
+
 	if (tmp & CSCR_LINK_STATUS)
 		netif_carrier_on(netdev);
 	else
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v2] net: usb: rtl8150: handle link status read failures
From: Andrew Lunn @ 2026-06-28 16:40 UTC (permalink / raw)
  To: Yousef Alhouseen
  Cc: Petko Manolov, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, linux-usb, netdev, linux-kernel,
	stable, syzbot+9db6c624635564ad813c
In-Reply-To: <20260628162528.8273-1-alhouseenyousef@gmail.com>

On Sun, Jun 28, 2026 at 06:25:28PM +0200, Yousef Alhouseen wrote:
> set_carrier() ignores the result of the USB control transfer and tests
> the stack variable supplied as its receive buffer. If the device rejects
> or aborts the request, that variable remains uninitialized and the driver
> chooses an arbitrary carrier state.
> 
> Leave the existing carrier state unchanged when the link status cannot be
> read. A transient USB error should not be treated as link loss.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: syzbot+9db6c624635564ad813c@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=9db6c624635564ad813c
> Cc: stable@vger.kernel.org

https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html

Does this issue bother people?

I think it would be better to submit to net-next:

https://www.kernel.org/doc/html/latest/process/maintainer-netdev.html

Please don't thread patch versions together.

Also, a Suggested-by: might be appropriate here.

    Andrew

---
pw-bot: cr

^ permalink raw reply

* Re: iproute2: trailing whitespace in man pages
From: Bjarni Ingi Gislason @ 2026-06-28 16:45 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20260626081534.70b9f2ff@phoenix.local>

grep -n -e ' $' -e '\\~$' -e ' \\f.$' -e ' \\"' -e ' "$' <file>
 
Added option -e ' "$' for the last argument of macros.

^ permalink raw reply

* [PATCH net-next v3] net: usb: rtl8150: handle link status read failures
From: Yousef Alhouseen @ 2026-06-28 16:50 UTC (permalink / raw)
  To: Petko Manolov, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: linux-usb, netdev, linux-kernel, syzbot+9db6c624635564ad813c,
	Yousef Alhouseen

set_carrier() ignores the result of the USB control transfer and tests
the stack variable supplied as its receive buffer. If the device rejects
or aborts the request, that variable remains uninitialized and the driver
chooses an arbitrary carrier state.

Leave the existing carrier state unchanged when the link status cannot be
read. A transient USB error should not be treated as link loss.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+9db6c624635564ad813c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9db6c624635564ad813c
Suggested-by: Petko Manolov <petkan@nucleusys.com>
Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
 drivers/net/usb/rtl8150.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/rtl8150.c b/drivers/net/usb/rtl8150.c
index c880c95c41a5..d51e43170e03 100644
--- a/drivers/net/usb/rtl8150.c
+++ b/drivers/net/usb/rtl8150.c
@@ -732,7 +732,9 @@ static void set_carrier(struct net_device *netdev)
 	rtl8150_t *dev = netdev_priv(netdev);
 	short tmp;
 
-	get_registers(dev, CSCR, 2, &tmp);
+	if (get_registers(dev, CSCR, 2, &tmp))
+		return;
+
 	if (tmp & CSCR_LINK_STATUS)
 		netif_carrier_on(netdev);
 	else
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH iproute2-next 0/7] devlink: add per-port resource support
From: David Ahern @ 2026-06-28 17:14 UTC (permalink / raw)
  To: Tariq Toukan, Stephen Hemminger, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andrew Lunn, David S. Miller
  Cc: Donald Hunter, Simon Horman, Jiri Pirko, Jonathan Corbet,
	Shuah Khan, Saeed Mahameed, Leon Romanovsky, Mark Bloch,
	Shuah Khan, Matthieu Baerts (NGI0), Chuck Lever, Or Har-Toov,
	Carolina Jubran, Moshe Shemesh, Shay Drori, Dragos Tatulea,
	Daniel Zahka, Shahar Shitrit, Jacob Keller, Cosmin Ratiu,
	Parav Pandit, Kees Cook, Adithya Jayachandran, Daniel Jurgens,
	netdev, linux-kernel, linux-doc, linux-rdma, linux-kselftest,
	Gal Pressman, Ido Schimmel, Jiri Pirko, Petr Machata
In-Reply-To: <20260609053953.487152-1-tariqt@nvidia.com>

On 6/8/26 11:39 PM, Tariq Toukan wrote:
> Hi,
> 
> Currently, devlink resource show only supports querying a specific
> device and displays device-level resources. However, some resources
> are per-port, such as the maximum number of SFs that can be created
> on a specific PF port.
> 
> This series extends devlink resource show with full support for
> port-level resources, including a dump mode, per-port querying syntax,
> and scope filtering. In preparation for these features, the first two
> patches refactor how dpipe tables are handled to unblock dump support
> and ensure errors in secondary queries are non-fatal.
> 
> The series is organized as follows:
> 
> Patch 1 splits the dpipe tables display into a separate function.
> 
> Patch 2 moves the dpipe tables query into the per-device resource show
> callback, ensuring it behaves correctly during a multi-device dump.
> 
> Patch 3 fixes a pre-existing memory leak in resource_ctx_fini.
> 
> Patch 4 adds dump support to resource show (no device required).
> 
> Patch 5 shows port-level resources returned in a dump reply.
> 
> Patch 6 adds DEV/PORT_INDEX syntax to resource show.
> 
> Patch 7 adds scope filter to resource show.
> 
> With this series, users can query resources at all levels:
> 
> $ devlink resource show
> pci/0000:03:00.0:
>   name local_max_SFs size 508 unit entry
>   name external_max_SFs size 508 unit entry
> pci/0000:03:00.0/196608:
>   name max_SFs size 20 unit entry
> 
> $ devlink resource show scope dev
> pci/0000:03:00.0:
>   name local_max_SFs size 508 unit entry
>   name external_max_SFs size 508 unit entry
> 
> $ devlink resource show scope port
> pci/0000:03:00.0/196608:
>   name max_SFs size 20 unit entry
> 
> $ devlink resource show pci/0000:03:00.0/196608
> pci/0000:03:00.0/196608:
>   name max_SFs size 20 unit entry
> 
> This series is the userspace counterpart to the kernel series:
> https://lore.kernel.org/all/20260407194107.148063-1-tariqt@nvidia.com/
> 
> Ido Schimmel (2):
>   devlink: Split dpipe tables output to a separate function
>   devlink: Move dpipe tables query to resources show callback
> 
> Or Har-Toov (5):
>   devlink: fix memory leak in resource_ctx_fini
>   devlink: add dump support for resource show
>   devlink: show port resources in resource dump
>   devlink: add per-port resource show support
>   devlink: add scope filter to resource show
> 
>  bash-completion/devlink     |   8 ++
>  devlink/devlink.c           | 202 +++++++++++++++++++++++++++---------
>  man/man8/devlink-resource.8 |  34 +++++-
>  3 files changed, 192 insertions(+), 52 deletions(-)
> 
> 
> base-commit: 7340b539841dc739bc0b813e8e86825bc1eb5a4c

applied to iproute2-next with the fixup recommended by Claude and
confirmed by Or

^ permalink raw reply

* Re: [PATCH iproute2-next] devlink: support u32-array values in devlink param show/set
From: David Ahern @ 2026-06-28 17:19 UTC (permalink / raw)
  To: Ratheesh Kannoth, stephen, kuba, linux-kernel, netdev
  Cc: andrew+netdev, davem, edumazet, pabeni, sgoutham
In-Reply-To: <20260615041042.549715-1-rkannoth@marvell.com>

On 6/14/26 10:10 PM, Ratheesh Kannoth wrote:
> @@ -3904,6 +3935,14 @@ static int cmd_dev_param_set(struct dl *dl)
>  		if (!strcmp(dl->opts.param_value, ctx.value.vstr))
>  			return 0;
>  		break;
> +	case 129:

no magic numbers. What does 129 represent? Is there a named macro for
it? If not, why not if this is part of a UAPI?

> +		buf = (char *)dl->opts.param_value;
> +		token = strtok(buf,  delim);
> +		while (token) {
> +			mnl_attr_put_u32(nlh, DEVLINK_ATTR_PARAM_VALUE_DATA, atoi(token));
> +			token = strtok(NULL, delim);
> +		}
> +		break;
>  	default:
>  		printf("Value type not supported\n");
>  		return -ENOTSUP;


^ permalink raw reply

* Re: [PATCH iproute2-next v3] rdma: display resource limits in curr/max format
From: David Ahern @ 2026-06-28 17:22 UTC (permalink / raw)
  To: Tao Cui, leonro; +Cc: linux-rdma, netdev, Tao Cui
In-Reply-To: <20260615005315.169582-1-cui.tao@linux.dev>

On 6/14/26 6:53 PM, Tao Cui wrote:
> diff --git a/rdma/include/uapi/rdma/rdma_netlink.h b/rdma/include/uapi/rdma/rdma_netlink.h
> index 4356ec4a..e5b8b065 100644
> --- a/rdma/include/uapi/rdma/rdma_netlink.h
> +++ b/rdma/include/uapi/rdma/rdma_netlink.h
> @@ -604,6 +604,11 @@ enum rdma_nldev_attr {
>  	RDMA_NLDEV_ATTR_FRMR_POOL_PINNED_HANDLES,	/* u32 */
>  	RDMA_NLDEV_ATTR_FRMR_POOL_KEY_KERNEL_VENDOR_KEY,	/* u64 */
>  
> +	/*
> +	 * Resource summary entry maximum value.
> +	 */
> +	RDMA_NLDEV_ATTR_RES_SUMMARY_ENTRY_MAX,		/* u64 */

I do not see this uapi in Linus' tree. What is the status of the kernel
commit? Put a reference to the kernel patches in the commit message.


^ permalink raw reply

* Re: [PATCH iproute-next v3] ipaddress: add support for showing IPv4 devconf attributes
From: David Ahern @ 2026-06-28 17:30 UTC (permalink / raw)
  To: Fernando Fernandez Mancera, netdev
  Cc: stephen, davem, edumazet, kuba, pabeni, horms
In-Reply-To: <20260614182515.8765-1-fmancera@suse.de>

On 6/14/26 12:25 PM, Fernando Fernandez Mancera wrote:
> This patch introduces support for showing IPv4 devconf attributes on
> detailed output of an interface e.g "ip -d link show dev enp1s0".
> 
> Additionally, this refactors 'print_af_spec()' to sequentially process
> both AF_INET and AF_INET6 attributes rather than returning early if
> AF_INET6 is missing.

refactors should be a separate patch.

> 
> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> ---
> v2: changed print_string to print_bool for boolean attributes
> v3: use print_bool for JSON output only
> ---
>  ip/ipaddress.c | 313 ++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 273 insertions(+), 40 deletions(-)
> 
> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> index 6017bc83..1530b836 100644
> --- a/ip/ipaddress.c
> +++ b/ip/ipaddress.c
> @@ -23,6 +23,7 @@
>  #include <linux/netdevice.h>
>  #include <linux/if_arp.h>
>  #include <linux/if_infiniband.h>
> +#include <linux/ip.h>
>  #include <linux/sockios.h>
>  #include <linux/net_namespace.h>
>  
> @@ -294,53 +295,285 @@ static void print_linktype(FILE *fp, struct rtattr *tb)
>  	close_json_object();
>  }
>  
> +static void print_inet(FILE *fp, struct rtattr *inet_attr)
> +{
> +	struct rtattr *tb[IFLA_INET_MAX + 1];
> +
> +	parse_rtattr_nested(tb, IFLA_INET_MAX, inet_attr);
> +
> +	if (tb[IFLA_INET_CONF]) {
> +		int *conf = RTA_DATA(tb[IFLA_INET_CONF]);
> +		int max_elements = RTA_PAYLOAD(tb[IFLA_INET_CONF]) / sizeof(int);
> +
> +		if (max_elements >= IPV4_DEVCONF_FORWARDING) {
> +			print_bool(PRINT_JSON, "forwarding", NULL,
> +				   conf[IPV4_DEVCONF_FORWARDING - 1]);
> +			print_string(PRINT_FP, "forwarding", "forwarding %s ",
> +				     conf[IPV4_DEVCONF_FORWARDING - 1] ? "on" : "off");
> +		}
> +
> +		if (max_elements >= IPV4_DEVCONF_MC_FORWARDING) {
> +			print_bool(PRINT_JSON, "mc_forwarding", NULL,
> +				   conf[IPV4_DEVCONF_MC_FORWARDING - 1]);
> +			print_string(PRINT_FP, "mc_forwarding", "mc_forwarding %s ",
> +				     conf[IPV4_DEVCONF_MC_FORWARDING - 1] ? "on" : "off");
> +		}
> +
> +		if (max_elements >= IPV4_DEVCONF_PROXY_ARP) {
> +			print_bool(PRINT_JSON, "proxy_arp", NULL,
> +				   conf[IPV4_DEVCONF_PROXY_ARP - 1]);
> +			print_string(PRINT_FP, "proxy_arp", "proxy_arp %s ",
> +				     conf[IPV4_DEVCONF_PROXY_ARP - 1] ? "on" : "off");
> +		}
> +
> +		if (max_elements >= IPV4_DEVCONF_ACCEPT_REDIRECTS) {
> +			print_bool(PRINT_JSON, "accept_redirects", NULL,
> +				   conf[IPV4_DEVCONF_ACCEPT_REDIRECTS - 1]);
> +			print_string(PRINT_FP, "accept_redirects",
> +				     "accept_redirects %s ",
> +				     conf[IPV4_DEVCONF_ACCEPT_REDIRECTS - 1] ? "on" : "off");

As I stated in the last patch for devconf:

"iproute2 follows netdev with coding standards and those need to be
followed as long as humans are in the loop. Please make sure follow on
patches adhere to roughly 80 columns with a little extra if it improves
readability (and of course strings are not broken across lines)."

Use print_on_off for example for these or use a temp variable for the
attributes.



^ permalink raw reply

* Re: Question: bridge: clarify MST VLAN list RCU traversal contract
From: Ido Schimmel @ 2026-06-28 17:49 UTC (permalink / raw)
  To: Runyu Xiao
  Cc: Nikolay Aleksandrov, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, bridge, netdev,
	linux-kernel, jianhao.xu
In-Reply-To: <20260627132539.3701630-1-runyu.xiao@seu.edu.cn>

On Sat, Jun 27, 2026 at 09:25:39PM +0800, Runyu Xiao wrote:
> Hi bridge maintainers,
> 
> This question comes from a candidate found by our static analysis tool
> and then manually reviewed against the current tree. The audit used
> CONFIG_PROVE_RCU_LIST as target-matched triage evidence; I am asking
> for maintainer guidance because the source-level review did not prove
> a use-after-free.
> 
> A CONFIG_PROVE_RCU_LIST audit flags the VLAN-list traversal in
> br_mst_info_size():
> 
>   net/bridge/br_mst.c:251 br_mst_info_size()
> 
> The helper walks vg->vlan_list with list_for_each_entry_rcu().  In the
> direct local context, br_get_link_af_size_filtered() first enters an
> RCU read-side section, resolves the bridge port or bridge VLAN group,
> and calls br_get_num_vlan_infos(vg, filter_mask).  That local RCU
> read-side section is then dropped before the later MST sizing call:
> 
>   net/bridge/br_netlink.c:104 rcu_read_lock()
>   net/bridge/br_netlink.c:113 br_get_num_vlan_infos(vg, filter_mask)
>   net/bridge/br_netlink.c:114 rcu_read_unlock()
>   net/bridge/br_netlink.c:123 br_mst_info_size(vg)
> 
> The helper is registered through rtnl_af_ops.get_link_af_size, and
> bridge VLAN updates appear RTNL-centered, so the broader rtnetlink
> sizing path may already provide the intended serialization.  I am not
> claiming a use-after-free here.  The question is only whether the
> RCU-list traversal contract around br_mst_info_size() should be made
> explicit enough for CONFIG_PROVE_RCU_LIST to see it.
> 
> Would you prefer one of these directions?
> 
>   1. keep the MST sizing loop inside an explicit rcu_read_lock() in
>      br_get_link_af_size_filtered();
> 
>   2. pass a confirmed RTNL lockdep condition to the iterator in
>      br_mst_info_size();
> 
>   3. document that the outer rtnetlink sizing path is the required
>      protection and leave the helper unchanged;
> 
>   4. use a different bridge-specific pattern.
> 
> I am intentionally sending this as a maintainer question rather than a
> patch because the right contract seems to depend on the bridge/rtnetlink
> caller semantics.

I don't think anything needs to change. AFAICT, br_mst_info_size() is
only reachable via the get_link_af_size() callback and
rtnl_link_get_af_size() always invokes it from an RCU read-side critical
section.

Did you see a splat with CONFIG_PROVE_RCU_LIST?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox