* Re: [patch 07/38] treewide: Consolidate cycles_t
From: Ojaswin Mujoo @ 2026-04-13 9:15 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Arnd Bergmann, x86, Lu Baolu, iommu, Michael Grzeschik,
netdev, linux-wireless, Herbert Xu, linux-crypto, Vlastimil Babka,
linux-mm, David Woodhouse, Bernie Thompson, linux-fbdev,
Theodore Tso, linux-ext4, Andrew Morton, Uladzislau Rezki,
Marco Elver, Dmitry Vyukov, kasan-dev, Andrey Ryabinin,
Thomas Sailer, linux-hams, Jason A. Donenfeld, Richard Henderson,
linux-alpha, Russell King, linux-arm-kernel, Catalin Marinas,
Huacai Chen, loongarch, Geert Uytterhoeven, linux-m68k,
Dinh Nguyen, Jonas Bonn, linux-openrisc, Helge Deller,
linux-parisc, Michael Ellerman, linuxppc-dev, Paul Walmsley,
linux-riscv, Heiko Carstens, linux-s390, David S. Miller,
sparclinux
In-Reply-To: <20260410120318.045532623@kernel.org>
On Fri, Apr 10, 2026 at 02:19:03PM +0200, Thomas Gleixner wrote:
> Most architectures define cycles_t as unsigned long execpt:
>
> - x86 requires it to be 64-bit independent of the 32-bit/64-bit build.
>
> - parisc and mips define it as unsigned int
>
> parisc has no real reason to do so as there are only a few usage sites
> which either expand it to a 64-bit value or utilize only the lower
> 32bits.
>
> mips has no real requirement either.
>
> Move the typedef to types.h and provide a config switch to enforce the
> 64-bit type for x86.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> ---
> arch/Kconfig | 4 ++++
> arch/alpha/include/asm/timex.h | 3 ---
> arch/arm/include/asm/timex.h | 1 -
> arch/loongarch/include/asm/timex.h | 2 --
> arch/m68k/include/asm/timex.h | 2 --
> arch/mips/include/asm/timex.h | 2 --
> arch/nios2/include/asm/timex.h | 2 --
> arch/parisc/include/asm/timex.h | 2 --
> arch/powerpc/include/asm/timex.h | 4 +---
> arch/riscv/include/asm/timex.h | 2 --
> arch/s390/include/asm/timex.h | 2 --
> arch/sparc/include/asm/timex_64.h | 1 -
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/tsc.h | 2 --
> include/asm-generic/timex.h | 1 -
> include/linux/types.h | 6 ++++++
> 16 files changed, 12 insertions(+), 25 deletions(-)
>
<...>
> --- a/arch/powerpc/include/asm/timex.h
> +++ b/arch/powerpc/include/asm/timex.h
> @@ -11,9 +11,7 @@
> #include <asm/cputable.h>
> #include <asm/vdso/timebase.h>
>
> -typedef unsigned long cycles_t;
> -
> -static inline cycles_t get_cycles(void)
> +ostatic inline cycles_t get_cycles(void)
Hi Thomas, I'm in middle of testing this series on powerpc. In the meantime I
noticed that there's probably a small typo here (althrough this is fixed
later)
Regards,
ojaswin
> {
> return mftb();
> }
^ permalink raw reply
* Re: [PATCH net-next v5 1/2] net: hsr: require valid EOT supervision TLV
From: Felix Maurer @ 2026-04-13 9:14 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: Luka Gejak, davem, edumazet, pabeni, netdev, horms
In-Reply-To: <20260412133157.3b335e1b@kernel.org>
On Sun, Apr 12, 2026 at 01:31:57PM -0700, Jakub Kicinski wrote:
> On Sun, 12 Apr 2026 22:13:35 +0200 Luka Gejak wrote:
> > Regarding the TLV loop: I actually implemented a TLV walker in v4 [1]
> > for this exact reason, but I moved to strict sequential parsing in v5
> > based on reviewer's feedback to keep the implementation simple. Could
> > you please check if the approach used in v4 is what you had in mind?
> > If so, I will rebase that logic onto the memory safety fixes
> > (pskb_may_pull) from v5 and submit it as v6.
>
> That's not really what I had in mind. I was thinking of a loop which
> just skips the TLVs in order, leaving the parsing of known TLVs as is.
> But I've never used HSR maybe this sort of strict validation is somehow
> okay in HSR deployments.
Hi Jakub,
I'm chiming in here as I was one of the reviewers asking for the strict
validation. The HSR supervision frames have this TLV structure that may
appear to support optionals or (unknown) extensions of some kind. But
the standard just has two potential frame formats (for normal and RedBox
supervision frames), both of them completely specified. Also, the
supervision frames have a version field in the beginning. IMHO, this
leaves no room to put other TLVs there. Therefore, I don't think we gain
anything but unexpected behavior if we start accepting frames with
arbitrary additional TLVs/data.
Thanks,
Felix
^ permalink raw reply
* Re: [PATCH] rose: Fix rose_find_socket() returning without sock_hold()
From: Eric Dumazet @ 2026-04-13 9:10 UTC (permalink / raw)
To: Dudu Lu; +Cc: netdev, davem, kuba, pabeni
In-Reply-To: <20260413090420.79932-1-phx0fer@gmail.com>
On Mon, Apr 13, 2026 at 2:04 AM Dudu Lu <phx0fer@gmail.com> wrote:
>
> rose_find_socket() returns a raw socket pointer after releasing
> rose_list_lock. The socket can be freed by a concurrent close()
> between the unlock and the caller's use of the pointer, leading
> to a use-after-free.
>
> Add sock_hold() before returning the found socket, and update
> callers to sock_put() when done.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Dudu Lu <phx0fer@gmail.com>
> ---
> net/rose/af_rose.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
> index ba56213e0a2a..b32b136f80aa 100644
> --- a/net/rose/af_rose.c
> +++ b/net/rose/af_rose.c
> @@ -1,4 +1,5 @@
> -// SPDX-License-Identifier: GPL-2.0-or-later
> + if (s)
> + sock_hold(s);// SPDX-License-Identifier: GPL-2.0-or-later
> /*
> *
> * Copyright (C) Jonathan Naylor G4KLX (g4klx@g4klx.demon.co.uk)
> --
> 2.39.3 (Apple Git-145)
>
I suggest that your patches are checked by one human, before sending
them to the lists.
^ permalink raw reply
* [PATCH net 1/1] net: bridge: use a stable FDB dst snapshot in RCU readers
From: Ren Wei @ 2026-04-13 9:08 UTC (permalink / raw)
To: bridge, netdev
Cc: razor, idosch, davem, edumazet, kuba, pabeni, horms,
makita.toshiaki, vyasevic, yifanwucs, tomapufckgml, yuantan098,
bird, enjou1224z, zcliangcn, n05ec
In-Reply-To: <cover.1776043229.git.zcliangcn@gmail.com>
From: Zhengchuan Liang <zcliangcn@gmail.com>
Local FDB entries can be rewritten in place by `fdb_delete_local()`, which
updates `f->dst` to another port or to `NULL` while keeping the entry
alive. Several bridge RCU readers inspect `f->dst`, including
`br_fdb_fillbuf()` through the `brforward_read()` sysfs path.
These readers currently load `f->dst` multiple times and can therefore
observe inconsistent values across the check and later dereference.
In `br_fdb_fillbuf()`, this means a concurrent local-FDB update can change
`f->dst` after the NULL check and before the `port_no` dereference,
leading to a NULL-ptr-deref.
Fix this by taking a single `READ_ONCE()` snapshot of `f->dst` in each
affected RCU reader and using that snapshot for the rest of the access
sequence. Also publish the in-place `f->dst` updates in `fdb_delete_local()`
with `WRITE_ONCE()` so the readers and writer use matching access patterns.
Fixes: 960b589f86c7 ("bridge: Properly check if local fdb entry can be deleted in br_fdb_change_mac_address")
Cc: stable@kernel.org
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Ren Wei <enjou1224z@gmail.com>
Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
net/bridge/br_arp_nd_proxy.c | 8 +++++---
net/bridge/br_fdb.c | 28 ++++++++++++++++++----------
2 files changed, 23 insertions(+), 13 deletions(-)
diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c
index 6b5595868a39c..7ace0f4941bb6 100644
--- a/net/bridge/br_arp_nd_proxy.c
+++ b/net/bridge/br_arp_nd_proxy.c
@@ -202,11 +202,12 @@ void br_do_proxy_suppress_arp(struct sk_buff *skb, struct net_bridge *br,
f = br_fdb_find_rcu(br, n->ha, vid);
if (f) {
+ const struct net_bridge_port *dst = READ_ONCE(f->dst);
bool replied = false;
if ((p && (p->flags & BR_PROXYARP)) ||
- (f->dst && (f->dst->flags & BR_PROXYARP_WIFI)) ||
- br_is_neigh_suppress_enabled(f->dst, vid)) {
+ (dst && (dst->flags & BR_PROXYARP_WIFI)) ||
+ br_is_neigh_suppress_enabled(dst, vid)) {
if (!vid)
br_arp_send(br, p, skb->dev, sip, tip,
sha, n->ha, sha, 0, 0);
@@ -470,9 +471,10 @@ void br_do_suppress_nd(struct sk_buff *skb, struct net_bridge *br,
f = br_fdb_find_rcu(br, n->ha, vid);
if (f) {
+ const struct net_bridge_port *dst = READ_ONCE(f->dst);
bool replied = false;
- if (br_is_neigh_suppress_enabled(f->dst, vid)) {
+ if (br_is_neigh_suppress_enabled(dst, vid)) {
if (vid != 0)
br_nd_send(br, p, skb, n,
skb->vlan_proto,
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index e2c17f620f009..6eb3ab69a5140 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -236,6 +236,7 @@ struct net_device *br_fdb_find_port(const struct net_device *br_dev,
const unsigned char *addr,
__u16 vid)
{
+ const struct net_bridge_port *dst;
struct net_bridge_fdb_entry *f;
struct net_device *dev = NULL;
struct net_bridge *br;
@@ -248,8 +249,11 @@ struct net_device *br_fdb_find_port(const struct net_device *br_dev,
br = netdev_priv(br_dev);
rcu_read_lock();
f = br_fdb_find_rcu(br, addr, vid);
- if (f && f->dst)
- dev = f->dst->dev;
+ if (f) {
+ dst = READ_ONCE(f->dst);
+ if (dst)
+ dev = dst->dev;
+ }
rcu_read_unlock();
return dev;
@@ -346,7 +350,7 @@ static void fdb_delete_local(struct net_bridge *br,
vg = nbp_vlan_group(op);
if (op != p && ether_addr_equal(op->dev->dev_addr, addr) &&
(!vid || br_vlan_find(vg, vid))) {
- f->dst = op;
+ WRITE_ONCE(f->dst, op);
clear_bit(BR_FDB_ADDED_BY_USER, &f->flags);
return;
}
@@ -357,7 +361,7 @@ static void fdb_delete_local(struct net_bridge *br,
/* Maybe bridge device has same hw addr? */
if (p && ether_addr_equal(br->dev->dev_addr, addr) &&
(!vid || (v && br_vlan_should_use(v)))) {
- f->dst = NULL;
+ WRITE_ONCE(f->dst, NULL);
clear_bit(BR_FDB_ADDED_BY_USER, &f->flags);
return;
}
@@ -928,6 +932,7 @@ int br_fdb_test_addr(struct net_device *dev, unsigned char *addr)
int br_fdb_fillbuf(struct net_bridge *br, void *buf,
unsigned long maxnum, unsigned long skip)
{
+ const struct net_bridge_port *dst;
struct net_bridge_fdb_entry *f;
struct __fdb_entry *fe = buf;
unsigned long delta;
@@ -944,7 +949,8 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf,
continue;
/* ignore pseudo entry for local MAC address */
- if (!f->dst)
+ dst = READ_ONCE(f->dst);
+ if (!dst)
continue;
if (skip) {
@@ -956,8 +962,8 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf,
memcpy(fe->mac_addr, f->key.addr.addr, ETH_ALEN);
/* due to ABI compat need to split into hi/lo */
- fe->port_no = f->dst->port_no;
- fe->port_hi = f->dst->port_no >> 8;
+ fe->port_no = dst->port_no;
+ fe->port_hi = dst->port_no >> 8;
fe->is_local = test_bit(BR_FDB_LOCAL, &f->flags);
if (!test_bit(BR_FDB_STATIC, &f->flags)) {
@@ -1083,9 +1089,11 @@ int br_fdb_dump(struct sk_buff *skb,
rcu_read_lock();
hlist_for_each_entry_rcu(f, &br->fdb_list, fdb_node) {
+ const struct net_bridge_port *dst = READ_ONCE(f->dst);
+
if (*idx < ctx->fdb_idx)
goto skip;
- if (filter_dev && (!f->dst || f->dst->dev != filter_dev)) {
+ if (filter_dev && (!dst || dst->dev != filter_dev)) {
if (filter_dev != dev)
goto skip;
/* !f->dst is a special case for bridge
@@ -1093,10 +1101,10 @@ int br_fdb_dump(struct sk_buff *skb,
* Therefore need a little more filtering
* we only want to dump the !f->dst case
*/
- if (f->dst)
+ if (dst)
goto skip;
}
- if (!filter_dev && f->dst)
+ if (!filter_dev && dst)
goto skip;
err = fdb_fill_info(skb, br, f,
--
2.43.0
^ permalink raw reply related
* [PATCH net 1/1] 8021q: free cleared egress QoS mappings safely
From: Ren Wei @ 2026-04-13 9:07 UTC (permalink / raw)
To: netdev
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, horms, kees,
yifanwucs, tomapufckgml, yuantan098, bird, ylong030, n05ec
In-Reply-To: <cover.1776039122.git.ylong030@ucr.edu>
From: Longxuan Yu <ylong030@ucr.edu>
vlan_dev_set_egress_priority() leaves cleared egress priority mapping
nodes in the hash until device teardown. Repeated set/clear cycles with
distinct skb priorities therefore allocate an unbounded number of
vlan_priority_tci_mapping objects and leak memory.
Delete mappings when vlan_prio is cleared instead of keeping
tombstones. The TX fast path and reporting paths walk the lists without
RTNL, so convert the egress mapping lists to RCU-protected pointers and
defer freeing removed nodes until after a grace period.
Cc: stable@kernel.org
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Longxuan Yu <ylong030@ucr.edu>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
include/linux/if_vlan.h | 23 +++++++++++--------
net/8021q/vlan_dev.c | 48 +++++++++++++++++++++++-----------------
net/8021q/vlan_netlink.c | 9 +++-----
net/8021q/vlanproc.c | 12 ++++++----
4 files changed, 53 insertions(+), 39 deletions(-)
diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index e6272f9c5e42..0a27533e70f3 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -151,7 +151,7 @@ extern __be16 vlan_dev_vlan_proto(const struct net_device *dev);
struct vlan_priority_tci_mapping {
u32 priority;
u16 vlan_qos;
- struct vlan_priority_tci_mapping *next;
+ struct vlan_priority_tci_mapping __rcu *next;
};
struct proc_dir_entry;
@@ -177,7 +177,7 @@ struct vlan_dev_priv {
unsigned int nr_ingress_mappings;
u32 ingress_priority_map[8];
unsigned int nr_egress_mappings;
- struct vlan_priority_tci_mapping *egress_priority_map[16];
+ struct vlan_priority_tci_mapping __rcu *egress_priority_map[16];
__be16 vlan_proto;
u16 vlan_id;
@@ -209,19 +209,24 @@ static inline u16
vlan_dev_get_egress_qos_mask(struct net_device *dev, u32 skprio)
{
struct vlan_priority_tci_mapping *mp;
+ u16 vlan_qos = 0;
- smp_rmb(); /* coupled with smp_wmb() in vlan_dev_set_egress_priority() */
+ rcu_read_lock();
- mp = vlan_dev_priv(dev)->egress_priority_map[(skprio & 0xF)];
+ mp = rcu_dereference(vlan_dev_priv(dev)->egress_priority_map[skprio & 0xF]);
while (mp) {
if (mp->priority == skprio) {
- return mp->vlan_qos; /* This should already be shifted
- * to mask correctly with the
- * VLAN's TCI */
+ vlan_qos = READ_ONCE(mp->vlan_qos);
+ break;
}
- mp = mp->next;
+ mp = rcu_dereference(mp->next);
}
- return 0;
+ rcu_read_unlock();
+
+ /* This should already be shifted to mask correctly with
+ * the VLAN's TCI.
+ */
+ return vlan_qos;
}
extern bool vlan_do_receive(struct sk_buff **skb);
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index c40f7d5c4fca..377616f51697 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -172,41 +172,43 @@ int vlan_dev_set_egress_priority(const struct net_device *dev,
u32 skb_prio, u16 vlan_prio)
{
- struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
- struct vlan_priority_tci_mapping *mp = NULL;
- struct vlan_priority_tci_mapping *np;
- u32 vlan_qos = (vlan_prio << VLAN_PRIO_SHIFT) & VLAN_PRIO_MASK;
+ struct vlan_priority_tci_mapping __rcu **mpp;
+ struct vlan_priority_tci_mapping *mp;
+ struct vlan_priority_tci_mapping *np;
+ struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
+ u32 bucket = skb_prio & 0xF;
+ u32 vlan_qos = (vlan_prio << VLAN_PRIO_SHIFT) & VLAN_PRIO_MASK;
/* See if a priority mapping exists.. */
- mp = vlan->egress_priority_map[skb_prio & 0xF];
+ mpp = &vlan->egress_priority_map[bucket];
+ mp = rtnl_dereference(*mpp);
while (mp) {
if (mp->priority == skb_prio) {
- if (mp->vlan_qos && !vlan_qos)
+ if (!vlan_qos) {
+ rcu_assign_pointer(*mpp, rtnl_dereference(mp->next));
vlan->nr_egress_mappings--;
- else if (!mp->vlan_qos && vlan_qos)
- vlan->nr_egress_mappings++;
- mp->vlan_qos = vlan_qos;
+ kfree_rcu_mightsleep(mp);
+ } else {
+ WRITE_ONCE(mp->vlan_qos, vlan_qos);
+ }
return 0;
}
- mp = mp->next;
+ mpp = &mp->next;
+ mp = rtnl_dereference(*mpp);
}
/* Create a new mapping then. */
- mp = vlan->egress_priority_map[skb_prio & 0xF];
+ if (!vlan_qos)
+ return 0;
+
np = kmalloc_obj(struct vlan_priority_tci_mapping);
if (!np)
return -ENOBUFS;
- np->next = mp;
np->priority = skb_prio;
np->vlan_qos = vlan_qos;
- /* Before inserting this element in hash table, make sure all its fields
- * are committed to memory.
- * coupled with smp_rmb() in vlan_dev_get_egress_qos_mask()
- */
- smp_wmb();
- vlan->egress_priority_map[skb_prio & 0xF] = np;
- if (vlan_qos)
- vlan->nr_egress_mappings++;
+ RCU_INIT_POINTER(np->next, rtnl_dereference(vlan->egress_priority_map[bucket]));
+ rcu_assign_pointer(vlan->egress_priority_map[bucket], np);
+ vlan->nr_egress_mappings++;
return 0;
}
@@ -604,11 +606,17 @@ void vlan_dev_free_egress_priority(const struct net_device *dev)
int i;
for (i = 0; i < ARRAY_SIZE(vlan->egress_priority_map); i++) {
- while ((pm = vlan->egress_priority_map[i]) != NULL) {
- vlan->egress_priority_map[i] = pm->next;
- kfree(pm);
+ pm = rtnl_dereference(vlan->egress_priority_map[i]);
+ RCU_INIT_POINTER(vlan->egress_priority_map[i], NULL);
+ while (pm) {
+ struct vlan_priority_tci_mapping *next;
+
+ next = rtnl_dereference(pm->next);
+ kfree_rcu_mightsleep(pm);
+ pm = next;
}
}
+ vlan->nr_egress_mappings = 0;
}
static void vlan_dev_uninit(struct net_device *dev)
diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index a000b1ef0520..bbe7cbd97939 100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -260,13 +260,10 @@ static int vlan_fill_info(struct sk_buff *skb, const struct net_device *dev)
goto nla_put_failure;
for (i = 0; i < ARRAY_SIZE(vlan->egress_priority_map); i++) {
- for (pm = vlan->egress_priority_map[i]; pm;
- pm = pm->next) {
- if (!pm->vlan_qos)
- continue;
-
+ for (pm = rtnl_dereference(vlan->egress_priority_map[i]); pm;
+ pm = rtnl_dereference(pm->next)) {
m.from = pm->priority;
- m.to = (pm->vlan_qos >> 13) & 0x7;
+ m.to = (READ_ONCE(pm->vlan_qos) >> 13) & 0x7;
if (nla_put(skb, IFLA_VLAN_QOS_MAPPING,
sizeof(m), &m))
goto nla_put_failure;
diff --git a/net/8021q/vlanproc.c b/net/8021q/vlanproc.c
index fa67374bda49..0e424e0895b7 100644
--- a/net/8021q/vlanproc.c
+++ b/net/8021q/vlanproc.c
@@ -262,15 +262,19 @@ static int vlandev_seq_show(struct seq_file *seq, void *offset)
vlan->ingress_priority_map[7]);
seq_printf(seq, " EGRESS priority mappings: ");
+ rcu_read_lock();
for (i = 0; i < 16; i++) {
- const struct vlan_priority_tci_mapping *mp
- = vlan->egress_priority_map[i];
+ const struct vlan_priority_tci_mapping *mp =
+ rcu_dereference(vlan->egress_priority_map[i]);
while (mp) {
+ u16 vlan_qos = READ_ONCE(mp->vlan_qos);
+
seq_printf(seq, "%u:%d ",
- mp->priority, ((mp->vlan_qos >> 13) & 0x7));
- mp = mp->next;
+ mp->priority, ((vlan_qos >> 13) & 0x7));
+ mp = rcu_dereference(mp->next);
}
}
+ rcu_read_unlock();
seq_puts(seq, "\n");
return 0;
--
2.43.0
^ permalink raw reply related
* Re: [patch 14/38] slub: Use prandom instead of get_cycles()
From: Harry Yoo (Oracle) @ 2026-04-13 9:07 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Vlastimil Babka, linux-mm, Arnd Bergmann, x86, Lu Baolu,
iommu, Michael Grzeschik, netdev, linux-wireless, Herbert Xu,
linux-crypto, David Woodhouse, Bernie Thompson, linux-fbdev,
Theodore Tso, linux-ext4, Andrew Morton, Uladzislau Rezki,
Marco Elver, Dmitry Vyukov, kasan-dev, Andrey Ryabinin,
Thomas Sailer, linux-hams, Jason A. Donenfeld, Richard Henderson,
linux-alpha, Russell King, linux-arm-kernel, Catalin Marinas,
Huacai Chen, loongarch, Geert Uytterhoeven, linux-m68k,
Dinh Nguyen, Jonas Bonn, linux-openrisc, Helge Deller,
linux-parisc, Michael Ellerman, linuxppc-dev, Paul Walmsley,
linux-riscv, Heiko Carstens, linux-s390, David S. Miller,
sparclinux, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Shengming Hu
In-Reply-To: <20260410120318.525653921@kernel.org>
[Resending after fixing broken email headers]
On Fri, Apr 10, 2026 at 02:19:37PM +0200, Thomas Gleixner wrote:
> The decision whether to scan remote nodes is based on a 'random' number
> retrieved via get_cycles(). get_cycles() is about to be removed.
>
> There is already prandom state in the code, so use that instead.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: linux-mm@kvack.org
> ---
Acked-by: Harry Yoo (Oracle) <harry@kernel.org>
Is this for this merge window?
This may conflict with upcoming changes on freelist shuffling [1]
(not queued for slab/for-next yet though), but it should be easy to
resolve.
[Cc'ing Shengming and SLAB ALLOCATOR folks]
[1] https://lore.kernel.org/linux-mm/20260409204352095kKWVYKtZImN59ybO6iRNj@zte.com.cn
--
Cheers,
Harry / Hyeonggon
> mm/slub.c | 37 +++++++++++++++++++++++--------------
> 1 file changed, 23 insertions(+), 14 deletions(-)
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3302,6 +3302,25 @@ static inline struct slab *alloc_slab_pa
> return slab;
> }
>
> +#if defined(CONFIG_SLAB_FREELIST_RANDOM) || defined(CONFIG_NUMA)
> +static DEFINE_PER_CPU(struct rnd_state, slab_rnd_state);
> +
> +static unsigned int slab_get_prandom_state(unsigned int limit)
> +{
> + struct rnd_state *state;
> + unsigned int res;
> +
> + /*
> + * An interrupt or NMI handler might interrupt and change
> + * the state in the middle, but that's safe.
> + */
> + state = &get_cpu_var(slab_rnd_state);
> + res = prandom_u32_state(state) % limit;
> + put_cpu_var(slab_rnd_state);
> + return res;
> +}
> +#endif
> +
> #ifdef CONFIG_SLAB_FREELIST_RANDOM
> /* Pre-initialize the random sequence cache */
> static int init_cache_random_seq(struct kmem_cache *s)
> @@ -3365,8 +3384,6 @@ static void *next_freelist_entry(struct
> return (char *)start + idx;
> }
>
> -static DEFINE_PER_CPU(struct rnd_state, slab_rnd_state);
> -
> /* Shuffle the single linked freelist based on a random pre-computed sequence */
> static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab,
> bool allow_spin)
> @@ -3383,15 +3400,7 @@ static bool shuffle_freelist(struct kmem
> if (allow_spin) {
> pos = get_random_u32_below(freelist_count);
> } else {
> - struct rnd_state *state;
> -
> - /*
> - * An interrupt or NMI handler might interrupt and change
> - * the state in the middle, but that's safe.
> - */
> - state = &get_cpu_var(slab_rnd_state);
> - pos = prandom_u32_state(state) % freelist_count;
> - put_cpu_var(slab_rnd_state);
> + pos = slab_get_prandom_state(freelist_count);
> }
>
> page_limit = slab->objects * s->size;
> @@ -3882,7 +3891,7 @@ static void *get_from_any_partial(struct
> * with available objects.
> */
> if (!s->remote_node_defrag_ratio ||
> - get_cycles() % 1024 > s->remote_node_defrag_ratio)
> + slab_get_prandom_state(1024) > s->remote_node_defrag_ratio)
> return NULL;
>
> do {
> @@ -7102,7 +7111,7 @@ static unsigned int
>
> /* see get_from_any_partial() for the defrag ratio description */
> if (!s->remote_node_defrag_ratio ||
> - get_cycles() % 1024 > s->remote_node_defrag_ratio)
> + slab_get_prandom_state(1024) > s->remote_node_defrag_ratio)
> return 0;
>
> do {
> @@ -8421,7 +8430,7 @@ void __init kmem_cache_init_late(void)
> flushwq = alloc_workqueue("slub_flushwq", WQ_MEM_RECLAIM | WQ_PERCPU,
> 0);
> WARN_ON(!flushwq);
> -#ifdef CONFIG_SLAB_FREELIST_RANDOM
> +#if defined(CONFIG_SLAB_FREELIST_RANDOM) || defined(CONFIG_NUMA)
> prandom_init_once(&slab_rnd_state);
> #endif
> }
>
>
^ permalink raw reply
* [PATCH] rose: Fix rose_find_socket() returning without sock_hold()
From: Dudu Lu @ 2026-04-13 9:04 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, kuba, pabeni, Dudu Lu
rose_find_socket() returns a raw socket pointer after releasing
rose_list_lock. The socket can be freed by a concurrent close()
between the unlock and the caller's use of the pointer, leading
to a use-after-free.
Add sock_hold() before returning the found socket, and update
callers to sock_put() when done.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
net/rose/af_rose.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index ba56213e0a2a..b32b136f80aa 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -1,4 +1,5 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
+ if (s)
+ sock_hold(s);// SPDX-License-Identifier: GPL-2.0-or-later
/*
*
* Copyright (C) Jonathan Naylor G4KLX (g4klx@g4klx.demon.co.uk)
--
2.39.3 (Apple Git-145)
^ permalink raw reply related
* [PATCH] nfc: llcp: Fix TLV parsing off-by-one in LLCP connect and SNL
From: Dudu Lu @ 2026-04-13 9:02 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, kuba, pabeni, Dudu Lu
The TLV parsing loops in nfc_llcp_connect_sn() and
nfc_llcp_recv_snl() check `offset < tlv_array_len` before accessing
both tlv[0] (type) and tlv[1] (length). When exactly one byte remains
(offset == tlv_array_len - 1), the access to tlv[1] reads one byte
beyond the skb data buffer.
Fix both sites by changing the loop condition to
`offset + 1 < tlv_array_len`, ensuring at least 2 bytes are available
before reading the type and length fields.
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
net/nfc/llcp_core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/nfc/llcp_core.c b/net/nfc/llcp_core.c
index 366d7566308c..d1a75b9445e1 100644
--- a/net/nfc/llcp_core.c
+++ b/net/nfc/llcp_core.c
@@ -852,7 +852,7 @@ static const u8 *nfc_llcp_connect_sn(const struct sk_buff *skb, size_t *sn_len)
const u8 *tlv = &skb->data[2];
size_t tlv_array_len = skb->len - LLCP_HEADER_SIZE, offset = 0;
- while (offset < tlv_array_len) {
+ while (offset + 1 < tlv_array_len) {
type = tlv[0];
length = tlv[1];
@@ -1297,7 +1297,7 @@ static void nfc_llcp_recv_snl(struct nfc_llcp_local *local,
offset = 0;
sdres_tlvs_len = 0;
- while (offset < tlv_len) {
+ while (offset + 1 < tlv_len) {
type = tlv[0];
length = tlv[1];
--
2.39.3 (Apple Git-145)
^ permalink raw reply related
* Re: [PATCH v2 2/2] MAINTAINERS: update PTP maintainer entries after directory split
From: Wen Gu @ 2026-04-13 9:00 UTC (permalink / raw)
To: Jakub Kicinski, David Woodhouse
Cc: tglx, richardcochran, andrew+netdev, davem, edumazet, pabeni,
linux-kernel, netdev, jstultz, anna-maria, frederic,
daniel.lezcano, sboyd, vladimir.oltean, wei.fang, xiaoning.wang,
jonathan.lemon, vadim.fedorenko, yangbo.lu, svens, nick.shi,
ajay.kaher, alexey.makhalov, bcm-kernel-feedback-list, linux-fpga,
imx, linux-s390, dust.li, xuanzhuo, mani, imran.shaik, taniya.das
In-Reply-To: <20260412095301.4fe1fe65@kernel.org>
On 2026/4/13 00:53, Jakub Kicinski wrote:
> On Sun, 12 Apr 2026 17:32:22 +0100 David Woodhouse wrote:
>> On 12 April 2026 16:47:04 BST, Jakub Kicinski <kuba@kernel.org> wrote:
>>> On Tue, 7 Apr 2026 18:48:02 +0800 Wen Gu wrote:
>>>> +PTP EMULATED CLOCK SUPPORT
>>>> +M: David Woodhouse <dwmw2@infradead.org>
>>>> +M: Wen Gu <guwen@linux.alibaba.com>
>>>> +M: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>> +L: linux-kernel@vger.kernel.org
>>>> +S: Maintained
>>>> +T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
>>>
>>> Hi David,
>>>
>>> Do you have a tree to route the patches thru? Or do you really have
>>> access to the tip tree?
>>
>> I do not have access to the tip tree. I can make a shared tree on
>> git.infradead.org if the other two maintainers would like to send me
>> a SSH pubkey and preferred username...
>
> Honestly I'd love for you to be the only M here, and the other two
> to be reviewers. Xuan Zhuo is currently at v40 trying to upstream
> an Ethernet driver. Some growth needed there to become a subsystem
> maintainer IMO.
Hi Jakub, David,
That works for us. We can act as reviewers.
If David sets up a new tree, I will update the MAINTAINERS entry
accordingly in v3.
Thanks.
^ permalink raw reply
* Re: [patch 14/38] slub: Use prandom instead of get_cycles()
From: Vlastimil Babka (SUSE) @ 2026-04-13 9:00 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: linux-mm, Arnd Bergmann, x86, Lu Baolu, iommu, Michael Grzeschik,
netdev, linux-wireless, Herbert Xu, linux-crypto, David Woodhouse,
Bernie Thompson, linux-fbdev, Theodore Tso, linux-ext4,
Andrew Morton, Uladzislau Rezki, Marco Elver, Dmitry Vyukov,
kasan-dev, Andrey Ryabinin, Thomas Sailer, linux-hams,
Jason A. Donenfeld, Richard Henderson, linux-alpha, Russell King,
linux-arm-kernel, Catalin Marinas, Huacai Chen, loongarch,
Geert Uytterhoeven, linux-m68k, Dinh Nguyen, Jonas Bonn,
linux-openrisc, Helge Deller, linux-parisc, Michael Ellerman,
linuxppc-dev, Paul Walmsley, linux-riscv, Heiko Carstens,
linux-s390, David S. Miller, sparclinux, Harry Yoo (Oracle),
Hao Li
In-Reply-To: <20260410120318.525653921@kernel.org>
On 4/10/26 14:19, Thomas Gleixner wrote:
> The decision whether to scan remote nodes is based on a 'random' number
> retrieved via get_cycles(). get_cycles() is about to be removed.
>
> There is already prandom state in the code, so use that instead.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: linux-mm@kvack.org
LGTM.
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
^ permalink raw reply
* [PATCH] nfc: nci: Add skb length validation in nci_core_init_rsp_packet
From: Dudu Lu @ 2026-04-13 9:01 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, kuba, pabeni, Dudu Lu
nci_core_init_rsp_packet_v1() and nci_core_init_rsp_packet_v2() cast
skb->data to response structures and dereference fields without first
checking that skb->len is large enough. A malicious or malformed NFCC
can send a short response packet, causing an out-of-bounds read.
Add minimum length checks at the start of both functions. For v1, check
that at least sizeof(nci_core_init_rsp_1) bytes are available before
accessing rsp_1 fields, and validate the dynamic offset before accessing
rsp_2. For v2, check that at least sizeof(nci_core_init_rsp_nci_ver2)
bytes are available.
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
net/nfc/nci/rsp.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/net/nfc/nci/rsp.c b/net/nfc/nci/rsp.c
index 9eeb862825c5..01972c806b45 100644
--- a/net/nfc/nci/rsp.c
+++ b/net/nfc/nci/rsp.c
@@ -1,3 +1,14 @@
+ if (skb->len < sizeof(*rsp)) {
+ pr_err("short NCI_CORE_INIT_RSP v2 packet\n");
+ return NCI_STATUS_SYNTAX_ERROR;
+ }
+ if (skb->len < 6 + rsp_1->num_supported_rf_interfaces +
+ sizeof(*rsp_2)) {
+ pr_err("short NCI_CORE_INIT_RSP v1 packet\n");
+ return NCI_STATUS_SYNTAX_ERROR;
+ }
+ if (skb->len < sizeof(*rsp_1))
+ return NCI_STATUS_SYNTAX_ERROR;
// SPDX-License-Identifier: GPL-2.0-only
/*
* The NFC Controller Interface is the communication protocol between an
--
2.39.3 (Apple Git-145)
^ permalink raw reply related
* [PATCH] tipc: Ensure NUL-termination of remote algorithm name
From: Dudu Lu @ 2026-04-13 8:58 UTC (permalink / raw)
To: netdev; +Cc: jmaloy, Dudu Lu
In tipc_crypto_key_rcv(), the algorithm name is copied from the incoming
message using memcpy with a fixed size of TIPC_AEAD_ALG_NAME (32 bytes).
If the remote peer sends a name that fills all 32 bytes without a NUL
terminator, the alg_name field will not be NUL-terminated. This string
is later passed to crypto_alloc_aead() which expects a NUL-terminated
string, potentially causing an out-of-bounds read when the crypto
subsystem searches for the algorithm by name.
Fix by explicitly NUL-terminating the last byte of alg_name after the
memcpy.
Fixes: 1ef6f7c9390f ("tipc: add automatic session key exchange")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
net/tipc/crypto.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/tipc/crypto.c b/net/tipc/crypto.c
index d3046a39ff72..ac072356bf0c 100644
--- a/net/tipc/crypto.c
+++ b/net/tipc/crypto.c
@@ -2325,6 +2325,7 @@ static bool tipc_crypto_key_rcv(struct tipc_crypto *rx, struct tipc_msg *hdr)
/* Copy key from msg data */
skey->keylen = keylen;
memcpy(skey->alg_name, data, TIPC_AEAD_ALG_NAME);
+ skey->alg_name[TIPC_AEAD_ALG_NAME - 1] = '\0';
memcpy(skey->key, data + TIPC_AEAD_ALG_NAME + sizeof(__be32),
skey->keylen);
--
2.39.3 (Apple Git-145)
^ permalink raw reply related
* [PATCH] macvlan: fix macvlan_get_size() not reserving space for IFLA_MACVLAN_BC_CUTOFF
From: Dudu Lu @ 2026-04-13 8:53 UTC (permalink / raw)
To: netdev; +Cc: andrew+netdev, davem, edumazet, kuba, pabeni, Dudu Lu
macvlan_get_size() does not account for IFLA_MACVLAN_BC_CUTOFF, but
macvlan_fill_info() conditionally includes it when port->bc_cutoff != 1.
This causes nla_put_s32() to fail with -EMSGSIZE when the netlink skb
runs out of space, triggering a WARN_ON in rtnetlink and preventing the
interface from being dumped.
The bug can be reproduced with:
ip link add macvlan0 link eth0 type macvlan mode bridge
ip link set macvlan0 type macvlan bc_cutoff 0
ip -d link show macvlan0 # fails with -EMSGSIZE
The bc_cutoff feature was added in commit 954d1fa1ac93 ("macvlan: Add
netlink attribute for broadcast cutoff"), which added the nla_put_s32()
call in macvlan_fill_info() but missed adding the corresponding
nla_total_size(4) in macvlan_get_size(). A follow-up commit
55cef78c244d ("macvlan: add forgotten nla_policy for
IFLA_MACVLAN_BC_CUTOFF") fixed the missing nla_policy entry but still
did not fix the size calculation.
Fixes: 954d1fa1ac93 ("macvlan: Add netlink attribute for broadcast cutoff")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
drivers/net/macvlan.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index a71f058eceef..80f87599a503 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -1681,6 +1681,7 @@ static size_t macvlan_get_size(const struct net_device *dev)
+ macvlan_get_size_mac(vlan) /* IFLA_MACVLAN_MACADDR */
+ nla_total_size(4) /* IFLA_MACVLAN_BC_QUEUE_LEN */
+ nla_total_size(4) /* IFLA_MACVLAN_BC_QUEUE_LEN_USED */
+ + nla_total_size(4) /* IFLA_MACVLAN_BC_CUTOFF */
);
}
--
2.39.3 (Apple Git-145)
^ permalink raw reply related
* [PATCH] vsock/virtio: fix accept queue count leak on transport mismatch in recv_listen
From: Dudu Lu @ 2026-04-13 8:52 UTC (permalink / raw)
To: netdev; +Cc: stefanha, sgarzare, mst, jasowang, Dudu Lu
virtio_transport_recv_listen() calls sk_acceptq_added(sk) to increment
the listener's accept queue counter before calling
vsock_assign_transport(). When vsock_assign_transport() fails or selects
a different transport than the one that received the packet, the error
path returns without calling sk_acceptq_removed(sk), permanently
incrementing sk_ack_backlog.
A malicious VM peer can exploit this by sending repeated CONNECT
requests that trigger the transport mismatch condition. Each such
request permanently increments sk_ack_backlog. After approximately
backlog+1 such requests (default backlog ~128), sk_acceptq_is_full()
returns true, causing the listener to reject ALL new connections with
-ENOMEM. The only recovery is closing and re-creating the listener
socket.
Compare with vmci_transport.c and hyperv_transport.c which correctly
place sk_acceptq_added() AFTER the transport check, avoiding this
issue entirely.
Fix by moving sk_acceptq_added(sk) to after the transport validation
check, matching the pattern used by the other transports.
Fixes: c0cfa2d8a788 ("vsock: add multi-transports support")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
net/vmw_vsock/virtio_transport_common.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 8a9fb23c6e85..29e1d9833be4 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1,3 +1,4 @@
+ sk_acceptq_added(sk);
// SPDX-License-Identifier: GPL-2.0-only
/*
* common code for virtio vsock
@@ -1560,8 +1561,9 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
return -ENOMEM;
}
- sk_acceptq_added(sk);
+
+ sk_acceptq_added(sk);
lock_sock_nested(child, SINGLE_DEPTH_NESTING);
child->sk_state = TCP_ESTABLISHED;
--
2.39.3 (Apple Git-145)
^ permalink raw reply related
* [PATCH] net: dsa: sja1105: fix division by zero in sja1105_tas_set_runtime_params()
From: Alexander.Chesnokov @ 2026-04-13 8:51 UTC (permalink / raw)
To: olteanv
Cc: lvc-project, Oleg.Kazakov, Pavel.Zhigulin, Alexander Chesnokov,
stable, Andrew Lunn, Florian Fainelli, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel, netdev
From: Alexander Chesnokov <Alexander.Chesnokov@kaspersky.com>
If taprio offload is configured such that none of the ports' base_time
is less than S64_MAX (the initial value of earliest_base_time), then
its_cycle_time remains zero and is passed to future_base_time() as
cycle_time, causing division by zero in div_s64().
Add a check for its_cycle_time being zero before calling
future_base_time() and return -EINVAL.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 86db36a347b4 ("net: dsa: sja1105: Implement state machine for TAS with PTP clock source")
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Chesnokov <Alexander.Chesnokov@kaspersky.com>
---
drivers/net/dsa/sja1105/sja1105_tas.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/dsa/sja1105/sja1105_tas.c b/drivers/net/dsa/sja1105/sja1105_tas.c
index e6153848a950..ce4b544a2b9c 100644
--- a/drivers/net/dsa/sja1105/sja1105_tas.c
+++ b/drivers/net/dsa/sja1105/sja1105_tas.c
@@ -62,6 +62,9 @@ static int sja1105_tas_set_runtime_params(struct sja1105_private *priv)
if (!tas_data->enabled)
return 0;
+ if (!its_cycle_time)
+ return -EINVAL;
+
/* Roll the earliest base time over until it is in a comparable
* time base with the latest, then compare their deltas.
* We want to enforce that all ports' base times are within
--
2.43.0
^ permalink raw reply related
* [PATCH] xfrm: iptfs: fix deadlock in iptfs_destroy_state
From: Dudu Lu @ 2026-04-13 8:51 UTC (permalink / raw)
To: netdev; +Cc: steffen.klassert, herbert, davem, Dudu Lu
iptfs_destroy_state() acquires x->lock (spin_lock_bh) and then calls
hrtimer_cancel(&xtfs->iptfs_timer). The timer callback
iptfs_delay_timer() also acquires x->lock (spin_lock). If the timer
fires on another CPU during destroy, hrtimer_cancel() waits for the
callback to complete, but the callback is blocked trying to acquire
the same lock — a classic ABBA deadlock.
The same pattern exists for drop_timer: destroy holds drop_lock and
calls hrtimer_cancel(&xtfs->drop_timer), while iptfs_drop_timer()
also acquires drop_lock.
Fix by cancelling the timers before acquiring the locks. The timer
callbacks check for state validity, so a late cancel is safe. The
queue splice is still done under the lock for consistency.
Fixes: 4b3faf610cc6 ("xfrm: iptfs: add new iptfs xfrm mode impl")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
net/xfrm/xfrm_iptfs.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/xfrm/xfrm_iptfs.c b/net/xfrm/xfrm_iptfs.c
index 97bc979e55ba..11291b87158c 100644
--- a/net/xfrm/xfrm_iptfs.c
+++ b/net/xfrm/xfrm_iptfs.c
@@ -2708,8 +2708,10 @@ static void iptfs_destroy_state(struct xfrm_state *x)
if (!xtfs)
return;
- spin_lock_bh(&xtfs->x->lock);
hrtimer_cancel(&xtfs->iptfs_timer);
+ hrtimer_cancel(&xtfs->drop_timer);
+
+ spin_lock_bh(&xtfs->x->lock);
__skb_queue_head_init(&list);
skb_queue_splice_init(&xtfs->queue, &list);
spin_unlock_bh(&xtfs->x->lock);
@@ -2717,9 +2719,7 @@ static void iptfs_destroy_state(struct xfrm_state *x)
while ((skb = __skb_dequeue(&list)))
kfree_skb(skb);
- spin_lock_bh(&xtfs->drop_lock);
- hrtimer_cancel(&xtfs->drop_timer);
- spin_unlock_bh(&xtfs->drop_lock);
+ /* drop_timer already cancelled above */
if (xtfs->ra_newskb)
kfree_skb(xtfs->ra_newskb);
--
2.39.3 (Apple Git-145)
^ permalink raw reply related
* [PATCH] net/sched: act_mirred: fix wrong device for mac_header_xmit check in tcf_blockcast_redir
From: Dudu Lu @ 2026-04-13 8:49 UTC (permalink / raw)
To: netdev; +Cc: jhs, jiri, Dudu Lu
In tcf_blockcast_redir(), when iterating block ports to redirect
packets to multiple devices, the mac_header_xmit flag is queried
from the wrong device. The loop sends to dev_prev but queries
dev_is_mac_header_xmit(dev) — which is the NEXT device in the
iteration, not the one being sent to.
This causes tcf_mirred_to_dev() to make incorrect decisions about
whether to push or pull the MAC header. When the block contains
mixed device types (e.g., an ethernet veth and a tunnel device),
intermediate devices get the wrong mac_header_xmit flag, leading to
skb header corruption. In the worst case, skb_push_rcsum with an
incorrect mac_len can exhaust headroom and panic.
The last device in the loop is handled correctly (line 365-366 uses
dev_is_mac_header_xmit(dev_prev)), confirming this is a copy-paste
oversight for the intermediate devices.
Fix by using dev_prev instead of dev for the mac_header_xmit query,
consistent with the device actually being sent to.
Fixes: 42f39036cda8 ("net/sched: act_mirred: Allow mirred to block")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
net/sched/act_mirred.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 05e0b14b5773..2c5a7a321a94 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -354,7 +354,7 @@ static int tcf_blockcast_redir(struct sk_buff *skb, struct tcf_mirred *m,
goto assign_prev;
tcf_mirred_to_dev(skb, m, dev_prev,
- dev_is_mac_header_xmit(dev),
+ dev_is_mac_header_xmit(dev_prev),
mirred_eaction, retval);
assign_prev:
dev_prev = dev;
--
2.39.3 (Apple Git-145)
^ permalink raw reply related
* Re: [PATCH net-next v4 5/5] selftests: net: bridge: add MRC and QQIC field encoding tests
From: Ido Schimmel @ 2026-04-13 8:47 UTC (permalink / raw)
To: Ujjal Roy
Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Nikolay Aleksandrov, David Ahern, Shuah Khan,
Andy Roulin, Yong Wang, Petr Machata, Ujjal Roy, bridge, netdev,
linux-kernel, linux-kselftest
In-Reply-To: <20260412111047.1326-6-royujjal@gmail.com>
See some comments below, but note that net-next is closed:
https://lore.kernel.org/netdev/20260412142250.131bf997@kernel.org/
So you can either wait with v5 until it is open again or post it as RFC
so that we can at least review (but not merge) it while net-next is
closed.
On Sun, Apr 12, 2026 at 11:10:47AM +0000, Ujjal Roy wrote:
> Enhance vlmc_query_intvl_test and vlmc_query_response_intvl_test in
> bridge_vlan_mcast.sh to validate IGMPv3/MLDv2 protocol compliance for
> MRC and QQIC field encoding across both linear and exponential ranges.
>
> TEST: Vlan multicast snooping enable [ OK ]
> TEST: Vlan mcast_query_interval global option default value [ OK ]
> INFO: Vlan 10 mcast_query_interval (QQIC) test cases:
> TEST: Number of tagged IGMPv2 general query [ OK ]
> TEST: IGMPv3 QQIC linear value 60 [ OK ]
> TEST: MLDv2 QQIC linear value 60 [ OK ]
> TEST: IGMPv3 QQIC non linear value 160 [ OK ]
> TEST: MLDv2 QQIC non linear value 160 [ OK ]
> TEST: Vlan mcast_query_response_interval global option default value [ OK ]
> INFO: Vlan 10 mcast_query_response_interval (MRC) test cases:
> TEST: IGMPv3 MRC linear value 60 [ OK ]
> TEST: IGMPv3 MRC non linear value 160 [ OK ]
> TEST: MLDv2 MRC linear value 30000 [ OK ]
> TEST: MLDv2 MRC non linear value 60000 [ OK ]
>
> Signed-off-by: Ujjal Roy <royujjal@gmail.com>
> ---
> .../net/forwarding/bridge_vlan_mcast.sh | 150 +++++++++++++++++-
> 1 file changed, 142 insertions(+), 8 deletions(-)
>
> diff --git a/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh b/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh
> index e8031f68200a..9f9f33d58286 100755
> --- a/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh
> +++ b/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh
> @@ -162,14 +162,27 @@ vlmc_query_cnt_setup()
> {
> local type=$1
> local dev=$2
> + local match=$3
>
> if [[ $type == "igmp" ]]; then
> - tc filter add dev $dev egress pref 10 prot 802.1Q \
> + # This matches: IP Protocol 2 (IGMP)
> + tc filter add dev "$dev" egress pref 10 prot 802.1Q \
> flower vlan_id 10 vlan_ethtype ipv4 dst_ip 224.0.0.1 ip_proto 2 \
> + action continue
> + # AND Type 0x11 (Query) at offset 24 after IP
> + # IP (20 byte IP + 4 bytes Option)
Let's make it clearer: 20 bytes IPv4 header + 4 bytes Router Alert option
> + match=(match u8 0x11 0xff at 24 $match)
> + tc filter add dev "$dev" egress pref 20 prot 802.1Q u32 "${match[@]}" \
> action pass
> else
> - tc filter add dev $dev egress pref 10 prot 802.1Q \
> + # This matches: ICMPv6
> + tc filter add dev "$dev" egress pref 10 prot 802.1Q \
> flower vlan_id 10 vlan_ethtype ipv6 dst_ip ff02::1 ip_proto icmpv6 \
> + action continue
> + # AND Type 0x82 (Query) at offset 48 after IPv6
> + # IPv6 (40 bytes IPv6 + 2 bytes next HDR + 4 bytes Option + 2 byte pad)
Same: 40 bytes IPv6 header + 8 bytes Hop-by-hop option
> + match=(match u8 0x82 0xff at 48 $match)
> + tc filter add dev "$dev" egress pref 20 prot 802.1Q u32 "${match[@]}" \
> action pass
> fi
Sashiko has a relevant comment:
"
Does this configuration evaluate all packets against the pref 20 filter,
regardless of the pref 10 result?
In tc, if a packet does not match a filter, classification automatically falls
through to the next priority filter. By using "action continue" on pref 10,
matching packets are also instructed to continue evaluation at the next filter.
Because both matching and non-matching packets proceed to pref 20, pref 10
seems to act as a no-op gate. Could this cause the u32 rules in pref 20 to
inadvertently match unrelated background traffic on the interface?
To implement a logical AND across different classifiers, should pref 10 use
"action goto chain 1" with pref 20 placed inside chain 1?
"
>
> @@ -181,7 +194,53 @@ vlmc_query_cnt_cleanup()
> local dev=$1
>
> ip link set dev br0 type bridge mcast_stats_enabled 0
> - tc filter del dev $dev egress pref 10
> + tc filter del dev "$dev" egress pref 20
> + tc filter del dev "$dev" egress pref 10
> +}
> +
> +vlmc_query_get_intvl_match()
> +{
> + local type=$1
> + local version=$2
> + local test=$3
> + local interval=$4
> +
> + if [ "$test" = "qqic" ]; then
> + # QQIC is 8-bit floating point encoding for IGMPv3 and MLDv2
> + if [ "${type}v${version}" = "igmpv3" ]; then
> + # IP 20 bytes + 4 bytes Option + IGMPv3[9]
> + if [[ $interval -lt 128 ]]; then
> + echo "match u8 0x3c 0xff at 33"
Please pass the expected value as an argument instead of hard coding
"0x3c" here. Same in other places in the function.
> + else
> + echo "match u8 0x84 0xff at 33"
> + fi
> + elif [ "${type}v${version}" = "mldv2" ]; then
> + # IPv6 40 + 2 next HDR + 4 Option + 2 pad + MLDv2[25]
> + if [[ $interval -lt 128 ]]; then
> + echo "match u8 0x3c 0xff at 73"
> + else
> + echo "match u8 0x84 0xff at 73"
> + fi
> + fi
> + elif [ "$test" = "mrc" ]; then
> + if [ "${type}v${version}" = "igmpv3" ]; then
> + # MRC is 8-bit floating point encoding for IGMPv3
> + # IP 20 bytes + 4 bytes Option + IGMPv3[1]
> + if [[ $interval -lt 128 ]]; then
> + echo "match u8 0x3c 0xff at 25"
> + else
> + echo "match u8 0x84 0xff at 25"
> + fi
> + elif [ "${type}v${version}" = "mldv2" ]; then
> + # MRC is 16-bit floating point encoding for MLDv2
> + # IPv6 40 + 2 next HDR + 4 Option + 2 pad + MLDv2[4]
> + if [[ $interval -lt 32768 ]]; then
> + echo "match u16 0x7530 0xffff at 52"
> + else
> + echo "match u16 0x8d4c 0xffff at 52"
> + fi
> + fi
> + fi
> }
>
> vlmc_check_query()
> @@ -191,9 +250,13 @@ vlmc_check_query()
> local dev=$3
> local expect=$4
> local time=$5
> + local test=$6
> + local interval=$7
> + local intvl_match=""
> local ret=0
>
> - vlmc_query_cnt_setup $type $dev
> + intvl_match="$(vlmc_query_get_intvl_match "$type" "$version" "$test" "$interval")"
> + vlmc_query_cnt_setup "$type" "$dev" "$intvl_match"
>
> local pre_tx_xstats=$(vlmc_query_cnt_xstats $type $version $dev)
> bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_querier 1
> @@ -201,7 +264,7 @@ vlmc_check_query()
> if [[ $ret -eq 0 ]]; then
> sleep $time
>
> - local tcstats=$(tc_rule_stats_get $dev 10 egress)
> + local tcstats=$(tc_rule_stats_get "$dev" 20 egress)
> local post_tx_xstats=$(vlmc_query_cnt_xstats $type $version $dev)
>
> if [[ $tcstats != $expect || \
> @@ -441,6 +504,7 @@ vlmc_query_intvl_test()
> check_err $? "Wrong default mcast_query_interval global vlan option value"
> log_test "Vlan mcast_query_interval global option default value"
>
> + log_info "Vlan 10 mcast_query_interval (QQIC) test cases:"
Let's remove this as it makes the output confusing:
INFO: Vlan 10 mcast_query_response_interval (MRC) test cases:
TEST: IGMPv3 MRC linear value 60 [ OK ]
[...]
TEST: Flood unknown vlan multicast packets to router port only [ OK ]
TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ]
> RET=0
> bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_startup_query_count 0
> bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 200
> @@ -448,8 +512,42 @@ vlmc_query_intvl_test()
> # 1 is sent immediately, then 2 more in the next 5 seconds
> vlmc_check_query igmp 2 $swp1 3 5
> check_err $? "Wrong number of tagged IGMPv2 general queries sent"
> - log_test "Vlan 10 mcast_query_interval option changed to 200"
> + log_test "Number of tagged IGMPv2 general query"
>
> + RET=0
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_igmp_version 3
> + check_err $? "Could not set mcast_igmp_version in vlan 10"
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_mld_version 2
> + check_err $? "Could not set mcast_mld_version in vlan 10"
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 6000
> + check_err $? "Could not set mcast_query_interval in vlan 10"
> + # 1 is sent immediately, IGMPv3 QQIC should match with linear value 60s
> + vlmc_check_query igmp 3 $swp1 1 1 qqic 60
> + check_err $? "Wrong QQIC in generated IGMPv3 general queries"
> + log_test "IGMPv3 QQIC linear value 60"
> +
> + RET=0
> + # 1 is sent immediately, MLDv2 QQIC should match with linear value 60s
> + vlmc_check_query mld 2 $swp1 1 1 qqic 60
> + check_err $? "Wrong QQIC in generated MLDv2 general queries"
> + log_test "MLDv2 QQIC linear value 60"
> +
> + RET=0
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 16000
> + check_err $? "Could not set mcast_query_interval in vlan 10"
> + # 1 is sent immediately, IGMPv3 QQIC should match with non linear value 160s
> + vlmc_check_query igmp 3 $swp1 1 1 qqic 160
> + check_err $? "Wrong QQIC in generated IGMPv3 general queries"
> + log_test "IGMPv3 QQIC non linear value 160"
> +
> + RET=0
> + # 1 is sent immediately, MLDv2 QQIC should match with non linear value 160s
> + vlmc_check_query mld 2 $swp1 1 1 qqic 160
> + check_err $? "Wrong QQIC in generated MLDv2 general queries"
> + log_test "MLDv2 QQIC non linear value 160"
> +
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_igmp_version 2
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_mld_version 1
> bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_startup_query_count 2
> bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 12500
> }
> @@ -468,11 +566,47 @@ vlmc_query_response_intvl_test()
> check_err $? "Wrong default mcast_query_response_interval global vlan option value"
> log_test "Vlan mcast_query_response_interval global option default value"
>
> + log_info "Vlan 10 mcast_query_response_interval (MRC) test cases:"
Same
> + RET=0
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_startup_query_count 0
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_igmp_version 3
> + check_err $? "Could not set mcast_igmp_version in vlan 10"
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 600
> + check_err $? "Could not set mcast_query_response_interval in vlan 10"
> + # 1 is sent immediately, IGMPv3 MRC should match with linear value 60 units of 1/10s
> + vlmc_check_query igmp 3 $swp1 1 1 mrc 60
> + check_err $? "Wrong MRC in generated IGMPv3 general queries"
> + log_test "IGMPv3 MRC linear value 60"
> +
> + RET=0
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 1600
> + check_err $? "Could not set mcast_query_response_interval in vlan 10"
> + # 1 is sent immediately, IGMPv3 MRC should match with non linear value 160 unit of 1/10s
> + vlmc_check_query igmp 3 $swp1 1 1 mrc 160
> + check_err $? "Wrong MRC in generated IGMPv3 general queries"
> + log_test "IGMPv3 MRC non linear value 160"
> +
> + RET=0
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_mld_version 2
> + check_err $? "Could not set mcast_mld_version in vlan 10"
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 3000
> + check_err $? "Could not set mcast_query_response_interval in vlan 10"
> + # 1 is sent immediately, MLDv2 MRC should match with linear value 30000(ms)
> + vlmc_check_query mld 2 $swp1 1 1 mrc 30000
> + check_err $? "Wrong MRC in generated MLDv2 general queries"
> + log_test "MLDv2 MRC linear value 30000"
> +
> RET=0
> - bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 200
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 6000
> check_err $? "Could not set mcast_query_response_interval in vlan 10"
> - log_test "Vlan 10 mcast_query_response_interval option changed to 200"
> + # 1 is sent immediately, MLDv2 MRC should match with non linear value 60000(ms)
> + vlmc_check_query mld 2 $swp1 1 1 mrc 60000
> + check_err $? "Wrong MRC in generated MLDv2 general queries"
> + log_test "MLDv2 MRC non linear value 60000"
>
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_igmp_version 2
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_mld_version 1
> + bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_startup_query_count 2
> bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 1000
> }
>
> --
> 2.43.0
>
^ permalink raw reply
* [PATCH] net/sched: sch_cake: fix NAT destination port not being updated in cake_update_flowkeys
From: Dudu Lu @ 2026-04-13 8:47 UTC (permalink / raw)
To: netdev; +Cc: toke, jhs, jiri, Dudu Lu
cake_update_flowkeys() is supposed to update the flow dissector keys
with the NAT-translated addresses and ports from conntrack, so that
CAKE's per-flow fairness correctly identifies post-NAT flows as
belonging to the same connection.
For the source port, this works correctly:
keys->ports.src = port; /* writes conntrack port into keys */
But for the destination port, the assignment is reversed:
port = keys->ports.dst; /* reads FROM keys into local var — no-op */
This means the NAT destination port is never updated in the flow keys.
As a result, when multiple connections are NATed to the same destination
(same IP + same port), CAKE treats them as separate flows because the
original (pre-NAT) destination ports differ. This completely defeats
CAKE's NAT-aware flow isolation when using the "nat" mode.
The vulnerability was introduced in commit b0c19ed6088a ("sch_cake: Take advantage
of skb->hash where appropriate") which refactored the original direct
assignment into a compare-and-conditionally-update pattern, but wrote
the destination port update backwards.
Fix by reversing the assignment direction to match the source port
pattern.
Fixes: b0c19ed6088a ("sch_cake: Take advantage of skb->hash where appropriate")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
net/sched/sch_cake.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 9efe23f8371b..4ac6c36ca6e4 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -619,7 +619,7 @@ static bool cake_update_flowkeys(struct flow_keys *keys,
}
port = rev ? tuple.src.u.all : tuple.dst.u.all;
if (port != keys->ports.dst) {
- port = keys->ports.dst;
+ keys->ports.dst = port;
upd = true;
}
}
--
2.39.3 (Apple Git-145)
^ permalink raw reply related
* Re: [PATCH net-next v4 3/5] ipv4: igmp: encode multicast exponential fields
From: Ido Schimmel @ 2026-04-13 8:47 UTC (permalink / raw)
To: Ujjal Roy
Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Nikolay Aleksandrov, David Ahern, Shuah Khan,
Andy Roulin, Yong Wang, Petr Machata, Ujjal Roy, bridge, netdev,
linux-kernel, linux-kselftest
In-Reply-To: <20260412111047.1326-4-royujjal@gmail.com>
On Sun, Apr 12, 2026 at 11:10:45AM +0000, Ujjal Roy wrote:
> In IGMP, MRC and QQIC fields are not correctly encoded
> when generating query packets. Since the receiver of the
> query interprets these fields using the IGMPv3 floating-
> point decoding logic, any value that exceeds the linear
> threshold is incorrectly parsed as an exponential value,
> leading to an incorrect interval calculation.
>
> Encode and assign the corresponding protocol fields during
> query generation. Introduce the logic to dynamically
> calculate the exponent and mantissa using bit-scan (fls).
> This ensures MRC and QQIC fields (8-bit) are properly
> encoded when transmitting query packets with intervals
> that exceed their respective linear threshold value of
> 128 (for MRT/QQI).
>
> RFC3376: for both MRC and QQIC, values >= 128 represent
> the same floating-point encoding as follows:
> 0 1 2 3 4 5 6 7
> +-+-+-+-+-+-+-+-+
> |1| exp | mant |
> +-+-+-+-+-+-+-+-+
>
> Signed-off-by: Ujjal Roy <royujjal@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
^ permalink raw reply
* Re: [PATCH net-next v4 2/5] ipv6: mld: rename mldv2_mrc() and add mldv2_qqi()
From: Ido Schimmel @ 2026-04-13 8:46 UTC (permalink / raw)
To: Ujjal Roy
Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Nikolay Aleksandrov, David Ahern, Shuah Khan,
Andy Roulin, Yong Wang, Petr Machata, Ujjal Roy, bridge, netdev,
linux-kernel, linux-kselftest
In-Reply-To: <20260412111047.1326-3-royujjal@gmail.com>
On Sun, Apr 12, 2026 at 11:10:44AM +0000, Ujjal Roy wrote:
> Rename mldv2_mrc() to mldv2_mrd() as it is used to calculate
> the Maximum Response Delay from the Maximum Response Code.
>
> Introduce a new API mldv2_qqi() to define the existing
> calculation logic of QQI from QQIC. This also organizes
> the existing mld_update_qi() API.
>
> Signed-off-by: Ujjal Roy <royujjal@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
^ permalink raw reply
* Re: [PATCH net-next v4 1/5] ipv4: igmp: get rid of IGMPV3_{QQIC,MRC} and simplify calculation
From: Ido Schimmel @ 2026-04-13 8:46 UTC (permalink / raw)
To: Ujjal Roy
Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Nikolay Aleksandrov, David Ahern, Shuah Khan,
Andy Roulin, Yong Wang, Petr Machata, Ujjal Roy, bridge, netdev,
linux-kernel, linux-kselftest
In-Reply-To: <20260412111047.1326-2-royujjal@gmail.com>
On Sun, Apr 12, 2026 at 11:10:43AM +0000, Ujjal Roy wrote:
> Get rid of the IGMPV3_MRC macro and use the igmpv3_mrt() API to
> calculate the Max Resp Time from the Maximum Response Code.
>
> Similarly, for IGMPV3_QQIC, use the igmpv3_qqi() API to calculate
> the Querier's Query Interval from the QQIC field.
>
> Signed-off-by: Ujjal Roy <royujjal@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
^ permalink raw reply
* [PATCH] net/sched: act_ct: fix skb leak on fragment check failure
From: Dudu Lu @ 2026-04-13 8:46 UTC (permalink / raw)
To: netdev; +Cc: jhs, jiri, Dudu Lu
tcf_ct_handle_fragments() returns TC_ACT_CONSUMED when
tcf_ct_ipv4/6_is_fragment() fails. This causes the caller to
believe the skb was consumed, but it was not freed. Each
malformed fragment leaks one skb, leading to OOM DoS under
sustained traffic.
Change the return value to TC_ACT_SHOT so the skb is properly
freed by the caller.
Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
net/sched/act_ct.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 7d5e50c921a0..870655f682bd 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -1107,8 +1107,10 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
return retval;
out_frag:
- if (err != -EINPROGRESS)
+ if (err != -EINPROGRESS) {
tcf_action_inc_drop_qstats(&c->common);
+ return TC_ACT_SHOT;
+ }
return TC_ACT_CONSUMED;
drop:
--
2.39.3 (Apple Git-145)
^ permalink raw reply related
* Re: [PATCH net-next] pppoe: optimize hash with word access
From: Eric Dumazet @ 2026-04-13 8:42 UTC (permalink / raw)
To: Qingfang Deng
Cc: Andrew Lunn, David S. Miller, Jakub Kicinski, Paolo Abeni,
Guillaume Nault, Kees Cook, Eric Woudstra, netdev, linux-kernel
In-Reply-To: <20260413035212.56566-1-qingfang.deng@linux.dev>
On Sun, Apr 12, 2026 at 8:52 PM Qingfang Deng <qingfang.deng@linux.dev> wrote:
>
> Currently, hash_item() processes the 6-byte Ethernet address and the
> 2-byte session ID byte-wise to compute a hash.
>
> Optimize this by using 16-bit word operations: XOR three 16-bit words
> from the Ethernet address and the 16-bit session ID, then fold the
> result. This reduces the total number of loads and XORs. The Ethernet
> addresses in a skb and struct pppoe_addr are both 2-byte aligned, so the
> u16 pointer cast is safe.
>
> Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
net-next is closed.
https://lore.kernel.org/netdev/20260412142250.131bf997@kernel.org/
Also I would suggest using hash32(hash, PPPOE_HASH_BITS)
^ permalink raw reply
* Re: [PATCH net-next v2] net: openvswitch: decouple flow_table from ovs_mutex
From: Paolo Abeni @ 2026-04-13 8:39 UTC (permalink / raw)
To: Adrian Moreno, netdev
Cc: Aaron Conole, Eelco Chaudron, Ilya Maximets, David S. Miller,
Eric Dumazet, Jakub Kicinski, Simon Horman, open list:OPENVSWITCH,
open list
In-Reply-To: <20260407120418.356718-1-amorenoz@redhat.com>
On 4/7/26 2:04 PM, Adrian Moreno wrote:
> Currently the entire ovs module is write-protected using the global
> ovs_mutex. While this simple approach works fine for control-plane
> operations (such as vport configurations), requiring the global mutex
> for flow modifications can be problematic.
>
> During periods of high control-plane operations, e.g: netdevs (vports)
> coming and going, RTNL can suffer contention. This contention is easily
> transferred to the ovs_mutex as RTNL nests inside ovs_mutex. Flow
> modifications, however, are done as part of packet processing and having
> them wait for RTNL pressure to go away can lead to packet drops.
>
> This patch decouples flow_table modifications from ovs_mutex by means of
> the following:
>
> 1 - Make flow_table an rcu-protected pointer inside the datapath.
> This allows both objects to be protected independently while reducing the
> amount of changes required in "flow_table.c".
>
> 2 - Create a new mutex inside the flow_table that protects it from
> concurrent modifications.
> Putting the mutex inside flow_table makes it easier to consume for
> functions inside flow_table.c that do not currently take pointers to the
> datapath.
> Some function signatures need to be changed to accept flow_table so that
> lockdep checks can be performed.
>
> 3 - Create a reference count to temporarily extend rcu protection from
> the datapath to the flow_table.
> In order to use the flow_table without locking ovs_mutex, the flow_table
> pointer must be first dereferenced within an rcu-protected region.
> Next, the table->mutex needs to be locked to protect it from
> concurrent writes but mutexes must not be locked inside an rcu-protected
> region, so the rcu-protected region must be left at which point the
> datapath can be concurrently freed.
> To extend the protection beyond the rcu region, a reference count is used.
> One reference is held by the datapath, the other is temporarily
> increased during flow modifications. For example:
>
> Datapath deletion:
>
> ovs_lock();
> table = rcu_dereference_protected(dp->table, ...);
> rcu_assign_pointer(dp->table, NULL);
> ovs_flow_tbl_put(table);
> ovs_unlock();
>
> Flow modification:
>
> rcu_read_lock();
> dp = get_dp(...);
> table = rcu_dereference(dp->table);
> ovs_flow_tbl_get(table);
> rcu_read_unlock();
>
> mutex_lock(&table->lock);
> /* Perform modifications on the flow_table */
> mutex_unlock(&table->lock);
> ovs_flow_tbl_put(table);
>
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> ---
> v2: Fix argument in ovs_flow_tbl_put (sparse)
> Remove rcu checks in ovs_dp_masks_rebalance
> ---
> net/openvswitch/datapath.c | 285 ++++++++++++++++++++++++-----------
> net/openvswitch/datapath.h | 2 +-
> net/openvswitch/flow.c | 13 +-
> net/openvswitch/flow.h | 9 +-
> net/openvswitch/flow_table.c | 180 ++++++++++++++--------
> net/openvswitch/flow_table.h | 51 ++++++-
> 6 files changed, 380 insertions(+), 160 deletions(-)
This is too big for a single patch. The changelog above already suggests
a way of splitting the change. At least the RCU-ification addition
should be straight forward in a separate patch, which in turn should be
easily reviewable.
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index e209099218b4..9c234993520c 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -88,13 +88,17 @@ static void ovs_notify(struct genl_family *family,
> * DOC: Locking:
> *
> * All writes e.g. Writes to device state (add/remove datapath, port, set
> - * operations on vports, etc.), Writes to other state (flow table
> - * modifications, set miscellaneous datapath parameters, etc.) are protected
> - * by ovs_lock.
> + * operations on vports, etc.) and writes to other datapath parameters
> + * are protected by ovs_lock.
> + *
> + * Writes to the flow table are NOT protected by ovs_lock. Instead, a per-table
> + * mutex and reference count are used (see comment above "struct flow_table"
> + * definition). On some few occasions, the per-flow table mutex is nested
> + * inside ovs_mutex.
> *
> * Reads are protected by RCU.
> *
> - * There are a few special cases (mostly stats) that have their own
> + * There are a few other special cases (mostly stats) that have their own
> * synchronization but they nest under all of above and don't interact with
> * each other.
> *
> @@ -166,7 +170,6 @@ static void destroy_dp_rcu(struct rcu_head *rcu)
> {
> struct datapath *dp = container_of(rcu, struct datapath, rcu);
>
> - ovs_flow_tbl_destroy(&dp->table);
> free_percpu(dp->stats_percpu);
> kfree(dp->ports);
> ovs_meters_exit(dp);
> @@ -247,6 +250,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> struct ovs_pcpu_storage *ovs_pcpu = this_cpu_ptr(ovs_pcpu_storage);
> const struct vport *p = OVS_CB(skb)->input_vport;
> struct datapath *dp = p->dp;
> + struct flow_table *table;
> struct sw_flow *flow;
> struct sw_flow_actions *sf_acts;
> struct dp_stats_percpu *stats;
> @@ -257,9 +261,16 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> int error;
>
> stats = this_cpu_ptr(dp->stats_percpu);
> + table = rcu_dereference(dp->table);
> + if (!table) {
> + net_dbg_ratelimited("ovs: no flow table on datapath %s\n",
> + ovs_dp_name(dp));
> + kfree_skb(skb);
> + return;
> + }
>
> /* Look up flow. */
> - flow = ovs_flow_tbl_lookup_stats(&dp->table, key, skb_get_hash(skb),
> + flow = ovs_flow_tbl_lookup_stats(table, key, skb_get_hash(skb),
> &n_mask_hit, &n_cache_hit);
> if (unlikely(!flow)) {
> struct dp_upcall_info upcall;
> @@ -752,12 +763,16 @@ static struct genl_family dp_packet_genl_family __ro_after_init = {
> static void get_dp_stats(const struct datapath *dp, struct ovs_dp_stats *stats,
> struct ovs_dp_megaflow_stats *mega_stats)
> {
> + struct flow_table *table = ovsl_dereference(dp->table);
Should be rcu_dereference_ovs_tbl() ?
> int i;
>
> memset(mega_stats, 0, sizeof(*mega_stats));
>
> - stats->n_flows = ovs_flow_tbl_count(&dp->table);
> - mega_stats->n_masks = ovs_flow_tbl_num_masks(&dp->table);
> + if (table) {
> + stats->n_flows = ovs_flow_tbl_count(table);
As noted by Aaron, READ_ONCE() is now needed when accessing
table->count. And WRITE_ONCE when writing it
> + mega_stats->n_masks = ovs_flow_tbl_num_masks(table);
Sashiko says:
---
get_dp_stats() accesses table->mask_array via ovs_flow_tbl_num_masks()
while holding only ovs_mutex. Since this patch decouples flow table updates
by moving them under table->lock, ovs_flow_cmd_new() can execute
concurrently and trigger a reallocation of the mask array, freeing the old
one via call_rcu().
Because get_dp_stats() does not hold rcu_read_lock(), the thread can be
preempted (as ovs_mutex is sleepable) and the RCU grace period might expire
before the count is read. Can this lead to a use-after-free?
---
Note that it also spotted pre-existing issues, please have a look:
https://sashiko.dev/#/patchset/20260407120418.356718-1-amorenoz%40redhat.com
[...]
> @@ -71,15 +93,40 @@ struct flow_table {
>
> extern struct kmem_cache *flow_stats_cache;
>
> +#ifdef CONFIG_LOCKDEP
> +int lockdep_ovs_tbl_is_held(const struct flow_table *table);
> +#else
> +static inline int lockdep_ovs_tbl_is_held(const struct flow_table *table)
> +{
> + (void)table;
You can use the __always_unused annotation.
> + return 1;
> +}
> +#endif
> +
> +#define ASSERT_OVS_TBL(tbl) WARN_ON(!lockdep_ovs_tbl_is_held(tbl))
> +
> +/* Lock-protected update-allowed dereferences.*/
> +#define ovs_tbl_dereference(p, tbl) \
> + rcu_dereference_protected(p, lockdep_ovs_tbl_is_held(tbl))
> +
> +/* Read dereferences can be protected by either RCU, table lock or ovs_mutex. */
Is this access schema really safe? I understand tables can be
written/deleted under the table lock only. If so this should ignore the
OVS mutex status.
/P
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox