Netdev List
 help / color / mirror / Atom feed
* [PATCH] net: dsa: sja1105: fix division by zero in sja1105_tas_set_runtime_params()
From: Alexander.Chesnokov @ 2026-04-13  8:51 UTC (permalink / raw)
  To: olteanv
  Cc: lvc-project, Oleg.Kazakov, Pavel.Zhigulin, Alexander Chesnokov,
	stable, Andrew Lunn, Florian Fainelli, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel, netdev

From: Alexander Chesnokov <Alexander.Chesnokov@kaspersky.com>

If taprio offload is configured such that none of the ports' base_time
is less than S64_MAX (the initial value of earliest_base_time), then
its_cycle_time remains zero and is passed to future_base_time() as
cycle_time, causing division by zero in div_s64().

Add a check for its_cycle_time being zero before calling
future_base_time() and return -EINVAL.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 86db36a347b4 ("net: dsa: sja1105: Implement state machine for TAS with PTP clock source")
Cc: stable@vger.kernel.org

Signed-off-by: Alexander Chesnokov <Alexander.Chesnokov@kaspersky.com>
---
 drivers/net/dsa/sja1105/sja1105_tas.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/dsa/sja1105/sja1105_tas.c b/drivers/net/dsa/sja1105/sja1105_tas.c
index e6153848a950..ce4b544a2b9c 100644
--- a/drivers/net/dsa/sja1105/sja1105_tas.c
+++ b/drivers/net/dsa/sja1105/sja1105_tas.c
@@ -62,6 +62,9 @@ static int sja1105_tas_set_runtime_params(struct sja1105_private *priv)
 	if (!tas_data->enabled)
 		return 0;
 
+	if (!its_cycle_time)
+		return -EINVAL;
+
 	/* Roll the earliest base time over until it is in a comparable
 	 * time base with the latest, then compare their deltas.
 	 * We want to enforce that all ports' base times are within
-- 
2.43.0


^ permalink raw reply related

* [PATCH] xfrm: iptfs: fix deadlock in iptfs_destroy_state
From: Dudu Lu @ 2026-04-13  8:51 UTC (permalink / raw)
  To: netdev; +Cc: steffen.klassert, herbert, davem, Dudu Lu

iptfs_destroy_state() acquires x->lock (spin_lock_bh) and then calls
hrtimer_cancel(&xtfs->iptfs_timer). The timer callback
iptfs_delay_timer() also acquires x->lock (spin_lock). If the timer
fires on another CPU during destroy, hrtimer_cancel() waits for the
callback to complete, but the callback is blocked trying to acquire
the same lock — a classic ABBA deadlock.

The same pattern exists for drop_timer: destroy holds drop_lock and
calls hrtimer_cancel(&xtfs->drop_timer), while iptfs_drop_timer()
also acquires drop_lock.

Fix by cancelling the timers before acquiring the locks. The timer
callbacks check for state validity, so a late cancel is safe. The
queue splice is still done under the lock for consistency.

Fixes: 4b3faf610cc6 ("xfrm: iptfs: add new iptfs xfrm mode impl")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
 net/xfrm/xfrm_iptfs.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_iptfs.c b/net/xfrm/xfrm_iptfs.c
index 97bc979e55ba..11291b87158c 100644
--- a/net/xfrm/xfrm_iptfs.c
+++ b/net/xfrm/xfrm_iptfs.c
@@ -2708,8 +2708,10 @@ static void iptfs_destroy_state(struct xfrm_state *x)
 	if (!xtfs)
 		return;
 
-	spin_lock_bh(&xtfs->x->lock);
 	hrtimer_cancel(&xtfs->iptfs_timer);
+	hrtimer_cancel(&xtfs->drop_timer);
+
+	spin_lock_bh(&xtfs->x->lock);
 	__skb_queue_head_init(&list);
 	skb_queue_splice_init(&xtfs->queue, &list);
 	spin_unlock_bh(&xtfs->x->lock);
@@ -2717,9 +2719,7 @@ static void iptfs_destroy_state(struct xfrm_state *x)
 	while ((skb = __skb_dequeue(&list)))
 		kfree_skb(skb);
 
-	spin_lock_bh(&xtfs->drop_lock);
-	hrtimer_cancel(&xtfs->drop_timer);
-	spin_unlock_bh(&xtfs->drop_lock);
+	/* drop_timer already cancelled above */
 
 	if (xtfs->ra_newskb)
 		kfree_skb(xtfs->ra_newskb);
-- 
2.39.3 (Apple Git-145)


^ permalink raw reply related

* [PATCH] net/sched: act_mirred: fix wrong device for mac_header_xmit check in tcf_blockcast_redir
From: Dudu Lu @ 2026-04-13  8:49 UTC (permalink / raw)
  To: netdev; +Cc: jhs, jiri, Dudu Lu

In tcf_blockcast_redir(), when iterating block ports to redirect
packets to multiple devices, the mac_header_xmit flag is queried
from the wrong device. The loop sends to dev_prev but queries
dev_is_mac_header_xmit(dev) — which is the NEXT device in the
iteration, not the one being sent to.

This causes tcf_mirred_to_dev() to make incorrect decisions about
whether to push or pull the MAC header. When the block contains
mixed device types (e.g., an ethernet veth and a tunnel device),
intermediate devices get the wrong mac_header_xmit flag, leading to
skb header corruption. In the worst case, skb_push_rcsum with an
incorrect mac_len can exhaust headroom and panic.

The last device in the loop is handled correctly (line 365-366 uses
dev_is_mac_header_xmit(dev_prev)), confirming this is a copy-paste
oversight for the intermediate devices.

Fix by using dev_prev instead of dev for the mac_header_xmit query,
consistent with the device actually being sent to.

Fixes: 42f39036cda8 ("net/sched: act_mirred: Allow mirred to block")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
 net/sched/act_mirred.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 05e0b14b5773..2c5a7a321a94 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -354,7 +354,7 @@ static int tcf_blockcast_redir(struct sk_buff *skb, struct tcf_mirred *m,
 			goto assign_prev;
 
 		tcf_mirred_to_dev(skb, m, dev_prev,
-				  dev_is_mac_header_xmit(dev),
+				  dev_is_mac_header_xmit(dev_prev),
 				  mirred_eaction, retval);
 assign_prev:
 		dev_prev = dev;
-- 
2.39.3 (Apple Git-145)


^ permalink raw reply related

* Re: [PATCH net-next v4 5/5] selftests: net: bridge: add MRC and QQIC field encoding tests
From: Ido Schimmel @ 2026-04-13  8:47 UTC (permalink / raw)
  To: Ujjal Roy
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Nikolay Aleksandrov, David Ahern, Shuah Khan,
	Andy Roulin, Yong Wang, Petr Machata, Ujjal Roy, bridge, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <20260412111047.1326-6-royujjal@gmail.com>

See some comments below, but note that net-next is closed:

https://lore.kernel.org/netdev/20260412142250.131bf997@kernel.org/

So you can either wait with v5 until it is open again or post it as RFC
so that we can at least review (but not merge) it while net-next is
closed.

On Sun, Apr 12, 2026 at 11:10:47AM +0000, Ujjal Roy wrote:
> Enhance vlmc_query_intvl_test and vlmc_query_response_intvl_test in
> bridge_vlan_mcast.sh to validate IGMPv3/MLDv2 protocol compliance for
> MRC and QQIC field encoding across both linear and exponential ranges.
> 
> TEST: Vlan multicast snooping enable                                [ OK ]
> TEST: Vlan mcast_query_interval global option default value         [ OK ]
> INFO: Vlan 10 mcast_query_interval (QQIC) test cases:
> TEST: Number of tagged IGMPv2 general query                         [ OK ]
> TEST: IGMPv3 QQIC linear value 60                                   [ OK ]
> TEST: MLDv2 QQIC linear value 60                                    [ OK ]
> TEST: IGMPv3 QQIC non linear value 160                              [ OK ]
> TEST: MLDv2 QQIC non linear value 160                               [ OK ]
> TEST: Vlan mcast_query_response_interval global option default value   [ OK ]
> INFO: Vlan 10 mcast_query_response_interval (MRC) test cases:
> TEST: IGMPv3 MRC linear value 60                                    [ OK ]
> TEST: IGMPv3 MRC non linear value 160                               [ OK ]
> TEST: MLDv2 MRC linear value 30000                                  [ OK ]
> TEST: MLDv2 MRC non linear value 60000                              [ OK ]
> 
> Signed-off-by: Ujjal Roy <royujjal@gmail.com>
> ---
>  .../net/forwarding/bridge_vlan_mcast.sh       | 150 +++++++++++++++++-
>  1 file changed, 142 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh b/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh
> index e8031f68200a..9f9f33d58286 100755
> --- a/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh
> +++ b/tools/testing/selftests/net/forwarding/bridge_vlan_mcast.sh
> @@ -162,14 +162,27 @@ vlmc_query_cnt_setup()
>  {
>  	local type=$1
>  	local dev=$2
> +	local match=$3
>  
>  	if [[ $type == "igmp" ]]; then
> -		tc filter add dev $dev egress pref 10 prot 802.1Q \
> +		# This matches: IP Protocol 2 (IGMP)
> +		tc filter add dev "$dev" egress pref 10 prot 802.1Q \
>  			flower vlan_id 10 vlan_ethtype ipv4 dst_ip 224.0.0.1 ip_proto 2 \
> +			action continue
> +		# AND Type 0x11 (Query) at offset 24 after IP
> +		# IP (20 byte IP + 4 bytes Option)

Let's make it clearer: 20 bytes IPv4 header + 4 bytes Router Alert option

> +		match=(match u8 0x11 0xff at 24 $match)
> +		tc filter add dev "$dev" egress pref 20 prot 802.1Q u32 "${match[@]}" \
>  			action pass
>  	else
> -		tc filter add dev $dev egress pref 10 prot 802.1Q \
> +		# This matches: ICMPv6
> +		tc filter add dev "$dev" egress pref 10 prot 802.1Q \
>  			flower vlan_id 10 vlan_ethtype ipv6 dst_ip ff02::1 ip_proto icmpv6 \
> +			action continue
> +		# AND Type 0x82 (Query) at offset 48 after IPv6
> +		# IPv6 (40 bytes IPv6 + 2 bytes next HDR + 4 bytes Option + 2 byte pad)

Same: 40 bytes IPv6 header + 8 bytes Hop-by-hop option

> +		match=(match u8 0x82 0xff at 48 $match)
> +		tc filter add dev "$dev" egress pref 20 prot 802.1Q u32 "${match[@]}" \
>  			action pass
>  	fi

Sashiko has a relevant comment:

"
Does this configuration evaluate all packets against the pref 20 filter,
regardless of the pref 10 result?

In tc, if a packet does not match a filter, classification automatically falls
through to the next priority filter.  By using "action continue" on pref 10,
matching packets are also instructed to continue evaluation at the next filter.

Because both matching and non-matching packets proceed to pref 20, pref 10
seems to act as a no-op gate.  Could this cause the u32 rules in pref 20 to
inadvertently match unrelated background traffic on the interface?

To implement a logical AND across different classifiers, should pref 10 use
"action goto chain 1" with pref 20 placed inside chain 1?
"

>  
> @@ -181,7 +194,53 @@ vlmc_query_cnt_cleanup()
>  	local dev=$1
>  
>  	ip link set dev br0 type bridge mcast_stats_enabled 0
> -	tc filter del dev $dev egress pref 10
> +	tc filter del dev "$dev" egress pref 20
> +	tc filter del dev "$dev" egress pref 10
> +}
> +
> +vlmc_query_get_intvl_match()
> +{
> +	local type=$1
> +	local version=$2
> +	local test=$3
> +	local interval=$4
> +
> +	if [ "$test" = "qqic" ]; then
> +		# QQIC is 8-bit floating point encoding for IGMPv3 and MLDv2
> +		if [ "${type}v${version}" = "igmpv3" ]; then
> +			# IP 20 bytes + 4 bytes Option + IGMPv3[9]
> +			if [[ $interval -lt 128 ]]; then
> +				echo "match u8 0x3c 0xff at 33"

Please pass the expected value as an argument instead of hard coding
"0x3c" here. Same in other places in the function.

> +			else
> +				echo "match u8 0x84 0xff at 33"
> +			fi
> +		elif [ "${type}v${version}" = "mldv2" ]; then
> +			# IPv6 40 + 2 next HDR + 4 Option + 2 pad + MLDv2[25]
> +			if [[ $interval -lt 128 ]]; then
> +				echo "match u8 0x3c 0xff at 73"
> +			else
> +				echo "match u8 0x84 0xff at 73"
> +			fi
> +		fi
> +	elif [ "$test" = "mrc" ]; then
> +		if [ "${type}v${version}" = "igmpv3" ]; then
> +			# MRC is 8-bit floating point encoding for IGMPv3
> +			# IP 20 bytes + 4 bytes Option + IGMPv3[1]
> +			if [[ $interval -lt 128 ]]; then
> +				echo "match u8 0x3c 0xff at 25"
> +			else
> +				echo "match u8 0x84 0xff at 25"
> +			fi
> +		elif [ "${type}v${version}" = "mldv2" ]; then
> +			# MRC is 16-bit floating point encoding for MLDv2
> +			# IPv6 40 + 2 next HDR + 4 Option + 2 pad + MLDv2[4]
> +			if [[ $interval -lt 32768 ]]; then
> +				echo "match u16 0x7530 0xffff at 52"
> +			else
> +				echo "match u16 0x8d4c 0xffff at 52"
> +			fi
> +		fi
> +	fi
>  }
>  
>  vlmc_check_query()
> @@ -191,9 +250,13 @@ vlmc_check_query()
>  	local dev=$3
>  	local expect=$4
>  	local time=$5
> +	local test=$6
> +	local interval=$7
> +	local intvl_match=""
>  	local ret=0
>  
> -	vlmc_query_cnt_setup $type $dev
> +	intvl_match="$(vlmc_query_get_intvl_match "$type" "$version" "$test" "$interval")"
> +	vlmc_query_cnt_setup "$type" "$dev" "$intvl_match"
>  
>  	local pre_tx_xstats=$(vlmc_query_cnt_xstats $type $version $dev)
>  	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_querier 1
> @@ -201,7 +264,7 @@ vlmc_check_query()
>  	if [[ $ret -eq 0 ]]; then
>  		sleep $time
>  
> -		local tcstats=$(tc_rule_stats_get $dev 10 egress)
> +		local tcstats=$(tc_rule_stats_get "$dev" 20 egress)
>  		local post_tx_xstats=$(vlmc_query_cnt_xstats $type $version $dev)
>  
>  		if [[ $tcstats != $expect || \
> @@ -441,6 +504,7 @@ vlmc_query_intvl_test()
>  	check_err $? "Wrong default mcast_query_interval global vlan option value"
>  	log_test "Vlan mcast_query_interval global option default value"
>  
> +	log_info "Vlan 10 mcast_query_interval (QQIC) test cases:"

Let's remove this as it makes the output confusing:

INFO: Vlan 10 mcast_query_response_interval (MRC) test cases:
TEST: IGMPv3 MRC linear value 60                                    [ OK ]
[...]
TEST: Flood unknown vlan multicast packets to router port only      [ OK ]
TEST: Disable multicast vlan snooping when vlan filtering is disabled   [ OK ]

>  	RET=0
>  	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_startup_query_count 0
>  	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 200
> @@ -448,8 +512,42 @@ vlmc_query_intvl_test()
>  	# 1 is sent immediately, then 2 more in the next 5 seconds
>  	vlmc_check_query igmp 2 $swp1 3 5
>  	check_err $? "Wrong number of tagged IGMPv2 general queries sent"
> -	log_test "Vlan 10 mcast_query_interval option changed to 200"
> +	log_test "Number of tagged IGMPv2 general query"
>  
> +	RET=0
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_igmp_version 3
> +	check_err $? "Could not set mcast_igmp_version in vlan 10"
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_mld_version 2
> +	check_err $? "Could not set mcast_mld_version in vlan 10"
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 6000
> +	check_err $? "Could not set mcast_query_interval in vlan 10"
> +	# 1 is sent immediately, IGMPv3 QQIC should match with linear value 60s
> +	vlmc_check_query igmp 3 $swp1 1 1 qqic 60
> +	check_err $? "Wrong QQIC in generated IGMPv3 general queries"
> +	log_test "IGMPv3 QQIC linear value 60"
> +
> +	RET=0
> +	# 1 is sent immediately, MLDv2 QQIC should match with linear value 60s
> +	vlmc_check_query mld 2 $swp1 1 1 qqic 60
> +	check_err $? "Wrong QQIC in generated MLDv2 general queries"
> +	log_test "MLDv2 QQIC linear value 60"
> +
> +	RET=0
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 16000
> +	check_err $? "Could not set mcast_query_interval in vlan 10"
> +	# 1 is sent immediately, IGMPv3 QQIC should match with non linear value 160s
> +	vlmc_check_query igmp 3 $swp1 1 1 qqic 160
> +	check_err $? "Wrong QQIC in generated IGMPv3 general queries"
> +	log_test "IGMPv3 QQIC non linear value 160"
> +
> +	RET=0
> +	# 1 is sent immediately, MLDv2 QQIC should match with non linear value 160s
> +	vlmc_check_query mld 2 $swp1 1 1 qqic 160
> +	check_err $? "Wrong QQIC in generated MLDv2 general queries"
> +	log_test "MLDv2 QQIC non linear value 160"
> +
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_igmp_version 2
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_mld_version 1
>  	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_startup_query_count 2
>  	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_interval 12500
>  }
> @@ -468,11 +566,47 @@ vlmc_query_response_intvl_test()
>  	check_err $? "Wrong default mcast_query_response_interval global vlan option value"
>  	log_test "Vlan mcast_query_response_interval global option default value"
>  
> +	log_info "Vlan 10 mcast_query_response_interval (MRC) test cases:"

Same

> +	RET=0
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_startup_query_count 0
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_igmp_version 3
> +	check_err $? "Could not set mcast_igmp_version in vlan 10"
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 600
> +	check_err $? "Could not set mcast_query_response_interval in vlan 10"
> +	# 1 is sent immediately, IGMPv3 MRC should match with linear value 60 units of 1/10s
> +	vlmc_check_query igmp 3 $swp1 1 1 mrc 60
> +	check_err $? "Wrong MRC in generated IGMPv3 general queries"
> +	log_test "IGMPv3 MRC linear value 60"
> +
> +	RET=0
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 1600
> +	check_err $? "Could not set mcast_query_response_interval in vlan 10"
> +	# 1 is sent immediately, IGMPv3 MRC should match with non linear value 160 unit of 1/10s
> +	vlmc_check_query igmp 3 $swp1 1 1 mrc 160
> +	check_err $? "Wrong MRC in generated IGMPv3 general queries"
> +	log_test "IGMPv3 MRC non linear value 160"
> +
> +	RET=0
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_mld_version 2
> +	check_err $? "Could not set mcast_mld_version in vlan 10"
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 3000
> +	check_err $? "Could not set mcast_query_response_interval in vlan 10"
> +	# 1 is sent immediately, MLDv2 MRC should match with linear value 30000(ms)
> +	vlmc_check_query mld 2 $swp1 1 1 mrc 30000
> +	check_err $? "Wrong MRC in generated MLDv2 general queries"
> +	log_test "MLDv2 MRC linear value 30000"
> +
>  	RET=0
> -	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 200
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 6000
>  	check_err $? "Could not set mcast_query_response_interval in vlan 10"
> -	log_test "Vlan 10 mcast_query_response_interval option changed to 200"
> +	# 1 is sent immediately, MLDv2 MRC should match with non linear value 60000(ms)
> +	vlmc_check_query mld 2 $swp1 1 1 mrc 60000
> +	check_err $? "Wrong MRC in generated MLDv2 general queries"
> +	log_test "MLDv2 MRC non linear value 60000"
>  
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_igmp_version 2
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_mld_version 1
> +	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_startup_query_count 2
>  	bridge vlan global set vid 10 dev br0 mcast_snooping 1 mcast_query_response_interval 1000
>  }
>  
> -- 
> 2.43.0
> 

^ permalink raw reply

* [PATCH] net/sched: sch_cake: fix NAT destination port not being updated in cake_update_flowkeys
From: Dudu Lu @ 2026-04-13  8:47 UTC (permalink / raw)
  To: netdev; +Cc: toke, jhs, jiri, Dudu Lu

cake_update_flowkeys() is supposed to update the flow dissector keys
with the NAT-translated addresses and ports from conntrack, so that
CAKE's per-flow fairness correctly identifies post-NAT flows as
belonging to the same connection.

For the source port, this works correctly:
    keys->ports.src = port;  /* writes conntrack port into keys */

But for the destination port, the assignment is reversed:
    port = keys->ports.dst;  /* reads FROM keys into local var — no-op */

This means the NAT destination port is never updated in the flow keys.
As a result, when multiple connections are NATed to the same destination
(same IP + same port), CAKE treats them as separate flows because the
original (pre-NAT) destination ports differ. This completely defeats
CAKE's NAT-aware flow isolation when using the "nat" mode.

The vulnerability was introduced in commit b0c19ed6088a ("sch_cake: Take advantage
of skb->hash where appropriate") which refactored the original direct
assignment into a compare-and-conditionally-update pattern, but wrote
the destination port update backwards.

Fix by reversing the assignment direction to match the source port
pattern.

Fixes: b0c19ed6088a ("sch_cake: Take advantage of skb->hash where appropriate")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
 net/sched/sch_cake.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 9efe23f8371b..4ac6c36ca6e4 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -619,7 +619,7 @@ static bool cake_update_flowkeys(struct flow_keys *keys,
 		}
 		port = rev ? tuple.src.u.all : tuple.dst.u.all;
 		if (port != keys->ports.dst) {
-			port = keys->ports.dst;
+			keys->ports.dst = port;
 			upd = true;
 		}
 	}
-- 
2.39.3 (Apple Git-145)


^ permalink raw reply related

* Re: [PATCH net-next v4 3/5] ipv4: igmp: encode multicast exponential fields
From: Ido Schimmel @ 2026-04-13  8:47 UTC (permalink / raw)
  To: Ujjal Roy
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Nikolay Aleksandrov, David Ahern, Shuah Khan,
	Andy Roulin, Yong Wang, Petr Machata, Ujjal Roy, bridge, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <20260412111047.1326-4-royujjal@gmail.com>

On Sun, Apr 12, 2026 at 11:10:45AM +0000, Ujjal Roy wrote:
> In IGMP, MRC and QQIC fields are not correctly encoded
> when generating query packets. Since the receiver of the
> query interprets these fields using the IGMPv3 floating-
> point decoding logic, any value that exceeds the linear
> threshold is incorrectly parsed as an exponential value,
> leading to an incorrect interval calculation.
> 
> Encode and assign the corresponding protocol fields during
> query generation. Introduce the logic to dynamically
> calculate the exponent and mantissa using bit-scan (fls).
> This ensures MRC and QQIC fields (8-bit) are properly
> encoded when transmitting query packets with intervals
> that exceed their respective linear threshold value of
> 128 (for MRT/QQI).
> 
> RFC3376: for both MRC and QQIC, values >= 128 represent
> the same floating-point encoding as follows:
>      0 1 2 3 4 5 6 7
>     +-+-+-+-+-+-+-+-+
>     |1| exp | mant  |
>     +-+-+-+-+-+-+-+-+
> 
> Signed-off-by: Ujjal Roy <royujjal@gmail.com>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

^ permalink raw reply

* Re: [PATCH net-next v4 2/5] ipv6: mld: rename mldv2_mrc() and add mldv2_qqi()
From: Ido Schimmel @ 2026-04-13  8:46 UTC (permalink / raw)
  To: Ujjal Roy
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Nikolay Aleksandrov, David Ahern, Shuah Khan,
	Andy Roulin, Yong Wang, Petr Machata, Ujjal Roy, bridge, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <20260412111047.1326-3-royujjal@gmail.com>

On Sun, Apr 12, 2026 at 11:10:44AM +0000, Ujjal Roy wrote:
> Rename mldv2_mrc() to mldv2_mrd() as it is used to calculate
> the Maximum Response Delay from the Maximum Response Code.
> 
> Introduce a new API mldv2_qqi() to define the existing
> calculation logic of QQI from QQIC. This also organizes
> the existing mld_update_qi() API.
> 
> Signed-off-by: Ujjal Roy <royujjal@gmail.com>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

^ permalink raw reply

* Re: [PATCH net-next v4 1/5] ipv4: igmp: get rid of IGMPV3_{QQIC,MRC} and simplify calculation
From: Ido Schimmel @ 2026-04-13  8:46 UTC (permalink / raw)
  To: Ujjal Roy
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Nikolay Aleksandrov, David Ahern, Shuah Khan,
	Andy Roulin, Yong Wang, Petr Machata, Ujjal Roy, bridge, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <20260412111047.1326-2-royujjal@gmail.com>

On Sun, Apr 12, 2026 at 11:10:43AM +0000, Ujjal Roy wrote:
> Get rid of the IGMPV3_MRC macro and use the igmpv3_mrt() API to
> calculate the Max Resp Time from the Maximum Response Code.
> 
> Similarly, for IGMPV3_QQIC, use the igmpv3_qqi() API to calculate
> the Querier's Query Interval from the QQIC field.
> 
> Signed-off-by: Ujjal Roy <royujjal@gmail.com>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

^ permalink raw reply

* [PATCH] net/sched: act_ct: fix skb leak on fragment check failure
From: Dudu Lu @ 2026-04-13  8:46 UTC (permalink / raw)
  To: netdev; +Cc: jhs, jiri, Dudu Lu

tcf_ct_handle_fragments() returns TC_ACT_CONSUMED when
tcf_ct_ipv4/6_is_fragment() fails. This causes the caller to
believe the skb was consumed, but it was not freed. Each
malformed fragment leaks one skb, leading to OOM DoS under
sustained traffic.

Change the return value to TC_ACT_SHOT so the skb is properly
freed by the caller.

Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: Dudu Lu <phx0fer@gmail.com>
---
 net/sched/act_ct.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 7d5e50c921a0..870655f682bd 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -1107,8 +1107,10 @@ TC_INDIRECT_SCOPE int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
 	return retval;
 
 out_frag:
-	if (err != -EINPROGRESS)
+	if (err != -EINPROGRESS) {
 		tcf_action_inc_drop_qstats(&c->common);
+		return TC_ACT_SHOT;
+	}
 	return TC_ACT_CONSUMED;
 
 drop:
-- 
2.39.3 (Apple Git-145)


^ permalink raw reply related

* Re: [PATCH net-next] pppoe: optimize hash with word access
From: Eric Dumazet @ 2026-04-13  8:42 UTC (permalink / raw)
  To: Qingfang Deng
  Cc: Andrew Lunn, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Guillaume Nault, Kees Cook, Eric Woudstra, netdev, linux-kernel
In-Reply-To: <20260413035212.56566-1-qingfang.deng@linux.dev>

On Sun, Apr 12, 2026 at 8:52 PM Qingfang Deng <qingfang.deng@linux.dev> wrote:
>
> Currently, hash_item() processes the 6-byte Ethernet address and the
> 2-byte session ID byte-wise to compute a hash.
>
> Optimize this by using 16-bit word operations: XOR three 16-bit words
> from the Ethernet address and the 16-bit session ID, then fold the
> result. This reduces the total number of loads and XORs. The Ethernet
> addresses in a skb and struct pppoe_addr are both 2-byte aligned, so the
> u16 pointer cast is safe.
>
> Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>

net-next is closed.

https://lore.kernel.org/netdev/20260412142250.131bf997@kernel.org/

Also I would suggest using hash32(hash, PPPOE_HASH_BITS)

^ permalink raw reply

* Re: [PATCH net-next v2] net: openvswitch: decouple flow_table from ovs_mutex
From: Paolo Abeni @ 2026-04-13  8:39 UTC (permalink / raw)
  To: Adrian Moreno, netdev
  Cc: Aaron Conole, Eelco Chaudron, Ilya Maximets, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Simon Horman, open list:OPENVSWITCH,
	open list
In-Reply-To: <20260407120418.356718-1-amorenoz@redhat.com>

On 4/7/26 2:04 PM, Adrian Moreno wrote:
> Currently the entire ovs module is write-protected using the global
> ovs_mutex. While this simple approach works fine for control-plane
> operations (such as vport configurations), requiring the global mutex
> for flow modifications can be problematic.
> 
> During periods of high control-plane operations, e.g: netdevs (vports)
> coming and going, RTNL can suffer contention. This contention is easily
> transferred to the ovs_mutex as RTNL nests inside ovs_mutex. Flow
> modifications, however, are done as part of packet processing and having
> them wait for RTNL pressure to go away can lead to packet drops.
> 
> This patch decouples flow_table modifications from ovs_mutex by means of
> the following:
> 
> 1 - Make flow_table an rcu-protected pointer inside the datapath.
> This allows both objects to be protected independently while reducing the
> amount of changes required in "flow_table.c".
> 
> 2 - Create a new mutex inside the flow_table that protects it from
> concurrent modifications.
> Putting the mutex inside flow_table makes it easier to consume for
> functions inside flow_table.c that do not currently take pointers to the
> datapath.
> Some function signatures need to be changed to accept flow_table so that
> lockdep checks can be performed.
> 
> 3 - Create a reference count to temporarily extend rcu protection from
> the datapath to the flow_table.
> In order to use the flow_table without locking ovs_mutex, the flow_table
> pointer must be first dereferenced within an rcu-protected region.
> Next, the table->mutex needs to be locked to protect it from
> concurrent writes but mutexes must not be locked inside an rcu-protected
> region, so the rcu-protected region must be left at which point the
> datapath can be concurrently freed.
> To extend the protection beyond the rcu region, a reference count is used.
> One reference is held by the datapath, the other is temporarily
> increased during flow modifications. For example:
> 
> Datapath deletion:
> 
>   ovs_lock();
>   table = rcu_dereference_protected(dp->table, ...);
>   rcu_assign_pointer(dp->table, NULL);
>   ovs_flow_tbl_put(table);
>   ovs_unlock();
> 
> Flow modification:
> 
>   rcu_read_lock();
>   dp = get_dp(...);
>   table = rcu_dereference(dp->table);
>   ovs_flow_tbl_get(table);
>   rcu_read_unlock();
> 
>   mutex_lock(&table->lock);
>   /* Perform modifications on the flow_table */
>   mutex_unlock(&table->lock);
>   ovs_flow_tbl_put(table);
> 
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> ---
> v2: Fix argument in ovs_flow_tbl_put (sparse)
>     Remove rcu checks in ovs_dp_masks_rebalance
> ---
>  net/openvswitch/datapath.c   | 285 ++++++++++++++++++++++++-----------
>  net/openvswitch/datapath.h   |   2 +-
>  net/openvswitch/flow.c       |  13 +-
>  net/openvswitch/flow.h       |   9 +-
>  net/openvswitch/flow_table.c | 180 ++++++++++++++--------
>  net/openvswitch/flow_table.h |  51 ++++++-
>  6 files changed, 380 insertions(+), 160 deletions(-)

This is too big for a single patch. The changelog above already suggests
a way of splitting the change. At least the RCU-ification addition
should be straight forward in a separate patch, which in turn should be
easily reviewable.

> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index e209099218b4..9c234993520c 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -88,13 +88,17 @@ static void ovs_notify(struct genl_family *family,
>   * DOC: Locking:
>   *
>   * All writes e.g. Writes to device state (add/remove datapath, port, set
> - * operations on vports, etc.), Writes to other state (flow table
> - * modifications, set miscellaneous datapath parameters, etc.) are protected
> - * by ovs_lock.
> + * operations on vports, etc.) and writes to other datapath parameters
> + * are protected by ovs_lock.
> + *
> + * Writes to the flow table are NOT protected by ovs_lock. Instead, a per-table
> + * mutex and reference count are used (see comment above "struct flow_table"
> + * definition). On some few occasions, the per-flow table mutex is nested
> + * inside ovs_mutex.
>   *
>   * Reads are protected by RCU.
>   *
> - * There are a few special cases (mostly stats) that have their own
> + * There are a few other special cases (mostly stats) that have their own
>   * synchronization but they nest under all of above and don't interact with
>   * each other.
>   *
> @@ -166,7 +170,6 @@ static void destroy_dp_rcu(struct rcu_head *rcu)
>  {
>  	struct datapath *dp = container_of(rcu, struct datapath, rcu);
>  
> -	ovs_flow_tbl_destroy(&dp->table);
>  	free_percpu(dp->stats_percpu);
>  	kfree(dp->ports);
>  	ovs_meters_exit(dp);
> @@ -247,6 +250,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
>  	struct ovs_pcpu_storage *ovs_pcpu = this_cpu_ptr(ovs_pcpu_storage);
>  	const struct vport *p = OVS_CB(skb)->input_vport;
>  	struct datapath *dp = p->dp;
> +	struct flow_table *table;
>  	struct sw_flow *flow;
>  	struct sw_flow_actions *sf_acts;
>  	struct dp_stats_percpu *stats;
> @@ -257,9 +261,16 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
>  	int error;
>  
>  	stats = this_cpu_ptr(dp->stats_percpu);
> +	table = rcu_dereference(dp->table);
> +	if (!table) {
> +		net_dbg_ratelimited("ovs: no flow table on datapath %s\n",
> +				    ovs_dp_name(dp));
> +		kfree_skb(skb);
> +		return;
> +	}
>  
>  	/* Look up flow. */
> -	flow = ovs_flow_tbl_lookup_stats(&dp->table, key, skb_get_hash(skb),
> +	flow = ovs_flow_tbl_lookup_stats(table, key, skb_get_hash(skb),
>  					 &n_mask_hit, &n_cache_hit);
>  	if (unlikely(!flow)) {
>  		struct dp_upcall_info upcall;
> @@ -752,12 +763,16 @@ static struct genl_family dp_packet_genl_family __ro_after_init = {
>  static void get_dp_stats(const struct datapath *dp, struct ovs_dp_stats *stats,
>  			 struct ovs_dp_megaflow_stats *mega_stats)
>  {
> +	struct flow_table *table = ovsl_dereference(dp->table);

Should be rcu_dereference_ovs_tbl() ?

>  	int i;
>  
>  	memset(mega_stats, 0, sizeof(*mega_stats));
>  
> -	stats->n_flows = ovs_flow_tbl_count(&dp->table);
> -	mega_stats->n_masks = ovs_flow_tbl_num_masks(&dp->table);
> +	if (table) {
> +		stats->n_flows = ovs_flow_tbl_count(table);

As noted by Aaron, READ_ONCE() is now needed when accessing
table->count. And WRITE_ONCE when writing it

> +		mega_stats->n_masks = ovs_flow_tbl_num_masks(table);

Sashiko says:

---
get_dp_stats() accesses table->mask_array via ovs_flow_tbl_num_masks()
while holding only ovs_mutex. Since this patch decouples flow table updates
by moving them under table->lock, ovs_flow_cmd_new() can execute
concurrently and trigger a reallocation of the mask array, freeing the old
one via call_rcu().
Because get_dp_stats() does not hold rcu_read_lock(), the thread can be
preempted (as ovs_mutex is sleepable) and the RCU grace period might expire
before the count is read. Can this lead to a use-after-free?
---

Note that it also spotted pre-existing issues, please have a look:

https://sashiko.dev/#/patchset/20260407120418.356718-1-amorenoz%40redhat.com

[...]
> @@ -71,15 +93,40 @@ struct flow_table {
>  
>  extern struct kmem_cache *flow_stats_cache;
>  
> +#ifdef CONFIG_LOCKDEP
> +int lockdep_ovs_tbl_is_held(const struct flow_table *table);
> +#else
> +static inline int lockdep_ovs_tbl_is_held(const struct flow_table *table)
> +{
> +	(void)table;

You can use the __always_unused annotation.

> +	return 1;
> +}
> +#endif
> +
> +#define ASSERT_OVS_TBL(tbl)   WARN_ON(!lockdep_ovs_tbl_is_held(tbl))
> +
> +/* Lock-protected update-allowed dereferences.*/
> +#define ovs_tbl_dereference(p, tbl)	\
> +	rcu_dereference_protected(p, lockdep_ovs_tbl_is_held(tbl))
> +
> +/* Read dereferences can be protected by either RCU, table lock or ovs_mutex. */

Is this access schema really safe? I understand tables can be
written/deleted under the table lock only. If so this should ignore the
OVS mutex status.

/P


^ permalink raw reply

* Re: [PATCH net] net: usb: cdc_ncm: reject negative chained NDP offsets
From: Oliver Neukum @ 2026-04-13  8:36 UTC (permalink / raw)
  To: Greg Kroah-Hartman, linux-usb, netdev
  Cc: linux-kernel, Oliver Neukum, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, stable
In-Reply-To: <2026041137-comfy-eaten-a1ed@gregkh>



On 11.04.26 12:53, Greg Kroah-Hartman wrote:
> cdc_ncm_rx_fixup() reads dwNextNdpIndex from each NDP32 to chain to the
> next one.  The 32-bit value from the device is stored into the signed
> int ndpoffset so that means values with the high bit set become

Well, then isn't the problem rather that you should not store an
unsigned value in a signed variable?

	Regards
		Oliver


^ permalink raw reply

* [PATCH net] net: airoha: Fix possible TX queue stall in airoha_qdma_tx_napi_poll()
From: Lorenzo Bianconi @ 2026-04-13  8:29 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Lorenzo Bianconi
  Cc: linux-arm-kernel, linux-mediatek, netdev

Since multiple net_device TX queues can share the same hw QDMA TX queue,
there is no guarantee we have inflight packets queued in hw belonging to a
net_device TX queue stopped in the xmit path because hw QDMA TX queue
can be full. In this corner case the net_device TX queue will never be
re-activated. In order to avoid any potential net_device TX queue stall,
we need to wake all the net_device TX queues feeding the same hw QDMA TX
queue in airoha_qdma_tx_napi_poll routine.

Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/airoha/airoha_eth.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 9e995094c32a..e7610f36b8e4 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -855,6 +855,19 @@ static int airoha_qdma_init_rx(struct airoha_qdma *qdma)
 	return 0;
 }
 
+static void airoha_qdma_wake_tx_queues(struct airoha_qdma *qdma)
+{
+	struct airoha_eth *eth = qdma->eth;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(eth->ports); i++) {
+		struct airoha_gdm_port *port = eth->ports[i];
+
+		if (port && port->qdma == qdma)
+			netif_tx_wake_all_queues(port->dev);
+	}
+}
+
 static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
 {
 	struct airoha_tx_irq_queue *irq_q;
@@ -931,12 +944,21 @@ static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
 
 			txq = netdev_get_tx_queue(skb->dev, queue);
 			netdev_tx_completed_queue(txq, 1, skb->len);
-			if (netif_tx_queue_stopped(txq) &&
-			    q->ndesc - q->queued >= q->free_thr)
-				netif_tx_wake_queue(txq);
-
 			dev_kfree_skb_any(skb);
 		}
+
+		if (q->ndesc - q->queued == q->free_thr) {
+			/* Since multiple net_device TX queues can share the
+			 * same hw QDMA TX queue, there is no guarantee we have
+			 * inflight packets queued in hw belonging to a
+			 * net_device TX queue stopped in the xmit path.
+			 * In order to avoid any potential net_device TX queue
+			 * stall, we need to wake all the net_device TX queues
+			 * feeding the same hw QDMA TX queue.
+			 */
+			airoha_qdma_wake_tx_queues(qdma);
+		}
+
 unlock:
 		spin_unlock_bh(&q->lock);
 	}

---
base-commit: 2dddb34dd0d07b01fa770eca89480a4da4f13153
change-id: 20260407-airoha-txq-potential-stall-ad52c53094e8

Best regards,
-- 
Lorenzo Bianconi <lorenzo@kernel.org>


^ permalink raw reply related

* [PATCH] net/sched: act_mirred: Fix blockcast recursion bypass leading to stack overflow
From: Kito Xu (veritas501) @ 2026-04-13  8:20 UTC (permalink / raw)
  To: jhs, jiri, davem, edumazet, kuba, pabeni
  Cc: horms, victor, pctammela, netdev, linux-kernel,
	Kito Xu (veritas501)

tcf_mirred_act() checks sched_mirred_nest against MIRRED_NEST_LIMIT (4)
to prevent deep recursion.  However, when the action uses blockcast
(tcfm_blockid != 0), the function returns at the tcf_blockcast() call
BEFORE reaching the counter increment.  As a result, the recursion
counter never advances and the limit check is entirely bypassed.

When two devices share a TC egress block with a mirred blockcast rule,
a packet egressing on device A is mirrored to device B via blockcast;
device B's egress TC re-enters tcf_mirred_act() via blockcast and
mirrors back to A, creating an unbounded recursion loop:

  tcf_mirred_act -> tcf_blockcast -> tcf_mirred_to_dev -> dev_queue_xmit
  -> sch_handle_egress -> tcf_classify -> tcf_mirred_act -> (repeat)

This recursion continues until the kernel stack overflows.

The bug is reachable from an unprivileged user via
unshare(CLONE_NEWUSER | CLONE_NEWNET): user namespaces grant
CAP_NET_ADMIN in the new network namespace, which is sufficient to
create dummy devices, attach clsact qdiscs with shared blocks, and
install mirred blockcast filters.

 BUG: TASK stack guard page was hit at ffffc90000b7fff8
 Oops: stack guard page: 0000 [#1] SMP KASAN NOPTI
 CPU: 2 UID: 1000 PID: 169 Comm: poc Not tainted 7.0.0-rc7-next-20260410
 RIP: 0010:xas_find+0x17/0x480
 Call Trace:
  xa_find+0x17b/0x1d0
  tcf_mirred_act+0x640/0x1060
  tcf_action_exec+0x400/0x530
  basic_classify+0x128/0x1d0
  tcf_classify+0xd83/0x1150
  tc_run+0x328/0x620
  __dev_queue_xmit+0x797/0x3100
  tcf_mirred_to_dev+0x7b1/0xf70
  tcf_mirred_act+0x68a/0x1060
  [repeating ~30+ times until stack overflow]
 Kernel panic - not syncing: Fatal exception in interrupt

Fix this by incrementing sched_mirred_nest before calling
tcf_blockcast() and decrementing it on return, mirroring the
non-blockcast path.  This ensures subsequent recursive entries see the
updated counter and are correctly limited by MIRRED_NEST_LIMIT.

Fixes: 42f39036cda8 ("net/sched: act_mirred: Allow mirred to block")
Signed-off-by: Kito Xu (veritas501) <hxzene@gmail.com>
---
 net/sched/act_mirred.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 05e0b14b5773..5928fcf3e651 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -444,8 +444,12 @@ TC_INDIRECT_SCOPE int tcf_mirred_act(struct sk_buff *skb,
 	tcf_action_update_bstats(&m->common, skb);
 
 	blockid = READ_ONCE(m->tcfm_blockid);
-	if (blockid)
-		return tcf_blockcast(skb, m, blockid, res, retval);
+	if (blockid) {
+		xmit->sched_mirred_nest++;
+		retval = tcf_blockcast(skb, m, blockid, res, retval);
+		xmit->sched_mirred_nest--;
+		return retval;
+	}
 
 	dev = rcu_dereference_bh(m->tcfm_dev);
 	if (unlikely(!dev)) {
-- 
2.43.0


^ permalink raw reply related

* RE: [PATCH v5 net-next 0/8] dpll/ice: Add TXC DPLL type and full TX reference clock control for E825
From: Kubalewski, Arkadiusz @ 2026-04-13  8:19 UTC (permalink / raw)
  To: Jakub Kicinski, Nitka, Grzegorz
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org, Oros, Petr,
	richardcochran@gmail.com, andrew+netdev@lunn.ch,
	Kitszel, Przemyslaw, Nguyen, Anthony L,
	Prathosh.Satish@microchip.com, Vecera, Ivan, jiri@resnulli.us,
	vadim.fedorenko@linux.dev, donald.hunter@gmail.com,
	horms@kernel.org, pabeni@redhat.com, davem@davemloft.net,
	edumazet@google.com
In-Reply-To: <20260410133812.4cf9b090@kernel.org>

>From: Jakub Kicinski <kuba@kernel.org>
>Sent: Friday, April 10, 2026 10:38 PM
>
>On Fri, 10 Apr 2026 14:23:58 +0000 Nitka, Grzegorz wrote:
>> Here is the high-level connection diagram for E825 device. I hope you
>>find it helpful:
>> [..]
>
>It does thanks a lot.
>
>> Before this series, we tried different approaches.
>> One of them was to create MUX pin associated with netdev interface.
>> EXT_REF and SYNCE pins were registered with this MUX pin.
>> However I recall there were at least two issues with this solution:
>> - when using DPLL subsystem not all the connections/relations were
>>visible
>>   from DPLL pin-get perspective. RT netlink was required
>> - due to mixing pins from different modules (like fwnode based pin from
>>zl driver
>>   and the pins from ice), we were not able to safely clean the
>>references between
>>   pins and dpll (basicaly .. we observed crashes)
>>
>> Proposed solution just seems to be clean and fully reflects current
>> connection topology.
>
>Do you have the link to the old proposal that was adding stuff to
>rtnetlink? I remember some discussion long-ish ago, maybe I was wrong.
>
>> What's actually your biggest concern?
>> The fact we introduce a new DPLL type? Or multiply DPLL instances? Or
>>both?
>> Do you prefer to see "one big" DPLL with 16 pins in our case (8 ports x
>>2 tx-clk pins)?
>> Each pin with the name like, for example, PF0-SyncE/PF0-eRef etc.?
>
>My concern is that I think this is a pretty run of the mill SyncE
>design. If we need to pretend we have two DPLLs here if we really
>only have one and a mux - then our APIs are mis-designed :(

Well, the true is that we did not anticipated per-port control of the
TX clock source, as a single DPLL device could drive multiple of such.

This is not true, that we pretend there is a second PLL - there is a
PLL on each TX clock, maybe not a full DPLL, but still the loop with
a control over it's sources is there and it has the same 2 external
sources + default XO.

A mentioned try of adding per port MUX-type pin, just to give some control
to the user, is where we wanted to simplify things, but in the end the API
would have to be modified in significant way, various paths related to pin
registration and keeping correct references, just to make working case
for the pin_on_pin_register and it's internals. We decided that the burden
and impact for existing design was to high.

And that is why the TXC approach emerged, the change of DPLL is minimal,
The model is still correct from user perspective, SyncE SW controller shall
anticipate possibility that per-port TXC dpll is there 

This particular device and driver doesn't implement any EEC-type DPLL
device, the one could think that we can just change the type here and use
EEC type instead of new one TXC - since we share pins from external dpll
driver, which is EEC type, and our DPLL device would have different clock_id
and module. But, further designs, where a single NIC is having control over
both a EEC DPLL and ability to control each source per-port this would be
problematic. At least one NIC Port driver would have to have 2 EEC-type DPLLs
leaving user with extra confusion.

Thanks,
Arkadiusz



^ permalink raw reply

* Re: [PATCH v2 6/6] bus: mhi: host: mhi_phc: Add support for PHC over MHI
From: Manivannan Sadhasivam @ 2026-04-13  8:06 UTC (permalink / raw)
  To: Krishna Chaitanya Chundru
  Cc: Richard Cochran, mhi, linux-arm-msm, linux-kernel, netdev,
	Imran Shaik, Taniya Das
In-Reply-To: <20260411-tsc_timesync-v2-6-6f25f72987b3@oss.qualcomm.com>

On Sat, Apr 11, 2026 at 01:42:06PM +0530, Krishna Chaitanya Chundru wrote:
> From: Imran Shaik <imran.shaik@oss.qualcomm.com>
> 
> This patch introduces the MHI PHC (PTP Hardware Clock) driver, which
> registers a PTP (Precision Time Protocol) clock and communicates with
> the MHI core to get the device side timestamps. These timestamps are
> then exposed to the PTP subsystem, enabling precise time synchronization
> between the host and the device.
> 
> The following diagram illustrates the architecture and data flow:
> 
>  +-------------+    +--------------------+    +--------------+
>  |Userspace App|<-->|Kernel PTP framework|<-->|MHI PHC Driver|
>  +-------------+    +--------------------+    +--------------+
>                                                      |
>                                                      v
>  +-------------------------------+         +-----------------+
>  | MHI Device (Timestamp source) |<------->| MHI Core Driver |
>  +-------------------------------+         +-----------------+
> 
> - User space applications use the standard Linux PTP interface.
> - The PTP subsystem routes IOCTLs to the MHI PHC driver.
> - The MHI PHC driver communicates with the MHI core to fetch timestamps.
> - The MHI core interacts with the device to retrieve accurate time data.

As mentioned in cover letter, this is misleading.

> 
> Co-developed-by: Taniya Das <taniya.das@oss.qualcomm.com>
> Signed-off-by: Taniya Das <taniya.das@oss.qualcomm.com>
> Signed-off-by: Imran Shaik <imran.shaik@oss.qualcomm.com>
> ---
>  drivers/bus/mhi/host/Kconfig       |   8 ++
>  drivers/bus/mhi/host/Makefile      |   1 +
>  drivers/bus/mhi/host/mhi_phc.c     | 150 +++++++++++++++++++++++++++++++++++++
>  drivers/bus/mhi/host/mhi_phc.h     |  28 +++++++
>  drivers/bus/mhi/host/pci_generic.c |  23 ++++++
>  5 files changed, 210 insertions(+)
> 
> diff --git a/drivers/bus/mhi/host/Kconfig b/drivers/bus/mhi/host/Kconfig
> index da5cd0c9fc620ab595e742c422f1a22a2a84c7b9..b4eabf3e5c56907de93232f02962040e979c3110 100644
> --- a/drivers/bus/mhi/host/Kconfig
> +++ b/drivers/bus/mhi/host/Kconfig
> @@ -29,3 +29,11 @@ config MHI_BUS_PCI_GENERIC
>  	  This driver provides MHI PCI controller driver for devices such as
>  	  Qualcomm SDX55 based PCIe modems.
>  
> +config MHI_BUS_PHC
> +	bool "MHI PHC driver"

Why not tristate?

> +	depends on MHI_BUS_PCI_GENERIC

No, this generic TSC driver cannot depend on a controller driver. It should
purely act as a library.

> +	help
> +	  This driver provides Precision Time Protocol (PTP) clock and
> +	  communicates with MHI PCI driver to get the device side timestamp,
> +	  which enables precise time synchronization between the host and
> +	  the device.
> diff --git a/drivers/bus/mhi/host/Makefile b/drivers/bus/mhi/host/Makefile
> index 859c2f38451c669b3d3014c374b2b957c99a1cfe..5ba244fe7d596834ea535797efd3428963ba0ed0 100644
> --- a/drivers/bus/mhi/host/Makefile
> +++ b/drivers/bus/mhi/host/Makefile
> @@ -4,3 +4,4 @@ mhi-$(CONFIG_MHI_BUS_DEBUG) += debugfs.o
>  
>  obj-$(CONFIG_MHI_BUS_PCI_GENERIC) += mhi_pci_generic.o
>  mhi_pci_generic-y += pci_generic.o
> +mhi_pci_generic-$(CONFIG_MHI_BUS_PHC) += mhi_phc.o
> diff --git a/drivers/bus/mhi/host/mhi_phc.c b/drivers/bus/mhi/host/mhi_phc.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..fa04eb7f6025fa281d86c0a45b5f7d3e61f5ce12
> --- /dev/null
> +++ b/drivers/bus/mhi/host/mhi_phc.c
> @@ -0,0 +1,150 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2025, Qualcomm Technologies, Inc. and/or its subsidiaries.

2026

> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/mod_devicetable.h>
> +#include <linux/module.h>

Are these headers really needed?

> +#include <linux/mhi.h>
> +#include <linux/ptp_clock_kernel.h>
> +#include "mhi_phc.h"
> +
> +#define NSEC 1000000000ULL

Use existing NSEC_PER_SEC

> +
> +/**
> + * struct mhi_phc_dev - MHI PHC device
> + * @ptp_clock: associated PTP clock
> + * @ptp_clock_info: PTP clock information
> + * @mhi_dev: associated mhi device object
> + * @lock: spinlock

What spinlock? and what is it used for?

> + * @enabled: Flag to track the state of the MHI device
> + */
> +struct mhi_phc_dev {
> +	struct ptp_clock *ptp_clock;
> +	struct ptp_clock_info  ptp_clock_info;

Use single space.

> +	struct mhi_device *mhi_dev;
> +	spinlock_t lock;
> +	bool enabled;
> +};
> +
> +static int qcom_ptp_gettimex64(struct ptp_clock_info *ptp, struct timespec64 *ts,
> +			       struct ptp_system_timestamp *sts)
> +{
> +	struct mhi_phc_dev *phc_dev = container_of(ptp, struct mhi_phc_dev, ptp_clock_info);
> +	struct mhi_timesync_info time;
> +	ktime_t ktime_cur;
> +	unsigned long flags;
> +	int ret;
> +
> +	spin_lock_irqsave(&phc_dev->lock, flags);

Why spinlock ant not mutex, especially when mhi_get_remote_tsc_time_sync()
performs MMIO reads 4 times with ASPM enabled?

I also doubt that you really need lock here.

> +	if (!phc_dev->enabled) {

I don't see a a value of this check. If the intention is to prevent PHC from
being disabled when gettimex64() callback is executing, then kernel POSIX clock
layer already provides that guarantee for you. You don't need to reinvent the
wheel again. 

> +		ret = -ENODEV;
> +		goto err;
> +	}
> +
> +	ret = mhi_get_remote_tsc_time_sync(phc_dev->mhi_dev, &time);
> +	if (ret)
> +		goto err;
> +
> +	ktime_cur = time.t_dev_hi * NSEC + time.t_dev_lo;
> +	*ts = ktime_to_timespec64(ktime_cur);
> +
> +	dev_dbg(&phc_dev->mhi_dev->dev, "TSC time stamps sec:%u nsec:%u current:%lld\n",
> +		time.t_dev_hi, time.t_dev_lo, ktime_cur);

Move to tracepoint.

> +
> +	/* Update pre and post timestamps for PTP_SYS_OFFSET_EXTENDED*/
> +	if (sts != NULL) {

if (sts)

> +		sts->pre_ts = ktime_to_timespec64(time.t_host_pre);
> +		sts->post_ts = ktime_to_timespec64(time.t_host_post);
> +		dev_dbg(&phc_dev->mhi_dev->dev, "pre:%lld post:%lld\n",
> +			time.t_host_pre, time.t_host_post);
> +	}
> +
> +err:
> +	spin_unlock_irqrestore(&phc_dev->lock, flags);
> +
> +	return ret;
> +}
> +
> +int mhi_phc_start(struct mhi_controller *mhi_cntrl)
> +{
> +	struct mhi_phc_dev *phc_dev = dev_get_drvdata(&mhi_cntrl->mhi_dev->dev);
> +	unsigned long flags;
> +
> +	if (!phc_dev) {
> +		dev_err(&mhi_cntrl->mhi_dev->dev, "Driver data is NULL\n");

Can this really happen? Even so, I wouldn't add an error print for this cosmetic
check.

> +		return -ENODEV;
> +	}
> +
> +	spin_lock_irqsave(&phc_dev->lock, flags);
> +	phc_dev->enabled = true;
> +	spin_unlock_irqrestore(&phc_dev->lock, flags);
> +
> +	return 0;
> +}
> +
> +int mhi_phc_stop(struct mhi_controller *mhi_cntrl)
> +{
> +	struct mhi_phc_dev *phc_dev = dev_get_drvdata(&mhi_cntrl->mhi_dev->dev);
> +	unsigned long flags;
> +
> +	if (!phc_dev) {
> +		dev_err(&mhi_cntrl->mhi_dev->dev, "Driver data is NULL\n");
> +		return -ENODEV;
> +	}

Same here.

> +
> +	spin_lock_irqsave(&phc_dev->lock, flags);
> +	phc_dev->enabled = false;
> +	spin_unlock_irqrestore(&phc_dev->lock, flags);
> +
> +	return 0;

Getting rid of the check and 'phc_dev->enabled' flag means, I see no point in
mhi_phc_{start/stop}() functions.

> +}
> +
> +static struct ptp_clock_info qcom_ptp_clock_info = {
> +	.owner    = THIS_MODULE,
> +	.gettimex64 =  qcom_ptp_gettimex64,
> +};
> +
> +int mhi_phc_init(struct mhi_controller *mhi_cntrl)
> +{
> +	struct mhi_device *mhi_dev = mhi_cntrl->mhi_dev;
> +	struct mhi_phc_dev *phc_dev;
> +	int ret;
> +
> +	phc_dev = devm_kzalloc(&mhi_dev->dev, sizeof(*phc_dev), GFP_KERNEL);
> +	if (!phc_dev)
> +		return -ENOMEM;
> +
> +	phc_dev->mhi_dev = mhi_dev;
> +
> +	phc_dev->ptp_clock_info = qcom_ptp_clock_info;
> +	strscpy(phc_dev->ptp_clock_info.name, mhi_dev->name, PTP_CLOCK_NAME_LEN);
> +
> +	spin_lock_init(&phc_dev->lock);
> +
> +	phc_dev->ptp_clock = ptp_clock_register(&phc_dev->ptp_clock_info, &mhi_dev->dev);
> +	if (IS_ERR(phc_dev->ptp_clock)) {
> +		ret = PTR_ERR(phc_dev->ptp_clock);
> +		dev_err(&mhi_dev->dev, "Failed to register PTP clock\n");
> +		phc_dev->ptp_clock = NULL;
> +		return ret;
> +	}
> +
> +	dev_set_drvdata(&mhi_dev->dev, phc_dev);
> +
> +	dev_dbg(&mhi_dev->dev, "probed MHI PHC dev: %s\n", mhi_dev->name);

Drop this spam.

> +	return 0;
> +};
> +
> +void mhi_phc_exit(struct mhi_controller *mhi_cntrl)
> +{
> +	struct mhi_phc_dev *phc_dev = dev_get_drvdata(&mhi_cntrl->mhi_dev->dev);
> +
> +	if (!phc_dev)
> +		return;
> +
> +	/* disable the node */

Remove this comment.

> +	ptp_clock_unregister(phc_dev->ptp_clock);
> +	phc_dev->enabled = false;
> +}
> diff --git a/drivers/bus/mhi/host/mhi_phc.h b/drivers/bus/mhi/host/mhi_phc.h
> new file mode 100644
> index 0000000000000000000000000000000000000000..e6b0866bc768ba5a8ac3e4c40a99aa2050db1389
> --- /dev/null
> +++ b/drivers/bus/mhi/host/mhi_phc.h
> @@ -0,0 +1,28 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2025, Qualcomm Technologies, Inc. and/or its subsidiaries.
> + */
> +
> +#ifdef CONFIG_MHI_BUS_PHC

#if IS_ENABLED()

> +int mhi_phc_init(struct mhi_controller *mhi_cntrl);
> +int mhi_phc_start(struct mhi_controller *mhi_cntrl);
> +int mhi_phc_stop(struct mhi_controller *mhi_cntrl);
> +void mhi_phc_exit(struct mhi_controller *mhi_cntrl);
> +#else
> +static inline int mhi_phc_init(struct mhi_controller *mhi_cntrl)
> +{
> +	return 0;
> +}
> +
> +static inline int mhi_phc_start(struct mhi_controller *mhi_cntrl)
> +{
> +	return 0;
> +}
> +
> +static inline int mhi_phc_stop(struct mhi_controller *mhi_cntrl)
> +{
> +	return 0;
> +}
> +
> +static inline void mhi_phc_exit(struct mhi_controller *mhi_cntrl) {}
> +#endif
> diff --git a/drivers/bus/mhi/host/pci_generic.c b/drivers/bus/mhi/host/pci_generic.c
> index b1122c7224bdd469406d96af6d3df342040e1002..6cba5cecd1adb40396bba30c9b2a551898dce871 100644
> --- a/drivers/bus/mhi/host/pci_generic.c
> +++ b/drivers/bus/mhi/host/pci_generic.c
> @@ -16,6 +16,7 @@
>  #include <linux/pm_runtime.h>
>  #include <linux/timer.h>
>  #include <linux/workqueue.h>
> +#include "mhi_phc.h"
>  
>  #define MHI_PCI_DEFAULT_BAR_NUM 0
>  
> @@ -1044,6 +1045,7 @@ struct mhi_pci_device {
>  	struct timer_list health_check_timer;
>  	unsigned long status;
>  	bool reset_on_remove;
> +	bool mhi_phc_init_done;
>  };
>  
>  #ifdef readq
> @@ -1084,6 +1086,7 @@ static void mhi_pci_status_cb(struct mhi_controller *mhi_cntrl,
>  			      enum mhi_callback cb)
>  {
>  	struct pci_dev *pdev = to_pci_dev(mhi_cntrl->cntrl_dev);
> +	struct mhi_pci_device *mhi_pdev = pci_get_drvdata(pdev);
>  
>  	/* Nothing to do for now */
>  	switch (cb) {
> @@ -1091,9 +1094,21 @@ static void mhi_pci_status_cb(struct mhi_controller *mhi_cntrl,
>  	case MHI_CB_SYS_ERROR:
>  		dev_warn(&pdev->dev, "firmware crashed (%u)\n", cb);
>  		pm_runtime_forbid(&pdev->dev);
> +		/* Stop PHC */
> +		if (mhi_cntrl->tsc_timesync)
> +			mhi_phc_stop(mhi_cntrl);
>  		break;
>  	case MHI_CB_EE_MISSION_MODE:
>  		pm_runtime_allow(&pdev->dev);
> +		/* Start PHC */
> +		if (mhi_cntrl->tsc_timesync) {
> +			if (!mhi_pdev->mhi_phc_init_done) {
> +				mhi_phc_init(mhi_cntrl);
> +				mhi_pdev->mhi_phc_init_done = true;
> +			}

This looks weird. Since MISSION_MODE can happen multiple times when the device
is attached, shouldn't you be doing mhi_phc_init() multiple times with the
corresponding mhi_phc_exit() during MHI_CB_SYS_ERROR?

Right now, you call mhi_phc_init() during MISSION_MODE and mhi_phc_exit()
during mhi_pci_remove(). What is the point of keeping /dev/ptpX if the firmware
is crashed?

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply

* [PATCH net-next v3] net: mctp: don't require received header reserved bits to be zero
From: wit_yuan @ 2026-04-13  8:03 UTC (permalink / raw)
  To: jk
  Cc: yuanzhaoming901030, yuanzm2, matt, davem, edumazet, kuba, pabeni,
	netdev, linux-kernel

From: Yuan Zhaoming <yuanzm2@lenovo.com>

From the MCTP Base specification (DSP0236 v1.2.1), the first byte of
the MCTP header contains a 4 bit reserved field, and 4 bit version.

On our current receive path, we require those 4 reserved bits to be
zero, but the 9500-8i card is non-conformant, and may set these
reserved bits.

DSP0236 states that the reserved bits must be written as zero, and
ignored when read. While the device might not conform to the former,
we should accept these message to conform to the latter.

Relax our check on the MCTP version byte to allow non-zero bits in the
reserved field.

Signed-off-by: Yuan Zhaoming <yuanzm2@lenovo.com>

---
v2: https://lore.kernel.org/netdev/20260410144339.0d1b289a@kernel.org/T/#t
v1: https://lore.kernel.org/netdev/ff147a3f0d27ef2aa6026cc86f9113d56a8c61ac.camel@codeconstruct.com.au/T/#t
---
 include/net/mctp.h | 3 +++
 net/mctp/route.c   | 8 ++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/net/mctp.h b/include/net/mctp.h
index e1e0a69..d8bf907 100644
--- a/include/net/mctp.h
+++ b/include/net/mctp.h
@@ -26,6 +26,9 @@ struct mctp_hdr {
 #define MCTP_VER_MIN	1
 #define MCTP_VER_MAX	1
 
+/* Definitions for ver field */
+#define MCTP_HDR_VER_MASK	GENMASK(3, 0)
+
 /* Definitions for flags_seq_tag field */
 #define MCTP_HDR_FLAG_SOM	BIT(7)
 #define MCTP_HDR_FLAG_EOM	BIT(6)
diff --git a/net/mctp/route.c b/net/mctp/route.c
index e69c6f7..62517c9 100644
--- a/net/mctp/route.c
+++ b/net/mctp/route.c
@@ -439,6 +439,7 @@ static int mctp_dst_input(struct mctp_dst *dst, struct sk_buff *skb)
 	struct mctp_hdr *mh;
 	unsigned int netid;
 	unsigned long f;
+	u8 ver;
 	u8 tag, flags;
 	int rc;
 
@@ -467,7 +468,8 @@ static int mctp_dst_input(struct mctp_dst *dst, struct sk_buff *skb)
 	netid = mctp_cb(skb)->net;
 	skb_pull(skb, sizeof(struct mctp_hdr));
 
-	if (mh->ver != 1)
+	ver = mh->ver & MCTP_HDR_VER_MASK;
+	if (ver < MCTP_VER_MIN || ver > MCTP_VER_MAX)
 		goto out;
 
 	flags = mh->flags_seq_tag & (MCTP_HDR_FLAG_SOM | MCTP_HDR_FLAG_EOM);
@@ -1316,6 +1318,7 @@ static int mctp_pkttype_receive(struct sk_buff *skb, struct net_device *dev,
 	struct mctp_skb_cb *cb;
 	struct mctp_dst dst;
 	struct mctp_hdr *mh;
+	u8 ver;
 	int rc;
 
 	rcu_read_lock();
@@ -1334,7 +1337,8 @@ static int mctp_pkttype_receive(struct sk_buff *skb, struct net_device *dev,
 
 	/* We have enough for a header; decode and route */
 	mh = mctp_hdr(skb);
-	if (mh->ver < MCTP_VER_MIN || mh->ver > MCTP_VER_MAX)
+	ver = mh->ver & MCTP_HDR_VER_MASK;
+	if (ver < MCTP_VER_MIN || ver > MCTP_VER_MAX)
 		goto err_drop;
 
 	/* source must be valid unicast or null; drop reserved ranges and
-- 
2.43.0


^ permalink raw reply related

* [RFC PATCH 0/2] Decouple ftrace/livepatch from module loader via notifier priority and reverse traversal
From: chensong_2000 @ 2026-04-13  8:01 UTC (permalink / raw)
  To: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers
  Cc: linux-modules, linux-kernel, linux-trace-kernel, linux-acpi,
	linux-clk, linux-pm, live-patching, dm-devel, linux-raid,
	kgdb-bugreport, netdev, Song Chen

From: Song Chen <chensong_2000@189.cn>

This patchset addresses a long-standing tight coupling between the
module loader and two of its key consumers: ftrace and livepatch.

Background:

The module loader currently hard-codes direct calls to
ftrace_module_enable(), klp_module_coming(), klp_module_going() and
ftrace_release_mod() inside prepare_coming_module() and the module
unload path. This hard-coding was necessary because the module notifier
chain could not guarantee the strict call ordering that ftrace and
livepatch require:

  During MODULE_STATE_COMING, ftrace must run before livepatch, so
  that per-module function records are ready before livepatch registers
  its ftrace hooks.

  During MODULE_STATE_GOING, livepatch must run before ftrace, so that
  livepatch removes its hooks before ftrace releases those records.

This symmetric setup/teardown ordering could not be expressed through
the notifier chain because the chain only supported forward (descending
priority) traversal. Without reverse traversal, it was impossible to
guarantee that the GOING order would be the strict inverse of the
COMING order using a single priority value per notifier.

Patch 1 - notifier: replace single-linked list with double-linked list.
Patch 2 - ftrace/klp: decouple from module loader using notifier
priority.

headsup: somehow the smtp of my mailbox doesn't work very well lately, 
if i receive return letter, i have to resend, sorry in advance.

Song Chen (2):
  kernel/notifier: replace single-linked list with double-linked list
    for reverse traversal
  kernel/module: Decouple klp and ftrace from load_module

 drivers/acpi/sleep.c      |   1 -
 drivers/clk/clk.c         |   2 +-
 drivers/cpufreq/cpufreq.c |   2 +-
 drivers/md/dm-integrity.c |   1 -
 drivers/md/md.c           |   1 -
 include/linux/module.h    |   8 ++
 include/linux/notifier.h  |  26 ++---
 kernel/debug/debug_core.c |   1 -
 kernel/livepatch/core.c   |  29 ++++-
 kernel/module/main.c      |  34 +++---
 kernel/notifier.c         | 219 ++++++++++++++++++++++++++++++++------
 kernel/trace/ftrace.c     |  38 +++++++
 net/ipv4/nexthop.c        |   2 +-
 13 files changed, 290 insertions(+), 74 deletions(-)

-- 
2.43.0


^ permalink raw reply

* (no subject)
From: Harry Yoo (Oracle) @ 2026-04-13  7:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Vlastimil Babka, linux-mm, Arnd Bergmann, x86, Lu Baolu,
	iommu, Michael Grzeschik, netdev, linux-wireless, Herbert Xu,
	linux-crypto, David Woodhouse, Bernie Thompson, linux-fbdev,
	Theodore Tso, linux-ext4, Andrew Morton, Uladzislau Rezki,
	Marco Elver, Dmitry Vyukov, kasan-dev, Andrey Ryabinin,
	Thomas Sailer, linux-hams, Jason A. Donenfeld, Richard Henderson,
	linux-alpha, Russell King, linux-arm-kernel, Catalin Marinas,
	Huacai Chen, loongarch, Geert Uytterhoeven, linux-m68k,
	Dinh Nguyen, Jonas Bonn, linux-openrisc, Helge Deller,
	linux-parisc, Michael Ellerman, linuxppc-dev, Paul Walmsley,
	linux-riscv, Heiko Carstens, linux-s390, David S. Miller,
	sparclinux, Hao Li, Christoph Lameter, David Rientjes,
	Roman Gushchin, Shengming Hu

Bcc: 
Subject: Re: [patch 14/38] slub: Use prandom instead of get_cycles()
Reply-To: 
In-Reply-To: <20260410120318.525653921@kernel.org>

On Fri, Apr 10, 2026 at 02:19:37PM +0200, Thomas Gleixner wrote:
> The decision whether to scan remote nodes is based on a 'random' number
> retrieved via get_cycles(). get_cycles() is about to be removed.
> 
> There is already prandom state in the code, so use that instead.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: linux-mm@kvack.org
> ---

Acked-by: Harry Yoo (Oracle) <harry@kernel.org>

Is this for this merge window?

This may conflict with upcoming changes on freelist shuffling [1]
(not queued for slab/for-next yet though), but it should be easy to
resolve.

[Cc'ing Shengming and SLAB ALLOCATOR folks]

[1] https://lore.kernel.org/linux-mm/20260409204352095kKWVYKtZImN59ybO6iRNj@zte.com.cn

-- 
Cheers,
Harry / Hyeonggon

>  mm/slub.c |   37 +++++++++++++++++++++++--------------
>  1 file changed, 23 insertions(+), 14 deletions(-)
> 
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3302,6 +3302,25 @@ static inline struct slab *alloc_slab_pa
>  	return slab;
>  }
>  
> +#if defined(CONFIG_SLAB_FREELIST_RANDOM) || defined(CONFIG_NUMA)
> +static DEFINE_PER_CPU(struct rnd_state, slab_rnd_state);
> +
> +static unsigned int slab_get_prandom_state(unsigned int limit)
> +{
> +	struct rnd_state *state;
> +	unsigned int res;
> +
> +	/*
> +	 * An interrupt or NMI handler might interrupt and change
> +	 * the state in the middle, but that's safe.
> +	 */
> +	state = &get_cpu_var(slab_rnd_state);
> +	res = prandom_u32_state(state) % limit;
> +	put_cpu_var(slab_rnd_state);
> +	return res;
> +}
> +#endif
> +
>  #ifdef CONFIG_SLAB_FREELIST_RANDOM
>  /* Pre-initialize the random sequence cache */
>  static int init_cache_random_seq(struct kmem_cache *s)
> @@ -3365,8 +3384,6 @@ static void *next_freelist_entry(struct
>  	return (char *)start + idx;
>  }
>  
> -static DEFINE_PER_CPU(struct rnd_state, slab_rnd_state);
> -
>  /* Shuffle the single linked freelist based on a random pre-computed sequence */
>  static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab,
>  			     bool allow_spin)
> @@ -3383,15 +3400,7 @@ static bool shuffle_freelist(struct kmem
>  	if (allow_spin) {
>  		pos = get_random_u32_below(freelist_count);
>  	} else {
> -		struct rnd_state *state;
> -
> -		/*
> -		 * An interrupt or NMI handler might interrupt and change
> -		 * the state in the middle, but that's safe.
> -		 */
> -		state = &get_cpu_var(slab_rnd_state);
> -		pos = prandom_u32_state(state) % freelist_count;
> -		put_cpu_var(slab_rnd_state);
> +		pos = slab_get_prandom_state(freelist_count);
>  	}
>  
>  	page_limit = slab->objects * s->size;
> @@ -3882,7 +3891,7 @@ static void *get_from_any_partial(struct
>  	 * with available objects.
>  	 */
>  	if (!s->remote_node_defrag_ratio ||
> -			get_cycles() % 1024 > s->remote_node_defrag_ratio)
> +	    slab_get_prandom_state(1024) > s->remote_node_defrag_ratio)
>  		return NULL;
>  
>  	do {
> @@ -7102,7 +7111,7 @@ static unsigned int
>  
>  	/* see get_from_any_partial() for the defrag ratio description */
>  	if (!s->remote_node_defrag_ratio ||
> -			get_cycles() % 1024 > s->remote_node_defrag_ratio)
> +	    slab_get_prandom_state(1024) > s->remote_node_defrag_ratio)
>  		return 0;
>  
>  	do {
> @@ -8421,7 +8430,7 @@ void __init kmem_cache_init_late(void)
>  	flushwq = alloc_workqueue("slub_flushwq", WQ_MEM_RECLAIM | WQ_PERCPU,
>  				  0);
>  	WARN_ON(!flushwq);
> -#ifdef CONFIG_SLAB_FREELIST_RANDOM
> +#if defined(CONFIG_SLAB_FREELIST_RANDOM) || defined(CONFIG_NUMA)
>  	prandom_init_once(&slab_rnd_state);
>  #endif
>  }
> 
> 

^ permalink raw reply

* [PATCH] net/sched: sch_dualpi2: fix NULL pointer dereference in dualpi2_change()
From: Kito Xu (veritas501) @ 2026-04-13  7:57 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jiri Pirko, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Chia-Yu Chang, netdev, linux-kernel,
	Kito Xu (veritas501)

dualpi2_change() uses a trim loop to enforce the new queue limit after a
configuration change. The loop calls qdisc_dequeue_internal(sch, true)
which only dequeues from the C-queue (sch->q) and the requeue list
(sch->gso_skb). It does not dequeue from the L-queue (q->l_queue).

However, the loop continuation condition checks qdisc_qlen(sch), which
reflects the total packet count across both queues because
dualpi2_enqueue_skb() manually increments sch->q.qlen for L-queue
packets (line 418). Similarly, q->memory_used accounts for memory from
both queues.

When all packets reside in the L-queue and the C-queue is empty, the
loop condition remains true but qdisc_dequeue_internal() returns NULL.
The subsequent skb->truesize dereference causes a NULL pointer oops.

An unprivileged user can trigger this from a user namespace:

  1. unshare(CLONE_NEWUSER | CLONE_NEWNET)
  2. Create a dummy device and attach dualpi2 qdisc
  3. Send ECT(1)-marked packets to fill the L-queue
  4. Reduce the qdisc limit via RTM_NEWQDISC

[   17.521319] Oops: general protection fault, probably for non-canonical address 0xdffffc000000001a: 0000 [#1] SMP KASAN NOPTI
[   17.525206] KASAN: null-ptr-deref in range [0x00000000000000d0-0x00000000000000d7]
[   17.527710] CPU: 3 UID: 1000 PID: 171 Comm: poc Not tainted 7.0.0-rc7-next-20260410 #10 PREEMPTLAZY
[   17.530795] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   17.533301] RIP: 0010:dualpi2_change+0xd09/0x1c00
[   17.535472] Code: ef 83 e7 07 83 c7 03 44 38 cf 7c 09 45 84 c9 0f 85 fb 06 00 00 4c 8d 96 d0 00 00 00 44 8b 8b 5c 02 00 00 4c 89 d7 48 c1 ef 03 <0f> b6 3c 2f 40 84 ff 74 0a 40 80 ff 03 0f 8e fc 06 00 00 4c 89 c7
[   17.540294] RSP: 0018:ffffc90000bb7360 EFLAGS: 00000202
[   17.542574] RAX: 0000000000014c3a RBX: ffff88800fe18000 RCX: ffffed1001fc3010
[   17.543461] RDX: ffff88800fe1825c RSI: 0000000000000000 RDI: 000000000000001a
[   17.544145] RBP: dffffc0000000000 R08: 0000000000000028 R09: 0000000000171240
[   17.546982] R10: 00000000000000d0 R11: ffff88800fe18080 R12: ffff88800fe180d0
[   17.549652] R13: ffff88800fe1825c R14: ffff88800fe18014 R15: ffff88800fe180f8
[   17.552566] FS:  000000002b3533c0(0000) GS:ffff8880e260f000(0000) knlGS:0000000000000000
[   17.555942] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.556472] CR2: dffffc000000001a CR3: 000000000fb01000 CR4: 00000000003006f0
[   17.559321] Call Trace:
[   17.560392]  <TASK>
[   17.560993]  ? __asan_memset+0x23/0x50
[   17.562609]  ? __pfx_dualpi2_change+0x10/0x10
[   17.564265]  ? mutex_lock+0x7e/0xd0
[   17.565628]  ? __pfx_mutex_lock+0x10/0x10
[   17.566886]  ? nla_strcmp+0x20/0x100
[   17.568354]  tc_modify_qdisc+0x4ee/0x1d60
[   17.570211]  ? __pfx_tc_modify_qdisc+0x10/0x10
[   17.571293]  ? __pfx_stack_trace_save+0x10/0x10
[   17.572894]  ? mutex_lock+0x7e/0xd0
[   17.573548]  ? __pfx_mutex_lock+0x10/0x10
[   17.574583]  ? security_capable+0x80/0x110
[   17.576051]  rtnetlink_rcv_msg+0x548/0xc10
[   17.577902]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[   17.579170]  netlink_rcv_skb+0x12a/0x390
[   17.580902]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[   17.581817]  ? __pfx_netlink_rcv_skb+0x10/0x10
[   17.583064]  ? __kasan_slab_alloc+0x89/0x90
[   17.584753]  netlink_unicast+0x5b8/0x980
[   17.585677]  ? __pfx_netlink_unicast+0x10/0x10
[   17.587312]  ? rpm_suspend+0x492/0xe70
[   17.588612]  ? __pfx___alloc_skb+0x10/0x10
[   17.590473]  ? __check_object_size+0x45e/0x650
[   17.592327]  ? rpm_suspend+0x492/0xe70
[   17.593589]  netlink_sendmsg+0x722/0xbb0
[   17.594452]  ? __pfx_netlink_sendmsg+0x10/0x10
[   17.595192]  ? __import_iovec+0x33d/0x5b0
[   17.596582]  ? __pfx_netlink_sendmsg+0x10/0x10
[   17.598003]  ____sys_sendmsg+0x8cf/0xb30
[   17.599168]  ? __pfx_____sys_sendmsg+0x10/0x10
[   17.599921]  ? __pfx_copy_msghdr_from_user+0x10/0x10
[   17.601616]  ? update_cfs_rq_load_avg+0x5a/0x560
[   17.602924]  ___sys_sendmsg+0x104/0x190
[   17.604318]  ? update_irq_load_avg+0xbd/0x18b0
[   17.604977]  ? __pfx____sys_sendmsg+0x10/0x10
[   17.606505]  __sys_sendmsg+0x124/0x1c0
[   17.607877]  ? __pfx___sys_sendmsg+0x10/0x10
[   17.609867]  ? __pfx_handle_softirqs+0x10/0x10
[   17.611017]  do_syscall_64+0x64/0x680
[   17.612277]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   17.614334] RIP: 0033:0x429d6b
[   17.615451] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 99 e5 02 00 8b 55 ec 48 8b 75 f0 41 89 c0 8b 7d f8 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 45 f8 e8 f1 e5 02 00 48 8b
[   17.619777] RSP: 002b:00007fff81a32560 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[   17.621686] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000000429d6b
[   17.623171] RDX: 0000000000000000 RSI: 00007fff81a325e0 RDI: 0000000000000003
[   17.625433] RBP: 00007fff81a32580 R08: 0000000000000000 R09: 00007fff81a32797
[   17.627317] R10: 0000000000000000 R11: 0000000000000293 R12: 00007fff81a32a38
[   17.629562] R13: 00007fff81a32a48 R14: 00000000004c4848 R15: 0000000000000001
[   17.630409]  </TASK>
[   17.631205] Modules linked in:
[   17.633251] ---[ end trace 0000000000000000 ]---
[   17.634487] RIP: 0010:dualpi2_change+0xd09/0x1c00
[   17.636511] Code: ef 83 e7 07 83 c7 03 44 38 cf 7c 09 45 84 c9 0f 85 fb 06 00 00 4c 8d 96 d0 00 00 00 44 8b 8b 5c 02 00 00 4c 89 d7 48 c1 ef 03 <0f> b6 3c 2f 40 84 ff 74 0a 40 80 ff 03 0f 8e fc 06 00 00 4c 89 c7
[   17.641850] RSP: 0018:ffffc90000bb7360 EFLAGS: 00000202
[   17.643001] RAX: 0000000000014c3a RBX: ffff88800fe18000 RCX: ffffed1001fc3010
[   17.645928] RDX: ffff88800fe1825c RSI: 0000000000000000 RDI: 000000000000001a
[   17.647603] RBP: dffffc0000000000 R08: 0000000000000028 R09: 0000000000171240
[   17.649912] R10: 00000000000000d0 R11: ffff88800fe18080 R12: ffff88800fe180d0
[   17.652899] R13: ffff88800fe1825c R14: ffff88800fe18014 R15: ffff88800fe180f8
[   17.655310] FS:  000000002b3533c0(0000) GS:ffff8880e260f000(0000) knlGS:0000000000000000
[   17.656751] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.659213] CR2: dffffc000000001a CR3: 000000000fb01000 CR4: 00000000003006f0
[   17.661915] Kernel panic - not syncing: Fatal exception in interrupt
[   17.665688] Kernel Offset: disabled
[   17.665980] Rebooting in 1 seconds..

Fix this by adding a NULL check after qdisc_dequeue_internal(). When
the C-queue is exhausted but L-queue packets keep qdisc_qlen(sch) above
the limit, the loop breaks safely. Remaining excess L-queue packets will
be drained by the normal dequeue path.

Fixes: 320d031ad6e4 ("sched: Struct definition and parsing of dualpi2 qdisc")
Signed-off-by: Kito Xu (veritas501) <hxzene@gmail.com>
---
 net/sched/sch_dualpi2.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
index fe6f5e889625..746c0e506024 100644
--- a/net/sched/sch_dualpi2.c
+++ b/net/sched/sch_dualpi2.c
@@ -870,6 +870,9 @@ static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
 	       q->memory_used > q->memory_limit) {
 		struct sk_buff *skb = qdisc_dequeue_internal(sch, true);
 
+		if (!skb)
+			break;
+
 		q->memory_used -= skb->truesize;
 		qdisc_qstats_backlog_dec(sch, skb);
 		rtnl_qdisc_drop(skb, sch);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net 2/2] net: enetc: fix NTMP DMA use-after-free issue
From: Wei Fang @ 2026-04-13  7:52 UTC (permalink / raw)
  To: claudiu.manoil, vladimir.oltean, xiaoning.wang, andrew+netdev,
	davem, edumazet, kuba, pabeni, chleroy
  Cc: netdev, linux-kernel, imx, linuxppc-dev, linux-arm-kernel
In-Reply-To: <20260413075250.281653-1-wei.fang@nxp.com>

The AI-generated review reported a potential DMA use-after-free issue
[1]. If netc_xmit_ntmp_cmd() times out and returns an error, the pending
command is not explicitly aborted, while ntmp_free_data_mem()
unconditionally frees the DMA buffer. If the buffer has already been
reallocated elsewhere, this may lead to silent memory corruption. Because
the hardware eventually processes the pending command and perform a DMA
write of the response to the physical address of the freed buffer.

To resolve this issue, this patch does the following modifications:

1. Convert cbdr->ring_lock from a spinlock to a mutex

The lock was originally a spinlock in case NTMP operations might be
invoked from atomic context. After downstream support for all NTMP
tables, no such usage has materialized. A mutex lock is now required
because the driver now needs to reclaim used BDs and release associated
DMA memory within the lock's context, while dma_free_coherent() might
sleep.

2. Introduce software command BD (struct netc_swcbd)

The hardware write-back overwrites the addr and len fields of the BD,
so the driver cannot rely on the hardware BD to free the associated DMA
memory. The driver now maintains a software shadow BD storing the DMA
buffer pointer, DMA address, and size. And netc_xmit_ntmp_cmd() only
reclaims older BDs when the number of used BDs reaches
NETC_CBDR_CLEAN_WORK (16). The software BD enables correct DMA memory
release. With this, struct ntmp_dma_buf and ntmp_free_data_mem() are no
longer needed and are removed.

These changes eliminate the DMA use-after-free condition and ensure safe
and consistent BD reclamation and DMA buffer lifecycle management.

Fixes: 4701073c3deb ("net: enetc: add initial netc-lib driver to support NTMP")
Link: https://lore.kernel.org/netdev/20260403011729.1795413-1-kuba@kernel.org/ # [1]
Signed-off-by: Wei Fang <wei.fang@nxp.com>
---
 drivers/net/ethernet/freescale/enetc/ntmp.c   | 158 ++++++++++--------
 .../ethernet/freescale/enetc/ntmp_private.h   |   8 +-
 include/linux/fsl/ntmp.h                      |   9 +-
 3 files changed, 93 insertions(+), 82 deletions(-)

diff --git a/drivers/net/ethernet/freescale/enetc/ntmp.c b/drivers/net/ethernet/freescale/enetc/ntmp.c
index 1b1ff0446d0a..3efc65443113 100644
--- a/drivers/net/ethernet/freescale/enetc/ntmp.c
+++ b/drivers/net/ethernet/freescale/enetc/ntmp.c
@@ -7,6 +7,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/fsl/netc_global.h>
 #include <linux/iopoll.h>
+#include <linux/vmalloc.h>
 
 #include "ntmp_private.h"
 
@@ -42,6 +43,12 @@ int ntmp_init_cbdr(struct netc_cbdr *cbdr, struct device *dev,
 	if (!cbdr->addr_base)
 		return -ENOMEM;
 
+	cbdr->swcbd = vcalloc(cbd_num, sizeof(struct netc_swcbd));
+	if (!cbdr->swcbd) {
+		dma_free_coherent(dev, size, cbdr->addr_base, cbdr->dma_base);
+		return -ENOMEM;
+	}
+
 	cbdr->dma_size = size;
 	cbdr->bd_num = cbd_num;
 	cbdr->regs = *regs;
@@ -52,7 +59,7 @@ int ntmp_init_cbdr(struct netc_cbdr *cbdr, struct device *dev,
 	cbdr->addr_base_align = PTR_ALIGN(cbdr->addr_base,
 					  NTMP_BASE_ADDR_ALIGN);
 
-	spin_lock_init(&cbdr->ring_lock);
+	mutex_init(&cbdr->ring_lock);
 
 	cbdr->next_to_use = netc_read(cbdr->regs.pir);
 	cbdr->next_to_clean = netc_read(cbdr->regs.cir) & NETC_CBDRCIR_INDEX;
@@ -71,10 +78,25 @@ int ntmp_init_cbdr(struct netc_cbdr *cbdr, struct device *dev,
 }
 EXPORT_SYMBOL_GPL(ntmp_init_cbdr);
 
+static void ntmp_free_data_mem(struct device *dev, struct netc_swcbd *swcbd)
+{
+	dma_free_coherent(dev, swcbd->size + NTMP_DATA_ADDR_ALIGN,
+			  swcbd->buf, swcbd->dma);
+}
+
 void ntmp_free_cbdr(struct netc_cbdr *cbdr)
 {
 	/* Disable the Control BD Ring */
 	netc_write(cbdr->regs.mr, 0);
+
+	for (int i = 0; i < cbdr->bd_num; i++) {
+		struct netc_swcbd *swcbd = &cbdr->swcbd[i];
+
+		if (swcbd->dma)
+			ntmp_free_data_mem(cbdr->dev, swcbd);
+	}
+
+	vfree(cbdr->swcbd);
 	dma_free_coherent(cbdr->dev, cbdr->dma_size, cbdr->addr_base,
 			  cbdr->dma_base);
 	memset(cbdr, 0, sizeof(*cbdr));
@@ -94,24 +116,28 @@ static union netc_cbd *ntmp_get_cbd(struct netc_cbdr *cbdr, int index)
 
 static void ntmp_clean_cbdr(struct netc_cbdr *cbdr)
 {
-	union netc_cbd *cbd;
-	int i;
+	int i = cbdr->next_to_clean;
 
-	i = cbdr->next_to_clean;
 	while ((netc_read(cbdr->regs.cir) & NETC_CBDRCIR_INDEX) != i) {
-		cbd = ntmp_get_cbd(cbdr, i);
+		union netc_cbd *cbd = ntmp_get_cbd(cbdr, i);
+		struct netc_swcbd *swcbd = &cbdr->swcbd[i];
+
+		ntmp_free_data_mem(cbdr->dev, swcbd);
+		memset(swcbd, 0, sizeof(*swcbd));
 		memset(cbd, 0, sizeof(*cbd));
 		i = (i + 1) % cbdr->bd_num;
 	}
 
+	dma_wmb();
 	cbdr->next_to_clean = i;
 }
 
-static int netc_xmit_ntmp_cmd(struct ntmp_user *user, union netc_cbd *cbd)
+static int netc_xmit_ntmp_cmd(struct ntmp_user *user, union netc_cbd *cbd,
+			      struct netc_swcbd *swcbd)
 {
 	union netc_cbd *cur_cbd;
 	struct netc_cbdr *cbdr;
-	int i, err;
+	int i, err, used_bds;
 	u16 status;
 	u32 val;
 
@@ -120,14 +146,21 @@ static int netc_xmit_ntmp_cmd(struct ntmp_user *user, union netc_cbd *cbd)
 	 */
 	cbdr = &user->ring[0];
 
-	spin_lock_bh(&cbdr->ring_lock);
+	mutex_lock(&cbdr->ring_lock);
 
-	if (unlikely(!ntmp_get_free_cbd_num(cbdr)))
+	used_bds = cbdr->bd_num - ntmp_get_free_cbd_num(cbdr);
+	if (unlikely(used_bds >= NETC_CBDR_CLEAN_WORK)) {
 		ntmp_clean_cbdr(cbdr);
+		if (unlikely(!ntmp_get_free_cbd_num(cbdr))) {
+			err = -EBUSY;
+			goto cbdr_unlock;
+		}
+	}
 
 	i = cbdr->next_to_use;
 	cur_cbd = ntmp_get_cbd(cbdr, i);
 	*cur_cbd = *cbd;
+	cbdr->swcbd[i] = *swcbd;
 	dma_wmb();
 
 	/* Update producer index of both software and hardware */
@@ -135,10 +168,9 @@ static int netc_xmit_ntmp_cmd(struct ntmp_user *user, union netc_cbd *cbd)
 	cbdr->next_to_use = i;
 	netc_write(cbdr->regs.pir, i);
 
-	err = read_poll_timeout_atomic(netc_read, val,
-				       (val & NETC_CBDRCIR_INDEX) == i,
-				       NETC_CBDR_DELAY_US, NETC_CBDR_TIMEOUT,
-				       true, cbdr->regs.cir);
+	err = read_poll_timeout(netc_read, val, (val & NETC_CBDRCIR_INDEX) == i,
+				NETC_CBDR_DELAY_US, NETC_CBDR_TIMEOUT,
+				true, cbdr->regs.cir);
 	if (unlikely(err))
 		goto cbdr_unlock;
 
@@ -155,36 +187,28 @@ static int netc_xmit_ntmp_cmd(struct ntmp_user *user, union netc_cbd *cbd)
 		dev_err(user->dev, "Command BD error: 0x%04x\n", status);
 	}
 
-	ntmp_clean_cbdr(cbdr);
-	dma_wmb();
-
 cbdr_unlock:
-	spin_unlock_bh(&cbdr->ring_lock);
+	mutex_unlock(&cbdr->ring_lock);
 
 	return err;
 }
 
-static int ntmp_alloc_data_mem(struct ntmp_dma_buf *data, void **buf_align)
+static int ntmp_alloc_data_mem(struct device *dev, struct netc_swcbd *swcbd,
+			       void **buf_align)
 {
 	void *buf;
 
-	buf = dma_alloc_coherent(data->dev, data->size + NTMP_DATA_ADDR_ALIGN,
-				 &data->dma, GFP_KERNEL);
+	buf = dma_alloc_coherent(dev, swcbd->size + NTMP_DATA_ADDR_ALIGN,
+				 &swcbd->dma, GFP_KERNEL);
 	if (!buf)
 		return -ENOMEM;
 
-	data->buf = buf;
+	swcbd->buf = buf;
 	*buf_align = PTR_ALIGN(buf, NTMP_DATA_ADDR_ALIGN);
 
 	return 0;
 }
 
-static void ntmp_free_data_mem(struct ntmp_dma_buf *data)
-{
-	dma_free_coherent(data->dev, data->size + NTMP_DATA_ADDR_ALIGN,
-			  data->buf, data->dma);
-}
-
 static void ntmp_fill_request_hdr(union netc_cbd *cbd, dma_addr_t dma,
 				  int len, int table_id, int cmd,
 				  int access_method)
@@ -235,37 +259,36 @@ static int ntmp_delete_entry_by_id(struct ntmp_user *user, int tbl_id,
 				   u8 tbl_ver, u32 entry_id, u32 req_len,
 				   u32 resp_len)
 {
-	struct ntmp_dma_buf data = {
-		.dev = user->dev,
+	struct netc_swcbd swcbd = {
 		.size = max(req_len, resp_len),
 	};
 	struct ntmp_req_by_eid *req;
 	union netc_cbd cbd;
 	int err;
 
-	err = ntmp_alloc_data_mem(&data, (void **)&req);
+	err = ntmp_alloc_data_mem(user->dev, &swcbd, (void **)&req);
 	if (err)
 		return err;
 
 	ntmp_fill_crd_eid(req, tbl_ver, 0, 0, entry_id);
-	ntmp_fill_request_hdr(&cbd, data.dma, NTMP_LEN(req_len, resp_len),
+	ntmp_fill_request_hdr(&cbd, swcbd.dma, NTMP_LEN(req_len, resp_len),
 			      tbl_id, NTMP_CMD_DELETE, NTMP_AM_ENTRY_ID);
 
-	err = netc_xmit_ntmp_cmd(user, &cbd);
+	err = netc_xmit_ntmp_cmd(user, &cbd, &swcbd);
 	if (err)
 		dev_err(user->dev,
 			"Failed to delete entry 0x%x of %s, err: %pe",
 			entry_id, ntmp_table_name(tbl_id), ERR_PTR(err));
 
-	ntmp_free_data_mem(&data);
-
 	return err;
 }
 
 static int ntmp_query_entry_by_id(struct ntmp_user *user, int tbl_id,
-				  u32 len, struct ntmp_req_by_eid *req,
-				  dma_addr_t dma, bool compare_eid)
+				  struct ntmp_req_by_eid *req,
+				  struct netc_swcbd *swcbd,
+				  bool compare_eid)
 {
+	u32 len = NTMP_LEN(sizeof(*req), swcbd->size);
 	struct ntmp_cmn_resp_query *resp;
 	int cmd = NTMP_CMD_QUERY;
 	union netc_cbd cbd;
@@ -277,8 +300,9 @@ static int ntmp_query_entry_by_id(struct ntmp_user *user, int tbl_id,
 		cmd = NTMP_CMD_QU;
 
 	/* Request header */
-	ntmp_fill_request_hdr(&cbd, dma, len, tbl_id, cmd, NTMP_AM_ENTRY_ID);
-	err = netc_xmit_ntmp_cmd(user, &cbd);
+	ntmp_fill_request_hdr(&cbd, swcbd->dma, len, tbl_id, cmd,
+			      NTMP_AM_ENTRY_ID);
+	err = netc_xmit_ntmp_cmd(user, &cbd, swcbd);
 	if (err) {
 		dev_err(user->dev,
 			"Failed to query entry 0x%x of %s, err: %pe\n",
@@ -306,15 +330,14 @@ static int ntmp_query_entry_by_id(struct ntmp_user *user, int tbl_id,
 int ntmp_maft_add_entry(struct ntmp_user *user, u32 entry_id,
 			struct maft_entry_data *maft)
 {
-	struct ntmp_dma_buf data = {
-		.dev = user->dev,
+	struct netc_swcbd swcbd = {
 		.size = sizeof(struct maft_req_add),
 	};
 	struct maft_req_add *req;
 	union netc_cbd cbd;
 	int err;
 
-	err = ntmp_alloc_data_mem(&data, (void **)&req);
+	err = ntmp_alloc_data_mem(user->dev, &swcbd, (void **)&req);
 	if (err)
 		return err;
 
@@ -323,15 +346,13 @@ int ntmp_maft_add_entry(struct ntmp_user *user, u32 entry_id,
 	req->keye = maft->keye;
 	req->cfge = maft->cfge;
 
-	ntmp_fill_request_hdr(&cbd, data.dma, NTMP_LEN(data.size, 0),
+	ntmp_fill_request_hdr(&cbd, swcbd.dma, NTMP_LEN(swcbd.size, 0),
 			      NTMP_MAFT_ID, NTMP_CMD_ADD, NTMP_AM_ENTRY_ID);
-	err = netc_xmit_ntmp_cmd(user, &cbd);
+	err = netc_xmit_ntmp_cmd(user, &cbd, &swcbd);
 	if (err)
 		dev_err(user->dev, "Failed to add MAFT entry 0x%x, err: %pe\n",
 			entry_id, ERR_PTR(err));
 
-	ntmp_free_data_mem(&data);
-
 	return err;
 }
 EXPORT_SYMBOL_GPL(ntmp_maft_add_entry);
@@ -339,33 +360,27 @@ EXPORT_SYMBOL_GPL(ntmp_maft_add_entry);
 int ntmp_maft_query_entry(struct ntmp_user *user, u32 entry_id,
 			  struct maft_entry_data *maft)
 {
-	struct ntmp_dma_buf data = {
-		.dev = user->dev,
+	struct netc_swcbd swcbd = {
 		.size = sizeof(struct maft_resp_query),
 	};
 	struct maft_resp_query *resp;
 	struct ntmp_req_by_eid *req;
 	int err;
 
-	err = ntmp_alloc_data_mem(&data, (void **)&req);
+	err = ntmp_alloc_data_mem(user->dev, &swcbd, (void **)&req);
 	if (err)
 		return err;
 
 	ntmp_fill_crd_eid(req, user->tbl.maft_ver, 0, 0, entry_id);
-	err = ntmp_query_entry_by_id(user, NTMP_MAFT_ID,
-				     NTMP_LEN(sizeof(*req), data.size),
-				     req, data.dma, true);
+	err = ntmp_query_entry_by_id(user, NTMP_MAFT_ID, req, &swcbd, true);
 	if (err)
-		goto end;
+		return err;
 
 	resp = (struct maft_resp_query *)req;
 	maft->keye = resp->keye;
 	maft->cfge = resp->cfge;
 
-end:
-	ntmp_free_data_mem(&data);
-
-	return err;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(ntmp_maft_query_entry);
 
@@ -379,8 +394,8 @@ EXPORT_SYMBOL_GPL(ntmp_maft_delete_entry);
 int ntmp_rsst_update_entry(struct ntmp_user *user, const u32 *table,
 			   int count)
 {
-	struct ntmp_dma_buf data = {.dev = user->dev};
 	struct rsst_req_update *req;
+	struct netc_swcbd swcbd;
 	union netc_cbd cbd;
 	int err, i;
 
@@ -388,8 +403,8 @@ int ntmp_rsst_update_entry(struct ntmp_user *user, const u32 *table,
 		/* HW only takes in a full 64 entry table */
 		return -EINVAL;
 
-	data.size = struct_size(req, groups, count);
-	err = ntmp_alloc_data_mem(&data, (void **)&req);
+	swcbd.size = struct_size(req, groups, count);
+	err = ntmp_alloc_data_mem(user->dev, &swcbd, (void **)&req);
 	if (err)
 		return err;
 
@@ -399,24 +414,22 @@ int ntmp_rsst_update_entry(struct ntmp_user *user, const u32 *table,
 	for (i = 0; i < count; i++)
 		req->groups[i] = (u8)(table[i]);
 
-	ntmp_fill_request_hdr(&cbd, data.dma, NTMP_LEN(data.size, 0),
+	ntmp_fill_request_hdr(&cbd, swcbd.dma, NTMP_LEN(swcbd.size, 0),
 			      NTMP_RSST_ID, NTMP_CMD_UPDATE, NTMP_AM_ENTRY_ID);
 
-	err = netc_xmit_ntmp_cmd(user, &cbd);
+	err = netc_xmit_ntmp_cmd(user, &cbd, &swcbd);
 	if (err)
 		dev_err(user->dev, "Failed to update RSST entry, err: %pe\n",
 			ERR_PTR(err));
 
-	ntmp_free_data_mem(&data);
-
 	return err;
 }
 EXPORT_SYMBOL_GPL(ntmp_rsst_update_entry);
 
 int ntmp_rsst_query_entry(struct ntmp_user *user, u32 *table, int count)
 {
-	struct ntmp_dma_buf data = {.dev = user->dev};
 	struct ntmp_req_by_eid *req;
+	struct netc_swcbd swcbd;
 	union netc_cbd cbd;
 	int err, i;
 	u8 *group;
@@ -425,21 +438,21 @@ int ntmp_rsst_query_entry(struct ntmp_user *user, u32 *table, int count)
 		/* HW only takes in a full 64 entry table */
 		return -EINVAL;
 
-	data.size = NTMP_ENTRY_ID_SIZE + RSST_STSE_DATA_SIZE(count) +
-		    RSST_CFGE_DATA_SIZE(count);
-	err = ntmp_alloc_data_mem(&data, (void **)&req);
+	swcbd.size = NTMP_ENTRY_ID_SIZE + RSST_STSE_DATA_SIZE(count) +
+		     RSST_CFGE_DATA_SIZE(count);
+	err = ntmp_alloc_data_mem(user->dev, &swcbd, (void **)&req);
 	if (err)
 		return err;
 
 	/* Set the request data buffer */
 	ntmp_fill_crd_eid(req, user->tbl.rsst_ver, 0, 0, 0);
-	ntmp_fill_request_hdr(&cbd, data.dma, NTMP_LEN(sizeof(*req), data.size),
+	ntmp_fill_request_hdr(&cbd, swcbd.dma, NTMP_LEN(sizeof(*req), swcbd.size),
 			      NTMP_RSST_ID, NTMP_CMD_QUERY, NTMP_AM_ENTRY_ID);
-	err = netc_xmit_ntmp_cmd(user, &cbd);
+	err = netc_xmit_ntmp_cmd(user, &cbd, &swcbd);
 	if (err) {
 		dev_err(user->dev, "Failed to query RSST entry, err: %pe\n",
 			ERR_PTR(err));
-		goto end;
+		return err;
 	}
 
 	group = (u8 *)req;
@@ -447,10 +460,7 @@ int ntmp_rsst_query_entry(struct ntmp_user *user, u32 *table, int count)
 	for (i = 0; i < count; i++)
 		table[i] = group[i];
 
-end:
-	ntmp_free_data_mem(&data);
-
-	return err;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(ntmp_rsst_query_entry);
 
diff --git a/drivers/net/ethernet/freescale/enetc/ntmp_private.h b/drivers/net/ethernet/freescale/enetc/ntmp_private.h
index 7a53db8740db..5ae6f8b92700 100644
--- a/drivers/net/ethernet/freescale/enetc/ntmp_private.h
+++ b/drivers/net/ethernet/freescale/enetc/ntmp_private.h
@@ -13,6 +13,7 @@
 #define NTMP_EID_REQ_LEN	8
 #define NETC_CBDR_BD_NUM	256
 #define NETC_CBDRCIR_INDEX	GENMASK(9, 0)
+#define NETC_CBDR_CLEAN_WORK	16
 
 union netc_cbd {
 	struct {
@@ -55,13 +56,6 @@ union netc_cbd {
 	} resp_hdr; /* NTMP Response Message Header Format */
 };
 
-struct ntmp_dma_buf {
-	struct device *dev;
-	size_t size;
-	void *buf;
-	dma_addr_t dma;
-};
-
 struct ntmp_cmn_req_data {
 	__le16 update_act;
 	u8 dbg_opt;
diff --git a/include/linux/fsl/ntmp.h b/include/linux/fsl/ntmp.h
index 916dc4fe7de3..83a449b4d6ec 100644
--- a/include/linux/fsl/ntmp.h
+++ b/include/linux/fsl/ntmp.h
@@ -31,6 +31,12 @@ struct netc_tbl_vers {
 	u8 rsst_ver;
 };
 
+struct netc_swcbd {
+	void *buf;
+	dma_addr_t dma;
+	size_t size;
+};
+
 struct netc_cbdr {
 	struct device *dev;
 	struct netc_cbdr_regs regs;
@@ -44,9 +50,10 @@ struct netc_cbdr {
 	void *addr_base_align;
 	dma_addr_t dma_base;
 	dma_addr_t dma_base_align;
+	struct netc_swcbd *swcbd;
 
 	/* Serialize the order of command BD ring */
-	spinlock_t ring_lock;
+	struct mutex ring_lock;
 };
 
 struct ntmp_user {
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 1/2] net: enetc: correct the command BD ring consumer index
From: Wei Fang @ 2026-04-13  7:52 UTC (permalink / raw)
  To: claudiu.manoil, vladimir.oltean, xiaoning.wang, andrew+netdev,
	davem, edumazet, kuba, pabeni, chleroy
  Cc: netdev, linux-kernel, imx, linuxppc-dev, linux-arm-kernel
In-Reply-To: <20260413075250.281653-1-wei.fang@nxp.com>

The command BD ring cousumer index register has the consumer index as
the lower 10 bits, and the bit 31 is SBE, which indicates whether a
system bus error occurred during execution of the CBD command. So if a
system bus error occurs, reading the register will get the SBE bit set.

However, the current implementation directly uses the register value as
the consumer index without masking it. Therefore, if a system bus error
occurs, an incorrect consumer index will be obtained, causing errors in
the processing of the command BD ring. Thus, we need to mask out the
other bits to obtain the correct consumer index.

Fixes: 4701073c3deb ("net: enetc: add initial netc-lib driver to support NTMP")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
---
 drivers/net/ethernet/freescale/enetc/ntmp.c         | 7 ++++---
 drivers/net/ethernet/freescale/enetc/ntmp_private.h | 1 +
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/freescale/enetc/ntmp.c b/drivers/net/ethernet/freescale/enetc/ntmp.c
index 0c1d343253bf..1b1ff0446d0a 100644
--- a/drivers/net/ethernet/freescale/enetc/ntmp.c
+++ b/drivers/net/ethernet/freescale/enetc/ntmp.c
@@ -55,7 +55,7 @@ int ntmp_init_cbdr(struct netc_cbdr *cbdr, struct device *dev,
 	spin_lock_init(&cbdr->ring_lock);
 
 	cbdr->next_to_use = netc_read(cbdr->regs.pir);
-	cbdr->next_to_clean = netc_read(cbdr->regs.cir);
+	cbdr->next_to_clean = netc_read(cbdr->regs.cir) & NETC_CBDRCIR_INDEX;
 
 	/* Step 1: Configure the base address of the Control BD Ring */
 	netc_write(cbdr->regs.bar0, lower_32_bits(cbdr->dma_base_align));
@@ -98,7 +98,7 @@ static void ntmp_clean_cbdr(struct netc_cbdr *cbdr)
 	int i;
 
 	i = cbdr->next_to_clean;
-	while (netc_read(cbdr->regs.cir) != i) {
+	while ((netc_read(cbdr->regs.cir) & NETC_CBDRCIR_INDEX) != i) {
 		cbd = ntmp_get_cbd(cbdr, i);
 		memset(cbd, 0, sizeof(*cbd));
 		i = (i + 1) % cbdr->bd_num;
@@ -135,7 +135,8 @@ static int netc_xmit_ntmp_cmd(struct ntmp_user *user, union netc_cbd *cbd)
 	cbdr->next_to_use = i;
 	netc_write(cbdr->regs.pir, i);
 
-	err = read_poll_timeout_atomic(netc_read, val, val == i,
+	err = read_poll_timeout_atomic(netc_read, val,
+				       (val & NETC_CBDRCIR_INDEX) == i,
 				       NETC_CBDR_DELAY_US, NETC_CBDR_TIMEOUT,
 				       true, cbdr->regs.cir);
 	if (unlikely(err))
diff --git a/drivers/net/ethernet/freescale/enetc/ntmp_private.h b/drivers/net/ethernet/freescale/enetc/ntmp_private.h
index 34394e40fddd..7a53db8740db 100644
--- a/drivers/net/ethernet/freescale/enetc/ntmp_private.h
+++ b/drivers/net/ethernet/freescale/enetc/ntmp_private.h
@@ -12,6 +12,7 @@
 
 #define NTMP_EID_REQ_LEN	8
 #define NETC_CBDR_BD_NUM	256
+#define NETC_CBDRCIR_INDEX	GENMASK(9, 0)
 
 union netc_cbd {
 	struct {
-- 
2.34.1


^ permalink raw reply related

* [PATCH net 0/2] net: enetc: fix command BD ring issues
From: Wei Fang @ 2026-04-13  7:52 UTC (permalink / raw)
  To: claudiu.manoil, vladimir.oltean, xiaoning.wang, andrew+netdev,
	davem, edumazet, kuba, pabeni, chleroy
  Cc: netdev, linux-kernel, imx, linuxppc-dev, linux-arm-kernel

Currently, the implementation of command BD ring has two issues, one is
that the driver may obtain wrong consumer index of the ring, because the
driver does not mask out the SBE bit of the CIR value, so a wrong index
will be obtained when a SBE error ouccrs. The other one is that the DMA
buffer may be used after free. If netc_xmit_ntmp_cmd() times out and
returns an error, the pending command is not explicitly aborted, while
ntmp_free_data_mem() unconditionally frees the DMA buffer. If the buffer
has already been reallocated elsewhere, this may lead to silent memory
corruption. Because the hardware eventually processes the pending command
and perform a DMA write of the response to the physical address of the
freed buffer. So this patch set is to fix these two issues.

Wei Fang (2):
  net: enetc: correct the command BD ring consumer index
  net: enetc: fix NTMP DMA use-after-free issue

 drivers/net/ethernet/freescale/enetc/ntmp.c   | 161 ++++++++++--------
 .../ethernet/freescale/enetc/ntmp_private.h   |   9 +-
 include/linux/fsl/ntmp.h                      |   9 +-
 3 files changed, 96 insertions(+), 83 deletions(-)

-- 
2.34.1


^ permalink raw reply

* [PATCH 6.12.y] netfilter: conntrack: add missing netlink policy validations
From: Li hongliang @ 2026-04-13  7:31 UTC (permalink / raw)
  To: gregkh, stable, fw
  Cc: patches, linux-kernel, pablo, kadlec, davem, edumazet, kuba,
	pabeni, horms, kaber, netfilter-devel, coreteam, netdev, imv4bel

From: Florian Westphal <fw@strlen.de>

[ Upstream commit f900e1d77ee0ef87bfb5ab3fe60f0b3d8ad5ba05 ]

Hyunwoo Kim reports out-of-bounds access in sctp and ctnetlink.

These attributes are used by the kernel without any validation.
Extend the netlink policies accordingly.

Quoting the reporter:
  nlattr_to_sctp() assigns the user-supplied CTA_PROTOINFO_SCTP_STATE
  value directly to ct->proto.sctp.state without checking that it is
  within the valid range. [..]

  and: ... with exp->dir = 100, the access at
  ct->master->tuplehash[100] reads 5600 bytes past the start of a
  320-byte nf_conn object, causing a slab-out-of-bounds read confirmed by
  UBSAN.

Fixes: 076a0ca02644 ("netfilter: ctnetlink: add NAT support for expectations")
Fixes: a258860e01b8 ("netfilter: ctnetlink: add full support for SCTP to ctnetlink")
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Li hongliang <1468888505@139.com>
---
 net/netfilter/nf_conntrack_netlink.c    | 2 +-
 net/netfilter/nf_conntrack_proto_sctp.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 323e147fe282..f51cdfba68fb 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -3460,7 +3460,7 @@ ctnetlink_change_expect(struct nf_conntrack_expect *x,
 
 #if IS_ENABLED(CONFIG_NF_NAT)
 static const struct nla_policy exp_nat_nla_policy[CTA_EXPECT_NAT_MAX+1] = {
-	[CTA_EXPECT_NAT_DIR]	= { .type = NLA_U32 },
+	[CTA_EXPECT_NAT_DIR]	= NLA_POLICY_MAX(NLA_BE32, IP_CT_DIR_REPLY),
 	[CTA_EXPECT_NAT_TUPLE]	= { .type = NLA_NESTED },
 };
 #endif
diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
index 4cc97f971264..fabb2c1ca00a 100644
--- a/net/netfilter/nf_conntrack_proto_sctp.c
+++ b/net/netfilter/nf_conntrack_proto_sctp.c
@@ -587,7 +587,8 @@ static int sctp_to_nlattr(struct sk_buff *skb, struct nlattr *nla,
 }
 
 static const struct nla_policy sctp_nla_policy[CTA_PROTOINFO_SCTP_MAX+1] = {
-	[CTA_PROTOINFO_SCTP_STATE]	    = { .type = NLA_U8 },
+	[CTA_PROTOINFO_SCTP_STATE]	    = NLA_POLICY_MAX(NLA_U8,
+							 SCTP_CONNTRACK_HEARTBEAT_SENT),
 	[CTA_PROTOINFO_SCTP_VTAG_ORIGINAL]  = { .type = NLA_U32 },
 	[CTA_PROTOINFO_SCTP_VTAG_REPLY]     = { .type = NLA_U32 },
 };
-- 
2.34.1



^ permalink raw reply related

* [PATCH iwl-net 5/5] iavf: return 0 when TC flower filter not found after qdisc teardown
From: Aleksandr Loktionov @ 2026-04-13  7:30 UTC (permalink / raw)
  To: intel-wired-lan, anthony.l.nguyen, aleksandr.loktionov
  Cc: netdev, Kiran Patil
In-Reply-To: <20260413073035.4082204-1-aleksandr.loktionov@intel.com>

From: Kiran Patil <kiran.patil@intel.com>

When an egress qdisc is destroyed, the driver proactively deletes all
associated cloud filters to prevent stale hardware state, decrementing
num_cloud_filters to zero in the process.

The kernel netdev layer is unaware of this implicit cleanup and may
still try to delete the same filters individually. If the filter is
not found in the driver's list and num_cloud_filters is already zero,
return 0 instead of -EINVAL to avoid confusing upper layers that
believe the filter is still offloaded in hardware.

Fixes: 0075fa0fadd0 ("i40evf: Add support to apply cloud filters")
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
 drivers/net/ethernet/intel/iavf/iavf_main.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 5e4035b..05aaae9 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -4175,7 +4175,16 @@ static int iavf_delete_clsflower(struct iavf_adapter *adapter,
 	if (filter) {
 		filter->del = true;
 		adapter->aq_required |= IAVF_FLAG_AQ_DEL_CLOUD_FILTER;
-	} else {
+	} else if (adapter->num_cloud_filters) {
+		/* When the egress qdisc is detached the driver implicitly
+		 * deletes all associated cloud filters to prevent stale
+		 * hardware entries, reducing num_cloud_filters to zero.
+		 * The netdev layer is unaware of this implicit cleanup and
+		 * may still request deletion of individual filters.  Only
+		 * return -EINVAL when a filter lookup fails and
+		 * num_cloud_filters is non-zero, indicating a genuine
+		 * lookup failure rather than a post-teardown stale delete.
+		 */
 		err = -EINVAL;
 	}
 	spin_unlock_bh(&adapter->cloud_filter_list_lock);
-- 
2.52.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox