* [PATCH bpf v4 2/3] bpf, sockmap: reject a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-19 6:29 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eduard Zingerman, Kumar Kartikeya Dwivedi, John Fastabend,
Jakub Sitnicki, David S . Miller, Jakub Kicinski, Eric Dumazet,
Paolo Abeni, Kuniyuki Iwashima, Willem de Bruijn, Shuah Khan
Cc: Jiri Olsa, Martin KaFai Lau, Song Liu, Yonghong Song,
Simon Horman, Bobby Eshleman, Jiayuan Chen, bpf, netdev,
linux-kernel, linux-kselftest
In-Reply-To: <20260619062959.3277612-1-rhkrqnwk98@gmail.com>
sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
to find the length of the next message. strparser assembles a message out
of several received skbs by chaining them onto the head's frag_list and
recording where to append the next one in strp->skb_nextp:
*strp->skb_nextp = skb;
strp->skb_nextp = &skb->next;
and then calls the parser on the head:
len = (*strp->cb.parse_msg)(strp, head);
The parser is only meant to inspect the skb, but the program may call
bpf_skb_change_tail() -- or the sibling bpf_skb_pull_data(),
bpf_skb_change_head(), bpf_skb_adjust_room(), all allowed for SK_SKB.
Once the head carries a frag_list these go
... -> skb_ensure_writable -> pskb_may_pull -> __pskb_pull_tail
and __pskb_pull_tail() frees the frag_list skbs that strparser still
tracks through skb_nextp:
while ((list = skb_shinfo(skb)->frag_list) != insp) {
skb_shinfo(skb)->frag_list = list->next;
consume_skb(list);
}
strp->skb_nextp now points into a freed sk_buff. The next segment of
the same message arrives in __strp_recv(), which links it with
*strp->skb_nextp = skb, an 8-byte write into the freed skb. The free
and the write happen in different __strp_recv() calls, so the message
has to span at least three segments before it triggers.
BUG: KASAN: slab-use-after-free in __strp_recv+0x447/0xda0
Write of size 8 at addr ffff88810db86140 by task repro/349
Call Trace:
<IRQ>
__strp_recv+0x447/0xda0
__tcp_read_sock+0x13d/0x590
tcp_bpf_strp_read_sock+0x195/0x320
strp_data_ready+0x267/0x340
sk_psock_strp_data_ready+0x1ce/0x350
tcp_data_queue+0x1364/0x2fd0
tcp_rcv_established+0xe07/0x1640
[...]
Allocated by task 349:
skb_clone+0x17b/0x210
__strp_recv+0x2c3/0xda0
__tcp_read_sock+0x13d/0x590
[...]
Freed by task 349:
kmem_cache_free+0x150/0x570
__pskb_pull_tail+0x57b/0xc20
skb_ensure_writable+0x236/0x260
__bpf_skb_change_tail+0x1d4/0x590
sk_skb_change_tail+0x2a/0x40
bpf_prog_1b285dcd6c41373e+0x27/0x30
bpf_prog_run_pin_on_cpu+0xf3/0x260
sk_psock_strp_parse+0x118/0x1e0
__strp_recv+0x4f6/0xda0
[...]
The same resize also leaves the head's length inconsistent with its
frags, so a later __pskb_pull_tail() can instead hit the
BUG_ON(skb_copy_bits(...)) in net/core/skbuff.c.
A stream parser is only meant to measure the next message, not to modify
the packet. Reject a parser whose program can change packet data
(prog->aux->changes_pkt_data) at attach time. The check is shared by
sock_map_prog_update() and sock_map_link_update_prog(), which between them
cover prog attach, link create and link update. Verdict programs are
unaffected and may still modify the skb.
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
net/core/sock_map.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 99e3789492a0..c60ba6d292f9 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1515,6 +1515,17 @@ static int sock_map_prog_link_lookup(struct bpf_map *map, struct bpf_prog ***ppr
return 0;
}
+static int sock_map_prog_attach_check(enum bpf_attach_type attach_type,
+ struct bpf_prog *prog)
+{
+ /* A stream parser must not modify the skb, only measure it. */
+ if (prog && attach_type == BPF_SK_SKB_STREAM_PARSER &&
+ prog->aux->changes_pkt_data)
+ return -EINVAL;
+
+ return 0;
+}
+
/* Handle the following four cases:
* prog_attach: prog != NULL, old == NULL, link == NULL
* prog_detach: prog == NULL, old != NULL, link == NULL
@@ -1533,6 +1544,10 @@ static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog,
if (ret)
return ret;
+ ret = sock_map_prog_attach_check(which, prog);
+ if (ret)
+ return ret;
+
/* for prog_attach/prog_detach/link_attach, return error if a bpf_link
* exists for that prog.
*/
@@ -1776,6 +1791,11 @@ static int sock_map_link_update_prog(struct bpf_link *link,
ret = -EINVAL;
goto out;
}
+
+ ret = sock_map_prog_attach_check(link->attach_type, prog);
+ if (ret)
+ goto out;
+
if (!sockmap_link->map) {
ret = -ENOLINK;
goto out;
--
2.43.0
^ permalink raw reply related
* [PATCH bpf v4 1/3] selftests/bpf: don't modify the skb in the strparser parser prog
From: Sechang Lim @ 2026-06-19 6:29 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eduard Zingerman, Kumar Kartikeya Dwivedi, John Fastabend,
Jakub Sitnicki, David S . Miller, Jakub Kicinski, Eric Dumazet,
Paolo Abeni, Kuniyuki Iwashima, Willem de Bruijn, Shuah Khan
Cc: Jiri Olsa, Martin KaFai Lau, Song Liu, Yonghong Song,
Simon Horman, Bobby Eshleman, Jiayuan Chen, bpf, netdev,
linux-kernel, linux-kselftest
In-Reply-To: <20260619062959.3277612-1-rhkrqnwk98@gmail.com>
sockmap_parse_prog.c is attached as an SK_SKB stream parser and modifies
the skb. It calls bpf_skb_pull_data() and writes a byte into the packet.
A stream parser runs on strparser's message head and must not modify it.
A resize frees the frag_list segments strparser still tracks, leading to
a use-after-free.
Make the parser read-only. It only needs to return the message length,
which keeps it attaching once packet-modifying parsers are rejected.
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
.../selftests/bpf/progs/sockmap_parse_prog.c | 22 -------------------
1 file changed, 22 deletions(-)
diff --git a/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c b/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
index c9abfe3a11af..56e9aebf05f2 100644
--- a/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
+++ b/tools/testing/selftests/bpf/progs/sockmap_parse_prog.c
@@ -5,28 +5,6 @@
SEC("sk_skb1")
int bpf_prog1(struct __sk_buff *skb)
{
- void *data_end = (void *)(long) skb->data_end;
- void *data = (void *)(long) skb->data;
- __u8 *d = data;
- int err;
-
- if (data + 10 > data_end) {
- err = bpf_skb_pull_data(skb, 10);
- if (err)
- return SK_DROP;
-
- data_end = (void *)(long)skb->data_end;
- data = (void *)(long)skb->data;
- if (data + 10 > data_end)
- return SK_DROP;
- }
-
- /* This write/read is a bit pointless but tests the verifier and
- * strparser handler for read/write pkt data and access into sk
- * fields.
- */
- d = data;
- d[7] = 1;
return skb->len;
}
--
2.43.0
^ permalink raw reply related
* [PATCH bpf v4 0/3] bpf, sockmap: reject a packet-modifying SK_SKB stream parser
From: Sechang Lim @ 2026-06-19 6:29 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eduard Zingerman, Kumar Kartikeya Dwivedi, John Fastabend,
Jakub Sitnicki, David S . Miller, Jakub Kicinski, Eric Dumazet,
Paolo Abeni, Kuniyuki Iwashima, Willem de Bruijn, Shuah Khan
Cc: Jiri Olsa, Martin KaFai Lau, Song Liu, Yonghong Song,
Simon Horman, Bobby Eshleman, Jiayuan Chen, bpf, netdev,
linux-kernel, linux-kselftest
A BPF_PROG_TYPE_SK_SKB stream parser runs on strparser's message head,
which can chain skbs through frag_list. A parser that resizes the skb
frees the frag_list segments that strparser still tracks through
skb_nextp, leading to a use-after-free.
A stream parser is only meant to measure the next message, not to modify
the packet, so reject a packet-modifying parser at attach time.
v4:
- drop the Fixes tag (Jiayuan Chen)
- drop the unsafe skb modification from the test prog (John Fastabend)
v3:
- https://lore.kernel.org/all/20260618102718.2331468-1-rhkrqnwk98@gmail.com/
v2:
- https://lore.kernel.org/all/20260612123553.2724240-1-rhkrqnwk98@gmail.com/
v1:
- https://lore.kernel.org/all/20260609112316.3685738-1-rhkrqnwk98@gmail.com/
Sechang Lim (3):
selftests/bpf: don't modify the skb in the strparser parser prog
bpf, sockmap: reject a packet-modifying SK_SKB stream parser
selftests/bpf: test rejection of a packet-modifying SK_SKB stream
parser
net/core/sock_map.c | 20 ++++++++++++
.../selftests/bpf/prog_tests/sockmap_strp.c | 31 +++++++++++++++++++
.../selftests/bpf/progs/sockmap_parse_prog.c | 22 -------------
.../selftests/bpf/progs/test_sockmap_strp.c | 7 +++++
4 files changed, 58 insertions(+), 22 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [PATCH net] ipv6: ndisc: fix NULL deref in accept_untracked_na()
From: Jiayuan Chen @ 2026-06-19 6:24 UTC (permalink / raw)
To: Weiming Shi, David S . Miller, David Ahern, Eric Dumazet,
Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, netdev, linux-kernel, Xiang Mei
In-Reply-To: <20260617065512.2529757-2-bestswngs@gmail.com>
On 6/17/26 2:55 PM, Weiming Shi wrote:
> accept_untracked_na() re-fetches the inet6_dev with __in6_dev_get(dev)
> and dereferences idev->cnf.accept_untracked_na without a NULL check,
> even though its only caller ndisc_recv_na() already fetched and
> NULL-checked idev for the same device.
>
> Both reads of dev->ip6_ptr run in the same RCU read-side critical
> section, but a concurrent addrconf_ifdown() can clear dev->ip6_ptr
> between them: lowering the MTU below IPV6_MIN_MTU calls addrconf_ifdown()
> without the synchronize_net() that orders the unregister path, so the
> re-fetch returns NULL and oopses:
>
> BUG: KASAN: null-ptr-deref in ndisc_recv_na (net/ipv6/ndisc.c:974)
> Read of size 4 at addr 0000000000000364
> Call Trace:
> <IRQ>
> ndisc_recv_na (net/ipv6/ndisc.c:974)
> icmpv6_rcv (net/ipv6/icmp.c:1193)
> ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:479)
> ip6_input_finish (net/ipv6/ip6_input.c:534)
> ip6_input (net/ipv6/ip6_input.c:545)
> ip6_mc_input (net/ipv6/ip6_input.c:635)
> ipv6_rcv (net/ipv6/ip6_input.c:351)
> </IRQ>
>
> It is reachable by an unprivileged user via a network namespace.
>
> Pass the caller's already validated idev instead of re-fetching it; the
> idev stays alive for the whole RCU critical section, so it is safe even
> after dev->ip6_ptr has been cleared.
>
> Fixes: aaa5f515b16b ("net: ipv6: new accept_untracked_na option to accept na only if in-network")
> Assisted-by: Claude:claude-opus-4-8
> Reported-by: Xiang Mei <xmei5@asu.edu>
> Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
^ permalink raw reply
* [PATCH net v6 4/4] ice: skip unnecessary VF reset when setting trust
From: Jose Ignacio Tornos Martinez @ 2026-06-19 6:13 UTC (permalink / raw)
To: netdev
Cc: intel-wired-lan, przemyslaw.kitszel, aleksandr.loktionov,
jacob.e.keller, horms, jesse.brandeburg, anthony.l.nguyen, davem,
edumazet, kuba, pabeni, Jose Ignacio Tornos Martinez
In-Reply-To: <20260619061321.8554-1-jtornosm@redhat.com>
Similar to the i40e fix, ice_set_vf_trust() unconditionally calls
ice_reset_vf() when the trust setting changes. While the delay is smaller
than i40e, this reset is still unnecessary in most cases.
When granting trust, no reset is needed - we can just set the capability
flag to allow privileged operations.
When revoking trust, we only need to reset (conservative approach) if
the VF has actually configured advanced features that require cleanup
(MAC LLDP filters, promiscuous mode). For VFs in a clean state, we can
safely change the trust setting without the disruptive reset.
When we do reset, we maintain the original ice pattern that has been
reliable in production: cleanup LLDP filters first, then set vf->trusted,
then reset. This ensures the privilege capability bit is handled correctly
during reset rebuild.
When we don't reset, we manually handle the capability flag via helper
function, eliminating the delay.
Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
---
v6: AI review identified issues with v5's reset-before-cleanup approach. Revert
to original reset procedure (cleanup before reset) which has proven reliable,
just adding the conditional check to skip reset when VF has no advanced
features configured.
v5: https://lore.kernel.org/all/20260429102426.210750-5-jtornosm@redhat.com/
drivers/net/ethernet/intel/ice/ice_sriov.c | 33 +++++++++++++++++++---
1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_sriov.c b/drivers/net/ethernet/intel/ice/ice_sriov.c
index 7e00e091756d..XXXXXXXXXXXXXXXX 100644
--- a/drivers/net/ethernet/intel/ice/ice_sriov.c
+++ b/drivers/net/ethernet/intel/ice/ice_sriov.c
@@ -1364,6 +1364,23 @@ int ice_set_vf_mac(struct net_device *netdev, int vf_id, u8 *mac)
return __ice_set_vf_mac(ice_netdev_to_pf(netdev), vf_id, mac);
}
+/**
+ * ice_setup_vf_trust - Enable/disable VF trust mode without reset
+ * @vf: VF to configure
+ * @setting: trust setting
+ *
+ * Update VF flags when changing trust without performing a VF reset.
+ * This is only called when it's safe to skip the reset (VF has no advanced
+ * features configured that need cleanup).
+ */
+static void ice_setup_vf_trust(struct ice_vf *vf, bool setting)
+{
+ if (setting)
+ set_bit(ICE_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps);
+ else
+ clear_bit(ICE_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps);
+}
+
/**
* ice_set_vf_trust
* @netdev: network interface device structure
@@ -1399,11 +1416,19 @@ int ice_set_vf_trust(struct net_device *netdev, int vf_id, bool trusted)
mutex_lock(&vf->cfg_lock);
- while (!trusted && vf->num_mac_lldp)
- ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false);
-
+ /* Reset only if revoking trust and VF has advanced features configured */
+ if (!trusted &&
+ (vf->num_mac_lldp > 0 ||
+ test_bit(ICE_VF_STATE_UC_PROMISC, vf->vf_states) ||
+ test_bit(ICE_VF_STATE_MC_PROMISC, vf->vf_states))) {
+ while (vf->num_mac_lldp)
+ ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false);
+ vf->trusted = trusted;
+ ice_reset_vf(vf, ICE_VF_RESET_NOTIFY);
+ } else {
+ vf->trusted = trusted;
+ ice_setup_vf_trust(vf, trusted);
+ }
- vf->trusted = trusted;
- ice_reset_vf(vf, ICE_VF_RESET_NOTIFY);
dev_info(ice_pf_to_dev(pf), "VF %u is now %strusted\n",
vf_id, trusted ? "" : "un");
--
2.43.0
^ permalink raw reply
* [PATCH net v6 3/4] iavf: send MAC change request synchronously
From: Jose Ignacio Tornos Martinez @ 2026-06-19 6:13 UTC (permalink / raw)
To: netdev
Cc: intel-wired-lan, przemyslaw.kitszel, aleksandr.loktionov,
jacob.e.keller, horms, jesse.brandeburg, anthony.l.nguyen, davem,
edumazet, kuba, pabeni, Jose Ignacio Tornos Martinez, stable
In-Reply-To: <20260619061321.8554-1-jtornosm@redhat.com>
After commit ad7c7b2172c3 ("net: hold netdev instance lock during sysfs
operations"), iavf_set_mac() is called with the netdev instance lock
already held.
The function queues a MAC address change request via
iavf_replace_primary_mac() and then waits for completion. However, in
the current flow, the actual virtchnl message is sent by the watchdog
task, which also needs to acquire the netdev lock to run. Additionally,
the adminq_task which processes virtchnl responses also needs the netdev
lock.
This creates a deadlock scenario:
1. iavf_set_mac() holds netdev lock and waits for MAC change
2. Watchdog needs netdev lock to send the request -> blocked
3. Even if request is sent, adminq_task needs netdev lock to process
PF response -> blocked
4. MAC change times out after 2.5 seconds
5. iavf_set_mac() returns -EAGAIN
This particularly affects VFs during bonding setup when multiple VFs are
enslaved in quick succession.
Fix by implementing a synchronous MAC change operation similar to the
approach used in commit fdadbf6e84c4 ("iavf: fix incorrect reset handling
in callbacks").
The solution:
1. Send the virtchnl ADD_ETH_ADDR message directly (not via watchdog)
2. Poll the admin queue hardware directly for responses
3. Process all received messages (including non-MAC messages)
4. Return when MAC change completes or times out
A new generic function iavf_poll_virtchnl_response() is introduced that
can be reused for any future synchronous virtchnl operations. It takes a
callback to check completion, allowing flexible condition checking.
This allows the operation to complete synchronously while holding
netdev_lock, without relying on watchdog or adminq_task. The function
can sleep for up to 2.5 seconds polling hardware, but this is acceptable
since netdev_lock is per-device and only serializes operations on the
same interface.
To support this, change iavf_add_ether_addrs() to return an error code
instead of void, allowing callers to detect failures. Additionally,
export iavf_mac_add_reject() to enable proper rollback on local failures
(timeouts, send errors) - PF rejections are already handled automatically
by iavf_virtchnl_completion().
Remove vc_waitqueue entirely because iavf_set_mac was the only waiter on
this waitqueue and after the changes it is not needed.
Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations")
cc: stable@vger.kernel.org
Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
---
v6: Address edge cases found by AI review (Jakub Kicinski):
Although unlikely in practice, v6 adds robustness for corner cases:
- Allocation failure after message sent: allocate event buffer BEFORE
sending to PF (theoretical - allocation rarely fails for small buffers)
- Multi-batch scenario: add loop to send all batches when >200 MACs pending
(rare - most configurations have far fewer MACs)
- Timeout rollback: only rollback on send failure (ret != -EAGAIN), not on
timeout where PF response handler will sync state (transient inconsistency
during timeout is acceptable and will be resolved by response)
v5: https://lore.kernel.org/all/20260429102426.210750-4-jtornosm@redhat.com/
drivers/net/ethernet/intel/iavf/iavf.h | 11 ++-
drivers/net/ethernet/intel/iavf/iavf_main.c | 91 +++++++++++++----
.../net/ethernet/intel/iavf/iavf_virtchnl.c | 99 +++++++++++++++++--
3 files changed, 171 insertions(+), 30 deletions(-)
diff --git a/drivers/net/ethernet/intel/iavf/iavf.h b/drivers/net/ethernet/intel/iavf/iavf.h
index e9fb0a0919e3..c154fe7c8ce9 100644
--- a/drivers/net/ethernet/intel/iavf/iavf.h
+++ b/drivers/net/ethernet/intel/iavf/iavf.h
@@ -260,7 +260,6 @@ struct iavf_adapter {
struct work_struct adminq_task;
struct work_struct finish_config;
wait_queue_head_t down_waitqueue;
- wait_queue_head_t vc_waitqueue;
struct iavf_q_vector *q_vectors;
struct list_head vlan_filter_list;
int num_vlan_filters;
@@ -589,8 +588,9 @@ void iavf_configure_queues(struct iavf_adapter *adapter);
void iavf_enable_queues(struct iavf_adapter *adapter);
void iavf_disable_queues(struct iavf_adapter *adapter);
void iavf_map_queues(struct iavf_adapter *adapter);
-void iavf_add_ether_addrs(struct iavf_adapter *adapter);
+int iavf_add_ether_addrs(struct iavf_adapter *adapter);
void iavf_del_ether_addrs(struct iavf_adapter *adapter);
+void iavf_mac_add_reject(struct iavf_adapter *adapter);
void iavf_add_vlans(struct iavf_adapter *adapter);
void iavf_del_vlans(struct iavf_adapter *adapter);
void iavf_set_promiscuous(struct iavf_adapter *adapter);
@@ -607,6 +607,13 @@ void iavf_disable_vlan_stripping(struct iavf_adapter *adapter);
void iavf_virtchnl_completion(struct iavf_adapter *adapter,
enum virtchnl_ops v_opcode,
enum iavf_status v_retval, u8 *msg, u16 msglen);
+int iavf_poll_virtchnl_response(struct iavf_adapter *adapter,
+ struct iavf_arq_event_info *event,
+ bool (*condition)(struct iavf_adapter *adapter,
+ const void *data,
+ enum virtchnl_ops v_op),
+ const void *cond_data,
+ unsigned int timeout_ms);
int iavf_config_rss(struct iavf_adapter *adapter);
void iavf_cfg_queues_bw(struct iavf_adapter *adapter);
void iavf_cfg_queues_quanta_size(struct iavf_adapter *adapter);
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 67aa14350b1b..ce8466ff8f55 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -1047,6 +1047,66 @@ static bool iavf_is_mac_set_handled(struct net_device *netdev,
return ret;
}
+/**
+ * iavf_mac_change_done - Check if MAC change completed
+ * @adapter: board private structure
+ * @data: MAC address being checked (as const void *)
+ * @v_op: virtchnl opcode from processed message
+ *
+ * Callback for iavf_poll_virtchnl_response() to check if MAC change completed.
+ *
+ * Return: true if MAC change completed, false otherwise
+ */
+static bool iavf_mac_change_done(struct iavf_adapter *adapter,
+ const void *data, enum virtchnl_ops v_op)
+{
+ const u8 *addr = data;
+
+ return iavf_is_mac_set_handled(adapter->netdev, addr);
+}
+
+/**
+ * iavf_set_mac_sync - Synchronously change MAC address
+ * @adapter: board private structure
+ * @addr: MAC address to set
+ *
+ * Send MAC change request to PF and poll admin queue for response.
+ * Caller must hold netdev_lock. This can sleep for up to 2.5 seconds.
+ * Event buffer is allocated before sending to avoid state mismatch if
+ * allocation fails after message is sent to PF.
+ *
+ * If the number of pending MAC filters exceeds what fits in a single message,
+ * this function sends all batches before polling for response to ensure the
+ * new primary MAC is actually transmitted.
+ *
+ * Return: 0 on success, negative on failure
+ */
+static int iavf_set_mac_sync(struct iavf_adapter *adapter, const u8 *addr)
+{
+ struct iavf_arq_event_info event;
+ int ret;
+
+ netdev_assert_locked(adapter->netdev);
+
+ event.buf_len = IAVF_MAX_AQ_BUF_SIZE;
+ event.msg_buf = kzalloc(event.buf_len, GFP_KERNEL);
+ if (!event.msg_buf)
+ return -ENOMEM;
+
+ while (adapter->aq_required & IAVF_FLAG_AQ_ADD_MAC_FILTER) {
+ ret = iavf_add_ether_addrs(adapter);
+ if (ret)
+ goto out;
+ }
+
+ ret = iavf_poll_virtchnl_response(adapter, &event,
+ iavf_mac_change_done, addr, 2500);
+
+out:
+ kfree(event.msg_buf);
+ return ret;
+}
+
/**
* iavf_set_mac - NDO callback to set port MAC address
* @netdev: network interface device structure
@@ -1067,25 +1127,23 @@ static int iavf_set_mac(struct net_device *netdev, void *p)
return -EADDRNOTAVAIL;
ret = iavf_replace_primary_mac(adapter, addr->sa_data);
-
if (ret)
return ret;
- ret = wait_event_interruptible_timeout(adapter->vc_waitqueue,
- iavf_is_mac_set_handled(netdev, addr->sa_data),
- msecs_to_jiffies(2500));
-
- /* If ret < 0 then it means wait was interrupted.
- * If ret == 0 then it means we got a timeout.
- * else it means we got response for set MAC from PF,
- * check if netdev MAC was updated to requested MAC,
- * if yes then set MAC succeeded otherwise it failed return -EACCES
- */
- if (ret < 0)
+ ret = iavf_set_mac_sync(adapter, addr->sa_data);
+ if (ret) {
+ /* Rollback only if send failed (message never reached PF).
+ * Don't rollback on timeout (-EAGAIN) because the message was
+ * sent and PF will eventually respond. When the response arrives,
+ * iavf_virtchnl_completion() will handle rollback (on PF error)
+ * or acceptance (on PF success) automatically.
+ */
+ if (ret != -EAGAIN) {
+ iavf_mac_add_reject(adapter);
+ ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr);
+ }
return ret;
-
- if (!ret)
- return -EAGAIN;
+ }
if (!ether_addr_equal(netdev->dev_addr, addr->sa_data))
return -EACCES;
@@ -5415,9 +5473,6 @@ static int iavf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
/* Setup the wait queue for indicating transition to down status */
init_waitqueue_head(&adapter->down_waitqueue);
- /* Setup the wait queue for indicating virtchannel events */
- init_waitqueue_head(&adapter->vc_waitqueue);
-
INIT_LIST_HEAD(&adapter->ptp.aq_cmds);
init_waitqueue_head(&adapter->ptp.phc_time_waitqueue);
mutex_init(&adapter->ptp.aq_cmd_lock);
diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
index a52c100dcbc5..ef5dd3c15a82 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c
@@ -2,6 +2,7 @@
/* Copyright(c) 2013 - 2018 Intel Corporation. */
#include <linux/net/intel/libie/rx.h>
+#include <net/netdev_lock.h>
#include "iavf.h"
#include "iavf_ptp.h"
@@ -555,20 +556,23 @@ iavf_set_mac_addr_type(struct virtchnl_ether_addr *virtchnl_ether_addr,
* @adapter: adapter structure
*
* Request that the PF add one or more addresses to our filters.
- **/
-void iavf_add_ether_addrs(struct iavf_adapter *adapter)
+ *
+ * Return: 0 on success, negative on failure
+ */
+int iavf_add_ether_addrs(struct iavf_adapter *adapter)
{
struct virtchnl_ether_addr_list *veal;
struct iavf_mac_filter *f;
int i = 0, count = 0;
bool more = false;
size_t len;
+ int ret;
if (adapter->current_op != VIRTCHNL_OP_UNKNOWN) {
/* bail because we already have a command pending */
dev_err(&adapter->pdev->dev, "Cannot add filters, command %d pending\n",
adapter->current_op);
- return;
+ return -EBUSY;
}
spin_lock_bh(&adapter->mac_vlan_list_lock);
@@ -580,7 +584,7 @@ void iavf_add_ether_addrs(struct iavf_adapter *adapter)
if (!count) {
adapter->aq_required &= ~IAVF_FLAG_AQ_ADD_MAC_FILTER;
spin_unlock_bh(&adapter->mac_vlan_list_lock);
- return;
+ return 0;
}
adapter->current_op = VIRTCHNL_OP_ADD_ETH_ADDR;
@@ -594,8 +598,9 @@ void iavf_add_ether_addrs(struct iavf_adapter *adapter)
veal = kzalloc(len, GFP_ATOMIC);
if (!veal) {
+ adapter->current_op = VIRTCHNL_OP_UNKNOWN;
spin_unlock_bh(&adapter->mac_vlan_list_lock);
- return;
+ return -ENOMEM;
}
veal->vsi_id = adapter->vsi_res->vsi_id;
@@ -615,8 +620,15 @@ void iavf_add_ether_addrs(struct iavf_adapter *adapter)
spin_unlock_bh(&adapter->mac_vlan_list_lock);
- iavf_send_pf_msg(adapter, VIRTCHNL_OP_ADD_ETH_ADDR, (u8 *)veal, len);
+ ret = iavf_send_pf_msg(adapter, VIRTCHNL_OP_ADD_ETH_ADDR, (u8 *)veal, len);
kfree(veal);
+ if (ret) {
+ dev_err(&adapter->pdev->dev,
+ "Unable to send ADD_ETH_ADDR message to PF, error %d\n", ret);
+ adapter->current_op = VIRTCHNL_OP_UNKNOWN;
+ }
+
+ return ret;
}
/**
@@ -712,8 +724,8 @@ static void iavf_mac_add_ok(struct iavf_adapter *adapter)
* @adapter: adapter structure
*
* Remove filters from list based on PF response.
- **/
-static void iavf_mac_add_reject(struct iavf_adapter *adapter)
+ */
+void iavf_mac_add_reject(struct iavf_adapter *adapter)
{
struct net_device *netdev = adapter->netdev;
struct iavf_mac_filter *f, *ftmp;
@@ -2389,7 +2401,6 @@ void iavf_virtchnl_completion(struct iavf_adapter *adapter,
iavf_mac_add_reject(adapter);
/* restore administratively set MAC address */
ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr);
- wake_up(&adapter->vc_waitqueue);
break;
case VIRTCHNL_OP_DEL_VLAN:
dev_err(&adapter->pdev->dev, "Failed to delete VLAN filter, error %s\n",
@@ -2586,7 +2597,6 @@ void iavf_virtchnl_completion(struct iavf_adapter *adapter,
eth_hw_addr_set(netdev, adapter->hw.mac.addr);
netif_addr_unlock_bh(netdev);
}
- wake_up(&adapter->vc_waitqueue);
break;
case VIRTCHNL_OP_GET_STATS: {
struct iavf_eth_stats *stats =
@@ -2956,3 +2966,72 @@ void iavf_virtchnl_completion(struct iavf_adapter *adapter,
} /* switch v_opcode */
adapter->current_op = VIRTCHNL_OP_UNKNOWN;
}
+
+/**
+ * iavf_poll_virtchnl_response - Poll admin queue for virtchnl response
+ * @adapter: adapter structure
+ * @event: pre-allocated event buffer to use for polling
+ * @condition: callback to check if desired response received
+ * @cond_data: context data passed to condition callback
+ * @timeout_ms: maximum time to wait in milliseconds
+ *
+ * Polls the admin queue and processes all incoming virtchnl messages.
+ * After processing each valid message, calls the condition callback to check
+ * if the expected response has been received. The callback receives the opcode
+ * of the processed message to identify which response was received. Continues
+ * polling until the callback returns true or timeout expires.
+ *
+ * Caller must allocate event buffer before sending any messages to PF to avoid
+ * state mismatch if allocation fails after message is sent.
+ *
+ * Caller must hold netdev_lock. This can sleep for up to timeout_ms while
+ * polling hardware.
+ *
+ * Return: 0 on success (condition met), -EAGAIN on timeout, or error code
+ */
+int iavf_poll_virtchnl_response(struct iavf_adapter *adapter,
+ struct iavf_arq_event_info *event,
+ bool (*condition)(struct iavf_adapter *adapter,
+ const void *data,
+ enum virtchnl_ops v_op),
+ const void *cond_data,
+ unsigned int timeout_ms)
+{
+ struct iavf_hw *hw = &adapter->hw;
+ enum virtchnl_ops received_op;
+ unsigned long timeout;
+ int ret = -EAGAIN;
+ u16 pending = 0;
+ u32 v_retval;
+
+ netdev_assert_locked(adapter->netdev);
+
+ timeout = jiffies + msecs_to_jiffies(timeout_ms);
+ do {
+ if (!pending)
+ usleep_range(50, 75);
+
+ if (iavf_clean_arq_element(hw, event, &pending) == IAVF_SUCCESS) {
+ received_op = (enum virtchnl_ops)le32_to_cpu(event->desc.cookie_high);
+ if (received_op != VIRTCHNL_OP_UNKNOWN) {
+ v_retval = le32_to_cpu(event->desc.cookie_low);
+
+ iavf_virtchnl_completion(adapter, received_op,
+ (enum iavf_status)v_retval,
+ event->msg_buf, event->msg_len);
+
+ if (condition(adapter, cond_data, received_op)) {
+ ret = 0;
+ break;
+ }
+ }
+
+ memset(event->msg_buf, 0, IAVF_MAX_AQ_BUF_SIZE);
+
+ if (pending)
+ continue;
+ }
+ } while (time_before(jiffies, timeout));
+
+ return ret;
+}
--
2.54.0
^ permalink raw reply related
* [PATCH net v6 2/4] i40e: skip unnecessary VF reset when setting trust
From: Jose Ignacio Tornos Martinez @ 2026-06-19 6:13 UTC (permalink / raw)
To: netdev
Cc: intel-wired-lan, przemyslaw.kitszel, aleksandr.loktionov,
jacob.e.keller, horms, jesse.brandeburg, anthony.l.nguyen, davem,
edumazet, kuba, pabeni, Jose Ignacio Tornos Martinez,
Rafal Romanowski
In-Reply-To: <20260619061321.8554-1-jtornosm@redhat.com>
The current implementation triggers a VF reset when changing the trust
setting, causing a ~10 second delay during bonding setup.
In all the cases, the reset causes a ~10 second delay during which:
- VF must reinitialize completely
- Any in-progress operations (like bonding enslave) fail with timeouts
- VF is unavailable
When granting trust, no reset is needed - we can just set the capability
flag to allow privileged operations.
When revoking trust, we only need to reset (conservative approach) if
the VF has actually configured advanced features that require cleanup
(ADQ/cloud filters, promiscuous mode). For VFs in a clean state, we can
safely change the trust setting without the disruptive reset.
When we don't reset, we manually handle capability flag via helper
function, eliminating the delay.
Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
---
v6: No changes from v5. AI review comments covered design decisions
(over-limit filter handling, synchronization model) and pre-existing
issues, but found no bugs in the new code.
v5: https://lore.kernel.org/all/20260429102426.210750-3-jtornosm@redhat.com/
.../ethernet/intel/i40e/i40e_virtchnl_pf.c | 38 ++++++++++++++-----
1 file changed, 28 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index a26c3d47ec15..0cc434b26eb8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -4943,6 +4943,23 @@ int i40e_ndo_set_vf_spoofchk(struct net_device *netdev, int vf_id, bool enable)
return ret;
}
+/**
+ * i40e_setup_vf_trust - Enable/disable VF trust mode without reset
+ * @vf: VF to configure
+ * @setting: trust setting
+ *
+ * Update VF flags when changing trust without performing a VF reset.
+ * This is only called when it's safe to skip the reset (VF has no advanced
+ * features configured that need cleanup).
+ */
+static void i40e_setup_vf_trust(struct i40e_vf *vf, bool setting)
+{
+ if (setting)
+ set_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps);
+ else
+ clear_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps);
+}
+
/**
* i40e_ndo_set_vf_trust
* @netdev: network interface device structure of the pf
@@ -4987,19 +5004,20 @@ int i40e_ndo_set_vf_trust(struct net_device *netdev, int vf_id, bool setting)
set_bit(__I40E_MACVLAN_SYNC_PENDING, pf->state);
pf->vsi[vf->lan_vsi_idx]->flags |= I40E_VSI_FLAG_FILTER_CHANGED;
- i40e_vc_reset_vf(vf, true);
+ /* Reset only if revoking trust and VF has advanced features configured */
+ if (!setting &&
+ (vf->adq_enabled || vf->num_cloud_filters > 0 ||
+ test_bit(I40E_VF_STATE_UC_PROMISC, &vf->vf_states) ||
+ test_bit(I40E_VF_STATE_MC_PROMISC, &vf->vf_states))) {
+ i40e_vc_reset_vf(vf, true);
+ i40e_del_all_cloud_filters(vf);
+ } else {
+ i40e_setup_vf_trust(vf, setting);
+ }
+
dev_info(&pf->pdev->dev, "VF %u is now %strusted\n",
vf_id, setting ? "" : "un");
- if (vf->adq_enabled) {
- if (!vf->trusted) {
- dev_info(&pf->pdev->dev,
- "VF %u no longer Trusted, deleting all cloud filters\n",
- vf_id);
- i40e_del_all_cloud_filters(vf);
- }
- }
-
out:
clear_bit(__I40E_VIRTCHNL_OP_PENDING, pf->state);
return ret;
--
2.53.0
^ permalink raw reply related
* [PATCH net v6 1/4] iavf: return EBUSY if reset in progress or not ready during MAC change
From: Jose Ignacio Tornos Martinez @ 2026-06-19 6:13 UTC (permalink / raw)
To: netdev
Cc: intel-wired-lan, przemyslaw.kitszel, aleksandr.loktionov,
jacob.e.keller, horms, jesse.brandeburg, anthony.l.nguyen, davem,
edumazet, kuba, pabeni, Jose Ignacio Tornos Martinez,
Rafal Romanowski
In-Reply-To: <20260619061321.8554-1-jtornosm@redhat.com>
When a MAC address change is requested while the VF is resetting or still
initializing, return -EBUSY immediately instead of attempting the
operation.
Additionally, during early initialization states (before __IAVF_DOWN),
the PF may be slow to respond to MAC change requests, causing long
delays. Only allow MAC changes once the VF reaches __IAVF_DOWN state or
later, when the watchdog is running and the VF is ready for operations.
After commit ad7c7b2172c3 ("net: hold netdev instance lock
during sysfs operations"), MAC changes are called with the netdev lock
held, so we should not wait with the lock held during reset or
initialization. This allows the caller to retry or handle the busy state
appropriately without blocking other operations.
Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
---
drivers/net/ethernet/intel/iavf/iavf_main.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index dad001abc908..67aa14350b1b 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -1060,6 +1060,9 @@ static int iavf_set_mac(struct net_device *netdev, void *p)
struct sockaddr *addr = p;
int ret;
+ if (iavf_is_reset_in_progress(adapter) || adapter->state < __IAVF_DOWN)
+ return -EBUSY;
+
if (!is_valid_ether_addr(addr->sa_data))
return -EADDRNOTAVAIL;
--
2.53.0
^ permalink raw reply related
* [PATCH net v6 0/4] Fix i40e/ice/iavf VF bonding after netdev lock changes
From: Jose Ignacio Tornos Martinez @ 2026-06-19 6:13 UTC (permalink / raw)
To: netdev
Cc: intel-wired-lan, przemyslaw.kitszel, aleksandr.loktionov,
jacob.e.keller, horms, jesse.brandeburg, anthony.l.nguyen, davem,
edumazet, kuba, pabeni, Jose Ignacio Tornos Martinez
This series fixes VF bonding failures introduced by commit ad7c7b2172c3
("net: hold netdev instance lock during sysfs operations").
When adding VFs to a bond immediately after setting trust mode, MAC
address changes fail with -EAGAIN, preventing bonding setup. This
affects both i40e (700-series) and ice (800-series) Intel NICs.
The core issue is lock contention: iavf_set_mac() is now called with the
netdev lock held and waits for MAC change completion while holding it.
However, both the watchdog task that sends the request and the adminq_task
that processes PF responses also need this lock, creating a deadlock where
neither can run, causing timeouts.
Additionally, setting VF trust triggers an unnecessary ~10 second VF reset
in i40e driver that delays bonding setup, even though filter
synchronization happens naturally during normal VF operation. For ice
driver, the delay is not so big, but in the same way the operation is not
necessary.
This series:
1. Adds safety guard to prevent MAC changes during reset or early
initialization (before VF is ready)
2. Eliminates unnecessary VF reset when setting trust in i40e (reset only
if revoking trust and VF has advanced features configured).
3. Fixes lock contention by polling admin queue synchronously
4. Eliminates unnecessary VF reset when setting trust in ice, (reset only
if revoking trust and VF has advanced features configured).
The key fix (patch 3/4) implements a synchronous MAC change operation
similar to the approach used for ndo_change_mtu deadlock fix:
https://lore.kernel.org/intel-wired-lan/20260211191855.1532226-1-poros@redhat.com/
Instead of scheduling work and waiting, it:
- Sends the virtchnl message directly (not via watchdog)
- Polls the admin queue hardware directly for responses
- Processes all messages inline (including non-MAC messages)
- Returns when complete or times out
This allows the operation to complete synchronously while holding
netdev_lock, without relying on watchdog or adminq_task.
The function can sleep for up to 2.5 seconds polling hardware, but this
is acceptable since netdev_lock is per-device and only serializes
operations on the same interface.
Testing shows VF bonding now works reliably in ~5 seconds vs 15+ seconds
before (i40e), without timeouts or errors (i40e and ice).
Tested on Intel 700-series (i40e) and 800-series (ice) dual-port NICs
with iavf driver.
Thanks to Jan Tluka <jtluka@redhat.com> and Yuying Ma <yuma@redhat.com> for
reporting the issues.
Jose Ignacio Tornos Martinez (4):
iavf: return EBUSY if reset in progress or not ready during MAC change
i40e: skip unnecessary VF reset when setting trust
iavf: send MAC change request synchronously
ice: skip unnecessary VF reset when setting trust
All patches tested successfully with bonding setup.
---
v6:
- Patch 1/4 (iavf EBUSY): No changes from v5
- Patch 2/4 (i40e trust): No code changes from v5. AI review comments covered
design decisions and pre-existing issues, no bugs found in new code.
- Patch 3/4 (iavf sync MAC): Address edge cases from AI review (Jakub Kicinski)
- Allocate event buffer before sending to avoid state mismatch if allocation
fails after message is sent to PF
- Add loop to send all batches before polling (rare multi-batch scenario)
- Conditional rollback: only rollback on send failure (ret != -EAGAIN), not on
timeout where PF will eventually respond
- Patch 4/4 (ice trust): Revert to original reset pattern based on AI review
- AI review identified issues with v5's reset-before-cleanup approach
(privilege bit not cleared, LLDP cleanup loop became dead code)
- Restore proven cleanup → set trust → reset pattern from original code
- Simpler implementation, no save/restore complexity
- Maintains same goal: skip reset when VF has no advanced features
v5: https://lore.kernel.org/all/20260429102426.210750-1-jtornosm@redhat.com/
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 38 ++++++++++++++++++++++++++++----------
drivers/net/ethernet/intel/iavf/iavf.h | 10 ++++++++--
drivers/net/ethernet/intel/iavf/iavf_main.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------
drivers/net/ethernet/intel/iavf/iavf_virtchnl.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------
drivers/net/ethernet/intel/ice/ice_sriov.c | 33 +++++++++++++++++++++++++++++----
5 files changed, 220 insertions(+), 35 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [RFC net-next 3/4] net: dsa: motorcomm: Dynamically allocate port structures
From: Andrew Lunn @ 2026-06-19 6:06 UTC (permalink / raw)
To: David Yang
Cc: netdev, Vladimir Oltean, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, linux-kernel
In-Reply-To: <20260618202716.2166450-4-mmyangfl@gmail.com>
On Fri, Jun 19, 2026 at 04:26:31AM +0800, David Yang wrote:
> With support for LED introduced later, struct yt921x_priv will be 17k
> which is not very good for a single kmalloc(). Convert the ports array
> to a array of pointers to stop bloating the priv struct.
>
> Signed-off-by: David Yang <mmyangfl@gmail.com>
> ---
> drivers/net/dsa/motorcomm/chip.c | 95 ++++++++++++++++++++++++--------
> drivers/net/dsa/motorcomm/chip.h | 3 +-
> 2 files changed, 75 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/net/dsa/motorcomm/chip.c b/drivers/net/dsa/motorcomm/chip.c
> index 6dee25b6754a..d44f7749de02 100644
> --- a/drivers/net/dsa/motorcomm/chip.c
> +++ b/drivers/net/dsa/motorcomm/chip.c
> @@ -548,11 +548,14 @@ yt921x_mbus_ext_init(struct yt921x_priv *priv, struct device_node *mnp)
> /* Read and handle overflow of 32bit MIBs. MIB buffer must be zeroed before. */
> static int yt921x_read_mib(struct yt921x_priv *priv, int port)
> {
> - struct yt921x_port *pp = &priv->ports[port];
> + struct yt921x_port *pp = priv->ports[port];
> struct device *dev = to_device(priv);
> struct yt921x_mib *mib = &pp->mib;
> int res = 0;
>
> + if (!pp)
> + return -ENODEV;
> +
Are all these tests actually needed? If you cannot allocate the
memory, i would expect the probe to fail, so you can never get here.
Andrew
^ permalink raw reply
* Re: [PATCH net-next 2/2] net: dsa: mv88e6xxx: add support for credit based shaper
From: Andrew Lunn @ 2026-06-19 5:59 UTC (permalink / raw)
To: Luke Howard
Cc: Cedric Jehasse, cedric.jehasse, Vladimir Oltean, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Russell King, netdev, linux-kernel, Christoph Mellauner,
Kieran Tyrrell
In-Reply-To: <A3A6C8D7-3B6D-4716-88C7-EC32DB9D087D@padl.com>
> As an aside, the 6341 driver is buggy. It assumes the chip does not
> have a dedicated ATU FID register because it has 256 databases. This
> is incorrect and breaks VLAN-aware bridging. I will post a patch
> when the merge window reopens.
You can post fixes for net any time. No need to wait for the merge
window to close.
Andrew
^ permalink raw reply
* [PATCH net] net: mana: Fall back to standard MTU when PF reports adapter_mtu of 0
From: Erni Sri Satya Vennela @ 2026-06-19 5:53 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, dipayanroy, ssengar, jacob.e.keller,
ernis, horms, gargaditya, kees, linux-hyperv, netdev,
linux-kernel, bpf
Commit d7709812e13d ("net: mana: hardening: Validate adapter_mtu from
MANA_QUERY_DEV_CONFIG") rejected any adapter_mtu value smaller than
ETH_MIN_MTU + ETH_HLEN, including 0, returning -EPROTO and failing
mana_probe().
Some older PF firmware versions still in the field report
adapter_mtu as 0 in the MANA_QUERY_DEV_CONFIG response. With the
hardening check in place, the MANA VF driver now fails to load on
those hosts, breaking networking entirely for guests.
MANA hardware always supports the standard Ethernet MTU. Treat a
reported adapter_mtu of 0 as "the PF did not advertise a value" and
fall back to ETH_FRAME_LEN, the same value used for the pre-V2
message version path. Only jumbo frames remain unavailable until
the PF reports a valid MTU.
Other small-but-nonzero bogus values are still rejected, preserving
the original protection against the unsigned-subtraction wrap that
would otherwise let ndev->max_mtu underflow to a huge value.
Fixes: d7709812e13d ("net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_bpf.c | 3 ++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 16 ++++++++++++++--
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index b5e9bb184a1d..53308e139cbe 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -237,7 +237,8 @@ static int mana_xdp_set(struct net_device *ndev, struct bpf_prog *prog,
bpf_prog_put(old_prog);
if (prog)
- ndev->max_mtu = MANA_XDP_MTU_MAX;
+ ndev->max_mtu = min_t(unsigned int, MANA_XDP_MTU_MAX,
+ gc->adapter_mtu - ETH_HLEN);
else
ndev->max_mtu = gc->adapter_mtu - ETH_HLEN;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 87862b0434c7..7438ea6b3f26 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1233,12 +1233,24 @@ int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
*max_num_vports = resp.max_num_vports;
if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2) {
- if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
+ if (resp.adapter_mtu == 0) {
+ /*
+ * Some older PF firmware versions report an
+ * adapter_mtu of 0. MANA hardware always supports the
+ * standard Ethernet MTU, so fall back to ETH_FRAME_LEN.
+ * Jumbo frames will not be available in this case.
+ */
+ dev_info(dev,
+ "PF reported adapter_mtu of 0, falling back to %u (jumbo frames disabled)\n",
+ ETH_FRAME_LEN);
+ gc->adapter_mtu = ETH_FRAME_LEN;
+ } else if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
dev_err(dev, "Adapter MTU too small: %u\n",
resp.adapter_mtu);
return -EPROTO;
+ } else {
+ gc->adapter_mtu = resp.adapter_mtu;
}
- gc->adapter_mtu = resp.adapter_mtu;
} else {
gc->adapter_mtu = ETH_FRAME_LEN;
}
--
2.34.1
^ permalink raw reply related
* [PATCH net v2] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Runyu Xiao @ 2026-06-19 5:48 UTC (permalink / raw)
To: D. Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Karsten Graul,
linux-rdma, linux-s390, netdev, linux-kernel, jianhao.xu,
runyu.xiao, stable
In-Reply-To: <20260617152855.1039151-1-runyu.xiao@seu.edu.cn>
smc_listen() installs smc_clcsock_data_ready() as the underlying TCP
listen socket's sk_data_ready callback. The callback takes
sk_callback_lock before looking up the SMC listener and queuing
smc_tcp_listen_work().
This can recurse when the underlying TCP listen socket is being closed.
The close/flush path may invoke the installed sk_data_ready callback with
sk_callback_lock already held, so smc_clcsock_data_ready() tries to take
the same rwlock again in the same thread.
This issue was found by our static analysis tool and then manually
reviewed against the current tree. The reproducer keeps the SMC listen
callback installation path:
smc_listen()
smc_clcsock_replace_cb()
sk_data_ready = smc_clcsock_data_ready()
It then models the close/flush carrier that invokes the installed
sk_data_ready callback while sk_callback_lock is already held. Lockdep
reports the same-thread recursive acquisition:
WARNING: possible recursive locking detected
kworker/u4:3/39 is trying to acquire lock:
(sk_callback_lock) at smc_clcsock_data_ready+0xa/0x4d
but task is already holding lock:
(sk_callback_lock) at smc_close_flush_work+0xc/0x30
Possible unsafe locking scenario:
CPU0
----
lock(sk_callback_lock);
lock(sk_callback_lock);
*** DEADLOCK ***
Workqueue: smc_close_wq smc_close_flush_work
Call Trace:
dump_stack_lvl
__lock_acquire
lock_acquire
_raw_read_lock_bh
smc_clcsock_data_ready
smc_close_flush_work
process_one_work
worker_thread
kthread
ret_from_fork
The same pattern was fixed for nvmet TCP by checking TCP_LISTEN before
taking sk_callback_lock:
commit 2fa8961d3a6a ("nvmet-tcp: fixup hang in
nvmet_tcp_listen_data_ready()")
Do the same for SMC. smc_clcsock_data_ready() is installed by
smc_listen() on the underlying TCP listen socket and only queues
smc_tcp_listen_work() for the SMC listen/accept path. Once that socket is
no longer in TCP_LISTEN, there is no listen accept work to queue from this
callback, and avoiding sk_callback_lock also avoids the recursive locking
path.
Fixes: 0558226cebee ("net/smc: Fix slab-out-of-bounds issue in fallback")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
v2:
- Include the fuller Lockdep stack from the grounded reproducer.
- Add the related nvmet TCP fix reference.
- Explain why the TCP_LISTEN check is valid for the SMC listen callback.
net/smc/af_smc.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 6421c2e1c84d..1af4e3c333ff 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -2631,6 +2631,9 @@ static void smc_clcsock_data_ready(struct sock *listen_clcsock)
{
struct smc_sock *lsmc;
+ if (READ_ONCE(listen_clcsock->sk_state) != TCP_LISTEN)
+ return;
+
read_lock_bh(&listen_clcsock->sk_callback_lock);
lsmc = smc_clcsock_user_data(listen_clcsock);
if (!lsmc)
--
2.34.1
^ permalink raw reply related
* Re: [PATCH v2 5/7] seg6: add End.M.GTP6.D.Di behavior
From: Yuya Kusakabe @ 2026-06-19 5:48 UTC (permalink / raw)
To: andrea
Cc: Yuya Kusakabe, andrea.mayer, davem, edumazet, dsahern, kuba,
pabeni, horms, justin.iurman, shuah, corbet, skhan, linux-kernel,
netdev, linux-kselftest, linux-doc, stefano.salsano, ahabdels
In-Reply-To: <20260607160119.ed2022e8a358d700e1134318@common-net.org>
Hi Andrea,
Thank you for the review.
> The patch 4 review applies here, except for the parts where Section 6.4 is
> implemented instead of Section 6.3 (which is incorrectly implemented in
> patch 4).
Answered in the patch 4 reply: the next version of End.M.GTP6.D will
implement Section 6.3 (Args.Mob.Session stamped into SRH[0], no
preserved D), leaving the original-DA preservation exclusive to this
drop-in variant.
> input_action_end_m_gtp6_d_di() and its finish callback are largely
> identical to the patch 4 functions (input_action_end_m_gtp6_d() and its
> finish): the SRH check, GTP-U dispatch, outer strip, inner protocol
> detection, and NF_HOOK invocation are identical. The duplication should be
> reduced via shared helpers.
Will do. The plan is one decap helper (SRH check, GTP-U dispatch,
outer strip, inner protocol detection) shared between End.M.GTP6.D and
End.M.GTP6.D.Di, and one SRv6-push helper (including the GSO offload
setup) shared with H.M.GTP4.D as well, with the GTP-U parser common to
all of them. The D.Di handler then reduces to the prepended-slot
handling specific to the drop-in variant. (The NF_HOOK invocation goes
away in the initial series per the cover letter thread.)
> D.Di does not use teid or qfi, so these variables and the (void) casts are
> dead code and should be avoided. For example, seg6_mobile_parse_gtpu() could
> accept NULL for teid and qfi so callers that do not need them can pass NULL
> directly.
Will do exactly that: the GTP-U parser (and the decap helper above)
will accept NULL for teid/qfi, and the drop-in variant will pass NULL.
Thanks,
Yuya
^ permalink raw reply
* Re: [PATCH net] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Mahanta Jambigi @ 2026-06-19 5:36 UTC (permalink / raw)
To: Runyu Xiao, D. Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: Tony Lu, Wen Gu, Simon Horman, Karsten Graul, linux-rdma,
linux-s390, netdev, linux-kernel, jianhao.xu
In-Reply-To: <20260618141629.2904071-1-runyu.xiao@seu.edu.cn>
On 18/06/26 7:46 pm, Runyu Xiao wrote:
> Hi,
>
> Thanks for taking a look.
>
> The exact Lockdep stack I have is from the grounded reproducer, not from
> a production SMC setup. The reproducer keeps the same callback shape:
> the close/flush side holds sk_callback_lock and invokes the installed
> sk_data_ready callback, which re-enters smc_clcsock_data_ready() and tries
> to take sk_callback_lock again.
>
> The relevant Lockdep report is:
>
> WARNING: possible recursive locking detected
> kworker/u4:3/39 is trying to acquire lock:
> (sk_callback_lock) at smc_clcsock_data_ready+0xa/0x4d
>
> but task is already holding lock:
> (sk_callback_lock) at smc_close_flush_work+0xc/0x30
>
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(sk_callback_lock);
> lock(sk_callback_lock);
>
> *** DEADLOCK ***
>
> Workqueue: smc_close_wq smc_close_flush_work
>
> Call Trace:
> dump_stack_lvl
> __lock_acquire
> lock_acquire
> _raw_read_lock_bh
> smc_clcsock_data_ready+0xa/0x4d
> smc_close_flush_work+0x1f/0x30
> process_one_work
> worker_thread
> kthread
> ret_from_fork
Thank you for addressing the feedback. My suggestion would be to reply
to the original email thread where the review comments were given, so
that the maintainers can follow the conversation.
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#respond-to-review-comments
Please include above call stack in your next version.
>
> The nvmet change I referred to is:
>
> 2fa8961d3a6a ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")
Please include this info in your next version.
>
> The stable/backport patch I originally used as the reference is:
>
> 1c90f930e7b4 ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")
>
> Its commit message says that when the socket is closed while in
> TCP_LISTEN, the flush callback can call nvmet_tcp_listen_data_ready()
> with sk_callback_lock already held, so nvmet moved the TCP_LISTEN check
> before taking sk_callback_lock.
>
> For the TCP_LISTEN check: my reasoning was that smc_clcsock_data_ready()
> is installed by smc_listen() on the underlying TCP listen socket and only
> queues smc_tcp_listen_work() for the SMC listen/accept path. Once that
> underlying socket is no longer in TCP_LISTEN, there should be no SMC
> listen accept work to queue from this callback. TCP_SYN_RECV and
> TCP_ESTABLISHED are not listen-socket states for this callback path, so I
> did not intend the callback to queue listen work for those states.
I understand. Please include this info in your next version.
>
> That said, if SMC expects smc_clcsock_data_ready() to handle a non-LISTEN
> state during fallback or another transition, then the proposed check is
> too strict and I should rework the fix.
>
> Thanks,
> Runyu
^ permalink raw reply
* Re: [PATCH v2 4/7] seg6: add End.M.GTP6.D behavior
From: Yuya Kusakabe @ 2026-06-19 5:27 UTC (permalink / raw)
To: andrea
Cc: Yuya Kusakabe, andrea.mayer, davem, edumazet, dsahern, kuba,
pabeni, horms, justin.iurman, shuah, corbet, skhan, linux-kernel,
netdev, linux-kselftest, linux-doc, stefano.salsano, ahabdels
In-Reply-To: <20260607020517.0c6bbb8beba505ac9447545e@common-net.org>
Hi Andrea,
Thank you for the review. The points shared with patches 1-3 will be
addressed as described in those replies; below the
End.M.GTP6.D-specific ones.
> The "src" attribute is used verbatim here as the outer IPv6 source address,
> same as patch 3. The src dual-semantics overload flagged in the patch 3
> reply applies here too.
Covered in the patch 3 reply: with the End.M.GTP4.E template use
gone, verbatim outer IPv6 SA becomes the single meaning of the src
attribute for the IPv6-emitting behaviors.
> Thank you for the follow-up in the cover letter thread. The finish callback
> writes orig_dst into SRH[0] and Args.Mob.Session into SRH[1]. As far as I
> can see, this matches neither Section 6.3 (Args.Mob.Session in SRH[0], no
> D) nor Section 6.4 (D in SRH[0], no Args.Mob).
Confirmed, that is the bug from my May 10 note. The next version of
End.M.GTP6.D will push the configured SR Policy verbatim and stamp
Args.Mob.Session into SRH[0] (at the locator length given by the
explicit sr_prefix_len attribute) per Section 6.3 S08; preserving the
original outer DA in a prepended slot will be exclusive to
End.M.GTP6.D.Di.
> Same reverse Christmas tree as patch 2; same issue in the other functions
> introduced by this patch.
> gtp is only used as a cast intermediary. Could it be inlined?
Will fix both.
> Nit: gtphl and hdrlen are assigned before the GTP1_F_EXTHDR check. On the
> path where the E flag is not set, gtphl is unused. Moving the gtphl
> assignment after the check would make the flow clearer.
Will move the gtphl dereference after the check; the pull has to stay
before it, since the long header is also consumed for S/PN-only
flags.
> Maybe ext could be renamed to ext_hdr? It would be easier to distinguish
> from ext_units and ext_bytes.
> ext_units is only used to derive ext_bytes. A single ext_len would
> remove the intermediate variable.
Will do both.
> If the extension chain contains more than one PDU Session Container, *qfi
> is silently overwritten. Is that intentional, or should the function reject
> a duplicate?
Not intentional; will reject a duplicate PDU Session Container as
malformed, with a selftest case for it.
> ext[ext_bytes - 1] reads the Next Extension Header Type field from the last
> byte of the current extension. Would a short comment help the reader?
Will add one.
> input_action_end_m_gtp6_d() does not change skb_dst(skb) before this call,
> so dst and lwtstate are the same ones the caller already dereferenced. When
> can this NULL check trigger?
It cannot: for a route installed with LWTUNNEL_STATE_INPUT_REDIRECT,
lwtunnel_set_redirect() always populates orig_input before dst.input
is replaced. I will drop the checks and call orig_input directly.
> Same dst/lwtstate issue as patch 2. Not introduced by this patch.
> Same missing iptunnel_handle_offloads() as patch 2.
The NF_HOOK split goes away per the cover letter thread, and the SRv6
push will go through a shared helper that calls
iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6) before
seg6_do_srh_encap().
> Same BAD_INNER misuse as patch 2. seg6_do_srh_encap() can also fail from
> seg6_push_hmac(), which is an HMAC error on the new SRH, not an inner-T-PDU
> problem.
[...]
> segments[0], segments[1], saddr, and daddr are written after
> seg6_do_srh_encap() already called skb_postpush_rcsum(). skb->csum can
> be stale. Same for any later change to the outer header or SRH.
>
> HMAC, if configured, is computed on non-final SRH and saddr, hence invalid.
Thanks, both of these are real issues. My plan for the next version:
- every field stamped after seg6_do_srh_encap() (Args.Mob.Session, the
preserved DA in the drop-in variant, the outer saddr/daddr refresh,
and the dsfield propagation in H.M.GTP4.D) will go through a small
helper that applies the corresponding diff to skb->csum when the skb
is CHECKSUM_COMPLETE;
- the D-side behaviors will reject an HMAC-flagged SRH template at
configuration time: stamping the per-packet fields after
seg6_do_srh_encap() has signed the SRH would always invalidate the
HMAC. Inbound HMAC validation is unaffected. Would you prefer the
stamp-before-sign ordering solved from the start instead?
> The initializer on reason is dead. Every goto drop path sets reason
> explicitly before the jump. The variable can be left uninitialized here.
This goes away with the drop-reason rework: the MUP drop reasons will
be out of the initial series per the prep series plan, so the variable
itself is removed.
> Same SRH validation concerns as patch 1. HMAC is not validated here.
The ingress will use the same three-state SRH helper as the other
behaviors, which validates the HMAC whenever an SRH is present.
> Limitation note for both input_action_end() calls above: correct per RFC
> 9433 Section 6.3 S10-S11, but the SRH is absent or SL == 0 here, so
> input_action_end() will always drop without signaling non-GTP-U traffic.
> Perhaps you meant to drop directly with BAD_GTPU?
Right, the End fallback could only ever drop here. Instead of
dropping, I plan to hand non-UDP, non-GTP-U and non-T-PDU packets to
the route's original input path (the orig_input saved by the lwtunnel
input redirect), so a downstream owner of the GTP-U control plane
still receives e.g. Echo Request; the selftests will cover that
passthrough.
> Nit: inner_first could be an inner_ver with the shift done at assignment.
> The name would say what the variable holds.
[...]
> Same repeated size-selection ternary as patch 2.
Will do both: the inner version, header length and protocol computed
once in the switch.
> The anonymous { } block scopes three variables that should be declared at
> function top. Splitting into smaller helpers would make this easier to
> follow.
Will split the dispatch and outer strip into a decap helper shared
with End.M.GTP6.D.Di, with declarations at function top.
> Same missing frag_off check as patch 2.
Will add.
> The "{,.Di}" shell brace notation is unusual. Emitting the actual
> behavior name (End.M.GTP6.D or End.M.GTP6.D.Di) would be clearer.
> Same applies wherever this notation appears in the patchset.
Will replace it with the concrete behavior name everywhere.
Thanks,
Yuya
^ permalink raw reply
* Re: [PATCH net-next 2/2] net: dsa: mv88e6xxx: add support for credit based shaper
From: Luke Howard @ 2026-06-19 5:00 UTC (permalink / raw)
To: Cedric Jehasse
Cc: cedric.jehasse, Andrew Lunn, Vladimir Oltean, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Russell King, netdev, linux-kernel, Christoph Mellauner,
Kieran Tyrrell
In-Reply-To: <CABvLYuBZ=U96gVpYqHnnhwBZxp8OZzH-W503MsbNd4JPqLbEpQ@mail.gmail.com>
> On 28 May 2026, at 7:19 pm, Cedric Jehasse <cedric.jehasse@gmail.com> wrote:
>
> On Thu, May 28, 2026 at 1:52 AM Luke Howard <lukeh@padl.com> wrote:
>> Would also be good to add support for the 6341, which as 4 queues but uses the same Qav multipliers as the 6390.
> I'll have a look at the 6341
As an aside, the 6341 driver is buggy. It assumes the chip does not have a dedicated ATU FID register because it has 256 databases. This is incorrect and breaks VLAN-aware bridging. I will post a patch when the merge window reopens.
On the hardware side, the ATU hashing algorithm appears to easily collide. This causes problems with things such as installing the broadcast address into multiple FIDs, not to mention 802.1Q Dynamic Reservation Entries. The only workaround is to use the ‘direct’ algorithm which may have a performance impact (I am presuming it is a linear search; the data sheet warns against using it).
Luke
^ permalink raw reply
* [PATCH net] ipv6: Fix null-ptr-deref in fib6_nh_mtu_change().
From: xmei5 @ 2026-06-19 4:53 UTC (permalink / raw)
To: dsahern, idosch, netdev
Cc: davem, edumazet, pabeni, kuba, horms, bestswngs, Xiang Mei
From: Xiang Mei <xmei5@asu.edu>
fib6_nh_mtu_change() re-fetches idev via __in6_dev_get(arg->dev) and
dereferences idev->cnf.mtu6 without a NULL check. addrconf_ifdown()
clears dev->ip6_ptr with RCU_INIT_POINTER() after rt6_disable_ip() has
released tb6_lock, so the RA-driven MTU walk can observe a NULL idev and
oops. The caller rt6_mtu_change_route() guards its own __in6_dev_get(),
but this re-fetch is unguarded; nexthop-backed routes survive
addrconf_ifdown()'s flush, so the walk still reaches it after ip6_ptr is
nulled.
Return 0 when idev is NULL, matching rt6_mtu_change_route() and the
fib6_mtu() fix in commit 5ad509c1fdad ("ipv6: Fix null-ptr-deref in
fib6_mtu().").
Oops: general protection fault, ... KASAN: null-ptr-deref in range
[0x00000000000002a8-0x00000000000002af]
RIP: 0010:fib6_nh_mtu_change+0x203/0x990
rt6_mtu_change_route+0x141/0x1d0
__fib6_clean_all+0xd0/0x160
rt6_mtu_change+0xb4/0x100
ndisc_router_discovery+0x24b5/0x2cb0
icmpv6_rcv+0x12e9/0x1710
ipv6_rcv+0x39b/0x410
Fixes: c0b220cf7d80 ("ipv6: Refactor exception functions")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
net/ipv6/route.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 6361ad2fcf77..a1301334da48 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5055,6 +5055,9 @@ static int fib6_nh_mtu_change(struct fib6_nh *nh, void *_arg)
struct inet6_dev *idev = __in6_dev_get(arg->dev);
u32 mtu = f6i->fib6_pmtu;
+ if (!idev)
+ return 0;
+
if (mtu >= arg->mtu ||
(mtu < arg->mtu && mtu == idev->cnf.mtu6))
fib6_metric_set(f6i, RTAX_MTU, arg->mtu);
--
2.43.0
^ permalink raw reply related
* [PATCH net] octeontx2-af: Block VFs from clobbering special CGX PKIND state
From: Ratheesh Kannoth @ 2026-06-19 4:10 UTC (permalink / raw)
To: davem, gakula, linux-kernel, netdev, sgoutham
Cc: andrew+netdev, edumazet, kuba, pabeni, Hariprasad Kelam,
Ratheesh Kannoth
From: Hariprasad Kelam <hkelam@marvell.com>
PF and VF NIX LFs that share a CGX LMAC reuse the same hardware PKIND
programming. When HiGig2 or EDSA parsing is enabled, a VF NIX LF alloc must
not reset the LMAC RX PKIND or default TX parse config over the PF setup.
Add cgx_get_pkind() and rvu_cgx_is_pkind_config_permitted() so VFs skip
cgx_set_pkind(), rvu_npc_set_pkind(), and NIX_AF_LFX_TX_PARSE_CFG updates
when the LMAC is using NPC_RX_HIGIG_PKIND or NPC_RX_EDSA_PKIND.
Fixes: 94d942c5fb97 ("octeontx2-af: Config pkind for CGX mapped PFs")
Cc: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
---
.../net/ethernet/marvell/octeontx2/af/cgx.c | 12 +++++++
.../net/ethernet/marvell/octeontx2/af/cgx.h | 1 +
.../net/ethernet/marvell/octeontx2/af/rvu.h | 1 +
.../ethernet/marvell/octeontx2/af/rvu_cgx.c | 32 +++++++++++++++++++
.../ethernet/marvell/octeontx2/af/rvu_nix.c | 13 +++++---
5 files changed, 55 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
index 2e94d5105016..f5fd6138c352 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
@@ -518,6 +518,18 @@ int cgx_set_pkind(void *cgxd, u8 lmac_id, int pkind)
return 0;
}
+int cgx_get_pkind(void *cgxd, u8 lmac_id, int *pkind)
+{
+ struct cgx *cgx = cgxd;
+
+ if (!is_lmac_valid(cgx, lmac_id))
+ return -ENODEV;
+
+ *pkind = cgx_read(cgx, lmac_id, cgx->mac_ops->rxid_map_offset);
+ *pkind = *pkind & 0x3F;
+ return 0;
+}
+
static u8 cgx_get_lmac_type(void *cgxd, int lmac_id)
{
struct cgx *cgx = cgxd;
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
index 92ccf343dfe0..8411a75dd723 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
@@ -141,6 +141,7 @@ int cgx_get_cgxid(void *cgxd);
int cgx_get_lmac_cnt(void *cgxd);
void *cgx_get_pdata(int cgx_id);
int cgx_set_pkind(void *cgxd, u8 lmac_id, int pkind);
+int cgx_get_pkind(void *cgxd, u8 lmac_id, int *pkind);
int cgx_lmac_evh_register(struct cgx_event_cb *cb, void *cgxd, int lmac_id);
int cgx_lmac_evh_unregister(void *cgxd, int lmac_id);
int cgx_get_tx_stats(void *cgxd, int lmac_id, int idx, u64 *tx_stat);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index 7f3505ae6860..bb671e2150aa 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -1115,6 +1115,7 @@ void npc_read_mcam_entry(struct rvu *rvu, struct npc_mcam *mcam,
u8 *intf, u8 *ena);
int npc_config_cntr_default_entries(struct rvu *rvu, bool enable);
bool is_cgx_config_permitted(struct rvu *rvu, u16 pcifunc);
+bool rvu_cgx_is_pkind_config_permitted(struct rvu *rvu, u16 pcifunc);
bool is_mac_feature_supported(struct rvu *rvu, int pf, int feature);
u32 rvu_cgx_get_fifolen(struct rvu *rvu);
void *rvu_first_cgx_pdata(struct rvu *rvu);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
index 4ff3935ed3fe..2be1da3476ac 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
@@ -1355,3 +1355,35 @@ void rvu_mac_reset(struct rvu *rvu, u16 pcifunc)
if (mac_ops->mac_reset(cgxd, lmac, !is_vf(pcifunc)))
dev_err(rvu->dev, "Failed to reset MAC\n");
}
+
+/* Do not allow CGX-mapped VFs to overwrite PKIND when special parse kinds
+ * (HiGig, EDSA, etc.) are in use on the shared LMAC.
+ */
+bool rvu_cgx_is_pkind_config_permitted(struct rvu *rvu, u16 pcifunc)
+{
+ int pf, err, rxpkind;
+ u8 cgx_id, lmac_id;
+ void *cgxd;
+
+ pf = rvu_get_pf(rvu->pdev, pcifunc);
+
+ if (!(pcifunc & RVU_PFVF_FUNC_MASK))
+ return true;
+
+ if (!is_pf_cgxmapped(rvu, pf))
+ return true;
+
+ rvu_get_cgx_lmac_id(rvu->pf2cgxlmac_map[pf], &cgx_id, &lmac_id);
+ cgxd = rvu_cgx_pdata(cgx_id, rvu);
+ err = cgx_get_pkind(cgxd, lmac_id, &rxpkind);
+ if (err)
+ return false;
+
+ switch (rxpkind) {
+ case NPC_RX_HIGIG_PKIND:
+ case NPC_RX_EDSA_PKIND:
+ return false;
+ default:
+ return true;
+ }
+}
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
index d8989395e875..8c5cadd4465d 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
@@ -363,8 +363,11 @@ static int nix_interface_init(struct rvu *rvu, u16 pcifunc, int type, int nixlf,
pfvf->tx_chan_cnt = 1;
rsp->tx_link = cgx_id * hw->lmac_per_cgx + lmac_id;
- cgx_set_pkind(rvu_cgx_pdata(cgx_id, rvu), lmac_id, pkind);
- rvu_npc_set_pkind(rvu, pkind, pfvf);
+ if (rvu_cgx_is_pkind_config_permitted(rvu, pcifunc)) {
+ cgx_set_pkind(rvu_cgx_pdata(cgx_id, rvu), lmac_id,
+ pkind);
+ rvu_npc_set_pkind(rvu, pkind, pfvf);
+ }
break;
case NIX_INTF_TYPE_LBK:
vf = (pcifunc & RVU_PFVF_FUNC_MASK) - 1;
@@ -1680,8 +1683,10 @@ int rvu_mbox_handler_nix_lf_alloc(struct rvu *rvu,
rvu_write64(rvu, blkaddr, NIX_AF_LFX_RX_CFG(nixlf), req->rx_cfg);
/* Configure pkind for TX parse config */
- cfg = NPC_TX_DEF_PKIND;
- rvu_write64(rvu, blkaddr, NIX_AF_LFX_TX_PARSE_CFG(nixlf), cfg);
+ if (rvu_cgx_is_pkind_config_permitted(rvu, pcifunc)) {
+ cfg = NPC_TX_DEF_PKIND;
+ rvu_write64(rvu, blkaddr, NIX_AF_LFX_TX_PARSE_CFG(nixlf), cfg);
+ }
if (is_rep_dev(rvu, pcifunc)) {
pfvf->tx_chan_base = RVU_SWITCH_LBK_CHAN;
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net] ipv6: ndisc: fix NULL deref in accept_untracked_na()
From: Xiang Mei @ 2026-06-19 4:05 UTC (permalink / raw)
To: Jiayuan Chen
Cc: Weiming Shi, David S . Miller, David Ahern, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, linux-kernel
In-Reply-To: <efea4e44-e9cd-444d-a648-8831f408acbf@linux.dev>
On Wed, Jun 17, 2026 at 9:09 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
>
> On 6/17/26 9:38 PM, Weiming Shi wrote:
> > On Wed Jun 17, 2026 at 4:32 PM CST, Jiayuan Chen wrote:
> >> On 6/17/26 2:55 PM, Weiming Shi wrote:
> >>> accept_untracked_na() re-fetches the inet6_dev with __in6_dev_get(dev)
> >>> and dereferences idev->cnf.accept_untracked_na without a NULL check,
> >>
> >> Does ipv6_rpl_srh_rcv have same problem?
> > Hi,
> >
> > Yes, ipv6_rpl_srh_rcv() has the same missing check. It reads
> > idev->cnf.rpl_seg_enabled right after __in6_dev_get(skb->dev) with no
> > NULL check, while seg6 and ioam6 in the same file both check it.
> >
> > But I tried to trigger it and couldn't. With a guard added as an instrument,
> > idev never came back NULL over tens of millions of RPL packets while
> > flapping the MTU, so I can't say it's actually reachable.
>
>
> Can you need to add mdelay to enlarge the race window to reproduce it?
>
The delay works; we insert one at
@@ -1078,6 +1079,7 @@ static enum skb_drop_reason ndisc_recv_na(struct
sk_buff *skb)
*/
new_state = msg->icmph.icmp6_solicited ? NUD_REACHABLE : NUD_STALE;
if (!neigh && lladdr && idev && READ_ONCE(idev->cnf.forwarding)) {
+ mdelay(100); /* AI-POC: widen race window vs
concurrent addrconf_ifdown() NULLing dev->ip6_ptr */
if (accept_untracked_na(dev, saddr)) {
neigh = neigh_create(&nd_tbl, &msg->target, dev);
new_state = NUD_STALE;
And our poc triggers:
```
[ 0.887292] Oops: general protection fault, probably for
non-canonical address 0xdffffc0000000068: 0000 [#1] SMP KASAN NOPTI
[ 0.887791] KASAN: null-ptr-deref in range
[0x0000000000000340-0x0000000000000347]
[ 0.888101] CPU: 1 UID: 0 PID: 146 Comm: exploit Not tainted 7.1.0+
#9 PREEMPTLAZY
[ 0.888419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/4
[ 0.888880] RIP: 0010:ndisc_recv_na+0x67b/0xf30
[ 0.889072] Code: 03 80 3c 02 00 0f 85 5f 07 00 00 49 8b af c0 00
00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bd 40 03 00 00 e
[ 0.889829] RSP: 0018:ffff8881029084f8 EFLAGS: 00010202
[ 0.890043] RAX: dffffc0000000000 RBX: 1ffff110205210a8 RCX: 0000000000000001
[ 0.890340] RDX: 0000000000000068 RSI: 0000000000004d9a RDI: 0000000000000340
[ 0.890633] RBP: 0000000000000000 R08: 0000000000000004 R09: ffff888102908570
[ 0.890921] R10: ffff88800ee7f792 R11: ffff888102908580 R12: ffff88800ee7f780
[ 0.891207] R13: ffff88800ee7f778 R14: ffff88800ee7f792 R15: ffff88800f2d6000
[ 0.891511] FS: 00007b8516aec6c0(0000) GS:ffff888177d3d000(0000)
knlGS:0000000000000000
[ 0.891853] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.892094] CR2: 00007b8516aebb30 CR3: 000000000f164006 CR4: 0000000000772ef0
[ 0.892402] PKRU: 55555554
[ 0.892517] Call Trace:
[ 0.892627] <IRQ>
[ 0.892718] ? stack_trace_save+0x93/0xd0
[ 0.892895] ? __pfx_ndisc_recv_na+0x10/0x10
[ 0.893087] ? _raw_spin_unlock_irqrestore+0xe/0x40
[ 0.893295] ? stack_depot_save_flags+0x488/0x810
[ 0.893506] ? fib6_rule_lookup+0x40a/0x5a0
[ 0.893689] ? __napi_poll+0xa1/0x560
[ 0.893852] ? net_rx_action+0x401/0xd80
[ 0.894025] icmpv6_rcv+0x10a4/0x14e0
[ 0.894181] ip6_protocol_deliver_rcu+0xbd4/0x1340
[ 0.894391] ip6_input_finish+0x1d1/0x3a0
[ 0.894559] ip6_input+0x195/0x1d0
[ 0.894709] ? __pfx_ip6_input+0x10/0x10
[ 0.894873] ? __asan_memset+0x23/0x50
[ 0.895034] ip6_mc_input+0x26c/0x3a0
[ 0.895188] ipv6_rcv+0x31b/0x390
[ 0.895340] ? __pfx_ipv6_rcv+0x10/0x10
[ 0.895501] ? get_stack_info+0x2f/0x90
[ 0.895665] ? stack_access_ok+0xde/0x200
[ 0.895833] ? get_stack_info_noinstr+0x18/0x110
[ 0.896036] __netif_receive_skb_core.constprop.0+0xd78/0x2f00
[ 0.896286] ? common_startup_64+0x13e/0x151
[ 0.896475] ? common_startup_64+0x13e/0x151
[ 0.896655] ? __pfx_stack_trace_consume_entry+0x10/0x10
[ 0.896875] ? arch_stack_walk+0x99/0x100
[ 0.897049] ? __pfx___netif_receive_skb_core.constprop.0+0x10/0x10
[ 0.897310] ? common_startup_64+0x13e/0x151
[ 0.897498] ? stack_trace_save+0x93/0xd0
[ 0.897668] ? __pfx_stack_trace_save+0x10/0x10
[ 0.897855] ? stack_depot_save_flags+0x29/0x810
[ 0.898046] ? kasan_save_stack+0x33/0x60
[ 0.898213] ? kasan_save_track+0x14/0x30
[ 0.898389] ? kasan_save_free_info+0x3b/0x60
...
```
Weiming's patch works with a delay. The poc doesn't crash the kernel.
I noticed that you mentioned ipv6_rpl_srh_rcv. We actually have sent
one, but we later found that someone (Andrea Mayer) had sent the issue
earlier, so we just let them fix it:
https://lore.kernel.org/all/20260521200859.816b8923b5f27bba6124461e@uniroma2.it/
Xiang
> I believe we need more precise traffic and timing control, instead of
> aggressively ramping up traffic and load in an attempt to reproduce the
> issue.
>
^ permalink raw reply
* Re: [PATCH stable 6.6.y v3 1/4] bpf: Track equal scalars history on per-instruction level
From: Zhenzhong Wu @ 2026-06-19 4:01 UTC (permalink / raw)
To: Shung-Hsi Yu
Cc: bpf, netdev, linux-kernel, ast, daniel, john.fastabend, andrii,
martin.lau, song, yonghong.song, kpsingh, haoluo, jolsa,
menglong8.dong, eddyz87, stable, mykolal, tamird, Hao Sun
In-Reply-To: <ajDiLjjSYPp5p7KF@u94a>
On second thought, I'll drop the cnt == 1 reset (and the matching cnt > 0)
and follow upstream's cnt > 1 in v4. It gives no benefit over upstream, it
only adds the corner cases the bot flagged. And The other two points the
bot raised (clearing id on overflow, collecting src/dst twice) are upstream
behaviour, so I will keep them as-is.Thanks for the pointer.
On Tue, Jun 16, 2026 at 1:51 PM Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:
>
> On Mon, Jun 15, 2026 at 12:58:38AM +0800, Zhenzhong Wu wrote:
> [...]
> > +/* For all R being scalar registers or spilled scalar registers
> > + * in verifier state, save R in linked_regs if R->id == id.
> > + * If there are too many Rs sharing same id, reset id for leftover Rs.
> > + */
> > +static void collect_linked_regs(struct bpf_verifier_state *vstate, u32 id,
> > + struct linked_regs *linked_regs)
> > +{
> > + struct bpf_func_state *func;
> > struct bpf_reg_state *reg;
> > + int i, j;
> >
> > - bpf_for_each_reg_in_vstate(vstate, state, reg, ({
> > - if (reg->type == SCALAR_VALUE && reg->id == known_reg->id) {
> > + for (i = vstate->curframe; i >= 0; i--) {
> > + func = vstate->frame[i];
> > + for (j = 0; j < BPF_REG_FP; j++) {
> > + reg = &func->regs[j];
> > + __collect_linked_regs(linked_regs, reg, id, i, j, true);
> > + }
> > + for (j = 0; j < func->allocated_stack / BPF_REG_SIZE; j++) {
> > + if (!is_spilled_reg(&func->stack[j]))
> > + continue;
> > + reg = &func->stack[j].spilled_ptr;
> > + __collect_linked_regs(linked_regs, reg, id, i, j, false);
> > + }
> > + }
> > +
> > + if (linked_regs->cnt == 1)
> > + linked_regs->cnt = 0;
>
> This part seems new, not found on the original commit, and also not in
> bpf-next. Can you add some more explaining (in the notes before your
> signed-off-by) regarding why this is needed?
>
> > +}
> [...]
> > @@ -14704,6 +14899,21 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
> > return 0;
> > }
> >
> > + /* Push scalar registers sharing same ID to jump history,
> > + * do this before creating 'other_branch', so that both
> > + * 'this_branch' and 'other_branch' share this history
> > + * if parent state is created.
> > + */
> > + if (BPF_SRC(insn->code) == BPF_X && src_reg->type == SCALAR_VALUE && src_reg->id)
> > + collect_linked_regs(this_branch, src_reg->id, &linked_regs);
> > + if (dst_reg->type == SCALAR_VALUE && dst_reg->id)
> > + collect_linked_regs(this_branch, dst_reg->id, &linked_regs);
> > + if (linked_regs.cnt > 0) {
>
> Same here, the original commit and bpf-next has the '> 1' conditional,
> where as your has '> 0'. Can you also added some explanation on this
> part?
>
> > + err = push_jmp_history(env, this_branch, 0, linked_regs_pack(&linked_regs));
> > + if (err)
> > + return err;
> > + }
> > +
> ...
^ permalink raw reply
* Re: [PATCH stable 6.6.y v3 0/4] bpf: linked scalar precision fixes
From: Zhenzhong Wu @ 2026-06-19 4:01 UTC (permalink / raw)
To: Shung-Hsi Yu
Cc: Sasha Levin, Paul Chaignon, bpf, netdev, linux-kernel, ast,
daniel, john.fastabend, andrii, martin.lau, song, yonghong.song,
kpsingh, haoluo, jolsa, menglong8.dong, eddyz87, stable, mykolal,
tamird
In-Reply-To: <ajC3d44N4s0sdEBB@u94a>
Sorry for the late reply.
For v3, I only ran the targeted test_progs coverage and missed the legacy
test_verifier coverage, so I did not catch the expectation mismatch in
precise.c. Thanks for the detailed analysis. I have rechecked this and will
send v4 shortly with the precise.c fix included.
On Tue, Jun 16, 2026 at 1:22 PM Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:
>
> On Tue, Jun 16, 2026 at 12:51:34AM +0200, Paul Chaignon wrote:
> > On Mon, Jun 15, 2026 at 12:58:37AM +0800, Zhenzhong Wu wrote:
> > > Hi,
> > >
> > > This v3 targets 6.6.y and changes the backport strategy based on review
> > > feedback on v2.
> >
> > [...]
> >
> > > Relevant QEMU selftest results on 6.6.y with this backport:
> > >
> > > verifier_scalar_ids passed all 18 subtests, including the newly
> > > backported linked-scalar precision tests and the related
> > > check_ids_in_regsafe tests.
> >
> > The first patch in this backport series is actually breaking the
> > "precise: test 1" selftest from test_verifier. You can see the full
> > error at [1]. I haven't yet checked if it's the test or the backport
> > that needs to be adjusted.
>
> I had a quick look, and believe it was that test that needs to be
> adjusted to include r9 into the precise register set.
>
> So unless Sasha have other preference, I suggest Zhenzhong send a v4,
> with changes to tools/testing/selftests/bpf/verifier/precise.c
> (including "r9" the the expected verifier output) merged into "bpf:
> Track equal scalars history on per-instruction level".
>
> ---
>
> The program under test is:
>
> 00: BPF_MOV64_IMM(BPF_REG_0, 1),
> 01: BPF_LD_MAP_FD(BPF_REG_6, 0),
> 03: BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
> 04: BPF_MOV64_REG(BPF_REG_2, BPF_REG_FP),
> 05: BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> 06: BPF_ST_MEM(BPF_DW, BPF_REG_FP, -8, 0),
> 07: BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
> 08: BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
> 09: BPF_EXIT_INSN(),
>
> 10: BPF_MOV64_REG(BPF_REG_9, BPF_REG_0),
>
> 11: BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
> 12: BPF_MOV64_REG(BPF_REG_2, BPF_REG_FP),
> 13: BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> 14: BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
> 15: BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
> 16: BPF_EXIT_INSN(),
>
> 17: BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
>
> 18: BPF_ALU64_REG(BPF_SUB, BPF_REG_9, BPF_REG_8), /* map_value_ptr -= map_value_ptr */
> 19: BPF_MOV64_REG(BPF_REG_2, BPF_REG_9),
> 20: BPF_JMP_IMM(BPF_JLT, BPF_REG_2, 8, 1),
> 21: BPF_EXIT_INSN(),
>
> 22: BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, 1), /* R2=scalar(umin=1, umax=8) */
> 23: BPF_MOV64_REG(BPF_REG_1, BPF_REG_FP),
> 24: BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8),
> 25: BPF_MOV64_IMM(BPF_REG_3, 0),
> 26: BPF_EMIT_CALL(BPF_FUNC_probe_read_kernel),
> 27: BPF_EXIT_INSN(),
>
> The test was expecting the following line in the verifier log that was
> shown during the backtracking start at instruction 26 (call
> bpf_probe_read_kernel#113)
>
> mark_precise: frame0: regs=r2 stack= before 20: (a5) if r2 < 0x8 goto pc+1
> mark_precise: frame0: parent state regs=r2 stack=: ...
> mark_precise: frame0: last_idx 19 first_idx 10 ...
>
> But after applying the patchset, we now got an additional register r9 in
> the precise set:
>
> mark_precise: frame0: regs=r2 stack= before 20: (a5) if r2 < 0x8 goto pc+1
> mark_precise: frame0: parent state regs=r2,r9 stack=: ....
> mark_precise: frame0: last_idx 19 first_idx 10 ...
>
> The additional r9 in the precise set seems actually correct, this is
> because r2 and r9 share the same scalar ID at instruction 20 (before the
> link got broken in instruction 21), and hence at that point, both
> register should be marked as precise.
>
> ---
>
> In upstream the test already has the expected verifier log to include
> r9, and hence no failure, but it simply comes from the fact that r2 and
> r9 maintain a link even after instruction 22 (r2 += 1).
>
> commit 98d7ca374ba4b39e7535613d40e159f09ca14da2
> Author: Alexei Starovoitov <ast@kernel.org>
> Date: Wed Jun 12 18:38:13 2024 -0700
>
> bpf: Track delta between "linked" registers.
> ...
> --- a/tools/testing/selftests/bpf/verifier/precise.c
> +++ b/tools/testing/selftests/bpf/verifier/precise.c
> @@ -39,12 +39,12 @@
> .result = VERBOSE_ACCEPT,
> .errstr =
> "mark_precise: frame0: last_idx 26 first_idx 20\
> - mark_precise: frame0: regs=r2 stack= before 25\
> - mark_precise: frame0: regs=r2 stack= before 24\
> - mark_precise: frame0: regs=r2 stack= before 23\
> - mark_precise: frame0: regs=r2 stack= before 22\
> - mark_precise: frame0: regs=r2 stack= before 20\
> - mark_precise: frame0: parent state regs=r2 stack=:\
> + mark_precise: frame0: regs=r2,r9 stack= before 25\
> + mark_precise: frame0: regs=r2,r9 stack= before 24\
> + mark_precise: frame0: regs=r2,r9 stack= before 23\
> + mark_precise: frame0: regs=r2,r9 stack= before 22\
> + mark_precise: frame0: regs=r2,r9 stack= before 20\
> + mark_precise: frame0: parent state regs=r2,r9 stack=:\
> mark_precise: frame0: last_idx 19 first_idx 10\
> mark_precise: frame0: regs=r2,r9 stack= before 19\
> mark_precise: frame0: regs=r9 stack= before 18\
> ...
>
> ---
>
> Full test log below
>
> #492/p precise: test 1 FAIL
> Unexpected verifier log!
> EXP: mark_precise: frame0: parent state regs=r2 stack=:
> RES:
> func#0 @0
> 0: R1=ctx(off=0,imm=0) R10=fp0
> 0: (b7) r0 = 1 ; R0_w=1
> 1: (18) r6 = 0xffff9eb644619000 ; R6_w=map_ptr(off=0,ks=4,vs=48,imm=0)
> 3: (bf) r1 = r6 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R6_w=map_ptr(off=0,ks=4,vs=48,imm=0)
> 4: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
> 5: (07) r2 += -8 ; R2_w=fp-8
> 6: (7a) *(u64 *)(r10 -8) = 0 ; R10=fp0 fp-8_w=00000000
> 7: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
> 8: (55) if r0 != 0x0 goto pc+1 ; R0_w=0
> 9: (95) exit
>
> from 8 to 10: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=0000mmmm
> 10: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=0000mmmm
> 10: (bf) r9 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R9_w=map_value(off=0,ks=4,vs=48,imm=0)
> 11: (bf) r1 = r6 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R6=map_ptr(off=0,ks=4,vs=48,imm=0)
> 12: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
> 13: (07) r2 += -8 ; R2_w=fp-8
> 14: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null(id=2,off=0,ks=4,vs=48,imm=0)
> 15: (55) if r0 != 0x0 goto pc+1 ; R0_w=0
> 16: (95) exit
>
> from 15 to 17: R0_w=map_value(off=0,ks=4,vs=48,imm=0) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R9_w=map_value(off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=0000mmmm
> 17: R0_w=map_value(off=0,ks=4,vs=48,imm=0) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R9_w=map_value(off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=0000mmmm
> 17: (bf) r8 = r0 ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R8_w=map_value(off=0,ks=4,vs=48,imm=0)
> 18: (1f) r9 -= r8 ; R8_w=map_value(off=0,ks=4,vs=48,imm=0) R9_w=scalar()
> 19: (bf) r2 = r9 ; R2=scalar(id=3) R9=scalar(id=3)
> 20: (a5) if r2 < 0x8 goto pc+1 ; R2=scalar(id=3,umin=8)
> 21: (95) exit
>
> from 20 to 22: R0=map_value(off=0,ks=4,vs=48,imm=0) R2=scalar(id=3,umax=7,var_off=(0x0; 0x7)) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R8=map_value(off=0,ks=4,vs=48,imm=0) R9=scalar(id=3,umax=7,var_off=(0x0; 0x7)) R10=fp0 fp-8=0000mmmm
> 22: R0=map_value(off=0,ks=4,vs=48,imm=0) R2=scalar(id=3,umax=7,var_off=(0x0; 0x7)) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R8=map_value(off=0,ks=4,vs=48,imm=0) R9=scalar(id=3,umax=7,var_off=(0x0; 0x7)) R10=fp0 fp-8=0000mmmm
> 22: (07) r2 += 1 ; R2_w=scalar(umin=1,umax=8,var_off=(0x0; 0xf))
> 23: (bf) r1 = r10 ; R1_w=fp0 R10=fp0
> 24: (07) r1 += -8 ; R1_w=fp-8
> 25: (b7) r3 = 0 ; R3_w=0
> 26: (85) call bpf_probe_read_kernel#113
> mark_precise: frame0: last_idx 26 first_idx 20 subseq_idx -1
> mark_precise: frame0: regs=r2 stack= before 25: (b7) r3 = 0
> mark_precise: frame0: regs=r2 stack= before 24: (07) r1 += -8
> mark_precise: frame0: regs=r2 stack= before 23: (bf) r1 = r10
> mark_precise: frame0: regs=r2 stack= before 22: (07) r2 += 1
> mark_precise: frame0: regs=r2 stack= before 20: (a5) if r2 < 0x8 goto pc+1
> mark_precise: frame0: parent state regs=r2,r9 stack=: R0_rw=map_value(off=0,ks=4,vs=48,imm=0) R2_rw=Pscalar(id=3) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R8_w=map_value(off=0,ks=4,vs=48,imm=0) R9_w=Pscalar(id=3) R10=fp0 fp-8_r=0000mmmm
> mark_precise: frame0: last_idx 19 first_idx 10 subseq_idx 20
> mark_precise: frame0: regs=r2,r9 stack= before 19: (bf) r2 = r9
> mark_precise: frame0: regs=r9 stack= before 18: (1f) r9 -= r8
> mark_precise: frame0: regs=r8,r9 stack= before 17: (bf) r8 = r0
> mark_precise: frame0: regs=r0,r9 stack= before 15: (55) if r0 != 0x0 goto pc+1
> mark_precise: frame0: regs=r0,r9 stack= before 14: (85) call bpf_map_lookup_elem#1
> mark_precise: frame0: regs=r9 stack= before 13: (07) r2 += -8
> mark_precise: frame0: regs=r9 stack= before 12: (bf) r2 = r10
> mark_precise: frame0: regs=r9 stack= before 11: (bf) r1 = r6
> mark_precise: frame0: regs=r9 stack= before 10: (bf) r9 = r0
> mark_precise: frame0: parent state regs= stack=: R0_rw=map_value(off=0,ks=4,vs=48,imm=0) R6_rw=map_ptr(off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8_rw=0000mmmm
> 27: R0_w=scalar()
> 27: (95) exit
> processed 27 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
>
> [...]
^ permalink raw reply
* [PATCH net 2/2] net: dsa: mxl862xx: fix use-after-free of DSA ports in crc_err_work
From: Daniel Golle @ 2026-06-19 3:40 UTC (permalink / raw)
To: Daniel Golle, Andrew Lunn, Vladimir Oltean, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
Upon an MDIO CRC error mxl862xx_crc_err_work_fn() walks the DSA ports
and closes the CPU port conduits:
dsa_switch_for_each_cpu_port(dp, priv->ds)
dev_close(dp->conduit);
mxl862xx_remove() unregisters the switch before cancelling this work:
set_bit(MXL862XX_FLAG_WORK_STOPPED, &priv->flags);
cancel_delayed_work_sync(&priv->stats_work);
dsa_unregister_switch(ds);
mxl862xx_host_shutdown(priv);
dsa_unregister_switch() frees the dsa_port objects. If a CRC error
schedules the work during teardown it can run after the ports have been
freed and dereference freed memory.
Guard the port walk with MXL862XX_FLAG_WORK_STOPPED, which is already set
before dsa_unregister_switch(). DSA tears the ports down under
rtnl_lock(), so checking the flag under rtnl_lock() means the work either
runs before teardown and sees valid ports, or runs afterwards, observes
the flag and skips the walk. This mirrors the host_flood_work handler,
which skips torn-down ports under rtnl_lock().
Link: https://sashiko.dev/#/patchset/cover.1780968180.git.daniel%40makrotopia.org?part=2
Fixes: a319d0c8c8ce ("net: dsa: mxl862xx: add CRC for MDIO communication")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
---
drivers/net/dsa/mxl862xx/mxl862xx-host.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/drivers/net/dsa/mxl862xx/mxl862xx-host.c b/drivers/net/dsa/mxl862xx/mxl862xx-host.c
index 882c5d960941..4acd216f7cc0 100644
--- a/drivers/net/dsa/mxl862xx/mxl862xx-host.c
+++ b/drivers/net/dsa/mxl862xx/mxl862xx-host.c
@@ -41,12 +41,13 @@ static void mxl862xx_crc_err_work_fn(struct work_struct *work)
crc_err_work);
struct dsa_port *dp;
- dev_warn(&priv->mdiodev->dev,
- "MDIO CRC error detected, shutting down all ports\n");
-
rtnl_lock();
- dsa_switch_for_each_cpu_port(dp, priv->ds)
- dev_close(dp->conduit);
+ if (!test_bit(MXL862XX_FLAG_WORK_STOPPED, &priv->flags)) {
+ dev_warn(&priv->mdiodev->dev,
+ "MDIO CRC error detected, shutting down all ports\n");
+ dsa_switch_for_each_cpu_port(dp, priv->ds)
+ dev_close(dp->conduit);
+ }
rtnl_unlock();
clear_bit(MXL862XX_FLAG_CRC_ERR, &priv->flags);
--
2.54.0
^ permalink raw reply related
* [PATCH net 1/2] net: dsa: mxl862xx: avoid unaligned 16-bit access in api_wrap
From: Daniel Golle @ 2026-06-19 3:39 UTC (permalink / raw)
To: Daniel Golle, Andrew Lunn, Vladimir Oltean, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
The MXL862XX_API_* macros pass the address of a stack-allocated, __packed
firmware-ABI struct to mxl862xx_api_wrap() as a void *. The struct has an
alignment of 1, so the compiler is free to place it at an odd address.
mxl862xx_api_wrap() reinterprets that buffer as a __le16 * and accesses it
with data[i], for which the compiler assumes the natural 2-byte alignment
of __le16 and emits aligned 16-bit loads/stores (e.g. lhu/sh on MIPS).
When the buffer lands on an odd address these fault on architectures that
do not support unaligned access, such as MIPS32.
-Waddress-of-packed-member does not catch this: the packed origin is
laundered through the void * parameter, so the cast inside api_wrap looks
alignment-safe to the compiler and no warning is emitted.
Use get_unaligned_le16()/put_unaligned_le16() for the three 16-bit word
accesses. The byte accesses (*(u8 *)&data[i], crc16()) are already safe
and are left unchanged.
Link: https://sashiko.dev/#/patchset/cover.1781319534.git.daniel%40makrotopia.org?part=4
Fixes: 23794bec1cb6 ("net: dsa: add basic initial driver for MxL862xx switches")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
---
drivers/net/dsa/mxl862xx/mxl862xx-host.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/net/dsa/mxl862xx/mxl862xx-host.c b/drivers/net/dsa/mxl862xx/mxl862xx-host.c
index d55f9dff6433..882c5d960941 100644
--- a/drivers/net/dsa/mxl862xx/mxl862xx-host.c
+++ b/drivers/net/dsa/mxl862xx/mxl862xx-host.c
@@ -12,6 +12,7 @@
#include <linux/crc16.h>
#include <linux/iopoll.h>
#include <linux/limits.h>
+#include <linux/unaligned.h>
#include <net/dsa.h>
#include "mxl862xx.h"
#include "mxl862xx-host.h"
@@ -349,7 +350,7 @@ int mxl862xx_api_wrap(struct mxl862xx_priv *priv, u16 cmd, void *_data,
* zero words individually.
*/
for (i = 0, zeros = 0; i < size / 2 && zeros < RST_DATA_THRESHOLD; i++)
- if (!data[i])
+ if (!get_unaligned_le16(&data[i]))
zeros++;
if (zeros < RST_DATA_THRESHOLD && (size & 1) && !*(u8 *)&data[i])
@@ -395,7 +396,7 @@ int mxl862xx_api_wrap(struct mxl862xx_priv *priv, u16 cmd, void *_data,
*/
val = *(u8 *)&data[i] | ((crc & 0xff) << 8);
} else {
- val = le16_to_cpu(data[i]);
+ val = get_unaligned_le16(&data[i]);
}
/* After RST_DATA, skip zero data words as the registers
@@ -453,7 +454,7 @@ int mxl862xx_api_wrap(struct mxl862xx_priv *priv, u16 cmd, void *_data,
*(uint8_t *)&data[i] = ret & 0xff;
crc = (ret >> 8) & 0xff;
} else {
- data[i] = cpu_to_le16((u16)ret);
+ put_unaligned_le16((u16)ret, &data[i]);
}
}
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Dust Li @ 2026-06-19 3:31 UTC (permalink / raw)
To: Bryam Vargas
Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <20260618221106.236699-1-hexlabsecurity@proton.me>
On 2026-06-18 22:11:12, Bryam Vargas wrote:
>On Fri, 19 Jun 2026 00:03:17 +0800, Dust Li wrote:
>> Once we validate the CDC message at the input boundary (as in the
>> previous patch), bytes_to_rcv can never exceed rmb_desc->len, so
>> this check becomes unreachable. So I don't think this patch is needed.
>
>This one I'd actually like to keep, and let me walk through why -- I don't think the
>boundary check closes it.
>
>bytes_to_rcv isn't set to a cursor count, it's a running accumulator:
>smc_cdc_msg_recv_action does atomic_add(diff_prod, &bytes_to_rcv), where
>diff_prod = smc_curs_diff(rmb_desc->len, old, new). So bounding each cursor's count at
>the boundary doesn't bound the sum of the deltas.
>
>The differing-wrap branch of smc_curs_diff returns (len - old.count) + new.count,
>which is up to 2*len-1 even when both cursors pass count <= len. With len=16, a prod
>going (0,0) -> (1,15) gives diff=31, so bytes_to_rcv is already 31 > len after one
>message; alternating wrap 0<->1 at count=15 keeps adding ~len and eventually wraps the
>atomic_t negative. I have an A/B for this -- happy to send it along.
Glad to see you A/B test, I think we can decide after we see the real
issue.
>
>So to make this truly unreachable from the boundary check, we'd need to bound
>prod - cons <= len there, not just the absolute count. The consumer-side clamp is two
>lines and race-free against the tasklet, so my preference would be to keep it as a
>backstop -- but if you'd rather fold it into a stronger boundary check instead, I'm
>open to that.
Another thing I'd worry about is if this really happens, should we also
abort the connection like what we did in patch #1 ?
Best regards,
Dust
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox