Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] bpf: preserve rx_queue_index across XDP redirects
From: Alexei Starovoitov @ 2026-06-25 16:44 UTC (permalink / raw)
  To: Jakub Kicinski, Siddharth C
  Cc: ast, hawk, andrii, netdev, bpf, linux-kernel, linux-kselftest
In-Reply-To: <20260624185432.32d90aa8@kernel.org>

On Wed Jun 24, 2026 at 6:54 PM PDT, Jakub Kicinski wrote:
> On Sat, 20 Jun 2026 12:13:13 +0000 Siddharth C wrote:
>> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
>> index 5e59ab896f05..8f2d7013620f 100644
>> --- a/kernel/bpf/cpumap.c
>> +++ b/kernel/bpf/cpumap.c
>> @@ -197,7 +197,7 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
>>  
>>  		rxq.dev = xdpf->dev_rx;
>>  		rxq.mem.type = xdpf->mem_type;
>> -		/* TODO: report queue_index to xdp_rxq_info */
>> +		rxq.queue_index = xdpf->rx_queue_index;
>
> Do you actually need this or you're just trying to address the TODO?

It's a 3rd if not 4th attempt from various "people" to address this TODO.
We should just remove this line instead.


^ permalink raw reply

* Re: [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc
From: Stanislav Fomichev @ 2026-06-25 16:24 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: bpf, andrii, ast, daniel, john.fastabend, jordan, martin.lau,
	yonghong.song, emil, netdev, edumazet, kuba, pabeni, davem, horms
In-Reply-To: <20260625110321.28236-2-mahe.tardy@gmail.com>

On 06/25, Mahe Tardy wrote:
> This is needed in the context of Tetragon to provide improved feedback
> (in contrast to just dropping packets) to east-west traffic when blocked
> by policies using cgroup_skb programs.
> 
> This reuses concepts from netfilter reject target codepath with the
> differences that:
> * Packets are cloned since the BPF user can still let the packet pass
>   (SK_PASS from the cgroup_skb progs for example) and the current skb
>   need to stay untouched (cgroup_skb hooks only allow read-only skb
>   payload).
> * We protect against recursion since the kfunc, by generating an ICMP
>   error message, could retrigger the BPF prog that invoked it.
> 
> Only ICMP_DEST_UNREACH and ICMPV6_DEST_UNREACH are currently supported.
> The interface accepts a type parameter to facilitate future extension to
> other ICMP control message types.
> 
> For normal cgroup_skb paths, the skb dst route should already be set.
> However, bpf_prog_test_run_skb can create synthetic IPv4 skbs without an
> attached route. In that case, icmp_send returns early, and the kfunc
> would otherwise report success despite no ICMP reply being sent. The
> check also rejects metadata dsts, which are not valid struct rtable
> instances. For IPv6, reject metadata dsts only: icmpv6_send can reach
> icmp6_dev, where skb_rt6_info treats any non-NULL skb dst as a struct
> rt6_info, which is not valid for metadata_dst.
> 
> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
> Reviewed-by: Jordan Rife <jordan@jrife.io>
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> ---
>  net/core/filter.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 95 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 2e96b4b847ce..0a0191586b44 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -84,6 +84,9 @@
>  #include <linux/un.h>
>  #include <net/xdp_sock_drv.h>
>  #include <net/inet_dscp.h>
> +#include <linux/icmpv6.h>
> +#include <net/icmp.h>
> +#include <net/ip6_route.h>
> 
>  #include "dev.h"
> 
> @@ -12546,6 +12549,88 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
>  	return 0;
>  }
> 
> +/**
> + * bpf_icmp_send - Send an ICMP control message
> + * @skb_ctx: Packet that triggered the control message
> + * @type: ICMP type (only ICMP_DEST_UNREACH/ICMPV6_DEST_UNREACH supported)
> + * @code: ICMP code (0-15 except ICMP_FRAG_NEEDED for IPv4, 0-6 for IPv6)
> + *
> + * Sends an ICMP control message in response to the packet. The original packet
> + * is cloned before sending the ICMP message, so the BPF program can still let
> + * the packet pass if desired.
> + *
> + * Currently only ICMP_DEST_UNREACH (IPv4) and ICMPV6_DEST_UNREACH (IPv6) are
> + * supported.
> + *
> + * Return: 0 on success (send attempt), negative error code on failure:
> + *         -EBUSY: Recursion detected
> + *         -EPROTONOSUPPORT: Non-IP protocol
> + *         -EOPNOTSUPP: Unsupported ICMP type
> + *         -EINVAL: Invalid code parameter
> + *         -ENETUNREACH: No usable route/dst for the ICMP reply
> + *         -ENOMEM: Memory allocation failed
> + */
> +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> +{
> +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> +	struct sk_buff *nskb;
> +	struct sock *sk;
> +
> +	sk = skb_to_full_sk(skb);
> +	if (sk && sk->sk_kern_sock &&
> +	    (sk->sk_protocol == IPPROTO_ICMP || sk->sk_protocol == IPPROTO_ICMPV6))
> +		return -EBUSY;
> +
> +	switch (skb->protocol) {
> +#if IS_ENABLED(CONFIG_INET)
> +	case htons(ETH_P_IP): {
> +		if (type != ICMP_DEST_UNREACH)
> +			return -EOPNOTSUPP;
> +		if (code < 0 || code > NR_ICMP_UNREACH ||
> +		    code == ICMP_FRAG_NEEDED) /* needs a valid next-hop MTU */
> +			return -EINVAL;
> +
> +		/* icmp_send expects skb_dst to be a real rtable. */
> +		if (!skb_valid_dst(skb))
> +			return -ENETUNREACH;
> +
> +		nskb = skb_clone(skb, GFP_ATOMIC);
> +		if (!nskb)
> +			return -ENOMEM;
> +
> +		memset(IPCB(nskb), 0, sizeof(*IPCB(nskb)));
> +		icmp_send(nskb, type, code, 0);
> +		consume_skb(nskb);
> +		break;
> +	}
> +#endif
> +#if IS_ENABLED(CONFIG_IPV6)
> +	case htons(ETH_P_IPV6):
> +		if (type != ICMPV6_DEST_UNREACH)
> +			return -EOPNOTSUPP;
> +		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> +			return -EINVAL;

[..]

> +		/* icmpv6_send may treat skb_dst as rt6_info. */
> +		if (skb_metadata_dst(skb))
> +			return -ENETUNREACH;

A bit confused about this. Which part of icmpv6_send treats skb_dst as rt6_info?
(I see the original sashiko report about dst, but icmp6 seems to be not
requiring it)

^ permalink raw reply

* Re: [PATCH net] net: enetc: fix potential divide-by-zero when num_vsi is zero
From: patchwork-bot+netdevbpf @ 2026-06-25 16:21 UTC (permalink / raw)
  To: Wei Fang
  Cc: claudiu.manoil, vladimir.oltean, xiaoning.wang, andrew+netdev,
	davem, edumazet, kuba, pabeni, Frank.Li, wei.fang, imx, netdev,
	linux-kernel
In-Reply-To: <20260624072726.1238903-1-wei.fang@oss.nxp.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 24 Jun 2026 15:27:26 +0800 you wrote:
> From: Wei Fang <wei.fang@nxp.com>
> 
> For i.MX94 series, all the standalone ENETCs do not support SR-IOV, so
> pf->caps.num_vsi is zero. This leads to a divide-by-zero in
> enetc4_default_rings_allocation() when distributing rings among PF and
> VFs.
> 
> [...]

Here is the summary with links:
  - [net] net: enetc: fix potential divide-by-zero when num_vsi is zero
    https://git.kernel.org/netdev/net/c/5da65537792b

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] octeontx2-af: Free BPID bitmap on setup failure
From: patchwork-bot+netdevbpf @ 2026-06-25 16:20 UTC (permalink / raw)
  To: haoxiang_li2024
  Cc: sgoutham, lcherian, gakula, hkelam, sbhatta, andrew+netdev, davem,
	edumazet, kuba, pabeni, horms, netdev, linux-kernel, stable
In-Reply-To: <20260623114316.2182271-1-haoxiang_li2024@163.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 23 Jun 2026 19:43:16 +0800 you wrote:
> nix_setup_bpids() allocates bp->bpids with rvu_alloc_bitmap(), which uses
> a plain kcalloc(). If any of the following devm_kcalloc() allocations for
> the BPID mapping arrays fails, the function returns without freeing the
> bitmap. Free the BPID bitmap before returning from those error paths.
> 
> Fixes: d6212d2e41a0 ("octeontx2-af: Create BPIDs free pool")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
> 
> [...]

Here is the summary with links:
  - octeontx2-af: Free BPID bitmap on setup failure
    https://git.kernel.org/netdev/net/c/36323f54cd32

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v2 net 0/3] net: udp_tunnel: fix races and use-after-free
From: patchwork-bot+netdevbpf @ 2026-06-25 16:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, horms, samsun1006219, sdf, netdev,
	eric.dumazet
In-Reply-To: <20260625065938.654652-1-edumazet@google.com>

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 25 Jun 2026 06:59:35 +0000 you wrote:
> Yue Sun reported a use-after-free and debugobjects warning in
> udp_tunnel_nic_device_sync_work() when concurrently creating and
> destroying netdevsim and geneve devices.
> 
> This series resolves the UAF and the underlying data races that
> make the fix vulnerable.
> 
> [...]

Here is the summary with links:
  - [v2,net,1/3] net: udp_tunnel: prevent double queueing in udp_tunnel_nic_device_sync
    https://git.kernel.org/netdev/net/c/ecf69d4b4337
  - [v2,net,2/3] net: udp_tunnel: convert state flags to atomic bitops
    (no matching commit)
  - [v2,net,3/3] net: udp_tunnel: use atomic bitops for missed bitmap
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] net: sparx5: unregister blocking notifier on init failure
From: patchwork-bot+netdevbpf @ 2026-06-25 16:20 UTC (permalink / raw)
  To: haoxiang_li2024
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, Steen.Hegelund,
	daniel.machon, UNGLinuxDriver, kees, horms, bjarni.jonasson,
	lars.povlsen, netdev, linux-arm-kernel, linux-kernel, stable
In-Reply-To: <20260623115714.2192074-1-haoxiang_li2024@163.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 23 Jun 2026 19:57:14 +0800 you wrote:
> sparx5_register_notifier_blocks() registers the switchdev blocking
> notifier before allocating the ordered workqueue. If the workqueue
> allocation fails, the error path unregisters the switchdev and netdevice
> notifiers, but leaves the blocking notifier registered.
> 
> Add a separate error label for the workqueue allocation failure path and
> unregister the switchdev blocking notifier there.
> 
> [...]

Here is the summary with links:
  - net: sparx5: unregister blocking notifier on init failure
    https://git.kernel.org/netdev/net/c/483be61b4a9a

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next v2] selftests: tls: size splice_short pipe by page size
From: patchwork-bot+netdevbpf @ 2026-06-25 16:20 UTC (permalink / raw)
  To: Nirmoy Das; +Cc: kuba, sd, john.fastabend, horms, netdev, linux-kernel
In-Reply-To: <20260624134416.3235403-1-nirmoyd@nvidia.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 24 Jun 2026 06:44:16 -0700 you wrote:
> splice_short grows its pipe with (MAX_FRAGS + 1) * 0x1000 so it can
> queue one short vmsplice() buffer for each fragment before draining the
> pipe. That assumes 4K pipe buffers.
> 
> On 64K-page kernels the request is rounded to 262144 bytes, which
> provides only four pipe buffers. The fifth one-byte vmsplice() blocks in
> pipe_wait_writable and the test times out before it reaches the TLS path.
> 
> [...]

Here is the summary with links:
  - [net-next,v2] selftests: tls: size splice_short pipe by page size
    https://git.kernel.org/netdev/net/c/3e52f56875c6

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v2 net 0/2] tipc: syzbot related fixes
From: patchwork-bot+netdevbpf @ 2026-06-25 16:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, horms, kuniyu, lucien.xin, jmaloy,
	tipc-discussion, netdev, eric.dumazet
In-Reply-To: <20260623173030.2925059-1-edumazet@google.com>

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 23 Jun 2026 17:30:28 +0000 you wrote:
> First patch fixes a recent syzbot report.
> 
> Second patch is inspired by numerous syzbot soft lockup
> reports with RTNL pressure.
> 
> Eric Dumazet (2):
>   tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
>   tipc: avoid busy looping in tipc_exit_net()
> 
> [...]

Here is the summary with links:
  - [v2,net,1/2] tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
    https://git.kernel.org/netdev/net/c/7116764ca53f
  - [v2,net,2/2] tipc: avoid busy looping in tipc_exit_net()
    https://git.kernel.org/netdev/net/c/c1481c94e74c

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] net: ethernet: qualcomm: ppe: Demote from supported and fix maintainer addresses
From: patchwork-bot+netdevbpf @ 2026-06-25 16:20 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: andersson, mturquette, sboyd, bmasney, robh, krzk+dt, conor+dt,
	jie.luo, andrew+netdev, davem, edumazet, kuba, pabeni,
	quic_leiwei, quic_suruchia, quic_pavir, linux-kernel,
	linux-arm-msm, linux-clk, devicetree, netdev
In-Reply-To: <20260623073307.36483-2-krzysztof.kozlowski@oss.qualcomm.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 23 Jun 2026 09:33:08 +0200 you wrote:
> Emails to the maintainer of Qualcomm PPE Ethernet driver (Luo Jie
> <quic_luoj@quicinc.com>) bounce permanently (full mailbox), because the
> "quicinc.com" addresses were deprecated for public work.  All Qualcomm
> contributors are aware of that and were asked to fix their addresses.
> 
> Driver is not supported - in terms of how netdev understands supported
> commitment - if maintainer does not care to receive the patches for its
> code, so demote it to "maintained" to reflect true status.
> 
> [...]

Here is the summary with links:
  - [net] net: ethernet: qualcomm: ppe: Demote from supported and fix maintainer addresses
    https://git.kernel.org/netdev/net/c/efd7fb21bad8

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] dt-bindings: net: renesas,ether: Drop example "ethernet-phy-ieee802.3-c22" fallback
From: patchwork-bot+netdevbpf @ 2026-06-25 16:20 UTC (permalink / raw)
  To: Rob Herring
  Cc: niklas.soderlund, andrew+netdev, davem, edumazet, kuba, pabeni,
	krzk+dt, conor+dt, geert+renesas, magnus.damm, sergei.shtylyov,
	netdev, linux-renesas-soc, devicetree, linux-kernel
In-Reply-To: <20260624150250.131966-2-robh@kernel.org>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 24 Jun 2026 10:02:50 -0500 you wrote:
> Fix the Micrel PHY in the example which shouldn't have the
> fallback "ethernet-phy-ieee802.3-c22" compatible:
> 
> Documentation/devicetree/bindings/net/renesas,ether.example.dtb: ethernet-phy@1 \
>   (ethernet-phy-id0022.1537): compatible: ['ethernet-phy-id0022.1537', 'ethernet-phy-ieee802.3-c22'] is too long
>         from schema $id: http://devicetree.org/schemas/net/micrel.yaml
> 
> [...]

Here is the summary with links:
  - [net] dt-bindings: net: renesas,ether: Drop example "ethernet-phy-ieee802.3-c22" fallback
    https://git.kernel.org/netdev/net/c/14eb1d2c03b3

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next] openvswitch: conntrack: annotate ct limit hlist traversal
From: patchwork-bot+netdevbpf @ 2026-06-25 16:20 UTC (permalink / raw)
  To: Runyu Xiao
  Cc: aconole, echaudro, i.maximets, davem, edumazet, kuba, pabeni,
	horms, netdev, dev, linux-kernel, jianhao.xu
In-Reply-To: <20260624150149.3510541-1-runyu.xiao@seu.edu.cn>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 24 Jun 2026 23:01:49 +0800 you wrote:
> ct_limit_set() is documented as being called with ovs_mutex held. It
> walks the ct limit hlist with hlist_for_each_entry_rcu(), but the
> iterator does not currently pass the OVS lockdep condition used
> elsewhere for RCU-protected OVS objects.
> 
> Pass lockdep_ovsl_is_held() to the iterator. This matches the function's
> existing caller contract and lets CONFIG_PROVE_RCU_LIST distinguish the
> ovs_mutex-protected update path from the RCU read-side ct_limit_get()
> path.
> 
> [...]

Here is the summary with links:
  - [net-next] openvswitch: conntrack: annotate ct limit hlist traversal
    https://git.kernel.org/netdev/net/c/0e901ee5c6f9

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH V2 net 0/4] net: hns3: fix configuration deadlocks and refactor link setup
From: patchwork-bot+netdevbpf @ 2026-06-25 16:20 UTC (permalink / raw)
  To: Jijie Shao
  Cc: davem, edumazet, kuba, pabeni, andrew+netdev, horms, shenjian15,
	liuyonglong, chenhao418, huangdonghua3, yangshuaisong, netdev,
	linux-kernel
In-Reply-To: <20260624141319.271439-1-shaojijie@huawei.com>

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 24 Jun 2026 22:13:15 +0800 you wrote:
> This patch series addresses a sequence of link configuration deadlocks
> and parameter contamination issues in the hns3 network driver, which
> typically occur during hardware resets or driver initialization under
> specific user-configured scenarios.
> 
> The bugs root from asynchronous discrepancies between the MAC state
> machine and cached user requests during sudden hardware resets, leading
> to invalid parameter combos or frozen registers.
> 
> [...]

Here is the summary with links:
  - [V2,net,1/4] net: hns3: unify copper port ksettings configuration path
    https://git.kernel.org/netdev/net/c/d77e98f8b2b3
  - [V2,net,2/4] net: hns3: refactor MAC autoneg and speed configuration
    https://git.kernel.org/netdev/net/c/c01f6e6bdc1c
  - [V2,net,3/4] net: hns3: fix permanent link down deadlock after reset
    https://git.kernel.org/netdev/net/c/c711f6d1cee9
  - [V2,net,4/4] net: hns3: differentiate autoneg default values between copper and fiber
    https://git.kernel.org/netdev/net/c/d9d349c4e8a0

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v5 net] net: mana: Optimize irq affinity for low vcpu configs
From: patchwork-bot+netdevbpf @ 2026-06-25 16:20 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: decui, wei.liu, haiyangz, kys, andrew+netdev, davem, edumazet,
	kuba, pabeni, kotaranov, horms, ernis, dipayanroy, shirazsaleem,
	mhklinux, longli, yury.norov, linux-hyperv, linux-kernel, netdev,
	paulros, shradhagupta, ssengar, stable, ynorov
In-Reply-To: <20260624072138.1632849-1-shradhagupta@linux.microsoft.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 24 Jun 2026 00:21:35 -0700 you wrote:
> Before the commit 755391121038 ("net: mana: Allocate MSI-X vectors
> dynamically"), all the MANA IRQs were assigned statically and together
> during early driver load.
> 
> After this commit, the IRQ allocation for MANA was done in two phases.
> HWC IRQ allocated earlier and then, queue IRQs dynamically added at a
> later point. By this time, the IRQ weights on vCPUs can become imbalanced
> and if IRQ count is greater than the vCPU count the topology aware IRQ
> distribution logic in MANA can cause multiple MANA IRQs to land on the
> same vCPUs, while other sibling vCPUs have none (case 1).
> 
> [...]

Here is the summary with links:
  - [v5,net] net: mana: Optimize irq affinity for low vcpu configs
    https://git.kernel.org/netdev/net/c/5316394b1752

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v3 4/4] vhost/vsock: add VHOST_RESET_OWNER ioctl
From: Pavel Tikhomirov @ 2026-06-25 16:13 UTC (permalink / raw)
  To: Andrey Drobyshev, linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, den
In-Reply-To: <20260625155416.480669-5-andrey.drobyshev@virtuozzo.com>

Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

On 6/25/26 17:54, Andrey Drobyshev wrote:
> From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> 
> This ioctl is needed for QEMU's CPR (checkpoint-restore) migration of
> the guest with vhost-vsock device.  For this to work, we need to reset
> the device ownership on the source side by calling RESET_OWNER, and then
> claim it on the dest side by calling SET_OWNER.  We expect not to lose any
> AF_VSOCK connection while this happens.
> 
> RESET_OWNER keeps the guest CID hashed, so that connections survive. That
> leaves the device reachable by a lockless send/cancel path while the worker
> is being torn down: a concurrent vhost_transport_send_pkt() or
> vhost_transport_cancel_pkt() can call vhost_vq_work_queue() as
> vhost_workers_free() frees the worker.  That might cause a use-after-free
> of vq->worker.  In addition, any work queued onto the dying worker leaves
> VHOST_WORK_QUEUED stuck, stalling send_pkt_queue after resume.
> 
> Fence the send/cancel paths around the teardown: send_pkt()/cancel_pkt()
> only kick the worker while the backend is alive.  And reset_owner() calls
> synchronize_rcu() after drop_backends() so in-flight send/cancel finish
> before the worker is freed.
> 
> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
>  drivers/vhost/vsock.c | 51 +++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 49 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 81d4f7209719..f0a0aa7d3200 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -318,7 +318,14 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
>  		atomic_inc(&vsock->queued_replies);
>  
>  	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
> -	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
> +
> +	/* Skip the kick once the backend is gone (stop/RESET_OWNER); the skb
> +	 * stays queued and vhost_vsock_start() drains it. Pairs with the
> +	 * synchronize_rcu() in vhost_vsock_reset_owner().
> +	 */
> +	if (data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))
> +		vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX],
> +				    &vsock->send_pkt_work);
>  
>  	rcu_read_unlock();
>  	return len;
> @@ -346,7 +353,15 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
>  		int new_cnt;
>  
>  		new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
> -		if (new_cnt + cnt >= tx_vq->num && new_cnt < tx_vq->num)
> +
> +		/* Skip the kick once the backend is gone (stop/RESET_OWNER):
> +		 * vhost_poll_queue() would touch the worker which is being freed
> +		 * by teardown, e.g. on RESET_OWNER.  Pairs with the
> +		 * synchronize_rcu() in vhost_vsock_reset_owner().  The TX VQ is
> +		 * re-kicked by vhost_vsock_start().
> +		 */
> +		if (data_race(vhost_vq_get_backend(tx_vq)) &&
> +		    new_cnt + cnt >= tx_vq->num && new_cnt < tx_vq->num)
>  			vhost_poll_queue(&tx_vq->poll);
>  	}
>  
> @@ -903,6 +918,36 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
>  	return -EFAULT;
>  }
>  
> +static int vhost_vsock_reset_owner(struct vhost_vsock *vsock)
> +{
> +	struct vhost_iotlb *umem;
> +	long err;
> +
> +	mutex_lock(&vsock->dev.mutex);
> +	err = vhost_dev_check_owner(&vsock->dev);
> +	if (err)
> +		goto done;
> +	umem = vhost_dev_reset_owner_prepare();
> +	if (!umem) {
> +		err = -ENOMEM;
> +		goto done;
> +	}
> +	vhost_vsock_drop_backends(vsock);
> +
> +	/* Let in-flight send_pkt() callers stop touching the worker before the
> +	 * flush + free below. Pairs with the backend check in
> +	 * vhost_transport_send_pkt().
> +	 */
> +	synchronize_rcu();
> +
> +	vhost_vsock_flush(vsock);
> +	vhost_dev_stop(&vsock->dev);
> +	vhost_dev_reset_owner(&vsock->dev, umem);
> +done:
> +	mutex_unlock(&vsock->dev.mutex);
> +	return err;
> +}
> +
>  static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
>  				  unsigned long arg)
>  {
> @@ -946,6 +991,8 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
>  			return -EOPNOTSUPP;
>  		vhost_set_backend_features(&vsock->dev, features);
>  		return 0;
> +	case VHOST_RESET_OWNER:
> +		return vhost_vsock_reset_owner(vsock);
>  	default:
>  		mutex_lock(&vsock->dev.mutex);
>  		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.


^ permalink raw reply

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Andrew Lunn @ 2026-06-25 16:12 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <38bafe7e-d419-46f7-8fa7-87e9183e578c@bootlin.com>

> This isn't sphynx, but I've come-up with something like this for a
> test definition :
> 
> 
> @ksft_ethtool_needs_supported_anyof([Pause, Asym_Pause])
> def test_ethtool_pause_advertising(cfg, peer) -> None:
>     """Pause advertisement
> 
>     Validate that changing pause params through the ETHTOOL_MSG_PAUSE command
>     translates to a change in the advertised pause params, and that these
>     parameters are correct w.r.t the supported pause params and requested pause
>     params.
>     
>     This exercises the .set_pauseparams() ethtool ops for MAC configuration,
>     as well as the reconfiguration of the PHY's advertising and negociation.
>     
>     On non-phylink MACs, the MAC should call phy_set_sym_pause() to update the
>     PHY's advertising, and restart a negotiation with phy_start_aneg() if
>     need be. Failure to do so will result on the wrong advertising parameters.
>     
>     Pn phylink-enabled MACs, phylink deals with the PHY reconfiguration provided

On 

>     the MAC driver calls phylink_ethtool_set_pauseparam().
>     
>     Failing this test likely means that the PHY driver is not correctly advertising
>     pause settings, either due to the MAC not triggering a PHY reconfiguration,
>     a misconficonfiguration of the advertising registers by the PHY, or by
>     mis-handling the phydev->advertising bitfield in the PHY driver directly.
>     
>     The validation is made by looking at the advertised modes locally, as well as
>     what the peer's 'lp_advertising' values report.
> 
>     cfg -- local device's interface configuration
>     peer -- peer device handle

Plain Sphinx can be made to pick up this method documentation and
include it the generated documentation. You would use something like

.. automethod:: test_ethtool_pause_advertising

in the .rst file.

I've no idea if the kernel configuration of sphinx allows this. At the
moment, i would not spend too much time on getting sphinx to generate
documentation. I would say that is nice to have. The description
itself is more important.

>     """
> 
>     # Initial conditions :
>     # - Local interface is admin UP, and reports lowlayer link UP
>     # - Remote interface is adming UP, and reports lowlayer link UP
>     #
>     # Test 1
>     # - SKIP if supported doesn't contain "Pause"
>     # - run 'ethtool -A ethX rx on tx on autoneg on'
>     # - FAIL if the return isn't 0
>     # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
>     #   "Pause" or contains "Asym_Pause"
>     # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
>     #   "Asym_Pause"
>     # - Succeed otherwise
>     #
>     # Test 2
>     # - SKIP uif supported doesn't contain both "Pause" and "Asym_Pause"
>     # - run 'ethtool -A ethX rx on tx on autoneg on'
>     # - FAIL if the return isn't 0
>     # - FAIL if ETHTOOL_A_LINKMODES_OURS's advertised values does not contain
>     #   "Pause" or contains "Asym_Pause"
>     # - FAIL if peer's lp_advertising doesn't contain "Pause" or contains
>     #   "Asym_Pause"
>     #
>     # ...
>    
> The annotation defines the pre-requisites in terms of locally supported
> linkmodes, we have a docstring containing information for developpers
> to debug their drivers, what I'm unsure about is the commented-out part
> below, so either one big function testing multiple adjacent scenarios
> or indivitual functions.

Sphinx follows pythons object orientate structure. So you could have a
class test_ethtool_pause_advertising, with class documentation. And
then methods within the class which are individual tests.  The
commented out section would then be method documentation.

However, i've no idea if the selftest code allows for classes of test
methods? It looks like ksft_run() takes a list of methods. So you can
probably instantiate the class, and then pass it methods from the
class?

I would say you are right about picking one of the simple test case,
and playing with it, define and implement it, and see what comes out
at the end. 

	Andrew

^ permalink raw reply

* Re: [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
From: Stanislav Fomichev @ 2026-06-25 16:05 UTC (permalink / raw)
  To: Jason Xing
  Cc: Maciej Fijalkowski, netdev, bpf, magnus.karlsson, stfomichev,
	kuba, pabeni, horms, bjorn
In-Reply-To: <CAL+tcoC789GrOXvvYVOfxhL4iLCdFEAZQ+WCENsMjbLrJnfiYw@mail.gmail.com>

On 06/25, Jason Xing wrote:
> On Thu, Jun 25, 2026 at 12:37 AM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Wed, Jun 24, 2026 at 08:38:20AM -0700, Stanislav Fomichev wrote:
> > > On 06/23, Maciej Fijalkowski wrote:
> > > > Hi,
> > > >
> > > > This series fixes several AF_XDP multi-buffer Tx paths where descriptors
> > > > consumed from the Tx ring are not consistently returned to userspace
> > > > through the completion ring when the packet is later dropped as invalid.
> > > >
> > > > The affected cases are invalid or oversized multi-buffer Tx packets in
> > > > both the generic and zero-copy paths. In these cases, the kernel can
> > > > consume one or more Tx descriptors while building or validating a
> > > > multi-buffer packet, then drop the packet before it reaches the device.
> > > > Userspace still owns the UMEM buffers only after the corresponding
> > > > addresses are returned through the CQ. Missing completions therefore
> > > > make userspace lose track of those buffers.
> > > >
> > > > The generic path fixes cover three related cases:
> > > > * partially built multi-buffer skbs dropped by xsk_drop_skb();
> > > >   continuation descriptors left in the Tx ring after xsk_build_skb()
> > > >   reports overflow;
> > > > * invalid descriptors encountered in the middle of a multi-buffer
> > > >   packet, including the offending invalid descriptor itself.
> > > >
> > > > The zero-copy path is handled separately. The batched Tx parser now
> > > > distinguishes descriptors that can be passed to the driver from
> > > > descriptors that are consumed only because they belong to an invalid
> > > > multi-buffer packet. Reclaim-only descriptors are written to the CQ
> > > > address area and published in completion order, after any earlier
> > > > driver-visible Tx descriptors.
> > > >
> > > > The ZC batching path can also retain drain state when userspace has not
> > > > yet provided the end of an invalid multi-buffer packet. To keep this
> > > > state local to the singular batched path, the series prevents a second
> > > > Tx socket from joining the same pool while such drain state exists.
> > > > During the singular-to-shared transition, Tx batching is gated,
> > > > pre-existing readers are waited out, and bind fails with -EAGAIN if the
> > > > existing socket still has pending drain state. This avoids adding
> > > > multi-buffer drain handling to the shared-UMEM fallback path.
> > > >
> > > > The last two patches update xskxceiver so the tests account invalid
> > > > multi-buffer Tx packets as descriptors that must be reclaimed, while
> > > > still not expecting those invalid packets on the Rx side.
> > > >
> > > > This is a follow-up to Jason's changes [0] which were addressing generic
> > > > xmit only and this set allows me to pass full xskxceiver test suite run
> > > > against ice driver.
> > >
> > > There is a fair amount of feedback from sashiko already :-( So the meta
> > > question from me is: is it time to scrap our current approach where
> > > we parse descriptor by descriptor? (and maintain half-baked skb and
> > > half-consumed descriptor queues)
> > >
> > > Should we:
> > >
> > > 1. do desc[MAX_SKB_FRAGS] and xskq_cons_peek_desc until we exhaust
> > > PKT_CONT (if the last packet has PKT_CONT, return EOVERFLOW to userspace
> > > and do a full stop here)
> > > 2. now that we really know the number of valid descriptors -> reserve
> > > the cq space (if not -> EAGAIN)
> > > 3. pre-allocate everything here (if at any point we have ENOMEM -> cleanup
> > > locally, don't ever create semi-initialized skb)
> > > 4. construct the skb
> > > 5. xmit
> >
> > Yeah generic xmit became utterly horrible, haven't gone through sashiko
> > reviews yet, but bare in mind this set also aligns zc side to what was
> > previously being addressed by Jason.
> >
> > I believe planned logistics were to get these fixes onto net and then
> > Jason had an implementation of batching on generic xmit, directed towards
> > -next and that's where we could address current flow.
> 
> Agreed. That's what I'm hoping for. There would be much more
> discussion on how to do batch xmit in an elegant way, I believe.

This doesn't have to depend on the batch rewrite, we should be able to rewrite
this non-zc in net, this is still technically fixes, not feature work..

There was already a couple of revisions with this drain_cont approach
and every time I look at it feels like the cure is worse than the
decease :-( Obviously not gonna stop you from going with the current approach,
but these fixes feel a bit of a wasted effort to me (since the bugs keep
coming and we are piling more complexity).

^ permalink raw reply

* Re: [PATCH net 4/4] selftests: bonding: add a test for VLAN propagation over a bonded real device
From: Stanislav Fomichev @ 2026-06-25 15:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea
In-Reply-To: <20260624182018.2445732-5-kuba@kernel.org>

On 06/24, Jakub Kicinski wrote:
> Add a regression test for the VLAN notifier handling that the netdev_work
> deferral fixed.
> 
> A VLAN's real device propagates its UP/DOWN, MTU and feature changes onto
> the VLANs stacked on top of it. This used to be done synchronously from the
> real device's notifier and deadlocked when the real device was brought up
> while enslaved to a bond (instance lock held across NETDEV_UP) and the VLAN
> on top was itself a bond member: the synchronous propagation re-entered the
> stack and took the same instance lock again.
> 
> The test covers both halves:
>  - that the deferred UP/DOWN, MTU and feature propagation actually lands on
>    the VLAN (link state and MTU use an ops-locked dummy, i.e. the deferral
>    path; features use veth, which exports vlan_features to inherit), and
>  - that the deadlock-prone topology - a VLAN on a dummy, with the VLAN and
>    the dummy each enslaved to a different bond - can be built without
>    hanging.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply

* Re: [PATCH net 3/4] vlan: defer real device state propagation to netdev_work
From: Stanislav Fomichev @ 2026-06-25 15:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea, syzbot+09da62a8b78959ceb8bb,
	syzbot+cb67c392b0b8f0fd0fc1, syzbot+9bb8bd77f3966641f298
In-Reply-To: <20260624182018.2445732-4-kuba@kernel.org>

On 06/24, Jakub Kicinski wrote:
> vlan_device_event() generates nested UP/DOWN, MTU and feature
> change events. It executes an event for the VLAN device directly
> from the notifier - while the locks of the lower device are held.
> 
> This causes deadlocks, for example:
> 
>   bond    (3) bond_update_speed_duplex(vlan)
>     |           ^                v
>   vlan    (2) UP(vlan)    (4) vlan_ethtool_get_link_ksettings()
>     |           ^                v
>   dummy   (1) UP(dummy)   (5) __ethtool_get_link_ksettings()
> 
> The dummy device is ops locked, vlan creates a nested event (2),
> then bond wants to ask vlan for link state (3). bond uses the
> "I'm already holding the instance lock" flavor of API. But in
> this case the lock held refers to vlan itself. We hit vlan's
> link settings trampoline (4) and call __ethtool_get_link_ksettings()
> which tries to lock dummy. Deadlock. There's no clean way for us
> to tell the vlan_ethtool_get_link_ksettings() that the caller
> is already in lower device's critical section.
> 
> Defer the propagation to the per-netdev work facility instead:
> the notifier only schedules netdev_work_sched(vlandev, VLAN_WORK_*),
> and ndo_work (vlan_dev_work) applies the change later. Hopefully
> nobody expects the VLAN state changes to be instantaneous.
> 
> If someone does expect the changes to be instantaneous we will
> have to do the same thing Stan did for rx_mode and "strategically"
> place sync calls, to make sure such delayed works are executed
> after we drop the ops lock but before we drop rtnl_lock.
> 
> Stan suggests that if we need that down the line we may
> consider reshaping the mechanism into "async notifications".
> AFAICT only vlan does this sort of netdev open chaining,
> so as a first try I think that sticking the complexity into
> the vlan code makes sense.
> 
> One corner case is that we need to cancel the event if user
> explicitly changes the state before work could run. Consider
> the following operations with vlan0 on top of dummy0:
> 
>   ip link set dev dummy0 up    # queues work to up vlan0
>   ip link set dev vlan0 down   # user explicitly downs the vlan
>   ndo_work                     # acts on the stale event
> 
> Reported-by: syzbot+09da62a8b78959ceb8bb@syzkaller.appspotmail.com
> Reported-by: syzbot+cb67c392b0b8f0fd0fc1@syzkaller.appspotmail.com
> Reported-by: syzbot+9bb8bd77f3966641f298@syzkaller.appspotmail.com
> Fixes: 9f275c2e9020 ("net: ethtool: make sure __ethtool_get_link_ksettings() is ops-locked")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply

* Re: [PATCH net 2/4] net: add the driver-facing netdev_work scheduling API
From: Stanislav Fomichev @ 2026-06-25 15:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea
In-Reply-To: <20260624182018.2445732-3-kuba@kernel.org>

On 06/24, Jakub Kicinski wrote:
> With an extra event mask we can easily extend the netdev work
> to also service driver-defined events. For advanced drivers
> this is probably not a perfect match, but it makes running
> deferred work easier in simple cases.
> 
> Expose the netdev_work facility to drivers. Add helpers
> to schedule work and a dedicated ndo to perform the driver-
> -scheduled actions.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply

* Re: [PATCH net 1/4] net: turn the rx_mode work into a generic netdev_work facility
From: Stanislav Fomichev @ 2026-06-25 15:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, jv, sdf,
	dongchenchen2, idosch, n05ec, yuantan098, kuniyu, nb,
	aleksandr.loktionov, dtatulea
In-Reply-To: <20260624182018.2445732-2-kuba@kernel.org>

On 06/24, Jakub Kicinski wrote:
> The rx_mode update runs from a workqueue: drivers have their
> ndo_set_rx_mode_async() callback executed by a single global
> work item under RTNL and ops lock. This is a useful pattern.
> 
> Support multiple "events" that need to be serviced and make RX_MODE
> sync the first one. Call the events "core" because later on
> we will let drivers define and schedule their own.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply

* Re: [PATCH net] net: ethtool: keep rtnl_lock for ops using ethtool_op_get_link()
From: Stanislav Fomichev @ 2026-06-25 15:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms,
	Breno Leitao, joshwash, hramamurthy, anthony.l.nguyen,
	przemyslaw.kitszel, saeedm, tariqt, mbloch, leon, alexanderduyck,
	kernel-team, kys, haiyangz, wei.liu, decui, longli, jordanrhee,
	jacob.e.keller, nktgrg, debarghyak, mohsin.bashr, ernis, sdf, gal,
	linux-rdma, linux-hyperv
In-Reply-To: <20260624190439.2521219-1-kuba@kernel.org>

On 06/24, Jakub Kicinski wrote:
> Breno reports following splats on mlx5:
> 
>   RTNL: assertion failed at net/core/dev.c (2241)
>   WARNING: net/core/dev.c:2241 at netif_state_change+0xed/0x130, CPU#5: ethtool/1335
>   RIP: 0010:netif_state_change+0xf9/0x130
>   Call Trace:
>     <TASK>
>      __linkwatch_sync_dev+0xea/0x120
>      ethtool_op_get_link+0xe/0x20
>      __ethtool_get_link+0x26/0x40
>      linkstate_prepare_data+0x51/0x200
>      ethnl_default_doit+0x213/0x470
>      genl_family_rcv_msg_doit+0xdd/0x110
> 
> Looks like I missed ethtool_op_get_link() trying to sync linkwatch,
> which needs rtnl_lock. Not all drivers do this - bnxt doesn't,
> it just returns the link state, so add an opt-in bit.
> 
> Reported-by: Breno Leitao <leitao@debian.org>
> Fixes: 45079e00133e ("net: ethtool: optionally skip rtnl_lock on Netlink path for GET ops")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply

* [PATCH net] eth: fbnic: don't cache shinfo across skb realloc
From: Jakub Kicinski @ 2026-06-25 16:05 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
	Alexander Duyck, kernel-team, mohsin.bashr

fbnic_tx_lso() calls skb_cow_head() which may reallocate the skb
including the shared info. We can't use the pointer calculated
before the call.

    BUG: KASAN: slab-use-after-free in fbnic_tx_lso.isra.0+0x668/0x8e0
    Read of size 4 at addr ff110000262edd98 by task swapper/5/0
    Call Trace:
     fbnic_tx_lso.isra.0+0x668/0x8e0
     fbnic_xmit_frame+0x622/0xba0
     dev_hard_start_xmit+0xf4/0x620

    Allocated by task 8653:
     __alloc_skb+0x11e/0x5f0
     alloc_skb_with_frags+0xcc/0x6c0
     sock_alloc_send_pskb+0x327/0x3f0
     __ip_append_data+0x188b/0x47a0
     ip_make_skb+0x24a/0x300
     udp_sendmsg+0x14d2/0x21e0

    Freed by task 0:
     kfree+0x123/0x5a0
     pskb_expand_head+0x36c/0xfa0
     fbnic_tx_lso.isra.0+0x500/0x8e0
     fbnic_xmit_frame+0x622/0xba0
     dev_hard_start_xmit+0xf4/0x620
     sch_direct_xmit+0x25b/0x1100

    The buggy address belongs to the object at ff110000262edc40
     which belongs to the cache skbuff_small_head of size 640
    The buggy address is located 344 bytes inside of
     freed 640-byte region [ff110000262edc40, ff110000262ede

Link: https://netdev.bots.linux.dev/logs/vmksft/fbnic-qemu-dbg/results/705762/15-uso-py/stderr
Fixes: b0b0f52042ac ("eth: fbnic: support TCP segmentation offload")
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: kernel-team@meta.com
CC: mohsin.bashr@gmail.com
---
 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c b/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
index 9cd85a0d0c3a..401f8b8ae1ca 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
@@ -194,16 +194,18 @@ static bool fbnic_tx_tstamp(struct sk_buff *skb)
 
 static bool
 fbnic_tx_lso(struct fbnic_ring *ring, struct sk_buff *skb,
-	     struct skb_shared_info *shinfo, __le64 *meta,
-	     unsigned int *l2len, unsigned int *i3len)
+	     __le64 *meta, unsigned int *l2len, unsigned int *i3len)
 {
 	unsigned int l3_type, l4_type, l4len, hdrlen;
+	struct skb_shared_info *shinfo;
 	unsigned char *l4hdr;
 	__be16 payload_len;
 
 	if (unlikely(skb_cow_head(skb, 0)))
 		return true;
 
+	shinfo = skb_shinfo(skb);
+
 	if (shinfo->gso_type & SKB_GSO_PARTIAL) {
 		l3_type = FBNIC_TWD_L3_TYPE_OTHER;
 	} else if (!skb->encapsulation) {
@@ -258,7 +260,6 @@ fbnic_tx_lso(struct fbnic_ring *ring, struct sk_buff *skb,
 static bool
 fbnic_tx_offloads(struct fbnic_ring *ring, struct sk_buff *skb, __le64 *meta)
 {
-	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	unsigned int l2len, i3len;
 
 	if (fbnic_tx_tstamp(skb))
@@ -273,8 +274,8 @@ fbnic_tx_offloads(struct fbnic_ring *ring, struct sk_buff *skb, __le64 *meta)
 	*meta |= cpu_to_le64(FIELD_PREP(FBNIC_TWD_CSUM_OFFSET_MASK,
 					skb->csum_offset / 2));
 
-	if (shinfo->gso_size) {
-		if (fbnic_tx_lso(ring, skb, shinfo, meta, &l2len, &i3len))
+	if (skb_is_gso(skb)) {
+		if (fbnic_tx_lso(ring, skb, meta, &l2len, &i3len))
 			return true;
 	} else {
 		*meta |= cpu_to_le64(FBNIC_TWD_FLAG_REQ_CSO);
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Maxime Chevallier @ 2026-06-25 16:03 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
	Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
	Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
	thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <dfee1484-fa2a-4b98-af5a-1e67ac716905@lunn.ch>


> 
> Does it even make sense to advertise this when in HD? But i don't
> think we need to consider this now. I consider HD low priority, i
> doubt it is actually used very often. We should concentrate on FD
> testing.

That's fine by me as well, let's keep it simple, we may revisit that if
we really need to.

> 
>> # ethtool -a eth2
>> Autonegotiate:	on
>> RX:		off
>> TX:		off
>> RX negotiated: on
>> TX negotiated: on
>>
>>
>> Sure, pause and HD don't make sense, however what I find confusing to some
>> extent is that the only place we have information about the *actual* pause
>> settings is the "link is Up" log in dmesg.
> 
> Maybe we should extend ksetting get to return the resolved pause
> parameters? But i'm not sure how much that actually gives us. Anything
> using phylink will just ask phylink to fill in the ksettings
> information, and it seems unlikely phylink gets it wrong. What we are
> really trying to test is drivers which don't user phylink, those are
> the ones which are generally broken, and they are not going to
> implement anything new in ksettings.

Correct yes. If the MAC driver uses phylink and a test fails, it very likely
means that the PHY driver is doing shady stuff (and some are/were for pause)

> So i think the test has to look
> at:
> 
>> 	Advertised pause frame use: Symmetric Receive-only
>> 	Link partner advertised pause frame use: Symmetric Receive-only
> 
> and check these match what we expect.

All good for me :) thanks for you feedback,

Maxime

^ permalink raw reply

* [PATCH v3 4/4] vhost/vsock: add VHOST_RESET_OWNER ioctl
From: Andrey Drobyshev @ 2026-06-25 15:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev
In-Reply-To: <20260625155416.480669-1-andrey.drobyshev@virtuozzo.com>

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

This ioctl is needed for QEMU's CPR (checkpoint-restore) migration of
the guest with vhost-vsock device.  For this to work, we need to reset
the device ownership on the source side by calling RESET_OWNER, and then
claim it on the dest side by calling SET_OWNER.  We expect not to lose any
AF_VSOCK connection while this happens.

RESET_OWNER keeps the guest CID hashed, so that connections survive. That
leaves the device reachable by a lockless send/cancel path while the worker
is being torn down: a concurrent vhost_transport_send_pkt() or
vhost_transport_cancel_pkt() can call vhost_vq_work_queue() as
vhost_workers_free() frees the worker.  That might cause a use-after-free
of vq->worker.  In addition, any work queued onto the dying worker leaves
VHOST_WORK_QUEUED stuck, stalling send_pkt_queue after resume.

Fence the send/cancel paths around the teardown: send_pkt()/cancel_pkt()
only kick the worker while the backend is alive.  And reset_owner() calls
synchronize_rcu() after drop_backends() so in-flight send/cancel finish
before the worker is freed.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
 drivers/vhost/vsock.c | 51 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 81d4f7209719..f0a0aa7d3200 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -318,7 +318,14 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
 		atomic_inc(&vsock->queued_replies);
 
 	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
-	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
+
+	/* Skip the kick once the backend is gone (stop/RESET_OWNER); the skb
+	 * stays queued and vhost_vsock_start() drains it. Pairs with the
+	 * synchronize_rcu() in vhost_vsock_reset_owner().
+	 */
+	if (data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))
+		vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX],
+				    &vsock->send_pkt_work);
 
 	rcu_read_unlock();
 	return len;
@@ -346,7 +353,15 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
 		int new_cnt;
 
 		new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
-		if (new_cnt + cnt >= tx_vq->num && new_cnt < tx_vq->num)
+
+		/* Skip the kick once the backend is gone (stop/RESET_OWNER):
+		 * vhost_poll_queue() would touch the worker which is being freed
+		 * by teardown, e.g. on RESET_OWNER.  Pairs with the
+		 * synchronize_rcu() in vhost_vsock_reset_owner().  The TX VQ is
+		 * re-kicked by vhost_vsock_start().
+		 */
+		if (data_race(vhost_vq_get_backend(tx_vq)) &&
+		    new_cnt + cnt >= tx_vq->num && new_cnt < tx_vq->num)
 			vhost_poll_queue(&tx_vq->poll);
 	}
 
@@ -903,6 +918,36 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
 	return -EFAULT;
 }
 
+static int vhost_vsock_reset_owner(struct vhost_vsock *vsock)
+{
+	struct vhost_iotlb *umem;
+	long err;
+
+	mutex_lock(&vsock->dev.mutex);
+	err = vhost_dev_check_owner(&vsock->dev);
+	if (err)
+		goto done;
+	umem = vhost_dev_reset_owner_prepare();
+	if (!umem) {
+		err = -ENOMEM;
+		goto done;
+	}
+	vhost_vsock_drop_backends(vsock);
+
+	/* Let in-flight send_pkt() callers stop touching the worker before the
+	 * flush + free below. Pairs with the backend check in
+	 * vhost_transport_send_pkt().
+	 */
+	synchronize_rcu();
+
+	vhost_vsock_flush(vsock);
+	vhost_dev_stop(&vsock->dev);
+	vhost_dev_reset_owner(&vsock->dev, umem);
+done:
+	mutex_unlock(&vsock->dev.mutex);
+	return err;
+}
+
 static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
 				  unsigned long arg)
 {
@@ -946,6 +991,8 @@ static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
 			return -EOPNOTSUPP;
 		vhost_set_backend_features(&vsock->dev, features);
 		return 0;
+	case VHOST_RESET_OWNER:
+		return vhost_vsock_reset_owner(vsock);
 	default:
 		mutex_lock(&vsock->dev.mutex);
 		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
-- 
2.47.1


^ permalink raw reply related

* [PATCH v3 3/4] vhost/vsock: re-scan TX virtqueue on device start
From: Andrey Drobyshev @ 2026-06-25 15:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, virtualization, netdev, sgarzare, mst, stefanha,
	dongli.zhang, maciej.szmigiero, bchaney, mark.kanda, ptikhomirov,
	den, andrey.drobyshev
In-Reply-To: <20260625155416.480669-1-andrey.drobyshev@virtuozzo.com>

During QEMU CPR live-update (and VHOST_RESET_OWNER in general) the guest
keeps running while the host drops and later re-attaches vhost backends.
If the guest adds a buffer to the TX virtqueue (guest->host) and kicks
while the backend is temporarily NULL (between vhost_vsock_drop_backends()
and the next vhost_vsock_start()), then the kick is delivered to the
vhost worker, handle_tx_kick() sees a NULL backend and returns, and the
kick signal is consumed.  The buffer is then left in the ring.

Then upon device start vhost_vsock_start() only re-kicks the RX send
worker, never the TX VQ, so the buffer is processed only if the guest
happens to kick again.  But if the guest itself is now waiting for data
from the host, it will never kick TX VQ again, and we end up in a
deadlock.

The issue itself is pre-existing, but it only manifests during a brief
pause caused by VHOST_RESET_OWNER.  Namely, the deadlock is reproduced
during active host->guest socat data transfer under multiple consecutive
CPR live-update's.

To fix this, in vhost_vsock_start(), after kicking the RX send worker, also
queue the TX vq poll so any buffers the guest enqueued while we were paused
get scanned.

Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 drivers/vhost/vsock.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index bec6bcfd885f..81d4f7209719 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -646,6 +646,13 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
 	 */
 	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
 
+	/*
+	 * Some packets might've also been queued in TX VQ.  That is the case
+	 * during the brief device pause caused by VHOST_RESET_OWNER.  Re-scan
+	 * the TX VQ here, mirroring the RX send-worker kick above.
+	 */
+	vhost_poll_queue(&vsock->vqs[VSOCK_VQ_TX].poll);
+
 	mutex_unlock(&vsock->dev.mutex);
 	return 0;
 
-- 
2.47.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox