* Re: [PATCH net-next 5/5] selftests: ovpn: align command flow with TAP
From: Jakub Kicinski @ 2026-04-13 23:56 UTC (permalink / raw)
To: antonio
Cc: Jakub Kicinski, netdev, ralf, sd, pabeni, andrew+netdev, davem,
edumazet
In-Reply-To: <20260412221121.410365-6-antonio@openvpn.net>
This is an AI-generated review of your patch. The human sending this
email says: "The second one looks legit, would you prefer to follow
up or respin?"
---
selftests: ovpn: align command flow with TAP
This commit restructures the ovpn selftests to use the TAP infrastructure,
splitting each test into stages with KTAP output, adding command wrappers,
and centralizing behavior control in common.sh.
> diff --git a/tools/testing/selftests/net/ovpn/common.sh b/tools/testing/selftests/net/ovpn/common.sh
> index 2b3f2e5c8cc9d..ec6fea37ceb39 100644
> --- a/tools/testing/selftests/net/ovpn/common.sh
> +++ b/tools/testing/selftests/net/ovpn/common.sh
[ ... ]
> @@ -152,8 +263,8 @@ ovpn_compare_ntfs() {
> received="${OVPN_TMP_JSONS[$1]}"
> diff_file=$(mktemp)
>
> - kill -TERM ${OVPN_LISTENER_PIDS[$1]} || true
> - wait ${OVPN_LISTENER_PIDS[$1]} || true
> + kill -TERM ${OVPN_LISTENER_PIDS[$1]} 2>/dev/null || true
> + wait ${OVPN_LISTENER_PIDS[$1]} 2>/dev/null || true
> printf "Checking notifications for peer ${1}... "
Does this leave the reaped PID in OVPN_LISTENER_PIDS?
Later, ovpn_cleanup() calls ovpn_stop_listener() which iterates over the
array and issues another kill -TERM on the same PID. Since the process was
already reaped by the wait here, could the OS recycle the PID, potentially
causing the second kill to terminate an unrelated system process?
[ ... ]
> +ovpn_cleanup_peer_ns() {
> + local peer="$1"
> +
> + if ! ip netns list | grep -qx "${peer}"; then
> + return 0
> + fi
Can this fail to match and permanently leak network namespaces?
When virtual interfaces are moved into namespaces during setup, ip netns list
often includes an internal ID in its output, such as:
peer0 (id: 0)
Because grep is using the -x flag to enforce an exact whole-line match,
it appears this will fail to match peer0 and return early, skipping the
subsequent namespace deletion.
[ ... ]
> diff --git a/tools/testing/selftests/net/ovpn/test-close-socket.sh b/tools/testing/selftests/net/ovpn/test-close-socket.sh
> index 4f0367c60fdaf..e97affe727383 100755
> --- a/tools/testing/selftests/net/ovpn/test-close-socket.sh
> +++ b/tools/testing/selftests/net/ovpn/test-close-socket.sh
> @@ -9,37 +9,72 @@ set -e
>
> source ./common.sh
Is test-close-socket.sh missing the set -eE flag and the ovpn_stage_err
ERR trap?
The other test scripts were updated to use set -eE and register the trap,
which handles emitting the KTAP failure directive. If a command fails here,
will set -e cause the script to exit immediately and bypass the KTAP
failure reporting entirely?
^ permalink raw reply
* Re: [PATCH net-next 2/5] selftests: ovpn: fail notification check on mismatch
From: Jakub Kicinski @ 2026-04-14 0:00 UTC (permalink / raw)
To: Antonio Quartulli
Cc: netdev, ralf, Sabrina Dubroca, Paolo Abeni, Andrew Lunn,
David S. Miller, Eric Dumazet
In-Reply-To: <20260412221121.410365-3-antonio@openvpn.net>
On Mon, 13 Apr 2026 00:11:18 +0200 Antonio Quartulli wrote:
> compare_ntfs doesn't fail when expected and received notification
> streams diverge.
>
> Fix this bug by trackink the diff exit status explicitly and return it
> to the caller so notification mismatches propagate as test failures.
Hm, this series nicely cleans up test_mark.sh failures
but test_tcp.sh now always fails on debug (slow) kernel
builds with:
# TAP version 13
# 1..12
# ok 1 setup network topology
# ok 2 run baseline data traffic
# ok 3 run LAN traffic behind peer1
# ok 4 run iperf throughput
# ok 5 run key rollout
# ok 6 query peers
# ok 7 query missing peer fails
# ok 8 peer lifecycle and key queries
# ok 9 delete peer while traffic
# ok 10 delete stale keys
# ok 11 check timeout behavior
# Checking notifications for peer 3... failed
# 1,9d0
# < {
# < "name": "peer-del-ntf",
# < "msg": {
# < "peer": {
# < "del-reason": "expired",
# < "id": 12
# < }
# < }
# < }
# validate listener output for peer 3: command failed with rc=1: ovpn_compare_ntfs 3
# not ok 12 validate notification output
# # Totals: pass:11 fail:1 xfail:0 xpass:0 skip:0 error:0
Similar failure in test_symmetric_id_tcp.sh
Only the debug kernels tho, non-debug kernels seem to pass.
So probably some race / slowness.
More runs if you want to compare
https://netdev.bots.linux.dev/contest.html?executor=vmksft-net-extra-dbg&test=test-tcp-sh
This series landed in net-next-2026-04-13--06-00
^ permalink raw reply
* Re: [PATCH net-next 0/3] net: airoha: Preliminary series to support multiple net_devices connected to the same GDM port
From: patchwork-bot+netdevbpf @ 2026-04-14 0:00 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, linux-arm-kernel,
linux-mediatek, netdev, xuegang.lu
In-Reply-To: <20260412-airoha-multi-serdes-preliminary-patch-v1-0-08d5b670ca8f@kernel.org>
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Sun, 12 Apr 2026 19:13:11 +0200 you wrote:
> EN7581 or AN7583 SoCs support connecting multiple external SerDes (e.g.
> Ethernet or USB SerDes) to GDM3 or GDM4 ports via a hw arbiter that
> manages the traffic in a TDM manner.
> This series introduces some preliminary changes necessary to introduce
> support for multiple net_devices connected to the same Frame Engine (FE)
> GDM port (GDM3 or GDM4).
>
> [...]
Here is the summary with links:
- [net-next,1/3] net: airoha: Rely on net_device pointer in airoha_dev_setup_tc_block signature
https://git.kernel.org/netdev/net-next/c/360d745a5319
- [net-next,2/3] net: airoha: Rely on net_device pointer in HTB callbacks
https://git.kernel.org/netdev/net-next/c/8baf4bf72ef9
- [net-next,3/3] net: airoha: Rely on net_device pointer in ETS callbacks
https://git.kernel.org/netdev/net-next/c/ae32f80018f0
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] sctp: disable BH before calling udp_tunnel_xmit_skb()
From: patchwork-bot+netdevbpf @ 2026-04-14 0:10 UTC (permalink / raw)
To: Xin Long
Cc: netdev, linux-sctp, davem, kuba, edumazet, pabeni, horms,
marcelo.leitner, bestswngs
In-Reply-To: <c874a8548221dcd56ff03c65ba75a74e6cf99119.1776017727.git.lucien.xin@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Sun, 12 Apr 2026 14:15:27 -0400 you wrote:
> udp_tunnel_xmit_skb() / udp_tunnel6_xmit_skb() are expected to run with
> BH disabled. After commit 6f1a9140ecda ("add xmit recursion limit to
> tunnel xmit functions"), on the path:
>
> udp(6)_tunnel_xmit_skb() -> ip(6)tunnel_xmit()
>
> dev_xmit_recursion_inc()/dec() must stay balanced on the same CPU.
>
> [...]
Here is the summary with links:
- [net] sctp: disable BH before calling udp_tunnel_xmit_skb()
https://git.kernel.org/netdev/net/c/2cd7e6971fc2
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] sctp: fix missing encap_port propagation for GSO fragments
From: patchwork-bot+netdevbpf @ 2026-04-14 0:10 UTC (permalink / raw)
To: Xin Long
Cc: netdev, linux-sctp, davem, kuba, edumazet, pabeni, horms,
marcelo.leitner
In-Reply-To: <ea65ed61b3598d8b4940f0170b9aa1762307e6c3.1776017631.git.lucien.xin@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Sun, 12 Apr 2026 14:13:51 -0400 you wrote:
> encap_port in SCTP_INPUT_CB(skb) is used by sctp_vtag_verify() for
> SCTP-over-UDP processing. In the GSO case, it is only set on the head
> skb, while fragment skbs leave it 0.
>
> This results in fragment skbs seeing encap_port == 0, breaking
> SCTP-over-UDP connections.
>
> [...]
Here is the summary with links:
- [net] sctp: fix missing encap_port propagation for GSO fragments
https://git.kernel.org/netdev/net/c/bf6f95ae3b8b
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v3] netfilter: nft_set_rbtree: fix use count leak on transaction abort
From: Pablo Neira Ayuso @ 2026-04-14 0:11 UTC (permalink / raw)
To: Marko Jevtic
Cc: fw, netfilter-devel, phil, coreteam, davem, edumazet, kuba,
pabeni, horms, netdev, linux-kernel
In-Reply-To: <20260412222801.34965-1-marko.jevtic@codereflect.io>
Hi,
On Mon, Apr 13, 2026 at 12:28:01AM +0200, Marko Jevtic wrote:
> nft_rbtree_abort() does not handle elements moved to the expired list
> by inline GC during __nft_rbtree_insert(). When inline GC encounters
> expired elements during overlap detection, it calls
> nft_rbtree_gc_elem_move() which deactivates element data (decrementing
> chain/object use counts), removes the element from the rbtree, and
> queues it for deferred freeing. On commit, these elements are freed
> via nft_rbtree_gc_queue(). On abort, however, the expired list is
> ignored entirely.
>
> This leaves use counts permanently decremented after abort.
Yes, but that is expected?
Expired elements reside in priv->expired, these elements are already
deactivated, ie. removed from the rbtree and chain reference is
decremented.
From abort path, the deactivated element simply remains there until
there is a commit run that gets rid of it.
I can't make any sense of this bug report so far.
Why do you think there is a need to restore an expired element?
^ permalink raw reply
* Re: [PATCH net v2] RDS: Fix memory leak in rds_rdma_extra_size()
From: Allison Henderson @ 2026-04-14 0:15 UTC (permalink / raw)
To: Xiaobo Liu, David S. Miller
Cc: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, netdev,
linux-rdma, rds-devel, linux-kernel
In-Reply-To: <20260413070005.15272-1-cppcoffee@gmail.com>
On Mon, 2026-04-13 at 15:00 +0800, Xiaobo Liu wrote:
> Free iov->iov when copy_from_user() or page count validation fails
> in rds_rdma_extra_size().
>
> This preserves the existing success path and avoids leaking the
> allocated iovec array on error.
>
> Signed-off-by: Xiaobo Liu <cppcoffee@gmail.com>
I think this looks good now. Thanks Xiaobo.
Reviewed-by: Allison Henderson <achender@kernel.org>
> ---
> net/rds/rdma.c | 28 +++++++++++++++++++++-------
> 1 file changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/net/rds/rdma.c b/net/rds/rdma.c
> index aa6465dc7..91a20c1e2 100644
> --- a/net/rds/rdma.c
> +++ b/net/rds/rdma.c
> @@ -560,6 +560,7 @@ int rds_rdma_extra_size(struct rds_rdma_args *args,
> struct rds_iovec *vec;
> struct rds_iovec __user *local_vec;
> int tot_pages = 0;
> + int ret = 0;
> unsigned int nr_pages;
> unsigned int i;
>
> @@ -578,16 +579,20 @@ int rds_rdma_extra_size(struct rds_rdma_args *args,
> vec = &iov->iov[0];
>
> if (copy_from_user(vec, local_vec, args->nr_local *
> - sizeof(struct rds_iovec)))
> - return -EFAULT;
> + sizeof(struct rds_iovec))) {
> + ret = -EFAULT;
> + goto out;
> + }
> iov->len = args->nr_local;
>
> /* figure out the number of pages in the vector */
> for (i = 0; i < args->nr_local; i++, vec++) {
>
> nr_pages = rds_pages_in_vec(vec);
> - if (nr_pages == 0)
> - return -EINVAL;
> + if (nr_pages == 0) {
> + ret = -EINVAL;
> + goto out;
> + }
>
> tot_pages += nr_pages;
>
> @@ -595,11 +600,20 @@ int rds_rdma_extra_size(struct rds_rdma_args *args,
> * nr_pages for one entry is limited to (UINT_MAX>>PAGE_SHIFT)+1,
> * so tot_pages cannot overflow without first going negative.
> */
> - if (tot_pages < 0)
> - return -EINVAL;
> + if (tot_pages < 0) {
> + ret = -EINVAL;
> + goto out;
> + }
> }
>
> - return tot_pages * sizeof(struct scatterlist);
> + ret = tot_pages * sizeof(struct scatterlist);
> +
> +out:
> + if (ret < 0) {
> + kfree(iov->iov);
> + iov->iov = NULL;
> + }
> + return ret;
> }
>
> /*
^ permalink raw reply
* Re: [PATCH net-next v05 1/6] hinic3: Add ethtool queue ops
From: Jakub Kicinski @ 2026-04-14 0:18 UTC (permalink / raw)
To: Fan Gong
Cc: Zhu Yikai, netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
Simon Horman, Andrew Lunn, Ioana Ciornei, Mohsin Bashir,
linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <157d5cc6e757ffa77eee01dfdc3f2159dc97905f.1775711066.git.zhuyikai1@h-partners.com>
On Sat, 11 Apr 2026 11:36:59 +0800 Fan Gong wrote:
> Implement following ethtool callback function:
> .get_ringparam
> .set_ringparam
>
> These callbacks allow users to utilize ethtool for detailed
> queue depth configuration and monitoring.
> +static int hinic3_check_ringparam_valid(struct net_device *netdev,
> + const struct ethtool_ringparam *ring)
> +{
> + if (ring->rx_jumbo_pending || ring->rx_mini_pending) {
Can driver actually be called with non-zero values if max is not set?
> + netdev_err(netdev, "Unsupported rx_jumbo_pending/rx_mini_pending\n");
> + return -EINVAL;
> + }
> + if (ring->tx_pending > HINIC3_MAX_TX_QUEUE_DEPTH ||
> + ring->tx_pending < HINIC3_MIN_QUEUE_DEPTH ||
> + ring->rx_pending > HINIC3_MAX_RX_QUEUE_DEPTH ||
> + ring->rx_pending < HINIC3_MIN_QUEUE_DEPTH) {
similar question - do you need to check the upper bound?
kernel should check the input against max returned by .get
> + netdev_err(netdev,
please use extack for errors
> + "Queue depth out of range tx[%d-%d] rx[%d-%d]\n",
> + HINIC3_MIN_QUEUE_DEPTH, HINIC3_MAX_TX_QUEUE_DEPTH,
> + HINIC3_MIN_QUEUE_DEPTH, HINIC3_MAX_RX_QUEUE_DEPTH);
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +static int hinic3_set_ringparam(struct net_device *netdev,
> + struct ethtool_ringparam *ring,
> + struct kernel_ethtool_ringparam *kernel_ring,
> + struct netlink_ext_ack *extack)
> +{
> + struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> + struct hinic3_dyna_txrxq_params q_params = {};
> + u32 new_sq_depth, new_rq_depth;
> + int err;
> +
> + err = hinic3_check_ringparam_valid(netdev, ring);
> + if (err)
> + return err;
> +
> + new_sq_depth = 1U << ilog2(ring->tx_pending);
> + new_rq_depth = 1U << ilog2(ring->rx_pending);
> + if (new_sq_depth == nic_dev->q_params.sq_depth &&
> + new_rq_depth == nic_dev->q_params.rq_depth)
> + return 0;
> +
> + if (new_sq_depth != ring->tx_pending)
> + netdev_info(netdev, "Requested Tx depth trimmed to %d\n",
> + new_sq_depth);
please use extack for warnings
> + if (new_rq_depth != ring->rx_pending)
> + netdev_info(netdev, "Requested Rx depth trimmed to %d\n",
> + new_rq_depth);
> +
> + netdev_info(netdev, "Change Tx/Rx ring depth from %u/%u to %u/%u\n",
> + nic_dev->q_params.sq_depth, nic_dev->q_params.rq_depth,
> + new_sq_depth, new_rq_depth);
> +
> + if (!netif_running(netdev)) {
> + hinic3_update_qp_depth(netdev, new_sq_depth, new_rq_depth);
> + } else {
> + q_params = nic_dev->q_params;
> + q_params.sq_depth = new_sq_depth;
> + q_params.rq_depth = new_rq_depth;
> +
> + err = hinic3_change_channel_settings(netdev, &q_params);
> + if (err) {
> + netdev_err(netdev, "Failed to change channel settings\n");
> + return err;
> + }
> + }
> +
> + return 0;
> +}
> +
> static const struct ethtool_ops hinic3_ethtool_ops = {
> .supported_coalesce_params = ETHTOOL_COALESCE_USECS |
> ETHTOOL_COALESCE_PKT_RATE_RX_USECS,
> @@ -417,6 +516,8 @@ static const struct ethtool_ops hinic3_ethtool_ops = {
> .get_msglevel = hinic3_get_msglevel,
> .set_msglevel = hinic3_set_msglevel,
> .get_link = ethtool_op_get_link,
> + .get_ringparam = hinic3_get_ringparam,
> + .set_ringparam = hinic3_set_ringparam,
> };
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_main.c b/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
> index 0a888fe4c975..3b470978714a 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
> @@ -179,6 +179,8 @@ static int hinic3_sw_init(struct net_device *netdev)
> int err;
>
> mutex_init(&nic_dev->port_state_mutex);
> + mutex_init(&nic_dev->channel_cfg_lock);
Why do you need this mutex?
Aren't all the places you take it under rtnl_lock anyway?
> + spin_lock_init(&nic_dev->channel_res_lock);
>
> nic_dev->q_params.sq_depth = HINIC3_SQ_DEPTH;
> nic_dev->q_params.rq_depth = HINIC3_RQ_DEPTH;
> @@ -314,6 +316,15 @@ static void hinic3_link_status_change(struct net_device *netdev,
> bool link_status_up)
> {
> struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> + unsigned long flags;
> + bool valid;
> +
> + spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
> + valid = HINIC3_CHANNEL_RES_VALID(nic_dev);
> + spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
> +
> + if (!valid)
Why are you checking valid here? What if the state changes immediately
after unlocking?
> + return;
>
> if (link_status_up) {
> if (netif_carrier_ok(netdev))
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c b/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
> index da73811641a9..d652a5ffdc2c 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
> @@ -428,6 +428,85 @@ static void hinic3_vport_down(struct net_device *netdev)
> }
> }
>
> +int
> +hinic3_change_channel_settings(struct net_device *netdev,
> + struct hinic3_dyna_txrxq_params *trxq_params)
> +{
> + struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> + struct hinic3_dyna_txrxq_params old_qp_params = {};
> + struct hinic3_dyna_qp_params new_qp_params = {};
> + struct hinic3_dyna_qp_params cur_qp_params = {};
> + bool need_teardown = false;
> + unsigned long flags;
> + int err;
> +
> + mutex_lock(&nic_dev->channel_cfg_lock);
> +
> + hinic3_config_num_qps(netdev, trxq_params);
> +
> + err = hinic3_alloc_channel_resources(netdev, &new_qp_params,
> + trxq_params);
> + if (err) {
> + netdev_err(netdev, "Failed to alloc channel resources\n");
> + mutex_unlock(&nic_dev->channel_cfg_lock);
> + return err;
> + }
> +
> + spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
> + if (!test_and_set_bit(HINIC3_CHANGE_RES_INVALID, &nic_dev->flags))
> + need_teardown = true;
> + spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
> +
> + if (need_teardown) {
> + hinic3_vport_down(netdev);
> + hinic3_close_channel(netdev);
> + hinic3_uninit_qps(nic_dev, &cur_qp_params);
> + hinic3_free_channel_resources(netdev, &cur_qp_params,
> + &nic_dev->q_params);
> + }
> +
> + if (nic_dev->num_qp_irq > trxq_params->num_qps)
> + hinic3_qp_irq_change(netdev, trxq_params->num_qps);
> +
> + spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
> + old_qp_params = nic_dev->q_params;
> + nic_dev->q_params = *trxq_params;
> + spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
> +
> + hinic3_init_qps(nic_dev, &new_qp_params);
> +
> + err = hinic3_open_channel(netdev);
This "open" function allocates Rx buffers, and fails if it couldn't get
even one. That's no good.
> + if (err)
> + goto err_uninit_qps;
> +
> + err = hinic3_vport_up(netdev);
> + if (err)
> + goto err_close_channel;
> +
> + spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
> + clear_bit(HINIC3_CHANGE_RES_INVALID, &nic_dev->flags);
> + spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
> +
> + mutex_unlock(&nic_dev->channel_cfg_lock);
> +
> + return 0;
> +
> +err_close_channel:
> + hinic3_close_channel(netdev);
> +err_uninit_qps:
> + spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
> + nic_dev->q_params = old_qp_params;
> + clear_bit(HINIC3_CHANGE_RES_INVALID, &nic_dev->flags);
> + spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
> +
> + hinic3_uninit_qps(nic_dev, &new_qp_params);
> + hinic3_free_channel_resources(netdev, &new_qp_params, trxq_params);
> +
> + mutex_unlock(&nic_dev->channel_cfg_lock);
AI says:
Can this error path lead to memory corruption?
If need_teardown was true, the old channel resources were freed earlier in
the function. If hinic3_open_channel() or hinic3_vport_up() fails, the code
jumps to err_uninit_qps and restores nic_dev->q_params = old_qp_params.
However, it doesn't appear to re-allocate those old resources or mark the
interface as down. Could a subsequent administrative teardown or network
traffic dereference these freed pointers?
> +
> static int hinic3_open(struct net_device *netdev)
> {
> struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> @@ -487,16 +566,33 @@ static int hinic3_close(struct net_device *netdev)
> {
> struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
> struct hinic3_dyna_qp_params qp_params;
> + bool need_teardown = false;
> + unsigned long flags;
>
> if (!test_and_clear_bit(HINIC3_INTF_UP, &nic_dev->flags)) {
> netdev_dbg(netdev, "Netdev already close, do nothing\n");
> return 0;
> }
>
> - hinic3_vport_down(netdev);
> - hinic3_close_channel(netdev);
> - hinic3_uninit_qps(nic_dev, &qp_params);
> - hinic3_free_channel_resources(netdev, &qp_params, &nic_dev->q_params);
> + mutex_lock(&nic_dev->channel_cfg_lock);
> +
> + spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
> + if (!test_and_set_bit(HINIC3_CHANGE_RES_INVALID, &nic_dev->flags))
> + need_teardown = true;
> + spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
> +
> + if (need_teardown) {
> + hinic3_vport_down(netdev);
> + hinic3_close_channel(netdev);
> + hinic3_uninit_qps(nic_dev, &qp_params);
> + hinic3_free_channel_resources(netdev, &qp_params,
> + &nic_dev->q_params);
> + }
> +
> + hinic3_free_nicio_res(nic_dev);
> + hinic3_destroy_num_qps(netdev);
> +
> + mutex_unlock(&nic_dev->channel_cfg_lock);
>
> return 0;
> }
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
> index 9502293ff710..55b280888ad8 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
> @@ -10,6 +10,9 @@
> #include "hinic3_hw_cfg.h"
> #include "hinic3_hwdev.h"
> #include "hinic3_mgmt_interface.h"
> +#include "hinic3_nic_io.h"
> +#include "hinic3_tx.h"
> +#include "hinic3_rx.h"
>
> #define HINIC3_VLAN_BITMAP_BYTE_SIZE(nic_dev) (sizeof(*(nic_dev)->vlan_bitmap))
> #define HINIC3_VLAN_BITMAP_SIZE(nic_dev) \
> @@ -20,8 +23,13 @@ enum hinic3_flags {
> HINIC3_MAC_FILTER_CHANGED,
> HINIC3_RSS_ENABLE,
> HINIC3_UPDATE_MAC_FILTER,
> + HINIC3_CHANGE_RES_INVALID,
> };
>
> +#define HINIC3_CHANNEL_RES_VALID(nic_dev) \
> + (test_bit(HINIC3_INTF_UP, &(nic_dev)->flags) && \
> + !test_bit(HINIC3_CHANGE_RES_INVALID, &(nic_dev)->flags))
I don't get why you need to check both of these bits.
Can't there be one bit for "resources valid" ?
And it will only be set while device is up (of course) so no need to
also check UP (this way checking can be atomic without the spin lock).
> enum hinic3_event_work_flags {
> HINIC3_EVENT_WORK_TX_TIMEOUT,
> };
> @@ -129,6 +137,10 @@ struct hinic3_nic_dev {
> struct work_struct rx_mode_work;
> /* lock for enable/disable port */
> struct mutex port_state_mutex;
> + /* lock for channel configuration */
> + struct mutex channel_cfg_lock;
> + /* lock for channel resources */
> + spinlock_t channel_res_lock;
>
> struct list_head uc_filter_list;
> struct list_head mc_filter_list;
> @@ -143,6 +155,10 @@ struct hinic3_nic_dev {
>
> void hinic3_set_netdev_ops(struct net_device *netdev);
> int hinic3_set_hw_features(struct net_device *netdev);
> +int
> +hinic3_change_channel_settings(struct net_device *netdev,
> + struct hinic3_dyna_txrxq_params *trxq_params);
> +
> int hinic3_qps_irq_init(struct net_device *netdev);
> void hinic3_qps_irq_uninit(struct net_device *netdev);
>
> diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
> index 12eefabcf1db..3791b9bc865b 100644
> --- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
> +++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
> @@ -14,6 +14,10 @@ struct hinic3_nic_dev;
> #define HINIC3_RQ_WQEBB_SHIFT 3
> #define HINIC3_SQ_WQEBB_SIZE BIT(HINIC3_SQ_WQEBB_SHIFT)
>
> +#define HINIC3_MAX_TX_QUEUE_DEPTH 65536
> +#define HINIC3_MAX_RX_QUEUE_DEPTH 16384
> +#define HINIC3_MIN_QUEUE_DEPTH 128
> +
> /* ******************** RQ_CTRL ******************** */
> enum hinic3_rq_wqe_type {
> HINIC3_NORMAL_RQ_WQE = 1,
^ permalink raw reply
* Re: [PATCH net-next] net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
From: patchwork-bot+netdevbpf @ 2026-04-14 0:20 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, linux-arm-kernel,
linux-mediatek, netdev
In-Reply-To: <20260412-airoha_fe_init_remove_mc_en_bit-v1-1-7b6a5a25a74d@kernel.org>
Hello:
This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Sun, 12 Apr 2026 11:56:25 +0200 you wrote:
> PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration performed in
> airoha_fe_init() is used to duplicate multicast packets and send a copy
> to the CPU when the traffic is offloaded. This is necessary just if
> it is requested by the user. Disable multicast packets duplication by
> default.
>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
>
> [...]
Here is the summary with links:
- [net-next] net: airoha: Remove PCE_MC_EN_MASK bit in REG_FE_PCE_CFG configuration
https://git.kernel.org/netdev/net-next/c/34e1a98ff2a8
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net-next v4 02/10] enic: add admin channel open and close for SR-IOV
From: Jakub Kicinski @ 2026-04-14 0:21 UTC (permalink / raw)
To: devnull+satishkh.cisco.com
Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-2-f052326c2a57@cisco.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: add admin channel open and close for SR-IOV
Introduce enic_admin_channel_open() and enic_admin_channel_close() to allocate
and initialize the admin WQ, RQ, two CQs, and one SR-IOV interrupt for the V2
SR-IOV PF-VF mailbox communication.
> diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
> new file mode 100644
> index 0000000000000..d1abe6a50095d
> --- /dev/null
> +++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
[ ... ]
> + err = vnic_cq_alloc_with_type(enic->vdev, &enic->admin_cq[0], 0,
> + ENIC_ADMIN_DESC_COUNT,
> + sizeof(struct cq_desc),
> + RES_TYPE_ADMIN_CQ);
> + if (err)
> + goto free_rq;
> +
> + err = vnic_cq_alloc_with_type(enic->vdev, &enic->admin_cq[1], 1,
> + ENIC_ADMIN_DESC_COUNT,
> + 16 << enic->ext_cq,
^^^^^^^^^^^^^^^^^^
Does this code use the wrong descriptor size for the admin RQ's completion
queue?
The admin WQ's completion queue (admin_cq[0]) uses sizeof(struct cq_desc),
which is 16 bytes. If the admin channel handles PF-VF mailbox messages rather
than data packets with extended metadata, will firmware still write standard
16-byte completions here?
If so, won't allocating the ring with an extended size cause the host polling
loop to look for the color bit at the wrong offset (e.g., byte 31 instead of
byte 15), causing it to silently miss incoming admin messages?
> + RES_TYPE_ADMIN_CQ);
> + if (err)
> + goto free_cq0;
--
pw-bot: cr
^ permalink raw reply
* Re: [PATCH net-next v4 03/10] enic: add admin RQ buffer management
From: Jakub Kicinski @ 2026-04-14 0:21 UTC (permalink / raw)
To: devnull+satishkh.cisco.com
Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-3-f052326c2a57@cisco.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: add admin RQ buffer management
This commit adds buffer management for the enic driver's admin receive queue,
allocating DMA buffers on open and draining them on close.
> diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
> index d1abe6a50095d..a8fcd5f116d17 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_admin.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
[ ... ]
> @@ -151,6 +212,7 @@ int enic_admin_channel_open(struct enic *enic)
> vnic_wq_disable(&enic->admin_wq);
> vnic_rq_disable(&enic->admin_rq);
> enic_admin_qp_type_set(enic, 0);
> + enic_admin_rq_drain(enic);
> enic_admin_free_resources(enic);
> return err;
> }
Could this sequence allow the hardware to DMA into freed memory?
Here and in enic_admin_channel_close() below, vnic_rq_disable() is called
before enic_admin_qp_type_set(enic, 0), meaning the firmware might still be
actively routing admin messages to the queue.
Additionally, the return value of vnic_rq_disable() is ignored. If the disable
operation fails or times out, the hardware queue will remain active.
Then enic_admin_rq_drain() unconditionally unmaps the DMA addresses and frees
the buffers. If the queue is still active, incoming messages could be written
directly into the freed memory.
Would it be safer to stop the source of the messages first, and verify the
queue is successfully disabled before freeing the buffers?
[ ... ]
> @@ -166,7 +228,7 @@ void enic_admin_channel_close(struct enic *enic)
> enic_admin_qp_type_set(enic, 0);
>
> vnic_wq_clean(&enic->admin_wq, enic_admin_wq_buf_clean);
> - vnic_rq_clean(&enic->admin_rq, enic_admin_rq_buf_clean);
> + enic_admin_rq_drain(enic);
> vnic_cq_clean(&enic->admin_cq[0]);
> vnic_cq_clean(&enic->admin_cq[1]);
> vnic_intr_clean(&enic->admin_intr);
^ permalink raw reply
* Re: [PATCH net-next v4 04/10] enic: add admin CQ service with MSI-X interrupt and NAPI polling
From: Jakub Kicinski @ 2026-04-14 0:21 UTC (permalink / raw)
To: devnull+satishkh.cisco.com
Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-4-f052326c2a57@cisco.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: add admin CQ service with MSI-X interrupt and NAPI polling
Add completion queue service for the admin channel WQ and RQ, driven
by an MSI-X interrupt and NAPI polling.
> diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
> index a8fcd5f116d17..345d194c6eebe 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_admin.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
[ ... ]
> @@ -83,6 +85,207 @@ static void enic_admin_rq_drain(struct enic *enic)
> vnic_rq_clean(&enic->admin_rq, enic_admin_rq_buf_clean);
> }
>
[ ... ]
> +unsigned int enic_admin_rq_cq_service(struct enic *enic, unsigned int budget)
> +{
> + struct vnic_cq *cq = &enic->admin_cq[1];
> + struct vnic_rq *rq = &enic->admin_rq;
> + struct vnic_rq_buf *buf;
> + unsigned int work = 0;
> + void *desc;
> +
> + desc = vnic_cq_to_clean(cq);
> + while (work < budget &&
> + enic_admin_cq_color(desc, cq->ring.desc_size) !=
> + cq->last_color) {
> + /* Ensure CQ descriptor fields are read after
> + * the color/valid check.
> + */
> + rmb();
> + buf = rq->to_clean;
> +
> + dma_sync_single_for_cpu(&enic->pdev->dev,
> + buf->dma_addr, buf->len,
> + DMA_FROM_DEVICE);
> +
> + enic_admin_msg_enqueue(enic, buf->os_buf, buf->len);
Does this code expose uninitialized kernel heap memory?
The driver reads the completion queue descriptor but does not decode it to
determine the actual number of bytes written by the hardware. Instead, it
passes the full buf->len (2048 bytes) to enic_admin_msg_enqueue().
Since enic_admin_rq_post_one() allocates the receive buffers using kmalloc()
without __GFP_ZERO, passing the full 2048 bytes copies uninitialized kernel
memory following the actual packet payload.
> +
> + enic_admin_rq_buf_clean(rq, rq->to_clean);
> + rq->to_clean = rq->to_clean->next;
> + rq->ring.desc_avail++;
> +
> + vnic_cq_inc_to_clean(cq);
> + work++;
> + desc = vnic_cq_to_clean(cq);
> + }
> +
> + enic_admin_rq_fill(enic, GFP_ATOMIC);
Can this lead to a permanent receive queue stall?
Every processed receive buffer is unconditionally unmapped and freed via
enic_admin_rq_buf_clean() above. If enic_admin_rq_fill() fails due to memory
pressure, it silently returns an error and no replacement buffers are posted.
Because the original buffers were destroyed rather than recycled, and there
is no deferred retry mechanism, the descriptors are permanently lost. Repeated
transient failures will empty the RQ and permanently deadlock the admin channel.
[ ... ]
> @@ -128,23 +331,8 @@ static int enic_admin_alloc_resources(struct enic *enic)
> if (err)
> goto free_cq0;
>
> - /* PFs have dedicated SRIOV_INTR resources for admin channel.
> - * VFs lack SRIOV_INTR; use a regular INTR_CTRL slot instead.
> - */
> - if (vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1)
> - err = vnic_intr_alloc_with_type(enic->vdev,
> - &enic->admin_intr, 0,
> - RES_TYPE_SRIOV_INTR);
> - else
> - err = vnic_intr_alloc(enic->vdev, &enic->admin_intr,
> - enic->intr_count);
> - if (err)
> - goto free_cq1;
> -
> return 0;
Will this removal break the admin channel on Physical Functions (PFs)?
Because PFs frequently have all their standard INTR_CTRL slots consumed by
regular networking queues, they rely on the dedicated SRIOV_INTR slot for
the admin channel.
Removing this fallback means enic_admin_setup_intr() might fail its bounds
check or vnic_intr_alloc() might fail, breaking initialization.
[ ... ]
> @@ -165,10 +353,32 @@ static void enic_admin_free_resources(struct enic *enic)
>
> static void enic_admin_init_resources(struct enic *enic)
> {
> + unsigned int intr_offset = enic->admin_intr_index;
> +
> vnic_wq_init(&enic->admin_wq, 0, 0, 0);
> vnic_rq_init(&enic->admin_rq, 1, 0, 0);
> - vnic_cq_init(&enic->admin_cq[0], 0, 1, 0, 0, 1, 0, 1, 0, 0, 0);
> - vnic_cq_init(&enic->admin_cq[1], 0, 1, 0, 0, 1, 0, 1, 0, 0, 0);
> + vnic_cq_init(&enic->admin_cq[0],
> + 0 /* flow_control_enable */,
> + 1 /* color_enable */,
> + 0 /* cq_head */,
> + 0 /* cq_tail */,
> + 1 /* cq_tail_color */,
> + 1 /* interrupt_enable */,
Could setting interrupt_enable to 1 on the transmit completion queue (admin_cq[0])
cause an interrupt storm?
When a transmit completion occurs, it asserts the admin MSI-X interrupt and
schedules the NAPI poll routine (enic_admin_napi_poll). However, the NAPI
poll routine only services the receive queue (admin_cq[1]) and never clears
the event for admin_cq[0].
Since NAPI unmasks the interrupt upon completion, the uncleared transmit
completion would immediately re-trigger the interrupt.
[ ... ]
> @@ -187,12 +397,24 @@ int enic_admin_channel_open(struct enic *enic)
> return err;
> }
>
> + err = enic_admin_setup_intr(enic);
> + if (err) {
> + netdev_err(enic->netdev,
> + "Admin channel requires MSI-X, SR-IOV unavailable: %d\n",
> + err);
> + goto free_resources;
> + }
> +
> + spin_lock_init(&enic->admin_msg_lock);
> + INIT_LIST_HEAD(&enic->admin_msg_list);
> + INIT_WORK(&enic->admin_msg_work, enic_admin_msg_work_handler);
Can a race condition occur here?
enic_admin_setup_intr() calls request_irq(), which registers the interrupt
handler and enables the IRQ line. If an interrupt fires immediately upon
registration, the ISR will schedule NAPI, which calls enic_admin_rq_cq_service().
NAPI would then attempt to acquire the uninitialized spinlock, modify the
uninitialized list, and schedule the uninitialized workqueue. Should the data
structures be initialized before requesting the interrupt?
[ ... ]
> return 0;
>
> disable_queues:
> + enic_admin_teardown_intr(enic);
> vnic_wq_disable(&enic->admin_wq);
> vnic_rq_disable(&enic->admin_rq);
> enic_admin_qp_type_set(enic, 0);
> enic_admin_rq_drain(enic);
> +free_resources:
> enic_admin_free_resources(enic);
> return err;
> }
Does the disable_queues error path leak queued messages?
If initialization fails after enic_admin_setup_intr() is called (for example,
during enic_admin_rq_fill()), the code branches to disable_queues. By this time,
the MSI-X interrupt and NAPI have been enabled.
If an interrupt fired and scheduled NAPI, messages could be added to
admin_msg_list and admin_msg_work could be scheduled. The error path does not
call cancel_work_sync(&enic->admin_msg_work) or enic_admin_msg_drain(enic),
which are required to stop the deferred work.
^ permalink raw reply
* Re: [PATCH net-next v4 05/10] enic: define MBOX message types and header structures
From: Jakub Kicinski @ 2026-04-14 0:21 UTC (permalink / raw)
To: devnull+satishkh.cisco.com
Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-5-f052326c2a57@cisco.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: define MBOX message types and header structures
This commit defines the mailbox protocol used for PF-VF communication over
the admin channel. It introduces the request/reply message types and the
structures representing the message headers and payloads.
> --- /dev/null
> +++ b/drivers/net/ethernet/cisco/enic/enic_mbox.h
> @@ -0,0 +1,75 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright 2025 Cisco Systems, Inc. All rights reserved. */
> +
> +#ifndef _ENIC_MBOX_H_
> +#define _ENIC_MBOX_H_
This isn't a bug, but should this header include <linux/types.h> and
<linux/bits.h>? It uses fixed-width types like __le16, __le32, u8 and the
BIT() macro. Without these includes, it might cause compilation errors if
included in an isolated context.
[ ... ]
> +struct enic_mbox_generic_reply {
> + __le16 ret_major;
> + __le16 ret_minor;
> +};
[ ... ]
> +struct enic_mbox_vf_capability_reply_msg {
> + struct enic_mbox_generic_reply reply;
> + __le32 version;
> + __le32 reserved[32];
> +};
Does this structure introduce misaligned memory accesses by violating
structure alignment inheritance rules?
struct enic_mbox_generic_reply consists only of __le16 fields, giving it
a natural 2-byte alignment. struct enic_mbox_vf_capability_reply_msg
embeds this reply struct and subsequently adds __le32 version and
__le32 reserved[32], which require 4-byte alignment.
If the outer structure is cast from a buffer that only guarantees the inner
structure's 2-byte alignment, the 4-byte aligned fields could cause
misaligned memory accesses.
Should the wider 32-bit fields be split into smaller __le16 fields to
strictly preserve the inherited 2-byte alignment?
^ permalink raw reply
* Re: [PATCH net-next v4 06/10] enic: add MBOX core send and receive for admin channel
From: Jakub Kicinski @ 2026-04-14 0:21 UTC (permalink / raw)
To: devnull+satishkh.cisco.com
Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-6-f052326c2a57@cisco.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: enic: add MBOX core send and receive for admin channel
This commit implements the mailbox protocol engine used for PF-VF
communication over the admin channel.
> diff --git a/drivers/net/ethernet/cisco/enic/enic_admin.c b/drivers/net/ethernet/cisco/enic/enic_admin.c
> index 345d194c6eebe..c96268adc173f 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_admin.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_admin.c
> @@ -19,6 +19,7 @@
> #include "cq_enet_desc.h"
> #include "wq_enet_desc.h"
> #include "rq_enet_desc.h"
> +#include "enic_mbox.h"
>
> /* No-op: admin WQ buffers are freed inline after completion polling */
> static void enic_admin_wq_buf_clean(struct vnic_wq *wq,
> struct vnic_wq_buf *buf)
> {
> }
If an MBOX send times out, the timeout path intentionally leaves the buffer
and DMA mapping pending to prevent a hardware use-after-free. Because this
cleanup function is a no-op, are those pending descriptors and mappings
permanently leaked when the admin channel is later closed or the driver is
removed?
[ ... ]
> @@ -156,7 +157,26 @@ unsigned int enic_admin_rq_cq_service(struct enic *enic, unsigned int budget)
> buf->dma_addr, buf->len,
> DMA_FROM_DEVICE);
>
> - enic_admin_msg_enqueue(enic, buf->os_buf, buf->len);
> + if (enic->admin_rq_handler) {
> + struct cq_enet_rq_desc *rq_desc = desc;
> + u16 sender_vlan;
Does this sequence process the CQ completion unconditionally without checking
the hardware descriptor for errors?
Standard data paths typically verify flags like packet_error or fcs_ok before
accepting the packet. If corrupted administrative messages are received here,
will they be processed as valid?
> +
> + /* Firmware sets the CQ VLAN field to identify the
> + * sender: 0 = PF, 1-based = VF index. Overwrite
> + * the untrusted src_vnic_id in the MBOX header with
> + * the hardware-verified value.
> + */
> + sender_vlan = le16_to_cpu(rq_desc->vlan);
> + if (buf->len >= sizeof(struct enic_mbox_hdr)) {
> + struct enic_mbox_hdr *hdr = buf->os_buf;
> +
> + hdr->src_vnic_id = (sender_vlan == 0) ?
> + cpu_to_le16(ENIC_MBOX_DST_PF) :
> + cpu_to_le16(sender_vlan - 1);
> + }
> +
> + enic_admin_msg_enqueue(enic, buf->os_buf, buf->len);
> + }
Is buf->len the original static allocation size (2048 bytes) rather than the
actual number of bytes written by the hardware?
If the hardware wrote fewer bytes than the full buffer capacity, it looks
like passing buf->len will enqueue uninitialized heap memory. This would
also bypass the bounds checks in enic_mbox_recv_handler(), as the length
will appear to always be 2048 regardless of the actual packet size.
[ ... ]
> diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
> new file mode 100644
> index 0000000000000..d144c86d9ef84
> --- /dev/null
> +++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
[ ... ]
> + vnic_wq_doorbell(wq);
> +
> + timeout = jiffies + usecs_to_jiffies(ENIC_MBOX_POLL_TIMEOUT_US);
> + err = -ETIMEDOUT;
> + while (time_before(jiffies, timeout)) {
> + if (enic_admin_wq_cq_service(enic)) {
> + err = 0;
> + break;
> + }
> + usleep_range(ENIC_MBOX_POLL_INTERVAL_US,
> + ENIC_MBOX_POLL_INTERVAL_US + 50);
> + }
Could this loop falsely report a timeout if the thread experiences a long
scheduling delay during usleep_range()?
If the thread wakes up after the timeout has passed, it will exit the loop
immediately without checking one last time. If the hardware actually
completed the operation during the sleep window, the completion would be
missed, incorrectly disabling the channel.
[ ... ]
^ permalink raw reply
* Re: [PATCH net-next v4 07/10] enic: add MBOX PF handlers for VF register and capability
From: Jakub Kicinski @ 2026-04-14 0:21 UTC (permalink / raw)
To: devnull+satishkh.cisco.com
Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-7-f052326c2a57@cisco.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: add MBOX PF handlers for VF register and capability
Implement PF-side mailbox message processing for SR-IOV V2 admin channel
communication. It processes capability, register, and unregister requests
from VFs.
> diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
> index d144c86d9ef84..f5784624ebbdf 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_mbox.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
[ ... ]
> @@ -124,10 +124,168 @@ int enic_mbox_send_msg(struct enic *enic, u8 msg_type, u16 dst_vnic_id,
[ ... ]
> +static int enic_mbox_pf_handle_unregister(struct enic *enic, void *msg,
> + u16 vf_id, u64 msg_num)
> +{
> + struct enic_mbox_vf_register_reply_msg reply = {};
> + int err;
> +
> + if (!enic->vf_state || vf_id >= enic->num_vfs) {
> + netdev_warn(enic->netdev,
> + "MBOX: unregister from invalid VF %u\n", vf_id);
> + return -EINVAL;
> + }
> +
> + reply.reply.ret_major = cpu_to_le16(0);
> + err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_UNREGISTER_REPLY, vf_id,
> + &reply, sizeof(reply));
> + if (err)
> + return err;
> +
> + enic->vf_state[vf_id].registered = false;
If enic_mbox_send_msg() fails (for example, due to a full admin WQ or channel
timeout), the function returns early and the PF misses setting the registered
state to false.
Since an unregister request indicates the VF driver is unloading or shutting
down, should the PF unconditionally clear the registration state regardless of
whether the acknowledgment successfully reaches the defunct VF?
If the local state is not cleared, could the PF assume the VF is still active
and continue attempting to send it unnecessary link state notifications?
[ ... ]
> +static void enic_mbox_pf_process_msg(struct enic *enic,
> + struct enic_mbox_hdr *hdr, void *payload)
> +{
> + u16 vf_id = le16_to_cpu(hdr->src_vnic_id);
> + u16 msg_len = le16_to_cpu(hdr->msg_len);
> + int err = 0;
> +
> + if (!enic->vf_state) {
> + netdev_dbg(enic->netdev,
> + "MBOX: PF received msg but SRIOV not active\n");
> + return;
> + }
> +
> + if (vf_id >= enic->num_vfs) {
> + netdev_warn(enic->netdev,
> + "MBOX: PF received msg from invalid VF %u\n",
> + vf_id);
> + return;
> + }
Could a malicious or buggy guest VM use this untrusted VF input to cause a
host denial of service through log spamming?
Because this warning is not rate-limited, an untrusted VF could continuously
flood the PF's admin channel with malformed requests. This might exhaust host
CPU cycles and cause dmesg lock contention.
Should these logging calls in the message parsing path use
netdev_warn_ratelimited() instead?
[ ... ]
> @@ -142,10 +300,22 @@ static void enic_mbox_recv_handler(struct enic *enic, void *buf,
> return;
> }
>
> + msg_len = le16_to_cpu(hdr->msg_len);
> + if (msg_len < sizeof(*hdr) || msg_len > len) {
> + netdev_warn(enic->netdev,
> + "MBOX: invalid msg_len %u (buf len %u)\n",
> + msg_len, len);
> + return;
> + }
Similarly here, does this log also need rate limiting to protect the host
from being spammed by malformed headers sent from a VF?
^ permalink raw reply
* Re: [PATCH net-next v4 08/10] enic: add MBOX VF handlers for capability, register and link state
From: Jakub Kicinski @ 2026-04-14 0:22 UTC (permalink / raw)
To: devnull+satishkh.cisco.com
Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-8-f052326c2a57@cisco.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: add MBOX VF handlers for capability, register and link state
This commit implements VF-side mailbox message processing for SR-IOV V2
admin channel communication, including handlers for capability checks,
registration, and link state notifications.
> diff --git a/drivers/net/ethernet/cisco/enic/enic_mbox.c b/drivers/net/ethernet/cisco/enic/enic_mbox.c
> index f5784624ebbdf..b5ed31450ee76 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_mbox.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_mbox.c
[ ... ]
> @@ -280,6 +291,136 @@ static void enic_mbox_pf_process_msg(struct enic *enic,
> hdr->msg_type, vf_id, err);
> }
>
> +static void enic_mbox_vf_handle_capability_reply(struct enic *enic,
> + void *payload)
> +{
> + struct enic_mbox_vf_capability_reply_msg *reply = payload;
> +
> + if (le16_to_cpu(reply->reply.ret_major) == 0)
> + enic->pf_cap_version = le32_to_cpu(reply->version);
> + complete(&enic->mbox_comp);
> +}
Does this code handle delayed replies safely?
If a capability request times out and the VF proceeds to send a register
request, could a delayed capability reply call complete() here and prematurely
wake up the register request?
Since there is only a single shared completion and no validation of the
currently expected reply type, the new request might wake up, evaluate unmatched
state, and fail incorrectly.
[ ... ]
> +static void enic_mbox_vf_handle_unregister_reply(struct enic *enic,
> + void *payload)
> +{
> + struct enic_mbox_vf_register_reply_msg *reply = payload;
This isn't a bug, but is it intentional to use the register_reply_msg structure
here for an unregister reply? While they currently have the same layout, this
could be brittle if the structures diverge in the future.
> +
> + if (le16_to_cpu(reply->reply.ret_major)) {
> + netdev_warn(enic->netdev,
> + "MBOX: VF unregister rejected by PF: %u/%u\n",
> + le16_to_cpu(reply->reply.ret_major),
> + le16_to_cpu(reply->reply.ret_minor));
> + } else {
> + enic->vf_registered = false;
> + }
> + complete(&enic->mbox_comp);
> +}
[ ... ]
> + case ENIC_MBOX_VF_UNREGISTER_REPLY: {
> + size_t exp = sizeof(struct enic_mbox_vf_register_reply_msg);
Similarly here, the register_reply_msg size is being used for the unregister
reply validation check.
> +
> + if (!enic_mbox_vf_payload_ok(enic, hdr->msg_type,
> + payload_len, exp))
> + return;
> + enic_mbox_vf_handle_unregister_reply(enic, payload);
> + break;
> + }
[ ... ]
> +int enic_mbox_vf_unregister(struct enic *enic)
> +{
> + int err;
> +
> + if (!enic->vf_registered)
> + return 0;
> +
> + reinit_completion(&enic->mbox_comp);
> +
> + err = enic_mbox_send_msg(enic, ENIC_MBOX_VF_UNREGISTER_REQUEST,
> + ENIC_MBOX_DST_PF, NULL, 0);
> + if (err)
> + return err;
> +
> + err = enic_mbox_wait_reply(enic, 3000);
> +
> + return enic->vf_registered ? -ETIMEDOUT : 0;
> }
Is the err value from enic_mbox_wait_reply() intentionally ignored here?
If the PF successfully receives the request but explicitly rejects it,
enic_mbox_wait_reply() returns 0, but enic->vf_registered remains true. This
causes a protocol-level rejection to be conflated with a channel timeout by
returning -ETIMEDOUT.
^ permalink raw reply
* Re: [PATCH net-next v4 09/10] enic: wire V2 SR-IOV enable with admin channel and MBOX
From: Jakub Kicinski @ 2026-04-14 0:22 UTC (permalink / raw)
To: devnull+satishkh.cisco.com
Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-9-f052326c2a57@cisco.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: wire V2 SR-IOV enable with admin channel and MBOX
This commit extends enic_sriov_configure() to handle V2 SR-IOV VFs,
allocating per-VF MBOX state and opening the admin channel before calling
pci_enable_sriov(). It also reverses the order in the disable path.
Note: The format of this report is altered due to recitation restrictions.
The original patch code is not quoted, and a free-form summary of the
observations is provided instead.
In enic_sriov_v2_enable(), enic->num_vfs is set and ENIC_SRIOV_ENABLED is
asserted, but is the enic->pp array reallocated to accommodate the VFs?
If enic->pp remains a 1-element array, could subsequent network device
operations bypass the vf < enic->num_vfs check and calculate an
out-of-bounds pointer for vf > 0 using the ENIC_PP_BY_INDEX macro,
corrupting the heap?
Is enic_sriov_configure() actually registered as the .sriov_configure
callback in the enic_driver structure?
Because the function is marked __maybe_unused, does this leave the sysfs
interface unable to invoke it, making the V2 VF dynamic enablement
unreachable?
When enic_sriov_configure() is called via sysfs, it mutates enic->pp and
enic->num_vfs. Does this happen without holding the rtnl_lock?
If concurrent network device operations running under rtnl_lock validate
the VF index and resolve enic->pp right before the sysfs path calls
kfree(enic->pp), can this result in a use-after-free when the stale pointer
is dereferenced?
In the SR-IOV disable path inside enic_sriov_configure(), the code frees
the existing port profiles array and reallocates a single element with
kzalloc_obj().
If this allocation fails and returns -ENOMEM, enic->pp is left as NULL.
Could a subsequent NDO call targeting the PF pass the enic_is_valid_pp_vf()
check and crash when dereferencing the NULL enic->pp?
When V2 SR-IOV is enabled, enic_sriov_v2_enable() calls enic_mbox_init(),
which unconditionally executes mutex_init(&enic->mbox_lock) and
init_completion(&enic->mbox_comp).
If SR-IOV is disabled and later re-enabled, will re-initializing these
active synchronization primitives over their existing state violate locking
rules?
The commit message states:
"The existing V1/USNIC SR-IOV paths are unchanged."
However, in enic_remove(), the calls to enic_dev_deinit(enic) and
vnic_dev_close(enic->vdev) were moved to execute after pci_disable_sriov().
Additionally, the enic->priv_flags &= ~ENIC_SRIOV_ENABLED cleanup was
dropped.
Could the commit message be updated to reflect this behavioral change to the
hardware teardown sequence, or was this change unintended?
^ permalink raw reply
* Re: [PATCH net-next v4 10/10] enic: add V2 VF probe with admin channel and PF registration
From: Jakub Kicinski @ 2026-04-14 0:22 UTC (permalink / raw)
To: devnull+satishkh.cisco.com
Cc: Jakub Kicinski, andrew+netdev, davem, edumazet, pabeni, netdev,
linux-kernel, 20260401-enic-sriov-v2-prep-v4-0-d5834b2ef1b9,
satishkh
In-Reply-To: <20260411-enic-sriov-v2-admin-channel-v2-v4-10-f052326c2a57@cisco.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
enic: add V2 VF probe with admin channel and PF registration
When a V2 SR-IOV VF probes, open the admin channel, initialize the
MBOX protocol, perform the capability check with the PF, and register
with the PF. This establishes the PF-VF communication path that the PF
uses to send link state notifications.
> diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
> index 057716ccc2837..bf4417e67b16d 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_main.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_main.c
[ ... ]
> @@ -2992,6 +2997,32 @@ static int enic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> goto err_out_dev_close;
> }
>
> + /* V2 VF: open admin channel and register with PF.
> + * Must happen before register_netdev so the VF is fully
> + * initialized before the interface is visible to userspace.
> + */
> + if (enic_is_sriov_vf_v2(enic)) {
> + err = enic_admin_channel_open(enic);
> + if (err) {
> + dev_err(dev,
> + "Failed to open admin channel: %d\n", err);
> + goto err_out_dev_deinit;
> + }
> + enic_mbox_init(enic);
Could this ordering cause a race condition with an uninitialized completion object?
enic_admin_channel_open() dynamically allocates the MSI-X interrupt, registers
the IRQ handler, and enables the admin hardware queues, making the interrupt
live immediately.
Only afterward is enic_mbox_init() called, which initializes the
enic->mbox_comp completion struct. If a spurious interrupt or a left-over
admin message causes the admin NAPI poll handler to fire between these two
calls, it will attempt to call complete(&enic->mbox_comp) before the
completion struct's internal lock has been initialized.
Also regarding the admin channel resources, are we properly handling the admin
channel across a soft reset?
When the driver recovers from a Tx hang or queue error via
enic_tx_hang_reset() or enic_reset(), it issues a CMD_SOFT_RESET to the
firmware, which resets and disables all hardware queues on the VNIC. The
recovery path then calls enic_init_vnic_resources(), which only re-initializes
and enables the datapath queues (wq, rq, cq).
It completely ignores the admin channel queues (admin_wq, admin_rq, admin_cq).
As a result, the admin queues remain disabled in hardware and desynchronized
from the software descriptors.
> + err = enic_mbox_vf_capability_check(enic);
[ ... ]
> @@ -3156,6 +3199,17 @@ static void enic_remove(struct pci_dev *pdev)
> cancel_work_sync(&enic->reset);
> cancel_work_sync(&enic->change_mtu_work);
> unregister_netdev(netdev);
> + if (enic_is_sriov_vf_v2(enic)) {
> + if (enic->vf_registered) {
> + int unreg_err = enic_mbox_vf_unregister(enic);
> +
> + if (unreg_err)
> + netdev_warn(netdev,
> + "Failed to unregister from PF: %d\n",
> + unreg_err);
> + }
> + enic_admin_channel_close(enic);
> + }
Does this unregistration ordering introduce a use-after-free on the freed
netdev?
Because the admin interrupt is not masked by ndo_stop(), it remains fully
active after the network device is shut down. If the PF asynchronously sends a
link state notification after unregister_netdev(netdev) but before
enic_admin_channel_close(enic):
The admin workqueue will process the notification and call
netif_carrier_on(enic->netdev). This adds the device to the lweventlist and
calls dev_hold(dev), bumping dev->pcpu_refcnt.
However, unregister_netdev() has already completed its wait for pcpu_refcnt to
drop to zero. The driver then proceeds to call free_netdev(netdev), freeing
both the pcpu_refcnt allocation and the netdev memory. When the linkwatch
worker subsequently processes the event, it accesses this freed memory.
Should the admin channel be closed before unregister_netdev()?
> diff --git a/drivers/net/ethernet/cisco/enic/enic_res.c b/drivers/net/ethernet/cisco/enic/enic_res.c
> index 436326ace0499..74cd2ee3af5cb 100644
> --- a/drivers/net/ethernet/cisco/enic/enic_res.c
> +++ b/drivers/net/ethernet/cisco/enic/enic_res.c
> @@ -211,7 +211,8 @@ void enic_get_res_counts(struct enic *enic)
> vnic_dev_get_res_count(enic->vdev, RES_TYPE_ADMIN_RQ) >= 1 &&
> vnic_dev_get_res_count(enic->vdev, RES_TYPE_ADMIN_CQ) >=
> ARRAY_SIZE(enic->admin_cq) &&
> - vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1;
> + (enic_is_sriov_vf_v2(enic) ||
> + vnic_dev_get_res_count(enic->vdev, RES_TYPE_SRIOV_INTR) >= 1);
Does this configuration lead to interrupt resource exhaustion and probe failures?
enic_adjust_resources() allocates all available interrupts (minus
ENIC_MSIX_RESERVED_INTR) to data queues, which sets enic->intr_count equal to
enic->intr_avail when queue counts are maximized.
However, enic_admin_setup_intr() requires an additional interrupt at index
intr_index = enic->intr_count. This triggers the
intr_index >= enic->intr_avail bounds check and returns -ENODEV, causing the
VF probe to abort.
Additionally, on the PF side, vnic_intr_alloc() hardcodes the resource type to
RES_TYPE_INTR_CTRL, effectively stealing a standard datapath interrupt instead
of utilizing the dedicated RES_TYPE_SRIOV_INTR hardware resource provisioned
by the firmware.
^ permalink raw reply
* Re: [Intel-wired-lan] [PATCH net] idpf: fix double free and use-after-free in aux device error paths
From: Jacob Keller @ 2026-04-14 0:46 UTC (permalink / raw)
To: Greg Kroah-Hartman, intel-wired-lan, netdev@vger.kernel.org,
Jakub Kicinski
Cc: netdev, linux-kernel, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
stable
In-Reply-To: <2026041116-retail-bagginess-250f@gregkh>
On 4/11/2026 3:12 AM, Greg Kroah-Hartman wrote:
> When auxiliary_device_add() fails in idpf_plug_vport_aux_dev() or
> idpf_plug_core_aux_dev(), the err_aux_dev_add label calls
> auxiliary_device_uninit() and falls through to err_aux_dev_init. The
> uninit call will trigger put_device(), which invokes the release
> callback (idpf_vport_adev_release / idpf_core_adev_release) that frees
> iadev. The fall-through then reads adev->id from the freed iadev for
> ida_free() and double-frees iadev with kfree().
>
> Free the IDA slot and clear the back-pointer before uninit, while adev
> is still valid, then return immediately.
>
> Commit 65637c3a1811 65637c3a1811 ("idpf: fix UAF in RDMA core aux dev
> deinitialization") fixed the same use-after-free in the matching unplug
> path in this file but missed both probe error paths.
>
> Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
> Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com>
> Cc: Andrew Lunn <andrew+netdev@lunn.ch>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: stable <stable@kernel.org>
> Fixes: be91128c579c ("idpf: implement RDMA vport auxiliary dev create, init, and destroy")
> Fixes: f4312e6bfa2a ("idpf: implement core RDMA auxiliary dev create, init, and destroy")
> Assisted-by: gregkh_clanker_t1000
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
This is targeted at [net]. The fix seems straight forward enough.
@Jakub, I have no objections if you want to pull this directly. I am not
sure our validation team will find anything when testing since this is
an error path that is historically been difficult for us to test.
I'm also fine with taking it through iwl-net if you prefer, but just
want to avoid duplicate work if you're already considering it.
> Note, these cleanup paths are messy, but I couldn't see a simpler way
> without a lot more rework, so I choose the simple way :)
>
Yea, I didn't see a better way either.
Thanks,
Jake
^ permalink raw reply
* Re: [PATCH net-next v2 09/14] net: macb: change caps helpers signatures
From: Jakub Kicinski @ 2026-04-14 0:47 UTC (permalink / raw)
To: Théo Lebrun
Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
Eric Dumazet, Paolo Abeni, Richard Cochran, Russell King,
Paolo Valerio, Conor Dooley, Nicolai Buchwitz,
Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier, netdev,
linux-kernel
In-Reply-To: <20260410-macb-context-v2-9-af39f71d40b6@bootlin.com>
On Fri, 10 Apr 2026 21:51:57 +0200 Théo Lebrun wrote:
> For parallel MACB context to start become a reality, many functions will
> soon not have access to `struct macb *bp`. Those will still have access
> to caps through ctx->info->caps.
>
> Change all caps helpers signatures, from taking `struct macb *bp` to
> taking `u32 caps`.
Subjectively I feel like this is a slight loss of type safety.
Someone may pass the wrong u32 and compiler will not help?
^ permalink raw reply
* Re: [PATCH net-next v2 13/14] net: macb: use context swapping in .set_ringparam()
From: Jakub Kicinski @ 2026-04-14 0:50 UTC (permalink / raw)
To: Théo Lebrun
Cc: Nicolas Ferre, Claudiu Beznea, Andrew Lunn, David S. Miller,
Eric Dumazet, Paolo Abeni, Richard Cochran, Russell King,
Paolo Valerio, Conor Dooley, Nicolai Buchwitz,
Vladimir Kondratiev, Gregory CLEMENT, Benoît Monin,
Tawfik Bayouk, Thomas Petazzoni, Maxime Chevallier, netdev,
linux-kernel
In-Reply-To: <20260410-macb-context-v2-13-af39f71d40b6@bootlin.com>
On Fri, 10 Apr 2026 21:52:01 +0200 Théo Lebrun wrote:
> ethtool_ops.set_ringparam() is implemented using the primitive close /
> update ring size / reopen sequence. Under memory pressure this does not
> fly: we free our buffers at close and cannot reallocate new ones at
> open. Also, it triggers a slow PHY reinit.
>
> Instead, exploit the new context mechanism and improve our sequence to:
> - allocate a new context (including buffers) first
> - if it fails, early return without any impact to the interface
> - stop interface
> - update global state (bp, netdev, etc)
> - pass buffer pointers to the hardware
> - start interface
> - free old context.
>
> The HW disable sequence is inspired by macb_reset_hw() but avoids
> (1) setting NCR bit CLRSTAT and (2) clearing register PBUFRXCUT.
>
> The HW re-enable sequence is inspired by macb_mac_link_up(), skipping
> over register writes which would be redundant (because values have not
> changed).
>
> The generic context swapping parts are isolated into helper functions
> macb_context_swap_start|end(), reusable by other operations (change_mtu,
> set_channels, etc).
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 81beb67b206a..340ae7d881c6 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3081,6 +3081,89 @@ static void macb_configure_dma(struct macb *bp)
> }
> }
>
> +static void macb_context_swap_start(struct macb *bp)
> +{
> + struct macb_queue *queue;
> + unsigned long flags;
> + unsigned int q;
> + u32 ctrl;
> +
> + /* Disable software Tx, disable HW Tx/Rx and disable NAPI. */
> +
> + netif_tx_disable(bp->netdev);
AFAIR netif_tx_disable() just stops all the queues, if the NAPIs and
whatever else may wake queues is still running the queues may get
restarted right away.
> + spin_lock_irqsave(&bp->lock, flags);
> +
> + ctrl = macb_readl(bp, NCR);
> + macb_writel(bp, NCR, ctrl & ~(MACB_BIT(RE) | MACB_BIT(TE)));
> +
> + macb_writel(bp, TSR, -1);
> + macb_writel(bp, RSR, -1);
> +
> + for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
> + queue_writel(queue, IDR, -1);
> + queue_readl(queue, ISR);
> + if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
> + queue_writel(queue, ISR, -1);
> + }
> +
^ permalink raw reply
* Re: [Intel-wired-lan] [PATCH iwl-net v2 1/6] ixgbe: fix SWFW semaphore timeout for X550 family
From: Jacob Keller @ 2026-04-14 0:56 UTC (permalink / raw)
To: Aleksandr Loktionov, intel-wired-lan, anthony.l.nguyen; +Cc: netdev
In-Reply-To: <20260408131154.2661818-2-aleksandr.loktionov@intel.com>
On 4/8/2026 6:11 AM, Aleksandr Loktionov wrote:
> According to FW documentation, the most time-consuming FW operation is
> Shadow RAM (SR) dump which takes up to 3.2 seconds. For X550 family
> devices the module-update FW command can take over 4.5 s. The default
> semaphore loop runs 200 iterations with a 5 ms sleep each, giving a
> maximum wait of 1 s -- not "200 ms" as previously stated in error.
> This is insufficient for X550 family FW update operations and causes
> spurious EBUSY failures.
>
> Extend the SW/FW semaphore timeout from 1 s to 5 s (1000 iterations x
> 5 ms) for all three X550 variants: ixgbe_mac_X550, ixgbe_mac_X550EM_x,
> and ixgbe_mac_x550em_a. All three share the same FW and exhibit the
> same worst-case latency. Use three explicit mac.type comparisons rather
> than a range check so future MAC additions are not inadvertently
> captured.
>
> The timeout variable is set immediately before the loop so the intent
> is clear, with an inline comment stating the resulting maximum delay.
>
> Suggested-by: Soumen Karmakar <soumen.karmakar@intel.com>
> Cc: stable@vger.kernel.org
> Suggested-by: Marta Plantykow <marta.a.plantykow@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v1 -> v2:
> - Squash with 0015 (X550EM extension); fix commit message ("200ms" was
> wrong, actual default is 1 s); replace >= / <= range check with three
> explicit mac.type == comparisons per Tony Nguyen.
>
> drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
> index e67e2fe..a3c8f51 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x540.c
> @@ -577,6 +577,15 @@ int ixgbe_acquire_swfw_sync_X540(struct ixgbe_hw *hw, u32 mask)
>
> swmask |= swi2c_mask;
> fwmask |= swi2c_mask << 2;
> + /* Extend to 5 s (1000 x 5 ms) for X550 family; default is 1 s
> + * (200 x 5 ms). FW SR-dump takes up to 3.2 s; module-update up
> + * to 4.5 s.
> + */
> + if (hw->mac.type == ixgbe_mac_X550 ||
> + hw->mac.type == ixgbe_mac_X550EM_x ||
> + hw->mac.type == ixgbe_mac_x550em_a)
> + timeout = 1000;
> +
Typically, I would request and prefer if we would refactor timeout loops
like this to use read_poll_timeout() instead of open coding the loop
like we do here. The current loop is somewhat complicated so I can
understand it might be tricky to refactor.
The issue with open coded loops like this is that usleep_range has
variable length waiting time, so we sleep for anywhere between 5 and 6
milliseconds in this case. This makes the total amount of time waiting
cap at 6 seconds and not the expected 5, with the actual amount of time
waiting being variable based on when the usleep_range wakes up.
Perhaps this specific loop is a bit more complicated and not worth the
effort to refactor to read_poll_timeout, but its something I've been
trying to get us to cleanup (both inside Intel drivers and in other
places in the kernel) when modifying such open-coded timeout loops.
Since this loop body is a bit more complicated (it has to take and
release the semaphore to check the condition) I can accept it doesn't
make sense to modify it here for net.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Thanks,
Jake
> for (i = 0; i < timeout; i++) {
> /* SW NVM semaphore bit is used for access to all
> * SW_FW_SYNC bits (not just NVM)
^ permalink raw reply
* Re: [PATCH net-next v2 12/14] net: macb: re-read ISR inside IRQ handler locked section
From: Jakub Kicinski @ 2026-04-14 0:52 UTC (permalink / raw)
To: theo.lebrun
Cc: Jakub Kicinski, nicolas.ferre, claudiu.beznea, andrew+netdev,
davem, edumazet, pabeni, richardcochran, linux, pvalerio, conor,
nb, vladimir.kondratiev, gregory.clement, benoit.monin,
tawfik.bayouk, thomas.petazzoni, maxime.chevallier, netdev,
linux-kernel
In-Reply-To: <20260410-macb-context-v2-12-af39f71d40b6@bootlin.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: macb: re-read ISR inside IRQ handler locked section
This commit adds a second read of the ISR register inside the lock in
macb_interrupt() to handle a race condition where the status might have
been updated by concurrent hardware resets.
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 2633234012bae..3deab87c565b5 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -2190,6 +2190,13 @@ static irqreturn_t macb_interrupt(int irq, void *dev_id)
>
> spin_lock(&bp->lock);
>
> + /* `status` stack variable might be stalled => re-read it */
> + status = queue_readl(queue, ISR);
Will this second read discard interrupt events on older MACB variants that
use clear-on-read for the ISR?
If the hardware does not have the MACB_CAPS_ISR_CLEAR_ON_WRITE capability,
the first queue_readl call before the lock already cleared the active bits.
Overwriting the local status variable here sets it to 0, which permanently
drops the initially captured Rx/Tx completions and stalls the network
interface.
> + if (unlikely(!status)) {
> + spin_unlock(&bp->lock);
> + return IRQ_NONE;
If the function reaches this point, the initial status read before the lock
was non-zero, meaning the device legitimately asserted the interrupt.
If a concurrent operation cleared the hardware state while this thread
waited for the lock, will returning IRQ_NONE falsely flag this as a
spurious interrupt?
Since the interrupt was actually raised by the device, doing this repeatedly
can cause the core IRQ subsystem to permanently disable the IRQ line.
Could this return IRQ_HANDLED instead?
> + }
> +
> while (status) {
> /* close possible race with dev_close */
> if (unlikely(!netif_running(netdev))) {
--
pw-bot: cr
^ permalink raw reply
* Re: [PATCH net-next v2 13/14] net: macb: use context swapping in .set_ringparam()
From: Jakub Kicinski @ 2026-04-14 0:56 UTC (permalink / raw)
To: theo.lebrun
Cc: Jakub Kicinski, nicolas.ferre, claudiu.beznea, andrew+netdev,
davem, edumazet, pabeni, richardcochran, linux, pvalerio, conor,
nb, vladimir.kondratiev, gregory.clement, benoit.monin,
tawfik.bayouk, thomas.petazzoni, maxime.chevallier, netdev,
linux-kernel
In-Reply-To: <20260410-macb-context-v2-13-af39f71d40b6@bootlin.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: macb: use context swapping in .set_ringparam()
This patch modifies the ethtool set_ringparam implementation to use a
context swapping mechanism. Instead of closing and reopening the interface,
it allocates new buffers, stops the interface temporarily, swaps the active
context, and then frees the old one.
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 3deab87c565b5..d1536cb1e40bb 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3081,6 +3081,89 @@ static void macb_configure_dma(struct macb *bp)
[ ... ]
> +static void macb_context_swap_start(struct macb *bp)
> +{
[ ... ]
> + for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
> + napi_disable(&queue->napi_rx);
> + napi_disable(&queue->napi_tx);
This isn't a regression from this patch, but does calling napi_disable()
unconditionally crash on EMAC devices?
EMAC devices appear to not initialize NAPI, which leaves the NAPI struct
zero-initialized (n->dev is NULL), potentially causing a NULL pointer
dereference in napi_disable(). Additionally, does this path also blindly
call bp->macbgem_ops.mog_init_rings() which is NULL on EMAC?
> + netdev_tx_reset_queue(netdev_get_tx_queue(bp->netdev, q));
> + cancel_work_sync(&queue->tx_error_task);
> + }
[ ... ]
> +static void macb_context_swap_end(struct macb *bp,
> + struct macb_context *new_ctx)
> +{
[ ... ]
> + /* Swap contexts & give buffer pointers to HW. */
> +
> + old_ctx = bp->ctx;
> + bp->ctx = new_ctx;
> + macb_init_buffers(bp);
Since the new context swap doesn't call phylink_stop(), could a concurrent
macb_mac_link_up() execution set the RE and TE bits in NCR during the window
where bp->lock is dropped?
If so, macb_init_buffers() would write new DMA ring base addresses to RBQP
and TBQP while the MAC is actively enabled, which violates hardware
specifications and could cause DMA corruption.
[ ... ]
> + if (!(bp->caps & MACB_CAPS_MACB_IS_EMAC)) {
> + for (q = 0, queue = bp->queues; q < bp->num_queues;
> + ++q, ++queue) {
> + queue_writel(queue, IER,
> + bp->rx_intr_mask |
> + MACB_TX_INT_FLAGS |
> + MACB_BIT(HRESP));
> + }
> + }
Hardware interrupts are disabled unconditionally for all devices in
macb_context_swap_start() by writing -1 to IDR.
Will this explicit check prevent EMAC devices from having their interrupts
re-enabled?
Since EMAC devices do not use NAPI and rely on hardware interrupts, leaving
them masked might silently halt all network traffic after a ring parameter
change.
> +
> + ctrl = macb_readl(bp, NCR);
> + macb_writel(bp, NCR, ctrl | MACB_BIT(RE) | MACB_BIT(TE));
Can this unprotected read-modify-write of the NCR register race with phylink
state changes?
If an asynchronous phylink callback like macb_mac_config() modifies NCR
while holding bp->lock, this unlocked write could overwrite and drop those
concurrent updates.
> +
> + netif_tx_start_all_queues(bp->netdev);
> +
> + /* Free old context. */
> +
> + macb_free_consistent(old_ctx);
This isn't a regression from this patch, but does this context swap leak
pending TX packets and their DMA mappings?
When freeing the old context, macb_free_consistent() appears to only free
the pointer array. If it doesn't iterate from txq->tail to txq->head to
unmap the active DMA mappings and free the sk_buff structs, repeatedly
modifying ring parameters could cause a memory and DMA mapping leak.
^ permalink raw reply
* Re: [PATCH net-next v2 14/14] net: macb: use context swapping in .ndo_change_mtu()
From: Jakub Kicinski @ 2026-04-14 0:56 UTC (permalink / raw)
To: theo.lebrun
Cc: Jakub Kicinski, nicolas.ferre, claudiu.beznea, andrew+netdev,
davem, edumazet, pabeni, richardcochran, linux, pvalerio, conor,
nb, vladimir.kondratiev, gregory.clement, benoit.monin,
tawfik.bayouk, thomas.petazzoni, maxime.chevallier, netdev,
linux-kernel
In-Reply-To: <20260410-macb-context-v2-14-af39f71d40b6@bootlin.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: macb: use context swapping in .ndo_change_mtu()
This commit introduces a mechanism to perform context swapping when changing
the MTU while the interface is running. It allocates a new context,
reconfigures the hardware via a start and end swap process, and then frees
the old context to avoid PHY reinitialization and gracefully handle memory
pressure.
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index d1536cb1e40bb..79371718f8392 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3430,11 +3430,25 @@ static int macb_close(struct net_device *netdev)
>
> static int macb_change_mtu(struct net_device *netdev, int new_mtu)
> {
> - if (netif_running(netdev))
> - return -EBUSY;
> + struct macb *bp = netdev_priv(netdev);
> + bool running = netif_running(netdev);
> + struct macb_context *new_ctx;
> +
> + if (running) {
> + new_ctx = macb_context_alloc(bp, new_mtu,
> + bp->configured_rx_ring_size,
> + bp->configured_tx_ring_size);
> + if (IS_ERR(new_ctx))
> + return PTR_ERR(new_ctx);
> +
> + macb_context_swap_start(bp);
Does macb_context_swap_start() gracefully halt the transmitter before the
context is freed?
It appears to attempt disabling the hardware transmitter by clearing the
TE and RE bits in the NCR register, but does not poll the hardware to ensure
it has reached an idle state.
Immediately after, macb_context_swap_end() frees the descriptor rings and data
buffers via macb_free_consistent().
If the MAC DMA engine is still mid-transfer, could it read or write to kernel
memory that has already been freed or reallocated?
Can this code cause an AB-BA deadlock between napi_disable() and
cancel_work_sync()?
In macb_context_swap_start(), the code calls napi_disable() for all queues,
and then immediately calls cancel_work_sync() on the error tasks.
Both of these work items internally invoke napi_disable().
If either error task runs concurrently, it will block forever inside its
napi_disable() waiting for the NAPI state bits (which were already set by the
first call in macb_context_swap_start()) to clear.
Consequently, could cancel_work_sync() hang indefinitely waiting for the
blocked work task to finish?
> }
>
> WRITE_ONCE(netdev->mtu, new_mtu);
>
> + if (running)
> + macb_context_swap_end(bp, new_ctx);
Does macb_context_swap_end() leak pending TX buffers and their DMA mappings?
When changing the MTU while the interface is running, macb_context_swap_end()
calls macb_free_consistent() to free the old DMA context. While RX buffers are
properly freed, for the TX queues it appears to only free the tracking array
pointer without iterating through the ring to unmap the DMA mappings and free
the actual buffers.
Any packets queued in the hardware ring at the moment of the MTU swap might
be permanently leaked, which could lead to memory and IOMMU space exhaustion
under active traffic.
Will MTU changes on EMAC devices permanently disable hardware interrupts?
macb_context_swap_start() disables all interrupts unconditionally by writing
-1 to the IDR register.
However, macb_context_swap_end() restores interrupts via the IER register
inside an explicit check:
if (!(bp->caps & MACB_CAPS_MACB_IS_EMAC)) {
...
}
Since there is no else branch to re-enable interrupts for EMAC devices,
could changing the MTU on these devices leave interrupts permanently masked
and cause the network interface to silently stall?
Is there a missing spinlock when modifying the IER and NCR registers in
macb_context_swap_end()?
Modifications to the NCR and interrupt registers in macb_context_swap_start()
are protected by the spinlock. However, macb_context_swap_end() performs
lockless read-modify-write operations on both IER and NCR:
ctrl = macb_readl(bp, NCR);
macb_writel(bp, NCR, ctrl | MACB_BIT(RE) | MACB_BIT(TE));
If concurrent hardware events or background tasks take the lock and modify
NCR during this window, could their updates be silently overwritten?
> +
> return 0;
> }
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox