* [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe)
@ 2026-05-20 18:34 Tony Nguyen
2026-05-20 18:34 ` [PATCH net 1/8] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race Tony Nguyen
` (7 more replies)
0 siblings, 8 replies; 21+ messages in thread
From: Tony Nguyen @ 2026-05-20 18:34 UTC (permalink / raw)
To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev; +Cc: Tony Nguyen
Kohei Enju disallows XDP attach when VSI is rebuilding to prevent
possible NULL dereference on ice.
Michal Schmidt adds call to ice_vsi_realloc_stat_arrays() when
reconfiguring VF VSIs on ice to resolve overwriting bounds when queues
are increased.
Jose Ignacio Tornos Martinez fixes issues with VF bonding that came
about with commit ad7c7b2172c3 ("net: hold netdev instance lock during
sysfs operations").
Further details:
https://lore.kernel.org/all/20260429102426.210750-1-jtornosm@redhat.com/
Przemyslaw Korba sets the proper PTP extts flags for i40e.
Corinna Vinschen moves from VF spinlock to RCU to prevent races in
structure accesses in ixgbe.
The following are changes since commit edc502717be153674b0b3eefb8b40734c747c138:
Merge branch 'mptcp-misc-fixes-for-v7-1-rc4'
and are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue 100GbE
Corinna Vinschen (1):
ixgbe: only access vfinfo and mv_list under RCU lock
Jose Ignacio Tornos Martinez (4):
iavf: return EBUSY if reset in progress or not ready during MAC change
i40e: skip unnecessary VF reset when setting trust
iavf: send MAC change request synchronously
ice: skip unnecessary VF reset when setting trust
Kohei Enju (1):
ice: fix UAF/NULL deref when VSI rebuild and XDP attach race
Michal Schmidt (1):
ice: fix stats array overflow when VF requests more queues
Przemyslaw Korba (1):
i40e: set supported_extts_flags for rising edge
drivers/net/ethernet/intel/i40e/i40e_ptp.c | 2 +
.../ethernet/intel/i40e/i40e_virtchnl_pf.c | 38 +-
drivers/net/ethernet/intel/iavf/iavf.h | 10 +-
drivers/net/ethernet/intel/iavf/iavf_main.c | 74 ++-
.../net/ethernet/intel/iavf/iavf_virtchnl.c | 100 +++-
drivers/net/ethernet/intel/ice/ice_lib.c | 2 +-
drivers/net/ethernet/intel/ice/ice_lib.h | 1 +
drivers/net/ethernet/intel/ice/ice_main.c | 13 +-
drivers/net/ethernet/intel/ice/ice_sriov.c | 33 +-
drivers/net/ethernet/intel/ice/ice_vf_lib.c | 7 +
drivers/net/ethernet/intel/ixgbe/ixgbe.h | 7 +-
.../net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c | 36 +-
.../net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 44 +-
.../net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 17 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 227 +++++---
.../net/ethernet/intel/ixgbe/ixgbe_sriov.c | 547 ++++++++++++------
16 files changed, 825 insertions(+), 333 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 21+ messages in thread* [PATCH net 1/8] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race 2026-05-20 18:34 [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe) Tony Nguyen @ 2026-05-20 18:34 ` Tony Nguyen 2026-05-21 15:37 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 2/8] ice: fix stats array overflow when VF requests more queues Tony Nguyen ` (6 subsequent siblings) 7 siblings, 2 replies; 21+ messages in thread From: Tony Nguyen @ 2026-05-20 18:34 UTC (permalink / raw) To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev Cc: Kohei Enju, Simon Horman, Patryk Holda, Tony Nguyen From: Kohei Enju <kohei@enjuk.jp> ice_xdp_setup_prog() unconditionally hot-swaps xdp_prog when ICE_VSI_REBUILD_PENDING is set. In the attach path, this can publish a new rx_ring->xdp_prog before rx_ring->xdp_ring becomes valid while the rebuild is pending. As a result, ice_clean_rx_irq() may dereference rx_ring->xdp_ring too early. With high-volume RX packets, running these commands in parallel triggered a KASAN splat [1]. # ethtool --reset $DEV irq dma filter offload # ip link set dev $DEV xdp {obj $OBJ sec xdp,off} Fix this by rejecting XDP attach while rebuild is pending. Keep XDP detach allowed in this window. Detach clears rx_ring->xdp_prog, so the RX path will not attempt to access rx_ring->xdp_ring. [1] BUG: KASAN: slab-use-after-free in ice_napi_poll+0x3921/0x41a0 Read of size 2 at addr ffff88812475b880 by task ksoftirqd/1/23 [...] Call Trace: <TASK> ice_napi_poll+0x3921/0x41a0 __napi_poll+0x98/0x520 net_rx_action+0x8f2/0xfa0 handle_softirqs+0x1cb/0x7f0 [...] </TASK> Allocated by task 7246: ice_prepare_xdp_rings+0x3de/0x12d0 ice_xdp+0x61c/0xef0 dev_xdp_install+0x3c4/0x840 dev_xdp_attach+0x50a/0x10a0 dev_change_xdp_fd+0x175/0x210 [...] Freed by task 7251: __rcu_free_sheaf_prepare+0x5f/0x230 rcu_free_sheaf+0x1a/0xf0 rcu_core+0x567/0x1d80 handle_softirqs+0x1cb/0x7f0 Fixes: 2504b8405768 ("ice: protect XDP configuration with a mutex") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Patryk Holda <patryk.holda@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> --- drivers/net/ethernet/intel/ice/ice_main.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index e2fbe111f849..f5aa31886e37 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -2912,12 +2912,21 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog, } /* hot swap progs and avoid toggling link */ - if (ice_is_xdp_ena_vsi(vsi) == !!prog || - test_bit(ICE_VSI_REBUILD_PENDING, vsi->state)) { + if (ice_is_xdp_ena_vsi(vsi) == !!prog) { ice_vsi_assign_bpf_prog(vsi, prog); return 0; } + if (test_bit(ICE_VSI_REBUILD_PENDING, vsi->state)) { + if (prog) { + NL_SET_ERR_MSG_MOD(extack, "VSI rebuild is pending"); + return -EAGAIN; + } + + ice_vsi_assign_bpf_prog(vsi, NULL); + return 0; + } + if_running = netif_running(vsi->netdev) && !test_and_set_bit(ICE_VSI_DOWN, vsi->state); -- 2.47.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH net 1/8] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race 2026-05-20 18:34 ` [PATCH net 1/8] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race Tony Nguyen @ 2026-05-21 15:37 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 1 sibling, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-21 15:37 UTC (permalink / raw) To: Tony Nguyen Cc: davem, pabeni, edumazet, andrew+netdev, netdev, Kohei Enju, Simon Horman, Patryk Holda On Wed, 20 May 2026 11:34:49 -0700 Tony Nguyen wrote: > + if (test_bit(ICE_VSI_REBUILD_PENDING, vsi->state)) { > + if (prog) { > + NL_SET_ERR_MSG_MOD(extack, "VSI rebuild is pending"); > + return -EAGAIN; > + } I seem to recall complaining about this sort of error handling to you folks in the past. Chances are user space will just fail seeing this error, rather than retrying. The correct behavior would be to wait. Is that not possible? (I think Sashiko is brewing a lot of comments for this series so it won't make today's PR either way :() ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net 1/8] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race 2026-05-20 18:34 ` [PATCH net 1/8] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race Tony Nguyen 2026-05-21 15:37 ` Jakub Kicinski @ 2026-05-23 0:16 ` Jakub Kicinski 1 sibling, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-23 0:16 UTC (permalink / raw) To: anthony.l.nguyen Cc: Jakub Kicinski, davem, pabeni, edumazet, andrew+netdev, netdev, vinschen, alexander.nowlin This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- ice: fix UAF/NULL deref when VSI rebuild and XDP attach race This commit addresses a race condition between a VSI rebuild and an XDP attach operation. It prevents rx_ring->xdp_prog from being populated before rx_ring->xdp_ring is valid by rejecting the XDP attach when a VSI rebuild is pending. XDP detachment is still permitted during this window to clear the existing program. Fixes: 2504b8405768 ("ice: protect XDP configuration with a mutex") > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c > --- a/drivers/net/ethernet/intel/ice/ice_main.c > +++ b/drivers/net/ethernet/intel/ice/ice_main.c [ ... ] > @@ -2912,12 +2912,21 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog, > } > > /* hot swap progs and avoid toggling link */ > - if (ice_is_xdp_ena_vsi(vsi) == !!prog || > - test_bit(ICE_VSI_REBUILD_PENDING, vsi->state)) { > + if (ice_is_xdp_ena_vsi(vsi) == !!prog) { > ice_vsi_assign_bpf_prog(vsi, prog); > return 0; > } > > + if (test_bit(ICE_VSI_REBUILD_PENDING, vsi->state)) { [Severity: High] This is a pre-existing issue, but is the check for ICE_VSI_REBUILD_PENDING prone to a TOCTOU race because it relies on ad-hoc flag synchronization instead of proper locking with the device reset path? The asynchronous reset path (ice_prepare_for_reset()) sets ICE_VSI_REBUILD_PENDING without holding vsi->xdp_state_lock or rtnl_lock(). If ice_xdp_setup_prog() (which holds both locks) checks the bit before it is set, it proceeds to configure XDP and sets ICE_VSI_DOWN. If the reset path then sets the bit and calls ice_pf_dis_all_vsi(), it will skip this VSI because ICE_VSI_DOWN is already set. When ice_xdp_setup_prog() finishes, it calls ice_up() which clears ICE_VSI_DOWN and starts enabling hardware queues while the device is actively undergoing a hardware reset (ice_do_reset()). Can writing to hardware registers during an active reset cause hardware hangs or PCI master aborts? > + if (prog) { > + NL_SET_ERR_MSG_MOD(extack, "VSI rebuild is pending"); > + return -EAGAIN; > + } > + > + ice_vsi_assign_bpf_prog(vsi, NULL); [Severity: High] This is a pre-existing issue, but does detaching an XDP program while a VSI rebuild has failed lead to a NULL pointer dereference here? If a VSI rebuild fails, ice_vsi_decfg() frees vsi->rx_rings and sets the pointer to NULL via ice_vsi_free_arrays(), but leaves vsi->num_rxq non-zero. The ICE_VSI_REBUILD_PENDING bit remains set because it is only cleared on the success path in ice_vsi_rebuild(). If a user subsequently detaches XDP, ice_xdp_setup_prog() intercepts it because ICE_VSI_REBUILD_PENDING is set, and it calls ice_vsi_assign_bpf_prog(vsi, NULL). This function loops up to vsi->num_rxq: ice_vsi_assign_bpf_prog() { ... ice_for_each_rxq(vsi, i) WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog); ... } Since vsi->rx_rings is NULL, does this cause a guaranteed NULL pointer dereference panic? > + return 0; > + } > + > if_running = netif_running(vsi->netdev) && > !test_and_set_bit(ICE_VSI_DOWN, vsi->state); -- pw-bot: cr ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net 2/8] ice: fix stats array overflow when VF requests more queues 2026-05-20 18:34 [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe) Tony Nguyen 2026-05-20 18:34 ` [PATCH net 1/8] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race Tony Nguyen @ 2026-05-20 18:34 ` Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 3/8] iavf: return EBUSY if reset in progress or not ready during MAC change Tony Nguyen ` (5 subsequent siblings) 7 siblings, 1 reply; 21+ messages in thread From: Tony Nguyen @ 2026-05-20 18:34 UTC (permalink / raw) To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev Cc: Michal Schmidt, anthony.l.nguyen, przemyslaw.kitszel, jacob.e.keller, poros, Aleksandr Loktionov, Simon Horman, Rafal Romanowski From: Michal Schmidt <mschmidt@redhat.com> When a VF increases its queue count via VIRTCHNL_OP_REQUEST_QUEUES, ice_vc_request_qs_msg() sets vf->num_req_qs and triggers a VF reset. The reset calls ice_vf_reconfig_vsi(), which does ice_vsi_decfg() followed by ice_vsi_cfg(). ice_vsi_decfg() does not free the per-ring stats arrays. Inside ice_vsi_cfg_def(), ice_vsi_set_num_qs() updates alloc_txq/alloc_rxq to the new larger value, but ice_vsi_alloc_stat_arrays() returns early because the stats already exist. ice_vsi_alloc_ring_stats() then iterates using the new larger alloc_txq and writes beyond the bounds of the old, smaller tx_ring_stats/rx_ring_stats pointer arrays, corrupting adjacent SLUB metadata. KASAN detects the bug: ================================================================== BUG: KASAN: slab-out-of-bounds in ice_vsi_alloc_ring_stats+0x385/0x4a0 [ice] Read of size 8 at addr ffff88810affea60 by task kworker/u131:7/221 CPU: 24 UID: 0 PID: 221 Comm: kworker/u131:7 Not tainted 7.1.0-rc1+ #1 PREEMPT(lazy) ... Workqueue: ice ice_service_task [ice] Call Trace: <TASK> ... kasan_report+0xd7/0x120 ice_vsi_alloc_ring_stats+0x385/0x4a0 [ice] ice_vsi_cfg_def+0x12e2/0x2060 [ice] ice_vsi_cfg+0xb5/0x3c0 [ice] ice_reset_vf+0x858/0xf80 [ice] ice_vc_request_qs_msg+0x1da/0x290 [ice] ice_vc_process_vf_msg+0xb15/0x1430 [ice] __ice_clean_ctrlq+0x70d/0x9d0 [ice] ice_service_task+0x840/0xf20 [ice] process_one_work+0x690/0xff0 worker_thread+0x4d9/0xd20 kthread+0x322/0x410 ret_from_fork+0x332/0x660 ret_from_fork_asm+0x1a/0x30 </TASK> Allocated by task 2439: kasan_save_stack+0x1c/0x40 kasan_save_track+0x10/0x30 __kasan_kmalloc+0x96/0xb0 __kmalloc_noprof+0x1d8/0x580 ice_vsi_cfg_def+0x115c/0x2060 [ice] ice_vsi_cfg+0xb5/0x3c0 [ice] ice_vsi_setup+0x180/0x320 [ice] ice_start_vfs+0x1f3/0x590 [ice] ice_ena_vfs+0x66d/0x798 [ice] ice_sriov_configure.cold+0xe4/0x121 [ice] sriov_numvfs_store+0x279/0x480 kernfs_fop_write_iter+0x331/0x4f0 vfs_write+0x4c4/0xe40 ksys_write+0x10c/0x240 do_syscall_64+0xd9/0x650 entry_SYSCALL_64_after_hwframe+0x76/0x7e The buggy address belongs to the object at ffff88810affea40 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 0 bytes to the right of allocated 32-byte region [ffff88810affea40, ffff88810affea60) ... ================================================================== ice_vsi_rebuild() handles this correctly by calling ice_vsi_realloc_stat_arrays() before reconfiguration, but ice_vf_reconfig_vsi() was missing this call. Fix by calling ice_vsi_realloc_stat_arrays() in ice_vf_reconfig_vsi() before ice_vsi_decfg(), mirroring the ice_vsi_rebuild() pattern. Set vsi->req_txq/req_rxq from vf->num_req_qs so the realloc function knows the target array size. See the linked RHEL Jira item for a reproducer. Fixes: 2a2cb4c6c181 ("ice: replace ice_vf_recreate_vsi() with ice_vf_reconfig_vsi()") Closes: https://redhat.atlassian.net/browse/RHEL-164321 Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Assisted-by: Claude:claude-opus-4-6 semcode Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> --- drivers/net/ethernet/intel/ice/ice_lib.c | 2 +- drivers/net/ethernet/intel/ice/ice_lib.h | 1 + drivers/net/ethernet/intel/ice/ice_vf_lib.c | 7 +++++++ 3 files changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c index 837b71b7b2b7..fc78176a2a8d 100644 --- a/drivers/net/ethernet/intel/ice/ice_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_lib.c @@ -3015,7 +3015,7 @@ ice_vsi_rebuild_set_coalesce(struct ice_vsi *vsi, * ice_vsi_realloc_stat_arrays - Frees unused stat structures or alloc new ones * @vsi: VSI pointer */ -static int +int ice_vsi_realloc_stat_arrays(struct ice_vsi *vsi) { u16 req_txq = vsi->req_txq ? vsi->req_txq : vsi->alloc_txq; diff --git a/drivers/net/ethernet/intel/ice/ice_lib.h b/drivers/net/ethernet/intel/ice/ice_lib.h index 49454d98dcfe..6f7da84384e5 100644 --- a/drivers/net/ethernet/intel/ice/ice_lib.h +++ b/drivers/net/ethernet/intel/ice/ice_lib.h @@ -66,6 +66,7 @@ int ice_ena_vsi(struct ice_vsi *vsi, bool locked); void ice_vsi_decfg(struct ice_vsi *vsi); void ice_dis_vsi(struct ice_vsi *vsi, bool locked); +int ice_vsi_realloc_stat_arrays(struct ice_vsi *vsi); int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags); int ice_vsi_cfg(struct ice_vsi *vsi); struct ice_vsi *ice_vsi_alloc(struct ice_pf *pf); diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.c b/drivers/net/ethernet/intel/ice/ice_vf_lib.c index b1f46707dcc0..cf4ed034ad8a 100644 --- a/drivers/net/ethernet/intel/ice/ice_vf_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.c @@ -268,6 +268,13 @@ static int ice_vf_reconfig_vsi(struct ice_vf *vf) vsi->flags = ICE_VSI_FLAG_NO_INIT; + vsi->req_txq = vf->num_req_qs; + vsi->req_rxq = vf->num_req_qs; + + err = ice_vsi_realloc_stat_arrays(vsi); + if (err) + return err; + ice_vsi_decfg(vsi); ice_fltr_remove_all(vsi); -- 2.47.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH net 2/8] ice: fix stats array overflow when VF requests more queues 2026-05-20 18:34 ` [PATCH net 2/8] ice: fix stats array overflow when VF requests more queues Tony Nguyen @ 2026-05-23 0:16 ` Jakub Kicinski 0 siblings, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-23 0:16 UTC (permalink / raw) To: anthony.l.nguyen Cc: Jakub Kicinski, davem, pabeni, edumazet, andrew+netdev, netdev, vinschen, alexander.nowlin This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- ice: fix stats array overflow when VF requests more queues When a VF increases its queue count via VIRTCHNL_OP_REQUEST_QUEUES, ice_vc_request_qs_msg() sets vf->num_req_qs and triggers a VF reset. The reset calls ice_vf_reconfig_vsi(), which does ice_vsi_decfg() followed by ice_vsi_cfg(). This patch fixes a stats array overflow by calling ice_vsi_realloc_stat_arrays() in ice_vf_reconfig_vsi() before ice_vsi_decfg(). > diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c > index 837b71b7b2b7c..fc78176a2a8d0 100644 > --- a/drivers/net/ethernet/intel/ice/ice_lib.c > +++ b/drivers/net/ethernet/intel/ice/ice_lib.c > @@ -3015,7 +3015,7 @@ ice_vsi_rebuild_set_coalesce(struct ice_vsi *vsi, > * ice_vsi_realloc_stat_arrays - Frees unused stat structures or alloc new ones > * @vsi: VSI pointer > */ > -static int > +int > ice_vsi_realloc_stat_arrays(struct ice_vsi *vsi) > { [Severity: High] This isn't a bug introduced by this patch, but making this function public and using it in more paths exposes a potential NULL pointer dereference. Does this code safely handle ADQ channel VSIs? For ADQ channel VSIs (where vsi->type == ICE_VSI_CHNL), ice_vsi_alloc_stat_arrays() bypasses stats allocation, leaving pf->vsi_stats[vsi->idx] as NULL. During a PF reset (ICE_RESET_PFR), ice_rebuild_channels() iterates over ICE_VSI_CHNL VSIs and calls ice_vsi_rebuild(), which triggers ice_vsi_realloc_stat_arrays(). This will unconditionally dereference vsi_stat: drivers/net/ethernet/intel/ice/ice_lib.c:ice_vsi_realloc_stat_arrays() { ... vsi_stat = pf->vsi_stats[vsi->idx]; if (req_txq < prev_txq) { for (i = req_txq; i < prev_txq; i++) { if (vsi_stat->tx_ring_stats[i]) { ... Can this result in a NULL pointer dereference when rebuilding channel VSIs? [Severity: High] This is another pre-existing issue, but the error handling when reallocating arrays might lead to out-of-bounds reads and memory corruption. When a VF requests fewer queues (req_txq < prev_txq), the function successfully shrinks vsi_stat->tx_ring_stats using krealloc_array(). However, if the subsequent reallocation for rx_ring_stats fails (e.g., due to memory pressure returning -ENOMEM), the function returns an error and aborts before vsi->alloc_txq is updated to the new smaller size: drivers/net/ethernet/intel/ice/ice_lib.c:ice_vsi_realloc_stat_arrays() { ... vsi_stat->tx_ring_stats = krealloc_array(...); if (!vsi_stat->tx_ring_stats) { ... } ... vsi_stat->rx_ring_stats = krealloc_array(...); if (!vsi_stat->rx_ring_stats) { vsi_stat->rx_ring_stats = rx_ring_stats; return -ENOMEM; } ... } Later, during VSI teardown in ice_vsi_free_stats(), ice_for_each_alloc_txq(vsi, i) iterates up to the stale, larger vsi->alloc_txq value. Is it possible for this to read past the bounds of the newly shrunk tx_ring_stats array and pass non-zero garbage to kfree_rcu()? ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net 3/8] iavf: return EBUSY if reset in progress or not ready during MAC change 2026-05-20 18:34 [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe) Tony Nguyen 2026-05-20 18:34 ` [PATCH net 1/8] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race Tony Nguyen 2026-05-20 18:34 ` [PATCH net 2/8] ice: fix stats array overflow when VF requests more queues Tony Nguyen @ 2026-05-20 18:34 ` Tony Nguyen 2026-05-20 18:34 ` [PATCH net 4/8] i40e: skip unnecessary VF reset when setting trust Tony Nguyen ` (4 subsequent siblings) 7 siblings, 0 replies; 21+ messages in thread From: Tony Nguyen @ 2026-05-20 18:34 UTC (permalink / raw) To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev Cc: Jose Ignacio Tornos Martinez, anthony.l.nguyen, przemyslaw.kitszel, jacob.e.keller, horms, Aleksandr Loktionov, Rafal Romanowski From: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> When a MAC address change is requested while the VF is resetting or still initializing, return -EBUSY immediately instead of attempting the operation. Additionally, during early initialization states (before __IAVF_DOWN), the PF may be slow to respond to MAC change requests, causing long delays. Only allow MAC changes once the VF reaches __IAVF_DOWN state or later, when the watchdog is running and the VF is ready for operations. After commit ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations"), MAC changes are called with the netdev lock held, so we should not wait with the lock held during reset or initialization. This allows the caller to retry or handle the busy state appropriately without blocking other operations. Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> --- drivers/net/ethernet/intel/iavf/iavf_main.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c index d2914c511e1e..78c59a58e0b2 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_main.c +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c @@ -1042,6 +1042,9 @@ static int iavf_set_mac(struct net_device *netdev, void *p) struct sockaddr *addr = p; int ret; + if (iavf_is_reset_in_progress(adapter) || adapter->state < __IAVF_DOWN) + return -EBUSY; + if (!is_valid_ether_addr(addr->sa_data)) return -EADDRNOTAVAIL; -- 2.47.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net 4/8] i40e: skip unnecessary VF reset when setting trust 2026-05-20 18:34 [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe) Tony Nguyen ` (2 preceding siblings ...) 2026-05-20 18:34 ` [PATCH net 3/8] iavf: return EBUSY if reset in progress or not ready during MAC change Tony Nguyen @ 2026-05-20 18:34 ` Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 5/8] iavf: send MAC change request synchronously Tony Nguyen ` (3 subsequent siblings) 7 siblings, 2 replies; 21+ messages in thread From: Tony Nguyen @ 2026-05-20 18:34 UTC (permalink / raw) To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev Cc: Jose Ignacio Tornos Martinez, anthony.l.nguyen, przemyslaw.kitszel, jacob.e.keller, horms, Rafal Romanowski From: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> The current implementation triggers a VF reset when changing the trust setting, causing a ~10 second delay during bonding setup. In all the cases, the reset causes a ~10 second delay during which: - VF must reinitialize completely - Any in-progress operations (like bonding enslave) fail with timeouts - VF is unavailable When granting trust, no reset is needed - we can just set the capability flag to allow privileged operations. When revoking trust, we only need to reset (conservative approach) if the VF has actually configured advanced features that require cleanup (ADQ/cloud filters, promiscuous mode). For VFs in a clean state, we can safely change the trust setting without the disruptive reset. When we don't reset, we manually handle capability flag via helper function, eliminating the delay. Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> --- .../ethernet/intel/i40e/i40e_virtchnl_pf.c | 38 ++++++++++++++----- 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index a26c3d47ec15..0cc434b26eb8 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -4943,6 +4943,23 @@ int i40e_ndo_set_vf_spoofchk(struct net_device *netdev, int vf_id, bool enable) return ret; } +/** + * i40e_setup_vf_trust - Enable/disable VF trust mode without reset + * @vf: VF to configure + * @setting: trust setting + * + * Update VF flags when changing trust without performing a VF reset. + * This is only called when it's safe to skip the reset (VF has no advanced + * features configured that need cleanup). + */ +static void i40e_setup_vf_trust(struct i40e_vf *vf, bool setting) +{ + if (setting) + set_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps); + else + clear_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps); +} + /** * i40e_ndo_set_vf_trust * @netdev: network interface device structure of the pf @@ -4987,19 +5004,20 @@ int i40e_ndo_set_vf_trust(struct net_device *netdev, int vf_id, bool setting) set_bit(__I40E_MACVLAN_SYNC_PENDING, pf->state); pf->vsi[vf->lan_vsi_idx]->flags |= I40E_VSI_FLAG_FILTER_CHANGED; - i40e_vc_reset_vf(vf, true); + /* Reset only if revoking trust and VF has advanced features configured */ + if (!setting && + (vf->adq_enabled || vf->num_cloud_filters > 0 || + test_bit(I40E_VF_STATE_UC_PROMISC, &vf->vf_states) || + test_bit(I40E_VF_STATE_MC_PROMISC, &vf->vf_states))) { + i40e_vc_reset_vf(vf, true); + i40e_del_all_cloud_filters(vf); + } else { + i40e_setup_vf_trust(vf, setting); + } + dev_info(&pf->pdev->dev, "VF %u is now %strusted\n", vf_id, setting ? "" : "un"); - if (vf->adq_enabled) { - if (!vf->trusted) { - dev_info(&pf->pdev->dev, - "VF %u no longer Trusted, deleting all cloud filters\n", - vf_id); - i40e_del_all_cloud_filters(vf); - } - } - out: clear_bit(__I40E_VIRTCHNL_OP_PENDING, pf->state); return ret; -- 2.47.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH net 4/8] i40e: skip unnecessary VF reset when setting trust 2026-05-20 18:34 ` [PATCH net 4/8] i40e: skip unnecessary VF reset when setting trust Tony Nguyen @ 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 1 sibling, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-23 0:16 UTC (permalink / raw) To: anthony.l.nguyen Cc: Jakub Kicinski, davem, pabeni, edumazet, andrew+netdev, netdev, vinschen, alexander.nowlin This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- i40e: skip unnecessary VF reset when setting trust This commit changes i40e_ndo_set_vf_trust() to skip the VF reset when granting trust, and when revoking trust unless one of a few advanced features (ADQ/cloud filters, UC/MC promisc) is currently set on the VF. On the no-reset path the new helper i40e_setup_vf_trust() flips the I40E_VIRTCHNL_VF_CAP_PRIVILEGE bit directly. > diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c > index a26c3d47ec15..0cc434b26eb8 100644 > --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c > +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c > @@ -4943,6 +4943,23 @@ int i40e_ndo_set_vf_spoofchk(struct net_device *netdev, int vf_id, bool enable) > return ret; > } > > +/** > + * i40e_setup_vf_trust - Enable/disable VF trust mode without reset > + * @vf: VF to configure > + * @setting: trust setting > + * > + * Update VF flags when changing trust without performing a VF reset. > + * This is only called when it's safe to skip the reset (VF has no advanced > + * features configured that need cleanup). > + */ > +static void i40e_setup_vf_trust(struct i40e_vf *vf, bool setting) > +{ > + if (setting) > + set_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps); > + else > + clear_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps); > +} > + [ ... ] > @@ -4987,19 +5004,20 @@ int i40e_ndo_set_vf_trust(struct net_device *netdev, int vf_id, bool setting) > set_bit(__I40E_MACVLAN_SYNC_PENDING, pf->state); > pf->vsi[vf->lan_vsi_idx]->flags |= I40E_VSI_FLAG_FILTER_CHANGED; > > - i40e_vc_reset_vf(vf, true); > + /* Reset only if revoking trust and VF has advanced features configured */ > + if (!setting && > + (vf->adq_enabled || vf->num_cloud_filters > 0 || > + test_bit(I40E_VF_STATE_UC_PROMISC, &vf->vf_states) || > + test_bit(I40E_VF_STATE_MC_PROMISC, &vf->vf_states))) { > + i40e_vc_reset_vf(vf, true); > + i40e_del_all_cloud_filters(vf); > + } else { > + i40e_setup_vf_trust(vf, setting); > + } [High] Are the four conditions used here a complete description of "VF in a clean state" for the purposes of trust revocation? A trusted VF can install several pieces of privileged state that this predicate does not consult: - more than I40E_VC_MAX_VLAN_PER_VF (16) VLAN filters, accepted by i40e_vc_add_vlan_msg() only because the VF is trusted - more than I40E_VC_MAX_MAC_ADDR_PER_VF (18) unicast MAC filters, permitted by i40e_check_vf_permission() only because the VF is trusted - a VF-side override of an administratively-set MAC (vf->pf_set_mac), where the comment in i40e_check_vf_permission() reads: /* ... Unless the VF is privileged and then it can do whatever */ Before this patch, the unconditional i40e_vc_reset_vf() drove i40e_cleanup_reset_vf() -> i40e_free_vf_res(), which wiped the VF filter table and re-derived the privilege bit from vf->trusted in i40e_alloc_vf_resources(). On the new no-reset branch, only the I40E_VIRTCHNL_VF_CAP_PRIVILEGE bit is cleared, so over-limit VLAN filters, over-limit MAC filters, and the admin-MAC override remain installed in hardware while the driver prints "VF %u is now untrusted". Should this predicate also cover those cases (for example by checking the VLAN/MAC filter counts on the VF VSI and vf->pf_set_mac), or by performing the resource cleanup explicitly on the no-reset path? [High] Can the no-reset branch race with concurrent virtchnl message handling? i40e_ndo_set_vf_trust() serializes against other ndo callbacks via __I40E_VIRTCHNL_OP_PENDING, but i40e_vc_process_vf_msg() (driven from i40e_clean_adminq_subtask() in the service workqueue) does not test that bit. It rejects messages only when I40E_VF_STATE_DISABLED is set, which the no-reset path never sets. A revoke-trust sequence could then look like: CPU0 (ndo_set_vf_trust, setting=false) reads vf->num_cloud_filters == 0, !vf->adq_enabled, !UC_PROMISC, !MC_PROMISC selects the else branch CPU1 (i40e_vc_process_vf_msg, e.g. ADD_CLOUD_FILTER / ADD_VLAN / ADD_ETH_ADDR / CONFIG_PROMISCUOUS_MODE) tests I40E_VIRTCHNL_VF_CAP_PRIVILEGE -- still set installs privileged state (++num_cloud_filters, sets UC/MC_PROMISC, adds an over-limit VLAN/MAC, ...) CPU0 i40e_setup_vf_trust(vf, false) clear_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps); End state: trust is "revoked" and the privilege bit is cleared, but privileged state was installed during the window and is never cleaned up. The previous unconditional i40e_vc_reset_vf() path masked this because i40e_trigger_vf_reset() clears I40E_VF_STATE_ACTIVE (so the i40e_sync_vf_state(I40E_VF_STATE_ACTIVE) check at the top of most virtchnl handlers fails) and i40e_cleanup_reset_vf() then wipes any state that did get installed. Should the no-reset path take a stronger form of synchronisation against virtchnl processing, or sample the gating conditions and clear the privilege bit under that synchronisation? > > dev_info(&pf->pdev->dev, "VF %u is now %strusted\n", > vf_id, setting ? "" : "un"); > > - if (vf->adq_enabled) { > - if (!vf->trusted) { > - dev_info(&pf->pdev->dev, > - "VF %u no longer Trusted, deleting all cloud filters\n", > - vf_id); > - i40e_del_all_cloud_filters(vf); > - } > - } > - [Low] The dev_info() "VF %u no longer Trusted, deleting all cloud filters" is removed here. The new code still calls i40e_del_all_cloud_filters() in the equivalent scenario (inside the conditional reset branch), but the message is no longer emitted and the commit message does not mention this change. Was the removal intentional, and if so should it be noted in the changelog? > out: > clear_bit(__I40E_VIRTCHNL_OP_PENDING, pf->state); > return ret; > } -- pw-bot: cr ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net 4/8] i40e: skip unnecessary VF reset when setting trust 2026-05-20 18:34 ` [PATCH net 4/8] i40e: skip unnecessary VF reset when setting trust Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski @ 2026-05-23 0:16 ` Jakub Kicinski 1 sibling, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-23 0:16 UTC (permalink / raw) To: anthony.l.nguyen Cc: Jakub Kicinski, davem, pabeni, edumazet, andrew+netdev, netdev, vinschen, alexander.nowlin This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- i40e: skip unnecessary VF reset when setting trust This commit attempts to skip the ~10 second VF reset delay when changing the trust setting. It bypasses the reset and only updates the privilege capability bit when revoking trust if no advanced features are active. > diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c > index a26c3d47ec156..0cc434b26eb88 100644 > --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c > +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c [ ... ] > @@ -4987,19 +5004,20 @@ int i40e_ndo_set_vf_trust(struct net_device *netdev, int vf_id, bool setting) > set_bit(__I40E_MACVLAN_SYNC_PENDING, pf->state); > pf->vsi[vf->lan_vsi_idx]->flags |= I40E_VSI_FLAG_FILTER_CHANGED; > > - i40e_vc_reset_vf(vf, true); > + /* Reset only if revoking trust and VF has advanced features configured */ > + if (!setting && > + (vf->adq_enabled || vf->num_cloud_filters > 0 || > + test_bit(I40E_VF_STATE_UC_PROMISC, &vf->vf_states) || > + test_bit(I40E_VF_STATE_MC_PROMISC, &vf->vf_states))) { [Severity: High] Does skipping the VF reset when revoking trust allow the Virtual Function to permanently retain privileged resources? A trusted VF can bypass untrusted limits, allocating a large number of MAC and VLAN filters and overriding administratively set MAC addresses. By skipping the reset upon trust revocation if ADq and promiscuous mode are inactive, the VSI is not destroyed. Since i40e_setup_vf_trust() only updates the capability bit without removing excess filters, could an untrusted VF retain these configurations and bypass isolation limits? [Severity: High] Can this introduce a race condition with concurrent virtchnl messages? Virtchnl messages are processed in the ARQ workqueue without holding rtnl_lock or __I40E_VIRTCHNL_OP_PENDING. A concurrent virtchnl request could verify the privilege capability bit before it is cleared, and successfully configure an advanced feature like promiscuous mode after these checks have bypassed the reset. Will the VF end up untrusted but with advanced features active? > + i40e_vc_reset_vf(vf, true); > + i40e_del_all_cloud_filters(vf); [Severity: Critical] This is a pre-existing issue, but is vf->cloud_filter_list accessed concurrently without synchronization? i40e_del_all_cloud_filters() iterates over and deletes elements from vf->cloud_filter_list. Concurrently, the ARQ processing thread processes virtchnl messages and adds elements to the same list. Since there is no lock protecting this list across the netlink and ARQ thread contexts, could concurrent modifications corrupt the list pointers and lead to a use-after-free? > + } else { > + i40e_setup_vf_trust(vf, setting); > + } > + > dev_info(&pf->pdev->dev, "VF %u is now %strusted\n", > vf_id, setting ? "" : "un"); > > - if (vf->adq_enabled) { > - if (!vf->trusted) { > - dev_info(&pf->pdev->dev, > - "VF %u no longer Trusted, deleting all cloud filters\n", > - vf_id); > - i40e_del_all_cloud_filters(vf); > - } > - } > - > out: > clear_bit(__I40E_VIRTCHNL_OP_PENDING, pf->state); [Severity: High] This isn't a bug introduced by this patch, but does using the __I40E_VIRTCHNL_OP_PENDING bit flag as an ad-hoc lock violate networking driver concurrency guidelines? The guidelines explicitly forbid using bit flags to guard code regions. This open-coded locking scheme provides no deadlock detection, lockdep visibility, or strict memory ordering guarantees. Could this be replaced with a proper mutex or lock? > return ret; > } ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net 5/8] iavf: send MAC change request synchronously 2026-05-20 18:34 [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe) Tony Nguyen ` (3 preceding siblings ...) 2026-05-20 18:34 ` [PATCH net 4/8] i40e: skip unnecessary VF reset when setting trust Tony Nguyen @ 2026-05-20 18:34 ` Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 6/8] ice: skip unnecessary VF reset when setting trust Tony Nguyen ` (2 subsequent siblings) 7 siblings, 2 replies; 21+ messages in thread From: Tony Nguyen @ 2026-05-20 18:34 UTC (permalink / raw) To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev Cc: Jose Ignacio Tornos Martinez, anthony.l.nguyen, przemyslaw.kitszel, jacob.e.keller, horms, stable, Aleksandr Loktionov, Rafal Romanowski From: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> After commit ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations"), iavf_set_mac() is called with the netdev instance lock already held. The function queues a MAC address change request via iavf_replace_primary_mac() and then waits for completion. However, in the current flow, the actual virtchnl message is sent by the watchdog task, which also needs to acquire the netdev lock to run. Additionally, the adminq_task which processes virtchnl responses also needs the netdev lock. This creates a deadlock scenario: 1. iavf_set_mac() holds netdev lock and waits for MAC change 2. Watchdog needs netdev lock to send the request -> blocked 3. Even if request is sent, adminq_task needs netdev lock to process PF response -> blocked 4. MAC change times out after 2.5 seconds 5. iavf_set_mac() returns -EAGAIN This particularly affects VFs during bonding setup when multiple VFs are enslaved in quick succession. Fix by implementing a synchronous MAC change operation similar to the approach used in commit fdadbf6e84c4 ("iavf: fix incorrect reset handling in callbacks"). The solution: 1. Send the virtchnl ADD_ETH_ADDR message directly (not via watchdog) 2. Poll the admin queue hardware directly for responses 3. Process all received messages (including non-MAC messages) 4. Return when MAC change completes or times out A new generic function iavf_poll_virtchnl_response() is introduced that can be reused for any future synchronous virtchnl operations. It takes a callback to check completion, allowing flexible condition checking. This allows the operation to complete synchronously while holding netdev_lock, without relying on watchdog or adminq_task. The function can sleep for up to 2.5 seconds polling hardware, but this is acceptable since netdev_lock is per-device and only serializes operations on the same interface. To support this, change iavf_add_ether_addrs() to return an error code instead of void, allowing callers to detect failures. Additionally, export iavf_mac_add_reject() to enable proper rollback on local failures (timeouts, send errors) - PF rejections are already handled automatically by iavf_virtchnl_completion(). Remove vc_waitqueue entirely because iavf_set_mac was the only waiter on this waitqueue and after the changes it is not needed. Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations") cc: stable@vger.kernel.org Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> --- drivers/net/ethernet/intel/iavf/iavf.h | 10 +- drivers/net/ethernet/intel/iavf/iavf_main.c | 71 +++++++++---- .../net/ethernet/intel/iavf/iavf_virtchnl.c | 100 ++++++++++++++++-- 3 files changed, 151 insertions(+), 30 deletions(-) diff --git a/drivers/net/ethernet/intel/iavf/iavf.h b/drivers/net/ethernet/intel/iavf/iavf.h index 050f8241ef5e..06eb19b00527 100644 --- a/drivers/net/ethernet/intel/iavf/iavf.h +++ b/drivers/net/ethernet/intel/iavf/iavf.h @@ -259,7 +259,6 @@ struct iavf_adapter { struct work_struct adminq_task; struct work_struct finish_config; wait_queue_head_t down_waitqueue; - wait_queue_head_t vc_waitqueue; struct iavf_q_vector *q_vectors; struct list_head vlan_filter_list; int num_vlan_filters; @@ -588,8 +587,9 @@ void iavf_configure_queues(struct iavf_adapter *adapter); void iavf_enable_queues(struct iavf_adapter *adapter); void iavf_disable_queues(struct iavf_adapter *adapter); void iavf_map_queues(struct iavf_adapter *adapter); -void iavf_add_ether_addrs(struct iavf_adapter *adapter); +int iavf_add_ether_addrs(struct iavf_adapter *adapter); void iavf_del_ether_addrs(struct iavf_adapter *adapter); +void iavf_mac_add_reject(struct iavf_adapter *adapter); void iavf_add_vlans(struct iavf_adapter *adapter); void iavf_del_vlans(struct iavf_adapter *adapter); void iavf_set_promiscuous(struct iavf_adapter *adapter); @@ -606,6 +606,12 @@ void iavf_disable_vlan_stripping(struct iavf_adapter *adapter); void iavf_virtchnl_completion(struct iavf_adapter *adapter, enum virtchnl_ops v_opcode, enum iavf_status v_retval, u8 *msg, u16 msglen); +int iavf_poll_virtchnl_response(struct iavf_adapter *adapter, + bool (*condition)(struct iavf_adapter *adapter, + const void *data, + enum virtchnl_ops v_op), + const void *cond_data, + unsigned int timeout_ms); int iavf_config_rss(struct iavf_adapter *adapter); void iavf_cfg_queues_bw(struct iavf_adapter *adapter); void iavf_cfg_queues_quanta_size(struct iavf_adapter *adapter); diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c index 78c59a58e0b2..ed790dc3de6b 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_main.c +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c @@ -1029,6 +1029,48 @@ static bool iavf_is_mac_set_handled(struct net_device *netdev, return ret; } +/** + * iavf_mac_change_done - Check if MAC change completed + * @adapter: board private structure + * @data: MAC address being checked (as const void *) + * @v_op: virtchnl opcode from processed message + * + * Callback for iavf_poll_virtchnl_response() to check if MAC change completed. + * + * Return: true if MAC change completed, false otherwise + */ +static bool iavf_mac_change_done(struct iavf_adapter *adapter, + const void *data, enum virtchnl_ops v_op) +{ + const u8 *addr = data; + + return iavf_is_mac_set_handled(adapter->netdev, addr); +} + +/** + * iavf_set_mac_sync - Synchronously change MAC address + * @adapter: board private structure + * @addr: MAC address to set + * + * Send MAC change request to PF and poll admin queue for response. + * Caller must hold netdev_lock. This can sleep for up to 2.5 seconds. + * + * Return: 0 on success, negative on failure + */ +static int iavf_set_mac_sync(struct iavf_adapter *adapter, const u8 *addr) +{ + int ret; + + netdev_assert_locked(adapter->netdev); + + ret = iavf_add_ether_addrs(adapter); + if (ret) + return ret; + + return iavf_poll_virtchnl_response(adapter, iavf_mac_change_done, + addr, 2500); +} + /** * iavf_set_mac - NDO callback to set port MAC address * @netdev: network interface device structure @@ -1049,25 +1091,21 @@ static int iavf_set_mac(struct net_device *netdev, void *p) return -EADDRNOTAVAIL; ret = iavf_replace_primary_mac(adapter, addr->sa_data); - if (ret) return ret; - ret = wait_event_interruptible_timeout(adapter->vc_waitqueue, - iavf_is_mac_set_handled(netdev, addr->sa_data), - msecs_to_jiffies(2500)); - - /* If ret < 0 then it means wait was interrupted. - * If ret == 0 then it means we got a timeout. - * else it means we got response for set MAC from PF, - * check if netdev MAC was updated to requested MAC, - * if yes then set MAC succeeded otherwise it failed return -EACCES - */ - if (ret < 0) + ret = iavf_set_mac_sync(adapter, addr->sa_data); + if (ret) { + /* Rollback for local failures (timeout, send error, -EBUSY). + * Note: If PF rejects the request (sends error response), + * iavf_virtchnl_completion() automatically calls + * iavf_mac_add_reject(), ret=0, and this is not executed. + * Only local failures (no PF response received) need manual rollback. + */ + iavf_mac_add_reject(adapter); + ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr); return ret; - - if (!ret) - return -EAGAIN; + } if (!ether_addr_equal(netdev->dev_addr, addr->sa_data)) return -EACCES; @@ -5393,9 +5431,6 @@ static int iavf_probe(struct pci_dev *pdev, const struct pci_device_id *ent) /* Setup the wait queue for indicating transition to down status */ init_waitqueue_head(&adapter->down_waitqueue); - /* Setup the wait queue for indicating virtchannel events */ - init_waitqueue_head(&adapter->vc_waitqueue); - INIT_LIST_HEAD(&adapter->ptp.aq_cmds); init_waitqueue_head(&adapter->ptp.phc_time_waitqueue); mutex_init(&adapter->ptp.aq_cmd_lock); diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c index 4f2defd2331b..cd5211b9a798 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c +++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c @@ -2,6 +2,7 @@ /* Copyright(c) 2013 - 2018 Intel Corporation. */ #include <linux/net/intel/libie/rx.h> +#include <net/netdev_lock.h> #include "iavf.h" #include "iavf_ptp.h" @@ -555,20 +556,23 @@ iavf_set_mac_addr_type(struct virtchnl_ether_addr *virtchnl_ether_addr, * @adapter: adapter structure * * Request that the PF add one or more addresses to our filters. - **/ -void iavf_add_ether_addrs(struct iavf_adapter *adapter) + * + * Return: 0 on success, negative on failure + */ +int iavf_add_ether_addrs(struct iavf_adapter *adapter) { struct virtchnl_ether_addr_list *veal; struct iavf_mac_filter *f; int i = 0, count = 0; bool more = false; size_t len; + int ret; if (adapter->current_op != VIRTCHNL_OP_UNKNOWN) { /* bail because we already have a command pending */ dev_err(&adapter->pdev->dev, "Cannot add filters, command %d pending\n", adapter->current_op); - return; + return -EBUSY; } spin_lock_bh(&adapter->mac_vlan_list_lock); @@ -580,7 +584,7 @@ void iavf_add_ether_addrs(struct iavf_adapter *adapter) if (!count) { adapter->aq_required &= ~IAVF_FLAG_AQ_ADD_MAC_FILTER; spin_unlock_bh(&adapter->mac_vlan_list_lock); - return; + return 0; } adapter->current_op = VIRTCHNL_OP_ADD_ETH_ADDR; @@ -594,8 +598,9 @@ void iavf_add_ether_addrs(struct iavf_adapter *adapter) veal = kzalloc(len, GFP_ATOMIC); if (!veal) { + adapter->current_op = VIRTCHNL_OP_UNKNOWN; spin_unlock_bh(&adapter->mac_vlan_list_lock); - return; + return -ENOMEM; } veal->vsi_id = adapter->vsi_res->vsi_id; @@ -615,8 +620,15 @@ void iavf_add_ether_addrs(struct iavf_adapter *adapter) spin_unlock_bh(&adapter->mac_vlan_list_lock); - iavf_send_pf_msg(adapter, VIRTCHNL_OP_ADD_ETH_ADDR, (u8 *)veal, len); + ret = iavf_send_pf_msg(adapter, VIRTCHNL_OP_ADD_ETH_ADDR, (u8 *)veal, len); kfree(veal); + if (ret) { + dev_err(&adapter->pdev->dev, + "Unable to send ADD_ETH_ADDR message to PF, error %d\n", ret); + adapter->current_op = VIRTCHNL_OP_UNKNOWN; + } + + return ret; } /** @@ -712,8 +724,8 @@ static void iavf_mac_add_ok(struct iavf_adapter *adapter) * @adapter: adapter structure * * Remove filters from list based on PF response. - **/ -static void iavf_mac_add_reject(struct iavf_adapter *adapter) + */ +void iavf_mac_add_reject(struct iavf_adapter *adapter) { struct net_device *netdev = adapter->netdev; struct iavf_mac_filter *f, *ftmp; @@ -2364,7 +2376,6 @@ void iavf_virtchnl_completion(struct iavf_adapter *adapter, iavf_mac_add_reject(adapter); /* restore administratively set MAC address */ ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr); - wake_up(&adapter->vc_waitqueue); break; case VIRTCHNL_OP_DEL_ETH_ADDR: dev_err(&adapter->pdev->dev, "Failed to delete MAC filter, error %s\n", @@ -2557,7 +2568,6 @@ void iavf_virtchnl_completion(struct iavf_adapter *adapter, eth_hw_addr_set(netdev, adapter->hw.mac.addr); netif_addr_unlock_bh(netdev); } - wake_up(&adapter->vc_waitqueue); break; case VIRTCHNL_OP_GET_STATS: { struct iavf_eth_stats *stats = @@ -2952,3 +2962,73 @@ void iavf_virtchnl_completion(struct iavf_adapter *adapter, } /* switch v_opcode */ adapter->current_op = VIRTCHNL_OP_UNKNOWN; } + +/** + * iavf_poll_virtchnl_response - Poll admin queue for virtchnl response + * @adapter: adapter structure + * @condition: callback to check if desired response received + * @cond_data: context data passed to condition callback + * @timeout_ms: maximum time to wait in milliseconds + * + * Polls the admin queue and processes all incoming virtchnl messages. + * After processing each valid message, calls the condition callback to check + * if the expected response has been received. The callback receives the opcode + * of the processed message to identify which response was received. Continues + * polling until the callback returns true or timeout expires. + * Caller must hold netdev_lock. This can sleep for up to timeout_ms while + * polling hardware. + * + * Return: 0 on success (condition met), -EAGAIN on timeout, or error code + */ +int iavf_poll_virtchnl_response(struct iavf_adapter *adapter, + bool (*condition)(struct iavf_adapter *adapter, + const void *data, + enum virtchnl_ops v_op), + const void *cond_data, + unsigned int timeout_ms) +{ + struct iavf_hw *hw = &adapter->hw; + struct iavf_arq_event_info event; + enum virtchnl_ops received_op; + unsigned long timeout; + int ret = -EAGAIN; + u16 pending = 0; + u32 v_retval; + + netdev_assert_locked(adapter->netdev); + + event.buf_len = IAVF_MAX_AQ_BUF_SIZE; + event.msg_buf = kzalloc(event.buf_len, GFP_KERNEL); + if (!event.msg_buf) + return -ENOMEM; + + timeout = jiffies + msecs_to_jiffies(timeout_ms); + do { + if (!pending) + usleep_range(50, 75); + + if (iavf_clean_arq_element(hw, &event, &pending) == IAVF_SUCCESS) { + received_op = (enum virtchnl_ops)le32_to_cpu(event.desc.cookie_high); + if (received_op != VIRTCHNL_OP_UNKNOWN) { + v_retval = le32_to_cpu(event.desc.cookie_low); + + iavf_virtchnl_completion(adapter, received_op, + (enum iavf_status)v_retval, + event.msg_buf, event.msg_len); + + if (condition(adapter, cond_data, received_op)) { + ret = 0; + break; + } + } + + memset(event.msg_buf, 0, IAVF_MAX_AQ_BUF_SIZE); + + if (pending) + continue; + } + } while (time_before(jiffies, timeout)); + + kfree(event.msg_buf); + return ret; +} -- 2.47.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH net 5/8] iavf: send MAC change request synchronously 2026-05-20 18:34 ` [PATCH net 5/8] iavf: send MAC change request synchronously Tony Nguyen @ 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 1 sibling, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-23 0:16 UTC (permalink / raw) To: anthony.l.nguyen Cc: Jakub Kicinski, davem, pabeni, edumazet, andrew+netdev, netdev, vinschen, alexander.nowlin This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- iavf: send MAC change request synchronously This patch reworks iavf_set_mac() to send the ADD_ETH_ADDR virtchnl message directly and poll the admin queue for the response, instead of relying on the watchdog and adminq tasks (which now block on the netdev instance lock held by iavf_set_mac()). It also exports iavf_mac_add_reject() so the new path can roll back local failures, and removes vc_waitqueue. > diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c > index 78c59a58e0b2..ed790dc3de6b 100644 > --- a/drivers/net/ethernet/intel/iavf/iavf_main.c > +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c > @@ -1029,6 +1029,48 @@ static bool iavf_is_mac_set_handled(struct net_device *netdev, > return ret; > } > [ ... ] > +static int iavf_set_mac_sync(struct iavf_adapter *adapter, const u8 *addr) > +{ > + int ret; > + > + netdev_assert_locked(adapter->netdev); > + > + ret = iavf_add_ether_addrs(adapter); > + if (ret) > + return ret; > + > + return iavf_poll_virtchnl_response(adapter, iavf_mac_change_done, > + addr, 2500); > +} [Medium] Can this path mishandle the multi-batch case in iavf_add_ether_addrs()? When the number of filters with f->add=true exceeds what fits in one IAVF_MAX_AQ_BUF_SIZE message, iavf_add_ether_addrs() trims the batch and sets more=true: len = virtchnl_struct_size(veal, list, count); if (len > IAVF_MAX_AQ_BUF_SIZE) { dev_warn(&adapter->pdev->dev, "Too many add MAC changes in one request\n"); while (len > IAVF_MAX_AQ_BUF_SIZE) len = virtchnl_struct_size(veal, list, --count); more = true; } ... if (!more) adapter->aq_required &= ~IAVF_FLAG_AQ_ADD_MAC_FILTER; Only the first batch is sent synchronously here; the remainder is left for the watchdog. iavf_replace_primary_mac() calls iavf_add_filter() which uses list_add_tail(), so the new primary MAC is at the end of mac_filter_list and ends up in the last batch. In that scenario, iavf_poll_virtchnl_response() polls iavf_is_mac_set_handled() for the new MAC, which never completes within 2.5 seconds because the message that would change it has not been sent yet. iavf_set_mac() then returns -EAGAIN and the rollback below removes the new filter via iavf_mac_add_reject() because is_new_mac is true. Pre-patch the wait_event() on vc_waitqueue would eventually be satisfied because the watchdog drained subsequent batches and the PF response for the new MAC arrived. Should iavf_set_mac_sync() loop while IAVF_FLAG_AQ_ADD_MAC_FILTER (or more) is still set, so the new MAC is actually transmitted before the timeout? > @@ -1049,25 +1091,21 @@ static int iavf_set_mac(struct net_device *netdev, void *p) > return -EADDRNOTAVAIL; > > ret = iavf_replace_primary_mac(adapter, addr->sa_data); > - > if (ret) > return ret; > > - ret = wait_event_interruptible_timeout(adapter->vc_waitqueue, > - iavf_is_mac_set_handled(netdev, addr->sa_data), > - msecs_to_jiffies(2500)); > - > - /* If ret < 0 then it means wait was interrupted. > - * If ret == 0 then it means we got a timeout. > - * else it means we got response for set MAC from PF, > - * check if netdev MAC was updated to requested MAC, > - * if yes then set MAC succeeded otherwise it failed return -EACCES > - */ > - if (ret < 0) > + ret = iavf_set_mac_sync(adapter, addr->sa_data); > + if (ret) { > + /* Rollback for local failures (timeout, send error, -EBUSY). > + * Note: If PF rejects the request (sends error response), > + * iavf_virtchnl_completion() automatically calls > + * iavf_mac_add_reject(), ret=0, and this is not executed. > + * Only local failures (no PF response received) need manual rollback. > + */ > + iavf_mac_add_reject(adapter); > + ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr); > return ret; > - > - if (!ret) > - return -EAGAIN; > + } [Medium] Is iavf_mac_add_reject() safe to call when the AQ message never reached the PF? iavf_add_ether_addrs() clears f->add=false on every filter that it packs into the batch, before iavf_send_pf_msg() is invoked: list_for_each_entry(f, &adapter->mac_filter_list, list) { if (f->add) { ether_addr_copy(veal->list[i].addr, f->macaddr); iavf_set_mac_addr_type(&veal->list[i], f); i++; f->add = false; ... } } ... ret = iavf_send_pf_msg(adapter, VIRTCHNL_OP_ADD_ETH_ADDR, (u8 *)veal, len); iavf_mac_add_reject() then walks the entire mac_filter_list and marks every filter with !f->add && !f->add_handled as add_handled=true: list_for_each_entry_safe(f, ftmp, &adapter->mac_filter_list, list) { ... if (!f->add && !f->add_handled) f->add_handled = true; ... } On a local failure (iavf_send_pf_msg() error, IAVF_FLAG_PF_COMMS_FAILED, -EBUSY, -ENOMEM or polling timeout) any unrelated filters that were in the batch (e.g., entries added by a prior ndo_set_rx_mode() that the watchdog had not yet drained) get add_handled=true even though the PF never received them. They are not retried since f->add was cleared. Pre-patch iavf_mac_add_reject() was only called from iavf_virtchnl_completion() after the PF actually rejected an ADD_ETH_ADDR, so marking these filters as handled was correct because the PF had seen them. Should the new local-failure rollback be scoped to only the new primary MAC filter, rather than reusing the PF-reject helper unchanged? > diff --git a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c > index 4f2defd2331b..cd5211b9a798 100644 > --- a/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c > +++ b/drivers/net/ethernet/intel/iavf/iavf_virtchnl.c > @@ -555,20 +556,23 @@ iavf_set_mac_addr_type(struct virtchnl_ether_addr *virtchnl_ether_addr, > * @adapter: adapter structure > * > * Request that the PF add one or more addresses to our filters. > - **/ > -void iavf_add_ether_addrs(struct iavf_adapter *adapter) > +int iavf_add_ether_addrs(struct iavf_adapter *adapter) > { > struct virtchnl_ether_addr_list *veal; > struct iavf_mac_filter *f; > int i = 0, count = 0; > bool more = false; [ ... ] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net 5/8] iavf: send MAC change request synchronously 2026-05-20 18:34 ` [PATCH net 5/8] iavf: send MAC change request synchronously Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski @ 2026-05-23 0:16 ` Jakub Kicinski 1 sibling, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-23 0:16 UTC (permalink / raw) To: anthony.l.nguyen Cc: Jakub Kicinski, davem, pabeni, edumazet, andrew+netdev, netdev, vinschen, alexander.nowlin, stable This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- iavf: send MAC change request synchronously This commit implements a synchronous MAC address change operation that sends the ADD_ETH_ADDR message directly and polls for the response. This avoids deadlocks involving the netdev lock, watchdog task, and adminq task. cc: stable@vger.kernel.org > diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c > index 78c59a58e0b2b..ed790dc3de6b7 100644 > --- a/drivers/net/ethernet/intel/iavf/iavf_main.c > +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c [ ... ] > @@ -1029,6 +1029,48 @@ static bool iavf_is_mac_set_handled(struct net_device *netdev, [ ... ] > +static int iavf_set_mac_sync(struct iavf_adapter *adapter, const u8 *addr) > +{ > + int ret; > + > + netdev_assert_locked(adapter->netdev); > + > + ret = iavf_add_ether_addrs(adapter); > + if (ret) > + return ret; [Severity: Medium] Will this reliably fail with -EBUSY if a background operation (such as VIRTCHNL_OP_GET_STATS sent by the watchdog) is already in flight? If adapter->current_op != VIRTCHNL_OP_UNKNOWN, iavf_add_ether_addrs() instantly returns -EBUSY. Because iavf_set_mac() holds the netdev_lock, the adminq_task cannot run to process the pending response and clear current_op, resulting in a spurious failure rather than waiting for the background operation to complete. > + > + return iavf_poll_virtchnl_response(adapter, iavf_mac_change_done, > + addr, 2500); > +} [ ... ] > @@ -1049,25 +1091,21 @@ static int iavf_set_mac(struct net_device *netdev, void *p) > return -EADDRNOTAVAIL; > > ret = iavf_replace_primary_mac(adapter, addr->sa_data); > - > if (ret) > return ret; > > - ret = wait_event_interruptible_timeout(adapter->vc_waitqueue, > - iavf_is_mac_set_handled(netdev, addr->sa_data), > - msecs_to_jiffies(2500)); > - > - /* If ret < 0 then it means wait was interrupted. > - * If ret == 0 then it means we got a timeout. > - * else it means we got response for set MAC from PF, > - * check if netdev MAC was updated to requested MAC, > - * if yes then set MAC succeeded otherwise it failed return -EACCES > - */ > - if (ret < 0) > + ret = iavf_set_mac_sync(adapter, addr->sa_data); > + if (ret) { > + /* Rollback for local failures (timeout, send error, -EBUSY). > + * Note: If PF rejects the request (sends error response), > + * iavf_virtchnl_completion() automatically calls > + * iavf_mac_add_reject(), ret=0, and this is not executed. > + * Only local failures (no PF response received) need manual rollback. > + */ > + iavf_mac_add_reject(adapter); > + ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr); > return ret; > - > - if (!ret) > - return -EAGAIN; [Severity: High] Does rolling back on local failure here cause a permanent state mismatch between the PF and VF? If the MAC change request was already sent to the PF by iavf_add_ether_addrs(), but iavf_poll_virtchnl_response() subsequently fails (e.g., memory allocation failure for event.msg_buf, or a timeout occurs), this local rollback deletes the new MAC filter and restores the old MAC. However, the PF might still asynchronously process the request and configure the new MAC. Later, the VF's watchdog will send a pending deletion for the old MAC, leaving the PF configured with the new MAC while the VF uses the old MAC, causing a complete loss of network connectivity. Should the buffer allocation in iavf_poll_virtchnl_response() be done before sending the request to the PF, and can timeouts be handled without rolling back in-flight requests? ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net 6/8] ice: skip unnecessary VF reset when setting trust 2026-05-20 18:34 [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe) Tony Nguyen ` (4 preceding siblings ...) 2026-05-20 18:34 ` [PATCH net 5/8] iavf: send MAC change request synchronously Tony Nguyen @ 2026-05-20 18:34 ` Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 7/8] i40e: set supported_extts_flags for rising edge Tony Nguyen 2026-05-20 18:34 ` [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock Tony Nguyen 7 siblings, 2 replies; 21+ messages in thread From: Tony Nguyen @ 2026-05-20 18:34 UTC (permalink / raw) To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev Cc: Jose Ignacio Tornos Martinez, anthony.l.nguyen, przemyslaw.kitszel, jacob.e.keller, horms, Rafal Romanowski From: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Similar to the i40e fix, ice_set_vf_trust() unconditionally calls ice_reset_vf() when the trust setting changes. While the delay is smaller than i40e this reset is still unnecessary in most cases. Additionally, the original code has a race condition: it deletes MAC LLDP filters BEFORE resetting the VF. During this deletion, the VF is still ACTIVE and can add new MAC LLDP filters concurrently, potentially corrupting the filter list. When granting trust, no reset is needed - we can just set the capability flag to allow privileged operations. When revoking trust, we only need to reset (conservative approach) if the VF has actually configured advanced features that require cleanup (MAC LLDP filters, promiscuous mode). For VFs in a clean state, we can safely change the trust setting without the disruptive reset. When we do reset (MAC LLDP case), we fix the race condition by resetting first to clear VF state (which blocks new MAC LLDP filter additions), then delete existing filters safely. During cleanup, vf->trusted remains true so ice_vf_is_lldp_ena() works properly. Only after cleanup do we set vf->trusted = false. When we don't reset, we manually handle capability flag via helper function, eliminating the delay. Fixes: 2296345416b0 ("ice: receive LLDP on trusted VFs") Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> --- drivers/net/ethernet/intel/ice/ice_sriov.c | 33 +++++++++++++++++++--- 1 file changed, 29 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_sriov.c b/drivers/net/ethernet/intel/ice/ice_sriov.c index 7e00e091756d..3c64ed1b41a8 100644 --- a/drivers/net/ethernet/intel/ice/ice_sriov.c +++ b/drivers/net/ethernet/intel/ice/ice_sriov.c @@ -1364,6 +1364,23 @@ int ice_set_vf_mac(struct net_device *netdev, int vf_id, u8 *mac) return __ice_set_vf_mac(ice_netdev_to_pf(netdev), vf_id, mac); } +/** + * ice_setup_vf_trust - Enable/disable VF trust mode without reset + * @vf: VF to configure + * @setting: trust setting + * + * Update VF flags when changing trust without performing a VF reset. + * This is only called when it's safe to skip the reset (VF has no advanced + * features configured that need cleanup). + */ +static void ice_setup_vf_trust(struct ice_vf *vf, bool setting) +{ + if (setting) + set_bit(ICE_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps); + else + clear_bit(ICE_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps); +} + /** * ice_set_vf_trust * @netdev: network interface device structure @@ -1399,11 +1416,19 @@ int ice_set_vf_trust(struct net_device *netdev, int vf_id, bool trusted) mutex_lock(&vf->cfg_lock); - while (!trusted && vf->num_mac_lldp) - ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false); - + /* Reset only if revoking trust and VF has advanced features configured */ + if (!trusted && + (vf->num_mac_lldp > 0 || + test_bit(ICE_VF_STATE_UC_PROMISC, vf->vf_states) || + test_bit(ICE_VF_STATE_MC_PROMISC, vf->vf_states))) { + ice_reset_vf(vf, ICE_VF_RESET_NOTIFY); + while (vf->num_mac_lldp) + ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false); + } else { + ice_setup_vf_trust(vf, trusted); + } vf->trusted = trusted; - ice_reset_vf(vf, ICE_VF_RESET_NOTIFY); + dev_info(ice_pf_to_dev(pf), "VF %u is now %strusted\n", vf_id, trusted ? "" : "un"); -- 2.47.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH net 6/8] ice: skip unnecessary VF reset when setting trust 2026-05-20 18:34 ` [PATCH net 6/8] ice: skip unnecessary VF reset when setting trust Tony Nguyen @ 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 1 sibling, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-23 0:16 UTC (permalink / raw) To: anthony.l.nguyen Cc: Jakub Kicinski, davem, pabeni, edumazet, andrew+netdev, netdev, vinschen, alexander.nowlin This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- Based on my verification, all three findings are confirmed. Let me now generate the report. commit 848ee33529de0af7967f298f70c37bfb75ea5498 Author: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> ice: skip unnecessary VF reset when setting trust This patch makes ice_set_vf_trust() skip the VF reset for both the grant-trust path and for revocation when the VF has not configured LLDP filters or promiscuous mode. When a reset is still needed, it reorders the work so the reset runs before the LLDP filter teardown. > diff --git a/drivers/net/ethernet/intel/ice/ice_sriov.c b/drivers/net/ethernet/intel/ice/ice_sriov.c > index 7e00e091756d..3c64ed1b41a8 100644 > --- a/drivers/net/ethernet/intel/ice/ice_sriov.c > +++ b/drivers/net/ethernet/intel/ice/ice_sriov.c > @@ -1399,11 +1416,19 @@ int ice_set_vf_trust(struct net_device *netdev, int vf_id, bool trusted) > > mutex_lock(&vf->cfg_lock); > > - while (!trusted && vf->num_mac_lldp) > - ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false); > - > + /* Reset only if revoking trust and VF has advanced features configured */ > + if (!trusted && > + (vf->num_mac_lldp > 0 || > + test_bit(ICE_VF_STATE_UC_PROMISC, vf->vf_states) || > + test_bit(ICE_VF_STATE_MC_PROMISC, vf->vf_states))) { > + ice_reset_vf(vf, ICE_VF_RESET_NOTIFY); [Critical] Can the privilege capability bit get left set after revocation here? The actual trust check used at runtime is ice_is_vf_trusted(), which reads the bit, not vf->trusted: bool ice_is_vf_trusted(struct ice_vf *vf) { return test_bit(ICE_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps); } The new code calls ice_reset_vf() while vf->trusted is still true (the assignment is deferred to after the reset so ice_vf_is_lldp_ena() works during cleanup). Inside the reset, the rebuild path runs: ice_reset_vf() -> ice_vf_post_vsi_rebuild() -> ice_vf_rebuild_host_cfg() -> ice_vf_set_host_trust_cfg() assign_bit(ICE_VIRTCHNL_VF_CAP_PRIVILEGE, &vf->vf_caps, vf->trusted); Because vf->trusted is still true at that moment, the bit gets re-asserted to 1. After the reset returns, vf->trusted is set to false, but ice_setup_vf_trust(vf, false) is only called in the else branch, so the privilege bit is never cleared on this path. After this returns, ice_is_vf_trusted() still returns true, which gates ice_can_vf_change_mac(), ice_vc_handle_mac_addr_msg(), ice_vc_cfg_promiscuous_mode_msg() and ice_vf_has_max_vlans(). The kernel log says "VF X is now untrusted" but the trust-gated capabilities remain available to the guest. Would calling ice_setup_vf_trust(vf, false) (or assigning vf->trusted before the reset, as the previous code did) be necessary to actually clear the bit on this path? > + while (vf->num_mac_lldp) > + ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false); [High] Is this loop reachable after the reset above? ice_reset_vf() goes through ice_vf_pre_vsi_rebuild() which calls ice_vf_clear_counters(): static void ice_vf_clear_counters(struct ice_vf *vf) { struct ice_vsi *vsi = ice_get_vf_vsi(vf); if (vsi) vsi->num_vlan = 0; vf->num_mac = 0; vf->num_mac_lldp = 0; ... } By the time control returns to ice_set_vf_trust(), vf->num_mac_lldp is already zero, so the while loop never iterates and ice_vf_update_mac_lldp_num() -> ice_vsi_cfg_sw_lldp(vsi, false, false) never runs. ice_vf_reconfig_vsi() inside the reset does invoke ice_fltr_remove_all() for per-VSI rules, but ice_vsi_cfg_sw_lldp() also drives the AQ-based ice_lldp_fltr_add_remove() removal, gated by the PF-wide ICE_FLAG_LLDP_AQ_FLTR flag, which is not torn down by ice_fltr_remove_all(). On firmware that uses the AQ fallback (set when the per-VSI rule add fails), the AQ LLDP filter rule installed when trust was granted stays in place, and the VF can keep receiving LLDP frames after revocation. The commit message says: When we do reset (MAC LLDP case), we fix the race condition by resetting first to clear VF state (which blocks new MAC LLDP filter additions), then delete existing filters safely. Does the deletion step actually run, given that the loop counter is zeroed by the reset before the loop is reached? Would saving num_mac_lldp before ice_reset_vf() and using that saved value (or unconditionally calling ice_vsi_cfg_sw_lldp(vsi, false, false) when an LLDP rule was present) be needed for the intended teardown to happen? > + } else { > + ice_setup_vf_trust(vf, trusted); > + } > vf->trusted = trusted; > - ice_reset_vf(vf, ICE_VF_RESET_NOTIFY); > + > dev_info(ice_pf_to_dev(pf), "VF %u is now %strusted\n", > vf_id, trusted ? "" : "un"); [Medium] Is the predicate that selects the reset path complete? The condition only covers num_mac_lldp and the UC/MC promisc state bits. While trusted, however, a VF can also: - override the host-set MAC and add unicast MACs beyond ICE_MAX_MACADDR_PER_VF (ice_can_vf_change_mac() and ice_vc_handle_mac_addr_msg() consult ice_is_vf_trusted()) - add VLAN filters beyond ICE_MAX_VLAN_PER_VF (ice_vf_has_max_vlans() / ice_vc_get_max_vlan_fltrs() grant a higher cap when ice_is_vf_trusted() is true) A VF that used these higher quotas but is not currently in promisc mode and has no LLDP filter takes the no-reset path, and the over-quota MAC and VLAN filters carry over past trust revocation. The previous unconditional ice_reset_vf() rebuilt the VSI through ice_vf_rebuild_host_mac_cfg(), which restored only the host-set MAC and dropped VF-added extras. The commit message says the reset is taken "if the VF has actually configured advanced features that require cleanup". Should the condition also account for VF-installed MAC and VLAN filters that exceed the untrusted limits? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net 6/8] ice: skip unnecessary VF reset when setting trust 2026-05-20 18:34 ` [PATCH net 6/8] ice: skip unnecessary VF reset when setting trust Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski @ 2026-05-23 0:16 ` Jakub Kicinski 1 sibling, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-23 0:16 UTC (permalink / raw) To: anthony.l.nguyen Cc: Jakub Kicinski, davem, pabeni, edumazet, andrew+netdev, netdev, vinschen, alexander.nowlin This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- ice: skip unnecessary VF reset when setting trust Similar to the i40e fix, ice_set_vf_trust() unconditionally calls ice_reset_vf() when the trust setting changes. While the delay is smaller than i40e this reset is still unnecessary in most cases. > diff --git a/drivers/net/ethernet/intel/ice/ice_sriov.c b/drivers/net/ethernet/intel/ice/ice_sriov.c > --- a/drivers/net/ethernet/intel/ice/ice_sriov.c > +++ b/drivers/net/ethernet/intel/ice/ice_sriov.c [ ... ] > @@ -1399,11 +1416,19 @@ int ice_set_vf_trust(struct net_device *netdev, int vf_id, bool trusted) > > mutex_lock(&vf->cfg_lock); > > - while (!trusted && vf->num_mac_lldp) > - ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false); > - > + /* Reset only if revoking trust and VF has advanced features configured */ > + if (!trusted && > + (vf->num_mac_lldp > 0 || > + test_bit(ICE_VF_STATE_UC_PROMISC, vf->vf_states) || > + test_bit(ICE_VF_STATE_MC_PROMISC, vf->vf_states))) { > + ice_reset_vf(vf, ICE_VF_RESET_NOTIFY); [Severity: High] When revoking trust, ice_reset_vf() is called before updating vf->trusted to false. During ice_reset_vf(), the VSI is rebuilt and synchronizes the ICE_VIRTCHNL_VF_CAP_PRIVILEGE bit in vf->vf_caps with the current value of vf->trusted, which is still true. The else block below is bypassed, and although vf->trusted is later set to false, vf_caps will retain the privileged capability bit. Because ice_is_vf_trusted() strictly checks vf_caps, won't the VF incorrectly retain privileged capabilities indefinitely after this operation? > + while (vf->num_mac_lldp) > + ice_vf_update_mac_lldp_num(vf, ice_get_vf_vsi(vf), false); [Severity: High] Is the MAC LLDP filter cleanup loop dead code? ice_reset_vf() internally calls ice_vf_pre_vsi_rebuild(), which unconditionally clears vf->num_mac_lldp to 0 via ice_vf_clear_counters(). As a result, the while condition evaluates to false immediately upon returning from the reset, completely bypassing the intended cleanup logic. Furthermore, because vf->num_mac_lldp is abruptly zeroed without going through ice_vf_update_mac_lldp_num(), does the driver skip calling ice_vsi_cfg_sw_lldp(), silently leaking the software LLDP switch filters from its bookkeeping? > + } else { > + ice_setup_vf_trust(vf, trusted); > + } [Severity: High] If we skip the VF reset when granting trust, what happens to expected VM configuration workflows for promiscuous mode? The PF driver intentionally lies to untrusted VFs that attempt to enable promiscuous mode: it returns success to the VF without actually programming the hardware. Historically, granting trust forced a full VF reset. This reset prompted the VF driver to re-initialize and replay its configuration (including promiscuous mode) to the PF, which would then correctly program the hardware because the VF was now trusted. By skipping the reset when granting trust, it appears a VF that previously requested promiscuous mode will remain permanently desynchronized. The guest OS believes it is in promiscuous mode, but the hardware is not programmed to reflect this, which could result in dropped traffic. [Severity: High] When skipping the VF reset on trust revocation (because the VF lacks MAC LLDP or promiscuous mode), are we allowing the VF to persistently retain resources allocated while it was trusted? The 'trusted' capability grants a VF the ability to bypass administrative limits, such as ICE_MAX_MACADDR_PER_VF (which restricts standard VFs to 18 MAC addresses) and the maximum VLAN limit checked in ice_vf_has_max_vlans(). Prior to this patch, revoking trust unconditionally triggered a VF reset, which rebuilt the VSI and completely wiped all dynamically allocated hardware filters, effectively enforcing the demoted limits. Without the reset, might a VF that allocated an excessive number of MAC addresses or VLANs while trusted retain them in hardware indefinitely after trust is revoked, creating a persistent resource limit bypass? > vf->trusted = trusted; > - ice_reset_vf(vf, ICE_VF_RESET_NOTIFY); > + > dev_info(ice_pf_to_dev(pf), "VF %u is now %strusted\n", > vf_id, trusted ? "" : "un"); ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net 7/8] i40e: set supported_extts_flags for rising edge 2026-05-20 18:34 [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe) Tony Nguyen ` (5 preceding siblings ...) 2026-05-20 18:34 ` [PATCH net 6/8] ice: skip unnecessary VF reset when setting trust Tony Nguyen @ 2026-05-20 18:34 ` Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock Tony Nguyen 7 siblings, 1 reply; 21+ messages in thread From: Tony Nguyen @ 2026-05-20 18:34 UTC (permalink / raw) To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev Cc: Przemyslaw Korba, anthony.l.nguyen, richardcochran, kory.maincent, Arkadiusz Kubalewski, Aleksandr Loktionov, Jacob Keller, Simon Horman, Sunitha Mekala From: Przemyslaw Korba <przemyslaw.korba@intel.com> The i40e driver always supported only rising edge detection, so advertise PTP_RISING_EDGE, and PTP_STRICT_FLAGS to ensure the PTP core properly validates user requests. Fixes: 7c571ac57d9d ("net: ptp: introduce .supported_extts_flags to ptp_clock_info") Signed-off-by: Przemyslaw Korba <przemyslaw.korba@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> --- drivers/net/ethernet/intel/i40e/i40e_ptp.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e_ptp.c b/drivers/net/ethernet/intel/i40e/i40e_ptp.c index 7d07c389bb23..c4525bfab09c 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ptp.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ptp.c @@ -1344,6 +1344,8 @@ static int i40e_init_pin_config(struct i40e_pf *pf) pf->ptp_caps.n_ext_ts = 2; pf->ptp_caps.pps = 1; pf->ptp_caps.n_per_out = 2; + pf->ptp_caps.supported_extts_flags = PTP_RISING_EDGE | + PTP_STRICT_FLAGS; pf->ptp_caps.pin_config = kzalloc_objs(*pf->ptp_caps.pin_config, pf->ptp_caps.n_pins); -- 2.47.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH net 7/8] i40e: set supported_extts_flags for rising edge 2026-05-20 18:34 ` [PATCH net 7/8] i40e: set supported_extts_flags for rising edge Tony Nguyen @ 2026-05-23 0:16 ` Jakub Kicinski 0 siblings, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-23 0:16 UTC (permalink / raw) To: anthony.l.nguyen Cc: Jakub Kicinski, davem, pabeni, edumazet, andrew+netdev, netdev, vinschen, alexander.nowlin This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- i40e: set supported_extts_flags for rising edge The i40e driver always supported only rising edge detection, so advertise PTP_RISING_EDGE, and PTP_STRICT_FLAGS to ensure the PTP core properly validates user requests. > diff --git a/drivers/net/ethernet/intel/i40e/i40e_ptp.c b/drivers/net/ethernet/intel/i40e/i40e_ptp.c > index 7d07c389bb231..c4525bfab09c0 100644 > --- a/drivers/net/ethernet/intel/i40e/i40e_ptp.c > +++ b/drivers/net/ethernet/intel/i40e/i40e_ptp.c > @@ -1344,6 +1344,8 @@ static int i40e_init_pin_config(struct i40e_pf *pf) > pf->ptp_caps.n_ext_ts = 2; > pf->ptp_caps.pps = 1; > pf->ptp_caps.n_per_out = 2; > + pf->ptp_caps.supported_extts_flags = PTP_RISING_EDGE | > + PTP_STRICT_FLAGS; [Severity: High] This isn't a bug introduced by this patch, but does the PTP external timestamp event in i40e_ptp_extts0_work() incorrectly assign the PCI Physical Function (PF) ID to the event index instead of the PTP hardware channel index? Looking at i40e_ptp_extts0_work(): drivers/net/ethernet/intel/i40e/i40e_ptp.c:i40e_ptp_extts0_work() { ... event.type = PTP_CLOCK_EXTTS; event.index = hw->pf_id; /* fire event */ ptp_clock_event(pf->ptp_clock, &event); } The PTP core API expects the index field to correspond to the requested EXTTS channel index (which should be 0 here). Userspace applications rely on this index to associate the received timestamp event with the requested channel. For interfaces where pf_id is greater than 0 (e.g., dual or quad port NICs), this dispatches the event to userspace with the wrong channel index, causing applications to ignore the timestamp or misbehave, completely breaking EXTTS functionality for those ports. > > pf->ptp_caps.pin_config = kzalloc_objs(*pf->ptp_caps.pin_config, > pf->ptp_caps.n_pins); ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock 2026-05-20 18:34 [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe) Tony Nguyen ` (6 preceding siblings ...) 2026-05-20 18:34 ` [PATCH net 7/8] i40e: set supported_extts_flags for rising edge Tony Nguyen @ 2026-05-20 18:34 ` Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 7 siblings, 2 replies; 21+ messages in thread From: Tony Nguyen @ 2026-05-20 18:34 UTC (permalink / raw) To: davem, kuba, pabeni, edumazet, andrew+netdev, netdev Cc: Corinna Vinschen, anthony.l.nguyen, Alexander Nowlin From: Corinna Vinschen <vinschen@redhat.com> Commit 1e53834ce541d ("ixgbe: Add locking to prevent panic when setting sriov_numvfs to zero") added a spinlock to the adapter info. The reason at the time was an observed crash when ixgbe_disable_sriov() freed the adapter->vfinfo array while the interrupt driven function ixgbe_msg_task() was handling VF messages. Recent stability testing turned up another crash, which is very easily reproducible: while true do for numvfs in 5 0 do echo $numvfs > /sys/class/net/eth0/device/sriov_numvfs done done This crashed almost always within the first two hundred runs with a NULL pointer deref while running the ixgbe_service_task() workqueue: [ 5052.036491] BUG: kernel NULL pointer dereference, address: 0000000000000258 [ 5052.043454] #PF: supervisor read access in kernel mode [ 5052.048594] #PF: error_code(0x0000) - not-present page [ 5052.053734] PGD 0 P4D 0 [ 5052.056272] Oops: Oops: 0000 #1 SMP NOPTI [ 5052.060459] CPU: 2 UID: 0 PID: 132253 Comm: kworker/u96:0 Kdump: loaded Not tainted 6.12.0-180.el10.x86_64 #1 PREEMPT(voluntary) [ 5052.072100] Hardware name: Dell Inc. PowerEdge R740/0DY2X0, BIOS 2.12.2 07/09/2021 [ 5052.079664] Workqueue: ixgbe ixgbe_service_task [ixgbe] [ 5052.084907] RIP: 0010:ixgbe_update_stats+0x8b1/0xb40 [ixgbe] [ 5052.090585] Code: 21 56 50 49 8b b6 18 26 00 00 4c 01 fe 48 09 46 50 42 8d 34 a5 00 83 00 00 e8 cb 7a ff ff 49 8b b6 18 26 00 00 89 c0 4c 01 fe <48> 3b 86 88 00 00 00 73 18 48 b9 00 00 00 00 01 00 00 00 48 01 4e [ 5052.109331] RSP: 0018:ffffd5f1e8a6bd88 EFLAGS: 00010202 [ 5052.114558] RAX: 0000000000000000 RBX: ffff8f49b22b14a0 RCX: 000000000000023c [ 5052.121689] RDX: ffffffff00000000 RSI: 00000000000001d0 RDI: ffff8f49b22b14a0 [ 5052.128823] RBP: 000000000000109c R08: 0000000000000000 R09: 0000000000000000 [ 5052.135955] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002 [ 5052.143086] R13: 0000000000008410 R14: ffff8f49b22b01a0 R15: 00000000000001d0 [ 5052.150221] FS: 0000000000000000(0000) GS:ffff8f58bfc80000(0000) knlGS:0000000000000000 [ 5052.158307] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5052.164054] CR2: 0000000000000258 CR3: 0000000bf2624006 CR4: 00000000007726f0 [ 5052.171187] PKRU: 55555554 [ 5052.173898] Call Trace: [ 5052.176351] <TASK> [ 5052.178457] ? show_trace_log_lvl+0x1b0/0x2f0 [ 5052.182816] ? show_trace_log_lvl+0x1b0/0x2f0 [ 5052.187177] ? ixgbe_watchdog_subtask+0x1a1/0x230 [ixgbe] [ 5052.192591] ? __die_body.cold+0x8/0x12 [ 5052.196433] ? page_fault_oops+0x148/0x160 [ 5052.200532] ? exc_page_fault+0x7f/0x150 [ 5052.204458] ? asm_exc_page_fault+0x26/0x30 [ 5052.208643] ? ixgbe_update_stats+0x8b1/0xb40 [ixgbe] [ 5052.213714] ? ixgbe_update_stats+0x8a5/0xb40 [ixgbe] [ 5052.218784] ixgbe_watchdog_subtask+0x1a1/0x230 [ixgbe] [ 5052.224026] ixgbe_service_task+0x15a/0x3f0 [ixgbe] [ 5052.228916] process_one_work+0x177/0x330 [ 5052.232928] worker_thread+0x256/0x3a0 [ 5052.236681] ? __pfx_worker_thread+0x10/0x10 [ 5052.240952] kthread+0xfa/0x240 [ 5052.244099] ? __pfx_kthread+0x10/0x10 [ 5052.247852] ret_from_fork+0x34/0x50 [ 5052.251429] ? __pfx_kthread+0x10/0x10 [ 5052.255185] ret_from_fork_asm+0x1a/0x30 [ 5052.259112] </TASK> The first simple patch, just adding spinlocking to ixgbe_update_stats() while reading from adapter->vfinfo, did not fix the problem, it just moved it elsewhere: I could now reproduce the same kind of crash in ixgbe_restore_vf_multicasts(). But adding more spinlocking doesn't really cut it. One reason is that ixgbe_restore_vf_multicasts() is called from within ixgbe_msg_task() with active spinlock, as well as from outside without locking. Additionally, given that ixgbe_disable_sriov() is the only call changing adapter->vfinfo, and given ixgbe_disable_sriov() is called very seldom compared to other actions in the driver, just adding more spinlocks would unnecessarily occupy the driver with spinning when multiple functions accessing adapter->vfinfo are running in parallel. So this patch drops the spinlock in favor of RCU and uses it throughout the driver. While changing this, it seems prudent to do the same for the adapter->mv_list array, which is allocated and freed at the same time as adapter->vfinfo, albeit there was no crash observed. Fixes: 1e53834ce541d ("ixgbe: Add locking to prevent panic when setting sriov_numvfs to zero") Signed-off-by: Corinna Vinschen <vinschen@redhat.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> --- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 7 +- .../net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c | 36 +- .../net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 44 +- .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 17 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 227 +++++--- .../net/ethernet/intel/ixgbe/ixgbe_sriov.c | 547 ++++++++++++------ 6 files changed, 592 insertions(+), 286 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index 9b8217523fd2..8849b9f42bf6 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -210,6 +210,7 @@ struct vf_stats { }; struct vf_data_storage { + struct rcu_head rcu_head; struct pci_dev *vfdev; unsigned char vf_mac_addresses[ETH_ALEN]; u16 vf_mc_hashes[IXGBE_MAX_VF_MC_ENTRIES]; @@ -240,6 +241,7 @@ enum ixgbevf_xcast_modes { }; struct vf_macvlans { + struct rcu_head rcu_head; struct list_head l; int vf; bool free; @@ -808,10 +810,10 @@ struct ixgbe_adapter { /* SR-IOV */ DECLARE_BITMAP(active_vfs, IXGBE_MAX_VF_FUNCTIONS); unsigned int num_vfs; - struct vf_data_storage *vfinfo; + struct vf_data_storage __rcu *vfinfo; int vf_rate_link_speed; struct vf_macvlans vf_mvs; - struct vf_macvlans *mv_list; + struct vf_macvlans __rcu *mv_list; u32 timer_event_accumulator; u32 vferr_refcount; @@ -844,7 +846,6 @@ struct ixgbe_adapter { #ifdef CONFIG_IXGBE_IPSEC struct ixgbe_ipsec *ipsec; #endif /* CONFIG_IXGBE_IPSEC */ - spinlock_t vfs_lock; }; struct ixgbe_netdevice_priv { diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c index 382d097e4b11..9a84cfc09120 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_dcb_nl.c @@ -640,17 +640,21 @@ static int ixgbe_dcbnl_ieee_setapp(struct net_device *dev, /* VF devices should use default UP when available */ if (app->selector == IEEE_8021QAZ_APP_SEL_ETHERTYPE && app->protocol == 0) { + struct vf_data_storage *vfinfo; int vf; adapter->default_up = app->priority; - for (vf = 0; vf < adapter->num_vfs; vf++) { - struct vf_data_storage *vfinfo = &adapter->vfinfo[vf]; - - if (!vfinfo->pf_qos) - ixgbe_set_vmvir(adapter, vfinfo->pf_vlan, - app->priority, vf); - } + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + for (vf = 0; vf < adapter->num_vfs; vf++) { + if (!vfinfo[vf].pf_qos) + ixgbe_set_vmvir(adapter, + vfinfo[vf].pf_vlan, + app->priority, vf); + } + rcu_read_unlock(); } return 0; @@ -683,19 +687,23 @@ static int ixgbe_dcbnl_ieee_delapp(struct net_device *dev, /* IF default priority is being removed clear VF default UP */ if (app->selector == IEEE_8021QAZ_APP_SEL_ETHERTYPE && app->protocol == 0 && adapter->default_up == app->priority) { + struct vf_data_storage *vfinfo; int vf; long unsigned int app_mask = dcb_ieee_getapp_mask(dev, app); int qos = app_mask ? find_first_bit(&app_mask, 8) : 0; adapter->default_up = qos; - for (vf = 0; vf < adapter->num_vfs; vf++) { - struct vf_data_storage *vfinfo = &adapter->vfinfo[vf]; - - if (!vfinfo->pf_qos) - ixgbe_set_vmvir(adapter, vfinfo->pf_vlan, - qos, vf); - } + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + for (vf = 0; vf < adapter->num_vfs; vf++) { + if (!vfinfo[vf].pf_qos) + ixgbe_set_vmvir(adapter, + vfinfo[vf].pf_vlan, + qos, vf); + } + rcu_read_unlock(); } return err; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c index ba049b3a9609..b77317476af4 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c @@ -2265,21 +2265,28 @@ static void ixgbe_diag_test(struct net_device *netdev, struct ixgbe_hw *hw = &adapter->hw; if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED) { + struct vf_data_storage *vfinfo; int i; - for (i = 0; i < adapter->num_vfs; i++) { - if (adapter->vfinfo[i].clear_to_send) { - netdev_warn(netdev, "offline diagnostic is not supported when VFs are present\n"); - data[0] = 1; - data[1] = 1; - data[2] = 1; - data[3] = 1; - data[4] = 1; - eth_test->flags |= ETH_TEST_FL_FAILED; - clear_bit(__IXGBE_TESTING, - &adapter->state); - return; + + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + for (i = 0; i < adapter->num_vfs; i++) { + if (vfinfo[i].clear_to_send) { + netdev_warn(netdev, "offline diagnostic is not supported when VFs are present\n"); + data[0] = 1; + data[1] = 1; + data[2] = 1; + data[3] = 1; + data[4] = 1; + eth_test->flags |= ETH_TEST_FL_FAILED; + clear_bit(__IXGBE_TESTING, + &adapter->state); + rcu_read_unlock(); + return; + } } - } + rcu_read_unlock(); } /* Offline tests */ @@ -3700,9 +3707,14 @@ static int ixgbe_set_priv_flags(struct net_device *netdev, u32 priv_flags) if (priv_flags & IXGBE_PRIV_FLAGS_AUTO_DISABLE_VF) { if (adapter->hw.mac.type == ixgbe_mac_82599EB) { /* Reset primary abort counter */ - for (i = 0; i < adapter->num_vfs; i++) - adapter->vfinfo[i].primary_abort_count = 0; - + struct vf_data_storage *vfinfo; + + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + for (i = 0; i < adapter->num_vfs; i++) + vfinfo[i].primary_abort_count = 0; + rcu_read_unlock(); flags2 |= IXGBE_FLAG2_AUTO_DISABLE_VF; } else { e_info(probe, diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c index bd397b3d7dea..b524a3a61eb6 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c @@ -874,6 +874,7 @@ void ixgbe_ipsec_vf_clear(struct ixgbe_adapter *adapter, u32 vf) int ixgbe_ipsec_vf_add_sa(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) { struct ixgbe_ipsec *ipsec = adapter->ipsec; + struct vf_data_storage *vfinfo; struct xfrm_algo_desc *algo; struct sa_mbx_msg *sam; struct xfrm_state *xs; @@ -883,7 +884,13 @@ int ixgbe_ipsec_vf_add_sa(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) int err; sam = (struct sa_mbx_msg *)(&msgbuf[1]); - if (!adapter->vfinfo[vf].trusted || + + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + + if (!vfinfo[vf].trusted || !(adapter->flags2 & IXGBE_FLAG2_VF_IPSEC_ENABLED)) { e_warn(drv, "VF %d attempted to add an IPsec SA\n", vf); err = -EACCES; @@ -984,11 +991,17 @@ int ixgbe_ipsec_vf_add_sa(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) int ixgbe_ipsec_vf_del_sa(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) { struct ixgbe_ipsec *ipsec = adapter->ipsec; + struct vf_data_storage *vfinfo; struct xfrm_state *xs; u32 pfsa = msgbuf[1]; u16 sa_idx; - if (!adapter->vfinfo[vf].trusted) { + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + + if (!vfinfo[vf].trusted) { e_err(drv, "vf %d attempted to delete an SA\n", vf); return -EPERM; } diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 2646ee6f295f..d82c7dfc6580 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1240,20 +1240,26 @@ static void ixgbe_pf_handle_tx_hang(struct ixgbe_ring *tx_ring, static void ixgbe_vf_handle_tx_hang(struct ixgbe_adapter *adapter, u16 vf) { struct ixgbe_hw *hw = &adapter->hw; + struct vf_data_storage *vfinfo; if (adapter->hw.mac.type != ixgbe_mac_e610) return; - e_warn(drv, - "Malicious Driver Detection tx hang detected on PF %d VF %d MAC: %pM", - hw->bus.func, vf, adapter->vfinfo[vf].vf_mac_addresses); - - adapter->tx_hang_count[vf]++; - if (adapter->tx_hang_count[vf] == IXGBE_MAX_TX_VF_HANGS) { - ixgbe_set_vf_link_state(adapter, vf, - IFLA_VF_LINK_STATE_DISABLE); - adapter->tx_hang_count[vf] = 0; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) { + e_warn(drv, + "Malicious Driver Detection tx hang detected on PF %d VF %d MAC: %pM", + hw->bus.func, vf, vfinfo[vf].vf_mac_addresses); + + adapter->tx_hang_count[vf]++; + if (adapter->tx_hang_count[vf] == IXGBE_MAX_TX_VF_HANGS) { + ixgbe_set_vf_link_state(adapter, vf, + IFLA_VF_LINK_STATE_DISABLE); + adapter->tx_hang_count[vf] = 0; + } } + rcu_read_unlock(); } static u32 ixgbe_poll_tx_icache(struct ixgbe_hw *hw, u16 queue, u16 idx) @@ -4625,6 +4631,7 @@ static void ixgbe_configure_virtualization(struct ixgbe_adapter *adapter) struct ixgbe_hw *hw = &adapter->hw; u16 pool = adapter->num_rx_pools; u32 reg_offset, vf_shift, vmolr; + struct vf_data_storage *vfinfo; u32 gcr_ext, vmdctl; int i; @@ -4680,15 +4687,19 @@ static void ixgbe_configure_virtualization(struct ixgbe_adapter *adapter) IXGBE_WRITE_REG(hw, IXGBE_GCR_EXT, gcr_ext); - for (i = 0; i < adapter->num_vfs; i++) { - /* configure spoof checking */ - ixgbe_ndo_set_vf_spoofchk(adapter->netdev, i, - adapter->vfinfo[i].spoofchk_enabled); + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + for (i = 0; i < adapter->num_vfs; i++) { + /* configure spoof checking */ + ixgbe_ndo_set_vf_spoofchk(adapter->netdev, i, + vfinfo[i].spoofchk_enabled); - /* Enable/Disable RSS query feature */ - ixgbe_ndo_set_vf_rss_query_en(adapter->netdev, i, - adapter->vfinfo[i].rss_query_enabled); - } + /* Enable/Disable RSS query feature */ + ixgbe_ndo_set_vf_rss_query_en(adapter->netdev, i, + vfinfo[i].rss_query_enabled); + } + rcu_read_unlock(); } static void ixgbe_set_rx_buffer_len(struct ixgbe_adapter *adapter) @@ -6093,35 +6104,40 @@ static void ixgbe_check_media_subtask(struct ixgbe_adapter *adapter) static void ixgbe_clear_vf_stats_counters(struct ixgbe_adapter *adapter) { struct ixgbe_hw *hw = &adapter->hw; + struct vf_data_storage *vfinfo; int i; - for (i = 0; i < adapter->num_vfs; i++) { - adapter->vfinfo[i].last_vfstats.gprc = - IXGBE_READ_REG(hw, IXGBE_PVFGPRC(i)); - adapter->vfinfo[i].saved_rst_vfstats.gprc += - adapter->vfinfo[i].vfstats.gprc; - adapter->vfinfo[i].vfstats.gprc = 0; - adapter->vfinfo[i].last_vfstats.gptc = - IXGBE_READ_REG(hw, IXGBE_PVFGPTC(i)); - adapter->vfinfo[i].saved_rst_vfstats.gptc += - adapter->vfinfo[i].vfstats.gptc; - adapter->vfinfo[i].vfstats.gptc = 0; - adapter->vfinfo[i].last_vfstats.gorc = - IXGBE_READ_REG(hw, IXGBE_PVFGORC_LSB(i)); - adapter->vfinfo[i].saved_rst_vfstats.gorc += - adapter->vfinfo[i].vfstats.gorc; - adapter->vfinfo[i].vfstats.gorc = 0; - adapter->vfinfo[i].last_vfstats.gotc = - IXGBE_READ_REG(hw, IXGBE_PVFGOTC_LSB(i)); - adapter->vfinfo[i].saved_rst_vfstats.gotc += - adapter->vfinfo[i].vfstats.gotc; - adapter->vfinfo[i].vfstats.gotc = 0; - adapter->vfinfo[i].last_vfstats.mprc = - IXGBE_READ_REG(hw, IXGBE_PVFMPRC(i)); - adapter->vfinfo[i].saved_rst_vfstats.mprc += - adapter->vfinfo[i].vfstats.mprc; - adapter->vfinfo[i].vfstats.mprc = 0; - } + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + for (i = 0; i < adapter->num_vfs; i++) { + vfinfo[i].last_vfstats.gprc = + IXGBE_READ_REG(hw, IXGBE_PVFGPRC(i)); + vfinfo[i].saved_rst_vfstats.gprc += + vfinfo[i].vfstats.gprc; + vfinfo[i].vfstats.gprc = 0; + vfinfo[i].last_vfstats.gptc = + IXGBE_READ_REG(hw, IXGBE_PVFGPTC(i)); + vfinfo[i].saved_rst_vfstats.gptc += + vfinfo[i].vfstats.gptc; + vfinfo[i].vfstats.gptc = 0; + vfinfo[i].last_vfstats.gorc = + IXGBE_READ_REG(hw, IXGBE_PVFGORC_LSB(i)); + vfinfo[i].saved_rst_vfstats.gorc += + vfinfo[i].vfstats.gorc; + vfinfo[i].vfstats.gorc = 0; + vfinfo[i].last_vfstats.gotc = + IXGBE_READ_REG(hw, IXGBE_PVFGOTC_LSB(i)); + vfinfo[i].saved_rst_vfstats.gotc += + vfinfo[i].vfstats.gotc; + vfinfo[i].vfstats.gotc = 0; + vfinfo[i].last_vfstats.mprc = + IXGBE_READ_REG(hw, IXGBE_PVFMPRC(i)); + vfinfo[i].saved_rst_vfstats.mprc += + vfinfo[i].vfstats.mprc; + vfinfo[i].vfstats.mprc = 0; + } + rcu_read_unlock(); } static void ixgbe_setup_gpie(struct ixgbe_adapter *adapter) @@ -6729,15 +6745,22 @@ void ixgbe_down(struct ixgbe_adapter *adapter) timer_delete_sync(&adapter->service_timer); if (adapter->num_vfs) { + struct vf_data_storage *vfinfo; + /* Clear EITR Select mapping */ IXGBE_WRITE_REG(&adapter->hw, IXGBE_EITRSEL, 0); + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); /* Mark all the VFs as inactive */ - for (i = 0 ; i < adapter->num_vfs; i++) - adapter->vfinfo[i].clear_to_send = false; + if (vfinfo) { + for (i = 0 ; i < adapter->num_vfs; i++) + vfinfo[i].clear_to_send = false; - /* update setting rx tx for all active vfs */ - ixgbe_set_all_vfs(adapter); + /* update setting rx tx for all active vfs */ + ixgbe_set_all_vfs(adapter); + } + rcu_read_unlock(); } /* disable transmits in the hardware now that interrupts are off */ @@ -7001,9 +7024,6 @@ static int ixgbe_sw_init(struct ixgbe_adapter *adapter, /* n-tuple support exists, always init our spinlock */ spin_lock_init(&adapter->fdir_perfect_lock); - /* init spinlock to avoid concurrency of VF resources */ - spin_lock_init(&adapter->vfs_lock); - #ifdef CONFIG_IXGBE_DCB ixgbe_init_dcb(adapter); #endif @@ -7905,25 +7925,31 @@ void ixgbe_update_stats(struct ixgbe_adapter *adapter) * crazy values. */ if (!test_bit(__IXGBE_RESETTING, &adapter->state)) { - for (i = 0; i < adapter->num_vfs; i++) { - UPDATE_VF_COUNTER_32bit(IXGBE_PVFGPRC(i), - adapter->vfinfo[i].last_vfstats.gprc, - adapter->vfinfo[i].vfstats.gprc); - UPDATE_VF_COUNTER_32bit(IXGBE_PVFGPTC(i), - adapter->vfinfo[i].last_vfstats.gptc, - adapter->vfinfo[i].vfstats.gptc); - UPDATE_VF_COUNTER_36bit(IXGBE_PVFGORC_LSB(i), - IXGBE_PVFGORC_MSB(i), - adapter->vfinfo[i].last_vfstats.gorc, - adapter->vfinfo[i].vfstats.gorc); - UPDATE_VF_COUNTER_36bit(IXGBE_PVFGOTC_LSB(i), - IXGBE_PVFGOTC_MSB(i), - adapter->vfinfo[i].last_vfstats.gotc, - adapter->vfinfo[i].vfstats.gotc); - UPDATE_VF_COUNTER_32bit(IXGBE_PVFMPRC(i), - adapter->vfinfo[i].last_vfstats.mprc, - adapter->vfinfo[i].vfstats.mprc); - } + struct vf_data_storage *vfinfo; + + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + for (i = 0; i < adapter->num_vfs; i++) { + UPDATE_VF_COUNTER_32bit(IXGBE_PVFGPRC(i), + vfinfo[i].last_vfstats.gprc, + vfinfo[i].vfstats.gprc); + UPDATE_VF_COUNTER_32bit(IXGBE_PVFGPTC(i), + vfinfo[i].last_vfstats.gptc, + vfinfo[i].vfstats.gptc); + UPDATE_VF_COUNTER_36bit(IXGBE_PVFGORC_LSB(i), + IXGBE_PVFGORC_MSB(i), + vfinfo[i].last_vfstats.gorc, + vfinfo[i].vfstats.gorc); + UPDATE_VF_COUNTER_36bit(IXGBE_PVFGOTC_LSB(i), + IXGBE_PVFGOTC_MSB(i), + vfinfo[i].last_vfstats.gotc, + vfinfo[i].vfstats.gotc); + UPDATE_VF_COUNTER_32bit(IXGBE_PVFMPRC(i), + vfinfo[i].last_vfstats.mprc, + vfinfo[i].vfstats.mprc); + } + rcu_read_unlock(); } } @@ -8267,22 +8293,27 @@ static void ixgbe_watchdog_flush_tx(struct ixgbe_adapter *adapter) static void ixgbe_bad_vf_abort(struct ixgbe_adapter *adapter, u32 vf) { struct ixgbe_hw *hw = &adapter->hw; + struct vf_data_storage *vfinfo; - if (adapter->hw.mac.type == ixgbe_mac_82599EB && + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo && + adapter->hw.mac.type == ixgbe_mac_82599EB && adapter->flags2 & IXGBE_FLAG2_AUTO_DISABLE_VF) { - adapter->vfinfo[vf].primary_abort_count++; - if (adapter->vfinfo[vf].primary_abort_count == + vfinfo[vf].primary_abort_count++; + if (vfinfo[vf].primary_abort_count == IXGBE_PRIMARY_ABORT_LIMIT) { ixgbe_set_vf_link_state(adapter, vf, IFLA_VF_LINK_STATE_DISABLE); - adapter->vfinfo[vf].primary_abort_count = 0; + vfinfo[vf].primary_abort_count = 0; e_info(drv, "Malicious Driver Detection event detected on PF %d VF %d MAC: %pM mdd-disable-vf=on", hw->bus.func, vf, - adapter->vfinfo[vf].vf_mac_addresses); + vfinfo[vf].vf_mac_addresses); } } + rcu_read_unlock(); } static void ixgbe_check_for_bad_vf(struct ixgbe_adapter *adapter) @@ -8309,9 +8340,15 @@ static void ixgbe_check_for_bad_vf(struct ixgbe_adapter *adapter) /* check status reg for all VFs owned by this PF */ for (vf = 0; vf < adapter->num_vfs; ++vf) { - struct pci_dev *vfdev = adapter->vfinfo[vf].vfdev; + struct vf_data_storage *vfinfo; + struct pci_dev *vfdev = NULL; u16 status_reg; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + vfdev = vfinfo[vf].vfdev; + rcu_read_unlock(); if (!vfdev) continue; pci_read_config_word(vfdev, PCI_STATUS, &status_reg); @@ -9744,15 +9781,21 @@ static int ixgbe_ndo_get_vf_stats(struct net_device *netdev, int vf, struct ifla_vf_stats *vf_stats) { struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev); + struct vf_data_storage *vfinfo; if (vf < 0 || vf >= adapter->num_vfs) return -EINVAL; - vf_stats->rx_packets = adapter->vfinfo[vf].vfstats.gprc; - vf_stats->rx_bytes = adapter->vfinfo[vf].vfstats.gorc; - vf_stats->tx_packets = adapter->vfinfo[vf].vfstats.gptc; - vf_stats->tx_bytes = adapter->vfinfo[vf].vfstats.gotc; - vf_stats->multicast = adapter->vfinfo[vf].vfstats.mprc; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) { + vf_stats->rx_packets = vfinfo[vf].vfstats.gprc; + vf_stats->rx_bytes = vfinfo[vf].vfstats.gorc; + vf_stats->tx_packets = vfinfo[vf].vfstats.gptc; + vf_stats->tx_bytes = vfinfo[vf].vfstats.gotc; + vf_stats->multicast = vfinfo[vf].vfstats.mprc; + } + rcu_read_unlock(); return 0; } @@ -10071,20 +10114,26 @@ static int handle_redirect_action(struct ixgbe_adapter *adapter, int ifindex, { struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ]; unsigned int num_vfs = adapter->num_vfs, vf; + struct vf_data_storage *vfinfo; struct netdev_nested_priv priv; struct upper_walk_data data; struct net_device *upper; /* redirect to a SRIOV VF */ - for (vf = 0; vf < num_vfs; ++vf) { - upper = pci_get_drvdata(adapter->vfinfo[vf].vfdev); - if (upper->ifindex == ifindex) { - *queue = vf * __ALIGN_MASK(1, ~vmdq->mask); - *action = vf + 1; - *action <<= ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF; - return 0; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + for (vf = 0; vf < num_vfs; ++vf) { + upper = pci_get_drvdata(vfinfo[vf].vfdev); + if (upper->ifindex == ifindex) { + *queue = vf * __ALIGN_MASK(1, ~vmdq->mask); + *action = vf + 1; + *action <<= ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF; + rcu_read_unlock(); + return 0; + } } - } + rcu_read_unlock(); /* redirect to a offloaded macvlan netdev */ data.adapter = adapter; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index 431d77da15a5..80f22a8e7af4 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -44,7 +44,7 @@ static inline void ixgbe_alloc_vf_macvlans(struct ixgbe_adapter *adapter, mv_list[i].free = true; list_add(&mv_list[i].l, &adapter->vf_mvs.l); } - adapter->mv_list = mv_list; + rcu_assign_pointer(adapter->mv_list, mv_list); } } @@ -52,6 +52,7 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter, unsigned int num_vfs) { struct ixgbe_hw *hw = &adapter->hw; + struct vf_data_storage *vfinfo; int i; if (adapter->xdp_prog) { @@ -64,14 +65,11 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter, IXGBE_FLAG_VMDQ_ENABLED; /* Allocate memory for per VF control structures */ - adapter->vfinfo = kzalloc_objs(struct vf_data_storage, num_vfs); - if (!adapter->vfinfo) + vfinfo = kzalloc_objs(struct vf_data_storage, num_vfs); + if (!vfinfo) return -ENOMEM; - adapter->num_vfs = num_vfs; - ixgbe_alloc_vf_macvlans(adapter, num_vfs); - adapter->ring_feature[RING_F_VMDQ].offset = num_vfs; /* Initialize default switching mode VEB */ IXGBE_WRITE_REG(hw, IXGBE_PFDTXGSWC, IXGBE_PFDTXGSWC_VT_LBEN); @@ -95,23 +93,27 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter, for (i = 0; i < num_vfs; i++) { /* enable spoof checking for all VFs */ - adapter->vfinfo[i].spoofchk_enabled = true; - adapter->vfinfo[i].link_enable = true; + vfinfo[i].spoofchk_enabled = true; + vfinfo[i].link_enable = true; /* We support VF RSS querying only for 82599 and x540 * devices at the moment. These devices share RSS * indirection table and RSS hash key with PF therefore * we want to disable the querying by default. */ - adapter->vfinfo[i].rss_query_enabled = false; + vfinfo[i].rss_query_enabled = false; /* Untrust all VFs */ - adapter->vfinfo[i].trusted = false; + vfinfo[i].trusted = false; /* set the default xcast mode */ - adapter->vfinfo[i].xcast_mode = IXGBEVF_XCAST_MODE_NONE; + vfinfo[i].xcast_mode = IXGBEVF_XCAST_MODE_NONE; } + rcu_assign_pointer(adapter->vfinfo, vfinfo); + adapter->num_vfs = num_vfs; + adapter->ring_feature[RING_F_VMDQ].offset = num_vfs; + e_info(probe, "SR-IOV enabled with %d VFs\n", num_vfs); return 0; } @@ -123,6 +125,7 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter, static void ixgbe_get_vfs(struct ixgbe_adapter *adapter) { struct pci_dev *pdev = adapter->pdev; + struct vf_data_storage *vfinfo; u16 vendor = pdev->vendor; struct pci_dev *vfdev; int vf = 0; @@ -134,18 +137,23 @@ static void ixgbe_get_vfs(struct ixgbe_adapter *adapter) return; pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, &vf_id); - vfdev = pci_get_device(vendor, vf_id, NULL); - for (; vfdev; vfdev = pci_get_device(vendor, vf_id, vfdev)) { - if (!vfdev->is_virtfn) - continue; - if (vfdev->physfn != pdev) - continue; - if (vf >= adapter->num_vfs) - continue; - pci_dev_get(vfdev); - adapter->vfinfo[vf].vfdev = vfdev; - ++vf; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) { + vfdev = pci_get_device(vendor, vf_id, NULL); + for (; vfdev; vfdev = pci_get_device(vendor, vf_id, vfdev)) { + if (!vfdev->is_virtfn) + continue; + if (vfdev->physfn != pdev) + continue; + if (vf >= adapter->num_vfs) + continue; + pci_dev_get(vfdev); + vfinfo[vf].vfdev = vfdev; + ++vf; + } } + rcu_read_unlock(); } /* Note this function is called when the user wants to enable SR-IOV @@ -206,31 +214,28 @@ int ixgbe_disable_sriov(struct ixgbe_adapter *adapter) { unsigned int num_vfs = adapter->num_vfs, vf; struct ixgbe_hw *hw = &adapter->hw; - unsigned long flags; + struct vf_data_storage *vfinfo; + struct vf_macvlans *mv_list; int rss; - spin_lock_irqsave(&adapter->vfs_lock, flags); - /* set num VFs to 0 to prevent access to vfinfo */ + /* set num VFs to 0 so readers bail out early */ adapter->num_vfs = 0; - spin_unlock_irqrestore(&adapter->vfs_lock, flags); + + vfinfo = rcu_replace_pointer(adapter->vfinfo, NULL, 1); + mv_list = rcu_replace_pointer(adapter->mv_list, NULL, 1); /* put the reference to all of the vf devices */ for (vf = 0; vf < num_vfs; ++vf) { - struct pci_dev *vfdev = adapter->vfinfo[vf].vfdev; + struct pci_dev *vfdev = vfinfo[vf].vfdev; if (!vfdev) continue; - adapter->vfinfo[vf].vfdev = NULL; + vfinfo[vf].vfdev = NULL; pci_dev_put(vfdev); } - /* free VF control structures */ - kfree(adapter->vfinfo); - adapter->vfinfo = NULL; - - /* free macvlan list */ - kfree(adapter->mv_list); - adapter->mv_list = NULL; + kfree_rcu(vfinfo, rcu_head); + kfree_rcu(mv_list, rcu_head); /* if SR-IOV is already disabled then there is nothing to do */ if (!(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)) @@ -368,8 +373,8 @@ static int ixgbe_set_vf_multicasts(struct ixgbe_adapter *adapter, { int entries = FIELD_GET(IXGBE_VT_MSGINFO_MASK, msgbuf[0]); u16 *hash_list = (u16 *)&msgbuf[1]; - struct vf_data_storage *vfinfo = &adapter->vfinfo[vf]; struct ixgbe_hw *hw = &adapter->hw; + struct vf_data_storage *vfinfo; int i; u32 vector_bit; u32 vector_reg; @@ -379,28 +384,34 @@ static int ixgbe_set_vf_multicasts(struct ixgbe_adapter *adapter, /* only so many hash values supported */ entries = min(entries, IXGBE_MAX_VF_MC_ENTRIES); + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + /* * salt away the number of multi cast addresses assigned * to this VF for later use to restore when the PF multi cast * list changes */ - vfinfo->num_vf_mc_hashes = entries; + vfinfo[vf].num_vf_mc_hashes = entries; /* * VFs are limited to using the MTA hash table for their multicast * addresses */ for (i = 0; i < entries; i++) { - vfinfo->vf_mc_hashes[i] = hash_list[i]; + vfinfo[vf].vf_mc_hashes[i] = hash_list[i]; } - for (i = 0; i < vfinfo->num_vf_mc_hashes; i++) { - vector_reg = (vfinfo->vf_mc_hashes[i] >> 5) & 0x7F; - vector_bit = vfinfo->vf_mc_hashes[i] & 0x1F; + for (i = 0; i < vfinfo[vf].num_vf_mc_hashes; i++) { + vector_reg = (vfinfo[vf].vf_mc_hashes[i] >> 5) & 0x7F; + vector_bit = vfinfo[vf].vf_mc_hashes[i] & 0x1F; mta_reg = IXGBE_READ_REG(hw, IXGBE_MTA(vector_reg)); mta_reg |= BIT(vector_bit); IXGBE_WRITE_REG(hw, IXGBE_MTA(vector_reg), mta_reg); } + vmolr |= IXGBE_VMOLR_ROMPE; IXGBE_WRITE_REG(hw, IXGBE_VMOLR(vf), vmolr); @@ -410,32 +421,39 @@ static int ixgbe_set_vf_multicasts(struct ixgbe_adapter *adapter, #ifdef CONFIG_PCI_IOV void ixgbe_restore_vf_multicasts(struct ixgbe_adapter *adapter) { - struct ixgbe_hw *hw = &adapter->hw; struct vf_data_storage *vfinfo; + struct ixgbe_hw *hw = &adapter->hw; int i, j; u32 vector_bit; u32 vector_reg; u32 mta_reg; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + goto no_vfs; + for (i = 0; i < adapter->num_vfs; i++) { u32 vmolr = IXGBE_READ_REG(hw, IXGBE_VMOLR(i)); - vfinfo = &adapter->vfinfo[i]; - for (j = 0; j < vfinfo->num_vf_mc_hashes; j++) { + for (j = 0; j < vfinfo[i].num_vf_mc_hashes; j++) { hw->addr_ctrl.mta_in_use++; - vector_reg = (vfinfo->vf_mc_hashes[j] >> 5) & 0x7F; - vector_bit = vfinfo->vf_mc_hashes[j] & 0x1F; + vector_reg = (vfinfo[i].vf_mc_hashes[j] >> 5) & 0x7F; + vector_bit = vfinfo[i].vf_mc_hashes[j] & 0x1F; mta_reg = IXGBE_READ_REG(hw, IXGBE_MTA(vector_reg)); mta_reg |= BIT(vector_bit); IXGBE_WRITE_REG(hw, IXGBE_MTA(vector_reg), mta_reg); } - if (vfinfo->num_vf_mc_hashes) + if (vfinfo[i].num_vf_mc_hashes) vmolr |= IXGBE_VMOLR_ROMPE; else vmolr &= ~IXGBE_VMOLR_ROMPE; IXGBE_WRITE_REG(hw, IXGBE_VMOLR(i), vmolr); } +no_vfs: + rcu_read_unlock(); + /* Restore any VF macvlans */ ixgbe_full_sync_mac_table(adapter); } @@ -493,7 +511,9 @@ static int ixgbe_set_vf_lpe(struct ixgbe_adapter *adapter, u32 max_frame, u32 vf */ if (adapter->hw.mac.type == ixgbe_mac_82599EB) { struct net_device *dev = adapter->netdev; + unsigned int vf_api = ixgbe_mbox_api_10; int pf_max_frame = dev->mtu + ETH_HLEN; + struct vf_data_storage *vfinfo; u32 reg_offset, vf_shift, vfre; int err = 0; @@ -503,7 +523,12 @@ static int ixgbe_set_vf_lpe(struct ixgbe_adapter *adapter, u32 max_frame, u32 vf IXGBE_FCOE_JUMBO_FRAME_SIZE); #endif /* CONFIG_FCOE */ - switch (adapter->vfinfo[vf].vf_api) { + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + vf_api = vfinfo[vf].vf_api; + + switch (vf_api) { case ixgbe_mbox_api_11: case ixgbe_mbox_api_12: case ixgbe_mbox_api_13: @@ -643,10 +668,16 @@ static void ixgbe_clear_vf_vlans(struct ixgbe_adapter *adapter, u32 vf) static int ixgbe_set_vf_macvlan(struct ixgbe_adapter *adapter, int vf, int index, unsigned char *mac_addr) { - struct vf_macvlans *entry; + struct vf_macvlans *mv_list, *entry; bool found = false; int retval = 0; + lockdep_assert_in_rcu_read_lock(); + /* vf_mvs entries point into the mv_list array */ + mv_list = rcu_dereference(adapter->mv_list); + if (!mv_list) + return 0; + if (index <= 1) { list_for_each_entry(entry, &adapter->vf_mvs.l, l) { if (entry->vf == vf) { @@ -700,7 +731,7 @@ static inline void ixgbe_vf_reset_event(struct ixgbe_adapter *adapter, u32 vf) { struct ixgbe_hw *hw = &adapter->hw; struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ]; - struct vf_data_storage *vfinfo = &adapter->vfinfo[vf]; + struct vf_data_storage *vfinfo; u32 q_per_pool = __ALIGN_MASK(1, ~vmdq->mask); u8 num_tcs = adapter->hw_tcs; u32 reg_val; @@ -709,31 +740,36 @@ static inline void ixgbe_vf_reset_event(struct ixgbe_adapter *adapter, u32 vf) /* remove VLAN filters belonging to this VF */ ixgbe_clear_vf_vlans(adapter, vf); + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return; + /* add back PF assigned VLAN or VLAN 0 */ - ixgbe_set_vf_vlan(adapter, true, vfinfo->pf_vlan, vf); + ixgbe_set_vf_vlan(adapter, true, vfinfo[vf].pf_vlan, vf); /* reset offloads to defaults */ - ixgbe_set_vmolr(hw, vf, !vfinfo->pf_vlan); + ixgbe_set_vmolr(hw, vf, !vfinfo[vf].pf_vlan); /* set outgoing tags for VFs */ - if (!vfinfo->pf_vlan && !vfinfo->pf_qos && !num_tcs) { + if (!vfinfo[vf].pf_vlan && !vfinfo[vf].pf_qos && !num_tcs) { ixgbe_clear_vmvir(adapter, vf); } else { - if (vfinfo->pf_qos || !num_tcs) - ixgbe_set_vmvir(adapter, vfinfo->pf_vlan, - vfinfo->pf_qos, vf); + if (vfinfo[vf].pf_qos || !num_tcs) + ixgbe_set_vmvir(adapter, vfinfo[vf].pf_vlan, + vfinfo[vf].pf_qos, vf); else - ixgbe_set_vmvir(adapter, vfinfo->pf_vlan, + ixgbe_set_vmvir(adapter, vfinfo[vf].pf_vlan, adapter->default_up, vf); - if (vfinfo->spoofchk_enabled) { + if (vfinfo[vf].spoofchk_enabled) { hw->mac.ops.set_vlan_anti_spoofing(hw, true, vf); hw->mac.ops.set_mac_anti_spoofing(hw, true, vf); } } /* reset multicast table array for vf */ - adapter->vfinfo[vf].num_vf_mc_hashes = 0; + vfinfo[vf].num_vf_mc_hashes = 0; /* clear any ipsec table info */ ixgbe_ipsec_vf_clear(adapter, vf); @@ -741,11 +777,11 @@ static inline void ixgbe_vf_reset_event(struct ixgbe_adapter *adapter, u32 vf) /* Flush and reset the mta with the new values */ ixgbe_set_rx_mode(adapter->netdev); - ixgbe_del_mac_filter(adapter, adapter->vfinfo[vf].vf_mac_addresses, vf); + ixgbe_del_mac_filter(adapter, vfinfo[vf].vf_mac_addresses, vf); ixgbe_set_vf_macvlan(adapter, vf, 0, NULL); /* reset VF api back to unknown */ - adapter->vfinfo[vf].vf_api = ixgbe_mbox_api_10; + vfinfo[vf].vf_api = ixgbe_mbox_api_10; /* Restart each queue for given VF */ for (queue = 0; queue < q_per_pool; queue++) { @@ -780,16 +816,25 @@ static void ixgbe_vf_clear_mbx(struct ixgbe_adapter *adapter, u32 vf) static int ixgbe_set_vf_mac(struct ixgbe_adapter *adapter, int vf, unsigned char *mac_addr) { + struct vf_data_storage *vfinfo; int retval; - ixgbe_del_mac_filter(adapter, adapter->vfinfo[vf].vf_mac_addresses, vf); + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) { + rcu_read_unlock(); + return -EINVAL; + } + + ixgbe_del_mac_filter(adapter, vfinfo[vf].vf_mac_addresses, vf); retval = ixgbe_add_mac_filter(adapter, mac_addr, vf); if (retval >= 0) - memcpy(adapter->vfinfo[vf].vf_mac_addresses, mac_addr, + memcpy(vfinfo[vf].vf_mac_addresses, mac_addr, ETH_ALEN); else - eth_zero_addr(adapter->vfinfo[vf].vf_mac_addresses); + eth_zero_addr(vfinfo[vf].vf_mac_addresses); + rcu_read_unlock(); return retval; } @@ -797,12 +842,17 @@ int ixgbe_vf_configuration(struct pci_dev *pdev, unsigned int event_mask) { struct ixgbe_adapter *adapter = pci_get_drvdata(pdev); unsigned int vfn = (event_mask & 0x3f); + struct vf_data_storage *vfinfo; bool enable = ((event_mask & 0x10000000U) != 0); - if (enable) - eth_zero_addr(adapter->vfinfo[vfn].vf_mac_addresses); - + if (enable) { + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + eth_zero_addr(vfinfo[vfn].vf_mac_addresses); + rcu_read_unlock(); + } return 0; } @@ -838,6 +888,7 @@ static void ixgbe_set_vf_rx_tx(struct ixgbe_adapter *adapter, int vf) { u32 reg_cur_tx, reg_cur_rx, reg_req_tx, reg_req_rx; struct ixgbe_hw *hw = &adapter->hw; + struct vf_data_storage *vfinfo; u32 reg_offset, vf_shift; vf_shift = vf % 32; @@ -846,7 +897,9 @@ static void ixgbe_set_vf_rx_tx(struct ixgbe_adapter *adapter, int vf) reg_cur_tx = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset)); reg_cur_rx = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset)); - if (adapter->vfinfo[vf].link_enable) { + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo && vfinfo[vf].link_enable) { reg_req_tx = reg_cur_tx | 1 << vf_shift; reg_req_rx = reg_cur_rx | 1 << vf_shift; } else { @@ -882,11 +935,12 @@ static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf) { struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ]; struct ixgbe_hw *hw = &adapter->hw; - unsigned char *vf_mac = adapter->vfinfo[vf].vf_mac_addresses; + struct vf_data_storage *vfinfo; u32 reg, reg_offset, vf_shift; u32 msgbuf[4] = {0, 0, 0, 0}; u8 *addr = (u8 *)(&msgbuf[1]); u32 q_per_pool = __ALIGN_MASK(1, ~vmdq->mask); + unsigned char *vf_mac; int i; e_info(probe, "VF Reset msg received from vf %d\n", vf); @@ -896,6 +950,13 @@ static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf) ixgbe_vf_clear_mbx(adapter, vf); + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + + vf_mac = vfinfo[vf].vf_mac_addresses; + /* set vf mac address */ if (!is_zero_ether_addr(vf_mac)) ixgbe_set_vf_mac(adapter, vf, vf_mac); @@ -905,7 +966,7 @@ static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf) /* force drop enable for all VF Rx queues */ reg = IXGBE_QDE_ENABLE; - if (adapter->vfinfo[vf].pf_vlan) + if (vfinfo[vf].pf_vlan) reg |= IXGBE_QDE_HIDE_VLAN; ixgbe_write_qde(adapter, vf, reg); @@ -913,7 +974,7 @@ static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf) ixgbe_set_vf_rx_tx(adapter, vf); /* enable VF mailbox for further messages */ - adapter->vfinfo[vf].clear_to_send = true; + vfinfo[vf].clear_to_send = true; /* Enable counting of spoofed packets in the SSVPC register */ reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset)); @@ -931,7 +992,7 @@ static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf) /* reply to reset with ack and vf mac address */ msgbuf[0] = IXGBE_VF_RESET; - if (!is_zero_ether_addr(vf_mac) && adapter->vfinfo[vf].pf_set_mac) { + if (!is_zero_ether_addr(vf_mac) && vfinfo[vf].pf_set_mac) { msgbuf[0] |= IXGBE_VT_MSGTYPE_ACK; memcpy(addr, vf_mac, ETH_ALEN); } else { @@ -952,14 +1013,20 @@ static int ixgbe_set_vf_mac_addr(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) { u8 *new_mac = ((u8 *)(&msgbuf[1])); + struct vf_data_storage *vfinfo; if (!is_valid_ether_addr(new_mac)) { e_warn(drv, "VF %d attempted to set invalid mac\n", vf); return -1; } - if (adapter->vfinfo[vf].pf_set_mac && !adapter->vfinfo[vf].trusted && - !ether_addr_equal(adapter->vfinfo[vf].vf_mac_addresses, new_mac)) { + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + + if (vfinfo[vf].pf_set_mac && !vfinfo[vf].trusted && + !ether_addr_equal(vfinfo[vf].vf_mac_addresses, new_mac)) { e_warn(drv, "VF %d attempted to override administratively set MAC address\n" "Reload the VF driver to resume operations\n", @@ -975,9 +1042,15 @@ static int ixgbe_set_vf_vlan_msg(struct ixgbe_adapter *adapter, { u32 add = FIELD_GET(IXGBE_VT_MSGINFO_MASK, msgbuf[0]); u32 vid = (msgbuf[1] & IXGBE_VLVF_VLANID_MASK); + struct vf_data_storage *vfinfo; u8 tcs = adapter->hw_tcs; - if (adapter->vfinfo[vf].pf_vlan || tcs) { + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + + if (vfinfo[vf].pf_vlan || tcs) { e_warn(drv, "VF %d attempted to override administratively set VLAN configuration\n" "Reload the VF driver to resume operations\n", @@ -997,9 +1070,15 @@ static int ixgbe_set_vf_macvlan_msg(struct ixgbe_adapter *adapter, { u8 *new_mac = ((u8 *)(&msgbuf[1])); int index = FIELD_GET(IXGBE_VT_MSGINFO_MASK, msgbuf[0]); + struct vf_data_storage *vfinfo; int err; - if (adapter->vfinfo[vf].pf_set_mac && !adapter->vfinfo[vf].trusted && + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + + if (vfinfo[vf].pf_set_mac && !vfinfo[vf].trusted && index > 0) { e_warn(drv, "VF %d requested MACVLAN filter but is administratively denied\n", @@ -1018,7 +1097,7 @@ static int ixgbe_set_vf_macvlan_msg(struct ixgbe_adapter *adapter, * If the VF is allowed to set MAC filters then turn off * anti-spoofing to avoid false positives. */ - if (adapter->vfinfo[vf].spoofchk_enabled) { + if (vfinfo[vf].spoofchk_enabled) { struct ixgbe_hw *hw = &adapter->hw; hw->mac.ops.set_mac_anti_spoofing(hw, false, vf); @@ -1038,6 +1117,7 @@ static int ixgbe_set_vf_macvlan_msg(struct ixgbe_adapter *adapter, static int ixgbe_negotiate_vf_api(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) { + struct vf_data_storage *vfinfo; int api = msgbuf[1]; switch (api) { @@ -1048,7 +1128,10 @@ static int ixgbe_negotiate_vf_api(struct ixgbe_adapter *adapter, case ixgbe_mbox_api_14: case ixgbe_mbox_api_16: case ixgbe_mbox_api_17: - adapter->vfinfo[vf].vf_api = api; + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + vfinfo[vf].vf_api = api; return 0; default: break; @@ -1064,11 +1147,17 @@ static int ixgbe_get_vf_queues(struct ixgbe_adapter *adapter, { struct net_device *dev = adapter->netdev; struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ]; + struct vf_data_storage *vfinfo; unsigned int default_tc = 0; u8 num_tcs = adapter->hw_tcs; + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + /* verify the PF is supporting the correct APIs */ - switch (adapter->vfinfo[vf].vf_api) { + switch (vfinfo[vf].vf_api) { case ixgbe_mbox_api_20: case ixgbe_mbox_api_11: case ixgbe_mbox_api_12: @@ -1092,7 +1181,7 @@ static int ixgbe_get_vf_queues(struct ixgbe_adapter *adapter, /* notify VF of need for VLAN tag stripping, and correct queue */ if (num_tcs) msgbuf[IXGBE_VF_TRANS_VLAN] = num_tcs; - else if (adapter->vfinfo[vf].pf_vlan || adapter->vfinfo[vf].pf_qos) + else if (vfinfo[vf].pf_vlan || vfinfo[vf].pf_qos) msgbuf[IXGBE_VF_TRANS_VLAN] = 1; else msgbuf[IXGBE_VF_TRANS_VLAN] = 0; @@ -1105,17 +1194,23 @@ static int ixgbe_get_vf_queues(struct ixgbe_adapter *adapter, static int ixgbe_get_vf_reta(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) { - u32 i, j; - u32 *out_buf = &msgbuf[1]; - const u8 *reta = adapter->rss_indir_tbl; u32 reta_size = ixgbe_rss_indir_tbl_entries(adapter); + const u8 *reta = adapter->rss_indir_tbl; + struct vf_data_storage *vfinfo; + u32 *out_buf = &msgbuf[1]; + u32 i, j; + + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; /* Check if operation is permitted */ - if (!adapter->vfinfo[vf].rss_query_enabled) + if (!vfinfo[vf].rss_query_enabled) return -EPERM; /* verify the PF is supporting the correct API */ - switch (adapter->vfinfo[vf].vf_api) { + switch (vfinfo[vf].vf_api) { case ixgbe_mbox_api_17: case ixgbe_mbox_api_16: case ixgbe_mbox_api_14: @@ -1143,14 +1238,20 @@ static int ixgbe_get_vf_reta(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) static int ixgbe_get_vf_rss_key(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) { + struct vf_data_storage *vfinfo; u32 *rss_key = &msgbuf[1]; + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + /* Check if the operation is permitted */ - if (!adapter->vfinfo[vf].rss_query_enabled) + if (!vfinfo[vf].rss_query_enabled) return -EPERM; /* verify the PF is supporting the correct API */ - switch (adapter->vfinfo[vf].vf_api) { + switch (vfinfo[vf].vf_api) { case ixgbe_mbox_api_17: case ixgbe_mbox_api_16: case ixgbe_mbox_api_14: @@ -1170,11 +1271,17 @@ static int ixgbe_update_vf_xcast_mode(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) { struct ixgbe_hw *hw = &adapter->hw; + struct vf_data_storage *vfinfo; int xcast_mode = msgbuf[1]; u32 vmolr, fctrl, disable, enable; + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + /* verify the PF is supporting the correct APIs */ - switch (adapter->vfinfo[vf].vf_api) { + switch (vfinfo[vf].vf_api) { case ixgbe_mbox_api_12: /* promisc introduced in 1.3 version */ if (xcast_mode == IXGBEVF_XCAST_MODE_PROMISC) @@ -1190,11 +1297,11 @@ static int ixgbe_update_vf_xcast_mode(struct ixgbe_adapter *adapter, } if (xcast_mode > IXGBEVF_XCAST_MODE_MULTI && - !adapter->vfinfo[vf].trusted) { + !vfinfo[vf].trusted) { xcast_mode = IXGBEVF_XCAST_MODE_MULTI; } - if (adapter->vfinfo[vf].xcast_mode == xcast_mode) + if (vfinfo[vf].xcast_mode == xcast_mode) goto out; switch (xcast_mode) { @@ -1236,7 +1343,7 @@ static int ixgbe_update_vf_xcast_mode(struct ixgbe_adapter *adapter, vmolr |= enable; IXGBE_WRITE_REG(hw, IXGBE_VMOLR(vf), vmolr); - adapter->vfinfo[vf].xcast_mode = xcast_mode; + vfinfo[vf].xcast_mode = xcast_mode; out: msgbuf[1] = xcast_mode; @@ -1247,10 +1354,16 @@ static int ixgbe_update_vf_xcast_mode(struct ixgbe_adapter *adapter, static int ixgbe_get_vf_link_state(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) { + struct vf_data_storage *vfinfo; u32 *link_state = &msgbuf[1]; + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + /* verify the PF is supporting the correct API */ - switch (adapter->vfinfo[vf].vf_api) { + switch (vfinfo[vf].vf_api) { case ixgbe_mbox_api_12: case ixgbe_mbox_api_13: case ixgbe_mbox_api_14: @@ -1261,7 +1374,7 @@ static int ixgbe_get_vf_link_state(struct ixgbe_adapter *adapter, return -EOPNOTSUPP; } - *link_state = adapter->vfinfo[vf].link_enable; + *link_state = vfinfo[vf].link_enable; return 0; } @@ -1280,8 +1393,14 @@ static int ixgbe_send_vf_link_status(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) { struct ixgbe_hw *hw = &adapter->hw; + struct vf_data_storage *vfinfo; + + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; - switch (adapter->vfinfo[vf].vf_api) { + switch (vfinfo[vf].vf_api) { case ixgbe_mbox_api_16: case ixgbe_mbox_api_17: if (hw->mac.type != ixgbe_mac_e610) @@ -1310,9 +1429,15 @@ static int ixgbe_send_vf_link_status(struct ixgbe_adapter *adapter, static int ixgbe_negotiate_vf_features(struct ixgbe_adapter *adapter, u32 *msgbuf, u32 vf) { + struct vf_data_storage *vfinfo; u32 features = msgbuf[1]; - switch (adapter->vfinfo[vf].vf_api) { + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + + switch (vfinfo[vf].vf_api) { case ixgbe_mbox_api_17: break; default: @@ -1330,6 +1455,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf) u32 mbx_size = IXGBE_VFMAILBOX_SIZE; u32 msgbuf[IXGBE_VFMAILBOX_SIZE]; struct ixgbe_hw *hw = &adapter->hw; + struct vf_data_storage *vfinfo; int retval; retval = ixgbe_read_mbx(hw, msgbuf, mbx_size, vf); @@ -1349,11 +1475,16 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf) if (msgbuf[0] == IXGBE_VF_RESET) return ixgbe_vf_reset_msg(adapter, vf); + lockdep_assert_in_rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + return 0; + /* * until the vf completes a virtual function reset it should not be * allowed to start any configuration. */ - if (!adapter->vfinfo[vf].clear_to_send) { + if (!vfinfo[vf].clear_to_send) { msgbuf[0] |= IXGBE_VT_MSGTYPE_NACK; ixgbe_write_mbx(hw, msgbuf, 1, vf); return 0; @@ -1426,11 +1557,12 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf) static void ixgbe_rcv_ack_from_vf(struct ixgbe_adapter *adapter, u32 vf) { + struct vf_data_storage *vfinfo = rcu_dereference(adapter->vfinfo); struct ixgbe_hw *hw = &adapter->hw; u32 msg = IXGBE_VT_MSGTYPE_NACK; /* if device isn't clear to send it shouldn't be reading either */ - if (!adapter->vfinfo[vf].clear_to_send) + if (vfinfo && !vfinfo[vf].clear_to_send) ixgbe_write_mbx(hw, &msg, 1, vf); } @@ -1462,15 +1594,21 @@ bool ixgbe_check_mdd_event(struct ixgbe_adapter *adapter) IXGBE_READ_REG(hw, IXGBE_LVMMC_RX)); if (hw->mac.ops.restore_mdd_vf) { + struct vf_data_storage *vfinfo; u32 ping; hw->mac.ops.restore_mdd_vf(hw, i); /* get the VF to rebuild its queues */ - adapter->vfinfo[i].clear_to_send = 0; - ping = IXGBE_PF_CONTROL_MSG | - IXGBE_VT_MSGTYPE_CTS; - ixgbe_write_mbx(hw, &ping, 1, i); + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) { + vfinfo[i].clear_to_send = false; + ping = IXGBE_PF_CONTROL_MSG | + IXGBE_VT_MSGTYPE_CTS; + ixgbe_write_mbx(hw, &ping, 1, i); + } + rcu_read_unlock(); } ret = true; @@ -1482,12 +1620,11 @@ bool ixgbe_check_mdd_event(struct ixgbe_adapter *adapter) void ixgbe_msg_task(struct ixgbe_adapter *adapter) { struct ixgbe_hw *hw = &adapter->hw; - unsigned long flags; u32 vf; ixgbe_check_mdd_event(adapter); - spin_lock_irqsave(&adapter->vfs_lock, flags); + rcu_read_lock(); for (vf = 0; vf < adapter->num_vfs; vf++) { /* process any reset requests */ if (!ixgbe_check_for_rst(hw, vf)) @@ -1501,7 +1638,7 @@ void ixgbe_msg_task(struct ixgbe_adapter *adapter) if (!ixgbe_check_for_ack(hw, vf)) ixgbe_rcv_ack_from_vf(adapter, vf); } - spin_unlock_irqrestore(&adapter->vfs_lock, flags); + rcu_read_unlock(); } static inline void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vf) @@ -1510,23 +1647,26 @@ static inline void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vf) u32 ping; ping = IXGBE_PF_CONTROL_MSG; - if (adapter->vfinfo[vf].clear_to_send) - ping |= IXGBE_VT_MSGTYPE_CTS; ixgbe_write_mbx(hw, &ping, 1, vf); } void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter) { struct ixgbe_hw *hw = &adapter->hw; + struct vf_data_storage *vfinfo; u32 ping; int i; - for (i = 0 ; i < adapter->num_vfs; i++) { - ping = IXGBE_PF_CONTROL_MSG; - if (adapter->vfinfo[i].clear_to_send) - ping |= IXGBE_VT_MSGTYPE_CTS; - ixgbe_write_mbx(hw, &ping, 1, i); - } + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + for (i = 0 ; i < adapter->num_vfs; i++) { + ping = IXGBE_PF_CONTROL_MSG; + if (vfinfo[i].clear_to_send) + ping |= IXGBE_VT_MSGTYPE_CTS; + ixgbe_write_mbx(hw, &ping, 1, i); + } + rcu_read_unlock(); } /** @@ -1537,21 +1677,34 @@ void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter) **/ void ixgbe_set_all_vfs(struct ixgbe_adapter *adapter) { + struct vf_data_storage *vfinfo; int i; - for (i = 0 ; i < adapter->num_vfs; i++) - ixgbe_set_vf_link_state(adapter, i, - adapter->vfinfo[i].link_state); + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + for (i = 0 ; i < adapter->num_vfs; i++) + ixgbe_set_vf_link_state(adapter, i, + vfinfo[i].link_state); + rcu_read_unlock(); } int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac) { struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev); + struct vf_data_storage *vfinfo; int retval; if (vf >= adapter->num_vfs) return -EINVAL; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) { + rcu_read_unlock(); + return 0; + } + if (is_valid_ether_addr(mac)) { dev_info(&adapter->pdev->dev, "setting MAC %pM on VF %d\n", mac, vf); @@ -1559,7 +1712,7 @@ int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac) retval = ixgbe_set_vf_mac(adapter, vf, mac); if (retval >= 0) { - adapter->vfinfo[vf].pf_set_mac = true; + vfinfo[vf].pf_set_mac = true; if (test_bit(__IXGBE_DOWN, &adapter->state)) { dev_warn(&adapter->pdev->dev, "The VF MAC address has been set, but the PF device is not up.\n"); @@ -1569,18 +1722,19 @@ int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac) dev_warn(&adapter->pdev->dev, "The VF MAC address was NOT set due to invalid or duplicate MAC address.\n"); } } else if (is_zero_ether_addr(mac)) { - unsigned char *vf_mac_addr = - adapter->vfinfo[vf].vf_mac_addresses; + unsigned char *vf_mac_addr = vfinfo[vf].vf_mac_addresses; /* nothing to do */ - if (is_zero_ether_addr(vf_mac_addr)) + if (is_zero_ether_addr(vf_mac_addr)) { + rcu_read_unlock(); return 0; + } dev_info(&adapter->pdev->dev, "removing MAC on VF %d\n", vf); retval = ixgbe_del_mac_filter(adapter, vf_mac_addr, vf); if (retval >= 0) { - adapter->vfinfo[vf].pf_set_mac = false; + vfinfo[vf].pf_set_mac = false; memcpy(vf_mac_addr, mac, ETH_ALEN); } else { dev_warn(&adapter->pdev->dev, "Could NOT remove the VF MAC address.\n"); @@ -1589,10 +1743,12 @@ int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac) retval = -EINVAL; } + rcu_read_unlock(); return retval; } static int ixgbe_enable_port_vlan(struct ixgbe_adapter *adapter, int vf, + struct vf_data_storage *vfinfo, u16 vlan, u8 qos) { struct ixgbe_hw *hw = &adapter->hw; @@ -1613,8 +1769,8 @@ static int ixgbe_enable_port_vlan(struct ixgbe_adapter *adapter, int vf, ixgbe_write_qde(adapter, vf, IXGBE_QDE_ENABLE | IXGBE_QDE_HIDE_VLAN); - adapter->vfinfo[vf].pf_vlan = vlan; - adapter->vfinfo[vf].pf_qos = qos; + vfinfo[vf].pf_vlan = vlan; + vfinfo[vf].pf_qos = qos; dev_info(&adapter->pdev->dev, "Setting VLAN %d, QOS 0x%x on VF %d\n", vlan, qos, vf); if (test_bit(__IXGBE_DOWN, &adapter->state)) { @@ -1628,13 +1784,14 @@ static int ixgbe_enable_port_vlan(struct ixgbe_adapter *adapter, int vf, return err; } -static int ixgbe_disable_port_vlan(struct ixgbe_adapter *adapter, int vf) +static int ixgbe_disable_port_vlan(struct ixgbe_adapter *adapter, int vf, + struct vf_data_storage *vfinfo) { struct ixgbe_hw *hw = &adapter->hw; int err; err = ixgbe_set_vf_vlan(adapter, false, - adapter->vfinfo[vf].pf_vlan, vf); + vfinfo[vf].pf_vlan, vf); /* Restore tagless access via VLAN 0 */ ixgbe_set_vf_vlan(adapter, true, 0, vf); ixgbe_clear_vmvir(adapter, vf); @@ -1644,8 +1801,8 @@ static int ixgbe_disable_port_vlan(struct ixgbe_adapter *adapter, int vf) if (hw->mac.type >= ixgbe_mac_X550) ixgbe_write_qde(adapter, vf, IXGBE_QDE_ENABLE); - adapter->vfinfo[vf].pf_vlan = 0; - adapter->vfinfo[vf].pf_qos = 0; + vfinfo[vf].pf_vlan = 0; + vfinfo[vf].pf_qos = 0; return err; } @@ -1653,13 +1810,20 @@ static int ixgbe_disable_port_vlan(struct ixgbe_adapter *adapter, int vf) int ixgbe_ndo_set_vf_vlan(struct net_device *netdev, int vf, u16 vlan, u8 qos, __be16 vlan_proto) { - int err = 0; struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev); + struct vf_data_storage *vfinfo; + int err = 0; if ((vf >= adapter->num_vfs) || (vlan > 4095) || (qos > 7)) return -EINVAL; if (vlan_proto != htons(ETH_P_8021Q)) return -EPROTONOSUPPORT; + + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) + goto out; + if (vlan || qos) { /* Check if there is already a port VLAN set, if so * we have to delete the old one first before we @@ -1668,16 +1832,17 @@ int ixgbe_ndo_set_vf_vlan(struct net_device *netdev, int vf, u16 vlan, * old port VLAN before setting a new one but this * is not necessarily the case. */ - if (adapter->vfinfo[vf].pf_vlan) - err = ixgbe_disable_port_vlan(adapter, vf); + if (vfinfo[vf].pf_vlan) + err = ixgbe_disable_port_vlan(adapter, vf, vfinfo); if (err) goto out; - err = ixgbe_enable_port_vlan(adapter, vf, vlan, qos); + err = ixgbe_enable_port_vlan(adapter, vf, vfinfo, vlan, qos); } else { - err = ixgbe_disable_port_vlan(adapter, vf); + err = ixgbe_disable_port_vlan(adapter, vf, vfinfo); } out: + rcu_read_unlock(); return err; } @@ -1695,13 +1860,13 @@ int ixgbe_link_mbps(struct ixgbe_adapter *adapter) } } -static void ixgbe_set_vf_rate_limit(struct ixgbe_adapter *adapter, int vf) +static void ixgbe_set_vf_rate_limit(struct ixgbe_adapter *adapter, int vf, + u16 tx_rate) { struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ]; struct ixgbe_hw *hw = &adapter->hw; u32 bcnrc_val = 0; u16 queue, queues_per_pool; - u16 tx_rate = adapter->vfinfo[vf].tx_rate; if (tx_rate) { /* start with base link speed value */ @@ -1749,6 +1914,7 @@ static void ixgbe_set_vf_rate_limit(struct ixgbe_adapter *adapter, int vf) void ixgbe_check_vf_rate_limit(struct ixgbe_adapter *adapter) { + struct vf_data_storage *vfinfo; int i; /* VF Tx rate limit was not set */ @@ -1761,18 +1927,23 @@ void ixgbe_check_vf_rate_limit(struct ixgbe_adapter *adapter) "Link speed has been changed. VF Transmit rate is disabled\n"); } - for (i = 0; i < adapter->num_vfs; i++) { - if (!adapter->vf_rate_link_speed) - adapter->vfinfo[i].tx_rate = 0; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + for (i = 0; i < adapter->num_vfs; i++) { + if (!adapter->vf_rate_link_speed) + vfinfo[i].tx_rate = 0; - ixgbe_set_vf_rate_limit(adapter, i); - } + ixgbe_set_vf_rate_limit(adapter, i, vfinfo[i].tx_rate); + } + rcu_read_unlock(); } int ixgbe_ndo_set_vf_bw(struct net_device *netdev, int vf, int min_tx_rate, int max_tx_rate) { struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev); + struct vf_data_storage *vfinfo; int link_speed; /* verify VF is active */ @@ -1795,12 +1966,17 @@ int ixgbe_ndo_set_vf_bw(struct net_device *netdev, int vf, int min_tx_rate, if (max_tx_rate && ((max_tx_rate <= 10) || (max_tx_rate > link_speed))) return -EINVAL; - /* store values */ - adapter->vf_rate_link_speed = link_speed; - adapter->vfinfo[vf].tx_rate = max_tx_rate; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) { + /* store values */ + adapter->vf_rate_link_speed = link_speed; + vfinfo[vf].tx_rate = max_tx_rate; - /* update hardware configuration */ - ixgbe_set_vf_rate_limit(adapter, vf); + /* update hardware configuration */ + ixgbe_set_vf_rate_limit(adapter, vf, vfinfo[vf].tx_rate); + } + rcu_read_unlock(); return 0; } @@ -1809,11 +1985,18 @@ int ixgbe_ndo_set_vf_spoofchk(struct net_device *netdev, int vf, bool setting) { struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev); struct ixgbe_hw *hw = &adapter->hw; + struct vf_data_storage *vfinfo; if (vf >= adapter->num_vfs) return -EINVAL; - adapter->vfinfo[vf].spoofchk_enabled = setting; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + vfinfo[vf].spoofchk_enabled = setting; + rcu_read_unlock(); + if (!vfinfo) + return 0; /* configure MAC spoofing */ hw->mac.ops.set_mac_anti_spoofing(hw, setting, vf); @@ -1851,28 +2034,37 @@ int ixgbe_ndo_set_vf_spoofchk(struct net_device *netdev, int vf, bool setting) **/ void ixgbe_set_vf_link_state(struct ixgbe_adapter *adapter, int vf, int state) { - adapter->vfinfo[vf].link_state = state; + struct vf_data_storage *vfinfo; + + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) { + rcu_read_unlock(); + return; + } + vfinfo[vf].link_state = state; switch (state) { case IFLA_VF_LINK_STATE_AUTO: if (test_bit(__IXGBE_DOWN, &adapter->state)) - adapter->vfinfo[vf].link_enable = false; + vfinfo[vf].link_enable = false; else - adapter->vfinfo[vf].link_enable = true; + vfinfo[vf].link_enable = true; break; case IFLA_VF_LINK_STATE_ENABLE: - adapter->vfinfo[vf].link_enable = true; + vfinfo[vf].link_enable = true; break; case IFLA_VF_LINK_STATE_DISABLE: - adapter->vfinfo[vf].link_enable = false; + vfinfo[vf].link_enable = false; break; } ixgbe_set_vf_rx_tx(adapter, vf); /* restart the VF */ - adapter->vfinfo[vf].clear_to_send = false; + vfinfo[vf].clear_to_send = false; ixgbe_ping_vf(adapter, vf); + rcu_read_unlock(); } /** @@ -1923,6 +2115,7 @@ int ixgbe_ndo_set_vf_rss_query_en(struct net_device *netdev, int vf, bool setting) { struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev); + struct vf_data_storage *vfinfo; /* This operation is currently supported only for 82599 and x540 * devices. @@ -1934,7 +2127,11 @@ int ixgbe_ndo_set_vf_rss_query_en(struct net_device *netdev, int vf, if (vf >= adapter->num_vfs) return -EINVAL; - adapter->vfinfo[vf].rss_query_enabled = setting; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (vfinfo) + vfinfo[vf].rss_query_enabled = setting; + rcu_read_unlock(); return 0; } @@ -1942,18 +2139,31 @@ int ixgbe_ndo_set_vf_rss_query_en(struct net_device *netdev, int vf, int ixgbe_ndo_set_vf_trust(struct net_device *netdev, int vf, bool setting) { struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev); + struct vf_data_storage *vfinfo; if (vf >= adapter->num_vfs) return -EINVAL; + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) { + rcu_read_unlock(); + return 0; + } + /* nothing to do */ - if (adapter->vfinfo[vf].trusted == setting) + if (vfinfo[vf].trusted == setting) { + rcu_read_unlock(); return 0; + } - adapter->vfinfo[vf].trusted = setting; + vfinfo[vf].trusted = setting; /* reset VF to reconfigure features */ - adapter->vfinfo[vf].clear_to_send = false; + vfinfo[vf].clear_to_send = false; + + rcu_read_unlock(); + ixgbe_ping_vf(adapter, vf); e_info(drv, "VF %u is %strusted\n", vf, setting ? "" : "not "); @@ -1965,17 +2175,30 @@ int ixgbe_ndo_get_vf_config(struct net_device *netdev, int vf, struct ifla_vf_info *ivi) { struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev); + struct vf_data_storage *vfinfo; + if (vf >= adapter->num_vfs) return -EINVAL; ivi->vf = vf; - memcpy(&ivi->mac, adapter->vfinfo[vf].vf_mac_addresses, ETH_ALEN); - ivi->max_tx_rate = adapter->vfinfo[vf].tx_rate; + + rcu_read_lock(); + vfinfo = rcu_dereference(adapter->vfinfo); + if (!vfinfo) { + rcu_read_unlock(); + return -EINVAL; + } + + memcpy(&ivi->mac, vfinfo[vf].vf_mac_addresses, ETH_ALEN); + ivi->max_tx_rate = vfinfo[vf].tx_rate; ivi->min_tx_rate = 0; - ivi->vlan = adapter->vfinfo[vf].pf_vlan; - ivi->qos = adapter->vfinfo[vf].pf_qos; - ivi->spoofchk = adapter->vfinfo[vf].spoofchk_enabled; - ivi->rss_query_en = adapter->vfinfo[vf].rss_query_enabled; - ivi->trusted = adapter->vfinfo[vf].trusted; - ivi->linkstate = adapter->vfinfo[vf].link_state; + ivi->vlan = vfinfo[vf].pf_vlan; + ivi->qos = vfinfo[vf].pf_qos; + ivi->spoofchk = vfinfo[vf].spoofchk_enabled; + ivi->rss_query_en = vfinfo[vf].rss_query_enabled; + ivi->trusted = vfinfo[vf].trusted; + ivi->linkstate = vfinfo[vf].link_state; + + rcu_read_unlock(); + return 0; } -- 2.47.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock 2026-05-20 18:34 ` [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock Tony Nguyen @ 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 1 sibling, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-23 0:16 UTC (permalink / raw) To: anthony.l.nguyen Cc: Jakub Kicinski, davem, pabeni, edumazet, andrew+netdev, netdev, vinschen, alexander.nowlin This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- ixgbe: only access vfinfo and mv_list under RCU lock This change drops the vfs_lock spinlock in favour of RCU protection for adapter->vfinfo and adapter->mv_list, marking them __rcu and converting readers to rcu_dereference()/rcu_read_lock() and writers to rcu_replace_pointer()/kfree_rcu(). The goal is to address a crash observed when ixgbe_disable_sriov() races with ixgbe_service_task() / ixgbe_msg_task() readers. > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h > index 9b8217523fd2..8849b9f42bf6 100644 > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h > @@ -808,10 +810,10 @@ struct ixgbe_adapter { > /* SR-IOV */ > DECLARE_BITMAP(active_vfs, IXGBE_MAX_VF_FUNCTIONS); > unsigned int num_vfs; > - struct vf_data_storage *vfinfo; > + struct vf_data_storage __rcu *vfinfo; > int vf_rate_link_speed; > struct vf_macvlans vf_mvs; > - struct vf_macvlans *mv_list; > + struct vf_macvlans __rcu *mv_list; [Medium] After this conversion, is ixgbe_handle_mdd_event() in drivers/net/ethernet/intel/ixgbe/ixgbe_main.c still consistent with the new annotations? It contains: if (adapter->vfinfo && ixgbe_check_mdd_event(adapter)) { That bare load of an __rcu pointer is not wrapped in rcu_read_lock() and does not use rcu_access_pointer() or rcu_dereference(), so sparse will flag it and the access bypasses the RCU discipline used elsewhere in the file. The commit message says RCU is used "throughout the driver" — was this site simply missed during the conversion? [ ... ] > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > index 2646ee6f295f..d82c7dfc6580 100644 > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c [ ... ] > @@ -8267,22 +8293,27 @@ static void ixgbe_watchdog_flush_tx(struct ixgbe_adapter *adapter) > @@ -8309,9 +8340,15 @@ static void ixgbe_check_for_bad_vf(struct ixgbe_adapter *adapter) > > /* check status reg for all VFs owned by this PF */ > for (vf = 0; vf < adapter->num_vfs; ++vf) { > - struct pci_dev *vfdev = adapter->vfinfo[vf].vfdev; > + struct vf_data_storage *vfinfo; > + struct pci_dev *vfdev = NULL; > u16 status_reg; > > + rcu_read_lock(); > + vfinfo = rcu_dereference(adapter->vfinfo); > + if (vfinfo) > + vfdev = vfinfo[vf].vfdev; > + rcu_read_unlock(); > if (!vfdev) > continue; > pci_read_config_word(vfdev, PCI_STATUS, &status_reg); [High] Can vfdev still be used safely after rcu_read_unlock() here? The RCU section only protects the lifetime of the vfinfo array, not the lifetime of the pci_dev whose pointer was stored in it. ixgbe_disable_sriov() drops the reference taken by ixgbe_get_vfs() with pci_dev_put(vfdev) and then proceeds to pci_disable_sriov(), either of which can free the pci_dev concurrently with the subsequent pci_read_config_word() and the pcie_flr(vfdev) reached through ixgbe_bad_vf_abort(). Should this path take pci_dev_get(vfdev) before dropping the rcu_read_unlock(), and pci_dev_put() after the config-space access and FLR are complete? This appears to predate this patch (the prior code dereferenced vfdev outside any lifetime-extending section as well), but the RCU conversion seems like a natural opportunity to address it. [ ... ] > @@ -10071,20 +10114,26 @@ static int handle_redirect_action(struct ixgbe_adapter *adapter, int ifindex, > { > struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ]; > unsigned int num_vfs = adapter->num_vfs, vf; > + struct vf_data_storage *vfinfo; > struct netdev_nested_priv priv; > struct upper_walk_data data; > struct net_device *upper; > > /* redirect to a SRIOV VF */ > - for (vf = 0; vf < num_vfs; ++vf) { > - upper = pci_get_drvdata(adapter->vfinfo[vf].vfdev); > - if (upper->ifindex == ifindex) { > - *queue = vf * __ALIGN_MASK(1, ~vmdq->mask); > - *action = vf + 1; > - *action <<= ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF; > - return 0; > + rcu_read_lock(); > + vfinfo = rcu_dereference(adapter->vfinfo); > + if (vfinfo) > + for (vf = 0; vf < num_vfs; ++vf) { > + upper = pci_get_drvdata(vfinfo[vf].vfdev); [High] Can the captured num_vfs and the freshly-dereferenced vfinfo disagree in size here? num_vfs is sampled from adapter->num_vfs before rcu_read_lock(), while vfinfo is observed inside the RCU section. ixgbe_disable_sriov() runs under PCI device_lock (not rtnl), sets adapter->num_vfs = 0, and rcu_replace_pointer()s vfinfo. A subsequent __ixgbe_enable_sriov(N2) can publish a smaller vfinfo array of size N2 < num_vfs before the loop runs. The access vfinfo[vf] for vf >= N2 is then a heap out-of-bounds read. Should num_vfs be sampled inside the rcu_read_lock() and bounded by the actual size of the published vfinfo, or should adapter->num_vfs be updated as part of the same rcu_assign_pointer publication so readers can rely on a consistent pair? [ ... ] > @@ -9744,15 +9781,21 @@ static int ixgbe_ndo_get_vf_stats(struct net_device *netdev, int vf, > struct ifla_vf_stats *vf_stats) > { > struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev); > + struct vf_data_storage *vfinfo; > > if (vf < 0 || vf >= adapter->num_vfs) > return -EINVAL; > > - vf_stats->rx_packets = adapter->vfinfo[vf].vfstats.gprc; > - vf_stats->rx_bytes = adapter->vfinfo[vf].vfstats.gorc; > - vf_stats->tx_packets = adapter->vfinfo[vf].vfstats.gptc; > - vf_stats->tx_bytes = adapter->vfinfo[vf].vfstats.gotc; > - vf_stats->multicast = adapter->vfinfo[vf].vfstats.mprc; > + rcu_read_lock(); > + vfinfo = rcu_dereference(adapter->vfinfo); > + if (vfinfo) { > + vf_stats->rx_packets = vfinfo[vf].vfstats.gprc; > + vf_stats->rx_bytes = vfinfo[vf].vfstats.gorc; > + vf_stats->tx_packets = vfinfo[vf].vfstats.gptc; > + vf_stats->tx_bytes = vfinfo[vf].vfstats.gotc; > + vf_stats->multicast = vfinfo[vf].vfstats.mprc; > + } > + rcu_read_unlock(); [High] This pattern recurs in many of the rtnl ndo handlers (set_vf_mac, set_vf_spoofchk, set_vf_rss_query_en, set_vf_trust, set_vf_vlan, set_vf_bw, get_vf_config, get_vf_stats): the bound check if (vf >= adapter->num_vfs) return -EINVAL; is performed before rcu_read_lock(). Between that check and the later rcu_dereference, ixgbe_disable_sriov() (under PCI device_lock, not rtnl) can set num_vfs = 0 and __ixgbe_enable_sriov() can republish vfinfo at a smaller size N2. For ixgbe_ndo_get_vf_config() the resulting OOB read is then memcpy'd into struct ifla_vf_info and returned to userspace via netlink, which would expose adjacent kernel heap. For the setter ndos and the mailbox-driven setters, this is an out-of-bounds write. Should the vf range check be repeated against the size of the freshly observed vfinfo, ideally with num_vfs and vfinfo published together so the pair is self-consistent? > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c > index 431d77da15a5..80f22a8e7af4 100644 > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c [ ... ] > @@ -206,31 +214,28 @@ int ixgbe_disable_sriov(struct ixgbe_adapter *adapter) > { > unsigned int num_vfs = adapter->num_vfs, vf; > struct ixgbe_hw *hw = &adapter->hw; > - unsigned long flags; > + struct vf_data_storage *vfinfo; > + struct vf_macvlans *mv_list; > int rss; > > - spin_lock_irqsave(&adapter->vfs_lock, flags); > - /* set num VFs to 0 to prevent access to vfinfo */ > + /* set num VFs to 0 so readers bail out early */ > adapter->num_vfs = 0; > - spin_unlock_irqrestore(&adapter->vfs_lock, flags); > + > + vfinfo = rcu_replace_pointer(adapter->vfinfo, NULL, 1); > + mv_list = rcu_replace_pointer(adapter->mv_list, NULL, 1); [Low] The lockdep predicate passed to rcu_replace_pointer() is the literal 1, which makes the writer-side assertion always succeed and provides no verification. The patch simultaneously removes vfs_lock and shifts to an implicit (undocumented) reliance on some caller serialization for the writer. Could lockdep_is_held() of the actual writer-side lock (for example the PCI device_lock or rtnl_lock, whichever is intended) be used here so a future regression that calls ixgbe_disable_sriov() outside that context is caught by lockdep? [ ... ] > @@ -643,10 +668,16 @@ static void ixgbe_clear_vf_vlans(struct ixgbe_adapter *adapter, u32 vf) > static int ixgbe_set_vf_macvlan(struct ixgbe_adapter *adapter, > int vf, int index, unsigned char *mac_addr) > { > - struct vf_macvlans *entry; > + struct vf_macvlans *mv_list, *entry; > bool found = false; > int retval = 0; > > + lockdep_assert_in_rcu_read_lock(); > + /* vf_mvs entries point into the mv_list array */ > + mv_list = rcu_dereference(adapter->mv_list); > + if (!mv_list) > + return 0; > + > if (index <= 1) { > list_for_each_entry(entry, &adapter->vf_mvs.l, l) { [High] Is non-RCU list iteration safe here while only rcu_read_lock() is held? The list head &adapter->vf_mvs.l is mutated by the writer side without RCU-aware primitives. ixgbe_alloc_vf_macvlans() does: INIT_LIST_HEAD(&adapter->vf_mvs.l); ... list_add(&mv_list[i].l, &adapter->vf_mvs.l); and ixgbe_disable_sriov() kfree_rcu()s mv_list but never reinitialises &adapter->vf_mvs.l, so on a subsequent re-enable the writer concurrently zeros and re-links the head while a reader walks it. Should list_for_each_entry() be list_for_each_entry_rcu(), and the writer use INIT_LIST_HEAD_RCU() / list_add_rcu()? Otherwise the reader can observe torn next/prev loads or follow into freed/re-used entries. [ ... ] > @@ -780,16 +816,25 @@ static void ixgbe_vf_clear_mbx(struct ixgbe_adapter *adapter, u32 vf) > static int ixgbe_set_vf_mac(struct ixgbe_adapter *adapter, > int vf, unsigned char *mac_addr) > { > + struct vf_data_storage *vfinfo; > int retval; > > - ixgbe_del_mac_filter(adapter, adapter->vfinfo[vf].vf_mac_addresses, vf); > + rcu_read_lock(); > + vfinfo = rcu_dereference(adapter->vfinfo); > + if (!vfinfo) { > + rcu_read_unlock(); > + return -EINVAL; > + } > + > + ixgbe_del_mac_filter(adapter, vfinfo[vf].vf_mac_addresses, vf); > retval = ixgbe_add_mac_filter(adapter, mac_addr, vf); > if (retval >= 0) > - memcpy(adapter->vfinfo[vf].vf_mac_addresses, mac_addr, > + memcpy(vfinfo[vf].vf_mac_addresses, mac_addr, > ETH_ALEN); [ ... ] > @@ -1537,21 +1677,34 @@ void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter) > int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac) > { > struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev); > + struct vf_data_storage *vfinfo; > int retval; > > if (vf >= adapter->num_vfs) > return -EINVAL; > > + rcu_read_lock(); > + vfinfo = rcu_dereference(adapter->vfinfo); > + if (!vfinfo) { > + rcu_read_unlock(); > + return 0; > + } > + > if (is_valid_ether_addr(mac)) { > dev_info(&adapter->pdev->dev, "setting MAC %pM on VF %d\n", > mac, vf); > dev_info(&adapter->pdev->dev, "Reload the VF driver to make this change effective."); > > retval = ixgbe_set_vf_mac(adapter, vf, mac); > if (retval >= 0) { > - adapter->vfinfo[vf].pf_set_mac = true; > + vfinfo[vf].pf_set_mac = true; [High] Can vfinfo here and the vfinfo observed inside ixgbe_set_vf_mac() end up pointing at different objects? ixgbe_ndo_set_vf_mac() captures vfinfo via its own rcu_dereference, then calls ixgbe_set_vf_mac() which independently does: rcu_read_lock(); vfinfo = rcu_dereference(adapter->vfinfo); ... memcpy(vfinfo[vf].vf_mac_addresses, mac_addr, ETH_ALEN); Nesting rcu_read_lock() is fine, but each rcu_dereference() observes whatever pointer is currently published. If ixgbe_disable_sriov() followed by __ixgbe_enable_sriov() races between the two dereferences, the inner copy lands in the new array while the outer pf_set_mac = true write lands in the old array. The two pieces of state then disagree, which seems to break the "attempted to override administratively set MAC" enforcement on subsequent VF mailbox messages. If the new vfinfo is shorter than vf, the inner memcpy is also a heap out-of-bounds write. Should ixgbe_ndo_set_vf_mac() pass its already-captured vfinfo into ixgbe_set_vf_mac() so both writes go to the same object, or otherwise structure the call so a single rcu_dereference covers all writes for one operation? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock 2026-05-20 18:34 ` [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski @ 2026-05-23 0:16 ` Jakub Kicinski 1 sibling, 0 replies; 21+ messages in thread From: Jakub Kicinski @ 2026-05-23 0:16 UTC (permalink / raw) To: anthony.l.nguyen Cc: Jakub Kicinski, davem, pabeni, edumazet, andrew+netdev, netdev, vinschen, alexander.nowlin This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. --- Subject: ixgbe: only access vfinfo and mv_list under RCU lock This patch drops the spinlock protecting the adapter info in favor of RCU to avoid lock contention and crashes when setting sriov_numvfs to zero. It modifies the accesses to vfinfo and mv_list across the driver to use RCU primitives. > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > index 2646ee6f295f0..d82c7dfc6580a 100644 > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c [ ... ] > @@ -8309,9 +8340,15 @@ static void ixgbe_check_for_bad_vf(struct ixgbe_adapter *adapter) > > /* check status reg for all VFs owned by this PF */ > for (vf = 0; vf < adapter->num_vfs; ++vf) { > - struct pci_dev *vfdev = adapter->vfinfo[vf].vfdev; > + struct vf_data_storage *vfinfo; > + struct pci_dev *vfdev = NULL; > u16 status_reg; > > + rcu_read_lock(); > + vfinfo = rcu_dereference(adapter->vfinfo); > + if (vfinfo) > + vfdev = vfinfo[vf].vfdev; > + rcu_read_unlock(); > if (!vfdev) > continue; > pci_read_config_word(vfdev, PCI_STATUS, &status_reg); [Severity: High] Is it safe to dereference vfdev outside the RCU read-side critical section? By dropping the RCU read lock before calling pci_read_config_word(), the grace period can expire and a concurrent execution of ixgbe_disable_sriov() could free the pci_dev object, resulting in a use-after-free. [ ... ] > @@ -10071,20 +10114,26 @@ static int handle_redirect_action(struct ixgbe_adapter *adapter, int ifindex, > { > struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ]; > unsigned int num_vfs = adapter->num_vfs, vf; > + struct vf_data_storage *vfinfo; > struct netdev_nested_priv priv; > struct upper_walk_data data; > struct net_device *upper; > > /* redirect to a SRIOV VF */ > - for (vf = 0; vf < num_vfs; ++vf) { > - upper = pci_get_drvdata(adapter->vfinfo[vf].vfdev); > - if (upper->ifindex == ifindex) { > - *queue = vf * __ALIGN_MASK(1, ~vmdq->mask); > - *action = vf + 1; > - *action <<= ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF; > - return 0; > + rcu_read_lock(); > + vfinfo = rcu_dereference(adapter->vfinfo); > + if (vfinfo) > + for (vf = 0; vf < num_vfs; ++vf) { [Severity: High] Does reading adapter->num_vfs outside the RCU lock create a time-of-check to time-of-use (TOCTOU) race? If SR-IOV is disabled and rapidly re-enabled, a reader can cache an older, larger num_vfs value before acquiring the RCU lock, but then fetch the newer, smaller vfinfo array. Iterating up to the old num_vfs would then cause an out-of-bounds memory access. > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c > index 431d77da15a5f..80f22a8e7af43 100644 > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c [ ... ] > @@ -123,6 +125,7 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter, > static void ixgbe_get_vfs(struct ixgbe_adapter *adapter) > { > struct pci_dev *pdev = adapter->pdev; > + struct vf_data_storage *vfinfo; > u16 vendor = pdev->vendor; > struct pci_dev *vfdev; > int vf = 0; > @@ -134,18 +137,23 @@ static void ixgbe_get_vfs(struct ixgbe_adapter *adapter) > return; > pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, &vf_id); > > - vfdev = pci_get_device(vendor, vf_id, NULL); > - for (; vfdev; vfdev = pci_get_device(vendor, vf_id, vfdev)) { > - if (!vfdev->is_virtfn) > - continue; > - if (vfdev->physfn != pdev) > - continue; > - if (vf >= adapter->num_vfs) > - continue; > - pci_dev_get(vfdev); > - adapter->vfinfo[vf].vfdev = vfdev; > - ++vf; > + rcu_read_lock(); > + vfinfo = rcu_dereference(adapter->vfinfo); > + if (vfinfo) { > + vfdev = pci_get_device(vendor, vf_id, NULL); > + for (; vfdev; vfdev = pci_get_device(vendor, vf_id, vfdev)) { [Severity: High] Does this code call a sleeping function inside an RCU read-side critical section? pci_get_device() internally acquires the pci_bus_sem read-write semaphore, which can sleep. Sleeping inside an RCU read section will trigger a 'scheduling while atomic' BUG. [ ... ] > @@ -206,31 +214,28 @@ int ixgbe_disable_sriov(struct ixgbe_adapter *adapter) > { > unsigned int num_vfs = adapter->num_vfs, vf; > struct ixgbe_hw *hw = &adapter->hw; > - unsigned long flags; > + struct vf_data_storage *vfinfo; > + struct vf_macvlans *mv_list; > int rss; > > - spin_lock_irqsave(&adapter->vfs_lock, flags); > - /* set num VFs to 0 to prevent access to vfinfo */ > + /* set num VFs to 0 so readers bail out early */ > adapter->num_vfs = 0; > - spin_unlock_irqrestore(&adapter->vfs_lock, flags); > + > + vfinfo = rcu_replace_pointer(adapter->vfinfo, NULL, 1); > + mv_list = rcu_replace_pointer(adapter->mv_list, NULL, 1); > > /* put the reference to all of the vf devices */ > for (vf = 0; vf < num_vfs; ++vf) { > - struct pci_dev *vfdev = adapter->vfinfo[vf].vfdev; > + struct pci_dev *vfdev = vfinfo[vf].vfdev; > > if (!vfdev) > continue; > - adapter->vfinfo[vf].vfdev = NULL; > + vfinfo[vf].vfdev = NULL; > pci_dev_put(vfdev); > } [Severity: High] Can RCU readers observe these synchronous mutations before the grace period ends? The vfinfo array is modified in-place and pci_dev_put() is called on the devices synchronously. If the final reference is dropped, the pci_dev could be freed while an RCU reader (like handle_redirect_action()) is concurrently accessing it, leading to a use-after-free. [ ... ] > @@ -643,10 +668,16 @@ static void ixgbe_clear_vf_vlans(struct ixgbe_adapter *adapter, u32 vf) > static int ixgbe_set_vf_macvlan(struct ixgbe_adapter *adapter, > int vf, int index, unsigned char *mac_addr) > { > - struct vf_macvlans *entry; > + struct vf_macvlans *mv_list, *entry; > bool found = false; > int retval = 0; > > + lockdep_assert_in_rcu_read_lock(); > + /* vf_mvs entries point into the mv_list array */ > + mv_list = rcu_dereference(adapter->mv_list); > + if (!mv_list) > + return 0; > + > if (index <= 1) { > list_for_each_entry(entry, &adapter->vf_mvs.l, l) { [Severity: High] Are standard list operations safe here under RCU? The adapter->vf_mvs.l list is traversed using list_for_each_entry() under rcu_read_lock(). Concurrently, ixgbe_alloc_vf_macvlans() can re-initialize the list head and add entries using list_add(). Without the appropriate RCU list primitives, concurrent list modifications can cause data races and corrupted pointers. [ ... ] > @@ -952,14 +1013,20 @@ static int ixgbe_set_vf_mac_addr(struct ixgbe_adapter *adapter, > u32 *msgbuf, u32 vf) > { > u8 *new_mac = ((u8 *)(&msgbuf[1])); > + struct vf_data_storage *vfinfo; > > if (!is_valid_ether_addr(new_mac)) { > e_warn(drv, "VF %d attempted to set invalid mac\n", vf); > return -1; > } > > - if (adapter->vfinfo[vf].pf_set_mac && !adapter->vfinfo[vf].trusted && > - !ether_addr_equal(adapter->vfinfo[vf].vf_mac_addresses, new_mac)) { > + lockdep_assert_in_rcu_read_lock(); > + vfinfo = rcu_dereference(adapter->vfinfo); > + if (!vfinfo) > + return 0; [Severity: Medium] Will this send false ACKs to the VF when vfinfo is NULL? The caller, ixgbe_rcv_msg_from_vf(), interprets a return value of 0 as success and sends an ACK (IXGBE_VT_MSGTYPE_ACK) to the VF. Returning 0 here means the operation is silently ignored but acknowledged as successful. Could this return an error like -EINVAL instead? ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2026-05-23 0:16 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-20 18:34 [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe) Tony Nguyen 2026-05-20 18:34 ` [PATCH net 1/8] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race Tony Nguyen 2026-05-21 15:37 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 2/8] ice: fix stats array overflow when VF requests more queues Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 3/8] iavf: return EBUSY if reset in progress or not ready during MAC change Tony Nguyen 2026-05-20 18:34 ` [PATCH net 4/8] i40e: skip unnecessary VF reset when setting trust Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 5/8] iavf: send MAC change request synchronously Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 6/8] ice: skip unnecessary VF reset when setting trust Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 7/8] i40e: set supported_extts_flags for rising edge Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-20 18:34 ` [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock Tony Nguyen 2026-05-23 0:16 ` Jakub Kicinski 2026-05-23 0:16 ` Jakub Kicinski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox