From: Jakub Kicinski <kuba@kernel.org>
To: anthony.l.nguyen@intel.com
Cc: Jakub Kicinski <kuba@kernel.org>,
davem@davemloft.net, pabeni@redhat.com, edumazet@google.com,
andrew+netdev@lunn.ch, netdev@vger.kernel.org,
vinschen@redhat.com, alexander.nowlin@intel.com
Subject: Re: [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock
Date: Fri, 22 May 2026 17:16:05 -0700 [thread overview]
Message-ID: <20260523001605.1757032-1-kuba@kernel.org> (raw)
In-Reply-To: <20260520183501.3360810-9-anthony.l.nguyen@intel.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
ixgbe: only access vfinfo and mv_list under RCU lock
This change drops the vfs_lock spinlock in favour of RCU protection
for adapter->vfinfo and adapter->mv_list, marking them __rcu and
converting readers to rcu_dereference()/rcu_read_lock() and writers
to rcu_replace_pointer()/kfree_rcu(). The goal is to address a
crash observed when ixgbe_disable_sriov() races with
ixgbe_service_task() / ixgbe_msg_task() readers.
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index 9b8217523fd2..8849b9f42bf6 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -808,10 +810,10 @@ struct ixgbe_adapter {
> /* SR-IOV */
> DECLARE_BITMAP(active_vfs, IXGBE_MAX_VF_FUNCTIONS);
> unsigned int num_vfs;
> - struct vf_data_storage *vfinfo;
> + struct vf_data_storage __rcu *vfinfo;
> int vf_rate_link_speed;
> struct vf_macvlans vf_mvs;
> - struct vf_macvlans *mv_list;
> + struct vf_macvlans __rcu *mv_list;
[Medium]
After this conversion, is ixgbe_handle_mdd_event() in
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c still consistent
with the new annotations? It contains:
if (adapter->vfinfo && ixgbe_check_mdd_event(adapter)) {
That bare load of an __rcu pointer is not wrapped in
rcu_read_lock() and does not use rcu_access_pointer() or
rcu_dereference(), so sparse will flag it and the access
bypasses the RCU discipline used elsewhere in the file. The
commit message says RCU is used "throughout the driver" — was
this site simply missed during the conversion?
[ ... ]
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 2646ee6f295f..d82c7dfc6580 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
[ ... ]
> @@ -8267,22 +8293,27 @@ static void ixgbe_watchdog_flush_tx(struct ixgbe_adapter *adapter)
> @@ -8309,9 +8340,15 @@ static void ixgbe_check_for_bad_vf(struct ixgbe_adapter *adapter)
>
> /* check status reg for all VFs owned by this PF */
> for (vf = 0; vf < adapter->num_vfs; ++vf) {
> - struct pci_dev *vfdev = adapter->vfinfo[vf].vfdev;
> + struct vf_data_storage *vfinfo;
> + struct pci_dev *vfdev = NULL;
> u16 status_reg;
>
> + rcu_read_lock();
> + vfinfo = rcu_dereference(adapter->vfinfo);
> + if (vfinfo)
> + vfdev = vfinfo[vf].vfdev;
> + rcu_read_unlock();
> if (!vfdev)
> continue;
> pci_read_config_word(vfdev, PCI_STATUS, &status_reg);
[High]
Can vfdev still be used safely after rcu_read_unlock() here?
The RCU section only protects the lifetime of the vfinfo array,
not the lifetime of the pci_dev whose pointer was stored in it.
ixgbe_disable_sriov() drops the reference taken by
ixgbe_get_vfs() with pci_dev_put(vfdev) and then proceeds to
pci_disable_sriov(), either of which can free the pci_dev
concurrently with the subsequent pci_read_config_word() and the
pcie_flr(vfdev) reached through ixgbe_bad_vf_abort().
Should this path take pci_dev_get(vfdev) before dropping the
rcu_read_unlock(), and pci_dev_put() after the config-space
access and FLR are complete?
This appears to predate this patch (the prior code dereferenced
vfdev outside any lifetime-extending section as well), but the
RCU conversion seems like a natural opportunity to address it.
[ ... ]
> @@ -10071,20 +10114,26 @@ static int handle_redirect_action(struct ixgbe_adapter *adapter, int ifindex,
> {
> struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ];
> unsigned int num_vfs = adapter->num_vfs, vf;
> + struct vf_data_storage *vfinfo;
> struct netdev_nested_priv priv;
> struct upper_walk_data data;
> struct net_device *upper;
>
> /* redirect to a SRIOV VF */
> - for (vf = 0; vf < num_vfs; ++vf) {
> - upper = pci_get_drvdata(adapter->vfinfo[vf].vfdev);
> - if (upper->ifindex == ifindex) {
> - *queue = vf * __ALIGN_MASK(1, ~vmdq->mask);
> - *action = vf + 1;
> - *action <<= ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF;
> - return 0;
> + rcu_read_lock();
> + vfinfo = rcu_dereference(adapter->vfinfo);
> + if (vfinfo)
> + for (vf = 0; vf < num_vfs; ++vf) {
> + upper = pci_get_drvdata(vfinfo[vf].vfdev);
[High]
Can the captured num_vfs and the freshly-dereferenced vfinfo
disagree in size here?
num_vfs is sampled from adapter->num_vfs before rcu_read_lock(),
while vfinfo is observed inside the RCU section. ixgbe_disable_sriov()
runs under PCI device_lock (not rtnl), sets adapter->num_vfs = 0,
and rcu_replace_pointer()s vfinfo. A subsequent
__ixgbe_enable_sriov(N2) can publish a smaller vfinfo array of
size N2 < num_vfs before the loop runs. The access vfinfo[vf]
for vf >= N2 is then a heap out-of-bounds read.
Should num_vfs be sampled inside the rcu_read_lock() and bounded
by the actual size of the published vfinfo, or should
adapter->num_vfs be updated as part of the same rcu_assign_pointer
publication so readers can rely on a consistent pair?
[ ... ]
> @@ -9744,15 +9781,21 @@ static int ixgbe_ndo_get_vf_stats(struct net_device *netdev, int vf,
> struct ifla_vf_stats *vf_stats)
> {
> struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev);
> + struct vf_data_storage *vfinfo;
>
> if (vf < 0 || vf >= adapter->num_vfs)
> return -EINVAL;
>
> - vf_stats->rx_packets = adapter->vfinfo[vf].vfstats.gprc;
> - vf_stats->rx_bytes = adapter->vfinfo[vf].vfstats.gorc;
> - vf_stats->tx_packets = adapter->vfinfo[vf].vfstats.gptc;
> - vf_stats->tx_bytes = adapter->vfinfo[vf].vfstats.gotc;
> - vf_stats->multicast = adapter->vfinfo[vf].vfstats.mprc;
> + rcu_read_lock();
> + vfinfo = rcu_dereference(adapter->vfinfo);
> + if (vfinfo) {
> + vf_stats->rx_packets = vfinfo[vf].vfstats.gprc;
> + vf_stats->rx_bytes = vfinfo[vf].vfstats.gorc;
> + vf_stats->tx_packets = vfinfo[vf].vfstats.gptc;
> + vf_stats->tx_bytes = vfinfo[vf].vfstats.gotc;
> + vf_stats->multicast = vfinfo[vf].vfstats.mprc;
> + }
> + rcu_read_unlock();
[High]
This pattern recurs in many of the rtnl ndo handlers (set_vf_mac,
set_vf_spoofchk, set_vf_rss_query_en, set_vf_trust, set_vf_vlan,
set_vf_bw, get_vf_config, get_vf_stats): the bound check
if (vf >= adapter->num_vfs)
return -EINVAL;
is performed before rcu_read_lock(). Between that check and the
later rcu_dereference, ixgbe_disable_sriov() (under PCI device_lock,
not rtnl) can set num_vfs = 0 and __ixgbe_enable_sriov() can
republish vfinfo at a smaller size N2.
For ixgbe_ndo_get_vf_config() the resulting OOB read is then
memcpy'd into struct ifla_vf_info and returned to userspace via
netlink, which would expose adjacent kernel heap. For the setter
ndos and the mailbox-driven setters, this is an out-of-bounds
write.
Should the vf range check be repeated against the size of the
freshly observed vfinfo, ideally with num_vfs and vfinfo
published together so the pair is self-consistent?
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> index 431d77da15a5..80f22a8e7af4 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
[ ... ]
> @@ -206,31 +214,28 @@ int ixgbe_disable_sriov(struct ixgbe_adapter *adapter)
> {
> unsigned int num_vfs = adapter->num_vfs, vf;
> struct ixgbe_hw *hw = &adapter->hw;
> - unsigned long flags;
> + struct vf_data_storage *vfinfo;
> + struct vf_macvlans *mv_list;
> int rss;
>
> - spin_lock_irqsave(&adapter->vfs_lock, flags);
> - /* set num VFs to 0 to prevent access to vfinfo */
> + /* set num VFs to 0 so readers bail out early */
> adapter->num_vfs = 0;
> - spin_unlock_irqrestore(&adapter->vfs_lock, flags);
> +
> + vfinfo = rcu_replace_pointer(adapter->vfinfo, NULL, 1);
> + mv_list = rcu_replace_pointer(adapter->mv_list, NULL, 1);
[Low]
The lockdep predicate passed to rcu_replace_pointer() is the
literal 1, which makes the writer-side assertion always succeed
and provides no verification. The patch simultaneously removes
vfs_lock and shifts to an implicit (undocumented) reliance on
some caller serialization for the writer.
Could lockdep_is_held() of the actual writer-side lock (for
example the PCI device_lock or rtnl_lock, whichever is intended)
be used here so a future regression that calls
ixgbe_disable_sriov() outside that context is caught by lockdep?
[ ... ]
> @@ -643,10 +668,16 @@ static void ixgbe_clear_vf_vlans(struct ixgbe_adapter *adapter, u32 vf)
> static int ixgbe_set_vf_macvlan(struct ixgbe_adapter *adapter,
> int vf, int index, unsigned char *mac_addr)
> {
> - struct vf_macvlans *entry;
> + struct vf_macvlans *mv_list, *entry;
> bool found = false;
> int retval = 0;
>
> + lockdep_assert_in_rcu_read_lock();
> + /* vf_mvs entries point into the mv_list array */
> + mv_list = rcu_dereference(adapter->mv_list);
> + if (!mv_list)
> + return 0;
> +
> if (index <= 1) {
> list_for_each_entry(entry, &adapter->vf_mvs.l, l) {
[High]
Is non-RCU list iteration safe here while only rcu_read_lock()
is held?
The list head &adapter->vf_mvs.l is mutated by the writer side
without RCU-aware primitives. ixgbe_alloc_vf_macvlans() does:
INIT_LIST_HEAD(&adapter->vf_mvs.l);
...
list_add(&mv_list[i].l, &adapter->vf_mvs.l);
and ixgbe_disable_sriov() kfree_rcu()s mv_list but never
reinitialises &adapter->vf_mvs.l, so on a subsequent
re-enable the writer concurrently zeros and re-links the head
while a reader walks it.
Should list_for_each_entry() be list_for_each_entry_rcu(), and
the writer use INIT_LIST_HEAD_RCU() / list_add_rcu()? Otherwise
the reader can observe torn next/prev loads or follow into
freed/re-used entries.
[ ... ]
> @@ -780,16 +816,25 @@ static void ixgbe_vf_clear_mbx(struct ixgbe_adapter *adapter, u32 vf)
> static int ixgbe_set_vf_mac(struct ixgbe_adapter *adapter,
> int vf, unsigned char *mac_addr)
> {
> + struct vf_data_storage *vfinfo;
> int retval;
>
> - ixgbe_del_mac_filter(adapter, adapter->vfinfo[vf].vf_mac_addresses, vf);
> + rcu_read_lock();
> + vfinfo = rcu_dereference(adapter->vfinfo);
> + if (!vfinfo) {
> + rcu_read_unlock();
> + return -EINVAL;
> + }
> +
> + ixgbe_del_mac_filter(adapter, vfinfo[vf].vf_mac_addresses, vf);
> retval = ixgbe_add_mac_filter(adapter, mac_addr, vf);
> if (retval >= 0)
> - memcpy(adapter->vfinfo[vf].vf_mac_addresses, mac_addr,
> + memcpy(vfinfo[vf].vf_mac_addresses, mac_addr,
> ETH_ALEN);
[ ... ]
> @@ -1537,21 +1677,34 @@ void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter)
> int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac)
> {
> struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev);
> + struct vf_data_storage *vfinfo;
> int retval;
>
> if (vf >= adapter->num_vfs)
> return -EINVAL;
>
> + rcu_read_lock();
> + vfinfo = rcu_dereference(adapter->vfinfo);
> + if (!vfinfo) {
> + rcu_read_unlock();
> + return 0;
> + }
> +
> if (is_valid_ether_addr(mac)) {
> dev_info(&adapter->pdev->dev, "setting MAC %pM on VF %d\n",
> mac, vf);
> dev_info(&adapter->pdev->dev, "Reload the VF driver to make this change effective.");
>
> retval = ixgbe_set_vf_mac(adapter, vf, mac);
> if (retval >= 0) {
> - adapter->vfinfo[vf].pf_set_mac = true;
> + vfinfo[vf].pf_set_mac = true;
[High]
Can vfinfo here and the vfinfo observed inside ixgbe_set_vf_mac()
end up pointing at different objects?
ixgbe_ndo_set_vf_mac() captures vfinfo via its own rcu_dereference,
then calls ixgbe_set_vf_mac() which independently does:
rcu_read_lock();
vfinfo = rcu_dereference(adapter->vfinfo);
...
memcpy(vfinfo[vf].vf_mac_addresses, mac_addr, ETH_ALEN);
Nesting rcu_read_lock() is fine, but each rcu_dereference()
observes whatever pointer is currently published. If
ixgbe_disable_sriov() followed by __ixgbe_enable_sriov() races
between the two dereferences, the inner copy lands in the new
array while the outer pf_set_mac = true write lands in the old
array. The two pieces of state then disagree, which seems to
break the "attempted to override administratively set MAC"
enforcement on subsequent VF mailbox messages. If the new
vfinfo is shorter than vf, the inner memcpy is also a heap
out-of-bounds write.
Should ixgbe_ndo_set_vf_mac() pass its already-captured vfinfo
into ixgbe_set_vf_mac() so both writes go to the same object,
or otherwise structure the call so a single rcu_dereference
covers all writes for one operation?
next prev parent reply other threads:[~2026-05-23 0:16 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 18:34 [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe) Tony Nguyen
2026-05-20 18:34 ` [PATCH net 1/8] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race Tony Nguyen
2026-05-21 15:37 ` Jakub Kicinski
2026-05-23 0:16 ` Jakub Kicinski
2026-05-20 18:34 ` [PATCH net 2/8] ice: fix stats array overflow when VF requests more queues Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski
2026-05-20 18:34 ` [PATCH net 3/8] iavf: return EBUSY if reset in progress or not ready during MAC change Tony Nguyen
2026-05-20 18:34 ` [PATCH net 4/8] i40e: skip unnecessary VF reset when setting trust Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski
2026-05-23 0:16 ` Jakub Kicinski
2026-05-20 18:34 ` [PATCH net 5/8] iavf: send MAC change request synchronously Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski
2026-05-23 0:16 ` Jakub Kicinski
2026-05-20 18:34 ` [PATCH net 6/8] ice: skip unnecessary VF reset when setting trust Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski
2026-05-23 0:16 ` Jakub Kicinski
2026-05-20 18:34 ` [PATCH net 7/8] i40e: set supported_extts_flags for rising edge Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski
2026-05-20 18:34 ` [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski [this message]
2026-05-23 0:16 ` Jakub Kicinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260523001605.1757032-1-kuba@kernel.org \
--to=kuba@kernel.org \
--cc=alexander.nowlin@intel.com \
--cc=andrew+netdev@lunn.ch \
--cc=anthony.l.nguyen@intel.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=vinschen@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox