From: Corinna Vinschen <vinschen@redhat.com>
To: Jakub Kicinski <kuba@kernel.org>
Cc: anthony.l.nguyen@intel.com, davem@davemloft.net,
pabeni@redhat.com, edumazet@google.com, andrew+netdev@lunn.ch,
netdev@vger.kernel.org, alexander.nowlin@intel.com,
Corinna Vinschen <vinschen@redhat.com>
Subject: Re: [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock
Date: Wed, 1 Jul 2026 12:15:04 +0200 [thread overview]
Message-ID: <akToqN09okrqTA73@calimero.vinschen.de> (raw)
In-Reply-To: <20260523001605.1757032-1-kuba@kernel.org>
Hi Jakub, hi others,
sorry for the late reply, I was AFK for two months.
I scanned the review points and most of them raise important questions.
This mail is only to let you know that I started looking into them
and I will follow up with a new patch in the next few weeks.
Thanks,
Corinna
On May 22 17:16, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> ixgbe: only access vfinfo and mv_list under RCU lock
>
> This change drops the vfs_lock spinlock in favour of RCU protection
> for adapter->vfinfo and adapter->mv_list, marking them __rcu and
> converting readers to rcu_dereference()/rcu_read_lock() and writers
> to rcu_replace_pointer()/kfree_rcu(). The goal is to address a
> crash observed when ixgbe_disable_sriov() races with
> ixgbe_service_task() / ixgbe_msg_task() readers.
>
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> > index 9b8217523fd2..8849b9f42bf6 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> > @@ -808,10 +810,10 @@ struct ixgbe_adapter {
> > /* SR-IOV */
> > DECLARE_BITMAP(active_vfs, IXGBE_MAX_VF_FUNCTIONS);
> > unsigned int num_vfs;
> > - struct vf_data_storage *vfinfo;
> > + struct vf_data_storage __rcu *vfinfo;
> > int vf_rate_link_speed;
> > struct vf_macvlans vf_mvs;
> > - struct vf_macvlans *mv_list;
> > + struct vf_macvlans __rcu *mv_list;
>
> [Medium]
> After this conversion, is ixgbe_handle_mdd_event() in
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c still consistent
> with the new annotations? It contains:
>
> if (adapter->vfinfo && ixgbe_check_mdd_event(adapter)) {
>
> That bare load of an __rcu pointer is not wrapped in
> rcu_read_lock() and does not use rcu_access_pointer() or
> rcu_dereference(), so sparse will flag it and the access
> bypasses the RCU discipline used elsewhere in the file. The
> commit message says RCU is used "throughout the driver" — was
> this site simply missed during the conversion?
>
> [ ... ]
>
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > index 2646ee6f295f..d82c7dfc6580 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>
> [ ... ]
>
> > @@ -8267,22 +8293,27 @@ static void ixgbe_watchdog_flush_tx(struct ixgbe_adapter *adapter)
> > @@ -8309,9 +8340,15 @@ static void ixgbe_check_for_bad_vf(struct ixgbe_adapter *adapter)
> >
> > /* check status reg for all VFs owned by this PF */
> > for (vf = 0; vf < adapter->num_vfs; ++vf) {
> > - struct pci_dev *vfdev = adapter->vfinfo[vf].vfdev;
> > + struct vf_data_storage *vfinfo;
> > + struct pci_dev *vfdev = NULL;
> > u16 status_reg;
> >
> > + rcu_read_lock();
> > + vfinfo = rcu_dereference(adapter->vfinfo);
> > + if (vfinfo)
> > + vfdev = vfinfo[vf].vfdev;
> > + rcu_read_unlock();
> > if (!vfdev)
> > continue;
> > pci_read_config_word(vfdev, PCI_STATUS, &status_reg);
>
> [High]
> Can vfdev still be used safely after rcu_read_unlock() here?
>
> The RCU section only protects the lifetime of the vfinfo array,
> not the lifetime of the pci_dev whose pointer was stored in it.
> ixgbe_disable_sriov() drops the reference taken by
> ixgbe_get_vfs() with pci_dev_put(vfdev) and then proceeds to
> pci_disable_sriov(), either of which can free the pci_dev
> concurrently with the subsequent pci_read_config_word() and the
> pcie_flr(vfdev) reached through ixgbe_bad_vf_abort().
>
> Should this path take pci_dev_get(vfdev) before dropping the
> rcu_read_unlock(), and pci_dev_put() after the config-space
> access and FLR are complete?
>
> This appears to predate this patch (the prior code dereferenced
> vfdev outside any lifetime-extending section as well), but the
> RCU conversion seems like a natural opportunity to address it.
>
> [ ... ]
>
> > @@ -10071,20 +10114,26 @@ static int handle_redirect_action(struct ixgbe_adapter *adapter, int ifindex,
> > {
> > struct ixgbe_ring_feature *vmdq = &adapter->ring_feature[RING_F_VMDQ];
> > unsigned int num_vfs = adapter->num_vfs, vf;
> > + struct vf_data_storage *vfinfo;
> > struct netdev_nested_priv priv;
> > struct upper_walk_data data;
> > struct net_device *upper;
> >
> > /* redirect to a SRIOV VF */
> > - for (vf = 0; vf < num_vfs; ++vf) {
> > - upper = pci_get_drvdata(adapter->vfinfo[vf].vfdev);
> > - if (upper->ifindex == ifindex) {
> > - *queue = vf * __ALIGN_MASK(1, ~vmdq->mask);
> > - *action = vf + 1;
> > - *action <<= ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF;
> > - return 0;
> > + rcu_read_lock();
> > + vfinfo = rcu_dereference(adapter->vfinfo);
> > + if (vfinfo)
> > + for (vf = 0; vf < num_vfs; ++vf) {
> > + upper = pci_get_drvdata(vfinfo[vf].vfdev);
>
> [High]
> Can the captured num_vfs and the freshly-dereferenced vfinfo
> disagree in size here?
>
> num_vfs is sampled from adapter->num_vfs before rcu_read_lock(),
> while vfinfo is observed inside the RCU section. ixgbe_disable_sriov()
> runs under PCI device_lock (not rtnl), sets adapter->num_vfs = 0,
> and rcu_replace_pointer()s vfinfo. A subsequent
> __ixgbe_enable_sriov(N2) can publish a smaller vfinfo array of
> size N2 < num_vfs before the loop runs. The access vfinfo[vf]
> for vf >= N2 is then a heap out-of-bounds read.
>
> Should num_vfs be sampled inside the rcu_read_lock() and bounded
> by the actual size of the published vfinfo, or should
> adapter->num_vfs be updated as part of the same rcu_assign_pointer
> publication so readers can rely on a consistent pair?
>
> [ ... ]
>
> > @@ -9744,15 +9781,21 @@ static int ixgbe_ndo_get_vf_stats(struct net_device *netdev, int vf,
> > struct ifla_vf_stats *vf_stats)
> > {
> > struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev);
> > + struct vf_data_storage *vfinfo;
> >
> > if (vf < 0 || vf >= adapter->num_vfs)
> > return -EINVAL;
> >
> > - vf_stats->rx_packets = adapter->vfinfo[vf].vfstats.gprc;
> > - vf_stats->rx_bytes = adapter->vfinfo[vf].vfstats.gorc;
> > - vf_stats->tx_packets = adapter->vfinfo[vf].vfstats.gptc;
> > - vf_stats->tx_bytes = adapter->vfinfo[vf].vfstats.gotc;
> > - vf_stats->multicast = adapter->vfinfo[vf].vfstats.mprc;
> > + rcu_read_lock();
> > + vfinfo = rcu_dereference(adapter->vfinfo);
> > + if (vfinfo) {
> > + vf_stats->rx_packets = vfinfo[vf].vfstats.gprc;
> > + vf_stats->rx_bytes = vfinfo[vf].vfstats.gorc;
> > + vf_stats->tx_packets = vfinfo[vf].vfstats.gptc;
> > + vf_stats->tx_bytes = vfinfo[vf].vfstats.gotc;
> > + vf_stats->multicast = vfinfo[vf].vfstats.mprc;
> > + }
> > + rcu_read_unlock();
>
> [High]
> This pattern recurs in many of the rtnl ndo handlers (set_vf_mac,
> set_vf_spoofchk, set_vf_rss_query_en, set_vf_trust, set_vf_vlan,
> set_vf_bw, get_vf_config, get_vf_stats): the bound check
>
> if (vf >= adapter->num_vfs)
> return -EINVAL;
>
> is performed before rcu_read_lock(). Between that check and the
> later rcu_dereference, ixgbe_disable_sriov() (under PCI device_lock,
> not rtnl) can set num_vfs = 0 and __ixgbe_enable_sriov() can
> republish vfinfo at a smaller size N2.
>
> For ixgbe_ndo_get_vf_config() the resulting OOB read is then
> memcpy'd into struct ifla_vf_info and returned to userspace via
> netlink, which would expose adjacent kernel heap. For the setter
> ndos and the mailbox-driven setters, this is an out-of-bounds
> write.
>
> Should the vf range check be repeated against the size of the
> freshly observed vfinfo, ideally with num_vfs and vfinfo
> published together so the pair is self-consistent?
>
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> > index 431d77da15a5..80f22a8e7af4 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
>
> [ ... ]
>
> > @@ -206,31 +214,28 @@ int ixgbe_disable_sriov(struct ixgbe_adapter *adapter)
> > {
> > unsigned int num_vfs = adapter->num_vfs, vf;
> > struct ixgbe_hw *hw = &adapter->hw;
> > - unsigned long flags;
> > + struct vf_data_storage *vfinfo;
> > + struct vf_macvlans *mv_list;
> > int rss;
> >
> > - spin_lock_irqsave(&adapter->vfs_lock, flags);
> > - /* set num VFs to 0 to prevent access to vfinfo */
> > + /* set num VFs to 0 so readers bail out early */
> > adapter->num_vfs = 0;
> > - spin_unlock_irqrestore(&adapter->vfs_lock, flags);
> > +
> > + vfinfo = rcu_replace_pointer(adapter->vfinfo, NULL, 1);
> > + mv_list = rcu_replace_pointer(adapter->mv_list, NULL, 1);
>
> [Low]
> The lockdep predicate passed to rcu_replace_pointer() is the
> literal 1, which makes the writer-side assertion always succeed
> and provides no verification. The patch simultaneously removes
> vfs_lock and shifts to an implicit (undocumented) reliance on
> some caller serialization for the writer.
>
> Could lockdep_is_held() of the actual writer-side lock (for
> example the PCI device_lock or rtnl_lock, whichever is intended)
> be used here so a future regression that calls
> ixgbe_disable_sriov() outside that context is caught by lockdep?
>
> [ ... ]
>
> > @@ -643,10 +668,16 @@ static void ixgbe_clear_vf_vlans(struct ixgbe_adapter *adapter, u32 vf)
> > static int ixgbe_set_vf_macvlan(struct ixgbe_adapter *adapter,
> > int vf, int index, unsigned char *mac_addr)
> > {
> > - struct vf_macvlans *entry;
> > + struct vf_macvlans *mv_list, *entry;
> > bool found = false;
> > int retval = 0;
> >
> > + lockdep_assert_in_rcu_read_lock();
> > + /* vf_mvs entries point into the mv_list array */
> > + mv_list = rcu_dereference(adapter->mv_list);
> > + if (!mv_list)
> > + return 0;
> > +
> > if (index <= 1) {
> > list_for_each_entry(entry, &adapter->vf_mvs.l, l) {
>
> [High]
> Is non-RCU list iteration safe here while only rcu_read_lock()
> is held?
>
> The list head &adapter->vf_mvs.l is mutated by the writer side
> without RCU-aware primitives. ixgbe_alloc_vf_macvlans() does:
>
> INIT_LIST_HEAD(&adapter->vf_mvs.l);
> ...
> list_add(&mv_list[i].l, &adapter->vf_mvs.l);
>
> and ixgbe_disable_sriov() kfree_rcu()s mv_list but never
> reinitialises &adapter->vf_mvs.l, so on a subsequent
> re-enable the writer concurrently zeros and re-links the head
> while a reader walks it.
>
> Should list_for_each_entry() be list_for_each_entry_rcu(), and
> the writer use INIT_LIST_HEAD_RCU() / list_add_rcu()? Otherwise
> the reader can observe torn next/prev loads or follow into
> freed/re-used entries.
>
> [ ... ]
>
> > @@ -780,16 +816,25 @@ static void ixgbe_vf_clear_mbx(struct ixgbe_adapter *adapter, u32 vf)
> > static int ixgbe_set_vf_mac(struct ixgbe_adapter *adapter,
> > int vf, unsigned char *mac_addr)
> > {
> > + struct vf_data_storage *vfinfo;
> > int retval;
> >
> > - ixgbe_del_mac_filter(adapter, adapter->vfinfo[vf].vf_mac_addresses, vf);
> > + rcu_read_lock();
> > + vfinfo = rcu_dereference(adapter->vfinfo);
> > + if (!vfinfo) {
> > + rcu_read_unlock();
> > + return -EINVAL;
> > + }
> > +
> > + ixgbe_del_mac_filter(adapter, vfinfo[vf].vf_mac_addresses, vf);
> > retval = ixgbe_add_mac_filter(adapter, mac_addr, vf);
> > if (retval >= 0)
> > - memcpy(adapter->vfinfo[vf].vf_mac_addresses, mac_addr,
> > + memcpy(vfinfo[vf].vf_mac_addresses, mac_addr,
> > ETH_ALEN);
>
> [ ... ]
>
> > @@ -1537,21 +1677,34 @@ void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter)
> > int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac)
> > {
> > struct ixgbe_adapter *adapter = ixgbe_from_netdev(netdev);
> > + struct vf_data_storage *vfinfo;
> > int retval;
> >
> > if (vf >= adapter->num_vfs)
> > return -EINVAL;
> >
> > + rcu_read_lock();
> > + vfinfo = rcu_dereference(adapter->vfinfo);
> > + if (!vfinfo) {
> > + rcu_read_unlock();
> > + return 0;
> > + }
> > +
> > if (is_valid_ether_addr(mac)) {
> > dev_info(&adapter->pdev->dev, "setting MAC %pM on VF %d\n",
> > mac, vf);
> > dev_info(&adapter->pdev->dev, "Reload the VF driver to make this change effective.");
> >
> > retval = ixgbe_set_vf_mac(adapter, vf, mac);
> > if (retval >= 0) {
> > - adapter->vfinfo[vf].pf_set_mac = true;
> > + vfinfo[vf].pf_set_mac = true;
>
> [High]
> Can vfinfo here and the vfinfo observed inside ixgbe_set_vf_mac()
> end up pointing at different objects?
>
> ixgbe_ndo_set_vf_mac() captures vfinfo via its own rcu_dereference,
> then calls ixgbe_set_vf_mac() which independently does:
>
> rcu_read_lock();
> vfinfo = rcu_dereference(adapter->vfinfo);
> ...
> memcpy(vfinfo[vf].vf_mac_addresses, mac_addr, ETH_ALEN);
>
> Nesting rcu_read_lock() is fine, but each rcu_dereference()
> observes whatever pointer is currently published. If
> ixgbe_disable_sriov() followed by __ixgbe_enable_sriov() races
> between the two dereferences, the inner copy lands in the new
> array while the outer pf_set_mac = true write lands in the old
> array. The two pieces of state then disagree, which seems to
> break the "attempted to override administratively set MAC"
> enforcement on subsequent VF mailbox messages. If the new
> vfinfo is shorter than vf, the inner memcpy is also a heap
> out-of-bounds write.
>
> Should ixgbe_ndo_set_vf_mac() pass its already-captured vfinfo
> into ixgbe_set_vf_mac() so both writes go to the same object,
> or otherwise structure the call so a single rcu_dereference
> covers all writes for one operation?
next prev parent reply other threads:[~2026-07-01 10:15 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 18:34 [PATCH net 0/8][pull request] Intel Wired LAN Driver Updates 2026-05-20 (ice, iavf, i40e, ixgbe) Tony Nguyen
2026-05-20 18:34 ` [PATCH net 1/8] ice: fix UAF/NULL deref when VSI rebuild and XDP attach race Tony Nguyen
2026-05-21 15:37 ` Jakub Kicinski
2026-05-23 0:16 ` Jakub Kicinski
2026-05-28 9:14 ` Maciej Fijalkowski
2026-05-29 8:05 ` Larysa Zaremba
2026-06-01 12:21 ` Maciej Fijalkowski
2026-06-01 13:52 ` Larysa Zaremba
2026-05-20 18:34 ` [PATCH net 2/8] ice: fix stats array overflow when VF requests more queues Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski
2026-05-20 18:34 ` [PATCH net 3/8] iavf: return EBUSY if reset in progress or not ready during MAC change Tony Nguyen
2026-05-20 18:34 ` [PATCH net 4/8] i40e: skip unnecessary VF reset when setting trust Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski
2026-05-23 0:16 ` Jakub Kicinski
2026-05-20 18:34 ` [PATCH net 5/8] iavf: send MAC change request synchronously Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski
2026-05-23 0:16 ` Jakub Kicinski
2026-05-20 18:34 ` [PATCH net 6/8] ice: skip unnecessary VF reset when setting trust Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski
2026-05-23 0:16 ` Jakub Kicinski
2026-05-20 18:34 ` [PATCH net 7/8] i40e: set supported_extts_flags for rising edge Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski
2026-05-20 18:34 ` [PATCH net 8/8] ixgbe: only access vfinfo and mv_list under RCU lock Tony Nguyen
2026-05-23 0:16 ` Jakub Kicinski
2026-07-01 10:15 ` Corinna Vinschen [this message]
2026-05-23 0:16 ` Jakub Kicinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=akToqN09okrqTA73@calimero.vinschen.de \
--to=vinschen@redhat.com \
--cc=alexander.nowlin@intel.com \
--cc=andrew+netdev@lunn.ch \
--cc=anthony.l.nguyen@intel.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox