* [PATCH net 09/13] i40e: don't advertise IFF_SUPP_NOFCS
From: Jacob Keller @ 2026-04-15 5:48 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, Jacob Keller, Kohei Enju, Aleksandr Loktionov,
Sunitha Mekala
In-Reply-To: <20260414-iwl-net-submission-2026-04-14-v1-0-852f38e7da39@intel.com>
From: Kohei Enju <kohei@enjuk.jp>
i40e advertises IFF_SUPP_NOFCS, allowing users to use the SO_NOFCS
socket option. However, this option is silently ignored, as the driver
does not check skb->no_fcs, and always enables FCS insertion offload.
Fix this by removing the advertisement of IFF_SUPP_NOFCS.
This behavior can be reproduced with a simple AF_PACKET socket:
import socket
s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW)
s.setsockopt(socket.SOL_SOCKET, 43, 1) # SO_NOFCS
s.bind(("eth0", 0))
s.send(b'\xff' * 64)
Previously, send() succeeds but the driver ignores SO_NOFCS.
With this change, send() fails with -EPROTONOSUPPORT, as expected.
Fixes: 41c445ff0f48 ("i40e: main driver core")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 926d001b2150..028bd500603a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -13783,7 +13783,6 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
netdev->neigh_priv_len = sizeof(u32) * 4;
netdev->priv_flags |= IFF_UNICAST_FLT;
- netdev->priv_flags |= IFF_SUPP_NOFCS;
/* Setup netdev TC information */
i40e_vsi_config_netdev_tc(vsi, vsi->tc_config.enabled_tc);
--
2.53.0.1066.g1eceb487f285
^ permalink raw reply related
* [PATCH net 08/13] ice: fix potential NULL pointer deref in error path of ice_set_ringparam()
From: Jacob Keller @ 2026-04-15 5:48 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, Jacob Keller, Kohei Enju, Paul Greenwalt, Rinitha S
In-Reply-To: <20260414-iwl-net-submission-2026-04-14-v1-0-852f38e7da39@intel.com>
From: Kohei Enju <kohei@enjuk.jp>
ice_set_ringparam nullifies tstamp_ring of temporary tx_rings, without
clearing ICE_TX_RING_FLAGS_TXTIME bit.
When ICE_TX_RING_FLAGS_TXTIME is set and the subsequent
ice_setup_tx_ring() call fails, a NULL pointer dereference could happen
in the unwinding sequence:
ice_clean_tx_ring()
-> ice_is_txtime_cfg() == true (ICE_TX_RING_FLAGS_TXTIME is set)
-> ice_free_tx_tstamp_ring()
-> ice_free_tstamp_ring()
-> tstamp_ring->desc (NULL deref)
Clear ICE_TX_RING_FLAGS_TXTIME bit to avoid the potential issue.
Note that this potential issue is found by manual code review.
Compile test only since unfortunately I don't have E830 devices.
Fixes: ccde82e90946 ("ice: add E830 Earliest TxTime First Offload support")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
drivers/net/ethernet/intel/ice/ice_ethtool.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index e6a20af6f63d..f28416a707d7 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -3290,6 +3290,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
tx_rings[i].desc = NULL;
tx_rings[i].tx_buf = NULL;
tx_rings[i].tstamp_ring = NULL;
+ clear_bit(ICE_TX_RING_FLAGS_TXTIME, tx_rings[i].flags);
tx_rings[i].tx_tstamps = &pf->ptp.port.tx;
err = ice_setup_tx_ring(&tx_rings[i]);
if (err) {
--
2.53.0.1066.g1eceb487f285
^ permalink raw reply related
* [PATCH net 07/13] ice: fix race condition in TX timestamp ring cleanup
From: Jacob Keller @ 2026-04-15 5:48 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, Jacob Keller, Keita Morisaki, Aleksandr Loktionov,
Rinitha S
In-Reply-To: <20260414-iwl-net-submission-2026-04-14-v1-0-852f38e7da39@intel.com>
From: Keita Morisaki <kmta1236@gmail.com>
Fix a race condition between ice_free_tx_tstamp_ring() and ice_tx_map()
that can cause a NULL pointer dereference.
ice_free_tx_tstamp_ring currently clears the ICE_TX_FLAGS_TXTIME flag
after NULLing the tstamp_ring. This could allow a concurrent ice_tx_map
call on another CPU to dereference the tstamp_ring, which could lead to
a NULL pointer dereference.
CPU A:ice_free_tx_tstamp_ring() | CPU B:ice_tx_map()
--------------------------------|---------------------------------
tx_ring->tstamp_ring = NULL |
| ice_is_txtime_cfg() -> true
| tstamp_ring = tx_ring->tstamp_ring
| tstamp_ring->count // NULL deref!
flags &= ~ICE_TX_FLAGS_TXTIME |
Fix by:
1. Reordering ice_free_tx_tstamp_ring() to clear the flag before
NULLing the pointer, with smp_wmb() to ensure proper ordering.
2. Adding smp_rmb() in ice_tx_map() after the flag check to order the
flag read before the pointer read, using READ_ONCE() for the
pointer, and adding a NULL check as a safety net.
3. Converting tx_ring->flags from u8 to DECLARE_BITMAP() and using
atomic bitops (set_bit(), clear_bit(), test_bit()) for all flag
operations throughout the driver:
- ICE_TX_RING_FLAGS_XDP
- ICE_TX_RING_FLAGS_VLAN_L2TAG1
- ICE_TX_RING_FLAGS_VLAN_L2TAG2
- ICE_TX_RING_FLAGS_TXTIME
Fixes: ccde82e909467 ("ice: add E830 Earliest TxTime First Offload support")
Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
drivers/net/ethernet/intel/ice/ice.h | 4 ++--
drivers/net/ethernet/intel/ice/ice_txrx.h | 16 ++++++++++------
drivers/net/ethernet/intel/ice/ice_dcb_lib.c | 2 +-
drivers/net/ethernet/intel/ice/ice_lib.c | 4 ++--
drivers/net/ethernet/intel/ice/ice_txrx.c | 23 ++++++++++++++++-------
5 files changed, 31 insertions(+), 18 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index eb3a48330cc1..725b130dd3a2 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -753,7 +753,7 @@ static inline bool ice_is_xdp_ena_vsi(struct ice_vsi *vsi)
static inline void ice_set_ring_xdp(struct ice_tx_ring *ring)
{
- ring->flags |= ICE_TX_FLAGS_RING_XDP;
+ set_bit(ICE_TX_RING_FLAGS_XDP, ring->flags);
}
/**
@@ -778,7 +778,7 @@ static inline bool ice_is_txtime_ena(const struct ice_tx_ring *ring)
*/
static inline bool ice_is_txtime_cfg(const struct ice_tx_ring *ring)
{
- return !!(ring->flags & ICE_TX_FLAGS_TXTIME);
+ return test_bit(ICE_TX_RING_FLAGS_TXTIME, ring->flags);
}
/**
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index b6547e1b7c42..5e517f219379 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -212,6 +212,14 @@ enum ice_rx_dtype {
ICE_RX_DTYPE_SPLIT_ALWAYS = 2,
};
+enum ice_tx_ring_flags {
+ ICE_TX_RING_FLAGS_XDP,
+ ICE_TX_RING_FLAGS_VLAN_L2TAG1,
+ ICE_TX_RING_FLAGS_VLAN_L2TAG2,
+ ICE_TX_RING_FLAGS_TXTIME,
+ ICE_TX_RING_FLAGS_NBITS,
+};
+
struct ice_pkt_ctx {
u64 cached_phctime;
__be16 vlan_proto;
@@ -352,11 +360,7 @@ struct ice_tx_ring {
u16 count; /* Number of descriptors */
u16 q_index; /* Queue number of ring */
- u8 flags;
-#define ICE_TX_FLAGS_RING_XDP BIT(0)
-#define ICE_TX_FLAGS_RING_VLAN_L2TAG1 BIT(1)
-#define ICE_TX_FLAGS_RING_VLAN_L2TAG2 BIT(2)
-#define ICE_TX_FLAGS_TXTIME BIT(3)
+ DECLARE_BITMAP(flags, ICE_TX_RING_FLAGS_NBITS);
struct xsk_buff_pool *xsk_pool;
@@ -398,7 +402,7 @@ static inline bool ice_ring_ch_enabled(struct ice_tx_ring *ring)
static inline bool ice_ring_is_xdp(struct ice_tx_ring *ring)
{
- return !!(ring->flags & ICE_TX_FLAGS_RING_XDP);
+ return test_bit(ICE_TX_RING_FLAGS_XDP, ring->flags);
}
enum ice_container_type {
diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
index bd77f1c001ee..16aa25535152 100644
--- a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
@@ -943,7 +943,7 @@ ice_tx_prepare_vlan_flags_dcb(struct ice_tx_ring *tx_ring,
/* if this is not already set it means a VLAN 0 + priority needs
* to be offloaded
*/
- if (tx_ring->flags & ICE_TX_FLAGS_RING_VLAN_L2TAG2)
+ if (test_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG2, tx_ring->flags))
first->tx_flags |= ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN;
else
first->tx_flags |= ICE_TX_FLAGS_HW_VLAN;
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 689c6025ea82..837b71b7b2b7 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -1412,9 +1412,9 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
ring->count = vsi->num_tx_desc;
ring->txq_teid = ICE_INVAL_TEID;
if (dvm_ena)
- ring->flags |= ICE_TX_FLAGS_RING_VLAN_L2TAG2;
+ set_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG2, ring->flags);
else
- ring->flags |= ICE_TX_FLAGS_RING_VLAN_L2TAG1;
+ set_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG1, ring->flags);
WRITE_ONCE(vsi->tx_rings[i], ring);
}
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 7be9c062949b..4ca1a0602307 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -190,9 +190,10 @@ void ice_free_tstamp_ring(struct ice_tx_ring *tx_ring)
void ice_free_tx_tstamp_ring(struct ice_tx_ring *tx_ring)
{
ice_free_tstamp_ring(tx_ring);
+ clear_bit(ICE_TX_RING_FLAGS_TXTIME, tx_ring->flags);
+ smp_wmb(); /* order flag clear before pointer NULL */
kfree_rcu(tx_ring->tstamp_ring, rcu);
- tx_ring->tstamp_ring = NULL;
- tx_ring->flags &= ~ICE_TX_FLAGS_TXTIME;
+ WRITE_ONCE(tx_ring->tstamp_ring, NULL);
}
/**
@@ -405,7 +406,7 @@ static int ice_alloc_tstamp_ring(struct ice_tx_ring *tx_ring)
tx_ring->tstamp_ring = tstamp_ring;
tstamp_ring->desc = NULL;
tstamp_ring->count = ice_calc_ts_ring_count(tx_ring);
- tx_ring->flags |= ICE_TX_FLAGS_TXTIME;
+ set_bit(ICE_TX_RING_FLAGS_TXTIME, tx_ring->flags);
return 0;
}
@@ -1521,13 +1522,20 @@ ice_tx_map(struct ice_tx_ring *tx_ring, struct ice_tx_buf *first,
return;
if (ice_is_txtime_cfg(tx_ring)) {
- struct ice_tstamp_ring *tstamp_ring = tx_ring->tstamp_ring;
- u32 tstamp_count = tstamp_ring->count;
- u32 j = tstamp_ring->next_to_use;
+ struct ice_tstamp_ring *tstamp_ring;
+ u32 tstamp_count, j;
struct ice_ts_desc *ts_desc;
struct timespec64 ts;
u32 tstamp;
+ smp_rmb(); /* order flag read before pointer read */
+ tstamp_ring = READ_ONCE(tx_ring->tstamp_ring);
+ if (unlikely(!tstamp_ring))
+ goto ring_kick;
+
+ tstamp_count = tstamp_ring->count;
+ j = tstamp_ring->next_to_use;
+
ts = ktime_to_timespec64(first->skb->tstamp);
tstamp = ts.tv_nsec >> ICE_TXTIME_CTX_RESOLUTION_128NS;
@@ -1555,6 +1563,7 @@ ice_tx_map(struct ice_tx_ring *tx_ring, struct ice_tx_buf *first,
tstamp_ring->next_to_use = j;
writel_relaxed(j, tstamp_ring->tail);
} else {
+ring_kick:
writel_relaxed(i, tx_ring->tail);
}
return;
@@ -1814,7 +1823,7 @@ ice_tx_prepare_vlan_flags(struct ice_tx_ring *tx_ring, struct ice_tx_buf *first)
*/
if (skb_vlan_tag_present(skb)) {
first->vid = skb_vlan_tag_get(skb);
- if (tx_ring->flags & ICE_TX_FLAGS_RING_VLAN_L2TAG2)
+ if (test_bit(ICE_TX_RING_FLAGS_VLAN_L2TAG2, tx_ring->flags))
first->tx_flags |= ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN;
else
first->tx_flags |= ICE_TX_FLAGS_HW_VLAN;
--
2.53.0.1066.g1eceb487f285
^ permalink raw reply related
* [PATCH net 06/13] ice: fix ICE_AQ_LINK_SPEED_M for 200G
From: Jacob Keller @ 2026-04-15 5:48 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, Jacob Keller, Paul Greenwalt, Aleksandr Loktionov,
Simon Horman, Sunitha Mekala
In-Reply-To: <20260414-iwl-net-submission-2026-04-14-v1-0-852f38e7da39@intel.com>
From: Paul Greenwalt <paul.greenwalt@intel.com>
When setting PHY configuration during driver initialization, 200G link
speed is not being advertised even when the PHY is capable. This is
because the get PHY capabilities link speed response is being masked by
ICE_AQ_LINK_SPEED_M, which does not include the 200G link speed bit.
ICE_AQ_LINK_SPEED_200GB is defined as BIT(11), but the mask 0x7FF only
covers bits 0-10. Fix ICE_AQ_LINK_SPEED_M to use GENMASK(11, 0) so
that it covers all defined link speed bits including 200G.
Fixes: 24407a01e57c ("ice: Add 200G speed/phy type use")
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 859e9c66f3e7..3cbb1b0582e3 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -1252,7 +1252,7 @@ struct ice_aqc_get_link_status_data {
#define ICE_AQ_LINK_PWR_QSFP_CLASS_3 2
#define ICE_AQ_LINK_PWR_QSFP_CLASS_4 3
__le16 link_speed;
-#define ICE_AQ_LINK_SPEED_M 0x7FF
+#define ICE_AQ_LINK_SPEED_M GENMASK(11, 0)
#define ICE_AQ_LINK_SPEED_10MB BIT(0)
#define ICE_AQ_LINK_SPEED_100MB BIT(1)
#define ICE_AQ_LINK_SPEED_1000MB BIT(2)
--
2.53.0.1066.g1eceb487f285
^ permalink raw reply related
* [PATCH net 05/13] ice: fix PHY config on media change with link-down-on-close
From: Jacob Keller @ 2026-04-15 5:48 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, Jacob Keller, Paul Greenwalt, Przemek Kitszel,
Aleksandr Loktionov, Sunitha Mekala
In-Reply-To: <20260414-iwl-net-submission-2026-04-14-v1-0-852f38e7da39@intel.com>
From: Paul Greenwalt <paul.greenwalt@intel.com>
Commit 1a3571b5938c ("ice: restore PHY settings on media insertion")
introduced separate flows for setting PHY configuration on media
present: ice_configure_phy() when link-down-on-close is disabled, and
ice_force_phys_link_state() when enabled. The latter incorrectly uses
the previous configuration even after module change, causing link
issues such as wrong speed or no link.
Unify PHY configuration into a single ice_phy_cfg() function with a
link_en parameter, ensuring PHY capabilities are always fetched fresh
from hardware.
Fixes: 1a3571b5938c ("ice: restore PHY settings on media insertion")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
drivers/net/ethernet/intel/ice/ice_main.c | 121 +++++++-----------------------
1 file changed, 27 insertions(+), 94 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 3c36e3641b9e..ce3a0afe302d 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -1922,82 +1922,6 @@ static void ice_handle_mdd_event(struct ice_pf *pf)
ice_print_vfs_mdd_events(pf);
}
-/**
- * ice_force_phys_link_state - Force the physical link state
- * @vsi: VSI to force the physical link state to up/down
- * @link_up: true/false indicates to set the physical link to up/down
- *
- * Force the physical link state by getting the current PHY capabilities from
- * hardware and setting the PHY config based on the determined capabilities. If
- * link changes a link event will be triggered because both the Enable Automatic
- * Link Update and LESM Enable bits are set when setting the PHY capabilities.
- *
- * Returns 0 on success, negative on failure
- */
-static int ice_force_phys_link_state(struct ice_vsi *vsi, bool link_up)
-{
- struct ice_aqc_get_phy_caps_data *pcaps;
- struct ice_aqc_set_phy_cfg_data *cfg;
- struct ice_port_info *pi;
- struct device *dev;
- int retcode;
-
- if (!vsi || !vsi->port_info || !vsi->back)
- return -EINVAL;
- if (vsi->type != ICE_VSI_PF)
- return 0;
-
- dev = ice_pf_to_dev(vsi->back);
-
- pi = vsi->port_info;
-
- pcaps = kzalloc_obj(*pcaps);
- if (!pcaps)
- return -ENOMEM;
-
- retcode = ice_aq_get_phy_caps(pi, false, ICE_AQC_REPORT_ACTIVE_CFG, pcaps,
- NULL);
- if (retcode) {
- dev_err(dev, "Failed to get phy capabilities, VSI %d error %d\n",
- vsi->vsi_num, retcode);
- retcode = -EIO;
- goto out;
- }
-
- /* No change in link */
- if (link_up == !!(pcaps->caps & ICE_AQC_PHY_EN_LINK) &&
- link_up == !!(pi->phy.link_info.link_info & ICE_AQ_LINK_UP))
- goto out;
-
- /* Use the current user PHY configuration. The current user PHY
- * configuration is initialized during probe from PHY capabilities
- * software mode, and updated on set PHY configuration.
- */
- cfg = kmemdup(&pi->phy.curr_user_phy_cfg, sizeof(*cfg), GFP_KERNEL);
- if (!cfg) {
- retcode = -ENOMEM;
- goto out;
- }
-
- cfg->caps |= ICE_AQ_PHY_ENA_AUTO_LINK_UPDT;
- if (link_up)
- cfg->caps |= ICE_AQ_PHY_ENA_LINK;
- else
- cfg->caps &= ~ICE_AQ_PHY_ENA_LINK;
-
- retcode = ice_aq_set_phy_cfg(&vsi->back->hw, pi, cfg, NULL);
- if (retcode) {
- dev_err(dev, "Failed to set phy config, VSI %d error %d\n",
- vsi->vsi_num, retcode);
- retcode = -EIO;
- }
-
- kfree(cfg);
-out:
- kfree(pcaps);
- return retcode;
-}
-
/**
* ice_init_nvm_phy_type - Initialize the NVM PHY type
* @pi: port info structure
@@ -2066,7 +1990,7 @@ static void ice_init_link_dflt_override(struct ice_port_info *pi)
* first time media is available. The ICE_LINK_DEFAULT_OVERRIDE_PENDING state
* is used to indicate that the user PHY cfg default override is initialized
* and the PHY has not been configured with the default override settings. The
- * state is set here, and cleared in ice_configure_phy the first time the PHY is
+ * state is set here, and cleared in ice_phy_cfg the first time the PHY is
* configured.
*
* This function should be called only if the FW doesn't support default
@@ -2172,14 +2096,18 @@ static int ice_init_phy_user_cfg(struct ice_port_info *pi)
}
/**
- * ice_configure_phy - configure PHY
+ * ice_phy_cfg - configure PHY
* @vsi: VSI of PHY
+ * @link_en: true/false indicates to set link to enable/disable
*
* Set the PHY configuration. If the current PHY configuration is the same as
- * the curr_user_phy_cfg, then do nothing to avoid link flap. Otherwise
- * configure the based get PHY capabilities for topology with media.
+ * the curr_user_phy_cfg and link_en hasn't changed, then do nothing to avoid
+ * link flap. Otherwise configure the PHY based get PHY capabilities for
+ * topology with media and link_en.
+ *
+ * Return: 0 on success, negative on failure
*/
-static int ice_configure_phy(struct ice_vsi *vsi)
+static int ice_phy_cfg(struct ice_vsi *vsi, bool link_en)
{
struct device *dev = ice_pf_to_dev(vsi->back);
struct ice_port_info *pi = vsi->port_info;
@@ -2199,9 +2127,6 @@ static int ice_configure_phy(struct ice_vsi *vsi)
phy->link_info.topo_media_conflict == ICE_AQ_LINK_TOPO_UNSUPP_MEDIA)
return -EPERM;
- if (test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, pf->flags))
- return ice_force_phys_link_state(vsi, true);
-
pcaps = kzalloc_obj(*pcaps);
if (!pcaps)
return -ENOMEM;
@@ -2215,10 +2140,8 @@ static int ice_configure_phy(struct ice_vsi *vsi)
goto done;
}
- /* If PHY enable link is configured and configuration has not changed,
- * there's nothing to do
- */
- if (pcaps->caps & ICE_AQC_PHY_EN_LINK &&
+ /* Configuration has not changed. There's nothing to do. */
+ if (link_en == !!(pcaps->caps & ICE_AQC_PHY_EN_LINK) &&
ice_phy_caps_equals_cfg(pcaps, &phy->curr_user_phy_cfg))
goto done;
@@ -2282,8 +2205,12 @@ static int ice_configure_phy(struct ice_vsi *vsi)
*/
ice_cfg_phy_fc(pi, cfg, phy->curr_user_fc_req);
- /* Enable link and link update */
- cfg->caps |= ICE_AQ_PHY_ENA_AUTO_LINK_UPDT | ICE_AQ_PHY_ENA_LINK;
+ /* Enable/Disable link and link update */
+ cfg->caps |= ICE_AQ_PHY_ENA_AUTO_LINK_UPDT;
+ if (link_en)
+ cfg->caps |= ICE_AQ_PHY_ENA_LINK;
+ else
+ cfg->caps &= ~ICE_AQ_PHY_ENA_LINK;
err = ice_aq_set_phy_cfg(&pf->hw, pi, cfg, NULL);
if (err)
@@ -2336,7 +2263,7 @@ static void ice_check_media_subtask(struct ice_pf *pf)
test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, vsi->back->flags))
return;
- err = ice_configure_phy(vsi);
+ err = ice_phy_cfg(vsi, true);
if (!err)
clear_bit(ICE_FLAG_NO_MEDIA, pf->flags);
@@ -4892,9 +4819,15 @@ static int ice_init_link(struct ice_pf *pf)
if (!test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, pf->flags)) {
struct ice_vsi *vsi = ice_get_main_vsi(pf);
+ struct ice_link_default_override_tlv *ldo;
+ bool link_en;
+
+ ldo = &pf->link_dflt_override;
+ link_en = !(ldo->options &
+ ICE_LINK_OVERRIDE_AUTO_LINK_DIS);
if (vsi)
- ice_configure_phy(vsi);
+ ice_phy_cfg(vsi, link_en);
}
} else {
set_bit(ICE_FLAG_NO_MEDIA, pf->flags);
@@ -9707,7 +9640,7 @@ int ice_open_internal(struct net_device *netdev)
}
}
- err = ice_configure_phy(vsi);
+ err = ice_phy_cfg(vsi, true);
if (err) {
netdev_err(netdev, "Failed to set physical link up, error %d\n",
err);
@@ -9748,7 +9681,7 @@ int ice_stop(struct net_device *netdev)
}
if (test_bit(ICE_FLAG_LINK_DOWN_ON_CLOSE_ENA, vsi->back->flags)) {
- int link_err = ice_force_phys_link_state(vsi, false);
+ int link_err = ice_phy_cfg(vsi, false);
if (link_err) {
if (link_err == -ENOMEDIUM)
--
2.53.0.1066.g1eceb487f285
^ permalink raw reply related
* [PATCH net 04/13] ice: fix double-free of tx_buf skb
From: Jacob Keller @ 2026-04-15 5:47 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, Jacob Keller, Michal Schmidt
In-Reply-To: <20260414-iwl-net-submission-2026-04-14-v1-0-852f38e7da39@intel.com>
From: Michal Schmidt <mschmidt@redhat.com>
If ice_tso() or ice_tx_csum() fail, the error path in
ice_xmit_frame_ring() frees the skb, but the 'first' tx_buf still points
to it and is marked as valid (ICE_TX_BUF_SKB).
'next_to_use' remains unchanged, so the potential problem will
likely fix itself when the next packet is transmitted and the tx_buf
gets overwritten. But if there is no next packet and the interface is
brought down instead, ice_clean_tx_ring() -> ice_unmap_and_free_tx_buf()
will find the tx_buf and free the skb for the second time.
The fix is to reset the tx_buf type to ICE_TX_BUF_EMPTY in the error
path, so that ice_unmap_and_free_tx_buf().
Move the initialization of 'first' up, to ensure it's already valid in
case we hit the linearization error path.
The bug was spotted by AI while I had it looking for something else.
It also proposed an initial version of the patch.
I reproduced the bug and tested the fix by adding code to inject
failures, on a build with KASAN.
I looked for similar bugs in related Intel drivers and did not find any.
Fixes: d76a60ba7afb ("ice: Add support for VLANs and offloads")
Assisted-by: Claude:claude-4.6-opus-high Cursor
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
drivers/net/ethernet/intel/ice/ice_txrx.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index a2cd4cf37734..7be9c062949b 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -2158,6 +2158,9 @@ ice_xmit_frame_ring(struct sk_buff *skb, struct ice_tx_ring *tx_ring)
ice_trace(xmit_frame_ring, tx_ring, skb);
+ /* record the location of the first descriptor for this packet */
+ first = &tx_ring->tx_buf[tx_ring->next_to_use];
+
count = ice_xmit_desc_count(skb);
if (ice_chk_linearize(skb, count)) {
if (__skb_linearize(skb))
@@ -2183,8 +2186,6 @@ ice_xmit_frame_ring(struct sk_buff *skb, struct ice_tx_ring *tx_ring)
offload.tx_ring = tx_ring;
- /* record the location of the first descriptor for this packet */
- first = &tx_ring->tx_buf[tx_ring->next_to_use];
first->skb = skb;
first->type = ICE_TX_BUF_SKB;
first->bytecount = max_t(unsigned int, skb->len, ETH_ZLEN);
@@ -2249,6 +2250,7 @@ ice_xmit_frame_ring(struct sk_buff *skb, struct ice_tx_ring *tx_ring)
out_drop:
ice_trace(xmit_frame_ring_drop, tx_ring, skb);
dev_kfree_skb_any(skb);
+ first->type = ICE_TX_BUF_EMPTY;
return NETDEV_TX_OK;
}
--
2.53.0.1066.g1eceb487f285
^ permalink raw reply related
* [PATCH net 03/13] ice: fix double free in ice_sf_eth_activate() error path
From: Jacob Keller @ 2026-04-15 5:47 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, Jacob Keller, Guangshuo Li, stable, Aleksandr Loktionov,
Simon Horman
In-Reply-To: <20260414-iwl-net-submission-2026-04-14-v1-0-852f38e7da39@intel.com>
From: Guangshuo Li <lgs201920130244@gmail.com>
When auxiliary_device_add() fails, ice_sf_eth_activate() jumps to
aux_dev_uninit and calls auxiliary_device_uninit(&sf_dev->adev).
The device release callback ice_sf_dev_release() frees sf_dev, but
the current error path falls through to sf_dev_free and calls
kfree(sf_dev) again, causing a double free.
Keep kfree(sf_dev) for the auxiliary_device_init() failure path, but
avoid falling through to sf_dev_free after auxiliary_device_uninit().
Fixes: 13acc5c4cdbe ("ice: subfunction activation and base devlink ops")
Cc: stable@vger.kernel.org
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
drivers/net/ethernet/intel/ice/ice_sf_eth.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/intel/ice/ice_sf_eth.c b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
index 2cf04bc6edce..a730aa368c92 100644
--- a/drivers/net/ethernet/intel/ice/ice_sf_eth.c
+++ b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
@@ -305,6 +305,8 @@ ice_sf_eth_activate(struct ice_dynamic_port *dyn_port,
aux_dev_uninit:
auxiliary_device_uninit(&sf_dev->adev);
+ return err;
+
sf_dev_free:
kfree(sf_dev);
xa_erase:
--
2.53.0.1066.g1eceb487f285
^ permalink raw reply related
* [PATCH net 02/13] ice: update PCS latency settings for E825 10G/25Gb modes
From: Jacob Keller @ 2026-04-15 5:47 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, Jacob Keller, Grzegorz Nitka, Zoltan Fodor,
Aleksandr Loktionov, Sunitha Mekala
In-Reply-To: <20260414-iwl-net-submission-2026-04-14-v1-0-852f38e7da39@intel.com>
From: Grzegorz Nitka <grzegorz.nitka@intel.com>
Update MAC Rx/Tx offset registers settings (PHY_MAC_[RX|TX]_OFFSET
registers) with the data obtained with the latest research. It applies
to PCS latency settings for the following speeds/modes:
* 10Gb NO-FEC
- TX latency changed from 71.25 ns to 73 ns
- RX latency changed from -25.6 ns to -28 ns
* 25Gb NO-FEC
- TX latency changed from 28.17 ns to 33 ns
- RX latency changed from -12.45 ns to -12 ns
* 25Gb RS-FEC
- TX latency changed from 64.5 ns to 69 ns
- RX latency changed from -3.6 ns to -3 ns
The original data came from simulation and pre-production hardware.
The new data measures the actual delays and as such is more accurate.
Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products")
Co-developed-by: Zoltan Fodor <zoltan.fodor@intel.com>
Signed-off-by: Zoltan Fodor <zoltan.fodor@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
drivers/net/ethernet/intel/ice/ice_ptp_consts.h | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_consts.h b/drivers/net/ethernet/intel/ice/ice_ptp_consts.h
index 19dddd9b53dd..4d298c27bfb2 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp_consts.h
+++ b/drivers/net/ethernet/intel/ice/ice_ptp_consts.h
@@ -78,14 +78,14 @@ struct ice_eth56g_mac_reg_cfg eth56g_mac_cfg[NUM_ICE_ETH56G_LNK_SPD] = {
.blktime = 0x666, /* 3.2 */
.tx_offset = {
.serdes = 0x234c, /* 17.6484848 */
- .no_fec = 0x8e80, /* 71.25 */
+ .no_fec = 0x93d9, /* 73 */
.fc = 0xb4a4, /* 90.32 */
.sfd = 0x4a4, /* 2.32 */
.onestep = 0x4ccd /* 38.4 */
},
.rx_offset = {
.serdes = 0xffffeb27, /* -10.42424 */
- .no_fec = 0xffffcccd, /* -25.6 */
+ .no_fec = 0xffffc7b6, /* -28 */
.fc = 0xfffc557b, /* -469.26 */
.sfd = 0x4a4, /* 2.32 */
.bs_ds = 0x32 /* 0.0969697 */
@@ -118,17 +118,17 @@ struct ice_eth56g_mac_reg_cfg eth56g_mac_cfg[NUM_ICE_ETH56G_LNK_SPD] = {
.mktime = 0x147b, /* 10.24, only if RS-FEC enabled */
.tx_offset = {
.serdes = 0xe1e, /* 7.0593939 */
- .no_fec = 0x3857, /* 28.17 */
+ .no_fec = 0x4266, /* 33 */
.fc = 0x48c3, /* 36.38 */
- .rs = 0x8100, /* 64.5 */
+ .rs = 0x8a00, /* 69 */
.sfd = 0x1dc, /* 0.93 */
.onestep = 0x1eb8 /* 15.36 */
},
.rx_offset = {
.serdes = 0xfffff7a9, /* -4.1697 */
- .no_fec = 0xffffe71a, /* -12.45 */
+ .no_fec = 0xffffe700, /* -12 */
.fc = 0xfffe894d, /* -187.35 */
- .rs = 0xfffff8cd, /* -3.6 */
+ .rs = 0xfffff8cc, /* -3 */
.sfd = 0x1dc, /* 0.93 */
.bs_ds = 0x14 /* 0.0387879, RS-FEC 0 */
}
--
2.53.0.1066.g1eceb487f285
^ permalink raw reply related
* [PATCH net 01/13] ice: fix 'adjust' timer programming for E830 devices
From: Jacob Keller @ 2026-04-15 5:47 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, Jacob Keller, Grzegorz Nitka, Aleksandr Loktionov,
Simon Horman, Rinitha S
In-Reply-To: <20260414-iwl-net-submission-2026-04-14-v1-0-852f38e7da39@intel.com>
From: Grzegorz Nitka <grzegorz.nitka@intel.com>
Fix incorrect 'adjust the timer' programming sequence for E830 devices
series. Only shadow registers GLTSYN_SHADJ were programmed in the
current implementation. According to the specification [1], write to
command GLTSYN_CMD register is also required with CMD field set to
"Adjust the Time" value, for the timer adjustment to take the effect.
The flow was broken for the adjustment less than S32_MAX/MIN range
(around +/- 2 seconds). For bigger adjustment, non-atomic programming
flow is used, involving set timer programming. Non-atomic flow is
implemented correctly.
Testing hints:
Run command:
phc_ctl /dev/ptpX get adj 2 get
Expected result:
Returned timestamps differ at least by 2 seconds
[1] Intel® Ethernet Controller E830 Datasheet rev 1.3, chapter 9.7.5.4
https://cdrdv2.intel.com/v1/dl/getContent/787353?explicitVersion=true
Fixes: f00307522786 ("ice: Implement PTP support for E830 devices")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
index 61c0a0d93ea8..5a5c511ccbb6 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
@@ -5381,8 +5381,8 @@ int ice_ptp_write_incval_locked(struct ice_hw *hw, u64 incval)
*/
int ice_ptp_adj_clock(struct ice_hw *hw, s32 adj)
{
+ int err = 0;
u8 tmr_idx;
- int err;
tmr_idx = hw->func_caps.ts_func_info.tmr_index_owned;
@@ -5399,8 +5399,8 @@ int ice_ptp_adj_clock(struct ice_hw *hw, s32 adj)
err = ice_ptp_prep_phy_adj_e810(hw, adj);
break;
case ICE_MAC_E830:
- /* E830 sync PHYs automatically after setting GLTSYN_SHADJ */
- return 0;
+ /* E830 sync PHYs automatically after setting cmd register */
+ break;
case ICE_MAC_GENERIC:
err = ice_ptp_prep_phy_adj_e82x(hw, adj);
break;
--
2.53.0.1066.g1eceb487f285
^ permalink raw reply related
* [PATCH net 00/13] Intel Wired LAN Driver Updates 2026-04-14 (ice, i40e, iavf, idpf, e1000e)
From: Jacob Keller @ 2026-04-15 5:47 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: netdev, Jacob Keller, Grzegorz Nitka, Aleksandr Loktionov,
Simon Horman, Rinitha S, Zoltan Fodor, Sunitha Mekala,
Guangshuo Li, stable, Michal Schmidt, Paul Greenwalt,
Przemek Kitszel, Keita Morisaki, Kohei Enju, Petr Oros,
Paul Menzel, Rafal Romanowski, Emil Tantilov, Patryk Holda,
Matt Vollrath, Avigail Dahan
Grzegorz updates the logic for adjusting the PTP hardware clock on E830,
fixing a bug that prevented adjustments below S32_MAX/MIN nanoseconds.
Grzegorz and Zoli update the PCS latency settings for E825 devices at 10GbE
and 25GbE, improving the accuracy of timestamps based on data from
production hardware.
Michal Schmidt fixes a double-free that could happen if a particular error
path is taken in ice_xmit_frame_ring().
Guangshuo fixes a double-free that could happen during error paths in the
ice_sf_eth_activate() function.
Paul Greenwalt fixes the PHY link configuration when the link-down-on-close
driver parameter is enabled and new media is inserted.
Paul Greenwalt fixes the ICE_AQ_LINK_SPEED_M macro for 200G, enabling 200G
link speed advertisement.
Keita Morisaki fixes a race condition in the ice Tx timestamp ring cleanup,
preventing a possible NULL pointer dereference.
Kohei Enju fixes a potential NULL pointer dereference in ice_set_ring_param().
Kohei Enju fixes i40e to stop advertising IFF_SUPP_NOFCS, when the driver
does not actually support the feature.
Aleksandr fixes i40e napi_enable/disable for q_vectors that no longer have
rings.
Petr fixes the VLAN L2TAG2 mask when the iAVF VF and a PF negotiate use of
the legacy Rx descriptor format.
Emil fixes a NULL pointer dereference that can happen in the soft reset if
a particular error path is taken.
Matt fixes the unrolling logic for PTP when the e1000e probe fails after
the PTP clock has been registered.
**A note to stable backports**
The patches [7/13] ("ice: fix race condition in TX timestamp ring
cleanup") and [8/13] ("ice: fix potential NULL pointer deref in error
path of ice_set_ringparam()") must be backported together. Otherwise the
fix in patch 8 will not work properly.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
Aleksandr Loktionov (1):
i40e: fix napi_enable/disable skipping ringless q_vectors
Emil Tantilov (1):
idpf: fix xdp crash in soft reset error path
Grzegorz Nitka (2):
ice: fix 'adjust' timer programming for E830 devices
ice: update PCS latency settings for E825 10G/25Gb modes
Guangshuo Li (1):
ice: fix double free in ice_sf_eth_activate() error path
Keita Morisaki (1):
ice: fix race condition in TX timestamp ring cleanup
Kohei Enju (2):
ice: fix potential NULL pointer deref in error path of ice_set_ringparam()
i40e: don't advertise IFF_SUPP_NOFCS
Matt Vollrath (1):
e1000e: Unroll PTP in probe error handling
Michal Schmidt (1):
ice: fix double-free of tx_buf skb
Paul Greenwalt (2):
ice: fix PHY config on media change with link-down-on-close
ice: fix ICE_AQ_LINK_SPEED_M for 200G
Petr Oros (1):
iavf: fix wrong VLAN mask for legacy Rx descriptors L2TAG2
drivers/net/ethernet/intel/iavf/iavf_type.h | 2 +-
drivers/net/ethernet/intel/ice/ice.h | 4 +-
drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 2 +-
drivers/net/ethernet/intel/ice/ice_ptp_consts.h | 12 +--
drivers/net/ethernet/intel/ice/ice_txrx.h | 16 ++--
drivers/net/ethernet/intel/e1000e/netdev.c | 1 +
drivers/net/ethernet/intel/i40e/i40e_main.c | 29 +++---
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 10 ++
drivers/net/ethernet/intel/ice/ice_dcb_lib.c | 2 +-
drivers/net/ethernet/intel/ice/ice_ethtool.c | 1 +
drivers/net/ethernet/intel/ice/ice_lib.c | 4 +-
drivers/net/ethernet/intel/ice/ice_main.c | 121 ++++++------------------
drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 6 +-
drivers/net/ethernet/intel/ice/ice_sf_eth.c | 2 +
drivers/net/ethernet/intel/ice/ice_txrx.c | 29 ++++--
drivers/net/ethernet/intel/idpf/xdp.c | 1 +
drivers/net/ethernet/intel/idpf/xsk.c | 4 +-
17 files changed, 107 insertions(+), 139 deletions(-)
---
base-commit: b9d8b856689d2b968495d79fe653d87fcb8ad98c
change-id: 20260414-iwl-net-submission-2026-04-14-6203e1860df3
Best regards,
--
Jacob Keller <jacob.e.keller@intel.com>
^ permalink raw reply
* Re: [PATCH] net: ipv4: fix alignment fault in sysctl_fib_multipath_hash_seed on ARM64 with Clang
From: Eric Dumazet @ 2026-04-15 5:43 UTC (permalink / raw)
To: Juno Choii; +Cc: netdev, davem, kuba, pabeni, horms, linux-kernel
In-Reply-To: <20260415051343.1190626-1-juno.choi@lge.com>
On Tue, Apr 14, 2026 at 10:13 PM Juno Choii <juno.choi@lge.com> wrote:
>
> From: Juno Choi <juno.choi@lge.com>
>
> On ARM64, Clang may generate ldaxr (64-bit exclusive load) for
> READ_ONCE() on 8-byte structs. ldaxr requires 8-byte natural
> alignment, but sysctl_fib_multipath_hash_seed (two u32 members)
> only has 4-byte natural alignment.
>
> When this struct lands at a 4-byte-aligned but not 8-byte-aligned
> offset within struct netns_ipv4, the ldaxr triggers an alignment
> fault in rt6_multipath_hash(), causing a kernel panic in the IPv6
> packet receive path (rtl8168_poll -> ipv6_list_rcv ->
> rt6_multipath_hash).
>
> Add __aligned(8) to the struct definition when building for ARM64
> with Clang to ensure proper alignment for atomic 8-byte loads.
>
> Signed-off-by: Juno Choi <juno.choi@lge.com>
> ---
It seems you missed
commit 4ee7fa6cf78ff26d783d39e2949d14c4c1cd5e7f
Author: Yung Chih Su <yuuchihsu@gmail.com>
Date: Mon Mar 2 14:02:47 2026 +0800
net: ipv4: fix ARM64 alignment fault in multipath hash seed
`struct sysctl_fib_multipath_hash_seed` contains two u32 fields
(user_seed and mp_seed), making it an 8-byte structure with a 4-byte
alignment requirement.
In `fib_multipath_hash_from_keys()`, the code evaluates the entire
struct atomically via `READ_ONCE()`:
mp_seed = READ_ONCE(net->ipv4.sysctl_fib_multipath_hash_seed).mp_seed;
While this silently works on GCC by falling back to unaligned regular
loads which the ARM64 kernel tolerates, it causes a fatal kernel panic
when compiled with Clang and LTO enabled.
Commit e35123d83ee3 ("arm64: lto: Strengthen READ_ONCE() to acquire
when CONFIG_LTO=y") strengthens `READ_ONCE()` to use Load-Acquire
instructions (`ldar` / `ldapr`) to prevent compiler reordering bugs
under Clang LTO. Since the macro evaluates the full 8-byte struct,
Clang emits a 64-bit `ldar` instruction. ARM64 architecture strictly
requires `ldar` to be naturally aligned, thus executing it on a 4-byte
aligned address triggers a strict Alignment Fault (FSC = 0x21).
Fix the read side by moving the `READ_ONCE()` directly to the `u32`
member, which emits a safe 32-bit `ldar Wn`.
Furthermore, Eric Dumazet pointed out that `WRITE_ONCE()` on the entire
struct in `proc_fib_multipath_hash_set_seed()` is also flawed. Analysis
shows that Clang splits this 8-byte write into two separate 32-bit
`str` instructions. While this avoids an alignment fault, it destroys
atomicity and exposes a tear-write vulnerability. Fix this by
explicitly splitting the write into two 32-bit `WRITE_ONCE()`
operations.
Finally, add the missing `READ_ONCE()` when reading `user_seed` in
`proc_fib_multipath_hash_seed()` to ensure proper pairing and
concurrency safety.
Fixes: 4ee2a8cace3f ("net: ipv4: Add a sysctl to set multipath hash seed")
Signed-off-by: Yung Chih Su <yuuchihsu@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260302060247.7066-1-yuuchihsu@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
So perhaps you only want this followup:
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 7bd87d0547d8af40ca8736e97eac9e3d8a069052..5de5fd9465b8c5ea81d92dc74d7c6e50e3a94c73
100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -11404,7 +11404,7 @@ static int mlxsw_sp_mp_hash_init(struct
mlxsw_sp *mlxsw_sp)
u32 seed;
int err;
- seed = READ_ONCE(net->ipv4.sysctl_fib_multipath_hash_seed).user_seed;
+ seed = READ_ONCE(net->ipv4.sysctl_fib_multipath_hash_seed.user_seed);
if (!seed)
seed = jhash(mlxsw_sp->base_mac, sizeof(mlxsw_sp->base_mac), 0);
^ permalink raw reply
* Re: [RFC net-next 1/3] net/tls_sw: support randomized zero padding
From: Wilfred Mallawa @ 2026-04-15 5:40 UTC (permalink / raw)
To: Alistair Francis, Wilfred Mallawa, kuba@kernel.org
Cc: corbet@lwn.net, dlemoal@kernel.org, davem@davemloft.net,
linux-kselftest@vger.kernel.org, john.fastabend@gmail.com,
sd@queasysnail.net, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org, pabeni@redhat.com,
skhan@linuxfoundation.org, edumazet@google.com, horms@kernel.org,
netdev@vger.kernel.org
In-Reply-To: <49513ee4347536e7c8419e9e65b8c619a8c665bb.camel@wdc.com>
>>> Sorry, I realized when i hit "send" that I phrased my previous
>>> message
>>> poorly. When I say "potential" I mean someone actually presenting a
>>> PoC
>>> and a CVE is issued for it. Have we seen any of those?
> In 2014 a group at UC Berkeley used HTTPS traffic analysis to identify:
>
> "individual pages in the same web-site with 90% accuracy, exposing
> personal details including medical conditions, financial and legal
> affairs and sexual orientation."
>
> They used machine learning to help and that was over 10 years ago. So I
> suspect modern day machine learning would make this even easier to do
> today.
>
> Obviously that is HTTP traffic, which is different to the NVMe-TCP
> traffic this series is targeting, but it does still seem like a real
> concern.
>
> They talk about a range of defences in the paper, with tradeoffs
> between all of them. But the linear defence seems like the one that is
> applicable here:
>
> "linear defense pads all packet sizes up to multiples of 128"
>
> The linear defence seems to reduce the Pan attack from 60% to around
> 25% and the BoG attack from 90% to around 60%.
>
> On top of that the
>
> "Burst defense offers greater protection, operating between the TCP
> layer and application layer to pad contiguous bursts of traffic up to
> predefined thresholds uniquely determined for each website"
>
> Which to me sounds like the random padding proposed in this series
> would provide more protection then the basic linear padding used in the
> paper.
>
> To me analysing TLS traffic does seem like a plausible threat and
> something that randomised padding would help with. Leaving it up to
> userspace to decide based on their threat model seems like a good
> approach as well.
>
> 1: https://secml.cs.berkeley.edu/pets2014/
>
> Alistair
gentle ping. Are there any further thoughts on adding this support?
Wilfred
^ permalink raw reply
* Re: [Intel-wired-lan] [PATCH v2] dpf: fix UAF and double free in idpf_plug_vport_aux_dev() error path
From: Jacob Keller @ 2026-04-15 5:37 UTC (permalink / raw)
To: Guangshuo Li
Cc: Tony Nguyen, Przemek Kitszel, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Joshua Hay,
Tatyana Nikolova, Madhu Chittim, intel-wired-lan, netdev,
linux-kernel, Greg Kroah-Hartman, stable
In-Reply-To: <CANUHTR8uNVWR48xs90s+MtGQ6J-1j5R0+64MKVGin0cf-FjRWA@mail.gmail.com>
On 4/14/2026 6:47 PM, Guangshuo Li wrote:
> Hi Jacob,
>
> Thanks for reviewing.
>
> On Wed, 15 Apr 2026 at 05:03, Jacob Keller <jacob.e.keller@intel.com> wrote:
>>
>>
>> This doesn't look right. The commit message analysis seems to match this
>> fix from Greg KH:
>>
>> https://lore.kernel.org/intel-wired-lan/2026041432-tapestry-condition-22ff@gregkh/
>>
>> But the changes do not make any sense to me. It looks like a poorly done
>> AI-generated "fix" which is not correct. Greg's version does look like
>> it properly resolves this.
>>
>>> v2:
>>> - note that the issue was identified by my static analysis tool
>>> - and confirmed by manual review
>>>
>>
>> What even is this change log?? I see that version was sent and everyone
>> else was sane enough to just silently reject or ignore the v1...
>>
>>> drivers/net/ethernet/intel/idpf/idpf_idc.c | 6 +++++-
>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/idpf/idpf_idc.c b/drivers/net/ethernet/intel/idpf/idpf_idc.c
>>> index 6dad0593f7f2..2a18907643fc 100644
>>> --- a/drivers/net/ethernet/intel/idpf/idpf_idc.c
>>> +++ b/drivers/net/ethernet/intel/idpf/idpf_idc.c
>>> @@ -59,6 +59,7 @@ static int idpf_plug_vport_aux_dev(struct iidc_rdma_core_dev_info *cdev_info,
>>> char name[IDPF_IDC_MAX_ADEV_NAME_LEN];
>>> struct auxiliary_device *adev;
>>> int ret;
>>> + int adev_id;
>>>
>>
>> You create a local variable here...
>>
>>> iadev = kzalloc(sizeof(*iadev), GFP_KERNEL);
>>> if (!iadev)
>>> @@ -74,11 +75,14 @@ static int idpf_plug_vport_aux_dev(struct iidc_rdma_core_dev_info *cdev_info,
>>> goto err_ida_alloc;
>>> }
>>> adev->id = ret;
>>> + adev->id = adev_id;
>>
>> adev_is is never initialized, so you assign a random garbage
>> uninitialized value. This is obviously wrong and will lead to worse
>> errors than the failed cleanup.
>>
>> I'm rejecting this patch in favor of the clearly appropriate fix from Greg.
>>
>>> adev->dev.release = idpf_vport_adev_release;
>>> adev->dev.parent = &cdev_info->pdev->dev;
>>> sprintf(name, "%04x.rdma.vdev", cdev_info->pdev->vendor);
>>> adev->name = name;
>>>
>>> + /* iadev is owned by the auxiliary device */
>>> + iadev = NULL;> ret = auxiliary_device_init(adev);
>>> if (ret)
>>> goto err_aux_dev_init;
>>> @@ -92,7 +96,7 @@ static int idpf_plug_vport_aux_dev(struct iidc_rdma_core_dev_info *cdev_info,
>>> err_aux_dev_add:
>>> auxiliary_device_uninit(adev);
>>> err_aux_dev_init:
>>> - ida_free(&idpf_idc_ida, adev->id);
>>> + ida_free(&idpf_idc_ida, adev_id);
>>> err_ida_alloc:
>>> vdev_info->adev = NULL;
>>> kfree(iadev);
>>
>
> You are right that the v2 patch as sent is incomplete. That was my
> mistake when preparing/sending v2: it accidentally dropped the adev_id
> = ret; assignment, which made that version incorrect.
>
> For reference, the original v1 patch is here:
>
> https://lkml.org/lkml/2026/3/21/421
>
> In v1, adev_id was assigned from ret before use, so I believe that
> particular uninitialized-variable issue was introduced in the v2
> posting.
>
> Sorry for the confusion caused by the broken v2 posting.
No problem. I had missed the other version, which explains my confusion.
Still, to my eyes, the fix looks to be an equivalent fix as one
submitted by GregKH:
https://lore.kernel.org/intel-wired-lan/2026041116-retail-bagginess-250f@gregkh/
Do you agree this is effectively a different fix for the same problem?
Or is there really two different double-free issues here that both need
patching? I haven't been able to fully convince my self either way, but
I am leaning on this being one problem, and I think Gregs solution feels
simpler to understand.
Thanks,
Jake
>
> Thanks,
> Guangshuo
^ permalink raw reply
* Re: [ovs-dev] [PATCH net-next v2] net: openvswitch: decouple flow_table from ovs_mutex
From: Adrián Moreno @ 2026-04-15 5:32 UTC (permalink / raw)
To: Aaron Conole
Cc: Adrian Moreno via dev, netdev, open list:OPENVSWITCH, Paolo Abeni,
open list, Ilya Maximets, Eric Dumazet, Simon Horman,
Jakub Kicinski, David S. Miller
In-Reply-To: <f7twlyeabra.fsf@redhat.com>
On Fri, Apr 10, 2026 at 02:52:41PM -0400, Aaron Conole wrote:
> Hi Adrian,
>
> Thanks for the patch. A few questions inline.
>
> Adrian Moreno via dev <ovs-dev@openvswitch.org> writes:
>
> > Currently the entire ovs module is write-protected using the global
> > ovs_mutex. While this simple approach works fine for control-plane
> > operations (such as vport configurations), requiring the global mutex
> > for flow modifications can be problematic.
> >
> > During periods of high control-plane operations, e.g: netdevs (vports)
> > coming and going, RTNL can suffer contention. This contention is easily
> > transferred to the ovs_mutex as RTNL nests inside ovs_mutex. Flow
> > modifications, however, are done as part of packet processing and having
> > them wait for RTNL pressure to go away can lead to packet drops.
> >
> > This patch decouples flow_table modifications from ovs_mutex by means of
> > the following:
> >
> > 1 - Make flow_table an rcu-protected pointer inside the datapath.
> > This allows both objects to be protected independently while reducing the
> > amount of changes required in "flow_table.c".
> >
> > 2 - Create a new mutex inside the flow_table that protects it from
> > concurrent modifications.
> > Putting the mutex inside flow_table makes it easier to consume for
> > functions inside flow_table.c that do not currently take pointers to the
> > datapath.
> > Some function signatures need to be changed to accept flow_table so that
> > lockdep checks can be performed.
> >
> > 3 - Create a reference count to temporarily extend rcu protection from
> > the datapath to the flow_table.
> > In order to use the flow_table without locking ovs_mutex, the flow_table
> > pointer must be first dereferenced within an rcu-protected region.
> > Next, the table->mutex needs to be locked to protect it from
> > concurrent writes but mutexes must not be locked inside an rcu-protected
> > region, so the rcu-protected region must be left at which point the
> > datapath can be concurrently freed.
> > To extend the protection beyond the rcu region, a reference count is used.
> > One reference is held by the datapath, the other is temporarily
> > increased during flow modifications. For example:
> >
> > Datapath deletion:
> >
> > ovs_lock();
> > table = rcu_dereference_protected(dp->table, ...);
> > rcu_assign_pointer(dp->table, NULL);
> > ovs_flow_tbl_put(table);
> > ovs_unlock();
>
> I guess it's possible now to have flow operations succeed on
> 'removed-but-not-yet-freed' tables. That's probably worth documenting
> somewhere, since it is a slight behavior change. More below
>
You are right. That corner case is kind of weird as we could be adding a
flow to a table that has been deteched from the datapath and will be
freed inmediately after. I can add a comment in __dp_destroy about this.
> > Flow modification:
> >
> > rcu_read_lock();
> > dp = get_dp(...);
> > table = rcu_dereference(dp->table);
> > ovs_flow_tbl_get(table);
> > rcu_read_unlock();
> >
> > mutex_lock(&table->lock);
> > /* Perform modifications on the flow_table */
> > mutex_unlock(&table->lock);
> > ovs_flow_tbl_put(table);
> >
> > Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> > ---
> > v2: Fix argument in ovs_flow_tbl_put (sparse)
> > Remove rcu checks in ovs_dp_masks_rebalance
> > ---
> > net/openvswitch/datapath.c | 285 ++++++++++++++++++++++++-----------
> > net/openvswitch/datapath.h | 2 +-
> > net/openvswitch/flow.c | 13 +-
> > net/openvswitch/flow.h | 9 +-
> > net/openvswitch/flow_table.c | 180 ++++++++++++++--------
> > net/openvswitch/flow_table.h | 51 ++++++-
> > 6 files changed, 380 insertions(+), 160 deletions(-)
> >
> > diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> > index e209099218b4..9c234993520c 100644
> > --- a/net/openvswitch/datapath.c
> > +++ b/net/openvswitch/datapath.c
> > @@ -88,13 +88,17 @@ static void ovs_notify(struct genl_family *family,
> > * DOC: Locking:
> > *
> > * All writes e.g. Writes to device state (add/remove datapath, port, set
> > - * operations on vports, etc.), Writes to other state (flow table
> > - * modifications, set miscellaneous datapath parameters, etc.) are protected
> > - * by ovs_lock.
> > + * operations on vports, etc.) and writes to other datapath parameters
> > + * are protected by ovs_lock.
> > + *
> > + * Writes to the flow table are NOT protected by ovs_lock. Instead, a per-table
> > + * mutex and reference count are used (see comment above "struct flow_table"
> > + * definition). On some few occasions, the per-flow table mutex is nested
> > + * inside ovs_mutex.
> > *
> > * Reads are protected by RCU.
> > *
> > - * There are a few special cases (mostly stats) that have their own
> > + * There are a few other special cases (mostly stats) that have their own
> > * synchronization but they nest under all of above and don't interact with
> > * each other.
> > *
> > @@ -166,7 +170,6 @@ static void destroy_dp_rcu(struct rcu_head *rcu)
> > {
> > struct datapath *dp = container_of(rcu, struct datapath, rcu);
> >
> > - ovs_flow_tbl_destroy(&dp->table);
> > free_percpu(dp->stats_percpu);
> > kfree(dp->ports);
> > ovs_meters_exit(dp);
> > @@ -247,6 +250,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> > struct ovs_pcpu_storage *ovs_pcpu = this_cpu_ptr(ovs_pcpu_storage);
> > const struct vport *p = OVS_CB(skb)->input_vport;
> > struct datapath *dp = p->dp;
> > + struct flow_table *table;
> > struct sw_flow *flow;
> > struct sw_flow_actions *sf_acts;
> > struct dp_stats_percpu *stats;
> > @@ -257,9 +261,16 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> > int error;
> >
> > stats = this_cpu_ptr(dp->stats_percpu);
> > + table = rcu_dereference(dp->table);
> > + if (!table) {
> > + net_dbg_ratelimited("ovs: no flow table on datapath %s\n",
> > + ovs_dp_name(dp));
> > + kfree_skb(skb);
> > + return;
> > + }
> >
> > /* Look up flow. */
> > - flow = ovs_flow_tbl_lookup_stats(&dp->table, key, skb_get_hash(skb),
> > + flow = ovs_flow_tbl_lookup_stats(table, key, skb_get_hash(skb),
> > &n_mask_hit, &n_cache_hit);
> > if (unlikely(!flow)) {
> > struct dp_upcall_info upcall;
> > @@ -752,12 +763,16 @@ static struct genl_family dp_packet_genl_family __ro_after_init = {
> > static void get_dp_stats(const struct datapath *dp, struct ovs_dp_stats *stats,
> > struct ovs_dp_megaflow_stats *mega_stats)
> > {
> > + struct flow_table *table = ovsl_dereference(dp->table);
> > int i;
> >
> > memset(mega_stats, 0, sizeof(*mega_stats));
> >
> > - stats->n_flows = ovs_flow_tbl_count(&dp->table);
> > - mega_stats->n_masks = ovs_flow_tbl_num_masks(&dp->table);
> > + if (table) {
> > + stats->n_flows = ovs_flow_tbl_count(table);
>
> Previously, when calling this we'd be under the ovs_mutex and the read
> on table->count would be somewhat coherent (for some definition of
> coherent). BUT we are now doing a bare read. I'm not sure if we should
> take the lock here, or at least give some kind of barrier (READ_ONCE and
> update the count setting sites with WRITE_ONCEs)? WDYT?
>
I think you are right, this call can now happen in parallel with
statistic updates on the flow table side.
IIUC, datapath operations such as this still hold the ovs_mutex,
"ovsl_dereference()" above should splat if that's not true. And
"table->lock" would force us to also hold it while updating the
stats which undermines the purpose of this patch. So
READ_ONCE/WRITE_ONCE seems like a good solution here.
> > + mega_stats->n_masks = ovs_flow_tbl_num_masks(table);
> > + }
> > +
> >
> > stats->n_hit = stats->n_missed = stats->n_lost = 0;
> >
> > @@ -829,15 +844,16 @@ static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts,
> > + nla_total_size_64bit(8); /* OVS_FLOW_ATTR_USED */
> > }
> >
> > -/* Called with ovs_mutex or RCU read lock. */
> > +/* Called with table->lock or RCU read lock. */
> > static int ovs_flow_cmd_fill_stats(const struct sw_flow *flow,
> > + const struct flow_table *table,
> > struct sk_buff *skb)
> > {
> > struct ovs_flow_stats stats;
> > __be16 tcp_flags;
> > unsigned long used;
> >
> > - ovs_flow_stats_get(flow, &stats, &used, &tcp_flags);
> > + ovs_flow_stats_get(flow, table, &stats, &used, &tcp_flags);
> >
> > if (used &&
> > nla_put_u64_64bit(skb, OVS_FLOW_ATTR_USED, ovs_flow_used_time(used),
> > @@ -857,8 +873,9 @@ static int ovs_flow_cmd_fill_stats(const struct sw_flow *flow,
> > return 0;
> > }
> >
> > -/* Called with ovs_mutex or RCU read lock. */
> > +/* Called with RCU read lock or table->lock held. */
> > static int ovs_flow_cmd_fill_actions(const struct sw_flow *flow,
> > + const struct flow_table *table,
> > struct sk_buff *skb, int skb_orig_len)
> > {
> > struct nlattr *start;
> > @@ -878,7 +895,7 @@ static int ovs_flow_cmd_fill_actions(const struct sw_flow *flow,
> > if (start) {
> > const struct sw_flow_actions *sf_acts;
> >
> > - sf_acts = rcu_dereference_ovsl(flow->sf_acts);
> > + sf_acts = rcu_dereference_ovs_tbl(flow->sf_acts, table);
> > err = ovs_nla_put_actions(sf_acts->actions,
> > sf_acts->actions_len, skb);
> >
> > @@ -897,8 +914,10 @@ static int ovs_flow_cmd_fill_actions(const struct sw_flow *flow,
> > return 0;
> > }
> >
> > -/* Called with ovs_mutex or RCU read lock. */
> > -static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
> > +/* Called with table->lock or RCU read lock. */
> > +static int ovs_flow_cmd_fill_info(const struct sw_flow *flow,
> > + const struct flow_table *table,
> > + int dp_ifindex,
> > struct sk_buff *skb, u32 portid,
> > u32 seq, u32 flags, u8 cmd, u32 ufid_flags)
> > {
> > @@ -929,12 +948,12 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
> > goto error;
> > }
> >
> > - err = ovs_flow_cmd_fill_stats(flow, skb);
> > + err = ovs_flow_cmd_fill_stats(flow, table, skb);
> > if (err)
> > goto error;
> >
> > if (should_fill_actions(ufid_flags)) {
> > - err = ovs_flow_cmd_fill_actions(flow, skb, skb_orig_len);
> > + err = ovs_flow_cmd_fill_actions(flow, table, skb, skb_orig_len);
> > if (err)
> > goto error;
> > }
> > @@ -968,8 +987,9 @@ static struct sk_buff *ovs_flow_cmd_alloc_info(const struct sw_flow_actions *act
> > return skb;
> > }
> >
> > -/* Called with ovs_mutex. */
> > +/* Called with table->lock. */
> > static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
> > + const struct flow_table *table,
> > int dp_ifindex,
> > struct genl_info *info, u8 cmd,
> > bool always, u32 ufid_flags)
> > @@ -977,12 +997,12 @@ static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
> > struct sk_buff *skb;
> > int retval;
> >
> > - skb = ovs_flow_cmd_alloc_info(ovsl_dereference(flow->sf_acts),
> > + skb = ovs_flow_cmd_alloc_info(ovs_tbl_dereference(flow->sf_acts, table),
> > &flow->id, info, always, ufid_flags);
> > if (IS_ERR_OR_NULL(skb))
> > return skb;
> >
> > - retval = ovs_flow_cmd_fill_info(flow, dp_ifindex, skb,
> > + retval = ovs_flow_cmd_fill_info(flow, table, dp_ifindex, skb,
> > info->snd_portid, info->snd_seq, 0,
> > cmd, ufid_flags);
> > if (WARN_ON_ONCE(retval < 0)) {
> > @@ -998,6 +1018,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
> > struct nlattr **a = info->attrs;
> > struct ovs_header *ovs_header = genl_info_userhdr(info);
> > struct sw_flow *flow = NULL, *new_flow;
> > + struct flow_table *table;
> > struct sw_flow_mask mask;
> > struct sk_buff *reply;
> > struct datapath *dp;
> > @@ -1064,30 +1085,43 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
> > goto err_kfree_acts;
> > }
> >
>
> I think this can lead to a weird(?) behavior:
>
> thread A (dp_destroy): thread b (ovs_flow_cmd_new):
> rcu_assign_pointer(dp->table, NULL)
> rcu_read_lock();
> table =
> rcu_dereference(dp->table);
> [old table]
> ovs_flow_tbl_get(table)
> //refcnt change
> rcu_read_unlock()
> ovs_flow_tbl_put(table) // refcnt chg
> mutex_lock(table->lock)
> ovs_flow_table_insert(...)
> [success reply]
> mutex_unlock(table->lock)
> ovs_flow_tbl_put(table)
> // table flow flush, etc.
>
> I guess it isn't a huge deal (installing flow while deleting table would
> be weird from a userspace perspective), and I think it is safe, but it
> is worth mentioning that we can have such scenario now.
>
I completely agree, this was not documented (it will in the next version
of the patch) but it's the inevitable side effect of this design.
> > - ovs_lock();
> > + rcu_read_lock();
> > dp = get_dp(net, ovs_header->dp_ifindex);
> > if (unlikely(!dp)) {
> > error = -ENODEV;
> > - goto err_unlock_ovs;
> > + rcu_read_unlock();
> > + goto err_kfree_reply;
> > }
> > + table = rcu_dereference(dp->table);
> > + if (!table || !ovs_flow_tbl_get(table)) {
> > + error = -ENODEV;
> > + rcu_read_unlock();
> > + goto err_kfree_reply;
> > + }
> > + rcu_read_unlock();
> > +
> > + /* It is safe to dereference "table" after leaving rcu read-protected
> > + * region because it's pinned by refcount.
> > + */
> > + mutex_lock(&table->lock);
> >
> > /* Check if this is a duplicate flow */
> > if (ovs_identifier_is_ufid(&new_flow->id))
> > - flow = ovs_flow_tbl_lookup_ufid(&dp->table, &new_flow->id);
> > + flow = ovs_flow_tbl_lookup_ufid(table, &new_flow->id);
> > if (!flow)
> > - flow = ovs_flow_tbl_lookup(&dp->table, key);
> > + flow = ovs_flow_tbl_lookup(table, key);
> > if (likely(!flow)) {
> > rcu_assign_pointer(new_flow->sf_acts, acts);
> >
> > /* Put flow in bucket. */
> > - error = ovs_flow_tbl_insert(&dp->table, new_flow, &mask);
> > + error = ovs_flow_tbl_insert(table, new_flow, &mask);
> > if (unlikely(error)) {
> > acts = NULL;
> > - goto err_unlock_ovs;
> > + goto err_unlock_tbl;
> > }
> >
> > if (unlikely(reply)) {
> > - error = ovs_flow_cmd_fill_info(new_flow,
> > + error = ovs_flow_cmd_fill_info(new_flow, table,
> > ovs_header->dp_ifindex,
> > reply, info->snd_portid,
> > info->snd_seq, 0,
> > @@ -1095,7 +1129,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
> > ufid_flags);
> > BUG_ON(error < 0);
> > }
> > - ovs_unlock();
> > + mutex_unlock(&table->lock);
> > + ovs_flow_tbl_put(table);
> > } else {
> > struct sw_flow_actions *old_acts;
> >
> > @@ -1108,28 +1143,28 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
> > if (unlikely(info->nlhdr->nlmsg_flags & (NLM_F_CREATE
> > | NLM_F_EXCL))) {
> > error = -EEXIST;
> > - goto err_unlock_ovs;
> > + goto err_unlock_tbl;
> > }
> > /* The flow identifier has to be the same for flow updates.
> > * Look for any overlapping flow.
> > */
> > if (unlikely(!ovs_flow_cmp(flow, &match))) {
> > if (ovs_identifier_is_key(&flow->id))
> > - flow = ovs_flow_tbl_lookup_exact(&dp->table,
> > + flow = ovs_flow_tbl_lookup_exact(table,
> > &match);
> > else /* UFID matches but key is different */
> > flow = NULL;
> > if (!flow) {
> > error = -ENOENT;
> > - goto err_unlock_ovs;
> > + goto err_unlock_tbl;
> > }
> > }
> > /* Update actions. */
> > - old_acts = ovsl_dereference(flow->sf_acts);
> > + old_acts = ovs_tbl_dereference(flow->sf_acts, table);
> > rcu_assign_pointer(flow->sf_acts, acts);
> >
> > if (unlikely(reply)) {
> > - error = ovs_flow_cmd_fill_info(flow,
> > + error = ovs_flow_cmd_fill_info(flow, table,
> > ovs_header->dp_ifindex,
> > reply, info->snd_portid,
> > info->snd_seq, 0,
> > @@ -1137,7 +1172,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
> > ufid_flags);
> > BUG_ON(error < 0);
> > }
> > - ovs_unlock();
> > + mutex_unlock(&table->lock);
> > + ovs_flow_tbl_put(table);
> >
> > ovs_nla_free_flow_actions_rcu(old_acts);
> > ovs_flow_free(new_flow, false);
> > @@ -1149,8 +1185,10 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
> > kfree(key);
> > return 0;
> >
> > -err_unlock_ovs:
> > - ovs_unlock();
> > +err_unlock_tbl:
> > + mutex_unlock(&table->lock);
> > + ovs_flow_tbl_put(table);
> > +err_kfree_reply:
> > kfree_skb(reply);
> > err_kfree_acts:
> > ovs_nla_free_flow_actions(acts);
> > @@ -1244,6 +1282,7 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
> > struct net *net = sock_net(skb->sk);
> > struct nlattr **a = info->attrs;
> > struct ovs_header *ovs_header = genl_info_userhdr(info);
> > + struct flow_table *table;
> > struct sw_flow_key key;
> > struct sw_flow *flow;
> > struct sk_buff *reply = NULL;
> > @@ -1278,29 +1317,43 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
> > }
> > }
> >
> > - ovs_lock();
> > + rcu_read_lock();
> > dp = get_dp(net, ovs_header->dp_ifindex);
> > if (unlikely(!dp)) {
> > error = -ENODEV;
> > - goto err_unlock_ovs;
> > + rcu_read_unlock();
> > + goto err_free_reply;
> > }
> > + table = rcu_dereference(dp->table);
> > + if (!table || !ovs_flow_tbl_get(table)) {
> > + rcu_read_unlock();
> > + error = -ENODEV;
> > + goto err_free_reply;
> > + }
> > + rcu_read_unlock();
> > +
> > + /* It is safe to dereference "table" after leaving rcu read-protected
> > + * region because it's pinned by refcount.
> > + */
> > + mutex_lock(&table->lock);
> > +
> > /* Check that the flow exists. */
> > if (ufid_present)
> > - flow = ovs_flow_tbl_lookup_ufid(&dp->table, &sfid);
> > + flow = ovs_flow_tbl_lookup_ufid(table, &sfid);
> > else
> > - flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
> > + flow = ovs_flow_tbl_lookup_exact(table, &match);
> > if (unlikely(!flow)) {
> > error = -ENOENT;
> > - goto err_unlock_ovs;
> > + goto err_unlock_tbl;
> > }
> >
> > /* Update actions, if present. */
> > if (likely(acts)) {
> > - old_acts = ovsl_dereference(flow->sf_acts);
> > + old_acts = ovs_tbl_dereference(flow->sf_acts, table);
> > rcu_assign_pointer(flow->sf_acts, acts);
> >
> > if (unlikely(reply)) {
> > - error = ovs_flow_cmd_fill_info(flow,
> > + error = ovs_flow_cmd_fill_info(flow, table,
> > ovs_header->dp_ifindex,
> > reply, info->snd_portid,
> > info->snd_seq, 0,
> > @@ -1310,20 +1363,22 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
> > }
> > } else {
> > /* Could not alloc without acts before locking. */
> > - reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex,
> > + reply = ovs_flow_cmd_build_info(flow, table,
> > + ovs_header->dp_ifindex,
> > info, OVS_FLOW_CMD_SET, false,
> > ufid_flags);
> >
> > if (IS_ERR(reply)) {
> > error = PTR_ERR(reply);
> > - goto err_unlock_ovs;
> > + goto err_unlock_tbl;
> > }
> > }
> >
> > /* Clear stats. */
> > if (a[OVS_FLOW_ATTR_CLEAR])
> > - ovs_flow_stats_clear(flow);
> > - ovs_unlock();
> > + ovs_flow_stats_clear(flow, table);
> > + mutex_unlock(&table->lock);
> > + ovs_flow_tbl_put(table);
> >
> > if (reply)
> > ovs_notify(&dp_flow_genl_family, reply, info);
> > @@ -1332,8 +1387,10 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
> >
> > return 0;
> >
> > -err_unlock_ovs:
> > - ovs_unlock();
> > +err_unlock_tbl:
> > + mutex_unlock(&table->lock);
> > + ovs_flow_tbl_put(table);
> > +err_free_reply:
> > kfree_skb(reply);
> > err_kfree_acts:
> > ovs_nla_free_flow_actions(acts);
> > @@ -1346,6 +1403,7 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
> > struct nlattr **a = info->attrs;
> > struct ovs_header *ovs_header = genl_info_userhdr(info);
> > struct net *net = sock_net(skb->sk);
> > + struct flow_table *table;
> > struct sw_flow_key key;
> > struct sk_buff *reply;
> > struct sw_flow *flow;
> > @@ -1370,33 +1428,48 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
> > if (err)
> > return err;
> >
> > - ovs_lock();
> > + rcu_read_lock();
> > dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
> > if (!dp) {
> > - err = -ENODEV;
> > - goto unlock;
> > + rcu_read_unlock();
> > + return -ENODEV;
> > }
> > + table = rcu_dereference(dp->table);
> > + if (!table || !ovs_flow_tbl_get(table)) {
> > + rcu_read_unlock();
> > + return -ENODEV;
> > + }
> > + rcu_read_unlock();
> > +
> > + /* It is safe to dereference "table" after leaving rcu read-protected
> > + * region because it's pinned by refcount.
> > + */
> > + mutex_lock(&table->lock);
> > +
> >
> > if (ufid_present)
> > - flow = ovs_flow_tbl_lookup_ufid(&dp->table, &ufid);
> > + flow = ovs_flow_tbl_lookup_ufid(table, &ufid);
> > else
> > - flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
> > + flow = ovs_flow_tbl_lookup_exact(table, &match);
> > if (!flow) {
> > err = -ENOENT;
> > goto unlock;
> > }
> >
> > - reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex, info,
> > - OVS_FLOW_CMD_GET, true, ufid_flags);
> > + reply = ovs_flow_cmd_build_info(flow, table, ovs_header->dp_ifindex,
> > + info, OVS_FLOW_CMD_GET, true,
> > + ufid_flags);
> > if (IS_ERR(reply)) {
> > err = PTR_ERR(reply);
> > goto unlock;
> > }
> >
> > - ovs_unlock();
> > + mutex_unlock(&table->lock);
> > + ovs_flow_tbl_put(table);
> > return genlmsg_reply(reply, info);
> > unlock:
> > - ovs_unlock();
> > + mutex_unlock(&table->lock);
> > + ovs_flow_tbl_put(table);
> > return err;
> > }
> >
> > @@ -1405,6 +1478,7 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
> > struct nlattr **a = info->attrs;
> > struct ovs_header *ovs_header = genl_info_userhdr(info);
> > struct net *net = sock_net(skb->sk);
> > + struct flow_table *table;
> > struct sw_flow_key key;
> > struct sk_buff *reply;
> > struct sw_flow *flow = NULL;
> > @@ -1425,36 +1499,49 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
> > return err;
> > }
> >
> > - ovs_lock();
> > + rcu_read_lock();
> > dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
> > if (unlikely(!dp)) {
> > - err = -ENODEV;
> > - goto unlock;
> > + rcu_read_unlock();
> > + return -ENODEV;
> > }
> > + table = rcu_dereference(dp->table);
> > + if (!table || !ovs_flow_tbl_get(table)) {
> > + rcu_read_unlock();
> > + return -ENODEV;
> > + }
> > + rcu_read_unlock();
> > +
> > + /* It is safe to dereference "table" after leaving rcu read-protected
> > + * region because it's pinned by refcount.
> > + */
> > + mutex_lock(&table->lock);
> > +
> >
> > if (unlikely(!a[OVS_FLOW_ATTR_KEY] && !ufid_present)) {
> > - err = ovs_flow_tbl_flush(&dp->table);
> > + err = ovs_flow_tbl_flush(table);
> > goto unlock;
> > }
> >
> > if (ufid_present)
> > - flow = ovs_flow_tbl_lookup_ufid(&dp->table, &ufid);
> > + flow = ovs_flow_tbl_lookup_ufid(table, &ufid);
> > else
> > - flow = ovs_flow_tbl_lookup_exact(&dp->table, &match);
> > + flow = ovs_flow_tbl_lookup_exact(table, &match);
> > if (unlikely(!flow)) {
> > err = -ENOENT;
> > goto unlock;
> > }
> >
> > - ovs_flow_tbl_remove(&dp->table, flow);
> > - ovs_unlock();
> > + ovs_flow_tbl_remove(table, flow);
> > + mutex_unlock(&table->lock);
> >
> > reply = ovs_flow_cmd_alloc_info((const struct sw_flow_actions __force *) flow->sf_acts,
> > &flow->id, info, false, ufid_flags);
> > if (likely(reply)) {
> > if (!IS_ERR(reply)) {
> > rcu_read_lock(); /*To keep RCU checker happy. */
> > - err = ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex,
> > + err = ovs_flow_cmd_fill_info(flow, table,
> > + ovs_header->dp_ifindex,
> > reply, info->snd_portid,
> > info->snd_seq, 0,
> > OVS_FLOW_CMD_DEL,
> > @@ -1473,10 +1560,12 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
> > }
> >
> > out_free:
> > + ovs_flow_tbl_put(table);
> > ovs_flow_free(flow, true);
> > return 0;
> > unlock:
> > - ovs_unlock();
> > + mutex_unlock(&table->lock);
> > + ovs_flow_tbl_put(table);
> > return err;
> > }
> >
> > @@ -1485,6 +1574,7 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
> > struct nlattr *a[__OVS_FLOW_ATTR_MAX];
> > struct ovs_header *ovs_header = genlmsg_data(nlmsg_data(cb->nlh));
> > struct table_instance *ti;
> > + struct flow_table *table;
> > struct datapath *dp;
> > u32 ufid_flags;
> > int err;
> > @@ -1501,8 +1591,13 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
> > rcu_read_unlock();
> > return -ENODEV;
> > }
> > + table = rcu_dereference(dp->table);
> > + if (!table) {
> > + rcu_read_unlock();
> > + return -ENODEV;
> > + }
> >
> > - ti = rcu_dereference(dp->table.ti);
> > + ti = rcu_dereference(table->ti);
> > for (;;) {
> > struct sw_flow *flow;
> > u32 bucket, obj;
> > @@ -1513,8 +1608,8 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
> > if (!flow)
> > break;
> >
> > - if (ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex, skb,
> > - NETLINK_CB(cb->skb).portid,
> > + if (ovs_flow_cmd_fill_info(flow, table, ovs_header->dp_ifindex,
> > + skb, NETLINK_CB(cb->skb).portid,
> > cb->nlh->nlmsg_seq, NLM_F_MULTI,
> > OVS_FLOW_CMD_GET, ufid_flags) < 0)
> > break;
> > @@ -1598,8 +1693,13 @@ static int ovs_dp_cmd_fill_info(struct datapath *dp, struct sk_buff *skb,
> > struct ovs_dp_stats dp_stats;
> > struct ovs_dp_megaflow_stats dp_megaflow_stats;
> > struct dp_nlsk_pids *pids = ovsl_dereference(dp->upcall_portids);
> > + struct flow_table *table;
> > int err, pids_len;
> >
> > + table = ovsl_dereference(dp->table);
> > + if (!table)
> > + return -ENODEV;
> > +
> > ovs_header = genlmsg_put(skb, portid, seq, &dp_datapath_genl_family,
> > flags, cmd);
> > if (!ovs_header)
> > @@ -1625,7 +1725,7 @@ static int ovs_dp_cmd_fill_info(struct datapath *dp, struct sk_buff *skb,
> > goto nla_put_failure;
> >
> > if (nla_put_u32(skb, OVS_DP_ATTR_MASKS_CACHE_SIZE,
> > - ovs_flow_tbl_masks_cache_size(&dp->table)))
> > + ovs_flow_tbl_masks_cache_size(table)))
> > goto nla_put_failure;
> >
> > if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU && pids) {
> > @@ -1736,6 +1836,7 @@ u32 ovs_dp_get_upcall_portid(const struct datapath *dp, uint32_t cpu_id)
> > static int ovs_dp_change(struct datapath *dp, struct nlattr *a[])
> > {
> > u32 user_features = 0, old_features = dp->user_features;
> > + struct flow_table *table;
> > int err;
> >
> > if (a[OVS_DP_ATTR_USER_FEATURES]) {
> > @@ -1757,8 +1858,12 @@ static int ovs_dp_change(struct datapath *dp, struct nlattr *a[])
> > int err;
> > u32 cache_size;
> >
> > + table = ovsl_dereference(dp->table);
> > + if (!table)
> > + return -ENODEV;
> > +
> > cache_size = nla_get_u32(a[OVS_DP_ATTR_MASKS_CACHE_SIZE]);
> > - err = ovs_flow_tbl_masks_cache_resize(&dp->table, cache_size);
> > + err = ovs_flow_tbl_masks_cache_resize(table, cache_size);
> > if (err)
> > return err;
> > }
> > @@ -1810,6 +1915,7 @@ static int ovs_dp_vport_init(struct datapath *dp)
> > static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
> > {
> > struct nlattr **a = info->attrs;
> > + struct flow_table *table;
> > struct vport_parms parms;
> > struct sk_buff *reply;
> > struct datapath *dp;
> > @@ -1833,9 +1939,12 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
> > ovs_dp_set_net(dp, sock_net(skb->sk));
> >
> > /* Allocate table. */
> > - err = ovs_flow_tbl_init(&dp->table);
> > - if (err)
> > + table = ovs_flow_tbl_alloc();
> > + if (IS_ERR(table)) {
> > + err = PTR_ERR(table);
> > goto err_destroy_dp;
> > + }
> > + rcu_assign_pointer(dp->table, table);
> >
> > err = ovs_dp_stats_init(dp);
> > if (err)
> > @@ -1905,7 +2014,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
> > err_destroy_stats:
> > free_percpu(dp->stats_percpu);
> > err_destroy_table:
> > - ovs_flow_tbl_destroy(&dp->table);
> > + ovs_flow_tbl_put(table);
> > err_destroy_dp:
> > kfree(dp);
> > err_destroy_reply:
> > @@ -1917,7 +2026,8 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
> > /* Called with ovs_mutex. */
> > static void __dp_destroy(struct datapath *dp)
> > {
> > - struct flow_table *table = &dp->table;
> > + struct flow_table *table = rcu_dereference_protected(dp->table,
> > + lockdep_ovsl_is_held());
> > int i;
> >
> > if (dp->user_features & OVS_DP_F_TC_RECIRC_SHARING)
> > @@ -1939,14 +2049,10 @@ static void __dp_destroy(struct datapath *dp)
> > */
> > ovs_dp_detach_port(ovs_vport_ovsl(dp, OVSP_LOCAL));
> >
> > - /* Flush sw_flow in the tables. RCU cb only releases resource
> > - * such as dp, ports and tables. That may avoid some issues
> > - * such as RCU usage warning.
> > - */
> > - table_instance_flow_flush(table, ovsl_dereference(table->ti),
> > - ovsl_dereference(table->ufid_ti));
> > + rcu_assign_pointer(dp->table, NULL);
> > + ovs_flow_tbl_put(table);
> >
> > - /* RCU destroy the ports, meters and flow tables. */
> > + /* RCU destroy the ports and meters. */
> > call_rcu(&dp->rcu, destroy_dp_rcu);
> > }
> >
> > @@ -2554,13 +2660,18 @@ static void ovs_dp_masks_rebalance(struct work_struct *work)
> > {
> > struct ovs_net *ovs_net = container_of(work, struct ovs_net,
> > masks_rebalance.work);
> > + struct flow_table *table;
> > struct datapath *dp;
> >
> > ovs_lock();
> > -
> > - list_for_each_entry(dp, &ovs_net->dps, list_node)
> > - ovs_flow_masks_rebalance(&dp->table);
> > -
> > + list_for_each_entry(dp, &ovs_net->dps, list_node) {
> > + table = ovsl_dereference(dp->table);
> > + if (!table)
> > + continue;
>
> Should we take a reference for table here? I guess it's kindof safe
> because of the ovs_lock() above, but if that gets removed it's possible
> someone misses that there isn't a refcnt pin here (but everywhere else
> has a ovs_flow_tbl_get before it).
>
Good point. As you say, it is safe but still we should probably do it.
I'll change this.
Just for reference:
I actually contemplated the possibility of removing the lock here, or at
least removing its scope. We still need it to serialize access to
"&ovs_net->dps" but we could then increase a reference to the table and
release the lock. The code would then look bad because we'd be releasing
the lock in the middle of the loop. After some thought, all this
complexity didn't feel necessary for something that happens every 4s and
that is not affected by RTNL contention.
Thanks.
Adrián
> > + mutex_lock(&table->lock);
> > + ovs_flow_masks_rebalance(table);
> > + mutex_unlock(&table->lock);
> > + }
> > ovs_unlock();
> >
> > schedule_delayed_work(&ovs_net->masks_rebalance,
> > diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
> > index db0c3e69d66c..44773bf9f645 100644
> > --- a/net/openvswitch/datapath.h
> > +++ b/net/openvswitch/datapath.h
> > @@ -90,7 +90,7 @@ struct datapath {
> > struct list_head list_node;
> >
> > /* Flow table. */
> > - struct flow_table table;
> > + struct flow_table __rcu *table;
> >
> > /* Switch ports. */
> > struct hlist_head *ports;
> > diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> > index 66366982f604..0a748cf20f53 100644
> > --- a/net/openvswitch/flow.c
> > +++ b/net/openvswitch/flow.c
> > @@ -124,8 +124,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags,
> > spin_unlock(&stats->lock);
> > }
> >
> > -/* Must be called with rcu_read_lock or ovs_mutex. */
> > +/* Must be called with rcu_read_lock or table->lock held. */
> > void ovs_flow_stats_get(const struct sw_flow *flow,
> > + const struct flow_table *table,
> > struct ovs_flow_stats *ovs_stats,
> > unsigned long *used, __be16 *tcp_flags)
> > {
> > @@ -136,7 +137,8 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
> > memset(ovs_stats, 0, sizeof(*ovs_stats));
> >
> > for_each_cpu(cpu, flow->cpu_used_mask) {
> > - struct sw_flow_stats *stats = rcu_dereference_ovsl(flow->stats[cpu]);
> > + struct sw_flow_stats *stats =
> > + rcu_dereference_ovs_tbl(flow->stats[cpu], table);
> >
> > if (stats) {
> > /* Local CPU may write on non-local stats, so we must
> > @@ -153,13 +155,14 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
> > }
> > }
> >
> > -/* Called with ovs_mutex. */
> > -void ovs_flow_stats_clear(struct sw_flow *flow)
> > +/* Called with table->lock held. */
> > +void ovs_flow_stats_clear(struct sw_flow *flow, struct flow_table *table)
> > {
> > unsigned int cpu;
> >
> > for_each_cpu(cpu, flow->cpu_used_mask) {
> > - struct sw_flow_stats *stats = ovsl_dereference(flow->stats[cpu]);
> > + struct sw_flow_stats *stats =
> > + ovs_tbl_dereference(flow->stats[cpu], table);
> >
> > if (stats) {
> > spin_lock_bh(&stats->lock);
> > diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
> > index b5711aff6e76..e05ed6796e4e 100644
> > --- a/net/openvswitch/flow.h
> > +++ b/net/openvswitch/flow.h
> > @@ -23,6 +23,7 @@
> > #include <net/dst_metadata.h>
> > #include <net/nsh.h>
> >
> > +struct flow_table;
> > struct sk_buff;
> >
> > enum sw_flow_mac_proto {
> > @@ -280,9 +281,11 @@ static inline bool ovs_identifier_is_key(const struct sw_flow_id *sfid)
> >
> > void ovs_flow_stats_update(struct sw_flow *, __be16 tcp_flags,
> > const struct sk_buff *);
> > -void ovs_flow_stats_get(const struct sw_flow *, struct ovs_flow_stats *,
> > - unsigned long *used, __be16 *tcp_flags);
> > -void ovs_flow_stats_clear(struct sw_flow *);
> > +void ovs_flow_stats_get(const struct sw_flow *flow,
> > + const struct flow_table *table,
> > + struct ovs_flow_stats *stats, unsigned long *used,
> > + __be16 *tcp_flags);
> > +void ovs_flow_stats_clear(struct sw_flow *flow, struct flow_table *table);
> > u64 ovs_flow_used_time(unsigned long flow_jiffies);
> >
> > int ovs_flow_key_update(struct sk_buff *skb, struct sw_flow_key *key);
> > diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
> > index 61c6a5f77c2e..d9dbe4b4807c 100644
> > --- a/net/openvswitch/flow_table.c
> > +++ b/net/openvswitch/flow_table.c
> > @@ -45,6 +45,16 @@
> > static struct kmem_cache *flow_cache;
> > struct kmem_cache *flow_stats_cache __read_mostly;
> >
> > +#ifdef CONFIG_LOCKDEP
> > +int lockdep_ovs_tbl_is_held(const struct flow_table *table)
> > +{
> > + if (debug_locks)
> > + return lockdep_is_held(&table->lock);
> > + else
> > + return 1;
> > +}
> > +#endif
> > +
> > static u16 range_n_bytes(const struct sw_flow_key_range *range)
> > {
> > return range->end - range->start;
> > @@ -249,12 +259,12 @@ static int tbl_mask_array_realloc(struct flow_table *tbl, int size)
> > if (!new)
> > return -ENOMEM;
> >
> > - old = ovsl_dereference(tbl->mask_array);
> > + old = ovs_tbl_dereference(tbl->mask_array, tbl);
> > if (old) {
> > int i;
> >
> > for (i = 0; i < old->max; i++) {
> > - if (ovsl_dereference(old->masks[i]))
> > + if (ovs_tbl_dereference(old->masks[i], tbl))
> > new->masks[new->count++] = old->masks[i];
> > }
> > call_rcu(&old->rcu, mask_array_rcu_cb);
> > @@ -268,7 +278,7 @@ static int tbl_mask_array_realloc(struct flow_table *tbl, int size)
> > static int tbl_mask_array_add_mask(struct flow_table *tbl,
> > struct sw_flow_mask *new)
> > {
> > - struct mask_array *ma = ovsl_dereference(tbl->mask_array);
> > + struct mask_array *ma = ovs_tbl_dereference(tbl->mask_array, tbl);
> > int err, ma_count = READ_ONCE(ma->count);
> >
> > if (ma_count >= ma->max) {
> > @@ -277,7 +287,7 @@ static int tbl_mask_array_add_mask(struct flow_table *tbl,
> > if (err)
> > return err;
> >
> > - ma = ovsl_dereference(tbl->mask_array);
> > + ma = ovs_tbl_dereference(tbl->mask_array, tbl);
> > } else {
> > /* On every add or delete we need to reset the counters so
> > * every new mask gets a fair chance of being prioritized.
> > @@ -285,7 +295,7 @@ static int tbl_mask_array_add_mask(struct flow_table *tbl,
> > tbl_mask_array_reset_counters(ma);
> > }
> >
> > - BUG_ON(ovsl_dereference(ma->masks[ma_count]));
> > + WARN_ON_ONCE(ovs_tbl_dereference(ma->masks[ma_count], tbl));
> >
> > rcu_assign_pointer(ma->masks[ma_count], new);
> > WRITE_ONCE(ma->count, ma_count + 1);
> > @@ -296,12 +306,12 @@ static int tbl_mask_array_add_mask(struct flow_table *tbl,
> > static void tbl_mask_array_del_mask(struct flow_table *tbl,
> > struct sw_flow_mask *mask)
> > {
> > - struct mask_array *ma = ovsl_dereference(tbl->mask_array);
> > + struct mask_array *ma = ovs_tbl_dereference(tbl->mask_array, tbl);
> > int i, ma_count = READ_ONCE(ma->count);
> >
> > /* Remove the deleted mask pointers from the array */
> > for (i = 0; i < ma_count; i++) {
> > - if (mask == ovsl_dereference(ma->masks[i]))
> > + if (mask == ovs_tbl_dereference(ma->masks[i], tbl))
> > goto found;
> > }
> >
> > @@ -329,10 +339,10 @@ static void tbl_mask_array_del_mask(struct flow_table *tbl,
> > static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask)
> > {
> > if (mask) {
> > - /* ovs-lock is required to protect mask-refcount and
> > + /* table lock is required to protect mask-refcount and
> > * mask list.
> > */
> > - ASSERT_OVSL();
> > + ASSERT_OVS_TBL(tbl);
> > BUG_ON(!mask->ref_count);
> > mask->ref_count--;
> >
> > @@ -386,7 +396,8 @@ static struct mask_cache *tbl_mask_cache_alloc(u32 size)
> > }
> > int ovs_flow_tbl_masks_cache_resize(struct flow_table *table, u32 size)
> > {
> > - struct mask_cache *mc = rcu_dereference_ovsl(table->mask_cache);
> > + struct mask_cache *mc = rcu_dereference_ovs_tbl(table->mask_cache,
> > + table);
> > struct mask_cache *new;
> >
> > if (size == mc->cache_size)
> > @@ -406,15 +417,23 @@ int ovs_flow_tbl_masks_cache_resize(struct flow_table *table, u32 size)
> > return 0;
> > }
> >
> > -int ovs_flow_tbl_init(struct flow_table *table)
> > +struct flow_table *ovs_flow_tbl_alloc(void)
> > {
> > struct table_instance *ti, *ufid_ti;
> > + struct flow_table *table;
> > struct mask_cache *mc;
> > struct mask_array *ma;
> >
> > + table = kzalloc_obj(*table, GFP_KERNEL);
> > + if (!table)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + mutex_init(&table->lock);
> > + refcount_set(&table->refcnt, 1);
> > +
> > mc = tbl_mask_cache_alloc(MC_DEFAULT_HASH_ENTRIES);
> > if (!mc)
> > - return -ENOMEM;
> > + goto free_table;
> >
> > ma = tbl_mask_array_alloc(MASK_ARRAY_SIZE_MIN);
> > if (!ma)
> > @@ -435,7 +454,7 @@ int ovs_flow_tbl_init(struct flow_table *table)
> > table->last_rehash = jiffies;
> > table->count = 0;
> > table->ufid_count = 0;
> > - return 0;
> > + return table;
> >
> > free_ti:
> > __table_instance_destroy(ti);
> > @@ -443,7 +462,10 @@ int ovs_flow_tbl_init(struct flow_table *table)
> > __mask_array_destroy(ma);
> > free_mask_cache:
> > __mask_cache_destroy(mc);
> > - return -ENOMEM;
> > +free_table:
> > + mutex_destroy(&table->lock);
> > + kfree(table);
> > + return ERR_PTR(-ENOMEM);
> > }
> >
> > static void flow_tbl_destroy_rcu_cb(struct rcu_head *rcu)
> > @@ -470,7 +492,7 @@ static void table_instance_flow_free(struct flow_table *table,
> > flow_mask_remove(table, flow->mask);
> > }
> >
> > -/* Must be called with OVS mutex held. */
> > +/* Must be called with table mutex held. */
> > void table_instance_flow_flush(struct flow_table *table,
> > struct table_instance *ti,
> > struct table_instance *ufid_ti)
> > @@ -505,11 +527,11 @@ static void table_instance_destroy(struct table_instance *ti,
> > call_rcu(&ufid_ti->rcu, flow_tbl_destroy_rcu_cb);
> > }
> >
> > -/* No need for locking this function is called from RCU callback or
> > - * error path.
> > - */
> > -void ovs_flow_tbl_destroy(struct flow_table *table)
> > +/* No need for locking this function is called from RCU callback. */
> > +static void ovs_flow_tbl_destroy_rcu(struct rcu_head *rcu)
> > {
> > + struct flow_table *table = container_of(rcu, struct flow_table, rcu);
> > +
> > struct table_instance *ti = rcu_dereference_raw(table->ti);
> > struct table_instance *ufid_ti = rcu_dereference_raw(table->ufid_ti);
> > struct mask_cache *mc = rcu_dereference_raw(table->mask_cache);
> > @@ -518,6 +540,20 @@ void ovs_flow_tbl_destroy(struct flow_table *table)
> > call_rcu(&mc->rcu, mask_cache_rcu_cb);
> > call_rcu(&ma->rcu, mask_array_rcu_cb);
> > table_instance_destroy(ti, ufid_ti);
> > + mutex_destroy(&table->lock);
> > + kfree(table);
> > +}
> > +
> > +void ovs_flow_tbl_put(struct flow_table *table)
> > +{
> > + if (refcount_dec_and_test(&table->refcnt)) {
> > + mutex_lock(&table->lock);
> > + table_instance_flow_flush(table,
> > + ovs_tbl_dereference(table->ti, table),
> > + ovs_tbl_dereference(table->ufid_ti, table));
> > + mutex_unlock(&table->lock);
> > + call_rcu(&table->rcu, ovs_flow_tbl_destroy_rcu);
> > + }
> > }
> >
> > struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,
> > @@ -571,7 +607,8 @@ static void ufid_table_instance_insert(struct table_instance *ti,
> > hlist_add_head_rcu(&flow->ufid_table.node[ti->node_ver], head);
> > }
> >
> > -static void flow_table_copy_flows(struct table_instance *old,
> > +static void flow_table_copy_flows(struct flow_table *table,
> > + struct table_instance *old,
> > struct table_instance *new, bool ufid)
> > {
> > int old_ver;
> > @@ -588,17 +625,18 @@ static void flow_table_copy_flows(struct table_instance *old,
> > if (ufid)
> > hlist_for_each_entry_rcu(flow, head,
> > ufid_table.node[old_ver],
> > - lockdep_ovsl_is_held())
> > + lockdep_ovs_tbl_is_held(table))
> > ufid_table_instance_insert(new, flow);
> > else
> > hlist_for_each_entry_rcu(flow, head,
> > flow_table.node[old_ver],
> > - lockdep_ovsl_is_held())
> > + lockdep_ovs_tbl_is_held(table))
> > table_instance_insert(new, flow);
> > }
> > }
> >
> > -static struct table_instance *table_instance_rehash(struct table_instance *ti,
> > +static struct table_instance *table_instance_rehash(struct flow_table *table,
> > + struct table_instance *ti,
> > int n_buckets, bool ufid)
> > {
> > struct table_instance *new_ti;
> > @@ -607,16 +645,19 @@ static struct table_instance *table_instance_rehash(struct table_instance *ti,
> > if (!new_ti)
> > return NULL;
> >
> > - flow_table_copy_flows(ti, new_ti, ufid);
> > + flow_table_copy_flows(table, ti, new_ti, ufid);
> >
> > return new_ti;
> > }
> >
> > +/* Must be called with flow_table->lock held. */
> > int ovs_flow_tbl_flush(struct flow_table *flow_table)
> > {
> > struct table_instance *old_ti, *new_ti;
> > struct table_instance *old_ufid_ti, *new_ufid_ti;
> >
> > + ASSERT_OVS_TBL(flow_table);
> > +
> > new_ti = table_instance_alloc(TBL_MIN_BUCKETS);
> > if (!new_ti)
> > return -ENOMEM;
> > @@ -624,8 +665,8 @@ int ovs_flow_tbl_flush(struct flow_table *flow_table)
> > if (!new_ufid_ti)
> > goto err_free_ti;
> >
> > - old_ti = ovsl_dereference(flow_table->ti);
> > - old_ufid_ti = ovsl_dereference(flow_table->ufid_ti);
> > + old_ti = ovs_tbl_dereference(flow_table->ti, flow_table);
> > + old_ufid_ti = ovs_tbl_dereference(flow_table->ufid_ti, flow_table);
> >
> > rcu_assign_pointer(flow_table->ti, new_ti);
> > rcu_assign_pointer(flow_table->ufid_ti, new_ufid_ti);
> > @@ -693,7 +734,8 @@ static bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
> > return cmp_key(flow->id.unmasked_key, key, key_start, key_end);
> > }
> >
> > -static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
> > +static struct sw_flow *masked_flow_lookup(struct flow_table *tbl,
> > + struct table_instance *ti,
> > const struct sw_flow_key *unmasked,
> > const struct sw_flow_mask *mask,
> > u32 *n_mask_hit)
> > @@ -709,7 +751,7 @@ static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
> > (*n_mask_hit)++;
> >
> > hlist_for_each_entry_rcu(flow, head, flow_table.node[ti->node_ver],
> > - lockdep_ovsl_is_held()) {
> > + lockdep_ovs_tbl_is_held(tbl)) {
> > if (flow->mask == mask && flow->flow_table.hash == hash &&
> > flow_cmp_masked_key(flow, &masked_key, &mask->range))
> > return flow;
> > @@ -736,9 +778,9 @@ static struct sw_flow *flow_lookup(struct flow_table *tbl,
> > int i;
> >
> > if (likely(*index < ma->max)) {
> > - mask = rcu_dereference_ovsl(ma->masks[*index]);
> > + mask = rcu_dereference_ovs_tbl(ma->masks[*index], tbl);
> > if (mask) {
> > - flow = masked_flow_lookup(ti, key, mask, n_mask_hit);
> > + flow = masked_flow_lookup(tbl, ti, key, mask, n_mask_hit);
> > if (flow) {
> > u64_stats_update_begin(&stats->syncp);
> > stats->usage_cntrs[*index]++;
> > @@ -754,11 +796,11 @@ static struct sw_flow *flow_lookup(struct flow_table *tbl,
> > if (i == *index)
> > continue;
> >
> > - mask = rcu_dereference_ovsl(ma->masks[i]);
> > + mask = rcu_dereference_ovs_tbl(ma->masks[i], tbl);
> > if (unlikely(!mask))
> > break;
> >
> > - flow = masked_flow_lookup(ti, key, mask, n_mask_hit);
> > + flow = masked_flow_lookup(tbl, ti, key, mask, n_mask_hit);
> > if (flow) { /* Found */
> > *index = i;
> > u64_stats_update_begin(&stats->syncp);
> > @@ -845,8 +887,8 @@ struct sw_flow *ovs_flow_tbl_lookup_stats(struct flow_table *tbl,
> > struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *tbl,
> > const struct sw_flow_key *key)
> > {
> > - struct table_instance *ti = rcu_dereference_ovsl(tbl->ti);
> > - struct mask_array *ma = rcu_dereference_ovsl(tbl->mask_array);
> > + struct table_instance *ti = rcu_dereference_ovs_tbl(tbl->ti, tbl);
> > + struct mask_array *ma = rcu_dereference_ovs_tbl(tbl->mask_array, tbl);
> > u32 __always_unused n_mask_hit;
> > u32 __always_unused n_cache_hit;
> > struct sw_flow *flow;
> > @@ -865,21 +907,22 @@ struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *tbl,
> > struct sw_flow *ovs_flow_tbl_lookup_exact(struct flow_table *tbl,
> > const struct sw_flow_match *match)
> > {
> > - struct mask_array *ma = ovsl_dereference(tbl->mask_array);
> > + struct mask_array *ma = ovs_tbl_dereference(tbl->mask_array, tbl);
> > int i;
> >
> > - /* Always called under ovs-mutex. */
> > + /* Always called under tbl->lock. */
> > for (i = 0; i < ma->max; i++) {
> > - struct table_instance *ti = rcu_dereference_ovsl(tbl->ti);
> > + struct table_instance *ti =
> > + rcu_dereference_ovs_tbl(tbl->ti, tbl);
> > u32 __always_unused n_mask_hit;
> > struct sw_flow_mask *mask;
> > struct sw_flow *flow;
> >
> > - mask = ovsl_dereference(ma->masks[i]);
> > + mask = ovs_tbl_dereference(ma->masks[i], tbl);
> > if (!mask)
> > continue;
> >
> > - flow = masked_flow_lookup(ti, match->key, mask, &n_mask_hit);
> > + flow = masked_flow_lookup(tbl, ti, match->key, mask, &n_mask_hit);
> > if (flow && ovs_identifier_is_key(&flow->id) &&
> > ovs_flow_cmp_unmasked_key(flow, match)) {
> > return flow;
> > @@ -915,7 +958,7 @@ bool ovs_flow_cmp(const struct sw_flow *flow,
> > struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *tbl,
> > const struct sw_flow_id *ufid)
> > {
> > - struct table_instance *ti = rcu_dereference_ovsl(tbl->ufid_ti);
> > + struct table_instance *ti = rcu_dereference_ovs_tbl(tbl->ufid_ti, tbl);
> > struct sw_flow *flow;
> > struct hlist_head *head;
> > u32 hash;
> > @@ -923,7 +966,7 @@ struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *tbl,
> > hash = ufid_hash(ufid);
> > head = find_bucket(ti, hash);
> > hlist_for_each_entry_rcu(flow, head, ufid_table.node[ti->node_ver],
> > - lockdep_ovsl_is_held()) {
> > + lockdep_ovs_tbl_is_held(tbl)) {
> > if (flow->ufid_table.hash == hash &&
> > ovs_flow_cmp_ufid(flow, ufid))
> > return flow;
> > @@ -933,28 +976,33 @@ struct sw_flow *ovs_flow_tbl_lookup_ufid(struct flow_table *tbl,
> >
> > int ovs_flow_tbl_num_masks(const struct flow_table *table)
> > {
> > - struct mask_array *ma = rcu_dereference_ovsl(table->mask_array);
> > + struct mask_array *ma = rcu_dereference_ovs_tbl(table->mask_array,
> > + table);
> > return READ_ONCE(ma->count);
> > }
> >
> > u32 ovs_flow_tbl_masks_cache_size(const struct flow_table *table)
> > {
> > - struct mask_cache *mc = rcu_dereference_ovsl(table->mask_cache);
> > + struct mask_cache *mc = rcu_dereference_ovs_tbl(table->mask_cache,
> > + table);
> >
> > return READ_ONCE(mc->cache_size);
> > }
> >
> > -static struct table_instance *table_instance_expand(struct table_instance *ti,
> > +static struct table_instance *table_instance_expand(struct flow_table *table,
> > + struct table_instance *ti,
> > bool ufid)
> > {
> > - return table_instance_rehash(ti, ti->n_buckets * 2, ufid);
> > + return table_instance_rehash(table, ti, ti->n_buckets * 2, ufid);
> > }
> >
> > -/* Must be called with OVS mutex held. */
> > +/* Must be called with table mutex held. */
> > void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow)
> > {
> > - struct table_instance *ti = ovsl_dereference(table->ti);
> > - struct table_instance *ufid_ti = ovsl_dereference(table->ufid_ti);
> > + struct table_instance *ti = ovs_tbl_dereference(table->ti,
> > + table);
> > + struct table_instance *ufid_ti = ovs_tbl_dereference(table->ufid_ti,
> > + table);
> >
> > BUG_ON(table->count == 0);
> > table_instance_flow_free(table, ti, ufid_ti, flow);
> > @@ -988,10 +1036,10 @@ static struct sw_flow_mask *flow_mask_find(const struct flow_table *tbl,
> > struct mask_array *ma;
> > int i;
> >
> > - ma = ovsl_dereference(tbl->mask_array);
> > + ma = ovs_tbl_dereference(tbl->mask_array, tbl);
> > for (i = 0; i < ma->max; i++) {
> > struct sw_flow_mask *t;
> > - t = ovsl_dereference(ma->masks[i]);
> > + t = ovs_tbl_dereference(ma->masks[i], tbl);
> >
> > if (t && mask_equal(mask, t))
> > return t;
> > @@ -1029,22 +1077,25 @@ static int flow_mask_insert(struct flow_table *tbl, struct sw_flow *flow,
> > return 0;
> > }
> >
> > -/* Must be called with OVS mutex held. */
> > +/* Must be called with table mutex held. */
> > static void flow_key_insert(struct flow_table *table, struct sw_flow *flow)
> > {
> > struct table_instance *new_ti = NULL;
> > struct table_instance *ti;
> >
> > + ASSERT_OVS_TBL(table);
> > +
> > flow->flow_table.hash = flow_hash(&flow->key, &flow->mask->range);
> > - ti = ovsl_dereference(table->ti);
> > + ti = ovs_tbl_dereference(table->ti, table);
> > table_instance_insert(ti, flow);
> > table->count++;
> >
> > /* Expand table, if necessary, to make room. */
> > if (table->count > ti->n_buckets)
> > - new_ti = table_instance_expand(ti, false);
> > + new_ti = table_instance_expand(table, ti, false);
> > else if (time_after(jiffies, table->last_rehash + REHASH_INTERVAL))
> > - new_ti = table_instance_rehash(ti, ti->n_buckets, false);
> > + new_ti = table_instance_rehash(table, ti, ti->n_buckets,
> > + false);
> >
> > if (new_ti) {
> > rcu_assign_pointer(table->ti, new_ti);
> > @@ -1053,13 +1104,15 @@ static void flow_key_insert(struct flow_table *table, struct sw_flow *flow)
> > }
> > }
> >
> > -/* Must be called with OVS mutex held. */
> > +/* Must be called with table mutex held. */
> > static void flow_ufid_insert(struct flow_table *table, struct sw_flow *flow)
> > {
> > struct table_instance *ti;
> >
> > + ASSERT_OVS_TBL(table);
> > +
> > flow->ufid_table.hash = ufid_hash(&flow->id);
> > - ti = ovsl_dereference(table->ufid_ti);
> > + ti = ovs_tbl_dereference(table->ufid_ti, table);
> > ufid_table_instance_insert(ti, flow);
> > table->ufid_count++;
> >
> > @@ -1067,7 +1120,7 @@ static void flow_ufid_insert(struct flow_table *table, struct sw_flow *flow)
> > if (table->ufid_count > ti->n_buckets) {
> > struct table_instance *new_ti;
> >
> > - new_ti = table_instance_expand(ti, true);
> > + new_ti = table_instance_expand(table, ti, true);
> > if (new_ti) {
> > rcu_assign_pointer(table->ufid_ti, new_ti);
> > call_rcu(&ti->rcu, flow_tbl_destroy_rcu_cb);
> > @@ -1075,12 +1128,14 @@ static void flow_ufid_insert(struct flow_table *table, struct sw_flow *flow)
> > }
> > }
> >
> > -/* Must be called with OVS mutex held. */
> > +/* Must be called with table mutex held. */
> > int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
> > const struct sw_flow_mask *mask)
> > {
> > int err;
> >
> > + ASSERT_OVS_TBL(table);
> > +
> > err = flow_mask_insert(table, flow, mask);
> > if (err)
> > return err;
> > @@ -1099,10 +1154,11 @@ static int compare_mask_and_count(const void *a, const void *b)
> > return (s64)mc_b->counter - (s64)mc_a->counter;
> > }
> >
> > -/* Must be called with OVS mutex held. */
> > +/* Must be called with table->lock held. */
> > void ovs_flow_masks_rebalance(struct flow_table *table)
> > {
> > - struct mask_array *ma = rcu_dereference_ovsl(table->mask_array);
> > + struct mask_array *ma = rcu_dereference_ovs_tbl(table->mask_array,
> > + table);
> > struct mask_count *masks_and_count;
> > struct mask_array *new;
> > int masks_entries = 0;
> > @@ -1117,7 +1173,7 @@ void ovs_flow_masks_rebalance(struct flow_table *table)
> > struct sw_flow_mask *mask;
> > int cpu;
> >
> > - mask = rcu_dereference_ovsl(ma->masks[i]);
> > + mask = rcu_dereference_ovs_tbl(ma->masks[i], table);
> > if (unlikely(!mask))
> > break;
> >
> > @@ -1171,7 +1227,7 @@ void ovs_flow_masks_rebalance(struct flow_table *table)
> > for (i = 0; i < masks_entries; i++) {
> > int index = masks_and_count[i].index;
> >
> > - if (ovsl_dereference(ma->masks[index]))
> > + if (ovs_tbl_dereference(ma->masks[index], table))
> > new->masks[new->count++] = ma->masks[index];
> > }
> >
> > diff --git a/net/openvswitch/flow_table.h b/net/openvswitch/flow_table.h
> > index f524dc3e4862..cffd412c9045 100644
> > --- a/net/openvswitch/flow_table.h
> > +++ b/net/openvswitch/flow_table.h
> > @@ -59,7 +59,29 @@ struct table_instance {
> > u32 hash_seed;
> > };
> >
> > +/* Locking:
> > + *
> > + * flow_table is _not_ protected by ovs_lock (see comment above ovs_mutex
> > + * in datapath.c).
> > + *
> > + * All writes to flow_table are protected by the embedded "lock".
> > + * In order to ensure datapath destruction does not trigger the destruction
> > + * of the flow_table, "refcnt" is used. Therefore, writers must:
> > + * 1 - Enter rcu read-protected section
> > + * 2 - Increase "table->refcnt"
> > + * 3 - Leave rcu read-protected section (to avoid using mutexes inside rcu)
> > + * 4 - Lock "table->lock"
> > + * 5 - Perform modifications
> > + * 6 - Release "table->lock"
> > + * 7 - Decrease "table->refcnt"
> > + *
> > + * Reads are protected by RCU.
> > + */
> > struct flow_table {
> > + /* Locks flow table writes. */
> > + struct mutex lock;
> > + refcount_t refcnt;
> > + struct rcu_head rcu;
> > struct table_instance __rcu *ti;
> > struct table_instance __rcu *ufid_ti;
> > struct mask_cache __rcu *mask_cache;
> > @@ -71,15 +93,40 @@ struct flow_table {
> >
> > extern struct kmem_cache *flow_stats_cache;
> >
> > +#ifdef CONFIG_LOCKDEP
> > +int lockdep_ovs_tbl_is_held(const struct flow_table *table);
> > +#else
> > +static inline int lockdep_ovs_tbl_is_held(const struct flow_table *table)
> > +{
> > + (void)table;
> > + return 1;
> > +}
> > +#endif
> > +
> > +#define ASSERT_OVS_TBL(tbl) WARN_ON(!lockdep_ovs_tbl_is_held(tbl))
> > +
> > +/* Lock-protected update-allowed dereferences.*/
> > +#define ovs_tbl_dereference(p, tbl) \
> > + rcu_dereference_protected(p, lockdep_ovs_tbl_is_held(tbl))
> > +
> > +/* Read dereferences can be protected by either RCU, table lock or ovs_mutex. */
> > +#define rcu_dereference_ovs_tbl(p, tbl) \
> > + rcu_dereference_check(p, \
> > + lockdep_ovs_tbl_is_held(tbl) || lockdep_ovsl_is_held())
> > +
> > int ovs_flow_init(void);
> > void ovs_flow_exit(void);
> >
> > struct sw_flow *ovs_flow_alloc(void);
> > void ovs_flow_free(struct sw_flow *, bool deferred);
> >
> > -int ovs_flow_tbl_init(struct flow_table *);
> > +struct flow_table *ovs_flow_tbl_alloc(void);
> > +void ovs_flow_tbl_put(struct flow_table *table);
> > +static inline bool ovs_flow_tbl_get(struct flow_table *table)
> > +{
> > + return refcount_inc_not_zero(&table->refcnt);
> > +}
> > int ovs_flow_tbl_count(const struct flow_table *table);
> > -void ovs_flow_tbl_destroy(struct flow_table *table);
> > int ovs_flow_tbl_flush(struct flow_table *flow_table);
> >
> > int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
>
^ permalink raw reply
* [PATCH] net: ipv4: fix alignment fault in sysctl_fib_multipath_hash_seed on ARM64 with Clang
From: Juno Choii @ 2026-04-15 5:13 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, kuba, pabeni, horms, linux-kernel, Juno Choi
From: Juno Choi <juno.choi@lge.com>
On ARM64, Clang may generate ldaxr (64-bit exclusive load) for
READ_ONCE() on 8-byte structs. ldaxr requires 8-byte natural
alignment, but sysctl_fib_multipath_hash_seed (two u32 members)
only has 4-byte natural alignment.
When this struct lands at a 4-byte-aligned but not 8-byte-aligned
offset within struct netns_ipv4, the ldaxr triggers an alignment
fault in rt6_multipath_hash(), causing a kernel panic in the IPv6
packet receive path (rtl8168_poll -> ipv6_list_rcv ->
rt6_multipath_hash).
Add __aligned(8) to the struct definition when building for ARM64
with Clang to ensure proper alignment for atomic 8-byte loads.
Signed-off-by: Juno Choi <juno.choi@lge.com>
---
include/net/netns/ipv4.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 276f622f3516..4366ab26512d 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -41,11 +41,18 @@ struct inet_timewait_death_row {
struct tcp_fastopen_context;
#ifdef CONFIG_IP_ROUTE_MULTIPATH
+#if defined(CONFIG_ARM64) && defined(CONFIG_CC_IS_CLANG)
+struct sysctl_fib_multipath_hash_seed {
+ u32 user_seed;
+ u32 mp_seed;
+} __aligned(8);
+#else
struct sysctl_fib_multipath_hash_seed {
u32 user_seed;
u32 mp_seed;
};
#endif
+#endif
struct netns_ipv4 {
/* Cacheline organization can be found documented in
--
2.43.0
^ permalink raw reply related
* Re: [PATCH bpf v4 5/5] bpf, sockmap: Take state lock for af_unix iter
From: Kuniyuki Iwashima @ 2026-04-15 5:02 UTC (permalink / raw)
To: Michal Luczaj
Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Paolo Abeni,
Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-5-2af6fe97918e@rbox.co>
On Tue, Apr 14, 2026 at 7:13 AM Michal Luczaj <mhal@rbox.co> wrote:
>
> When a BPF iterator program updates a sockmap, there is a race condition in
> unix_stream_bpf_update_proto() where the `peer` pointer can become stale[1]
> during a state transition TCP_ESTABLISHED -> TCP_CLOSE.
>
> CPU0 bpf CPU1 close
> -------- ----------
> // unix_stream_bpf_update_proto()
> sk_pair = unix_peer(sk)
> if (unlikely(!sk_pair))
> return -EINVAL;
> // unix_release_sock()
> skpair = unix_peer(sk);
> unix_peer(sk) = NULL;
> sock_put(skpair)
> sock_hold(sk_pair) // UaF
>
> More practically, this fix guarantees that the iterator program is
> consistently provided with a unix socket that remains stable during
> iterator execution.
>
> [1]:
> BUG: KASAN: slab-use-after-free in unix_stream_bpf_update_proto+0x155/0x490
> Write of size 4 at addr ffff8881178c9a00 by task test_progs/2231
> Call Trace:
> dump_stack_lvl+0x5d/0x80
> print_report+0x170/0x4f3
> kasan_report+0xe4/0x1c0
> kasan_check_range+0x125/0x200
> unix_stream_bpf_update_proto+0x155/0x490
> sock_map_link+0x71c/0xec0
> sock_map_update_common+0xbc/0x600
> sock_map_update_elem+0x19a/0x1f0
> bpf_prog_bbbf56096cdd4f01_selective_dump_unix+0x20c/0x217
> bpf_iter_run_prog+0x21e/0xae0
> bpf_iter_unix_seq_show+0x1e0/0x2a0
> bpf_seq_read+0x42c/0x10d0
> vfs_read+0x171/0xb20
> ksys_read+0xff/0x200
> do_syscall_64+0xf7/0x5e0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Allocated by task 2236:
> kasan_save_stack+0x30/0x50
> kasan_save_track+0x14/0x30
> __kasan_slab_alloc+0x63/0x80
> kmem_cache_alloc_noprof+0x1d5/0x680
> sk_prot_alloc+0x59/0x210
> sk_alloc+0x34/0x470
> unix_create1+0x86/0x8a0
> unix_stream_connect+0x318/0x15b0
> __sys_connect+0xfd/0x130
> __x64_sys_connect+0x72/0xd0
> do_syscall_64+0xf7/0x5e0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Freed by task 2236:
> kasan_save_stack+0x30/0x50
> kasan_save_track+0x14/0x30
> kasan_save_free_info+0x3b/0x70
> __kasan_slab_free+0x47/0x70
> kmem_cache_free+0x11c/0x590
> __sk_destruct+0x432/0x6e0
> unix_release_sock+0x9b3/0xf60
> unix_release+0x8a/0xf0
> __sock_release+0xb0/0x270
> sock_close+0x18/0x20
> __fput+0x36e/0xac0
> fput_close_sync+0xe5/0x1a0
> __x64_sys_close+0x7d/0xd0
> do_syscall_64+0xf7/0x5e0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Thanks for the fixes, Michal !
^ permalink raw reply
* Re: [PATCH bpf v4 3/5] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
From: Kuniyuki Iwashima @ 2026-04-15 5:01 UTC (permalink / raw)
To: Michal Luczaj
Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Paolo Abeni,
Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest,
Jiayuan Chen
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-3-2af6fe97918e@rbox.co>
On Tue, Apr 14, 2026 at 7:13 AM Michal Luczaj <mhal@rbox.co> wrote:
>
> Updating a sockmap from a unix iterator prog may lead to a deadlock.
> Piggyback on the original selftest.
>
> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
^ permalink raw reply
* Re: [PATCH bpf v4 4/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update
From: Kuniyuki Iwashima @ 2026-04-15 5:00 UTC (permalink / raw)
To: Michal Luczaj
Cc: John Fastabend, Jakub Sitnicki, Eric Dumazet, Paolo Abeni,
Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman,
Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest,
钱一铭
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-4-2af6fe97918e@rbox.co>
On Tue, Apr 14, 2026 at 7:13 AM Michal Luczaj <mhal@rbox.co> wrote:
>
> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
> socket is properly set up, which would include having a defined peer. IOW,
> there's a window when unix_stream_bpf_update_proto() can be called on
> socket which still has unix_peer(sk) == NULL.
>
> CPU0 bpf CPU1 connect
> -------- ------------
>
> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> sock_map_sk_state_allowed(sk)
> ...
> sk_pair = unix_peer(sk)
> sock_hold(sk_pair)
> sock_hold(newsk)
> smp_mb__after_atomic()
> unix_peer(sk) = newsk
>
> BUG: kernel NULL pointer dereference, address: 0000000000000080
> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
> Call Trace:
> sock_map_link+0x564/0x8b0
> sock_map_update_common+0x6e/0x340
> sock_map_update_elem_sys+0x17d/0x240
> __sys_bpf+0x26db/0x3250
> __x64_sys_bpf+0x21/0x30
> do_syscall_64+0x6b/0x3a0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Initial idea was to move peer assignment _before_ the sk_state update[1],
> but that involved an additional memory barrier, and changing the hot path
> was rejected.
> Then a NULL check during proto update in unix_stream_bpf_update_proto() was
> considered[2], but the follow-up discussion[3] focused on the root cause,
> i.e. sockmap update taking a wrong lock. Or, more specifically, missing
> unix_state_lock()[4].
> In the end it was concluded that teaching sockmap about the af_unix locking
> would be unnecessarily complex[5].
> Complexity aside, since BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT
> are allowed to update sockmaps, sock_map_update_elem() taking the unix
> lock, as it is currently implemented in unix_state_lock():
> spin_lock(&unix_sk(s)->lock), would be problematic. unix_state_lock() taken
> in a process context, followed by a softirq-context TC BPF program
> attempting to take the same spinlock -- deadlock[6].
> This way we circled back to the peer check idea[2].
>
> [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
> [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
> [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
> [4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
> [5]: https://lore.kernel.org/bpf/CAAVpQUAHijOMext28Gi10dSLuMzGYh+jK61Ujn+fZ-wvcODR2A@mail.gmail.com/
> [6]: https://lore.kernel.org/bpf/dd043c69-4d03-46fe-8325-8f97101435cf@linux.dev/
>
> Summary of scenarios where af_unix/stream connect() may race a sockmap
> update:
>
> 1. connect() vs. bpf(BPF_MAP_UPDATE_ELEM), i.e. sock_map_update_elem_sys()
>
> Implemented NULL check is sufficient. Once assigned, socket peer won't
> be released until socket fd is released. And that's not an issue because
> sock_map_update_elem_sys() bumps fd refcnf.
>
> 2. connect() vs BPF program doing update
>
> Update restricted per verifier.c:may_update_sockmap() to
>
> BPF_PROG_TYPE_TRACING/BPF_TRACE_ITER
> BPF_PROG_TYPE_SOCK_OPS (bpf_sock_map_update() only)
> BPF_PROG_TYPE_SOCKET_FILTER
> BPF_PROG_TYPE_SCHED_CLS
> BPF_PROG_TYPE_SCHED_ACT
> BPF_PROG_TYPE_XDP
> BPF_PROG_TYPE_SK_REUSEPORT
> BPF_PROG_TYPE_FLOW_DISSECTOR
> BPF_PROG_TYPE_SK_LOOKUP
>
> Plus one more race to consider:
>
> CPU0 bpf CPU1 connect
> -------- ------------
>
> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> sock_map_sk_state_allowed(sk)
> sock_hold(newsk)
> smp_mb__after_atomic()
> unix_peer(sk) = newsk
> sk_pair = unix_peer(sk)
> if (unlikely(!sk_pair))
> return -EINVAL;
>
> CPU1 close
> ----------
>
> skpair = unix_peer(sk);
> unix_peer(sk) = NULL;
> sock_put(skpair)
> // use after free?
> sock_hold(sk_pair)
>
> 2.1 BPF program invoking helper function bpf_sock_map_update() ->
> BPF_CALL_4(bpf_sock_map_update(), ...)
>
> Helper limited to BPF_PROG_TYPE_SOCK_OPS. Nevertheless, a unix sock
> might be accessible via bpf_map_lookup_elem(). Which implies sk
> already having psock, which in turn implies sk already having
> sk_pair. Since sk_psock_destroy() is queued as RCU work, sk_pair
> won't go away while BPF executes the update.
>
> 2.2 BPF program invoking helper function bpf_map_update_elem() ->
> sock_map_update_elem()
>
> 2.2.1 Unix sock accessible to BPF prog only via sockmap lookup in
> BPF_PROG_TYPE_SOCKET_FILTER, BPF_PROG_TYPE_SCHED_CLS,
> BPF_PROG_TYPE_SCHED_ACT, BPF_PROG_TYPE_XDP,
> BPF_PROG_TYPE_SK_REUSEPORT, BPF_PROG_TYPE_FLOW_DISSECTOR,
> BPF_PROG_TYPE_SK_LOOKUP.
>
> Pretty much the same as case 2.1.
>
> 2.2.2 Unix sock accessible to BPF program directly:
> BPF_PROG_TYPE_TRACING, narrowed down to BPF_TRACE_ITER.
>
> Sockmap iterator (sock_map_seq_ops) is safe: unix sock
> residing in a sockmap means that the sock already went through
> the proto update step.
>
> Unix sock iterator (bpf_iter_unix_seq_ops), on the other hand,
> gives access to socks that may still be unconnected. Which
> means iterator prog can race sockmap/proto update against
> connect().
>
> BUG: KASAN: null-ptr-deref in unix_stream_bpf_update_proto+0x253/0x4d0
> Write of size 4 at addr 0000000000000080 by task test_progs/3140
> Call Trace:
> dump_stack_lvl+0x5d/0x80
> kasan_report+0xe4/0x1c0
> kasan_check_range+0x125/0x200
> unix_stream_bpf_update_proto+0x253/0x4d0
> sock_map_link+0x71c/0xec0
> sock_map_update_common+0xbc/0x600
> sock_map_update_elem+0x19a/0x1f0
> bpf_prog_bbbf56096cdd4f01_selective_dump_unix+0x20c/0x217
> bpf_iter_run_prog+0x21e/0xae0
> bpf_iter_unix_seq_show+0x1e0/0x2a0
> bpf_seq_read+0x42c/0x10d0
> vfs_read+0x171/0xb20
> ksys_read+0xff/0x200
> do_syscall_64+0xf7/0x5e0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> While the introduced NULL check prevents null-ptr-deref in the
> BPF program path as well, it is insufficient to guard against
> a poorly timed close() leading to a use-after-free. This will
> be addressed in a subsequent patch.
>
> Reported-by: Michal Luczaj <mhal@rbox.co>
> Closes: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
> Reported-by: 钱一铭 <yimingqian591@gmail.com>
> Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
^ permalink raw reply
* Re: [PATCH bpf-next v2 2/3] bpf: Use kmalloc_nolock() universally in local storage
From: Slava Imameev @ 2026-04-15 4:11 UTC (permalink / raw)
To: alexei.starovoitov
Cc: ameryhung, andrii, ast, bot+bpf-ci, bpf, clm, daniel, eddyz87,
ihor.solodrai, kernel-team, linux-open-source, martin.lau, memxor,
netdev, slava.imameev, yonghong.song
In-Reply-To: <ydbdk5fj3mbjoninyt5lcg5czcrcgghk6ownothijy667zn6h3@eczgskgj43iw>
On Tue, 14 Apr 2026 19:27:00 -0700 Alexei Starovoitov wrote:
> On Mon, Apr 13, 2026 at 01:48:29PM +1000, Slava Imameev wrote:
> > On Fri, 10 Apr 2026 21:39:00 -0700 Alexei Starovoitov wrote:
> > > >
> > > >
> > > > This allows value sizes up to ~65KB. Before this patch, socket and
> > > > inode storage used bpf_map_kzalloc() (backed by regular kmalloc)
> > > > which could handle those large sizes. After this patch, any
> > > > elem_size above KMALLOC_MAX_CACHE_SIZE will silently fail: the map
> > > > creation succeeds via bpf_local_storage_map_alloc_check() but every
> > > > element allocation returns NULL.
> > > >
> > > > Should BPF_LOCAL_STORAGE_MAX_VALUE_SIZE be updated to use
> > > > KMALLOC_MAX_CACHE_SIZE instead of KMALLOC_MAX_SIZE now that all
> > > > storage types go through kmalloc_nolock()?
> > > >
> > > > Slava Imameev raised the same concern for task storage in
> > > > https://urldefense.com/v3/__https://lore.kernel.org/bpf/20260410014341.47043-1-slava.imameev@crowdstrike.com/__;!!BmdzS3_lV9HdKG8!ytFHcGR6fq4YVQZ74Z_LwJ5IKsEaF2vnY03x8-IS51cQyN3SkHYa-6G_vUxk2lW7xvWMNEfSArwyIGXuxeEhe62whEC8AyDpmA$
> > >
> > > Right. Let's update it, but I don't think it's a regression.
> > > On a loaded system kmalloc_large() rarely succeeds for order 2+.
> > > That's why kmalloc_nolock() doesn't attempt to bridge that gap.
> > > One or two contiguous physical pages is the best one can expect.
> > > In early bpf days we picked KMALLOC_MAX_SIZE assuming that
> > > it's a realistic max for kmalloc().
> > > It turned out to be wishful thinking.
> > > kmalloc_large concept should really be removed.
> > > It deceives users into thinking that it's usable.
> >
> > In defense of supporting 8KB-64KB allocations for local
> > storage, we can consider BPF_MAP_TYPE_HASH with BPF_F_NO_PREALLOC
> > as providing similar functionality to replace the missing 8KB-64KB
> > local storage allocation support. However, these map entry
> > allocations can also fail with similar probability since they
> > depend on the same underlying allocator.
>
> I really hope that 64kb task local storage is not your production code.
> Severs easily have 50k threads. Sometimes more.
> 64k * 50k = 3 Gbytes of memory wasted.
> You need to redesign it from ground up.
This was a research project to replace LRU maps with task
storage. We implemented a garbage collector using a BPF task
iterator to release inactive task allocations. While iterating
over tens of thousands of tasks might be questionable, this was a
proof of concept that, when combined with other measures, could
potentially keep memory pressure in the tens of MBs.
8KB would be sufficient for 99.9% of our allocations, but
sometimes we need 12KB or more. The alternative to task storage
could be BPF_MAP_TYPE_HASH with BPF_F_NO_PREALLOC and a garbage
collector, as we want to reduce dependency on preallocated LRU
maps.
^ permalink raw reply
* Re: [GIT PULL] Networking for 7.1
From: patchwork-bot+netdevbpf @ 2026-04-15 3:46 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: torvalds, davem, netdev, linux-kernel, pabeni
In-Reply-To: <20260414193324.1390838-1-kuba@kernel.org>
Hello:
This pull request was applied to bpf/bpf.git (master)
by Linus Torvalds <torvalds@linux-foundation.org>:
On Tue, 14 Apr 2026 12:33:24 -0700 you wrote:
> Hi Linus!
>
> You'll see a conflict with iouring. The resolutions are self-evident.
> Redo what
> 222b5566a02d ("net: Proxy netdev_queue_get_dma_dev for leased queues")
> 1e91c98bc9a8 ("net: Slightly simplify net_mp_{open,close}_rxq")
> did in iouring, vs changes from:
> 06fc3b6d388d ("io_uring/zcrx: extract netdev+area init into a helper")
>
> [...]
Here is the summary with links:
- [GIT,PULL] Networking for 7.1
https://git.kernel.org/bpf/bpf/c/91a4855d6c03
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* [PATCH] netfilter: xt_realm: fix null-ptr-deref in realm_mt()
From: Kito Xu (veritas501) @ 2026-04-15 3:43 UTC (permalink / raw)
To: pablo
Cc: fw, phil, davem, edumazet, kuba, pabeni, horms, jengelh, kaber,
netfilter-devel, coreteam, netdev, linux-kernel,
Kito Xu (veritas501)
realm_mt() unconditionally dereferences skb_dst(skb) without a NULL
check. The xt_realm match registers with .family = NFPROTO_UNSPEC,
making it available to all netfilter protocol families. Through the
nftables compat layer (nft_compat), an unprivileged user inside a
user/net namespace can load this match into a bridge-family chain.
nft_match_validate() explicitly permits NFPROTO_BRIDGE, and the hook
bitmask check cannot distinguish bridge hooks from inet hooks because
NF_BR_FORWARD and NF_INET_FORWARD share the same numeric value (2).
The match also has no .checkentry callback to reject non-IP families.
When a pure L2 bridged packet traverses the chain, it has never gone
through IP routing, so skb_dst() returns NULL. realm_mt() then
dereferences this NULL pointer at dst->tclassid, causing a kernel oops.
Add a NULL check for the dst_entry pointer. When dst is NULL, return
false (no match), which is the correct semantic since a packet without
a routing realm cannot match any realm-based rule.
Oops: general protection fault, probably for non-canonical address 0xdffffc000000000c: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000060-0x0000000000000067]
CPU: 1 UID: 0 PID: 169 Comm: poc Not tainted 7.0.0-rc7-next-20260410+ #15 PREEMPTLAZY
Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:realm_mt+0xa0/0x180
Call Trace:
<IRQ>
nft_match_eval+0x1b7/0x310
nft_do_chain+0x261/0x1740
nft_do_chain_bridge+0x20c/0xe10
nf_hook_slow+0xac/0x1e0
__br_forward+0x33a/0x480
br_handle_frame_finish+0xab8/0x1d10
br_handle_frame+0x80f/0x12c0
__netif_receive_skb_core.constprop.0+0xbd4/0x2c10
__netif_receive_skb_one_core+0xae/0x1b0
process_backlog+0x197/0x590
__napi_poll+0xa1/0x540
net_rx_action+0x401/0xd80
handle_softirqs+0x19f/0x610
do_softirq.part.0+0x3b/0x60
</IRQ>
<TASK>
__local_bh_enable_ip+0x64/0x70
__dev_queue_xmit+0x9f3/0x30e0
packet_sendmsg+0x2126/0x5470
__sys_sendto+0x34e/0x3a0
__x64_sys_sendto+0xe0/0x1c0
do_syscall_64+0x64/0x680
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
Kernel panic - not syncing: Fatal exception in interrupt
Fixes: ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions")
Signed-off-by: Kito Xu (veritas501) <hxzene@gmail.com>
---
net/netfilter/xt_realm.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/netfilter/xt_realm.c b/net/netfilter/xt_realm.c
index 6df485f4403d..6d3a86647cae 100644
--- a/net/netfilter/xt_realm.c
+++ b/net/netfilter/xt_realm.c
@@ -24,6 +24,9 @@ realm_mt(const struct sk_buff *skb, struct xt_action_param *par)
const struct xt_realm_info *info = par->matchinfo;
const struct dst_entry *dst = skb_dst(skb);
+ if (!dst)
+ return false;
+
return (info->id == (dst->tclassid & info->mask)) ^ info->invert;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v12 net-next 10/11] net/nebula-matrix: add common/ctrl dev init/reinit operation
From: illusion.wang @ 2026-04-15 3:36 UTC (permalink / raw)
To: dimon.zhao, illusion.wang, alvin.wang, sam.chen, netdev
Cc: andrew+netdev, corbet, kuba, linux-doc, lorenzo, pabeni, horms,
vadim.fedorenko, lukas.bulwahn, edumazet, enelsonmoore, skhan,
hkallweit1, open list
In-Reply-To: <20260415033608.2438-1-illusion.wang@nebula-matrix.com>
Common Device Setup: nbl_dev_setup_common_dev configures mailbox queues,
registers cleanup tasks, and MSI-X interrupt counter initialization.
Control Device Setup (optional): nbl_dev_setup_ctrl_dev initializes
the chip and configures all channel queues.
Signed-off-by: illusion.wang <illusion.wang@nebula-matrix.com>
---
.../nebula-matrix/nbl/nbl_core/nbl_dev.c | 170 ++++++++++++++++++
.../nebula-matrix/nbl/nbl_core/nbl_dev.h | 31 ++++
2 files changed, 201 insertions(+)
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.c
index 5deb21e35f8e..f10bb9460774 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.c
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.c
@@ -6,6 +6,157 @@
#include <linux/pci.h>
#include "nbl_dev.h"
+static void nbl_dev_init_msix_cnt(struct nbl_dev_mgt *dev_mgt)
+{
+ struct nbl_dev_common *dev_common = dev_mgt->common_dev;
+ struct nbl_msix_info *msix_info = &dev_common->msix_info;
+
+ msix_info->serv_info[NBL_MSIX_MAILBOX_TYPE].num = 1;
+}
+
+/* ---------- Channel config ---------- */
+static int nbl_dev_setup_chan_qinfo(struct nbl_dev_mgt *dev_mgt, u8 chan_type)
+{
+ struct nbl_channel_ops *chan_ops = dev_mgt->chan_ops_tbl->ops;
+ struct nbl_channel_mgt *priv = dev_mgt->chan_ops_tbl->priv;
+ struct device *dev = dev_mgt->common->dev;
+ int ret;
+
+ if (!chan_ops->check_queue_exist(priv, chan_type))
+ return 0;
+
+ ret = chan_ops->cfg_chan_qinfo_map_table(priv, chan_type);
+ if (ret)
+ dev_err(dev, "setup chan:%d, qinfo map table failed\n",
+ chan_type);
+
+ return ret;
+}
+
+static int nbl_dev_setup_chan_queue(struct nbl_dev_mgt *dev_mgt, u8 chan_type)
+{
+ struct nbl_channel_ops *chan_ops = dev_mgt->chan_ops_tbl->ops;
+ struct nbl_channel_mgt *priv = dev_mgt->chan_ops_tbl->priv;
+ int ret = 0;
+
+ if (chan_ops->check_queue_exist(priv, chan_type))
+ ret = chan_ops->setup_queue(priv, chan_type);
+
+ return ret;
+}
+
+static int nbl_dev_remove_chan_queue(struct nbl_dev_mgt *dev_mgt, u8 chan_type)
+{
+ struct nbl_channel_ops *chan_ops = dev_mgt->chan_ops_tbl->ops;
+ struct nbl_channel_mgt *priv = dev_mgt->chan_ops_tbl->priv;
+ int ret = 0;
+
+ if (chan_ops->check_queue_exist(priv, chan_type))
+ ret = chan_ops->teardown_queue(priv, chan_type);
+
+ return ret;
+}
+
+static void nbl_dev_register_chan_task(struct nbl_dev_mgt *dev_mgt,
+ u8 chan_type, struct work_struct *task)
+{
+ struct nbl_channel_ops *chan_ops = dev_mgt->chan_ops_tbl->ops;
+
+ if (chan_ops->check_queue_exist(dev_mgt->chan_ops_tbl->priv, chan_type))
+ chan_ops->register_chan_task(dev_mgt->chan_ops_tbl->priv,
+ chan_type, task);
+}
+
+/* ---------- Tasks config ---------- */
+static void nbl_dev_clean_mailbox_task(struct work_struct *work)
+{
+ struct nbl_dev_common *common_dev =
+ container_of(work, struct nbl_dev_common, clean_mbx_task);
+ struct nbl_dev_mgt *dev_mgt = common_dev->dev_mgt;
+ struct nbl_channel_ops *chan_ops = dev_mgt->chan_ops_tbl->ops;
+
+ chan_ops->clean_queue_subtask(dev_mgt->chan_ops_tbl->priv,
+ NBL_CHAN_TYPE_MAILBOX);
+}
+
+/* ---------- Dev init process ---------- */
+static int nbl_dev_setup_common_dev(struct nbl_adapter *adapter)
+{
+ struct nbl_dev_mgt *dev_mgt = adapter->core.dev_mgt;
+ struct nbl_dispatch_ops *disp_ops = dev_mgt->disp_ops_tbl->ops;
+ struct nbl_dispatch_mgt *priv = dev_mgt->disp_ops_tbl->priv;
+ struct nbl_common_info *common = dev_mgt->common;
+ struct nbl_dev_common *common_dev;
+ int ret;
+
+ common_dev = devm_kzalloc(&adapter->pdev->dev, sizeof(*common_dev),
+ GFP_KERNEL);
+ if (!common_dev)
+ return -ENOMEM;
+ common_dev->dev_mgt = dev_mgt;
+
+ ret = nbl_dev_setup_chan_queue(dev_mgt, NBL_CHAN_TYPE_MAILBOX);
+ if (ret)
+ return ret;
+
+ INIT_WORK(&common_dev->clean_mbx_task, nbl_dev_clean_mailbox_task);
+ common->vsi_id = disp_ops->get_vsi_id(priv, NBL_VSI_DATA);
+ disp_ops->get_eth_id(priv, common->vsi_id, &common->eth_mode,
+ &common->eth_id, &common->logic_eth_id);
+
+ nbl_dev_register_chan_task(dev_mgt, NBL_CHAN_TYPE_MAILBOX,
+ &common_dev->clean_mbx_task);
+
+ dev_mgt->common_dev = common_dev;
+ nbl_dev_init_msix_cnt(dev_mgt);
+
+ return 0;
+}
+
+static void nbl_dev_remove_common_dev(struct nbl_adapter *adapter)
+{
+ struct nbl_dev_mgt *dev_mgt = adapter->core.dev_mgt;
+ struct nbl_dev_common *common_dev = dev_mgt->common_dev;
+
+ if (!common_dev)
+ return;
+
+ nbl_dev_register_chan_task(dev_mgt, NBL_CHAN_TYPE_MAILBOX, NULL);
+ cancel_work_sync(&common_dev->clean_mbx_task);
+ nbl_dev_remove_chan_queue(dev_mgt, NBL_CHAN_TYPE_MAILBOX);
+}
+
+static int nbl_dev_setup_ctrl_dev(struct nbl_adapter *adapter)
+{
+ struct nbl_dev_mgt *dev_mgt = adapter->core.dev_mgt;
+ struct nbl_dispatch_ops *disp_ops = dev_mgt->disp_ops_tbl->ops;
+ int i, ret;
+
+ ret = disp_ops->init_chip_module(dev_mgt->disp_ops_tbl->priv);
+ if (ret)
+ goto chip_init_fail;
+
+ for (i = 0; i < NBL_CHAN_TYPE_MAX; i++) {
+ ret = nbl_dev_setup_chan_qinfo(dev_mgt, i);
+ if (ret)
+ goto setup_chan_q_fail;
+ }
+
+ return 0;
+setup_chan_q_fail:
+ disp_ops->deinit_chip_module(dev_mgt->disp_ops_tbl->priv);
+chip_init_fail:
+ return ret;
+}
+
+static void nbl_dev_remove_ctrl_dev(struct nbl_adapter *adapter)
+{
+ struct nbl_dev_mgt *dev_mgt = adapter->core.dev_mgt;
+ struct nbl_dispatch_ops *disp_ops = dev_mgt->disp_ops_tbl->ops;
+
+ disp_ops->deinit_chip_module(dev_mgt->disp_ops_tbl->priv);
+}
+
static struct nbl_dev_mgt *nbl_dev_setup_dev_mgt(struct nbl_common_info *common)
{
struct nbl_dev_mgt *dev_mgt;
@@ -38,11 +189,30 @@ int nbl_dev_init(struct nbl_adapter *adapter)
dev_mgt->chan_ops_tbl = chan_ops_tbl;
adapter->core.dev_mgt = dev_mgt;
+ ret = nbl_dev_setup_common_dev(adapter);
+ if (ret)
+ return ret;
+
+ if (common->has_ctrl) {
+ ret = nbl_dev_setup_ctrl_dev(adapter);
+ if (ret)
+ goto setup_ctrl_dev_fail;
+ }
+
return 0;
+
+setup_ctrl_dev_fail:
+ nbl_dev_remove_common_dev(adapter);
+ return ret;
}
void nbl_dev_remove(struct nbl_adapter *adapter)
{
+ struct nbl_common_info *common = &adapter->common;
+
+ if (common->has_ctrl)
+ nbl_dev_remove_ctrl_dev(adapter);
+ nbl_dev_remove_common_dev(adapter);
}
/* ---------- Dev start process ---------- */
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.h b/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.h
index 9b71092b99a0..b51c8a4424c5 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.h
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.h
@@ -18,10 +18,41 @@
#include "../nbl_include/nbl_def_common.h"
#include "../nbl_core.h"
+#define NBL_STRING_NAME_LEN 32
+
+enum nbl_msix_serv_type {
+ /* virtio_dev has a config vector_id, and the vector_id need is 0 */
+ NBL_MSIX_VIRTIO_TYPE = 0,
+ NBL_MSIX_NET_TYPE,
+ NBL_MSIX_MAILBOX_TYPE,
+ NBL_MSIX_TYPE_MAX
+};
+
+struct nbl_msix_serv_info {
+ char irq_name[NBL_STRING_NAME_LEN];
+ u16 num;
+ u16 base_vector_id;
+ /* true: hw report msix, hw need to mask actively */
+ bool hw_self_mask_en;
+};
+
+struct nbl_msix_info {
+ struct nbl_msix_serv_info serv_info[NBL_MSIX_TYPE_MAX];
+};
+
+struct nbl_dev_common {
+ struct nbl_dev_mgt *dev_mgt;
+ struct nbl_msix_info msix_info;
+ char mailbox_name[NBL_STRING_NAME_LEN];
+ /* for ctrl-dev/net-dev mailbox recv msg */
+ struct work_struct clean_mbx_task;
+};
+
struct nbl_dev_mgt {
struct nbl_common_info *common;
struct nbl_dispatch_ops_tbl *disp_ops_tbl;
struct nbl_channel_ops_tbl *chan_ops_tbl;
+ struct nbl_dev_common *common_dev;
};
#endif
--
2.47.3
^ permalink raw reply related
* [PATCH v12 net-next 06/11] net/nebula-matrix: add common resource implementation
From: illusion.wang @ 2026-04-15 3:35 UTC (permalink / raw)
To: dimon.zhao, illusion.wang, alvin.wang, sam.chen, netdev
Cc: andrew+netdev, corbet, kuba, linux-doc, lorenzo, pabeni, horms,
vadim.fedorenko, lukas.bulwahn, edumazet, enelsonmoore, skhan,
hkallweit1, open list
In-Reply-To: <20260415033608.2438-1-illusion.wang@nebula-matrix.com>
The Resource layer processes the entries/data of various modules within
the processing chip to accomplish specific entry management operations,
this describes the module business capabilities of the chip and the data
it manages.
The resource layer comprises the following sub-modules: common,
interrupt, and vsi(txrx,queue not contained this time)
This patch provides the common part, including the conversion
relationships among vsi_id, func_id, eth_id, and pf_id. These
relationships may be utilized in the upper layer or the resource layer.
Signed-off-by: illusion.wang <illusion.wang@nebula-matrix.com>
---
.../net/ethernet/nebula-matrix/nbl/Makefile | 1 +
.../nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.c | 55 ++++++
.../nbl_hw_leonis/nbl_resource_leonis.c | 158 ++++++++++++++++++
.../nebula-matrix/nbl/nbl_hw/nbl_resource.c | 118 +++++++++++++
.../nebula-matrix/nbl/nbl_hw/nbl_resource.h | 52 ++++++
.../nbl/nbl_include/nbl_def_common.h | 15 ++
.../nbl/nbl_include/nbl_def_resource.h | 15 ++
.../nbl/nbl_include/nbl_include.h | 8 +
8 files changed, 422 insertions(+)
create mode 100644 drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.c
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/Makefile b/drivers/net/ethernet/nebula-matrix/nbl/Makefile
index c9bc060732e7..b03c20f9988e 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/Makefile
+++ b/drivers/net/ethernet/nebula-matrix/nbl/Makefile
@@ -8,6 +8,7 @@ nbl-objs += nbl_common/nbl_common.o \
nbl_hw/nbl_hw_leonis/nbl_hw_leonis.o \
nbl_hw/nbl_hw_leonis/nbl_resource_leonis.o \
nbl_hw/nbl_hw_leonis/nbl_hw_leonis_regs.o \
+ nbl_hw/nbl_resource.o \
nbl_core/nbl_dispatch.o \
nbl_core/nbl_dev.o \
nbl_main.o
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.c
index 83a4dc584f48..4ef0d5989a76 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.c
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.c
@@ -9,6 +9,18 @@
#include <linux/spinlock.h>
#include "nbl_hw_leonis.h"
+static void nbl_hw_read_mbx_regs(struct nbl_hw_mgt *hw_mgt, u64 reg,
+ u32 *data, u32 len)
+{
+ u32 i;
+
+ if (len % 4)
+ return;
+
+ for (i = 0; i < len / 4; i++)
+ data[i] = nbl_mbx_rd32(hw_mgt, reg + i * sizeof(u32));
+}
+
static void nbl_hw_write_mbx_regs(struct nbl_hw_mgt *hw_mgt, u64 reg,
const u32 *data, u32 len)
{
@@ -130,6 +142,14 @@ static u32 nbl_hw_get_host_pf_mask(struct nbl_hw_mgt *hw_mgt)
return data;
}
+static u32 nbl_hw_get_real_bus(struct nbl_hw_mgt *hw_mgt)
+{
+ u32 data;
+
+ data = nbl_hw_rd32(hw_mgt, NBL_PCIE_HOST_TL_CFG_BUSDEV);
+ return data >> 5;
+}
+
static void nbl_hw_cfg_mailbox_qinfo(struct nbl_hw_mgt *hw_mgt, u16 func_id,
u16 bus, u16 devid, u16 function)
{
@@ -144,6 +164,36 @@ static void nbl_hw_cfg_mailbox_qinfo(struct nbl_hw_mgt *hw_mgt, u16 func_id,
(u32 *)&mb_qinfo_map, sizeof(mb_qinfo_map));
}
+static void nbl_hw_get_board_info(struct nbl_hw_mgt *hw_mgt,
+ struct nbl_board_port_info *board_info)
+{
+ union nbl_fw_board_cfg_dw3 dw3 = { .info = { 0 } };
+
+ nbl_hw_read_mbx_regs(hw_mgt, NBL_FW_BOARD_DW3_OFFSET, (u32 *)&dw3,
+ sizeof(dw3));
+ board_info->eth_num = dw3.info.port_num;
+ board_info->eth_speed = dw3.info.port_speed;
+ board_info->p4_version = dw3.info.p4_version;
+}
+
+static u32 nbl_hw_get_fw_eth_num(struct nbl_hw_mgt *hw_mgt)
+{
+ union nbl_fw_board_cfg_dw3 dw3 = { .info = { 0 } };
+
+ nbl_hw_read_mbx_regs(hw_mgt, NBL_FW_BOARD_DW3_OFFSET, (u32 *)&dw3,
+ sizeof(dw3));
+ return dw3.info.port_num;
+}
+
+static u32 nbl_hw_get_fw_eth_map(struct nbl_hw_mgt *hw_mgt)
+{
+ union nbl_fw_board_cfg_dw6 dw6 = { .info = { 0 } };
+
+ nbl_hw_read_mbx_regs(hw_mgt, NBL_FW_BOARD_DW6_OFFSET, (u32 *)&dw6,
+ sizeof(dw6));
+ return dw6.info.eth_bitmap;
+}
+
static struct nbl_hw_ops hw_ops = {
.update_mailbox_queue_tail_ptr = nbl_hw_update_mailbox_queue_tail_ptr,
.config_mailbox_rxq = nbl_hw_config_mailbox_rxq,
@@ -151,8 +201,13 @@ static struct nbl_hw_ops hw_ops = {
.stop_mailbox_rxq = nbl_hw_stop_mailbox_rxq,
.stop_mailbox_txq = nbl_hw_stop_mailbox_txq,
.get_host_pf_mask = nbl_hw_get_host_pf_mask,
+ .get_real_bus = nbl_hw_get_real_bus,
+
.cfg_mailbox_qinfo = nbl_hw_cfg_mailbox_qinfo,
+ .get_fw_eth_num = nbl_hw_get_fw_eth_num,
+ .get_fw_eth_map = nbl_hw_get_fw_eth_map,
+ .get_board_info = nbl_hw_get_board_info,
};
/* Structure starts here, adding an op should not modify anything below */
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_resource_leonis.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_resource_leonis.c
index ffac1d59bf32..ff7f4a9a392c 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_resource_leonis.c
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_resource_leonis.c
@@ -4,9 +4,12 @@
*/
#include <linux/device.h>
#include <linux/pci.h>
+#include <linux/bits.h>
#include "nbl_resource_leonis.h"
static struct nbl_resource_ops res_ops = {
+ .get_vsi_id = nbl_res_func_id_to_vsi_id,
+ .get_eth_id = nbl_res_get_eth_id,
};
static struct nbl_resource_mgt *
@@ -45,8 +48,163 @@ nbl_res_setup_ops(struct device *dev, struct nbl_resource_mgt *res_mgt)
return res_ops_tbl;
}
+static int nbl_res_ctrl_dev_setup_eth_info(struct nbl_resource_mgt *res_mgt)
+{
+ struct nbl_hw_ops *hw_ops = res_mgt->hw_ops_tbl->ops;
+ struct device *dev = res_mgt->common->dev;
+ struct nbl_eth_info *eth_info;
+ u32 eth_bitmap, eth_id;
+ u32 eth_num = 0;
+ int i;
+
+ eth_info = devm_kzalloc(dev, sizeof(*eth_info), GFP_KERNEL);
+ if (!eth_info)
+ return -ENOMEM;
+
+ res_mgt->resource_info->eth_info = eth_info;
+
+ eth_info->eth_num =
+ (u8)hw_ops->get_fw_eth_num(res_mgt->hw_ops_tbl->priv);
+ eth_bitmap = hw_ops->get_fw_eth_map(res_mgt->hw_ops_tbl->priv);
+ /* for 2 eth port board, the eth_id is 0, 2 */
+ for (i = 0; i < NBL_MAX_ETHERNET; i++) {
+ if ((1 << i) & eth_bitmap) {
+ set_bit(i, eth_info->eth_bitmap);
+ eth_info->eth_id[eth_num] = i;
+ eth_info->logic_eth_id[i] = eth_num;
+ eth_num++;
+ }
+ }
+
+ for (i = 0; i < res_mgt->resource_info->max_pf; i++) {
+ /* Map PF index i to eth_id from eth_info->eth_id[i]
+ * if i < eth_num, otherwise map to eth_id 0
+ */
+ if (i < eth_num) {
+ eth_id = eth_info->eth_id[i];
+ eth_info->pf_bitmap[eth_id] |= BIT(i);
+ } else {
+ eth_info->pf_bitmap[0] |= BIT(i);
+ }
+ }
+
+ return 0;
+}
+
+static int nbl_res_ctrl_dev_sriov_info_init(struct nbl_resource_mgt *res_mgt)
+{
+ struct nbl_hw_ops *hw_ops = res_mgt->hw_ops_tbl->ops;
+ struct nbl_hw_mgt *p = res_mgt->hw_ops_tbl->priv;
+ struct nbl_common_info *common = res_mgt->common;
+ struct nbl_sriov_info *sriov_info;
+ struct device *dev = common->dev;
+ u16 func_id, function;
+
+ sriov_info = devm_kcalloc(dev, res_mgt->resource_info->max_pf,
+ sizeof(*sriov_info), GFP_KERNEL);
+ if (!sriov_info)
+ return -ENOMEM;
+
+ res_mgt->resource_info->sriov_info = sriov_info;
+ common->hw_bus = (u8)hw_ops->get_real_bus(p);
+ for (func_id = 0; func_id < res_mgt->resource_info->max_pf; func_id++) {
+ sriov_info = res_mgt->resource_info->sriov_info + func_id;
+ function = common->function + func_id;
+ sriov_info->bdf = PCI_DEVID(common->hw_bus,
+ PCI_DEVFN(common->devid, function));
+ }
+
+ return 0;
+}
+
+static int nbl_res_ctrl_dev_vsi_info_init(struct nbl_resource_mgt *res_mgt)
+{
+ struct nbl_eth_info *eth_info = res_mgt->resource_info->eth_info;
+ struct nbl_common_info *common = res_mgt->common;
+ struct device *dev = common->dev;
+ struct nbl_vsi_info *vsi_info;
+ int i;
+
+ vsi_info = devm_kcalloc(dev, res_mgt->resource_info->max_pf,
+ sizeof(*vsi_info), GFP_KERNEL);
+ if (!vsi_info)
+ return -ENOMEM;
+
+ res_mgt->resource_info->vsi_info = vsi_info;
+ /*
+ * case 1 two port(2pf)
+ * pf0,pf1(NBL_VSI_SERV_PF_DATA_TYPE) vsi is 0,512
+
+ * case 2 four port(4pf)
+ * pf0,pf1,pf2,pf3(NBL_VSI_SERV_PF_DATA_TYPE) vsi is 0,256,512,768
+
+ */
+
+ vsi_info->num = eth_info->eth_num;
+ for (i = 0; i < vsi_info->num; i++) {
+ vsi_info->serv_info[i][NBL_VSI_SERV_PF_DATA_TYPE].base_id =
+ i * NBL_VSI_ID_GAP(vsi_info->num);
+ vsi_info->serv_info[i][NBL_VSI_SERV_PF_DATA_TYPE].num = 1;
+ }
+
+ return 0;
+}
+
+static int nbl_res_init_pf_num(struct nbl_resource_mgt *res_mgt)
+{
+ struct nbl_hw_ops *hw_ops = res_mgt->hw_ops_tbl->ops;
+ u32 pf_num = 0;
+ u32 pf_mask;
+ int i;
+
+ pf_mask = hw_ops->get_host_pf_mask(res_mgt->hw_ops_tbl->priv);
+ for (i = 0; i < NBL_MAX_PF; i++) {
+ if (!(pf_mask & (1 << i)))
+ pf_num++;
+ else
+ break;
+ }
+
+ res_mgt->resource_info->max_pf = pf_num;
+
+ if (!pf_num)
+ return -EINVAL;
+
+ return 0;
+}
+
+static void nbl_res_init_board_info(struct nbl_resource_mgt *res_mgt)
+{
+ struct nbl_hw_ops *hw_ops = res_mgt->hw_ops_tbl->ops;
+
+ hw_ops->get_board_info(res_mgt->hw_ops_tbl->priv,
+ &res_mgt->resource_info->board_info);
+}
+
static int nbl_res_start(struct nbl_resource_mgt *res_mgt)
{
+ struct nbl_common_info *common = res_mgt->common;
+ int ret = 0;
+
+ if (common->has_ctrl) {
+ nbl_res_init_board_info(res_mgt);
+
+ ret = nbl_res_init_pf_num(res_mgt);
+ if (ret)
+ return ret;
+
+ ret = nbl_res_ctrl_dev_sriov_info_init(res_mgt);
+ if (ret)
+ return ret;
+
+ ret = nbl_res_ctrl_dev_setup_eth_info(res_mgt);
+ if (ret)
+ return ret;
+
+ ret = nbl_res_ctrl_dev_vsi_info_init(res_mgt);
+ if (ret)
+ return ret;
+ }
return 0;
}
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.c
new file mode 100644
index 000000000000..563f00b5bfb4
--- /dev/null
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2025 Nebula Matrix Limited.
+ */
+
+#include <linux/pci.h>
+#include "nbl_resource.h"
+
+u16 nbl_res_pfid_to_vsi_id(struct nbl_resource_mgt *res_mgt, int pfid, u16 type)
+{
+ struct nbl_vsi_info *vsi_info = res_mgt->resource_info->vsi_info;
+ enum nbl_vsi_serv_type dst_type = NBL_VSI_SERV_PF_DATA_TYPE;
+ struct nbl_common_info *common = res_mgt->common;
+ struct device *dev = res_mgt->common->dev;
+ u16 vsi_id = U16_MAX;
+ u32 diff;
+
+ diff = nbl_common_pf_id_subtraction_mgtpf_id(common, pfid);
+ if (diff == U32_MAX)
+ return vsi_id;
+
+ if (diff < vsi_info->num) {
+ nbl_res_pf_dev_vsi_type_to_hw_vsi_type(type, &dst_type);
+ vsi_id = vsi_info->serv_info[diff][dst_type].base_id;
+ }
+ if (vsi_id == U16_MAX)
+ dev_err(dev, "convert pfid %d to vsi_id failed!\n", pfid);
+ return vsi_id;
+}
+
+u16 nbl_res_func_id_to_vsi_id(struct nbl_resource_mgt *res_mgt, u16 func_id,
+ u16 type)
+{
+ int pfid = func_id;
+
+ return nbl_res_pfid_to_vsi_id(res_mgt, pfid, type);
+}
+
+int nbl_res_vsi_id_to_pf_id(struct nbl_resource_mgt *res_mgt, u16 vsi_id)
+{
+ struct nbl_vsi_info *vsi_info = res_mgt->resource_info->vsi_info;
+ struct nbl_common_info *common = res_mgt->common;
+ bool vsi_find = false;
+ int pf_id = -1; /* -1 indicates not found */
+ int i, j;
+
+ for (i = 0; i < vsi_info->num; i++) {
+ for (j = 0; j < NBL_VSI_SERV_MAX_TYPE; j++)
+ if (vsi_id >= vsi_info->serv_info[i][j].base_id &&
+ (vsi_id < vsi_info->serv_info[i][j].base_id +
+ vsi_info->serv_info[i][j].num)) {
+ vsi_find = true;
+ break;
+ }
+
+ if (vsi_find)
+ break;
+ }
+
+ if (vsi_find) {
+ if (j == NBL_VSI_SERV_PF_DATA_TYPE)
+ pf_id = i + common->mgt_pf;
+ }
+
+ return pf_id;
+}
+
+int nbl_res_func_id_to_bdf(struct nbl_resource_mgt *res_mgt, u16 func_id,
+ u8 *bus, u8 *dev, u8 *function)
+{
+ struct nbl_common_info *common = res_mgt->common;
+ struct nbl_sriov_info *sriov_info;
+ int pfid = func_id;
+ u8 pf_bus, devfn;
+ u32 diff;
+
+ diff = nbl_common_pf_id_subtraction_mgtpf_id(common, pfid);
+ if (diff == U32_MAX)
+ return U32_MAX;
+ sriov_info = res_mgt->resource_info->sriov_info + diff;
+ pf_bus = PCI_BUS_NUM(sriov_info->bdf);
+ devfn = sriov_info->bdf & 0xff;
+ *bus = pf_bus;
+ *dev = PCI_SLOT(devfn);
+ *function = PCI_FUNC(devfn);
+
+ return 0;
+}
+
+void nbl_res_get_eth_id(struct nbl_resource_mgt *res_mgt, u16 vsi_id,
+ u8 *eth_mode, u8 *eth_id, u8 *logic_eth_id)
+{
+ struct nbl_eth_info *eth_info = res_mgt->resource_info->eth_info;
+ int pf_id = nbl_res_vsi_id_to_pf_id(res_mgt, vsi_id);
+ struct device *dev = res_mgt->common->dev;
+
+ *eth_mode = eth_info->eth_num;
+ if (pf_id < eth_info->eth_num && pf_id >= 0) {
+ *eth_id = eth_info->eth_id[pf_id];
+ *logic_eth_id = pf_id;
+ } else {
+ /*
+ * Fallback to eth_id[0] if pf_id is out of range.
+ * This is a safety measure to prevent crashes, but callers
+ * should validate pf_id beforehand if possible.
+ */
+ dev_warn(dev, "pf_id %d invalid\n", pf_id);
+ *eth_id = eth_info->eth_id[0];
+ *logic_eth_id = 0;
+ }
+}
+
+void nbl_res_pf_dev_vsi_type_to_hw_vsi_type(u16 src_type,
+ enum nbl_vsi_serv_type *dst_type)
+{
+ if (src_type == NBL_VSI_DATA)
+ *dst_type = NBL_VSI_SERV_PF_DATA_TYPE;
+}
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.h b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.h
index e08b6237da32..51b5b958cde8 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.h
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.h
@@ -16,7 +16,48 @@
#include "../nbl_include/nbl_def_common.h"
#include "../nbl_core.h"
+struct nbl_resource_mgt;
+
+/* --------- INFO ---------- */
+struct nbl_sriov_info {
+ unsigned int bdf;
+};
+
+struct nbl_eth_info {
+ DECLARE_BITMAP(eth_bitmap, NBL_MAX_ETHERNET);
+ u8 pf_bitmap[NBL_MAX_ETHERNET];
+ u8 eth_num;
+ u8 resv[3];
+ u8 eth_id[NBL_MAX_PF];
+ u8 logic_eth_id[NBL_MAX_PF];
+};
+
+enum nbl_vsi_serv_type {
+ NBL_VSI_SERV_PF_DATA_TYPE,
+ NBL_VSI_SERV_MAX_TYPE,
+};
+
+struct nbl_vsi_serv_info {
+ u16 base_id;
+ u16 num;
+};
+
+struct nbl_vsi_info {
+ u16 num;
+ struct nbl_vsi_serv_info serv_info[NBL_MAX_ETHERNET]
+ [NBL_VSI_SERV_MAX_TYPE];
+};
+
struct nbl_resource_info {
+ /* ctrl-dev owned pfs */
+ DECLARE_BITMAP(func_bitmap, NBL_MAX_FUNC);
+ struct nbl_sriov_info *sriov_info;
+ struct nbl_eth_info *eth_info;
+ struct nbl_vsi_info *vsi_info;
+ u32 base_qid;
+ u32 max_vf_num;
+ u8 max_pf;
+ struct nbl_board_port_info board_info;
};
struct nbl_resource_mgt {
@@ -27,4 +68,15 @@ struct nbl_resource_mgt {
struct nbl_interrupt_mgt *intr_mgt;
};
+int nbl_res_vsi_id_to_pf_id(struct nbl_resource_mgt *res_mgt, u16 vsi_id);
+u16 nbl_res_pfid_to_vsi_id(struct nbl_resource_mgt *res_mgt, int pfid,
+ u16 type);
+u16 nbl_res_func_id_to_vsi_id(struct nbl_resource_mgt *res_mgt, u16 func_id,
+ u16 type);
+int nbl_res_func_id_to_bdf(struct nbl_resource_mgt *res_mgt, u16 func_id,
+ u8 *bus, u8 *dev, u8 *function);
+void nbl_res_get_eth_id(struct nbl_resource_mgt *res_mgt, u16 vsi_id,
+ u8 *eth_mode, u8 *eth_id, u8 *logic_eth_id);
+void nbl_res_pf_dev_vsi_type_to_hw_vsi_type(u16 src_type,
+ enum nbl_vsi_serv_type *dst_type);
#endif
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_common.h b/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_common.h
index 5c532247c852..04ffc1918a46 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_common.h
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_common.h
@@ -12,6 +12,21 @@
#include "nbl_include.h"
struct nbl_hash_tbl_mgt;
+#define NBL_TWO_ETHERNET_PORT 2
+#define NBL_FOUR_ETHERNET_PORT 4
+#define NBL_DEFAULT_VSI_ID_GAP 1024
+#define NBL_TWO_ETHERNET_VSI_ID_GAP 512
+#define NBL_FOUR_ETHERNET_VSI_ID_GAP 256
+
+#define NBL_VSI_ID_GAP(m) \
+ ({ \
+ typeof(m) _m = (m); \
+ _m == NBL_FOUR_ETHERNET_PORT ? \
+ NBL_FOUR_ETHERNET_VSI_ID_GAP : \
+ (_m == NBL_TWO_ETHERNET_PORT ? \
+ NBL_TWO_ETHERNET_VSI_ID_GAP : \
+ NBL_DEFAULT_VSI_ID_GAP); \
+ })
struct nbl_common_info {
struct pci_dev *pdev;
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_resource.h b/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_resource.h
index d55934af5a9a..15422eb4a218 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_resource.h
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_resource.h
@@ -6,10 +6,25 @@
#ifndef _NBL_DEF_RESOURCE_H_
#define _NBL_DEF_RESOURCE_H_
+#include <linux/types.h>
+
struct nbl_resource_mgt;
struct nbl_adapter;
struct nbl_resource_ops {
+ int (*init_chip_module)(struct nbl_resource_mgt *res_mgt);
+ void (*deinit_chip_module)(struct nbl_resource_mgt *res_mgt);
+
+ int (*configure_msix_map)(struct nbl_resource_mgt *res_mgt, u16 func_id,
+ u16 num_net_msix, u16 num_others_msix,
+ bool net_msix_mask_en);
+ int (*destroy_msix_map)(struct nbl_resource_mgt *res_mgt, u16 func_id);
+ int (*enable_mailbox_irq)(struct nbl_resource_mgt *res_mgt, u16 func_id,
+ u16 vector_id, bool enable_msix);
+ u16 (*get_vsi_id)(struct nbl_resource_mgt *res_mgt, u16 func_id,
+ u16 type);
+ void (*get_eth_id)(struct nbl_resource_mgt *res_mgt, u16 vsi_id,
+ u8 *eth_mode, u8 *eth_id, u8 *logic_eth_id);
};
struct nbl_resource_ops_tbl {
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_include.h b/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_include.h
index a01c32f57d84..6a0bf5e8ca32 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_include.h
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_include.h
@@ -17,11 +17,19 @@
((_id) == (max) ? 0 : (_id) + 1); \
})
+#define NBL_MAX_FUNC 520
+#define NBL_MAX_ETHERNET 4
+
enum nbl_product_type {
NBL_LEONIS_TYPE,
NBL_PRODUCT_MAX,
};
+enum {
+ NBL_VSI_DATA = 0,
+ NBL_VSI_MAX,
+};
+
struct nbl_func_caps {
u32 has_ctrl:1;
u32 has_net:1;
--
2.47.3
^ permalink raw reply related
* [PATCH v12 net-next 08/11] net/nebula-matrix: add vsi resource implementation
From: illusion.wang @ 2026-04-15 3:36 UTC (permalink / raw)
To: dimon.zhao, illusion.wang, alvin.wang, sam.chen, netdev
Cc: andrew+netdev, corbet, kuba, linux-doc, lorenzo, pabeni, horms,
vadim.fedorenko, lukas.bulwahn, edumazet, enelsonmoore, skhan,
hkallweit1, open list
In-Reply-To: <20260415033608.2438-1-illusion.wang@nebula-matrix.com>
The HW (Hardware) layer code can have a quick review since it is
highly chip-specific.
Chip initialization includes the initialization of the DP module, the
intf module, and the P4 registers.
The initialization of the DP module encompasses the initialization of
the dped(downstream pkt edit), uped(upstream pkt edit), dsch(downstream
schedule), ustore, dstore, dvn, uvn, and uqm modules.
Signed-off-by: illusion.wang <illusion.wang@nebula-matrix.com>
---
.../net/ethernet/nebula-matrix/nbl/Makefile | 1 +
.../nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.c | 388 ++++++++++++++++++
.../nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.h | 12 +-
.../nbl_hw_leonis/nbl_resource_leonis.c | 4 +
.../nebula-matrix/nbl/nbl_hw/nbl_resource.h | 1 +
.../nebula-matrix/nbl/nbl_hw/nbl_vsi.c | 51 +++
.../nebula-matrix/nbl/nbl_hw/nbl_vsi.h | 11 +
.../nbl/nbl_include/nbl_def_hw.h | 4 +
.../nbl/nbl_include/nbl_include.h | 31 ++
9 files changed, 502 insertions(+), 1 deletion(-)
create mode 100644 drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_vsi.c
create mode 100644 drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_vsi.h
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/Makefile b/drivers/net/ethernet/nebula-matrix/nbl/Makefile
index a56e722a5ac7..241bbb572b5e 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/Makefile
+++ b/drivers/net/ethernet/nebula-matrix/nbl/Makefile
@@ -10,6 +10,7 @@ nbl-objs += nbl_common/nbl_common.o \
nbl_hw/nbl_hw_leonis/nbl_hw_leonis_regs.o \
nbl_hw/nbl_resource.o \
nbl_hw/nbl_interrupt.o \
+ nbl_hw/nbl_vsi.o \
nbl_core/nbl_dispatch.o \
nbl_core/nbl_dev.o \
nbl_main.o
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.c
index aa5e91c2b278..75d67e3ef08b 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.c
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.c
@@ -8,6 +8,7 @@
#include <linux/io.h>
#include <linux/spinlock.h>
#include "nbl_hw_leonis.h"
+#include "nbl_hw_leonis_regs.h"
static void nbl_hw_read_mbx_regs(struct nbl_hw_mgt *hw_mgt, u64 reg,
u32 *data, u32 len)
@@ -63,6 +64,390 @@ static void nbl_hw_wr_regs(struct nbl_hw_mgt *hw_mgt, u64 reg, const u32 *data,
spin_unlock(&hw_mgt->reg_lock);
}
+static u32 nbl_hw_get_quirks(struct nbl_hw_mgt *hw_mgt)
+{
+ u32 quirks;
+
+ nbl_hw_read_mbx_regs(hw_mgt, NBL_LEONIS_QUIRKS_OFFSET, &quirks,
+ sizeof(u32));
+
+ if (quirks == NBL_LEONIS_ILLEGAL_REG_VALUE)
+ return 0;
+
+ return quirks;
+}
+
+static void nbl_configure_dped_checksum(struct nbl_hw_mgt *hw_mgt)
+{
+ union dped_l4_ck_cmd_40_u l4_ck_cmd_40;
+
+ /* DPED dped_l4_ck_cmd_40 for sctp */
+ nbl_hw_rd_regs(hw_mgt, NBL_DPED_L4_CK_CMD_40_ADDR, (u32 *)&l4_ck_cmd_40,
+ sizeof(l4_ck_cmd_40));
+ l4_ck_cmd_40.info.en = 1;
+ nbl_hw_wr_regs(hw_mgt, NBL_DPED_L4_CK_CMD_40_ADDR, (u32 *)&l4_ck_cmd_40,
+ sizeof(l4_ck_cmd_40));
+}
+
+static void nbl_dped_init(struct nbl_hw_mgt *hw_mgt)
+{
+ nbl_hw_wr32(hw_mgt, NBL_DPED_VLAN_OFFSET, 0xC);
+ nbl_hw_wr32(hw_mgt, NBL_DPED_DSCP_OFFSET_0, 0x8);
+ nbl_hw_wr32(hw_mgt, NBL_DPED_DSCP_OFFSET_1, 0x4);
+
+ /* dped checksum offload */
+ nbl_configure_dped_checksum(hw_mgt);
+}
+
+static void nbl_uped_init(struct nbl_hw_mgt *hw_mgt)
+{
+ struct ped_hw_edit_profile hw_edit;
+
+ nbl_hw_rd_regs(hw_mgt, NBL_UPED_HW_EDT_PROF_TABLE(NBL_DPED_V4_TCP_IDX),
+ (u32 *)&hw_edit, sizeof(hw_edit));
+ hw_edit.l3_len = 0;
+ nbl_hw_wr_regs(hw_mgt, NBL_UPED_HW_EDT_PROF_TABLE(NBL_DPED_V4_TCP_IDX),
+ (u32 *)&hw_edit, sizeof(hw_edit));
+
+ nbl_hw_rd_regs(hw_mgt, NBL_UPED_HW_EDT_PROF_TABLE(NBL_DPED_V6_TCP_IDX),
+ (u32 *)&hw_edit, sizeof(hw_edit));
+ hw_edit.l3_len = 1;
+ nbl_hw_wr_regs(hw_mgt, NBL_UPED_HW_EDT_PROF_TABLE(NBL_DPED_V6_TCP_IDX),
+ (u32 *)&hw_edit, sizeof(hw_edit));
+}
+
+static void nbl_shaping_eth_init(struct nbl_hw_mgt *hw_mgt, u8 eth_id, u8 speed)
+{
+ struct nbl_shaping_dvn_dport dvn_dport = { 0 };
+ struct nbl_shaping_dport dport = { 0 };
+ u32 rate, half_rate;
+
+ if (speed == NBL_FW_PORT_SPEED_100G) {
+ rate = NBL_SHAPING_DPORT_100G_RATE;
+ half_rate = NBL_SHAPING_DPORT_HALF_100G_RATE;
+ } else {
+ rate = NBL_SHAPING_DPORT_25G_RATE;
+ half_rate = NBL_SHAPING_DPORT_HALF_25G_RATE;
+ }
+
+ dport.cir = rate;
+ dport.pir = rate;
+ dport.depth = max(dport.cir * 2, NBL_LR_LEONIS_NET_BUCKET_DEPTH);
+ dport.cbs = dport.depth;
+ dport.pbs = dport.depth;
+ dport.valid = 1;
+
+ dvn_dport.cir = half_rate;
+ dvn_dport.pir = rate;
+ dvn_dport.depth = dport.depth;
+ dvn_dport.cbs = dvn_dport.depth;
+ dvn_dport.pbs = dvn_dport.depth;
+ dvn_dport.valid = 1;
+
+ nbl_hw_wr_regs(hw_mgt, NBL_SHAPING_DPORT_REG(eth_id), (u32 *)&dport,
+ sizeof(dport));
+ nbl_hw_wr_regs(hw_mgt, NBL_SHAPING_DVN_DPORT_REG(eth_id),
+ (u32 *)&dvn_dport, sizeof(dvn_dport));
+}
+
+static void nbl_shaping_init(struct nbl_hw_mgt *hw_mgt, u8 speed)
+{
+#define NBL_SHAPING_FLUSH_INTERVAL 128
+ struct nbl_shaping_net net_shaping = { 0 };
+ struct dsch_psha_en psha_en = { 0 };
+ int i;
+
+ for (i = 0; i < NBL_MAX_ETHERNET; i++)
+ nbl_shaping_eth_init(hw_mgt, i, speed);
+
+ psha_en.en = 0xF;
+ nbl_hw_wr_regs(hw_mgt, NBL_DSCH_PSHA_EN_ADDR, (u32 *)&psha_en,
+ sizeof(psha_en));
+
+ for (i = 0; i < NBL_MAX_FUNC; i++) {
+ nbl_hw_wr_regs(hw_mgt, NBL_SHAPING_NET_REG(i),
+ (u32 *)&net_shaping, sizeof(net_shaping));
+ if ((i % NBL_SHAPING_FLUSH_INTERVAL) == 0)
+ nbl_flush_writes(hw_mgt);
+ }
+ nbl_flush_writes(hw_mgt);
+}
+
+static void nbl_dsch_qid_max_init(struct nbl_hw_mgt *hw_mgt)
+{
+ struct dsch_vn_quanta quanta = { 0 };
+
+ quanta.h_qua = NBL_HOST_QUANTA;
+ quanta.e_qua = NBL_ECPU_QUANTA;
+ nbl_hw_wr_regs(hw_mgt, NBL_DSCH_VN_QUANTA_ADDR, (u32 *)&quanta,
+ sizeof(quanta));
+ nbl_hw_wr32(hw_mgt, NBL_DSCH_HOST_QID_MAX, NBL_MAX_QUEUE_ID);
+
+ nbl_hw_wr32(hw_mgt, NBL_DVN_ECPU_QUEUE_NUM, 0);
+ nbl_hw_wr32(hw_mgt, NBL_UVN_ECPU_QUEUE_NUM, 0);
+}
+
+static void nbl_ustore_init(struct nbl_hw_mgt *hw_mgt, u8 eth_num)
+{
+ struct nbl_ustore_port_drop_th drop_th = { 0 };
+ struct ustore_pkt_len pkt_len;
+ int i;
+
+ /* Read current packet length config
+ *(to preserve other fields while updating 'min')
+ */
+ nbl_hw_rd_regs(hw_mgt, NBL_USTORE_PKT_LEN_ADDR, (u32 *)&pkt_len,
+ sizeof(pkt_len));
+ /* min arp packet length 42 (14 + 28) */
+ pkt_len.min = 42;
+ nbl_hw_wr_regs(hw_mgt, NBL_USTORE_PKT_LEN_ADDR, (u32 *)&pkt_len,
+ sizeof(pkt_len));
+
+ drop_th.en = 1;
+ if (eth_num == 1)
+ drop_th.disc_th = NBL_USTORE_SINGLE_ETH_DROP_TH;
+ else if (eth_num == 2)
+ drop_th.disc_th = NBL_USTORE_DUAL_ETH_DROP_TH;
+ else
+ drop_th.disc_th = NBL_USTORE_QUAD_ETH_DROP_TH;
+
+ for (i = 0; i < NBL_MAX_ETHERNET; i++)
+ nbl_hw_wr_regs(hw_mgt, NBL_USTORE_PORT_DROP_TH_REG_ARR(i),
+ (u32 *)&drop_th, sizeof(drop_th));
+
+ /* clear registers*/
+ for (i = 0; i < NBL_MAX_ETHERNET; i++) {
+ nbl_hw_rd32(hw_mgt, NBL_USTORE_BUF_PORT_DROP_PKT(i));
+ nbl_hw_rd32(hw_mgt, NBL_USTORE_BUF_PORT_TRUN_PKT(i));
+ }
+}
+
+static void nbl_dstore_init(struct nbl_hw_mgt *hw_mgt, u8 speed)
+{
+ struct dstore_port_drop_th drop_th;
+ struct dstore_d_dport_fc_th fc_th;
+ struct dstore_disc_bp_th bp_th;
+ int i;
+
+ for (i = 0; i < NBL_DSTORE_PORT_DROP_TH_DEPTH; i++) {
+ nbl_hw_rd_regs(hw_mgt, NBL_DSTORE_PORT_DROP_TH_REG(i),
+ (u32 *)&drop_th, sizeof(drop_th));
+ drop_th.en = 0;
+ nbl_hw_wr_regs(hw_mgt, NBL_DSTORE_PORT_DROP_TH_REG(i),
+ (u32 *)&drop_th, sizeof(drop_th));
+ }
+
+ nbl_hw_rd_regs(hw_mgt, NBL_DSTORE_DISC_BP_TH, (u32 *)&bp_th,
+ sizeof(bp_th));
+ bp_th.en = 1;
+ nbl_hw_wr_regs(hw_mgt, NBL_DSTORE_DISC_BP_TH, (u32 *)&bp_th,
+ sizeof(bp_th));
+
+ for (i = 0; i < NBL_MAX_ETHERNET; i++) {
+ nbl_hw_rd_regs(hw_mgt, NBL_DSTORE_D_DPORT_FC_TH_REG(i),
+ (u32 *)&fc_th, sizeof(fc_th));
+ if (speed == NBL_FW_PORT_SPEED_100G) {
+ fc_th.xoff_th = NBL_DSTORE_DROP_XOFF_TH_100G;
+ fc_th.xon_th = NBL_DSTORE_DROP_XON_TH_100G;
+ } else {
+ fc_th.xoff_th = NBL_DSTORE_DROP_XOFF_TH;
+ fc_th.xon_th = NBL_DSTORE_DROP_XON_TH;
+ }
+
+ fc_th.fc_en = 1;
+ nbl_hw_wr_regs(hw_mgt, NBL_DSTORE_D_DPORT_FC_TH_REG(i),
+ (u32 *)&fc_th, sizeof(fc_th));
+ }
+}
+
+static void nbl_dvn_descreq_num_cfg(struct nbl_hw_mgt *hw_mgt, u32 descreq_num)
+{
+ u32 split_ring_num = (descreq_num >> 16) & 0xffff;
+ struct nbl_dvn_descreq_num_cfg num_cfg = { 0 };
+ u32 packet_ring_num = descreq_num & 0xffff;
+
+ packet_ring_num =
+ clamp(packet_ring_num, PACKET_RING_MIN, PACKET_RING_MAX);
+ num_cfg.packed_l1_num =
+ (packet_ring_num - PACKET_RING_BASE) / PACKET_RING_DIV;
+
+ split_ring_num = clamp(split_ring_num, SPLIT_RING_MIN,
+ SPLIT_RING_MAX);
+ num_cfg.avring_cfg_num = split_ring_num > SPLIT_RING_MIN ?
+ SPLIT_RING_CFG_16 :
+ SPLIT_RING_CFG_8;
+
+ nbl_hw_wr_regs(hw_mgt, NBL_DVN_DESCREQ_NUM_CFG, (u32 *)&num_cfg,
+ sizeof(num_cfg));
+}
+
+static void nbl_dvn_init(struct nbl_hw_mgt *hw_mgt, u8 speed)
+{
+ struct nbl_dvn_desc_wr_merge_timeout timeout = { 0 };
+ struct nbl_dvn_dif_req_rd_ro_flag ro_flag = { 0 };
+
+ timeout.cfg_cycle = DEFAULT_DVN_DESC_WR_MERGE_TIMEOUT_MAX;
+ nbl_hw_wr_regs(hw_mgt, NBL_DVN_DESC_WR_MERGE_TIMEOUT, (u32 *)&timeout,
+ sizeof(timeout));
+
+ ro_flag.rd_desc_ro_en = 1;
+ ro_flag.rd_data_ro_en = 1;
+ ro_flag.rd_avring_ro_en = 1;
+ nbl_hw_wr_regs(hw_mgt, NBL_DVN_DIF_REQ_RD_RO_FLAG, (u32 *)&ro_flag,
+ sizeof(ro_flag));
+
+ if (speed == NBL_FW_PORT_SPEED_100G)
+ nbl_dvn_descreq_num_cfg(hw_mgt,
+ DEFAULT_DVN_100G_DESCREQ_NUMCFG);
+ else
+ nbl_dvn_descreq_num_cfg(hw_mgt, DEFAULT_DVN_DESCREQ_NUMCFG);
+}
+
+static void nbl_uvn_init(struct nbl_hw_mgt *hw_mgt)
+{
+ struct uvn_desc_prefetch_init prefetch_init = { 0 };
+ struct uvn_desc_wr_timeout desc_wr_timeout = { 0 };
+ struct uvn_dif_req_ro_flag flag = { 0 };
+ struct uvn_queue_err_mask mask = { 0 };
+ u16 wr_timeout = 0x12c;
+ u32 timeout = 119760; /* 200us 200000/1.67 */
+ u32 quirks;
+
+ nbl_hw_wr32(hw_mgt, NBL_UVN_DESC_RD_WAIT, timeout);
+
+ desc_wr_timeout.num = wr_timeout;
+ nbl_hw_wr_regs(hw_mgt, NBL_UVN_DESC_WR_TIMEOUT, (u32 *)&desc_wr_timeout,
+ sizeof(desc_wr_timeout));
+
+ flag.avail_rd = 1;
+ flag.desc_rd = 1;
+ flag.pkt_wr = 1;
+ flag.desc_wr = 0;
+ nbl_hw_wr_regs(hw_mgt, NBL_UVN_DIF_REQ_RO_FLAG, (u32 *)&flag,
+ sizeof(flag));
+
+ nbl_hw_rd_regs(hw_mgt, NBL_UVN_QUEUE_ERR_MASK, (u32 *)&mask,
+ sizeof(mask));
+ mask.dif_err = 1;
+ nbl_hw_wr_regs(hw_mgt, NBL_UVN_QUEUE_ERR_MASK, (u32 *)&mask,
+ sizeof(mask));
+
+ prefetch_init.num = NBL_UVN_DESC_PREFETCH_NUM;
+ prefetch_init.sel = 0;
+ quirks = nbl_hw_get_quirks(hw_mgt);
+ if (!(quirks & BIT(NBL_QUIRKS_UVN_PREFETCH_ALIGN)))
+ prefetch_init.sel = 1;
+ nbl_hw_wr_regs(hw_mgt, NBL_UVN_DESC_PREFETCH_INIT,
+ (u32 *)&prefetch_init, sizeof(prefetch_init));
+}
+
+static void nbl_uqm_init(struct nbl_hw_mgt *hw_mgt)
+{
+ struct nbl_uqm_que_type que_type = { 0 };
+ u32 cnt = 0;
+ int i;
+
+ nbl_hw_wr_regs(hw_mgt, NBL_UQM_FWD_DROP_CNT, &cnt, sizeof(cnt));
+
+ nbl_hw_wr_regs(hw_mgt, NBL_UQM_DROP_PKT_CNT, &cnt, sizeof(cnt));
+ nbl_hw_wr_regs(hw_mgt, NBL_UQM_DROP_PKT_SLICE_CNT, &cnt,
+ sizeof(cnt));
+ nbl_hw_wr_regs(hw_mgt, NBL_UQM_DROP_PKT_LEN_ADD_CNT, &cnt,
+ sizeof(cnt));
+ nbl_hw_wr_regs(hw_mgt, NBL_UQM_DROP_HEAD_PNTR_ADD_CNT, &cnt,
+ sizeof(cnt));
+ nbl_hw_wr_regs(hw_mgt, NBL_UQM_DROP_WEIGHT_ADD_CNT, &cnt,
+ sizeof(cnt));
+
+ for (i = 0; i < NBL_UQM_PORT_DROP_DEPTH; i++) {
+ nbl_hw_wr_regs(hw_mgt,
+ NBL_UQM_PORT_DROP_PKT_CNT + (sizeof(cnt) * i),
+ &cnt, sizeof(cnt));
+ nbl_hw_wr_regs(hw_mgt,
+ NBL_UQM_PORT_DROP_PKT_SLICE_CNT +
+ (sizeof(cnt) * i),
+ &cnt, sizeof(cnt));
+ nbl_hw_wr_regs(hw_mgt,
+ NBL_UQM_PORT_DROP_PKT_LEN_ADD_CNT +
+ (sizeof(cnt) * i),
+ &cnt, sizeof(cnt));
+ nbl_hw_wr_regs(hw_mgt,
+ NBL_UQM_PORT_DROP_HEAD_PNTR_ADD_CNT +
+ (sizeof(cnt) * i),
+ &cnt, sizeof(cnt));
+ nbl_hw_wr_regs(hw_mgt,
+ NBL_UQM_PORT_DROP_WEIGHT_ADD_CNT +
+ (sizeof(cnt) * i),
+ &cnt, sizeof(cnt));
+ }
+
+ for (i = 0; i < NBL_UQM_DPORT_DROP_DEPTH; i++)
+ nbl_hw_wr_regs(hw_mgt,
+ NBL_UQM_DPORT_DROP_CNT + (sizeof(cnt) * i),
+ &cnt, sizeof(cnt));
+
+ que_type.bp_drop = 0;
+ nbl_hw_wr_regs(hw_mgt, NBL_UQM_QUE_TYPE, (u32 *)&que_type,
+ sizeof(que_type));
+}
+
+static void nbl_dp_init(struct nbl_hw_mgt *hw_mgt, u8 speed, u8 eth_num)
+{
+ nbl_dped_init(hw_mgt);
+ nbl_uped_init(hw_mgt);
+ nbl_shaping_init(hw_mgt, speed);
+ nbl_dsch_qid_max_init(hw_mgt);
+ nbl_ustore_init(hw_mgt, eth_num);
+ nbl_dstore_init(hw_mgt, speed);
+ nbl_dvn_init(hw_mgt, speed);
+ nbl_uvn_init(hw_mgt);
+ nbl_uqm_init(hw_mgt);
+}
+
+static void nbl_host_padpt_init(struct nbl_hw_mgt *hw_mgt)
+{
+ /* padpt flow control register */
+ nbl_hw_wr32(hw_mgt, NBL_HOST_PADPT_HOST_CFG_FC_CPLH_UP, 0x10400);
+ nbl_hw_wr32(hw_mgt, NBL_HOST_PADPT_HOST_CFG_FC_PD_DN, 0x10080);
+ nbl_hw_wr32(hw_mgt, NBL_HOST_PADPT_HOST_CFG_FC_PH_DN, 0x10010);
+ nbl_hw_wr32(hw_mgt, NBL_HOST_PADPT_HOST_CFG_FC_NPH_DN, 0x10010);
+}
+
+static void nbl_intf_init(struct nbl_hw_mgt *hw_mgt)
+{
+ nbl_host_padpt_init(hw_mgt);
+}
+
+static void nbl_hw_set_driver_status(struct nbl_hw_mgt *hw_mgt, bool active)
+{
+ u32 status;
+
+ status = nbl_hw_rd32(hw_mgt, NBL_DRIVER_STATUS_REG);
+
+ status = (status & ~(1 << NBL_DRIVER_STATUS_BIT)) |
+ (active << NBL_DRIVER_STATUS_BIT);
+
+ nbl_hw_wr32(hw_mgt, NBL_DRIVER_STATUS_REG, status);
+}
+
+static void nbl_hw_deinit_chip_module(struct nbl_hw_mgt *hw_mgt)
+{
+ nbl_hw_set_driver_status(hw_mgt, false);
+}
+
+static int nbl_hw_init_chip_module(struct nbl_hw_mgt *hw_mgt, u8 eth_speed,
+ u8 eth_num)
+{
+ nbl_dp_init(hw_mgt, eth_speed, eth_num);
+ nbl_intf_init(hw_mgt);
+
+ nbl_write_all_regs(hw_mgt);
+ nbl_hw_set_driver_status(hw_mgt, true);
+ hw_mgt->version = nbl_hw_rd32(hw_mgt, NBL_HW_DUMMY_REG);
+
+ return 0;
+}
+
static void nbl_hw_enable_mailbox_irq(struct nbl_hw_mgt *hw_mgt, u16 func_id,
bool enable_msix, u16 global_vec_id)
{
@@ -262,6 +647,9 @@ static u32 nbl_hw_get_fw_eth_map(struct nbl_hw_mgt *hw_mgt)
}
static struct nbl_hw_ops hw_ops = {
+ .init_chip_module = nbl_hw_init_chip_module,
+ .deinit_chip_module = nbl_hw_deinit_chip_module,
+
.configure_msix_map = nbl_hw_configure_msix_map,
.configure_msix_info = nbl_hw_configure_msix_info,
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.h b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.h
index 8831394ed11b..7487d0e757e3 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.h
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_hw_leonis.h
@@ -128,7 +128,8 @@ struct nbl_function_msix_map {
#define NBL_DPED_VLAN_OFFSET (NBL_DP_DPED_BASE + 0x000003F4)
#define NBL_DPED_DSCP_OFFSET_0 (NBL_DP_DPED_BASE + 0x000003F8)
#define NBL_DPED_DSCP_OFFSET_1 (NBL_DP_DPED_BASE + 0x000003FC)
-
+#define NBL_DPED_V4_TCP_IDX 5
+#define NBL_DPED_V6_TCP_IDX 6
/* DPED hw_edt_prof/ UPED hw_edt_prof */
struct ped_hw_edit_profile {
u32 l4_len:2;
@@ -268,6 +269,15 @@ struct dsch_vn_quanta {
#define DEFAULT_DVN_DESC_WR_MERGE_TIMEOUT_MAX 0x3FF
+#define PACKET_RING_MIN 8U
+#define PACKET_RING_MAX 32U
+#define SPLIT_RING_MIN 8U
+#define SPLIT_RING_MAX 16U
+#define PACKET_RING_BASE 8U
+#define PACKET_RING_DIV 4U
+#define SPLIT_RING_CFG_8 0U
+#define SPLIT_RING_CFG_16 1U
+
struct nbl_dvn_descreq_num_cfg {
u32 avring_cfg_num:1; /* spilit ring descreq_num 0:8,1:16 */
u32 rsv0:3;
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_resource_leonis.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_resource_leonis.c
index d60d9cf0bb2c..8cf56f19c81c 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_resource_leonis.c
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_hw_leonis/nbl_resource_leonis.c
@@ -46,6 +46,10 @@ nbl_res_setup_ops(struct device *dev, struct nbl_resource_mgt *res_mgt)
if (!is_ops_inited) {
ret = nbl_intr_setup_ops(&res_ops);
+ if (ret)
+ return ERR_PTR(-ENOMEM);
+
+ ret = nbl_vsi_setup_ops(&res_ops);
if (ret)
return ERR_PTR(-ENOMEM);
is_ops_inited = true;
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.h b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.h
index 1b80676cf19a..675649ffb271 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.h
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_resource.h
@@ -110,6 +110,7 @@ void nbl_res_get_eth_id(struct nbl_resource_mgt *res_mgt, u16 vsi_id,
u8 *eth_mode, u8 *eth_id, u8 *logic_eth_id);
int nbl_intr_mgt_start(struct nbl_resource_mgt *res_mgt);
int nbl_intr_setup_ops(struct nbl_resource_ops *resource_ops);
+int nbl_vsi_setup_ops(struct nbl_resource_ops *resource_ops);
void nbl_res_pf_dev_vsi_type_to_hw_vsi_type(u16 src_type,
enum nbl_vsi_serv_type *dst_type);
#endif
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_vsi.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_vsi.c
new file mode 100644
index 000000000000..67b9b23ad012
--- /dev/null
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_vsi.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2025 Nebula Matrix Limited.
+ */
+#include <linux/device.h>
+#include "nbl_vsi.h"
+
+static void nbl_res_vsi_deinit_chip_module(struct nbl_resource_mgt *res_mgt)
+{
+ struct nbl_hw_ops *hw_ops = res_mgt->hw_ops_tbl->ops;
+
+ hw_ops->deinit_chip_module(res_mgt->hw_ops_tbl->priv);
+}
+
+static int nbl_res_vsi_init_chip_module(struct nbl_resource_mgt *res_mgt)
+{
+ u8 eth_speed = res_mgt->resource_info->board_info.eth_speed;
+ u8 eth_num = res_mgt->resource_info->board_info.eth_num;
+ struct nbl_hw_ops *hw_ops = res_mgt->hw_ops_tbl->ops;
+ struct nbl_hw_mgt *p = res_mgt->hw_ops_tbl->priv;
+ int ret;
+
+ ret = hw_ops->init_chip_module(p, eth_speed, eth_num);
+
+ return ret;
+}
+
+/* NBL_VSI_SET_OPS(ops_name, func)
+ *
+ * Use X Macros to reduce setup and remove codes.
+ */
+#define NBL_VSI_OPS_TBL \
+do { \
+ NBL_VSI_SET_OPS(init_chip_module, \
+ nbl_res_vsi_init_chip_module); \
+ NBL_VSI_SET_OPS(deinit_chip_module, \
+ nbl_res_vsi_deinit_chip_module); \
+} while (0)
+
+int nbl_vsi_setup_ops(struct nbl_resource_ops *res_ops)
+{
+#define NBL_VSI_SET_OPS(name, func) \
+ do { \
+ res_ops->NBL_NAME(name) = func; \
+ ; \
+ } while (0)
+ NBL_VSI_OPS_TBL;
+#undef NBL_VSI_SET_OPS
+
+ return 0;
+}
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_vsi.h b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_vsi.h
new file mode 100644
index 000000000000..6fd79add57d6
--- /dev/null
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_hw/nbl_vsi.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2025 Nebula Matrix Limited.
+ */
+
+#ifndef _NBL_VSI_H_
+#define _NBL_VSI_H_
+
+#include "nbl_resource.h"
+
+#endif
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_hw.h b/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_hw.h
index f97f7a810ad0..93dfe74d3643 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_hw.h
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_def_hw.h
@@ -11,6 +11,10 @@
struct nbl_hw_mgt;
struct nbl_adapter;
struct nbl_hw_ops {
+ int (*init_chip_module)(struct nbl_hw_mgt *hw_mgt, u8 eth_speed,
+ u8 eth_num);
+ void (*deinit_chip_module)(struct nbl_hw_mgt *hw_mgt);
+
void (*configure_msix_map)(struct nbl_hw_mgt *hw_mgt, u16 func_id,
bool valid, dma_addr_t dma_addr, u8 bus,
u8 devid, u8 function);
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_include.h b/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_include.h
index e4f11e6ded94..5203bb2a9a5f 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_include.h
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_include/nbl_include.h
@@ -45,4 +45,35 @@ struct nbl_init_param {
bool pci_using_dac;
};
+enum nbl_fw_port_speed {
+ NBL_FW_PORT_SPEED_10G,
+ NBL_FW_PORT_SPEED_25G,
+ NBL_FW_PORT_SPEED_50G,
+ NBL_FW_PORT_SPEED_100G,
+};
+
+#define NBL_OPS_CALL(func, para) \
+do { \
+ typeof(func) _func = (func); \
+ if (_func) \
+ _func para; \
+} while (0)
+
+#define NBL_OPS_CALL_RET(func, para) \
+({ \
+ typeof(func) _func = (func); \
+ _func ? _func para : 0; \
+})
+
+#define NBL_OPS_CALL_RET_PTR(func, para) \
+({ \
+ typeof(func) _func = (func); \
+ _func ? _func para : NULL; \
+})
+
+enum nbl_performance_mode {
+ NBL_QUIRKS_NO_TOE,
+ NBL_QUIRKS_UVN_PREFETCH_ALIGN,
+};
+
#endif
--
2.47.3
^ permalink raw reply related
* [PATCH v12 net-next 11/11] net/nebula-matrix: add common dev start/stop operation
From: illusion.wang @ 2026-04-15 3:36 UTC (permalink / raw)
To: dimon.zhao, illusion.wang, alvin.wang, sam.chen, netdev
Cc: andrew+netdev, corbet, kuba, linux-doc, lorenzo, pabeni, horms,
vadim.fedorenko, lukas.bulwahn, edumazet, enelsonmoore, skhan,
hkallweit1, open list
In-Reply-To: <20260415033608.2438-1-illusion.wang@nebula-matrix.com>
start common dev: config msix map table, alloc and enable msix vectors,
register mailbox ISR and enable mailbox irq
Signed-off-by: illusion.wang <illusion.wang@nebula-matrix.com>
---
.../nebula-matrix/nbl/nbl_core/nbl_dev.c | 215 ++++++++++++++++++
.../net/ethernet/nebula-matrix/nbl/nbl_main.c | 32 ++-
2 files changed, 246 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.c
index f10bb9460774..e814ffbb978d 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.c
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_core/nbl_dev.c
@@ -6,6 +6,17 @@
#include <linux/pci.h>
#include "nbl_dev.h"
+static int nbl_dev_clean_mailbox_schedule(struct nbl_dev_mgt *dev_mgt);
+
+/* ---------- Interrupt config ---------- */
+static irqreturn_t nbl_dev_clean_mailbox(int __always_unused irq, void *data)
+{
+ struct nbl_dev_mgt *dev_mgt = (struct nbl_dev_mgt *)data;
+
+ nbl_dev_clean_mailbox_schedule(dev_mgt);
+ return IRQ_HANDLED;
+}
+
static void nbl_dev_init_msix_cnt(struct nbl_dev_mgt *dev_mgt)
{
struct nbl_dev_common *dev_common = dev_mgt->common_dev;
@@ -14,6 +25,170 @@ static void nbl_dev_init_msix_cnt(struct nbl_dev_mgt *dev_mgt)
msix_info->serv_info[NBL_MSIX_MAILBOX_TYPE].num = 1;
}
+static int nbl_dev_request_mailbox_irq(struct nbl_dev_mgt *dev_mgt)
+{
+ struct nbl_dev_common *dev_common = dev_mgt->common_dev;
+ struct nbl_msix_info *msix_info = &dev_common->msix_info;
+ struct nbl_common_info *common = dev_mgt->common;
+ u16 local_vec_id;
+ u32 irq_num;
+ int err;
+
+ if (!msix_info->serv_info[NBL_MSIX_MAILBOX_TYPE].num)
+ return 0;
+
+ local_vec_id =
+ msix_info->serv_info[NBL_MSIX_MAILBOX_TYPE].base_vector_id;
+ irq_num = pci_irq_vector(common->pdev, local_vec_id);
+
+ snprintf(dev_common->mailbox_name, sizeof(dev_common->mailbox_name),
+ "nbl_mailbox@pci:%s", pci_name(common->pdev));
+ err = request_irq(irq_num, nbl_dev_clean_mailbox, 0,
+ dev_common->mailbox_name, dev_mgt);
+ if (err)
+ return err;
+
+ return 0;
+}
+
+static void nbl_dev_free_mailbox_irq(struct nbl_dev_mgt *dev_mgt)
+{
+ struct nbl_dev_common *dev_common = dev_mgt->common_dev;
+ struct nbl_msix_info *msix_info = &dev_common->msix_info;
+ struct nbl_common_info *common = dev_mgt->common;
+ u16 local_vec_id;
+ u32 irq_num;
+
+ if (!msix_info->serv_info[NBL_MSIX_MAILBOX_TYPE].num)
+ return;
+
+ local_vec_id =
+ msix_info->serv_info[NBL_MSIX_MAILBOX_TYPE].base_vector_id;
+ irq_num = pci_irq_vector(common->pdev, local_vec_id);
+
+ free_irq(irq_num, dev_mgt);
+}
+
+static int nbl_dev_enable_mailbox_irq(struct nbl_dev_mgt *dev_mgt)
+{
+ struct nbl_dispatch_ops *disp_ops = dev_mgt->disp_ops_tbl->ops;
+ struct nbl_channel_ops *chan_ops = dev_mgt->chan_ops_tbl->ops;
+ struct nbl_dev_common *dev_common = dev_mgt->common_dev;
+ struct nbl_msix_info *msix_info = &dev_common->msix_info;
+ u16 local_vec_id;
+
+ if (!msix_info->serv_info[NBL_MSIX_MAILBOX_TYPE].num)
+ return 0;
+
+ local_vec_id =
+ msix_info->serv_info[NBL_MSIX_MAILBOX_TYPE].base_vector_id;
+ chan_ops->set_queue_state(dev_mgt->chan_ops_tbl->priv,
+ NBL_CHAN_INTERRUPT_READY,
+ NBL_CHAN_TYPE_MAILBOX, true);
+
+ return disp_ops->enable_mailbox_irq(dev_mgt->disp_ops_tbl->priv,
+ local_vec_id, true);
+}
+
+static int nbl_dev_disable_mailbox_irq(struct nbl_dev_mgt *dev_mgt)
+{
+ struct nbl_dispatch_ops *disp_ops = dev_mgt->disp_ops_tbl->ops;
+ struct nbl_channel_ops *chan_ops = dev_mgt->chan_ops_tbl->ops;
+ struct nbl_dev_common *dev_common = dev_mgt->common_dev;
+ struct nbl_msix_info *msix_info = &dev_common->msix_info;
+ u16 local_vec_id;
+
+ if (!msix_info->serv_info[NBL_MSIX_MAILBOX_TYPE].num)
+ return 0;
+
+ flush_work(&dev_common->clean_mbx_task);
+ local_vec_id =
+ msix_info->serv_info[NBL_MSIX_MAILBOX_TYPE].base_vector_id;
+ chan_ops->set_queue_state(dev_mgt->chan_ops_tbl->priv,
+ NBL_CHAN_INTERRUPT_READY,
+ NBL_CHAN_TYPE_MAILBOX, false);
+
+ return disp_ops->enable_mailbox_irq(dev_mgt->disp_ops_tbl->priv,
+ local_vec_id, false);
+}
+
+static int nbl_dev_configure_msix_map(struct nbl_dev_mgt *dev_mgt)
+{
+ struct nbl_dispatch_ops *disp_ops = dev_mgt->disp_ops_tbl->ops;
+ struct nbl_dev_common *dev_common = dev_mgt->common_dev;
+ struct nbl_msix_info *msix_info = &dev_common->msix_info;
+ bool mask_en = msix_info->serv_info[NBL_MSIX_NET_TYPE].hw_self_mask_en;
+ u16 msix_net_num = msix_info->serv_info[NBL_MSIX_NET_TYPE].num;
+ u16 msix_not_net_num = 0;
+ int err, i;
+
+ for (i = NBL_MSIX_NET_TYPE; i < NBL_MSIX_TYPE_MAX; i++)
+ msix_info->serv_info[i].base_vector_id =
+ msix_info->serv_info[i - 1].base_vector_id +
+ msix_info->serv_info[i - 1].num;
+
+ for (i = NBL_MSIX_MAILBOX_TYPE; i < NBL_MSIX_TYPE_MAX; i++)
+ msix_not_net_num += msix_info->serv_info[i].num;
+
+ err = disp_ops->configure_msix_map(dev_mgt->disp_ops_tbl->priv,
+ msix_net_num, msix_not_net_num,
+ mask_en);
+
+ return err;
+}
+
+static int nbl_dev_destroy_msix_map(struct nbl_dev_mgt *dev_mgt)
+{
+ struct nbl_dispatch_ops *disp_ops = dev_mgt->disp_ops_tbl->ops;
+
+ return disp_ops->destroy_msix_map(dev_mgt->disp_ops_tbl->priv);
+}
+
+static int nbl_dev_alloc_msix_intr(struct nbl_dev_mgt *dev_mgt)
+{
+ struct nbl_dev_common *dev_common = dev_mgt->common_dev;
+ struct nbl_msix_info *msix_info = &dev_common->msix_info;
+ struct nbl_common_info *common = dev_mgt->common;
+ int needed = 0;
+ int err;
+ int i;
+
+ for (i = 0; i < NBL_MSIX_TYPE_MAX; i++)
+ needed += msix_info->serv_info[i].num;
+
+ err = pci_alloc_irq_vectors(common->pdev, needed, needed,
+ PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
+ if (err < 0) {
+ pr_err("pci_alloc_irq_vectors failed, err = %d.\n", err);
+ goto enable_msix_failed;
+ }
+
+ return needed;
+
+enable_msix_failed:
+ return err;
+}
+
+static int nbl_dev_init_interrupt_scheme(struct nbl_dev_mgt *dev_mgt)
+{
+ int err;
+
+ err = nbl_dev_alloc_msix_intr(dev_mgt);
+ if (err < 0) {
+ dev_err(dev_mgt->common->dev,
+ "Failed to enable MSI-X vectors\n");
+ return err;
+ }
+ return 0;
+}
+
+static void nbl_dev_clear_interrupt_scheme(struct nbl_dev_mgt *dev_mgt)
+{
+ struct nbl_common_info *common = dev_mgt->common;
+
+ pci_free_irq_vectors(common->pdev);
+}
+
/* ---------- Channel config ---------- */
static int nbl_dev_setup_chan_qinfo(struct nbl_dev_mgt *dev_mgt, u8 chan_type)
{
@@ -79,6 +254,14 @@ static void nbl_dev_clean_mailbox_task(struct work_struct *work)
NBL_CHAN_TYPE_MAILBOX);
}
+static int nbl_dev_clean_mailbox_schedule(struct nbl_dev_mgt *dev_mgt)
+{
+ struct nbl_dev_common *common_dev = dev_mgt->common_dev;
+
+ nbl_common_queue_work(&common_dev->clean_mbx_task);
+ return 0;
+}
+
/* ---------- Dev init process ---------- */
static int nbl_dev_setup_common_dev(struct nbl_adapter *adapter)
{
@@ -218,9 +401,41 @@ void nbl_dev_remove(struct nbl_adapter *adapter)
/* ---------- Dev start process ---------- */
int nbl_dev_start(struct nbl_adapter *adapter)
{
+ struct nbl_dev_mgt *dev_mgt = adapter->core.dev_mgt;
+ int ret;
+
+ ret = nbl_dev_configure_msix_map(dev_mgt);
+ if (ret)
+ goto config_msix_map_err;
+
+ ret = nbl_dev_init_interrupt_scheme(dev_mgt);
+ if (ret)
+ goto init_interrupt_scheme_err;
+ ret = nbl_dev_request_mailbox_irq(dev_mgt);
+ if (ret)
+ goto mailbox_request_irq_err;
+ ret = nbl_dev_enable_mailbox_irq(dev_mgt);
+ if (ret)
+ goto enable_mailbox_irq_err;
+
return 0;
+enable_mailbox_irq_err:
+ nbl_dev_disable_mailbox_irq(dev_mgt);
+ nbl_dev_free_mailbox_irq(dev_mgt);
+mailbox_request_irq_err:
+ nbl_dev_clear_interrupt_scheme(dev_mgt);
+init_interrupt_scheme_err:
+ nbl_dev_destroy_msix_map(dev_mgt);
+config_msix_map_err:
+ return ret;
}
void nbl_dev_stop(struct nbl_adapter *adapter)
{
+ struct nbl_dev_mgt *dev_mgt = adapter->core.dev_mgt;
+
+ nbl_dev_disable_mailbox_irq(dev_mgt);
+ nbl_dev_free_mailbox_irq(dev_mgt);
+ nbl_dev_clear_interrupt_scheme(dev_mgt);
+ nbl_dev_destroy_msix_map(dev_mgt);
}
diff --git a/drivers/net/ethernet/nebula-matrix/nbl/nbl_main.c b/drivers/net/ethernet/nebula-matrix/nbl/nbl_main.c
index 9ffa76000ae3..15732d3175af 100644
--- a/drivers/net/ethernet/nebula-matrix/nbl/nbl_main.c
+++ b/drivers/net/ethernet/nebula-matrix/nbl/nbl_main.c
@@ -182,6 +182,7 @@ static int nbl_probe(struct pci_dev *pdev,
err = nbl_core_start(adapter);
if (err)
goto core_start_err;
+
return 0;
core_start_err:
nbl_core_remove(adapter);
@@ -293,7 +294,36 @@ static struct pci_driver nbl_driver = {
.remove = nbl_remove,
};
-module_pci_driver(nbl_driver);
+static int __init nbl_module_init(void)
+{
+ int status;
+
+ status = nbl_common_create_wq();
+ if (status) {
+ pr_err("Failed to create wq, err = %d\n", status);
+ goto wq_create_failed;
+ }
+ status = pci_register_driver(&nbl_driver);
+ if (status) {
+ pr_err("Failed to register PCI driver, err = %d\n", status);
+ goto pci_register_driver_failed;
+ }
+
+ return 0;
+
+pci_register_driver_failed:
+ nbl_common_destroy_wq();
+wq_create_failed:
+ return status;
+}
+
+static void __exit nbl_module_exit(void)
+{
+ pci_unregister_driver(&nbl_driver);
+ nbl_common_destroy_wq();
+}
+module_init(nbl_module_init);
+module_exit(nbl_module_exit);
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Nebula Matrix Network Driver");
--
2.47.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox