[Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io

Intel-Wired-Lan Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx
@ 2026-05-05 15:29 Alexander Lobakin
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 1/5] libeth: pass Rx queue index to PP when creating a fill queue Alexander Lobakin
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

Now that ice uses libeth for managing Rx buffers and supports
configurable header split, it's ready to get support for sending
and receiving packets with unreadable (to the kernel) frags.

Extend libeth just a little bit to allow creating PPs with custom
memory providers and make sure ice works correctly with the netdev
ops locking. Then add the full set of queue_mgmt_ops and don't
unmap unreadable frags on Tx completion.
No perf regressions for the regular flows and no code duplication
implied.

Credits to the fbnic developers, whose code helped me understand
the memory providers and queue_mgmt_ops logics and served as
a reference.

Alexander Lobakin (5):
  libeth: pass Rx queue index to PP when creating a fill queue
  libeth: handle creating pools with unreadable buffers
  ice: migrate to netdev ops lock
  ice: implement Rx queue management ops
  ice: add support for transmitting unreadable frags

 drivers/net/ethernet/intel/ice/ice_base.h    |   2 +
 drivers/net/ethernet/intel/ice/ice_lib.h     |  18 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h    |   2 +
 drivers/net/ethernet/intel/idpf/idpf_txrx.h  |   2 +
 include/net/libeth/rx.h                      |   2 +
 include/net/libeth/tx.h                      |   2 +-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c  |   1 +
 drivers/net/ethernet/intel/ice/ice_base.c    | 247 +++++++++++++++----
 drivers/net/ethernet/intel/ice/ice_dcb_lib.c |  15 +-
 drivers/net/ethernet/intel/ice/ice_eswitch.c |  26 +-
 drivers/net/ethernet/intel/ice/ice_lib.c     | 227 +++++++++++++----
 drivers/net/ethernet/intel/ice/ice_main.c    |  79 +++---
 drivers/net/ethernet/intel/ice/ice_sf_eth.c  |   4 +
 drivers/net/ethernet/intel/ice/ice_txrx.c    |  43 +++-
 drivers/net/ethernet/intel/ice/ice_xsk.c     |   4 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.c  |  13 +
 drivers/net/ethernet/intel/libeth/rx.c       |  43 ++++
 17 files changed, 566 insertions(+), 164 deletions(-)

---
Note: apply to net-next, not Tony's next-queue (ready to be sent as
a PR).

From v4[0]:
* rebase on top of the latest net-next;
* 3/5: fix the last [hopefully] missing netdev lock (E-Switch code,
       Simon), rechecked with our internal Intel's Sashiko setup;
* 3/5: pick fixes for .ndo_bpf() and safe mode from Kohei.

From v3[1]:
* rebase on top of recent Larysa's changes;
* 3/5: fix the last locking inconsistencies (Jakub);
* 3/5: pick a kdoc fix from Tony.

From v2[2]:
* rebase on top of net-next-7.0;
* 3/5: fix [hopefully] all inconsistent locking (Jakub, Tony);
* 4/5: pick a hotfix from Kohei.

From v1[3]:
* rebase on top of the latest next-queue;
* fix a typo 'rxq_ixd' -> 'rxq_idx' (Tony).

Testing hints:
* regular Rx and Tx for regressions;
* <kernel root>/tools/testing/selftests/drivers/net/hw/ contains
  scripts for testing netmem Rx and Tx, namely devmem.py and
  iou-zcrx.py (read the documentation first).

[0] https://lore.kernel.org/intel-wired-lan/20260318163505.31765-1-aleksander.lobakin@intel.com
[1] https://lore.kernel.org/intel-wired-lan/20260224174618.2780516-1-aleksander.lobakin@intel.com
[2] https://lore.kernel.org/intel-wired-lan/20251204155133.2437621-1-aleksander.lobakin@intel.com
[3] https://lore.kernel.org/intel-wired-lan/20251125173603.3834486-1-aleksander.lobakin@intel.com
-- 
2.54.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next v5 1/5] libeth: pass Rx queue index to PP when creating a fill queue
  2026-05-05 15:29 [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx Alexander Lobakin
@ 2026-05-05 15:29 ` Alexander Lobakin
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 2/5] libeth: handle creating pools with unreadable buffers Alexander Lobakin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

Since recently, page_pool_create() accepts optional stack index of
the Rx queue which the pool will be created for. It can then be
used on control path for stuff like memory providers.
Add the same field to libeth_fq and pass the index from all the
drivers using libeth for managing Rx to simplify implementing MP
support later.
idpf has one libeth_fq per buffer/fill queue and each Rx queue has
two fill queues, but since fill queues can never be shared, we can
store the corresponding Rx queue index there during the
initialization to pass it to libeth.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/idpf/idpf_txrx.h |  2 ++
 include/net/libeth/rx.h                     |  2 ++
 drivers/net/ethernet/intel/iavf/iavf_txrx.c |  1 +
 drivers/net/ethernet/intel/ice/ice_base.c   |  2 ++
 drivers/net/ethernet/intel/idpf/idpf_txrx.c | 13 +++++++++++++
 drivers/net/ethernet/intel/libeth/rx.c      |  1 +
 6 files changed, 21 insertions(+)

diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
index 4be5b3b6d3ed..a0d92adf11c4 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -748,6 +748,7 @@ libeth_cacheline_set_assert(struct idpf_tx_queue, 64,
  * @size: Length of descriptor ring in bytes
  * @dma: Physical address of ring
  * @q_vector: Backreference to associated vector
+ * @rxq_idx: stack index of the corresponding Rx queue
  * @rx_buffer_low_watermark: RX buffer low watermark
  * @rx_hbuf_size: Header buffer size
  * @rx_buf_size: Buffer size
@@ -791,6 +792,7 @@ struct idpf_buf_queue {
 	dma_addr_t dma;
 
 	struct idpf_q_vector *q_vector;
+	u16 rxq_idx;
 
 	u16 rx_buffer_low_watermark;
 	u16 rx_hbuf_size;
diff --git a/include/net/libeth/rx.h b/include/net/libeth/rx.h
index 5d991404845e..3b3d7acd13c9 100644
--- a/include/net/libeth/rx.h
+++ b/include/net/libeth/rx.h
@@ -71,6 +71,7 @@ enum libeth_fqe_type {
  * @xdp: flag indicating whether XDP is enabled
  * @buf_len: HW-writeable length per each buffer
  * @nid: ID of the closest NUMA node with memory
+ * @idx: stack index of the corresponding Rx queue
  */
 struct libeth_fq {
 	struct_group_tagged(libeth_fq_fp, fp,
@@ -88,6 +89,7 @@ struct libeth_fq {
 
 	u32			buf_len;
 	int			nid;
+	u32			idx;
 };
 
 int libeth_rx_fq_create(struct libeth_fq *fq, struct napi_struct *napi);
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 363c42bf3dcf..d3c68659162b 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -771,6 +771,7 @@ int iavf_setup_rx_descriptors(struct iavf_ring *rx_ring)
 		.count		= rx_ring->count,
 		.buf_len	= LIBIE_MAX_RX_BUF_LEN,
 		.nid		= NUMA_NO_NODE,
+		.idx		= rx_ring->queue_index,
 	};
 	int ret;
 
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 1667f686ff75..f162cdfc62a7 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -609,6 +609,7 @@ static int ice_rxq_pp_create(struct ice_rx_ring *rq)
 	struct libeth_fq fq = {
 		.count		= rq->count,
 		.nid		= NUMA_NO_NODE,
+		.idx		= rq->q_index,
 		.hsplit		= rq->vsi->hsplit,
 		.xdp		= ice_is_xdp_ena_vsi(rq->vsi),
 		.buf_len	= LIBIE_MAX_RX_BUF_LEN,
@@ -631,6 +632,7 @@ static int ice_rxq_pp_create(struct ice_rx_ring *rq)
 		.count		= rq->count,
 		.type		= LIBETH_FQE_HDR,
 		.nid		= NUMA_NO_NODE,
+		.idx		= rq->q_index,
 		.xdp		= ice_is_xdp_ena_vsi(rq->vsi),
 	};
 
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index f6b3b15364ff..95930edc566f 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -557,6 +557,7 @@ static int idpf_rx_hdr_buf_alloc_all(struct idpf_buf_queue *bufq)
 		.type	= LIBETH_FQE_HDR,
 		.xdp	= idpf_xdp_enabled(bufq->q_vector->vport),
 		.nid	= idpf_q_vector_to_mem(bufq->q_vector),
+		.idx	= bufq->rxq_idx,
 	};
 	int ret;
 
@@ -699,6 +700,7 @@ static int idpf_rx_bufs_init_singleq(struct idpf_rx_queue *rxq)
 		.type		= LIBETH_FQE_MTU,
 		.buf_len	= IDPF_RX_MAX_BUF_SZ,
 		.nid		= idpf_q_vector_to_mem(rxq->q_vector),
+		.idx		= rxq->idx,
 	};
 	int ret;
 
@@ -759,6 +761,7 @@ static int idpf_rx_bufs_init(struct idpf_buf_queue *bufq,
 		.hsplit		= idpf_queue_has(HSPLIT_EN, bufq),
 		.xdp		= idpf_xdp_enabled(bufq->q_vector->vport),
 		.nid		= idpf_q_vector_to_mem(bufq->q_vector),
+		.idx		= bufq->rxq_idx,
 	};
 	int ret;
 
@@ -1913,6 +1916,16 @@ static int idpf_rxq_group_alloc(struct idpf_vport *vport,
 							LIBETH_RX_LL_LEN;
 			idpf_rxq_set_descids(rsrc, q);
 		}
+
+		if (!idpf_is_queue_model_split(rsrc->rxq_model))
+			continue;
+
+		for (u32 j = 0; j < rsrc->num_bufqs_per_qgrp; j++) {
+			struct idpf_buf_queue *bufq;
+
+			bufq = &rx_qgrp->splitq.bufq_sets[j].bufq;
+			bufq->rxq_idx = rx_qgrp->splitq.rxq_sets[0]->rxq.idx;
+		}
 	}
 
 err_alloc:
diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c
index 62521a1f4ec9..8874b714cdcc 100644
--- a/drivers/net/ethernet/intel/libeth/rx.c
+++ b/drivers/net/ethernet/intel/libeth/rx.c
@@ -156,6 +156,7 @@ int libeth_rx_fq_create(struct libeth_fq *fq, struct napi_struct *napi)
 		.order		= LIBETH_RX_PAGE_ORDER,
 		.pool_size	= fq->count,
 		.nid		= fq->nid,
+		.queue_idx	= fq->idx,
 		.dev		= napi->dev->dev.parent,
 		.netdev		= napi->dev,
 		.napi		= napi,
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next v5 2/5] libeth: handle creating pools with unreadable buffers
  2026-05-05 15:29 [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx Alexander Lobakin
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 1/5] libeth: pass Rx queue index to PP when creating a fill queue Alexander Lobakin
@ 2026-05-05 15:29 ` Alexander Lobakin
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 3/5] ice: migrate to netdev ops lock Alexander Lobakin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

libeth uses netmems for quite some time already, so in order to
support unreadable frags / memory providers, it only needs to set
PP_FLAG_ALLOW_UNREADABLE_NETMEM when needed.
Also add a couple sanity checks to make sure the driver didn't mess
up the configuration options and, in case when an MP is installed,
return the truesize always equal to PAGE_SIZE, so that
libeth_rx_alloc() will never try to allocate frags. Memory providers
manage buffers on their own and expect 1:1 buffer / HW Rx descriptor
association.

Bonus: mention in the libeth_sqe_type description that
LIBETH_SQE_EMPTY should also be used for netmem Tx SQEs -- they
don't need DMA unmapping.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/libeth/tx.h                |  2 +-
 drivers/net/ethernet/intel/libeth/rx.c | 42 ++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/net/libeth/tx.h b/include/net/libeth/tx.h
index c3db5c6f1641..a66fc2b3a114 100644
--- a/include/net/libeth/tx.h
+++ b/include/net/libeth/tx.h
@@ -12,7 +12,7 @@
 
 /**
  * enum libeth_sqe_type - type of &libeth_sqe to act on Tx completion
- * @LIBETH_SQE_EMPTY: unused/empty OR XDP_TX/XSk frame, no action required
+ * @LIBETH_SQE_EMPTY: empty OR netmem/XDP_TX/XSk frame, no action required
  * @LIBETH_SQE_CTX: context descriptor with empty SQE, no action required
  * @LIBETH_SQE_SLAB: kmalloc-allocated buffer, unmap and kfree()
  * @LIBETH_SQE_FRAG: mapped skb frag, only unmap DMA
diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c
index 8874b714cdcc..11e6e8f353ef 100644
--- a/drivers/net/ethernet/intel/libeth/rx.c
+++ b/drivers/net/ethernet/intel/libeth/rx.c
@@ -6,6 +6,7 @@
 #include <linux/export.h>
 
 #include <net/libeth/rx.h>
+#include <net/netdev_queues.h>
 
 /* Rx buffer management */
 
@@ -139,9 +140,47 @@ static bool libeth_rx_page_pool_params_zc(struct libeth_fq *fq,
 	fq->buf_len = clamp(mtu, LIBETH_RX_BUF_STRIDE, max);
 	fq->truesize = fq->buf_len;
 
+	/*
+	 * Allow frags only for kernel pages. `fq->truesize == pp->max_len`
+	 * will always fall back to regular page_pool_alloc_netmems()
+	 * regardless of the MTU / FQ buffer size.
+	 */
+	if (pp->flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM)
+		fq->truesize = pp->max_len;
+
 	return true;
 }
 
+/**
+ * libeth_rx_page_pool_check_unread - check input params for unreadable MPs
+ * @fq: buffer queue to check
+ * @pp: &page_pool_params for the queue
+ *
+ * Make sure we don't create an invalid pool with full-frame unreadable
+ * buffers, bidirectional unreadable buffers or so, and configure the
+ * ZC payload pool accordingly.
+ *
+ * Return: true on success, false on invalid input params.
+ */
+static bool libeth_rx_page_pool_check_unread(const struct libeth_fq *fq,
+					     struct page_pool_params *pp)
+{
+	if (!netif_rxq_has_unreadable_mp(pp->netdev, pp->queue_idx))
+		return true;
+
+	/* For now, the core stack doesn't allow XDP with unreadable frags */
+	if (fq->xdp)
+		return false;
+
+	/* It should be either a header pool or a ZC payload pool */
+	if (fq->type == LIBETH_FQE_HDR)
+		return !fq->hsplit;
+
+	pp->flags |= PP_FLAG_ALLOW_UNREADABLE_NETMEM;
+
+	return fq->hsplit;
+}
+
 /**
  * libeth_rx_fq_create - create a PP with the default libeth settings
  * @fq: buffer queue struct to fill
@@ -165,6 +204,9 @@ int libeth_rx_fq_create(struct libeth_fq *fq, struct napi_struct *napi)
 	struct page_pool *pool;
 	int ret;
 
+	if (!libeth_rx_page_pool_check_unread(fq, &pp))
+		return -EINVAL;
+
 	pp.dma_dir = fq->xdp ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 
 	if (!fq->hsplit)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next v5 3/5] ice: migrate to netdev ops lock
  2026-05-05 15:29 [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx Alexander Lobakin
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 1/5] libeth: pass Rx queue index to PP when creating a fill queue Alexander Lobakin
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 2/5] libeth: handle creating pools with unreadable buffers Alexander Lobakin
@ 2026-05-05 15:29 ` Alexander Lobakin
  2026-05-08 13:28   ` Simon Horman
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 4/5] ice: implement Rx queue management ops Alexander Lobakin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

Queue management ops unconditionally enable netdev locking. The same
lock is taken by default by several NAPI configuration functions,
such as napi_enable() and netif_napi_set_irq().
Request ops locking in advance and make sure we use the _locked
counterparts of those functions to avoid deadlocks, taking the lock
manually where needed (suspend/resume, queue rebuild and resets).

Co-developed-by: Kohei Enju <kohei@enjuk.jp> # ice_xdp_setup_prog(), safe mode
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_base.h    |   2 +
 drivers/net/ethernet/intel/ice/ice_lib.h     |  13 +-
 drivers/net/ethernet/intel/ice/ice_base.c    |  63 ++++-
 drivers/net/ethernet/intel/ice/ice_dcb_lib.c |  15 +-
 drivers/net/ethernet/intel/ice/ice_eswitch.c |  26 ++-
 drivers/net/ethernet/intel/ice/ice_lib.c     | 227 +++++++++++++++----
 drivers/net/ethernet/intel/ice/ice_main.c    |  78 ++++---
 drivers/net/ethernet/intel/ice/ice_sf_eth.c  |   3 +
 drivers/net/ethernet/intel/ice/ice_xsk.c     |   4 +-
 9 files changed, 324 insertions(+), 107 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_base.h b/drivers/net/ethernet/intel/ice/ice_base.h
index d28294247599..99b2c7232829 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.h
+++ b/drivers/net/ethernet/intel/ice/ice_base.h
@@ -12,8 +12,10 @@ int __ice_vsi_get_qs(struct ice_qs_cfg *qs_cfg);
 int
 ice_vsi_ctrl_one_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx, bool wait);
 int ice_vsi_wait_one_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx);
+int ice_vsi_alloc_q_vectors_locked(struct ice_vsi *vsi);
 int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi);
 void ice_vsi_map_rings_to_vectors(struct ice_vsi *vsi);
+void ice_vsi_free_q_vectors_locked(struct ice_vsi *vsi);
 void ice_vsi_free_q_vectors(struct ice_vsi *vsi);
 int ice_vsi_cfg_single_txq(struct ice_vsi *vsi, struct ice_tx_ring **tx_rings,
 			   u16 q_idx);
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.h b/drivers/net/ethernet/intel/ice/ice_lib.h
index 49454d98dcfe..476fa54ec4e8 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_lib.h
@@ -53,19 +53,24 @@ struct ice_vsi *
 ice_vsi_setup(struct ice_pf *pf, struct ice_vsi_cfg_params *params);
 
 void ice_vsi_set_napi_queues(struct ice_vsi *vsi);
-void ice_napi_add(struct ice_vsi *vsi);
-
+void ice_vsi_set_napi_queues_locked(struct ice_vsi *vsi);
 void ice_vsi_clear_napi_queues(struct ice_vsi *vsi);
+void ice_vsi_clear_napi_queues_locked(struct ice_vsi *vsi);
+
+void ice_napi_add(struct ice_vsi *vsi);
 
 int ice_vsi_release(struct ice_vsi *vsi);
 
 void ice_vsi_close(struct ice_vsi *vsi);
 
-int ice_ena_vsi(struct ice_vsi *vsi, bool locked);
+int ice_ena_vsi_locked(struct ice_vsi *vsi);
+int ice_ena_vsi(struct ice_vsi *vsi);
 
 void ice_vsi_decfg(struct ice_vsi *vsi);
-void ice_dis_vsi(struct ice_vsi *vsi, bool locked);
+void ice_dis_vsi_locked(struct ice_vsi *vsi);
+void ice_dis_vsi(struct ice_vsi *vsi);
 
+int ice_vsi_rebuild_locked(struct ice_vsi *vsi, u32 vsi_flags);
 int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags);
 int ice_vsi_cfg(struct ice_vsi *vsi);
 struct ice_vsi *ice_vsi_alloc(struct ice_pf *pf);
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index f162cdfc62a7..1add82d894bb 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -155,8 +155,8 @@ static int ice_vsi_alloc_q_vector(struct ice_vsi *vsi, u16 v_idx)
 	 * handler here (i.e. resume, reset/rebuild, etc.)
 	 */
 	if (vsi->netdev)
-		netif_napi_add_config(vsi->netdev, &q_vector->napi,
-				      ice_napi_poll, v_idx);
+		netif_napi_add_config_locked(vsi->netdev, &q_vector->napi,
+					     ice_napi_poll, v_idx);
 
 out:
 	/* tie q_vector and VSI together */
@@ -198,7 +198,7 @@ static void ice_free_q_vector(struct ice_vsi *vsi, int v_idx)
 
 	/* only VSI with an associated netdev is set up with NAPI */
 	if (vsi->netdev)
-		netif_napi_del(&q_vector->napi);
+		netif_napi_del_locked(&q_vector->napi);
 
 	/* release MSIX interrupt if q_vector had interrupt allocated */
 	if (q_vector->irq.index < 0)
@@ -886,13 +886,15 @@ int ice_vsi_wait_one_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx)
 }
 
 /**
- * ice_vsi_alloc_q_vectors - Allocate memory for interrupt vectors
+ * ice_vsi_alloc_q_vectors_locked - Allocate memory for interrupt vectors
  * @vsi: the VSI being configured
  *
- * We allocate one q_vector per queue interrupt. If allocation fails we
- * return -ENOMEM.
+ * Should be called only under the netdev lock.
+ * We allocate one q_vector per queue interrupt.
+ *
+ * Return: 0 on success, -ENOMEM if allocation fails.
  */
-int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
+int ice_vsi_alloc_q_vectors_locked(struct ice_vsi *vsi)
 {
 	struct device *dev = ice_pf_to_dev(vsi->back);
 	u16 v_idx;
@@ -919,6 +921,30 @@ int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
 	return v_idx ? 0 : err;
 }
 
+/**
+ * ice_vsi_alloc_q_vectors - Allocate memory for interrupt vectors
+ * @vsi: the VSI being configured
+ *
+ * We allocate one q_vector per queue interrupt.
+ *
+ * Return: 0 on success, -ENOMEM if allocation fails.
+ */
+int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+	int ret;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ret = ice_vsi_alloc_q_vectors_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+
+	return ret;
+}
+
 /**
  * ice_vsi_map_rings_to_vectors - Map VSI rings to interrupt vectors
  * @vsi: the VSI being configured
@@ -982,10 +1008,12 @@ void ice_vsi_map_rings_to_vectors(struct ice_vsi *vsi)
 }
 
 /**
- * ice_vsi_free_q_vectors - Free memory allocated for interrupt vectors
+ * ice_vsi_free_q_vectors_locked - Free memory allocated for interrupt vectors
  * @vsi: the VSI having memory freed
+ *
+ * Should be called only under the netdev lock.
  */
-void ice_vsi_free_q_vectors(struct ice_vsi *vsi)
+void ice_vsi_free_q_vectors_locked(struct ice_vsi *vsi)
 {
 	int v_idx;
 
@@ -995,6 +1023,23 @@ void ice_vsi_free_q_vectors(struct ice_vsi *vsi)
 	vsi->num_q_vectors = 0;
 }
 
+/**
+ * ice_vsi_free_q_vectors - Free memory allocated for interrupt vectors
+ * @vsi: the VSI having memory freed
+ */
+void ice_vsi_free_q_vectors(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ice_vsi_free_q_vectors_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+}
+
 /**
  * ice_cfg_tstamp - Configure Tx time stamp queue
  * @tx_ring: Tx ring to be configured with timestamping
diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
index 16aa25535152..7d89c0acc5d8 100644
--- a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
@@ -273,14 +273,13 @@ void ice_vsi_cfg_dcb_rings(struct ice_vsi *vsi)
  * ice_dcb_ena_dis_vsi - disable certain VSIs for DCB config/reconfig
  * @pf: pointer to the PF instance
  * @ena: true to enable VSIs, false to disable
- * @locked: true if caller holds RTNL lock, false otherwise
  *
  * Before a new DCB configuration can be applied, VSIs of type PF, SWITCHDEV
  * and CHNL need to be brought down. Following completion of DCB configuration
  * the VSIs that were downed need to be brought up again. This helper function
  * does both.
  */
-static void ice_dcb_ena_dis_vsi(struct ice_pf *pf, bool ena, bool locked)
+static void ice_dcb_ena_dis_vsi(struct ice_pf *pf, bool ena)
 {
 	int i;
 
@@ -294,9 +293,9 @@ static void ice_dcb_ena_dis_vsi(struct ice_pf *pf, bool ena, bool locked)
 		case ICE_VSI_CHNL:
 		case ICE_VSI_PF:
 			if (ena)
-				ice_ena_vsi(vsi, locked);
+				ice_ena_vsi(vsi);
 			else
-				ice_dis_vsi(vsi, locked);
+				ice_dis_vsi(vsi);
 			break;
 		default:
 			continue;
@@ -416,7 +415,7 @@ int ice_pf_dcb_cfg(struct ice_pf *pf, struct ice_dcbx_cfg *new_cfg, bool locked)
 		rtnl_lock();
 
 	/* disable VSIs affected by DCB changes */
-	ice_dcb_ena_dis_vsi(pf, false, true);
+	ice_dcb_ena_dis_vsi(pf, false);
 
 	memcpy(curr_cfg, new_cfg, sizeof(*curr_cfg));
 	memcpy(&curr_cfg->etsrec, &curr_cfg->etscfg, sizeof(curr_cfg->etsrec));
@@ -445,7 +444,7 @@ int ice_pf_dcb_cfg(struct ice_pf *pf, struct ice_dcbx_cfg *new_cfg, bool locked)
 
 out:
 	/* enable previously downed VSIs */
-	ice_dcb_ena_dis_vsi(pf, true, true);
+	ice_dcb_ena_dis_vsi(pf, true);
 	if (!locked)
 		rtnl_unlock();
 free_cfg:
@@ -1107,7 +1106,7 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
 
 	rtnl_lock();
 	/* disable VSIs affected by DCB changes */
-	ice_dcb_ena_dis_vsi(pf, false, true);
+	ice_dcb_ena_dis_vsi(pf, false);
 
 	ret = ice_query_port_ets(pi, &buf, sizeof(buf), NULL);
 	if (ret) {
@@ -1119,7 +1118,7 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
 	ice_pf_dcb_recfg(pf, false);
 
 	/* enable previously downed VSIs */
-	ice_dcb_ena_dis_vsi(pf, true, true);
+	ice_dcb_ena_dis_vsi(pf, true);
 unlock_rtnl:
 	rtnl_unlock();
 out:
diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
index 2e4f0969035f..af0cc77fbf71 100644
--- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
+++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
@@ -23,10 +23,16 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
 	struct net_device *netdev = uplink_vsi->netdev;
 	bool if_running = netif_running(netdev);
 	struct ice_vsi_vlan_ops *vlan_ops;
+	int ret;
+
+	if (if_running && !test_and_set_bit(ICE_VSI_DOWN, uplink_vsi->state)) {
+		netdev_lock(netdev);
+		ret = ice_down(uplink_vsi);
+		netdev_unlock(netdev);
 
-	if (if_running && !test_and_set_bit(ICE_VSI_DOWN, uplink_vsi->state))
-		if (ice_down(uplink_vsi))
+		if (ret)
 			return -ENODEV;
+	}
 
 	ice_remove_vsi_fltr(&pf->hw, uplink_vsi->idx);
 	ice_vsi_cfg_sw_lldp(uplink_vsi, true, false);
@@ -53,8 +59,14 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
 	if (ice_vsi_update_local_lb(uplink_vsi, true))
 		goto err_override_local_lb;
 
-	if (if_running && ice_up(uplink_vsi))
-		goto err_up;
+	if (if_running) {
+		netdev_lock(netdev);
+		ret = ice_up(uplink_vsi);
+		netdev_unlock(netdev);
+
+		if (ret)
+			goto err_up;
+	}
 
 	return 0;
 
@@ -74,8 +86,12 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
 	ice_fltr_add_mac_and_broadcast(uplink_vsi,
 				       uplink_vsi->port_info->mac.perm_addr,
 				       ICE_FWD_TO_VSI);
-	if (if_running)
+
+	if (if_running) {
+		netdev_lock(netdev);
 		ice_up(uplink_vsi);
+		netdev_unlock(netdev);
+	}
 
 	return -ENODEV;
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 837b71b7b2b7..6760aac609f2 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -2304,10 +2304,14 @@ static int ice_vsi_cfg_tc_lan(struct ice_pf *pf, struct ice_vsi *vsi)
 }
 
 /**
- * ice_vsi_cfg_def - configure default VSI based on the type
+ * ice_vsi_cfg_def_locked - configure default VSI based on the type
  * @vsi: pointer to VSI
+ *
+ * Should be called only with the netdev lock taken.
+ *
+ * Return: 0 on success, -errno on failure.
  */
-static int ice_vsi_cfg_def(struct ice_vsi *vsi)
+static int ice_vsi_cfg_def_locked(struct ice_vsi *vsi)
 {
 	struct device *dev = ice_pf_to_dev(vsi->back);
 	struct ice_pf *pf = vsi->back;
@@ -2350,7 +2354,7 @@ static int ice_vsi_cfg_def(struct ice_vsi *vsi)
 	case ICE_VSI_CTRL:
 	case ICE_VSI_SF:
 	case ICE_VSI_PF:
-		ret = ice_vsi_alloc_q_vectors(vsi);
+		ret = ice_vsi_alloc_q_vectors_locked(vsi);
 		if (ret)
 			goto unroll_vsi_init;
 
@@ -2400,7 +2404,7 @@ static int ice_vsi_cfg_def(struct ice_vsi *vsi)
 		 * creates a VSI and corresponding structures for bookkeeping
 		 * purpose
 		 */
-		ret = ice_vsi_alloc_q_vectors(vsi);
+		ret = ice_vsi_alloc_q_vectors_locked(vsi);
 		if (ret)
 			goto unroll_vsi_init;
 
@@ -2424,7 +2428,7 @@ static int ice_vsi_cfg_def(struct ice_vsi *vsi)
 		}
 		break;
 	case ICE_VSI_LB:
-		ret = ice_vsi_alloc_q_vectors(vsi);
+		ret = ice_vsi_alloc_q_vectors_locked(vsi);
 		if (ret)
 			goto unroll_vsi_init;
 
@@ -2451,7 +2455,7 @@ static int ice_vsi_cfg_def(struct ice_vsi *vsi)
 unroll_vector_base:
 	/* reclaim SW interrupts back to the common pool */
 unroll_alloc_q_vector:
-	ice_vsi_free_q_vectors(vsi);
+	ice_vsi_free_q_vectors_locked(vsi);
 unroll_vsi_init:
 	ice_vsi_delete_from_hw(vsi);
 unroll_get_qs:
@@ -2463,6 +2467,28 @@ static int ice_vsi_cfg_def(struct ice_vsi *vsi)
 	return ret;
 }
 
+/**
+ * ice_vsi_cfg_def - configure default VSI based on the type
+ * @vsi: pointer to VSI
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+static int ice_vsi_cfg_def(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+	int ret;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ret = ice_vsi_cfg_def_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+
+	return ret;
+}
+
 /**
  * ice_vsi_cfg - configure a previously allocated VSI
  * @vsi: pointer to VSI
@@ -2497,10 +2523,12 @@ int ice_vsi_cfg(struct ice_vsi *vsi)
 }
 
 /**
- * ice_vsi_decfg - remove all VSI configuration
+ * ice_vsi_decfg_locked - remove all VSI configuration
  * @vsi: pointer to VSI
+ *
+ * Should be called only under the netdev lock.
  */
-void ice_vsi_decfg(struct ice_vsi *vsi)
+static void ice_vsi_decfg_locked(struct ice_vsi *vsi)
 {
 	struct ice_pf *pf = vsi->back;
 	int err;
@@ -2518,7 +2546,7 @@ void ice_vsi_decfg(struct ice_vsi *vsi)
 		ice_destroy_xdp_rings(vsi, ICE_XDP_CFG_PART);
 
 	ice_vsi_clear_rings(vsi);
-	ice_vsi_free_q_vectors(vsi);
+	ice_vsi_free_q_vectors_locked(vsi);
 	ice_vsi_put_qs(vsi);
 	ice_vsi_free_arrays(vsi);
 
@@ -2533,6 +2561,23 @@ void ice_vsi_decfg(struct ice_vsi *vsi)
 		vsi->agg_node->num_vsis--;
 }
 
+/**
+ * ice_vsi_decfg - remove all VSI configuration
+ * @vsi: pointer to VSI
+ */
+void ice_vsi_decfg(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ice_vsi_decfg_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+}
+
 /**
  * ice_vsi_setup - Set up a VSI by a given type
  * @pf: board private structure
@@ -2706,18 +2751,19 @@ void ice_vsi_close(struct ice_vsi *vsi)
 	if (!test_and_set_bit(ICE_VSI_DOWN, vsi->state))
 		ice_down(vsi);
 
-	ice_vsi_clear_napi_queues(vsi);
+	ice_vsi_clear_napi_queues_locked(vsi);
 	ice_vsi_free_irq(vsi);
 	ice_vsi_free_tx_rings(vsi);
 	ice_vsi_free_rx_rings(vsi);
 }
 
 /**
- * ice_ena_vsi - resume a VSI
- * @vsi: the VSI being resume
- * @locked: is the rtnl_lock already held
+ * ice_ena_vsi_locked - resume a VSI (without taking the netdev lock)
+ * @vsi: VSI to resume
+ *
+ * Return: 0 on success, -errno on failure.
  */
-int ice_ena_vsi(struct ice_vsi *vsi, bool locked)
+int ice_ena_vsi_locked(struct ice_vsi *vsi)
 {
 	int err = 0;
 
@@ -2728,15 +2774,8 @@ int ice_ena_vsi(struct ice_vsi *vsi, bool locked)
 
 	if (vsi->netdev && (vsi->type == ICE_VSI_PF ||
 			    vsi->type == ICE_VSI_SF)) {
-		if (netif_running(vsi->netdev)) {
-			if (!locked)
-				rtnl_lock();
-
+		if (netif_running(vsi->netdev))
 			err = ice_open_internal(vsi->netdev);
-
-			if (!locked)
-				rtnl_unlock();
-		}
 	} else if (vsi->type == ICE_VSI_CTRL) {
 		err = ice_vsi_open_ctrl(vsi);
 	}
@@ -2745,11 +2784,34 @@ int ice_ena_vsi(struct ice_vsi *vsi, bool locked)
 }
 
 /**
- * ice_dis_vsi - pause a VSI
+ * ice_ena_vsi - resume a VSI
+ * @vsi: VSI to resume
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int ice_ena_vsi(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+	int ret;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ret = ice_ena_vsi_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+
+	return ret;
+}
+
+/**
+ * ice_dis_vsi_locked - pause a VSI
  * @vsi: the VSI being paused
- * @locked: is the rtnl_lock already held
+ *
+ * The caller must always hold the netdev lock.
  */
-void ice_dis_vsi(struct ice_vsi *vsi, bool locked)
+void ice_dis_vsi_locked(struct ice_vsi *vsi)
 {
 	bool already_down = test_bit(ICE_VSI_DOWN, vsi->state);
 
@@ -2758,14 +2820,9 @@ void ice_dis_vsi(struct ice_vsi *vsi, bool locked)
 	if (vsi->netdev && (vsi->type == ICE_VSI_PF ||
 			    vsi->type == ICE_VSI_SF)) {
 		if (netif_running(vsi->netdev)) {
-			if (!locked)
-				rtnl_lock();
 			already_down = test_bit(ICE_VSI_DOWN, vsi->state);
 			if (!already_down)
 				ice_vsi_close(vsi);
-
-			if (!locked)
-				rtnl_unlock();
 		} else if (!already_down) {
 			ice_vsi_close(vsi);
 		}
@@ -2775,12 +2832,30 @@ void ice_dis_vsi(struct ice_vsi *vsi, bool locked)
 }
 
 /**
- * ice_vsi_set_napi_queues - associate netdev queues with napi
+ * ice_dis_vsi - pause a VSI
+ * @vsi: the VSI being paused
+ */
+void ice_dis_vsi(struct ice_vsi *vsi)
+{
+	struct net_device *dev = vsi->netdev;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ice_dis_vsi_locked(vsi);
+
+	if (dev)
+		netdev_unlock(dev);
+}
+
+/**
+ * ice_vsi_set_napi_queues_locked - associate netdev queues with napi
  * @vsi: VSI pointer
  *
  * Associate queue[s] with napi for all vectors.
+ * Must be called only with the netdev_lock taken.
  */
-void ice_vsi_set_napi_queues(struct ice_vsi *vsi)
+void ice_vsi_set_napi_queues_locked(struct ice_vsi *vsi)
 {
 	struct net_device *netdev = vsi->netdev;
 	int q_idx, v_idx;
@@ -2788,7 +2863,6 @@ void ice_vsi_set_napi_queues(struct ice_vsi *vsi)
 	if (!netdev)
 		return;
 
-	ASSERT_RTNL();
 	ice_for_each_rxq(vsi, q_idx)
 		if (vsi->rx_rings[q_idx] && vsi->rx_rings[q_idx]->q_vector)
 			netif_queue_set_napi(netdev, q_idx, NETDEV_QUEUE_TYPE_RX,
@@ -2802,17 +2876,37 @@ void ice_vsi_set_napi_queues(struct ice_vsi *vsi)
 	ice_for_each_q_vector(vsi, v_idx) {
 		struct ice_q_vector *q_vector = vsi->q_vectors[v_idx];
 
-		netif_napi_set_irq(&q_vector->napi, q_vector->irq.virq);
+		netif_napi_set_irq_locked(&q_vector->napi, q_vector->irq.virq);
 	}
 }
 
 /**
- * ice_vsi_clear_napi_queues - dissociate netdev queues from napi
+ * ice_vsi_set_napi_queues - associate VSI queues with NAPIs
  * @vsi: VSI pointer
  *
+ * Version of ice_vsi_set_napi_queues_locked() that takes the netdev_lock,
+ * to use it outside of the net_device_ops context.
+ */
+void ice_vsi_set_napi_queues(struct ice_vsi *vsi)
+{
+	struct net_device *netdev = vsi->netdev;
+
+	if (!netdev)
+		return;
+
+	netdev_lock(netdev);
+	ice_vsi_set_napi_queues_locked(vsi);
+	netdev_unlock(netdev);
+}
+
+/**
+ * ice_vsi_clear_napi_queues_locked - dissociate netdev queues from napi
+ * @vsi: VSI to process
+ *
  * Clear the association between all VSI queues queue[s] and napi.
+ * Must be called only with the netdev_lock taken.
  */
-void ice_vsi_clear_napi_queues(struct ice_vsi *vsi)
+void ice_vsi_clear_napi_queues_locked(struct ice_vsi *vsi)
 {
 	struct net_device *netdev = vsi->netdev;
 	int q_idx, v_idx;
@@ -2820,12 +2914,11 @@ void ice_vsi_clear_napi_queues(struct ice_vsi *vsi)
 	if (!netdev)
 		return;
 
-	ASSERT_RTNL();
 	/* Clear the NAPI's interrupt number */
 	ice_for_each_q_vector(vsi, v_idx) {
 		struct ice_q_vector *q_vector = vsi->q_vectors[v_idx];
 
-		netif_napi_set_irq(&q_vector->napi, -1);
+		netif_napi_set_irq_locked(&q_vector->napi, -1);
 	}
 
 	ice_for_each_txq(vsi, q_idx)
@@ -2835,6 +2928,25 @@ void ice_vsi_clear_napi_queues(struct ice_vsi *vsi)
 		netif_queue_set_napi(netdev, q_idx, NETDEV_QUEUE_TYPE_RX, NULL);
 }
 
+/**
+ * ice_vsi_clear_napi_queues - dissociate VSI queues from NAPIs
+ * @vsi: VSI to process
+ *
+ * Version of ice_vsi_clear_napi_queues_locked() that takes the netdev lock,
+ * to use it outside of the net_device_ops context.
+ */
+void ice_vsi_clear_napi_queues(struct ice_vsi *vsi)
+{
+	struct net_device *netdev = vsi->netdev;
+
+	if (!netdev)
+		return;
+
+	netdev_lock(netdev);
+	ice_vsi_clear_napi_queues_locked(vsi);
+	netdev_unlock(netdev);
+}
+
 /**
  * ice_napi_add - register NAPI handler for the VSI
  * @vsi: VSI for which NAPI handler is to be registered
@@ -3072,16 +3184,17 @@ ice_vsi_realloc_stat_arrays(struct ice_vsi *vsi)
 }
 
 /**
- * ice_vsi_rebuild - Rebuild VSI after reset
+ * ice_vsi_rebuild_locked - Rebuild VSI after reset
  * @vsi: VSI to be rebuild
  * @vsi_flags: flags used for VSI rebuild flow
  *
  * Set vsi_flags to ICE_VSI_FLAG_INIT to initialize a new VSI, or
  * ICE_VSI_FLAG_NO_INIT to rebuild an existing VSI in hardware.
+ * Should be called only under the netdev lock.
  *
- * Returns 0 on success and negative value on failure
+ * Return: 0 on success, -errno on failure.
  */
-int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags)
+int ice_vsi_rebuild_locked(struct ice_vsi *vsi, u32 vsi_flags)
 {
 	struct ice_coalesce_stored *coalesce;
 	int prev_num_q_vectors;
@@ -3102,8 +3215,8 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags)
 	if (ret)
 		goto unlock;
 
-	ice_vsi_decfg(vsi);
-	ret = ice_vsi_cfg_def(vsi);
+	ice_vsi_decfg_locked(vsi);
+	ret = ice_vsi_cfg_def_locked(vsi);
 	if (ret)
 		goto unlock;
 
@@ -3133,12 +3246,38 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags)
 	kfree(coalesce);
 decfg:
 	if (ret)
-		ice_vsi_decfg(vsi);
+		ice_vsi_decfg_locked(vsi);
 unlock:
 	mutex_unlock(&vsi->xdp_state_lock);
 	return ret;
 }
 
+/**
+ * ice_vsi_rebuild - Rebuild VSI after reset
+ * @vsi: VSI to be rebuild
+ * @vsi_flags: flags used for VSI rebuild flow
+ *
+ * Set vsi_flags to ICE_VSI_FLAG_INIT to initialize a new VSI, or
+ * ICE_VSI_FLAG_NO_INIT to rebuild an existing VSI in hardware.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int ice_vsi_rebuild(struct ice_vsi *vsi, u32 vsi_flags)
+{
+	struct net_device *dev = vsi->netdev;
+	int ret;
+
+	if (dev)
+		netdev_lock(dev);
+
+	ret = ice_vsi_rebuild_locked(vsi, vsi_flags);
+
+	if (dev)
+		netdev_unlock(dev);
+
+	return ret;
+}
+
 /**
  * ice_is_reset_in_progress - check for a reset in progress
  * @state: PF state field
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 1d1947a7fe11..50975fe7cab7 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -507,16 +507,15 @@ static void ice_sync_fltr_subtask(struct ice_pf *pf)
 /**
  * ice_pf_dis_all_vsi - Pause all VSIs on a PF
  * @pf: the PF
- * @locked: is the rtnl_lock already held
  */
-static void ice_pf_dis_all_vsi(struct ice_pf *pf, bool locked)
+static void ice_pf_dis_all_vsi(struct ice_pf *pf)
 {
 	int node;
 	int v;
 
 	ice_for_each_vsi(pf, v)
 		if (pf->vsi[v])
-			ice_dis_vsi(pf->vsi[v], locked);
+			ice_dis_vsi(pf->vsi[v]);
 
 	for (node = 0; node < ICE_MAX_PF_AGG_NODES; node++)
 		pf->pf_agg_node[node].num_vsis = 0;
@@ -605,7 +604,7 @@ ice_prepare_for_reset(struct ice_pf *pf, enum ice_reset_req reset_type)
 	ice_clear_hw_tbls(hw);
 	/* disable the VSIs and their queues that are not already DOWN */
 	set_bit(ICE_VSI_REBUILD_PENDING, ice_get_main_vsi(pf)->state);
-	ice_pf_dis_all_vsi(pf, false);
+	ice_pf_dis_all_vsi(pf);
 
 	if (test_bit(ICE_FLAG_PTP_SUPPORTED, pf->flags))
 		ice_ptp_prepare_for_reset(pf, reset_type);
@@ -2943,9 +2942,9 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog,
 				goto resume_if;
 			}
 		}
-		xdp_features_set_redirect_target(vsi->netdev, true);
+		xdp_features_set_redirect_target_locked(vsi->netdev, true);
 	} else if (ice_is_xdp_ena_vsi(vsi) && !prog) {
-		xdp_features_clear_redirect_target(vsi->netdev);
+		xdp_features_clear_redirect_target_locked(vsi->netdev);
 		xdp_ring_err = ice_destroy_xdp_rings(vsi, ICE_XDP_CFG_FULL);
 		if (xdp_ring_err)
 			NL_SET_ERR_MSG_MOD(extack, "Freeing XDP Tx resources failed");
@@ -3447,11 +3446,13 @@ static void ice_set_ops(struct ice_vsi *vsi)
 
 	if (ice_is_safe_mode(pf)) {
 		netdev->netdev_ops = &ice_netdev_safe_mode_ops;
+		netdev->request_ops_lock = true;
 		ice_set_ethtool_safe_mode_ops(netdev);
 		return;
 	}
 
 	netdev->netdev_ops = &ice_netdev_ops;
+	netdev->request_ops_lock = true;
 	netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
 	netdev->xdp_metadata_ops = &ice_xdp_md_ops;
 	ice_set_ethtool_ops(netdev);
@@ -4058,6 +4059,7 @@ bool ice_is_wol_supported(struct ice_hw *hw)
  * @locked: is adev device_lock held
  *
  * Only change the number of queues if new_tx, or new_rx is non-0.
+ * Note that it should be called only with the netdev lock taken.
  *
  * Returns 0 on success.
  */
@@ -4083,7 +4085,7 @@ int ice_vsi_recfg_qs(struct ice_vsi *vsi, int new_rx, int new_tx, bool locked)
 
 	/* set for the next time the netdev is started */
 	if (!netif_running(vsi->netdev)) {
-		err = ice_vsi_rebuild(vsi, ICE_VSI_FLAG_NO_INIT);
+		err = ice_vsi_rebuild_locked(vsi, ICE_VSI_FLAG_NO_INIT);
 		if (err)
 			goto rebuild_err;
 		dev_dbg(ice_pf_to_dev(pf), "Link is down, queue count change happens when link is brought up\n");
@@ -4091,7 +4093,7 @@ int ice_vsi_recfg_qs(struct ice_vsi *vsi, int new_rx, int new_tx, bool locked)
 	}
 
 	ice_vsi_close(vsi);
-	err = ice_vsi_rebuild(vsi, ICE_VSI_FLAG_NO_INIT);
+	err = ice_vsi_rebuild_locked(vsi, ICE_VSI_FLAG_NO_INIT);
 	if (err)
 		goto rebuild_err;
 
@@ -5437,7 +5439,7 @@ static void ice_prepare_for_shutdown(struct ice_pf *pf)
 	dev_dbg(ice_pf_to_dev(pf), "Tearing down internal switch for shutdown\n");
 
 	/* disable the VSIs and their queues that are not already DOWN */
-	ice_pf_dis_all_vsi(pf, false);
+	ice_pf_dis_all_vsi(pf);
 
 	ice_for_each_vsi(pf, v)
 		if (pf->vsi[v])
@@ -5473,16 +5475,17 @@ static int ice_reinit_interrupt_scheme(struct ice_pf *pf)
 
 	/* Remap vectors and rings, after successful re-init interrupts */
 	ice_for_each_vsi(pf, v) {
-		if (!pf->vsi[v])
+		struct ice_vsi *vsi = pf->vsi[v];
+
+		if (!vsi)
 			continue;
 
-		ret = ice_vsi_alloc_q_vectors(pf->vsi[v]);
+		ret = ice_vsi_alloc_q_vectors(vsi);
 		if (ret)
 			goto err_reinit;
-		ice_vsi_map_rings_to_vectors(pf->vsi[v]);
-		rtnl_lock();
-		ice_vsi_set_napi_queues(pf->vsi[v]);
-		rtnl_unlock();
+
+		ice_vsi_map_rings_to_vectors(vsi);
+		ice_vsi_set_napi_queues(vsi);
 	}
 
 	ret = ice_req_irq_msix_misc(pf);
@@ -5495,13 +5498,15 @@ static int ice_reinit_interrupt_scheme(struct ice_pf *pf)
 	return 0;
 
 err_reinit:
-	while (v--)
-		if (pf->vsi[v]) {
-			rtnl_lock();
-			ice_vsi_clear_napi_queues(pf->vsi[v]);
-			rtnl_unlock();
-			ice_vsi_free_q_vectors(pf->vsi[v]);
-		}
+	while (v--) {
+		struct ice_vsi *vsi = pf->vsi[v];
+
+		if (!vsi)
+			continue;
+
+		ice_vsi_clear_napi_queues(vsi);
+		ice_vsi_free_q_vectors(vsi);
+	}
 
 	return ret;
 }
@@ -5564,14 +5569,17 @@ static int ice_suspend(struct device *dev)
 	 * to CPU0.
 	 */
 	ice_free_irq_msix_misc(pf);
+
 	ice_for_each_vsi(pf, v) {
-		if (!pf->vsi[v])
+		struct ice_vsi *vsi = pf->vsi[v];
+
+		if (!vsi)
 			continue;
-		rtnl_lock();
-		ice_vsi_clear_napi_queues(pf->vsi[v]);
-		rtnl_unlock();
-		ice_vsi_free_q_vectors(pf->vsi[v]);
+
+		ice_vsi_clear_napi_queues(vsi);
+		ice_vsi_free_q_vectors(vsi);
 	}
+
 	ice_clear_interrupt_scheme(pf);
 
 	pci_save_state(pdev);
@@ -6699,7 +6707,7 @@ static void ice_napi_enable_all(struct ice_vsi *vsi)
 		ice_init_moderation(q_vector);
 
 		if (q_vector->rx.rx_ring || q_vector->tx.tx_ring)
-			napi_enable(&q_vector->napi);
+			napi_enable_locked(&q_vector->napi);
 	}
 }
 
@@ -7198,7 +7206,7 @@ static void ice_napi_disable_all(struct ice_vsi *vsi)
 		struct ice_q_vector *q_vector = vsi->q_vectors[q_idx];
 
 		if (q_vector->rx.rx_ring || q_vector->tx.tx_ring)
-			napi_disable(&q_vector->napi);
+			napi_disable_locked(&q_vector->napi);
 
 		cancel_work_sync(&q_vector->tx.dim.work);
 		cancel_work_sync(&q_vector->rx.dim.work);
@@ -7498,7 +7506,7 @@ int ice_vsi_open(struct ice_vsi *vsi)
 		if (err)
 			goto err_set_qs;
 
-		ice_vsi_set_napi_queues(vsi);
+		ice_vsi_set_napi_queues_locked(vsi);
 	}
 
 	err = ice_up_complete(vsi);
@@ -7584,7 +7592,7 @@ static int ice_vsi_rebuild_by_type(struct ice_pf *pf, enum ice_vsi_type type)
 		vsi->vsi_num = ice_get_hw_vsi_num(&pf->hw, vsi->idx);
 
 		/* enable the VSI */
-		err = ice_ena_vsi(vsi, false);
+		err = ice_ena_vsi(vsi);
 		if (err) {
 			dev_err(dev, "enable VSI failed, err %d, VSI index %d, type %s\n",
 				err, vsi->idx, ice_vsi_type_str(type));
@@ -9190,7 +9198,7 @@ static int ice_setup_tc_mqprio_qdisc(struct net_device *netdev, void *type_data)
 		return 0;
 
 	/* Pause VSI queues */
-	ice_dis_vsi(vsi, true);
+	ice_dis_vsi_locked(vsi);
 
 	if (!hw && !test_bit(ICE_FLAG_TC_MQPRIO, pf->flags))
 		ice_remove_q_channels(vsi, true);
@@ -9229,14 +9237,14 @@ static int ice_setup_tc_mqprio_qdisc(struct net_device *netdev, void *type_data)
 	cur_rxq = vsi->num_rxq;
 
 	/* proceed with rebuild main VSI using correct number of queues */
-	ret = ice_vsi_rebuild(vsi, ICE_VSI_FLAG_NO_INIT);
+	ret = ice_vsi_rebuild_locked(vsi, ICE_VSI_FLAG_NO_INIT);
 	if (ret) {
 		/* fallback to current number of queues */
 		dev_info(dev, "Rebuild failed with new queues, try with current number of queues\n");
 		vsi->req_txq = cur_txq;
 		vsi->req_rxq = cur_rxq;
 		clear_bit(ICE_RESET_FAILED, pf->state);
-		if (ice_vsi_rebuild(vsi, ICE_VSI_FLAG_NO_INIT)) {
+		if (ice_vsi_rebuild_locked(vsi, ICE_VSI_FLAG_NO_INIT)) {
 			dev_err(dev, "Rebuild of main VSI failed again\n");
 			return ret;
 		}
@@ -9292,7 +9300,7 @@ static int ice_setup_tc_mqprio_qdisc(struct net_device *netdev, void *type_data)
 		vsi->all_enatc = 0;
 	}
 	/* resume VSI */
-	ice_ena_vsi(vsi, true);
+	ice_ena_vsi_locked(vsi);
 
 	return ret;
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_sf_eth.c b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
index a730aa368c92..cd6ba53a873b 100644
--- a/drivers/net/ethernet/intel/ice/ice_sf_eth.c
+++ b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
@@ -58,6 +58,7 @@ static int ice_sf_cfg_netdev(struct ice_dynamic_port *dyn_port,
 	eth_hw_addr_set(netdev, dyn_port->hw_addr);
 	ether_addr_copy(netdev->perm_addr, dyn_port->hw_addr);
 	netdev->netdev_ops = &ice_sf_netdev_ops;
+	netdev->request_ops_lock = true;
 	SET_NETDEV_DEVLINK_PORT(netdev, devlink_port);
 
 	err = register_netdev(netdev);
@@ -183,7 +184,9 @@ static void ice_sf_dev_remove(struct auxiliary_device *adev)
 	devlink = priv_to_devlink(sf_dev->priv);
 	devl_lock(devlink);
 
+	netdev_lock(vsi->netdev);
 	ice_vsi_close(vsi);
+	netdev_unlock(vsi->netdev);
 
 	ice_sf_decfg_netdev(vsi);
 	ice_devlink_destroy_sf_dev_port(sf_dev);
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 0643017541c3..be0cb548487a 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -33,9 +33,9 @@ ice_qvec_toggle_napi(struct ice_vsi *vsi, struct ice_q_vector *q_vector,
 		return;
 
 	if (enable)
-		napi_enable(&q_vector->napi);
+		napi_enable_locked(&q_vector->napi);
 	else
-		napi_disable(&q_vector->napi);
+		napi_disable_locked(&q_vector->napi);
 }
 
 /**
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next v5 4/5] ice: implement Rx queue management ops
  2026-05-05 15:29 [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx Alexander Lobakin
                   ` (2 preceding siblings ...)
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 3/5] ice: migrate to netdev ops lock Alexander Lobakin
@ 2026-05-05 15:29 ` Alexander Lobakin
  2026-05-08 14:08   ` Simon Horman
  2026-05-08 14:25   ` Simon Horman
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 5/5] ice: add support for transmitting unreadable frags Alexander Lobakin
  2026-05-08 12:06 ` [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx Alexander Lobakin
  5 siblings, 2 replies; 15+ messages in thread
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

Now ice is ready to get queue_mgmt_ops support. It already has API
to disable/reconfig/enable one particular queue (for XSk). Reuse as
much of its code as possible to implement Rx queue management
callbacks and vice versa -- ice_queue_mem_{alloc,free}() can be
reused during ifup/ifdown to elide code duplication.
With this, ice passes the io_uring zcrx selftests, meaning the Rx
part of netmem/MP support is done.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_lib.h    |   5 +
 drivers/net/ethernet/intel/ice/ice_txrx.h   |   2 +
 drivers/net/ethernet/intel/ice/ice_base.c   | 182 +++++++++++++++-----
 drivers/net/ethernet/intel/ice/ice_main.c   |   2 +-
 drivers/net/ethernet/intel/ice/ice_sf_eth.c |   2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c   |  26 ++-
 6 files changed, 165 insertions(+), 54 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.h b/drivers/net/ethernet/intel/ice/ice_lib.h
index 476fa54ec4e8..e1e85976e523 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_lib.h
@@ -4,6 +4,8 @@
 #ifndef _ICE_LIB_H_
 #define _ICE_LIB_H_
 
+#include <net/netdev_queues.h>
+
 #include "ice.h"
 #include "ice_vlan.h"
 
@@ -135,4 +137,7 @@ void ice_clear_feature_support(struct ice_pf *pf, enum ice_feature f);
 void ice_init_feature_support(struct ice_pf *pf);
 bool ice_vsi_is_rx_queue_active(struct ice_vsi *vsi);
 void ice_vsi_update_l2tsel(struct ice_vsi *vsi, enum ice_l2tsel l2tsel);
+
+extern const struct netdev_queue_mgmt_ops ice_queue_mgmt_ops;
+
 #endif /* !_ICE_LIB_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 5e517f219379..7c22b2015067 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -462,6 +462,8 @@ u16
 ice_select_queue(struct net_device *dev, struct sk_buff *skb,
 		 struct net_device *sb_dev);
 void ice_clean_tx_ring(struct ice_tx_ring *tx_ring);
+void ice_queue_mem_free(struct net_device *dev, void *per_queue_mem);
+void ice_zero_rx_ring(struct ice_rx_ring *rx_ring);
 void ice_clean_rx_ring(struct ice_rx_ring *rx_ring);
 int ice_setup_tx_ring(struct ice_tx_ring *tx_ring);
 int ice_setup_rx_ring(struct ice_rx_ring *rx_ring);
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 1add82d894bb..4e0b8895c303 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -653,6 +653,43 @@ static int ice_rxq_pp_create(struct ice_rx_ring *rq)
 	return err;
 }
 
+static int ice_queue_mem_alloc(struct net_device *dev,
+			       struct netdev_queue_config *qcfg,
+			       void *per_queue_mem, int idx)
+{
+	const struct ice_netdev_priv *priv = netdev_priv(dev);
+	const struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
+	struct ice_rx_ring *new = per_queue_mem;
+	int ret;
+
+	new->count = real->count;
+	new->netdev = real->netdev;
+	new->q_index = real->q_index;
+	new->q_vector = real->q_vector;
+	new->vsi = real->vsi;
+
+	ret = ice_rxq_pp_create(new);
+	if (ret)
+		return ret;
+
+	if (!netif_running(dev))
+		return 0;
+
+	ret = __xdp_rxq_info_reg(&new->xdp_rxq, new->netdev, new->q_index,
+				 new->q_vector->napi.napi_id, new->truesize);
+	if (ret)
+		goto err_destroy_fq;
+
+	xdp_rxq_info_attach_page_pool(&new->xdp_rxq, new->pp);
+
+	return 0;
+
+err_destroy_fq:
+	ice_rxq_pp_destroy(new);
+
+	return ret;
+}
+
 /**
  * ice_vsi_cfg_rxq - Configure an Rx queue
  * @ring: the ring being configured
@@ -691,19 +728,10 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 			dev_info(dev, "Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring %d\n",
 				 ring->q_index);
 		} else {
-			err = ice_rxq_pp_create(ring);
+			err = ice_queue_mem_alloc(ring->netdev, NULL, ring,
+						  ring->q_index);
 			if (err)
 				return err;
-
-			err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
-						 ring->q_index,
-						 ring->q_vector->napi.napi_id,
-						 ring->truesize);
-			if (err)
-				goto err_destroy_fq;
-
-			xdp_rxq_info_attach_page_pool(&ring->xdp_rxq,
-						      ring->pp);
 		}
 	}
 
@@ -712,7 +740,7 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 	if (err) {
 		dev_err(dev, "ice_setup_rx_ctx failed for RxQ %d, err %d\n",
 			ring->q_index, err);
-		goto err_destroy_fq;
+		goto err_clean_rq;
 	}
 
 	if (ring->xsk_pool) {
@@ -743,12 +771,12 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 		err = ice_alloc_rx_bufs(ring, num_bufs);
 
 	if (err)
-		goto err_destroy_fq;
+		goto err_clean_rq;
 
 	return 0;
 
-err_destroy_fq:
-	ice_rxq_pp_destroy(ring);
+err_clean_rq:
+	ice_clean_rx_ring(ring);
 
 	return err;
 }
@@ -1460,27 +1488,7 @@ static void ice_qp_reset_stats(struct ice_vsi *vsi, u16 q_idx)
 		       sizeof(vsi->xdp_rings[q_idx]->ring_stats->stats));
 }
 
-/**
- * ice_qp_clean_rings - Cleans all the rings of a given index
- * @vsi: VSI that contains rings of interest
- * @q_idx: ring index in array
- */
-static void ice_qp_clean_rings(struct ice_vsi *vsi, u16 q_idx)
-{
-	ice_clean_tx_ring(vsi->tx_rings[q_idx]);
-	if (vsi->xdp_rings)
-		ice_clean_tx_ring(vsi->xdp_rings[q_idx]);
-	ice_clean_rx_ring(vsi->rx_rings[q_idx]);
-}
-
-/**
- * ice_qp_dis - Disables a queue pair
- * @vsi: VSI of interest
- * @q_idx: ring index in array
- *
- * Returns 0 on success, negative on failure.
- */
-int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
+static int __ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
 {
 	struct ice_txq_meta txq_meta = { };
 	struct ice_q_vector *q_vector;
@@ -1519,23 +1527,35 @@ int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
 	}
 
 	ice_vsi_ctrl_one_rx_ring(vsi, false, q_idx, false);
-	ice_qp_clean_rings(vsi, q_idx);
 	ice_qp_reset_stats(vsi, q_idx);
 
+	ice_clean_tx_ring(vsi->tx_rings[q_idx]);
+	if (vsi->xdp_rings)
+		ice_clean_tx_ring(vsi->xdp_rings[q_idx]);
+
 	return fail;
 }
 
 /**
- * ice_qp_ena - Enables a queue pair
+ * ice_qp_dis - Disables a queue pair
  * @vsi: VSI of interest
  * @q_idx: ring index in array
  *
  * Returns 0 on success, negative on failure.
  */
-int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
+int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
+{
+	int ret;
+
+	ret = __ice_qp_dis(vsi, q_idx);
+	ice_clean_rx_ring(vsi->rx_rings[q_idx]);
+
+	return ret;
+}
+
+static int __ice_qp_ena(struct ice_vsi *vsi, u16 q_idx, int fail)
 {
 	struct ice_q_vector *q_vector;
-	int fail = 0;
 	bool link_up;
 	int err;
 
@@ -1553,10 +1573,6 @@ int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
 		ice_tx_xsk_pool(vsi, q_idx);
 	}
 
-	err = ice_vsi_cfg_single_rxq(vsi, q_idx);
-	if (!fail)
-		fail = err;
-
 	q_vector = vsi->rx_rings[q_idx]->q_vector;
 	ice_qvec_cfg_msix(vsi, q_vector, q_idx);
 
@@ -1577,3 +1593,81 @@ int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
 
 	return fail;
 }
+
+/**
+ * ice_qp_ena - Enables a queue pair
+ * @vsi: VSI of interest
+ * @q_idx: ring index in array
+ *
+ * Returns 0 on success, negative on failure.
+ */
+int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
+{
+	return __ice_qp_ena(vsi, q_idx, ice_vsi_cfg_single_rxq(vsi, q_idx));
+}
+
+static int ice_queue_start(struct net_device *dev,
+			   struct netdev_queue_config *qcfg,
+			   void *per_queue_mem, int idx)
+{
+	const struct ice_netdev_priv *priv = netdev_priv(dev);
+	struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
+	struct ice_rx_ring *new = per_queue_mem;
+	struct napi_struct *napi;
+	int ret;
+
+	real->pp = new->pp;
+	real->rx_fqes = new->rx_fqes;
+	real->hdr_fqes = new->hdr_fqes;
+	real->hdr_pp = new->hdr_pp;
+
+	real->hdr_truesize = new->hdr_truesize;
+	real->truesize = new->truesize;
+	real->rx_hdr_len = new->rx_hdr_len;
+	real->rx_buf_len = new->rx_buf_len;
+
+	memcpy(&real->xdp_rxq, &new->xdp_rxq, sizeof(new->xdp_rxq));
+
+	ret = ice_setup_rx_ctx(real);
+	if (ret)
+		return ret;
+
+	napi = &real->q_vector->napi;
+
+	page_pool_enable_direct_recycling(real->pp, napi);
+	if (real->hdr_pp)
+		page_pool_enable_direct_recycling(real->hdr_pp, napi);
+
+	ret = ice_alloc_rx_bufs(real, ICE_DESC_UNUSED(real));
+
+	return __ice_qp_ena(priv->vsi, idx, ret);
+}
+
+static int ice_queue_stop(struct net_device *dev, void *per_queue_mem,
+			  int idx)
+{
+	const struct ice_netdev_priv *priv = netdev_priv(dev);
+	struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
+	int ret;
+
+	ret = __ice_qp_dis(priv->vsi, idx);
+	if (ret)
+		return ret;
+
+	page_pool_disable_direct_recycling(real->pp);
+	if (real->hdr_pp)
+		page_pool_disable_direct_recycling(real->hdr_pp);
+
+	ice_zero_rx_ring(real);
+	memcpy(per_queue_mem, real, sizeof(*real));
+
+	return 0;
+}
+
+const struct netdev_queue_mgmt_ops ice_queue_mgmt_ops = {
+	.ndo_queue_mem_alloc	= ice_queue_mem_alloc,
+	.ndo_queue_mem_free	= ice_queue_mem_free,
+	.ndo_queue_mem_size	= sizeof(struct ice_rx_ring),
+	.ndo_queue_start	= ice_queue_start,
+	.ndo_queue_stop		= ice_queue_stop,
+};
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 50975fe7cab7..a88f9e3c0077 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3452,7 +3452,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
 	}
 
 	netdev->netdev_ops = &ice_netdev_ops;
-	netdev->request_ops_lock = true;
+	netdev->queue_mgmt_ops = &ice_queue_mgmt_ops;
 	netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
 	netdev->xdp_metadata_ops = &ice_xdp_md_ops;
 	ice_set_ethtool_ops(netdev);
diff --git a/drivers/net/ethernet/intel/ice/ice_sf_eth.c b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
index cd6ba53a873b..8d1541712abc 100644
--- a/drivers/net/ethernet/intel/ice/ice_sf_eth.c
+++ b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
@@ -58,7 +58,7 @@ static int ice_sf_cfg_netdev(struct ice_dynamic_port *dyn_port,
 	eth_hw_addr_set(netdev, dyn_port->hw_addr);
 	ether_addr_copy(netdev->perm_addr, dyn_port->hw_addr);
 	netdev->netdev_ops = &ice_sf_netdev_ops;
-	netdev->request_ops_lock = true;
+	netdev->queue_mgmt_ops = &ice_queue_mgmt_ops;
 	SET_NETDEV_DEVLINK_PORT(netdev, devlink_port);
 
 	err = register_netdev(netdev);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 4ca1a0602307..43b467076027 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -531,17 +531,13 @@ void ice_rxq_pp_destroy(struct ice_rx_ring *rq)
 	rq->hdr_pp = NULL;
 }
 
-/**
- * ice_clean_rx_ring - Free Rx buffers
- * @rx_ring: ring to be cleaned
- */
-void ice_clean_rx_ring(struct ice_rx_ring *rx_ring)
+void ice_queue_mem_free(struct net_device *dev, void *per_queue_mem)
 {
-	u32 size;
+	struct ice_rx_ring *rx_ring = per_queue_mem;
 
 	if (rx_ring->xsk_pool) {
 		ice_xsk_clean_rx_ring(rx_ring);
-		goto rx_skip_free;
+		return;
 	}
 
 	/* ring already cleared, nothing to do */
@@ -570,8 +566,12 @@ void ice_clean_rx_ring(struct ice_rx_ring *rx_ring)
 	}
 
 	ice_rxq_pp_destroy(rx_ring);
+}
+
+void ice_zero_rx_ring(struct ice_rx_ring *rx_ring)
+{
+	size_t size;
 
-rx_skip_free:
 	/* Zero out the descriptor ring */
 	size = ALIGN(rx_ring->count * sizeof(union ice_32byte_rx_desc),
 		     PAGE_SIZE);
@@ -581,6 +581,16 @@ void ice_clean_rx_ring(struct ice_rx_ring *rx_ring)
 	rx_ring->next_to_use = 0;
 }
 
+/**
+ * ice_clean_rx_ring - Free Rx buffers
+ * @rx_ring: ring to be cleaned
+ */
+void ice_clean_rx_ring(struct ice_rx_ring *rx_ring)
+{
+	ice_queue_mem_free(rx_ring->netdev, rx_ring);
+	ice_zero_rx_ring(rx_ring);
+}
+
 /**
  * ice_free_rx_ring - Free Rx resources
  * @rx_ring: ring to clean the resources from
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Intel-wired-lan] [PATCH iwl-next v5 5/5] ice: add support for transmitting unreadable frags
  2026-05-05 15:29 [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx Alexander Lobakin
                   ` (3 preceding siblings ...)
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 4/5] ice: implement Rx queue management ops Alexander Lobakin
@ 2026-05-05 15:29 ` Alexander Lobakin
  2026-05-08 15:42   ` Simon Horman
  2026-05-08 12:06 ` [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx Alexander Lobakin
  5 siblings, 1 reply; 15+ messages in thread
From: Alexander Lobakin @ 2026-05-05 15:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Alexander Lobakin, Tony Nguyen, Przemek Kitszel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

Advertise netmem Tx support in ice. The only change needed is to set
ICE_TX_BUF_FRAG conditionally, only when skb_frag_is_net_iov() is
false. Otherwise, the Tx buffer type will be ICE_TX_BUF_EMPTY and
the driver will skip the DMA unmapping operation.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_main.c   |  1 +
 drivers/net/ethernet/intel/ice/ice_sf_eth.c |  1 +
 drivers/net/ethernet/intel/ice/ice_txrx.c   | 17 +++++++++++++----
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index a88f9e3c0077..0e61b38e53a5 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3453,6 +3453,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
 
 	netdev->netdev_ops = &ice_netdev_ops;
 	netdev->queue_mgmt_ops = &ice_queue_mgmt_ops;
+	netdev->netmem_tx = true;
 	netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;
 	netdev->xdp_metadata_ops = &ice_xdp_md_ops;
 	ice_set_ethtool_ops(netdev);
diff --git a/drivers/net/ethernet/intel/ice/ice_sf_eth.c b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
index 8d1541712abc..fdf98aa431ba 100644
--- a/drivers/net/ethernet/intel/ice/ice_sf_eth.c
+++ b/drivers/net/ethernet/intel/ice/ice_sf_eth.c
@@ -59,6 +59,7 @@ static int ice_sf_cfg_netdev(struct ice_dynamic_port *dyn_port,
 	ether_addr_copy(netdev->perm_addr, dyn_port->hw_addr);
 	netdev->netdev_ops = &ice_sf_netdev_ops;
 	netdev->queue_mgmt_ops = &ice_queue_mgmt_ops;
+	netdev->netmem_tx = true;
 	SET_NETDEV_DEVLINK_PORT(netdev, devlink_port);
 
 	err = register_netdev(netdev);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 43b467076027..1d97e2cc2ade 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -113,11 +113,17 @@ ice_prgm_fdir_fltr(struct ice_vsi *vsi, struct ice_fltr_desc *fdir_desc,
 static void
 ice_unmap_and_free_tx_buf(struct ice_tx_ring *ring, struct ice_tx_buf *tx_buf)
 {
-	if (tx_buf->type != ICE_TX_BUF_XDP_TX && dma_unmap_len(tx_buf, len))
+	switch (tx_buf->type) {
+	case ICE_TX_BUF_DUMMY:
+	case ICE_TX_BUF_FRAG:
+	case ICE_TX_BUF_SKB:
+	case ICE_TX_BUF_XDP_XMIT:
 		dma_unmap_page(ring->dev,
 			       dma_unmap_addr(tx_buf, dma),
 			       dma_unmap_len(tx_buf, len),
 			       DMA_TO_DEVICE);
+		break;
+	}
 
 	switch (tx_buf->type) {
 	case ICE_TX_BUF_DUMMY:
@@ -338,12 +344,14 @@ static bool ice_clean_tx_irq(struct ice_tx_ring *tx_ring, int napi_budget)
 			}
 
 			/* unmap any remaining paged data */
-			if (dma_unmap_len(tx_buf, len)) {
+			if (tx_buf->type != ICE_TX_BUF_EMPTY) {
 				dma_unmap_page(tx_ring->dev,
 					       dma_unmap_addr(tx_buf, dma),
 					       dma_unmap_len(tx_buf, len),
 					       DMA_TO_DEVICE);
+
 				dma_unmap_len_set(tx_buf, len, 0);
+				tx_buf->type = ICE_TX_BUF_EMPTY;
 			}
 		}
 		ice_trace(clean_tx_irq_unmap_eop, tx_ring, tx_desc, tx_buf);
@@ -1495,7 +1503,8 @@ ice_tx_map(struct ice_tx_ring *tx_ring, struct ice_tx_buf *first,
 				       DMA_TO_DEVICE);
 
 		tx_buf = &tx_ring->tx_buf[i];
-		tx_buf->type = ICE_TX_BUF_FRAG;
+		if (!skb_frag_is_net_iov(frag))
+			tx_buf->type = ICE_TX_BUF_FRAG;
 	}
 
 	/* record SW timestamp if HW timestamp is not available */
@@ -2377,7 +2386,7 @@ void ice_clean_ctrl_tx_irq(struct ice_tx_ring *tx_ring)
 		}
 
 		/* unmap the data header */
-		if (dma_unmap_len(tx_buf, len))
+		if (tx_buf->type != ICE_TX_BUF_EMPTY)
 			dma_unmap_single(tx_ring->dev,
 					 dma_unmap_addr(tx_buf, dma),
 					 dma_unmap_len(tx_buf, len),
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx
  2026-05-05 15:29 [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx Alexander Lobakin
                   ` (4 preceding siblings ...)
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 5/5] ice: add support for transmitting unreadable frags Alexander Lobakin
@ 2026-05-08 12:06 ` Alexander Lobakin
  5 siblings, 0 replies; 15+ messages in thread
From: Alexander Lobakin @ 2026-05-08 12:06 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Tony Nguyen, Przemek Kitszel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Kohei Enju, Jacob Keller, Aleksandr Loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Tue, 5 May 2026 17:29:18 +0200

> Now that ice uses libeth for managing Rx buffers and supports
> configurable header split, it's ready to get support for sending
> and receiving packets with unreadable (to the kernel) frags.
> 
> Extend libeth just a little bit to allow creating PPs with custom
> memory providers and make sure ice works correctly with the netdev
> ops locking. Then add the full set of queue_mgmt_ops and don't
> unmap unreadable frags on Tx completion.
> No perf regressions for the regular flows and no code duplication
> implied.
> 
> Credits to the fbnic developers, whose code helped me understand
> the memory providers and queue_mgmt_ops logics and served as
> a reference.

Hi,

A week ago .ndo_set_rx_mode_async() was introduced.
I didn't know that it's mandatory to switch to it when converting to
netdev_ops lock.
So I guess I'll need a new respin :c

Thanks,
Olek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next v5 3/5] ice: migrate to netdev ops lock
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 3/5] ice: migrate to netdev ops lock Alexander Lobakin
@ 2026-05-08 13:28   ` Simon Horman
  2026-05-08 13:57     ` Alexander Lobakin
  0 siblings, 1 reply; 15+ messages in thread
From: Simon Horman @ 2026-05-08 13:28 UTC (permalink / raw)
  To: aleksander.lobakin
  Cc: 'Simon Horman', intel-wired-lan, anthony.l.nguyen,
	przemyslaw.kitszel, andrew+netdev, davem, edumazet, kuba, pabeni,
	kohei, jacob.e.keller, aleksandr.loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3128 bytes --]

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
    ice: migrate to netdev ops lock

This commit migrates queue management operations to use netdev locking.
It updates several NAPI and VSI configuration functions to use their
_locked variants to avoid deadlocks, taking the lock manually where needed.

> diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
> index 16aa255351523..7d89c0acc5d8d 100644
> --- a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
[ ... ]
> @@ -1107,7 +1106,7 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
>  
>  	rtnl_lock();
>  	/* disable VSIs affected by DCB changes */
> -	ice_dcb_ena_dis_vsi(pf, false, true);
> +	ice_dcb_ena_dis_vsi(pf, false);
>  
>  	ret = ice_query_port_ets(pi, &buf, sizeof(buf), NULL);
>  	if (ret) {
> @@ -1119,7 +1118,7 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
>  	ice_pf_dcb_recfg(pf, false);
>  
>  	/* enable previously downed VSIs */
> -	ice_dcb_ena_dis_vsi(pf, true, true);
> +	ice_dcb_ena_dis_vsi(pf, true);
>  unlock_rtnl:
>  	rtnl_unlock();
>  out:

This wasn't introduced by this patch, but does this error path leave VSIs
permanently disabled?

If ice_query_port_ets() fails, the code jumps to unlock_rtnl, bypassing the
call to ice_dcb_ena_dis_vsi(pf, true). This seems to leave the affected VSIs
(including the main PF interface) in a disabled state (ICE_VSI_DOWN) with
ICE_VSI_NEEDS_RESTART set, which could result in a denial of service.

> diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> index 2e4f0969035f7..af0cc77fbf718 100644
> --- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> +++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> @@ -23,10 +23,16 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
>  	struct net_device *netdev = uplink_vsi->netdev;
>  	bool if_running = netif_running(netdev);
>  	struct ice_vsi_vlan_ops *vlan_ops;
> +	int ret;
> +
> +	if (if_running && !test_and_set_bit(ICE_VSI_DOWN, uplink_vsi->state)) {
> +		netdev_lock(netdev);
> +		ret = ice_down(uplink_vsi);
> +		netdev_unlock(netdev);
>  
> -	if (if_running && !test_and_set_bit(ICE_VSI_DOWN, uplink_vsi->state))
> -		if (ice_down(uplink_vsi))
> +		if (ret)
>  			return -ENODEV;
> +	}

Simon says: It seems that the topic of locks in this patch has come up
            again ¯\_(ツ)_/¯

Does manipulating the state bit outside of netdev_lock create a race
condition with interface teardown?

If the interface is concurrently brought down, ndo_stop (which holds
netdev_lock) would call ice_vsi_close(). Because ICE_VSI_DOWN is already
set by the test_and_set_bit() call above, ice_vsi_close() would skip calling
ice_down() and immediately proceed to free the Tx/Rx rings.

This would leave the hardware queues active and performing DMA operations into
freed memory, potentially causing a use-after-free.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next v5 3/5] ice: migrate to netdev ops lock
  2026-05-08 13:28   ` Simon Horman
@ 2026-05-08 13:57     ` Alexander Lobakin
  2026-05-10  8:57       ` Simon Horman
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Lobakin @ 2026-05-08 13:57 UTC (permalink / raw)
  To: Simon Horman
  Cc: intel-wired-lan, anthony.l.nguyen, przemyslaw.kitszel,
	andrew+netdev, davem, edumazet, kuba, pabeni, kohei,
	jacob.e.keller, aleksandr.loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

From: Simon Horman <horms@kernel.org>
Date: Fri,  8 May 2026 14:28:29 +0100

> From: 'Simon Horman' <horms@kernel.org>
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
> ---
>     ice: migrate to netdev ops lock
> 
> This commit migrates queue management operations to use netdev locking.
> It updates several NAPI and VSI configuration functions to use their
> _locked variants to avoid deadlocks, taking the lock manually where needed.
> 
>> diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
>> index 16aa255351523..7d89c0acc5d8d 100644
>> --- a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
>> +++ b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
> [ ... ]
>> @@ -1107,7 +1106,7 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
>>  
>>  	rtnl_lock();
>>  	/* disable VSIs affected by DCB changes */
>> -	ice_dcb_ena_dis_vsi(pf, false, true);
>> +	ice_dcb_ena_dis_vsi(pf, false);
>>  
>>  	ret = ice_query_port_ets(pi, &buf, sizeof(buf), NULL);
>>  	if (ret) {
>> @@ -1119,7 +1118,7 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
>>  	ice_pf_dcb_recfg(pf, false);
>>  
>>  	/* enable previously downed VSIs */
>> -	ice_dcb_ena_dis_vsi(pf, true, true);
>> +	ice_dcb_ena_dis_vsi(pf, true);
>>  unlock_rtnl:
>>  	rtnl_unlock();
>>  out:
> 
> This wasn't introduced by this patch, but does this error path leave VSIs
> permanently disabled?
> 
> If ice_query_port_ets() fails, the code jumps to unlock_rtnl, bypassing the
> call to ice_dcb_ena_dis_vsi(pf, true). This seems to leave the affected VSIs
> (including the main PF interface) in a disabled state (ICE_VSI_DOWN) with
> ICE_VSI_NEEDS_RESTART set, which could result in a denial of service.
> 
>> diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
>> index 2e4f0969035f7..af0cc77fbf718 100644
>> --- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
>> +++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
>> @@ -23,10 +23,16 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
>>  	struct net_device *netdev = uplink_vsi->netdev;
>>  	bool if_running = netif_running(netdev);
>>  	struct ice_vsi_vlan_ops *vlan_ops;
>> +	int ret;
>> +
>> +	if (if_running && !test_and_set_bit(ICE_VSI_DOWN, uplink_vsi->state)) {
>> +		netdev_lock(netdev);
>> +		ret = ice_down(uplink_vsi);
>> +		netdev_unlock(netdev);
>>  
>> -	if (if_running && !test_and_set_bit(ICE_VSI_DOWN, uplink_vsi->state))
>> -		if (ice_down(uplink_vsi))
>> +		if (ret)
>>  			return -ENODEV;
>> +	}
> 
> Simon says: It seems that the topic of locks in this patch has come up
>             again ¯\_(ツ)_/¯
> 
> Does manipulating the state bit outside of netdev_lock create a race
> condition with interface teardown?
> 
> If the interface is concurrently brought down, ndo_stop (which holds
> netdev_lock) would call ice_vsi_close(). Because ICE_VSI_DOWN is already
> set by the test_and_set_bit() call above, ice_vsi_close() would skip calling
> ice_down() and immediately proceed to free the Tx/Rx rings.
> 
> This would leave the hardware queues active and performing DMA operations into
> freed memory, potentially causing a use-after-free.

I ran Sashiko internally a couple times and it hasn't found these =\

Anyway, now that I anyway need to adjust the series after
.ndo_set_rx_mode_asyn() landed, I'll take a look at this, too.

(I hate this series already. I feel like if Sashiko was alive when other
 vendors were switching to the netdev lock, they'd have the same drama)

(I was planning to add devmem/io_uring support to idpf after this lands,
 but given that idpf is way more confusing and inconsistent in locking,
 I'm not sure I'll have enough patience :D)

Thanks,
Olek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next v5 4/5] ice: implement Rx queue management ops
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 4/5] ice: implement Rx queue management ops Alexander Lobakin
@ 2026-05-08 14:08   ` Simon Horman
  2026-05-08 14:26     ` Simon Horman
  2026-05-08 15:07     ` Alexander Lobakin
  2026-05-08 14:25   ` Simon Horman
  1 sibling, 2 replies; 15+ messages in thread
From: Simon Horman @ 2026-05-08 14:08 UTC (permalink / raw)
  To: aleksander.lobakin
  Cc: 'Simon Horman', intel-wired-lan, anthony.l.nguyen,
	przemyslaw.kitszel, andrew+netdev, davem, edumazet, kuba, pabeni,
	kohei, jacob.e.keller, aleksandr.loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ice: implement Rx queue management ops

This commit wires the ice driver into the netdev queue management ops
framework, reusing the existing per-queue disable/reconfig/enable paths
originally added for XSk. With it, ice reportedly passes the io_uring
zcrx selftests.

> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index 1add82d894bb..4e0b8895c303 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c

[ ... ]

> @@ -1577,3 +1593,81 @@ int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
>
>  	return fail;
>  }
> +
> +/**
> + * ice_qp_ena - Enables a queue pair
> + * @vsi: VSI of interest
> + * @q_idx: ring index in array
> + *
> + * Returns 0 on success, negative on failure.
> + */
> +int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
> +{
> +	return __ice_qp_ena(vsi, q_idx, ice_vsi_cfg_single_rxq(vsi, q_idx));
> +}
> +
> +static int ice_queue_start(struct net_device *dev,
> +			   struct netdev_queue_config *qcfg,
> +			   void *per_queue_mem, int idx)
> +{
> +	const struct ice_netdev_priv *priv = netdev_priv(dev);
> +	struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
> +	struct ice_rx_ring *new = per_queue_mem;
> +	struct napi_struct *napi;
> +	int ret;
> +
> +	real->pp = new->pp;
> +	real->rx_fqes = new->rx_fqes;
> +	real->hdr_fqes = new->hdr_fqes;
> +	real->hdr_pp = new->hdr_pp;
> +
> +	real->hdr_truesize = new->hdr_truesize;
> +	real->truesize = new->truesize;
> +	real->rx_hdr_len = new->rx_hdr_len;
> +	real->rx_buf_len = new->rx_buf_len;
> +
> +	memcpy(&real->xdp_rxq, &new->xdp_rxq, sizeof(new->xdp_rxq));
> +
> +	ret = ice_setup_rx_ctx(real);
> +	if (ret)
> +		return ret;

Simon says: I believe that the issue flagged immediately below is
            a false positive; the cleanup logic unwinds things correctly.

	    However, the two other issues flagged further down do seem to be
	    of concern.

ice_queue_start() eagerly copies new->pp, new->rx_fqes, new->hdr_pp,
new->hdr_fqes and the entire new->xdp_rxq into real before ice_setup_rx_ctx()
has succeeded.

If ice_setup_rx_ctx() returns an error here, the function bails out with
real already pointing at new_mem's page pools and fqes arrays.

The core netdev_rx_queue_reconfig() rollback then invokes
ndo_queue_start(dev, qcfg_old, old_mem, rxq_idx), which repeats the same
eager overwrite using old_mem and calls ice_setup_rx_ctx() again.  If the
second ice_setup_rx_ctx() also fails, the core proceeds to
ndo_queue_mem_free(dev, old_mem), and ice_queue_mem_free() will tear down
the page pools whose pointers real->pp / real->hdr_pp are still aliasing.

Would it be safer to stash the existing real->pp / real->rx_fqes /
real->hdr_pp / real->hdr_fqes / real->xdp_rxq on entry and restore them if
ice_setup_rx_ctx() fails, so that a failed rollback cannot leave real with
dangling pointers to a pool that ice_queue_mem_free() is about to destroy?

> +
> +	napi = &real->q_vector->napi;
> +
> +	page_pool_enable_direct_recycling(real->pp, napi);
> +	if (real->hdr_pp)
> +		page_pool_enable_direct_recycling(real->hdr_pp, napi);
> +
> +	ret = ice_alloc_rx_bufs(real, ICE_DESC_UNUSED(real));
> +
> +	return __ice_qp_ena(priv->vsi, idx, ret);

ice_alloc_rx_bufs() is declared bool and returns !!cleaned_count, so it
yields 0 on full success and 1 on partial/failed allocation:

bool ice_alloc_rx_bufs(struct ice_rx_ring *rx_ring, unsigned int cleaned_count)
{
    ...
    return !!cleaned_count;
}

That bool is then passed straight into __ice_qp_ena()'s fail parameter,
which only overwrites fail via 'if (!fail) fail = err;', so a value of 1
propagates unchanged to ice_queue_start()'s return value.

netdev_rx_queue_reconfig() treats any non-zero return as failure and
propagates it back to userspace (netlink, ethtool, io_uring zcrx install),
where the callers expect a negative errno.

Before this patch, ice_qp_ena() obtained its fail value from
ice_vsi_cfg_single_rxq(), which returns a signed errno.

Should this be converted to a proper negative errno before being handed
to __ice_qp_ena(), for example:

	ret = ice_alloc_rx_bufs(real, ICE_DESC_UNUSED(real)) ? -ENOMEM : 0;

> +}
> +
> +static int ice_queue_stop(struct net_device *dev, void *per_queue_mem,
> +			  int idx)
> +{
> +	const struct ice_netdev_priv *priv = netdev_priv(dev);
> +	struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
> +	int ret;
> +
> +	ret = __ice_qp_dis(priv->vsi, idx);
> +	if (ret)
> +		return ret;
> +
> +	page_pool_disable_direct_recycling(real->pp);
> +	if (real->hdr_pp)
> +		page_pool_disable_direct_recycling(real->hdr_pp);
> +
> +	ice_zero_rx_ring(real);
> +	memcpy(per_queue_mem, real, sizeof(*real));
> +
> +	return 0;
> +}

Can this leak inflight page pool buffers?

ice_zero_rx_ring(real) is called before the memcpy to per_queue_mem, and
ice_zero_rx_ring() resets both indices:

void ice_zero_rx_ring(struct ice_rx_ring *rx_ring)
{
    ...
    rx_ring->next_to_clean = 0;
    rx_ring->next_to_use = 0;
}

So per_queue_mem captures a ring where next_to_clean == next_to_use == 0.

The core then invokes ndo_queue_mem_free(dev, old_mem), and the recycle
loop in ice_queue_mem_free() is guarded by exactly those two indices:

void ice_queue_mem_free(struct net_device *dev, void *per_queue_mem)
{
    ...
    for (u32 i = rx_ring->next_to_clean; i != rx_ring->next_to_use; ) {
        libeth_rx_recycle_slow(rx_ring->rx_fqes[i].netmem);

        if (rx_ring->hdr_pp)
            libeth_rx_recycle_slow(rx_ring->hdr_fqes[i].netmem);

        if (unlikely(++i == rx_ring->count))
            i = 0;
    }
    ...
    ice_rxq_pp_destroy(rx_ring);
}

With 0 != 0 false on entry, the loop never runs, so any buffers that were
in rx_fqes[old_ntc..old_ntu) (and hdr_fqes[] when header split is on) are
never returned via libeth_rx_recycle_slow().  ice_rxq_pp_destroy() then
kvfree()s the fqes arrays and calls page_pool_destroy() with outstanding
inflight references that are no longer tracked anywhere.

The new ice_clean_rx_ring() wrapper keeps the original ordering:

void ice_clean_rx_ring(struct ice_rx_ring *rx_ring)
{
    ice_queue_mem_free(rx_ring->netdev, rx_ring);
    ice_zero_rx_ring(rx_ring);
}

Should ice_queue_stop() follow the same ordering and perform the memcpy
before ice_zero_rx_ring(real), so the recycle loop in ice_queue_mem_free()
sees the real ntc/ntu values?

> +
> +const struct netdev_queue_mgmt_ops ice_queue_mgmt_ops = {
> +	.ndo_queue_mem_alloc	= ice_queue_mem_alloc,
> +	.ndo_queue_mem_free	= ice_queue_mem_free,
> +	.ndo_queue_mem_size	= sizeof(struct ice_rx_ring),
> +	.ndo_queue_start	= ice_queue_start,
> +	.ndo_queue_stop		= ice_queue_stop,
> +};

[ ... ]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next v5 4/5] ice: implement Rx queue management ops
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 4/5] ice: implement Rx queue management ops Alexander Lobakin
  2026-05-08 14:08   ` Simon Horman
@ 2026-05-08 14:25   ` Simon Horman
  1 sibling, 0 replies; 15+ messages in thread
From: Simon Horman @ 2026-05-08 14:25 UTC (permalink / raw)
  To: aleksander.lobakin
  Cc: 'Simon Horman', intel-wired-lan, anthony.l.nguyen,
	przemyslaw.kitszel, andrew+netdev, davem, edumazet, kuba, pabeni,
	kohei, jacob.e.keller, aleksandr.loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
    ice: implement Rx queue management ops

    Now ice is ready to get queue_mgmt_ops support. It already has API
    to disable/reconfig/enable one particular queue (for XSk). Reuse as
    much of its code as possible to implement Rx queue management
    callbacks and vice versa -- ice_queue_mem_{alloc,free}() can be
    reused during ifup/ifdown to elide code duplication.
    With this, ice passes the io_uring zcrx selftests, meaning the Rx
    part of netmem/MP support is done.

> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index 1add82d894bbe..4e0b8895c303a 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> @@ -653,6 +653,43 @@ static int ice_rxq_pp_create(struct ice_rx_ring *rq)
>  	return err;
>  }
>  
> +static int ice_queue_mem_alloc(struct net_device *dev,
> +			       struct netdev_queue_config *qcfg,
> +			       void *per_queue_mem, int idx)
> +{
> +	const struct ice_netdev_priv *priv = netdev_priv(dev);
> +	const struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
> +	struct ice_rx_ring *new = per_queue_mem;
> +	int ret;
> +
> +	new->count = real->count;
> +	new->netdev = real->netdev;
> +	new->q_index = real->q_index;
> +	new->q_vector = real->q_vector;
> +	new->vsi = real->vsi;

Is it safe to assume netdev_priv(dev) returns the correct VSI here? For a
loopback VSI (ICE_VSI_LB), it shares the PF's netdev, so priv->vsi resolves
to the PF VSI.

Could this copy the PF VSI pointers to the loopback ring and cause hardware
context corruption later?

[ ... ]
> -/**
> - * ice_qp_dis - Disables a queue pair
> - * @vsi: VSI of interest
> - * @q_idx: ring index in array
> - *
> - * Returns 0 on success, negative on failure.
> - */
> -int ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
> +static int __ice_qp_dis(struct ice_vsi *vsi, u16 q_idx)
>  {

Could this cause TX data corruption? This function appears to call
synchronize_net() before netif_tx_stop_queue(), meaning the RCU barrier
completes before the queue actually stops.

Concurrent transmissions might then race with ring modifications.

[ ... ]
> +static int ice_queue_start(struct net_device *dev,
> +			   struct netdev_queue_config *qcfg,
> +			   void *per_queue_mem, int idx)
> +{
> +	const struct ice_netdev_priv *priv = netdev_priv(dev);
> +	struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
> +	struct ice_rx_ring *new = per_queue_mem;
> +	struct napi_struct *napi;
> +	int ret;
[ ... ]
> +	ret = ice_alloc_rx_bufs(real, ICE_DESC_UNUSED(real));
> +
> +	return __ice_qp_ena(priv->vsi, idx, ret);
> +}

Does this cause a use-after-free via hardware DMA?

ice_alloc_rx_bufs() returns a boolean true (1) on success. This is passed to
__ice_qp_ena() as the fail argument, which returns it back. Since
ice_queue_start() returns a non-zero value, the core networking stack assumes
failure and frees the queue memory, while the hardware queue remains active and
DMAs into freed memory.

> +
> +static int ice_queue_stop(struct net_device *dev, void *per_queue_mem,
> +			  int idx)
> +{
> +	const struct ice_netdev_priv *priv = netdev_priv(dev);
> +	struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
> +	int ret;
> +
> +	ret = __ice_qp_dis(priv->vsi, idx);

Does calling __ice_qp_dis() from ice_queue_stop() disrupt the global interface
carrier state and TX queues?

__ice_qp_dis() calls netif_carrier_off() and cleans TX rings, but
ice_queue_stop() is intended to be a granular, per-RX-queue operation.

> +	if (ret)
> +		return ret;
> +
> +	page_pool_disable_direct_recycling(real->pp);

Can this dereference a NULL pointer if the queue is configured with an AF_XDP
zero-copy socket?

In that case, a standard page pool is not created and real->pp might be NULL,
causing a panic in page_pool_disable_direct_recycling().

> +	if (real->hdr_pp)
> +		page_pool_disable_direct_recycling(real->hdr_pp);
> +
> +	ice_zero_rx_ring(real);
> +	memcpy(per_queue_mem, real, sizeof(*real));

Could this lead to a memory leak of RX ring buffers?

ice_zero_rx_ring() clears the next_to_clean and next_to_use indices to 0
before the memcpy(). When the stack calls ice_queue_mem_free(per_queue_mem),
it skips cleaning because next_to_clean == next_to_use, leaking all active
skbs and page pool buffers.

Also, does this memcpy() leave dangling pointers in real that can lead to a
double-free?

The dynamically allocated pointers like rx_fqes and pp are not set to NULL
in real. When the interface is brought down later, ice_clean_rx_ring(real)
may attempt to destroy the already-freed page pools again.

> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next v5 4/5] ice: implement Rx queue management ops
  2026-05-08 14:08   ` Simon Horman
@ 2026-05-08 14:26     ` Simon Horman
  2026-05-08 15:07     ` Alexander Lobakin
  1 sibling, 0 replies; 15+ messages in thread
From: Simon Horman @ 2026-05-08 14:26 UTC (permalink / raw)
  To: aleksander.lobakin
  Cc: intel-wired-lan, anthony.l.nguyen, przemyslaw.kitszel,
	andrew+netdev, davem, edumazet, kuba, pabeni, kohei,
	jacob.e.keller, aleksandr.loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

On Fri, May 08, 2026 at 03:08:00PM +0100, Simon Horman wrote:
> From: 'Simon Horman' <horms@kernel.org>
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev

Sorry, the line above should have referenced
https://netdev-ai.bots.linux.dev/sashiko/

There is also a review of this patch available on https://sashiko.dev
Which I plan to forward separately.

...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next v5 4/5] ice: implement Rx queue management ops
  2026-05-08 14:08   ` Simon Horman
  2026-05-08 14:26     ` Simon Horman
@ 2026-05-08 15:07     ` Alexander Lobakin
  1 sibling, 0 replies; 15+ messages in thread
From: Alexander Lobakin @ 2026-05-08 15:07 UTC (permalink / raw)
  To: Simon Horman
  Cc: intel-wired-lan, anthony.l.nguyen, przemyslaw.kitszel,
	andrew+netdev, davem, edumazet, kuba, pabeni, kohei,
	jacob.e.keller, aleksandr.loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

From: Simon Horman <horms@kernel.org>
Date: Fri,  8 May 2026 15:08:00 +0100

> From: 'Simon Horman' <horms@kernel.org>
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
> ---
> ice: implement Rx queue management ops
> 
> This commit wires the ice driver into the netdev queue management ops
> framework, reusing the existing per-queue disable/reconfig/enable paths
> originally added for XSk. With it, ice reportedly passes the io_uring
> zcrx selftests.
> 
>> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
>> index 1add82d894bb..4e0b8895c303 100644
>> --- a/drivers/net/ethernet/intel/ice/ice_base.c
>> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> 
> [ ... ]
> 
>> @@ -1577,3 +1593,81 @@ int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
>>
>>  	return fail;
>>  }
>> +
>> +/**
>> + * ice_qp_ena - Enables a queue pair
>> + * @vsi: VSI of interest
>> + * @q_idx: ring index in array
>> + *
>> + * Returns 0 on success, negative on failure.
>> + */
>> +int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx)
>> +{
>> +	return __ice_qp_ena(vsi, q_idx, ice_vsi_cfg_single_rxq(vsi, q_idx));
>> +}
>> +
>> +static int ice_queue_start(struct net_device *dev,
>> +			   struct netdev_queue_config *qcfg,
>> +			   void *per_queue_mem, int idx)
>> +{
>> +	const struct ice_netdev_priv *priv = netdev_priv(dev);
>> +	struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
>> +	struct ice_rx_ring *new = per_queue_mem;
>> +	struct napi_struct *napi;
>> +	int ret;
>> +
>> +	real->pp = new->pp;
>> +	real->rx_fqes = new->rx_fqes;
>> +	real->hdr_fqes = new->hdr_fqes;
>> +	real->hdr_pp = new->hdr_pp;
>> +
>> +	real->hdr_truesize = new->hdr_truesize;
>> +	real->truesize = new->truesize;
>> +	real->rx_hdr_len = new->rx_hdr_len;
>> +	real->rx_buf_len = new->rx_buf_len;
>> +
>> +	memcpy(&real->xdp_rxq, &new->xdp_rxq, sizeof(new->xdp_rxq));
>> +
>> +	ret = ice_setup_rx_ctx(real);
>> +	if (ret)
>> +		return ret;
> 
> Simon says: I believe that the issue flagged immediately below is
>             a false positive; the cleanup logic unwinds things correctly.

Yup, our Sashiko also had concerns about this piece, but I rechecked and
seems like it's a FP.

> 
> 	    However, the two other issues flagged further down do seem to be
> 	    of concern.
> 
> ice_queue_start() eagerly copies new->pp, new->rx_fqes, new->hdr_pp,
> new->hdr_fqes and the entire new->xdp_rxq into real before ice_setup_rx_ctx()
> has succeeded.

[...]

> 
>> +
>> +	napi = &real->q_vector->napi;
>> +
>> +	page_pool_enable_direct_recycling(real->pp, napi);
>> +	if (real->hdr_pp)
>> +		page_pool_enable_direct_recycling(real->hdr_pp, napi);
>> +
>> +	ret = ice_alloc_rx_bufs(real, ICE_DESC_UNUSED(real));
>> +
>> +	return __ice_qp_ena(priv->vsi, idx, ret);
> 
> ice_alloc_rx_bufs() is declared bool and returns !!cleaned_count, so it
> yields 0 on full success and 1 on partial/failed allocation:
> 
> bool ice_alloc_rx_bufs(struct ice_rx_ring *rx_ring, unsigned int cleaned_count)
> {
>     ...
>     return !!cleaned_count;
> }
> 
> That bool is then passed straight into __ice_qp_ena()'s fail parameter,
> which only overwrites fail via 'if (!fail) fail = err;', so a value of 1
> propagates unchanged to ice_queue_start()'s return value.
> 
> netdev_rx_queue_reconfig() treats any non-zero return as failure and
> propagates it back to userspace (netlink, ethtool, io_uring zcrx install),
> where the callers expect a negative errno.
> 
> Before this patch, ice_qp_ena() obtained its fail value from
> ice_vsi_cfg_single_rxq(), which returns a signed errno.
> 
> Should this be converted to a proper negative errno before being handed
> to __ice_qp_ena(), for example:
> 
> 	ret = ice_alloc_rx_bufs(real, ICE_DESC_UNUSED(real)) ? -ENOMEM : 0;
> 
>> +}
>> +
>> +static int ice_queue_stop(struct net_device *dev, void *per_queue_mem,
>> +			  int idx)
>> +{
>> +	const struct ice_netdev_priv *priv = netdev_priv(dev);
>> +	struct ice_rx_ring *real = priv->vsi->rx_rings[idx];
>> +	int ret;
>> +
>> +	ret = __ice_qp_dis(priv->vsi, idx);
>> +	if (ret)
>> +		return ret;
>> +
>> +	page_pool_disable_direct_recycling(real->pp);
>> +	if (real->hdr_pp)
>> +		page_pool_disable_direct_recycling(real->hdr_pp);
>> +
>> +	ice_zero_rx_ring(real);
>> +	memcpy(per_queue_mem, real, sizeof(*real));
>> +
>> +	return 0;
>> +}
> 
> Can this leak inflight page pool buffers?
> 
> ice_zero_rx_ring(real) is called before the memcpy to per_queue_mem, and
> ice_zero_rx_ring() resets both indices:
> 
> void ice_zero_rx_ring(struct ice_rx_ring *rx_ring)
> {
>     ...
>     rx_ring->next_to_clean = 0;
>     rx_ring->next_to_use = 0;
> }
> 
> So per_queue_mem captures a ring where next_to_clean == next_to_use == 0.
> 
> The core then invokes ndo_queue_mem_free(dev, old_mem), and the recycle
> loop in ice_queue_mem_free() is guarded by exactly those two indices:
> 
> void ice_queue_mem_free(struct net_device *dev, void *per_queue_mem)
> {
>     ...
>     for (u32 i = rx_ring->next_to_clean; i != rx_ring->next_to_use; ) {
>         libeth_rx_recycle_slow(rx_ring->rx_fqes[i].netmem);
> 
>         if (rx_ring->hdr_pp)
>             libeth_rx_recycle_slow(rx_ring->hdr_fqes[i].netmem);
> 
>         if (unlikely(++i == rx_ring->count))
>             i = 0;
>     }
>     ...
>     ice_rxq_pp_destroy(rx_ring);
> }
> 
> With 0 != 0 false on entry, the loop never runs, so any buffers that were
> in rx_fqes[old_ntc..old_ntu) (and hdr_fqes[] when header split is on) are
> never returned via libeth_rx_recycle_slow().  ice_rxq_pp_destroy() then
> kvfree()s the fqes arrays and calls page_pool_destroy() with outstanding
> inflight references that are no longer tracked anywhere.
> 
> The new ice_clean_rx_ring() wrapper keeps the original ordering:
> 
> void ice_clean_rx_ring(struct ice_rx_ring *rx_ring)
> {
>     ice_queue_mem_free(rx_ring->netdev, rx_ring);
>     ice_zero_rx_ring(rx_ring);
> }
> 
> Should ice_queue_stop() follow the same ordering and perform the memcpy
> before ice_zero_rx_ring(real), so the recycle loop in ice_queue_mem_free()
> sees the real ntc/ntu values?
> 
>> +
>> +const struct netdev_queue_mgmt_ops ice_queue_mgmt_ops = {
>> +	.ndo_queue_mem_alloc	= ice_queue_mem_alloc,
>> +	.ndo_queue_mem_free	= ice_queue_mem_free,
>> +	.ndo_queue_mem_size	= sizeof(struct ice_rx_ring),
>> +	.ndo_queue_start	= ice_queue_start,
>> +	.ndo_queue_stop		= ice_queue_stop,
>> +};

Those two are new to me, I'll double check.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next v5 5/5] ice: add support for transmitting unreadable frags
  2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 5/5] ice: add support for transmitting unreadable frags Alexander Lobakin
@ 2026-05-08 15:42   ` Simon Horman
  0 siblings, 0 replies; 15+ messages in thread
From: Simon Horman @ 2026-05-08 15:42 UTC (permalink / raw)
  To: aleksander.lobakin
  Cc: 'Simon Horman', intel-wired-lan, anthony.l.nguyen,
	przemyslaw.kitszel, andrew+netdev, davem, edumazet, kuba, pabeni,
	kohei, jacob.e.keller, aleksandr.loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

From: 'Simon Horman' <horms@kernel.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ice: add support for transmitting unreadable frags

This commit advertises netmem Tx support in the ice driver. It modifies
the transmit path to conditionally set the buffer type to ICE_TX_BUF_FRAG
only when the fragment is not a net_iov, allowing the driver to skip DMA
unmapping operations for unreadable fragments.

> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> index a88f9e3c00771..0e61b38e53a5e 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -3453,6 +3453,7 @@ static void ice_set_ops(struct ice_vsi *vsi)
>  
>  	netdev->netdev_ops = &ice_netdev_ops;
>  	netdev->queue_mgmt_ops = &ice_queue_mgmt_ops;
> +	netdev->netmem_tx = true;
>  	netdev->udp_tunnel_nic_info = &pf->hw.udp_tunnel_nic;

By setting netmem_tx to true here, the driver accepts packets containing
net_iov fragments. If a net_iov packet exceeds ICE_MAX_BUF_TXD fragments, it
appears it will fall back to linearization in the transmit path.

Since __skb_linearize() uses CPU memory accessors that expect a valid struct
page, will this cause an invalid memory access when skb_frag_page() returns a
net_iov pointer?

Should the driver explicitly drop highly fragmented net_iov packets instead of
attempting to linearize them?

[ ... ]
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index 43b4670760273..1d97e2cc2adee 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -113,11 +113,17 @@ ice_prgm_fdir_fltr(struct ice_vsi *vsi, struct ice_fltr_desc *fdir_desc,
>  static void
>  ice_unmap_and_free_tx_buf(struct ice_tx_ring *ring, struct ice_tx_buf *tx_buf)
>  {
> -	if (tx_buf->type != ICE_TX_BUF_XDP_TX && dma_unmap_len(tx_buf, len))
> +	switch (tx_buf->type) {
> +	case ICE_TX_BUF_DUMMY:
> +	case ICE_TX_BUF_FRAG:
> +	case ICE_TX_BUF_SKB:
> +	case ICE_TX_BUF_XDP_XMIT:
>  		dma_unmap_page(ring->dev,
>  			       dma_unmap_addr(tx_buf, dma),
>  			       dma_unmap_len(tx_buf, len),
>  			       DMA_TO_DEVICE);
> +		break;
> +	}

Does removing the dma_unmap_len() check alter the safety of the DMA error
recovery path?

If a mapping fails in ice_tx_map(), execution jumps to dma_error and calls
ice_unmap_and_free_tx_buf(). Since the failure happens before
dma_unmap_len_set() updates the descriptor, it seems this might unconditionally
unmap using stale descriptor state. 

Could this result in 0-length unmaps or erroneously unmapping persistent
net_iov memory?

[ ... ]
> @@ -338,12 +344,14 @@ static bool ice_clean_tx_irq(struct ice_tx_ring *tx_ring, int napi_budget)
>  			}
>  
>  			/* unmap any remaining paged data */
> -			if (dma_unmap_len(tx_buf, len)) {
> +			if (tx_buf->type != ICE_TX_BUF_EMPTY) {
>  				dma_unmap_page(tx_ring->dev,
>  					       dma_unmap_addr(tx_buf, dma),
>  					       dma_unmap_len(tx_buf, len),
>  					       DMA_TO_DEVICE);
> +
>  				dma_unmap_len_set(tx_buf, len, 0);
> +				tx_buf->type = ICE_TX_BUF_EMPTY;
>  			}
>  		}

When mapping a net_iov fragment in ice_tx_map(), the descriptor is populated
with the length and DMA address, but tx_buf->type remains ICE_TX_BUF_EMPTY.

Because dma_unmap_len_set(tx_buf, len, 0) is now placed inside the
tx_buf->type != ICE_TX_BUF_EMPTY check, does this skip clearing the len and
dma fields for net_iov descriptors when they are returned to the free pool?

If these descriptor fields are leaked, could they be incorrectly unmapped later
if the descriptor is reused and a subsequent mapping fails?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next v5 3/5] ice: migrate to netdev ops lock
  2026-05-08 13:57     ` Alexander Lobakin
@ 2026-05-10  8:57       ` Simon Horman
  0 siblings, 0 replies; 15+ messages in thread
From: Simon Horman @ 2026-05-10  8:57 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: intel-wired-lan, anthony.l.nguyen, przemyslaw.kitszel,
	andrew+netdev, davem, edumazet, kuba, pabeni, kohei,
	jacob.e.keller, aleksandr.loktionov,
	nxne.cnse.osdt.itp.upstreaming, netdev, linux-kernel

On Fri, May 08, 2026 at 03:57:25PM +0200, Alexander Lobakin wrote:
> From: Simon Horman <horms@kernel.org>
> Date: Fri,  8 May 2026 14:28:29 +0100
> 
> > From: 'Simon Horman' <horms@kernel.org>
> > 
> > This is an AI-generated review of your patch. The human sending this
> > email has considered the AI review valid, or at least plausible.
> > Full review at: https://sashiko.dev
> > ---
> >     ice: migrate to netdev ops lock
> > 
> > This commit migrates queue management operations to use netdev locking.
> > It updates several NAPI and VSI configuration functions to use their
> > _locked variants to avoid deadlocks, taking the lock manually where needed.
> > 
> >> diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
> >> index 16aa255351523..7d89c0acc5d8d 100644
> >> --- a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
> >> +++ b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
> > [ ... ]
> >> @@ -1107,7 +1106,7 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
> >>  
> >>  	rtnl_lock();
> >>  	/* disable VSIs affected by DCB changes */
> >> -	ice_dcb_ena_dis_vsi(pf, false, true);
> >> +	ice_dcb_ena_dis_vsi(pf, false);
> >>  
> >>  	ret = ice_query_port_ets(pi, &buf, sizeof(buf), NULL);
> >>  	if (ret) {
> >> @@ -1119,7 +1118,7 @@ ice_dcb_process_lldp_set_mib_change(struct ice_pf *pf,
> >>  	ice_pf_dcb_recfg(pf, false);
> >>  
> >>  	/* enable previously downed VSIs */
> >> -	ice_dcb_ena_dis_vsi(pf, true, true);
> >> +	ice_dcb_ena_dis_vsi(pf, true);
> >>  unlock_rtnl:
> >>  	rtnl_unlock();
> >>  out:
> > 
> > This wasn't introduced by this patch, but does this error path leave VSIs
> > permanently disabled?
> > 
> > If ice_query_port_ets() fails, the code jumps to unlock_rtnl, bypassing the
> > call to ice_dcb_ena_dis_vsi(pf, true). This seems to leave the affected VSIs
> > (including the main PF interface) in a disabled state (ICE_VSI_DOWN) with
> > ICE_VSI_NEEDS_RESTART set, which could result in a denial of service.
> > 
> >> diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> >> index 2e4f0969035f7..af0cc77fbf718 100644
> >> --- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> >> +++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> >> @@ -23,10 +23,16 @@ static int ice_eswitch_setup_env(struct ice_pf *pf)
> >>  	struct net_device *netdev = uplink_vsi->netdev;
> >>  	bool if_running = netif_running(netdev);
> >>  	struct ice_vsi_vlan_ops *vlan_ops;
> >> +	int ret;
> >> +
> >> +	if (if_running && !test_and_set_bit(ICE_VSI_DOWN, uplink_vsi->state)) {
> >> +		netdev_lock(netdev);
> >> +		ret = ice_down(uplink_vsi);
> >> +		netdev_unlock(netdev);
> >>  
> >> -	if (if_running && !test_and_set_bit(ICE_VSI_DOWN, uplink_vsi->state))
> >> -		if (ice_down(uplink_vsi))
> >> +		if (ret)
> >>  			return -ENODEV;
> >> +	}
> > 
> > Simon says: It seems that the topic of locks in this patch has come up
> >             again ¯\_(ツ)_/¯
> > 
> > Does manipulating the state bit outside of netdev_lock create a race
> > condition with interface teardown?
> > 
> > If the interface is concurrently brought down, ndo_stop (which holds
> > netdev_lock) would call ice_vsi_close(). Because ICE_VSI_DOWN is already
> > set by the test_and_set_bit() call above, ice_vsi_close() would skip calling
> > ice_down() and immediately proceed to free the Tx/Rx rings.
> > 
> > This would leave the hardware queues active and performing DMA operations into
> > freed memory, potentially causing a use-after-free.
> 
> I ran Sashiko internally a couple times and it hasn't found these =\
> 
> Anyway, now that I anyway need to adjust the series after
> .ndo_set_rx_mode_asyn() landed, I'll take a look at this, too.
> 
> (I hate this series already. I feel like if Sashiko was alive when other
>  vendors were switching to the netdev lock, they'd have the same drama)

Life is certainly different with Sashiko.
No question about it.

> 
> (I was planning to add devmem/io_uring support to idpf after this lands,
>  but given that idpf is way more confusing and inconsistent in locking,
>  I'm not sure I'll have enough patience :D)
> 
> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-05-10  8:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-05 15:29 [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx Alexander Lobakin
2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 1/5] libeth: pass Rx queue index to PP when creating a fill queue Alexander Lobakin
2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 2/5] libeth: handle creating pools with unreadable buffers Alexander Lobakin
2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 3/5] ice: migrate to netdev ops lock Alexander Lobakin
2026-05-08 13:28   ` Simon Horman
2026-05-08 13:57     ` Alexander Lobakin
2026-05-10  8:57       ` Simon Horman
2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 4/5] ice: implement Rx queue management ops Alexander Lobakin
2026-05-08 14:08   ` Simon Horman
2026-05-08 14:26     ` Simon Horman
2026-05-08 15:07     ` Alexander Lobakin
2026-05-08 14:25   ` Simon Horman
2026-05-05 15:29 ` [Intel-wired-lan] [PATCH iwl-next v5 5/5] ice: add support for transmitting unreadable frags Alexander Lobakin
2026-05-08 15:42   ` Simon Horman
2026-05-08 12:06 ` [Intel-wired-lan] [PATCH iwl-next v5 0/5] ice: add support for devmem/io_uring Rx and Tx Alexander Lobakin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox