linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
@ 2025-05-22 21:41 Tariq Toukan
  2025-05-22 21:41 ` [PATCH net-next V2 01/11] net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR Tariq Toukan
                   ` (13 more replies)
  0 siblings, 14 replies; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

This series from the team adds support for zerocopy rx TCP with devmem
and io_uring for ConnectX7 NICs and above. For performance reasons and
simplicity HW-GRO will also be turned on when header-data split mode is
on.

Find more details below.

Regards,
Tariq

Performance
===========

Test setup:

* CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
* NIC: ConnectX7
* Benchmarking tool: kperf [1]
* Single TCP flow
* Test duration: 60s

With application thread and interrupts pinned to the *same* core:

|------+-----------+----------|
| MTU  | epoll     | io_uring |
|------+-----------+----------|
| 1500 | 61.6 Gbps | 114 Gbps |
| 4096 | 69.3 Gbps | 151 Gbps |
| 9000 | 67.8 Gbps | 187 Gbps |
|------+-----------+----------|

The CPU usage for io_uring is 95%.

Reproduction steps for io_uring:

server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
        --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2

server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc

client --src 2001:db8::2 --dst 2001:db8::1 \
        --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2

Patch overview:
================

First, a netmem API for skb_can_coalesce is added to the core to be able
to do skb fragment coalescing on netmems.

The next patches introduce some cleanups in the internal SHAMPO code and
improvements to hw gro capability checks in FW.

A separate page_pool is introduced for headers. Ethtool stats are added
as well.

Then the driver is converted to use the netmem API and to allow support
for unreadable netmem page pool.

The queue management ops are implemented.

Finally, the tcp-data-split ring parameter is exposed.

Changelog
=========

Changes from v1 [0]:
- Added support for skb_can_coalesce_netmem().
- Avoid netmem_to_page() casts in the driver.
- Fixed code to abide 80 char limit with some exceptions to avoid
code churn.

References
==========

[0] v1: https://lore.kernel.org/all/20250116215530.158886-1-saeed@kernel.org/
[1] kperf: git://git.kernel.dk/kperf.git


Dragos Tatulea (1):
  net: Add skb_can_coalesce for netmem

Saeed Mahameed (10):
  net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR
  net/mlx5e: SHAMPO: Reorganize mlx5_rq_shampo_alloc
  net/mlx5e: SHAMPO: Remove redundant params
  net/mlx5e: SHAMPO: Improve hw gro capability checking
  net/mlx5e: SHAMPO: Separate pool for headers
  net/mlx5e: SHAMPO: Headers page pool stats
  net/mlx5e: Convert over to netmem
  net/mlx5e: Add support for UNREADABLE netmem page pools
  net/mlx5e: Implement queue mgmt ops and single channel swap
  net/mlx5e: Support ethtool tcp-data-split settings

 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  11 +-
 .../ethernet/mellanox/mlx5/core/en/params.c   |  36 ++-
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  |  50 ++++
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 281 +++++++++++++-----
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 136 +++++----
 .../ethernet/mellanox/mlx5/core/en_stats.c    |  53 ++++
 .../ethernet/mellanox/mlx5/core/en_stats.h    |  24 ++
 include/linux/skbuff.h                        |  12 +
 net/Kconfig                                   |   2 +-
 9 files changed, 445 insertions(+), 160 deletions(-)


base-commit: 33e1b1b3991ba8c0d02b2324a582e084272205d6
-- 
2.31.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH net-next V2 01/11] net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
@ 2025-05-22 21:41 ` Tariq Toukan
  2025-05-22 23:07   ` Mina Almasry
  2025-05-22 21:41 ` [PATCH net-next V2 02/11] net: Add skb_can_coalesce for netmem Tariq Toukan
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

From: Saeed Mahameed <saeedm@nvidia.com>

GENERIC_ALLOCATOR is a non-prompt kconfig, meaning users can't enable it
selectively. All kconfig users of GENERIC_ALLOCATOR select it, except of
NET_DEVMEM which only depends on it, there is no easy way to turn
GENERIC_ALLOCATOR on unless we select other unnecessary configs that
will select it.

Instead of depending on it, select it when NET_DEVMEM is enabled.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 net/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/Kconfig b/net/Kconfig
index 5b71a52987d3..ebc80a98fc91 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -68,8 +68,8 @@ config SKB_EXTENSIONS
 
 config NET_DEVMEM
 	def_bool y
+	select GENERIC_ALLOCATOR
 	depends on DMA_SHARED_BUFFER
-	depends on GENERIC_ALLOCATOR
 	depends on PAGE_POOL
 
 config NET_SHAPER
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH net-next V2 02/11] net: Add skb_can_coalesce for netmem
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
  2025-05-22 21:41 ` [PATCH net-next V2 01/11] net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR Tariq Toukan
@ 2025-05-22 21:41 ` Tariq Toukan
  2025-05-22 23:09   ` Mina Almasry
  2025-05-22 21:41 ` [PATCH net-next V2 03/11] net/mlx5e: SHAMPO: Reorganize mlx5_rq_shampo_alloc Tariq Toukan
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

From: Dragos Tatulea <dtatulea@nvidia.com>

Allow drivers that have moved over to netmem to do fragment coalescing.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/linux/skbuff.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5520524c93bf..e8e2860183b4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3887,6 +3887,18 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
 	return false;
 }
 
+static inline bool skb_can_coalesce_netmem(struct sk_buff *skb, int i,
+					   const netmem_ref netmem, int off)
+{
+	if (i) {
+		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
+
+		return netmem == skb_frag_netmem(frag) &&
+		       off == skb_frag_off(frag) + skb_frag_size(frag);
+	}
+	return false;
+}
+
 static inline int __skb_linearize(struct sk_buff *skb)
 {
 	return __pskb_pull_tail(skb, skb->data_len) ? 0 : -ENOMEM;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH net-next V2 03/11] net/mlx5e: SHAMPO: Reorganize mlx5_rq_shampo_alloc
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
  2025-05-22 21:41 ` [PATCH net-next V2 01/11] net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR Tariq Toukan
  2025-05-22 21:41 ` [PATCH net-next V2 02/11] net: Add skb_can_coalesce for netmem Tariq Toukan
@ 2025-05-22 21:41 ` Tariq Toukan
  2025-05-22 21:41 ` [PATCH net-next V2 04/11] net/mlx5e: SHAMPO: Remove redundant params Tariq Toukan
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

From: Saeed Mahameed <saeedm@nvidia.com>

Drop redundant SHAMPO structure alloc/free functions.

Gather together function calls pertaining to header split info, pass
header per WQE (hd_per_wqe) as parameter to those function to avoid use
before initialization future mistakes.

Allocate HW GRO related info outside of the header related info scope.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   1 -
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 135 +++++++++---------
 2 files changed, 66 insertions(+), 70 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 5b0d03b3efe8..211ea429ea89 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -638,7 +638,6 @@ struct mlx5e_shampo_hd {
 	struct mlx5e_frag_page *pages;
 	u32 hd_per_wq;
 	u16 hd_per_wqe;
-	u16 pages_per_wq;
 	unsigned long *bitmap;
 	u16 pi;
 	u16 ci;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ea822c69d137..3d11c9f87171 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -331,47 +331,6 @@ static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq,
 	ucseg->mkey_mask     = cpu_to_be64(MLX5_MKEY_MASK_FREE);
 }
 
-static int mlx5e_rq_shampo_hd_alloc(struct mlx5e_rq *rq, int node)
-{
-	rq->mpwqe.shampo = kvzalloc_node(sizeof(*rq->mpwqe.shampo),
-					 GFP_KERNEL, node);
-	if (!rq->mpwqe.shampo)
-		return -ENOMEM;
-	return 0;
-}
-
-static void mlx5e_rq_shampo_hd_free(struct mlx5e_rq *rq)
-{
-	kvfree(rq->mpwqe.shampo);
-}
-
-static int mlx5e_rq_shampo_hd_info_alloc(struct mlx5e_rq *rq, int node)
-{
-	struct mlx5e_shampo_hd *shampo = rq->mpwqe.shampo;
-
-	shampo->bitmap = bitmap_zalloc_node(shampo->hd_per_wq, GFP_KERNEL,
-					    node);
-	shampo->pages = kvzalloc_node(array_size(shampo->hd_per_wq,
-						 sizeof(*shampo->pages)),
-				     GFP_KERNEL, node);
-	if (!shampo->bitmap || !shampo->pages)
-		goto err_nomem;
-
-	return 0;
-
-err_nomem:
-	bitmap_free(shampo->bitmap);
-	kvfree(shampo->pages);
-
-	return -ENOMEM;
-}
-
-static void mlx5e_rq_shampo_hd_info_free(struct mlx5e_rq *rq)
-{
-	bitmap_free(rq->mpwqe.shampo->bitmap);
-	kvfree(rq->mpwqe.shampo->pages);
-}
-
 static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq, int node)
 {
 	int wq_sz = mlx5_wq_ll_get_size(&rq->mpwqe.wq);
@@ -584,19 +543,18 @@ static int mlx5e_create_rq_umr_mkey(struct mlx5_core_dev *mdev, struct mlx5e_rq
 }
 
 static int mlx5e_create_rq_hd_umr_mkey(struct mlx5_core_dev *mdev,
-				       struct mlx5e_rq *rq)
+				       u16 hd_per_wq, u32 *umr_mkey)
 {
 	u32 max_ksm_size = BIT(MLX5_CAP_GEN(mdev, log_max_klm_list_size));
 
-	if (max_ksm_size < rq->mpwqe.shampo->hd_per_wq) {
+	if (max_ksm_size < hd_per_wq) {
 		mlx5_core_err(mdev, "max ksm list size 0x%x is smaller than shampo header buffer list size 0x%x\n",
-			      max_ksm_size, rq->mpwqe.shampo->hd_per_wq);
+			      max_ksm_size, hd_per_wq);
 		return -EINVAL;
 	}
-
-	return mlx5e_create_umr_ksm_mkey(mdev, rq->mpwqe.shampo->hd_per_wq,
+	return mlx5e_create_umr_ksm_mkey(mdev, hd_per_wq,
 					 MLX5E_SHAMPO_LOG_HEADER_ENTRY_SIZE,
-					 &rq->mpwqe.shampo->mkey);
+					 umr_mkey);
 }
 
 static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
@@ -758,6 +716,35 @@ static int mlx5e_init_rxq_rq(struct mlx5e_channel *c, struct mlx5e_params *param
 				  xdp_frag_size);
 }
 
+static int mlx5e_rq_shampo_hd_info_alloc(struct mlx5e_rq *rq, u16 hd_per_wq,
+					 int node)
+{
+	struct mlx5e_shampo_hd *shampo = rq->mpwqe.shampo;
+
+	shampo->hd_per_wq = hd_per_wq;
+
+	shampo->bitmap = bitmap_zalloc_node(hd_per_wq, GFP_KERNEL, node);
+	shampo->pages = kvzalloc_node(array_size(hd_per_wq,
+						 sizeof(*shampo->pages)),
+				      GFP_KERNEL, node);
+	if (!shampo->bitmap || !shampo->pages)
+		goto err_nomem;
+
+	return 0;
+
+err_nomem:
+	kvfree(shampo->pages);
+	bitmap_free(shampo->bitmap);
+
+	return -ENOMEM;
+}
+
+static void mlx5e_rq_shampo_hd_info_free(struct mlx5e_rq *rq)
+{
+	kvfree(rq->mpwqe.shampo->pages);
+	bitmap_free(rq->mpwqe.shampo->bitmap);
+}
+
 static int mlx5_rq_shampo_alloc(struct mlx5_core_dev *mdev,
 				struct mlx5e_params *params,
 				struct mlx5e_rq_param *rqp,
@@ -765,42 +752,52 @@ static int mlx5_rq_shampo_alloc(struct mlx5_core_dev *mdev,
 				u32 *pool_size,
 				int node)
 {
+	void *wqc = MLX5_ADDR_OF(rqc, rqp->rqc, wq);
+	u16 hd_per_wq;
+	int wq_size;
 	int err;
 
 	if (!test_bit(MLX5E_RQ_STATE_SHAMPO, &rq->state))
 		return 0;
-	err = mlx5e_rq_shampo_hd_alloc(rq, node);
-	if (err)
-		goto out;
-	rq->mpwqe.shampo->hd_per_wq =
-		mlx5e_shampo_hd_per_wq(mdev, params, rqp);
-	err = mlx5e_create_rq_hd_umr_mkey(mdev, rq);
+
+	rq->mpwqe.shampo = kvzalloc_node(sizeof(*rq->mpwqe.shampo),
+					 GFP_KERNEL, node);
+	if (!rq->mpwqe.shampo)
+		return -ENOMEM;
+
+	/* split headers data structures */
+	hd_per_wq = mlx5e_shampo_hd_per_wq(mdev, params, rqp);
+	err = mlx5e_rq_shampo_hd_info_alloc(rq, hd_per_wq, node);
 	if (err)
-		goto err_shampo_hd;
-	err = mlx5e_rq_shampo_hd_info_alloc(rq, node);
+		goto err_shampo_hd_info_alloc;
+
+	err = mlx5e_create_rq_hd_umr_mkey(mdev, hd_per_wq,
+					  &rq->mpwqe.shampo->mkey);
 	if (err)
-		goto err_shampo_info;
+		goto err_umr_mkey;
+
+	rq->mpwqe.shampo->key = cpu_to_be32(rq->mpwqe.shampo->mkey);
+	rq->mpwqe.shampo->hd_per_wqe =
+		mlx5e_shampo_hd_per_wqe(mdev, params, rqp);
+	wq_size = BIT(MLX5_GET(wq, wqc, log_wq_sz));
+	*pool_size += (rq->mpwqe.shampo->hd_per_wqe * wq_size) /
+		     MLX5E_SHAMPO_WQ_HEADER_PER_PAGE;
+
+	/* gro only data structures */
 	rq->hw_gro_data = kvzalloc_node(sizeof(*rq->hw_gro_data), GFP_KERNEL, node);
 	if (!rq->hw_gro_data) {
 		err = -ENOMEM;
 		goto err_hw_gro_data;
 	}
-	rq->mpwqe.shampo->key =
-		cpu_to_be32(rq->mpwqe.shampo->mkey);
-	rq->mpwqe.shampo->hd_per_wqe =
-		mlx5e_shampo_hd_per_wqe(mdev, params, rqp);
-	rq->mpwqe.shampo->pages_per_wq =
-		rq->mpwqe.shampo->hd_per_wq / MLX5E_SHAMPO_WQ_HEADER_PER_PAGE;
-	*pool_size += rq->mpwqe.shampo->pages_per_wq;
+
 	return 0;
 
 err_hw_gro_data:
-	mlx5e_rq_shampo_hd_info_free(rq);
-err_shampo_info:
 	mlx5_core_destroy_mkey(mdev, rq->mpwqe.shampo->mkey);
-err_shampo_hd:
-	mlx5e_rq_shampo_hd_free(rq);
-out:
+err_umr_mkey:
+	mlx5e_rq_shampo_hd_info_free(rq);
+err_shampo_hd_info_alloc:
+	kvfree(rq->mpwqe.shampo);
 	return err;
 }
 
@@ -812,7 +809,7 @@ static void mlx5e_rq_free_shampo(struct mlx5e_rq *rq)
 	kvfree(rq->hw_gro_data);
 	mlx5e_rq_shampo_hd_info_free(rq);
 	mlx5_core_destroy_mkey(rq->mdev, rq->mpwqe.shampo->mkey);
-	mlx5e_rq_shampo_hd_free(rq);
+	kvfree(rq->mpwqe.shampo);
 }
 
 static int mlx5e_alloc_rq(struct mlx5e_params *params,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH net-next V2 04/11] net/mlx5e: SHAMPO: Remove redundant params
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
                   ` (2 preceding siblings ...)
  2025-05-22 21:41 ` [PATCH net-next V2 03/11] net/mlx5e: SHAMPO: Reorganize mlx5_rq_shampo_alloc Tariq Toukan
@ 2025-05-22 21:41 ` Tariq Toukan
  2025-05-22 21:41 ` [PATCH net-next V2 05/11] net/mlx5e: SHAMPO: Improve hw gro capability checking Tariq Toukan
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

From: Saeed Mahameed <saeedm@nvidia.com>

Two SHAMPO params are static and always the same, remove them from the
global mlx5e_params struct.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  4 ---
 .../ethernet/mellanox/mlx5/core/en/params.c   | 36 ++++++++++---------
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  4 ---
 3 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 211ea429ea89..581eef34f512 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -278,10 +278,6 @@ enum packet_merge {
 struct mlx5e_packet_merge_param {
 	enum packet_merge type;
 	u32 timeout;
-	struct {
-		u8 match_criteria_type;
-		u8 alignment_granularity;
-	} shampo;
 };
 
 struct mlx5e_params {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 58ec5e44aa7a..fc945bce933a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -901,6 +901,7 @@ int mlx5e_build_rq_param(struct mlx5_core_dev *mdev,
 {
 	void *rqc = param->rqc;
 	void *wq = MLX5_ADDR_OF(rqc, rqc, wq);
+	u32 lro_timeout;
 	int ndsegs = 1;
 	int err;
 
@@ -926,22 +927,25 @@ int mlx5e_build_rq_param(struct mlx5_core_dev *mdev,
 		MLX5_SET(wq, wq, log_wqe_stride_size,
 			 log_wqe_stride_size - MLX5_MPWQE_LOG_STRIDE_SZ_BASE);
 		MLX5_SET(wq, wq, log_wq_sz, mlx5e_mpwqe_get_log_rq_size(mdev, params, xsk));
-		if (params->packet_merge.type == MLX5E_PACKET_MERGE_SHAMPO) {
-			MLX5_SET(wq, wq, shampo_enable, true);
-			MLX5_SET(wq, wq, log_reservation_size,
-				 mlx5e_shampo_get_log_rsrv_size(mdev, params));
-			MLX5_SET(wq, wq,
-				 log_max_num_of_packets_per_reservation,
-				 mlx5e_shampo_get_log_pkt_per_rsrv(mdev, params));
-			MLX5_SET(wq, wq, log_headers_entry_size,
-				 mlx5e_shampo_get_log_hd_entry_size(mdev, params));
-			MLX5_SET(rqc, rqc, reservation_timeout,
-				 mlx5e_choose_lro_timeout(mdev, MLX5E_DEFAULT_SHAMPO_TIMEOUT));
-			MLX5_SET(rqc, rqc, shampo_match_criteria_type,
-				 params->packet_merge.shampo.match_criteria_type);
-			MLX5_SET(rqc, rqc, shampo_no_match_alignment_granularity,
-				 params->packet_merge.shampo.alignment_granularity);
-		}
+		if (params->packet_merge.type != MLX5E_PACKET_MERGE_SHAMPO)
+			break;
+
+		MLX5_SET(wq, wq, shampo_enable, true);
+		MLX5_SET(wq, wq, log_reservation_size,
+			 mlx5e_shampo_get_log_rsrv_size(mdev, params));
+		MLX5_SET(wq, wq,
+			 log_max_num_of_packets_per_reservation,
+			 mlx5e_shampo_get_log_pkt_per_rsrv(mdev, params));
+		MLX5_SET(wq, wq, log_headers_entry_size,
+			 mlx5e_shampo_get_log_hd_entry_size(mdev, params));
+		lro_timeout =
+			mlx5e_choose_lro_timeout(mdev,
+						 MLX5E_DEFAULT_SHAMPO_TIMEOUT);
+		MLX5_SET(rqc, rqc, reservation_timeout, lro_timeout);
+		MLX5_SET(rqc, rqc, shampo_match_criteria_type,
+			 MLX5_RQC_SHAMPO_MATCH_CRITERIA_TYPE_EXTENDED);
+		MLX5_SET(rqc, rqc, shampo_no_match_alignment_granularity,
+			 MLX5_RQC_SHAMPO_NO_MATCH_ALIGNMENT_GRANULARITY_STRIDE);
 		break;
 	}
 	default: /* MLX5_WQ_TYPE_CYCLIC */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 3d11c9f87171..e1e44533b744 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4040,10 +4040,6 @@ static int set_feature_hw_gro(struct net_device *netdev, bool enable)
 
 	if (enable) {
 		new_params.packet_merge.type = MLX5E_PACKET_MERGE_SHAMPO;
-		new_params.packet_merge.shampo.match_criteria_type =
-			MLX5_RQC_SHAMPO_MATCH_CRITERIA_TYPE_EXTENDED;
-		new_params.packet_merge.shampo.alignment_granularity =
-			MLX5_RQC_SHAMPO_NO_MATCH_ALIGNMENT_GRANULARITY_STRIDE;
 	} else if (new_params.packet_merge.type == MLX5E_PACKET_MERGE_SHAMPO) {
 		new_params.packet_merge.type = MLX5E_PACKET_MERGE_NONE;
 	} else {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH net-next V2 05/11] net/mlx5e: SHAMPO: Improve hw gro capability checking
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
                   ` (3 preceding siblings ...)
  2025-05-22 21:41 ` [PATCH net-next V2 04/11] net/mlx5e: SHAMPO: Remove redundant params Tariq Toukan
@ 2025-05-22 21:41 ` Tariq Toukan
  2025-05-22 21:41 ` [PATCH net-next V2 06/11] net/mlx5e: SHAMPO: Separate pool for headers Tariq Toukan
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

From: Saeed Mahameed <saeedm@nvidia.com>

Add missing HW capabilities, declare the feature in
netdev->vlan_features, similar to other features in mlx5e_build_nic_netdev.
No functional change here as all by default disabled features are
explicitly disabled at the bottom of the function.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index e1e44533b744..a81d354af7c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -78,7 +78,8 @@
 
 static bool mlx5e_hw_gro_supported(struct mlx5_core_dev *mdev)
 {
-	if (!MLX5_CAP_GEN(mdev, shampo))
+	if (!MLX5_CAP_GEN(mdev, shampo) ||
+	    !MLX5_CAP_SHAMPO(mdev, shampo_header_split_data_merge))
 		return false;
 
 	/* Our HW-GRO implementation relies on "KSM Mkey" for
@@ -5499,17 +5500,17 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 						   MLX5E_MPWRQ_UMR_MODE_ALIGNED))
 		netdev->vlan_features    |= NETIF_F_LRO;
 
+	if (mlx5e_hw_gro_supported(mdev) &&
+	    mlx5e_check_fragmented_striding_rq_cap(mdev, PAGE_SHIFT,
+						   MLX5E_MPWRQ_UMR_MODE_ALIGNED))
+		netdev->vlan_features |= NETIF_F_GRO_HW;
+
 	netdev->hw_features       = netdev->vlan_features;
 	netdev->hw_features      |= NETIF_F_HW_VLAN_CTAG_TX;
 	netdev->hw_features      |= NETIF_F_HW_VLAN_CTAG_RX;
 	netdev->hw_features      |= NETIF_F_HW_VLAN_CTAG_FILTER;
 	netdev->hw_features      |= NETIF_F_HW_VLAN_STAG_TX;
 
-	if (mlx5e_hw_gro_supported(mdev) &&
-	    mlx5e_check_fragmented_striding_rq_cap(mdev, PAGE_SHIFT,
-						   MLX5E_MPWRQ_UMR_MODE_ALIGNED))
-		netdev->hw_features    |= NETIF_F_GRO_HW;
-
 	if (mlx5e_tunnel_any_tx_proto_supported(mdev)) {
 		netdev->hw_enc_features |= NETIF_F_HW_CSUM;
 		netdev->hw_enc_features |= NETIF_F_TSO;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH net-next V2 06/11] net/mlx5e: SHAMPO: Separate pool for headers
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
                   ` (4 preceding siblings ...)
  2025-05-22 21:41 ` [PATCH net-next V2 05/11] net/mlx5e: SHAMPO: Improve hw gro capability checking Tariq Toukan
@ 2025-05-22 21:41 ` Tariq Toukan
  2025-05-22 22:30   ` Jakub Kicinski
  2025-05-22 21:41 ` [PATCH net-next V2 07/11] net/mlx5e: SHAMPO: Headers page pool stats Tariq Toukan
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

From: Saeed Mahameed <saeedm@nvidia.com>

Allocate a separate page pool for headers when SHAMPO is enabled.
This will be useful for adding support to zc page pool, which has to be
different from the headers page pool.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  4 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 37 ++++++++++++++---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 41 +++++++++++--------
 3 files changed, 59 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 581eef34f512..c329de1d4f0a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -716,7 +716,11 @@ struct mlx5e_rq {
 	struct bpf_prog __rcu *xdp_prog;
 	struct mlx5e_xdpsq    *xdpsq;
 	DECLARE_BITMAP(flags, 8);
+
+	/* page pools */
 	struct page_pool      *page_pool;
+	struct page_pool      *hd_page_pool;
+
 	struct mlx5e_xdp_buff mxbuf;
 
 	/* AF_XDP zero-copy */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index a81d354af7c8..9e2975782a82 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -750,12 +750,10 @@ static int mlx5_rq_shampo_alloc(struct mlx5_core_dev *mdev,
 				struct mlx5e_params *params,
 				struct mlx5e_rq_param *rqp,
 				struct mlx5e_rq *rq,
-				u32 *pool_size,
 				int node)
 {
 	void *wqc = MLX5_ADDR_OF(rqc, rqp->rqc, wq);
 	u16 hd_per_wq;
-	int wq_size;
 	int err;
 
 	if (!test_bit(MLX5E_RQ_STATE_SHAMPO, &rq->state))
@@ -780,9 +778,33 @@ static int mlx5_rq_shampo_alloc(struct mlx5_core_dev *mdev,
 	rq->mpwqe.shampo->key = cpu_to_be32(rq->mpwqe.shampo->mkey);
 	rq->mpwqe.shampo->hd_per_wqe =
 		mlx5e_shampo_hd_per_wqe(mdev, params, rqp);
-	wq_size = BIT(MLX5_GET(wq, wqc, log_wq_sz));
-	*pool_size += (rq->mpwqe.shampo->hd_per_wqe * wq_size) /
-		     MLX5E_SHAMPO_WQ_HEADER_PER_PAGE;
+
+	/* separate page pool for shampo headers */
+	{
+		int wq_size = BIT(MLX5_GET(wq, wqc, log_wq_sz));
+		struct page_pool_params pp_params = { };
+		u32 pool_size;
+
+		pool_size = (rq->mpwqe.shampo->hd_per_wqe * wq_size) /
+				MLX5E_SHAMPO_WQ_HEADER_PER_PAGE;
+
+		pp_params.order     = 0;
+		pp_params.flags     = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
+		pp_params.pool_size = pool_size;
+		pp_params.nid       = node;
+		pp_params.dev       = rq->pdev;
+		pp_params.napi      = rq->cq.napi;
+		pp_params.netdev    = rq->netdev;
+		pp_params.dma_dir   = rq->buff.map_dir;
+		pp_params.max_len   = PAGE_SIZE;
+
+		rq->hd_page_pool = page_pool_create(&pp_params);
+		if (IS_ERR(rq->hd_page_pool)) {
+			err = PTR_ERR(rq->hd_page_pool);
+			rq->hd_page_pool = NULL;
+			goto err_hds_page_pool;
+		}
+	}
 
 	/* gro only data structures */
 	rq->hw_gro_data = kvzalloc_node(sizeof(*rq->hw_gro_data), GFP_KERNEL, node);
@@ -794,6 +816,8 @@ static int mlx5_rq_shampo_alloc(struct mlx5_core_dev *mdev,
 	return 0;
 
 err_hw_gro_data:
+	page_pool_destroy(rq->hd_page_pool);
+err_hds_page_pool:
 	mlx5_core_destroy_mkey(mdev, rq->mpwqe.shampo->mkey);
 err_umr_mkey:
 	mlx5e_rq_shampo_hd_info_free(rq);
@@ -808,6 +832,7 @@ static void mlx5e_rq_free_shampo(struct mlx5e_rq *rq)
 		return;
 
 	kvfree(rq->hw_gro_data);
+	page_pool_destroy(rq->hd_page_pool);
 	mlx5e_rq_shampo_hd_info_free(rq);
 	mlx5_core_destroy_mkey(rq->mdev, rq->mpwqe.shampo->mkey);
 	kvfree(rq->mpwqe.shampo);
@@ -887,7 +912,7 @@ static int mlx5e_alloc_rq(struct mlx5e_params *params,
 		if (err)
 			goto err_rq_mkey;
 
-		err = mlx5_rq_shampo_alloc(mdev, params, rqp, rq, &pool_size, node);
+		err = mlx5_rq_shampo_alloc(mdev, params, rqp, rq, node);
 		if (err)
 			goto err_free_mpwqe_info;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 84b1ab8233b8..e34ef53ebd0e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -273,12 +273,12 @@ static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
 
 #define MLX5E_PAGECNT_BIAS_MAX (PAGE_SIZE / 64)
 
-static int mlx5e_page_alloc_fragmented(struct mlx5e_rq *rq,
+static int mlx5e_page_alloc_fragmented(struct page_pool *pool,
 				       struct mlx5e_frag_page *frag_page)
 {
 	struct page *page;
 
-	page = page_pool_dev_alloc_pages(rq->page_pool);
+	page = page_pool_dev_alloc_pages(pool);
 	if (unlikely(!page))
 		return -ENOMEM;
 
@@ -292,14 +292,14 @@ static int mlx5e_page_alloc_fragmented(struct mlx5e_rq *rq,
 	return 0;
 }
 
-static void mlx5e_page_release_fragmented(struct mlx5e_rq *rq,
+static void mlx5e_page_release_fragmented(struct page_pool *pool,
 					  struct mlx5e_frag_page *frag_page)
 {
 	u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - frag_page->frags;
 	struct page *page = frag_page->page;
 
 	if (page_pool_unref_page(page, drain_count) == 0)
-		page_pool_put_unrefed_page(rq->page_pool, page, -1, true);
+		page_pool_put_unrefed_page(pool, page, -1, true);
 }
 
 static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq,
@@ -313,7 +313,8 @@ static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq,
 		 * offset) should just use the new one without replenishing again
 		 * by themselves.
 		 */
-		err = mlx5e_page_alloc_fragmented(rq, frag->frag_page);
+		err = mlx5e_page_alloc_fragmented(rq->page_pool,
+						  frag->frag_page);
 
 	return err;
 }
@@ -332,7 +333,7 @@ static inline void mlx5e_put_rx_frag(struct mlx5e_rq *rq,
 				     struct mlx5e_wqe_frag_info *frag)
 {
 	if (mlx5e_frag_can_release(frag))
-		mlx5e_page_release_fragmented(rq, frag->frag_page);
+		mlx5e_page_release_fragmented(rq->page_pool, frag->frag_page);
 }
 
 static inline struct mlx5e_wqe_frag_info *get_frag(struct mlx5e_rq *rq, u16 ix)
@@ -584,7 +585,8 @@ mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
 				struct mlx5e_frag_page *frag_page;
 
 				frag_page = &wi->alloc_units.frag_pages[i];
-				mlx5e_page_release_fragmented(rq, frag_page);
+				mlx5e_page_release_fragmented(rq->page_pool,
+							      frag_page);
 			}
 		}
 	}
@@ -679,11 +681,10 @@ static int mlx5e_build_shampo_hd_umr(struct mlx5e_rq *rq,
 		struct mlx5e_frag_page *frag_page = mlx5e_shampo_hd_to_frag_page(rq, index);
 		u64 addr;
 
-		err = mlx5e_page_alloc_fragmented(rq, frag_page);
+		err = mlx5e_page_alloc_fragmented(rq->hd_page_pool, frag_page);
 		if (unlikely(err))
 			goto err_unmap;
 
-
 		addr = page_pool_get_dma_addr(frag_page->page);
 
 		for (int j = 0; j < MLX5E_SHAMPO_WQ_HEADER_PER_PAGE; j++) {
@@ -715,7 +716,8 @@ static int mlx5e_build_shampo_hd_umr(struct mlx5e_rq *rq,
 		if (!header_offset) {
 			struct mlx5e_frag_page *frag_page = mlx5e_shampo_hd_to_frag_page(rq, index);
 
-			mlx5e_page_release_fragmented(rq, frag_page);
+			mlx5e_page_release_fragmented(rq->hd_page_pool,
+						      frag_page);
 		}
 	}
 
@@ -791,7 +793,7 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 	for (i = 0; i < rq->mpwqe.pages_per_wqe; i++, frag_page++) {
 		dma_addr_t addr;
 
-		err = mlx5e_page_alloc_fragmented(rq, frag_page);
+		err = mlx5e_page_alloc_fragmented(rq->page_pool, frag_page);
 		if (unlikely(err))
 			goto err_unmap;
 		addr = page_pool_get_dma_addr(frag_page->page);
@@ -836,7 +838,7 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 err_unmap:
 	while (--i >= 0) {
 		frag_page--;
-		mlx5e_page_release_fragmented(rq, frag_page);
+		mlx5e_page_release_fragmented(rq->page_pool, frag_page);
 	}
 
 	bitmap_fill(wi->skip_release_bitmap, rq->mpwqe.pages_per_wqe);
@@ -855,7 +857,7 @@ mlx5e_free_rx_shampo_hd_entry(struct mlx5e_rq *rq, u16 header_index)
 	if (((header_index + 1) & (MLX5E_SHAMPO_WQ_HEADER_PER_PAGE - 1)) == 0) {
 		struct mlx5e_frag_page *frag_page = mlx5e_shampo_hd_to_frag_page(rq, header_index);
 
-		mlx5e_page_release_fragmented(rq, frag_page);
+		mlx5e_page_release_fragmented(rq->hd_page_pool, frag_page);
 	}
 	clear_bit(header_index, shampo->bitmap);
 }
@@ -1100,6 +1102,8 @@ INDIRECT_CALLABLE_SCOPE bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq)
 
 	if (rq->page_pool)
 		page_pool_nid_changed(rq->page_pool, numa_mem_id());
+	if (rq->hd_page_pool)
+		page_pool_nid_changed(rq->hd_page_pool, numa_mem_id());
 
 	head = rq->mpwqe.actual_wq_head;
 	i = missing;
@@ -2004,7 +2008,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	if (prog) {
 		/* area for bpf_xdp_[store|load]_bytes */
 		net_prefetchw(page_address(frag_page->page) + frag_offset);
-		if (unlikely(mlx5e_page_alloc_fragmented(rq, &wi->linear_page))) {
+		if (unlikely(mlx5e_page_alloc_fragmented(rq->page_pool,
+							 &wi->linear_page))) {
 			rq->stats->buff_alloc_err++;
 			return NULL;
 		}
@@ -2068,7 +2073,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 				wi->linear_page.frags++;
 			}
-			mlx5e_page_release_fragmented(rq, &wi->linear_page);
+			mlx5e_page_release_fragmented(rq->page_pool,
+						      &wi->linear_page);
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
@@ -2077,13 +2083,14 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			mxbuf->xdp.data - mxbuf->xdp.data_hard_start, 0,
 			mxbuf->xdp.data - mxbuf->xdp.data_meta);
 		if (unlikely(!skb)) {
-			mlx5e_page_release_fragmented(rq, &wi->linear_page);
+			mlx5e_page_release_fragmented(rq->page_pool,
+						      &wi->linear_page);
 			return NULL;
 		}
 
 		skb_mark_for_recycle(skb);
 		wi->linear_page.frags++;
-		mlx5e_page_release_fragmented(rq, &wi->linear_page);
+		mlx5e_page_release_fragmented(rq->page_pool, &wi->linear_page);
 
 		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH net-next V2 07/11] net/mlx5e: SHAMPO: Headers page pool stats
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
                   ` (5 preceding siblings ...)
  2025-05-22 21:41 ` [PATCH net-next V2 06/11] net/mlx5e: SHAMPO: Separate pool for headers Tariq Toukan
@ 2025-05-22 21:41 ` Tariq Toukan
  2025-05-22 22:31   ` Jakub Kicinski
  2025-05-22 21:41 ` [PATCH net-next V2 08/11] net/mlx5e: Convert over to netmem Tariq Toukan
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

From: Saeed Mahameed <saeedm@nvidia.com>

Expose the stats of the new headers page pool.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en_stats.c    | 53 +++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/en_stats.h    | 24 +++++++++
 2 files changed, 77 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 19664fa7f217..dcfe86d6dc83 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -205,6 +205,18 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_recycle_ring) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_recycle_ring_full) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_recycle_released_ref) },
+
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_hd_alloc_fast) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_hd_alloc_slow) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_hd_alloc_slow_high_order) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_hd_alloc_empty) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_hd_alloc_refill) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_hd_alloc_waive) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_hd_recycle_cached) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_hd_recycle_cache_full) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_hd_recycle_ring) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_hd_recycle_ring_full) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_pp_hd_recycle_released_ref) },
 #ifdef CONFIG_MLX5_EN_TLS
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_tls_decrypted_packets) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_tls_decrypted_bytes) },
@@ -384,6 +396,18 @@ static void mlx5e_stats_grp_sw_update_stats_rq_stats(struct mlx5e_sw_stats *s,
 	s->rx_pp_recycle_ring			+= rq_stats->pp_recycle_ring;
 	s->rx_pp_recycle_ring_full		+= rq_stats->pp_recycle_ring_full;
 	s->rx_pp_recycle_released_ref		+= rq_stats->pp_recycle_released_ref;
+
+	s->rx_pp_hd_alloc_fast          += rq_stats->pp_hd_alloc_fast;
+	s->rx_pp_hd_alloc_slow          += rq_stats->pp_hd_alloc_slow;
+	s->rx_pp_hd_alloc_empty         += rq_stats->pp_hd_alloc_empty;
+	s->rx_pp_hd_alloc_refill        += rq_stats->pp_hd_alloc_refill;
+	s->rx_pp_hd_alloc_waive         += rq_stats->pp_hd_alloc_waive;
+	s->rx_pp_hd_alloc_slow_high_order	+= rq_stats->pp_hd_alloc_slow_high_order;
+	s->rx_pp_hd_recycle_cached		+= rq_stats->pp_hd_recycle_cached;
+	s->rx_pp_hd_recycle_cache_full		+= rq_stats->pp_hd_recycle_cache_full;
+	s->rx_pp_hd_recycle_ring		+= rq_stats->pp_hd_recycle_ring;
+	s->rx_pp_hd_recycle_ring_full		+= rq_stats->pp_hd_recycle_ring_full;
+	s->rx_pp_hd_recycle_released_ref	+= rq_stats->pp_hd_recycle_released_ref;
 #ifdef CONFIG_MLX5_EN_TLS
 	s->rx_tls_decrypted_packets   += rq_stats->tls_decrypted_packets;
 	s->rx_tls_decrypted_bytes     += rq_stats->tls_decrypted_bytes;
@@ -511,6 +535,23 @@ static void mlx5e_stats_update_stats_rq_page_pool(struct mlx5e_channel *c)
 	rq_stats->pp_recycle_ring = stats.recycle_stats.ring;
 	rq_stats->pp_recycle_ring_full = stats.recycle_stats.ring_full;
 	rq_stats->pp_recycle_released_ref = stats.recycle_stats.released_refcnt;
+
+	pool = c->rq.hd_page_pool;
+	if (!pool || !page_pool_get_stats(pool, &stats))
+		return;
+
+	rq_stats->pp_hd_alloc_fast = stats.alloc_stats.fast;
+	rq_stats->pp_hd_alloc_slow = stats.alloc_stats.slow;
+	rq_stats->pp_hd_alloc_slow_high_order = stats.alloc_stats.slow_high_order;
+	rq_stats->pp_hd_alloc_empty = stats.alloc_stats.empty;
+	rq_stats->pp_hd_alloc_waive = stats.alloc_stats.waive;
+	rq_stats->pp_hd_alloc_refill = stats.alloc_stats.refill;
+
+	rq_stats->pp_hd_recycle_cached = stats.recycle_stats.cached;
+	rq_stats->pp_hd_recycle_cache_full = stats.recycle_stats.cache_full;
+	rq_stats->pp_hd_recycle_ring = stats.recycle_stats.ring;
+	rq_stats->pp_hd_recycle_ring_full = stats.recycle_stats.ring_full;
+	rq_stats->pp_hd_recycle_released_ref = stats.recycle_stats.released_refcnt;
 }
 
 static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw)
@@ -2130,6 +2171,18 @@ static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_recycle_ring) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_recycle_ring_full) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_recycle_released_ref) },
+
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_hd_alloc_fast) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_hd_alloc_slow) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_hd_alloc_slow_high_order) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_hd_alloc_empty) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_hd_alloc_refill) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_hd_alloc_waive) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_hd_recycle_cached) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_hd_recycle_cache_full) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_hd_recycle_ring) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_hd_recycle_ring_full) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, pp_hd_recycle_released_ref) },
 #ifdef CONFIG_MLX5_EN_TLS
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, tls_decrypted_packets) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, tls_decrypted_bytes) },
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index def5dea1463d..113221dfcdfa 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -226,6 +226,18 @@ struct mlx5e_sw_stats {
 	u64 rx_pp_recycle_ring;
 	u64 rx_pp_recycle_ring_full;
 	u64 rx_pp_recycle_released_ref;
+
+	u64 rx_pp_hd_alloc_fast;
+	u64 rx_pp_hd_alloc_slow;
+	u64 rx_pp_hd_alloc_slow_high_order;
+	u64 rx_pp_hd_alloc_empty;
+	u64 rx_pp_hd_alloc_refill;
+	u64 rx_pp_hd_alloc_waive;
+	u64 rx_pp_hd_recycle_cached;
+	u64 rx_pp_hd_recycle_cache_full;
+	u64 rx_pp_hd_recycle_ring;
+	u64 rx_pp_hd_recycle_ring_full;
+	u64 rx_pp_hd_recycle_released_ref;
 #ifdef CONFIG_MLX5_EN_TLS
 	u64 tx_tls_encrypted_packets;
 	u64 tx_tls_encrypted_bytes;
@@ -394,6 +406,18 @@ struct mlx5e_rq_stats {
 	u64 pp_recycle_ring;
 	u64 pp_recycle_ring_full;
 	u64 pp_recycle_released_ref;
+
+	u64 pp_hd_alloc_fast;
+	u64 pp_hd_alloc_slow;
+	u64 pp_hd_alloc_slow_high_order;
+	u64 pp_hd_alloc_empty;
+	u64 pp_hd_alloc_refill;
+	u64 pp_hd_alloc_waive;
+	u64 pp_hd_recycle_cached;
+	u64 pp_hd_recycle_cache_full;
+	u64 pp_hd_recycle_ring;
+	u64 pp_hd_recycle_ring_full;
+	u64 pp_hd_recycle_released_ref;
 #ifdef CONFIG_MLX5_EN_TLS
 	u64 tls_decrypted_packets;
 	u64 tls_decrypted_bytes;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH net-next V2 08/11] net/mlx5e: Convert over to netmem
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
                   ` (6 preceding siblings ...)
  2025-05-22 21:41 ` [PATCH net-next V2 07/11] net/mlx5e: SHAMPO: Headers page pool stats Tariq Toukan
@ 2025-05-22 21:41 ` Tariq Toukan
  2025-05-22 23:18   ` Mina Almasry
  2025-05-22 21:41 ` [PATCH net-next V2 09/11] net/mlx5e: Add support for UNREADABLE netmem page pools Tariq Toukan
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

From: Saeed Mahameed <saeedm@nvidia.com>

mlx5e_page_frag holds the physical page itself, to naturally support
zc page pools, remove physical page reference from mlx5 and replace it
with netmem_ref, to avoid internal handling in mlx5 for net_iov backed
pages.

SHAMPO can issue packets that are not split into header and data. These
packets will be dropped if the data part resides in a net_iov as the
driver can't read into this area.

No performance degradation observed.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 103 ++++++++++--------
 2 files changed, 61 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index c329de1d4f0a..65a73913b9a2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -553,7 +553,7 @@ struct mlx5e_icosq {
 } ____cacheline_aligned_in_smp;
 
 struct mlx5e_frag_page {
-	struct page *page;
+	netmem_ref netmem;
 	u16 frags;
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index e34ef53ebd0e..75e753adedef 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -273,33 +273,33 @@ static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
 
 #define MLX5E_PAGECNT_BIAS_MAX (PAGE_SIZE / 64)
 
-static int mlx5e_page_alloc_fragmented(struct page_pool *pool,
+static int mlx5e_page_alloc_fragmented(struct page_pool *pp,
 				       struct mlx5e_frag_page *frag_page)
 {
-	struct page *page;
+	netmem_ref netmem = page_pool_alloc_netmems(pp,
+						    GFP_ATOMIC | __GFP_NOWARN);
 
-	page = page_pool_dev_alloc_pages(pool);
-	if (unlikely(!page))
+	if (unlikely(!netmem))
 		return -ENOMEM;
 
-	page_pool_fragment_page(page, MLX5E_PAGECNT_BIAS_MAX);
+	page_pool_fragment_netmem(netmem, MLX5E_PAGECNT_BIAS_MAX);
 
 	*frag_page = (struct mlx5e_frag_page) {
-		.page	= page,
+		.netmem	= netmem,
 		.frags	= 0,
 	};
 
 	return 0;
 }
 
-static void mlx5e_page_release_fragmented(struct page_pool *pool,
+static void mlx5e_page_release_fragmented(struct page_pool *pp,
 					  struct mlx5e_frag_page *frag_page)
 {
 	u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - frag_page->frags;
-	struct page *page = frag_page->page;
+	netmem_ref netmem = frag_page->netmem;
 
-	if (page_pool_unref_page(page, drain_count) == 0)
-		page_pool_put_unrefed_page(pool, page, -1, true);
+	if (page_pool_unref_netmem(netmem, drain_count) == 0)
+		page_pool_put_unrefed_netmem(pp, netmem, -1, true);
 }
 
 static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq,
@@ -359,7 +359,7 @@ static int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe_cyc *wqe,
 		frag->flags &= ~BIT(MLX5E_WQE_FRAG_SKIP_RELEASE);
 
 		headroom = i == 0 ? rq->buff.headroom : 0;
-		addr = page_pool_get_dma_addr(frag->frag_page->page);
+		addr = page_pool_get_dma_addr_netmem(frag->frag_page->netmem);
 		wqe->data[i].addr = cpu_to_be64(addr + frag->offset + headroom);
 	}
 
@@ -500,9 +500,10 @@ mlx5e_add_skb_shared_info_frag(struct mlx5e_rq *rq, struct skb_shared_info *sinf
 			       struct xdp_buff *xdp, struct mlx5e_frag_page *frag_page,
 			       u32 frag_offset, u32 len)
 {
+	netmem_ref netmem = frag_page->netmem;
 	skb_frag_t *frag;
 
-	dma_addr_t addr = page_pool_get_dma_addr(frag_page->page);
+	dma_addr_t addr = page_pool_get_dma_addr_netmem(netmem);
 
 	dma_sync_single_for_cpu(rq->pdev, addr + frag_offset, len, rq->buff.map_dir);
 	if (!xdp_buff_has_frags(xdp)) {
@@ -515,9 +516,9 @@ mlx5e_add_skb_shared_info_frag(struct mlx5e_rq *rq, struct skb_shared_info *sinf
 	}
 
 	frag = &sinfo->frags[sinfo->nr_frags++];
-	skb_frag_fill_page_desc(frag, frag_page->page, frag_offset, len);
+	skb_frag_fill_netmem_desc(frag, netmem, frag_offset, len);
 
-	if (page_is_pfmemalloc(frag_page->page))
+	if (!netmem_is_net_iov(netmem) && netmem_is_pfmemalloc(netmem))
 		xdp_buff_set_frag_pfmemalloc(xdp);
 	sinfo->xdp_frags_size += len;
 }
@@ -528,27 +529,29 @@ mlx5e_add_skb_frag(struct mlx5e_rq *rq, struct sk_buff *skb,
 		   u32 frag_offset, u32 len,
 		   unsigned int truesize)
 {
-	dma_addr_t addr = page_pool_get_dma_addr(frag_page->page);
+	dma_addr_t addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
 	u8 next_frag = skb_shinfo(skb)->nr_frags;
+	netmem_ref netmem = frag_page->netmem;
 
 	dma_sync_single_for_cpu(rq->pdev, addr + frag_offset, len,
 				rq->buff.map_dir);
 
-	if (skb_can_coalesce(skb, next_frag, frag_page->page, frag_offset)) {
+	if (skb_can_coalesce_netmem(skb, next_frag, netmem, frag_offset)) {
 		skb_coalesce_rx_frag(skb, next_frag - 1, len, truesize);
-	} else {
-		frag_page->frags++;
-		skb_add_rx_frag(skb, next_frag, frag_page->page,
-				frag_offset, len, truesize);
+		return;
 	}
+
+	frag_page->frags++;
+	skb_add_rx_frag_netmem(skb, next_frag, netmem,
+			       frag_offset, len, truesize);
 }
 
 static inline void
 mlx5e_copy_skb_header(struct mlx5e_rq *rq, struct sk_buff *skb,
-		      struct page *page, dma_addr_t addr,
+		      netmem_ref netmem, dma_addr_t addr,
 		      int offset_from, int dma_offset, u32 headlen)
 {
-	const void *from = page_address(page) + offset_from;
+	const void *from = netmem_address(netmem) + offset_from;
 	/* Aligning len to sizeof(long) optimizes memcpy performance */
 	unsigned int len = ALIGN(headlen, sizeof(long));
 
@@ -685,7 +688,7 @@ static int mlx5e_build_shampo_hd_umr(struct mlx5e_rq *rq,
 		if (unlikely(err))
 			goto err_unmap;
 
-		addr = page_pool_get_dma_addr(frag_page->page);
+		addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
 
 		for (int j = 0; j < MLX5E_SHAMPO_WQ_HEADER_PER_PAGE; j++) {
 			header_offset = mlx5e_shampo_hd_offset(index++);
@@ -796,7 +799,8 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 		err = mlx5e_page_alloc_fragmented(rq->page_pool, frag_page);
 		if (unlikely(err))
 			goto err_unmap;
-		addr = page_pool_get_dma_addr(frag_page->page);
+
+		addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
 		umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
 			.ptag = cpu_to_be64(addr | MLX5_EN_WR),
 		};
@@ -1216,7 +1220,7 @@ static void *mlx5e_shampo_get_packet_hd(struct mlx5e_rq *rq, u16 header_index)
 	struct mlx5e_frag_page *frag_page = mlx5e_shampo_hd_to_frag_page(rq, header_index);
 	u16 head_offset = mlx5e_shampo_hd_offset(header_index) + rq->buff.headroom;
 
-	return page_address(frag_page->page) + head_offset;
+	return netmem_address(frag_page->netmem) + head_offset;
 }
 
 static void mlx5e_shampo_update_ipv4_udp_hdr(struct mlx5e_rq *rq, struct iphdr *ipv4)
@@ -1677,11 +1681,11 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 	dma_addr_t addr;
 	u32 frag_size;
 
-	va             = page_address(frag_page->page) + wi->offset;
+	va             = netmem_address(frag_page->netmem) + wi->offset;
 	data           = va + rx_headroom;
 	frag_size      = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 
-	addr = page_pool_get_dma_addr(frag_page->page);
+	addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
 	dma_sync_single_range_for_cpu(rq->pdev, addr, wi->offset,
 				      frag_size, rq->buff.map_dir);
 	net_prefetch(data);
@@ -1731,10 +1735,10 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 
 	frag_page = wi->frag_page;
 
-	va = page_address(frag_page->page) + wi->offset;
+	va = netmem_address(frag_page->netmem) + wi->offset;
 	frag_consumed_bytes = min_t(u32, frag_info->frag_size, cqe_bcnt);
 
-	addr = page_pool_get_dma_addr(frag_page->page);
+	addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
 	dma_sync_single_range_for_cpu(rq->pdev, addr, wi->offset,
 				      rq->buff.frame0_sz, rq->buff.map_dir);
 	net_prefetchw(va); /* xdp_frame data area */
@@ -2007,13 +2011,14 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 	if (prog) {
 		/* area for bpf_xdp_[store|load]_bytes */
-		net_prefetchw(page_address(frag_page->page) + frag_offset);
+		net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);
 		if (unlikely(mlx5e_page_alloc_fragmented(rq->page_pool,
 							 &wi->linear_page))) {
 			rq->stats->buff_alloc_err++;
 			return NULL;
 		}
-		va = page_address(wi->linear_page.page);
+
+		va = netmem_address(wi->linear_page.netmem);
 		net_prefetchw(va); /* xdp_frame data area */
 		linear_hr = XDP_PACKET_HEADROOM;
 		linear_data_len = 0;
@@ -2124,8 +2129,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			while (++pagep < frag_page);
 		}
 		/* copy header */
-		addr = page_pool_get_dma_addr(head_page->page);
-		mlx5e_copy_skb_header(rq, skb, head_page->page, addr,
+		addr = page_pool_get_dma_addr_netmem(head_page->netmem);
+		mlx5e_copy_skb_header(rq, skb, head_page->netmem, addr,
 				      head_offset, head_offset, headlen);
 		/* skb linear part was allocated with headlen and aligned to long */
 		skb->tail += headlen;
@@ -2155,11 +2160,11 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 		return NULL;
 	}
 
-	va             = page_address(frag_page->page) + head_offset;
+	va             = netmem_address(frag_page->netmem) + head_offset;
 	data           = va + rx_headroom;
 	frag_size      = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 
-	addr = page_pool_get_dma_addr(frag_page->page);
+	addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
 	dma_sync_single_range_for_cpu(rq->pdev, addr, head_offset,
 				      frag_size, rq->buff.map_dir);
 	net_prefetch(data);
@@ -2198,16 +2203,19 @@ mlx5e_skb_from_cqe_shampo(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 			  struct mlx5_cqe64 *cqe, u16 header_index)
 {
 	struct mlx5e_frag_page *frag_page = mlx5e_shampo_hd_to_frag_page(rq, header_index);
-	dma_addr_t page_dma_addr = page_pool_get_dma_addr(frag_page->page);
 	u16 head_offset = mlx5e_shampo_hd_offset(header_index);
-	dma_addr_t dma_addr = page_dma_addr + head_offset;
 	u16 head_size = cqe->shampo.header_size;
 	u16 rx_headroom = rq->buff.headroom;
 	struct sk_buff *skb = NULL;
+	dma_addr_t page_dma_addr;
+	dma_addr_t dma_addr;
 	void *hdr, *data;
 	u32 frag_size;
 
-	hdr		= page_address(frag_page->page) + head_offset;
+	page_dma_addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
+	dma_addr = page_dma_addr + head_offset;
+
+	hdr		= netmem_address(frag_page->netmem) + head_offset;
 	data		= hdr + rx_headroom;
 	frag_size	= MLX5_SKB_FRAG_SZ(rx_headroom + head_size);
 
@@ -2232,7 +2240,7 @@ mlx5e_skb_from_cqe_shampo(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 		}
 
 		net_prefetchw(skb->data);
-		mlx5e_copy_skb_header(rq, skb, frag_page->page, dma_addr,
+		mlx5e_copy_skb_header(rq, skb, frag_page->netmem, dma_addr,
 				      head_offset + rx_headroom,
 				      rx_headroom, head_size);
 		/* skb linear part was allocated with headlen and aligned to long */
@@ -2326,11 +2334,20 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
 	}
 
 	if (!*skb) {
-		if (likely(head_size))
+		if (likely(head_size)) {
 			*skb = mlx5e_skb_from_cqe_shampo(rq, wi, cqe, header_index);
-		else
-			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe, cqe_bcnt,
-								  data_offset, page_idx);
+		} else {
+			struct mlx5e_frag_page *frag_page;
+
+			frag_page = &wi->alloc_units.frag_pages[page_idx];
+			if (unlikely(netmem_is_net_iov(frag_page->netmem)))
+				goto free_hd_entry;
+			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe,
+								  cqe_bcnt,
+								  data_offset,
+								  page_idx);
+		}
+
 		if (unlikely(!*skb))
 			goto free_hd_entry;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH net-next V2 09/11] net/mlx5e: Add support for UNREADABLE netmem page pools
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
                   ` (7 preceding siblings ...)
  2025-05-22 21:41 ` [PATCH net-next V2 08/11] net/mlx5e: Convert over to netmem Tariq Toukan
@ 2025-05-22 21:41 ` Tariq Toukan
  2025-05-22 23:26   ` Mina Almasry
  2025-05-22 21:41 ` [PATCH net-next V2 10/11] net/mlx5e: Implement queue mgmt ops and single channel swap Tariq Toukan
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

From: Saeed Mahameed <saeedm@nvidia.com>

On netdev_rx_queue_restart, a special type of page pool maybe expected.

In this patch declare support for UNREADABLE netmem iov pages in the
pool params only when header data split shampo RQ mode is enabled, also
set the queue index in the page pool params struct.

Shampo mode requirement: Without header split rx needs to peek at the data,
we can't do UNREADABLE_NETMEM.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9e2975782a82..485b1515ace5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -952,6 +952,11 @@ static int mlx5e_alloc_rq(struct mlx5e_params *params,
 		pp_params.netdev    = rq->netdev;
 		pp_params.dma_dir   = rq->buff.map_dir;
 		pp_params.max_len   = PAGE_SIZE;
+		pp_params.queue_idx = rq->ix;
+
+		/* Shampo header data split allow for unreadable netmem */
+		if (test_bit(MLX5E_RQ_STATE_SHAMPO, &rq->state))
+			pp_params.flags |= PP_FLAG_ALLOW_UNREADABLE_NETMEM;
 
 		/* page_pool can be used even when there is no rq->xdp_prog,
 		 * given page_pool does not handle DMA mapping there is no
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH net-next V2 10/11] net/mlx5e: Implement queue mgmt ops and single channel swap
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
                   ` (8 preceding siblings ...)
  2025-05-22 21:41 ` [PATCH net-next V2 09/11] net/mlx5e: Add support for UNREADABLE netmem page pools Tariq Toukan
@ 2025-05-22 21:41 ` Tariq Toukan
  2025-05-22 21:41 ` [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings Tariq Toukan
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

From: Saeed Mahameed <saeedm@nvidia.com>

The bulk of the work is done in mlx5e_queue_mem_alloc, where we allocate
and create the new channel resources, similar to
mlx5e_safe_switch_params, but here we do it for a single channel using
existing params, sort of a clone channel.
To swap the old channel with the new one, we deactivate and close the
old channel then replace it with the new one, since the swap procedure
doesn't fail in mlx5, we do it all in one place (mlx5e_queue_start).

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 95 +++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 485b1515ace5..3cc316ace0a5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5478,6 +5478,100 @@ static const struct netdev_stat_ops mlx5e_stat_ops = {
 	.get_base_stats      = mlx5e_get_base_stats,
 };
 
+struct mlx5_qmgmt_data {
+	struct mlx5e_channel *c;
+	struct mlx5e_channel_param cparam;
+};
+
+static int mlx5e_queue_mem_alloc(struct net_device *dev, void *newq,
+				 int queue_index)
+{
+	struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
+	struct mlx5e_priv *priv = netdev_priv(dev);
+	struct mlx5e_channels *chs = &priv->channels;
+	struct mlx5e_params params = chs->params;
+	struct mlx5_core_dev *mdev;
+	int err;
+
+	mutex_lock(&priv->state_lock);
+	if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
+		err = -ENODEV;
+		goto unlock;
+	}
+
+	if (queue_index >= chs->num) {
+		err = -ERANGE;
+		goto unlock;
+	}
+
+	if (MLX5E_GET_PFLAG(&chs->params, MLX5E_PFLAG_TX_PORT_TS) ||
+	    chs->params.ptp_rx   ||
+	    chs->params.xdp_prog ||
+	    priv->htb) {
+		netdev_err(priv->netdev,
+			   "Cloning channels with Port/rx PTP, XDP or HTB is not supported\n");
+		err = -EOPNOTSUPP;
+		goto unlock;
+	}
+
+	mdev = mlx5_sd_ch_ix_get_dev(priv->mdev, queue_index);
+	err = mlx5e_build_channel_param(mdev, &params, &new->cparam);
+	if (err)
+		goto unlock;
+
+	err = mlx5e_open_channel(priv, queue_index, &params, NULL, &new->c);
+unlock:
+	mutex_unlock(&priv->state_lock);
+	return err;
+}
+
+static void mlx5e_queue_mem_free(struct net_device *dev, void *mem)
+{
+	struct mlx5_qmgmt_data *data = (struct mlx5_qmgmt_data *)mem;
+
+	/* not supposed to happen since mlx5e_queue_start never fails
+	 * but this is how this should be implemented just in case
+	 */
+	if (data->c)
+		mlx5e_close_channel(data->c);
+}
+
+static int mlx5e_queue_stop(struct net_device *dev, void *oldq, int queue_index)
+{
+	/* mlx5e_queue_start does not fail, we stop the old queue there */
+	return 0;
+}
+
+static int mlx5e_queue_start(struct net_device *dev, void *newq,
+			     int queue_index)
+{
+	struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
+	struct mlx5e_priv *priv = netdev_priv(dev);
+	struct mlx5e_channel *old;
+
+	mutex_lock(&priv->state_lock);
+
+	/* stop and close the old */
+	old = priv->channels.c[queue_index];
+	mlx5e_deactivate_priv_channels(priv);
+	/* close old before activating new, to avoid napi conflict */
+	mlx5e_close_channel(old);
+
+	/* start the new */
+	priv->channels.c[queue_index] = new->c;
+	mlx5e_activate_priv_channels(priv);
+	mutex_unlock(&priv->state_lock);
+	return 0;
+}
+
+static const struct netdev_queue_mgmt_ops mlx5e_queue_mgmt_ops = {
+	.ndo_queue_mem_size	=	sizeof(struct mlx5_qmgmt_data),
+	.ndo_queue_mem_alloc	=	mlx5e_queue_mem_alloc,
+	.ndo_queue_mem_free	=	mlx5e_queue_mem_free,
+	.ndo_queue_start	=	mlx5e_queue_start,
+	.ndo_queue_stop		=	mlx5e_queue_stop,
+};
+
 static void mlx5e_build_nic_netdev(struct net_device *netdev)
 {
 	struct mlx5e_priv *priv = netdev_priv(netdev);
@@ -5488,6 +5582,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	SET_NETDEV_DEV(netdev, mdev->device);
 
 	netdev->netdev_ops = &mlx5e_netdev_ops;
+	netdev->queue_mgmt_ops = &mlx5e_queue_mgmt_ops;
 	netdev->xdp_metadata_ops = &mlx5e_xdp_metadata_ops;
 	netdev->xsk_tx_metadata_ops = &mlx5e_xsk_tx_metadata_ops;
 	netdev->request_ops_lock = true;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
                   ` (9 preceding siblings ...)
  2025-05-22 21:41 ` [PATCH net-next V2 10/11] net/mlx5e: Implement queue mgmt ops and single channel swap Tariq Toukan
@ 2025-05-22 21:41 ` Tariq Toukan
  2025-05-22 22:55   ` Jakub Kicinski
  2025-05-27 16:05 ` [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Stanislav Fomichev
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 50+ messages in thread
From: Tariq Toukan @ 2025-05-22 21:41 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

From: Saeed Mahameed <saeedm@nvidia.com>

Try enabling HW GRO when requested.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  | 50 +++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index ea078c9f5d15..b6c3b6c11f86 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -371,6 +371,14 @@ void mlx5e_ethtool_get_ringparam(struct mlx5e_priv *priv,
 		(priv->channels.params.packet_merge.type == MLX5E_PACKET_MERGE_SHAMPO) ?
 		ETHTOOL_TCP_DATA_SPLIT_ENABLED :
 		ETHTOOL_TCP_DATA_SPLIT_DISABLED;
+
+	/* if HW GRO is not enabled due to external limitations but is wanted,
+	 * report HDS state as unknown so it won't get turned off explicitly.
+	 */
+	if (kernel_param->tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
+	    priv->netdev->wanted_features & NETIF_F_GRO_HW)
+		kernel_param->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_UNKNOWN;
+
 }
 
 static void mlx5e_get_ringparam(struct net_device *dev,
@@ -383,6 +391,43 @@ static void mlx5e_get_ringparam(struct net_device *dev,
 	mlx5e_ethtool_get_ringparam(priv, param, kernel_param);
 }
 
+static bool mlx5e_ethtool_set_tcp_data_split(struct mlx5e_priv *priv,
+					     u8 tcp_data_split)
+{
+	bool enable = (tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_ENABLED);
+	struct net_device *dev = priv->netdev;
+
+	if (tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_UNKNOWN)
+		return true;
+
+	if (enable && !(dev->hw_features & NETIF_F_GRO_HW)) {
+		netdev_warn(dev, "TCP-data-split is not supported when GRO HW is not supported\n");
+		return false; /* GRO HW is not supported */
+	}
+
+	if (enable && (dev->features & NETIF_F_GRO_HW)) {
+		/* Already enabled */
+		dev->wanted_features |= NETIF_F_GRO_HW;
+		return true;
+	}
+
+	if (!enable && !(dev->features & NETIF_F_GRO_HW)) {
+		/* Already disabled */
+		dev->wanted_features &= ~NETIF_F_GRO_HW;
+		return true;
+	}
+
+	/* Try enable or disable GRO HW */
+	if (enable)
+		dev->wanted_features |= NETIF_F_GRO_HW;
+	else
+		dev->wanted_features &= ~NETIF_F_GRO_HW;
+
+	netdev_change_features(dev);
+
+	return enable == !!(dev->features & NETIF_F_GRO_HW);
+}
+
 int mlx5e_ethtool_set_ringparam(struct mlx5e_priv *priv,
 				struct ethtool_ringparam *param,
 				struct netlink_ext_ack *extack)
@@ -441,6 +486,10 @@ static int mlx5e_set_ringparam(struct net_device *dev,
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
 
+	if (!mlx5e_ethtool_set_tcp_data_split(priv,
+					      kernel_param->tcp_data_split))
+		return -EINVAL;
+
 	return mlx5e_ethtool_set_ringparam(priv, param, extack);
 }
 
@@ -2625,6 +2674,7 @@ const struct ethtool_ops mlx5e_ethtool_ops = {
 				     ETHTOOL_COALESCE_USE_ADAPTIVE |
 				     ETHTOOL_COALESCE_USE_CQE,
 	.supported_input_xfrm = RXH_XFRM_SYM_OR_XOR,
+	.supported_ring_params = ETHTOOL_RING_USE_TCP_DATA_SPLIT,
 	.get_drvinfo       = mlx5e_get_drvinfo,
 	.get_link          = ethtool_op_get_link,
 	.get_link_ext_state  = mlx5e_get_link_ext_state,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 06/11] net/mlx5e: SHAMPO: Separate pool for headers
  2025-05-22 21:41 ` [PATCH net-next V2 06/11] net/mlx5e: SHAMPO: Separate pool for headers Tariq Toukan
@ 2025-05-22 22:30   ` Jakub Kicinski
  2025-05-22 23:08     ` Saeed Mahameed
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2025-05-22 22:30 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S. Miller, Paolo Abeni, Eric Dumazet, Andrew Lunn,
	Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On Fri, 23 May 2025 00:41:21 +0300 Tariq Toukan wrote:
> Allocate a separate page pool for headers when SHAMPO is enabled.
> This will be useful for adding support to zc page pool, which has to be
> different from the headers page pool.

Could you explain why always allocate a separate pool? 
For bnxt we do it only if ZC is enabled (or system pages are large),
see bnxt_separate_head_pool() and page_pool_is_unreadable().

Not sure if page_pool_is_unreadable() existed when this code
was written.

> -	wq_size = BIT(MLX5_GET(wq, wqc, log_wq_sz));
> -	*pool_size += (rq->mpwqe.shampo->hd_per_wqe * wq_size) /
> -		     MLX5E_SHAMPO_WQ_HEADER_PER_PAGE;
> +
> +	/* separate page pool for shampo headers */
> +	{
> +		int wq_size = BIT(MLX5_GET(wq, wqc, log_wq_sz));
> +		struct page_pool_params pp_params = { };
> +		u32 pool_size;

A free standing code block? I this it's universally understood 
to be very poor coding style..

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 07/11] net/mlx5e: SHAMPO: Headers page pool stats
  2025-05-22 21:41 ` [PATCH net-next V2 07/11] net/mlx5e: SHAMPO: Headers page pool stats Tariq Toukan
@ 2025-05-22 22:31   ` Jakub Kicinski
  2025-05-22 22:58     ` Saeed Mahameed
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2025-05-22 22:31 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S. Miller, Paolo Abeni, Eric Dumazet, Andrew Lunn,
	Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On Fri, 23 May 2025 00:41:22 +0300 Tariq Toukan wrote:
> Expose the stats of the new headers page pool.

Nope. We have a netlink API for page pool stats.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings
  2025-05-22 21:41 ` [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings Tariq Toukan
@ 2025-05-22 22:55   ` Jakub Kicinski
  2025-05-22 23:19     ` Saeed Mahameed
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2025-05-22 22:55 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S. Miller, Paolo Abeni, Eric Dumazet, Andrew Lunn,
	Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On Fri, 23 May 2025 00:41:26 +0300 Tariq Toukan wrote:
> +	/* if HW GRO is not enabled due to external limitations but is wanted,
> +	 * report HDS state as unknown so it won't get turned off explicitly.
> +	 */
> +	if (kernel_param->tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
> +	    priv->netdev->wanted_features & NETIF_F_GRO_HW)
> +		kernel_param->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_UNKNOWN;

The kernel_param->tcp_data_split here is the user config, right?
It would be cleaner to not support setting SPLIT_DISABLED.
Nothing requires that, you can support just setting AUTO and ENABLED.

> +

nit: extra empty line, please run checkpatch

>  }
>  
>  static void mlx5e_get_ringparam(struct net_device *dev,
> @@ -383,6 +391,43 @@ static void mlx5e_get_ringparam(struct net_device *dev,
>  	mlx5e_ethtool_get_ringparam(priv, param, kernel_param);
>  }
>  
> +static bool mlx5e_ethtool_set_tcp_data_split(struct mlx5e_priv *priv,
> +					     u8 tcp_data_split)
> +{
> +	bool enable = (tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_ENABLED);
> +	struct net_device *dev = priv->netdev;
> +
> +	if (tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_UNKNOWN)
> +		return true;
> +
> +	if (enable && !(dev->hw_features & NETIF_F_GRO_HW)) {
> +		netdev_warn(dev, "TCP-data-split is not supported when GRO HW is not supported\n");
> +		return false; /* GRO HW is not supported */
> +	}
> +
> +	if (enable && (dev->features & NETIF_F_GRO_HW)) {
> +		/* Already enabled */
> +		dev->wanted_features |= NETIF_F_GRO_HW;
> +		return true;
> +	}
> +
> +	if (!enable && !(dev->features & NETIF_F_GRO_HW)) {
> +		/* Already disabled */
> +		dev->wanted_features &= ~NETIF_F_GRO_HW;
> +		return true;
> +	}
> +
> +	/* Try enable or disable GRO HW */
> +	if (enable)
> +		dev->wanted_features |= NETIF_F_GRO_HW;
> +	else
> +		dev->wanted_features &= ~NETIF_F_GRO_HW;

Why are you modifying wanted_features? wanted_features is what
*user space* wanted! You should probably operate on hw_features ?
Tho, may be cleaner to return an error and an extack if the user
tries to set HDS and GRO to conflicting values.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 07/11] net/mlx5e: SHAMPO: Headers page pool stats
  2025-05-22 22:31   ` Jakub Kicinski
@ 2025-05-22 22:58     ` Saeed Mahameed
  2025-06-06 10:43       ` Cosmin Ratiu
  0 siblings, 1 reply; 50+ messages in thread
From: Saeed Mahameed @ 2025-05-22 22:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On 22 May 15:31, Jakub Kicinski wrote:
>On Fri, 23 May 2025 00:41:22 +0300 Tariq Toukan wrote:
>> Expose the stats of the new headers page pool.
>
>Nope. We have a netlink API for page pool stats.
>

We already expose the stats of the main pool in ethtool.
So it will be an inconvenience to keep exposing half of the stats.
So either we delete both or keep both. Some of us rely on this for debug


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 01/11] net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR
  2025-05-22 21:41 ` [PATCH net-next V2 01/11] net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR Tariq Toukan
@ 2025-05-22 23:07   ` Mina Almasry
  0 siblings, 0 replies; 50+ messages in thread
From: Mina Almasry @ 2025-05-22 23:07 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On Thu, May 22, 2025 at 2:43 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>
> From: Saeed Mahameed <saeedm@nvidia.com>
>
> GENERIC_ALLOCATOR is a non-prompt kconfig, meaning users can't enable it
> selectively. All kconfig users of GENERIC_ALLOCATOR select it, except of
> NET_DEVMEM which only depends on it, there is no easy way to turn
> GENERIC_ALLOCATOR on unless we select other unnecessary configs that
> will select it.
>
> Instead of depending on it, select it when NET_DEVMEM is enabled.
>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 06/11] net/mlx5e: SHAMPO: Separate pool for headers
  2025-05-22 22:30   ` Jakub Kicinski
@ 2025-05-22 23:08     ` Saeed Mahameed
  2025-05-22 23:24       ` Mina Almasry
  2025-05-27 15:29       ` Jakub Kicinski
  0 siblings, 2 replies; 50+ messages in thread
From: Saeed Mahameed @ 2025-05-22 23:08 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On 22 May 15:30, Jakub Kicinski wrote:
>On Fri, 23 May 2025 00:41:21 +0300 Tariq Toukan wrote:
>> Allocate a separate page pool for headers when SHAMPO is enabled.
>> This will be useful for adding support to zc page pool, which has to be
>> different from the headers page pool.
>
>Could you explain why always allocate a separate pool?

Better flow management, 0 conditional code on data path to alloc/return
header buffers, since in mlx5 we already have separate paths to handle
header, we don't have/need bnxt_separate_head_pool() and
rxr->need_head_pool spread across the code.. 

Since we alloc and return pages in bulks, it makes more sense to manage
headers and data in separate pools if we are going to do it anyway for 
"undreadable_pools", and when there's no performance impact.

>For bnxt we do it only if ZC is enabled (or system pages are large),
>see bnxt_separate_head_pool() and page_pool_is_unreadable().
>
>Not sure if page_pool_is_unreadable() existed when this code
>was written.
>
>> -	wq_size = BIT(MLX5_GET(wq, wqc, log_wq_sz));
>> -	*pool_size += (rq->mpwqe.shampo->hd_per_wqe * wq_size) /
>> -		     MLX5E_SHAMPO_WQ_HEADER_PER_PAGE;
>> +
>> +	/* separate page pool for shampo headers */
>> +	{
>> +		int wq_size = BIT(MLX5_GET(wq, wqc, log_wq_sz));
>> +		struct page_pool_params pp_params = { };
>> +		u32 pool_size;
>
>A free standing code block? I this it's universally understood
>to be very poor coding style..
>

Sure if used excessively and incorrectly. Here it's only used for temporary
variable scoping. I don't think there is anything wrong with free-standing
blocks when used in the proper situations.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 02/11] net: Add skb_can_coalesce for netmem
  2025-05-22 21:41 ` [PATCH net-next V2 02/11] net: Add skb_can_coalesce for netmem Tariq Toukan
@ 2025-05-22 23:09   ` Mina Almasry
  2025-05-25 13:03     ` Dragos Tatulea
  0 siblings, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2025-05-22 23:09 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On Thu, May 22, 2025 at 2:43 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>
> From: Dragos Tatulea <dtatulea@nvidia.com>
>
> Allow drivers that have moved over to netmem to do fragment coalescing.
>
> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  include/linux/skbuff.h | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 5520524c93bf..e8e2860183b4 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3887,6 +3887,18 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
>         return false;
>  }
>
> +static inline bool skb_can_coalesce_netmem(struct sk_buff *skb, int i,
> +                                          const netmem_ref netmem, int off)
> +{
> +       if (i) {
> +               const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
> +
> +               return netmem == skb_frag_netmem(frag) &&
> +                      off == skb_frag_off(frag) + skb_frag_size(frag);
> +       }
> +       return false;
> +}
> +

Can we limit the code duplication by changing skb_can_coalesce to call
skb_can_coalesce_netmem? Or is that too bad for perf?

static inline bool skb_can_coalesce(struct sk_buff *skb, int i, const
struct page *page, int off) {
    skb_can_coalesce_netmem(skb, i, page_to_netmem(page), off);
}

It's always safe to cast a page to netmem.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 08/11] net/mlx5e: Convert over to netmem
  2025-05-22 21:41 ` [PATCH net-next V2 08/11] net/mlx5e: Convert over to netmem Tariq Toukan
@ 2025-05-22 23:18   ` Mina Almasry
  2025-05-22 23:54     ` Saeed Mahameed
  0 siblings, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2025-05-22 23:18 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On Thu, May 22, 2025 at 2:46 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>
> From: Saeed Mahameed <saeedm@nvidia.com>
>
> mlx5e_page_frag holds the physical page itself, to naturally support
> zc page pools, remove physical page reference from mlx5 and replace it
> with netmem_ref, to avoid internal handling in mlx5 for net_iov backed
> pages.
>
> SHAMPO can issue packets that are not split into header and data. These
> packets will be dropped if the data part resides in a net_iov as the
> driver can't read into this area.
>
> No performance degradation observed.
>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h  |   2 +-
>  .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 103 ++++++++++--------
>  2 files changed, 61 insertions(+), 44 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index c329de1d4f0a..65a73913b9a2 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -553,7 +553,7 @@ struct mlx5e_icosq {
>  } ____cacheline_aligned_in_smp;
>
>  struct mlx5e_frag_page {

omega nit: maybe this should be renamed now to mlx5e_frag_netmem. Up to you.

> -       struct page *page;
> +       netmem_ref netmem;
>         u16 frags;
>  };
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index e34ef53ebd0e..75e753adedef 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -273,33 +273,33 @@ static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
>
>  #define MLX5E_PAGECNT_BIAS_MAX (PAGE_SIZE / 64)
>
> -static int mlx5e_page_alloc_fragmented(struct page_pool *pool,
> +static int mlx5e_page_alloc_fragmented(struct page_pool *pp,
>                                        struct mlx5e_frag_page *frag_page)
>  {
> -       struct page *page;
> +       netmem_ref netmem = page_pool_alloc_netmems(pp,
> +                                                   GFP_ATOMIC | __GFP_NOWARN);
>
> -       page = page_pool_dev_alloc_pages(pool);
> -       if (unlikely(!page))
> +       if (unlikely(!netmem))
>                 return -ENOMEM;
>
> -       page_pool_fragment_page(page, MLX5E_PAGECNT_BIAS_MAX);
> +       page_pool_fragment_netmem(netmem, MLX5E_PAGECNT_BIAS_MAX);
>
>         *frag_page = (struct mlx5e_frag_page) {
> -               .page   = page,
> +               .netmem = netmem,
>                 .frags  = 0,
>         };
>
>         return 0;
>  }
>
> -static void mlx5e_page_release_fragmented(struct page_pool *pool,
> +static void mlx5e_page_release_fragmented(struct page_pool *pp,
>                                           struct mlx5e_frag_page *frag_page)
>  {
>         u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - frag_page->frags;
> -       struct page *page = frag_page->page;
> +       netmem_ref netmem = frag_page->netmem;
>
> -       if (page_pool_unref_page(page, drain_count) == 0)
> -               page_pool_put_unrefed_page(pool, page, -1, true);
> +       if (page_pool_unref_netmem(netmem, drain_count) == 0)
> +               page_pool_put_unrefed_netmem(pp, netmem, -1, true);
>  }
>
>  static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq,
> @@ -359,7 +359,7 @@ static int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe_cyc *wqe,
>                 frag->flags &= ~BIT(MLX5E_WQE_FRAG_SKIP_RELEASE);
>
>                 headroom = i == 0 ? rq->buff.headroom : 0;
> -               addr = page_pool_get_dma_addr(frag->frag_page->page);
> +               addr = page_pool_get_dma_addr_netmem(frag->frag_page->netmem);
>                 wqe->data[i].addr = cpu_to_be64(addr + frag->offset + headroom);
>         }
>
> @@ -500,9 +500,10 @@ mlx5e_add_skb_shared_info_frag(struct mlx5e_rq *rq, struct skb_shared_info *sinf
>                                struct xdp_buff *xdp, struct mlx5e_frag_page *frag_page,
>                                u32 frag_offset, u32 len)
>  {
> +       netmem_ref netmem = frag_page->netmem;
>         skb_frag_t *frag;
>
> -       dma_addr_t addr = page_pool_get_dma_addr(frag_page->page);
> +       dma_addr_t addr = page_pool_get_dma_addr_netmem(netmem);
>
>         dma_sync_single_for_cpu(rq->pdev, addr + frag_offset, len, rq->buff.map_dir);
>         if (!xdp_buff_has_frags(xdp)) {
> @@ -515,9 +516,9 @@ mlx5e_add_skb_shared_info_frag(struct mlx5e_rq *rq, struct skb_shared_info *sinf
>         }
>
>         frag = &sinfo->frags[sinfo->nr_frags++];
> -       skb_frag_fill_page_desc(frag, frag_page->page, frag_offset, len);
> +       skb_frag_fill_netmem_desc(frag, netmem, frag_offset, len);
>
> -       if (page_is_pfmemalloc(frag_page->page))
> +       if (!netmem_is_net_iov(netmem) && netmem_is_pfmemalloc(netmem))

nit: unnecessary net_iov check AFAICT?

>                 xdp_buff_set_frag_pfmemalloc(xdp);
>         sinfo->xdp_frags_size += len;
>  }
> @@ -528,27 +529,29 @@ mlx5e_add_skb_frag(struct mlx5e_rq *rq, struct sk_buff *skb,
>                    u32 frag_offset, u32 len,
>                    unsigned int truesize)
>  {
> -       dma_addr_t addr = page_pool_get_dma_addr(frag_page->page);
> +       dma_addr_t addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
>         u8 next_frag = skb_shinfo(skb)->nr_frags;
> +       netmem_ref netmem = frag_page->netmem;
>
>         dma_sync_single_for_cpu(rq->pdev, addr + frag_offset, len,
>                                 rq->buff.map_dir);
>
> -       if (skb_can_coalesce(skb, next_frag, frag_page->page, frag_offset)) {
> +       if (skb_can_coalesce_netmem(skb, next_frag, netmem, frag_offset)) {
>                 skb_coalesce_rx_frag(skb, next_frag - 1, len, truesize);
> -       } else {
> -               frag_page->frags++;
> -               skb_add_rx_frag(skb, next_frag, frag_page->page,
> -                               frag_offset, len, truesize);
> +               return;
>         }
> +
> +       frag_page->frags++;
> +       skb_add_rx_frag_netmem(skb, next_frag, netmem,
> +                              frag_offset, len, truesize);
>  }
>
>  static inline void
>  mlx5e_copy_skb_header(struct mlx5e_rq *rq, struct sk_buff *skb,
> -                     struct page *page, dma_addr_t addr,
> +                     netmem_ref netmem, dma_addr_t addr,
>                       int offset_from, int dma_offset, u32 headlen)
>  {
> -       const void *from = page_address(page) + offset_from;
> +       const void *from = netmem_address(netmem) + offset_from;

I think this needs a check that netmem_address != NULL and safe error
handling in case it is? If the netmem is unreadable, netmem_address
will return NULL, and because you add offset_from to it, you can't
NULL check from as well.

>         /* Aligning len to sizeof(long) optimizes memcpy performance */
>         unsigned int len = ALIGN(headlen, sizeof(long));
>
> @@ -685,7 +688,7 @@ static int mlx5e_build_shampo_hd_umr(struct mlx5e_rq *rq,
>                 if (unlikely(err))
>                         goto err_unmap;
>
> -               addr = page_pool_get_dma_addr(frag_page->page);
> +               addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
>
>                 for (int j = 0; j < MLX5E_SHAMPO_WQ_HEADER_PER_PAGE; j++) {
>                         header_offset = mlx5e_shampo_hd_offset(index++);
> @@ -796,7 +799,8 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>                 err = mlx5e_page_alloc_fragmented(rq->page_pool, frag_page);
>                 if (unlikely(err))
>                         goto err_unmap;
> -               addr = page_pool_get_dma_addr(frag_page->page);
> +
> +               addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
>                 umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
>                         .ptag = cpu_to_be64(addr | MLX5_EN_WR),
>                 };
> @@ -1216,7 +1220,7 @@ static void *mlx5e_shampo_get_packet_hd(struct mlx5e_rq *rq, u16 header_index)
>         struct mlx5e_frag_page *frag_page = mlx5e_shampo_hd_to_frag_page(rq, header_index);
>         u16 head_offset = mlx5e_shampo_hd_offset(header_index) + rq->buff.headroom;
>
> -       return page_address(frag_page->page) + head_offset;
> +       return netmem_address(frag_page->netmem) + head_offset;

Similar concern here. netmem_address may be NULL, especially when you
enable unreadable netmem support in the later patches. There are a
couple of call sites below as well.

>  }
>
>  static void mlx5e_shampo_update_ipv4_udp_hdr(struct mlx5e_rq *rq, struct iphdr *ipv4)
> @@ -1677,11 +1681,11 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
>         dma_addr_t addr;
>         u32 frag_size;
>
> -       va             = page_address(frag_page->page) + wi->offset;
> +       va             = netmem_address(frag_page->netmem) + wi->offset;
>         data           = va + rx_headroom;
>         frag_size      = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
>
> -       addr = page_pool_get_dma_addr(frag_page->page);
> +       addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
>         dma_sync_single_range_for_cpu(rq->pdev, addr, wi->offset,
>                                       frag_size, rq->buff.map_dir);
>         net_prefetch(data);
> @@ -1731,10 +1735,10 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>
>         frag_page = wi->frag_page;
>
> -       va = page_address(frag_page->page) + wi->offset;
> +       va = netmem_address(frag_page->netmem) + wi->offset;
>         frag_consumed_bytes = min_t(u32, frag_info->frag_size, cqe_bcnt);
>
> -       addr = page_pool_get_dma_addr(frag_page->page);
> +       addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
>         dma_sync_single_range_for_cpu(rq->pdev, addr, wi->offset,
>                                       rq->buff.frame0_sz, rq->buff.map_dir);
>         net_prefetchw(va); /* xdp_frame data area */
> @@ -2007,13 +2011,14 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>
>         if (prog) {
>                 /* area for bpf_xdp_[store|load]_bytes */
> -               net_prefetchw(page_address(frag_page->page) + frag_offset);
> +               net_prefetchw(netmem_address(frag_page->netmem) + frag_offset);
>                 if (unlikely(mlx5e_page_alloc_fragmented(rq->page_pool,
>                                                          &wi->linear_page))) {
>                         rq->stats->buff_alloc_err++;
>                         return NULL;
>                 }
> -               va = page_address(wi->linear_page.page);
> +
> +               va = netmem_address(wi->linear_page.netmem);
>                 net_prefetchw(va); /* xdp_frame data area */
>                 linear_hr = XDP_PACKET_HEADROOM;
>                 linear_data_len = 0;
> @@ -2124,8 +2129,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>                         while (++pagep < frag_page);
>                 }
>                 /* copy header */
> -               addr = page_pool_get_dma_addr(head_page->page);
> -               mlx5e_copy_skb_header(rq, skb, head_page->page, addr,
> +               addr = page_pool_get_dma_addr_netmem(head_page->netmem);
> +               mlx5e_copy_skb_header(rq, skb, head_page->netmem, addr,
>                                       head_offset, head_offset, headlen);
>                 /* skb linear part was allocated with headlen and aligned to long */
>                 skb->tail += headlen;
> @@ -2155,11 +2160,11 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>                 return NULL;
>         }
>
> -       va             = page_address(frag_page->page) + head_offset;
> +       va             = netmem_address(frag_page->netmem) + head_offset;
>         data           = va + rx_headroom;
>         frag_size      = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
>
> -       addr = page_pool_get_dma_addr(frag_page->page);
> +       addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
>         dma_sync_single_range_for_cpu(rq->pdev, addr, head_offset,
>                                       frag_size, rq->buff.map_dir);
>         net_prefetch(data);
> @@ -2198,16 +2203,19 @@ mlx5e_skb_from_cqe_shampo(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>                           struct mlx5_cqe64 *cqe, u16 header_index)
>  {
>         struct mlx5e_frag_page *frag_page = mlx5e_shampo_hd_to_frag_page(rq, header_index);
> -       dma_addr_t page_dma_addr = page_pool_get_dma_addr(frag_page->page);
>         u16 head_offset = mlx5e_shampo_hd_offset(header_index);
> -       dma_addr_t dma_addr = page_dma_addr + head_offset;
>         u16 head_size = cqe->shampo.header_size;
>         u16 rx_headroom = rq->buff.headroom;
>         struct sk_buff *skb = NULL;
> +       dma_addr_t page_dma_addr;
> +       dma_addr_t dma_addr;
>         void *hdr, *data;
>         u32 frag_size;
>
> -       hdr             = page_address(frag_page->page) + head_offset;
> +       page_dma_addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
> +       dma_addr = page_dma_addr + head_offset;
> +
> +       hdr             = netmem_address(frag_page->netmem) + head_offset;
>         data            = hdr + rx_headroom;
>         frag_size       = MLX5_SKB_FRAG_SZ(rx_headroom + head_size);
>
> @@ -2232,7 +2240,7 @@ mlx5e_skb_from_cqe_shampo(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>                 }
>
>                 net_prefetchw(skb->data);
> -               mlx5e_copy_skb_header(rq, skb, frag_page->page, dma_addr,
> +               mlx5e_copy_skb_header(rq, skb, frag_page->netmem, dma_addr,
>                                       head_offset + rx_headroom,
>                                       rx_headroom, head_size);
>                 /* skb linear part was allocated with headlen and aligned to long */
> @@ -2326,11 +2334,20 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
>         }
>
>         if (!*skb) {
> -               if (likely(head_size))
> +               if (likely(head_size)) {
>                         *skb = mlx5e_skb_from_cqe_shampo(rq, wi, cqe, header_index);
> -               else
> -                       *skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe, cqe_bcnt,
> -                                                                 data_offset, page_idx);
> +               } else {
> +                       struct mlx5e_frag_page *frag_page;
> +
> +                       frag_page = &wi->alloc_units.frag_pages[page_idx];
> +                       if (unlikely(netmem_is_net_iov(frag_page->netmem)))
> +                               goto free_hd_entry;
> +                       *skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe,
> +                                                                 cqe_bcnt,
> +                                                                 data_offset,
> +                                                                 page_idx);
> +               }
> +
>                 if (unlikely(!*skb))
>                         goto free_hd_entry;
>
> --
> 2.31.1
>
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings
  2025-05-22 22:55   ` Jakub Kicinski
@ 2025-05-22 23:19     ` Saeed Mahameed
  2025-05-23 16:17       ` Cosmin Ratiu
  2025-05-27 16:10       ` Jakub Kicinski
  0 siblings, 2 replies; 50+ messages in thread
From: Saeed Mahameed @ 2025-05-22 23:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On 22 May 15:55, Jakub Kicinski wrote:
>On Fri, 23 May 2025 00:41:26 +0300 Tariq Toukan wrote:
>> +	/* if HW GRO is not enabled due to external limitations but is wanted,
>> +	 * report HDS state as unknown so it won't get turned off explicitly.
>> +	 */
>> +	if (kernel_param->tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
>> +	    priv->netdev->wanted_features & NETIF_F_GRO_HW)
>> +		kernel_param->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_UNKNOWN;
>
>The kernel_param->tcp_data_split here is the user config, right?
>It would be cleaner to not support setting SPLIT_DISABLED.
>Nothing requires that, you can support just setting AUTO and ENABLED.
>

I think I agree, AUTO might require some extra work on the driver side to
figure out current internal mode, but it actually makes more sense than
just doing "UNKNOWN", UKNOWN here means that HW GRO needs to be enabled
when disabling TCP HDR split, and we still don't know if that will work..

Cosmin will you look into this ? 

>> +
>
>nit: extra empty line, please run checkpatch
>
>>  }
>>
>>  static void mlx5e_get_ringparam(struct net_device *dev,
>> @@ -383,6 +391,43 @@ static void mlx5e_get_ringparam(struct net_device *dev,
>>  	mlx5e_ethtool_get_ringparam(priv, param, kernel_param);
>>  }
>>
>> +static bool mlx5e_ethtool_set_tcp_data_split(struct mlx5e_priv *priv,
>> +					     u8 tcp_data_split)
>> +{
>> +	bool enable = (tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_ENABLED);
>> +	struct net_device *dev = priv->netdev;
>> +
>> +	if (tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_UNKNOWN)
>> +		return true;
>> +
>> +	if (enable && !(dev->hw_features & NETIF_F_GRO_HW)) {
>> +		netdev_warn(dev, "TCP-data-split is not supported when GRO HW is not supported\n");
>> +		return false; /* GRO HW is not supported */
>> +	}
>> +
>> +	if (enable && (dev->features & NETIF_F_GRO_HW)) {
>> +		/* Already enabled */
>> +		dev->wanted_features |= NETIF_F_GRO_HW;
>> +		return true;
>> +	}
>> +
>> +	if (!enable && !(dev->features & NETIF_F_GRO_HW)) {
>> +		/* Already disabled */
>> +		dev->wanted_features &= ~NETIF_F_GRO_HW;
>> +		return true;
>> +	}
>> +
>> +	/* Try enable or disable GRO HW */
>> +	if (enable)
>> +		dev->wanted_features |= NETIF_F_GRO_HW;
>> +	else
>> +		dev->wanted_features &= ~NETIF_F_GRO_HW;
>
>Why are you modifying wanted_features? wanted_features is what
>*user space* wanted! You should probably operate on hw_features ?
>Tho, may be cleaner to return an error and an extack if the user
>tries to set HDS and GRO to conflicting values.
>

hw_features is hw capabilities, it doesn't mean on/off.. so no we can't
rely on that.

To enable TCP_DATA_SPLIT we tie it to GRO_HW, so we enable GRO_HW when
TCP_DATA_SPLIT is set to on and vise-versa. I agree not the cleanest.. 
But it is good for user-visibility as you would see both ON if you query
from user, which is the actual state. This is the only way to set HW_GRO
to on by driver and not lose previous state when we turn the other bit
on/off.

Yes, I guess such logic should be in the stack, although I don't see
anything wrong here in terms of correctness.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 06/11] net/mlx5e: SHAMPO: Separate pool for headers
  2025-05-22 23:08     ` Saeed Mahameed
@ 2025-05-22 23:24       ` Mina Almasry
  2025-05-22 23:43         ` Saeed Mahameed
  2025-05-27 15:29       ` Jakub Kicinski
  1 sibling, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2025-05-22 23:24 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Jakub Kicinski, Tariq Toukan, David S. Miller, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Saeed Mahameed, Leon Romanovsky,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, netdev, linux-rdma,
	linux-kernel, bpf, Moshe Shemesh, Mark Bloch, Gal Pressman,
	Cosmin Ratiu, Dragos Tatulea

On Thu, May 22, 2025 at 4:09 PM Saeed Mahameed <saeed@kernel.org> wrote:
>
> On 22 May 15:30, Jakub Kicinski wrote:
> >On Fri, 23 May 2025 00:41:21 +0300 Tariq Toukan wrote:
> >> Allocate a separate page pool for headers when SHAMPO is enabled.
> >> This will be useful for adding support to zc page pool, which has to be
> >> different from the headers page pool.
> >
> >Could you explain why always allocate a separate pool?
>
> Better flow management, 0 conditional code on data path to alloc/return
> header buffers, since in mlx5 we already have separate paths to handle
> header, we don't have/need bnxt_separate_head_pool() and
> rxr->need_head_pool spread across the code..
>
> Since we alloc and return pages in bulks, it makes more sense to manage
> headers and data in separate pools if we are going to do it anyway for
> "undreadable_pools", and when there's no performance impact.
>

Are you allocating full pages for each incoming header (which is much
smaller than a page)? Or are you reusing the same PAGE_SIZE from the
page_pool to store multiple headers?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 09/11] net/mlx5e: Add support for UNREADABLE netmem page pools
  2025-05-22 21:41 ` [PATCH net-next V2 09/11] net/mlx5e: Add support for UNREADABLE netmem page pools Tariq Toukan
@ 2025-05-22 23:26   ` Mina Almasry
  2025-05-22 23:56     ` Saeed Mahameed
  0 siblings, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2025-05-22 23:26 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On Thu, May 22, 2025 at 2:46 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>
> From: Saeed Mahameed <saeedm@nvidia.com>
>
> On netdev_rx_queue_restart, a special type of page pool maybe expected.
>
> In this patch declare support for UNREADABLE netmem iov pages in the
> pool params only when header data split shampo RQ mode is enabled, also
> set the queue index in the page pool params struct.
>
> Shampo mode requirement: Without header split rx needs to peek at the data,
> we can't do UNREADABLE_NETMEM.
>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 9e2975782a82..485b1515ace5 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -952,6 +952,11 @@ static int mlx5e_alloc_rq(struct mlx5e_params *params,
>                 pp_params.netdev    = rq->netdev;
>                 pp_params.dma_dir   = rq->buff.map_dir;
>                 pp_params.max_len   = PAGE_SIZE;
> +               pp_params.queue_idx = rq->ix;
> +
> +               /* Shampo header data split allow for unreadable netmem */
> +               if (test_bit(MLX5E_RQ_STATE_SHAMPO, &rq->state))
> +                       pp_params.flags |= PP_FLAG_ALLOW_UNREADABLE_NETMEM;
>

This patch itself looks good to me for FWIW, but unreadable netmem
will return netmem_address(netmem) == NULL, which from an initial look
didn't seem like you were handling in the previous patches. Not sure
if oversight or you are sure you're not going to have unreadable
netmem in these code paths for some reason.


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 06/11] net/mlx5e: SHAMPO: Separate pool for headers
  2025-05-22 23:24       ` Mina Almasry
@ 2025-05-22 23:43         ` Saeed Mahameed
  0 siblings, 0 replies; 50+ messages in thread
From: Saeed Mahameed @ 2025-05-22 23:43 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, Tariq Toukan, David S. Miller, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Saeed Mahameed, Leon Romanovsky,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, netdev, linux-rdma,
	linux-kernel, bpf, Moshe Shemesh, Mark Bloch, Gal Pressman,
	Cosmin Ratiu, Dragos Tatulea

On 22 May 16:24, Mina Almasry wrote:
>On Thu, May 22, 2025 at 4:09 PM Saeed Mahameed <saeed@kernel.org> wrote:
>>
>> On 22 May 15:30, Jakub Kicinski wrote:
>> >On Fri, 23 May 2025 00:41:21 +0300 Tariq Toukan wrote:
>> >> Allocate a separate page pool for headers when SHAMPO is enabled.
>> >> This will be useful for adding support to zc page pool, which has to be
>> >> different from the headers page pool.
>> >
>> >Could you explain why always allocate a separate pool?
>>
>> Better flow management, 0 conditional code on data path to alloc/return
>> header buffers, since in mlx5 we already have separate paths to handle
>> header, we don't have/need bnxt_separate_head_pool() and
>> rxr->need_head_pool spread across the code..
>>
>> Since we alloc and return pages in bulks, it makes more sense to manage
>> headers and data in separate pools if we are going to do it anyway for
>> "undreadable_pools", and when there's no performance impact.
>>
>
>Are you allocating full pages for each incoming header (which is much
>smaller than a page)? Or are you reusing the same PAGE_SIZE from the
>page_pool to store multiple headers?
>

Multiple headers MLX5E_SHAMPO_WQ_HEADER_PER_PAGE;

>-- 
>Thanks,
>Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 08/11] net/mlx5e: Convert over to netmem
  2025-05-22 23:18   ` Mina Almasry
@ 2025-05-22 23:54     ` Saeed Mahameed
  2025-05-23 17:58       ` Mina Almasry
  0 siblings, 1 reply; 50+ messages in thread
From: Saeed Mahameed @ 2025-05-22 23:54 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Tariq Toukan, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Saeed Mahameed, Leon Romanovsky,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, netdev, linux-rdma,
	linux-kernel, bpf, Moshe Shemesh, Mark Bloch, Gal Pressman,
	Cosmin Ratiu, Dragos Tatulea

On 22 May 16:18, Mina Almasry wrote:
>On Thu, May 22, 2025 at 2:46 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>>
>> From: Saeed Mahameed <saeedm@nvidia.com>
>>
>> mlx5e_page_frag holds the physical page itself, to naturally support
>> zc page pools, remove physical page reference from mlx5 and replace it
>> with netmem_ref, to avoid internal handling in mlx5 for net_iov backed
>> pages.
>>
>> SHAMPO can issue packets that are not split into header and data. These
>> packets will be dropped if the data part resides in a net_iov as the
>> driver can't read into this area.
>>
>> No performance degradation observed.
>>
>> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
>> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx5/core/en.h  |   2 +-
>>  .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 103 ++++++++++--------
>>  2 files changed, 61 insertions(+), 44 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> index c329de1d4f0a..65a73913b9a2 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> @@ -553,7 +553,7 @@ struct mlx5e_icosq {
>>  } ____cacheline_aligned_in_smp;
>>
>>  struct mlx5e_frag_page {
>
>omega nit: maybe this should be renamed now to mlx5e_frag_netmem. Up to you.
>
30+ occurrences need changing, Tariq, Cosmin up to you. I agree with the
comment though.

>> -       struct page *page;
>> +       netmem_ref netmem;
>>         u16 frags;
>>  };
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> index e34ef53ebd0e..75e753adedef 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> @@ -273,33 +273,33 @@ static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
>>
>>  #define MLX5E_PAGECNT_BIAS_MAX (PAGE_SIZE / 64)
>>
>> -static int mlx5e_page_alloc_fragmented(struct page_pool *pool,
>> +static int mlx5e_page_alloc_fragmented(struct page_pool *pp,
>>                                        struct mlx5e_frag_page *frag_page)
>>  {
>> -       struct page *page;
>> +       netmem_ref netmem = page_pool_alloc_netmems(pp,
>> +                                                   GFP_ATOMIC | __GFP_NOWARN);
>>
>> -       page = page_pool_dev_alloc_pages(pool);
>> -       if (unlikely(!page))
>> +       if (unlikely(!netmem))
>>                 return -ENOMEM;
>>
>> -       page_pool_fragment_page(page, MLX5E_PAGECNT_BIAS_MAX);
>> +       page_pool_fragment_netmem(netmem, MLX5E_PAGECNT_BIAS_MAX);
>>
>>         *frag_page = (struct mlx5e_frag_page) {
>> -               .page   = page,
>> +               .netmem = netmem,
>>                 .frags  = 0,
>>         };
>>
>>         return 0;
>>  }
>>
>> -static void mlx5e_page_release_fragmented(struct page_pool *pool,
>> +static void mlx5e_page_release_fragmented(struct page_pool *pp,
>>                                           struct mlx5e_frag_page *frag_page)
>>  {
>>         u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - frag_page->frags;
>> -       struct page *page = frag_page->page;
>> +       netmem_ref netmem = frag_page->netmem;
>>
>> -       if (page_pool_unref_page(page, drain_count) == 0)
>> -               page_pool_put_unrefed_page(pool, page, -1, true);
>> +       if (page_pool_unref_netmem(netmem, drain_count) == 0)
>> +               page_pool_put_unrefed_netmem(pp, netmem, -1, true);
>>  }
>>
>>  static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq,
>> @@ -359,7 +359,7 @@ static int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe_cyc *wqe,
>>                 frag->flags &= ~BIT(MLX5E_WQE_FRAG_SKIP_RELEASE);
>>
>>                 headroom = i == 0 ? rq->buff.headroom : 0;
>> -               addr = page_pool_get_dma_addr(frag->frag_page->page);
>> +               addr = page_pool_get_dma_addr_netmem(frag->frag_page->netmem);
>>                 wqe->data[i].addr = cpu_to_be64(addr + frag->offset + headroom);
>>         }
>>
>> @@ -500,9 +500,10 @@ mlx5e_add_skb_shared_info_frag(struct mlx5e_rq *rq, struct skb_shared_info *sinf
>>                                struct xdp_buff *xdp, struct mlx5e_frag_page *frag_page,
>>                                u32 frag_offset, u32 len)
>>  {
>> +       netmem_ref netmem = frag_page->netmem;
>>         skb_frag_t *frag;
>>
>> -       dma_addr_t addr = page_pool_get_dma_addr(frag_page->page);
>> +       dma_addr_t addr = page_pool_get_dma_addr_netmem(netmem);
>>
>>         dma_sync_single_for_cpu(rq->pdev, addr + frag_offset, len, rq->buff.map_dir);
>>         if (!xdp_buff_has_frags(xdp)) {
>> @@ -515,9 +516,9 @@ mlx5e_add_skb_shared_info_frag(struct mlx5e_rq *rq, struct skb_shared_info *sinf
>>         }
>>
>>         frag = &sinfo->frags[sinfo->nr_frags++];
>> -       skb_frag_fill_page_desc(frag, frag_page->page, frag_offset, len);
>> +       skb_frag_fill_netmem_desc(frag, netmem, frag_offset, len);
>>
>> -       if (page_is_pfmemalloc(frag_page->page))
>> +       if (!netmem_is_net_iov(netmem) && netmem_is_pfmemalloc(netmem))
>
>nit: unnecessary net_iov check AFAICT?
>

yep, redundant.

>>                 xdp_buff_set_frag_pfmemalloc(xdp);
>>         sinfo->xdp_frags_size += len;
>>  }
>> @@ -528,27 +529,29 @@ mlx5e_add_skb_frag(struct mlx5e_rq *rq, struct sk_buff *skb,
>>                    u32 frag_offset, u32 len,
>>                    unsigned int truesize)
>>  {
>> -       dma_addr_t addr = page_pool_get_dma_addr(frag_page->page);
>> +       dma_addr_t addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
>>         u8 next_frag = skb_shinfo(skb)->nr_frags;
>> +       netmem_ref netmem = frag_page->netmem;
>>
>>         dma_sync_single_for_cpu(rq->pdev, addr + frag_offset, len,
>>                                 rq->buff.map_dir);
>>
>> -       if (skb_can_coalesce(skb, next_frag, frag_page->page, frag_offset)) {
>> +       if (skb_can_coalesce_netmem(skb, next_frag, netmem, frag_offset)) {
>>                 skb_coalesce_rx_frag(skb, next_frag - 1, len, truesize);
>> -       } else {
>> -               frag_page->frags++;
>> -               skb_add_rx_frag(skb, next_frag, frag_page->page,
>> -                               frag_offset, len, truesize);
>> +               return;
>>         }
>> +
>> +       frag_page->frags++;
>> +       skb_add_rx_frag_netmem(skb, next_frag, netmem,
>> +                              frag_offset, len, truesize);
>>  }
>>
>>  static inline void
>>  mlx5e_copy_skb_header(struct mlx5e_rq *rq, struct sk_buff *skb,
>> -                     struct page *page, dma_addr_t addr,
>> +                     netmem_ref netmem, dma_addr_t addr,
>>                       int offset_from, int dma_offset, u32 headlen)
>>  {
>> -       const void *from = page_address(page) + offset_from;
>> +       const void *from = netmem_address(netmem) + offset_from;
>
>I think this needs a check that netmem_address != NULL and safe error
>handling in case it is? If the netmem is unreadable, netmem_address
>will return NULL, and because you add offset_from to it, you can't
>NULL check from as well.
>

Nope, this code path is not for GRO_HW, it is always safe to assume this is
not iov_netmem.

>>         /* Aligning len to sizeof(long) optimizes memcpy performance */
>>         unsigned int len = ALIGN(headlen, sizeof(long));
>>
>> @@ -685,7 +688,7 @@ static int mlx5e_build_shampo_hd_umr(struct mlx5e_rq *rq,
>>                 if (unlikely(err))
>>                         goto err_unmap;
>>
>> -               addr = page_pool_get_dma_addr(frag_page->page);
>> +               addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
>>
>>                 for (int j = 0; j < MLX5E_SHAMPO_WQ_HEADER_PER_PAGE; j++) {
>>                         header_offset = mlx5e_shampo_hd_offset(index++);
>> @@ -796,7 +799,8 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>>                 err = mlx5e_page_alloc_fragmented(rq->page_pool, frag_page);
>>                 if (unlikely(err))
>>                         goto err_unmap;
>> -               addr = page_pool_get_dma_addr(frag_page->page);
>> +
>> +               addr = page_pool_get_dma_addr_netmem(frag_page->netmem);
>>                 umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
>>                         .ptag = cpu_to_be64(addr | MLX5_EN_WR),
>>                 };
>> @@ -1216,7 +1220,7 @@ static void *mlx5e_shampo_get_packet_hd(struct mlx5e_rq *rq, u16 header_index)
>>         struct mlx5e_frag_page *frag_page = mlx5e_shampo_hd_to_frag_page(rq, header_index);
>>         u16 head_offset = mlx5e_shampo_hd_offset(header_index) + rq->buff.headroom;
>>
>> -       return page_address(frag_page->page) + head_offset;
>> +       return netmem_address(frag_page->netmem) + head_offset;
>
>Similar concern here. netmem_address may be NULL, especially when you
>enable unreadable netmem support in the later patches. There are a
>couple of call sites below as well.

This is the header path, so we are good.

We only need to check one location in mlx5, which is the data payload path of
GRO_HW and it's covered.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 09/11] net/mlx5e: Add support for UNREADABLE netmem page pools
  2025-05-22 23:26   ` Mina Almasry
@ 2025-05-22 23:56     ` Saeed Mahameed
  0 siblings, 0 replies; 50+ messages in thread
From: Saeed Mahameed @ 2025-05-22 23:56 UTC (permalink / raw)
  Cc: Tariq Toukan, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Saeed Mahameed, Leon Romanovsky,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, netdev, linux-rdma,
	linux-kernel, bpf, Moshe Shemesh, Mark Bloch, Gal Pressman,
	Cosmin Ratiu, Dragos Tatulea

On 22 May 16:26, Mina Almasry wrote:
>On Thu, May 22, 2025 at 2:46 PM Tariq Toukan <tariqt@nvidia.com> wrote:
>>
>> From: Saeed Mahameed <saeedm@nvidia.com>
>>
>> On netdev_rx_queue_restart, a special type of page pool maybe expected.
>>
>> In this patch declare support for UNREADABLE netmem iov pages in the
>> pool params only when header data split shampo RQ mode is enabled, also
>> set the queue index in the page pool params struct.
>>
>> Shampo mode requirement: Without header split rx needs to peek at the data,
>> we can't do UNREADABLE_NETMEM.
>>
>> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
>> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> index 9e2975782a82..485b1515ace5 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> @@ -952,6 +952,11 @@ static int mlx5e_alloc_rq(struct mlx5e_params *params,
>>                 pp_params.netdev    = rq->netdev;
>>                 pp_params.dma_dir   = rq->buff.map_dir;
>>                 pp_params.max_len   = PAGE_SIZE;
>> +               pp_params.queue_idx = rq->ix;
>> +
>> +               /* Shampo header data split allow for unreadable netmem */
>> +               if (test_bit(MLX5E_RQ_STATE_SHAMPO, &rq->state))
>> +                       pp_params.flags |= PP_FLAG_ALLOW_UNREADABLE_NETMEM;
>>
>
>This patch itself looks good to me for FWIW, but unreadable netmem
>will return netmem_address(netmem) == NULL, which from an initial look
>didn't seem like you were handling in the previous patches. Not sure
>if oversight or you are sure you're not going to have unreadable
>netmem in these code paths for some reason.

I think I explained in my other reply to the other comment, we only need to
check in one location (HW_GRO payload handling).. other paths can not
support iov netmem so we are good.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings
  2025-05-22 23:19     ` Saeed Mahameed
@ 2025-05-23 16:17       ` Cosmin Ratiu
  2025-05-23 19:35         ` saeed
  2025-05-27 16:10       ` Jakub Kicinski
  1 sibling, 1 reply; 50+ messages in thread
From: Cosmin Ratiu @ 2025-05-23 16:17 UTC (permalink / raw)
  To: kuba@kernel.org, saeed@kernel.org
  Cc: andrew+netdev@lunn.ch, hawk@kernel.org, davem@davemloft.net,
	john.fastabend@gmail.com, leon@kernel.org,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	linux-rdma@vger.kernel.org, ast@kernel.org, pabeni@redhat.com,
	richardcochran@gmail.com, Dragos Tatulea, Mark Bloch,
	bpf@vger.kernel.org, Tariq Toukan, Saeed Mahameed,
	netdev@vger.kernel.org, Gal Pressman, daniel@iogearbox.net,
	Moshe Shemesh

On Thu, 2025-05-22 at 16:19 -0700, Saeed Mahameed wrote:
> > 
> > The kernel_param->tcp_data_split here is the user config, right?
> > It would be cleaner to not support setting SPLIT_DISABLED.
> > Nothing requires that, you can support just setting AUTO and
> > ENABLED.
> > 
> 
> I think I agree, AUTO might require some extra work on the driver
> side to
> figure out current internal mode, but it actually makes more sense
> than
> just doing "UNKNOWN", UKNOWN here means that HW GRO needs to be
> enabled
> when disabling TCP HDR split, and we still don't know if that will
> work..
> 
> Cosmin will you look into this ? 

Yes, I will address all comments from this round by the next
submission.

Cosmin.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 08/11] net/mlx5e: Convert over to netmem
  2025-05-22 23:54     ` Saeed Mahameed
@ 2025-05-23 17:58       ` Mina Almasry
  2025-05-23 19:22         ` Saeed Mahameed
  0 siblings, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2025-05-23 17:58 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Tariq Toukan, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Saeed Mahameed, Leon Romanovsky,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, netdev, linux-rdma,
	linux-kernel, bpf, Moshe Shemesh, Mark Bloch, Gal Pressman,
	Cosmin Ratiu, Dragos Tatulea

On Thu, May 22, 2025 at 4:54 PM Saeed Mahameed <saeed@kernel.org> wrote:
> >>  static inline void
> >>  mlx5e_copy_skb_header(struct mlx5e_rq *rq, struct sk_buff *skb,
> >> -                     struct page *page, dma_addr_t addr,
> >> +                     netmem_ref netmem, dma_addr_t addr,
> >>                       int offset_from, int dma_offset, u32 headlen)
> >>  {
> >> -       const void *from = page_address(page) + offset_from;
> >> +       const void *from = netmem_address(netmem) + offset_from;
> >
> >I think this needs a check that netmem_address != NULL and safe error
> >handling in case it is? If the netmem is unreadable, netmem_address
> >will return NULL, and because you add offset_from to it, you can't
> >NULL check from as well.
> >
>
> Nope, this code path is not for GRO_HW, it is always safe to assume this is
> not iov_netmem.
>

OK, thanks for checking. It may be worth it to add
DEBUG_NET_WARN_ON_ONCE(netmem_address(netmem)); in these places where
you're assuming the netmem is readable and has a valid address. It
would be a very subtle bug later on if someone moves the code or
something and suddenly you have unreadable netmem being funnelled
through these code paths. But up to you.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 08/11] net/mlx5e: Convert over to netmem
  2025-05-23 17:58       ` Mina Almasry
@ 2025-05-23 19:22         ` Saeed Mahameed
  0 siblings, 0 replies; 50+ messages in thread
From: Saeed Mahameed @ 2025-05-23 19:22 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Tariq Toukan, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Saeed Mahameed, Leon Romanovsky,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, netdev, linux-rdma,
	linux-kernel, bpf, Moshe Shemesh, Mark Bloch, Gal Pressman,
	Cosmin Ratiu, Dragos Tatulea

On 23 May 10:58, Mina Almasry wrote:
>On Thu, May 22, 2025 at 4:54 PM Saeed Mahameed <saeed@kernel.org> wrote:
>> >>  static inline void
>> >>  mlx5e_copy_skb_header(struct mlx5e_rq *rq, struct sk_buff *skb,
>> >> -                     struct page *page, dma_addr_t addr,
>> >> +                     netmem_ref netmem, dma_addr_t addr,
>> >>                       int offset_from, int dma_offset, u32 headlen)
>> >>  {
>> >> -       const void *from = page_address(page) + offset_from;
>> >> +       const void *from = netmem_address(netmem) + offset_from;
>> >
>> >I think this needs a check that netmem_address != NULL and safe error
>> >handling in case it is? If the netmem is unreadable, netmem_address
>> >will return NULL, and because you add offset_from to it, you can't
>> >NULL check from as well.
>> >
>>
>> Nope, this code path is not for GRO_HW, it is always safe to assume this is
>> not iov_netmem.
>>
>
>OK, thanks for checking. It may be worth it to add
>DEBUG_NET_WARN_ON_ONCE(netmem_address(netmem)); in these places where

Too ugly and will only be caught in DEBUG env with netmem_iov enabled on a
somehow broken driver, so if you already doing that I am sure
you won't mind a crash :) in your debug env.. 

Also I don't expect any of mlx5 developers to confuse between header data
split paths and other paths.. but maybe a comment somewhere should cover
this gap.

>you're assuming the netmem is readable and has a valid address. It
>would be a very subtle bug later on if someone moves the code or
>something and suddenly you have unreadable netmem being funnelled
>through these code paths. But up to you.
>

Cosmin, let's add comments on the shampo skb functions and the relevant
lines of code, maybe it will help preventing future mistakes.

>-- 
>Thanks,
>Mina
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings
  2025-05-23 16:17       ` Cosmin Ratiu
@ 2025-05-23 19:35         ` saeed
  0 siblings, 0 replies; 50+ messages in thread
From: saeed @ 2025-05-23 19:35 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: kuba@kernel.org, andrew+netdev@lunn.ch, hawk@kernel.org,
	davem@davemloft.net, john.fastabend@gmail.com, leon@kernel.org,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	linux-rdma@vger.kernel.org, ast@kernel.org, pabeni@redhat.com,
	richardcochran@gmail.com, Dragos Tatulea, Mark Bloch,
	bpf@vger.kernel.org, Tariq Toukan, Saeed Mahameed,
	netdev@vger.kernel.org, Gal Pressman, daniel@iogearbox.net,
	Moshe Shemesh

On 23 May 16:17, Cosmin Ratiu wrote:
>On Thu, 2025-05-22 at 16:19 -0700, Saeed Mahameed wrote:
>> >
>> > The kernel_param->tcp_data_split here is the user config, right?
>> > It would be cleaner to not support setting SPLIT_DISABLED.
>> > Nothing requires that, you can support just setting AUTO and
>> > ENABLED.
>> >
>>
>> I think I agree, AUTO might require some extra work on the driver
>> side to
>> figure out current internal mode, but it actually makes more sense
>> than
>> just doing "UNKNOWN", UKNOWN here means that HW GRO needs to be
>> enabled
>> when disabling TCP HDR split, and we still don't know if that will
>> work..
>>
>> Cosmin will you look into this ?
>
>Yes, I will address all comments from this round by the next
>submission.
>

I thought about it, maybe we should simplify mlx5:
when hw_gro is not enabled mlx5 should fail tcp_data_split ethtool
settings, and when hw_gro is enabled setting tcp_data_split on mlx5 should
be a no-op.

I think verbosity is important here as mlx5 doesn't support plain
tcp_data_split, it must come with gro_hw .. 

reporting tcp_data_split in mlx5 can remain the same as before the series.

We might need to update some tests and script to try hw_gro if enabling
just tcp_data_split didn't work.

for the AUTO state I really have no clue what it means and I think we
should avoid it.


>Cosmin.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 02/11] net: Add skb_can_coalesce for netmem
  2025-05-22 23:09   ` Mina Almasry
@ 2025-05-25 13:03     ` Dragos Tatulea
  2025-05-25 17:44       ` Mina Almasry
  0 siblings, 1 reply; 50+ messages in thread
From: Dragos Tatulea @ 2025-05-25 13:03 UTC (permalink / raw)
  To: Mina Almasry, Tariq Toukan
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu

On Thu, May 22, 2025 at 04:09:35PM -0700, Mina Almasry wrote:
> On Thu, May 22, 2025 at 2:43 PM Tariq Toukan <tariqt@nvidia.com> wrote:
> >
> > From: Dragos Tatulea <dtatulea@nvidia.com>
> >
> > Allow drivers that have moved over to netmem to do fragment coalescing.
> >
> > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> > Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> > Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> > ---
> >  include/linux/skbuff.h | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 5520524c93bf..e8e2860183b4 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -3887,6 +3887,18 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
> >         return false;
> >  }
> >
> > +static inline bool skb_can_coalesce_netmem(struct sk_buff *skb, int i,
> > +                                          const netmem_ref netmem, int off)
> > +{
> > +       if (i) {
> > +               const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
> > +
> > +               return netmem == skb_frag_netmem(frag) &&
> > +                      off == skb_frag_off(frag) + skb_frag_size(frag);
> > +       }
> > +       return false;
> > +}
> > +
> 
> Can we limit the code duplication by changing skb_can_coalesce to call
> skb_can_coalesce_netmem? Or is that too bad for perf?
>
> static inline bool skb_can_coalesce(struct sk_buff *skb, int i, const
> struct page *page, int off) {
>     skb_can_coalesce_netmem(skb, i, page_to_netmem(page), off);
> }
> 
> It's always safe to cast a page to netmem.
>
I think it makes sense and I don't see an issue with perf as everything
stays inline and the cast should be free.

As netmems are used only for rx and skb_zcopy() seems to be used
only for tx (IIUC), maybe it makes sense to keep the skb_zcopy() check
within skb_can_coalesce(). Like below. Any thoughts?

--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3873,15 +3873,22 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
 {
 	if (skb_zcopy(skb))
 		return false;
-	if (i) {
-		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
 
-		return page == skb_frag_page(frag) &&
-		       off == skb_frag_off(frag) + skb_frag_size(frag);
-	}
-	return false;
+	return skb_can_coalesce_netmem(skb, i, page_to_netmem(page), off);
 }
 
+   static inline bool skb_can_coalesce_netmem(struct sk_buff *skb, int i,
+                                              const netmem_ref netmem, int off)
+   {
+          if (i) {
+                  const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
+
+                  return netmem == skb_frag_netmem(frag) &&
+                         off == skb_frag_off(frag) + skb_frag_size(frag);
+          }
+          return false;
+   }
+

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 02/11] net: Add skb_can_coalesce for netmem
  2025-05-25 13:03     ` Dragos Tatulea
@ 2025-05-25 17:44       ` Mina Almasry
  2025-05-28  9:13         ` Dragos Tatulea
  0 siblings, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2025-05-25 17:44 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: Tariq Toukan, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Saeed Mahameed, Leon Romanovsky,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, netdev, linux-rdma,
	linux-kernel, bpf, Moshe Shemesh, Mark Bloch, Gal Pressman,
	Cosmin Ratiu

On Sun, May 25, 2025 at 6:04 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
> On Thu, May 22, 2025 at 04:09:35PM -0700, Mina Almasry wrote:
> > On Thu, May 22, 2025 at 2:43 PM Tariq Toukan <tariqt@nvidia.com> wrote:
> > >
> > > From: Dragos Tatulea <dtatulea@nvidia.com>
> > >
> > > Allow drivers that have moved over to netmem to do fragment coalescing.
> > >
> > > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> > > Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> > > ---
> > >  include/linux/skbuff.h | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > >
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index 5520524c93bf..e8e2860183b4 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -3887,6 +3887,18 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
> > >         return false;
> > >  }
> > >
> > > +static inline bool skb_can_coalesce_netmem(struct sk_buff *skb, int i,
> > > +                                          const netmem_ref netmem, int off)
> > > +{
> > > +       if (i) {
> > > +               const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
> > > +
> > > +               return netmem == skb_frag_netmem(frag) &&
> > > +                      off == skb_frag_off(frag) + skb_frag_size(frag);
> > > +       }
> > > +       return false;
> > > +}
> > > +
> >
> > Can we limit the code duplication by changing skb_can_coalesce to call
> > skb_can_coalesce_netmem? Or is that too bad for perf?
> >
> > static inline bool skb_can_coalesce(struct sk_buff *skb, int i, const
> > struct page *page, int off) {
> >     skb_can_coalesce_netmem(skb, i, page_to_netmem(page), off);
> > }
> >
> > It's always safe to cast a page to netmem.
> >
> I think it makes sense and I don't see an issue with perf as everything
> stays inline and the cast should be free.
>
> As netmems are used only for rx and skb_zcopy() seems to be used
> only for tx (IIUC), maybe it makes sense to keep the skb_zcopy() check
> within skb_can_coalesce(). Like below. Any thoughts?
>

[net|dev]mems can now be in the TX path too:
https://lore.kernel.org/netdev/20250508004830.4100853-1-almasrymina@google.com/

And even without explicit TX support, IIUC from Kuba RX packets can
always be looped back to the TX path via forwarding or tc and what
not. So let's leave the skb_zcopy check in the common path for now
unless we're sure the move is safe.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 06/11] net/mlx5e: SHAMPO: Separate pool for headers
  2025-05-22 23:08     ` Saeed Mahameed
  2025-05-22 23:24       ` Mina Almasry
@ 2025-05-27 15:29       ` Jakub Kicinski
  2025-05-27 15:53         ` Dragos Tatulea
  1 sibling, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2025-05-27 15:29 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On Thu, 22 May 2025 16:08:48 -0700 Saeed Mahameed wrote:
> On 22 May 15:30, Jakub Kicinski wrote:
> >On Fri, 23 May 2025 00:41:21 +0300 Tariq Toukan wrote:  
> >> Allocate a separate page pool for headers when SHAMPO is enabled.
> >> This will be useful for adding support to zc page pool, which has to be
> >> different from the headers page pool.  
> >
> >Could you explain why always allocate a separate pool?  
> 
> Better flow management, 0 conditional code on data path to alloc/return
> header buffers, since in mlx5 we already have separate paths to handle
> header, we don't have/need bnxt_separate_head_pool() and
> rxr->need_head_pool spread across the code.. 
> 
> Since we alloc and return pages in bulks, it makes more sense to manage
> headers and data in separate pools if we are going to do it anyway for 
> "undreadable_pools", and when there's no performance impact.

I think you need to look closer at the bnxt implementation.
There is no conditional on the buffer alloc path. If the head and
payload pools are identical we simply assign the same pointer to 
(using mlx5 naming) page_pool and hd_page_pool.

Your arguments are not very convincing, TBH.
The memory sitting in the recycling rings is very much not free.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 06/11] net/mlx5e: SHAMPO: Separate pool for headers
  2025-05-27 15:29       ` Jakub Kicinski
@ 2025-05-27 15:53         ` Dragos Tatulea
  0 siblings, 0 replies; 50+ messages in thread
From: Dragos Tatulea @ 2025-05-27 15:53 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu

On Tue, May 27, 2025 at 08:29:56AM -0700, Jakub Kicinski wrote:
> On Thu, 22 May 2025 16:08:48 -0700 Saeed Mahameed wrote:
> > On 22 May 15:30, Jakub Kicinski wrote:
> > >On Fri, 23 May 2025 00:41:21 +0300 Tariq Toukan wrote:  
> > >> Allocate a separate page pool for headers when SHAMPO is enabled.
> > >> This will be useful for adding support to zc page pool, which has to be
> > >> different from the headers page pool.  
> > >
> > >Could you explain why always allocate a separate pool?  
> > 
> > Better flow management, 0 conditional code on data path to alloc/return
> > header buffers, since in mlx5 we already have separate paths to handle
> > header, we don't have/need bnxt_separate_head_pool() and
> > rxr->need_head_pool spread across the code.. 
> > 
> > Since we alloc and return pages in bulks, it makes more sense to manage
> > headers and data in separate pools if we are going to do it anyway for 
> > "undreadable_pools", and when there's no performance impact.
> 
> I think you need to look closer at the bnxt implementation.
> There is no conditional on the buffer alloc path. If the head and
> payload pools are identical we simply assign the same pointer to 
> (using mlx5 naming) page_pool and hd_page_pool.
> 
> Your arguments are not very convincing, TBH.
> The memory sitting in the recycling rings is very much not free.

I can add 2 more small argumens for always using 2 page pools:

- For large ring size + high MTU the page_pool size will go above the
  internal limit of the page_pool in HW GRO mode.

- Debugability (already mentioned by Saeed in the counters pach): if
  something goes wrong (page leaks for example) we can easily pinpoint
  to where the issue is.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
                   ` (10 preceding siblings ...)
  2025-05-22 21:41 ` [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings Tariq Toukan
@ 2025-05-27 16:05 ` Stanislav Fomichev
  2025-05-28  9:17   ` Dragos Tatulea
  2025-05-28  0:31 ` Jakub Kicinski
  2025-05-28  1:20 ` patchwork-bot+netdevbpf
  13 siblings, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2025-05-27 16:05 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On 05/23, Tariq Toukan wrote:
> This series from the team adds support for zerocopy rx TCP with devmem
> and io_uring for ConnectX7 NICs and above. For performance reasons and
> simplicity HW-GRO will also be turned on when header-data split mode is
> on.
> 
> Find more details below.
> 
> Regards,
> Tariq
> 
> Performance
> ===========
> 
> Test setup:
> 
> * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> * NIC: ConnectX7
> * Benchmarking tool: kperf [1]
> * Single TCP flow
> * Test duration: 60s
> 
> With application thread and interrupts pinned to the *same* core:
> 
> |------+-----------+----------|
> | MTU  | epoll     | io_uring |
> |------+-----------+----------|
> | 1500 | 61.6 Gbps | 114 Gbps |
> | 4096 | 69.3 Gbps | 151 Gbps |
> | 9000 | 67.8 Gbps | 187 Gbps |
> |------+-----------+----------|
> 
> The CPU usage for io_uring is 95%.
> 
> Reproduction steps for io_uring:
> 
> server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
>         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> 
> server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> 
> client --src 2001:db8::2 --dst 2001:db8::1 \
>         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> 
> Patch overview:
> ================
> 
> First, a netmem API for skb_can_coalesce is added to the core to be able
> to do skb fragment coalescing on netmems.
> 
> The next patches introduce some cleanups in the internal SHAMPO code and
> improvements to hw gro capability checks in FW.
> 
> A separate page_pool is introduced for headers. Ethtool stats are added
> as well.
> 
> Then the driver is converted to use the netmem API and to allow support
> for unreadable netmem page pool.
> 
> The queue management ops are implemented.
> 
> Finally, the tcp-data-split ring parameter is exposed.
> 
> Changelog
> =========
> 
> Changes from v1 [0]:
> - Added support for skb_can_coalesce_netmem().
> - Avoid netmem_to_page() casts in the driver.
> - Fixed code to abide 80 char limit with some exceptions to avoid
> code churn.

Since there is gonna be 2-3 weeks of closed net-next, can you
also add a patch for the tx side? It should be trivial (skip dma unmap
for niovs in tx completions plus netdev->netmem_tx=1).

And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
in this series?

0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings
  2025-05-22 23:19     ` Saeed Mahameed
  2025-05-23 16:17       ` Cosmin Ratiu
@ 2025-05-27 16:10       ` Jakub Kicinski
  2025-05-28  5:10         ` Gal Pressman
  1 sibling, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2025-05-27 16:10 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On Thu, 22 May 2025 16:19:28 -0700 Saeed Mahameed wrote:
> >Why are you modifying wanted_features? wanted_features is what
> >*user space* wanted! You should probably operate on hw_features ?
> >Tho, may be cleaner to return an error and an extack if the user
> >tries to set HDS and GRO to conflicting values.
> >  
> 
> hw_features is hw capabilities, it doesn't mean on/off.. so no we can't
> rely on that.
> 
> To enable TCP_DATA_SPLIT we tie it to GRO_HW, so we enable GRO_HW when
> TCP_DATA_SPLIT is set to on and vise-versa. I agree not the cleanest.. 
> But it is good for user-visibility as you would see both ON if you query
> from user, which is the actual state. This is the only way to set HW_GRO
> to on by driver and not lose previous state when we turn the other bit
> on/off.

   features = on
hw_features = off

is how we indicate the feature is "on [fixed]"
Tho, I'm not sure how much precedent there is for making things fixed
at runtime.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
                   ` (11 preceding siblings ...)
  2025-05-27 16:05 ` [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Stanislav Fomichev
@ 2025-05-28  0:31 ` Jakub Kicinski
  2025-05-28  1:20 ` patchwork-bot+netdevbpf
  13 siblings, 0 replies; 50+ messages in thread
From: Jakub Kicinski @ 2025-05-28  0:31 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S. Miller, Paolo Abeni, Eric Dumazet, Andrew Lunn,
	Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu,
	Dragos Tatulea

On Fri, 23 May 2025 00:41:15 +0300 Tariq Toukan wrote:
>   net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR

I'll apply this one already, seems like a good cleanup.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
  2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
                   ` (12 preceding siblings ...)
  2025-05-28  0:31 ` Jakub Kicinski
@ 2025-05-28  1:20 ` patchwork-bot+netdevbpf
  13 siblings, 0 replies; 50+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-05-28  1:20 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: davem, kuba, pabeni, edumazet, andrew+netdev, saeedm, leon,
	richardcochran, ast, daniel, hawk, john.fastabend, netdev,
	linux-rdma, linux-kernel, bpf, moshe, mbloch, gal, cratiu,
	dtatulea

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 23 May 2025 00:41:15 +0300 you wrote:
> This series from the team adds support for zerocopy rx TCP with devmem
> and io_uring for ConnectX7 NICs and above. For performance reasons and
> simplicity HW-GRO will also be turned on when header-data split mode is
> on.
> 
> Find more details below.
> 
> [...]

Here is the summary with links:
  - [net-next,V2,01/11] net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR
    https://git.kernel.org/netdev/net-next/c/cb575e5e9fd1
  - [net-next,V2,02/11] net: Add skb_can_coalesce for netmem
    (no matching commit)
  - [net-next,V2,03/11] net/mlx5e: SHAMPO: Reorganize mlx5_rq_shampo_alloc
    (no matching commit)
  - [net-next,V2,04/11] net/mlx5e: SHAMPO: Remove redundant params
    (no matching commit)
  - [net-next,V2,05/11] net/mlx5e: SHAMPO: Improve hw gro capability checking
    (no matching commit)
  - [net-next,V2,06/11] net/mlx5e: SHAMPO: Separate pool for headers
    (no matching commit)
  - [net-next,V2,07/11] net/mlx5e: SHAMPO: Headers page pool stats
    (no matching commit)
  - [net-next,V2,08/11] net/mlx5e: Convert over to netmem
    (no matching commit)
  - [net-next,V2,09/11] net/mlx5e: Add support for UNREADABLE netmem page pools
    (no matching commit)
  - [net-next,V2,10/11] net/mlx5e: Implement queue mgmt ops and single channel swap
    (no matching commit)
  - [net-next,V2,11/11] net/mlx5e: Support ethtool tcp-data-split settings
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings
  2025-05-27 16:10       ` Jakub Kicinski
@ 2025-05-28  5:10         ` Gal Pressman
  2025-05-29  0:12           ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: Gal Pressman @ 2025-05-28  5:10 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: Tariq Toukan, David S. Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Cosmin Ratiu, Dragos Tatulea

On 27/05/2025 19:10, Jakub Kicinski wrote:
> On Thu, 22 May 2025 16:19:28 -0700 Saeed Mahameed wrote:
>>> Why are you modifying wanted_features? wanted_features is what
>>> *user space* wanted! You should probably operate on hw_features ?
>>> Tho, may be cleaner to return an error and an extack if the user
>>> tries to set HDS and GRO to conflicting values.
>>>  
>>
>> hw_features is hw capabilities, it doesn't mean on/off.. so no we can't
>> rely on that.
>>
>> To enable TCP_DATA_SPLIT we tie it to GRO_HW, so we enable GRO_HW when
>> TCP_DATA_SPLIT is set to on and vise-versa. I agree not the cleanest.. 
>> But it is good for user-visibility as you would see both ON if you query
>> from user, which is the actual state. This is the only way to set HW_GRO
>> to on by driver and not lose previous state when we turn the other bit
>> on/off.
> 
>    features = on
> hw_features = off
> 
> is how we indicate the feature is "on [fixed]"
> Tho, I'm not sure how much precedent there is for making things fixed
> at runtime.

Isn't this something that should be handled through fix_features?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 02/11] net: Add skb_can_coalesce for netmem
  2025-05-25 17:44       ` Mina Almasry
@ 2025-05-28  9:13         ` Dragos Tatulea
  0 siblings, 0 replies; 50+ messages in thread
From: Dragos Tatulea @ 2025-05-28  9:13 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Tariq Toukan, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Saeed Mahameed, Leon Romanovsky,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, netdev, linux-rdma,
	linux-kernel, bpf, Moshe Shemesh, Mark Bloch, Gal Pressman,
	Cosmin Ratiu

On Sun, May 25, 2025 at 10:44:43AM -0700, Mina Almasry wrote:
> On Sun, May 25, 2025 at 6:04 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> >
> > On Thu, May 22, 2025 at 04:09:35PM -0700, Mina Almasry wrote:
> > > On Thu, May 22, 2025 at 2:43 PM Tariq Toukan <tariqt@nvidia.com> wrote:
> > > >
> > > > From: Dragos Tatulea <dtatulea@nvidia.com>
> > > >
> > > > Allow drivers that have moved over to netmem to do fragment coalescing.
> > > >
> > > > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > > Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> > > > Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> > > > ---
> > > >  include/linux/skbuff.h | 12 ++++++++++++
> > > >  1 file changed, 12 insertions(+)
> > > >
> > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > index 5520524c93bf..e8e2860183b4 100644
> > > > --- a/include/linux/skbuff.h
> > > > +++ b/include/linux/skbuff.h
> > > > @@ -3887,6 +3887,18 @@ static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
> > > >         return false;
> > > >  }
> > > >
> > > > +static inline bool skb_can_coalesce_netmem(struct sk_buff *skb, int i,
> > > > +                                          const netmem_ref netmem, int off)
> > > > +{
> > > > +       if (i) {
> > > > +               const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
> > > > +
> > > > +               return netmem == skb_frag_netmem(frag) &&
> > > > +                      off == skb_frag_off(frag) + skb_frag_size(frag);
> > > > +       }
> > > > +       return false;
> > > > +}
> > > > +
> > >
> > > Can we limit the code duplication by changing skb_can_coalesce to call
> > > skb_can_coalesce_netmem? Or is that too bad for perf?
> > >
> > > static inline bool skb_can_coalesce(struct sk_buff *skb, int i, const
> > > struct page *page, int off) {
> > >     skb_can_coalesce_netmem(skb, i, page_to_netmem(page), off);
> > > }
> > >
> > > It's always safe to cast a page to netmem.
> > >
> > I think it makes sense and I don't see an issue with perf as everything
> > stays inline and the cast should be free.
> >
> > As netmems are used only for rx and skb_zcopy() seems to be used
> > only for tx (IIUC), maybe it makes sense to keep the skb_zcopy() check
> > within skb_can_coalesce(). Like below. Any thoughts?
> >
> 
> [net|dev]mems can now be in the TX path too:
> https://lore.kernel.org/netdev/20250508004830.4100853-1-almasrymina@google.com/
> 
> And even without explicit TX support, IIUC from Kuba RX packets can
> always be looped back to the TX path via forwarding or tc and what
> not. So let's leave the skb_zcopy check in the common path for now
> unless we're sure the move is safe.
>
Makes sense. Will be done.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
  2025-05-27 16:05 ` [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Stanislav Fomichev
@ 2025-05-28  9:17   ` Dragos Tatulea
  2025-05-28 15:45     ` Stanislav Fomichev
  0 siblings, 1 reply; 50+ messages in thread
From: Dragos Tatulea @ 2025-05-28  9:17 UTC (permalink / raw)
  To: Stanislav Fomichev, Tariq Toukan
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Saeed Mahameed, Leon Romanovsky, Richard Cochran,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, netdev, linux-rdma, linux-kernel, bpf,
	Moshe Shemesh, Mark Bloch, Gal Pressman, Cosmin Ratiu

On Tue, May 27, 2025 at 09:05:49AM -0700, Stanislav Fomichev wrote:
> On 05/23, Tariq Toukan wrote:
> > This series from the team adds support for zerocopy rx TCP with devmem
> > and io_uring for ConnectX7 NICs and above. For performance reasons and
> > simplicity HW-GRO will also be turned on when header-data split mode is
> > on.
> > 
> > Find more details below.
> > 
> > Regards,
> > Tariq
> > 
> > Performance
> > ===========
> > 
> > Test setup:
> > 
> > * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> > * NIC: ConnectX7
> > * Benchmarking tool: kperf [1]
> > * Single TCP flow
> > * Test duration: 60s
> > 
> > With application thread and interrupts pinned to the *same* core:
> > 
> > |------+-----------+----------|
> > | MTU  | epoll     | io_uring |
> > |------+-----------+----------|
> > | 1500 | 61.6 Gbps | 114 Gbps |
> > | 4096 | 69.3 Gbps | 151 Gbps |
> > | 9000 | 67.8 Gbps | 187 Gbps |
> > |------+-----------+----------|
> > 
> > The CPU usage for io_uring is 95%.
> > 
> > Reproduction steps for io_uring:
> > 
> > server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
> >         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> > 
> > server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> > 
> > client --src 2001:db8::2 --dst 2001:db8::1 \
> >         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> > 
> > Patch overview:
> > ================
> > 
> > First, a netmem API for skb_can_coalesce is added to the core to be able
> > to do skb fragment coalescing on netmems.
> > 
> > The next patches introduce some cleanups in the internal SHAMPO code and
> > improvements to hw gro capability checks in FW.
> > 
> > A separate page_pool is introduced for headers. Ethtool stats are added
> > as well.
> > 
> > Then the driver is converted to use the netmem API and to allow support
> > for unreadable netmem page pool.
> > 
> > The queue management ops are implemented.
> > 
> > Finally, the tcp-data-split ring parameter is exposed.
> > 
> > Changelog
> > =========
> > 
> > Changes from v1 [0]:
> > - Added support for skb_can_coalesce_netmem().
> > - Avoid netmem_to_page() casts in the driver.
> > - Fixed code to abide 80 char limit with some exceptions to avoid
> > code churn.
> 
> Since there is gonna be 2-3 weeks of closed net-next, can you
> also add a patch for the tx side? It should be trivial (skip dma unmap
> for niovs in tx completions plus netdev->netmem_tx=1).
>
Seems indeed trivial. We will add it.

> And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
> in this series?
> 
> 0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
We wanted to fix this afterwards as it needs to change a more subtle
part in the code that replenishes pages. This needs more thinking and
testing.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
  2025-05-28  9:17   ` Dragos Tatulea
@ 2025-05-28 15:45     ` Stanislav Fomichev
  2025-05-28 22:59       ` Mina Almasry
  0 siblings, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2025-05-28 15:45 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: Tariq Toukan, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Saeed Mahameed, Leon Romanovsky,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, netdev, linux-rdma,
	linux-kernel, bpf, Moshe Shemesh, Mark Bloch, Gal Pressman,
	Cosmin Ratiu

On 05/28, Dragos Tatulea wrote:
> On Tue, May 27, 2025 at 09:05:49AM -0700, Stanislav Fomichev wrote:
> > On 05/23, Tariq Toukan wrote:
> > > This series from the team adds support for zerocopy rx TCP with devmem
> > > and io_uring for ConnectX7 NICs and above. For performance reasons and
> > > simplicity HW-GRO will also be turned on when header-data split mode is
> > > on.
> > > 
> > > Find more details below.
> > > 
> > > Regards,
> > > Tariq
> > > 
> > > Performance
> > > ===========
> > > 
> > > Test setup:
> > > 
> > > * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> > > * NIC: ConnectX7
> > > * Benchmarking tool: kperf [1]
> > > * Single TCP flow
> > > * Test duration: 60s
> > > 
> > > With application thread and interrupts pinned to the *same* core:
> > > 
> > > |------+-----------+----------|
> > > | MTU  | epoll     | io_uring |
> > > |------+-----------+----------|
> > > | 1500 | 61.6 Gbps | 114 Gbps |
> > > | 4096 | 69.3 Gbps | 151 Gbps |
> > > | 9000 | 67.8 Gbps | 187 Gbps |
> > > |------+-----------+----------|
> > > 
> > > The CPU usage for io_uring is 95%.
> > > 
> > > Reproduction steps for io_uring:
> > > 
> > > server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
> > >         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> > > 
> > > server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> > > 
> > > client --src 2001:db8::2 --dst 2001:db8::1 \
> > >         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> > > 
> > > Patch overview:
> > > ================
> > > 
> > > First, a netmem API for skb_can_coalesce is added to the core to be able
> > > to do skb fragment coalescing on netmems.
> > > 
> > > The next patches introduce some cleanups in the internal SHAMPO code and
> > > improvements to hw gro capability checks in FW.
> > > 
> > > A separate page_pool is introduced for headers. Ethtool stats are added
> > > as well.
> > > 
> > > Then the driver is converted to use the netmem API and to allow support
> > > for unreadable netmem page pool.
> > > 
> > > The queue management ops are implemented.
> > > 
> > > Finally, the tcp-data-split ring parameter is exposed.
> > > 
> > > Changelog
> > > =========
> > > 
> > > Changes from v1 [0]:
> > > - Added support for skb_can_coalesce_netmem().
> > > - Avoid netmem_to_page() casts in the driver.
> > > - Fixed code to abide 80 char limit with some exceptions to avoid
> > > code churn.
> > 
> > Since there is gonna be 2-3 weeks of closed net-next, can you
> > also add a patch for the tx side? It should be trivial (skip dma unmap
> > for niovs in tx completions plus netdev->netmem_tx=1).
> >
> Seems indeed trivial. We will add it.
> 
> > And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
> > in this series?
> > 
> > 0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
> We wanted to fix this afterwards as it needs to change a more subtle
> part in the code that replenishes pages. This needs more thinking and
> testing.

Thanks! For my understanding: does the issue occur only during initial
queue refill? Or the same problem will happen any time there is a burst
of traffic that might exhaust all rx descriptors?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
  2025-05-28 15:45     ` Stanislav Fomichev
@ 2025-05-28 22:59       ` Mina Almasry
  2025-05-28 23:04         ` Stanislav Fomichev
  0 siblings, 1 reply; 50+ messages in thread
From: Mina Almasry @ 2025-05-28 22:59 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Dragos Tatulea, Tariq Toukan, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Saeed Mahameed,
	Leon Romanovsky, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	linux-rdma, linux-kernel, bpf, Moshe Shemesh, Mark Bloch,
	Gal Pressman, Cosmin Ratiu

On Wed, May 28, 2025 at 8:45 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 05/28, Dragos Tatulea wrote:
> > On Tue, May 27, 2025 at 09:05:49AM -0700, Stanislav Fomichev wrote:
> > > On 05/23, Tariq Toukan wrote:
> > > > This series from the team adds support for zerocopy rx TCP with devmem
> > > > and io_uring for ConnectX7 NICs and above. For performance reasons and
> > > > simplicity HW-GRO will also be turned on when header-data split mode is
> > > > on.
> > > >
> > > > Find more details below.
> > > >
> > > > Regards,
> > > > Tariq
> > > >
> > > > Performance
> > > > ===========
> > > >
> > > > Test setup:
> > > >
> > > > * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> > > > * NIC: ConnectX7
> > > > * Benchmarking tool: kperf [1]
> > > > * Single TCP flow
> > > > * Test duration: 60s
> > > >
> > > > With application thread and interrupts pinned to the *same* core:
> > > >
> > > > |------+-----------+----------|
> > > > | MTU  | epoll     | io_uring |
> > > > |------+-----------+----------|
> > > > | 1500 | 61.6 Gbps | 114 Gbps |
> > > > | 4096 | 69.3 Gbps | 151 Gbps |
> > > > | 9000 | 67.8 Gbps | 187 Gbps |
> > > > |------+-----------+----------|
> > > >
> > > > The CPU usage for io_uring is 95%.
> > > >
> > > > Reproduction steps for io_uring:
> > > >
> > > > server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
> > > >         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> > > >
> > > > server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> > > >
> > > > client --src 2001:db8::2 --dst 2001:db8::1 \
> > > >         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> > > >
> > > > Patch overview:
> > > > ================
> > > >
> > > > First, a netmem API for skb_can_coalesce is added to the core to be able
> > > > to do skb fragment coalescing on netmems.
> > > >
> > > > The next patches introduce some cleanups in the internal SHAMPO code and
> > > > improvements to hw gro capability checks in FW.
> > > >
> > > > A separate page_pool is introduced for headers. Ethtool stats are added
> > > > as well.
> > > >
> > > > Then the driver is converted to use the netmem API and to allow support
> > > > for unreadable netmem page pool.
> > > >
> > > > The queue management ops are implemented.
> > > >
> > > > Finally, the tcp-data-split ring parameter is exposed.
> > > >
> > > > Changelog
> > > > =========
> > > >
> > > > Changes from v1 [0]:
> > > > - Added support for skb_can_coalesce_netmem().
> > > > - Avoid netmem_to_page() casts in the driver.
> > > > - Fixed code to abide 80 char limit with some exceptions to avoid
> > > > code churn.
> > >
> > > Since there is gonna be 2-3 weeks of closed net-next, can you
> > > also add a patch for the tx side? It should be trivial (skip dma unmap
> > > for niovs in tx completions plus netdev->netmem_tx=1).
> > >
> > Seems indeed trivial. We will add it.
> >
> > > And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
> > > in this series?
> > >
> > > 0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
> > We wanted to fix this afterwards as it needs to change a more subtle
> > part in the code that replenishes pages. This needs more thinking and
> > testing.
>
> Thanks! For my understanding: does the issue occur only during initial
> queue refill? Or the same problem will happen any time there is a burst
> of traffic that might exhaust all rx descriptors?
>

Minor: a burst in traffic likely won't reproduce this case, I'm sure
mlx5 can drive the hardware to line rate consistently. It's more if
the machine is under extreme memory pressure, I think,
page_pool_alloc_pages and friends may return ENOMEM, which reproduces
the same edge case as the dma-buf being extremely small which also
makes page_pool_alloc_netmems return -ENOMEM.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
  2025-05-28 22:59       ` Mina Almasry
@ 2025-05-28 23:04         ` Stanislav Fomichev
  2025-05-29 11:11           ` Dragos Tatulea
  0 siblings, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2025-05-28 23:04 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Dragos Tatulea, Tariq Toukan, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Saeed Mahameed,
	Leon Romanovsky, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend, netdev,
	linux-rdma, linux-kernel, bpf, Moshe Shemesh, Mark Bloch,
	Gal Pressman, Cosmin Ratiu

On 05/28, Mina Almasry wrote:
> On Wed, May 28, 2025 at 8:45 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 05/28, Dragos Tatulea wrote:
> > > On Tue, May 27, 2025 at 09:05:49AM -0700, Stanislav Fomichev wrote:
> > > > On 05/23, Tariq Toukan wrote:
> > > > > This series from the team adds support for zerocopy rx TCP with devmem
> > > > > and io_uring for ConnectX7 NICs and above. For performance reasons and
> > > > > simplicity HW-GRO will also be turned on when header-data split mode is
> > > > > on.
> > > > >
> > > > > Find more details below.
> > > > >
> > > > > Regards,
> > > > > Tariq
> > > > >
> > > > > Performance
> > > > > ===========
> > > > >
> > > > > Test setup:
> > > > >
> > > > > * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> > > > > * NIC: ConnectX7
> > > > > * Benchmarking tool: kperf [1]
> > > > > * Single TCP flow
> > > > > * Test duration: 60s
> > > > >
> > > > > With application thread and interrupts pinned to the *same* core:
> > > > >
> > > > > |------+-----------+----------|
> > > > > | MTU  | epoll     | io_uring |
> > > > > |------+-----------+----------|
> > > > > | 1500 | 61.6 Gbps | 114 Gbps |
> > > > > | 4096 | 69.3 Gbps | 151 Gbps |
> > > > > | 9000 | 67.8 Gbps | 187 Gbps |
> > > > > |------+-----------+----------|
> > > > >
> > > > > The CPU usage for io_uring is 95%.
> > > > >
> > > > > Reproduction steps for io_uring:
> > > > >
> > > > > server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
> > > > >         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> > > > >
> > > > > server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> > > > >
> > > > > client --src 2001:db8::2 --dst 2001:db8::1 \
> > > > >         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> > > > >
> > > > > Patch overview:
> > > > > ================
> > > > >
> > > > > First, a netmem API for skb_can_coalesce is added to the core to be able
> > > > > to do skb fragment coalescing on netmems.
> > > > >
> > > > > The next patches introduce some cleanups in the internal SHAMPO code and
> > > > > improvements to hw gro capability checks in FW.
> > > > >
> > > > > A separate page_pool is introduced for headers. Ethtool stats are added
> > > > > as well.
> > > > >
> > > > > Then the driver is converted to use the netmem API and to allow support
> > > > > for unreadable netmem page pool.
> > > > >
> > > > > The queue management ops are implemented.
> > > > >
> > > > > Finally, the tcp-data-split ring parameter is exposed.
> > > > >
> > > > > Changelog
> > > > > =========
> > > > >
> > > > > Changes from v1 [0]:
> > > > > - Added support for skb_can_coalesce_netmem().
> > > > > - Avoid netmem_to_page() casts in the driver.
> > > > > - Fixed code to abide 80 char limit with some exceptions to avoid
> > > > > code churn.
> > > >
> > > > Since there is gonna be 2-3 weeks of closed net-next, can you
> > > > also add a patch for the tx side? It should be trivial (skip dma unmap
> > > > for niovs in tx completions plus netdev->netmem_tx=1).
> > > >
> > > Seems indeed trivial. We will add it.
> > >
> > > > And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
> > > > in this series?
> > > >
> > > > 0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
> > > We wanted to fix this afterwards as it needs to change a more subtle
> > > part in the code that replenishes pages. This needs more thinking and
> > > testing.
> >
> > Thanks! For my understanding: does the issue occur only during initial
> > queue refill? Or the same problem will happen any time there is a burst
> > of traffic that might exhaust all rx descriptors?
> >
> 
> Minor: a burst in traffic likely won't reproduce this case, I'm sure
> mlx5 can drive the hardware to line rate consistently. It's more if
> the machine is under extreme memory pressure, I think,
> page_pool_alloc_pages and friends may return ENOMEM, which reproduces
> the same edge case as the dma-buf being extremely small which also
> makes page_pool_alloc_netmems return -ENOMEM.

What I want to understand is whether the kernel/driver will oops when dmabuf
runs out of buffers after initial setup. Either traffic burst and/or userspace
being slow on refill - doesn't matter.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings
  2025-05-28  5:10         ` Gal Pressman
@ 2025-05-29  0:12           ` Jakub Kicinski
  0 siblings, 0 replies; 50+ messages in thread
From: Jakub Kicinski @ 2025-05-29  0:12 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Saeed Mahameed, Tariq Toukan, David S. Miller, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Saeed Mahameed, Leon Romanovsky,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, netdev, linux-rdma,
	linux-kernel, bpf, Moshe Shemesh, Mark Bloch, Cosmin Ratiu,
	Dragos Tatulea

On Wed, 28 May 2025 08:10:46 +0300 Gal Pressman wrote:
> >    features = on
> > hw_features = off
> > 
> > is how we indicate the feature is "on [fixed]"
> > Tho, I'm not sure how much precedent there is for making things fixed
> > at runtime.  
> 
> Isn't this something that should be handled through fix_features?

Yes, should work.
Off the top of my head after setting HDS to enabled and trying 
to disable HW-GRO we should see: "on [requested off]" right?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
  2025-05-28 23:04         ` Stanislav Fomichev
@ 2025-05-29 11:11           ` Dragos Tatulea
  2025-06-06  9:00             ` Cosmin Ratiu
  0 siblings, 1 reply; 50+ messages in thread
From: Dragos Tatulea @ 2025-05-29 11:11 UTC (permalink / raw)
  To: Stanislav Fomichev, Mina Almasry
  Cc: Tariq Toukan, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Saeed Mahameed, Leon Romanovsky,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, netdev, linux-rdma,
	linux-kernel, bpf, Moshe Shemesh, Mark Bloch, Gal Pressman,
	Cosmin Ratiu

On Wed, May 28, 2025 at 04:04:18PM -0700, Stanislav Fomichev wrote:
> On 05/28, Mina Almasry wrote:
> > On Wed, May 28, 2025 at 8:45 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >
> > > On 05/28, Dragos Tatulea wrote:
> > > > On Tue, May 27, 2025 at 09:05:49AM -0700, Stanislav Fomichev wrote:
> > > > > On 05/23, Tariq Toukan wrote:
> > > > > > This series from the team adds support for zerocopy rx TCP with devmem
> > > > > > and io_uring for ConnectX7 NICs and above. For performance reasons and
> > > > > > simplicity HW-GRO will also be turned on when header-data split mode is
> > > > > > on.
> > > > > >
> > > > > > Find more details below.
> > > > > >
> > > > > > Regards,
> > > > > > Tariq
> > > > > >
> > > > > > Performance
> > > > > > ===========
> > > > > >
> > > > > > Test setup:
> > > > > >
> > > > > > * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> > > > > > * NIC: ConnectX7
> > > > > > * Benchmarking tool: kperf [1]
> > > > > > * Single TCP flow
> > > > > > * Test duration: 60s
> > > > > >
> > > > > > With application thread and interrupts pinned to the *same* core:
> > > > > >
> > > > > > |------+-----------+----------|
> > > > > > | MTU  | epoll     | io_uring |
> > > > > > |------+-----------+----------|
> > > > > > | 1500 | 61.6 Gbps | 114 Gbps |
> > > > > > | 4096 | 69.3 Gbps | 151 Gbps |
> > > > > > | 9000 | 67.8 Gbps | 187 Gbps |
> > > > > > |------+-----------+----------|
> > > > > >
> > > > > > The CPU usage for io_uring is 95%.
> > > > > >
> > > > > > Reproduction steps for io_uring:
> > > > > >
> > > > > > server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
> > > > > >         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> > > > > >
> > > > > > server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> > > > > >
> > > > > > client --src 2001:db8::2 --dst 2001:db8::1 \
> > > > > >         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> > > > > >
> > > > > > Patch overview:
> > > > > > ================
> > > > > >
> > > > > > First, a netmem API for skb_can_coalesce is added to the core to be able
> > > > > > to do skb fragment coalescing on netmems.
> > > > > >
> > > > > > The next patches introduce some cleanups in the internal SHAMPO code and
> > > > > > improvements to hw gro capability checks in FW.
> > > > > >
> > > > > > A separate page_pool is introduced for headers. Ethtool stats are added
> > > > > > as well.
> > > > > >
> > > > > > Then the driver is converted to use the netmem API and to allow support
> > > > > > for unreadable netmem page pool.
> > > > > >
> > > > > > The queue management ops are implemented.
> > > > > >
> > > > > > Finally, the tcp-data-split ring parameter is exposed.
> > > > > >
> > > > > > Changelog
> > > > > > =========
> > > > > >
> > > > > > Changes from v1 [0]:
> > > > > > - Added support for skb_can_coalesce_netmem().
> > > > > > - Avoid netmem_to_page() casts in the driver.
> > > > > > - Fixed code to abide 80 char limit with some exceptions to avoid
> > > > > > code churn.
> > > > >
> > > > > Since there is gonna be 2-3 weeks of closed net-next, can you
> > > > > also add a patch for the tx side? It should be trivial (skip dma unmap
> > > > > for niovs in tx completions plus netdev->netmem_tx=1).
> > > > >
> > > > Seems indeed trivial. We will add it.
> > > >
> > > > > And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
> > > > > in this series?
> > > > >
> > > > > 0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
> > > > We wanted to fix this afterwards as it needs to change a more subtle
> > > > part in the code that replenishes pages. This needs more thinking and
> > > > testing.
> > >
> > > Thanks! For my understanding: does the issue occur only during initial
> > > queue refill? Or the same problem will happen any time there is a burst
> > > of traffic that might exhaust all rx descriptors?
> > >
> > 
> > Minor: a burst in traffic likely won't reproduce this case, I'm sure
> > mlx5 can drive the hardware to line rate consistently. It's more if
> > the machine is under extreme memory pressure, I think,
> > page_pool_alloc_pages and friends may return ENOMEM, which reproduces
> > the same edge case as the dma-buf being extremely small which also
> > makes page_pool_alloc_netmems return -ENOMEM.
> 
> What I want to understand is whether the kernel/driver will oops when dmabuf
> runs out of buffers after initial setup. Either traffic burst and/or userspace
> being slow on refill - doesn't matter.
There is no OOPS but the queue can't handle more traffic because it
can't allocate more buffers and it can't release old buffers either.

AFAIU from Cosmin the condition happened on initial queue fill when
there are no buffers to be released for the current WQE.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
  2025-05-29 11:11           ` Dragos Tatulea
@ 2025-06-06  9:00             ` Cosmin Ratiu
  0 siblings, 0 replies; 50+ messages in thread
From: Cosmin Ratiu @ 2025-06-06  9:00 UTC (permalink / raw)
  To: Dragos Tatulea, stfomichev@gmail.com, almasrymina@google.com
  Cc: andrew+netdev@lunn.ch, hawk@kernel.org, leon@kernel.org,
	davem@davemloft.net, john.fastabend@gmail.com,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	linux-rdma@vger.kernel.org, richardcochran@gmail.com,
	pabeni@redhat.com, ast@kernel.org, kuba@kernel.org,
	daniel@iogearbox.net, Tariq Toukan, Saeed Mahameed,
	netdev@vger.kernel.org, Mark Bloch, bpf@vger.kernel.org,
	Moshe Shemesh, Gal Pressman

On Thu, 2025-05-29 at 11:11 +0000, Dragos Tatulea wrote:
> 
> AFAIU from Cosmin the condition happened on initial queue fill when
> there are no buffers to be released for the current WQE.

The issue happens when there isn't enough memory in the pool to
completely fill the rx ring with descriptors, and then rx eventually
fully stops once the posted descriptors get exhausted, because the ring
refill logic will actually only release ring memory back to the pool
from ring_tail, when ring_head == ring_tail (for cache efficiency).
This means if the ring cannot be completely filled, memory never gets
released because ring_head != ring_tail.

The easy workaround is to have a pool with enough memory to let the rx
ring completely fill up. I suspect in real life this is easily the
case, but in the contrived ncdevmem test with udmabuf memory of 128 MB
and artificially high ring size & MTU this corner case was hit.

As Dragos said, we will look into this after the code has been posted.

Cosmin.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 07/11] net/mlx5e: SHAMPO: Headers page pool stats
  2025-05-22 22:58     ` Saeed Mahameed
@ 2025-06-06 10:43       ` Cosmin Ratiu
  2025-06-08 10:09         ` Tariq Toukan
  0 siblings, 1 reply; 50+ messages in thread
From: Cosmin Ratiu @ 2025-06-06 10:43 UTC (permalink / raw)
  To: kuba@kernel.org, saeed@kernel.org
  Cc: andrew+netdev@lunn.ch, hawk@kernel.org, davem@davemloft.net,
	john.fastabend@gmail.com, leon@kernel.org,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	linux-rdma@vger.kernel.org, ast@kernel.org, pabeni@redhat.com,
	richardcochran@gmail.com, Dragos Tatulea, Mark Bloch,
	bpf@vger.kernel.org, Tariq Toukan, Saeed Mahameed,
	netdev@vger.kernel.org, Gal Pressman, daniel@iogearbox.net,
	Moshe Shemesh

On Thu, 2025-05-22 at 15:58 -0700, Saeed Mahameed wrote:
> On 22 May 15:31, Jakub Kicinski wrote:
> > On Fri, 23 May 2025 00:41:22 +0300 Tariq Toukan wrote:
> > > Expose the stats of the new headers page pool.
> > 
> > Nope. We have a netlink API for page pool stats.
> > 
> 
> We already expose the stats of the main pool in ethtool.
> So it will be an inconvenience to keep exposing half of the stats.
> So either we delete both or keep both. Some of us rely on this for
> debug
> 

What is the conclusion here?
Do we keep this patch, to have all the stats in the same place?
Or do we remove it, and then half of the stats will be accessible
through both ethtool and netlink, and the other half only via netlink?

Cosmin.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 07/11] net/mlx5e: SHAMPO: Headers page pool stats
  2025-06-06 10:43       ` Cosmin Ratiu
@ 2025-06-08 10:09         ` Tariq Toukan
  2025-06-09 15:20           ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: Tariq Toukan @ 2025-06-08 10:09 UTC (permalink / raw)
  To: Cosmin Ratiu, kuba@kernel.org, saeed@kernel.org
  Cc: andrew+netdev@lunn.ch, hawk@kernel.org, davem@davemloft.net,
	john.fastabend@gmail.com, leon@kernel.org,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	linux-rdma@vger.kernel.org, ast@kernel.org, pabeni@redhat.com,
	richardcochran@gmail.com, Dragos Tatulea, Mark Bloch,
	bpf@vger.kernel.org, Tariq Toukan, Saeed Mahameed,
	netdev@vger.kernel.org, Gal Pressman, daniel@iogearbox.net,
	Moshe Shemesh



On 06/06/2025 13:43, Cosmin Ratiu wrote:
> On Thu, 2025-05-22 at 15:58 -0700, Saeed Mahameed wrote:
>> On 22 May 15:31, Jakub Kicinski wrote:
>>> On Fri, 23 May 2025 00:41:22 +0300 Tariq Toukan wrote:
>>>> Expose the stats of the new headers page pool.
>>>
>>> Nope. We have a netlink API for page pool stats.
>>>
>>
>> We already expose the stats of the main pool in ethtool.
>> So it will be an inconvenience to keep exposing half of the stats.
>> So either we delete both or keep both. Some of us rely on this for
>> debug
>>
> 
> What is the conclusion here?
> Do we keep this patch, to have all the stats in the same place?
> Or do we remove it, and then half of the stats will be accessible
> through both ethtool and netlink, and the other half only via netlink?
> 
> Cosmin.

IIRC, the netlink API shows only the overall/sum, right?

ethtool stats show you per-ring numbers, this is very helpful for system 
monitoring and perf debug.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH net-next V2 07/11] net/mlx5e: SHAMPO: Headers page pool stats
  2025-06-08 10:09         ` Tariq Toukan
@ 2025-06-09 15:20           ` Jakub Kicinski
  0 siblings, 0 replies; 50+ messages in thread
From: Jakub Kicinski @ 2025-06-09 15:20 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Cosmin Ratiu, saeed@kernel.org, andrew+netdev@lunn.ch,
	hawk@kernel.org, davem@davemloft.net, john.fastabend@gmail.com,
	leon@kernel.org, linux-kernel@vger.kernel.org,
	edumazet@google.com, linux-rdma@vger.kernel.org, ast@kernel.org,
	pabeni@redhat.com, richardcochran@gmail.com, Dragos Tatulea,
	Mark Bloch, bpf@vger.kernel.org, Tariq Toukan, Saeed Mahameed,
	netdev@vger.kernel.org, Gal Pressman, daniel@iogearbox.net,
	Moshe Shemesh

On Sun, 8 Jun 2025 13:09:16 +0300 Tariq Toukan wrote:
> >> We already expose the stats of the main pool in ethtool.
> >> So it will be an inconvenience to keep exposing half of the stats.
> >> So either we delete both or keep both. Some of us rely on this for
> >> debug
> >>  
> > 
> > What is the conclusion here?
> > Do we keep this patch, to have all the stats in the same place?
> > Or do we remove it, and then half of the stats will be accessible
> > through both ethtool and netlink, and the other half only via netlink?

Unfortunately by that logic we would never be able to move away from
deprecated APIs.

> IIRC, the netlink API shows only the overall/sum, right?

Wrong.

> ethtool stats show you per-ring numbers, this is very helpful for system 
> monitoring and perf debug.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2025-06-09 15:20 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22 21:41 [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Tariq Toukan
2025-05-22 21:41 ` [PATCH net-next V2 01/11] net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR Tariq Toukan
2025-05-22 23:07   ` Mina Almasry
2025-05-22 21:41 ` [PATCH net-next V2 02/11] net: Add skb_can_coalesce for netmem Tariq Toukan
2025-05-22 23:09   ` Mina Almasry
2025-05-25 13:03     ` Dragos Tatulea
2025-05-25 17:44       ` Mina Almasry
2025-05-28  9:13         ` Dragos Tatulea
2025-05-22 21:41 ` [PATCH net-next V2 03/11] net/mlx5e: SHAMPO: Reorganize mlx5_rq_shampo_alloc Tariq Toukan
2025-05-22 21:41 ` [PATCH net-next V2 04/11] net/mlx5e: SHAMPO: Remove redundant params Tariq Toukan
2025-05-22 21:41 ` [PATCH net-next V2 05/11] net/mlx5e: SHAMPO: Improve hw gro capability checking Tariq Toukan
2025-05-22 21:41 ` [PATCH net-next V2 06/11] net/mlx5e: SHAMPO: Separate pool for headers Tariq Toukan
2025-05-22 22:30   ` Jakub Kicinski
2025-05-22 23:08     ` Saeed Mahameed
2025-05-22 23:24       ` Mina Almasry
2025-05-22 23:43         ` Saeed Mahameed
2025-05-27 15:29       ` Jakub Kicinski
2025-05-27 15:53         ` Dragos Tatulea
2025-05-22 21:41 ` [PATCH net-next V2 07/11] net/mlx5e: SHAMPO: Headers page pool stats Tariq Toukan
2025-05-22 22:31   ` Jakub Kicinski
2025-05-22 22:58     ` Saeed Mahameed
2025-06-06 10:43       ` Cosmin Ratiu
2025-06-08 10:09         ` Tariq Toukan
2025-06-09 15:20           ` Jakub Kicinski
2025-05-22 21:41 ` [PATCH net-next V2 08/11] net/mlx5e: Convert over to netmem Tariq Toukan
2025-05-22 23:18   ` Mina Almasry
2025-05-22 23:54     ` Saeed Mahameed
2025-05-23 17:58       ` Mina Almasry
2025-05-23 19:22         ` Saeed Mahameed
2025-05-22 21:41 ` [PATCH net-next V2 09/11] net/mlx5e: Add support for UNREADABLE netmem page pools Tariq Toukan
2025-05-22 23:26   ` Mina Almasry
2025-05-22 23:56     ` Saeed Mahameed
2025-05-22 21:41 ` [PATCH net-next V2 10/11] net/mlx5e: Implement queue mgmt ops and single channel swap Tariq Toukan
2025-05-22 21:41 ` [PATCH net-next V2 11/11] net/mlx5e: Support ethtool tcp-data-split settings Tariq Toukan
2025-05-22 22:55   ` Jakub Kicinski
2025-05-22 23:19     ` Saeed Mahameed
2025-05-23 16:17       ` Cosmin Ratiu
2025-05-23 19:35         ` saeed
2025-05-27 16:10       ` Jakub Kicinski
2025-05-28  5:10         ` Gal Pressman
2025-05-29  0:12           ` Jakub Kicinski
2025-05-27 16:05 ` [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy Stanislav Fomichev
2025-05-28  9:17   ` Dragos Tatulea
2025-05-28 15:45     ` Stanislav Fomichev
2025-05-28 22:59       ` Mina Almasry
2025-05-28 23:04         ` Stanislav Fomichev
2025-05-29 11:11           ` Dragos Tatulea
2025-06-06  9:00             ` Cosmin Ratiu
2025-05-28  0:31 ` Jakub Kicinski
2025-05-28  1:20 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).