public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
@ 2026-04-07 19:59 Dipayaan Roy
  2026-04-07 19:59 ` [PATCH net-next v6 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch Dipayaan Roy
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Dipayaan Roy @ 2026-04-07 19:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool 
fragments for allocation in the RX refill path (~2kB buffer per fragment)
causes 15-20% throughput regression under high connection counts
(>16 TCP streams at 180+ Gbps). Using full-page buffers on these
platforms shows no regression and restores line-rate performance.

This behavior is observed on a single platform; other platforms
perform better with page_pool fragments, indicating this is not a
page_pool issue but platform-specific.

This series adds an ethtool private flag "full-page-rx" to let the
user opt in to one RX buffer per page:

  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag can be persisted
via udev rule for affected platforms.

Changes in v6:
  - Added missed maintainers.
Changes in v5:
  - Split prep refactor into separate patch (patch 1/2)
Changes in v4:
  - Dropping the smbios string parsing and add ethtool priv flag
    to reconfigure the queues with full page rx buffers.
Changes in v3:
  - changed u8* to char*
Changes in v2:
  - separate reading string index and the string, remove inline.

Dipayaan Roy (2):
  net: mana: refactor mana_get_strings() and mana_get_sset_count() to
    use switch
  net: mana: force full-page RX buffers via ethtool private flag

 drivers/net/ethernet/microsoft/mana/mana_en.c |  22 ++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 164 ++++++++++++++----
 include/net/mana/mana.h                       |   8 +
 3 files changed, 163 insertions(+), 31 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next v6 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch
  2026-04-07 19:59 [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers Dipayaan Roy
@ 2026-04-07 19:59 ` Dipayaan Roy
  2026-04-07 19:59 ` [PATCH net-next v6 2/2] net: mana: force full-page RX buffers via ethtool private flag Dipayaan Roy
  2026-04-10  1:35 ` [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers Jakub Kicinski
  2 siblings, 0 replies; 12+ messages in thread
From: Dipayaan Roy @ 2026-04-07 19:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

Refactor mana_get_strings() and mana_get_sset_count() from if/else to
switch statements in preparation for adding ethtool private flags
support which requires handling ETH_SS_PRIV_FLAGS.

No functional change.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 .../ethernet/microsoft/mana/mana_ethtool.c    | 75 ++++++++++++-------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 6a4b42fe0944..a28ca461c135 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -138,53 +138,70 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 
-	if (stringset != ETH_SS_STATS)
+	switch (stringset) {
+	case ETH_SS_STATS:
+		return ARRAY_SIZE(mana_eth_stats) +
+		       ARRAY_SIZE(mana_phy_stats) +
+		       ARRAY_SIZE(mana_hc_stats)  +
+		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	default:
 		return -EINVAL;
-
-	return ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) + ARRAY_SIZE(mana_hc_stats) +
-			num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	}
 }
 
-static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 {
-	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 	int i, j;
 
-	if (stringset != ETH_SS_STATS)
-		return;
 	for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
-		ethtool_puts(&data, mana_eth_stats[i].name);
+		ethtool_puts(data, mana_eth_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
-		ethtool_puts(&data, mana_hc_stats[i].name);
+		ethtool_puts(data, mana_hc_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
-		ethtool_puts(&data, mana_phy_stats[i].name);
+		ethtool_puts(data, mana_phy_stats[i].name);
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "rx_%d_packets", i);
-		ethtool_sprintf(&data, "rx_%d_bytes", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
-		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+		ethtool_sprintf(data, "rx_%d_packets", i);
+		ethtool_sprintf(data, "rx_%d_bytes", i);
+		ethtool_sprintf(data, "rx_%d_xdp_drop", i);
+		ethtool_sprintf(data, "rx_%d_xdp_tx", i);
+		ethtool_sprintf(data, "rx_%d_xdp_redirect", i);
+		ethtool_sprintf(data, "rx_%d_pkt_len0_err", i);
 		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
-			ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
+			ethtool_sprintf(data,
+					"rx_%d_coalesced_cqe_%d",
+					i,
+					j + 2);
 	}
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "tx_%d_packets", i);
-		ethtool_sprintf(&data, "tx_%d_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_xdp_xmit", i);
-		ethtool_sprintf(&data, "tx_%d_tso_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_long_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_short_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_csum_partial", i);
-		ethtool_sprintf(&data, "tx_%d_mana_map_err", i);
+		ethtool_sprintf(data, "tx_%d_packets", i);
+		ethtool_sprintf(data, "tx_%d_bytes", i);
+		ethtool_sprintf(data, "tx_%d_xdp_xmit", i);
+		ethtool_sprintf(data, "tx_%d_tso_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_bytes", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_bytes", i);
+		ethtool_sprintf(data, "tx_%d_long_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_short_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_csum_partial", i);
+		ethtool_sprintf(data, "tx_%d_mana_map_err", i);
+	}
+}
+
+static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		mana_get_strings_stats(apc, &data);
+		break;
+	default:
+		break;
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v6 2/2] net: mana: force full-page RX buffers via ethtool private flag
  2026-04-07 19:59 [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers Dipayaan Roy
  2026-04-07 19:59 ` [PATCH net-next v6 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch Dipayaan Roy
@ 2026-04-07 19:59 ` Dipayaan Roy
  2026-04-10  1:35 ` [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers Jakub Kicinski
  2 siblings, 0 replies; 12+ messages in thread
From: Dipayaan Roy @ 2026-04-07 19:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
allocation in the RX refill path can cause 15-20% throughput
regression under high connection counts (>16 TCP streams).

Add an ethtool private flag "full-page-rx" that allows the user to
force one RX buffer per page, bypassing the page_pool fragment path.
This restores line-rate (180+ Gbps) performance on affected platforms.

Usage:
  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag must be explicitly
enabled by the user or udev rule.

The existing single-buffer-per-page logic for XDP and jumbo frames is
consolidated into a new helper mana_use_single_rxbuf_per_page() which
is now the single decision point for both the automatic and
user-controlled paths.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 89 +++++++++++++++++++
 include/net/mana/mana.h                       |  8 ++
 3 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49c65cc1697c..59a1626c2be1 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 	return va;
 }
 
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+	/* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
+	 * in the RX refill path (~2kB buffer) can cause significant throughput
+	 * regression under high connection counts. Allow user to force one RX
+	 * buffer per page via ethtool private flag to bypass the fragment
+	 * path.
+	 */
+	if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
+		return true;
+
+	/* For xdp and jumbo frames make sure only one packet fits per page. */
+	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+		return true;
+
+	return false;
+}
+
 /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
 static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 			       int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 	/* Calculate datasize first (consistent across all cases) */
 	*datasize = mtu + ETH_HLEN;
 
-	/* For xdp and jumbo frames make sure only one packet fits per page */
-	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
 		if (mana_xdp_get(apc)) {
 			*headroom = XDP_PACKET_HEADROOM;
 			*alloc_size = PAGE_SIZE;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index a28ca461c135..0547c903f613 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -133,6 +133,10 @@ static const struct mana_stats_desc mana_phy_stats[] = {
 	{ "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
 };
 
+static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
+	[MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
+};
+
 static int mana_get_sset_count(struct net_device *ndev, int stringset)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -144,6 +148,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 		       ARRAY_SIZE(mana_phy_stats) +
 		       ARRAY_SIZE(mana_hc_stats)  +
 		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+
+	case ETH_SS_PRIV_FLAGS:
+		return MANA_PRIV_FLAG_MAX;
+
 	default:
 		return -EINVAL;
 	}
@@ -192,6 +200,14 @@ static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 	}
 }
 
+static void mana_get_strings_priv_flags(u8 **data)
+{
+	int i;
+
+	for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
+		ethtool_puts(data, mana_priv_flags[i]);
+}
+
 static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -200,6 +216,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 	case ETH_SS_STATS:
 		mana_get_strings_stats(apc, &data);
 		break;
+	case ETH_SS_PRIV_FLAGS:
+		mana_get_strings_priv_flags(&data);
+		break;
 	default:
 		break;
 	}
@@ -590,6 +609,74 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 	return 0;
 }
 
+static u32 mana_get_priv_flags(struct net_device *ndev)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	return apc->priv_flags;
+}
+
+static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+	u32 changed = apc->priv_flags ^ priv_flags;
+	u32 old_priv_flags = apc->priv_flags;
+	bool schedule_port_reset = false;
+	int err = 0;
+
+	if (!changed)
+		return 0;
+
+	/* Reject unknown bits */
+	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
+		return -EINVAL;
+
+	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
+		apc->priv_flags = priv_flags;
+
+		if (!apc->port_is_up) {
+			/* Port is down, flag updated to apply on next up
+			 * so just return.
+			 */
+			return 0;
+		}
+
+		/* Pre-allocate buffers to prevent failure in mana_attach
+		 * later
+		 */
+		err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+		if (err) {
+			netdev_err(ndev,
+				   "Insufficient memory for new allocations\n");
+			apc->priv_flags = old_priv_flags;
+			return err;
+		}
+
+		err = mana_detach(ndev, false);
+		if (err) {
+			netdev_err(ndev, "mana_detach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			goto out;
+		}
+
+		err = mana_attach(ndev);
+		if (err) {
+			netdev_err(ndev, "mana_attach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			schedule_port_reset = true;
+		}
+	}
+
+out:
+	mana_pre_dealloc_rxbufs(apc);
+
+	if (err && schedule_port_reset)
+		queue_work(apc->ac->per_port_queue_reset_wq,
+			   &apc->queue_reset_work);
+
+	return err;
+}
+
 const struct ethtool_ops mana_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
@@ -608,4 +695,6 @@ const struct ethtool_ops mana_ethtool_ops = {
 	.set_ringparam          = mana_set_ringparam,
 	.get_link_ksettings	= mana_get_link_ksettings,
 	.get_link		= ethtool_op_get_link,
+	.get_priv_flags		= mana_get_priv_flags,
+	.set_priv_flags		= mana_set_priv_flags,
 };
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 3336688fed5e..fd87e3d6c1f4 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -30,6 +30,12 @@ enum TRI_STATE {
 	TRI_STATE_TRUE = 1
 };
 
+/* MANA ethtool private flag bit positions */
+enum mana_priv_flag_bits {
+	MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
+	MANA_PRIV_FLAG_MAX,
+};
+
 /* Number of entries for hardware indirection table must be in power of 2 */
 #define MANA_INDIRECT_TABLE_MAX_SIZE 512
 #define MANA_INDIRECT_TABLE_DEF_SIZE 64
@@ -531,6 +537,8 @@ struct mana_port_context {
 	u32 rxbpre_headroom;
 	u32 rxbpre_frag_count;
 
+	u32 priv_flags;
+
 	struct bpf_prog *bpf_prog;
 
 	/* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
  2026-04-07 19:59 [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers Dipayaan Roy
  2026-04-07 19:59 ` [PATCH net-next v6 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch Dipayaan Roy
  2026-04-07 19:59 ` [PATCH net-next v6 2/2] net: mana: force full-page RX buffers via ethtool private flag Dipayaan Roy
@ 2026-04-10  1:35 ` Jakub Kicinski
  2026-04-12 19:59   ` Jakub Kicinski
  2 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2026-04-10  1:35 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On Tue,  7 Apr 2026 12:59:17 -0700 Dipayaan Roy wrote:
> This behavior is observed on a single platform; other platforms
> perform better with page_pool fragments, indicating this is not a
> page_pool issue but platform-specific.

Well, someone has to run some experiments and confirm other ARM
platforms are not impacted, with data. I was hoping to do it myself
but doesn't look like that will happen in time for the merge window :(

> Changes in v6:
>  - Added missed maintainers.

STOP REPOSTING PATCHES FOR NO REASON.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
  2026-04-10  1:35 ` [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers Jakub Kicinski
@ 2026-04-12 19:59   ` Jakub Kicinski
  2026-04-14 16:00     ` Dipayaan Roy
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2026-04-12 19:59 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On Thu, 9 Apr 2026 18:35:09 -0700 Jakub Kicinski wrote:
> On Tue,  7 Apr 2026 12:59:17 -0700 Dipayaan Roy wrote:
> > This behavior is observed on a single platform; other platforms
> > perform better with page_pool fragments, indicating this is not a
> > page_pool issue but platform-specific.  
> 
> Well, someone has to run some experiments and confirm other ARM
> platforms are not impacted, with data. I was hoping to do it myself
> but doesn't look like that will happen in time for the merge window :(

Please repost with the perf analysis on other commercially available
ARM platform. Something like:

  This is a workaround applicable to only some platforms. Modifying
  driver X to use a similar workaround on [Ampere Max|nVidia
  Grace|Amazon Graviton 3|..] the performance for split pages is
  y% higher than when using single pages.
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
  2026-04-12 19:59   ` Jakub Kicinski
@ 2026-04-14 16:00     ` Dipayaan Roy
  2026-04-16 15:31       ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Dipayaan Roy @ 2026-04-14 16:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On Sun, Apr 12, 2026 at 12:59:17PM -0700, Jakub Kicinski wrote:
> On Thu, 9 Apr 2026 18:35:09 -0700 Jakub Kicinski wrote:
> > On Tue,  7 Apr 2026 12:59:17 -0700 Dipayaan Roy wrote:
> > > This behavior is observed on a single platform; other platforms
> > > perform better with page_pool fragments, indicating this is not a
> > > page_pool issue but platform-specific.  
> > 
> > Well, someone has to run some experiments and confirm other ARM
> > platforms are not impacted, with data. I was hoping to do it myself
> > but doesn't look like that will happen in time for the merge window :(
> 
> Please repost with the perf analysis on other commercially available
> ARM platform. Something like:
> 
>   This is a workaround applicable to only some platforms. Modifying
>   driver X to use a similar workaround on [Ampere Max|nVidia
>   Grace|Amazon Graviton 3|..] the performance for split pages is
>   y% higher than when using single pages.
> -- 
> pw-bot: cr

Hi Jakub,

I ran the same experiment on an alternate ARM64 platform from a
different vendor, which I was able to access only recently. I still see
roughly a 5% overhead from the atomic refcount operation itself, but on
that platform there is no throughput drop when using page fragments
versus full-page mode. In both cases, the setup reaches line rate. That
suggests the atomic overhead alone does not explain the throughput loss
on the specific hardware we are discussing.

I also received an update from the hardware team. They collected PCIe
traces and observed stalls on this particular ARM64 prcossor
when running with page fragments, while those stalls are not seen in
full-page mode. The exact root cause is still under investigation, but
their current assessment is that this is likely a microarchitectural
issue in the PCIe root port. Based on that, they are asking for a
software workaround that uses full pages until the issue is fully
understood.

For that reason, I am asking whether this could be accepted as an
ethtool private flag rather than as a generic driver change,
since the problem is still specific to one CPU/platform.
Please let me know whether you think this patch with private flag
would be acceptable here.

Regards
Dipayaan Roy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
  2026-04-14 16:00     ` Dipayaan Roy
@ 2026-04-16 15:31       ` Jakub Kicinski
  2026-04-23 12:48         ` Dipayaan Roy
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2026-04-16 15:31 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
> I still see roughly a 5% overhead from the atomic refcount operation
> itself, but on that platform there is no throughput drop when using
> page fragments versus full-page mode.

That seems to contradict your claim that it's a problem with a specific
platform.. Since we're in the merge window I asked David Wei to try to
experiment with disabling page fragmentation on the ARM64 platforms we
have at Meta. If it repros we should use the generic rx-buf-len
ringparam because more NICs may want to implement this strategy.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
  2026-04-16 15:31       ` Jakub Kicinski
@ 2026-04-23 12:48         ` Dipayaan Roy
  2026-04-23 16:33           ` Jakub Kicinski
                             ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Dipayaan Roy @ 2026-04-23 12:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On Thu, Apr 16, 2026 at 08:31:46AM -0700, Jakub Kicinski wrote:
> On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
> > I still see roughly a 5% overhead from the atomic refcount operation
> > itself, but on that platform there is no throughput drop when using
> > page fragments versus full-page mode.
> 
> That seems to contradict your claim that it's a problem with a specific
> platform.. Since we're in the merge window I asked David Wei to try to
> experiment with disabling page fragmentation on the ARM64 platforms we
> have at Meta. If it repros we should use the generic rx-buf-len
> ringparam because more NICs may want to implement this strategy.

Hi Jakub,

Thanks. I think I was not precise enough in my previous reply.

What I meant is that the atomic refcount cost itself does not appear to
be unique to the affected platform. I see a similar ~5% overhead on
another ARM64 platformi (different vendor) as well. However, on that platform
there is no throughput delta between fragment mode and full-page mode; both reach
line rate.

On the affected platform, fragment mode shows an additional ~15%
throughput drop versus full-page mode. So the current data suggests that
the atomic overhead is common, but the throughput regression is not
explained by that overhead alone and likely depends on an additional
platform-specific factor.

Separately, the hardware team collected PCIe traces on the affected
platform and reported stalls in the fragment-mode case that are not seen
in full-page mode. They are still investigating the root cause, but
their current hypothesis is that this is related to that platform’s
PCIe/root-port microarchitecture rather than to page_pool refcounting
alone.

That said, I agree the right direction depends on whether this
reproduces on other ARM64 platforms. If David is able to reproduce the
same behavior, then using the generic rx-buf-len ringparam sounds like
the better direction.

Please let me know what David finds, and I can rework the patch
accordingly.


Regards
Dipayaan Roy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
  2026-04-23 12:48         ` Dipayaan Roy
@ 2026-04-23 16:33           ` Jakub Kicinski
  2026-04-24 16:24           ` David Wei
  2026-04-24 20:05           ` David Wei
  2 siblings, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2026-04-23 16:33 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On Thu, 23 Apr 2026 05:48:11 -0700 Dipayaan Roy wrote:
> What I meant is that the atomic refcount cost itself does not appear to
> be unique to the affected platform. I see a similar ~5% overhead on
> another ARM64 platformi (different vendor) as well. However, on that platform
> there is no throughput delta between fragment mode and full-page mode; both reach
> line rate.

I wonder if it wouldn't be more expedient at this stage to just switch
to rx-buf-len rather than investigating in more detail. But we can wait
for more data if you prefer.

> Please let me know what David finds, and I can rework the patch
> accordingly.

Haven't heard back. I pinged him now.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
  2026-04-23 12:48         ` Dipayaan Roy
  2026-04-23 16:33           ` Jakub Kicinski
@ 2026-04-24 16:24           ` David Wei
  2026-04-24 20:05           ` David Wei
  2 siblings, 0 replies; 12+ messages in thread
From: David Wei @ 2026-04-24 16:24 UTC (permalink / raw)
  To: Dipayaan Roy, Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On 2026-04-23 05:48, Dipayaan Roy wrote:
> On Thu, Apr 16, 2026 at 08:31:46AM -0700, Jakub Kicinski wrote:
>> On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
>>> I still see roughly a 5% overhead from the atomic refcount operation
>>> itself, but on that platform there is no throughput drop when using
>>> page fragments versus full-page mode.
>>
>> That seems to contradict your claim that it's a problem with a specific
>> platform.. Since we're in the merge window I asked David Wei to try to
>> experiment with disabling page fragmentation on the ARM64 platforms we
>> have at Meta. If it repros we should use the generic rx-buf-len
>> ringparam because more NICs may want to implement this strategy.
> 
> Hi Jakub,
> 
> Thanks. I think I was not precise enough in my previous reply.
> 
> What I meant is that the atomic refcount cost itself does not appear to
> be unique to the affected platform. I see a similar ~5% overhead on
> another ARM64 platformi (different vendor) as well. However, on that platform
> there is no throughput delta between fragment mode and full-page mode; both reach
> line rate.
> 
> On the affected platform, fragment mode shows an additional ~15%
> throughput drop versus full-page mode. So the current data suggests that
> the atomic overhead is common, but the throughput regression is not
> explained by that overhead alone and likely depends on an additional
> platform-specific factor.
> 
> Separately, the hardware team collected PCIe traces on the affected
> platform and reported stalls in the fragment-mode case that are not seen
> in full-page mode. They are still investigating the root cause, but
> their current hypothesis is that this is related to that platform’s
> PCIe/root-port microarchitecture rather than to page_pool refcounting
> alone.
> 
> That said, I agree the right direction depends on whether this
> reproduces on other ARM64 platforms. If David is able to reproduce the
> same behavior, then using the generic rx-buf-len ringparam sounds like
> the better direction.
> 
> Please let me know what David finds, and I can rework the patch
> accordingly.

Hi Dipayaan. Can you please share more details on your testing setup?

* What are you using as the test client/server? iperf3 or something
   else?
* What do you mean specifically by "5% overhead from the atomic refcount
   operation"? Some specific function?
* What are you using to measure? perf?
* How many queues, what is the napi softirq affinity?
* How many NUMA nodes? Does the problem only appear when crossing?

Thanks,
David

> 
> 
> Regards
> Dipayaan Roy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
  2026-04-23 12:48         ` Dipayaan Roy
  2026-04-23 16:33           ` Jakub Kicinski
  2026-04-24 16:24           ` David Wei
@ 2026-04-24 20:05           ` David Wei
  2026-04-25  8:05             ` Dipayaan Roy
  2 siblings, 1 reply; 12+ messages in thread
From: David Wei @ 2026-04-24 20:05 UTC (permalink / raw)
  To: Dipayaan Roy, Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy

On 2026-04-23 05:48, Dipayaan Roy wrote:
> On Thu, Apr 16, 2026 at 08:31:46AM -0700, Jakub Kicinski wrote:
>> On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
>>> I still see roughly a 5% overhead from the atomic refcount operation
>>> itself, but on that platform there is no throughput drop when using
>>> page fragments versus full-page mode.
>>
>> That seems to contradict your claim that it's a problem with a specific
>> platform.. Since we're in the merge window I asked David Wei to try to
>> experiment with disabling page fragmentation on the ARM64 platforms we
>> have at Meta. If it repros we should use the generic rx-buf-len
>> ringparam because more NICs may want to implement this strategy.
> 
> Hi Jakub,
> 
> Thanks. I think I was not precise enough in my previous reply.
> 
> What I meant is that the atomic refcount cost itself does not appear to
> be unique to the affected platform. I see a similar ~5% overhead on
> another ARM64 platformi (different vendor) as well. However, on that platform
> there is no throughput delta between fragment mode and full-page mode; both reach
> line rate.
> 
> On the affected platform, fragment mode shows an additional ~15%
> throughput drop versus full-page mode. So the current data suggests that
> the atomic overhead is common, but the throughput regression is not
> explained by that overhead alone and likely depends on an additional
> platform-specific factor.
> 
> Separately, the hardware team collected PCIe traces on the affected
> platform and reported stalls in the fragment-mode case that are not seen
> in full-page mode. They are still investigating the root cause, but
> their current hypothesis is that this is related to that platform’s
> PCIe/root-port microarchitecture rather than to page_pool refcounting
> alone.
> 
> That said, I agree the right direction depends on whether this
> reproduces on other ARM64 platforms. If David is able to reproduce the
> same behavior, then using the generic rx-buf-len ringparam sounds like
> the better direction.
> 
> Please let me know what David finds, and I can rework the patch
> accordingly.

I ran a test on Grace, 4 KB pages, 72 cores, 1 NUMA node.

Broadcom NIC, bnxt driver, 50 Gbps bandwidth. Hacked it up to either
give me 1 or 2 frags per page. No agg ring, no HDS, no HW GRO.

Use 1 combined queue only for the server. Affinitized its net rx softirq
to run on core 4.

Ran iperf3 server, taskset onto cpu cores 32-47. The iperf3 client is
running on a host w/ same hw in the same region. Using 32 queues, no
softirq affinities. The idea is to hammer page->pp_ref_count from
different cores.

* 1 frag/page  -> 32.3 Gbps
* 2 frags/page -> 36.0 Gbps

Comparing perf, for 2 frags/page the cost of skb_release_data() hitting
pp_ref_count goes up, as expected. Is this what you see? When you say
there's a +5% overhead, what function?

Overall tput is higher with multiple frags. That's to be expected w/
page pool.

There are some 200 Gbps NICs but they're mlx5 so I'd have to redo the
driver hack. Are you going to re-implement this change with rx-buf-len
instead of a private flag? If so, I won't spend more time running this
test.

> 
> 
> Regards
> Dipayaan Roy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
  2026-04-24 20:05           ` David Wei
@ 2026-04-25  8:05             ` Dipayaan Roy
  0 siblings, 0 replies; 12+ messages in thread
From: Dipayaan Roy @ 2026-04-25  8:05 UTC (permalink / raw)
  To: David Wei, kuba
  Cc: Jakub Kicinski, kys, haiyangz, wei.liu, decui, andrew+netdev,
	davem, edumazet, pabeni, leon, longli, kotaranov, horms,
	shradhagupta, ssengar, ernis, shirazsaleem, linux-hyperv, netdev,
	linux-kernel, linux-rdma, stephen, jacob.e.keller, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, dipayanroy

On Fri, Apr 24, 2026 at 01:05:24PM -0700, David Wei wrote:
> On 2026-04-23 05:48, Dipayaan Roy wrote:
> > On Thu, Apr 16, 2026 at 08:31:46AM -0700, Jakub Kicinski wrote:
> > > On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
> > > > I still see roughly a 5% overhead from the atomic refcount operation
> > > > itself, but on that platform there is no throughput drop when using
> > > > page fragments versus full-page mode.
> > > 
> > > That seems to contradict your claim that it's a problem with a specific
> > > platform.. Since we're in the merge window I asked David Wei to try to
> > > experiment with disabling page fragmentation on the ARM64 platforms we
> > > have at Meta. If it repros we should use the generic rx-buf-len
> > > ringparam because more NICs may want to implement this strategy.
> > 
> > Hi Jakub,
> > 
> > Thanks. I think I was not precise enough in my previous reply.
> > 
> > What I meant is that the atomic refcount cost itself does not appear to
> > be unique to the affected platform. I see a similar ~5% overhead on
> > another ARM64 platformi (different vendor) as well. However, on that platform
> > there is no throughput delta between fragment mode and full-page mode; both reach
> > line rate.
> > 
> > On the affected platform, fragment mode shows an additional ~15%
> > throughput drop versus full-page mode. So the current data suggests that
> > the atomic overhead is common, but the throughput regression is not
> > explained by that overhead alone and likely depends on an additional
> > platform-specific factor.
> > 
> > Separately, the hardware team collected PCIe traces on the affected
> > platform and reported stalls in the fragment-mode case that are not seen
> > in full-page mode. They are still investigating the root cause, but
> > their current hypothesis is that this is related to that platform’s
> > PCIe/root-port microarchitecture rather than to page_pool refcounting
> > alone.
> > 
> > That said, I agree the right direction depends on whether this
> > reproduces on other ARM64 platforms. If David is able to reproduce the
> > same behavior, then using the generic rx-buf-len ringparam sounds like
> > the better direction.
> > 
> > Please let me know what David finds, and I can rework the patch
> > accordingly.
> 
> I ran a test on Grace, 4 KB pages, 72 cores, 1 NUMA node.
> 
> Broadcom NIC, bnxt driver, 50 Gbps bandwidth. Hacked it up to either
> give me 1 or 2 frags per page. No agg ring, no HDS, no HW GRO.
> 
> Use 1 combined queue only for the server. Affinitized its net rx softirq
> to run on core 4.
> 
> Ran iperf3 server, taskset onto cpu cores 32-47. The iperf3 client is
> running on a host w/ same hw in the same region. Using 32 queues, no
> softirq affinities. The idea is to hammer page->pp_ref_count from
> different cores.
> 
> * 1 frag/page  -> 32.3 Gbps
> * 2 frags/page -> 36.0 Gbps
> 
> Comparing perf, for 2 frags/page the cost of skb_release_data() hitting
> pp_ref_count goes up, as expected. Is this what you see? When you say
> there's a +5% overhead, what function?
> 
> Overall tput is higher with multiple frags. That's to be expected w/
> page pool.

Hi David,

Thanks for running this. Your results are consistent with mine.

I have tested this on 2 ARM64 platforms from different vendors,
running ntttcp and iperf3 using 4k as base page size.
In my observation I see both platforms show a 5% overhead in
napi_pp_put_page (~3.9%) and page_pool_alloc_frag_netmem (~1.9%)
when running in fragment mode, both stalling on the LSE ldaddal
atomic that maintains pp_ref_count.
This seems to be same as your observation as well. However in my
observation one of the platform shows 15% drop in throughput when
in fragment mode vs page mode. The other platform I ran the test on
infact performs slighty better in fragment mode than in full page
mode (simillar observation as yours).

So the atomic refcount overhead appears to be common across ARM64
platforms, but it does not cause a throughput regression.
The throughput regression seems specific to one platform only for which
we want to have the full page work around, also the HW team has
identified PCIe stalls in fragment mode that are absent in full-page mode.
Their investigation points to a suspected microarchitectural
issue in the PCIe root port. IMO, there seems to be no issue with
page_pool itself.

Given that:
 - Grace shows fragments are faster (your data)
 - A second ARM64 platform shows no regression (my data)
 - Only the affected platform shows a throughput drop
 - The HW team suspects this to a platform-specific PCIe issue,
   also form our experiment data the drop in throughput seems to
   be platform specific only.

I believe this remains a platform-specific workaround rather than
a generic issue. Would a private flag still be acceptable for this
case?


> 
> There are some 200 Gbps NICs but they're mlx5 so I'd have to redo the
> driver hack. Are you going to re-implement this change with rx-buf-len
> instead of a private flag? If so, I won't spend more time running this
> test.
> 
I can go either way depending on what Jakub prefers.
Hi Jakub,
with this new data from David, is it convincing enough for a mana driver
specific private flag, which can be set from user space by a udev rule
by detecting the underlying platform? If not then I will send the next
version with the other rxbuflen approach. 
> > 
> > 
> > Regards
> > Dipayaan Roy


Thanks and Regards
Dipayaan Roy

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-04-25  8:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-07 19:59 [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers Dipayaan Roy
2026-04-07 19:59 ` [PATCH net-next v6 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch Dipayaan Roy
2026-04-07 19:59 ` [PATCH net-next v6 2/2] net: mana: force full-page RX buffers via ethtool private flag Dipayaan Roy
2026-04-10  1:35 ` [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers Jakub Kicinski
2026-04-12 19:59   ` Jakub Kicinski
2026-04-14 16:00     ` Dipayaan Roy
2026-04-16 15:31       ` Jakub Kicinski
2026-04-23 12:48         ` Dipayaan Roy
2026-04-23 16:33           ` Jakub Kicinski
2026-04-24 16:24           ` David Wei
2026-04-24 20:05           ` David Wei
2026-04-25  8:05             ` Dipayaan Roy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox