All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers
@ 2026-06-02 20:24 Dipayaan Roy
  2026-06-02 20:24 ` [PATCH net-next v10 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch Dipayaan Roy
  2026-06-02 20:24 ` [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag Dipayaan Roy
  0 siblings, 2 replies; 5+ messages in thread
From: Dipayaan Roy @ 2026-06-02 20:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
	pavan.chebbi

On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool
fragments for allocation in the RX refill path (~2kB buffer per fragment)
causes 15-20% throughput regression under high connection counts
(>16 TCP streams at 180+ Gbps). Using full-page buffers on these
platforms shows no regression and restores line-rate performance.

This behavior is observed on a single platform; other platforms
perform better with page_pool fragments, indicating this is not a
page_pool issue but platform-specific.

This series adds an ethtool private flag "full-page-rx" to let the
user opt in to one RX buffer per page:

  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag can be persisted
via udev rule for affected platforms.

This series depends on the following fixes now merged in net-next:
  17bfe0a8c014 ("net: mana: add NULL guards in teardown path to prevent panic")
  5b05aa36ee24 ("net: mana: skip redundant detach on already-detached port")

Changes in v10:
  - Rebased on net-next which now includes the prerequisite fixes.
  - Recovery logic in mana_set_priv_flags() leverages the idempotent
    mana_detach() from the merged fixes.
Changes in v9:
  - Added correct tree.
Changes in v8:
  - Fixed queue_reset_work recovery by restoring port_is_up before
    scheduling reset so the handler can properly re-attach.
  - Simplified "err && schedule_port_reset" to "schedule_port_reset".
Changes in v7:
  - Rebased onto net-next.
  - Retained private flag approach after David Wei's testing on
    Grace (ARM64) confirmed that fragment mode outperforms
    full-page mode on other platforms, validating this is a
    single-platform workaround rather than a generic issue.
Changes in v6:
  - Added missed maintainers.
Changes in v5:
  - Split prep refactor into separate patch (patch 1/2)
Changes in v4:
  - Dropping the smbios string parsing and add ethtool priv flag
    to reconfigure the queues with full page rx buffers.
Changes in v3:
  - changed u8* to char*
Changes in v2:
  - separate reading string index and the string, remove inline.

Dipayaan Roy (2):
  net: mana: refactor mana_get_strings() and mana_get_sset_count() to
    use switch
  net: mana: force full-page RX buffers via ethtool private flag

 drivers/net/ethernet/microsoft/mana/mana_en.c |  22 ++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 178 +++++++++++++++---
 include/net/mana/mana.h                       |   8 +
 3 files changed, 177 insertions(+), 31 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net-next v10 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch
  2026-06-02 20:24 [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers Dipayaan Roy
@ 2026-06-02 20:24 ` Dipayaan Roy
  2026-06-02 20:24 ` [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag Dipayaan Roy
  1 sibling, 0 replies; 5+ messages in thread
From: Dipayaan Roy @ 2026-06-02 20:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
	pavan.chebbi

Refactor mana_get_strings() and mana_get_sset_count() from if/else to
switch statements in preparation for adding ethtool private flags
support which requires handling ETH_SS_PRIV_FLAGS.

No functional change.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 .../ethernet/microsoft/mana/mana_ethtool.c    | 75 ++++++++++++-------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 04350973e19e..7e79681634db 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -138,53 +138,70 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 
-	if (stringset != ETH_SS_STATS)
+	switch (stringset) {
+	case ETH_SS_STATS:
+		return ARRAY_SIZE(mana_eth_stats) +
+		       ARRAY_SIZE(mana_phy_stats) +
+		       ARRAY_SIZE(mana_hc_stats)  +
+		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	default:
 		return -EINVAL;
-
-	return ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) + ARRAY_SIZE(mana_hc_stats) +
-			num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	}
 }
 
-static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 {
-	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 	int i, j;
 
-	if (stringset != ETH_SS_STATS)
-		return;
 	for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
-		ethtool_puts(&data, mana_eth_stats[i].name);
+		ethtool_puts(data, mana_eth_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
-		ethtool_puts(&data, mana_hc_stats[i].name);
+		ethtool_puts(data, mana_hc_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
-		ethtool_puts(&data, mana_phy_stats[i].name);
+		ethtool_puts(data, mana_phy_stats[i].name);
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "rx_%d_packets", i);
-		ethtool_sprintf(&data, "rx_%d_bytes", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
-		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+		ethtool_sprintf(data, "rx_%d_packets", i);
+		ethtool_sprintf(data, "rx_%d_bytes", i);
+		ethtool_sprintf(data, "rx_%d_xdp_drop", i);
+		ethtool_sprintf(data, "rx_%d_xdp_tx", i);
+		ethtool_sprintf(data, "rx_%d_xdp_redirect", i);
+		ethtool_sprintf(data, "rx_%d_pkt_len0_err", i);
 		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
-			ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
+			ethtool_sprintf(data,
+					"rx_%d_coalesced_cqe_%d",
+					i,
+					j + 2);
 	}
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "tx_%d_packets", i);
-		ethtool_sprintf(&data, "tx_%d_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_xdp_xmit", i);
-		ethtool_sprintf(&data, "tx_%d_tso_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_long_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_short_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_csum_partial", i);
-		ethtool_sprintf(&data, "tx_%d_mana_map_err", i);
+		ethtool_sprintf(data, "tx_%d_packets", i);
+		ethtool_sprintf(data, "tx_%d_bytes", i);
+		ethtool_sprintf(data, "tx_%d_xdp_xmit", i);
+		ethtool_sprintf(data, "tx_%d_tso_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_bytes", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_bytes", i);
+		ethtool_sprintf(data, "tx_%d_long_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_short_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_csum_partial", i);
+		ethtool_sprintf(data, "tx_%d_mana_map_err", i);
+	}
+}
+
+static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		mana_get_strings_stats(apc, &data);
+		break;
+	default:
+		break;
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag
  2026-06-02 20:24 [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers Dipayaan Roy
  2026-06-02 20:24 ` [PATCH net-next v10 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch Dipayaan Roy
@ 2026-06-02 20:24 ` Dipayaan Roy
  2026-06-03 20:28   ` sashiko-bot
  2026-06-04 18:40   ` Jacob Keller
  1 sibling, 2 replies; 5+ messages in thread
From: Dipayaan Roy @ 2026-06-02 20:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
	pavan.chebbi

On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
allocation in the RX refill path can cause 15-20% throughput
regression under high connection counts (>16 TCP streams).

Add an ethtool private flag "full-page-rx" that allows the user to
force one RX buffer per page, bypassing the page_pool fragment path.
This restores line-rate (180+ Gbps) performance on affected platforms.

Usage:
  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag must be explicitly
enabled by the user or udev rule.

The existing single-buffer-per-page logic for XDP and jumbo frames is
consolidated into a new helper mana_use_single_rxbuf_per_page() which
is now the single decision point for both the automatic and
user-controlled paths.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c |  22 +++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 103 ++++++++++++++++++
 include/net/mana/mana.h                       |   8 ++
 3 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index db14357d3732..447cecfd3f67 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 	return va;
 }
 
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+	/* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
+	 * in the RX refill path (~2kB buffer) can cause significant throughput
+	 * regression under high connection counts. Allow user to force one RX
+	 * buffer per page via ethtool private flag to bypass the fragment
+	 * path.
+	 */
+	if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
+		return true;
+
+	/* For xdp and jumbo frames make sure only one packet fits per page. */
+	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+		return true;
+
+	return false;
+}
+
 /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
 static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 			       int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 	/* Calculate datasize first (consistent across all cases) */
 	*datasize = mtu + ETH_HLEN;
 
-	/* For xdp and jumbo frames make sure only one packet fits per page */
-	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
 		if (mana_xdp_get(apc)) {
 			*headroom = XDP_PACKET_HEADROOM;
 			*alloc_size = PAGE_SIZE;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 7e79681634db..f22bbb325948 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -133,6 +133,10 @@ static const struct mana_stats_desc mana_phy_stats[] = {
 	{ "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
 };
 
+static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
+	[MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
+};
+
 static int mana_get_sset_count(struct net_device *ndev, int stringset)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -144,6 +148,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 		       ARRAY_SIZE(mana_phy_stats) +
 		       ARRAY_SIZE(mana_hc_stats)  +
 		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+
+	case ETH_SS_PRIV_FLAGS:
+		return MANA_PRIV_FLAG_MAX;
+
 	default:
 		return -EINVAL;
 	}
@@ -192,6 +200,14 @@ static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 	}
 }
 
+static void mana_get_strings_priv_flags(u8 **data)
+{
+	int i;
+
+	for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
+		ethtool_puts(data, mana_priv_flags[i]);
+}
+
 static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -200,6 +216,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 	case ETH_SS_STATS:
 		mana_get_strings_stats(apc, &data);
 		break;
+	case ETH_SS_PRIV_FLAGS:
+		mana_get_strings_priv_flags(&data);
+		break;
 	default:
 		break;
 	}
@@ -590,6 +609,88 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 	return 0;
 }
 
+static u32 mana_get_priv_flags(struct net_device *ndev)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	return apc->priv_flags;
+}
+
+static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+	u32 changed = apc->priv_flags ^ priv_flags;
+	u32 old_priv_flags = apc->priv_flags;
+	bool schedule_port_reset = false;
+	int err = 0;
+
+	if (!changed)
+		return 0;
+
+	/* Reject unknown bits */
+	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
+		return -EINVAL;
+
+	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
+		apc->priv_flags = priv_flags;
+
+		if (!apc->port_is_up) {
+			/* Port is down, flag updated to apply on next up
+			 * so just return.
+			 */
+			return 0;
+		}
+
+		/* Pre-allocate buffers to prevent failure in mana_attach
+		 * later
+		 */
+		err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+		if (err) {
+			netdev_err(ndev,
+				   "Insufficient memory for new allocations\n");
+			apc->priv_flags = old_priv_flags;
+			return err;
+		}
+
+		err = mana_detach(ndev, false);
+		if (err) {
+			netdev_err(ndev, "mana_detach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+
+			/* Port is in an inconsistent state. Restore
+			 * 'port_is_up' so that queue reset work handler
+			 * can properly detach and re-attach.
+			 */
+			apc->port_is_up = true;
+			schedule_port_reset = true;
+			goto out;
+		}
+
+		err = mana_attach(ndev);
+		if (err) {
+			netdev_err(ndev, "mana_attach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+
+			/* Restore 'port_is_up' so the reset work handler
+			 * can properly detach/attach. Without this,
+			 * the handler sees port_is_up=false and skips
+			 * queue allocation, leaving the port dead.
+			 */
+			apc->port_is_up = true;
+			schedule_port_reset = true;
+		}
+	}
+
+out:
+	mana_pre_dealloc_rxbufs(apc);
+
+	if (schedule_port_reset)
+		queue_work(apc->ac->per_port_queue_reset_wq,
+			   &apc->queue_reset_work);
+
+	return err;
+}
+
 const struct ethtool_ops mana_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
@@ -608,4 +709,6 @@ const struct ethtool_ops mana_ethtool_ops = {
 	.set_ringparam          = mana_set_ringparam,
 	.get_link_ksettings	= mana_get_link_ksettings,
 	.get_link		= ethtool_op_get_link,
+	.get_priv_flags		= mana_get_priv_flags,
+	.set_priv_flags		= mana_set_priv_flags,
 };
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index d9c27310fd04..26fd5e041a47 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -30,6 +30,12 @@ enum TRI_STATE {
 	TRI_STATE_TRUE = 1
 };
 
+/* MANA ethtool private flag bit positions */
+enum mana_priv_flag_bits {
+	MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
+	MANA_PRIV_FLAG_MAX,
+};
+
 /* Number of entries for hardware indirection table must be in power of 2 */
 #define MANA_INDIRECT_TABLE_MAX_SIZE 512
 #define MANA_INDIRECT_TABLE_DEF_SIZE 64
@@ -531,6 +537,8 @@ struct mana_port_context {
 	u32 rxbpre_headroom;
 	u32 rxbpre_frag_count;
 
+	u32 priv_flags;
+
 	struct bpf_prog *bpf_prog;
 
 	/* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag
  2026-06-02 20:24 ` [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag Dipayaan Roy
@ 2026-06-03 20:28   ` sashiko-bot
  2026-06-04 18:40   ` Jacob Keller
  1 sibling, 0 replies; 5+ messages in thread
From: sashiko-bot @ 2026-06-03 20:28 UTC (permalink / raw)
  To: Dipayaan Roy; +Cc: linux-hyperv, bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Forcing `apc->port_is_up = true` on the `mana_attach` error path causes a double-destroy of the hardware vport when the interface is later brought down.
- [Medium] Latent bug: Changes to future private flags will be silently ignored if MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF is not simultaneously toggled.
--

commit d7983f6668f209a0b7bfb427c64d5517b326c7b1
Author: Dipayaan Roy <dipayanroy@linux.microsoft.com>

net: mana: force full-page RX buffers via ethtool private flag

Adds an ethtool private flag full-page-rx to bypass page_pool fragment
allocation and force one RX buffer per page. Includes logic in
mana_set_priv_flags to dynamically reconfigure the port on flag change.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c

[ ... ]

> +static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
> +{
> +	struct mana_port_context *apc = netdev_priv(ndev);
> +	u32 changed = apc->priv_flags ^ priv_flags;
> +	u32 old_priv_flags = apc->priv_flags;
> +	bool schedule_port_reset = false;
> +	int err = 0;
> +
> +	if (!changed)
> +		return 0;
> +
> +	/* Reject unknown bits */
> +	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
> +		return -EINVAL;
> +
> +	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
> +		apc->priv_flags = priv_flags;
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[Severity: Medium]
When additional private flags are added in the future, will changes to those
new flags be silently ignored if MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF isn't
also toggled?

Since this assignment is nested inside the FULL_PAGE_RXBUF check, changing
only a future new flag will bypass this assignment and the function will
return success while discarding the new flag state.

Should this assignment be moved outside the flag-specific conditional block?

> +
> +		if (!apc->port_is_up) {

[ ... ]

> +		err = mana_attach(ndev);
> +		if (err) {
> +			netdev_err(ndev, "mana_attach failed: %d\n", err);
> +			apc->priv_flags = old_priv_flags;
> +
> +			/* Restore 'port_is_up' so the reset work handler
> +			 * can properly detach/attach. Without this,
> +			 * the handler sees port_is_up=false and skips
> +			 * queue allocation, leaving the port dead.
> +			 */
> +			apc->port_is_up = true;
                        ^^^^^^^^^^^^^^^^^^^^^^

[Severity: High]
Does forcing port_is_up to true here cause a double-destroy of the hardware
vport when the interface is later brought down?

If mana_attach fails, it cleans up any partial state by destroying the vport.

Leaving the port artificially marked as up means a subsequent interface down
event or reset worker will see apc->port_is_up as true and call
mana_detach with from_close set to true.

This bypasses the netif_device_present check and triggers a second call to
mana_destroy_vport, which could drop apc->vport_use_count below zero and
send a duplicate deregister message to the firmware for the already-freed
hw_vport_handle.

> +			schedule_port_reset = true;
> +		}
> +	}
> +
> +out:
> +	mana_pre_dealloc_rxbufs(apc);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260602202801.1873742-1-dipayanroy@linux.microsoft.com?part=2

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag
  2026-06-02 20:24 ` [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag Dipayaan Roy
  2026-06-03 20:28   ` sashiko-bot
@ 2026-06-04 18:40   ` Jacob Keller
  1 sibling, 0 replies; 5+ messages in thread
From: Jacob Keller @ 2026-06-04 18:40 UTC (permalink / raw)
  To: Dipayaan Roy, kys, haiyangz, wei.liu, decui, andrew+netdev, davem,
	edumazet, kuba, pabeni, leon, longli, kotaranov, horms,
	shradhagupta, ssengar, ernis, shirazsaleem, linux-hyperv, netdev,
	linux-kernel, linux-rdma, stephen, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
	pavan.chebbi

On 6/2/2026 1:24 PM, Dipayaan Roy wrote:
> On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
> allocation in the RX refill path can cause 15-20% throughput
> regression under high connection counts (>16 TCP streams).
> 
> Add an ethtool private flag "full-page-rx" that allows the user to
> force one RX buffer per page, bypassing the page_pool fragment path.
> This restores line-rate (180+ Gbps) performance on affected platforms.
> 
> Usage:
>   ethtool --set-priv-flags eth0 full-page-rx on
> 
> There is no behavioral change by default. The flag must be explicitly
> enabled by the user or udev rule.
> 
> The existing single-buffer-per-page logic for XDP and jumbo frames is
> consolidated into a new helper mana_use_single_rxbuf_per_page() which
> is now the single decision point for both the automatic and
> user-controlled paths.
> 
> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
> ---

I had one or two minor nits, but nothing that I think really deserves a
v11. The only real comment is a future "gotcha" that could happen if you
ever added a second private flag, which seems unlikely and maybe not
worth dealing with until it matters.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

>  drivers/net/ethernet/microsoft/mana/mana_en.c |  22 +++-
>  .../ethernet/microsoft/mana/mana_ethtool.c    | 103 ++++++++++++++++++
>  include/net/mana/mana.h                       |   8 ++
>  3 files changed, 131 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index db14357d3732..447cecfd3f67 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
>  	return va;
>  }
>  
> +static bool
> +mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
> +{
> +	/* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
> +	 * in the RX refill path (~2kB buffer) can cause significant throughput
> +	 * regression under high connection counts. Allow user to force one RX
> +	 * buffer per page via ethtool private flag to bypass the fragment
> +	 * path.
> +	 */
> +	if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
> +		return true;
> +
> +	/* For xdp and jumbo frames make sure only one packet fits per page. */
> +	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
> +		return true;

Technically you could combine all three into one if, but I agree that
clarity and space for the comment about why the private flag exists
makes sense.

> +
> +	return false;
> +}
> +
>  /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
>  static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
>  			       int mtu, u32 *datasize, u32 *alloc_size,
> @@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
>  	/* Calculate datasize first (consistent across all cases) */
>  	*datasize = mtu + ETH_HLEN;
>  
> -	/* For xdp and jumbo frames make sure only one packet fits per page */
> -	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
> +	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
>  		if (mana_xdp_get(apc)) {
>  			*headroom = XDP_PACKET_HEADROOM;
>  			*alloc_size = PAGE_SIZE;
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 7e79681634db..f22bbb325948 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> @@ -133,6 +133,10 @@ static const struct mana_stats_desc mana_phy_stats[] = {
>  	{ "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
>  };
>  
> +static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
> +	[MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
> +};
> +
>  static int mana_get_sset_count(struct net_device *ndev, int stringset)
>  {
>  	struct mana_port_context *apc = netdev_priv(ndev);
> @@ -144,6 +148,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
>  		       ARRAY_SIZE(mana_phy_stats) +
>  		       ARRAY_SIZE(mana_hc_stats)  +
>  		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
> +
> +	case ETH_SS_PRIV_FLAGS:
> +		return MANA_PRIV_FLAG_MAX;
> +
>  	default:
>  		return -EINVAL;
>  	}
> @@ -192,6 +200,14 @@ static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
>  	}
>  }
>  
> +static void mana_get_strings_priv_flags(u8 **data)
> +{
> +	int i;
> +
> +	for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
> +		ethtool_puts(data, mana_priv_flags[i]);
> +}
> +
>  static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
>  {
>  	struct mana_port_context *apc = netdev_priv(ndev);
> @@ -200,6 +216,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
>  	case ETH_SS_STATS:
>  		mana_get_strings_stats(apc, &data);
>  		break;
> +	case ETH_SS_PRIV_FLAGS:
> +		mana_get_strings_priv_flags(&data);
> +		break;
>  	default:
>  		break;
>  	}
> @@ -590,6 +609,88 @@ static int mana_get_link_ksettings(struct net_device *ndev,
>  	return 0;
>  }
>  
> +static u32 mana_get_priv_flags(struct net_device *ndev)
> +{
> +	struct mana_port_context *apc = netdev_priv(ndev);
> +
> +	return apc->priv_flags;
> +}
> +
> +static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
> +{
> +	struct mana_port_context *apc = netdev_priv(ndev);
> +	u32 changed = apc->priv_flags ^ priv_flags;
> +	u32 old_priv_flags = apc->priv_flags;
> +	bool schedule_port_reset = false;
> +	int err = 0;
> +
> +	if (!changed)
> +		return 0;
> +
> +	/* Reject unknown bits */
> +	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
> +		return -EINVAL;

Good. Explicit rejection ensures that there's no risk of bad value. I
think this is only required for the legacy ioctl interface, and won't be
able to have a bit set that isn't in your accepted list. However the
legacy ioctl interface looks like it doesn't do that double checking, so
its good to have this.

> +
> +	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
> +		apc->priv_flags = priv_flags;
> +

In the (unlikely) event that you need another private flag in the
future, this bit seems like it shouldn't be inside the if block here. It
seems like you'd want to either do this at the end or up front. Of
course it doesn't matter as long as this is the only private flag you have.

> +		if (!apc->port_is_up) {
> +			/* Port is down, flag updated to apply on next up
> +			 * so just return.
> +			 */
> +			return 0;
> +		}
> +
> +		/* Pre-allocate buffers to prevent failure in mana_attach
> +		 * later
> +		 */
> +		err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
> +		if (err) {
> +			netdev_err(ndev,
> +				   "Insufficient memory for new allocations\n");
> +			apc->priv_flags = old_priv_flags;
> +			return err;
> +		}
> +
> +		err = mana_detach(ndev, false);
> +		if (err) {
> +			netdev_err(ndev, "mana_detach failed: %d\n", err);
> +			apc->priv_flags = old_priv_flags;
> +
> +			/* Port is in an inconsistent state. Restore
> +			 * 'port_is_up' so that queue reset work handler
> +			 * can properly detach and re-attach.
> +			 */
> +			apc->port_is_up = true;
> +			schedule_port_reset = true;
> +			goto out;
> +		}
> +
> +		err = mana_attach(ndev);
> +		if (err) {
> +			netdev_err(ndev, "mana_attach failed: %d\n", err);
> +			apc->priv_flags = old_priv_flags;
> +
> +			/* Restore 'port_is_up' so the reset work handler
> +			 * can properly detach/attach. Without this,
> +			 * the handler sees port_is_up=false and skips
> +			 * queue allocation, leaving the port dead.
> +			 */
> +			apc->port_is_up = true;
> +			schedule_port_reset = true;
> +		}

I might have made this bit a separate function, but that comes from
history of working with older drivers which accumulated a larger number
of private flags. Given that we frown on adding new ones except in more
rare cases these days, this is probably fine.

> +	}
> +
> +out:
> +	mana_pre_dealloc_rxbufs(apc);
> +
> +	if (schedule_port_reset)
> +		queue_work(apc->ac->per_port_queue_reset_wq,
> +			   &apc->queue_reset_work);
> +
> +	return err;
> +}
> +
>  const struct ethtool_ops mana_ethtool_ops = {
>  	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
>  	.get_ethtool_stats	= mana_get_ethtool_stats,
> @@ -608,4 +709,6 @@ const struct ethtool_ops mana_ethtool_ops = {
>  	.set_ringparam          = mana_set_ringparam,
>  	.get_link_ksettings	= mana_get_link_ksettings,
>  	.get_link		= ethtool_op_get_link,
> +	.get_priv_flags		= mana_get_priv_flags,
> +	.set_priv_flags		= mana_set_priv_flags,
>  };
> diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
> index d9c27310fd04..26fd5e041a47 100644
> --- a/include/net/mana/mana.h
> +++ b/include/net/mana/mana.h
> @@ -30,6 +30,12 @@ enum TRI_STATE {
>  	TRI_STATE_TRUE = 1
>  };
>  
> +/* MANA ethtool private flag bit positions */
> +enum mana_priv_flag_bits {
> +	MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
> +	MANA_PRIV_FLAG_MAX,

For cases like this, I find it helpful to add a comment indicating this
must be the last entry. (and in that case, drop the trailing comma).

> +};
> +
>  /* Number of entries for hardware indirection table must be in power of 2 */
>  #define MANA_INDIRECT_TABLE_MAX_SIZE 512
>  #define MANA_INDIRECT_TABLE_DEF_SIZE 64
> @@ -531,6 +537,8 @@ struct mana_port_context {
>  	u32 rxbpre_headroom;
>  	u32 rxbpre_frag_count;
>  
> +	u32 priv_flags;
> +
>  	struct bpf_prog *bpf_prog;
>  
>  	/* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-04 18:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-02 20:24 [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers Dipayaan Roy
2026-06-02 20:24 ` [PATCH net-next v10 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch Dipayaan Roy
2026-06-02 20:24 ` [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag Dipayaan Roy
2026-06-03 20:28   ` sashiko-bot
2026-06-04 18:40   ` Jacob Keller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.