Re: [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Jacob Keller <jacob.e.keller@intel.com>
To: Mingrui Cui <mingruic@outlook.com>, <dtatulea@nvidia.com>
Cc: <andrew+netdev@lunn.ch>, <davem@davemloft.net>,
	<edumazet@google.com>, <kuba@kernel.org>, <leon@kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
	<mbloch@nvidia.com>, <netdev@vger.kernel.org>,
	<pabeni@redhat.com>, <saeedm@nvidia.com>, <tariqt@nvidia.com>
Subject: Re: [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size
Date: Tue, 30 Sep 2025 16:35:00 -0700	[thread overview]
Message-ID: <324e376c-8f66-4ae4-af2d-eea7e5a8e498@intel.com> (raw)
In-Reply-To: <MN6PR16MB5450C5EC9A1B2E2E78E8B241B71AA@MN6PR16MB5450.namprd16.prod.outlook.com>


[-- Attachment #1.1: Type: text/plain, Size: 5914 bytes --]



On 9/30/2025 4:33 AM, Mingrui Cui wrote:
> When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3
> fragments per WQE, odd-indexed WQEs always share the same page with
> their subsequent WQE, while WQEs consisting of 4 fragments does not.
> However, this relationship does not hold for page sizes larger than 8K.
> In this case, wqe_index_mask cannot guarantee that newly allocated WQEs
> won't share the same page with old WQEs.
> 
> If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a
> page with its subsequent WQE, allocating a page for that WQE will
> overwrite mlx5e_frag_page, preventing the original page from being
> recycled. When the next WQE is processed, the newly allocated page will
> be immediately recycled. In the next round, if these two WQEs are
> handled in the same bulk, page_pool_defrag_page() will be called again
> on the page, causing pp_frag_count to become negative[1].
> 
> Moreover, this can also lead to memory corruption, as the page may have
> already been returned to the page pool and re-allocated to another WQE.
> And since skb_shared_info is stored at the end of the first fragment,
> its frags->bv_page pointer can be overwritten, leading to an invalid
> memory access when processing the skb[2].
> 
> For example, on 8K page size systems (e.g. DEC Alpha) with a ConnectX-4
> Lx MT27710 (MCX4121A-ACA_Ax) NIC setting MTU to 7657 or higher, heavy
> network loads (e.g. iperf) will first trigger a series of WARNINGs[1]
> and eventually crash[2].
> 
> Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page
> size.
> 
> [1]
> WARNING: CPU: 9 PID: 0 at include/net/page_pool/helpers.h:130
> mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
> CPU: 9 PID: 0 Comm: swapper/9 Tainted: G        W          6.6.0
>  walk_stackframe+0x0/0x190
>  show_stack+0x70/0x94
>  dump_stack_lvl+0x98/0xd8
>  dump_stack+0x2c/0x48
>  __warn+0x1c8/0x220
>  warn_slowpath_fmt+0x20c/0x230
>  mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
>  mlx5e_free_rx_wqes+0xcc/0x120 [mlx5_core]
>  mlx5e_post_rx_wqes+0x1f4/0x4e0 [mlx5_core]
>  mlx5e_napi_poll+0x1c0/0x8d0 [mlx5_core]
>  __napi_poll+0x58/0x2e0
>  net_rx_action+0x1a8/0x340
>  __do_softirq+0x2b8/0x480
>  [...]
> 
> [2]
> Unable to handle kernel paging request at virtual address 393837363534333a
> Oops [#1]
> CPU: 72 PID: 0 Comm: swapper/72 Tainted: G        W          6.6.0
> Trace:
>  walk_stackframe+0x0/0x190
>  show_stack+0x70/0x94
>  die+0x1d4/0x350
>  do_page_fault+0x630/0x690
>  entMM+0x120/0x130
>  napi_pp_put_page+0x30/0x160
>  skb_release_data+0x164/0x250
>  kfree_skb_list_reason+0xd0/0x2f0
>  skb_release_data+0x1f0/0x250
>  napi_consume_skb+0xa0/0x220
>  net_rx_action+0x158/0x340
>  __do_softirq+0x2b8/0x480
>  irq_exit+0xd4/0x120
>  do_entInt+0x164/0x520
>  entInt+0x114/0x120
>  [...]
> 
> Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme")
> Signed-off-by: Mingrui Cui <mingruic@outlook.com>
> ---
> Changes in v2:
>   - Add Fixes tag and more details to commit message.
>   - Target 'net' branch.
>   - Remove the obsolete WARN_ON() and update related comments.
> Link to v1: https://lore.kernel.org/all/MN6PR16MB5450CAF432AE40B2AFA58F61B706A@MN6PR16MB5450.namprd16.prod.outlook.com/
> 
>  .../net/ethernet/mellanox/mlx5/core/en/params.c   | 15 +++++----------
>  1 file changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> index 3cca06a74cf9..00b44da23e00 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> @@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params,
>  	info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor);
>  }
>  
> -#define DEFAULT_FRAG_SIZE (2048)
> +#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2)
>  

What else does changing this DEFAULT_FRAG_SIZE affect? Presumably we're
allocating larger fragments now for 8K or larger page sizes.

>  static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>  				     struct mlx5e_params *params,
> @@ -756,18 +756,13 @@ static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>  		/* No WQE can start in the middle of a page. */
>  		info->wqe_index_mask = 0;
>  	} else {
> -		/* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments,
> -		 * because there would be more than MLX5E_MAX_RX_FRAGS of them.
> -		 */
> -		WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE);
> -

So previously we would warn, but now we just fix the DEFAULT_FRAG_SIZE
so this warning is redundant.. Ok.

>  		/* Odd number of fragments allows to pack the last fragment of
>  		 * the previous WQE and the first fragment of the next WQE into
>  		 * the same page.
> -		 * As long as DEFAULT_FRAG_SIZE is 2048, and MLX5E_MAX_RX_FRAGS
> -		 * is 4, the last fragment can be bigger than the rest only if
> -		 * it's the fourth one, so WQEs consisting of 3 fragments will
> -		 * always share a page.
> +		 * As long as DEFAULT_FRAG_SIZE is (PAGE_SIZE / 2), and
> +		 * MLX5E_MAX_RX_FRAGS is 4, the last fragment can be bigger than
> +		 * the rest only if it's the fourth one, so WQEs consisting of 3
> +		 * fragments will always share a page.
>  		 * When a page is shared, WQE bulk size is 2, otherwise just 1.
>  		 */
>  		info->wqe_index_mask = info->num_frags % 2;

Would it be possible to fix the other logic so that it works for a
DEFAULT_FRAG_SIZE of 2k on 8K pages? I guess if there's no negative to
increasing the frag size then this fix makes sense since it is simple.

I guess the noted fixed commit limits to 4 fragments, but it make some
assumptions that were wrong for 8K pages?

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

next prev parent reply	other threads:[~2025-09-30 23:35 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-30 11:33 [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size Mingrui Cui
2025-09-30 23:35 ` Jacob Keller [this message]
2025-10-15 12:32   ` Mingrui Cui
2025-10-15 17:36     ` Jacob Keller
2025-10-01 13:02 ` Dragos Tatulea
2025-10-15 13:02   ` Mingrui Cui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=324e376c-8f66-4ae4-af2d-eea7e5a8e498@intel.com \
    --to=jacob.e.keller@intel.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=mingruic@outlook.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeedm@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox