* [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size
@ 2025-09-30 11:33 Mingrui Cui
2025-09-30 23:35 ` Jacob Keller
2025-10-01 13:02 ` Dragos Tatulea
0 siblings, 2 replies; 6+ messages in thread
From: Mingrui Cui @ 2025-09-30 11:33 UTC (permalink / raw)
To: dtatulea
Cc: andrew+netdev, davem, edumazet, kuba, leon, linux-kernel,
linux-rdma, mbloch, mingruic, netdev, pabeni, saeedm, tariqt
When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3
fragments per WQE, odd-indexed WQEs always share the same page with
their subsequent WQE, while WQEs consisting of 4 fragments does not.
However, this relationship does not hold for page sizes larger than 8K.
In this case, wqe_index_mask cannot guarantee that newly allocated WQEs
won't share the same page with old WQEs.
If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a
page with its subsequent WQE, allocating a page for that WQE will
overwrite mlx5e_frag_page, preventing the original page from being
recycled. When the next WQE is processed, the newly allocated page will
be immediately recycled. In the next round, if these two WQEs are
handled in the same bulk, page_pool_defrag_page() will be called again
on the page, causing pp_frag_count to become negative[1].
Moreover, this can also lead to memory corruption, as the page may have
already been returned to the page pool and re-allocated to another WQE.
And since skb_shared_info is stored at the end of the first fragment,
its frags->bv_page pointer can be overwritten, leading to an invalid
memory access when processing the skb[2].
For example, on 8K page size systems (e.g. DEC Alpha) with a ConnectX-4
Lx MT27710 (MCX4121A-ACA_Ax) NIC setting MTU to 7657 or higher, heavy
network loads (e.g. iperf) will first trigger a series of WARNINGs[1]
and eventually crash[2].
Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page
size.
[1]
WARNING: CPU: 9 PID: 0 at include/net/page_pool/helpers.h:130
mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W 6.6.0
walk_stackframe+0x0/0x190
show_stack+0x70/0x94
dump_stack_lvl+0x98/0xd8
dump_stack+0x2c/0x48
__warn+0x1c8/0x220
warn_slowpath_fmt+0x20c/0x230
mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
mlx5e_free_rx_wqes+0xcc/0x120 [mlx5_core]
mlx5e_post_rx_wqes+0x1f4/0x4e0 [mlx5_core]
mlx5e_napi_poll+0x1c0/0x8d0 [mlx5_core]
__napi_poll+0x58/0x2e0
net_rx_action+0x1a8/0x340
__do_softirq+0x2b8/0x480
[...]
[2]
Unable to handle kernel paging request at virtual address 393837363534333a
Oops [#1]
CPU: 72 PID: 0 Comm: swapper/72 Tainted: G W 6.6.0
Trace:
walk_stackframe+0x0/0x190
show_stack+0x70/0x94
die+0x1d4/0x350
do_page_fault+0x630/0x690
entMM+0x120/0x130
napi_pp_put_page+0x30/0x160
skb_release_data+0x164/0x250
kfree_skb_list_reason+0xd0/0x2f0
skb_release_data+0x1f0/0x250
napi_consume_skb+0xa0/0x220
net_rx_action+0x158/0x340
__do_softirq+0x2b8/0x480
irq_exit+0xd4/0x120
do_entInt+0x164/0x520
entInt+0x114/0x120
[...]
Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme")
Signed-off-by: Mingrui Cui <mingruic@outlook.com>
---
Changes in v2:
- Add Fixes tag and more details to commit message.
- Target 'net' branch.
- Remove the obsolete WARN_ON() and update related comments.
Link to v1: https://lore.kernel.org/all/MN6PR16MB5450CAF432AE40B2AFA58F61B706A@MN6PR16MB5450.namprd16.prod.outlook.com/
.../net/ethernet/mellanox/mlx5/core/en/params.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 3cca06a74cf9..00b44da23e00 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params,
info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor);
}
-#define DEFAULT_FRAG_SIZE (2048)
+#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2)
static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
struct mlx5e_params *params,
@@ -756,18 +756,13 @@ static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
/* No WQE can start in the middle of a page. */
info->wqe_index_mask = 0;
} else {
- /* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments,
- * because there would be more than MLX5E_MAX_RX_FRAGS of them.
- */
- WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE);
-
/* Odd number of fragments allows to pack the last fragment of
* the previous WQE and the first fragment of the next WQE into
* the same page.
- * As long as DEFAULT_FRAG_SIZE is 2048, and MLX5E_MAX_RX_FRAGS
- * is 4, the last fragment can be bigger than the rest only if
- * it's the fourth one, so WQEs consisting of 3 fragments will
- * always share a page.
+ * As long as DEFAULT_FRAG_SIZE is (PAGE_SIZE / 2), and
+ * MLX5E_MAX_RX_FRAGS is 4, the last fragment can be bigger than
+ * the rest only if it's the fourth one, so WQEs consisting of 3
+ * fragments will always share a page.
* When a page is shared, WQE bulk size is 2, otherwise just 1.
*/
info->wqe_index_mask = info->num_frags % 2;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size
2025-09-30 11:33 [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size Mingrui Cui
@ 2025-09-30 23:35 ` Jacob Keller
2025-10-15 12:32 ` Mingrui Cui
2025-10-01 13:02 ` Dragos Tatulea
1 sibling, 1 reply; 6+ messages in thread
From: Jacob Keller @ 2025-09-30 23:35 UTC (permalink / raw)
To: Mingrui Cui, dtatulea
Cc: andrew+netdev, davem, edumazet, kuba, leon, linux-kernel,
linux-rdma, mbloch, netdev, pabeni, saeedm, tariqt
[-- Attachment #1.1: Type: text/plain, Size: 5914 bytes --]
On 9/30/2025 4:33 AM, Mingrui Cui wrote:
> When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3
> fragments per WQE, odd-indexed WQEs always share the same page with
> their subsequent WQE, while WQEs consisting of 4 fragments does not.
> However, this relationship does not hold for page sizes larger than 8K.
> In this case, wqe_index_mask cannot guarantee that newly allocated WQEs
> won't share the same page with old WQEs.
>
> If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a
> page with its subsequent WQE, allocating a page for that WQE will
> overwrite mlx5e_frag_page, preventing the original page from being
> recycled. When the next WQE is processed, the newly allocated page will
> be immediately recycled. In the next round, if these two WQEs are
> handled in the same bulk, page_pool_defrag_page() will be called again
> on the page, causing pp_frag_count to become negative[1].
>
> Moreover, this can also lead to memory corruption, as the page may have
> already been returned to the page pool and re-allocated to another WQE.
> And since skb_shared_info is stored at the end of the first fragment,
> its frags->bv_page pointer can be overwritten, leading to an invalid
> memory access when processing the skb[2].
>
> For example, on 8K page size systems (e.g. DEC Alpha) with a ConnectX-4
> Lx MT27710 (MCX4121A-ACA_Ax) NIC setting MTU to 7657 or higher, heavy
> network loads (e.g. iperf) will first trigger a series of WARNINGs[1]
> and eventually crash[2].
>
> Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page
> size.
>
> [1]
> WARNING: CPU: 9 PID: 0 at include/net/page_pool/helpers.h:130
> mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
> CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W 6.6.0
> walk_stackframe+0x0/0x190
> show_stack+0x70/0x94
> dump_stack_lvl+0x98/0xd8
> dump_stack+0x2c/0x48
> __warn+0x1c8/0x220
> warn_slowpath_fmt+0x20c/0x230
> mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
> mlx5e_free_rx_wqes+0xcc/0x120 [mlx5_core]
> mlx5e_post_rx_wqes+0x1f4/0x4e0 [mlx5_core]
> mlx5e_napi_poll+0x1c0/0x8d0 [mlx5_core]
> __napi_poll+0x58/0x2e0
> net_rx_action+0x1a8/0x340
> __do_softirq+0x2b8/0x480
> [...]
>
> [2]
> Unable to handle kernel paging request at virtual address 393837363534333a
> Oops [#1]
> CPU: 72 PID: 0 Comm: swapper/72 Tainted: G W 6.6.0
> Trace:
> walk_stackframe+0x0/0x190
> show_stack+0x70/0x94
> die+0x1d4/0x350
> do_page_fault+0x630/0x690
> entMM+0x120/0x130
> napi_pp_put_page+0x30/0x160
> skb_release_data+0x164/0x250
> kfree_skb_list_reason+0xd0/0x2f0
> skb_release_data+0x1f0/0x250
> napi_consume_skb+0xa0/0x220
> net_rx_action+0x158/0x340
> __do_softirq+0x2b8/0x480
> irq_exit+0xd4/0x120
> do_entInt+0x164/0x520
> entInt+0x114/0x120
> [...]
>
> Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme")
> Signed-off-by: Mingrui Cui <mingruic@outlook.com>
> ---
> Changes in v2:
> - Add Fixes tag and more details to commit message.
> - Target 'net' branch.
> - Remove the obsolete WARN_ON() and update related comments.
> Link to v1: https://lore.kernel.org/all/MN6PR16MB5450CAF432AE40B2AFA58F61B706A@MN6PR16MB5450.namprd16.prod.outlook.com/
>
> .../net/ethernet/mellanox/mlx5/core/en/params.c | 15 +++++----------
> 1 file changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> index 3cca06a74cf9..00b44da23e00 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> @@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params,
> info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor);
> }
>
> -#define DEFAULT_FRAG_SIZE (2048)
> +#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2)
>
What else does changing this DEFAULT_FRAG_SIZE affect? Presumably we're
allocating larger fragments now for 8K or larger page sizes.
> static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
> struct mlx5e_params *params,
> @@ -756,18 +756,13 @@ static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
> /* No WQE can start in the middle of a page. */
> info->wqe_index_mask = 0;
> } else {
> - /* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments,
> - * because there would be more than MLX5E_MAX_RX_FRAGS of them.
> - */
> - WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE);
> -
So previously we would warn, but now we just fix the DEFAULT_FRAG_SIZE
so this warning is redundant.. Ok.
> /* Odd number of fragments allows to pack the last fragment of
> * the previous WQE and the first fragment of the next WQE into
> * the same page.
> - * As long as DEFAULT_FRAG_SIZE is 2048, and MLX5E_MAX_RX_FRAGS
> - * is 4, the last fragment can be bigger than the rest only if
> - * it's the fourth one, so WQEs consisting of 3 fragments will
> - * always share a page.
> + * As long as DEFAULT_FRAG_SIZE is (PAGE_SIZE / 2), and
> + * MLX5E_MAX_RX_FRAGS is 4, the last fragment can be bigger than
> + * the rest only if it's the fourth one, so WQEs consisting of 3
> + * fragments will always share a page.
> * When a page is shared, WQE bulk size is 2, otherwise just 1.
> */
> info->wqe_index_mask = info->num_frags % 2;
Would it be possible to fix the other logic so that it works for a
DEFAULT_FRAG_SIZE of 2k on 8K pages? I guess if there's no negative to
increasing the frag size then this fix makes sense since it is simple.
I guess the noted fixed commit limits to 4 fragments, but it make some
assumptions that were wrong for 8K pages?
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size
2025-09-30 11:33 [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size Mingrui Cui
2025-09-30 23:35 ` Jacob Keller
@ 2025-10-01 13:02 ` Dragos Tatulea
2025-10-15 13:02 ` Mingrui Cui
1 sibling, 1 reply; 6+ messages in thread
From: Dragos Tatulea @ 2025-10-01 13:02 UTC (permalink / raw)
To: Mingrui Cui
Cc: andrew+netdev, davem, edumazet, kuba, leon, linux-kernel,
linux-rdma, mbloch, netdev, pabeni, saeedm, tariqt
On Tue, Sep 30, 2025 at 07:33:11PM +0800, Mingrui Cui wrote:
> When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3
> fragments per WQE, odd-indexed WQEs always share the same page with
> their subsequent WQE, while WQEs consisting of 4 fragments does not.
> However, this relationship does not hold for page sizes larger than 8K.
> In this case, wqe_index_mask cannot guarantee that newly allocated WQEs
> won't share the same page with old WQEs.
>
> If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a
> page with its subsequent WQE, allocating a page for that WQE will
> overwrite mlx5e_frag_page, preventing the original page from being
> recycled. When the next WQE is processed, the newly allocated page will
> be immediately recycled. In the next round, if these two WQEs are
> handled in the same bulk, page_pool_defrag_page() will be called again
> on the page, causing pp_frag_count to become negative[1].
>
> Moreover, this can also lead to memory corruption, as the page may have
> already been returned to the page pool and re-allocated to another WQE.
> And since skb_shared_info is stored at the end of the first fragment,
> its frags->bv_page pointer can be overwritten, leading to an invalid
> memory access when processing the skb[2].
>
> For example, on 8K page size systems (e.g. DEC Alpha) with a ConnectX-4
> Lx MT27710 (MCX4121A-ACA_Ax) NIC setting MTU to 7657 or higher, heavy
> network loads (e.g. iperf) will first trigger a series of WARNINGs[1]
> and eventually crash[2].
>
> Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page
> size.
>
> [1]
> WARNING: CPU: 9 PID: 0 at include/net/page_pool/helpers.h:130
> mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
> CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W 6.6.0
> walk_stackframe+0x0/0x190
> show_stack+0x70/0x94
> dump_stack_lvl+0x98/0xd8
> dump_stack+0x2c/0x48
> __warn+0x1c8/0x220
> warn_slowpath_fmt+0x20c/0x230
> mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
> mlx5e_free_rx_wqes+0xcc/0x120 [mlx5_core]
> mlx5e_post_rx_wqes+0x1f4/0x4e0 [mlx5_core]
> mlx5e_napi_poll+0x1c0/0x8d0 [mlx5_core]
> __napi_poll+0x58/0x2e0
> net_rx_action+0x1a8/0x340
> __do_softirq+0x2b8/0x480
> [...]
>
> [2]
> Unable to handle kernel paging request at virtual address 393837363534333a
> Oops [#1]
> CPU: 72 PID: 0 Comm: swapper/72 Tainted: G W 6.6.0
> Trace:
> walk_stackframe+0x0/0x190
> show_stack+0x70/0x94
> die+0x1d4/0x350
> do_page_fault+0x630/0x690
> entMM+0x120/0x130
> napi_pp_put_page+0x30/0x160
> skb_release_data+0x164/0x250
> kfree_skb_list_reason+0xd0/0x2f0
> skb_release_data+0x1f0/0x250
> napi_consume_skb+0xa0/0x220
> net_rx_action+0x158/0x340
> __do_softirq+0x2b8/0x480
> irq_exit+0xd4/0x120
> do_entInt+0x164/0x520
> entInt+0x114/0x120
> [...]
>
> Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme")
> Signed-off-by: Mingrui Cui <mingruic@outlook.com>
> ---
> Changes in v2:
> - Add Fixes tag and more details to commit message.
> - Target 'net' branch.
> - Remove the obsolete WARN_ON() and update related comments.
> Link to v1: https://lore.kernel.org/all/MN6PR16MB5450CAF432AE40B2AFA58F61B706A@MN6PR16MB5450.namprd16.prod.outlook.com/
>
Thanks for your changes Mingrui.
> .../net/ethernet/mellanox/mlx5/core/en/params.c | 15 +++++----------
> 1 file changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> index 3cca06a74cf9..00b44da23e00 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> @@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params,
> info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor);
> }
>
> -#define DEFAULT_FRAG_SIZE (2048)
> +#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2)
>
Reasoning my way through this code as I can't test it on 8K page size:
- 4K page size: nothing changes so looks good.
- 8K page size:
- Smaller MTUs will be using linear SKB, so frags are not used.
Looks good.
- Larger MTUs will have a frag size of 4K. The number of frags is
still below MLX5E_MAX_RX_FRAGS. This is what this patch fixes and
it looks good.
- 16K page size or larger: as max HW MTU is somewhere in the 10-12K range
all MTUs will result in linear SKBs. So looks good. But see below
comment about keeping the warning.
For non-linear XDP, frag_size_max will always be PAGE_SIZE. So this
looks safe in all cases.
XSK with smaller chunk sizes is still an open question. For 16K page
size you could still get in non-linear mode.
One thing to note is that mlx5e_max_nonlinear_mtu() will now depend on
PAGE_SIZE as frag_size_max and first_frag_size_max are now MTU
dependent. It wasn't the case previously. I think this is currently not
an issue as this function is used only by mlx5e_build_rq_frags_info().
> static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
> struct mlx5e_params *params,
> @@ -756,18 +756,13 @@ static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
> /* No WQE can start in the middle of a page. */
> info->wqe_index_mask = 0;
> } else {
> - /* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments,
> - * because there would be more than MLX5E_MAX_RX_FRAGS of them.
> - */
> - WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE);
> -
I would still keep a warning when reaching this area with a page size
above 8K just because it was not tested.
> /* Odd number of fragments allows to pack the last fragment of
> * the previous WQE and the first fragment of the next WQE into
> * the same page.
> - * As long as DEFAULT_FRAG_SIZE is 2048, and MLX5E_MAX_RX_FRAGS
> - * is 4, the last fragment can be bigger than the rest only if
> - * it's the fourth one, so WQEs consisting of 3 fragments will
> - * always share a page.
> + * As long as DEFAULT_FRAG_SIZE is (PAGE_SIZE / 2), and
> + * MLX5E_MAX_RX_FRAGS is 4, the last fragment can be bigger than
> + * the rest only if it's the fourth one, so WQEs consisting of 3
> + * fragments will always share a page.
> * When a page is shared, WQE bulk size is 2, otherwise just 1.
> */
> info->wqe_index_mask = info->num_frags % 2;
Have you tried testing this with KASAN on? As your platform is not very
common, we want to make sure that there are no other issues.
Thanks,
Dragos
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size
2025-09-30 23:35 ` Jacob Keller
@ 2025-10-15 12:32 ` Mingrui Cui
2025-10-15 17:36 ` Jacob Keller
0 siblings, 1 reply; 6+ messages in thread
From: Mingrui Cui @ 2025-10-15 12:32 UTC (permalink / raw)
To: jacob.e.keller
Cc: andrew+netdev, davem, dtatulea, edumazet, kuba, leon,
linux-kernel, linux-rdma, mbloch, mingruic, netdev, pabeni,
saeedm, tariqt
That's all the effects of changing DEFAULT_FRAG_SIZE. DEFAULT_FRAG_SIZE is only
used as the initial value for frag_size_max. It is primarily used to calculate
frag_size and frag_stride in arr of mlx5e_rq_frags_info, representing the actual
data size and the size used for page allocation, respectively. In
mlx5e_init_frags_partition(), an mlx5e_wqe_frag_info is allocated for each
fragment according to its frag_stride, which determines the layout of fragments
on a page.
> > static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
> > struct mlx5e_params *params,
> > @@ -756,18 +756,13 @@ static int mlx5e_build_rq_frags_info(struct mlx5_=
> core_dev *mdev,
> > /* No WQE can start in the middle of a page. */
> > info->wqe_index_mask =3D 0;
> > } else {
> > - /* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments,
> > - * because there would be more than MLX5E_MAX_RX_FRAGS of them.
> > - */
> > - WARN_ON(PAGE_SIZE !=3D 2 * DEFAULT_FRAG_SIZE);
> > -
>
> So previously we would warn, but now we just fix the DEFAULT_FRAG_SIZE
> so this warning is redundant.. Ok.
>
> > /* Odd number of fragments allows to pack the last fragment of
> > * the previous WQE and the first fragment of the next WQE into
> > * the same page.
> > - * As long as DEFAULT_FRAG_SIZE is 2048, and MLX5E_MAX_RX_FRAGS
> > - * is 4, the last fragment can be bigger than the rest only if
> > - * it's the fourth one, so WQEs consisting of 3 fragments will
> > - * always share a page.
> > + * As long as DEFAULT_FRAG_SIZE is (PAGE_SIZE / 2), and
> > + * MLX5E_MAX_RX_FRAGS is 4, the last fragment can be bigger than
> > + * the rest only if it's the fourth one, so WQEs consisting of 3
> > + * fragments will always share a page.
> > * When a page is shared, WQE bulk size is 2, otherwise just 1.
> > */
> > info->wqe_index_mask =3D info->num_frags % 2;
>
> Would it be possible to fix the other logic so that it works for a
> DEFAULT_FRAG_SIZE of 2k on 8K pages? I guess if there's no negative to
> increasing the frag size then this fix makes sense since it is simple.
To maintain 2K DEFAULT_FRAG_SIZE on 8K pages, one of two alternatives would be
necessary: either find a method to calculate the occurrence period of
page-aligned WQEs for the current MTU to replace wqe_index_mask, or use a more
complex logic to manage fragment allocation and release on shared pages to avoid
conflicts. This would make the page allocation logic for 8K pages significantly
different from the 4K page case. Therefore, I believe directly modifying
DEFAULT_FRAG_SIZE is a cleaner solution.
Please note that frag_size_max is not fixed at 2048. If the current MTU exceeds
the maximum size that 2K fragments can store, frag_size_max will be set to
PAGE_SIZE. Therefore, changing DEFAULT_FRAG_SIZE to PAGE_SIZE/2 should
theoretically be safe. The only downside is a potential for slightly more wasted
space when filling a page.
> I guess the noted fixed commit limits to 4 fragments, but it make some
> assumptions that were wrong for 8K pages?
That's right. Specifically, on 4KB pages, once a small fragment and a 2K
fragment are placed, there is no room for another 2K fragment. This results in
the predictable page-sharing pattern for 3-fragment WQEs that breaks on larger
page sizes.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size
2025-10-01 13:02 ` Dragos Tatulea
@ 2025-10-15 13:02 ` Mingrui Cui
0 siblings, 0 replies; 6+ messages in thread
From: Mingrui Cui @ 2025-10-15 13:02 UTC (permalink / raw)
To: dtatulea
Cc: andrew+netdev, davem, edumazet, kuba, leon, linux-kernel,
linux-rdma, mbloch, mingruic, netdev, pabeni, saeedm, tariqt
Hi Dragos,
Thanks again for the feedback.
On Wed, 1 Oct 2025 13:02:02 +0000, Dragos Tatulea wrote:
> On Tue, Sep 30, 2025 at 07:33:11PM +0800, Mingrui Cui wrote:
> > When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3
> > fragments per WQE, odd-indexed WQEs always share the same page with
> > their subsequent WQE, while WQEs consisting of 4 fragments does not.
> > However, this relationship does not hold for page sizes larger than 8K.
> > In this case, wqe_index_mask cannot guarantee that newly allocated WQEs
> > won't share the same page with old WQEs.
> >
> > If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a
> > page with its subsequent WQE, allocating a page for that WQE will
> > overwrite mlx5e_frag_page, preventing the original page from being
> > recycled. When the next WQE is processed, the newly allocated page will
> > be immediately recycled. In the next round, if these two WQEs are
> > handled in the same bulk, page_pool_defrag_page() will be called again
> > on the page, causing pp_frag_count to become negative[1].
> >
> > Moreover, this can also lead to memory corruption, as the page may have
> > already been returned to the page pool and re-allocated to another WQE.
> > And since skb_shared_info is stored at the end of the first fragment,
> > its frags->bv_page pointer can be overwritten, leading to an invalid
> > memory access when processing the skb[2].
> >
> > For example, on 8K page size systems (e.g. DEC Alpha) with a ConnectX-4
> > Lx MT27710 (MCX4121A-ACA_Ax) NIC setting MTU to 7657 or higher, heavy
> > network loads (e.g. iperf) will first trigger a series of WARNINGs[1]
> > and eventually crash[2].
> >
> > Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page
> > size.
> >
> > [1]
> > WARNING: CPU: 9 PID: 0 at include/net/page_pool/helpers.h:130
> > mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
> > CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W 6.6.0
> > walk_stackframe+0x0/0x190
> > show_stack+0x70/0x94
> > dump_stack_lvl+0x98/0xd8
> > dump_stack+0x2c/0x48
> > __warn+0x1c8/0x220
> > warn_slowpath_fmt+0x20c/0x230
> > mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
> > mlx5e_free_rx_wqes+0xcc/0x120 [mlx5_core]
> > mlx5e_post_rx_wqes+0x1f4/0x4e0 [mlx5_core]
> > mlx5e_napi_poll+0x1c0/0x8d0 [mlx5_core]
> > __napi_poll+0x58/0x2e0
> > net_rx_action+0x1a8/0x340
> > __do_softirq+0x2b8/0x480
> > [...]
> >
> > [2]
> > Unable to handle kernel paging request at virtual address 393837363534333a
> > Oops [#1]
> > CPU: 72 PID: 0 Comm: swapper/72 Tainted: G W 6.6.0
> > Trace:
> > walk_stackframe+0x0/0x190
> > show_stack+0x70/0x94
> > die+0x1d4/0x350
> > do_page_fault+0x630/0x690
> > entMM+0x120/0x130
> > napi_pp_put_page+0x30/0x160
> > skb_release_data+0x164/0x250
> > kfree_skb_list_reason+0xd0/0x2f0
> > skb_release_data+0x1f0/0x250
> > napi_consume_skb+0xa0/0x220
> > net_rx_action+0x158/0x340
> > __do_softirq+0x2b8/0x480
> > irq_exit+0xd4/0x120
> > do_entInt+0x164/0x520
> > entInt+0x114/0x120
> > [...]
> >
> > Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme")
> > Signed-off-by: Mingrui Cui <mingruic@outlook.com>
> > ---
> > Changes in v2:
> > - Add Fixes tag and more details to commit message.
> > - Target 'net' branch.
> > - Remove the obsolete WARN_ON() and update related comments.
> > Link to v1: https://lore.kernel.org/all/MN6PR16MB5450CAF432AE40B2AFA58F61B706A@MN6PR16MB5450.namprd16.prod.outlook.com/
> >
> Thanks for your changes Mingrui.
>
> > .../net/ethernet/mellanox/mlx5/core/en/params.c | 15 +++++----------
> > 1 file changed, 5 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> > index 3cca06a74cf9..00b44da23e00 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> > @@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params,
> > info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor);
> > }
> >
> > -#define DEFAULT_FRAG_SIZE (2048)
> > +#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2)
> >
> Reasoning my way through this code as I can't test it on 8K page size:
> - 4K page size: nothing changes so looks good.
> - 8K page size:
> - Smaller MTUs will be using linear SKB, so frags are not used.
> Looks good.
> - Larger MTUs will have a frag size of 4K. The number of frags is
> still below MLX5E_MAX_RX_FRAGS. This is what this patch fixes and
> it looks good.
> - 16K page size or larger: as max HW MTU is somewhere in the 10-12K range
> all MTUs will result in linear SKBs. So looks good. But see below
> comment about keeping the warning.
>
> For non-linear XDP, frag_size_max will always be PAGE_SIZE. So this
> looks safe in all cases.
>
> XSK with smaller chunk sizes is still an open question. For 16K page
> size you could still get in non-linear mode.
>
> One thing to note is that mlx5e_max_nonlinear_mtu() will now depend on
> PAGE_SIZE as frag_size_max and first_frag_size_max are now MTU
> dependent. It wasn't the case previously. I think this is currently not
> an issue as this function is used only by mlx5e_build_rq_frags_info().
>
> > static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
> > struct mlx5e_params *params,
> > @@ -756,18 +756,13 @@ static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
> > /* No WQE can start in the middle of a page. */
> > info->wqe_index_mask = 0;
> > } else {
> > - /* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments,
> > - * because there would be more than MLX5E_MAX_RX_FRAGS of them.
> > - */
> > - WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE);
> > -
> I would still keep a warning when reaching this area with a page size
> above 8K just because it was not tested.
Understood. I will add WARN_ON(PAGE_SIZE > 8192) in next patch as you suggested.
> > /* Odd number of fragments allows to pack the last fragment of
> > * the previous WQE and the first fragment of the next WQE into
> > * the same page.
> > - * As long as DEFAULT_FRAG_SIZE is 2048, and MLX5E_MAX_RX_FRAGS
> > - * is 4, the last fragment can be bigger than the rest only if
> > - * it's the fourth one, so WQEs consisting of 3 fragments will
> > - * always share a page.
> > + * As long as DEFAULT_FRAG_SIZE is (PAGE_SIZE / 2), and
> > + * MLX5E_MAX_RX_FRAGS is 4, the last fragment can be bigger than
> > + * the rest only if it's the fourth one, so WQEs consisting of 3
> > + * fragments will always share a page.
> > * When a page is shared, WQE bulk size is 2, otherwise just 1.
> > */
> > info->wqe_index_mask = info->num_frags % 2;
>
> Have you tried testing this with KASAN on? As your platform is not very
> common, we want to make sure that there are no other issues.
Unfortunately, the CPU architecture of my test platform does not support KASAN,
so I'm unable to perform KASAN test.
However, I ran a continuous iperf test on the patched kernel for one week.
During this long-duration test, no crashes, new warnings, or other instabilities
were observed. Do I need other tests to confirm further?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size
2025-10-15 12:32 ` Mingrui Cui
@ 2025-10-15 17:36 ` Jacob Keller
0 siblings, 0 replies; 6+ messages in thread
From: Jacob Keller @ 2025-10-15 17:36 UTC (permalink / raw)
To: Mingrui Cui
Cc: andrew+netdev, davem, dtatulea, edumazet, kuba, leon,
linux-kernel, linux-rdma, mbloch, netdev, pabeni, saeedm, tariqt
[-- Attachment #1.1: Type: text/plain, Size: 3772 bytes --]
On 10/15/2025 5:32 AM, Mingrui Cui wrote:
> That's all the effects of changing DEFAULT_FRAG_SIZE. DEFAULT_FRAG_SIZE is only
> used as the initial value for frag_size_max. It is primarily used to calculate
> frag_size and frag_stride in arr of mlx5e_rq_frags_info, representing the actual
> data size and the size used for page allocation, respectively. In
> mlx5e_init_frags_partition(), an mlx5e_wqe_frag_info is allocated for each
> fragment according to its frag_stride, which determines the layout of fragments
> on a page.
Right.
>
>>> static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>>> struct mlx5e_params *params,
>>> @@ -756,18 +756,13 @@ static int mlx5e_build_rq_frags_info(struct mlx5_=
>> core_dev *mdev,
>>> /* No WQE can start in the middle of a page. */
>>> info->wqe_index_mask =3D 0;
>>> } else {
>>> - /* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments,
>>> - * because there would be more than MLX5E_MAX_RX_FRAGS of them.
>>> - */
>>> - WARN_ON(PAGE_SIZE !=3D 2 * DEFAULT_FRAG_SIZE);
>>> -
>>
>> So previously we would warn, but now we just fix the DEFAULT_FRAG_SIZE
>> so this warning is redundant.. Ok.
>>
>>> /* Odd number of fragments allows to pack the last fragment of
>>> * the previous WQE and the first fragment of the next WQE into
>>> * the same page.
>>> - * As long as DEFAULT_FRAG_SIZE is 2048, and MLX5E_MAX_RX_FRAGS
>>> - * is 4, the last fragment can be bigger than the rest only if
>>> - * it's the fourth one, so WQEs consisting of 3 fragments will
>>> - * always share a page.
>>> + * As long as DEFAULT_FRAG_SIZE is (PAGE_SIZE / 2), and
>>> + * MLX5E_MAX_RX_FRAGS is 4, the last fragment can be bigger than
>>> + * the rest only if it's the fourth one, so WQEs consisting of 3
>>> + * fragments will always share a page.
>>> * When a page is shared, WQE bulk size is 2, otherwise just 1.
>>> */
>>> info->wqe_index_mask =3D info->num_frags % 2;
>>
>> Would it be possible to fix the other logic so that it works for a
>> DEFAULT_FRAG_SIZE of 2k on 8K pages? I guess if there's no negative to
>> increasing the frag size then this fix makes sense since it is simple.
>
> To maintain 2K DEFAULT_FRAG_SIZE on 8K pages, one of two alternatives would be
> necessary: either find a method to calculate the occurrence period of
> page-aligned WQEs for the current MTU to replace wqe_index_mask, or use a more
> complex logic to manage fragment allocation and release on shared pages to avoid
> conflicts. This would make the page allocation logic for 8K pages significantly
> different from the 4K page case. Therefore, I believe directly modifying
> DEFAULT_FRAG_SIZE is a cleaner solution.
>
Makes sense. The cost of the more complex logic doesn't seem worth it.
> Please note that frag_size_max is not fixed at 2048. If the current MTU exceeds
> the maximum size that 2K fragments can store, frag_size_max will be set to
> PAGE_SIZE. Therefore, changing DEFAULT_FRAG_SIZE to PAGE_SIZE/2 should
> theoretically be safe. The only downside is a potential for slightly more wasted
> space when filling a page.
Ok.
>
>> I guess the noted fixed commit limits to 4 fragments, but it make some
>> assumptions that were wrong for 8K pages?
>
> That's right. Specifically, on 4KB pages, once a small fragment and a 2K
> fragment are placed, there is no room for another 2K fragment. This results in
> the predictable page-sharing pattern for 3-fragment WQEs that breaks on larger
> page sizes.
Right.
Thanks for the explanation and answers, helps understand the logic and
solution.
For the patch:
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-10-15 17:36 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-30 11:33 [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size Mingrui Cui
2025-09-30 23:35 ` Jacob Keller
2025-10-15 12:32 ` Mingrui Cui
2025-10-15 17:36 ` Jacob Keller
2025-10-01 13:02 ` Dragos Tatulea
2025-10-15 13:02 ` Mingrui Cui
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox