* [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size @ 2025-09-02 13:00 Mingrui Cui 2025-09-02 18:56 ` Saeed Mahameed 2025-09-04 16:31 ` Dragos Tatulea 0 siblings, 2 replies; 8+ messages in thread From: Mingrui Cui @ 2025-09-02 13:00 UTC (permalink / raw) To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni Cc: Mingrui Cui, netdev, linux-rdma, linux-kernel When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3 fragments per WQE, odd-indexed WQEs always share the same page with their subsequent WQE. However, this relationship does not hold for page sizes larger than 8K. In this case, wqe_index_mask cannot guarantee that newly allocated WQEs won't share the same page with old WQEs. If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a page with its subsequent WQE, allocating a page for that WQE will overwrite mlx5e_frag_page, preventing the original page from being recycled. When the next WQE is processed, the newly allocated page will be immediately recycled. In the next round, if these two WQEs are handled in the same bulk, page_pool_defrag_page() will be called again on the page, causing pp_frag_count to become negative. Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page size. Signed-off-by: Mingrui Cui <mingruic@outlook.com> --- drivers/net/ethernet/mellanox/mlx5/core/en/params.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c index 3cca06a74cf9..d96a3cbea23c 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c @@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params, info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor); } -#define DEFAULT_FRAG_SIZE (2048) +#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2) static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev, struct mlx5e_params *params, -- 2.43.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size 2025-09-02 13:00 [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size Mingrui Cui @ 2025-09-02 18:56 ` Saeed Mahameed 2025-09-03 14:59 ` Mingrui Cui 2025-09-04 16:31 ` Dragos Tatulea 1 sibling, 1 reply; 8+ messages in thread From: Saeed Mahameed @ 2025-09-02 18:56 UTC (permalink / raw) To: Mingrui Cui Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-rdma, linux-kernel On 02 Sep 21:00, Mingrui Cui wrote: >When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3 >fragments per WQE, odd-indexed WQEs always share the same page with >their subsequent WQE. However, this relationship does not hold for page >sizes larger than 8K. In this case, wqe_index_mask cannot guarantee that >newly allocated WQEs won't share the same page with old WQEs. > >If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a >page with its subsequent WQE, allocating a page for that WQE will >overwrite mlx5e_frag_page, preventing the original page from being >recycled. When the next WQE is processed, the newly allocated page will >be immediately recycled. > >In the next round, if these two WQEs are handled in the same bulk, >page_pool_defrag_page() will be called again on the page, causing >pp_frag_count to become negative. > >Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page >size. > >Signed-off-by: Mingrui Cui <mingruic@outlook.com> CC: Dragos Tatulea <dtatulea@nvidia.com> Dragos is on a mission to improve page_size support in mlx5. Dragos, please look into this, I am not sure making DEFAULT_FRAG_SIZE dependant on PAGE_SIZE is the correct way to go, see mlx5e_build_rq_frags_info() I believe we don't do page flipping for > 4k pages, but I might be wrong, anyway the code also should throw a warn_on: /* The last fragment of WQE with index 2*N may share the page with the * first fragment of WQE with index 2*N+1 in certain cases. If WQE * 2*N+1 * is not completed yet, WQE 2*N must not be allocated, as it's * responsible for allocating a new page. */ if (frag_size_max == PAGE_SIZE) { /* No WQE can start in the middle of a page. */ info->wqe_index_mask = 0; } else { /* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments, * because there would be more than MLX5E_MAX_RX_FRAGS of * them. */ WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE); /* Odd number of fragments allows to pack the last fragment of * the previous WQE and the first fragment of the next WQE * into * the same page. * As long as DEFAULT_FRAG_SIZE is 2048, and * MLX5E_MAX_RX_FRAGS * is 4, the last fragment can be bigger than the rest only * if * it's the fourth one, so WQEs consisting of 3 fragments * will * always share a page. * When a page is shared, WQE bulk size is 2, otherwise * just 1. */ info->wqe_index_mask = info->num_frags % 2; } Looking at the above makes me think that this patch is correct, but a more careful look is needed to be taken, a Fixes tag is also required and target 'net' branch. Thanks, Saeed. >--- > drivers/net/ethernet/mellanox/mlx5/core/en/params.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c >index 3cca06a74cf9..d96a3cbea23c 100644 >--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c >+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c >@@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params, > info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor); > } > >-#define DEFAULT_FRAG_SIZE (2048) >+#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2) > > static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev, > struct mlx5e_params *params, >-- >2.43.0 > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size 2025-09-02 18:56 ` Saeed Mahameed @ 2025-09-03 14:59 ` Mingrui Cui 0 siblings, 0 replies; 8+ messages in thread From: Mingrui Cui @ 2025-09-03 14:59 UTC (permalink / raw) To: Saeed Mahameed, Dragos Tatulea Cc: Mingrui Cui, Leon Romanovsky, Tariq Toukan, Mark Bloch, Saeed Mahameed, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-rdma, linux-kernel > On 02 Sep 21:00, Mingrui Cui wrote: > >When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3 > >fragments per WQE, odd-indexed WQEs always share the same page with > >their subsequent WQE. However, this relationship does not hold for page > >sizes larger than 8K. In this case, wqe_index_mask cannot guarantee that > >newly allocated WQEs won't share the same page with old WQEs. > > > >If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a > >page with its subsequent WQE, allocating a page for that WQE will > >overwrite mlx5e_frag_page, preventing the original page from being > >recycled. When the next WQE is processed, the newly allocated page will > >be immediately recycled. > > > >In the next round, if these two WQEs are handled in the same bulk, > >page_pool_defrag_page() will be called again on the page, causing > >pp_frag_count to become negative. > > > >Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page > >size. > > > >Signed-off-by: Mingrui Cui <mingruic@outlook.com> > CC: Dragos Tatulea <dtatulea@nvidia.com> > > Dragos is on a mission to improve page_size support in mlx5. > > Dragos, please look into this, I am not sure making DEFAULT_FRAG_SIZE > dependant on PAGE_SIZE is the correct way to go, > see mlx5e_build_rq_frags_info() > > I believe we don't do page flipping for > 4k pages, but I might be wrong, > anyway the code also should throw a warn_on: From what I saw, the handling for > 4K pages should be no different from 4K pages. > > /* The last fragment of WQE with index 2*N may share the page with the > * first fragment of WQE with index 2*N+1 in certain cases. If WQE > * 2*N+1 > * is not completed yet, WQE 2*N must not be allocated, as it's > * responsible for allocating a new page. > */ > if (frag_size_max == PAGE_SIZE) { > /* No WQE can start in the middle of a page. */ > info->wqe_index_mask = 0; > } else { > /* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments, > * because there would be more than MLX5E_MAX_RX_FRAGS of > * them. > */ > WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE); > /* Odd number of fragments allows to pack the last fragment of > * the previous WQE and the first fragment of the next WQE > * into > * the same page. > * As long as DEFAULT_FRAG_SIZE is 2048, and > * MLX5E_MAX_RX_FRAGS > * is 4, the last fragment can be bigger than the rest only > * if > * it's the fourth one, so WQEs consisting of 3 fragments > * will > * always share a page. > * When a page is shared, WQE bulk size is 2, otherwise > * just 1. > */ > info->wqe_index_mask = info->num_frags % 2; > } > > Looking at the above makes me think that this patch is correct, but a more > careful look is needed to be taken, a Fixes tag is also required and target > 'net' branch. > > Thanks, > Saeed. Okay, I will prepare and send a v2 patch next. I think WARN_ON() should also be removed and surrounding comments updated accordingly. Thanks. > > >--- > > drivers/net/ethernet/mellanox/mlx5/core/en/params.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > >diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c > >index 3cca06a74cf9..d96a3cbea23c 100644 > >--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c > >+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c > >@@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params, > > info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor); > > } > > > >-#define DEFAULT_FRAG_SIZE (2048) > >+#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2) > > > > static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev, > > struct mlx5e_params *params, > >-- > >2.43.0 > > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size 2025-09-02 13:00 [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size Mingrui Cui 2025-09-02 18:56 ` Saeed Mahameed @ 2025-09-04 16:31 ` Dragos Tatulea 2025-09-08 13:35 ` Mingrui Cui 1 sibling, 1 reply; 8+ messages in thread From: Dragos Tatulea @ 2025-09-04 16:31 UTC (permalink / raw) To: Mingrui Cui, Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch, Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni Cc: netdev, linux-rdma, linux-kernel On Tue, Sep 02, 2025 at 09:00:16PM +0800, Mingrui Cui wrote: > When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3 > fragments per WQE, odd-indexed WQEs always share the same page with > their subsequent WQE. However, this relationship does not hold for page > sizes larger than 8K. In this case, wqe_index_mask cannot guarantee that > newly allocated WQEs won't share the same page with old WQEs. > > If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a > page with its subsequent WQE, allocating a page for that WQE will > overwrite mlx5e_frag_page, preventing the original page from being > recycled. When the next WQE is processed, the newly allocated page will > be immediately recycled. > > In the next round, if these two WQEs are handled in the same bulk, > page_pool_defrag_page() will be called again on the page, causing > pp_frag_count to become negative. > > Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page > size. > Was there an actual encountered issue or is this a code clarity fix? For 64K page size, linear mode will be used so the constant will not be used for calculating the frag size. Thanks, Dragos ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size 2025-09-04 16:31 ` Dragos Tatulea @ 2025-09-08 13:35 ` Mingrui Cui 2025-09-08 14:25 ` Dragos Tatulea 0 siblings, 1 reply; 8+ messages in thread From: Mingrui Cui @ 2025-09-08 13:35 UTC (permalink / raw) To: dtatulea Cc: andrew+netdev, davem, edumazet, kuba, leon, linux-kernel, linux-rdma, mbloch, mingruic, netdev, pabeni, saeedm, tariqt > On Tue, Sep 02, 2025 at 09:00:16PM +0800, Mingrui Cui wrote: > > When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3 > > fragments per WQE, odd-indexed WQEs always share the same page with > > their subsequent WQE. However, this relationship does not hold for page > > sizes larger than 8K. In this case, wqe_index_mask cannot guarantee that > > newly allocated WQEs won't share the same page with old WQEs. > > > > If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a > > page with its subsequent WQE, allocating a page for that WQE will > > overwrite mlx5e_frag_page, preventing the original page from being > > recycled. When the next WQE is processed, the newly allocated page will > > be immediately recycled. > > > > In the next round, if these two WQEs are handled in the same bulk, > > page_pool_defrag_page() will be called again on the page, causing > > pp_frag_count to become negative. > > > > Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page > > size. > > > Was there an actual encountered issue or is this a code clarity fix? > > For 64K page size, linear mode will be used so the constant will not be > used for calculating the frag size. > > Thanks, > Dragos Yes, this was an actual issue we encountered that caused a kernel crash. We found it on a server with a DEC-Alpha like processor, which uses 8KB page size and runs a custom-built kernel. When using a ConnectX-4 Lx MT27710 (MCX4121A-ACA_Ax) NIC with the MTU set to 7657 or higher, the kernel would crash during heavy traffic (e.g., iperf test). Here's the kernel log: WARNING: CPU: 9 PID: 0 at include/net/page_pool/helpers.h:130 mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core] Modules linked in: ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core ipv6 mlx5_core tls CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W 6.6.0 #23 walk_stackframe+0x0/0x190 show_stack+0x70/0x94 dump_stack_lvl+0x98/0xd8 dump_stack+0x2c/0x48 __warn+0x1c8/0x220 warn_slowpath_fmt+0x20c/0x230 mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core] mlx5e_free_rx_wqes+0xcc/0x120 [mlx5_core] mlx5e_post_rx_wqes+0x1f4/0x4e0 [mlx5_core] mlx5e_napi_poll+0x1c0/0x8d0 [mlx5_core] __napi_poll+0x58/0x2e0 net_rx_action+0x1a8/0x340 __do_softirq+0x2b8/0x480 irq_exit+0xd4/0x120 do_entInt+0x164/0x520 entInt+0x114/0x120 __idle_end+0x0/0x50 default_idle_call+0x64/0x150 do_idle+0x10c/0x240 cpu_startup_entry+0x70/0x80 smp_callin+0x354/0x410 __smp_callin+0x3c/0x40 Although this was on a custom kernel and processor, I believe this issue is generic to any system using an 8KB page size. Unfortunately, I don't have an Alpha server running a mainline kernel to verify this directly, and most mainstream architectures don't support 8KB page size. I also tried to modify some conditions in the driver to force it to fall back into non-linear mode on an ARMv8 server configured with a 16KB page size, and was then able to trigger the same warning and crash. So I suspect this issue would also occur on 16KB page size if the NIC can be configured with a larger MTU. Best regards, Mingrui Cui ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size 2025-09-08 13:35 ` Mingrui Cui @ 2025-09-08 14:25 ` Dragos Tatulea 2025-09-15 13:28 ` Dragos Tatulea 0 siblings, 1 reply; 8+ messages in thread From: Dragos Tatulea @ 2025-09-08 14:25 UTC (permalink / raw) To: Mingrui Cui Cc: andrew+netdev, davem, edumazet, kuba, leon, linux-kernel, linux-rdma, mbloch, netdev, pabeni, saeedm, tariqt On Mon, Sep 08, 2025 at 09:35:32PM +0800, Mingrui Cui wrote: > > On Tue, Sep 02, 2025 at 09:00:16PM +0800, Mingrui Cui wrote: > > > When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3 > > > fragments per WQE, odd-indexed WQEs always share the same page with > > > their subsequent WQE. However, this relationship does not hold for page > > > sizes larger than 8K. In this case, wqe_index_mask cannot guarantee that > > > newly allocated WQEs won't share the same page with old WQEs. > > > > > > If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a > > > page with its subsequent WQE, allocating a page for that WQE will > > > overwrite mlx5e_frag_page, preventing the original page from being > > > recycled. When the next WQE is processed, the newly allocated page will > > > be immediately recycled. > > > > > > In the next round, if these two WQEs are handled in the same bulk, > > > page_pool_defrag_page() will be called again on the page, causing > > > pp_frag_count to become negative. > > > > > > Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page > > > size. > > > > > Was there an actual encountered issue or is this a code clarity fix? > > > > For 64K page size, linear mode will be used so the constant will not be > > used for calculating the frag size. > > > > Thanks, > > Dragos > > Yes, this was an actual issue we encountered that caused a kernel crash. > > We found it on a server with a DEC-Alpha like processor, which uses 8KB page > size and runs a custom-built kernel. When using a ConnectX-4 Lx MT27710 > (MCX4121A-ACA_Ax) NIC with the MTU set to 7657 or higher, the kernel would crash > during heavy traffic (e.g., iperf test). Here's the kernel log: > > WARNING: CPU: 9 PID: 0 at include/net/page_pool/helpers.h:130 > mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core] > Modules linked in: ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core ipv6 > mlx5_core tls > CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W 6.6.0 #23 > walk_stackframe+0x0/0x190 > show_stack+0x70/0x94 > dump_stack_lvl+0x98/0xd8 > dump_stack+0x2c/0x48 > __warn+0x1c8/0x220 > warn_slowpath_fmt+0x20c/0x230 > mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core] > mlx5e_free_rx_wqes+0xcc/0x120 [mlx5_core] > mlx5e_post_rx_wqes+0x1f4/0x4e0 [mlx5_core] > mlx5e_napi_poll+0x1c0/0x8d0 [mlx5_core] > __napi_poll+0x58/0x2e0 > net_rx_action+0x1a8/0x340 > __do_softirq+0x2b8/0x480 > irq_exit+0xd4/0x120 > do_entInt+0x164/0x520 > entInt+0x114/0x120 > __idle_end+0x0/0x50 > default_idle_call+0x64/0x150 > do_idle+0x10c/0x240 > cpu_startup_entry+0x70/0x80 > smp_callin+0x354/0x410 > __smp_callin+0x3c/0x40 > > Although this was on a custom kernel and processor, I believe this issue is > generic to any system using an 8KB page size. Unfortunately, I don't have an > Alpha server running a mainline kernel to verify this directly, and most > mainstream architectures don't support 8KB page size. > Oh, I see. Thanks for the note. I had issues finding any arch that supports 8K page size. The information above would be useful in the commit message as well. Also, you need a fixes tag for net patches. Probably this one: Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme") Thanks, Dragos ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size 2025-09-08 14:25 ` Dragos Tatulea @ 2025-09-15 13:28 ` Dragos Tatulea 2025-09-16 9:01 ` Mingrui Cui 0 siblings, 1 reply; 8+ messages in thread From: Dragos Tatulea @ 2025-09-15 13:28 UTC (permalink / raw) To: Mingrui Cui Cc: andrew+netdev, davem, edumazet, kuba, leon, linux-kernel, linux-rdma, mbloch, netdev, pabeni, saeedm, tariqt On Mon, Sep 08, 2025 at 02:25:48PM +0000, Dragos Tatulea wrote: > On Mon, Sep 08, 2025 at 09:35:32PM +0800, Mingrui Cui wrote: > > > On Tue, Sep 02, 2025 at 09:00:16PM +0800, Mingrui Cui wrote: > > > > When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3 > > > > fragments per WQE, odd-indexed WQEs always share the same page with > > > > their subsequent WQE. However, this relationship does not hold for page > > > > sizes larger than 8K. In this case, wqe_index_mask cannot guarantee that > > > > newly allocated WQEs won't share the same page with old WQEs. > > > > > > > > If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a > > > > page with its subsequent WQE, allocating a page for that WQE will > > > > overwrite mlx5e_frag_page, preventing the original page from being > > > > recycled. When the next WQE is processed, the newly allocated page will > > > > be immediately recycled. > > > > > > > > In the next round, if these two WQEs are handled in the same bulk, > > > > page_pool_defrag_page() will be called again on the page, causing > > > > pp_frag_count to become negative. > > > > > > > > Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page > > > > size. > > > > > > > Was there an actual encountered issue or is this a code clarity fix? > > > > > > For 64K page size, linear mode will be used so the constant will not be > > > used for calculating the frag size. > > > > > > Thanks, > > > Dragos > > > > Yes, this was an actual issue we encountered that caused a kernel crash. > > > > We found it on a server with a DEC-Alpha like processor, which uses 8KB page > > size and runs a custom-built kernel. When using a ConnectX-4 Lx MT27710 > > (MCX4121A-ACA_Ax) NIC with the MTU set to 7657 or higher, the kernel would crash > > during heavy traffic (e.g., iperf test). Here's the kernel log: > > Tariq and I had a closer look at mlx5e_build_rq_frags_info() and noticed that for the given MTU (7657) you should have seen the WARN_ON() from [1]. Unless you are running XDP or a higher MTU in which case frag_size_max was reset to PAGE_SIZE [2]. Did you observe this warning? [1] https://elixir.bootlin.com/linux/v6.17-rc5/source/drivers/net/ethernet/mellanox/mlx5/core/en/params.c#L762 [2] https://elixir.bootlin.com/linux/v6.17-rc5/source/drivers/net/ethernet/mellanox/mlx5/core/en/params.c#L710 Thanks, Dragos ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size 2025-09-15 13:28 ` Dragos Tatulea @ 2025-09-16 9:01 ` Mingrui Cui 0 siblings, 0 replies; 8+ messages in thread From: Mingrui Cui @ 2025-09-16 9:01 UTC (permalink / raw) To: dtatulea Cc: andrew+netdev, davem, edumazet, kuba, leon, linux-kernel, linux-rdma, mbloch, mingruic, netdev, pabeni, saeedm, tariqt On Mon, Sep 15, 2025 at 01:28:11PM +0000, Dragos Tatulea wrote: > On Mon, Sep 08, 2025 at 02:25:48PM +0000, Dragos Tatulea wrote: > > On Mon, Sep 08, 2025 at 09:35:32PM +0800, Mingrui Cui wrote: > > > > On Tue, Sep 02, 2025 at 09:00:16PM +0800, Mingrui Cui wrote: > > > > > When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3 > > > > > fragments per WQE, odd-indexed WQEs always share the same page with > > > > > their subsequent WQE. However, this relationship does not hold for page > > > > > sizes larger than 8K. In this case, wqe_index_mask cannot guarantee that > > > > > newly allocated WQEs won't share the same page with old WQEs. > > > > > > > > > > If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a > > > > > page with its subsequent WQE, allocating a page for that WQE will > > > > > overwrite mlx5e_frag_page, preventing the original page from being > > > > > recycled. When the next WQE is processed, the newly allocated page will > > > > > be immediately recycled. > > > > > > > > > > In the next round, if these two WQEs are handled in the same bulk, > > > > > page_pool_defrag_page() will be called again on the page, causing > > > > > pp_frag_count to become negative. > > > > > > > > > > Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page > > > > > size. > > > > > > > > > Was there an actual encountered issue or is this a code clarity fix? > > > > > > > > For 64K page size, linear mode will be used so the constant will not be > > > > used for calculating the frag size. > > > > > > > > Thanks, > > > > Dragos > > > > > > Yes, this was an actual issue we encountered that caused a kernel crash. > > > > > > We found it on a server with a DEC-Alpha like processor, which uses 8KB page > > > size and runs a custom-built kernel. When using a ConnectX-4 Lx MT27710 > > > (MCX4121A-ACA_Ax) NIC with the MTU set to 7657 or higher, the kernel would crash > > > during heavy traffic (e.g., iperf test). Here's the kernel log: > > > > Tariq and I had a closer look at mlx5e_build_rq_frags_info() and noticed > that for the given MTU (7657) you should have seen the WARN_ON() from > [1]. Unless you are running XDP or a higher MTU in which case > frag_size_max was reset to PAGE_SIZE [2]. Did you observe this warning? > > [1] https://elixir.bootlin.com/linux/v6.17-rc5/source/drivers/net/ethernet/mellanox/mlx5/core/en/params.c#L762 > [2] https://elixir.bootlin.com/linux/v6.17-rc5/source/drivers/net/ethernet/mellanox/mlx5/core/en/params.c#L710 Yes, that WARN_ON() is triggered when setting MTU to 7657 above. Here is the log: WARNING: CPU: 129 PID: 4368 at drivers/net/ethernet/mellanox/mlx5/core/en/params.c:824 mlx5e_build_rq_param+0x25c/0x1050 [mlx5_core] Modules linked in: ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core uio_pdrv_genirq dm_mod sch_fq_codel mlx5_core tls efivarfs ipv6 CPU: 129 PID: 4368 Comm: ifconfig Not tainted 6.6.0 #23 Trace: walk_stackframe+0x0/0x190 show_stack+0x70/0x94 dump_stack_lvl+0x98/0xd8 dump_stack+0x2c/0x48 __warn+0x1c8/0x220 warn_slowpath_fmt+0x20c/0x230 mlx5e_build_rq_param+0x25c/0x1050 [mlx5_core] mlx5e_build_channel_param+0x60/0x6d0 [mlx5_core] mlx5e_open_channels+0xc8/0x1400 [mlx5_core] mlx5e_safe_switch_params+0xe0/0x1c0 [mlx5_core] mlx5e_change_mtu+0x13c/0x390 [mlx5_core] mlx5e_change_nic_mtu+0x38/0x60 [mlx5_core] dev_set_mtu_ext+0x12c/0x270 dev_set_mtu+0x6c/0xf0 dev_ifsioc+0x6d0/0x740 dev_ioctl+0x54c/0x770 sock_ioctl+0x368/0x4e0 sys_ioctl+0x610/0xec0 do_entSys+0xbc/0x1d0 entSys+0x12c/0x130 I plan to remove this WARN_ON in v2 patch, as it becomes obsolete after changing DEFAULT_FRAG_SIZE. Thanks, Mingrui Cui ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-09-16 9:02 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-02 13:00 [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size Mingrui Cui 2025-09-02 18:56 ` Saeed Mahameed 2025-09-03 14:59 ` Mingrui Cui 2025-09-04 16:31 ` Dragos Tatulea 2025-09-08 13:35 ` Mingrui Cui 2025-09-08 14:25 ` Dragos Tatulea 2025-09-15 13:28 ` Dragos Tatulea 2025-09-16 9:01 ` Mingrui Cui
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox