* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-08-29 3:36 ` [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
@ 2025-08-29 16:34 ` Eric Dumazet
2025-08-29 22:39 ` Saeed Mahameed
2025-09-03 23:38 ` Amery Hung
2 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2025-08-29 16:34 UTC (permalink / raw)
To: cpaasch
Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
netdev, linux-rdma, bpf
On Thu, Aug 28, 2025 at 8:36 PM Christoph Paasch via B4 Relay
<devnull+cpaasch.openai.com@kernel.org> wrote:
>
> From: Christoph Paasch <cpaasch@openai.com>
>
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
>
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
>
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
>
> So, let's avoid copying parts of the payload to the linear part. We use
> eth_get_headlen() to parse the headers and compute the length of the
> protocol headers, which will be used to copy the relevant bits ot the
> skb's linear part.
>
> We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> stack needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
>
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
>
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.01 32547.82
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52531.67
>
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52896.06
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 85094.90
>
> Additional tests across a larger range of parameters w/ and w/o LRO, w/
> and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> better performance with this patch.
>
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-08-29 3:36 ` [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
2025-08-29 16:34 ` Eric Dumazet
@ 2025-08-29 22:39 ` Saeed Mahameed
2025-09-03 23:38 ` Amery Hung
2 siblings, 0 replies; 16+ messages in thread
From: Saeed Mahameed @ 2025-08-29 22:39 UTC (permalink / raw)
To: cpaasch
Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, netdev, linux-rdma, bpf
On 28 Aug 20:36, Christoph Paasch via B4 Relay wrote:
>From: Christoph Paasch <cpaasch@openai.com>
>
>mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
>bytes from the page-pool to the skb's linear part. Those 256 bytes
>include part of the payload.
>
>When attempting to do GRO in skb_gro_receive, if headlen > data_offset
>(and skb->head_frag is not set), we end up aggregating packets in the
>frag_list.
>
>This is of course not good when we are CPU-limited. Also causes a worse
>skb->len/truesize ratio,...
>
>So, let's avoid copying parts of the payload to the linear part. We use
>eth_get_headlen() to parse the headers and compute the length of the
>protocol headers, which will be used to copy the relevant bits ot the
>skb's linear part.
>
>We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
>stack needs to call pskb_may_pull() later on, we don't need to reallocate
>memory.
>
>This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
>LRO enabled):
>
>BEFORE:
>=======
>(netserver pinned to core receiving interrupts)
>$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.01 32547.82
>
>(netserver pinned to adjacent core receiving interrupts)
>$ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52531.67
>
>AFTER:
>======
>(netserver pinned to core receiving interrupts)
>$ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52896.06
>
>(netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 85094.90
>
>Additional tests across a larger range of parameters w/ and w/o LRO, w/
>and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
>TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
>better performance with this patch.
>
>Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-08-29 3:36 ` [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part Christoph Paasch via B4 Relay
2025-08-29 16:34 ` Eric Dumazet
2025-08-29 22:39 ` Saeed Mahameed
@ 2025-09-03 23:38 ` Amery Hung
2025-09-03 23:57 ` Christoph Paasch
2 siblings, 1 reply; 16+ messages in thread
From: Amery Hung @ 2025-09-03 23:38 UTC (permalink / raw)
To: cpaasch, Gal Pressman, Dragos Tatulea, Saeed Mahameed,
Tariq Toukan, Mark Bloch, Leon Romanovsky, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev
Cc: netdev, linux-rdma, bpf
On 8/28/25 8:36 PM, Christoph Paasch via B4 Relay wrote:
> From: Christoph Paasch <cpaasch@openai.com>
>
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
>
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
>
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
>
> So, let's avoid copying parts of the payload to the linear part. We use
> eth_get_headlen() to parse the headers and compute the length of the
> protocol headers, which will be used to copy the relevant bits ot the
> skb's linear part.
>
> We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> stack needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
>
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
>
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.01 32547.82
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52531.67
>
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52896.06
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 85094.90
>
> Additional tests across a larger range of parameters w/ and w/o LRO, w/
> and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> better performance with this patch.
>
> Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> rq->buff.map_dir);
>
> + headlen = eth_get_headlen(skb->dev, head_addr, headlen);
> +
Hi,
I am building on top of this patchset and got a kernel crash. It was
triggered by attaching an xdp program.
I think the problem is skb->dev is still NULL here. It will be set later by:
mlx5e_complete_rx_cqe() -> mlx5e_build_rx_skb() -> eth_type_trans()
> frag_offset += headlen;
> byte_cnt -= headlen;
> linear_hr = skb_headroom(skb);
> @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> pagep->frags++;
> while (++pagep < frag_page);
> }
> +
> + headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen);
> +
> __pskb_pull_tail(skb, headlen);
> } else {
> if (xdp_buff_has_frags(&mxbuf->xdp)) {
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-09-03 23:38 ` Amery Hung
@ 2025-09-03 23:57 ` Christoph Paasch
2025-09-04 0:11 ` Amery Hung
0 siblings, 1 reply; 16+ messages in thread
From: Christoph Paasch @ 2025-09-03 23:57 UTC (permalink / raw)
To: Amery Hung
Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, netdev, linux-rdma, bpf
On Wed, Sep 3, 2025 at 4:39 PM Amery Hung <ameryhung@gmail.com> wrote:
>
>
>
> On 8/28/25 8:36 PM, Christoph Paasch via B4 Relay wrote:
> > From: Christoph Paasch <cpaasch@openai.com>
> >
> > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > include part of the payload.
> >
> > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > (and skb->head_frag is not set), we end up aggregating packets in the
> > frag_list.
> >
> > This is of course not good when we are CPU-limited. Also causes a worse
> > skb->len/truesize ratio,...
> >
> > So, let's avoid copying parts of the payload to the linear part. We use
> > eth_get_headlen() to parse the headers and compute the length of the
> > protocol headers, which will be used to copy the relevant bits ot the
> > skb's linear part.
> >
> > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > memory.
> >
> > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > LRO enabled):
> >
> > BEFORE:
> > =======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > 87380 16384 262144 60.01 32547.82
> >
> > (netserver pinned to adjacent core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > 87380 16384 262144 60.00 52531.67
> >
> > AFTER:
> > ======
> > (netserver pinned to core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > 87380 16384 262144 60.00 52896.06
> >
> > (netserver pinned to adjacent core receiving interrupts)
> > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > 87380 16384 262144 60.00 85094.90
> >
> > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > better performance with this patch.
> >
> > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > ---
> > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> > rq->buff.map_dir);
> >
> > + headlen = eth_get_headlen(skb->dev, head_addr, headlen);
> > +
>
> Hi,
>
> I am building on top of this patchset and got a kernel crash. It was
> triggered by attaching an xdp program.
>
> I think the problem is skb->dev is still NULL here. It will be set later by:
> mlx5e_complete_rx_cqe() -> mlx5e_build_rx_skb() -> eth_type_trans()
Hmmm... Not sure what happened here...
I'm almost certain I tested with xdp as well...
I will try again later/tomorrow.
Thanks!
Christoph
>
>
> > frag_offset += headlen;
> > byte_cnt -= headlen;
> > linear_hr = skb_headroom(skb);
> > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > pagep->frags++;
> > while (++pagep < frag_page);
> > }
> > +
> > + headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen);
> > +
> > __pskb_pull_tail(skb, headlen);
> > } else {
> > if (xdp_buff_has_frags(&mxbuf->xdp)) {
> >
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-09-03 23:57 ` Christoph Paasch
@ 2025-09-04 0:11 ` Amery Hung
2025-09-04 3:58 ` Christoph Paasch
0 siblings, 1 reply; 16+ messages in thread
From: Amery Hung @ 2025-09-04 0:11 UTC (permalink / raw)
To: Christoph Paasch
Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, netdev, linux-rdma, bpf
On Wed, Sep 3, 2025 at 4:57 PM Christoph Paasch <cpaasch@openai.com> wrote:
>
> On Wed, Sep 3, 2025 at 4:39 PM Amery Hung <ameryhung@gmail.com> wrote:
> >
> >
> >
> > On 8/28/25 8:36 PM, Christoph Paasch via B4 Relay wrote:
> > > From: Christoph Paasch <cpaasch@openai.com>
> > >
> > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > include part of the payload.
> > >
> > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > frag_list.
> > >
> > > This is of course not good when we are CPU-limited. Also causes a worse
> > > skb->len/truesize ratio,...
> > >
> > > So, let's avoid copying parts of the payload to the linear part. We use
> > > eth_get_headlen() to parse the headers and compute the length of the
> > > protocol headers, which will be used to copy the relevant bits ot the
> > > skb's linear part.
> > >
> > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > > memory.
> > >
> > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > LRO enabled):
> > >
> > > BEFORE:
> > > =======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.01 32547.82
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.00 52531.67
> > >
> > > AFTER:
> > > ======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.00 52896.06
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.00 85094.90
> > >
> > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > better performance with this patch.
> > >
> > > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > > ---
> > > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> > > rq->buff.map_dir);
> > >
> > > + headlen = eth_get_headlen(skb->dev, head_addr, headlen);
> > > +
> >
> > Hi,
> >
> > I am building on top of this patchset and got a kernel crash. It was
> > triggered by attaching an xdp program.
> >
> > I think the problem is skb->dev is still NULL here. It will be set later by:
> > mlx5e_complete_rx_cqe() -> mlx5e_build_rx_skb() -> eth_type_trans()
>
> Hmmm... Not sure what happened here...
> I'm almost certain I tested with xdp as well...
>
> I will try again later/tomorrow.
>
Here is the command that triggers the panic:
ip link set dev eth0 mtu 8000 xdp obj
/root/ksft-net-drv/net/lib/xdp_native.bpf.o sec xdp.frags
and I should have attached the log:
[ 2851.287387] BUG: kernel NULL pointer dereference, address: 0000000000000100
[ 2851.301329] #PF: supervisor read access in kernel mode
[ 2851.311602] #PF: error_code(0x0000) - not-present page
[ 2851.321879] PGD 0 P4D 0
[ 2851.326944] Oops: Oops: 0000 [#1] SMP
[ 2851.334272] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Kdump: loaded
Tainted: G S E 6.17.0-rc1-gcf50ef415525 #305 NONE
[ 2851.357759] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
[ 2851.369252] Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1,
BIOS Y3DL401 09/04/2024
[ 2851.385787] RIP: 0010:eth_get_headlen+0x16/0x90
[ 2851.394850] Code: 5e 41 5f 5d c3 b8 f2 ff ff ff eb f0 cc cc cc cc
cc cc cc cc 0f 1f 44 00 00 41 56 53 48 83 ec 10 89 d3 83 fa 0e 72 68
49 89 f6 <48> 8b bf 00 01 00 00 44 0f b7 4e 0c c7 44 24 08 00 00 00 00
48 c7
[ 2851.432413] RSP: 0018:ffffc90000720cc8 EFLAGS: 00010212
[ 2851.442864] RAX: 0000000000000000 RBX: 000000000000008a RCX: 00000000000000a0
[ 2851.457141] RDX: 000000000000008a RSI: ffff8885a5aee100 RDI: 0000000000000000
[ 2851.471417] RBP: ffff8883d01f3900 R08: ffff888204c7c000 R09: 0000000000000000
[ 2851.485696] R10: ffff8883d01f3900 R11: ffff8885a5aee340 R12: ffff8885add00030
[ 2851.499969] R13: ffff8885add00030 R14: ffff8885a5aee100 R15: 0000000000000000
[ 2851.514245] FS: 0000000000000000(0000) GS:ffff8890b4427000(0000)
knlGS:0000000000000000
[ 2851.530433] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2851.541931] CR2: 0000000000000100 CR3: 000000107d412003 CR4: 00000000007726f0
[ 2851.556208] PKRU: 55555554
[ 2851.561623] Call Trace:
[ 2851.566514] <IRQ>
[ 2851.570540] mlx5e_skb_from_cqe_mpwrq_nonlinear+0x7af/0x8d0
[ 2851.581689] mlx5e_handle_rx_cqe_mpwrq+0xbc/0x180
[ 2851.591096] mlx5e_poll_rx_cq+0x2ef/0x780
[ 2851.599114] mlx5e_napi_poll+0x10c/0x710
[ 2851.606959] __napi_poll+0x28/0x160
[ 2851.613934] net_rx_action+0x1c0/0x350
[ 2851.621434] ? mlx5_eq_comp_int+0xdf/0x190
[ 2851.629628] ? sched_clock+0x5/0x10
[ 2851.636603] ? sched_clock_cpu+0xc/0x170
[ 2851.644450] handle_softirqs+0xd8/0x280
[ 2851.652121] __irq_exit_rcu.llvm.7416059615185659459+0x44/0xd0
[ 2851.663788] common_interrupt+0x85/0x90
[ 2851.671457] </IRQ>
[ 2851.675653] <TASK>
[ 2851.679850] asm_common_interrupt+0x22/0x40
Thanks for taking a look!
Amery
> Thanks!
> Christoph
>
> >
> >
> > > frag_offset += headlen;
> > > byte_cnt -= headlen;
> > > linear_hr = skb_headroom(skb);
> > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > pagep->frags++;
> > > while (++pagep < frag_page);
> > > }
> > > +
> > > + headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen);
> > > +
> > > __pskb_pull_tail(skb, headlen);
> > > } else {
> > > if (xdp_buff_has_frags(&mxbuf->xdp)) {
> > >
> >
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part
2025-09-04 0:11 ` Amery Hung
@ 2025-09-04 3:58 ` Christoph Paasch
0 siblings, 0 replies; 16+ messages in thread
From: Christoph Paasch @ 2025-09-04 3:58 UTC (permalink / raw)
To: Amery Hung
Cc: Gal Pressman, Dragos Tatulea, Saeed Mahameed, Tariq Toukan,
Mark Bloch, Leon Romanovsky, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, netdev, linux-rdma, bpf
On Wed, Sep 3, 2025 at 5:12 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> On Wed, Sep 3, 2025 at 4:57 PM Christoph Paasch <cpaasch@openai.com> wrote:
> >
> > On Wed, Sep 3, 2025 at 4:39 PM Amery Hung <ameryhung@gmail.com> wrote:
> > >
> > >
> > >
> > > On 8/28/25 8:36 PM, Christoph Paasch via B4 Relay wrote:
> > > > From: Christoph Paasch <cpaasch@openai.com>
> > > >
> > > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > > include part of the payload.
> > > >
> > > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > > frag_list.
> > > >
> > > > This is of course not good when we are CPU-limited. Also causes a worse
> > > > skb->len/truesize ratio,...
> > > >
> > > > So, let's avoid copying parts of the payload to the linear part. We use
> > > > eth_get_headlen() to parse the headers and compute the length of the
> > > > protocol headers, which will be used to copy the relevant bits ot the
> > > > skb's linear part.
> > > >
> > > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > > > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > > > memory.
> > > >
> > > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > > LRO enabled):
> > > >
> > > > BEFORE:
> > > > =======
> > > > (netserver pinned to core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > > 87380 16384 262144 60.01 32547.82
> > > >
> > > > (netserver pinned to adjacent core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > > 87380 16384 262144 60.00 52531.67
> > > >
> > > > AFTER:
> > > > ======
> > > > (netserver pinned to core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > > 87380 16384 262144 60.00 52896.06
> > > >
> > > > (netserver pinned to adjacent core receiving interrupts)
> > > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > > 87380 16384 262144 60.00 85094.90
> > > >
> > > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > > better performance with this patch.
> > > >
> > > > Signed-off-by: Christoph Paasch <cpaasch@openai.com>
> > > > ---
> > > > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> > > > 1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644
> > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > > dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> > > > rq->buff.map_dir);
> > > >
> > > > + headlen = eth_get_headlen(skb->dev, head_addr, headlen);
> > > > +
> > >
> > > Hi,
> > >
> > > I am building on top of this patchset and got a kernel crash. It was
> > > triggered by attaching an xdp program.
> > >
> > > I think the problem is skb->dev is still NULL here. It will be set later by:
> > > mlx5e_complete_rx_cqe() -> mlx5e_build_rx_skb() -> eth_type_trans()
> >
> > Hmmm... Not sure what happened here...
> > I'm almost certain I tested with xdp as well...
> >
> > I will try again later/tomorrow.
> >
>
> Here is the command that triggers the panic:
>
> ip link set dev eth0 mtu 8000 xdp obj
> /root/ksft-net-drv/net/lib/xdp_native.bpf.o sec xdp.frags
>
> and I should have attached the log:
>
> [ 2851.287387] BUG: kernel NULL pointer dereference, address: 0000000000000100
> [ 2851.301329] #PF: supervisor read access in kernel mode
> [ 2851.311602] #PF: error_code(0x0000) - not-present page
> [ 2851.321879] PGD 0 P4D 0
> [ 2851.326944] Oops: Oops: 0000 [#1] SMP
> [ 2851.334272] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Kdump: loaded
> Tainted: G S E 6.17.0-rc1-gcf50ef415525 #305 NONE
> [ 2851.357759] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
> [ 2851.369252] Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1,
> BIOS Y3DL401 09/04/2024
> [ 2851.385787] RIP: 0010:eth_get_headlen+0x16/0x90
> [ 2851.394850] Code: 5e 41 5f 5d c3 b8 f2 ff ff ff eb f0 cc cc cc cc
> cc cc cc cc 0f 1f 44 00 00 41 56 53 48 83 ec 10 89 d3 83 fa 0e 72 68
> 49 89 f6 <48> 8b bf 00 01 00 00 44 0f b7 4e 0c c7 44 24 08 00 00 00 00
> 48 c7
> [ 2851.432413] RSP: 0018:ffffc90000720cc8 EFLAGS: 00010212
> [ 2851.442864] RAX: 0000000000000000 RBX: 000000000000008a RCX: 00000000000000a0
> [ 2851.457141] RDX: 000000000000008a RSI: ffff8885a5aee100 RDI: 0000000000000000
> [ 2851.471417] RBP: ffff8883d01f3900 R08: ffff888204c7c000 R09: 0000000000000000
> [ 2851.485696] R10: ffff8883d01f3900 R11: ffff8885a5aee340 R12: ffff8885add00030
> [ 2851.499969] R13: ffff8885add00030 R14: ffff8885a5aee100 R15: 0000000000000000
> [ 2851.514245] FS: 0000000000000000(0000) GS:ffff8890b4427000(0000)
> knlGS:0000000000000000
> [ 2851.530433] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2851.541931] CR2: 0000000000000100 CR3: 000000107d412003 CR4: 00000000007726f0
> [ 2851.556208] PKRU: 55555554
> [ 2851.561623] Call Trace:
> [ 2851.566514] <IRQ>
> [ 2851.570540] mlx5e_skb_from_cqe_mpwrq_nonlinear+0x7af/0x8d0
> [ 2851.581689] mlx5e_handle_rx_cqe_mpwrq+0xbc/0x180
> [ 2851.591096] mlx5e_poll_rx_cq+0x2ef/0x780
> [ 2851.599114] mlx5e_napi_poll+0x10c/0x710
> [ 2851.606959] __napi_poll+0x28/0x160
> [ 2851.613934] net_rx_action+0x1c0/0x350
> [ 2851.621434] ? mlx5_eq_comp_int+0xdf/0x190
> [ 2851.629628] ? sched_clock+0x5/0x10
> [ 2851.636603] ? sched_clock_cpu+0xc/0x170
> [ 2851.644450] handle_softirqs+0xd8/0x280
> [ 2851.652121] __irq_exit_rcu.llvm.7416059615185659459+0x44/0xd0
> [ 2851.663788] common_interrupt+0x85/0x90
> [ 2851.671457] </IRQ>
> [ 2851.675653] <TASK>
> [ 2851.679850] asm_common_interrupt+0x22/0x40
Oh, I see why I didn't hit the bug when testing with xdp... I wasn't
using a multi-buffer xdp prog and thus had to reduce the MTU and so
ended up not using the mlx5e_skb_from_cqe_mpwrq_nonlinear()
code-path...
I can reproduce the panic and will fix it.
Christoph
>
> Thanks for taking a look!
> Amery
>
> > Thanks!
> > Christoph
> >
> > >
> > >
> > > > frag_offset += headlen;
> > > > byte_cnt -= headlen;
> > > > linear_hr = skb_headroom(skb);
> > > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > > > pagep->frags++;
> > > > while (++pagep < frag_page);
> > > > }
> > > > +
> > > > + headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen);
> > > > +
> > > > __pskb_pull_tail(skb, headlen);
> > > > } else {
> > > > if (xdp_buff_has_frags(&mxbuf->xdp)) {
> > > >
> > >
^ permalink raw reply [flat|nested] 16+ messages in thread