* [PATCH bpf v3 1/9] xdp: use modulo operation to calculate XDP frag tailroom
2026-02-17 13:24 [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Larysa Zaremba
@ 2026-02-17 13:24 ` Larysa Zaremba
2026-02-17 15:10 ` Loktionov, Aleksandr
2026-02-17 13:24 ` [PATCH bpf v3 2/9] xsk: introduce helper to determine rxq->frag_size Larysa Zaremba
` (8 subsequent siblings)
9 siblings, 1 reply; 22+ messages in thread
From: Larysa Zaremba @ 2026-02-17 13:24 UTC (permalink / raw)
To: bpf
Cc: Larysa Zaremba, Claudiu Manoil, Vladimir Oltean, Wei Fang,
Clark Wang, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Tony Nguyen, Przemek Kitszel,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
The current formula for calculating XDP tailroom in mbuf packets works only
if each frag has its own page (if rxq->frag_size is PAGE_SIZE), this
defeats the purpose of the parameter overall and without any indication
leads to negative calculated tailroom on at least half of frags, if shared
pages are used.
There are not many drivers that set rxq->frag_size. Among them:
* i40e and enetc always split page uniformly between frags, use shared
pages
* ice uses page_pool frags via libeth, those are power-of-2 and uniformly
distributed across page
* idpf has variable frag_size with XDP on, so current API is not applicable
* mlx5, mtk and mvneta use PAGE_SIZE or 0 as frag_size for page_pool
As for AF_XDP ZC, only ice, i40e and idpf declare frag_size for it. Modulo
operation yields good results for aligned chunks, they are all power-of-2,
between 2K and PAGE_SIZE. Formula without modulo fails when chunk_size is
2K. Buffers in unaligned mode are not distributed uniformly, so modulo
operation would not work.
To accommodate unaligned buffers, we could define frag_size as
data + tailroom, and hence do not subtract offset when calculating
tailroom, but this would necessitate more changes in the drivers.
Define rxq->frag_size as an even portion of a page that fully belongs to a
single frag. When calculating tailroom, locate the data start within such
portion by performing a modulo operation on page offset.
Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
net/core/filter.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index ba019ded773d..5f5489665c58 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4156,7 +4156,8 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
return -EOPNOTSUPP;
- tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
+ tailroom = rxq->frag_size - skb_frag_size(frag) -
+ skb_frag_off(frag) % rxq->frag_size;
if (unlikely(offset > tailroom))
return -EINVAL;
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* RE: [PATCH bpf v3 1/9] xdp: use modulo operation to calculate XDP frag tailroom
2026-02-17 13:24 ` [PATCH bpf v3 1/9] xdp: use modulo operation to calculate XDP frag tailroom Larysa Zaremba
@ 2026-02-17 15:10 ` Loktionov, Aleksandr
0 siblings, 0 replies; 22+ messages in thread
From: Loktionov, Aleksandr @ 2026-02-17 15:10 UTC (permalink / raw)
To: Zaremba, Larysa, bpf@vger.kernel.org
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Nguyen, Anthony L, Kitszel, Przemyslaw,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Lobakin, Aleksander, Fijalkowski, Maciej,
Bastien Curutchet (eBPF Foundation), Vyavahare, Tushar,
Jason Xing, Ricardo B. Marlière, Eelco Chaudron,
Lorenzo Bianconi, Toke Hoiland-Jorgensen, imx@lists.linux.dev,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
intel-wired-lan@lists.osuosl.org, linux-kselftest@vger.kernel.org,
Dragos Tatulea
> -----Original Message-----
> From: Zaremba, Larysa <larysa.zaremba@intel.com>
> Sent: Tuesday, February 17, 2026 2:25 PM
> To: bpf@vger.kernel.org
> Cc: Zaremba, Larysa <larysa.zaremba@intel.com>; Claudiu Manoil
> <claudiu.manoil@nxp.com>; Vladimir Oltean <vladimir.oltean@nxp.com>;
> Wei Fang <wei.fang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>;
> Andrew Lunn <andrew+netdev@lunn.ch>; David S. Miller
> <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Nguyen,
> Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Alexei Starovoitov <ast@kernel.org>;
> Daniel Borkmann <daniel@iogearbox.net>; Jesper Dangaard Brouer
> <hawk@kernel.org>; John Fastabend <john.fastabend@gmail.com>;
> Stanislav Fomichev <sdf@fomichev.me>; Andrii Nakryiko
> <andrii@kernel.org>; Martin KaFai Lau <martin.lau@linux.dev>; Eduard
> Zingerman <eddyz87@gmail.com>; Song Liu <song@kernel.org>; Yonghong
> Song <yonghong.song@linux.dev>; KP Singh <kpsingh@kernel.org>; Hao Luo
> <haoluo@google.com>; Jiri Olsa <jolsa@kernel.org>; Simon Horman
> <horms@kernel.org>; Shuah Khan <shuah@kernel.org>; Lobakin, Aleksander
> <aleksander.lobakin@intel.com>; Fijalkowski, Maciej
> <maciej.fijalkowski@intel.com>; Bastien Curutchet (eBPF Foundation)
> <bastien.curutchet@bootlin.com>; Vyavahare, Tushar
> <tushar.vyavahare@intel.com>; Jason Xing <kernelxing@tencent.com>;
> Ricardo B. Marlière <rbm@suse.com>; Eelco Chaudron
> <echaudro@redhat.com>; Lorenzo Bianconi <lorenzo@kernel.org>; Toke
> Hoiland-Jorgensen <toke@redhat.com>; imx@lists.linux.dev;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; intel-wired-
> lan@lists.osuosl.org; linux-kselftest@vger.kernel.org; Loktionov,
> Aleksandr <aleksandr.loktionov@intel.com>; Dragos Tatulea
> <dtatulea@nvidia.com>
> Subject: [PATCH bpf v3 1/9] xdp: use modulo operation to calculate XDP
> frag tailroom
>
> The current formula for calculating XDP tailroom in mbuf packets works
> only if each frag has its own page (if rxq->frag_size is PAGE_SIZE),
> this defeats the purpose of the parameter overall and without any
> indication leads to negative calculated tailroom on at least half of
> frags, if shared pages are used.
>
> There are not many drivers that set rxq->frag_size. Among them:
> * i40e and enetc always split page uniformly between frags, use shared
> pages
> * ice uses page_pool frags via libeth, those are power-of-2 and
> uniformly
> distributed across page
> * idpf has variable frag_size with XDP on, so current API is not
> applicable
> * mlx5, mtk and mvneta use PAGE_SIZE or 0 as frag_size for page_pool
>
> As for AF_XDP ZC, only ice, i40e and idpf declare frag_size for it.
> Modulo operation yields good results for aligned chunks, they are all
> power-of-2, between 2K and PAGE_SIZE. Formula without modulo fails
> when chunk_size is 2K. Buffers in unaligned mode are not distributed
> uniformly, so modulo operation would not work.
>
> To accommodate unaligned buffers, we could define frag_size as data +
> tailroom, and hence do not subtract offset when calculating tailroom,
> but this would necessitate more changes in the drivers.
>
> Define rxq->frag_size as an even portion of a page that fully belongs
> to a single frag. When calculating tailroom, locate the data start
> within such portion by performing a modulo operation on page offset.
>
> Fixes: bf25146a5595 ("bpf: add frags support to the
> bpf_xdp_adjust_tail() API")
> Acked-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
> net/core/filter.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c index
> ba019ded773d..5f5489665c58 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4156,7 +4156,8 @@ static int bpf_xdp_frags_increase_tail(struct
> xdp_buff *xdp, int offset)
> if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
> return -EOPNOTSUPP;
>
> - tailroom = rxq->frag_size - skb_frag_size(frag) -
> skb_frag_off(frag);
> + tailroom = rxq->frag_size - skb_frag_size(frag) -
> + skb_frag_off(frag) % rxq->frag_size;
> if (unlikely(offset > tailroom))
> return -EINVAL;
>
> --
> 2.52.0
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH bpf v3 2/9] xsk: introduce helper to determine rxq->frag_size
2026-02-17 13:24 [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Larysa Zaremba
2026-02-17 13:24 ` [PATCH bpf v3 1/9] xdp: use modulo operation to calculate XDP frag tailroom Larysa Zaremba
@ 2026-02-17 13:24 ` Larysa Zaremba
2026-02-17 15:11 ` Loktionov, Aleksandr
2026-02-17 13:24 ` [PATCH bpf v3 3/9] ice: fix rxq info registering in mbuf packets Larysa Zaremba
` (7 subsequent siblings)
9 siblings, 1 reply; 22+ messages in thread
From: Larysa Zaremba @ 2026-02-17 13:24 UTC (permalink / raw)
To: bpf
Cc: Larysa Zaremba, Claudiu Manoil, Vladimir Oltean, Wei Fang,
Clark Wang, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Tony Nguyen, Przemek Kitszel,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
rxq->frag_size is basically a step between consecutive strictly aligned
frames. In ZC mode, chunk size fits exactly, but if chunks are unaligned,
there is no safe way to determine accessible space to grow tailroom.
Report frag_size to be zero, if chunks are unaligned, chunk_size otherwise.
Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
include/net/xdp_sock_drv.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 242e34f771cc..09d972f4bd60 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -51,6 +51,11 @@ static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool)
return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool);
}
+static inline u32 xsk_pool_get_rx_frag_step(struct xsk_buff_pool *pool)
+{
+ return pool->unaligned ? 0 : xsk_pool_get_chunk_size(pool);
+}
+
static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
struct xdp_rxq_info *rxq)
{
@@ -337,6 +342,11 @@ static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool)
return 0;
}
+static inline u32 xsk_pool_get_rx_frag_step(struct xsk_buff_pool *pool)
+{
+ return 0;
+}
+
static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
struct xdp_rxq_info *rxq)
{
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* RE: [PATCH bpf v3 2/9] xsk: introduce helper to determine rxq->frag_size
2026-02-17 13:24 ` [PATCH bpf v3 2/9] xsk: introduce helper to determine rxq->frag_size Larysa Zaremba
@ 2026-02-17 15:11 ` Loktionov, Aleksandr
0 siblings, 0 replies; 22+ messages in thread
From: Loktionov, Aleksandr @ 2026-02-17 15:11 UTC (permalink / raw)
To: Zaremba, Larysa, bpf@vger.kernel.org
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Nguyen, Anthony L, Kitszel, Przemyslaw,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Lobakin, Aleksander, Fijalkowski, Maciej,
Bastien Curutchet (eBPF Foundation), Vyavahare, Tushar,
Jason Xing, Ricardo B. Marlière, Eelco Chaudron,
Lorenzo Bianconi, Toke Hoiland-Jorgensen, imx@lists.linux.dev,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
intel-wired-lan@lists.osuosl.org, linux-kselftest@vger.kernel.org,
Dragos Tatulea
> -----Original Message-----
> From: Zaremba, Larysa <larysa.zaremba@intel.com>
> Sent: Tuesday, February 17, 2026 2:25 PM
> To: bpf@vger.kernel.org
> Cc: Zaremba, Larysa <larysa.zaremba@intel.com>; Claudiu Manoil
> <claudiu.manoil@nxp.com>; Vladimir Oltean <vladimir.oltean@nxp.com>;
> Wei Fang <wei.fang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>;
> Andrew Lunn <andrew+netdev@lunn.ch>; David S. Miller
> <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Nguyen,
> Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Alexei Starovoitov <ast@kernel.org>;
> Daniel Borkmann <daniel@iogearbox.net>; Jesper Dangaard Brouer
> <hawk@kernel.org>; John Fastabend <john.fastabend@gmail.com>;
> Stanislav Fomichev <sdf@fomichev.me>; Andrii Nakryiko
> <andrii@kernel.org>; Martin KaFai Lau <martin.lau@linux.dev>; Eduard
> Zingerman <eddyz87@gmail.com>; Song Liu <song@kernel.org>; Yonghong
> Song <yonghong.song@linux.dev>; KP Singh <kpsingh@kernel.org>; Hao Luo
> <haoluo@google.com>; Jiri Olsa <jolsa@kernel.org>; Simon Horman
> <horms@kernel.org>; Shuah Khan <shuah@kernel.org>; Lobakin, Aleksander
> <aleksander.lobakin@intel.com>; Fijalkowski, Maciej
> <maciej.fijalkowski@intel.com>; Bastien Curutchet (eBPF Foundation)
> <bastien.curutchet@bootlin.com>; Vyavahare, Tushar
> <tushar.vyavahare@intel.com>; Jason Xing <kernelxing@tencent.com>;
> Ricardo B. Marlière <rbm@suse.com>; Eelco Chaudron
> <echaudro@redhat.com>; Lorenzo Bianconi <lorenzo@kernel.org>; Toke
> Hoiland-Jorgensen <toke@redhat.com>; imx@lists.linux.dev;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; intel-wired-
> lan@lists.osuosl.org; linux-kselftest@vger.kernel.org; Loktionov,
> Aleksandr <aleksandr.loktionov@intel.com>; Dragos Tatulea
> <dtatulea@nvidia.com>
> Subject: [PATCH bpf v3 2/9] xsk: introduce helper to determine rxq-
> >frag_size
>
> rxq->frag_size is basically a step between consecutive strictly
> aligned
> frames. In ZC mode, chunk size fits exactly, but if chunks are
> unaligned, there is no safe way to determine accessible space to grow
> tailroom.
>
> Report frag_size to be zero, if chunks are unaligned, chunk_size
> otherwise.
>
> Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
> include/net/xdp_sock_drv.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 242e34f771cc..09d972f4bd60 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -51,6 +51,11 @@ static inline u32 xsk_pool_get_rx_frame_size(struct
> xsk_buff_pool *pool)
> return xsk_pool_get_chunk_size(pool) -
> xsk_pool_get_headroom(pool); }
>
> +static inline u32 xsk_pool_get_rx_frag_step(struct xsk_buff_pool
> *pool)
> +{
> + return pool->unaligned ? 0 : xsk_pool_get_chunk_size(pool); }
> +
> static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
> struct xdp_rxq_info *rxq)
> {
> @@ -337,6 +342,11 @@ static inline u32
> xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool)
> return 0;
> }
>
> +static inline u32 xsk_pool_get_rx_frag_step(struct xsk_buff_pool
> *pool)
> +{
> + return 0;
> +}
> +
> static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
> struct xdp_rxq_info *rxq)
> {
> --
> 2.52.0
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH bpf v3 3/9] ice: fix rxq info registering in mbuf packets
2026-02-17 13:24 [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Larysa Zaremba
2026-02-17 13:24 ` [PATCH bpf v3 1/9] xdp: use modulo operation to calculate XDP frag tailroom Larysa Zaremba
2026-02-17 13:24 ` [PATCH bpf v3 2/9] xsk: introduce helper to determine rxq->frag_size Larysa Zaremba
@ 2026-02-17 13:24 ` Larysa Zaremba
2026-02-17 13:24 ` [PATCH bpf v3 4/9] ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz Larysa Zaremba
` (6 subsequent siblings)
9 siblings, 0 replies; 22+ messages in thread
From: Larysa Zaremba @ 2026-02-17 13:24 UTC (permalink / raw)
To: bpf
Cc: Larysa Zaremba, Claudiu Manoil, Vladimir Oltean, Wei Fang,
Clark Wang, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Tony Nguyen, Przemek Kitszel,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
XDP RxQ info contains frag_size, which depends on the MTU. This makes the
old way of registering RxQ info before calculating new buffer sizes
invalid. Currently, it leads to frag_size being outdated, making it
sometimes impossible to grow tailroom in a mbuf packet. E.g. fragments are
actually 3K+, but frag size is still as if MTU was 1500.
Always register new XDP RxQ info after reconfiguring memory pools.
Fixes: 93f53db9f9dc ("ice: switch to Page Pool")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
drivers/net/ethernet/intel/ice/ice_base.c | 26 ++++++-----------------
drivers/net/ethernet/intel/ice/ice_txrx.c | 3 ++-
drivers/net/ethernet/intel/ice/ice_xsk.c | 3 +++
3 files changed, 12 insertions(+), 20 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index eadb1e3d12b3..511d803cf0a4 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -663,23 +663,12 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
int err;
if (ring->vsi->type == ICE_VSI_PF || ring->vsi->type == ICE_VSI_SF) {
- if (!xdp_rxq_info_is_reg(&ring->xdp_rxq)) {
- err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
- ring->q_index,
- ring->q_vector->napi.napi_id,
- ring->rx_buf_len);
- if (err)
- return err;
- }
-
ice_rx_xsk_pool(ring);
err = ice_realloc_rx_xdp_bufs(ring, ring->xsk_pool);
if (err)
return err;
if (ring->xsk_pool) {
- xdp_rxq_info_unreg(&ring->xdp_rxq);
-
rx_buf_len =
xsk_pool_get_rx_frame_size(ring->xsk_pool);
err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
@@ -702,14 +691,13 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
if (err)
return err;
- if (!xdp_rxq_info_is_reg(&ring->xdp_rxq)) {
- err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
- ring->q_index,
- ring->q_vector->napi.napi_id,
- ring->rx_buf_len);
- if (err)
- goto err_destroy_fq;
- }
+ err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
+ ring->q_index,
+ ring->q_vector->napi.napi_id,
+ ring->rx_buf_len);
+ if (err)
+ goto err_destroy_fq;
+
xdp_rxq_info_attach_page_pool(&ring->xdp_rxq,
ring->pp);
}
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index ad76768a4232..4c294ab7df30 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -560,7 +560,8 @@ void ice_clean_rx_ring(struct ice_rx_ring *rx_ring)
i = 0;
}
- if (rx_ring->vsi->type == ICE_VSI_PF &&
+ if ((rx_ring->vsi->type == ICE_VSI_PF ||
+ rx_ring->vsi->type == ICE_VSI_SF) &&
xdp_rxq_info_is_reg(&rx_ring->xdp_rxq)) {
xdp_rxq_info_detach_mem_model(&rx_ring->xdp_rxq);
xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 989ff1fd9110..102631398af3 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -900,6 +900,9 @@ void ice_xsk_clean_rx_ring(struct ice_rx_ring *rx_ring)
u16 ntc = rx_ring->next_to_clean;
u16 ntu = rx_ring->next_to_use;
+ if (xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
+ xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+
while (ntc != ntu) {
struct xdp_buff *xdp = *ice_xdp_buf(rx_ring, ntc);
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* [PATCH bpf v3 4/9] ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz
2026-02-17 13:24 [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Larysa Zaremba
` (2 preceding siblings ...)
2026-02-17 13:24 ` [PATCH bpf v3 3/9] ice: fix rxq info registering in mbuf packets Larysa Zaremba
@ 2026-02-17 13:24 ` Larysa Zaremba
2026-02-27 11:22 ` Maciej Fijalkowski
2026-02-17 13:24 ` [PATCH bpf v3 5/9] i40e: fix registering XDP RxQ info Larysa Zaremba
` (5 subsequent siblings)
9 siblings, 1 reply; 22+ messages in thread
From: Larysa Zaremba @ 2026-02-17 13:24 UTC (permalink / raw)
To: bpf
Cc: Larysa Zaremba, Claudiu Manoil, Vladimir Oltean, Wei Fang,
Clark Wang, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Tony Nguyen, Przemek Kitszel,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects whole buff size instead
of DMA write size. Different assumptions in ice driver configuration lead
to negative tailroom.
This allows to trigger kernel panic, when using
XDP_ADJUST_TAIL_GROW_MULTI_BUFF xskxceiver test and changing packet size to
6912 and the requested offset to a huge value, e.g.
XSK_UMEM__MAX_FRAME_SIZE * 100.
Due to other quirks of the ZC configuration in ice, panic is not observed
in ZC mode, but tailroom growing still fails when it should not.
Use fill queue buffer truesize instead of DMA write size in XDP RxQ info.
Fix ZC mode too by using the new helper.
Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
drivers/net/ethernet/intel/ice/ice_base.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 511d803cf0a4..27ab899a4052 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -659,7 +659,6 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
{
struct device *dev = ice_pf_to_dev(ring->vsi->back);
u32 num_bufs = ICE_DESC_UNUSED(ring);
- u32 rx_buf_len;
int err;
if (ring->vsi->type == ICE_VSI_PF || ring->vsi->type == ICE_VSI_SF) {
@@ -669,12 +668,12 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
return err;
if (ring->xsk_pool) {
- rx_buf_len =
- xsk_pool_get_rx_frame_size(ring->xsk_pool);
+ u32 frag_size =
+ xsk_pool_get_rx_frag_step(ring->xsk_pool);
err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
ring->q_index,
ring->q_vector->napi.napi_id,
- rx_buf_len);
+ frag_size);
if (err)
return err;
err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
@@ -694,7 +693,7 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
ring->q_index,
ring->q_vector->napi.napi_id,
- ring->rx_buf_len);
+ ring->truesize);
if (err)
goto err_destroy_fq;
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH bpf v3 4/9] ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz
2026-02-17 13:24 ` [PATCH bpf v3 4/9] ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz Larysa Zaremba
@ 2026-02-27 11:22 ` Maciej Fijalkowski
2026-03-02 14:32 ` Larysa Zaremba
0 siblings, 1 reply; 22+ messages in thread
From: Maciej Fijalkowski @ 2026-02-27 11:22 UTC (permalink / raw)
To: Larysa Zaremba
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
On Tue, Feb 17, 2026 at 02:24:42PM +0100, Larysa Zaremba wrote:
> The only user of frag_size field in XDP RxQ info is
> bpf_xdp_frags_increase_tail(). It clearly expects whole buff size instead
> of DMA write size. Different assumptions in ice driver configuration lead
> to negative tailroom.
>
> This allows to trigger kernel panic, when using
> XDP_ADJUST_TAIL_GROW_MULTI_BUFF xskxceiver test and changing packet size to
> 6912 and the requested offset to a huge value, e.g.
> XSK_UMEM__MAX_FRAME_SIZE * 100.
>
> Due to other quirks of the ZC configuration in ice, panic is not observed
> in ZC mode, but tailroom growing still fails when it should not.
>
> Use fill queue buffer truesize instead of DMA write size in XDP RxQ info.
> Fix ZC mode too by using the new helper.
>
> Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
> drivers/net/ethernet/intel/ice/ice_base.c | 9 ++++-----
> 1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index 511d803cf0a4..27ab899a4052 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> @@ -659,7 +659,6 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
> {
> struct device *dev = ice_pf_to_dev(ring->vsi->back);
> u32 num_bufs = ICE_DESC_UNUSED(ring);
> - u32 rx_buf_len;
> int err;
>
> if (ring->vsi->type == ICE_VSI_PF || ring->vsi->type == ICE_VSI_SF) {
> @@ -669,12 +668,12 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
> return err;
>
> if (ring->xsk_pool) {
> - rx_buf_len =
> - xsk_pool_get_rx_frame_size(ring->xsk_pool);
ice_setup_rx_ctx() consumes ring->rx_buf_len. This can't come from
page_pool when you have configured xsk_pool on a given rxq. I believe we
need a setting:
ring->rx_buf_len =
xsk_pool_get_rx_frame_size(ring->xsk_pool);
> + u32 frag_size =
> + xsk_pool_get_rx_frag_step(ring->xsk_pool);
> err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> ring->q_index,
> ring->q_vector->napi.napi_id,
> - rx_buf_len);
> + frag_size);
> if (err)
> return err;
> err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> @@ -694,7 +693,7 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
> err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> ring->q_index,
> ring->q_vector->napi.napi_id,
> - ring->rx_buf_len);
> + ring->truesize);
> if (err)
> goto err_destroy_fq;
>
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH bpf v3 4/9] ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz
2026-02-27 11:22 ` Maciej Fijalkowski
@ 2026-03-02 14:32 ` Larysa Zaremba
0 siblings, 0 replies; 22+ messages in thread
From: Larysa Zaremba @ 2026-03-02 14:32 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
On Fri, Feb 27, 2026 at 12:22:41PM +0100, Maciej Fijalkowski wrote:
> On Tue, Feb 17, 2026 at 02:24:42PM +0100, Larysa Zaremba wrote:
> > The only user of frag_size field in XDP RxQ info is
> > bpf_xdp_frags_increase_tail(). It clearly expects whole buff size instead
> > of DMA write size. Different assumptions in ice driver configuration lead
> > to negative tailroom.
> >
> > This allows to trigger kernel panic, when using
> > XDP_ADJUST_TAIL_GROW_MULTI_BUFF xskxceiver test and changing packet size to
> > 6912 and the requested offset to a huge value, e.g.
> > XSK_UMEM__MAX_FRAME_SIZE * 100.
> >
> > Due to other quirks of the ZC configuration in ice, panic is not observed
> > in ZC mode, but tailroom growing still fails when it should not.
> >
> > Use fill queue buffer truesize instead of DMA write size in XDP RxQ info.
> > Fix ZC mode too by using the new helper.
> >
> > Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
> > Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> > drivers/net/ethernet/intel/ice/ice_base.c | 9 ++++-----
> > 1 file changed, 4 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> > index 511d803cf0a4..27ab899a4052 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_base.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> > @@ -659,7 +659,6 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
> > {
> > struct device *dev = ice_pf_to_dev(ring->vsi->back);
> > u32 num_bufs = ICE_DESC_UNUSED(ring);
> > - u32 rx_buf_len;
> > int err;
> >
> > if (ring->vsi->type == ICE_VSI_PF || ring->vsi->type == ICE_VSI_SF) {
> > @@ -669,12 +668,12 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
> > return err;
> >
> > if (ring->xsk_pool) {
> > - rx_buf_len =
> > - xsk_pool_get_rx_frame_size(ring->xsk_pool);
>
> ice_setup_rx_ctx() consumes ring->rx_buf_len. This can't come from
> page_pool when you have configured xsk_pool on a given rxq. I believe we
> need a setting:
>
> ring->rx_buf_len =
> xsk_pool_get_rx_frame_size(ring->xsk_pool);
>
Yes, but doing this via xsk_pool_get_rx_frame_size() as it is now will introduce
a regression, due to lack of tailroom, so I decided not to touch this logic for
now, as you indend to improve xsk_pool_get_rx_frame_size() for mbuf soon.
> > + u32 frag_size =
> > + xsk_pool_get_rx_frag_step(ring->xsk_pool);
> > err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> > ring->q_index,
> > ring->q_vector->napi.napi_id,
> > - rx_buf_len);
> > + frag_size);
> > if (err)
> > return err;
> > err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> > @@ -694,7 +693,7 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
> > err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> > ring->q_index,
> > ring->q_vector->napi.napi_id,
> > - ring->rx_buf_len);
> > + ring->truesize);
> > if (err)
> > goto err_destroy_fq;
> >
> > --
> > 2.52.0
> >
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH bpf v3 5/9] i40e: fix registering XDP RxQ info
2026-02-17 13:24 [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Larysa Zaremba
` (3 preceding siblings ...)
2026-02-17 13:24 ` [PATCH bpf v3 4/9] ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz Larysa Zaremba
@ 2026-02-17 13:24 ` Larysa Zaremba
2026-02-17 15:12 ` Loktionov, Aleksandr
2026-02-19 12:00 ` Maciej Fijalkowski
2026-02-17 13:24 ` [PATCH bpf v3 6/9] i40e: use xdp.frame_sz as XDP RxQ info frag_size Larysa Zaremba
` (4 subsequent siblings)
9 siblings, 2 replies; 22+ messages in thread
From: Larysa Zaremba @ 2026-02-17 13:24 UTC (permalink / raw)
To: bpf
Cc: Larysa Zaremba, Claudiu Manoil, Vladimir Oltean, Wei Fang,
Clark Wang, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Tony Nguyen, Przemek Kitszel,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
Current way of handling XDP RxQ info in i40e has following problems:
* when xsk_buff_pool is detached, memory model is not unregistered before
registering a new one, this leads to a dangling xsk_buff_pool in the
memory model table
* frag_size is not updated when xsk_buff_pool is detached or when MTU is
changed, this leads to growing tail always failing for multi-buffer
packets.
Couple XDP RxQ info registering with buffer allocations and unregistering
with cleaning the ring.
Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size")
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 34 ++++++++++++---------
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 5 +--
2 files changed, 22 insertions(+), 17 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index d3bc3207054f..eaa5b65e6daf 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3577,18 +3577,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
if (ring->vsi->type != I40E_VSI_MAIN)
goto skip;
- if (!xdp_rxq_info_is_reg(&ring->xdp_rxq)) {
- err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
- ring->queue_index,
- ring->q_vector->napi.napi_id,
- ring->rx_buf_len);
- if (err)
- return err;
- }
-
ring->xsk_pool = i40e_xsk_pool(ring);
if (ring->xsk_pool) {
- xdp_rxq_info_unreg(&ring->xdp_rxq);
ring->rx_buf_len = xsk_pool_get_rx_frame_size(ring->xsk_pool);
err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
ring->queue_index,
@@ -3600,17 +3590,23 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
MEM_TYPE_XSK_BUFF_POOL,
NULL);
if (err)
- return err;
+ goto unreg_xdp;
dev_info(&vsi->back->pdev->dev,
"Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring %d\n",
ring->queue_index);
} else {
+ err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
+ ring->queue_index,
+ ring->q_vector->napi.napi_id,
+ ring->rx_buf_len);
+ if (err)
+ return err;
err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
MEM_TYPE_PAGE_SHARED,
NULL);
if (err)
- return err;
+ goto unreg_xdp;
}
skip:
@@ -3648,7 +3644,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
dev_info(&vsi->back->pdev->dev,
"Failed to clear LAN Rx queue context on Rx ring %d (pf_q %d), error: %d\n",
ring->queue_index, pf_q, err);
- return -ENOMEM;
+ err = -ENOMEM;
+ goto unreg_xdp;
}
/* set the context in the HMC */
@@ -3657,7 +3654,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
dev_info(&vsi->back->pdev->dev,
"Failed to set LAN Rx queue context on Rx ring %d (pf_q %d), error: %d\n",
ring->queue_index, pf_q, err);
- return -ENOMEM;
+ err = -ENOMEM;
+ goto unreg_xdp;
}
/* configure Rx buffer alignment */
@@ -3665,7 +3663,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
if (I40E_2K_TOO_SMALL_WITH_PADDING) {
dev_info(&vsi->back->pdev->dev,
"2k Rx buffer is too small to fit standard MTU and skb_shared_info\n");
- return -EOPNOTSUPP;
+ err = -EOPNOTSUPP;
+ goto unreg_xdp;
}
clear_ring_build_skb_enabled(ring);
} else {
@@ -3695,6 +3694,11 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
}
return 0;
+unreg_xdp:
+ if (ring->vsi->type == I40E_VSI_MAIN)
+ xdp_rxq_info_unreg(&ring->xdp_rxq);
+
+ return err;
}
/**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index cc0b9efc2637..816179c7e271 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1470,6 +1470,9 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
if (!rx_ring->rx_bi)
return;
+ if (xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
+ xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+
if (rx_ring->xsk_pool) {
i40e_xsk_clean_rx_ring(rx_ring);
goto skip_free;
@@ -1527,8 +1530,6 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
void i40e_free_rx_resources(struct i40e_ring *rx_ring)
{
i40e_clean_rx_ring(rx_ring);
- if (rx_ring->vsi->type == I40E_VSI_MAIN)
- xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
rx_ring->xdp_prog = NULL;
kfree(rx_ring->rx_bi);
rx_ring->rx_bi = NULL;
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* RE: [PATCH bpf v3 5/9] i40e: fix registering XDP RxQ info
2026-02-17 13:24 ` [PATCH bpf v3 5/9] i40e: fix registering XDP RxQ info Larysa Zaremba
@ 2026-02-17 15:12 ` Loktionov, Aleksandr
2026-02-19 12:00 ` Maciej Fijalkowski
1 sibling, 0 replies; 22+ messages in thread
From: Loktionov, Aleksandr @ 2026-02-17 15:12 UTC (permalink / raw)
To: Zaremba, Larysa, bpf@vger.kernel.org
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Nguyen, Anthony L, Kitszel, Przemyslaw,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Lobakin, Aleksander, Fijalkowski, Maciej,
Bastien Curutchet (eBPF Foundation), Vyavahare, Tushar,
Jason Xing, Ricardo B. Marlière, Eelco Chaudron,
Lorenzo Bianconi, Toke Hoiland-Jorgensen, imx@lists.linux.dev,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
intel-wired-lan@lists.osuosl.org, linux-kselftest@vger.kernel.org,
Dragos Tatulea
> -----Original Message-----
> From: Zaremba, Larysa <larysa.zaremba@intel.com>
> Sent: Tuesday, February 17, 2026 2:25 PM
> To: bpf@vger.kernel.org
> Cc: Zaremba, Larysa <larysa.zaremba@intel.com>; Claudiu Manoil
> <claudiu.manoil@nxp.com>; Vladimir Oltean <vladimir.oltean@nxp.com>;
> Wei Fang <wei.fang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>;
> Andrew Lunn <andrew+netdev@lunn.ch>; David S. Miller
> <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Nguyen,
> Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Alexei Starovoitov <ast@kernel.org>;
> Daniel Borkmann <daniel@iogearbox.net>; Jesper Dangaard Brouer
> <hawk@kernel.org>; John Fastabend <john.fastabend@gmail.com>;
> Stanislav Fomichev <sdf@fomichev.me>; Andrii Nakryiko
> <andrii@kernel.org>; Martin KaFai Lau <martin.lau@linux.dev>; Eduard
> Zingerman <eddyz87@gmail.com>; Song Liu <song@kernel.org>; Yonghong
> Song <yonghong.song@linux.dev>; KP Singh <kpsingh@kernel.org>; Hao Luo
> <haoluo@google.com>; Jiri Olsa <jolsa@kernel.org>; Simon Horman
> <horms@kernel.org>; Shuah Khan <shuah@kernel.org>; Lobakin, Aleksander
> <aleksander.lobakin@intel.com>; Fijalkowski, Maciej
> <maciej.fijalkowski@intel.com>; Bastien Curutchet (eBPF Foundation)
> <bastien.curutchet@bootlin.com>; Vyavahare, Tushar
> <tushar.vyavahare@intel.com>; Jason Xing <kernelxing@tencent.com>;
> Ricardo B. Marlière <rbm@suse.com>; Eelco Chaudron
> <echaudro@redhat.com>; Lorenzo Bianconi <lorenzo@kernel.org>; Toke
> Hoiland-Jorgensen <toke@redhat.com>; imx@lists.linux.dev;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; intel-wired-
> lan@lists.osuosl.org; linux-kselftest@vger.kernel.org; Loktionov,
> Aleksandr <aleksandr.loktionov@intel.com>; Dragos Tatulea
> <dtatulea@nvidia.com>
> Subject: [PATCH bpf v3 5/9] i40e: fix registering XDP RxQ info
>
> Current way of handling XDP RxQ info in i40e has following problems:
> * when xsk_buff_pool is detached, memory model is not unregistered
> before
> registering a new one, this leads to a dangling xsk_buff_pool in the
> memory model table
> * frag_size is not updated when xsk_buff_pool is detached or when MTU
> is
> changed, this leads to growing tail always failing for multi-buffer
> packets.
>
> Couple XDP RxQ info registering with buffer allocations and
> unregistering with cleaning the ring.
>
> Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size")
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
> drivers/net/ethernet/intel/i40e/i40e_main.c | 34 ++++++++++++--------
> - drivers/net/ethernet/intel/i40e/i40e_txrx.c | 5 +--
> 2 files changed, 22 insertions(+), 17 deletions(-)
>
...
> --
> 2.52.0
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH bpf v3 5/9] i40e: fix registering XDP RxQ info
2026-02-17 13:24 ` [PATCH bpf v3 5/9] i40e: fix registering XDP RxQ info Larysa Zaremba
2026-02-17 15:12 ` Loktionov, Aleksandr
@ 2026-02-19 12:00 ` Maciej Fijalkowski
2026-03-02 14:23 ` Larysa Zaremba
1 sibling, 1 reply; 22+ messages in thread
From: Maciej Fijalkowski @ 2026-02-19 12:00 UTC (permalink / raw)
To: Larysa Zaremba
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
On Tue, Feb 17, 2026 at 02:24:43PM +0100, Larysa Zaremba wrote:
> Current way of handling XDP RxQ info in i40e has following problems:
> * when xsk_buff_pool is detached, memory model is not unregistered before
> registering a new one, this leads to a dangling xsk_buff_pool in the
> memory model table
What is 'memory model table' in this context?
I believe you are referring to a case where XDP prog is kept alive on
interface put you close one socket and then bind the other one?
> * frag_size is not updated when xsk_buff_pool is detached or when MTU is
> changed, this leads to growing tail always failing for multi-buffer
> packets.
Good catch, i now see that i40e_change_mtu() only does the link flap and
i40e_free_rx_resources() is not called in this path.
>
> Couple XDP RxQ info registering with buffer allocations and unregistering
> with cleaning the ring.
>
> Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size")
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
> drivers/net/ethernet/intel/i40e/i40e_main.c | 34 ++++++++++++---------
> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 5 +--
> 2 files changed, 22 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index d3bc3207054f..eaa5b65e6daf 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -3577,18 +3577,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> if (ring->vsi->type != I40E_VSI_MAIN)
> goto skip;
>
> - if (!xdp_rxq_info_is_reg(&ring->xdp_rxq)) {
> - err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> - ring->queue_index,
> - ring->q_vector->napi.napi_id,
> - ring->rx_buf_len);
> - if (err)
> - return err;
> - }
> -
> ring->xsk_pool = i40e_xsk_pool(ring);
> if (ring->xsk_pool) {
> - xdp_rxq_info_unreg(&ring->xdp_rxq);
> ring->rx_buf_len = xsk_pool_get_rx_frame_size(ring->xsk_pool);
> err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> ring->queue_index,
> @@ -3600,17 +3590,23 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> MEM_TYPE_XSK_BUFF_POOL,
> NULL);
> if (err)
> - return err;
> + goto unreg_xdp;
> dev_info(&vsi->back->pdev->dev,
> "Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring %d\n",
> ring->queue_index);
>
> } else {
> + err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> + ring->queue_index,
> + ring->q_vector->napi.napi_id,
> + ring->rx_buf_len);
> + if (err)
> + return err;
> err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> MEM_TYPE_PAGE_SHARED,
> NULL);
> if (err)
> - return err;
> + goto unreg_xdp;
> }
>
> skip:
> @@ -3648,7 +3644,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> dev_info(&vsi->back->pdev->dev,
> "Failed to clear LAN Rx queue context on Rx ring %d (pf_q %d), error: %d\n",
> ring->queue_index, pf_q, err);
> - return -ENOMEM;
> + err = -ENOMEM;
> + goto unreg_xdp;
> }
>
> /* set the context in the HMC */
> @@ -3657,7 +3654,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> dev_info(&vsi->back->pdev->dev,
> "Failed to set LAN Rx queue context on Rx ring %d (pf_q %d), error: %d\n",
> ring->queue_index, pf_q, err);
> - return -ENOMEM;
> + err = -ENOMEM;
> + goto unreg_xdp;
> }
>
> /* configure Rx buffer alignment */
> @@ -3665,7 +3663,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> if (I40E_2K_TOO_SMALL_WITH_PADDING) {
> dev_info(&vsi->back->pdev->dev,
> "2k Rx buffer is too small to fit standard MTU and skb_shared_info\n");
> - return -EOPNOTSUPP;
> + err = -EOPNOTSUPP;
> + goto unreg_xdp;
> }
> clear_ring_build_skb_enabled(ring);
> } else {
> @@ -3695,6 +3694,11 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> }
>
> return 0;
> +unreg_xdp:
> + if (ring->vsi->type == I40E_VSI_MAIN)
> + xdp_rxq_info_unreg(&ring->xdp_rxq);
> +
> + return err;
> }
>
> /**
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index cc0b9efc2637..816179c7e271 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -1470,6 +1470,9 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
> if (!rx_ring->rx_bi)
> return;
>
> + if (xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
> + xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
> +
> if (rx_ring->xsk_pool) {
> i40e_xsk_clean_rx_ring(rx_ring);
> goto skip_free;
> @@ -1527,8 +1530,6 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
> void i40e_free_rx_resources(struct i40e_ring *rx_ring)
> {
> i40e_clean_rx_ring(rx_ring);
> - if (rx_ring->vsi->type == I40E_VSI_MAIN)
> - xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
> rx_ring->xdp_prog = NULL;
> kfree(rx_ring->rx_bi);
> rx_ring->rx_bi = NULL;
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH bpf v3 5/9] i40e: fix registering XDP RxQ info
2026-02-19 12:00 ` Maciej Fijalkowski
@ 2026-03-02 14:23 ` Larysa Zaremba
0 siblings, 0 replies; 22+ messages in thread
From: Larysa Zaremba @ 2026-03-02 14:23 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
On Thu, Feb 19, 2026 at 01:00:05PM +0100, Maciej Fijalkowski wrote:
> On Tue, Feb 17, 2026 at 02:24:43PM +0100, Larysa Zaremba wrote:
> > Current way of handling XDP RxQ info in i40e has following problems:
> > * when xsk_buff_pool is detached, memory model is not unregistered before
> > registering a new one, this leads to a dangling xsk_buff_pool in the
> > memory model table
>
> What is 'memory model table' in this context?
>
I mean the hash table where we register the allocator when registering XDP RxQ
info.
The paragraph is wrong, we do not have such problem currently, as we do not
register pass any non-NULL values as allocator. So I will correct the commit
message.
> I believe you are referring to a case where XDP prog is kept alive on
> interface put you close one socket and then bind the other one?
>
> > * frag_size is not updated when xsk_buff_pool is detached or when MTU is
> > changed, this leads to growing tail always failing for multi-buffer
> > packets.
>
> Good catch, i now see that i40e_change_mtu() only does the link flap and
> i40e_free_rx_resources() is not called in this path.
>
> >
> > Couple XDP RxQ info registering with buffer allocations and unregistering
> > with cleaning the ring.
> >
> > Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size")
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> > drivers/net/ethernet/intel/i40e/i40e_main.c | 34 ++++++++++++---------
> > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 5 +--
> > 2 files changed, 22 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > index d3bc3207054f..eaa5b65e6daf 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > @@ -3577,18 +3577,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> > if (ring->vsi->type != I40E_VSI_MAIN)
> > goto skip;
> >
> > - if (!xdp_rxq_info_is_reg(&ring->xdp_rxq)) {
> > - err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> > - ring->queue_index,
> > - ring->q_vector->napi.napi_id,
> > - ring->rx_buf_len);
> > - if (err)
> > - return err;
> > - }
> > -
> > ring->xsk_pool = i40e_xsk_pool(ring);
> > if (ring->xsk_pool) {
> > - xdp_rxq_info_unreg(&ring->xdp_rxq);
> > ring->rx_buf_len = xsk_pool_get_rx_frame_size(ring->xsk_pool);
> > err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> > ring->queue_index,
> > @@ -3600,17 +3590,23 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> > MEM_TYPE_XSK_BUFF_POOL,
> > NULL);
> > if (err)
> > - return err;
> > + goto unreg_xdp;
> > dev_info(&vsi->back->pdev->dev,
> > "Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring %d\n",
> > ring->queue_index);
> >
> > } else {
> > + err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> > + ring->queue_index,
> > + ring->q_vector->napi.napi_id,
> > + ring->rx_buf_len);
> > + if (err)
> > + return err;
> > err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> > MEM_TYPE_PAGE_SHARED,
> > NULL);
> > if (err)
> > - return err;
> > + goto unreg_xdp;
> > }
> >
> > skip:
> > @@ -3648,7 +3644,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> > dev_info(&vsi->back->pdev->dev,
> > "Failed to clear LAN Rx queue context on Rx ring %d (pf_q %d), error: %d\n",
> > ring->queue_index, pf_q, err);
> > - return -ENOMEM;
> > + err = -ENOMEM;
> > + goto unreg_xdp;
> > }
> >
> > /* set the context in the HMC */
> > @@ -3657,7 +3654,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> > dev_info(&vsi->back->pdev->dev,
> > "Failed to set LAN Rx queue context on Rx ring %d (pf_q %d), error: %d\n",
> > ring->queue_index, pf_q, err);
> > - return -ENOMEM;
> > + err = -ENOMEM;
> > + goto unreg_xdp;
> > }
> >
> > /* configure Rx buffer alignment */
> > @@ -3665,7 +3663,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> > if (I40E_2K_TOO_SMALL_WITH_PADDING) {
> > dev_info(&vsi->back->pdev->dev,
> > "2k Rx buffer is too small to fit standard MTU and skb_shared_info\n");
> > - return -EOPNOTSUPP;
> > + err = -EOPNOTSUPP;
> > + goto unreg_xdp;
> > }
> > clear_ring_build_skb_enabled(ring);
> > } else {
> > @@ -3695,6 +3694,11 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
> > }
> >
> > return 0;
> > +unreg_xdp:
> > + if (ring->vsi->type == I40E_VSI_MAIN)
> > + xdp_rxq_info_unreg(&ring->xdp_rxq);
> > +
> > + return err;
> > }
> >
> > /**
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > index cc0b9efc2637..816179c7e271 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > @@ -1470,6 +1470,9 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
> > if (!rx_ring->rx_bi)
> > return;
> >
> > + if (xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
> > + xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
> > +
> > if (rx_ring->xsk_pool) {
> > i40e_xsk_clean_rx_ring(rx_ring);
> > goto skip_free;
> > @@ -1527,8 +1530,6 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
> > void i40e_free_rx_resources(struct i40e_ring *rx_ring)
> > {
> > i40e_clean_rx_ring(rx_ring);
> > - if (rx_ring->vsi->type == I40E_VSI_MAIN)
> > - xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
> > rx_ring->xdp_prog = NULL;
> > kfree(rx_ring->rx_bi);
> > rx_ring->rx_bi = NULL;
> > --
> > 2.52.0
> >
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH bpf v3 6/9] i40e: use xdp.frame_sz as XDP RxQ info frag_size
2026-02-17 13:24 [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Larysa Zaremba
` (4 preceding siblings ...)
2026-02-17 13:24 ` [PATCH bpf v3 5/9] i40e: fix registering XDP RxQ info Larysa Zaremba
@ 2026-02-17 13:24 ` Larysa Zaremba
2026-02-17 15:13 ` Loktionov, Aleksandr
2026-02-17 13:24 ` [PATCH bpf v3 7/9] libeth, idpf: use truesize " Larysa Zaremba
` (3 subsequent siblings)
9 siblings, 1 reply; 22+ messages in thread
From: Larysa Zaremba @ 2026-02-17 13:24 UTC (permalink / raw)
To: bpf
Cc: Larysa Zaremba, Claudiu Manoil, Vladimir Oltean, Wei Fang,
Clark Wang, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Tony Nguyen, Przemek Kitszel,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead
of DMA write size. Different assumptions in i40e driver configuration lead
to negative tailroom.
Set frag_size to the same value as frame_sz in shared pages mode, use new
helper to set frag_size when AF_XDP ZC is active.
Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size")
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index eaa5b65e6daf..e012a50a0448 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3561,6 +3561,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
struct i40e_vsi *vsi = ring->vsi;
u32 chain_len = vsi->back->hw.func_caps.rx_buf_chain_len;
u16 pf_q = vsi->base_queue + ring->queue_index;
+ u32 xdp_frame_sz = i40e_rx_pg_size(ring) / 2;
struct i40e_hw *hw = &vsi->back->hw;
struct i40e_hmc_obj_rxq rx_ctx;
int err = 0;
@@ -3579,11 +3580,12 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
ring->xsk_pool = i40e_xsk_pool(ring);
if (ring->xsk_pool) {
+ xdp_frame_sz = xsk_pool_get_rx_frag_step(ring->xsk_pool);
ring->rx_buf_len = xsk_pool_get_rx_frame_size(ring->xsk_pool);
err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
ring->queue_index,
ring->q_vector->napi.napi_id,
- ring->rx_buf_len);
+ xdp_frame_sz);
if (err)
return err;
err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
@@ -3599,7 +3601,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
ring->queue_index,
ring->q_vector->napi.napi_id,
- ring->rx_buf_len);
+ xdp_frame_sz);
if (err)
return err;
err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
@@ -3610,7 +3612,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
}
skip:
- xdp_init_buff(&ring->xdp, i40e_rx_pg_size(ring) / 2, &ring->xdp_rxq);
+ xdp_init_buff(&ring->xdp, xdp_frame_sz, &ring->xdp_rxq);
rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* RE: [PATCH bpf v3 6/9] i40e: use xdp.frame_sz as XDP RxQ info frag_size
2026-02-17 13:24 ` [PATCH bpf v3 6/9] i40e: use xdp.frame_sz as XDP RxQ info frag_size Larysa Zaremba
@ 2026-02-17 15:13 ` Loktionov, Aleksandr
0 siblings, 0 replies; 22+ messages in thread
From: Loktionov, Aleksandr @ 2026-02-17 15:13 UTC (permalink / raw)
To: Zaremba, Larysa, bpf@vger.kernel.org
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Nguyen, Anthony L, Kitszel, Przemyslaw,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Lobakin, Aleksander, Fijalkowski, Maciej,
Bastien Curutchet (eBPF Foundation), Vyavahare, Tushar,
Jason Xing, Ricardo B. Marlière, Eelco Chaudron,
Lorenzo Bianconi, Toke Hoiland-Jorgensen, imx@lists.linux.dev,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
intel-wired-lan@lists.osuosl.org, linux-kselftest@vger.kernel.org,
Dragos Tatulea
> -----Original Message-----
> From: Zaremba, Larysa <larysa.zaremba@intel.com>
> Sent: Tuesday, February 17, 2026 2:25 PM
> To: bpf@vger.kernel.org
> Cc: Zaremba, Larysa <larysa.zaremba@intel.com>; Claudiu Manoil
> <claudiu.manoil@nxp.com>; Vladimir Oltean <vladimir.oltean@nxp.com>;
> Wei Fang <wei.fang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>;
> Andrew Lunn <andrew+netdev@lunn.ch>; David S. Miller
> <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Nguyen,
> Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Alexei Starovoitov <ast@kernel.org>;
> Daniel Borkmann <daniel@iogearbox.net>; Jesper Dangaard Brouer
> <hawk@kernel.org>; John Fastabend <john.fastabend@gmail.com>;
> Stanislav Fomichev <sdf@fomichev.me>; Andrii Nakryiko
> <andrii@kernel.org>; Martin KaFai Lau <martin.lau@linux.dev>; Eduard
> Zingerman <eddyz87@gmail.com>; Song Liu <song@kernel.org>; Yonghong
> Song <yonghong.song@linux.dev>; KP Singh <kpsingh@kernel.org>; Hao Luo
> <haoluo@google.com>; Jiri Olsa <jolsa@kernel.org>; Simon Horman
> <horms@kernel.org>; Shuah Khan <shuah@kernel.org>; Lobakin, Aleksander
> <aleksander.lobakin@intel.com>; Fijalkowski, Maciej
> <maciej.fijalkowski@intel.com>; Bastien Curutchet (eBPF Foundation)
> <bastien.curutchet@bootlin.com>; Vyavahare, Tushar
> <tushar.vyavahare@intel.com>; Jason Xing <kernelxing@tencent.com>;
> Ricardo B. Marlière <rbm@suse.com>; Eelco Chaudron
> <echaudro@redhat.com>; Lorenzo Bianconi <lorenzo@kernel.org>; Toke
> Hoiland-Jorgensen <toke@redhat.com>; imx@lists.linux.dev;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; intel-wired-
> lan@lists.osuosl.org; linux-kselftest@vger.kernel.org; Loktionov,
> Aleksandr <aleksandr.loktionov@intel.com>; Dragos Tatulea
> <dtatulea@nvidia.com>
> Subject: [PATCH bpf v3 6/9] i40e: use xdp.frame_sz as XDP RxQ info
> frag_size
>
> The only user of frag_size field in XDP RxQ info is
> bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size
> instead of DMA write size. Different assumptions in i40e driver
> configuration lead to negative tailroom.
>
> Set frag_size to the same value as frame_sz in shared pages mode, use
> new helper to set frag_size when AF_XDP ZC is active.
>
> Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size")
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
> drivers/net/ethernet/intel/i40e/i40e_main.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index eaa5b65e6daf..e012a50a0448 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -3561,6 +3561,7 @@ static int i40e_configure_rx_ring(struct
> i40e_ring *ring)
> struct i40e_vsi *vsi = ring->vsi;
> u32 chain_len = vsi->back->hw.func_caps.rx_buf_chain_len;
> u16 pf_q = vsi->base_queue + ring->queue_index;
> + u32 xdp_frame_sz = i40e_rx_pg_size(ring) / 2;
> struct i40e_hw *hw = &vsi->back->hw;
> struct i40e_hmc_obj_rxq rx_ctx;
> int err = 0;
> @@ -3579,11 +3580,12 @@ static int i40e_configure_rx_ring(struct
> i40e_ring *ring)
>
> ring->xsk_pool = i40e_xsk_pool(ring);
> if (ring->xsk_pool) {
> + xdp_frame_sz = xsk_pool_get_rx_frag_step(ring-
> >xsk_pool);
> ring->rx_buf_len = xsk_pool_get_rx_frame_size(ring-
> >xsk_pool);
> err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> ring->queue_index,
> ring->q_vector->napi.napi_id,
> - ring->rx_buf_len);
> + xdp_frame_sz);
> if (err)
> return err;
> err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> @@ -3599,7 +3601,7 @@ static int i40e_configure_rx_ring(struct
> i40e_ring *ring)
> err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
> ring->queue_index,
> ring->q_vector->napi.napi_id,
> - ring->rx_buf_len);
> + xdp_frame_sz);
> if (err)
> return err;
> err = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
> @@ -3610,7 +3612,7 @@ static int i40e_configure_rx_ring(struct
> i40e_ring *ring)
> }
>
> skip:
> - xdp_init_buff(&ring->xdp, i40e_rx_pg_size(ring) / 2, &ring-
> >xdp_rxq);
> + xdp_init_buff(&ring->xdp, xdp_frame_sz, &ring->xdp_rxq);
>
> rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
> BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
> --
> 2.52.0
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH bpf v3 7/9] libeth, idpf: use truesize as XDP RxQ info frag_size
2026-02-17 13:24 [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Larysa Zaremba
` (5 preceding siblings ...)
2026-02-17 13:24 ` [PATCH bpf v3 6/9] i40e: use xdp.frame_sz as XDP RxQ info frag_size Larysa Zaremba
@ 2026-02-17 13:24 ` Larysa Zaremba
2026-02-17 15:06 ` Loktionov, Aleksandr
2026-02-17 16:19 ` Alexander Lobakin
2026-02-17 13:24 ` [PATCH bpf v3 8/9] net: enetc: " Larysa Zaremba
` (2 subsequent siblings)
9 siblings, 2 replies; 22+ messages in thread
From: Larysa Zaremba @ 2026-02-17 13:24 UTC (permalink / raw)
To: bpf
Cc: Larysa Zaremba, Claudiu Manoil, Vladimir Oltean, Wei Fang,
Clark Wang, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Tony Nguyen, Przemek Kitszel,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead
of DMA write size. Different assumptions in idpf driver configuration lead
to negative tailroom.
To make it worse, buffer sizes are not actually uniform in idpf when
splitq is enabled, as there are several buffer queues, so rxq->rx_buf_size
is meaningless in this case.
Use truesize of the first bufq in AF_XDP ZC, as there is only one. Disable
growinf tail for regular splitq.
Fixes: ac8a861f632e ("idpf: prepare structures to support XDP")
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
drivers/net/ethernet/intel/idpf/xdp.c | 6 +++++-
drivers/net/ethernet/intel/idpf/xsk.c | 1 +
drivers/net/ethernet/intel/libeth/xsk.c | 1 +
include/net/libeth/xsk.h | 3 +++
4 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
index 958d16f87424..7d91f21174de 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.c
+++ b/drivers/net/ethernet/intel/idpf/xdp.c
@@ -46,11 +46,15 @@ static int __idpf_xdp_rxq_info_init(struct idpf_rx_queue *rxq, void *arg)
{
const struct idpf_vport *vport = rxq->q_vector->vport;
bool split = idpf_is_queue_model_split(vport->rxq_model);
+ u32 frag_size = 0;
int err;
+ if (idpf_queue_has(XSK, rxq))
+ frag_size = rxq->bufq_sets[0].bufq.truesize;
+
err = __xdp_rxq_info_reg(&rxq->xdp_rxq, vport->netdev, rxq->idx,
rxq->q_vector->napi.napi_id,
- rxq->rx_buf_size);
+ frag_size);
if (err)
return err;
diff --git a/drivers/net/ethernet/intel/idpf/xsk.c b/drivers/net/ethernet/intel/idpf/xsk.c
index fd2cc43ab43c..95a665cb2f33 100644
--- a/drivers/net/ethernet/intel/idpf/xsk.c
+++ b/drivers/net/ethernet/intel/idpf/xsk.c
@@ -401,6 +401,7 @@ int idpf_xskfq_init(struct idpf_buf_queue *bufq)
bufq->pending = fq.pending;
bufq->thresh = fq.thresh;
bufq->rx_buf_size = fq.buf_len;
+ bufq->truesize = fq.truesize;
if (!idpf_xskfq_refill(bufq))
netdev_err(bufq->pool->netdev,
diff --git a/drivers/net/ethernet/intel/libeth/xsk.c b/drivers/net/ethernet/intel/libeth/xsk.c
index 846e902e31b6..4882951d5c9c 100644
--- a/drivers/net/ethernet/intel/libeth/xsk.c
+++ b/drivers/net/ethernet/intel/libeth/xsk.c
@@ -167,6 +167,7 @@ int libeth_xskfq_create(struct libeth_xskfq *fq)
fq->pending = fq->count;
fq->thresh = libeth_xdp_queue_threshold(fq->count);
fq->buf_len = xsk_pool_get_rx_frame_size(fq->pool);
+ fq->truesize = xsk_pool_get_rx_frag_step(fq->pool);
return 0;
}
diff --git a/include/net/libeth/xsk.h b/include/net/libeth/xsk.h
index 481a7b28e6f2..82b5d21aae87 100644
--- a/include/net/libeth/xsk.h
+++ b/include/net/libeth/xsk.h
@@ -597,6 +597,7 @@ __libeth_xsk_run_pass(struct libeth_xdp_buff *xdp,
* @pending: current number of XSkFQEs to refill
* @thresh: threshold below which the queue is refilled
* @buf_len: HW-writeable length per each buffer
+ * @truesize: step between consecutive buffers, 0 if none exists
* @nid: ID of the closest NUMA node with memory
*/
struct libeth_xskfq {
@@ -614,6 +615,8 @@ struct libeth_xskfq {
u32 thresh;
u32 buf_len;
+ u32 truesize;
+
int nid;
};
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* RE: [PATCH bpf v3 7/9] libeth, idpf: use truesize as XDP RxQ info frag_size
2026-02-17 13:24 ` [PATCH bpf v3 7/9] libeth, idpf: use truesize " Larysa Zaremba
@ 2026-02-17 15:06 ` Loktionov, Aleksandr
2026-02-17 16:19 ` Alexander Lobakin
1 sibling, 0 replies; 22+ messages in thread
From: Loktionov, Aleksandr @ 2026-02-17 15:06 UTC (permalink / raw)
To: Zaremba, Larysa, bpf@vger.kernel.org
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Nguyen, Anthony L, Kitszel, Przemyslaw,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Lobakin, Aleksander, Fijalkowski, Maciej,
Bastien Curutchet (eBPF Foundation), Vyavahare, Tushar,
Jason Xing, Ricardo B. Marlière, Eelco Chaudron,
Lorenzo Bianconi, Toke Hoiland-Jorgensen, imx@lists.linux.dev,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
intel-wired-lan@lists.osuosl.org, linux-kselftest@vger.kernel.org,
Dragos Tatulea
> -----Original Message-----
> From: Zaremba, Larysa <larysa.zaremba@intel.com>
> Sent: Tuesday, February 17, 2026 2:25 PM
> To: bpf@vger.kernel.org
> Cc: Zaremba, Larysa <larysa.zaremba@intel.com>; Claudiu Manoil
> <claudiu.manoil@nxp.com>; Vladimir Oltean <vladimir.oltean@nxp.com>;
> Wei Fang <wei.fang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>;
> Andrew Lunn <andrew+netdev@lunn.ch>; David S. Miller
> <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Nguyen,
> Anthony L <anthony.l.nguyen@intel.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@intel.com>; Alexei Starovoitov <ast@kernel.org>;
> Daniel Borkmann <daniel@iogearbox.net>; Jesper Dangaard Brouer
> <hawk@kernel.org>; John Fastabend <john.fastabend@gmail.com>;
> Stanislav Fomichev <sdf@fomichev.me>; Andrii Nakryiko
> <andrii@kernel.org>; Martin KaFai Lau <martin.lau@linux.dev>; Eduard
> Zingerman <eddyz87@gmail.com>; Song Liu <song@kernel.org>; Yonghong
> Song <yonghong.song@linux.dev>; KP Singh <kpsingh@kernel.org>; Hao Luo
> <haoluo@google.com>; Jiri Olsa <jolsa@kernel.org>; Simon Horman
> <horms@kernel.org>; Shuah Khan <shuah@kernel.org>; Lobakin, Aleksander
> <aleksander.lobakin@intel.com>; Fijalkowski, Maciej
> <maciej.fijalkowski@intel.com>; Bastien Curutchet (eBPF Foundation)
> <bastien.curutchet@bootlin.com>; Vyavahare, Tushar
> <tushar.vyavahare@intel.com>; Jason Xing <kernelxing@tencent.com>;
> Ricardo B. Marlière <rbm@suse.com>; Eelco Chaudron
> <echaudro@redhat.com>; Lorenzo Bianconi <lorenzo@kernel.org>; Toke
> Hoiland-Jorgensen <toke@redhat.com>; imx@lists.linux.dev;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; intel-wired-
> lan@lists.osuosl.org; linux-kselftest@vger.kernel.org; Loktionov,
> Aleksandr <aleksandr.loktionov@intel.com>; Dragos Tatulea
> <dtatulea@nvidia.com>
> Subject: [PATCH bpf v3 7/9] libeth, idpf: use truesize as XDP RxQ info
> frag_size
>
> The only user of frag_size field in XDP RxQ info is
> bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size
> instead of DMA write size. Different assumptions in idpf driver
> configuration lead to negative tailroom.
>
> To make it worse, buffer sizes are not actually uniform in idpf when
> splitq is enabled, as there are several buffer queues, so rxq-
> >rx_buf_size is meaningless in this case.
>
> Use truesize of the first bufq in AF_XDP ZC, as there is only one.
> Disable growinf tail for regular splitq.
"growing" -> " growing" ?
Otherwise fine
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>
> Fixes: ac8a861f632e ("idpf: prepare structures to support XDP")
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
> drivers/net/ethernet/intel/idpf/xdp.c | 6 +++++-
> drivers/net/ethernet/intel/idpf/xsk.c | 1 +
> drivers/net/ethernet/intel/libeth/xsk.c | 1 +
> include/net/libeth/xsk.h | 3 +++
> 4 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/idpf/xdp.c
> b/drivers/net/ethernet/intel/idpf/xdp.c
> index 958d16f87424..7d91f21174de 100644
> --- a/drivers/net/ethernet/intel/idpf/xdp.c
> +++ b/drivers/net/ethernet/intel/idpf/xdp.c
> @@ -46,11 +46,15 @@ static int __idpf_xdp_rxq_info_init(struct
> idpf_rx_queue *rxq, void *arg) {
> const struct idpf_vport *vport = rxq->q_vector->vport;
> bool split = idpf_is_queue_model_split(vport->rxq_model);
> + u32 frag_size = 0;
> int err;
>
> + if (idpf_queue_has(XSK, rxq))
> + frag_size = rxq->bufq_sets[0].bufq.truesize;
> +
> err = __xdp_rxq_info_reg(&rxq->xdp_rxq, vport->netdev, rxq-
> >idx,
> rxq->q_vector->napi.napi_id,
> - rxq->rx_buf_size);
> + frag_size);
> if (err)
> return err;
>
> diff --git a/drivers/net/ethernet/intel/idpf/xsk.c
> b/drivers/net/ethernet/intel/idpf/xsk.c
> index fd2cc43ab43c..95a665cb2f33 100644
> --- a/drivers/net/ethernet/intel/idpf/xsk.c
> +++ b/drivers/net/ethernet/intel/idpf/xsk.c
> @@ -401,6 +401,7 @@ int idpf_xskfq_init(struct idpf_buf_queue *bufq)
> bufq->pending = fq.pending;
> bufq->thresh = fq.thresh;
> bufq->rx_buf_size = fq.buf_len;
> + bufq->truesize = fq.truesize;
>
> if (!idpf_xskfq_refill(bufq))
> netdev_err(bufq->pool->netdev,
> diff --git a/drivers/net/ethernet/intel/libeth/xsk.c
> b/drivers/net/ethernet/intel/libeth/xsk.c
> index 846e902e31b6..4882951d5c9c 100644
> --- a/drivers/net/ethernet/intel/libeth/xsk.c
> +++ b/drivers/net/ethernet/intel/libeth/xsk.c
> @@ -167,6 +167,7 @@ int libeth_xskfq_create(struct libeth_xskfq *fq)
> fq->pending = fq->count;
> fq->thresh = libeth_xdp_queue_threshold(fq->count);
> fq->buf_len = xsk_pool_get_rx_frame_size(fq->pool);
> + fq->truesize = xsk_pool_get_rx_frag_step(fq->pool);
>
> return 0;
> }
> diff --git a/include/net/libeth/xsk.h b/include/net/libeth/xsk.h index
> 481a7b28e6f2..82b5d21aae87 100644
> --- a/include/net/libeth/xsk.h
> +++ b/include/net/libeth/xsk.h
> @@ -597,6 +597,7 @@ __libeth_xsk_run_pass(struct libeth_xdp_buff *xdp,
> * @pending: current number of XSkFQEs to refill
> * @thresh: threshold below which the queue is refilled
> * @buf_len: HW-writeable length per each buffer
> + * @truesize: step between consecutive buffers, 0 if none exists
> * @nid: ID of the closest NUMA node with memory
> */
> struct libeth_xskfq {
> @@ -614,6 +615,8 @@ struct libeth_xskfq {
> u32 thresh;
>
> u32 buf_len;
> + u32 truesize;
> +
> int nid;
> };
>
> --
> 2.52.0
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH bpf v3 7/9] libeth, idpf: use truesize as XDP RxQ info frag_size
2026-02-17 13:24 ` [PATCH bpf v3 7/9] libeth, idpf: use truesize " Larysa Zaremba
2026-02-17 15:06 ` Loktionov, Aleksandr
@ 2026-02-17 16:19 ` Alexander Lobakin
1 sibling, 0 replies; 22+ messages in thread
From: Alexander Lobakin @ 2026-02-17 16:19 UTC (permalink / raw)
To: Larysa Zaremba
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
From: Zaremba, Larysa <larysa.zaremba@intel.com>
Date: Tue, 17 Feb 2026 14:24:45 +0100
> The only user of frag_size field in XDP RxQ info is
> bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead
> of DMA write size. Different assumptions in idpf driver configuration lead
> to negative tailroom.
>
> To make it worse, buffer sizes are not actually uniform in idpf when
> splitq is enabled, as there are several buffer queues, so rxq->rx_buf_size
> is meaningless in this case.
>
> Use truesize of the first bufq in AF_XDP ZC, as there is only one. Disable
> growinf tail for regular splitq.
>
> Fixes: ac8a861f632e ("idpf: prepare structures to support XDP")
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Acked-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Thanks for handling this!
Olek
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH bpf v3 8/9] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-17 13:24 [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Larysa Zaremba
` (6 preceding siblings ...)
2026-02-17 13:24 ` [PATCH bpf v3 7/9] libeth, idpf: use truesize " Larysa Zaremba
@ 2026-02-17 13:24 ` Larysa Zaremba
2026-02-17 13:24 ` [PATCH bpf v3 9/9] xdp: produce a warning when calculated tailroom is negative Larysa Zaremba
2026-02-27 15:31 ` [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Maciej Fijalkowski
9 siblings, 0 replies; 22+ messages in thread
From: Larysa Zaremba @ 2026-02-17 13:24 UTC (permalink / raw)
To: bpf
Cc: Larysa Zaremba, Claudiu Manoil, Vladimir Oltean, Wei Fang,
Clark Wang, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Tony Nguyen, Przemek Kitszel,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
write size. Different assumptions in enetc driver configuration lead to
negative tailroom.
Set frag_size to the same value as frame_sz.
Fixes: 2768b2e2f7d2 ("net: enetc: register XDP RX queues with frag_size")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
drivers/net/ethernet/freescale/enetc/enetc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
index e380a4f39855..9fdd448e602f 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -3468,7 +3468,7 @@ static int enetc_int_vector_init(struct enetc_ndev_priv *priv, int i,
priv->rx_ring[i] = bdr;
err = __xdp_rxq_info_reg(&bdr->xdp.rxq, priv->ndev, i, 0,
- ENETC_RXB_DMA_SIZE_XDP);
+ ENETC_RXB_TRUESIZE);
if (err)
goto free_vector;
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* [PATCH bpf v3 9/9] xdp: produce a warning when calculated tailroom is negative
2026-02-17 13:24 [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Larysa Zaremba
` (7 preceding siblings ...)
2026-02-17 13:24 ` [PATCH bpf v3 8/9] net: enetc: " Larysa Zaremba
@ 2026-02-17 13:24 ` Larysa Zaremba
2026-02-27 15:31 ` [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Maciej Fijalkowski
9 siblings, 0 replies; 22+ messages in thread
From: Larysa Zaremba @ 2026-02-17 13:24 UTC (permalink / raw)
To: bpf
Cc: Larysa Zaremba, Claudiu Manoil, Vladimir Oltean, Wei Fang,
Clark Wang, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Tony Nguyen, Przemek Kitszel,
Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Simon Horman, Shuah Khan,
Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea, Martin KaFai Lau
Many ethernet drivers report xdp Rx queue frag size as being the same as
DMA write size. However, the only user of this field, namely
bpf_xdp_frags_increase_tail(), clearly expects a truesize.
Such difference leads to unspecific memory corruption issues under certain
circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
all DMA-writable space in 2 buffers. This would be fine, if only
rxq->frag_size was properly set to 4K, but value of 3K results in a
negative tailroom, because there is a non-zero page offset.
We are supposed to return -EINVAL and be done with it in such case, but due
to tailroom being stored as an unsigned int, it is reported to be somewhere
near UINT_MAX, resulting in a tail being grown, even if the requested
offset is too much (it is around 2K in the abovementioned test). This later
leads to all kinds of unspecific calltraces.
[ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
[ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
[ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000]
[ 7340.339230] in xskxceiver[42b5,400000+69000]
[ 7340.340300] likely on CPU 6 (core 0, socket 6)
[ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
[ 7340.340888] likely on CPU 3 (core 0, socket 3)
[ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
[ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
[ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
[ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
[ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
[ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
[ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
[ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
[ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
[ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
[ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
[ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
[ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
[ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
[ 7340.421237] PKRU: 55555554
[ 7340.421623] Call Trace:
[ 7340.421987] <TASK>
[ 7340.422309] ? softleaf_from_pte+0x77/0xa0
[ 7340.422855] swap_pte_batch+0xa7/0x290
[ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
[ 7340.424102] zap_pte_range+0x281/0x580
[ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240
[ 7340.425177] unmap_page_range+0x24d/0x420
[ 7340.425714] unmap_vmas+0xa1/0x180
[ 7340.426185] exit_mmap+0xe1/0x3b0
[ 7340.426644] __mmput+0x41/0x150
[ 7340.427098] exit_mm+0xb1/0x110
[ 7340.427539] do_exit+0x1b2/0x460
[ 7340.427992] do_group_exit+0x2d/0xc0
[ 7340.428477] get_signal+0x79d/0x7e0
[ 7340.428957] arch_do_signal_or_restart+0x34/0x100
[ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0
[ 7340.430159] do_syscall_64+0x188/0x6b0
[ 7340.430672] ? __do_sys_clone3+0xd9/0x120
[ 7340.431212] ? switch_fpu_return+0x4e/0xd0
[ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
[ 7340.432498] ? do_syscall_64+0xbb/0x6b0
[ 7340.433015] ? __handle_mm_fault+0x445/0x690
[ 7340.433582] ? count_memcg_events+0xd6/0x210
[ 7340.434151] ? handle_mm_fault+0x212/0x340
[ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0
[ 7340.435271] ? clear_bhb_loop+0x30/0x80
[ 7340.435788] ? clear_bhb_loop+0x30/0x80
[ 7340.436299] ? clear_bhb_loop+0x30/0x80
[ 7340.436812] ? clear_bhb_loop+0x30/0x80
[ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 7340.437973] RIP: 0033:0x7f4161b14169
[ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
[ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
[ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
[ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
[ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
[ 7340.444586] </TASK>
[ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
[ 7340.449650] ---[ end trace 0000000000000000 ]---
The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
drivers to not do this. Therefore, make tailroom a signed int and produce a
warning when it is negative to prevent such mistakes in the future.
Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
net/core/filter.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index 5f5489665c58..e93d9dc0471a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4151,13 +4151,14 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
struct xdp_rxq_info *rxq = xdp->rxq;
- unsigned int tailroom;
+ int tailroom;
if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
return -EOPNOTSUPP;
tailroom = rxq->frag_size - skb_frag_size(frag) -
skb_frag_off(frag) % rxq->frag_size;
+ WARN_ON_ONCE(tailroom < 0);
if (unlikely(offset > tailroom))
return -EINVAL;
--
2.52.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH bpf v3 0/9] Address XDP frags having negative tailroom
2026-02-17 13:24 [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Larysa Zaremba
` (8 preceding siblings ...)
2026-02-17 13:24 ` [PATCH bpf v3 9/9] xdp: produce a warning when calculated tailroom is negative Larysa Zaremba
@ 2026-02-27 15:31 ` Maciej Fijalkowski
2026-03-02 9:18 ` Larysa Zaremba
9 siblings, 1 reply; 22+ messages in thread
From: Maciej Fijalkowski @ 2026-02-27 15:31 UTC (permalink / raw)
To: Larysa Zaremba
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
On Tue, Feb 17, 2026 at 02:24:38PM +0100, Larysa Zaremba wrote:
> Aside from the issue described below, tailroom calculation does not account
> for pages being split between frags, e.g. in i40e, enetc and
> AF_XDP ZC with smaller chunks. These series address the problem by
> calculating modulo (skb_frag_off() % rxq->frag_size) in order to get
> data offset within a smaller block of memory. Please note, xskxceiver
> tail grow test passes without modulo e.g. in xdpdrv mode on i40e,
> because there is not enough descriptors to get to flipped buffers.
>
> Many ethernet drivers report xdp Rx queue frag size as being the same as
> DMA write size. However, the only user of this field, namely
> bpf_xdp_frags_increase_tail(), clearly expects a truesize.
>
> Such difference leads to unspecific memory corruption issues under certain
> circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
> running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
> all DMA-writable space in 2 buffers. This would be fine, if only
> rxq->frag_size was properly set to 4K, but value of 3K results in a
> negative tailroom, because there is a non-zero page offset.
>
> We are supposed to return -EINVAL and be done with it in such case,
> but due to tailroom being stored as an unsigned int, it is reported to be
> somewhere near UINT_MAX, resulting in a tail being grown, even if the
> requested offset is too much(it is around 2K in the abovementioned test).
> This later leads to all kinds of unspecific calltraces.
>
> [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
> [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
> [ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000]
> [ 7340.339230] in xskxceiver[42b5,400000+69000]
> [ 7340.340300] likely on CPU 6 (core 0, socket 6)
> [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
> [ 7340.340888] likely on CPU 3 (core 0, socket 3)
> [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
> [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
> [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
> [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
> [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
> [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
> [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
> [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
> [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
> [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
> [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
> [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
> [ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
> [ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
> [ 7340.421237] PKRU: 55555554
> [ 7340.421623] Call Trace:
> [ 7340.421987] <TASK>
> [ 7340.422309] ? softleaf_from_pte+0x77/0xa0
> [ 7340.422855] swap_pte_batch+0xa7/0x290
> [ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
> [ 7340.424102] zap_pte_range+0x281/0x580
> [ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240
> [ 7340.425177] unmap_page_range+0x24d/0x420
> [ 7340.425714] unmap_vmas+0xa1/0x180
> [ 7340.426185] exit_mmap+0xe1/0x3b0
> [ 7340.426644] __mmput+0x41/0x150
> [ 7340.427098] exit_mm+0xb1/0x110
> [ 7340.427539] do_exit+0x1b2/0x460
> [ 7340.427992] do_group_exit+0x2d/0xc0
> [ 7340.428477] get_signal+0x79d/0x7e0
> [ 7340.428957] arch_do_signal_or_restart+0x34/0x100
> [ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0
> [ 7340.430159] do_syscall_64+0x188/0x6b0
> [ 7340.430672] ? __do_sys_clone3+0xd9/0x120
> [ 7340.431212] ? switch_fpu_return+0x4e/0xd0
> [ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
> [ 7340.432498] ? do_syscall_64+0xbb/0x6b0
> [ 7340.433015] ? __handle_mm_fault+0x445/0x690
> [ 7340.433582] ? count_memcg_events+0xd6/0x210
> [ 7340.434151] ? handle_mm_fault+0x212/0x340
> [ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0
> [ 7340.435271] ? clear_bhb_loop+0x30/0x80
> [ 7340.435788] ? clear_bhb_loop+0x30/0x80
> [ 7340.436299] ? clear_bhb_loop+0x30/0x80
> [ 7340.436812] ? clear_bhb_loop+0x30/0x80
> [ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 7340.437973] RIP: 0033:0x7f4161b14169
> [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
> [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
> [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
> [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
> [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
> [ 7340.444586] </TASK>
> [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
> [ 7340.449650] ---[ end trace 0000000000000000 ]---
>
> The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
> drivers to not do this. Therefore, make tailroom a signed int and produce a
> warning when it is negative to prevent such mistakes in the future.
>
> The issue can also be easily reproduced with ice driver, by applying
> the following diff to xskxceiver and enjoying a kernel panic in xdpdrv mode:
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> index 5af28f359cfd..042d587fa7ef 100644
> --- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> +++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> @@ -2541,8 +2541,8 @@ int testapp_adjust_tail_grow_mb(struct test_spec *test)
> {
> test->mtu = MAX_ETH_JUMBO_SIZE;
> /* Grow by (frag_size - last_frag_Size) - 1 to stay inside the last fragment */
> - return testapp_adjust_tail(test, (XSK_UMEM__MAX_FRAME_SIZE / 2) - 1,
> - XSK_UMEM__LARGE_FRAME_SIZE * 2);
> + return testapp_adjust_tail(test, XSK_UMEM__MAX_FRAME_SIZE * 100,
> + 6912);
> }
>
> int testapp_tx_queue_consumer(struct test_spec *test)
>
> If we print out the values involved in the tailroom calculation:
>
> tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
>
> 4294967040 = 3456 - 3456 - 256
>
> I personally reproduced and verified the issue in ice and i40e,
> aside from WiP ixgbevf implementation.
May I ask what was the testing approach against ice on your side? When I
run test_xsk.sh against tree with your series applied, I get a panic shown
below [1]. This comes from a test that modifies descriptor count on rings
and the trick is that it might be passing when running as a standalone
test but in the test suite it causes problems. It comes from a fact that
we copy xdp_rxq between old and new ice_rx_ring, core sees the xdp_rxq
already registered, does unregister by itself but it bails out on
page_pool pointer being invalid (as these two xdp_rxqs pointed to same pp
and it got destroyed). So small diff below [0] allows me to go through
xskxceiver test suite executed from test_xsk.sh.
Thanks,
MF
[0]:
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index 969d4f8f9c02..06986adb2005 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -3328,6 +3328,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
rx_rings[i].desc = NULL;
rx_rings[i].xdp_buf = NULL;
+ xdp_rxq_info_unreg(&rx_rings[i].xdp_rxq);
/* this is to allow wr32 to have something to write to
* during early allocation of Rx buffers
[1]:
[ 2596.560462] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 2596.568466] #PF: supervisor read access in kernel mode
[ 2596.574686] #PF: error_code(0x0000) - not-present page
[ 2596.580942] PGD 118694067 P4D 0
[ 2596.585322] Oops: Oops: 0000 [#1] SMP NOPTI
[ 2596.590694] CPU: 2 UID: 0 PID: 5117 Comm: xskxceiver Tainted: G B W O 6.19.0+ #198 PREEMPT(full)
[ 2596.602065] Tainted: [B]=BAD_PAGE, [W]=WARN, [O]=OOT_MODULE
[ 2596.609049] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0004.2110190142 10/19/2021
[ 2596.621632] RIP: 0010:xdp_unreg_mem_model+0x86/0xc0
[ 2596.628195] Code: 0f 44 d7 f6 c2 01 75 37 41 0f b7 4c 24 16 48 89 ce 48 f7 de 3b 5c 32 04 75 1d 48 89 d3 48 29 cb 48 85 d2 74 2f e8 9a 9e 4c ff <48> 8b 7b 08 5b 5d 41 5c e9 6d eb 00 00 48 8b 12 f6 c2 01 74 d5 48
[ 2596.650847] RSP: 0018:ffa000001ffe3a90 EFLAGS: 00010246
[ 2596.658128] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff1100808e308ea1
[ 2596.667403] RDX: ff1100808e308ea1 RSI: 00000000000001cc RDI: ff11000130150000
[ 2596.676719] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000001
[ 2596.686060] R10: ff1100011084a2c0 R11: 0000000000000000 R12: ff1100011541ce40
[ 2596.695445] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000
[ 2596.704866] FS: 00007f6044013c40(0000) GS:ff11007efbb1b000(0000) knlGS:0000000000000000
[ 2596.715336] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2596.723510] CR2: 0000000000000008 CR3: 00000001e9052004 CR4: 0000000000773ef0
[ 2596.733162] PKRU: 55555554
[ 2596.738407] Call Trace:
[ 2596.743398] <TASK>
[ 2596.748045] __xdp_rxq_info_reg+0xb7/0xf0
[ 2596.755108] ice_vsi_cfg_rxq+0x668/0x6b0 [ice]
[ 2596.762499] ice_vsi_cfg_rxqs+0x29/0x80 [ice]
[ 2596.769555] ice_up+0xe/0x30 [ice]
[ 2596.775673] ice_set_ringparam+0x662/0x7e0 [ice]
[ 2596.783066] ethtool_set_ringparam+0xb3/0x110
[ 2596.790189] __dev_ethtool+0x1200/0x2d90
[ 2596.796916] ? update_se+0xc1/0x120
[ 2596.803224] ? update_load_avg+0x73/0x220
[ 2596.810079] ? xas_load+0x9/0xc0
[ 2596.816172] ? xa_load+0x71/0xb0
[ 2596.822273] ? avc_has_extended_perms+0xcf/0x4a0
[ 2596.829822] ? __kmalloc_cache_noprof+0x11a/0x400
[ 2596.837493] dev_ethtool+0xa6/0x170
[ 2596.843976] dev_ioctl+0x2d9/0x510
[ 2596.850388] sock_do_ioctl+0xa8/0x110
[ 2596.857078] sock_ioctl+0x234/0x320
[ 2596.863614] __x64_sys_ioctl+0x92/0xe0
[ 2596.870444] do_syscall_64+0xa4/0xc80
[ 2596.877212] entry_SYSCALL_64_after_hwframe+0x71/0x79
[ 2596.885426] RIP: 0033:0x7f6043f24e1d
[ 2596.892186] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[ 2596.917864] RSP: 002b:00007ffd329f5e50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 2596.929028] RAX: ffffffffffffffda RBX: 00007ffd329f6208 RCX: 00007f6043f24e1d
[ 2596.939757] RDX: 00007ffd329f5ed0 RSI: 0000000000008946 RDI: 0000000000000013
[ 2596.950460] RBP: 00007ffd329f5ea0 R08: 0000000000000000 R09: 0000000000000007
[ 2596.961597] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
[ 2596.972256] R13: 0000000000000000 R14: 000055a88e016338 R15: 00007f6044100000
[ 2596.982917] </TASK>
[ 2596.988534] Modules linked in: ice(O) ipmi_ssif 8021q garp stp mrp llc intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp nls_iso8859_1 kvm_intel kvm irqbypass mei_me ioatdma mei wmi dca ipmi_si ipmi_msghandler acpi_power_meter acpi_pad input_leds hid_generic ghash_clmulni_intel idpf i40e libeth_xdp libeth ahci libie libie_fwlog libie_adminq libahci aesni_intel gf128mul [last unloaded: irdma]
[ 2597.040161] CR2: 0000000000000008
[ 2597.046911] ---[ end trace 0000000000000000 ]---
[ 2597.117161] RIP: 0010:xdp_unreg_mem_model+0x86/0xc0
[ 2597.125432] Code: 0f 44 d7 f6 c2 01 75 37 41 0f b7 4c 24 16 48 89 ce 48 f7 de 3b 5c 32 04 75 1d 48 89 d3 48 29 cb 48 85 d2 74 2f e8 9a 9e 4c ff <48> 8b 7b 08 5b 5d 41 5c e9 6d eb 00 00 48 8b 12 f6 c2 01 74 d5 48
[ 2597.151379] RSP: 0018:ffa000001ffe3a90 EFLAGS: 00010246
[ 2597.160243] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff1100808e308ea1
[ 2597.171798] RDX: ff1100808e308ea1 RSI: 00000000000001cc RDI: ff11000130150000
[ 2597.182587] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000001
[ 2597.193333] R10: ff1100011084a2c0 R11: 0000000000000000 R12: ff1100011541ce40
[ 2597.204055] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000
[ 2597.214732] FS: 00007f6044013c40(0000) GS:ff11007efbb1b000(0000) knlGS:0000000000000000
[ 2597.226440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2597.235842] CR2: 0000000000000008 CR3: 00000001e9052004 CR4: 0000000000773ef0
[ 2597.246692] PKRU: 55555554
[ 2597.253088] note: xskxceiver[5117] exited with irqs disabled
>
> v3->v2:
> * unregister XDP RxQ info for subfunction in ice
> * remove rx_buf_len variable in ice
> * add missing ifdefed empty definition xsk_pool_get_rx_frag_step()
> * move xsk_pool_get_rx_frag_step() call from idpf to libeth
> * simplify conditions when determining frag_size in idpf
> * correctly init xdp_frame_sz for non-main VSI in i40e
>
> v1->v2:
> * add modulo to calculate offset within chunk
> * add helper for AF_XDP ZC queues
> * fix the problem in ZC mode in i40e, ice and idpf
> * verify solution in i40e
> * fix RxQ info registering in i40e
> * fix splitq handling in idpf
> * do not use word truesize unless the value used is named trusize
>
> Larysa Zaremba (9):
> xdp: use modulo operation to calculate XDP frag tailroom
> xsk: introduce helper to determine rxq->frag_size
> ice: fix rxq info registering in mbuf packets
> ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz
> i40e: fix registering XDP RxQ info
> i40e: use xdp.frame_sz as XDP RxQ info frag_size
> libeth, idpf: use truesize as XDP RxQ info frag_size
> net: enetc: use truesize as XDP RxQ info frag_size
> xdp: produce a warning when calculated tailroom is negative
>
> drivers/net/ethernet/freescale/enetc/enetc.c | 2 +-
> drivers/net/ethernet/intel/i40e/i40e_main.c | 40 +++++++++++---------
> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 5 ++-
> drivers/net/ethernet/intel/ice/ice_base.c | 33 +++++-----------
> drivers/net/ethernet/intel/ice/ice_txrx.c | 3 +-
> drivers/net/ethernet/intel/ice/ice_xsk.c | 3 ++
> drivers/net/ethernet/intel/idpf/xdp.c | 6 ++-
> drivers/net/ethernet/intel/idpf/xsk.c | 1 +
> drivers/net/ethernet/intel/libeth/xsk.c | 1 +
> include/net/libeth/xsk.h | 3 ++
> include/net/xdp_sock_drv.h | 10 +++++
> net/core/filter.c | 6 ++-
> 12 files changed, 66 insertions(+), 47 deletions(-)
>
> --
> 2.52.0
>
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH bpf v3 0/9] Address XDP frags having negative tailroom
2026-02-27 15:31 ` [PATCH bpf v3 0/9] Address XDP frags having negative tailroom Maciej Fijalkowski
@ 2026-03-02 9:18 ` Larysa Zaremba
0 siblings, 0 replies; 22+ messages in thread
From: Larysa Zaremba @ 2026-03-02 9:18 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Dragos Tatulea
On Fri, Feb 27, 2026 at 04:31:27PM +0100, Maciej Fijalkowski wrote:
> On Tue, Feb 17, 2026 at 02:24:38PM +0100, Larysa Zaremba wrote:
> > Aside from the issue described below, tailroom calculation does not account
> > for pages being split between frags, e.g. in i40e, enetc and
> > AF_XDP ZC with smaller chunks. These series address the problem by
> > calculating modulo (skb_frag_off() % rxq->frag_size) in order to get
> > data offset within a smaller block of memory. Please note, xskxceiver
> > tail grow test passes without modulo e.g. in xdpdrv mode on i40e,
> > because there is not enough descriptors to get to flipped buffers.
> >
> > Many ethernet drivers report xdp Rx queue frag size as being the same as
> > DMA write size. However, the only user of this field, namely
> > bpf_xdp_frags_increase_tail(), clearly expects a truesize.
> >
> > Such difference leads to unspecific memory corruption issues under certain
> > circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
> > running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
> > all DMA-writable space in 2 buffers. This would be fine, if only
> > rxq->frag_size was properly set to 4K, but value of 3K results in a
> > negative tailroom, because there is a non-zero page offset.
> >
> > We are supposed to return -EINVAL and be done with it in such case,
> > but due to tailroom being stored as an unsigned int, it is reported to be
> > somewhere near UINT_MAX, resulting in a tail being grown, even if the
> > requested offset is too much(it is around 2K in the abovementioned test).
> > This later leads to all kinds of unspecific calltraces.
> >
> > [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
> > [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
> > [ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000]
> > [ 7340.339230] in xskxceiver[42b5,400000+69000]
> > [ 7340.340300] likely on CPU 6 (core 0, socket 6)
> > [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
> > [ 7340.340888] likely on CPU 3 (core 0, socket 3)
> > [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
> > [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
> > [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
> > [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
> > [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
> > [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
> > [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
> > [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
> > [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
> > [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
> > [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
> > [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
> > [ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
> > [ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
> > [ 7340.421237] PKRU: 55555554
> > [ 7340.421623] Call Trace:
> > [ 7340.421987] <TASK>
> > [ 7340.422309] ? softleaf_from_pte+0x77/0xa0
> > [ 7340.422855] swap_pte_batch+0xa7/0x290
> > [ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
> > [ 7340.424102] zap_pte_range+0x281/0x580
> > [ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240
> > [ 7340.425177] unmap_page_range+0x24d/0x420
> > [ 7340.425714] unmap_vmas+0xa1/0x180
> > [ 7340.426185] exit_mmap+0xe1/0x3b0
> > [ 7340.426644] __mmput+0x41/0x150
> > [ 7340.427098] exit_mm+0xb1/0x110
> > [ 7340.427539] do_exit+0x1b2/0x460
> > [ 7340.427992] do_group_exit+0x2d/0xc0
> > [ 7340.428477] get_signal+0x79d/0x7e0
> > [ 7340.428957] arch_do_signal_or_restart+0x34/0x100
> > [ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0
> > [ 7340.430159] do_syscall_64+0x188/0x6b0
> > [ 7340.430672] ? __do_sys_clone3+0xd9/0x120
> > [ 7340.431212] ? switch_fpu_return+0x4e/0xd0
> > [ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
> > [ 7340.432498] ? do_syscall_64+0xbb/0x6b0
> > [ 7340.433015] ? __handle_mm_fault+0x445/0x690
> > [ 7340.433582] ? count_memcg_events+0xd6/0x210
> > [ 7340.434151] ? handle_mm_fault+0x212/0x340
> > [ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0
> > [ 7340.435271] ? clear_bhb_loop+0x30/0x80
> > [ 7340.435788] ? clear_bhb_loop+0x30/0x80
> > [ 7340.436299] ? clear_bhb_loop+0x30/0x80
> > [ 7340.436812] ? clear_bhb_loop+0x30/0x80
> > [ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 7340.437973] RIP: 0033:0x7f4161b14169
> > [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
> > [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> > [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
> > [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
> > [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
> > [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
> > [ 7340.444586] </TASK>
> > [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
> > [ 7340.449650] ---[ end trace 0000000000000000 ]---
> >
> > The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
> > drivers to not do this. Therefore, make tailroom a signed int and produce a
> > warning when it is negative to prevent such mistakes in the future.
> >
> > The issue can also be easily reproduced with ice driver, by applying
> > the following diff to xskxceiver and enjoying a kernel panic in xdpdrv mode:
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> > index 5af28f359cfd..042d587fa7ef 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> > @@ -2541,8 +2541,8 @@ int testapp_adjust_tail_grow_mb(struct test_spec *test)
> > {
> > test->mtu = MAX_ETH_JUMBO_SIZE;
> > /* Grow by (frag_size - last_frag_Size) - 1 to stay inside the last fragment */
> > - return testapp_adjust_tail(test, (XSK_UMEM__MAX_FRAME_SIZE / 2) - 1,
> > - XSK_UMEM__LARGE_FRAME_SIZE * 2);
> > + return testapp_adjust_tail(test, XSK_UMEM__MAX_FRAME_SIZE * 100,
> > + 6912);
> > }
> >
> > int testapp_tx_queue_consumer(struct test_spec *test)
> >
> > If we print out the values involved in the tailroom calculation:
> >
> > tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> >
> > 4294967040 = 3456 - 3456 - 256
> >
> > I personally reproduced and verified the issue in ice and i40e,
> > aside from WiP ixgbevf implementation.
>
> May I ask what was the testing approach against ice on your side? When I
> run test_xsk.sh against tree with your series applied, I get a panic shown
> below [1]. This comes from a test that modifies descriptor count on rings
> and the trick is that it might be passing when running as a standalone
> test but in the test suite it causes problems. It comes from a fact that
> we copy xdp_rxq between old and new ice_rx_ring, core sees the xdp_rxq
> already registered, does unregister by itself but it bails out on
> page_pool pointer being invalid (as these two xdp_rxqs pointed to same pp
> and it got destroyed). So small diff below [0] allows me to go through
> xskxceiver test suite executed from test_xsk.sh.
>
Thanks for looking into this. I usually do skip non-CI tests (considering how
skb mode is now), and additionally run 9K + tail growing tests, so I did
(perhaps wrongfully) skip ring size tests.
Your fix seems like the best option for now, though I would add
xdp_rxq_info_detach_mem_model() before the unreg too, to minimize potential side
effects from ring duplication.
Will add this to v4, and run the full xskxceiver suite.
> Thanks,
> MF
>
> [0]:
> diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> index 969d4f8f9c02..06986adb2005 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> @@ -3328,6 +3328,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
> rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
> rx_rings[i].desc = NULL;
> rx_rings[i].xdp_buf = NULL;
> + xdp_rxq_info_unreg(&rx_rings[i].xdp_rxq);
>
> /* this is to allow wr32 to have something to write to
> * during early allocation of Rx buffers
>
> [1]:
> [ 2596.560462] BUG: kernel NULL pointer dereference, address: 0000000000000008
> [ 2596.568466] #PF: supervisor read access in kernel mode
> [ 2596.574686] #PF: error_code(0x0000) - not-present page
> [ 2596.580942] PGD 118694067 P4D 0
> [ 2596.585322] Oops: Oops: 0000 [#1] SMP NOPTI
> [ 2596.590694] CPU: 2 UID: 0 PID: 5117 Comm: xskxceiver Tainted: G B W O 6.19.0+ #198 PREEMPT(full)
> [ 2596.602065] Tainted: [B]=BAD_PAGE, [W]=WARN, [O]=OOT_MODULE
> [ 2596.609049] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0004.2110190142 10/19/2021
> [ 2596.621632] RIP: 0010:xdp_unreg_mem_model+0x86/0xc0
> [ 2596.628195] Code: 0f 44 d7 f6 c2 01 75 37 41 0f b7 4c 24 16 48 89 ce 48 f7 de 3b 5c 32 04 75 1d 48 89 d3 48 29 cb 48 85 d2 74 2f e8 9a 9e 4c ff <48> 8b 7b 08 5b 5d 41 5c e9 6d eb 00 00 48 8b 12 f6 c2 01 74 d5 48
> [ 2596.650847] RSP: 0018:ffa000001ffe3a90 EFLAGS: 00010246
> [ 2596.658128] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff1100808e308ea1
> [ 2596.667403] RDX: ff1100808e308ea1 RSI: 00000000000001cc RDI: ff11000130150000
> [ 2596.676719] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000001
> [ 2596.686060] R10: ff1100011084a2c0 R11: 0000000000000000 R12: ff1100011541ce40
> [ 2596.695445] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000
> [ 2596.704866] FS: 00007f6044013c40(0000) GS:ff11007efbb1b000(0000) knlGS:0000000000000000
> [ 2596.715336] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2596.723510] CR2: 0000000000000008 CR3: 00000001e9052004 CR4: 0000000000773ef0
> [ 2596.733162] PKRU: 55555554
> [ 2596.738407] Call Trace:
> [ 2596.743398] <TASK>
> [ 2596.748045] __xdp_rxq_info_reg+0xb7/0xf0
> [ 2596.755108] ice_vsi_cfg_rxq+0x668/0x6b0 [ice]
> [ 2596.762499] ice_vsi_cfg_rxqs+0x29/0x80 [ice]
> [ 2596.769555] ice_up+0xe/0x30 [ice]
> [ 2596.775673] ice_set_ringparam+0x662/0x7e0 [ice]
> [ 2596.783066] ethtool_set_ringparam+0xb3/0x110
> [ 2596.790189] __dev_ethtool+0x1200/0x2d90
> [ 2596.796916] ? update_se+0xc1/0x120
> [ 2596.803224] ? update_load_avg+0x73/0x220
> [ 2596.810079] ? xas_load+0x9/0xc0
> [ 2596.816172] ? xa_load+0x71/0xb0
> [ 2596.822273] ? avc_has_extended_perms+0xcf/0x4a0
> [ 2596.829822] ? __kmalloc_cache_noprof+0x11a/0x400
> [ 2596.837493] dev_ethtool+0xa6/0x170
> [ 2596.843976] dev_ioctl+0x2d9/0x510
> [ 2596.850388] sock_do_ioctl+0xa8/0x110
> [ 2596.857078] sock_ioctl+0x234/0x320
> [ 2596.863614] __x64_sys_ioctl+0x92/0xe0
> [ 2596.870444] do_syscall_64+0xa4/0xc80
> [ 2596.877212] entry_SYSCALL_64_after_hwframe+0x71/0x79
> [ 2596.885426] RIP: 0033:0x7f6043f24e1d
> [ 2596.892186] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
> [ 2596.917864] RSP: 002b:00007ffd329f5e50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 2596.929028] RAX: ffffffffffffffda RBX: 00007ffd329f6208 RCX: 00007f6043f24e1d
> [ 2596.939757] RDX: 00007ffd329f5ed0 RSI: 0000000000008946 RDI: 0000000000000013
> [ 2596.950460] RBP: 00007ffd329f5ea0 R08: 0000000000000000 R09: 0000000000000007
> [ 2596.961597] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
> [ 2596.972256] R13: 0000000000000000 R14: 000055a88e016338 R15: 00007f6044100000
> [ 2596.982917] </TASK>
> [ 2596.988534] Modules linked in: ice(O) ipmi_ssif 8021q garp stp mrp llc intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp nls_iso8859_1 kvm_intel kvm irqbypass mei_me ioatdma mei wmi dca ipmi_si ipmi_msghandler acpi_power_meter acpi_pad input_leds hid_generic ghash_clmulni_intel idpf i40e libeth_xdp libeth ahci libie libie_fwlog libie_adminq libahci aesni_intel gf128mul [last unloaded: irdma]
> [ 2597.040161] CR2: 0000000000000008
> [ 2597.046911] ---[ end trace 0000000000000000 ]---
> [ 2597.117161] RIP: 0010:xdp_unreg_mem_model+0x86/0xc0
> [ 2597.125432] Code: 0f 44 d7 f6 c2 01 75 37 41 0f b7 4c 24 16 48 89 ce 48 f7 de 3b 5c 32 04 75 1d 48 89 d3 48 29 cb 48 85 d2 74 2f e8 9a 9e 4c ff <48> 8b 7b 08 5b 5d 41 5c e9 6d eb 00 00 48 8b 12 f6 c2 01 74 d5 48
> [ 2597.151379] RSP: 0018:ffa000001ffe3a90 EFLAGS: 00010246
> [ 2597.160243] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff1100808e308ea1
> [ 2597.171798] RDX: ff1100808e308ea1 RSI: 00000000000001cc RDI: ff11000130150000
> [ 2597.182587] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000001
> [ 2597.193333] R10: ff1100011084a2c0 R11: 0000000000000000 R12: ff1100011541ce40
> [ 2597.204055] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000
> [ 2597.214732] FS: 00007f6044013c40(0000) GS:ff11007efbb1b000(0000) knlGS:0000000000000000
> [ 2597.226440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2597.235842] CR2: 0000000000000008 CR3: 00000001e9052004 CR4: 0000000000773ef0
> [ 2597.246692] PKRU: 55555554
> [ 2597.253088] note: xskxceiver[5117] exited with irqs disabled
>
> >
> > v3->v2:
> > * unregister XDP RxQ info for subfunction in ice
> > * remove rx_buf_len variable in ice
> > * add missing ifdefed empty definition xsk_pool_get_rx_frag_step()
> > * move xsk_pool_get_rx_frag_step() call from idpf to libeth
> > * simplify conditions when determining frag_size in idpf
> > * correctly init xdp_frame_sz for non-main VSI in i40e
> >
> > v1->v2:
> > * add modulo to calculate offset within chunk
> > * add helper for AF_XDP ZC queues
> > * fix the problem in ZC mode in i40e, ice and idpf
> > * verify solution in i40e
> > * fix RxQ info registering in i40e
> > * fix splitq handling in idpf
> > * do not use word truesize unless the value used is named trusize
> >
> > Larysa Zaremba (9):
> > xdp: use modulo operation to calculate XDP frag tailroom
> > xsk: introduce helper to determine rxq->frag_size
> > ice: fix rxq info registering in mbuf packets
> > ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz
> > i40e: fix registering XDP RxQ info
> > i40e: use xdp.frame_sz as XDP RxQ info frag_size
> > libeth, idpf: use truesize as XDP RxQ info frag_size
> > net: enetc: use truesize as XDP RxQ info frag_size
> > xdp: produce a warning when calculated tailroom is negative
> >
> > drivers/net/ethernet/freescale/enetc/enetc.c | 2 +-
> > drivers/net/ethernet/intel/i40e/i40e_main.c | 40 +++++++++++---------
> > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 5 ++-
> > drivers/net/ethernet/intel/ice/ice_base.c | 33 +++++-----------
> > drivers/net/ethernet/intel/ice/ice_txrx.c | 3 +-
> > drivers/net/ethernet/intel/ice/ice_xsk.c | 3 ++
> > drivers/net/ethernet/intel/idpf/xdp.c | 6 ++-
> > drivers/net/ethernet/intel/idpf/xsk.c | 1 +
> > drivers/net/ethernet/intel/libeth/xsk.c | 1 +
> > include/net/libeth/xsk.h | 3 ++
> > include/net/xdp_sock_drv.h | 10 +++++
> > net/core/filter.c | 6 ++-
> > 12 files changed, 66 insertions(+), 47 deletions(-)
> >
> > --
> > 2.52.0
> >
^ permalink raw reply [flat|nested] 22+ messages in thread