* [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative
2026-02-03 10:53 [PATCH bpf 0/6] Address XDP frags having negative tailroom Larysa Zaremba
@ 2026-02-03 10:53 ` Larysa Zaremba
2026-02-03 12:26 ` Toke Høiland-Jørgensen
2026-02-04 22:52 ` Martin KaFai Lau
2026-02-03 10:53 ` [PATCH bpf 2/6] ice: fix rxq info registering in mbuf packets Larysa Zaremba
` (6 subsequent siblings)
7 siblings, 2 replies; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-03 10:53 UTC (permalink / raw)
To: bpf
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Maciej Fijalkowski, Bastien Curutchet (eBPF Foundation),
Larysa Zaremba, Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
Many ethernet drivers report xdp Rx queue frag size as being the same as
DMA write size. However, the only user of this field, namely
bpf_xdp_frags_increase_tail(), clearly expects a truesize.
Such difference leads to unspecific memory corruption issues under certain
circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
all DMA-writable space in 2 buffers. This would be fine, if only
rxq->frag_size was properly set to 4K, but value of 3K results in a
negative tailroom, because there is a non-zero page offset.
We could return -EINVAL and be done with it in such case, but due to
tailroom being stored as an unsigned int, it is reported to be somewhere
near UINT_MAX, resulting in a tail being grown, even if the requested
offset is too much (it is around 2K in the abovementioned test). This later
leads to all kinds of unspecific calltraces.
[ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
[ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
[ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000]
[ 7340.339230] in xskxceiver[42b5,400000+69000]
[ 7340.340300] likely on CPU 6 (core 0, socket 6)
[ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
[ 7340.340888] likely on CPU 3 (core 0, socket 3)
[ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
[ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
[ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
[ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
[ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
[ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
[ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
[ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
[ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
[ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
[ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
[ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
[ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
[ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
[ 7340.421237] PKRU: 55555554
[ 7340.421623] Call Trace:
[ 7340.421987] <TASK>
[ 7340.422309] ? softleaf_from_pte+0x77/0xa0
[ 7340.422855] swap_pte_batch+0xa7/0x290
[ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
[ 7340.424102] zap_pte_range+0x281/0x580
[ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240
[ 7340.425177] unmap_page_range+0x24d/0x420
[ 7340.425714] unmap_vmas+0xa1/0x180
[ 7340.426185] exit_mmap+0xe1/0x3b0
[ 7340.426644] __mmput+0x41/0x150
[ 7340.427098] exit_mm+0xb1/0x110
[ 7340.427539] do_exit+0x1b2/0x460
[ 7340.427992] do_group_exit+0x2d/0xc0
[ 7340.428477] get_signal+0x79d/0x7e0
[ 7340.428957] arch_do_signal_or_restart+0x34/0x100
[ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0
[ 7340.430159] do_syscall_64+0x188/0x6b0
[ 7340.430672] ? __do_sys_clone3+0xd9/0x120
[ 7340.431212] ? switch_fpu_return+0x4e/0xd0
[ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
[ 7340.432498] ? do_syscall_64+0xbb/0x6b0
[ 7340.433015] ? __handle_mm_fault+0x445/0x690
[ 7340.433582] ? count_memcg_events+0xd6/0x210
[ 7340.434151] ? handle_mm_fault+0x212/0x340
[ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0
[ 7340.435271] ? clear_bhb_loop+0x30/0x80
[ 7340.435788] ? clear_bhb_loop+0x30/0x80
[ 7340.436299] ? clear_bhb_loop+0x30/0x80
[ 7340.436812] ? clear_bhb_loop+0x30/0x80
[ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 7340.437973] RIP: 0033:0x7f4161b14169
[ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
[ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
[ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
[ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
[ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
[ 7340.444586] </TASK>
[ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
[ 7340.449650] ---[ end trace 0000000000000000 ]---
The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
drivers to not do this. Therefore, make tailroom a signed int and produce a
warning when it is negative to prevent such mistakes in the future.
Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
net/core/filter.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index 616e0520a0bb..9715d957e3c5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
struct xdp_rxq_info *rxq = xdp->rxq;
- unsigned int tailroom;
+ int tailroom;
if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
return -EOPNOTSUPP;
tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
+ WARN_ON_ONCE(tailroom < 0);
if (unlikely(offset > tailroom))
return -EINVAL;
--
2.52.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative
2026-02-03 10:53 ` [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative Larysa Zaremba
@ 2026-02-03 12:26 ` Toke Høiland-Jørgensen
2026-02-03 12:31 ` Larysa Zaremba
2026-02-04 22:52 ` Martin KaFai Lau
1 sibling, 1 reply; 29+ messages in thread
From: Toke Høiland-Jørgensen @ 2026-02-03 12:26 UTC (permalink / raw)
To: Larysa Zaremba, bpf
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Maciej Fijalkowski, Bastien Curutchet (eBPF Foundation),
Larysa Zaremba, Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi, imx,
netdev, linux-kernel, intel-wired-lan, linux-kselftest,
Aleksandr Loktionov
Larysa Zaremba <larysa.zaremba@intel.com> writes:
> Many ethernet drivers report xdp Rx queue frag size as being the same as
> DMA write size. However, the only user of this field, namely
> bpf_xdp_frags_increase_tail(), clearly expects a truesize.
>
> Such difference leads to unspecific memory corruption issues under certain
> circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
> running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
> all DMA-writable space in 2 buffers. This would be fine, if only
> rxq->frag_size was properly set to 4K, but value of 3K results in a
> negative tailroom, because there is a non-zero page offset.
>
> We could return -EINVAL and be done with it in such case, but due to
> tailroom being stored as an unsigned int, it is reported to be somewhere
> near UINT_MAX, resulting in a tail being grown, even if the requested
> offset is too much (it is around 2K in the abovementioned test). This later
> leads to all kinds of unspecific calltraces.
>
> [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
> [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
> [ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000]
> [ 7340.339230] in xskxceiver[42b5,400000+69000]
> [ 7340.340300] likely on CPU 6 (core 0, socket 6)
> [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
> [ 7340.340888] likely on CPU 3 (core 0, socket 3)
> [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
> [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
> [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
> [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
> [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
> [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
> [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
> [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
> [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
> [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
> [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
> [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
> [ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
> [ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
> [ 7340.421237] PKRU: 55555554
> [ 7340.421623] Call Trace:
> [ 7340.421987] <TASK>
> [ 7340.422309] ? softleaf_from_pte+0x77/0xa0
> [ 7340.422855] swap_pte_batch+0xa7/0x290
> [ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
> [ 7340.424102] zap_pte_range+0x281/0x580
> [ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240
> [ 7340.425177] unmap_page_range+0x24d/0x420
> [ 7340.425714] unmap_vmas+0xa1/0x180
> [ 7340.426185] exit_mmap+0xe1/0x3b0
> [ 7340.426644] __mmput+0x41/0x150
> [ 7340.427098] exit_mm+0xb1/0x110
> [ 7340.427539] do_exit+0x1b2/0x460
> [ 7340.427992] do_group_exit+0x2d/0xc0
> [ 7340.428477] get_signal+0x79d/0x7e0
> [ 7340.428957] arch_do_signal_or_restart+0x34/0x100
> [ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0
> [ 7340.430159] do_syscall_64+0x188/0x6b0
> [ 7340.430672] ? __do_sys_clone3+0xd9/0x120
> [ 7340.431212] ? switch_fpu_return+0x4e/0xd0
> [ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
> [ 7340.432498] ? do_syscall_64+0xbb/0x6b0
> [ 7340.433015] ? __handle_mm_fault+0x445/0x690
> [ 7340.433582] ? count_memcg_events+0xd6/0x210
> [ 7340.434151] ? handle_mm_fault+0x212/0x340
> [ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0
> [ 7340.435271] ? clear_bhb_loop+0x30/0x80
> [ 7340.435788] ? clear_bhb_loop+0x30/0x80
> [ 7340.436299] ? clear_bhb_loop+0x30/0x80
> [ 7340.436812] ? clear_bhb_loop+0x30/0x80
> [ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 7340.437973] RIP: 0033:0x7f4161b14169
> [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
> [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
> [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
> [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
> [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
> [ 7340.444586] </TASK>
> [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
> [ 7340.449650] ---[ end trace 0000000000000000 ]---
>
> The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
> drivers to not do this. Therefore, make tailroom a signed int and produce a
> warning when it is negative to prevent such mistakes in the future.
>
> Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
> net/core/filter.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 616e0520a0bb..9715d957e3c5 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
> struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
> struct xdp_rxq_info *rxq = xdp->rxq;
> - unsigned int tailroom;
> + int tailroom;
>
> if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
> return -EOPNOTSUPP;
>
> tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> + WARN_ON_ONCE(tailroom < 0);
> if (unlikely(offset > tailroom))
> return -EINVAL;
>
Why can't we do both? I.e., WARN_ON_ONCE() *and* return -EINVAL?
-Toke
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative
2026-02-03 12:26 ` Toke Høiland-Jørgensen
@ 2026-02-03 12:31 ` Larysa Zaremba
2026-02-03 12:38 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-03 12:31 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Maciej Fijalkowski, Bastien Curutchet (eBPF Foundation),
Tushar Vyavahare, Jason Xing, Ricardo B. Marlière,
Eelco Chaudron, Lorenzo Bianconi, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Tue, Feb 03, 2026 at 01:26:21PM +0100, Toke Høiland-Jørgensen wrote:
> Larysa Zaremba <larysa.zaremba@intel.com> writes:
>
> > Many ethernet drivers report xdp Rx queue frag size as being the same as
> > DMA write size. However, the only user of this field, namely
> > bpf_xdp_frags_increase_tail(), clearly expects a truesize.
> >
> > Such difference leads to unspecific memory corruption issues under certain
> > circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
> > running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
> > all DMA-writable space in 2 buffers. This would be fine, if only
> > rxq->frag_size was properly set to 4K, but value of 3K results in a
> > negative tailroom, because there is a non-zero page offset.
> >
> > We could return -EINVAL and be done with it in such case, but due to
> > tailroom being stored as an unsigned int, it is reported to be somewhere
> > near UINT_MAX, resulting in a tail being grown, even if the requested
> > offset is too much (it is around 2K in the abovementioned test). This later
> > leads to all kinds of unspecific calltraces.
> >
> > [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
> > [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
> > [ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000]
> > [ 7340.339230] in xskxceiver[42b5,400000+69000]
> > [ 7340.340300] likely on CPU 6 (core 0, socket 6)
> > [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
> > [ 7340.340888] likely on CPU 3 (core 0, socket 3)
> > [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
> > [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
> > [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
> > [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
> > [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
> > [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
> > [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
> > [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
> > [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
> > [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
> > [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
> > [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
> > [ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
> > [ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
> > [ 7340.421237] PKRU: 55555554
> > [ 7340.421623] Call Trace:
> > [ 7340.421987] <TASK>
> > [ 7340.422309] ? softleaf_from_pte+0x77/0xa0
> > [ 7340.422855] swap_pte_batch+0xa7/0x290
> > [ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
> > [ 7340.424102] zap_pte_range+0x281/0x580
> > [ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240
> > [ 7340.425177] unmap_page_range+0x24d/0x420
> > [ 7340.425714] unmap_vmas+0xa1/0x180
> > [ 7340.426185] exit_mmap+0xe1/0x3b0
> > [ 7340.426644] __mmput+0x41/0x150
> > [ 7340.427098] exit_mm+0xb1/0x110
> > [ 7340.427539] do_exit+0x1b2/0x460
> > [ 7340.427992] do_group_exit+0x2d/0xc0
> > [ 7340.428477] get_signal+0x79d/0x7e0
> > [ 7340.428957] arch_do_signal_or_restart+0x34/0x100
> > [ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0
> > [ 7340.430159] do_syscall_64+0x188/0x6b0
> > [ 7340.430672] ? __do_sys_clone3+0xd9/0x120
> > [ 7340.431212] ? switch_fpu_return+0x4e/0xd0
> > [ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
> > [ 7340.432498] ? do_syscall_64+0xbb/0x6b0
> > [ 7340.433015] ? __handle_mm_fault+0x445/0x690
> > [ 7340.433582] ? count_memcg_events+0xd6/0x210
> > [ 7340.434151] ? handle_mm_fault+0x212/0x340
> > [ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0
> > [ 7340.435271] ? clear_bhb_loop+0x30/0x80
> > [ 7340.435788] ? clear_bhb_loop+0x30/0x80
> > [ 7340.436299] ? clear_bhb_loop+0x30/0x80
> > [ 7340.436812] ? clear_bhb_loop+0x30/0x80
> > [ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 7340.437973] RIP: 0033:0x7f4161b14169
> > [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
> > [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> > [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
> > [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
> > [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
> > [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
> > [ 7340.444586] </TASK>
> > [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
> > [ 7340.449650] ---[ end trace 0000000000000000 ]---
> >
> > The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
> > drivers to not do this. Therefore, make tailroom a signed int and produce a
> > warning when it is negative to prevent such mistakes in the future.
> >
> > Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
> > Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> > net/core/filter.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 616e0520a0bb..9715d957e3c5 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
> > struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> > skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
> > struct xdp_rxq_info *rxq = xdp->rxq;
> > - unsigned int tailroom;
> > + int tailroom;
> >
> > if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
> > return -EOPNOTSUPP;
> >
> > tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> > + WARN_ON_ONCE(tailroom < 0);
> > if (unlikely(offset > tailroom))
> > return -EINVAL;
> >
>
> Why can't we do both? I.e., WARN_ON_ONCE() *and* return -EINVAL?
>
> -Toke
>
It would be redundant, offset is always >= 0 here, so with tailroom now being a
signed int, offset is always bigger and -EINVAL is returned.
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative
2026-02-03 12:31 ` Larysa Zaremba
@ 2026-02-03 12:38 ` Toke Høiland-Jørgensen
2026-02-03 12:54 ` Larysa Zaremba
0 siblings, 1 reply; 29+ messages in thread
From: Toke Høiland-Jørgensen @ 2026-02-03 12:38 UTC (permalink / raw)
To: Larysa Zaremba
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Maciej Fijalkowski, Bastien Curutchet (eBPF Foundation),
Tushar Vyavahare, Jason Xing, Ricardo B. Marlière,
Eelco Chaudron, Lorenzo Bianconi, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
Larysa Zaremba <larysa.zaremba@intel.com> writes:
> On Tue, Feb 03, 2026 at 01:26:21PM +0100, Toke Høiland-Jørgensen wrote:
>> Larysa Zaremba <larysa.zaremba@intel.com> writes:
>>
>> > Many ethernet drivers report xdp Rx queue frag size as being the same as
>> > DMA write size. However, the only user of this field, namely
>> > bpf_xdp_frags_increase_tail(), clearly expects a truesize.
>> >
>> > Such difference leads to unspecific memory corruption issues under certain
>> > circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
>> > running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
>> > all DMA-writable space in 2 buffers. This would be fine, if only
>> > rxq->frag_size was properly set to 4K, but value of 3K results in a
>> > negative tailroom, because there is a non-zero page offset.
>> >
>> > We could return -EINVAL and be done with it in such case, but due to
>> > tailroom being stored as an unsigned int, it is reported to be somewhere
>> > near UINT_MAX, resulting in a tail being grown, even if the requested
>> > offset is too much (it is around 2K in the abovementioned test). This later
>> > leads to all kinds of unspecific calltraces.
>> >
>> > [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
>> > [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
>> > [ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000]
>> > [ 7340.339230] in xskxceiver[42b5,400000+69000]
>> > [ 7340.340300] likely on CPU 6 (core 0, socket 6)
>> > [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
>> > [ 7340.340888] likely on CPU 3 (core 0, socket 3)
>> > [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
>> > [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
>> > [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
>> > [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
>> > [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
>> > [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
>> > [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
>> > [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
>> > [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
>> > [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
>> > [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
>> > [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
>> > [ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
>> > [ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
>> > [ 7340.421237] PKRU: 55555554
>> > [ 7340.421623] Call Trace:
>> > [ 7340.421987] <TASK>
>> > [ 7340.422309] ? softleaf_from_pte+0x77/0xa0
>> > [ 7340.422855] swap_pte_batch+0xa7/0x290
>> > [ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
>> > [ 7340.424102] zap_pte_range+0x281/0x580
>> > [ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240
>> > [ 7340.425177] unmap_page_range+0x24d/0x420
>> > [ 7340.425714] unmap_vmas+0xa1/0x180
>> > [ 7340.426185] exit_mmap+0xe1/0x3b0
>> > [ 7340.426644] __mmput+0x41/0x150
>> > [ 7340.427098] exit_mm+0xb1/0x110
>> > [ 7340.427539] do_exit+0x1b2/0x460
>> > [ 7340.427992] do_group_exit+0x2d/0xc0
>> > [ 7340.428477] get_signal+0x79d/0x7e0
>> > [ 7340.428957] arch_do_signal_or_restart+0x34/0x100
>> > [ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0
>> > [ 7340.430159] do_syscall_64+0x188/0x6b0
>> > [ 7340.430672] ? __do_sys_clone3+0xd9/0x120
>> > [ 7340.431212] ? switch_fpu_return+0x4e/0xd0
>> > [ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
>> > [ 7340.432498] ? do_syscall_64+0xbb/0x6b0
>> > [ 7340.433015] ? __handle_mm_fault+0x445/0x690
>> > [ 7340.433582] ? count_memcg_events+0xd6/0x210
>> > [ 7340.434151] ? handle_mm_fault+0x212/0x340
>> > [ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0
>> > [ 7340.435271] ? clear_bhb_loop+0x30/0x80
>> > [ 7340.435788] ? clear_bhb_loop+0x30/0x80
>> > [ 7340.436299] ? clear_bhb_loop+0x30/0x80
>> > [ 7340.436812] ? clear_bhb_loop+0x30/0x80
>> > [ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> > [ 7340.437973] RIP: 0033:0x7f4161b14169
>> > [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
>> > [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
>> > [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
>> > [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
>> > [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
>> > [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>> > [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
>> > [ 7340.444586] </TASK>
>> > [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
>> > [ 7340.449650] ---[ end trace 0000000000000000 ]---
>> >
>> > The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
>> > drivers to not do this. Therefore, make tailroom a signed int and produce a
>> > warning when it is negative to prevent such mistakes in the future.
>> >
>> > Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
>> > Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
>> > ---
>> > net/core/filter.c | 3 ++-
>> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/net/core/filter.c b/net/core/filter.c
>> > index 616e0520a0bb..9715d957e3c5 100644
>> > --- a/net/core/filter.c
>> > +++ b/net/core/filter.c
>> > @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
>> > struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>> > skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
>> > struct xdp_rxq_info *rxq = xdp->rxq;
>> > - unsigned int tailroom;
>> > + int tailroom;
>> >
>> > if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
>> > return -EOPNOTSUPP;
>> >
>> > tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
>> > + WARN_ON_ONCE(tailroom < 0);
>> > if (unlikely(offset > tailroom))
>> > return -EINVAL;
>> >
>>
>> Why can't we do both? I.e., WARN_ON_ONCE() *and* return -EINVAL?
>>
>> -Toke
>>
>
> It would be redundant, offset is always >= 0 here, so with tailroom now being a
> signed int, offset is always bigger and -EINVAL is returned.
Oh, I see. OK, may be worth calling out; I read this paragraph in your
commit message to mean "we don't bother returning EINVAL in this case,
we just warn":
> > We could return -EINVAL and be done with it in such case, but due to
> > tailroom being stored as an unsigned int, it is reported to be somewhere
> > near UINT_MAX, resulting in a tail being grown, even if the requested
> > offset is too much (it is around 2K in the abovementioned test). This later
> > leads to all kinds of unspecific calltraces.
-Toke
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative
2026-02-03 12:38 ` Toke Høiland-Jørgensen
@ 2026-02-03 12:54 ` Larysa Zaremba
2026-02-03 13:37 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-03 12:54 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Maciej Fijalkowski, Bastien Curutchet (eBPF Foundation),
Tushar Vyavahare, Jason Xing, Ricardo B. Marlière,
Eelco Chaudron, Lorenzo Bianconi, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Tue, Feb 03, 2026 at 01:38:47PM +0100, Toke Høiland-Jørgensen wrote:
> Larysa Zaremba <larysa.zaremba@intel.com> writes:
>
> > On Tue, Feb 03, 2026 at 01:26:21PM +0100, Toke Høiland-Jørgensen wrote:
> >> Larysa Zaremba <larysa.zaremba@intel.com> writes:
> >>
> >> > Many ethernet drivers report xdp Rx queue frag size as being the same as
> >> > DMA write size. However, the only user of this field, namely
> >> > bpf_xdp_frags_increase_tail(), clearly expects a truesize.
> >> >
> >> > Such difference leads to unspecific memory corruption issues under certain
> >> > circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
> >> > running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
> >> > all DMA-writable space in 2 buffers. This would be fine, if only
> >> > rxq->frag_size was properly set to 4K, but value of 3K results in a
> >> > negative tailroom, because there is a non-zero page offset.
> >> >
> >> > We could return -EINVAL and be done with it in such case, but due to
> >> > tailroom being stored as an unsigned int, it is reported to be somewhere
> >> > near UINT_MAX, resulting in a tail being grown, even if the requested
> >> > offset is too much (it is around 2K in the abovementioned test). This later
> >> > leads to all kinds of unspecific calltraces.
> >> >
> >> > [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
> >> > [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
> >> > [ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000]
> >> > [ 7340.339230] in xskxceiver[42b5,400000+69000]
> >> > [ 7340.340300] likely on CPU 6 (core 0, socket 6)
> >> > [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
> >> > [ 7340.340888] likely on CPU 3 (core 0, socket 3)
> >> > [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
> >> > [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
> >> > [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
> >> > [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
> >> > [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
> >> > [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
> >> > [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
> >> > [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
> >> > [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
> >> > [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
> >> > [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
> >> > [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
> >> > [ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
> >> > [ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> > [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
> >> > [ 7340.421237] PKRU: 55555554
> >> > [ 7340.421623] Call Trace:
> >> > [ 7340.421987] <TASK>
> >> > [ 7340.422309] ? softleaf_from_pte+0x77/0xa0
> >> > [ 7340.422855] swap_pte_batch+0xa7/0x290
> >> > [ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
> >> > [ 7340.424102] zap_pte_range+0x281/0x580
> >> > [ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240
> >> > [ 7340.425177] unmap_page_range+0x24d/0x420
> >> > [ 7340.425714] unmap_vmas+0xa1/0x180
> >> > [ 7340.426185] exit_mmap+0xe1/0x3b0
> >> > [ 7340.426644] __mmput+0x41/0x150
> >> > [ 7340.427098] exit_mm+0xb1/0x110
> >> > [ 7340.427539] do_exit+0x1b2/0x460
> >> > [ 7340.427992] do_group_exit+0x2d/0xc0
> >> > [ 7340.428477] get_signal+0x79d/0x7e0
> >> > [ 7340.428957] arch_do_signal_or_restart+0x34/0x100
> >> > [ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0
> >> > [ 7340.430159] do_syscall_64+0x188/0x6b0
> >> > [ 7340.430672] ? __do_sys_clone3+0xd9/0x120
> >> > [ 7340.431212] ? switch_fpu_return+0x4e/0xd0
> >> > [ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
> >> > [ 7340.432498] ? do_syscall_64+0xbb/0x6b0
> >> > [ 7340.433015] ? __handle_mm_fault+0x445/0x690
> >> > [ 7340.433582] ? count_memcg_events+0xd6/0x210
> >> > [ 7340.434151] ? handle_mm_fault+0x212/0x340
> >> > [ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0
> >> > [ 7340.435271] ? clear_bhb_loop+0x30/0x80
> >> > [ 7340.435788] ? clear_bhb_loop+0x30/0x80
> >> > [ 7340.436299] ? clear_bhb_loop+0x30/0x80
> >> > [ 7340.436812] ? clear_bhb_loop+0x30/0x80
> >> > [ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >> > [ 7340.437973] RIP: 0033:0x7f4161b14169
> >> > [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
> >> > [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> >> > [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
> >> > [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
> >> > [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
> >> > [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> >> > [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
> >> > [ 7340.444586] </TASK>
> >> > [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
> >> > [ 7340.449650] ---[ end trace 0000000000000000 ]---
> >> >
> >> > The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
> >> > drivers to not do this. Therefore, make tailroom a signed int and produce a
> >> > warning when it is negative to prevent such mistakes in the future.
> >> >
> >> > Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
> >> > Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> >> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> >> > ---
> >> > net/core/filter.c | 3 ++-
> >> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/net/core/filter.c b/net/core/filter.c
> >> > index 616e0520a0bb..9715d957e3c5 100644
> >> > --- a/net/core/filter.c
> >> > +++ b/net/core/filter.c
> >> > @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
> >> > struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> >> > skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
> >> > struct xdp_rxq_info *rxq = xdp->rxq;
> >> > - unsigned int tailroom;
> >> > + int tailroom;
> >> >
> >> > if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
> >> > return -EOPNOTSUPP;
> >> >
> >> > tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> >> > + WARN_ON_ONCE(tailroom < 0);
> >> > if (unlikely(offset > tailroom))
> >> > return -EINVAL;
> >> >
> >>
> >> Why can't we do both? I.e., WARN_ON_ONCE() *and* return -EINVAL?
> >>
> >> -Toke
> >>
> >
> > It would be redundant, offset is always >= 0 here, so with tailroom now being a
> > signed int, offset is always bigger and -EINVAL is returned.
>
> Oh, I see. OK, may be worth calling out; I read this paragraph in your
> commit message to mean "we don't bother returning EINVAL in this case,
> we just warn":
Worth changing 'could' to 'are supposed to', if there will be v2.
>
> > > We could return -EINVAL and be done with it in such case, but due to
> > > tailroom being stored as an unsigned int, it is reported to be somewhere
> > > near UINT_MAX, resulting in a tail being grown, even if the requested
> > > offset is too much (it is around 2K in the abovementioned test). This later
> > > leads to all kinds of unspecific calltraces.
>
> -Toke
>
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative
2026-02-03 12:54 ` Larysa Zaremba
@ 2026-02-03 13:37 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 29+ messages in thread
From: Toke Høiland-Jørgensen @ 2026-02-03 13:37 UTC (permalink / raw)
To: Larysa Zaremba
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Maciej Fijalkowski, Bastien Curutchet (eBPF Foundation),
Tushar Vyavahare, Jason Xing, Ricardo B. Marlière,
Eelco Chaudron, Lorenzo Bianconi, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
Larysa Zaremba <larysa.zaremba@intel.com> writes:
> On Tue, Feb 03, 2026 at 01:38:47PM +0100, Toke Høiland-Jørgensen wrote:
>> Larysa Zaremba <larysa.zaremba@intel.com> writes:
>>
>> > On Tue, Feb 03, 2026 at 01:26:21PM +0100, Toke Høiland-Jørgensen wrote:
>> >> Larysa Zaremba <larysa.zaremba@intel.com> writes:
>> >>
>> >> > Many ethernet drivers report xdp Rx queue frag size as being the same as
>> >> > DMA write size. However, the only user of this field, namely
>> >> > bpf_xdp_frags_increase_tail(), clearly expects a truesize.
>> >> >
>> >> > Such difference leads to unspecific memory corruption issues under certain
>> >> > circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
>> >> > running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
>> >> > all DMA-writable space in 2 buffers. This would be fine, if only
>> >> > rxq->frag_size was properly set to 4K, but value of 3K results in a
>> >> > negative tailroom, because there is a non-zero page offset.
>> >> >
>> >> > We could return -EINVAL and be done with it in such case, but due to
>> >> > tailroom being stored as an unsigned int, it is reported to be somewhere
>> >> > near UINT_MAX, resulting in a tail being grown, even if the requested
>> >> > offset is too much (it is around 2K in the abovementioned test). This later
>> >> > leads to all kinds of unspecific calltraces.
>> >> >
>> >> > [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
>> >> > [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
>> >> > [ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000]
>> >> > [ 7340.339230] in xskxceiver[42b5,400000+69000]
>> >> > [ 7340.340300] likely on CPU 6 (core 0, socket 6)
>> >> > [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
>> >> > [ 7340.340888] likely on CPU 3 (core 0, socket 3)
>> >> > [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
>> >> > [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
>> >> > [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
>> >> > [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
>> >> > [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
>> >> > [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
>> >> > [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
>> >> > [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
>> >> > [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
>> >> > [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
>> >> > [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
>> >> > [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
>> >> > [ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
>> >> > [ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >> > [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
>> >> > [ 7340.421237] PKRU: 55555554
>> >> > [ 7340.421623] Call Trace:
>> >> > [ 7340.421987] <TASK>
>> >> > [ 7340.422309] ? softleaf_from_pte+0x77/0xa0
>> >> > [ 7340.422855] swap_pte_batch+0xa7/0x290
>> >> > [ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
>> >> > [ 7340.424102] zap_pte_range+0x281/0x580
>> >> > [ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240
>> >> > [ 7340.425177] unmap_page_range+0x24d/0x420
>> >> > [ 7340.425714] unmap_vmas+0xa1/0x180
>> >> > [ 7340.426185] exit_mmap+0xe1/0x3b0
>> >> > [ 7340.426644] __mmput+0x41/0x150
>> >> > [ 7340.427098] exit_mm+0xb1/0x110
>> >> > [ 7340.427539] do_exit+0x1b2/0x460
>> >> > [ 7340.427992] do_group_exit+0x2d/0xc0
>> >> > [ 7340.428477] get_signal+0x79d/0x7e0
>> >> > [ 7340.428957] arch_do_signal_or_restart+0x34/0x100
>> >> > [ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0
>> >> > [ 7340.430159] do_syscall_64+0x188/0x6b0
>> >> > [ 7340.430672] ? __do_sys_clone3+0xd9/0x120
>> >> > [ 7340.431212] ? switch_fpu_return+0x4e/0xd0
>> >> > [ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
>> >> > [ 7340.432498] ? do_syscall_64+0xbb/0x6b0
>> >> > [ 7340.433015] ? __handle_mm_fault+0x445/0x690
>> >> > [ 7340.433582] ? count_memcg_events+0xd6/0x210
>> >> > [ 7340.434151] ? handle_mm_fault+0x212/0x340
>> >> > [ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0
>> >> > [ 7340.435271] ? clear_bhb_loop+0x30/0x80
>> >> > [ 7340.435788] ? clear_bhb_loop+0x30/0x80
>> >> > [ 7340.436299] ? clear_bhb_loop+0x30/0x80
>> >> > [ 7340.436812] ? clear_bhb_loop+0x30/0x80
>> >> > [ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> >> > [ 7340.437973] RIP: 0033:0x7f4161b14169
>> >> > [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
>> >> > [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
>> >> > [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
>> >> > [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
>> >> > [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
>> >> > [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>> >> > [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
>> >> > [ 7340.444586] </TASK>
>> >> > [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
>> >> > [ 7340.449650] ---[ end trace 0000000000000000 ]---
>> >> >
>> >> > The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
>> >> > drivers to not do this. Therefore, make tailroom a signed int and produce a
>> >> > warning when it is negative to prevent such mistakes in the future.
>> >> >
>> >> > Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
>> >> > Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>> >> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
>> >> > ---
>> >> > net/core/filter.c | 3 ++-
>> >> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/net/core/filter.c b/net/core/filter.c
>> >> > index 616e0520a0bb..9715d957e3c5 100644
>> >> > --- a/net/core/filter.c
>> >> > +++ b/net/core/filter.c
>> >> > @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
>> >> > struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>> >> > skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
>> >> > struct xdp_rxq_info *rxq = xdp->rxq;
>> >> > - unsigned int tailroom;
>> >> > + int tailroom;
>> >> >
>> >> > if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
>> >> > return -EOPNOTSUPP;
>> >> >
>> >> > tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
>> >> > + WARN_ON_ONCE(tailroom < 0);
>> >> > if (unlikely(offset > tailroom))
>> >> > return -EINVAL;
>> >> >
>> >>
>> >> Why can't we do both? I.e., WARN_ON_ONCE() *and* return -EINVAL?
>> >>
>> >> -Toke
>> >>
>> >
>> > It would be redundant, offset is always >= 0 here, so with tailroom now being a
>> > signed int, offset is always bigger and -EINVAL is returned.
>>
>> Oh, I see. OK, may be worth calling out; I read this paragraph in your
>> commit message to mean "we don't bother returning EINVAL in this case,
>> we just warn":
>
> Worth changing 'could' to 'are supposed to', if there will be v2.
Yeah, that would be better, thanks!
Other than that,
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative
2026-02-03 10:53 ` [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative Larysa Zaremba
2026-02-03 12:26 ` Toke Høiland-Jørgensen
@ 2026-02-04 22:52 ` Martin KaFai Lau
1 sibling, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2026-02-04 22:52 UTC (permalink / raw)
To: Larysa Zaremba
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, bpf, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On 2/3/26 2:53 AM, Larysa Zaremba wrote:
> The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
> drivers to not do this. Therefore, make tailroom a signed int and produce a
> warning when it is negative to prevent such mistakes in the future.
>
> Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
> Reviewed-by: Aleksandr Loktionov<aleksandr.loktionov@intel.com>
> Signed-off-by: Larysa Zaremba<larysa.zaremba@intel.com>
> ---
> net/core/filter.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 616e0520a0bb..9715d957e3c5 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
> struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
> struct xdp_rxq_info *rxq = xdp->rxq;
> - unsigned int tailroom;
> + int tailroom;
>
> if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
> return -EOPNOTSUPP;
>
> tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> + WARN_ON_ONCE(tailroom < 0);
> if (unlikely(offset > tailroom))
> return -EINVAL;
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH bpf 2/6] ice: fix rxq info registering in mbuf packets
2026-02-03 10:53 [PATCH bpf 0/6] Address XDP frags having negative tailroom Larysa Zaremba
2026-02-03 10:53 ` [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative Larysa Zaremba
@ 2026-02-03 10:53 ` Larysa Zaremba
2026-02-03 10:53 ` [PATCH bpf 3/6] ice: change XDP RxQ frag_size from DMA write length to truesize Larysa Zaremba
` (5 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-03 10:53 UTC (permalink / raw)
To: bpf
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Maciej Fijalkowski, Bastien Curutchet (eBPF Foundation),
Larysa Zaremba, Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
XDP RxQ info contains frag_size, which depends on the MTU. This makes the
old way of registering RxQ info before calculating new buffer sizes
invalid. Currently, it leads to frag_size being outdated, making it
sometimes impossible to grow tailroom in a mbuf packet. E.g. fragments are
actually 3K+, but frag size is still as if MTU was 1500.
Always register new XDP RxQ info after reconfiguring memory pools.
Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
drivers/net/ethernet/intel/ice/ice_base.c | 26 ++++++-----------------
drivers/net/ethernet/intel/ice/ice_xsk.c | 3 +++
2 files changed, 10 insertions(+), 19 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index eadb1e3d12b3..511d803cf0a4 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -663,23 +663,12 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
int err;
if (ring->vsi->type == ICE_VSI_PF || ring->vsi->type == ICE_VSI_SF) {
- if (!xdp_rxq_info_is_reg(&ring->xdp_rxq)) {
- err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
- ring->q_index,
- ring->q_vector->napi.napi_id,
- ring->rx_buf_len);
- if (err)
- return err;
- }
-
ice_rx_xsk_pool(ring);
err = ice_realloc_rx_xdp_bufs(ring, ring->xsk_pool);
if (err)
return err;
if (ring->xsk_pool) {
- xdp_rxq_info_unreg(&ring->xdp_rxq);
-
rx_buf_len =
xsk_pool_get_rx_frame_size(ring->xsk_pool);
err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
@@ -702,14 +691,13 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
if (err)
return err;
- if (!xdp_rxq_info_is_reg(&ring->xdp_rxq)) {
- err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
- ring->q_index,
- ring->q_vector->napi.napi_id,
- ring->rx_buf_len);
- if (err)
- goto err_destroy_fq;
- }
+ err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
+ ring->q_index,
+ ring->q_vector->napi.napi_id,
+ ring->rx_buf_len);
+ if (err)
+ goto err_destroy_fq;
+
xdp_rxq_info_attach_page_pool(&ring->xdp_rxq,
ring->pp);
}
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 989ff1fd9110..102631398af3 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -900,6 +900,9 @@ void ice_xsk_clean_rx_ring(struct ice_rx_ring *rx_ring)
u16 ntc = rx_ring->next_to_clean;
u16 ntu = rx_ring->next_to_use;
+ if (xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
+ xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+
while (ntc != ntu) {
struct xdp_buff *xdp = *ice_xdp_buf(rx_ring, ntc);
--
2.52.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* [PATCH bpf 3/6] ice: change XDP RxQ frag_size from DMA write length to truesize
2026-02-03 10:53 [PATCH bpf 0/6] Address XDP frags having negative tailroom Larysa Zaremba
2026-02-03 10:53 ` [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative Larysa Zaremba
2026-02-03 10:53 ` [PATCH bpf 2/6] ice: fix rxq info registering in mbuf packets Larysa Zaremba
@ 2026-02-03 10:53 ` Larysa Zaremba
2026-02-03 10:53 ` [PATCH bpf 4/6] i40e: use truesize as XDP RxQ info frag_size Larysa Zaremba
` (4 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-03 10:53 UTC (permalink / raw)
To: bpf
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Maciej Fijalkowski, Bastien Curutchet (eBPF Foundation),
Larysa Zaremba, Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
write size. Different assumptions in ice driver configuration lead to
negative tailroom.
Before changing the tailroom in the abovementioned function to a signed
int, this allows to trigger kernel panic, when using
XDP_ADJUST_TAIL_GROW_MULTI_BUFF xskxceiver test and changing packet size to
6912 and the requisted offset to a huge value, e.g.
XSK_UMEM__MAX_FRAME_SIZE * 100.
Due to other quirks of the ZC configuration in ice, issue is not observed
in ZC mode.
Use fill queue buffer truesize instead of DMA write size in XDP RxQ info.
Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
drivers/net/ethernet/intel/ice/ice_base.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 511d803cf0a4..6a15b66e340e 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -694,7 +694,7 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
ring->q_index,
ring->q_vector->napi.napi_id,
- ring->rx_buf_len);
+ ring->truesize);
if (err)
goto err_destroy_fq;
--
2.52.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* [PATCH bpf 4/6] i40e: use truesize as XDP RxQ info frag_size
2026-02-03 10:53 [PATCH bpf 0/6] Address XDP frags having negative tailroom Larysa Zaremba
` (2 preceding siblings ...)
2026-02-03 10:53 ` [PATCH bpf 3/6] ice: change XDP RxQ frag_size from DMA write length to truesize Larysa Zaremba
@ 2026-02-03 10:53 ` Larysa Zaremba
2026-02-03 10:53 ` [PATCH bpf 5/6] idpf: " Larysa Zaremba
` (3 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-03 10:53 UTC (permalink / raw)
To: bpf
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Maciej Fijalkowski, Bastien Curutchet (eBPF Foundation),
Larysa Zaremba, Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
write size. Different assumptions in i40e driver configuration lead to
negative tailroom.
Set frag_size to the same value as frame_sz.
Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 0b1cc0481027..72a33aa3e1f3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3561,6 +3561,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
struct i40e_vsi *vsi = ring->vsi;
u32 chain_len = vsi->back->hw.func_caps.rx_buf_chain_len;
u16 pf_q = vsi->base_queue + ring->queue_index;
+ u32 truesize = i40e_rx_pg_size(ring) / 2;
struct i40e_hw *hw = &vsi->back->hw;
struct i40e_hmc_obj_rxq rx_ctx;
int err = 0;
@@ -3581,7 +3582,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
ring->queue_index,
ring->q_vector->napi.napi_id,
- ring->rx_buf_len);
+ truesize);
if (err)
return err;
}
@@ -3614,7 +3615,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
}
skip:
- xdp_init_buff(&ring->xdp, i40e_rx_pg_size(ring) / 2, &ring->xdp_rxq);
+ xdp_init_buff(&ring->xdp, truesize, &ring->xdp_rxq);
rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
--
2.52.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* [PATCH bpf 5/6] idpf: use truesize as XDP RxQ info frag_size
2026-02-03 10:53 [PATCH bpf 0/6] Address XDP frags having negative tailroom Larysa Zaremba
` (3 preceding siblings ...)
2026-02-03 10:53 ` [PATCH bpf 4/6] i40e: use truesize as XDP RxQ info frag_size Larysa Zaremba
@ 2026-02-03 10:53 ` Larysa Zaremba
2026-02-03 10:53 ` [PATCH bpf 6/6] net: enetc: " Larysa Zaremba
` (2 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-03 10:53 UTC (permalink / raw)
To: bpf
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Maciej Fijalkowski, Bastien Curutchet (eBPF Foundation),
Larysa Zaremba, Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
write size. Different assumptions in idpf driver configuration lead to
negative tailroom.
Set frag_size to the same value as frame_sz.
Fixes: ac8a861f632e ("idpf: prepare structures to support XDP")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
drivers/net/ethernet/intel/idpf/xdp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
index 958d16f87424..4966376a9785 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.c
+++ b/drivers/net/ethernet/intel/idpf/xdp.c
@@ -50,7 +50,7 @@ static int __idpf_xdp_rxq_info_init(struct idpf_rx_queue *rxq, void *arg)
err = __xdp_rxq_info_reg(&rxq->xdp_rxq, vport->netdev, rxq->idx,
rxq->q_vector->napi.napi_id,
- rxq->rx_buf_size);
+ rxq->truesize);
if (err)
return err;
--
2.52.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-03 10:53 [PATCH bpf 0/6] Address XDP frags having negative tailroom Larysa Zaremba
` (4 preceding siblings ...)
2026-02-03 10:53 ` [PATCH bpf 5/6] idpf: " Larysa Zaremba
@ 2026-02-03 10:53 ` Larysa Zaremba
2026-02-05 0:59 ` Vladimir Oltean
2026-02-04 22:57 ` [PATCH bpf 0/6] Address XDP frags having negative tailroom Martin KaFai Lau
2026-02-05 1:26 ` Jakub Kicinski
7 siblings, 1 reply; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-03 10:53 UTC (permalink / raw)
To: bpf
Cc: Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Simon Horman, Shuah Khan, Alexander Lobakin,
Maciej Fijalkowski, Bastien Curutchet (eBPF Foundation),
Larysa Zaremba, Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
write size. Different assumptions in enetc driver configuration lead to
negative tailroom.
Set frag_size to the same value as frame_sz.
Fixes: 2768b2e2f7d2 ("net: enetc: register XDP RX queues with frag_size")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
drivers/net/ethernet/freescale/enetc/enetc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
index 53b26cece16a..389331571d9e 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -3465,7 +3465,7 @@ static int enetc_int_vector_init(struct enetc_ndev_priv *priv, int i,
priv->rx_ring[i] = bdr;
err = __xdp_rxq_info_reg(&bdr->xdp.rxq, priv->ndev, i, 0,
- ENETC_RXB_DMA_SIZE_XDP);
+ ENETC_RXB_TRUESIZE);
if (err)
goto free_vector;
--
2.52.0
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-03 10:53 ` [PATCH bpf 6/6] net: enetc: " Larysa Zaremba
@ 2026-02-05 0:59 ` Vladimir Oltean
2026-02-05 1:34 ` Jakub Kicinski
0 siblings, 1 reply; 29+ messages in thread
From: Vladimir Oltean @ 2026-02-05 0:59 UTC (permalink / raw)
To: Larysa Zaremba
Cc: bpf, Claudiu Manoil, Wei Fang, Clark Wang, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Tue, Feb 03, 2026 at 11:53:34AM +0100, Larysa Zaremba wrote:
> The only user of frag_size field in XDP RxQ info is
> bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
> write size. Different assumptions in enetc driver configuration lead to
> negative tailroom.
>
> Set frag_size to the same value as frame_sz.
>
> Fixes: 2768b2e2f7d2 ("net: enetc: register XDP RX queues with frag_size")
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
> drivers/net/ethernet/freescale/enetc/enetc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
> index 53b26cece16a..389331571d9e 100644
> --- a/drivers/net/ethernet/freescale/enetc/enetc.c
> +++ b/drivers/net/ethernet/freescale/enetc/enetc.c
> @@ -3465,7 +3465,7 @@ static int enetc_int_vector_init(struct enetc_ndev_priv *priv, int i,
> priv->rx_ring[i] = bdr;
>
> err = __xdp_rxq_info_reg(&bdr->xdp.rxq, priv->ndev, i, 0,
> - ENETC_RXB_DMA_SIZE_XDP);
> + ENETC_RXB_TRUESIZE);
> if (err)
> goto free_vector;
>
> --
> 2.52.0
>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Thanks! This is an extremely subtle corner case. I appreciate the patch
and explanation.
I did run tests on the blamed commit (which I still have), but to catch
a real issue in a meaningful way it would have been required to have a
program which calls bpf_xdp_adjust_tail() with a very large offset.
I'm noting that I'm seeing the WARN_ON() much easier after your fix, but
before, it was mostly inconsequential for practical cases.
Namely, the ENETC truesize is 2048, and XDP_PACKET_HEADROOM is 256.
First buffers also contain the skb_shared_info (320 bytes), while
subsequent buffers don't.
Maybe the situation for Intel NICs is different, but we don't have the
ability to tell ENETC "you have this buffer size at your disposal for
initial buffers, and this other size for non-initial buffers". So we
just tell the NIC to DMA a maximum of 1472 bytes per buffer, as if all
buffers had skb_shared_info which shouldn't be overwritten.
That means that in non-initial buffers, there is plenty of tailroom due
to the space unused for skb_shared_info, space which bpf_xdp_adjust_tail()
can safely grow into.
So as long as the tail doesn't grow by more than 320 bytes, this growth
operation is actually safe even with the wrong rxq->frag_size (tailroom
pessimistically overestimated until it becomes that unrepresentable
negative value).
My tests were with bpf_xdp_adjust_tail(4)...
I have just one nitpick about this series: your patch order *introduces*
a bug for me which didn't really exist before for practical scenarios
(as explained above), since it fixes the offset > tailroom test to
return -EINVAL in patch 1/6, and then it fixes the driver in patch 6/6.
Normally, you'd fix the driver first, and then you'd fix the bad
condition in the core that was sidestepped, so that you don't introduce
regressions visible in "git bisect". I don't think that's major in this
case, so it's up to maintainers really if they request you to resend for
this or not.
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-05 0:59 ` Vladimir Oltean
@ 2026-02-05 1:34 ` Jakub Kicinski
2026-02-05 12:29 ` Vladimir Oltean
0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2026-02-05 1:34 UTC (permalink / raw)
To: Vladimir Oltean
Cc: Larysa Zaremba, bpf, Claudiu Manoil, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Thu, 5 Feb 2026 02:59:01 +0200 Vladimir Oltean wrote:
> Thanks! This is an extremely subtle corner case. I appreciate the patch
> and explanation.
>
> I did run tests on the blamed commit (which I still have), but to catch
> a real issue in a meaningful way it would have been required to have a
> program which calls bpf_xdp_adjust_tail() with a very large offset.
> I'm noting that I'm seeing the WARN_ON() much easier after your fix, but
> before, it was mostly inconsequential for practical cases.
>
> Namely, the ENETC truesize is 2048, and XDP_PACKET_HEADROOM is 256.
> First buffers also contain the skb_shared_info (320 bytes), while
> subsequent buffers don't.
I can't wrap my head around this series, hope you can tell me where I'm
going wrong. AFAICT enetc splits the page into two halves for small MTU.
So we have
| 2k | 2k |
----------------------------- -----------------------------
| hroom | data | troom/shinfo | hroom | data | troom/shinfo |
----------------------------- -----------------------------
If we attach the second chunk as frag well have:
offset = 2k + hroom
size = data.len
But we use
truesize / frag_size = 2k
so
tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
tailroom = 2k - data.len - 2k
tailroom = -data.len
WARN(tailroom < 0) -> yes
The frag_size thing is unusable for any driver that doesn't hand out
full pages to frags?
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-05 1:34 ` Jakub Kicinski
@ 2026-02-05 12:29 ` Vladimir Oltean
2026-02-05 12:41 ` Larysa Zaremba
0 siblings, 1 reply; 29+ messages in thread
From: Vladimir Oltean @ 2026-02-05 12:29 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Larysa Zaremba, bpf, Claudiu Manoil, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Wed, Feb 04, 2026 at 05:34:01PM -0800, Jakub Kicinski wrote:
> On Thu, 5 Feb 2026 02:59:01 +0200 Vladimir Oltean wrote:
> > Thanks! This is an extremely subtle corner case. I appreciate the patch
> > and explanation.
> >
> > I did run tests on the blamed commit (which I still have), but to catch
> > a real issue in a meaningful way it would have been required to have a
> > program which calls bpf_xdp_adjust_tail() with a very large offset.
> > I'm noting that I'm seeing the WARN_ON() much easier after your fix, but
> > before, it was mostly inconsequential for practical cases.
> >
> > Namely, the ENETC truesize is 2048, and XDP_PACKET_HEADROOM is 256.
> > First buffers also contain the skb_shared_info (320 bytes), while
> > subsequent buffers don't.
>
> I can't wrap my head around this series, hope you can tell me where I'm
> going wrong. AFAICT enetc splits the page into two halves for small MTU.
>
> So we have
>
> | 2k | 2k |
> ----------------------------- -----------------------------
> | hroom | data | troom/shinfo | hroom | data | troom/shinfo |
> ----------------------------- -----------------------------
>
> If we attach the second chunk as frag well have:
> offset = 2k + hroom
> size = data.len
> But we use
> truesize / frag_size = 2k
> so
> tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> tailroom = 2k - data.len - 2k
> tailroom = -data.len
> WARN(tailroom < 0) -> yes
>
> The frag_size thing is unusable for any driver that doesn't hand out
> full pages to frags?
This is an excellent question.
Yes, you're right, bpf_xdp_frags_increase_tail() only has a 50% chance
of working - the paged data has to be in the first half of the page,
otherwise the tailroom calculations are not correct due to rxq->frag_size,
and the WARN_ON() will trigger.
The reason why I didn't notice this during my testing is stupid. I was
attaching the BPF program to the interface and then detaching it after
each test, and each test was sending less than the RX ring size (2048)
worth of packets. So all multi-buffer frames were using buffers which
were fresh out of enetc_setup_rxbdr() -> ... -> enetc_new_page() (first
halves) and never out of flipped pages (enetc_bulk_flip_buff()).
This seems to be a good reason to convert this driver to use page pool,
which I can look into. I'm not sure that there's anything that can be
done to make the rxq->frag_size mechanism compatible with the current
buffer allocation scheme.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-05 12:29 ` Vladimir Oltean
@ 2026-02-05 12:41 ` Larysa Zaremba
2026-02-05 12:46 ` Vladimir Oltean
0 siblings, 1 reply; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-05 12:41 UTC (permalink / raw)
To: Vladimir Oltean, Jakub Kicinski
Cc: bpf, Claudiu Manoil, Wei Fang, Clark Wang, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Tony Nguyen,
Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Thu, Feb 05, 2026 at 02:29:53PM +0200, Vladimir Oltean wrote:
> On Wed, Feb 04, 2026 at 05:34:01PM -0800, Jakub Kicinski wrote:
> > On Thu, 5 Feb 2026 02:59:01 +0200 Vladimir Oltean wrote:
> > > Thanks! This is an extremely subtle corner case. I appreciate the patch
> > > and explanation.
> > >
> > > I did run tests on the blamed commit (which I still have), but to catch
> > > a real issue in a meaningful way it would have been required to have a
> > > program which calls bpf_xdp_adjust_tail() with a very large offset.
> > > I'm noting that I'm seeing the WARN_ON() much easier after your fix, but
> > > before, it was mostly inconsequential for practical cases.
> > >
> > > Namely, the ENETC truesize is 2048, and XDP_PACKET_HEADROOM is 256.
> > > First buffers also contain the skb_shared_info (320 bytes), while
> > > subsequent buffers don't.
> >
> > I can't wrap my head around this series, hope you can tell me where I'm
> > going wrong. AFAICT enetc splits the page into two halves for small MTU.
> >
> > So we have
> >
> > | 2k | 2k |
> > ----------------------------- -----------------------------
> > | hroom | data | troom/shinfo | hroom | data | troom/shinfo |
> > ----------------------------- -----------------------------
> >
> > If we attach the second chunk as frag well have:
> > offset = 2k + hroom
> > size = data.len
> > But we use
> > truesize / frag_size = 2k
> > so
> > tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> > tailroom = 2k - data.len - 2k
> > tailroom = -data.len
> > WARN(tailroom < 0) -> yes
> >
> > The frag_size thing is unusable for any driver that doesn't hand out
> > full pages to frags?
>
> This is an excellent question.
>
> Yes, you're right, bpf_xdp_frags_increase_tail() only has a 50% chance
> of working - the paged data has to be in the first half of the page,
> otherwise the tailroom calculations are not correct due to rxq->frag_size,
> and the WARN_ON() will trigger.
>
> The reason why I didn't notice this during my testing is stupid. I was
> attaching the BPF program to the interface and then detaching it after
> each test, and each test was sending less than the RX ring size (2048)
> worth of packets. So all multi-buffer frames were using buffers which
> were fresh out of enetc_setup_rxbdr() -> ... -> enetc_new_page() (first
> halves) and never out of flipped pages (enetc_bulk_flip_buff()).
>
> This seems to be a good reason to convert this driver to use page pool,
> which I can look into. I'm not sure that there's anything that can be
> done to make the rxq->frag_size mechanism compatible with the current
> buffer allocation scheme.
I was just about to send an answer.
Seems like my mistake here. I actually think adjusting the tail should work, if
we set rxq->frag_size to PAGE_SIZE in enetc and i40e_rx_pg_size() in i40e, and
not to (PAGE_SIZE / 2), as I did at first, but in such case naming this
frag_size is just utterly wrong. Glad Jakub has pointed this out.
ice and idpf are fine, since they use libeth for Rx buffers, so mbuf packets
always reside in 3K+ buffers. But for xsk_buff_pool seems like all drivers
should have PAGE_SIZE as frag_size? I will let the discussion go on for at least
a day and then will send v2 with patches reordered and those sizes corrected,
maybe add ZC fixes on top.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-05 12:41 ` Larysa Zaremba
@ 2026-02-05 12:46 ` Vladimir Oltean
2026-02-05 13:23 ` Larysa Zaremba
0 siblings, 1 reply; 29+ messages in thread
From: Vladimir Oltean @ 2026-02-05 12:46 UTC (permalink / raw)
To: Larysa Zaremba
Cc: Jakub Kicinski, bpf, Claudiu Manoil, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Thu, Feb 05, 2026 at 01:41:03PM +0100, Larysa Zaremba wrote:
> On Thu, Feb 05, 2026 at 02:29:53PM +0200, Vladimir Oltean wrote:
> > On Wed, Feb 04, 2026 at 05:34:01PM -0800, Jakub Kicinski wrote:
> > > On Thu, 5 Feb 2026 02:59:01 +0200 Vladimir Oltean wrote:
> > > > Thanks! This is an extremely subtle corner case. I appreciate the patch
> > > > and explanation.
> > > >
> > > > I did run tests on the blamed commit (which I still have), but to catch
> > > > a real issue in a meaningful way it would have been required to have a
> > > > program which calls bpf_xdp_adjust_tail() with a very large offset.
> > > > I'm noting that I'm seeing the WARN_ON() much easier after your fix, but
> > > > before, it was mostly inconsequential for practical cases.
> > > >
> > > > Namely, the ENETC truesize is 2048, and XDP_PACKET_HEADROOM is 256.
> > > > First buffers also contain the skb_shared_info (320 bytes), while
> > > > subsequent buffers don't.
> > >
> > > I can't wrap my head around this series, hope you can tell me where I'm
> > > going wrong. AFAICT enetc splits the page into two halves for small MTU.
> > >
> > > So we have
> > >
> > > | 2k | 2k |
> > > ----------------------------- -----------------------------
> > > | hroom | data | troom/shinfo | hroom | data | troom/shinfo |
> > > ----------------------------- -----------------------------
> > >
> > > If we attach the second chunk as frag well have:
> > > offset = 2k + hroom
> > > size = data.len
> > > But we use
> > > truesize / frag_size = 2k
> > > so
> > > tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> > > tailroom = 2k - data.len - 2k
> > > tailroom = -data.len
> > > WARN(tailroom < 0) -> yes
> > >
> > > The frag_size thing is unusable for any driver that doesn't hand out
> > > full pages to frags?
> >
> > This is an excellent question.
> >
> > Yes, you're right, bpf_xdp_frags_increase_tail() only has a 50% chance
> > of working - the paged data has to be in the first half of the page,
> > otherwise the tailroom calculations are not correct due to rxq->frag_size,
> > and the WARN_ON() will trigger.
> >
> > The reason why I didn't notice this during my testing is stupid. I was
> > attaching the BPF program to the interface and then detaching it after
> > each test, and each test was sending less than the RX ring size (2048)
> > worth of packets. So all multi-buffer frames were using buffers which
> > were fresh out of enetc_setup_rxbdr() -> ... -> enetc_new_page() (first
> > halves) and never out of flipped pages (enetc_bulk_flip_buff()).
> >
> > This seems to be a good reason to convert this driver to use page pool,
> > which I can look into. I'm not sure that there's anything that can be
> > done to make the rxq->frag_size mechanism compatible with the current
> > buffer allocation scheme.
>
> I was just about to send an answer.
>
> Seems like my mistake here. I actually think adjusting the tail should work, if
> we set rxq->frag_size to PAGE_SIZE in enetc and i40e_rx_pg_size() in i40e, and
> not to (PAGE_SIZE / 2), as I did at first, but in such case naming this
> frag_size is just utterly wrong. Glad Jakub has pointed this out.
I mean, it should "work" given the caveat that calling bpf_xdp_adjust_tail()
on a first-half page buffer with a large offset risks leaking into the
second half, which may also be in use, and this will go undetected, right?
Although the practical chances of that happening are low, the requested
offset needs to be in the order of hundreds still.
> ice and idpf are fine, since they use libeth for Rx buffers, so mbuf packets
> always reside in 3K+ buffers. But for xsk_buff_pool seems like all drivers
> should have PAGE_SIZE as frag_size? I will let the discussion go on for at least
> a day and then will send v2 with patches reordered and those sizes corrected,
> maybe add ZC fixes on top.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-05 12:46 ` Vladimir Oltean
@ 2026-02-05 13:23 ` Larysa Zaremba
2026-02-05 13:40 ` Vladimir Oltean
0 siblings, 1 reply; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-05 13:23 UTC (permalink / raw)
To: Vladimir Oltean, Jakub Kicinski
Cc: bpf, Claudiu Manoil, Wei Fang, Clark Wang, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Tony Nguyen,
Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Thu, Feb 05, 2026 at 02:46:38PM +0200, Vladimir Oltean wrote:
> On Thu, Feb 05, 2026 at 01:41:03PM +0100, Larysa Zaremba wrote:
> > On Thu, Feb 05, 2026 at 02:29:53PM +0200, Vladimir Oltean wrote:
> > > On Wed, Feb 04, 2026 at 05:34:01PM -0800, Jakub Kicinski wrote:
> > > > On Thu, 5 Feb 2026 02:59:01 +0200 Vladimir Oltean wrote:
> > > > > Thanks! This is an extremely subtle corner case. I appreciate the patch
> > > > > and explanation.
> > > > >
> > > > > I did run tests on the blamed commit (which I still have), but to catch
> > > > > a real issue in a meaningful way it would have been required to have a
> > > > > program which calls bpf_xdp_adjust_tail() with a very large offset.
> > > > > I'm noting that I'm seeing the WARN_ON() much easier after your fix, but
> > > > > before, it was mostly inconsequential for practical cases.
> > > > >
> > > > > Namely, the ENETC truesize is 2048, and XDP_PACKET_HEADROOM is 256.
> > > > > First buffers also contain the skb_shared_info (320 bytes), while
> > > > > subsequent buffers don't.
> > > >
> > > > I can't wrap my head around this series, hope you can tell me where I'm
> > > > going wrong. AFAICT enetc splits the page into two halves for small MTU.
> > > >
> > > > So we have
> > > >
> > > > | 2k | 2k |
> > > > ----------------------------- -----------------------------
> > > > | hroom | data | troom/shinfo | hroom | data | troom/shinfo |
> > > > ----------------------------- -----------------------------
> > > >
> > > > If we attach the second chunk as frag well have:
> > > > offset = 2k + hroom
> > > > size = data.len
> > > > But we use
> > > > truesize / frag_size = 2k
> > > > so
> > > > tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> > > > tailroom = 2k - data.len - 2k
> > > > tailroom = -data.len
> > > > WARN(tailroom < 0) -> yes
> > > >
> > > > The frag_size thing is unusable for any driver that doesn't hand out
> > > > full pages to frags?
> > >
> > > This is an excellent question.
> > >
> > > Yes, you're right, bpf_xdp_frags_increase_tail() only has a 50% chance
> > > of working - the paged data has to be in the first half of the page,
> > > otherwise the tailroom calculations are not correct due to rxq->frag_size,
> > > and the WARN_ON() will trigger.
> > >
> > > The reason why I didn't notice this during my testing is stupid. I was
> > > attaching the BPF program to the interface and then detaching it after
> > > each test, and each test was sending less than the RX ring size (2048)
> > > worth of packets. So all multi-buffer frames were using buffers which
> > > were fresh out of enetc_setup_rxbdr() -> ... -> enetc_new_page() (first
> > > halves) and never out of flipped pages (enetc_bulk_flip_buff()).
> > >
> > > This seems to be a good reason to convert this driver to use page pool,
> > > which I can look into. I'm not sure that there's anything that can be
> > > done to make the rxq->frag_size mechanism compatible with the current
> > > buffer allocation scheme.
> >
> > I was just about to send an answer.
> >
> > Seems like my mistake here. I actually think adjusting the tail should work, if
> > we set rxq->frag_size to PAGE_SIZE in enetc and i40e_rx_pg_size() in i40e, and
> > not to (PAGE_SIZE / 2), as I did at first, but in such case naming this
> > frag_size is just utterly wrong. Glad Jakub has pointed this out.
>
> I mean, it should "work" given the caveat that calling bpf_xdp_adjust_tail()
> on a first-half page buffer with a large offset risks leaking into the
> second half, which may also be in use, and this will go undetected, right?
> Although the practical chances of that happening are low, the requested
> offset needs to be in the order of hundreds still.
Oh, I did get carried away there...
Well, one thing is shared page memory model in enetc and i40e, another thing is
xsk_buff_pool, where chunk size can be between 2K and PAGE_SIZE. What about
tailroom = rxq->frag_size - skb_frag_size(frag) -
(skb_frag_off(frag) % rxq->frag_size);
When frag_size is set to 2K, headroom is let's say 256, so aligned DMA write
size is 1420.
last frag at the start of the page: offset=256, size<=1420
tailroom >= 2K - 1420 - 256 = 372
last frag in the middle of the page: offset=256+2K, size<=1420
tailroom >= 2K - 1420 - ((256 + 2K) % 2K) = 372
And for drivers that do not fragment pages for multi-buffer packets, nothing
changes, since offset is always less than rxq->frag_size.
This brings us back to rxq->frag_size being half of a page for enetc and i40e,
and seems like in ZC mode it should be pool->chunk_size to work properly.
>
> > ice and idpf are fine, since they use libeth for Rx buffers, so mbuf packets
> > always reside in 3K+ buffers. But for xsk_buff_pool seems like all drivers
> > should have PAGE_SIZE as frag_size? I will let the discussion go on for at least
> > a day and then will send v2 with patches reordered and those sizes corrected,
> > maybe add ZC fixes on top.
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-05 13:23 ` Larysa Zaremba
@ 2026-02-05 13:40 ` Vladimir Oltean
2026-02-06 1:54 ` Jakub Kicinski
0 siblings, 1 reply; 29+ messages in thread
From: Vladimir Oltean @ 2026-02-05 13:40 UTC (permalink / raw)
To: Larysa Zaremba
Cc: Jakub Kicinski, bpf, Claudiu Manoil, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Thu, Feb 05, 2026 at 02:23:15PM +0100, Larysa Zaremba wrote:
> On Thu, Feb 05, 2026 at 02:46:38PM +0200, Vladimir Oltean wrote:
> > On Thu, Feb 05, 2026 at 01:41:03PM +0100, Larysa Zaremba wrote:
> > > On Thu, Feb 05, 2026 at 02:29:53PM +0200, Vladimir Oltean wrote:
> > > > On Wed, Feb 04, 2026 at 05:34:01PM -0800, Jakub Kicinski wrote:
> > > > > On Thu, 5 Feb 2026 02:59:01 +0200 Vladimir Oltean wrote:
> > > > > > Thanks! This is an extremely subtle corner case. I appreciate the patch
> > > > > > and explanation.
> > > > > >
> > > > > > I did run tests on the blamed commit (which I still have), but to catch
> > > > > > a real issue in a meaningful way it would have been required to have a
> > > > > > program which calls bpf_xdp_adjust_tail() with a very large offset.
> > > > > > I'm noting that I'm seeing the WARN_ON() much easier after your fix, but
> > > > > > before, it was mostly inconsequential for practical cases.
> > > > > >
> > > > > > Namely, the ENETC truesize is 2048, and XDP_PACKET_HEADROOM is 256.
> > > > > > First buffers also contain the skb_shared_info (320 bytes), while
> > > > > > subsequent buffers don't.
> > > > >
> > > > > I can't wrap my head around this series, hope you can tell me where I'm
> > > > > going wrong. AFAICT enetc splits the page into two halves for small MTU.
> > > > >
> > > > > So we have
> > > > >
> > > > > | 2k | 2k |
> > > > > ----------------------------- -----------------------------
> > > > > | hroom | data | troom/shinfo | hroom | data | troom/shinfo |
> > > > > ----------------------------- -----------------------------
> > > > >
> > > > > If we attach the second chunk as frag well have:
> > > > > offset = 2k + hroom
> > > > > size = data.len
> > > > > But we use
> > > > > truesize / frag_size = 2k
> > > > > so
> > > > > tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> > > > > tailroom = 2k - data.len - 2k
> > > > > tailroom = -data.len
> > > > > WARN(tailroom < 0) -> yes
> > > > >
> > > > > The frag_size thing is unusable for any driver that doesn't hand out
> > > > > full pages to frags?
> > > >
> > > > This is an excellent question.
> > > >
> > > > Yes, you're right, bpf_xdp_frags_increase_tail() only has a 50% chance
> > > > of working - the paged data has to be in the first half of the page,
> > > > otherwise the tailroom calculations are not correct due to rxq->frag_size,
> > > > and the WARN_ON() will trigger.
> > > >
> > > > The reason why I didn't notice this during my testing is stupid. I was
> > > > attaching the BPF program to the interface and then detaching it after
> > > > each test, and each test was sending less than the RX ring size (2048)
> > > > worth of packets. So all multi-buffer frames were using buffers which
> > > > were fresh out of enetc_setup_rxbdr() -> ... -> enetc_new_page() (first
> > > > halves) and never out of flipped pages (enetc_bulk_flip_buff()).
> > > >
> > > > This seems to be a good reason to convert this driver to use page pool,
> > > > which I can look into. I'm not sure that there's anything that can be
> > > > done to make the rxq->frag_size mechanism compatible with the current
> > > > buffer allocation scheme.
> > >
> > > I was just about to send an answer.
> > >
> > > Seems like my mistake here. I actually think adjusting the tail should work, if
> > > we set rxq->frag_size to PAGE_SIZE in enetc and i40e_rx_pg_size() in i40e, and
> > > not to (PAGE_SIZE / 2), as I did at first, but in such case naming this
> > > frag_size is just utterly wrong. Glad Jakub has pointed this out.
> >
> > I mean, it should "work" given the caveat that calling bpf_xdp_adjust_tail()
> > on a first-half page buffer with a large offset risks leaking into the
> > second half, which may also be in use, and this will go undetected, right?
> > Although the practical chances of that happening are low, the requested
> > offset needs to be in the order of hundreds still.
>
> Oh, I did get carried away there...
> Well, one thing is shared page memory model in enetc and i40e, another thing is
> xsk_buff_pool, where chunk size can be between 2K and PAGE_SIZE. What about
>
> tailroom = rxq->frag_size - skb_frag_size(frag) -
> (skb_frag_off(frag) % rxq->frag_size);
>
> When frag_size is set to 2K, headroom is let's say 256, so aligned DMA write
> size is 1420.
> last frag at the start of the page: offset=256, size<=1420
> tailroom >= 2K - 1420 - 256 = 372
> last frag in the middle of the page: offset=256+2K, size<=1420
> tailroom >= 2K - 1420 - ((256 + 2K) % 2K) = 372
>
> And for drivers that do not fragment pages for multi-buffer packets, nothing
> changes, since offset is always less than rxq->frag_size.
>
> This brings us back to rxq->frag_size being half of a page for enetc and i40e,
> and seems like in ZC mode it should be pool->chunk_size to work properly.
With skb_frag_off() taken into account modulo 2K for the tailroom
calculation, I can confirm bpf_xdp_frags_increase_tail() works well for
ENETC. I haven't fully considered the side effects, though.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-05 13:40 ` Vladimir Oltean
@ 2026-02-06 1:54 ` Jakub Kicinski
2026-02-06 8:36 ` Larysa Zaremba
2026-02-08 12:59 ` Vladimir Oltean
0 siblings, 2 replies; 29+ messages in thread
From: Jakub Kicinski @ 2026-02-06 1:54 UTC (permalink / raw)
To: Vladimir Oltean, Larysa Zaremba
Cc: bpf, Claudiu Manoil, Wei Fang, Clark Wang, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Tony Nguyen,
Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Thu, 5 Feb 2026 15:40:46 +0200 Vladimir Oltean wrote:
> > > I mean, it should "work" given the caveat that calling bpf_xdp_adjust_tail()
> > > on a first-half page buffer with a large offset risks leaking into the
> > > second half, which may also be in use, and this will go undetected, right?
> > > Although the practical chances of that happening are low, the requested
> > > offset needs to be in the order of hundreds still.
> >
> > Oh, I did get carried away there...
> > Well, one thing is shared page memory model in enetc and i40e, another thing is
> > xsk_buff_pool, where chunk size can be between 2K and PAGE_SIZE. What about
> >
> > tailroom = rxq->frag_size - skb_frag_size(frag) -
> > (skb_frag_off(frag) % rxq->frag_size);
> >
> > When frag_size is set to 2K, headroom is let's say 256, so aligned DMA write
> > size is 1420.
> > last frag at the start of the page: offset=256, size<=1420
> > tailroom >= 2K - 1420 - 256 = 372
> > last frag in the middle of the page: offset=256+2K, size<=1420
> > tailroom >= 2K - 1420 - ((256 + 2K) % 2K) = 372
> >
> > And for drivers that do not fragment pages for multi-buffer packets, nothing
> > changes, since offset is always less than rxq->frag_size.
> >
> > This brings us back to rxq->frag_size being half of a page for enetc and i40e,
> > and seems like in ZC mode it should be pool->chunk_size to work properly.
>
> With skb_frag_off() taken into account modulo 2K for the tailroom
> calculation, I can confirm bpf_xdp_frags_increase_tail() works well for
> ENETC. I haven't fully considered the side effects, though.
+1, also seems to me like it would work tho I haven't thought thru all
the cases. We do need to document and name things well, tho, 'cause
subtleties are piling up ;) Maybe it's time for an ASCII art
for xdp layout?
FWIW my feeling is that instead of nickel and diming leftover space
in the frags if someone actually cared about growing mbufs we should
have the helper allocate a new page from the PP and append it to the
shinfo. Much simpler, "infinite space", and works regardless of the
driver. I don't mean that to suggest you implement it, purely to point
out that I think nobody really uses positive offsets.. So we can as
well switch more complicated drivers back to xdp_rxq_info_reg().
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-06 1:54 ` Jakub Kicinski
@ 2026-02-06 8:36 ` Larysa Zaremba
2026-02-07 2:57 ` Jakub Kicinski
2026-02-08 12:59 ` Vladimir Oltean
1 sibling, 1 reply; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-06 8:36 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Vladimir Oltean, bpf, Claudiu Manoil, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Thu, Feb 05, 2026 at 05:54:08PM -0800, Jakub Kicinski wrote:
> On Thu, 5 Feb 2026 15:40:46 +0200 Vladimir Oltean wrote:
> > > > I mean, it should "work" given the caveat that calling bpf_xdp_adjust_tail()
> > > > on a first-half page buffer with a large offset risks leaking into the
> > > > second half, which may also be in use, and this will go undetected, right?
> > > > Although the practical chances of that happening are low, the requested
> > > > offset needs to be in the order of hundreds still.
> > >
> > > Oh, I did get carried away there...
> > > Well, one thing is shared page memory model in enetc and i40e, another thing is
> > > xsk_buff_pool, where chunk size can be between 2K and PAGE_SIZE. What about
> > >
> > > tailroom = rxq->frag_size - skb_frag_size(frag) -
> > > (skb_frag_off(frag) % rxq->frag_size);
> > >
> > > When frag_size is set to 2K, headroom is let's say 256, so aligned DMA write
> > > size is 1420.
> > > last frag at the start of the page: offset=256, size<=1420
> > > tailroom >= 2K - 1420 - 256 = 372
> > > last frag in the middle of the page: offset=256+2K, size<=1420
> > > tailroom >= 2K - 1420 - ((256 + 2K) % 2K) = 372
> > >
> > > And for drivers that do not fragment pages for multi-buffer packets, nothing
> > > changes, since offset is always less than rxq->frag_size.
> > >
> > > This brings us back to rxq->frag_size being half of a page for enetc and i40e,
> > > and seems like in ZC mode it should be pool->chunk_size to work properly.
> >
> > With skb_frag_off() taken into account modulo 2K for the tailroom
> > calculation, I can confirm bpf_xdp_frags_increase_tail() works well for
> > ENETC. I haven't fully considered the side effects, though.
>
> +1, also seems to me like it would work tho I haven't thought thru all
> the cases. We do need to document and name things well, tho, 'cause
> subtleties are piling up ;) Maybe it's time for an ASCII art
> for xdp layout?
>
Yeah, for AF_XDP mbuf in i40e we actually recently discovered another
buffer-size-calculation-related issue, so some visual aid would be useful. I
will think about how it should look.
> FWIW my feeling is that instead of nickel and diming leftover space
> in the frags if someone actually cared about growing mbufs we should
> have the helper allocate a new page from the PP and append it to the
> shinfo. Much simpler, "infinite space", and works regardless of the
> driver. I don't mean that to suggest you implement it, purely to point
> out that I think nobody really uses positive offsets.. So we can as
> well switch more complicated drivers back to xdp_rxq_info_reg().
>
As Vladimir has mentioned, if the driver does not use header split, frags will
have a tailroom of a size of skb_shared_info, so tail growing does work in
practice.
Allocating a page_pool buffer (given XDP queue has one attached) is certainly an
option, although I am not sure if anyone needs it. Furthermore, growing tail
would still fail for a single-buf case.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-06 8:36 ` Larysa Zaremba
@ 2026-02-07 2:57 ` Jakub Kicinski
2026-02-09 9:46 ` Larysa Zaremba
0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2026-02-07 2:57 UTC (permalink / raw)
To: Larysa Zaremba
Cc: Vladimir Oltean, bpf, Claudiu Manoil, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Fri, 6 Feb 2026 09:36:21 +0100 Larysa Zaremba wrote:
> > FWIW my feeling is that instead of nickel and diming leftover space
> > in the frags if someone actually cared about growing mbufs we should
> > have the helper allocate a new page from the PP and append it to the
> > shinfo. Much simpler, "infinite space", and works regardless of the
> > driver. I don't mean that to suggest you implement it, purely to point
> > out that I think nobody really uses positive offsets.. So we can as
> > well switch more complicated drivers back to xdp_rxq_info_reg().
>
> As Vladimir has mentioned, if the driver does not use header split, frags will
> have a tailroom of a size of skb_shared_info, so tail growing does work in
> practice.
>
> Allocating a page_pool buffer (given XDP queue has one attached) is certainly an
> option, although I am not sure if anyone needs it. Furthermore, growing tail
> would still fail for a single-buf case.
sbuf is a different code path, sbuf has precise frame_sz per frame,
not a single value in rxq, no? Don't mean to argue, just making sure
my mental model is correct ;)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-07 2:57 ` Jakub Kicinski
@ 2026-02-09 9:46 ` Larysa Zaremba
0 siblings, 0 replies; 29+ messages in thread
From: Larysa Zaremba @ 2026-02-09 9:46 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Vladimir Oltean, bpf, Claudiu Manoil, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Fri, Feb 06, 2026 at 06:57:35PM -0800, Jakub Kicinski wrote:
> On Fri, 6 Feb 2026 09:36:21 +0100 Larysa Zaremba wrote:
> > > FWIW my feeling is that instead of nickel and diming leftover space
> > > in the frags if someone actually cared about growing mbufs we should
> > > have the helper allocate a new page from the PP and append it to the
> > > shinfo. Much simpler, "infinite space", and works regardless of the
> > > driver. I don't mean that to suggest you implement it, purely to point
> > > out that I think nobody really uses positive offsets.. So we can as
> > > well switch more complicated drivers back to xdp_rxq_info_reg().
> >
> > As Vladimir has mentioned, if the driver does not use header split, frags will
> > have a tailroom of a size of skb_shared_info, so tail growing does work in
> > practice.
> >
> > Allocating a page_pool buffer (given XDP queue has one attached) is certainly an
> > option, although I am not sure if anyone needs it. Furthermore, growing tail
> > would still fail for a single-buf case.
>
> sbuf is a different code path, sbuf has precise frame_sz per frame,
> not a single value in rxq, no? Don't mean to argue, just making sure
> my mental model is correct ;)
>
You are right, xdp_buff in sbuf case is self-sufficient inside of bpf ops. sbuf
mostly requires memory info from XDP RxQ for proper xdp_frame handling.
What I meant is that intended conditions of failure to grow tail are the same
for sbuf and mbuf case, the last (which can be the same as the first) buffer
needs to have enough leftover space.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-06 1:54 ` Jakub Kicinski
2026-02-06 8:36 ` Larysa Zaremba
@ 2026-02-08 12:59 ` Vladimir Oltean
2026-02-10 17:27 ` Dragos Tatulea
1 sibling, 1 reply; 29+ messages in thread
From: Vladimir Oltean @ 2026-02-08 12:59 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Larysa Zaremba, bpf, Claudiu Manoil, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Thu, Feb 05, 2026 at 05:54:08PM -0800, Jakub Kicinski wrote:
> FWIW my feeling is that instead of nickel and diming leftover space
> in the frags if someone actually cared about growing mbufs we should
> have the helper allocate a new page from the PP and append it to the
> shinfo. Much simpler, "infinite space", and works regardless of the
> driver. I don't mean that to suggest you implement it, purely to point
> out that I think nobody really uses positive offsets.. So we can as
> well switch more complicated drivers back to xdp_rxq_info_reg().
FWIW, I do have a use case at least in the theoretical sense for
bpf_xdp_adjust_tail() with positive offsets, although it's still under
development.
I'm working on a DSA data path library for XDP, and one of the features
it supports is redirecting from one user port to another, with in-place
tag modification.
If the path to the egress port goes through a tail-tagging switch but
the path from the ingress port didn't, bpf_xdp_adjust_tail() with a
positive offset will be called to make space for the tail tags.
I'm not sure about the "regardless of the driver" part of your comment.
Is it possible to mix and match allocation models and still keep track
of how each individual page needs to be freed? AFAICS in xdp_return_frame(),
the mem_type is assumed to be the same for the entire xdp_frame.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size
2026-02-08 12:59 ` Vladimir Oltean
@ 2026-02-10 17:27 ` Dragos Tatulea
0 siblings, 0 replies; 29+ messages in thread
From: Dragos Tatulea @ 2026-02-10 17:27 UTC (permalink / raw)
To: Vladimir Oltean, Jakub Kicinski
Cc: Larysa Zaremba, bpf, Claudiu Manoil, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov,
Tariq Toukan, Nimrod Oren
On 08.02.26 13:59, Vladimir Oltean wrote:
> On Thu, Feb 05, 2026 at 05:54:08PM -0800, Jakub Kicinski wrote:
>> FWIW my feeling is that instead of nickel and diming leftover space
>> in the frags if someone actually cared about growing mbufs we should
>> have the helper allocate a new page from the PP and append it to the
>> shinfo. Much simpler, "infinite space", and works regardless of the
>> driver. I don't mean that to suggest you implement it, purely to point
>> out that I think nobody really uses positive offsets.. So we can as
>> well switch more complicated drivers back to xdp_rxq_info_reg().
>
> FWIW, I do have a use case at least in the theoretical sense for
> bpf_xdp_adjust_tail() with positive offsets, although it's still under
> development.
>
> I'm working on a DSA data path library for XDP, and one of the features
> it supports is redirecting from one user port to another, with in-place
> tag modification.
>
> If the path to the egress port goes through a tail-tagging switch but
> the path from the ingress port didn't, bpf_xdp_adjust_tail() with a
> positive offset will be called to make space for the tail tags.
>
Jumping a bit late in the conversation...
We were recently discussing this limitation when trying to add growable
tail support for multi-buf XDP buffers in the mlx5 driver (Striding RQ
mode) after lifting the page_size stride limitation for XDP [1].
It turns out that it is quite complicated to do in this mode with the
with existing frag_size configuration...
The issue is that the HW can write a packet in multiple smaller strides
(256B for example) and setting rxq->frag_size to it would not work for
the following reasons:
1) The tailroom formula would yield a negative value, as pointed out by
this series.
2) Even if this formula would not be the issue, frag_size currently
means that the underlying storage of each fragment has a size of
frag_size. So the number of fragments would explode in the driver.
That's a no go.
3) And even if we would change the semantics of frag_size to mean
something else so that the XDP code would use frag_size as the available
growth size in the fragment
(tailroom = skb_frag_off() + frag_size - skb_frag_size())
we'd still be very much in the "nickel and diming" space with less than
256B to spare.
4) The only sane way to do it would be to use a large stride but this
would kill small packet optimization
So +1 for the direction of having a helper allocating an extra page from
the page_pool instead.
> I'm not sure about the "regardless of the driver" part of your comment.
> Is it possible to mix and match allocation models and still keep track
> of how each individual page needs to be freed? AFAICS in xdp_return_frame(),
> the mem_type is assumed to be the same for the entire xdp_frame.
>
Wouldn't the allocations happen from the page_pool of the rx queue
so the mem_type would be homogenous. I was initially worried about the
cpumap case but seems to not allow tail growth (frag_size is initialized
to 0 in cpu_map_bpf_prog_run_xdp()).
[1] - Disclaimer: XDP multi-buf fragment growth is still supported for the
non Striding RQ mode.
Thanks,
Dragos
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 0/6] Address XDP frags having negative tailroom
2026-02-03 10:53 [PATCH bpf 0/6] Address XDP frags having negative tailroom Larysa Zaremba
` (5 preceding siblings ...)
2026-02-03 10:53 ` [PATCH bpf 6/6] net: enetc: " Larysa Zaremba
@ 2026-02-04 22:57 ` Martin KaFai Lau
2026-02-05 1:23 ` Jakub Kicinski
2026-02-05 1:26 ` Jakub Kicinski
7 siblings, 1 reply; 29+ messages in thread
From: Martin KaFai Lau @ 2026-02-04 22:57 UTC (permalink / raw)
To: Jakub Kicinski
Cc: bpf, Larysa Zaremba, Claudiu Manoil, Vladimir Oltean, Wei Fang,
Clark Wang, Andrew Lunn, David S. Miller, Eric Dumazet,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On 2/3/26 2:53 AM, Larysa Zaremba wrote:
> Larysa Zaremba (6):
> xdp: produce a warning when calculated tailroom is negative
> ice: fix rxq info registering in mbuf packets
> ice: change XDP RxQ frag_size from DMA write length to truesize
> i40e: use truesize as XDP RxQ info frag_size
> idpf: use truesize as XDP RxQ info frag_size
> net: enetc: use truesize as XDP RxQ info frag_size
>
> drivers/net/ethernet/freescale/enetc/enetc.c | 2 +-
> drivers/net/ethernet/intel/i40e/i40e_main.c | 5 ++--
> drivers/net/ethernet/intel/ice/ice_base.c | 26 ++++++--------------
> drivers/net/ethernet/intel/ice/ice_xsk.c | 3 +++
> drivers/net/ethernet/intel/idpf/xdp.c | 2 +-
> net/core/filter.c | 3 ++-
Jakub, can you help to take it to the net tree? Thanks.
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH bpf 0/6] Address XDP frags having negative tailroom
2026-02-04 22:57 ` [PATCH bpf 0/6] Address XDP frags having negative tailroom Martin KaFai Lau
@ 2026-02-05 1:23 ` Jakub Kicinski
0 siblings, 0 replies; 29+ messages in thread
From: Jakub Kicinski @ 2026-02-05 1:23 UTC (permalink / raw)
To: Larysa Zaremba
Cc: Martin KaFai Lau, bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang,
Clark Wang, Andrew Lunn, David S. Miller, Eric Dumazet,
Paolo Abeni, Tony Nguyen, Przemek Kitszel, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Wed, 4 Feb 2026 14:57:13 -0800 Martin KaFai Lau wrote:
> On 2/3/26 2:53 AM, Larysa Zaremba wrote:
> > Larysa Zaremba (6):
> > xdp: produce a warning when calculated tailroom is negative
> > ice: fix rxq info registering in mbuf packets
> > ice: change XDP RxQ frag_size from DMA write length to truesize
> > i40e: use truesize as XDP RxQ info frag_size
> > idpf: use truesize as XDP RxQ info frag_size
> > net: enetc: use truesize as XDP RxQ info frag_size
>
> Jakub, can you help to take it to the net tree? Thanks.
Will do. I think I also have some questions, first, tho :)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH bpf 0/6] Address XDP frags having negative tailroom
2026-02-03 10:53 [PATCH bpf 0/6] Address XDP frags having negative tailroom Larysa Zaremba
` (6 preceding siblings ...)
2026-02-04 22:57 ` [PATCH bpf 0/6] Address XDP frags having negative tailroom Martin KaFai Lau
@ 2026-02-05 1:26 ` Jakub Kicinski
7 siblings, 0 replies; 29+ messages in thread
From: Jakub Kicinski @ 2026-02-05 1:26 UTC (permalink / raw)
To: Larysa Zaremba
Cc: bpf, Claudiu Manoil, Vladimir Oltean, Wei Fang, Clark Wang,
Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Tony Nguyen, Przemek Kitszel, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Hao Luo, Jiri Olsa, Simon Horman,
Shuah Khan, Alexander Lobakin, Maciej Fijalkowski,
Bastien Curutchet (eBPF Foundation), Tushar Vyavahare, Jason Xing,
Ricardo B. Marlière, Eelco Chaudron, Lorenzo Bianconi,
Toke Hoiland-Jorgensen, imx, netdev, linux-kernel,
intel-wired-lan, linux-kselftest, Aleksandr Loktionov
On Tue, 3 Feb 2026 11:53:28 +0100 Larysa Zaremba wrote:
> Many ethernet drivers report xdp Rx queue frag size as being the same as
> DMA write size. However, the only user of this field, namely
> bpf_xdp_frags_increase_tail(), clearly expects a truesize.
truesize may not be the most fortunate term.
truesize is how much memory we charge to the skb (for memory management)
which may relate to frag size in strange ways for strange HW :S
I don't have a great term, but I have a feeling truesize already
confuses driver developers.
Note two our xdp.py tests covers tail adjust, I wonder if it would have
caught this for you. Looking at Intel's results the XDP tests don't run
due to missing BTF. Could you work with Adrian to address that?
^ permalink raw reply [flat|nested] 29+ messages in thread