[PATCH v2 bpf 0/2] xdp: fix page

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 bpf 0/2] xdp: fix page_pool leaks
@ 2025-10-17 14:31 Maciej Fijalkowski
  2025-10-17 14:31 ` [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook Maciej Fijalkowski
  2025-10-17 14:31 ` [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff Maciej Fijalkowski
  0 siblings, 2 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2025-10-17 14:31 UTC (permalink / raw)
  To: bpf, ast, daniel, hawk, ilias.apalodimas, toke, lorenzo, kuba
  Cc: netdev, magnus.karlsson, andrii, stfomichev, aleksander.lobakin,
	Maciej Fijalkowski

v1:
https://lore.kernel.org/bpf/20251003140243.2534865-1-maciej.fijalkowski@intel.com/
v1->v2:
- add skb to xdp_buff converter (Jakub)
- postpone restoration of veth's xdp_rxq_info mem model (AI)

Hi,

here we fix page_pool leaks which happen when fragment is released
within XDP program and memory type is set incorrectly. The reports come
from syzbot and AF_XDP test suite.

Most of the changes are about pulling out the common code for
init/prepare xdp_buff based on existing skb and supplying it with
correct xdp_rxq_info's mem_type that is assigned to xdp_buff. A bit more
stuff, page_pool related, had to be done on veth side.

Originally, this work was started by Octavian Purdila.

Thanks!


Maciej Fijalkowski (2):
  xdp: update xdp_rxq_info's mem type in XDP generic hook
  veth: update mem type in xdp_buff

 drivers/net/veth.c | 43 +++++++++++++++++++++++++++----------------
 include/net/xdp.h  | 31 +++++++++++++++++++++++++++++++
 net/core/dev.c     | 25 ++++---------------------
 3 files changed, 62 insertions(+), 37 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook
  2025-10-17 14:31 [PATCH v2 bpf 0/2] xdp: fix page_pool leaks Maciej Fijalkowski
@ 2025-10-17 14:31 ` Maciej Fijalkowski
  2025-10-17 16:45   ` Toke Høiland-Jørgensen
  2025-10-20 11:20   ` Jesper Dangaard Brouer
  2025-10-17 14:31 ` [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff Maciej Fijalkowski
  1 sibling, 2 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2025-10-17 14:31 UTC (permalink / raw)
  To: bpf, ast, daniel, hawk, ilias.apalodimas, toke, lorenzo, kuba
  Cc: netdev, magnus.karlsson, andrii, stfomichev, aleksander.lobakin,
	Maciej Fijalkowski, syzbot+ff145014d6b0ce64a173, Ihor Solodrai,
	Octavian Purdila

Currently, generic XDP hook uses xdp_rxq_info from netstack Rx queues
which do not have its XDP memory model registered. There is a case when
XDP program calls bpf_xdp_adjust_tail() BPF helper, which in turn
releases underlying memory. This happens when it consumes enough amount
of bytes and when XDP buffer has fragments. For this action the memory
model knowledge passed to XDP program is crucial so that core can call
suitable function for freeing/recycling the page.

For netstack queues it defaults to MEM_TYPE_PAGE_SHARED (0) due to lack
of mem model registration. The problem we're fixing here is when kernel
copied the skb to new buffer backed by system's page_pool and XDP buffer
is built around it. Then when bpf_xdp_adjust_tail() calls
__xdp_return(), it acts incorrectly due to mem type not being set to
MEM_TYPE_PAGE_POOL and causes a page leak.

Pull out the existing code from bpf_prog_run_generic_xdp() that
init/prepares xdp_buff onto new helper xdp_convert_skb_to_buff() and
embed there rxq's mem_type initialization that is assigned to xdp_buff.

This problem was triggered by syzbot as well as AF_XDP test suite which
is about to be integrated to BPF CI.

Reported-by: syzbot+ff145014d6b0ce64a173@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6756c37b.050a0220.a30f1.019a.GAE@google.com/
Fixes: e6d5dbdd20aa ("xdp: add multi-buff support for xdp running in generic mode")
Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Co-developed-by: Octavian Purdila <tavip@google.com>
Signed-off-by: Octavian Purdila <tavip@google.com> # whole analysis, testing, initiating a fix
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # commit msg and proposed more robust fix
---
 include/net/xdp.h | 31 +++++++++++++++++++++++++++++++
 net/core/dev.c    | 25 ++++---------------------
 2 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index aa742f413c35..51f3321e4f94 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -384,6 +384,37 @@ struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 					 struct net_device *dev);
 struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
+static inline
+void xdp_convert_skb_to_buff(struct sk_buff *skb, struct xdp_buff *xdp,
+			     struct xdp_rxq_info *xdp_rxq)
+{
+	u32 frame_sz, mac_len;
+	void *hard_start;
+
+	/* The XDP program wants to see the packet starting at the MAC
+	 * header.
+	 */
+	mac_len = skb->data - skb_mac_header(skb);
+	hard_start = skb->data - skb_headroom(skb);
+
+	/* SKB "head" area always have tailroom for skb_shared_info */
+	frame_sz = skb_end_pointer(skb) - skb->head;
+	frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	xdp_init_buff(xdp, frame_sz, xdp_rxq);
+	xdp_prepare_buff(xdp, hard_start, skb_headroom(skb) - mac_len,
+			 skb_headlen(skb) + mac_len, true);
+
+	if (skb_is_nonlinear(skb)) {
+		skb_shinfo(skb)->xdp_frags_size = skb->data_len;
+		xdp_buff_set_frags_flag(xdp);
+	} else {
+		xdp_buff_clear_frags_flag(xdp);
+	}
+
+	xdp->rxq->mem.type = page_pool_page_is_pp(virt_to_page(xdp->data)) ?
+				MEM_TYPE_PAGE_POOL : MEM_TYPE_PAGE_SHARED;
+}
+
 static inline
 void xdp_convert_frame_to_buff(const struct xdp_frame *frame,
 			       struct xdp_buff *xdp)
diff --git a/net/core/dev.c b/net/core/dev.c
index a64cef2c537e..3d607dce292b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5320,35 +5320,18 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
 u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
 			     const struct bpf_prog *xdp_prog)
 {
-	void *orig_data, *orig_data_end, *hard_start;
+	void *orig_data, *orig_data_end;
 	struct netdev_rx_queue *rxqueue;
 	bool orig_bcast, orig_host;
-	u32 mac_len, frame_sz;
+	u32 metalen, act, mac_len;
 	__be16 orig_eth_type;
 	struct ethhdr *eth;
-	u32 metalen, act;
 	int off;
 
-	/* The XDP program wants to see the packet starting at the MAC
-	 * header.
-	 */
+	rxqueue = netif_get_rxqueue(skb);
 	mac_len = skb->data - skb_mac_header(skb);
-	hard_start = skb->data - skb_headroom(skb);
-
-	/* SKB "head" area always have tailroom for skb_shared_info */
-	frame_sz = (void *)skb_end_pointer(skb) - hard_start;
-	frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	rxqueue = netif_get_rxqueue(skb);
-	xdp_init_buff(xdp, frame_sz, &rxqueue->xdp_rxq);
-	xdp_prepare_buff(xdp, hard_start, skb_headroom(skb) - mac_len,
-			 skb_headlen(skb) + mac_len, true);
-	if (skb_is_nonlinear(skb)) {
-		skb_shinfo(skb)->xdp_frags_size = skb->data_len;
-		xdp_buff_set_frags_flag(xdp);
-	} else {
-		xdp_buff_clear_frags_flag(xdp);
-	}
+	xdp_convert_skb_to_buff(skb, xdp, &rxqueue->xdp_rxq);
 
 	orig_data_end = xdp->data_end;
 	orig_data = xdp->data;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff
  2025-10-17 14:31 [PATCH v2 bpf 0/2] xdp: fix page_pool leaks Maciej Fijalkowski
  2025-10-17 14:31 ` [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook Maciej Fijalkowski
@ 2025-10-17 14:31 ` Maciej Fijalkowski
  2025-10-17 16:33   ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2025-10-17 14:31 UTC (permalink / raw)
  To: bpf, ast, daniel, hawk, ilias.apalodimas, toke, lorenzo, kuba
  Cc: netdev, magnus.karlsson, andrii, stfomichev, aleksander.lobakin,
	Maciej Fijalkowski

Veth calls skb_pp_cow_data() which makes the underlying memory to
originate from system page_pool. For CONFIG_DEBUG_VM=y and XDP program
that uses bpf_xdp_adjust_tail(), following splat was observed:

[   32.204881] BUG: Bad page state in process test_progs  pfn:11c98b
[   32.207167] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11c98b
[   32.210084] flags: 0x1fffe0000000000(node=0|zone=1|lastcpupid=0x7fff)
[   32.212493] raw: 01fffe0000000000 dead000000000040 ff11000123c9b000 0000000000000000
[   32.218056] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
[   32.220900] page dumped because: page_pool leak
[   32.222636] Modules linked in: bpf_testmod(O) bpf_preload
[   32.224632] CPU: 6 UID: 0 PID: 3612 Comm: test_progs Tainted: G O        6.17.0-rc5-gfec474d29325 #6969 PREEMPT
[   32.224638] Tainted: [O]=OOT_MODULE
[   32.224639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[   32.224641] Call Trace:
[   32.224644]  <IRQ>
[   32.224646]  dump_stack_lvl+0x4b/0x70
[   32.224653]  bad_page.cold+0xbd/0xe0
[   32.224657]  __free_frozen_pages+0x838/0x10b0
[   32.224660]  ? skb_pp_cow_data+0x782/0xc30
[   32.224665]  bpf_xdp_shrink_data+0x221/0x530
[   32.224668]  ? skb_pp_cow_data+0x6d1/0xc30
[   32.224671]  bpf_xdp_adjust_tail+0x598/0x810
[   32.224673]  ? xsk_destruct_skb+0x321/0x800
[   32.224678]  bpf_prog_004ac6bb21de57a7_xsk_xdp_adjust_tail+0x52/0xd6
[   32.224681]  veth_xdp_rcv_skb+0x45d/0x15a0
[   32.224684]  ? get_stack_info_noinstr+0x16/0xe0
[   32.224688]  ? veth_set_channels+0x920/0x920
[   32.224691]  ? get_stack_info+0x2f/0x80
[   32.224693]  ? unwind_next_frame+0x3af/0x1df0
[   32.224697]  veth_xdp_rcv.constprop.0+0x38a/0xbe0
[   32.224700]  ? common_startup_64+0x13e/0x148
[   32.224703]  ? veth_xdp_rcv_one+0xcd0/0xcd0
[   32.224706]  ? stack_trace_save+0x84/0xa0
[   32.224709]  ? stack_depot_save_flags+0x28/0x820
[   32.224713]  ? __resched_curr.constprop.0+0x332/0x3b0
[   32.224716]  ? timerqueue_add+0x217/0x320
[   32.224719]  veth_poll+0x115/0x5e0
[   32.224722]  ? veth_xdp_rcv.constprop.0+0xbe0/0xbe0
[   32.224726]  ? update_load_avg+0x1cb/0x12d0
[   32.224730]  ? update_cfs_group+0x121/0x2c0
[   32.224733]  __napi_poll+0xa0/0x420
[   32.224736]  net_rx_action+0x901/0xe90
[   32.224740]  ? run_backlog_napi+0x50/0x50
[   32.224743]  ? clockevents_program_event+0x1cc/0x280
[   32.224746]  ? hrtimer_interrupt+0x31e/0x7c0
[   32.224749]  handle_softirqs+0x151/0x430
[   32.224752]  do_softirq+0x3f/0x60
[   32.224755]  </IRQ>

It's because xdp_rxq with mem model set to MEM_TYPE_PAGE_SHARED was used
when initializing xdp_buff.

Fix this by using new helper xdp_convert_skb_to_buff() that, besides
init/prepare xdp_buff, will check if page used for linear part of
xdp_buff comes from page_pool. We assume that linear data and frags will
have same memory provider as currently XDP API does not provide us a way
to distinguish it (the mem model is registered for *whole* Rx queue and
here we speak about single buffer granularity).

In order to meet expected skb layout by new helper, pull the mac header
before conversion from skb to xdp_buff.

However, that is not enough as before releasing xdp_buff out of veth via
XDP_{TX,REDIRECT}, mem type on xdp_rxq associated with xdp_buff is
restored to its original model. We need to respect previous setting at
least until buff is converted to frame, as frame carries the mem_type.
Add a page_pool variant of veth_xdp_get() so that we avoid refcount
underflow when draining page frag.

Fixes: 0ebab78cbcbf ("net: veth: add page_pool for page recycling")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Closes: https://lore.kernel.org/bpf/CAADnVQ+bBofJDfieyOYzSmSujSfJwDTQhiz3aJw7hE+4E2_iPA@mail.gmail.com/
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/veth.c | 43 +++++++++++++++++++++++++++----------------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a3046142cb8e..eeeee7bba685 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -733,7 +733,7 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
 	}
 }
 
-static void veth_xdp_get(struct xdp_buff *xdp)
+static void veth_xdp_get_shared(struct xdp_buff *xdp)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
 	int i;
@@ -746,12 +746,33 @@ static void veth_xdp_get(struct xdp_buff *xdp)
 		__skb_frag_ref(&sinfo->frags[i]);
 }
 
+static void veth_xdp_get_pp(struct xdp_buff *xdp)
+{
+	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
+	int i;
+
+	page_pool_ref_page(virt_to_page(xdp->data));
+	if (likely(!xdp_buff_has_frags(xdp)))
+		return;
+
+	for (i = 0; i < sinfo->nr_frags; i++) {
+		skb_frag_t *frag = &sinfo->frags[i];
+
+		page_pool_ref_page(netmem_to_page(frag->netmem));
+	}
+}
+
+static void veth_xdp_get(struct xdp_buff *xdp)
+{
+	xdp->rxq->mem.type == MEM_TYPE_PAGE_POOL ?
+		veth_xdp_get_pp(xdp) : veth_xdp_get_shared(xdp);
+}
+
 static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
 					struct xdp_buff *xdp,
 					struct sk_buff **pskb)
 {
 	struct sk_buff *skb = *pskb;
-	u32 frame_sz;
 
 	if (skb_shared(skb) || skb_head_is_locked(skb) ||
 	    skb_shinfo(skb)->nr_frags ||
@@ -762,19 +783,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
 		skb = *pskb;
 	}
 
-	/* SKB "head" area always have tailroom for skb_shared_info */
-	frame_sz = skb_end_pointer(skb) - skb->head;
-	frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-	xdp_init_buff(xdp, frame_sz, &rq->xdp_rxq);
-	xdp_prepare_buff(xdp, skb->head, skb_headroom(skb),
-			 skb_headlen(skb), true);
+	__skb_pull(*pskb, skb->data - skb_mac_header(skb));
 
-	if (skb_is_nonlinear(skb)) {
-		skb_shinfo(skb)->xdp_frags_size = skb->data_len;
-		xdp_buff_set_frags_flag(xdp);
-	} else {
-		xdp_buff_clear_frags_flag(xdp);
-	}
+	xdp_convert_skb_to_buff(skb, xdp, &rq->xdp_rxq);
 	*pskb = skb;
 
 	return 0;
@@ -822,24 +833,24 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	case XDP_TX:
 		veth_xdp_get(xdp);
 		consume_skb(skb);
-		xdp->rxq->mem = rq->xdp_mem;
 		if (unlikely(veth_xdp_tx(rq, xdp, bq) < 0)) {
 			trace_xdp_exception(rq->dev, xdp_prog, act);
 			stats->rx_drops++;
 			goto err_xdp;
 		}
 		stats->xdp_tx++;
+		rq->xdp_rxq.mem = rq->xdp_mem;
 		rcu_read_unlock();
 		goto xdp_xmit;
 	case XDP_REDIRECT:
 		veth_xdp_get(xdp);
 		consume_skb(skb);
-		xdp->rxq->mem = rq->xdp_mem;
 		if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
 			stats->rx_drops++;
 			goto err_xdp;
 		}
 		stats->xdp_redirect++;
+		rq->xdp_rxq.mem = rq->xdp_mem;
 		rcu_read_unlock();
 		goto xdp_xmit;
 	default:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff
  2025-10-17 14:31 ` [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff Maciej Fijalkowski
@ 2025-10-17 16:33   ` Toke Høiland-Jørgensen
  2025-10-17 16:52     ` Maciej Fijalkowski
  0 siblings, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-17 16:33 UTC (permalink / raw)
  To: Maciej Fijalkowski, bpf, ast, daniel, hawk, ilias.apalodimas,
	lorenzo, kuba
  Cc: netdev, magnus.karlsson, andrii, stfomichev, aleksander.lobakin,
	Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> Veth calls skb_pp_cow_data() which makes the underlying memory to
> originate from system page_pool. For CONFIG_DEBUG_VM=y and XDP program
> that uses bpf_xdp_adjust_tail(), following splat was observed:
>
> [   32.204881] BUG: Bad page state in process test_progs  pfn:11c98b
> [   32.207167] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11c98b
> [   32.210084] flags: 0x1fffe0000000000(node=0|zone=1|lastcpupid=0x7fff)
> [   32.212493] raw: 01fffe0000000000 dead000000000040 ff11000123c9b000 0000000000000000
> [   32.218056] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
> [   32.220900] page dumped because: page_pool leak
> [   32.222636] Modules linked in: bpf_testmod(O) bpf_preload
> [   32.224632] CPU: 6 UID: 0 PID: 3612 Comm: test_progs Tainted: G O        6.17.0-rc5-gfec474d29325 #6969 PREEMPT
> [   32.224638] Tainted: [O]=OOT_MODULE
> [   32.224639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [   32.224641] Call Trace:
> [   32.224644]  <IRQ>
> [   32.224646]  dump_stack_lvl+0x4b/0x70
> [   32.224653]  bad_page.cold+0xbd/0xe0
> [   32.224657]  __free_frozen_pages+0x838/0x10b0
> [   32.224660]  ? skb_pp_cow_data+0x782/0xc30
> [   32.224665]  bpf_xdp_shrink_data+0x221/0x530
> [   32.224668]  ? skb_pp_cow_data+0x6d1/0xc30
> [   32.224671]  bpf_xdp_adjust_tail+0x598/0x810
> [   32.224673]  ? xsk_destruct_skb+0x321/0x800
> [   32.224678]  bpf_prog_004ac6bb21de57a7_xsk_xdp_adjust_tail+0x52/0xd6
> [   32.224681]  veth_xdp_rcv_skb+0x45d/0x15a0
> [   32.224684]  ? get_stack_info_noinstr+0x16/0xe0
> [   32.224688]  ? veth_set_channels+0x920/0x920
> [   32.224691]  ? get_stack_info+0x2f/0x80
> [   32.224693]  ? unwind_next_frame+0x3af/0x1df0
> [   32.224697]  veth_xdp_rcv.constprop.0+0x38a/0xbe0
> [   32.224700]  ? common_startup_64+0x13e/0x148
> [   32.224703]  ? veth_xdp_rcv_one+0xcd0/0xcd0
> [   32.224706]  ? stack_trace_save+0x84/0xa0
> [   32.224709]  ? stack_depot_save_flags+0x28/0x820
> [   32.224713]  ? __resched_curr.constprop.0+0x332/0x3b0
> [   32.224716]  ? timerqueue_add+0x217/0x320
> [   32.224719]  veth_poll+0x115/0x5e0
> [   32.224722]  ? veth_xdp_rcv.constprop.0+0xbe0/0xbe0
> [   32.224726]  ? update_load_avg+0x1cb/0x12d0
> [   32.224730]  ? update_cfs_group+0x121/0x2c0
> [   32.224733]  __napi_poll+0xa0/0x420
> [   32.224736]  net_rx_action+0x901/0xe90
> [   32.224740]  ? run_backlog_napi+0x50/0x50
> [   32.224743]  ? clockevents_program_event+0x1cc/0x280
> [   32.224746]  ? hrtimer_interrupt+0x31e/0x7c0
> [   32.224749]  handle_softirqs+0x151/0x430
> [   32.224752]  do_softirq+0x3f/0x60
> [   32.224755]  </IRQ>
>
> It's because xdp_rxq with mem model set to MEM_TYPE_PAGE_SHARED was used
> when initializing xdp_buff.
>
> Fix this by using new helper xdp_convert_skb_to_buff() that, besides
> init/prepare xdp_buff, will check if page used for linear part of
> xdp_buff comes from page_pool. We assume that linear data and frags will
> have same memory provider as currently XDP API does not provide us a way
> to distinguish it (the mem model is registered for *whole* Rx queue and
> here we speak about single buffer granularity).
>
> In order to meet expected skb layout by new helper, pull the mac header
> before conversion from skb to xdp_buff.
>
> However, that is not enough as before releasing xdp_buff out of veth via
> XDP_{TX,REDIRECT}, mem type on xdp_rxq associated with xdp_buff is
> restored to its original model. We need to respect previous setting at
> least until buff is converted to frame, as frame carries the mem_type.
> Add a page_pool variant of veth_xdp_get() so that we avoid refcount
> underflow when draining page frag.
>
> Fixes: 0ebab78cbcbf ("net: veth: add page_pool for page recycling")
> Reported-by: Alexei Starovoitov <ast@kernel.org>
> Closes: https://lore.kernel.org/bpf/CAADnVQ+bBofJDfieyOYzSmSujSfJwDTQhiz3aJw7hE+4E2_iPA@mail.gmail.com/
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/veth.c | 43 +++++++++++++++++++++++++++----------------
>  1 file changed, 27 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index a3046142cb8e..eeeee7bba685 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -733,7 +733,7 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
>  	}
>  }
>  
> -static void veth_xdp_get(struct xdp_buff *xdp)
> +static void veth_xdp_get_shared(struct xdp_buff *xdp)
>  {
>  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>  	int i;
> @@ -746,12 +746,33 @@ static void veth_xdp_get(struct xdp_buff *xdp)
>  		__skb_frag_ref(&sinfo->frags[i]);
>  }
>  
> +static void veth_xdp_get_pp(struct xdp_buff *xdp)
> +{
> +	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> +	int i;
> +
> +	page_pool_ref_page(virt_to_page(xdp->data));
> +	if (likely(!xdp_buff_has_frags(xdp)))
> +		return;
> +
> +	for (i = 0; i < sinfo->nr_frags; i++) {
> +		skb_frag_t *frag = &sinfo->frags[i];
> +
> +		page_pool_ref_page(netmem_to_page(frag->netmem));
> +	}
> +}
> +
> +static void veth_xdp_get(struct xdp_buff *xdp)
> +{
> +	xdp->rxq->mem.type == MEM_TYPE_PAGE_POOL ?
> +		veth_xdp_get_pp(xdp) : veth_xdp_get_shared(xdp);
> +}
> +
>  static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>  					struct xdp_buff *xdp,
>  					struct sk_buff **pskb)
>  {
>  	struct sk_buff *skb = *pskb;
> -	u32 frame_sz;
>  
>  	if (skb_shared(skb) || skb_head_is_locked(skb) ||
>  	    skb_shinfo(skb)->nr_frags ||
> @@ -762,19 +783,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>  		skb = *pskb;
>  	}
>  
> -	/* SKB "head" area always have tailroom for skb_shared_info */
> -	frame_sz = skb_end_pointer(skb) - skb->head;
> -	frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> -	xdp_init_buff(xdp, frame_sz, &rq->xdp_rxq);
> -	xdp_prepare_buff(xdp, skb->head, skb_headroom(skb),
> -			 skb_headlen(skb), true);
> +	__skb_pull(*pskb, skb->data - skb_mac_header(skb));

veth_xdp_rcv_skb() does:

	__skb_push(skb, skb->data - skb_mac_header(skb));
	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))

so how about just getting rid of that push instead of doing the opposite
pull straight after? :)

-Toke


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook
  2025-10-17 14:31 ` [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook Maciej Fijalkowski
@ 2025-10-17 16:45   ` Toke Høiland-Jørgensen
  2025-10-20 11:20   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-17 16:45 UTC (permalink / raw)
  To: Maciej Fijalkowski, bpf, ast, daniel, hawk, ilias.apalodimas,
	lorenzo, kuba
  Cc: netdev, magnus.karlsson, andrii, stfomichev, aleksander.lobakin,
	Maciej Fijalkowski, syzbot+ff145014d6b0ce64a173, Ihor Solodrai,
	Octavian Purdila

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> Currently, generic XDP hook uses xdp_rxq_info from netstack Rx queues
> which do not have its XDP memory model registered. There is a case when
> XDP program calls bpf_xdp_adjust_tail() BPF helper, which in turn
> releases underlying memory. This happens when it consumes enough amount
> of bytes and when XDP buffer has fragments. For this action the memory
> model knowledge passed to XDP program is crucial so that core can call
> suitable function for freeing/recycling the page.
>
> For netstack queues it defaults to MEM_TYPE_PAGE_SHARED (0) due to lack
> of mem model registration. The problem we're fixing here is when kernel
> copied the skb to new buffer backed by system's page_pool and XDP buffer
> is built around it. Then when bpf_xdp_adjust_tail() calls
> __xdp_return(), it acts incorrectly due to mem type not being set to
> MEM_TYPE_PAGE_POOL and causes a page leak.
>
> Pull out the existing code from bpf_prog_run_generic_xdp() that
> init/prepares xdp_buff onto new helper xdp_convert_skb_to_buff() and
> embed there rxq's mem_type initialization that is assigned to xdp_buff.
>
> This problem was triggered by syzbot as well as AF_XDP test suite which
> is about to be integrated to BPF CI.
>
> Reported-by: syzbot+ff145014d6b0ce64a173@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/netdev/6756c37b.050a0220.a30f1.019a.GAE@google.com/
> Fixes: e6d5dbdd20aa ("xdp: add multi-buff support for xdp running in generic mode")
> Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
> Co-developed-by: Octavian Purdila <tavip@google.com>
> Signed-off-by: Octavian Purdila <tavip@google.com> # whole analysis, testing, initiating a fix
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # commit msg and proposed more robust fix

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff
  2025-10-17 16:33   ` Toke Høiland-Jørgensen
@ 2025-10-17 16:52     ` Maciej Fijalkowski
  2025-10-17 17:51       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2025-10-17 16:52 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, hawk, ilias.apalodimas, lorenzo, kuba, netdev,
	magnus.karlsson, andrii, stfomichev, aleksander.lobakin

On Fri, Oct 17, 2025 at 06:33:54PM +0200, Toke Høiland-Jørgensen wrote:
> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> 
> > Veth calls skb_pp_cow_data() which makes the underlying memory to
> > originate from system page_pool. For CONFIG_DEBUG_VM=y and XDP program
> > that uses bpf_xdp_adjust_tail(), following splat was observed:
> >
> > [   32.204881] BUG: Bad page state in process test_progs  pfn:11c98b
> > [   32.207167] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11c98b
> > [   32.210084] flags: 0x1fffe0000000000(node=0|zone=1|lastcpupid=0x7fff)
> > [   32.212493] raw: 01fffe0000000000 dead000000000040 ff11000123c9b000 0000000000000000
> > [   32.218056] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
> > [   32.220900] page dumped because: page_pool leak
> > [   32.222636] Modules linked in: bpf_testmod(O) bpf_preload
> > [   32.224632] CPU: 6 UID: 0 PID: 3612 Comm: test_progs Tainted: G O        6.17.0-rc5-gfec474d29325 #6969 PREEMPT
> > [   32.224638] Tainted: [O]=OOT_MODULE
> > [   32.224639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > [   32.224641] Call Trace:
> > [   32.224644]  <IRQ>
> > [   32.224646]  dump_stack_lvl+0x4b/0x70
> > [   32.224653]  bad_page.cold+0xbd/0xe0
> > [   32.224657]  __free_frozen_pages+0x838/0x10b0
> > [   32.224660]  ? skb_pp_cow_data+0x782/0xc30
> > [   32.224665]  bpf_xdp_shrink_data+0x221/0x530
> > [   32.224668]  ? skb_pp_cow_data+0x6d1/0xc30
> > [   32.224671]  bpf_xdp_adjust_tail+0x598/0x810
> > [   32.224673]  ? xsk_destruct_skb+0x321/0x800
> > [   32.224678]  bpf_prog_004ac6bb21de57a7_xsk_xdp_adjust_tail+0x52/0xd6
> > [   32.224681]  veth_xdp_rcv_skb+0x45d/0x15a0
> > [   32.224684]  ? get_stack_info_noinstr+0x16/0xe0
> > [   32.224688]  ? veth_set_channels+0x920/0x920
> > [   32.224691]  ? get_stack_info+0x2f/0x80
> > [   32.224693]  ? unwind_next_frame+0x3af/0x1df0
> > [   32.224697]  veth_xdp_rcv.constprop.0+0x38a/0xbe0
> > [   32.224700]  ? common_startup_64+0x13e/0x148
> > [   32.224703]  ? veth_xdp_rcv_one+0xcd0/0xcd0
> > [   32.224706]  ? stack_trace_save+0x84/0xa0
> > [   32.224709]  ? stack_depot_save_flags+0x28/0x820
> > [   32.224713]  ? __resched_curr.constprop.0+0x332/0x3b0
> > [   32.224716]  ? timerqueue_add+0x217/0x320
> > [   32.224719]  veth_poll+0x115/0x5e0
> > [   32.224722]  ? veth_xdp_rcv.constprop.0+0xbe0/0xbe0
> > [   32.224726]  ? update_load_avg+0x1cb/0x12d0
> > [   32.224730]  ? update_cfs_group+0x121/0x2c0
> > [   32.224733]  __napi_poll+0xa0/0x420
> > [   32.224736]  net_rx_action+0x901/0xe90
> > [   32.224740]  ? run_backlog_napi+0x50/0x50
> > [   32.224743]  ? clockevents_program_event+0x1cc/0x280
> > [   32.224746]  ? hrtimer_interrupt+0x31e/0x7c0
> > [   32.224749]  handle_softirqs+0x151/0x430
> > [   32.224752]  do_softirq+0x3f/0x60
> > [   32.224755]  </IRQ>
> >
> > It's because xdp_rxq with mem model set to MEM_TYPE_PAGE_SHARED was used
> > when initializing xdp_buff.
> >
> > Fix this by using new helper xdp_convert_skb_to_buff() that, besides
> > init/prepare xdp_buff, will check if page used for linear part of
> > xdp_buff comes from page_pool. We assume that linear data and frags will
> > have same memory provider as currently XDP API does not provide us a way
> > to distinguish it (the mem model is registered for *whole* Rx queue and
> > here we speak about single buffer granularity).
> >
> > In order to meet expected skb layout by new helper, pull the mac header
> > before conversion from skb to xdp_buff.
> >
> > However, that is not enough as before releasing xdp_buff out of veth via
> > XDP_{TX,REDIRECT}, mem type on xdp_rxq associated with xdp_buff is
> > restored to its original model. We need to respect previous setting at
> > least until buff is converted to frame, as frame carries the mem_type.
> > Add a page_pool variant of veth_xdp_get() so that we avoid refcount
> > underflow when draining page frag.
> >
> > Fixes: 0ebab78cbcbf ("net: veth: add page_pool for page recycling")
> > Reported-by: Alexei Starovoitov <ast@kernel.org>
> > Closes: https://lore.kernel.org/bpf/CAADnVQ+bBofJDfieyOYzSmSujSfJwDTQhiz3aJw7hE+4E2_iPA@mail.gmail.com/
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  drivers/net/veth.c | 43 +++++++++++++++++++++++++++----------------
> >  1 file changed, 27 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index a3046142cb8e..eeeee7bba685 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -733,7 +733,7 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
> >  	}
> >  }
> >  
> > -static void veth_xdp_get(struct xdp_buff *xdp)
> > +static void veth_xdp_get_shared(struct xdp_buff *xdp)
> >  {
> >  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> >  	int i;
> > @@ -746,12 +746,33 @@ static void veth_xdp_get(struct xdp_buff *xdp)
> >  		__skb_frag_ref(&sinfo->frags[i]);
> >  }
> >  
> > +static void veth_xdp_get_pp(struct xdp_buff *xdp)
> > +{
> > +	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> > +	int i;
> > +
> > +	page_pool_ref_page(virt_to_page(xdp->data));
> > +	if (likely(!xdp_buff_has_frags(xdp)))
> > +		return;
> > +
> > +	for (i = 0; i < sinfo->nr_frags; i++) {
> > +		skb_frag_t *frag = &sinfo->frags[i];
> > +
> > +		page_pool_ref_page(netmem_to_page(frag->netmem));
> > +	}
> > +}
> > +
> > +static void veth_xdp_get(struct xdp_buff *xdp)
> > +{
> > +	xdp->rxq->mem.type == MEM_TYPE_PAGE_POOL ?
> > +		veth_xdp_get_pp(xdp) : veth_xdp_get_shared(xdp);
> > +}
> > +
> >  static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >  					struct xdp_buff *xdp,
> >  					struct sk_buff **pskb)
> >  {
> >  	struct sk_buff *skb = *pskb;
> > -	u32 frame_sz;
> >  
> >  	if (skb_shared(skb) || skb_head_is_locked(skb) ||
> >  	    skb_shinfo(skb)->nr_frags ||
> > @@ -762,19 +783,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >  		skb = *pskb;
> >  	}
> >  
> > -	/* SKB "head" area always have tailroom for skb_shared_info */
> > -	frame_sz = skb_end_pointer(skb) - skb->head;
> > -	frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > -	xdp_init_buff(xdp, frame_sz, &rq->xdp_rxq);
> > -	xdp_prepare_buff(xdp, skb->head, skb_headroom(skb),
> > -			 skb_headlen(skb), true);
> > +	__skb_pull(*pskb, skb->data - skb_mac_header(skb));
> 
> veth_xdp_rcv_skb() does:
> 
> 	__skb_push(skb, skb->data - skb_mac_header(skb));
> 	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
> 
> so how about just getting rid of that push instead of doing the opposite
> pull straight after? :)

Hi Toke,

I believe this is done so we get a proper headroom representation which is
needed for XDP_PACKET_HEADROOM comparison. Maybe we could be smarter here
and for example subtract mac header length? However I wanted to preserve
old behavior.

Thanks for review!
Maciej

> 
> -Toke
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff
  2025-10-17 16:52     ` Maciej Fijalkowski
@ 2025-10-17 17:51       ` Toke Høiland-Jørgensen
  2025-10-17 18:12         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-17 17:51 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, hawk, ilias.apalodimas, lorenzo, kuba, netdev,
	magnus.karlsson, andrii, stfomichev, aleksander.lobakin

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> On Fri, Oct 17, 2025 at 06:33:54PM +0200, Toke Høiland-Jørgensen wrote:
>> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
>> 
>> > Veth calls skb_pp_cow_data() which makes the underlying memory to
>> > originate from system page_pool. For CONFIG_DEBUG_VM=y and XDP program
>> > that uses bpf_xdp_adjust_tail(), following splat was observed:
>> >
>> > [   32.204881] BUG: Bad page state in process test_progs  pfn:11c98b
>> > [   32.207167] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11c98b
>> > [   32.210084] flags: 0x1fffe0000000000(node=0|zone=1|lastcpupid=0x7fff)
>> > [   32.212493] raw: 01fffe0000000000 dead000000000040 ff11000123c9b000 0000000000000000
>> > [   32.218056] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
>> > [   32.220900] page dumped because: page_pool leak
>> > [   32.222636] Modules linked in: bpf_testmod(O) bpf_preload
>> > [   32.224632] CPU: 6 UID: 0 PID: 3612 Comm: test_progs Tainted: G O        6.17.0-rc5-gfec474d29325 #6969 PREEMPT
>> > [   32.224638] Tainted: [O]=OOT_MODULE
>> > [   32.224639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>> > [   32.224641] Call Trace:
>> > [   32.224644]  <IRQ>
>> > [   32.224646]  dump_stack_lvl+0x4b/0x70
>> > [   32.224653]  bad_page.cold+0xbd/0xe0
>> > [   32.224657]  __free_frozen_pages+0x838/0x10b0
>> > [   32.224660]  ? skb_pp_cow_data+0x782/0xc30
>> > [   32.224665]  bpf_xdp_shrink_data+0x221/0x530
>> > [   32.224668]  ? skb_pp_cow_data+0x6d1/0xc30
>> > [   32.224671]  bpf_xdp_adjust_tail+0x598/0x810
>> > [   32.224673]  ? xsk_destruct_skb+0x321/0x800
>> > [   32.224678]  bpf_prog_004ac6bb21de57a7_xsk_xdp_adjust_tail+0x52/0xd6
>> > [   32.224681]  veth_xdp_rcv_skb+0x45d/0x15a0
>> > [   32.224684]  ? get_stack_info_noinstr+0x16/0xe0
>> > [   32.224688]  ? veth_set_channels+0x920/0x920
>> > [   32.224691]  ? get_stack_info+0x2f/0x80
>> > [   32.224693]  ? unwind_next_frame+0x3af/0x1df0
>> > [   32.224697]  veth_xdp_rcv.constprop.0+0x38a/0xbe0
>> > [   32.224700]  ? common_startup_64+0x13e/0x148
>> > [   32.224703]  ? veth_xdp_rcv_one+0xcd0/0xcd0
>> > [   32.224706]  ? stack_trace_save+0x84/0xa0
>> > [   32.224709]  ? stack_depot_save_flags+0x28/0x820
>> > [   32.224713]  ? __resched_curr.constprop.0+0x332/0x3b0
>> > [   32.224716]  ? timerqueue_add+0x217/0x320
>> > [   32.224719]  veth_poll+0x115/0x5e0
>> > [   32.224722]  ? veth_xdp_rcv.constprop.0+0xbe0/0xbe0
>> > [   32.224726]  ? update_load_avg+0x1cb/0x12d0
>> > [   32.224730]  ? update_cfs_group+0x121/0x2c0
>> > [   32.224733]  __napi_poll+0xa0/0x420
>> > [   32.224736]  net_rx_action+0x901/0xe90
>> > [   32.224740]  ? run_backlog_napi+0x50/0x50
>> > [   32.224743]  ? clockevents_program_event+0x1cc/0x280
>> > [   32.224746]  ? hrtimer_interrupt+0x31e/0x7c0
>> > [   32.224749]  handle_softirqs+0x151/0x430
>> > [   32.224752]  do_softirq+0x3f/0x60
>> > [   32.224755]  </IRQ>
>> >
>> > It's because xdp_rxq with mem model set to MEM_TYPE_PAGE_SHARED was used
>> > when initializing xdp_buff.
>> >
>> > Fix this by using new helper xdp_convert_skb_to_buff() that, besides
>> > init/prepare xdp_buff, will check if page used for linear part of
>> > xdp_buff comes from page_pool. We assume that linear data and frags will
>> > have same memory provider as currently XDP API does not provide us a way
>> > to distinguish it (the mem model is registered for *whole* Rx queue and
>> > here we speak about single buffer granularity).
>> >
>> > In order to meet expected skb layout by new helper, pull the mac header
>> > before conversion from skb to xdp_buff.
>> >
>> > However, that is not enough as before releasing xdp_buff out of veth via
>> > XDP_{TX,REDIRECT}, mem type on xdp_rxq associated with xdp_buff is
>> > restored to its original model. We need to respect previous setting at
>> > least until buff is converted to frame, as frame carries the mem_type.
>> > Add a page_pool variant of veth_xdp_get() so that we avoid refcount
>> > underflow when draining page frag.
>> >
>> > Fixes: 0ebab78cbcbf ("net: veth: add page_pool for page recycling")
>> > Reported-by: Alexei Starovoitov <ast@kernel.org>
>> > Closes: https://lore.kernel.org/bpf/CAADnVQ+bBofJDfieyOYzSmSujSfJwDTQhiz3aJw7hE+4E2_iPA@mail.gmail.com/
>> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>> > ---
>> >  drivers/net/veth.c | 43 +++++++++++++++++++++++++++----------------
>> >  1 file changed, 27 insertions(+), 16 deletions(-)
>> >
>> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> > index a3046142cb8e..eeeee7bba685 100644
>> > --- a/drivers/net/veth.c
>> > +++ b/drivers/net/veth.c
>> > @@ -733,7 +733,7 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
>> >  	}
>> >  }
>> >  
>> > -static void veth_xdp_get(struct xdp_buff *xdp)
>> > +static void veth_xdp_get_shared(struct xdp_buff *xdp)
>> >  {
>> >  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>> >  	int i;
>> > @@ -746,12 +746,33 @@ static void veth_xdp_get(struct xdp_buff *xdp)
>> >  		__skb_frag_ref(&sinfo->frags[i]);
>> >  }
>> >  
>> > +static void veth_xdp_get_pp(struct xdp_buff *xdp)
>> > +{
>> > +	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>> > +	int i;
>> > +
>> > +	page_pool_ref_page(virt_to_page(xdp->data));
>> > +	if (likely(!xdp_buff_has_frags(xdp)))
>> > +		return;
>> > +
>> > +	for (i = 0; i < sinfo->nr_frags; i++) {
>> > +		skb_frag_t *frag = &sinfo->frags[i];
>> > +
>> > +		page_pool_ref_page(netmem_to_page(frag->netmem));
>> > +	}
>> > +}
>> > +
>> > +static void veth_xdp_get(struct xdp_buff *xdp)
>> > +{
>> > +	xdp->rxq->mem.type == MEM_TYPE_PAGE_POOL ?
>> > +		veth_xdp_get_pp(xdp) : veth_xdp_get_shared(xdp);
>> > +}
>> > +
>> >  static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>> >  					struct xdp_buff *xdp,
>> >  					struct sk_buff **pskb)
>> >  {
>> >  	struct sk_buff *skb = *pskb;
>> > -	u32 frame_sz;
>> >  
>> >  	if (skb_shared(skb) || skb_head_is_locked(skb) ||
>> >  	    skb_shinfo(skb)->nr_frags ||
>> > @@ -762,19 +783,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>> >  		skb = *pskb;
>> >  	}
>> >  
>> > -	/* SKB "head" area always have tailroom for skb_shared_info */
>> > -	frame_sz = skb_end_pointer(skb) - skb->head;
>> > -	frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> > -	xdp_init_buff(xdp, frame_sz, &rq->xdp_rxq);
>> > -	xdp_prepare_buff(xdp, skb->head, skb_headroom(skb),
>> > -			 skb_headlen(skb), true);
>> > +	__skb_pull(*pskb, skb->data - skb_mac_header(skb));
>> 
>> veth_xdp_rcv_skb() does:
>> 
>> 	__skb_push(skb, skb->data - skb_mac_header(skb));
>> 	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
>> 
>> so how about just getting rid of that push instead of doing the opposite
>> pull straight after? :)
>
> Hi Toke,
>
> I believe this is done so we get a proper headroom representation which is
> needed for XDP_PACKET_HEADROOM comparison. Maybe we could be smarter here
> and for example subtract mac header length? However I wanted to preserve
> old behavior.

Yeah, basically what we want is to check if the mac_header offset is
larger than the headroom. So the check could just be:

    skb->mac_header < XDP_PACKET_HEADROOM

however, it may be better to use the helper? Since that makes sure we
keep hitting the DEBUG_NET_WARN_ON_ONCE inside the helper... So:

    skb_mac_header(skb) - skb->head < XDP_PACKET_HEADROOM

or, equivalently:

    skb_headroom(skb) - skb_mac_offset(skb) < XDP_PACKET_HEADROOM

I think the first one is probably more readable, since skb_mac_offset()
is negative here, so the calculation looks off...

-Toke


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff
  2025-10-17 17:51       ` Toke Høiland-Jørgensen
@ 2025-10-17 18:12         ` Toke Høiland-Jørgensen
  2025-10-17 20:08           ` Maciej Fijalkowski
  0 siblings, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-17 18:12 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, hawk, ilias.apalodimas, lorenzo, kuba, netdev,
	magnus.karlsson, andrii, stfomichev, aleksander.lobakin

Toke Høiland-Jørgensen <toke@redhat.com> writes:

> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
>
>> On Fri, Oct 17, 2025 at 06:33:54PM +0200, Toke Høiland-Jørgensen wrote:
>>> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
>>> 
>>> > Veth calls skb_pp_cow_data() which makes the underlying memory to
>>> > originate from system page_pool. For CONFIG_DEBUG_VM=y and XDP program
>>> > that uses bpf_xdp_adjust_tail(), following splat was observed:
>>> >
>>> > [   32.204881] BUG: Bad page state in process test_progs  pfn:11c98b
>>> > [   32.207167] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11c98b
>>> > [   32.210084] flags: 0x1fffe0000000000(node=0|zone=1|lastcpupid=0x7fff)
>>> > [   32.212493] raw: 01fffe0000000000 dead000000000040 ff11000123c9b000 0000000000000000
>>> > [   32.218056] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
>>> > [   32.220900] page dumped because: page_pool leak
>>> > [   32.222636] Modules linked in: bpf_testmod(O) bpf_preload
>>> > [   32.224632] CPU: 6 UID: 0 PID: 3612 Comm: test_progs Tainted: G O        6.17.0-rc5-gfec474d29325 #6969 PREEMPT
>>> > [   32.224638] Tainted: [O]=OOT_MODULE
>>> > [   32.224639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>> > [   32.224641] Call Trace:
>>> > [   32.224644]  <IRQ>
>>> > [   32.224646]  dump_stack_lvl+0x4b/0x70
>>> > [   32.224653]  bad_page.cold+0xbd/0xe0
>>> > [   32.224657]  __free_frozen_pages+0x838/0x10b0
>>> > [   32.224660]  ? skb_pp_cow_data+0x782/0xc30
>>> > [   32.224665]  bpf_xdp_shrink_data+0x221/0x530
>>> > [   32.224668]  ? skb_pp_cow_data+0x6d1/0xc30
>>> > [   32.224671]  bpf_xdp_adjust_tail+0x598/0x810
>>> > [   32.224673]  ? xsk_destruct_skb+0x321/0x800
>>> > [   32.224678]  bpf_prog_004ac6bb21de57a7_xsk_xdp_adjust_tail+0x52/0xd6
>>> > [   32.224681]  veth_xdp_rcv_skb+0x45d/0x15a0
>>> > [   32.224684]  ? get_stack_info_noinstr+0x16/0xe0
>>> > [   32.224688]  ? veth_set_channels+0x920/0x920
>>> > [   32.224691]  ? get_stack_info+0x2f/0x80
>>> > [   32.224693]  ? unwind_next_frame+0x3af/0x1df0
>>> > [   32.224697]  veth_xdp_rcv.constprop.0+0x38a/0xbe0
>>> > [   32.224700]  ? common_startup_64+0x13e/0x148
>>> > [   32.224703]  ? veth_xdp_rcv_one+0xcd0/0xcd0
>>> > [   32.224706]  ? stack_trace_save+0x84/0xa0
>>> > [   32.224709]  ? stack_depot_save_flags+0x28/0x820
>>> > [   32.224713]  ? __resched_curr.constprop.0+0x332/0x3b0
>>> > [   32.224716]  ? timerqueue_add+0x217/0x320
>>> > [   32.224719]  veth_poll+0x115/0x5e0
>>> > [   32.224722]  ? veth_xdp_rcv.constprop.0+0xbe0/0xbe0
>>> > [   32.224726]  ? update_load_avg+0x1cb/0x12d0
>>> > [   32.224730]  ? update_cfs_group+0x121/0x2c0
>>> > [   32.224733]  __napi_poll+0xa0/0x420
>>> > [   32.224736]  net_rx_action+0x901/0xe90
>>> > [   32.224740]  ? run_backlog_napi+0x50/0x50
>>> > [   32.224743]  ? clockevents_program_event+0x1cc/0x280
>>> > [   32.224746]  ? hrtimer_interrupt+0x31e/0x7c0
>>> > [   32.224749]  handle_softirqs+0x151/0x430
>>> > [   32.224752]  do_softirq+0x3f/0x60
>>> > [   32.224755]  </IRQ>
>>> >
>>> > It's because xdp_rxq with mem model set to MEM_TYPE_PAGE_SHARED was used
>>> > when initializing xdp_buff.
>>> >
>>> > Fix this by using new helper xdp_convert_skb_to_buff() that, besides
>>> > init/prepare xdp_buff, will check if page used for linear part of
>>> > xdp_buff comes from page_pool. We assume that linear data and frags will
>>> > have same memory provider as currently XDP API does not provide us a way
>>> > to distinguish it (the mem model is registered for *whole* Rx queue and
>>> > here we speak about single buffer granularity).
>>> >
>>> > In order to meet expected skb layout by new helper, pull the mac header
>>> > before conversion from skb to xdp_buff.
>>> >
>>> > However, that is not enough as before releasing xdp_buff out of veth via
>>> > XDP_{TX,REDIRECT}, mem type on xdp_rxq associated with xdp_buff is
>>> > restored to its original model. We need to respect previous setting at
>>> > least until buff is converted to frame, as frame carries the mem_type.
>>> > Add a page_pool variant of veth_xdp_get() so that we avoid refcount
>>> > underflow when draining page frag.
>>> >
>>> > Fixes: 0ebab78cbcbf ("net: veth: add page_pool for page recycling")
>>> > Reported-by: Alexei Starovoitov <ast@kernel.org>
>>> > Closes: https://lore.kernel.org/bpf/CAADnVQ+bBofJDfieyOYzSmSujSfJwDTQhiz3aJw7hE+4E2_iPA@mail.gmail.com/
>>> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>>> > ---
>>> >  drivers/net/veth.c | 43 +++++++++++++++++++++++++++----------------
>>> >  1 file changed, 27 insertions(+), 16 deletions(-)
>>> >
>>> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> > index a3046142cb8e..eeeee7bba685 100644
>>> > --- a/drivers/net/veth.c
>>> > +++ b/drivers/net/veth.c
>>> > @@ -733,7 +733,7 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
>>> >  	}
>>> >  }
>>> >  
>>> > -static void veth_xdp_get(struct xdp_buff *xdp)
>>> > +static void veth_xdp_get_shared(struct xdp_buff *xdp)
>>> >  {
>>> >  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>>> >  	int i;
>>> > @@ -746,12 +746,33 @@ static void veth_xdp_get(struct xdp_buff *xdp)
>>> >  		__skb_frag_ref(&sinfo->frags[i]);
>>> >  }
>>> >  
>>> > +static void veth_xdp_get_pp(struct xdp_buff *xdp)
>>> > +{
>>> > +	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>>> > +	int i;
>>> > +
>>> > +	page_pool_ref_page(virt_to_page(xdp->data));
>>> > +	if (likely(!xdp_buff_has_frags(xdp)))
>>> > +		return;
>>> > +
>>> > +	for (i = 0; i < sinfo->nr_frags; i++) {
>>> > +		skb_frag_t *frag = &sinfo->frags[i];
>>> > +
>>> > +		page_pool_ref_page(netmem_to_page(frag->netmem));
>>> > +	}
>>> > +}
>>> > +
>>> > +static void veth_xdp_get(struct xdp_buff *xdp)
>>> > +{
>>> > +	xdp->rxq->mem.type == MEM_TYPE_PAGE_POOL ?
>>> > +		veth_xdp_get_pp(xdp) : veth_xdp_get_shared(xdp);
>>> > +}
>>> > +
>>> >  static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>> >  					struct xdp_buff *xdp,
>>> >  					struct sk_buff **pskb)
>>> >  {
>>> >  	struct sk_buff *skb = *pskb;
>>> > -	u32 frame_sz;
>>> >  
>>> >  	if (skb_shared(skb) || skb_head_is_locked(skb) ||
>>> >  	    skb_shinfo(skb)->nr_frags ||
>>> > @@ -762,19 +783,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>>> >  		skb = *pskb;
>>> >  	}
>>> >  
>>> > -	/* SKB "head" area always have tailroom for skb_shared_info */
>>> > -	frame_sz = skb_end_pointer(skb) - skb->head;
>>> > -	frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>> > -	xdp_init_buff(xdp, frame_sz, &rq->xdp_rxq);
>>> > -	xdp_prepare_buff(xdp, skb->head, skb_headroom(skb),
>>> > -			 skb_headlen(skb), true);
>>> > +	__skb_pull(*pskb, skb->data - skb_mac_header(skb));
>>> 
>>> veth_xdp_rcv_skb() does:
>>> 
>>> 	__skb_push(skb, skb->data - skb_mac_header(skb));
>>> 	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
>>> 
>>> so how about just getting rid of that push instead of doing the opposite
>>> pull straight after? :)
>>
>> Hi Toke,
>>
>> I believe this is done so we get a proper headroom representation which is
>> needed for XDP_PACKET_HEADROOM comparison. Maybe we could be smarter here
>> and for example subtract mac header length? However I wanted to preserve
>> old behavior.
>
> Yeah, basically what we want is to check if the mac_header offset is
> larger than the headroom. So the check could just be:
>
>     skb->mac_header < XDP_PACKET_HEADROOM
>
> however, it may be better to use the helper? Since that makes sure we
> keep hitting the DEBUG_NET_WARN_ON_ONCE inside the helper... So:
>
>     skb_mac_header(skb) - skb->head < XDP_PACKET_HEADROOM
>
> or, equivalently:
>
>     skb_headroom(skb) - skb_mac_offset(skb) < XDP_PACKET_HEADROOM
>
> I think the first one is probably more readable, since skb_mac_offset()
> is negative here, so the calculation looks off...

Wait, veth_xdp_rcv_skb() calls skb_reset_mac_header() further down, so
it expects skb->data to point to the mac header. So getting rid of the
__skb_push() is not a good idea; but neither is doing the __skb_pull() as
your patch does currently.

How about just making xdp_convert_skb_to_buff() agnostic to where
skb->data is?

	headroom = skb_mac_header(skb) - skb->head;
        data_len = skb->data + skb->len - skb_mac_header(skb);
	xdp_prepare_buff(xdp, skb->head, headroom, data_len, true);

should work in both cases, no?

-Toke


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff
  2025-10-17 18:12         ` Toke Høiland-Jørgensen
@ 2025-10-17 20:08           ` Maciej Fijalkowski
  2025-10-20 10:25             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2025-10-17 20:08 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, hawk, ilias.apalodimas, lorenzo, kuba, netdev,
	magnus.karlsson, andrii, stfomichev, aleksander.lobakin

On Fri, Oct 17, 2025 at 08:12:57PM +0200, Toke Høiland-Jørgensen wrote:
> Toke Høiland-Jørgensen <toke@redhat.com> writes:
> 
> > Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> >
> >> On Fri, Oct 17, 2025 at 06:33:54PM +0200, Toke Høiland-Jørgensen wrote:
> >>> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> >>> 
> >>> > Veth calls skb_pp_cow_data() which makes the underlying memory to
> >>> > originate from system page_pool. For CONFIG_DEBUG_VM=y and XDP program
> >>> > that uses bpf_xdp_adjust_tail(), following splat was observed:
> >>> >
> >>> > [   32.204881] BUG: Bad page state in process test_progs  pfn:11c98b
> >>> > [   32.207167] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11c98b
> >>> > [   32.210084] flags: 0x1fffe0000000000(node=0|zone=1|lastcpupid=0x7fff)
> >>> > [   32.212493] raw: 01fffe0000000000 dead000000000040 ff11000123c9b000 0000000000000000
> >>> > [   32.218056] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
> >>> > [   32.220900] page dumped because: page_pool leak
> >>> > [   32.222636] Modules linked in: bpf_testmod(O) bpf_preload
> >>> > [   32.224632] CPU: 6 UID: 0 PID: 3612 Comm: test_progs Tainted: G O        6.17.0-rc5-gfec474d29325 #6969 PREEMPT
> >>> > [   32.224638] Tainted: [O]=OOT_MODULE
> >>> > [   32.224639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> >>> > [   32.224641] Call Trace:
> >>> > [   32.224644]  <IRQ>
> >>> > [   32.224646]  dump_stack_lvl+0x4b/0x70
> >>> > [   32.224653]  bad_page.cold+0xbd/0xe0
> >>> > [   32.224657]  __free_frozen_pages+0x838/0x10b0
> >>> > [   32.224660]  ? skb_pp_cow_data+0x782/0xc30
> >>> > [   32.224665]  bpf_xdp_shrink_data+0x221/0x530
> >>> > [   32.224668]  ? skb_pp_cow_data+0x6d1/0xc30
> >>> > [   32.224671]  bpf_xdp_adjust_tail+0x598/0x810
> >>> > [   32.224673]  ? xsk_destruct_skb+0x321/0x800
> >>> > [   32.224678]  bpf_prog_004ac6bb21de57a7_xsk_xdp_adjust_tail+0x52/0xd6
> >>> > [   32.224681]  veth_xdp_rcv_skb+0x45d/0x15a0
> >>> > [   32.224684]  ? get_stack_info_noinstr+0x16/0xe0
> >>> > [   32.224688]  ? veth_set_channels+0x920/0x920
> >>> > [   32.224691]  ? get_stack_info+0x2f/0x80
> >>> > [   32.224693]  ? unwind_next_frame+0x3af/0x1df0
> >>> > [   32.224697]  veth_xdp_rcv.constprop.0+0x38a/0xbe0
> >>> > [   32.224700]  ? common_startup_64+0x13e/0x148
> >>> > [   32.224703]  ? veth_xdp_rcv_one+0xcd0/0xcd0
> >>> > [   32.224706]  ? stack_trace_save+0x84/0xa0
> >>> > [   32.224709]  ? stack_depot_save_flags+0x28/0x820
> >>> > [   32.224713]  ? __resched_curr.constprop.0+0x332/0x3b0
> >>> > [   32.224716]  ? timerqueue_add+0x217/0x320
> >>> > [   32.224719]  veth_poll+0x115/0x5e0
> >>> > [   32.224722]  ? veth_xdp_rcv.constprop.0+0xbe0/0xbe0
> >>> > [   32.224726]  ? update_load_avg+0x1cb/0x12d0
> >>> > [   32.224730]  ? update_cfs_group+0x121/0x2c0
> >>> > [   32.224733]  __napi_poll+0xa0/0x420
> >>> > [   32.224736]  net_rx_action+0x901/0xe90
> >>> > [   32.224740]  ? run_backlog_napi+0x50/0x50
> >>> > [   32.224743]  ? clockevents_program_event+0x1cc/0x280
> >>> > [   32.224746]  ? hrtimer_interrupt+0x31e/0x7c0
> >>> > [   32.224749]  handle_softirqs+0x151/0x430
> >>> > [   32.224752]  do_softirq+0x3f/0x60
> >>> > [   32.224755]  </IRQ>
> >>> >
> >>> > It's because xdp_rxq with mem model set to MEM_TYPE_PAGE_SHARED was used
> >>> > when initializing xdp_buff.
> >>> >
> >>> > Fix this by using new helper xdp_convert_skb_to_buff() that, besides
> >>> > init/prepare xdp_buff, will check if page used for linear part of
> >>> > xdp_buff comes from page_pool. We assume that linear data and frags will
> >>> > have same memory provider as currently XDP API does not provide us a way
> >>> > to distinguish it (the mem model is registered for *whole* Rx queue and
> >>> > here we speak about single buffer granularity).
> >>> >
> >>> > In order to meet expected skb layout by new helper, pull the mac header
> >>> > before conversion from skb to xdp_buff.
> >>> >
> >>> > However, that is not enough as before releasing xdp_buff out of veth via
> >>> > XDP_{TX,REDIRECT}, mem type on xdp_rxq associated with xdp_buff is
> >>> > restored to its original model. We need to respect previous setting at
> >>> > least until buff is converted to frame, as frame carries the mem_type.
> >>> > Add a page_pool variant of veth_xdp_get() so that we avoid refcount
> >>> > underflow when draining page frag.
> >>> >
> >>> > Fixes: 0ebab78cbcbf ("net: veth: add page_pool for page recycling")
> >>> > Reported-by: Alexei Starovoitov <ast@kernel.org>
> >>> > Closes: https://lore.kernel.org/bpf/CAADnVQ+bBofJDfieyOYzSmSujSfJwDTQhiz3aJw7hE+4E2_iPA@mail.gmail.com/
> >>> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> >>> > ---
> >>> >  drivers/net/veth.c | 43 +++++++++++++++++++++++++++----------------
> >>> >  1 file changed, 27 insertions(+), 16 deletions(-)
> >>> >
> >>> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> >>> > index a3046142cb8e..eeeee7bba685 100644
> >>> > --- a/drivers/net/veth.c
> >>> > +++ b/drivers/net/veth.c
> >>> > @@ -733,7 +733,7 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
> >>> >  	}
> >>> >  }
> >>> >  
> >>> > -static void veth_xdp_get(struct xdp_buff *xdp)
> >>> > +static void veth_xdp_get_shared(struct xdp_buff *xdp)
> >>> >  {
> >>> >  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> >>> >  	int i;
> >>> > @@ -746,12 +746,33 @@ static void veth_xdp_get(struct xdp_buff *xdp)
> >>> >  		__skb_frag_ref(&sinfo->frags[i]);
> >>> >  }
> >>> >  
> >>> > +static void veth_xdp_get_pp(struct xdp_buff *xdp)
> >>> > +{
> >>> > +	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> >>> > +	int i;
> >>> > +
> >>> > +	page_pool_ref_page(virt_to_page(xdp->data));
> >>> > +	if (likely(!xdp_buff_has_frags(xdp)))
> >>> > +		return;
> >>> > +
> >>> > +	for (i = 0; i < sinfo->nr_frags; i++) {
> >>> > +		skb_frag_t *frag = &sinfo->frags[i];
> >>> > +
> >>> > +		page_pool_ref_page(netmem_to_page(frag->netmem));
> >>> > +	}
> >>> > +}
> >>> > +
> >>> > +static void veth_xdp_get(struct xdp_buff *xdp)
> >>> > +{
> >>> > +	xdp->rxq->mem.type == MEM_TYPE_PAGE_POOL ?
> >>> > +		veth_xdp_get_pp(xdp) : veth_xdp_get_shared(xdp);
> >>> > +}
> >>> > +
> >>> >  static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>> >  					struct xdp_buff *xdp,
> >>> >  					struct sk_buff **pskb)
> >>> >  {
> >>> >  	struct sk_buff *skb = *pskb;
> >>> > -	u32 frame_sz;
> >>> >  
> >>> >  	if (skb_shared(skb) || skb_head_is_locked(skb) ||
> >>> >  	    skb_shinfo(skb)->nr_frags ||
> >>> > @@ -762,19 +783,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>> >  		skb = *pskb;
> >>> >  	}
> >>> >  
> >>> > -	/* SKB "head" area always have tailroom for skb_shared_info */
> >>> > -	frame_sz = skb_end_pointer(skb) - skb->head;
> >>> > -	frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> >>> > -	xdp_init_buff(xdp, frame_sz, &rq->xdp_rxq);
> >>> > -	xdp_prepare_buff(xdp, skb->head, skb_headroom(skb),
> >>> > -			 skb_headlen(skb), true);
> >>> > +	__skb_pull(*pskb, skb->data - skb_mac_header(skb));
> >>> 
> >>> veth_xdp_rcv_skb() does:
> >>> 
> >>> 	__skb_push(skb, skb->data - skb_mac_header(skb));
> >>> 	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
> >>> 
> >>> so how about just getting rid of that push instead of doing the opposite
> >>> pull straight after? :)
> >>
> >> Hi Toke,
> >>
> >> I believe this is done so we get a proper headroom representation which is
> >> needed for XDP_PACKET_HEADROOM comparison. Maybe we could be smarter here
> >> and for example subtract mac header length? However I wanted to preserve
> >> old behavior.
> >
> > Yeah, basically what we want is to check if the mac_header offset is
> > larger than the headroom. So the check could just be:
> >
> >     skb->mac_header < XDP_PACKET_HEADROOM
> >
> > however, it may be better to use the helper? Since that makes sure we
> > keep hitting the DEBUG_NET_WARN_ON_ONCE inside the helper... So:
> >
> >     skb_mac_header(skb) - skb->head < XDP_PACKET_HEADROOM
> >
> > or, equivalently:
> >
> >     skb_headroom(skb) - skb_mac_offset(skb) < XDP_PACKET_HEADROOM
> >
> > I think the first one is probably more readable, since skb_mac_offset()
> > is negative here, so the calculation looks off...
> 
> Wait, veth_xdp_rcv_skb() calls skb_reset_mac_header() further down, so
> it expects skb->data to point to the mac header. So getting rid of the

Oof. Correct.

> __skb_push() is not a good idea; but neither is doing the __skb_pull() as
> your patch does currently.
> 
> How about just making xdp_convert_skb_to_buff() agnostic to where
> skb->data is?
> 
> 	headroom = skb_mac_header(skb) - skb->head;
>         data_len = skb->data + skb->len - skb_mac_header(skb);

could we just use skb->tail - skb_mac_header(skb) here?

anyways, i'm gonna try out your suggestion on weekend or on monday and
will send a v3. maybe input from someone else in different time zones will
land in by tomorrow. thanks again:)

> 	xdp_prepare_buff(xdp, skb->head, headroom, data_len, true);
> 
> should work in both cases, no?
> 
> -Toke
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff
  2025-10-17 20:08           ` Maciej Fijalkowski
@ 2025-10-20 10:25             ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-20 10:25 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, hawk, ilias.apalodimas, lorenzo, kuba, netdev,
	magnus.karlsson, andrii, stfomichev, aleksander.lobakin

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> On Fri, Oct 17, 2025 at 08:12:57PM +0200, Toke Høiland-Jørgensen wrote:
>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>> 
>> > Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
>> >
>> >> On Fri, Oct 17, 2025 at 06:33:54PM +0200, Toke Høiland-Jørgensen wrote:
>> >>> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
>> >>> 
>> >>> > Veth calls skb_pp_cow_data() which makes the underlying memory to
>> >>> > originate from system page_pool. For CONFIG_DEBUG_VM=y and XDP program
>> >>> > that uses bpf_xdp_adjust_tail(), following splat was observed:
>> >>> >
>> >>> > [   32.204881] BUG: Bad page state in process test_progs  pfn:11c98b
>> >>> > [   32.207167] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11c98b
>> >>> > [   32.210084] flags: 0x1fffe0000000000(node=0|zone=1|lastcpupid=0x7fff)
>> >>> > [   32.212493] raw: 01fffe0000000000 dead000000000040 ff11000123c9b000 0000000000000000
>> >>> > [   32.218056] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
>> >>> > [   32.220900] page dumped because: page_pool leak
>> >>> > [   32.222636] Modules linked in: bpf_testmod(O) bpf_preload
>> >>> > [   32.224632] CPU: 6 UID: 0 PID: 3612 Comm: test_progs Tainted: G O        6.17.0-rc5-gfec474d29325 #6969 PREEMPT
>> >>> > [   32.224638] Tainted: [O]=OOT_MODULE
>> >>> > [   32.224639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>> >>> > [   32.224641] Call Trace:
>> >>> > [   32.224644]  <IRQ>
>> >>> > [   32.224646]  dump_stack_lvl+0x4b/0x70
>> >>> > [   32.224653]  bad_page.cold+0xbd/0xe0
>> >>> > [   32.224657]  __free_frozen_pages+0x838/0x10b0
>> >>> > [   32.224660]  ? skb_pp_cow_data+0x782/0xc30
>> >>> > [   32.224665]  bpf_xdp_shrink_data+0x221/0x530
>> >>> > [   32.224668]  ? skb_pp_cow_data+0x6d1/0xc30
>> >>> > [   32.224671]  bpf_xdp_adjust_tail+0x598/0x810
>> >>> > [   32.224673]  ? xsk_destruct_skb+0x321/0x800
>> >>> > [   32.224678]  bpf_prog_004ac6bb21de57a7_xsk_xdp_adjust_tail+0x52/0xd6
>> >>> > [   32.224681]  veth_xdp_rcv_skb+0x45d/0x15a0
>> >>> > [   32.224684]  ? get_stack_info_noinstr+0x16/0xe0
>> >>> > [   32.224688]  ? veth_set_channels+0x920/0x920
>> >>> > [   32.224691]  ? get_stack_info+0x2f/0x80
>> >>> > [   32.224693]  ? unwind_next_frame+0x3af/0x1df0
>> >>> > [   32.224697]  veth_xdp_rcv.constprop.0+0x38a/0xbe0
>> >>> > [   32.224700]  ? common_startup_64+0x13e/0x148
>> >>> > [   32.224703]  ? veth_xdp_rcv_one+0xcd0/0xcd0
>> >>> > [   32.224706]  ? stack_trace_save+0x84/0xa0
>> >>> > [   32.224709]  ? stack_depot_save_flags+0x28/0x820
>> >>> > [   32.224713]  ? __resched_curr.constprop.0+0x332/0x3b0
>> >>> > [   32.224716]  ? timerqueue_add+0x217/0x320
>> >>> > [   32.224719]  veth_poll+0x115/0x5e0
>> >>> > [   32.224722]  ? veth_xdp_rcv.constprop.0+0xbe0/0xbe0
>> >>> > [   32.224726]  ? update_load_avg+0x1cb/0x12d0
>> >>> > [   32.224730]  ? update_cfs_group+0x121/0x2c0
>> >>> > [   32.224733]  __napi_poll+0xa0/0x420
>> >>> > [   32.224736]  net_rx_action+0x901/0xe90
>> >>> > [   32.224740]  ? run_backlog_napi+0x50/0x50
>> >>> > [   32.224743]  ? clockevents_program_event+0x1cc/0x280
>> >>> > [   32.224746]  ? hrtimer_interrupt+0x31e/0x7c0
>> >>> > [   32.224749]  handle_softirqs+0x151/0x430
>> >>> > [   32.224752]  do_softirq+0x3f/0x60
>> >>> > [   32.224755]  </IRQ>
>> >>> >
>> >>> > It's because xdp_rxq with mem model set to MEM_TYPE_PAGE_SHARED was used
>> >>> > when initializing xdp_buff.
>> >>> >
>> >>> > Fix this by using new helper xdp_convert_skb_to_buff() that, besides
>> >>> > init/prepare xdp_buff, will check if page used for linear part of
>> >>> > xdp_buff comes from page_pool. We assume that linear data and frags will
>> >>> > have same memory provider as currently XDP API does not provide us a way
>> >>> > to distinguish it (the mem model is registered for *whole* Rx queue and
>> >>> > here we speak about single buffer granularity).
>> >>> >
>> >>> > In order to meet expected skb layout by new helper, pull the mac header
>> >>> > before conversion from skb to xdp_buff.
>> >>> >
>> >>> > However, that is not enough as before releasing xdp_buff out of veth via
>> >>> > XDP_{TX,REDIRECT}, mem type on xdp_rxq associated with xdp_buff is
>> >>> > restored to its original model. We need to respect previous setting at
>> >>> > least until buff is converted to frame, as frame carries the mem_type.
>> >>> > Add a page_pool variant of veth_xdp_get() so that we avoid refcount
>> >>> > underflow when draining page frag.
>> >>> >
>> >>> > Fixes: 0ebab78cbcbf ("net: veth: add page_pool for page recycling")
>> >>> > Reported-by: Alexei Starovoitov <ast@kernel.org>
>> >>> > Closes: https://lore.kernel.org/bpf/CAADnVQ+bBofJDfieyOYzSmSujSfJwDTQhiz3aJw7hE+4E2_iPA@mail.gmail.com/
>> >>> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>> >>> > ---
>> >>> >  drivers/net/veth.c | 43 +++++++++++++++++++++++++++----------------
>> >>> >  1 file changed, 27 insertions(+), 16 deletions(-)
>> >>> >
>> >>> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> >>> > index a3046142cb8e..eeeee7bba685 100644
>> >>> > --- a/drivers/net/veth.c
>> >>> > +++ b/drivers/net/veth.c
>> >>> > @@ -733,7 +733,7 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
>> >>> >  	}
>> >>> >  }
>> >>> >  
>> >>> > -static void veth_xdp_get(struct xdp_buff *xdp)
>> >>> > +static void veth_xdp_get_shared(struct xdp_buff *xdp)
>> >>> >  {
>> >>> >  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>> >>> >  	int i;
>> >>> > @@ -746,12 +746,33 @@ static void veth_xdp_get(struct xdp_buff *xdp)
>> >>> >  		__skb_frag_ref(&sinfo->frags[i]);
>> >>> >  }
>> >>> >  
>> >>> > +static void veth_xdp_get_pp(struct xdp_buff *xdp)
>> >>> > +{
>> >>> > +	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>> >>> > +	int i;
>> >>> > +
>> >>> > +	page_pool_ref_page(virt_to_page(xdp->data));
>> >>> > +	if (likely(!xdp_buff_has_frags(xdp)))
>> >>> > +		return;
>> >>> > +
>> >>> > +	for (i = 0; i < sinfo->nr_frags; i++) {
>> >>> > +		skb_frag_t *frag = &sinfo->frags[i];
>> >>> > +
>> >>> > +		page_pool_ref_page(netmem_to_page(frag->netmem));
>> >>> > +	}
>> >>> > +}
>> >>> > +
>> >>> > +static void veth_xdp_get(struct xdp_buff *xdp)
>> >>> > +{
>> >>> > +	xdp->rxq->mem.type == MEM_TYPE_PAGE_POOL ?
>> >>> > +		veth_xdp_get_pp(xdp) : veth_xdp_get_shared(xdp);
>> >>> > +}
>> >>> > +
>> >>> >  static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>> >>> >  					struct xdp_buff *xdp,
>> >>> >  					struct sk_buff **pskb)
>> >>> >  {
>> >>> >  	struct sk_buff *skb = *pskb;
>> >>> > -	u32 frame_sz;
>> >>> >  
>> >>> >  	if (skb_shared(skb) || skb_head_is_locked(skb) ||
>> >>> >  	    skb_shinfo(skb)->nr_frags ||
>> >>> > @@ -762,19 +783,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
>> >>> >  		skb = *pskb;
>> >>> >  	}
>> >>> >  
>> >>> > -	/* SKB "head" area always have tailroom for skb_shared_info */
>> >>> > -	frame_sz = skb_end_pointer(skb) - skb->head;
>> >>> > -	frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> >>> > -	xdp_init_buff(xdp, frame_sz, &rq->xdp_rxq);
>> >>> > -	xdp_prepare_buff(xdp, skb->head, skb_headroom(skb),
>> >>> > -			 skb_headlen(skb), true);
>> >>> > +	__skb_pull(*pskb, skb->data - skb_mac_header(skb));
>> >>> 
>> >>> veth_xdp_rcv_skb() does:
>> >>> 
>> >>> 	__skb_push(skb, skb->data - skb_mac_header(skb));
>> >>> 	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
>> >>> 
>> >>> so how about just getting rid of that push instead of doing the opposite
>> >>> pull straight after? :)
>> >>
>> >> Hi Toke,
>> >>
>> >> I believe this is done so we get a proper headroom representation which is
>> >> needed for XDP_PACKET_HEADROOM comparison. Maybe we could be smarter here
>> >> and for example subtract mac header length? However I wanted to preserve
>> >> old behavior.
>> >
>> > Yeah, basically what we want is to check if the mac_header offset is
>> > larger than the headroom. So the check could just be:
>> >
>> >     skb->mac_header < XDP_PACKET_HEADROOM
>> >
>> > however, it may be better to use the helper? Since that makes sure we
>> > keep hitting the DEBUG_NET_WARN_ON_ONCE inside the helper... So:
>> >
>> >     skb_mac_header(skb) - skb->head < XDP_PACKET_HEADROOM
>> >
>> > or, equivalently:
>> >
>> >     skb_headroom(skb) - skb_mac_offset(skb) < XDP_PACKET_HEADROOM
>> >
>> > I think the first one is probably more readable, since skb_mac_offset()
>> > is negative here, so the calculation looks off...
>> 
>> Wait, veth_xdp_rcv_skb() calls skb_reset_mac_header() further down, so
>> it expects skb->data to point to the mac header. So getting rid of the
>
> Oof. Correct.
>
>> __skb_push() is not a good idea; but neither is doing the __skb_pull() as
>> your patch does currently.
>> 
>> How about just making xdp_convert_skb_to_buff() agnostic to where
>> skb->data is?
>> 
>> 	headroom = skb_mac_header(skb) - skb->head;
>>         data_len = skb->data + skb->len - skb_mac_header(skb);
>
> could we just use skb->tail - skb_mac_header(skb) here?

Yeah, guess so. Tail is a length, though, so that would be
skb_tail_pointer(skb) - skb_mac_header(skb)

> anyways, i'm gonna try out your suggestion on weekend or on monday and
> will send a v3. maybe input from someone else in different time zones will
> land in by tomorrow. thanks again:)

Cool!

-Toke


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook
  2025-10-17 14:31 ` [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook Maciej Fijalkowski
  2025-10-17 16:45   ` Toke Høiland-Jørgensen
@ 2025-10-20 11:20   ` Jesper Dangaard Brouer
  2025-10-20 15:36     ` Alexander Lobakin
  1 sibling, 1 reply; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2025-10-20 11:20 UTC (permalink / raw)
  To: Maciej Fijalkowski, bpf, ast, daniel, ilias.apalodimas, toke,
	lorenzo, kuba
  Cc: netdev, magnus.karlsson, andrii, stfomichev, aleksander.lobakin,
	syzbot+ff145014d6b0ce64a173, Ihor Solodrai, Octavian Purdila



On 17/10/2025 16.31, Maciej Fijalkowski wrote:
> Currently, generic XDP hook uses xdp_rxq_info from netstack Rx queues
> which do not have its XDP memory model registered. There is a case when
> XDP program calls bpf_xdp_adjust_tail() BPF helper, which in turn
> releases underlying memory. This happens when it consumes enough amount
> of bytes and when XDP buffer has fragments. For this action the memory
> model knowledge passed to XDP program is crucial so that core can call
> suitable function for freeing/recycling the page.
> 
> For netstack queues it defaults to MEM_TYPE_PAGE_SHARED (0) due to lack
> of mem model registration. The problem we're fixing here is when kernel
> copied the skb to new buffer backed by system's page_pool and XDP buffer
> is built around it. Then when bpf_xdp_adjust_tail() calls
> __xdp_return(), it acts incorrectly due to mem type not being set to
> MEM_TYPE_PAGE_POOL and causes a page leak.
> 

Does the code not set the skb->pp_recycle ?

> Pull out the existing code from bpf_prog_run_generic_xdp() that
> init/prepares xdp_buff onto new helper xdp_convert_skb_to_buff() and
> embed there rxq's mem_type initialization that is assigned to xdp_buff.
> 
> This problem was triggered by syzbot as well as AF_XDP test suite which
> is about to be integrated to BPF CI.
> 
> Reported-by: syzbot+ff145014d6b0ce64a173@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/netdev/6756c37b.050a0220.a30f1.019a.GAE@google.com/
> Fixes: e6d5dbdd20aa ("xdp: add multi-buff support for xdp running in generic mode")
> Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
> Co-developed-by: Octavian Purdila <tavip@google.com>
> Signed-off-by: Octavian Purdila <tavip@google.com> # whole analysis, testing, initiating a fix
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # commit msg and proposed more robust fix
> ---
>   include/net/xdp.h | 31 +++++++++++++++++++++++++++++++
>   net/core/dev.c    | 25 ++++---------------------
>   2 files changed, 35 insertions(+), 21 deletions(-)
> 
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index aa742f413c35..51f3321e4f94 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -384,6 +384,37 @@ struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf,
>   					 struct net_device *dev);
>   struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
>   
> +static inline
> +void xdp_convert_skb_to_buff(struct sk_buff *skb, struct xdp_buff *xdp,
> +			     struct xdp_rxq_info *xdp_rxq)
> +{

I really like that this is getting put into a helper function :-)

> +	u32 frame_sz, mac_len;
> +	void *hard_start;
> +
> +	/* The XDP program wants to see the packet starting at the MAC
> +	 * header.
> +	 */
> +	mac_len = skb->data - skb_mac_header(skb);
> +	hard_start = skb->data - skb_headroom(skb);
> +
> +	/* SKB "head" area always have tailroom for skb_shared_info */
> +	frame_sz = skb_end_pointer(skb) - skb->head;
> +	frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	xdp_init_buff(xdp, frame_sz, xdp_rxq);
> +	xdp_prepare_buff(xdp, hard_start, skb_headroom(skb) - mac_len,
> +			 skb_headlen(skb) + mac_len, true);
> +
> +	if (skb_is_nonlinear(skb)) {
> +		skb_shinfo(skb)->xdp_frags_size = skb->data_len;
> +		xdp_buff_set_frags_flag(xdp);
> +	} else {
> +		xdp_buff_clear_frags_flag(xdp);
> +	}
> +

The SKB should be marked via skb->pp_recycle, but I guess you are trying
to catch code that doesn't set this correctly?
(Slightly worried this will "paper-over" some other buggy code?)

> +	xdp->rxq->mem.type = page_pool_page_is_pp(virt_to_page(xdp->data)) ?
> +				MEM_TYPE_PAGE_POOL : MEM_TYPE_PAGE_SHARED;
> +}

In the beginning PP_MAGIC / PP_SIGNATURE was primarily used as a
debugging feature to catch faulty code that released pp pages to the
real page allocator.  It seems to have evolved into something more
critical.  Someone also tried to elevate this into a page flag, which
would make this more reliable.

> +
>   static inline
>   void xdp_convert_frame_to_buff(const struct xdp_frame *frame,
>   			       struct xdp_buff *xdp)
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a64cef2c537e..3d607dce292b 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5320,35 +5320,18 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
>   u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
>   			     const struct bpf_prog *xdp_prog)
>   {
> -	void *orig_data, *orig_data_end, *hard_start;
> +	void *orig_data, *orig_data_end;
>   	struct netdev_rx_queue *rxqueue;
>   	bool orig_bcast, orig_host;
> -	u32 mac_len, frame_sz;
> +	u32 metalen, act, mac_len;
>   	__be16 orig_eth_type;
>   	struct ethhdr *eth;
> -	u32 metalen, act;
>   	int off;
>   
> -	/* The XDP program wants to see the packet starting at the MAC
> -	 * header.
> -	 */
> +	rxqueue = netif_get_rxqueue(skb);
>   	mac_len = skb->data - skb_mac_header(skb);
> -	hard_start = skb->data - skb_headroom(skb);
> -
> -	/* SKB "head" area always have tailroom for skb_shared_info */
> -	frame_sz = (void *)skb_end_pointer(skb) - hard_start;
> -	frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>   
> -	rxqueue = netif_get_rxqueue(skb);
> -	xdp_init_buff(xdp, frame_sz, &rxqueue->xdp_rxq);
> -	xdp_prepare_buff(xdp, hard_start, skb_headroom(skb) - mac_len,
> -			 skb_headlen(skb) + mac_len, true);
> -	if (skb_is_nonlinear(skb)) {
> -		skb_shinfo(skb)->xdp_frags_size = skb->data_len;
> -		xdp_buff_set_frags_flag(xdp);
> -	} else {
> -		xdp_buff_clear_frags_flag(xdp);
> -	}
> +	xdp_convert_skb_to_buff(skb, xdp, &rxqueue->xdp_rxq);
>   
>   	orig_data_end = xdp->data_end;
>   	orig_data = xdp->data;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook
  2025-10-20 11:20   ` Jesper Dangaard Brouer
@ 2025-10-20 15:36     ` Alexander Lobakin
  2025-10-20 17:53       ` Maciej Fijalkowski
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Lobakin @ 2025-10-20 15:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Maciej Fijalkowski
  Cc: bpf, ast, daniel, ilias.apalodimas, toke, lorenzo, kuba, netdev,
	magnus.karlsson, andrii, stfomichev, syzbot+ff145014d6b0ce64a173,
	Ihor Solodrai, Octavian Purdila

From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Mon, 20 Oct 2025 13:20:57 +0200

> 
> 
> On 17/10/2025 16.31, Maciej Fijalkowski wrote:
>> Currently, generic XDP hook uses xdp_rxq_info from netstack Rx queues
>> which do not have its XDP memory model registered. There is a case when
>> XDP program calls bpf_xdp_adjust_tail() BPF helper, which in turn
>> releases underlying memory. This happens when it consumes enough amount
>> of bytes and when XDP buffer has fragments. For this action the memory
>> model knowledge passed to XDP program is crucial so that core can call
>> suitable function for freeing/recycling the page.
>>
>> For netstack queues it defaults to MEM_TYPE_PAGE_SHARED (0) due to lack
>> of mem model registration. The problem we're fixing here is when kernel
>> copied the skb to new buffer backed by system's page_pool and XDP buffer
>> is built around it. Then when bpf_xdp_adjust_tail() calls
>> __xdp_return(), it acts incorrectly due to mem type not being set to
>> MEM_TYPE_PAGE_POOL and causes a page leak.
>>
> 
> Does the code not set the skb->pp_recycle ?

You mean this CoW code which replaces the buffers in the skb with system
PP-backed ones?
Maybe that's the problem (I don't remember the details of the function)?

> 
>> Pull out the existing code from bpf_prog_run_generic_xdp() that
>> init/prepares xdp_buff onto new helper xdp_convert_skb_to_buff() and
>> embed there rxq's mem_type initialization that is assigned to xdp_buff.

[...]

>> +    if (skb_is_nonlinear(skb)) {
>> +        skb_shinfo(skb)->xdp_frags_size = skb->data_len;
>> +        xdp_buff_set_frags_flag(xdp);
>> +    } else {
>> +        xdp_buff_clear_frags_flag(xdp);
>> +    }
>> +
> 
> The SKB should be marked via skb->pp_recycle, but I guess you are trying
> to catch code that doesn't set this correctly?
> (Slightly worried this will "paper-over" some other buggy code?)
> 
>> +    xdp->rxq->mem.type = page_pool_page_is_pp(virt_to_page(xdp->data)) ?

BTW this may return incorrect results if the page is not order-0.
IIRC system PPs always return order-0 pages, what about veth code etc?

>> +                MEM_TYPE_PAGE_POOL : MEM_TYPE_PAGE_SHARED;
>> +}
> 
> In the beginning PP_MAGIC / PP_SIGNATURE was primarily used as a
> debugging feature to catch faulty code that released pp pages to the
> real page allocator.  It seems to have evolved into something more
> critical.  Someone also tried to elevate this into a page flag, which
> would make this more reliable.
Thanks,
Olek

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook
  2025-10-20 15:36     ` Alexander Lobakin
@ 2025-10-20 17:53       ` Maciej Fijalkowski
  2025-10-22  1:01         ` Jakub Kicinski
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2025-10-20 17:53 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Jesper Dangaard Brouer, bpf, ast, daniel, ilias.apalodimas, toke,
	lorenzo, kuba, netdev, magnus.karlsson, andrii, stfomichev,
	syzbot+ff145014d6b0ce64a173, Ihor Solodrai, Octavian Purdila

On Mon, Oct 20, 2025 at 05:36:06PM +0200, Alexander Lobakin wrote:
> From: Jesper Dangaard Brouer <hawk@kernel.org>
> Date: Mon, 20 Oct 2025 13:20:57 +0200
> 
> > 
> > 
> > On 17/10/2025 16.31, Maciej Fijalkowski wrote:
> >> Currently, generic XDP hook uses xdp_rxq_info from netstack Rx queues
> >> which do not have its XDP memory model registered. There is a case when
> >> XDP program calls bpf_xdp_adjust_tail() BPF helper, which in turn
> >> releases underlying memory. This happens when it consumes enough amount
> >> of bytes and when XDP buffer has fragments. For this action the memory
> >> model knowledge passed to XDP program is crucial so that core can call
> >> suitable function for freeing/recycling the page.
> >>
> >> For netstack queues it defaults to MEM_TYPE_PAGE_SHARED (0) due to lack
> >> of mem model registration. The problem we're fixing here is when kernel
> >> copied the skb to new buffer backed by system's page_pool and XDP buffer
> >> is built around it. Then when bpf_xdp_adjust_tail() calls
> >> __xdp_return(), it acts incorrectly due to mem type not being set to
> >> MEM_TYPE_PAGE_POOL and causes a page leak.
> >>
> > 
> > Does the code not set the skb->pp_recycle ?

Hi Jesper,

yes it does and I based on this my initial solution, however Jakub had
concerns:

https://lore.kernel.org/bpf/20251001082737.23f5037f@kernel.org/

> 
> You mean this CoW code which replaces the buffers in the skb with system
> PP-backed ones?

skb_pp_cow_data() takes arbitrary page_pool as an input, it does not imply
we're gonna be dealing with system pp only. veth provides its own and
generic xdp uses system pp (the two existing skb_pp_cow_data() callsites).

> Maybe that's the problem (I don't remember the details of the function)?
> 
> > 
> >> Pull out the existing code from bpf_prog_run_generic_xdp() that
> >> init/prepares xdp_buff onto new helper xdp_convert_skb_to_buff() and
> >> embed there rxq's mem_type initialization that is assigned to xdp_buff.
> 
> [...]
> 
> >> +    if (skb_is_nonlinear(skb)) {
> >> +        skb_shinfo(skb)->xdp_frags_size = skb->data_len;
> >> +        xdp_buff_set_frags_flag(xdp);
> >> +    } else {
> >> +        xdp_buff_clear_frags_flag(xdp);
> >> +    }
> >> +
> > 
> > The SKB should be marked via skb->pp_recycle, but I guess you are trying
> > to catch code that doesn't set this correctly?
> > (Slightly worried this will "paper-over" some other buggy code?)
> > 
> >> +    xdp->rxq->mem.type = page_pool_page_is_pp(virt_to_page(xdp->data)) ?
> 
> BTW this may return incorrect results if the page is not order-0.
> IIRC system PPs always return order-0 pages, what about veth code etc?

veth's pp works on order-0 pages well, however I agree it would be better
to use virt_to_head_page() here.

> 
> >> +                MEM_TYPE_PAGE_POOL : MEM_TYPE_PAGE_SHARED;
> >> +}
> > 
> > In the beginning PP_MAGIC / PP_SIGNATURE was primarily used as a
> > debugging feature to catch faulty code that released pp pages to the
> > real page allocator.  It seems to have evolved into something more
> > critical.  Someone also tried to elevate this into a page flag, which
> > would make this more reliable.

exactly here we have the very same issue and we need to correctly return
pages back to their originating page pool.

> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook
  2025-10-20 17:53       ` Maciej Fijalkowski
@ 2025-10-22  1:01         ` Jakub Kicinski
  2025-10-22 11:22           ` Maciej Fijalkowski
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Kicinski @ 2025-10-22  1:01 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexander Lobakin, Jesper Dangaard Brouer, bpf, ast, daniel,
	ilias.apalodimas, toke, lorenzo, netdev, magnus.karlsson, andrii,
	stfomichev, syzbot+ff145014d6b0ce64a173, Ihor Solodrai,
	Octavian Purdila

On Mon, 20 Oct 2025 19:53:26 +0200 Maciej Fijalkowski wrote:
> > > The SKB should be marked via skb->pp_recycle, but I guess you are trying
> > > to catch code that doesn't set this correctly?
> > > (Slightly worried this will "paper-over" some other buggy code?)
> > >   
> > >> +    xdp->rxq->mem.type = page_pool_page_is_pp(virt_to_page(xdp->data)) ?  
> > 
> > BTW this may return incorrect results if the page is not order-0.
> > IIRC system PPs always return order-0 pages, what about veth code etc?  
> 
> veth's pp works on order-0 pages well, however I agree it would be better
> to use virt_to_head_page() here.

In this case the mem.type update is for consuming frags only, right?
We can't free the head itself since the skb is attached to it.
So running the predicates on xdp->data is probably wrong.

Is it possible to get to bpf_prog_run_generic_xdp() (with frags)
and without going thru netif_skb_check_for_xdp() ? If no then
frags must have come from skb_pp_cow(). 
And the type is always MEM_TYPE_PAGE_POOL ?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook
  2025-10-22  1:01         ` Jakub Kicinski
@ 2025-10-22 11:22           ` Maciej Fijalkowski
  2025-10-22 14:04             ` Jakub Kicinski
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2025-10-22 11:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexander Lobakin, Jesper Dangaard Brouer, bpf, ast, daniel,
	ilias.apalodimas, toke, lorenzo, netdev, magnus.karlsson, andrii,
	stfomichev, syzbot+ff145014d6b0ce64a173, Ihor Solodrai,
	Octavian Purdila

On Tue, Oct 21, 2025 at 06:01:36PM -0700, Jakub Kicinski wrote:
> On Mon, 20 Oct 2025 19:53:26 +0200 Maciej Fijalkowski wrote:
> > > > The SKB should be marked via skb->pp_recycle, but I guess you are trying
> > > > to catch code that doesn't set this correctly?
> > > > (Slightly worried this will "paper-over" some other buggy code?)
> > > >   
> > > >> +    xdp->rxq->mem.type = page_pool_page_is_pp(virt_to_page(xdp->data)) ?  
> > > 
> > > BTW this may return incorrect results if the page is not order-0.
> > > IIRC system PPs always return order-0 pages, what about veth code etc?  
> > 
> > veth's pp works on order-0 pages well, however I agree it would be better
> > to use virt_to_head_page() here.
> 
> In this case the mem.type update is for consuming frags only, right?
> We can't free the head itself since the skb is attached to it.
> So running the predicates on xdp->data is probably wrong.

See the veth patch where we bump refcount of related pages in order to
keep the data as skb is consumed.

> 
> Is it possible to get to bpf_prog_run_generic_xdp() (with frags)
> and without going thru netif_skb_check_for_xdp() ? If no then
> frags must have come from skb_pp_cow(). 
> And the type is always MEM_TYPE_PAGE_POOL ?

We have a fallback path netif_skb_check_for_xdp() that linearizes skb, in
case COW code failed. But bare in mind that bpf_prog_run_generic_xdp() has
other callsites, besides generic XDP itself (cpumap and devmap). That is
why I wouldn't like to base this helper on assumptions such as frags
presence -> mem_type is pp.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook
  2025-10-22 11:22           ` Maciej Fijalkowski
@ 2025-10-22 14:04             ` Jakub Kicinski
  0 siblings, 0 replies; 16+ messages in thread
From: Jakub Kicinski @ 2025-10-22 14:04 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexander Lobakin, Jesper Dangaard Brouer, bpf, ast, daniel,
	ilias.apalodimas, toke, lorenzo, netdev, magnus.karlsson, andrii,
	stfomichev, syzbot+ff145014d6b0ce64a173, Ihor Solodrai,
	Octavian Purdila

On Wed, 22 Oct 2025 13:22:25 +0200 Maciej Fijalkowski wrote:
> > > veth's pp works on order-0 pages well, however I agree it would be better
> > > to use virt_to_head_page() here.  
> > 
> > In this case the mem.type update is for consuming frags only, right?
> > We can't free the head itself since the skb is attached to it.
> > So running the predicates on xdp->data is probably wrong.  
> 
> See the veth patch where we bump refcount of related pages in order to
> keep the data as skb is consumed.

That one is wrong too. In veth's case since we care about refs on
individual frags you _should_ actually use skb->pp_recycle for the
condition. And you need to test whether each individual fragment
is from the PP. Basically call skb_pp_frag_ref() like
skb_try_coalesce() does?

BTW it is _not_ legal to call get_page() on slab pages any more
so all of this code should probably fail when head_frag=0.
But let's leave that problem to someone else.

> > Is it possible to get to bpf_prog_run_generic_xdp() (with frags)
> > and without going thru netif_skb_check_for_xdp() ? If no then
> > frags must have come from skb_pp_cow(). 
> > And the type is always MEM_TYPE_PAGE_POOL ?  
> 
> We have a fallback path netif_skb_check_for_xdp() that linearizes skb, in
> case COW code failed. But bare in mind that bpf_prog_run_generic_xdp() has
> other callsites, besides generic XDP itself (cpumap and devmap). That is
> why I wouldn't like to base this helper on assumptions such as frags
> presence -> mem_type is pp.

I assumed cpumap and devmap must be hit from a generic XDP handler.
Ergo we also much have gone thru the COW.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-22 14:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-17 14:31 [PATCH v2 bpf 0/2] xdp: fix page_pool leaks Maciej Fijalkowski
2025-10-17 14:31 ` [PATCH v2 bpf 1/2] xdp: update xdp_rxq_info's mem type in XDP generic hook Maciej Fijalkowski
2025-10-17 16:45   ` Toke Høiland-Jørgensen
2025-10-20 11:20   ` Jesper Dangaard Brouer
2025-10-20 15:36     ` Alexander Lobakin
2025-10-20 17:53       ` Maciej Fijalkowski
2025-10-22  1:01         ` Jakub Kicinski
2025-10-22 11:22           ` Maciej Fijalkowski
2025-10-22 14:04             ` Jakub Kicinski
2025-10-17 14:31 ` [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff Maciej Fijalkowski
2025-10-17 16:33   ` Toke Høiland-Jørgensen
2025-10-17 16:52     ` Maciej Fijalkowski
2025-10-17 17:51       ` Toke Høiland-Jørgensen
2025-10-17 18:12         ` Toke Høiland-Jørgensen
2025-10-17 20:08           ` Maciej Fijalkowski
2025-10-20 10:25             ` Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).