[RFC 0/2] xdp: Delegate fast path return decision to page

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] xdp: Delegate fast path return decision to page_pool
@ 2025-11-07 10:28 Dragos Tatulea
  2025-11-07 10:28 ` [RFC 1/2] page_pool: add benchmarking for napi-based recycling Dragos Tatulea
  2025-11-07 10:28 ` [RFC 2/2] xdp: Delegate fast path return decision to page_pool Dragos Tatulea
  0 siblings, 2 replies; 11+ messages in thread
From: Dragos Tatulea @ 2025-11-07 10:28 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Shuah Khan, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa
  Cc: Dragos Tatulea, Tariq Toukan, netdev, linux-kselftest,
	linux-kernel, bpf

This small series proposes the removal of the
BPF_RI_F_RF_NO_DIRECT XDP flag in favour of page_pool's
internal page_pool_napi_local() check which can override
a non-direct recycle into a direct one if the right
conditions are met.,

This was discussed on the mailing list on several occasions
[1][2].

The first patch adds additional benchmarking code to the page_pool
benchmark.

The second patch has the actual change with a proper explanation
and measurements. It remains to be debated if the whole
BPF_RI_F_RF_NO_DIRECT mechanism should be deleted or only
its use in xdp_return_frame_rx_napi().

There is still the unresolved issue of drivers that don't support
page_pool NAPI recycling. This series could be extended to add
that support. Otherwise those drivers would end up with slow
path recycling for XDP.

[1] https://lore.kernel.org/all/8d165026-1477-46cb-94d4-a01e1da40833@kernel.org/
[2] https://lore.kernel.org/all/20250918084823.372000-1-dtatulea@nvidia.com/

Dragos Tatulea (2):
  page_pool: add benchmarking for napi-based recycling
  xdp: Delegate fast path return decision to page_pool

 drivers/net/veth.c                            |  2 -
 include/linux/filter.h                        | 22 -----
 include/net/xdp.h                             |  2 +-
 kernel/bpf/cpumap.c                           |  2 -
 net/bpf/test_run.c                            |  2 -
 net/core/filter.c                             |  2 +-
 net/core/xdp.c                                | 24 ++---
 .../bench/page_pool/bench_page_pool_simple.c  | 92 ++++++++++++++++++-
 8 files changed, 104 insertions(+), 44 deletions(-)

-- 
2.50.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 1/2] page_pool: add benchmarking for napi-based recycling
  2025-11-07 10:28 [RFC 0/2] xdp: Delegate fast path return decision to page_pool Dragos Tatulea
@ 2025-11-07 10:28 ` Dragos Tatulea
  2025-11-07 11:04   ` bot+bpf-ci
  2025-11-07 10:28 ` [RFC 2/2] xdp: Delegate fast path return decision to page_pool Dragos Tatulea
  1 sibling, 1 reply; 11+ messages in thread
From: Dragos Tatulea @ 2025-11-07 10:28 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Shuah Khan
  Cc: Dragos Tatulea, Tariq Toukan, netdev, linux-kselftest,
	linux-kernel

The code brings back the tasklet based code in order
to be able to run in softirq context.

One additional test is added which benchmarks the
impact of page_pool_napi_local().

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
---
 .../bench/page_pool/bench_page_pool_simple.c  | 92 ++++++++++++++++++-
 1 file changed, 90 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c
index cb6468adbda4..84683c547814 100644
--- a/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c
+++ b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c
@@ -9,6 +9,7 @@
 #include <linux/limits.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/netdevice.h>
 #include <net/page_pool/helpers.h>
 
 #include "time_bench.h"
@@ -16,6 +17,8 @@
 static int verbose = 1;
 #define MY_POOL_SIZE 1024
 
+DEFINE_MUTEX(wait_for_tasklet);
+
 /* Makes tests selectable. Useful for perf-record to analyze a single test.
  * Hint: Bash shells support writing binary number like: $((2#101010)
  *
@@ -31,6 +34,10 @@ enum benchmark_bit {
 	bit_run_bench_no_softirq01,
 	bit_run_bench_no_softirq02,
 	bit_run_bench_no_softirq03,
+	bit_run_bench_tasklet01,
+	bit_run_bench_tasklet02,
+	bit_run_bench_tasklet03,
+	bit_run_bench_tasklet04,
 };
 
 #define bit(b)		(1 << (b))
@@ -120,7 +127,12 @@ static void pp_fill_ptr_ring(struct page_pool *pp, int elems)
 	kfree(array);
 }
 
-enum test_type { type_fast_path, type_ptr_ring, type_page_allocator };
+enum test_type {
+	type_fast_path,
+	type_napi_aware,
+	type_ptr_ring,
+	type_page_allocator,
+};
 
 /* Depends on compile optimizing this function */
 static int time_bench_page_pool(struct time_bench_record *rec, void *data,
@@ -132,6 +144,7 @@ static int time_bench_page_pool(struct time_bench_record *rec, void *data,
 
 	struct page_pool *pp;
 	struct page *page;
+	struct napi_struct napi = {0};
 
 	struct page_pool_params pp_params = {
 		.order = 0,
@@ -141,6 +154,7 @@ static int time_bench_page_pool(struct time_bench_record *rec, void *data,
 		.dev = NULL, /* Only use for DMA mapping */
 		.dma_dir = DMA_BIDIRECTIONAL,
 	};
+	struct page_pool_stats stats = {0};
 
 	pp = page_pool_create(&pp_params);
 	if (IS_ERR(pp)) {
@@ -155,6 +169,11 @@ static int time_bench_page_pool(struct time_bench_record *rec, void *data,
 	else
 		pr_warn("%s(): Cannot use page_pool fast-path\n", func);
 
+	if (type == type_napi_aware) {
+		napi.list_owner = smp_processor_id();
+		page_pool_enable_direct_recycling(pp, &napi);
+	}
+
 	time_bench_start(rec);
 	/** Loop to measure **/
 	for (i = 0; i < rec->loops; i++) {
@@ -173,7 +192,13 @@ static int time_bench_page_pool(struct time_bench_record *rec, void *data,
 			page_pool_recycle_direct(pp, page);
 
 		} else if (type == type_ptr_ring) {
-			/* Normal return path */
+			/* Normal return path, either direct or via ptr_ring */
+			page_pool_put_page(pp, page, -1, false);
+
+		} else if (type == type_napi_aware) {
+			/* NAPI-aware recycling: uses fast-path recycling if
+			 * possible.
+			 */
 			page_pool_put_page(pp, page, -1, false);
 
 		} else if (type == type_page_allocator) {
@@ -188,6 +213,14 @@ static int time_bench_page_pool(struct time_bench_record *rec, void *data,
 		}
 	}
 	time_bench_stop(rec, loops_cnt);
+
+	if (type == type_napi_aware) {
+		page_pool_get_stats(pp, &stats);
+		if (stats.recycle_stats.cached < rec->loops)
+			pr_warn("%s(): NAPI-aware recycling wasn't used\n",
+				func);
+	}
+
 out:
 	page_pool_destroy(pp);
 	return loops_cnt;
@@ -211,6 +244,54 @@ static int time_bench_page_pool03_slow(struct time_bench_record *rec,
 	return time_bench_page_pool(rec, data, type_page_allocator, __func__);
 }
 
+static int time_bench_page_pool04_napi_aware(struct time_bench_record *rec,
+					     void *data)
+{
+	return time_bench_page_pool(rec, data, type_napi_aware, __func__);
+}
+
+/* Testing page_pool requires running under softirq.
+ *
+ * Running under a tasklet satisfy this, as tasklets are built on top of
+ * softirq.
+ */
+static void pp_tasklet_handler(struct tasklet_struct *t)
+{
+	uint32_t nr_loops = loops;
+
+	if (in_serving_softirq())
+		pr_warn("%s(): in_serving_softirq fast-path\n",
+			__func__); // True
+	else
+		pr_warn("%s(): Cannot use page_pool fast-path\n", __func__);
+
+	if (enabled(bit_run_bench_tasklet01))
+		time_bench_loop(nr_loops, 0, "tasklet_page_pool01_fast_path",
+				NULL, time_bench_page_pool01_fast_path);
+
+	if (enabled(bit_run_bench_tasklet02))
+		time_bench_loop(nr_loops, 0, "tasklet_page_pool02_ptr_ring",
+				NULL, time_bench_page_pool02_ptr_ring);
+
+	if (enabled(bit_run_bench_tasklet03))
+		time_bench_loop(nr_loops, 0, "tasklet_page_pool03_slow", NULL,
+				time_bench_page_pool03_slow);
+
+	if (enabled(bit_run_bench_tasklet04))
+		time_bench_loop(nr_loops, 0, "tasklet_page_pool04_napi_aware",
+				NULL, time_bench_page_pool04_napi_aware);
+
+	mutex_unlock(&wait_for_tasklet); /* Module __init waiting on unlock */
+}
+DECLARE_TASKLET_DISABLED(pp_tasklet, pp_tasklet_handler);
+
+static void run_tasklet_tests(void)
+{
+	tasklet_enable(&pp_tasklet);
+	/* "Async" schedule tasklet, which runs on the CPU that schedule it */
+	tasklet_schedule(&pp_tasklet);
+}
+
 static int run_benchmark_tests(void)
 {
 	uint32_t nr_loops = loops;
@@ -251,12 +332,19 @@ static int __init bench_page_pool_simple_module_init(void)
 
 	run_benchmark_tests();
 
+	mutex_lock(&wait_for_tasklet);
+	run_tasklet_tests();
+	/* Sleep on mutex, waiting for tasklet to release */
+	mutex_lock(&wait_for_tasklet);
+
 	return 0;
 }
 module_init(bench_page_pool_simple_module_init);
 
 static void __exit bench_page_pool_simple_module_exit(void)
 {
+	tasklet_kill(&pp_tasklet);
+
 	if (verbose)
 		pr_info("Unloaded\n");
 }
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC 2/2] xdp: Delegate fast path return decision to page_pool
  2025-11-07 10:28 [RFC 0/2] xdp: Delegate fast path return decision to page_pool Dragos Tatulea
  2025-11-07 10:28 ` [RFC 1/2] page_pool: add benchmarking for napi-based recycling Dragos Tatulea
@ 2025-11-07 10:28 ` Dragos Tatulea
  2025-11-10 11:06   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 11+ messages in thread
From: Dragos Tatulea @ 2025-11-07 10:28 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Simon Horman
  Cc: Dragos Tatulea, Tariq Toukan, netdev, linux-kernel, bpf

XDP uses the BPF_RI_F_RF_NO_DIRECT flag to mark contexts where it is not
allowed to do direct recycling, even though the direct flag was set by
the caller. This is confusing and can lead to races which are hard to
detect [1].

Furthermore, the page_pool already contains an internal
mechanism which checks if it is safe to switch the direct
flag from off to on.

This patch drops the use of the BPF_RI_F_RF_NO_DIRECT flag and always
calls the page_pool release with the direct flag set to false. The
page_pool will decide if it is safe to do direct recycling. This
is not free but it is worth it to make the XDP code safer. The
next paragrapsh are discussing the performance impact.

Performance wise, there are 3 cases to consider. Looking from
__xdp_return() for MEM_TYPE_PAGE_POOL case:

1) napi_direct == false:
  - Before: 1 comparison in __xdp_return() + call of
    page_pool_napi_local() from page_pool_put_unrefed_netmem().
  - After: Only one call to page_pool_napi_local().

2) napi_direct == true && BPF_RI_F_RF_NO_DIRECT
  - Before: 2 comparisons in __xdp_return() + call of
    page_pool_napi_local() from page_pool_put_unrefed_netmem().
  - After: Only one call to page_pool_napi_local().

3) napi_direct == true && !BPF_RI_F_RF_NO_DIRECT
  - Before: 2 comparisons in __xdp_return().
  - After: One call to page_pool_napi_local()

Case 1 & 2 are the slower paths and they only have to gain.
But they are slow anyway so the gain is small.

Case 3 is the fast path and is the one that has to be considered more
closely. The 2 comparisons from __xdp_return() are swapped for the more
expensive page_pool_napi_local() call.

Using the page_pool benchmark between the fast-path and the
newly-added NAPI aware mode to measure [2] how expensive
page_pool_napi_local() is:

  bench_page_pool: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
  bench_page_pool: Type:tasklet_page_pool01_fast_path Per elem: 15 cycles(tsc) 7.537 ns (step:0)

  bench_page_pool: time_bench_page_pool04_napi_aware(): in_serving_softirq fast-path
  bench_page_pool: Type:tasklet_page_pool04_napi_aware Per elem: 20 cycles(tsc) 10.490 ns (step:0)

... and the slow path for reference:

  bench_page_pool: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
  bench_page_pool: Type:tasklet_page_pool02_ptr_ring Per elem: 30 cycles(tsc) 15.395 ns (step:0)

So the impact is small in the fast-path, but not negligible. One thing
to consider is the fact that the comparisons from napi_direct are
dropped. That means that the impact will be smaller than the
measurements from the benchmark.

[1] Commit 2b986b9e917b ("bpf, cpumap: Disable page_pool direct xdp_return need larger scope")
[2] Intel Xeon Platinum 8580

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
---
 drivers/net/veth.c     |  2 --
 include/linux/filter.h | 22 ----------------------
 include/net/xdp.h      |  2 +-
 kernel/bpf/cpumap.c    |  2 --
 net/bpf/test_run.c     |  2 --
 net/core/filter.c      |  2 +-
 net/core/xdp.c         | 24 ++++++++++++------------
 7 files changed, 14 insertions(+), 42 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a3046142cb8e..6d5c1e0b05a7 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -975,7 +975,6 @@ static int veth_poll(struct napi_struct *napi, int budget)
 
 	bq.count = 0;
 
-	xdp_set_return_frame_no_direct();
 	done = veth_xdp_rcv(rq, budget, &bq, &stats);
 
 	if (stats.xdp_redirect > 0)
@@ -994,7 +993,6 @@ static int veth_poll(struct napi_struct *napi, int budget)
 
 	if (stats.xdp_tx > 0)
 		veth_xdp_flush(rq, &bq);
-	xdp_clear_return_frame_no_direct();
 
 	return done;
 }
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f5c859b8131a..877e40d81a4c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -764,7 +764,6 @@ struct bpf_nh_params {
 };
 
 /* flags for bpf_redirect_info kern_flags */
-#define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
 #define BPF_RI_F_RI_INIT	BIT(1)
 #define BPF_RI_F_CPU_MAP_INIT	BIT(2)
 #define BPF_RI_F_DEV_MAP_INIT	BIT(3)
@@ -1163,27 +1162,6 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt);
 
-static inline bool xdp_return_frame_no_direct(void)
-{
-	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
-
-	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
-}
-
-static inline void xdp_set_return_frame_no_direct(void)
-{
-	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
-
-	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
-}
-
-static inline void xdp_clear_return_frame_no_direct(void)
-{
-	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
-
-	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
-}
-
 static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
 				 unsigned int pktlen)
 {
diff --git a/include/net/xdp.h b/include/net/xdp.h
index aa742f413c35..2a44d84a7611 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -446,7 +446,7 @@ struct xdp_frame *xdp_convert_buff_to_frame(struct xdp_buff *xdp)
 }
 
 void __xdp_return(netmem_ref netmem, enum xdp_mem_type mem_type,
-		  bool napi_direct, struct xdp_buff *xdp);
+		  struct xdp_buff *xdp);
 void xdp_return_frame(struct xdp_frame *xdpf);
 void xdp_return_frame_rx_napi(struct xdp_frame *xdpf);
 void xdp_return_buff(struct xdp_buff *xdp);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 703e5df1f4ef..3ece03dc36bd 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -253,7 +253,6 @@ static void cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 
 	rcu_read_lock();
 	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
-	xdp_set_return_frame_no_direct();
 
 	ret->xdp_n = cpu_map_bpf_prog_run_xdp(rcpu, frames, ret->xdp_n, stats);
 	if (unlikely(ret->skb_n))
@@ -263,7 +262,6 @@ static void cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	if (stats->redirect)
 		xdp_do_flush();
 
-	xdp_clear_return_frame_no_direct();
 	bpf_net_ctx_clear(bpf_net_ctx);
 	rcu_read_unlock();
 
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 8b7d0b90fea7..a0fe03e9e527 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -289,7 +289,6 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	local_bh_disable();
 	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	ri = bpf_net_ctx_get_ri();
-	xdp_set_return_frame_no_direct();
 
 	for (i = 0; i < batch_sz; i++) {
 		page = page_pool_dev_alloc_pages(xdp->pp);
@@ -352,7 +351,6 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 			err = ret;
 	}
 
-	xdp_clear_return_frame_no_direct();
 	bpf_net_ctx_clear(bpf_net_ctx);
 	local_bh_enable();
 	return err;
diff --git a/net/core/filter.c b/net/core/filter.c
index 16105f52927d..5622ec5ac19c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4187,7 +4187,7 @@ static bool bpf_xdp_shrink_data(struct xdp_buff *xdp, skb_frag_t *frag,
 	}
 
 	if (release) {
-		__xdp_return(netmem, mem_type, false, zc_frag);
+		__xdp_return(netmem, mem_type, zc_frag);
 	} else {
 		if (!tail)
 			skb_frag_off_add(frag, shrink);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 9100e160113a..cf8eab699d9a 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -431,18 +431,18 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_attach_page_pool);
  * of xdp_frames/pages in those cases.
  */
 void __xdp_return(netmem_ref netmem, enum xdp_mem_type mem_type,
-		  bool napi_direct, struct xdp_buff *xdp)
+		  struct xdp_buff *xdp)
 {
 	switch (mem_type) {
 	case MEM_TYPE_PAGE_POOL:
 		netmem = netmem_compound_head(netmem);
-		if (napi_direct && xdp_return_frame_no_direct())
-			napi_direct = false;
+
 		/* No need to check netmem_is_pp() as mem->type knows this a
 		 * page_pool page
+		 *
+		 * page_pool can detect direct recycle.
 		 */
-		page_pool_put_full_netmem(netmem_get_pp(netmem), netmem,
-					  napi_direct);
+		page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, false);
 		break;
 	case MEM_TYPE_PAGE_SHARED:
 		page_frag_free(__netmem_address(netmem));
@@ -471,10 +471,10 @@ void xdp_return_frame(struct xdp_frame *xdpf)
 	sinfo = xdp_get_shared_info_from_frame(xdpf);
 	for (u32 i = 0; i < sinfo->nr_frags; i++)
 		__xdp_return(skb_frag_netmem(&sinfo->frags[i]), xdpf->mem_type,
-			     false, NULL);
+			     NULL);
 
 out:
-	__xdp_return(virt_to_netmem(xdpf->data), xdpf->mem_type, false, NULL);
+	__xdp_return(virt_to_netmem(xdpf->data), xdpf->mem_type, NULL);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);
 
@@ -488,10 +488,10 @@ void xdp_return_frame_rx_napi(struct xdp_frame *xdpf)
 	sinfo = xdp_get_shared_info_from_frame(xdpf);
 	for (u32 i = 0; i < sinfo->nr_frags; i++)
 		__xdp_return(skb_frag_netmem(&sinfo->frags[i]), xdpf->mem_type,
-			     true, NULL);
+			     NULL);
 
 out:
-	__xdp_return(virt_to_netmem(xdpf->data), xdpf->mem_type, true, NULL);
+	__xdp_return(virt_to_netmem(xdpf->data), xdpf->mem_type, NULL);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);
 
@@ -542,7 +542,7 @@ EXPORT_SYMBOL_GPL(xdp_return_frame_bulk);
  */
 void xdp_return_frag(netmem_ref netmem, const struct xdp_buff *xdp)
 {
-	__xdp_return(netmem, xdp->rxq->mem.type, true, NULL);
+	__xdp_return(netmem, xdp->rxq->mem.type, NULL);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frag);
 
@@ -556,10 +556,10 @@ void xdp_return_buff(struct xdp_buff *xdp)
 	sinfo = xdp_get_shared_info_from_buff(xdp);
 	for (u32 i = 0; i < sinfo->nr_frags; i++)
 		__xdp_return(skb_frag_netmem(&sinfo->frags[i]),
-			     xdp->rxq->mem.type, true, xdp);
+			     xdp->rxq->mem.type, xdp);
 
 out:
-	__xdp_return(virt_to_netmem(xdp->data), xdp->rxq->mem.type, true, xdp);
+	__xdp_return(virt_to_netmem(xdp->data), xdp->rxq->mem.type, xdp);
 }
 EXPORT_SYMBOL_GPL(xdp_return_buff);
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] page_pool: add benchmarking for napi-based recycling
  2025-11-07 10:28 ` [RFC 1/2] page_pool: add benchmarking for napi-based recycling Dragos Tatulea
@ 2025-11-07 11:04   ` bot+bpf-ci
  0 siblings, 0 replies; 11+ messages in thread
From: bot+bpf-ci @ 2025-11-07 11:04 UTC (permalink / raw)
  To: dtatulea, kuba, hawk, davem, edumazet, pabeni, horms, shuah
  Cc: dtatulea, tariqt, netdev, linux-kselftest, linux-kernel, ast,
	andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3040 bytes --]

> diff --git a/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c
> index cb6468adb..84683c547 100644
> --- a/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c
> +++ b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c
> @@ -16,6 +17,8 @@
>  static int verbose = 1;
>  #define MY_POOL_SIZE 1024
>
> +DEFINE_MUTEX(wait_for_tasklet);
> +

[ ... ]

> +/* Testing page_pool requires running under softirq.
> + *
> + * Running under a tasklet satisfy this, as tasklets are built on top of
> + * softirq.
> + */
> +static void pp_tasklet_handler(struct tasklet_struct *t)
> +{
> +	uint32_t nr_loops = loops;
> +
> +	if (in_serving_softirq())
> +		pr_warn("%s(): in_serving_softirq fast-path\n",
> +			__func__); // True
> +	else
> +		pr_warn("%s(): Cannot use page_pool fast-path\n", __func__);
> +
> +	if (enabled(bit_run_bench_tasklet01))
> +		time_bench_loop(nr_loops, 0, "tasklet_page_pool01_fast_path",
> +				NULL, time_bench_page_pool01_fast_path);
> +
> +	if (enabled(bit_run_bench_tasklet02))
> +		time_bench_loop(nr_loops, 0, "tasklet_page_pool02_ptr_ring",
> +				NULL, time_bench_page_pool02_ptr_ring);
> +
> +	if (enabled(bit_run_bench_tasklet03))
> +		time_bench_loop(nr_loops, 0, "tasklet_page_pool03_slow", NULL,
> +				time_bench_page_pool03_slow);
> +
> +	if (enabled(bit_run_bench_tasklet04))
> +		time_bench_loop(nr_loops, 0, "tasklet_page_pool04_napi_aware",
> +				NULL, time_bench_page_pool04_napi_aware);
> +
> +	mutex_unlock(&wait_for_tasklet); /* Module __init waiting on unlock */
        ^^^^^^^^^^^^^

Can mutex_unlock() be called from softirq context? The pp_tasklet_handler()
function runs as a tasklet handler in softirq context, but mutexes are
sleeping locks that require process context. Would a completion or spinlock
be more appropriate here?

The synchronization pattern appears to be:
  bench_page_pool_simple_module_init()->mutex_lock(&wait_for_tasklet)
  bench_page_pool_simple_module_init()->tasklet_schedule(&pp_tasklet)
  bench_page_pool_simple_module_init()->mutex_lock(&wait_for_tasklet) [blocks]
  pp_tasklet_handler()->mutex_unlock(&wait_for_tasklet) [softirq context]

> +}
> +DECLARE_TASKLET_DISABLED(pp_tasklet, pp_tasklet_handler);
> +
> +static void run_tasklet_tests(void)
> +{
> +	tasklet_enable(&pp_tasklet);
> +	/* "Async" schedule tasklet, which runs on the CPU that schedule it */
> +	tasklet_schedule(&pp_tasklet);
> +}

[ ... ]

> @@ -251,12 +332,19 @@ static int __init bench_page_pool_simple_module_init(void)
>
>  	run_benchmark_tests();
>
> +	mutex_lock(&wait_for_tasklet);
> +	run_tasklet_tests();
> +	/* Sleep on mutex, waiting for tasklet to release */
> +	mutex_lock(&wait_for_tasklet);
> +
>  	return 0;
>  }

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19165940352

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] xdp: Delegate fast path return decision to page_pool
  2025-11-07 10:28 ` [RFC 2/2] xdp: Delegate fast path return decision to page_pool Dragos Tatulea
@ 2025-11-10 11:06   ` Jesper Dangaard Brouer
  2025-11-10 18:51     ` Dragos Tatulea
  0 siblings, 1 reply; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-10 11:06 UTC (permalink / raw)
  To: Dragos Tatulea, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Simon Horman, Toshiaki Makita, David Ahern
  Cc: Tariq Toukan, netdev, linux-kernel, bpf, Martin KaFai Lau,
	KP Singh



On 07/11/2025 11.28, Dragos Tatulea wrote:
> XDP uses the BPF_RI_F_RF_NO_DIRECT flag to mark contexts where it is not
> allowed to do direct recycling, even though the direct flag was set by
> the caller. This is confusing and can lead to races which are hard to
> detect [1].
> 
> Furthermore, the page_pool already contains an internal
> mechanism which checks if it is safe to switch the direct
> flag from off to on.
> 
> This patch drops the use of the BPF_RI_F_RF_NO_DIRECT flag and always
> calls the page_pool release with the direct flag set to false. The
> page_pool will decide if it is safe to do direct recycling. This
> is not free but it is worth it to make the XDP code safer. The
> next paragrapsh are discussing the performance impact.
> 
> Performance wise, there are 3 cases to consider. Looking from
> __xdp_return() for MEM_TYPE_PAGE_POOL case:
> 
> 1) napi_direct == false:
>    - Before: 1 comparison in __xdp_return() + call of
>      page_pool_napi_local() from page_pool_put_unrefed_netmem().
>    - After: Only one call to page_pool_napi_local().
> 
> 2) napi_direct == true && BPF_RI_F_RF_NO_DIRECT
>    - Before: 2 comparisons in __xdp_return() + call of
>      page_pool_napi_local() from page_pool_put_unrefed_netmem().
>    - After: Only one call to page_pool_napi_local().
> 
> 3) napi_direct == true && !BPF_RI_F_RF_NO_DIRECT
>    - Before: 2 comparisons in __xdp_return().
>    - After: One call to page_pool_napi_local()
> 
> Case 1 & 2 are the slower paths and they only have to gain.
> But they are slow anyway so the gain is small.
> 
> Case 3 is the fast path and is the one that has to be considered more
> closely. The 2 comparisons from __xdp_return() are swapped for the more
> expensive page_pool_napi_local() call.
> 
> Using the page_pool benchmark between the fast-path and the
> newly-added NAPI aware mode to measure [2] how expensive
> page_pool_napi_local() is:
> 
>    bench_page_pool: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
>    bench_page_pool: Type:tasklet_page_pool01_fast_path Per elem: 15 cycles(tsc) 7.537 ns (step:0)
> 
>    bench_page_pool: time_bench_page_pool04_napi_aware(): in_serving_softirq fast-path
>    bench_page_pool: Type:tasklet_page_pool04_napi_aware Per elem: 20 cycles(tsc) 10.490 ns (step:0)
> 

IMHO fast-path slowdown is significant.  This fast-path is used for the
XDP_DROP use-case in drivers.  The fast-path is competing with the speed
of updating an (per-cpu) array and a function-call overhead. The
performance target for XDP_DROP is NIC *wirespeed* which at 100Gbit/s is
148Mpps (or 6.72ns between packets).

I still want to seriously entertain this idea, because (1) because the
bug[1] was hard to find, and (2) this is mostly an XDP API optimization
that isn't used by drivers (they call page_pool APIs directly for
XDP_DROP case).
Drivers can do this because they have access to the page_pool instance.

Thus, this isn't a XDP_DROP use-case.
  - This is either XDP_REDIRECT or XDP_TX use-case.

The primary change in this patch is, changing the XDP API call 
xdp_return_frame_rx_napi() effectively to xdp_return_frame().

Looking at code users of this call:
  (A) Seeing a number of drivers using this to speed up XDP_TX when 
*completing* packets from TX-ring.
  (B) drivers/net/xen-netfront.c use looks incorrect.
  (C) drivers/net/virtio_net.c use can easily be removed.
  (D) cpumap.c and drivers/net/tun.c should not be using this call.
  (E) devmap.c is the main user (with multiple calls)

The (A) user will see a performance drop for XDP_TX, but these driver
should be able to instead call the page_pool APIs directly as they
should have access to the page_pool instance.

Users (B)+(C)+(D) simply needs cleanup.

User (E): devmap is the most important+problematic user (IIRC this was
the cause of bug[1]).  XDP redirecting into devmap and running a new
XDP-prog (per target device) was a prime user of this call
xdp_return_frame_rx_napi() as it gave us excellent (e.g. XDP_DROP)
performance.

Perhaps we should simply measure the impact on devmap + 2nd XDP-prog
doing XDP_DROP.  Then, we can see if overhead is acceptable... ?

> ... and the slow path for reference:
> 
>    bench_page_pool: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
>    bench_page_pool: Type:tasklet_page_pool02_ptr_ring Per elem: 30 cycles(tsc) 15.395 ns (step:0)

The devmap user will basically fallback to using this code path.

> So the impact is small in the fast-path, but not negligible. One thing
> to consider is the fact that the comparisons from napi_direct are
> dropped. That means that the impact will be smaller than the
> measurements from the benchmark.
> 
> [1] Commit 2b986b9e917b ("bpf, cpumap: Disable page_pool direct xdp_return need larger scope")
> [2] Intel Xeon Platinum 8580
> 
> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> ---
>   drivers/net/veth.c     |  2 --
>   include/linux/filter.h | 22 ----------------------
>   include/net/xdp.h      |  2 +-
>   kernel/bpf/cpumap.c    |  2 --
>   net/bpf/test_run.c     |  2 --
>   net/core/filter.c      |  2 +-
>   net/core/xdp.c         | 24 ++++++++++++------------
>   7 files changed, 14 insertions(+), 42 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index a3046142cb8e..6d5c1e0b05a7 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -975,7 +975,6 @@ static int veth_poll(struct napi_struct *napi, int budget)
>   
>   	bq.count = 0;
>   
> -	xdp_set_return_frame_no_direct();
>   	done = veth_xdp_rcv(rq, budget, &bq, &stats);
>   
>   	if (stats.xdp_redirect > 0)
> @@ -994,7 +993,6 @@ static int veth_poll(struct napi_struct *napi, int budget)
>   
>   	if (stats.xdp_tx > 0)
>   		veth_xdp_flush(rq, &bq);
> -	xdp_clear_return_frame_no_direct();
>   
>   	return done;
>   }
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index f5c859b8131a..877e40d81a4c 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -764,7 +764,6 @@ struct bpf_nh_params {
>   };
>   
>   /* flags for bpf_redirect_info kern_flags */
> -#define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
>   #define BPF_RI_F_RI_INIT	BIT(1)
>   #define BPF_RI_F_CPU_MAP_INIT	BIT(2)
>   #define BPF_RI_F_DEV_MAP_INIT	BIT(3)
> @@ -1163,27 +1162,6 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>   				       const struct bpf_insn *patch, u32 len);
>   int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt);
>   
> -static inline bool xdp_return_frame_no_direct(void)
> -{
> -	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
> -
> -	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
> -}
> -
> -static inline void xdp_set_return_frame_no_direct(void)
> -{
> -	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
> -
> -	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
> -}
> -
> -static inline void xdp_clear_return_frame_no_direct(void)
> -{
> -	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
> -
> -	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
> -}
> -
>   static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
>   				 unsigned int pktlen)
>   {
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index aa742f413c35..2a44d84a7611 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -446,7 +446,7 @@ struct xdp_frame *xdp_convert_buff_to_frame(struct xdp_buff *xdp)
>   }
>   
>   void __xdp_return(netmem_ref netmem, enum xdp_mem_type mem_type,
> -		  bool napi_direct, struct xdp_buff *xdp);
> +		  struct xdp_buff *xdp);
>   void xdp_return_frame(struct xdp_frame *xdpf);
>   void xdp_return_frame_rx_napi(struct xdp_frame *xdpf);
>   void xdp_return_buff(struct xdp_buff *xdp);
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index 703e5df1f4ef..3ece03dc36bd 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -253,7 +253,6 @@ static void cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
>   
>   	rcu_read_lock();
>   	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
> -	xdp_set_return_frame_no_direct();
>   
>   	ret->xdp_n = cpu_map_bpf_prog_run_xdp(rcpu, frames, ret->xdp_n, stats);
>   	if (unlikely(ret->skb_n))
> @@ -263,7 +262,6 @@ static void cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
>   	if (stats->redirect)
>   		xdp_do_flush();
>   
> -	xdp_clear_return_frame_no_direct();
>   	bpf_net_ctx_clear(bpf_net_ctx);
>   	rcu_read_unlock();
>   
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index 8b7d0b90fea7..a0fe03e9e527 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -289,7 +289,6 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
>   	local_bh_disable();
>   	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
>   	ri = bpf_net_ctx_get_ri();
> -	xdp_set_return_frame_no_direct();
>   
>   	for (i = 0; i < batch_sz; i++) {
>   		page = page_pool_dev_alloc_pages(xdp->pp);
> @@ -352,7 +351,6 @@ static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
>   			err = ret;
>   	}
>   
> -	xdp_clear_return_frame_no_direct();
>   	bpf_net_ctx_clear(bpf_net_ctx);
>   	local_bh_enable();
>   	return err;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 16105f52927d..5622ec5ac19c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4187,7 +4187,7 @@ static bool bpf_xdp_shrink_data(struct xdp_buff *xdp, skb_frag_t *frag,
>   	}
>   
>   	if (release) {
> -		__xdp_return(netmem, mem_type, false, zc_frag);
> +		__xdp_return(netmem, mem_type, zc_frag);
>   	} else {
>   		if (!tail)
>   			skb_frag_off_add(frag, shrink);
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 9100e160113a..cf8eab699d9a 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -431,18 +431,18 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_attach_page_pool);
>    * of xdp_frames/pages in those cases.
>    */
>   void __xdp_return(netmem_ref netmem, enum xdp_mem_type mem_type,
> -		  bool napi_direct, struct xdp_buff *xdp)
> +		  struct xdp_buff *xdp)
>   {
>   	switch (mem_type) {
>   	case MEM_TYPE_PAGE_POOL:
>   		netmem = netmem_compound_head(netmem);
> -		if (napi_direct && xdp_return_frame_no_direct())
> -			napi_direct = false;
> +
>   		/* No need to check netmem_is_pp() as mem->type knows this a
>   		 * page_pool page
> +		 *
> +		 * page_pool can detect direct recycle.
>   		 */
> -		page_pool_put_full_netmem(netmem_get_pp(netmem), netmem,
> -					  napi_direct);
> +		page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, false);
>   		break;
>   	case MEM_TYPE_PAGE_SHARED:
>   		page_frag_free(__netmem_address(netmem));
> @@ -471,10 +471,10 @@ void xdp_return_frame(struct xdp_frame *xdpf)
>   	sinfo = xdp_get_shared_info_from_frame(xdpf);
>   	for (u32 i = 0; i < sinfo->nr_frags; i++)
>   		__xdp_return(skb_frag_netmem(&sinfo->frags[i]), xdpf->mem_type,
> -			     false, NULL);
> +			     NULL);
>   
>   out:
> -	__xdp_return(virt_to_netmem(xdpf->data), xdpf->mem_type, false, NULL);
> +	__xdp_return(virt_to_netmem(xdpf->data), xdpf->mem_type, NULL);
>   }
>   EXPORT_SYMBOL_GPL(xdp_return_frame);
>   
> @@ -488,10 +488,10 @@ void xdp_return_frame_rx_napi(struct xdp_frame *xdpf)
>   	sinfo = xdp_get_shared_info_from_frame(xdpf);
>   	for (u32 i = 0; i < sinfo->nr_frags; i++)
>   		__xdp_return(skb_frag_netmem(&sinfo->frags[i]), xdpf->mem_type,
> -			     true, NULL);
> +			     NULL);
>   
>   out:
> -	__xdp_return(virt_to_netmem(xdpf->data), xdpf->mem_type, true, NULL);
> +	__xdp_return(virt_to_netmem(xdpf->data), xdpf->mem_type, NULL);
>   }
>   EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);

The changed to xdp_return_frame_rx_napi() makes it equvilent to 
xdp_return_frame().

>   
> @@ -542,7 +542,7 @@ EXPORT_SYMBOL_GPL(xdp_return_frame_bulk);
>    */
>   void xdp_return_frag(netmem_ref netmem, const struct xdp_buff *xdp)
>   {
> -	__xdp_return(netmem, xdp->rxq->mem.type, true, NULL);
> +	__xdp_return(netmem, xdp->rxq->mem.type, NULL);
>   }
>   EXPORT_SYMBOL_GPL(xdp_return_frag);
>   
> @@ -556,10 +556,10 @@ void xdp_return_buff(struct xdp_buff *xdp)
>   	sinfo = xdp_get_shared_info_from_buff(xdp);
>   	for (u32 i = 0; i < sinfo->nr_frags; i++)
>   		__xdp_return(skb_frag_netmem(&sinfo->frags[i]),
> -			     xdp->rxq->mem.type, true, xdp);
> +			     xdp->rxq->mem.type, xdp);
>   
>   out:
> -	__xdp_return(virt_to_netmem(xdp->data), xdp->rxq->mem.type, true, xdp);
> +	__xdp_return(virt_to_netmem(xdp->data), xdp->rxq->mem.type, xdp);
>   }
>   EXPORT_SYMBOL_GPL(xdp_return_buff);
>   


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] xdp: Delegate fast path return decision to page_pool
  2025-11-10 11:06   ` Jesper Dangaard Brouer
@ 2025-11-10 18:51     ` Dragos Tatulea
  2025-11-11  7:54       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 11+ messages in thread
From: Dragos Tatulea @ 2025-11-10 18:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Jakub Kicinski, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
	Jiri Olsa, Simon Horman, Toshiaki Makita, David Ahern
  Cc: Tariq Toukan, netdev, linux-kernel, bpf, Martin KaFai Lau,
	KP Singh

On Mon, Nov 10, 2025 at 12:06:08PM +0100, Jesper Dangaard Brouer wrote:
> 
> 
> On 07/11/2025 11.28, Dragos Tatulea wrote:
> > XDP uses the BPF_RI_F_RF_NO_DIRECT flag to mark contexts where it is not
> > allowed to do direct recycling, even though the direct flag was set by
> > the caller. This is confusing and can lead to races which are hard to
> > detect [1].
> > 
> > Furthermore, the page_pool already contains an internal
> > mechanism which checks if it is safe to switch the direct
> > flag from off to on.
> > 
> > This patch drops the use of the BPF_RI_F_RF_NO_DIRECT flag and always
> > calls the page_pool release with the direct flag set to false. The
> > page_pool will decide if it is safe to do direct recycling. This
> > is not free but it is worth it to make the XDP code safer. The
> > next paragrapsh are discussing the performance impact.
> > 
> > Performance wise, there are 3 cases to consider. Looking from
> > __xdp_return() for MEM_TYPE_PAGE_POOL case:
> > 
> > 1) napi_direct == false:
> >    - Before: 1 comparison in __xdp_return() + call of
> >      page_pool_napi_local() from page_pool_put_unrefed_netmem().
> >    - After: Only one call to page_pool_napi_local().
> > 
> > 2) napi_direct == true && BPF_RI_F_RF_NO_DIRECT
> >    - Before: 2 comparisons in __xdp_return() + call of
> >      page_pool_napi_local() from page_pool_put_unrefed_netmem().
> >    - After: Only one call to page_pool_napi_local().
> > 
> > 3) napi_direct == true && !BPF_RI_F_RF_NO_DIRECT
> >    - Before: 2 comparisons in __xdp_return().
> >    - After: One call to page_pool_napi_local()
> > 
> > Case 1 & 2 are the slower paths and they only have to gain.
> > But they are slow anyway so the gain is small.
> > 
> > Case 3 is the fast path and is the one that has to be considered more
> > closely. The 2 comparisons from __xdp_return() are swapped for the more
> > expensive page_pool_napi_local() call.
> > 
> > Using the page_pool benchmark between the fast-path and the
> > newly-added NAPI aware mode to measure [2] how expensive
> > page_pool_napi_local() is:
> > 
> >    bench_page_pool: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
> >    bench_page_pool: Type:tasklet_page_pool01_fast_path Per elem: 15 cycles(tsc) 7.537 ns (step:0)
> > 
> >    bench_page_pool: time_bench_page_pool04_napi_aware(): in_serving_softirq fast-path
> >    bench_page_pool: Type:tasklet_page_pool04_napi_aware Per elem: 20 cycles(tsc) 10.490 ns (step:0)
> > 
> 
> IMHO fast-path slowdown is significant.  This fast-path is used for the
> XDP_DROP use-case in drivers.  The fast-path is competing with the speed
> of updating an (per-cpu) array and a function-call overhead. The
> performance target for XDP_DROP is NIC *wirespeed* which at 100Gbit/s is
> 148Mpps (or 6.72ns between packets).
>
> I still want to seriously entertain this idea, because (1) because the
> bug[1] was hard to find, and (2) this is mostly an XDP API optimization
> that isn't used by drivers (they call page_pool APIs directly for
> XDP_DROP case).
> Drivers can do this because they have access to the page_pool instance.
> 
> Thus, this isn't a XDP_DROP use-case.
>  - This is either XDP_REDIRECT or XDP_TX use-case.
> 
> The primary change in this patch is, changing the XDP API call
> xdp_return_frame_rx_napi() effectively to xdp_return_frame().
> 
> Looking at code users of this call:
>  (A) Seeing a number of drivers using this to speed up XDP_TX when
> *completing* packets from TX-ring.
>  (B) drivers/net/xen-netfront.c use looks incorrect.
>  (C) drivers/net/virtio_net.c use can easily be removed.
>  (D) cpumap.c and drivers/net/tun.c should not be using this call.
>  (E) devmap.c is the main user (with multiple calls)
> 
> The (A) user will see a performance drop for XDP_TX, but these driver
> should be able to instead call the page_pool APIs directly as they
> should have access to the page_pool instance.
> 
> Users (B)+(C)+(D) simply needs cleanup.
> 
> User (E): devmap is the most important+problematic user (IIRC this was
> the cause of bug[1]).  XDP redirecting into devmap and running a new
> XDP-prog (per target device) was a prime user of this call
> xdp_return_frame_rx_napi() as it gave us excellent (e.g. XDP_DROP)
> performance.
>
Thanks for the analysis Jesper.

> Perhaps we should simply measure the impact on devmap + 2nd XDP-prog
> doing XDP_DROP.  Then, we can see if overhead is acceptable... ?
>
Will try. Just to make sure we are on the same page, AFAIU the setup
would be:
XDP_REDIRECT NIC1 -> veth ingress side and XDP_DROP veth egress side?

> > ... and the slow path for reference:
> > 
> >    bench_page_pool: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
> >    bench_page_pool: Type:tasklet_page_pool02_ptr_ring Per elem: 30 cycles(tsc) 15.395 ns (step:0)
> 
> The devmap user will basically fallback to using this code path.
>
Yes, if the page_pool is not NAPI aware.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] xdp: Delegate fast path return decision to page_pool
  2025-11-10 18:51     ` Dragos Tatulea
@ 2025-11-11  7:54       ` Jesper Dangaard Brouer
  2025-11-11 18:25         ` Dragos Tatulea
  0 siblings, 1 reply; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-11  7:54 UTC (permalink / raw)
  To: Dragos Tatulea, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Simon Horman, Toshiaki Makita, David Ahern,
	Toke Hoiland Jorgensen
  Cc: Tariq Toukan, netdev, linux-kernel, bpf, Martin KaFai Lau,
	KP Singh



On 10/11/2025 19.51, Dragos Tatulea wrote:
> On Mon, Nov 10, 2025 at 12:06:08PM +0100, Jesper Dangaard Brouer wrote:
>>
>>
>> On 07/11/2025 11.28, Dragos Tatulea wrote:
>>> XDP uses the BPF_RI_F_RF_NO_DIRECT flag to mark contexts where it is not
>>> allowed to do direct recycling, even though the direct flag was set by
>>> the caller. This is confusing and can lead to races which are hard to
>>> detect [1].
>>>
>>> Furthermore, the page_pool already contains an internal
>>> mechanism which checks if it is safe to switch the direct
>>> flag from off to on.
>>>
>>> This patch drops the use of the BPF_RI_F_RF_NO_DIRECT flag and always
>>> calls the page_pool release with the direct flag set to false. The
>>> page_pool will decide if it is safe to do direct recycling. This
>>> is not free but it is worth it to make the XDP code safer. The
>>> next paragrapsh are discussing the performance impact.
>>>
>>> Performance wise, there are 3 cases to consider. Looking from
>>> __xdp_return() for MEM_TYPE_PAGE_POOL case:
>>>
>>> 1) napi_direct == false:
>>>     - Before: 1 comparison in __xdp_return() + call of
>>>       page_pool_napi_local() from page_pool_put_unrefed_netmem().
>>>     - After: Only one call to page_pool_napi_local().
>>>
>>> 2) napi_direct == true && BPF_RI_F_RF_NO_DIRECT
>>>     - Before: 2 comparisons in __xdp_return() + call of
>>>       page_pool_napi_local() from page_pool_put_unrefed_netmem().
>>>     - After: Only one call to page_pool_napi_local().
>>>
>>> 3) napi_direct == true && !BPF_RI_F_RF_NO_DIRECT
>>>     - Before: 2 comparisons in __xdp_return().
>>>     - After: One call to page_pool_napi_local()
>>>
>>> Case 1 & 2 are the slower paths and they only have to gain.
>>> But they are slow anyway so the gain is small.
>>>
>>> Case 3 is the fast path and is the one that has to be considered more
>>> closely. The 2 comparisons from __xdp_return() are swapped for the more
>>> expensive page_pool_napi_local() call.
>>>
>>> Using the page_pool benchmark between the fast-path and the
>>> newly-added NAPI aware mode to measure [2] how expensive
>>> page_pool_napi_local() is:
>>>
>>>     bench_page_pool: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
>>>     bench_page_pool: Type:tasklet_page_pool01_fast_path Per elem: 15 cycles(tsc) 7.537 ns (step:0)
>>>
>>>     bench_page_pool: time_bench_page_pool04_napi_aware(): in_serving_softirq fast-path
>>>     bench_page_pool: Type:tasklet_page_pool04_napi_aware Per elem: 20 cycles(tsc) 10.490 ns (step:0)
>>>
>>
>> IMHO fast-path slowdown is significant.  This fast-path is used for the
>> XDP_DROP use-case in drivers.  The fast-path is competing with the speed
>> of updating an (per-cpu) array and a function-call overhead. The
>> performance target for XDP_DROP is NIC *wirespeed* which at 100Gbit/s is
>> 148Mpps (or 6.72ns between packets).
>>
>> I still want to seriously entertain this idea, because (1) because the
>> bug[1] was hard to find, and (2) this is mostly an XDP API optimization
>> that isn't used by drivers (they call page_pool APIs directly for
>> XDP_DROP case).
>> Drivers can do this because they have access to the page_pool instance.
>>
>> Thus, this isn't a XDP_DROP use-case.
>>   - This is either XDP_REDIRECT or XDP_TX use-case.
>>
>> The primary change in this patch is, changing the XDP API call
>> xdp_return_frame_rx_napi() effectively to xdp_return_frame().
>>
>> Looking at code users of this call:
>>   (A) Seeing a number of drivers using this to speed up XDP_TX when
>> *completing* packets from TX-ring.
>>   (B) drivers/net/xen-netfront.c use looks incorrect.
>>   (C) drivers/net/virtio_net.c use can easily be removed.
>>   (D) cpumap.c and drivers/net/tun.c should not be using this call.
>>   (E) devmap.c is the main user (with multiple calls)
>>
>> The (A) user will see a performance drop for XDP_TX, but these driver
>> should be able to instead call the page_pool APIs directly as they
>> should have access to the page_pool instance.
>>
>> Users (B)+(C)+(D) simply needs cleanup.
>>
>> User (E): devmap is the most important+problematic user (IIRC this was
>> the cause of bug[1]).  XDP redirecting into devmap and running a new
>> XDP-prog (per target device) was a prime user of this call
>> xdp_return_frame_rx_napi() as it gave us excellent (e.g. XDP_DROP)
>> performance.
>>
> Thanks for the analysis Jesper.

Thanks for working on this! It is long over due, that we clean this up.
I think I spotted another bug in veth related to
xdp_clear_return_frame_no_direct() and when NAPI exits.

>> Perhaps we should simply measure the impact on devmap + 2nd XDP-prog
>> doing XDP_DROP.  Then, we can see if overhead is acceptable... ?
>>
> Will try. Just to make sure we are on the same page, AFAIU the setup
> would be:
> XDP_REDIRECT NIC1 -> veth ingress side and XDP_DROP veth egress side?

No, this isn't exactly what I meant. But the people that wrote this
blogpost ([1] https://loopholelabs.io/blog/xdp-for-egress-traffic ) is
dependent on the performance for that scenario with veth pairs.

When doing redirect-map, then you can attach a 2nd XDP-prog per map
target "egress" device.  That 2nd XDP-prog should do a XDP_DROP as that
will allow us to measure the code path we are talking about. I want test
to hit this code line [2].
[2] https://elixir.bootlin.com/linux/v6.17.7/source/kernel/bpf/
devmap.c#L368.

The xdp-bench[3] tool unfortunately support program-mode for 2nd XDP-
prog, so I did this code change:

diff --git a/xdp-bench/xdp_redirect_devmap.bpf.c 
b/xdp-bench/xdp_redirect_devmap.bpf.c
index 0212e824e2fa..39a24f8834e8 100644
--- a/xdp-bench/xdp_redirect_devmap.bpf.c
+++ b/xdp-bench/xdp_redirect_devmap.bpf.c
@@ -76,6 +76,8 @@ int xdp_redirect_devmap_egress(struct xdp_md *ctx)
         struct ethhdr *eth = data;
         __u64 nh_off;

+       return XDP_DROP;
+
         nh_off = sizeof(*eth);
         if (data + nh_off > data_end)
                 return XDP_DROP;

[3] https://github.com/xdp-project/xdp-tools/tree/main/xdp-bench

And then you can run thus command:
  sudo ./xdp-bench redirect-map --load-egress mlx5p1 mlx5p1

Toke (and I) will appreciate if you added code for this to xdp-bench.
Supporting a --program-mode like 'redirect-cpu' does.


> 
>>> ... and the slow path for reference:
>>>
>>>     bench_page_pool: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
>>>     bench_page_pool: Type:tasklet_page_pool02_ptr_ring Per elem: 30 cycles(tsc) 15.395 ns (step:0)
>>
>> The devmap user will basically fallback to using this code path.
>>
> Yes, if the page_pool is not NAPI aware.
> 
> Thanks,
> Dragos


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] xdp: Delegate fast path return decision to page_pool
  2025-11-11  7:54       ` Jesper Dangaard Brouer
@ 2025-11-11 18:25         ` Dragos Tatulea
  2025-12-01 10:12           ` Dragos Tatulea
  0 siblings, 1 reply; 11+ messages in thread
From: Dragos Tatulea @ 2025-11-11 18:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Jakub Kicinski, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
	Jiri Olsa, Simon Horman, Toshiaki Makita, David Ahern,
	Toke Hoiland Jorgensen
  Cc: Tariq Toukan, netdev, linux-kernel, bpf, Martin KaFai Lau,
	KP Singh

On Tue, Nov 11, 2025 at 08:54:37AM +0100, Jesper Dangaard Brouer wrote:
> 
> 
> On 10/11/2025 19.51, Dragos Tatulea wrote:
> > On Mon, Nov 10, 2025 at 12:06:08PM +0100, Jesper Dangaard Brouer wrote:
> > > 
> > > 
> > > On 07/11/2025 11.28, Dragos Tatulea wrote:
> > > > XDP uses the BPF_RI_F_RF_NO_DIRECT flag to mark contexts where it is not
> > > > allowed to do direct recycling, even though the direct flag was set by
> > > > the caller. This is confusing and can lead to races which are hard to
> > > > detect [1].
> > > > 
> > > > Furthermore, the page_pool already contains an internal
> > > > mechanism which checks if it is safe to switch the direct
> > > > flag from off to on.
> > > > 
> > > > This patch drops the use of the BPF_RI_F_RF_NO_DIRECT flag and always
> > > > calls the page_pool release with the direct flag set to false. The
> > > > page_pool will decide if it is safe to do direct recycling. This
> > > > is not free but it is worth it to make the XDP code safer. The
> > > > next paragrapsh are discussing the performance impact.
> > > > 
> > > > Performance wise, there are 3 cases to consider. Looking from
> > > > __xdp_return() for MEM_TYPE_PAGE_POOL case:
> > > > 
> > > > 1) napi_direct == false:
> > > >     - Before: 1 comparison in __xdp_return() + call of
> > > >       page_pool_napi_local() from page_pool_put_unrefed_netmem().
> > > >     - After: Only one call to page_pool_napi_local().
> > > > 
> > > > 2) napi_direct == true && BPF_RI_F_RF_NO_DIRECT
> > > >     - Before: 2 comparisons in __xdp_return() + call of
> > > >       page_pool_napi_local() from page_pool_put_unrefed_netmem().
> > > >     - After: Only one call to page_pool_napi_local().
> > > > 
> > > > 3) napi_direct == true && !BPF_RI_F_RF_NO_DIRECT
> > > >     - Before: 2 comparisons in __xdp_return().
> > > >     - After: One call to page_pool_napi_local()
> > > > 
> > > > Case 1 & 2 are the slower paths and they only have to gain.
> > > > But they are slow anyway so the gain is small.
> > > > 
> > > > Case 3 is the fast path and is the one that has to be considered more
> > > > closely. The 2 comparisons from __xdp_return() are swapped for the more
> > > > expensive page_pool_napi_local() call.
> > > > 
> > > > Using the page_pool benchmark between the fast-path and the
> > > > newly-added NAPI aware mode to measure [2] how expensive
> > > > page_pool_napi_local() is:
> > > > 
> > > >     bench_page_pool: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
> > > >     bench_page_pool: Type:tasklet_page_pool01_fast_path Per elem: 15 cycles(tsc) 7.537 ns (step:0)
> > > > 
> > > >     bench_page_pool: time_bench_page_pool04_napi_aware(): in_serving_softirq fast-path
> > > >     bench_page_pool: Type:tasklet_page_pool04_napi_aware Per elem: 20 cycles(tsc) 10.490 ns (step:0)
> > > > 
> > > 
> > > IMHO fast-path slowdown is significant.  This fast-path is used for the
> > > XDP_DROP use-case in drivers.  The fast-path is competing with the speed
> > > of updating an (per-cpu) array and a function-call overhead. The
> > > performance target for XDP_DROP is NIC *wirespeed* which at 100Gbit/s is
> > > 148Mpps (or 6.72ns between packets).
> > > 
> > > I still want to seriously entertain this idea, because (1) because the
> > > bug[1] was hard to find, and (2) this is mostly an XDP API optimization
> > > that isn't used by drivers (they call page_pool APIs directly for
> > > XDP_DROP case).
> > > Drivers can do this because they have access to the page_pool instance.
> > > 
> > > Thus, this isn't a XDP_DROP use-case.
> > >   - This is either XDP_REDIRECT or XDP_TX use-case.
> > > 
> > > The primary change in this patch is, changing the XDP API call
> > > xdp_return_frame_rx_napi() effectively to xdp_return_frame().
> > > 
> > > Looking at code users of this call:
> > >   (A) Seeing a number of drivers using this to speed up XDP_TX when
> > > *completing* packets from TX-ring.
> > >   (B) drivers/net/xen-netfront.c use looks incorrect.
> > >   (C) drivers/net/virtio_net.c use can easily be removed.
> > >   (D) cpumap.c and drivers/net/tun.c should not be using this call.
> > >   (E) devmap.c is the main user (with multiple calls)
> > > 
> > > The (A) user will see a performance drop for XDP_TX, but these driver
> > > should be able to instead call the page_pool APIs directly as they
> > > should have access to the page_pool instance.
> > > 
> > > Users (B)+(C)+(D) simply needs cleanup.
> > > 
> > > User (E): devmap is the most important+problematic user (IIRC this was
> > > the cause of bug[1]).  XDP redirecting into devmap and running a new
> > > XDP-prog (per target device) was a prime user of this call
> > > xdp_return_frame_rx_napi() as it gave us excellent (e.g. XDP_DROP)
> > > performance.
> > > 
> > Thanks for the analysis Jesper.
> 
> Thanks for working on this! It is long over due, that we clean this up.
> I think I spotted another bug in veth related to
> xdp_clear_return_frame_no_direct() and when NAPI exits.
>
What is the issue? Besides the fact that the code is using
xdp_return_frame() which doesn't require
xdp_clear_return_frame_no_direct() anyway.

> > > Perhaps we should simply measure the impact on devmap + 2nd XDP-prog
> > > doing XDP_DROP.  Then, we can see if overhead is acceptable... ?
> > > 
> > Will try. Just to make sure we are on the same page, AFAIU the setup
> > would be:
> > XDP_REDIRECT NIC1 -> veth ingress side and XDP_DROP veth egress side?
> 
> No, this isn't exactly what I meant. But the people that wrote this
> blogpost ([1] https://loopholelabs.io/blog/xdp-for-egress-traffic ) is
> dependent on the performance for that scenario with veth pairs.
> 
> When doing redirect-map, then you can attach a 2nd XDP-prog per map
> target "egress" device.  That 2nd XDP-prog should do a XDP_DROP as that
> will allow us to measure the code path we are talking about. I want test
> to hit this code line [2].
> [2] https://elixir.bootlin.com/linux/v6.17.7/source/kernel/bpf/
> devmap.c#L368.
> 
> The xdp-bench[3] tool unfortunately support program-mode for 2nd XDP-
> prog, so I did this code change:
> 
> diff --git a/xdp-bench/xdp_redirect_devmap.bpf.c
> b/xdp-bench/xdp_redirect_devmap.bpf.c
> index 0212e824e2fa..39a24f8834e8 100644
> --- a/xdp-bench/xdp_redirect_devmap.bpf.c
> +++ b/xdp-bench/xdp_redirect_devmap.bpf.c
> @@ -76,6 +76,8 @@ int xdp_redirect_devmap_egress(struct xdp_md *ctx)
>         struct ethhdr *eth = data;
>         __u64 nh_off;
> 
> +       return XDP_DROP;
> +
>         nh_off = sizeof(*eth);
>         if (data + nh_off > data_end)
>                 return XDP_DROP;
> 
> [3] https://github.com/xdp-project/xdp-tools/tree/main/xdp-bench
> 
> And then you can run thus command:
>  sudo ./xdp-bench redirect-map --load-egress mlx5p1 mlx5p1
>
Ah, yes! I was ignorant about the egress part of the program.
That did the trick. The drop happens before reaching the tx
queue of the second netdev and the mentioned code in devmem.c
is reached.

Sender is xdp-trafficgen with 3 threads pushing enough on one RX queue
to saturate the CPU.

Here's what I got:

* before:

eth2->eth3             16,153,328 rx/s         16,153,329 err,drop/s            0 xmit/s       
  xmit eth2->eth3               0 xmit/s       16,153,329 drop/s                0 drv_err/s         16.00 bulk-avg     
eth2->eth3             16,152,538 rx/s         16,152,546 err,drop/s            0 xmit/s       
  xmit eth2->eth3               0 xmit/s       16,152,546 drop/s                0 drv_err/s         16.00 bulk-avg     
eth2->eth3             16,156,331 rx/s         16,156,337 err,drop/s            0 xmit/s       
  xmit eth2->eth3               0 xmit/s       16,156,337 drop/s                0 drv_err/s         16.00 bulk-avg

* after:

eth2->eth3             16,105,461 rx/s         16,105,469 err,drop/s            0 xmit/s        
  xmit eth2->eth3               0 xmit/s       16,105,469 drop/s                0 drv_err/s         16.00 bulk-avg     
eth2->eth3             16,119,550 rx/s         16,119,541 err,drop/s            0 xmit/s        
  xmit eth2->eth3               0 xmit/s       16,119,541 drop/s                0 drv_err/s         16.00 bulk-avg     
eth2->eth3             16,092,145 rx/s         16,092,154 err,drop/s            0 xmit/s        
  xmit eth2->eth3               0 xmit/s       16,092,154 drop/s                0 drv_err/s         16.00 bulk-avg

So slightly worse... I don't fully trust the measurements though as I
saw the inverse situation in other tests as well: higher rate after the
patch.

Perf top:

* before:
  13.15%  [kernel]                                              [k] __xdp_return
  11.36%  bpf_prog_3f68498fa592198e_redir_devmap_native         [k] bpf_prog_3f68498fa592198e_redir_devmap_native
   9.60%  [mlx5_core]                                           [k] mlx5e_skb_from_cqe_mpwrq_linear
   8.19%  [mlx5_core]                                           [k] mlx5e_handle_rx_cqe_mpwrq
   7.54%  [mlx5_core]                                           [k] mlx5e_poll_rx_cq
   6.23%  [kernel]                                              [k] xdp_do_redirect
   5.10%  [kernel]                                              [k] page_pool_put_unrefed_netmem
   4.86%  [mlx5_core]                                           [k] mlx5e_post_rx_mpwqes
   4.78%  [mlx5_core]                                           [k] mlx5e_xdp_handle
   3.87%  [kernel]                                              [k] dev_map_bpf_prog_run
   2.74%  [mlx5_core]                                           [k] mlx5e_page_release_fragmented.isra.0
   2.51%  [kernel]                                              [k] dev_map_enqueue
   2.33%  [kernel]                                              [k] dev_map_redirect
   2.19%  [kernel]                                              [k] page_pool_alloc_netmems
   2.18%  [kernel]                                              [k] xdp_return_frame_rx_napi
   1.75%  [kernel]                                              [k] bq_enqueue
   1.64%  [kernel]                                              [k] bpf_dispatcher_xdp_func
   1.37%  [kernel]                                              [k] bq_xmit_all
   1.34%  [kernel]                                              [k] htab_map_update_elem_in_place
   1.20%  [mlx5_core]                                           [k] mlx5e_poll_ico_cq
   1.10%  [mlx5_core]                                           [k] mlx5e_free_rx_mpwqe
   0.66%  bpf_prog_07d302889c674206_tp_xdp_devmap_xmit_multi    [k] bpf_prog_07d302889c674206_tp_xdp_devmap_xmit_multi
   0.55%  bpf_prog_b30cf65b7e0fa9c7_xdp_redirect_devmap_egress  [k] bpf_prog_b30cf65b7e0fa9c7_xdp_redirect_devmap_egress
   0.40%  [kernel]                                              [k] htab_map_hash
   0.35%  [kernel]                                              [k] __dev_flush

* after:
  12.42%  [kernel]                                              [k] __xdp_return
  10.22%  bpf_prog_3f68498fa592198e_redir_devmap_native         [k] bpf_prog_3f68498fa592198e_redir_devmap_native
   9.04%  [mlx5_core]                                           [k] mlx5e_skb_from_cqe_mpwrq_linear
   8.34%  [mlx5_core]                                           [k] mlx5e_handle_rx_cqe_mpwrq
   7.93%  [mlx5_core]                                           [k] mlx5e_poll_rx_cq
   6.51%  [kernel]                                              [k] xdp_do_redirect
   5.24%  [mlx5_core]                                           [k] mlx5e_post_rx_mpwqes
   5.01%  [kernel]                                              [k] page_pool_put_unrefed_netmem
   5.01%  [mlx5_core]                                           [k] mlx5e_xdp_handle
   3.76%  [kernel]                                              [k] dev_map_bpf_prog_run
   2.92%  [mlx5_core]                                           [k] mlx5e_page_release_fragmented.isra.0
   2.56%  [kernel]                                              [k] dev_map_enqueue
   2.38%  [kernel]                                              [k] dev_map_redirect
   2.09%  [kernel]                                              [k] page_pool_alloc_netmems
   1.70%  [kernel]                                              [k] xdp_return_frame
   1.67%  [kernel]                                              [k] bq_xmit_all
   1.66%  [kernel]                                              [k] bq_enqueue
   1.63%  [kernel]                                              [k] bpf_dispatcher_xdp_func
   1.27%  [kernel]                                              [k] htab_map_update_elem_in_place
   1.20%  [mlx5_core]                                           [k] mlx5e_free_rx_mpwqe
   1.08%  [mlx5_core]                                           [k] mlx5e_poll_ico_cq
   0.67%  bpf_prog_07d302889c674206_tp_xdp_devmap_xmit_multi    [k] bpf_prog_07d302889c674206_tp_xdp_devmap_xmit_multi
   0.59%  [kernel]                                              [k] xdp_return_frame_rx_napi
   0.54%  bpf_prog_b30cf65b7e0fa9c7_xdp_redirect_devmap_egress  [k] bpf_prog_b30cf65b7e0fa9c7_xdp_redirect_devmap_egress
   0.46%  [kernel]                                              [k] htab_map_hash
   0.38%  [kernel]                                              [k] __dev_flush
   0.35%  [kernel]                                              [k] net_rx_action

I both cases pp_alloc_fast == pp_recycle_cached.

> Toke (and I) will appreciate if you added code for this to xdp-bench.
> Supporting a --program-mode like 'redirect-cpu' does.
> 
> 
Ok. I will add it.

Thanks,
Dragos


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] xdp: Delegate fast path return decision to page_pool
  2025-11-11 18:25         ` Dragos Tatulea
@ 2025-12-01 10:12           ` Dragos Tatulea
  2025-12-02 14:00             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 11+ messages in thread
From: Dragos Tatulea @ 2025-12-01 10:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Jakub Kicinski, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
	Jiri Olsa, Simon Horman, Toshiaki Makita, David Ahern,
	Toke Hoiland Jorgensen
  Cc: Tariq Toukan, netdev, linux-kernel, bpf, Martin KaFai Lau,
	KP Singh



[...]
>> And then you can run thus command:
>>  sudo ./xdp-bench redirect-map --load-egress mlx5p1 mlx5p1
>>
> Ah, yes! I was ignorant about the egress part of the program.
> That did the trick. The drop happens before reaching the tx
> queue of the second netdev and the mentioned code in devmem.c
> is reached.
> 
> Sender is xdp-trafficgen with 3 threads pushing enough on one RX queue
> to saturate the CPU.
> 
> Here's what I got:
> 
> * before:
> 
> eth2->eth3             16,153,328 rx/s         16,153,329 err,drop/s            0 xmit/s       
>   xmit eth2->eth3               0 xmit/s       16,153,329 drop/s                0 drv_err/s         16.00 bulk-avg     
> eth2->eth3             16,152,538 rx/s         16,152,546 err,drop/s            0 xmit/s       
>   xmit eth2->eth3               0 xmit/s       16,152,546 drop/s                0 drv_err/s         16.00 bulk-avg     
> eth2->eth3             16,156,331 rx/s         16,156,337 err,drop/s            0 xmit/s       
>   xmit eth2->eth3               0 xmit/s       16,156,337 drop/s                0 drv_err/s         16.00 bulk-avg
> 
> * after:
> 
> eth2->eth3             16,105,461 rx/s         16,105,469 err,drop/s            0 xmit/s        
>   xmit eth2->eth3               0 xmit/s       16,105,469 drop/s                0 drv_err/s         16.00 bulk-avg     
> eth2->eth3             16,119,550 rx/s         16,119,541 err,drop/s            0 xmit/s        
>   xmit eth2->eth3               0 xmit/s       16,119,541 drop/s                0 drv_err/s         16.00 bulk-avg     
> eth2->eth3             16,092,145 rx/s         16,092,154 err,drop/s            0 xmit/s        
>   xmit eth2->eth3               0 xmit/s       16,092,154 drop/s                0 drv_err/s         16.00 bulk-avg
> 
> So slightly worse... I don't fully trust the measurements though as I
> saw the inverse situation in other tests as well: higher rate after the
> patch.
I had a chance to re-run this on a more stable system and the conclusion
is the same. Performance is ~2 % worse:

* before:
eth2->eth3        13,746,431 rx/s   13,746,471 err,drop/s 0 xmit/s    
  xmit eth2->eth3          0 xmit/s 13,746,471 drop/s     0 drv_err/s 16.00 bulk-avg 

* after:
eth2->eth3        13,437,277 rx/s   13,437,259 err,drop/s 0 xmit/s    
  xmit eth2->eth3          0 xmit/s 13,437,259 drop/s     0 drv_err/s 16.00 bulk-avg 

After this experiment it doesn't seem like this direction is worth
proceeding with... I was more optimistic at the start.

>>> Toke (and I) will appreciate if you added code for this to xdp-bench.
>> Supporting a --program-mode like 'redirect-cpu' does.
>>
>>
> Ok. I will add it.
> 
Added it here:
https://github.com/xdp-project/xdp-tools/pull/532

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] xdp: Delegate fast path return decision to page_pool
  2025-12-01 10:12           ` Dragos Tatulea
@ 2025-12-02 14:00             ` Jesper Dangaard Brouer
  2025-12-02 16:29               ` Dragos Tatulea
  0 siblings, 1 reply; 11+ messages in thread
From: Jesper Dangaard Brouer @ 2025-12-02 14:00 UTC (permalink / raw)
  To: Dragos Tatulea, Jakub Kicinski, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Simon Horman, Toshiaki Makita, David Ahern,
	Toke Hoiland Jorgensen
  Cc: Tariq Toukan, netdev, linux-kernel, bpf, Martin KaFai Lau,
	KP Singh, kernel-team




On 01/12/2025 11.12, Dragos Tatulea wrote:
> 
> 
> [...]
>>> And then you can run thus command:
>>>   sudo ./xdp-bench redirect-map --load-egress mlx5p1 mlx5p1
>>>
>> Ah, yes! I was ignorant about the egress part of the program.
>> That did the trick. The drop happens before reaching the tx
>> queue of the second netdev and the mentioned code in devmem.c
>> is reached.
>>
>> Sender is xdp-trafficgen with 3 threads pushing enough on one RX queue
>> to saturate the CPU.
>>
>> Here's what I got:
>>
>> * before:
>>
>> eth2->eth3             16,153,328 rx/s         16,153,329 err,drop/s            0 xmit/s
>>    xmit eth2->eth3               0 xmit/s       16,153,329 drop/s                0 drv_err/s         16.00 bulk-avg
>> eth2->eth3             16,152,538 rx/s         16,152,546 err,drop/s            0 xmit/s
>>    xmit eth2->eth3               0 xmit/s       16,152,546 drop/s                0 drv_err/s         16.00 bulk-avg
>> eth2->eth3             16,156,331 rx/s         16,156,337 err,drop/s            0 xmit/s
>>    xmit eth2->eth3               0 xmit/s       16,156,337 drop/s                0 drv_err/s         16.00 bulk-avg
>>
>> * after:
>>
>> eth2->eth3             16,105,461 rx/s         16,105,469 err,drop/s            0 xmit/s
>>    xmit eth2->eth3               0 xmit/s       16,105,469 drop/s                0 drv_err/s         16.00 bulk-avg
>> eth2->eth3             16,119,550 rx/s         16,119,541 err,drop/s            0 xmit/s
>>    xmit eth2->eth3               0 xmit/s       16,119,541 drop/s                0 drv_err/s         16.00 bulk-avg
>> eth2->eth3             16,092,145 rx/s         16,092,154 err,drop/s            0 xmit/s
>>    xmit eth2->eth3               0 xmit/s       16,092,154 drop/s                0 drv_err/s         16.00 bulk-avg
>>
>> So slightly worse... I don't fully trust the measurements though as I
>> saw the inverse situation in other tests as well: higher rate after the
>> patch.

Remember that you are also removing some code (the
xdp_set_return_frame_no_direct and xdp_clear_return_frame_no_direct).
Thus, I was actually hoping we would see a higher rate after the patch.
This is why I wanted to see this XDP-redirect test, instead of the
page_pool micro-benchmark.


> I had a chance to re-run this on a more stable system and the conclusion
> is the same. Performance is ~2 % worse:
> 
> * before:
> eth2->eth3        13,746,431 rx/s   13,746,471 err,drop/s 0 xmit/s
>    xmit eth2->eth3          0 xmit/s 13,746,471 drop/s     0 drv_err/s 16.00 bulk-avg
> 
> * after:
> eth2->eth3        13,437,277 rx/s   13,437,259 err,drop/s 0 xmit/s
>    xmit eth2->eth3          0 xmit/s 13,437,259 drop/s     0 drv_err/s 16.00 bulk-avg
> 
> After this experiment it doesn't seem like this direction is worth
> proceeding with... I was more optimistic at the start.

I do think it is worth proceeding.  I will claim that your PPS results
are basically the same. Converting PPS number to nanosec per packet:

  13,746,471 = (1/13746471*10^9) = 72.74 nanosec
  13,437,259 = (1/13437259*10^9) = 74.42 nanosec
  Difference is  = (74.42-72.75) =  1.67 nanosec

In my experience it is very hard to find a system stable enough to
measure a 2 nanosec difference. As you also note you had to spend effort
finding a stable system.  Thus, I claim your results show no noticeable
performance impact.

My only concern (based on your perf symbols) is that you might not be
testing the right/expected code path.  If mlx5 is running with a
page_pool memory mode that have elevated refcnf on the page, then we
will not be exercising the slower page_pool ptr_ring return path as much
as expected.  I guess, I have to do this experiment in my own testlab on
other NIC drivers that doesn't use elevated refcnt as default.


>>>> Toke (and I) will appreciate if you added code for this to xdp-bench.
>>> Supporting a --program-mode like 'redirect-cpu' does.
>>>
>>>
>> Ok. I will add it.
>>
> Added it here:
> https://github.com/xdp-project/xdp-tools/pull/532
>

Thanks, I'll take a look, and I'm sure Toke have opinions on the cmdline
options and the missing man-page update.

--Jesper

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] xdp: Delegate fast path return decision to page_pool
  2025-12-02 14:00             ` Jesper Dangaard Brouer
@ 2025-12-02 16:29               ` Dragos Tatulea
  0 siblings, 0 replies; 11+ messages in thread
From: Dragos Tatulea @ 2025-12-02 16:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Jakub Kicinski, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, Stanislav Fomichev, Hao Luo,
	Jiri Olsa, Simon Horman, Toshiaki Makita, David Ahern,
	Toke Hoiland Jorgensen
  Cc: Tariq Toukan, netdev, linux-kernel, bpf, Martin KaFai Lau,
	KP Singh, kernel-team



On 02.12.25 15:00, Jesper Dangaard Brouer wrote:
> On 01/12/2025 11.12, Dragos Tatulea wrote:
>>
>>
>> [...]
>>>> And then you can run thus command:
>>>>   sudo ./xdp-bench redirect-map --load-egress mlx5p1 mlx5p1
>>>>
>>> Ah, yes! I was ignorant about the egress part of the program.
>>> That did the trick. The drop happens before reaching the tx
>>> queue of the second netdev and the mentioned code in devmem.c
>>> is reached.
>>>
>>> Sender is xdp-trafficgen with 3 threads pushing enough on one RX queue
>>> to saturate the CPU.
>>>
>>> Here's what I got:
>>>
>>> * before:
>>>
>>> eth2->eth3             16,153,328 rx/s         16,153,329 err,drop/s            0 xmit/s
>>>    xmit eth2->eth3               0 xmit/s       16,153,329 drop/s                0 drv_err/s         16.00 bulk-avg
>>> eth2->eth3             16,152,538 rx/s         16,152,546 err,drop/s            0 xmit/s
>>>    xmit eth2->eth3               0 xmit/s       16,152,546 drop/s                0 drv_err/s         16.00 bulk-avg
>>> eth2->eth3             16,156,331 rx/s         16,156,337 err,drop/s            0 xmit/s
>>>    xmit eth2->eth3               0 xmit/s       16,156,337 drop/s                0 drv_err/s         16.00 bulk-avg
>>>
>>> * after:
>>>
>>> eth2->eth3             16,105,461 rx/s         16,105,469 err,drop/s            0 xmit/s
>>>    xmit eth2->eth3               0 xmit/s       16,105,469 drop/s                0 drv_err/s         16.00 bulk-avg
>>> eth2->eth3             16,119,550 rx/s         16,119,541 err,drop/s            0 xmit/s
>>>    xmit eth2->eth3               0 xmit/s       16,119,541 drop/s                0 drv_err/s         16.00 bulk-avg
>>> eth2->eth3             16,092,145 rx/s         16,092,154 err,drop/s            0 xmit/s
>>>    xmit eth2->eth3               0 xmit/s       16,092,154 drop/s                0 drv_err/s         16.00 bulk-avg
>>>
>>> So slightly worse... I don't fully trust the measurements though as I
>>> saw the inverse situation in other tests as well: higher rate after the
>>> patch.
> 
> Remember that you are also removing some code (the
> xdp_set_return_frame_no_direct and xdp_clear_return_frame_no_direct).
> Thus, I was actually hoping we would see a higher rate after the patch.
> This is why I wanted to see this XDP-redirect test, instead of the
> page_pool micro-benchmark.
> 
Right. This was mentioned in the initial message as well. I was also
hoping to see an improvement...

> 
>> I had a chance to re-run this on a more stable system and the conclusion
>> is the same. Performance is ~2 % worse:
>>
>> * before:
>> eth2->eth3        13,746,431 rx/s   13,746,471 err,drop/s 0 xmit/s
>>    xmit eth2->eth3          0 xmit/s 13,746,471 drop/s     0 drv_err/s 16.00 bulk-avg
>>
>> * after:
>> eth2->eth3        13,437,277 rx/s   13,437,259 err,drop/s 0 xmit/s
>>    xmit eth2->eth3          0 xmit/s 13,437,259 drop/s     0 drv_err/s 16.00 bulk-avg
>>
>> After this experiment it doesn't seem like this direction is worth
>> proceeding with... I was more optimistic at the start.
> 
> I do think it is worth proceeding.  I will claim that your PPS results
> are basically the same. Converting PPS number to nanosec per packet:
> 
>  13,746,471 = (1/13746471*10^9) = 72.74 nanosec
>  13,437,259 = (1/13437259*10^9) = 74.42 nanosec
>  Difference is  = (74.42-72.75) =  1.67 nanosec
> 
> In my experience it is very hard to find a system stable enough to
> measure a 2 nanosec difference. As you also note you had to spend effort
> finding a stable system.  Thus, I claim your results show no noticeable
> performance impact.
> 
Oh yes, converting to ns does bring a different perspective...

> My only concern (based on your perf symbols) is that you might not be
> testing the right/expected code path.  If mlx5 is running with a
> page_pool memory mode that have elevated refcnf on the page, then we
> will not be exercising the slower page_pool ptr_ring return path as much
> as expected.  I guess, I have to do this experiment in my own testlab on
> other NIC drivers that doesn't use elevated refcnt as default.
> 
This part I don't get. I thought that the point was to measure the impact of
the change on fastest path: recycle to cache.

Are you saying that you would like to see the impact on the slowest path
as well? Or would you like to see the impact for a mix of the two? Maybe
mlx5 can be hacked into this mode for benchmarking. But not sure I understand
your usecase.

> 
>>>>> Toke (and I) will appreciate if you added code for this to xdp-bench.
>>>> Supporting a --program-mode like 'redirect-cpu' does.
>>>>
>>>>
>>> Ok. I will add it.
>>>
>> Added it here:
>> https://github.com/xdp-project/xdp-tools/pull/532
>>
> 
> Thanks, I'll take a look, and I'm sure Toke have opinions on the cmdline
> options and the missing man-page update.
> 
Oh forgot about man-page. Will wait for his PR comments.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-12-02 16:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-07 10:28 [RFC 0/2] xdp: Delegate fast path return decision to page_pool Dragos Tatulea
2025-11-07 10:28 ` [RFC 1/2] page_pool: add benchmarking for napi-based recycling Dragos Tatulea
2025-11-07 11:04   ` bot+bpf-ci
2025-11-07 10:28 ` [RFC 2/2] xdp: Delegate fast path return decision to page_pool Dragos Tatulea
2025-11-10 11:06   ` Jesper Dangaard Brouer
2025-11-10 18:51     ` Dragos Tatulea
2025-11-11  7:54       ` Jesper Dangaard Brouer
2025-11-11 18:25         ` Dragos Tatulea
2025-12-01 10:12           ` Dragos Tatulea
2025-12-02 14:00             ` Jesper Dangaard Brouer
2025-12-02 16:29               ` Dragos Tatulea

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).