netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v7 0/8] fix two bugs related to page_pool
@ 2025-01-10 13:06 Yunsheng Lin
  2025-01-10 13:06 ` [PATCH net-next v7 1/8] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-10 13:06 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin,
	Alexander Lobakin, Robin Murphy, Alexander Duyck, Andrew Morton,
	IOMMU, MM, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Matthias Brugger,
	AngeloGioacchino Del Regno, netdev, intel-wired-lan, bpf,
	linux-kernel, linux-arm-kernel, linux-mediatek

This patchset fix a possible time window problem for page_pool and
the dma API misuse problem as mentioned in [1], and try to avoid the
overhead of the fixing using some optimization.

From the below performance data, the overhead is not so obvious
due to performance variations for time_bench_page_pool01_fast_path()
and time_bench_page_pool02_ptr_ring, and there is about 20ns overhead
for time_bench_page_pool03_slow() for fixing the bug.

Before this patchset:
root@(none)$ insmod bench_page_pool_simple.ko
[  323.367627] bench_page_pool_simple: Loaded
[  323.448747] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.076997150 sec time_interval:76997150) - (invoke count:100000000 tsc_interval:7699707)
[  324.812884] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.468 ns (step:0) - (measurement period time:1.346855130 sec time_interval:1346855130) - (invoke count:100000000 tsc_interval:134685507)
[  324.980875] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.010 ns (step:0) - (measurement period time:0.150101270 sec time_interval:150101270) - (invoke count:10000000 tsc_interval:15010120)
[  325.652195] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.542 ns (step:0) - (measurement period time:0.654213000 sec time_interval:654213000) - (invoke count:100000000 tsc_interval:65421294)
[  325.669215] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  325.974848] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 29.633 ns (step:0) - (measurement period time:0.296338200 sec time_interval:296338200) - (invoke count:10000000 tsc_interval:29633814)
[  325.993517] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  326.576636] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.391 ns (step:0) - (measurement period time:0.573911820 sec time_interval:573911820) - (invoke count:10000000 tsc_interval:57391174)
[  326.595307] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  328.422661] time_bench: Type:no-softirq-page_pool03 Per elem: 18 cycles(tsc) 181.849 ns (step:0) - (measurement period time:1.818495880 sec time_interval:1818495880) - (invoke count:10000000 tsc_interval:181849581)
[  328.441681] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  328.449584] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  328.755031] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 29.632 ns (step:0) - (measurement period time:0.296327910 sec time_interval:296327910) - (invoke count:10000000 tsc_interval:29632785)
[  328.774308] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  329.578579] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 7 cycles(tsc) 79.523 ns (step:0) - (measurement period time:0.795236560 sec time_interval:795236560) - (invoke count:10000000 tsc_interval:79523650)
[  329.597769] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  331.507501] time_bench: Type:tasklet_page_pool03_slow Per elem: 19 cycles(tsc) 190.104 ns (step:0) - (measurement period time:1.901047510 sec time_interval:1901047510) - (invoke count:10000000 tsc_interval:190104743)

After this patchset:
root@(none)$ insmod bench_page_pool_simple.ko
[  138.634758] bench_page_pool_simple: Loaded
[  138.715879] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.076972720 sec time_interval:76972720) - (invoke count:100000000 tsc_interval:7697265)
[  140.079897] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:1.346735370 sec time_interval:1346735370) - (invoke count:100000000 tsc_interval:134673531)
[  140.247841] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.005 ns (step:0) - (measurement period time:0.150055080 sec time_interval:150055080) - (invoke count:10000000 tsc_interval:15005497)
[  140.919072] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:0.654125000 sec time_interval:654125000) - (invoke count:100000000 tsc_interval:65412493)
[  140.936091] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  141.246985] time_bench: Type:no-softirq-page_pool01 Per elem: 3 cycles(tsc) 30.159 ns (step:0) - (measurement period time:0.301598160 sec time_interval:301598160) - (invoke count:10000000 tsc_interval:30159812)
[  141.265654] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  141.976265] time_bench: Type:no-softirq-page_pool02 Per elem: 7 cycles(tsc) 70.140 ns (step:0) - (measurement period time:0.701405780 sec time_interval:701405780) - (invoke count:10000000 tsc_interval:70140573)
[  141.994933] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  144.018945] time_bench: Type:no-softirq-page_pool03 Per elem: 20 cycles(tsc) 201.514 ns (step:0) - (measurement period time:2.015141210 sec time_interval:2015141210) - (invoke count:10000000 tsc_interval:201514113)
[  144.037966] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  144.045870] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  144.205045] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 1 cycles(tsc) 15.005 ns (step:0) - (measurement period time:0.150056510 sec time_interval:150056510) - (invoke count:10000000 tsc_interval:15005645)
[  144.224320] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  144.916044] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 68.269 ns (step:0) - (measurement period time:0.682693070 sec time_interval:682693070) - (invoke count:10000000 tsc_interval:68269300)
[  144.935234] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  146.997684] time_bench: Type:tasklet_page_pool03_slow Per elem: 20 cycles(tsc) 205.376 ns (step:0) - (measurement period time:2.053766310 sec time_interval:2053766310) - (invoke count:10000000 tsc_interval:205376624)

1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/

CC: Alexander Lobakin <aleksander.lobakin@intel.com>
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: IOMMU <iommu@lists.linux.dev>
CC: MM <linux-mm@kvack.org>

Change log:
V7:
  1. Fix a used-after-free bug reported by KASAN as mentioned by Jakub.
  2. Fix the 'netmem' variable not setting up correctly bug as mentioned
     by Simon.

V6:
  1. Repost based on latest net-next.
  2. Rename page_pool_to_pp() to page_pool_get_pp().

V5:
  1. Support unlimit inflight pages.
  2. Add some optimization to avoid the overhead of fixing bug.

V4:
  1. use scanning to do the unmapping
  2. spilt dma sync skipping into separate patch

V3:
  1. Target net-next tree instead of net tree.
  2. Narrow the rcu lock as the discussion in v2.
  3. Check the ummapping cnt against the inflight cnt.

V2:
  1. Add a item_full stat.
  2. Use container_of() for page_pool_to_pp().

Yunsheng Lin (8):
  page_pool: introduce page_pool_get_pp() API
  page_pool: fix timing for checking and disabling napi_local
  page_pool: fix IOMMU crash when driver has already unbound
  page_pool: support unlimited number of inflight pages
  page_pool: skip dma sync operation for inflight pages
  page_pool: use list instead of ptr_ring for ring cache
  page_pool: batch refilling pages to reduce atomic operation
  page_pool: use list instead of array for alloc cache

 drivers/net/ethernet/freescale/fec_main.c     |   8 +-
 .../ethernet/google/gve/gve_buffer_mgmt_dqo.c |   2 +-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   |   6 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.c   |  14 +-
 drivers/net/ethernet/intel/libeth/rx.c        |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |   3 +-
 drivers/net/netdevsim/netdev.c                |   6 +-
 drivers/net/wireless/mediatek/mt76/mt76.h     |   2 +-
 include/linux/mm_types.h                      |   2 +-
 include/linux/skbuff.h                        |   1 +
 include/net/libeth/rx.h                       |   3 +-
 include/net/netmem.h                          |  24 +-
 include/net/page_pool/helpers.h               |  11 +
 include/net/page_pool/types.h                 |  64 +-
 net/core/devmem.c                             |   4 +-
 net/core/netmem_priv.h                        |   5 +-
 net/core/page_pool.c                          | 664 ++++++++++++++----
 net/core/page_pool_priv.h                     |  12 +-
 18 files changed, 675 insertions(+), 158 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH net-next v7 1/8] page_pool: introduce page_pool_get_pp() API
  2025-01-10 13:06 [PATCH net-next v7 0/8] fix two bugs related to page_pool Yunsheng Lin
@ 2025-01-10 13:06 ` Yunsheng Lin
  2025-01-10 13:06 ` [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local Yunsheng Lin
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-10 13:06 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Wei Fang,
	Shenwei Wang, Clark Wang, Andrew Lunn, Eric Dumazet,
	Jeroen de Borst, Praveen Kaligineedi, Shailend Chand, Tony Nguyen,
	Przemek Kitszel, Alexander Lobakin, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Felix Fietkau,
	Lorenzo Bianconi, Ryder Lee, Shayne Chen, Sean Wang, Kalle Valo,
	Matthias Brugger, AngeloGioacchino Del Regno, Simon Horman,
	Ilias Apalodimas, imx, netdev, linux-kernel, intel-wired-lan, bpf,
	linux-rdma, linux-wireless, linux-arm-kernel, linux-mediatek

introduce page_pool_get_pp() API to avoid caller accessing
page->pp directly.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 drivers/net/ethernet/freescale/fec_main.c          |  8 +++++---
 .../net/ethernet/google/gve/gve_buffer_mgmt_dqo.c  |  2 +-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c        |  6 ++++--
 drivers/net/ethernet/intel/idpf/idpf_txrx.c        | 14 +++++++++-----
 drivers/net/ethernet/intel/libeth/rx.c             |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c   |  3 ++-
 drivers/net/netdevsim/netdev.c                     |  6 ++++--
 drivers/net/wireless/mediatek/mt76/mt76.h          |  2 +-
 include/net/libeth/rx.h                            |  3 ++-
 include/net/page_pool/helpers.h                    |  5 +++++
 10 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index b2daed55bf6c..18d2119dbec1 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1009,7 +1009,8 @@ static void fec_enet_bd_init(struct net_device *dev)
 				struct page *page = txq->tx_buf[i].buf_p;
 
 				if (page)
-					page_pool_put_page(page->pp, page, 0, false);
+					page_pool_put_page(page_pool_get_pp(page),
+							   page, 0, false);
 			}
 
 			txq->tx_buf[i].buf_p = NULL;
@@ -1549,7 +1550,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
 			xdp_return_frame_rx_napi(xdpf);
 		} else { /* recycle pages of XDP_TX frames */
 			/* The dma_sync_size = 0 as XDP_TX has already synced DMA for_device */
-			page_pool_put_page(page->pp, page, 0, true);
+			page_pool_put_page(page_pool_get_pp(page), page, 0, true);
 		}
 
 		txq->tx_buf[index].buf_p = NULL;
@@ -3307,7 +3308,8 @@ static void fec_enet_free_buffers(struct net_device *ndev)
 			} else {
 				struct page *page = txq->tx_buf[i].buf_p;
 
-				page_pool_put_page(page->pp, page, 0, false);
+				page_pool_put_page(page_pool_get_pp(page),
+						   page, 0, false);
 			}
 
 			txq->tx_buf[i].buf_p = NULL;
diff --git a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
index 403f0f335ba6..87422b8828ff 100644
--- a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
+++ b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
@@ -210,7 +210,7 @@ void gve_free_to_page_pool(struct gve_rx_ring *rx,
 	if (!page)
 		return;
 
-	page_pool_put_full_page(page->pp, page, allow_direct);
+	page_pool_put_full_page(page_pool_get_pp(page), page, allow_direct);
 	buf_state->page_info.page = NULL;
 }
 
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 26b424fd6718..e1bf5554f6e3 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -1050,7 +1050,8 @@ static void iavf_add_rx_frag(struct sk_buff *skb,
 			     const struct libeth_fqe *rx_buffer,
 			     unsigned int size)
 {
-	u32 hr = rx_buffer->page->pp->p.offset;
+	struct page_pool *pool = page_pool_get_pp(rx_buffer->page);
+	u32 hr = pool->p.offset;
 
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
 			rx_buffer->offset + hr, size, rx_buffer->truesize);
@@ -1067,7 +1068,8 @@ static void iavf_add_rx_frag(struct sk_buff *skb,
 static struct sk_buff *iavf_build_skb(const struct libeth_fqe *rx_buffer,
 				      unsigned int size)
 {
-	u32 hr = rx_buffer->page->pp->p.offset;
+	struct page_pool *pool = page_pool_get_pp(rx_buffer->page);
+	u32 hr = pool->p.offset;
 	struct sk_buff *skb;
 	void *va;
 
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index 2fa9c36e33c9..04f2347716ca 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -385,7 +385,8 @@ static void idpf_rx_page_rel(struct libeth_fqe *rx_buf)
 	if (unlikely(!rx_buf->page))
 		return;
 
-	page_pool_put_full_page(rx_buf->page->pp, rx_buf->page, false);
+	page_pool_put_full_page(page_pool_get_pp(rx_buf->page), rx_buf->page,
+				false);
 
 	rx_buf->page = NULL;
 	rx_buf->offset = 0;
@@ -3098,7 +3099,8 @@ idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb,
 void idpf_rx_add_frag(struct idpf_rx_buf *rx_buf, struct sk_buff *skb,
 		      unsigned int size)
 {
-	u32 hr = rx_buf->page->pp->p.offset;
+	struct page_pool *pool = page_pool_get_pp(rx_buf->page);
+	u32 hr = pool->p.offset;
 
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buf->page,
 			rx_buf->offset + hr, size, rx_buf->truesize);
@@ -3130,8 +3132,10 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
 	if (!libeth_rx_sync_for_cpu(buf, copy))
 		return 0;
 
-	dst = page_address(hdr->page) + hdr->offset + hdr->page->pp->p.offset;
-	src = page_address(buf->page) + buf->offset + buf->page->pp->p.offset;
+	dst = page_address(hdr->page) + hdr->offset +
+		page_pool_get_pp(hdr->page)->p.offset;
+	src = page_address(buf->page) + buf->offset +
+		page_pool_get_pp(buf->page)->p.offset;
 	memcpy(dst, src, LARGEST_ALIGN(copy));
 
 	buf->offset += copy;
@@ -3149,7 +3153,7 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
  */
 struct sk_buff *idpf_rx_build_skb(const struct libeth_fqe *buf, u32 size)
 {
-	u32 hr = buf->page->pp->p.offset;
+	u32 hr = page_pool_get_pp(buf->page)->p.offset;
 	struct sk_buff *skb;
 	void *va;
 
diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c
index 66d1d23b8ad2..8de0c3a3b146 100644
--- a/drivers/net/ethernet/intel/libeth/rx.c
+++ b/drivers/net/ethernet/intel/libeth/rx.c
@@ -207,7 +207,7 @@ EXPORT_SYMBOL_NS_GPL(libeth_rx_fq_destroy, "LIBETH");
  */
 void libeth_rx_recycle_slow(struct page *page)
 {
-	page_pool_recycle_direct(page->pp, page);
+	page_pool_recycle_direct(page_pool_get_pp(page), page);
 }
 EXPORT_SYMBOL_NS_GPL(libeth_rx_recycle_slow, "LIBETH");
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 94b291662087..30baca49c71e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -716,7 +716,8 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
 				 * as we know this is a page_pool page.
 				 */
-				page_pool_recycle_direct(page->pp, page);
+				page_pool_recycle_direct(page_pool_get_pp(page),
+							 page);
 			} while (++n < num);
 
 			break;
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index d013b6498539..05a04f4b51d7 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -810,7 +810,8 @@ nsim_pp_hold_write(struct file *file, const char __user *data,
 		if (!ns->page)
 			ret = -ENOMEM;
 	} else {
-		page_pool_put_full_page(ns->page->pp, ns->page, false);
+		page_pool_put_full_page(page_pool_get_pp(ns->page), ns->page,
+					false);
 		ns->page = NULL;
 	}
 
@@ -1022,7 +1023,8 @@ void nsim_destroy(struct netdevsim *ns)
 
 	/* Put this intentionally late to exercise the orphaning path */
 	if (ns->page) {
-		page_pool_put_full_page(ns->page->pp, ns->page, false);
+		page_pool_put_full_page(page_pool_get_pp(ns->page), ns->page,
+					false);
 		ns->page = NULL;
 	}
 
diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h
index ca2dba3ac65d..4d0e41a7bf4a 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76.h
+++ b/drivers/net/wireless/mediatek/mt76/mt76.h
@@ -1688,7 +1688,7 @@ static inline void mt76_put_page_pool_buf(void *buf, bool allow_direct)
 {
 	struct page *page = virt_to_head_page(buf);
 
-	page_pool_put_full_page(page->pp, page, allow_direct);
+	page_pool_put_full_page(page_pool_get_pp(page), page, allow_direct);
 }
 
 static inline void *
diff --git a/include/net/libeth/rx.h b/include/net/libeth/rx.h
index 43574bd6612f..f4ae75f9cc1b 100644
--- a/include/net/libeth/rx.h
+++ b/include/net/libeth/rx.h
@@ -137,7 +137,8 @@ static inline bool libeth_rx_sync_for_cpu(const struct libeth_fqe *fqe,
 		return false;
 	}
 
-	page_pool_dma_sync_for_cpu(page->pp, page, fqe->offset, len);
+	page_pool_dma_sync_for_cpu(page_pool_get_pp(page), page, fqe->offset,
+				   len);
 
 	return true;
 }
diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 543f54fa3020..9c4dbd2289b1 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -83,6 +83,11 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
 }
 #endif
 
+static inline struct page_pool *page_pool_get_pp(struct page *page)
+{
+	return page->pp;
+}
+
 /**
  * page_pool_dev_alloc_pages() - allocate a page.
  * @pool:	pool from which to allocate
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local
  2025-01-10 13:06 [PATCH net-next v7 0/8] fix two bugs related to page_pool Yunsheng Lin
  2025-01-10 13:06 ` [PATCH net-next v7 1/8] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
@ 2025-01-10 13:06 ` Yunsheng Lin
  2025-01-10 15:40   ` Toke Høiland-Jørgensen
  2025-01-10 13:06 ` [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-10 13:06 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin,
	Alexander Lobakin, Xuan Zhuo, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

page_pool page may be freed from skb_defer_free_flush() in
softirq context without binding to any specific napi, it
may cause use-after-free problem due to the below time window,
as below, CPU1 may still access napi->list_owner after CPU0
free the napi memory:

            CPU 0                           CPU1
      page_pool_destroy()          skb_defer_free_flush()
             .                               .
             .                napi = READ_ONCE(pool->p.napi);
             .                               .
page_pool_disable_direct_recycling()         .
   driver free napi memory                   .
             .                               .
             .       napi && READ_ONCE(napi->list_owner) == cpuid
             .                               .

Use rcu mechanism to avoid the above problem.

Note, the above was found during code reviewing on how to fix
the problem in [1].

As the following IOMMU fix patch depends on synchronize_rcu()
added in this patch and the time window is so small that it
doesn't seem to be an urgent fix, so target the net-next as
the IOMMU fix patch does.

1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/

Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 net/core/page_pool.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 9733206d6406..1aa7b93bdcc8 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -799,6 +799,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
 static bool page_pool_napi_local(const struct page_pool *pool)
 {
 	const struct napi_struct *napi;
+	bool napi_local;
 	u32 cpuid;
 
 	if (unlikely(!in_softirq()))
@@ -814,9 +815,15 @@ static bool page_pool_napi_local(const struct page_pool *pool)
 	if (READ_ONCE(pool->cpuid) == cpuid)
 		return true;
 
+	/* Synchronizated with page_pool_destory() to avoid use-after-free
+	 * for 'napi'.
+	 */
+	rcu_read_lock();
 	napi = READ_ONCE(pool->p.napi);
+	napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
+	rcu_read_unlock();
 
-	return napi && READ_ONCE(napi->list_owner) == cpuid;
+	return napi_local;
 }
 
 void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
@@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_release(pool))
 		return;
 
+	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
+	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
+	 * before returning to driver to free the napi instance.
+	 */
+	synchronize_rcu();
+
 	page_pool_detached(pool);
 	pool->defer_start = jiffies;
 	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound
  2025-01-10 13:06 [PATCH net-next v7 0/8] fix two bugs related to page_pool Yunsheng Lin
  2025-01-10 13:06 ` [PATCH net-next v7 1/8] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
  2025-01-10 13:06 ` [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local Yunsheng Lin
@ 2025-01-10 13:06 ` Yunsheng Lin
  2025-01-15 16:29   ` Jesper Dangaard Brouer
  2025-01-10 13:06 ` [PATCH net-next v7 4/8] page_pool: support unlimited number of inflight pages Yunsheng Lin
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-10 13:06 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Robin Murphy,
	Alexander Duyck, IOMMU, Andrew Morton, Eric Dumazet, Simon Horman,
	Jesper Dangaard Brouer, Ilias Apalodimas, linux-mm, linux-kernel,
	netdev

Networking driver with page_pool support may hand over page
still with dma mapping to network stack and try to reuse that
page after network stack is done with it and passes it back
to page_pool to avoid the penalty of dma mapping/unmapping.
With all the caching in the network stack, some pages may be
held in the network stack without returning to the page_pool
soon enough, and with VF disable causing the driver unbound,
the page_pool does not stop the driver from doing it's
unbounding work, instead page_pool uses workqueue to check
if there is some pages coming back from the network stack
periodically, if there is any, it will do the dma unmmapping
related cleanup work.

As mentioned in [1], attempting DMA unmaps after the driver
has already unbound may leak resources or at worst corrupt
memory. Fundamentally, the page pool code cannot allow DMA
mappings to outlive the driver they belong to.

Currently it seems there are at least two cases that the page
is not released fast enough causing dma unmmapping done after
driver has already unbound:
1. ipv4 packet defragmentation timeout: this seems to cause
   delay up to 30 secs.
2. skb_defer_free_flush(): this may cause infinite delay if
   there is no triggering for net_rx_action().

In order not to call DMA APIs to do DMA unmmapping after driver
has already unbound and stall the unloading of the networking
driver, use some pre-allocated item blocks to record inflight
pages including the ones which are handed over to network stack,
so the page_pool can do the DMA unmmapping for those pages when
page_pool_destroy() is called. As the pre-allocated item blocks
need to be large enough to avoid performance degradation, add a
'item_fast_empty' stat to indicate the unavailability of the
pre-allocated item blocks.

The overhead of tracking of inflight pages is about 10ns~20ns,
which causes about 10% performance degradation for the test case
of time_bench_page_pool03_slow() in [2].

Note, the devmem patchset seems to make the bug harder to fix,
and may make backporting harder too. As there is no actual user
for the devmem and the fixing for devmem is unclear for now,
this patch does not consider fixing the case for devmem yet.

1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
2. https://github.com/netoptimizer/prototype-kernel
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
---
 include/linux/mm_types.h        |   2 +-
 include/linux/skbuff.h          |   1 +
 include/net/netmem.h            |  24 ++-
 include/net/page_pool/helpers.h |   8 +-
 include/net/page_pool/types.h   |  36 +++-
 net/core/devmem.c               |   4 +-
 net/core/netmem_priv.h          |   5 +-
 net/core/page_pool.c            | 309 +++++++++++++++++++++++++++-----
 net/core/page_pool_priv.h       |  10 +-
 9 files changed, 343 insertions(+), 56 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 332cee285662..97d32a2a3b77 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -120,7 +120,7 @@ struct page {
 			 * page_pool allocated pages.
 			 */
 			unsigned long pp_magic;
-			struct page_pool *pp;
+			struct page_pool_item *pp_item;
 			unsigned long _pp_mapping_pad;
 			unsigned long dma_addr;
 			atomic_long_t pp_ref_count;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb2b751d274a..a47a23527724 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -39,6 +39,7 @@
 #include <net/net_debug.h>
 #include <net/dropreason-core.h>
 #include <net/netmem.h>
+#include <net/page_pool/types.h>
 
 /**
  * DOC: skb checksums
diff --git a/include/net/netmem.h b/include/net/netmem.h
index 1b58faa4f20f..c848a48b8e96 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -23,7 +23,7 @@ DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers);
 struct net_iov {
 	unsigned long __unused_padding;
 	unsigned long pp_magic;
-	struct page_pool *pp;
+	struct page_pool_item *pp_item;
 	struct dmabuf_genpool_chunk_owner *owner;
 	unsigned long dma_addr;
 	atomic_long_t pp_ref_count;
@@ -33,7 +33,7 @@ struct net_iov {
  *
  *        struct {
  *                unsigned long pp_magic;
- *                struct page_pool *pp;
+ *                struct page_pool_item *pp_item;
  *                unsigned long _pp_mapping_pad;
  *                unsigned long dma_addr;
  *                atomic_long_t pp_ref_count;
@@ -49,7 +49,7 @@ struct net_iov {
 	static_assert(offsetof(struct page, pg) == \
 		      offsetof(struct net_iov, iov))
 NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic);
-NET_IOV_ASSERT_OFFSET(pp, pp);
+NET_IOV_ASSERT_OFFSET(pp_item, pp_item);
 NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr);
 NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count);
 #undef NET_IOV_ASSERT_OFFSET
@@ -67,6 +67,11 @@ NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count);
  */
 typedef unsigned long __bitwise netmem_ref;
 
+/* Mirror page_pool_item_block, see include/net/page_pool/types.h */
+struct netmem_item_block {
+	struct page_pool *pp;
+};
+
 static inline bool netmem_is_net_iov(const netmem_ref netmem)
 {
 	return (__force unsigned long)netmem & NET_IOV;
@@ -154,6 +159,11 @@ static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem)
 	return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV);
 }
 
+static inline struct page_pool_item *netmem_get_pp_item(netmem_ref netmem)
+{
+	return __netmem_clear_lsb(netmem)->pp_item;
+}
+
 /**
  * __netmem_get_pp - unsafely get pointer to the &page_pool backing @netmem
  * @netmem: netmem reference to get the pointer from
@@ -167,12 +177,16 @@ static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem)
  */
 static inline struct page_pool *__netmem_get_pp(netmem_ref netmem)
 {
-	return __netmem_to_page(netmem)->pp;
+	struct page_pool_item *item = __netmem_to_page(netmem)->pp_item;
+	struct netmem_item_block *block;
+
+	block = (struct netmem_item_block *)((unsigned long)item & PAGE_MASK);
+	return block->pp;
 }
 
 static inline struct page_pool *netmem_get_pp(netmem_ref netmem)
 {
-	return __netmem_clear_lsb(netmem)->pp;
+	return __netmem_get_pp((__force netmem_ref)__netmem_clear_lsb(netmem));
 }
 
 static inline atomic_long_t *netmem_get_pp_ref_count_ref(netmem_ref netmem)
diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 9c4dbd2289b1..d4f2ec0898a5 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -83,9 +83,15 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
 }
 #endif
 
+static inline struct page_pool_item_block *
+page_pool_item_to_block(struct page_pool_item *item)
+{
+	return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);
+}
+
 static inline struct page_pool *page_pool_get_pp(struct page *page)
 {
-	return page->pp;
+	return page_pool_item_to_block(page->pp_item)->pp;
 }
 
 /**
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index ed4cd114180a..2011fa43ad0f 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -102,6 +102,7 @@ struct page_pool_params {
  * @refill:	an allocation which triggered a refill of the cache
  * @waive:	pages obtained from the ptr ring that cannot be added to
  *		the cache due to a NUMA mismatch
+ * @item_fast_empty: pre-allocated item cache is empty
  */
 struct page_pool_alloc_stats {
 	u64 fast;
@@ -110,6 +111,7 @@ struct page_pool_alloc_stats {
 	u64 empty;
 	u64 refill;
 	u64 waive;
+	u64 item_fast_empty;
 };
 
 /**
@@ -142,6 +144,30 @@ struct page_pool_stats {
 };
 #endif
 
+struct page_pool_item {
+	unsigned long state;
+
+	union {
+		netmem_ref pp_netmem;
+		struct llist_node lentry;
+	};
+};
+
+/* The size of item_block is always PAGE_SIZE, so that the address of item_block
+ * for a specific item can be calculated using 'item & PAGE_MASK'
+ */
+struct page_pool_item_block {
+	struct page_pool *pp;
+	struct list_head list;
+	struct page_pool_item items[];
+};
+
+/* Ensure the offset of 'pp' field for both 'page_pool_item_block' and
+ * 'netmem_item_block' are the same.
+ */
+static_assert(offsetof(struct page_pool_item_block, pp) == \
+	      offsetof(struct netmem_item_block, pp));
+
 /* The whole frag API block must stay within one cacheline. On 32-bit systems,
  * sizeof(long) == sizeof(int), so that the block size is ``3 * sizeof(long)``.
  * On 64-bit systems, the actual size is ``2 * sizeof(long) + sizeof(int)``.
@@ -161,6 +187,7 @@ struct page_pool {
 
 	int cpuid;
 	u32 pages_state_hold_cnt;
+	struct llist_head hold_items;
 
 	bool has_init_callback:1;	/* slow::init_callback is set */
 	bool dma_map:1;			/* Perform DMA mapping */
@@ -223,13 +250,20 @@ struct page_pool {
 #endif
 	atomic_t pages_state_release_cnt;
 
+	/* Synchronizate dma unmapping operation in page_pool_return_page() with
+	 * page_pool_destory() when destroy_cnt is non-zero.
+	 */
+	spinlock_t item_lock;
+	struct list_head item_blocks;
+	struct llist_head release_items;
+
 	/* A page_pool is strictly tied to a single RX-queue being
 	 * protected by NAPI, due to above pp_alloc_cache. This
 	 * refcnt serves purpose is to simplify drivers error handling.
 	 */
 	refcount_t user_cnt;
 
-	u64 destroy_cnt;
+	unsigned long destroy_cnt;
 
 	/* Slow/Control-path information follows */
 	struct page_pool_params_slow slow;
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 0b6ed7525b22..cc7093f00af1 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -85,7 +85,7 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
 	niov = &owner->niovs[index];
 
 	niov->pp_magic = 0;
-	niov->pp = NULL;
+	niov->pp_item = NULL;
 	atomic_long_set(&niov->pp_ref_count, 0);
 
 	return niov;
@@ -380,7 +380,7 @@ bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
 	if (WARN_ON_ONCE(refcount != 1))
 		return false;
 
-	page_pool_clear_pp_info(netmem);
+	page_pool_clear_pp_info(pool, netmem);
 
 	net_devmem_free_dmabuf(netmem_to_net_iov(netmem));
 
diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
index 7eadb8393e00..3173f6070cf7 100644
--- a/net/core/netmem_priv.h
+++ b/net/core/netmem_priv.h
@@ -18,9 +18,10 @@ static inline void netmem_clear_pp_magic(netmem_ref netmem)
 	__netmem_clear_lsb(netmem)->pp_magic = 0;
 }
 
-static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool)
+static inline void netmem_set_pp_item(netmem_ref netmem,
+				      struct page_pool_item *item)
 {
-	__netmem_clear_lsb(netmem)->pp = pool;
+	__netmem_clear_lsb(netmem)->pp_item = item;
 }
 
 static inline void netmem_set_dma_addr(netmem_ref netmem,
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 1aa7b93bdcc8..fa7629c3ec94 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -61,6 +61,7 @@ static const char pp_stats[][ETH_GSTRING_LEN] = {
 	"rx_pp_alloc_empty",
 	"rx_pp_alloc_refill",
 	"rx_pp_alloc_waive",
+	"rx_pp_alloc_item_fast_empty",
 	"rx_pp_recycle_cached",
 	"rx_pp_recycle_cache_full",
 	"rx_pp_recycle_ring",
@@ -94,6 +95,7 @@ bool page_pool_get_stats(const struct page_pool *pool,
 	stats->alloc_stats.empty += pool->alloc_stats.empty;
 	stats->alloc_stats.refill += pool->alloc_stats.refill;
 	stats->alloc_stats.waive += pool->alloc_stats.waive;
+	stats->alloc_stats.item_fast_empty += pool->alloc_stats.item_fast_empty;
 
 	for_each_possible_cpu(cpu) {
 		const struct page_pool_recycle_stats *pcpu =
@@ -139,6 +141,7 @@ u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
 	*data++ = pool_stats->alloc_stats.empty;
 	*data++ = pool_stats->alloc_stats.refill;
 	*data++ = pool_stats->alloc_stats.waive;
+	*data++ = pool_stats->alloc_stats.item_fast_empty;
 	*data++ = pool_stats->recycle_stats.cached;
 	*data++ = pool_stats->recycle_stats.cache_full;
 	*data++ = pool_stats->recycle_stats.ring;
@@ -268,6 +271,7 @@ static int page_pool_init(struct page_pool *pool,
 		return -ENOMEM;
 	}
 
+	spin_lock_init(&pool->item_lock);
 	atomic_set(&pool->pages_state_release_cnt, 0);
 
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
@@ -325,6 +329,200 @@ static void page_pool_uninit(struct page_pool *pool)
 #endif
 }
 
+#define PAGE_POOL_ITEM_USED			0
+#define PAGE_POOL_ITEM_MAPPED			1
+
+#define ITEMS_PER_PAGE	((PAGE_SIZE -						\
+			  offsetof(struct page_pool_item_block, items)) /	\
+			 sizeof(struct page_pool_item))
+
+#define page_pool_item_init_state(item)					\
+({									\
+	(item)->state = 0;						\
+})
+
+#if defined(CONFIG_DEBUG_NET)
+#define page_pool_item_set_used(item)					\
+	__set_bit(PAGE_POOL_ITEM_USED, &(item)->state)
+
+#define page_pool_item_clear_used(item)					\
+	__clear_bit(PAGE_POOL_ITEM_USED, &(item)->state)
+
+#define page_pool_item_is_used(item)					\
+	test_bit(PAGE_POOL_ITEM_USED, &(item)->state)
+#else
+#define page_pool_item_set_used(item)
+#define page_pool_item_clear_used(item)
+#define page_pool_item_is_used(item)		false
+#endif
+
+#define page_pool_item_set_mapped(item)					\
+	__set_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
+
+/* Only clear_mapped and is_mapped need to be atomic as they can be
+ * called concurrently.
+ */
+#define page_pool_item_clear_mapped(item)				\
+	clear_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
+
+#define page_pool_item_is_mapped(item)					\
+	test_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
+
+static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
+							 netmem_ref netmem,
+							 bool destroyed)
+{
+	struct page_pool_item *item;
+	dma_addr_t dma;
+
+	if (!pool->dma_map)
+		/* Always account for inflight pages, even if we didn't
+		 * map them
+		 */
+		return;
+
+	dma = page_pool_get_dma_addr_netmem(netmem);
+	item = netmem_get_pp_item(netmem);
+
+	/* dma unmapping is always needed when page_pool_destory() is not called
+	 * yet.
+	 */
+	DEBUG_NET_WARN_ON_ONCE(!destroyed && !page_pool_item_is_mapped(item));
+	if (unlikely(destroyed && !page_pool_item_is_mapped(item)))
+		return;
+
+	/* When page is unmapped, it cannot be returned to our pool */
+	dma_unmap_page_attrs(pool->p.dev, dma,
+			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
+			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
+	page_pool_set_dma_addr_netmem(netmem, 0);
+	page_pool_item_clear_mapped(item);
+}
+
+static void __page_pool_item_init(struct page_pool *pool, struct page *page)
+{
+	struct page_pool_item_block *block = page_address(page);
+	struct page_pool_item *items = block->items;
+	unsigned int i;
+
+	list_add(&block->list, &pool->item_blocks);
+	block->pp = pool;
+
+	for (i = 0; i < ITEMS_PER_PAGE; i++) {
+		page_pool_item_init_state(&items[i]);
+		__llist_add(&items[i].lentry, &pool->hold_items);
+	}
+}
+
+static int page_pool_item_init(struct page_pool *pool)
+{
+#define PAGE_POOL_MIN_INFLIGHT_ITEMS		512
+	struct page_pool_item_block *block;
+	int item_cnt;
+
+	INIT_LIST_HEAD(&pool->item_blocks);
+	init_llist_head(&pool->hold_items);
+	init_llist_head(&pool->release_items);
+
+	item_cnt = pool->p.pool_size * 2 + PP_ALLOC_CACHE_SIZE +
+		PAGE_POOL_MIN_INFLIGHT_ITEMS;
+	while (item_cnt > 0) {
+		struct page *page;
+
+		page = alloc_pages_node(pool->p.nid, GFP_KERNEL, 0);
+		if (!page)
+			goto err;
+
+		__page_pool_item_init(pool, page);
+		item_cnt -= ITEMS_PER_PAGE;
+	}
+
+	return 0;
+err:
+	list_for_each_entry(block, &pool->item_blocks, list)
+		put_page(virt_to_page(block));
+
+	return -ENOMEM;
+}
+
+static void page_pool_item_unmap(struct page_pool *pool,
+				 struct page_pool_item *item)
+{
+	spin_lock_bh(&pool->item_lock);
+	__page_pool_release_page_dma(pool, item->pp_netmem, true);
+	spin_unlock_bh(&pool->item_lock);
+}
+
+static void page_pool_items_unmap(struct page_pool *pool)
+{
+	struct page_pool_item_block *block;
+
+	if (!pool->dma_map || pool->mp_priv)
+		return;
+
+	list_for_each_entry(block, &pool->item_blocks, list) {
+		struct page_pool_item *items = block->items;
+		int i;
+
+		for (i = 0; i < ITEMS_PER_PAGE; i++) {
+			struct page_pool_item *item = &items[i];
+
+			if (!page_pool_item_is_mapped(item))
+				continue;
+
+			page_pool_item_unmap(pool, item);
+		}
+	}
+}
+
+static void page_pool_item_uninit(struct page_pool *pool)
+{
+	while (!list_empty(&pool->item_blocks)) {
+		struct page_pool_item_block *block;
+
+		block = list_first_entry(&pool->item_blocks,
+					 struct page_pool_item_block,
+					 list);
+		list_del(&block->list);
+		put_page(virt_to_page(block));
+	}
+}
+
+static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem)
+{
+	struct page_pool_item *item;
+	struct llist_node *node;
+
+	if (unlikely(llist_empty(&pool->hold_items))) {
+		pool->hold_items.first = llist_del_all(&pool->release_items);
+
+		if (unlikely(llist_empty(&pool->hold_items))) {
+			alloc_stat_inc(pool, item_fast_empty);
+			return false;
+		}
+	}
+
+	node = pool->hold_items.first;
+	pool->hold_items.first = node->next;
+	item = llist_entry(node, struct page_pool_item, lentry);
+	item->pp_netmem = netmem;
+	page_pool_item_set_used(item);
+	netmem_set_pp_item(netmem, item);
+	return true;
+}
+
+static void page_pool_item_del(struct page_pool *pool, netmem_ref netmem)
+{
+	struct page_pool_item *item = netmem_get_pp_item(netmem);
+
+	DEBUG_NET_WARN_ON_ONCE(item->pp_netmem != netmem);
+	DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item));
+	DEBUG_NET_WARN_ON_ONCE(!page_pool_item_is_used(item));
+	page_pool_item_clear_used(item);
+	netmem_set_pp_item(netmem, NULL);
+	llist_add(&item->lentry, &pool->release_items);
+}
+
 /**
  * page_pool_create_percpu() - create a page pool for a given cpu.
  * @params: parameters, see struct page_pool_params
@@ -344,12 +542,18 @@ page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
 	if (err < 0)
 		goto err_free;
 
-	err = page_pool_list(pool);
+	err = page_pool_item_init(pool);
 	if (err)
 		goto err_uninit;
 
+	err = page_pool_list(pool);
+	if (err)
+		goto err_item_uninit;
+
 	return pool;
 
+err_item_uninit:
+	page_pool_item_uninit(pool);
 err_uninit:
 	page_pool_uninit(pool);
 err_free:
@@ -369,7 +573,8 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
 }
 EXPORT_SYMBOL(page_pool_create);
 
-static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
+static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem,
+				    bool destroyed);
 
 static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 {
@@ -407,7 +612,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 			 * (2) break out to fallthrough to alloc_pages_node.
 			 * This limit stress on page buddy alloactor.
 			 */
-			page_pool_return_page(pool, netmem);
+			__page_pool_return_page(pool, netmem, false);
 			alloc_stat_inc(pool, waive);
 			netmem = 0;
 			break;
@@ -464,6 +669,7 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
 
 static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 {
+	struct page_pool_item *item;
 	dma_addr_t dma;
 
 	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
@@ -481,6 +687,9 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	if (page_pool_set_dma_addr_netmem(netmem, dma))
 		goto unmap_failed;
 
+	item = netmem_get_pp_item(netmem);
+	DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item));
+	page_pool_item_set_mapped(item);
 	page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
 
 	return true;
@@ -503,19 +712,24 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 	if (unlikely(!page))
 		return NULL;
 
-	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
-		put_page(page);
-		return NULL;
-	}
+	if (unlikely(!page_pool_set_pp_info(pool, page_to_netmem(page))))
+		goto err_alloc;
+
+	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page))))
+		goto err_set_info;
 
 	alloc_stat_inc(pool, slow_high_order);
-	page_pool_set_pp_info(pool, page_to_netmem(page));
 
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
 	trace_page_pool_state_hold(pool, page_to_netmem(page),
 				   pool->pages_state_hold_cnt);
 	return page;
+err_set_info:
+	page_pool_clear_pp_info(pool, page_to_netmem(page));
+err_alloc:
+	put_page(page);
+	return NULL;
 }
 
 /* slow path */
@@ -550,12 +764,18 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 	 */
 	for (i = 0; i < nr_pages; i++) {
 		netmem = pool->alloc.cache[i];
+
+		if (unlikely(!page_pool_set_pp_info(pool, netmem))) {
+			put_page(netmem_to_page(netmem));
+			continue;
+		}
+
 		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
+			page_pool_clear_pp_info(pool, netmem);
 			put_page(netmem_to_page(netmem));
 			continue;
 		}
 
-		page_pool_set_pp_info(pool, netmem);
 		pool->alloc.cache[pool->alloc.count++] = netmem;
 		/* Track how many pages are held 'in-flight' */
 		pool->pages_state_hold_cnt++;
@@ -627,9 +847,11 @@ s32 page_pool_inflight(const struct page_pool *pool, bool strict)
 	return inflight;
 }
 
-void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
+bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
 {
-	netmem_set_pp(netmem, pool);
+	if (unlikely(!page_pool_item_add(pool, netmem)))
+		return false;
+
 	netmem_or_pp_magic(netmem, PP_SIGNATURE);
 
 	/* Ensuring all pages have been split into one fragment initially:
@@ -641,32 +863,14 @@ void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
 	page_pool_fragment_netmem(netmem, 1);
 	if (pool->has_init_callback)
 		pool->slow.init_callback(netmem, pool->slow.init_arg);
-}
 
-void page_pool_clear_pp_info(netmem_ref netmem)
-{
-	netmem_clear_pp_magic(netmem);
-	netmem_set_pp(netmem, NULL);
+	return true;
 }
 
-static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
-							 netmem_ref netmem)
+void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem)
 {
-	dma_addr_t dma;
-
-	if (!pool->dma_map)
-		/* Always account for inflight pages, even if we didn't
-		 * map them
-		 */
-		return;
-
-	dma = page_pool_get_dma_addr_netmem(netmem);
-
-	/* When page is unmapped, it cannot be returned to our pool */
-	dma_unmap_page_attrs(pool->p.dev, dma,
-			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
-			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
-	page_pool_set_dma_addr_netmem(netmem, 0);
+	netmem_clear_pp_magic(netmem);
+	page_pool_item_del(pool, netmem);
 }
 
 /* Disconnects a page (from a page_pool).  API users can have a need
@@ -674,7 +878,8 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
  * a regular page (that will eventually be returned to the normal
  * page-allocator via put_page).
  */
-void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
+void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem,
+			     bool destroyed)
 {
 	int count;
 	bool put;
@@ -683,7 +888,7 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
 		put = mp_dmabuf_devmem_release_page(pool, netmem);
 	else
-		__page_pool_release_page_dma(pool, netmem);
+		__page_pool_release_page_dma(pool, netmem, destroyed);
 
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
@@ -692,7 +897,7 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	trace_page_pool_state_release(pool, netmem, count);
 
 	if (put) {
-		page_pool_clear_pp_info(netmem);
+		page_pool_clear_pp_info(pool, netmem);
 		put_page(netmem_to_page(netmem));
 	}
 	/* An optimization would be to call __free_pages(page, pool->p.order)
@@ -701,6 +906,27 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	 */
 }
 
+/* Called from page_pool_put_*() path, need to synchronizated with
+ * page_pool_destory() path.
+ */
+static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
+{
+	unsigned int destroy_cnt;
+
+	rcu_read_lock();
+
+	destroy_cnt = READ_ONCE(pool->destroy_cnt);
+	if (unlikely(destroy_cnt)) {
+		spin_lock_bh(&pool->item_lock);
+		__page_pool_return_page(pool, netmem, true);
+		spin_unlock_bh(&pool->item_lock);
+	} else {
+		__page_pool_return_page(pool, netmem, false);
+	}
+
+	rcu_read_unlock();
+}
+
 static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
 {
 	int ret;
@@ -963,7 +1189,7 @@ static netmem_ref page_pool_drain_frag(struct page_pool *pool,
 		return netmem;
 	}
 
-	page_pool_return_page(pool, netmem);
+	__page_pool_return_page(pool, netmem, false);
 	return 0;
 }
 
@@ -977,7 +1203,7 @@ static void page_pool_free_frag(struct page_pool *pool)
 	if (!netmem || page_pool_unref_netmem(netmem, drain_count))
 		return;
 
-	page_pool_return_page(pool, netmem);
+	__page_pool_return_page(pool, netmem, false);
 }
 
 netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool,
@@ -1053,6 +1279,7 @@ static void __page_pool_destroy(struct page_pool *pool)
 	if (pool->disconnect)
 		pool->disconnect(pool);
 
+	page_pool_item_uninit(pool);
 	page_pool_unlist(pool);
 	page_pool_uninit(pool);
 
@@ -1084,7 +1311,7 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 static void page_pool_scrub(struct page_pool *pool)
 {
 	page_pool_empty_alloc_cache_once(pool);
-	pool->destroy_cnt++;
+	WRITE_ONCE(pool->destroy_cnt, pool->destroy_cnt + 1);
 
 	/* No more consumers should exist, but producers could still
 	 * be in-flight.
@@ -1178,6 +1405,8 @@ void page_pool_destroy(struct page_pool *pool)
 	 */
 	synchronize_rcu();
 
+	page_pool_items_unmap(pool);
+
 	page_pool_detached(pool);
 	pool->defer_start = jiffies;
 	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
@@ -1198,7 +1427,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
 	/* Flush pool alloc cache, as refill will check NUMA node */
 	while (pool->alloc.count) {
 		netmem = pool->alloc.cache[--pool->alloc.count];
-		page_pool_return_page(pool, netmem);
+		__page_pool_return_page(pool, netmem, false);
 	}
 }
 EXPORT_SYMBOL(page_pool_update_nid);
diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h
index 57439787b9c2..5d85f862a30a 100644
--- a/net/core/page_pool_priv.h
+++ b/net/core/page_pool_priv.h
@@ -36,16 +36,18 @@ static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 }
 
 #if defined(CONFIG_PAGE_POOL)
-void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
-void page_pool_clear_pp_info(netmem_ref netmem);
+bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
+void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem);
 int page_pool_check_memory_provider(struct net_device *dev,
 				    struct netdev_rx_queue *rxq);
 #else
-static inline void page_pool_set_pp_info(struct page_pool *pool,
+static inline bool page_pool_set_pp_info(struct page_pool *pool,
 					 netmem_ref netmem)
 {
+	return true;
 }
-static inline void page_pool_clear_pp_info(netmem_ref netmem)
+static inline void page_pool_clear_pp_info(struct page_pool *pool,
+					   netmem_ref netmem)
 {
 }
 static inline int page_pool_check_memory_provider(struct net_device *dev,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next v7 4/8] page_pool: support unlimited number of inflight pages
  2025-01-10 13:06 [PATCH net-next v7 0/8] fix two bugs related to page_pool Yunsheng Lin
                   ` (2 preceding siblings ...)
  2025-01-10 13:06 ` [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
@ 2025-01-10 13:06 ` Yunsheng Lin
  2025-01-10 13:06 ` [PATCH net-next v7 5/8] page_pool: skip dma sync operation for " Yunsheng Lin
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-10 13:06 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Robin Murphy,
	Alexander Duyck, IOMMU, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Simon Horman, netdev, linux-kernel

Currently a fixed size of pre-allocated memory is used to
keep track of the inflight pages, in order to use the DMA
API correctly.

As mentioned [1], the number of inflight pages can be up to
73203 depending on the use cases. Allocate memory dynamically
to keep track of the inflight pages when pre-allocated memory
runs out.

The overhead of using dynamic memory allocation is about 10ns~
20ns, which causes 5%~10% performance degradation for the test
case of time_bench_page_pool03_slow() in [2].

1. https://lore.kernel.org/all/b8b7818a-e44b-45f5-91c2-d5eceaa5dd5b@kernel.org/
2. https://github.com/netoptimizer/prototype-kernel
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/page_pool/types.h |  12 +++-
 net/core/devmem.c             |   2 +-
 net/core/page_pool.c          | 106 +++++++++++++++++++++++++++++++---
 net/core/page_pool_priv.h     |   6 +-
 4 files changed, 113 insertions(+), 13 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 2011fa43ad0f..844a7f5ba87a 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -103,6 +103,7 @@ struct page_pool_params {
  * @waive:	pages obtained from the ptr ring that cannot be added to
  *		the cache due to a NUMA mismatch
  * @item_fast_empty: pre-allocated item cache is empty
+ * @item_slow_failed: failed to allocate memory for item_block
  */
 struct page_pool_alloc_stats {
 	u64 fast;
@@ -112,6 +113,7 @@ struct page_pool_alloc_stats {
 	u64 refill;
 	u64 waive;
 	u64 item_fast_empty;
+	u64 item_slow_failed;
 };
 
 /**
@@ -159,6 +161,8 @@ struct page_pool_item {
 struct page_pool_item_block {
 	struct page_pool *pp;
 	struct list_head list;
+	unsigned int flags;
+	refcount_t ref;
 	struct page_pool_item items[];
 };
 
@@ -188,6 +192,8 @@ struct page_pool {
 	int cpuid;
 	u32 pages_state_hold_cnt;
 	struct llist_head hold_items;
+	struct page_pool_item_block *item_blk;
+	unsigned int item_blk_idx;
 
 	bool has_init_callback:1;	/* slow::init_callback is set */
 	bool dma_map:1;			/* Perform DMA mapping */
@@ -250,8 +256,10 @@ struct page_pool {
 #endif
 	atomic_t pages_state_release_cnt;
 
-	/* Synchronizate dma unmapping operation in page_pool_return_page() with
-	 * page_pool_destory() when destroy_cnt is non-zero.
+	/* 1. Synchronizate dma unmapping operation in page_pool_return_page()
+	 *    with page_pool_destory() when destroy_cnt is non-zero.
+	 * 2. Protect item_blocks list when allocating and freeing item_block
+	 *    memory dynamically when destroy_cnt is zero.
 	 */
 	spinlock_t item_lock;
 	struct list_head item_blocks;
diff --git a/net/core/devmem.c b/net/core/devmem.c
index cc7093f00af1..4d8b751d6f9c 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -380,7 +380,7 @@ bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
 	if (WARN_ON_ONCE(refcount != 1))
 		return false;
 
-	page_pool_clear_pp_info(pool, netmem);
+	page_pool_clear_pp_info(pool, netmem, false);
 
 	net_devmem_free_dmabuf(netmem_to_net_iov(netmem));
 
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index fa7629c3ec94..f65d946e964b 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -62,6 +62,7 @@ static const char pp_stats[][ETH_GSTRING_LEN] = {
 	"rx_pp_alloc_refill",
 	"rx_pp_alloc_waive",
 	"rx_pp_alloc_item_fast_empty",
+	"rx_pp_alloc_item_slow_failed",
 	"rx_pp_recycle_cached",
 	"rx_pp_recycle_cache_full",
 	"rx_pp_recycle_ring",
@@ -96,6 +97,7 @@ bool page_pool_get_stats(const struct page_pool *pool,
 	stats->alloc_stats.refill += pool->alloc_stats.refill;
 	stats->alloc_stats.waive += pool->alloc_stats.waive;
 	stats->alloc_stats.item_fast_empty += pool->alloc_stats.item_fast_empty;
+	stats->alloc_stats.item_slow_failed += pool->alloc_stats.item_slow_failed;
 
 	for_each_possible_cpu(cpu) {
 		const struct page_pool_recycle_stats *pcpu =
@@ -142,6 +144,7 @@ u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
 	*data++ = pool_stats->alloc_stats.refill;
 	*data++ = pool_stats->alloc_stats.waive;
 	*data++ = pool_stats->alloc_stats.item_fast_empty;
+	*data++ = pool_stats->alloc_stats.item_slow_failed;
 	*data++ = pool_stats->recycle_stats.cached;
 	*data++ = pool_stats->recycle_stats.cache_full;
 	*data++ = pool_stats->recycle_stats.ring;
@@ -407,6 +410,8 @@ static void __page_pool_item_init(struct page_pool *pool, struct page *page)
 
 	list_add(&block->list, &pool->item_blocks);
 	block->pp = pool;
+	block->flags = 0;
+	refcount_set(&block->ref, 0);
 
 	for (i = 0; i < ITEMS_PER_PAGE; i++) {
 		page_pool_item_init_state(&items[i]);
@@ -484,10 +489,83 @@ static void page_pool_item_uninit(struct page_pool *pool)
 					 struct page_pool_item_block,
 					 list);
 		list_del(&block->list);
+		WARN_ON(refcount_read(&block->ref));
 		put_page(virt_to_page(block));
 	}
 }
 
+#define PAGE_POOL_ITEM_BLK_DYNAMIC_BIT			BIT(0)
+
+static bool page_pool_item_blk_add(struct page_pool *pool, netmem_ref netmem)
+{
+	struct page_pool_item *item;
+
+	if (unlikely(!pool->item_blk || pool->item_blk_idx >= ITEMS_PER_PAGE)) {
+		struct page_pool_item_block *block;
+		struct page *page;
+
+		page = alloc_pages_node(pool->p.nid, GFP_ATOMIC | __GFP_NOWARN |
+					__GFP_ZERO, 0);
+		if (!page) {
+			alloc_stat_inc(pool, item_slow_failed);
+			return false;
+		}
+
+		block = page_address(page);
+		spin_lock_bh(&pool->item_lock);
+		list_add(&block->list, &pool->item_blocks);
+		spin_unlock_bh(&pool->item_lock);
+
+		block->pp = pool;
+		block->flags |= PAGE_POOL_ITEM_BLK_DYNAMIC_BIT;
+		refcount_set(&block->ref, ITEMS_PER_PAGE);
+		pool->item_blk = block;
+		pool->item_blk_idx = 0;
+	}
+
+	item = &pool->item_blk->items[pool->item_blk_idx++];
+	item->pp_netmem = netmem;
+	page_pool_item_set_used(item);
+	netmem_set_pp_item(netmem, item);
+	return true;
+}
+
+static void __page_pool_item_blk_del(struct page_pool *pool,
+				     struct page_pool_item_block *block)
+{
+	spin_lock_bh(&pool->item_lock);
+	list_del(&block->list);
+	spin_unlock_bh(&pool->item_lock);
+
+	put_page(virt_to_page(block));
+}
+
+static void page_pool_item_blk_free(struct page_pool *pool)
+{
+	struct page_pool_item_block *block = pool->item_blk;
+
+	if (!block || pool->item_blk_idx >= ITEMS_PER_PAGE)
+		return;
+
+	if (refcount_sub_and_test(ITEMS_PER_PAGE - pool->item_blk_idx,
+				  &block->ref))
+		__page_pool_item_blk_del(pool, block);
+}
+
+static void page_pool_item_blk_del(struct page_pool *pool,
+				   struct page_pool_item_block *block,
+				   bool destroyed)
+{
+	/* Only call __page_pool_item_blk_del() when page_pool_destroy()
+	 * is not called yet as alloc API is not allowed to be called at
+	 * this point and pool->item_lock is reused to avoid concurrent
+	 * dma unmapping when page_pool_destroy() is called, taking the
+	 * lock in __page_pool_item_blk_del() causes deadlock.
+	 */
+	if (refcount_dec_and_test(&block->ref) && !destroyed)
+		__page_pool_item_blk_del(pool, block);
+}
+
 static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem)
 {
 	struct page_pool_item *item;
@@ -498,7 +576,7 @@ static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem)
 
 		if (unlikely(llist_empty(&pool->hold_items))) {
 			alloc_stat_inc(pool, item_fast_empty);
-			return false;
+			return page_pool_item_blk_add(pool, netmem);
 		}
 	}
 
@@ -511,16 +589,26 @@ static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem)
 	return true;
 }
 
-static void page_pool_item_del(struct page_pool *pool, netmem_ref netmem)
+static void page_pool_item_del(struct page_pool *pool, netmem_ref netmem,
+			       bool destroyed)
 {
 	struct page_pool_item *item = netmem_get_pp_item(netmem);
+	struct page_pool_item_block *block;
 
 	DEBUG_NET_WARN_ON_ONCE(item->pp_netmem != netmem);
 	DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item));
 	DEBUG_NET_WARN_ON_ONCE(!page_pool_item_is_used(item));
 	page_pool_item_clear_used(item);
 	netmem_set_pp_item(netmem, NULL);
-	llist_add(&item->lentry, &pool->release_items);
+
+	block = page_pool_item_to_block(item);
+	if (likely(!(block->flags & PAGE_POOL_ITEM_BLK_DYNAMIC_BIT))) {
+		DEBUG_NET_WARN_ON_ONCE(refcount_read(&block->ref));
+		llist_add(&item->lentry, &pool->release_items);
+		return;
+	}
+
+	page_pool_item_blk_del(pool, block, destroyed);
 }
 
 /**
@@ -726,7 +814,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 				   pool->pages_state_hold_cnt);
 	return page;
 err_set_info:
-	page_pool_clear_pp_info(pool, page_to_netmem(page));
+	page_pool_clear_pp_info(pool, page_to_netmem(page), false);
 err_alloc:
 	put_page(page);
 	return NULL;
@@ -771,7 +859,7 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 		}
 
 		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
-			page_pool_clear_pp_info(pool, netmem);
+			page_pool_clear_pp_info(pool, netmem, false);
 			put_page(netmem_to_page(netmem));
 			continue;
 		}
@@ -867,10 +955,11 @@ bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
 	return true;
 }
 
-void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem)
+void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem,
+			     bool destroyed)
 {
 	netmem_clear_pp_magic(netmem);
-	page_pool_item_del(pool, netmem);
+	page_pool_item_del(pool, netmem, destroyed);
 }
 
 /* Disconnects a page (from a page_pool).  API users can have a need
@@ -897,7 +986,7 @@ void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem,
 	trace_page_pool_state_release(pool, netmem, count);
 
 	if (put) {
-		page_pool_clear_pp_info(pool, netmem);
+		page_pool_clear_pp_info(pool, netmem, destroyed);
 		put_page(netmem_to_page(netmem));
 	}
 	/* An optimization would be to call __free_pages(page, pool->p.order)
@@ -1395,6 +1484,7 @@ void page_pool_destroy(struct page_pool *pool)
 
 	page_pool_disable_direct_recycling(pool);
 	page_pool_free_frag(pool);
+	page_pool_item_blk_free(pool);
 
 	if (!page_pool_release(pool))
 		return;
diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h
index 5d85f862a30a..643f707838e8 100644
--- a/net/core/page_pool_priv.h
+++ b/net/core/page_pool_priv.h
@@ -37,7 +37,8 @@ static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 
 #if defined(CONFIG_PAGE_POOL)
 bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
-void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem);
+void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem,
+			     bool destroyed);
 int page_pool_check_memory_provider(struct net_device *dev,
 				    struct netdev_rx_queue *rxq);
 #else
@@ -47,7 +48,8 @@ static inline bool page_pool_set_pp_info(struct page_pool *pool,
 	return true;
 }
 static inline void page_pool_clear_pp_info(struct page_pool *pool,
-					   netmem_ref netmem)
+					   netmem_ref netmem,
+					   bool destroyed)
 {
 }
 static inline int page_pool_check_memory_provider(struct net_device *dev,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next v7 5/8] page_pool: skip dma sync operation for inflight pages
  2025-01-10 13:06 [PATCH net-next v7 0/8] fix two bugs related to page_pool Yunsheng Lin
                   ` (3 preceding siblings ...)
  2025-01-10 13:06 ` [PATCH net-next v7 4/8] page_pool: support unlimited number of inflight pages Yunsheng Lin
@ 2025-01-10 13:06 ` Yunsheng Lin
  2025-01-10 13:07 ` [PATCH net-next v7 6/8] page_pool: use list instead of ptr_ring for ring cache Yunsheng Lin
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-10 13:06 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Robin Murphy,
	Alexander Duyck, IOMMU, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Simon Horman, netdev, linux-kernel

Skip dma sync operation for inflight pages before the
page_pool_destroy() returns to the driver as DMA API
expects to be called with a valid device bound to a
driver as mentioned in [1].

After page_pool_destroy() is called, the page is not
expected to be recycled back to pool->alloc cache and
dma sync operation is not needed when the page is not
recyclable or pool->ring is full, so only skip the dma
sync operation for the infilght pages by clearing the
pool->dma_sync, as synchronize_rcu() in
page_pool_destroy() is paired with rcu lock in
page_pool_recycle_in_ring() to ensure that there is no
dma syncoperation called after page_pool_destroy() is
returned.

1. https://lore.kernel.org/all/caf31b5e-0e8f-4844-b7ba-ef59ed13b74e@arm.com/
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 net/core/page_pool.c | 57 ++++++++++++++++++++++++++++++++------------
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index f65d946e964b..232ab56f7fac 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -280,9 +280,6 @@ static int page_pool_init(struct page_pool *pool,
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
-	if (pool->dma_map)
-		get_device(pool->p.dev);
-
 	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
 		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
 		 * configuration doesn't change while we're initializing
@@ -323,9 +320,6 @@ static void page_pool_uninit(struct page_pool *pool)
 {
 	ptr_ring_cleanup(&pool->ring, NULL);
 
-	if (pool->dma_map)
-		put_device(pool->p.dev);
-
 #ifdef CONFIG_PAGE_POOL_STATS
 	if (!pool->system)
 		free_percpu(pool->recycle_stats);
@@ -755,6 +749,25 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
 		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
 }
 
+static __always_inline void
+page_pool_dma_sync_for_device_rcu(const struct page_pool *pool,
+				  netmem_ref netmem,
+				  u32 dma_sync_size)
+{
+	if (!pool->dma_sync)
+		return;
+
+	rcu_read_lock();
+
+	/* Recheck the dma_sync under rcu lock to pair with synchronize_rcu() in
+	 * page_pool_destroy().
+	 */
+	if (pool->dma_sync && dma_dev_need_sync(pool->p.dev))
+		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
+
+	rcu_read_unlock();
+}
+
 static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 {
 	struct page_pool_item *item;
@@ -1016,7 +1029,8 @@ static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	rcu_read_unlock();
 }
 
-static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
+static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem,
+				      unsigned int dma_sync_size)
 {
 	int ret;
 	/* BH protection not needed if current is softirq */
@@ -1025,12 +1039,12 @@ static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
 	else
 		ret = ptr_ring_produce_bh(&pool->ring, (__force void *)netmem);
 
-	if (!ret) {
+	if (likely(!ret)) {
+		page_pool_dma_sync_for_device_rcu(pool, netmem, dma_sync_size);
 		recycle_stat_inc(pool, ring);
-		return true;
 	}
 
-	return false;
+	return !ret;
 }
 
 /* Only allow direct recycling in special circumstances, into the
@@ -1083,10 +1097,10 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
 	if (likely(__page_pool_page_can_be_recycled(netmem))) {
 		/* Read barrier done in page_ref_count / READ_ONCE */
 
-		page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
-
-		if (allow_direct && page_pool_recycle_in_cache(netmem, pool))
+		if (allow_direct && page_pool_recycle_in_cache(netmem, pool)) {
+			page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
 			return 0;
+		}
 
 		/* Page found as candidate for recycling */
 		return netmem;
@@ -1149,7 +1163,7 @@ void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
 
 	netmem =
 		__page_pool_put_page(pool, netmem, dma_sync_size, allow_direct);
-	if (netmem && !page_pool_recycle_in_ring(pool, netmem)) {
+	if (netmem && !page_pool_recycle_in_ring(pool, netmem, dma_sync_size)) {
 		/* Cache full, fallback to free pages */
 		recycle_stat_inc(pool, ring_full);
 		page_pool_return_page(pool, netmem);
@@ -1175,14 +1189,17 @@ static void page_pool_recycle_ring_bulk(struct page_pool *pool,
 	/* Bulk produce into ptr_ring page_pool cache */
 	in_softirq = page_pool_producer_lock(pool);
 
+	rcu_read_lock();
 	for (i = 0; i < bulk_len; i++) {
 		if (__ptr_ring_produce(&pool->ring, (__force void *)bulk[i])) {
 			/* ring full */
 			recycle_stat_inc(pool, ring_full);
 			break;
 		}
+		page_pool_dma_sync_for_device(pool, (__force netmem_ref)bulk[i],
+					      -1);
 	}
-
+	rcu_read_unlock();
 	page_pool_producer_unlock(pool, in_softirq);
 	recycle_stat_add(pool, ring, i);
 
@@ -1489,6 +1506,16 @@ void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_release(pool))
 		return;
 
+	/* After page_pool_destroy() is called, the page is not expected to be
+	 * recycled back to pool->alloc cache and dma sync operation is not
+	 * needed when the page is not recyclable or pool->ring is full, so only
+	 * skip the dma sync operation for the infilght pages by clearing the
+	 * pool->dma_sync, and the synchronize_rcu() is paired with rcu lock in
+	 * page_pool_recycle_in_ring() to ensure that there is no dma sync
+	 * operation called after page_pool_destroy() is returned.
+	 */
+	pool->dma_sync = false;
+
 	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
 	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
 	 * before returning to driver to free the napi instance.
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next v7 6/8] page_pool: use list instead of ptr_ring for ring cache
  2025-01-10 13:06 [PATCH net-next v7 0/8] fix two bugs related to page_pool Yunsheng Lin
                   ` (4 preceding siblings ...)
  2025-01-10 13:06 ` [PATCH net-next v7 5/8] page_pool: skip dma sync operation for " Yunsheng Lin
@ 2025-01-10 13:07 ` Yunsheng Lin
  2025-01-10 13:07 ` [PATCH net-next v7 7/8] page_pool: batch refilling pages to reduce atomic operation Yunsheng Lin
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-10 13:07 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Robin Murphy,
	Alexander Duyck, IOMMU, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Simon Horman, netdev, linux-kernel

As 'struct page_pool_item' is added to fix the DMA API
misuse problem, which adds some performance and memory
overhead.

Utilize the 'state' of 'struct page_pool_item' for a
pointer to a next item as only lower 2 bits of 'state'
are used, in order to avoid some performance and memory
overhead of adding 'struct page_pool_item'. As there is
only one producer due to the NAPI context protection,
multiple consumers can be allowed using the similar
lockless operation like llist in llist.h.

Testing shows there is about 10ns~20ns improvement for
performance of 'time_bench_page_pool02_ptr_ring' test
case

CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/page_pool/types.h |  16 ++--
 net/core/page_pool.c          | 133 ++++++++++++++++++----------------
 2 files changed, 82 insertions(+), 67 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 844a7f5ba87a..f4903aa3c7c2 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -4,7 +4,6 @@
 #define _NET_PAGE_POOL_TYPES_H
 
 #include <linux/dma-direction.h>
-#include <linux/ptr_ring.h>
 #include <linux/types.h>
 #include <net/netmem.h>
 
@@ -147,7 +146,10 @@ struct page_pool_stats {
 #endif
 
 struct page_pool_item {
-	unsigned long state;
+	/* An 'encoded_next' is a pointer to next item, lower 2 bits is used to
+	 * indicate the state of current item.
+	 */
+	unsigned long encoded_next;
 
 	union {
 		netmem_ref pp_netmem;
@@ -155,6 +157,11 @@ struct page_pool_item {
 	};
 };
 
+struct pp_ring_cache {
+	struct page_pool_item *list;
+	atomic_t count;
+};
+
 /* The size of item_block is always PAGE_SIZE, so that the address of item_block
  * for a specific item can be calculated using 'item & PAGE_MASK'
  */
@@ -241,12 +248,9 @@ struct page_pool {
 	 * wise, because free's can happen on remote CPUs, with no
 	 * association with allocation resource.
 	 *
-	 * Use ptr_ring, as it separates consumer and producer
-	 * efficiently, it a way that doesn't bounce cache-lines.
-	 *
 	 * TODO: Implement bulk return pages into this structure.
 	 */
-	struct ptr_ring ring;
+	struct pp_ring_cache ring ____cacheline_aligned_in_smp;
 
 	void *mp_priv;
 
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 232ab56f7fac..5b0b841352f2 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -161,29 +161,6 @@ EXPORT_SYMBOL(page_pool_ethtool_stats_get);
 #define recycle_stat_add(pool, __stat, val)
 #endif
 
-static bool page_pool_producer_lock(struct page_pool *pool)
-	__acquires(&pool->ring.producer_lock)
-{
-	bool in_softirq = in_softirq();
-
-	if (in_softirq)
-		spin_lock(&pool->ring.producer_lock);
-	else
-		spin_lock_bh(&pool->ring.producer_lock);
-
-	return in_softirq;
-}
-
-static void page_pool_producer_unlock(struct page_pool *pool,
-				      bool in_softirq)
-	__releases(&pool->ring.producer_lock)
-{
-	if (in_softirq)
-		spin_unlock(&pool->ring.producer_lock);
-	else
-		spin_unlock_bh(&pool->ring.producer_lock);
-}
-
 static void page_pool_struct_check(void)
 {
 	CACHELINE_ASSERT_GROUP_MEMBER(struct page_pool, frag, frag_users);
@@ -266,14 +243,6 @@ static int page_pool_init(struct page_pool *pool,
 	}
 #endif
 
-	if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0) {
-#ifdef CONFIG_PAGE_POOL_STATS
-		if (!pool->system)
-			free_percpu(pool->recycle_stats);
-#endif
-		return -ENOMEM;
-	}
-
 	spin_lock_init(&pool->item_lock);
 	atomic_set(&pool->pages_state_release_cnt, 0);
 
@@ -299,7 +268,7 @@ static int page_pool_init(struct page_pool *pool,
 		if (err) {
 			pr_warn("%s() mem-provider init failed %d\n", __func__,
 				err);
-			goto free_ptr_ring;
+			goto free_stats;
 		}
 
 		static_branch_inc(&page_pool_mem_providers);
@@ -307,8 +276,7 @@ static int page_pool_init(struct page_pool *pool,
 
 	return 0;
 
-free_ptr_ring:
-	ptr_ring_cleanup(&pool->ring, NULL);
+free_stats:
 #ifdef CONFIG_PAGE_POOL_STATS
 	if (!pool->system)
 		free_percpu(pool->recycle_stats);
@@ -318,8 +286,6 @@ static int page_pool_init(struct page_pool *pool,
 
 static void page_pool_uninit(struct page_pool *pool)
 {
-	ptr_ring_cleanup(&pool->ring, NULL);
-
 #ifdef CONFIG_PAGE_POOL_STATS
 	if (!pool->system)
 		free_percpu(pool->recycle_stats);
@@ -328,6 +294,8 @@ static void page_pool_uninit(struct page_pool *pool)
 
 #define PAGE_POOL_ITEM_USED			0
 #define PAGE_POOL_ITEM_MAPPED			1
+#define PAGE_POOL_ITEM_STATE_MASK		(BIT(PAGE_POOL_ITEM_USED) |\
+						 BIT(PAGE_POOL_ITEM_MAPPED))
 
 #define ITEMS_PER_PAGE	((PAGE_SIZE -						\
 			  offsetof(struct page_pool_item_block, items)) /	\
@@ -335,18 +303,18 @@ static void page_pool_uninit(struct page_pool *pool)
 
 #define page_pool_item_init_state(item)					\
 ({									\
-	(item)->state = 0;						\
+	(item)->encoded_next &= ~PAGE_POOL_ITEM_STATE_MASK;		\
 })
 
 #if defined(CONFIG_DEBUG_NET)
 #define page_pool_item_set_used(item)					\
-	__set_bit(PAGE_POOL_ITEM_USED, &(item)->state)
+	__set_bit(PAGE_POOL_ITEM_USED, &(item)->encoded_next)
 
 #define page_pool_item_clear_used(item)					\
-	__clear_bit(PAGE_POOL_ITEM_USED, &(item)->state)
+	__clear_bit(PAGE_POOL_ITEM_USED, &(item)->encoded_next)
 
 #define page_pool_item_is_used(item)					\
-	test_bit(PAGE_POOL_ITEM_USED, &(item)->state)
+	test_bit(PAGE_POOL_ITEM_USED, &(item)->encoded_next)
 #else
 #define page_pool_item_set_used(item)
 #define page_pool_item_clear_used(item)
@@ -354,16 +322,69 @@ static void page_pool_uninit(struct page_pool *pool)
 #endif
 
 #define page_pool_item_set_mapped(item)					\
-	__set_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
+	__set_bit(PAGE_POOL_ITEM_MAPPED, &(item)->encoded_next)
 
 /* Only clear_mapped and is_mapped need to be atomic as they can be
  * called concurrently.
  */
 #define page_pool_item_clear_mapped(item)				\
-	clear_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
+	clear_bit(PAGE_POOL_ITEM_MAPPED, &(item)->encoded_next)
 
 #define page_pool_item_is_mapped(item)					\
-	test_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
+	test_bit(PAGE_POOL_ITEM_MAPPED, &(item)->encoded_next)
+
+#define page_pool_item_set_next(item, next)				\
+({									\
+	struct page_pool_item *__item = item;				\
+									\
+	__item->encoded_next &= PAGE_POOL_ITEM_STATE_MASK;		\
+	__item->encoded_next |= (unsigned long)(next);			\
+})
+
+#define page_pool_item_get_next(item)					\
+({									\
+	struct page_pool_item *__next;					\
+									\
+	__next = (struct page_pool_item *)				\
+		((item)->encoded_next & ~PAGE_POOL_ITEM_STATE_MASK);	\
+	__next;								\
+})
+
+static bool __page_pool_recycle_in_ring(struct page_pool *pool,
+					netmem_ref netmem)
+{
+	struct page_pool_item *item, *list;
+
+	if (unlikely(atomic_read(&pool->ring.count) > pool->p.pool_size))
+		return false;
+
+	item = netmem_get_pp_item(netmem);
+	list = READ_ONCE(pool->ring.list);
+
+	do {
+		page_pool_item_set_next(item, list);
+	} while (!try_cmpxchg(&pool->ring.list, &list, item));
+
+	atomic_inc(&pool->ring.count);
+	return true;
+}
+
+static netmem_ref page_pool_consume_ring(struct page_pool *pool)
+{
+	struct page_pool_item *next, *list;
+
+	list = READ_ONCE(pool->ring.list);
+
+	do {
+		if (unlikely(!list))
+			return (__force netmem_ref)0;
+
+		next = page_pool_item_get_next(list);
+	} while (!try_cmpxchg(&pool->ring.list, &list, next));
+
+	atomic_dec(&pool->ring.count);
+	return list->pp_netmem;
+}
 
 static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 							 netmem_ref netmem,
@@ -660,12 +681,11 @@ static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem,
 
 static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 {
-	struct ptr_ring *r = &pool->ring;
 	netmem_ref netmem;
 	int pref_nid; /* preferred NUMA node */
 
 	/* Quicker fallback, avoid locks when ring is empty */
-	if (__ptr_ring_empty(r)) {
+	if (unlikely(!READ_ONCE(pool->ring.list))) {
 		alloc_stat_inc(pool, empty);
 		return 0;
 	}
@@ -682,7 +702,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 
 	/* Refill alloc array, but only if NUMA match */
 	do {
-		netmem = (__force netmem_ref)__ptr_ring_consume(r);
+		netmem = page_pool_consume_ring(pool);
 		if (unlikely(!netmem))
 			break;
 
@@ -1032,19 +1052,14 @@ static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem,
 				      unsigned int dma_sync_size)
 {
-	int ret;
-	/* BH protection not needed if current is softirq */
-	if (in_softirq())
-		ret = ptr_ring_produce(&pool->ring, (__force void *)netmem);
-	else
-		ret = ptr_ring_produce_bh(&pool->ring, (__force void *)netmem);
+	bool ret = __page_pool_recycle_in_ring(pool, netmem);
 
-	if (likely(!ret)) {
+	if (likely(ret)) {
 		page_pool_dma_sync_for_device_rcu(pool, netmem, dma_sync_size);
 		recycle_stat_inc(pool, ring);
 	}
 
-	return !ret;
+	return ret;
 }
 
 /* Only allow direct recycling in special circumstances, into the
@@ -1183,15 +1198,12 @@ static void page_pool_recycle_ring_bulk(struct page_pool *pool,
 					netmem_ref *bulk,
 					u32 bulk_len)
 {
-	bool in_softirq;
 	u32 i;
 
-	/* Bulk produce into ptr_ring page_pool cache */
-	in_softirq = page_pool_producer_lock(pool);
-
 	rcu_read_lock();
 	for (i = 0; i < bulk_len; i++) {
-		if (__ptr_ring_produce(&pool->ring, (__force void *)bulk[i])) {
+		if (!__page_pool_recycle_in_ring(pool,
+						 (__force netmem_ref)bulk[i])) {
 			/* ring full */
 			recycle_stat_inc(pool, ring_full);
 			break;
@@ -1200,7 +1212,6 @@ static void page_pool_recycle_ring_bulk(struct page_pool *pool,
 					      -1);
 	}
 	rcu_read_unlock();
-	page_pool_producer_unlock(pool, in_softirq);
 	recycle_stat_add(pool, ring, i);
 
 	/* Hopefully all pages were returned into ptr_ring */
@@ -1370,7 +1381,7 @@ static void page_pool_empty_ring(struct page_pool *pool)
 	netmem_ref netmem;
 
 	/* Empty recycle ring */
-	while ((netmem = (__force netmem_ref)ptr_ring_consume_bh(&pool->ring))) {
+	while ((netmem = page_pool_consume_ring(pool))) {
 		/* Verify the refcnt invariant of cached pages */
 		if (!(netmem_ref_count(netmem) == 1))
 			pr_crit("%s() page_pool refcnt %d violation\n",
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next v7 7/8] page_pool: batch refilling pages to reduce atomic operation
  2025-01-10 13:06 [PATCH net-next v7 0/8] fix two bugs related to page_pool Yunsheng Lin
                   ` (5 preceding siblings ...)
  2025-01-10 13:07 ` [PATCH net-next v7 6/8] page_pool: use list instead of ptr_ring for ring cache Yunsheng Lin
@ 2025-01-10 13:07 ` Yunsheng Lin
  2025-01-10 13:07 ` [PATCH net-next v7 8/8] page_pool: use list instead of array for alloc cache Yunsheng Lin
  2025-01-14 14:31 ` [PATCH net-next v7 0/8] fix two bugs related to page_pool Jesper Dangaard Brouer
  8 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-10 13:07 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Robin Murphy,
	Alexander Duyck, IOMMU, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Simon Horman, netdev, linux-kernel

Add refill variable in alloc cache to keep batched refilled
pages to avoid doing the atomic operation for each page.

Testing shows there is about 10ns improvement for the
performance of 'time_bench_page_pool02_ptr_ring' test case.

CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/page_pool/types.h |  5 +++++
 net/core/page_pool.c          | 25 +++++++++++++++++++++----
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index f4903aa3c7c2..d01c5d26cd56 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -51,6 +51,11 @@
 struct pp_alloc_cache {
 	u32 count;
 	netmem_ref cache[PP_ALLOC_CACHE_SIZE];
+
+	/* Keep batched refilled pages here to avoid doing the atomic operation
+	 * for each page.
+	 */
+	struct page_pool_item *refill;
 };
 
 /**
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 5b0b841352f2..eb18ab0999e6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -681,11 +681,13 @@ static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem,
 
 static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 {
+	struct page_pool_item *refill;
 	netmem_ref netmem;
 	int pref_nid; /* preferred NUMA node */
 
 	/* Quicker fallback, avoid locks when ring is empty */
-	if (unlikely(!READ_ONCE(pool->ring.list))) {
+	refill = pool->alloc.refill;
+	if (unlikely(!refill && !READ_ONCE(pool->ring.list))) {
 		alloc_stat_inc(pool, empty);
 		return 0;
 	}
@@ -702,10 +704,14 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 
 	/* Refill alloc array, but only if NUMA match */
 	do {
-		netmem = page_pool_consume_ring(pool);
-		if (unlikely(!netmem))
-			break;
+		if (unlikely(!refill)) {
+			refill = xchg(&pool->ring.list, NULL);
+			if (!refill)
+				break;
+		}
 
+		netmem = refill->pp_netmem;
+		refill = page_pool_item_get_next(refill);
 		if (likely(netmem_is_pref_nid(netmem, pref_nid))) {
 			pool->alloc.cache[pool->alloc.count++] = netmem;
 		} else {
@@ -715,14 +721,18 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 			 * This limit stress on page buddy alloactor.
 			 */
 			__page_pool_return_page(pool, netmem, false);
+			atomic_dec(&pool->ring.count);
 			alloc_stat_inc(pool, waive);
 			netmem = 0;
 			break;
 		}
 	} while (pool->alloc.count < PP_ALLOC_CACHE_REFILL);
 
+	pool->alloc.refill = refill;
+
 	/* Return last page */
 	if (likely(pool->alloc.count > 0)) {
+		atomic_sub(pool->alloc.count, &pool->ring.count);
 		netmem = pool->alloc.cache[--pool->alloc.count];
 		alloc_stat_inc(pool, refill);
 	}
@@ -1410,6 +1420,7 @@ static void __page_pool_destroy(struct page_pool *pool)
 
 static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 {
+	struct page_pool_item *refill;
 	netmem_ref netmem;
 
 	if (pool->destroy_cnt)
@@ -1423,6 +1434,12 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 		netmem = pool->alloc.cache[--pool->alloc.count];
 		page_pool_return_page(pool, netmem);
 	}
+
+	while ((refill = pool->alloc.refill)) {
+		pool->alloc.refill = page_pool_item_get_next(refill);
+		page_pool_return_page(pool, refill->pp_netmem);
+		atomic_dec(&pool->ring.count);
+	}
 }
 
 static void page_pool_scrub(struct page_pool *pool)
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next v7 8/8] page_pool: use list instead of array for alloc cache
  2025-01-10 13:06 [PATCH net-next v7 0/8] fix two bugs related to page_pool Yunsheng Lin
                   ` (6 preceding siblings ...)
  2025-01-10 13:07 ` [PATCH net-next v7 7/8] page_pool: batch refilling pages to reduce atomic operation Yunsheng Lin
@ 2025-01-10 13:07 ` Yunsheng Lin
  2025-01-14 14:31 ` [PATCH net-next v7 0/8] fix two bugs related to page_pool Jesper Dangaard Brouer
  8 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-10 13:07 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Robin Murphy,
	Alexander Duyck, IOMMU, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Simon Horman, netdev, linux-kernel

As the alloc cache is always protected by NAPI context
protection, use encoded_next as a pointer to a next item
to avoid the using the array.

Testing shows there is about 3ns improvement for the
performance of 'time_bench_page_pool01_fast_path' test
case.

CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/net/page_pool/types.h |  1 +
 net/core/page_pool.c          | 59 ++++++++++++++++++++++++++++-------
 2 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index d01c5d26cd56..ff6ebacaa57a 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -50,6 +50,7 @@
 #define PP_ALLOC_CACHE_REFILL	64
 struct pp_alloc_cache {
 	u32 count;
+	struct page_pool_item *list;
 	netmem_ref cache[PP_ALLOC_CACHE_SIZE];
 
 	/* Keep batched refilled pages here to avoid doing the atomic operation
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index eb18ab0999e6..459f783a354a 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -386,6 +386,27 @@ static netmem_ref page_pool_consume_ring(struct page_pool *pool)
 	return list->pp_netmem;
 }
 
+static netmem_ref __page_pool_consume_alloc(struct page_pool *pool)
+{
+	struct page_pool_item *item = pool->alloc.list;
+
+	pool->alloc.list = page_pool_item_get_next(item);
+	pool->alloc.count--;
+
+	return item->pp_netmem;
+}
+
+static void __page_pool_recycle_in_alloc(struct page_pool *pool,
+					 netmem_ref netmem)
+{
+	struct page_pool_item *item;
+
+	item = netmem_get_pp_item(netmem);
+	page_pool_item_set_next(item, pool->alloc.list);
+	pool->alloc.list = item;
+	pool->alloc.count++;
+}
+
 static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 							 netmem_ref netmem,
 							 bool destroyed)
@@ -681,10 +702,12 @@ static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem,
 
 static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 {
-	struct page_pool_item *refill;
+	struct page_pool_item *refill, *alloc, *curr;
 	netmem_ref netmem;
 	int pref_nid; /* preferred NUMA node */
 
+	DEBUG_NET_WARN_ON_ONCE(pool->alloc.count || pool->alloc.list);
+
 	/* Quicker fallback, avoid locks when ring is empty */
 	refill = pool->alloc.refill;
 	if (unlikely(!refill && !READ_ONCE(pool->ring.list))) {
@@ -702,6 +725,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 	pref_nid = numa_mem_id(); /* will be zero like page_to_nid() */
 #endif
 
+	alloc = NULL;
 	/* Refill alloc array, but only if NUMA match */
 	do {
 		if (unlikely(!refill)) {
@@ -710,10 +734,13 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 				break;
 		}
 
+		curr = refill;
 		netmem = refill->pp_netmem;
 		refill = page_pool_item_get_next(refill);
 		if (likely(netmem_is_pref_nid(netmem, pref_nid))) {
-			pool->alloc.cache[pool->alloc.count++] = netmem;
+			page_pool_item_set_next(curr, alloc);
+			pool->alloc.count++;
+			alloc = curr;
 		} else {
 			/* NUMA mismatch;
 			 * (1) release 1 page to page-allocator and
@@ -733,7 +760,9 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 	/* Return last page */
 	if (likely(pool->alloc.count > 0)) {
 		atomic_sub(pool->alloc.count, &pool->ring.count);
-		netmem = pool->alloc.cache[--pool->alloc.count];
+		netmem = alloc->pp_netmem;
+		pool->alloc.list = page_pool_item_get_next(alloc);
+		pool->alloc.count--;
 		alloc_stat_inc(pool, refill);
 	}
 
@@ -748,7 +777,7 @@ static netmem_ref __page_pool_get_cached(struct page_pool *pool)
 	/* Caller MUST guarantee safe non-concurrent access, e.g. softirq */
 	if (likely(pool->alloc.count)) {
 		/* Fast-path */
-		netmem = pool->alloc.cache[--pool->alloc.count];
+		netmem = __page_pool_consume_alloc(pool);
 		alloc_stat_inc(pool, fast);
 	} else {
 		netmem = page_pool_refill_alloc_cache(pool);
@@ -867,6 +896,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 							gfp_t gfp)
 {
+	struct page_pool_item *curr, *alloc = NULL;
 	const int bulk = PP_ALLOC_CACHE_REFILL;
 	unsigned int pp_order = pool->p.order;
 	bool dma_map = pool->dma_map;
@@ -877,9 +907,8 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 	if (unlikely(pp_order))
 		return page_to_netmem(__page_pool_alloc_page_order(pool, gfp));
 
-	/* Unnecessary as alloc cache is empty, but guarantees zero count */
-	if (unlikely(pool->alloc.count > 0))
-		return pool->alloc.cache[--pool->alloc.count];
+	/* alloc cache should be empty */
+	DEBUG_NET_WARN_ON_ONCE(pool->alloc.count || pool->alloc.list);
 
 	/* Mark empty alloc.cache slots "empty" for alloc_pages_bulk_array */
 	memset(&pool->alloc.cache, 0, sizeof(void *) * bulk);
@@ -907,7 +936,11 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 			continue;
 		}
 
-		pool->alloc.cache[pool->alloc.count++] = netmem;
+		curr = netmem_get_pp_item(netmem);
+		page_pool_item_set_next(curr, alloc);
+		pool->alloc.count++;
+		alloc = curr;
+
 		/* Track how many pages are held 'in-flight' */
 		pool->pages_state_hold_cnt++;
 		trace_page_pool_state_hold(pool, netmem,
@@ -916,7 +949,9 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 
 	/* Return last page */
 	if (likely(pool->alloc.count > 0)) {
-		netmem = pool->alloc.cache[--pool->alloc.count];
+		netmem = alloc->pp_netmem;
+		pool->alloc.list = page_pool_item_get_next(alloc);
+		pool->alloc.count--;
 		alloc_stat_inc(pool, slow);
 	} else {
 		netmem = 0;
@@ -1086,7 +1121,7 @@ static bool page_pool_recycle_in_cache(netmem_ref netmem,
 	}
 
 	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
-	pool->alloc.cache[pool->alloc.count++] = netmem;
+	__page_pool_recycle_in_alloc(pool, netmem);
 	recycle_stat_inc(pool, cached);
 	return true;
 }
@@ -1431,7 +1466,7 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 	 * call concurrently.
 	 */
 	while (pool->alloc.count) {
-		netmem = pool->alloc.cache[--pool->alloc.count];
+		netmem = __page_pool_consume_alloc(pool);
 		page_pool_return_page(pool, netmem);
 	}
 
@@ -1571,7 +1606,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
 
 	/* Flush pool alloc cache, as refill will check NUMA node */
 	while (pool->alloc.count) {
-		netmem = pool->alloc.cache[--pool->alloc.count];
+		netmem = __page_pool_consume_alloc(pool);
 		__page_pool_return_page(pool, netmem, false);
 	}
 }
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local
  2025-01-10 13:06 ` [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local Yunsheng Lin
@ 2025-01-10 15:40   ` Toke Høiland-Jørgensen
  2025-01-11  5:24     ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-01-10 15:40 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin,
	Alexander Lobakin, Xuan Zhuo, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

Yunsheng Lin <linyunsheng@huawei.com> writes:

> page_pool page may be freed from skb_defer_free_flush() in
> softirq context without binding to any specific napi, it
> may cause use-after-free problem due to the below time window,
> as below, CPU1 may still access napi->list_owner after CPU0
> free the napi memory:
>
>             CPU 0                           CPU1
>       page_pool_destroy()          skb_defer_free_flush()
>              .                               .
>              .                napi = READ_ONCE(pool->p.napi);
>              .                               .
> page_pool_disable_direct_recycling()         .
>    driver free napi memory                   .
>              .                               .
>              .       napi && READ_ONCE(napi->list_owner) == cpuid
>              .                               .

Have you actually observed this happen, or are you just speculating?
Because I don't think it can; deleting a NAPI instance already requires
observing an RCU grace period, cf netdevice.h:

/**
 *  __netif_napi_del - remove a NAPI context
 *  @napi: NAPI context
 *
 * Warning: caller must observe RCU grace period before freeing memory
 * containing @napi. Drivers might want to call this helper to combine
 * all the needed RCU grace periods into a single one.
 */
void __netif_napi_del(struct napi_struct *napi);

/**
 *  netif_napi_del - remove a NAPI context
 *  @napi: NAPI context
 *
 *  netif_napi_del() removes a NAPI context from the network device NAPI list
 */
static inline void netif_napi_del(struct napi_struct *napi)
{
	__netif_napi_del(napi);
	synchronize_net();
}


> Use rcu mechanism to avoid the above problem.
>
> Note, the above was found during code reviewing on how to fix
> the problem in [1].
>
> As the following IOMMU fix patch depends on synchronize_rcu()
> added in this patch and the time window is so small that it
> doesn't seem to be an urgent fix, so target the net-next as
> the IOMMU fix patch does.
>
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>
> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> CC: Alexander Lobakin <aleksander.lobakin@intel.com>
> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>  net/core/page_pool.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 9733206d6406..1aa7b93bdcc8 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -799,6 +799,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
>  static bool page_pool_napi_local(const struct page_pool *pool)
>  {
>  	const struct napi_struct *napi;
> +	bool napi_local;
>  	u32 cpuid;
>  
>  	if (unlikely(!in_softirq()))
> @@ -814,9 +815,15 @@ static bool page_pool_napi_local(const struct page_pool *pool)
>  	if (READ_ONCE(pool->cpuid) == cpuid)
>  		return true;
>  
> +	/* Synchronizated with page_pool_destory() to avoid use-after-free
> +	 * for 'napi'.
> +	 */
> +	rcu_read_lock();
>  	napi = READ_ONCE(pool->p.napi);
> +	napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
> +	rcu_read_unlock();

This rcu_read_lock/unlock() pair is redundant in the context you mention
above, since skb_defer_free_flush() is only ever called from softirq
context (within local_bh_disable()), which already function as an RCU
read lock.

> -	return napi && READ_ONCE(napi->list_owner) == cpuid;
> +	return napi_local;
>  }
>  
>  void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
>  	if (!page_pool_release(pool))
>  		return;
>  
> +	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
> +	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
> +	 * before returning to driver to free the napi instance.
> +	 */
> +	synchronize_rcu();

Most drivers call page_pool_destroy() in a loop for each RX queue, so
now you're introducing a full synchronize_rcu() wait for each queue.
That can delay tearing down the device significantly, so I don't think
this is a good idea.

-Toke


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local
  2025-01-10 15:40   ` Toke Høiland-Jørgensen
@ 2025-01-11  5:24     ` Yunsheng Lin
  2025-01-14 13:03       ` Yunsheng Lin
  2025-01-20 11:24       ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-11  5:24 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Yunsheng Lin, davem, kuba,
	pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Xuan Zhuo, Jesper Dangaard Brouer, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

On 1/10/2025 11:40 PM, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <linyunsheng@huawei.com> writes:
> 
>> page_pool page may be freed from skb_defer_free_flush() in
>> softirq context without binding to any specific napi, it
>> may cause use-after-free problem due to the below time window,
>> as below, CPU1 may still access napi->list_owner after CPU0
>> free the napi memory:
>>
>>              CPU 0                           CPU1
>>        page_pool_destroy()          skb_defer_free_flush()
>>               .                               .
>>               .                napi = READ_ONCE(pool->p.napi);
>>               .                               .
>> page_pool_disable_direct_recycling()         .
>>     driver free napi memory                   .
>>               .                               .
>>               .       napi && READ_ONCE(napi->list_owner) == cpuid
>>               .                               .
> 
> Have you actually observed this happen, or are you just speculating?

I did not actually observe this happen, but I added some delaying and
pr_err() debugging code in page_pool_napi_local()/page_pool_destroy(),
and modified the test module for page_pool in [1] to trigger that it is
indeed possible if the delay between reading napi and checking
napi->list_owner is long enough.

1. 
https://patchwork.kernel.org/project/netdevbpf/patch/20240909091913.987826-1-linyunsheng@huawei.com/

> Because I don't think it can; deleting a NAPI instance already requires
> observing an RCU grace period, cf netdevice.h:
> 
> /**
>   *  __netif_napi_del - remove a NAPI context
>   *  @napi: NAPI context
>   *
>   * Warning: caller must observe RCU grace period before freeing memory
>   * containing @napi. Drivers might want to call this helper to combine
>   * all the needed RCU grace periods into a single one.
>   */
> void __netif_napi_del(struct napi_struct *napi);
> 
> /**
>   *  netif_napi_del - remove a NAPI context
>   *  @napi: NAPI context
>   *
>   *  netif_napi_del() removes a NAPI context from the network device NAPI list
>   */
> static inline void netif_napi_del(struct napi_struct *napi)
> {
> 	__netif_napi_del(napi);
> 	synchronize_net();
> }

I am not sure we can reliably depend on the implicit synchronize_net()
above if netif_napi_del() might not be called before page_pool_destroy()
as there might not be netif_napi_del() before page_pool_destroy() for
the case of changing rx_desc_num for a queue, which seems to be the case
of hns3_set_ringparam() for hns3 driver.

> 
> 
>> Use rcu mechanism to avoid the above problem.
>>
>> Note, the above was found during code reviewing on how to fix
>> the problem in [1].
>>
>> As the following IOMMU fix patch depends on synchronize_rcu()
>> added in this patch and the time window is so small that it
>> doesn't seem to be an urgent fix, so target the net-next as
>> the IOMMU fix patch does.
>>
>> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>>
>> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> CC: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>> ---
>>   net/core/page_pool.c | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index 9733206d6406..1aa7b93bdcc8 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -799,6 +799,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
>>   static bool page_pool_napi_local(const struct page_pool *pool)
>>   {
>>   	const struct napi_struct *napi;
>> +	bool napi_local;
>>   	u32 cpuid;
>>   
>>   	if (unlikely(!in_softirq()))
>> @@ -814,9 +815,15 @@ static bool page_pool_napi_local(const struct page_pool *pool)
>>   	if (READ_ONCE(pool->cpuid) == cpuid)
>>   		return true;
>>   
>> +	/* Synchronizated with page_pool_destory() to avoid use-after-free
>> +	 * for 'napi'.
>> +	 */
>> +	rcu_read_lock();
>>   	napi = READ_ONCE(pool->p.napi);
>> +	napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
>> +	rcu_read_unlock();
> 
> This rcu_read_lock/unlock() pair is redundant in the context you mention
> above, since skb_defer_free_flush() is only ever called from softirq
> context (within local_bh_disable()), which already function as an RCU
> read lock.

I thought about it, but I am not sure if we need a explicit rcu lock
for different kernel PREEMPT and RCU config.
Perhaps use rcu_read_lock_bh_held() to ensure that we are in the
correct context?

> 
>> -	return napi && READ_ONCE(napi->list_owner) == cpuid;
>> +	return napi_local;
>>   }
>>   
>>   void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
>>   	if (!page_pool_release(pool))
>>   		return;
>>   
>> +	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
>> +	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>> +	 * before returning to driver to free the napi instance.
>> +	 */
>> +	synchronize_rcu();
> 
> Most drivers call page_pool_destroy() in a loop for each RX queue, so
> now you're introducing a full synchronize_rcu() wait for each queue.
> That can delay tearing down the device significantly, so I don't think
> this is a good idea.

synchronize_rcu() is called after page_pool_release(pool), which means
it is only called when there are some inflight pages, so there is not
necessarily a full synchronize_rcu() wait for each queue.

Anyway, it seems that there are some cases that need explicit
synchronize_rcu() and some cases depending on the other API providing
synchronize_rcu() semantics, maybe we provide two diffferent API for
both cases like the netif_napi_del()/__netif_napi_del() APIs do?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local
  2025-01-11  5:24     ` Yunsheng Lin
@ 2025-01-14 13:03       ` Yunsheng Lin
  2025-01-20 11:24       ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-14 13:03 UTC (permalink / raw)
  To: Yunsheng Lin, Toke Høiland-Jørgensen, davem, kuba,
	pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Xuan Zhuo, Jesper Dangaard Brouer, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

On 2025/1/11 13:24, Yunsheng Lin wrote:

...

>>>   }
>>>     void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>>> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
>>>       if (!page_pool_release(pool))
>>>           return;
>>>   +    /* Paired with rcu lock in page_pool_napi_local() to enable clearing
>>> +     * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>>> +     * before returning to driver to free the napi instance.
>>> +     */
>>> +    synchronize_rcu();
>>
>> Most drivers call page_pool_destroy() in a loop for each RX queue, so
>> now you're introducing a full synchronize_rcu() wait for each queue.
>> That can delay tearing down the device significantly, so I don't think
>> this is a good idea.
> 
> synchronize_rcu() is called after page_pool_release(pool), which means
> it is only called when there are some inflight pages, so there is not
> necessarily a full synchronize_rcu() wait for each queue.
> 
> Anyway, it seems that there are some cases that need explicit
> synchronize_rcu() and some cases depending on the other API providing
> synchronize_rcu() semantics, maybe we provide two diffferent API for
> both cases like the netif_napi_del()/__netif_napi_del() APIs do?

As the synchronize_rcu() is also needed to fix the DMA API misuse problem,
we can not really handle it like netif_napi_del()/__netif_napi_del() APIs
do, the best I can think is something like below:

bool need_sync = false;

for (each queue)
	need_sync |= page_pool_destroy_prepare(queue->pool);

if (need_sync)
	synchronize_rcu()

for (each queue)
	page_pool_destroy_commit(queue->pool);

But I am not sure if the above worth the effort or not for now as the
synchronize_rcu() is only called for the inflight page case.
Any better idea? If not, maybe we can optimize the above later if
the synchronize_rcu() does turn out to be a problem.

> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 0/8] fix two bugs related to page_pool
  2025-01-10 13:06 [PATCH net-next v7 0/8] fix two bugs related to page_pool Yunsheng Lin
                   ` (7 preceding siblings ...)
  2025-01-10 13:07 ` [PATCH net-next v7 8/8] page_pool: use list instead of array for alloc cache Yunsheng Lin
@ 2025-01-14 14:31 ` Jesper Dangaard Brouer
  2025-01-15 11:33   ` Yunsheng Lin
  8 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2025-01-14 14:31 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Robin Murphy, Alexander Duyck, Andrew Morton, IOMMU, MM,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, netdev,
	intel-wired-lan, bpf, linux-kernel, linux-arm-kernel,
	linux-mediatek



On 10/01/2025 14.06, Yunsheng Lin wrote:
> This patchset fix a possible time window problem for page_pool and
> the dma API misuse problem as mentioned in [1], and try to avoid the
> overhead of the fixing using some optimization.
> 
>  From the below performance data, the overhead is not so obvious
> due to performance variations for time_bench_page_pool01_fast_path()
> and time_bench_page_pool02_ptr_ring, and there is about 20ns overhead
> for time_bench_page_pool03_slow() for fixing the bug.
> 

My benchmarking on x86_64 CPUs looks significantly different.
  - CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

Benchmark (bench_page_pool_simple) results from before and after patchset:

| Test name  | Cycles |       |    |Nanosec |        |       |      % |
| (tasklet_*)| Before | After |diff| Before |  After |  diff | change |
|------------+--------+-------+----+--------+--------+-------+--------|
| fast_path  |     19 |    24 |   5|  5.399 |  6.928 | 1.529 |   28.3 |
| ptr_ring   |     54 |    79 |  25| 15.090 | 21.976 | 6.886 |   45.6 |
| slow       |    238 |   299 |  61| 66.134 | 83.298 |17.164 |   26.0 |
#+TBLFM: $4=$3-$2::$7=$6-$5::$8=(($7/$5)*100);%.1f

My above testing show a clear performance regressions across three
different page_pool operating modes.


Data also available in:
  - 
https://github.com/xdp-project/xdp-project/blob/main/areas/mem/page_pool07_bench_DMA_fix.org

Raw data below

Before this patchset:

[  157.186644] bench_page_pool_simple: Loaded
[  157.475084] time_bench: Type:for_loop Per elem: 1 cycles(tsc) 0.284 
ns (step:0) - (measurement period time:0.284327440 sec 
time_interval:284327440) - (invoke count:1000000000 tsc_interval:1023590451)
[  162.262752] time_bench: Type:atomic_inc Per elem: 17 cycles(tsc) 
4.769 ns (step:0) - (measurement period time:4.769757001 sec 
time_interval:4769757001) - (invoke count:1000000000 
tsc_interval:17171776113)
[  163.324091] time_bench: Type:lock Per elem: 37 cycles(tsc) 10.431 ns 
(step:0) - (measurement period time:1.043182161 sec 
time_interval:1043182161) - (invoke count:100000000 tsc_interval:3755514465)
[  163.341702] bench_page_pool_simple: 
time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  163.922466] time_bench: Type:no-softirq-page_pool01 Per elem: 20 
cycles(tsc) 5.713 ns (step:0) - (measurement period time:0.571357387 sec 
time_interval:571357387) - (invoke count:100000000 tsc_interval:2056911063)
[  163.941429] bench_page_pool_simple: 
time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  165.506796] time_bench: Type:no-softirq-page_pool02 Per elem: 56 
cycles(tsc) 15.560 ns (step:0) - (measurement period time:1.556080558 
sec time_interval:1556080558) - (invoke count:100000000 
tsc_interval:5601960921)
[  165.525978] bench_page_pool_simple: time_bench_page_pool03_slow(): 
Cannot use page_pool fast-path
[  171.811289] time_bench: Type:no-softirq-page_pool03 Per elem: 225 
cycles(tsc) 62.763 ns (step:0) - (measurement period time:6.276301531 
sec time_interval:6276301531) - (invoke count:100000000 
tsc_interval:22594974468)
[  171.830646] bench_page_pool_simple: pp_tasklet_handler(): 
in_serving_softirq fast-path
[  171.838561] bench_page_pool_simple: 
time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  172.387597] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 
19 cycles(tsc) 5.399 ns (step:0) - (measurement period time:0.539904228 
sec time_interval:539904228) - (invoke count:100000000 
tsc_interval:1943679246)
[  172.407130] bench_page_pool_simple: 
time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  173.925266] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 
54 cycles(tsc) 15.090 ns (step:0) - (measurement period time:1.509075496 
sec time_interval:1509075496) - (invoke count:100000000 
tsc_interval:5432740575)
[  173.944878] bench_page_pool_simple: time_bench_page_pool03_slow(): 
in_serving_softirq fast-path
[  180.567094] time_bench: Type:tasklet_page_pool03_slow Per elem: 238 
cycles(tsc) 66.134 ns (step:0) - (measurement period time:6.613430605 
sec time_interval:6613430605) - (invoke count:100000000 
tsc_interval:23808654870)



After this patchset:
[  860.519918] bench_page_pool_simple: Loaded
[  860.781605] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.257 
ns (step:0) - (measurement period time:0.257573336 sec 
time_interval:257573336) - (invoke count:1000000000 tsc_interval:927275355)
[  865.613893] time_bench: Type:atomic_inc Per elem: 17 cycles(tsc) 
4.814 ns (step:0) - (measurement period time:4.814593429 sec 
time_interval:4814593429) - (invoke count:1000000000 
tsc_interval:17332768494)
[  866.708420] time_bench: Type:lock Per elem: 38 cycles(tsc) 10.763 ns 
(step:0) - (measurement period time:1.076362960 sec 
time_interval:1076362960) - (invoke count:100000000 tsc_interval:3874955595)
[  866.726118] bench_page_pool_simple: 
time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  867.423572] time_bench: Type:no-softirq-page_pool01 Per elem: 24 
cycles(tsc) 6.880 ns (step:0) - (measurement period time:0.688069107 sec 
time_interval:688069107) - (invoke count:100000000 tsc_interval:2477080260)
[  867.442517] bench_page_pool_simple: 
time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  869.436286] time_bench: Type:no-softirq-page_pool02 Per elem: 71 
cycles(tsc) 19.844 ns (step:0) - (measurement period time:1.984451929 
sec time_interval:1984451929) - (invoke count:100000000 
tsc_interval:7144120329)
[  869.455492] bench_page_pool_simple: time_bench_page_pool03_slow(): 
Cannot use page_pool fast-path
[  877.071437] time_bench: Type:no-softirq-page_pool03 Per elem: 273 
cycles(tsc) 76.069 ns (step:0) - (measurement period time:7.606911291 
sec time_interval:7606911291) - (invoke count:100000000 
tsc_interval:27385252251)
[  877.090762] bench_page_pool_simple: pp_tasklet_handler(): 
in_serving_softirq fast-path
[  877.098683] bench_page_pool_simple: 
time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  877.800696] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 
24 cycles(tsc) 6.928 ns (step:0) - (measurement period time:0.692852876 
sec time_interval:692852876) - (invoke count:100000000 
tsc_interval:2494303293)
[  877.820224] bench_page_pool_simple: 
time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  880.026911] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 
79 cycles(tsc) 21.976 ns (step:0) - (measurement period time:2.197615122 
sec time_interval:2197615122) - (invoke count:100000000 
tsc_interval:7911521190)
[  880.046528] bench_page_pool_simple: time_bench_page_pool03_slow(): 
in_serving_softirq fast-path
[  888.385235] time_bench: Type:tasklet_page_pool03_slow Per elem: 299 
cycles(tsc) 83.298 ns (step:0) - (measurement period time:8.329893717 
sec time_interval:8329893717) - (invoke count:100000000 
tsc_interval:29988024696)




> Before this patchset:
> root@(none)$ insmod bench_page_pool_simple.ko
> [  323.367627] bench_page_pool_simple: Loaded
> [  323.448747] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.076997150 sec time_interval:76997150) - (invoke count:100000000 tsc_interval:7699707)
> [  324.812884] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.468 ns (step:0) - (measurement period time:1.346855130 sec time_interval:1346855130) - (invoke count:100000000 tsc_interval:134685507)
> [  324.980875] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.010 ns (step:0) - (measurement period time:0.150101270 sec time_interval:150101270) - (invoke count:10000000 tsc_interval:15010120)
> [  325.652195] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.542 ns (step:0) - (measurement period time:0.654213000 sec time_interval:654213000) - (invoke count:100000000 tsc_interval:65421294)
> [  325.669215] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
> [  325.974848] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 29.633 ns (step:0) - (measurement period time:0.296338200 sec time_interval:296338200) - (invoke count:10000000 tsc_interval:29633814)
> [  325.993517] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
> [  326.576636] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.391 ns (step:0) - (measurement period time:0.573911820 sec time_interval:573911820) - (invoke count:10000000 tsc_interval:57391174)
> [  326.595307] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
> [  328.422661] time_bench: Type:no-softirq-page_pool03 Per elem: 18 cycles(tsc) 181.849 ns (step:0) - (measurement period time:1.818495880 sec time_interval:1818495880) - (invoke count:10000000 tsc_interval:181849581)
> [  328.441681] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
> [  328.449584] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
> [  328.755031] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 29.632 ns (step:0) - (measurement period time:0.296327910 sec time_interval:296327910) - (invoke count:10000000 tsc_interval:29632785)
> [  328.774308] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
> [  329.578579] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 7 cycles(tsc) 79.523 ns (step:0) - (measurement period time:0.795236560 sec time_interval:795236560) - (invoke count:10000000 tsc_interval:79523650)
> [  329.597769] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
> [  331.507501] time_bench: Type:tasklet_page_pool03_slow Per elem: 19 cycles(tsc) 190.104 ns (step:0) - (measurement period time:1.901047510 sec time_interval:1901047510) - (invoke count:10000000 tsc_interval:190104743)
> 
> After this patchset:
> root@(none)$ insmod bench_page_pool_simple.ko
> [  138.634758] bench_page_pool_simple: Loaded
> [  138.715879] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.076972720 sec time_interval:76972720) - (invoke count:100000000 tsc_interval:7697265)
> [  140.079897] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:1.346735370 sec time_interval:1346735370) - (invoke count:100000000 tsc_interval:134673531)
> [  140.247841] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.005 ns (step:0) - (measurement period time:0.150055080 sec time_interval:150055080) - (invoke count:10000000 tsc_interval:15005497)
> [  140.919072] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:0.654125000 sec time_interval:654125000) - (invoke count:100000000 tsc_interval:65412493)
> [  140.936091] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
> [  141.246985] time_bench: Type:no-softirq-page_pool01 Per elem: 3 cycles(tsc) 30.159 ns (step:0) - (measurement period time:0.301598160 sec time_interval:301598160) - (invoke count:10000000 tsc_interval:30159812)
> [  141.265654] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
> [  141.976265] time_bench: Type:no-softirq-page_pool02 Per elem: 7 cycles(tsc) 70.140 ns (step:0) - (measurement period time:0.701405780 sec time_interval:701405780) - (invoke count:10000000 tsc_interval:70140573)
> [  141.994933] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
> [  144.018945] time_bench: Type:no-softirq-page_pool03 Per elem: 20 cycles(tsc) 201.514 ns (step:0) - (measurement period time:2.015141210 sec time_interval:2015141210) - (invoke count:10000000 tsc_interval:201514113)
> [  144.037966] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
> [  144.045870] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
> [  144.205045] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 1 cycles(tsc) 15.005 ns (step:0) - (measurement period time:0.150056510 sec time_interval:150056510) - (invoke count:10000000 tsc_interval:15005645)
> [  144.224320] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
> [  144.916044] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 68.269 ns (step:0) - (measurement period time:0.682693070 sec time_interval:682693070) - (invoke count:10000000 tsc_interval:68269300)
> [  144.935234] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
> [  146.997684] time_bench: Type:tasklet_page_pool03_slow Per elem: 20 cycles(tsc) 205.376 ns (step:0) - (measurement period time:2.053766310 sec time_interval:2053766310) - (invoke count:10000000 tsc_interval:205376624)
> 
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
> 
> CC: Alexander Lobakin <aleksander.lobakin@intel.com>
> CC: Robin Murphy <robin.murphy@arm.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: IOMMU <iommu@lists.linux.dev>
> CC: MM <linux-mm@kvack.org>
> 
> Change log:
> V7:
>    1. Fix a used-after-free bug reported by KASAN as mentioned by Jakub.
>    2. Fix the 'netmem' variable not setting up correctly bug as mentioned
>       by Simon.
> 
> V6:
>    1. Repost based on latest net-next.
>    2. Rename page_pool_to_pp() to page_pool_get_pp().
> 
> V5:
>    1. Support unlimit inflight pages.
>    2. Add some optimization to avoid the overhead of fixing bug.
> 
> V4:
>    1. use scanning to do the unmapping
>    2. spilt dma sync skipping into separate patch
> 
> V3:
>    1. Target net-next tree instead of net tree.
>    2. Narrow the rcu lock as the discussion in v2.
>    3. Check the ummapping cnt against the inflight cnt.
> 
> V2:
>    1. Add a item_full stat.
>    2. Use container_of() for page_pool_to_pp().
> 
> Yunsheng Lin (8):
>    page_pool: introduce page_pool_get_pp() API
>    page_pool: fix timing for checking and disabling napi_local
>    page_pool: fix IOMMU crash when driver has already unbound
>    page_pool: support unlimited number of inflight pages
>    page_pool: skip dma sync operation for inflight pages
>    page_pool: use list instead of ptr_ring for ring cache
>    page_pool: batch refilling pages to reduce atomic operation
>    page_pool: use list instead of array for alloc cache
> 
>   drivers/net/ethernet/freescale/fec_main.c     |   8 +-
>   .../ethernet/google/gve/gve_buffer_mgmt_dqo.c |   2 +-
>   drivers/net/ethernet/intel/iavf/iavf_txrx.c   |   6 +-
>   drivers/net/ethernet/intel/idpf/idpf_txrx.c   |  14 +-
>   drivers/net/ethernet/intel/libeth/rx.c        |   2 +-
>   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |   3 +-
>   drivers/net/netdevsim/netdev.c                |   6 +-
>   drivers/net/wireless/mediatek/mt76/mt76.h     |   2 +-
>   include/linux/mm_types.h                      |   2 +-
>   include/linux/skbuff.h                        |   1 +
>   include/net/libeth/rx.h                       |   3 +-
>   include/net/netmem.h                          |  24 +-
>   include/net/page_pool/helpers.h               |  11 +
>   include/net/page_pool/types.h                 |  64 +-
>   net/core/devmem.c                             |   4 +-
>   net/core/netmem_priv.h                        |   5 +-
>   net/core/page_pool.c                          | 664 ++++++++++++++----
>   net/core/page_pool_priv.h                     |  12 +-
>   18 files changed, 675 insertions(+), 158 deletions(-)
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 0/8] fix two bugs related to page_pool
  2025-01-14 14:31 ` [PATCH net-next v7 0/8] fix two bugs related to page_pool Jesper Dangaard Brouer
@ 2025-01-15 11:33   ` Yunsheng Lin
  2025-01-15 17:40     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-15 11:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Robin Murphy, Alexander Duyck, Andrew Morton, IOMMU, MM,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, netdev,
	intel-wired-lan, bpf, linux-kernel, linux-arm-kernel,
	linux-mediatek

[-- Attachment #1: Type: text/plain, Size: 8662 bytes --]

On 2025/1/14 22:31, Jesper Dangaard Brouer wrote:
> 
> 
> On 10/01/2025 14.06, Yunsheng Lin wrote:
>> This patchset fix a possible time window problem for page_pool and
>> the dma API misuse problem as mentioned in [1], and try to avoid the
>> overhead of the fixing using some optimization.
>>
>>  From the below performance data, the overhead is not so obvious
>> due to performance variations for time_bench_page_pool01_fast_path()
>> and time_bench_page_pool02_ptr_ring, and there is about 20ns overhead
>> for time_bench_page_pool03_slow() for fixing the bug.
>>
> 
> My benchmarking on x86_64 CPUs looks significantly different.
>  - CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
> 
> Benchmark (bench_page_pool_simple) results from before and after patchset:
> 
> | Test name  | Cycles |       |    |Nanosec |        |       |      % |
> | (tasklet_*)| Before | After |diff| Before |  After |  diff | change |
> |------------+--------+-------+----+--------+--------+-------+--------|
> | fast_path  |     19 |    24 |   5|  5.399 |  6.928 | 1.529 |   28.3 |
> | ptr_ring   |     54 |    79 |  25| 15.090 | 21.976 | 6.886 |   45.6 |
> | slow       |    238 |   299 |  61| 66.134 | 83.298 |17.164 |   26.0 |
> #+TBLFM: $4=$3-$2::$7=$6-$5::$8=(($7/$5)*100);%.1f
> 
> My above testing show a clear performance regressions across three
> different page_pool operating modes.

I retested it on arm64 server patch by patch as the raw performance
data in the attachment, it seems the result seemed similar as before.

Before this patchset:
            fast_path              ptr_ring            slow
1.         31.171 ns               60.980 ns          164.917 ns
2.         28.824 ns               60.891 ns          170.241 ns
3.         14.236 ns               60.583 ns          164.355 ns

With patch 1-4:
4.         31.443 ns               53.242 ns          210.148 ns
5.         31.406 ns               53.270 ns          210.189 ns

With patch 1-5:
6.         26.163 ns               53.781 ns          189.450 ns
7.         26.189 ns               53.798 ns          189.466 ns

With patch 1-8:
8.         28.108 ns               68.199 ns          202.516 ns
9.         16.128 ns               55.904 ns          202.711 ns

I am not able to get hold of a x86 server yet, I might be able
to get one during weekend.

Theoretically, patch 1-4 or 1-5 should not have much performance
impact for fast_path and ptr_ring except for the rcu_lock mentioned
in page_pool_napi_local(), so it would be good if patch 1-5 is also
tested in your testlab with the rcu_lock removing in
page_pool_napi_local().

> 
> 
> Data also available in:
>  - https://github.com/xdp-project/xdp-project/blob/main/areas/mem/page_pool07_bench_DMA_fix.org
> 
> Raw data below
> 
> Before this patchset:
> 
> [  157.186644] bench_page_pool_simple: Loaded
> [  157.475084] time_bench: Type:for_loop Per elem: 1 cycles(tsc) 0.284 ns (step:0) - (measurement period time:0.284327440 sec time_interval:284327440) - (invoke count:1000000000 tsc_interval:1023590451)
> [  162.262752] time_bench: Type:atomic_inc Per elem: 17 cycles(tsc) 4.769 ns (step:0) - (measurement period time:4.769757001 sec time_interval:4769757001) - (invoke count:1000000000 tsc_interval:17171776113)
> [  163.324091] time_bench: Type:lock Per elem: 37 cycles(tsc) 10.431 ns (step:0) - (measurement period time:1.043182161 sec time_interval:1043182161) - (invoke count:100000000 tsc_interval:3755514465)
> [  163.341702] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
> [  163.922466] time_bench: Type:no-softirq-page_pool01 Per elem: 20 cycles(tsc) 5.713 ns (step:0) - (measurement period time:0.571357387 sec time_interval:571357387) - (invoke count:100000000 tsc_interval:2056911063)
> [  163.941429] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
> [  165.506796] time_bench: Type:no-softirq-page_pool02 Per elem: 56 cycles(tsc) 15.560 ns (step:0) - (measurement period time:1.556080558 sec time_interval:1556080558) - (invoke count:100000000 tsc_interval:5601960921)
> [  165.525978] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
> [  171.811289] time_bench: Type:no-softirq-page_pool03 Per elem: 225 cycles(tsc) 62.763 ns (step:0) - (measurement period time:6.276301531 sec time_interval:6276301531) - (invoke count:100000000 tsc_interval:22594974468)
> [  171.830646] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
> [  171.838561] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
> [  172.387597] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 19 cycles(tsc) 5.399 ns (step:0) - (measurement period time:0.539904228 sec time_interval:539904228) - (invoke count:100000000 tsc_interval:1943679246)
> [  172.407130] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
> [  173.925266] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 54 cycles(tsc) 15.090 ns (step:0) - (measurement period time:1.509075496 sec time_interval:1509075496) - (invoke count:100000000 tsc_interval:5432740575)
> [  173.944878] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
> [  180.567094] time_bench: Type:tasklet_page_pool03_slow Per elem: 238 cycles(tsc) 66.134 ns (step:0) - (measurement period time:6.613430605 sec time_interval:6613430605) - (invoke count:100000000 tsc_interval:23808654870)
> 
> 
> 
> After this patchset:
> [  860.519918] bench_page_pool_simple: Loaded
> [  860.781605] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.257 ns (step:0) - (measurement period time:0.257573336 sec time_interval:257573336) - (invoke count:1000000000 tsc_interval:927275355)
> [  865.613893] time_bench: Type:atomic_inc Per elem: 17 cycles(tsc) 4.814 ns (step:0) - (measurement period time:4.814593429 sec time_interval:4814593429) - (invoke count:1000000000 tsc_interval:17332768494)
> [  866.708420] time_bench: Type:lock Per elem: 38 cycles(tsc) 10.763 ns (step:0) - (measurement period time:1.076362960 sec time_interval:1076362960) - (invoke count:100000000 tsc_interval:3874955595)
> [  866.726118] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
> [  867.423572] time_bench: Type:no-softirq-page_pool01 Per elem: 24 cycles(tsc) 6.880 ns (step:0) - (measurement period time:0.688069107 sec time_interval:688069107) - (invoke count:100000000 tsc_interval:2477080260)
> [  867.442517] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
> [  869.436286] time_bench: Type:no-softirq-page_pool02 Per elem: 71 cycles(tsc) 19.844 ns (step:0) - (measurement period time:1.984451929 sec time_interval:1984451929) - (invoke count:100000000 tsc_interval:7144120329)
> [  869.455492] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
> [  877.071437] time_bench: Type:no-softirq-page_pool03 Per elem: 273 cycles(tsc) 76.069 ns (step:0) - (measurement period time:7.606911291 sec time_interval:7606911291) - (invoke count:100000000 tsc_interval:27385252251)
> [  877.090762] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
> [  877.098683] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
> [  877.800696] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 24 cycles(tsc) 6.928 ns (step:0) - (measurement period time:0.692852876 sec time_interval:692852876) - (invoke count:100000000 tsc_interval:2494303293)
> [  877.820224] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
> [  880.026911] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 79 cycles(tsc) 21.976 ns (step:0) - (measurement period time:2.197615122 sec time_interval:2197615122) - (invoke count:100000000 tsc_interval:7911521190)
> [  880.046528] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
> [  888.385235] time_bench: Type:tasklet_page_pool03_slow Per elem: 299 cycles(tsc) 83.298 ns (step:0) - (measurement period time:8.329893717 sec time_interval:8329893717) - (invoke count:100000000 tsc_interval:29988024696)

As mentioned by Toke, we may be able to reduce the performance difference
between tasklet and non-tasklet testcases by removing the rcu_lock in
page_pool_napi_local() for patch 1 as in_softirq() checking in
page_pool_napi_local() should ensure RCU-bh read-side critical section.

[-- Attachment #2: pp_inflight_fix_v7_perf_data.txt --]
[-- Type: text/plain, Size: 69039 bytes --]


07ea810753bd Revert "page_pool: introduce page_pool_get_pp() API"
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  118.835127] bench_page_pool_simple: Loaded
[  119.608858] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769587320 sec time_interval:769587320) - (invoke count:1000000000 tsc_interval:76958720)
[  136.559273] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 16.932 ns (step:0) - (measurement period time:16.932925510 sec time_interval:16932925510) - (invoke count:1000000000 tsc_interval:1693292543)
[  138.078107] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500666520 sec time_interval:1500666520) - (invoke count:100000000 tsc_interval:150066646)
[  144.636732] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541323980 sec time_interval:6541323980) - (invoke count:1000000000 tsc_interval:654132391)
[  144.653948] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  147.780571] time_bench: Type:no-softirq-page_pool01 Per elem: 3 cycles(tsc) 31.173 ns (step:0) - (measurement period time:3.117359810 sec time_interval:3117359810) - (invoke count:100000000 tsc_interval:311735974)
[  147.799427] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  153.566322] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.577 ns (step:0) - (measurement period time:5.757708010 sec time_interval:5757708010) - (invoke count:100000000 tsc_interval:575770795)
[  153.585178] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  171.732446] time_bench: Type:no-softirq-page_pool03 Per elem: 18 cycles(tsc) 181.384 ns (step:0) - (measurement period time:18.138436700 sec time_interval:18138436700) - (invoke count:100000000 tsc_interval:1813843661)
[  171.751744] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  171.759626] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  174.885885] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 3 cycles(tsc) 31.171 ns (step:0) - (measurement period time:3.117169710 sec time_interval:3117169710) - (invoke count:100000000 tsc_interval:311716965)
[  174.905345] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  181.012397] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 60.980 ns (step:0) - (measurement period time:6.098047810 sec time_interval:6098047810) - (invoke count:100000000 tsc_interval:609804775)
[  181.031770] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path

[  197.532151] time_bench: Type:tasklet_page_pool03_slow Per elem: 16 cycles(tsc) 164.917 ns (step:0) - (measurement period time:16.491723510 sec time_interval:16491723510) - (invoke count:100000000 tsc_interval:1649172345)
root@(none)$
root@(none)$
root@(none)$ rmmod bench_page_pool_simple.ko
[  209.510186] bench_page_pool_simple: Unloaded
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  210.659129] bench_page_pool_simple: Loaded
[  211.432882] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769610810 sec time_interval:769610810) - (invoke count:1000000000 tsc_interval:76961072)
[  224.917831] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467473740 sec time_interval:13467473740) - (invoke count:1000000000 tsc_interval:1346747368)
[  226.436667] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500671210 sec time_interval:1500671210) - (invoke count:100000000 tsc_interval:150067117)
[  232.995372] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541405330 sec time_interval:6541405330) - (invoke count:1000000000 tsc_interval:654140528)
[  233.012586] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  236.139341] time_bench: Type:no-softirq-page_pool01 Per elem: 3 cycles(tsc) 31.174 ns (step:0) - (measurement period time:3.117491630 sec time_interval:3117491630) - (invoke count:100000000 tsc_interval:311749159)
[  236.158197] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  241.926861] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.594 ns (step:0) - (measurement period time:5.759481900 sec time_interval:5759481900) - (invoke count:100000000 tsc_interval:575948185)
[  241.945717] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  259.747779] time_bench: Type:no-softirq-page_pool03 Per elem: 17 cycles(tsc) 177.932 ns (step:0) - (measurement period time:17.793230520 sec time_interval:17793230520) - (invoke count:100000000 tsc_interval:1779323045)
[  259.767070] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  259.774951] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  262.901276] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 3 cycles(tsc) 31.172 ns (step:0) - (measurement period time:3.117235450 sec time_interval:3117235450) - (invoke count:100000000 tsc_interval:311723540)
[  262.920737] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  269.016589] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 60.868 ns (step:0) - (measurement period time:6.086848810 sec time_interval:6086848810) - (invoke count:100000000 tsc_interval:608684876)
[  269.035963] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  285.540301] time_bench: Type:tasklet_page_pool03_slow Per elem: 16 cycles(tsc) 164.956 ns (step:0) - (measurement period time:16.495681400 sec time_interval:16495681400) - (invoke count:100000000 tsc_interval:1649568134)
root@(none)$ cat /proc/version
Linux version 6.13.0-rc6-00905-g07ea810753bd (linyunsheng@localhost.localdomain) (gcc (GCC) 10.3.1, GNU ld (GNU Binutils) 2.37) #295 SMP PREEMPT Wed Jan 15 11:22:27 CST 2025

root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  102.478309] bench_page_pool_simple: Loaded
[  103.252061] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769609840 sec time_interval:769609840) - (invoke count:1000000000 tsc_interval:76960976)
[  116.737122] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467584160 sec time_interval:13467584160) - (invoke count:1000000000 tsc_interval:1346758411)
[  118.255948] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500661720 sec time_interval:1500661720) - (invoke count:100000000 tsc_interval:150066166)
[  124.814672] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541425600 sec time_interval:6541425600) - (invoke count:1000000000 tsc_interval:654142555)
[  124.831887] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  126.355730] time_bench: Type:no-softirq-page_pool01 Per elem: 1 cycles(tsc) 15.145 ns (step:0) - (measurement period time:1.514579980 sec time_interval:1514579980) - (invoke count:100000000 tsc_interval:151457991)
[  126.374588] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  132.139818] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.560 ns (step:0) - (measurement period time:5.756052820 sec time_interval:5756052820) - (invoke count:100000000 tsc_interval:575605276)
[  132.158674] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  149.943233] time_bench: Type:no-softirq-page_pool03 Per elem: 17 cycles(tsc) 177.757 ns (step:0) - (measurement period time:17.775726280 sec time_interval:17775726280) - (invoke count:100000000 tsc_interval:1777572621)
[  149.962525] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  149.970407] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  152.861903] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 28.824 ns (step:0) - (measurement period time:2.882405020 sec time_interval:2882405020) - (invoke count:100000000 tsc_interval:288240495)
[  152.881364] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  158.979512] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 60.891 ns (step:0) - (measurement period time:6.089144870 sec time_interval:6089144870) - (invoke count:100000000 tsc_interval:608914482)
[  158.998884] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  176.031659] time_bench: Type:tasklet_page_pool03_slow Per elem: 17 cycles(tsc) 170.241 ns (step:0) - (measurement period time:17.024117960 sec time_interval:17024117960) - (invoke count:100000000 tsc_interval:1702411789)

root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  442.818325] bench_page_pool_simple: Loaded
[  443.592055] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769610330 sec time_interval:769610330) - (invoke count:1000000000 tsc_interval:76961025)
[  458.439817] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 14.830 ns (step:0) - (measurement period time:14.830285600 sec time_interval:14830285600) - (invoke count:1000000000 tsc_interval:1483028556)
[  459.958698] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.007 ns (step:0) - (measurement period time:1.500714240 sec time_interval:1500714240) - (invoke count:100000000 tsc_interval:150071418)
[  466.517515] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541516880 sec time_interval:6541516880) - (invoke count:1000000000 tsc_interval:654151682)
[  466.534728] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  468.047027] time_bench: Type:no-softirq-page_pool01 Per elem: 1 cycles(tsc) 15.030 ns (step:0) - (measurement period time:1.503035130 sec time_interval:1503035130) - (invoke count:100000000 tsc_interval:150303507)
[  468.065883] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  473.829596] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.545 ns (step:0) - (measurement period time:5.754537290 sec time_interval:5754537290) - (invoke count:100000000 tsc_interval:575453724)
[  473.848452] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  491.124253] time_bench: Type:no-softirq-page_pool03 Per elem: 17 cycles(tsc) 172.669 ns (step:0) - (measurement period time:17.266968680 sec time_interval:17266968680) - (invoke count:100000000 tsc_interval:1726696861)
[  491.143550] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  491.151434] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  493.118656] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 1 cycles(tsc) 19.581 ns (step:0) - (measurement period time:1.958131510 sec time_interval:1958131510) - (invoke count:100000000 tsc_interval:195813143)
[  493.138115] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  499.227968] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 60.808 ns (step:0) - (measurement period time:6.080847450 sec time_interval:6080847450) - (invoke count:100000000 tsc_interval:608084740)
[  499.247339] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  515.691157] time_bench: Type:tasklet_page_pool03_slow Per elem: 16 cycles(tsc) 164.351 ns (step:0) - (measurement period time:16.435160550 sec time_interval:16435160550) - (invoke count:100000000 tsc_interval:1643516048)
root@(none)$ rmmod bench_page_pool_simple.ko
[  683.197394] bench_page_pool_simple: Unloaded
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  684.374311] bench_page_pool_simple: Loaded
[  685.148035] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769604180 sec time_interval:769604180) - (invoke count:1000000000 tsc_interval:76960410)
[  698.632947] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467434190 sec time_interval:13467434190) - (invoke count:1000000000 tsc_interval:1346743412)
[  700.151767] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500657020 sec time_interval:1500657020) - (invoke count:100000000 tsc_interval:150065696)
[  706.710339] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541272330 sec time_interval:6541272330) - (invoke count:1000000000 tsc_interval:654127227)
[  706.727553] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  709.619400] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 28.825 ns (step:0) - (measurement period time:2.882584100 sec time_interval:2882584100) - (invoke count:100000000 tsc_interval:288258403)
[  709.638256] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  715.411633] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.642 ns (step:0) - (measurement period time:5.764201050 sec time_interval:5764201050) - (invoke count:100000000 tsc_interval:576420099)
[  715.430493] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  732.168906] time_bench: Type:no-softirq-page_pool03 Per elem: 16 cycles(tsc) 167.295 ns (step:0) - (measurement period time:16.729578200 sec time_interval:16729578200) - (invoke count:100000000 tsc_interval:1672957815)
[  732.188197] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  732.196078] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  733.628852] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 1 cycles(tsc) 14.236 ns (step:0) - (measurement period time:1.423682990 sec time_interval:1423682990) - (invoke count:100000000 tsc_interval:142368292)
[  733.648311] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  739.715700] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 60.583 ns (step:0) - (measurement period time:6.058384260 sec time_interval:6058384260) - (invoke count:100000000 tsc_interval:605838420)
[  739.735073] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  756.179270] time_bench: Type:tasklet_page_pool03_slow Per elem: 16 cycles(tsc) 164.355 ns (step:0) - (measurement period time:16.435539700 sec time_interval:16435539700) - (invoke count:100000000 tsc_interval:1643553963)
root@(none)$ cat /proc/version
Linux version 6.13.0-rc6-00905-g07ea810753bd (linyunsheng@localhost.localdomain) (gcc (GCC) 10.3.1, GNU ld (GNU Binutils) 2.37) #295 SMP PREEMPT Wed Jan 15 11:22:27 CST 2025


c8cd65aea46f (HEAD -> pp-inflight-fix_v6_test) Revert "page_pool: fix IOMMU crash when driver has already unbound"
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  112.284533] bench_page_pool_simple: Loaded
[  113.058250] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769595440 sec time_interval:769595440) - (invoke count:1000000000 tsc_interval:76959536)
[  126.543325] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467599580 sec time_interval:13467599580) - (invoke count:1000000000 tsc_interval:1346759954)
[  128.062178] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500688700 sec time_interval:1500688700) - (invoke count:100000000 tsc_interval:150068863)
[  134.620885] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541407810 sec time_interval:6541407810) - (invoke count:1000000000 tsc_interval:654140776)
[  134.638100] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  137.764295] time_bench: Type:no-softirq-page_pool01 Per elem: 3 cycles(tsc) 31.169 ns (step:0) - (measurement period time:3.116932100 sec time_interval:3116932100) - (invoke count:100000000 tsc_interval:311693204)
[  137.783151] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  143.556498] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.641 ns (step:0) - (measurement period time:5.764165830 sec time_interval:5764165830) - (invoke count:100000000 tsc_interval:576416578)
[  143.575354] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  160.391936] time_bench: Type:no-softirq-page_pool03 Per elem: 16 cycles(tsc) 168.077 ns (step:0) - (measurement period time:16.807748380 sec time_interval:16807748380) - (invoke count:100000000 tsc_interval:1680774833)
[  160.411228] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  160.419110] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  163.025216] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 25.970 ns (step:0) - (measurement period time:2.597014370 sec time_interval:2597014370) - (invoke count:100000000 tsc_interval:259701433)
[  163.044675] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  169.169341] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 61.156 ns (step:0) - (measurement period time:6.115661410 sec time_interval:6115661410) - (invoke count:100000000 tsc_interval:611566136)
[  169.188712] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  185.721921] time_bench: Type:tasklet_page_pool03_slow Per elem: 16 cycles(tsc) 165.245 ns (step:0) - (measurement period time:16.524552130 sec time_interval:16524552130) - (invoke count:100000000 tsc_interval:1652455208)
root@(none)$ rmmod bench_page_pool_simple.ko
[  228.647567] bench_page_pool_simple: Unloaded
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  229.756515] bench_page_pool_simple: Loaded
[  230.530211] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769571820 sec time_interval:769571820) - (invoke count:1000000000 tsc_interval:76957172)
[  244.015118] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467427880 sec time_interval:13467427880) - (invoke count:1000000000 tsc_interval:1346742782)
[  245.533931] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500649840 sec time_interval:1500649840) - (invoke count:100000000 tsc_interval:150064979)
[  252.092555] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541318290 sec time_interval:6541318290) - (invoke count:1000000000 tsc_interval:654131824)
[  252.109769] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  253.543110] time_bench: Type:no-softirq-page_pool01 Per elem: 1 cycles(tsc) 14.240 ns (step:0) - (measurement period time:1.424077550 sec time_interval:1424077550) - (invoke count:100000000 tsc_interval:142407750)
[  253.561963] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  259.320132] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.489 ns (step:0) - (measurement period time:5.748989970 sec time_interval:5748989970) - (invoke count:100000000 tsc_interval:574898993)
[  259.338990] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  276.124086] time_bench: Type:no-softirq-page_pool03 Per elem: 16 cycles(tsc) 167.762 ns (step:0) - (measurement period time:16.776264180 sec time_interval:16776264180) - (invoke count:100000000 tsc_interval:1677626413)
[  276.143377] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  276.151259] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  277.584309] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 1 cycles(tsc) 14.239 ns (step:0) - (measurement period time:1.423960790 sec time_interval:1423960790) - (invoke count:100000000 tsc_interval:142396074)
[  277.603769] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  283.675754] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 60.629 ns (step:0) - (measurement period time:6.062981570 sec time_interval:6062981570) - (invoke count:100000000 tsc_interval:606298151)
[  283.695128] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  300.180187] time_bench: Type:tasklet_page_pool03_slow Per elem: 16 cycles(tsc) 164.764 ns (step:0) - (measurement period time:16.476401670 sec time_interval:16476401670) - (invoke count:100000000 tsc_interval:1647640163)
root@(none)$ cat /proc/version
Linux version 6.13.0-rc6-00903-gc8cd65aea46f (linyunsheng@localhost.localdomain) (gcc (GCC) 10.3.1, GNU ld (GNU Binutils) 2.37) #296 SMP PREEMPT Wed Jan 15 11:29:54 CST 2025



d8de0484ad23------page_pool: fix IOMMU crash when driver has already unbound
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  352.981066] bench_page_pool_simple: Loaded
[  353.754833] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769612830 sec time_interval:769612830) - (invoke count:1000000000 tsc_interval:76961275)
[  367.239820] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467509700 sec time_interval:13467509700) - (invoke count:1000000000 tsc_interval:1346750932)
[  368.758688] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.007 ns (step:0) - (measurement period time:1.500703810 sec time_interval:1500703810) - (invoke count:100000000 tsc_interval:150070375)
[  375.317433] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541446010 sec time_interval:6541446010) - (invoke count:1000000000 tsc_interval:654144595)
[  375.334647] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  378.470719] time_bench: Type:no-softirq-page_pool01 Per elem: 3 cycles(tsc) 31.268 ns (step:0) - (measurement period time:3.126808010 sec time_interval:3126808010) - (invoke count:100000000 tsc_interval:312680796)
[  378.489580] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  384.237992] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.392 ns (step:0) - (measurement period time:5.739235000 sec time_interval:5739235000) - (invoke count:100000000 tsc_interval:573923493)
[  384.256846] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  404.284227] time_bench: Type:no-softirq-page_pool03 Per elem: 20 cycles(tsc) 200.185 ns (step:0) - (measurement period time:20.018549500 sec time_interval:20018549500) - (invoke count:100000000 tsc_interval:2001854942)
[  404.303523] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  404.311405] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  407.450798] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 3 cycles(tsc) 31.303 ns (step:0) - (measurement period time:3.130301150 sec time_interval:3130301150) - (invoke count:100000000 tsc_interval:313030109)
[  407.470257] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  413.117820] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 56.385 ns (step:0) - (measurement period time:5.638558540 sec time_interval:5638558540) - (invoke count:100000000 tsc_interval:563855847)
[  413.137192] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  433.250575] time_bench: Type:tasklet_page_pool03_slow Per elem: 20 cycles(tsc) 201.047 ns (step:0) - (measurement period time:20.104725790 sec time_interval:20104725790) - (invoke count:100000000 tsc_interval:2010472573)
root@(none)$ rmmod bench_page_pool_simple.ko
[  481.612067] bench_page_pool_simple: Unloaded
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  482.525041] bench_page_pool_simple: Loaded
[  483.298777] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769612290 sec time_interval:769612290) - (invoke count:1000000000 tsc_interval:76961221)
[  496.783660] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467404470 sec time_interval:13467404470) - (invoke count:1000000000 tsc_interval:1346740441)
[  498.302476] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500651360 sec time_interval:1500651360) - (invoke count:100000000 tsc_interval:150065132)
[  504.861015] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541237000 sec time_interval:6541237000) - (invoke count:1000000000 tsc_interval:654123694)
[  504.878228] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  508.017855] time_bench: Type:no-softirq-page_pool01 Per elem: 3 cycles(tsc) 31.303 ns (step:0) - (measurement period time:3.130363490 sec time_interval:3130363490) - (invoke count:100000000 tsc_interval:313036345)
[  508.036725] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  513.777554] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.316 ns (step:0) - (measurement period time:5.731647070 sec time_interval:5731647070) - (invoke count:100000000 tsc_interval:573164701)
[  513.796408] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  533.821092] time_bench: Type:no-softirq-page_pool03 Per elem: 20 cycles(tsc) 200.158 ns (step:0) - (measurement period time:20.015853910 sec time_interval:20015853910) - (invoke count:100000000 tsc_interval:2001585384)
[  533.840385] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  533.848266] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  536.987413] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 3 cycles(tsc) 31.300 ns (step:0) - (measurement period time:3.130056990 sec time_interval:3130056990) - (invoke count:100000000 tsc_interval:313005695)
[  537.006870] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  542.553443] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 55.375 ns (step:0) - (measurement period time:5.537567730 sec time_interval:5537567730) - (invoke count:100000000 tsc_interval:553756767)
[  542.572814] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  562.622903] time_bench: Type:tasklet_page_pool03_slow Per elem: 20 cycles(tsc) 200.414 ns (step:0) - (measurement period time:20.041430960 sec time_interval:20041430960) - (invoke count:100000000 tsc_interval:2004143090)
root@(none)$


b53806ee8b03 (HEAD -> pp-inflight-fix_v6_test) page_pool: support unlimited number of inflight pages
root@(none)$ insmod time_bench.ko
[   57.826902] time_bench: loading out-of-tree module taints kernel.
[   57.833978] time_bench: Loaded
root@(none)$  insmod bench_page_pool_simple.ko loops=100000000
[   66.015795] bench_page_pool_simple: Loaded
[   66.789504] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769581100 sec time_interval:769581100) - (invoke count:1000000000 tsc_interval:76958101)
[   85.985445] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 19.178 ns (step:0) - (measurement period time:19.178464890 sec time_interval:19178464890) - (invoke count:1000000000 tsc_interval:1917846484)
[   87.504318] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.007 ns (step:0) - (measurement period time:1.500707820 sec time_interval:1500707820) - (invoke count:100000000 tsc_interval:150070776)
[   94.062989] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541369880 sec time_interval:6541369880) - (invoke count:1000000000 tsc_interval:654136982)
[   94.080203] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[   97.229937] time_bench: Type:no-softirq-page_pool01 Per elem: 3 cycles(tsc) 31.404 ns (step:0) - (measurement period time:3.140470140 sec time_interval:3140470140) - (invoke count:100000000 tsc_interval:314047009)
[   97.248793] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  102.967699] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.097 ns (step:0) - (measurement period time:5.709729700 sec time_interval:5709729700) - (invoke count:100000000 tsc_interval:570972963)
[  102.986554] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  123.332228] time_bench: Type:no-softirq-page_pool03 Per elem: 20 cycles(tsc) 203.368 ns (step:0) - (measurement period time:20.336842600 sec time_interval:20336842600) - (invoke count:100000000 tsc_interval:2033684253)
[  123.351522] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  123.359404] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  126.512828] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 3 cycles(tsc) 31.443 ns (step:0) - (measurement period time:3.144333160 sec time_interval:3144333160) - (invoke count:100000000 tsc_interval:314433311)
[  126.532286] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  131.865545] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 53.242 ns (step:0) - (measurement period time:5.324254260 sec time_interval:5324254260) - (invoke count:100000000 tsc_interval:532425421)
[  131.884917] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  152.908467] time_bench: Type:tasklet_page_pool03_slow Per elem: 21 cycles(tsc) 210.148 ns (step:0) - (measurement period time:21.014892650 sec time_interval:21014892650) - (invoke count:100000000 tsc_interval:2101489259)
root@(none)$ rmmod bench_page_pool_simple.ko
[  163.826865] bench_page_pool_simple: Unloaded
root@(none)$  insmod bench_page_pool_simple.ko loops=100000000
[  164.867796] bench_page_pool_simple: Loaded
[  165.641522] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769607400 sec time_interval:769607400) - (invoke count:1000000000 tsc_interval:76960732)
[  179.126540] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467542660 sec time_interval:13467542660) - (invoke count:1000000000 tsc_interval:1346754260)
[  180.645378] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500671580 sec time_interval:1500671580) - (invoke count:100000000 tsc_interval:150067152)
[  187.204029] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541350520 sec time_interval:6541350520) - (invoke count:1000000000 tsc_interval:654135046)
[  187.221243] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  188.577413] time_bench: Type:no-softirq-page_pool01 Per elem: 1 cycles(tsc) 13.468 ns (step:0) - (measurement period time:1.346892420 sec time_interval:1346892420) - (invoke count:100000000 tsc_interval:134689236)
[  188.596268] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  194.314705] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.092 ns (step:0) - (measurement period time:5.709260290 sec time_interval:5709260290) - (invoke count:100000000 tsc_interval:570926024)
[  194.333561] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  214.660328] time_bench: Type:no-softirq-page_pool03 Per elem: 20 cycles(tsc) 203.179 ns (step:0) - (measurement period time:20.317934940 sec time_interval:20317934940) - (invoke count:100000000 tsc_interval:2031793485)
[  214.679620] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  214.687501] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  217.837259] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 3 cycles(tsc) 31.406 ns (step:0) - (measurement period time:3.140666230 sec time_interval:3140666230) - (invoke count:100000000 tsc_interval:314066616)
[  217.856720] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  223.192797] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 53.270 ns (step:0) - (measurement period time:5.327072820 sec time_interval:5327072820) - (invoke count:100000000 tsc_interval:532707276)
[  223.212169] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  244.239728] time_bench: Type:tasklet_page_pool03_slow Per elem: 21 cycles(tsc) 210.189 ns (step:0) - (measurement period time:21.018901830 sec time_interval:21018901830) - (invoke count:100000000 tsc_interval:2101890177)
root@(none)$ cat /proc/version
Linux version 6.13.0-rc6-00903-gb53806ee8b03 (linyunsheng@localhost.localdomain) (gcc (GCC) 10.3.1, GNU ld (GNU Binutils) 2.37) #297 SMP PREEMPT Wed Jan 15 11:43:41 CST 2025




249fa431270c (HEAD -> pp-inflight-fix_v6_test) page_pool: skip dma sync operation for inflight pages
root@(none)$ cat /proc/version
Linux version 6.13.0-rc6-00904-g249fa431270c (linyunsheng@localhost.localdomain) (gcc (GCC) 10.3.1, GNU ld (GNU Binutils) 2.37) #300 SMP PREEMPT Wed Jan 15 14:21:51 CST 2025
root@(none)$ rmmod bench_page_pool_simple.ko
[  459.241973] bench_page_pool_simple: Unloaded
root@(none)$
root@(none)$
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  462.674971] bench_page_pool_simple: Loaded
[  463.448730] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769614430 sec time_interval:769614430) - (invoke count:1000000000 tsc_interval:76961435)
[  476.933835] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467629020 sec time_interval:13467629020) - (invoke count:1000000000 tsc_interval:1346762898)
[  478.452709] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.007 ns (step:0) - (measurement period time:1.500710750 sec time_interval:1500710750) - (invoke count:100000000 tsc_interval:150071069)
[  485.011458] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541449970 sec time_interval:6541449970) - (invoke count:1000000000 tsc_interval:654144991)
[  485.028671] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  486.500170] time_bench: Type:no-softirq-page_pool01 Per elem: 1 cycles(tsc) 14.622 ns (step:0) - (measurement period time:1.462234950 sec time_interval:1462234950) - (invoke count:100000000 tsc_interval:146223489)
[  486.519026] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  491.827181] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 52.989 ns (step:0) - (measurement period time:5.298974920 sec time_interval:5298974920) - (invoke count:100000000 tsc_interval:529897484)
[  491.846039] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  509.968937] time_bench: Type:no-softirq-page_pool03 Per elem: 18 cycles(tsc) 181.140 ns (step:0) - (measurement period time:18.114063050 sec time_interval:18114063050) - (invoke count:100000000 tsc_interval:1811406296)
[  509.988228] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  509.996109] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  512.621549] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 26.163 ns (step:0) - (measurement period time:2.616350750 sec time_interval:2616350750) - (invoke count:100000000 tsc_interval:261635069)
[  512.641009] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  518.028167] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 53.781 ns (step:0) - (measurement period time:5.378154590 sec time_interval:5378154590) - (invoke count:100000000 tsc_interval:537815454)
[  518.047541] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  537.001263] time_bench: Type:tasklet_page_pool03_slow Per elem: 18 cycles(tsc) 189.450 ns (step:0) - (measurement period time:18.945065660 sec time_interval:18945065660) - (invoke count:100000000 tsc_interval:1894506561)
root@(none)$ rmmod bench_page_pool_simple.ko
[  554.270004] bench_page_pool_simple: Unloaded
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  555.334974] bench_page_pool_simple: Loaded
[  556.108716] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769622900 sec time_interval:769622900) - (invoke count:1000000000 tsc_interval:76962277)
[  569.593570] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467378920 sec time_interval:13467378920) - (invoke count:1000000000 tsc_interval:1346737886)
[  571.112408] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500672390 sec time_interval:1500672390) - (invoke count:100000000 tsc_interval:150067233)
[  577.671068] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541360400 sec time_interval:6541360400) - (invoke count:1000000000 tsc_interval:654136033)
[  577.688281] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  579.159760] time_bench: Type:no-softirq-page_pool01 Per elem: 1 cycles(tsc) 14.622 ns (step:0) - (measurement period time:1.462214680 sec time_interval:1462214680) - (invoke count:100000000 tsc_interval:146221461)
[  579.178615] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  584.387107] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 51.993 ns (step:0) - (measurement period time:5.199315890 sec time_interval:5199315890) - (invoke count:100000000 tsc_interval:519931583)
[  584.405963] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  601.992462] time_bench: Type:no-softirq-page_pool03 Per elem: 17 cycles(tsc) 175.776 ns (step:0) - (measurement period time:17.577663130 sec time_interval:17577663130) - (invoke count:100000000 tsc_interval:1757766306)
[  602.011753] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  602.019634] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  604.647682] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 26.189 ns (step:0) - (measurement period time:2.618955910 sec time_interval:2618955910) - (invoke count:100000000 tsc_interval:261895585)
[  604.667141] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  610.055961] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 53.798 ns (step:0) - (measurement period time:5.379816080 sec time_interval:5379816080) - (invoke count:100000000 tsc_interval:537981602)
[  610.075334] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  629.030597] time_bench: Type:tasklet_page_pool03_slow Per elem: 18 cycles(tsc) 189.466 ns (step:0) - (measurement period time:18.946606280 sec time_interval:18946606280) - (invoke count:100000000 tsc_interval:1894660622)



bd05af7e28d2 (HEAD -> pp-inflight-fix_v6_test) page_pool: use list instead of ptr_ring for ring cache
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  324.256893] bench_page_pool_simple: Loaded
[  325.030626] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769608510 sec time_interval:769608510) - (invoke count:1000000000 tsc_interval:76960843)
[  338.515544] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467442220 sec time_interval:13467442220) - (invoke count:1000000000 tsc_interval:1346744216)
[  340.034383] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500673080 sec time_interval:1500673080) - (invoke count:100000000 tsc_interval:150067302)
[  346.593168] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541486300 sec time_interval:6541486300) - (invoke count:1000000000 tsc_interval:654148625)
[  346.610383] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  349.198132] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 25.784 ns (step:0) - (measurement period time:2.578484390 sec time_interval:2578484390) - (invoke count:100000000 tsc_interval:257848433)
[  349.216987] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  358.266543] time_bench: Type:no-softirq-page_pool02 Per elem: 9 cycles(tsc) 90.403 ns (step:0) - (measurement period time:9.040378740 sec time_interval:9040378740) - (invoke count:100000000 tsc_interval:904037869)
[  358.285398] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  378.581275] time_bench: Type:no-softirq-page_pool03 Per elem: 20 cycles(tsc) 202.870 ns (step:0) - (measurement period time:20.287047800 sec time_interval:20287047800) - (invoke count:100000000 tsc_interval:2028704772)
[  378.600567] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  378.608449] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  381.195830] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 25.782 ns (step:0) - (measurement period time:2.578291220 sec time_interval:2578291220) - (invoke count:100000000 tsc_interval:257829118)
[  381.215288] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  390.262793] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 9 cycles(tsc) 90.385 ns (step:0) - (measurement period time:9.038500040 sec time_interval:9038500040) - (invoke count:100000000 tsc_interval:903849999)
[  390.282165] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  410.602531] time_bench: Type:tasklet_page_pool03_slow Per elem: 20 cycles(tsc) 203.117 ns (step:0) - (measurement period time:20.311708230 sec time_interval:20311708230) - (invoke count:100000000 tsc_interval:2031170817)
root@(none)$ rmmod bench_page_pool_simple.ko
[  452.799939] bench_page_pool_simple: Unloaded
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  454.932877] bench_page_pool_simple: Loaded
[  455.706590] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769596200 sec time_interval:769596200) - (invoke count:1000000000 tsc_interval:76959611)
[  469.191300] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467234550 sec time_interval:13467234550) - (invoke count:1000000000 tsc_interval:1346723449)
[  470.710117] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500652740 sec time_interval:1500652740) - (invoke count:100000000 tsc_interval:150065267)
[  477.268702] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541285540 sec time_interval:6541285540) - (invoke count:1000000000 tsc_interval:654128549)
[  477.285914] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  479.873572] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 25.783 ns (step:0) - (measurement period time:2.578394320 sec time_interval:2578394320) - (invoke count:100000000 tsc_interval:257839426)
[  479.892426] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  488.941591] time_bench: Type:no-softirq-page_pool02 Per elem: 9 cycles(tsc) 90.399 ns (step:0) - (measurement period time:9.039988700 sec time_interval:9039988700) - (invoke count:100000000 tsc_interval:903998864)
[  488.960458] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  509.252999] time_bench: Type:no-softirq-page_pool03 Per elem: 20 cycles(tsc) 202.837 ns (step:0) - (measurement period time:20.283709920 sec time_interval:20283709920) - (invoke count:100000000 tsc_interval:2028370986)
[  509.275188] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  509.283069] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  511.870501] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 25.783 ns (step:0) - (measurement period time:2.578339900 sec time_interval:2578339900) - (invoke count:100000000 tsc_interval:257833985)
[  511.889959] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  520.937881] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 9 cycles(tsc) 90.389 ns (step:0) - (measurement period time:9.038917580 sec time_interval:9038917580) - (invoke count:100000000 tsc_interval:903891752)
[  520.957253] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  541.278328] time_bench: Type:tasklet_page_pool03_slow Per elem: 20 cycles(tsc) 203.124 ns (step:0) - (measurement period time:20.312417960 sec time_interval:20312417960) - (invoke count:100000000 tsc_interval:2031241790)
root@(none)$ cat /proc/version
Linux version 6.13.0-rc6-00905-gbd05af7e28d2 (linyunsheng@localhost.localdomain) (gcc (GCC) 10.3.1, GNU ld (GNU Binutils) 2.37) #301 SMP PREEMPT Wed Jan 15 14:57:40 CST 2025



e8e4ef65fd4b (HEAD -> pp-inflight-fix_v6_test) page_pool: batch refilling pages to reduce atomic operation
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[   81.660612] bench_page_pool_simple: Loaded
[   82.434335] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769577370 sec time_interval:769577370) - (invoke count:1000000000 tsc_interval:76957728)
[   95.919455] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467643010 sec time_interval:13467643010) - (invoke count:1000000000 tsc_interval:1346764295)
[   97.438295] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500675620 sec time_interval:1500675620) - (invoke count:100000000 tsc_interval:150067556)
[  103.997112] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541514490 sec time_interval:6541514490) - (invoke count:1000000000 tsc_interval:654151443)
[  104.014327] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  105.524295] time_bench: Type:no-softirq-page_pool01 Per elem: 1 cycles(tsc) 15.007 ns (step:0) - (measurement period time:1.500704660 sec time_interval:1500704660) - (invoke count:100000000 tsc_interval:150070459)
[  105.543183] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  111.935637] time_bench: Type:no-softirq-page_pool02 Per elem: 6 cycles(tsc) 63.832 ns (step:0) - (measurement period time:6.383276590 sec time_interval:6383276590) - (invoke count:100000000 tsc_interval:638327653)
[  111.954492] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  131.007329] time_bench: Type:no-softirq-page_pool03 Per elem: 19 cycles(tsc) 190.440 ns (step:0) - (measurement period time:19.044004630 sec time_interval:19044004630) - (invoke count:100000000 tsc_interval:1904400455)
[  131.026621] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  131.034503] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  132.544154] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 1 cycles(tsc) 15.005 ns (step:0) - (measurement period time:1.500558810 sec time_interval:1500558810) - (invoke count:100000000 tsc_interval:150055876)
[  132.563614] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  139.007314] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 64.346 ns (step:0) - (measurement period time:6.434695610 sec time_interval:6434695610) - (invoke count:100000000 tsc_interval:643469557)
[  139.026687] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  158.093560] time_bench: Type:tasklet_page_pool03_slow Per elem: 19 cycles(tsc) 190.582 ns (step:0) - (measurement period time:19.058215140 sec time_interval:19058215140) - (invoke count:100000000 tsc_interval:1905821508)
root@(none)$ rmmod bench_page_pool_simple.ko
[  172.671534] bench_page_pool_simple: Unloaded
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  174.012461] bench_page_pool_simple: Loaded
[  174.786162] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769579310 sec time_interval:769579310) - (invoke count:1000000000 tsc_interval:76957922)
[  188.270731] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467093170 sec time_interval:13467093170) - (invoke count:1000000000 tsc_interval:1346709310)
[  189.789532] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500638040 sec time_interval:1500638040) - (invoke count:100000000 tsc_interval:150063795)
[  196.348065] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541234660 sec time_interval:6541234660) - (invoke count:1000000000 tsc_interval:654123460)
[  196.365281] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  197.875195] time_bench: Type:no-softirq-page_pool01 Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500650210 sec time_interval:1500650210) - (invoke count:100000000 tsc_interval:150065016)
[  197.894050] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  203.394345] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 54.911 ns (step:0) - (measurement period time:5.491119700 sec time_interval:5491119700) - (invoke count:100000000 tsc_interval:549111964)
[  203.413201] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  222.522015] time_bench: Type:no-softirq-page_pool03 Per elem: 19 cycles(tsc) 190.999 ns (step:0) - (measurement period time:19.099982300 sec time_interval:19099982300) - (invoke count:100000000 tsc_interval:1909998222)
[  222.541306] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  222.549187] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  224.058807] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 1 cycles(tsc) 15.005 ns (step:0) - (measurement period time:1.500531720 sec time_interval:1500531720) - (invoke count:100000000 tsc_interval:150053166)
[  224.078267] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  229.638432] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 55.511 ns (step:0) - (measurement period time:5.551160500 sec time_interval:5551160500) - (invoke count:100000000 tsc_interval:555116045)
[  229.657805] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  248.720382] time_bench: Type:tasklet_page_pool03_slow Per elem: 19 cycles(tsc) 190.539 ns (step:0) - (measurement period time:19.053918960 sec time_interval:19053918960) - (invoke count:100000000 tsc_interval:1905391890)
root@(none)$ cat /proc/version
Linux version 6.13.0-rc6-00906-ge8e4ef65fd4b (linyunsheng@localhost.localdomain) (gcc (GCC) 10.3.1, GNU ld (GNU Binutils) 2.37) #302 SMP PREEMPT Wed Jan 15 15:11:10 CST 2025
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  493.008461] bench_page_pool_simple: Loaded
[  493.782195] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769607870 sec time_interval:769607870) - (invoke count:1000000000 tsc_interval:76960778)
[  507.266860] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467190060 sec time_interval:13467190060) - (invoke count:1000000000 tsc_interval:1346718999)
[  508.785667] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500643840 sec time_interval:1500643840) - (invoke count:100000000 tsc_interval:150064378)
[  515.344224] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541258530 sec time_interval:6541258530) - (invoke count:1000000000 tsc_interval:654125847)
[  515.361440] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  518.102903] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 27.321 ns (step:0) - (measurement period time:2.732199220 sec time_interval:2732199220) - (invoke count:100000000 tsc_interval:273219917)
[  518.121759] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  524.874604] time_bench: Type:no-softirq-page_pool02 Per elem: 6 cycles(tsc) 67.436 ns (step:0) - (measurement period time:6.743668740 sec time_interval:6743668740) - (invoke count:100000000 tsc_interval:674366869)
[  524.893460] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  543.980580] time_bench: Type:no-softirq-page_pool03 Per elem: 19 cycles(tsc) 190.782 ns (step:0) - (measurement period time:19.078288770 sec time_interval:19078288770) - (invoke count:100000000 tsc_interval:1907828868)
[  543.999871] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  544.007753] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  546.748829] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 27.319 ns (step:0) - (measurement period time:2.731985080 sec time_interval:2731985080) - (invoke count:100000000 tsc_interval:273198499)
[  546.768288] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  553.505522] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 67.282 ns (step:0) - (measurement period time:6.728229430 sec time_interval:6728229430) - (invoke count:100000000 tsc_interval:672822938)
[  553.524893] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  572.731687] time_bench: Type:tasklet_page_pool03_slow Per elem: 19 cycles(tsc) 191.981 ns (step:0) - (measurement period time:19.198137710 sec time_interval:19198137710) - (invoke count:100000000 tsc_inter

root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[  624.624453] bench_page_pool_simple: Loaded
[  625.398155] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769580100 sec time_interval:769580100) - (invoke count:1000000000 tsc_interval:76958003)
[  638.882758] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467127790 sec time_interval:13467127790) - (invoke count:1000000000 tsc_interval:1346712774)
[  640.401554] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500633000 sec time_interval:1500633000) - (invoke count:100000000 tsc_interval:150063294)
[  646.960100] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541244270 sec time_interval:6541244270) - (invoke count:1000000000 tsc_interval:654124421)
[  646.977313] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  649.718817] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 27.322 ns (step:0) - (measurement period time:2.732241230 sec time_interval:2732241230) - (invoke count:100000000 tsc_interval:273224117)
[  649.737673] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  656.485353] time_bench: Type:no-softirq-page_pool02 Per elem: 6 cycles(tsc) 67.385 ns (step:0) - (measurement period time:6.738504450 sec time_interval:6738504450) - (invoke count:100000000 tsc_interval:673850439)
[  656.504211] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  675.730226] time_bench: Type:no-softirq-page_pool03 Per elem: 19 cycles(tsc) 192.171 ns (step:0) - (measurement period time:19.217181040 sec time_interval:19217181040) - (invoke count:100000000 tsc_interval:1921718097)
[  675.749517] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  675.757399] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  678.498457] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 27.319 ns (step:0) - (measurement period time:2.731969810 sec time_interval:2731969810) - (invoke count:100000000 tsc_interval:273196975)
[  678.517917] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  685.272622] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 67.457 ns (step:0) - (measurement period time:6.745701080 sec time_interval:6745701080) - (invoke count:100000000 tsc_interval:674570103)
[  685.291993] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  704.535410] time_bench: Type:tasklet_page_pool03_slow Per elem: 19 cycles(tsc) 192.347 ns (step:0) - (measurement period time:19.234760880 sec time_interval:19234760880) - (invoke count:100000000 tsc_interval:1923476080)


5760bcdd3fef (HEAD -> pp-inflight-fix_v6_test) page_pool: use list instead of array for alloc cache
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[ 1378.118009] bench_page_pool_simple: Loaded
[ 1378.891760] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769629870 sec time_interval:769629870) - (invoke count:1000000000 tsc_interval:76962977)
[ 1392.376430] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467196340 sec time_interval:13467196340) - (invoke count:1000000000 tsc_interval:1346719628)
[ 1393.895253] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500659490 sec time_interval:1500659490) - (invoke count:100000000 tsc_interval:150065942)
[ 1400.453791] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541237910 sec time_interval:6541237910) - (invoke count:1000000000 tsc_interval:654123784)
[ 1400.471006] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[ 1402.135620] time_bench: Type:no-softirq-page_pool01 Per elem: 1 cycles(tsc) 16.553 ns (step:0) - (measurement period time:1.655350930 sec time_interval:1655350930) - (invoke count:100000000 tsc_interval:165535087)
[ 1402.154474] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[ 1407.685584] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 55.219 ns (step:0) - (measurement period time:5.521934590 sec time_interval:5521934590) - (invoke count:100000000 tsc_interval:552193452)
[ 1407.704438] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[ 1427.906125] time_bench: Type:no-softirq-page_pool03 Per elem: 20 cycles(tsc) 201.928 ns (step:0) - (measurement period time:20.192856910 sec time_interval:20192856910) - (invoke count:100000000 tsc_interval:2019285683)
[ 1427.925416] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[ 1427.933297] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[ 1429.519900] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 1 cycles(tsc) 15.775 ns (step:0) - (measurement period time:1.577513290 sec time_interval:1577513290) - (invoke count:100000000 tsc_interval:157751323)
[ 1429.539358] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[ 1435.138765] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 55.904 ns (step:0) - (measurement period time:5.590404140 sec time_interval:5590404140) - (invoke count:100000000 tsc_interval:559040410)
[ 1435.158136] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[ 1455.411856] time_bench: Type:tasklet_page_pool03_slow Per elem: 20 cycles(tsc) 202.450 ns (step:0) - (measurement period time:20.245062650 sec time_interval:20245062650) - (invoke count:100000000 tsc_interval:2024506258)
root@(none)$ rmmod bench_page_pool_simple.ko
[ 1624.116972] bench_page_pool_simple: Unloaded
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[ 1625.254057] bench_page_pool_simple: Loaded
[ 1626.027804] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769627010 sec time_interval:769627010) - (invoke count:1000000000 tsc_interval:76962694)
[ 1639.512664] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467385750 sec time_interval:13467385750) - (invoke count:1000000000 tsc_interval:1346738568)
[ 1641.031493] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500664980 sec time_interval:1500664980) - (invoke count:100000000 tsc_interval:150066492)
[ 1647.590116] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541324190 sec time_interval:6541324190) - (invoke count:1000000000 tsc_interval:654132413)
[ 1647.607328] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[ 1649.211118] time_bench: Type:no-softirq-page_pool01 Per elem: 1 cycles(tsc) 15.945 ns (step:0) - (measurement period time:1.594526020 sec time_interval:1594526020) - (invoke count:100000000 tsc_interval:159452596)
[ 1649.229971] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[ 1654.761083] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 55.219 ns (step:0) - (measurement period time:5.521934830 sec time_interval:5521934830) - (invoke count:100000000 tsc_interval:552193476)
[ 1654.779937] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[ 1674.973459] time_bench: Type:no-softirq-page_pool03 Per elem: 20 cycles(tsc) 201.846 ns (step:0) - (measurement period time:20.184690600 sec time_interval:20184690600) - (invoke count:100000000 tsc_interval:2018469053)
[ 1674.992751] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[ 1675.000632] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[ 1676.622598] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 1 cycles(tsc) 16.128 ns (step:0) - (measurement period time:1.612877140 sec time_interval:1612877140) - (invoke count:100000000 tsc_interval:161287709)
[ 1676.642056] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[ 1682.241489] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 55.904 ns (step:0) - (measurement period time:5.590428410 sec time_interval:5590428410) - (invoke count:100000000 tsc_interval:559042835)
[ 1682.260860] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[ 1702.540682] time_bench: Type:tasklet_page_pool03_slow Per elem: 20 cycles(tsc) 202.711 ns (step:0) - (measurement period time:20.271164760 sec time_interval:20271164760) - (invoke count:100000000 tsc_interval:2027116470)
root@(none)$ rmmod bench_page_pool_simple.ko
[ 3945.224975] bench_page_pool_simple: Unloaded
root@(none)$ insmod bench_page_pool_simple.ko loops=100000000
[ 3946.318072] bench_page_pool_simple: Loaded
[ 3947.091825] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.769631280 sec time_interval:769631280) - (invoke count:1000000000 tsc_interval:76963115)
[ 3960.576784] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:13.467483140 sec time_interval:13467483140) - (invoke count:1000000000 tsc_interval:1346748308)
[ 3962.095607] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:1.500658780 sec time_interval:1500658780) - (invoke count:100000000 tsc_interval:150065872)
[ 3968.654285] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:6.541378830 sec time_interval:6541378830) - (invoke count:1000000000 tsc_interval:654137877)
[ 3968.671520] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[ 3971.491845] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 28.110 ns (step:0) - (measurement period time:2.811058810 sec time_interval:2811058810) - (invoke count:100000000 tsc_interval:281105875)
[ 3971.510703] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[ 3978.348581] time_bench: Type:no-softirq-page_pool02 Per elem: 6 cycles(tsc) 68.287 ns (step:0) - (measurement period time:6.828701400 sec time_interval:6828701400) - (invoke count:100000000 tsc_interval:682870134)
[ 3978.367435] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[ 3998.595188] time_bench: Type:no-softirq-page_pool03 Per elem: 20 cycles(tsc) 202.189 ns (step:0) - (measurement period time:20.218922630 sec time_interval:20218922630) - (invoke count:100000000 tsc_interval:2021892255)
[ 3998.614480] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[ 3998.622362] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[ 4001.442253] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 28.108 ns (step:0) - (measurement period time:2.810802040 sec time_interval:2810802040) - (invoke count:100000000 tsc_interval:281080197)
[ 4001.461713] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[ 4008.290654] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 68.199 ns (step:0) - (measurement period time:6.819937430 sec time_interval:6819937430) - (invoke count:100000000 tsc_interval:681993738)
[ 4008.310026] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[ 4028.570377] time_bench: Type:tasklet_page_pool03_slow Per elem: 20 cycles(tsc) 202.516 ns (step:0) - (measurement period time:20.251693920 sec time_interval:20251693920) - (invoke count:100000000 tsc_interval:2025169387)
root@(none)$ cat /proc/version
Linux version 6.13.0-rc6-00907-g5760bcdd3fef (linyunsheng@localhost.localdomain) (gcc (GCC) 10.3.1, GNU ld (GNU Binutils) 2.37) #303 SMP PREEMPT Wed Jan 15 15:27:07 CST 2025

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound
  2025-01-10 13:06 ` [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
@ 2025-01-15 16:29   ` Jesper Dangaard Brouer
  2025-01-16 12:52     ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2025-01-15 16:29 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Robin Murphy,
	Alexander Duyck, IOMMU, Andrew Morton, Eric Dumazet, Simon Horman,
	Ilias Apalodimas, linux-mm, linux-kernel, netdev, kernel-team



On 10/01/2025 14.06, Yunsheng Lin wrote:
[...]
> In order not to call DMA APIs to do DMA unmmapping after driver
> has already unbound and stall the unloading of the networking
> driver, use some pre-allocated item blocks to record inflight
> pages including the ones which are handed over to network stack,
> so the page_pool can do the DMA unmmapping for those pages when
> page_pool_destroy() is called. As the pre-allocated item blocks
> need to be large enough to avoid performance degradation, add a
> 'item_fast_empty' stat to indicate the unavailability of the
> pre-allocated item blocks.
> 

[...]
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 1aa7b93bdcc8..fa7629c3ec94 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
[...]
> @@ -268,6 +271,7 @@ static int page_pool_init(struct page_pool *pool,
>   		return -ENOMEM;
>   	}
>   
> +	spin_lock_init(&pool->item_lock);
>   	atomic_set(&pool->pages_state_release_cnt, 0);
>   
>   	/* Driver calling page_pool_create() also call page_pool_destroy() */
> @@ -325,6 +329,200 @@ static void page_pool_uninit(struct page_pool *pool)
>   #endif
>   }
>   
> +#define PAGE_POOL_ITEM_USED			0
> +#define PAGE_POOL_ITEM_MAPPED			1
> +
> +#define ITEMS_PER_PAGE	((PAGE_SIZE -						\
> +			  offsetof(struct page_pool_item_block, items)) /	\
> +			 sizeof(struct page_pool_item))
> +
> +#define page_pool_item_init_state(item)					\
> +({									\
> +	(item)->state = 0;						\
> +})
> +
> +#if defined(CONFIG_DEBUG_NET)
> +#define page_pool_item_set_used(item)					\
> +	__set_bit(PAGE_POOL_ITEM_USED, &(item)->state)
> +
> +#define page_pool_item_clear_used(item)					\
> +	__clear_bit(PAGE_POOL_ITEM_USED, &(item)->state)
> +
> +#define page_pool_item_is_used(item)					\
> +	test_bit(PAGE_POOL_ITEM_USED, &(item)->state)
> +#else
> +#define page_pool_item_set_used(item)
> +#define page_pool_item_clear_used(item)
> +#define page_pool_item_is_used(item)		false
> +#endif
> +
> +#define page_pool_item_set_mapped(item)					\
> +	__set_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
> +
> +/* Only clear_mapped and is_mapped need to be atomic as they can be
> + * called concurrently.
> + */
> +#define page_pool_item_clear_mapped(item)				\
> +	clear_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
> +
> +#define page_pool_item_is_mapped(item)					\
> +	test_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
> +
> +static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
> +							 netmem_ref netmem,
> +							 bool destroyed)
> +{
> +	struct page_pool_item *item;
> +	dma_addr_t dma;
> +
> +	if (!pool->dma_map)
> +		/* Always account for inflight pages, even if we didn't
> +		 * map them
> +		 */
> +		return;
> +
> +	dma = page_pool_get_dma_addr_netmem(netmem);
> +	item = netmem_get_pp_item(netmem);
> +
> +	/* dma unmapping is always needed when page_pool_destory() is not called
> +	 * yet.
> +	 */
> +	DEBUG_NET_WARN_ON_ONCE(!destroyed && !page_pool_item_is_mapped(item));
> +	if (unlikely(destroyed && !page_pool_item_is_mapped(item)))
> +		return;
> +
> +	/* When page is unmapped, it cannot be returned to our pool */
> +	dma_unmap_page_attrs(pool->p.dev, dma,
> +			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
> +			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
> +	page_pool_set_dma_addr_netmem(netmem, 0);
> +	page_pool_item_clear_mapped(item);
> +}
> +

I have a hard time reading/reviewing/maintaining below code, without
some design description.  This code needs more comments on what is the
*intend* and design it's trying to achieve.

 From patch description the only hint I have is:
  "use some pre-allocated item blocks to record inflight pages"

E.g. Why is it needed/smart to hijack the page->pp pointer?

> +static void __page_pool_item_init(struct page_pool *pool, struct page *page)
> +{

Function name is confusing.  First I though this was init'ing a single
item, but looking at the code it is iterating over ITEMS_PER_PAGE.

Maybe it should be called page_pool_item_block_init ?

> +	struct page_pool_item_block *block = page_address(page);
> +	struct page_pool_item *items = block->items;
> +	unsigned int i;
> +
> +	list_add(&block->list, &pool->item_blocks);
> +	block->pp = pool;
> +
> +	for (i = 0; i < ITEMS_PER_PAGE; i++) {
> +		page_pool_item_init_state(&items[i]);
> +		__llist_add(&items[i].lentry, &pool->hold_items);
> +	}
> +}
> +
> +static int page_pool_item_init(struct page_pool *pool)
> +{
> +#define PAGE_POOL_MIN_INFLIGHT_ITEMS		512
> +	struct page_pool_item_block *block;
> +	int item_cnt;
> +
> +	INIT_LIST_HEAD(&pool->item_blocks);
> +	init_llist_head(&pool->hold_items);
> +	init_llist_head(&pool->release_items);
> +
> +	item_cnt = pool->p.pool_size * 2 + PP_ALLOC_CACHE_SIZE +
> +		PAGE_POOL_MIN_INFLIGHT_ITEMS;
> +	while (item_cnt > 0) {
> +		struct page *page;
> +
> +		page = alloc_pages_node(pool->p.nid, GFP_KERNEL, 0);
> +		if (!page)
> +			goto err;
> +
> +		__page_pool_item_init(pool, page);
> +		item_cnt -= ITEMS_PER_PAGE;
> +	}
> +
> +	return 0;
> +err:
> +	list_for_each_entry(block, &pool->item_blocks, list)
> +		put_page(virt_to_page(block));
> +
> +	return -ENOMEM;
> +}
> +
> +static void page_pool_item_unmap(struct page_pool *pool,
> +				 struct page_pool_item *item)
> +{
> +	spin_lock_bh(&pool->item_lock);
> +	__page_pool_release_page_dma(pool, item->pp_netmem, true);
> +	spin_unlock_bh(&pool->item_lock);
> +}
> +
> +static void page_pool_items_unmap(struct page_pool *pool)
> +{
> +	struct page_pool_item_block *block;
> +
> +	if (!pool->dma_map || pool->mp_priv)
> +		return;
> +
> +	list_for_each_entry(block, &pool->item_blocks, list) {
> +		struct page_pool_item *items = block->items;
> +		int i;
> +
> +		for (i = 0; i < ITEMS_PER_PAGE; i++) {
> +			struct page_pool_item *item = &items[i];
> +
> +			if (!page_pool_item_is_mapped(item))
> +				continue;
> +
> +			page_pool_item_unmap(pool, item);
> +		}
> +	}
> +}
> +
> +static void page_pool_item_uninit(struct page_pool *pool)
> +{
> +	while (!list_empty(&pool->item_blocks)) {
> +		struct page_pool_item_block *block;
> +
> +		block = list_first_entry(&pool->item_blocks,
> +					 struct page_pool_item_block,
> +					 list);
> +		list_del(&block->list);
> +		put_page(virt_to_page(block));
> +	}
> +}
> +
> +static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem)
> +{
> +	struct page_pool_item *item;
> +	struct llist_node *node;
> +
> +	if (unlikely(llist_empty(&pool->hold_items))) {
> +		pool->hold_items.first = llist_del_all(&pool->release_items);
> +
> +		if (unlikely(llist_empty(&pool->hold_items))) {
> +			alloc_stat_inc(pool, item_fast_empty);
> +			return false;
> +		}
> +	}
> +
> +	node = pool->hold_items.first;
> +	pool->hold_items.first = node->next;
> +	item = llist_entry(node, struct page_pool_item, lentry);
> +	item->pp_netmem = netmem;
> +	page_pool_item_set_used(item);
> +	netmem_set_pp_item(netmem, item);
> +	return true;
> +}
> +
> +static void page_pool_item_del(struct page_pool *pool, netmem_ref netmem)
> +{
> +	struct page_pool_item *item = netmem_get_pp_item(netmem);
> +
> +	DEBUG_NET_WARN_ON_ONCE(item->pp_netmem != netmem);
> +	DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item));
> +	DEBUG_NET_WARN_ON_ONCE(!page_pool_item_is_used(item));
> +	page_pool_item_clear_used(item);
> +	netmem_set_pp_item(netmem, NULL);
> +	llist_add(&item->lentry, &pool->release_items);
> +}
> +
>   /**
>    * page_pool_create_percpu() - create a page pool for a given cpu.
>    * @params: parameters, see struct page_pool_params
> @@ -344,12 +542,18 @@ page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
>   	if (err < 0)
>   		goto err_free;
>   
> -	err = page_pool_list(pool);
> +	err = page_pool_item_init(pool);
>   	if (err)
>   		goto err_uninit;
>   
> +	err = page_pool_list(pool);
> +	if (err)
> +		goto err_item_uninit;
> +
>   	return pool;
>   
> +err_item_uninit:
> +	page_pool_item_uninit(pool);
>   err_uninit:
>   	page_pool_uninit(pool);
>   err_free:
> @@ -369,7 +573,8 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
>   }
>   EXPORT_SYMBOL(page_pool_create);
>   
> -static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
> +static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem,
> +				    bool destroyed);
>   
>   static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
>   {
> @@ -407,7 +612,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
>   			 * (2) break out to fallthrough to alloc_pages_node.
>   			 * This limit stress on page buddy alloactor.
>   			 */
> -			page_pool_return_page(pool, netmem);
> +			__page_pool_return_page(pool, netmem, false);
>   			alloc_stat_inc(pool, waive);
>   			netmem = 0;
>   			break;
> @@ -464,6 +669,7 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
>   
>   static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
>   {
> +	struct page_pool_item *item;
>   	dma_addr_t dma;
>   
>   	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> @@ -481,6 +687,9 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
>   	if (page_pool_set_dma_addr_netmem(netmem, dma))
>   		goto unmap_failed;
>   
> +	item = netmem_get_pp_item(netmem);
> +	DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item));
> +	page_pool_item_set_mapped(item);
>   	page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
>   
>   	return true;
> @@ -503,19 +712,24 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
>   	if (unlikely(!page))
>   		return NULL;
>   
> -	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
> -		put_page(page);
> -		return NULL;
> -	}
> +	if (unlikely(!page_pool_set_pp_info(pool, page_to_netmem(page))))
> +		goto err_alloc;
> +
> +	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page))))
> +		goto err_set_info;
>   
>   	alloc_stat_inc(pool, slow_high_order);
> -	page_pool_set_pp_info(pool, page_to_netmem(page));
>   
>   	/* Track how many pages are held 'in-flight' */
>   	pool->pages_state_hold_cnt++;
>   	trace_page_pool_state_hold(pool, page_to_netmem(page),
>   				   pool->pages_state_hold_cnt);
>   	return page;
> +err_set_info:
> +	page_pool_clear_pp_info(pool, page_to_netmem(page));
> +err_alloc:
> +	put_page(page);
> +	return NULL;
>   }
>   
>   /* slow path */
> @@ -550,12 +764,18 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
>   	 */
>   	for (i = 0; i < nr_pages; i++) {
>   		netmem = pool->alloc.cache[i];
> +
> +		if (unlikely(!page_pool_set_pp_info(pool, netmem))) {
> +			put_page(netmem_to_page(netmem));
> +			continue;
> +		}
> +
>   		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
> +			page_pool_clear_pp_info(pool, netmem);
>   			put_page(netmem_to_page(netmem));
>   			continue;
>   		}
>   
> -		page_pool_set_pp_info(pool, netmem);
>   		pool->alloc.cache[pool->alloc.count++] = netmem;
>   		/* Track how many pages are held 'in-flight' */
>   		pool->pages_state_hold_cnt++;
> @@ -627,9 +847,11 @@ s32 page_pool_inflight(const struct page_pool *pool, bool strict)
>   	return inflight;
>   }
>   
> -void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
> +bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
>   {
> -	netmem_set_pp(netmem, pool);
> +	if (unlikely(!page_pool_item_add(pool, netmem)))
> +		return false;
> +
>   	netmem_or_pp_magic(netmem, PP_SIGNATURE);
>   
>   	/* Ensuring all pages have been split into one fragment initially:
> @@ -641,32 +863,14 @@ void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
>   	page_pool_fragment_netmem(netmem, 1);
>   	if (pool->has_init_callback)
>   		pool->slow.init_callback(netmem, pool->slow.init_arg);
> -}
>   
> -void page_pool_clear_pp_info(netmem_ref netmem)
> -{
> -	netmem_clear_pp_magic(netmem);
> -	netmem_set_pp(netmem, NULL);
> +	return true;
>   }
>   
> -static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
> -							 netmem_ref netmem)
> +void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem)
>   {
> -	dma_addr_t dma;
> -
> -	if (!pool->dma_map)
> -		/* Always account for inflight pages, even if we didn't
> -		 * map them
> -		 */
> -		return;
> -
> -	dma = page_pool_get_dma_addr_netmem(netmem);
> -
> -	/* When page is unmapped, it cannot be returned to our pool */
> -	dma_unmap_page_attrs(pool->p.dev, dma,
> -			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
> -			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
> -	page_pool_set_dma_addr_netmem(netmem, 0);
> +	netmem_clear_pp_magic(netmem);
> +	page_pool_item_del(pool, netmem);
>   }
>   
>   /* Disconnects a page (from a page_pool).  API users can have a need
> @@ -674,7 +878,8 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
>    * a regular page (that will eventually be returned to the normal
>    * page-allocator via put_page).
>    */
> -void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
> +void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem,
> +			     bool destroyed)
>   {
>   	int count;
>   	bool put;
> @@ -683,7 +888,7 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
>   		put = mp_dmabuf_devmem_release_page(pool, netmem);
>   	else
> -		__page_pool_release_page_dma(pool, netmem);
> +		__page_pool_release_page_dma(pool, netmem, destroyed);
>   
>   	/* This may be the last page returned, releasing the pool, so
>   	 * it is not safe to reference pool afterwards.
> @@ -692,7 +897,7 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   	trace_page_pool_state_release(pool, netmem, count);
>   
>   	if (put) {
> -		page_pool_clear_pp_info(netmem);
> +		page_pool_clear_pp_info(pool, netmem);
>   		put_page(netmem_to_page(netmem));
>   	}
>   	/* An optimization would be to call __free_pages(page, pool->p.order)
> @@ -701,6 +906,27 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   	 */
>   }
>   
> +/* Called from page_pool_put_*() path, need to synchronizated with
> + * page_pool_destory() path.
> + */
> +static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
> +{
> +	unsigned int destroy_cnt;
> +
> +	rcu_read_lock();
> +
> +	destroy_cnt = READ_ONCE(pool->destroy_cnt);
> +	if (unlikely(destroy_cnt)) {
> +		spin_lock_bh(&pool->item_lock);
> +		__page_pool_return_page(pool, netmem, true);
> +		spin_unlock_bh(&pool->item_lock);
> +	} else {
> +		__page_pool_return_page(pool, netmem, false);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
>   static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
>   {
>   	int ret;
> @@ -963,7 +1189,7 @@ static netmem_ref page_pool_drain_frag(struct page_pool *pool,
>   		return netmem;
>   	}
>   
> -	page_pool_return_page(pool, netmem);
> +	__page_pool_return_page(pool, netmem, false);
>   	return 0;
>   }
>   
> @@ -977,7 +1203,7 @@ static void page_pool_free_frag(struct page_pool *pool)
>   	if (!netmem || page_pool_unref_netmem(netmem, drain_count))
>   		return;
>   
> -	page_pool_return_page(pool, netmem);
> +	__page_pool_return_page(pool, netmem, false);
>   }
>   
>   netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool,
> @@ -1053,6 +1279,7 @@ static void __page_pool_destroy(struct page_pool *pool)
>   	if (pool->disconnect)
>   		pool->disconnect(pool);
>   
> +	page_pool_item_uninit(pool);
>   	page_pool_unlist(pool);
>   	page_pool_uninit(pool);
>   
> @@ -1084,7 +1311,7 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
>   static void page_pool_scrub(struct page_pool *pool)
>   {
>   	page_pool_empty_alloc_cache_once(pool);
> -	pool->destroy_cnt++;
> +	WRITE_ONCE(pool->destroy_cnt, pool->destroy_cnt + 1);
>   
>   	/* No more consumers should exist, but producers could still
>   	 * be in-flight.
> @@ -1178,6 +1405,8 @@ void page_pool_destroy(struct page_pool *pool)
>   	 */
>   	synchronize_rcu();
>   
> +	page_pool_items_unmap(pool);
> +
>   	page_pool_detached(pool);
>   	pool->defer_start = jiffies;
>   	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> @@ -1198,7 +1427,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>   	/* Flush pool alloc cache, as refill will check NUMA node */
>   	while (pool->alloc.count) {
>   		netmem = pool->alloc.cache[--pool->alloc.count];
> -		page_pool_return_page(pool, netmem);
> +		__page_pool_return_page(pool, netmem, false);
>   	}
>   }
>   EXPORT_SYMBOL(page_pool_update_nid);
> diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h
> index 57439787b9c2..5d85f862a30a 100644
> --- a/net/core/page_pool_priv.h
> +++ b/net/core/page_pool_priv.h
> @@ -36,16 +36,18 @@ static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>   }
>   
>   #if defined(CONFIG_PAGE_POOL)
> -void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
> -void page_pool_clear_pp_info(netmem_ref netmem);
> +bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
> +void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem);
>   int page_pool_check_memory_provider(struct net_device *dev,
>   				    struct netdev_rx_queue *rxq);
>   #else
> -static inline void page_pool_set_pp_info(struct page_pool *pool,
> +static inline bool page_pool_set_pp_info(struct page_pool *pool,
>   					 netmem_ref netmem)
>   {
> +	return true;
>   }
> -static inline void page_pool_clear_pp_info(netmem_ref netmem)
> +static inline void page_pool_clear_pp_info(struct page_pool *pool,
> +					   netmem_ref netmem)
>   {
>   }
>   static inline int page_pool_check_memory_provider(struct net_device *dev,

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 0/8] fix two bugs related to page_pool
  2025-01-15 11:33   ` Yunsheng Lin
@ 2025-01-15 17:40     ` Jesper Dangaard Brouer
  2025-01-16 12:52       ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2025-01-15 17:40 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Robin Murphy, Alexander Duyck, Andrew Morton, IOMMU, MM,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, netdev,
	intel-wired-lan, bpf, linux-kernel, linux-arm-kernel,
	linux-mediatek



On 15/01/2025 12.33, Yunsheng Lin wrote:
> On 2025/1/14 22:31, Jesper Dangaard Brouer wrote:
>>
>>
>> On 10/01/2025 14.06, Yunsheng Lin wrote:
>>> This patchset fix a possible time window problem for page_pool and
>>> the dma API misuse problem as mentioned in [1], and try to avoid the
>>> overhead of the fixing using some optimization.
>>>
>>>   From the below performance data, the overhead is not so obvious
>>> due to performance variations for time_bench_page_pool01_fast_path()
>>> and time_bench_page_pool02_ptr_ring, and there is about 20ns overhead
>>> for time_bench_page_pool03_slow() for fixing the bug.
>>>
>>
>> My benchmarking on x86_64 CPUs looks significantly different.
>>   - CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
>>
>> Benchmark (bench_page_pool_simple) results from before and after patchset:
>>
>> | Test name  | Cycles |       |    |Nanosec |        |       |      % |
>> | (tasklet_*)| Before | After |diff| Before |  After |  diff | change |
>> |------------+--------+-------+----+--------+--------+-------+--------|
>> | fast_path  |     19 |    24 |   5|  5.399 |  6.928 | 1.529 |   28.3 |
>> | ptr_ring   |     54 |    79 |  25| 15.090 | 21.976 | 6.886 |   45.6 |
>> | slow       |    238 |   299 |  61| 66.134 | 83.298 |17.164 |   26.0 |
>> #+TBLFM: $4=$3-$2::$7=$6-$5::$8=(($7/$5)*100);%.1f
>>
>> My above testing show a clear performance regressions across three
>> different page_pool operating modes.
> 
> I retested it on arm64 server patch by patch as the raw performance
> data in the attachment, it seems the result seemed similar as before.
> 
> Before this patchset:
>              fast_path              ptr_ring            slow
> 1.         31.171 ns               60.980 ns          164.917 ns
> 2.         28.824 ns               60.891 ns          170.241 ns
> 3.         14.236 ns               60.583 ns          164.355 ns
> 
> With patch 1-4:
> 4.         31.443 ns               53.242 ns          210.148 ns
> 5.         31.406 ns               53.270 ns          210.189 ns
> 
> With patch 1-5:
> 6.         26.163 ns               53.781 ns          189.450 ns
> 7.         26.189 ns               53.798 ns          189.466 ns
> 
> With patch 1-8:
> 8.         28.108 ns               68.199 ns          202.516 ns
> 9.         16.128 ns               55.904 ns          202.711 ns
> 
> I am not able to get hold of a x86 server yet, I might be able
> to get one during weekend.
> 
> Theoretically, patch 1-4 or 1-5 should not have much performance
> impact for fast_path and ptr_ring except for the rcu_lock mentioned
> in page_pool_napi_local(), so it would be good if patch 1-5 is also
> tested in your testlab with the rcu_lock removing in
> page_pool_napi_local().
> 

What are you saying?
  - (1) test patch 1-5
  - or (2) test patch 1-5 but revert patch 2 with page_pool_napi_local()

--Jesper

>>
>>
>> Data also available in:
>>   - https://github.com/xdp-project/xdp-project/blob/main/areas/mem/page_pool07_bench_DMA_fix.org
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 0/8] fix two bugs related to page_pool
  2025-01-15 17:40     ` Jesper Dangaard Brouer
@ 2025-01-16 12:52       ` Yunsheng Lin
  2025-01-16 18:02         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-16 12:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Robin Murphy, Alexander Duyck, Andrew Morton, IOMMU, MM,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, netdev,
	intel-wired-lan, bpf, linux-kernel, linux-arm-kernel,
	linux-mediatek

On 2025/1/16 1:40, Jesper Dangaard Brouer wrote:
> 
> 
> On 15/01/2025 12.33, Yunsheng Lin wrote:
>> On 2025/1/14 22:31, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 10/01/2025 14.06, Yunsheng Lin wrote:
>>>> This patchset fix a possible time window problem for page_pool and
>>>> the dma API misuse problem as mentioned in [1], and try to avoid the
>>>> overhead of the fixing using some optimization.
>>>>
>>>>   From the below performance data, the overhead is not so obvious
>>>> due to performance variations for time_bench_page_pool01_fast_path()
>>>> and time_bench_page_pool02_ptr_ring, and there is about 20ns overhead
>>>> for time_bench_page_pool03_slow() for fixing the bug.
>>>>
>>>
>>> My benchmarking on x86_64 CPUs looks significantly different.
>>>   - CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
>>>
>>> Benchmark (bench_page_pool_simple) results from before and after patchset:
>>>
>>> | Test name  | Cycles |       |    |Nanosec |        |       |      % |
>>> | (tasklet_*)| Before | After |diff| Before |  After |  diff | change |
>>> |------------+--------+-------+----+--------+--------+-------+--------|
>>> | fast_path  |     19 |    24 |   5|  5.399 |  6.928 | 1.529 |   28.3 |
>>> | ptr_ring   |     54 |    79 |  25| 15.090 | 21.976 | 6.886 |   45.6 |
>>> | slow       |    238 |   299 |  61| 66.134 | 83.298 |17.164 |   26.0 |
>>> #+TBLFM: $4=$3-$2::$7=$6-$5::$8=(($7/$5)*100);%.1f
>>>
>>> My above testing show a clear performance regressions across three
>>> different page_pool operating modes.
>>
>> I retested it on arm64 server patch by patch as the raw performance
>> data in the attachment, it seems the result seemed similar as before.
>>
>> Before this patchset:
>>              fast_path              ptr_ring            slow
>> 1.         31.171 ns               60.980 ns          164.917 ns
>> 2.         28.824 ns               60.891 ns          170.241 ns
>> 3.         14.236 ns               60.583 ns          164.355 ns
>>
>> With patch 1-4:
>> 4.         31.443 ns               53.242 ns          210.148 ns
>> 5.         31.406 ns               53.270 ns          210.189 ns
>>
>> With patch 1-5:
>> 6.         26.163 ns               53.781 ns          189.450 ns
>> 7.         26.189 ns               53.798 ns          189.466 ns
>>
>> With patch 1-8:
>> 8.         28.108 ns               68.199 ns          202.516 ns
>> 9.         16.128 ns               55.904 ns          202.711 ns
>>
>> I am not able to get hold of a x86 server yet, I might be able
>> to get one during weekend.
>>
>> Theoretically, patch 1-4 or 1-5 should not have much performance
>> impact for fast_path and ptr_ring except for the rcu_lock mentioned
>> in page_pool_napi_local(), so it would be good if patch 1-5 is also
>> tested in your testlab with the rcu_lock removing in
>> page_pool_napi_local().
>>
> 
> What are you saying?
>  - (1) test patch 1-5
>  - or (2) test patch 1-5 but revert patch 2 with page_pool_napi_local()

patch 1-5 with below applied.

--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -1207,10 +1207,8 @@ static bool page_pool_napi_local(const struct page_pool *pool)
        /* Synchronizated with page_pool_destory() to avoid use-after-free
         * for 'napi'.
         */
-       rcu_read_lock();
        napi = READ_ONCE(pool->p.napi);
        napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
-       rcu_read_unlock();

        return napi_local;
 }


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound
  2025-01-15 16:29   ` Jesper Dangaard Brouer
@ 2025-01-16 12:52     ` Yunsheng Lin
  2025-01-16 16:09       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-16 12:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Robin Murphy,
	Alexander Duyck, IOMMU, Andrew Morton, Eric Dumazet, Simon Horman,
	Ilias Apalodimas, linux-mm, linux-kernel, netdev, kernel-team

On 2025/1/16 0:29, Jesper Dangaard Brouer wrote:
> 
> 
> On 10/01/2025 14.06, Yunsheng Lin wrote:
> [...]
>> In order not to call DMA APIs to do DMA unmmapping after driver
>> has already unbound and stall the unloading of the networking
>> driver, use some pre-allocated item blocks to record inflight
>> pages including the ones which are handed over to network stack,
>> so the page_pool can do the DMA unmmapping for those pages when
>> page_pool_destroy() is called. As the pre-allocated item blocks
>> need to be large enough to avoid performance degradation, add a
>> 'item_fast_empty' stat to indicate the unavailability of the
>> pre-allocated item blocks.
>>
> 

...

>> +
>> +static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
>> +                             netmem_ref netmem,
>> +                             bool destroyed)
>> +{
>> +    struct page_pool_item *item;
>> +    dma_addr_t dma;
>> +
>> +    if (!pool->dma_map)
>> +        /* Always account for inflight pages, even if we didn't
>> +         * map them
>> +         */
>> +        return;
>> +
>> +    dma = page_pool_get_dma_addr_netmem(netmem);
>> +    item = netmem_get_pp_item(netmem);
>> +
>> +    /* dma unmapping is always needed when page_pool_destory() is not called
>> +     * yet.
>> +     */
>> +    DEBUG_NET_WARN_ON_ONCE(!destroyed && !page_pool_item_is_mapped(item));
>> +    if (unlikely(destroyed && !page_pool_item_is_mapped(item)))
>> +        return;
>> +
>> +    /* When page is unmapped, it cannot be returned to our pool */
>> +    dma_unmap_page_attrs(pool->p.dev, dma,
>> +                 PAGE_SIZE << pool->p.order, pool->p.dma_dir,
>> +                 DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
>> +    page_pool_set_dma_addr_netmem(netmem, 0);
>> +    page_pool_item_clear_mapped(item);
>> +}
>> +
> 
> I have a hard time reading/reviewing/maintaining below code, without
> some design description.  This code needs more comments on what is the
> *intend* and design it's trying to achieve.
> 
> From patch description the only hint I have is:
>  "use some pre-allocated item blocks to record inflight pages"
> 
> E.g. Why is it needed/smart to hijack the page->pp pointer?

Mainly because there is no space available for keeping tracking of inflight
pages, using page->pp can only find the page_pool owning the page, but page_pool
is not able to keep track of the inflight page when the page is handled by
networking stack.

By using page_pool_item as below, the state is used to tell if a specific
item is being used/dma mapped or not by scanning all the item blocks in
pool->item_blocks. If a specific item is used by a page, then 'pp_netmem'
will point to that page so that dma unmapping can be done for that page
when page_pool_destroy() is called, otherwise free items sit in the
pool->hold_items or pool->release_items by using 'lentry':

struct page_pool_item {
	unsigned long state;
	
	union {
		netmem_ref pp_netmem;
		struct llist_node lentry;
	};
};

When a page is added to the page_pool, a item is deleted from pool->hold_items
or pool->release_items and set the 'pp_netmem' pointing to that page and set
'state' accordingly in order to keep track of that page.

When a page is deleted from the page_pool, it is able to tell which page_pool
this page belong to by using the below function, and after clearing the 'state',
the item is added back to pool->release_items so that the item is reused for new
pages.

static inline struct page_pool_item_block *
page_pool_item_to_block(struct page_pool_item *item)
{
	return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);
}

 static inline struct page_pool *page_pool_get_pp(struct page *page)
 {
      return page_pool_item_to_block(page->pp_item)->pp;
 }


> 
>> +static void __page_pool_item_init(struct page_pool *pool, struct page *page)
>> +{
> 
> Function name is confusing.  First I though this was init'ing a single
> item, but looking at the code it is iterating over ITEMS_PER_PAGE.
> 
> Maybe it should be called page_pool_item_block_init ?

The __page_pool_item_init() is added to make the below
page_pool_item_init() function more readable or maintainable, changing
it to page_pool_item_block_init doesn't seems consistent?

> 
>> +    struct page_pool_item_block *block = page_address(page);
>> +    struct page_pool_item *items = block->items;
>> +    unsigned int i;
>> +
>> +    list_add(&block->list, &pool->item_blocks);
>> +    block->pp = pool;
>> +
>> +    for (i = 0; i < ITEMS_PER_PAGE; i++) {
>> +        page_pool_item_init_state(&items[i]);
>> +        __llist_add(&items[i].lentry, &pool->hold_items);
>> +    }
>> +}
>> +
>> +static int page_pool_item_init(struct page_pool *pool)
>> +{
>> +#define PAGE_POOL_MIN_INFLIGHT_ITEMS        512
>> +    struct page_pool_item_block *block;
>> +    int item_cnt;
>> +
>> +    INIT_LIST_HEAD(&pool->item_blocks);
>> +    init_llist_head(&pool->hold_items);
>> +    init_llist_head(&pool->release_items);
>> +
>> +    item_cnt = pool->p.pool_size * 2 + PP_ALLOC_CACHE_SIZE +
>> +        PAGE_POOL_MIN_INFLIGHT_ITEMS;
>> +    while (item_cnt > 0) {
>> +        struct page *page;
>> +
>> +        page = alloc_pages_node(pool->p.nid, GFP_KERNEL, 0);
>> +        if (!page)
>> +            goto err;
>> +
>> +        __page_pool_item_init(pool, page);
>> +        item_cnt -= ITEMS_PER_PAGE;
>> +    }
>> +
>> +    return 0;
>> +err:
>> +    list_for_each_entry(block, &pool->item_blocks, list)
>> +        put_page(virt_to_page(block));

This one also have used-after-free problem as the page_pool_item_uninit
in the previous version.

>> +
>> +    return -ENOMEM;
>> +}
>> +


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound
  2025-01-16 12:52     ` Yunsheng Lin
@ 2025-01-16 16:09       ` Jesper Dangaard Brouer
  2025-01-17 11:56         ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2025-01-16 16:09 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Robin Murphy,
	Alexander Duyck, IOMMU, Andrew Morton, Eric Dumazet, Simon Horman,
	Ilias Apalodimas, linux-mm, linux-kernel, netdev, kernel-team




On 16/01/2025 13.52, Yunsheng Lin wrote:
> On 2025/1/16 0:29, Jesper Dangaard Brouer wrote:
>>
>>
>> On 10/01/2025 14.06, Yunsheng Lin wrote:
>> [...]
>>> In order not to call DMA APIs to do DMA unmmapping after driver
>>> has already unbound and stall the unloading of the networking
>>> driver, use some pre-allocated item blocks to record inflight
>>> pages including the ones which are handed over to network stack,
>>> so the page_pool can do the DMA unmmapping for those pages when
>>> page_pool_destroy() is called. As the pre-allocated item blocks
>>> need to be large enough to avoid performance degradation, add a
>>> 'item_fast_empty' stat to indicate the unavailability of the
>>> pre-allocated item blocks.
>>>
>>
> 
> ...
> 
>>> +
>>> +static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
>>> +                             netmem_ref netmem,
>>> +                             bool destroyed)
>>> +{
>>> +    struct page_pool_item *item;
>>> +    dma_addr_t dma;
>>> +
>>> +    if (!pool->dma_map)
>>> +        /* Always account for inflight pages, even if we didn't
>>> +         * map them
>>> +         */
>>> +        return;
>>> +
>>> +    dma = page_pool_get_dma_addr_netmem(netmem);
>>> +    item = netmem_get_pp_item(netmem);
>>> +
>>> +    /* dma unmapping is always needed when page_pool_destory() is not called
>>> +     * yet.
>>> +     */
>>> +    DEBUG_NET_WARN_ON_ONCE(!destroyed && !page_pool_item_is_mapped(item));
>>> +    if (unlikely(destroyed && !page_pool_item_is_mapped(item)))
>>> +        return;
>>> +
>>> +    /* When page is unmapped, it cannot be returned to our pool */
>>> +    dma_unmap_page_attrs(pool->p.dev, dma,
>>> +                 PAGE_SIZE << pool->p.order, pool->p.dma_dir,
>>> +                 DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
>>> +    page_pool_set_dma_addr_netmem(netmem, 0);
>>> +    page_pool_item_clear_mapped(item);
>>> +}
>>> +
>>
>> I have a hard time reading/reviewing/maintaining below code, without
>> some design description.  This code needs more comments on what is the
>> *intend* and design it's trying to achieve.
>>
>>  From patch description the only hint I have is:
>>   "use some pre-allocated item blocks to record inflight pages"
>>
>> E.g. Why is it needed/smart to hijack the page->pp pointer?
> 
> Mainly because there is no space available for keeping tracking of inflight
> pages, using page->pp can only find the page_pool owning the page, but page_pool
> is not able to keep track of the inflight page when the page is handled by
> networking stack.
> 
> By using page_pool_item as below, the state is used to tell if a specific
> item is being used/dma mapped or not by scanning all the item blocks in
> pool->item_blocks. If a specific item is used by a page, then 'pp_netmem'
> will point to that page so that dma unmapping can be done for that page
> when page_pool_destroy() is called, otherwise free items sit in the
> pool->hold_items or pool->release_items by using 'lentry':
> 
> struct page_pool_item {
> 	unsigned long state;
> 	
> 	union {
> 		netmem_ref pp_netmem;
> 		struct llist_node lentry;
> 	};
> };

pahole  -C page_pool_item vmlinux
struct page_pool_item {
	/* An 'encoded_next' is a pointer to next item, lower 2 bits is used to
	 * indicate the state of current item.
	 */	
	long unsigned int          encoded_next;     /*     0     8 */
	union {
		netmem_ref         pp_netmem;        /*     8     8 */
		struct llist_node  lentry;           /*     8     8 */
	};                                           /*     8     8 */

	/* size: 16, cachelines: 1, members: 2 */
	/* last cacheline: 16 bytes */
};


> When a page is added to the page_pool, a item is deleted from pool->hold_items
> or pool->release_items and set the 'pp_netmem' pointing to that page and set
> 'state' accordingly in order to keep track of that page.
> 
> When a page is deleted from the page_pool, it is able to tell which page_pool
> this page belong to by using the below function, and after clearing the 'state',
> the item is added back to pool->release_items so that the item is reused for new
> pages.
> 

To understand below, I'm listing struct page_pool_item_block for other
reviewers:

pahole  -C page_pool_item_block vmlinux
struct page_pool_item_block {
	struct page_pool *         pp;               /*     0     8 */
	struct list_head           list;             /*     8    16 */
	unsigned int               flags;            /*    24     4 */
	refcount_t                 ref;              /*    28     4 */
	struct page_pool_item      items[];          /*    32     0 */

	/* size: 32, cachelines: 1, members: 5 */
	/* last cacheline: 32 bytes */
};

> static inline struct page_pool_item_block *
> page_pool_item_to_block(struct page_pool_item *item)
> {
> 	return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);

This trick requires some comments explaining what is going on!
Please correct me if I'm wrong: Here you a masking off the lower bits of
the pointer to page_pool_item *item, as you know that a struct
page_pool_item_block is stored in the top of a struct page.  This trick
is like a "container_of" for going from page_pool_item to
page_pool_item_block, right?

I do notice that you have a comment above struct page_pool_item_block
(that says "item_block is always PAGE_SIZE"), which is nice, but to be
more explicit/clear:
  I want a big comment block (placed above the main code here) that
explains the design and intention behind this newly invented
"item-block" scheme, like e.g. the connection between
page_pool_item_block and page_pool_item. Like the advantage/trick that
allows page->pp pointer to be an "item" and be mapped back to a "block"
to find the page_pool object it belongs to.  Don't write *what* the code
does, but write about the intended purpose and design reasons behind the
code.


> }
> 
>   static inline struct page_pool *page_pool_get_pp(struct page *page)
>   {
>        return page_pool_item_to_block(page->pp_item)->pp;
>   }
> 
> 
>>
>>> +static void __page_pool_item_init(struct page_pool *pool, struct page *page)
>>> +{
>>
>> Function name is confusing.  First I though this was init'ing a single
>> item, but looking at the code it is iterating over ITEMS_PER_PAGE.
>>
>> Maybe it should be called page_pool_item_block_init ?
> 
> The __page_pool_item_init() is added to make the below
> page_pool_item_init() function more readable or maintainable, changing
> it to page_pool_item_block_init doesn't seems consistent?

You (of-cause) also have to rename the other function, I though that was
implicitly understood.

BUT does my suggested rename make sense?  What I'm seeing is that all
the *items* in the "block" is getting inited. But we are also setting up
the "block" (e.g.  "block->pp=pool").

>>
>>> +    struct page_pool_item_block *block = page_address(page);
>>> +    struct page_pool_item *items = block->items;
>>> +    unsigned int i;
>>> +
>>> +    list_add(&block->list, &pool->item_blocks);
>>> +    block->pp = pool;
>>> +
>>> +    for (i = 0; i < ITEMS_PER_PAGE; i++) {
>>> +        page_pool_item_init_state(&items[i]);
>>> +        __llist_add(&items[i].lentry, &pool->hold_items);
>>> +    }
>>> +}
>>> +
>>> +static int page_pool_item_init(struct page_pool *pool)
>>> +{
>>> +#define PAGE_POOL_MIN_INFLIGHT_ITEMS        512
>>> +    struct page_pool_item_block *block;
>>> +    int item_cnt;
>>> +
>>> +    INIT_LIST_HEAD(&pool->item_blocks);
>>> +    init_llist_head(&pool->hold_items);
>>> +    init_llist_head(&pool->release_items);
>>> +
>>> +    item_cnt = pool->p.pool_size * 2 + PP_ALLOC_CACHE_SIZE +
>>> +        PAGE_POOL_MIN_INFLIGHT_ITEMS;
>>> +    while (item_cnt > 0) {
>>> +        struct page *page;
>>> +
>>> +        page = alloc_pages_node(pool->p.nid, GFP_KERNEL, 0);
>>> +        if (!page)
>>> +            goto err;
>>> +
>>> +        __page_pool_item_init(pool, page);
>>> +        item_cnt -= ITEMS_PER_PAGE;
>>> +    }
>>> +
>>> +    return 0;
>>> +err:
>>> +    list_for_each_entry(block, &pool->item_blocks, list)
>>> +        put_page(virt_to_page(block));
> 
> This one also have used-after-free problem as the page_pool_item_uninit
> in the previous version.
> 
>>> +
>>> +    return -ENOMEM;
>>> +}
>>> +
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 0/8] fix two bugs related to page_pool
  2025-01-16 12:52       ` Yunsheng Lin
@ 2025-01-16 18:02         ` Jesper Dangaard Brouer
  2025-01-17 11:35           ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2025-01-16 18:02 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Robin Murphy, Alexander Duyck, Andrew Morton, IOMMU, MM,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, netdev,
	intel-wired-lan, bpf, linux-kernel, linux-arm-kernel,
	linux-mediatek



On 16/01/2025 13.52, Yunsheng Lin wrote:
> On 2025/1/16 1:40, Jesper Dangaard Brouer wrote:
>>
>>
>> On 15/01/2025 12.33, Yunsheng Lin wrote:
>>> On 2025/1/14 22:31, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 10/01/2025 14.06, Yunsheng Lin wrote:
>>>>> This patchset fix a possible time window problem for page_pool and
>>>>> the dma API misuse problem as mentioned in [1], and try to avoid the
>>>>> overhead of the fixing using some optimization.
>>>>>
>>>>>    From the below performance data, the overhead is not so obvious
>>>>> due to performance variations for time_bench_page_pool01_fast_path()
>>>>> and time_bench_page_pool02_ptr_ring, and there is about 20ns overhead
>>>>> for time_bench_page_pool03_slow() for fixing the bug.
>>>>>
>>>>
>>>> My benchmarking on x86_64 CPUs looks significantly different.
>>>>    - CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
>>>>
>>>> Benchmark (bench_page_pool_simple) results from before and after patchset:
>>>>
>>>> | Test name  | Cycles |       |    |Nanosec |        |       |      % |
>>>> | (tasklet_*)| Before | After |diff| Before |  After |  diff | change |
>>>> |------------+--------+-------+----+--------+--------+-------+--------|
>>>> | fast_path  |     19 |    24 |   5|  5.399 |  6.928 | 1.529 |   28.3 |
>>>> | ptr_ring   |     54 |    79 |  25| 15.090 | 21.976 | 6.886 |   45.6 |
>>>> | slow       |    238 |   299 |  61| 66.134 | 83.298 |17.164 |   26.0 |
>>>> #+TBLFM: $4=$3-$2::$7=$6-$5::$8=(($7/$5)*100);%.1f
>>>>
>>>> My above testing show a clear performance regressions across three
>>>> different page_pool operating modes.
>>>
>>> I retested it on arm64 server patch by patch as the raw performance
>>> data in the attachment, it seems the result seemed similar as before.
>>>
>>> Before this patchset:
>>>               fast_path              ptr_ring            slow
>>> 1.         31.171 ns               60.980 ns          164.917 ns
>>> 2.         28.824 ns               60.891 ns          170.241 ns
>>> 3.         14.236 ns               60.583 ns          164.355 ns
>>>
>>> With patch 1-4:
>>> 4.         31.443 ns               53.242 ns          210.148 ns
>>> 5.         31.406 ns               53.270 ns          210.189 ns
>>>
>>> With patch 1-5:
>>> 6.         26.163 ns               53.781 ns          189.450 ns
>>> 7.         26.189 ns               53.798 ns          189.466 ns
>>>
>>> With patch 1-8:
>>> 8.         28.108 ns               68.199 ns          202.516 ns
>>> 9.         16.128 ns               55.904 ns          202.711 ns
>>>
>>> I am not able to get hold of a x86 server yet, I might be able
>>> to get one during weekend.
>>>
>>> Theoretically, patch 1-4 or 1-5 should not have much performance
>>> impact for fast_path and ptr_ring except for the rcu_lock mentioned
>>> in page_pool_napi_local(), so it would be good if patch 1-5 is also
>>> tested in your testlab with the rcu_lock removing in
>>> page_pool_napi_local().
>>>
>>
>> What are you saying?
>>   - (1) test patch 1-5
>>   - or (2) test patch 1-5 but revert patch 2 with page_pool_napi_local()
> 
> patch 1-5 with below applied.
> 
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -1207,10 +1207,8 @@ static bool page_pool_napi_local(const struct page_pool *pool)
>          /* Synchronizated with page_pool_destory() to avoid use-after-free
>           * for 'napi'.
>           */
> -       rcu_read_lock();
>          napi = READ_ONCE(pool->p.napi);
>          napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
> -       rcu_read_unlock();
> 
>          return napi_local;
>   }
> 

Benchmark (bench_page_pool_simple) results from before and after
patchset with patches 1-5m and rcu lock removal as requested.

| Test name  |Cycles |   1-5 |    | Nanosec |    1-5 |        |      % |
| (tasklet_*)|Before | After |diff|  Before |  After |   diff | change |
|------------+-------+-------+----+---------+--------+--------+--------|
| fast_path  |    19 |    19 |   0|   5.399 |  5.492 |  0.093 |    1.7 |
| ptr_ring   |    54 |    57 |   3|  15.090 | 15.849 |  0.759 |    5.0 |
| slow       |   238 |   284 |  46|  66.134 | 78.909 | 12.775 |   19.3 |
#+TBLFM: $4=$3-$2::$7=$6-$5::$8=(($7/$5)*100);%.1f

This test with patches 1-5 looks much better regarding performance.

--Jesper

https://github.com/xdp-project/xdp-project/blob/main/areas/mem/page_pool07_bench_DMA_fix.org#e5-1650-pp01-dma-fix-v7-p1-5

Kernel:
  - 6.13.0-rc6-pp01-DMA-fix-v7-p1-5+ #5 SMP PREEMPT_DYNAMIC Thu Jan 16 
18:06:53 CET 2025 x86_64 GNU/Linux

Machine: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

modprobe bench_page_pool_simple loops=100000000

Raw data:
[  187.309423] bench_page_pool_simple: 
time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  187.872849] time_bench: Type:no-softirq-page_pool01 Per elem: 19 
cycles(tsc) 5.539 ns (step:0) - (measurement period time:0.553906443 sec 
time_interval:553906443) - (invoke count:100000000 tsc_interval:1994123064)
[  187.892023] bench_page_pool_simple: 
time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  189.611070] time_bench: Type:no-softirq-page_pool02 Per elem: 61 
cycles(tsc) 17.095 ns (step:0) - (measurement period time:1.709580367 
sec time_interval:1709580367) - (invoke count:100000000 
tsc_interval:6154679394)
[  189.630414] bench_page_pool_simple: time_bench_page_pool03_slow(): 
Cannot use page_pool fast-path
[  197.222387] time_bench: Type:no-softirq-page_pool03 Per elem: 272 
cycles(tsc) 75.826 ns (step:0) - (measurement period time:7.582681388 
sec time_interval:7582681388) - (invoke count:100000000 
tsc_interval:27298499214)
[  197.241926] bench_page_pool_simple: pp_tasklet_handler(): 
in_serving_softirq fast-path
[  197.249968] bench_page_pool_simple: 
time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  197.808470] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 
19 cycles(tsc) 5.492 ns (step:0) - (measurement period time:0.549225541 
sec time_interval:549225541) - (invoke count:100000000 
tsc_interval:1977272238)
[  197.828174] bench_page_pool_simple: 
time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  199.422305] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 
57 cycles(tsc) 15.849 ns (step:0) - (measurement period time:1.584920736 
sec time_interval:1584920736) - (invoke count:100000000 
tsc_interval:5705890830)
[  199.442087] bench_page_pool_simple: time_bench_page_pool03_slow(): 
in_serving_softirq fast-path
[  207.342120] time_bench: Type:tasklet_page_pool03_slow Per elem: 284 
cycles(tsc) 78.909 ns (step:0) - (measurement period time:7.890955151 
sec time_interval:7890955151) - (invoke count:100000000 
tsc_interval:28408319289)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 0/8] fix two bugs related to page_pool
  2025-01-16 18:02         ` Jesper Dangaard Brouer
@ 2025-01-17 11:35           ` Yunsheng Lin
  2025-01-18  8:04             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-17 11:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Robin Murphy, Alexander Duyck, Andrew Morton, IOMMU, MM,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, netdev,
	intel-wired-lan, bpf, linux-kernel, linux-arm-kernel,
	linux-mediatek

On 2025/1/17 2:02, Jesper Dangaard Brouer wrote:

> 
> Benchmark (bench_page_pool_simple) results from before and after
> patchset with patches 1-5m and rcu lock removal as requested.
> 
> | Test name  |Cycles |   1-5 |    | Nanosec |    1-5 |        |      % |
> | (tasklet_*)|Before | After |diff|  Before |  After |   diff | change |
> |------------+-------+-------+----+---------+--------+--------+--------|
> | fast_path  |    19 |    19 |   0|   5.399 |  5.492 |  0.093 |    1.7 |
> | ptr_ring   |    54 |    57 |   3|  15.090 | 15.849 |  0.759 |    5.0 |
> | slow       |   238 |   284 |  46|  66.134 | 78.909 | 12.775 |   19.3 |
> #+TBLFM: $4=$3-$2::$7=$6-$5::$8=(($7/$5)*100);%.1f
> 
> This test with patches 1-5 looks much better regarding performance.

Thanks for the testing.

Is there any notiable performance variation during different test running
for the same built kernel in your machine?

> 
> --Jesper
> 
> https://github.com/xdp-project/xdp-project/blob/main/areas/mem/page_pool07_bench_DMA_fix.org#e5-1650-pp01-dma-fix-v7-p1-5
> 
> Kernel:
>  - 6.13.0-rc6-pp01-DMA-fix-v7-p1-5+ #5 SMP PREEMPT_DYNAMIC Thu Jan 16 18:06:53 CET 2025 x86_64 GNU/Linux
> 
> Machine: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
> 
> modprobe bench_page_pool_simple loops=100000000
> 
> Raw data:
> [  187.309423] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
> [  187.872849] time_bench: Type:no-softirq-page_pool01 Per elem: 19 cycles(tsc) 5.539 ns (step:0) - (measurement period time:0.553906443 sec time_interval:553906443) - (invoke count:100000000 tsc_interval:1994123064)
> [  187.892023] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
> [  189.611070] time_bench: Type:no-softirq-page_pool02 Per elem: 61 cycles(tsc) 17.095 ns (step:0) - (measurement period time:1.709580367 sec time_interval:1709580367) - (invoke count:100000000 tsc_interval:6154679394)
> [  189.630414] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
> [  197.222387] time_bench: Type:no-softirq-page_pool03 Per elem: 272 cycles(tsc) 75.826 ns (step:0) - (measurement period time:7.582681388 sec time_interval:7582681388) - (invoke count:100000000 tsc_interval:27298499214)
> [  197.241926] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
> [  197.249968] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
> [  197.808470] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 19 cycles(tsc) 5.492 ns (step:0) - (measurement period time:0.549225541 sec time_interval:549225541) - (invoke count:100000000 tsc_interval:1977272238)
> [  197.828174] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
> [  199.422305] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 57 cycles(tsc) 15.849 ns (step:0) - (measurement period time:1.584920736 sec time_interval:1584920736) - (invoke count:100000000 tsc_interval:5705890830)
> [  199.442087] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
> [  207.342120] time_bench: Type:tasklet_page_pool03_slow Per elem: 284 cycles(tsc) 78.909 ns (step:0) - (measurement period time:7.890955151 sec time_interval:7890955151) - (invoke count:100000000 tsc_interval:28408319289)
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound
  2025-01-16 16:09       ` Jesper Dangaard Brouer
@ 2025-01-17 11:56         ` Yunsheng Lin
  2025-01-17 16:56           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-17 11:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Robin Murphy,
	Alexander Duyck, IOMMU, Andrew Morton, Eric Dumazet, Simon Horman,
	Ilias Apalodimas, linux-mm, linux-kernel, netdev, kernel-team

On 2025/1/17 0:09, Jesper Dangaard Brouer wrote:

...

>> Mainly because there is no space available for keeping tracking of inflight
>> pages, using page->pp can only find the page_pool owning the page, but page_pool
>> is not able to keep track of the inflight page when the page is handled by
>> networking stack.
>>
>> By using page_pool_item as below, the state is used to tell if a specific
>> item is being used/dma mapped or not by scanning all the item blocks in
>> pool->item_blocks. If a specific item is used by a page, then 'pp_netmem'
>> will point to that page so that dma unmapping can be done for that page
>> when page_pool_destroy() is called, otherwise free items sit in the
>> pool->hold_items or pool->release_items by using 'lentry':
>>
>> struct page_pool_item {
>>     unsigned long state;
>>     
>>     union {
>>         netmem_ref pp_netmem;
>>         struct llist_node lentry;
>>     };
>> };
> 
> pahole  -C page_pool_item vmlinux
> struct page_pool_item {
>     /* An 'encoded_next' is a pointer to next item, lower 2 bits is used to
>      * indicate the state of current item.
>      */   
>     long unsigned int          encoded_next;     /*     0     8 */
>     union {
>         netmem_ref         pp_netmem;        /*     8     8 */
>         struct llist_node  lentry;           /*     8     8 */
>     };                                           /*     8     8 */
> 
>     /* size: 16, cachelines: 1, members: 2 */
>     /* last cacheline: 16 bytes */
> };
> 
> 
>> When a page is added to the page_pool, a item is deleted from pool->hold_items
>> or pool->release_items and set the 'pp_netmem' pointing to that page and set
>> 'state' accordingly in order to keep track of that page.
>>
>> When a page is deleted from the page_pool, it is able to tell which page_pool
>> this page belong to by using the below function, and after clearing the 'state',
>> the item is added back to pool->release_items so that the item is reused for new
>> pages.
>>
> 
> To understand below, I'm listing struct page_pool_item_block for other
> reviewers:
> 
> pahole  -C page_pool_item_block vmlinux
> struct page_pool_item_block {
>     struct page_pool *         pp;               /*     0     8 */
>     struct list_head           list;             /*     8    16 */
>     unsigned int               flags;            /*    24     4 */
>     refcount_t                 ref;              /*    28     4 */
>     struct page_pool_item      items[];          /*    32     0 */
> 
>     /* size: 32, cachelines: 1, members: 5 */
>     /* last cacheline: 32 bytes */
> };
> 
>> static inline struct page_pool_item_block *
>> page_pool_item_to_block(struct page_pool_item *item)
>> {
>>     return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);
> 
> This trick requires some comments explaining what is going on!
> Please correct me if I'm wrong: Here you a masking off the lower bits of
> the pointer to page_pool_item *item, as you know that a struct
> page_pool_item_block is stored in the top of a struct page.  This trick
> is like a "container_of" for going from page_pool_item to
> page_pool_item_block, right?

Yes, you are right.

> 
> I do notice that you have a comment above struct page_pool_item_block
> (that says "item_block is always PAGE_SIZE"), which is nice, but to be
> more explicit/clear:
>  I want a big comment block (placed above the main code here) that
> explains the design and intention behind this newly invented
> "item-block" scheme, like e.g. the connection between
> page_pool_item_block and page_pool_item. Like the advantage/trick that
> allows page->pp pointer to be an "item" and be mapped back to a "block"
> to find the page_pool object it belongs to.  Don't write *what* the code
> does, but write about the intended purpose and design reasons behind the
> code.

The comment for page_pool_item_block is below, it seems I also wrote about
intended purpose and design reasons here.

/* The size of item_block is always PAGE_SIZE, so that the address of item_block
 * for a specific item can be calculated using 'item & PAGE_MASK'
 */

Anyway, If putting something like above for page_pool_item_to_block() does
make it clearer, will add some comment for page_pool_item_to_block() too.

> 
> 
>> }
>>
>>   static inline struct page_pool *page_pool_get_pp(struct page *page)
>>   {
>>        return page_pool_item_to_block(page->pp_item)->pp;
>>   }
>>
>>
>>>
>>>> +static void __page_pool_item_init(struct page_pool *pool, struct page *page)
>>>> +{
>>>
>>> Function name is confusing.  First I though this was init'ing a single
>>> item, but looking at the code it is iterating over ITEMS_PER_PAGE.
>>>
>>> Maybe it should be called page_pool_item_block_init ?
>>
>> The __page_pool_item_init() is added to make the below
>> page_pool_item_init() function more readable or maintainable, changing
>> it to page_pool_item_block_init doesn't seems consistent?
> 
> You (of-cause) also have to rename the other function, I though that was
> implicitly understood.
> 
> BUT does my suggested rename make sense?  What I'm seeing is that all
> the *items* in the "block" is getting inited. But we are also setting up
> the "block" (e.g.  "block->pp=pool").

I am not really sure about that, as using the PAGE_SIZE block to hold the
item seems like a implementation detail, which might change in the future,
renaming other function to something like that doesn't seem right to me IMHO.

Also the next patch will add page_pool_item_blk_add() to support unlimited
inflight pages, it seems a better name is needed for that too, perheps rename
page_pool_item_blk_add() to page_pool_dynamic_item_add()?

For __page_pool_item_init(), perhaps just inline it back to page_pool_item_init()
as __page_pool_item_init() is only used by page_pool_item_init(), and both of them
are not really large function.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound
  2025-01-17 11:56         ` Yunsheng Lin
@ 2025-01-17 16:56           ` Jesper Dangaard Brouer
  2025-01-18 13:36             ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2025-01-17 16:56 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Robin Murphy,
	Alexander Duyck, IOMMU, Andrew Morton, Eric Dumazet, Simon Horman,
	Ilias Apalodimas, linux-mm, linux-kernel, netdev, kernel-team



On 17/01/2025 12.56, Yunsheng Lin wrote:
> On 2025/1/17 0:09, Jesper Dangaard Brouer wrote:
> 
> ...
> 
>>> Mainly because there is no space available for keeping tracking of inflight
>>> pages, using page->pp can only find the page_pool owning the page, but page_pool
>>> is not able to keep track of the inflight page when the page is handled by
>>> networking stack.
>>>
>>> By using page_pool_item as below, the state is used to tell if a specific
>>> item is being used/dma mapped or not by scanning all the item blocks in
>>> pool->item_blocks. If a specific item is used by a page, then 'pp_netmem'
>>> will point to that page so that dma unmapping can be done for that page
>>> when page_pool_destroy() is called, otherwise free items sit in the
>>> pool->hold_items or pool->release_items by using 'lentry':
>>>
>>> struct page_pool_item {
>>>      unsigned long state;
>>>      
>>>      union {
>>>          netmem_ref pp_netmem;
>>>          struct llist_node lentry;
>>>      };
>>> };
>>
>> pahole  -C page_pool_item vmlinux
>> struct page_pool_item {
>>      /* An 'encoded_next' is a pointer to next item, lower 2 bits is used to
>>       * indicate the state of current item.
>>       */
>>      long unsigned int          encoded_next;     /*     0     8 */
>>      union {
>>          netmem_ref         pp_netmem;        /*     8     8 */
>>          struct llist_node  lentry;           /*     8     8 */
>>      };                                           /*     8     8 */
>>
>>      /* size: 16, cachelines: 1, members: 2 */
>>      /* last cacheline: 16 bytes */
>> };
>>
>>
>>> When a page is added to the page_pool, a item is deleted from pool->hold_items
>>> or pool->release_items and set the 'pp_netmem' pointing to that page and set
>>> 'state' accordingly in order to keep track of that page.
>>>
>>> When a page is deleted from the page_pool, it is able to tell which page_pool
>>> this page belong to by using the below function, and after clearing the 'state',
>>> the item is added back to pool->release_items so that the item is reused for new
>>> pages.
>>>
>>
>> To understand below, I'm listing struct page_pool_item_block for other
>> reviewers:
>>
>> pahole  -C page_pool_item_block vmlinux
>> struct page_pool_item_block {
>>      struct page_pool *         pp;               /*     0     8 */
>>      struct list_head           list;             /*     8    16 */
>>      unsigned int               flags;            /*    24     4 */
>>      refcount_t                 ref;              /*    28     4 */
>>      struct page_pool_item      items[];          /*    32     0 */
>>
>>      /* size: 32, cachelines: 1, members: 5 */
>>      /* last cacheline: 32 bytes */
>> };
>>
>>> static inline struct page_pool_item_block *
>>> page_pool_item_to_block(struct page_pool_item *item)
>>> {
>>>      return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);
>>
>> This trick requires some comments explaining what is going on!
>> Please correct me if I'm wrong: Here you a masking off the lower bits of
>> the pointer to page_pool_item *item, as you know that a struct
>> page_pool_item_block is stored in the top of a struct page.  This trick
>> is like a "container_of" for going from page_pool_item to
>> page_pool_item_block, right?
> 
> Yes, you are right.
> 
>>
>> I do notice that you have a comment above struct page_pool_item_block
>> (that says "item_block is always PAGE_SIZE"), which is nice, but to be
>> more explicit/clear:
>>   I want a big comment block (placed above the main code here) that
>> explains the design and intention behind this newly invented
>> "item-block" scheme, like e.g. the connection between
>> page_pool_item_block and page_pool_item. Like the advantage/trick that
>> allows page->pp pointer to be an "item" and be mapped back to a "block"
>> to find the page_pool object it belongs to.  Don't write *what* the code
>> does, but write about the intended purpose and design reasons behind the
>> code.
> 
> The comment for page_pool_item_block is below, it seems I also wrote about
> intended purpose and design reasons here.
> 
> /* The size of item_block is always PAGE_SIZE, so that the address of item_block
>   * for a specific item can be calculated using 'item & PAGE_MASK'
>   */
> 
> Anyway, If putting something like above for page_pool_item_to_block() does
> make it clearer, will add some comment for page_pool_item_to_block() too.
> 
>>
>>
>>> }
>>>
>>>    static inline struct page_pool *page_pool_get_pp(struct page *page)
>>>    {
>>>         return page_pool_item_to_block(page->pp_item)->pp;
>>>    }
>>>
>>>
>>>>
>>>>> +static void __page_pool_item_init(struct page_pool *pool, struct page *page)
>>>>> +{
>>>>
>>>> Function name is confusing.  First I though this was init'ing a single
>>>> item, but looking at the code it is iterating over ITEMS_PER_PAGE.
>>>>
>>>> Maybe it should be called page_pool_item_block_init ?
>>>
>>> The __page_pool_item_init() is added to make the below
>>> page_pool_item_init() function more readable or maintainable, changing
>>> it to page_pool_item_block_init doesn't seems consistent?
>>
>> You (of-cause) also have to rename the other function, I though that was
>> implicitly understood.
>>
>> BUT does my suggested rename make sense?  What I'm seeing is that all
>> the *items* in the "block" is getting inited. But we are also setting up
>> the "block" (e.g.  "block->pp=pool").
> 
> I am not really sure about that, as using the PAGE_SIZE block to hold the
> item seems like a implementation detail, which might change in the future,
> renaming other function to something like that doesn't seem right to me IMHO.
> 
> Also the next patch will add page_pool_item_blk_add() to support unlimited
> inflight pages, it seems a better name is needed for that too, perheps rename
> page_pool_item_blk_add() to page_pool_dynamic_item_add()?
> 

Hmmm... not sure about this.
I think I prefer page_pool_item_blk_add() over page_pool_dynamic_item_add().

> For __page_pool_item_init(), perhaps just inline it back to page_pool_item_init()
> as __page_pool_item_init() is only used by page_pool_item_init(), and both of them
> are not really large function.

I like that you had a helper function. So, don't merge 
__page_pool_item_init() into page_pool_item_init() just to avoid naming 
it differently.

Let me be more explicit what I'm asking for:

IMHO you should rename:
  - __page_pool_item_init() to __page_pool_item_block_init()
and rename:
  - page_pool_item_init() to page_pool_item_block_init()

I hope this make it more clear what I'm saying.

--Jesper

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 0/8] fix two bugs related to page_pool
  2025-01-17 11:35           ` Yunsheng Lin
@ 2025-01-18  8:04             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2025-01-18  8:04 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Robin Murphy, Alexander Duyck, Andrew Morton, IOMMU, MM,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Matthias Brugger, AngeloGioacchino Del Regno, netdev,
	intel-wired-lan, bpf, linux-kernel, linux-arm-kernel,
	linux-mediatek



On 17/01/2025 12.35, Yunsheng Lin wrote:
> On 2025/1/17 2:02, Jesper Dangaard Brouer wrote:
> 
>>
>> Benchmark (bench_page_pool_simple) results from before and after
>> patchset with patches 1-5m and rcu lock removal as requested.
>>
>> | Test name  |Cycles |   1-5 |    | Nanosec |    1-5 |        |      % |
>> | (tasklet_*)|Before | After |diff|  Before |  After |   diff | change |
>> |------------+-------+-------+----+---------+--------+--------+--------|
>> | fast_path  |    19 |    19 |   0|   5.399 |  5.492 |  0.093 |    1.7 |
>> | ptr_ring   |    54 |    57 |   3|  15.090 | 15.849 |  0.759 |    5.0 |
>> | slow       |   238 |   284 |  46|  66.134 | 78.909 | 12.775 |   19.3 |
>> #+TBLFM: $4=$3-$2::$7=$6-$5::$8=(($7/$5)*100);%.1f
>>
>> This test with patches 1-5 looks much better regarding performance.
> 
> Thanks for the testing.
> 
> Is there any notiable performance variation during different test running
> for the same built kernel in your machine?
> 

My machine have quite stable performance for this benchmark.


>> https://github.com/xdp-project/xdp-project/blob/main/areas/mem/page_pool07_bench_DMA_fix.org#e5-1650-pp01-dma-fix-v7-p1-5

Like documented in above link. I have also increased the loops count for
the test to get it more stable, given this will be measured over a
longer period.

  modprobe bench_page_pool_simple loops=100000000


>> Kernel:
>>   - 6.13.0-rc6-pp01-DMA-fix-v7-p1-5+ #5 SMP PREEMPT_DYNAMIC Thu Jan 16 18:06:53 CET 2025 x86_64 GNU/Linux
>>
>> Machine: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
>>
>> modprobe bench_page_pool_simple loops=100000000
>>
>> Raw data:
>> [  187.309423] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
>> [  187.872849] time_bench: Type:no-softirq-page_pool01 Per elem: 19 cycles(tsc) 5.539 ns (step:0) - (measurement period time:0.553906443 sec time_interval:553906443) - (invoke count:100000000 tsc_interval:1994123064)
>> [  187.892023] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
>> [  189.611070] time_bench: Type:no-softirq-page_pool02 Per elem: 61 cycles(tsc) 17.095 ns (step:0) - (measurement period time:1.709580367 sec time_interval:1709580367) - (invoke count:100000000 tsc_interval:6154679394)
>> [  189.630414] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
>> [  197.222387] time_bench: Type:no-softirq-page_pool03 Per elem: 272 cycles(tsc) 75.826 ns (step:0) - (measurement period time:7.582681388 sec time_interval:7582681388) - (invoke count:100000000 tsc_interval:27298499214)
>> [  197.241926] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
>> [  197.249968] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
>> [  197.808470] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 19 cycles(tsc) 5.492 ns (step:0) - (measurement period time:0.549225541 sec time_interval:549225541) - (invoke count:100000000 tsc_interval:1977272238)
>> [  197.828174] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
>> [  199.422305] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 57 cycles(tsc) 15.849 ns (step:0) - (measurement period time:1.584920736 sec time_interval:1584920736) - (invoke count:100000000 tsc_interval:5705890830)
>> [  199.442087] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
>> [  207.342120] time_bench: Type:tasklet_page_pool03_slow Per elem: 284 cycles(tsc) 78.909 ns (step:0) - (measurement period time:7.890955151 sec time_interval:7890955151) - (invoke count:100000000 tsc_interval:28408319289)
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound
  2025-01-17 16:56           ` Jesper Dangaard Brouer
@ 2025-01-18 13:36             ` Yunsheng Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-18 13:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Robin Murphy,
	Alexander Duyck, IOMMU, Andrew Morton, Eric Dumazet, Simon Horman,
	Ilias Apalodimas, linux-mm, linux-kernel, netdev, kernel-team

On 1/18/2025 12:56 AM, Jesper Dangaard Brouer wrote:

...

>> I am not really sure about that, as using the PAGE_SIZE block to hold the
>> item seems like a implementation detail, which might change in the 
>> future,
>> renaming other function to something like that doesn't seem right to 
>> me IMHO.
>>
>> Also the next patch will add page_pool_item_blk_add() to support 
>> unlimited
>> inflight pages, it seems a better name is needed for that too, perheps 
>> rename
>> page_pool_item_blk_add() to page_pool_dynamic_item_add()?
>>
> 
> Hmmm... not sure about this.
> I think I prefer page_pool_item_blk_add() over 
> page_pool_dynamic_item_add().
> 
>> For __page_pool_item_init(), perhaps just inline it back to 
>> page_pool_item_init()
>> as __page_pool_item_init() is only used by page_pool_item_init(), and 
>> both of them
>> are not really large function.
> 
> I like that you had a helper function. So, don't merge 
> __page_pool_item_init() into page_pool_item_init() just to avoid naming 
> it differently.

Any particular reason for the above suggestion?

After reusing the page_pool_item_uninit() to fix the similar
use-after-freed problem,it seems reasonable to not expose the
item_blcok as much as possible as item_blcok is really an
implementation detail that should be hidden as much as possible
IMHO.

If it is able to reused for supporting the unlimited item case,
then I am agreed that it might be better to refactor it out,
but it is not really reusable.

static int page_pool_item_init(struct page_pool *pool)
{
#define PAGE_POOL_MIN_INFLIGHT_ITEMS            512
         struct page_pool_item_block *block;
         int item_cnt;

         INIT_LIST_HEAD(&pool->item_blocks);
         init_llist_head(&pool->hold_items);
         init_llist_head(&pool->release_items);

         item_cnt = pool->p.pool_size * 2 + PP_ALLOC_CACHE_SIZE +
                 PAGE_POOL_MIN_INFLIGHT_ITEMS;
         for (; item_cnt > 0; item_cnt -= ITEMS_PER_PAGE) {
                 struct page *page;
                 unsigned int i;

                 page = alloc_pages_node(pool->p.nid, GFP_KERNEL, 0);
                 if (!page) {
                         page_pool_item_uninit(pool);
                         return -ENOMEM;
                 }

                 block = page_address(page);
                 block->pp = pool;
                 list_add(&block->list, &pool->item_blocks);

                 for (i = 0; i < ITEMS_PER_PAGE; i++) {
                         page_pool_item_init_state(&block->items[i]);
                         __llist_add(&block->items[i].lentry, 
&pool->hold_items);
                 }
         }

         return 0;
}

> 
> Let me be more explicit what I'm asking for:
> 
> IMHO you should rename:
>   - __page_pool_item_init() to __page_pool_item_block_init()
> and rename:
>   - page_pool_item_init() to page_pool_item_block_init()
> 
> I hope this make it more clear what I'm saying.
 > > --Jesper
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local
  2025-01-11  5:24     ` Yunsheng Lin
  2025-01-14 13:03       ` Yunsheng Lin
@ 2025-01-20 11:24       ` Toke Høiland-Jørgensen
  2025-01-22 11:02         ` Yunsheng Lin
  1 sibling, 1 reply; 31+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-01-20 11:24 UTC (permalink / raw)
  To: Yunsheng Lin, Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Xuan Zhuo, Jesper Dangaard Brouer, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

Yunsheng Lin <yunshenglin0825@gmail.com> writes:

> On 1/10/2025 11:40 PM, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <linyunsheng@huawei.com> writes:
>> 
>>> page_pool page may be freed from skb_defer_free_flush() in
>>> softirq context without binding to any specific napi, it
>>> may cause use-after-free problem due to the below time window,
>>> as below, CPU1 may still access napi->list_owner after CPU0
>>> free the napi memory:
>>>
>>>              CPU 0                           CPU1
>>>        page_pool_destroy()          skb_defer_free_flush()
>>>               .                               .
>>>               .                napi = READ_ONCE(pool->p.napi);
>>>               .                               .
>>> page_pool_disable_direct_recycling()         .
>>>     driver free napi memory                   .
>>>               .                               .
>>>               .       napi && READ_ONCE(napi->list_owner) == cpuid
>>>               .                               .
>> 
>> Have you actually observed this happen, or are you just speculating?
>
> I did not actually observe this happen, but I added some delaying and
> pr_err() debugging code in page_pool_napi_local()/page_pool_destroy(),
> and modified the test module for page_pool in [1] to trigger that it is
> indeed possible if the delay between reading napi and checking
> napi->list_owner is long enough.
>
> 1. 
> https://patchwork.kernel.org/project/netdevbpf/patch/20240909091913.987826-1-linyunsheng@huawei.com/

Right, I wasn't contesting whether it's possible to trigger this race by
calling those two functions directly in some fashion. I was asking
whether there are any drivers that use the API in a way that this race
can happen; because I would consider any such driver buggy, and we
should fix this rather than adding more cruft to the page_pool API. See
below.

>> Because I don't think it can; deleting a NAPI instance already requires
>> observing an RCU grace period, cf netdevice.h:
>> 
>> /**
>>   *  __netif_napi_del - remove a NAPI context
>>   *  @napi: NAPI context
>>   *
>>   * Warning: caller must observe RCU grace period before freeing memory
>>   * containing @napi. Drivers might want to call this helper to combine
>>   * all the needed RCU grace periods into a single one.
>>   */
>> void __netif_napi_del(struct napi_struct *napi);
>> 
>> /**
>>   *  netif_napi_del - remove a NAPI context
>>   *  @napi: NAPI context
>>   *
>>   *  netif_napi_del() removes a NAPI context from the network device NAPI list
>>   */
>> static inline void netif_napi_del(struct napi_struct *napi)
>> {
>> 	__netif_napi_del(napi);
>> 	synchronize_net();
>> }
>
> I am not sure we can reliably depend on the implicit synchronize_net()
> above if netif_napi_del() might not be called before page_pool_destroy()
> as there might not be netif_napi_del() before page_pool_destroy() for
> the case of changing rx_desc_num for a queue, which seems to be the case
> of hns3_set_ringparam() for hns3 driver.

The hns3 driver doesn't use pp->napi at all AFAICT, so that's hardly
relevant.

>> 
>> 
>>> Use rcu mechanism to avoid the above problem.
>>>
>>> Note, the above was found during code reviewing on how to fix
>>> the problem in [1].
>>>
>>> As the following IOMMU fix patch depends on synchronize_rcu()
>>> added in this patch and the time window is so small that it
>>> doesn't seem to be an urgent fix, so target the net-next as
>>> the IOMMU fix patch does.
>>>
>>> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>>>
>>> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>> CC: Alexander Lobakin <aleksander.lobakin@intel.com>
>>> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> ---
>>>   net/core/page_pool.c | 15 ++++++++++++++-
>>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>> index 9733206d6406..1aa7b93bdcc8 100644
>>> --- a/net/core/page_pool.c
>>> +++ b/net/core/page_pool.c
>>> @@ -799,6 +799,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
>>>   static bool page_pool_napi_local(const struct page_pool *pool)
>>>   {
>>>   	const struct napi_struct *napi;
>>> +	bool napi_local;
>>>   	u32 cpuid;
>>>   
>>>   	if (unlikely(!in_softirq()))
>>> @@ -814,9 +815,15 @@ static bool page_pool_napi_local(const struct page_pool *pool)
>>>   	if (READ_ONCE(pool->cpuid) == cpuid)
>>>   		return true;
>>>   
>>> +	/* Synchronizated with page_pool_destory() to avoid use-after-free
>>> +	 * for 'napi'.
>>> +	 */
>>> +	rcu_read_lock();
>>>   	napi = READ_ONCE(pool->p.napi);
>>> +	napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
>>> +	rcu_read_unlock();
>> 
>> This rcu_read_lock/unlock() pair is redundant in the context you mention
>> above, since skb_defer_free_flush() is only ever called from softirq
>> context (within local_bh_disable()), which already function as an RCU
>> read lock.
>
> I thought about it, but I am not sure if we need a explicit rcu lock
> for different kernel PREEMPT and RCU config.
> Perhaps use rcu_read_lock_bh_held() to ensure that we are in the
> correct context?

page_pool_napi_local() returns immediately if in_softirq() returns
false. So the rcu_read_lock() is definitely not needed.

>> 
>>> -	return napi && READ_ONCE(napi->list_owner) == cpuid;
>>> +	return napi_local;
>>>   }
>>>   
>>>   void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>>> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
>>>   	if (!page_pool_release(pool))
>>>   		return;
>>>   
>>> +	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
>>> +	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>>> +	 * before returning to driver to free the napi instance.
>>> +	 */
>>> +	synchronize_rcu();
>> 
>> Most drivers call page_pool_destroy() in a loop for each RX queue, so
>> now you're introducing a full synchronize_rcu() wait for each queue.
>> That can delay tearing down the device significantly, so I don't think
>> this is a good idea.
>
> synchronize_rcu() is called after page_pool_release(pool), which means
> it is only called when there are some inflight pages, so there is not
> necessarily a full synchronize_rcu() wait for each queue.
>
> Anyway, it seems that there are some cases that need explicit
> synchronize_rcu() and some cases depending on the other API providing
> synchronize_rcu() semantics, maybe we provide two diffferent API for
> both cases like the netif_napi_del()/__netif_napi_del() APIs do?

I don't think so. This race can only be triggered if:

- An skb is allocated from a page_pool with a napi instance attached

- That skb is freed *in softirq context* while the memory backing the
  NAPI instance is being freed.

It's only valid to free a napi instance after calling netif_napi_del(),
which does a full synchronise_rcu(). This means that any running
softirqs will have exited at this point, and all packets will have been
flushed from the deferred freeing queues. And since the NAPI has been
stopped at this point, no new packets can enter the deferred freeing
queue from that NAPI instance.

So I really don't see a way for this race to happen with correct usage
of the page_pool and NAPI APIs, which means there's no reason to make
the change you are proposing here.

-Toke


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local
  2025-01-20 11:24       ` Toke Høiland-Jørgensen
@ 2025-01-22 11:02         ` Yunsheng Lin
  2025-01-24 17:13           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-22 11:02 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Yunsheng Lin, davem, kuba,
	pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Xuan Zhuo, Jesper Dangaard Brouer, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

On 2025/1/20 19:24, Toke Høiland-Jørgensen wrote:

...

>>>>   
>>>>   void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>>>> @@ -1165,6 +1172,12 @@ void page_pool_destroy(struct page_pool *pool)
>>>>   	if (!page_pool_release(pool))
>>>>   		return;
>>>>   
>>>> +	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
>>>> +	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>>>> +	 * before returning to driver to free the napi instance.
>>>> +	 */
>>>> +	synchronize_rcu();
>>>
>>> Most drivers call page_pool_destroy() in a loop for each RX queue, so
>>> now you're introducing a full synchronize_rcu() wait for each queue.
>>> That can delay tearing down the device significantly, so I don't think
>>> this is a good idea.
>>
>> synchronize_rcu() is called after page_pool_release(pool), which means
>> it is only called when there are some inflight pages, so there is not
>> necessarily a full synchronize_rcu() wait for each queue.
>>
>> Anyway, it seems that there are some cases that need explicit
>> synchronize_rcu() and some cases depending on the other API providing
>> synchronize_rcu() semantics, maybe we provide two diffferent API for
>> both cases like the netif_napi_del()/__netif_napi_del() APIs do?
> 
> I don't think so. This race can only be triggered if:
> 
> - An skb is allocated from a page_pool with a napi instance attached
> 
> - That skb is freed *in softirq context* while the memory backing the
>   NAPI instance is being freed.
> 
> It's only valid to free a napi instance after calling netif_napi_del(),
> which does a full synchronise_rcu(). This means that any running
> softirqs will have exited at this point, and all packets will have been
> flushed from the deferred freeing queues. And since the NAPI has been
> stopped at this point, no new packets can enter the deferred freeing
> queue from that NAPI instance.

Note that the skb_defer_free_flush() can be called without bounding to
any NAPI instance, see the skb_defer_free_flush() called by net_rx_action(),
which means the packets from that NAPI instance can still be called in
the softirq context even when the NAPI has been stopped.

> 
> So I really don't see a way for this race to happen with correct usage
> of the page_pool and NAPI APIs, which means there's no reason to make
> the change you are proposing here.

I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
seems to call page_pool_disable_direct_recycling() when unloading, see
bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
only called for the new queue_mgmt API:

/* rtnl_lock held, this call can only be made after a previous successful
 * call to bnxt_half_open_nic().
 */
void bnxt_half_close_nic(struct bnxt *bp)
{
	bnxt_hwrm_resource_free(bp, false, true);
	bnxt_del_napi(bp);       *----call napi del and rcu sync----*
	bnxt_free_skbs(bp);
	bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
	clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
}

Even if there is a page_pool_disable_direct_recycling() called between
bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
rcu sync need to be called after page_pool_disable_direct_recycling(),
it seems some refactor is needed for bnxt driver to reuse the rcu sync
from the NAPI API, in order to avoid calling the rcu sync for
page_pool_destroy().


> 
> -Toke
> 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local
  2025-01-22 11:02         ` Yunsheng Lin
@ 2025-01-24 17:13           ` Toke Høiland-Jørgensen
  2025-01-25 14:21             ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-01-24 17:13 UTC (permalink / raw)
  To: Yunsheng Lin, Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Xuan Zhuo, Jesper Dangaard Brouer, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

Yunsheng Lin <linyunsheng@huawei.com> writes:

>> So I really don't see a way for this race to happen with correct usage
>> of the page_pool and NAPI APIs, which means there's no reason to make
>> the change you are proposing here.
>
> I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
> seems to call page_pool_disable_direct_recycling() when unloading, see
> bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
> only called for the new queue_mgmt API:
>
> /* rtnl_lock held, this call can only be made after a previous successful
>  * call to bnxt_half_open_nic().
>  */
> void bnxt_half_close_nic(struct bnxt *bp)
> {
> 	bnxt_hwrm_resource_free(bp, false, true);
> 	bnxt_del_napi(bp);       *----call napi del and rcu sync----*
> 	bnxt_free_skbs(bp);
> 	bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
> 	clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
> }
>
> Even if there is a page_pool_disable_direct_recycling() called between
> bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
> rcu sync need to be called after page_pool_disable_direct_recycling(),
> it seems some refactor is needed for bnxt driver to reuse the rcu sync
> from the NAPI API, in order to avoid calling the rcu sync for
> page_pool_destroy().

Well, I would consider that usage buggy. A page pool object is created
with a reference to the napi struct; so the page pool should also be
destroyed (clearing its reference) before the napi memory is freed. I
guess this is not really documented anywhere, but it's pretty standard
practice to free objects in the opposite order of their creation.

So no, I don't think this is something that should be fixed on the page
pool side (and certainly not by adding another synchronize_rcu() call
per queue!); rather, we should fix the drivers that get this wrong (and
probably document the requirement a bit better).

-Toke


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local
  2025-01-24 17:13           ` Toke Høiland-Jørgensen
@ 2025-01-25 14:21             ` Yunsheng Lin
  2025-01-27 13:47               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 31+ messages in thread
From: Yunsheng Lin @ 2025-01-25 14:21 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Yunsheng Lin, davem, kuba,
	pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Xuan Zhuo, Jesper Dangaard Brouer, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

On 1/25/2025 1:13 AM, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <linyunsheng@huawei.com> writes:
> 
>>> So I really don't see a way for this race to happen with correct usage
>>> of the page_pool and NAPI APIs, which means there's no reason to make
>>> the change you are proposing here.
>>
>> I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
>> seems to call page_pool_disable_direct_recycling() when unloading, see
>> bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
>> only called for the new queue_mgmt API:
>>
>> /* rtnl_lock held, this call can only be made after a previous successful
>>   * call to bnxt_half_open_nic().
>>   */
>> void bnxt_half_close_nic(struct bnxt *bp)
>> {
>> 	bnxt_hwrm_resource_free(bp, false, true);
>> 	bnxt_del_napi(bp);       *----call napi del and rcu sync----*
>> 	bnxt_free_skbs(bp);
>> 	bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
>> 	clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
>> }
>>
>> Even if there is a page_pool_disable_direct_recycling() called between
>> bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
>> rcu sync need to be called after page_pool_disable_direct_recycling(),
>> it seems some refactor is needed for bnxt driver to reuse the rcu sync
>> from the NAPI API, in order to avoid calling the rcu sync for
>> page_pool_destroy().
> 
> Well, I would consider that usage buggy. A page pool object is created
> with a reference to the napi struct; so the page pool should also be
> destroyed (clearing its reference) before the napi memory is freed. I
> guess this is not really documented anywhere, but it's pretty standard
> practice to free objects in the opposite order of their creation.

I am not so familiar with rule about the creation API of NAPI, but the
implementation of bnxt driver can have reference of 'struct napi' before
calling netif_napi_add(), see below:

static int __bnxt_open_nic(struct bnxt *bp, bool irq_re_init, bool 
link_re_init)
{
	.......
	rc = bnxt_alloc_mem(bp, irq_re_init);     *create page_pool*
	if (rc) {
		netdev_err(bp->dev, "bnxt_alloc_mem err: %x\n", rc);
		goto open_err_free_mem;
	}

	if (irq_re_init) {
		bnxt_init_napi(bp);                *netif_napi_add*
		rc = bnxt_request_irq(bp);
		if (rc) {
			netdev_err(bp->dev, "bnxt_request_irq err: %x\n", rc);
			goto open_err_irq;
		}
	}

	.....
}

> 
> So no, I don't think this is something that should be fixed on the page
> pool side (and certainly not by adding another synchronize_rcu() call
> per queue!); rather, we should fix the drivers that get this wrong (and
> probably document the requirement a bit better).

Even if timing problem of checking and disabling napi_local should not
be fixed on the page_pool side, do we have some common understanding
about fixing the DMA API misuse problem on the page_pool side?
If yes, do we have some common understanding about some mechanism
like synchronize_rcu() might be still needed on the page_pool side?

If no, I am not sure if there is still any better about how to fix
the DMA API misuse problem after all the previous discussion?

If yes, it may be better to focus on discussing how to avoid calling rcu
sync for each queue mentioned in [1].

1. 
https://lore.kernel.org/all/22de6033-744e-486e-bbd9-8950249cd018@huawei.com/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local
  2025-01-25 14:21             ` Yunsheng Lin
@ 2025-01-27 13:47               ` Toke Høiland-Jørgensen
  2025-02-04 13:51                 ` Yunsheng Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-01-27 13:47 UTC (permalink / raw)
  To: Yunsheng Lin, Yunsheng Lin, davem, kuba, pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Xuan Zhuo, Jesper Dangaard Brouer, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

Yunsheng Lin <yunshenglin0825@gmail.com> writes:

> On 1/25/2025 1:13 AM, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <linyunsheng@huawei.com> writes:
>> 
>>>> So I really don't see a way for this race to happen with correct usage
>>>> of the page_pool and NAPI APIs, which means there's no reason to make
>>>> the change you are proposing here.
>>>
>>> I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
>>> seems to call page_pool_disable_direct_recycling() when unloading, see
>>> bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
>>> only called for the new queue_mgmt API:
>>>
>>> /* rtnl_lock held, this call can only be made after a previous successful
>>>   * call to bnxt_half_open_nic().
>>>   */
>>> void bnxt_half_close_nic(struct bnxt *bp)
>>> {
>>> 	bnxt_hwrm_resource_free(bp, false, true);
>>> 	bnxt_del_napi(bp);       *----call napi del and rcu sync----*
>>> 	bnxt_free_skbs(bp);
>>> 	bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
>>> 	clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
>>> }
>>>
>>> Even if there is a page_pool_disable_direct_recycling() called between
>>> bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
>>> rcu sync need to be called after page_pool_disable_direct_recycling(),
>>> it seems some refactor is needed for bnxt driver to reuse the rcu sync
>>> from the NAPI API, in order to avoid calling the rcu sync for
>>> page_pool_destroy().
>> 
>> Well, I would consider that usage buggy. A page pool object is created
>> with a reference to the napi struct; so the page pool should also be
>> destroyed (clearing its reference) before the napi memory is freed. I
>> guess this is not really documented anywhere, but it's pretty standard
>> practice to free objects in the opposite order of their creation.
>
> I am not so familiar with rule about the creation API of NAPI, but the
> implementation of bnxt driver can have reference of 'struct napi' before
> calling netif_napi_add(), see below:
>
> static int __bnxt_open_nic(struct bnxt *bp, bool irq_re_init, bool 
> link_re_init)
> {
> 	.......
> 	rc = bnxt_alloc_mem(bp, irq_re_init);     *create page_pool*
> 	if (rc) {
> 		netdev_err(bp->dev, "bnxt_alloc_mem err: %x\n", rc);
> 		goto open_err_free_mem;
> 	}
>
> 	if (irq_re_init) {
> 		bnxt_init_napi(bp);                *netif_napi_add*
> 		rc = bnxt_request_irq(bp);
> 		if (rc) {
> 			netdev_err(bp->dev, "bnxt_request_irq err: %x\n", rc);
> 			goto open_err_irq;
> 		}
> 	}
>
> 	.....
> }

Regardless of the initialisation error, the fact that bnxt frees the
NAPI memory before calling page_pool_destroy() is a driver bug. Mina has
a suggestion for a warning to catch such bugs over in this thread:

https://lore.kernel.org/r/CAHS8izOv=tUiuzha6NFq1-ZurLGz9Jdi78jb3ey4ExVJirMprA@mail.gmail.com

>> So no, I don't think this is something that should be fixed on the page
>> pool side (and certainly not by adding another synchronize_rcu() call
>> per queue!); rather, we should fix the drivers that get this wrong (and
>> probably document the requirement a bit better).
>
> Even if timing problem of checking and disabling napi_local should not
> be fixed on the page_pool side, do we have some common understanding
> about fixing the DMA API misuse problem on the page_pool side?
> If yes, do we have some common understanding about some mechanism
> like synchronize_rcu() might be still needed on the page_pool side?

I have not reviewed the rest of your patch set, I only looked at this
patch. I see you posted v8 without addressing Jesper's ask for a
conceptual description of your design. I am not going to review a
600-something line patch series without such a description to go by, so
please address that first.

> If yes, it may be better to focus on discussing how to avoid calling rcu
> sync for each queue mentioned in [1].

Regardless of whether a synchronize_rcu() is needed in the final design
(and again, note that I don't have an opinion on this before reviewing
the whole series), this patch should be dropped from the series. The bug
it is purporting to fix is a driver API misuse and should be fixed in
the drivers, cf the above.

-Toke


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local
  2025-01-27 13:47               ` Toke Høiland-Jørgensen
@ 2025-02-04 13:51                 ` Yunsheng Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Yunsheng Lin @ 2025-02-04 13:51 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Yunsheng Lin, davem, kuba,
	pabeni
  Cc: zhangkun09, liuyonglong, fanghaiqing, Alexander Lobakin,
	Xuan Zhuo, Jesper Dangaard Brouer, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

On 1/27/2025 9:47 PM, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
> 
>> On 1/25/2025 1:13 AM, Toke Høiland-Jørgensen wrote:
>>> Yunsheng Lin <linyunsheng@huawei.com> writes:
>>>
>>>>> So I really don't see a way for this race to happen with correct usage
>>>>> of the page_pool and NAPI APIs, which means there's no reason to make
>>>>> the change you are proposing here.
>>>>
>>>> I looked at one driver setting pp->napi, it seems the bnxt driver doesn't
>>>> seems to call page_pool_disable_direct_recycling() when unloading, see
>>>> bnxt_half_close_nic(), page_pool_disable_direct_recycling() seems to be
>>>> only called for the new queue_mgmt API:
>>>>
>>>> /* rtnl_lock held, this call can only be made after a previous successful
>>>>    * call to bnxt_half_open_nic().
>>>>    */
>>>> void bnxt_half_close_nic(struct bnxt *bp)
>>>> {
>>>> 	bnxt_hwrm_resource_free(bp, false, true);
>>>> 	bnxt_del_napi(bp);       *----call napi del and rcu sync----*
>>>> 	bnxt_free_skbs(bp);
>>>> 	bnxt_free_mem(bp, true); *------call page_pool_destroy()----*
>>>> 	clear_bit(BNXT_STATE_HALF_OPEN, &bp->state);
>>>> }
>>>>
>>>> Even if there is a page_pool_disable_direct_recycling() called between
>>>> bnxt_del_napi() and bnxt_free_mem(), the timing window still exist as
>>>> rcu sync need to be called after page_pool_disable_direct_recycling(),
>>>> it seems some refactor is needed for bnxt driver to reuse the rcu sync
>>>> from the NAPI API, in order to avoid calling the rcu sync for
>>>> page_pool_destroy().
>>>
>>> Well, I would consider that usage buggy. A page pool object is created
>>> with a reference to the napi struct; so the page pool should also be
>>> destroyed (clearing its reference) before the napi memory is freed. I
>>> guess this is not really documented anywhere, but it's pretty standard
>>> practice to free objects in the opposite order of their creation.
>>
>> I am not so familiar with rule about the creation API of NAPI, but the
>> implementation of bnxt driver can have reference of 'struct napi' before
>> calling netif_napi_add(), see below:
>>
>> static int __bnxt_open_nic(struct bnxt *bp, bool irq_re_init, bool
>> link_re_init)
>> {
>> 	.......
>> 	rc = bnxt_alloc_mem(bp, irq_re_init);     *create page_pool*
>> 	if (rc) {
>> 		netdev_err(bp->dev, "bnxt_alloc_mem err: %x\n", rc);
>> 		goto open_err_free_mem;
>> 	}
>>
>> 	if (irq_re_init) {
>> 		bnxt_init_napi(bp);                *netif_napi_add*
>> 		rc = bnxt_request_irq(bp);
>> 		if (rc) {
>> 			netdev_err(bp->dev, "bnxt_request_irq err: %x\n", rc);
>> 			goto open_err_irq;
>> 		}
>> 	}
>>
>> 	.....
>> }
> 
> Regardless of the initialisation error, the fact that bnxt frees the
> NAPI memory before calling page_pool_destroy() is a driver bug. Mina has
> a suggestion for a warning to catch such bugs over in this thread:
> 
> https://lore.kernel.org/r/CAHS8izOv=tUiuzha6NFq1-ZurLGz9Jdi78jb3ey4ExVJirMprA@mail.gmail.com

Thanks for the reminder.
As the main problem is about adding a rcu sync between
page_pool_disable_direct_recycling() and page_pool_destroy(), I am
really doubtful that a warning can be added to catch such bugs if
page_pool_destroy() does not use an explicit rcu sync and rely on
the rcu sync from napi del API.

> 
>>> So no, I don't think this is something that should be fixed on the page
>>> pool side (and certainly not by adding another synchronize_rcu() call
>>> per queue!); rather, we should fix the drivers that get this wrong (and
>>> probably document the requirement a bit better).
>>
>> Even if timing problem of checking and disabling napi_local should not
>> be fixed on the page_pool side, do we have some common understanding
>> about fixing the DMA API misuse problem on the page_pool side?
>> If yes, do we have some common understanding about some mechanism
>> like synchronize_rcu() might be still needed on the page_pool side?
> 
> I have not reviewed the rest of your patch set, I only looked at this
> patch. I see you posted v8 without addressing Jesper's ask for a
> conceptual description of your design. I am not going to review a
> 600-something line patch series without such a description to go by, so
> please address that first.

I thought what Jesper'ask was mainly about why hijacking the page->pp
pointer.
I summarized the discussion in [1] as below, please let me know if that
addresses your concern too.

"By using the 'struct page_pool_item' referenced by page->pp_item,
page_pool is not only able to keep track of the inflight page to do dma
unmmaping when page_pool_destroy() is called if some pages are still
handled in networking stack, and networking stack is also able to find
the page_pool owning the page when returning pages back into page_pool.

struct page_pool_item {
	unsigned long state;
	
	union {
		netmem_ref pp_netmem;
		struct llist_node lentry;
	};
};

When a page is added to the page_pool, an item is deleted from
pool->hold_items and set the 'pp_netmem' pointing to that page and set
'state' accordingly in order to keep track of that page, refill from
pool->release_items when pool->hold_items is empty or use the item from
pool->slow_items when fast items run out.

When a page is released from the page_pool, it is able to tell which
page_pool this page belongs to by using the below functions:

static inline struct page_pool_item_block *
page_pool_item_to_block(struct page_pool_item *item)
{
	return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);
}

static inline struct page_pool *page_pool_get_pp(struct page *page)
{
	/* The size of item_block is always PAGE_SIZE, the address of item_block
	 * for a specific item can be calculated using 'item & PAGE_MASK', so
	 * that we can find the page_pool object it belongs to.
	 */
	return page_pool_item_to_block(page->pp_item)->pp;
  }

and after clearing the pp_item->state', the item for the released page
is added back to pool->release_items so that it can be reused for new
pages or just free it when it is from the pool->slow_items.

When page_pool_destroy() is called, pp_item->state is used to tell if a 
specific item is being used/dma mapped or not by scanning all the item 
blocks in pool->item_blocks, then pp_item->netmem can be used to do the
dma unmmaping if the corresponding inflight page is dma mapped."

1. 
https://lore.kernel.org/all/2b5a58f3-d67a-4bf7-921a-033326958ac6@huawei.com/

> 
>> If yes, it may be better to focus on discussing how to avoid calling rcu
>> sync for each queue mentioned in [1].
> 
> Regardless of whether a synchronize_rcu() is needed in the final design
> (and again, note that I don't have an opinion on this before reviewing
> the whole series), this patch should be dropped from the series. The bug
> it is purporting to fix is a driver API misuse and should be fixed in
> the drivers, cf the above.

I am still a little doubltful it is a driver API misuse problem yet as
I am not true if page_pool_destroy() can depend on the rcu sync from
napi del API for all cases. Even if it is, this driver API misuse
problem seems to only exist after page_pool NAPI recycling feature/API
is added, which might mean some refactoring needed from the driver side
to support page_pool NAPI recycling.

Anyway, it seems to make sense to drop this patch from the series for
better forward progressing for the dma misuse problem as they are not
really related.

> 
> -Toke
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-02-04 13:51 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-10 13:06 [PATCH net-next v7 0/8] fix two bugs related to page_pool Yunsheng Lin
2025-01-10 13:06 ` [PATCH net-next v7 1/8] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
2025-01-10 13:06 ` [PATCH net-next v7 2/8] page_pool: fix timing for checking and disabling napi_local Yunsheng Lin
2025-01-10 15:40   ` Toke Høiland-Jørgensen
2025-01-11  5:24     ` Yunsheng Lin
2025-01-14 13:03       ` Yunsheng Lin
2025-01-20 11:24       ` Toke Høiland-Jørgensen
2025-01-22 11:02         ` Yunsheng Lin
2025-01-24 17:13           ` Toke Høiland-Jørgensen
2025-01-25 14:21             ` Yunsheng Lin
2025-01-27 13:47               ` Toke Høiland-Jørgensen
2025-02-04 13:51                 ` Yunsheng Lin
2025-01-10 13:06 ` [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
2025-01-15 16:29   ` Jesper Dangaard Brouer
2025-01-16 12:52     ` Yunsheng Lin
2025-01-16 16:09       ` Jesper Dangaard Brouer
2025-01-17 11:56         ` Yunsheng Lin
2025-01-17 16:56           ` Jesper Dangaard Brouer
2025-01-18 13:36             ` Yunsheng Lin
2025-01-10 13:06 ` [PATCH net-next v7 4/8] page_pool: support unlimited number of inflight pages Yunsheng Lin
2025-01-10 13:06 ` [PATCH net-next v7 5/8] page_pool: skip dma sync operation for " Yunsheng Lin
2025-01-10 13:07 ` [PATCH net-next v7 6/8] page_pool: use list instead of ptr_ring for ring cache Yunsheng Lin
2025-01-10 13:07 ` [PATCH net-next v7 7/8] page_pool: batch refilling pages to reduce atomic operation Yunsheng Lin
2025-01-10 13:07 ` [PATCH net-next v7 8/8] page_pool: use list instead of array for alloc cache Yunsheng Lin
2025-01-14 14:31 ` [PATCH net-next v7 0/8] fix two bugs related to page_pool Jesper Dangaard Brouer
2025-01-15 11:33   ` Yunsheng Lin
2025-01-15 17:40     ` Jesper Dangaard Brouer
2025-01-16 12:52       ` Yunsheng Lin
2025-01-16 18:02         ` Jesper Dangaard Brouer
2025-01-17 11:35           ` Yunsheng Lin
2025-01-18  8:04             ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).