[PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local
       [not found] <20241120103456.396577-1-linyunsheng@huawei.com>
@ 2024-11-20 10:34 ` Yunsheng Lin
  2024-12-03  2:49   ` Jakub Kicinski
  2024-11-20 10:34 ` [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
  2024-11-20 10:34 ` [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages Yunsheng Lin
  2 siblings, 1 reply; 28+ messages in thread
From: Yunsheng Lin @ 2024-11-20 10:34 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Yunsheng Lin,
	Alexander Lobakin, Xuan Zhuo, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

page_pool page may be freed from skb_defer_free_flush() in
softirq context without binding to any specific napi, it
may cause use-after-free problem due to the below time window,
as below, CPU1 may still access napi->list_owner after CPU0
free the napi memory:

            CPU 0                           CPU1
      page_pool_destroy()          skb_defer_free_flush()
             .                               .
             .                napi = READ_ONCE(pool->p.napi);
             .                               .
page_pool_disable_direct_recycling()         .
   driver free napi memory                   .
             .                               .
             .       napi && READ_ONCE(napi->list_owner) == cpuid
             .                               .

Use rcu mechanism to avoid the above problem.

Note, the above was found during code reviewing on how to fix
the problem in [1].

1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/

Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 net/core/page_pool.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index f89cf93f6eb4..b3dae671eb26 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -795,6 +795,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
 static bool page_pool_napi_local(const struct page_pool *pool)
 {
 	const struct napi_struct *napi;
+	bool napi_local;
 	u32 cpuid;
 
 	if (unlikely(!in_softirq()))
@@ -810,9 +811,15 @@ static bool page_pool_napi_local(const struct page_pool *pool)
 	if (READ_ONCE(pool->cpuid) == cpuid)
 		return true;
 
+	/* Synchronizated with page_pool_destory() to avoid use-after-free
+	 * for 'napi'.
+	 */
+	rcu_read_lock();
 	napi = READ_ONCE(pool->p.napi);
+	napi_local = napi && READ_ONCE(napi->list_owner) == cpuid;
+	rcu_read_unlock();
 
-	return napi && READ_ONCE(napi->list_owner) == cpuid;
+	return napi_local;
 }
 
 void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
@@ -1126,6 +1133,12 @@ void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_release(pool))
 		return;
 
+	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
+	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
+	 * before returning to driver to free the napi instance.
+	 */
+	synchronize_rcu();
+
 	page_pool_detached(pool);
 	pool->defer_start = jiffies;
 	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
       [not found] <20241120103456.396577-1-linyunsheng@huawei.com>
  2024-11-20 10:34 ` [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local Yunsheng Lin
@ 2024-11-20 10:34 ` Yunsheng Lin
  2024-11-20 15:10   ` Jesper Dangaard Brouer
  2024-11-20 10:34 ` [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages Yunsheng Lin
  2 siblings, 1 reply; 28+ messages in thread
From: Yunsheng Lin @ 2024-11-20 10:34 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Yunsheng Lin, Robin Murphy,
	Alexander Duyck, IOMMU, Jesper Dangaard Brouer, Ilias Apalodimas,
	Eric Dumazet, Simon Horman, netdev, linux-kernel

Networking driver with page_pool support may hand over page
still with dma mapping to network stack and try to reuse that
page after network stack is done with it and passes it back
to page_pool to avoid the penalty of dma mapping/unmapping.
With all the caching in the network stack, some pages may be
held in the network stack without returning to the page_pool
soon enough, and with VF disable causing the driver unbound,
the page_pool does not stop the driver from doing it's
unbounding work, instead page_pool uses workqueue to check
if there is some pages coming back from the network stack
periodically, if there is any, it will do the dma unmmapping
related cleanup work.

As mentioned in [1], attempting DMA unmaps after the driver
has already unbound may leak resources or at worst corrupt
memory. Fundamentally, the page pool code cannot allow DMA
mappings to outlive the driver they belong to.

Currently it seems there are at least two cases that the page
is not released fast enough causing dma unmmapping done after
driver has already unbound:
1. ipv4 packet defragmentation timeout: this seems to cause
   delay up to 30 secs.
2. skb_defer_free_flush(): this may cause infinite delay if
   there is no triggering for net_rx_action().

In order not to call DMA APIs to do DMA unmmapping after driver
has already unbound and stall the unloading of the networking
driver, scan the inflight pages using some MM API to do the DMA
unmmapping for those pages when page_pool_destroy() is called.

The max time of scanning inflight pages is about 1.3 sec for
system with over 300GB memory as mentioned in [3], which seems
acceptable as the scanning is only done when there are indeed
some inflight pages and is done in the slow path.

Note, the devmem patchset seems to make the bug harder to fix,
and may make backporting harder too. As there is no actual user
for the devmem and the fixing for devmem is unclear for now,
this patch does not consider fixing the case for devmem yet.

1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
2. https://github.com/netoptimizer/prototype-kernel
3. https://lore.kernel.org/all/17a24d69-7bf0-412c-a32a-b25d82bb4159@kernel.org/
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
---
 include/net/page_pool/types.h |  6 ++-
 net/core/page_pool.c          | 95 ++++++++++++++++++++++++++++++-----
 2 files changed, 87 insertions(+), 14 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index c022c410abe3..7393fd45bc47 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -228,7 +228,11 @@ struct page_pool {
 	 */
 	refcount_t user_cnt;
 
-	u64 destroy_cnt;
+	/* Lock to avoid doing dma unmapping concurrently when
+	 * destroy_cnt > 0.
+	 */
+	spinlock_t destroy_lock;
+	unsigned int destroy_cnt;
 
 	/* Slow/Control-path information follows */
 	struct page_pool_params_slow slow;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index b3dae671eb26..33a314abbba4 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -272,9 +272,6 @@ static int page_pool_init(struct page_pool *pool,
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
-	if (pool->dma_map)
-		get_device(pool->p.dev);
-
 	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
 		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
 		 * configuration doesn't change while we're initializing
@@ -312,9 +309,6 @@ static void page_pool_uninit(struct page_pool *pool)
 {
 	ptr_ring_cleanup(&pool->ring, NULL);
 
-	if (pool->dma_map)
-		put_device(pool->p.dev);
-
 #ifdef CONFIG_PAGE_POOL_STATS
 	if (!pool->system)
 		free_percpu(pool->recycle_stats);
@@ -365,7 +359,7 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
 }
 EXPORT_SYMBOL(page_pool_create);
 
-static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
+static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
 
 static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 {
@@ -403,7 +397,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 			 * (2) break out to fallthrough to alloc_pages_node.
 			 * This limit stress on page buddy alloactor.
 			 */
-			page_pool_return_page(pool, netmem);
+			__page_pool_return_page(pool, netmem);
 			alloc_stat_inc(pool, waive);
 			netmem = 0;
 			break;
@@ -670,7 +664,7 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
  * a regular page (that will eventually be returned to the normal
  * page-allocator via put_page).
  */
-void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
+void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 {
 	int count;
 	bool put;
@@ -697,6 +691,27 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	 */
 }
 
+/* Called from page_pool_put_*() path, need to synchronizated with
+ * page_pool_destory() path.
+ */
+static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
+{
+	unsigned int destroy_cnt;
+
+	rcu_read_lock();
+
+	destroy_cnt = READ_ONCE(pool->destroy_cnt);
+	if (unlikely(destroy_cnt)) {
+		spin_lock_bh(&pool->destroy_lock);
+		__page_pool_return_page(pool, netmem);
+		spin_unlock_bh(&pool->destroy_lock);
+	} else {
+		__page_pool_return_page(pool, netmem);
+	}
+
+	rcu_read_unlock();
+}
+
 static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
 {
 	int ret;
@@ -924,7 +939,7 @@ static netmem_ref page_pool_drain_frag(struct page_pool *pool,
 		return netmem;
 	}
 
-	page_pool_return_page(pool, netmem);
+	__page_pool_return_page(pool, netmem);
 	return 0;
 }
 
@@ -938,7 +953,7 @@ static void page_pool_free_frag(struct page_pool *pool)
 	if (!netmem || page_pool_unref_netmem(netmem, drain_count))
 		return;
 
-	page_pool_return_page(pool, netmem);
+	__page_pool_return_page(pool, netmem);
 }
 
 netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool,
@@ -1045,7 +1060,7 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 static void page_pool_scrub(struct page_pool *pool)
 {
 	page_pool_empty_alloc_cache_once(pool);
-	pool->destroy_cnt++;
+	WRITE_ONCE(pool->destroy_cnt, pool->destroy_cnt + 1);
 
 	/* No more consumers should exist, but producers could still
 	 * be in-flight.
@@ -1119,6 +1134,58 @@ void page_pool_disable_direct_recycling(struct page_pool *pool)
 }
 EXPORT_SYMBOL(page_pool_disable_direct_recycling);
 
+static void page_pool_inflight_unmap(struct page_pool *pool)
+{
+	unsigned int unmapped = 0;
+	struct zone *zone;
+	int inflight;
+
+	if (!pool->dma_map || pool->mp_priv)
+		return;
+
+	get_online_mems();
+	spin_lock_bh(&pool->destroy_lock);
+
+	inflight = page_pool_inflight(pool, false);
+	for_each_populated_zone(zone) {
+		unsigned long end_pfn = zone_end_pfn(zone);
+		unsigned long pfn;
+
+		for (pfn = zone->zone_start_pfn; pfn < end_pfn; pfn++) {
+			struct page *page = pfn_to_online_page(pfn);
+
+			if (!page || !page_count(page) ||
+			    (page->pp_magic & ~0x3UL) != PP_SIGNATURE ||
+			    page->pp != pool)
+				continue;
+
+			dma_unmap_page_attrs(pool->p.dev,
+					     page_pool_get_dma_addr(page),
+					     PAGE_SIZE << pool->p.order,
+					     pool->p.dma_dir,
+					     DMA_ATTR_SKIP_CPU_SYNC |
+					     DMA_ATTR_WEAK_ORDERING);
+			page_pool_set_dma_addr(page, 0);
+
+			unmapped++;
+
+			/* Skip scanning all pages when debug is disabled */
+			if (!IS_ENABLED(CONFIG_DEBUG_NET) &&
+			    inflight == unmapped)
+				goto out;
+		}
+	}
+
+out:
+	WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
+		  "page_pool(%u): unmapped(%u) != inflight pages(%d)\n",
+		  pool->user.id, unmapped, inflight);
+
+	pool->dma_map = false;
+	spin_unlock_bh(&pool->destroy_lock);
+	put_online_mems();
+}
+
 void page_pool_destroy(struct page_pool *pool)
 {
 	if (!pool)
@@ -1139,6 +1206,8 @@ void page_pool_destroy(struct page_pool *pool)
 	 */
 	synchronize_rcu();
 
+	page_pool_inflight_unmap(pool);
+
 	page_pool_detached(pool);
 	pool->defer_start = jiffies;
 	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
@@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
 	/* Flush pool alloc cache, as refill will check NUMA node */
 	while (pool->alloc.count) {
 		netmem = pool->alloc.cache[--pool->alloc.count];
-		page_pool_return_page(pool, netmem);
+		__page_pool_return_page(pool, netmem);
 	}
 }
 EXPORT_SYMBOL(page_pool_update_nid);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages
       [not found] <20241120103456.396577-1-linyunsheng@huawei.com>
  2024-11-20 10:34 ` [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local Yunsheng Lin
  2024-11-20 10:34 ` [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
@ 2024-11-20 10:34 ` Yunsheng Lin
  2024-11-20 16:17   ` Robin Murphy
  2 siblings, 1 reply; 28+ messages in thread
From: Yunsheng Lin @ 2024-11-20 10:34 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Yunsheng Lin, Robin Murphy,
	Alexander Duyck, Andrew Morton, IOMMU, MM, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

Skip dma sync operation for inflight pages before the
page_pool_destroy() returns to the driver as DMA API
expects to be called with a valid device bound to a
driver as mentioned in [1].

After page_pool_destroy() is called, the page is not
expected to be recycled back to pool->alloc cache and
dma sync operation is not needed when the page is not
recyclable or pool->ring is full, so only skip the dma
sync operation for the infilght pages by clearing the
pool->dma_sync under protection of rcu lock when page
is recycled to pool->ring to ensure that there is no
dma sync operation called after page_pool_destroy() is
returned.

1. https://lore.kernel.org/all/caf31b5e-0e8f-4844-b7ba-ef59ed13b74e@arm.com/
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: IOMMU <iommu@lists.linux.dev>
CC: MM <linux-mm@kvack.org>
Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 net/core/page_pool.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 33a314abbba4..0bde7c6c781a 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -712,7 +712,8 @@ static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	rcu_read_unlock();
 }
 
-static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
+static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem,
+				      unsigned int dma_sync_size)
 {
 	int ret;
 	/* BH protection not needed if current is softirq */
@@ -723,10 +724,13 @@ static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
 
 	if (!ret) {
 		recycle_stat_inc(pool, ring);
-		return true;
+
+		rcu_read_lock();
+		page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
+		rcu_read_unlock();
 	}
 
-	return false;
+	return !ret;
 }
 
 /* Only allow direct recycling in special circumstances, into the
@@ -779,10 +783,11 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
 	if (likely(__page_pool_page_can_be_recycled(netmem))) {
 		/* Read barrier done in page_ref_count / READ_ONCE */
 
-		page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
-
-		if (allow_direct && page_pool_recycle_in_cache(netmem, pool))
+		if (allow_direct && page_pool_recycle_in_cache(netmem, pool)) {
+			page_pool_dma_sync_for_device(pool, netmem,
+						      dma_sync_size);
 			return 0;
+		}
 
 		/* Page found as candidate for recycling */
 		return netmem;
@@ -845,7 +850,7 @@ void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
 
 	netmem =
 		__page_pool_put_page(pool, netmem, dma_sync_size, allow_direct);
-	if (netmem && !page_pool_recycle_in_ring(pool, netmem)) {
+	if (netmem && !page_pool_recycle_in_ring(pool, netmem, dma_sync_size)) {
 		/* Cache full, fallback to free pages */
 		recycle_stat_inc(pool, ring_full);
 		page_pool_return_page(pool, netmem);
@@ -903,14 +908,18 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 
 	/* Bulk producer into ptr_ring page_pool cache */
 	in_softirq = page_pool_producer_lock(pool);
+	rcu_read_lock();
 	for (i = 0; i < bulk_len; i++) {
 		if (__ptr_ring_produce(&pool->ring, data[i])) {
 			/* ring full */
 			recycle_stat_inc(pool, ring_full);
 			break;
 		}
+		page_pool_dma_sync_for_device(pool, (__force netmem_ref)data[i],
+					      -1);
 	}
 	recycle_stat_add(pool, ring, i);
+	rcu_read_unlock();
 	page_pool_producer_unlock(pool, in_softirq);
 
 	/* Hopefully all pages was return into ptr_ring */
@@ -1200,6 +1209,8 @@ void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_release(pool))
 		return;
 
+	pool->dma_sync = false;
+
 	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
 	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
 	 * before returning to driver to free the napi instance.
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-20 10:34 ` [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
@ 2024-11-20 15:10   ` Jesper Dangaard Brouer
  2024-11-21  8:03     ` Yunsheng Lin
  0 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-11-20 15:10 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Robin Murphy,
	Alexander Duyck, IOMMU, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel



On 20/11/2024 11.34, Yunsheng Lin wrote:
> Networking driver with page_pool support may hand over page
> still with dma mapping to network stack and try to reuse that
> page after network stack is done with it and passes it back
> to page_pool to avoid the penalty of dma mapping/unmapping.
> With all the caching in the network stack, some pages may be
> held in the network stack without returning to the page_pool
> soon enough, and with VF disable causing the driver unbound,
> the page_pool does not stop the driver from doing it's
> unbounding work, instead page_pool uses workqueue to check
> if there is some pages coming back from the network stack
> periodically, if there is any, it will do the dma unmmapping
> related cleanup work.
> 
> As mentioned in [1], attempting DMA unmaps after the driver
> has already unbound may leak resources or at worst corrupt
> memory. Fundamentally, the page pool code cannot allow DMA
> mappings to outlive the driver they belong to.
> 
> Currently it seems there are at least two cases that the page
> is not released fast enough causing dma unmmapping done after
> driver has already unbound:
> 1. ipv4 packet defragmentation timeout: this seems to cause
>     delay up to 30 secs.
> 2. skb_defer_free_flush(): this may cause infinite delay if
>     there is no triggering for net_rx_action().
> 
> In order not to call DMA APIs to do DMA unmmapping after driver
> has already unbound and stall the unloading of the networking
> driver, scan the inflight pages using some MM API to do the DMA
> unmmapping for those pages when page_pool_destroy() is called.
> 
> The max time of scanning inflight pages is about 1.3 sec for
> system with over 300GB memory as mentioned in [3], which seems
> acceptable as the scanning is only done when there are indeed
> some inflight pages and is done in the slow path.
> 
> Note, the devmem patchset seems to make the bug harder to fix,
> and may make backporting harder too. As there is no actual user
> for the devmem and the fixing for devmem is unclear for now,
> this patch does not consider fixing the case for devmem yet.
> 
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
> 2. https://github.com/netoptimizer/prototype-kernel
> 3. https://lore.kernel.org/all/17a24d69-7bf0-412c-a32a-b25d82bb4159@kernel.org/
> CC: Robin Murphy <robin.murphy@arm.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: IOMMU <iommu@lists.linux.dev>
> Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Tested-by: Yonglong Liu <liuyonglong@huawei.com>
> ---
>   include/net/page_pool/types.h |  6 ++-
>   net/core/page_pool.c          | 95 ++++++++++++++++++++++++++++++-----
>   2 files changed, 87 insertions(+), 14 deletions(-)
> 
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index c022c410abe3..7393fd45bc47 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -228,7 +228,11 @@ struct page_pool {
>   	 */
>   	refcount_t user_cnt;
>   
> -	u64 destroy_cnt;
> +	/* Lock to avoid doing dma unmapping concurrently when
> +	 * destroy_cnt > 0.
> +	 */
> +	spinlock_t destroy_lock;
> +	unsigned int destroy_cnt;
>   
>   	/* Slow/Control-path information follows */
>   	struct page_pool_params_slow slow;
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index b3dae671eb26..33a314abbba4 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -272,9 +272,6 @@ static int page_pool_init(struct page_pool *pool,
>   	/* Driver calling page_pool_create() also call page_pool_destroy() */
>   	refcount_set(&pool->user_cnt, 1);
>   
> -	if (pool->dma_map)
> -		get_device(pool->p.dev);
> -
>   	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
>   		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
>   		 * configuration doesn't change while we're initializing
> @@ -312,9 +309,6 @@ static void page_pool_uninit(struct page_pool *pool)
>   {
>   	ptr_ring_cleanup(&pool->ring, NULL);
>   
> -	if (pool->dma_map)
> -		put_device(pool->p.dev);
> -
>   #ifdef CONFIG_PAGE_POOL_STATS
>   	if (!pool->system)
>   		free_percpu(pool->recycle_stats);
> @@ -365,7 +359,7 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
>   }
>   EXPORT_SYMBOL(page_pool_create);
>   
> -static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
> +static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
>   
>   static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
>   {
> @@ -403,7 +397,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
>   			 * (2) break out to fallthrough to alloc_pages_node.
>   			 * This limit stress on page buddy alloactor.
>   			 */
> -			page_pool_return_page(pool, netmem);
> +			__page_pool_return_page(pool, netmem);
>   			alloc_stat_inc(pool, waive);
>   			netmem = 0;
>   			break;
> @@ -670,7 +664,7 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
>    * a regular page (that will eventually be returned to the normal
>    * page-allocator via put_page).
>    */
> -void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
> +void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   {
>   	int count;
>   	bool put;
> @@ -697,6 +691,27 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   	 */
>   }
>   
> +/* Called from page_pool_put_*() path, need to synchronizated with
> + * page_pool_destory() path.
> + */
> +static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
> +{
> +	unsigned int destroy_cnt;
> +
> +	rcu_read_lock();
> +
> +	destroy_cnt = READ_ONCE(pool->destroy_cnt);
> +	if (unlikely(destroy_cnt)) {
> +		spin_lock_bh(&pool->destroy_lock);
> +		__page_pool_return_page(pool, netmem);
> +		spin_unlock_bh(&pool->destroy_lock);
> +	} else {
> +		__page_pool_return_page(pool, netmem);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
>   static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
>   {
>   	int ret;
> @@ -924,7 +939,7 @@ static netmem_ref page_pool_drain_frag(struct page_pool *pool,
>   		return netmem;
>   	}
>   
> -	page_pool_return_page(pool, netmem);
> +	__page_pool_return_page(pool, netmem);
>   	return 0;
>   }
>   
> @@ -938,7 +953,7 @@ static void page_pool_free_frag(struct page_pool *pool)
>   	if (!netmem || page_pool_unref_netmem(netmem, drain_count))
>   		return;
>   
> -	page_pool_return_page(pool, netmem);
> +	__page_pool_return_page(pool, netmem);
>   }
>   
>   netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool,
> @@ -1045,7 +1060,7 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
>   static void page_pool_scrub(struct page_pool *pool)
>   {
>   	page_pool_empty_alloc_cache_once(pool);
> -	pool->destroy_cnt++;
> +	WRITE_ONCE(pool->destroy_cnt, pool->destroy_cnt + 1);
>   
>   	/* No more consumers should exist, but producers could still
>   	 * be in-flight.
> @@ -1119,6 +1134,58 @@ void page_pool_disable_direct_recycling(struct page_pool *pool)
>   }
>   EXPORT_SYMBOL(page_pool_disable_direct_recycling);
>   
> +static void page_pool_inflight_unmap(struct page_pool *pool)
> +{
> +	unsigned int unmapped = 0;
> +	struct zone *zone;
> +	int inflight;
> +
> +	if (!pool->dma_map || pool->mp_priv)
> +		return;
> +
> +	get_online_mems();
> +	spin_lock_bh(&pool->destroy_lock);
> +
> +	inflight = page_pool_inflight(pool, false);
> +	for_each_populated_zone(zone) {
> +		unsigned long end_pfn = zone_end_pfn(zone);
> +		unsigned long pfn;
> +
> +		for (pfn = zone->zone_start_pfn; pfn < end_pfn; pfn++) {
> +			struct page *page = pfn_to_online_page(pfn);
> +
> +			if (!page || !page_count(page) ||
> +			    (page->pp_magic & ~0x3UL) != PP_SIGNATURE ||
> +			    page->pp != pool)
> +				continue;
> +
> +			dma_unmap_page_attrs(pool->p.dev,
> +					     page_pool_get_dma_addr(page),
> +					     PAGE_SIZE << pool->p.order,
> +					     pool->p.dma_dir,
> +					     DMA_ATTR_SKIP_CPU_SYNC |
> +					     DMA_ATTR_WEAK_ORDERING);
> +			page_pool_set_dma_addr(page, 0);
> +

I feel this belongs in a helper function call.

Previously we had a function called: page_pool_release_page().

This was used to convert the page into a normal page again, releasing
page_pool dependencies.  It was used when packet transitioned into the
netstack, but it was removed when we decided it was safe to let netstack
return pp pages.

Removed in commits:
  - 535b9c61bdef ("net: page_pool: hide page_pool_release_page()")
  - 07e0c7d3179d ("net: page_pool: merge page_pool_release_page() with 
page_pool_return_page()")


> +			unmapped++;
> +
> +			/* Skip scanning all pages when debug is disabled */
> +			if (!IS_ENABLED(CONFIG_DEBUG_NET) &&
> +			    inflight == unmapped)
> +				goto out;
> +		}
> +	}
> +
> +out:
> +	WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
> +		  "page_pool(%u): unmapped(%u) != inflight pages(%d)\n",
> +		  pool->user.id, unmapped, inflight);
> +
> +	pool->dma_map = false;
> +	spin_unlock_bh(&pool->destroy_lock);
> +	put_online_mems();
> +}
> +
>   void page_pool_destroy(struct page_pool *pool)
>   {
>   	if (!pool)
> @@ -1139,6 +1206,8 @@ void page_pool_destroy(struct page_pool *pool)
>   	 */
>   	synchronize_rcu();
>   
> +	page_pool_inflight_unmap(pool);
> +

Reaching here means we have detected in-flight packets/pages.

In "page_pool_inflight_unmap" we scan and find those in-flight pages to
DMA unmap them. Then below we wait for these in-flight pages again.
Why don't we just "release" (page_pool_release_page) those in-flight
pages from belonging to the page_pool, when we found them during scanning?

If doing so, we can hopefully remove the periodic checking code below.

>   	page_pool_detached(pool);
>   	pool->defer_start = jiffies;
>   	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>   	/* Flush pool alloc cache, as refill will check NUMA node */
>   	while (pool->alloc.count) {
>   		netmem = pool->alloc.cache[--pool->alloc.count];
> -		page_pool_return_page(pool, netmem);
> +		__page_pool_return_page(pool, netmem);
>   	}
>   }
>   EXPORT_SYMBOL(page_pool_update_nid);

Thanks for continuing to work on this :-)

--Jesper

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages
  2024-11-20 10:34 ` [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages Yunsheng Lin
@ 2024-11-20 16:17   ` Robin Murphy
  2024-11-21  8:04     ` Yunsheng Lin
  0 siblings, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2024-11-20 16:17 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Alexander Duyck,
	Andrew Morton, IOMMU, MM, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On 20/11/2024 10:34 am, Yunsheng Lin wrote:
> Skip dma sync operation for inflight pages before the
> page_pool_destroy() returns to the driver as DMA API
> expects to be called with a valid device bound to a
> driver as mentioned in [1].
> 
> After page_pool_destroy() is called, the page is not
> expected to be recycled back to pool->alloc cache and
> dma sync operation is not needed when the page is not
> recyclable or pool->ring is full, so only skip the dma
> sync operation for the infilght pages by clearing the
> pool->dma_sync under protection of rcu lock when page
> is recycled to pool->ring to ensure that there is no
> dma sync operation called after page_pool_destroy() is
> returned.

Something feels off here - either this is a micro-optimisation which I 
wouldn't really expect to be meaningful, or it means patch #2 doesn't 
actually do what it claims. If it really is possible to attempt to 
dma_sync a page *after* page_pool_inflight_unmap() has already reclaimed 
and unmapped it, that represents yet another DMA API lifecycle issue, 
which as well as being even more obviously incorrect usage-wise, could 
also still lead to the same crash (if the device is non-coherent).

Otherwise, I don't imagine it's really worth worrying about optimising 
out syncs for any pages which happen to get naturally returned after 
page_pool_destroy() starts but before they're explicitly reclaimed. 
Realistically, the kinds of big server systems where reclaim takes an 
appreciable amount of time are going to be coherent and skipping syncs 
anyway.

Thanks,
Robin.

> 1. https://lore.kernel.org/all/caf31b5e-0e8f-4844-b7ba-ef59ed13b74e@arm.com/
> CC: Robin Murphy <robin.murphy@arm.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: IOMMU <iommu@lists.linux.dev>
> CC: MM <linux-mm@kvack.org>
> Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>   net/core/page_pool.c | 25 ++++++++++++++++++-------
>   1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 33a314abbba4..0bde7c6c781a 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -712,7 +712,8 @@ static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   	rcu_read_unlock();
>   }
>   
> -static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
> +static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem,
> +				      unsigned int dma_sync_size)
>   {
>   	int ret;
>   	/* BH protection not needed if current is softirq */
> @@ -723,10 +724,13 @@ static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
>   
>   	if (!ret) {
>   		recycle_stat_inc(pool, ring);
> -		return true;
> +
> +		rcu_read_lock();
> +		page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
> +		rcu_read_unlock();
>   	}
>   
> -	return false;
> +	return !ret;
>   }
>   
>   /* Only allow direct recycling in special circumstances, into the
> @@ -779,10 +783,11 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
>   	if (likely(__page_pool_page_can_be_recycled(netmem))) {
>   		/* Read barrier done in page_ref_count / READ_ONCE */
>   
> -		page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
> -
> -		if (allow_direct && page_pool_recycle_in_cache(netmem, pool))
> +		if (allow_direct && page_pool_recycle_in_cache(netmem, pool)) {
> +			page_pool_dma_sync_for_device(pool, netmem,
> +						      dma_sync_size);
>   			return 0;
> +		}
>   
>   		/* Page found as candidate for recycling */
>   		return netmem;
> @@ -845,7 +850,7 @@ void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>   
>   	netmem =
>   		__page_pool_put_page(pool, netmem, dma_sync_size, allow_direct);
> -	if (netmem && !page_pool_recycle_in_ring(pool, netmem)) {
> +	if (netmem && !page_pool_recycle_in_ring(pool, netmem, dma_sync_size)) {
>   		/* Cache full, fallback to free pages */
>   		recycle_stat_inc(pool, ring_full);
>   		page_pool_return_page(pool, netmem);
> @@ -903,14 +908,18 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
>   
>   	/* Bulk producer into ptr_ring page_pool cache */
>   	in_softirq = page_pool_producer_lock(pool);
> +	rcu_read_lock();
>   	for (i = 0; i < bulk_len; i++) {
>   		if (__ptr_ring_produce(&pool->ring, data[i])) {
>   			/* ring full */
>   			recycle_stat_inc(pool, ring_full);
>   			break;
>   		}
> +		page_pool_dma_sync_for_device(pool, (__force netmem_ref)data[i],
> +					      -1);
>   	}
>   	recycle_stat_add(pool, ring, i);
> +	rcu_read_unlock();
>   	page_pool_producer_unlock(pool, in_softirq);
>   
>   	/* Hopefully all pages was return into ptr_ring */
> @@ -1200,6 +1209,8 @@ void page_pool_destroy(struct page_pool *pool)
>   	if (!page_pool_release(pool))
>   		return;
>   
> +	pool->dma_sync = false;
> +
>   	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
>   	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>   	 * before returning to driver to free the napi instance.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-20 15:10   ` Jesper Dangaard Brouer
@ 2024-11-21  8:03     ` Yunsheng Lin
  2024-11-25 15:25       ` Jesper Dangaard Brouer
  2024-11-26 21:51       ` Mina Almasry
  0 siblings, 2 replies; 28+ messages in thread
From: Yunsheng Lin @ 2024-11-21  8:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Robin Murphy,
	Alexander Duyck, IOMMU, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:

...

>>         /* No more consumers should exist, but producers could still
>>        * be in-flight.
>> @@ -1119,6 +1134,58 @@ void page_pool_disable_direct_recycling(struct page_pool *pool)
>>   }
>>   EXPORT_SYMBOL(page_pool_disable_direct_recycling);
>>   +static void page_pool_inflight_unmap(struct page_pool *pool)
>> +{
>> +    unsigned int unmapped = 0;
>> +    struct zone *zone;
>> +    int inflight;
>> +
>> +    if (!pool->dma_map || pool->mp_priv)
>> +        return;
>> +
>> +    get_online_mems();
>> +    spin_lock_bh(&pool->destroy_lock);
>> +
>> +    inflight = page_pool_inflight(pool, false);
>> +    for_each_populated_zone(zone) {
>> +        unsigned long end_pfn = zone_end_pfn(zone);
>> +        unsigned long pfn;
>> +
>> +        for (pfn = zone->zone_start_pfn; pfn < end_pfn; pfn++) {
>> +            struct page *page = pfn_to_online_page(pfn);
>> +
>> +            if (!page || !page_count(page) ||
>> +                (page->pp_magic & ~0x3UL) != PP_SIGNATURE ||
>> +                page->pp != pool)
>> +                continue;
>> +
>> +            dma_unmap_page_attrs(pool->p.dev,
>> +                         page_pool_get_dma_addr(page),
>> +                         PAGE_SIZE << pool->p.order,
>> +                         pool->p.dma_dir,
>> +                         DMA_ATTR_SKIP_CPU_SYNC |
>> +                         DMA_ATTR_WEAK_ORDERING);
>> +            page_pool_set_dma_addr(page, 0);
>> +
> 
> I feel this belongs in a helper function call.
> 
> Previously we had a function called: page_pool_release_page().

Previously I was going to reuse __page_pool_release_page_dma() for the above,
but it seems __page_pool_release_page_dma() has the duplicated pool->dma_map
checking for each page as page_pool_inflight_unmap() already check that before
calling __page_pool_release_page_dma().

> 
> This was used to convert the page into a normal page again, releasing
> page_pool dependencies.  It was used when packet transitioned into the
> netstack, but it was removed when we decided it was safe to let netstack
> return pp pages.
> 
> Removed in commits:
>  - 535b9c61bdef ("net: page_pool: hide page_pool_release_page()")
>  - 07e0c7d3179d ("net: page_pool: merge page_pool_release_page() with page_pool_return_page()")
> 
> 
>> +            unmapped++;
>> +
>> +            /* Skip scanning all pages when debug is disabled */
>> +            if (!IS_ENABLED(CONFIG_DEBUG_NET) &&
>> +                inflight == unmapped)
>> +                goto out;
>> +        }
>> +    }
>> +
>> +out:
>> +    WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
>> +          "page_pool(%u): unmapped(%u) != inflight pages(%d)\n",
>> +          pool->user.id, unmapped, inflight);
>> +
>> +    pool->dma_map = false;
>> +    spin_unlock_bh(&pool->destroy_lock);
>> +    put_online_mems();
>> +}
>> +
>>   void page_pool_destroy(struct page_pool *pool)
>>   {
>>       if (!pool)
>> @@ -1139,6 +1206,8 @@ void page_pool_destroy(struct page_pool *pool)
>>        */
>>       synchronize_rcu();
>>   +    page_pool_inflight_unmap(pool);
>> +
> 
> Reaching here means we have detected in-flight packets/pages.
> 
> In "page_pool_inflight_unmap" we scan and find those in-flight pages to
> DMA unmap them. Then below we wait for these in-flight pages again.
> Why don't we just "release" (page_pool_release_page) those in-flight
> pages from belonging to the page_pool, when we found them during scanning?
> 
> If doing so, we can hopefully remove the periodic checking code below.

I thought about that too, but it means more complicated work than just
calling the page_pool_release_page() as page->pp_ref_count need to be
converted into page->_refcount for the above to work, it seems hard to
do that with least performance degradation as the racing against
page_pool_put_page() being called concurrently.

> 
>>       page_pool_detached(pool);
>>       pool->defer_start = jiffies;
>>       pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>>       /* Flush pool alloc cache, as refill will check NUMA node */
>>       while (pool->alloc.count) {
>>           netmem = pool->alloc.cache[--pool->alloc.count];
>> -        page_pool_return_page(pool, netmem);
>> +        __page_pool_return_page(pool, netmem);
>>       }
>>   }
>>   EXPORT_SYMBOL(page_pool_update_nid);
> 
> Thanks for continuing to work on this :-)

I am not sure how scalable the scanning is going to be if the memory size became
bigger, which is one of the reason I was posting it as RFC for this version.

For some quick searching here, it seems there might be server with max ram capacity
of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
called from the softirq context, which might mean there might be spinning of 12 secs
in the softirq context.

And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
is released, which might means page_pool_get_dma_addr() need to be checked to decide
if the mapping is already done or not for each page.

Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
for those large memory systems.

https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH

> 
> --Jesper

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages
  2024-11-20 16:17   ` Robin Murphy
@ 2024-11-21  8:04     ` Yunsheng Lin
  2024-11-21 13:44       ` Robin Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Yunsheng Lin @ 2024-11-21  8:04 UTC (permalink / raw)
  To: Robin Murphy, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Alexander Duyck,
	Andrew Morton, IOMMU, MM, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On 2024/11/21 0:17, Robin Murphy wrote:
> On 20/11/2024 10:34 am, Yunsheng Lin wrote:
>> Skip dma sync operation for inflight pages before the
>> page_pool_destroy() returns to the driver as DMA API
>> expects to be called with a valid device bound to a
>> driver as mentioned in [1].
>>
>> After page_pool_destroy() is called, the page is not
>> expected to be recycled back to pool->alloc cache and
>> dma sync operation is not needed when the page is not
>> recyclable or pool->ring is full, so only skip the dma
>> sync operation for the infilght pages by clearing the
>> pool->dma_sync under protection of rcu lock when page
>> is recycled to pool->ring to ensure that there is no
>> dma sync operation called after page_pool_destroy() is
>> returned.
> 
> Something feels off here - either this is a micro-optimisation which I wouldn't really expect to be meaningful, or it means patch #2 doesn't actually do what it claims. If it really is possible to attempt to dma_sync a page *after* page_pool_inflight_unmap() has already reclaimed and unmapped it, that represents yet another DMA API lifecycle issue, which as well as being even more obviously incorrect usage-wise, could also still lead to the same crash (if the device is non-coherent).

For a page_pool owned page, it mostly goes through the below steps:
1. page_pool calls buddy allocator API to allocate a page, call DMA mapping
   and sync_for_device API for it if its pool is empty. Or reuse the page in
   pool.

2. Driver calls the page_pool API to allocate the page, and pass the page
   to network stack after packet is dma'ed into the page and the sync_for_cpu
   API is called.

3. Network stack is done with page and called page_pool API to free the page.

4. page_pool releases the page back to buddy allocator if the page is not
   recyclable before doing the dma unmaping. Or do the sync_for_device
   and put the page in the its pool, the page might go through step 1
   again if the driver calls the page_pool allocate API.

The calling of dma mapping and dma sync API is controlled by pool->dma_map
and pool->dma_sync respectively, the previous patch only clear pool->dma_map
after doing the dma unmapping. This patch ensures that there is no dma_sync
for recycle case of step 4 by clearing pool->dma_sync.

The dma_sync skipping should also happen before page_pool_inflight_unmap()
is called too because all the caller will see the clearing of pool->dma_sync
after synchronize_rcu() and page_pool_inflight_unmap() is called after
the same synchronize_rcu() in page_pool_destroy().

> 
> Otherwise, I don't imagine it's really worth worrying about optimising out syncs for any pages which happen to get naturally returned after page_pool_destroy() starts but before they're explicitly reclaimed. Realistically, the kinds of big server systems where reclaim takes an appreciable amount of time are going to be coherent and skipping syncs anyway.

The skipping is about skipping the dma sync for those inflight pages,
I should make it clearer that the skipping happens before the calling
of page_pool_inflight_unmap() instead of page_pool_destroy() in the
commit log.

> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages
  2024-11-21  8:04     ` Yunsheng Lin
@ 2024-11-21 13:44       ` Robin Murphy
  2024-11-22  7:20         ` Yunsheng Lin
  0 siblings, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2024-11-21 13:44 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Alexander Duyck,
	Andrew Morton, IOMMU, MM, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On 21/11/2024 8:04 am, Yunsheng Lin wrote:
> On 2024/11/21 0:17, Robin Murphy wrote:
>> On 20/11/2024 10:34 am, Yunsheng Lin wrote:
>>> Skip dma sync operation for inflight pages before the
>>> page_pool_destroy() returns to the driver as DMA API
>>> expects to be called with a valid device bound to a
>>> driver as mentioned in [1].
>>>
>>> After page_pool_destroy() is called, the page is not
>>> expected to be recycled back to pool->alloc cache and
>>> dma sync operation is not needed when the page is not
>>> recyclable or pool->ring is full, so only skip the dma
>>> sync operation for the infilght pages by clearing the
>>> pool->dma_sync under protection of rcu lock when page
>>> is recycled to pool->ring to ensure that there is no
>>> dma sync operation called after page_pool_destroy() is
>>> returned.
>>
>> Something feels off here - either this is a micro-optimisation which I wouldn't really expect to be meaningful, or it means patch #2 doesn't actually do what it claims. If it really is possible to attempt to dma_sync a page *after* page_pool_inflight_unmap() has already reclaimed and unmapped it, that represents yet another DMA API lifecycle issue, which as well as being even more obviously incorrect usage-wise, could also still lead to the same crash (if the device is non-coherent).
> 
> For a page_pool owned page, it mostly goes through the below steps:
> 1. page_pool calls buddy allocator API to allocate a page, call DMA mapping
>     and sync_for_device API for it if its pool is empty. Or reuse the page in
>     pool.
> 
> 2. Driver calls the page_pool API to allocate the page, and pass the page
>     to network stack after packet is dma'ed into the page and the sync_for_cpu
>     API is called.
> 
> 3. Network stack is done with page and called page_pool API to free the page.
> 
> 4. page_pool releases the page back to buddy allocator if the page is not
>     recyclable before doing the dma unmaping. Or do the sync_for_device
>     and put the page in the its pool, the page might go through step 1
>     again if the driver calls the page_pool allocate API.
> 
> The calling of dma mapping and dma sync API is controlled by pool->dma_map
> and pool->dma_sync respectively, the previous patch only clear pool->dma_map
> after doing the dma unmapping. This patch ensures that there is no dma_sync
> for recycle case of step 4 by clearing pool->dma_sync.

But *why* does it want to ensure that? Is there some possible race where 
one thread can attempt to sync and recycle a page while another thread 
is attempting to unmap and free it, such that you can't guarantee the 
correctness of dma_sync calls after page_pool_inflight_unmap() has 
started, and skipping them is a workaround for that? If so, then frankly 
I think that would want solving properly, but at the very least this 
change would need to come before patch #2.

If not, and this is just some attempt at performance micro-optimisation, 
then I'd be keen to see the numbers to justify it, since I struggle to 
imagine it being worth the bother while already in the process of 
spending whole seconds scanning memory...

Thanks,
Robin.

> The dma_sync skipping should also happen before page_pool_inflight_unmap()
> is called too because all the caller will see the clearing of pool->dma_sync
> after synchronize_rcu() and page_pool_inflight_unmap() is called after
> the same synchronize_rcu() in page_pool_destroy().
> 
>>
>> Otherwise, I don't imagine it's really worth worrying about optimising out syncs for any pages which happen to get naturally returned after page_pool_destroy() starts but before they're explicitly reclaimed. Realistically, the kinds of big server systems where reclaim takes an appreciable amount of time are going to be coherent and skipping syncs anyway.
> 
> The skipping is about skipping the dma sync for those inflight pages,
> I should make it clearer that the skipping happens before the calling
> of page_pool_inflight_unmap() instead of page_pool_destroy() in the
> commit log.
> 
>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages
  2024-11-21 13:44       ` Robin Murphy
@ 2024-11-22  7:20         ` Yunsheng Lin
  0 siblings, 0 replies; 28+ messages in thread
From: Yunsheng Lin @ 2024-11-22  7:20 UTC (permalink / raw)
  To: Robin Murphy, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Alexander Duyck,
	Andrew Morton, IOMMU, MM, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On 2024/11/21 21:44, Robin Murphy wrote:
> On 21/11/2024 8:04 am, Yunsheng Lin wrote:
>> On 2024/11/21 0:17, Robin Murphy wrote:
>>> On 20/11/2024 10:34 am, Yunsheng Lin wrote:
>>>> Skip dma sync operation for inflight pages before the
>>>> page_pool_destroy() returns to the driver as DMA API
>>>> expects to be called with a valid device bound to a
>>>> driver as mentioned in [1].
>>>>
>>>> After page_pool_destroy() is called, the page is not
>>>> expected to be recycled back to pool->alloc cache and
>>>> dma sync operation is not needed when the page is not
>>>> recyclable or pool->ring is full, so only skip the dma
>>>> sync operation for the infilght pages by clearing the
>>>> pool->dma_sync under protection of rcu lock when page
>>>> is recycled to pool->ring to ensure that there is no
>>>> dma sync operation called after page_pool_destroy() is
>>>> returned.
>>>
>>> Something feels off here - either this is a micro-optimisation which I wouldn't really expect to be meaningful, or it means patch #2 doesn't actually do what it claims. If it really is possible to attempt to dma_sync a page *after* page_pool_inflight_unmap() has already reclaimed and unmapped it, that represents yet another DMA API lifecycle issue, which as well as being even more obviously incorrect usage-wise, could also still lead to the same crash (if the device is non-coherent).
>>
>> For a page_pool owned page, it mostly goes through the below steps:
>> 1. page_pool calls buddy allocator API to allocate a page, call DMA mapping
>>     and sync_for_device API for it if its pool is empty. Or reuse the page in
>>     pool.
>>
>> 2. Driver calls the page_pool API to allocate the page, and pass the page
>>     to network stack after packet is dma'ed into the page and the sync_for_cpu
>>     API is called.
>>
>> 3. Network stack is done with page and called page_pool API to free the page.
>>
>> 4. page_pool releases the page back to buddy allocator if the page is not
>>     recyclable before doing the dma unmaping. Or do the sync_for_device
>>     and put the page in the its pool, the page might go through step 1
>>     again if the driver calls the page_pool allocate API.
>>
>> The calling of dma mapping and dma sync API is controlled by pool->dma_map
>> and pool->dma_sync respectively, the previous patch only clear pool->dma_map
>> after doing the dma unmapping. This patch ensures that there is no dma_sync
>> for recycle case of step 4 by clearing pool->dma_sync.
> 
> But *why* does it want to ensure that? Is there some possible race where one thread can attempt to sync and recycle a page while another thread is attempting to unmap and free it, such that you can't guarantee the correctness of dma_sync calls after page_pool_inflight_unmap() has started, and skipping them is a workaround for that? If so, then frankly I think that would want solving properly, but at the very least this change would need to come before patch #2.

The racing window is something like below. page_pool_destroy() and
page_pool_put_page() can be called concurrently, patch 2 only use
a spinlock to synchronise page_pool_inflight_unmap() with
page_pool_return_page() called by page_pool_put_page() to avoid
concurrent dma unmapping, there is no synchronization between
page_pool_destroy() and page_pool_dma_sync_for_device() called
by page_pool_put_page():
            CPU0                           CPU1
             .                               .
     page_pool_destroy()          page_pool_put_page()
             .                               .
     synchronize_rcu()                       .
             .                               .
   page_pool_inflight_unmap()                .
             .                               .
             .                    __page_pool_put_page()
             .                               .
             .               page_pool_dma_sync_for_device()
             .                               .

After this patch, page_pool_dma_sync_for_device() is protected by
rcu lock and pool->dma_sync is cleared before synchronize_rcu and
page_pool_inflight_unmap() is called after synchronize_rcu to ensure
page_pool_dma_sync_for_device() will not call dma sync API after
synchronize_rcu():

            CPU0                           CPU1
             .                               .
     page_pool_destroy() CPU       page_pool_put_page() CPU
             .                               .
    pool->dma_sync = false                   .
             .                               .
     synchronize_rcu()                       .
             .                               .
   page_pool_inflight_unmap()                .
             .                               .
             .                  page_pool_recycle_in_ring()
             .                               .
             .                        rcu_read_lock()
             .               page_pool_dma_sync_for_device()
             .                        rcu_read_unlock()

Previously patch 2&3 was combined as one patch, this version splits
it out to make it more reviewable.
I am not sure if it matters that much about the patch order as the
fix doesn't seem to be completed unless both patches are included.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-21  8:03     ` Yunsheng Lin
@ 2024-11-25 15:25       ` Jesper Dangaard Brouer
  2024-11-26  8:22         ` Yunsheng Lin
  2024-11-26 21:51       ` Mina Almasry
  1 sibling, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-11-25 15:25 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Robin Murphy,
	Alexander Duyck, IOMMU, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel



On 21/11/2024 09.03, Yunsheng Lin wrote:
> On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
> 
> ...
> 
>>>          /* No more consumers should exist, but producers could still
>>>         * be in-flight.
>>> @@ -1119,6 +1134,58 @@ void page_pool_disable_direct_recycling(struct page_pool *pool)
>>>    }
>>>    EXPORT_SYMBOL(page_pool_disable_direct_recycling);
>>>    +static void page_pool_inflight_unmap(struct page_pool *pool)
>>> +{
>>> +    unsigned int unmapped = 0;
>>> +    struct zone *zone;
>>> +    int inflight;
>>> +
>>> +    if (!pool->dma_map || pool->mp_priv)
>>> +        return;
>>> +
>>> +    get_online_mems();
>>> +    spin_lock_bh(&pool->destroy_lock);
>>> +
>>> +    inflight = page_pool_inflight(pool, false);
>>> +    for_each_populated_zone(zone) {
>>> +        unsigned long end_pfn = zone_end_pfn(zone);
>>> +        unsigned long pfn;
>>> +
>>> +        for (pfn = zone->zone_start_pfn; pfn < end_pfn; pfn++) {
>>> +            struct page *page = pfn_to_online_page(pfn);
>>> +
>>> +            if (!page || !page_count(page) ||
>>> +                (page->pp_magic & ~0x3UL) != PP_SIGNATURE ||
>>> +                page->pp != pool)
>>> +                continue;
>>> +
>>> +            dma_unmap_page_attrs(pool->p.dev,
>>> +                         page_pool_get_dma_addr(page),
>>> +                         PAGE_SIZE << pool->p.order,
>>> +                         pool->p.dma_dir,
>>> +                         DMA_ATTR_SKIP_CPU_SYNC |
>>> +                         DMA_ATTR_WEAK_ORDERING);
>>> +            page_pool_set_dma_addr(page, 0);
>>> +
>>
>> I feel this belongs in a helper function call.
>>
>> Previously we had a function called: page_pool_release_page().
> 
> Previously I was going to reuse __page_pool_release_page_dma() for the above,
> but it seems __page_pool_release_page_dma() has the duplicated pool->dma_map
> checking for each page as page_pool_inflight_unmap() already check that before
> calling __page_pool_release_page_dma().
> 
>>
>> This was used to convert the page into a normal page again, releasing
>> page_pool dependencies.  It was used when packet transitioned into the
>> netstack, but it was removed when we decided it was safe to let netstack
>> return pp pages.
>>
>> Removed in commits:
>>   - 535b9c61bdef ("net: page_pool: hide page_pool_release_page()")
>>   - 07e0c7d3179d ("net: page_pool: merge page_pool_release_page() with page_pool_return_page()")
>>
>>
>>> +            unmapped++;
>>> +
>>> +            /* Skip scanning all pages when debug is disabled */
>>> +            if (!IS_ENABLED(CONFIG_DEBUG_NET) &&
>>> +                inflight == unmapped)
>>> +                goto out;
>>> +        }
>>> +    }
>>> +
>>> +out:
>>> +    WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
>>> +          "page_pool(%u): unmapped(%u) != inflight pages(%d)\n",
>>> +          pool->user.id, unmapped, inflight);
>>> +
>>> +    pool->dma_map = false;
>>> +    spin_unlock_bh(&pool->destroy_lock);
>>> +    put_online_mems();
>>> +}
>>> +
>>>    void page_pool_destroy(struct page_pool *pool)
>>>    {
>>>        if (!pool)
>>> @@ -1139,6 +1206,8 @@ void page_pool_destroy(struct page_pool *pool)
>>>         */
>>>        synchronize_rcu();
>>>    +    page_pool_inflight_unmap(pool);
>>> +
>>
>> Reaching here means we have detected in-flight packets/pages.
>>
>> In "page_pool_inflight_unmap" we scan and find those in-flight pages to
>> DMA unmap them. Then below we wait for these in-flight pages again.
>> Why don't we just "release" (page_pool_release_page) those in-flight
>> pages from belonging to the page_pool, when we found them during scanning?
>>
>> If doing so, we can hopefully remove the periodic checking code below.
> 
> I thought about that too, but it means more complicated work than just
> calling the page_pool_release_page() as page->pp_ref_count need to be
> converted into page->_refcount for the above to work, it seems hard to
> do that with least performance degradation as the racing against
> page_pool_put_page() being called concurrently.
> 

Maybe we can have a design that avoid/reduce concurrency.  Can we
convert the suggested pool->destroy_lock into an atomic?
(Doing an *atomic* READ in page_pool_return_page, should be fast if we
keep this cache in in (cache coherence) Shared state).

In your new/proposed page_pool_return_page() when we see the
"destroy_cnt" (now atomic READ) bigger than zero, then we can do nothing
(or maybe we need decrement page-refcnt?), as we know the destroy code
will be taking care of "releasing" the pages from the page pool.

Once the a page is release from a page pool it becomes a normal page,
that adhere to normal page refcnt'ing. That is how it worked before with
page_pool_release_page().
The later extensions with page fragment support and devmem might have
complicated this code path.


>>
>>>        page_pool_detached(pool);
>>>        pool->defer_start = jiffies;
>>>        pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>>>        /* Flush pool alloc cache, as refill will check NUMA node */
>>>        while (pool->alloc.count) {
>>>            netmem = pool->alloc.cache[--pool->alloc.count];
>>> -        page_pool_return_page(pool, netmem);
>>> +        __page_pool_return_page(pool, netmem);
>>>        }
>>>    }
>>>    EXPORT_SYMBOL(page_pool_update_nid);
>>
>> Thanks for continuing to work on this :-)
> 
> I am not sure how scalable the scanning is going to be if the memory size became
> bigger, which is one of the reason I was posting it as RFC for this version.
> 
> For some quick searching here, it seems there might be server with max ram capacity
> of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
> The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
> called from the softirq context, which might mean there might be spinning of 12 secs
> in the softirq context.
> 
> And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
> of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
> is released, which might means page_pool_get_dma_addr() need to be checked to decide
> if the mapping is already done or not for each page.
> 

As explained above, once the "destroy" phase have started, I hope we can
avoid concurrently returned pages to wait on a lock.   As IMHO it will
be problematic to stall the page_pool_return_page() call for this long.
The take down "destroy" of a page pool is always initiated from
userspace, so is possible to call cond_resched() from this context.

--Jesper

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-25 15:25       ` Jesper Dangaard Brouer
@ 2024-11-26  8:22         ` Yunsheng Lin
  2024-11-26 10:22           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 28+ messages in thread
From: Yunsheng Lin @ 2024-11-26  8:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Robin Murphy,
	Alexander Duyck, IOMMU, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

On 2024/11/25 23:25, Jesper Dangaard Brouer wrote:

...

>>>> +
>>>>    void page_pool_destroy(struct page_pool *pool)
>>>>    {
>>>>        if (!pool)
>>>> @@ -1139,6 +1206,8 @@ void page_pool_destroy(struct page_pool *pool)
>>>>         */
>>>>        synchronize_rcu();
>>>>    +    page_pool_inflight_unmap(pool);
>>>> +
>>>
>>> Reaching here means we have detected in-flight packets/pages.
>>>
>>> In "page_pool_inflight_unmap" we scan and find those in-flight pages to
>>> DMA unmap them. Then below we wait for these in-flight pages again.
>>> Why don't we just "release" (page_pool_release_page) those in-flight
>>> pages from belonging to the page_pool, when we found them during scanning?
>>>
>>> If doing so, we can hopefully remove the periodic checking code below.
>>
>> I thought about that too, but it means more complicated work than just
>> calling the page_pool_release_page() as page->pp_ref_count need to be
>> converted into page->_refcount for the above to work, it seems hard to
>> do that with least performance degradation as the racing against
>> page_pool_put_page() being called concurrently.
>>
> 
> Maybe we can have a design that avoid/reduce concurrency.  Can we
> convert the suggested pool->destroy_lock into an atomic?
> (Doing an *atomic* READ in page_pool_return_page, should be fast if we
> keep this cache in in (cache coherence) Shared state).
> 
> In your new/proposed page_pool_return_page() when we see the
> "destroy_cnt" (now atomic READ) bigger than zero, then we can do nothing
> (or maybe we need decrement page-refcnt?), as we know the destroy code

Is it valid to have a page->_refcount of zero when page_pool still own
the page if we only decrement page->_refcount and not clear page->pp_magic?
What happens if put_page() is called from other subsystem for a page_pool
owned page, isn't that mean the page might be returned to buddy page
allocator, causing use-after-free problem?

> will be taking care of "releasing" the pages from the page pool.

If page->_refcount is not decremented in page_pool_return_page(), how
does page_pool_destroy() know if a specific page have been called with
page_pool_return_page()? Does an extra state is needed to indicate that?

And there might still be concurrency between checking/handling of the extra
state in page_pool_destroy() and the setting of extra state in
page_pool_return_page(), something like lock might still be needed to avoid
the above concurrency.

> 
> Once the a page is release from a page pool it becomes a normal page,
> that adhere to normal page refcnt'ing. That is how it worked before with
> page_pool_release_page().
> The later extensions with page fragment support and devmem might have
> complicated this code path.

As page_pool_return_page() and page_pool_destroy() both try to "release"
the page concurrently for a specific page, I am not sure how using some
simple *atomic* can avoid this kind of concurrency even before page
fragment and devmem are supported, it would be good to be more specific
about that by using some pseudocode.

I looked at it more closely, previously page_pool_put_page() seemed to
not be allowed to be called after page_pool_release_page() had been
called for a specific page mainly because of concurrently checking/handlig
and clearing of page->pp_magic if I understand it correctly:
https://elixir.bootlin.com/linux/v5.16.20/source/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L5316

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-26  8:22         ` Yunsheng Lin
@ 2024-11-26 10:22           ` Jesper Dangaard Brouer
  2024-11-26 11:46             ` Yunsheng Lin
  0 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-11-26 10:22 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Robin Murphy,
	Alexander Duyck, IOMMU, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel



On 26/11/2024 09.22, Yunsheng Lin wrote:
> On 2024/11/25 23:25, Jesper Dangaard Brouer wrote:
> 
> ...
> 
>>>>> +
>>>>>     void page_pool_destroy(struct page_pool *pool)
>>>>>     {
>>>>>         if (!pool)
>>>>> @@ -1139,6 +1206,8 @@ void page_pool_destroy(struct page_pool *pool)
>>>>>          */
>>>>>         synchronize_rcu();
>>>>>     +    page_pool_inflight_unmap(pool);
>>>>> +
>>>>
>>>> Reaching here means we have detected in-flight packets/pages.
>>>>
>>>> In "page_pool_inflight_unmap" we scan and find those in-flight pages to
>>>> DMA unmap them. Then below we wait for these in-flight pages again.
>>>> Why don't we just "release" (page_pool_release_page) those in-flight
>>>> pages from belonging to the page_pool, when we found them during scanning?
>>>>
>>>> If doing so, we can hopefully remove the periodic checking code below.
>>>
>>> I thought about that too, but it means more complicated work than just
>>> calling the page_pool_release_page() as page->pp_ref_count need to be
>>> converted into page->_refcount for the above to work, it seems hard to
>>> do that with least performance degradation as the racing against
>>> page_pool_put_page() being called concurrently.
>>>
>>
>> Maybe we can have a design that avoid/reduce concurrency.  Can we
>> convert the suggested pool->destroy_lock into an atomic?
>> (Doing an *atomic* READ in page_pool_return_page, should be fast if we
>> keep this cache in in (cache coherence) Shared state).
>>
>> In your new/proposed page_pool_return_page() when we see the
>> "destroy_cnt" (now atomic READ) bigger than zero, then we can do nothing
>> (or maybe we need decrement page-refcnt?), as we know the destroy code
> 
> Is it valid to have a page->_refcount of zero when page_pool still own
> the page if we only decrement page->_refcount and not clear page->pp_magic?

No, page_pool keeps page->_refcount equal 1 (for "release") and also 
clears page->pp_magic.

> What happens if put_page() is called from other subsystem for a page_pool
> owned page, isn't that mean the page might be returned to buddy page
> allocator, causing use-after-free problem?
> 

Notice that page_pool_release_page() didn't decrement page refcnt.
It disconnects a page (from a page_pool). To allow it to be used as
a regular page (that will eventually be returned to the normal
page-allocator via put_page).


>> will be taking care of "releasing" the pages from the page pool.
> 
> If page->_refcount is not decremented in page_pool_return_page(), how
> does page_pool_destroy() know if a specific page have been called with
> page_pool_return_page()? Does an extra state is needed to indicate that?
> 

Good point. In page_pool_return_page(), we will need to handle the two 
cases. (1) page still belongs to a page_pool, (2) page have been 
released to look like a normal page. For (2) we do need to call 
put_page().  For (1) yes we would either need some extra state, such 
that page_pool_destroy() knows, or a lock like this patchset.


> And there might still be concurrency between checking/handling of the extra
> state in page_pool_destroy() and the setting of extra state in
> page_pool_return_page(), something like lock might still be needed to avoid
> the above concurrency.
> 

I agree, we (likely) cannot avoid this lock.

>>
>> Once the a page is release from a page pool it becomes a normal page,
>> that adhere to normal page refcnt'ing. That is how it worked before with
>> page_pool_release_page().
>> The later extensions with page fragment support and devmem might have
>> complicated this code path.
> 
> As page_pool_return_page() and page_pool_destroy() both try to "release"
> the page concurrently for a specific page, I am not sure how using some
> simple *atomic* can avoid this kind of concurrency even before page
> fragment and devmem are supported, it would be good to be more specific
> about that by using some pseudocode.
> 

Okay, some my simple atomic idea will not work.

NEW IDEA:

So, the my concern in this patchset is that BH-disabling spin_lock 
pool->destroy_lock is held in the outer loop of 
page_pool_inflight_unmap() that scans all pages.  Disabling BH for this 
long have nasty side-effects.

Will it be enough to grab the pool->destroy_lock only when we detect a 
page that belongs to our page pool?  Of-cause after obtaining the lock. 
the code need to recheck if the page still belongs to the pool.


> I looked at it more closely, previously page_pool_put_page() seemed to
> not be allowed to be called after page_pool_release_page() had been
> called for a specific page mainly because of concurrently checking/handlig
> and clearing of page->pp_magic if I understand it correctly:
> https://elixir.bootlin.com/linux/v5.16.20/source/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L5316

As I said above
The page_pool_release_page() didn't decrement page refcnt.
It disconnects a page (from a page_pool). To allow it to be used as
a regular page (that will eventually be returned to the normal
page-allocator via put_page).

--Jesper


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-26 10:22           ` Jesper Dangaard Brouer
@ 2024-11-26 11:46             ` Yunsheng Lin
  0 siblings, 0 replies; 28+ messages in thread
From: Yunsheng Lin @ 2024-11-26 11:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Robin Murphy,
	Alexander Duyck, IOMMU, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

On 2024/11/26 18:22, Jesper Dangaard Brouer wrote:

...

>>>
>>> Once the a page is release from a page pool it becomes a normal page,
>>> that adhere to normal page refcnt'ing. That is how it worked before with
>>> page_pool_release_page().
>>> The later extensions with page fragment support and devmem might have
>>> complicated this code path.
>>
>> As page_pool_return_page() and page_pool_destroy() both try to "release"
>> the page concurrently for a specific page, I am not sure how using some
>> simple *atomic* can avoid this kind of concurrency even before page
>> fragment and devmem are supported, it would be good to be more specific
>> about that by using some pseudocode.
>>
> 
> Okay, some my simple atomic idea will not work.
> 
> NEW IDEA:
> 
> So, the my concern in this patchset is that BH-disabling spin_lock pool->destroy_lock is held in the outer loop of page_pool_inflight_unmap() that scans all pages.  Disabling BH for this long have nasty side-effects.
> 
> Will it be enough to grab the pool->destroy_lock only when we detect a page that belongs to our page pool?  Of-cause after obtaining the lock. the code need to recheck if the page still belongs to the pool.
> 

That means there will be page_pool_return_page() called between the scanning,
it seems like a lot like the idea of 'page_pool_get_dma_addr() need to be
checked to decide if the mapping is already done or not for each page.' as
there are two cases when page_pool_return_page() is called during scanning:
1. page_pool_get_dma_addr() returns non-zero dma address, which means the dma
   unmapping is not done by scanning yet, page_pool_return_page() need to do
   the dma unmapping before calling put_page()
2. page_pool_get_dma_addr() returns zero dma address, which means the dma
   unmapping is done by scanning, page_pool_return_page() just skip the dma
   unmapping and only call put_page().

It seems there is only one case for scanning:
1. page_pool_get_dma_addr() for a page_pool owned page returns non-zero dma
   address, which means page_pool_return_page() is not called for that page yet,
   scanning will the do the mapping for page_pool_return_page() and reset the
   dma address of the page to indicate the dma unmapping is done for that page.

It seems there is no case of page_pool owned page having zero dma address during
scanning, as both page->pp_magic is cleared and dma unmapping is already done in
page_pool_return_page().

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-21  8:03     ` Yunsheng Lin
  2024-11-25 15:25       ` Jesper Dangaard Brouer
@ 2024-11-26 21:51       ` Mina Almasry
  2024-11-26 23:53         ` Alexander Duyck
  1 sibling, 1 reply; 28+ messages in thread
From: Mina Almasry @ 2024-11-26 21:51 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Jesper Dangaard Brouer, davem, kuba, pabeni, liuyonglong,
	fanghaiqing, zhangkun09, Robin Murphy, Alexander Duyck, IOMMU,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On Thu, Nov 21, 2024 at 12:03 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
> >
> >>       page_pool_detached(pool);
> >>       pool->defer_start = jiffies;
> >>       pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> >> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
> >>       /* Flush pool alloc cache, as refill will check NUMA node */
> >>       while (pool->alloc.count) {
> >>           netmem = pool->alloc.cache[--pool->alloc.count];
> >> -        page_pool_return_page(pool, netmem);
> >> +        __page_pool_return_page(pool, netmem);
> >>       }
> >>   }
> >>   EXPORT_SYMBOL(page_pool_update_nid);
> >
> > Thanks for continuing to work on this :-)
>
> I am not sure how scalable the scanning is going to be if the memory size became
> bigger, which is one of the reason I was posting it as RFC for this version.
>
> For some quick searching here, it seems there might be server with max ram capacity
> of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
> The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
> called from the softirq context, which might mean there might be spinning of 12 secs
> in the softirq context.
>
> And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
> of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
> is released, which might means page_pool_get_dma_addr() need to be checked to decide
> if the mapping is already done or not for each page.
>
> Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
> for those large memory systems.
>
> https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH
>

FWIW I'm also concerned about the looping of all memory on the system.
In addition to the performance, I think (but not sure), that
CONFIG_MEMORY_HOTPLUG may mess such a loop as memory may appear or
disappear concurrently. Even if not, the CPU cost of this may be
significant. I'm imagining the possibility of having many page_pools
allocated on the system for many hardware queues, (and maybe multiple
pp's per queue for applications like devmem TCP), and each pp looping
over the entire xTB memory on page_pool_destroy()...

My 2 cents here is that a more reasonable approach is to have the pp
track all pages it has dma-mapped, without the problems in the
previous iterations of this patch:

1. When we dma-map a page, we add it to some pp->dma_mapped data
structure (maybe xarray or rculist).
2. When we dma-unmap a page, we remove it from pp->dma_mapped.
3 When we destroy the pp, we traverse pp->dma_mapped and unmap all the
pages there.

I haven't looked deeply, but with the right data structure we may be
able to synchronize 1, 2, and 3 without any additional locks. From a
quick skim it seems maybe rculist and xarray can do this without
additional locks, maybe.

Like stated in the previous iterations of this approach, we should not
be putting any hard limit on the amount of memory the pp can allocate,
and we should not have to mess with the page->pp entry in struct page.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-26 21:51       ` Mina Almasry
@ 2024-11-26 23:53         ` Alexander Duyck
  2024-11-27  9:35           ` Yunsheng Lin
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Duyck @ 2024-11-26 23:53 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Yunsheng Lin, Jesper Dangaard Brouer, davem, kuba, pabeni,
	liuyonglong, fanghaiqing, zhangkun09, Robin Murphy, IOMMU,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On Tue, Nov 26, 2024 at 1:51 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Thu, Nov 21, 2024 at 12:03 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >
> > On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
> > >
> > >>       page_pool_detached(pool);
> > >>       pool->defer_start = jiffies;
> > >>       pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> > >> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
> > >>       /* Flush pool alloc cache, as refill will check NUMA node */
> > >>       while (pool->alloc.count) {
> > >>           netmem = pool->alloc.cache[--pool->alloc.count];
> > >> -        page_pool_return_page(pool, netmem);
> > >> +        __page_pool_return_page(pool, netmem);
> > >>       }
> > >>   }
> > >>   EXPORT_SYMBOL(page_pool_update_nid);
> > >
> > > Thanks for continuing to work on this :-)
> >
> > I am not sure how scalable the scanning is going to be if the memory size became
> > bigger, which is one of the reason I was posting it as RFC for this version.
> >
> > For some quick searching here, it seems there might be server with max ram capacity
> > of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
> > The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
> > called from the softirq context, which might mean there might be spinning of 12 secs
> > in the softirq context.
> >
> > And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
> > of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
> > is released, which might means page_pool_get_dma_addr() need to be checked to decide
> > if the mapping is already done or not for each page.
> >
> > Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
> > for those large memory systems.
> >
> > https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH
> >
>
> FWIW I'm also concerned about the looping of all memory on the system.
> In addition to the performance, I think (but not sure), that
> CONFIG_MEMORY_HOTPLUG may mess such a loop as memory may appear or
> disappear concurrently. Even if not, the CPU cost of this may be
> significant. I'm imagining the possibility of having many page_pools
> allocated on the system for many hardware queues, (and maybe multiple
> pp's per queue for applications like devmem TCP), and each pp looping
> over the entire xTB memory on page_pool_destroy()...
>
> My 2 cents here is that a more reasonable approach is to have the pp
> track all pages it has dma-mapped, without the problems in the
> previous iterations of this patch:
>
> 1. When we dma-map a page, we add it to some pp->dma_mapped data
> structure (maybe xarray or rculist).
> 2. When we dma-unmap a page, we remove it from pp->dma_mapped.
> 3 When we destroy the pp, we traverse pp->dma_mapped and unmap all the
> pages there.

The thing is this should be a very rare event as it should apply only
when a device is removed and still has pages outstanding shouldn't it?
The problem is that maintaining a list of in-flight DMA pages will be
very costly and will make the use of page pool expensive enough that I
would worry it might be considered less than useful. Once we add too
much overhead the caching of the DMA address doesn't gain us much on
most systems in that case.

> I haven't looked deeply, but with the right data structure we may be
> able to synchronize 1, 2, and 3 without any additional locks. From a
> quick skim it seems maybe rculist and xarray can do this without
> additional locks, maybe.
>
> Like stated in the previous iterations of this approach, we should not
> be putting any hard limit on the amount of memory the pp can allocate,
> and we should not have to mess with the page->pp entry in struct page.

I agree with you on the fact that we shouldn't be setting any sort of
limit. The current approach to doing the unmapping is more-or-less the
brute force way of doing it working around the DMA api. I wonder if we
couldn't look at working with it instead and see if there wouldn't be
some way for us to reduce the overhead instead of having to do the
full scan of the page table.

One thought in that regard though would be to see if there were a way
to have the DMA API itself provide some of that info. I know the DMA
API should be storing some of that data for the mapping as we have to
go through and invalidate it if it is stored.

Another alternative would be to see if we have the option to just
invalidate the DMA side of things entirely for the device. Essentially
unregister the device from the IOMMU instead of the mappings. If that
is an option then we could look at leaving the page pool in a state
where it would essentially claim it no longer has to do the DMA unmap
operations and is just freeing the remaining lingering pages.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-26 23:53         ` Alexander Duyck
@ 2024-11-27  9:35           ` Yunsheng Lin
  2024-11-27 15:31             ` Robin Murphy
  2024-11-27 19:39             ` Mina Almasry
  0 siblings, 2 replies; 28+ messages in thread
From: Yunsheng Lin @ 2024-11-27  9:35 UTC (permalink / raw)
  To: Alexander Duyck, Mina Almasry
  Cc: Jesper Dangaard Brouer, davem, kuba, pabeni, liuyonglong,
	fanghaiqing, zhangkun09, Robin Murphy, IOMMU, Ilias Apalodimas,
	Eric Dumazet, Simon Horman, netdev, linux-kernel

On 2024/11/27 7:53, Alexander Duyck wrote:
> On Tue, Nov 26, 2024 at 1:51 PM Mina Almasry <almasrymina@google.com> wrote:
>>
>> On Thu, Nov 21, 2024 at 12:03 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>
>>> On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
>>>>
>>>>>       page_pool_detached(pool);
>>>>>       pool->defer_start = jiffies;
>>>>>       pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>>>>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>>>>>       /* Flush pool alloc cache, as refill will check NUMA node */
>>>>>       while (pool->alloc.count) {
>>>>>           netmem = pool->alloc.cache[--pool->alloc.count];
>>>>> -        page_pool_return_page(pool, netmem);
>>>>> +        __page_pool_return_page(pool, netmem);
>>>>>       }
>>>>>   }
>>>>>   EXPORT_SYMBOL(page_pool_update_nid);
>>>>
>>>> Thanks for continuing to work on this :-)
>>>
>>> I am not sure how scalable the scanning is going to be if the memory size became
>>> bigger, which is one of the reason I was posting it as RFC for this version.
>>>
>>> For some quick searching here, it seems there might be server with max ram capacity
>>> of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
>>> The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
>>> called from the softirq context, which might mean there might be spinning of 12 secs
>>> in the softirq context.
>>>
>>> And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
>>> of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
>>> is released, which might means page_pool_get_dma_addr() need to be checked to decide
>>> if the mapping is already done or not for each page.
>>>
>>> Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
>>> for those large memory systems.
>>>
>>> https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH
>>>
>>
>> FWIW I'm also concerned about the looping of all memory on the system.
>> In addition to the performance, I think (but not sure), that
>> CONFIG_MEMORY_HOTPLUG may mess such a loop as memory may appear or
>> disappear concurrently. Even if not, the CPU cost of this may be
>> significant. I'm imagining the possibility of having many page_pools
>> allocated on the system for many hardware queues, (and maybe multiple
>> pp's per queue for applications like devmem TCP), and each pp looping
>> over the entire xTB memory on page_pool_destroy()...
>>
>> My 2 cents here is that a more reasonable approach is to have the pp
>> track all pages it has dma-mapped, without the problems in the
>> previous iterations of this patch:
>>
>> 1. When we dma-map a page, we add it to some pp->dma_mapped data
>> structure (maybe xarray or rculist).
>> 2. When we dma-unmap a page, we remove it from pp->dma_mapped.
>> 3 When we destroy the pp, we traverse pp->dma_mapped and unmap all the
>> pages there.
> 
> The thing is this should be a very rare event as it should apply only
> when a device is removed and still has pages outstanding shouldn't it?
> The problem is that maintaining a list of in-flight DMA pages will be
> very costly and will make the use of page pool expensive enough that I
> would worry it might be considered less than useful. Once we add too
> much overhead the caching of the DMA address doesn't gain us much on
> most systems in that case.
> 
>> I haven't looked deeply, but with the right data structure we may be
>> able to synchronize 1, 2, and 3 without any additional locks. From a
>> quick skim it seems maybe rculist and xarray can do this without
>> additional locks, maybe.

I am not sure how the above right data structure without any additional
locks will work, but my feeling is that the issues mentioned in [1] will
likely apply to the above right data structure too.

1. https://lore.kernel.org/all/6233e2c3-3fea-4ed0-bdcc-9a625270da37@huawei.com/

>>
>> Like stated in the previous iterations of this approach, we should not
>> be putting any hard limit on the amount of memory the pp can allocate,
>> and we should not have to mess with the page->pp entry in struct page.

It would be good to be more specific about how it is done without 'messing'
with the page->pp entry in struct page using some pseudocode or RFC if you
call the renaming as messing.

> 
> I agree with you on the fact that we shouldn't be setting any sort of
> limit. The current approach to doing the unmapping is more-or-less the
> brute force way of doing it working around the DMA api. I wonder if we
> couldn't look at working with it instead and see if there wouldn't be
> some way for us to reduce the overhead instead of having to do the
> full scan of the page table.
> 
> One thought in that regard though would be to see if there were a way
> to have the DMA API itself provide some of that info. I know the DMA
> API should be storing some of that data for the mapping as we have to
> go through and invalidate it if it is stored.
> 
> Another alternative would be to see if we have the option to just
> invalidate the DMA side of things entirely for the device. Essentially
> unregister the device from the IOMMU instead of the mappings. If that
> is an option then we could look at leaving the page pool in a state
> where it would essentially claim it no longer has to do the DMA unmap
> operations and is just freeing the remaining lingering pages.

If we are going to 'invalidate the DMA side of things entirely for the
device', synchronization from page_pool might just go to the DMA core as
concurrent calling for dma unmapping API and 'invalidating' operation still
exist. If the invalidating is a common feature, perhaps it makes sense to
do that in the DMA core, otherwise it might just add unnecessary overhead
for other callers of DMA API.

As mentioned by Robin in [2], the DMA core seems to make a great deal of
effort to catch DMA API misuse in kernel/dma/debug.c, it seems doing the
above might invalidate some of the dma debug checking.

2. https://lore.kernel.org/all/caf31b5e-0e8f-4844-b7ba-ef59ed13b74e@arm.com/

> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-27  9:35           ` Yunsheng Lin
@ 2024-11-27 15:31             ` Robin Murphy
  2024-11-27 16:27               ` Alexander Duyck
  2024-11-27 19:39             ` Mina Almasry
  1 sibling, 1 reply; 28+ messages in thread
From: Robin Murphy @ 2024-11-27 15:31 UTC (permalink / raw)
  To: Yunsheng Lin, Alexander Duyck, Mina Almasry
  Cc: Jesper Dangaard Brouer, davem, kuba, pabeni, liuyonglong,
	fanghaiqing, zhangkun09, IOMMU, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

On 27/11/2024 9:35 am, Yunsheng Lin wrote:
> On 2024/11/27 7:53, Alexander Duyck wrote:
>> On Tue, Nov 26, 2024 at 1:51 PM Mina Almasry <almasrymina@google.com> wrote:
>>>
>>> On Thu, Nov 21, 2024 at 12:03 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>
>>>> On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
>>>>>
>>>>>>        page_pool_detached(pool);
>>>>>>        pool->defer_start = jiffies;
>>>>>>        pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>>>>>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>>>>>>        /* Flush pool alloc cache, as refill will check NUMA node */
>>>>>>        while (pool->alloc.count) {
>>>>>>            netmem = pool->alloc.cache[--pool->alloc.count];
>>>>>> -        page_pool_return_page(pool, netmem);
>>>>>> +        __page_pool_return_page(pool, netmem);
>>>>>>        }
>>>>>>    }
>>>>>>    EXPORT_SYMBOL(page_pool_update_nid);
>>>>>
>>>>> Thanks for continuing to work on this :-)
>>>>
>>>> I am not sure how scalable the scanning is going to be if the memory size became
>>>> bigger, which is one of the reason I was posting it as RFC for this version.
>>>>
>>>> For some quick searching here, it seems there might be server with max ram capacity
>>>> of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
>>>> The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
>>>> called from the softirq context, which might mean there might be spinning of 12 secs
>>>> in the softirq context.
>>>>
>>>> And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
>>>> of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
>>>> is released, which might means page_pool_get_dma_addr() need to be checked to decide
>>>> if the mapping is already done or not for each page.
>>>>
>>>> Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
>>>> for those large memory systems.
>>>>
>>>> https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH
>>>>
>>>
>>> FWIW I'm also concerned about the looping of all memory on the system.
>>> In addition to the performance, I think (but not sure), that
>>> CONFIG_MEMORY_HOTPLUG may mess such a loop as memory may appear or
>>> disappear concurrently. Even if not, the CPU cost of this may be
>>> significant. I'm imagining the possibility of having many page_pools
>>> allocated on the system for many hardware queues, (and maybe multiple
>>> pp's per queue for applications like devmem TCP), and each pp looping
>>> over the entire xTB memory on page_pool_destroy()...
>>>
>>> My 2 cents here is that a more reasonable approach is to have the pp
>>> track all pages it has dma-mapped, without the problems in the
>>> previous iterations of this patch:
>>>
>>> 1. When we dma-map a page, we add it to some pp->dma_mapped data
>>> structure (maybe xarray or rculist).
>>> 2. When we dma-unmap a page, we remove it from pp->dma_mapped.
>>> 3 When we destroy the pp, we traverse pp->dma_mapped and unmap all the
>>> pages there.
>>
>> The thing is this should be a very rare event as it should apply only
>> when a device is removed and still has pages outstanding shouldn't it?
>> The problem is that maintaining a list of in-flight DMA pages will be
>> very costly and will make the use of page pool expensive enough that I
>> would worry it might be considered less than useful. Once we add too
>> much overhead the caching of the DMA address doesn't gain us much on
>> most systems in that case.
>>
>>> I haven't looked deeply, but with the right data structure we may be
>>> able to synchronize 1, 2, and 3 without any additional locks. From a
>>> quick skim it seems maybe rculist and xarray can do this without
>>> additional locks, maybe.
> 
> I am not sure how the above right data structure without any additional
> locks will work, but my feeling is that the issues mentioned in [1] will
> likely apply to the above right data structure too.
> 
> 1. https://lore.kernel.org/all/6233e2c3-3fea-4ed0-bdcc-9a625270da37@huawei.com/
> 
>>>
>>> Like stated in the previous iterations of this approach, we should not
>>> be putting any hard limit on the amount of memory the pp can allocate,
>>> and we should not have to mess with the page->pp entry in struct page.
> 
> It would be good to be more specific about how it is done without 'messing'
> with the page->pp entry in struct page using some pseudocode or RFC if you
> call the renaming as messing.
> 
>>
>> I agree with you on the fact that we shouldn't be setting any sort of
>> limit. The current approach to doing the unmapping is more-or-less the
>> brute force way of doing it working around the DMA api. I wonder if we
>> couldn't look at working with it instead and see if there wouldn't be
>> some way for us to reduce the overhead instead of having to do the
>> full scan of the page table.
>>
>> One thought in that regard though would be to see if there were a way
>> to have the DMA API itself provide some of that info. I know the DMA
>> API should be storing some of that data for the mapping as we have to
>> go through and invalidate it if it is stored.
>>
>> Another alternative would be to see if we have the option to just
>> invalidate the DMA side of things entirely for the device. Essentially
>> unregister the device from the IOMMU instead of the mappings. If that
>> is an option then we could look at leaving the page pool in a state
>> where it would essentially claim it no longer has to do the DMA unmap
>> operations and is just freeing the remaining lingering pages.
> 
> If we are going to 'invalidate the DMA side of things entirely for the
> device', synchronization from page_pool might just go to the DMA core as
> concurrent calling for dma unmapping API and 'invalidating' operation still
> exist. If the invalidating is a common feature, perhaps it makes sense to
> do that in the DMA core, otherwise it might just add unnecessary overhead
> for other callers of DMA API.
> 
> As mentioned by Robin in [2], the DMA core seems to make a great deal of
> effort to catch DMA API misuse in kernel/dma/debug.c, it seems doing the
> above might invalidate some of the dma debug checking.

Has nobody paused to consider *why* dma-debug is an optional feature 
with an explicit performance warning? If you're concerned about the 
impact of keeping track of DMA mappings within the confines of the 
page_pool usage model, do you really imagine it could somehow be cheaper 
to keep track of them at the generic DMA API level without the benefit 
of any assumptions at all?

Yes, in principle we could add a set of "robust" DMA APIs which make 
sure racy sync calls are safe and offer a "clean up all my outstanding 
mappings" op, and they would offer approximately as terrible performance 
as the current streaming APIs with dma-debug enabled, because it would 
be little different from what dma-debug already does. The checks 
themselves aren't complicated; the generally-prohibitive cost lies in 
keeping track of mappings and allocations so that they *can* be checked 
internally.

Whatever you think is hard to do in the page_pool code to fix that 
code's own behaviour, it's even harder to do from elsewhere with less 
information.

Thanks,
Robin.

> 
> 2. https://lore.kernel.org/all/caf31b5e-0e8f-4844-b7ba-ef59ed13b74e@arm.com/
> 
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-27 15:31             ` Robin Murphy
@ 2024-11-27 16:27               ` Alexander Duyck
  2024-12-04 11:16                 ` Yunsheng Lin
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Duyck @ 2024-11-27 16:27 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Yunsheng Lin, Mina Almasry, Jesper Dangaard Brouer, davem, kuba,
	pabeni, liuyonglong, fanghaiqing, zhangkun09, IOMMU,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On Wed, Nov 27, 2024 at 7:31 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 27/11/2024 9:35 am, Yunsheng Lin wrote:
> > On 2024/11/27 7:53, Alexander Duyck wrote:
> >> On Tue, Nov 26, 2024 at 1:51 PM Mina Almasry <almasrymina@google.com> wrote:
> >>>
> >>> On Thu, Nov 21, 2024 at 12:03 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>>
> >>>> On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
> >>>>>
> >>>>>>        page_pool_detached(pool);
> >>>>>>        pool->defer_start = jiffies;
> >>>>>>        pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> >>>>>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
> >>>>>>        /* Flush pool alloc cache, as refill will check NUMA node */
> >>>>>>        while (pool->alloc.count) {
> >>>>>>            netmem = pool->alloc.cache[--pool->alloc.count];
> >>>>>> -        page_pool_return_page(pool, netmem);
> >>>>>> +        __page_pool_return_page(pool, netmem);
> >>>>>>        }
> >>>>>>    }
> >>>>>>    EXPORT_SYMBOL(page_pool_update_nid);
> >>>>>
> >>>>> Thanks for continuing to work on this :-)
> >>>>
> >>>> I am not sure how scalable the scanning is going to be if the memory size became
> >>>> bigger, which is one of the reason I was posting it as RFC for this version.
> >>>>
> >>>> For some quick searching here, it seems there might be server with max ram capacity
> >>>> of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
> >>>> The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
> >>>> called from the softirq context, which might mean there might be spinning of 12 secs
> >>>> in the softirq context.
> >>>>
> >>>> And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
> >>>> of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
> >>>> is released, which might means page_pool_get_dma_addr() need to be checked to decide
> >>>> if the mapping is already done or not for each page.
> >>>>
> >>>> Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
> >>>> for those large memory systems.
> >>>>
> >>>> https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH
> >>>>
> >>>
> >>> FWIW I'm also concerned about the looping of all memory on the system.
> >>> In addition to the performance, I think (but not sure), that
> >>> CONFIG_MEMORY_HOTPLUG may mess such a loop as memory may appear or
> >>> disappear concurrently. Even if not, the CPU cost of this may be
> >>> significant. I'm imagining the possibility of having many page_pools
> >>> allocated on the system for many hardware queues, (and maybe multiple
> >>> pp's per queue for applications like devmem TCP), and each pp looping
> >>> over the entire xTB memory on page_pool_destroy()...
> >>>
> >>> My 2 cents here is that a more reasonable approach is to have the pp
> >>> track all pages it has dma-mapped, without the problems in the
> >>> previous iterations of this patch:
> >>>
> >>> 1. When we dma-map a page, we add it to some pp->dma_mapped data
> >>> structure (maybe xarray or rculist).
> >>> 2. When we dma-unmap a page, we remove it from pp->dma_mapped.
> >>> 3 When we destroy the pp, we traverse pp->dma_mapped and unmap all the
> >>> pages there.
> >>
> >> The thing is this should be a very rare event as it should apply only
> >> when a device is removed and still has pages outstanding shouldn't it?
> >> The problem is that maintaining a list of in-flight DMA pages will be
> >> very costly and will make the use of page pool expensive enough that I
> >> would worry it might be considered less than useful. Once we add too
> >> much overhead the caching of the DMA address doesn't gain us much on
> >> most systems in that case.
> >>
> >>> I haven't looked deeply, but with the right data structure we may be
> >>> able to synchronize 1, 2, and 3 without any additional locks. From a
> >>> quick skim it seems maybe rculist and xarray can do this without
> >>> additional locks, maybe.
> >
> > I am not sure how the above right data structure without any additional
> > locks will work, but my feeling is that the issues mentioned in [1] will
> > likely apply to the above right data structure too.
> >
> > 1. https://lore.kernel.org/all/6233e2c3-3fea-4ed0-bdcc-9a625270da37@huawei.com/
> >
> >>>
> >>> Like stated in the previous iterations of this approach, we should not
> >>> be putting any hard limit on the amount of memory the pp can allocate,
> >>> and we should not have to mess with the page->pp entry in struct page.
> >
> > It would be good to be more specific about how it is done without 'messing'
> > with the page->pp entry in struct page using some pseudocode or RFC if you
> > call the renaming as messing.
> >
> >>
> >> I agree with you on the fact that we shouldn't be setting any sort of
> >> limit. The current approach to doing the unmapping is more-or-less the
> >> brute force way of doing it working around the DMA api. I wonder if we
> >> couldn't look at working with it instead and see if there wouldn't be
> >> some way for us to reduce the overhead instead of having to do the
> >> full scan of the page table.
> >>
> >> One thought in that regard though would be to see if there were a way
> >> to have the DMA API itself provide some of that info. I know the DMA
> >> API should be storing some of that data for the mapping as we have to
> >> go through and invalidate it if it is stored.
> >>
> >> Another alternative would be to see if we have the option to just
> >> invalidate the DMA side of things entirely for the device. Essentially
> >> unregister the device from the IOMMU instead of the mappings. If that
> >> is an option then we could look at leaving the page pool in a state
> >> where it would essentially claim it no longer has to do the DMA unmap
> >> operations and is just freeing the remaining lingering pages.
> >
> > If we are going to 'invalidate the DMA side of things entirely for the
> > device', synchronization from page_pool might just go to the DMA core as
> > concurrent calling for dma unmapping API and 'invalidating' operation still
> > exist. If the invalidating is a common feature, perhaps it makes sense to
> > do that in the DMA core, otherwise it might just add unnecessary overhead
> > for other callers of DMA API.
> >
> > As mentioned by Robin in [2], the DMA core seems to make a great deal of
> > effort to catch DMA API misuse in kernel/dma/debug.c, it seems doing the
> > above might invalidate some of the dma debug checking.
>
> Has nobody paused to consider *why* dma-debug is an optional feature
> with an explicit performance warning? If you're concerned about the
> impact of keeping track of DMA mappings within the confines of the
> page_pool usage model, do you really imagine it could somehow be cheaper
> to keep track of them at the generic DMA API level without the benefit
> of any assumptions at all?

I get what you are saying, but there are things about the internal
implementations of the individual DMA APIs that might make them much
easier to destroy mappings and essentially invalidate them. For
example if the system doesn't have an IOMMU there isn't much that
actually has to be retained. At most it might be a bitmap for the
SWIOTLB that would have to be retained per device and that could be
used to invalidate the mappings assuming the device has been wiped out
and is somehow actually using the SWIOTLB.

In the case of an IOMMU there are many who run with iommu=pt which
will identity map the entire system memory and then just hand out
individual physical addresses from that range. In reality that should
be almost as easy to handle as the non-iommu case so why shouldn't we
take advantage of that to clean up this use case?

> Yes, in principle we could add a set of "robust" DMA APIs which make
> sure racy sync calls are safe and offer a "clean up all my outstanding
> mappings" op, and they would offer approximately as terrible performance
> as the current streaming APIs with dma-debug enabled, because it would
> be little different from what dma-debug already does. The checks
> themselves aren't complicated; the generally-prohibitive cost lies in
> keeping track of mappings and allocations so that they *can* be checked
> internally.
>
> Whatever you think is hard to do in the page_pool code to fix that
> code's own behaviour, it's even harder to do from elsewhere with less
> information.

My general thought would be to see if there is anything we could
explore within the DMA API itself to optimize the handling for this
sort of bulk unmap request. If not we could fall back to an approach
that requires more overhead and invalidation of individual pages.

You could think of it like the approach that has been taken with
DEFINED_DMA_UNMAP_ADDR/LEN. Basically there are cases where this can
be done much more quickly and it is likely we can clean up large
swaths in one go. So why not expose a function that might be able to
take advantage of that for exception cases like this surprise device
removal.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-27  9:35           ` Yunsheng Lin
  2024-11-27 15:31             ` Robin Murphy
@ 2024-11-27 19:39             ` Mina Almasry
  1 sibling, 0 replies; 28+ messages in thread
From: Mina Almasry @ 2024-11-27 19:39 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Alexander Duyck, Jesper Dangaard Brouer, davem, kuba, pabeni,
	liuyonglong, fanghaiqing, zhangkun09, Robin Murphy, IOMMU,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On Wed, Nov 27, 2024 at 1:35 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/11/27 7:53, Alexander Duyck wrote:
> > On Tue, Nov 26, 2024 at 1:51 PM Mina Almasry <almasrymina@google.com> wrote:
> >>
> >> On Thu, Nov 21, 2024 at 12:03 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>
> >>> On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:
> >>>>
> >>>>>       page_pool_detached(pool);
> >>>>>       pool->defer_start = jiffies;
> >>>>>       pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> >>>>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
> >>>>>       /* Flush pool alloc cache, as refill will check NUMA node */
> >>>>>       while (pool->alloc.count) {
> >>>>>           netmem = pool->alloc.cache[--pool->alloc.count];
> >>>>> -        page_pool_return_page(pool, netmem);
> >>>>> +        __page_pool_return_page(pool, netmem);
> >>>>>       }
> >>>>>   }
> >>>>>   EXPORT_SYMBOL(page_pool_update_nid);
> >>>>
> >>>> Thanks for continuing to work on this :-)
> >>>
> >>> I am not sure how scalable the scanning is going to be if the memory size became
> >>> bigger, which is one of the reason I was posting it as RFC for this version.
> >>>
> >>> For some quick searching here, it seems there might be server with max ram capacity
> >>> of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
> >>> The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
> >>> called from the softirq context, which might mean there might be spinning of 12 secs
> >>> in the softirq context.
> >>>
> >>> And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
> >>> of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
> >>> is released, which might means page_pool_get_dma_addr() need to be checked to decide
> >>> if the mapping is already done or not for each page.
> >>>
> >>> Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
> >>> for those large memory systems.
> >>>
> >>> https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH
> >>>
> >>
> >> FWIW I'm also concerned about the looping of all memory on the system.
> >> In addition to the performance, I think (but not sure), that
> >> CONFIG_MEMORY_HOTPLUG may mess such a loop as memory may appear or
> >> disappear concurrently. Even if not, the CPU cost of this may be
> >> significant. I'm imagining the possibility of having many page_pools
> >> allocated on the system for many hardware queues, (and maybe multiple
> >> pp's per queue for applications like devmem TCP), and each pp looping
> >> over the entire xTB memory on page_pool_destroy()...
> >>
> >> My 2 cents here is that a more reasonable approach is to have the pp
> >> track all pages it has dma-mapped, without the problems in the
> >> previous iterations of this patch:
> >>
> >> 1. When we dma-map a page, we add it to some pp->dma_mapped data
> >> structure (maybe xarray or rculist).
> >> 2. When we dma-unmap a page, we remove it from pp->dma_mapped.
> >> 3 When we destroy the pp, we traverse pp->dma_mapped and unmap all the
> >> pages there.
> >
> > The thing is this should be a very rare event as it should apply only
> > when a device is removed and still has pages outstanding shouldn't it?
> > The problem is that maintaining a list of in-flight DMA pages will be
> > very costly and will make the use of page pool expensive enough that I
> > would worry it might be considered less than useful. Once we add too
> > much overhead the caching of the DMA address doesn't gain us much on
> > most systems in that case.
> >
> >> I haven't looked deeply, but with the right data structure we may be
> >> able to synchronize 1, 2, and 3 without any additional locks. From a
> >> quick skim it seems maybe rculist and xarray can do this without
> >> additional locks, maybe.
>
> I am not sure how the above right data structure without any additional
> locks will work, but my feeling is that the issues mentioned in [1] will
> likely apply to the above right data structure too.
>
> 1. https://lore.kernel.org/all/6233e2c3-3fea-4ed0-bdcc-9a625270da37@huawei.com/
>

I don't see the issues called out in the above thread conflict with
what I'm proposing. In fact, I think Jesper's suggestion works
perfectly with what I'm proposing. Maybe I'm missing something.

We can use an atomic pool->destroy_count to synchronize steps 2 and 3.
I.e. If destroy_count > 0, we don't unmap the page in
__page_pool_release_page(), and instead count on the page_pool_destroy
unmapping all the pages in the pp->dma_mapped list.

> >>
> >> Like stated in the previous iterations of this approach, we should not
> >> be putting any hard limit on the amount of memory the pp can allocate,
> >> and we should not have to mess with the page->pp entry in struct page.
>
> It would be good to be more specific about how it is done without 'messing'
> with the page->pp entry in struct page using some pseudocode or RFC if you
> call the renaming as messing.
>

Yeah, I don't think touching the page->pp entry is needed if you use
an xarray or rculist which hangs off the pp.

I'm currently working on a few bug fixes already and the devmem TCP TX
which is waiting on by a few folks; I don't think I can look into this
right now, but I'll try. If this issue hasn't been resolved by the
time I get some bandwidth, sure, I'll take a stab at it.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local
  2024-11-20 10:34 ` [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local Yunsheng Lin
@ 2024-12-03  2:49   ` Jakub Kicinski
  2024-12-04 11:01     ` Yunsheng Lin
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2024-12-03  2:49 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, pabeni, liuyonglong, fanghaiqing, zhangkun09,
	Alexander Lobakin, Xuan Zhuo, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On Wed, 20 Nov 2024 18:34:53 +0800 Yunsheng Lin wrote:
> page_pool page may be freed from skb_defer_free_flush() in
> softirq context without binding to any specific napi, it
> may cause use-after-free problem due to the below time window,
> as below, CPU1 may still access napi->list_owner after CPU0
> free the napi memory:
> 
>             CPU 0                           CPU1
>       page_pool_destroy()          skb_defer_free_flush()
>              .                               .
>              .                napi = READ_ONCE(pool->p.napi);
>              .                               .
> page_pool_disable_direct_recycling()         .
>    driver free napi memory                   .
>              .                               .
>              .       napi && READ_ONCE(napi->list_owner) == cpuid
>              .                               .
> 
> Use rcu mechanism to avoid the above problem.
> 
> Note, the above was found during code reviewing on how to fix
> the problem in [1].
> 
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/

Please split it from the rest of the series, it's related but not
really logically connected (dma mappings and NAPI recycling are 
different features of page pool).

> @@ -1126,6 +1133,12 @@ void page_pool_destroy(struct page_pool *pool)
>  	if (!page_pool_release(pool))
>  		return;
>  
> +	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
> +	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
> +	 * before returning to driver to free the napi instance.
> +	 */
> +	synchronize_rcu();

I don't think this is in the right place.
Why not inside page_pool_disable_direct_recycling() ?

Hopefully this doesn't cause long stalls during reconfig, normally
NAPIs and page pools are freed together, and NAPI removal implies
synchronize_rcu(). That's why it's not really a problem for majority
of drivers. You should perhaps make a note of this in the commit
message.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local
  2024-12-03  2:49   ` Jakub Kicinski
@ 2024-12-04 11:01     ` Yunsheng Lin
  2024-12-05  1:28       ` Jakub Kicinski
  0 siblings, 1 reply; 28+ messages in thread
From: Yunsheng Lin @ 2024-12-04 11:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, liuyonglong, fanghaiqing, zhangkun09,
	Alexander Lobakin, Xuan Zhuo, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On 2024/12/3 10:49, Jakub Kicinski wrote:
> On Wed, 20 Nov 2024 18:34:53 +0800 Yunsheng Lin wrote:
>> page_pool page may be freed from skb_defer_free_flush() in
>> softirq context without binding to any specific napi, it
>> may cause use-after-free problem due to the below time window,
>> as below, CPU1 may still access napi->list_owner after CPU0
>> free the napi memory:
>>
>>             CPU 0                           CPU1
>>       page_pool_destroy()          skb_defer_free_flush()
>>              .                               .
>>              .                napi = READ_ONCE(pool->p.napi);
>>              .                               .
>> page_pool_disable_direct_recycling()         .
>>    driver free napi memory                   .
>>              .                               .
>>              .       napi && READ_ONCE(napi->list_owner) == cpuid
>>              .                               .
>>
>> Use rcu mechanism to avoid the above problem.
>>
>> Note, the above was found during code reviewing on how to fix
>> the problem in [1].
>>
>> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
> 
> Please split it from the rest of the series, it's related but not
> really logically connected (dma mappings and NAPI recycling are 
> different features of page pool).
> 
>> @@ -1126,6 +1133,12 @@ void page_pool_destroy(struct page_pool *pool)
>>  	if (!page_pool_release(pool))
>>  		return;
>>  
>> +	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
>> +	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>> +	 * before returning to driver to free the napi instance.
>> +	 */
>> +	synchronize_rcu();
> 
> I don't think this is in the right place.
> Why not inside page_pool_disable_direct_recycling() ?

It is in page_pool_destroy() mostly because:
1. Only call synchronize_rcu() when there is inflight pages, which should
   be an unlikely case, and  synchronize_rcu() might need to be called at
   least for the case of pool->p.napi not being NULL if it is called inside
   page_pool_disable_direct_recycling().
2. the above synchronize_rcu() in page_pool_destroy() is also reused
   for the fixing of dma unmappings.

> 
> Hopefully this doesn't cause long stalls during reconfig, normally
> NAPIs and page pools are freed together, and NAPI removal implies
> synchronize_rcu(). That's why it's not really a problem for majority
> of drivers. You should perhaps make a note of this in the commit
> message.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound
  2024-11-27 16:27               ` Alexander Duyck
@ 2024-12-04 11:16                 ` Yunsheng Lin
  0 siblings, 0 replies; 28+ messages in thread
From: Yunsheng Lin @ 2024-12-04 11:16 UTC (permalink / raw)
  To: Alexander Duyck, Robin Murphy
  Cc: Mina Almasry, Jesper Dangaard Brouer, davem, kuba, pabeni,
	liuyonglong, fanghaiqing, zhangkun09, IOMMU, Ilias Apalodimas,
	Eric Dumazet, Simon Horman, netdev, linux-kernel

On 2024/11/28 0:27, Alexander Duyck wrote:

...

> 
> My general thought would be to see if there is anything we could
> explore within the DMA API itself to optimize the handling for this
> sort of bulk unmap request. If not we could fall back to an approach
> that requires more overhead and invalidation of individual pages.
> 
> You could think of it like the approach that has been taken with
> DEFINED_DMA_UNMAP_ADDR/LEN. Basically there are cases where this can
> be done much more quickly and it is likely we can clean up large
> swaths in one go. So why not expose a function that might be able to
> take advantage of that for exception cases like this surprise device
> removal.

I am not sure if I understand the 'surprise device removal' part, it
seems to be about calling the DMA API after the driver has already
unbound, which includes the normal driver unloading too as my
understanding.

For the dma sync API, it seems there is already an existing API to
check if the dma sync API is needed for a specific device:
dma_dev_need_sync(). And it seems that the API is not really reliable
as it might return different value during the lifetime of a driver
instance, see dma_reset_need_sync() called in swiotlb_tbl_map_single().

For the dma unmap API, the below patch implemented something similar to
check if the dma unmap API is needed for a specific device, it seems
to be unreliable too as the dma_dev_need_sync() does as they both depend
on the dev->dma_skip_sync.

Even if there is a reliable way to do the checking, it seems the
complexity‌ might be still needed for the case of not being able to skip
the DMA API.
As the main concerns seems to be about supporting unlimting inflight
pages and performance overhead, if there is no other better idea of
not tracking the inflight pages, perhaps it is better to go back to
the tracking the inflight pages way by supporting unlimting inflight
page and avoiding performance overhead as much as possible.

1. https://lore.kernel.org/linux-pci/b912495d307d92ac7071553db99b3badc477fb12.1731244445.git.leon@kernel.org/

> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local
  2024-12-04 11:01     ` Yunsheng Lin
@ 2024-12-05  1:28       ` Jakub Kicinski
  2024-12-05 11:43         ` Yunsheng Lin
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2024-12-05  1:28 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, pabeni, liuyonglong, fanghaiqing, zhangkun09,
	Alexander Lobakin, Xuan Zhuo, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On Wed, 4 Dec 2024 19:01:14 +0800 Yunsheng Lin wrote:
> > I don't think this is in the right place.
> > Why not inside page_pool_disable_direct_recycling() ?  
> 
> It is in page_pool_destroy() mostly because:
> 1. Only call synchronize_rcu() when there is inflight pages, which should
>    be an unlikely case, and  synchronize_rcu() might need to be called at
>    least for the case of pool->p.napi not being NULL if it is called inside
>    page_pool_disable_direct_recycling().

Right, my point was that page_pool_disable_direct_recycling() 
is an exported function, its callers also need to be protected.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local
  2024-12-05  1:28       ` Jakub Kicinski
@ 2024-12-05 11:43         ` Yunsheng Lin
  2024-12-06  0:42           ` Jakub Kicinski
  0 siblings, 1 reply; 28+ messages in thread
From: Yunsheng Lin @ 2024-12-05 11:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, liuyonglong, fanghaiqing, zhangkun09,
	Alexander Lobakin, Xuan Zhuo, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On 2024/12/5 9:28, Jakub Kicinski wrote:
> On Wed, 4 Dec 2024 19:01:14 +0800 Yunsheng Lin wrote:
>>> I don't think this is in the right place.
>>> Why not inside page_pool_disable_direct_recycling() ?  
>>
>> It is in page_pool_destroy() mostly because:
>> 1. Only call synchronize_rcu() when there is inflight pages, which should
>>    be an unlikely case, and  synchronize_rcu() might need to be called at
>>    least for the case of pool->p.napi not being NULL if it is called inside
>>    page_pool_disable_direct_recycling().
> 
> Right, my point was that page_pool_disable_direct_recycling() 
> is an exported function, its callers also need to be protected.

It depends on what is the callers is trying to protect by calling
page_pool_disable_direct_recycling().

It seems the use case for the only user of the API in bnxt driver
is about reuseing the same NAPI for different page_pool instances.

According to the steps in netdev_rx_queue.c:
1. allocate new queue memory & create page_pool
2. stop old rx queue.
3. start new rx queue with new page_pool
4. free old queue memory + destroy page_pool.

The page_pool_disable_direct_recycling() is called in step 2, I am
not sure how napi_enable() & napi_disable() are called in the above
flow, but it seems there is no use-after-free problem this patch is
trying to fix for the above flow.

It doesn't seems to have any concurrent access problem if napi->list_owner
is set to -1 before napi_disable() returns and the napi_enable() for the
new queue is called after page_pool_disable_direct_recycling() is called
in step 2.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local
  2024-12-05 11:43         ` Yunsheng Lin
@ 2024-12-06  0:42           ` Jakub Kicinski
  2024-12-06 12:29             ` Yunsheng Lin
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2024-12-06  0:42 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, pabeni, liuyonglong, fanghaiqing, zhangkun09,
	Alexander Lobakin, Xuan Zhuo, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On Thu, 5 Dec 2024 19:43:25 +0800 Yunsheng Lin wrote:
> It depends on what is the callers is trying to protect by calling
> page_pool_disable_direct_recycling().
> 
> It seems the use case for the only user of the API in bnxt driver
> is about reuseing the same NAPI for different page_pool instances.
> 
> According to the steps in netdev_rx_queue.c:
> 1. allocate new queue memory & create page_pool
> 2. stop old rx queue.
> 3. start new rx queue with new page_pool
> 4. free old queue memory + destroy page_pool.
> 
> The page_pool_disable_direct_recycling() is called in step 2, I am
> not sure how napi_enable() & napi_disable() are called in the above
> flow, but it seems there is no use-after-free problem this patch is
> trying to fix for the above flow.
> 
> It doesn't seems to have any concurrent access problem if napi->list_owner
> is set to -1 before napi_disable() returns and the napi_enable() for the
> new queue is called after page_pool_disable_direct_recycling() is called
> in step 2.

The fix is presupposing there is long delay between fetching of
the NAPI pointer and its access. The concern is that NAPI gets
restarted in step 3 after we already READ_ONCE()'ed the pointer,
then we access it and judge it to be running on the same core.
Then we put the page into the fast cache which will never get
flushed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local
  2024-12-06  0:42           ` Jakub Kicinski
@ 2024-12-06 12:29             ` Yunsheng Lin
  2024-12-06 16:09               ` Jakub Kicinski
  0 siblings, 1 reply; 28+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, liuyonglong, fanghaiqing, zhangkun09,
	Alexander Lobakin, Xuan Zhuo, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On 2024/12/6 8:42, Jakub Kicinski wrote:
> On Thu, 5 Dec 2024 19:43:25 +0800 Yunsheng Lin wrote:
>> It depends on what is the callers is trying to protect by calling
>> page_pool_disable_direct_recycling().
>>
>> It seems the use case for the only user of the API in bnxt driver
>> is about reuseing the same NAPI for different page_pool instances.
>>
>> According to the steps in netdev_rx_queue.c:
>> 1. allocate new queue memory & create page_pool
>> 2. stop old rx queue.
>> 3. start new rx queue with new page_pool
>> 4. free old queue memory + destroy page_pool.
>>
>> The page_pool_disable_direct_recycling() is called in step 2, I am
>> not sure how napi_enable() & napi_disable() are called in the above
>> flow, but it seems there is no use-after-free problem this patch is
>> trying to fix for the above flow.
>>
>> It doesn't seems to have any concurrent access problem if napi->list_owner
>> is set to -1 before napi_disable() returns and the napi_enable() for the
>> new queue is called after page_pool_disable_direct_recycling() is called
>> in step 2.
> 
> The fix is presupposing there is long delay between fetching of
> the NAPI pointer and its access. The concern is that NAPI gets
> restarted in step 3 after we already READ_ONCE()'ed the pointer,
> then we access it and judge it to be running on the same core.
> Then we put the page into the fast cache which will never get
> flushed.

It seems the napi_disable() is called before netdev_rx_queue_restart()
and napi_enable() and ____napi_schedule() are called after
netdev_rx_queue_restart() as there is no napi API called in the
implementation of 'netdev_queue_mgmt_ops' for bnxt driver?

If yes, napi->list_owner is set to -1 before step 1 and only set to
a valid cpu in step 6 as below:
1. napi_disable()
2. allocate new queue memory & create new page_pool.
3. stop old rx queue.
4. start new rx queue with new page_pool.
5. free old queue memory + destroy old page_pool.
6. napi_enable() & ____napi_schedule()

And there are at least three flows involved here:
flow 1: calling napi_complete_done() and set napi->list_owner to -1.
flow 2: calling netdev_rx_queue_restart().
flow 3: calling skb_defer_free_flush() with the page belonging to the old
       page_pool.

The only case of page_pool_napi_local() returning true in flow 3 I can
think of is that flow 1 and flow 3 might need to be called in the softirq
of the same CPU and flow 3 might need to be called before flow 1.

It seems impossible that page_pool_napi_local() will return true between
step 1 and step 6 as updated napi->list_owner is always seen by flow 3
when they are both called in the softirq context of the same CPU or
napi->list_owner != CPU that calling flow 3, which seems like an implicit
assumption for the case of napi scheduling between different cpus too.

And old page_pool is destroyed in step 5, I am not sure if it is necessary
to call page_pool_disable_direct_recycling() in step 3 if page_pool_destroy()
already have the synchronize_rcu() in step 5 before enabling napi.

If not, maybe I am missing something here. It would be good to be more specific
about the timing window that page_pool_napi_local() returning true for the old
page_pool.

> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local
  2024-12-06 12:29             ` Yunsheng Lin
@ 2024-12-06 16:09               ` Jakub Kicinski
  2024-12-07  5:52                 ` Yunsheng Lin
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2024-12-06 16:09 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, pabeni, liuyonglong, fanghaiqing, zhangkun09,
	Alexander Lobakin, Xuan Zhuo, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On Fri, 6 Dec 2024 20:29:40 +0800 Yunsheng Lin wrote:
> On 2024/12/6 8:42, Jakub Kicinski wrote:
> > On Thu, 5 Dec 2024 19:43:25 +0800 Yunsheng Lin wrote:  
> >> It depends on what is the callers is trying to protect by calling
> >> page_pool_disable_direct_recycling().
> >>
> >> It seems the use case for the only user of the API in bnxt driver
> >> is about reuseing the same NAPI for different page_pool instances.
> >>
> >> According to the steps in netdev_rx_queue.c:
> >> 1. allocate new queue memory & create page_pool
> >> 2. stop old rx queue.
> >> 3. start new rx queue with new page_pool
> >> 4. free old queue memory + destroy page_pool.
> >>
> >> The page_pool_disable_direct_recycling() is called in step 2, I am
> >> not sure how napi_enable() & napi_disable() are called in the above
> >> flow, but it seems there is no use-after-free problem this patch is
> >> trying to fix for the above flow.
> >>
> >> It doesn't seems to have any concurrent access problem if napi->list_owner
> >> is set to -1 before napi_disable() returns and the napi_enable() for the
> >> new queue is called after page_pool_disable_direct_recycling() is called
> >> in step 2.  
> > 
> > The fix is presupposing there is long delay between fetching of
> > the NAPI pointer and its access. The concern is that NAPI gets
> > restarted in step 3 after we already READ_ONCE()'ed the pointer,
> > then we access it and judge it to be running on the same core.
> > Then we put the page into the fast cache which will never get
> > flushed.  
> 
> It seems the napi_disable() is called before netdev_rx_queue_restart()
> and napi_enable() and ____napi_schedule() are called after
> netdev_rx_queue_restart() as there is no napi API called in the
> implementation of 'netdev_queue_mgmt_ops' for bnxt driver?
> 
> If yes, napi->list_owner is set to -1 before step 1 and only set to
> a valid cpu in step 6 as below:
> 1. napi_disable()
> 2. allocate new queue memory & create new page_pool.
> 3. stop old rx queue.
> 4. start new rx queue with new page_pool.
> 5. free old queue memory + destroy old page_pool.
> 6. napi_enable() & ____napi_schedule()
> 
> And there are at least three flows involved here:
> flow 1: calling napi_complete_done() and set napi->list_owner to -1.
> flow 2: calling netdev_rx_queue_restart().
> flow 3: calling skb_defer_free_flush() with the page belonging to the old
>        page_pool.
> 
> The only case of page_pool_napi_local() returning true in flow 3 I can
> think of is that flow 1 and flow 3 might need to be called in the softirq
> of the same CPU and flow 3 might need to be called before flow 1.
> 
> It seems impossible that page_pool_napi_local() will return true between
> step 1 and step 6 as updated napi->list_owner is always seen by flow 3
> when they are both called in the softirq context of the same CPU or
> napi->list_owner != CPU that calling flow 3, which seems like an implicit
> assumption for the case of napi scheduling between different cpus too.
> 
> And old page_pool is destroyed in step 5, I am not sure if it is necessary
> to call page_pool_disable_direct_recycling() in step 3 if page_pool_destroy()
> already have the synchronize_rcu() in step 5 before enabling napi.
> 
> If not, maybe I am missing something here.

Yes, I believe you got the steps 5 and 6 backwards.

> It would be good to be more specific
> about the timing window that page_pool_napi_local() returning true for the old
> page_pool.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local
  2024-12-06 16:09               ` Jakub Kicinski
@ 2024-12-07  5:52                 ` Yunsheng Lin
  0 siblings, 0 replies; 28+ messages in thread
From: Yunsheng Lin @ 2024-12-07  5:52 UTC (permalink / raw)
  To: Jakub Kicinski, Yunsheng Lin
  Cc: davem, pabeni, liuyonglong, fanghaiqing, zhangkun09,
	Alexander Lobakin, Xuan Zhuo, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel, David Wei, Shailend Chand, Michael Chan

On 12/7/2024 12:09 AM, Jakub Kicinski wrote:

...

>>
>> It seems the napi_disable() is called before netdev_rx_queue_restart()
>> and napi_enable() and ____napi_schedule() are called after
>> netdev_rx_queue_restart() as there is no napi API called in the
>> implementation of 'netdev_queue_mgmt_ops' for bnxt driver?
>>
>> If yes, napi->list_owner is set to -1 before step 1 and only set to
>> a valid cpu in step 6 as below:
>> 1. napi_disable()
>> 2. allocate new queue memory & create new page_pool.
>> 3. stop old rx queue.
>> 4. start new rx queue with new page_pool.
>> 5. free old queue memory + destroy old page_pool.
>> 6. napi_enable() & ____napi_schedule()
>>
>> And there are at least three flows involved here:
>> flow 1: calling napi_complete_done() and set napi->list_owner to -1.
>> flow 2: calling netdev_rx_queue_restart().
>> flow 3: calling skb_defer_free_flush() with the page belonging to the old
>>         page_pool.
>>
>> The only case of page_pool_napi_local() returning true in flow 3 I can
>> think of is that flow 1 and flow 3 might need to be called in the softirq
>> of the same CPU and flow 3 might need to be called before flow 1.
>>
>> It seems impossible that page_pool_napi_local() will return true between
>> step 1 and step 6 as updated napi->list_owner is always seen by flow 3
>> when they are both called in the softirq context of the same CPU or
>> napi->list_owner != CPU that calling flow 3, which seems like an implicit
>> assumption for the case of napi scheduling between different cpus too.
>>
>> And old page_pool is destroyed in step 5, I am not sure if it is necessary
>> to call page_pool_disable_direct_recycling() in step 3 if page_pool_destroy()
>> already have the synchronize_rcu() in step 5 before enabling napi.
>>
>> If not, maybe I am missing something here.
> 
> Yes, I believe you got the steps 5 and 6 backwards.

Maybe, but I am not sure how is it possible that step 6 is called before
step 5 yet.
As it seems two drivers implement 'netdev_queue_mgmt_ops' now and
only bnxt calls page_pool_disable_direct_recycling(), and its
implementation doesn't call napi related API, see bnxt_queue_mgmt_ops:
https://elixir.bootlin.com/linux/v6.13-rc1/source/drivers/net/ethernet/broadcom/bnxt/bnxt.c#L15539

And netdev_rx_queue_restart() seems to call the above ops without
calling any napi related API:
https://elixir.bootlin.com/linux/v6.12.3/source/net/core/netdev_rx_queue.c#L9

The napi related API seems to be only called in bnxt_open_nic() and
bnxt_close_nic() in bnxt driver, and they don't seems to be related
directly to the queue_mgmt_ops.

+cc relevant author and maintainer to see if there is some clarifying
from them as I am not really similar with queue mgmt related sequence.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-12-07  5:52 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20241120103456.396577-1-linyunsheng@huawei.com>
2024-11-20 10:34 ` [PATCH RFC v4 1/3] page_pool: fix timing for checking and disabling napi_local Yunsheng Lin
2024-12-03  2:49   ` Jakub Kicinski
2024-12-04 11:01     ` Yunsheng Lin
2024-12-05  1:28       ` Jakub Kicinski
2024-12-05 11:43         ` Yunsheng Lin
2024-12-06  0:42           ` Jakub Kicinski
2024-12-06 12:29             ` Yunsheng Lin
2024-12-06 16:09               ` Jakub Kicinski
2024-12-07  5:52                 ` Yunsheng Lin
2024-11-20 10:34 ` [PATCH RFC v4 2/3] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
2024-11-20 15:10   ` Jesper Dangaard Brouer
2024-11-21  8:03     ` Yunsheng Lin
2024-11-25 15:25       ` Jesper Dangaard Brouer
2024-11-26  8:22         ` Yunsheng Lin
2024-11-26 10:22           ` Jesper Dangaard Brouer
2024-11-26 11:46             ` Yunsheng Lin
2024-11-26 21:51       ` Mina Almasry
2024-11-26 23:53         ` Alexander Duyck
2024-11-27  9:35           ` Yunsheng Lin
2024-11-27 15:31             ` Robin Murphy
2024-11-27 16:27               ` Alexander Duyck
2024-12-04 11:16                 ` Yunsheng Lin
2024-11-27 19:39             ` Mina Almasry
2024-11-20 10:34 ` [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages Yunsheng Lin
2024-11-20 16:17   ` Robin Murphy
2024-11-21  8:04     ` Yunsheng Lin
2024-11-21 13:44       ` Robin Murphy
2024-11-22  7:20         ` Yunsheng Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).