* [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool
@ 2025-02-12 9:25 Yunsheng Lin
2025-02-12 9:25 ` [PATCH net-next v9 1/4] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
2025-02-12 18:53 ` [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool Matthew Wilcox
0 siblings, 2 replies; 4+ messages in thread
From: Yunsheng Lin @ 2025-02-12 9:25 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin,
Alexander Lobakin, Robin Murphy, Alexander Duyck, Andrew Morton,
IOMMU, MM, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Matthias Brugger,
AngeloGioacchino Del Regno, netdev, intel-wired-lan, bpf,
linux-kernel, linux-arm-kernel, linux-mediatek
This patchset fix the dma API misuse problem as mentioned in [1].
From the below performance data, the overhead is not so obvious
due to performance variations in arm64 server and less than 1 ns in
x86 server for time_bench_page_pool01_fast_path() and
time_bench_page_pool02_ptr_ring, and there is about 10~20ns overhead
for time_bench_page_pool03_slow(), see more detail in [2].
arm64 server:
Before this patchset:
fast_path ptr_ring slow
1. 31.171 ns 60.980 ns 164.917 ns
2. 28.824 ns 60.891 ns 170.241 ns
3. 14.236 ns 60.583 ns 164.355 ns
With patchset:
6. 26.163 ns 53.781 ns 189.450 ns
7. 26.189 ns 53.798 ns 189.466 ns
X86 server:
| Test name |Cycles | 1-5 | | Nanosec | 1-5 | | % |
| (tasklet_*)|Before | After |diff| Before | After | diff | change |
|------------+-------+-------+----+---------+--------+--------+--------|
| fast_path | 19 | 19 | 0| 5.399 | 5.492 | 0.093 | 1.7 |
| ptr_ring | 54 | 57 | 3| 15.090 | 15.849 | 0.759 | 5.0 |
| slow | 238 | 284 | 46| 66.134 | 78.909 | 12.775 | 19.3 |
And about 16 bytes of memory is also needed for each page_pool owned
page to fix the dma API misuse problem
1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
2. https://lore.kernel.org/all/f558df7a-d983-4fc5-8358-faf251994d23@kernel.org/
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: IOMMU <iommu@lists.linux.dev>
CC: MM <linux-mm@kvack.org>
Change log:
V9.
1. Drop the fix of a possible time window problem for NPAI recycling.
2. Add design description for the fix in patch 2.
V8:
1. Drop last 3 patch as it causes observable performance degradation
for x86 system.
2. Remove rcu read lock in page_pool_napi_local().
3. Renaming item function more consistently.
V7:
1. Fix a used-after-free bug reported by KASAN as mentioned by Jakub.
2. Fix the 'netmem' variable not setting up correctly bug as mentioned
by Simon.
V6:
1. Repost based on latest net-next.
2. Rename page_pool_to_pp() to page_pool_get_pp().
V5:
1. Support unlimit inflight pages.
2. Add some optimization to avoid the overhead of fixing bug.
V4:
1. use scanning to do the unmapping
2. spilt dma sync skipping into separate patch
V3:
1. Target net-next tree instead of net tree.
2. Narrow the rcu lock as the discussion in v2.
3. Check the ummapping cnt against the inflight cnt.
V2:
1. Add a item_full stat.
2. Use container_of() for page_pool_to_pp().
Yunsheng Lin (4):
page_pool: introduce page_pool_get_pp() API
page_pool: fix IOMMU crash when driver has already unbound
page_pool: support unlimited number of inflight pages
page_pool: skip dma sync operation for inflight pages
drivers/net/ethernet/freescale/fec_main.c | 8 +-
.../ethernet/google/gve/gve_buffer_mgmt_dqo.c | 2 +-
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 6 +-
drivers/net/ethernet/intel/idpf/idpf_txrx.c | 14 +-
drivers/net/ethernet/intel/libeth/rx.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 3 +-
drivers/net/netdevsim/netdev.c | 6 +-
drivers/net/wireless/mediatek/mt76/mt76.h | 2 +-
include/linux/mm_types.h | 2 +-
include/linux/skbuff.h | 1 +
include/net/libeth/rx.h | 3 +-
include/net/netmem.h | 31 +-
include/net/page_pool/helpers.h | 15 +
include/net/page_pool/memory_provider.h | 2 +-
include/net/page_pool/types.h | 46 +-
net/core/devmem.c | 6 +-
net/core/netmem_priv.h | 5 +-
net/core/page_pool.c | 423 ++++++++++++++++--
net/core/page_pool_priv.h | 10 +-
19 files changed, 504 insertions(+), 83 deletions(-)
--
2.33.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH net-next v9 1/4] page_pool: introduce page_pool_get_pp() API
2025-02-12 9:25 [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool Yunsheng Lin
@ 2025-02-12 9:25 ` Yunsheng Lin
2025-02-12 18:53 ` [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool Matthew Wilcox
1 sibling, 0 replies; 4+ messages in thread
From: Yunsheng Lin @ 2025-02-12 9:25 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Wei Fang,
Shenwei Wang, Clark Wang, Andrew Lunn, Eric Dumazet,
Jeroen de Borst, Praveen Kaligineedi, Shailend Chand, Tony Nguyen,
Przemek Kitszel, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Felix Fietkau,
Lorenzo Bianconi, Ryder Lee, Shayne Chen, Sean Wang, Kalle Valo,
Matthias Brugger, AngeloGioacchino Del Regno, Simon Horman,
Ilias Apalodimas, imx, netdev, linux-kernel, intel-wired-lan, bpf,
linux-rdma, linux-wireless, linux-arm-kernel, linux-mediatek
Introduce page_pool_get_pp() API to avoid caller accessing
page->pp directly, in order to make the following patch more
reviewable as the following patch will change page->pp to
page->pp_item to fix the DMA API misuse problem.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
drivers/net/ethernet/freescale/fec_main.c | 8 +++++---
.../net/ethernet/google/gve/gve_buffer_mgmt_dqo.c | 2 +-
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 6 ++++--
drivers/net/ethernet/intel/idpf/idpf_txrx.c | 14 +++++++++-----
drivers/net/ethernet/intel/libeth/rx.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 3 ++-
drivers/net/netdevsim/netdev.c | 6 ++++--
drivers/net/wireless/mediatek/mt76/mt76.h | 2 +-
include/net/libeth/rx.h | 3 ++-
include/net/page_pool/helpers.h | 5 +++++
10 files changed, 34 insertions(+), 17 deletions(-)
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index a86cfebedaa8..4ade1553557a 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1038,7 +1038,8 @@ static void fec_enet_bd_init(struct net_device *dev)
struct page *page = txq->tx_buf[i].buf_p;
if (page)
- page_pool_put_page(page->pp, page, 0, false);
+ page_pool_put_page(page_pool_get_pp(page),
+ page, 0, false);
}
txq->tx_buf[i].buf_p = NULL;
@@ -1576,7 +1577,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
xdp_return_frame_rx_napi(xdpf);
} else { /* recycle pages of XDP_TX frames */
/* The dma_sync_size = 0 as XDP_TX has already synced DMA for_device */
- page_pool_put_page(page->pp, page, 0, true);
+ page_pool_put_page(page_pool_get_pp(page), page, 0, true);
}
txq->tx_buf[index].buf_p = NULL;
@@ -3343,7 +3344,8 @@ static void fec_enet_free_buffers(struct net_device *ndev)
} else {
struct page *page = txq->tx_buf[i].buf_p;
- page_pool_put_page(page->pp, page, 0, false);
+ page_pool_put_page(page_pool_get_pp(page),
+ page, 0, false);
}
txq->tx_buf[i].buf_p = NULL;
diff --git a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
index 403f0f335ba6..87422b8828ff 100644
--- a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
+++ b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
@@ -210,7 +210,7 @@ void gve_free_to_page_pool(struct gve_rx_ring *rx,
if (!page)
return;
- page_pool_put_full_page(page->pp, page, allow_direct);
+ page_pool_put_full_page(page_pool_get_pp(page), page, allow_direct);
buf_state->page_info.page = NULL;
}
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 26b424fd6718..e1bf5554f6e3 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -1050,7 +1050,8 @@ static void iavf_add_rx_frag(struct sk_buff *skb,
const struct libeth_fqe *rx_buffer,
unsigned int size)
{
- u32 hr = rx_buffer->page->pp->p.offset;
+ struct page_pool *pool = page_pool_get_pp(rx_buffer->page);
+ u32 hr = pool->p.offset;
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
rx_buffer->offset + hr, size, rx_buffer->truesize);
@@ -1067,7 +1068,8 @@ static void iavf_add_rx_frag(struct sk_buff *skb,
static struct sk_buff *iavf_build_skb(const struct libeth_fqe *rx_buffer,
unsigned int size)
{
- u32 hr = rx_buffer->page->pp->p.offset;
+ struct page_pool *pool = page_pool_get_pp(rx_buffer->page);
+ u32 hr = pool->p.offset;
struct sk_buff *skb;
void *va;
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index 2fa9c36e33c9..04f2347716ca 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -385,7 +385,8 @@ static void idpf_rx_page_rel(struct libeth_fqe *rx_buf)
if (unlikely(!rx_buf->page))
return;
- page_pool_put_full_page(rx_buf->page->pp, rx_buf->page, false);
+ page_pool_put_full_page(page_pool_get_pp(rx_buf->page), rx_buf->page,
+ false);
rx_buf->page = NULL;
rx_buf->offset = 0;
@@ -3098,7 +3099,8 @@ idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb,
void idpf_rx_add_frag(struct idpf_rx_buf *rx_buf, struct sk_buff *skb,
unsigned int size)
{
- u32 hr = rx_buf->page->pp->p.offset;
+ struct page_pool *pool = page_pool_get_pp(rx_buf->page);
+ u32 hr = pool->p.offset;
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buf->page,
rx_buf->offset + hr, size, rx_buf->truesize);
@@ -3130,8 +3132,10 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
if (!libeth_rx_sync_for_cpu(buf, copy))
return 0;
- dst = page_address(hdr->page) + hdr->offset + hdr->page->pp->p.offset;
- src = page_address(buf->page) + buf->offset + buf->page->pp->p.offset;
+ dst = page_address(hdr->page) + hdr->offset +
+ page_pool_get_pp(hdr->page)->p.offset;
+ src = page_address(buf->page) + buf->offset +
+ page_pool_get_pp(buf->page)->p.offset;
memcpy(dst, src, LARGEST_ALIGN(copy));
buf->offset += copy;
@@ -3149,7 +3153,7 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
*/
struct sk_buff *idpf_rx_build_skb(const struct libeth_fqe *buf, u32 size)
{
- u32 hr = buf->page->pp->p.offset;
+ u32 hr = page_pool_get_pp(buf->page)->p.offset;
struct sk_buff *skb;
void *va;
diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c
index 66d1d23b8ad2..8de0c3a3b146 100644
--- a/drivers/net/ethernet/intel/libeth/rx.c
+++ b/drivers/net/ethernet/intel/libeth/rx.c
@@ -207,7 +207,7 @@ EXPORT_SYMBOL_NS_GPL(libeth_rx_fq_destroy, "LIBETH");
*/
void libeth_rx_recycle_slow(struct page *page)
{
- page_pool_recycle_direct(page->pp, page);
+ page_pool_recycle_direct(page_pool_get_pp(page), page);
}
EXPORT_SYMBOL_NS_GPL(libeth_rx_recycle_slow, "LIBETH");
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 3cc4d55613bf..e5cc5af1b848 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -716,7 +716,8 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
* as we know this is a page_pool page.
*/
- page_pool_recycle_direct(page->pp, page);
+ page_pool_recycle_direct(page_pool_get_pp(page),
+ page);
} while (++n < num);
break;
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 9b394ddc5206..917f987e89fa 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -817,7 +817,8 @@ nsim_pp_hold_write(struct file *file, const char __user *data,
if (!ns->page)
ret = -ENOMEM;
} else {
- page_pool_put_full_page(ns->page->pp, ns->page, false);
+ page_pool_put_full_page(page_pool_get_pp(ns->page), ns->page,
+ false);
ns->page = NULL;
}
@@ -1029,7 +1030,8 @@ void nsim_destroy(struct netdevsim *ns)
/* Put this intentionally late to exercise the orphaning path */
if (ns->page) {
- page_pool_put_full_page(ns->page->pp, ns->page, false);
+ page_pool_put_full_page(page_pool_get_pp(ns->page), ns->page,
+ false);
ns->page = NULL;
}
diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h
index 132148f7b107..11a88ecf8533 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76.h
+++ b/drivers/net/wireless/mediatek/mt76/mt76.h
@@ -1777,7 +1777,7 @@ static inline void mt76_put_page_pool_buf(void *buf, bool allow_direct)
{
struct page *page = virt_to_head_page(buf);
- page_pool_put_full_page(page->pp, page, allow_direct);
+ page_pool_put_full_page(page_pool_get_pp(page), page, allow_direct);
}
static inline void *
diff --git a/include/net/libeth/rx.h b/include/net/libeth/rx.h
index 43574bd6612f..f4ae75f9cc1b 100644
--- a/include/net/libeth/rx.h
+++ b/include/net/libeth/rx.h
@@ -137,7 +137,8 @@ static inline bool libeth_rx_sync_for_cpu(const struct libeth_fqe *fqe,
return false;
}
- page_pool_dma_sync_for_cpu(page->pp, page, fqe->offset, len);
+ page_pool_dma_sync_for_cpu(page_pool_get_pp(page), page, fqe->offset,
+ len);
return true;
}
diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 582a3d00cbe2..ab91911af215 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -83,6 +83,11 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
}
#endif
+static inline struct page_pool *page_pool_get_pp(struct page *page)
+{
+ return page->pp;
+}
+
/**
* page_pool_dev_alloc_pages() - allocate a page.
* @pool: pool from which to allocate
--
2.33.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool
2025-02-12 9:25 [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool Yunsheng Lin
2025-02-12 9:25 ` [PATCH net-next v9 1/4] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
@ 2025-02-12 18:53 ` Matthew Wilcox
2025-02-13 11:13 ` Yunsheng Lin
1 sibling, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2025-02-12 18:53 UTC (permalink / raw)
To: Yunsheng Lin
Cc: davem, kuba, pabeni, zhangkun09, liuyonglong, fanghaiqing,
Alexander Lobakin, Robin Murphy, Alexander Duyck, Andrew Morton,
IOMMU, MM, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Matthias Brugger,
AngeloGioacchino Del Regno, netdev, intel-wired-lan, bpf,
linux-kernel, linux-arm-kernel, linux-mediatek
On Wed, Feb 12, 2025 at 05:25:47PM +0800, Yunsheng Lin wrote:
> This patchset fix the dma API misuse problem as mentioned in [1].
>
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
That's a very long and complicated thread. I gave up. You need to
provide a proper description of the problem.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool
2025-02-12 18:53 ` [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool Matthew Wilcox
@ 2025-02-13 11:13 ` Yunsheng Lin
0 siblings, 0 replies; 4+ messages in thread
From: Yunsheng Lin @ 2025-02-13 11:13 UTC (permalink / raw)
To: Matthew Wilcox
Cc: davem, kuba, pabeni, zhangkun09, liuyonglong, fanghaiqing,
Alexander Lobakin, Robin Murphy, Alexander Duyck, Andrew Morton,
IOMMU, MM, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Matthias Brugger,
AngeloGioacchino Del Regno, netdev, intel-wired-lan, bpf,
linux-kernel, linux-arm-kernel, linux-mediatek
On 2025/2/13 2:53, Matthew Wilcox wrote:
> On Wed, Feb 12, 2025 at 05:25:47PM +0800, Yunsheng Lin wrote:
>> This patchset fix the dma API misuse problem as mentioned in [1].
>>
>> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>
> That's a very long and complicated thread. I gave up. You need to
> provide a proper description of the problem.
The description of the problem is in the commit log of patch 2
as something below:
"Networking driver with page_pool support may hand over page
still with dma mapping to network stack and try to reuse that
page after network stack is done with it and passes it back
to page_pool to avoid the penalty of dma mapping/unmapping.
With all the caching in the network stack, some pages may be
held in the network stack without returning to the page_pool
soon enough, and with VF disable causing the driver unbound,
the page_pool does not stop the driver from doing it's
unbounding work, instead page_pool uses workqueue to check
if there is some pages coming back from the network stack
periodically, if there is any, it will do the dma unmmapping
related cleanup work.
As mentioned in [1], attempting DMA unmaps after the driver
has already unbound may leak resources or at worst corrupt
memory. Fundamentally, the page pool code cannot allow DMA
mappings to outlive the driver they belong to."
The description of the fixing is also in the commit log of patch 2
as below:
"By using the 'struct page_pool_item' referenced by page->pp_item,
page_pool is not only able to keep track of the inflight page to
do dma unmmaping if some pages are still handled in networking
stack when page_pool_destroy() is called, and networking stack is
also able to find the page_pool owning the page when returning
pages back into page_pool:
1. When a page is added to the page_pool, an item is deleted from
pool->hold_items and set the 'pp_netmem' pointing to that page
and set item->state and item->pp_netmem accordingly in order to
keep track of that page, refill from pool->release_items when
pool->hold_items is empty or use the item from pool->slow_items
when fast items run out.
2. When a page is released from the page_pool, it is able to tell
which page_pool this page belongs to by masking off the lower
bits of the pointer to page_pool_item *item, as the 'struct
page_pool_item_block' is stored in the top of a struct page. And
after clearing the pp_item->state', the item for the released page
is added back to pool->release_items so that it can be reused for
new pages or just free it when it is from the pool->slow_items.
3. When page_pool_destroy() is called, item->state is used to tell if
a specific item is being used/dma mapped or not by scanning all the
item blocks in pool->item_blocks, then item->netmem can be used to
do the dma unmmaping if the corresponding inflight page is dma
mapped."
it is worth to mention that the changing of page->pp to page->pp_item
for the above fix may be able to enable the decoupling page_pool from
using the metadata of 'struct page' if folios only provide a memdesc
pointer to the page_pool subsystem in the future as pp_item may be
used as the metadata replacement of existing 'struct page'.
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-02-13 11:15 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-12 9:25 [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool Yunsheng Lin
2025-02-12 9:25 ` [PATCH net-next v9 1/4] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
2025-02-12 18:53 ` [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool Matthew Wilcox
2025-02-13 11:13 ` Yunsheng Lin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox