* [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool
@ 2025-02-26 11:03 Yunsheng Lin
2025-02-26 11:03 ` [PATCH net-next v10 1/4] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
` (4 more replies)
0 siblings, 5 replies; 12+ messages in thread
From: Yunsheng Lin @ 2025-02-26 11:03 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin,
Alexander Lobakin, Robin Murphy, Alexander Duyck, Andrew Morton,
Gaurav Batra, Matthew Rosato, IOMMU, MM, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Matthias Brugger, AngeloGioacchino Del Regno, netdev,
intel-wired-lan, bpf, linux-kernel, linux-arm-kernel,
linux-mediatek
This patchset fix the dma API misuse problem as below:
Networking driver with page_pool support may hand over page
still with dma mapping to network stack and try to reuse that
page after network stack is done with it and passes it back
to page_pool to avoid the penalty of dma mapping/unmapping.
With all the caching in the network stack, some pages may be
held in the network stack without returning to the page_pool
soon enough, and with VF disable causing the driver unbound,
the page_pool does not stop the driver from doing it's
unbounding work, instead page_pool uses workqueue to check
if there is some pages coming back from the network stack
periodically, if there is any, it will do the dma unmmapping
related cleanup work.
As mentioned in [1], attempting DMA unmaps after the driver
has already unbound may leak resources or at worst corrupt
memory. Fundamentally, the page pool code cannot allow DMA
mappings to outlive the driver they belong to.
By using the 'struct page_pool_item' referenced by page->pp_item,
page_pool is not only able to keep track of the inflight page to
do dma unmmaping if some pages are still handled in networking
stack when page_pool_destroy() is called, and networking stack is
also able to find the page_pool owning the page when returning
pages back into page_pool:
1. When a page is added to the page_pool, an item is deleted from
pool->hold_items and set the 'pp_netmem' pointing to that page
and set item->state and item->pp_netmem accordingly in order to
keep track of that page, refill from pool->release_items when
pool->hold_items is empty or use the item from pool->slow_items
when fast items run out.
2. When a page is released from the page_pool, it is able to tell
which page_pool this page belongs to by masking off the lower
bits of the pointer to page_pool_item *item, as the 'struct
page_pool_item_block' is stored in the top of a struct page.
And after clearing the pp_item->state', the item for the
released page is added back to pool->release_items so that it
can be reused for new pages or just free it when it is from the
pool->slow_items.
3. When page_pool_destroy() is called, item->state is used to tell
if a specific item is being used/dma mapped or not by scanning
all the item blocks in pool->item_blocks, then item->netmem can
be used to do the dma unmmaping if the corresponding inflight
page is dma mapped.
From the below performance data, the overhead is not so obvious
due to performance variations in arm64 server and less than 1
ns in x86 server for time_bench_page_pool01_fast_path() and
time_bench_page_pool02_ptr_ring, and there is about 10~20ns
overhead for time_bench_page_pool03_slow(), see more detail in
[2].
arm64 server:
Before this patchset:
fast_path ptr_ring slow
1. 31.171 ns 60.980 ns 164.917 ns
2. 28.824 ns 60.891 ns 170.241 ns
3. 14.236 ns 60.583 ns 164.355 ns
With patchset:
6. 26.163 ns 53.781 ns 189.450 ns
7. 26.189 ns 53.798 ns 189.466 ns
X86 server:
| Test name |Cycles | 1-5 | | Nanosec | 1-5 | | % |
| (tasklet_*)|Before | After |diff| Before | After | diff | change |
|------------+-------+-------+----+---------+--------+--------+--------|
| fast_path | 19 | 19 | 0| 5.399 | 5.492 | 0.093 | 1.7 |
| ptr_ring | 54 | 57 | 3| 15.090 | 15.849 | 0.759 | 5.0 |
| slow | 238 | 284 | 46| 66.134 | 78.909 | 12.775 | 19.3 |
And about 16 bytes of memory is also needed for each page_pool owned
page to fix the dma API misuse problem
1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
2. https://lore.kernel.org/all/f558df7a-d983-4fc5-8358-faf251994d23@kernel.org/
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Gaurav Batra <gbatra@linux.ibm.com>
CC: Matthew Rosato <mjrosato@linux.ibm.com>
CC: IOMMU <iommu@lists.linux.dev>
CC: MM <linux-mm@kvack.org>
Change log:
V10:
1. Add nl API to dump item memory usage.
2. Use __acquires() and __releases() to avoid 'context imbalance'
warning.
V9.
1. Drop the fix of a possible time window problem for NPAI recycling.
2. Add design description for the fix in patch 2.
V8:
1. Drop last 3 patch as it causes observable performance degradation
for x86 system.
2. Remove rcu read lock in page_pool_napi_local().
3. Renaming item function more consistently.
V7:
1. Fix a used-after-free bug reported by KASAN as mentioned by Jakub.
2. Fix the 'netmem' variable not setting up correctly bug as mentioned
by Simon.
V6:
1. Repost based on latest net-next.
2. Rename page_pool_to_pp() to page_pool_get_pp().
V5:
1. Support unlimit inflight pages.
2. Add some optimization to avoid the overhead of fixing bug.
V4:
1. use scanning to do the unmapping
2. spilt dma sync skipping into separate patch
V3:
1. Target net-next tree instead of net tree.
2. Narrow the rcu lock as the discussion in v2.
3. Check the ummapping cnt against the inflight cnt.
V2:
1. Add a item_full stat.
2. Use container_of() for page_pool_to_pp().
Yunsheng Lin (4):
page_pool: introduce page_pool_get_pp() API
page_pool: fix IOMMU crash when driver has already unbound
page_pool: support unlimited number of inflight pages
page_pool: skip dma sync operation for inflight pages
Documentation/netlink/specs/netdev.yaml | 16 +
drivers/net/ethernet/freescale/fec_main.c | 8 +-
.../ethernet/google/gve/gve_buffer_mgmt_dqo.c | 2 +-
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 6 +-
drivers/net/ethernet/intel/idpf/idpf_txrx.c | 14 +-
drivers/net/ethernet/intel/libeth/rx.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 3 +-
drivers/net/netdevsim/netdev.c | 6 +-
drivers/net/wireless/mediatek/mt76/mt76.h | 2 +-
include/linux/mm_types.h | 2 +-
include/linux/skbuff.h | 1 +
include/net/libeth/rx.h | 3 +-
include/net/netmem.h | 31 +-
include/net/page_pool/helpers.h | 15 +
include/net/page_pool/memory_provider.h | 2 +-
include/net/page_pool/types.h | 46 +-
include/uapi/linux/netdev.h | 2 +
net/core/devmem.c | 6 +-
net/core/netmem_priv.h | 5 +-
net/core/page_pool.c | 423 ++++++++++++++++--
net/core/page_pool_priv.h | 12 +-
net/core/page_pool_user.c | 39 +-
tools/net/ynl/samples/page-pool.c | 11 +
23 files changed, 570 insertions(+), 87 deletions(-)
--
2.33.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH net-next v10 1/4] page_pool: introduce page_pool_get_pp() API
2025-02-26 11:03 [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool Yunsheng Lin
@ 2025-02-26 11:03 ` Yunsheng Lin
2025-02-26 11:03 ` [PATCH net-next v10 2/4] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
` (3 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2025-02-26 11:03 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Wei Fang,
Shenwei Wang, Clark Wang, Andrew Lunn, Eric Dumazet,
Jeroen de Borst, Harshitha Ramamurthy, Tony Nguyen,
Przemek Kitszel, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Felix Fietkau,
Lorenzo Bianconi, Ryder Lee, Shayne Chen, Sean Wang,
Johannes Berg, Matthias Brugger, AngeloGioacchino Del Regno,
Simon Horman, Ilias Apalodimas, imx, netdev, linux-kernel,
intel-wired-lan, bpf, linux-rdma, linux-wireless,
linux-arm-kernel, linux-mediatek
Introduce page_pool_get_pp() API to avoid caller accessing
page->pp directly, in order to make the following patch more
reviewable as the following patch will change page->pp to
page->pp_item to fix the DMA API misuse problem.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
drivers/net/ethernet/freescale/fec_main.c | 8 +++++---
.../net/ethernet/google/gve/gve_buffer_mgmt_dqo.c | 2 +-
drivers/net/ethernet/intel/iavf/iavf_txrx.c | 6 ++++--
drivers/net/ethernet/intel/idpf/idpf_txrx.c | 14 +++++++++-----
drivers/net/ethernet/intel/libeth/rx.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 3 ++-
drivers/net/netdevsim/netdev.c | 6 ++++--
drivers/net/wireless/mediatek/mt76/mt76.h | 2 +-
include/net/libeth/rx.h | 3 ++-
include/net/page_pool/helpers.h | 5 +++++
10 files changed, 34 insertions(+), 17 deletions(-)
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index a86cfebedaa8..4ade1553557a 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1038,7 +1038,8 @@ static void fec_enet_bd_init(struct net_device *dev)
struct page *page = txq->tx_buf[i].buf_p;
if (page)
- page_pool_put_page(page->pp, page, 0, false);
+ page_pool_put_page(page_pool_get_pp(page),
+ page, 0, false);
}
txq->tx_buf[i].buf_p = NULL;
@@ -1576,7 +1577,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
xdp_return_frame_rx_napi(xdpf);
} else { /* recycle pages of XDP_TX frames */
/* The dma_sync_size = 0 as XDP_TX has already synced DMA for_device */
- page_pool_put_page(page->pp, page, 0, true);
+ page_pool_put_page(page_pool_get_pp(page), page, 0, true);
}
txq->tx_buf[index].buf_p = NULL;
@@ -3343,7 +3344,8 @@ static void fec_enet_free_buffers(struct net_device *ndev)
} else {
struct page *page = txq->tx_buf[i].buf_p;
- page_pool_put_page(page->pp, page, 0, false);
+ page_pool_put_page(page_pool_get_pp(page),
+ page, 0, false);
}
txq->tx_buf[i].buf_p = NULL;
diff --git a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
index 403f0f335ba6..87422b8828ff 100644
--- a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
+++ b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c
@@ -210,7 +210,7 @@ void gve_free_to_page_pool(struct gve_rx_ring *rx,
if (!page)
return;
- page_pool_put_full_page(page->pp, page, allow_direct);
+ page_pool_put_full_page(page_pool_get_pp(page), page, allow_direct);
buf_state->page_info.page = NULL;
}
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 422312b8b54a..72f17eaac277 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -1197,7 +1197,8 @@ static void iavf_add_rx_frag(struct sk_buff *skb,
const struct libeth_fqe *rx_buffer,
unsigned int size)
{
- u32 hr = rx_buffer->page->pp->p.offset;
+ struct page_pool *pool = page_pool_get_pp(rx_buffer->page);
+ u32 hr = pool->p.offset;
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
rx_buffer->offset + hr, size, rx_buffer->truesize);
@@ -1214,7 +1215,8 @@ static void iavf_add_rx_frag(struct sk_buff *skb,
static struct sk_buff *iavf_build_skb(const struct libeth_fqe *rx_buffer,
unsigned int size)
{
- u32 hr = rx_buffer->page->pp->p.offset;
+ struct page_pool *pool = page_pool_get_pp(rx_buffer->page);
+ u32 hr = pool->p.offset;
struct sk_buff *skb;
void *va;
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index 2747dc69999a..248495f587d0 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -385,7 +385,8 @@ static void idpf_rx_page_rel(struct libeth_fqe *rx_buf)
if (unlikely(!rx_buf->page))
return;
- page_pool_put_full_page(rx_buf->page->pp, rx_buf->page, false);
+ page_pool_put_full_page(page_pool_get_pp(rx_buf->page), rx_buf->page,
+ false);
rx_buf->page = NULL;
rx_buf->offset = 0;
@@ -3095,7 +3096,8 @@ idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb,
void idpf_rx_add_frag(struct idpf_rx_buf *rx_buf, struct sk_buff *skb,
unsigned int size)
{
- u32 hr = rx_buf->page->pp->p.offset;
+ struct page_pool *pool = page_pool_get_pp(rx_buf->page);
+ u32 hr = pool->p.offset;
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buf->page,
rx_buf->offset + hr, size, rx_buf->truesize);
@@ -3127,8 +3129,10 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
if (!libeth_rx_sync_for_cpu(buf, copy))
return 0;
- dst = page_address(hdr->page) + hdr->offset + hdr->page->pp->p.offset;
- src = page_address(buf->page) + buf->offset + buf->page->pp->p.offset;
+ dst = page_address(hdr->page) + hdr->offset +
+ page_pool_get_pp(hdr->page)->p.offset;
+ src = page_address(buf->page) + buf->offset +
+ page_pool_get_pp(buf->page)->p.offset;
memcpy(dst, src, LARGEST_ALIGN(copy));
buf->offset += copy;
@@ -3146,7 +3150,7 @@ static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
*/
struct sk_buff *idpf_rx_build_skb(const struct libeth_fqe *buf, u32 size)
{
- u32 hr = buf->page->pp->p.offset;
+ u32 hr = page_pool_get_pp(buf->page)->p.offset;
struct sk_buff *skb;
void *va;
diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c
index 66d1d23b8ad2..8de0c3a3b146 100644
--- a/drivers/net/ethernet/intel/libeth/rx.c
+++ b/drivers/net/ethernet/intel/libeth/rx.c
@@ -207,7 +207,7 @@ EXPORT_SYMBOL_NS_GPL(libeth_rx_fq_destroy, "LIBETH");
*/
void libeth_rx_recycle_slow(struct page *page)
{
- page_pool_recycle_direct(page->pp, page);
+ page_pool_recycle_direct(page_pool_get_pp(page), page);
}
EXPORT_SYMBOL_NS_GPL(libeth_rx_recycle_slow, "LIBETH");
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 6f3094a479e1..b6bee95db994 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -709,7 +709,8 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
* as we know this is a page_pool page.
*/
- page_pool_recycle_direct(page->pp, page);
+ page_pool_recycle_direct(page_pool_get_pp(page),
+ page);
} while (++n < num);
break;
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index a41dc79e9c2e..ffb7bcc67eba 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -836,7 +836,8 @@ nsim_pp_hold_write(struct file *file, const char __user *data,
if (!ns->page)
ret = -ENOMEM;
} else {
- page_pool_put_full_page(ns->page->pp, ns->page, false);
+ page_pool_put_full_page(page_pool_get_pp(ns->page), ns->page,
+ false);
ns->page = NULL;
}
@@ -1048,7 +1049,8 @@ void nsim_destroy(struct netdevsim *ns)
/* Put this intentionally late to exercise the orphaning path */
if (ns->page) {
- page_pool_put_full_page(ns->page->pp, ns->page, false);
+ page_pool_put_full_page(page_pool_get_pp(ns->page), ns->page,
+ false);
ns->page = NULL;
}
diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h
index 132148f7b107..11a88ecf8533 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76.h
+++ b/drivers/net/wireless/mediatek/mt76/mt76.h
@@ -1777,7 +1777,7 @@ static inline void mt76_put_page_pool_buf(void *buf, bool allow_direct)
{
struct page *page = virt_to_head_page(buf);
- page_pool_put_full_page(page->pp, page, allow_direct);
+ page_pool_put_full_page(page_pool_get_pp(page), page, allow_direct);
}
static inline void *
diff --git a/include/net/libeth/rx.h b/include/net/libeth/rx.h
index ab05024be518..2a3991d5b7c0 100644
--- a/include/net/libeth/rx.h
+++ b/include/net/libeth/rx.h
@@ -137,7 +137,8 @@ static inline bool libeth_rx_sync_for_cpu(const struct libeth_fqe *fqe,
return false;
}
- page_pool_dma_sync_for_cpu(page->pp, page, fqe->offset, len);
+ page_pool_dma_sync_for_cpu(page_pool_get_pp(page), page, fqe->offset,
+ len);
return true;
}
diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 582a3d00cbe2..ab91911af215 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -83,6 +83,11 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
}
#endif
+static inline struct page_pool *page_pool_get_pp(struct page *page)
+{
+ return page->pp;
+}
+
/**
* page_pool_dev_alloc_pages() - allocate a page.
* @pool: pool from which to allocate
--
2.33.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v10 2/4] page_pool: fix IOMMU crash when driver has already unbound
2025-02-26 11:03 [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool Yunsheng Lin
2025-02-26 11:03 ` [PATCH net-next v10 1/4] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
@ 2025-02-26 11:03 ` Yunsheng Lin
2025-02-26 11:03 ` [PATCH net-next v10 3/4] page_pool: support unlimited number of inflight pages Yunsheng Lin
` (2 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2025-02-26 11:03 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Robin Murphy,
Alexander Duyck, IOMMU, Andrew Morton, Eric Dumazet, Simon Horman,
Jesper Dangaard Brouer, Ilias Apalodimas, linux-mm, linux-kernel,
netdev
Networking driver with page_pool support may hand over page
still with dma mapping to network stack and try to reuse that
page after network stack is done with it and passes it back
to page_pool to avoid the penalty of dma mapping/unmapping.
With all the caching in the network stack, some pages may be
held in the network stack without returning to the page_pool
soon enough, and with VF disable causing the driver unbound,
the page_pool does not stop the driver from doing it's
unbounding work, instead page_pool uses workqueue to check
if there is some pages coming back from the network stack
periodically, if there is any, it will do the dma unmmapping
related cleanup work.
As mentioned in [1], attempting DMA unmaps after the driver
has already unbound may leak resources or at worst corrupt
memory. Fundamentally, the page pool code cannot allow DMA
mappings to outlive the driver they belong to.
Currently it seems there are at least two cases that the page
is not released fast enough causing dma unmmapping done after
driver has already unbound:
1. ipv4 packet defragmentation timeout: this seems to cause
delay up to 30 secs.
2. skb_defer_free_flush(): this may cause infinite delay if
there is no triggering for net_rx_action().
In order not to call DMA APIs to do DMA unmmapping after driver
has already unbound and stall the unloading of the networking
driver, use some pre-allocated item blocks to record inflight
pages including the ones which are handed over to network stack,
so the page_pool can do the DMA unmmapping for those pages when
page_pool_destroy() is called. As the pre-allocated item blocks
need to be large enough to avoid performance degradation, add a
'item_fast_empty' stat to indicate the unavailability of the
pre-allocated item blocks.
By using the 'struct page_pool_item' referenced by page->pp_item,
page_pool is not only able to keep track of the inflight page to
do dma unmmaping if some pages are still handled in networking
stack when page_pool_destroy() is called, and networking stack is
also able to find the page_pool owning the page when returning
pages back into page_pool:
1. When a page is added to the page_pool, an item is deleted from
pool->hold_items and set the 'pp_netmem' pointing to that page
and set item->state and item->pp_netmem accordingly in order to
keep track of that page, refill from pool->release_items when
pool->hold_items is empty or use the item from pool->slow_items
when fast items run out.
2. When a page is released from the page_pool, it is able to tell
which page_pool this page belongs to by masking off the lower
bits of the pointer to page_pool_item *item, as the 'struct
page_pool_item_block' is stored in the top of a struct page. And
after clearing the pp_item->state', the item for the released page
is added back to pool->release_items so that it can be reused for
new pages or just free it when it is from the pool->slow_items.
3. When page_pool_destroy() is called, item->state is used to tell if
a specific item is being used/dma mapped or not by scanning all the
item blocks in pool->item_blocks, then item->netmem can be used to
do the dma unmmaping if the corresponding inflight page is dma
mapped.
The overhead of tracking of inflight pages is about 10ns~20ns,
which causes about 10% performance degradation for the test case
of time_bench_page_pool03_slow() in [2].
Note, the devmem patchset seems to make the bug harder to fix,
and may make backporting harder too. As there is no actual user
for the devmem and the fixing for devmem is unclear for now,
this patch does not consider fixing the case for devmem yet.
1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
2. https://github.com/netoptimizer/prototype-kernel
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
---
include/linux/mm_types.h | 2 +-
include/linux/skbuff.h | 1 +
include/net/netmem.h | 31 ++-
include/net/page_pool/helpers.h | 12 +-
include/net/page_pool/memory_provider.h | 2 +-
include/net/page_pool/types.h | 36 ++-
net/core/devmem.c | 6 +-
net/core/netmem_priv.h | 5 +-
net/core/page_pool.c | 290 +++++++++++++++++++++---
net/core/page_pool_priv.h | 10 +-
10 files changed, 342 insertions(+), 53 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6b27db7f9496..c0353866a329 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -120,7 +120,7 @@ struct page {
* page_pool allocated pages.
*/
unsigned long pp_magic;
- struct page_pool *pp;
+ struct page_pool_item *pp_item;
unsigned long _pp_mapping_pad;
unsigned long dma_addr;
atomic_long_t pp_ref_count;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2bb8473d99a..eb4d0389e10d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -39,6 +39,7 @@
#include <net/net_debug.h>
#include <net/dropreason-core.h>
#include <net/netmem.h>
+#include <net/page_pool/types.h>
/**
* DOC: skb checksums
diff --git a/include/net/netmem.h b/include/net/netmem.h
index c61d5b21e7b4..a75fdb60ac3d 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -23,7 +23,7 @@ DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers);
struct net_iov {
unsigned long __unused_padding;
unsigned long pp_magic;
- struct page_pool *pp;
+ struct page_pool_item *pp_item;
struct net_iov_area *owner;
unsigned long dma_addr;
atomic_long_t pp_ref_count;
@@ -42,7 +42,7 @@ struct net_iov_area {
*
* struct {
* unsigned long pp_magic;
- * struct page_pool *pp;
+ * struct page_pool_item *pp_item;
* unsigned long _pp_mapping_pad;
* unsigned long dma_addr;
* atomic_long_t pp_ref_count;
@@ -58,7 +58,7 @@ struct net_iov_area {
static_assert(offsetof(struct page, pg) == \
offsetof(struct net_iov, iov))
NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic);
-NET_IOV_ASSERT_OFFSET(pp, pp);
+NET_IOV_ASSERT_OFFSET(pp_item, pp_item);
NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr);
NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count);
#undef NET_IOV_ASSERT_OFFSET
@@ -86,6 +86,11 @@ static inline unsigned int net_iov_idx(const struct net_iov *niov)
*/
typedef unsigned long __bitwise netmem_ref;
+/* Mirror page_pool_item_block, see include/net/page_pool/types.h */
+struct netmem_item_block {
+ struct page_pool *pp;
+};
+
static inline bool netmem_is_net_iov(const netmem_ref netmem)
{
return (__force unsigned long)netmem & NET_IOV;
@@ -173,6 +178,15 @@ static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem)
return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV);
}
+static inline struct page_pool *
+netmem_pp_item_to_pp(struct page_pool_item *item)
+{
+ struct netmem_item_block *block;
+
+ block = (struct netmem_item_block *)((unsigned long)item & PAGE_MASK);
+ return block->pp;
+}
+
/**
* __netmem_get_pp - unsafely get pointer to the &page_pool backing @netmem
* @netmem: netmem reference to get the pointer from
@@ -186,12 +200,19 @@ static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem)
*/
static inline struct page_pool *__netmem_get_pp(netmem_ref netmem)
{
- return __netmem_to_page(netmem)->pp;
+ return netmem_pp_item_to_pp(__netmem_to_page(netmem)->pp_item);
+}
+
+static inline struct page_pool_item *netmem_get_pp_item(netmem_ref netmem)
+{
+ return __netmem_clear_lsb(netmem)->pp_item;
}
static inline struct page_pool *netmem_get_pp(netmem_ref netmem)
{
- return __netmem_clear_lsb(netmem)->pp;
+ struct page_pool_item *item = netmem_get_pp_item(netmem);
+
+ return netmem_pp_item_to_pp(item);
}
static inline atomic_long_t *netmem_get_pp_ref_count_ref(netmem_ref netmem)
diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index ab91911af215..ac8c71841be5 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -83,9 +83,19 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
}
#endif
+static inline struct page_pool_item_block *
+page_pool_item_to_block(struct page_pool_item *item)
+{
+ return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);
+}
+
static inline struct page_pool *page_pool_get_pp(struct page *page)
{
- return page->pp;
+ /* The size of item_block is always PAGE_SIZE, the address of item_block
+ * for a specific item can be calculated using 'item & PAGE_MASK', so
+ * that we can find the page_pool object it belongs to.
+ */
+ return page_pool_item_to_block(page->pp_item)->pp;
}
/**
diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
index b3e665897767..bcc72f340db9 100644
--- a/include/net/page_pool/memory_provider.h
+++ b/include/net/page_pool/memory_provider.h
@@ -20,7 +20,7 @@ struct memory_provider_ops {
bool net_mp_niov_set_dma_addr(struct net_iov *niov, dma_addr_t addr);
void net_mp_niov_set_page_pool(struct page_pool *pool, struct net_iov *niov);
-void net_mp_niov_clear_page_pool(struct net_iov *niov);
+void net_mp_niov_clear_page_pool(struct page_pool *pool, struct net_iov *niov);
int net_mp_open_rxq(struct net_device *dev, unsigned ifq_idx,
struct pp_memory_provider_params *p);
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 36eb57d73abc..c131e2725e9a 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -102,6 +102,7 @@ struct page_pool_params {
* @refill: an allocation which triggered a refill of the cache
* @waive: pages obtained from the ptr ring that cannot be added to
* the cache due to a NUMA mismatch
+ * @item_fast_empty: pre-allocated item cache is empty
*/
struct page_pool_alloc_stats {
u64 fast;
@@ -110,6 +111,7 @@ struct page_pool_alloc_stats {
u64 empty;
u64 refill;
u64 waive;
+ u64 item_fast_empty;
};
/**
@@ -142,6 +144,30 @@ struct page_pool_stats {
};
#endif
+struct page_pool_item {
+ unsigned long state;
+
+ union {
+ netmem_ref pp_netmem;
+ struct llist_node lentry;
+ };
+};
+
+/* The size of item_block is always PAGE_SIZE, so that the address of item_block
+ * for a specific item can be calculated using 'item & PAGE_MASK'
+ */
+struct page_pool_item_block {
+ struct page_pool *pp;
+ struct list_head list;
+ struct page_pool_item items[];
+};
+
+/* Ensure the offset of 'pp' field for both 'page_pool_item_block' and
+ * 'netmem_item_block' are the same.
+ */
+static_assert(offsetof(struct page_pool_item_block, pp) == \
+ offsetof(struct netmem_item_block, pp));
+
/* The whole frag API block must stay within one cacheline. On 32-bit systems,
* sizeof(long) == sizeof(int), so that the block size is ``3 * sizeof(long)``.
* On 64-bit systems, the actual size is ``2 * sizeof(long) + sizeof(int)``.
@@ -164,6 +190,7 @@ struct page_pool {
int cpuid;
u32 pages_state_hold_cnt;
+ struct llist_head hold_items;
bool has_init_callback:1; /* slow::init_callback is set */
bool dma_map:1; /* Perform DMA mapping */
@@ -227,13 +254,20 @@ struct page_pool {
#endif
atomic_t pages_state_release_cnt;
+ /* Synchronizate dma unmapping operation in page_pool_return_page() with
+ * page_pool_destory() when destroy_cnt is non-zero.
+ */
+ spinlock_t item_lock;
+ struct list_head item_blocks;
+ struct llist_head release_items;
+
/* A page_pool is strictly tied to a single RX-queue being
* protected by NAPI, due to above pp_alloc_cache. This
* refcnt serves purpose is to simplify drivers error handling.
*/
refcount_t user_cnt;
- u64 destroy_cnt;
+ unsigned long destroy_cnt;
/* Slow/Control-path information follows */
struct page_pool_params_slow slow;
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 7c6e0b5b6acb..d0ddbf2f7a6a 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -32,7 +32,7 @@ static const struct memory_provider_ops dmabuf_devmem_ops;
bool net_is_devmem_iov(struct net_iov *niov)
{
- return niov->pp->mp_ops == &dmabuf_devmem_ops;
+ return netmem_pp_item_to_pp(niov->pp_item)->mp_ops == &dmabuf_devmem_ops;
}
static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
@@ -95,7 +95,7 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
niov = &owner->area.niovs[index];
niov->pp_magic = 0;
- niov->pp = NULL;
+ niov->pp_item = NULL;
atomic_long_set(&niov->pp_ref_count, 0);
return niov;
@@ -383,7 +383,7 @@ bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
if (WARN_ON_ONCE(refcount != 1))
return false;
- page_pool_clear_pp_info(netmem);
+ page_pool_clear_pp_info(pool, netmem);
net_devmem_free_dmabuf(netmem_to_net_iov(netmem));
diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
index 7eadb8393e00..3173f6070cf7 100644
--- a/net/core/netmem_priv.h
+++ b/net/core/netmem_priv.h
@@ -18,9 +18,10 @@ static inline void netmem_clear_pp_magic(netmem_ref netmem)
__netmem_clear_lsb(netmem)->pp_magic = 0;
}
-static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool)
+static inline void netmem_set_pp_item(netmem_ref netmem,
+ struct page_pool_item *item)
{
- __netmem_clear_lsb(netmem)->pp = pool;
+ __netmem_clear_lsb(netmem)->pp_item = item;
}
static inline void netmem_set_dma_addr(netmem_ref netmem,
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index acef1fcd8ddc..d927c468bc1b 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -63,6 +63,7 @@ static const char pp_stats[][ETH_GSTRING_LEN] = {
"rx_pp_alloc_empty",
"rx_pp_alloc_refill",
"rx_pp_alloc_waive",
+ "rx_pp_alloc_item_fast_empty",
"rx_pp_recycle_cached",
"rx_pp_recycle_cache_full",
"rx_pp_recycle_ring",
@@ -96,6 +97,7 @@ bool page_pool_get_stats(const struct page_pool *pool,
stats->alloc_stats.empty += pool->alloc_stats.empty;
stats->alloc_stats.refill += pool->alloc_stats.refill;
stats->alloc_stats.waive += pool->alloc_stats.waive;
+ stats->alloc_stats.item_fast_empty += pool->alloc_stats.item_fast_empty;
for_each_possible_cpu(cpu) {
const struct page_pool_recycle_stats *pcpu =
@@ -141,6 +143,7 @@ u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
*data++ = pool_stats->alloc_stats.empty;
*data++ = pool_stats->alloc_stats.refill;
*data++ = pool_stats->alloc_stats.waive;
+ *data++ = pool_stats->alloc_stats.item_fast_empty;
*data++ = pool_stats->recycle_stats.cached;
*data++ = pool_stats->recycle_stats.cache_full;
*data++ = pool_stats->recycle_stats.ring;
@@ -333,6 +336,215 @@ static void page_pool_uninit(struct page_pool *pool)
#endif
}
+#define PAGE_POOL_ITEM_USED 0
+#define PAGE_POOL_ITEM_MAPPED 1
+
+#define ITEMS_PER_PAGE ((PAGE_SIZE - \
+ offsetof(struct page_pool_item_block, items)) / \
+ sizeof(struct page_pool_item))
+
+#if defined(CONFIG_DEBUG_NET)
+#define page_pool_item_set_used(item) \
+ __set_bit(PAGE_POOL_ITEM_USED, &(item)->state)
+
+#define page_pool_item_clear_used(item) \
+ __clear_bit(PAGE_POOL_ITEM_USED, &(item)->state)
+
+#define page_pool_item_is_used(item) \
+ test_bit(PAGE_POOL_ITEM_USED, &(item)->state)
+#else
+#define page_pool_item_set_used(item)
+#define page_pool_item_clear_used(item)
+#define page_pool_item_is_used(item) false
+#endif
+
+#define page_pool_item_set_mapped(item) \
+ __set_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
+
+/* Only clear_mapped and is_mapped need to be atomic as they can be
+ * called concurrently.
+ */
+#define page_pool_item_clear_mapped(item) \
+ clear_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
+
+#define page_pool_item_is_mapped(item) \
+ test_bit(PAGE_POOL_ITEM_MAPPED, &(item)->state)
+
+static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
+ netmem_ref netmem)
+ __acquires(&pool->item_lock) __releases(&pool->item_lock)
+{
+ struct page_pool_item *item;
+ bool destroyed;
+ dma_addr_t dma;
+
+ if (!pool->dma_map)
+ /* Always account for inflight pages, even if we didn't
+ * map them
+ */
+ return;
+
+ /* Paired with the rcu synchronization in page_pool_destroy() to ensure
+ * synchronize dma unmapping operation between page_pool_destroy() and
+ * page being released to page_pool from networking by using a spinlock
+ * when pool->destroy_cnt is non-zero.
+ */
+ rcu_read_lock();
+ destroyed = !!READ_ONCE(pool->destroy_cnt);
+ item = netmem_get_pp_item(netmem);
+
+ /* To catch the case of item state not setting up correctly as dma
+ * unmapping is always needed when page_pool_destory() is not called
+ * yet.
+ */
+ DEBUG_NET_WARN_ON_ONCE(!destroyed &&
+ !page_pool_item_is_mapped(item));
+ if (unlikely(destroyed)) {
+ spin_lock_bh(&pool->item_lock);
+
+ if (!page_pool_item_is_mapped(item))
+ goto out_unlock;
+ }
+
+ dma = page_pool_get_dma_addr_netmem(netmem);
+
+ /* When page is unmapped, it cannot be returned to our pool */
+ dma_unmap_page_attrs(pool->p.dev, dma,
+ PAGE_SIZE << pool->p.order, pool->p.dma_dir,
+ DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
+ page_pool_set_dma_addr_netmem(netmem, 0);
+ page_pool_item_clear_mapped(item);
+
+out_unlock:
+ if (unlikely(destroyed))
+ spin_unlock_bh(&pool->item_lock);
+
+ rcu_read_unlock();
+}
+
+static void page_pool_item_uninit(struct page_pool *pool)
+{
+ while (!list_empty(&pool->item_blocks)) {
+ struct page_pool_item_block *block;
+
+ block = list_first_entry(&pool->item_blocks,
+ struct page_pool_item_block,
+ list);
+ list_del(&block->list);
+ put_page(virt_to_page(block));
+ }
+}
+
+static int page_pool_item_init(struct page_pool *pool)
+{
+#define PAGE_POOL_MIN_INFLIGHT_ITEMS 512
+ struct page_pool_item_block *block;
+ int item_cnt;
+
+ INIT_LIST_HEAD(&pool->item_blocks);
+ spin_lock_init(&pool->item_lock);
+ init_llist_head(&pool->hold_items);
+ init_llist_head(&pool->release_items);
+
+ item_cnt = pool->p.pool_size * 2 + PP_ALLOC_CACHE_SIZE +
+ PAGE_POOL_MIN_INFLIGHT_ITEMS;
+ for (; item_cnt > 0; item_cnt -= ITEMS_PER_PAGE) {
+ struct page *page;
+ unsigned int i;
+
+ page = alloc_pages_node(pool->p.nid, GFP_KERNEL | __GFP_ZERO,
+ 0);
+ if (!page) {
+ page_pool_item_uninit(pool);
+ return -ENOMEM;
+ }
+
+ block = page_address(page);
+ block->pp = pool;
+ list_add(&block->list, &pool->item_blocks);
+
+ for (i = 0; i < ITEMS_PER_PAGE; i++)
+ __llist_add(&block->items[i].lentry, &pool->hold_items);
+ }
+
+ return 0;
+}
+
+static void page_pool_item_unmap(struct page_pool *pool)
+{
+ struct page_pool_item_block *block;
+
+ if (!pool->dma_map || pool->mp_priv)
+ return;
+
+ /* Paired with rcu read lock in __page_pool_release_page_dma() to
+ * synchronize dma unmapping operations.
+ */
+ synchronize_net();
+
+ list_for_each_entry(block, &pool->item_blocks, list) {
+ struct page_pool_item *items = block->items;
+ int i;
+
+ for (i = 0; i < ITEMS_PER_PAGE; i++) {
+ struct page_pool_item *item = &items[i];
+
+ if (!page_pool_item_is_mapped(item))
+ continue;
+
+ __page_pool_release_page_dma(pool, item->pp_netmem);
+ }
+ }
+}
+
+static struct page_pool_item *page_pool_fast_item_alloc(struct page_pool *pool)
+{
+ struct llist_node *first = pool->hold_items.first;
+
+ if (unlikely(!first)) {
+ first = llist_del_all(&pool->release_items);
+
+ if (unlikely(!first)) {
+ alloc_stat_inc(pool, item_fast_empty);
+ return NULL;
+ }
+ }
+
+ pool->hold_items.first = first->next;
+ return llist_entry(first, struct page_pool_item, lentry);
+}
+
+static bool page_pool_set_item_info(struct page_pool *pool, netmem_ref netmem)
+{
+ struct page_pool_item *item = page_pool_fast_item_alloc(pool);
+
+ if (likely(item)) {
+ item->pp_netmem = netmem;
+ page_pool_item_set_used(item);
+ netmem_set_pp_item(netmem, item);
+ }
+
+ return !!item;
+}
+
+static void page_pool_fast_item_free(struct page_pool *pool,
+ struct page_pool_item *item)
+{
+ llist_add(&item->lentry, &pool->release_items);
+}
+
+static void page_pool_clear_item_info(struct page_pool *pool, netmem_ref netmem)
+{
+ struct page_pool_item *item = netmem_get_pp_item(netmem);
+
+ DEBUG_NET_WARN_ON_ONCE(item->pp_netmem != netmem);
+ DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item));
+ DEBUG_NET_WARN_ON_ONCE(!page_pool_item_is_used(item));
+ page_pool_item_clear_used(item);
+ netmem_set_pp_item(netmem, NULL);
+ page_pool_fast_item_free(pool, item);
+}
+
/**
* page_pool_create_percpu() - create a page pool for a given cpu.
* @params: parameters, see struct page_pool_params
@@ -352,12 +564,18 @@ page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
if (err < 0)
goto err_free;
- err = page_pool_list(pool);
+ err = page_pool_item_init(pool);
if (err)
goto err_uninit;
+ err = page_pool_list(pool);
+ if (err)
+ goto err_item_uninit;
+
return pool;
+err_item_uninit:
+ page_pool_item_uninit(pool);
err_uninit:
page_pool_uninit(pool);
err_free:
@@ -472,6 +690,7 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
{
+ struct page_pool_item *item;
dma_addr_t dma;
/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
@@ -489,6 +708,9 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
if (page_pool_set_dma_addr_netmem(netmem, dma))
goto unmap_failed;
+ item = netmem_get_pp_item(netmem);
+ DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item));
+ page_pool_item_set_mapped(item);
page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
return true;
@@ -511,19 +733,24 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
if (unlikely(!page))
return NULL;
- if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
- put_page(page);
- return NULL;
- }
+ if (unlikely(!page_pool_set_pp_info(pool, page_to_netmem(page))))
+ goto err_alloc;
+
+ if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page))))
+ goto err_set_info;
alloc_stat_inc(pool, slow_high_order);
- page_pool_set_pp_info(pool, page_to_netmem(page));
/* Track how many pages are held 'in-flight' */
pool->pages_state_hold_cnt++;
trace_page_pool_state_hold(pool, page_to_netmem(page),
pool->pages_state_hold_cnt);
return page;
+err_set_info:
+ page_pool_clear_pp_info(pool, page_to_netmem(page));
+err_alloc:
+ put_page(page);
+ return NULL;
}
/* slow path */
@@ -557,12 +784,18 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
*/
for (i = 0; i < nr_pages; i++) {
netmem = pool->alloc.cache[i];
+
+ if (unlikely(!page_pool_set_pp_info(pool, netmem))) {
+ put_page(netmem_to_page(netmem));
+ continue;
+ }
+
if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
+ page_pool_clear_pp_info(pool, netmem);
put_page(netmem_to_page(netmem));
continue;
}
- page_pool_set_pp_info(pool, netmem);
pool->alloc.cache[pool->alloc.count++] = netmem;
/* Track how many pages are held 'in-flight' */
pool->pages_state_hold_cnt++;
@@ -634,9 +867,11 @@ s32 page_pool_inflight(const struct page_pool *pool, bool strict)
return inflight;
}
-void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
+bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
{
- netmem_set_pp(netmem, pool);
+ if (unlikely(!page_pool_set_item_info(pool, netmem)))
+ return false;
+
netmem_or_pp_magic(netmem, PP_SIGNATURE);
/* Ensuring all pages have been split into one fragment initially:
@@ -648,32 +883,14 @@ void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
page_pool_fragment_netmem(netmem, 1);
if (pool->has_init_callback)
pool->slow.init_callback(netmem, pool->slow.init_arg);
-}
-void page_pool_clear_pp_info(netmem_ref netmem)
-{
- netmem_clear_pp_magic(netmem);
- netmem_set_pp(netmem, NULL);
+ return true;
}
-static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
- netmem_ref netmem)
+void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem)
{
- dma_addr_t dma;
-
- if (!pool->dma_map)
- /* Always account for inflight pages, even if we didn't
- * map them
- */
- return;
-
- dma = page_pool_get_dma_addr_netmem(netmem);
-
- /* When page is unmapped, it cannot be returned to our pool */
- dma_unmap_page_attrs(pool->p.dev, dma,
- PAGE_SIZE << pool->p.order, pool->p.dma_dir,
- DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
- page_pool_set_dma_addr_netmem(netmem, 0);
+ netmem_clear_pp_magic(netmem);
+ page_pool_clear_item_info(pool, netmem);
}
/* Disconnects a page (from a page_pool). API users can have a need
@@ -699,7 +916,7 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
trace_page_pool_state_release(pool, netmem, count);
if (put) {
- page_pool_clear_pp_info(netmem);
+ page_pool_clear_pp_info(pool, netmem);
put_page(netmem_to_page(netmem));
}
/* An optimization would be to call __free_pages(page, pool->p.order)
@@ -1053,6 +1270,7 @@ static void __page_pool_destroy(struct page_pool *pool)
if (pool->disconnect)
pool->disconnect(pool);
+ page_pool_item_uninit(pool);
page_pool_unlist(pool);
page_pool_uninit(pool);
@@ -1084,7 +1302,7 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
static void page_pool_scrub(struct page_pool *pool)
{
page_pool_empty_alloc_cache_once(pool);
- pool->destroy_cnt++;
+ WRITE_ONCE(pool->destroy_cnt, pool->destroy_cnt + 1);
/* No more consumers should exist, but producers could still
* be in-flight.
@@ -1176,6 +1394,8 @@ void page_pool_destroy(struct page_pool *pool)
if (!page_pool_release(pool))
return;
+ page_pool_item_unmap(pool);
+
page_pool_detached(pool);
pool->defer_start = jiffies;
pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
@@ -1222,9 +1442,9 @@ void net_mp_niov_set_page_pool(struct page_pool *pool, struct net_iov *niov)
/* Disassociate a niov from a page pool. Should only be used in the
* ->release_netmem() path.
*/
-void net_mp_niov_clear_page_pool(struct net_iov *niov)
+void net_mp_niov_clear_page_pool(struct page_pool *pool, struct net_iov *niov)
{
netmem_ref netmem = net_iov_to_netmem(niov);
- page_pool_clear_pp_info(netmem);
+ page_pool_clear_pp_info(pool, netmem);
}
diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h
index 2fb06d5f6d55..a5df5ab14ead 100644
--- a/net/core/page_pool_priv.h
+++ b/net/core/page_pool_priv.h
@@ -38,16 +38,18 @@ static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
}
#if defined(CONFIG_PAGE_POOL)
-void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
-void page_pool_clear_pp_info(netmem_ref netmem);
+bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
+void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem);
int page_pool_check_memory_provider(struct net_device *dev,
struct netdev_rx_queue *rxq);
#else
-static inline void page_pool_set_pp_info(struct page_pool *pool,
+static inline bool page_pool_set_pp_info(struct page_pool *pool,
netmem_ref netmem)
{
+ return true;
}
-static inline void page_pool_clear_pp_info(netmem_ref netmem)
+static inline void page_pool_clear_pp_info(struct page_pool *pool,
+ netmem_ref netmem)
{
}
static inline int page_pool_check_memory_provider(struct net_device *dev,
--
2.33.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v10 3/4] page_pool: support unlimited number of inflight pages
2025-02-26 11:03 [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool Yunsheng Lin
2025-02-26 11:03 ` [PATCH net-next v10 1/4] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
2025-02-26 11:03 ` [PATCH net-next v10 2/4] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
@ 2025-02-26 11:03 ` Yunsheng Lin
2025-03-03 17:59 ` Simon Horman
2025-03-04 9:25 ` Jesper Dangaard Brouer
2025-02-26 11:03 ` [PATCH net-next v10 4/4] page_pool: skip dma sync operation for " Yunsheng Lin
2025-02-28 2:15 ` [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool Jakub Kicinski
4 siblings, 2 replies; 12+ messages in thread
From: Yunsheng Lin @ 2025-02-26 11:03 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Robin Murphy,
Alexander Duyck, IOMMU, Eric Dumazet, Simon Horman, Donald Hunter,
Jesper Dangaard Brouer, Ilias Apalodimas, Andrew Lunn, netdev,
linux-kernel
Currently a fixed size of pre-allocated memory is used to
keep track of the inflight pages, in order to use the DMA
API correctly.
As mentioned [1], the number of inflight pages can be up to
73203 depending on the use cases. Allocate memory dynamically
to keep track of the inflight pages when pre-allocated memory
runs out.
The overhead of using dynamic memory allocation is about 10ns~
20ns, which causes 5%~10% performance degradation for the test
case of time_bench_page_pool03_slow() in [2].
1. https://lore.kernel.org/all/b8b7818a-e44b-45f5-91c2-d5eceaa5dd5b@kernel.org/
2. https://github.com/netoptimizer/prototype-kernel
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
Documentation/netlink/specs/netdev.yaml | 16 +++++
include/net/page_pool/types.h | 10 ++++
include/uapi/linux/netdev.h | 2 +
net/core/page_pool.c | 79 ++++++++++++++++++++++++-
net/core/page_pool_priv.h | 2 +
net/core/page_pool_user.c | 39 ++++++++++--
tools/net/ynl/samples/page-pool.c | 11 ++++
7 files changed, 154 insertions(+), 5 deletions(-)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 36f1152bfac3..7c121d0a5d15 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -162,6 +162,20 @@ attribute-sets:
type: uint
doc: |
Amount of memory held by inflight pages.
+ -
+ name: item_mem_resident
+ type: uint
+ doc: |
+ Amount of actual memory allocated to track inflight pages.
+ memory fragmentation ratio for item memory can be calculated
+ using item_mem_resident / item_mem_used.
+ -
+ name: item_mem_used
+ type: uint
+ doc: |
+ Amount of actual memory used to track inflight pages.
+ memory fragmentation ratio for item memory can be calculated
+ using item_mem_resident / item_mem_used.
-
name: detach-time
type: uint
@@ -602,6 +616,8 @@ operations:
- detach-time
- dmabuf
- io-uring
+ - item_mem_resident
+ - item_mem_used
dump:
reply: *pp-reply
config-cond: page-pool
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index c131e2725e9a..c8c47ca67f4b 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -103,6 +103,7 @@ struct page_pool_params {
* @waive: pages obtained from the ptr ring that cannot be added to
* the cache due to a NUMA mismatch
* @item_fast_empty: pre-allocated item cache is empty
+ * @item_slow_failed: failed to allocate memory for item_block
*/
struct page_pool_alloc_stats {
u64 fast;
@@ -112,6 +113,7 @@ struct page_pool_alloc_stats {
u64 refill;
u64 waive;
u64 item_fast_empty;
+ u64 item_slow_failed;
};
/**
@@ -159,9 +161,16 @@ struct page_pool_item {
struct page_pool_item_block {
struct page_pool *pp;
struct list_head list;
+ unsigned int flags;
+ refcount_t ref;
struct page_pool_item items[];
};
+struct page_pool_slow_item {
+ struct page_pool_item_block *block;
+ unsigned int next_to_use;
+};
+
/* Ensure the offset of 'pp' field for both 'page_pool_item_block' and
* 'netmem_item_block' are the same.
*/
@@ -191,6 +200,7 @@ struct page_pool {
int cpuid;
u32 pages_state_hold_cnt;
struct llist_head hold_items;
+ struct page_pool_slow_item slow_items;
bool has_init_callback:1; /* slow::init_callback is set */
bool dma_map:1; /* Perform DMA mapping */
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 7600bf62dbdf..9309cbfeb8d2 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -103,6 +103,8 @@ enum {
NETDEV_A_PAGE_POOL_DETACH_TIME,
NETDEV_A_PAGE_POOL_DMABUF,
NETDEV_A_PAGE_POOL_IO_URING,
+ NETDEV_A_PAGE_POOL_ITEM_MEM_RESIDENT,
+ NETDEV_A_PAGE_POOL_ITEM_MEM_USED,
__NETDEV_A_PAGE_POOL_MAX,
NETDEV_A_PAGE_POOL_MAX = (__NETDEV_A_PAGE_POOL_MAX - 1)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index d927c468bc1b..dc9574bd129b 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -64,6 +64,7 @@ static const char pp_stats[][ETH_GSTRING_LEN] = {
"rx_pp_alloc_refill",
"rx_pp_alloc_waive",
"rx_pp_alloc_item_fast_empty",
+ "rx_pp_alloc_item_slow_failed",
"rx_pp_recycle_cached",
"rx_pp_recycle_cache_full",
"rx_pp_recycle_ring",
@@ -98,6 +99,7 @@ bool page_pool_get_stats(const struct page_pool *pool,
stats->alloc_stats.refill += pool->alloc_stats.refill;
stats->alloc_stats.waive += pool->alloc_stats.waive;
stats->alloc_stats.item_fast_empty += pool->alloc_stats.item_fast_empty;
+ stats->alloc_stats.item_slow_failed += pool->alloc_stats.item_slow_failed;
for_each_possible_cpu(cpu) {
const struct page_pool_recycle_stats *pcpu =
@@ -144,6 +146,7 @@ u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
*data++ = pool_stats->alloc_stats.refill;
*data++ = pool_stats->alloc_stats.waive;
*data++ = pool_stats->alloc_stats.item_fast_empty;
+ *data++ = pool_stats->alloc_stats.item_slow_failed;
*data++ = pool_stats->recycle_stats.cached;
*data++ = pool_stats->recycle_stats.cache_full;
*data++ = pool_stats->recycle_stats.ring;
@@ -431,6 +434,7 @@ static void page_pool_item_uninit(struct page_pool *pool)
struct page_pool_item_block,
list);
list_del(&block->list);
+ WARN_ON(refcount_read(&block->ref));
put_page(virt_to_page(block));
}
}
@@ -514,10 +518,42 @@ static struct page_pool_item *page_pool_fast_item_alloc(struct page_pool *pool)
return llist_entry(first, struct page_pool_item, lentry);
}
+static struct page_pool_item *page_pool_slow_item_alloc(struct page_pool *pool)
+{
+ if (unlikely(!pool->slow_items.block ||
+ pool->slow_items.next_to_use >= ITEMS_PER_PAGE)) {
+ struct page_pool_item_block *block;
+ struct page *page;
+
+ page = alloc_pages_node(pool->p.nid, GFP_ATOMIC | __GFP_NOWARN |
+ __GFP_ZERO, 0);
+ if (!page) {
+ alloc_stat_inc(pool, item_slow_failed);
+ return NULL;
+ }
+
+ block = page_address(page);
+ block->pp = pool;
+ block->flags |= PAGE_POOL_SLOW_ITEM_BLOCK_BIT;
+ refcount_set(&block->ref, ITEMS_PER_PAGE);
+ pool->slow_items.block = block;
+ pool->slow_items.next_to_use = 0;
+
+ spin_lock_bh(&pool->item_lock);
+ list_add(&block->list, &pool->item_blocks);
+ spin_unlock_bh(&pool->item_lock);
+ }
+
+ return &pool->slow_items.block->items[pool->slow_items.next_to_use++];
+}
+
static bool page_pool_set_item_info(struct page_pool *pool, netmem_ref netmem)
{
struct page_pool_item *item = page_pool_fast_item_alloc(pool);
+ if (unlikely(!item))
+ item = page_pool_slow_item_alloc(pool);
+
if (likely(item)) {
item->pp_netmem = netmem;
page_pool_item_set_used(item);
@@ -527,6 +563,37 @@ static bool page_pool_set_item_info(struct page_pool *pool, netmem_ref netmem)
return !!item;
}
+static void __page_pool_slow_item_free(struct page_pool *pool,
+ struct page_pool_item_block *block)
+{
+ spin_lock_bh(&pool->item_lock);
+ list_del(&block->list);
+ spin_unlock_bh(&pool->item_lock);
+
+ put_page(virt_to_page(block));
+}
+
+static void page_pool_slow_item_drain(struct page_pool *pool)
+{
+ struct page_pool_item_block *block = pool->slow_items.block;
+
+ if (!block || pool->slow_items.next_to_use >= ITEMS_PER_PAGE)
+ return;
+
+ if (refcount_sub_and_test(ITEMS_PER_PAGE - pool->slow_items.next_to_use,
+ &block->ref))
+ __page_pool_slow_item_free(pool, block);
+}
+
+static void page_pool_slow_item_free(struct page_pool *pool,
+ struct page_pool_item_block *block)
+{
+ if (likely(!refcount_dec_and_test(&block->ref)))
+ return;
+
+ __page_pool_slow_item_free(pool, block);
+}
+
static void page_pool_fast_item_free(struct page_pool *pool,
struct page_pool_item *item)
{
@@ -536,13 +603,22 @@ static void page_pool_fast_item_free(struct page_pool *pool,
static void page_pool_clear_item_info(struct page_pool *pool, netmem_ref netmem)
{
struct page_pool_item *item = netmem_get_pp_item(netmem);
+ struct page_pool_item_block *block;
DEBUG_NET_WARN_ON_ONCE(item->pp_netmem != netmem);
DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item));
DEBUG_NET_WARN_ON_ONCE(!page_pool_item_is_used(item));
page_pool_item_clear_used(item);
netmem_set_pp_item(netmem, NULL);
- page_pool_fast_item_free(pool, item);
+
+ block = page_pool_item_to_block(item);
+ if (likely(!(block->flags & PAGE_POOL_SLOW_ITEM_BLOCK_BIT))) {
+ DEBUG_NET_WARN_ON_ONCE(refcount_read(&block->ref));
+ page_pool_fast_item_free(pool, item);
+ return;
+ }
+
+ page_pool_slow_item_free(pool, block);
}
/**
@@ -1390,6 +1466,7 @@ void page_pool_destroy(struct page_pool *pool)
page_pool_disable_direct_recycling(pool);
page_pool_free_frag(pool);
+ page_pool_slow_item_drain(pool);
if (!page_pool_release(pool))
return;
diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h
index a5df5ab14ead..37adfc766c12 100644
--- a/net/core/page_pool_priv.h
+++ b/net/core/page_pool_priv.h
@@ -7,6 +7,8 @@
#include "netmem_priv.h"
+#define PAGE_POOL_SLOW_ITEM_BLOCK_BIT BIT(0)
+
extern struct mutex page_pools_lock;
s32 page_pool_inflight(const struct page_pool *pool, bool strict);
diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
index c82a95beceff..0dc0090257ae 100644
--- a/net/core/page_pool_user.c
+++ b/net/core/page_pool_user.c
@@ -33,7 +33,7 @@ DEFINE_MUTEX(page_pools_lock);
* - user.list: unhashed, netdev: unknown
*/
-typedef int (*pp_nl_fill_cb)(struct sk_buff *rsp, const struct page_pool *pool,
+typedef int (*pp_nl_fill_cb)(struct sk_buff *rsp, struct page_pool *pool,
const struct genl_info *info);
static int
@@ -111,7 +111,7 @@ netdev_nl_page_pool_get_dump(struct sk_buff *skb, struct netlink_callback *cb,
}
static int
-page_pool_nl_stats_fill(struct sk_buff *rsp, const struct page_pool *pool,
+page_pool_nl_stats_fill(struct sk_buff *rsp, struct page_pool *pool,
const struct genl_info *info)
{
#ifdef CONFIG_PAGE_POOL_STATS
@@ -212,8 +212,36 @@ int netdev_nl_page_pool_stats_get_dumpit(struct sk_buff *skb,
return netdev_nl_page_pool_get_dump(skb, cb, page_pool_nl_stats_fill);
}
+static int page_pool_nl_fill_item_mem_info(struct page_pool *pool,
+ struct sk_buff *rsp)
+{
+ struct page_pool_item_block *block;
+ size_t resident = 0, used = 0;
+ int err;
+
+ spin_lock_bh(&pool->item_lock);
+
+ list_for_each_entry(block, &pool->item_blocks, list) {
+ resident += PAGE_SIZE;
+
+ if (block->flags & PAGE_POOL_SLOW_ITEM_BLOCK_BIT)
+ used += (PAGE_SIZE - sizeof(struct page_pool_item) *
+ refcount_read(&block->ref));
+ else
+ used += PAGE_SIZE;
+ }
+
+ spin_unlock_bh(&pool->item_lock);
+
+ err = nla_put_uint(rsp, NETDEV_A_PAGE_POOL_ITEM_MEM_RESIDENT, resident);
+ if (err)
+ return err;
+
+ return nla_put_uint(rsp, NETDEV_A_PAGE_POOL_ITEM_MEM_USED, used);
+}
+
static int
-page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
+page_pool_nl_fill(struct sk_buff *rsp, struct page_pool *pool,
const struct genl_info *info)
{
size_t inflight, refsz;
@@ -251,6 +279,9 @@ page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
if (pool->mp_ops && pool->mp_ops->nl_fill(pool->mp_priv, rsp, NULL))
goto err_cancel;
+ if (page_pool_nl_fill_item_mem_info(pool, rsp))
+ goto err_cancel;
+
genlmsg_end(rsp, hdr);
return 0;
@@ -259,7 +290,7 @@ page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
return -EMSGSIZE;
}
-static void netdev_nl_page_pool_event(const struct page_pool *pool, u32 cmd)
+static void netdev_nl_page_pool_event(struct page_pool *pool, u32 cmd)
{
struct genl_info info;
struct sk_buff *ntf;
diff --git a/tools/net/ynl/samples/page-pool.c b/tools/net/ynl/samples/page-pool.c
index e5d521320fbf..16c48b6d3c2c 100644
--- a/tools/net/ynl/samples/page-pool.c
+++ b/tools/net/ynl/samples/page-pool.c
@@ -16,6 +16,7 @@ struct stat {
struct {
unsigned int cnt;
size_t refs, bytes;
+ size_t item_mem_resident, item_mem_used;
} live[2];
size_t alloc_slow, alloc_fast, recycle_ring, recycle_cache;
@@ -52,6 +53,12 @@ static void count(struct stat *s, unsigned int l,
s->live[l].refs += pp->inflight;
if (pp->_present.inflight_mem)
s->live[l].bytes += pp->inflight_mem;
+
+ if (pp->_present.item_mem_resident)
+ s->live[l].item_mem_resident += pp->item_mem_resident;
+
+ if (pp->_present.item_mem_used)
+ s->live[l].item_mem_used += pp->item_mem_used;
}
int main(int argc, char **argv)
@@ -127,6 +134,10 @@ int main(int argc, char **argv)
s->live[1].refs, s->live[1].bytes,
s->live[0].refs, s->live[0].bytes);
+ printf("\t\titem_mem_resident: %zu item_mem_used: %zu (item_mem_resident: %zu item_mem_used: %zu)\n",
+ s->live[1].item_mem_resident, s->live[1].item_mem_used,
+ s->live[0].item_mem_resident, s->live[0].item_mem_used);
+
/* We don't know how many pages are sitting in cache and ring
* so we will under-count the recycling rate a bit.
*/
--
2.33.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH net-next v10 4/4] page_pool: skip dma sync operation for inflight pages
2025-02-26 11:03 [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool Yunsheng Lin
` (2 preceding siblings ...)
2025-02-26 11:03 ` [PATCH net-next v10 3/4] page_pool: support unlimited number of inflight pages Yunsheng Lin
@ 2025-02-26 11:03 ` Yunsheng Lin
2025-03-05 17:29 ` Jason Gunthorpe
2025-02-28 2:15 ` [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool Jakub Kicinski
4 siblings, 1 reply; 12+ messages in thread
From: Yunsheng Lin @ 2025-02-26 11:03 UTC (permalink / raw)
To: davem, kuba, pabeni
Cc: zhangkun09, liuyonglong, fanghaiqing, Yunsheng Lin, Robin Murphy,
Alexander Duyck, IOMMU, Jesper Dangaard Brouer, Ilias Apalodimas,
Eric Dumazet, Simon Horman, netdev, linux-kernel
Skip dma sync operation for inflight pages before the
sync operation in page_pool_item_unmap() as DMA API
expects to be called with a valid device bound to a
driver as mentioned in [1].
After page_pool_destroy() is called, the page is not
expected to be recycled back to pool->alloc cache and
dma sync operation is not needed when the page is not
recyclable or pool->ring is full, so only skip the dma
sync operation for the infilght pages by clearing the
pool->dma_sync, as rcu sync operation in
page_pool_destroy() is paired with rcu lock in
page_pool_recycle_in_ring() to ensure that there is no
dma sync operation called after rcu sync operation.
1. https://lore.kernel.org/all/caf31b5e-0e8f-4844-b7ba-ef59ed13b74e@arm.com/
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
net/core/page_pool.c | 56 +++++++++++++++++++++++++++++++++-----------
1 file changed, 42 insertions(+), 14 deletions(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index dc9574bd129b..0f9abd0bf664 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -281,9 +281,6 @@ static int page_pool_init(struct page_pool *pool,
/* Driver calling page_pool_create() also call page_pool_destroy() */
refcount_set(&pool->user_cnt, 1);
- if (pool->dma_map)
- get_device(pool->p.dev);
-
if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
* configuration doesn't change while we're initializing
@@ -330,9 +327,6 @@ static void page_pool_uninit(struct page_pool *pool)
{
ptr_ring_cleanup(&pool->ring, NULL);
- if (pool->dma_map)
- put_device(pool->p.dev);
-
#ifdef CONFIG_PAGE_POOL_STATS
if (!pool->system)
free_percpu(pool->recycle_stats);
@@ -481,6 +475,16 @@ static void page_pool_item_unmap(struct page_pool *pool)
if (!pool->dma_map || pool->mp_priv)
return;
+ /* After page_pool_destroy() is called, the page is not expected to be
+ * recycled back to pool->alloc cache and dma sync operation is not
+ * needed when the page is not recyclable or pool->ring is full, skip
+ * the dma sync operation for the infilght pages by clearing the
+ * pool->dma_sync, and the below synchronize_net() is paired with rcu
+ * lock when page is recycled back into ptr_ring to ensure that there is
+ * no dma sync operation called after rcu sync operation.
+ */
+ pool->dma_sync = false;
+
/* Paired with rcu read lock in __page_pool_release_page_dma() to
* synchronize dma unmapping operations.
*/
@@ -764,6 +768,25 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
}
+static __always_inline void
+page_pool_dma_sync_for_device_rcu(const struct page_pool *pool,
+ netmem_ref netmem,
+ u32 dma_sync_size)
+{
+ if (!pool->dma_sync || !dma_dev_need_sync(pool->p.dev))
+ return;
+
+ rcu_read_lock();
+
+ /* Recheck the dma_sync under rcu lock to pair with rcu sync operation
+ * in page_pool_destroy().
+ */
+ if (likely(pool->dma_sync))
+ __page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
+
+ rcu_read_unlock();
+}
+
static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
{
struct page_pool_item *item;
@@ -1001,7 +1024,8 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
*/
}
-static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
+static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem,
+ unsigned int dma_sync_size)
{
int ret;
/* BH protection not needed if current is softirq */
@@ -1010,12 +1034,12 @@ static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
else
ret = ptr_ring_produce_bh(&pool->ring, (__force void *)netmem);
- if (!ret) {
+ if (likely(!ret)) {
+ page_pool_dma_sync_for_device_rcu(pool, netmem, dma_sync_size);
recycle_stat_inc(pool, ring);
- return true;
}
- return false;
+ return !ret;
}
/* Only allow direct recycling in special circumstances, into the
@@ -1068,10 +1092,10 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
if (likely(__page_pool_page_can_be_recycled(netmem))) {
/* Read barrier done in page_ref_count / READ_ONCE */
- page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
-
- if (allow_direct && page_pool_recycle_in_cache(netmem, pool))
+ if (allow_direct && page_pool_recycle_in_cache(netmem, pool)) {
+ page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
return 0;
+ }
/* Page found as candidate for recycling */
return netmem;
@@ -1127,7 +1151,7 @@ void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
netmem =
__page_pool_put_page(pool, netmem, dma_sync_size, allow_direct);
- if (netmem && !page_pool_recycle_in_ring(pool, netmem)) {
+ if (netmem && !page_pool_recycle_in_ring(pool, netmem, dma_sync_size)) {
/* Cache full, fallback to free pages */
recycle_stat_inc(pool, ring_full);
page_pool_return_page(pool, netmem);
@@ -1153,14 +1177,18 @@ static void page_pool_recycle_ring_bulk(struct page_pool *pool,
/* Bulk produce into ptr_ring page_pool cache */
in_softirq = page_pool_producer_lock(pool);
+ rcu_read_lock();
for (i = 0; i < bulk_len; i++) {
if (__ptr_ring_produce(&pool->ring, (__force void *)bulk[i])) {
/* ring full */
recycle_stat_inc(pool, ring_full);
break;
}
+
+ page_pool_dma_sync_for_device(pool, bulk[i], -1);
}
+ rcu_read_unlock();
page_pool_producer_unlock(pool, in_softirq);
recycle_stat_add(pool, ring, i);
--
2.33.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool
2025-02-26 11:03 [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool Yunsheng Lin
` (3 preceding siblings ...)
2025-02-26 11:03 ` [PATCH net-next v10 4/4] page_pool: skip dma sync operation for " Yunsheng Lin
@ 2025-02-28 2:15 ` Jakub Kicinski
4 siblings, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2025-02-28 2:15 UTC (permalink / raw)
To: Yunsheng Lin
Cc: davem, pabeni, zhangkun09, liuyonglong, fanghaiqing,
Alexander Lobakin, Robin Murphy, Alexander Duyck, Andrew Morton,
Gaurav Batra, Matthew Rosato, IOMMU, MM, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Matthias Brugger, AngeloGioacchino Del Regno, netdev,
intel-wired-lan, bpf, linux-kernel, linux-arm-kernel,
linux-mediatek
On Wed, 26 Feb 2025 19:03:35 +0800 Yunsheng Lin wrote:
> This patchset fix the dma API misuse problem as below:
> Networking driver with page_pool support may hand over page
> still with dma mapping to network stack and try to reuse that
> page after network stack is done with it and passes it back
> to page_pool to avoid the penalty of dma mapping/unmapping.
> With all the caching in the network stack, some pages may be
> held in the network stack without returning to the page_pool
> soon enough, and with VF disable causing the driver unbound,
> the page_pool does not stop the driver from doing it's
> unbounding work, instead page_pool uses workqueue to check
> if there is some pages coming back from the network stack
> periodically, if there is any, it will do the dma unmmapping
> related cleanup work.
Does not build :( Always do an allmodconfig build when working
on subsystem-wide interfaces..
--
pw-bot: cr
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v10 3/4] page_pool: support unlimited number of inflight pages
2025-02-26 11:03 ` [PATCH net-next v10 3/4] page_pool: support unlimited number of inflight pages Yunsheng Lin
@ 2025-03-03 17:59 ` Simon Horman
2025-03-04 12:25 ` Yunsheng Lin
2025-03-04 9:25 ` Jesper Dangaard Brouer
1 sibling, 1 reply; 12+ messages in thread
From: Simon Horman @ 2025-03-03 17:59 UTC (permalink / raw)
To: Yunsheng Lin
Cc: davem, kuba, pabeni, zhangkun09, liuyonglong, fanghaiqing,
Robin Murphy, Alexander Duyck, IOMMU, Eric Dumazet, Donald Hunter,
Jesper Dangaard Brouer, Ilias Apalodimas, Andrew Lunn, netdev,
linux-kernel
On Wed, Feb 26, 2025 at 07:03:38PM +0800, Yunsheng Lin wrote:
> Currently a fixed size of pre-allocated memory is used to
> keep track of the inflight pages, in order to use the DMA
> API correctly.
>
> As mentioned [1], the number of inflight pages can be up to
> 73203 depending on the use cases. Allocate memory dynamically
> to keep track of the inflight pages when pre-allocated memory
> runs out.
>
> The overhead of using dynamic memory allocation is about 10ns~
> 20ns, which causes 5%~10% performance degradation for the test
> case of time_bench_page_pool03_slow() in [2].
>
> 1. https://lore.kernel.org/all/b8b7818a-e44b-45f5-91c2-d5eceaa5dd5b@kernel.org/
> 2. https://github.com/netoptimizer/prototype-kernel
> CC: Robin Murphy <robin.murphy@arm.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: IOMMU <iommu@lists.linux.dev>
> Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> Documentation/netlink/specs/netdev.yaml | 16 +++++
> include/net/page_pool/types.h | 10 ++++
> include/uapi/linux/netdev.h | 2 +
> net/core/page_pool.c | 79 ++++++++++++++++++++++++-
> net/core/page_pool_priv.h | 2 +
> net/core/page_pool_user.c | 39 ++++++++++--
> tools/net/ynl/samples/page-pool.c | 11 ++++
> 7 files changed, 154 insertions(+), 5 deletions(-)
Hi,
It looks like the header changes in this patch don't quite
correspond to the spec changes.
But if so, perhaps the spec update needs to change,
because adding values to an enum, other than at the end,
feels like UAPI breakage to me.
I see this:
$ ./tools/net/ynl/ynl-regen.sh -f
$ git diff
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 9309cbfeb8d2..9e02f6190b07 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -100,11 +100,11 @@ enum {
NETDEV_A_PAGE_POOL_NAPI_ID,
NETDEV_A_PAGE_POOL_INFLIGHT,
NETDEV_A_PAGE_POOL_INFLIGHT_MEM,
+ NETDEV_A_PAGE_POOL_ITEM_MEM_RESIDENT,
+ NETDEV_A_PAGE_POOL_ITEM_MEM_USED,
NETDEV_A_PAGE_POOL_DETACH_TIME,
NETDEV_A_PAGE_POOL_DMABUF,
NETDEV_A_PAGE_POOL_IO_URING,
- NETDEV_A_PAGE_POOL_ITEM_MEM_RESIDENT,
- NETDEV_A_PAGE_POOL_ITEM_MEM_USED,
__NETDEV_A_PAGE_POOL_MAX,
NETDEV_A_PAGE_POOL_MAX = (__NETDEV_A_PAGE_POOL_MAX - 1)
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 7600bf62dbdf..9e02f6190b07 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -100,6 +100,8 @@ enum {
NETDEV_A_PAGE_POOL_NAPI_ID,
NETDEV_A_PAGE_POOL_INFLIGHT,
NETDEV_A_PAGE_POOL_INFLIGHT_MEM,
+ NETDEV_A_PAGE_POOL_ITEM_MEM_RESIDENT,
+ NETDEV_A_PAGE_POOL_ITEM_MEM_USED,
NETDEV_A_PAGE_POOL_DETACH_TIME,
NETDEV_A_PAGE_POOL_DMABUF,
NETDEV_A_PAGE_POOL_IO_URING,
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v10 3/4] page_pool: support unlimited number of inflight pages
2025-02-26 11:03 ` [PATCH net-next v10 3/4] page_pool: support unlimited number of inflight pages Yunsheng Lin
2025-03-03 17:59 ` Simon Horman
@ 2025-03-04 9:25 ` Jesper Dangaard Brouer
2025-03-05 12:19 ` Yunsheng Lin
1 sibling, 1 reply; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2025-03-04 9:25 UTC (permalink / raw)
To: Yunsheng Lin, davem, kuba, pabeni
Cc: zhangkun09, liuyonglong, fanghaiqing, Robin Murphy,
Alexander Duyck, IOMMU, Eric Dumazet, Simon Horman, Donald Hunter,
Ilias Apalodimas, Andrew Lunn, netdev, linux-kernel, kernel-team,
Yan Zhai
(comments requesting changes inlined below)
On 26/02/2025 12.03, Yunsheng Lin wrote:
> Currently a fixed size of pre-allocated memory is used to
> keep track of the inflight pages, in order to use the DMA
> API correctly.
>
> As mentioned [1], the number of inflight pages can be up to
> 73203 depending on the use cases. Allocate memory dynamically
> to keep track of the inflight pages when pre-allocated memory
> runs out.
>
> The overhead of using dynamic memory allocation is about 10ns~
> 20ns, which causes 5%~10% performance degradation for the test
> case of time_bench_page_pool03_slow() in [2].
>
> 1. https://lore.kernel.org/all/b8b7818a-e44b-45f5-91c2-d5eceaa5dd5b@kernel.org/
> 2. https://github.com/netoptimizer/prototype-kernel
> CC: Robin Murphy <robin.murphy@arm.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: IOMMU <iommu@lists.linux.dev>
> Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> Documentation/netlink/specs/netdev.yaml | 16 +++++
> include/net/page_pool/types.h | 10 ++++
> include/uapi/linux/netdev.h | 2 +
> net/core/page_pool.c | 79 ++++++++++++++++++++++++-
> net/core/page_pool_priv.h | 2 +
> net/core/page_pool_user.c | 39 ++++++++++--
> tools/net/ynl/samples/page-pool.c | 11 ++++
> 7 files changed, 154 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index 36f1152bfac3..7c121d0a5d15 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -162,6 +162,20 @@ attribute-sets:
> type: uint
> doc: |
> Amount of memory held by inflight pages.
> + -
> + name: item_mem_resident
> + type: uint
> + doc: |
> + Amount of actual memory allocated to track inflight pages.
> + memory fragmentation ratio for item memory can be calculated
> + using item_mem_resident / item_mem_used.
> + -
> + name: item_mem_used
> + type: uint
> + doc: |
> + Amount of actual memory used to track inflight pages.
> + memory fragmentation ratio for item memory can be calculated
> + using item_mem_resident / item_mem_used.
> -
> name: detach-time
> type: uint
> @@ -602,6 +616,8 @@ operations:
> - detach-time
> - dmabuf
> - io-uring
> + - item_mem_resident
> + - item_mem_used
> dump:
> reply: *pp-reply
> config-cond: page-pool
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index c131e2725e9a..c8c47ca67f4b 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -103,6 +103,7 @@ struct page_pool_params {
> * @waive: pages obtained from the ptr ring that cannot be added to
> * the cache due to a NUMA mismatch
> * @item_fast_empty: pre-allocated item cache is empty
> + * @item_slow_failed: failed to allocate memory for item_block
> */
> struct page_pool_alloc_stats {
> u64 fast;
> @@ -112,6 +113,7 @@ struct page_pool_alloc_stats {
> u64 refill;
> u64 waive;
> u64 item_fast_empty;
> + u64 item_slow_failed;
> };
>
> /**
> @@ -159,9 +161,16 @@ struct page_pool_item {
> struct page_pool_item_block {
> struct page_pool *pp;
> struct list_head list;
> + unsigned int flags;
> + refcount_t ref;
> struct page_pool_item items[];
> };
>
> +struct page_pool_slow_item {
> + struct page_pool_item_block *block;
> + unsigned int next_to_use;
> +};
> +
> /* Ensure the offset of 'pp' field for both 'page_pool_item_block' and
> * 'netmem_item_block' are the same.
> */
> @@ -191,6 +200,7 @@ struct page_pool {
> int cpuid;
> u32 pages_state_hold_cnt;
> struct llist_head hold_items;
> + struct page_pool_slow_item slow_items;
>
> bool has_init_callback:1; /* slow::init_callback is set */
> bool dma_map:1; /* Perform DMA mapping */
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index 7600bf62dbdf..9309cbfeb8d2 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -103,6 +103,8 @@ enum {
> NETDEV_A_PAGE_POOL_DETACH_TIME,
> NETDEV_A_PAGE_POOL_DMABUF,
> NETDEV_A_PAGE_POOL_IO_URING,
> + NETDEV_A_PAGE_POOL_ITEM_MEM_RESIDENT,
> + NETDEV_A_PAGE_POOL_ITEM_MEM_USED,
>
> __NETDEV_A_PAGE_POOL_MAX,
> NETDEV_A_PAGE_POOL_MAX = (__NETDEV_A_PAGE_POOL_MAX - 1)
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index d927c468bc1b..dc9574bd129b 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -64,6 +64,7 @@ static const char pp_stats[][ETH_GSTRING_LEN] = {
> "rx_pp_alloc_refill",
> "rx_pp_alloc_waive",
> "rx_pp_alloc_item_fast_empty",
> + "rx_pp_alloc_item_slow_failed",
> "rx_pp_recycle_cached",
> "rx_pp_recycle_cache_full",
> "rx_pp_recycle_ring",
> @@ -98,6 +99,7 @@ bool page_pool_get_stats(const struct page_pool *pool,
> stats->alloc_stats.refill += pool->alloc_stats.refill;
> stats->alloc_stats.waive += pool->alloc_stats.waive;
> stats->alloc_stats.item_fast_empty += pool->alloc_stats.item_fast_empty;
> + stats->alloc_stats.item_slow_failed += pool->alloc_stats.item_slow_failed;
>
> for_each_possible_cpu(cpu) {
> const struct page_pool_recycle_stats *pcpu =
> @@ -144,6 +146,7 @@ u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
> *data++ = pool_stats->alloc_stats.refill;
> *data++ = pool_stats->alloc_stats.waive;
> *data++ = pool_stats->alloc_stats.item_fast_empty;
> + *data++ = pool_stats->alloc_stats.item_slow_failed;
> *data++ = pool_stats->recycle_stats.cached;
> *data++ = pool_stats->recycle_stats.cache_full;
> *data++ = pool_stats->recycle_stats.ring;
> @@ -431,6 +434,7 @@ static void page_pool_item_uninit(struct page_pool *pool)
> struct page_pool_item_block,
> list);
> list_del(&block->list);
> + WARN_ON(refcount_read(&block->ref));
> put_page(virt_to_page(block));
> }
> }
> @@ -514,10 +518,42 @@ static struct page_pool_item *page_pool_fast_item_alloc(struct page_pool *pool)
> return llist_entry(first, struct page_pool_item, lentry);
> }
>
> +static struct page_pool_item *page_pool_slow_item_alloc(struct page_pool *pool)
> +{
> + if (unlikely(!pool->slow_items.block ||
> + pool->slow_items.next_to_use >= ITEMS_PER_PAGE)) {
> + struct page_pool_item_block *block;
> + struct page *page;
> +
> + page = alloc_pages_node(pool->p.nid, GFP_ATOMIC | __GFP_NOWARN |
> + __GFP_ZERO, 0);
> + if (!page) {
> + alloc_stat_inc(pool, item_slow_failed);
> + return NULL;
> + }
I'm missing a counter that I can use to monitor the rate of page
allocations for these "item" block's.
In production want to have a metric that shows me a sudden influx of
that cause code to hit this "item_slow_alloc" case (inflight_slow_alloc)
BTW should this be called "inflight_block" instead of "item_block"?
> +
> + block = page_address(page);
> + block->pp = pool;
> + block->flags |= PAGE_POOL_SLOW_ITEM_BLOCK_BIT;
> + refcount_set(&block->ref, ITEMS_PER_PAGE);
> + pool->slow_items.block = block;
> + pool->slow_items.next_to_use = 0;
> +
> + spin_lock_bh(&pool->item_lock);
> + list_add(&block->list, &pool->item_blocks);
> + spin_unlock_bh(&pool->item_lock);
> + }
> +
> + return &pool->slow_items.block->items[pool->slow_items.next_to_use++];
> +}
> +
> static bool page_pool_set_item_info(struct page_pool *pool, netmem_ref netmem)
> {
> struct page_pool_item *item = page_pool_fast_item_alloc(pool);
>
> + if (unlikely(!item))
> + item = page_pool_slow_item_alloc(pool);
> +
> if (likely(item)) {
> item->pp_netmem = netmem;
> page_pool_item_set_used(item);
> @@ -527,6 +563,37 @@ static bool page_pool_set_item_info(struct page_pool *pool, netmem_ref netmem)
> return !!item;
> }
>
> +static void __page_pool_slow_item_free(struct page_pool *pool,
> + struct page_pool_item_block *block)
> +{
> + spin_lock_bh(&pool->item_lock);
> + list_del(&block->list);
> + spin_unlock_bh(&pool->item_lock);
> +
> + put_page(virt_to_page(block));
Here again I'm missing a counter that I can use to monitor the rate of
page free events.
In production I want a metric (e.g inflight_slow_free_block) that
together with "item_slow_alloc" (perhaps named
inflight_slow_alloc_block), show me if this code path is creating churn,
that I can correlate/explain some other influx event on the system.
BTW subtracting these (alloc - free) counters gives us the memory used.
> +}
> +
> +static void page_pool_slow_item_drain(struct page_pool *pool)
> +{
> + struct page_pool_item_block *block = pool->slow_items.block;
> +
> + if (!block || pool->slow_items.next_to_use >= ITEMS_PER_PAGE)
> + return;
> +
> + if (refcount_sub_and_test(ITEMS_PER_PAGE - pool->slow_items.next_to_use,
> + &block->ref))
> + __page_pool_slow_item_free(pool, block);
> +}
> +
> +static void page_pool_slow_item_free(struct page_pool *pool,
> + struct page_pool_item_block *block)
> +{
> + if (likely(!refcount_dec_and_test(&block->ref)))
> + return;
> +
> + __page_pool_slow_item_free(pool, block);
> +}
> +
> static void page_pool_fast_item_free(struct page_pool *pool,
> struct page_pool_item *item)
> {
> @@ -536,13 +603,22 @@ static void page_pool_fast_item_free(struct page_pool *pool,
> static void page_pool_clear_item_info(struct page_pool *pool, netmem_ref netmem)
> {
> struct page_pool_item *item = netmem_get_pp_item(netmem);
> + struct page_pool_item_block *block;
>
> DEBUG_NET_WARN_ON_ONCE(item->pp_netmem != netmem);
> DEBUG_NET_WARN_ON_ONCE(page_pool_item_is_mapped(item));
> DEBUG_NET_WARN_ON_ONCE(!page_pool_item_is_used(item));
> page_pool_item_clear_used(item);
> netmem_set_pp_item(netmem, NULL);
> - page_pool_fast_item_free(pool, item);
> +
> + block = page_pool_item_to_block(item);
> + if (likely(!(block->flags & PAGE_POOL_SLOW_ITEM_BLOCK_BIT))) {
> + DEBUG_NET_WARN_ON_ONCE(refcount_read(&block->ref));
> + page_pool_fast_item_free(pool, item);
> + return;
> + }
> +
> + page_pool_slow_item_free(pool, block);
> }
>
> /**
> @@ -1390,6 +1466,7 @@ void page_pool_destroy(struct page_pool *pool)
>
> page_pool_disable_direct_recycling(pool);
> page_pool_free_frag(pool);
> + page_pool_slow_item_drain(pool);
>
> if (!page_pool_release(pool))
> return;
> diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h
> index a5df5ab14ead..37adfc766c12 100644
> --- a/net/core/page_pool_priv.h
> +++ b/net/core/page_pool_priv.h
> @@ -7,6 +7,8 @@
>
> #include "netmem_priv.h"
>
> +#define PAGE_POOL_SLOW_ITEM_BLOCK_BIT BIT(0)
> +
> extern struct mutex page_pools_lock;
>
> s32 page_pool_inflight(const struct page_pool *pool, bool strict);
> diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
> index c82a95beceff..0dc0090257ae 100644
> --- a/net/core/page_pool_user.c
> +++ b/net/core/page_pool_user.c
> @@ -33,7 +33,7 @@ DEFINE_MUTEX(page_pools_lock);
> * - user.list: unhashed, netdev: unknown
> */
>
> -typedef int (*pp_nl_fill_cb)(struct sk_buff *rsp, const struct page_pool *pool,
> +typedef int (*pp_nl_fill_cb)(struct sk_buff *rsp, struct page_pool *pool,
> const struct genl_info *info);
>
> static int
> @@ -111,7 +111,7 @@ netdev_nl_page_pool_get_dump(struct sk_buff *skb, struct netlink_callback *cb,
> }
>
> static int
> -page_pool_nl_stats_fill(struct sk_buff *rsp, const struct page_pool *pool,
> +page_pool_nl_stats_fill(struct sk_buff *rsp, struct page_pool *pool,
> const struct genl_info *info)
> {
> #ifdef CONFIG_PAGE_POOL_STATS
> @@ -212,8 +212,36 @@ int netdev_nl_page_pool_stats_get_dumpit(struct sk_buff *skb,
> return netdev_nl_page_pool_get_dump(skb, cb, page_pool_nl_stats_fill);
> }
>
> +static int page_pool_nl_fill_item_mem_info(struct page_pool *pool,
> + struct sk_buff *rsp)
> +{
> + struct page_pool_item_block *block;
> + size_t resident = 0, used = 0;
> + int err;
> +
> + spin_lock_bh(&pool->item_lock);
> +
> + list_for_each_entry(block, &pool->item_blocks, list) {
> + resident += PAGE_SIZE;
> +
> + if (block->flags & PAGE_POOL_SLOW_ITEM_BLOCK_BIT)
> + used += (PAGE_SIZE - sizeof(struct page_pool_item) *
> + refcount_read(&block->ref));
> + else
> + used += PAGE_SIZE;
> + }
> +
> + spin_unlock_bh(&pool->item_lock);
Holding a BH spin_lock can easily create production issues.
I worry how long time it will take to traverse these lists.
We (Cc Yan) are currently hunting down a number of real production issue
due to different cases of control-path code querying the kernel that
takes a _bh lock to read data, hurting the data-path processing.
If we had the stats counters, then this would be less work, right?
> +
> + err = nla_put_uint(rsp, NETDEV_A_PAGE_POOL_ITEM_MEM_RESIDENT, resident);
> + if (err)
> + return err;
> +
> + return nla_put_uint(rsp, NETDEV_A_PAGE_POOL_ITEM_MEM_USED, used);
> +}
> +
> static int
> -page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
> +page_pool_nl_fill(struct sk_buff *rsp, struct page_pool *pool,
> const struct genl_info *info)
> {
> size_t inflight, refsz;
> @@ -251,6 +279,9 @@ page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
> if (pool->mp_ops && pool->mp_ops->nl_fill(pool->mp_priv, rsp, NULL))
> goto err_cancel;
>
> + if (page_pool_nl_fill_item_mem_info(pool, rsp))
> + goto err_cancel;
> +
> genlmsg_end(rsp, hdr);
>
> return 0;
> @@ -259,7 +290,7 @@ page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
> return -EMSGSIZE;
> }
>
> -static void netdev_nl_page_pool_event(const struct page_pool *pool, u32 cmd)
> +static void netdev_nl_page_pool_event(struct page_pool *pool, u32 cmd)
> {
> struct genl_info info;
> struct sk_buff *ntf;
> diff --git a/tools/net/ynl/samples/page-pool.c b/tools/net/ynl/samples/page-pool.c
> index e5d521320fbf..16c48b6d3c2c 100644
> --- a/tools/net/ynl/samples/page-pool.c
> +++ b/tools/net/ynl/samples/page-pool.c
> @@ -16,6 +16,7 @@ struct stat {
> struct {
> unsigned int cnt;
> size_t refs, bytes;
> + size_t item_mem_resident, item_mem_used;
> } live[2];
>
> size_t alloc_slow, alloc_fast, recycle_ring, recycle_cache;
> @@ -52,6 +53,12 @@ static void count(struct stat *s, unsigned int l,
> s->live[l].refs += pp->inflight;
> if (pp->_present.inflight_mem)
> s->live[l].bytes += pp->inflight_mem;
> +
> + if (pp->_present.item_mem_resident)
> + s->live[l].item_mem_resident += pp->item_mem_resident;
> +
> + if (pp->_present.item_mem_used)
> + s->live[l].item_mem_used += pp->item_mem_used;
> }
>
> int main(int argc, char **argv)
> @@ -127,6 +134,10 @@ int main(int argc, char **argv)
> s->live[1].refs, s->live[1].bytes,
> s->live[0].refs, s->live[0].bytes);
>
> + printf("\t\titem_mem_resident: %zu item_mem_used: %zu (item_mem_resident: %zu item_mem_used: %zu)\n",
> + s->live[1].item_mem_resident, s->live[1].item_mem_used,
> + s->live[0].item_mem_resident, s->live[0].item_mem_used);
> +
> /* We don't know how many pages are sitting in cache and ring
> * so we will under-count the recycling rate a bit.
> */
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v10 3/4] page_pool: support unlimited number of inflight pages
2025-03-03 17:59 ` Simon Horman
@ 2025-03-04 12:25 ` Yunsheng Lin
0 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2025-03-04 12:25 UTC (permalink / raw)
To: Simon Horman
Cc: davem, kuba, pabeni, zhangkun09, liuyonglong, fanghaiqing,
Robin Murphy, Alexander Duyck, IOMMU, Eric Dumazet, Donald Hunter,
Jesper Dangaard Brouer, Ilias Apalodimas, Andrew Lunn, netdev,
linux-kernel
On 2025/3/4 1:59, Simon Horman wrote:
> On Wed, Feb 26, 2025 at 07:03:38PM +0800, Yunsheng Lin wrote:
>> Currently a fixed size of pre-allocated memory is used to
>> keep track of the inflight pages, in order to use the DMA
>> API correctly.
>>
>> As mentioned [1], the number of inflight pages can be up to
>> 73203 depending on the use cases. Allocate memory dynamically
>> to keep track of the inflight pages when pre-allocated memory
>> runs out.
>>
>> The overhead of using dynamic memory allocation is about 10ns~
>> 20ns, which causes 5%~10% performance degradation for the test
>> case of time_bench_page_pool03_slow() in [2].
>>
>> 1. https://lore.kernel.org/all/b8b7818a-e44b-45f5-91c2-d5eceaa5dd5b@kernel.org/
>> 2. https://github.com/netoptimizer/prototype-kernel
>> CC: Robin Murphy <robin.murphy@arm.com>
>> CC: Alexander Duyck <alexander.duyck@gmail.com>
>> CC: IOMMU <iommu@lists.linux.dev>
>> Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>> Documentation/netlink/specs/netdev.yaml | 16 +++++
>> include/net/page_pool/types.h | 10 ++++
>> include/uapi/linux/netdev.h | 2 +
>> net/core/page_pool.c | 79 ++++++++++++++++++++++++-
>> net/core/page_pool_priv.h | 2 +
>> net/core/page_pool_user.c | 39 ++++++++++--
>> tools/net/ynl/samples/page-pool.c | 11 ++++
>> 7 files changed, 154 insertions(+), 5 deletions(-)
>
> Hi,
>
> It looks like the header changes in this patch don't quite
> correspond to the spec changes.
>
> But if so, perhaps the spec update needs to change,
> because adding values to an enum, other than at the end,
> feels like UAPI breakage to me.
>
> I see this:
>
> $ ./tools/net/ynl/ynl-regen.sh -f
Yes, It seems I only tested the tools/net/ynl/samples/page-pool, which
doesn't seems to catch the above problem.
Will update the spec changes to the the header changes and update
tools/include/uapi/linux/netdev.h accordingly too.
Thanks for the reporting.
> $ git diff
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index 9309cbfeb8d2..9e02f6190b07 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -100,11 +100,11 @@ enum {
> NETDEV_A_PAGE_POOL_NAPI_ID,
> NETDEV_A_PAGE_POOL_INFLIGHT,
> NETDEV_A_PAGE_POOL_INFLIGHT_MEM,
> + NETDEV_A_PAGE_POOL_ITEM_MEM_RESIDENT,
> + NETDEV_A_PAGE_POOL_ITEM_MEM_USED,
> NETDEV_A_PAGE_POOL_DETACH_TIME,
> NETDEV_A_PAGE_POOL_DMABUF,
> NETDEV_A_PAGE_POOL_IO_URING,
> - NETDEV_A_PAGE_POOL_ITEM_MEM_RESIDENT,
> - NETDEV_A_PAGE_POOL_ITEM_MEM_USED,
>
> __NETDEV_A_PAGE_POOL_MAX,
> NETDEV_A_PAGE_POOL_MAX = (__NETDEV_A_PAGE_POOL_MAX - 1)
> diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
> index 7600bf62dbdf..9e02f6190b07 100644
> --- a/tools/include/uapi/linux/netdev.h
> +++ b/tools/include/uapi/linux/netdev.h
> @@ -100,6 +100,8 @@ enum {
> NETDEV_A_PAGE_POOL_NAPI_ID,
> NETDEV_A_PAGE_POOL_INFLIGHT,
> NETDEV_A_PAGE_POOL_INFLIGHT_MEM,
> + NETDEV_A_PAGE_POOL_ITEM_MEM_RESIDENT,
> + NETDEV_A_PAGE_POOL_ITEM_MEM_USED,
> NETDEV_A_PAGE_POOL_DETACH_TIME,
> NETDEV_A_PAGE_POOL_DMABUF,
> NETDEV_A_PAGE_POOL_IO_URING,
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v10 3/4] page_pool: support unlimited number of inflight pages
2025-03-04 9:25 ` Jesper Dangaard Brouer
@ 2025-03-05 12:19 ` Yunsheng Lin
0 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2025-03-05 12:19 UTC (permalink / raw)
To: Jesper Dangaard Brouer, davem, kuba, pabeni
Cc: zhangkun09, liuyonglong, fanghaiqing, Robin Murphy,
Alexander Duyck, IOMMU, Eric Dumazet, Simon Horman, Donald Hunter,
Ilias Apalodimas, Andrew Lunn, netdev, linux-kernel, kernel-team,
Yan Zhai
On 2025/3/4 17:25, Jesper Dangaard Brouer wrote:
>> }
>> @@ -514,10 +518,42 @@ static struct page_pool_item *page_pool_fast_item_alloc(struct page_pool *pool)
>> return llist_entry(first, struct page_pool_item, lentry);
>> }
>> +static struct page_pool_item *page_pool_slow_item_alloc(struct page_pool *pool)
>> +{
>> + if (unlikely(!pool->slow_items.block ||
>> + pool->slow_items.next_to_use >= ITEMS_PER_PAGE)) {
>> + struct page_pool_item_block *block;
>> + struct page *page;
>> +
>> + page = alloc_pages_node(pool->p.nid, GFP_ATOMIC | __GFP_NOWARN |
>> + __GFP_ZERO, 0);
>> + if (!page) {
>> + alloc_stat_inc(pool, item_slow_failed);
>> + return NULL;
>> + }
>
> I'm missing a counter that I can use to monitor the rate of page
> allocations for these "item" block's.
> In production want to have a metric that shows me a sudden influx of
> that cause code to hit this "item_slow_alloc" case (inflight_slow_alloc)
It seems the 'item_fast_empty' stat added in patch 2 is the metric you
mention above? as we use those slow_items sequentially, the pages allocated
for slow_items can be calculated by 'item_fast_empty' / ITEMS_PER_PAGE.
>
> BTW should this be called "inflight_block" instead of "item_block"?
>
>
>> +
>> + block = page_address(page);
>> + block->pp = pool;
>> + block->flags |= PAGE_POOL_SLOW_ITEM_BLOCK_BIT;
>> + refcount_set(&block->ref, ITEMS_PER_PAGE);
>> + pool->slow_items.block = block;
>> + pool->slow_items.next_to_use = 0;
>> +
>> + spin_lock_bh(&pool->item_lock);
>> + list_add(&block->list, &pool->item_blocks);
>> + spin_unlock_bh(&pool->item_lock);
>> + }
>> +
>> + return &pool->slow_items.block->items[pool->slow_items.next_to_use++];
>> +}
...
>> +static void __page_pool_slow_item_free(struct page_pool *pool,
>> + struct page_pool_item_block *block)
>> +{
>> + spin_lock_bh(&pool->item_lock);
>> + list_del(&block->list);
>> + spin_unlock_bh(&pool->item_lock);
>> +
>> + put_page(virt_to_page(block));
>
> Here again I'm missing a counter that I can use to monitor the rate of
> page free events.
>
> In production I want a metric (e.g inflight_slow_free_block) that
> together with "item_slow_alloc" (perhaps named
> inflight_slow_alloc_block), show me if this code path is creating churn,
> that I can correlate/explain some other influx event on the system.
>
> BTW subtracting these (alloc - free) counters gives us the memory used.
If I understand it correctly, the 'item_fast_empty' is something like
the 'alloc' mentioned above, let's discuss the 'free' below.
>
>> +}
>> +
...
>> }
>> +static int page_pool_nl_fill_item_mem_info(struct page_pool *pool,
>> + struct sk_buff *rsp)
>> +{
>> + struct page_pool_item_block *block;
>> + size_t resident = 0, used = 0;
>> + int err;
>> +
>> + spin_lock_bh(&pool->item_lock);
>> +
>> + list_for_each_entry(block, &pool->item_blocks, list) {
>> + resident += PAGE_SIZE;
>> +
>> + if (block->flags & PAGE_POOL_SLOW_ITEM_BLOCK_BIT)
>> + used += (PAGE_SIZE - sizeof(struct page_pool_item) *
>> + refcount_read(&block->ref));
>> + else
>> + used += PAGE_SIZE;
>> + }
>> +
>> + spin_unlock_bh(&pool->item_lock);
>
> Holding a BH spin_lock can easily create production issues.
The above is not only give us the total pages used for page_pool_item,
but also give us the fragmentation info for those pages too.
So it seems the BH spin_lock is needed if we want the fragmentation info?
And the 'free' memtioned above can be calculated by 'memory used' - 'alloc'.
> I worry how long time it will take to traverse these lists.
I wouldn't worry about that as it is not supposed to be a lot of pages
in those list, if it is, it seems it is something we should be fixing
by increasing the size of fast_item or by defragmenting the slow_item
pages.
>
> We (Cc Yan) are currently hunting down a number of real production issue
> due to different cases of control-path code querying the kernel that
> takes a _bh lock to read data, hurting the data-path processing.
I am not sure if the above is the control-path here, I would rather
treat it as the debug-path?
>
> If we had the stats counters, then this would be less work, right?
It depends on if we want the fragmentation info or not as mentioned
above.
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v10 4/4] page_pool: skip dma sync operation for inflight pages
2025-02-26 11:03 ` [PATCH net-next v10 4/4] page_pool: skip dma sync operation for " Yunsheng Lin
@ 2025-03-05 17:29 ` Jason Gunthorpe
2025-03-06 11:32 ` Yunsheng Lin
0 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2025-03-05 17:29 UTC (permalink / raw)
To: Yunsheng Lin
Cc: davem, kuba, pabeni, zhangkun09, liuyonglong, fanghaiqing,
Robin Murphy, Alexander Duyck, IOMMU, Jesper Dangaard Brouer,
Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
linux-kernel
On Wed, Feb 26, 2025 at 07:03:39PM +0800, Yunsheng Lin wrote:
> Skip dma sync operation for inflight pages before the
> sync operation in page_pool_item_unmap() as DMA API
> expects to be called with a valid device bound to a
> driver as mentioned in [1].
>
> After page_pool_destroy() is called, the page is not
> expected to be recycled back to pool->alloc cache and
> dma sync operation is not needed when the page is not
> recyclable or pool->ring is full, so only skip the dma
> sync operation for the infilght pages by clearing the
> pool->dma_sync, as rcu sync operation in
> page_pool_destroy() is paired with rcu lock in
> page_pool_recycle_in_ring() to ensure that there is no
> dma sync operation called after rcu sync operation.
Are you guaranteeing that the cache is made consistent before freeing
the page back to the mm? That is required..
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH net-next v10 4/4] page_pool: skip dma sync operation for inflight pages
2025-03-05 17:29 ` Jason Gunthorpe
@ 2025-03-06 11:32 ` Yunsheng Lin
0 siblings, 0 replies; 12+ messages in thread
From: Yunsheng Lin @ 2025-03-06 11:32 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: davem, kuba, pabeni, zhangkun09, liuyonglong, fanghaiqing,
Robin Murphy, Alexander Duyck, IOMMU, Jesper Dangaard Brouer,
Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
linux-kernel
On 2025/3/6 1:29, Jason Gunthorpe wrote:
> On Wed, Feb 26, 2025 at 07:03:39PM +0800, Yunsheng Lin wrote:
>> Skip dma sync operation for inflight pages before the
>> sync operation in page_pool_item_unmap() as DMA API
>> expects to be called with a valid device bound to a
>> driver as mentioned in [1].
>>
>> After page_pool_destroy() is called, the page is not
>> expected to be recycled back to pool->alloc cache and
>> dma sync operation is not needed when the page is not
>> recyclable or pool->ring is full, so only skip the dma
>> sync operation for the infilght pages by clearing the
>> pool->dma_sync, as rcu sync operation in
>> page_pool_destroy() is paired with rcu lock in
>> page_pool_recycle_in_ring() to ensure that there is no
>> dma sync operation called after rcu sync operation.
>
> Are you guaranteeing that the cache is made consistent before freeing
> the page back to the mm? That is required..
As page_pool is only calling the sync_for_device API before the
device triggers the DMA, and the above skip is only done for the
sync_for_device API after page_pool_destroy() is called, which
means there is no DMA messing with the page before freeing the
page back to the mm if I understand the question correctly.
And the driver is supposed to call sync_for_cpu API before
passing the page to networking stack, which passes the page
back to page_pool eventually.
>
> Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-03-06 11:32 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-26 11:03 [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool Yunsheng Lin
2025-02-26 11:03 ` [PATCH net-next v10 1/4] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
2025-02-26 11:03 ` [PATCH net-next v10 2/4] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
2025-02-26 11:03 ` [PATCH net-next v10 3/4] page_pool: support unlimited number of inflight pages Yunsheng Lin
2025-03-03 17:59 ` Simon Horman
2025-03-04 12:25 ` Yunsheng Lin
2025-03-04 9:25 ` Jesper Dangaard Brouer
2025-03-05 12:19 ` Yunsheng Lin
2025-02-26 11:03 ` [PATCH net-next v10 4/4] page_pool: skip dma sync operation for " Yunsheng Lin
2025-03-05 17:29 ` Jason Gunthorpe
2025-03-06 11:32 ` Yunsheng Lin
2025-02-28 2:15 ` [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool Jakub Kicinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).