Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Yunsheng Lin <yunshenglin0825@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	"David S. Miller" <davem@davemloft.net>
Cc: Yunsheng Lin <linyunsheng@huawei.com>,
	Yonglong Liu <liuyonglong@huawei.com>,
	Mina Almasry <almasrymina@google.com>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>,
	linux-mm@kvack.org, netdev@vger.kernel.org
Subject: Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
Date: Mon, 10 Mar 2025 10:13:32 +0100	[thread overview]
Message-ID: <87cyepxn7n.fsf@toke.dk> (raw)
In-Reply-To: <d84e19c9-be0c-4d23-908b-f5e5ab6f3f3f@gmail.com>

Yunsheng Lin <yunshenglin0825@gmail.com> writes:

> On 3/8/2025 10:54 PM, Toke Høiland-Jørgensen wrote:
>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>> they are released from the pool, to avoid the overhead of re-mapping the
>> pages every time they are used. This causes problems when a device is
>> torn down, because the page pool can't unmap the pages until they are
>> returned to the pool. This causes resource leaks and/or crashes when
>> there are pages still outstanding while the device is torn down, because
>> page_pool will attempt an unmap of a non-existent DMA device on the
>> subsequent page return.
>> 
>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>> in page pool using an xarray. This was first suggested by Mina[0], and
>> turns out to be fairly straight forward: We simply store pointers to
>> pages directly in the xarray with xa_alloc() when they are first DMA
>> mapped, and remove them from the array on unmap. Then, when a page pool
>> is torn down, it can simply walk the xarray and unmap all pages still
>> present there before returning, which also allows us to get rid of the
>> get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
>> synchronisation is needed, as a page will only ever be unmapped once.
>
> The implementation of xa_cmpxchg() seems to take the xa_lock, which
> seems to be a per-Xarray spin_lock.
> Yes, if if we were to take a per-Xarray lock unconditionaly, additional
> synchronisation like rcu doesn't seems to be needed. But it seems an
> excessive overhead for normal packet processing when page_pool_destroy()
> is not called yet?

I dunno, maybe? It's only done on DMA map/unmap, so it may be
acceptable? We should get some benchmark results to assess :)

> Also, we might need a similar locking or synchronisation for the dma
> sync API in order to skip the dma sync API when page_pool_destroy() is
> called too.

Good point, but that seems a separate issue? And simpler to solve (just
set pool->dma_sync to false when destroying?).

>> To avoid having to walk the entire xarray on unmap to find the page
>> reference, we stash the ID assigned by xa_alloc() into the page
>> structure itself, in the field previously called '_pp_mapping_pad' in
>> the page_pool struct inside struct page. This field overlaps with the
>> page->mapping pointer, which may turn out to be problematic, so an
>> alternative is probably needed. Sticking the ID into some of the upper
>> bits of page->pp_magic may work as an alternative, but that requires
>> further investigation. Using the 'mapping' field works well enough as
>> a demonstration for this RFC, though.
> page->pp_magic seems to only have 16 bits space available at most when
> trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
> and STACK_DEPOT_POISON seems to already use 16 bits, so:
> 1. For 32 bits system, it seems there is only 16 bits left even if the
>     ILLEGAL_POINTER_VALUE is defined as 0x0.
> 2. For 64 bits system, it seems there is only 12 bits left for powerpc
>     as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
>     arch/powerpc/Kconfig.
>
> So it seems we might need to limit the dma mapping count to 4096 or
> 65536?
>
> And to be honest, I am not sure if those 'unused' 12/16 bits can really 
> be reused or not, I guess we might need suggestion from mm and arch
> experts here.

Why do we need to care about BPF_PTR_POISON and STACK_DEPOT_POISON?
AFAICT, we only need to make sure we preserve the PP_SIGNATURE value.
See v2 of the RFC patch, the bit arithmetic there gives me:

- 24 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures

Which seems to be plenty?

>> Since all the tracking is performed on DMA map/unmap, no additional code
>> is needed in the fast path, meaning the performance overhead of this
>> tracking is negligible. The extra memory needed to track the pages is
>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>> structure to track items. This structure is 576 bytes long, with slots
>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>> per slot it tracks (in practice, it probably won't be this efficient,
>> but in any case it should be an acceptable overhead).
>
> Even if items is stored sequentially in xa_node at first, is it possible
> that there may be fragmentation in those xa_node when pages are released
> and allocated many times during packet processing? If yes, is there any
> fragmentation info about those xa_node?

Some (that's what I mean with "not as efficient"). AFAICT, xa_array does
do some rebalancing of the underlying radix tree, freeing nodes when
they are no longer used. However, I am not too familiar with the xarray
code, so I don't know exactly how efficient this is in practice.

>> [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
>> 
>> Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
>> Reported-by: Yonglong Liu <liuyonglong@huawei.com>
>> Suggested-by: Mina Almasry <almasrymina@google.com>
>> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> ---
>> This is an alternative to Yunsheng's series. Yunsheng requested I send
>> this as an RFC to better be able to discuss the different approaches; see
>> some initial discussion in[1], also regarding where to store the ID as
>> alluded to above.
>
> As mentioned before, I am not really convinced there is still any
> space left in 'struct page' yet, otherwise we might already use that
> space to fix the DMA address > 32 bits problem in 32 bits system, see
> page_pool_set_dma_addr_netmem().

See above.

> Also, Using the more space in 'struct page' for the page_pool seems to
> make page_pool more coupled to the mm subsystem, which seems to not
> align with the folios work that is trying to decouple non-mm subsystem
> from the mm subsystem by avoid other subsystem using more of the 'struct
> page' as metadata from the long term point of view.

This seems a bit theoretical; any future changes of struct page would
have to shuffle things around so we still have the ID available,
obviously :)

-Toke

next prev parent reply	other threads:[~2025-03-10  9:13 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-08 14:54 [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
2025-03-08 19:22 ` Mina Almasry
2025-03-09 12:42   ` Toke Høiland-Jørgensen
2025-03-11 13:25     ` Pavel Begunkov
2025-03-11 13:44       ` Toke Høiland-Jørgensen
2025-03-11 13:56         ` Pavel Begunkov
2025-03-11 15:49           ` Toke Høiland-Jørgensen
2025-03-11 16:46         ` Matthew Wilcox
2025-03-12 10:56           ` Toke Høiland-Jørgensen
2025-03-12 12:55           ` Pavel Begunkov
2025-03-10  7:17   ` Pavel Begunkov
2025-03-09 13:27 ` Yunsheng Lin
2025-03-10  9:13   ` Toke Høiland-Jørgensen [this message]
2025-03-10 12:35     ` Yunsheng Lin
2025-03-10 15:24       ` Toke Høiland-Jørgensen
2025-03-11 12:19         ` Yunsheng Lin
2025-03-11 13:26           ` Toke Høiland-Jørgensen
2025-03-12 12:04             ` Yunsheng Lin
2025-03-12 12:27               ` Toke Høiland-Jørgensen
2025-03-12 12:53                 ` Pavel Begunkov
2025-03-10 15:42     ` Matthew Wilcox
2025-03-10 17:26       ` Toke Høiland-Jørgensen
2025-03-11 15:03         ` Matthew Wilcox
2025-03-11 15:32           ` Toke Høiland-Jørgensen
2025-03-11 12:25       ` Yunsheng Lin
2025-03-11 15:11         ` Matthew Wilcox
2025-03-12 12:05           ` Yunsheng Lin
2025-03-12 18:35             ` Shuah
2025-03-12 18:48               ` shuah
2025-03-12 18:56                 ` Matthew Wilcox
2025-03-12 22:25                   ` Shuah Khan
2025-03-14 18:09                 ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87cyepxn7n.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=horms@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=kuba@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linyunsheng@huawei.com \
    --cc=liuyonglong@huawei.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=yunshenglin0825@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.