All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yunsheng Lin <yunshenglin0825@gmail.com>
To: Mina Almasry <almasrymina@google.com>,
	Yunsheng Lin <linyunsheng@huawei.com>
Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
	zhangkun09@huawei.com, liuyonglong@huawei.com,
	fanghaiqing@huawei.com, Robin Murphy <robin.murphy@arm.com>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	IOMMU <iommu@lists.linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Eric Dumazet <edumazet@google.com>,
	Simon Horman <horms@kernel.org>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
Subject: Re: [PATCH net-next v9 2/4] page_pool: fix IOMMU crash when driver has already unbound
Date: Sat, 15 Feb 2025 20:13:42 +0800	[thread overview]
Message-ID: <e65c19d6-4ce0-4dd2-a607-e08d47b6fe64@gmail.com> (raw)
In-Reply-To: <CAHS8izPZe0UHn8P38EvzX0ei_jGJnsXg99B5ra9Ldu09aWBU-Q@mail.gmail.com>

On 2/15/2025 4:58 AM, Mina Almasry wrote:
> On Wed, Feb 12, 2025 at 1:34 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> Networking driver with page_pool support may hand over page
>> still with dma mapping to network stack and try to reuse that
>> page after network stack is done with it and passes it back
>> to page_pool to avoid the penalty of dma mapping/unmapping.
>> With all the caching in the network stack, some pages may be
>> held in the network stack without returning to the page_pool
>> soon enough, and with VF disable causing the driver unbound,
>> the page_pool does not stop the driver from doing it's
>> unbounding work, instead page_pool uses workqueue to check
>> if there is some pages coming back from the network stack
>> periodically, if there is any, it will do the dma unmmapping
>> related cleanup work.
>>
>> As mentioned in [1], attempting DMA unmaps after the driver
>> has already unbound may leak resources or at worst corrupt
>> memory. Fundamentally, the page pool code cannot allow DMA
>> mappings to outlive the driver they belong to.
>>
>> Currently it seems there are at least two cases that the page
>> is not released fast enough causing dma unmmapping done after
>> driver has already unbound:
>> 1. ipv4 packet defragmentation timeout: this seems to cause
>>     delay up to 30 secs.
>> 2. skb_defer_free_flush(): this may cause infinite delay if
>>     there is no triggering for net_rx_action().
>>
>> In order not to call DMA APIs to do DMA unmmapping after driver
>> has already unbound and stall the unloading of the networking
>> driver, use some pre-allocated item blocks to record inflight
>> pages including the ones which are handed over to network stack,
>> so the page_pool can do the DMA unmmapping for those pages when
>> page_pool_destroy() is called. As the pre-allocated item blocks
>> need to be large enough to avoid performance degradation, add a
>> 'item_fast_empty' stat to indicate the unavailability of the
>> pre-allocated item blocks.
>>
>> By using the 'struct page_pool_item' referenced by page->pp_item,
>> page_pool is not only able to keep track of the inflight page to
>> do dma unmmaping if some pages are still handled in networking
>> stack when page_pool_destroy() is called, and networking stack is
>> also able to find the page_pool owning the page when returning
>> pages back into page_pool:
>> 1. When a page is added to the page_pool, an item is deleted from
>>     pool->hold_items and set the 'pp_netmem' pointing to that page
>>     and set item->state and item->pp_netmem accordingly in order to
>>     keep track of that page, refill from pool->release_items when
>>     pool->hold_items is empty or use the item from pool->slow_items
>>     when fast items run out.
>> 2. When a page is released from the page_pool, it is able to tell
>>     which page_pool this page belongs to by masking off the lower
>>     bits of the pointer to page_pool_item *item, as the 'struct
>>     page_pool_item_block' is stored in the top of a struct page. And
>>     after clearing the pp_item->state', the item for the released page
>>     is added back to pool->release_items so that it can be reused for
>>     new pages or just free it when it is from the pool->slow_items.
>> 3. When page_pool_destroy() is called, item->state is used to tell if
>>     a specific item is being used/dma mapped or not by scanning all the
>>     item blocks in pool->item_blocks, then item->netmem can be used to
>>     do the dma unmmaping if the corresponding inflight page is dma
>>     mapped.
>>
>> The overhead of tracking of inflight pages is about 10ns~20ns,
>> which causes about 10% performance degradation for the test case
>> of time_bench_page_pool03_slow() in [2].
>>
>> Note, the devmem patchset seems to make the bug harder to fix,
>> and may make backporting harder too. As there is no actual user
>> for the devmem and the fixing for devmem is unclear for now,
>> this patch does not consider fixing the case for devmem yet.
>>
>> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>> 2. https://github.com/netoptimizer/prototype-kernel
>> CC: Robin Murphy <robin.murphy@arm.com>
>> CC: Alexander Duyck <alexander.duyck@gmail.com>
>> CC: IOMMU <iommu@lists.linux.dev>
>> Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> Tested-by: Yonglong Liu <liuyonglong@huawei.com>
> [...]
>> +
>> +/* The size of item_block is always PAGE_SIZE, so that the address of item_block
>> + * for a specific item can be calculated using 'item & PAGE_MASK'
>> + */
>> +struct page_pool_item_block {
>> +       struct page_pool *pp;
>> +       struct list_head list;
>> +       struct page_pool_item items[];
>> +};
>> +
> 
> I think this feedback was mentioned in earlier iterations of the series:
> Can we not hold a struct list_head in the page_pool that keeps track
> of inflight netmems that we need to dma-unmap on page_pool_destroy?
> Why do we have to modify the pp entry in the struct page and struct
> net_iov?

Jesper asked a similar question in [1], see below:

1. 
https://lore.kernel.org/all/95f258b2-52f5-4a80-a670-b9a182caec7c@kernel.org/

> 
> The decision to modify pp entry in struct page and struct net_iov is
> making this patchset bigger and harder to review IMO.

I am open to any better suggestion of simple and easy way to fix this
with little performance impact as much as possible, but I really
doubtful that using some advanced struct like 'list_head" as suggested
above will not cause significant performance degradation.


  reply	other threads:[~2025-02-15 12:13 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-12  9:25 [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool Yunsheng Lin
2025-02-12  9:25 ` [Intel-wired-lan] " Yunsheng Lin
2025-02-12  9:25 ` [PATCH net-next v9 1/4] page_pool: introduce page_pool_get_pp() API Yunsheng Lin
2025-02-12  9:25   ` [Intel-wired-lan] " Yunsheng Lin
2025-02-12  9:25 ` [PATCH net-next v9 2/4] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
2025-02-14 20:58   ` Mina Almasry
2025-02-15 12:13     ` Yunsheng Lin [this message]
2025-02-12  9:25 ` [PATCH net-next v9 3/4] page_pool: support unlimited number of inflight pages Yunsheng Lin
2025-02-21 10:12   ` Jesper Dangaard Brouer
2025-02-22  8:11     ` Yunsheng Lin
2025-02-12  9:25 ` [PATCH net-next v9 4/4] page_pool: skip dma sync operation for " Yunsheng Lin
2025-02-12 18:53 ` [PATCH net-next v9 0/4] fix the DMA API misuse problem for page_pool Matthew Wilcox
2025-02-12 18:53   ` [Intel-wired-lan] " Matthew Wilcox
2025-02-13 11:13   ` Yunsheng Lin
2025-02-13 11:13     ` [Intel-wired-lan] " Yunsheng Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e65c19d6-4ce0-4dd2-a607-e08d47b6fe64@gmail.com \
    --to=yunshenglin0825@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.duyck@gmail.com \
    --cc=almasrymina@google.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=fanghaiqing@huawei.com \
    --cc=hawk@kernel.org \
    --cc=horms@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=iommu@lists.linux.dev \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linyunsheng@huawei.com \
    --cc=liuyonglong@huawei.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=robin.murphy@arm.com \
    --cc=zhangkun09@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.