Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Yunsheng Lin <linyunsheng@huawei.com>
To: "Toke Høiland-Jørgensen" <toke@redhat.com>,
	"Jesper Dangaard Brouer" <hawk@kernel.org>,
	davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com
Cc: <zhangkun09@huawei.com>, <fanghaiqing@huawei.com>,
	<liuyonglong@huawei.com>, Robin Murphy <robin.murphy@arm.com>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	IOMMU <iommu@lists.linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Eric Dumazet <edumazet@google.com>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<netdev@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>
Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound
Date: Wed, 30 Oct 2024 19:30:59 +0800	[thread overview]
Message-ID: <1eac33ae-e8e1-4437-9403-57291ba4ced6@huawei.com> (raw)
In-Reply-To: <878qu7c8om.fsf@toke.dk>

On 2024/10/29 21:58, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <linyunsheng@huawei.com> writes:
> 
>>>> I would prefer the waiting too if simple waiting fixed the test cases that
>>>> Youglong and Haiqing were reporting and I did not look into the rabbit hole
>>>> of possible caching in networking.
>>>>
>>>> As mentioned in commit log and [1]:
>>>> 1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30
>>>>    secs, which was reported by Haiqing.
>>>> 2. skb_defer_free_flush(): this may cause infinite delay if there is no
>>>>    triggering for net_rx_action(), which was reported by Yonglong.
>>>>
>>>> For case 1, is it really ok to stall the driver unbound up to 30 secs for the
>>>> default setting of defragmentation timeout?
>>>>
>>>> For case 2, it is possible to add timeout for those kind of caching like the
>>>> defragmentation timeout too, but as mentioned in [2], it seems to be a normal
>>>> thing for this kind of caching in networking:
>>>
>>> Both 1 and 2 seem to be cases where the netdev teardown code can just
>>> make sure to kick the respective queues and make sure there's nothing
>>> outstanding (for (1), walk the defrag cache and clear out anything
>>> related to the netdev going away, for (2) make sure to kick
>>> net_rx_action() as part of the teardown).
>>
>> It would be good to be more specific about the 'kick' here, does it mean
>> taking the lock and doing one of below action for each cache instance?
>> 1. flush all the cache of each cache instance.
>> 2. scan for the page_pool owned page and do the finegrained flushing.
> 
> Depends on the context. The page pool is attached to a device, so it
> should be possible to walk the skb frags queue and just remove any skbs
> that refer to that netdevice, or something like that.

I am not sure if netdevice is still the same when passing through all sorts
of software netdevice, checking if it is the page_pool owned page seems safer?

The scaning/flushing seems complicated and hard to get it right if it is
depending on internal detail of other subsystem's cache implementation.

> 
> As for the lack of net_rx_action(), this is related to the deferred
> freeing of skbs, so it seems like just calling skb_defer_free_flush() on
> teardown could be an option.

That was my initial thinking about the above case too if we know which percpu
sd to be passed to skb_defer_free_flush() or which cpu to trigger its
net_rx_action().
But it seems hard to tell which cpu napi is running in before napi is disabled,
which means skb_defer_free_flush() might need to be called for every cpu with
softirq disabled, as skb_defer_free_flush() calls napi_consume_skb() with
budget being 1 or call kick_defer_list_purge() for each CPU.

> 
>>>> "Eric pointed out/predicted there's no guarantee that applications will
>>>> read / close their sockets so a page pool page may be stuck in a socket
>>>> (but not leaked) forever."
>>>
>>> As for this one, I would put that in the "well, let's see if this
>>> becomes a problem in practice" bucket.
>>
>> As the commit log in [2], it seems it is already happening.
>>
>> Those cache are mostly per-cpu and per-socket, and there may be hundreds of
>> CPUs and thousands of sockets in one system, are you really sure we need
>> to take the lock of each cache instance, which may be thousands of them,
>> and do the flushing/scaning of memory used in networking, which may be as
>> large as '24 GiB' mentioned by Jesper?
> 
> Well, as above, the two issues you mentioned are per-netns (or possibly
> per-CPU), so those seem to be manageable to do on device teardown if the
> wait is really a problem.

As above, I am not sure if it is still the same netns if the skb is passing
through all sorts of software netdevice?

> 
> But, well, I'm not sure it is? You seem to be taking it as axiomatic
> that the wait in itself is bad. Why? It's just a bit memory being held
> on to while it is still in use, and so what?

Actually, I thought about adding some sort of timeout or kicking based on
jakub's waiting patch too.

But after looking at more caching in the networking, waiting and kicking/flushing
seems harder than recording the inflight pages, mainly because kicking/flushing
need very subsystem using page_pool owned page to provide a kicking/flushing
mechanism for it to work, not to mention how much time does it take to do all
the kicking/flushing.

It seems rdma subsystem uses a similar mechanism:
https://lwn.net/Articles/989087/

> 
> -Toke
> 
> 
>

next prev parent reply	other threads:[~2024-10-30 11:31 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-22  3:22 [PATCH net-next v3 0/3] fix two bugs related to page_pool Yunsheng Lin
2024-10-22  3:22 ` [Intel-wired-lan] " Yunsheng Lin
2024-10-22  3:22 ` [PATCH net-next v3 1/3] page_pool: introduce page_pool_to_pp() API Yunsheng Lin
2024-10-22  3:22   ` [Intel-wired-lan] " Yunsheng Lin
2024-10-22  3:22 ` [PATCH net-next v3 2/3] page_pool: fix timing for checking and disabling napi_local Yunsheng Lin
2024-11-07  6:17   ` Xuan Zhuo
2024-10-22  3:22 ` [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
2024-10-22 16:40   ` Simon Horman
2024-10-22 18:14   ` Jesper Dangaard Brouer
2024-10-23  8:59     ` Yunsheng Lin
2024-10-24 14:40       ` Toke Høiland-Jørgensen
2024-10-25  3:20         ` Yunsheng Lin
2024-10-25 11:16           ` Toke Høiland-Jørgensen
2024-10-25 14:07             ` Jesper Dangaard Brouer
2024-10-26  7:33               ` Yunsheng Lin
2024-11-06 13:25                 ` Jesper Dangaard Brouer
2024-11-06 15:57                   ` Jesper Dangaard Brouer
2024-11-06 19:55                     ` Alexander Duyck
2024-11-07 11:10                       ` Yunsheng Lin
2024-11-07 11:09                     ` Yunsheng Lin
2024-11-11 11:31                 ` Yunsheng Lin
2024-11-11 18:51                   ` Toke Høiland-Jørgensen
2024-11-12 12:22                     ` Yunsheng Lin
2024-11-12 14:19                       ` Jesper Dangaard Brouer
2024-11-13 12:21                         ` Yunsheng Lin
2024-11-18  9:08                         ` Yunsheng Lin
2024-11-18 15:11                           ` Jesper Dangaard Brouer
2024-10-26  7:32             ` Yunsheng Lin
2024-10-29 13:58               ` Toke Høiland-Jørgensen
2024-10-30 11:30                 ` Yunsheng Lin [this message]
2024-10-30 11:57                   ` Toke Høiland-Jørgensen
2024-10-31 12:17                     ` Yunsheng Lin
2024-10-31 16:18                       ` Toke Høiland-Jørgensen
2024-11-01 11:11                         ` Yunsheng Lin
2024-11-05 20:11                           ` Jesper Dangaard Brouer
2024-11-06 10:56                             ` Yunsheng Lin
2024-11-06 14:17                               ` Robin Murphy
2024-11-07  8:41                               ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1eac33ae-e8e1-4437-9403-57291ba4ced6@huawei.com \
    --to=linyunsheng@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=fanghaiqing@huawei.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=iommu@lists.linux.dev \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuyonglong@huawei.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=robin.murphy@arm.com \
    --cc=toke@redhat.com \
    --cc=zhangkun09@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.