netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lorenzo Bianconi <lorenzo@kernel.org>
To: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>,
	brouer@redhat.com, Jakub Kicinski <kuba@kernel.org>,
	netdev@vger.kernel.org, hawk@kernel.org,
	ilias.apalodimas@linaro.org, davem@davemloft.net,
	pabeni@redhat.com, bpf@vger.kernel.org,
	lorenzo.bianconi@redhat.com, nbd@nbd.name,
	Toke Hoiland Jorgensen <toke@redhat.com>
Subject: Re: issue with inflight pages from page_pool
Date: Wed, 19 Apr 2023 16:21:50 +0200	[thread overview]
Message-ID: <ZD/4/npAIvS1Co6e@lore-desk> (raw)
In-Reply-To: <ea762132-a6ff-379a-2cc2-6057754425f7@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 5822 bytes --]

> 
> On 19/04/2023 14.09, Eric Dumazet wrote:
> > On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer
> > > 
> > > 
> > > On 18/04/2023 09.36, Lorenzo Bianconi wrote:
> > > > > On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
> > > > > > > If it's that then I'm with Eric. There are many ways to keep the pages
> > > > > > > in use, no point working around one of them and not the rest :(
> > > > > > 
> > > > > > I was not clear here, my fault. What I mean is I can see the returned
> > > > > > pages counter increasing from time to time, but during most of tests,
> > > > > > even after 2h the tcp traffic has stopped, page_pool_release_retry()
> > > > > > still complains not all the pages are returned to the pool and so the
> > > > > > pool has not been deallocated yet.
> > > > > > The chunk of code in my first email is just to demonstrate the issue
> > > > > > and I am completely fine to get a better solution :)
> > > > > 
> > > > > Your problem is perhaps made worse by threaded NAPI, you have
> > > > > defer-free skbs sprayed across all cores and no NAPI there to
> > > > > flush them :(
> > > > 
> > > > yes, exactly :)
> > > > 
> > > > > 
> > > > > > I guess we just need a way to free the pool in a reasonable amount
> > > > > > of time. Agree?
> > > > > 
> > > > > Whether we need to guarantee the release is the real question.
> > > > 
> > > > yes, this is the main goal of my email. The defer-free skbs behaviour seems in
> > > > contrast with the page_pool pending pages monitor mechanism or at least they
> > > > do not work well together.
> > > > 
> > > > @Jesper, Ilias: any input on it?
> > > > 
> > > > > Maybe it's more of a false-positive warning.
> > > > > 
> > > > > Flushing the defer list is probably fine as a hack, but it's not
> > > > > a full fix as Eric explained. False positive can still happen.
> > > > 
> > > > agree, it was just a way to give an idea of the issue, not a proper solution.
> > > > 
> > > > Regards,
> > > > Lorenzo
> > > > 
> > > > > 
> > > > > I'm ambivalent. My only real request wold be to make the flushing
> > > > > a helper in net/core/dev.c rather than open coded in page_pool.c.
> > > 
> > > I agree. We need a central defer_list flushing helper
> > > 
> > > It is too easy to say this is a false-positive warning.
> > > IHMO this expose an issue with the sd->defer_list system.
> > > 
> > > Lorenzo's test is adding+removing veth devices, which creates and runs
> > > NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
> > > removed, nothing will naturally invoking net_rx_softirq on this CPU.
> > > Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
> > > not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
> > > call (trigger_rx_softirq), even if this CPU process and frees SKBs.
> > > 
> > > I see two solutions:
> > > 
> > >    (1) When netdevice/NAPI unregister happens call defer_list flushing
> > > helper.
> > > 
> > >    (2) Use napi_watchdog to detect if defer_list is (many jiffies) old,
> > > and then call defer_list flushing helper.
> > > 
> > > 
> > > > > 
> > > > > Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
> > > 
> > > Looks to me like dev_cpu_dead() also need this flushing helper for
> > > sd->defer_list, or at least moving the sd->defer_list to an sd that will
> > > run eventually.
> > 
> > I think I just considered having a few skbs in per-cpu list would not
> > be an issue,
> > especially considering skbs can sit hours in tcp receive queues.
> > 
> 
> It was the first thing I said to Lorenzo when he first reported the
> problem to me (over chat): It is likely packets sitting in a TCP queue.
> Then I instructed him to look at output from netstat to see queues and
> look for TIME-WAIT, FIN-WAIT etc.
> 
> 
> > Do we expect hacing some kind of callback/shrinker to instruct TCP or
> > pipes to release all pages that prevent
> > a page_pool to be freed ?
> > 
> 
> This is *not* what I'm asking for.
> 
> With TCP sockets (pipes etc) we can take care of closing the sockets
> (and programs etc) to free up the SKBs (and perhaps wait for timeouts)
> to make sure the page_pool shutdown doesn't hang.
> 
> The problem arise for all the selftests that uses veth and bpf_test_run
> (using bpf_test_run_xdp_live / xdp_test_run_setup).  For the selftests
> we obviously take care of closing sockets and removing veth interfaces
> again.  Problem: The defer_list corner-case isn't under our control.
> 
> 
> > Here, we are talking of hundreds of thousands of skbs, compared to at
> > most 32 skbs per cpu.
> > 
> 
> It is not a memory usage concern.
> 
> > Perhaps sets sysctl_skb_defer_max to zero by default, so that admins
> > can opt-in
> > 
> 
> I really like the sd->defer_list system and I think is should be enabled
> by default.  Even if disabled by default, we still need to handle these
> corner cases, as the selftests shouldn't start to cause-issues when this
> gets enabled.
> 
> The simple solution is: (1) When netdevice/NAPI unregister happens call
> defer_list flushing helper.  And perhaps we also need to call it in
> xdp_test_run_teardown().  How do you feel about that?
> 
> --Jesper
> 

Today I was discussing with Toke about this issue, and we were wondering,
if we just consider the page_pool use-case, what about moving the real pool
destroying steps when we return a page to the pool in page_pool_put_full_page()
if the pool has marked to be destroyed and there are no inflight pages instead
of assuming we have all the pages in the pool when we run page_pool_destroy()?
Maybe this means just get rid of the warn in page_pool_release_retry() :)

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

  parent reply	other threads:[~2023-04-19 14:21 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-17 17:53 issue with inflight pages from page_pool Lorenzo Bianconi
2023-04-17 18:08 ` Eric Dumazet
2023-04-17 18:17   ` Lorenzo Bianconi
2023-04-17 18:23     ` Jakub Kicinski
2023-04-17 18:42       ` Lorenzo Bianconi
2023-04-17 19:08         ` Jakub Kicinski
2023-04-17 21:31           ` Lorenzo Bianconi
2023-04-17 23:32             ` Jakub Kicinski
2023-04-18  7:36               ` Lorenzo Bianconi
2023-04-19 11:08                 ` Jesper Dangaard Brouer
2023-04-19 12:09                   ` Eric Dumazet
2023-04-19 14:02                     ` Jesper Dangaard Brouer
2023-04-19 14:18                       ` Eric Dumazet
2023-04-19 16:10                         ` Jesper Dangaard Brouer
2023-04-19 14:21                       ` Lorenzo Bianconi [this message]
2023-04-19 15:36                         ` Jesper Dangaard Brouer
2023-04-19 16:40                           ` Lorenzo Bianconi
2023-04-19 17:12                           ` Jakub Kicinski
2023-04-17 18:24     ` Gerhard Engleder
2023-04-17 19:00       ` Lorenzo Bianconi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZD/4/npAIvS1Co6e@lore-desk \
    --to=lorenzo@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=jbrouer@redhat.com \
    --cc=kuba@kernel.org \
    --cc=lorenzo.bianconi@redhat.com \
    --cc=nbd@nbd.name \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=toke@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).