netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: Mina Almasry <almasrymina@google.com>
Cc: netdev@vger.kernel.org, hawk@kernel.org,
	ilias.apalodimas@linaro.org, edumazet@google.com,
	dsahern@gmail.com, michael.chan@broadcom.com, willemb@google.com
Subject: Re: [RFC 00/12] net: huge page backed page_pool
Date: Fri, 7 Jul 2023 15:45:03 -0700	[thread overview]
Message-ID: <20230707154503.57cc834e@kernel.org> (raw)
In-Reply-To: <CAHS8izPmQRuBfBB2ddna-RHvorvJs7VtVUqCW80MaxPLUtDHGA@mail.gmail.com>

On Fri, 7 Jul 2023 12:45:26 -0700 Mina Almasry wrote:
> > This is an "early PoC" at best. It seems to work for a basic
> > traffic test but there's no uAPI and a lot more general polish
> > is needed.
> >
> > The problem we're seeing is that performance of some older NICs
> > degrades quite a bit when IOMMU is used (in non-passthru mode).
> > There is a long tail of old NICs deployed, especially in PoPs/
> > /on edge. From a conversation I had with Eric a few months
> > ago it sounded like others may have similar issues. So I thought
> > I'd take a swing at getting page pool to feed drivers huge pages.
> > 1G pages require hooking into early init via CMA but it works
> > just fine.
> >
> > I haven't tested this with a real workload, because I'm still
> > waiting to get my hands on the right machine. But the experiment
> > with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).
> 
> Thanks for CCing me Jakub. I'm working on a proposal for device memory
> TCP, and I recently migrated it to be on top of your pp-provider idea
> and I think I can share my test results as well. I had my code working
> on top of your slightly older API I found here a few days ago:
> https://github.com/kuba-moo/linux/tree/pp-providers
> 
> On top of the old API I had something with all my functionality tests
> passing and performance benchmarking hitting ~96.5% line rate (with
> all data going straight to the device - GPU - memory, which is the
> point of the proposal). Of course, when you look at the code you may
> not like the approach and I may need to try something else, which is
> perfectly fine, but my current implementation is pp-provider based.
> 
> I'll look into rebasing my changes on top of this RFC and retesting,
> but I should share my RFC either way sometime next week maybe. I took
> a quick look at the changes you made here, and I don't think you
> changed anything that would break my use case.

Oh, sorry I didn't realize you were working on top of my changes
already. Yes, the memory provider API should not have changed much.
I mostly reshuffled the MEP code to have both a coherent and
non-coherent buddy allocator since then.

> > In terms of the missing parts - uAPI is definitely needed.
> > The rough plan would be to add memory config via the netdev
> > genl family. Should fit nicely there. Have the config stored
> > in struct netdevice. When page pool is created get to the netdev
> > and automatically select the provider without the driver even
> > knowing.  
> 
> I guess I misunderstood the intent behind the original patches. I
> thought you wanted the user to tell the driver what memory provider to
> use, and the driver to recreate the page pool with that provider. What
> you're saying here sounds much better, and less changes to the driver.
> 
> >  Two problems with that are - 1) if the driver follows
> > the recommended flow of allocating new queues before freeing
> > old ones we will have page pools created before the old ones
> > are gone, which means we'd need to reserve 2x the number of
> > 1G pages; 2) there's no callback to the driver to say "I did
> > something behind your back, don't worry about it, but recreate
> > your queues, please" so the change will not take effect until
> > some unrelated change like installing XDP. Which may be fine
> > in practice but is a bit odd.
> 
> I have the same problem with device memory TCP. I solved it in a
> similar way, doing something else in the driver that triggers
> gve_close() & gve_open(). I wonder if the cleanest way to do this is
> calling ethtool_ops->reset() or something like that? That was my idea
> at least. I haven't tested it, but from reading the code it should do
> a gve_close() + gve_open() like I want.

The prevailing wisdom so far was that close() + open() is not a good
idea. Some NICs will require large contiguous allocations for rings 
and context memory and there's no guarantee that open() will succeed
in prod when memory is fragmented. So you may end up with a close()d
NIC and a failure to open(), and the machine dropping off the net.

But if we don't close() before we open() and the memory provider is
single consumer we'll have problem #1 :(

BTW are you planning to use individual queues in prod? I anticipated
that for ZC we'll need to tie multiple queues into an RSS context, 
and then configure at the level of an RSS context.

  reply	other threads:[~2023-07-07 22:45 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
2023-07-07 18:39 ` [RFC 01/12] net: hack together some page sharing Jakub Kicinski
2023-07-07 18:39 ` [RFC 02/12] net: create a 1G-huge-page-backed allocator Jakub Kicinski
2023-07-07 18:39 ` [RFC 03/12] net: page_pool: hide page_pool_release_page() Jakub Kicinski
2023-07-07 18:39 ` [RFC 04/12] net: page_pool: merge page_pool_release_page() with page_pool_return_page() Jakub Kicinski
2023-07-10 16:07   ` Jesper Dangaard Brouer
2023-07-07 18:39 ` [RFC 05/12] net: page_pool: factor out releasing DMA from releasing the page Jakub Kicinski
2023-07-07 18:39 ` [RFC 06/12] net: page_pool: create hooks for custom page providers Jakub Kicinski
2023-07-07 19:50   ` Mina Almasry
2023-07-07 22:28     ` Jakub Kicinski
2023-07-07 18:39 ` [RFC 07/12] net: page_pool: add huge page backed memory providers Jakub Kicinski
2023-07-07 18:39 ` [RFC 08/12] eth: bnxt: let the page pool manage the DMA mapping Jakub Kicinski
2023-07-10 10:12   ` Jesper Dangaard Brouer
2023-07-26  6:56     ` Ilias Apalodimas
2023-07-07 18:39 ` [RFC 09/12] eth: bnxt: use the page pool for data pages Jakub Kicinski
2023-07-10  4:22   ` Michael Chan
2023-07-10 17:04     ` Jakub Kicinski
2023-07-07 18:39 ` [RFC 10/12] eth: bnxt: make sure we make for recycle skbs before freeing them Jakub Kicinski
2023-07-07 18:39 ` [RFC 11/12] eth: bnxt: wrap coherent allocations into helpers Jakub Kicinski
2023-07-07 18:39 ` [RFC 12/12] eth: bnxt: hack in the use of MEP Jakub Kicinski
2023-07-07 19:45 ` [RFC 00/12] net: huge page backed page_pool Mina Almasry
2023-07-07 22:45   ` Jakub Kicinski [this message]
2023-07-10 17:31     ` Mina Almasry
2023-07-11 15:49 ` Jesper Dangaard Brouer
2023-07-12  0:08   ` Jakub Kicinski
2023-07-12 11:47     ` Yunsheng Lin
2023-07-12 12:43       ` Jesper Dangaard Brouer
2023-07-12 17:01         ` Jakub Kicinski
2023-07-14 13:05           ` Yunsheng Lin
2023-07-12 14:00     ` Jesper Dangaard Brouer
2023-07-12 17:19       ` Jakub Kicinski
2023-07-13 10:07         ` Jesper Dangaard Brouer
2023-07-13 16:27           ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230707154503.57cc834e@kernel.org \
    --to=kuba@kernel.org \
    --cc=almasrymina@google.com \
    --cc=dsahern@gmail.com \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).