netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mina Almasry <almasrymina@google.com>
To: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org, hawk@kernel.org,
	ilias.apalodimas@linaro.org,  edumazet@google.com,
	dsahern@gmail.com, michael.chan@broadcom.com,
	 willemb@google.com
Subject: Re: [RFC 00/12] net: huge page backed page_pool
Date: Fri, 7 Jul 2023 12:45:26 -0700	[thread overview]
Message-ID: <CAHS8izPmQRuBfBB2ddna-RHvorvJs7VtVUqCW80MaxPLUtDHGA@mail.gmail.com> (raw)
In-Reply-To: <20230707183935.997267-1-kuba@kernel.org>

On Fri, Jul 7, 2023 at 11:39 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Hi!
>
> This is an "early PoC" at best. It seems to work for a basic
> traffic test but there's no uAPI and a lot more general polish
> is needed.
>
> The problem we're seeing is that performance of some older NICs
> degrades quite a bit when IOMMU is used (in non-passthru mode).
> There is a long tail of old NICs deployed, especially in PoPs/
> /on edge. From a conversation I had with Eric a few months
> ago it sounded like others may have similar issues. So I thought
> I'd take a swing at getting page pool to feed drivers huge pages.
> 1G pages require hooking into early init via CMA but it works
> just fine.
>
> I haven't tested this with a real workload, because I'm still
> waiting to get my hands on the right machine. But the experiment
> with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).
>

Thanks for CCing me Jakub. I'm working on a proposal for device memory
TCP, and I recently migrated it to be on top of your pp-provider idea
and I think I can share my test results as well. I had my code working
on top of your slightly older API I found here a few days ago:
https://github.com/kuba-moo/linux/tree/pp-providers

On top of the old API I had something with all my functionality tests
passing and performance benchmarking hitting ~96.5% line rate (with
all data going straight to the device - GPU - memory, which is the
point of the proposal). Of course, when you look at the code you may
not like the approach and I may need to try something else, which is
perfectly fine, but my current implementation is pp-provider based.

I'll look into rebasing my changes on top of this RFC and retesting,
but I should share my RFC either way sometime next week maybe. I took
a quick look at the changes you made here, and I don't think you
changed anything that would break my use case.

> In terms of the missing parts - uAPI is definitely needed.
> The rough plan would be to add memory config via the netdev
> genl family. Should fit nicely there. Have the config stored
> in struct netdevice. When page pool is created get to the netdev
> and automatically select the provider without the driver even
> knowing.

I guess I misunderstood the intent behind the original patches. I
thought you wanted the user to tell the driver what memory provider to
use, and the driver to recreate the page pool with that provider. What
you're saying here sounds much better, and less changes to the driver.

>  Two problems with that are - 1) if the driver follows
> the recommended flow of allocating new queues before freeing
> old ones we will have page pools created before the old ones
> are gone, which means we'd need to reserve 2x the number of
> 1G pages; 2) there's no callback to the driver to say "I did
> something behind your back, don't worry about it, but recreate
> your queues, please" so the change will not take effect until
> some unrelated change like installing XDP. Which may be fine
> in practice but is a bit odd.
>

I have the same problem with device memory TCP. I solved it in a
similar way, doing something else in the driver that triggers
gve_close() & gve_open(). I wonder if the cleanest way to do this is
calling ethtool_ops->reset() or something like that? That was my idea
at least. I haven't tested it, but from reading the code it should do
a gve_close() + gve_open() like I want.

> Then we get into hand-wavy stuff like - if we can link page
> pools to netdevs, we should also be able to export the page pool
> stats via the netdev family instead doing it the ethtool -S.. ekhm..
> "way". And if we start storing configs behind driver's back why
> don't we also store other params, like ring size and queue count...
> A lot of potential improvements as we iron out a new API...
>
> Live tree: https://github.com/kuba-moo/linux/tree/pp-providers
>
> Jakub Kicinski (12):
>   net: hack together some page sharing
>   net: create a 1G-huge-page-backed allocator
>   net: page_pool: hide page_pool_release_page()
>   net: page_pool: merge page_pool_release_page() with
>     page_pool_return_page()
>   net: page_pool: factor out releasing DMA from releasing the page
>   net: page_pool: create hooks for custom page providers
>   net: page_pool: add huge page backed memory providers
>   eth: bnxt: let the page pool manage the DMA mapping
>   eth: bnxt: use the page pool for data pages
>   eth: bnxt: make sure we make for recycle skbs before freeing them
>   eth: bnxt: wrap coherent allocations into helpers
>   eth: bnxt: hack in the use of MEP
>
>  Documentation/networking/page_pool.rst        |  10 +-
>  arch/x86/kernel/setup.c                       |   6 +-
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 154 +++--
>  drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +
>  drivers/net/ethernet/engleder/tsnep_main.c    |   2 +-
>  .../net/ethernet/stmicro/stmmac/stmmac_main.c |   4 +-
>  include/net/dcalloc.h                         |  28 +
>  include/net/page_pool.h                       |  36 +-
>  net/core/Makefile                             |   2 +-
>  net/core/dcalloc.c                            | 615 +++++++++++++++++
>  net/core/dcalloc.h                            |  96 +++
>  net/core/page_pool.c                          | 625 +++++++++++++++++-
>  12 files changed, 1478 insertions(+), 105 deletions(-)
>  create mode 100644 include/net/dcalloc.h
>  create mode 100644 net/core/dcalloc.c
>  create mode 100644 net/core/dcalloc.h
>
> --
> 2.41.0
>


-- 
Thanks,
Mina

  parent reply	other threads:[~2023-07-07 19:45 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07 18:39 [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
2023-07-07 18:39 ` [RFC 01/12] net: hack together some page sharing Jakub Kicinski
2023-07-07 18:39 ` [RFC 02/12] net: create a 1G-huge-page-backed allocator Jakub Kicinski
2023-07-07 18:39 ` [RFC 03/12] net: page_pool: hide page_pool_release_page() Jakub Kicinski
2023-07-07 18:39 ` [RFC 04/12] net: page_pool: merge page_pool_release_page() with page_pool_return_page() Jakub Kicinski
2023-07-10 16:07   ` Jesper Dangaard Brouer
2023-07-07 18:39 ` [RFC 05/12] net: page_pool: factor out releasing DMA from releasing the page Jakub Kicinski
2023-07-07 18:39 ` [RFC 06/12] net: page_pool: create hooks for custom page providers Jakub Kicinski
2023-07-07 19:50   ` Mina Almasry
2023-07-07 22:28     ` Jakub Kicinski
2023-07-07 18:39 ` [RFC 07/12] net: page_pool: add huge page backed memory providers Jakub Kicinski
2023-07-07 18:39 ` [RFC 08/12] eth: bnxt: let the page pool manage the DMA mapping Jakub Kicinski
2023-07-10 10:12   ` Jesper Dangaard Brouer
2023-07-26  6:56     ` Ilias Apalodimas
2023-07-07 18:39 ` [RFC 09/12] eth: bnxt: use the page pool for data pages Jakub Kicinski
2023-07-10  4:22   ` Michael Chan
2023-07-10 17:04     ` Jakub Kicinski
2023-07-07 18:39 ` [RFC 10/12] eth: bnxt: make sure we make for recycle skbs before freeing them Jakub Kicinski
2023-07-07 18:39 ` [RFC 11/12] eth: bnxt: wrap coherent allocations into helpers Jakub Kicinski
2023-07-07 18:39 ` [RFC 12/12] eth: bnxt: hack in the use of MEP Jakub Kicinski
2023-07-07 19:45 ` Mina Almasry [this message]
2023-07-07 22:45   ` [RFC 00/12] net: huge page backed page_pool Jakub Kicinski
2023-07-10 17:31     ` Mina Almasry
2023-07-11 15:49 ` Jesper Dangaard Brouer
2023-07-12  0:08   ` Jakub Kicinski
2023-07-12 11:47     ` Yunsheng Lin
2023-07-12 12:43       ` Jesper Dangaard Brouer
2023-07-12 17:01         ` Jakub Kicinski
2023-07-14 13:05           ` Yunsheng Lin
2023-07-12 14:00     ` Jesper Dangaard Brouer
2023-07-12 17:19       ` Jakub Kicinski
2023-07-13 10:07         ` Jesper Dangaard Brouer
2023-07-13 16:27           ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHS8izPmQRuBfBB2ddna-RHvorvJs7VtVUqCW80MaxPLUtDHGA@mail.gmail.com \
    --to=almasrymina@google.com \
    --cc=dsahern@gmail.com \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=kuba@kernel.org \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).