From: Joe Damato <jdamato@fastly.com>
To: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org, edumazet@google.com,
amritha.nambiar@intel.com, sridhar.samudrala@intel.com,
sdf@fomichev.me, bjorn@rivosinc.com, hch@infradead.org,
willy@infradead.org, willemdebruijn.kernel@gmail.com,
skhawaja@google.com, Martin Karsten <mkarsten@uwaterloo.ca>,
Donald Hunter <donald.hunter@gmail.com>,
"David S. Miller" <davem@davemloft.net>,
Paolo Abeni <pabeni@redhat.com>,
Jesper Dangaard Brouer <hawk@kernel.org>,
Xuan Zhuo <xuanzhuo@linux.alibaba.com>,
Daniel Jurgens <danielj@nvidia.com>,
open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
Date: Sat, 31 Aug 2024 18:27:45 +0100 [thread overview]
Message-ID: <ZtNSkWa1G40jRX5N@LQ3V64L9R2> (raw)
In-Reply-To: <20240830142235.352dbad5@kernel.org>
On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:
> > > > + napi = napi_by_id(napi_id);
> > > > + if (napi)
> > > > + err = netdev_nl_napi_set_config(napi, info);
> > > > + else
> > > > + err = -EINVAL;
> > >
> > > if (napi) {
> > > ...
> > > } else {
> > > NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID])
> > > err = -ENOENT;
> > > }
> >
> > Thanks, I'll make that change in the v2.
> >
> > Should I send a Fixes for the same pattern in
> > netdev_nl_napi_get_doit ?
>
> SG, standalone patch is good, FWIW, no need to add to the series.
Done. TBH: couldn't tell if it was a fixes for net or a net-next
thing.
> > > > + doc: Set configurable NAPI instance settings.
> > >
> > > We should pause and think here how configuring NAPI params should
> > > behave. NAPI instances are ephemeral, if you close and open the
> > > device (or for some drivers change any BPF or ethtool setting)
> > > the NAPIs may get wiped and recreated, discarding all configuration.
> > >
> > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > survive close. It's (weirdly?) also not how queues behave, because we
> > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > you'd think queues are as ephemeral as NAPIs if not more.
> > >
> > > I guess we can either document this, and move on (which may be fine,
> > > you have more practical experience than me). Or we can add an internal
> > > concept of a "channel" (which perhaps maybe if you squint is what
> > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > net_device and store such config there. For simplicity of matching
> > > config to NAPIs we can assume drivers add NAPI instances in order.
> > > If driver wants to do something more fancy we can add a variant of
> > > netif_napi_add() which specifies the channel/storage to use.
> > >
> > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > I work with unfortunate drivers...
> >
> > Thanks for pointing this out. I think this is an important case to
> > consider. Here's how I'm thinking about it.
> >
> > There are two cases:
> >
> > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > discarded and recreated, the code I added to netif_napi_add_weight
> > in patch 1 and 3 should take care of that case preserving how sysfs
> > works today, I believe. I think we are good on this case ?
>
> Agreed.
>
> > 2) apps using netlink to set various custom settings. This seems
> > like a case where a future extension can be made to add a notifier
> > for NAPI changes (like the netdevice notifier?).
>
> Yes, the notifier may help, but it's a bit of a stop gap / fallback.
>
> > If you think this is a good idea, then we'd do something like:
> > 1. Document that the NAPI settings are wiped when NAPIs are wiped
> > 2. In the future (not part of this series) a NAPI notifier is
> > added
> > 3. User apps can then listen for NAPI create/delete events
> > and update settings when a NAPI is created. It would be
> > helpful, I think, for user apps to know about NAPI
> > create/delete events in general because it means NAPI IDs are
> > changing.
> >
> > One could argue:
> >
> > When wiping/recreating a NAPI for an existing HW queue, that HW
> > queue gets a new NAPI ID associated with it. User apps operating
> > at this level probably care about NAPI IDs changing (as it affects
> > epoll busy poll). Since the settings in this series are per-NAPI
> > (and not per HW queue), the argument could be that user apps need
> > to setup NAPIs when they are created and settings do not persist
> > between NAPIs with different IDs even if associated with the same
> > HW queue.
>
> IDK if the fact that NAPI ID gets replaced was intentional in the first
> place. I would venture a guess that the person who added the IDs was
> working with NICs which have stable NAPI instances once the device is
> opened. This is, unfortunately, not universally the case.
>
> I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> on an open device. Closer we get to queue API the more dynamic the whole
> setup will become (read: the more often reconfigurations will happen).
>
> > Admittedly, from the perspective of a user it would be nice if a new
> > NAPI created for an existing HW queue retained the previous
> > settings so that I, as the user, can do less work.
> >
> > But, what happens if a HW queue is destroyed and recreated? Will any
> > HW settings be retained? And does that have any influence on what we
> > do in software? See below.
>
> Up to the driver, today. But settings we store in queue structs in
> the core are not wiped.
>
> > This part of your message:
> >
> > > we can assume drivers add NAPI instances in order. If driver wants
> > > to do something more fancy we can add a variant of
> > > netif_napi_add() which specifies the channel/storage to use.
> >
> > assuming drivers will "do a thing", so to speak, makes me uneasy.
>
> Yeah.. :(
>
> > I started to wonder: how do drivers handle per-queue HW IRQ coalesce
> > settings when queue counts increase? It's a different, but adjacent
> > problem, I think.
> >
> > I tried a couple experiments on mlx5 and got very strange results
> > suitable for their own thread and I didn't want to get this thread
> > too far off track.
>
> Yes, but ethtool is an old shallow API from the times when semantics
> were simpler. It's precisely this mess which we try to avoid by storing
> more of the config in the core, in a consistent fashion.
>
> > I think you have much more practical experience when it comes to
> > dealing with drivers, so I am happy to follow your lead on this one,
> > but assuming drivers will "do a thing" seems mildly scary to me with
> > limited driver experience.
> >
> > My two goals with this series are:
> > 1. Make it possible to set these values per NAPI
> > 2. Unblock the IRQ suspension series by threading the suspend
> > parameter through the code path carved in this series
> >
> > So, I'm happy to proceed with this series as you prefer whether
> > that's documentation or "napi_storage"; I think you are probably the
> > best person to answer this question :)
>
> How do you feel about making this configuration opt-in / require driver
> changes? What I'm thinking is that having the new "netif_napi_add()"
> variant (or perhaps extending netif_napi_set_irq()) to take an extra
> "index" parameter would make the whole thing much simpler.
I think if we are going to go this way, then opt-in is probably the
way to go. This series would include the necessary changes for mlx5,
in that case (because that's what I have access to) so that the new
variant has a user?
> Index would basically be an integer 0..n, where n is the number of
> IRQs configured for the driver. The index of a NAPI instance would
> likely match the queue ID of the queue the NAPI serves.
>
> We can then allocate an array of "napi_configs" in net_device -
> like we allocate queues, the array size would be max(num_rx_queue,
> num_tx_queues). We just need to store a couple of ints so it will
> be tiny compared to queue structs, anyway.
I assume napi_storage exists for both combined RX/TX NAPIs (i.e.
drivers that multiplex RX/TX on a single NAPI like mlx5) as well
as drivers which create NAPIs that are RX or TX-only, right?
If so, it seems like we'd either need to:
- Do something more complicated when computing how much NAPI
storage to make, or
- Provide a different path for drivers which don't multiplex and
create some number of (for example) TX-only NAPIs ?
I guess I'm just imagining a weird case where a driver has 8 RX
queues but 64 TX queues. max of that is 64, so we'd be missing 8
napi_storage ?
Sorry, I'm probably just missing something about the implementation
details you summarized above.
> The NAPI_SET netlink op can then work based on NAPI index rather
> than the ephemeral NAPI ID. It can apply the config to all live
> NAPI instances with that index (of which there really should only
> be one, unless driver is mid-reconfiguration somehow but even that
> won't cause issues, we can give multiple instances the same settings)
> and also store the user config in the array in net_device.
I understand what you are proposing. I suppose napi-get could be
extended to include the NAPI index, too?
Then users could map queues to NAPI indexes to queues (via NAPI ID)?
> When new NAPI instance is associate with a NAPI index it should get
> all the config associated with that index applied.
>
> Thoughts? Does that makes sense, and if so do you think it's an
> over-complication?
It feels a bit tricky, to me, as it seems there are some edge cases
to be careful with (queue count change). I could probably give the
implementation a try and see where I end up.
Having these settings per-NAPI would be really useful and being able
to support IRQ suspension would be useful, too.
I think being thoughtful about how we get there is important; I'm a
little wary of getting side tracked, but I trust your judgement and
if you think this is worth exploring I'll think on it some more.
- Joe
next prev parent reply other threads:[~2024-08-31 17:27 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-29 13:11 [PATCH net-next 0/5] Add support for per-NAPI config via netlink Joe Damato
2024-08-29 13:11 ` [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
2024-08-29 13:46 ` Eric Dumazet
2024-08-29 22:05 ` Jakub Kicinski
2024-08-30 9:14 ` Joe Damato
2024-08-30 20:21 ` Jakub Kicinski
2024-08-30 20:23 ` Joe Damato
2024-08-30 8:36 ` Simon Horman
2024-08-30 9:11 ` Joe Damato
2024-08-30 16:50 ` kernel test robot
2024-08-29 13:11 ` [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
2024-08-29 22:08 ` Jakub Kicinski
2024-08-30 9:10 ` Joe Damato
2024-08-30 20:28 ` Jakub Kicinski
2024-08-30 20:31 ` Joe Damato
2024-08-30 21:22 ` Jakub Kicinski
2024-08-29 13:11 ` [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
2024-08-29 13:48 ` Eric Dumazet
2024-08-29 13:57 ` Joe Damato
2024-08-29 15:28 ` Joe Damato
2024-08-29 15:31 ` Eric Dumazet
2024-08-29 15:39 ` Joe Damato
2024-08-30 16:18 ` kernel test robot
2024-08-30 16:18 ` kernel test robot
2024-08-29 13:12 ` [PATCH net-next 4/5] netdev-genl: Dump gro_flush_timeout Joe Damato
2024-08-29 22:09 ` Jakub Kicinski
2024-08-30 9:17 ` Joe Damato
2024-08-29 13:12 ` [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values Joe Damato
2024-08-29 22:31 ` Jakub Kicinski
2024-08-30 10:43 ` Joe Damato
2024-08-30 21:22 ` Jakub Kicinski
2024-08-31 17:27 ` Joe Damato [this message]
2024-09-03 0:49 ` Jakub Kicinski
2024-09-02 16:56 ` Joe Damato
2024-09-03 1:02 ` Jakub Kicinski
2024-09-03 19:04 ` Samiullah Khawaja
2024-09-03 19:40 ` Jakub Kicinski
2024-09-03 21:58 ` Samiullah Khawaja
2024-09-05 9:20 ` Joe Damato
2024-09-08 15:54 ` Joe Damato
2024-09-04 23:40 ` Stanislav Fomichev
2024-09-04 23:54 ` Jakub Kicinski
2024-09-05 9:32 ` Joe Damato
2024-09-08 15:57 ` Joe Damato
2024-09-09 23:03 ` Jakub Kicinski
2024-09-05 9:30 ` Joe Damato
2024-09-05 16:56 ` Stanislav Fomichev
2024-09-05 17:05 ` Joe Damato
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZtNSkWa1G40jRX5N@LQ3V64L9R2 \
--to=jdamato@fastly.com \
--cc=amritha.nambiar@intel.com \
--cc=bjorn@rivosinc.com \
--cc=danielj@nvidia.com \
--cc=davem@davemloft.net \
--cc=donald.hunter@gmail.com \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=hch@infradead.org \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mkarsten@uwaterloo.ca \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
--cc=skhawaja@google.com \
--cc=sridhar.samudrala@intel.com \
--cc=willemdebruijn.kernel@gmail.com \
--cc=willy@infradead.org \
--cc=xuanzhuo@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox