netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stanislav Fomichev <sdf@fomichev.me>
To: Joe Damato <jdamato@fastly.com>, Jakub Kicinski <kuba@kernel.org>,
	netdev@vger.kernel.org, edumazet@google.com,
	amritha.nambiar@intel.com, sridhar.samudrala@intel.com,
	bjorn@rivosinc.com, hch@infradead.org, willy@infradead.org,
	willemdebruijn.kernel@gmail.com, skhawaja@google.com,
	Martin Karsten <mkarsten@uwaterloo.ca>,
	Donald Hunter <donald.hunter@gmail.com>,
	"David S. Miller" <davem@davemloft.net>,
	Paolo Abeni <pabeni@redhat.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Xuan Zhuo <xuanzhuo@linux.alibaba.com>,
	Daniel Jurgens <danielj@nvidia.com>,
	open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values
Date: Wed, 4 Sep 2024 16:40:41 -0700	[thread overview]
Message-ID: <Ztjv-dgNFwFBnXwd@mini-arch> (raw)
In-Reply-To: <ZtXuJ3TMp9cN5e9h@LQ3V64L9R2.station>

On 09/02, Joe Damato wrote:
> On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > > +      doc: Set configurable NAPI instance settings.  
> > > > 
> > > > We should pause and think here how configuring NAPI params should
> > > > behave. NAPI instances are ephemeral, if you close and open the
> > > > device (or for some drivers change any BPF or ethtool setting)
> > > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > > 
> > > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > > survive close. It's (weirdly?) also not how queues behave, because we
> > > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > > you'd think queues are as ephemeral as NAPIs if not more.
> > > > 
> > > > I guess we can either document this, and move on (which may be fine,
> > > > you have more practical experience than me). Or we can add an internal
> > > > concept of a "channel" (which perhaps maybe if you squint is what
> > > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > > net_device and store such config there. For simplicity of matching
> > > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > > If driver wants to do something more fancy we can add a variant of
> > > > netif_napi_add() which specifies the channel/storage to use.
> > > > 
> > > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > > I work with unfortunate drivers...  
> > > 
> > > Thanks for pointing this out. I think this is an important case to
> > > consider. Here's how I'm thinking about it.
> > > 
> > > There are two cases:
> > > 
> > > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > > discarded and recreated, the code I added to netif_napi_add_weight
> > > in patch 1 and 3 should take care of that case preserving how sysfs
> > > works today, I believe. I think we are good on this case ?
> > 
> > Agreed.
> > 
> > > 2) apps using netlink to set various custom settings. This seems
> > > like a case where a future extension can be made to add a notifier
> > > for NAPI changes (like the netdevice notifier?).
> > 
> > Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> > 
> > > If you think this is a good idea, then we'd do something like:
> > >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> > >   2. In the future (not part of this series) a NAPI notifier is
> > >      added
> > >   3. User apps can then listen for NAPI create/delete events
> > >      and update settings when a NAPI is created. It would be
> > >      helpful, I think, for user apps to know about NAPI
> > >      create/delete events in general because it means NAPI IDs are
> > >      changing.
> > > 
> > > One could argue:
> > > 
> > >   When wiping/recreating a NAPI for an existing HW queue, that HW
> > >   queue gets a new NAPI ID associated with it. User apps operating
> > >   at this level probably care about NAPI IDs changing (as it affects
> > >   epoll busy poll). Since the settings in this series are per-NAPI
> > >   (and not per HW queue), the argument could be that user apps need
> > >   to setup NAPIs when they are created and settings do not persist
> > >   between NAPIs with different IDs even if associated with the same
> > >   HW queue.
> > 
> > IDK if the fact that NAPI ID gets replaced was intentional in the first
> > place. I would venture a guess that the person who added the IDs was
> > working with NICs which have stable NAPI instances once the device is
> > opened. This is, unfortunately, not universally the case.
> > 
> > I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> > on an open device. Closer we get to queue API the more dynamic the whole
> > setup will become (read: the more often reconfigurations will happen).
> >
> 
> [...]
> 
> > > I think you have much more practical experience when it comes to
> > > dealing with drivers, so I am happy to follow your lead on this one,
> > > but assuming drivers will "do a thing" seems mildly scary to me with
> > > limited driver experience.
> > > 
> > > My two goals with this series are:
> > >   1. Make it possible to set these values per NAPI
> > >   2. Unblock the IRQ suspension series by threading the suspend
> > >      parameter through the code path carved in this series
> > > 
> > > So, I'm happy to proceed with this series as you prefer whether
> > > that's documentation or "napi_storage"; I think you are probably the
> > > best person to answer this question :)
> > 
> > How do you feel about making this configuration opt-in / require driver
> > changes? What I'm thinking is that having the new "netif_napi_add()"
> > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > "index" parameter would make the whole thing much simpler.
> 
> What about extending netif_queue_set_napi instead? That function
> takes a napi and a queue index.
> 
> Locally I kinda of hacked up something simple that:
>   - Allocates napi_storage in net_device in alloc_netdev_mqs
>   - Modifies netif_queue_set_napi to:
>      if (napi)
>        napi->storage = dev->napi_storage[queue_index];
> 
> I think I'm still missing the bit about the
> max(rx_queues,tx_queues), though :(
> 
> > Index would basically be an integer 0..n, where n is the number of
> > IRQs configured for the driver. The index of a NAPI instance would
> > likely match the queue ID of the queue the NAPI serves.
> 
> Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> are NAPIs for which no IRQ is allocated ~someday~ ?
> 
> It seems like (I could totally be wrong) that netif_queue_set_napi
> can be called and work and create the association even without an
> IRQ allocated.
> 
> I guess the issue is mostly the queue index question above: combined
> rx/tx vs drivers having different numbers of rx and tx queues.
> 
> > We can then allocate an array of "napi_configs" in net_device -
> > like we allocate queues, the array size would be max(num_rx_queue,
> > num_tx_queues). We just need to store a couple of ints so it will
> > be tiny compared to queue structs, anyway.
> > 
> > The NAPI_SET netlink op can then work based on NAPI index rather 
> > than the ephemeral NAPI ID. It can apply the config to all live
> > NAPI instances with that index (of which there really should only 
> > be one, unless driver is mid-reconfiguration somehow but even that
> > won't cause issues, we can give multiple instances the same settings)
> > and also store the user config in the array in net_device.
> > 
> > When new NAPI instance is associate with a NAPI index it should get
> > all the config associated with that index applied.
> > 
> > Thoughts? Does that makes sense, and if so do you think it's an
> > over-complication?
> 
> I think what you are proposing seems fine; I'm just working out the
> implementation details and making sure I understand before sending
> another revision.

What if instead of an extra storage index in UAPI, we make napi_id persistent?
Then we can keep using napi_id as a user-facing number for the configuration.

Having a stable napi_id would also be super useful for the epoll setup so you
don't have to match old/invalid ids to the new ones on device reset.

In the code, we can keep the same idea with napi_storage in netdev and
ask drivers to provide storage id, but keep that id internal.

The only complication with that is napi_hash_add/napi_hash_del that
happen in netif_napi_add_weight. So for the devices that allocate
new napi before removing the old ones (most devices?), we'd have to add
some new netif_napi_takeover(old_napi, new_napi) to remove the
old napi_id from the hash and reuse it in the new one.

So for mlx5, the flow would look like the following:

- mlx5e_safe_switch_params
  - mlx5e_open_channels
    - netif_napi_add(new_napi)
      - adds napi with 'ephemeral' napi id
  - mlx5e_switch_priv_channels
    - mlx5e_deactivate_priv_channels
      - napi_disable(old_napi)
      - netif_napi_del(old_napi) - this frees the old napi_id
  - mlx5e_activate_priv_channels
    - mlx5e_activate_channels
      - mlx5e_activate_channel
        - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
	  - if napi is not hashed - safe to reuse?
	- napi_enable

This is a bit ugly because we still have random napi ids during reset, but
is not super complicated implementation-wise. We can eventually improve
the above by splitting netif_napi_add_weight into two steps: allocate and
activate (to do the napi_id allocation & hashing). Thoughts?

  parent reply	other threads:[~2024-09-04 23:40 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-29 13:11 [PATCH net-next 0/5] Add support for per-NAPI config via netlink Joe Damato
2024-08-29 13:11 ` [PATCH net-next 1/5] net: napi: Make napi_defer_hard_irqs per-NAPI Joe Damato
2024-08-29 13:46   ` Eric Dumazet
2024-08-29 22:05   ` Jakub Kicinski
2024-08-30  9:14     ` Joe Damato
2024-08-30 20:21       ` Jakub Kicinski
2024-08-30 20:23         ` Joe Damato
2024-08-30  8:36   ` Simon Horman
2024-08-30  9:11     ` Joe Damato
2024-08-30 16:50   ` kernel test robot
2024-08-29 13:11 ` [PATCH net-next 2/5] netdev-genl: Dump napi_defer_hard_irqs Joe Damato
2024-08-29 22:08   ` Jakub Kicinski
2024-08-30  9:10     ` Joe Damato
2024-08-30 20:28       ` Jakub Kicinski
2024-08-30 20:31         ` Joe Damato
2024-08-30 21:22           ` Jakub Kicinski
2024-08-29 13:11 ` [PATCH net-next 3/5] net: napi: Make gro_flush_timeout per-NAPI Joe Damato
2024-08-29 13:48   ` Eric Dumazet
2024-08-29 13:57     ` Joe Damato
2024-08-29 15:28     ` Joe Damato
2024-08-29 15:31       ` Eric Dumazet
2024-08-29 15:39         ` Joe Damato
2024-08-30 16:18   ` kernel test robot
2024-08-30 16:18   ` kernel test robot
2024-08-29 13:12 ` [PATCH net-next 4/5] netdev-genl: Dump gro_flush_timeout Joe Damato
2024-08-29 22:09   ` Jakub Kicinski
2024-08-30  9:17     ` Joe Damato
2024-08-29 13:12 ` [PATCH net-next 5/5] netdev-genl: Support setting per-NAPI config values Joe Damato
2024-08-29 22:31   ` Jakub Kicinski
2024-08-30 10:43     ` Joe Damato
2024-08-30 21:22       ` Jakub Kicinski
2024-08-31 17:27         ` Joe Damato
2024-09-03  0:49           ` Jakub Kicinski
2024-09-02 16:56         ` Joe Damato
2024-09-03  1:02           ` Jakub Kicinski
2024-09-03 19:04             ` Samiullah Khawaja
2024-09-03 19:40               ` Jakub Kicinski
2024-09-03 21:58                 ` Samiullah Khawaja
2024-09-05  9:20                   ` Joe Damato
2024-09-08 15:54                 ` Joe Damato
2024-09-04 23:40           ` Stanislav Fomichev [this message]
2024-09-04 23:54             ` Jakub Kicinski
2024-09-05  9:32               ` Joe Damato
2024-09-08 15:57               ` Joe Damato
2024-09-09 23:03                 ` Jakub Kicinski
2024-09-05  9:30             ` Joe Damato
2024-09-05 16:56               ` Stanislav Fomichev
2024-09-05 17:05                 ` Joe Damato

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ztjv-dgNFwFBnXwd@mini-arch \
    --to=sdf@fomichev.me \
    --cc=amritha.nambiar@intel.com \
    --cc=bjorn@rivosinc.com \
    --cc=danielj@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=donald.hunter@gmail.com \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=hch@infradead.org \
    --cc=jdamato@fastly.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mkarsten@uwaterloo.ca \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=skhawaja@google.com \
    --cc=sridhar.samudrala@intel.com \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=willy@infradead.org \
    --cc=xuanzhuo@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).