linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Joe Damato <jdamato@fastly.com>
To: Jakub Kicinski <kuba@kernel.org>
Cc: "Samudrala, Sridhar" <sridhar.samudrala@intel.com>,
	Eric Dumazet <edumazet@google.com>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	chuck.lever@oracle.com, jlayton@kernel.org,
	linux-api@vger.kernel.org, brauner@kernel.org,
	davem@davemloft.net, alexander.duyck@gmail.com,
	Wei Wang <weiwan@google.com>,
	Amritha Nambiar <amritha.nambiar@intel.com>
Subject: Re: [net-next 0/3] Per epoll context busy poll support
Date: Fri, 2 Feb 2024 11:33:33 -0800	[thread overview]
Message-ID: <20240202193332.GA8932@fastly.com> (raw)
In-Reply-To: <20240202102239.274ca9bb@kernel.org>

On Fri, Feb 02, 2024 at 10:22:39AM -0800, Jakub Kicinski wrote:
> On Fri, 2 Feb 2024 11:23:28 -0600 Samudrala, Sridhar wrote:
> > > I know I am replying to a stale thread on the patches I've submit (there is
> > > a v5 now [1]), but I just looked at your message - sorry I didn't reply
> > > sooner.
> > > 
> > > The per-queue and per-napi netlink APIs look extremely useful, thanks for
> > > pointing this out.
> > > 
> > > In my development tree, I had added SIOCGIFNAME_BY_NAPI_ID which works
> > > similar to SIOCGIFNAME: it takes a NAPI ID and returns the IF name. This is
> > > useful on machines with multiple NICs where each NIC could be located in
> > > one of many different NUMA zones.
> > > 
> > > The idea was that apps would use SO_INCOMING_NAPI_ID, distribute the NAPI
> > > ID to a worker thread which could then use SIOCGIFNAME_BY_NAPI_ID to
> > > compute which NIC the connection came in on. The app would then (via
> > > configuration) know where to pin that worker thread; ideally somewhere NUMA
> > > local to the NIC.
> > > 
> > > I had assumed that such a change would be rejected, but I figured I'd send
> > > an RFC for it after the per epoll context stuff was done and see if anyone
> > > thought SIOCGIFNAME_BY_NAPI_ID would be useful for them, as well.  
> > 
> > I think you should be able to get this functionality via the netdev-genl 
> > API to get napi parameters. It returns ifindex as one of the parameters 
> > and you should able to get the name from ifindex.
> > 
> > $ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
> > {'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}
> 
> FWIW we also have a C library to access those. Out of curiosity what's
> the programming language you'd use in user space, Joe?

I am using C from user space. Curious what you think about
SIOCGIFNAME_BY_NAPI_ID, Jakub? I think it would be very useful, but not
sure if such an extension would be accepted. I can send an RFC, if you'd
like to take a look and consider it. I know you are busy and I don't want
to add too much noise to the list if I can help it :)

Here's a brief description of what I'm doing, which others might find
helpful:

1. Machine has multiple NICs. Each NIC has 1 queue per busy poll app
thread, plus a few extra queues for other non busy poll usage.

2. A custom RSS context is created to distribute flows to the busy poll
queues. This context is created for each NIC. The default context directs
flows to the non-busy poll queues.

3. Each NIC has n-tuple filters inserted to direct incoming connections
with certain destination ports (e.g. 80, 443) to the custom RSS context.
All other incoming connections will land in the default context and go to
the other queues.

4. IRQs for the busy poll queues are pinned to specific CPUs which are NUMA
local to the NIC.

5. IRQ coalescing values are setup with busy poll in mind, so IRQs are
deferred as much as possible with the assumption userland will drive NAPI
via epoll_wait. This is done per queue (using ethtool --per-queue and a
queue mask). This is where napi_defer_hard_irqs and gro_flush_timeout
could help even more. IRQ deferral is only needed for the busy poll queues.

6. userspace app config has NICs with their NUMA local CPUs listed, for
example like this:

   - eth0: 0,1,2,3
   - eth1: 4,5,6,7

The app reads that configuration in when it starts. Ideally, these are the
same CPUs the IRQs are pinned to in step 4, but hopefully the coalesce
settings let IRQs be deferred quite a bit so busy poll can take over.

7. App threads are created and sockets are opened with REUSEPORT. Notably:
when the sockets are created, SO_BINDTODEVICE is used* (see below for
longer explanation about this).

8. cbpf reusport program inserted to distribute incoming connections to
threads based on skb->queue_mapping. skb->queue_mapping values are not
unique (e.g. each NIC will have queue_mapping==0), this is why BINDTODEVICE
is needed. Again, see below.

9. worker thread epoll contexts are set to busy poll by the ioctl I've
submit in my patches.

The first time a worker thread receives a connection, it:

1. calls SO_INCOMING_NAPI_ID to get the NAPI ID associated with the
connection it received.

2. Takes that NAPI ID and calls SIOCGIFNAME_BY_NAPI_ID to figure out which
NIC the connection came in on.

3. Looks for an un-unsed CPU from the list it read in at configuration time
that is associated with that NIC and then pins itself to that CPU. That CPU
is removed from the list so other threads can't take it.

All future incoming connections with the same NAPI ID will be distributed
to app threads which are pinned in the appropriate place and are doing busy
polling.

So, as you can see, SIOCGIFNAME_BY_NAPI_ID makes this implementation very
simple.

I plan to eventually add some information to the kernel networking
documentation to capture some more details of the above, which I think
might be helpful for others.

Thanks,
Joe

* Longer explanation about SO_BINDTODEVICE (only relevant if you have
mulitple NICs):

It turns out that reuseport groups in the kernel are bounded by a few
attributes, port being one of them but also ifindex. Since multiple NICs
can have queue_mapping == 0, reusport groups need to be constructed in
userland with care if there are multiple NICs. This is required because
each epoll context can only do epoll busy poll on a single NAPI ID. So,
even if multiple NICs have queue_mapping == 0, the queues will have
different NAPI IDs and incoming connections must be distributed to threads
uniquely based on NAPI ID.

I am doing this by creating listen sockets for each NIC, one NIC at a time.
When the listen socket is created, SO_BINDTODEVICE is used on the socket
before calling listen.

In the kernel, this results in all listen sockets with the same ifindex to
form a reuseport group. So, if I have 2 NICs and 1 listen port (say port
80), this results in 2 reuseport groups -- one for nic0 port 80 and one for
nic1 port 80, because of SO_BINDTODEVICE.

The reuseport cbpf filter is inserted for each reuseport group, and then
the skb->queue_mapping based listen socket selection will work as expected
distributing NAPI IDs to app threads without breaking epoll busy poll.

Without the above, you can run into an issue where two connections with the
same queue_mapping (but from different NICs) can land in the same epoll
context, which breaks busy poll.

Another potential solution to avoid the above might be use an eBPF program
and to build a hash that maps NAPI IDs to thread IDs and write a more
complicated eBPF program to distribute connections that way. This seemed
cool, but involved a lot more work so I went with the SO_BINDTODEVICE +
SIOCGIFNAME_BY_NAPI_ID method instead which was pretty simple (C code wise)
and easy to implement.

  reply	other threads:[~2024-02-02 19:33 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-24  2:53 [net-next 0/3] Per epoll context busy poll support Joe Damato
2024-01-24  2:53 ` [net-next 1/3] eventpoll: support busy poll per epoll instance Joe Damato
2024-01-24  2:53 ` [net-next 2/3] eventpoll: Add per-epoll busy poll packet budget Joe Damato
2024-01-24  2:53 ` [net-next 3/3] eventpoll: Add epoll ioctl for epoll_params Joe Damato
2024-01-24 15:37   ` Joe Damato
2024-01-24  8:20 ` [net-next 0/3] Per epoll context busy poll support Eric Dumazet
2024-01-24 14:20   ` Joe Damato
2024-01-24 14:38     ` Eric Dumazet
2024-01-24 15:19       ` Joe Damato
2024-01-30 18:54   ` Samudrala, Sridhar
2024-02-02  3:28     ` Joe Damato
2024-02-02 17:23       ` Samudrala, Sridhar
2024-02-02 18:22         ` Jakub Kicinski
2024-02-02 19:33           ` Joe Damato [this message]
2024-02-02 19:58             ` Jakub Kicinski
2024-02-02 20:23               ` Joe Damato
2024-02-02 20:50                 ` Samudrala, Sridhar
2024-02-02 20:55                   ` Joe Damato
2024-02-03  1:15                 ` Jakub Kicinski
2024-02-05 18:17                   ` Stanislav Fomichev
2024-02-05 18:52                     ` Joe Damato
2024-02-05 19:48                       ` Jakub Kicinski
2024-02-05 18:51                   ` Joe Damato
2024-02-05 19:59                     ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240202193332.GA8932@fastly.com \
    --to=jdamato@fastly.com \
    --cc=alexander.duyck@gmail.com \
    --cc=amritha.nambiar@intel.com \
    --cc=brauner@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jlayton@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=sridhar.samudrala@intel.com \
    --cc=weiwan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).