All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stanislav Fomichev <stfomichev@gmail.com>
To: Samiullah Khawaja <skhawaja@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>,
	almasrymina@google.com, willemb@google.com,
	mkarsten@uwaterloo.ca, Joe Damato <joe@dama.to>,
	netdev@vger.kernel.org
Subject: Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll
Date: Mon, 25 Aug 2025 15:42:51 -0700	[thread overview]
Message-ID: <aKzm660pL_JvcsdW@mini-arch> (raw)
In-Reply-To: <CAAywjhQVVbK=CthODv=sVez1Q6yhYp+seuxHv5gstGHZz20vEg@mail.gmail.com>

On 08/25, Samiullah Khawaja wrote:
> On Mon, Aug 25, 2025 at 12:37 PM Stanislav Fomichev
> <stfomichev@gmail.com> wrote:
> >
> > On 08/24, Samiullah Khawaja wrote:
> > > Extend the already existing support of threaded napi poll to do continuous
> > > busy polling.
> > >
> > > This is used for doing continuous polling of napi to fetch descriptors
> > > from backing RX/TX queues for low latency applications. Allow enabling
> > > of threaded busypoll using netlink so this can be enabled on a set of
> > > dedicated napis for low latency applications.
> > >
> > > Once enabled user can fetch the PID of the kthread doing NAPI polling
> > > and set affinity, priority and scheduler for it depending on the
> > > low-latency requirements.
> > >
> > > Currently threaded napi is only enabled at device level using sysfs. Add
> > > support to enable/disable threaded mode for a napi individually. This
> > > can be done using the netlink interface. Extend `napi-set` op in netlink
> > > spec that allows setting the `threaded` attribute of a napi.
> > >
> > > Extend the threaded attribute in napi struct to add an option to enable
> > > continuous busy polling. Extend the netlink and sysfs interface to allow
> > > enabling/disabling threaded busypolling at device or individual napi
> > > level.
> > >
> > > We use this for our AF_XDP based hard low-latency usecase with usecs
> > > level latency requirement. For our usecase we want low jitter and stable
> > > latency at P99.
> > >
> > > Following is an analysis and comparison of available (and compatible)
> > > busy poll interfaces for a low latency usecase with stable P99. Please
> > > note that the throughput and cpu efficiency is a non-goal.
> > >
> > > For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> > > description of the tool and how it tries to simulate the real workload
> > > is following,
> > >
> > > - It sends UDP packets between 2 machines.
> > > - The client machine sends packets at a fixed frequency. To maintain the
> > >   frequency of the packet being sent, we use open-loop sampling. That is
> > >   the packets are sent in a separate thread.
> > > - The server replies to the packet inline by reading the pkt from the
> > >   recv ring and replies using the tx ring.
> > > - To simulate the application processing time, we use a configurable
> > >   delay in usecs on the client side after a reply is received from the
> > >   server.
> > >
> > > The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
> > >
> > > We use this tool with following napi polling configurations,
> > >
> > > - Interrupts only
> > > - SO_BUSYPOLL (inline in the same thread where the client receives the
> > >   packet).
> > > - SO_BUSYPOLL (separate thread and separate core)
> > > - Threaded NAPI busypoll
> > >
> > > System is configured using following script in all 4 cases,
> > >
> > > ```
> > > echo 0 | sudo tee /sys/class/net/eth0/threaded
> > > echo 0 | sudo tee /proc/sys/kernel/timer_migration
> > > echo off | sudo tee  /sys/devices/system/cpu/smt/control
> > >
> > > sudo ethtool -L eth0 rx 1 tx 1
> > > sudo ethtool -G eth0 rx 1024
> > >
> > > echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> > > echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> > >
> > >  # pin IRQs on CPU 2
> > > IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> > >                               print arr[0]}' < /proc/interrupts)"
> > > for irq in "${IRQS}"; \
> > >       do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> > >
> > > echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> > >
> > > for i in /sys/devices/virtual/workqueue/*/cpumask; \
> > >                       do echo $i; echo 1,2,3,4,5,6 > $i; done
> > >
> > > if [[ -z "$1" ]]; then
> > >   echo 400 | sudo tee /proc/sys/net/core/busy_read
> > >   echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > >   echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > > fi
> > >
> > > sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
> > >
> > > if [[ "$1" == "enable_threaded" ]]; then
> > >   echo 0 | sudo tee /proc/sys/net/core/busy_poll
> > >   echo 0 | sudo tee /proc/sys/net/core/busy_read
> > >   echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > >   echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > >   echo 2 | sudo tee /sys/class/net/eth0/threaded
> > >   NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> > >   sudo chrt -f  -p 50 $NAPI_T
> > >
> > >   # pin threaded poll thread to CPU 2
> > >   sudo taskset -pc 2 $NAPI_T
> > > fi
> > >
> > > if [[ "$1" == "enable_interrupt" ]]; then
> > >   echo 0 | sudo tee /proc/sys/net/core/busy_read
> > >   echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > >   echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > > fi
> > > ```
> > >
> > > To enable various configurations, script can be run as following,
> > >
> > > - Interrupt Only
> > >   ```
> > >   <script> enable_interrupt
> > >   ```
> > >
> > > - SO_BUSYPOLL (no arguments to script)
> > >   ```
> > >   <script>
> > >   ```
> > >
> > > - NAPI threaded busypoll
> > >   ```
> > >   <script> enable_threaded
> > >   ```
> > >
> > > If using idpf, the script needs to be run again after launching the
> > > workload just to make sure that the configurations are not reverted. As
> > > idpf reverts some configurations on software reset when AF_XDP program
> > > is attached.
> > >
> > > Once configured, the workload is run with various configurations using
> > > following commands. Set period (1/frequency) and delay in usecs to
> > > produce results for packet frequency and application processing delay.
> > >
> > >  ## Interrupt Only and SO_BUSY_POLL (inline)
> > >
> > > - Server
> > > ```
> > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > >       -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
> > > ```
> > >
> > > - Client
> > > ```
> > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > >       -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> > >       -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
> > > ```
> > >
> > >  ## SO_BUSY_POLL(done in separate core using recvfrom)
> > >
> > > Argument -t spawns a seprate thread and continuously calls recvfrom.
> > >
> > > - Server
> > > ```
> > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > >       -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> > >       -h -v -t
> > > ```
> > >
> > > - Client
> > > ```
> > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > >       -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> > >       -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
> > > ```
> > >
> > >  ## NAPI Threaded Busy Poll
> > >
> > > Argument -n skips the recvfrom call as there is no recv kick needed.
> > >
> > > - Server
> > > ```
> > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > >       -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> > >       -h -v -n
> > > ```
> > >
> > > - Client
> > > ```
> > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> > >       -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> > >       -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
> > > ```
> > >
> > > | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> > > |---|---|---|---|---|
> > > | 12 Kpkt/s + 0us delay | | | | |
> > > |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> > > |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> > > |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> > > |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> > > | 32 Kpkt/s + 30us delay | | | | |
> > > |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> > > |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> > > |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> > > |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> > > | 125 Kpkt/s + 6us delay | | | | |
> > > |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> > > |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> > > |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> > > |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> > > | 12 Kpkt/s + 78us delay | | | | |
> > > |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> > > |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> > > |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> > > |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> > > | 25 Kpkt/s + 38us delay | | | | |
> > > |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> > > |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> > > |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> > > |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> > >
> > >  ## Observations
> > >
> > > - Here without application processing all the approaches give the same
> > >   latency within 1usecs range and NAPI threaded gives minimum latency.
> > > - With application processing the latency increases by 3-4usecs when
> > >   doing inline polling.
> > > - Using a dedicated core to drive napi polling keeps the latency same
> > >   even with application processing. This is observed both in userspace
> > >   and threaded napi (in kernel).
> > > - Using napi threaded polling in kernel gives lower latency by
> > >   1-1.5usecs as compared to userspace driven polling in separate core.
> > > - With application processing userspace will get the packet from recv
> > >   ring and spend some time doing application processing and then do napi
> > >   polling. While application processing is happening a dedicated core
> > >   doing napi polling can pull the packet of the NAPI RX queue and
> > >   populate the AF_XDP recv ring. This means that when the application
> > >   thread is done with application processing it has new packets ready to
> > >   recv and process in recv ring.
> > > - Napi threaded busy polling in the kernel with a dedicated core gives
> > >   the consistent P5-P99 latency.
> >
> > The real take away for me is ~1us difference between SO_BUSYPOLL in a
> > thread and NAPI threaded. Presumably mostly because of the non-blocking calls
> > to sk_busy_loop in the former? So it takes 1us extra to enter/leave the kernel
> > and setup/teardown the busy polling?
> >
> > And you haven't tried epoll based busy polling? I'd expect to see
> > results similar to your NAPI threaded (if it works correctly).
> I haven't attempted epoll-based NAPI polling because my understanding
> is that it only polls NAPI when no events are present. Let me check.

I was under the impression that xsk won't actually add any (socket) events
to ep making it busy poll until timeout. But I might be wrong, still
worth it to double check.

> > (have nothing against the busy polling thread, mostly trying to
> > understand what we are missing from the existing setup)
> The missing piece is a mechanism to busy poll a NAPI instance in a
> dedicated thread while ignoring available events or packets,
> regardless of the userspace API. Most existing mechanisms are designed
> to work in a pattern where you poll until new packets or events are
> received, after which userspace is expected to handle them.
> 
> As a result, one has to hack together a solution using a mechanism
> intended to receive packets or events, not to simply NAPI poll. NAPI
> threaded, on the other hand, provides this capability natively,
> independent of any userspace API.

Agreed, yes. Would be nice to document it in the commit description. Explain
how SO_BUSY_POLL in a thread is still not enough (polls only once,
doesn't busy-poll until the events are ready -> 1-2us of extra latency).
And the same for epoll depending on how it goes. If it ends up working,
kthread might still be more convenient to setup/manage.

  reply	other threads:[~2025-08-25 22:42 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-24 21:54 [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Samiullah Khawaja
2025-08-24 21:54 ` [PATCH net-next v7 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
2025-08-25 19:47   ` Stanislav Fomichev
2025-08-25 23:11     ` Samiullah Khawaja
2025-08-24 21:54 ` [PATCH net-next v7 2/2] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
2025-08-25 16:30   ` Jakub Kicinski
2025-08-25 17:20     ` Samiullah Khawaja
2025-08-25  0:03 ` [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Martin Karsten
2025-08-25 17:20   ` Samiullah Khawaja
2025-08-25 17:40     ` Martin Karsten
2025-08-25 18:53       ` Samiullah Khawaja
2025-08-25 19:45         ` Martin Karsten
2025-08-25 20:21           ` Martin Karsten
2025-08-28 22:23           ` Samiullah Khawaja
2025-08-25 19:37 ` Stanislav Fomichev
2025-08-25 22:12   ` Samiullah Khawaja
2025-08-25 22:42     ` Stanislav Fomichev [this message]
2025-08-25 23:22       ` Samiullah Khawaja

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aKzm660pL_JvcsdW@mini-arch \
    --to=stfomichev@gmail.com \
    --cc=almasrymina@google.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=joe@dama.to \
    --cc=kuba@kernel.org \
    --cc=mkarsten@uwaterloo.ca \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=skhawaja@google.com \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.