From: Martin Karsten <mkarsten@uwaterloo.ca>
To: Samiullah Khawaja <skhawaja@google.com>,
Jakub Kicinski <kuba@kernel.org>,
"David S . Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Paolo Abeni <pabeni@redhat.com>,
almasrymina@google.com, willemb@google.com
Cc: Joe Damato <joe@dama.to>, netdev@vger.kernel.org
Subject: Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll
Date: Sun, 24 Aug 2025 20:03:36 -0400 [thread overview]
Message-ID: <8407a1e5-c6ad-47da-9b41-978730cd5420@uwaterloo.ca> (raw)
In-Reply-To: <20250824215418.257588-1-skhawaja@google.com>
On 2025-08-24 17:54, Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.
>
> This is used for doing continuous polling of napi to fetch descriptors
> from backing RX/TX queues for low latency applications. Allow enabling
> of threaded busypoll using netlink so this can be enabled on a set of
> dedicated napis for low latency applications.
>
> Once enabled user can fetch the PID of the kthread doing NAPI polling
> and set affinity, priority and scheduler for it depending on the
> low-latency requirements.
>
> Currently threaded napi is only enabled at device level using sysfs. Add
> support to enable/disable threaded mode for a napi individually. This
> can be done using the netlink interface. Extend `napi-set` op in netlink
> spec that allows setting the `threaded` attribute of a napi.
>
> Extend the threaded attribute in napi struct to add an option to enable
> continuous busy polling. Extend the netlink and sysfs interface to allow
> enabling/disabling threaded busypolling at device or individual napi
> level.
>
> We use this for our AF_XDP based hard low-latency usecase with usecs
> level latency requirement. For our usecase we want low jitter and stable
> latency at P99.
>
> Following is an analysis and comparison of available (and compatible)
> busy poll interfaces for a low latency usecase with stable P99. Please
> note that the throughput and cpu efficiency is a non-goal.
>
> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> description of the tool and how it tries to simulate the real workload
> is following,
>
> - It sends UDP packets between 2 machines.
> - The client machine sends packets at a fixed frequency. To maintain the
> frequency of the packet being sent, we use open-loop sampling. That is
> the packets are sent in a separate thread.
> - The server replies to the packet inline by reading the pkt from the
> recv ring and replies using the tx ring.
> - To simulate the application processing time, we use a configurable
> delay in usecs on the client side after a reply is received from the
> server.
>
> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
>
> We use this tool with following napi polling configurations,
>
> - Interrupts only
> - SO_BUSYPOLL (inline in the same thread where the client receives the
> packet).
> - SO_BUSYPOLL (separate thread and separate core)
> - Threaded NAPI busypoll
>
> System is configured using following script in all 4 cases,
>
> ```
> echo 0 | sudo tee /sys/class/net/eth0/threaded
> echo 0 | sudo tee /proc/sys/kernel/timer_migration
> echo off | sudo tee /sys/devices/system/cpu/smt/control
>
> sudo ethtool -L eth0 rx 1 tx 1
> sudo ethtool -G eth0 rx 1024
>
> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
>
> # pin IRQs on CPU 2
> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> print arr[0]}' < /proc/interrupts)"
> for irq in "${IRQS}"; \
> do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
>
> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
>
> for i in /sys/devices/virtual/workqueue/*/cpumask; \
> do echo $i; echo 1,2,3,4,5,6 > $i; done
>
> if [[ -z "$1" ]]; then
> echo 400 | sudo tee /proc/sys/net/core/busy_read
> echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> fi
>
> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
>
> if [[ "$1" == "enable_threaded" ]]; then
> echo 0 | sudo tee /proc/sys/net/core/busy_poll
> echo 0 | sudo tee /proc/sys/net/core/busy_read
> echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> echo 2 | sudo tee /sys/class/net/eth0/threaded
> NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> sudo chrt -f -p 50 $NAPI_T
>
> # pin threaded poll thread to CPU 2
> sudo taskset -pc 2 $NAPI_T
> fi
>
> if [[ "$1" == "enable_interrupt" ]]; then
> echo 0 | sudo tee /proc/sys/net/core/busy_read
> echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> fi
> ```
>
> To enable various configurations, script can be run as following,
>
> - Interrupt Only
> ```
> <script> enable_interrupt
> ```
>
> - SO_BUSYPOLL (no arguments to script)
> ```
> <script>
> ```
>
> - NAPI threaded busypoll
> ```
> <script> enable_threaded
> ```
>
> If using idpf, the script needs to be run again after launching the
> workload just to make sure that the configurations are not reverted. As
> idpf reverts some configurations on software reset when AF_XDP program
> is attached.
>
> Once configured, the workload is run with various configurations using
> following commands. Set period (1/frequency) and delay in usecs to
> produce results for packet frequency and application processing delay.
>
> ## Interrupt Only and SO_BUSY_POLL (inline)
>
> - Server
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
> ```
>
> - Client
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v
> ```
>
> ## SO_BUSY_POLL(done in separate core using recvfrom)
>
> Argument -t spawns a seprate thread and continuously calls recvfrom.
>
> - Server
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> -h -v -t
> ```
>
> - Client
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t
> ```
>
> ## NAPI Threaded Busy Poll
>
> Argument -n skips the recvfrom call as there is no recv kick needed.
>
> - Server
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> -h -v -n
> ```
>
> - Client
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n
> ```
>
> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> |---|---|---|---|---|
> | 12 Kpkt/s + 0us delay | | | | |
> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> | 32 Kpkt/s + 30us delay | | | | |
> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> | 125 Kpkt/s + 6us delay | | | | |
> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> | 12 Kpkt/s + 78us delay | | | | |
> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> | 25 Kpkt/s + 38us delay | | | | |
> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>
> ## Observations
Hi Samiullah,
I believe you are comparing apples and oranges with these experiments.
Because threaded busy poll uses two cores at each end (at 100%), you
should compare with 2 pairs of xsk_rr processes using interrupt mode,
but each running at half the rate. I am quite certain you would then see
the same latency as in the baseline experiment - at much reduced cpu
utilization.
Threaded busy poll reduces p99 latency by just 100 nsec, while
busy-spinning two cores, at each end - not more not less. I continue to
believe that this trade-off and these limited benefits need to be
clearly and explicitly spelled out in the cover letter.
Best,
Martin
next prev parent reply other threads:[~2025-08-25 0:03 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-24 21:54 [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Samiullah Khawaja
2025-08-24 21:54 ` [PATCH net-next v7 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
2025-08-25 19:47 ` Stanislav Fomichev
2025-08-25 23:11 ` Samiullah Khawaja
2025-08-24 21:54 ` [PATCH net-next v7 2/2] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
2025-08-25 16:30 ` Jakub Kicinski
2025-08-25 17:20 ` Samiullah Khawaja
2025-08-25 0:03 ` Martin Karsten [this message]
2025-08-25 17:20 ` [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Samiullah Khawaja
2025-08-25 17:40 ` Martin Karsten
2025-08-25 18:53 ` Samiullah Khawaja
2025-08-25 19:45 ` Martin Karsten
2025-08-25 20:21 ` Martin Karsten
2025-08-28 22:23 ` Samiullah Khawaja
2025-08-25 19:37 ` Stanislav Fomichev
2025-08-25 22:12 ` Samiullah Khawaja
2025-08-25 22:42 ` Stanislav Fomichev
2025-08-25 23:22 ` Samiullah Khawaja
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8407a1e5-c6ad-47da-9b41-978730cd5420@uwaterloo.ca \
--to=mkarsten@uwaterloo.ca \
--cc=almasrymina@google.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=joe@dama.to \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=skhawaja@google.com \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).