* [PATCH net-next v7 0/2] Add support to do threaded napi busy poll @ 2025-08-24 21:54 Samiullah Khawaja 2025-08-24 21:54 ` [PATCH net-next v7 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja ` (3 more replies) 0 siblings, 4 replies; 18+ messages in thread From: Samiullah Khawaja @ 2025-08-24 21:54 UTC (permalink / raw) To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten Cc: Joe Damato, netdev, skhawaja Extend the already existing support of threaded napi poll to do continuous busy polling. This is used for doing continuous polling of napi to fetch descriptors from backing RX/TX queues for low latency applications. Allow enabling of threaded busypoll using netlink so this can be enabled on a set of dedicated napis for low latency applications. Once enabled user can fetch the PID of the kthread doing NAPI polling and set affinity, priority and scheduler for it depending on the low-latency requirements. Currently threaded napi is only enabled at device level using sysfs. Add support to enable/disable threaded mode for a napi individually. This can be done using the netlink interface. Extend `napi-set` op in netlink spec that allows setting the `threaded` attribute of a napi. Extend the threaded attribute in napi struct to add an option to enable continuous busy polling. Extend the netlink and sysfs interface to allow enabling/disabling threaded busypolling at device or individual napi level. We use this for our AF_XDP based hard low-latency usecase with usecs level latency requirement. For our usecase we want low jitter and stable latency at P99. Following is an analysis and comparison of available (and compatible) busy poll interfaces for a low latency usecase with stable P99. Please note that the throughput and cpu efficiency is a non-goal. For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The description of the tool and how it tries to simulate the real workload is following, - It sends UDP packets between 2 machines. - The client machine sends packets at a fixed frequency. To maintain the frequency of the packet being sent, we use open-loop sampling. That is the packets are sent in a separate thread. - The server replies to the packet inline by reading the pkt from the recv ring and replies using the tx ring. - To simulate the application processing time, we use a configurable delay in usecs on the client side after a reply is received from the server. The xdp_rr tool is posted separately as an RFC for tools/testing/selftest. We use this tool with following napi polling configurations, - Interrupts only - SO_BUSYPOLL (inline in the same thread where the client receives the packet). - SO_BUSYPOLL (separate thread and separate core) - Threaded NAPI busypoll System is configured using following script in all 4 cases, ``` echo 0 | sudo tee /sys/class/net/eth0/threaded echo 0 | sudo tee /proc/sys/kernel/timer_migration echo off | sudo tee /sys/devices/system/cpu/smt/control sudo ethtool -L eth0 rx 1 tx 1 sudo ethtool -G eth0 rx 1024 echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus # pin IRQs on CPU 2 IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \ print arr[0]}' < /proc/interrupts)" for irq in "${IRQS}"; \ do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us for i in /sys/devices/virtual/workqueue/*/cpumask; \ do echo $i; echo 1,2,3,4,5,6 > $i; done if [[ -z "$1" ]]; then echo 400 | sudo tee /proc/sys/net/core/busy_read echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout fi sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0 if [[ "$1" == "enable_threaded" ]]; then echo 0 | sudo tee /proc/sys/net/core/busy_poll echo 0 | sudo tee /proc/sys/net/core/busy_read echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout echo 2 | sudo tee /sys/class/net/eth0/threaded NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }') sudo chrt -f -p 50 $NAPI_T # pin threaded poll thread to CPU 2 sudo taskset -pc 2 $NAPI_T fi if [[ "$1" == "enable_interrupt" ]]; then echo 0 | sudo tee /proc/sys/net/core/busy_read echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout fi ``` To enable various configurations, script can be run as following, - Interrupt Only ``` <script> enable_interrupt ``` - SO_BUSYPOLL (no arguments to script) ``` <script> ``` - NAPI threaded busypoll ``` <script> enable_threaded ``` If using idpf, the script needs to be run again after launching the workload just to make sure that the configurations are not reverted. As idpf reverts some configurations on software reset when AF_XDP program is attached. Once configured, the workload is run with various configurations using following commands. Set period (1/frequency) and delay in usecs to produce results for packet frequency and application processing delay. ## Interrupt Only and SO_BUSY_POLL (inline) - Server ``` sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v ``` - Client ``` sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v ``` ## SO_BUSY_POLL(done in separate core using recvfrom) Argument -t spawns a seprate thread and continuously calls recvfrom. - Server ``` sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ -h -v -t ``` - Client ``` sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t ``` ## NAPI Threaded Busy Poll Argument -n skips the recvfrom call as there is no recv kick needed. - Server ``` sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ -h -v -n ``` - Client ``` sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n ``` | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded | |---|---|---|---|---| | 12 Kpkt/s + 0us delay | | | | | | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | | 32 Kpkt/s + 30us delay | | | | | | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | | 125 Kpkt/s + 6us delay | | | | | | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | | 12 Kpkt/s + 78us delay | | | | | | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | | 25 Kpkt/s + 38us delay | | | | | | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | ## Observations - Here without application processing all the approaches give the same latency within 1usecs range and NAPI threaded gives minimum latency. - With application processing the latency increases by 3-4usecs when doing inline polling. - Using a dedicated core to drive napi polling keeps the latency same even with application processing. This is observed both in userspace and threaded napi (in kernel). - Using napi threaded polling in kernel gives lower latency by 1-1.5usecs as compared to userspace driven polling in separate core. - With application processing userspace will get the packet from recv ring and spend some time doing application processing and then do napi polling. While application processing is happening a dedicated core doing napi polling can pull the packet of the NAPI RX queue and populate the AF_XDP recv ring. This means that when the application thread is done with application processing it has new packets ready to recv and process in recv ring. - Napi threaded busy polling in the kernel with a dedicated core gives the consistent P5-P99 latency. Following histogram is generated to measure the time spent in recvfrom while using inline thread with SO_BUSYPOLL. The histogram is generated using the following bpftrace command. In this experiment there are 32K packets per second and the application processing delay is 30usecs. This is to measure whether there is significant time spent pulling packets from the descriptor queue that it will affect the overall latency if done inline. ``` bpftrace -e ' kprobe:xsk_recvmsg { @start[tid] = nsecs; } kretprobe:xsk_recvmsg { if (@start[tid]) { $sample = (nsecs - @start[tid]); @xsk_recvfrom_hist = hist($sample); delete(@start[tid]); } } END { clear(@start);}' ``` Here in case of inline busypolling around 35 percent of calls are taking 1-2usecs and around 50 percent are taking 0.5-2usecs. @xsk_recvfrom_hist: [128, 256) 24073 |@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 55633 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [512, 1K) 20974 |@@@@@@@@@@@@@@@@@@@ | [1K, 2K) 34234 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [2K, 4K) 3266 |@@@ | [4K, 8K) 19 | | v7: - Rebased. v6: - Moved threaded in struct netdevice up to fill the cacheline hole. - Changed dev_set_threaded to dev_set_threaded_hint and removed the second argument that was always set to true by all the drivers. Exported only dev_set_threaded_hint and made dev_set_threaded core only function. This change is done in a separate commit. - Updated documentation comment for threaded in struct netdevice. - gro_flush_helper renamed to gro_flush_normal and moved to gro.h. Also used it in kernel/bpf/cpumap.c - Updated documentation to explicitly state that the NAPI threaded busy polling would keep the CPU core busy at 100% usage. - Updated documentation and commit messages. v5: - Updated experiment data with 'SO_PREFER_BUSY_POLL' usage as suggested. - Sent 'Add support to set napi threaded for individual napi' separately. This series depends on top of that patch. https://lore.kernel.org/netdev/20250423201413.1564527-1-skhawaja@google.com/ - Added a separate patch to use enum for napi threaded state. Updated the nl_netdev python test. - Using "write all" semantics when napi settings set at device level. This aligns with already existing behaviour for other settings. - Fix comments to make them kdoc compatible. - Updated Documentation/networking/net_cachelines/net_device.rst - Updated the missed gro_flush modification in napi_complete_done v4: - Using AF_XDP based benchmark for experiments. - Re-enable dev level napi threaded busypoll after soft reset. v3: - Fixed calls to dev_set_threaded in drivers v2: - Add documentation in napi.rst. - Provide experiment data and usecase details. - Update busy_poller selftest to include napi threaded poll testcase. - Define threaded mode enum in netlink interface. - Included NAPI threaded state in napi config to save/restore. Samiullah Khawaja (2): Extend napi threaded polling to allow kthread based busy polling selftests: Add napi threaded busy poll test in `busy_poller` Documentation/ABI/testing/sysfs-class-net | 3 +- Documentation/netlink/specs/netdev.yaml | 5 +- Documentation/networking/napi.rst | 63 +++++++++++++++++- include/linux/netdevice.h | 11 +++- include/uapi/linux/netdev.h | 1 + net/core/dev.c | 66 ++++++++++++++++--- net/core/dev.h | 3 + net/core/net-sysfs.c | 2 +- net/core/netdev-genl-gen.c | 2 +- tools/include/uapi/linux/netdev.h | 1 + tools/testing/selftests/net/busy_poll_test.sh | 25 ++++++- tools/testing/selftests/net/busy_poller.c | 14 +++- 12 files changed, 177 insertions(+), 19 deletions(-) base-commit: c3199adbe4ffffc7b6536715e0290d1919a45cd9 -- 2.51.0.rc1.193.gad69d77794-goog ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH net-next v7 1/2] Extend napi threaded polling to allow kthread based busy polling 2025-08-24 21:54 [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Samiullah Khawaja @ 2025-08-24 21:54 ` Samiullah Khawaja 2025-08-25 19:47 ` Stanislav Fomichev 2025-08-24 21:54 ` [PATCH net-next v7 2/2] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja ` (2 subsequent siblings) 3 siblings, 1 reply; 18+ messages in thread From: Samiullah Khawaja @ 2025-08-24 21:54 UTC (permalink / raw) To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten Cc: Joe Damato, netdev, skhawaja Add a new state to napi state enum: - NAPI_STATE_THREADED_BUSY_POLL Threaded busy poll is enabled/running for this napi. Following changes are introduced in the napi scheduling and state logic: - When threaded busy poll is enabled through sysfs or netlink it also enables NAPI_STATE_THREADED so a kthread is created per napi. It also sets NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that it is going to busy poll the napi. - When napi is scheduled with NAPI_STATE_SCHED_THREADED and associated kthread is woken up, the kthread owns the context. If NAPI_STATE_THREADED_BUSY_POLL and NAPI_STATE_SCHED_THREADED both are set then it means that kthread can busy poll. - To keep busy polling and to avoid scheduling of the interrupts, the napi_complete_done returns false when both NAPI_STATE_SCHED_THREADED and NAPI_STATE_THREADED_BUSY_POLL flags are set. Also napi_complete_done returns early to avoid the NAPI_STATE_SCHED_THREADED being unset. - If at any point NAPI_STATE_THREADED_BUSY_POLL is unset, the napi_complete_done will run and unset the NAPI_STATE_SCHED_THREADED bit also. This will make the associated kthread go to sleep as per existing logic. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> --- Documentation/ABI/testing/sysfs-class-net | 3 +- Documentation/netlink/specs/netdev.yaml | 5 +- Documentation/networking/napi.rst | 63 +++++++++++++++++++++- include/linux/netdevice.h | 11 +++- include/uapi/linux/netdev.h | 1 + net/core/dev.c | 66 +++++++++++++++++++---- net/core/dev.h | 3 ++ net/core/net-sysfs.c | 2 +- net/core/netdev-genl-gen.c | 2 +- tools/include/uapi/linux/netdev.h | 1 + 10 files changed, 142 insertions(+), 15 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net index ebf21beba846..15d7d36a8294 100644 --- a/Documentation/ABI/testing/sysfs-class-net +++ b/Documentation/ABI/testing/sysfs-class-net @@ -343,7 +343,7 @@ Date: Jan 2021 KernelVersion: 5.12 Contact: netdev@vger.kernel.org Description: - Boolean value to control the threaded mode per device. User could + Integer value to control the threaded mode per device. User could set this value to enable/disable threaded mode for all napi belonging to this device, without the need to do device up/down. @@ -351,4 +351,5 @@ Description: == ================================== 0 threaded mode disabled for this dev 1 threaded mode enabled for this dev + 2 threaded mode enabled, and busy polling enabled. == ================================== diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml index c035dc0f64fd..ee2cfb121dbb 100644 --- a/Documentation/netlink/specs/netdev.yaml +++ b/Documentation/netlink/specs/netdev.yaml @@ -88,7 +88,7 @@ definitions: - name: napi-threaded type: enum - entries: [disabled, enabled] + entries: [disabled, enabled, busy-poll-enabled] attribute-sets: - @@ -292,6 +292,9 @@ attribute-sets: doc: Whether the NAPI is configured to operate in threaded polling mode. If this is set to enabled then the NAPI context operates in threaded polling mode. + mode. If this is set to enabled then the NAPI context operates + in threaded polling mode. If this is set to busy-poll-enabled + then the NAPI kthread also does busypolling. type: u32 enum: napi-threaded - diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst index a15754adb041..a1e76341a99a 100644 --- a/Documentation/networking/napi.rst +++ b/Documentation/networking/napi.rst @@ -263,7 +263,9 @@ are not well known). Busy polling is enabled by either setting ``SO_BUSY_POLL`` on selected sockets or using the global ``net.core.busy_poll`` and ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling -also exists. +also exists. Threaded polling of NAPI also has a mode to busy poll for +packets (:ref:`threaded busy polling<threaded_busy_poll>`) using the same +thread that is used for NAPI processing. epoll-based busy polling ------------------------ @@ -426,6 +428,65 @@ Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is the recommended usage, because otherwise setting ``irq-suspend-timeout`` might not have any discernible effect. +.. _threaded_busy_poll: + +Threaded NAPI busy polling +-------------------------- + +Threaded NAPI allows processing of packets from each NAPI in a kthread in +kernel. Threaded NAPI busy polling extends this and adds support to do +continuous busy polling of this NAPI. This can be used to enable busy polling +independent of userspace application or the API (epoll, io_uring, raw sockets) +being used in userspace to process the packets. + +It can be enabled for each NAPI using netlink interface or at device level using +the threaded NAPI sysctl. + +For example, using following script: + +.. code-block:: bash + + $ ynl --family netdev --do napi-set \ + --json='{"id": 66, "threaded": "busy-poll-enabled"}' + + +Enabling it for each NAPI allows finer control to enable busy pollling for +only a set of NIC queues which will get traffic with low latency requirements. + +Depending on application requirement, user might want to set affinity of the +kthread that is busy polling each NAPI. User might also want to set priority +and the scheduler of the thread depending on the latency requirements. + +For a hard low-latency application, user might want to dedicate the full core +for the NAPI polling so the NIC queue descriptors are picked up from the queue +as soon as they appear. Once enabled, the NAPI thread will poll the NIC queues +continuously without sleeping. This will keep the CPU core busy with 100% +usage. For more relaxed low-latency requirement, user might want to share the +core with other threads by setting thread affinity and priority. + +Once threaded busy polling is enabled for a NAPI, PID of the kthread can be +fetched using netlink interface so the affinity, priority and scheduler +configuration can be done. + +For example, following script can be used to fetch the pid: + +.. code-block:: bash + + $ ynl --family netdev --do napi-get --json='{"id": 66}' + +This will output something like following, the pid `258` is the PID of the +kthread that is polling this NAPI. + +.. code-block:: bash + + $ {'defer-hard-irqs': 0, + 'gro-flush-timeout': 0, + 'id': 66, + 'ifindex': 2, + 'irq-suspend-timeout': 0, + 'pid': 258, + 'threaded': 'enable'} + .. _threaded: Threaded NAPI diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f3a3b761abfb..a88f6596aef7 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -427,6 +427,8 @@ enum { NAPI_STATE_THREADED, /* The poll is performed inside its own thread*/ NAPI_STATE_SCHED_THREADED, /* Napi is currently scheduled in threaded mode */ NAPI_STATE_HAS_NOTIFIER, /* Napi has an IRQ notifier */ + NAPI_STATE_THREADED_BUSY_POLL, /* The threaded napi poller will busy poll */ + NAPI_STATE_SCHED_THREADED_BUSY_POLL, /* The threaded napi poller is busy polling */ }; enum { @@ -441,8 +443,14 @@ enum { NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED), NAPIF_STATE_SCHED_THREADED = BIT(NAPI_STATE_SCHED_THREADED), NAPIF_STATE_HAS_NOTIFIER = BIT(NAPI_STATE_HAS_NOTIFIER), + NAPIF_STATE_THREADED_BUSY_POLL = BIT(NAPI_STATE_THREADED_BUSY_POLL), + NAPIF_STATE_SCHED_THREADED_BUSY_POLL = + BIT(NAPI_STATE_SCHED_THREADED_BUSY_POLL), }; +#define NAPIF_STATE_THREADED_BUSY_POLL_MASK \ + (NAPIF_STATE_THREADED | NAPIF_STATE_THREADED_BUSY_POLL) + enum gro_result { GRO_MERGED, GRO_MERGED_FREE, @@ -1873,7 +1881,8 @@ enum netdev_reg_state { * @addr_len: Hardware address length * @upper_level: Maximum depth level of upper devices. * @lower_level: Maximum depth level of lower devices. - * @threaded: napi threaded state. + * @threaded: napi threaded mode is disabled, enabled or + * enabled with busy polling. * @neigh_priv_len: Used in neigh_alloc() * @dev_id: Used to differentiate devices that share * the same link layer address diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index 48eb49aa03d4..8163afb15377 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -80,6 +80,7 @@ enum netdev_qstats_scope { enum netdev_napi_threaded { NETDEV_NAPI_THREADED_DISABLED, NETDEV_NAPI_THREADED_ENABLED, + NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED, }; enum { diff --git a/net/core/dev.c b/net/core/dev.c index 5a3c0f40a93f..07ef77fed447 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -78,6 +78,7 @@ #include <linux/slab.h> #include <linux/sched.h> #include <linux/sched/isolation.h> +#include <linux/sched/types.h> #include <linux/sched/mm.h> #include <linux/smpboot.h> #include <linux/mutex.h> @@ -6558,7 +6559,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done) * the guarantee we will be called later. */ if (unlikely(n->state & (NAPIF_STATE_NPSVC | - NAPIF_STATE_IN_BUSY_POLL))) + NAPIF_STATE_IN_BUSY_POLL | + NAPIF_STATE_SCHED_THREADED_BUSY_POLL))) return false; if (work_done) { @@ -6963,6 +6965,19 @@ static void napi_stop_kthread(struct napi_struct *napi) napi->thread = NULL; } +static void napi_set_threaded_state(struct napi_struct *napi, + enum netdev_napi_threaded threaded) +{ + unsigned long val; + + val = 0; + if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED) + val |= NAPIF_STATE_THREADED_BUSY_POLL; + if (threaded) + val |= NAPIF_STATE_THREADED; + set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val); +} + int napi_set_threaded(struct napi_struct *napi, enum netdev_napi_threaded threaded) { @@ -6989,7 +7004,7 @@ int napi_set_threaded(struct napi_struct *napi, } else { /* Make sure kthread is created before THREADED bit is set. */ smp_mb__before_atomic(); - assign_bit(NAPI_STATE_THREADED, &napi->state, threaded); + napi_set_threaded_state(napi, threaded); } return 0; @@ -7381,7 +7396,9 @@ void napi_disable_locked(struct napi_struct *n) } new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC; - new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL); + new &= ~(NAPIF_STATE_THREADED + | NAPIF_STATE_THREADED_BUSY_POLL + | NAPIF_STATE_PREFER_BUSY_POLL); } while (!try_cmpxchg(&n->state, &val, new)); hrtimer_cancel(&n->timer); @@ -7425,7 +7442,7 @@ void napi_enable_locked(struct napi_struct *n) new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC); if (n->dev->threaded && n->thread) - new |= NAPIF_STATE_THREADED; + napi_set_threaded_state(n, n->dev->threaded); } while (!try_cmpxchg(&n->state, &val, new)); } EXPORT_SYMBOL(napi_enable_locked); @@ -7593,7 +7610,7 @@ static int napi_thread_wait(struct napi_struct *napi) return -1; } -static void napi_threaded_poll_loop(struct napi_struct *napi) +static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll) { struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx; struct softnet_data *sd; @@ -7622,22 +7639,53 @@ static void napi_threaded_poll_loop(struct napi_struct *napi) } skb_defer_free_flush(sd); bpf_net_ctx_clear(bpf_net_ctx); + + /* Flush too old packets. If HZ < 1000, flush all packets */ + if (busy_poll) + gro_flush_normal(&napi->gro, HZ >= 1000); local_bh_enable(); - if (!repoll) + /* If busy polling then do not break here because we need to + * call cond_resched and rcu_softirq_qs_periodic to prevent + * watchdog warnings. + */ + if (!repoll && !busy_poll) break; rcu_softirq_qs_periodic(last_qs); cond_resched(); + + if (!repoll) + break; } } static int napi_threaded_poll(void *data) { struct napi_struct *napi = data; + bool busy_poll_sched; + unsigned long val; + bool busy_poll; + + while (!napi_thread_wait(napi)) { + /* Once woken up, this means that we are scheduled as threaded + * napi and this thread owns the napi context, if busy poll + * state is set then busy poll this napi. + */ + val = READ_ONCE(napi->state); + busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL; + busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL; - while (!napi_thread_wait(napi)) - napi_threaded_poll_loop(napi); + /* Do not busy poll if napi is disabled. */ + if (unlikely(val & NAPIF_STATE_DISABLE)) + busy_poll = false; + + if (busy_poll != busy_poll_sched) + assign_bit(NAPI_STATE_SCHED_THREADED_BUSY_POLL, + &napi->state, busy_poll); + + napi_threaded_poll_loop(napi, busy_poll); + } return 0; } @@ -12829,7 +12877,7 @@ static void run_backlog_napi(unsigned int cpu) { struct softnet_data *sd = per_cpu_ptr(&softnet_data, cpu); - napi_threaded_poll_loop(&sd->backlog); + napi_threaded_poll_loop(&sd->backlog, false); } static void backlog_napi_setup(unsigned int cpu) diff --git a/net/core/dev.h b/net/core/dev.h index d6b08d435479..d6cfe7105903 100644 --- a/net/core/dev.h +++ b/net/core/dev.h @@ -317,6 +317,9 @@ static inline void napi_set_irq_suspend_timeout(struct napi_struct *n, static inline enum netdev_napi_threaded napi_get_threaded(struct napi_struct *n) { + if (test_bit(NAPI_STATE_THREADED_BUSY_POLL, &n->state)) + return NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED; + if (test_bit(NAPI_STATE_THREADED, &n->state)) return NETDEV_NAPI_THREADED_ENABLED; diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index c28cd6665444..d2cccca5b444 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -754,7 +754,7 @@ static int modify_napi_threaded(struct net_device *dev, unsigned long val) if (list_empty(&dev->napi_list)) return -EOPNOTSUPP; - if (val != 0 && val != 1) + if (val > NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED) return -EOPNOTSUPP; ret = netif_set_threaded(dev, val); diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c index e9a2a6f26cb7..ff20435c45d2 100644 --- a/net/core/netdev-genl-gen.c +++ b/net/core/netdev-genl-gen.c @@ -97,7 +97,7 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED [NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range), [NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, }, [NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, }, - [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1), + [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 2), }; /* NETDEV_CMD_BIND_TX - do */ diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h index 48eb49aa03d4..8163afb15377 100644 --- a/tools/include/uapi/linux/netdev.h +++ b/tools/include/uapi/linux/netdev.h @@ -80,6 +80,7 @@ enum netdev_qstats_scope { enum netdev_napi_threaded { NETDEV_NAPI_THREADED_DISABLED, NETDEV_NAPI_THREADED_ENABLED, + NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED, }; enum { -- 2.51.0.rc1.193.gad69d77794-goog ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 1/2] Extend napi threaded polling to allow kthread based busy polling 2025-08-24 21:54 ` [PATCH net-next v7 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja @ 2025-08-25 19:47 ` Stanislav Fomichev 2025-08-25 23:11 ` Samiullah Khawaja 0 siblings, 1 reply; 18+ messages in thread From: Stanislav Fomichev @ 2025-08-25 19:47 UTC (permalink / raw) To: Samiullah Khawaja Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, Joe Damato, netdev On 08/24, Samiullah Khawaja wrote: > Add a new state to napi state enum: > > - NAPI_STATE_THREADED_BUSY_POLL > Threaded busy poll is enabled/running for this napi. > > Following changes are introduced in the napi scheduling and state logic: > > - When threaded busy poll is enabled through sysfs or netlink it also > enables NAPI_STATE_THREADED so a kthread is created per napi. It also > sets NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that > it is going to busy poll the napi. > > - When napi is scheduled with NAPI_STATE_SCHED_THREADED and associated > kthread is woken up, the kthread owns the context. If > NAPI_STATE_THREADED_BUSY_POLL and NAPI_STATE_SCHED_THREADED both are > set then it means that kthread can busy poll. > > - To keep busy polling and to avoid scheduling of the interrupts, the > napi_complete_done returns false when both NAPI_STATE_SCHED_THREADED > and NAPI_STATE_THREADED_BUSY_POLL flags are set. Also > napi_complete_done returns early to avoid the > NAPI_STATE_SCHED_THREADED being unset. > > - If at any point NAPI_STATE_THREADED_BUSY_POLL is unset, the > napi_complete_done will run and unset the NAPI_STATE_SCHED_THREADED > bit also. This will make the associated kthread go to sleep as per > existing logic. > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com> > Reviewed-by: Willem de Bruijn <willemb@google.com> > > --- > Documentation/ABI/testing/sysfs-class-net | 3 +- > Documentation/netlink/specs/netdev.yaml | 5 +- > Documentation/networking/napi.rst | 63 +++++++++++++++++++++- > include/linux/netdevice.h | 11 +++- > include/uapi/linux/netdev.h | 1 + > net/core/dev.c | 66 +++++++++++++++++++---- > net/core/dev.h | 3 ++ > net/core/net-sysfs.c | 2 +- > net/core/netdev-genl-gen.c | 2 +- > tools/include/uapi/linux/netdev.h | 1 + > 10 files changed, 142 insertions(+), 15 deletions(-) > > diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net > index ebf21beba846..15d7d36a8294 100644 > --- a/Documentation/ABI/testing/sysfs-class-net > +++ b/Documentation/ABI/testing/sysfs-class-net > @@ -343,7 +343,7 @@ Date: Jan 2021 > KernelVersion: 5.12 > Contact: netdev@vger.kernel.org > Description: > - Boolean value to control the threaded mode per device. User could > + Integer value to control the threaded mode per device. User could > set this value to enable/disable threaded mode for all napi > belonging to this device, without the need to do device up/down. > > @@ -351,4 +351,5 @@ Description: > == ================================== > 0 threaded mode disabled for this dev > 1 threaded mode enabled for this dev > + 2 threaded mode enabled, and busy polling enabled. I might have asked already but forgot the answer: any reason we keep extending sysfs? With a proper ynl control over per-queue settings, why do we want an option to enable busy-polling threaded mode for the whole device? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 1/2] Extend napi threaded polling to allow kthread based busy polling 2025-08-25 19:47 ` Stanislav Fomichev @ 2025-08-25 23:11 ` Samiullah Khawaja 0 siblings, 0 replies; 18+ messages in thread From: Samiullah Khawaja @ 2025-08-25 23:11 UTC (permalink / raw) To: Stanislav Fomichev Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, Joe Damato, netdev On Mon, Aug 25, 2025 at 12:47 PM Stanislav Fomichev <stfomichev@gmail.com> wrote: > > On 08/24, Samiullah Khawaja wrote: > > Add a new state to napi state enum: > > > > - NAPI_STATE_THREADED_BUSY_POLL > > Threaded busy poll is enabled/running for this napi. > > > > Following changes are introduced in the napi scheduling and state logic: > > > > - When threaded busy poll is enabled through sysfs or netlink it also > > enables NAPI_STATE_THREADED so a kthread is created per napi. It also > > sets NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that > > it is going to busy poll the napi. > > > > - When napi is scheduled with NAPI_STATE_SCHED_THREADED and associated > > kthread is woken up, the kthread owns the context. If > > NAPI_STATE_THREADED_BUSY_POLL and NAPI_STATE_SCHED_THREADED both are > > set then it means that kthread can busy poll. > > > > - To keep busy polling and to avoid scheduling of the interrupts, the > > napi_complete_done returns false when both NAPI_STATE_SCHED_THREADED > > and NAPI_STATE_THREADED_BUSY_POLL flags are set. Also > > napi_complete_done returns early to avoid the > > NAPI_STATE_SCHED_THREADED being unset. > > > > - If at any point NAPI_STATE_THREADED_BUSY_POLL is unset, the > > napi_complete_done will run and unset the NAPI_STATE_SCHED_THREADED > > bit also. This will make the associated kthread go to sleep as per > > existing logic. > > > > Signed-off-by: Samiullah Khawaja <skhawaja@google.com> > > Reviewed-by: Willem de Bruijn <willemb@google.com> > > > > --- > > Documentation/ABI/testing/sysfs-class-net | 3 +- > > Documentation/netlink/specs/netdev.yaml | 5 +- > > Documentation/networking/napi.rst | 63 +++++++++++++++++++++- > > include/linux/netdevice.h | 11 +++- > > include/uapi/linux/netdev.h | 1 + > > net/core/dev.c | 66 +++++++++++++++++++---- > > net/core/dev.h | 3 ++ > > net/core/net-sysfs.c | 2 +- > > net/core/netdev-genl-gen.c | 2 +- > > tools/include/uapi/linux/netdev.h | 1 + > > 10 files changed, 142 insertions(+), 15 deletions(-) > > > > diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net > > index ebf21beba846..15d7d36a8294 100644 > > --- a/Documentation/ABI/testing/sysfs-class-net > > +++ b/Documentation/ABI/testing/sysfs-class-net > > @@ -343,7 +343,7 @@ Date: Jan 2021 > > KernelVersion: 5.12 > > Contact: netdev@vger.kernel.org > > Description: > > - Boolean value to control the threaded mode per device. User could > > + Integer value to control the threaded mode per device. User could > > set this value to enable/disable threaded mode for all napi > > belonging to this device, without the need to do device up/down. > > > > @@ -351,4 +351,5 @@ Description: > > == ================================== > > 0 threaded mode disabled for this dev > > 1 threaded mode enabled for this dev > > + 2 threaded mode enabled, and busy polling enabled. > > I might have asked already but forgot the answer: any reason we keep > extending sysfs? With a proper ynl control over per-queue settings, > why do we want an option to enable busy-polling threaded mode for the > whole device? That's great. One would enable threaded napi busy poll only for certain napis and not for the whole device, so this makes perfect sense. I am open to doing it. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH net-next v7 2/2] selftests: Add napi threaded busy poll test in `busy_poller` 2025-08-24 21:54 [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Samiullah Khawaja 2025-08-24 21:54 ` [PATCH net-next v7 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja @ 2025-08-24 21:54 ` Samiullah Khawaja 2025-08-25 16:30 ` Jakub Kicinski 2025-08-25 0:03 ` [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Martin Karsten 2025-08-25 19:37 ` Stanislav Fomichev 3 siblings, 1 reply; 18+ messages in thread From: Samiullah Khawaja @ 2025-08-24 21:54 UTC (permalink / raw) To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten Cc: Joe Damato, netdev, skhawaja Add testcase to run busy poll test with threaded napi busy poll enabled. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> --- tools/testing/selftests/net/busy_poll_test.sh | 25 ++++++++++++++++++- tools/testing/selftests/net/busy_poller.c | 14 ++++++++--- 2 files changed, 35 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/net/busy_poll_test.sh b/tools/testing/selftests/net/busy_poll_test.sh index 7d2d40812074..ab230df1057e 100755 --- a/tools/testing/selftests/net/busy_poll_test.sh +++ b/tools/testing/selftests/net/busy_poll_test.sh @@ -27,6 +27,9 @@ NAPI_DEFER_HARD_IRQS=100 GRO_FLUSH_TIMEOUT=50000 SUSPEND_TIMEOUT=20000000 +# NAPI threaded busy poll config +NAPI_THREADED_POLL=2 + setup_ns() { set -e @@ -62,6 +65,9 @@ cleanup_ns() test_busypoll() { suspend_value=${1:-0} + napi_threaded_value=${2:-0} + prefer_busy_poll_value=${3:-$PREFER_BUSY_POLL} + tmp_file=$(mktemp) out_file=$(mktemp) @@ -73,10 +79,11 @@ test_busypoll() -b${SERVER_IP} \ -m${MAX_EVENTS} \ -u${BUSY_POLL_USECS} \ - -P${PREFER_BUSY_POLL} \ + -P${prefer_busy_poll_value} \ -g${BUSY_POLL_BUDGET} \ -i${NSIM_SV_IFIDX} \ -s${suspend_value} \ + -t${napi_threaded_value} \ -o${out_file}& wait_local_port_listen nssv ${SERVER_PORT} tcp @@ -109,6 +116,15 @@ test_busypoll_with_suspend() return $? } +test_busypoll_with_napi_threaded() +{ + # Only enable napi threaded poll. Set suspend timeout and prefer busy + # poll to 0. + test_busypoll 0 ${NAPI_THREADED_POLL} 0 + + return $? +} + ### ### Code start ### @@ -154,6 +170,13 @@ if [ $? -ne 0 ]; then exit 1 fi +test_busypoll_with_napi_threaded +if [ $? -ne 0 ]; then + echo "test_busypoll_with_napi_threaded failed" + cleanup_ns + exit 1 +fi + echo "$NSIM_SV_FD:$NSIM_SV_IFIDX" > $NSIM_DEV_SYS_UNLINK echo $NSIM_CL_ID > $NSIM_DEV_SYS_DEL diff --git a/tools/testing/selftests/net/busy_poller.c b/tools/testing/selftests/net/busy_poller.c index 04c7ff577bb8..f7407f09f635 100644 --- a/tools/testing/selftests/net/busy_poller.c +++ b/tools/testing/selftests/net/busy_poller.c @@ -65,15 +65,16 @@ static uint32_t cfg_busy_poll_usecs; static uint16_t cfg_busy_poll_budget; static uint8_t cfg_prefer_busy_poll; -/* IRQ params */ +/* NAPI params */ static uint32_t cfg_defer_hard_irqs; static uint64_t cfg_gro_flush_timeout; static uint64_t cfg_irq_suspend_timeout; +static enum netdev_napi_threaded cfg_napi_threaded_poll = NETDEV_NAPI_THREADED_DISABLE; static void usage(const char *filepath) { error(1, 0, - "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -i<ifindex>", + "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -t<napi_threaded_poll> -i<ifindex>", filepath); } @@ -86,7 +87,7 @@ static void parse_opts(int argc, char **argv) if (argc <= 1) usage(argv[0]); - while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:")) != -1) { + while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:t:")) != -1) { /* most options take integer values, except o and b, so reduce * code duplication a bit for the common case by calling * strtoull here and leave bounds checking and casting per @@ -168,6 +169,12 @@ static void parse_opts(int argc, char **argv) cfg_ifindex = (int)tmp; break; + case 't': + if (tmp == ULLONG_MAX || tmp > 2) + error(1, ERANGE, "napi threaded poll value must be 0-2"); + + cfg_napi_threaded_poll = (enum netdev_napi_threaded)tmp; + break; } } @@ -246,6 +253,7 @@ static void setup_queue(void) cfg_gro_flush_timeout); netdev_napi_set_req_set_irq_suspend_timeout(set_req, cfg_irq_suspend_timeout); + netdev_napi_set_req_set_threaded(set_req, cfg_napi_threaded_poll); if (netdev_napi_set(ys, set_req)) error(1, 0, "can't set NAPI params: %s\n", yerr.msg); -- 2.51.0.rc1.193.gad69d77794-goog ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 2/2] selftests: Add napi threaded busy poll test in `busy_poller` 2025-08-24 21:54 ` [PATCH net-next v7 2/2] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja @ 2025-08-25 16:30 ` Jakub Kicinski 2025-08-25 17:20 ` Samiullah Khawaja 0 siblings, 1 reply; 18+ messages in thread From: Jakub Kicinski @ 2025-08-25 16:30 UTC (permalink / raw) To: Samiullah Khawaja Cc: David S . Miller , Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, Joe Damato, netdev On Sun, 24 Aug 2025 21:54:18 +0000 Samiullah Khawaja wrote: > +static enum netdev_napi_threaded cfg_napi_threaded_poll = NETDEV_NAPI_THREADED_DISABLE; This doesn't build: DISABLE*D* ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 2/2] selftests: Add napi threaded busy poll test in `busy_poller` 2025-08-25 16:30 ` Jakub Kicinski @ 2025-08-25 17:20 ` Samiullah Khawaja 0 siblings, 0 replies; 18+ messages in thread From: Samiullah Khawaja @ 2025-08-25 17:20 UTC (permalink / raw) To: Jakub Kicinski Cc: David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, Joe Damato, netdev On Mon, Aug 25, 2025 at 9:30 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Sun, 24 Aug 2025 21:54:18 +0000 Samiullah Khawaja wrote: > > +static enum netdev_napi_threaded cfg_napi_threaded_poll = NETDEV_NAPI_THREADED_DISABLE; > > This doesn't build: > > DISABLE*D* Yikes. Sorry let me fix that. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll 2025-08-24 21:54 [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Samiullah Khawaja 2025-08-24 21:54 ` [PATCH net-next v7 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja 2025-08-24 21:54 ` [PATCH net-next v7 2/2] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja @ 2025-08-25 0:03 ` Martin Karsten 2025-08-25 17:20 ` Samiullah Khawaja 2025-08-25 19:37 ` Stanislav Fomichev 3 siblings, 1 reply; 18+ messages in thread From: Martin Karsten @ 2025-08-25 0:03 UTC (permalink / raw) To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb Cc: Joe Damato, netdev On 2025-08-24 17:54, Samiullah Khawaja wrote: > Extend the already existing support of threaded napi poll to do continuous > busy polling. > > This is used for doing continuous polling of napi to fetch descriptors > from backing RX/TX queues for low latency applications. Allow enabling > of threaded busypoll using netlink so this can be enabled on a set of > dedicated napis for low latency applications. > > Once enabled user can fetch the PID of the kthread doing NAPI polling > and set affinity, priority and scheduler for it depending on the > low-latency requirements. > > Currently threaded napi is only enabled at device level using sysfs. Add > support to enable/disable threaded mode for a napi individually. This > can be done using the netlink interface. Extend `napi-set` op in netlink > spec that allows setting the `threaded` attribute of a napi. > > Extend the threaded attribute in napi struct to add an option to enable > continuous busy polling. Extend the netlink and sysfs interface to allow > enabling/disabling threaded busypolling at device or individual napi > level. > > We use this for our AF_XDP based hard low-latency usecase with usecs > level latency requirement. For our usecase we want low jitter and stable > latency at P99. > > Following is an analysis and comparison of available (and compatible) > busy poll interfaces for a low latency usecase with stable P99. Please > note that the throughput and cpu efficiency is a non-goal. > > For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The > description of the tool and how it tries to simulate the real workload > is following, > > - It sends UDP packets between 2 machines. > - The client machine sends packets at a fixed frequency. To maintain the > frequency of the packet being sent, we use open-loop sampling. That is > the packets are sent in a separate thread. > - The server replies to the packet inline by reading the pkt from the > recv ring and replies using the tx ring. > - To simulate the application processing time, we use a configurable > delay in usecs on the client side after a reply is received from the > server. > > The xdp_rr tool is posted separately as an RFC for tools/testing/selftest. > > We use this tool with following napi polling configurations, > > - Interrupts only > - SO_BUSYPOLL (inline in the same thread where the client receives the > packet). > - SO_BUSYPOLL (separate thread and separate core) > - Threaded NAPI busypoll > > System is configured using following script in all 4 cases, > > ``` > echo 0 | sudo tee /sys/class/net/eth0/threaded > echo 0 | sudo tee /proc/sys/kernel/timer_migration > echo off | sudo tee /sys/devices/system/cpu/smt/control > > sudo ethtool -L eth0 rx 1 tx 1 > sudo ethtool -G eth0 rx 1024 > > echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries > echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus > > # pin IRQs on CPU 2 > IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \ > print arr[0]}' < /proc/interrupts)" > for irq in "${IRQS}"; \ > do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done > > echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us > > for i in /sys/devices/virtual/workqueue/*/cpumask; \ > do echo $i; echo 1,2,3,4,5,6 > $i; done > > if [[ -z "$1" ]]; then > echo 400 | sudo tee /proc/sys/net/core/busy_read > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > fi > > sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0 > > if [[ "$1" == "enable_threaded" ]]; then > echo 0 | sudo tee /proc/sys/net/core/busy_poll > echo 0 | sudo tee /proc/sys/net/core/busy_read > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > echo 2 | sudo tee /sys/class/net/eth0/threaded > NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }') > sudo chrt -f -p 50 $NAPI_T > > # pin threaded poll thread to CPU 2 > sudo taskset -pc 2 $NAPI_T > fi > > if [[ "$1" == "enable_interrupt" ]]; then > echo 0 | sudo tee /proc/sys/net/core/busy_read > echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > fi > ``` > > To enable various configurations, script can be run as following, > > - Interrupt Only > ``` > <script> enable_interrupt > ``` > > - SO_BUSYPOLL (no arguments to script) > ``` > <script> > ``` > > - NAPI threaded busypoll > ``` > <script> enable_threaded > ``` > > If using idpf, the script needs to be run again after launching the > workload just to make sure that the configurations are not reverted. As > idpf reverts some configurations on software reset when AF_XDP program > is attached. > > Once configured, the workload is run with various configurations using > following commands. Set period (1/frequency) and delay in usecs to > produce results for packet frequency and application processing delay. > > ## Interrupt Only and SO_BUSY_POLL (inline) > > - Server > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v > ``` > > - Client > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v > ``` > > ## SO_BUSY_POLL(done in separate core using recvfrom) > > Argument -t spawns a seprate thread and continuously calls recvfrom. > > - Server > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > -h -v -t > ``` > > - Client > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t > ``` > > ## NAPI Threaded Busy Poll > > Argument -n skips the recvfrom call as there is no recv kick needed. > > - Server > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > -h -v -n > ``` > > - Client > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n > ``` > > | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded | > |---|---|---|---|---| > | 12 Kpkt/s + 0us delay | | | | | > | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | > | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | > | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | > | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | > | 32 Kpkt/s + 30us delay | | | | | > | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | > | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | > | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | > | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | > | 125 Kpkt/s + 6us delay | | | | | > | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | > | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | > | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | > | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | > | 12 Kpkt/s + 78us delay | | | | | > | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | > | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | > | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | > | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | > | 25 Kpkt/s + 38us delay | | | | | > | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | > | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | > | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | > | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | > > ## Observations Hi Samiullah, I believe you are comparing apples and oranges with these experiments. Because threaded busy poll uses two cores at each end (at 100%), you should compare with 2 pairs of xsk_rr processes using interrupt mode, but each running at half the rate. I am quite certain you would then see the same latency as in the baseline experiment - at much reduced cpu utilization. Threaded busy poll reduces p99 latency by just 100 nsec, while busy-spinning two cores, at each end - not more not less. I continue to believe that this trade-off and these limited benefits need to be clearly and explicitly spelled out in the cover letter. Best, Martin ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll 2025-08-25 0:03 ` [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Martin Karsten @ 2025-08-25 17:20 ` Samiullah Khawaja 2025-08-25 17:40 ` Martin Karsten 0 siblings, 1 reply; 18+ messages in thread From: Samiullah Khawaja @ 2025-08-25 17:20 UTC (permalink / raw) To: Martin Karsten Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb, Joe Damato, netdev On Sun, Aug 24, 2025 at 5:03 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > > On 2025-08-24 17:54, Samiullah Khawaja wrote: > > Extend the already existing support of threaded napi poll to do continuous > > busy polling. > > > > This is used for doing continuous polling of napi to fetch descriptors > > from backing RX/TX queues for low latency applications. Allow enabling > > of threaded busypoll using netlink so this can be enabled on a set of > > dedicated napis for low latency applications. > > > > Once enabled user can fetch the PID of the kthread doing NAPI polling > > and set affinity, priority and scheduler for it depending on the > > low-latency requirements. > > > > Currently threaded napi is only enabled at device level using sysfs. Add > > support to enable/disable threaded mode for a napi individually. This > > can be done using the netlink interface. Extend `napi-set` op in netlink > > spec that allows setting the `threaded` attribute of a napi. > > > > Extend the threaded attribute in napi struct to add an option to enable > > continuous busy polling. Extend the netlink and sysfs interface to allow > > enabling/disabling threaded busypolling at device or individual napi > > level. > > > > We use this for our AF_XDP based hard low-latency usecase with usecs > > level latency requirement. For our usecase we want low jitter and stable > > latency at P99. > > > > Following is an analysis and comparison of available (and compatible) > > busy poll interfaces for a low latency usecase with stable P99. Please > > note that the throughput and cpu efficiency is a non-goal. > > > > For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The > > description of the tool and how it tries to simulate the real workload > > is following, > > > > - It sends UDP packets between 2 machines. > > - The client machine sends packets at a fixed frequency. To maintain the > > frequency of the packet being sent, we use open-loop sampling. That is > > the packets are sent in a separate thread. > > - The server replies to the packet inline by reading the pkt from the > > recv ring and replies using the tx ring. > > - To simulate the application processing time, we use a configurable > > delay in usecs on the client side after a reply is received from the > > server. > > > > The xdp_rr tool is posted separately as an RFC for tools/testing/selftest. > > > > We use this tool with following napi polling configurations, > > > > - Interrupts only > > - SO_BUSYPOLL (inline in the same thread where the client receives the > > packet). > > - SO_BUSYPOLL (separate thread and separate core) This one uses separate thread and core for polling the napi. > > - Threaded NAPI busypoll > > > > System is configured using following script in all 4 cases, > > > > ``` > > echo 0 | sudo tee /sys/class/net/eth0/threaded > > echo 0 | sudo tee /proc/sys/kernel/timer_migration > > echo off | sudo tee /sys/devices/system/cpu/smt/control > > > > sudo ethtool -L eth0 rx 1 tx 1 > > sudo ethtool -G eth0 rx 1024 > > > > echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries > > echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus > > > > # pin IRQs on CPU 2 > > IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \ > > print arr[0]}' < /proc/interrupts)" > > for irq in "${IRQS}"; \ > > do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done > > > > echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us > > > > for i in /sys/devices/virtual/workqueue/*/cpumask; \ > > do echo $i; echo 1,2,3,4,5,6 > $i; done > > > > if [[ -z "$1" ]]; then > > echo 400 | sudo tee /proc/sys/net/core/busy_read > > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > fi > > > > sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0 > > > > if [[ "$1" == "enable_threaded" ]]; then > > echo 0 | sudo tee /proc/sys/net/core/busy_poll > > echo 0 | sudo tee /proc/sys/net/core/busy_read > > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > echo 2 | sudo tee /sys/class/net/eth0/threaded > > NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }') > > sudo chrt -f -p 50 $NAPI_T > > > > # pin threaded poll thread to CPU 2 > > sudo taskset -pc 2 $NAPI_T > > fi > > > > if [[ "$1" == "enable_interrupt" ]]; then > > echo 0 | sudo tee /proc/sys/net/core/busy_read > > echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > fi > > ``` > > > > To enable various configurations, script can be run as following, > > > > - Interrupt Only > > ``` > > <script> enable_interrupt > > ``` > > > > - SO_BUSYPOLL (no arguments to script) > > ``` > > <script> > > ``` > > > > - NAPI threaded busypoll > > ``` > > <script> enable_threaded > > ``` > > > > If using idpf, the script needs to be run again after launching the > > workload just to make sure that the configurations are not reverted. As > > idpf reverts some configurations on software reset when AF_XDP program > > is attached. > > > > Once configured, the workload is run with various configurations using > > following commands. Set period (1/frequency) and delay in usecs to > > produce results for packet frequency and application processing delay. > > > > ## Interrupt Only and SO_BUSY_POLL (inline) > > > > - Server > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v > > ``` > > > > - Client > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v > > ``` > > > > ## SO_BUSY_POLL(done in separate core using recvfrom) > > > > Argument -t spawns a seprate thread and continuously calls recvfrom. > > > > - Server > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > > -h -v -t > > ``` > > > > - Client > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t > > ``` > > > > ## NAPI Threaded Busy Poll > > > > Argument -n skips the recvfrom call as there is no recv kick needed. > > > > - Server > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > > -h -v -n > > ``` > > > > - Client > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n > > ``` > > > > | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded | > > |---|---|---|---|---| > > | 12 Kpkt/s + 0us delay | | | | | > > | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | > > | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | > > | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | > > | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | > > | 32 Kpkt/s + 30us delay | | | | | > > | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | > > | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | > > | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | > > | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | > > | 125 Kpkt/s + 6us delay | | | | | > > | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | > > | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | > > | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | > > | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | > > | 12 Kpkt/s + 78us delay | | | | | > > | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | > > | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | > > | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | > > | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | > > | 25 Kpkt/s + 38us delay | | | | | > > | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | > > | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | > > | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | > > | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | > > > > ## Observations > > Hi Samiullah, > Thanks for the review > I believe you are comparing apples and oranges with these experiments. > Because threaded busy poll uses two cores at each end (at 100%), you The SO_BUSYPOLL(separate) column is actually running in a separate thread and using two cores. So this is actually comparing apples to apples. > should compare with 2 pairs of xsk_rr processes using interrupt mode, > but each running at half the rate. I am quite certain you would then see > the same latency as in the baseline experiment - at much reduced cpu > utilization. > > Threaded busy poll reduces p99 latency by just 100 nsec, while The table in the experiments show much larger differences in latency. > busy-spinning two cores, at each end - not more not less. I continue to > believe that this trade-off and these limited benefits need to be > clearly and explicitly spelled out in the cover letter. Yes, if you just look at the first row of the table then there is virtually no difference. > > Best, > Martin > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll 2025-08-25 17:20 ` Samiullah Khawaja @ 2025-08-25 17:40 ` Martin Karsten 2025-08-25 18:53 ` Samiullah Khawaja 0 siblings, 1 reply; 18+ messages in thread From: Martin Karsten @ 2025-08-25 17:40 UTC (permalink / raw) To: Samiullah Khawaja Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb, Joe Damato, netdev On 2025-08-25 13:20, Samiullah Khawaja wrote: > On Sun, Aug 24, 2025 at 5:03 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote: >> >> On 2025-08-24 17:54, Samiullah Khawaja wrote: >>> Extend the already existing support of threaded napi poll to do continuous >>> busy polling. >>> >>> This is used for doing continuous polling of napi to fetch descriptors >>> from backing RX/TX queues for low latency applications. Allow enabling >>> of threaded busypoll using netlink so this can be enabled on a set of >>> dedicated napis for low latency applications. >>> >>> Once enabled user can fetch the PID of the kthread doing NAPI polling >>> and set affinity, priority and scheduler for it depending on the >>> low-latency requirements. >>> >>> Currently threaded napi is only enabled at device level using sysfs. Add >>> support to enable/disable threaded mode for a napi individually. This >>> can be done using the netlink interface. Extend `napi-set` op in netlink >>> spec that allows setting the `threaded` attribute of a napi. >>> >>> Extend the threaded attribute in napi struct to add an option to enable >>> continuous busy polling. Extend the netlink and sysfs interface to allow >>> enabling/disabling threaded busypolling at device or individual napi >>> level. >>> >>> We use this for our AF_XDP based hard low-latency usecase with usecs >>> level latency requirement. For our usecase we want low jitter and stable >>> latency at P99. >>> >>> Following is an analysis and comparison of available (and compatible) >>> busy poll interfaces for a low latency usecase with stable P99. Please >>> note that the throughput and cpu efficiency is a non-goal. >>> >>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The >>> description of the tool and how it tries to simulate the real workload >>> is following, >>> >>> - It sends UDP packets between 2 machines. >>> - The client machine sends packets at a fixed frequency. To maintain the >>> frequency of the packet being sent, we use open-loop sampling. That is >>> the packets are sent in a separate thread. >>> - The server replies to the packet inline by reading the pkt from the >>> recv ring and replies using the tx ring. >>> - To simulate the application processing time, we use a configurable >>> delay in usecs on the client side after a reply is received from the >>> server. >>> >>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest. >>> >>> We use this tool with following napi polling configurations, >>> >>> - Interrupts only >>> - SO_BUSYPOLL (inline in the same thread where the client receives the >>> packet). >>> - SO_BUSYPOLL (separate thread and separate core) > This one uses separate thread and core for polling the napi. That's not what I am referring to below. [snip] >>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded | >>> |---|---|---|---|---| >>> | 12 Kpkt/s + 0us delay | | | | | >>> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | >>> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | >>> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | >>> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | >>> | 32 Kpkt/s + 30us delay | | | | | >>> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | >>> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | >>> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | >>> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | >>> | 125 Kpkt/s + 6us delay | | | | | >>> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | >>> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | >>> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | >>> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | >>> | 12 Kpkt/s + 78us delay | | | | | >>> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | >>> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | >>> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | >>> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | >>> | 25 Kpkt/s + 38us delay | | | | | >>> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | >>> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | >>> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | >>> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | >>> >>> ## Observations >> >> Hi Samiullah, >> > Thanks for the review >> I believe you are comparing apples and oranges with these experiments. >> Because threaded busy poll uses two cores at each end (at 100%), you > The SO_BUSYPOLL(separate) column is actually running in a separate > thread and using two cores. So this is actually comparing apples to > apples. I am not referring to SO_BUSYPOLL, but to the column labelled "interrupts". This is single-core, yes? >> should compare with 2 pairs of xsk_rr processes using interrupt mode, >> but each running at half the rate. I am quite certain you would then see >> the same latency as in the baseline experiment - at much reduced cpu >> utilization. >> >> Threaded busy poll reduces p99 latency by just 100 nsec, while > The table in the experiments show much larger differences in latency. Yes, because all but the first experiment add processing delay to simulate 100% load and thus most likely show queuing effects. Since "interrupts" uses just one core and "NAPI threaded" uses two, a fair comparison would be for "interrupts" to run two pairs of xsk_rr at half the rate each. Then the load would be well below 100%, no queueing, and latency would probably go back to the values measured in the "0us delay" experiments. At least that's what I would expect. Reproduction is getting a bit difficult, because you haven't updated the xsk_rr RFC and judging from the compilation error, maybe not built/run these experiments for a while now? It would be nice to have a working reproducible setup. >> busy-spinning two cores, at each end - not more not less. I continue to >> believe that this trade-off and these limited benefits need to be >> clearly and explicitly spelled out in the cover letter. > Yes, if you just look at the first row of the table then there is > virtually no difference. I'm not sure what you mean by this. I compare "interrupts" with "NAPI threaded" for the case "12 Kpkt/s + 0us delay" and I have explained why I believe the other experiments are not meaningful. Thanks, Martin ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll 2025-08-25 17:40 ` Martin Karsten @ 2025-08-25 18:53 ` Samiullah Khawaja 2025-08-25 19:45 ` Martin Karsten 0 siblings, 1 reply; 18+ messages in thread From: Samiullah Khawaja @ 2025-08-25 18:53 UTC (permalink / raw) To: Martin Karsten Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb, Joe Damato, netdev On Mon, Aug 25, 2025 at 10:41 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > > On 2025-08-25 13:20, Samiullah Khawaja wrote: > > On Sun, Aug 24, 2025 at 5:03 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > >> > >> On 2025-08-24 17:54, Samiullah Khawaja wrote: > >>> Extend the already existing support of threaded napi poll to do continuous > >>> busy polling. > >>> > >>> This is used for doing continuous polling of napi to fetch descriptors > >>> from backing RX/TX queues for low latency applications. Allow enabling > >>> of threaded busypoll using netlink so this can be enabled on a set of > >>> dedicated napis for low latency applications. > >>> > >>> Once enabled user can fetch the PID of the kthread doing NAPI polling > >>> and set affinity, priority and scheduler for it depending on the > >>> low-latency requirements. > >>> > >>> Currently threaded napi is only enabled at device level using sysfs. Add > >>> support to enable/disable threaded mode for a napi individually. This > >>> can be done using the netlink interface. Extend `napi-set` op in netlink > >>> spec that allows setting the `threaded` attribute of a napi. > >>> > >>> Extend the threaded attribute in napi struct to add an option to enable > >>> continuous busy polling. Extend the netlink and sysfs interface to allow > >>> enabling/disabling threaded busypolling at device or individual napi > >>> level. > >>> > >>> We use this for our AF_XDP based hard low-latency usecase with usecs > >>> level latency requirement. For our usecase we want low jitter and stable > >>> latency at P99. > >>> > >>> Following is an analysis and comparison of available (and compatible) > >>> busy poll interfaces for a low latency usecase with stable P99. Please > >>> note that the throughput and cpu efficiency is a non-goal. > >>> > >>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The > >>> description of the tool and how it tries to simulate the real workload > >>> is following, > >>> > >>> - It sends UDP packets between 2 machines. > >>> - The client machine sends packets at a fixed frequency. To maintain the > >>> frequency of the packet being sent, we use open-loop sampling. That is > >>> the packets are sent in a separate thread. > >>> - The server replies to the packet inline by reading the pkt from the > >>> recv ring and replies using the tx ring. > >>> - To simulate the application processing time, we use a configurable > >>> delay in usecs on the client side after a reply is received from the > >>> server. > >>> > >>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest. > >>> > >>> We use this tool with following napi polling configurations, > >>> > >>> - Interrupts only > >>> - SO_BUSYPOLL (inline in the same thread where the client receives the > >>> packet). > >>> - SO_BUSYPOLL (separate thread and separate core) > > This one uses separate thread and core for polling the napi. > > That's not what I am referring to below. > > [snip] > > >>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded | > >>> |---|---|---|---|---| > >>> | 12 Kpkt/s + 0us delay | | | | | > >>> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | > >>> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | > >>> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | > >>> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | > >>> | 32 Kpkt/s + 30us delay | | | | | > >>> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | > >>> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | > >>> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | > >>> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | > >>> | 125 Kpkt/s + 6us delay | | | | | > >>> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | > >>> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | > >>> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | > >>> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | > >>> | 12 Kpkt/s + 78us delay | | | | | > >>> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | > >>> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | > >>> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | > >>> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | > >>> | 25 Kpkt/s + 38us delay | | | | | > >>> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | > >>> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | > >>> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | > >>> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | > >>> > >>> ## Observations > >> > >> Hi Samiullah, > >> > > Thanks for the review > >> I believe you are comparing apples and oranges with these experiments. > >> Because threaded busy poll uses two cores at each end (at 100%), you > > The SO_BUSYPOLL(separate) column is actually running in a separate > > thread and using two cores. So this is actually comparing apples to > > apples. > > I am not referring to SO_BUSYPOLL, but to the column labelled > "interrupts". This is single-core, yes? > > >> should compare with 2 pairs of xsk_rr processes using interrupt mode, > >> but each running at half the rate. I am quite certain you would then see > >> the same latency as in the baseline experiment - at much reduced cpu > >> utilization. > >> > >> Threaded busy poll reduces p99 latency by just 100 nsec, while > > The table in the experiments show much larger differences in latency. > > Yes, because all but the first experiment add processing delay to > simulate 100% load and thus most likely show queuing effects. > > Since "interrupts" uses just one core and "NAPI threaded" uses two, a > fair comparison would be for "interrupts" to run two pairs of xsk_rr at > half the rate each. Then the load would be well below 100%, no queueing, > and latency would probably go back to the values measured in the "0us > delay" experiments. At least that's what I would expect. Two set of xsk_rr will go to two different NIC queues with two different interrupts (I think). That would be comparing apples to oranges, as all the other columns use a single NIC queue. Having (Forcing user to have) two xsk sockets to deliver packets at a certain rate is a completely different use case. > > Reproduction is getting a bit difficult, because you haven't updated the > xsk_rr RFC and judging from the compilation error, maybe not built/run > these experiments for a while now? It would be nice to have a working > reproducible setup. Oh. Let me check the xsk_rr and see whether it is outdated. I will send out another RFC for it if it's outdated. > > >> busy-spinning two cores, at each end - not more not less. I continue to > >> believe that this trade-off and these limited benefits need to be > >> clearly and explicitly spelled out in the cover letter. > > Yes, if you just look at the first row of the table then there is > > virtually no difference. > I'm not sure what you mean by this. I compare "interrupts" with "NAPI > threaded" for the case "12 Kpkt/s + 0us delay" and I have explained why > I believe the other experiments are not meaningful. Yes that is exactly what I am disagreeing with. I don't think other rows are "not meaningful". The xsk_rr is trying to "simulate the application processing" by adding a cpu delay and the table clearly shows the comparison between various mechanisms and how they perform with in load. > > Thanks, > Martin > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll 2025-08-25 18:53 ` Samiullah Khawaja @ 2025-08-25 19:45 ` Martin Karsten 2025-08-25 20:21 ` Martin Karsten 2025-08-28 22:23 ` Samiullah Khawaja 0 siblings, 2 replies; 18+ messages in thread From: Martin Karsten @ 2025-08-25 19:45 UTC (permalink / raw) To: Samiullah Khawaja Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb, Joe Damato, netdev On 2025-08-25 14:53, Samiullah Khawaja wrote: > On Mon, Aug 25, 2025 at 10:41 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: >> >> On 2025-08-25 13:20, Samiullah Khawaja wrote: >>> On Sun, Aug 24, 2025 at 5:03 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote: >>>> >>>> On 2025-08-24 17:54, Samiullah Khawaja wrote: >>>>> Extend the already existing support of threaded napi poll to do continuous >>>>> busy polling. >>>>> >>>>> This is used for doing continuous polling of napi to fetch descriptors >>>>> from backing RX/TX queues for low latency applications. Allow enabling >>>>> of threaded busypoll using netlink so this can be enabled on a set of >>>>> dedicated napis for low latency applications. >>>>> >>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling >>>>> and set affinity, priority and scheduler for it depending on the >>>>> low-latency requirements. >>>>> >>>>> Currently threaded napi is only enabled at device level using sysfs. Add >>>>> support to enable/disable threaded mode for a napi individually. This >>>>> can be done using the netlink interface. Extend `napi-set` op in netlink >>>>> spec that allows setting the `threaded` attribute of a napi. >>>>> >>>>> Extend the threaded attribute in napi struct to add an option to enable >>>>> continuous busy polling. Extend the netlink and sysfs interface to allow >>>>> enabling/disabling threaded busypolling at device or individual napi >>>>> level. >>>>> >>>>> We use this for our AF_XDP based hard low-latency usecase with usecs >>>>> level latency requirement. For our usecase we want low jitter and stable >>>>> latency at P99. >>>>> >>>>> Following is an analysis and comparison of available (and compatible) >>>>> busy poll interfaces for a low latency usecase with stable P99. Please >>>>> note that the throughput and cpu efficiency is a non-goal. >>>>> >>>>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The >>>>> description of the tool and how it tries to simulate the real workload >>>>> is following, >>>>> >>>>> - It sends UDP packets between 2 machines. >>>>> - The client machine sends packets at a fixed frequency. To maintain the >>>>> frequency of the packet being sent, we use open-loop sampling. That is >>>>> the packets are sent in a separate thread. >>>>> - The server replies to the packet inline by reading the pkt from the >>>>> recv ring and replies using the tx ring. >>>>> - To simulate the application processing time, we use a configurable >>>>> delay in usecs on the client side after a reply is received from the >>>>> server. >>>>> >>>>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest. >>>>> >>>>> We use this tool with following napi polling configurations, >>>>> >>>>> - Interrupts only >>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the >>>>> packet). >>>>> - SO_BUSYPOLL (separate thread and separate core) >>> This one uses separate thread and core for polling the napi. >> >> That's not what I am referring to below. >> >> [snip] >> >>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded | >>>>> |---|---|---|---|---| >>>>> | 12 Kpkt/s + 0us delay | | | | | >>>>> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | >>>>> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | >>>>> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | >>>>> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | >>>>> | 32 Kpkt/s + 30us delay | | | | | >>>>> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | >>>>> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | >>>>> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | >>>>> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | >>>>> | 125 Kpkt/s + 6us delay | | | | | >>>>> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | >>>>> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | >>>>> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | >>>>> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | >>>>> | 12 Kpkt/s + 78us delay | | | | | >>>>> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | >>>>> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | >>>>> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | >>>>> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | >>>>> | 25 Kpkt/s + 38us delay | | | | | >>>>> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | >>>>> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | >>>>> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | >>>>> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | >>>>> >>>>> ## Observations >>>> >>>> Hi Samiullah, >>>> >>> Thanks for the review >>>> I believe you are comparing apples and oranges with these experiments. >>>> Because threaded busy poll uses two cores at each end (at 100%), you >>> The SO_BUSYPOLL(separate) column is actually running in a separate >>> thread and using two cores. So this is actually comparing apples to >>> apples. >> >> I am not referring to SO_BUSYPOLL, but to the column labelled >> "interrupts". This is single-core, yes? >> >>>> should compare with 2 pairs of xsk_rr processes using interrupt mode, >>>> but each running at half the rate. I am quite certain you would then see >>>> the same latency as in the baseline experiment - at much reduced cpu >>>> utilization. >>>> >>>> Threaded busy poll reduces p99 latency by just 100 nsec, while >>> The table in the experiments show much larger differences in latency. >> >> Yes, because all but the first experiment add processing delay to >> simulate 100% load and thus most likely show queuing effects. >> >> Since "interrupts" uses just one core and "NAPI threaded" uses two, a >> fair comparison would be for "interrupts" to run two pairs of xsk_rr at >> half the rate each. Then the load would be well below 100%, no queueing, >> and latency would probably go back to the values measured in the "0us >> delay" experiments. At least that's what I would expect. > Two set of xsk_rr will go to two different NIC queues with two > different interrupts (I think). That would be comparing apples to > oranges, as all the other columns use a single NIC queue. Having > (Forcing user to have) two xsk sockets to deliver packets at a certain > rate is a completely different use case. I don't think a NIC queue is a more critical resource than a CPU core? And the rest depends on the actual application that would be using the service. The restriction to xsk_rr and its particulars is because that's the benchmark you provided. >> Reproduction is getting a bit difficult, because you haven't updated the >> xsk_rr RFC and judging from the compilation error, maybe not built/run >> these experiments for a while now? It would be nice to have a working >> reproducible setup. > Oh. Let me check the xsk_rr and see whether it is outdated. I will > send out another RFC for it if it's outdated. >> >>>> busy-spinning two cores, at each end - not more not less. I continue to >>>> believe that this trade-off and these limited benefits need to be >>>> clearly and explicitly spelled out in the cover letter. >>> Yes, if you just look at the first row of the table then there is >>> virtually no difference. >> I'm not sure what you mean by this. I compare "interrupts" with "NAPI >> threaded" for the case "12 Kpkt/s + 0us delay" and I have explained why >> I believe the other experiments are not meaningful. > Yes that is exactly what I am disagreeing with. I don't think other > rows are "not meaningful". The xsk_rr is trying to "simulate the > application processing" by adding a cpu delay and the table clearly > shows the comparison between various mechanisms and how they perform > with in load. But these experiments only look at cases with almost exactly 100% load. As I mentioned in a previous round, this is highly unlikely for a latency-critical service and thus it seems contrived. Once you go to 100% load and see queueing effects, you also need to look left and right to investigate other load and system settings. Maybe this means the xsk_rr tool is not a good enough benchmark? Thanks, Martin ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll 2025-08-25 19:45 ` Martin Karsten @ 2025-08-25 20:21 ` Martin Karsten 2025-08-28 22:23 ` Samiullah Khawaja 1 sibling, 0 replies; 18+ messages in thread From: Martin Karsten @ 2025-08-25 20:21 UTC (permalink / raw) To: Samiullah Khawaja Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb, Joe Damato, netdev [snip] >>>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | >>>>>> NAPI threaded | >>>>>> |---|---|---|---|---| >>>>>> | 12 Kpkt/s + 0us delay | | | | | >>>>>> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | >>>>>> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | >>>>>> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | >>>>>> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | >>>>>> | 32 Kpkt/s + 30us delay | | | | | >>>>>> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | >>>>>> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | >>>>>> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | >>>>>> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | >>>>>> | 125 Kpkt/s + 6us delay | | | | | >>>>>> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | >>>>>> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | >>>>>> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | >>>>>> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | >>>>>> | 12 Kpkt/s + 78us delay | | | | | >>>>>> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | >>>>>> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | >>>>>> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | >>>>>> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | >>>>>> | 25 Kpkt/s + 38us delay | | | | | >>>>>> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | >>>>>> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | >>>>>> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | >>>>>> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | >>>>>> >>>>>> ## Observations >>>>> >>>>> Hi Samiullah, >>>>> >>>> Thanks for the review >>>>> I believe you are comparing apples and oranges with these experiments. >>>>> Because threaded busy poll uses two cores at each end (at 100%), you >>>> The SO_BUSYPOLL(separate) column is actually running in a separate >>>> thread and using two cores. So this is actually comparing apples to >>>> apples. >>> >>> I am not referring to SO_BUSYPOLL, but to the column labelled >>> "interrupts". This is single-core, yes? >>> >>>>> should compare with 2 pairs of xsk_rr processes using interrupt mode, >>>>> but each running at half the rate. I am quite certain you would >>>>> then see >>>>> the same latency as in the baseline experiment - at much reduced cpu >>>>> utilization. >>>>> >>>>> Threaded busy poll reduces p99 latency by just 100 nsec, while >>>> The table in the experiments show much larger differences in latency. >>> >>> Yes, because all but the first experiment add processing delay to >>> simulate 100% load and thus most likely show queuing effects. >>> >>> Since "interrupts" uses just one core and "NAPI threaded" uses two, a >>> fair comparison would be for "interrupts" to run two pairs of xsk_rr at >>> half the rate each. Then the load would be well below 100%, no queueing, >>> and latency would probably go back to the values measured in the "0us >>> delay" experiments. At least that's what I would expect. >> Two set of xsk_rr will go to two different NIC queues with two >> different interrupts (I think). That would be comparing apples to >> oranges, as all the other columns use a single NIC queue. Having >> (Forcing user to have) two xsk sockets to deliver packets at a certain >> rate is a completely different use case. > > I don't think a NIC queue is a more critical resource than a CPU core? > > And the rest depends on the actual application that would be using the > service. The restriction to xsk_rr and its particulars is because that's > the benchmark you provided. >>> Reproduction is getting a bit difficult, because you haven't updated the >>> xsk_rr RFC and judging from the compilation error, maybe not built/run >>> these experiments for a while now? It would be nice to have a working >>> reproducible setup. >> Oh. Let me check the xsk_rr and see whether it is outdated. I will >> send out another RFC for it if it's outdated. >>> >>>>> busy-spinning two cores, at each end - not more not less. I >>>>> continue to >>>>> believe that this trade-off and these limited benefits need to be >>>>> clearly and explicitly spelled out in the cover letter. >>>> Yes, if you just look at the first row of the table then there is >>>> virtually no difference. >>> I'm not sure what you mean by this. I compare "interrupts" with "NAPI >>> threaded" for the case "12 Kpkt/s + 0us delay" and I have explained why >>> I believe the other experiments are not meaningful. >> Yes that is exactly what I am disagreeing with. I don't think other >> rows are "not meaningful". The xsk_rr is trying to "simulate the >> application processing" by adding a cpu delay and the table clearly >> shows the comparison between various mechanisms and how they perform >> with in load. > > But these experiments only look at cases with almost exactly 100% load. > As I mentioned in a previous round, this is highly unlikely for a > latency-critical service and thus it seems contrived. Once you go to > 100% load and see queueing effects, you also need to look left and right > to investigate other load and system settings. Let me try another way: The delay and rate parameters create a two-dimensional configuration space, but you only cherry-pick setups that result in near-100% load, which make "NAPI threaded" look particularly good. It would be easy to provide a more comprehensive evaluation. And if there's a good reason to avoid using multiple NIC queues, it would be good to know that as well. As I mentioned before, I am not debating that "NAPI threaded" provides some performance improvements. I am just asking to present the full picture. Thanks, Martin ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll 2025-08-25 19:45 ` Martin Karsten 2025-08-25 20:21 ` Martin Karsten @ 2025-08-28 22:23 ` Samiullah Khawaja 1 sibling, 0 replies; 18+ messages in thread From: Samiullah Khawaja @ 2025-08-28 22:23 UTC (permalink / raw) To: Martin Karsten Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb, Joe Damato, netdev On Mon, Aug 25, 2025 at 12:45 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > > On 2025-08-25 14:53, Samiullah Khawaja wrote: > > On Mon, Aug 25, 2025 at 10:41 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > >> > >> On 2025-08-25 13:20, Samiullah Khawaja wrote: > >>> On Sun, Aug 24, 2025 at 5:03 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > >>>> > >>>> On 2025-08-24 17:54, Samiullah Khawaja wrote: > >>>>> Extend the already existing support of threaded napi poll to do continuous > >>>>> busy polling. > >>>>> > >>>>> This is used for doing continuous polling of napi to fetch descriptors > >>>>> from backing RX/TX queues for low latency applications. Allow enabling > >>>>> of threaded busypoll using netlink so this can be enabled on a set of > >>>>> dedicated napis for low latency applications. > >>>>> > >>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling > >>>>> and set affinity, priority and scheduler for it depending on the > >>>>> low-latency requirements. > >>>>> > >>>>> Currently threaded napi is only enabled at device level using sysfs. Add > >>>>> support to enable/disable threaded mode for a napi individually. This > >>>>> can be done using the netlink interface. Extend `napi-set` op in netlink > >>>>> spec that allows setting the `threaded` attribute of a napi. > >>>>> > >>>>> Extend the threaded attribute in napi struct to add an option to enable > >>>>> continuous busy polling. Extend the netlink and sysfs interface to allow > >>>>> enabling/disabling threaded busypolling at device or individual napi > >>>>> level. > >>>>> > >>>>> We use this for our AF_XDP based hard low-latency usecase with usecs > >>>>> level latency requirement. For our usecase we want low jitter and stable > >>>>> latency at P99. > >>>>> > >>>>> Following is an analysis and comparison of available (and compatible) > >>>>> busy poll interfaces for a low latency usecase with stable P99. Please > >>>>> note that the throughput and cpu efficiency is a non-goal. > >>>>> > >>>>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The > >>>>> description of the tool and how it tries to simulate the real workload > >>>>> is following, > >>>>> > >>>>> - It sends UDP packets between 2 machines. > >>>>> - The client machine sends packets at a fixed frequency. To maintain the > >>>>> frequency of the packet being sent, we use open-loop sampling. That is > >>>>> the packets are sent in a separate thread. > >>>>> - The server replies to the packet inline by reading the pkt from the > >>>>> recv ring and replies using the tx ring. > >>>>> - To simulate the application processing time, we use a configurable > >>>>> delay in usecs on the client side after a reply is received from the > >>>>> server. > >>>>> > >>>>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest. > >>>>> > >>>>> We use this tool with following napi polling configurations, > >>>>> > >>>>> - Interrupts only > >>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the > >>>>> packet). > >>>>> - SO_BUSYPOLL (separate thread and separate core) > >>> This one uses separate thread and core for polling the napi. > >> > >> That's not what I am referring to below. > >> > >> [snip] > >> > >>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded | > >>>>> |---|---|---|---|---| > >>>>> | 12 Kpkt/s + 0us delay | | | | | > >>>>> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | > >>>>> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | > >>>>> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | > >>>>> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | > >>>>> | 32 Kpkt/s + 30us delay | | | | | > >>>>> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | > >>>>> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | > >>>>> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | > >>>>> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | > >>>>> | 125 Kpkt/s + 6us delay | | | | | > >>>>> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | > >>>>> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | > >>>>> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | > >>>>> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | > >>>>> | 12 Kpkt/s + 78us delay | | | | | > >>>>> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | > >>>>> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | > >>>>> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | > >>>>> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | > >>>>> | 25 Kpkt/s + 38us delay | | | | | > >>>>> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | > >>>>> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | > >>>>> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | > >>>>> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | > >>>>> > >>>>> ## Observations > >>>> > >>>> Hi Samiullah, > >>>> > >>> Thanks for the review > >>>> I believe you are comparing apples and oranges with these experiments. > >>>> Because threaded busy poll uses two cores at each end (at 100%), you > >>> The SO_BUSYPOLL(separate) column is actually running in a separate > >>> thread and using two cores. So this is actually comparing apples to > >>> apples. > >> > >> I am not referring to SO_BUSYPOLL, but to the column labelled > >> "interrupts". This is single-core, yes? Not really. The interrupts are pinned to a different CPU and the process (and it's threads) run a different CPU. Please check the cover letter for interrupt and process affinity configurations.. > >> > >>>> should compare with 2 pairs of xsk_rr processes using interrupt mode, > >>>> but each running at half the rate. I am quite certain you would then see > >>>> the same latency as in the baseline experiment - at much reduced cpu > >>>> utilization. > >>>> > >>>> Threaded busy poll reduces p99 latency by just 100 nsec, while > >>> The table in the experiments show much larger differences in latency. > >> > >> Yes, because all but the first experiment add processing delay to > >> simulate 100% load and thus most likely show queuing effects. > >> > >> Since "interrupts" uses just one core and "NAPI threaded" uses two, a > >> fair comparison would be for "interrupts" to run two pairs of xsk_rr at > >> half the rate each. Then the load would be well below 100%, no queueing, > >> and latency would probably go back to the values measured in the "0us > >> delay" experiments. At least that's what I would expect. > > Two set of xsk_rr will go to two different NIC queues with two > > different interrupts (I think). That would be comparing apples to > > oranges, as all the other columns use a single NIC queue. Having > > (Forcing user to have) two xsk sockets to deliver packets at a certain > > rate is a completely different use case. > > I don't think a NIC queue is a more critical resource than a CPU core? > > And the rest depends on the actual application that would be using the > service. The restriction to xsk_rr and its particulars is because that's > the benchmark you provided. > >> Reproduction is getting a bit difficult, because you haven't updated the > >> xsk_rr RFC and judging from the compilation error, maybe not built/run > >> these experiments for a while now? It would be nice to have a working > >> reproducible setup. > > Oh. Let me check the xsk_rr and see whether it is outdated. I will > > send out another RFC for it if it's outdated. I checked this, it seems the last xsk_rr needs to be rebased. Will be sending it out shortly. > >> > >>>> busy-spinning two cores, at each end - not more not less. I continue to > >>>> believe that this trade-off and these limited benefits need to be > >>>> clearly and explicitly spelled out in the cover letter. > >>> Yes, if you just look at the first row of the table then there is > >>> virtually no difference. > >> I'm not sure what you mean by this. I compare "interrupts" with "NAPI > >> threaded" for the case "12 Kpkt/s + 0us delay" and I have explained why > >> I believe the other experiments are not meaningful. > > Yes that is exactly what I am disagreeing with. I don't think other > > rows are "not meaningful". The xsk_rr is trying to "simulate the > > application processing" by adding a cpu delay and the table clearly > > shows the comparison between various mechanisms and how they perform > > with in load. > > But these experiments only look at cases with almost exactly 100% load. > As I mentioned in a previous round, this is highly unlikely for a > latency-critical service and thus it seems contrived. Once you go to > 100% load and see queueing effects, you also need to look left and right > to investigate other load and system settings. > > Maybe this means the xsk_rr tool is not a good enough benchmark? > > Thanks, > Martin > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll 2025-08-24 21:54 [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Samiullah Khawaja ` (2 preceding siblings ...) 2025-08-25 0:03 ` [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Martin Karsten @ 2025-08-25 19:37 ` Stanislav Fomichev 2025-08-25 22:12 ` Samiullah Khawaja 3 siblings, 1 reply; 18+ messages in thread From: Stanislav Fomichev @ 2025-08-25 19:37 UTC (permalink / raw) To: Samiullah Khawaja Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, Joe Damato, netdev On 08/24, Samiullah Khawaja wrote: > Extend the already existing support of threaded napi poll to do continuous > busy polling. > > This is used for doing continuous polling of napi to fetch descriptors > from backing RX/TX queues for low latency applications. Allow enabling > of threaded busypoll using netlink so this can be enabled on a set of > dedicated napis for low latency applications. > > Once enabled user can fetch the PID of the kthread doing NAPI polling > and set affinity, priority and scheduler for it depending on the > low-latency requirements. > > Currently threaded napi is only enabled at device level using sysfs. Add > support to enable/disable threaded mode for a napi individually. This > can be done using the netlink interface. Extend `napi-set` op in netlink > spec that allows setting the `threaded` attribute of a napi. > > Extend the threaded attribute in napi struct to add an option to enable > continuous busy polling. Extend the netlink and sysfs interface to allow > enabling/disabling threaded busypolling at device or individual napi > level. > > We use this for our AF_XDP based hard low-latency usecase with usecs > level latency requirement. For our usecase we want low jitter and stable > latency at P99. > > Following is an analysis and comparison of available (and compatible) > busy poll interfaces for a low latency usecase with stable P99. Please > note that the throughput and cpu efficiency is a non-goal. > > For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The > description of the tool and how it tries to simulate the real workload > is following, > > - It sends UDP packets between 2 machines. > - The client machine sends packets at a fixed frequency. To maintain the > frequency of the packet being sent, we use open-loop sampling. That is > the packets are sent in a separate thread. > - The server replies to the packet inline by reading the pkt from the > recv ring and replies using the tx ring. > - To simulate the application processing time, we use a configurable > delay in usecs on the client side after a reply is received from the > server. > > The xdp_rr tool is posted separately as an RFC for tools/testing/selftest. > > We use this tool with following napi polling configurations, > > - Interrupts only > - SO_BUSYPOLL (inline in the same thread where the client receives the > packet). > - SO_BUSYPOLL (separate thread and separate core) > - Threaded NAPI busypoll > > System is configured using following script in all 4 cases, > > ``` > echo 0 | sudo tee /sys/class/net/eth0/threaded > echo 0 | sudo tee /proc/sys/kernel/timer_migration > echo off | sudo tee /sys/devices/system/cpu/smt/control > > sudo ethtool -L eth0 rx 1 tx 1 > sudo ethtool -G eth0 rx 1024 > > echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries > echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus > > # pin IRQs on CPU 2 > IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \ > print arr[0]}' < /proc/interrupts)" > for irq in "${IRQS}"; \ > do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done > > echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us > > for i in /sys/devices/virtual/workqueue/*/cpumask; \ > do echo $i; echo 1,2,3,4,5,6 > $i; done > > if [[ -z "$1" ]]; then > echo 400 | sudo tee /proc/sys/net/core/busy_read > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > fi > > sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0 > > if [[ "$1" == "enable_threaded" ]]; then > echo 0 | sudo tee /proc/sys/net/core/busy_poll > echo 0 | sudo tee /proc/sys/net/core/busy_read > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > echo 2 | sudo tee /sys/class/net/eth0/threaded > NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }') > sudo chrt -f -p 50 $NAPI_T > > # pin threaded poll thread to CPU 2 > sudo taskset -pc 2 $NAPI_T > fi > > if [[ "$1" == "enable_interrupt" ]]; then > echo 0 | sudo tee /proc/sys/net/core/busy_read > echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > fi > ``` > > To enable various configurations, script can be run as following, > > - Interrupt Only > ``` > <script> enable_interrupt > ``` > > - SO_BUSYPOLL (no arguments to script) > ``` > <script> > ``` > > - NAPI threaded busypoll > ``` > <script> enable_threaded > ``` > > If using idpf, the script needs to be run again after launching the > workload just to make sure that the configurations are not reverted. As > idpf reverts some configurations on software reset when AF_XDP program > is attached. > > Once configured, the workload is run with various configurations using > following commands. Set period (1/frequency) and delay in usecs to > produce results for packet frequency and application processing delay. > > ## Interrupt Only and SO_BUSY_POLL (inline) > > - Server > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v > ``` > > - Client > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v > ``` > > ## SO_BUSY_POLL(done in separate core using recvfrom) > > Argument -t spawns a seprate thread and continuously calls recvfrom. > > - Server > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > -h -v -t > ``` > > - Client > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t > ``` > > ## NAPI Threaded Busy Poll > > Argument -n skips the recvfrom call as there is no recv kick needed. > > - Server > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > -h -v -n > ``` > > - Client > ``` > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n > ``` > > | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded | > |---|---|---|---|---| > | 12 Kpkt/s + 0us delay | | | | | > | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | > | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | > | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | > | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | > | 32 Kpkt/s + 30us delay | | | | | > | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | > | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | > | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | > | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | > | 125 Kpkt/s + 6us delay | | | | | > | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | > | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | > | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | > | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | > | 12 Kpkt/s + 78us delay | | | | | > | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | > | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | > | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | > | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | > | 25 Kpkt/s + 38us delay | | | | | > | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | > | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | > | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | > | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | > > ## Observations > > - Here without application processing all the approaches give the same > latency within 1usecs range and NAPI threaded gives minimum latency. > - With application processing the latency increases by 3-4usecs when > doing inline polling. > - Using a dedicated core to drive napi polling keeps the latency same > even with application processing. This is observed both in userspace > and threaded napi (in kernel). > - Using napi threaded polling in kernel gives lower latency by > 1-1.5usecs as compared to userspace driven polling in separate core. > - With application processing userspace will get the packet from recv > ring and spend some time doing application processing and then do napi > polling. While application processing is happening a dedicated core > doing napi polling can pull the packet of the NAPI RX queue and > populate the AF_XDP recv ring. This means that when the application > thread is done with application processing it has new packets ready to > recv and process in recv ring. > - Napi threaded busy polling in the kernel with a dedicated core gives > the consistent P5-P99 latency. The real take away for me is ~1us difference between SO_BUSYPOLL in a thread and NAPI threaded. Presumably mostly because of the non-blocking calls to sk_busy_loop in the former? So it takes 1us extra to enter/leave the kernel and setup/teardown the busy polling? And you haven't tried epoll based busy polling? I'd expect to see results similar to your NAPI threaded (if it works correctly). (have nothing against the busy polling thread, mostly trying to understand what we are missing from the existing setup) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll 2025-08-25 19:37 ` Stanislav Fomichev @ 2025-08-25 22:12 ` Samiullah Khawaja 2025-08-25 22:42 ` Stanislav Fomichev 0 siblings, 1 reply; 18+ messages in thread From: Samiullah Khawaja @ 2025-08-25 22:12 UTC (permalink / raw) To: Stanislav Fomichev Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, Joe Damato, netdev On Mon, Aug 25, 2025 at 12:37 PM Stanislav Fomichev <stfomichev@gmail.com> wrote: > > On 08/24, Samiullah Khawaja wrote: > > Extend the already existing support of threaded napi poll to do continuous > > busy polling. > > > > This is used for doing continuous polling of napi to fetch descriptors > > from backing RX/TX queues for low latency applications. Allow enabling > > of threaded busypoll using netlink so this can be enabled on a set of > > dedicated napis for low latency applications. > > > > Once enabled user can fetch the PID of the kthread doing NAPI polling > > and set affinity, priority and scheduler for it depending on the > > low-latency requirements. > > > > Currently threaded napi is only enabled at device level using sysfs. Add > > support to enable/disable threaded mode for a napi individually. This > > can be done using the netlink interface. Extend `napi-set` op in netlink > > spec that allows setting the `threaded` attribute of a napi. > > > > Extend the threaded attribute in napi struct to add an option to enable > > continuous busy polling. Extend the netlink and sysfs interface to allow > > enabling/disabling threaded busypolling at device or individual napi > > level. > > > > We use this for our AF_XDP based hard low-latency usecase with usecs > > level latency requirement. For our usecase we want low jitter and stable > > latency at P99. > > > > Following is an analysis and comparison of available (and compatible) > > busy poll interfaces for a low latency usecase with stable P99. Please > > note that the throughput and cpu efficiency is a non-goal. > > > > For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The > > description of the tool and how it tries to simulate the real workload > > is following, > > > > - It sends UDP packets between 2 machines. > > - The client machine sends packets at a fixed frequency. To maintain the > > frequency of the packet being sent, we use open-loop sampling. That is > > the packets are sent in a separate thread. > > - The server replies to the packet inline by reading the pkt from the > > recv ring and replies using the tx ring. > > - To simulate the application processing time, we use a configurable > > delay in usecs on the client side after a reply is received from the > > server. > > > > The xdp_rr tool is posted separately as an RFC for tools/testing/selftest. > > > > We use this tool with following napi polling configurations, > > > > - Interrupts only > > - SO_BUSYPOLL (inline in the same thread where the client receives the > > packet). > > - SO_BUSYPOLL (separate thread and separate core) > > - Threaded NAPI busypoll > > > > System is configured using following script in all 4 cases, > > > > ``` > > echo 0 | sudo tee /sys/class/net/eth0/threaded > > echo 0 | sudo tee /proc/sys/kernel/timer_migration > > echo off | sudo tee /sys/devices/system/cpu/smt/control > > > > sudo ethtool -L eth0 rx 1 tx 1 > > sudo ethtool -G eth0 rx 1024 > > > > echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries > > echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus > > > > # pin IRQs on CPU 2 > > IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \ > > print arr[0]}' < /proc/interrupts)" > > for irq in "${IRQS}"; \ > > do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done > > > > echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us > > > > for i in /sys/devices/virtual/workqueue/*/cpumask; \ > > do echo $i; echo 1,2,3,4,5,6 > $i; done > > > > if [[ -z "$1" ]]; then > > echo 400 | sudo tee /proc/sys/net/core/busy_read > > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > fi > > > > sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0 > > > > if [[ "$1" == "enable_threaded" ]]; then > > echo 0 | sudo tee /proc/sys/net/core/busy_poll > > echo 0 | sudo tee /proc/sys/net/core/busy_read > > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > echo 2 | sudo tee /sys/class/net/eth0/threaded > > NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }') > > sudo chrt -f -p 50 $NAPI_T > > > > # pin threaded poll thread to CPU 2 > > sudo taskset -pc 2 $NAPI_T > > fi > > > > if [[ "$1" == "enable_interrupt" ]]; then > > echo 0 | sudo tee /proc/sys/net/core/busy_read > > echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > fi > > ``` > > > > To enable various configurations, script can be run as following, > > > > - Interrupt Only > > ``` > > <script> enable_interrupt > > ``` > > > > - SO_BUSYPOLL (no arguments to script) > > ``` > > <script> > > ``` > > > > - NAPI threaded busypoll > > ``` > > <script> enable_threaded > > ``` > > > > If using idpf, the script needs to be run again after launching the > > workload just to make sure that the configurations are not reverted. As > > idpf reverts some configurations on software reset when AF_XDP program > > is attached. > > > > Once configured, the workload is run with various configurations using > > following commands. Set period (1/frequency) and delay in usecs to > > produce results for packet frequency and application processing delay. > > > > ## Interrupt Only and SO_BUSY_POLL (inline) > > > > - Server > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v > > ``` > > > > - Client > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v > > ``` > > > > ## SO_BUSY_POLL(done in separate core using recvfrom) > > > > Argument -t spawns a seprate thread and continuously calls recvfrom. > > > > - Server > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > > -h -v -t > > ``` > > > > - Client > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t > > ``` > > > > ## NAPI Threaded Busy Poll > > > > Argument -n skips the recvfrom call as there is no recv kick needed. > > > > - Server > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > > -h -v -n > > ``` > > > > - Client > > ``` > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n > > ``` > > > > | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded | > > |---|---|---|---|---| > > | 12 Kpkt/s + 0us delay | | | | | > > | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | > > | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | > > | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | > > | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | > > | 32 Kpkt/s + 30us delay | | | | | > > | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | > > | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | > > | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | > > | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | > > | 125 Kpkt/s + 6us delay | | | | | > > | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | > > | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | > > | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | > > | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | > > | 12 Kpkt/s + 78us delay | | | | | > > | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | > > | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | > > | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | > > | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | > > | 25 Kpkt/s + 38us delay | | | | | > > | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | > > | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | > > | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | > > | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | > > > > ## Observations > > > > - Here without application processing all the approaches give the same > > latency within 1usecs range and NAPI threaded gives minimum latency. > > - With application processing the latency increases by 3-4usecs when > > doing inline polling. > > - Using a dedicated core to drive napi polling keeps the latency same > > even with application processing. This is observed both in userspace > > and threaded napi (in kernel). > > - Using napi threaded polling in kernel gives lower latency by > > 1-1.5usecs as compared to userspace driven polling in separate core. > > - With application processing userspace will get the packet from recv > > ring and spend some time doing application processing and then do napi > > polling. While application processing is happening a dedicated core > > doing napi polling can pull the packet of the NAPI RX queue and > > populate the AF_XDP recv ring. This means that when the application > > thread is done with application processing it has new packets ready to > > recv and process in recv ring. > > - Napi threaded busy polling in the kernel with a dedicated core gives > > the consistent P5-P99 latency. > > The real take away for me is ~1us difference between SO_BUSYPOLL in a > thread and NAPI threaded. Presumably mostly because of the non-blocking calls > to sk_busy_loop in the former? So it takes 1us extra to enter/leave the kernel > and setup/teardown the busy polling? > > And you haven't tried epoll based busy polling? I'd expect to see > results similar to your NAPI threaded (if it works correctly). I haven't attempted epoll-based NAPI polling because my understanding is that it only polls NAPI when no events are present. Let me check. > > (have nothing against the busy polling thread, mostly trying to > understand what we are missing from the existing setup) The missing piece is a mechanism to busy poll a NAPI instance in a dedicated thread while ignoring available events or packets, regardless of the userspace API. Most existing mechanisms are designed to work in a pattern where you poll until new packets or events are received, after which userspace is expected to handle them. As a result, one has to hack together a solution using a mechanism intended to receive packets or events, not to simply NAPI poll. NAPI threaded, on the other hand, provides this capability natively, independent of any userspace API. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll 2025-08-25 22:12 ` Samiullah Khawaja @ 2025-08-25 22:42 ` Stanislav Fomichev 2025-08-25 23:22 ` Samiullah Khawaja 0 siblings, 1 reply; 18+ messages in thread From: Stanislav Fomichev @ 2025-08-25 22:42 UTC (permalink / raw) To: Samiullah Khawaja Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, Joe Damato, netdev On 08/25, Samiullah Khawaja wrote: > On Mon, Aug 25, 2025 at 12:37 PM Stanislav Fomichev > <stfomichev@gmail.com> wrote: > > > > On 08/24, Samiullah Khawaja wrote: > > > Extend the already existing support of threaded napi poll to do continuous > > > busy polling. > > > > > > This is used for doing continuous polling of napi to fetch descriptors > > > from backing RX/TX queues for low latency applications. Allow enabling > > > of threaded busypoll using netlink so this can be enabled on a set of > > > dedicated napis for low latency applications. > > > > > > Once enabled user can fetch the PID of the kthread doing NAPI polling > > > and set affinity, priority and scheduler for it depending on the > > > low-latency requirements. > > > > > > Currently threaded napi is only enabled at device level using sysfs. Add > > > support to enable/disable threaded mode for a napi individually. This > > > can be done using the netlink interface. Extend `napi-set` op in netlink > > > spec that allows setting the `threaded` attribute of a napi. > > > > > > Extend the threaded attribute in napi struct to add an option to enable > > > continuous busy polling. Extend the netlink and sysfs interface to allow > > > enabling/disabling threaded busypolling at device or individual napi > > > level. > > > > > > We use this for our AF_XDP based hard low-latency usecase with usecs > > > level latency requirement. For our usecase we want low jitter and stable > > > latency at P99. > > > > > > Following is an analysis and comparison of available (and compatible) > > > busy poll interfaces for a low latency usecase with stable P99. Please > > > note that the throughput and cpu efficiency is a non-goal. > > > > > > For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The > > > description of the tool and how it tries to simulate the real workload > > > is following, > > > > > > - It sends UDP packets between 2 machines. > > > - The client machine sends packets at a fixed frequency. To maintain the > > > frequency of the packet being sent, we use open-loop sampling. That is > > > the packets are sent in a separate thread. > > > - The server replies to the packet inline by reading the pkt from the > > > recv ring and replies using the tx ring. > > > - To simulate the application processing time, we use a configurable > > > delay in usecs on the client side after a reply is received from the > > > server. > > > > > > The xdp_rr tool is posted separately as an RFC for tools/testing/selftest. > > > > > > We use this tool with following napi polling configurations, > > > > > > - Interrupts only > > > - SO_BUSYPOLL (inline in the same thread where the client receives the > > > packet). > > > - SO_BUSYPOLL (separate thread and separate core) > > > - Threaded NAPI busypoll > > > > > > System is configured using following script in all 4 cases, > > > > > > ``` > > > echo 0 | sudo tee /sys/class/net/eth0/threaded > > > echo 0 | sudo tee /proc/sys/kernel/timer_migration > > > echo off | sudo tee /sys/devices/system/cpu/smt/control > > > > > > sudo ethtool -L eth0 rx 1 tx 1 > > > sudo ethtool -G eth0 rx 1024 > > > > > > echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries > > > echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus > > > > > > # pin IRQs on CPU 2 > > > IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \ > > > print arr[0]}' < /proc/interrupts)" > > > for irq in "${IRQS}"; \ > > > do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done > > > > > > echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us > > > > > > for i in /sys/devices/virtual/workqueue/*/cpumask; \ > > > do echo $i; echo 1,2,3,4,5,6 > $i; done > > > > > > if [[ -z "$1" ]]; then > > > echo 400 | sudo tee /proc/sys/net/core/busy_read > > > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > > fi > > > > > > sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0 > > > > > > if [[ "$1" == "enable_threaded" ]]; then > > > echo 0 | sudo tee /proc/sys/net/core/busy_poll > > > echo 0 | sudo tee /proc/sys/net/core/busy_read > > > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > > echo 2 | sudo tee /sys/class/net/eth0/threaded > > > NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }') > > > sudo chrt -f -p 50 $NAPI_T > > > > > > # pin threaded poll thread to CPU 2 > > > sudo taskset -pc 2 $NAPI_T > > > fi > > > > > > if [[ "$1" == "enable_interrupt" ]]; then > > > echo 0 | sudo tee /proc/sys/net/core/busy_read > > > echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > > fi > > > ``` > > > > > > To enable various configurations, script can be run as following, > > > > > > - Interrupt Only > > > ``` > > > <script> enable_interrupt > > > ``` > > > > > > - SO_BUSYPOLL (no arguments to script) > > > ``` > > > <script> > > > ``` > > > > > > - NAPI threaded busypoll > > > ``` > > > <script> enable_threaded > > > ``` > > > > > > If using idpf, the script needs to be run again after launching the > > > workload just to make sure that the configurations are not reverted. As > > > idpf reverts some configurations on software reset when AF_XDP program > > > is attached. > > > > > > Once configured, the workload is run with various configurations using > > > following commands. Set period (1/frequency) and delay in usecs to > > > produce results for packet frequency and application processing delay. > > > > > > ## Interrupt Only and SO_BUSY_POLL (inline) > > > > > > - Server > > > ``` > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v > > > ``` > > > > > > - Client > > > ``` > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v > > > ``` > > > > > > ## SO_BUSY_POLL(done in separate core using recvfrom) > > > > > > Argument -t spawns a seprate thread and continuously calls recvfrom. > > > > > > - Server > > > ``` > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > > > -h -v -t > > > ``` > > > > > > - Client > > > ``` > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t > > > ``` > > > > > > ## NAPI Threaded Busy Poll > > > > > > Argument -n skips the recvfrom call as there is no recv kick needed. > > > > > > - Server > > > ``` > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > > > -h -v -n > > > ``` > > > > > > - Client > > > ``` > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n > > > ``` > > > > > > | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded | > > > |---|---|---|---|---| > > > | 12 Kpkt/s + 0us delay | | | | | > > > | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | > > > | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | > > > | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | > > > | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | > > > | 32 Kpkt/s + 30us delay | | | | | > > > | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | > > > | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | > > > | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | > > > | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | > > > | 125 Kpkt/s + 6us delay | | | | | > > > | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | > > > | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | > > > | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | > > > | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | > > > | 12 Kpkt/s + 78us delay | | | | | > > > | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | > > > | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | > > > | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | > > > | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | > > > | 25 Kpkt/s + 38us delay | | | | | > > > | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | > > > | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | > > > | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | > > > | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | > > > > > > ## Observations > > > > > > - Here without application processing all the approaches give the same > > > latency within 1usecs range and NAPI threaded gives minimum latency. > > > - With application processing the latency increases by 3-4usecs when > > > doing inline polling. > > > - Using a dedicated core to drive napi polling keeps the latency same > > > even with application processing. This is observed both in userspace > > > and threaded napi (in kernel). > > > - Using napi threaded polling in kernel gives lower latency by > > > 1-1.5usecs as compared to userspace driven polling in separate core. > > > - With application processing userspace will get the packet from recv > > > ring and spend some time doing application processing and then do napi > > > polling. While application processing is happening a dedicated core > > > doing napi polling can pull the packet of the NAPI RX queue and > > > populate the AF_XDP recv ring. This means that when the application > > > thread is done with application processing it has new packets ready to > > > recv and process in recv ring. > > > - Napi threaded busy polling in the kernel with a dedicated core gives > > > the consistent P5-P99 latency. > > > > The real take away for me is ~1us difference between SO_BUSYPOLL in a > > thread and NAPI threaded. Presumably mostly because of the non-blocking calls > > to sk_busy_loop in the former? So it takes 1us extra to enter/leave the kernel > > and setup/teardown the busy polling? > > > > And you haven't tried epoll based busy polling? I'd expect to see > > results similar to your NAPI threaded (if it works correctly). > I haven't attempted epoll-based NAPI polling because my understanding > is that it only polls NAPI when no events are present. Let me check. I was under the impression that xsk won't actually add any (socket) events to ep making it busy poll until timeout. But I might be wrong, still worth it to double check. > > (have nothing against the busy polling thread, mostly trying to > > understand what we are missing from the existing setup) > The missing piece is a mechanism to busy poll a NAPI instance in a > dedicated thread while ignoring available events or packets, > regardless of the userspace API. Most existing mechanisms are designed > to work in a pattern where you poll until new packets or events are > received, after which userspace is expected to handle them. > > As a result, one has to hack together a solution using a mechanism > intended to receive packets or events, not to simply NAPI poll. NAPI > threaded, on the other hand, provides this capability natively, > independent of any userspace API. Agreed, yes. Would be nice to document it in the commit description. Explain how SO_BUSY_POLL in a thread is still not enough (polls only once, doesn't busy-poll until the events are ready -> 1-2us of extra latency). And the same for epoll depending on how it goes. If it ends up working, kthread might still be more convenient to setup/manage. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next v7 0/2] Add support to do threaded napi busy poll 2025-08-25 22:42 ` Stanislav Fomichev @ 2025-08-25 23:22 ` Samiullah Khawaja 0 siblings, 0 replies; 18+ messages in thread From: Samiullah Khawaja @ 2025-08-25 23:22 UTC (permalink / raw) To: Stanislav Fomichev Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, Joe Damato, netdev On Mon, Aug 25, 2025 at 3:42 PM Stanislav Fomichev <stfomichev@gmail.com> wrote: > > On 08/25, Samiullah Khawaja wrote: > > On Mon, Aug 25, 2025 at 12:37 PM Stanislav Fomichev > > <stfomichev@gmail.com> wrote: > > > > > > On 08/24, Samiullah Khawaja wrote: > > > > Extend the already existing support of threaded napi poll to do continuous > > > > busy polling. > > > > > > > > This is used for doing continuous polling of napi to fetch descriptors > > > > from backing RX/TX queues for low latency applications. Allow enabling > > > > of threaded busypoll using netlink so this can be enabled on a set of > > > > dedicated napis for low latency applications. > > > > > > > > Once enabled user can fetch the PID of the kthread doing NAPI polling > > > > and set affinity, priority and scheduler for it depending on the > > > > low-latency requirements. > > > > > > > > Currently threaded napi is only enabled at device level using sysfs. Add > > > > support to enable/disable threaded mode for a napi individually. This > > > > can be done using the netlink interface. Extend `napi-set` op in netlink > > > > spec that allows setting the `threaded` attribute of a napi. > > > > > > > > Extend the threaded attribute in napi struct to add an option to enable > > > > continuous busy polling. Extend the netlink and sysfs interface to allow > > > > enabling/disabling threaded busypolling at device or individual napi > > > > level. > > > > > > > > We use this for our AF_XDP based hard low-latency usecase with usecs > > > > level latency requirement. For our usecase we want low jitter and stable > > > > latency at P99. > > > > > > > > Following is an analysis and comparison of available (and compatible) > > > > busy poll interfaces for a low latency usecase with stable P99. Please > > > > note that the throughput and cpu efficiency is a non-goal. > > > > > > > > For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The > > > > description of the tool and how it tries to simulate the real workload > > > > is following, > > > > > > > > - It sends UDP packets between 2 machines. > > > > - The client machine sends packets at a fixed frequency. To maintain the > > > > frequency of the packet being sent, we use open-loop sampling. That is > > > > the packets are sent in a separate thread. > > > > - The server replies to the packet inline by reading the pkt from the > > > > recv ring and replies using the tx ring. > > > > - To simulate the application processing time, we use a configurable > > > > delay in usecs on the client side after a reply is received from the > > > > server. > > > > > > > > The xdp_rr tool is posted separately as an RFC for tools/testing/selftest. > > > > > > > > We use this tool with following napi polling configurations, > > > > > > > > - Interrupts only > > > > - SO_BUSYPOLL (inline in the same thread where the client receives the > > > > packet). > > > > - SO_BUSYPOLL (separate thread and separate core) > > > > - Threaded NAPI busypoll > > > > > > > > System is configured using following script in all 4 cases, > > > > > > > > ``` > > > > echo 0 | sudo tee /sys/class/net/eth0/threaded > > > > echo 0 | sudo tee /proc/sys/kernel/timer_migration > > > > echo off | sudo tee /sys/devices/system/cpu/smt/control > > > > > > > > sudo ethtool -L eth0 rx 1 tx 1 > > > > sudo ethtool -G eth0 rx 1024 > > > > > > > > echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries > > > > echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus > > > > > > > > # pin IRQs on CPU 2 > > > > IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \ > > > > print arr[0]}' < /proc/interrupts)" > > > > for irq in "${IRQS}"; \ > > > > do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done > > > > > > > > echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us > > > > > > > > for i in /sys/devices/virtual/workqueue/*/cpumask; \ > > > > do echo $i; echo 1,2,3,4,5,6 > $i; done > > > > > > > > if [[ -z "$1" ]]; then > > > > echo 400 | sudo tee /proc/sys/net/core/busy_read > > > > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > > > fi > > > > > > > > sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0 > > > > > > > > if [[ "$1" == "enable_threaded" ]]; then > > > > echo 0 | sudo tee /proc/sys/net/core/busy_poll > > > > echo 0 | sudo tee /proc/sys/net/core/busy_read > > > > echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > > > echo 2 | sudo tee /sys/class/net/eth0/threaded > > > > NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }') > > > > sudo chrt -f -p 50 $NAPI_T > > > > > > > > # pin threaded poll thread to CPU 2 > > > > sudo taskset -pc 2 $NAPI_T > > > > fi > > > > > > > > if [[ "$1" == "enable_interrupt" ]]; then > > > > echo 0 | sudo tee /proc/sys/net/core/busy_read > > > > echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs > > > > echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout > > > > fi > > > > ``` > > > > > > > > To enable various configurations, script can be run as following, > > > > > > > > - Interrupt Only > > > > ``` > > > > <script> enable_interrupt > > > > ``` > > > > > > > > - SO_BUSYPOLL (no arguments to script) > > > > ``` > > > > <script> > > > > ``` > > > > > > > > - NAPI threaded busypoll > > > > ``` > > > > <script> enable_threaded > > > > ``` > > > > > > > > If using idpf, the script needs to be run again after launching the > > > > workload just to make sure that the configurations are not reverted. As > > > > idpf reverts some configurations on software reset when AF_XDP program > > > > is attached. > > > > > > > > Once configured, the workload is run with various configurations using > > > > following commands. Set period (1/frequency) and delay in usecs to > > > > produce results for packet frequency and application processing delay. > > > > > > > > ## Interrupt Only and SO_BUSY_POLL (inline) > > > > > > > > - Server > > > > ``` > > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v > > > > ``` > > > > > > > > - Client > > > > ``` > > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v > > > > ``` > > > > > > > > ## SO_BUSY_POLL(done in separate core using recvfrom) > > > > > > > > Argument -t spawns a seprate thread and continuously calls recvfrom. > > > > > > > > - Server > > > > ``` > > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > > > > -h -v -t > > > > ``` > > > > > > > > - Client > > > > ``` > > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t > > > > ``` > > > > > > > > ## NAPI Threaded Busy Poll > > > > > > > > Argument -n skips the recvfrom call as there is no recv kick needed. > > > > > > > > - Server > > > > ``` > > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > > -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \ > > > > -h -v -n > > > > ``` > > > > > > > > - Client > > > > ``` > > > > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \ > > > > -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \ > > > > -P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n > > > > ``` > > > > > > > > | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded | > > > > |---|---|---|---|---| > > > > | 12 Kpkt/s + 0us delay | | | | | > > > > | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 | > > > > | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 | > > > > | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 | > > > > | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 | > > > > | 32 Kpkt/s + 30us delay | | | | | > > > > | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 | > > > > | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 | > > > > | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 | > > > > | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 | > > > > | 125 Kpkt/s + 6us delay | | | | | > > > > | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 | > > > > | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 | > > > > | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 | > > > > | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 | > > > > | 12 Kpkt/s + 78us delay | | | | | > > > > | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 | > > > > | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 | > > > > | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 | > > > > | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 | > > > > | 25 Kpkt/s + 38us delay | | | | | > > > > | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 | > > > > | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 | > > > > | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 | > > > > | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 | > > > > > > > > ## Observations > > > > > > > > - Here without application processing all the approaches give the same > > > > latency within 1usecs range and NAPI threaded gives minimum latency. > > > > - With application processing the latency increases by 3-4usecs when > > > > doing inline polling. > > > > - Using a dedicated core to drive napi polling keeps the latency same > > > > even with application processing. This is observed both in userspace > > > > and threaded napi (in kernel). > > > > - Using napi threaded polling in kernel gives lower latency by > > > > 1-1.5usecs as compared to userspace driven polling in separate core. > > > > - With application processing userspace will get the packet from recv > > > > ring and spend some time doing application processing and then do napi > > > > polling. While application processing is happening a dedicated core > > > > doing napi polling can pull the packet of the NAPI RX queue and > > > > populate the AF_XDP recv ring. This means that when the application > > > > thread is done with application processing it has new packets ready to > > > > recv and process in recv ring. > > > > - Napi threaded busy polling in the kernel with a dedicated core gives > > > > the consistent P5-P99 latency. > > > > > > The real take away for me is ~1us difference between SO_BUSYPOLL in a > > > thread and NAPI threaded. Presumably mostly because of the non-blocking calls > > > to sk_busy_loop in the former? So it takes 1us extra to enter/leave the kernel > > > and setup/teardown the busy polling? > > > > > > And you haven't tried epoll based busy polling? I'd expect to see > > > results similar to your NAPI threaded (if it works correctly). > > I haven't attempted epoll-based NAPI polling because my understanding > > is that it only polls NAPI when no events are present. Let me check. > > I was under the impression that xsk won't actually add any (socket) events > to ep making it busy poll until timeout. But I might be wrong, still > worth it to double check. xsk.c has an xsk_poll implementation that is doing following (apart from other things): if (xs->rx && !xskq_prod_is_empty(xs->rx)) mask |= EPOLLIN | EPOLLRDNORM; if (xs->tx && xsk_tx_writeable(xs)) mask |= EPOLLOUT | EPOLLWRNORM; This means it should be providing events checking whether RX queue is not empty and whether it is tx writeable. > > > > (have nothing against the busy polling thread, mostly trying to > > > understand what we are missing from the existing setup) > > The missing piece is a mechanism to busy poll a NAPI instance in a > > dedicated thread while ignoring available events or packets, > > regardless of the userspace API. Most existing mechanisms are designed > > to work in a pattern where you poll until new packets or events are > > received, after which userspace is expected to handle them. > > > > As a result, one has to hack together a solution using a mechanism > > intended to receive packets or events, not to simply NAPI poll. NAPI > > threaded, on the other hand, provides this capability natively, > > independent of any userspace API. > > Agreed, yes. Would be nice to document it in the commit description. Explain > how SO_BUSY_POLL in a thread is still not enough (polls only once, > doesn't busy-poll until the events are ready -> 1-2us of extra latency). > And the same for epoll depending on how it goes. If it ends up working, > kthread might still be more convenient to setup/manage. Agreed. I can add it to the cover letter. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-08-28 22:23 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-24 21:54 [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Samiullah Khawaja 2025-08-24 21:54 ` [PATCH net-next v7 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja 2025-08-25 19:47 ` Stanislav Fomichev 2025-08-25 23:11 ` Samiullah Khawaja 2025-08-24 21:54 ` [PATCH net-next v7 2/2] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja 2025-08-25 16:30 ` Jakub Kicinski 2025-08-25 17:20 ` Samiullah Khawaja 2025-08-25 0:03 ` [PATCH net-next v7 0/2] Add support to do threaded napi busy poll Martin Karsten 2025-08-25 17:20 ` Samiullah Khawaja 2025-08-25 17:40 ` Martin Karsten 2025-08-25 18:53 ` Samiullah Khawaja 2025-08-25 19:45 ` Martin Karsten 2025-08-25 20:21 ` Martin Karsten 2025-08-28 22:23 ` Samiullah Khawaja 2025-08-25 19:37 ` Stanislav Fomichev 2025-08-25 22:12 ` Samiullah Khawaja 2025-08-25 22:42 ` Stanislav Fomichev 2025-08-25 23:22 ` Samiullah Khawaja
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).