[PATCH net-next v8 0/2] Add support to do threaded napi busy poll

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
@ 2025-08-29  1:16 Samiullah Khawaja
  2025-08-29  1:16 ` [PATCH net-next v8 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Samiullah Khawaja @ 2025-08-29  1:16 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, mkarsten
  Cc: Joe Damato, netdev, skhawaja

Extend the already existing support of threaded napi poll to do continuous
busy polling.

This is used for doing continuous polling of napi to fetch descriptors
from backing RX/TX queues for low latency applications. Allow enabling
of threaded busypoll using netlink so this can be enabled on a set of
dedicated napis for low latency applications.

Once enabled user can fetch the PID of the kthread doing NAPI polling
and set affinity, priority and scheduler for it depending on the
low-latency requirements.

Extend the netlink interface to allow enabling/disabling threaded
busypolling at individual napi level.

We use this for our AF_XDP based hard low-latency usecase with usecs
level latency requirement. For our usecase we want low jitter and stable
latency at P99.

Following is an analysis and comparison of available (and compatible)
busy poll interfaces for a low latency usecase with stable P99. This can
be suitable for applications that want very low latency at the expense
of cpu usage and efficiency.

Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
backing a socket, but the missing piece is a mechanism to busy poll a
NAPI instance in a dedicated thread while ignoring available events or
packets, regardless of the userspace API. Most existing mechanisms are
designed to work in a pattern where you poll until new packets or events
are received, after which userspace is expected to handle them.

As a result, one has to hack together a solution using a mechanism
intended to receive packets or events, not to simply NAPI poll. NAPI
threaded busy polling, on the other hand, provides this capability
natively, independent of any userspace API. This makes it really easy to
setup and manage.

For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
description of the tool and how it tries to simulate the real workload
is following,

- It sends UDP packets between 2 machines.
- The client machine sends packets at a fixed frequency. To maintain the
  frequency of the packet being sent, we use open-loop sampling. That is
  the packets are sent in a separate thread.
- The server replies to the packet inline by reading the pkt from the
  recv ring and replies using the tx ring.
- To simulate the application processing time, we use a configurable
  delay in usecs on the client side after a reply is received from the
  server.

The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.

We use this tool with following napi polling configurations,

- Interrupts only
- SO_BUSYPOLL (inline in the same thread where the client receives the
  packet).
- SO_BUSYPOLL (separate thread and separate core)
- Threaded NAPI busypoll

System is configured using following script in all 4 cases,

```
echo 0 | sudo tee /sys/class/net/eth0/threaded
echo 0 | sudo tee /proc/sys/kernel/timer_migration
echo off | sudo tee  /sys/devices/system/cpu/smt/control

sudo ethtool -L eth0 rx 1 tx 1
sudo ethtool -G eth0 rx 1024

echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus

 # pin IRQs on CPU 2
IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
				print arr[0]}' < /proc/interrupts)"
for irq in "${IRQS}"; \
	do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done

echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us

for i in /sys/devices/virtual/workqueue/*/cpumask; \
			do echo $i; echo 1,2,3,4,5,6 > $i; done

if [[ -z "$1" ]]; then
  echo 400 | sudo tee /proc/sys/net/core/busy_read
  echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
  echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi

sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0

if [[ "$1" == "enable_threaded" ]]; then
  echo 0 | sudo tee /proc/sys/net/core/busy_poll
  echo 0 | sudo tee /proc/sys/net/core/busy_read
  echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
  echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
  echo 2 | sudo tee /sys/class/net/eth0/threaded
  NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
  sudo chrt -f  -p 50 $NAPI_T

  # pin threaded poll thread to CPU 2
  sudo taskset -pc 2 $NAPI_T
fi

if [[ "$1" == "enable_interrupt" ]]; then
  echo 0 | sudo tee /proc/sys/net/core/busy_read
  echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
  echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi
```

To enable various configurations, script can be run as following,

- Interrupt Only
  ```
  <script> enable_interrupt
  ```

- SO_BUSYPOLL (no arguments to script)
  ```
  <script>
  ```

- NAPI threaded busypoll
  ```
  <script> enable_threaded
  ```

If using idpf, the script needs to be run again after launching the
workload just to make sure that the configurations are not reverted. As
idpf reverts some configurations on software reset when AF_XDP program
is attached.

Once configured, the workload is run with various configurations using
following commands. Set period (1/frequency) and delay in usecs to
produce results for packet frequency and application processing delay.

 ## Interrupt Only and SO_BUSYPOLL (inline)

- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
```

- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
	-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
```

 ## SO_BUSYPOLL(done in separate core using recvfrom)

Argument -t spawns a seprate thread and continuously calls recvfrom.

- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
	-h -v -t
```

- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
	-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
```

 ## NAPI Threaded Busy Poll

Argument -n skips the recvfrom call as there is no recv kick needed.

- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
	-h -v -n
```

- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
	-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
```

| Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
|---|---|---|---|---|
| 12 Kpkt/s + 0us delay | | | | |
|  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
|  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
|  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
|  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
| 32 Kpkt/s + 30us delay | | | | |
|  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
|  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
|  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
|  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
| 125 Kpkt/s + 6us delay | | | | |
|  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
|  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
|  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
|  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
| 12 Kpkt/s + 78us delay | | | | |
|  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
|  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
|  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
|  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
| 25 Kpkt/s + 38us delay | | | | |
|  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
|  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
|  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
|  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |

 ## Observations

- Here without application processing all the approaches give the same
  latency within 1usecs range and NAPI threaded gives minimum latency.
- With application processing the latency increases by 3-4usecs when
  doing inline polling.
- Using a dedicated core to drive napi polling keeps the latency same
  even with application processing. This is observed both in userspace
  and threaded napi (in kernel).
- Using napi threaded polling in kernel gives lower latency by
  1-2usecs as compared to userspace driven polling in separate core.
- Even on a dedicated core, SO_BUSYPOLL adds around 1-2usecs of latency.
  This is because it doesn't continuously busy poll until events are
  ready. Instead, it returns after polling only once, requiring the
  process to re-invoke the syscall for each poll, which requires a new
  enter/leave kernel cycle and the setup/teardown of the busy poll for
  every single poll attempt.
- With application processing userspace will get the packet from recv
  ring and spend some time doing application processing and then do napi
  polling. While application processing is happening a dedicated core
  doing napi polling can pull the packet of the NAPI RX queue and
  populate the AF_XDP recv ring. This means that when the application
  thread is done with application processing it has new packets ready to
  recv and process in recv ring.
- Napi threaded busy polling in the kernel with a dedicated core gives
  the consistent P5-P99 latency.

Following histogram is generated to measure the time spent in recvfrom
while using inline thread with SO_BUSYPOLL. The histogram is generated
using the following bpftrace command. In this experiment there are 32K
packets per second and the application processing delay is 30usecs. This
is to measure whether there is significant time spent pulling packets
from the descriptor queue that it will affect the overall latency if
done inline.

```
bpftrace -e '
        kprobe:xsk_recvmsg {
                @start[tid] = nsecs;
        }
        kretprobe:xsk_recvmsg {
                if (@start[tid]) {
                        $sample = (nsecs - @start[tid]);
                        @xsk_recvfrom_hist = hist($sample);
                        delete(@start[tid]);
                }
        }
        END { clear(@start);}'
```

Here in case of inline busypolling around 35 percent of calls are taking
1-2usecs and around 50 percent are taking 0.5-2usecs.

@xsk_recvfrom_hist:
[128, 256)         24073 |@@@@@@@@@@@@@@@@@@@@@@                              |
[256, 512)         55633 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)          20974 |@@@@@@@@@@@@@@@@@@@                                 |
[1K, 2K)           34234 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
[2K, 4K)            3266 |@@@                                                 |
[4K, 8K)              19 |                                                    |

v8:
 - Fixed selftest build error.
 - Removed support of enabling napi busy poll at device level.
 - Updated documentation to reflect removal of busy poll at device
   level.
 - Added paragraph in the cover letter mentioning that napi threaded
   busy polling allows busy polling a NAPI natively independent of API.
 - Added paragraph in the cover letter explaining why SO_BUSYPOLL is
   still not enough when run in a dedicated core.

v7:
 - Rebased.

v6:
 - Moved threaded in struct netdevice up to fill the cacheline hole.
 - Changed dev_set_threaded to dev_set_threaded_hint and removed the
   second argument that was always set to true by all the drivers.
   Exported only dev_set_threaded_hint and made dev_set_threaded core
   only function. This change is done in a separate commit.
 - Updated documentation comment for threaded in struct netdevice.
 - gro_flush_helper renamed to gro_flush_normal and moved to gro.h. Also
   used it in kernel/bpf/cpumap.c
 - Updated documentation to explicitly state that the NAPI threaded busy
   polling would keep the CPU core busy at 100% usage.
 - Updated documentation and commit messages.

v5:
 - Updated experiment data with 'SO_PREFER_BUSY_POLL' usage as
   suggested.
 - Sent 'Add support to set napi threaded for individual napi'
   separately. This series depends on top of that patch.
   https://lore.kernel.org/netdev/20250423201413.1564527-1-skhawaja@google.com/
 - Added a separate patch to use enum for napi threaded state. Updated
   the nl_netdev python test.
 - Using "write all" semantics when napi settings set at device level.
   This aligns with already existing behaviour for other settings.
 - Fix comments to make them kdoc compatible.
 - Updated Documentation/networking/net_cachelines/net_device.rst
 - Updated the missed gro_flush modification in napi_complete_done

v4:
 - Using AF_XDP based benchmark for experiments.
 - Re-enable dev level napi threaded busypoll after soft reset.

v3:
 - Fixed calls to dev_set_threaded in drivers

v2:
 - Add documentation in napi.rst.
 - Provide experiment data and usecase details.
 - Update busy_poller selftest to include napi threaded poll testcase.
 - Define threaded mode enum in netlink interface.
 - Included NAPI threaded state in napi config to save/restore.

Samiullah Khawaja (2):
  Extend napi threaded polling to allow kthread based busy polling
  selftests: Add napi threaded busy poll test in `busy_poller`

 Documentation/netlink/specs/netdev.yaml       |  5 +-
 Documentation/networking/napi.rst             | 62 ++++++++++++++++-
 include/linux/netdevice.h                     | 11 +++-
 include/uapi/linux/netdev.h                   |  1 +
 net/core/dev.c                                | 66 ++++++++++++++++---
 net/core/dev.h                                |  3 +
 net/core/netdev-genl-gen.c                    |  2 +-
 tools/include/uapi/linux/netdev.h             |  1 +
 tools/testing/selftests/net/busy_poll_test.sh | 25 ++++++-
 tools/testing/selftests/net/busy_poller.c     | 14 +++-
 10 files changed, 173 insertions(+), 17 deletions(-)


base-commit: c3199adbe4ffffc7b6536715e0290d1919a45cd9
-- 
2.51.0.338.gd7d06c2dae-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next v8 1/2] Extend napi threaded polling to allow kthread based busy polling
  2025-08-29  1:16 [PATCH net-next v8 0/2] Add support to do threaded napi busy poll Samiullah Khawaja
@ 2025-08-29  1:16 ` Samiullah Khawaja
  2025-08-29  1:16 ` [PATCH net-next v8 2/2] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
  2025-08-29  3:15 ` [PATCH net-next v8 0/2] Add support to do threaded napi busy poll Martin Karsten
  2 siblings, 0 replies; 17+ messages in thread
From: Samiullah Khawaja @ 2025-08-29  1:16 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, mkarsten
  Cc: Joe Damato, netdev, skhawaja

Add a new state to napi state enum:

- NAPI_STATE_THREADED_BUSY_POLL
  Threaded busy poll is enabled/running for this napi.

Following changes are introduced in the napi scheduling and state logic:

- When threaded busy poll is enabled through netlink it also enables
  NAPI_STATE_THREADED so a kthread is created per napi. It also sets
  NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that it is
  going to busy poll the napi.

- When napi is scheduled with NAPI_STATE_SCHED_THREADED and associated
  kthread is woken up, the kthread owns the context. If
  NAPI_STATE_THREADED_BUSY_POLL and NAPI_STATE_SCHED_THREADED both are
  set then it means that kthread can busy poll.

- To keep busy polling and to avoid scheduling of the interrupts, the
  napi_complete_done returns false when both NAPI_STATE_SCHED_THREADED
  and NAPI_STATE_THREADED_BUSY_POLL flags are set. Also
  napi_complete_done returns early to avoid the
  NAPI_STATE_SCHED_THREADED being unset.

- If at any point NAPI_STATE_THREADED_BUSY_POLL is unset, the
  napi_complete_done will run and unset the NAPI_STATE_SCHED_THREADED
  bit also. This will make the associated kthread go to sleep as per
  existing logic.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>

---
 Documentation/netlink/specs/netdev.yaml |  5 +-
 Documentation/networking/napi.rst       | 62 ++++++++++++++++++++++-
 include/linux/netdevice.h               | 11 ++++-
 include/uapi/linux/netdev.h             |  1 +
 net/core/dev.c                          | 66 +++++++++++++++++++++----
 net/core/dev.h                          |  3 ++
 net/core/netdev-genl-gen.c              |  2 +-
 tools/include/uapi/linux/netdev.h       |  1 +
 8 files changed, 138 insertions(+), 13 deletions(-)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index c035dc0f64fd..ee2cfb121dbb 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -88,7 +88,7 @@ definitions:
   -
     name: napi-threaded
     type: enum
-    entries: [disabled, enabled]
+    entries: [disabled, enabled, busy-poll-enabled]
 
 attribute-sets:
   -
@@ -292,6 +292,9 @@ attribute-sets:
         doc: Whether the NAPI is configured to operate in threaded polling
              mode. If this is set to enabled then the NAPI context operates
              in threaded polling mode.
+             mode. If this is set to enabled then the NAPI context operates
+             in threaded polling mode. If this is set to busy-poll-enabled
+             then the NAPI kthread also does busypolling.
         type: u32
         enum: napi-threaded
   -
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index a15754adb041..b252b6e16262 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -263,7 +263,9 @@ are not well known).
 Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
 selected sockets or using the global ``net.core.busy_poll`` and
 ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
-also exists.
+also exists. Threaded polling of NAPI also has a mode to busy poll for
+packets (:ref:`threaded busy polling<threaded_busy_poll>`) using the same
+thread that is used for NAPI processing.
 
 epoll-based busy polling
 ------------------------
@@ -426,6 +428,64 @@ Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
 the recommended usage, because otherwise setting ``irq-suspend-timeout``
 might not have any discernible effect.
 
+.. _threaded_busy_poll:
+
+Threaded NAPI busy polling
+--------------------------
+
+Threaded NAPI allows processing of packets from each NAPI in a kthread in
+kernel. Threaded NAPI busy polling extends this and adds support to do
+continuous busy polling of this NAPI. This can be used to enable busy polling
+independent of userspace application or the API (epoll, io_uring, raw sockets)
+being used in userspace to process the packets.
+
+It can be enabled for each NAPI using netlink interface.
+
+For example, using following script:
+
+.. code-block:: bash
+
+  $ ynl --family netdev --do napi-set \
+            --json='{"id": 66, "threaded": "busy-poll-enabled"}'
+
+
+Enabling it for each NAPI allows finer control to enable busy pollling for
+only a set of NIC queues which will get traffic with low latency requirements.
+
+Depending on application requirement, user might want to set affinity of the
+kthread that is busy polling each NAPI. User might also want to set priority
+and the scheduler of the thread depending on the latency requirements.
+
+For a hard low-latency application, user might want to dedicate the full core
+for the NAPI polling so the NIC queue descriptors are picked up from the queue
+as soon as they appear. Once enabled, the NAPI thread will poll the NIC queues
+continuously without sleeping. This will keep the CPU core busy with 100%
+usage.  For more relaxed low-latency requirement, user might want to share the
+core with other threads by setting thread affinity and priority.
+
+Once threaded busy polling is enabled for a NAPI, PID of the kthread can be
+fetched using netlink interface so the affinity, priority and scheduler
+configuration can be done.
+
+For example, following script can be used to fetch the pid:
+
+.. code-block:: bash
+
+  $ ynl --family netdev --do napi-get --json='{"id": 66}'
+
+This will output something like following, the pid `258` is the PID of the
+kthread that is polling this NAPI.
+
+.. code-block:: bash
+
+  $ {'defer-hard-irqs': 0,
+     'gro-flush-timeout': 0,
+     'id': 66,
+     'ifindex': 2,
+     'irq-suspend-timeout': 0,
+     'pid': 258,
+     'threaded': 'busy-poll-enabled'}
+
 .. _threaded:
 
 Threaded NAPI
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f3a3b761abfb..a88f6596aef7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -427,6 +427,8 @@ enum {
 	NAPI_STATE_THREADED,		/* The poll is performed inside its own thread*/
 	NAPI_STATE_SCHED_THREADED,	/* Napi is currently scheduled in threaded mode */
 	NAPI_STATE_HAS_NOTIFIER,	/* Napi has an IRQ notifier */
+	NAPI_STATE_THREADED_BUSY_POLL,	/* The threaded napi poller will busy poll */
+	NAPI_STATE_SCHED_THREADED_BUSY_POLL,  /* The threaded napi poller is busy polling */
 };
 
 enum {
@@ -441,8 +443,14 @@ enum {
 	NAPIF_STATE_THREADED		= BIT(NAPI_STATE_THREADED),
 	NAPIF_STATE_SCHED_THREADED	= BIT(NAPI_STATE_SCHED_THREADED),
 	NAPIF_STATE_HAS_NOTIFIER	= BIT(NAPI_STATE_HAS_NOTIFIER),
+	NAPIF_STATE_THREADED_BUSY_POLL	= BIT(NAPI_STATE_THREADED_BUSY_POLL),
+	NAPIF_STATE_SCHED_THREADED_BUSY_POLL  =
+			BIT(NAPI_STATE_SCHED_THREADED_BUSY_POLL),
 };
 
+#define NAPIF_STATE_THREADED_BUSY_POLL_MASK \
+	(NAPIF_STATE_THREADED | NAPIF_STATE_THREADED_BUSY_POLL)
+
 enum gro_result {
 	GRO_MERGED,
 	GRO_MERGED_FREE,
@@ -1873,7 +1881,8 @@ enum netdev_reg_state {
  * 	@addr_len:		Hardware address length
  *	@upper_level:		Maximum depth level of upper devices.
  *	@lower_level:		Maximum depth level of lower devices.
- *	@threaded:		napi threaded state.
+ *	@threaded:		napi threaded mode is disabled, enabled or
+ *				enabled with busy polling.
  *	@neigh_priv_len:	Used in neigh_alloc()
  * 	@dev_id:		Used to differentiate devices that share
  * 				the same link layer address
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 48eb49aa03d4..8163afb15377 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -80,6 +80,7 @@ enum netdev_qstats_scope {
 enum netdev_napi_threaded {
 	NETDEV_NAPI_THREADED_DISABLED,
 	NETDEV_NAPI_THREADED_ENABLED,
+	NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED,
 };
 
 enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index 5a3c0f40a93f..07ef77fed447 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -78,6 +78,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/sched/isolation.h>
+#include <linux/sched/types.h>
 #include <linux/sched/mm.h>
 #include <linux/smpboot.h>
 #include <linux/mutex.h>
@@ -6558,7 +6559,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 	 *    the guarantee we will be called later.
 	 */
 	if (unlikely(n->state & (NAPIF_STATE_NPSVC |
-				 NAPIF_STATE_IN_BUSY_POLL)))
+				 NAPIF_STATE_IN_BUSY_POLL |
+				 NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
 		return false;
 
 	if (work_done) {
@@ -6963,6 +6965,19 @@ static void napi_stop_kthread(struct napi_struct *napi)
 	napi->thread = NULL;
 }
 
+static void napi_set_threaded_state(struct napi_struct *napi,
+				    enum netdev_napi_threaded threaded)
+{
+	unsigned long val;
+
+	val = 0;
+	if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED)
+		val |= NAPIF_STATE_THREADED_BUSY_POLL;
+	if (threaded)
+		val |= NAPIF_STATE_THREADED;
+	set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
+}
+
 int napi_set_threaded(struct napi_struct *napi,
 		      enum netdev_napi_threaded threaded)
 {
@@ -6989,7 +7004,7 @@ int napi_set_threaded(struct napi_struct *napi,
 	} else {
 		/* Make sure kthread is created before THREADED bit is set. */
 		smp_mb__before_atomic();
-		assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+		napi_set_threaded_state(napi, threaded);
 	}
 
 	return 0;
@@ -7381,7 +7396,9 @@ void napi_disable_locked(struct napi_struct *n)
 		}
 
 		new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
-		new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
+		new &= ~(NAPIF_STATE_THREADED
+			 | NAPIF_STATE_THREADED_BUSY_POLL
+			 | NAPIF_STATE_PREFER_BUSY_POLL);
 	} while (!try_cmpxchg(&n->state, &val, new));
 
 	hrtimer_cancel(&n->timer);
@@ -7425,7 +7442,7 @@ void napi_enable_locked(struct napi_struct *n)
 
 		new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
 		if (n->dev->threaded && n->thread)
-			new |= NAPIF_STATE_THREADED;
+			napi_set_threaded_state(n, n->dev->threaded);
 	} while (!try_cmpxchg(&n->state, &val, new));
 }
 EXPORT_SYMBOL(napi_enable_locked);
@@ -7593,7 +7610,7 @@ static int napi_thread_wait(struct napi_struct *napi)
 	return -1;
 }
 
-static void napi_threaded_poll_loop(struct napi_struct *napi)
+static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
 {
 	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct softnet_data *sd;
@@ -7622,22 +7639,53 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
 		}
 		skb_defer_free_flush(sd);
 		bpf_net_ctx_clear(bpf_net_ctx);
+
+		/* Flush too old packets. If HZ < 1000, flush all packets */
+		if (busy_poll)
+			gro_flush_normal(&napi->gro, HZ >= 1000);
 		local_bh_enable();
 
-		if (!repoll)
+		/* If busy polling then do not break here because we need to
+		 * call cond_resched and rcu_softirq_qs_periodic to prevent
+		 * watchdog warnings.
+		 */
+		if (!repoll && !busy_poll)
 			break;
 
 		rcu_softirq_qs_periodic(last_qs);
 		cond_resched();
+
+		if (!repoll)
+			break;
 	}
 }
 
 static int napi_threaded_poll(void *data)
 {
 	struct napi_struct *napi = data;
+	bool busy_poll_sched;
+	unsigned long val;
+	bool busy_poll;
+
+	while (!napi_thread_wait(napi)) {
+		/* Once woken up, this means that we are scheduled as threaded
+		 * napi and this thread owns the napi context, if busy poll
+		 * state is set then busy poll this napi.
+		 */
+		val = READ_ONCE(napi->state);
+		busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
+		busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
 
-	while (!napi_thread_wait(napi))
-		napi_threaded_poll_loop(napi);
+		/* Do not busy poll if napi is disabled. */
+		if (unlikely(val & NAPIF_STATE_DISABLE))
+			busy_poll = false;
+
+		if (busy_poll != busy_poll_sched)
+			assign_bit(NAPI_STATE_SCHED_THREADED_BUSY_POLL,
+				   &napi->state, busy_poll);
+
+		napi_threaded_poll_loop(napi, busy_poll);
+	}
 
 	return 0;
 }
@@ -12829,7 +12877,7 @@ static void run_backlog_napi(unsigned int cpu)
 {
 	struct softnet_data *sd = per_cpu_ptr(&softnet_data, cpu);
 
-	napi_threaded_poll_loop(&sd->backlog);
+	napi_threaded_poll_loop(&sd->backlog, false);
 }
 
 static void backlog_napi_setup(unsigned int cpu)
diff --git a/net/core/dev.h b/net/core/dev.h
index d6b08d435479..d6cfe7105903 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -317,6 +317,9 @@ static inline void napi_set_irq_suspend_timeout(struct napi_struct *n,
 
 static inline enum netdev_napi_threaded napi_get_threaded(struct napi_struct *n)
 {
+	if (test_bit(NAPI_STATE_THREADED_BUSY_POLL, &n->state))
+		return NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED;
+
 	if (test_bit(NAPI_STATE_THREADED, &n->state))
 		return NETDEV_NAPI_THREADED_ENABLED;
 
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index e9a2a6f26cb7..ff20435c45d2 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -97,7 +97,7 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED
 	[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
 	[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
 	[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
-	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
+	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 2),
 };
 
 /* NETDEV_CMD_BIND_TX - do */
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 48eb49aa03d4..8163afb15377 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -80,6 +80,7 @@ enum netdev_qstats_scope {
 enum netdev_napi_threaded {
 	NETDEV_NAPI_THREADED_DISABLED,
 	NETDEV_NAPI_THREADED_ENABLED,
+	NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED,
 };
 
 enum {
-- 
2.51.0.338.gd7d06c2dae-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v8 2/2] selftests: Add napi threaded busy poll test in `busy_poller`
  2025-08-29  1:16 [PATCH net-next v8 0/2] Add support to do threaded napi busy poll Samiullah Khawaja
  2025-08-29  1:16 ` [PATCH net-next v8 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
@ 2025-08-29  1:16 ` Samiullah Khawaja
  2025-08-29  3:15 ` [PATCH net-next v8 0/2] Add support to do threaded napi busy poll Martin Karsten
  2 siblings, 0 replies; 17+ messages in thread
From: Samiullah Khawaja @ 2025-08-29  1:16 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, mkarsten
  Cc: Joe Damato, netdev, skhawaja

Add testcase to run busy poll test with threaded napi busy poll enabled.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>

---
 tools/testing/selftests/net/busy_poll_test.sh | 25 ++++++++++++++++++-
 tools/testing/selftests/net/busy_poller.c     | 14 ++++++++---
 2 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/busy_poll_test.sh b/tools/testing/selftests/net/busy_poll_test.sh
index 7d2d40812074..ab230df1057e 100755
--- a/tools/testing/selftests/net/busy_poll_test.sh
+++ b/tools/testing/selftests/net/busy_poll_test.sh
@@ -27,6 +27,9 @@ NAPI_DEFER_HARD_IRQS=100
 GRO_FLUSH_TIMEOUT=50000
 SUSPEND_TIMEOUT=20000000
 
+# NAPI threaded busy poll config
+NAPI_THREADED_POLL=2
+
 setup_ns()
 {
 	set -e
@@ -62,6 +65,9 @@ cleanup_ns()
 test_busypoll()
 {
 	suspend_value=${1:-0}
+	napi_threaded_value=${2:-0}
+	prefer_busy_poll_value=${3:-$PREFER_BUSY_POLL}
+
 	tmp_file=$(mktemp)
 	out_file=$(mktemp)
 
@@ -73,10 +79,11 @@ test_busypoll()
 					     -b${SERVER_IP}        \
 					     -m${MAX_EVENTS}       \
 					     -u${BUSY_POLL_USECS}  \
-					     -P${PREFER_BUSY_POLL} \
+					     -P${prefer_busy_poll_value} \
 					     -g${BUSY_POLL_BUDGET} \
 					     -i${NSIM_SV_IFIDX}    \
 					     -s${suspend_value}    \
+					     -t${napi_threaded_value} \
 					     -o${out_file}&
 
 	wait_local_port_listen nssv ${SERVER_PORT} tcp
@@ -109,6 +116,15 @@ test_busypoll_with_suspend()
 	return $?
 }
 
+test_busypoll_with_napi_threaded()
+{
+	# Only enable napi threaded poll. Set suspend timeout and prefer busy
+	# poll to 0.
+	test_busypoll 0 ${NAPI_THREADED_POLL} 0
+
+	return $?
+}
+
 ###
 ### Code start
 ###
@@ -154,6 +170,13 @@ if [ $? -ne 0 ]; then
 	exit 1
 fi
 
+test_busypoll_with_napi_threaded
+if [ $? -ne 0 ]; then
+	echo "test_busypoll_with_napi_threaded failed"
+	cleanup_ns
+	exit 1
+fi
+
 echo "$NSIM_SV_FD:$NSIM_SV_IFIDX" > $NSIM_DEV_SYS_UNLINK
 
 echo $NSIM_CL_ID > $NSIM_DEV_SYS_DEL
diff --git a/tools/testing/selftests/net/busy_poller.c b/tools/testing/selftests/net/busy_poller.c
index 04c7ff577bb8..46401d5e01be 100644
--- a/tools/testing/selftests/net/busy_poller.c
+++ b/tools/testing/selftests/net/busy_poller.c
@@ -65,15 +65,16 @@ static uint32_t cfg_busy_poll_usecs;
 static uint16_t cfg_busy_poll_budget;
 static uint8_t cfg_prefer_busy_poll;
 
-/* IRQ params */
+/* NAPI params */
 static uint32_t cfg_defer_hard_irqs;
 static uint64_t cfg_gro_flush_timeout;
 static uint64_t cfg_irq_suspend_timeout;
+static enum netdev_napi_threaded cfg_napi_threaded_poll = NETDEV_NAPI_THREADED_DISABLED;
 
 static void usage(const char *filepath)
 {
 	error(1, 0,
-	      "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -i<ifindex>",
+	      "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -t<napi_threaded_poll> -i<ifindex>",
 	      filepath);
 }
 
@@ -86,7 +87,7 @@ static void parse_opts(int argc, char **argv)
 	if (argc <= 1)
 		usage(argv[0]);
 
-	while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:")) != -1) {
+	while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:t:")) != -1) {
 		/* most options take integer values, except o and b, so reduce
 		 * code duplication a bit for the common case by calling
 		 * strtoull here and leave bounds checking and casting per
@@ -168,6 +169,12 @@ static void parse_opts(int argc, char **argv)
 
 			cfg_ifindex = (int)tmp;
 			break;
+		case 't':
+			if (tmp == ULLONG_MAX || tmp > 2)
+				error(1, ERANGE, "napi threaded poll value must be 0-2");
+
+			cfg_napi_threaded_poll = (enum netdev_napi_threaded)tmp;
+			break;
 		}
 	}
 
@@ -246,6 +253,7 @@ static void setup_queue(void)
 						  cfg_gro_flush_timeout);
 	netdev_napi_set_req_set_irq_suspend_timeout(set_req,
 						    cfg_irq_suspend_timeout);
+	netdev_napi_set_req_set_threaded(set_req, cfg_napi_threaded_poll);
 
 	if (netdev_napi_set(ys, set_req))
 		error(1, 0, "can't set NAPI params: %s\n", yerr.msg);
-- 
2.51.0.338.gd7d06c2dae-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29  1:16 [PATCH net-next v8 0/2] Add support to do threaded napi busy poll Samiullah Khawaja
  2025-08-29  1:16 ` [PATCH net-next v8 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
  2025-08-29  1:16 ` [PATCH net-next v8 2/2] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
@ 2025-08-29  3:15 ` Martin Karsten
  2025-08-29 17:50   ` Samiullah Khawaja
  2 siblings, 1 reply; 17+ messages in thread
From: Martin Karsten @ 2025-08-29  3:15 UTC (permalink / raw)
  To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, willemb
  Cc: Joe Damato, netdev

On 2025-08-28 21:16, Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.
> 
> This is used for doing continuous polling of napi to fetch descriptors
> from backing RX/TX queues for low latency applications. Allow enabling
> of threaded busypoll using netlink so this can be enabled on a set of
> dedicated napis for low latency applications.
> 
> Once enabled user can fetch the PID of the kthread doing NAPI polling
> and set affinity, priority and scheduler for it depending on the
> low-latency requirements.
> 
> Extend the netlink interface to allow enabling/disabling threaded
> busypolling at individual napi level.
> 
> We use this for our AF_XDP based hard low-latency usecase with usecs
> level latency requirement. For our usecase we want low jitter and stable
> latency at P99.
> 
> Following is an analysis and comparison of available (and compatible)
> busy poll interfaces for a low latency usecase with stable P99. This can
> be suitable for applications that want very low latency at the expense
> of cpu usage and efficiency.
> 
> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
> backing a socket, but the missing piece is a mechanism to busy poll a
> NAPI instance in a dedicated thread while ignoring available events or
> packets, regardless of the userspace API. Most existing mechanisms are
> designed to work in a pattern where you poll until new packets or events
> are received, after which userspace is expected to handle them.
> 
> As a result, one has to hack together a solution using a mechanism
> intended to receive packets or events, not to simply NAPI poll. NAPI
> threaded busy polling, on the other hand, provides this capability
> natively, independent of any userspace API. This makes it really easy to
> setup and manage.
> 
> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
> description of the tool and how it tries to simulate the real workload
> is following,
> 
> - It sends UDP packets between 2 machines.
> - The client machine sends packets at a fixed frequency. To maintain the
>    frequency of the packet being sent, we use open-loop sampling. That is
>    the packets are sent in a separate thread.
> - The server replies to the packet inline by reading the pkt from the
>    recv ring and replies using the tx ring.
> - To simulate the application processing time, we use a configurable
>    delay in usecs on the client side after a reply is received from the
>    server.
> 
> The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.
> 
> We use this tool with following napi polling configurations,
> 
> - Interrupts only
> - SO_BUSYPOLL (inline in the same thread where the client receives the
>    packet).
> - SO_BUSYPOLL (separate thread and separate core)
> - Threaded NAPI busypoll
> 
> System is configured using following script in all 4 cases,
> 
> ```
> echo 0 | sudo tee /sys/class/net/eth0/threaded
> echo 0 | sudo tee /proc/sys/kernel/timer_migration
> echo off | sudo tee  /sys/devices/system/cpu/smt/control
> 
> sudo ethtool -L eth0 rx 1 tx 1
> sudo ethtool -G eth0 rx 1024
> 
> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> 
>   # pin IRQs on CPU 2
> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> 				print arr[0]}' < /proc/interrupts)"
> for irq in "${IRQS}"; \
> 	do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> 
> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> 
> for i in /sys/devices/virtual/workqueue/*/cpumask; \
> 			do echo $i; echo 1,2,3,4,5,6 > $i; done
> 
> if [[ -z "$1" ]]; then
>    echo 400 | sudo tee /proc/sys/net/core/busy_read
>    echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>    echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
> fi
> 
> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
> 
> if [[ "$1" == "enable_threaded" ]]; then
>    echo 0 | sudo tee /proc/sys/net/core/busy_poll
>    echo 0 | sudo tee /proc/sys/net/core/busy_read
>    echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>    echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>    echo 2 | sudo tee /sys/class/net/eth0/threaded
>    NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
>    sudo chrt -f  -p 50 $NAPI_T
> 
>    # pin threaded poll thread to CPU 2
>    sudo taskset -pc 2 $NAPI_T
> fi
> 
> if [[ "$1" == "enable_interrupt" ]]; then
>    echo 0 | sudo tee /proc/sys/net/core/busy_read
>    echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>    echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> fi
> ```

The experiment script above does not work, because the sysfs parameter 
does not exist anymore in this version.

> To enable various configurations, script can be run as following,
> 
> - Interrupt Only
>    ```
>    <script> enable_interrupt
>    ```
> 
> - SO_BUSYPOLL (no arguments to script)
>    ```
>    <script>
>    ```
> 
> - NAPI threaded busypoll
>    ```
>    <script> enable_threaded
>    ```
> 
> If using idpf, the script needs to be run again after launching the
> workload just to make sure that the configurations are not reverted. As
> idpf reverts some configurations on software reset when AF_XDP program
> is attached.
> 
> Once configured, the workload is run with various configurations using
> following commands. Set period (1/frequency) and delay in usecs to
> produce results for packet frequency and application processing delay.
> 
>   ## Interrupt Only and SO_BUSYPOLL (inline)
> 
> - Server
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> 	-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
> ```
> 
> - Client
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> 	-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> 	-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
> ```
> 
>   ## SO_BUSYPOLL(done in separate core using recvfrom)
> 
> Argument -t spawns a seprate thread and continuously calls recvfrom.
> 
> - Server
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> 	-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> 	-h -v -t
> ```
> 
> - Client
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> 	-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> 	-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
> ```
> 
>   ## NAPI Threaded Busy Poll
> 
> Argument -n skips the recvfrom call as there is no recv kick needed.
> 
> - Server
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> 	-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> 	-h -v -n
> ```
> 
> - Client
> ```
> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> 	-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> 	-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
> ```

I believe there's a bug when disabling busy-polled napi threading after 
an experiment. My system hangs and needs a hard reset.

> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> |---|---|---|---|---|
> | 12 Kpkt/s + 0us delay | | | | |
> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> | 32 Kpkt/s + 30us delay | | | | |
> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> | 125 Kpkt/s + 6us delay | | | | |
> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> | 12 Kpkt/s + 78us delay | | | | |
> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> | 25 Kpkt/s + 38us delay | | | | |
> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |

On my system, routing the irq to same core where xsk_rr runs results in 
lower latency than routing the irq to a different core. To me that makes 
sense in a low-rate latency-sensitive scenario where interrupts are not 
causing much trouble, but the resulting locality might be beneficial. I 
think you should test this as well.

The experiments reported above (except for the first one) are 
cherry-picking parameter combinations that result in a near-100% load 
and ignore anything else. Near-100% load is a highly unlikely scenario 
for a latency-sensitive workload.

When combining the above two paragraphs, I believe other interesting 
setups are missing from the experiments, such as comparing to two pairs 
of xsk_rr under high load (as mentioned in my previous emails).

Thanks,
Martin


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29  3:15 ` [PATCH net-next v8 0/2] Add support to do threaded napi busy poll Martin Karsten
@ 2025-08-29 17:50   ` Samiullah Khawaja
  2025-08-29 18:08     ` Martin Karsten
  0 siblings, 1 reply; 17+ messages in thread
From: Samiullah Khawaja @ 2025-08-29 17:50 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-08-28 21:16, Samiullah Khawaja wrote:
> > Extend the already existing support of threaded napi poll to do continuous
> > busy polling.
> >
> > This is used for doing continuous polling of napi to fetch descriptors
> > from backing RX/TX queues for low latency applications. Allow enabling
> > of threaded busypoll using netlink so this can be enabled on a set of
> > dedicated napis for low latency applications.
> >
> > Once enabled user can fetch the PID of the kthread doing NAPI polling
> > and set affinity, priority and scheduler for it depending on the
> > low-latency requirements.
> >
> > Extend the netlink interface to allow enabling/disabling threaded
> > busypolling at individual napi level.
> >
> > We use this for our AF_XDP based hard low-latency usecase with usecs
> > level latency requirement. For our usecase we want low jitter and stable
> > latency at P99.
> >
> > Following is an analysis and comparison of available (and compatible)
> > busy poll interfaces for a low latency usecase with stable P99. This can
> > be suitable for applications that want very low latency at the expense
> > of cpu usage and efficiency.
> >
> > Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
> > backing a socket, but the missing piece is a mechanism to busy poll a
> > NAPI instance in a dedicated thread while ignoring available events or
> > packets, regardless of the userspace API. Most existing mechanisms are
> > designed to work in a pattern where you poll until new packets or events
> > are received, after which userspace is expected to handle them.
> >
> > As a result, one has to hack together a solution using a mechanism
> > intended to receive packets or events, not to simply NAPI poll. NAPI
> > threaded busy polling, on the other hand, provides this capability
> > natively, independent of any userspace API. This makes it really easy to
> > setup and manage.
> >
> > For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
> > description of the tool and how it tries to simulate the real workload
> > is following,
> >
> > - It sends UDP packets between 2 machines.
> > - The client machine sends packets at a fixed frequency. To maintain the
> >    frequency of the packet being sent, we use open-loop sampling. That is
> >    the packets are sent in a separate thread.
> > - The server replies to the packet inline by reading the pkt from the
> >    recv ring and replies using the tx ring.
> > - To simulate the application processing time, we use a configurable
> >    delay in usecs on the client side after a reply is received from the
> >    server.
> >
> > The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.
> >
> > We use this tool with following napi polling configurations,
> >
> > - Interrupts only
> > - SO_BUSYPOLL (inline in the same thread where the client receives the
> >    packet).
> > - SO_BUSYPOLL (separate thread and separate core)
> > - Threaded NAPI busypoll
> >
> > System is configured using following script in all 4 cases,
> >
> > ```
> > echo 0 | sudo tee /sys/class/net/eth0/threaded
> > echo 0 | sudo tee /proc/sys/kernel/timer_migration
> > echo off | sudo tee  /sys/devices/system/cpu/smt/control
> >
> > sudo ethtool -L eth0 rx 1 tx 1
> > sudo ethtool -G eth0 rx 1024
> >
> > echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> > echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> >
> >   # pin IRQs on CPU 2
> > IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> >                               print arr[0]}' < /proc/interrupts)"
> > for irq in "${IRQS}"; \
> >       do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> >
> > echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> >
> > for i in /sys/devices/virtual/workqueue/*/cpumask; \
> >                       do echo $i; echo 1,2,3,4,5,6 > $i; done
> >
> > if [[ -z "$1" ]]; then
> >    echo 400 | sudo tee /proc/sys/net/core/busy_read
> >    echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >    echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > fi
> >
> > sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
> >
> > if [[ "$1" == "enable_threaded" ]]; then
> >    echo 0 | sudo tee /proc/sys/net/core/busy_poll
> >    echo 0 | sudo tee /proc/sys/net/core/busy_read
> >    echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >    echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >    echo 2 | sudo tee /sys/class/net/eth0/threaded
> >    NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> >    sudo chrt -f  -p 50 $NAPI_T
> >
> >    # pin threaded poll thread to CPU 2
> >    sudo taskset -pc 2 $NAPI_T
> > fi
> >
> > if [[ "$1" == "enable_interrupt" ]]; then
> >    echo 0 | sudo tee /proc/sys/net/core/busy_read
> >    echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >    echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > fi
> > ```
>
> The experiment script above does not work, because the sysfs parameter
> does not exist anymore in this version.
>
> > To enable various configurations, script can be run as following,
> >
> > - Interrupt Only
> >    ```
> >    <script> enable_interrupt
> >    ```
> >
> > - SO_BUSYPOLL (no arguments to script)
> >    ```
> >    <script>
> >    ```
> >
> > - NAPI threaded busypoll
> >    ```
> >    <script> enable_threaded
> >    ```
> >
> > If using idpf, the script needs to be run again after launching the
> > workload just to make sure that the configurations are not reverted. As
> > idpf reverts some configurations on software reset when AF_XDP program
> > is attached.
> >
> > Once configured, the workload is run with various configurations using
> > following commands. Set period (1/frequency) and delay in usecs to
> > produce results for packet frequency and application processing delay.
> >
> >   ## Interrupt Only and SO_BUSYPOLL (inline)
> >
> > - Server
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
> > ```
> >
> > - Client
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >       -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
> > ```
> >
> >   ## SO_BUSYPOLL(done in separate core using recvfrom)
> >
> > Argument -t spawns a seprate thread and continuously calls recvfrom.
> >
> > - Server
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >       -h -v -t
> > ```
> >
> > - Client
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >       -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
> > ```
> >
> >   ## NAPI Threaded Busy Poll
> >
> > Argument -n skips the recvfrom call as there is no recv kick needed.
> >
> > - Server
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >       -h -v -n
> > ```
> >
> > - Client
> > ```
> > sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >       -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >       -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
> > ```
>
> I believe there's a bug when disabling busy-polled napi threading after
> an experiment. My system hangs and needs a hard reset.
>
> > | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> > |---|---|---|---|---|
> > | 12 Kpkt/s + 0us delay | | | | |
> > |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> > |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> > |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> > |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> > | 32 Kpkt/s + 30us delay | | | | |
> > |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> > |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> > |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> > |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> > | 125 Kpkt/s + 6us delay | | | | |
> > |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> > |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> > |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> > |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> > | 12 Kpkt/s + 78us delay | | | | |
> > |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> > |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> > |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> > |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> > | 25 Kpkt/s + 38us delay | | | | |
> > |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> > |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> > |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> > |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>
> On my system, routing the irq to same core where xsk_rr runs results in
> lower latency than routing the irq to a different core. To me that makes
> sense in a low-rate latency-sensitive scenario where interrupts are not
> causing much trouble, but the resulting locality might be beneficial. I
> think you should test this as well.
>
> The experiments reported above (except for the first one) are
> cherry-picking parameter combinations that result in a near-100% load
> and ignore anything else. Near-100% load is a highly unlikely scenario
> for a latency-sensitive workload.
>
> When combining the above two paragraphs, I believe other interesting
> setups are missing from the experiments, such as comparing to two pairs
> of xsk_rr under high load (as mentioned in my previous emails).
This is to support an existing real workload. We cannot easily modify
its threading model. The two xsk_rr model would be a different
workload.
>
> Thanks,
> Martin
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29 17:50   ` Samiullah Khawaja
@ 2025-08-29 18:08     ` Martin Karsten
  2025-08-29 18:42       ` Willem de Bruijn
                         ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Martin Karsten @ 2025-08-29 18:08 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On 2025-08-29 13:50, Samiullah Khawaja wrote:
> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>
>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
>>> Extend the already existing support of threaded napi poll to do continuous
>>> busy polling.
>>>
>>> This is used for doing continuous polling of napi to fetch descriptors
>>> from backing RX/TX queues for low latency applications. Allow enabling
>>> of threaded busypoll using netlink so this can be enabled on a set of
>>> dedicated napis for low latency applications.
>>>
>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>> and set affinity, priority and scheduler for it depending on the
>>> low-latency requirements.
>>>
>>> Extend the netlink interface to allow enabling/disabling threaded
>>> busypolling at individual napi level.
>>>
>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>> level latency requirement. For our usecase we want low jitter and stable
>>> latency at P99.
>>>
>>> Following is an analysis and comparison of available (and compatible)
>>> busy poll interfaces for a low latency usecase with stable P99. This can
>>> be suitable for applications that want very low latency at the expense
>>> of cpu usage and efficiency.
>>>
>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
>>> backing a socket, but the missing piece is a mechanism to busy poll a
>>> NAPI instance in a dedicated thread while ignoring available events or
>>> packets, regardless of the userspace API. Most existing mechanisms are
>>> designed to work in a pattern where you poll until new packets or events
>>> are received, after which userspace is expected to handle them.
>>>
>>> As a result, one has to hack together a solution using a mechanism
>>> intended to receive packets or events, not to simply NAPI poll. NAPI
>>> threaded busy polling, on the other hand, provides this capability
>>> natively, independent of any userspace API. This makes it really easy to
>>> setup and manage.
>>>
>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
>>> description of the tool and how it tries to simulate the real workload
>>> is following,
>>>
>>> - It sends UDP packets between 2 machines.
>>> - The client machine sends packets at a fixed frequency. To maintain the
>>>     frequency of the packet being sent, we use open-loop sampling. That is
>>>     the packets are sent in a separate thread.
>>> - The server replies to the packet inline by reading the pkt from the
>>>     recv ring and replies using the tx ring.
>>> - To simulate the application processing time, we use a configurable
>>>     delay in usecs on the client side after a reply is received from the
>>>     server.
>>>
>>> The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.
>>>
>>> We use this tool with following napi polling configurations,
>>>
>>> - Interrupts only
>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>     packet).
>>> - SO_BUSYPOLL (separate thread and separate core)
>>> - Threaded NAPI busypoll
>>>
>>> System is configured using following script in all 4 cases,
>>>
>>> ```
>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
>>>
>>> sudo ethtool -L eth0 rx 1 tx 1
>>> sudo ethtool -G eth0 rx 1024
>>>
>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
>>>
>>>    # pin IRQs on CPU 2
>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
>>>                                print arr[0]}' < /proc/interrupts)"
>>> for irq in "${IRQS}"; \
>>>        do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
>>>
>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
>>>
>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
>>>                        do echo $i; echo 1,2,3,4,5,6 > $i; done
>>>
>>> if [[ -z "$1" ]]; then
>>>     echo 400 | sudo tee /proc/sys/net/core/busy_read
>>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>     echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>> fi
>>>
>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
>>>
>>> if [[ "$1" == "enable_threaded" ]]; then
>>>     echo 0 | sudo tee /proc/sys/net/core/busy_poll
>>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>     echo 2 | sudo tee /sys/class/net/eth0/threaded
>>>     NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
>>>     sudo chrt -f  -p 50 $NAPI_T
>>>
>>>     # pin threaded poll thread to CPU 2
>>>     sudo taskset -pc 2 $NAPI_T
>>> fi
>>>
>>> if [[ "$1" == "enable_interrupt" ]]; then
>>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>     echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>> fi
>>> ```
>>
>> The experiment script above does not work, because the sysfs parameter
>> does not exist anymore in this version.
>>
>>> To enable various configurations, script can be run as following,
>>>
>>> - Interrupt Only
>>>     ```
>>>     <script> enable_interrupt
>>>     ```
>>>
>>> - SO_BUSYPOLL (no arguments to script)
>>>     ```
>>>     <script>
>>>     ```
>>>
>>> - NAPI threaded busypoll
>>>     ```
>>>     <script> enable_threaded
>>>     ```
>>>
>>> If using idpf, the script needs to be run again after launching the
>>> workload just to make sure that the configurations are not reverted. As
>>> idpf reverts some configurations on software reset when AF_XDP program
>>> is attached.
>>>
>>> Once configured, the workload is run with various configurations using
>>> following commands. Set period (1/frequency) and delay in usecs to
>>> produce results for packet frequency and application processing delay.
>>>
>>>    ## Interrupt Only and SO_BUSYPOLL (inline)
>>>
>>> - Server
>>> ```
>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
>>> ```
>>>
>>> - Client
>>> ```
>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
>>> ```
>>>
>>>    ## SO_BUSYPOLL(done in separate core using recvfrom)
>>>
>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
>>>
>>> - Server
>>> ```
>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>        -h -v -t
>>> ```
>>>
>>> - Client
>>> ```
>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
>>> ```
>>>
>>>    ## NAPI Threaded Busy Poll
>>>
>>> Argument -n skips the recvfrom call as there is no recv kick needed.
>>>
>>> - Server
>>> ```
>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>        -h -v -n
>>> ```
>>>
>>> - Client
>>> ```
>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
>>> ```
>>
>> I believe there's a bug when disabling busy-polled napi threading after
>> an experiment. My system hangs and needs a hard reset.
>>
>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
>>> |---|---|---|---|---|
>>> | 12 Kpkt/s + 0us delay | | | | |
>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>> | 32 Kpkt/s + 30us delay | | | | |
>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>> | 125 Kpkt/s + 6us delay | | | | |
>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>> | 12 Kpkt/s + 78us delay | | | | |
>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>> | 25 Kpkt/s + 38us delay | | | | |
>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>
>> On my system, routing the irq to same core where xsk_rr runs results in
>> lower latency than routing the irq to a different core. To me that makes
>> sense in a low-rate latency-sensitive scenario where interrupts are not
>> causing much trouble, but the resulting locality might be beneficial. I
>> think you should test this as well.
>>
>> The experiments reported above (except for the first one) are
>> cherry-picking parameter combinations that result in a near-100% load
>> and ignore anything else. Near-100% load is a highly unlikely scenario
>> for a latency-sensitive workload.
>>
>> When combining the above two paragraphs, I believe other interesting
>> setups are missing from the experiments, such as comparing to two pairs
>> of xsk_rr under high load (as mentioned in my previous emails).
> This is to support an existing real workload. We cannot easily modify
> its threading model. The two xsk_rr model would be a different
> workload.

That's fine, but:

- In principle I don't think it's a good justification for a kernel 
change that an application cannot be rewritten.

- I believe it is your responsibility to more comprehensively document 
the impact of your proposed changes beyond your one particular workload.

Also, I do believe there's a bug as mentioned before. I can't quite pin 
it down, but every time after running a "NAPI threaded" experiment, my 
servers enters a funny state and eventually becomes largely unresponsive 
without much useful output and needs a hard reset. For example:

1) Run "NAPI threaded" experiment
2) Disabled "threaded" parameter in NAPI config
3) Run IRQ experiment -> xsk_rr hangs and apparently holds a lock, 
because other services stop working successively.

Do you not have this problem?

Thanks,
Martin


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29 18:08     ` Martin Karsten
@ 2025-08-29 18:42       ` Willem de Bruijn
  2025-08-29 20:49       ` Samiullah Khawaja
  2025-08-29 22:19       ` Martin Karsten
  2 siblings, 0 replies; 17+ messages in thread
From: Willem de Bruijn @ 2025-08-29 18:42 UTC (permalink / raw)
  To: Martin Karsten, Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

Martin Karsten wrote:
> On 2025-08-29 13:50, Samiullah Khawaja wrote:
> > On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>
> >> On 2025-08-28 21:16, Samiullah Khawaja wrote:
> >>> Extend the already existing support of threaded napi poll to do continuous
> >>> busy polling.
> >>>
> >>> This is used for doing continuous polling of napi to fetch descriptors
> >>> from backing RX/TX queues for low latency applications. Allow enabling
> >>> of threaded busypoll using netlink so this can be enabled on a set of
> >>> dedicated napis for low latency applications.
> >>>
> >>> Once enabled user can fetch the PID of the kthread doing NAPI polling
> >>> and set affinity, priority and scheduler for it depending on the
> >>> low-latency requirements.
> >>>
> >>> Extend the netlink interface to allow enabling/disabling threaded
> >>> busypolling at individual napi level.
> >>>
> >>> We use this for our AF_XDP based hard low-latency usecase with usecs
> >>> level latency requirement. For our usecase we want low jitter and stable
> >>> latency at P99.
> >>>
> >>> Following is an analysis and comparison of available (and compatible)
> >>> busy poll interfaces for a low latency usecase with stable P99. This can
> >>> be suitable for applications that want very low latency at the expense
> >>> of cpu usage and efficiency.
> >>>
> >>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
> >>> backing a socket, but the missing piece is a mechanism to busy poll a
> >>> NAPI instance in a dedicated thread while ignoring available events or
> >>> packets, regardless of the userspace API. Most existing mechanisms are
> >>> designed to work in a pattern where you poll until new packets or events
> >>> are received, after which userspace is expected to handle them.
> >>>
> >>> As a result, one has to hack together a solution using a mechanism
> >>> intended to receive packets or events, not to simply NAPI poll. NAPI
> >>> threaded busy polling, on the other hand, provides this capability
> >>> natively, independent of any userspace API. This makes it really easy to
> >>> setup and manage.
> >>>
> >>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
> >>> description of the tool and how it tries to simulate the real workload
> >>> is following,
> >>>
> >>> - It sends UDP packets between 2 machines.
> >>> - The client machine sends packets at a fixed frequency. To maintain the
> >>>     frequency of the packet being sent, we use open-loop sampling. That is
> >>>     the packets are sent in a separate thread.
> >>> - The server replies to the packet inline by reading the pkt from the
> >>>     recv ring and replies using the tx ring.
> >>> - To simulate the application processing time, we use a configurable
> >>>     delay in usecs on the client side after a reply is received from the
> >>>     server.
> >>>
> >>> The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.
> >>>
> >>> We use this tool with following napi polling configurations,
> >>>
> >>> - Interrupts only
> >>> - SO_BUSYPOLL (inline in the same thread where the client receives the
> >>>     packet).
> >>> - SO_BUSYPOLL (separate thread and separate core)
> >>> - Threaded NAPI busypoll
> >>>
> >>> System is configured using following script in all 4 cases,
> >>>
> >>> ```
> >>> echo 0 | sudo tee /sys/class/net/eth0/threaded
> >>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
> >>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
> >>>
> >>> sudo ethtool -L eth0 rx 1 tx 1
> >>> sudo ethtool -G eth0 rx 1024
> >>>
> >>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> >>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> >>>
> >>>    # pin IRQs on CPU 2
> >>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> >>>                                print arr[0]}' < /proc/interrupts)"
> >>> for irq in "${IRQS}"; \
> >>>        do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> >>>
> >>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> >>>
> >>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
> >>>                        do echo $i; echo 1,2,3,4,5,6 > $i; done
> >>>
> >>> if [[ -z "$1" ]]; then
> >>>     echo 400 | sudo tee /proc/sys/net/core/busy_read
> >>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>     echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>> fi
> >>>
> >>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
> >>>
> >>> if [[ "$1" == "enable_threaded" ]]; then
> >>>     echo 0 | sudo tee /proc/sys/net/core/busy_poll
> >>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>     echo 2 | sudo tee /sys/class/net/eth0/threaded
> >>>     NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> >>>     sudo chrt -f  -p 50 $NAPI_T
> >>>
> >>>     # pin threaded poll thread to CPU 2
> >>>     sudo taskset -pc 2 $NAPI_T
> >>> fi
> >>>
> >>> if [[ "$1" == "enable_interrupt" ]]; then
> >>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>     echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>> fi
> >>> ```
> >>
> >> The experiment script above does not work, because the sysfs parameter
> >> does not exist anymore in this version.
> >>
> >>> To enable various configurations, script can be run as following,
> >>>
> >>> - Interrupt Only
> >>>     ```
> >>>     <script> enable_interrupt
> >>>     ```
> >>>
> >>> - SO_BUSYPOLL (no arguments to script)
> >>>     ```
> >>>     <script>
> >>>     ```
> >>>
> >>> - NAPI threaded busypoll
> >>>     ```
> >>>     <script> enable_threaded
> >>>     ```
> >>>
> >>> If using idpf, the script needs to be run again after launching the
> >>> workload just to make sure that the configurations are not reverted. As
> >>> idpf reverts some configurations on software reset when AF_XDP program
> >>> is attached.
> >>>
> >>> Once configured, the workload is run with various configurations using
> >>> following commands. Set period (1/frequency) and delay in usecs to
> >>> produce results for packet frequency and application processing delay.
> >>>
> >>>    ## Interrupt Only and SO_BUSYPOLL (inline)
> >>>
> >>> - Server
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
> >>> ```
> >>>
> >>> - Client
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
> >>> ```
> >>>
> >>>    ## SO_BUSYPOLL(done in separate core using recvfrom)
> >>>
> >>> Argument -t spawns a seprate thread and continuously calls recvfrom.
> >>>
> >>> - Server
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>        -h -v -t
> >>> ```
> >>>
> >>> - Client
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
> >>> ```
> >>>
> >>>    ## NAPI Threaded Busy Poll
> >>>
> >>> Argument -n skips the recvfrom call as there is no recv kick needed.
> >>>
> >>> - Server
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>        -h -v -n
> >>> ```
> >>>
> >>> - Client
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
> >>> ```
> >>
> >> I believe there's a bug when disabling busy-polled napi threading after
> >> an experiment. My system hangs and needs a hard reset.
> >>
> >>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> >>> |---|---|---|---|---|
> >>> | 12 Kpkt/s + 0us delay | | | | |
> >>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >>> | 32 Kpkt/s + 30us delay | | | | |
> >>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >>> | 125 Kpkt/s + 6us delay | | | | |
> >>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >>> | 12 Kpkt/s + 78us delay | | | | |
> >>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >>> | 25 Kpkt/s + 38us delay | | | | |
> >>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>
> >> On my system, routing the irq to same core where xsk_rr runs results in
> >> lower latency than routing the irq to a different core. To me that makes
> >> sense in a low-rate latency-sensitive scenario where interrupts are not
> >> causing much trouble, but the resulting locality might be beneficial. I
> >> think you should test this as well.
> >>
> >> The experiments reported above (except for the first one) are
> >> cherry-picking parameter combinations that result in a near-100% load
> >> and ignore anything else. Near-100% load is a highly unlikely scenario
> >> for a latency-sensitive workload.
> >>
> >> When combining the above two paragraphs, I believe other interesting
> >> setups are missing from the experiments, such as comparing to two pairs
> >> of xsk_rr under high load (as mentioned in my previous emails).
> > This is to support an existing real workload. We cannot easily modify
> > its threading model. The two xsk_rr model would be a different
> > workload.
> 
> That's fine, but:
> 
> - In principle I don't think it's a good justification for a kernel 
> change that an application cannot be rewritten.

It's not as narrow as one application. It's a way to scale processing
using pipelining instead of sharding. Both are reasonable approaches.

Especially for XDP, doing this first stage in the kernel makes sense
to me, as it makes XDP closer to hardware descriptor queue based
polling architectures such as DPDK or Google's SPIN. The OS abstracts
away the hardware format and the format translation entirely.

> - I believe it is your responsibility to more comprehensively document 
> the impact of your proposed changes beyond your one particular workload.
>
> Also, I do believe there's a bug as mentioned before. I can't quite pin 
> it down, but every time after running a "NAPI threaded" experiment, my 
> servers enters a funny state and eventually becomes largely unresponsive 
> without much useful output and needs a hard reset. For example:
> 
> 1) Run "NAPI threaded" experiment
> 2) Disabled "threaded" parameter in NAPI config
> 3) Run IRQ experiment -> xsk_rr hangs and apparently holds a lock, 
> because other services stop working successively.
> 
> Do you not have this problem?
> 
> Thanks,
> Martin
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29 18:08     ` Martin Karsten
  2025-08-29 18:42       ` Willem de Bruijn
@ 2025-08-29 20:49       ` Samiullah Khawaja
  2025-08-29 21:27         ` Martin Karsten
  2025-08-29 22:19       ` Martin Karsten
  2 siblings, 1 reply; 17+ messages in thread
From: Samiullah Khawaja @ 2025-08-29 20:49 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On Fri, Aug 29, 2025 at 11:08 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-08-29 13:50, Samiullah Khawaja wrote:
> > On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>
> >> On 2025-08-28 21:16, Samiullah Khawaja wrote:
> >>> Extend the already existing support of threaded napi poll to do continuous
> >>> busy polling.
> >>>
> >>> This is used for doing continuous polling of napi to fetch descriptors
> >>> from backing RX/TX queues for low latency applications. Allow enabling
> >>> of threaded busypoll using netlink so this can be enabled on a set of
> >>> dedicated napis for low latency applications.
> >>>
> >>> Once enabled user can fetch the PID of the kthread doing NAPI polling
> >>> and set affinity, priority and scheduler for it depending on the
> >>> low-latency requirements.
> >>>
> >>> Extend the netlink interface to allow enabling/disabling threaded
> >>> busypolling at individual napi level.
> >>>
> >>> We use this for our AF_XDP based hard low-latency usecase with usecs
> >>> level latency requirement. For our usecase we want low jitter and stable
> >>> latency at P99.
> >>>
> >>> Following is an analysis and comparison of available (and compatible)
> >>> busy poll interfaces for a low latency usecase with stable P99. This can
> >>> be suitable for applications that want very low latency at the expense
> >>> of cpu usage and efficiency.
> >>>
> >>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
> >>> backing a socket, but the missing piece is a mechanism to busy poll a
> >>> NAPI instance in a dedicated thread while ignoring available events or
> >>> packets, regardless of the userspace API. Most existing mechanisms are
> >>> designed to work in a pattern where you poll until new packets or events
> >>> are received, after which userspace is expected to handle them.
> >>>
> >>> As a result, one has to hack together a solution using a mechanism
> >>> intended to receive packets or events, not to simply NAPI poll. NAPI
> >>> threaded busy polling, on the other hand, provides this capability
> >>> natively, independent of any userspace API. This makes it really easy to
> >>> setup and manage.
> >>>
> >>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
> >>> description of the tool and how it tries to simulate the real workload
> >>> is following,
> >>>
> >>> - It sends UDP packets between 2 machines.
> >>> - The client machine sends packets at a fixed frequency. To maintain the
> >>>     frequency of the packet being sent, we use open-loop sampling. That is
> >>>     the packets are sent in a separate thread.
> >>> - The server replies to the packet inline by reading the pkt from the
> >>>     recv ring and replies using the tx ring.
> >>> - To simulate the application processing time, we use a configurable
> >>>     delay in usecs on the client side after a reply is received from the
> >>>     server.
> >>>
> >>> The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.
> >>>
> >>> We use this tool with following napi polling configurations,
> >>>
> >>> - Interrupts only
> >>> - SO_BUSYPOLL (inline in the same thread where the client receives the
> >>>     packet).
> >>> - SO_BUSYPOLL (separate thread and separate core)
> >>> - Threaded NAPI busypoll
> >>>
> >>> System is configured using following script in all 4 cases,
> >>>
> >>> ```
> >>> echo 0 | sudo tee /sys/class/net/eth0/threaded
> >>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
> >>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
> >>>
> >>> sudo ethtool -L eth0 rx 1 tx 1
> >>> sudo ethtool -G eth0 rx 1024
> >>>
> >>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> >>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> >>>
> >>>    # pin IRQs on CPU 2
> >>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> >>>                                print arr[0]}' < /proc/interrupts)"
> >>> for irq in "${IRQS}"; \
> >>>        do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> >>>
> >>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> >>>
> >>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
> >>>                        do echo $i; echo 1,2,3,4,5,6 > $i; done
> >>>
> >>> if [[ -z "$1" ]]; then
> >>>     echo 400 | sudo tee /proc/sys/net/core/busy_read
> >>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>     echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>> fi
> >>>
> >>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
> >>>
> >>> if [[ "$1" == "enable_threaded" ]]; then
> >>>     echo 0 | sudo tee /proc/sys/net/core/busy_poll
> >>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>     echo 2 | sudo tee /sys/class/net/eth0/threaded
> >>>     NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> >>>     sudo chrt -f  -p 50 $NAPI_T
> >>>
> >>>     # pin threaded poll thread to CPU 2
> >>>     sudo taskset -pc 2 $NAPI_T
> >>> fi
> >>>
> >>> if [[ "$1" == "enable_interrupt" ]]; then
> >>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>     echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>> fi
> >>> ```
> >>
> >> The experiment script above does not work, because the sysfs parameter
> >> does not exist anymore in this version.
> >>
> >>> To enable various configurations, script can be run as following,
> >>>
> >>> - Interrupt Only
> >>>     ```
> >>>     <script> enable_interrupt
> >>>     ```
> >>>
> >>> - SO_BUSYPOLL (no arguments to script)
> >>>     ```
> >>>     <script>
> >>>     ```
> >>>
> >>> - NAPI threaded busypoll
> >>>     ```
> >>>     <script> enable_threaded
> >>>     ```
> >>>
> >>> If using idpf, the script needs to be run again after launching the
> >>> workload just to make sure that the configurations are not reverted. As
> >>> idpf reverts some configurations on software reset when AF_XDP program
> >>> is attached.
> >>>
> >>> Once configured, the workload is run with various configurations using
> >>> following commands. Set period (1/frequency) and delay in usecs to
> >>> produce results for packet frequency and application processing delay.
> >>>
> >>>    ## Interrupt Only and SO_BUSYPOLL (inline)
> >>>
> >>> - Server
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
> >>> ```
> >>>
> >>> - Client
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
> >>> ```
> >>>
> >>>    ## SO_BUSYPOLL(done in separate core using recvfrom)
> >>>
> >>> Argument -t spawns a seprate thread and continuously calls recvfrom.
> >>>
> >>> - Server
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>        -h -v -t
> >>> ```
> >>>
> >>> - Client
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
> >>> ```
> >>>
> >>>    ## NAPI Threaded Busy Poll
> >>>
> >>> Argument -n skips the recvfrom call as there is no recv kick needed.
> >>>
> >>> - Server
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>        -h -v -n
> >>> ```
> >>>
> >>> - Client
> >>> ```
> >>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
> >>> ```
> >>
> >> I believe there's a bug when disabling busy-polled napi threading after
> >> an experiment. My system hangs and needs a hard reset.
> >>
> >>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> >>> |---|---|---|---|---|
> >>> | 12 Kpkt/s + 0us delay | | | | |
> >>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >>> | 32 Kpkt/s + 30us delay | | | | |
> >>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >>> | 125 Kpkt/s + 6us delay | | | | |
> >>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >>> | 12 Kpkt/s + 78us delay | | | | |
> >>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >>> | 25 Kpkt/s + 38us delay | | | | |
> >>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>
> >> On my system, routing the irq to same core where xsk_rr runs results in
> >> lower latency than routing the irq to a different core. To me that makes
> >> sense in a low-rate latency-sensitive scenario where interrupts are not
> >> causing much trouble, but the resulting locality might be beneficial. I
> >> think you should test this as well.
> >>
> >> The experiments reported above (except for the first one) are
> >> cherry-picking parameter combinations that result in a near-100% load
> >> and ignore anything else. Near-100% load is a highly unlikely scenario
> >> for a latency-sensitive workload.
> >>
> >> When combining the above two paragraphs, I believe other interesting
> >> setups are missing from the experiments, such as comparing to two pairs
> >> of xsk_rr under high load (as mentioned in my previous emails).
> > This is to support an existing real workload. We cannot easily modify
> > its threading model. The two xsk_rr model would be a different
> > workload.
>
> That's fine, but:
>
> - In principle I don't think it's a good justification for a kernel
> change that an application cannot be rewritten.
>
> - I believe it is your responsibility to more comprehensively document
> the impact of your proposed changes beyond your one particular workload.
>
> Also, I do believe there's a bug as mentioned before. I can't quite pin
> it down, but every time after running a "NAPI threaded" experiment, my
> servers enters a funny state and eventually becomes largely unresponsive
> without much useful output and needs a hard reset. For example:
>
> 1) Run "NAPI threaded" experiment
> 2) Disabled "threaded" parameter in NAPI config
> 3) Run IRQ experiment -> xsk_rr hangs and apparently holds a lock,
> because other services stop working successively.
I just tried with this scenario and it seems to work fine.
>
> Do you not have this problem?
Not Really. Jakub actually fixed a deadlock in napi threaded recently.
Maybe you are hitting that? Are you using the latest base-commit that
I have in this patch series?
>
> Thanks,
> Martin
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29 20:49       ` Samiullah Khawaja
@ 2025-08-29 21:27         ` Martin Karsten
  2025-08-29 22:27           ` Samiullah Khawaja
  0 siblings, 1 reply; 17+ messages in thread
From: Martin Karsten @ 2025-08-29 21:27 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On 2025-08-29 16:49, Samiullah Khawaja wrote:
> On Fri, Aug 29, 2025 at 11:08 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>
>> On 2025-08-29 13:50, Samiullah Khawaja wrote:
>>> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>>>
>>>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
>>>>> Extend the already existing support of threaded napi poll to do continuous
>>>>> busy polling.
>>>>>
>>>>> This is used for doing continuous polling of napi to fetch descriptors
>>>>> from backing RX/TX queues for low latency applications. Allow enabling
>>>>> of threaded busypoll using netlink so this can be enabled on a set of
>>>>> dedicated napis for low latency applications.
>>>>>
>>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>>>> and set affinity, priority and scheduler for it depending on the
>>>>> low-latency requirements.
>>>>>
>>>>> Extend the netlink interface to allow enabling/disabling threaded
>>>>> busypolling at individual napi level.
>>>>>
>>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>>>> level latency requirement. For our usecase we want low jitter and stable
>>>>> latency at P99.
>>>>>
>>>>> Following is an analysis and comparison of available (and compatible)
>>>>> busy poll interfaces for a low latency usecase with stable P99. This can
>>>>> be suitable for applications that want very low latency at the expense
>>>>> of cpu usage and efficiency.
>>>>>
>>>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
>>>>> backing a socket, but the missing piece is a mechanism to busy poll a
>>>>> NAPI instance in a dedicated thread while ignoring available events or
>>>>> packets, regardless of the userspace API. Most existing mechanisms are
>>>>> designed to work in a pattern where you poll until new packets or events
>>>>> are received, after which userspace is expected to handle them.
>>>>>
>>>>> As a result, one has to hack together a solution using a mechanism
>>>>> intended to receive packets or events, not to simply NAPI poll. NAPI
>>>>> threaded busy polling, on the other hand, provides this capability
>>>>> natively, independent of any userspace API. This makes it really easy to
>>>>> setup and manage.
>>>>>
>>>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
>>>>> description of the tool and how it tries to simulate the real workload
>>>>> is following,
>>>>>
>>>>> - It sends UDP packets between 2 machines.
>>>>> - The client machine sends packets at a fixed frequency. To maintain the
>>>>>      frequency of the packet being sent, we use open-loop sampling. That is
>>>>>      the packets are sent in a separate thread.
>>>>> - The server replies to the packet inline by reading the pkt from the
>>>>>      recv ring and replies using the tx ring.
>>>>> - To simulate the application processing time, we use a configurable
>>>>>      delay in usecs on the client side after a reply is received from the
>>>>>      server.
>>>>>
>>>>> The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.
>>>>>
>>>>> We use this tool with following napi polling configurations,
>>>>>
>>>>> - Interrupts only
>>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>>>      packet).
>>>>> - SO_BUSYPOLL (separate thread and separate core)
>>>>> - Threaded NAPI busypoll
>>>>>
>>>>> System is configured using following script in all 4 cases,
>>>>>
>>>>> ```
>>>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
>>>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
>>>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
>>>>>
>>>>> sudo ethtool -L eth0 rx 1 tx 1
>>>>> sudo ethtool -G eth0 rx 1024
>>>>>
>>>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
>>>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
>>>>>
>>>>>     # pin IRQs on CPU 2
>>>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
>>>>>                                 print arr[0]}' < /proc/interrupts)"
>>>>> for irq in "${IRQS}"; \
>>>>>         do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
>>>>>
>>>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
>>>>>
>>>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
>>>>>                         do echo $i; echo 1,2,3,4,5,6 > $i; done
>>>>>
>>>>> if [[ -z "$1" ]]; then
>>>>>      echo 400 | sudo tee /proc/sys/net/core/busy_read
>>>>>      echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>      echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>> fi
>>>>>
>>>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
>>>>>
>>>>> if [[ "$1" == "enable_threaded" ]]; then
>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_poll
>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>      echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>      echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>      echo 2 | sudo tee /sys/class/net/eth0/threaded
>>>>>      NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
>>>>>      sudo chrt -f  -p 50 $NAPI_T
>>>>>
>>>>>      # pin threaded poll thread to CPU 2
>>>>>      sudo taskset -pc 2 $NAPI_T
>>>>> fi
>>>>>
>>>>> if [[ "$1" == "enable_interrupt" ]]; then
>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>      echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>      echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>> fi
>>>>> ```
>>>>
>>>> The experiment script above does not work, because the sysfs parameter
>>>> does not exist anymore in this version.
>>>>
>>>>> To enable various configurations, script can be run as following,
>>>>>
>>>>> - Interrupt Only
>>>>>      ```
>>>>>      <script> enable_interrupt
>>>>>      ```
>>>>>
>>>>> - SO_BUSYPOLL (no arguments to script)
>>>>>      ```
>>>>>      <script>
>>>>>      ```
>>>>>
>>>>> - NAPI threaded busypoll
>>>>>      ```
>>>>>      <script> enable_threaded
>>>>>      ```
>>>>>
>>>>> If using idpf, the script needs to be run again after launching the
>>>>> workload just to make sure that the configurations are not reverted. As
>>>>> idpf reverts some configurations on software reset when AF_XDP program
>>>>> is attached.
>>>>>
>>>>> Once configured, the workload is run with various configurations using
>>>>> following commands. Set period (1/frequency) and delay in usecs to
>>>>> produce results for packet frequency and application processing delay.
>>>>>
>>>>>     ## Interrupt Only and SO_BUSYPOLL (inline)
>>>>>
>>>>> - Server
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
>>>>> ```
>>>>>
>>>>> - Client
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
>>>>> ```
>>>>>
>>>>>     ## SO_BUSYPOLL(done in separate core using recvfrom)
>>>>>
>>>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
>>>>>
>>>>> - Server
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>         -h -v -t
>>>>> ```
>>>>>
>>>>> - Client
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
>>>>> ```
>>>>>
>>>>>     ## NAPI Threaded Busy Poll
>>>>>
>>>>> Argument -n skips the recvfrom call as there is no recv kick needed.
>>>>>
>>>>> - Server
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>         -h -v -n
>>>>> ```
>>>>>
>>>>> - Client
>>>>> ```
>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
>>>>> ```
>>>>
>>>> I believe there's a bug when disabling busy-polled napi threading after
>>>> an experiment. My system hangs and needs a hard reset.
>>>>
>>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
>>>>> |---|---|---|---|---|
>>>>> | 12 Kpkt/s + 0us delay | | | | |
>>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>>>> | 32 Kpkt/s + 30us delay | | | | |
>>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>>>> | 125 Kpkt/s + 6us delay | | | | |
>>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>>>> | 12 Kpkt/s + 78us delay | | | | |
>>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>>>> | 25 Kpkt/s + 38us delay | | | | |
>>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>>
>>>> On my system, routing the irq to same core where xsk_rr runs results in
>>>> lower latency than routing the irq to a different core. To me that makes
>>>> sense in a low-rate latency-sensitive scenario where interrupts are not
>>>> causing much trouble, but the resulting locality might be beneficial. I
>>>> think you should test this as well.
>>>>
>>>> The experiments reported above (except for the first one) are
>>>> cherry-picking parameter combinations that result in a near-100% load
>>>> and ignore anything else. Near-100% load is a highly unlikely scenario
>>>> for a latency-sensitive workload.
>>>>
>>>> When combining the above two paragraphs, I believe other interesting
>>>> setups are missing from the experiments, such as comparing to two pairs
>>>> of xsk_rr under high load (as mentioned in my previous emails).
>>> This is to support an existing real workload. We cannot easily modify
>>> its threading model. The two xsk_rr model would be a different
>>> workload.
>>
>> That's fine, but:
>>
>> - In principle I don't think it's a good justification for a kernel
>> change that an application cannot be rewritten.
>>
>> - I believe it is your responsibility to more comprehensively document
>> the impact of your proposed changes beyond your one particular workload.
>>
>> Also, I do believe there's a bug as mentioned before. I can't quite pin
>> it down, but every time after running a "NAPI threaded" experiment, my
>> servers enters a funny state and eventually becomes largely unresponsive
>> without much useful output and needs a hard reset. For example:
>>
>> 1) Run "NAPI threaded" experiment
>> 2) Disabled "threaded" parameter in NAPI config
>> 3) Run IRQ experiment -> xsk_rr hangs and apparently holds a lock,
>> because other services stop working successively.
> I just tried with this scenario and it seems to work fine.

Ok. I've reproduced it more concisely. This is after a fresh reboot:

sudo ethtool -L ens15f1np1 combined 1

sudo net-next/tools/net/ynl/pyynl/cli.py --no-schema --output-json\
  --spec net-next/Documentation/netlink/specs/netdev.yaml --do napi-set\
  --json='{"id": 8209, "threaded": "busy-poll-enabled"}'

# ping from another machine to this NIC works
# napi thread busy at 100%

sudo net-next/tools/net/ynl/pyynl/cli.py --no-schema --output-json\
  --spec net-next/Documentation/netlink/specs/netdev.yaml --do napi-set\
  --json='{"id": 8209, "threaded": "disabled"}'

# napi thread gone
# ping from another machine does not work
# tcpdump does not show incoming icmp packets
# but machine still responsive on other NIC

sudo ethtool -L ens15f1np1 combined 12

# networking hangs on all NICs
# sudo reboot on console hangs
# hard reset needed, no useful output
>> Do you not have this problem?
> Not Really. Jakub actually fixed a deadlock in napi threaded recently.
> Maybe you are hitting that? Are you using the latest base-commit that
> I have in this patch series?

Yep:
- Ubuntu 24.04.3 LTS system
- base commit before patches is c3199adbe4ffffc7b6536715e0290d1919a45cd9
- NIC driver is ice, PCI id 8086:159b.

Let me know, if you need any other information?

Best,
Martin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29 18:08     ` Martin Karsten
  2025-08-29 18:42       ` Willem de Bruijn
  2025-08-29 20:49       ` Samiullah Khawaja
@ 2025-08-29 22:19       ` Martin Karsten
  2025-08-29 22:25         ` Samiullah Khawaja
  2 siblings, 1 reply; 17+ messages in thread
From: Martin Karsten @ 2025-08-29 22:19 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On 2025-08-29 14:08, Martin Karsten wrote:
> On 2025-08-29 13:50, Samiullah Khawaja wrote:
>> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca> 
>> wrote:
>>>
>>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
>>>> Extend the already existing support of threaded napi poll to do 
>>>> continuous
>>>> busy polling.
>>>>
>>>> This is used for doing continuous polling of napi to fetch descriptors
>>>> from backing RX/TX queues for low latency applications. Allow enabling
>>>> of threaded busypoll using netlink so this can be enabled on a set of
>>>> dedicated napis for low latency applications.
>>>>
>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>>> and set affinity, priority and scheduler for it depending on the
>>>> low-latency requirements.
>>>>
>>>> Extend the netlink interface to allow enabling/disabling threaded
>>>> busypolling at individual napi level.
>>>>
>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>>> level latency requirement. For our usecase we want low jitter and 
>>>> stable
>>>> latency at P99.
>>>>
>>>> Following is an analysis and comparison of available (and compatible)
>>>> busy poll interfaces for a low latency usecase with stable P99. This 
>>>> can
>>>> be suitable for applications that want very low latency at the expense
>>>> of cpu usage and efficiency.
>>>>
>>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
>>>> backing a socket, but the missing piece is a mechanism to busy poll a
>>>> NAPI instance in a dedicated thread while ignoring available events or
>>>> packets, regardless of the userspace API. Most existing mechanisms are
>>>> designed to work in a pattern where you poll until new packets or 
>>>> events
>>>> are received, after which userspace is expected to handle them.
>>>>
>>>> As a result, one has to hack together a solution using a mechanism
>>>> intended to receive packets or events, not to simply NAPI poll. NAPI
>>>> threaded busy polling, on the other hand, provides this capability
>>>> natively, independent of any userspace API. This makes it really 
>>>> easy to
>>>> setup and manage.
>>>>
>>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
>>>> description of the tool and how it tries to simulate the real workload
>>>> is following,
>>>>
>>>> - It sends UDP packets between 2 machines.
>>>> - The client machine sends packets at a fixed frequency. To maintain 
>>>> the
>>>>     frequency of the packet being sent, we use open-loop sampling. 
>>>> That is
>>>>     the packets are sent in a separate thread.
>>>> - The server replies to the packet inline by reading the pkt from the
>>>>     recv ring and replies using the tx ring.
>>>> - To simulate the application processing time, we use a configurable
>>>>     delay in usecs on the client side after a reply is received from 
>>>> the
>>>>     server.
>>>>
>>>> The xsk_rr tool is posted separately as an RFC for tools/testing/ 
>>>> selftest.
>>>>
>>>> We use this tool with following napi polling configurations,
>>>>
>>>> - Interrupts only
>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>>     packet).
>>>> - SO_BUSYPOLL (separate thread and separate core)
>>>> - Threaded NAPI busypoll
>>>>
>>>> System is configured using following script in all 4 cases,
>>>>
>>>> ```
>>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
>>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
>>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
>>>>
>>>> sudo ethtool -L eth0 rx 1 tx 1
>>>> sudo ethtool -G eth0 rx 1024
>>>>
>>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
>>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
>>>>
>>>>    # pin IRQs on CPU 2
>>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
>>>>                                print arr[0]}' < /proc/interrupts)"
>>>> for irq in "${IRQS}"; \
>>>>        do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
>>>>
>>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
>>>>
>>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
>>>>                        do echo $i; echo 1,2,3,4,5,6 > $i; done
>>>>
>>>> if [[ -z "$1" ]]; then
>>>>     echo 400 | sudo tee /proc/sys/net/core/busy_read
>>>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>     echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>> fi
>>>>
>>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx- 
>>>> usecs 0
>>>>
>>>> if [[ "$1" == "enable_threaded" ]]; then
>>>>     echo 0 | sudo tee /proc/sys/net/core/busy_poll
>>>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>     echo 2 | sudo tee /sys/class/net/eth0/threaded
>>>>     NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
>>>>     sudo chrt -f  -p 50 $NAPI_T
>>>>
>>>>     # pin threaded poll thread to CPU 2
>>>>     sudo taskset -pc 2 $NAPI_T
>>>> fi
>>>>
>>>> if [[ "$1" == "enable_interrupt" ]]; then
>>>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>     echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>> fi
>>>> ```
>>>
>>> The experiment script above does not work, because the sysfs parameter
>>> does not exist anymore in this version.
>>>
>>>> To enable various configurations, script can be run as following,
>>>>
>>>> - Interrupt Only
>>>>     ```
>>>>     <script> enable_interrupt
>>>>     ```
>>>>
>>>> - SO_BUSYPOLL (no arguments to script)
>>>>     ```
>>>>     <script>
>>>>     ```
>>>>
>>>> - NAPI threaded busypoll
>>>>     ```
>>>>     <script> enable_threaded
>>>>     ```
>>>>
>>>> If using idpf, the script needs to be run again after launching the
>>>> workload just to make sure that the configurations are not reverted. As
>>>> idpf reverts some configurations on software reset when AF_XDP program
>>>> is attached.
>>>>
>>>> Once configured, the workload is run with various configurations using
>>>> following commands. Set period (1/frequency) and delay in usecs to
>>>> produce results for packet frequency and application processing delay.
>>>>
>>>>    ## Interrupt Only and SO_BUSYPOLL (inline)
>>>>
>>>> - Server
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 - 
>>>> h -v
>>>> ```
>>>>
>>>> - Client
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
>>>> ```
>>>>
>>>>    ## SO_BUSYPOLL(done in separate core using recvfrom)
>>>>
>>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
>>>>
>>>> - Server
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>        -h -v -t
>>>> ```
>>>>
>>>> - Client
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
>>>> ```
>>>>
>>>>    ## NAPI Threaded Busy Poll
>>>>
>>>> Argument -n skips the recvfrom call as there is no recv kick needed.
>>>>
>>>> - Server
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>        -h -v -n
>>>> ```
>>>>
>>>> - Client
>>>> ```
>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
>>>> ```
>>>
>>> I believe there's a bug when disabling busy-polled napi threading after
>>> an experiment. My system hangs and needs a hard reset.
>>>
>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | 
>>>> NAPI threaded |
>>>> |---|---|---|---|---|
>>>> | 12 Kpkt/s + 0us delay | | | | |
>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>>> | 32 Kpkt/s + 30us delay | | | | |
>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>>> | 125 Kpkt/s + 6us delay | | | | |
>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>>> | 12 Kpkt/s + 78us delay | | | | |
>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>>> | 25 Kpkt/s + 38us delay | | | | |
>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>
>>> On my system, routing the irq to same core where xsk_rr runs results in
>>> lower latency than routing the irq to a different core. To me that makes
>>> sense in a low-rate latency-sensitive scenario where interrupts are not
>>> causing much trouble, but the resulting locality might be beneficial. I
>>> think you should test this as well.
>>>
>>> The experiments reported above (except for the first one) are
>>> cherry-picking parameter combinations that result in a near-100% load
>>> and ignore anything else. Near-100% load is a highly unlikely scenario
>>> for a latency-sensitive workload.
>>>
>>> When combining the above two paragraphs, I believe other interesting
>>> setups are missing from the experiments, such as comparing to two pairs
>>> of xsk_rr under high load (as mentioned in my previous emails).
>> This is to support an existing real workload. We cannot easily modify
>> its threading model. The two xsk_rr model would be a different
>> workload.
> 
> That's fine, but:
> 
> - In principle I don't think it's a good justification for a kernel 
> change that an application cannot be rewritten.
> 
> - I believe it is your responsibility to more comprehensively document 
> the impact of your proposed changes beyond your one particular workload.> 
A few more observations from my tests for the "SO_BUSYPOLL(separate)" case:

- Using -t for the client reduces latency compared to -T.

- Using poll instead of recvfrom in xsk_rr in rx_polling_run() also 
reduces latency.

Best,
Martin


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29 22:19       ` Martin Karsten
@ 2025-08-29 22:25         ` Samiullah Khawaja
  2025-08-29 22:56           ` Martin Karsten
  0 siblings, 1 reply; 17+ messages in thread
From: Samiullah Khawaja @ 2025-08-29 22:25 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On Fri, Aug 29, 2025 at 3:19 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-08-29 14:08, Martin Karsten wrote:
> > On 2025-08-29 13:50, Samiullah Khawaja wrote:
> >> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca>
> >> wrote:
> >>>
> >>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
> >>>> Extend the already existing support of threaded napi poll to do
> >>>> continuous
> >>>> busy polling.
> >>>>
> >>>> This is used for doing continuous polling of napi to fetch descriptors
> >>>> from backing RX/TX queues for low latency applications. Allow enabling
> >>>> of threaded busypoll using netlink so this can be enabled on a set of
> >>>> dedicated napis for low latency applications.
> >>>>
> >>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
> >>>> and set affinity, priority and scheduler for it depending on the
> >>>> low-latency requirements.
> >>>>
> >>>> Extend the netlink interface to allow enabling/disabling threaded
> >>>> busypolling at individual napi level.
> >>>>
> >>>> We use this for our AF_XDP based hard low-latency usecase with usecs
> >>>> level latency requirement. For our usecase we want low jitter and
> >>>> stable
> >>>> latency at P99.
> >>>>
> >>>> Following is an analysis and comparison of available (and compatible)
> >>>> busy poll interfaces for a low latency usecase with stable P99. This
> >>>> can
> >>>> be suitable for applications that want very low latency at the expense
> >>>> of cpu usage and efficiency.
> >>>>
> >>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
> >>>> backing a socket, but the missing piece is a mechanism to busy poll a
> >>>> NAPI instance in a dedicated thread while ignoring available events or
> >>>> packets, regardless of the userspace API. Most existing mechanisms are
> >>>> designed to work in a pattern where you poll until new packets or
> >>>> events
> >>>> are received, after which userspace is expected to handle them.
> >>>>
> >>>> As a result, one has to hack together a solution using a mechanism
> >>>> intended to receive packets or events, not to simply NAPI poll. NAPI
> >>>> threaded busy polling, on the other hand, provides this capability
> >>>> natively, independent of any userspace API. This makes it really
> >>>> easy to
> >>>> setup and manage.
> >>>>
> >>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
> >>>> description of the tool and how it tries to simulate the real workload
> >>>> is following,
> >>>>
> >>>> - It sends UDP packets between 2 machines.
> >>>> - The client machine sends packets at a fixed frequency. To maintain
> >>>> the
> >>>>     frequency of the packet being sent, we use open-loop sampling.
> >>>> That is
> >>>>     the packets are sent in a separate thread.
> >>>> - The server replies to the packet inline by reading the pkt from the
> >>>>     recv ring and replies using the tx ring.
> >>>> - To simulate the application processing time, we use a configurable
> >>>>     delay in usecs on the client side after a reply is received from
> >>>> the
> >>>>     server.
> >>>>
> >>>> The xsk_rr tool is posted separately as an RFC for tools/testing/
> >>>> selftest.
> >>>>
> >>>> We use this tool with following napi polling configurations,
> >>>>
> >>>> - Interrupts only
> >>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
> >>>>     packet).
> >>>> - SO_BUSYPOLL (separate thread and separate core)
> >>>> - Threaded NAPI busypoll
> >>>>
> >>>> System is configured using following script in all 4 cases,
> >>>>
> >>>> ```
> >>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
> >>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
> >>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
> >>>>
> >>>> sudo ethtool -L eth0 rx 1 tx 1
> >>>> sudo ethtool -G eth0 rx 1024
> >>>>
> >>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> >>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> >>>>
> >>>>    # pin IRQs on CPU 2
> >>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> >>>>                                print arr[0]}' < /proc/interrupts)"
> >>>> for irq in "${IRQS}"; \
> >>>>        do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> >>>>
> >>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> >>>>
> >>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
> >>>>                        do echo $i; echo 1,2,3,4,5,6 > $i; done
> >>>>
> >>>> if [[ -z "$1" ]]; then
> >>>>     echo 400 | sudo tee /proc/sys/net/core/busy_read
> >>>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>     echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>> fi
> >>>>
> >>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-
> >>>> usecs 0
> >>>>
> >>>> if [[ "$1" == "enable_threaded" ]]; then
> >>>>     echo 0 | sudo tee /proc/sys/net/core/busy_poll
> >>>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>>     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>>     echo 2 | sudo tee /sys/class/net/eth0/threaded
> >>>>     NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> >>>>     sudo chrt -f  -p 50 $NAPI_T
> >>>>
> >>>>     # pin threaded poll thread to CPU 2
> >>>>     sudo taskset -pc 2 $NAPI_T
> >>>> fi
> >>>>
> >>>> if [[ "$1" == "enable_interrupt" ]]; then
> >>>>     echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>>     echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>> fi
> >>>> ```
> >>>
> >>> The experiment script above does not work, because the sysfs parameter
> >>> does not exist anymore in this version.
> >>>
> >>>> To enable various configurations, script can be run as following,
> >>>>
> >>>> - Interrupt Only
> >>>>     ```
> >>>>     <script> enable_interrupt
> >>>>     ```
> >>>>
> >>>> - SO_BUSYPOLL (no arguments to script)
> >>>>     ```
> >>>>     <script>
> >>>>     ```
> >>>>
> >>>> - NAPI threaded busypoll
> >>>>     ```
> >>>>     <script> enable_threaded
> >>>>     ```
> >>>>
> >>>> If using idpf, the script needs to be run again after launching the
> >>>> workload just to make sure that the configurations are not reverted. As
> >>>> idpf reverts some configurations on software reset when AF_XDP program
> >>>> is attached.
> >>>>
> >>>> Once configured, the workload is run with various configurations using
> >>>> following commands. Set period (1/frequency) and delay in usecs to
> >>>> produce results for packet frequency and application processing delay.
> >>>>
> >>>>    ## Interrupt Only and SO_BUSYPOLL (inline)
> >>>>
> >>>> - Server
> >>>> ```
> >>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -
> >>>> h -v
> >>>> ```
> >>>>
> >>>> - Client
> >>>> ```
> >>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
> >>>> ```
> >>>>
> >>>>    ## SO_BUSYPOLL(done in separate core using recvfrom)
> >>>>
> >>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
> >>>>
> >>>> - Server
> >>>> ```
> >>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>>        -h -v -t
> >>>> ```
> >>>>
> >>>> - Client
> >>>> ```
> >>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
> >>>> ```
> >>>>
> >>>>    ## NAPI Threaded Busy Poll
> >>>>
> >>>> Argument -n skips the recvfrom call as there is no recv kick needed.
> >>>>
> >>>> - Server
> >>>> ```
> >>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>>        -h -v -n
> >>>> ```
> >>>>
> >>>> - Client
> >>>> ```
> >>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
> >>>> ```
> >>>
> >>> I believe there's a bug when disabling busy-polled napi threading after
> >>> an experiment. My system hangs and needs a hard reset.
> >>>
> >>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) |
> >>>> NAPI threaded |
> >>>> |---|---|---|---|---|
> >>>> | 12 Kpkt/s + 0us delay | | | | |
> >>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >>>> | 32 Kpkt/s + 30us delay | | | | |
> >>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >>>> | 125 Kpkt/s + 6us delay | | | | |
> >>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >>>> | 12 Kpkt/s + 78us delay | | | | |
> >>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >>>> | 25 Kpkt/s + 38us delay | | | | |
> >>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>>
> >>> On my system, routing the irq to same core where xsk_rr runs results in
> >>> lower latency than routing the irq to a different core. To me that makes
> >>> sense in a low-rate latency-sensitive scenario where interrupts are not
> >>> causing much trouble, but the resulting locality might be beneficial. I
> >>> think you should test this as well.
> >>>
> >>> The experiments reported above (except for the first one) are
> >>> cherry-picking parameter combinations that result in a near-100% load
> >>> and ignore anything else. Near-100% load is a highly unlikely scenario
> >>> for a latency-sensitive workload.
> >>>
> >>> When combining the above two paragraphs, I believe other interesting
> >>> setups are missing from the experiments, such as comparing to two pairs
> >>> of xsk_rr under high load (as mentioned in my previous emails).
> >> This is to support an existing real workload. We cannot easily modify
> >> its threading model. The two xsk_rr model would be a different
> >> workload.
> >
> > That's fine, but:
> >
> > - In principle I don't think it's a good justification for a kernel
> > change that an application cannot be rewritten.
> >
> > - I believe it is your responsibility to more comprehensively document
> > the impact of your proposed changes beyond your one particular workload.>
> A few more observations from my tests for the "SO_BUSYPOLL(separate)" case:
>
> - Using -t for the client reduces latency compared to -T.
That is understandable and also it is part of the data I presented. -t
means running the SO_BUSY_POLL in a separate thread. Removing -T would
invalidate the workload by making the rate unpredictable.
>
> - Using poll instead of recvfrom in xsk_rr in rx_polling_run() also
> reduces latency.
>
> Best,
> Martin
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29 21:27         ` Martin Karsten
@ 2025-08-29 22:27           ` Samiullah Khawaja
  0 siblings, 0 replies; 17+ messages in thread
From: Samiullah Khawaja @ 2025-08-29 22:27 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On Fri, Aug 29, 2025 at 2:27 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-08-29 16:49, Samiullah Khawaja wrote:
> > On Fri, Aug 29, 2025 at 11:08 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>
> >> On 2025-08-29 13:50, Samiullah Khawaja wrote:
> >>> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>>>
> >>>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
> >>>>> Extend the already existing support of threaded napi poll to do continuous
> >>>>> busy polling.
> >>>>>
> >>>>> This is used for doing continuous polling of napi to fetch descriptors
> >>>>> from backing RX/TX queues for low latency applications. Allow enabling
> >>>>> of threaded busypoll using netlink so this can be enabled on a set of
> >>>>> dedicated napis for low latency applications.
> >>>>>
> >>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
> >>>>> and set affinity, priority and scheduler for it depending on the
> >>>>> low-latency requirements.
> >>>>>
> >>>>> Extend the netlink interface to allow enabling/disabling threaded
> >>>>> busypolling at individual napi level.
> >>>>>
> >>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
> >>>>> level latency requirement. For our usecase we want low jitter and stable
> >>>>> latency at P99.
> >>>>>
> >>>>> Following is an analysis and comparison of available (and compatible)
> >>>>> busy poll interfaces for a low latency usecase with stable P99. This can
> >>>>> be suitable for applications that want very low latency at the expense
> >>>>> of cpu usage and efficiency.
> >>>>>
> >>>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
> >>>>> backing a socket, but the missing piece is a mechanism to busy poll a
> >>>>> NAPI instance in a dedicated thread while ignoring available events or
> >>>>> packets, regardless of the userspace API. Most existing mechanisms are
> >>>>> designed to work in a pattern where you poll until new packets or events
> >>>>> are received, after which userspace is expected to handle them.
> >>>>>
> >>>>> As a result, one has to hack together a solution using a mechanism
> >>>>> intended to receive packets or events, not to simply NAPI poll. NAPI
> >>>>> threaded busy polling, on the other hand, provides this capability
> >>>>> natively, independent of any userspace API. This makes it really easy to
> >>>>> setup and manage.
> >>>>>
> >>>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
> >>>>> description of the tool and how it tries to simulate the real workload
> >>>>> is following,
> >>>>>
> >>>>> - It sends UDP packets between 2 machines.
> >>>>> - The client machine sends packets at a fixed frequency. To maintain the
> >>>>>      frequency of the packet being sent, we use open-loop sampling. That is
> >>>>>      the packets are sent in a separate thread.
> >>>>> - The server replies to the packet inline by reading the pkt from the
> >>>>>      recv ring and replies using the tx ring.
> >>>>> - To simulate the application processing time, we use a configurable
> >>>>>      delay in usecs on the client side after a reply is received from the
> >>>>>      server.
> >>>>>
> >>>>> The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.
> >>>>>
> >>>>> We use this tool with following napi polling configurations,
> >>>>>
> >>>>> - Interrupts only
> >>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
> >>>>>      packet).
> >>>>> - SO_BUSYPOLL (separate thread and separate core)
> >>>>> - Threaded NAPI busypoll
> >>>>>
> >>>>> System is configured using following script in all 4 cases,
> >>>>>
> >>>>> ```
> >>>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
> >>>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
> >>>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
> >>>>>
> >>>>> sudo ethtool -L eth0 rx 1 tx 1
> >>>>> sudo ethtool -G eth0 rx 1024
> >>>>>
> >>>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> >>>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> >>>>>
> >>>>>     # pin IRQs on CPU 2
> >>>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> >>>>>                                 print arr[0]}' < /proc/interrupts)"
> >>>>> for irq in "${IRQS}"; \
> >>>>>         do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> >>>>>
> >>>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> >>>>>
> >>>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
> >>>>>                         do echo $i; echo 1,2,3,4,5,6 > $i; done
> >>>>>
> >>>>> if [[ -z "$1" ]]; then
> >>>>>      echo 400 | sudo tee /proc/sys/net/core/busy_read
> >>>>>      echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>>      echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>>> fi
> >>>>>
> >>>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
> >>>>>
> >>>>> if [[ "$1" == "enable_threaded" ]]; then
> >>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_poll
> >>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>>>      echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>>      echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>>>      echo 2 | sudo tee /sys/class/net/eth0/threaded
> >>>>>      NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> >>>>>      sudo chrt -f  -p 50 $NAPI_T
> >>>>>
> >>>>>      # pin threaded poll thread to CPU 2
> >>>>>      sudo taskset -pc 2 $NAPI_T
> >>>>> fi
> >>>>>
> >>>>> if [[ "$1" == "enable_interrupt" ]]; then
> >>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>>>      echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>>      echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>>> fi
> >>>>> ```
> >>>>
> >>>> The experiment script above does not work, because the sysfs parameter
> >>>> does not exist anymore in this version.
> >>>>
> >>>>> To enable various configurations, script can be run as following,
> >>>>>
> >>>>> - Interrupt Only
> >>>>>      ```
> >>>>>      <script> enable_interrupt
> >>>>>      ```
> >>>>>
> >>>>> - SO_BUSYPOLL (no arguments to script)
> >>>>>      ```
> >>>>>      <script>
> >>>>>      ```
> >>>>>
> >>>>> - NAPI threaded busypoll
> >>>>>      ```
> >>>>>      <script> enable_threaded
> >>>>>      ```
> >>>>>
> >>>>> If using idpf, the script needs to be run again after launching the
> >>>>> workload just to make sure that the configurations are not reverted. As
> >>>>> idpf reverts some configurations on software reset when AF_XDP program
> >>>>> is attached.
> >>>>>
> >>>>> Once configured, the workload is run with various configurations using
> >>>>> following commands. Set period (1/frequency) and delay in usecs to
> >>>>> produce results for packet frequency and application processing delay.
> >>>>>
> >>>>>     ## Interrupt Only and SO_BUSYPOLL (inline)
> >>>>>
> >>>>> - Server
> >>>>> ```
> >>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
> >>>>> ```
> >>>>>
> >>>>> - Client
> >>>>> ```
> >>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
> >>>>> ```
> >>>>>
> >>>>>     ## SO_BUSYPOLL(done in separate core using recvfrom)
> >>>>>
> >>>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
> >>>>>
> >>>>> - Server
> >>>>> ```
> >>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>>>         -h -v -t
> >>>>> ```
> >>>>>
> >>>>> - Client
> >>>>> ```
> >>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
> >>>>> ```
> >>>>>
> >>>>>     ## NAPI Threaded Busy Poll
> >>>>>
> >>>>> Argument -n skips the recvfrom call as there is no recv kick needed.
> >>>>>
> >>>>> - Server
> >>>>> ```
> >>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>>>         -h -v -n
> >>>>> ```
> >>>>>
> >>>>> - Client
> >>>>> ```
> >>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
> >>>>> ```
> >>>>
> >>>> I believe there's a bug when disabling busy-polled napi threading after
> >>>> an experiment. My system hangs and needs a hard reset.
> >>>>
> >>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> >>>>> |---|---|---|---|---|
> >>>>> | 12 Kpkt/s + 0us delay | | | | |
> >>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >>>>> | 32 Kpkt/s + 30us delay | | | | |
> >>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >>>>> | 125 Kpkt/s + 6us delay | | | | |
> >>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >>>>> | 12 Kpkt/s + 78us delay | | | | |
> >>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >>>>> | 25 Kpkt/s + 38us delay | | | | |
> >>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>>>
> >>>> On my system, routing the irq to same core where xsk_rr runs results in
> >>>> lower latency than routing the irq to a different core. To me that makes
> >>>> sense in a low-rate latency-sensitive scenario where interrupts are not
> >>>> causing much trouble, but the resulting locality might be beneficial. I
> >>>> think you should test this as well.
> >>>>
> >>>> The experiments reported above (except for the first one) are
> >>>> cherry-picking parameter combinations that result in a near-100% load
> >>>> and ignore anything else. Near-100% load is a highly unlikely scenario
> >>>> for a latency-sensitive workload.
> >>>>
> >>>> When combining the above two paragraphs, I believe other interesting
> >>>> setups are missing from the experiments, such as comparing to two pairs
> >>>> of xsk_rr under high load (as mentioned in my previous emails).
> >>> This is to support an existing real workload. We cannot easily modify
> >>> its threading model. The two xsk_rr model would be a different
> >>> workload.
> >>
> >> That's fine, but:
> >>
> >> - In principle I don't think it's a good justification for a kernel
> >> change that an application cannot be rewritten.
> >>
> >> - I believe it is your responsibility to more comprehensively document
> >> the impact of your proposed changes beyond your one particular workload.
> >>
> >> Also, I do believe there's a bug as mentioned before. I can't quite pin
> >> it down, but every time after running a "NAPI threaded" experiment, my
> >> servers enters a funny state and eventually becomes largely unresponsive
> >> without much useful output and needs a hard reset. For example:
> >>
> >> 1) Run "NAPI threaded" experiment
> >> 2) Disabled "threaded" parameter in NAPI config
> >> 3) Run IRQ experiment -> xsk_rr hangs and apparently holds a lock,
> >> because other services stop working successively.
> > I just tried with this scenario and it seems to work fine.
>
> Ok. I've reproduced it more concisely. This is after a fresh reboot:
>
> sudo ethtool -L ens15f1np1 combined 1
>
> sudo net-next/tools/net/ynl/pyynl/cli.py --no-schema --output-json\
>   --spec net-next/Documentation/netlink/specs/netdev.yaml --do napi-set\
>   --json='{"id": 8209, "threaded": "busy-poll-enabled"}'
>
> # ping from another machine to this NIC works
> # napi thread busy at 100%
>
> sudo net-next/tools/net/ynl/pyynl/cli.py --no-schema --output-json\
>   --spec net-next/Documentation/netlink/specs/netdev.yaml --do napi-set\
>   --json='{"id": 8209, "threaded": "disabled"}'
>
> # napi thread gone
> # ping from another machine does not work
> # tcpdump does not show incoming icmp packets
> # but machine still responsive on other NIC
>
> sudo ethtool -L ens15f1np1 combined 12
Ok I have found it. It's related to stopping the kthreads. Will send a
revision out.
>
> # networking hangs on all NICs
> # sudo reboot on console hangs
> # hard reset needed, no useful output
> >> Do you not have this problem?
> > Not Really. Jakub actually fixed a deadlock in napi threaded recently.
> > Maybe you are hitting that? Are you using the latest base-commit that
> > I have in this patch series?
>
> Yep:
> - Ubuntu 24.04.3 LTS system
> - base commit before patches is c3199adbe4ffffc7b6536715e0290d1919a45cd9
> - NIC driver is ice, PCI id 8086:159b.
>
> Let me know, if you need any other information?
>
> Best,
> Martin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29 22:25         ` Samiullah Khawaja
@ 2025-08-29 22:56           ` Martin Karsten
  2025-08-29 23:31             ` Samiullah Khawaja
  0 siblings, 1 reply; 17+ messages in thread
From: Martin Karsten @ 2025-08-29 22:56 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On 2025-08-29 18:25, Samiullah Khawaja wrote:
> On Fri, Aug 29, 2025 at 3:19 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>
>> On 2025-08-29 14:08, Martin Karsten wrote:
>>> On 2025-08-29 13:50, Samiullah Khawaja wrote:
>>>> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca>
>>>> wrote:
>>>>>
>>>>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
>>>>>> Extend the already existing support of threaded napi poll to do
>>>>>> continuous
>>>>>> busy polling.
>>>>>>
>>>>>> This is used for doing continuous polling of napi to fetch descriptors
>>>>>> from backing RX/TX queues for low latency applications. Allow enabling
>>>>>> of threaded busypoll using netlink so this can be enabled on a set of
>>>>>> dedicated napis for low latency applications.
>>>>>>
>>>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>>>>> and set affinity, priority and scheduler for it depending on the
>>>>>> low-latency requirements.
>>>>>>
>>>>>> Extend the netlink interface to allow enabling/disabling threaded
>>>>>> busypolling at individual napi level.
>>>>>>
>>>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>>>>> level latency requirement. For our usecase we want low jitter and
>>>>>> stable
>>>>>> latency at P99.
>>>>>>
>>>>>> Following is an analysis and comparison of available (and compatible)
>>>>>> busy poll interfaces for a low latency usecase with stable P99. This
>>>>>> can
>>>>>> be suitable for applications that want very low latency at the expense
>>>>>> of cpu usage and efficiency.
>>>>>>
>>>>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
>>>>>> backing a socket, but the missing piece is a mechanism to busy poll a
>>>>>> NAPI instance in a dedicated thread while ignoring available events or
>>>>>> packets, regardless of the userspace API. Most existing mechanisms are
>>>>>> designed to work in a pattern where you poll until new packets or
>>>>>> events
>>>>>> are received, after which userspace is expected to handle them.
>>>>>>
>>>>>> As a result, one has to hack together a solution using a mechanism
>>>>>> intended to receive packets or events, not to simply NAPI poll. NAPI
>>>>>> threaded busy polling, on the other hand, provides this capability
>>>>>> natively, independent of any userspace API. This makes it really
>>>>>> easy to
>>>>>> setup and manage.
>>>>>>
>>>>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
>>>>>> description of the tool and how it tries to simulate the real workload
>>>>>> is following,
>>>>>>
>>>>>> - It sends UDP packets between 2 machines.
>>>>>> - The client machine sends packets at a fixed frequency. To maintain
>>>>>> the
>>>>>>      frequency of the packet being sent, we use open-loop sampling.
>>>>>> That is
>>>>>>      the packets are sent in a separate thread.
>>>>>> - The server replies to the packet inline by reading the pkt from the
>>>>>>      recv ring and replies using the tx ring.
>>>>>> - To simulate the application processing time, we use a configurable
>>>>>>      delay in usecs on the client side after a reply is received from
>>>>>> the
>>>>>>      server.
>>>>>>
>>>>>> The xsk_rr tool is posted separately as an RFC for tools/testing/
>>>>>> selftest.
>>>>>>
>>>>>> We use this tool with following napi polling configurations,
>>>>>>
>>>>>> - Interrupts only
>>>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>>>>      packet).
>>>>>> - SO_BUSYPOLL (separate thread and separate core)
>>>>>> - Threaded NAPI busypoll
>>>>>>
>>>>>> System is configured using following script in all 4 cases,
>>>>>>
>>>>>> ```
>>>>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
>>>>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
>>>>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
>>>>>>
>>>>>> sudo ethtool -L eth0 rx 1 tx 1
>>>>>> sudo ethtool -G eth0 rx 1024
>>>>>>
>>>>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
>>>>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
>>>>>>
>>>>>>     # pin IRQs on CPU 2
>>>>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
>>>>>>                                 print arr[0]}' < /proc/interrupts)"
>>>>>> for irq in "${IRQS}"; \
>>>>>>         do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
>>>>>>
>>>>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
>>>>>>
>>>>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
>>>>>>                         do echo $i; echo 1,2,3,4,5,6 > $i; done
>>>>>>
>>>>>> if [[ -z "$1" ]]; then
>>>>>>      echo 400 | sudo tee /proc/sys/net/core/busy_read
>>>>>>      echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>      echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>> fi
>>>>>>
>>>>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-
>>>>>> usecs 0
>>>>>>
>>>>>> if [[ "$1" == "enable_threaded" ]]; then
>>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_poll
>>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>>      echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>      echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>>      echo 2 | sudo tee /sys/class/net/eth0/threaded
>>>>>>      NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
>>>>>>      sudo chrt -f  -p 50 $NAPI_T
>>>>>>
>>>>>>      # pin threaded poll thread to CPU 2
>>>>>>      sudo taskset -pc 2 $NAPI_T
>>>>>> fi
>>>>>>
>>>>>> if [[ "$1" == "enable_interrupt" ]]; then
>>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>>      echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>      echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>> fi
>>>>>> ```
>>>>>
>>>>> The experiment script above does not work, because the sysfs parameter
>>>>> does not exist anymore in this version.
>>>>>
>>>>>> To enable various configurations, script can be run as following,
>>>>>>
>>>>>> - Interrupt Only
>>>>>>      ```
>>>>>>      <script> enable_interrupt
>>>>>>      ```
>>>>>>
>>>>>> - SO_BUSYPOLL (no arguments to script)
>>>>>>      ```
>>>>>>      <script>
>>>>>>      ```
>>>>>>
>>>>>> - NAPI threaded busypoll
>>>>>>      ```
>>>>>>      <script> enable_threaded
>>>>>>      ```
>>>>>>
>>>>>> If using idpf, the script needs to be run again after launching the
>>>>>> workload just to make sure that the configurations are not reverted. As
>>>>>> idpf reverts some configurations on software reset when AF_XDP program
>>>>>> is attached.
>>>>>>
>>>>>> Once configured, the workload is run with various configurations using
>>>>>> following commands. Set period (1/frequency) and delay in usecs to
>>>>>> produce results for packet frequency and application processing delay.
>>>>>>
>>>>>>     ## Interrupt Only and SO_BUSYPOLL (inline)
>>>>>>
>>>>>> - Server
>>>>>> ```
>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -
>>>>>> h -v
>>>>>> ```
>>>>>>
>>>>>> - Client
>>>>>> ```
>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
>>>>>> ```
>>>>>>
>>>>>>     ## SO_BUSYPOLL(done in separate core using recvfrom)
>>>>>>
>>>>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
>>>>>>
>>>>>> - Server
>>>>>> ```
>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>>         -h -v -t
>>>>>> ```
>>>>>>
>>>>>> - Client
>>>>>> ```
>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
>>>>>> ```

see below
>>>>>>     ## NAPI Threaded Busy Poll
>>>>>>
>>>>>> Argument -n skips the recvfrom call as there is no recv kick needed.
>>>>>>
>>>>>> - Server
>>>>>> ```
>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>>         -h -v -n
>>>>>> ```
>>>>>>
>>>>>> - Client
>>>>>> ```
>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
>>>>>> ```

see below
>>>>> I believe there's a bug when disabling busy-polled napi threading after
>>>>> an experiment. My system hangs and needs a hard reset.
>>>>>
>>>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) |
>>>>>> NAPI threaded |
>>>>>> |---|---|---|---|---|
>>>>>> | 12 Kpkt/s + 0us delay | | | | |
>>>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>>>>> | 32 Kpkt/s + 30us delay | | | | |
>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>>>>> | 125 Kpkt/s + 6us delay | | | | |
>>>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>>>>> | 12 Kpkt/s + 78us delay | | | | |
>>>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>>>>> | 25 Kpkt/s + 38us delay | | | | |
>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>>>
>>>>> On my system, routing the irq to same core where xsk_rr runs results in
>>>>> lower latency than routing the irq to a different core. To me that makes
>>>>> sense in a low-rate latency-sensitive scenario where interrupts are not
>>>>> causing much trouble, but the resulting locality might be beneficial. I
>>>>> think you should test this as well.
>>>>>
>>>>> The experiments reported above (except for the first one) are
>>>>> cherry-picking parameter combinations that result in a near-100% load
>>>>> and ignore anything else. Near-100% load is a highly unlikely scenario
>>>>> for a latency-sensitive workload.
>>>>>
>>>>> When combining the above two paragraphs, I believe other interesting
>>>>> setups are missing from the experiments, such as comparing to two pairs
>>>>> of xsk_rr under high load (as mentioned in my previous emails).
>>>> This is to support an existing real workload. We cannot easily modify
>>>> its threading model. The two xsk_rr model would be a different
>>>> workload.
>>>
>>> That's fine, but:
>>>
>>> - In principle I don't think it's a good justification for a kernel
>>> change that an application cannot be rewritten.
>>>
>>> - I believe it is your responsibility to more comprehensively document
>>> the impact of your proposed changes beyond your one particular workload.>
>> A few more observations from my tests for the "SO_BUSYPOLL(separate)" case:
>>
>> - Using -t for the client reduces latency compared to -T.
> That is understandable and also it is part of the data I presented. -t
> means running the SO_BUSY_POLL in a separate thread. Removing -T would
> invalidate the workload by making the rate unpredictable.

That's another problem with your cover letter then. The experiment as 
described should match the data presented. See above.

>> - Using poll instead of recvfrom in xsk_rr in rx_polling_run() also
>> reduces latency.

Any thoughts on this one?

Best,
Martin


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29 22:56           ` Martin Karsten
@ 2025-08-29 23:31             ` Samiullah Khawaja
  2025-08-29 23:37               ` Martin Karsten
  0 siblings, 1 reply; 17+ messages in thread
From: Samiullah Khawaja @ 2025-08-29 23:31 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On Fri, Aug 29, 2025 at 3:56 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-08-29 18:25, Samiullah Khawaja wrote:
> > On Fri, Aug 29, 2025 at 3:19 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>
> >> On 2025-08-29 14:08, Martin Karsten wrote:
> >>> On 2025-08-29 13:50, Samiullah Khawaja wrote:
> >>>> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca>
> >>>> wrote:
> >>>>>
> >>>>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
> >>>>>> Extend the already existing support of threaded napi poll to do
> >>>>>> continuous
> >>>>>> busy polling.
> >>>>>>
> >>>>>> This is used for doing continuous polling of napi to fetch descriptors
> >>>>>> from backing RX/TX queues for low latency applications. Allow enabling
> >>>>>> of threaded busypoll using netlink so this can be enabled on a set of
> >>>>>> dedicated napis for low latency applications.
> >>>>>>
> >>>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
> >>>>>> and set affinity, priority and scheduler for it depending on the
> >>>>>> low-latency requirements.
> >>>>>>
> >>>>>> Extend the netlink interface to allow enabling/disabling threaded
> >>>>>> busypolling at individual napi level.
> >>>>>>
> >>>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
> >>>>>> level latency requirement. For our usecase we want low jitter and
> >>>>>> stable
> >>>>>> latency at P99.
> >>>>>>
> >>>>>> Following is an analysis and comparison of available (and compatible)
> >>>>>> busy poll interfaces for a low latency usecase with stable P99. This
> >>>>>> can
> >>>>>> be suitable for applications that want very low latency at the expense
> >>>>>> of cpu usage and efficiency.
> >>>>>>
> >>>>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
> >>>>>> backing a socket, but the missing piece is a mechanism to busy poll a
> >>>>>> NAPI instance in a dedicated thread while ignoring available events or
> >>>>>> packets, regardless of the userspace API. Most existing mechanisms are
> >>>>>> designed to work in a pattern where you poll until new packets or
> >>>>>> events
> >>>>>> are received, after which userspace is expected to handle them.
> >>>>>>
> >>>>>> As a result, one has to hack together a solution using a mechanism
> >>>>>> intended to receive packets or events, not to simply NAPI poll. NAPI
> >>>>>> threaded busy polling, on the other hand, provides this capability
> >>>>>> natively, independent of any userspace API. This makes it really
> >>>>>> easy to
> >>>>>> setup and manage.
> >>>>>>
> >>>>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
> >>>>>> description of the tool and how it tries to simulate the real workload
> >>>>>> is following,
> >>>>>>
> >>>>>> - It sends UDP packets between 2 machines.
> >>>>>> - The client machine sends packets at a fixed frequency. To maintain
> >>>>>> the
> >>>>>>      frequency of the packet being sent, we use open-loop sampling.
> >>>>>> That is
> >>>>>>      the packets are sent in a separate thread.
> >>>>>> - The server replies to the packet inline by reading the pkt from the
> >>>>>>      recv ring and replies using the tx ring.
> >>>>>> - To simulate the application processing time, we use a configurable
> >>>>>>      delay in usecs on the client side after a reply is received from
> >>>>>> the
> >>>>>>      server.
> >>>>>>
> >>>>>> The xsk_rr tool is posted separately as an RFC for tools/testing/
> >>>>>> selftest.
> >>>>>>
> >>>>>> We use this tool with following napi polling configurations,
> >>>>>>
> >>>>>> - Interrupts only
> >>>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
> >>>>>>      packet).
> >>>>>> - SO_BUSYPOLL (separate thread and separate core)
> >>>>>> - Threaded NAPI busypoll
> >>>>>>
> >>>>>> System is configured using following script in all 4 cases,
> >>>>>>
> >>>>>> ```
> >>>>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
> >>>>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
> >>>>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
> >>>>>>
> >>>>>> sudo ethtool -L eth0 rx 1 tx 1
> >>>>>> sudo ethtool -G eth0 rx 1024
> >>>>>>
> >>>>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> >>>>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> >>>>>>
> >>>>>>     # pin IRQs on CPU 2
> >>>>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> >>>>>>                                 print arr[0]}' < /proc/interrupts)"
> >>>>>> for irq in "${IRQS}"; \
> >>>>>>         do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> >>>>>>
> >>>>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> >>>>>>
> >>>>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
> >>>>>>                         do echo $i; echo 1,2,3,4,5,6 > $i; done
> >>>>>>
> >>>>>> if [[ -z "$1" ]]; then
> >>>>>>      echo 400 | sudo tee /proc/sys/net/core/busy_read
> >>>>>>      echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>>>      echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>>>> fi
> >>>>>>
> >>>>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-
> >>>>>> usecs 0
> >>>>>>
> >>>>>> if [[ "$1" == "enable_threaded" ]]; then
> >>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_poll
> >>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>>>>      echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>>>      echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>>>>      echo 2 | sudo tee /sys/class/net/eth0/threaded
> >>>>>>      NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> >>>>>>      sudo chrt -f  -p 50 $NAPI_T
> >>>>>>
> >>>>>>      # pin threaded poll thread to CPU 2
> >>>>>>      sudo taskset -pc 2 $NAPI_T
> >>>>>> fi
> >>>>>>
> >>>>>> if [[ "$1" == "enable_interrupt" ]]; then
> >>>>>>      echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>>>>      echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>>>      echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>>>> fi
> >>>>>> ```
> >>>>>
> >>>>> The experiment script above does not work, because the sysfs parameter
> >>>>> does not exist anymore in this version.
> >>>>>
> >>>>>> To enable various configurations, script can be run as following,
> >>>>>>
> >>>>>> - Interrupt Only
> >>>>>>      ```
> >>>>>>      <script> enable_interrupt
> >>>>>>      ```
> >>>>>>
> >>>>>> - SO_BUSYPOLL (no arguments to script)
> >>>>>>      ```
> >>>>>>      <script>
> >>>>>>      ```
> >>>>>>
> >>>>>> - NAPI threaded busypoll
> >>>>>>      ```
> >>>>>>      <script> enable_threaded
> >>>>>>      ```
> >>>>>>
> >>>>>> If using idpf, the script needs to be run again after launching the
> >>>>>> workload just to make sure that the configurations are not reverted. As
> >>>>>> idpf reverts some configurations on software reset when AF_XDP program
> >>>>>> is attached.
> >>>>>>
> >>>>>> Once configured, the workload is run with various configurations using
> >>>>>> following commands. Set period (1/frequency) and delay in usecs to
> >>>>>> produce results for packet frequency and application processing delay.
> >>>>>>
> >>>>>>     ## Interrupt Only and SO_BUSYPOLL (inline)
> >>>>>>
> >>>>>> - Server
> >>>>>> ```
> >>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -
> >>>>>> h -v
> >>>>>> ```
> >>>>>>
> >>>>>> - Client
> >>>>>> ```
> >>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
> >>>>>> ```
> >>>>>>
> >>>>>>     ## SO_BUSYPOLL(done in separate core using recvfrom)
Defines this test case clearly here.
> >>>>>>
> >>>>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
This defines the -t argument and clearly states that it spawns the
separate thread.
> >>>>>>
> >>>>>> - Server
> >>>>>> ```
> >>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>>>>         -h -v -t
> >>>>>> ```
> >>>>>>
> >>>>>> - Client
> >>>>>> ```
> >>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
> >>>>>> ```
>
> see below
> >>>>>>     ## NAPI Threaded Busy Poll
Section for NAPI Threaded Busy Poll scenario
> >>>>>>
> >>>>>> Argument -n skips the recvfrom call as there is no recv kick needed.
States -n argument and defines it.
> >>>>>>
> >>>>>> - Server
> >>>>>> ```
> >>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>         -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>>>>         -h -v -n
> >>>>>> ```
> >>>>>>
> >>>>>> - Client
> >>>>>> ```
> >>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>         -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>>>         -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
> >>>>>> ```
>
> see below
> >>>>> I believe there's a bug when disabling busy-polled napi threading after
> >>>>> an experiment. My system hangs and needs a hard reset.
> >>>>>
> >>>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) |
> >>>>>> NAPI threaded |
> >>>>>> |---|---|---|---|---|
> >>>>>> | 12 Kpkt/s + 0us delay | | | | |
> >>>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >>>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >>>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >>>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >>>>>> | 32 Kpkt/s + 30us delay | | | | |
> >>>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >>>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >>>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >>>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >>>>>> | 125 Kpkt/s + 6us delay | | | | |
> >>>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >>>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >>>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >>>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >>>>>> | 12 Kpkt/s + 78us delay | | | | |
> >>>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >>>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >>>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >>>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >>>>>> | 25 Kpkt/s + 38us delay | | | | |
> >>>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >>>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >>>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >>>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>>>>
> >>>>> On my system, routing the irq to same core where xsk_rr runs results in
> >>>>> lower latency than routing the irq to a different core. To me that makes
> >>>>> sense in a low-rate latency-sensitive scenario where interrupts are not
> >>>>> causing much trouble, but the resulting locality might be beneficial. I
> >>>>> think you should test this as well.
> >>>>>
> >>>>> The experiments reported above (except for the first one) are
> >>>>> cherry-picking parameter combinations that result in a near-100% load
> >>>>> and ignore anything else. Near-100% load is a highly unlikely scenario
> >>>>> for a latency-sensitive workload.
> >>>>>
> >>>>> When combining the above two paragraphs, I believe other interesting
> >>>>> setups are missing from the experiments, such as comparing to two pairs
> >>>>> of xsk_rr under high load (as mentioned in my previous emails).
> >>>> This is to support an existing real workload. We cannot easily modify
> >>>> its threading model. The two xsk_rr model would be a different
> >>>> workload.
> >>>
> >>> That's fine, but:
> >>>
> >>> - In principle I don't think it's a good justification for a kernel
> >>> change that an application cannot be rewritten.
> >>>
> >>> - I believe it is your responsibility to more comprehensively document
> >>> the impact of your proposed changes beyond your one particular workload.>
> >> A few more observations from my tests for the "SO_BUSYPOLL(separate)" case:
> >>
> >> - Using -t for the client reduces latency compared to -T.
> > That is understandable and also it is part of the data I presented. -t
> > means running the SO_BUSY_POLL in a separate thread. Removing -T would
> > invalidate the workload by making the rate unpredictable.
>
> That's another problem with your cover letter then. The experiment as
> described should match the data presented. See above.
The experiments are described clearly. I have pointed out the areas in
the cover letter where these are documented. Where is the mismatch?
>
> >> - Using poll instead of recvfrom in xsk_rr in rx_polling_run() also
> >> reduces latency.
>
> Any thoughts on this one?
I think we discussed this already in the previous iteration, with
Stanislav, and how it will suffer the same way SO_BUSYPOLL suffers. As
I have already stated, for my workload every microsecond matters and
the CPU efficiency is not an issue.
>
> Best,
> Martin
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29 23:31             ` Samiullah Khawaja
@ 2025-08-29 23:37               ` Martin Karsten
  2025-08-30  0:21                 ` Samiullah Khawaja
  0 siblings, 1 reply; 17+ messages in thread
From: Martin Karsten @ 2025-08-29 23:37 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On 2025-08-29 19:31, Samiullah Khawaja wrote:
> On Fri, Aug 29, 2025 at 3:56 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>
>> On 2025-08-29 18:25, Samiullah Khawaja wrote:
>>> On Fri, Aug 29, 2025 at 3:19 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>>>
>>>> On 2025-08-29 14:08, Martin Karsten wrote:
>>>>> On 2025-08-29 13:50, Samiullah Khawaja wrote:
>>>>>> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca>
>>>>>> wrote:
>>>>>>>
>>>>>>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
>>>>>>>> Extend the already existing support of threaded napi poll to do
>>>>>>>> continuous
>>>>>>>> busy polling.
>>>>>>>>
>>>>>>>> This is used for doing continuous polling of napi to fetch descriptors
>>>>>>>> from backing RX/TX queues for low latency applications. Allow enabling
>>>>>>>> of threaded busypoll using netlink so this can be enabled on a set of
>>>>>>>> dedicated napis for low latency applications.
>>>>>>>>
>>>>>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>>>>>>> and set affinity, priority and scheduler for it depending on the
>>>>>>>> low-latency requirements.
>>>>>>>>
>>>>>>>> Extend the netlink interface to allow enabling/disabling threaded
>>>>>>>> busypolling at individual napi level.
>>>>>>>>
>>>>>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>>>>>>> level latency requirement. For our usecase we want low jitter and
>>>>>>>> stable
>>>>>>>> latency at P99.
>>>>>>>>
>>>>>>>> Following is an analysis and comparison of available (and compatible)
>>>>>>>> busy poll interfaces for a low latency usecase with stable P99. This
>>>>>>>> can
>>>>>>>> be suitable for applications that want very low latency at the expense
>>>>>>>> of cpu usage and efficiency.
>>>>>>>>
>>>>>>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
>>>>>>>> backing a socket, but the missing piece is a mechanism to busy poll a
>>>>>>>> NAPI instance in a dedicated thread while ignoring available events or
>>>>>>>> packets, regardless of the userspace API. Most existing mechanisms are
>>>>>>>> designed to work in a pattern where you poll until new packets or
>>>>>>>> events
>>>>>>>> are received, after which userspace is expected to handle them.
>>>>>>>>
>>>>>>>> As a result, one has to hack together a solution using a mechanism
>>>>>>>> intended to receive packets or events, not to simply NAPI poll. NAPI
>>>>>>>> threaded busy polling, on the other hand, provides this capability
>>>>>>>> natively, independent of any userspace API. This makes it really
>>>>>>>> easy to
>>>>>>>> setup and manage.
>>>>>>>>
>>>>>>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
>>>>>>>> description of the tool and how it tries to simulate the real workload
>>>>>>>> is following,
>>>>>>>>
>>>>>>>> - It sends UDP packets between 2 machines.
>>>>>>>> - The client machine sends packets at a fixed frequency. To maintain
>>>>>>>> the
>>>>>>>>       frequency of the packet being sent, we use open-loop sampling.
>>>>>>>> That is
>>>>>>>>       the packets are sent in a separate thread.
>>>>>>>> - The server replies to the packet inline by reading the pkt from the
>>>>>>>>       recv ring and replies using the tx ring.
>>>>>>>> - To simulate the application processing time, we use a configurable
>>>>>>>>       delay in usecs on the client side after a reply is received from
>>>>>>>> the
>>>>>>>>       server.
>>>>>>>>
>>>>>>>> The xsk_rr tool is posted separately as an RFC for tools/testing/
>>>>>>>> selftest.
>>>>>>>>
>>>>>>>> We use this tool with following napi polling configurations,
>>>>>>>>
>>>>>>>> - Interrupts only
>>>>>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>>>>>>       packet).
>>>>>>>> - SO_BUSYPOLL (separate thread and separate core)
>>>>>>>> - Threaded NAPI busypoll
>>>>>>>>
>>>>>>>> System is configured using following script in all 4 cases,
>>>>>>>>
>>>>>>>> ```
>>>>>>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
>>>>>>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
>>>>>>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
>>>>>>>>
>>>>>>>> sudo ethtool -L eth0 rx 1 tx 1
>>>>>>>> sudo ethtool -G eth0 rx 1024
>>>>>>>>
>>>>>>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
>>>>>>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
>>>>>>>>
>>>>>>>>      # pin IRQs on CPU 2
>>>>>>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
>>>>>>>>                                  print arr[0]}' < /proc/interrupts)"
>>>>>>>> for irq in "${IRQS}"; \
>>>>>>>>          do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
>>>>>>>>
>>>>>>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
>>>>>>>>
>>>>>>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
>>>>>>>>                          do echo $i; echo 1,2,3,4,5,6 > $i; done
>>>>>>>>
>>>>>>>> if [[ -z "$1" ]]; then
>>>>>>>>       echo 400 | sudo tee /proc/sys/net/core/busy_read
>>>>>>>>       echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>>>       echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>>>> fi
>>>>>>>>
>>>>>>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-
>>>>>>>> usecs 0
>>>>>>>>
>>>>>>>> if [[ "$1" == "enable_threaded" ]]; then
>>>>>>>>       echo 0 | sudo tee /proc/sys/net/core/busy_poll
>>>>>>>>       echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>>>>       echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>>>       echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>>>>       echo 2 | sudo tee /sys/class/net/eth0/threaded
>>>>>>>>       NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
>>>>>>>>       sudo chrt -f  -p 50 $NAPI_T
>>>>>>>>
>>>>>>>>       # pin threaded poll thread to CPU 2
>>>>>>>>       sudo taskset -pc 2 $NAPI_T
>>>>>>>> fi
>>>>>>>>
>>>>>>>> if [[ "$1" == "enable_interrupt" ]]; then
>>>>>>>>       echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>>>>       echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>>>       echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>>>> fi
>>>>>>>> ```
>>>>>>>
>>>>>>> The experiment script above does not work, because the sysfs parameter
>>>>>>> does not exist anymore in this version.
>>>>>>>
>>>>>>>> To enable various configurations, script can be run as following,
>>>>>>>>
>>>>>>>> - Interrupt Only
>>>>>>>>       ```
>>>>>>>>       <script> enable_interrupt
>>>>>>>>       ```
>>>>>>>>
>>>>>>>> - SO_BUSYPOLL (no arguments to script)
>>>>>>>>       ```
>>>>>>>>       <script>
>>>>>>>>       ```
>>>>>>>>
>>>>>>>> - NAPI threaded busypoll
>>>>>>>>       ```
>>>>>>>>       <script> enable_threaded
>>>>>>>>       ```
>>>>>>>>
>>>>>>>> If using idpf, the script needs to be run again after launching the
>>>>>>>> workload just to make sure that the configurations are not reverted. As
>>>>>>>> idpf reverts some configurations on software reset when AF_XDP program
>>>>>>>> is attached.
>>>>>>>>
>>>>>>>> Once configured, the workload is run with various configurations using
>>>>>>>> following commands. Set period (1/frequency) and delay in usecs to
>>>>>>>> produce results for packet frequency and application processing delay.
>>>>>>>>
>>>>>>>>      ## Interrupt Only and SO_BUSYPOLL (inline)
>>>>>>>>
>>>>>>>> - Server
>>>>>>>> ```
>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>          -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -
>>>>>>>> h -v
>>>>>>>> ```
>>>>>>>>
>>>>>>>> - Client
>>>>>>>> ```
>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>          -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>>>          -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
>>>>>>>> ```
>>>>>>>>
>>>>>>>>      ## SO_BUSYPOLL(done in separate core using recvfrom)
> Defines this test case clearly here.
>>>>>>>>
>>>>>>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
> This defines the -t argument and clearly states that it spawns the
> separate thread.
>>>>>>>>
>>>>>>>> - Server
>>>>>>>> ```
>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>          -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>>>>          -h -v -t
>>>>>>>> ```
>>>>>>>>
>>>>>>>> - Client
>>>>>>>> ```
>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>          -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>>>          -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
>>>>>>>> ```
>>
>> see below
>>>>>>>>      ## NAPI Threaded Busy Poll
> Section for NAPI Threaded Busy Poll scenario
>>>>>>>>
>>>>>>>> Argument -n skips the recvfrom call as there is no recv kick needed.
> States -n argument and defines it.
>>>>>>>>
>>>>>>>> - Server
>>>>>>>> ```
>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>          -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>>>>          -h -v -n
>>>>>>>> ```
>>>>>>>>
>>>>>>>> - Client
>>>>>>>> ```
>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>          -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>>>          -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
>>>>>>>> ```
>>
>> see below
>>>>>>> I believe there's a bug when disabling busy-polled napi threading after
>>>>>>> an experiment. My system hangs and needs a hard reset.
>>>>>>>
>>>>>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) |
>>>>>>>> NAPI threaded |
>>>>>>>> |---|---|---|---|---|
>>>>>>>> | 12 Kpkt/s + 0us delay | | | | |
>>>>>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>>>>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>>>>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>>>>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>>>>>>> | 32 Kpkt/s + 30us delay | | | | |
>>>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>>>>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>>>>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>>>>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>>>>>>> | 125 Kpkt/s + 6us delay | | | | |
>>>>>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>>>>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>>>>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>>>>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>>>>>>> | 12 Kpkt/s + 78us delay | | | | |
>>>>>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>>>>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>>>>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>>>>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>>>>>>> | 25 Kpkt/s + 38us delay | | | | |
>>>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>>>>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>>>>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>>>>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>>>>>
>>>>>>> On my system, routing the irq to same core where xsk_rr runs results in
>>>>>>> lower latency than routing the irq to a different core. To me that makes
>>>>>>> sense in a low-rate latency-sensitive scenario where interrupts are not
>>>>>>> causing much trouble, but the resulting locality might be beneficial. I
>>>>>>> think you should test this as well.
>>>>>>>
>>>>>>> The experiments reported above (except for the first one) are
>>>>>>> cherry-picking parameter combinations that result in a near-100% load
>>>>>>> and ignore anything else. Near-100% load is a highly unlikely scenario
>>>>>>> for a latency-sensitive workload.
>>>>>>>
>>>>>>> When combining the above two paragraphs, I believe other interesting
>>>>>>> setups are missing from the experiments, such as comparing to two pairs
>>>>>>> of xsk_rr under high load (as mentioned in my previous emails).
>>>>>> This is to support an existing real workload. We cannot easily modify
>>>>>> its threading model. The two xsk_rr model would be a different
>>>>>> workload.
>>>>>
>>>>> That's fine, but:
>>>>>
>>>>> - In principle I don't think it's a good justification for a kernel
>>>>> change that an application cannot be rewritten.
>>>>>
>>>>> - I believe it is your responsibility to more comprehensively document
>>>>> the impact of your proposed changes beyond your one particular workload.>
>>>> A few more observations from my tests for the "SO_BUSYPOLL(separate)" case:
>>>>
>>>> - Using -t for the client reduces latency compared to -T.
>>> That is understandable and also it is part of the data I presented. -t
>>> means running the SO_BUSY_POLL in a separate thread. Removing -T would
>>> invalidate the workload by making the rate unpredictable.
>>
>> That's another problem with your cover letter then. The experiment as
>> described should match the data presented. See above.
> The experiments are described clearly. I have pointed out the areas in
> the cover letter where these are documented. Where is the mismatch?

Ah, I missed the -t at the end, sorry, my bad.

>>>> - Using poll instead of recvfrom in xsk_rr in rx_polling_run() also
>>>> reduces latency.
>>
>> Any thoughts on this one?
> I think we discussed this already in the previous iteration, with
> Stanislav, and how it will suffer the same way SO_BUSYPOLL suffers. As
> I have already stated, for my workload every microsecond matters and
> the CPU efficiency is not an issue.

Discussing is one thing. Testing is another. In my setup I observe a 
noticeable difference between using recvfrom and poll.

Thanks,
Martin


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-29 23:37               ` Martin Karsten
@ 2025-08-30  0:21                 ` Samiullah Khawaja
  2025-08-30  0:40                   ` Martin Karsten
  0 siblings, 1 reply; 17+ messages in thread
From: Samiullah Khawaja @ 2025-08-30  0:21 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On Fri, Aug 29, 2025 at 4:37 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-08-29 19:31, Samiullah Khawaja wrote:
> > On Fri, Aug 29, 2025 at 3:56 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>
> >> On 2025-08-29 18:25, Samiullah Khawaja wrote:
> >>> On Fri, Aug 29, 2025 at 3:19 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>>>
> >>>> On 2025-08-29 14:08, Martin Karsten wrote:
> >>>>> On 2025-08-29 13:50, Samiullah Khawaja wrote:
> >>>>>> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
> >>>>>>>> Extend the already existing support of threaded napi poll to do
> >>>>>>>> continuous
> >>>>>>>> busy polling.
> >>>>>>>>
> >>>>>>>> This is used for doing continuous polling of napi to fetch descriptors
> >>>>>>>> from backing RX/TX queues for low latency applications. Allow enabling
> >>>>>>>> of threaded busypoll using netlink so this can be enabled on a set of
> >>>>>>>> dedicated napis for low latency applications.
> >>>>>>>>
> >>>>>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
> >>>>>>>> and set affinity, priority and scheduler for it depending on the
> >>>>>>>> low-latency requirements.
> >>>>>>>>
> >>>>>>>> Extend the netlink interface to allow enabling/disabling threaded
> >>>>>>>> busypolling at individual napi level.
> >>>>>>>>
> >>>>>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
> >>>>>>>> level latency requirement. For our usecase we want low jitter and
> >>>>>>>> stable
> >>>>>>>> latency at P99.
> >>>>>>>>
> >>>>>>>> Following is an analysis and comparison of available (and compatible)
> >>>>>>>> busy poll interfaces for a low latency usecase with stable P99. This
> >>>>>>>> can
> >>>>>>>> be suitable for applications that want very low latency at the expense
> >>>>>>>> of cpu usage and efficiency.
> >>>>>>>>
> >>>>>>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
> >>>>>>>> backing a socket, but the missing piece is a mechanism to busy poll a
> >>>>>>>> NAPI instance in a dedicated thread while ignoring available events or
> >>>>>>>> packets, regardless of the userspace API. Most existing mechanisms are
> >>>>>>>> designed to work in a pattern where you poll until new packets or
> >>>>>>>> events
> >>>>>>>> are received, after which userspace is expected to handle them.
> >>>>>>>>
> >>>>>>>> As a result, one has to hack together a solution using a mechanism
> >>>>>>>> intended to receive packets or events, not to simply NAPI poll. NAPI
> >>>>>>>> threaded busy polling, on the other hand, provides this capability
> >>>>>>>> natively, independent of any userspace API. This makes it really
> >>>>>>>> easy to
> >>>>>>>> setup and manage.
> >>>>>>>>
> >>>>>>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
> >>>>>>>> description of the tool and how it tries to simulate the real workload
> >>>>>>>> is following,
> >>>>>>>>
> >>>>>>>> - It sends UDP packets between 2 machines.
> >>>>>>>> - The client machine sends packets at a fixed frequency. To maintain
> >>>>>>>> the
> >>>>>>>>       frequency of the packet being sent, we use open-loop sampling.
> >>>>>>>> That is
> >>>>>>>>       the packets are sent in a separate thread.
> >>>>>>>> - The server replies to the packet inline by reading the pkt from the
> >>>>>>>>       recv ring and replies using the tx ring.
> >>>>>>>> - To simulate the application processing time, we use a configurable
> >>>>>>>>       delay in usecs on the client side after a reply is received from
> >>>>>>>> the
> >>>>>>>>       server.
> >>>>>>>>
> >>>>>>>> The xsk_rr tool is posted separately as an RFC for tools/testing/
> >>>>>>>> selftest.
> >>>>>>>>
> >>>>>>>> We use this tool with following napi polling configurations,
> >>>>>>>>
> >>>>>>>> - Interrupts only
> >>>>>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
> >>>>>>>>       packet).
> >>>>>>>> - SO_BUSYPOLL (separate thread and separate core)
> >>>>>>>> - Threaded NAPI busypoll
> >>>>>>>>
> >>>>>>>> System is configured using following script in all 4 cases,
> >>>>>>>>
> >>>>>>>> ```
> >>>>>>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
> >>>>>>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
> >>>>>>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
> >>>>>>>>
> >>>>>>>> sudo ethtool -L eth0 rx 1 tx 1
> >>>>>>>> sudo ethtool -G eth0 rx 1024
> >>>>>>>>
> >>>>>>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
> >>>>>>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
> >>>>>>>>
> >>>>>>>>      # pin IRQs on CPU 2
> >>>>>>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
> >>>>>>>>                                  print arr[0]}' < /proc/interrupts)"
> >>>>>>>> for irq in "${IRQS}"; \
> >>>>>>>>          do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
> >>>>>>>>
> >>>>>>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
> >>>>>>>>
> >>>>>>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
> >>>>>>>>                          do echo $i; echo 1,2,3,4,5,6 > $i; done
> >>>>>>>>
> >>>>>>>> if [[ -z "$1" ]]; then
> >>>>>>>>       echo 400 | sudo tee /proc/sys/net/core/busy_read
> >>>>>>>>       echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>>>>>       echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>>>>>> fi
> >>>>>>>>
> >>>>>>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-
> >>>>>>>> usecs 0
> >>>>>>>>
> >>>>>>>> if [[ "$1" == "enable_threaded" ]]; then
> >>>>>>>>       echo 0 | sudo tee /proc/sys/net/core/busy_poll
> >>>>>>>>       echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>>>>>>       echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>>>>>       echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>>>>>>       echo 2 | sudo tee /sys/class/net/eth0/threaded
> >>>>>>>>       NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
> >>>>>>>>       sudo chrt -f  -p 50 $NAPI_T
> >>>>>>>>
> >>>>>>>>       # pin threaded poll thread to CPU 2
> >>>>>>>>       sudo taskset -pc 2 $NAPI_T
> >>>>>>>> fi
> >>>>>>>>
> >>>>>>>> if [[ "$1" == "enable_interrupt" ]]; then
> >>>>>>>>       echo 0 | sudo tee /proc/sys/net/core/busy_read
> >>>>>>>>       echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> >>>>>>>>       echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >>>>>>>> fi
> >>>>>>>> ```
> >>>>>>>
> >>>>>>> The experiment script above does not work, because the sysfs parameter
> >>>>>>> does not exist anymore in this version.
> >>>>>>>
> >>>>>>>> To enable various configurations, script can be run as following,
> >>>>>>>>
> >>>>>>>> - Interrupt Only
> >>>>>>>>       ```
> >>>>>>>>       <script> enable_interrupt
> >>>>>>>>       ```
> >>>>>>>>
> >>>>>>>> - SO_BUSYPOLL (no arguments to script)
> >>>>>>>>       ```
> >>>>>>>>       <script>
> >>>>>>>>       ```
> >>>>>>>>
> >>>>>>>> - NAPI threaded busypoll
> >>>>>>>>       ```
> >>>>>>>>       <script> enable_threaded
> >>>>>>>>       ```
> >>>>>>>>
> >>>>>>>> If using idpf, the script needs to be run again after launching the
> >>>>>>>> workload just to make sure that the configurations are not reverted. As
> >>>>>>>> idpf reverts some configurations on software reset when AF_XDP program
> >>>>>>>> is attached.
> >>>>>>>>
> >>>>>>>> Once configured, the workload is run with various configurations using
> >>>>>>>> following commands. Set period (1/frequency) and delay in usecs to
> >>>>>>>> produce results for packet frequency and application processing delay.
> >>>>>>>>
> >>>>>>>>      ## Interrupt Only and SO_BUSYPOLL (inline)
> >>>>>>>>
> >>>>>>>> - Server
> >>>>>>>> ```
> >>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>>>          -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -
> >>>>>>>> h -v
> >>>>>>>> ```
> >>>>>>>>
> >>>>>>>> - Client
> >>>>>>>> ```
> >>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>>>          -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>>>>>          -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
> >>>>>>>> ```
> >>>>>>>>
> >>>>>>>>      ## SO_BUSYPOLL(done in separate core using recvfrom)
> > Defines this test case clearly here.
> >>>>>>>>
> >>>>>>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
> > This defines the -t argument and clearly states that it spawns the
> > separate thread.
> >>>>>>>>
> >>>>>>>> - Server
> >>>>>>>> ```
> >>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>>>          -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>>>>>>          -h -v -t
> >>>>>>>> ```
> >>>>>>>>
> >>>>>>>> - Client
> >>>>>>>> ```
> >>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>>>          -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>>>>>          -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
> >>>>>>>> ```
> >>
> >> see below
> >>>>>>>>      ## NAPI Threaded Busy Poll
> > Section for NAPI Threaded Busy Poll scenario
> >>>>>>>>
> >>>>>>>> Argument -n skips the recvfrom call as there is no recv kick needed.
> > States -n argument and defines it.
> >>>>>>>>
> >>>>>>>> - Server
> >>>>>>>> ```
> >>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>>>          -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
> >>>>>>>>          -h -v -n
> >>>>>>>> ```
> >>>>>>>>
> >>>>>>>> - Client
> >>>>>>>> ```
> >>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
> >>>>>>>>          -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
> >>>>>>>>          -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
> >>>>>>>> ```
> >>
> >> see below
> >>>>>>> I believe there's a bug when disabling busy-polled napi threading after
> >>>>>>> an experiment. My system hangs and needs a hard reset.
> >>>>>>>
> >>>>>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) |
> >>>>>>>> NAPI threaded |
> >>>>>>>> |---|---|---|---|---|
> >>>>>>>> | 12 Kpkt/s + 0us delay | | | | |
> >>>>>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> >>>>>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> >>>>>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> >>>>>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> >>>>>>>> | 32 Kpkt/s + 30us delay | | | | |
> >>>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> >>>>>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> >>>>>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> >>>>>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> >>>>>>>> | 125 Kpkt/s + 6us delay | | | | |
> >>>>>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> >>>>>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> >>>>>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> >>>>>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> >>>>>>>> | 12 Kpkt/s + 78us delay | | | | |
> >>>>>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> >>>>>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> >>>>>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> >>>>>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> >>>>>>>> | 25 Kpkt/s + 38us delay | | | | |
> >>>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> >>>>>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> >>>>>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> >>>>>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
> >>>>>>>
> >>>>>>> On my system, routing the irq to same core where xsk_rr runs results in
> >>>>>>> lower latency than routing the irq to a different core. To me that makes
> >>>>>>> sense in a low-rate latency-sensitive scenario where interrupts are not
> >>>>>>> causing much trouble, but the resulting locality might be beneficial. I
> >>>>>>> think you should test this as well.
> >>>>>>>
> >>>>>>> The experiments reported above (except for the first one) are
> >>>>>>> cherry-picking parameter combinations that result in a near-100% load
> >>>>>>> and ignore anything else. Near-100% load is a highly unlikely scenario
> >>>>>>> for a latency-sensitive workload.
> >>>>>>>
> >>>>>>> When combining the above two paragraphs, I believe other interesting
> >>>>>>> setups are missing from the experiments, such as comparing to two pairs
> >>>>>>> of xsk_rr under high load (as mentioned in my previous emails).
> >>>>>> This is to support an existing real workload. We cannot easily modify
> >>>>>> its threading model. The two xsk_rr model would be a different
> >>>>>> workload.
> >>>>>
> >>>>> That's fine, but:
> >>>>>
> >>>>> - In principle I don't think it's a good justification for a kernel
> >>>>> change that an application cannot be rewritten.
> >>>>>
> >>>>> - I believe it is your responsibility to more comprehensively document
> >>>>> the impact of your proposed changes beyond your one particular workload.>
> >>>> A few more observations from my tests for the "SO_BUSYPOLL(separate)" case:
> >>>>
> >>>> - Using -t for the client reduces latency compared to -T.
> >>> That is understandable and also it is part of the data I presented. -t
> >>> means running the SO_BUSY_POLL in a separate thread. Removing -T would
> >>> invalidate the workload by making the rate unpredictable.
> >>
> >> That's another problem with your cover letter then. The experiment as
> >> described should match the data presented. See above.
> > The experiments are described clearly. I have pointed out the areas in
> > the cover letter where these are documented. Where is the mismatch?
>
> Ah, I missed the -t at the end, sorry, my bad.
>
> >>>> - Using poll instead of recvfrom in xsk_rr in rx_polling_run() also
> >>>> reduces latency.
> >>
> >> Any thoughts on this one?
> > I think we discussed this already in the previous iteration, with
> > Stanislav, and how it will suffer the same way SO_BUSYPOLL suffers. As
> > I have already stated, for my workload every microsecond matters and
> > the CPU efficiency is not an issue.
>
> Discussing is one thing. Testing is another. In my setup I observe a
> noticeable difference between using recvfrom and poll.
I experimented with it and it seems to improve a little bit in some
cases (maybe 200nsecs) but performs really badly with a low packet
rate as expected. As discussed in the last iteration and also in the
cover letter, this is because it only polls when there are no events.

count: 1249 p5: 17200
count: 12436 p50: 21100
count: 21106 p95: 21700
count: 21994 p99: 21700
rate: 24995
outstanding packets: 5

Like I stated in the cover letter and documentation, one can try to
hack together something using APIs designed to recv packets or events
but it's better to have a native mechanism supported by OS that is
designed to poll underlying NAPIs if that is what user wants.
>
> Thanks,
> Martin
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll
  2025-08-30  0:21                 ` Samiullah Khawaja
@ 2025-08-30  0:40                   ` Martin Karsten
  0 siblings, 0 replies; 17+ messages in thread
From: Martin Karsten @ 2025-08-30  0:40 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, Joe Damato, netdev

On 2025-08-29 20:21, Samiullah Khawaja wrote:
> On Fri, Aug 29, 2025 at 4:37 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>
>> On 2025-08-29 19:31, Samiullah Khawaja wrote:
>>> On Fri, Aug 29, 2025 at 3:56 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>>>
>>>> On 2025-08-29 18:25, Samiullah Khawaja wrote:
>>>>> On Fri, Aug 29, 2025 at 3:19 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>>>>>
>>>>>> On 2025-08-29 14:08, Martin Karsten wrote:
>>>>>>> On 2025-08-29 13:50, Samiullah Khawaja wrote:
>>>>>>>> On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten <mkarsten@uwaterloo.ca>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On 2025-08-28 21:16, Samiullah Khawaja wrote:
>>>>>>>>>> Extend the already existing support of threaded napi poll to do
>>>>>>>>>> continuous
>>>>>>>>>> busy polling.
>>>>>>>>>>
>>>>>>>>>> This is used for doing continuous polling of napi to fetch descriptors
>>>>>>>>>> from backing RX/TX queues for low latency applications. Allow enabling
>>>>>>>>>> of threaded busypoll using netlink so this can be enabled on a set of
>>>>>>>>>> dedicated napis for low latency applications.
>>>>>>>>>>
>>>>>>>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>>>>>>>>> and set affinity, priority and scheduler for it depending on the
>>>>>>>>>> low-latency requirements.
>>>>>>>>>>
>>>>>>>>>> Extend the netlink interface to allow enabling/disabling threaded
>>>>>>>>>> busypolling at individual napi level.
>>>>>>>>>>
>>>>>>>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>>>>>>>>> level latency requirement. For our usecase we want low jitter and
>>>>>>>>>> stable
>>>>>>>>>> latency at P99.
>>>>>>>>>>
>>>>>>>>>> Following is an analysis and comparison of available (and compatible)
>>>>>>>>>> busy poll interfaces for a low latency usecase with stable P99. This
>>>>>>>>>> can
>>>>>>>>>> be suitable for applications that want very low latency at the expense
>>>>>>>>>> of cpu usage and efficiency.
>>>>>>>>>>
>>>>>>>>>> Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
>>>>>>>>>> backing a socket, but the missing piece is a mechanism to busy poll a
>>>>>>>>>> NAPI instance in a dedicated thread while ignoring available events or
>>>>>>>>>> packets, regardless of the userspace API. Most existing mechanisms are
>>>>>>>>>> designed to work in a pattern where you poll until new packets or
>>>>>>>>>> events
>>>>>>>>>> are received, after which userspace is expected to handle them.
>>>>>>>>>>
>>>>>>>>>> As a result, one has to hack together a solution using a mechanism
>>>>>>>>>> intended to receive packets or events, not to simply NAPI poll. NAPI
>>>>>>>>>> threaded busy polling, on the other hand, provides this capability
>>>>>>>>>> natively, independent of any userspace API. This makes it really
>>>>>>>>>> easy to
>>>>>>>>>> setup and manage.
>>>>>>>>>>
>>>>>>>>>> For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
>>>>>>>>>> description of the tool and how it tries to simulate the real workload
>>>>>>>>>> is following,
>>>>>>>>>>
>>>>>>>>>> - It sends UDP packets between 2 machines.
>>>>>>>>>> - The client machine sends packets at a fixed frequency. To maintain
>>>>>>>>>> the
>>>>>>>>>>        frequency of the packet being sent, we use open-loop sampling.
>>>>>>>>>> That is
>>>>>>>>>>        the packets are sent in a separate thread.
>>>>>>>>>> - The server replies to the packet inline by reading the pkt from the
>>>>>>>>>>        recv ring and replies using the tx ring.
>>>>>>>>>> - To simulate the application processing time, we use a configurable
>>>>>>>>>>        delay in usecs on the client side after a reply is received from
>>>>>>>>>> the
>>>>>>>>>>        server.
>>>>>>>>>>
>>>>>>>>>> The xsk_rr tool is posted separately as an RFC for tools/testing/
>>>>>>>>>> selftest.
>>>>>>>>>>
>>>>>>>>>> We use this tool with following napi polling configurations,
>>>>>>>>>>
>>>>>>>>>> - Interrupts only
>>>>>>>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>>>>>>>>        packet).
>>>>>>>>>> - SO_BUSYPOLL (separate thread and separate core)
>>>>>>>>>> - Threaded NAPI busypoll
>>>>>>>>>>
>>>>>>>>>> System is configured using following script in all 4 cases,
>>>>>>>>>>
>>>>>>>>>> ```
>>>>>>>>>> echo 0 | sudo tee /sys/class/net/eth0/threaded
>>>>>>>>>> echo 0 | sudo tee /proc/sys/kernel/timer_migration
>>>>>>>>>> echo off | sudo tee  /sys/devices/system/cpu/smt/control
>>>>>>>>>>
>>>>>>>>>> sudo ethtool -L eth0 rx 1 tx 1
>>>>>>>>>> sudo ethtool -G eth0 rx 1024
>>>>>>>>>>
>>>>>>>>>> echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
>>>>>>>>>> echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
>>>>>>>>>>
>>>>>>>>>>       # pin IRQs on CPU 2
>>>>>>>>>> IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
>>>>>>>>>>                                   print arr[0]}' < /proc/interrupts)"
>>>>>>>>>> for irq in "${IRQS}"; \
>>>>>>>>>>           do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
>>>>>>>>>>
>>>>>>>>>> echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
>>>>>>>>>>
>>>>>>>>>> for i in /sys/devices/virtual/workqueue/*/cpumask; \
>>>>>>>>>>                           do echo $i; echo 1,2,3,4,5,6 > $i; done
>>>>>>>>>>
>>>>>>>>>> if [[ -z "$1" ]]; then
>>>>>>>>>>        echo 400 | sudo tee /proc/sys/net/core/busy_read
>>>>>>>>>>        echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>>>>>        echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>>>>>> fi
>>>>>>>>>>
>>>>>>>>>> sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-
>>>>>>>>>> usecs 0
>>>>>>>>>>
>>>>>>>>>> if [[ "$1" == "enable_threaded" ]]; then
>>>>>>>>>>        echo 0 | sudo tee /proc/sys/net/core/busy_poll
>>>>>>>>>>        echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>>>>>>        echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>>>>>        echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>>>>>>        echo 2 | sudo tee /sys/class/net/eth0/threaded
>>>>>>>>>>        NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
>>>>>>>>>>        sudo chrt -f  -p 50 $NAPI_T
>>>>>>>>>>
>>>>>>>>>>        # pin threaded poll thread to CPU 2
>>>>>>>>>>        sudo taskset -pc 2 $NAPI_T
>>>>>>>>>> fi
>>>>>>>>>>
>>>>>>>>>> if [[ "$1" == "enable_interrupt" ]]; then
>>>>>>>>>>        echo 0 | sudo tee /proc/sys/net/core/busy_read
>>>>>>>>>>        echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
>>>>>>>>>>        echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
>>>>>>>>>> fi
>>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> The experiment script above does not work, because the sysfs parameter
>>>>>>>>> does not exist anymore in this version.
>>>>>>>>>
>>>>>>>>>> To enable various configurations, script can be run as following,
>>>>>>>>>>
>>>>>>>>>> - Interrupt Only
>>>>>>>>>>        ```
>>>>>>>>>>        <script> enable_interrupt
>>>>>>>>>>        ```
>>>>>>>>>>
>>>>>>>>>> - SO_BUSYPOLL (no arguments to script)
>>>>>>>>>>        ```
>>>>>>>>>>        <script>
>>>>>>>>>>        ```
>>>>>>>>>>
>>>>>>>>>> - NAPI threaded busypoll
>>>>>>>>>>        ```
>>>>>>>>>>        <script> enable_threaded
>>>>>>>>>>        ```
>>>>>>>>>>
>>>>>>>>>> If using idpf, the script needs to be run again after launching the
>>>>>>>>>> workload just to make sure that the configurations are not reverted. As
>>>>>>>>>> idpf reverts some configurations on software reset when AF_XDP program
>>>>>>>>>> is attached.
>>>>>>>>>>
>>>>>>>>>> Once configured, the workload is run with various configurations using
>>>>>>>>>> following commands. Set period (1/frequency) and delay in usecs to
>>>>>>>>>> produce results for packet frequency and application processing delay.
>>>>>>>>>>
>>>>>>>>>>       ## Interrupt Only and SO_BUSYPOLL (inline)
>>>>>>>>>>
>>>>>>>>>> - Server
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -
>>>>>>>>>> h -v
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> - Client
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>>>>>           -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>>       ## SO_BUSYPOLL(done in separate core using recvfrom)
>>> Defines this test case clearly here.
>>>>>>>>>>
>>>>>>>>>> Argument -t spawns a seprate thread and continuously calls recvfrom.
>>> This defines the -t argument and clearly states that it spawns the
>>> separate thread.
>>>>>>>>>>
>>>>>>>>>> - Server
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>>>>>>           -h -v -t
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> - Client
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>>>>>           -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
>>>>>>>>>> ```
>>>>
>>>> see below
>>>>>>>>>>       ## NAPI Threaded Busy Poll
>>> Section for NAPI Threaded Busy Poll scenario
>>>>>>>>>>
>>>>>>>>>> Argument -n skips the recvfrom call as there is no recv kick needed.
>>> States -n argument and defines it.
>>>>>>>>>>
>>>>>>>>>> - Server
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
>>>>>>>>>>           -h -v -n
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> - Client
>>>>>>>>>> ```
>>>>>>>>>> sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
>>>>>>>>>>           -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
>>>>>>>>>>           -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
>>>>>>>>>> ```
>>>>
>>>> see below
>>>>>>>>> I believe there's a bug when disabling busy-polled napi threading after
>>>>>>>>> an experiment. My system hangs and needs a hard reset.
>>>>>>>>>
>>>>>>>>>> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) |
>>>>>>>>>> NAPI threaded |
>>>>>>>>>> |---|---|---|---|---|
>>>>>>>>>> | 12 Kpkt/s + 0us delay | | | | |
>>>>>>>>>> |  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
>>>>>>>>>> |  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
>>>>>>>>>> |  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
>>>>>>>>>> |  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
>>>>>>>>>> | 32 Kpkt/s + 30us delay | | | | |
>>>>>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
>>>>>>>>>> |  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
>>>>>>>>>> |  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
>>>>>>>>>> |  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
>>>>>>>>>> | 125 Kpkt/s + 6us delay | | | | |
>>>>>>>>>> |  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
>>>>>>>>>> |  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
>>>>>>>>>> |  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
>>>>>>>>>> |  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
>>>>>>>>>> | 12 Kpkt/s + 78us delay | | | | |
>>>>>>>>>> |  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
>>>>>>>>>> |  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
>>>>>>>>>> |  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
>>>>>>>>>> |  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
>>>>>>>>>> | 25 Kpkt/s + 38us delay | | | | |
>>>>>>>>>> |  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
>>>>>>>>>> |  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
>>>>>>>>>> |  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
>>>>>>>>>> |  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>>>>>>>>>
>>>>>>>>> On my system, routing the irq to same core where xsk_rr runs results in
>>>>>>>>> lower latency than routing the irq to a different core. To me that makes
>>>>>>>>> sense in a low-rate latency-sensitive scenario where interrupts are not
>>>>>>>>> causing much trouble, but the resulting locality might be beneficial. I
>>>>>>>>> think you should test this as well.
>>>>>>>>>
>>>>>>>>> The experiments reported above (except for the first one) are
>>>>>>>>> cherry-picking parameter combinations that result in a near-100% load
>>>>>>>>> and ignore anything else. Near-100% load is a highly unlikely scenario
>>>>>>>>> for a latency-sensitive workload.
>>>>>>>>>
>>>>>>>>> When combining the above two paragraphs, I believe other interesting
>>>>>>>>> setups are missing from the experiments, such as comparing to two pairs
>>>>>>>>> of xsk_rr under high load (as mentioned in my previous emails).
>>>>>>>> This is to support an existing real workload. We cannot easily modify
>>>>>>>> its threading model. The two xsk_rr model would be a different
>>>>>>>> workload.
>>>>>>>
>>>>>>> That's fine, but:
>>>>>>>
>>>>>>> - In principle I don't think it's a good justification for a kernel
>>>>>>> change that an application cannot be rewritten.
>>>>>>>
>>>>>>> - I believe it is your responsibility to more comprehensively document
>>>>>>> the impact of your proposed changes beyond your one particular workload.>
>>>>>> A few more observations from my tests for the "SO_BUSYPOLL(separate)" case:
>>>>>>
>>>>>> - Using -t for the client reduces latency compared to -T.
>>>>> That is understandable and also it is part of the data I presented. -t
>>>>> means running the SO_BUSY_POLL in a separate thread. Removing -T would
>>>>> invalidate the workload by making the rate unpredictable.
>>>>
>>>> That's another problem with your cover letter then. The experiment as
>>>> described should match the data presented. See above.
>>> The experiments are described clearly. I have pointed out the areas in
>>> the cover letter where these are documented. Where is the mismatch?
>>
>> Ah, I missed the -t at the end, sorry, my bad.
>>
>>>>>> - Using poll instead of recvfrom in xsk_rr in rx_polling_run() also
>>>>>> reduces latency.
>>>>
>>>> Any thoughts on this one?
>>> I think we discussed this already in the previous iteration, with
>>> Stanislav, and how it will suffer the same way SO_BUSYPOLL suffers. As
>>> I have already stated, for my workload every microsecond matters and
>>> the CPU efficiency is not an issue.
>>
>> Discussing is one thing. Testing is another. In my setup I observe a
>> noticeable difference between using recvfrom and poll.
> I experimented with it and it seems to improve a little bit in some
> cases (maybe 200nsecs) but performs really badly with a low packet
> rate as expected. As discussed in the last iteration and also in the
> cover letter, this is because it only polls when there are no events.
> 
> count: 1249 p5: 17200
> count: 12436 p50: 21100
> count: 21106 p95: 21700
> count: 21994 p99: 21700
> rate: 24995
> outstanding packets: 5
> 
> Like I stated in the cover letter and documentation, one can try to
> hack together something using APIs designed to recv packets or events
> but it's better to have a native mechanism supported by OS that is
> designed to poll underlying NAPIs if that is what user wants.
I see, thanks for checking. I got sidetracked and was looking at yet 
another setup (0 period, 0 delay). The timing of xsk_rr with "work" and 
I/O being interleaved seems like a special case (not to mention the 100% 
load). Anyway, I am sure you will post again and I will make my 
statement about a comprehensive evaluation again. ;-)

Best,
Martin


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-08-30  0:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-29  1:16 [PATCH net-next v8 0/2] Add support to do threaded napi busy poll Samiullah Khawaja
2025-08-29  1:16 ` [PATCH net-next v8 1/2] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
2025-08-29  1:16 ` [PATCH net-next v8 2/2] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
2025-08-29  3:15 ` [PATCH net-next v8 0/2] Add support to do threaded napi busy poll Martin Karsten
2025-08-29 17:50   ` Samiullah Khawaja
2025-08-29 18:08     ` Martin Karsten
2025-08-29 18:42       ` Willem de Bruijn
2025-08-29 20:49       ` Samiullah Khawaja
2025-08-29 21:27         ` Martin Karsten
2025-08-29 22:27           ` Samiullah Khawaja
2025-08-29 22:19       ` Martin Karsten
2025-08-29 22:25         ` Samiullah Khawaja
2025-08-29 22:56           ` Martin Karsten
2025-08-29 23:31             ` Samiullah Khawaja
2025-08-29 23:37               ` Martin Karsten
2025-08-30  0:21                 ` Samiullah Khawaja
2025-08-30  0:40                   ` Martin Karsten

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).