netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v4 0/4] Add support to do threaded napi busy poll
@ 2025-03-21  2:15 Samiullah Khawaja
  2025-03-21  2:15 ` [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Samiullah Khawaja @ 2025-03-21  2:15 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, jdamato, mkarsten
  Cc: netdev, skhawaja

Extend the already existing support of threaded napi poll to do continuous
busy polling.

This is used for doing continuous polling of napi to fetch descriptors
from backing RX/TX queues for low latency applications. Allow enabling
of threaded busypoll using netlink so this can be enabled on a set of
dedicated napis for low latency applications.

Once enabled user can fetch the PID of the kthread doing NAPI polling
and set affinity, priority and scheduler for it depending on the
low-latency requirements.

Currently threaded napi is only enabled at device level using sysfs. Add
support to enable/disable threaded mode for a napi individually. This
can be done using the netlink interface. Extend `napi-set` op in netlink
spec that allows setting the `threaded` attribute of a napi.

Extend the threaded attribute in napi struct to add an option to enable
continuous busy polling. Extend the netlink and sysfs interface to allow
enabling/disabling threaded busypolling at device or individual napi
level.

We use this for our AF_XDP based hard low-latency usecase with usecs
level latency requirement. For our usecase we want low jitter and stable
latency at P99.

Following is an analysis and comparison of available (and compatible)
busy poll interfaces for a low latency usecase with stable P99. Please
note that the throughput and cpu efficiency is a non-goal.

For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
description of the tool and how it tries to simulate the real workload
is following,

- It sends UDP packets between 2 machines.
- The client machine sends packets at a fixed frequency. To maintain the
  frequency of the packet being sent, we use open-loop sampling. That is
  the packets are sent in a separate thread.
- The server replies to the packet inline by reading the pkt from the
  recv ring and replies using the tx ring.
- To simulate the application processing time, we use a configurable
  delay in usecs on the client side after a reply is received from the
  server.

The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.

We use this tool with following napi polling configurations,

- Interrupts only
- SO_BUSYPOLL (inline in the same thread where the client receives the
  packet).
- SO_BUSYPOLL (separate thread and separate core)
- Threaded NAPI busypoll

System is configured using following script in all 4 cases,

```
echo 0 | sudo tee /sys/class/net/eth0/threaded
echo 0 | sudo tee /proc/sys/kernel/timer_migration
echo off | sudo tee  /sys/devices/system/cpu/smt/control

sudo ethtool -L eth0 rx 1 tx 1
sudo ethtool -G eth0 rx 1024

echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus

 # pin IRQs on CPU 2
IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
				print arr[0]}' < /proc/interrupts)"
for irq in "${IRQS}"; \
	do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done

echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us

for i in /sys/devices/virtual/workqueue/*/cpumask; \
			do echo $i; echo 1,2,3,4,5,6 > $i; done

if [[ -z "$1" ]]; then
  echo 400 | sudo tee /proc/sys/net/core/busy_read
  echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
  echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi

sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0

if [[ "$1" == "enable_threaded" ]]; then
  echo 0 | sudo tee /proc/sys/net/core/busy_poll
  echo 0 | sudo tee /proc/sys/net/core/busy_read
  echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
  echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
  echo 2 | sudo tee /sys/class/net/eth0/threaded
  NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
  sudo chrt -f  -p 50 $NAPI_T

  # pin threaded poll thread to CPU 2
  sudo taskset -pc 2 $NAPI_T
fi

if [[ "$1" == "enable_interrupt" ]]; then
  echo 0 | sudo tee /proc/sys/net/core/busy_read
  echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
  echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi
```

To enable various configurations, script can be run as following,

- Interrupt Only
  ```
  <script> enable_interrupt
  ```

- SO_BUSYPOLL (no arguments to script)
  ```
  <script>
  ```

- NAPI threaded busypoll
  ```
  <script> enable_threaded
  ```

If using idpf, the script needs to be run again after launching the
workload just to make sure that the configurations are not reverted. As
idpf reverts some configurations on software reset when AF_XDP program
is attached.

Once configured, the workload is run with various configurations using
following commands. Set period (1/frequency) and delay in usecs to
produce results for packet frequency and application processing delay.

 ## Interrupt Only and SO_BUSY_POLL (inline)

- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
```

- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
	-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
```

 ## SO_BUSY_POLL(done in separate core using recvfrom)

Argument -t spawns a seprate thread and continuously calls recvfrom.

- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
	-h -v -t
```

- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
	-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
```

 ## NAPI Threaded Busy Poll

Argument -n skips the recvfrom call as there is no recv kick needed.

- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
	-h -v -n
```

- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
	-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
	-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
```

| Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI Threaded |
|---|---|---|---|---|
| 12K pkt/s + 0us delay | | | | |
| | p5: 14000 | p5: 13300 | p5: 13000 | p5: 12800 |
| | p50: 14200 | p50: 13700 | p50: 14000 | p50: 13100 |
| | p95: 14200 | p95: 13800 | p95: 14200 | p95: 13100 |
| | p99: 14200 | p99: 13800 | p99: 14200 | p99: 13100 |
| 125K pkt/s + 6us delay | | | | |
| | p5: 14600 | p5: 17000 | p5: 14200 | p5: 13100 |
| | p50: 15300 | p50: 17200 | p50: 14900 | p50: 13200 |
| | p95: 15500 | p95: 17500 | p95: 14900 | p95: 13200 |
| | p99: 15500 | p99: 17500 | p99: 14900 | p99: 13200 |
| 25K pkt/s + 38us delay | | | | |
| | p5: 22300 | p5: 16800 | p5: 13700 | p5: 12700 |
| | p50: 22800 | p50: 16900 | p50: 14100 | p50: 12900 |
| | p95: 22800 | p95: 17000 | p95: 14200 | p95: 12900 |
| | p99: 22800 | p99: 17100 | p99: 14200 | p99: 12900 |
| 32K pkt/s + 30us delay | | | | |
| | p5: 20000 | p5: 16600 | p5: 13900 | p5: 12900 |
| | p50: 21000 | p50: 17000 | p50: 14500 | p50: 13100 |
| | p95: 21000 | p95: 17000 | p95: 14600 | p95: 13100 |
| | p99: 21300 | p99: 17000 | p99: 14600 | p99: 13100 |
| 12Kpkt/s + 78us delay | | | | |
| | p5: 13800 | p5: 16700 | p5: 13800 | p5: 12800 |
| | p50: 14100 | p50: 17100 | p50: 14300 | p50: 13000 |
| | p95: 14100 | p95: 17100 | p95: 14500 | p90: 13000 |
| | p99: 14100 | p99: 17100 | p99: 14500 | p99: 13000 |

 ## Observations

- Here without application processing all the approaches give the same
  latency within 1usecs range and NAPI threaded gives minimum latency.
- With application processing the latency increases by 3-4usecs when
  doing inline polling.
- Using a dedicated core to drive napi polling keeps the latency same
  even with application processing. This is observed both in userspace
  and threaded napi (in kernel).
- Using napi threaded polling in kernel gives lower latency by
  1-1.5usecs as compared to userspace driven polling in separate core.
- With application processing userspace will get the packet from recv
  ring and spend some time doing application processing and then do napi
  polling. While application processing is happening a dedicated core
  doing napi polling can pull the packet of the NAPI RX queue and
  populate the AF_XDP recv ring. This means that when the application
  thread is done with application processing it has new packets ready to
  recv and process in recv ring.
- Napi threaded busy polling in the kernel with a dedicated core gives
  the consistent P5-P99 latency.

Following histogram is generated to measure the time spent in recvfrom
while using inline thread with SO_BUSYPOLL. The histogram is generated
using the following bpftrace command. In this experiment there are 32K
packets per second and the application processing delay is 30usecs. This
is to measure whether there is significant time spent pulling packets
from the descriptor queue that it will affect the overall latency if
done inline.

```
bpftrace -e '
        kprobe:xsk_recvmsg {
                @start[tid] = nsecs;
        }
        kretprobe:xsk_recvmsg {
                if (@start[tid]) {
                        $sample = (nsecs - @start[tid]);
                        @xsk_recvfrom_hist = hist($sample);
                        delete(@start[tid]);
                }
        }
        END { clear(@start);}'
```

Here in case of inline busypolling around 35 percent of calls are taking
1-2usecs and around 50 percent are taking 0.5-2usecs.

@xsk_recvfrom_hist:
[128, 256)         24073 |@@@@@@@@@@@@@@@@@@@@@@                              |
[256, 512)         55633 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)          20974 |@@@@@@@@@@@@@@@@@@@                                 |
[1K, 2K)           34234 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
[2K, 4K)            3266 |@@@                                                 |
[4K, 8K)              19 |                                                    |

v4:
 - Using AF_XDP based benchmark for experiments.
 - Re-enable dev level napi threaded busypoll after soft reset.

v3:
 - Fixed calls to dev_set_threaded in drivers

v2:
 - Add documentation in napi.rst.
 - Provide experiment data and usecase details.
 - Update busy_poller selftest to include napi threaded poll testcase.
 - Define threaded mode enum in netlink interface.
 - Included NAPI threaded state in napi config to save/restore.

Samiullah Khawaja (4):
  Add support to set napi threaded for individual napi
  net: Create separate gro_flush helper function
  Extend napi threaded polling to allow kthread based busy polling
  selftests: Add napi threaded busy poll test in `busy_poller`

 Documentation/ABI/testing/sysfs-class-net     |   3 +-
 Documentation/netlink/specs/netdev.yaml       |  14 ++
 Documentation/networking/napi.rst             |  80 ++++++++++-
 .../net/ethernet/atheros/atl1c/atl1c_main.c   |   2 +-
 drivers/net/ethernet/mellanox/mlxsw/pci.c     |   2 +-
 drivers/net/ethernet/renesas/ravb_main.c      |   2 +-
 drivers/net/wireless/ath/ath10k/snoc.c        |   2 +-
 include/linux/netdevice.h                     |  24 +++-
 include/uapi/linux/netdev.h                   |   7 +
 net/core/dev.c                                | 135 +++++++++++++++---
 net/core/net-sysfs.c                          |   2 +-
 net/core/netdev-genl-gen.c                    |   5 +-
 net/core/netdev-genl.c                        |   9 ++
 tools/include/uapi/linux/netdev.h             |   7 +
 tools/testing/selftests/net/busy_poll_test.sh |  25 +++-
 tools/testing/selftests/net/busy_poller.c     |  14 +-
 16 files changed, 298 insertions(+), 35 deletions(-)

-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi
  2025-03-21  2:15 [PATCH net-next v4 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
@ 2025-03-21  2:15 ` Samiullah Khawaja
  2025-03-21 17:10   ` Joe Damato
  2025-03-25 14:52   ` Jakub Kicinski
  2025-03-21  2:15 ` [PATCH net-next v4 2/4] net: Create separate gro_flush helper function Samiullah Khawaja
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Samiullah Khawaja @ 2025-03-21  2:15 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, jdamato, mkarsten
  Cc: netdev, skhawaja

A net device has a threaded sysctl that can be used to enable threaded
napi polling on all of the NAPI contexts under that device. Allow
enabling threaded napi polling at individual napi level using netlink.

Extend the netlink operation `napi-set` and allow setting the threaded
attribute of a NAPI. This will enable the threaded polling on a napi
context.

Tested using following command in qemu/virtio-net:
./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
  --do napi-set       --json '{"id": 66, "threaded": 1}'

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 Documentation/netlink/specs/netdev.yaml | 10 ++++++++
 Documentation/networking/napi.rst       | 13 ++++++++++-
 include/linux/netdevice.h               | 10 ++++++++
 include/uapi/linux/netdev.h             |  1 +
 net/core/dev.c                          | 31 +++++++++++++++++++++++++
 net/core/netdev-genl-gen.c              |  5 ++--
 net/core/netdev-genl.c                  |  9 +++++++
 tools/include/uapi/linux/netdev.h       |  1 +
 8 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index f5e0750ab71d..92f98f2a6bd7 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -280,6 +280,14 @@ attribute-sets:
         doc: The timeout, in nanoseconds, of how long to suspend irq
              processing, if event polling finds events
         type: uint
+      -
+        name: threaded
+        doc: Whether the napi is configured to operate in threaded polling
+             mode. If this is set to `1` then the NAPI context operates
+             in threaded polling mode.
+        type: u32
+        checks:
+          max: 1
   -
     name: xsk-info
     attributes: []
@@ -691,6 +699,7 @@ operations:
             - defer-hard-irqs
             - gro-flush-timeout
             - irq-suspend-timeout
+            - threaded
       dump:
         request:
           attributes:
@@ -743,6 +752,7 @@ operations:
             - defer-hard-irqs
             - gro-flush-timeout
             - irq-suspend-timeout
+            - threaded
 
 kernel-family:
   headers: [ "net/netdev_netlink.h"]
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index d0e3953cae6a..63f98c05860f 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -444,7 +444,18 @@ dependent). The NAPI instance IDs will be assigned in the opposite
 order than the process IDs of the kernel threads.
 
 Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
-netdev's sysfs directory.
+netdev's sysfs directory. It can also be enabled for a specific napi using
+netlink interface.
+
+For example, using the script:
+
+.. code-block:: bash
+
+  $ kernel-source/tools/net/ynl/pyynl/cli.py \
+            --spec Documentation/netlink/specs/netdev.yaml \
+            --do napi-set \
+            --json='{"id": 66,
+                     "threaded": 1}'
 
 .. rubric:: Footnotes
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0c5b1f7f8f3a..3c244fd9ae6d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -369,6 +369,7 @@ struct napi_config {
 	u64 irq_suspend_timeout;
 	u32 defer_hard_irqs;
 	cpumask_t affinity_mask;
+	bool threaded;
 	unsigned int napi_id;
 };
 
@@ -590,6 +591,15 @@ static inline bool napi_complete(struct napi_struct *n)
 
 int dev_set_threaded(struct net_device *dev, bool threaded);
 
+/*
+ * napi_set_threaded - set napi threaded state
+ * @napi: NAPI context
+ * @threaded: whether this napi does threaded polling
+ *
+ * Return 0 on success and negative errno on failure.
+ */
+int napi_set_threaded(struct napi_struct *napi, bool threaded);
+
 void napi_disable(struct napi_struct *n);
 void napi_disable_locked(struct napi_struct *n);
 
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 7600bf62dbdf..fac1b8ffeb55 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -134,6 +134,7 @@ enum {
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 	NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
+	NETDEV_A_NAPI_THREADED,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
diff --git a/net/core/dev.c b/net/core/dev.c
index 235560341765..b92e4e8890d1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6806,6 +6806,30 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
+int napi_set_threaded(struct napi_struct *napi, bool threaded)
+{
+	if (napi->dev->threaded)
+		return -EINVAL;
+
+	if (threaded) {
+		if (!napi->thread) {
+			int err = napi_kthread_create(napi);
+
+			if (err)
+				return err;
+		}
+	}
+
+	if (napi->config)
+		napi->config->threaded = threaded;
+
+	/* Make sure kthread is created before THREADED bit is set. */
+	smp_mb__before_atomic();
+	assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+
+	return 0;
+}
+
 int dev_set_threaded(struct net_device *dev, bool threaded)
 {
 	struct napi_struct *napi;
@@ -6817,6 +6841,11 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
 		return 0;
 
 	if (threaded) {
+		/* Check if threaded is set at napi level already */
+		list_for_each_entry(napi, &dev->napi_list, dev_list)
+			if (test_bit(NAPI_STATE_THREADED, &napi->state))
+				return -EINVAL;
+
 		list_for_each_entry(napi, &dev->napi_list, dev_list) {
 			if (!napi->thread) {
 				err = napi_kthread_create(napi);
@@ -7063,6 +7092,8 @@ static void napi_restore_config(struct napi_struct *n)
 		napi_hash_add(n);
 		n->config->napi_id = n->napi_id;
 	}
+
+	napi_set_threaded(n, n->config->threaded);
 }
 
 static void napi_save_config(struct napi_struct *n)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 739f7b6506a6..c2e5cee857d2 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -92,11 +92,12 @@ static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1]
 };
 
 /* NETDEV_CMD_NAPI_SET - do */
-static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT + 1] = {
+static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED + 1] = {
 	[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
 	[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
 	[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
 	[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
+	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
 };
 
 /* Ops table for netdev */
@@ -187,7 +188,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.cmd		= NETDEV_CMD_NAPI_SET,
 		.doit		= netdev_nl_napi_set_doit,
 		.policy		= netdev_napi_set_nl_policy,
-		.maxattr	= NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
+		.maxattr	= NETDEV_A_NAPI_THREADED,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
 };
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index a186fea63c09..057001c3bbba 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -186,6 +186,9 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
 	if (napi->irq >= 0 && nla_put_u32(rsp, NETDEV_A_NAPI_IRQ, napi->irq))
 		goto nla_put_failure;
 
+	if (nla_put_u32(rsp, NETDEV_A_NAPI_THREADED, !!napi->thread))
+		goto nla_put_failure;
+
 	if (napi->thread) {
 		pid = task_pid_nr(napi->thread);
 		if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
@@ -324,8 +327,14 @@ netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
 {
 	u64 irq_suspend_timeout = 0;
 	u64 gro_flush_timeout = 0;
+	u32 threaded = 0;
 	u32 defer = 0;
 
+	if (info->attrs[NETDEV_A_NAPI_THREADED]) {
+		threaded = nla_get_u32(info->attrs[NETDEV_A_NAPI_THREADED]);
+		napi_set_threaded(napi, !!threaded);
+	}
+
 	if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
 		defer = nla_get_u32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]);
 		napi_set_defer_hard_irqs(napi, defer);
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 7600bf62dbdf..fac1b8ffeb55 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -134,6 +134,7 @@ enum {
 	NETDEV_A_NAPI_DEFER_HARD_IRQS,
 	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
 	NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
+	NETDEV_A_NAPI_THREADED,
 
 	__NETDEV_A_NAPI_MAX,
 	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v4 2/4] net: Create separate gro_flush helper function
  2025-03-21  2:15 [PATCH net-next v4 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
  2025-03-21  2:15 ` [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
@ 2025-03-21  2:15 ` Samiullah Khawaja
  2025-03-21 17:16   ` Joe Damato
  2025-03-21  2:15 ` [PATCH net-next v4 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Samiullah Khawaja @ 2025-03-21  2:15 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, jdamato, mkarsten
  Cc: netdev, skhawaja

Move multiple copies of same code snippet doing `gro_flush` and
`gro_normal_list` into a separate helper function.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 net/core/dev.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index b92e4e8890d1..cc746f223554 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6516,6 +6516,13 @@ static void skb_defer_free_flush(struct softnet_data *sd)
 	}
 }
 
+static void __napi_gro_flush_helper(struct napi_struct *napi)
+{
+	/* Flush too old packets. If HZ < 1000, flush all packets */
+	gro_flush(&napi->gro, HZ >= 1000);
+	gro_normal_list(&napi->gro);
+}
+
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 
 static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
@@ -6526,9 +6533,7 @@ static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
 		return;
 	}
 
-	/* Flush too old packets. If HZ < 1000, flush all packets */
-	gro_flush(&napi->gro, HZ >= 1000);
-	gro_normal_list(&napi->gro);
+	__napi_gro_flush_helper(napi);
 
 	clear_bit(NAPI_STATE_SCHED, &napi->state);
 }
@@ -7360,9 +7365,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
 		return work;
 	}
 
-	/* Flush too old packets. If HZ < 1000, flush all packets */
-	gro_flush(&n->gro, HZ >= 1000);
-	gro_normal_list(&n->gro);
+	__napi_gro_flush_helper(n);
 
 	/* Some drivers may have called napi_schedule
 	 * prior to exhausting their budget.
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v4 3/4] Extend napi threaded polling to allow kthread based busy polling
  2025-03-21  2:15 [PATCH net-next v4 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
  2025-03-21  2:15 ` [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
  2025-03-21  2:15 ` [PATCH net-next v4 2/4] net: Create separate gro_flush helper function Samiullah Khawaja
@ 2025-03-21  2:15 ` Samiullah Khawaja
  2025-03-21 17:39   ` Joe Damato
  2025-03-21  2:15 ` [PATCH net-next v4 4/4] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
  2025-03-24  2:38 ` [PATCH net-next v4 0/4] Add support to do threaded napi busy poll Martin Karsten
  4 siblings, 1 reply; 22+ messages in thread
From: Samiullah Khawaja @ 2025-03-21  2:15 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, jdamato, mkarsten
  Cc: netdev, skhawaja

Add a new state to napi state enum:

- STATE_THREADED_BUSY_POLL
  Threaded busy poll is enabled/running for this napi.

Following changes are introduced in the napi scheduling and state logic:

- When threaded busy poll is enabled through sysfs it also enables
  NAPI_STATE_THREADED so a kthread is created per napi. It also sets
  NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that we are
  supposed to busy poll for each napi.

- When napi is scheduled with STATE_SCHED_THREADED and associated
  kthread is woken up, the kthread owns the context. If
  NAPI_STATE_THREADED_BUSY_POLL and NAPI_SCHED_THREADED both are set
  then it means that we can busy poll.

- To keep busy polling and to avoid scheduling of the interrupts, the
  napi_complete_done returns false when both SCHED_THREADED and
  THREADED_BUSY_POLL flags are set. Also napi_complete_done returns
  early to avoid the STATE_SCHED_THREADED being unset.

- If at any point STATE_THREADED_BUSY_POLL is unset, the
  napi_complete_done will run and unset the SCHED_THREADED bit also.
  This will make the associated kthread go to sleep as per existing
  logic.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 Documentation/ABI/testing/sysfs-class-net     |  3 +-
 Documentation/netlink/specs/netdev.yaml       | 12 ++-
 Documentation/networking/napi.rst             | 67 ++++++++++++-
 .../net/ethernet/atheros/atl1c/atl1c_main.c   |  2 +-
 drivers/net/ethernet/mellanox/mlxsw/pci.c     |  2 +-
 drivers/net/ethernet/renesas/ravb_main.c      |  2 +-
 drivers/net/wireless/ath/ath10k/snoc.c        |  2 +-
 include/linux/netdevice.h                     | 20 +++-
 include/uapi/linux/netdev.h                   |  6 ++
 net/core/dev.c                                | 93 ++++++++++++++++---
 net/core/net-sysfs.c                          |  2 +-
 net/core/netdev-genl-gen.c                    |  2 +-
 net/core/netdev-genl.c                        |  2 +-
 tools/include/uapi/linux/netdev.h             |  6 ++
 14 files changed, 188 insertions(+), 33 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net
index ebf21beba846..15d7d36a8294 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -343,7 +343,7 @@ Date:		Jan 2021
 KernelVersion:	5.12
 Contact:	netdev@vger.kernel.org
 Description:
-		Boolean value to control the threaded mode per device. User could
+		Integer value to control the threaded mode per device. User could
 		set this value to enable/disable threaded mode for all napi
 		belonging to this device, without the need to do device up/down.
 
@@ -351,4 +351,5 @@ Description:
 		== ==================================
 		0  threaded mode disabled for this dev
 		1  threaded mode enabled for this dev
+		2  threaded mode enabled, and busy polling enabled.
 		== ==================================
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 92f98f2a6bd7..650179559558 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -82,6 +82,10 @@ definitions:
     name: qstats-scope
     type: flags
     entries: [ queue ]
+  -
+    name: napi-threaded
+    type: enum
+    entries: [ disable, enable, busy-poll-enable ]
 
 attribute-sets:
   -
@@ -283,11 +287,11 @@ attribute-sets:
       -
         name: threaded
         doc: Whether the napi is configured to operate in threaded polling
-             mode. If this is set to `1` then the NAPI context operates
-             in threaded polling mode.
+             mode. If this is set to `enable` then the NAPI context operates
+             in threaded polling mode. If this is set to `busy-poll-enable`
+             then the NAPI kthread also does busypolling.
         type: u32
-        checks:
-          max: 1
+        enum: napi-threaded
   -
     name: xsk-info
     attributes: []
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index 63f98c05860f..0f83142c624d 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -263,7 +263,9 @@ are not well known).
 Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
 selected sockets or using the global ``net.core.busy_poll`` and
 ``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
-also exists.
+also exists. Threaded polling of NAPI also has a mode to busy poll for
+packets (:ref:`threaded busy polling<threaded_busy_poll>`) using the same
+thread that is used for NAPI processing.
 
 epoll-based busy polling
 ------------------------
@@ -426,6 +428,69 @@ Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
 the recommended usage, because otherwise setting ``irq-suspend-timeout``
 might not have any discernible effect.
 
+.. _threaded_busy_poll:
+
+Threaded NAPI busy polling
+--------------------------
+
+Threaded napi allows processing of packets from each NAPI in a kthread in
+kernel. Threaded napi busy polling extends this and adds support to do
+continuous busy polling of this napi. This can be used to enable busy polling
+independent of userspace application or the API (epoll, io_uring, raw sockets)
+being used in userspace to process the packets.
+
+It can be enabled for each NAPI using netlink interface or at device level using
+the threaded NAPI sysctl.
+
+For example, using following script:
+
+.. code-block:: bash
+
+  $ kernel-source/tools/net/ynl/pyynl/cli.py \
+            --spec Documentation/netlink/specs/netdev.yaml \
+            --do napi-set \
+            --json='{"id": 66,
+                     "threaded": "busy-poll-enable"}'
+
+
+Enabling it for each NAPI allows finer control to enable busy pollling for
+only a set of NIC queues which will get traffic with low latency requirements.
+
+Depending on application requirement, user might want to set affinity of the
+kthread that is busy polling each NAPI. User might also want to set priority
+and the scheduler of the thread depending on the latency requirements.
+
+For a hard low-latency application, user might want to dedicate the full core
+for the NAPI polling so the NIC queue descriptors are picked up from the queue
+as soon as they appear. For more relaxed low-latency requirement, user might
+want to share the core with other threads.
+
+Once threaded busy polling is enabled for a NAPI, PID of the kthread can be
+fetched using netlink interface so the affinity, priority and scheduler
+configuration can be done.
+
+For example, following script can be used to fetch the pid:
+
+.. code-block:: bash
+
+  $ kernel-source/tools/net/ynl/pyynl/cli.py \
+            --spec Documentation/netlink/specs/netdev.yaml \
+            --do napi-get \
+            --json='{"id": 66}'
+
+This will output something like following, the pid `258` is the PID of the
+kthread that is polling this NAPI.
+
+.. code-block:: bash
+
+  $ {'defer-hard-irqs': 0,
+     'gro-flush-timeout': 0,
+     'id': 66,
+     'ifindex': 2,
+     'irq-suspend-timeout': 0,
+     'pid': 258,
+     'threaded': 'enable'}
+
 .. _threaded:
 
 Threaded NAPI
diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
index c571614b1d50..513328476770 100644
--- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
+++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
@@ -2688,7 +2688,7 @@ static int atl1c_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	adapter->mii.mdio_write = atl1c_mdio_write;
 	adapter->mii.phy_id_mask = 0x1f;
 	adapter->mii.reg_num_mask = MDIO_CTRL_REG_MASK;
-	dev_set_threaded(netdev, true);
+	dev_set_threaded(netdev, NETDEV_NAPI_THREADED_ENABLE);
 	for (i = 0; i < adapter->rx_queue_count; ++i)
 		netif_napi_add(netdev, &adapter->rrd_ring[i].napi,
 			       atl1c_clean_rx);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 058dcabfaa2e..2ed3b9263be2 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -156,7 +156,7 @@ static int mlxsw_pci_napi_devs_init(struct mlxsw_pci *mlxsw_pci)
 	}
 	strscpy(mlxsw_pci->napi_dev_rx->name, "mlxsw_rx",
 		sizeof(mlxsw_pci->napi_dev_rx->name));
-	dev_set_threaded(mlxsw_pci->napi_dev_rx, true);
+	dev_set_threaded(mlxsw_pci->napi_dev_rx, NETDEV_NAPI_THREADED_ENABLE);
 
 	return 0;
 
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index c9f4976a3527..12e4f68c0c8f 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -3075,7 +3075,7 @@ static int ravb_probe(struct platform_device *pdev)
 	if (info->coalesce_irqs) {
 		netdev_sw_irq_coalesce_default_on(ndev);
 		if (num_present_cpus() == 1)
-			dev_set_threaded(ndev, true);
+			dev_set_threaded(ndev, NETDEV_NAPI_THREADED_ENABLE);
 	}
 
 	/* Network device register */
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index d436a874cd5a..52e0de8c3069 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -935,7 +935,7 @@ static int ath10k_snoc_hif_start(struct ath10k *ar)
 
 	bitmap_clear(ar_snoc->pending_ce_irqs, 0, CE_COUNT_MAX);
 
-	dev_set_threaded(ar->napi_dev, true);
+	dev_set_threaded(ar->napi_dev, NETDEV_NAPI_THREADED_ENABLE);
 	ath10k_core_napi_enable(ar);
 	ath10k_snoc_irq_enable(ar);
 	ath10k_snoc_rx_post(ar);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3c244fd9ae6d..b990cbe76f86 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -369,7 +369,7 @@ struct napi_config {
 	u64 irq_suspend_timeout;
 	u32 defer_hard_irqs;
 	cpumask_t affinity_mask;
-	bool threaded;
+	u8 threaded;
 	unsigned int napi_id;
 };
 
@@ -427,6 +427,8 @@ enum {
 	NAPI_STATE_THREADED,		/* The poll is performed inside its own thread*/
 	NAPI_STATE_SCHED_THREADED,	/* Napi is currently scheduled in threaded mode */
 	NAPI_STATE_HAS_NOTIFIER,	/* Napi has an IRQ notifier */
+	NAPI_STATE_THREADED_BUSY_POLL,	/* The threaded napi poller will busy poll */
+	NAPI_STATE_SCHED_THREADED_BUSY_POLL,  /* The threaded napi poller is busy polling */
 };
 
 enum {
@@ -441,8 +443,14 @@ enum {
 	NAPIF_STATE_THREADED		= BIT(NAPI_STATE_THREADED),
 	NAPIF_STATE_SCHED_THREADED	= BIT(NAPI_STATE_SCHED_THREADED),
 	NAPIF_STATE_HAS_NOTIFIER	= BIT(NAPI_STATE_HAS_NOTIFIER),
+	NAPIF_STATE_THREADED_BUSY_POLL	= BIT(NAPI_STATE_THREADED_BUSY_POLL),
+	NAPIF_STATE_SCHED_THREADED_BUSY_POLL  =
+			BIT(NAPI_STATE_SCHED_THREADED_BUSY_POLL),
 };
 
+#define NAPIF_STATE_THREADED_BUSY_POLL_MASK \
+	(NAPIF_STATE_THREADED | NAPIF_STATE_THREADED_BUSY_POLL)
+
 enum gro_result {
 	GRO_MERGED,
 	GRO_MERGED_FREE,
@@ -589,16 +597,18 @@ static inline bool napi_complete(struct napi_struct *n)
 	return napi_complete_done(n, 0);
 }
 
-int dev_set_threaded(struct net_device *dev, bool threaded);
+int dev_set_threaded(struct net_device *dev,
+		     enum netdev_napi_threaded threaded);
 
 /*
  * napi_set_threaded - set napi threaded state
  * @napi: NAPI context
- * @threaded: whether this napi does threaded polling
+ * @threaded: threading mode
  *
  * Return 0 on success and negative errno on failure.
  */
-int napi_set_threaded(struct napi_struct *napi, bool threaded);
+int napi_set_threaded(struct napi_struct *napi,
+		      enum netdev_napi_threaded threaded);
 
 void napi_disable(struct napi_struct *n);
 void napi_disable_locked(struct napi_struct *n);
@@ -2432,7 +2442,7 @@ struct net_device {
 	struct sfp_bus		*sfp_bus;
 	struct lock_class_key	*qdisc_tx_busylock;
 	bool			proto_down;
-	bool			threaded;
+	u8			threaded;
 	bool			irq_affinity_auto;
 	bool			rx_cpu_rmap_auto;
 
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index fac1b8ffeb55..b9b59d60957f 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -77,6 +77,12 @@ enum netdev_qstats_scope {
 	NETDEV_QSTATS_SCOPE_QUEUE = 1,
 };
 
+enum netdev_napi_threaded {
+	NETDEV_NAPI_THREADED_DISABLE,
+	NETDEV_NAPI_THREADED_ENABLE,
+	NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
diff --git a/net/core/dev.c b/net/core/dev.c
index cc746f223554..8323782541fa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -78,6 +78,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/sched/isolation.h>
+#include <linux/sched/types.h>
 #include <linux/sched/mm.h>
 #include <linux/smpboot.h>
 #include <linux/mutex.h>
@@ -6436,7 +6437,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 	 *    the guarantee we will be called later.
 	 */
 	if (unlikely(n->state & (NAPIF_STATE_NPSVC |
-				 NAPIF_STATE_IN_BUSY_POLL)))
+				 NAPIF_STATE_IN_BUSY_POLL |
+				 NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
 		return false;
 
 	if (work_done) {
@@ -6811,7 +6813,21 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-int napi_set_threaded(struct napi_struct *napi, bool threaded)
+static void napi_set_threaded_state(struct napi_struct *napi,
+				    enum netdev_napi_threaded threaded)
+{
+	unsigned long val;
+
+	val = 0;
+	if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE)
+		val |= NAPIF_STATE_THREADED_BUSY_POLL;
+	if (threaded)
+		val |= NAPIF_STATE_THREADED;
+	set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
+}
+
+int napi_set_threaded(struct napi_struct *napi,
+		      enum netdev_napi_threaded threaded)
 {
 	if (napi->dev->threaded)
 		return -EINVAL;
@@ -6830,14 +6846,15 @@ int napi_set_threaded(struct napi_struct *napi, bool threaded)
 
 	/* Make sure kthread is created before THREADED bit is set. */
 	smp_mb__before_atomic();
-	assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+	napi_set_threaded_state(napi, threaded);
 
 	return 0;
 }
 
-int dev_set_threaded(struct net_device *dev, bool threaded)
+int dev_set_threaded(struct net_device *dev, enum netdev_napi_threaded threaded)
 {
 	struct napi_struct *napi;
+	unsigned long val;
 	int err = 0;
 
 	netdev_assert_locked_or_invisible(dev);
@@ -6845,17 +6862,22 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
 	if (dev->threaded == threaded)
 		return 0;
 
+	val = 0;
 	if (threaded) {
 		/* Check if threaded is set at napi level already */
 		list_for_each_entry(napi, &dev->napi_list, dev_list)
 			if (test_bit(NAPI_STATE_THREADED, &napi->state))
 				return -EINVAL;
 
+		val |= NAPIF_STATE_THREADED;
+		if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE)
+			val |= NAPIF_STATE_THREADED_BUSY_POLL;
+
 		list_for_each_entry(napi, &dev->napi_list, dev_list) {
 			if (!napi->thread) {
 				err = napi_kthread_create(napi);
 				if (err) {
-					threaded = false;
+					threaded = NETDEV_NAPI_THREADED_DISABLE;
 					break;
 				}
 			}
@@ -6874,9 +6896,13 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
 	 * polled. In this case, the switch between threaded mode and
 	 * softirq mode will happen in the next round of napi_schedule().
 	 * This should not cause hiccups/stalls to the live traffic.
+	 *
+	 * Switch to busy_poll threaded napi will occur after the threaded
+	 * napi is scheduled.
 	 */
 	list_for_each_entry(napi, &dev->napi_list, dev_list)
-		assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+		set_mask_bits(&napi->state,
+			      NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
 
 	return err;
 }
@@ -7196,8 +7222,12 @@ void netif_napi_add_weight_locked(struct net_device *dev,
 	 * Clear dev->threaded if kthread creation failed so that
 	 * threaded mode will not be enabled in napi_enable().
 	 */
-	if (dev->threaded && napi_kthread_create(napi))
-		dev->threaded = false;
+	if (dev->threaded) {
+		if (napi_kthread_create(napi))
+			dev->threaded = false;
+		else
+			napi_set_threaded_state(napi, dev->threaded);
+	}
 	netif_napi_set_irq_locked(napi, -1);
 }
 EXPORT_SYMBOL(netif_napi_add_weight_locked);
@@ -7219,7 +7249,9 @@ void napi_disable_locked(struct napi_struct *n)
 		}
 
 		new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
-		new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
+		new &= ~(NAPIF_STATE_THREADED
+			 | NAPIF_STATE_THREADED_BUSY_POLL
+			 | NAPIF_STATE_PREFER_BUSY_POLL);
 	} while (!try_cmpxchg(&n->state, &val, new));
 
 	hrtimer_cancel(&n->timer);
@@ -7263,7 +7295,7 @@ void napi_enable_locked(struct napi_struct *n)
 
 		new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
 		if (n->dev->threaded && n->thread)
-			new |= NAPIF_STATE_THREADED;
+			napi_set_threaded_state(n, n->dev->threaded);
 	} while (!try_cmpxchg(&n->state, &val, new));
 }
 EXPORT_SYMBOL(napi_enable_locked);
@@ -7425,7 +7457,7 @@ static int napi_thread_wait(struct napi_struct *napi)
 	return -1;
 }
 
-static void napi_threaded_poll_loop(struct napi_struct *napi)
+static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
 {
 	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct softnet_data *sd;
@@ -7454,22 +7486,53 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
 		}
 		skb_defer_free_flush(sd);
 		bpf_net_ctx_clear(bpf_net_ctx);
+
+		/* Push the skbs up the stack if busy polling. */
+		if (busy_poll)
+			__napi_gro_flush_helper(napi);
 		local_bh_enable();
 
-		if (!repoll)
+		/* If busy polling then do not break here because we need to
+		 * call cond_resched and rcu_softirq_qs_periodic to prevent
+		 * watchdog warnings.
+		 */
+		if (!repoll && !busy_poll)
 			break;
 
 		rcu_softirq_qs_periodic(last_qs);
 		cond_resched();
+
+		if (!repoll)
+			break;
 	}
 }
 
 static int napi_threaded_poll(void *data)
 {
 	struct napi_struct *napi = data;
+	bool busy_poll_sched;
+	unsigned long val;
+	bool busy_poll;
+
+	while (!napi_thread_wait(napi)) {
+		/* Once woken up, this means that we are scheduled as threaded
+		 * napi and this thread owns the napi context, if busy poll
+		 * state is set then we busy poll this napi.
+		 */
+		val = READ_ONCE(napi->state);
+		busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
+		busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
 
-	while (!napi_thread_wait(napi))
-		napi_threaded_poll_loop(napi);
+		/* Do not busy poll if napi is disabled. */
+		if (unlikely(val & NAPIF_STATE_DISABLE))
+			busy_poll = false;
+
+		if (busy_poll != busy_poll_sched)
+			assign_bit(NAPI_STATE_SCHED_THREADED_BUSY_POLL,
+				   &napi->state, busy_poll);
+
+		napi_threaded_poll_loop(napi, busy_poll);
+	}
 
 	return 0;
 }
@@ -12637,7 +12700,7 @@ static void run_backlog_napi(unsigned int cpu)
 {
 	struct softnet_data *sd = per_cpu_ptr(&softnet_data, cpu);
 
-	napi_threaded_poll_loop(&sd->backlog);
+	napi_threaded_poll_loop(&sd->backlog, false);
 }
 
 static void backlog_napi_setup(unsigned int cpu)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index abaa1c919b98..d3ccd04960fb 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -741,7 +741,7 @@ static int modify_napi_threaded(struct net_device *dev, unsigned long val)
 	if (list_empty(&dev->napi_list))
 		return -EOPNOTSUPP;
 
-	if (val != 0 && val != 1)
+	if (val > NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE)
 		return -EOPNOTSUPP;
 
 	ret = dev_set_threaded(dev, val);
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index c2e5cee857d2..1dbe5f19a192 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -97,7 +97,7 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED
 	[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
 	[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
 	[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
-	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
+	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 2),
 };
 
 /* Ops table for netdev */
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 057001c3bbba..e540475290ca 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -332,7 +332,7 @@ netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
 
 	if (info->attrs[NETDEV_A_NAPI_THREADED]) {
 		threaded = nla_get_u32(info->attrs[NETDEV_A_NAPI_THREADED]);
-		napi_set_threaded(napi, !!threaded);
+		napi_set_threaded(napi, threaded);
 	}
 
 	if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index fac1b8ffeb55..b9b59d60957f 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -77,6 +77,12 @@ enum netdev_qstats_scope {
 	NETDEV_QSTATS_SCOPE_QUEUE = 1,
 };
 
+enum netdev_napi_threaded {
+	NETDEV_NAPI_THREADED_DISABLE,
+	NETDEV_NAPI_THREADED_ENABLE,
+	NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v4 4/4] selftests: Add napi threaded busy poll test in `busy_poller`
  2025-03-21  2:15 [PATCH net-next v4 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
                   ` (2 preceding siblings ...)
  2025-03-21  2:15 ` [PATCH net-next v4 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
@ 2025-03-21  2:15 ` Samiullah Khawaja
  2025-03-24  2:38 ` [PATCH net-next v4 0/4] Add support to do threaded napi busy poll Martin Karsten
  4 siblings, 0 replies; 22+ messages in thread
From: Samiullah Khawaja @ 2025-03-21  2:15 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, jdamato, mkarsten
  Cc: netdev, skhawaja

Add testcase to run busy poll test with threaded napi busy poll enabled.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 tools/testing/selftests/net/busy_poll_test.sh | 25 ++++++++++++++++++-
 tools/testing/selftests/net/busy_poller.c     | 14 ++++++++---
 2 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/busy_poll_test.sh b/tools/testing/selftests/net/busy_poll_test.sh
index 7db292ec4884..aeca610dc989 100755
--- a/tools/testing/selftests/net/busy_poll_test.sh
+++ b/tools/testing/selftests/net/busy_poll_test.sh
@@ -27,6 +27,9 @@ NAPI_DEFER_HARD_IRQS=100
 GRO_FLUSH_TIMEOUT=50000
 SUSPEND_TIMEOUT=20000000
 
+# NAPI threaded busy poll config
+NAPI_THREADED_POLL=2
+
 setup_ns()
 {
 	set -e
@@ -62,6 +65,9 @@ cleanup_ns()
 test_busypoll()
 {
 	suspend_value=${1:-0}
+	napi_threaded_value=${2:-0}
+	prefer_busy_poll_value=${3:-$PREFER_BUSY_POLL}
+
 	tmp_file=$(mktemp)
 	out_file=$(mktemp)
 
@@ -73,10 +79,11 @@ test_busypoll()
 					     -b${SERVER_IP}        \
 					     -m${MAX_EVENTS}       \
 					     -u${BUSY_POLL_USECS}  \
-					     -P${PREFER_BUSY_POLL} \
+					     -P${prefer_busy_poll_value} \
 					     -g${BUSY_POLL_BUDGET} \
 					     -i${NSIM_SV_IFIDX}    \
 					     -s${suspend_value}    \
+					     -t${napi_threaded_value} \
 					     -o${out_file}&
 
 	wait_local_port_listen nssv ${SERVER_PORT} tcp
@@ -109,6 +116,15 @@ test_busypoll_with_suspend()
 	return $?
 }
 
+test_busypoll_with_napi_threaded()
+{
+	# Only enable napi threaded poll. Set suspend timeout and prefer busy
+	# poll to 0.
+	test_busypoll 0 ${NAPI_THREADED_POLL} 0
+
+	return $?
+}
+
 ###
 ### Code start
 ###
@@ -154,6 +170,13 @@ if [ $? -ne 0 ]; then
 	exit 1
 fi
 
+test_busypoll_with_napi_threaded
+if [ $? -ne 0 ]; then
+	echo "test_busypoll_with_napi_threaded failed"
+	cleanup_ns
+	exit 1
+fi
+
 echo "$NSIM_SV_FD:$NSIM_SV_IFIDX" > $NSIM_DEV_SYS_UNLINK
 
 echo $NSIM_CL_ID > $NSIM_DEV_SYS_DEL
diff --git a/tools/testing/selftests/net/busy_poller.c b/tools/testing/selftests/net/busy_poller.c
index 04c7ff577bb8..f7407f09f635 100644
--- a/tools/testing/selftests/net/busy_poller.c
+++ b/tools/testing/selftests/net/busy_poller.c
@@ -65,15 +65,16 @@ static uint32_t cfg_busy_poll_usecs;
 static uint16_t cfg_busy_poll_budget;
 static uint8_t cfg_prefer_busy_poll;
 
-/* IRQ params */
+/* NAPI params */
 static uint32_t cfg_defer_hard_irqs;
 static uint64_t cfg_gro_flush_timeout;
 static uint64_t cfg_irq_suspend_timeout;
+static enum netdev_napi_threaded cfg_napi_threaded_poll = NETDEV_NAPI_THREADED_DISABLE;
 
 static void usage(const char *filepath)
 {
 	error(1, 0,
-	      "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -i<ifindex>",
+	      "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -t<napi_threaded_poll> -i<ifindex>",
 	      filepath);
 }
 
@@ -86,7 +87,7 @@ static void parse_opts(int argc, char **argv)
 	if (argc <= 1)
 		usage(argv[0]);
 
-	while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:")) != -1) {
+	while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:t:")) != -1) {
 		/* most options take integer values, except o and b, so reduce
 		 * code duplication a bit for the common case by calling
 		 * strtoull here and leave bounds checking and casting per
@@ -168,6 +169,12 @@ static void parse_opts(int argc, char **argv)
 
 			cfg_ifindex = (int)tmp;
 			break;
+		case 't':
+			if (tmp == ULLONG_MAX || tmp > 2)
+				error(1, ERANGE, "napi threaded poll value must be 0-2");
+
+			cfg_napi_threaded_poll = (enum netdev_napi_threaded)tmp;
+			break;
 		}
 	}
 
@@ -246,6 +253,7 @@ static void setup_queue(void)
 						  cfg_gro_flush_timeout);
 	netdev_napi_set_req_set_irq_suspend_timeout(set_req,
 						    cfg_irq_suspend_timeout);
+	netdev_napi_set_req_set_threaded(set_req, cfg_napi_threaded_poll);
 
 	if (netdev_napi_set(ys, set_req))
 		error(1, 0, "can't set NAPI params: %s\n", yerr.msg);
-- 
2.49.0.395.g12beb8f557-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi
  2025-03-21  2:15 ` [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
@ 2025-03-21 17:10   ` Joe Damato
  2025-03-25 14:51     ` Jakub Kicinski
  2025-03-25 14:52   ` Jakub Kicinski
  1 sibling, 1 reply; 22+ messages in thread
From: Joe Damato @ 2025-03-21 17:10 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, mkarsten, netdev

On Fri, Mar 21, 2025 at 02:15:18AM +0000, Samiullah Khawaja wrote:
> A net device has a threaded sysctl that can be used to enable threaded
> napi polling on all of the NAPI contexts under that device. Allow
> enabling threaded napi polling at individual napi level using netlink.
> 
> Extend the netlink operation `napi-set` and allow setting the threaded
> attribute of a NAPI. This will enable the threaded polling on a napi
> context.
> 
> Tested using following command in qemu/virtio-net:
> ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
>   --do napi-set       --json '{"id": 66, "threaded": 1}'
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  Documentation/netlink/specs/netdev.yaml | 10 ++++++++
>  Documentation/networking/napi.rst       | 13 ++++++++++-
>  include/linux/netdevice.h               | 10 ++++++++
>  include/uapi/linux/netdev.h             |  1 +
>  net/core/dev.c                          | 31 +++++++++++++++++++++++++
>  net/core/netdev-genl-gen.c              |  5 ++--
>  net/core/netdev-genl.c                  |  9 +++++++
>  tools/include/uapi/linux/netdev.h       |  1 +
>  8 files changed, 77 insertions(+), 3 deletions(-)

I still think that this patch could be submit on its own, separate
from this series, with its own selftest.

It seems generally useful to be able to enable threaded NAPI on a
per-NAPI basis, unrelated to the rest of the work in the series.

> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index f5e0750ab71d..92f98f2a6bd7 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -280,6 +280,14 @@ attribute-sets:
>          doc: The timeout, in nanoseconds, of how long to suspend irq
>               processing, if event polling finds events
>          type: uint
> +      -
> +        name: threaded
> +        doc: Whether the napi is configured to operate in threaded polling
> +             mode. If this is set to `1` then the NAPI context operates
> +             in threaded polling mode.
> +        type: u32

I think the type should be uint instead of u32?

> +        checks:
> +          max: 1
>    -
>      name: xsk-info
>      attributes: []
> @@ -691,6 +699,7 @@ operations:
>              - defer-hard-irqs
>              - gro-flush-timeout
>              - irq-suspend-timeout
> +            - threaded
>        dump:
>          request:
>            attributes:
> @@ -743,6 +752,7 @@ operations:
>              - defer-hard-irqs
>              - gro-flush-timeout
>              - irq-suspend-timeout
> +            - threaded
>  
>  kernel-family:
>    headers: [ "net/netdev_netlink.h"]
> diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
> index d0e3953cae6a..63f98c05860f 100644
> --- a/Documentation/networking/napi.rst
> +++ b/Documentation/networking/napi.rst
> @@ -444,7 +444,18 @@ dependent). The NAPI instance IDs will be assigned in the opposite
>  order than the process IDs of the kernel threads.
>  
>  Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
> -netdev's sysfs directory.
> +netdev's sysfs directory. It can also be enabled for a specific napi using
> +netlink interface.
> +
> +For example, using the script:
> +
> +.. code-block:: bash
> +
> +  $ kernel-source/tools/net/ynl/pyynl/cli.py \
> +            --spec Documentation/netlink/specs/netdev.yaml \
> +            --do napi-set \
> +            --json='{"id": 66,
> +                     "threaded": 1}'
>  
>  .. rubric:: Footnotes
>  
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0c5b1f7f8f3a..3c244fd9ae6d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -369,6 +369,7 @@ struct napi_config {
>  	u64 irq_suspend_timeout;
>  	u32 defer_hard_irqs;
>  	cpumask_t affinity_mask;
> +	bool threaded;
>  	unsigned int napi_id;
>  };
>  
> @@ -590,6 +591,15 @@ static inline bool napi_complete(struct napi_struct *n)
>  
>  int dev_set_threaded(struct net_device *dev, bool threaded);
>  
> +/*
> + * napi_set_threaded - set napi threaded state
> + * @napi: NAPI context
> + * @threaded: whether this napi does threaded polling
> + *
> + * Return 0 on success and negative errno on failure.
> + */
> +int napi_set_threaded(struct napi_struct *napi, bool threaded);
> +
>  void napi_disable(struct napi_struct *n);
>  void napi_disable_locked(struct napi_struct *n);
>  
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index 7600bf62dbdf..fac1b8ffeb55 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -134,6 +134,7 @@ enum {
>  	NETDEV_A_NAPI_DEFER_HARD_IRQS,
>  	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
>  	NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
> +	NETDEV_A_NAPI_THREADED,
>  
>  	__NETDEV_A_NAPI_MAX,
>  	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 235560341765..b92e4e8890d1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6806,6 +6806,30 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
>  	return HRTIMER_NORESTART;
>  }
>  
> +int napi_set_threaded(struct napi_struct *napi, bool threaded)
> +{
> +	if (napi->dev->threaded)
> +		return -EINVAL;

This works differently than the existing per-NAPI defer_hard_irqs /
gro_flush_timeout which are also interface wide.

In that implementation: 
  - the per-NAPI value is set when requested by the user
  - when the sysfs value is written, all NAPIs have their values
    overwritten to the sysfs value

I think either:
  - This implementation should work like the existing ones, or
  - The existing ones should be changed to work like this

I am opposed to have two different behaviors when setting per-NAPI
vs system/nic-wide sysfs values.

I don't have a preference on which behavior is chosen, but the
behavior should be the same for all of the things that are
system/nic-wide and moving to per-NAPI.

> +	if (threaded) {
> +		if (!napi->thread) {
> +			int err = napi_kthread_create(napi);
> +
> +			if (err)
> +				return err;
> +		}
> +	}
> +
> +	if (napi->config)
> +		napi->config->threaded = threaded;
> +
> +	/* Make sure kthread is created before THREADED bit is set. */
> +	smp_mb__before_atomic();
> +	assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
> +
> +	return 0;
> +}
> +
>  int dev_set_threaded(struct net_device *dev, bool threaded)
>  {
>  	struct napi_struct *napi;
> @@ -6817,6 +6841,11 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
>  		return 0;
>  
>  	if (threaded) {
> +		/* Check if threaded is set at napi level already */
> +		list_for_each_entry(napi, &dev->napi_list, dev_list)
> +			if (test_bit(NAPI_STATE_THREADED, &napi->state))
> +				return -EINVAL;
> +

Same comment as above.

>  		list_for_each_entry(napi, &dev->napi_list, dev_list) {
>  			if (!napi->thread) {
>  				err = napi_kthread_create(napi);
> @@ -7063,6 +7092,8 @@ static void napi_restore_config(struct napi_struct *n)
>  		napi_hash_add(n);
>  		n->config->napi_id = n->napi_id;
>  	}
> +
> +	napi_set_threaded(n, n->config->threaded);

I feel like if this fails, it should only be because
napi_kthread_create fails (as described above), which I think should
probably generate a warning, perhaps via WARN_ON_ONCE ?

>  }
>  
>  static void napi_save_config(struct napi_struct *n)
> diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> index 739f7b6506a6..c2e5cee857d2 100644
> --- a/net/core/netdev-genl-gen.c
> +++ b/net/core/netdev-genl-gen.c
> @@ -92,11 +92,12 @@ static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1]
>  };
>  
>  /* NETDEV_CMD_NAPI_SET - do */
> -static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT + 1] = {
> +static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED + 1] = {
>  	[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
>  	[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
>  	[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
>  	[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
> +	[NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
>  };
>  
>  /* Ops table for netdev */
> @@ -187,7 +188,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
>  		.cmd		= NETDEV_CMD_NAPI_SET,
>  		.doit		= netdev_nl_napi_set_doit,
>  		.policy		= netdev_napi_set_nl_policy,
> -		.maxattr	= NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
> +		.maxattr	= NETDEV_A_NAPI_THREADED,
>  		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
>  	},
>  };
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index a186fea63c09..057001c3bbba 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -186,6 +186,9 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
>  	if (napi->irq >= 0 && nla_put_u32(rsp, NETDEV_A_NAPI_IRQ, napi->irq))
>  		goto nla_put_failure;
>  
> +	if (nla_put_u32(rsp, NETDEV_A_NAPI_THREADED, !!napi->thread))
> +		goto nla_put_failure;
> +
>  	if (napi->thread) {
>  		pid = task_pid_nr(napi->thread);
>  		if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
> @@ -324,8 +327,14 @@ netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
>  {
>  	u64 irq_suspend_timeout = 0;
>  	u64 gro_flush_timeout = 0;
> +	u32 threaded = 0;
>  	u32 defer = 0;
>  
> +	if (info->attrs[NETDEV_A_NAPI_THREADED]) {
> +		threaded = nla_get_u32(info->attrs[NETDEV_A_NAPI_THREADED]);
> +		napi_set_threaded(napi, !!threaded);
> +	}
> +
>  	if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
>  		defer = nla_get_u32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]);
>  		napi_set_defer_hard_irqs(napi, defer);
> diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
> index 7600bf62dbdf..fac1b8ffeb55 100644
> --- a/tools/include/uapi/linux/netdev.h
> +++ b/tools/include/uapi/linux/netdev.h
> @@ -134,6 +134,7 @@ enum {
>  	NETDEV_A_NAPI_DEFER_HARD_IRQS,
>  	NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
>  	NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT,
> +	NETDEV_A_NAPI_THREADED,
>  
>  	__NETDEV_A_NAPI_MAX,
>  	NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
> -- 
> 2.49.0.395.g12beb8f557-goog
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 2/4] net: Create separate gro_flush helper function
  2025-03-21  2:15 ` [PATCH net-next v4 2/4] net: Create separate gro_flush helper function Samiullah Khawaja
@ 2025-03-21 17:16   ` Joe Damato
  2025-03-27 16:42     ` Samiullah Khawaja
  0 siblings, 1 reply; 22+ messages in thread
From: Joe Damato @ 2025-03-21 17:16 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, mkarsten, netdev

On Fri, Mar 21, 2025 at 02:15:19AM +0000, Samiullah Khawaja wrote:
> Move multiple copies of same code snippet doing `gro_flush` and
> `gro_normal_list` into a separate helper function.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>

As mentioned in the previous review, I think this is missing a spot.
the instance in napi_complete_done was not addressed as suggested
previously.

Is there any particular reason why that feedback was not addressed
in this revision?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 3/4] Extend napi threaded polling to allow kthread based busy polling
  2025-03-21  2:15 ` [PATCH net-next v4 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
@ 2025-03-21 17:39   ` Joe Damato
  2025-03-27 16:39     ` Samiullah Khawaja
  0 siblings, 1 reply; 22+ messages in thread
From: Joe Damato @ 2025-03-21 17:39 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, mkarsten, netdev

On Fri, Mar 21, 2025 at 02:15:20AM +0000, Samiullah Khawaja wrote:
> Add a new state to napi state enum:
> 
> - STATE_THREADED_BUSY_POLL
>   Threaded busy poll is enabled/running for this napi.
> 
> Following changes are introduced in the napi scheduling and state logic:
> 
> - When threaded busy poll is enabled through sysfs it also enables
>   NAPI_STATE_THREADED so a kthread is created per napi. It also sets
>   NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that we are
>   supposed to busy poll for each napi.
> 
> - When napi is scheduled with STATE_SCHED_THREADED and associated
>   kthread is woken up, the kthread owns the context. If
>   NAPI_STATE_THREADED_BUSY_POLL and NAPI_SCHED_THREADED both are set
>   then it means that we can busy poll.
> 
> - To keep busy polling and to avoid scheduling of the interrupts, the
>   napi_complete_done returns false when both SCHED_THREADED and
>   THREADED_BUSY_POLL flags are set. Also napi_complete_done returns
>   early to avoid the STATE_SCHED_THREADED being unset.
> 
> - If at any point STATE_THREADED_BUSY_POLL is unset, the
>   napi_complete_done will run and unset the SCHED_THREADED bit also.
>   This will make the associated kthread go to sleep as per existing
>   logic.
> 
> Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> ---
>  Documentation/ABI/testing/sysfs-class-net     |  3 +-
>  Documentation/netlink/specs/netdev.yaml       | 12 ++-
>  Documentation/networking/napi.rst             | 67 ++++++++++++-
>  .../net/ethernet/atheros/atl1c/atl1c_main.c   |  2 +-
>  drivers/net/ethernet/mellanox/mlxsw/pci.c     |  2 +-
>  drivers/net/ethernet/renesas/ravb_main.c      |  2 +-
>  drivers/net/wireless/ath/ath10k/snoc.c        |  2 +-
>  include/linux/netdevice.h                     | 20 +++-
>  include/uapi/linux/netdev.h                   |  6 ++
>  net/core/dev.c                                | 93 ++++++++++++++++---
>  net/core/net-sysfs.c                          |  2 +-
>  net/core/netdev-genl-gen.c                    |  2 +-
>  net/core/netdev-genl.c                        |  2 +-
>  tools/include/uapi/linux/netdev.h             |  6 ++
>  14 files changed, 188 insertions(+), 33 deletions(-)
  
I think this should be split into two patches which would ease
review and bisection:

  - First patch: introduce enum netdev_napi_threaded and
    NETDEV_NAPI_THREADED_ENABLE and the associated driver changes.

  - Second patch: introduce NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE

I'll have to take a closer look at all the changes here after I've
read the cover letter and have reproduced the results, but one issue
stands out:

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3c244fd9ae6d..b990cbe76f86 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h

[...]

> @@ -2432,7 +2442,7 @@ struct net_device {
>  	struct sfp_bus		*sfp_bus;
>  	struct lock_class_key	*qdisc_tx_busylock;
>  	bool			proto_down;
> -	bool			threaded;
> +	u8			threaded;

Doesn't

Documentation/networking/net_cachelines/net_device.rst

Also need to be updated if you are changing the width of this field
from bool to u8?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll
  2025-03-21  2:15 [PATCH net-next v4 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
                   ` (3 preceding siblings ...)
  2025-03-21  2:15 ` [PATCH net-next v4 4/4] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
@ 2025-03-24  2:38 ` Martin Karsten
  2025-03-25 16:40   ` Samiullah Khawaja
  4 siblings, 1 reply; 22+ messages in thread
From: Martin Karsten @ 2025-03-24  2:38 UTC (permalink / raw)
  To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, willemb, jdamato
  Cc: netdev

On 2025-03-20 22:15, Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.
> 
> This is used for doing continuous polling of napi to fetch descriptors
> from backing RX/TX queues for low latency applications. Allow enabling
> of threaded busypoll using netlink so this can be enabled on a set of
> dedicated napis for low latency applications.
> 
> Once enabled user can fetch the PID of the kthread doing NAPI polling
> and set affinity, priority and scheduler for it depending on the
> low-latency requirements.
> 
> Currently threaded napi is only enabled at device level using sysfs. Add
> support to enable/disable threaded mode for a napi individually. This
> can be done using the netlink interface. Extend `napi-set` op in netlink
> spec that allows setting the `threaded` attribute of a napi.
> 
> Extend the threaded attribute in napi struct to add an option to enable
> continuous busy polling. Extend the netlink and sysfs interface to allow
> enabling/disabling threaded busypolling at device or individual napi
> level.
> 
> We use this for our AF_XDP based hard low-latency usecase with usecs
> level latency requirement. For our usecase we want low jitter and stable
> latency at P99.
> 
> Following is an analysis and comparison of available (and compatible)
> busy poll interfaces for a low latency usecase with stable P99. Please
> note that the throughput and cpu efficiency is a non-goal.
> 
> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> description of the tool and how it tries to simulate the real workload
> is following,
> 
> - It sends UDP packets between 2 machines.
> - The client machine sends packets at a fixed frequency. To maintain the
>    frequency of the packet being sent, we use open-loop sampling. That is
>    the packets are sent in a separate thread.
> - The server replies to the packet inline by reading the pkt from the
>    recv ring and replies using the tx ring.
> - To simulate the application processing time, we use a configurable
>    delay in usecs on the client side after a reply is received from the
>    server.
> 
> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.

Thanks very much for sending the benchmark program and these specific 
experiments. I am able to build the tool and run the experiments in 
principle. While I don't have a complete picture yet, one observation 
seems already clear, so I want to report back on it.

> We use this tool with following napi polling configurations,
> 
> - Interrupts only
> - SO_BUSYPOLL (inline in the same thread where the client receives the
>    packet).
> - SO_BUSYPOLL (separate thread and separate core)
> - Threaded NAPI busypoll

The configurations that you describe as SO_BUSYPOLL here are not using 
the best busy-polling configuration. The best busy-polling strictly 
alternates between application processing and network polling. No 
asynchronous processing due to hardware irq delivery or softirq 
processing should happen.

A high-level check is making sure that no softirq processing is reported 
for the relevant cores (see, e.g., "%soft" in sar -P <cores> -u ALL 1). 
In addition, interrupts can be counted in /proc/stat or /proc/interrupts.

Unfortunately it is not always straightforward to enter this pattern. In 
this particular case, it seems that two pieces are missing:

1) Because the XPD socket is created with XDP_COPY, it is never marked 
with its corresponding napi_id. Without the socket being marked with a 
valid napi_id, sk_busy_loop (called from __xsk_recvmsg) never invokes 
napi_busy_loop. Instead the gro_flush_timeout/napi_defer_hard_irqs 
softirq loop controls packet delivery.

I found code at the end of xsk_bind in xsk.c that is conditional on xs->zc:

	if (xs->zc && qid < dev->real_num_rx_queues) {
		struct netdev_rx_queue *rxq;

		rxq = __netif_get_rx_queue(dev, qid);
		if (rxq->napi)
			__sk_mark_napi_id_once(sk, rxq->napi->napi_id);
	}

I am not an expert on XDP sockets, so I don't know why that is or what 
would be an acceptable workaround/fix, but when I simply remove the 
check for xs->zc, the socket is being marked and napi_busy_loop is being 
called. But maybe there's a better way to accomplish this.

2) SO_PREFER_BUSY_POLL needs to be set on the XDP socket to make sure 
that busy polling stays in control after napi_busy_loop, regardless of 
how many packets were found. Without this setting, the gro_flush_timeout 
timer is not extended in busy_poll_stop.

With these two changes, both SO_BUSYPOLL alternatives perform noticeably 
better in my experiments and come closer to Threaded NAPI busypoll, so I 
was wondering if you could try that in your environment? While this 
might not change the big picture, I think it's important to fully 
understand and document the trade-offs.

Thanks,
Martin


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi
  2025-03-21 17:10   ` Joe Damato
@ 2025-03-25 14:51     ` Jakub Kicinski
  2025-04-01 18:27       ` Jakub Kicinski
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2025-03-25 14:51 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Joe Damato, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, mkarsten, netdev

On Fri, 21 Mar 2025 10:10:09 -0700 Joe Damato wrote:
> > +int napi_set_threaded(struct napi_struct *napi, bool threaded)
> > +{
> > +	if (napi->dev->threaded)
> > +		return -EINVAL;  
> 
> This works differently than the existing per-NAPI defer_hard_irqs /
> gro_flush_timeout which are also interface wide.
> 
> In that implementation: 
>   - the per-NAPI value is set when requested by the user
>   - when the sysfs value is written, all NAPIs have their values
>     overwritten to the sysfs value
> 
> I think either:
>   - This implementation should work like the existing ones, or
>   - The existing ones should be changed to work like this
> 
> I am opposed to have two different behaviors when setting per-NAPI
> vs system/nic-wide sysfs values.
> 
> I don't have a preference on which behavior is chosen, but the
> behavior should be the same for all of the things that are
> system/nic-wide and moving to per-NAPI.

And we should probably have a test that verifies the consistency
for all the relevant attrs.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi
  2025-03-21  2:15 ` [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
  2025-03-21 17:10   ` Joe Damato
@ 2025-03-25 14:52   ` Jakub Kicinski
  1 sibling, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2025-03-25 14:52 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: David S . Miller , Eric Dumazet, Paolo Abeni, almasrymina,
	willemb, jdamato, mkarsten, netdev

On Fri, 21 Mar 2025 02:15:18 +0000 Samiullah Khawaja wrote:
> +/*
> + * napi_set_threaded - set napi threaded state
> + * @napi: NAPI context
> + * @threaded: whether this napi does threaded polling
> + *
> + * Return 0 on success and negative errno on failure.

Please make this a real kdoc (staring with /**) or delete it

> + */
> +int napi_set_threaded(struct napi_struct *napi, bool threaded);
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll
  2025-03-24  2:38 ` [PATCH net-next v4 0/4] Add support to do threaded napi busy poll Martin Karsten
@ 2025-03-25 16:40   ` Samiullah Khawaja
  2025-03-25 17:47     ` Martin Karsten
  0 siblings, 1 reply; 22+ messages in thread
From: Samiullah Khawaja @ 2025-03-25 16:40 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, jdamato, netdev

On Sun, Mar 23, 2025 at 7:38 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-03-20 22:15, Samiullah Khawaja wrote:
> > Extend the already existing support of threaded napi poll to do continuous
> > busy polling.
> >
> > This is used for doing continuous polling of napi to fetch descriptors
> > from backing RX/TX queues for low latency applications. Allow enabling
> > of threaded busypoll using netlink so this can be enabled on a set of
> > dedicated napis for low latency applications.
> >
> > Once enabled user can fetch the PID of the kthread doing NAPI polling
> > and set affinity, priority and scheduler for it depending on the
> > low-latency requirements.
> >
> > Currently threaded napi is only enabled at device level using sysfs. Add
> > support to enable/disable threaded mode for a napi individually. This
> > can be done using the netlink interface. Extend `napi-set` op in netlink
> > spec that allows setting the `threaded` attribute of a napi.
> >
> > Extend the threaded attribute in napi struct to add an option to enable
> > continuous busy polling. Extend the netlink and sysfs interface to allow
> > enabling/disabling threaded busypolling at device or individual napi
> > level.
> >
> > We use this for our AF_XDP based hard low-latency usecase with usecs
> > level latency requirement. For our usecase we want low jitter and stable
> > latency at P99.
> >
> > Following is an analysis and comparison of available (and compatible)
> > busy poll interfaces for a low latency usecase with stable P99. Please
> > note that the throughput and cpu efficiency is a non-goal.
> >
> > For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> > description of the tool and how it tries to simulate the real workload
> > is following,
> >
> > - It sends UDP packets between 2 machines.
> > - The client machine sends packets at a fixed frequency. To maintain the
> >    frequency of the packet being sent, we use open-loop sampling. That is
> >    the packets are sent in a separate thread.
> > - The server replies to the packet inline by reading the pkt from the
> >    recv ring and replies using the tx ring.
> > - To simulate the application processing time, we use a configurable
> >    delay in usecs on the client side after a reply is received from the
> >    server.
> >
> > The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
>
> Thanks very much for sending the benchmark program and these specific
> experiments. I am able to build the tool and run the experiments in
> principle. While I don't have a complete picture yet, one observation
> seems already clear, so I want to report back on it.
Thanks for reproducing this Martin. Really appreciate you reviewing
this and your interest in this.
>
> > We use this tool with following napi polling configurations,
> >
> > - Interrupts only
> > - SO_BUSYPOLL (inline in the same thread where the client receives the
> >    packet).
> > - SO_BUSYPOLL (separate thread and separate core)
> > - Threaded NAPI busypoll
>
> The configurations that you describe as SO_BUSYPOLL here are not using
> the best busy-polling configuration. The best busy-polling strictly
> alternates between application processing and network polling. No
> asynchronous processing due to hardware irq delivery or softirq
> processing should happen.
>
> A high-level check is making sure that no softirq processing is reported
> for the relevant cores (see, e.g., "%soft" in sar -P <cores> -u ALL 1).
> In addition, interrupts can be counted in /proc/stat or /proc/interrupts.
>
> Unfortunately it is not always straightforward to enter this pattern. In
> this particular case, it seems that two pieces are missing:
>
> 1) Because the XPD socket is created with XDP_COPY, it is never marked
> with its corresponding napi_id. Without the socket being marked with a
> valid napi_id, sk_busy_loop (called from __xsk_recvmsg) never invokes
> napi_busy_loop. Instead the gro_flush_timeout/napi_defer_hard_irqs
> softirq loop controls packet delivery.
Nice catch. It seems a recent change broke the busy polling for AF_XDP
and there was a fix for the XDP_ZEROCOPY but the XDP_COPY remained
broken and seems in my experiments I didn't pick that up. During my
experimentation I confirmed that all experiment modes are invoking the
busypoll and not going through softirqs. I confirmed this through perf
traces. I sent out a fix for XDP_COPY busy polling here in the link
below. I will resent this for the net since the original commit has
already landed in 6.13.
https://lore.kernel.org/netdev/CAAywjhSEjaSgt7fCoiqJiMufGOi=oxa164_vTfk+3P43H60qwQ@mail.gmail.com/T/#t
>
> I found code at the end of xsk_bind in xsk.c that is conditional on xs->zc:
>
>         if (xs->zc && qid < dev->real_num_rx_queues) {
>                 struct netdev_rx_queue *rxq;
>
>                 rxq = __netif_get_rx_queue(dev, qid);
>                 if (rxq->napi)
>                         __sk_mark_napi_id_once(sk, rxq->napi->napi_id);
>         }
>
> I am not an expert on XDP sockets, so I don't know why that is or what
> would be an acceptable workaround/fix, but when I simply remove the
> check for xs->zc, the socket is being marked and napi_busy_loop is being
> called. But maybe there's a better way to accomplish this.
+1
>
> 2) SO_PREFER_BUSY_POLL needs to be set on the XDP socket to make sure
> that busy polling stays in control after napi_busy_loop, regardless of
> how many packets were found. Without this setting, the gro_flush_timeout
> timer is not extended in busy_poll_stop.
>
> With these two changes, both SO_BUSYPOLL alternatives perform noticeably
> better in my experiments and come closer to Threaded NAPI busypoll, so I
> was wondering if you could try that in your environment? While this
> might not change the big picture, I think it's important to fully
> understand and document the trade-offs.
I agree. In my experiments the SO_BUSYPOLL works properly, please see
the commit I mentioned above. But I will experiment with
SO_PREFER_BUSY_POLL to see whether it makes any significant change.
>
> Thanks,
> Martin
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll
  2025-03-25 16:40   ` Samiullah Khawaja
@ 2025-03-25 17:47     ` Martin Karsten
  2025-03-26 20:34       ` Samiullah Khawaja
  0 siblings, 1 reply; 22+ messages in thread
From: Martin Karsten @ 2025-03-25 17:47 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, jdamato, netdev

On 2025-03-25 12:40, Samiullah Khawaja wrote:
> On Sun, Mar 23, 2025 at 7:38 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>
>> On 2025-03-20 22:15, Samiullah Khawaja wrote:
>>> Extend the already existing support of threaded napi poll to do continuous
>>> busy polling.
>>>
>>> This is used for doing continuous polling of napi to fetch descriptors
>>> from backing RX/TX queues for low latency applications. Allow enabling
>>> of threaded busypoll using netlink so this can be enabled on a set of
>>> dedicated napis for low latency applications.
>>>
>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>> and set affinity, priority and scheduler for it depending on the
>>> low-latency requirements.
>>>
>>> Currently threaded napi is only enabled at device level using sysfs. Add
>>> support to enable/disable threaded mode for a napi individually. This
>>> can be done using the netlink interface. Extend `napi-set` op in netlink
>>> spec that allows setting the `threaded` attribute of a napi.
>>>
>>> Extend the threaded attribute in napi struct to add an option to enable
>>> continuous busy polling. Extend the netlink and sysfs interface to allow
>>> enabling/disabling threaded busypolling at device or individual napi
>>> level.
>>>
>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>> level latency requirement. For our usecase we want low jitter and stable
>>> latency at P99.
>>>
>>> Following is an analysis and comparison of available (and compatible)
>>> busy poll interfaces for a low latency usecase with stable P99. Please
>>> note that the throughput and cpu efficiency is a non-goal.
>>>
>>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
>>> description of the tool and how it tries to simulate the real workload
>>> is following,
>>>
>>> - It sends UDP packets between 2 machines.
>>> - The client machine sends packets at a fixed frequency. To maintain the
>>>     frequency of the packet being sent, we use open-loop sampling. That is
>>>     the packets are sent in a separate thread.
>>> - The server replies to the packet inline by reading the pkt from the
>>>     recv ring and replies using the tx ring.
>>> - To simulate the application processing time, we use a configurable
>>>     delay in usecs on the client side after a reply is received from the
>>>     server.
>>>
>>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
>>
>> Thanks very much for sending the benchmark program and these specific
>> experiments. I am able to build the tool and run the experiments in
>> principle. While I don't have a complete picture yet, one observation
>> seems already clear, so I want to report back on it.
> Thanks for reproducing this Martin. Really appreciate you reviewing
> this and your interest in this.
>>
>>> We use this tool with following napi polling configurations,
>>>
>>> - Interrupts only
>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>     packet).
>>> - SO_BUSYPOLL (separate thread and separate core)
>>> - Threaded NAPI busypoll
>>
>> The configurations that you describe as SO_BUSYPOLL here are not using
>> the best busy-polling configuration. The best busy-polling strictly
>> alternates between application processing and network polling. No
>> asynchronous processing due to hardware irq delivery or softirq
>> processing should happen.
>>
>> A high-level check is making sure that no softirq processing is reported
>> for the relevant cores (see, e.g., "%soft" in sar -P <cores> -u ALL 1).
>> In addition, interrupts can be counted in /proc/stat or /proc/interrupts.
>>
>> Unfortunately it is not always straightforward to enter this pattern. In
>> this particular case, it seems that two pieces are missing:
>>
>> 1) Because the XPD socket is created with XDP_COPY, it is never marked
>> with its corresponding napi_id. Without the socket being marked with a
>> valid napi_id, sk_busy_loop (called from __xsk_recvmsg) never invokes
>> napi_busy_loop. Instead the gro_flush_timeout/napi_defer_hard_irqs
>> softirq loop controls packet delivery.
> Nice catch. It seems a recent change broke the busy polling for AF_XDP
> and there was a fix for the XDP_ZEROCOPY but the XDP_COPY remained
> broken and seems in my experiments I didn't pick that up. During my
> experimentation I confirmed that all experiment modes are invoking the
> busypoll and not going through softirqs. I confirmed this through perf
> traces. I sent out a fix for XDP_COPY busy polling here in the link
> below. I will resent this for the net since the original commit has
> already landed in 6.13.
> https://lore.kernel.org/netdev/CAAywjhSEjaSgt7fCoiqJiMufGOi=oxa164_vTfk+3P43H60qwQ@mail.gmail.com/T/#t
>>
>> I found code at the end of xsk_bind in xsk.c that is conditional on xs->zc:
>>
>>          if (xs->zc && qid < dev->real_num_rx_queues) {
>>                  struct netdev_rx_queue *rxq;
>>
>>                  rxq = __netif_get_rx_queue(dev, qid);
>>                  if (rxq->napi)
>>                          __sk_mark_napi_id_once(sk, rxq->napi->napi_id);
>>          }
>>
>> I am not an expert on XDP sockets, so I don't know why that is or what
>> would be an acceptable workaround/fix, but when I simply remove the
>> check for xs->zc, the socket is being marked and napi_busy_loop is being
>> called. But maybe there's a better way to accomplish this.
> +1
>>
>> 2) SO_PREFER_BUSY_POLL needs to be set on the XDP socket to make sure
>> that busy polling stays in control after napi_busy_loop, regardless of
>> how many packets were found. Without this setting, the gro_flush_timeout
>> timer is not extended in busy_poll_stop.
>>
>> With these two changes, both SO_BUSYPOLL alternatives perform noticeably
>> better in my experiments and come closer to Threaded NAPI busypoll, so I
>> was wondering if you could try that in your environment? While this
>> might not change the big picture, I think it's important to fully
>> understand and document the trade-offs.
> I agree. In my experiments the SO_BUSYPOLL works properly, please see
> the commit I mentioned above. But I will experiment with
> SO_PREFER_BUSY_POLL to see whether it makes any significant change.

I'd like to clarify: Your original experiments cannot have used 
busypoll, because it was broken for XDP_COPY. Did you rerun the 
experiments with the XDP_COPY fix but without SO_PREFER_BUSY_POLL and 
see the same latency numbers as before? Also, can you provide more 
details about the perf tracing that you used to see that busypoll is 
invoked, but softirq is not?

Thanks,
Martin


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll
  2025-03-25 17:47     ` Martin Karsten
@ 2025-03-26 20:34       ` Samiullah Khawaja
  2025-03-26 21:22         ` Martin Karsten
  2025-03-26 21:57         ` Stanislav Fomichev
  0 siblings, 2 replies; 22+ messages in thread
From: Samiullah Khawaja @ 2025-03-26 20:34 UTC (permalink / raw)
  To: Martin Karsten
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, jdamato, netdev

On Tue, Mar 25, 2025 at 10:47 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2025-03-25 12:40, Samiullah Khawaja wrote:
> > On Sun, Mar 23, 2025 at 7:38 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>
> >> On 2025-03-20 22:15, Samiullah Khawaja wrote:
> >>> Extend the already existing support of threaded napi poll to do continuous
> >>> busy polling.
> >>>
> >>> This is used for doing continuous polling of napi to fetch descriptors
> >>> from backing RX/TX queues for low latency applications. Allow enabling
> >>> of threaded busypoll using netlink so this can be enabled on a set of
> >>> dedicated napis for low latency applications.
> >>>
> >>> Once enabled user can fetch the PID of the kthread doing NAPI polling
> >>> and set affinity, priority and scheduler for it depending on the
> >>> low-latency requirements.
> >>>
> >>> Currently threaded napi is only enabled at device level using sysfs. Add
> >>> support to enable/disable threaded mode for a napi individually. This
> >>> can be done using the netlink interface. Extend `napi-set` op in netlink
> >>> spec that allows setting the `threaded` attribute of a napi.
> >>>
> >>> Extend the threaded attribute in napi struct to add an option to enable
> >>> continuous busy polling. Extend the netlink and sysfs interface to allow
> >>> enabling/disabling threaded busypolling at device or individual napi
> >>> level.
> >>>
> >>> We use this for our AF_XDP based hard low-latency usecase with usecs
> >>> level latency requirement. For our usecase we want low jitter and stable
> >>> latency at P99.
> >>>
> >>> Following is an analysis and comparison of available (and compatible)
> >>> busy poll interfaces for a low latency usecase with stable P99. Please
> >>> note that the throughput and cpu efficiency is a non-goal.
> >>>
> >>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> >>> description of the tool and how it tries to simulate the real workload
> >>> is following,
> >>>
> >>> - It sends UDP packets between 2 machines.
> >>> - The client machine sends packets at a fixed frequency. To maintain the
> >>>     frequency of the packet being sent, we use open-loop sampling. That is
> >>>     the packets are sent in a separate thread.
> >>> - The server replies to the packet inline by reading the pkt from the
> >>>     recv ring and replies using the tx ring.
> >>> - To simulate the application processing time, we use a configurable
> >>>     delay in usecs on the client side after a reply is received from the
> >>>     server.
> >>>
> >>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
> >>
> >> Thanks very much for sending the benchmark program and these specific
> >> experiments. I am able to build the tool and run the experiments in
> >> principle. While I don't have a complete picture yet, one observation
> >> seems already clear, so I want to report back on it.
> > Thanks for reproducing this Martin. Really appreciate you reviewing
> > this and your interest in this.
> >>
> >>> We use this tool with following napi polling configurations,
> >>>
> >>> - Interrupts only
> >>> - SO_BUSYPOLL (inline in the same thread where the client receives the
> >>>     packet).
> >>> - SO_BUSYPOLL (separate thread and separate core)
> >>> - Threaded NAPI busypoll
> >>
> >> The configurations that you describe as SO_BUSYPOLL here are not using
> >> the best busy-polling configuration. The best busy-polling strictly
> >> alternates between application processing and network polling. No
> >> asynchronous processing due to hardware irq delivery or softirq
> >> processing should happen.
> >>
> >> A high-level check is making sure that no softirq processing is reported
> >> for the relevant cores (see, e.g., "%soft" in sar -P <cores> -u ALL 1).
> >> In addition, interrupts can be counted in /proc/stat or /proc/interrupts.
> >>
> >> Unfortunately it is not always straightforward to enter this pattern. In
> >> this particular case, it seems that two pieces are missing:
> >>
> >> 1) Because the XPD socket is created with XDP_COPY, it is never marked
> >> with its corresponding napi_id. Without the socket being marked with a
> >> valid napi_id, sk_busy_loop (called from __xsk_recvmsg) never invokes
> >> napi_busy_loop. Instead the gro_flush_timeout/napi_defer_hard_irqs
> >> softirq loop controls packet delivery.
> > Nice catch. It seems a recent change broke the busy polling for AF_XDP
> > and there was a fix for the XDP_ZEROCOPY but the XDP_COPY remained
> > broken and seems in my experiments I didn't pick that up. During my
> > experimentation I confirmed that all experiment modes are invoking the
> > busypoll and not going through softirqs. I confirmed this through perf
> > traces. I sent out a fix for XDP_COPY busy polling here in the link
> > below. I will resent this for the net since the original commit has
> > already landed in 6.13.
> > https://lore.kernel.org/netdev/CAAywjhSEjaSgt7fCoiqJiMufGOi=oxa164_vTfk+3P43H60qwQ@mail.gmail.com/T/#t
> >>
> >> I found code at the end of xsk_bind in xsk.c that is conditional on xs->zc:
> >>
> >>          if (xs->zc && qid < dev->real_num_rx_queues) {
> >>                  struct netdev_rx_queue *rxq;
> >>
> >>                  rxq = __netif_get_rx_queue(dev, qid);
> >>                  if (rxq->napi)
> >>                          __sk_mark_napi_id_once(sk, rxq->napi->napi_id);
> >>          }
> >>
> >> I am not an expert on XDP sockets, so I don't know why that is or what
> >> would be an acceptable workaround/fix, but when I simply remove the
> >> check for xs->zc, the socket is being marked and napi_busy_loop is being
> >> called. But maybe there's a better way to accomplish this.
> > +1
> >>
> >> 2) SO_PREFER_BUSY_POLL needs to be set on the XDP socket to make sure
> >> that busy polling stays in control after napi_busy_loop, regardless of
> >> how many packets were found. Without this setting, the gro_flush_timeout
> >> timer is not extended in busy_poll_stop.
> >>
> >> With these two changes, both SO_BUSYPOLL alternatives perform noticeably
> >> better in my experiments and come closer to Threaded NAPI busypoll, so I
> >> was wondering if you could try that in your environment? While this
> >> might not change the big picture, I think it's important to fully
> >> understand and document the trade-offs.
> > I agree. In my experiments the SO_BUSYPOLL works properly, please see
> > the commit I mentioned above. But I will experiment with
> > SO_PREFER_BUSY_POLL to see whether it makes any significant change.
>
> I'd like to clarify: Your original experiments cannot have used
> busypoll, because it was broken for XDP_COPY. Did you rerun the
On my idpf test platform the AF_XDP support is broken with the latest
kernel, so I didn't have the original commit that broke AF_XDP
busypoll for zerocopy and copy mode. So in the experiments that I
shared XDP_COPY busy poll has been working. Please see the traces
below. Sorry for the confusion.
> experiments with the XDP_COPY fix but without SO_PREFER_BUSY_POLL and
I tried with SO_PREFER_BUSY_POLL as you suggested, I see results
matching the previous observation:

12Kpkts/sec with 78usecs delay:

INLINE:
p5: 16700
p50: 17100
p95: 17200
p99: 17200

> see the same latency numbers as before? Also, can you provide more
> details about the perf tracing that you used to see that busypoll is
> invoked, but softirq is not?
I used the following command to record the call graph and could see
the calls to napi_busy_loop going from xsk_rcvmsg. Confirmed with
SO_PREFER_BUSY_POLL also below,
```
perf record -o prefer.perf -a -e cycles -g sleep 10
perf report --stdio -i prefer.perf
```

```
 --1.35%--entry_SYSCALL_64
            |
             --1.31%--do_syscall_64
                       __x64_sys_recvfrom
                       __sys_recvfrom
                       sock_recvmsg
                       xsk_recvmsg
                       __xsk_recvmsg.constprop.0.isra.0
                       napi_busy_loop
                       __napi_busy_loop
```

I do see softirq getting triggered occasionally, when inline the
busy_poll thread is not able to pick up the packets. I used following
command to find number of samples for each in the trace,

```
perf report -g -n -i prefer.perf
```

Filtered the results to include only the interesting symbols
```
<
Children      Self       Samples  Command          Shared Object
          Symbol
+    1.48%     0.06%            46  xsk_rr           [idpf]
            [k] idpf_vport_splitq_napi_poll

+    1.28%     0.11%            86  xsk_rr           [kernel.kallsyms]
            [k] __napi_busy_loop

+    0.71%     0.02%            17  xsk_rr           [kernel.kallsyms]
            [k] net_rx_action

+    0.69%     0.01%             6  xsk_rr           [kernel.kallsyms]
            [k] __napi_poll
```

>
> Thanks,
> Martin
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll
  2025-03-26 20:34       ` Samiullah Khawaja
@ 2025-03-26 21:22         ` Martin Karsten
  2025-03-26 21:57         ` Stanislav Fomichev
  1 sibling, 0 replies; 22+ messages in thread
From: Martin Karsten @ 2025-03-26 21:22 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, jdamato, netdev

On 2025-03-26 16:34, Samiullah Khawaja wrote:
> On Tue, Mar 25, 2025 at 10:47 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>
>> On 2025-03-25 12:40, Samiullah Khawaja wrote:
>>> On Sun, Mar 23, 2025 at 7:38 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>>>
>>>> On 2025-03-20 22:15, Samiullah Khawaja wrote:
>>>>> Extend the already existing support of threaded napi poll to do continuous
>>>>> busy polling.
>>>>>
>>>>> This is used for doing continuous polling of napi to fetch descriptors
>>>>> from backing RX/TX queues for low latency applications. Allow enabling
>>>>> of threaded busypoll using netlink so this can be enabled on a set of
>>>>> dedicated napis for low latency applications.
>>>>>
>>>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>>>> and set affinity, priority and scheduler for it depending on the
>>>>> low-latency requirements.
>>>>>
>>>>> Currently threaded napi is only enabled at device level using sysfs. Add
>>>>> support to enable/disable threaded mode for a napi individually. This
>>>>> can be done using the netlink interface. Extend `napi-set` op in netlink
>>>>> spec that allows setting the `threaded` attribute of a napi.
>>>>>
>>>>> Extend the threaded attribute in napi struct to add an option to enable
>>>>> continuous busy polling. Extend the netlink and sysfs interface to allow
>>>>> enabling/disabling threaded busypolling at device or individual napi
>>>>> level.
>>>>>
>>>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>>>> level latency requirement. For our usecase we want low jitter and stable
>>>>> latency at P99.
>>>>>
>>>>> Following is an analysis and comparison of available (and compatible)
>>>>> busy poll interfaces for a low latency usecase with stable P99. Please
>>>>> note that the throughput and cpu efficiency is a non-goal.
>>>>>
>>>>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
>>>>> description of the tool and how it tries to simulate the real workload
>>>>> is following,
>>>>>
>>>>> - It sends UDP packets between 2 machines.
>>>>> - The client machine sends packets at a fixed frequency. To maintain the
>>>>>      frequency of the packet being sent, we use open-loop sampling. That is
>>>>>      the packets are sent in a separate thread.
>>>>> - The server replies to the packet inline by reading the pkt from the
>>>>>      recv ring and replies using the tx ring.
>>>>> - To simulate the application processing time, we use a configurable
>>>>>      delay in usecs on the client side after a reply is received from the
>>>>>      server.
>>>>>
>>>>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
>>>>
>>>> Thanks very much for sending the benchmark program and these specific
>>>> experiments. I am able to build the tool and run the experiments in
>>>> principle. While I don't have a complete picture yet, one observation
>>>> seems already clear, so I want to report back on it.
>>> Thanks for reproducing this Martin. Really appreciate you reviewing
>>> this and your interest in this.
>>>>
>>>>> We use this tool with following napi polling configurations,
>>>>>
>>>>> - Interrupts only
>>>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>>>      packet).
>>>>> - SO_BUSYPOLL (separate thread and separate core)
>>>>> - Threaded NAPI busypoll
>>>>
>>>> The configurations that you describe as SO_BUSYPOLL here are not using
>>>> the best busy-polling configuration. The best busy-polling strictly
>>>> alternates between application processing and network polling. No
>>>> asynchronous processing due to hardware irq delivery or softirq
>>>> processing should happen.
>>>>
>>>> A high-level check is making sure that no softirq processing is reported
>>>> for the relevant cores (see, e.g., "%soft" in sar -P <cores> -u ALL 1).
>>>> In addition, interrupts can be counted in /proc/stat or /proc/interrupts.
>>>>
>>>> Unfortunately it is not always straightforward to enter this pattern. In
>>>> this particular case, it seems that two pieces are missing:
>>>>
>>>> 1) Because the XPD socket is created with XDP_COPY, it is never marked
>>>> with its corresponding napi_id. Without the socket being marked with a
>>>> valid napi_id, sk_busy_loop (called from __xsk_recvmsg) never invokes
>>>> napi_busy_loop. Instead the gro_flush_timeout/napi_defer_hard_irqs
>>>> softirq loop controls packet delivery.
>>> Nice catch. It seems a recent change broke the busy polling for AF_XDP
>>> and there was a fix for the XDP_ZEROCOPY but the XDP_COPY remained
>>> broken and seems in my experiments I didn't pick that up. During my
>>> experimentation I confirmed that all experiment modes are invoking the
>>> busypoll and not going through softirqs. I confirmed this through perf
>>> traces. I sent out a fix for XDP_COPY busy polling here in the link
>>> below. I will resent this for the net since the original commit has
>>> already landed in 6.13.
>>> https://lore.kernel.org/netdev/CAAywjhSEjaSgt7fCoiqJiMufGOi=oxa164_vTfk+3P43H60qwQ@mail.gmail.com/T/#t
>>>>
>>>> I found code at the end of xsk_bind in xsk.c that is conditional on xs->zc:
>>>>
>>>>           if (xs->zc && qid < dev->real_num_rx_queues) {
>>>>                   struct netdev_rx_queue *rxq;
>>>>
>>>>                   rxq = __netif_get_rx_queue(dev, qid);
>>>>                   if (rxq->napi)
>>>>                           __sk_mark_napi_id_once(sk, rxq->napi->napi_id);
>>>>           }
>>>>
>>>> I am not an expert on XDP sockets, so I don't know why that is or what
>>>> would be an acceptable workaround/fix, but when I simply remove the
>>>> check for xs->zc, the socket is being marked and napi_busy_loop is being
>>>> called. But maybe there's a better way to accomplish this.
>>> +1
>>>>
>>>> 2) SO_PREFER_BUSY_POLL needs to be set on the XDP socket to make sure
>>>> that busy polling stays in control after napi_busy_loop, regardless of
>>>> how many packets were found. Without this setting, the gro_flush_timeout
>>>> timer is not extended in busy_poll_stop.
>>>>
>>>> With these two changes, both SO_BUSYPOLL alternatives perform noticeably
>>>> better in my experiments and come closer to Threaded NAPI busypoll, so I
>>>> was wondering if you could try that in your environment? While this
>>>> might not change the big picture, I think it's important to fully
>>>> understand and document the trade-offs.
>>> I agree. In my experiments the SO_BUSYPOLL works properly, please see
>>> the commit I mentioned above. But I will experiment with
>>> SO_PREFER_BUSY_POLL to see whether it makes any significant change.
>>
>> I'd like to clarify: Your original experiments cannot have used
>> busypoll, because it was broken for XDP_COPY. Did you rerun the
> On my idpf test platform the AF_XDP support is broken with the latest
> kernel, so I didn't have the original commit that broke AF_XDP
> busypoll for zerocopy and copy mode. So in the experiments that I
> shared XDP_COPY busy poll has been working. Please see the traces
> below. Sorry for the confusion.

Ok, that explains it.

>> experiments with the XDP_COPY fix but without SO_PREFER_BUSY_POLL and
> I tried with SO_PREFER_BUSY_POLL as you suggested, I see results
> matching the previous observation:
> 
> 12Kpkts/sec with 78usecs delay:
> 
> INLINE:
> p5: 16700
> p50: 17100
> p95: 17200
> p99: 17200

This comment applies to the experiments overall: I believe these 
carefully crafted period/delay configurations that just straddle the 
capacity limit do not show any additional benefits over and above what 
the basic experiments (without application delay) already show.

If you want to illustrate the fact that the slightly faster mechanism 
reaches capacity a little later, I would find experiments with a fixed 
period and varying the delay from 0 to overload more illustrative.

>> see the same latency numbers as before? Also, can you provide more
>> details about the perf tracing that you used to see that busypoll is
>> invoked, but softirq is not?
> I used the following command to record the call graph and could see
> the calls to napi_busy_loop going from xsk_rcvmsg. Confirmed with
> SO_PREFER_BUSY_POLL also below,
> ```
> perf record -o prefer.perf -a -e cycles -g sleep 10
> perf report --stdio -i prefer.perf
> ```
> 
> ```
>   --1.35%--entry_SYSCALL_64
>              |
>               --1.31%--do_syscall_64
>                         __x64_sys_recvfrom
>                         __sys_recvfrom
>                         sock_recvmsg
>                         xsk_recvmsg
>                         __xsk_recvmsg.constprop.0.isra.0
>                         napi_busy_loop
>                         __napi_busy_loop
> ```
> 
> I do see softirq getting triggered occasionally, when inline the
> busy_poll thread is not able to pick up the packets. I used following
> command to find number of samples for each in the trace,
> 
> ```
> perf report -g -n -i prefer.perf
> ```
> 
> Filtered the results to include only the interesting symbols
> ```
> <
> Children      Self       Samples  Command          Shared Object
>            Symbol
> +    1.48%     0.06%            46  xsk_rr           [idpf]
>              [k] idpf_vport_splitq_napi_poll
> 
> +    1.28%     0.11%            86  xsk_rr           [kernel.kallsyms]
>              [k] __napi_busy_loop
> 
> +    0.71%     0.02%            17  xsk_rr           [kernel.kallsyms]
>              [k] net_rx_action
> 
> +    0.69%     0.01%             6  xsk_rr           [kernel.kallsyms]
>              [k] __napi_poll
> ```

Thanks, this makes me realize that I forgot to mention something as well:

SO_PREFER_BUSY_POLL should eliminate the remaining softirq invocations, 
but only if gro_flush_timeout is big enough. In fact, in a full busypoll 
configuration, the value of gro_flush_timeout should not matter at all 
as long as its sufficiently higher than the application period. I have 
set it to 1000000 for these experiments as another litmus test that 
busypoll is actually working.

Last not least, I found that co-locating the single-threaded 
busy-polling application with the irq core improved the outcome. I.e., 
in your experiment setup you would taskset the application to Core 2. 
Not sure I have a rock-solid explanation, but it did make a difference.

Since net-next patch submission is closed, I thought I provide this 
feedback now, so you can decide whether to take it into account for the 
next go-around.

Best,
Martin


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll
  2025-03-26 20:34       ` Samiullah Khawaja
  2025-03-26 21:22         ` Martin Karsten
@ 2025-03-26 21:57         ` Stanislav Fomichev
  2025-03-27 18:40           ` Joe Damato
  1 sibling, 1 reply; 22+ messages in thread
From: Stanislav Fomichev @ 2025-03-26 21:57 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Martin Karsten, Jakub Kicinski, David S . Miller, Eric Dumazet,
	Paolo Abeni, almasrymina, willemb, jdamato, netdev

On 03/26, Samiullah Khawaja wrote:
> On Tue, Mar 25, 2025 at 10:47 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >
> > On 2025-03-25 12:40, Samiullah Khawaja wrote:
> > > On Sun, Mar 23, 2025 at 7:38 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > >>
> > >> On 2025-03-20 22:15, Samiullah Khawaja wrote:
> > >>> Extend the already existing support of threaded napi poll to do continuous
> > >>> busy polling.
> > >>>
> > >>> This is used for doing continuous polling of napi to fetch descriptors
> > >>> from backing RX/TX queues for low latency applications. Allow enabling
> > >>> of threaded busypoll using netlink so this can be enabled on a set of
> > >>> dedicated napis for low latency applications.
> > >>>
> > >>> Once enabled user can fetch the PID of the kthread doing NAPI polling
> > >>> and set affinity, priority and scheduler for it depending on the
> > >>> low-latency requirements.
> > >>>
> > >>> Currently threaded napi is only enabled at device level using sysfs. Add
> > >>> support to enable/disable threaded mode for a napi individually. This
> > >>> can be done using the netlink interface. Extend `napi-set` op in netlink
> > >>> spec that allows setting the `threaded` attribute of a napi.
> > >>>
> > >>> Extend the threaded attribute in napi struct to add an option to enable
> > >>> continuous busy polling. Extend the netlink and sysfs interface to allow
> > >>> enabling/disabling threaded busypolling at device or individual napi
> > >>> level.
> > >>>
> > >>> We use this for our AF_XDP based hard low-latency usecase with usecs
> > >>> level latency requirement. For our usecase we want low jitter and stable
> > >>> latency at P99.
> > >>>
> > >>> Following is an analysis and comparison of available (and compatible)
> > >>> busy poll interfaces for a low latency usecase with stable P99. Please
> > >>> note that the throughput and cpu efficiency is a non-goal.
> > >>>
> > >>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
> > >>> description of the tool and how it tries to simulate the real workload
> > >>> is following,
> > >>>
> > >>> - It sends UDP packets between 2 machines.
> > >>> - The client machine sends packets at a fixed frequency. To maintain the
> > >>>     frequency of the packet being sent, we use open-loop sampling. That is
> > >>>     the packets are sent in a separate thread.
> > >>> - The server replies to the packet inline by reading the pkt from the
> > >>>     recv ring and replies using the tx ring.
> > >>> - To simulate the application processing time, we use a configurable
> > >>>     delay in usecs on the client side after a reply is received from the
> > >>>     server.
> > >>>
> > >>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
> > >>
> > >> Thanks very much for sending the benchmark program and these specific
> > >> experiments. I am able to build the tool and run the experiments in
> > >> principle. While I don't have a complete picture yet, one observation
> > >> seems already clear, so I want to report back on it.
> > > Thanks for reproducing this Martin. Really appreciate you reviewing
> > > this and your interest in this.
> > >>
> > >>> We use this tool with following napi polling configurations,
> > >>>
> > >>> - Interrupts only
> > >>> - SO_BUSYPOLL (inline in the same thread where the client receives the
> > >>>     packet).
> > >>> - SO_BUSYPOLL (separate thread and separate core)
> > >>> - Threaded NAPI busypoll
> > >>
> > >> The configurations that you describe as SO_BUSYPOLL here are not using
> > >> the best busy-polling configuration. The best busy-polling strictly
> > >> alternates between application processing and network polling. No
> > >> asynchronous processing due to hardware irq delivery or softirq
> > >> processing should happen.
> > >>
> > >> A high-level check is making sure that no softirq processing is reported
> > >> for the relevant cores (see, e.g., "%soft" in sar -P <cores> -u ALL 1).
> > >> In addition, interrupts can be counted in /proc/stat or /proc/interrupts.
> > >>
> > >> Unfortunately it is not always straightforward to enter this pattern. In
> > >> this particular case, it seems that two pieces are missing:
> > >>
> > >> 1) Because the XPD socket is created with XDP_COPY, it is never marked
> > >> with its corresponding napi_id. Without the socket being marked with a
> > >> valid napi_id, sk_busy_loop (called from __xsk_recvmsg) never invokes
> > >> napi_busy_loop. Instead the gro_flush_timeout/napi_defer_hard_irqs
> > >> softirq loop controls packet delivery.
> > > Nice catch. It seems a recent change broke the busy polling for AF_XDP
> > > and there was a fix for the XDP_ZEROCOPY but the XDP_COPY remained
> > > broken and seems in my experiments I didn't pick that up. During my
> > > experimentation I confirmed that all experiment modes are invoking the
> > > busypoll and not going through softirqs. I confirmed this through perf
> > > traces. I sent out a fix for XDP_COPY busy polling here in the link
> > > below. I will resent this for the net since the original commit has
> > > already landed in 6.13.
> > > https://lore.kernel.org/netdev/CAAywjhSEjaSgt7fCoiqJiMufGOi=oxa164_vTfk+3P43H60qwQ@mail.gmail.com/T/#t

In general, when sending the patches and numbers, try running everything
against the latest net-next. Otherwise, it is very confusing to reason
about..

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 3/4] Extend napi threaded polling to allow kthread based busy polling
  2025-03-21 17:39   ` Joe Damato
@ 2025-03-27 16:39     ` Samiullah Khawaja
  0 siblings, 0 replies; 22+ messages in thread
From: Samiullah Khawaja @ 2025-03-27 16:39 UTC (permalink / raw)
  To: Joe Damato, Samiullah Khawaja, Jakub Kicinski, David S . Miller,
	Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev

On Fri, Mar 21, 2025 at 10:39 AM Joe Damato <jdamato@fastly.com> wrote:
>
> On Fri, Mar 21, 2025 at 02:15:20AM +0000, Samiullah Khawaja wrote:
> > Add a new state to napi state enum:
> >
> > - STATE_THREADED_BUSY_POLL
> >   Threaded busy poll is enabled/running for this napi.
> >
> > Following changes are introduced in the napi scheduling and state logic:
> >
> > - When threaded busy poll is enabled through sysfs it also enables
> >   NAPI_STATE_THREADED so a kthread is created per napi. It also sets
> >   NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that we are
> >   supposed to busy poll for each napi.
> >
> > - When napi is scheduled with STATE_SCHED_THREADED and associated
> >   kthread is woken up, the kthread owns the context. If
> >   NAPI_STATE_THREADED_BUSY_POLL and NAPI_SCHED_THREADED both are set
> >   then it means that we can busy poll.
> >
> > - To keep busy polling and to avoid scheduling of the interrupts, the
> >   napi_complete_done returns false when both SCHED_THREADED and
> >   THREADED_BUSY_POLL flags are set. Also napi_complete_done returns
> >   early to avoid the STATE_SCHED_THREADED being unset.
> >
> > - If at any point STATE_THREADED_BUSY_POLL is unset, the
> >   napi_complete_done will run and unset the SCHED_THREADED bit also.
> >   This will make the associated kthread go to sleep as per existing
> >   logic.
> >
> > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
> > ---
> >  Documentation/ABI/testing/sysfs-class-net     |  3 +-
> >  Documentation/netlink/specs/netdev.yaml       | 12 ++-
> >  Documentation/networking/napi.rst             | 67 ++++++++++++-
> >  .../net/ethernet/atheros/atl1c/atl1c_main.c   |  2 +-
> >  drivers/net/ethernet/mellanox/mlxsw/pci.c     |  2 +-
> >  drivers/net/ethernet/renesas/ravb_main.c      |  2 +-
> >  drivers/net/wireless/ath/ath10k/snoc.c        |  2 +-
> >  include/linux/netdevice.h                     | 20 +++-
> >  include/uapi/linux/netdev.h                   |  6 ++
> >  net/core/dev.c                                | 93 ++++++++++++++++---
> >  net/core/net-sysfs.c                          |  2 +-
> >  net/core/netdev-genl-gen.c                    |  2 +-
> >  net/core/netdev-genl.c                        |  2 +-
> >  tools/include/uapi/linux/netdev.h             |  6 ++
> >  14 files changed, 188 insertions(+), 33 deletions(-)
>
> I think this should be split into two patches which would ease
> review and bisection:
>
>   - First patch: introduce enum netdev_napi_threaded and
>     NETDEV_NAPI_THREADED_ENABLE and the associated driver changes.
>
>   - Second patch: introduce NETDEV_NAPI_THREADED_BUSY_POLL_ENABLE
Agreed. I will split this up and send again.
>
> I'll have to take a closer look at all the changes here after I've
> read the cover letter and have reproduced the results, but one issue
> stands out:
>
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 3c244fd9ae6d..b990cbe76f86 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
>
> [...]
>
> > @@ -2432,7 +2442,7 @@ struct net_device {
> >       struct sfp_bus          *sfp_bus;
> >       struct lock_class_key   *qdisc_tx_busylock;
> >       bool                    proto_down;
> > -     bool                    threaded;
> > +     u8                      threaded;
>
> Doesn't
>
> Documentation/networking/net_cachelines/net_device.rst
>
> Also need to be updated if you are changing the width of this field
> from bool to u8?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 2/4] net: Create separate gro_flush helper function
  2025-03-21 17:16   ` Joe Damato
@ 2025-03-27 16:42     ` Samiullah Khawaja
  0 siblings, 0 replies; 22+ messages in thread
From: Samiullah Khawaja @ 2025-03-27 16:42 UTC (permalink / raw)
  To: Joe Damato, Samiullah Khawaja, Jakub Kicinski, David S . Miller,
	Eric Dumazet, Paolo Abeni, almasrymina, willemb, mkarsten, netdev

On Fri, Mar 21, 2025 at 10:16 AM Joe Damato <jdamato@fastly.com> wrote:
>
> On Fri, Mar 21, 2025 at 02:15:19AM +0000, Samiullah Khawaja wrote:
> > Move multiple copies of same code snippet doing `gro_flush` and
> > `gro_normal_list` into a separate helper function.
> >
> > Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
>
> As mentioned in the previous review, I think this is missing a spot.
> the instance in napi_complete_done was not addressed as suggested
> previously.
Oops... I did that internally, but it seems I missed that spot during
rebase. Will send again.
>
> Is there any particular reason why that feedback was not addressed
> in this revision?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll
  2025-03-26 21:57         ` Stanislav Fomichev
@ 2025-03-27 18:40           ` Joe Damato
  2025-03-27 19:35             ` Samiullah Khawaja
  0 siblings, 1 reply; 22+ messages in thread
From: Joe Damato @ 2025-03-27 18:40 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Samiullah Khawaja, Martin Karsten, Jakub Kicinski,
	David S . Miller, Eric Dumazet, Paolo Abeni, almasrymina, willemb,
	netdev

On Wed, Mar 26, 2025 at 02:57:37PM -0700, Stanislav Fomichev wrote:
> On 03/26, Samiullah Khawaja wrote:
> > On Tue, Mar 25, 2025 at 10:47 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > >
> > > On 2025-03-25 12:40, Samiullah Khawaja wrote:
> > > > On Sun, Mar 23, 2025 at 7:38 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > > >>
> > > >> On 2025-03-20 22:15, Samiullah Khawaja wrote:

> > > > Nice catch. It seems a recent change broke the busy polling for AF_XDP
> > > > and there was a fix for the XDP_ZEROCOPY but the XDP_COPY remained
> > > > broken and seems in my experiments I didn't pick that up. During my
> > > > experimentation I confirmed that all experiment modes are invoking the
> > > > busypoll and not going through softirqs. I confirmed this through perf
> > > > traces. I sent out a fix for XDP_COPY busy polling here in the link
> > > > below. I will resent this for the net since the original commit has
> > > > already landed in 6.13.
> > > > https://lore.kernel.org/netdev/CAAywjhSEjaSgt7fCoiqJiMufGOi=oxa164_vTfk+3P43H60qwQ@mail.gmail.com/T/#t
> 
> In general, when sending the patches and numbers, try running everything
> against the latest net-next. Otherwise, it is very confusing to reason
> about..

I had mentioned in my previous review, but worth mentioning again...
using --base=auto when using format-patch to generate the base
commit SHA would be really useful to potentially help avoid this
problem.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll
  2025-03-27 18:40           ` Joe Damato
@ 2025-03-27 19:35             ` Samiullah Khawaja
  0 siblings, 0 replies; 22+ messages in thread
From: Samiullah Khawaja @ 2025-03-27 19:35 UTC (permalink / raw)
  To: Joe Damato, Stanislav Fomichev, Samiullah Khawaja, Martin Karsten,
	Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, netdev

On Thu, Mar 27, 2025 at 11:40 AM Joe Damato <jdamato@fastly.com> wrote:
>
> On Wed, Mar 26, 2025 at 02:57:37PM -0700, Stanislav Fomichev wrote:
> > On 03/26, Samiullah Khawaja wrote:
> > > On Tue, Mar 25, 2025 at 10:47 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > > >
> > > > On 2025-03-25 12:40, Samiullah Khawaja wrote:
> > > > > On Sun, Mar 23, 2025 at 7:38 PM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > > > >>
> > > > >> On 2025-03-20 22:15, Samiullah Khawaja wrote:
>
> > > > > Nice catch. It seems a recent change broke the busy polling for AF_XDP
> > > > > and there was a fix for the XDP_ZEROCOPY but the XDP_COPY remained
> > > > > broken and seems in my experiments I didn't pick that up. During my
> > > > > experimentation I confirmed that all experiment modes are invoking the
> > > > > busypoll and not going through softirqs. I confirmed this through perf
> > > > > traces. I sent out a fix for XDP_COPY busy polling here in the link
> > > > > below. I will resent this for the net since the original commit has
> > > > > already landed in 6.13.
> > > > > https://lore.kernel.org/netdev/CAAywjhSEjaSgt7fCoiqJiMufGOi=oxa164_vTfk+3P43H60qwQ@mail.gmail.com/T/#t
> >
> > In general, when sending the patches and numbers, try running everything
> > against the latest net-next. Otherwise, it is very confusing to reason
> > about..
>
> I had mentioned in my previous review, but worth mentioning again...
> using --base=auto when using format-patch to generate the base
> commit SHA would be really useful to potentially help avoid this
> problem.
I completely agree. But I have 2 platforms that I am working with,
virito-net and idpf. I use idpf to generate experiment data but I
cannot use the latest kernel with it since the AF_XDP support for it
is broken. And I generate patches using format-patch after rebasing on
top of net-next and doing functional testing with qemu. This is to
make sure that the patches don't have any conflicts with the latest
net-next.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi
  2025-03-25 14:51     ` Jakub Kicinski
@ 2025-04-01 18:27       ` Jakub Kicinski
  2025-04-14 17:16         ` Samiullah Khawaja
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Kicinski @ 2025-04-01 18:27 UTC (permalink / raw)
  To: Joe Damato
  Cc: Samiullah Khawaja, David S . Miller , Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, mkarsten, netdev

On Tue, 25 Mar 2025 07:51:00 -0700 Jakub Kicinski wrote:
> On Fri, 21 Mar 2025 10:10:09 -0700 Joe Damato wrote:
> > > +int napi_set_threaded(struct napi_struct *napi, bool threaded)
> > > +{
> > > +	if (napi->dev->threaded)
> > > +		return -EINVAL;    
> > 
> > This works differently than the existing per-NAPI defer_hard_irqs /
> > gro_flush_timeout which are also interface wide.
> > 
> > In that implementation: 
> >   - the per-NAPI value is set when requested by the user
> >   - when the sysfs value is written, all NAPIs have their values
> >     overwritten to the sysfs value
> > 
> > I think either:
> >   - This implementation should work like the existing ones, or
> >   - The existing ones should be changed to work like this
> > 
> > I am opposed to have two different behaviors when setting per-NAPI
> > vs system/nic-wide sysfs values.
> > 
> > I don't have a preference on which behavior is chosen, but the
> > behavior should be the same for all of the things that are
> > system/nic-wide and moving to per-NAPI.  
> 
> And we should probably have a test that verifies the consistency
> for all the relevant attrs.

I was thinking about it some more in another context, and I decided 
to write down what came to mind. Does this make sense as part of
our docs?

===================================
Adding new configuration interfaces
===================================

Best practices for implementing new configuration interfaces in networking.

Multi-level configuration
=========================

In certain cases the same configuration option can be specified with different
levels of granularity, e.g. global configuration, and device-level
configuration. Finer-grained rules always take precedence. A more tricky
problem is what effect should changing the coarser settings have on already
present finer settings. Setting coarser configuration can either reset
all finer grained rules ("write all" semantics), or affect only objects
for which finer grained rules have not been specified ("default" semantics).

The "default" semantics are recommended unless clear and documented reason
exists for the interface to behave otherwise.

Configuration persistence
=========================

User configuration should always be preserved, as long as related objects
exist.

No loss on close
----------------

Closing and opening a net_device should not result in loss of configuration.
Dynamically allocated objects should be re-instantiated when the device
is opened.

No partial loss
---------------

Loss of configuration is only acceptable due to asynchronous device errors,
and in response to explicit reset requests from the user (``devlink reload``,
``ethtool --reset``, etc.). The implementation should not attempt to preserve
the objects affected by configuration loss (e.g. if some of net_device
configuration is lost, the net_device should be unregistered and re-registered
as part of the reset procedure).

Explicit default tracking
-------------------------

Network configuration is often performed in multiple steps, so it is important
that conflicting user requests cause an explicit error, rather than silent
reset of previously requested settings to defaults. For example, if user
first requests an RSS indirection table directing to queues 0, 1, and 2,
and then sets the queue count to 2 the queue count change should be rejected.

This implies that network configuration often needs to include an indication
whether given setting has been requested by a user, or is a default value
populated by the core, or the driver. What's more the user configuration API
may need to provide an ability to not only set a value but also to reset
it back to the default.

Indexed objects
---------------

Configuration related to indexed objects (queues, NAPI instances etc.)
should not be reset when device is closed, but should be reset when object
is explicitly "freed" by the user. For example reducing the queue count
should discard configuration of now-disabled queued.

Core validation
===============

For driver-facing APIs the networking stack should do its best to validate
the request (using maintained state and potentially requesting other config
from the driver via GET methods), before passing the configuration to
the driver.

Testing
=======

All new configuration APIs are required to be accompanied by tests,
including tests validating the configuration persistence, and (if applicable)
the interactions of multi-level configuration.

Tests validating the API should support execution against netdevsim,
and real drivers (netdev Python tests have this support built-in).


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi
  2025-04-01 18:27       ` Jakub Kicinski
@ 2025-04-14 17:16         ` Samiullah Khawaja
  0 siblings, 0 replies; 22+ messages in thread
From: Samiullah Khawaja @ 2025-04-14 17:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Joe Damato, David S . Miller, Eric Dumazet, Paolo Abeni,
	almasrymina, willemb, mkarsten, netdev

On Tue, Apr 1, 2025 at 11:27 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 25 Mar 2025 07:51:00 -0700 Jakub Kicinski wrote:
> > On Fri, 21 Mar 2025 10:10:09 -0700 Joe Damato wrote:
> > > > +int napi_set_threaded(struct napi_struct *napi, bool threaded)
> > > > +{
> > > > + if (napi->dev->threaded)
> > > > +         return -EINVAL;
> > >
> > > This works differently than the existing per-NAPI defer_hard_irqs /
> > > gro_flush_timeout which are also interface wide.
> > >
> > > In that implementation:
> > >   - the per-NAPI value is set when requested by the user
> > >   - when the sysfs value is written, all NAPIs have their values
> > >     overwritten to the sysfs value
> > >
> > > I think either:
> > >   - This implementation should work like the existing ones, or
> > >   - The existing ones should be changed to work like this
> > >
> > > I am opposed to have two different behaviors when setting per-NAPI
> > > vs system/nic-wide sysfs values.
> > >
> > > I don't have a preference on which behavior is chosen, but the
> > > behavior should be the same for all of the things that are
> > > system/nic-wide and moving to per-NAPI.
> >
> > And we should probably have a test that verifies the consistency
> > for all the relevant attrs.
>
> I was thinking about it some more in another context, and I decided
> to write down what came to mind. Does this make sense as part of
> our docs?
>
> ===================================
> Adding new configuration interfaces
> ===================================
>
> Best practices for implementing new configuration interfaces in networking.
>
> Multi-level configuration
> =========================
>
> In certain cases the same configuration option can be specified with different
> levels of granularity, e.g. global configuration, and device-level
> configuration. Finer-grained rules always take precedence. A more tricky
> problem is what effect should changing the coarser settings have on already
> present finer settings. Setting coarser configuration can either reset
> all finer grained rules ("write all" semantics), or affect only objects
> for which finer grained rules have not been specified ("default" semantics).
Our current approach for napi configuration is using "write all"
semantics as Joe mentioned above and I am going to change the "napi
threaded" to that also for consistency. To do the "default" semantics,
we would need something that tracks the "dirty" configuration state in
each napi. Also if we do this, we might also need to decide on a
mechanism from the user that propagates the intent that the fine
grained configuration is not needed anymore. I can send another patch
later to switch to the "default" semantics if this is decided.

>
> The "default" semantics are recommended unless clear and documented reason
> exists for the interface to behave otherwise.
>
> Configuration persistence
> =========================
>
> User configuration should always be preserved, as long as related objects
> exist.
>
> No loss on close
> ----------------
>
> Closing and opening a net_device should not result in loss of configuration.
> Dynamically allocated objects should be re-instantiated when the device
> is opened.
>
> No partial loss
> ---------------
>
> Loss of configuration is only acceptable due to asynchronous device errors,
> and in response to explicit reset requests from the user (``devlink reload``,
> ``ethtool --reset``, etc.). The implementation should not attempt to preserve
> the objects affected by configuration loss (e.g. if some of net_device
> configuration is lost, the net_device should be unregistered and re-registered
> as part of the reset procedure).
>
> Explicit default tracking
> -------------------------
>
> Network configuration is often performed in multiple steps, so it is important
> that conflicting user requests cause an explicit error, rather than silent
> reset of previously requested settings to defaults. For example, if user
> first requests an RSS indirection table directing to queues 0, 1, and 2,
> and then sets the queue count to 2 the queue count change should be rejected.
>
> This implies that network configuration often needs to include an indication
> whether given setting has been requested by a user, or is a default value
> populated by the core, or the driver. What's more the user configuration API
> may need to provide an ability to not only set a value but also to reset
> it back to the default.
>
> Indexed objects
> ---------------
>
> Configuration related to indexed objects (queues, NAPI instances etc.)
> should not be reset when device is closed, but should be reset when object
> is explicitly "freed" by the user. For example reducing the queue count
> should discard configuration of now-disabled queued.
>
> Core validation
> ===============
>
> For driver-facing APIs the networking stack should do its best to validate
> the request (using maintained state and potentially requesting other config
> from the driver via GET methods), before passing the configuration to
> the driver.
>
> Testing
> =======
>
> All new configuration APIs are required to be accompanied by tests,
> including tests validating the configuration persistence, and (if applicable)
> the interactions of multi-level configuration.
>
> Tests validating the API should support execution against netdevsim,
> and real drivers (netdev Python tests have this support built-in).
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-04-14 17:16 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-21  2:15 [PATCH net-next v4 0/4] Add support to do threaded napi busy poll Samiullah Khawaja
2025-03-21  2:15 ` [PATCH net-next v4 1/4] Add support to set napi threaded for individual napi Samiullah Khawaja
2025-03-21 17:10   ` Joe Damato
2025-03-25 14:51     ` Jakub Kicinski
2025-04-01 18:27       ` Jakub Kicinski
2025-04-14 17:16         ` Samiullah Khawaja
2025-03-25 14:52   ` Jakub Kicinski
2025-03-21  2:15 ` [PATCH net-next v4 2/4] net: Create separate gro_flush helper function Samiullah Khawaja
2025-03-21 17:16   ` Joe Damato
2025-03-27 16:42     ` Samiullah Khawaja
2025-03-21  2:15 ` [PATCH net-next v4 3/4] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
2025-03-21 17:39   ` Joe Damato
2025-03-27 16:39     ` Samiullah Khawaja
2025-03-21  2:15 ` [PATCH net-next v4 4/4] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
2025-03-24  2:38 ` [PATCH net-next v4 0/4] Add support to do threaded napi busy poll Martin Karsten
2025-03-25 16:40   ` Samiullah Khawaja
2025-03-25 17:47     ` Martin Karsten
2025-03-26 20:34       ` Samiullah Khawaja
2025-03-26 21:22         ` Martin Karsten
2025-03-26 21:57         ` Stanislav Fomichev
2025-03-27 18:40           ` Joe Damato
2025-03-27 19:35             ` Samiullah Khawaja

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).