* [PATCH net-next v6 0/5] Add support to do threaded napi busy poll
@ 2025-07-18 23:20 Samiullah Khawaja
2025-07-18 23:20 ` [PATCH net-next v6 1/5] net: Create separate gro_flush_normal function Samiullah Khawaja
` (5 more replies)
0 siblings, 6 replies; 8+ messages in thread
From: Samiullah Khawaja @ 2025-07-18 23:20 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, mkarsten
Cc: netdev, skhawaja
Extend the already existing support of threaded napi poll to do continuous
busy polling.
This is used for doing continuous polling of napi to fetch descriptors
from backing RX/TX queues for low latency applications. Allow enabling
of threaded busypoll using netlink so this can be enabled on a set of
dedicated napis for low latency applications.
Once enabled user can fetch the PID of the kthread doing NAPI polling
and set affinity, priority and scheduler for it depending on the
low-latency requirements.
Currently threaded napi is only enabled at device level using sysfs. Add
support to enable/disable threaded mode for a napi individually. This
can be done using the netlink interface. Extend `napi-set` op in netlink
spec that allows setting the `threaded` attribute of a napi.
Extend the threaded attribute in napi struct to add an option to enable
continuous busy polling. Extend the netlink and sysfs interface to allow
enabling/disabling threaded busypolling at device or individual napi
level.
We use this for our AF_XDP based hard low-latency usecase with usecs
level latency requirement. For our usecase we want low jitter and stable
latency at P99.
Following is an analysis and comparison of available (and compatible)
busy poll interfaces for a low latency usecase with stable P99. Please
note that the throughput and cpu efficiency is a non-goal.
For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
description of the tool and how it tries to simulate the real workload
is following,
- It sends UDP packets between 2 machines.
- The client machine sends packets at a fixed frequency. To maintain the
frequency of the packet being sent, we use open-loop sampling. That is
the packets are sent in a separate thread.
- The server replies to the packet inline by reading the pkt from the
recv ring and replies using the tx ring.
- To simulate the application processing time, we use a configurable
delay in usecs on the client side after a reply is received from the
server.
The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
We use this tool with following napi polling configurations,
- Interrupts only
- SO_BUSYPOLL (inline in the same thread where the client receives the
packet).
- SO_BUSYPOLL (separate thread and separate core)
- Threaded NAPI busypoll
System is configured using following script in all 4 cases,
```
echo 0 | sudo tee /sys/class/net/eth0/threaded
echo 0 | sudo tee /proc/sys/kernel/timer_migration
echo off | sudo tee /sys/devices/system/cpu/smt/control
sudo ethtool -L eth0 rx 1 tx 1
sudo ethtool -G eth0 rx 1024
echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
# pin IRQs on CPU 2
IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
print arr[0]}' < /proc/interrupts)"
for irq in "${IRQS}"; \
do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
for i in /sys/devices/virtual/workqueue/*/cpumask; \
do echo $i; echo 1,2,3,4,5,6 > $i; done
if [[ -z "$1" ]]; then
echo 400 | sudo tee /proc/sys/net/core/busy_read
echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi
sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
if [[ "$1" == "enable_threaded" ]]; then
echo 0 | sudo tee /proc/sys/net/core/busy_poll
echo 0 | sudo tee /proc/sys/net/core/busy_read
echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
echo 2 | sudo tee /sys/class/net/eth0/threaded
NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
sudo chrt -f -p 50 $NAPI_T
# pin threaded poll thread to CPU 2
sudo taskset -pc 2 $NAPI_T
fi
if [[ "$1" == "enable_interrupt" ]]; then
echo 0 | sudo tee /proc/sys/net/core/busy_read
echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi
```
To enable various configurations, script can be run as following,
- Interrupt Only
```
<script> enable_interrupt
```
- SO_BUSYPOLL (no arguments to script)
```
<script>
```
- NAPI threaded busypoll
```
<script> enable_threaded
```
If using idpf, the script needs to be run again after launching the
workload just to make sure that the configurations are not reverted. As
idpf reverts some configurations on software reset when AF_XDP program
is attached.
Once configured, the workload is run with various configurations using
following commands. Set period (1/frequency) and delay in usecs to
produce results for packet frequency and application processing delay.
## Interrupt Only and SO_BUSY_POLL (inline)
- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
```
- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
-P <Period-usecs> -d <Delay-usecs> -T -l 1 -v
```
## SO_BUSY_POLL(done in separate core using recvfrom)
Argument -t spawns a seprate thread and continuously calls recvfrom.
- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
-h -v -t
```
- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
-P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -t
```
## NAPI Threaded Busy Poll
Argument -n skips the recvfrom call as there is no recv kick needed.
- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
-h -v -n
```
- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
-P <Period-usecs> -d <Delay-usecs> -T -l 1 -v -n
```
| Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
|---|---|---|---|---|
| 12 Kpkt/s + 0us delay | | | | |
| | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
| | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
| | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
| | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
| 32 Kpkt/s + 30us delay | | | | |
| | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
| | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
| | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
| | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
| 125 Kpkt/s + 6us delay | | | | |
| | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
| | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
| | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
| | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
| 12 Kpkt/s + 78us delay | | | | |
| | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
| | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
| | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
| | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
| 25 Kpkt/s + 38us delay | | | | |
| | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
| | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
| | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
| | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
## Observations
- Here without application processing all the approaches give the same
latency within 1usecs range and NAPI threaded gives minimum latency.
- With application processing the latency increases by 3-4usecs when
doing inline polling.
- Using a dedicated core to drive napi polling keeps the latency same
even with application processing. This is observed both in userspace
and threaded napi (in kernel).
- Using napi threaded polling in kernel gives lower latency by
1-1.5usecs as compared to userspace driven polling in separate core.
- With application processing userspace will get the packet from recv
ring and spend some time doing application processing and then do napi
polling. While application processing is happening a dedicated core
doing napi polling can pull the packet of the NAPI RX queue and
populate the AF_XDP recv ring. This means that when the application
thread is done with application processing it has new packets ready to
recv and process in recv ring.
- Napi threaded busy polling in the kernel with a dedicated core gives
the consistent P5-P99 latency.
Following histogram is generated to measure the time spent in recvfrom
while using inline thread with SO_BUSYPOLL. The histogram is generated
using the following bpftrace command. In this experiment there are 32K
packets per second and the application processing delay is 30usecs. This
is to measure whether there is significant time spent pulling packets
from the descriptor queue that it will affect the overall latency if
done inline.
```
bpftrace -e '
kprobe:xsk_recvmsg {
@start[tid] = nsecs;
}
kretprobe:xsk_recvmsg {
if (@start[tid]) {
$sample = (nsecs - @start[tid]);
@xsk_recvfrom_hist = hist($sample);
delete(@start[tid]);
}
}
END { clear(@start);}'
```
Here in case of inline busypolling around 35 percent of calls are taking
1-2usecs and around 50 percent are taking 0.5-2usecs.
@xsk_recvfrom_hist:
[128, 256) 24073 |@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 55633 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 20974 |@@@@@@@@@@@@@@@@@@@ |
[1K, 2K) 34234 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[2K, 4K) 3266 |@@@ |
[4K, 8K) 19 | |
v6:
- Moved threaded in struct netdevice up to fill the cacheline hole.
- Changed dev_set_threaded to dev_set_threaded_hint and removed the
second argument that was always set to true by all the drivers.
Exported only dev_set_threaded_hint and made dev_set_threaded core
only function. This change is done in a separate commit.
- Updated documentation comment for threaded in struct netdevice.
- gro_flush_helper renamed to gro_flush_normal and moved to gro.h. Also
used it in kernel/bpf/cpumap.c
- Updated documentation to explicitly state that the NAPI threaded busy
polling would keep the CPU core busy at 100% usage.
- Updated documentation and commit messages.
v5:
- Updated experiment data with 'SO_PREFER_BUSY_POLL' usage as
suggested.
- Sent 'Add support to set napi threaded for individual napi'
separately. This series depends on top of that patch.
https://lore.kernel.org/netdev/20250423201413.1564527-1-skhawaja@google.com/
- Added a separate patch to use enum for napi threaded state. Updated
the nl_netdev python test.
- Using "write all" semantics when napi settings set at device level.
This aligns with already existing behaviour for other settings.
- Fix comments to make them kdoc compatible.
- Updated Documentation/networking/net_cachelines/net_device.rst
- Updated the missed gro_flush modification in napi_complete_done
v4:
- Using AF_XDP based benchmark for experiments.
- Re-enable dev level napi threaded busypoll after soft reset.
v3:
- Fixed calls to dev_set_threaded in drivers
v2:
- Add documentation in napi.rst.
- Provide experiment data and usecase details.
- Update busy_poller selftest to include napi threaded poll testcase.
- Define threaded mode enum in netlink interface.
- Included NAPI threaded state in napi config to save/restore.
Samiullah Khawaja (5):
net: Create separate gro_flush_normal function
net: Use dev_set_threaded_hint instead of dev_set_threaded in drivers
net: define an enum for the napi threaded state
Extend napi threaded polling to allow kthread based busy polling
selftests: Add napi threaded busy poll test in `busy_poller`
Documentation/ABI/testing/sysfs-class-net | 3 +-
Documentation/netlink/specs/netdev.yaml | 14 ++-
Documentation/networking/napi.rst | 63 +++++++++++-
.../networking/net_cachelines/net_device.rst | 2 +-
.../net/ethernet/atheros/atl1c/atl1c_main.c | 2 +-
drivers/net/ethernet/mellanox/mlxsw/pci.c | 2 +-
drivers/net/ethernet/renesas/ravb_main.c | 2 +-
drivers/net/wireguard/device.c | 2 +-
drivers/net/wireless/ath/ath10k/snoc.c | 2 +-
drivers/net/wireless/mediatek/mt76/debugfs.c | 2 +-
include/linux/netdevice.h | 18 +++-
include/net/gro.h | 6 ++
include/uapi/linux/netdev.h | 6 ++
kernel/bpf/cpumap.c | 3 +-
net/core/dev.c | 97 +++++++++++++++----
net/core/dev.h | 16 ++-
net/core/net-sysfs.c | 2 +-
net/core/netdev-genl-gen.c | 2 +-
net/core/netdev-genl.c | 2 +-
tools/include/uapi/linux/netdev.h | 6 ++
tools/testing/selftests/net/busy_poll_test.sh | 25 ++++-
tools/testing/selftests/net/busy_poller.c | 14 ++-
tools/testing/selftests/net/nl_netdev.py | 36 +++----
23 files changed, 257 insertions(+), 70 deletions(-)
base-commit: c3886ccaadf8fdc2c91bfbdcdca36ccdc6ef8f70
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH net-next v6 1/5] net: Create separate gro_flush_normal function
2025-07-18 23:20 [PATCH net-next v6 0/5] Add support to do threaded napi busy poll Samiullah Khawaja
@ 2025-07-18 23:20 ` Samiullah Khawaja
2025-07-18 23:20 ` [PATCH net-next v6 2/5] net: Use dev_set_threaded_hint instead of dev_set_threaded in drivers Samiullah Khawaja
` (4 subsequent siblings)
5 siblings, 0 replies; 8+ messages in thread
From: Samiullah Khawaja @ 2025-07-18 23:20 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, mkarsten
Cc: netdev, skhawaja
Move multiple copies of same code snippet doing `gro_flush` and
`gro_normal_list` into separate helper function.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
v6:
- gro_flush_helper renamed to gro_flush_normal and moved to gro.h. Also
used it in kernel/bpf/cpumap.c
---
include/net/gro.h | 6 ++++++
kernel/bpf/cpumap.c | 3 +--
net/core/dev.c | 9 +++------
3 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/include/net/gro.h b/include/net/gro.h
index 22d3a69e4404..a0fca7ac6e7e 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -534,6 +534,12 @@ static inline void gro_normal_list(struct gro_node *gro)
gro->rx_count = 0;
}
+static inline void gro_flush_normal(struct gro_node *gro, bool flush_old)
+{
+ gro_flush(gro, flush_old);
+ gro_normal_list(gro);
+}
+
/* Queue one GRO_NORMAL SKB up for list processing. If batch size exceeded,
* pass the whole batch up to the stack.
*/
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 67e8a2fc1a99..b2b7b8ec2c2a 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -282,8 +282,7 @@ static void cpu_map_gro_flush(struct bpf_cpu_map_entry *rcpu, bool empty)
* This is equivalent to how NAPI decides whether to perform a full
* flush.
*/
- gro_flush(&rcpu->gro, !empty && HZ >= 1000);
- gro_normal_list(&rcpu->gro);
+ gro_flush_normal(&rcpu->gro, !empty && HZ >= 1000);
}
static int cpu_map_kthread_run(void *data)
diff --git a/net/core/dev.c b/net/core/dev.c
index 621a639aeba1..cc216a461743 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6574,8 +6574,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
* it, we need to bound somehow the time packets are kept in
* the GRO layer.
*/
- gro_flush(&n->gro, !!timeout);
- gro_normal_list(&n->gro);
+ gro_flush_normal(&n->gro, !!timeout);
if (unlikely(!list_empty(&n->poll_list))) {
/* If n->poll_list is not empty, we need to mask irqs */
@@ -6645,8 +6644,7 @@ static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
}
/* Flush too old packets. If HZ < 1000, flush all packets */
- gro_flush(&napi->gro, HZ >= 1000);
- gro_normal_list(&napi->gro);
+ gro_flush_normal(&napi->gro, HZ >= 1000);
clear_bit(NAPI_STATE_SCHED, &napi->state);
}
@@ -7511,8 +7509,7 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
}
/* Flush too old packets. If HZ < 1000, flush all packets */
- gro_flush(&n->gro, HZ >= 1000);
- gro_normal_list(&n->gro);
+ gro_flush_normal(&n->gro, HZ >= 1000);
/* Some drivers may have called napi_schedule
* prior to exhausting their budget.
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH net-next v6 2/5] net: Use dev_set_threaded_hint instead of dev_set_threaded in drivers
2025-07-18 23:20 [PATCH net-next v6 0/5] Add support to do threaded napi busy poll Samiullah Khawaja
2025-07-18 23:20 ` [PATCH net-next v6 1/5] net: Create separate gro_flush_normal function Samiullah Khawaja
@ 2025-07-18 23:20 ` Samiullah Khawaja
2025-07-18 23:20 ` [PATCH net-next v6 3/5] net: define an enum for the napi threaded state Samiullah Khawaja
` (3 subsequent siblings)
5 siblings, 0 replies; 8+ messages in thread
From: Samiullah Khawaja @ 2025-07-18 23:20 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, mkarsten
Cc: netdev, skhawaja
Prepare for adding an enum type for NAPI threaded states by adding
dev_set_threaded_hint API. De-export the existing dev_set_threaded API
and only use it internally. Update existing drivers to use
dev_set_threaded_hint instead of the de-exported dev_set_threaded.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
drivers/net/ethernet/atheros/atl1c/atl1c_main.c | 2 +-
drivers/net/ethernet/mellanox/mlxsw/pci.c | 2 +-
drivers/net/ethernet/renesas/ravb_main.c | 2 +-
drivers/net/wireguard/device.c | 2 +-
drivers/net/wireless/ath/ath10k/snoc.c | 2 +-
drivers/net/wireless/mediatek/mt76/debugfs.c | 2 +-
include/linux/netdevice.h | 2 +-
net/core/dev.c | 7 ++++++-
net/core/dev.h | 2 ++
9 files changed, 15 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
index ef1a51347351..4519379d284c 100644
--- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
+++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
@@ -2688,7 +2688,7 @@ static int atl1c_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
adapter->mii.mdio_write = atl1c_mdio_write;
adapter->mii.phy_id_mask = 0x1f;
adapter->mii.reg_num_mask = MDIO_CTRL_REG_MASK;
- dev_set_threaded(netdev, true);
+ dev_set_threaded_hint(netdev);
for (i = 0; i < adapter->rx_queue_count; ++i)
netif_napi_add(netdev, &adapter->rrd_ring[i].napi,
atl1c_clean_rx);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 058dcabfaa2e..268b830ce17e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -156,7 +156,7 @@ static int mlxsw_pci_napi_devs_init(struct mlxsw_pci *mlxsw_pci)
}
strscpy(mlxsw_pci->napi_dev_rx->name, "mlxsw_rx",
sizeof(mlxsw_pci->napi_dev_rx->name));
- dev_set_threaded(mlxsw_pci->napi_dev_rx, true);
+ dev_set_threaded_hint(mlxsw_pci->napi_dev_rx);
return 0;
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index c9f4976a3527..31b2cb11764d 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -3075,7 +3075,7 @@ static int ravb_probe(struct platform_device *pdev)
if (info->coalesce_irqs) {
netdev_sw_irq_coalesce_default_on(ndev);
if (num_present_cpus() == 1)
- dev_set_threaded(ndev, true);
+ dev_set_threaded_hint(ndev);
}
/* Network device register */
diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c
index 4a529f1f9bea..1f3e4e7cc90a 100644
--- a/drivers/net/wireguard/device.c
+++ b/drivers/net/wireguard/device.c
@@ -366,7 +366,7 @@ static int wg_newlink(struct net_device *dev,
if (ret < 0)
goto err_free_handshake_queue;
- dev_set_threaded(dev, true);
+ dev_set_threaded_hint(dev);
ret = register_netdevice(dev);
if (ret < 0)
goto err_uninit_ratelimiter;
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index d51f2e5a79a4..d6412330d8ef 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -936,7 +936,7 @@ static int ath10k_snoc_hif_start(struct ath10k *ar)
bitmap_clear(ar_snoc->pending_ce_irqs, 0, CE_COUNT_MAX);
- dev_set_threaded(ar->napi_dev, true);
+ dev_set_threaded_hint(ar->napi_dev);
ath10k_core_napi_enable(ar);
/* IRQs are left enabled when we restart due to a firmware crash */
if (!test_bit(ATH10K_SNOC_FLAG_RECOVERY, &ar_snoc->flags))
diff --git a/drivers/net/wireless/mediatek/mt76/debugfs.c b/drivers/net/wireless/mediatek/mt76/debugfs.c
index b6a2746c187d..bd62a87aabfe 100644
--- a/drivers/net/wireless/mediatek/mt76/debugfs.c
+++ b/drivers/net/wireless/mediatek/mt76/debugfs.c
@@ -34,7 +34,7 @@ mt76_napi_threaded_set(void *data, u64 val)
return -EOPNOTSUPP;
if (dev->napi_dev->threaded != val)
- return dev_set_threaded(dev->napi_dev, val);
+ return dev_set_threaded_hint(dev->napi_dev);
return 0;
}
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e49d8c98d284..87591448a008 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -589,7 +589,7 @@ static inline bool napi_complete(struct napi_struct *n)
return napi_complete_done(n, 0);
}
-int dev_set_threaded(struct net_device *dev, bool threaded);
+int dev_set_threaded_hint(struct net_device *dev);
void napi_disable(struct napi_struct *n);
void napi_disable_locked(struct napi_struct *n);
diff --git a/net/core/dev.c b/net/core/dev.c
index cc216a461743..d3f72e5f4904 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7025,7 +7025,12 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
return err;
}
-EXPORT_SYMBOL(dev_set_threaded);
+
+int dev_set_threaded_hint(struct net_device *dev)
+{
+ return dev_set_threaded(dev, true);
+}
+EXPORT_SYMBOL(dev_set_threaded_hint);
/**
* netif_queue_set_napi - Associate queue with the napi
diff --git a/net/core/dev.h b/net/core/dev.h
index a603387fb566..23cbeaad8ca2 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -322,6 +322,8 @@ static inline bool napi_get_threaded(struct napi_struct *n)
int napi_set_threaded(struct napi_struct *n, bool threaded);
+int dev_set_threaded(struct net_device *dev, bool threaded);
+
int rps_cpumask_housekeeping(struct cpumask *mask);
#if defined(CONFIG_DEBUG_NET) && defined(CONFIG_BPF_SYSCALL)
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH net-next v6 3/5] net: define an enum for the napi threaded state
2025-07-18 23:20 [PATCH net-next v6 0/5] Add support to do threaded napi busy poll Samiullah Khawaja
2025-07-18 23:20 ` [PATCH net-next v6 1/5] net: Create separate gro_flush_normal function Samiullah Khawaja
2025-07-18 23:20 ` [PATCH net-next v6 2/5] net: Use dev_set_threaded_hint instead of dev_set_threaded in drivers Samiullah Khawaja
@ 2025-07-18 23:20 ` Samiullah Khawaja
2025-07-21 23:48 ` Jakub Kicinski
2025-07-18 23:20 ` [PATCH net-next v6 4/5] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
` (2 subsequent siblings)
5 siblings, 1 reply; 8+ messages in thread
From: Samiullah Khawaja @ 2025-07-18 23:20 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, mkarsten
Cc: netdev, skhawaja
Instead of using '0' and '1' for napi threaded state use an enum with
'disabled' and 'enabled' states. This is to prepare for the next patch
to add a new 'busy-poll-enabled' state. Also move and update the
'threaded' field in struct net_device to u8 instead of bool.
Tested:
./tools/testing/selftests/net/nl_netdev.py
TAP version 13
1..7
ok 1 nl_netdev.empty_check
ok 2 nl_netdev.lo_check
ok 3 nl_netdev.page_pool_check
ok 4 nl_netdev.napi_list_check
ok 5 nl_netdev.dev_set_threaded
ok 6 nl_netdev.napi_set_threaded
ok 7 nl_netdev.nsim_rxq_reset_down
# Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
v6:
- Moved threaded in struct netdevice up to fill the cacheline hole.
- Changed dev_set_threaded to dev_set_threaded_hint and removed the
second argument that was always set to true by all the drivers.
Exported only dev_set_threaded_hint and made dev_set_threaded core
only function. This change is done in a separate commit.
- Updated documentation comment for threaded in struct netdevice.
---
Documentation/netlink/specs/netdev.yaml | 13 ++++---
.../networking/net_cachelines/net_device.rst | 2 +-
include/linux/netdevice.h | 7 ++--
include/uapi/linux/netdev.h | 5 +++
net/core/dev.c | 12 ++++---
net/core/dev.h | 13 ++++---
net/core/netdev-genl-gen.c | 2 +-
net/core/netdev-genl.c | 2 +-
tools/include/uapi/linux/netdev.h | 5 +++
tools/testing/selftests/net/nl_netdev.py | 36 +++++++++----------
10 files changed, 58 insertions(+), 39 deletions(-)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 85d0ea6ac426..11edbf9c5727 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -85,6 +85,10 @@ definitions:
name: qstats-scope
type: flags
entries: [queue]
+ -
+ name: napi-threaded
+ type: enum
+ entries: [ disabled, enabled ]
attribute-sets:
-
@@ -286,11 +290,10 @@ attribute-sets:
-
name: threaded
doc: Whether the NAPI is configured to operate in threaded polling
- mode. If this is set to 1 then the NAPI context operates in
- threaded polling mode.
- type: uint
- checks:
- max: 1
+ mode. If this is set to `enabled` then the NAPI context operates
+ in threaded polling mode.
+ type: u32
+ enum: napi-threaded
-
name: xsk-info
attributes: []
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index c69cc89c958e..cb6daccac0b6 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -68,6 +68,7 @@ unsigned_char addr_assign_type
unsigned_char addr_len
unsigned_char upper_level
unsigned_char lower_level
+u8 threaded napi_poll(napi_enable,dev_set_threaded)
unsigned_short neigh_priv_len
unsigned_short padded
unsigned_short dev_id
@@ -165,7 +166,6 @@ struct sfp_bus* sfp_bus
struct lock_class_key* qdisc_tx_busylock
bool proto_down
unsigned:1 wol_enabled
-unsigned:1 threaded napi_poll(napi_enable,dev_set_threaded)
unsigned_long:1 see_all_hwtstamp_requests
unsigned_long:1 change_proto_down
unsigned_long:1 netns_immutable
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 87591448a008..97cf14a9b469 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -369,7 +369,7 @@ struct napi_config {
u64 irq_suspend_timeout;
u32 defer_hard_irqs;
cpumask_t affinity_mask;
- bool threaded;
+ u8 threaded;
unsigned int napi_id;
};
@@ -1871,6 +1871,7 @@ enum netdev_reg_state {
* @addr_len: Hardware address length
* @upper_level: Maximum depth level of upper devices.
* @lower_level: Maximum depth level of lower devices.
+ * @threaded: napi threaded state.
* @neigh_priv_len: Used in neigh_alloc()
* @dev_id: Used to differentiate devices that share
* the same link layer address
@@ -2010,8 +2011,6 @@ enum netdev_reg_state {
* switch driver and used to set the phys state of the
* switch port.
*
- * @threaded: napi threaded mode is enabled
- *
* @irq_affinity_auto: driver wants the core to store and re-assign the IRQ
* affinity. Set by netif_enable_irq_affinity(), then
* the driver must create a persistent napi by
@@ -2247,6 +2246,7 @@ struct net_device {
unsigned char addr_len;
unsigned char upper_level;
unsigned char lower_level;
+ u8 threaded;
unsigned short neigh_priv_len;
unsigned short dev_id;
@@ -2428,7 +2428,6 @@ struct net_device {
struct sfp_bus *sfp_bus;
struct lock_class_key *qdisc_tx_busylock;
bool proto_down;
- bool threaded;
bool irq_affinity_auto;
bool rx_cpu_rmap_auto;
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 1f3719a9a0eb..48eb49aa03d4 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -77,6 +77,11 @@ enum netdev_qstats_scope {
NETDEV_QSTATS_SCOPE_QUEUE = 1,
};
+enum netdev_napi_threaded {
+ NETDEV_NAPI_THREADED_DISABLED,
+ NETDEV_NAPI_THREADED_ENABLED,
+};
+
enum {
NETDEV_A_DEV_IFINDEX = 1,
NETDEV_A_DEV_PAD,
diff --git a/net/core/dev.c b/net/core/dev.c
index d3f72e5f4904..ec65b03492b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6959,7 +6959,8 @@ static void napi_stop_kthread(struct napi_struct *napi)
napi->thread = NULL;
}
-int napi_set_threaded(struct napi_struct *napi, bool threaded)
+int napi_set_threaded(struct napi_struct *napi,
+ enum netdev_napi_threaded threaded)
{
if (threaded) {
if (!napi->thread) {
@@ -6984,7 +6985,8 @@ int napi_set_threaded(struct napi_struct *napi, bool threaded)
return 0;
}
-int dev_set_threaded(struct net_device *dev, bool threaded)
+int dev_set_threaded(struct net_device *dev,
+ enum netdev_napi_threaded threaded)
{
struct napi_struct *napi;
int err = 0;
@@ -6996,7 +6998,7 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
if (!napi->thread) {
err = napi_kthread_create(napi);
if (err) {
- threaded = false;
+ threaded = NETDEV_NAPI_THREADED_DISABLED;
break;
}
}
@@ -7028,7 +7030,7 @@ int dev_set_threaded(struct net_device *dev, bool threaded)
int dev_set_threaded_hint(struct net_device *dev)
{
- return dev_set_threaded(dev, true);
+ return dev_set_threaded(dev, NETDEV_NAPI_THREADED_ENABLED);
}
EXPORT_SYMBOL(dev_set_threaded_hint);
@@ -7345,7 +7347,7 @@ void netif_napi_add_weight_locked(struct net_device *dev,
* threaded mode will not be enabled in napi_enable().
*/
if (dev->threaded && napi_kthread_create(napi))
- dev->threaded = false;
+ dev->threaded = NETDEV_NAPI_THREADED_DISABLED;
netif_napi_set_irq_locked(napi, -1);
}
EXPORT_SYMBOL(netif_napi_add_weight_locked);
diff --git a/net/core/dev.h b/net/core/dev.h
index 23cbeaad8ca2..ab6fac65ec24 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -315,14 +315,19 @@ static inline void napi_set_irq_suspend_timeout(struct napi_struct *n,
WRITE_ONCE(n->irq_suspend_timeout, timeout);
}
-static inline bool napi_get_threaded(struct napi_struct *n)
+static inline enum netdev_napi_threaded napi_get_threaded(struct napi_struct *n)
{
- return test_bit(NAPI_STATE_THREADED, &n->state);
+ if (test_bit(NAPI_STATE_THREADED, &n->state))
+ return NETDEV_NAPI_THREADED_ENABLED;
+
+ return NETDEV_NAPI_THREADED_DISABLED;
}
-int napi_set_threaded(struct napi_struct *n, bool threaded);
+int napi_set_threaded(struct napi_struct *n,
+ enum netdev_napi_threaded threaded);
-int dev_set_threaded(struct net_device *dev, bool threaded);
+int dev_set_threaded(struct net_device *dev,
+ enum netdev_napi_threaded threaded);
int rps_cpumask_housekeeping(struct cpumask *mask);
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 0994bd68a7e6..e9a2a6f26cb7 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -97,7 +97,7 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED
[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
- [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_UINT, 1),
+ [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
};
/* NETDEV_CMD_BIND_TX - do */
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 5875df372415..6314eb7bdf69 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -333,7 +333,7 @@ netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
int ret;
threaded = nla_get_uint(info->attrs[NETDEV_A_NAPI_THREADED]);
- ret = napi_set_threaded(napi, !!threaded);
+ ret = napi_set_threaded(napi, threaded);
if (ret)
return ret;
}
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 1f3719a9a0eb..48eb49aa03d4 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -77,6 +77,11 @@ enum netdev_qstats_scope {
NETDEV_QSTATS_SCOPE_QUEUE = 1,
};
+enum netdev_napi_threaded {
+ NETDEV_NAPI_THREADED_DISABLED,
+ NETDEV_NAPI_THREADED_ENABLED,
+};
+
enum {
NETDEV_A_DEV_IFINDEX = 1,
NETDEV_A_DEV_PAD,
diff --git a/tools/testing/selftests/net/nl_netdev.py b/tools/testing/selftests/net/nl_netdev.py
index c8ffade79a52..5c66421ab8aa 100755
--- a/tools/testing/selftests/net/nl_netdev.py
+++ b/tools/testing/selftests/net/nl_netdev.py
@@ -52,14 +52,14 @@ def napi_set_threaded(nf) -> None:
napi1_id = napis[1]['id']
# set napi threaded and verify
- nf.napi_set({'id': napi0_id, 'threaded': 1})
+ nf.napi_set({'id': napi0_id, 'threaded': "enabled"})
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 1)
+ ksft_eq(napi0['threaded'], "enabled")
ksft_ne(napi0.get('pid'), None)
# check it is not set for napi1
napi1 = nf.napi_get({'id': napi1_id})
- ksft_eq(napi1['threaded'], 0)
+ ksft_eq(napi1['threaded'], "disabled")
ksft_eq(napi1.get('pid'), None)
ip(f"link set dev {nsim.ifname} down")
@@ -67,18 +67,18 @@ def napi_set_threaded(nf) -> None:
# verify if napi threaded is still set
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 1)
+ ksft_eq(napi0['threaded'], "enabled")
ksft_ne(napi0.get('pid'), None)
# check it is still not set for napi1
napi1 = nf.napi_get({'id': napi1_id})
- ksft_eq(napi1['threaded'], 0)
+ ksft_eq(napi1['threaded'], "disabled")
ksft_eq(napi1.get('pid'), None)
# unset napi threaded and verify
- nf.napi_set({'id': napi0_id, 'threaded': 0})
+ nf.napi_set({'id': napi0_id, 'threaded': "disabled"})
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 0)
+ ksft_eq(napi0['threaded'], "disabled")
ksft_eq(napi0.get('pid'), None)
# set threaded at device level
@@ -86,10 +86,10 @@ def napi_set_threaded(nf) -> None:
# check napi threaded is set for both napis
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 1)
+ ksft_eq(napi0['threaded'], "enabled")
ksft_ne(napi0.get('pid'), None)
napi1 = nf.napi_get({'id': napi1_id})
- ksft_eq(napi1['threaded'], 1)
+ ksft_eq(napi1['threaded'], "enabled")
ksft_ne(napi1.get('pid'), None)
# unset threaded at device level
@@ -97,16 +97,16 @@ def napi_set_threaded(nf) -> None:
# check napi threaded is unset for both napis
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 0)
+ ksft_eq(napi0['threaded'], "disabled")
ksft_eq(napi0.get('pid'), None)
napi1 = nf.napi_get({'id': napi1_id})
- ksft_eq(napi1['threaded'], 0)
+ ksft_eq(napi1['threaded'], "disabled")
ksft_eq(napi1.get('pid'), None)
# set napi threaded for napi0
nf.napi_set({'id': napi0_id, 'threaded': 1})
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 1)
+ ksft_eq(napi0['threaded'], "enabled")
ksft_ne(napi0.get('pid'), None)
# unset threaded at device level
@@ -114,10 +114,10 @@ def napi_set_threaded(nf) -> None:
# check napi threaded is unset for both napis
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 0)
+ ksft_eq(napi0['threaded'], "disabled")
ksft_eq(napi0.get('pid'), None)
napi1 = nf.napi_get({'id': napi1_id})
- ksft_eq(napi1['threaded'], 0)
+ ksft_eq(napi1['threaded'], "disabled")
ksft_eq(napi1.get('pid'), None)
def dev_set_threaded(nf) -> None:
@@ -141,10 +141,10 @@ def dev_set_threaded(nf) -> None:
# check napi threaded is set for both napis
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 1)
+ ksft_eq(napi0['threaded'], "enabled")
ksft_ne(napi0.get('pid'), None)
napi1 = nf.napi_get({'id': napi1_id})
- ksft_eq(napi1['threaded'], 1)
+ ksft_eq(napi1['threaded'], "enabled")
ksft_ne(napi1.get('pid'), None)
# unset threaded
@@ -152,10 +152,10 @@ def dev_set_threaded(nf) -> None:
# check napi threaded is unset for both napis
napi0 = nf.napi_get({'id': napi0_id})
- ksft_eq(napi0['threaded'], 0)
+ ksft_eq(napi0['threaded'], "disabled")
ksft_eq(napi0.get('pid'), None)
napi1 = nf.napi_get({'id': napi1_id})
- ksft_eq(napi1['threaded'], 0)
+ ksft_eq(napi1['threaded'], "disabled")
ksft_eq(napi1.get('pid'), None)
def nsim_rxq_reset_down(nf) -> None:
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH net-next v6 4/5] Extend napi threaded polling to allow kthread based busy polling
2025-07-18 23:20 [PATCH net-next v6 0/5] Add support to do threaded napi busy poll Samiullah Khawaja
` (2 preceding siblings ...)
2025-07-18 23:20 ` [PATCH net-next v6 3/5] net: define an enum for the napi threaded state Samiullah Khawaja
@ 2025-07-18 23:20 ` Samiullah Khawaja
2025-07-18 23:20 ` [PATCH net-next v6 5/5] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
2025-07-19 21:27 ` [PATCH net-next v6 0/5] Add support to do threaded napi busy poll Martin Karsten
5 siblings, 0 replies; 8+ messages in thread
From: Samiullah Khawaja @ 2025-07-18 23:20 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, mkarsten
Cc: netdev, skhawaja
Add a new state to napi state enum:
- NAPI_STATE_THREADED_BUSY_POLL
Threaded busy poll is enabled/running for this napi.
Following changes are introduced in the napi scheduling and state logic:
- When threaded busy poll is enabled through sysfs or netlink it also
enables NAPI_STATE_THREADED so a kthread is created per napi. It also
sets NAPI_STATE_THREADED_BUSY_POLL bit on each napi to indicate that
it is going to busy poll the napi.
- When napi is scheduled with NAPI_STATE_SCHED_THREADED and associated
kthread is woken up, the kthread owns the context. If
NAPI_STATE_THREADED_BUSY_POLL and NAPI_STATE_SCHED_THREADED both are
set then it means that kthread can busy poll.
- To keep busy polling and to avoid scheduling of the interrupts, the
napi_complete_done returns false when both NAPI_STATE_SCHED_THREADED
and NAPI_STATE_THREADED_BUSY_POLL flags are set. Also
napi_complete_done returns early to avoid the
NAPI_STATE_SCHED_THREADED being unset.
- If at any point NAPI_STATE_THREADED_BUSY_POLL is unset, the
napi_complete_done will run and unset the NAPI_STATE_SCHED_THREADED
bit also. This will make the associated kthread go to sleep as per
existing logic.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
Documentation/ABI/testing/sysfs-class-net | 3 +-
Documentation/netlink/specs/netdev.yaml | 5 +-
Documentation/networking/napi.rst | 63 +++++++++++++++++++-
include/linux/netdevice.h | 11 +++-
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 71 +++++++++++++++++++----
net/core/dev.h | 3 +
net/core/net-sysfs.c | 2 +-
net/core/netdev-genl-gen.c | 2 +-
tools/include/uapi/linux/netdev.h | 1 +
10 files changed, 145 insertions(+), 17 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net
index ebf21beba846..15d7d36a8294 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -343,7 +343,7 @@ Date: Jan 2021
KernelVersion: 5.12
Contact: netdev@vger.kernel.org
Description:
- Boolean value to control the threaded mode per device. User could
+ Integer value to control the threaded mode per device. User could
set this value to enable/disable threaded mode for all napi
belonging to this device, without the need to do device up/down.
@@ -351,4 +351,5 @@ Description:
== ==================================
0 threaded mode disabled for this dev
1 threaded mode enabled for this dev
+ 2 threaded mode enabled, and busy polling enabled.
== ==================================
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 11edbf9c5727..70a4a9c8afef 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -88,7 +88,7 @@ definitions:
-
name: napi-threaded
type: enum
- entries: [ disabled, enabled ]
+ entries: [ disabled, enabled, busy-poll-enabled ]
attribute-sets:
-
@@ -291,7 +291,8 @@ attribute-sets:
name: threaded
doc: Whether the NAPI is configured to operate in threaded polling
mode. If this is set to `enabled` then the NAPI context operates
- in threaded polling mode.
+ in threaded polling mode. If this is set to `busy-poll-enabled`
+ then the NAPI kthread also does busypolling.
type: u32
enum: napi-threaded
-
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index a15754adb041..a1e76341a99a 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -263,7 +263,9 @@ are not well known).
Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
selected sockets or using the global ``net.core.busy_poll`` and
``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
-also exists.
+also exists. Threaded polling of NAPI also has a mode to busy poll for
+packets (:ref:`threaded busy polling<threaded_busy_poll>`) using the same
+thread that is used for NAPI processing.
epoll-based busy polling
------------------------
@@ -426,6 +428,65 @@ Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
the recommended usage, because otherwise setting ``irq-suspend-timeout``
might not have any discernible effect.
+.. _threaded_busy_poll:
+
+Threaded NAPI busy polling
+--------------------------
+
+Threaded NAPI allows processing of packets from each NAPI in a kthread in
+kernel. Threaded NAPI busy polling extends this and adds support to do
+continuous busy polling of this NAPI. This can be used to enable busy polling
+independent of userspace application or the API (epoll, io_uring, raw sockets)
+being used in userspace to process the packets.
+
+It can be enabled for each NAPI using netlink interface or at device level using
+the threaded NAPI sysctl.
+
+For example, using following script:
+
+.. code-block:: bash
+
+ $ ynl --family netdev --do napi-set \
+ --json='{"id": 66, "threaded": "busy-poll-enabled"}'
+
+
+Enabling it for each NAPI allows finer control to enable busy pollling for
+only a set of NIC queues which will get traffic with low latency requirements.
+
+Depending on application requirement, user might want to set affinity of the
+kthread that is busy polling each NAPI. User might also want to set priority
+and the scheduler of the thread depending on the latency requirements.
+
+For a hard low-latency application, user might want to dedicate the full core
+for the NAPI polling so the NIC queue descriptors are picked up from the queue
+as soon as they appear. Once enabled, the NAPI thread will poll the NIC queues
+continuously without sleeping. This will keep the CPU core busy with 100%
+usage. For more relaxed low-latency requirement, user might want to share the
+core with other threads by setting thread affinity and priority.
+
+Once threaded busy polling is enabled for a NAPI, PID of the kthread can be
+fetched using netlink interface so the affinity, priority and scheduler
+configuration can be done.
+
+For example, following script can be used to fetch the pid:
+
+.. code-block:: bash
+
+ $ ynl --family netdev --do napi-get --json='{"id": 66}'
+
+This will output something like following, the pid `258` is the PID of the
+kthread that is polling this NAPI.
+
+.. code-block:: bash
+
+ $ {'defer-hard-irqs': 0,
+ 'gro-flush-timeout': 0,
+ 'id': 66,
+ 'ifindex': 2,
+ 'irq-suspend-timeout': 0,
+ 'pid': 258,
+ 'threaded': 'enable'}
+
.. _threaded:
Threaded NAPI
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 97cf14a9b469..6682b975febd 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -427,6 +427,8 @@ enum {
NAPI_STATE_THREADED, /* The poll is performed inside its own thread*/
NAPI_STATE_SCHED_THREADED, /* Napi is currently scheduled in threaded mode */
NAPI_STATE_HAS_NOTIFIER, /* Napi has an IRQ notifier */
+ NAPI_STATE_THREADED_BUSY_POLL, /* The threaded napi poller will busy poll */
+ NAPI_STATE_SCHED_THREADED_BUSY_POLL, /* The threaded napi poller is busy polling */
};
enum {
@@ -441,8 +443,14 @@ enum {
NAPIF_STATE_THREADED = BIT(NAPI_STATE_THREADED),
NAPIF_STATE_SCHED_THREADED = BIT(NAPI_STATE_SCHED_THREADED),
NAPIF_STATE_HAS_NOTIFIER = BIT(NAPI_STATE_HAS_NOTIFIER),
+ NAPIF_STATE_THREADED_BUSY_POLL = BIT(NAPI_STATE_THREADED_BUSY_POLL),
+ NAPIF_STATE_SCHED_THREADED_BUSY_POLL =
+ BIT(NAPI_STATE_SCHED_THREADED_BUSY_POLL),
};
+#define NAPIF_STATE_THREADED_BUSY_POLL_MASK \
+ (NAPIF_STATE_THREADED | NAPIF_STATE_THREADED_BUSY_POLL)
+
enum gro_result {
GRO_MERGED,
GRO_MERGED_FREE,
@@ -1871,7 +1879,8 @@ enum netdev_reg_state {
* @addr_len: Hardware address length
* @upper_level: Maximum depth level of upper devices.
* @lower_level: Maximum depth level of lower devices.
- * @threaded: napi threaded state.
+ * @threaded: napi threaded mode is disabled, enabled or
+ * enabled with busy polling.
* @neigh_priv_len: Used in neigh_alloc()
* @dev_id: Used to differentiate devices that share
* the same link layer address
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 48eb49aa03d4..8163afb15377 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -80,6 +80,7 @@ enum netdev_qstats_scope {
enum netdev_napi_threaded {
NETDEV_NAPI_THREADED_DISABLED,
NETDEV_NAPI_THREADED_ENABLED,
+ NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED,
};
enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index ec65b03492b1..9511c69dc8e8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -78,6 +78,7 @@
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/sched/isolation.h>
+#include <linux/sched/types.h>
#include <linux/sched/mm.h>
#include <linux/smpboot.h>
#include <linux/mutex.h>
@@ -6554,7 +6555,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
* the guarantee we will be called later.
*/
if (unlikely(n->state & (NAPIF_STATE_NPSVC |
- NAPIF_STATE_IN_BUSY_POLL)))
+ NAPIF_STATE_IN_BUSY_POLL |
+ NAPIF_STATE_SCHED_THREADED_BUSY_POLL)))
return false;
if (work_done) {
@@ -6959,6 +6961,19 @@ static void napi_stop_kthread(struct napi_struct *napi)
napi->thread = NULL;
}
+static void napi_set_threaded_state(struct napi_struct *napi,
+ enum netdev_napi_threaded threaded)
+{
+ unsigned long val;
+
+ val = 0;
+ if (threaded == NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED)
+ val |= NAPIF_STATE_THREADED_BUSY_POLL;
+ if (threaded)
+ val |= NAPIF_STATE_THREADED;
+ set_mask_bits(&napi->state, NAPIF_STATE_THREADED_BUSY_POLL_MASK, val);
+}
+
int napi_set_threaded(struct napi_struct *napi,
enum netdev_napi_threaded threaded)
{
@@ -6979,7 +6994,7 @@ int napi_set_threaded(struct napi_struct *napi,
} else {
/* Make sure kthread is created before THREADED bit is set. */
smp_mb__before_atomic();
- assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+ napi_set_threaded_state(napi, threaded);
}
return 0;
@@ -7017,12 +7032,15 @@ int dev_set_threaded(struct net_device *dev,
* polled. In this case, the switch between threaded mode and
* softirq mode will happen in the next round of napi_schedule().
* This should not cause hiccups/stalls to the live traffic.
+ *
+ * Switch to busy_poll threaded napi will occur after the threaded
+ * napi is scheduled.
*/
list_for_each_entry(napi, &dev->napi_list, dev_list) {
if (!threaded && napi->thread)
napi_stop_kthread(napi);
else
- assign_bit(NAPI_STATE_THREADED, &napi->state, threaded);
+ napi_set_threaded_state(napi, threaded);
}
return err;
@@ -7369,7 +7387,9 @@ void napi_disable_locked(struct napi_struct *n)
}
new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC;
- new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL);
+ new &= ~(NAPIF_STATE_THREADED
+ | NAPIF_STATE_THREADED_BUSY_POLL
+ | NAPIF_STATE_PREFER_BUSY_POLL);
} while (!try_cmpxchg(&n->state, &val, new));
hrtimer_cancel(&n->timer);
@@ -7413,7 +7433,7 @@ void napi_enable_locked(struct napi_struct *n)
new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
if (n->dev->threaded && n->thread)
- new |= NAPIF_STATE_THREADED;
+ napi_set_threaded_state(n, n->dev->threaded);
} while (!try_cmpxchg(&n->state, &val, new));
}
EXPORT_SYMBOL(napi_enable_locked);
@@ -7581,7 +7601,7 @@ static int napi_thread_wait(struct napi_struct *napi)
return -1;
}
-static void napi_threaded_poll_loop(struct napi_struct *napi)
+static void napi_threaded_poll_loop(struct napi_struct *napi, bool busy_poll)
{
struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
struct softnet_data *sd;
@@ -7610,22 +7630,53 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
}
skb_defer_free_flush(sd);
bpf_net_ctx_clear(bpf_net_ctx);
+
+ /* Flush too old packets. If HZ < 1000, flush all packets */
+ if (busy_poll)
+ gro_flush_normal(&napi->gro, HZ >= 1000);
local_bh_enable();
- if (!repoll)
+ /* If busy polling then do not break here because we need to
+ * call cond_resched and rcu_softirq_qs_periodic to prevent
+ * watchdog warnings.
+ */
+ if (!repoll && !busy_poll)
break;
rcu_softirq_qs_periodic(last_qs);
cond_resched();
+
+ if (!repoll)
+ break;
}
}
static int napi_threaded_poll(void *data)
{
struct napi_struct *napi = data;
+ bool busy_poll_sched;
+ unsigned long val;
+ bool busy_poll;
+
+ while (!napi_thread_wait(napi)) {
+ /* Once woken up, this means that we are scheduled as threaded
+ * napi and this thread owns the napi context, if busy poll
+ * state is set then busy poll this napi.
+ */
+ val = READ_ONCE(napi->state);
+ busy_poll = val & NAPIF_STATE_THREADED_BUSY_POLL;
+ busy_poll_sched = val & NAPIF_STATE_SCHED_THREADED_BUSY_POLL;
- while (!napi_thread_wait(napi))
- napi_threaded_poll_loop(napi);
+ /* Do not busy poll if napi is disabled. */
+ if (unlikely(val & NAPIF_STATE_DISABLE))
+ busy_poll = false;
+
+ if (busy_poll != busy_poll_sched)
+ assign_bit(NAPI_STATE_SCHED_THREADED_BUSY_POLL,
+ &napi->state, busy_poll);
+
+ napi_threaded_poll_loop(napi, busy_poll);
+ }
return 0;
}
@@ -12808,7 +12859,7 @@ static void run_backlog_napi(unsigned int cpu)
{
struct softnet_data *sd = per_cpu_ptr(&softnet_data, cpu);
- napi_threaded_poll_loop(&sd->backlog);
+ napi_threaded_poll_loop(&sd->backlog, false);
}
static void backlog_napi_setup(unsigned int cpu)
diff --git a/net/core/dev.h b/net/core/dev.h
index ab6fac65ec24..082270ed5b92 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -317,6 +317,9 @@ static inline void napi_set_irq_suspend_timeout(struct napi_struct *n,
static inline enum netdev_napi_threaded napi_get_threaded(struct napi_struct *n)
{
+ if (test_bit(NAPI_STATE_THREADED_BUSY_POLL, &n->state))
+ return NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED;
+
if (test_bit(NAPI_STATE_THREADED, &n->state))
return NETDEV_NAPI_THREADED_ENABLED;
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 8f897e2c8b4f..3ebf8153666b 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -754,7 +754,7 @@ static int modify_napi_threaded(struct net_device *dev, unsigned long val)
if (list_empty(&dev->napi_list))
return -EOPNOTSUPP;
- if (val != 0 && val != 1)
+ if (val > NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED)
return -EOPNOTSUPP;
ret = dev_set_threaded(dev, val);
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index e9a2a6f26cb7..ff20435c45d2 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -97,7 +97,7 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_THREADED
[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
[NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, },
- [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 1),
+ [NETDEV_A_NAPI_THREADED] = NLA_POLICY_MAX(NLA_U32, 2),
};
/* NETDEV_CMD_BIND_TX - do */
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 48eb49aa03d4..8163afb15377 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -80,6 +80,7 @@ enum netdev_qstats_scope {
enum netdev_napi_threaded {
NETDEV_NAPI_THREADED_DISABLED,
NETDEV_NAPI_THREADED_ENABLED,
+ NETDEV_NAPI_THREADED_BUSY_POLL_ENABLED,
};
enum {
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH net-next v6 5/5] selftests: Add napi threaded busy poll test in `busy_poller`
2025-07-18 23:20 [PATCH net-next v6 0/5] Add support to do threaded napi busy poll Samiullah Khawaja
` (3 preceding siblings ...)
2025-07-18 23:20 ` [PATCH net-next v6 4/5] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
@ 2025-07-18 23:20 ` Samiullah Khawaja
2025-07-19 21:27 ` [PATCH net-next v6 0/5] Add support to do threaded napi busy poll Martin Karsten
5 siblings, 0 replies; 8+ messages in thread
From: Samiullah Khawaja @ 2025-07-18 23:20 UTC (permalink / raw)
To: Jakub Kicinski, David S . Miller , Eric Dumazet, Paolo Abeni,
almasrymina, willemb, jdamato, mkarsten
Cc: netdev, skhawaja
Add testcase to run busy poll test with threaded napi busy poll enabled.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
tools/testing/selftests/net/busy_poll_test.sh | 25 ++++++++++++++++++-
tools/testing/selftests/net/busy_poller.c | 14 ++++++++---
2 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/net/busy_poll_test.sh b/tools/testing/selftests/net/busy_poll_test.sh
index 7d2d40812074..ab230df1057e 100755
--- a/tools/testing/selftests/net/busy_poll_test.sh
+++ b/tools/testing/selftests/net/busy_poll_test.sh
@@ -27,6 +27,9 @@ NAPI_DEFER_HARD_IRQS=100
GRO_FLUSH_TIMEOUT=50000
SUSPEND_TIMEOUT=20000000
+# NAPI threaded busy poll config
+NAPI_THREADED_POLL=2
+
setup_ns()
{
set -e
@@ -62,6 +65,9 @@ cleanup_ns()
test_busypoll()
{
suspend_value=${1:-0}
+ napi_threaded_value=${2:-0}
+ prefer_busy_poll_value=${3:-$PREFER_BUSY_POLL}
+
tmp_file=$(mktemp)
out_file=$(mktemp)
@@ -73,10 +79,11 @@ test_busypoll()
-b${SERVER_IP} \
-m${MAX_EVENTS} \
-u${BUSY_POLL_USECS} \
- -P${PREFER_BUSY_POLL} \
+ -P${prefer_busy_poll_value} \
-g${BUSY_POLL_BUDGET} \
-i${NSIM_SV_IFIDX} \
-s${suspend_value} \
+ -t${napi_threaded_value} \
-o${out_file}&
wait_local_port_listen nssv ${SERVER_PORT} tcp
@@ -109,6 +116,15 @@ test_busypoll_with_suspend()
return $?
}
+test_busypoll_with_napi_threaded()
+{
+ # Only enable napi threaded poll. Set suspend timeout and prefer busy
+ # poll to 0.
+ test_busypoll 0 ${NAPI_THREADED_POLL} 0
+
+ return $?
+}
+
###
### Code start
###
@@ -154,6 +170,13 @@ if [ $? -ne 0 ]; then
exit 1
fi
+test_busypoll_with_napi_threaded
+if [ $? -ne 0 ]; then
+ echo "test_busypoll_with_napi_threaded failed"
+ cleanup_ns
+ exit 1
+fi
+
echo "$NSIM_SV_FD:$NSIM_SV_IFIDX" > $NSIM_DEV_SYS_UNLINK
echo $NSIM_CL_ID > $NSIM_DEV_SYS_DEL
diff --git a/tools/testing/selftests/net/busy_poller.c b/tools/testing/selftests/net/busy_poller.c
index 04c7ff577bb8..f7407f09f635 100644
--- a/tools/testing/selftests/net/busy_poller.c
+++ b/tools/testing/selftests/net/busy_poller.c
@@ -65,15 +65,16 @@ static uint32_t cfg_busy_poll_usecs;
static uint16_t cfg_busy_poll_budget;
static uint8_t cfg_prefer_busy_poll;
-/* IRQ params */
+/* NAPI params */
static uint32_t cfg_defer_hard_irqs;
static uint64_t cfg_gro_flush_timeout;
static uint64_t cfg_irq_suspend_timeout;
+static enum netdev_napi_threaded cfg_napi_threaded_poll = NETDEV_NAPI_THREADED_DISABLE;
static void usage(const char *filepath)
{
error(1, 0,
- "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -i<ifindex>",
+ "Usage: %s -p<port> -b<addr> -m<max_events> -u<busy_poll_usecs> -P<prefer_busy_poll> -g<busy_poll_budget> -o<outfile> -d<defer_hard_irqs> -r<gro_flush_timeout> -s<irq_suspend_timeout> -t<napi_threaded_poll> -i<ifindex>",
filepath);
}
@@ -86,7 +87,7 @@ static void parse_opts(int argc, char **argv)
if (argc <= 1)
usage(argv[0]);
- while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:")) != -1) {
+ while ((c = getopt(argc, argv, "p:m:b:u:P:g:o:d:r:s:i:t:")) != -1) {
/* most options take integer values, except o and b, so reduce
* code duplication a bit for the common case by calling
* strtoull here and leave bounds checking and casting per
@@ -168,6 +169,12 @@ static void parse_opts(int argc, char **argv)
cfg_ifindex = (int)tmp;
break;
+ case 't':
+ if (tmp == ULLONG_MAX || tmp > 2)
+ error(1, ERANGE, "napi threaded poll value must be 0-2");
+
+ cfg_napi_threaded_poll = (enum netdev_napi_threaded)tmp;
+ break;
}
}
@@ -246,6 +253,7 @@ static void setup_queue(void)
cfg_gro_flush_timeout);
netdev_napi_set_req_set_irq_suspend_timeout(set_req,
cfg_irq_suspend_timeout);
+ netdev_napi_set_req_set_threaded(set_req, cfg_napi_threaded_poll);
if (netdev_napi_set(ys, set_req))
error(1, 0, "can't set NAPI params: %s\n", yerr.msg);
--
2.50.0.727.gbf7dc18ff4-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH net-next v6 0/5] Add support to do threaded napi busy poll
2025-07-18 23:20 [PATCH net-next v6 0/5] Add support to do threaded napi busy poll Samiullah Khawaja
` (4 preceding siblings ...)
2025-07-18 23:20 ` [PATCH net-next v6 5/5] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
@ 2025-07-19 21:27 ` Martin Karsten
5 siblings, 0 replies; 8+ messages in thread
From: Martin Karsten @ 2025-07-19 21:27 UTC (permalink / raw)
To: Samiullah Khawaja, Jakub Kicinski, David S . Miller, Eric Dumazet,
Paolo Abeni, almasrymina, willemb, jdamato, joe
Cc: netdev
On 2025-07-18 19:20, Samiullah Khawaja wrote:
[snip]
>
> | Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
> |---|---|---|---|---|
> | 12 Kpkt/s + 0us delay | | | | |
> | | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
> | | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
> | | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
> | | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
> | 32 Kpkt/s + 30us delay | | | | |
> | | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
> | | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
> | | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
> | | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
> | 125 Kpkt/s + 6us delay | | | | |
> | | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
> | | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
> | | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
> | | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
> | 12 Kpkt/s + 78us delay | | | | |
> | | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
> | | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
> | | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
> | | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
> | 25 Kpkt/s + 38us delay | | | | |
> | | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
> | | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
> | | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
> | | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
>
> ## Observations
>
> - Here without application processing all the approaches give the same
> latency within 1usecs range and NAPI threaded gives minimum latency.
> - With application processing the latency increases by 3-4usecs when
> doing inline polling.
> - Using a dedicated core to drive napi polling keeps the latency same
> even with application processing. This is observed both in userspace
> and threaded napi (in kernel).
> - Using napi threaded polling in kernel gives lower latency by
> 1-1.5usecs as compared to userspace driven polling in separate core.
> - With application processing userspace will get the packet from recv
> ring and spend some time doing application processing and then do napi
> polling. While application processing is happening a dedicated core
> doing napi polling can pull the packet of the NAPI RX queue and
> populate the AF_XDP recv ring. This means that when the application
> thread is done with application processing it has new packets ready to
> recv and process in recv ring.
> - Napi threaded busy polling in the kernel with a dedicated core gives
> the consistent P5-P99 latency.
Hi Samiullah.
I notice that you still present the experiments with application delay.
I previously asked what these experiments represent, since it's highly
unlikely that a latency-critical service would run at 100% load?
I also notice that you have not added any warning to the cover letter
that explicitly spells out the trade-off between performance and efficiency.
However, most importantly, I am trying to rerun the experiments, but
when running xsk_rr with threaded napi busy poll, networking locks up
and the machine needs a hard reset to reboot. This is after applying
your patches to commit c3886ccaadf8fdc2c91bfbdcdca36ccdc6ef8f70. I have
tested with Intel E810-XXV-2 using the ice driver and Mellanox
ConnectX-4 Lx using the mlx5 driver.
I am enclosing the various stack backtraces that I find in the logs.
Best,
Martin
**** ice ****
Jul 19 16:51:31 husky07 kernel: INFO: task systemd-network:542 blocked
for more than 122 seconds.
Jul 19 16:51:31 husky07 kernel: Tainted: G I E
6.16.0-rc5-test #1
Jul 19 16:51:31 husky07 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 19 16:51:31 husky07 kernel: task:systemd-network state:D stack:0
pid:542 tgid:542 ppid:1 task_flags:0x400100 flags:0x00004006
Jul 19 16:51:31 husky07 kernel: Call Trace:
Jul 19 16:51:31 husky07 kernel: <TASK>
Jul 19 16:51:31 husky07 kernel: __schedule+0x49b/0x1530
Jul 19 16:51:31 husky07 kernel: schedule+0x27/0xf0
Jul 19 16:51:31 husky07 kernel: schedule_preempt_disabled+0x15/0x30
Jul 19 16:51:31 husky07 kernel: __mutex_lock.constprop.0+0x4c9/0x870
Jul 19 16:51:31 husky07 kernel: ? __nla_validate_parse+0x5a/0xe30
Jul 19 16:51:31 husky07 kernel: __mutex_lock_slowpath+0x13/0x20
Jul 19 16:51:31 husky07 kernel: mutex_lock+0x3b/0x50
Jul 19 16:51:31 husky07 kernel: rtnl_lock+0x15/0x20
Jul 19 16:51:31 husky07 kernel: inet_rtm_newaddr+0x101/0x540
Jul 19 16:51:31 husky07 kernel: ? __pfx_inet_rtm_newaddr+0x10/0x10
Jul 19 16:51:31 husky07 kernel: rtnetlink_rcv_msg+0x37e/0x450
Jul 19 16:51:31 husky07 kernel: ? shmem_undo_range+0x283/0x850
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnetlink_rcv_msg+0x10/0x10
Jul 19 16:51:31 husky07 kernel: netlink_rcv_skb+0x5c/0x110
Jul 19 16:51:31 husky07 kernel: rtnetlink_rcv+0x15/0x30
Jul 19 16:51:31 husky07 kernel: netlink_unicast+0x282/0x3d0
Jul 19 16:51:31 husky07 kernel: netlink_sendmsg+0x214/0x470
Jul 19 16:51:31 husky07 kernel: __sys_sendto+0x23d/0x250
Jul 19 16:51:31 husky07 kernel: __x64_sys_sendto+0x24/0x40
Jul 19 16:51:31 husky07 kernel: x64_sys_call+0x1c32/0x2660
Jul 19 16:51:31 husky07 kernel: do_syscall_64+0x80/0x990
Jul 19 16:51:31 husky07 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:51:31 husky07 kernel: ? __ct_user_enter+0x72/0x100
Jul 19 16:51:31 husky07 kernel: ? do_syscall_64+0x1be/0x990
Jul 19 16:51:31 husky07 kernel: ? kmem_cache_free+0x43a/0x470
Jul 19 16:51:31 husky07 kernel: ? sched_clock_noinstr+0x9/0x10
Jul 19 16:51:31 husky07 kernel: ? sched_clock+0x10/0x30
Jul 19 16:51:31 husky07 kernel: ? get_vtime_delta+0x14/0xc0
Jul 19 16:51:31 husky07 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:51:31 husky07 kernel: ? __ct_user_enter+0x72/0x100
Jul 19 16:51:31 husky07 kernel: ? do_syscall_64+0x1be/0x990
Jul 19 16:51:31 husky07 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:51:31 husky07 kernel: ? __ct_user_enter+0x72/0x100
Jul 19 16:51:31 husky07 kernel: ? do_syscall_64+0x1be/0x990
Jul 19 16:51:31 husky07 kernel: ? sched_clock_noinstr+0x9/0x10
Jul 19 16:51:31 husky07 kernel: ? sched_clock+0x10/0x30
Jul 19 16:51:31 husky07 kernel: ? get_vtime_delta+0x14/0xc0
Jul 19 16:51:31 husky07 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:51:31 husky07 kernel: ? __ct_user_enter+0x72/0x100
Jul 19 16:51:31 husky07 kernel: ? do_syscall_64+0x1be/0x990
Jul 19 16:51:31 husky07 kernel: ? do_syscall_64+0x1be/0x990
Jul 19 16:51:31 husky07 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jul 19 16:51:31 husky07 kernel: RIP: 0033:0x7018b7b2c0a7
Jul 19 16:51:31 husky07 kernel: RSP: 002b:00007ffd470844a8 EFLAGS:
00000202 ORIG_RAX: 000000000000002c
Jul 19 16:51:31 husky07 kernel: RAX: ffffffffffffffda RBX:
000062ff08206d00 RCX: 00007018b7b2c0a7
Jul 19 16:51:31 husky07 kernel: RDX: 0000000000000044 RSI:
000062ff08235ae0 RDI: 0000000000000003
Jul 19 16:51:31 husky07 kernel: RBP: 00007ffd47084540 R08:
00007ffd470844b0 R09: 0000000000000080
Jul 19 16:51:31 husky07 kernel: R10: 0000000000000000 R11:
0000000000000202 R12: 000062ff0823a0a0
Jul 19 16:51:31 husky07 kernel: R13: 000062ff082105b8 R14:
0000000000000000 R15: 000062ff08210570
Jul 19 16:51:31 husky07 kernel: </TASK>
Jul 19 16:51:31 husky07 kernel: INFO: task systemd-network:542 is
blocked on a mutex likely owned by task xsk_rr:5912.
Jul 19 16:51:31 husky07 kernel: task:xsk_rr state:D stack:0
pid:5912 tgid:5912 ppid:5911 task_flags:0x400100 flags:0x00004006
Jul 19 16:51:31 husky07 kernel: Call Trace:
Jul 19 16:51:31 husky07 kernel: <TASK>
Jul 19 16:51:31 husky07 kernel: __schedule+0x49b/0x1530
Jul 19 16:51:31 husky07 kernel: ? _raw_spin_unlock_irqrestore+0x21/0x60
Jul 19 16:51:31 husky07 kernel: schedule+0x27/0xf0
Jul 19 16:51:31 husky07 kernel: schedule_timeout+0x85/0x110
Jul 19 16:51:31 husky07 kernel: ? __pfx_process_timeout+0x10/0x10
Jul 19 16:51:31 husky07 kernel: msleep+0x34/0x60
Jul 19 16:51:31 husky07 kernel: napi_stop_kthread+0x78/0x80
Jul 19 16:51:31 husky07 kernel: napi_set_threaded+0x33/0xc0
Jul 19 16:51:31 husky07 kernel: napi_enable_locked+0xb5/0x250
Jul 19 16:51:31 husky07 kernel: napi_enable+0x25/0x50
Jul 19 16:51:31 husky07 kernel: ice_up_complete+0x91/0x260 [ice]
Jul 19 16:51:31 husky07 kernel: ice_xdp+0x388/0x5d0 [ice]
Jul 19 16:51:31 husky07 kernel: ? __pfx_ice_xdp+0x10/0x10 [ice]
Jul 19 16:51:31 husky07 kernel: dev_xdp_install+0x157/0x320
Jul 19 16:51:31 husky07 kernel: dev_xdp_attach+0x23f/0x9d0
Jul 19 16:51:31 husky07 kernel: ? __bpf_prog_get+0x1f/0xf0
Jul 19 16:51:31 husky07 kernel: dev_change_xdp_fd+0x164/0x210
Jul 19 16:51:31 husky07 kernel: do_setlink.isra.0+0x110a/0x12c0
Jul 19 16:51:31 husky07 kernel: ? get_page_from_freelist+0x167f/0x1bd0
Jul 19 16:51:31 husky07 kernel: ? __nla_validate_parse+0x5a/0xe30
Jul 19 16:51:31 husky07 kernel: ? ns_capable+0x2a/0x60
Jul 19 16:51:31 husky07 kernel: rtnl_setlink+0x289/0x600
Jul 19 16:51:31 husky07 kernel: ? security_capable+0x7c/0x1e0
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnl_setlink+0x10/0x10
Jul 19 16:51:31 husky07 kernel: rtnetlink_rcv_msg+0x37e/0x450
Jul 19 16:51:31 husky07 kernel: ? kvfree+0x31/0x40
Jul 19 16:51:31 husky07 kernel: ? map_update_elem+0x203/0x330
Jul 19 16:51:31 husky07 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnetlink_rcv_msg+0x10/0x10
Jul 19 16:51:31 husky07 kernel: netlink_rcv_skb+0x5c/0x110
Jul 19 16:51:31 husky07 kernel: rtnetlink_rcv+0x15/0x30
Jul 19 16:51:31 husky07 kernel: netlink_unicast+0x282/0x3d0
Jul 19 16:51:31 husky07 kernel: netlink_sendmsg+0x214/0x470
Jul 19 16:51:31 husky07 kernel: __sys_sendto+0x23d/0x250
Jul 19 16:51:31 husky07 kernel: __x64_sys_sendto+0x24/0x40
Jul 19 16:51:31 husky07 kernel: x64_sys_call+0x1c32/0x2660
Jul 19 16:51:31 husky07 kernel: do_syscall_64+0x80/0x990
Jul 19 16:51:31 husky07 kernel: ? sched_clock_noinstr+0x9/0x10
Jul 19 16:51:31 husky07 kernel: ? sched_clock+0x10/0x30
Jul 19 16:51:31 husky07 kernel: ? get_vtime_delta+0x14/0xc0
Jul 19 16:51:31 husky07 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:51:31 husky07 kernel: ? __ct_user_enter+0x72/0x100
Jul 19 16:51:31 husky07 kernel: ? irqentry_exit_to_user_mode+0x167/0x270
Jul 19 16:51:31 husky07 kernel: ? irqentry_exit+0x43/0x50
Jul 19 16:51:31 husky07 kernel: ? exc_page_fault+0x90/0x1b0
Jul 19 16:51:31 husky07 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jul 19 16:51:31 husky07 kernel: RIP: 0033:0x73175152bead
Jul 19 16:51:31 husky07 kernel: RSP: 002b:00007ffd54363fa8 EFLAGS:
00000246 ORIG_RAX: 000000000000002c
Jul 19 16:51:31 husky07 kernel: RAX: ffffffffffffffda RBX:
0000000000000004 RCX: 000073175152bead
Jul 19 16:51:31 husky07 kernel: RDX: 0000000000000034 RSI:
00007ffd54364030 RDI: 0000000000000008
Jul 19 16:51:31 husky07 kernel: RBP: 00007ffd54364000 R08:
0000000000000000 R09: 0000000000000000
Jul 19 16:51:31 husky07 kernel: R10: 0000000000000000 R11:
0000000000000246 R12: 0000000000000019
Jul 19 16:51:31 husky07 kernel: R13: 0000000000000000 R14:
000063c753f8cd78 R15: 000073175187c000
Jul 19 16:51:31 husky07 kernel: </TASK>
Jul 19 16:51:31 husky07 kernel: INFO: task kworker/u50:3:3472 blocked
for more than 122 seconds.
Jul 19 16:51:31 husky07 kernel: Tainted: G I E
6.16.0-rc5-test #1
Jul 19 16:51:31 husky07 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 19 16:51:31 husky07 kernel: task:kworker/u50:3 state:D stack:0
pid:3472 tgid:3472 ppid:2 task_flags:0x4208060 flags:0x00004000
Jul 19 16:51:31 husky07 kernel: Workqueue: events_unbound linkwatch_event
Jul 19 16:51:31 husky07 kernel: Call Trace:
Jul 19 16:51:31 husky07 kernel: <TASK>
Jul 19 16:51:31 husky07 kernel: __schedule+0x49b/0x1530
Jul 19 16:51:31 husky07 kernel: ? sched_clock_noinstr+0x9/0x10
Jul 19 16:51:31 husky07 kernel: ? sched_clock+0x10/0x30
Jul 19 16:51:31 husky07 kernel: ? sched_clock_cpu+0x10/0x1e0
Jul 19 16:51:31 husky07 kernel: schedule+0x27/0xf0
Jul 19 16:51:31 husky07 kernel: schedule_preempt_disabled+0x15/0x30
Jul 19 16:51:31 husky07 kernel: __mutex_lock.constprop.0+0x4c9/0x870
Jul 19 16:51:31 husky07 kernel: __mutex_lock_slowpath+0x13/0x20
Jul 19 16:51:31 husky07 kernel: mutex_lock+0x3b/0x50
Jul 19 16:51:31 husky07 kernel: rtnl_lock+0x15/0x20
Jul 19 16:51:31 husky07 kernel: linkwatch_event+0x12/0x40
Jul 19 16:51:31 husky07 kernel: process_one_work+0x191/0x3e0
Jul 19 16:51:31 husky07 kernel: worker_thread+0x2e3/0x420
Jul 19 16:51:31 husky07 kernel: ? __pfx_worker_thread+0x10/0x10
Jul 19 16:51:31 husky07 kernel: kthread+0x10d/0x230
Jul 19 16:51:31 husky07 kernel: ? __pfx_kthread+0x10/0x10
Jul 19 16:51:31 husky07 kernel: ret_from_fork+0x1d7/0x210
Jul 19 16:51:31 husky07 kernel: ? __pfx_kthread+0x10/0x10
Jul 19 16:51:31 husky07 kernel: ret_from_fork_asm+0x1a/0x30
Jul 19 16:51:31 husky07 kernel: </TASK>
Jul 19 16:51:31 husky07 kernel: INFO: task kworker/u50:3:3472 is blocked
on a mutex likely owned by task xsk_rr:5912.
Jul 19 16:51:31 husky07 kernel: task:xsk_rr state:D stack:0
pid:5912 tgid:5912 ppid:5911 task_flags:0x400100 flags:0x00004006
Jul 19 16:51:31 husky07 kernel: Call Trace:
Jul 19 16:51:31 husky07 kernel: <TASK>
Jul 19 16:51:31 husky07 kernel: __schedule+0x49b/0x1530
Jul 19 16:51:31 husky07 kernel: ? _raw_spin_unlock_irqrestore+0x21/0x60
Jul 19 16:51:31 husky07 kernel: schedule+0x27/0xf0
Jul 19 16:51:31 husky07 kernel: schedule_timeout+0x85/0x110
Jul 19 16:51:31 husky07 kernel: ? __pfx_process_timeout+0x10/0x10
Jul 19 16:51:31 husky07 kernel: msleep+0x34/0x60
Jul 19 16:51:31 husky07 kernel: napi_stop_kthread+0x78/0x80
Jul 19 16:51:31 husky07 kernel: napi_set_threaded+0x33/0xc0
Jul 19 16:51:31 husky07 kernel: napi_enable_locked+0xb5/0x250
Jul 19 16:51:31 husky07 kernel: napi_enable+0x25/0x50
Jul 19 16:51:31 husky07 kernel: ice_up_complete+0x91/0x260 [ice]
Jul 19 16:51:31 husky07 kernel: ice_xdp+0x388/0x5d0 [ice]
Jul 19 16:51:31 husky07 kernel: ? __pfx_ice_xdp+0x10/0x10 [ice]
Jul 19 16:51:31 husky07 kernel: dev_xdp_install+0x157/0x320
Jul 19 16:51:31 husky07 kernel: dev_xdp_attach+0x23f/0x9d0
Jul 19 16:51:31 husky07 kernel: ? __bpf_prog_get+0x1f/0xf0
Jul 19 16:51:31 husky07 kernel: dev_change_xdp_fd+0x164/0x210
Jul 19 16:51:31 husky07 kernel: do_setlink.isra.0+0x110a/0x12c0
Jul 19 16:51:31 husky07 kernel: ? get_page_from_freelist+0x167f/0x1bd0
Jul 19 16:51:31 husky07 kernel: ? __nla_validate_parse+0x5a/0xe30
Jul 19 16:51:31 husky07 kernel: ? ns_capable+0x2a/0x60
Jul 19 16:51:31 husky07 kernel: rtnl_setlink+0x289/0x600
Jul 19 16:51:31 husky07 kernel: ? security_capable+0x7c/0x1e0
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnl_setlink+0x10/0x10
Jul 19 16:51:31 husky07 kernel: rtnetlink_rcv_msg+0x37e/0x450
Jul 19 16:51:31 husky07 kernel: ? kvfree+0x31/0x40
Jul 19 16:51:31 husky07 kernel: ? map_update_elem+0x203/0x330
Jul 19 16:51:31 husky07 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnetlink_rcv_msg+0x10/0x10
Jul 19 16:51:31 husky07 kernel: netlink_rcv_skb+0x5c/0x110
Jul 19 16:51:31 husky07 kernel: rtnetlink_rcv+0x15/0x30
Jul 19 16:51:31 husky07 kernel: netlink_unicast+0x282/0x3d0
Jul 19 16:51:31 husky07 kernel: netlink_sendmsg+0x214/0x470
Jul 19 16:51:31 husky07 kernel: __sys_sendto+0x23d/0x250
Jul 19 16:51:31 husky07 kernel: __x64_sys_sendto+0x24/0x40
Jul 19 16:51:31 husky07 kernel: x64_sys_call+0x1c32/0x2660
Jul 19 16:51:31 husky07 kernel: do_syscall_64+0x80/0x990
Jul 19 16:51:31 husky07 kernel: ? sched_clock_noinstr+0x9/0x10
Jul 19 16:51:31 husky07 kernel: ? sched_clock+0x10/0x30
Jul 19 16:51:31 husky07 kernel: ? get_vtime_delta+0x14/0xc0
Jul 19 16:51:31 husky07 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:51:31 husky07 kernel: ? __ct_user_enter+0x72/0x100
Jul 19 16:51:31 husky07 kernel: ? irqentry_exit_to_user_mode+0x167/0x270
Jul 19 16:51:31 husky07 kernel: ? irqentry_exit+0x43/0x50
Jul 19 16:51:31 husky07 kernel: ? exc_page_fault+0x90/0x1b0
Jul 19 16:51:31 husky07 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jul 19 16:51:31 husky07 kernel: RIP: 0033:0x73175152bead
Jul 19 16:51:31 husky07 kernel: RSP: 002b:00007ffd54363fa8 EFLAGS:
00000246 ORIG_RAX: 000000000000002c
Jul 19 16:51:31 husky07 kernel: RAX: ffffffffffffffda RBX:
0000000000000004 RCX: 000073175152bead
Jul 19 16:51:31 husky07 kernel: RDX: 0000000000000034 RSI:
00007ffd54364030 RDI: 0000000000000008
Jul 19 16:51:31 husky07 kernel: RBP: 00007ffd54364000 R08:
0000000000000000 R09: 0000000000000000
Jul 19 16:51:31 husky07 kernel: R10: 0000000000000000 R11:
0000000000000246 R12: 0000000000000019
Jul 19 16:51:31 husky07 kernel: R13: 0000000000000000 R14:
000063c753f8cd78 R15: 000073175187c000
Jul 19 16:51:31 husky07 kernel: </TASK>
Jul 19 16:51:31 husky07 kernel: INFO: task sudo:5918 blocked for more
than 122 seconds.
Jul 19 16:51:31 husky07 kernel: Tainted: G I E
6.16.0-rc5-test #1
Jul 19 16:51:31 husky07 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 19 16:51:31 husky07 kernel: task:sudo state:D stack:0
pid:5918 tgid:5918 ppid:5856 task_flags:0x400100 flags:0x00004006
Jul 19 16:51:31 husky07 kernel: Call Trace:
Jul 19 16:51:31 husky07 kernel: <TASK>
Jul 19 16:51:31 husky07 kernel: __schedule+0x49b/0x1530
Jul 19 16:51:31 husky07 kernel: ? xa_load+0x6d/0xa0
Jul 19 16:51:31 husky07 kernel: schedule+0x27/0xf0
Jul 19 16:51:31 husky07 kernel: schedule_preempt_disabled+0x15/0x30
Jul 19 16:51:31 husky07 kernel: __mutex_lock.constprop.0+0x4c9/0x870
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnl_dump_ifinfo+0x10/0x10
Jul 19 16:51:31 husky07 kernel: __mutex_lock_slowpath+0x13/0x20
Jul 19 16:51:31 husky07 kernel: mutex_lock+0x3b/0x50
Jul 19 16:51:31 husky07 kernel: rtnl_dumpit+0x83/0xc0
Jul 19 16:51:31 husky07 kernel: netlink_dump+0x197/0x3c0
Jul 19 16:51:31 husky07 kernel: ? obj_cgroup_charge_account+0x139/0x370
Jul 19 16:51:31 husky07 kernel: __netlink_dump_start+0x204/0x340
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnl_dump_ifinfo+0x10/0x10
Jul 19 16:51:31 husky07 kernel: rtnetlink_rcv_msg+0x2d6/0x450
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnl_dumpit+0x10/0x10
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnl_dump_ifinfo+0x10/0x10
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnetlink_rcv_msg+0x10/0x10
Jul 19 16:51:31 husky07 kernel: netlink_rcv_skb+0x5c/0x110
Jul 19 16:51:31 husky07 kernel: rtnetlink_rcv+0x15/0x30
Jul 19 16:51:31 husky07 kernel: netlink_unicast+0x282/0x3d0
Jul 19 16:51:31 husky07 kernel: netlink_sendmsg+0x214/0x470
Jul 19 16:51:31 husky07 kernel: __sys_sendto+0x23d/0x250
Jul 19 16:51:31 husky07 kernel: __x64_sys_sendto+0x24/0x40
Jul 19 16:51:31 husky07 kernel: x64_sys_call+0x1c32/0x2660
Jul 19 16:51:31 husky07 kernel: do_syscall_64+0x80/0x990
Jul 19 16:51:31 husky07 kernel: ? walk_system_ram_range+0xa8/0x110
Jul 19 16:51:31 husky07 kernel: ? __pfx_pagerange_is_ram_callback+0x10/0x10
Jul 19 16:51:31 husky07 kernel: ? ___pte_offset_map+0x1c/0x1b0
Jul 19 16:51:31 husky07 kernel: ? __pte_offset_map_lock+0xa2/0x120
Jul 19 16:51:31 husky07 kernel: ? __get_locked_pte+0x3f/0x90
Jul 19 16:51:31 husky07 kernel: ? insert_pfn+0xbb/0x220
Jul 19 16:51:31 husky07 kernel: ? vmf_insert_pfn_prot+0x99/0x100
Jul 19 16:51:31 husky07 kernel: ? vmf_insert_pfn+0x12/0x20
Jul 19 16:51:31 husky07 kernel: ? vvar_fault+0xa1/0x110
Jul 19 16:51:31 husky07 kernel: ? special_mapping_fault+0x21/0xd0
Jul 19 16:51:31 husky07 kernel: ? __do_fault+0x3d/0x190
Jul 19 16:51:31 husky07 kernel: ? do_fault+0x2d5/0x570
Jul 19 16:51:31 husky07 kernel: ? __handle_mm_fault+0x838/0x1070
Jul 19 16:51:31 husky07 kernel: ? security_task_setrlimit+0xa3/0x1b0
Jul 19 16:51:31 husky07 kernel: ? do_prlimit+0x144/0x230
Jul 19 16:51:31 husky07 kernel: ? count_memcg_events+0x180/0x200
Jul 19 16:51:31 husky07 kernel: ? sched_clock_noinstr+0x9/0x10
Jul 19 16:51:31 husky07 kernel: ? sched_clock+0x10/0x30
Jul 19 16:51:31 husky07 kernel: ? get_vtime_delta+0x14/0xc0
Jul 19 16:51:31 husky07 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:51:31 husky07 kernel: ? __ct_user_enter+0x72/0x100
Jul 19 16:51:31 husky07 kernel: ? irqentry_exit_to_user_mode+0x167/0x270
Jul 19 16:51:31 husky07 kernel: ? irqentry_exit+0x43/0x50
Jul 19 16:51:31 husky07 kernel: ? exc_page_fault+0x90/0x1b0
Jul 19 16:51:31 husky07 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jul 19 16:51:31 husky07 kernel: RIP: 0033:0x7cb59032c0a7
Jul 19 16:51:31 husky07 kernel: RSP: 002b:00007ffcf07c2808 EFLAGS:
00000202 ORIG_RAX: 000000000000002c
Jul 19 16:51:31 husky07 kernel: RAX: ffffffffffffffda RBX:
00007ffcf07c2850 RCX: 00007cb59032c0a7
Jul 19 16:51:31 husky07 kernel: RDX: 0000000000000014 RSI:
00007ffcf07c2890 RDI: 0000000000000003
Jul 19 16:51:31 husky07 kernel: RBP: 00007ffcf07c28e0 R08:
00007ffcf07c2850 R09: 000000000000000c
Jul 19 16:51:31 husky07 kernel: R10: 0000000000000000 R11:
0000000000000202 R12: 00007ffcf07c2980
Jul 19 16:51:31 husky07 kernel: R13: 00007ffcf07c2890 R14:
00007ffcf07c29b0 R15: 00007ffcf07c2e58
Jul 19 16:51:31 husky07 kernel: </TASK>
Jul 19 16:51:31 husky07 kernel: INFO: task sudo:5918 is blocked on a
mutex likely owned by task xsk_rr:5912.
Jul 19 16:51:31 husky07 kernel: task:xsk_rr state:D stack:0
pid:5912 tgid:5912 ppid:5911 task_flags:0x400100 flags:0x00004006
Jul 19 16:51:31 husky07 kernel: Call Trace:
Jul 19 16:51:31 husky07 kernel: <TASK>
Jul 19 16:51:31 husky07 kernel: __schedule+0x49b/0x1530
Jul 19 16:51:31 husky07 kernel: ? _raw_spin_unlock_irqrestore+0x21/0x60
Jul 19 16:51:31 husky07 kernel: schedule+0x27/0xf0
Jul 19 16:51:31 husky07 kernel: schedule_timeout+0x85/0x110
Jul 19 16:51:31 husky07 kernel: ? __pfx_process_timeout+0x10/0x10
Jul 19 16:51:31 husky07 kernel: msleep+0x34/0x60
Jul 19 16:51:31 husky07 kernel: napi_stop_kthread+0x78/0x80
Jul 19 16:51:31 husky07 kernel: napi_set_threaded+0x33/0xc0
Jul 19 16:51:31 husky07 kernel: napi_enable_locked+0xb5/0x250
Jul 19 16:51:31 husky07 kernel: napi_enable+0x25/0x50
Jul 19 16:51:31 husky07 kernel: ice_up_complete+0x91/0x260 [ice]
Jul 19 16:51:31 husky07 kernel: ice_xdp+0x388/0x5d0 [ice]
Jul 19 16:51:31 husky07 kernel: ? __pfx_ice_xdp+0x10/0x10 [ice]
Jul 19 16:51:31 husky07 kernel: dev_xdp_install+0x157/0x320
Jul 19 16:51:31 husky07 kernel: dev_xdp_attach+0x23f/0x9d0
Jul 19 16:51:31 husky07 kernel: ? __bpf_prog_get+0x1f/0xf0
Jul 19 16:51:31 husky07 kernel: dev_change_xdp_fd+0x164/0x210
Jul 19 16:51:31 husky07 kernel: do_setlink.isra.0+0x110a/0x12c0
Jul 19 16:51:31 husky07 kernel: ? get_page_from_freelist+0x167f/0x1bd0
Jul 19 16:51:31 husky07 kernel: ? __nla_validate_parse+0x5a/0xe30
Jul 19 16:51:31 husky07 kernel: ? ns_capable+0x2a/0x60
Jul 19 16:51:31 husky07 kernel: rtnl_setlink+0x289/0x600
Jul 19 16:51:31 husky07 kernel: ? security_capable+0x7c/0x1e0
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnl_setlink+0x10/0x10
Jul 19 16:51:31 husky07 kernel: rtnetlink_rcv_msg+0x37e/0x450
Jul 19 16:51:31 husky07 kernel: ? kvfree+0x31/0x40
Jul 19 16:51:31 husky07 kernel: ? map_update_elem+0x203/0x330
Jul 19 16:51:31 husky07 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:51:31 husky07 kernel: ? __pfx_rtnetlink_rcv_msg+0x10/0x10
Jul 19 16:51:31 husky07 kernel: netlink_rcv_skb+0x5c/0x110
Jul 19 16:51:31 husky07 kernel: rtnetlink_rcv+0x15/0x30
Jul 19 16:51:31 husky07 kernel: netlink_unicast+0x282/0x3d0
Jul 19 16:51:31 husky07 kernel: netlink_sendmsg+0x214/0x470
Jul 19 16:51:31 husky07 kernel: __sys_sendto+0x23d/0x250
Jul 19 16:51:31 husky07 kernel: __x64_sys_sendto+0x24/0x40
Jul 19 16:51:31 husky07 kernel: x64_sys_call+0x1c32/0x2660
Jul 19 16:51:31 husky07 kernel: do_syscall_64+0x80/0x990
Jul 19 16:51:31 husky07 kernel: ? sched_clock_noinstr+0x9/0x10
Jul 19 16:51:31 husky07 kernel: ? sched_clock+0x10/0x30
Jul 19 16:51:31 husky07 kernel: ? get_vtime_delta+0x14/0xc0
Jul 19 16:51:31 husky07 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:51:31 husky07 kernel: ? __ct_user_enter+0x72/0x100
Jul 19 16:51:31 husky07 kernel: ? irqentry_exit_to_user_mode+0x167/0x270
Jul 19 16:51:31 husky07 kernel: ? irqentry_exit+0x43/0x50
Jul 19 16:51:31 husky07 kernel: ? exc_page_fault+0x90/0x1b0
Jul 19 16:51:31 husky07 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jul 19 16:51:31 husky07 kernel: RIP: 0033:0x73175152bead
Jul 19 16:51:31 husky07 kernel: RSP: 002b:00007ffd54363fa8 EFLAGS:
00000246 ORIG_RAX: 000000000000002c
Jul 19 16:51:31 husky07 kernel: RAX: ffffffffffffffda RBX:
0000000000000004 RCX: 000073175152bead
Jul 19 16:51:31 husky07 kernel: RDX: 0000000000000034 RSI:
00007ffd54364030 RDI: 0000000000000008
Jul 19 16:51:31 husky07 kernel: RBP: 00007ffd54364000 R08:
0000000000000000 R09: 0000000000000000
Jul 19 16:51:31 husky07 kernel: R10: 0000000000000000 R11:
0000000000000246 R12: 0000000000000019
Jul 19 16:51:31 husky07 kernel: R13: 0000000000000000 R14:
000063c753f8cd78 R15: 000073175187c000
Jul 19 16:51:31 husky07 kernel: </TASK>
**** mlx5 ****
Jul 19 16:52:28 tilly02 kernel: INFO: task kworker/u129:1:255 blocked
for more than 122 seconds.
Jul 19 16:52:28 tilly02 kernel: Tainted: G I E
6.16.0-rc5-test #1
Jul 19 16:52:28 tilly02 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 19 16:52:28 tilly02 kernel: task:kworker/u129:1 state:D stack:0
pid:255 tgid:255 ppid:2 task_flags:0x4208060 flags:0x00004000
Jul 19 16:52:28 tilly02 kernel: Workqueue: events_unbound linkwatch_event
Jul 19 16:52:28 tilly02 kernel: Call Trace:
Jul 19 16:52:28 tilly02 kernel: <TASK>
Jul 19 16:52:28 tilly02 kernel: __schedule+0x493/0x1630
Jul 19 16:52:28 tilly02 kernel: ? sched_clock+0x10/0x30
Jul 19 16:52:28 tilly02 kernel: ? sched_clock_cpu+0x10/0x1e0
Jul 19 16:52:28 tilly02 kernel: schedule+0x27/0xf0
Jul 19 16:52:28 tilly02 kernel: schedule_preempt_disabled+0x15/0x30
Jul 19 16:52:28 tilly02 kernel: __mutex_lock.constprop.0+0x4c9/0x870
Jul 19 16:52:28 tilly02 kernel: __mutex_lock_slowpath+0x13/0x20
Jul 19 16:52:28 tilly02 kernel: mutex_lock+0x3b/0x50
Jul 19 16:52:28 tilly02 kernel: rtnl_lock+0x15/0x20
Jul 19 16:52:28 tilly02 kernel: linkwatch_event+0x12/0x40
Jul 19 16:52:28 tilly02 kernel: process_one_work+0x18e/0x3e0
Jul 19 16:52:28 tilly02 kernel: worker_thread+0x2e3/0x420
Jul 19 16:52:28 tilly02 kernel: ? __pfx_worker_thread+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: kthread+0x10a/0x230
Jul 19 16:52:28 tilly02 kernel: ? __pfx_kthread+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: ret_from_fork+0x1d4/0x210
Jul 19 16:52:28 tilly02 kernel: ? __pfx_kthread+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: ret_from_fork_asm+0x1a/0x30
Jul 19 16:52:28 tilly02 kernel: </TASK>
Jul 19 16:52:28 tilly02 kernel: INFO: task kworker/u129:1:255 is blocked
on a mutex likely owned by task xsk_rr:1612.
Jul 19 16:52:28 tilly02 kernel: task:xsk_rr state:D stack:0
pid:1612 tgid:1612 ppid:1611 task_flags:0x400100 flags:0x00004002
Jul 19 16:52:28 tilly02 kernel: Call Trace:
Jul 19 16:52:28 tilly02 kernel: <TASK>
Jul 19 16:52:28 tilly02 kernel: __schedule+0x493/0x1630
Jul 19 16:52:28 tilly02 kernel: schedule+0x27/0xf0
Jul 19 16:52:28 tilly02 kernel: schedule_timeout+0x85/0x110
Jul 19 16:52:28 tilly02 kernel: ? __pfx_process_timeout+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: msleep+0x34/0x60
Jul 19 16:52:28 tilly02 kernel: napi_stop_kthread+0x78/0x80
Jul 19 16:52:28 tilly02 kernel: napi_set_threaded+0x33/0xc0
Jul 19 16:52:28 tilly02 kernel: napi_enable_locked+0xb5/0x250
Jul 19 16:52:28 tilly02 kernel:
mlx5e_activate_priv_channels+0x1bc/0x490 [mlx5_core]
Jul 19 16:52:28 tilly02 kernel: mlx5e_switch_priv_channels+0xeb/0x150
[mlx5_core]
Jul 19 16:52:28 tilly02 kernel: mlx5e_safe_switch_params+0xef/0x140
[mlx5_core]
Jul 19 16:52:28 tilly02 kernel: mlx5e_xdp_set+0xd0/0x220 [mlx5_core]
Jul 19 16:52:28 tilly02 kernel: ? __pfx_mlx5e_xdp+0x10/0x10 [mlx5_core]
Jul 19 16:52:28 tilly02 kernel: mlx5e_xdp+0x47/0x60 [mlx5_core]
Jul 19 16:52:28 tilly02 kernel: dev_xdp_install+0x154/0x320
Jul 19 16:52:28 tilly02 kernel: dev_xdp_attach+0x23f/0x9d0
Jul 19 16:52:28 tilly02 kernel: ? __bpf_prog_get+0x1f/0xf0
Jul 19 16:52:28 tilly02 kernel: dev_change_xdp_fd+0x164/0x210
Jul 19 16:52:28 tilly02 kernel: do_setlink.isra.0+0x110a/0x12c0
Jul 19 16:52:28 tilly02 kernel: ? __call_rcu_common+0x233/0x730
Jul 19 16:52:28 tilly02 kernel: ? __rmqueue_pcplist+0x86e/0xed0
Jul 19 16:52:28 tilly02 kernel: ? __nla_validate_parse+0x5a/0xe30
Jul 19 16:52:28 tilly02 kernel: ? ns_capable+0x2a/0x60
Jul 19 16:52:28 tilly02 kernel: rtnl_setlink+0x289/0x600
Jul 19 16:52:28 tilly02 kernel: ? __memcg_slab_post_alloc_hook+0x1b0/0x3e0
Jul 19 16:52:28 tilly02 kernel: ? security_capable+0x77/0x1c0
Jul 19 16:52:28 tilly02 kernel: ? __pfx_rtnl_setlink+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: rtnetlink_rcv_msg+0x37b/0x450
Jul 19 16:52:28 tilly02 kernel: ? bpf_map_kzalloc+0xd1/0x110
Jul 19 16:52:28 tilly02 kernel: ? __pfx_rtnetlink_rcv_msg+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: netlink_rcv_skb+0x59/0x110
Jul 19 16:52:28 tilly02 kernel: rtnetlink_rcv+0x15/0x30
Jul 19 16:52:28 tilly02 kernel: netlink_unicast+0x27f/0x3d0
Jul 19 16:52:28 tilly02 kernel: netlink_sendmsg+0x214/0x470
Jul 19 16:52:28 tilly02 kernel: __sys_sendto+0x23a/0x250
Jul 19 16:52:28 tilly02 kernel: __x64_sys_sendto+0x24/0x40
Jul 19 16:52:28 tilly02 kernel: x64_sys_call+0x1c32/0x2660
Jul 19 16:52:28 tilly02 kernel: do_syscall_64+0x80/0x9a0
Jul 19 16:52:28 tilly02 kernel: ? vmf_insert_pfn_prot+0x99/0x100
Jul 19 16:52:28 tilly02 kernel: ? vmf_insert_pfn+0x12/0x20
Jul 19 16:52:28 tilly02 kernel: ? vvar_fault+0xa1/0x110
Jul 19 16:52:28 tilly02 kernel: ? special_mapping_fault+0x1e/0xd0
Jul 19 16:52:28 tilly02 kernel: ? __do_fault+0x3a/0x190
Jul 19 16:52:28 tilly02 kernel: ? do_fault+0x2d5/0x570
Jul 19 16:52:28 tilly02 kernel: ? __handle_mm_fault+0x838/0x1070
Jul 19 16:52:28 tilly02 kernel: ? count_memcg_events+0x180/0x200
Jul 19 16:52:28 tilly02 kernel: ? sched_clock_noinstr+0x9/0x10
Jul 19 16:52:28 tilly02 kernel: ? sched_clock+0x10/0x30
Jul 19 16:52:28 tilly02 kernel: ? get_vtime_delta+0x14/0xc0
Jul 19 16:52:28 tilly02 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:52:28 tilly02 kernel: ? __ct_user_enter+0x72/0x100
Jul 19 16:52:28 tilly02 kernel: ? irqentry_exit_to_user_mode+0x167/0x270
Jul 19 16:52:28 tilly02 kernel: ? irqentry_exit+0x43/0x50
Jul 19 16:52:28 tilly02 kernel: ? exc_page_fault+0x90/0x1b0
Jul 19 16:52:28 tilly02 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jul 19 16:52:28 tilly02 kernel: RIP: 0033:0x7e1395f2bead
Jul 19 16:52:28 tilly02 kernel: RSP: 002b:00007ffd2ac39398 EFLAGS:
00000246 ORIG_RAX: 000000000000002c
Jul 19 16:52:28 tilly02 kernel: RAX: ffffffffffffffda RBX:
0000000000000004 RCX: 00007e1395f2bead
Jul 19 16:52:28 tilly02 kernel: RDX: 0000000000000034 RSI:
00007ffd2ac39420 RDI: 0000000000000008
Jul 19 16:52:28 tilly02 kernel: RBP: 00007ffd2ac393f0 R08:
0000000000000000 R09: 0000000000000000
Jul 19 16:52:28 tilly02 kernel: R10: 0000000000000000 R11:
0000000000000246 R12: 0000000000000019
Jul 19 16:52:28 tilly02 kernel: R13: 0000000000000000 R14:
00006305503a1d78 R15: 00007e1396267000
Jul 19 16:52:28 tilly02 kernel: </TASK>
Jul 19 16:52:28 tilly02 kernel: INFO: task sudo:1619 blocked for more
than 122 seconds.
Jul 19 16:52:28 tilly02 kernel: Tainted: G I E
6.16.0-rc5-test #1
Jul 19 16:52:28 tilly02 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 19 16:52:28 tilly02 kernel: task:sudo state:D stack:0
pid:1619 tgid:1619 ppid:1544 task_flags:0x400100 flags:0x00004002
Jul 19 16:52:28 tilly02 kernel: Call Trace:
Jul 19 16:52:28 tilly02 kernel: <TASK>
Jul 19 16:52:28 tilly02 kernel: __schedule+0x493/0x1630
Jul 19 16:52:28 tilly02 kernel: ? obj_cgroup_charge_account+0x139/0x370
Jul 19 16:52:28 tilly02 kernel: schedule+0x27/0xf0
Jul 19 16:52:28 tilly02 kernel: schedule_preempt_disabled+0x15/0x30
Jul 19 16:52:28 tilly02 kernel: __mutex_lock.constprop.0+0x4c9/0x870
Jul 19 16:52:28 tilly02 kernel: ? __pfx_rtnl_dump_ifinfo+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: __mutex_lock_slowpath+0x13/0x20
Jul 19 16:52:28 tilly02 kernel: mutex_lock+0x3b/0x50
Jul 19 16:52:28 tilly02 kernel: rtnl_dumpit+0x83/0xc0
Jul 19 16:52:28 tilly02 kernel: netlink_dump+0x194/0x3c0
Jul 19 16:52:28 tilly02 kernel: __netlink_dump_start+0x204/0x340
Jul 19 16:52:28 tilly02 kernel: ? __pfx_rtnl_dump_ifinfo+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: rtnetlink_rcv_msg+0x2d6/0x450
Jul 19 16:52:28 tilly02 kernel: ? __pfx_rtnl_dumpit+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: ? __pfx_rtnl_dump_ifinfo+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: ? __pfx_rtnetlink_rcv_msg+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: netlink_rcv_skb+0x59/0x110
Jul 19 16:52:28 tilly02 kernel: rtnetlink_rcv+0x15/0x30
Jul 19 16:52:28 tilly02 kernel: netlink_unicast+0x27f/0x3d0
Jul 19 16:52:28 tilly02 kernel: netlink_sendmsg+0x214/0x470
Jul 19 16:52:28 tilly02 kernel: __sys_sendto+0x23a/0x250
Jul 19 16:52:28 tilly02 kernel: __x64_sys_sendto+0x24/0x40
Jul 19 16:52:28 tilly02 kernel: x64_sys_call+0x1c32/0x2660
Jul 19 16:52:28 tilly02 kernel: do_syscall_64+0x80/0x9a0
Jul 19 16:52:28 tilly02 kernel: ? __pte_offset_map_lock+0xa2/0x120
Jul 19 16:52:28 tilly02 kernel: ? __get_locked_pte+0x3f/0x90
Jul 19 16:52:28 tilly02 kernel: ? insert_pfn+0xbb/0x220
Jul 19 16:52:28 tilly02 kernel: ? vmf_insert_pfn_prot+0x99/0x100
Jul 19 16:52:28 tilly02 kernel: ? vmf_insert_pfn+0x12/0x20
Jul 19 16:52:28 tilly02 kernel: ? vvar_fault+0xa1/0x110
Jul 19 16:52:28 tilly02 kernel: ? special_mapping_fault+0x1e/0xd0
Jul 19 16:52:28 tilly02 kernel: ? __do_fault+0x3a/0x190
Jul 19 16:52:28 tilly02 kernel: ? do_fault+0x2d5/0x570
Jul 19 16:52:28 tilly02 kernel: ? __handle_mm_fault+0x838/0x1070
Jul 19 16:52:28 tilly02 kernel: ? __do_sys_prlimit64+0x244/0x2e0
Jul 19 16:52:28 tilly02 kernel: ? count_memcg_events+0x180/0x200
Jul 19 16:52:28 tilly02 kernel: ? handle_mm_fault+0xbc/0x300
Jul 19 16:52:28 tilly02 kernel: ? __ct_user_enter+0x2d/0x100
Jul 19 16:52:28 tilly02 kernel: ? irqentry_exit_to_user_mode+0x167/0x270
Jul 19 16:52:28 tilly02 kernel: ? irqentry_exit+0x43/0x50
Jul 19 16:52:28 tilly02 kernel: ? exc_page_fault+0x90/0x1b0
Jul 19 16:52:28 tilly02 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jul 19 16:52:28 tilly02 kernel: RIP: 0033:0x75041712c0a7
Jul 19 16:52:28 tilly02 kernel: RSP: 002b:00007ffc92399938 EFLAGS:
00000202 ORIG_RAX: 000000000000002c
Jul 19 16:52:28 tilly02 kernel: RAX: ffffffffffffffda RBX:
00007ffc92399980 RCX: 000075041712c0a7
Jul 19 16:52:28 tilly02 kernel: RDX: 0000000000000014 RSI:
00007ffc923999c0 RDI: 0000000000000003
Jul 19 16:52:28 tilly02 kernel: RBP: 00007ffc92399a10 R08:
00007ffc92399980 R09: 000000000000000c
Jul 19 16:52:28 tilly02 kernel: R10: 0000000000000000 R11:
0000000000000202 R12: 00007ffc92399ab0
Jul 19 16:52:28 tilly02 kernel: R13: 00007ffc923999c0 R14:
00007ffc92399ae0 R15: 00007ffc92399f88
Jul 19 16:52:28 tilly02 kernel: </TASK>
Jul 19 16:52:28 tilly02 kernel: INFO: task sudo:1619 is blocked on a
mutex likely owned by task xsk_rr:1612.
Jul 19 16:52:28 tilly02 kernel: task:xsk_rr state:D stack:0
pid:1612 tgid:1612 ppid:1611 task_flags:0x400100 flags:0x00004002
Jul 19 16:52:28 tilly02 kernel: Call Trace:
Jul 19 16:52:28 tilly02 kernel: <TASK>
Jul 19 16:52:28 tilly02 kernel: __schedule+0x493/0x1630
Jul 19 16:52:28 tilly02 kernel: schedule+0x27/0xf0
Jul 19 16:52:28 tilly02 kernel: schedule_timeout+0x85/0x110
Jul 19 16:52:28 tilly02 kernel: ? __pfx_process_timeout+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: msleep+0x34/0x60
Jul 19 16:52:28 tilly02 kernel: napi_stop_kthread+0x78/0x80
Jul 19 16:52:28 tilly02 kernel: napi_set_threaded+0x33/0xc0
Jul 19 16:52:28 tilly02 kernel: napi_enable_locked+0xb5/0x250
Jul 19 16:52:28 tilly02 kernel:
mlx5e_activate_priv_channels+0x1bc/0x490 [mlx5_core]
Jul 19 16:52:28 tilly02 kernel: mlx5e_switch_priv_channels+0xeb/0x150
[mlx5_core]
Jul 19 16:52:28 tilly02 kernel: mlx5e_safe_switch_params+0xef/0x140
[mlx5_core]
Jul 19 16:52:28 tilly02 kernel: mlx5e_xdp_set+0xd0/0x220 [mlx5_core]
Jul 19 16:52:28 tilly02 kernel: ? __pfx_mlx5e_xdp+0x10/0x10 [mlx5_core]
Jul 19 16:52:28 tilly02 kernel: mlx5e_xdp+0x47/0x60 [mlx5_core]
Jul 19 16:52:28 tilly02 kernel: dev_xdp_install+0x154/0x320
Jul 19 16:52:28 tilly02 kernel: dev_xdp_attach+0x23f/0x9d0
Jul 19 16:52:28 tilly02 kernel: ? __bpf_prog_get+0x1f/0xf0
Jul 19 16:52:28 tilly02 kernel: dev_change_xdp_fd+0x164/0x210
Jul 19 16:52:28 tilly02 kernel: do_setlink.isra.0+0x110a/0x12c0
Jul 19 16:52:28 tilly02 kernel: ? __call_rcu_common+0x233/0x730
Jul 19 16:52:28 tilly02 kernel: ? __rmqueue_pcplist+0x86e/0xed0
Jul 19 16:52:28 tilly02 kernel: ? __nla_validate_parse+0x5a/0xe30
Jul 19 16:52:28 tilly02 kernel: ? ns_capable+0x2a/0x60
Jul 19 16:52:28 tilly02 kernel: rtnl_setlink+0x289/0x600
Jul 19 16:52:28 tilly02 kernel: ? __memcg_slab_post_alloc_hook+0x1b0/0x3e0
Jul 19 16:52:28 tilly02 kernel: ? security_capable+0x77/0x1c0
Jul 19 16:52:28 tilly02 kernel: ? __pfx_rtnl_setlink+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: rtnetlink_rcv_msg+0x37b/0x450
Jul 19 16:52:28 tilly02 kernel: ? bpf_map_kzalloc+0xd1/0x110
Jul 19 16:52:28 tilly02 kernel: ? __pfx_rtnetlink_rcv_msg+0x10/0x10
Jul 19 16:52:28 tilly02 kernel: netlink_rcv_skb+0x59/0x110
Jul 19 16:52:28 tilly02 kernel: rtnetlink_rcv+0x15/0x30
Jul 19 16:52:28 tilly02 kernel: netlink_unicast+0x27f/0x3d0
Jul 19 16:52:28 tilly02 kernel: netlink_sendmsg+0x214/0x470
Jul 19 16:52:28 tilly02 kernel: __sys_sendto+0x23a/0x250
Jul 19 16:52:28 tilly02 kernel: __x64_sys_sendto+0x24/0x40
Jul 19 16:52:28 tilly02 kernel: x64_sys_call+0x1c32/0x2660
Jul 19 16:52:28 tilly02 kernel: do_syscall_64+0x80/0x9a0
Jul 19 16:52:28 tilly02 kernel: ? vmf_insert_pfn_prot+0x99/0x100
Jul 19 16:52:28 tilly02 kernel: ? vmf_insert_pfn+0x12/0x20
Jul 19 16:52:28 tilly02 kernel: ? vvar_fault+0xa1/0x110
Jul 19 16:52:28 tilly02 kernel: ? special_mapping_fault+0x1e/0xd0
Jul 19 16:52:28 tilly02 kernel: ? __do_fault+0x3a/0x190
Jul 19 16:52:28 tilly02 kernel: ? do_fault+0x2d5/0x570
Jul 19 16:52:28 tilly02 kernel: ? __handle_mm_fault+0x838/0x1070
Jul 19 16:52:28 tilly02 kernel: ? count_memcg_events+0x180/0x200
Jul 19 16:52:28 tilly02 kernel: ? sched_clock_noinstr+0x9/0x10
Jul 19 16:52:28 tilly02 kernel: ? sched_clock+0x10/0x30
Jul 19 16:52:28 tilly02 kernel: ? get_vtime_delta+0x14/0xc0
Jul 19 16:52:28 tilly02 kernel: ? ct_kernel_exit.isra.0+0x84/0xb0
Jul 19 16:52:28 tilly02 kernel: ? __ct_user_enter+0x72/0x100
Jul 19 16:52:28 tilly02 kernel: ? irqentry_exit_to_user_mode+0x167/0x270
Jul 19 16:52:28 tilly02 kernel: ? irqentry_exit+0x43/0x50
Jul 19 16:52:28 tilly02 kernel: ? exc_page_fault+0x90/0x1b0
Jul 19 16:52:28 tilly02 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jul 19 16:52:28 tilly02 kernel: RIP: 0033:0x7e1395f2bead
Jul 19 16:52:28 tilly02 kernel: RSP: 002b:00007ffd2ac39398 EFLAGS:
00000246 ORIG_RAX: 000000000000002c
Jul 19 16:52:28 tilly02 kernel: RAX: ffffffffffffffda RBX:
0000000000000004 RCX: 00007e1395f2bead
Jul 19 16:52:28 tilly02 kernel: RDX: 0000000000000034 RSI:
00007ffd2ac39420 RDI: 0000000000000008
Jul 19 16:52:28 tilly02 kernel: RBP: 00007ffd2ac393f0 R08:
0000000000000000 R09: 0000000000000000
Jul 19 16:52:28 tilly02 kernel: R10: 0000000000000000 R11:
0000000000000246 R12: 0000000000000019
Jul 19 16:52:28 tilly02 kernel: R13: 0000000000000000 R14:
00006305503a1d78 R15: 00007e1396267000
Jul 19 16:52:28 tilly02 kernel: </TASK>
> Following histogram is generated to measure the time spent in recvfrom
> while using inline thread with SO_BUSYPOLL. The histogram is generated
> using the following bpftrace command. In this experiment there are 32K
> packets per second and the application processing delay is 30usecs. This
> is to measure whether there is significant time spent pulling packets
> from the descriptor queue that it will affect the overall latency if
> done inline.
>
> ```
> bpftrace -e '
> kprobe:xsk_recvmsg {
> @start[tid] = nsecs;
> }
> kretprobe:xsk_recvmsg {
> if (@start[tid]) {
> $sample = (nsecs - @start[tid]);
> @xsk_recvfrom_hist = hist($sample);
> delete(@start[tid]);
> }
> }
> END { clear(@start);}'
> ```
>
> Here in case of inline busypolling around 35 percent of calls are taking
> 1-2usecs and around 50 percent are taking 0.5-2usecs.
>
> @xsk_recvfrom_hist:
> [128, 256) 24073 |@@@@@@@@@@@@@@@@@@@@@@ |
> [256, 512) 55633 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [512, 1K) 20974 |@@@@@@@@@@@@@@@@@@@ |
> [1K, 2K) 34234 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [2K, 4K) 3266 |@@@ |
> [4K, 8K) 19 | |
>
> v6:
> - Moved threaded in struct netdevice up to fill the cacheline hole.
> - Changed dev_set_threaded to dev_set_threaded_hint and removed the
> second argument that was always set to true by all the drivers.
> Exported only dev_set_threaded_hint and made dev_set_threaded core
> only function. This change is done in a separate commit.
> - Updated documentation comment for threaded in struct netdevice.
> - gro_flush_helper renamed to gro_flush_normal and moved to gro.h. Also
> used it in kernel/bpf/cpumap.c
> - Updated documentation to explicitly state that the NAPI threaded busy
> polling would keep the CPU core busy at 100% usage.
> - Updated documentation and commit messages.
>
> v5:
> - Updated experiment data with 'SO_PREFER_BUSY_POLL' usage as
> suggested.
> - Sent 'Add support to set napi threaded for individual napi'
> separately. This series depends on top of that patch.
> https://lore.kernel.org/netdev/20250423201413.1564527-1-skhawaja@google.com/
> - Added a separate patch to use enum for napi threaded state. Updated
> the nl_netdev python test.
> - Using "write all" semantics when napi settings set at device level.
> This aligns with already existing behaviour for other settings.
> - Fix comments to make them kdoc compatible.
> - Updated Documentation/networking/net_cachelines/net_device.rst
> - Updated the missed gro_flush modification in napi_complete_done
>
> v4:
> - Using AF_XDP based benchmark for experiments.
> - Re-enable dev level napi threaded busypoll after soft reset.
>
> v3:
> - Fixed calls to dev_set_threaded in drivers
>
> v2:
> - Add documentation in napi.rst.
> - Provide experiment data and usecase details.
> - Update busy_poller selftest to include napi threaded poll testcase.
> - Define threaded mode enum in netlink interface.
> - Included NAPI threaded state in napi config to save/restore.
>
> Samiullah Khawaja (5):
> net: Create separate gro_flush_normal function
> net: Use dev_set_threaded_hint instead of dev_set_threaded in drivers
> net: define an enum for the napi threaded state
> Extend napi threaded polling to allow kthread based busy polling
> selftests: Add napi threaded busy poll test in `busy_poller`
>
> Documentation/ABI/testing/sysfs-class-net | 3 +-
> Documentation/netlink/specs/netdev.yaml | 14 ++-
> Documentation/networking/napi.rst | 63 +++++++++++-
> .../networking/net_cachelines/net_device.rst | 2 +-
> .../net/ethernet/atheros/atl1c/atl1c_main.c | 2 +-
> drivers/net/ethernet/mellanox/mlxsw/pci.c | 2 +-
> drivers/net/ethernet/renesas/ravb_main.c | 2 +-
> drivers/net/wireguard/device.c | 2 +-
> drivers/net/wireless/ath/ath10k/snoc.c | 2 +-
> drivers/net/wireless/mediatek/mt76/debugfs.c | 2 +-
> include/linux/netdevice.h | 18 +++-
> include/net/gro.h | 6 ++
> include/uapi/linux/netdev.h | 6 ++
> kernel/bpf/cpumap.c | 3 +-
> net/core/dev.c | 97 +++++++++++++++----
> net/core/dev.h | 16 ++-
> net/core/net-sysfs.c | 2 +-
> net/core/netdev-genl-gen.c | 2 +-
> net/core/netdev-genl.c | 2 +-
> tools/include/uapi/linux/netdev.h | 6 ++
> tools/testing/selftests/net/busy_poll_test.sh | 25 ++++-
> tools/testing/selftests/net/busy_poller.c | 14 ++-
> tools/testing/selftests/net/nl_netdev.py | 36 +++----
> 23 files changed, 257 insertions(+), 70 deletions(-)
>
>
> base-commit: c3886ccaadf8fdc2c91bfbdcdca36ccdc6ef8f70
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next v6 3/5] net: define an enum for the napi threaded state
2025-07-18 23:20 ` [PATCH net-next v6 3/5] net: define an enum for the napi threaded state Samiullah Khawaja
@ 2025-07-21 23:48 ` Jakub Kicinski
0 siblings, 0 replies; 8+ messages in thread
From: Jakub Kicinski @ 2025-07-21 23:48 UTC (permalink / raw)
To: Samiullah Khawaja
Cc: David S . Miller , Eric Dumazet, Paolo Abeni, almasrymina,
willemb, jdamato, mkarsten, netdev
On Fri, 18 Jul 2025 23:20:49 +0000 Samiullah Khawaja wrote:
> Documentation/netlink/specs/netdev.yaml | 13 ++++---
yamllint says:
91:15 error too many spaces inside brackets (brackets)
91:33 error too many spaces inside brackets (brackets)
Please fix, rebase (the series does not apply any more), and repost
the first 3 patches ASAP. We really need to merge the decoding of
threaded as enum before the merge window. We don't want it to be
a bare int in 6.17 and enum in 6.18. I may break some Python scripts.
--
pw-bot: cr
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-07-21 23:48 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-18 23:20 [PATCH net-next v6 0/5] Add support to do threaded napi busy poll Samiullah Khawaja
2025-07-18 23:20 ` [PATCH net-next v6 1/5] net: Create separate gro_flush_normal function Samiullah Khawaja
2025-07-18 23:20 ` [PATCH net-next v6 2/5] net: Use dev_set_threaded_hint instead of dev_set_threaded in drivers Samiullah Khawaja
2025-07-18 23:20 ` [PATCH net-next v6 3/5] net: define an enum for the napi threaded state Samiullah Khawaja
2025-07-21 23:48 ` Jakub Kicinski
2025-07-18 23:20 ` [PATCH net-next v6 4/5] Extend napi threaded polling to allow kthread based busy polling Samiullah Khawaja
2025-07-18 23:20 ` [PATCH net-next v6 5/5] selftests: Add napi threaded busy poll test in `busy_poller` Samiullah Khawaja
2025-07-19 21:27 ` [PATCH net-next v6 0/5] Add support to do threaded napi busy poll Martin Karsten
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.