* [PATCH iwl-net v1] i40e: fix napi_enable/disable skipping ringless q_vectors
@ 2026-03-24 13:09 Aleksandr Loktionov
2026-04-14 10:57 ` Maciej Fijalkowski
2026-04-14 17:58 ` [Intel-wired-lan] " Mekala, SunithaX D
0 siblings, 2 replies; 3+ messages in thread
From: Aleksandr Loktionov @ 2026-03-24 13:09 UTC (permalink / raw)
To: intel-wired-lan, anthony.l.nguyen, aleksandr.loktionov
Cc: netdev, Jakub Kicinski
After ethtool -L reduces the queue count, i40e_napi_disable_all() sets
NAPI_STATE_SCHED on all q_vectors, then i40e_vsi_map_rings_to_vectors()
clears ring pointers on the excess ones. i40e_napi_enable_all() skips
those with:
if (q_vector->rx.ring || q_vector->tx.ring)
napi_enable(&q_vector->napi);
leaving them on dev->napi_list with NAPI_STATE_SCHED permanently set.
Writing to /sys/class/net/<iface>/threaded calls napi_stop_kthread()
on every entry in dev->napi_list. The function loops on msleep(20)
waiting for NAPI_STATE_SCHED to clear -- which never happens for the
stale q_vectors. The task hangs in D state forever; a concurrent write
deadlocks on dev->lock held by the first.
Commit 13a8cd191a2b added the guard to prevent a divide-by-zero in
i40e_napi_poll() when epoll busy-poll iterated all device NAPIs (4.x
era). Since 7adc3d57fe2b ("net: Introduce preferred busy-polling",
v5.11) napi_busy_loop() polls by napi_id keyed to the socket, so
ringless q_vectors are never selected. i40e_msix_clean_rings() also
independently avoids scheduling NAPI for them. The guard is safe to
remove.
Add an early return in i40e_napi_poll() for num_ringpairs == 0 so the
function is self-defending against a NULL tx.ring dereference at the
WB_ON_ITR check, should the NAPI ever fire through an unexpected path.
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/intel-wired-lan/20260316133100.6054a11f@kernel.org/
Fixes: 13a8cd191a2b ("i40e: Do not enable NAPI on q_vectors that have no rings")
Cc: stable@vger.kernel.org
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
Test configuration:
Kernel : Linux 6.19.0-rc8+
NIC : Intel Ethernet Controller XXV710 for 25GbE SFP28 [8086:158b]
Driver : i40e (in-tree)
Firmware : 9.40 0x8000ed12 1.3429.0
CPU : 2 x Intel Xeon Gold 6238M (88 logical CPUs, x86_64)
RAM : 64 GiB
Reproduction steps (FAIL before fix):
# 1. Reduce queues so excess q_vectors lose their ring pointers
ethtool -L <iface> combined 1
# 2. Enable threaded NAPI (completes fast in 6.19, no hang on enable path)
echo 1 > /sys/class/net/<iface>/threaded
# 3. Two concurrent writes to disable -- fires the msleep deadlock
echo 0 > /sys/class/net/<iface>/threaded &
echo 0 > /sys/class/net/<iface>/threaded &
Both background tasks enter uninterruptible sleep (D state) immediately
and never return.
Observed kernel stack (W1, holds dev->lock):
msleep+0x2d/0x50
napi_set_threaded+0x10b/0x110
netif_set_threaded+0xe1/0x140
threaded_store+0xd2/0x100
kernfs_fop_write_iter+0x138/0x1d0
Kernel hung_task message (~120 s after trigger):
INFO: task bash blocked for more than 122 seconds.
INFO: task bash is blocked on a mutex likely owned by task bash.
Validation (PASS with fix):
Both background tasks exit within 1 second.
D-state process count: 0.
Busy-poll (net.core.busy_poll=50) + 50000-packet UDP flood with
1 active queue: no NULL dereference, no crash.
drivers/net/ethernet/intel/i40e/i40e_main.c | 28 ++++++++++++---------
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 10 ++++++++
2 files changed, 26 insertions(+), 12 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 926d001..5042f8c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5182,6 +5182,14 @@ static void i40e_clear_interrupt_scheme(struct i40e_pf *pf)
/**
* i40e_napi_enable_all - Enable NAPI for all q_vectors in the VSI
* @vsi: the VSI being configured
+ *
+ * Enable NAPI on every q_vector that is registered with the netdev,
+ * regardless of whether it currently has rings assigned. After a queue-
+ * count reduction (e.g. ethtool -L combined 1) the excess q_vectors lose
+ * their ring pointers inside i40e_vsi_map_rings_to_vectors but remain on
+ * dev->napi_list. Leaving them in the napi_disable()-ed state
+ * (NAPI_STATE_SCHED set) causes napi_set_threaded() to spin forever on
+ * msleep(20) waiting for that bit to clear.
**/
static void i40e_napi_enable_all(struct i40e_vsi *vsi)
{
@@ -5190,17 +5198,17 @@ static void i40e_napi_enable_all(struct i40e_vsi *vsi)
if (!vsi->netdev)
return;
- for (q_idx = 0; q_idx < vsi->num_q_vectors; q_idx++) {
- struct i40e_q_vector *q_vector = vsi->q_vectors[q_idx];
-
- if (q_vector->rx.ring || q_vector->tx.ring)
- napi_enable(&q_vector->napi);
- }
+ for (q_idx = 0; q_idx < vsi->num_q_vectors; q_idx++)
+ napi_enable(&vsi->q_vectors[q_idx]->napi);
}
/**
* i40e_napi_disable_all - Disable NAPI for all q_vectors in the VSI
* @vsi: the VSI being configured
+ *
+ * Mirror of i40e_napi_enable_all: operate on every registered q_vector so
+ * enable/disable calls are always balanced, even when some q_vectors carry
+ * no rings (as happens after a queue-count reduction).
**/
static void i40e_napi_disable_all(struct i40e_vsi *vsi)
{
@@ -5209,12 +5217,8 @@ static void i40e_napi_disable_all(struct i40e_vsi *vsi)
if (!vsi->netdev)
return;
- for (q_idx = 0; q_idx < vsi->num_q_vectors; q_idx++) {
- struct i40e_q_vector *q_vector = vsi->q_vectors[q_idx];
-
- if (q_vector->rx.ring || q_vector->tx.ring)
- napi_disable(&q_vector->napi);
- }
+ for (q_idx = 0; q_idx < vsi->num_q_vectors; q_idx++)
+ napi_disable(&vsi->q_vectors[q_idx]->napi);
}
/**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 894f2d0..3123459 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2760,6 +2760,16 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
return 0;
}
+ /* A q_vector can have its ring pointers cleared after a queue-count
+ * reduction (ethtool -L combined N) while napi_enable() was already
+ * called on it. Complete immediately so the poll loop exits cleanly
+ * and we never dereference the NULL ring pointer below.
+ */
+ if (unlikely(!q_vector->num_ringpairs)) {
+ napi_complete_done(napi, 0);
+ return 0;
+ }
+
/* Since the actual Tx work is minimal, we can give the Tx a larger
* budget and be more aggressive about cleaning up the Tx descriptors.
*/
--
2.52.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH iwl-net v1] i40e: fix napi_enable/disable skipping ringless q_vectors
2026-03-24 13:09 [PATCH iwl-net v1] i40e: fix napi_enable/disable skipping ringless q_vectors Aleksandr Loktionov
@ 2026-04-14 10:57 ` Maciej Fijalkowski
2026-04-14 17:58 ` [Intel-wired-lan] " Mekala, SunithaX D
1 sibling, 0 replies; 3+ messages in thread
From: Maciej Fijalkowski @ 2026-04-14 10:57 UTC (permalink / raw)
To: Aleksandr Loktionov
Cc: intel-wired-lan, anthony.l.nguyen, netdev, Jakub Kicinski
On Tue, Mar 24, 2026 at 02:09:22PM +0100, Aleksandr Loktionov wrote:
> After ethtool -L reduces the queue count, i40e_napi_disable_all() sets
> NAPI_STATE_SCHED on all q_vectors, then i40e_vsi_map_rings_to_vectors()
> clears ring pointers on the excess ones. i40e_napi_enable_all() skips
> those with:
>
> if (q_vector->rx.ring || q_vector->tx.ring)
> napi_enable(&q_vector->napi);
>
> leaving them on dev->napi_list with NAPI_STATE_SCHED permanently set.
>
> Writing to /sys/class/net/<iface>/threaded calls napi_stop_kthread()
> on every entry in dev->napi_list. The function loops on msleep(20)
> waiting for NAPI_STATE_SCHED to clear -- which never happens for the
> stale q_vectors. The task hangs in D state forever; a concurrent write
> deadlocks on dev->lock held by the first.
>
> Commit 13a8cd191a2b added the guard to prevent a divide-by-zero in
> i40e_napi_poll() when epoll busy-poll iterated all device NAPIs (4.x
> era). Since 7adc3d57fe2b ("net: Introduce preferred busy-polling",
> v5.11) napi_busy_loop() polls by napi_id keyed to the socket, so
> ringless q_vectors are never selected. i40e_msix_clean_rings() also
> independently avoids scheduling NAPI for them. The guard is safe to
> remove.
>
> Add an early return in i40e_napi_poll() for num_ringpairs == 0 so the
> function is self-defending against a NULL tx.ring dereference at the
> WB_ON_ITR check, should the NAPI ever fire through an unexpected path.
>
> Reported-by: Jakub Kicinski <kuba@kernel.org>
> Closes: https://lore.kernel.org/intel-wired-lan/20260316133100.6054a11f@kernel.org/
> Fixes: 13a8cd191a2b ("i40e: Do not enable NAPI on q_vectors that have no rings")
> Cc: stable@vger.kernel.org
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
The problem comes from a fact that napi instances are kept after
rebuilding VSI with lower queue count. Instead of duck taping the driver
and adding conditions in hot path (!) we should fix the issue at its core.
I'm gonna send a fix, please drop this one.
pw-bot: cr
> ---
> Test configuration:
> Kernel : Linux 6.19.0-rc8+
> NIC : Intel Ethernet Controller XXV710 for 25GbE SFP28 [8086:158b]
> Driver : i40e (in-tree)
> Firmware : 9.40 0x8000ed12 1.3429.0
> CPU : 2 x Intel Xeon Gold 6238M (88 logical CPUs, x86_64)
> RAM : 64 GiB
>
> Reproduction steps (FAIL before fix):
> # 1. Reduce queues so excess q_vectors lose their ring pointers
> ethtool -L <iface> combined 1
>
> # 2. Enable threaded NAPI (completes fast in 6.19, no hang on enable path)
> echo 1 > /sys/class/net/<iface>/threaded
>
> # 3. Two concurrent writes to disable -- fires the msleep deadlock
> echo 0 > /sys/class/net/<iface>/threaded &
> echo 0 > /sys/class/net/<iface>/threaded &
>
> Both background tasks enter uninterruptible sleep (D state) immediately
> and never return.
>
> Observed kernel stack (W1, holds dev->lock):
> msleep+0x2d/0x50
> napi_set_threaded+0x10b/0x110
> netif_set_threaded+0xe1/0x140
> threaded_store+0xd2/0x100
> kernfs_fop_write_iter+0x138/0x1d0
>
> Kernel hung_task message (~120 s after trigger):
> INFO: task bash blocked for more than 122 seconds.
> INFO: task bash is blocked on a mutex likely owned by task bash.
>
> Validation (PASS with fix):
> Both background tasks exit within 1 second.
> D-state process count: 0.
> Busy-poll (net.core.busy_poll=50) + 50000-packet UDP flood with
> 1 active queue: no NULL dereference, no crash.
>
> drivers/net/ethernet/intel/i40e/i40e_main.c | 28 ++++++++++++---------
> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 10 ++++++++
> 2 files changed, 26 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 926d001..5042f8c 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -5182,6 +5182,14 @@ static void i40e_clear_interrupt_scheme(struct i40e_pf *pf)
> /**
> * i40e_napi_enable_all - Enable NAPI for all q_vectors in the VSI
> * @vsi: the VSI being configured
> + *
> + * Enable NAPI on every q_vector that is registered with the netdev,
> + * regardless of whether it currently has rings assigned. After a queue-
> + * count reduction (e.g. ethtool -L combined 1) the excess q_vectors lose
> + * their ring pointers inside i40e_vsi_map_rings_to_vectors but remain on
> + * dev->napi_list. Leaving them in the napi_disable()-ed state
> + * (NAPI_STATE_SCHED set) causes napi_set_threaded() to spin forever on
> + * msleep(20) waiting for that bit to clear.
> **/
> static void i40e_napi_enable_all(struct i40e_vsi *vsi)
> {
> @@ -5190,17 +5198,17 @@ static void i40e_napi_enable_all(struct i40e_vsi *vsi)
> if (!vsi->netdev)
> return;
>
> - for (q_idx = 0; q_idx < vsi->num_q_vectors; q_idx++) {
> - struct i40e_q_vector *q_vector = vsi->q_vectors[q_idx];
> -
> - if (q_vector->rx.ring || q_vector->tx.ring)
> - napi_enable(&q_vector->napi);
> - }
> + for (q_idx = 0; q_idx < vsi->num_q_vectors; q_idx++)
> + napi_enable(&vsi->q_vectors[q_idx]->napi);
> }
>
> /**
> * i40e_napi_disable_all - Disable NAPI for all q_vectors in the VSI
> * @vsi: the VSI being configured
> + *
> + * Mirror of i40e_napi_enable_all: operate on every registered q_vector so
> + * enable/disable calls are always balanced, even when some q_vectors carry
> + * no rings (as happens after a queue-count reduction).
> **/
> static void i40e_napi_disable_all(struct i40e_vsi *vsi)
> {
> @@ -5209,12 +5217,8 @@ static void i40e_napi_disable_all(struct i40e_vsi *vsi)
> if (!vsi->netdev)
> return;
>
> - for (q_idx = 0; q_idx < vsi->num_q_vectors; q_idx++) {
> - struct i40e_q_vector *q_vector = vsi->q_vectors[q_idx];
> -
> - if (q_vector->rx.ring || q_vector->tx.ring)
> - napi_disable(&q_vector->napi);
> - }
> + for (q_idx = 0; q_idx < vsi->num_q_vectors; q_idx++)
> + napi_disable(&vsi->q_vectors[q_idx]->napi);
> }
>
> /**
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 894f2d0..3123459 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -2760,6 +2760,16 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
> return 0;
> }
>
> + /* A q_vector can have its ring pointers cleared after a queue-count
> + * reduction (ethtool -L combined N) while napi_enable() was already
> + * called on it. Complete immediately so the poll loop exits cleanly
> + * and we never dereference the NULL ring pointer below.
> + */
> + if (unlikely(!q_vector->num_ringpairs)) {
> + napi_complete_done(napi, 0);
> + return 0;
> + }
> +
> /* Since the actual Tx work is minimal, we can give the Tx a larger
> * budget and be more aggressive about cleaning up the Tx descriptors.
> */
> --
> 2.52.0
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* RE: [Intel-wired-lan] [PATCH iwl-net v1] i40e: fix napi_enable/disable skipping ringless q_vectors
2026-03-24 13:09 [PATCH iwl-net v1] i40e: fix napi_enable/disable skipping ringless q_vectors Aleksandr Loktionov
2026-04-14 10:57 ` Maciej Fijalkowski
@ 2026-04-14 17:58 ` Mekala, SunithaX D
1 sibling, 0 replies; 3+ messages in thread
From: Mekala, SunithaX D @ 2026-04-14 17:58 UTC (permalink / raw)
To: Loktionov, Aleksandr, intel-wired-lan@lists.osuosl.org,
Nguyen, Anthony L, Loktionov, Aleksandr
Cc: netdev@vger.kernel.org, Jakub Kicinski
> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of Aleksandr Loktionov
> Sent: Tuesday, March 24, 2026 6:09 AM
> To: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Loktionov, Aleksandr <aleksandr.loktionov@intel.com>
> Cc: netdev@vger.kernel.org; Jakub Kicinski <kuba@kernel.org>
> Subject: [Intel-wired-lan] [PATCH iwl-net v1] i40e: fix napi_enable/disable skipping ringless q_vectors
>
> After ethtool -L reduces the queue count, i40e_napi_disable_all() sets
> NAPI_STATE_SCHED on all q_vectors, then i40e_vsi_map_rings_to_vectors()
> clears ring pointers on the excess ones. i40e_napi_enable_all() skips
> those with:
>
> if (q_vector->rx.ring || q_vector->tx.ring)
> napi_enable(&q_vector->napi);
>
> leaving them on dev->napi_list with NAPI_STATE_SCHED permanently set.
>
> Writing to /sys/class/net/<iface>/threaded calls napi_stop_kthread()
> on every entry in dev->napi_list. The function loops on msleep(20)
> waiting for NAPI_STATE_SCHED to clear -- which never happens for the
> stale q_vectors. The task hangs in D state forever; a concurrent write
> deadlocks on dev->lock held by the first.
>
> Commit 13a8cd191a2b added the guard to prevent a divide-by-zero in
> i40e_napi_poll() when epoll busy-poll iterated all device NAPIs (4.x
> era). Since 7adc3d57fe2b ("net: Introduce preferred busy-polling",
> v5.11) napi_busy_loop() polls by napi_id keyed to the socket, so
> ringless q_vectors are never selected. i40e_msix_clean_rings() also
> independently avoids scheduling NAPI for them. The guard is safe to
> remove.
>
> Add an early return in i40e_napi_poll() for num_ringpairs == 0 so the
> function is self-defending against a NULL tx.ring dereference at the
> WB_ON_ITR check, should the NAPI ever fire through an unexpected path.
>
> Reported-by: Jakub Kicinski <kuba@kernel.org>
> Closes: https://lore.kernel.org/intel-wired-lan/20260316133100.6054a11f@kernel.org/
> Fixes: 13a8cd191a2b ("i40e: Do not enable NAPI on q_vectors that have no rings")
> Cc: stable@vger.kernel.org
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> Test configuration:
> Kernel : Linux 6.19.0-rc8+
> NIC : Intel Ethernet Controller XXV710 for 25GbE SFP28 [8086:158b]
> Driver : i40e (in-tree)
> Firmware : 9.40 0x8000ed12 1.3429.0
> CPU : 2 x Intel Xeon Gold 6238M (88 logical CPUs, x86_64)
> RAM : 64 GiB
>
> Reproduction steps (FAIL before fix):
> # 1. Reduce queues so excess q_vectors lose their ring pointers
> ethtool -L <iface> combined 1
>
> # 2. Enable threaded NAPI (completes fast in 6.19, no hang on enable path)
> echo 1 > /sys/class/net/<iface>/threaded
>
> # 3. Two concurrent writes to disable -- fires the msleep deadlock
> echo 0 > /sys/class/net/<iface>/threaded &
> echo 0 > /sys/class/net/<iface>/threaded &
>
> Both background tasks enter uninterruptible sleep (D state) immediately
> and never return.
>
> Observed kernel stack (W1, holds dev->lock):
> msleep+0x2d/0x50
> napi_set_threaded+0x10b/0x110
> netif_set_threaded+0xe1/0x140
> threaded_store+0xd2/0x100
> kernfs_fop_write_iter+0x138/0x1d0
>
> Kernel hung_task message (~120 s after trigger):
> INFO: task bash blocked for more than 122 seconds.
> INFO: task bash is blocked on a mutex likely owned by task bash.
>
> Validation (PASS with fix):
> Both background tasks exit within 1 second.
> D-state process count: 0.
> Busy-poll (net.core.busy_poll=50) + 50000-packet UDP flood with
> 1 active queue: no NULL dereference, no crash.
>
> drivers/net/ethernet/intel/i40e/i40e_main.c | 28 ++++++++++++---------
> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 10 ++++++++
> 2 files changed, 26 insertions(+), 12 deletions(-)
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-04-14 17:58 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-24 13:09 [PATCH iwl-net v1] i40e: fix napi_enable/disable skipping ringless q_vectors Aleksandr Loktionov
2026-04-14 10:57 ` Maciej Fijalkowski
2026-04-14 17:58 ` [Intel-wired-lan] " Mekala, SunithaX D
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox