public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Tariq Toukan <ttoukan.linux@gmail.com>
To: Breno Leitao <leitao@debian.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Amir Vadai <amirv@mellanox.com>
Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, dcostantino@meta.com,
	rneu@meta.com, kernel-team@meta.com
Subject: Re: [PATCH net] net/mlx5e: Skip NAPI polling when PCI channel is offline
Date: Wed, 11 Feb 2026 13:26:35 +0200	[thread overview]
Message-ID: <09a77964-37bf-4b3c-bfa9-8939eb7761ab@gmail.com> (raw)
In-Reply-To: <20260209-mlx5_iommu-v1-1-b17ae501aeb2@debian.org>



On 09/02/2026 20:01, Breno Leitao wrote:
> When a PCI error (e.g. AER error or DPC containment) marks the PCI
> channel as frozen or permanently failed, the IOMMU mappings for the
> device may already be torn down. If mlx5e_napi_poll() continues
> processing CQEs in this state, every call to dma_unmap_page() triggers
> a WARN_ON in iommu_dma_unmap_phys().
> 
> In a real-world crash scenario on an NVIDIA Grace (ARM64) platform,
> a DPC event froze the PCI channel and the mlx5 NAPI poll continued
> processing error CQEs, calling dma_unmap for each pending WQE. Here is
> an example:
> 
> The DPC event on port 0007:00:00.0 fires and eth1 (on 0017:01:00.0) starts
> seeing error CQEs almost immediately:
> 
>      pcieport 0007:00:00.0: DPC: containment event, status:0x2009
>      mlx5_core 0017:01:00.0 eth1: Error cqe on cqn 0x54e, ci 0xb06, ...
> 
> The WARN_ON storm begins ~0.4s later and repeats for every pending WQE:
> 
>      WARNING: CPU: 32 PID: 0 at drivers/iommu/dma-iommu.c:1237 iommu_dma_unmap_phys
>      Call trace:
>       iommu_dma_unmap_phys+0xd4/0xe0
>       mlx5e_tx_wi_dma_unmap+0xb4/0xf0
>       mlx5e_poll_tx_cq+0x14c/0x438
>       mlx5e_napi_poll+0x6c/0x5e0
>       net_rx_action+0x160/0x5c0
>       handle_softirqs+0xe8/0x320
>       run_ksoftirqd+0x30/0x58
> 
> After 23 seconds of WARN_ON() storm, the watchdog fires:
> 
>      watchdog: BUG: soft lockup - CPU#32 stuck for 23s! [ksoftirqd/32:179]
>      Kernel panic - not syncing: softlockup: hung tasks
> 
> Each unmap hit the WARN_ON in the IOMMU layer, printing a full stack
> trace. With dozens of pending WQEs, this created a storm of WARN_ON
> dumps in softirq context that monopolized the CPU for over 23 seconds,
> triggering a soft lockup panic.
> 
> Fix this by checking pci_channel_offline() at the top of
> mlx5e_napi_poll() and bailing out immediately when the channel is
> offline. napi_complete_done() is called before returning to clear the
> NAPI_STATE_SCHED bit, ensuring that napi_disable() in the teardown path
> does not spin forever waiting for it. No CQ interrupts are re-armed
> since the explicit mlx5e_cq_arm() calls are skipped, so the NAPI
> instance will not be re-scheduled. The pending DMA buffers are left for
> device removal to clean up.
> 
> Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files")
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> index 76108299ea57d..934ad7fafa801 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> @@ -138,6 +138,19 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
>   	bool xsk_open;
>   	int i;
>   
> +	/*
> +	 * When the PCI channel is offline, IOMMU mappings may already be torn
> +	 * down.  Processing CQEs would call dma_unmap for every pending WQE,
> +	 * each hitting a WARN_ON in the IOMMU layer.  The resulting storm of
> +	 * warnings in softirq context can monopolise the CPU long enough to
> +	 * trigger a soft lockup and prevent any RCU grace period from
> +	 * completing.
> +	 */
> +	if (unlikely(pci_channel_offline(c->mdev->pdev))) {
> +		napi_complete_done(napi, 0);
> +		return 0;
> +	}
> +
>   	rcu_read_lock();
>   
>   	qos_sqs = rcu_dereference(c->qos_sqs);
> 
> ---
> base-commit: a956792a1543c2bf4a2266cb818dc7c4135006f0
> change-id: 20260209-mlx5_iommu-c8b238b1bb14
> 
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
> 
> 

Hi,

Thanks for your patch.

You're introducing an interesting problem, but I am not convinced by 
this solution approach.

Why would the driver perform this check if it doesn't guarantee 
prevention of invalid access? It only "allows one napi cycle", which 
happen to be good enough to prevent the soft lockup in your case.

What if a napi cycle is configured with larger budget?

If the problem is that the WARN_ON is being called at a high rate, then 
it should be rate-limited.


  parent reply	other threads:[~2026-02-11 11:26 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-09 18:01 [PATCH net] net/mlx5e: Skip NAPI polling when PCI channel is offline Breno Leitao
2026-02-10  2:19 ` Jijie Shao
2026-02-10 15:18   ` Breno Leitao
2026-02-11  1:42     ` Jijie Shao
2026-02-11 11:26 ` Tariq Toukan [this message]
2026-02-11 13:44   ` Breno Leitao
2026-02-11 15:17     ` Breno Leitao
2026-02-11 15:27       ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=09a77964-37bf-4b3c-bfa9-8939eb7761ab@gmail.com \
    --to=ttoukan.linux@gmail.com \
    --cc=amirv@mellanox.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=dcostantino@meta.com \
    --cc=edumazet@google.com \
    --cc=kernel-team@meta.com \
    --cc=kuba@kernel.org \
    --cc=leitao@debian.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=rneu@meta.com \
    --cc=saeedm@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox