From: Simon Horman <horms@kernel.org>
To: Jiawen Wu <jiawenwu@trustnetic.com>
Cc: netdev@vger.kernel.org,
"Mengyuan Lou" <mengyuanlou@net-swift.com>,
"Andrew Lunn" <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
"Eric Dumazet" <edumazet@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
"Richard Cochran" <richardcochran@gmail.com>,
"Russell King" <linux@armlinux.org.uk>,
"Jacob Keller" <jacob.e.keller@intel.com>,
"Michal Swiatkowski" <michal.swiatkowski@linux.intel.com>,
"Kees Cook" <kees@kernel.org>,
"Larysa Zaremba" <larysa.zaremba@intel.com>,
"Joe Damato" <joe@dama.to>, "Breno Leitao" <leitao@debian.org>,
"Aleksandr Loktionov" <aleksandr.loktionov@intel.com>,
"Uwe Kleine-König (The Capable Hub)"
<u.kleine-koenig@baylibre.com>,
"Fabio Baltieri" <fabio.baltieri@gmail.com>,
"Thomas Gleixner" <tglx@kernel.org>,
"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>
Subject: Re: [PATCH net-next v6 5/5] net: wangxun: add pcie error handler
Date: Fri, 12 Jun 2026 16:41:17 +0100 [thread overview]
Message-ID: <20260612154117.GC671640@horms.kernel.org> (raw)
In-Reply-To: <20260610060917.23980-6-jiawenwu@trustnetic.com>
On Wed, Jun 10, 2026 at 02:09:17PM +0800, Jiawen Wu wrote:
> Support AER driver to handle the PCIe errors. Sometimes netdev watchdog
> Tx timeout happens before the AER error report when a PCIe error occurs,
> CPU blocking would be caused by MMIO during the reset process. To
> prevent it, check PCIe error status in .ndo_tx_timeout. The current
> function of ngbe is not yet fully developed, it will be completed in the
> future.
>
> Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Hi Jiawen,
There is AI-generated review of this patch-set available on both
https://sashiko.dev and https://netdev-ai.bots.linux.dev/sashiko/
I've tried to filter out only those that seem strictly relevant
to the progress of this patch and have included them inline.
I'd appreciate it if you could take a look over this.
> diff --git a/drivers/net/ethernet/wangxun/libwx/wx_err.c b/drivers/net/ethernet/wangxun/libwx/wx_err.c
...
> +/**
> + * wx_io_slot_reset - called after the pci bus has been reset.
> + * @pdev: Pointer to PCI device
> + *
> + * Return: pci_ers_result_t.
> + *
> + * Restart the card from scratch, as if from a cold-boot.
> + */
> +static pci_ers_result_t wx_io_slot_reset(struct pci_dev *pdev)
> +{
> + struct wx *wx = pci_get_drvdata(pdev);
> + pci_ers_result_t result;
> +
> + if (pci_enable_device_mem(pdev)) {
> + wx_err(wx, "Cannot re-enable PCI device after reset.\n");
> + result = PCI_ERS_RESULT_DISCONNECT;
> + } else {
> + /* make all memory operations done before clearing the flag */
> + smp_mb__before_atomic();
> + clear_bit(WX_STATE_DISABLED, wx->state);
> + clear_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags);
> + pci_set_master(pdev);
> + pci_restore_state(pdev);
> + pci_wake_from_d3(pdev, false);
> +
> + rtnl_lock();
> + if (netif_running(wx->netdev) && wx->down_suspend)
> + wx->down_suspend(wx);
> + if (wx->do_reset)
> + wx->do_reset(wx->netdev, false);
The AI-generated review on sashiko.dev flags the following.
"Does this properly initialize the hardware MAC during PCIe AER
slot reset recovery for the ngbe driver?
"For txgbe, this correctly maps to txgbe_reset(), which calls
txgbe_reset_hw(). However, for ngbe, wx->do_reset maps to
ngbe_do_reset() which calls ngbe_reset().
"Looking at ngbe_reset():
"drivers/net/ethernet/wangxun/ngbe/ngbe_main.c:ngbe_reset() {
if (test_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags))
return;
wx_flush_sw_mac_table(wx);
wx_mac_set_default_filter(wx, wx->mac.addr);
if (test_bit(WX_STATE_PTP_RUNNING, wx->state))
wx_ptp_reset(wx);
"This skips ngbe_reset_hw() and fails to fully reset the hardware.
Since a PCIe slot reset clears the device's internal state, when
wx_io_resume() subsequently calls ndo_open(), could the device fail
to function because its hardware registers and PHY state were never
reset?
> + rtnl_unlock();
> + result = PCI_ERS_RESULT_RECOVERED;
> + }
> +
> + pci_aer_clear_nonfatal_status(pdev);
> +
> + return result;
> +}
...
> +static void wx_tx_timeout_recovery(struct wx *wx)
> +{
> + /*
> + * When a PCIe hardware error occurs, the driver should initiate a PCIe
> + * recovery mechanism. However, this recovery flow relies on the AER
> + * driver for current kernel policy. Therefore, a self-contained
> + * recovery mechanism is not implemented yet.
> + */
> + set_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags);
> + wx_err(wx, "PCIe error detected during tx timeout\n");
> + queue_work(wx->reset_wq, &wx->reset_task);
> +}
The AI-generated review on https://netdev-ai.bots.linux.dev/sashiko/ flags:
"What clears WX_FLAG_NEED_PCIE_RECOVERY if AER never follows up?
wx_tx_timeout_recovery() sets the flag and queues reset_task.
reset_task only calls wx_soft_quiesce() and does not clear the
flag:
if (test_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags))
wx_soft_quiesce(wx);
"The flag is cleared only on the success branch of
wx_io_slot_reset(). If the PCIe error never escalates to AER (AER
disabled in firmware or kernel, error not reportable, slot_reset
returns PCI_ERS_RESULT_DISCONNECT), the flag remains set
indefinitely.
"Both ngbe_reset() and txgbe_reset() now early-return on this flag:
static void ngbe_reset(struct wx *wx)
{
if (test_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags))
return;
...
}
"so any subsequent reset path (ethtool reset, ndo_set_features
triggered reset, watchdog timeout reset, FDIR reinit,
ngbe_reinit_locked / txgbe_reinit_locked) is silently a no-op. The
hardware reset capability is effectively disabled until the module
is reloaded.
"Should wx_tx_timeout_recovery() (or reset_task after the quiesce)
clear the flag once the soft-quiesce completes, so the driver's
reset machinery remains functional when AER does not follow up?
...
prev parent reply other threads:[~2026-06-12 15:41 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-10 6:09 [PATCH net-next v6 0/5] net: wangxun: timeout and error Jiawen Wu
2026-06-10 6:09 ` [PATCH net-next v6 1/5] net: ngbe: implement libwx reset ops Jiawen Wu
2026-06-10 6:09 ` [PATCH net-next v6 2/5] net: wangxun: add Tx timeout process Jiawen Wu
2026-06-10 6:09 ` [PATCH net-next v6 3/5] net: wangxun: add reinit parameter to wx->do_reset callback Jiawen Wu
2026-06-10 6:09 ` [PATCH net-next v6 4/5] net: wangxun: implement soft quiesce for PCIe error recovery Jiawen Wu
2026-06-12 15:49 ` Simon Horman
2026-06-10 6:09 ` [PATCH net-next v6 5/5] net: wangxun: add pcie error handler Jiawen Wu
2026-06-12 15:41 ` Simon Horman [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260612154117.GC671640@horms.kernel.org \
--to=horms@kernel.org \
--cc=aleksandr.loktionov@intel.com \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=fabio.baltieri@gmail.com \
--cc=gregkh@linuxfoundation.org \
--cc=jacob.e.keller@intel.com \
--cc=jiawenwu@trustnetic.com \
--cc=joe@dama.to \
--cc=kees@kernel.org \
--cc=kuba@kernel.org \
--cc=larysa.zaremba@intel.com \
--cc=leitao@debian.org \
--cc=linux@armlinux.org.uk \
--cc=mengyuanlou@net-swift.com \
--cc=michal.swiatkowski@linux.intel.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=richardcochran@gmail.com \
--cc=tglx@kernel.org \
--cc=u.kleine-koenig@baylibre.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.