netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH iwl-net V3,0/2] Fix repeated EEH reports in MSI domain
@ 2024-05-14 20:21 Thinh Tran
  2024-05-14 20:21 ` [PATCH iwl-net V3,1/2] i40e: fractoring out i40e_suspend/i40e_resume Thinh Tran
  2024-05-14 20:21 ` [PATCH iwl-net V3,2/2] i40e: Fully suspend and resume IO operations in EEH case Thinh Tran
  0 siblings, 2 replies; 6+ messages in thread
From: Thinh Tran @ 2024-05-14 20:21 UTC (permalink / raw)
  To: netdev, kuba, anthony.l.nguyen, aleksandr.loktionov,
	przemyslaw.kitszel, horms
  Cc: jesse.brandeburg, davem, edumazet, pabeni, intel-wired-lan,
	rob.thomas, Thinh Tran

The patch fixes an issue where repeated EEH reports with a single error
on the bus of Intel X710 4-port 10G Base-T adapter in the MSI domain
causes the device to be permanently disabled.  It fully resets and
restarts the device when handling the PCI EEH error.

v3: moved text commit messages from the cover letter to appropriate
    patches.
v2: fixed typos and split into two commits

Thinh Tran (2):
  i40e: fractoring out i40e_suspend/i40e_resume
  i40e: Fully suspend and resume IO operations in EEH case

 drivers/net/ethernet/intel/i40e/i40e_main.c | 257 +++++++++++---------
 1 file changed, 140 insertions(+), 117 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH iwl-net V3,1/2] i40e: fractoring out i40e_suspend/i40e_resume
  2024-05-14 20:21 [PATCH iwl-net V3,0/2] Fix repeated EEH reports in MSI domain Thinh Tran
@ 2024-05-14 20:21 ` Thinh Tran
  2024-05-15  5:08   ` [Intel-wired-lan] [PATCH iwl-net V3, 1/2] " Paul Menzel
  2024-05-14 20:21 ` [PATCH iwl-net V3,2/2] i40e: Fully suspend and resume IO operations in EEH case Thinh Tran
  1 sibling, 1 reply; 6+ messages in thread
From: Thinh Tran @ 2024-05-14 20:21 UTC (permalink / raw)
  To: netdev, kuba, anthony.l.nguyen, aleksandr.loktionov,
	przemyslaw.kitszel, horms
  Cc: jesse.brandeburg, davem, edumazet, pabeni, intel-wired-lan,
	rob.thomas, Thinh Tran

Two new functions, i40e_io_suspend() and i40e_io_resume(), have been
introduced.  These functions were factored out from the existing
i40e_suspend() and i40e_resume() respectively.  This factoring was
done due to concerns about the logic of the I40E_SUSPENSED state, which
caused the device not able to recover.  The functions are now used in the
EEH handling for device suspend/resume callbacks.

Reordered the function, i40e_enable_mc_magic_wake() has been moved 
ahead of i40e_io_suspend() to ensure it is declared before being used.

Tested-by: Robert Thomas <rob.thomas@ibm.com>
Signed-off-by: Thinh Tran <thinhtr@linux.ibm.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 248 +++++++++++---------
 1 file changed, 134 insertions(+), 114 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ffb9f9f15c52..281c8ec27af2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -16303,6 +16303,138 @@ static void i40e_remove(struct pci_dev *pdev)
 	pci_disable_device(pdev);
 }
 
+/**
+ * i40e_enable_mc_magic_wake - enable multicast magic packet wake up
+ * using the mac_address_write admin q function
+ * @pf: pointer to i40e_pf struct
+ **/
+static void i40e_enable_mc_magic_wake(struct i40e_pf *pf)
+{
+	struct i40e_hw *hw = &pf->hw;
+	u8 mac_addr[6];
+	u16 flags = 0;
+	int ret;
+
+	/* Get current MAC address in case it's an LAA */
+	if (pf->vsi[pf->lan_vsi] && pf->vsi[pf->lan_vsi]->netdev) {
+		ether_addr_copy(mac_addr,
+				pf->vsi[pf->lan_vsi]->netdev->dev_addr);
+	} else {
+		dev_err(&pf->pdev->dev,
+			"Failed to retrieve MAC address; using default\n");
+		ether_addr_copy(mac_addr, hw->mac.addr);
+	}
+
+	/* The FW expects the mac address write cmd to first be called with
+	 * one of these flags before calling it again with the multicast
+	 * enable flags.
+	 */
+	flags = I40E_AQC_WRITE_TYPE_LAA_WOL;
+
+	if (hw->func_caps.flex10_enable && hw->partition_id != 1)
+		flags = I40E_AQC_WRITE_TYPE_LAA_ONLY;
+
+	ret = i40e_aq_mac_address_write(hw, flags, mac_addr, NULL);
+	if (ret) {
+		dev_err(&pf->pdev->dev,
+			"Failed to update MAC address registers; cannot enable Multicast Magic packet wake up");
+		return;
+	}
+
+	flags = I40E_AQC_MC_MAG_EN
+			| I40E_AQC_WOL_PRESERVE_ON_PFR
+			| I40E_AQC_WRITE_TYPE_UPDATE_MC_MAG;
+	ret = i40e_aq_mac_address_write(hw, flags, mac_addr, NULL);
+	if (ret)
+		dev_err(&pf->pdev->dev,
+			"Failed to enable Multicast Magic Packet wake up\n");
+}
+
+/**
+ * i40e_io_suspend - suspend all IO operations
+ * @pf: pointer to i40e_pf struct
+ *
+ **/
+static int i40e_io_suspend(struct i40e_pf *pf)
+{
+	struct i40e_hw *hw = &pf->hw;
+
+	set_bit(__I40E_DOWN, pf->state);
+
+	/* Ensure service task will not be running */
+	del_timer_sync(&pf->service_timer);
+	cancel_work_sync(&pf->service_task);
+
+	/* Client close must be called explicitly here because the timer
+	 * has been stopped.
+	 */
+	i40e_notify_client_of_netdev_close(pf->vsi[pf->lan_vsi], false);
+
+	if (test_bit(I40E_HW_CAP_WOL_MC_MAGIC_PKT_WAKE, pf->hw.caps) &&
+	    pf->wol_en)
+		i40e_enable_mc_magic_wake(pf);
+
+	/* Since we're going to destroy queues during the
+	 * i40e_clear_interrupt_scheme() we should hold the RTNL lock for this
+	 * whole section
+	 */
+	rtnl_lock();
+
+	i40e_prep_for_reset(pf);
+
+	wr32(hw, I40E_PFPM_APM, (pf->wol_en ? I40E_PFPM_APM_APME_MASK : 0));
+	wr32(hw, I40E_PFPM_WUFC, (pf->wol_en ? I40E_PFPM_WUFC_MAG_MASK : 0));
+
+	/* Clear the interrupt scheme and release our IRQs so that the system
+	 * can safely hibernate even when there are a large number of CPUs.
+	 * Otherwise hibernation might fail when mapping all the vectors back
+	 * to CPU0.
+	 */
+	i40e_clear_interrupt_scheme(pf);
+
+	rtnl_unlock();
+
+	return 0;
+}
+
+/**
+ * i40e_io_resume - resume IO operations
+ * @pf: pointer to i40e_pf struct
+ *
+ **/
+static int i40e_io_resume(struct i40e_pf *pf)
+{
+	int err;
+
+	/* We need to hold the RTNL lock prior to restoring interrupt schemes,
+	 * since we're going to be restoring queues
+	 */
+	rtnl_lock();
+
+	/* We cleared the interrupt scheme when we suspended, so we need to
+	 * restore it now to resume device functionality.
+	 */
+	err = i40e_restore_interrupt_scheme(pf);
+	if (err) {
+		dev_err(&pf->pdev->dev, "Cannot restore interrupt scheme: %d\n",
+			err);
+	}
+
+	clear_bit(__I40E_DOWN, pf->state);
+	i40e_reset_and_rebuild(pf, false, true);
+
+	rtnl_unlock();
+
+	/* Clear suspended state last after everything is recovered */
+	clear_bit(__I40E_SUSPENDED, pf->state);
+
+	/* Restart the service task */
+	mod_timer(&pf->service_timer,
+		  round_jiffies(jiffies + pf->service_timer_period));
+
+	return 0;
+}
+
 /**
  * i40e_pci_error_detected - warning that something funky happened in PCI land
  * @pdev: PCI device information struct
@@ -16415,53 +16547,6 @@ static void i40e_pci_error_resume(struct pci_dev *pdev)
 	i40e_handle_reset_warning(pf, false);
 }
 
-/**
- * i40e_enable_mc_magic_wake - enable multicast magic packet wake up
- * using the mac_address_write admin q function
- * @pf: pointer to i40e_pf struct
- **/
-static void i40e_enable_mc_magic_wake(struct i40e_pf *pf)
-{
-	struct i40e_hw *hw = &pf->hw;
-	u8 mac_addr[6];
-	u16 flags = 0;
-	int ret;
-
-	/* Get current MAC address in case it's an LAA */
-	if (pf->vsi[pf->lan_vsi] && pf->vsi[pf->lan_vsi]->netdev) {
-		ether_addr_copy(mac_addr,
-				pf->vsi[pf->lan_vsi]->netdev->dev_addr);
-	} else {
-		dev_err(&pf->pdev->dev,
-			"Failed to retrieve MAC address; using default\n");
-		ether_addr_copy(mac_addr, hw->mac.addr);
-	}
-
-	/* The FW expects the mac address write cmd to first be called with
-	 * one of these flags before calling it again with the multicast
-	 * enable flags.
-	 */
-	flags = I40E_AQC_WRITE_TYPE_LAA_WOL;
-
-	if (hw->func_caps.flex10_enable && hw->partition_id != 1)
-		flags = I40E_AQC_WRITE_TYPE_LAA_ONLY;
-
-	ret = i40e_aq_mac_address_write(hw, flags, mac_addr, NULL);
-	if (ret) {
-		dev_err(&pf->pdev->dev,
-			"Failed to update MAC address registers; cannot enable Multicast Magic packet wake up");
-		return;
-	}
-
-	flags = I40E_AQC_MC_MAG_EN
-			| I40E_AQC_WOL_PRESERVE_ON_PFR
-			| I40E_AQC_WRITE_TYPE_UPDATE_MC_MAG;
-	ret = i40e_aq_mac_address_write(hw, flags, mac_addr, NULL);
-	if (ret)
-		dev_err(&pf->pdev->dev,
-			"Failed to enable Multicast Magic Packet wake up\n");
-}
-
 /**
  * i40e_shutdown - PCI callback for shutting down
  * @pdev: PCI device information struct
@@ -16521,48 +16606,11 @@ static void i40e_shutdown(struct pci_dev *pdev)
 static int __maybe_unused i40e_suspend(struct device *dev)
 {
 	struct i40e_pf *pf = dev_get_drvdata(dev);
-	struct i40e_hw *hw = &pf->hw;
 
 	/* If we're already suspended, then there is nothing to do */
 	if (test_and_set_bit(__I40E_SUSPENDED, pf->state))
 		return 0;
-
-	set_bit(__I40E_DOWN, pf->state);
-
-	/* Ensure service task will not be running */
-	del_timer_sync(&pf->service_timer);
-	cancel_work_sync(&pf->service_task);
-
-	/* Client close must be called explicitly here because the timer
-	 * has been stopped.
-	 */
-	i40e_notify_client_of_netdev_close(pf->vsi[pf->lan_vsi], false);
-
-	if (test_bit(I40E_HW_CAP_WOL_MC_MAGIC_PKT_WAKE, pf->hw.caps) &&
-	    pf->wol_en)
-		i40e_enable_mc_magic_wake(pf);
-
-	/* Since we're going to destroy queues during the
-	 * i40e_clear_interrupt_scheme() we should hold the RTNL lock for this
-	 * whole section
-	 */
-	rtnl_lock();
-
-	i40e_prep_for_reset(pf);
-
-	wr32(hw, I40E_PFPM_APM, (pf->wol_en ? I40E_PFPM_APM_APME_MASK : 0));
-	wr32(hw, I40E_PFPM_WUFC, (pf->wol_en ? I40E_PFPM_WUFC_MAG_MASK : 0));
-
-	/* Clear the interrupt scheme and release our IRQs so that the system
-	 * can safely hibernate even when there are a large number of CPUs.
-	 * Otherwise hibernation might fail when mapping all the vectors back
-	 * to CPU0.
-	 */
-	i40e_clear_interrupt_scheme(pf);
-
-	rtnl_unlock();
-
-	return 0;
+	return i40e_io_suspend(pf);
 }
 
 /**
@@ -16572,39 +16620,11 @@ static int __maybe_unused i40e_suspend(struct device *dev)
 static int __maybe_unused i40e_resume(struct device *dev)
 {
 	struct i40e_pf *pf = dev_get_drvdata(dev);
-	int err;
 
 	/* If we're not suspended, then there is nothing to do */
 	if (!test_bit(__I40E_SUSPENDED, pf->state))
 		return 0;
-
-	/* We need to hold the RTNL lock prior to restoring interrupt schemes,
-	 * since we're going to be restoring queues
-	 */
-	rtnl_lock();
-
-	/* We cleared the interrupt scheme when we suspended, so we need to
-	 * restore it now to resume device functionality.
-	 */
-	err = i40e_restore_interrupt_scheme(pf);
-	if (err) {
-		dev_err(dev, "Cannot restore interrupt scheme: %d\n",
-			err);
-	}
-
-	clear_bit(__I40E_DOWN, pf->state);
-	i40e_reset_and_rebuild(pf, false, true);
-
-	rtnl_unlock();
-
-	/* Clear suspended state last after everything is recovered */
-	clear_bit(__I40E_SUSPENDED, pf->state);
-
-	/* Restart the service task */
-	mod_timer(&pf->service_timer,
-		  round_jiffies(jiffies + pf->service_timer_period));
-
-	return 0;
+	return i40e_io_resume(pf);
 }
 
 static const struct pci_error_handlers i40e_err_handler = {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH iwl-net V3,2/2] i40e: Fully suspend and resume IO operations in EEH case
  2024-05-14 20:21 [PATCH iwl-net V3,0/2] Fix repeated EEH reports in MSI domain Thinh Tran
  2024-05-14 20:21 ` [PATCH iwl-net V3,1/2] i40e: fractoring out i40e_suspend/i40e_resume Thinh Tran
@ 2024-05-14 20:21 ` Thinh Tran
  2024-05-14 21:30   ` [Intel-wired-lan] [PATCH iwl-net V3, 2/2] " Jacob Keller
  1 sibling, 1 reply; 6+ messages in thread
From: Thinh Tran @ 2024-05-14 20:21 UTC (permalink / raw)
  To: netdev, kuba, anthony.l.nguyen, aleksandr.loktionov,
	przemyslaw.kitszel, horms
  Cc: jesse.brandeburg, davem, edumazet, pabeni, intel-wired-lan,
	rob.thomas, Thinh Tran

When EEH events occurs, the callback functions in the i40e, which are
managed by the EEH driver, will completely suspend and resume all IO
operations.

- In the PCI error detected callback, replaced i40e_prep_for_reset()
  with i40e_io_suspend(). The change is to fully suspend all I/O
  operations
- In the PCI error slot reset callback, replaced pci_enable_device_mem()
  with pci_enable_device(). This change enables both I/O and memory of
  the device.
- In the PCI error resume callback, replaced i40e_handle_reset_warning()
  with i40e_io_resume(). This change allows the system to resume I/O
  operations

Fixes: a5f3d2c17b07 ("powerpc/pseries/pci: Add MSI domains")
Tested-by: Robert Thomas <rob.thomas@ibm.com>
Signed-off-by: Thinh Tran <thinhtr@linux.ibm.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 281c8ec27af2..9f71a61e0c52 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11138,6 +11138,8 @@ static void i40e_reset_and_rebuild(struct i40e_pf *pf, bool reinit,
 	ret = i40e_reset(pf);
 	if (!ret)
 		i40e_rebuild(pf, reinit, lock_acquired);
+	else
+		dev_err(&pf->pdev->dev, "%s: i40e_reset() FAILED", __func__);
 }
 
 /**
@@ -16459,7 +16461,7 @@ static pci_ers_result_t i40e_pci_error_detected(struct pci_dev *pdev,
 
 	/* shutdown all operations */
 	if (!test_bit(__I40E_SUSPENDED, pf->state))
-		i40e_prep_for_reset(pf);
+		i40e_io_suspend(pf);
 
 	/* Request a slot reset */
 	return PCI_ERS_RESULT_NEED_RESET;
@@ -16481,7 +16483,8 @@ static pci_ers_result_t i40e_pci_error_slot_reset(struct pci_dev *pdev)
 	u32 reg;
 
 	dev_dbg(&pdev->dev, "%s\n", __func__);
-	if (pci_enable_device_mem(pdev)) {
+	/* enable I/O and memory of the device  */
+	if (pci_enable_device(pdev)) {
 		dev_info(&pdev->dev,
 			 "Cannot re-enable PCI device after reset.\n");
 		result = PCI_ERS_RESULT_DISCONNECT;
@@ -16544,7 +16547,7 @@ static void i40e_pci_error_resume(struct pci_dev *pdev)
 	if (test_bit(__I40E_SUSPENDED, pf->state))
 		return;
 
-	i40e_handle_reset_warning(pf, false);
+	i40e_io_resume(pf);
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-net V3, 2/2] i40e: Fully suspend and resume IO operations in EEH case
  2024-05-14 20:21 ` [PATCH iwl-net V3,2/2] i40e: Fully suspend and resume IO operations in EEH case Thinh Tran
@ 2024-05-14 21:30   ` Jacob Keller
  0 siblings, 0 replies; 6+ messages in thread
From: Jacob Keller @ 2024-05-14 21:30 UTC (permalink / raw)
  To: Thinh Tran, netdev, kuba, anthony.l.nguyen, aleksandr.loktionov,
	przemyslaw.kitszel, horms
  Cc: edumazet, rob.thomas, intel-wired-lan, pabeni, davem



On 5/14/2024 1:21 PM, Thinh Tran wrote:
> When EEH events occurs, the callback functions in the i40e, which are
> managed by the EEH driver, will completely suspend and resume all IO
> operations.
> 
> - In the PCI error detected callback, replaced i40e_prep_for_reset()
>   with i40e_io_suspend(). The change is to fully suspend all I/O
>   operations
> - In the PCI error slot reset callback, replaced pci_enable_device_mem()
>   with pci_enable_device(). This change enables both I/O and memory of
>   the device.
> - In the PCI error resume callback, replaced i40e_handle_reset_warning()
>   with i40e_io_resume(). This change allows the system to resume I/O
>   operations
> 
> Fixes: a5f3d2c17b07 ("powerpc/pseries/pci: Add MSI domains")
> Tested-by: Robert Thomas <rob.thomas@ibm.com>
> Signed-off-by: Thinh Tran <thinhtr@linux.ibm.com>
> ---

Thanks for moving this info to the commit message.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-net V3, 1/2] i40e: fractoring out i40e_suspend/i40e_resume
  2024-05-14 20:21 ` [PATCH iwl-net V3,1/2] i40e: fractoring out i40e_suspend/i40e_resume Thinh Tran
@ 2024-05-15  5:08   ` Paul Menzel
  2024-05-15 13:41     ` Thinh Tran
  0 siblings, 1 reply; 6+ messages in thread
From: Paul Menzel @ 2024-05-15  5:08 UTC (permalink / raw)
  To: Thinh Tran
  Cc: netdev, kuba, anthony.l.nguyen, aleksandr.loktionov,
	przemyslaw.kitszel, horms, edumazet, rob.thomas, intel-wired-lan,
	pabeni, davem

Dear Thinh,


Thank you for your patch. Two minor comments for the title: Please use 
imperative mood and fix a typo:

Factor out i40e_suspend/i40e_resume


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-net V3, 1/2] i40e: fractoring out i40e_suspend/i40e_resume
  2024-05-15  5:08   ` [Intel-wired-lan] [PATCH iwl-net V3, 1/2] " Paul Menzel
@ 2024-05-15 13:41     ` Thinh Tran
  0 siblings, 0 replies; 6+ messages in thread
From: Thinh Tran @ 2024-05-15 13:41 UTC (permalink / raw)
  To: Paul Menzel
  Cc: netdev, kuba, anthony.l.nguyen, aleksandr.loktionov,
	przemyslaw.kitszel, horms, edumazet, rob.thomas, intel-wired-lan,
	pabeni, davem



On 5/15/2024 12:08 AM, Paul Menzel wrote:
> Please use imperative mood and fix a typo
I apologize for the oversight. Thank you for your feedback. I'll correct 
that.

Thinh Tran

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-05-15 13:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-14 20:21 [PATCH iwl-net V3,0/2] Fix repeated EEH reports in MSI domain Thinh Tran
2024-05-14 20:21 ` [PATCH iwl-net V3,1/2] i40e: fractoring out i40e_suspend/i40e_resume Thinh Tran
2024-05-15  5:08   ` [Intel-wired-lan] [PATCH iwl-net V3, 1/2] " Paul Menzel
2024-05-15 13:41     ` Thinh Tran
2024-05-14 20:21 ` [PATCH iwl-net V3,2/2] i40e: Fully suspend and resume IO operations in EEH case Thinh Tran
2024-05-14 21:30   ` [Intel-wired-lan] [PATCH iwl-net V3, 2/2] " Jacob Keller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).