netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Saeed Mahameed <saeed@kernel.org>
To: "David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org, Roi Dayan <roid@nvidia.com>,
	Moshe Shemesh <moshe@nvidia.com>,
	Saeed Mahameed <saeedm@nvidia.com>
Subject: [v2 net 05/12] net/mlx5: Fix SF health recovery flow
Date: Thu, 23 Dec 2021 11:04:34 -0800	[thread overview]
Message-ID: <20211223190441.153012-6-saeed@kernel.org> (raw)
In-Reply-To: <20211223190441.153012-1-saeed@kernel.org>

From: Moshe Shemesh <moshe@nvidia.com>

SF do not directly control the PCI device. During recovery flow SF
should not be allowed to do pci disable or pci reset, its PF will do it.

It fixes the following kernel trace:
mlx5_core.sf mlx5_core.sf.25: mlx5_health_try_recover:387:(pid 40948): starting health recovery flow
mlx5_core 0000:03:00.0: mlx5_pci_slot_reset was called
mlx5_core 0000:03:00.0: wait vital counter value 0xab175 after 1 iterations
mlx5_core.sf mlx5_core.sf.25: firmware version: 24.32.532
mlx5_core.sf mlx5_core.sf.23: mlx5_health_try_recover:387:(pid 40946): starting health recovery flow
mlx5_core 0000:03:00.0: mlx5_pci_slot_reset was called
mlx5_core 0000:03:00.0: wait vital counter value 0xab193 after 1 iterations
mlx5_core.sf mlx5_core.sf.23: firmware version: 24.32.532
mlx5_core.sf mlx5_core.sf.25: mlx5_cmd_check:813:(pid 40948): ENABLE_HCA(0x104) op_mod(0x0) failed,
status bad resource state(0x9), syndrome (0x658908)
mlx5_core.sf mlx5_core.sf.25: mlx5_function_setup:1292:(pid 40948): enable hca failed
mlx5_core.sf mlx5_core.sf.25: mlx5_health_try_recover:389:(pid 40948): health recovery failed

Fixes: 1958fc2f0712 ("net/mlx5: SF, Add auxiliary device driver")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 7df9c7f8d9c8..65083496f913 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1809,12 +1809,13 @@ void mlx5_disable_device(struct mlx5_core_dev *dev)
 
 int mlx5_recover_device(struct mlx5_core_dev *dev)
 {
-	int ret = -EIO;
+	if (!mlx5_core_is_sf(dev)) {
+		mlx5_pci_disable_device(dev);
+		if (mlx5_pci_slot_reset(dev->pdev) != PCI_ERS_RESULT_RECOVERED)
+			return -EIO;
+	}
 
-	mlx5_pci_disable_device(dev);
-	if (mlx5_pci_slot_reset(dev->pdev) == PCI_ERS_RESULT_RECOVERED)
-		ret = mlx5_load_one(dev);
-	return ret;
+	return mlx5_load_one(dev);
 }
 
 static struct pci_driver mlx5_core_driver = {
-- 
2.33.1


  parent reply	other threads:[~2021-12-23 19:04 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-23 19:04 [pull request][v2 net 00/12] mlx5 fixes 2021-12-22 Saeed Mahameed
2021-12-23 19:04 ` [v2 net 01/12] net/mlx5: DR, Fix NULL vs IS_ERR checking in dr_domain_init_resources Saeed Mahameed
2021-12-24  3:30   ` patchwork-bot+netdevbpf
2021-12-23 19:04 ` [v2 net 02/12] net/mlx5: DR, Fix querying eswitch manager vport for ECPF Saeed Mahameed
2021-12-23 19:04 ` [v2 net 03/12] net/mlx5: Use first online CPU instead of hard coded CPU Saeed Mahameed
2021-12-23 19:04 ` [v2 net 04/12] net/mlx5: Fix error print in case of IRQ request failed Saeed Mahameed
2021-12-23 19:04 ` Saeed Mahameed [this message]
2021-12-23 19:04 ` [v2 net 06/12] net/mlx5: Fix tc max supported prio for nic mode Saeed Mahameed
2021-12-23 19:04 ` [v2 net 07/12] net/mlx5e: Wrap the tx reporter dump callback to extract the sq Saeed Mahameed
2021-12-23 19:04 ` [v2 net 08/12] net/mlx5e: Fix skb memory leak when TC classifier action offloads are disabled Saeed Mahameed
2021-12-23 19:04 ` [v2 net 09/12] net/mlx5e: Fix interoperability between XSK and ICOSQ recovery flow Saeed Mahameed
2021-12-23 19:04 ` [v2 net 10/12] net/mlx5e: Fix ICOSQ recovery flow for XSK Saeed Mahameed
2021-12-23 19:04 ` [v2 net 11/12] net/mlx5e: Delete forward rule for ct or sample action Saeed Mahameed
2021-12-23 19:04 ` [v2 net 12/12] net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()' Saeed Mahameed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211223190441.153012-6-saeed@kernel.org \
    --to=saeed@kernel.org \
    --cc=davem@davemloft.net \
    --cc=kuba@kernel.org \
    --cc=moshe@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=roid@nvidia.com \
    --cc=saeedm@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).