From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Gavin Li <gavinl@nvidia.com>, Moshe Shemesh <moshe@nvidia.com>,
Shay Drory <shayd@nvidia.com>, Saeed Mahameed <saeedm@nvidia.com>,
Sasha Levin <sashal@kernel.org>,
davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, amirtz@nvidia.com, netdev@vger.kernel.org,
linux-rdma@vger.kernel.org
Subject: [PATCH AUTOSEL 5.18 090/159] net/mlx5: Increase FW pre-init timeout for health recovery
Date: Mon, 30 May 2022 09:23:15 -0400 [thread overview]
Message-ID: <20220530132425.1929512-90-sashal@kernel.org> (raw)
In-Reply-To: <20220530132425.1929512-1-sashal@kernel.org>
From: Gavin Li <gavinl@nvidia.com>
[ Upstream commit 37ca95e62ee23fa6d2c2c64e3dc40b4a0c0146dc ]
Currently, health recovery will reload driver to recover it from fatal
errors. During the driver's load process, it would wait for FW to set the
pre-init bit for up to 120 seconds, beyond this threshold it would abort
the load process. In some cases, such as a FW upgrade on the DPU, this
timeout period is insufficient, and the user has no way to recover the
host device.
To solve this issue, introduce a new FW pre-init timeout for health
recovery, which is set to 2 hours.
The timeout for devlink reload and probe will use the original one because
they are user triggered flows, and therefore should not have a
significantly long timeout, during which the user command would hang.
Signed-off-by: Gavin Li <gavinl@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
.../net/ethernet/mellanox/mlx5/core/devlink.c | 4 ++--
.../ethernet/mellanox/mlx5/core/fw_reset.c | 2 +-
.../ethernet/mellanox/mlx5/core/lib/tout.c | 1 +
.../ethernet/mellanox/mlx5/core/lib/tout.h | 1 +
.../net/ethernet/mellanox/mlx5/core/main.c | 23 +++++++++++--------
.../ethernet/mellanox/mlx5/core/mlx5_core.h | 2 +-
6 files changed, 20 insertions(+), 13 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 057dde6f4417..9401127fb0ec 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -178,13 +178,13 @@ static int mlx5_devlink_reload_up(struct devlink *devlink, enum devlink_reload_a
*actions_performed = BIT(action);
switch (action) {
case DEVLINK_RELOAD_ACTION_DRIVER_REINIT:
- return mlx5_load_one(dev);
+ return mlx5_load_one(dev, false);
case DEVLINK_RELOAD_ACTION_FW_ACTIVATE:
if (limit == DEVLINK_RELOAD_LIMIT_NO_RESET)
break;
/* On fw_activate action, also driver is reloaded and reinit performed */
*actions_performed |= BIT(DEVLINK_RELOAD_ACTION_DRIVER_REINIT);
- return mlx5_load_one(dev);
+ return mlx5_load_one(dev, false);
default:
/* Unsupported action should not get to this function */
WARN_ON(1);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c b/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
index 81eb67fb95b0..052af4901c0b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c
@@ -149,7 +149,7 @@ static void mlx5_fw_reset_complete_reload(struct mlx5_core_dev *dev)
if (test_bit(MLX5_FW_RESET_FLAGS_PENDING_COMP, &fw_reset->reset_flags)) {
complete(&fw_reset->done);
} else {
- mlx5_load_one(dev);
+ mlx5_load_one(dev, false);
devlink_remote_reload_actions_performed(priv_to_devlink(dev), 0,
BIT(DEVLINK_RELOAD_ACTION_DRIVER_REINIT) |
BIT(DEVLINK_RELOAD_ACTION_FW_ACTIVATE));
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/tout.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/tout.c
index c1df0d3595d8..d758848d34d0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/tout.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/tout.c
@@ -10,6 +10,7 @@ struct mlx5_timeouts {
static const u32 tout_def_sw_val[MAX_TIMEOUT_TYPES] = {
[MLX5_TO_FW_PRE_INIT_TIMEOUT_MS] = 120000,
+ [MLX5_TO_FW_PRE_INIT_ON_RECOVERY_TIMEOUT_MS] = 7200000,
[MLX5_TO_FW_PRE_INIT_WARN_MESSAGE_INTERVAL_MS] = 20000,
[MLX5_TO_FW_PRE_INIT_WAIT_MS] = 2,
[MLX5_TO_FW_INIT_MS] = 2000,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/tout.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/tout.h
index 1c42ead782fa..257c03eeab36 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/tout.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/tout.h
@@ -7,6 +7,7 @@
enum mlx5_timeouts_types {
/* pre init timeouts (not read from FW) */
MLX5_TO_FW_PRE_INIT_TIMEOUT_MS,
+ MLX5_TO_FW_PRE_INIT_ON_RECOVERY_TIMEOUT_MS,
MLX5_TO_FW_PRE_INIT_WARN_MESSAGE_INTERVAL_MS,
MLX5_TO_FW_PRE_INIT_WAIT_MS,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index ef196cb764e2..8b5263699994 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1014,7 +1014,7 @@ static void mlx5_cleanup_once(struct mlx5_core_dev *dev)
mlx5_devcom_unregister_device(dev->priv.devcom);
}
-static int mlx5_function_setup(struct mlx5_core_dev *dev, bool boot)
+static int mlx5_function_setup(struct mlx5_core_dev *dev, u64 timeout)
{
int err;
@@ -1029,11 +1029,11 @@ static int mlx5_function_setup(struct mlx5_core_dev *dev, bool boot)
/* wait for firmware to accept initialization segments configurations
*/
- err = wait_fw_init(dev, mlx5_tout_ms(dev, FW_PRE_INIT_TIMEOUT),
+ err = wait_fw_init(dev, timeout,
mlx5_tout_ms(dev, FW_PRE_INIT_WARN_MESSAGE_INTERVAL));
if (err) {
mlx5_core_err(dev, "Firmware over %llu MS in pre-initializing state, aborting\n",
- mlx5_tout_ms(dev, FW_PRE_INIT_TIMEOUT));
+ timeout);
return err;
}
@@ -1296,7 +1296,7 @@ int mlx5_init_one(struct mlx5_core_dev *dev)
mutex_lock(&dev->intf_state_mutex);
dev->state = MLX5_DEVICE_STATE_UP;
- err = mlx5_function_setup(dev, true);
+ err = mlx5_function_setup(dev, mlx5_tout_ms(dev, FW_PRE_INIT_TIMEOUT));
if (err)
goto err_function;
@@ -1360,9 +1360,10 @@ void mlx5_uninit_one(struct mlx5_core_dev *dev)
mutex_unlock(&dev->intf_state_mutex);
}
-int mlx5_load_one(struct mlx5_core_dev *dev)
+int mlx5_load_one(struct mlx5_core_dev *dev, bool recovery)
{
int err = 0;
+ u64 timeout;
mutex_lock(&dev->intf_state_mutex);
if (test_bit(MLX5_INTERFACE_STATE_UP, &dev->intf_state)) {
@@ -1372,7 +1373,11 @@ int mlx5_load_one(struct mlx5_core_dev *dev)
/* remove any previous indication of internal error */
dev->state = MLX5_DEVICE_STATE_UP;
- err = mlx5_function_setup(dev, false);
+ if (recovery)
+ timeout = mlx5_tout_ms(dev, FW_PRE_INIT_ON_RECOVERY_TIMEOUT);
+ else
+ timeout = mlx5_tout_ms(dev, FW_PRE_INIT_TIMEOUT);
+ err = mlx5_function_setup(dev, timeout);
if (err)
goto err_function;
@@ -1746,7 +1751,7 @@ static void mlx5_pci_resume(struct pci_dev *pdev)
mlx5_pci_trace(dev, "Enter, loading driver..\n");
- err = mlx5_load_one(dev);
+ err = mlx5_load_one(dev, false);
mlx5_pci_trace(dev, "Done, err = %d, device %s\n", err,
!err ? "recovered" : "Failed");
@@ -1833,7 +1838,7 @@ static int mlx5_resume(struct pci_dev *pdev)
{
struct mlx5_core_dev *dev = pci_get_drvdata(pdev);
- return mlx5_load_one(dev);
+ return mlx5_load_one(dev, false);
}
static const struct pci_device_id mlx5_core_pci_table[] = {
@@ -1878,7 +1883,7 @@ int mlx5_recover_device(struct mlx5_core_dev *dev)
return -EIO;
}
- return mlx5_load_one(dev);
+ return mlx5_load_one(dev, true);
}
static struct pci_driver mlx5_core_driver = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index a9b2d6ead542..9026be1d6223 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -290,7 +290,7 @@ void mlx5_mdev_uninit(struct mlx5_core_dev *dev);
int mlx5_init_one(struct mlx5_core_dev *dev);
void mlx5_uninit_one(struct mlx5_core_dev *dev);
void mlx5_unload_one(struct mlx5_core_dev *dev);
-int mlx5_load_one(struct mlx5_core_dev *dev);
+int mlx5_load_one(struct mlx5_core_dev *dev, bool recovery);
int mlx5_vport_get_other_func_cap(struct mlx5_core_dev *dev, u16 function_id, void *out);
--
2.35.1
next prev parent reply other threads:[~2022-05-30 13:37 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20220530132425.1929512-1-sashal@kernel.org>
2022-05-30 13:21 ` [PATCH AUTOSEL 5.18 004/159] ath11k: fix the warning of dev_wake in mhi_pm_disable_transition() Sasha Levin
2022-05-30 13:21 ` [PATCH AUTOSEL 5.18 006/159] selftests/bpf: Fix vfs_link kprobe definition Sasha Levin
2022-05-30 13:21 ` [PATCH AUTOSEL 5.18 007/159] selftests/bpf: Fix parsing of prog types in UAPI hdr for bpftool sync Sasha Levin
2022-05-30 13:21 ` [PATCH AUTOSEL 5.18 008/159] ath11k: Change max no of active probe SSID and BSSID to fw capability Sasha Levin
2022-05-30 13:21 ` [PATCH AUTOSEL 5.18 009/159] selftests/bpf: Fix file descriptor leak in load_kallsyms() Sasha Levin
2022-05-30 13:21 ` [PATCH AUTOSEL 5.18 010/159] rtw89: ser: fix CAM leaks occurring in L2 reset Sasha Levin
2022-05-30 13:21 ` [PATCH AUTOSEL 5.18 011/159] rtw89: fix misconfiguration on hw_scan channel time Sasha Levin
2022-05-30 13:21 ` [PATCH AUTOSEL 5.18 012/159] mwifiex: add mutex lock for call in mwifiex_dfs_chan_sw_work_queue Sasha Levin
2022-05-30 13:21 ` [PATCH AUTOSEL 5.18 013/159] b43legacy: Fix assigning negative value to unsigned variable Sasha Levin
2022-05-30 13:21 ` [PATCH AUTOSEL 5.18 014/159] b43: " Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 015/159] ipw2x00: Fix potential NULL dereference in libipw_xmit() Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 016/159] ipv6: fix locking issues with loops over idev->addr_list Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 020/159] libbpf: Fix a bug with checking bpf_probe_read_kernel() support in old kernels Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 021/159] mac80211: minstrel_ht: fix where rate stats are stored (fixes debugfs output) Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 027/159] sfc: ef10: Fix assigning negative value to unsigned variable Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 029/159] rtw88: fix incorrect frequency reported Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 030/159] rtw88: 8821c: fix debugfs rssi value Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 033/159] tcp: consume incoming skb leading to a reset Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 040/159] net: sched: use queue_mapping to pick tx queue Sasha Levin
2022-05-30 18:10 ` Jakub Kicinski
2022-06-05 12:55 ` Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 046/159] net: macb: In ZynqMP initialization make SGMII phy configuration optional Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 047/159] ath9k: fix QCA9561 PA bias level Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 062/159] ath11k: disable spectral scan during spectral deinit Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 068/159] ath10k: skip ath10k_halt during suspend for driver state RESTARTING Sasha Levin
2022-05-30 13:22 ` [PATCH AUTOSEL 5.18 073/159] ath11k: fix warning of not found station for bssid in message Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 075/159] ipv6: Don't send rs packets to the interface of ARPHRD_TUNNEL Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 076/159] net/mlx5: use kvfree() for kvzalloc() in mlx5_ct_fs_smfs_matcher_create Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 077/159] net/mlx5: fs, delete the FTE when there are no rules attached to it Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 080/159] mlxsw: spectrum_dcb: Do not warn about priority changes Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 081/159] mlxsw: Treat LLDP packets as control Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 085/159] ice: always check VF VSI pointer values Sasha Levin
2022-05-30 13:23 ` Sasha Levin [this message]
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 095/159] net: remove two BUG() from skb_checksum_help() Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 108/159] rtlwifi: Use pr_warn instead of WARN_ONCE Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 109/159] mt76: mt7915: accept rx frames with non-standard VHT MCS10-11 Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 110/159] mt76: mt7921: " Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 111/159] mt76: fix encap offload ethernet type check Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 118/159] usbnet: Run unregister_netdev() before unbind() again Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 119/159] Bluetooth: HCI: Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 122/159] bnxt_en: Configure ptp filters during bnxt open Sasha Levin
2022-05-30 13:23 ` [PATCH AUTOSEL 5.18 134/159] net: phy: micrel: Allow probing without .driver_data Sasha Levin
2022-05-30 13:24 ` [PATCH AUTOSEL 5.18 137/159] rtw89: cfo: check mac_id to avoid out-of-bounds Sasha Levin
2022-05-30 13:24 ` [PATCH AUTOSEL 5.18 144/159] can: mcp251xfd: silence clang's -Wunaligned-access warning Sasha Levin
2022-05-30 13:24 ` [PATCH AUTOSEL 5.18 146/159] net: ipa: ignore endianness if there is no header Sasha Levin
2022-05-30 13:24 ` [PATCH AUTOSEL 5.18 148/159] selftests/bpf: Add missing trampoline program type to trampoline_count test Sasha Levin
2022-05-30 13:24 ` [PATCH AUTOSEL 5.18 152/159] rxrpc: Return an error to sendmsg if call failed Sasha Levin
2022-05-30 13:24 ` [PATCH AUTOSEL 5.18 153/159] rxrpc, afs: Fix selection of abort codes Sasha Levin
2022-05-30 13:24 ` [PATCH AUTOSEL 5.18 154/159] afs: Adjust ACK interpretation to try and cope with NAT Sasha Levin
2022-05-30 13:24 ` [PATCH AUTOSEL 5.18 155/159] eth: tg3: silence the GCC 12 array-bounds warning Sasha Levin
2022-05-30 13:24 ` [PATCH AUTOSEL 5.18 157/159] selftests/bpf: fix btf_dump/btf_dump due to recent clang change Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220530132425.1929512-90-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=amirtz@nvidia.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=gavinl@nvidia.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=moshe@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeedm@nvidia.com \
--cc=shayd@nvidia.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).