* [PATCH net-next 0/5] Expose grace period delay for devlink health reporter @ 2025-07-17 16:07 Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 1/5] devlink: Move graceful period parameter to reporter ops Tariq Toukan ` (5 more replies) 0 siblings, 6 replies; 15+ messages in thread From: Tariq Toukan @ 2025-07-17 16:07 UTC (permalink / raw) To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko Cc: Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Tariq Toukan, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma Hi, This series by Shahar implements graceful period delay in devlink health reporter, and use it in mlx5e driver. See detailed feature description by Shahar below [1]. Regards, Tariq [1] Currently, the devlink health reporter initiates the grace period immediately after recovering an error, which blocks further recovery attempts until the grace period concludes. Since additional errors are not generally expected during this short interval, any new error reported during the grace period is not only rejected but also causes the reporter to enter an error state that requires manual intervention. This approach poses a problem in scenarios where a single root cause triggers multiple related errors in quick succession - for example, a PCI issue affecting multiple hardware queues. Because these errors are closely related and occur rapidly, it is more effective to handle them together rather than handling only the first one reported and blocking any subsequent recovery attempts. Furthermore, setting the reporter to an error state in this context can be misleading, as these multiple errors are manifestations of a single underlying issue, making it unlike the general case where additional errors are not expected during the grace period. To resolve this, introduce a configurable grace period delay attribute to the devlink health reporter. This delay starts when the first error is recovered and lasts for a user-defined duration. Once this grace period delay expires, the actual grace period begins. After the grace period ends, a new reported error will start the same flow again. Timeline summary: ----|--------|------------------------------/----------------------/-- error is error is grace period delay grace period reported recovered (recoveries allowed) (recoveries blocked) With grace period delay, create a time window during which recovery attempts are permitted, allowing all reported errors to be handled sequentially before the grace period starts. Once the grace period begins, it prevents any further error recoveries until it ends. When grace period delay is set to 0, current behavior is preserved. Shahar Shitrit (5): devlink: Move graceful period parameter to reporter ops devlink: Move health reporter recovery abort logic to a separate function devlink: Introduce grace period delay for health reporter devlink: Make health reporter grace period delay configurable net/mlx5e: Set default grace period delay for TX and RX reporters Documentation/netlink/specs/devlink.yaml | 7 ++ .../networking/devlink/devlink-health.rst | 2 +- drivers/net/ethernet/amd/pds_core/main.c | 2 +- .../net/ethernet/broadcom/bnxt/bnxt_devlink.c | 2 +- .../net/ethernet/huawei/hinic/hinic_devlink.c | 10 +- .../net/ethernet/intel/ice/devlink/health.c | 3 +- .../marvell/octeontx2/af/rvu_devlink.c | 32 +++-- .../mellanox/mlx5/core/diag/reporter_vnic.c | 2 +- .../mellanox/mlx5/core/en/reporter_rx.c | 13 ++- .../mellanox/mlx5/core/en/reporter_tx.c | 13 ++- .../net/ethernet/mellanox/mlx5/core/en_rep.c | 2 +- .../net/ethernet/mellanox/mlx5/core/health.c | 41 ++++--- drivers/net/ethernet/mellanox/mlxsw/core.c | 2 +- drivers/net/ethernet/qlogic/qed/qed_devlink.c | 10 +- drivers/net/netdevsim/health.c | 4 +- include/net/devlink.h | 15 ++- include/uapi/linux/devlink.h | 2 + net/devlink/health.c | 109 +++++++++++++----- net/devlink/netlink_gen.c | 5 +- 19 files changed, 191 insertions(+), 85 deletions(-) base-commit: a96cee9b369ee47b5309311d0d71cb6663b123fc -- 2.31.1 ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net-next 1/5] devlink: Move graceful period parameter to reporter ops 2025-07-17 16:07 [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Tariq Toukan @ 2025-07-17 16:07 ` Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 2/5] devlink: Move health reporter recovery abort logic to a separate function Tariq Toukan ` (4 subsequent siblings) 5 siblings, 0 replies; 15+ messages in thread From: Tariq Toukan @ 2025-07-17 16:07 UTC (permalink / raw) To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko Cc: Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Tariq Toukan, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma From: Shahar Shitrit <shshitrit@nvidia.com> Move the default graceful period from a parameter to devlink_health_reporter_create() to a field in the devlink_health_reporter_ops structure. This change improves consistency, as the graceful period is inherently tied to the reporter's behavior and recovery policy. It simplifies the signature of devlink_health_reporter_create() and its internal helper functions. It also centralizes the reporter configuration at the ops structure, preparing the groundwork for a downstream patch that will introduce a devlink health reporter grace period delay attribute whose default value will similarly be provided by the driver via the ops structure. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> --- drivers/net/ethernet/amd/pds_core/main.c | 2 +- .../net/ethernet/broadcom/bnxt/bnxt_devlink.c | 2 +- .../net/ethernet/huawei/hinic/hinic_devlink.c | 10 +++-- .../net/ethernet/intel/ice/devlink/health.c | 3 +- .../marvell/octeontx2/af/rvu_devlink.c | 32 +++++++++++---- .../mellanox/mlx5/core/diag/reporter_vnic.c | 2 +- .../mellanox/mlx5/core/en/reporter_rx.c | 10 +++-- .../mellanox/mlx5/core/en/reporter_tx.c | 10 +++-- .../net/ethernet/mellanox/mlx5/core/en_rep.c | 2 +- .../net/ethernet/mellanox/mlx5/core/health.c | 41 +++++++++++-------- drivers/net/ethernet/mellanox/mlxsw/core.c | 2 +- drivers/net/ethernet/qlogic/qed/qed_devlink.c | 10 +++-- drivers/net/netdevsim/health.c | 4 +- include/net/devlink.h | 11 +++-- net/devlink/health.c | 28 +++++-------- 15 files changed, 98 insertions(+), 71 deletions(-) diff --git a/drivers/net/ethernet/amd/pds_core/main.c b/drivers/net/ethernet/amd/pds_core/main.c index 9b81e1c260c2..c7a2eff57632 100644 --- a/drivers/net/ethernet/amd/pds_core/main.c +++ b/drivers/net/ethernet/amd/pds_core/main.c @@ -280,7 +280,7 @@ static int pdsc_init_pf(struct pdsc *pdsc) goto err_out_del_dev; } - hr = devl_health_reporter_create(dl, &pdsc_fw_reporter_ops, 0, pdsc); + hr = devl_health_reporter_create(dl, &pdsc_fw_reporter_ops, pdsc); if (IS_ERR(hr)) { devl_unlock(dl); dev_warn(pdsc->dev, "Failed to create fw reporter: %pe\n", hr); diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c index 4c4581b0342e..43fb75806cd6 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c @@ -220,7 +220,7 @@ __bnxt_dl_reporter_create(struct bnxt *bp, { struct devlink_health_reporter *reporter; - reporter = devlink_health_reporter_create(bp->dl, ops, 0, bp); + reporter = devlink_health_reporter_create(bp->dl, ops, bp); if (IS_ERR(reporter)) { netdev_warn(bp->dev, "Failed to create %s health reporter, rc = %ld\n", ops->name, PTR_ERR(reporter)); diff --git a/drivers/net/ethernet/huawei/hinic/hinic_devlink.c b/drivers/net/ethernet/huawei/hinic/hinic_devlink.c index 03e42512a2d5..300bc267a259 100644 --- a/drivers/net/ethernet/huawei/hinic/hinic_devlink.c +++ b/drivers/net/ethernet/huawei/hinic/hinic_devlink.c @@ -443,8 +443,9 @@ int hinic_health_reporters_create(struct hinic_devlink_priv *priv) struct devlink *devlink = priv_to_devlink(priv); priv->hw_fault_reporter = - devlink_health_reporter_create(devlink, &hinic_hw_fault_reporter_ops, - 0, priv); + devlink_health_reporter_create(devlink, + &hinic_hw_fault_reporter_ops, + priv); if (IS_ERR(priv->hw_fault_reporter)) { dev_warn(&priv->hwdev->hwif->pdev->dev, "Failed to create hw fault reporter, err: %ld\n", PTR_ERR(priv->hw_fault_reporter)); @@ -452,8 +453,9 @@ int hinic_health_reporters_create(struct hinic_devlink_priv *priv) } priv->fw_fault_reporter = - devlink_health_reporter_create(devlink, &hinic_fw_fault_reporter_ops, - 0, priv); + devlink_health_reporter_create(devlink, + &hinic_fw_fault_reporter_ops, + priv); if (IS_ERR(priv->fw_fault_reporter)) { dev_warn(&priv->hwdev->hwif->pdev->dev, "Failed to create fw fault reporter, err: %ld\n", PTR_ERR(priv->fw_fault_reporter)); diff --git a/drivers/net/ethernet/intel/ice/devlink/health.c b/drivers/net/ethernet/intel/ice/devlink/health.c index 19c3d37aa768..3177496e7828 100644 --- a/drivers/net/ethernet/intel/ice/devlink/health.c +++ b/drivers/net/ethernet/intel/ice/devlink/health.c @@ -448,9 +448,8 @@ ice_init_devlink_rep(struct ice_pf *pf, { struct devlink *devlink = priv_to_devlink(pf); struct devlink_health_reporter *rep; - const u64 graceful_period = 0; - rep = devl_health_reporter_create(devlink, ops, graceful_period, pf); + rep = devl_health_reporter_create(devlink, ops, pf); if (IS_ERR(rep)) { struct device *dev = ice_pf_to_dev(pf); diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c index 27c3a2daaaa9..3735372539bd 100644 --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c @@ -505,7 +505,9 @@ static int rvu_nix_register_reporters(struct rvu_devlink *rvu_dl) rvu_reporters->nix_event_ctx = nix_event_context; rvu_reporters->rvu_hw_nix_intr_reporter = - devlink_health_reporter_create(rvu_dl->dl, &rvu_hw_nix_intr_reporter_ops, 0, rvu); + devlink_health_reporter_create(rvu_dl->dl, + &rvu_hw_nix_intr_reporter_ops, + rvu); if (IS_ERR(rvu_reporters->rvu_hw_nix_intr_reporter)) { dev_warn(rvu->dev, "Failed to create hw_nix_intr reporter, err=%ld\n", PTR_ERR(rvu_reporters->rvu_hw_nix_intr_reporter)); @@ -513,7 +515,9 @@ static int rvu_nix_register_reporters(struct rvu_devlink *rvu_dl) } rvu_reporters->rvu_hw_nix_gen_reporter = - devlink_health_reporter_create(rvu_dl->dl, &rvu_hw_nix_gen_reporter_ops, 0, rvu); + devlink_health_reporter_create(rvu_dl->dl, + &rvu_hw_nix_gen_reporter_ops, + rvu); if (IS_ERR(rvu_reporters->rvu_hw_nix_gen_reporter)) { dev_warn(rvu->dev, "Failed to create hw_nix_gen reporter, err=%ld\n", PTR_ERR(rvu_reporters->rvu_hw_nix_gen_reporter)); @@ -521,7 +525,9 @@ static int rvu_nix_register_reporters(struct rvu_devlink *rvu_dl) } rvu_reporters->rvu_hw_nix_err_reporter = - devlink_health_reporter_create(rvu_dl->dl, &rvu_hw_nix_err_reporter_ops, 0, rvu); + devlink_health_reporter_create(rvu_dl->dl, + &rvu_hw_nix_err_reporter_ops, + rvu); if (IS_ERR(rvu_reporters->rvu_hw_nix_err_reporter)) { dev_warn(rvu->dev, "Failed to create hw_nix_err reporter, err=%ld\n", PTR_ERR(rvu_reporters->rvu_hw_nix_err_reporter)); @@ -529,7 +535,9 @@ static int rvu_nix_register_reporters(struct rvu_devlink *rvu_dl) } rvu_reporters->rvu_hw_nix_ras_reporter = - devlink_health_reporter_create(rvu_dl->dl, &rvu_hw_nix_ras_reporter_ops, 0, rvu); + devlink_health_reporter_create(rvu_dl->dl, + &rvu_hw_nix_ras_reporter_ops, + rvu); if (IS_ERR(rvu_reporters->rvu_hw_nix_ras_reporter)) { dev_warn(rvu->dev, "Failed to create hw_nix_ras reporter, err=%ld\n", PTR_ERR(rvu_reporters->rvu_hw_nix_ras_reporter)); @@ -1051,7 +1059,9 @@ static int rvu_npa_register_reporters(struct rvu_devlink *rvu_dl) rvu_reporters->npa_event_ctx = npa_event_context; rvu_reporters->rvu_hw_npa_intr_reporter = - devlink_health_reporter_create(rvu_dl->dl, &rvu_hw_npa_intr_reporter_ops, 0, rvu); + devlink_health_reporter_create(rvu_dl->dl, + &rvu_hw_npa_intr_reporter_ops, + rvu); if (IS_ERR(rvu_reporters->rvu_hw_npa_intr_reporter)) { dev_warn(rvu->dev, "Failed to create hw_npa_intr reporter, err=%ld\n", PTR_ERR(rvu_reporters->rvu_hw_npa_intr_reporter)); @@ -1059,7 +1069,9 @@ static int rvu_npa_register_reporters(struct rvu_devlink *rvu_dl) } rvu_reporters->rvu_hw_npa_gen_reporter = - devlink_health_reporter_create(rvu_dl->dl, &rvu_hw_npa_gen_reporter_ops, 0, rvu); + devlink_health_reporter_create(rvu_dl->dl, + &rvu_hw_npa_gen_reporter_ops, + rvu); if (IS_ERR(rvu_reporters->rvu_hw_npa_gen_reporter)) { dev_warn(rvu->dev, "Failed to create hw_npa_gen reporter, err=%ld\n", PTR_ERR(rvu_reporters->rvu_hw_npa_gen_reporter)); @@ -1067,7 +1079,9 @@ static int rvu_npa_register_reporters(struct rvu_devlink *rvu_dl) } rvu_reporters->rvu_hw_npa_err_reporter = - devlink_health_reporter_create(rvu_dl->dl, &rvu_hw_npa_err_reporter_ops, 0, rvu); + devlink_health_reporter_create(rvu_dl->dl, + &rvu_hw_npa_err_reporter_ops, + rvu); if (IS_ERR(rvu_reporters->rvu_hw_npa_err_reporter)) { dev_warn(rvu->dev, "Failed to create hw_npa_err reporter, err=%ld\n", PTR_ERR(rvu_reporters->rvu_hw_npa_err_reporter)); @@ -1075,7 +1089,9 @@ static int rvu_npa_register_reporters(struct rvu_devlink *rvu_dl) } rvu_reporters->rvu_hw_npa_ras_reporter = - devlink_health_reporter_create(rvu_dl->dl, &rvu_hw_npa_ras_reporter_ops, 0, rvu); + devlink_health_reporter_create(rvu_dl->dl, + &rvu_hw_npa_ras_reporter_ops, + rvu); if (IS_ERR(rvu_reporters->rvu_hw_npa_ras_reporter)) { dev_warn(rvu->dev, "Failed to create hw_npa_ras reporter, err=%ld\n", PTR_ERR(rvu_reporters->rvu_hw_npa_ras_reporter)); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/reporter_vnic.c b/drivers/net/ethernet/mellanox/mlx5/core/diag/reporter_vnic.c index 86253a89c24c..878f9b46bf18 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/diag/reporter_vnic.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/reporter_vnic.c @@ -133,7 +133,7 @@ void mlx5_reporter_vnic_create(struct mlx5_core_dev *dev) health->vnic_reporter = devlink_health_reporter_create(devlink, &mlx5_reporter_vnic_ops, - 0, dev); + dev); if (IS_ERR(health->vnic_reporter)) mlx5_core_warn(dev, "Failed to create vnic reporter, err = %ld\n", diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c index e75759533ae0..e106f0696486 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c @@ -644,22 +644,24 @@ void mlx5e_reporter_icosq_resume_recovery(struct mlx5e_channel *c) mutex_unlock(&c->icosq_recovery_lock); } +#define MLX5E_REPORTER_RX_GRACEFUL_PERIOD 500 + static const struct devlink_health_reporter_ops mlx5_rx_reporter_ops = { .name = "rx", .recover = mlx5e_rx_reporter_recover, .diagnose = mlx5e_rx_reporter_diagnose, .dump = mlx5e_rx_reporter_dump, + .default_graceful_period = MLX5E_REPORTER_RX_GRACEFUL_PERIOD, }; -#define MLX5E_REPORTER_RX_GRACEFUL_PERIOD 500 - void mlx5e_reporter_rx_create(struct mlx5e_priv *priv) { + struct devlink_port *port = priv->netdev->devlink_port; struct devlink_health_reporter *reporter; - reporter = devlink_port_health_reporter_create(priv->netdev->devlink_port, + reporter = devlink_port_health_reporter_create(port, &mlx5_rx_reporter_ops, - MLX5E_REPORTER_RX_GRACEFUL_PERIOD, priv); + priv); if (IS_ERR(reporter)) { netdev_warn(priv->netdev, "Failed to create rx reporter, err = %ld\n", PTR_ERR(reporter)); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c index bd96988e102c..6fb0d143ad1b 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c @@ -514,22 +514,24 @@ void mlx5e_reporter_tx_ptpsq_unhealthy(struct mlx5e_ptpsq *ptpsq) mlx5e_health_report(priv, priv->tx_reporter, err_str, &err_ctx); } +#define MLX5_REPORTER_TX_GRACEFUL_PERIOD 500 + static const struct devlink_health_reporter_ops mlx5_tx_reporter_ops = { .name = "tx", .recover = mlx5e_tx_reporter_recover, .diagnose = mlx5e_tx_reporter_diagnose, .dump = mlx5e_tx_reporter_dump, + .default_graceful_period = MLX5_REPORTER_TX_GRACEFUL_PERIOD, }; -#define MLX5_REPORTER_TX_GRACEFUL_PERIOD 500 - void mlx5e_reporter_tx_create(struct mlx5e_priv *priv) { + struct devlink_port *port = priv->netdev->devlink_port; struct devlink_health_reporter *reporter; - reporter = devlink_port_health_reporter_create(priv->netdev->devlink_port, + reporter = devlink_port_health_reporter_create(port, &mlx5_tx_reporter_ops, - MLX5_REPORTER_TX_GRACEFUL_PERIOD, priv); + priv); if (IS_ERR(reporter)) { netdev_warn(priv->netdev, "Failed to create tx reporter, err = %ld\n", diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c index 63a7a788fb0d..b231e7855bca 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c @@ -1447,7 +1447,7 @@ static void mlx5e_rep_vnic_reporter_create(struct mlx5e_priv *priv, reporter = devl_port_health_reporter_create(dl_port, &mlx5_rep_vnic_reporter_ops, - 0, rpriv); + rpriv); if (IS_ERR(reporter)) { mlx5_core_err(priv->mdev, "Failed to create representor vnic reporter, err = %ld\n", diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c index cf7a1edd0530..6959fea03443 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/health.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c @@ -669,54 +669,61 @@ static void mlx5_fw_fatal_reporter_err_work(struct work_struct *work) } } +#define MLX5_FW_REPORTER_ECPF_GRACEFUL_PERIOD 180000 +#define MLX5_FW_REPORTER_PF_GRACEFUL_PERIOD 60000 +#define MLX5_FW_REPORTER_VF_GRACEFUL_PERIOD 30000 +#define MLX5_FW_REPORTER_DEFAULT_GRACEFUL_PERIOD \ + MLX5_FW_REPORTER_VF_GRACEFUL_PERIOD + +static const +struct devlink_health_reporter_ops mlx5_fw_fatal_reporter_ecpf_ops = { + .name = "fw_fatal", + .recover = mlx5_fw_fatal_reporter_recover, + .dump = mlx5_fw_fatal_reporter_dump, + .default_graceful_period = + MLX5_FW_REPORTER_ECPF_GRACEFUL_PERIOD, +}; + static const struct devlink_health_reporter_ops mlx5_fw_fatal_reporter_pf_ops = { .name = "fw_fatal", .recover = mlx5_fw_fatal_reporter_recover, .dump = mlx5_fw_fatal_reporter_dump, + .default_graceful_period = MLX5_FW_REPORTER_PF_GRACEFUL_PERIOD, }; static const struct devlink_health_reporter_ops mlx5_fw_fatal_reporter_ops = { .name = "fw_fatal", .recover = mlx5_fw_fatal_reporter_recover, + .default_graceful_period = + MLX5_FW_REPORTER_DEFAULT_GRACEFUL_PERIOD, }; -#define MLX5_FW_REPORTER_ECPF_GRACEFUL_PERIOD 180000 -#define MLX5_FW_REPORTER_PF_GRACEFUL_PERIOD 60000 -#define MLX5_FW_REPORTER_VF_GRACEFUL_PERIOD 30000 -#define MLX5_FW_REPORTER_DEFAULT_GRACEFUL_PERIOD MLX5_FW_REPORTER_VF_GRACEFUL_PERIOD - void mlx5_fw_reporters_create(struct mlx5_core_dev *dev) { const struct devlink_health_reporter_ops *fw_fatal_ops; struct mlx5_core_health *health = &dev->priv.health; const struct devlink_health_reporter_ops *fw_ops; struct devlink *devlink = priv_to_devlink(dev); - u64 grace_period; - fw_fatal_ops = &mlx5_fw_fatal_reporter_pf_ops; fw_ops = &mlx5_fw_reporter_pf_ops; if (mlx5_core_is_ecpf(dev)) { - grace_period = MLX5_FW_REPORTER_ECPF_GRACEFUL_PERIOD; + fw_fatal_ops = &mlx5_fw_fatal_reporter_ecpf_ops; } else if (mlx5_core_is_pf(dev)) { - grace_period = MLX5_FW_REPORTER_PF_GRACEFUL_PERIOD; + fw_fatal_ops = &mlx5_fw_fatal_reporter_pf_ops; } else { /* VF or SF */ - grace_period = MLX5_FW_REPORTER_DEFAULT_GRACEFUL_PERIOD; fw_fatal_ops = &mlx5_fw_fatal_reporter_ops; fw_ops = &mlx5_fw_reporter_ops; } - health->fw_reporter = - devl_health_reporter_create(devlink, fw_ops, 0, dev); + health->fw_reporter = devl_health_reporter_create(devlink, fw_ops, dev); if (IS_ERR(health->fw_reporter)) mlx5_core_warn(dev, "Failed to create fw reporter, err = %ld\n", PTR_ERR(health->fw_reporter)); - health->fw_fatal_reporter = - devl_health_reporter_create(devlink, - fw_fatal_ops, - grace_period, - dev); + health->fw_fatal_reporter = devl_health_reporter_create(devlink, + fw_fatal_ops, + dev); if (IS_ERR(health->fw_fatal_reporter)) mlx5_core_warn(dev, "Failed to create fw fatal reporter, err = %ld\n", PTR_ERR(health->fw_fatal_reporter)); diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c index 2bb2b77351bd..980f3223f124 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core.c @@ -2043,7 +2043,7 @@ static int mlxsw_core_health_init(struct mlxsw_core *mlxsw_core) return 0; fw_fatal = devl_health_reporter_create(devlink, &mlxsw_core_health_fw_fatal_ops, - 0, mlxsw_core); + mlxsw_core); if (IS_ERR(fw_fatal)) { dev_err(mlxsw_core->bus_info->dev, "Failed to create fw fatal reporter"); return PTR_ERR(fw_fatal); diff --git a/drivers/net/ethernet/qlogic/qed/qed_devlink.c b/drivers/net/ethernet/qlogic/qed/qed_devlink.c index 1adc7fbb3f2f..d000ed734c7c 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_devlink.c +++ b/drivers/net/ethernet/qlogic/qed/qed_devlink.c @@ -87,20 +87,22 @@ qed_fw_fatal_reporter_recover(struct devlink_health_reporter *reporter, return 0; } +#define QED_REPORTER_FW_GRACEFUL_PERIOD 0 + static const struct devlink_health_reporter_ops qed_fw_fatal_reporter_ops = { .name = "fw_fatal", .recover = qed_fw_fatal_reporter_recover, .dump = qed_fw_fatal_reporter_dump, + .default_graceful_period = QED_REPORTER_FW_GRACEFUL_PERIOD, }; -#define QED_REPORTER_FW_GRACEFUL_PERIOD 0 - void qed_fw_reporters_create(struct devlink *devlink) { struct qed_devlink *dl = devlink_priv(devlink); - dl->fw_reporter = devlink_health_reporter_create(devlink, &qed_fw_fatal_reporter_ops, - QED_REPORTER_FW_GRACEFUL_PERIOD, dl); + dl->fw_reporter = + devlink_health_reporter_create(devlink, + &qed_fw_fatal_reporter_ops, dl); if (IS_ERR(dl->fw_reporter)) { DP_NOTICE(dl->cdev, "Failed to create fw reporter, err = %ld\n", PTR_ERR(dl->fw_reporter)); diff --git a/drivers/net/netdevsim/health.c b/drivers/net/netdevsim/health.c index 688f05316b5e..3bd0e7a489c3 100644 --- a/drivers/net/netdevsim/health.c +++ b/drivers/net/netdevsim/health.c @@ -183,14 +183,14 @@ int nsim_dev_health_init(struct nsim_dev *nsim_dev, struct devlink *devlink) health->empty_reporter = devl_health_reporter_create(devlink, &nsim_dev_empty_reporter_ops, - 0, health); + health); if (IS_ERR(health->empty_reporter)) return PTR_ERR(health->empty_reporter); health->dummy_reporter = devl_health_reporter_create(devlink, &nsim_dev_dummy_reporter_ops, - 0, health); + health); if (IS_ERR(health->dummy_reporter)) { err = PTR_ERR(health->dummy_reporter); goto err_empty_reporter_destroy; diff --git a/include/net/devlink.h b/include/net/devlink.h index 93640a29427c..a65aa24e8df4 100644 --- a/include/net/devlink.h +++ b/include/net/devlink.h @@ -742,6 +742,8 @@ enum devlink_health_reporter_state { * if priv_ctx is NULL, run a full dump * @diagnose: callback to diagnose the current status * @test: callback to trigger a test event + * @default_graceful_period: default min time (in msec) + between recovery attempts */ struct devlink_health_reporter_ops { @@ -756,6 +758,7 @@ struct devlink_health_reporter_ops { struct netlink_ext_ack *extack); int (*test)(struct devlink_health_reporter *reporter, struct netlink_ext_ack *extack); + u64 default_graceful_period; }; /** @@ -1924,22 +1927,22 @@ void devlink_fmsg_binary_pair_put(struct devlink_fmsg *fmsg, const char *name, struct devlink_health_reporter * devl_port_health_reporter_create(struct devlink_port *port, const struct devlink_health_reporter_ops *ops, - u64 graceful_period, void *priv); + void *priv); struct devlink_health_reporter * devlink_port_health_reporter_create(struct devlink_port *port, const struct devlink_health_reporter_ops *ops, - u64 graceful_period, void *priv); + void *priv); struct devlink_health_reporter * devl_health_reporter_create(struct devlink *devlink, const struct devlink_health_reporter_ops *ops, - u64 graceful_period, void *priv); + void *priv); struct devlink_health_reporter * devlink_health_reporter_create(struct devlink *devlink, const struct devlink_health_reporter_ops *ops, - u64 graceful_period, void *priv); + void *priv); void devl_health_reporter_destroy(struct devlink_health_reporter *reporter); diff --git a/net/devlink/health.c b/net/devlink/health.c index b3ce8ecbb7fb..ba144b7426fa 100644 --- a/net/devlink/health.c +++ b/net/devlink/health.c @@ -108,11 +108,11 @@ devlink_port_health_reporter_find_by_name(struct devlink_port *devlink_port, static struct devlink_health_reporter * __devlink_health_reporter_create(struct devlink *devlink, const struct devlink_health_reporter_ops *ops, - u64 graceful_period, void *priv) + void *priv) { struct devlink_health_reporter *reporter; - if (WARN_ON(graceful_period && !ops->recover)) + if (WARN_ON(ops->default_graceful_period && !ops->recover)) return ERR_PTR(-EINVAL); reporter = kzalloc(sizeof(*reporter), GFP_KERNEL); @@ -122,7 +122,7 @@ __devlink_health_reporter_create(struct devlink *devlink, reporter->priv = priv; reporter->ops = ops; reporter->devlink = devlink; - reporter->graceful_period = graceful_period; + reporter->graceful_period = ops->default_graceful_period; reporter->auto_recover = !!ops->recover; reporter->auto_dump = !!ops->dump; return reporter; @@ -134,13 +134,12 @@ __devlink_health_reporter_create(struct devlink *devlink, * * @port: devlink_port to which health reports will relate * @ops: devlink health reporter ops - * @graceful_period: min time (in msec) between recovery attempts * @priv: driver priv pointer */ struct devlink_health_reporter * devl_port_health_reporter_create(struct devlink_port *port, const struct devlink_health_reporter_ops *ops, - u64 graceful_period, void *priv) + void *priv) { struct devlink_health_reporter *reporter; @@ -150,8 +149,7 @@ devl_port_health_reporter_create(struct devlink_port *port, ops->name)) return ERR_PTR(-EEXIST); - reporter = __devlink_health_reporter_create(port->devlink, ops, - graceful_period, priv); + reporter = __devlink_health_reporter_create(port->devlink, ops, priv); if (IS_ERR(reporter)) return reporter; @@ -164,14 +162,13 @@ EXPORT_SYMBOL_GPL(devl_port_health_reporter_create); struct devlink_health_reporter * devlink_port_health_reporter_create(struct devlink_port *port, const struct devlink_health_reporter_ops *ops, - u64 graceful_period, void *priv) + void *priv) { struct devlink_health_reporter *reporter; struct devlink *devlink = port->devlink; devl_lock(devlink); - reporter = devl_port_health_reporter_create(port, ops, - graceful_period, priv); + reporter = devl_port_health_reporter_create(port, ops, priv); devl_unlock(devlink); return reporter; } @@ -182,13 +179,12 @@ EXPORT_SYMBOL_GPL(devlink_port_health_reporter_create); * * @devlink: devlink instance which the health reports will relate * @ops: devlink health reporter ops - * @graceful_period: min time (in msec) between recovery attempts * @priv: driver priv pointer */ struct devlink_health_reporter * devl_health_reporter_create(struct devlink *devlink, const struct devlink_health_reporter_ops *ops, - u64 graceful_period, void *priv) + void *priv) { struct devlink_health_reporter *reporter; @@ -197,8 +193,7 @@ devl_health_reporter_create(struct devlink *devlink, if (devlink_health_reporter_find_by_name(devlink, ops->name)) return ERR_PTR(-EEXIST); - reporter = __devlink_health_reporter_create(devlink, ops, - graceful_period, priv); + reporter = __devlink_health_reporter_create(devlink, ops, priv); if (IS_ERR(reporter)) return reporter; @@ -210,13 +205,12 @@ EXPORT_SYMBOL_GPL(devl_health_reporter_create); struct devlink_health_reporter * devlink_health_reporter_create(struct devlink *devlink, const struct devlink_health_reporter_ops *ops, - u64 graceful_period, void *priv) + void *priv) { struct devlink_health_reporter *reporter; devl_lock(devlink); - reporter = devl_health_reporter_create(devlink, ops, - graceful_period, priv); + reporter = devl_health_reporter_create(devlink, ops, priv); devl_unlock(devlink); return reporter; } -- 2.31.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next 2/5] devlink: Move health reporter recovery abort logic to a separate function 2025-07-17 16:07 [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 1/5] devlink: Move graceful period parameter to reporter ops Tariq Toukan @ 2025-07-17 16:07 ` Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 3/5] devlink: Introduce grace period delay for health reporter Tariq Toukan ` (3 subsequent siblings) 5 siblings, 0 replies; 15+ messages in thread From: Tariq Toukan @ 2025-07-17 16:07 UTC (permalink / raw) To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko Cc: Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Tariq Toukan, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma From: Shahar Shitrit <shshitrit@nvidia.com> Extract the health reporter recovery abort logic into a separate function devlink_health_recover_abort(). The function encapsulates the conditions for aborting recovery: - When auto-recovery is disabled - When previous error wasn't recovered - When within the grace period after last recovery Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> --- net/devlink/health.c | 31 +++++++++++++++++++++++-------- 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/net/devlink/health.c b/net/devlink/health.c index ba144b7426fa..9d0d4a9face7 100644 --- a/net/devlink/health.c +++ b/net/devlink/health.c @@ -586,12 +586,33 @@ static int devlink_health_do_dump(struct devlink_health_reporter *reporter, return err; } +static bool +devlink_health_recover_abort(struct devlink_health_reporter *reporter, + enum devlink_health_reporter_state prev_state) +{ + unsigned long recover_ts_threshold; + + if (!reporter->auto_recover) + return false; + + /* abort if the previous error wasn't recovered */ + if (prev_state != DEVLINK_HEALTH_REPORTER_STATE_HEALTHY) + return true; + + recover_ts_threshold = reporter->last_recovery_ts + + msecs_to_jiffies(reporter->graceful_period); + if (reporter->last_recovery_ts && reporter->recovery_count && + time_is_after_jiffies(recover_ts_threshold)) + return true; + + return false; +} + int devlink_health_report(struct devlink_health_reporter *reporter, const char *msg, void *priv_ctx) { enum devlink_health_reporter_state prev_health_state; struct devlink *devlink = reporter->devlink; - unsigned long recover_ts_threshold; int ret; /* write a log message of the current error */ @@ -602,13 +623,7 @@ int devlink_health_report(struct devlink_health_reporter *reporter, reporter->health_state = DEVLINK_HEALTH_REPORTER_STATE_ERROR; devlink_recover_notify(reporter, DEVLINK_CMD_HEALTH_REPORTER_RECOVER); - /* abort if the previous error wasn't recovered */ - recover_ts_threshold = reporter->last_recovery_ts + - msecs_to_jiffies(reporter->graceful_period); - if (reporter->auto_recover && - (prev_health_state != DEVLINK_HEALTH_REPORTER_STATE_HEALTHY || - (reporter->last_recovery_ts && reporter->recovery_count && - time_is_after_jiffies(recover_ts_threshold)))) { + if (devlink_health_recover_abort(reporter, prev_health_state)) { trace_devlink_health_recover_aborted(devlink, reporter->ops->name, reporter->health_state, -- 2.31.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next 3/5] devlink: Introduce grace period delay for health reporter 2025-07-17 16:07 [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 1/5] devlink: Move graceful period parameter to reporter ops Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 2/5] devlink: Move health reporter recovery abort logic to a separate function Tariq Toukan @ 2025-07-17 16:07 ` Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable Tariq Toukan ` (2 subsequent siblings) 5 siblings, 0 replies; 15+ messages in thread From: Tariq Toukan @ 2025-07-17 16:07 UTC (permalink / raw) To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko Cc: Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Tariq Toukan, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma From: Shahar Shitrit <shshitrit@nvidia.com> Currently, the devlink health reporter starts the grace period immediately after handling an error, blocking any further recoveries until it finished. However, when a single root cause triggers multiple errors in a short time frame, it is desirable to treat them as a bulk of errors and to allow their recoveries, avoiding premature blocking of subsequent related errors, and reducing the risk of inconsistent or incomplete error handling. To address this, introduce a configurable grace period delay for devlink health reporter. Start this delay when the first error is handled, and allow recovery attempts for reported errors during this window. Once the delay expires, begin the grace period to block further recoveries until it concludes. Timeline summary: ----|--------|------------------------------/----------------------/-- error is error is grace period delay grace period reported recovered (recoveries allowed) (recoveries blocked) For calculating the grace period delay duration, use the same last_recovery_ts as the grace period. Update it on recovery only when the delay is inactive (either disabled or at the first error). This patch implements the framework for the grace period delay and effectively sets its value to 0 at reporter creation, so the current behavior remains unchanged, which ensures backward compatibility. A downstream patch will make the grace period delay configurable. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> --- include/net/devlink.h | 4 ++++ net/devlink/health.c | 22 +++++++++++++++++++++- 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/include/net/devlink.h b/include/net/devlink.h index a65aa24e8df4..3ab85de9c862 100644 --- a/include/net/devlink.h +++ b/include/net/devlink.h @@ -744,6 +744,9 @@ enum devlink_health_reporter_state { * @test: callback to trigger a test event * @default_graceful_period: default min time (in msec) between recovery attempts + * @default_graceful_period_delay: default time (in msec) for + * error recoveries before + * starting the grace period */ struct devlink_health_reporter_ops { @@ -759,6 +762,7 @@ struct devlink_health_reporter_ops { int (*test)(struct devlink_health_reporter *reporter, struct netlink_ext_ack *extack); u64 default_graceful_period; + u64 default_graceful_period_delay; }; /** diff --git a/net/devlink/health.c b/net/devlink/health.c index 9d0d4a9face7..a0269975f592 100644 --- a/net/devlink/health.c +++ b/net/devlink/health.c @@ -60,6 +60,7 @@ struct devlink_health_reporter { struct devlink_port *devlink_port; struct devlink_fmsg *dump_fmsg; u64 graceful_period; + u64 graceful_period_delay; bool auto_recover; bool auto_dump; u8 health_state; @@ -123,6 +124,7 @@ __devlink_health_reporter_create(struct devlink *devlink, reporter->ops = ops; reporter->devlink = devlink; reporter->graceful_period = ops->default_graceful_period; + reporter->graceful_period_delay = ops->default_graceful_period_delay; reporter->auto_recover = !!ops->recover; reporter->auto_dump = !!ops->dump; return reporter; @@ -508,11 +510,25 @@ static void devlink_recover_notify(struct devlink_health_reporter *reporter, devlink_nl_notify_send_desc(devlink, msg, &desc); } +static bool +devlink_health_reporter_delay_active(struct devlink_health_reporter *reporter) +{ + unsigned long delay_threshold = reporter->last_recovery_ts + + msecs_to_jiffies(reporter->graceful_period_delay); + + return time_is_after_jiffies(delay_threshold); +} + void devlink_health_reporter_recovery_done(struct devlink_health_reporter *reporter) { reporter->recovery_count++; - reporter->last_recovery_ts = jiffies; + if (!devlink_health_reporter_delay_active(reporter)) + /* When grace period delay is set, last_recovery_ts marks + * the first recovery within the delay, not necessarily the + * last one. + */ + reporter->last_recovery_ts = jiffies; } EXPORT_SYMBOL_GPL(devlink_health_reporter_recovery_done); @@ -599,7 +615,11 @@ devlink_health_recover_abort(struct devlink_health_reporter *reporter, if (prev_state != DEVLINK_HEALTH_REPORTER_STATE_HEALTHY) return true; + if (devlink_health_reporter_delay_active(reporter)) + return false; + recover_ts_threshold = reporter->last_recovery_ts + + msecs_to_jiffies(reporter->graceful_period_delay) + msecs_to_jiffies(reporter->graceful_period); if (reporter->last_recovery_ts && reporter->recovery_count && time_is_after_jiffies(recover_ts_threshold)) -- 2.31.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable 2025-07-17 16:07 [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Tariq Toukan ` (2 preceding siblings ...) 2025-07-17 16:07 ` [PATCH net-next 3/5] devlink: Introduce grace period delay for health reporter Tariq Toukan @ 2025-07-17 16:07 ` Tariq Toukan 2025-07-19 0:48 ` Jakub Kicinski 2025-07-19 0:51 ` Jakub Kicinski 2025-07-17 16:07 ` [PATCH net-next 5/5] net/mlx5e: Set default grace period delay for TX and RX reporters Tariq Toukan 2025-07-19 0:47 ` [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Jakub Kicinski 5 siblings, 2 replies; 15+ messages in thread From: Tariq Toukan @ 2025-07-17 16:07 UTC (permalink / raw) To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko Cc: Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Tariq Toukan, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma From: Shahar Shitrit <shshitrit@nvidia.com> Enable configuration of the grace period delay — a time window starting from the first error recovery, during which the reporter allows recovery attempts for each reported error. This feature is helpful when a single underlying issue causes multiple errors, as it delays the start of the grace period to allow sufficient time for recovering all related errors. For example, if multiple TX queues time out simultaneously, a sufficient grace period delay could allow all affected TX queues to be recovered within that window. Without this delay, only the first TX queue that reports a timeout will undergo recovery, while the remaining TX queues will be blocked once the grace period begins. Configuration example: $ devlink health set pci/0000:00:09.0 reporter tx grace_period_delay 500 Configuration example with ynl: ./tools/net/ynl/pyynl/cli.py \ --spec Documentation/netlink/specs/devlink.yaml \ --do health-reporter-set --json '{ "bus-name": "auxiliary", "dev-name": "mlx5_core.eth.0", "port-index": 65535, "health-reporter-name": "tx", "health-reporter-graceful-period-delay": 500 }' Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> --- Documentation/netlink/specs/devlink.yaml | 7 +++++ .../networking/devlink/devlink-health.rst | 2 +- include/uapi/linux/devlink.h | 2 ++ net/devlink/health.c | 30 +++++++++++++++++-- net/devlink/netlink_gen.c | 5 ++-- 5 files changed, 40 insertions(+), 6 deletions(-) diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml index 1c4bb0cbe5f0..c6cc0ce18685 100644 --- a/Documentation/netlink/specs/devlink.yaml +++ b/Documentation/netlink/specs/devlink.yaml @@ -848,6 +848,10 @@ attribute-sets: - name: region-direct type: flag + - + name: health-reporter-graceful-period-delay + type: u64 + - name: rate-tc-bws type: nest @@ -1228,6 +1232,8 @@ attribute-sets: name: health-reporter-dump-ts-ns - name: health-reporter-auto-dump + - + name: health-reporter-graceful-period-delay - name: dl-attr-stats @@ -1965,6 +1971,7 @@ operations: - health-reporter-graceful-period - health-reporter-auto-recover - health-reporter-auto-dump + - health-reporter-graceful-period-delay - name: health-reporter-recover diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst index e0b8cfed610a..07602f678282 100644 --- a/Documentation/networking/devlink/devlink-health.rst +++ b/Documentation/networking/devlink/devlink-health.rst @@ -50,7 +50,7 @@ Once an error is reported, devlink health will perform the following actions: * Auto recovery attempt is being done. Depends on: - Auto-recovery configuration - - Grace period vs. time passed since last recover + - Grace period (and grace period delay) vs. time passed since last recover Devlink formatted message ========================= diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h index e72bcc239afd..42a11b7e4a70 100644 --- a/include/uapi/linux/devlink.h +++ b/include/uapi/linux/devlink.h @@ -634,6 +634,8 @@ enum devlink_attr { DEVLINK_ATTR_REGION_DIRECT, /* flag */ + DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY, /* u64 */ + DEVLINK_ATTR_RATE_TC_BWS, /* nested */ DEVLINK_ATTR_RATE_TC_INDEX, /* u8 */ DEVLINK_ATTR_RATE_TC_BW, /* u32 */ diff --git a/net/devlink/health.c b/net/devlink/health.c index a0269975f592..5699779fce77 100644 --- a/net/devlink/health.c +++ b/net/devlink/health.c @@ -113,7 +113,9 @@ __devlink_health_reporter_create(struct devlink *devlink, { struct devlink_health_reporter *reporter; - if (WARN_ON(ops->default_graceful_period && !ops->recover)) + if (WARN_ON(ops->default_graceful_period_delay && + !ops->default_graceful_period) || + WARN_ON(ops->default_graceful_period && !ops->recover)) return ERR_PTR(-EINVAL); reporter = kzalloc(sizeof(*reporter), GFP_KERNEL); @@ -293,6 +295,11 @@ devlink_nl_health_reporter_fill(struct sk_buff *msg, devlink_nl_put_u64(msg, DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD, reporter->graceful_period)) goto reporter_nest_cancel; + if (reporter->ops->recover && + devlink_nl_put_u64(msg, + DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY, + reporter->graceful_period_delay)) + goto reporter_nest_cancel; if (reporter->ops->recover && nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER, reporter->auto_recover)) @@ -458,16 +465,33 @@ int devlink_nl_health_reporter_set_doit(struct sk_buff *skb, if (!reporter->ops->recover && (info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD] || - info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER])) + info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER] || + info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY])) return -EOPNOTSUPP; if (!reporter->ops->dump && info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_DUMP]) return -EOPNOTSUPP; - if (info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD]) + if (info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD]) { reporter->graceful_period = nla_get_u64(info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD]); + if (!reporter->graceful_period) + reporter->graceful_period_delay = 0; + } + + if (info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY]) { + u64 configured_delay = + nla_get_u64(info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY]); + + if (!reporter->graceful_period && configured_delay) { + NL_SET_ERR_MSG_MOD(info->extack, + "Cannot set grace period delay without a grace period."); + return -EINVAL; + } + + reporter->graceful_period_delay = configured_delay; + } if (info->attrs[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER]) reporter->auto_recover = diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c index c50436433c18..b0f38253d163 100644 --- a/net/devlink/netlink_gen.c +++ b/net/devlink/netlink_gen.c @@ -389,7 +389,7 @@ static const struct nla_policy devlink_health_reporter_get_dump_nl_policy[DEVLIN }; /* DEVLINK_CMD_HEALTH_REPORTER_SET - do */ -static const struct nla_policy devlink_health_reporter_set_nl_policy[DEVLINK_ATTR_HEALTH_REPORTER_AUTO_DUMP + 1] = { +static const struct nla_policy devlink_health_reporter_set_nl_policy[DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY + 1] = { [DEVLINK_ATTR_BUS_NAME] = { .type = NLA_NUL_STRING, }, [DEVLINK_ATTR_DEV_NAME] = { .type = NLA_NUL_STRING, }, [DEVLINK_ATTR_PORT_INDEX] = { .type = NLA_U32, }, @@ -397,6 +397,7 @@ static const struct nla_policy devlink_health_reporter_set_nl_policy[DEVLINK_ATT [DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD] = { .type = NLA_U64, }, [DEVLINK_ATTR_HEALTH_REPORTER_AUTO_RECOVER] = { .type = NLA_U8, }, [DEVLINK_ATTR_HEALTH_REPORTER_AUTO_DUMP] = { .type = NLA_U8, }, + [DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY] = { .type = NLA_U64, }, }; /* DEVLINK_CMD_HEALTH_REPORTER_RECOVER - do */ @@ -1032,7 +1033,7 @@ const struct genl_split_ops devlink_nl_ops[74] = { .doit = devlink_nl_health_reporter_set_doit, .post_doit = devlink_nl_post_doit, .policy = devlink_health_reporter_set_nl_policy, - .maxattr = DEVLINK_ATTR_HEALTH_REPORTER_AUTO_DUMP, + .maxattr = DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY, .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO, }, { -- 2.31.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable 2025-07-17 16:07 ` [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable Tariq Toukan @ 2025-07-19 0:48 ` Jakub Kicinski 2025-07-20 10:11 ` Tariq Toukan 2025-07-19 0:51 ` Jakub Kicinski 1 sibling, 1 reply; 15+ messages in thread From: Jakub Kicinski @ 2025-07-19 0:48 UTC (permalink / raw) To: Tariq Toukan Cc: Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko, Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma On Thu, 17 Jul 2025 19:07:21 +0300 Tariq Toukan wrote: > + DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY, /* u64 */ /me pulls out a ruler 50 characters, -ENAMETOOLONG -- pw-bot: cr ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable 2025-07-19 0:48 ` Jakub Kicinski @ 2025-07-20 10:11 ` Tariq Toukan 0 siblings, 0 replies; 15+ messages in thread From: Tariq Toukan @ 2025-07-20 10:11 UTC (permalink / raw) To: Jakub Kicinski, Tariq Toukan Cc: Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko, Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma On 19/07/2025 3:48, Jakub Kicinski wrote: > On Thu, 17 Jul 2025 19:07:21 +0300 Tariq Toukan wrote: >> + DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY, /* u64 */ > > /me pulls out a ruler > > 50 characters, -ENAMETOOLONG We'll address. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable 2025-07-17 16:07 ` [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable Tariq Toukan 2025-07-19 0:48 ` Jakub Kicinski @ 2025-07-19 0:51 ` Jakub Kicinski 2025-07-20 10:47 ` Tariq Toukan 1 sibling, 1 reply; 15+ messages in thread From: Jakub Kicinski @ 2025-07-19 0:51 UTC (permalink / raw) To: Tariq Toukan Cc: Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko, Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma On Thu, 17 Jul 2025 19:07:21 +0300 Tariq Toukan wrote: > diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h > index e72bcc239afd..42a11b7e4a70 100644 > --- a/include/uapi/linux/devlink.h > +++ b/include/uapi/linux/devlink.h > @@ -634,6 +634,8 @@ enum devlink_attr { > > DEVLINK_ATTR_REGION_DIRECT, /* flag */ > > + DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY, /* u64 */ > + > DEVLINK_ATTR_RATE_TC_BWS, /* nested */ > DEVLINK_ATTR_RATE_TC_INDEX, /* u8 */ > DEVLINK_ATTR_RATE_TC_BW, /* u32 */ BTW the patch from Carolina to cut the TC attributes from the main enum is higher prio that this. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable 2025-07-19 0:51 ` Jakub Kicinski @ 2025-07-20 10:47 ` Tariq Toukan 0 siblings, 0 replies; 15+ messages in thread From: Tariq Toukan @ 2025-07-20 10:47 UTC (permalink / raw) To: Jakub Kicinski, Tariq Toukan Cc: Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko, Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma On 19/07/2025 3:51, Jakub Kicinski wrote: > On Thu, 17 Jul 2025 19:07:21 +0300 Tariq Toukan wrote: >> diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h >> index e72bcc239afd..42a11b7e4a70 100644 >> --- a/include/uapi/linux/devlink.h >> +++ b/include/uapi/linux/devlink.h >> @@ -634,6 +634,8 @@ enum devlink_attr { >> >> DEVLINK_ATTR_REGION_DIRECT, /* flag */ >> >> + DEVLINK_ATTR_HEALTH_REPORTER_GRACEFUL_PERIOD_DELAY, /* u64 */ >> + >> DEVLINK_ATTR_RATE_TC_BWS, /* nested */ >> DEVLINK_ATTR_RATE_TC_INDEX, /* u8 */ >> DEVLINK_ATTR_RATE_TC_BW, /* u32 */ > > BTW the patch from Carolina to cut the TC attributes from the main enum > is higher prio that this. > WIP. We'll send it soon. ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net-next 5/5] net/mlx5e: Set default grace period delay for TX and RX reporters 2025-07-17 16:07 [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Tariq Toukan ` (3 preceding siblings ...) 2025-07-17 16:07 ` [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable Tariq Toukan @ 2025-07-17 16:07 ` Tariq Toukan 2025-07-19 0:47 ` [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Jakub Kicinski 5 siblings, 0 replies; 15+ messages in thread From: Tariq Toukan @ 2025-07-17 16:07 UTC (permalink / raw) To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko Cc: Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Tariq Toukan, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma From: Shahar Shitrit <shshitrit@nvidia.com> System errors can sometimes cause multiple errors to be reported to the TX reporter at the same time. For instance, lost interrupts may cause several SQs to time out simultaneously. When dev_watchdog notifies the driver for that, it iterates over all SQs to trigger recovery for the timed-out ones, via TX health reporter. However, grace period allows only one recovery at a time, so only the first SQ recovers while others remain blocked. Since no further recoveries are allowed during the grace period, subsequent errors cause the reporter to enter an ERROR state, requiring manual intervention. To address this, set the TX reporter's default grace period delay to 0.5 second. This allows the reporter to detect and handle all timed-out SQs within this delay window before initiating the grace period. To account for the possibility of a similar issue in the RX reporter, its default grace period delay is also configured. Additionally, while here, align the TX definition prefix with the RX, as these are used only in EN driver. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> --- drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c | 3 +++ drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 7 +++++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c index e106f0696486..feb3f2bce830 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c @@ -645,6 +645,7 @@ void mlx5e_reporter_icosq_resume_recovery(struct mlx5e_channel *c) } #define MLX5E_REPORTER_RX_GRACEFUL_PERIOD 500 +#define MLX5E_REPORTER_RX_GRACEFUL_PERIOD_DELAY 500 static const struct devlink_health_reporter_ops mlx5_rx_reporter_ops = { .name = "rx", @@ -652,6 +653,8 @@ static const struct devlink_health_reporter_ops mlx5_rx_reporter_ops = { .diagnose = mlx5e_rx_reporter_diagnose, .dump = mlx5e_rx_reporter_dump, .default_graceful_period = MLX5E_REPORTER_RX_GRACEFUL_PERIOD, + .default_graceful_period_delay = + MLX5E_REPORTER_RX_GRACEFUL_PERIOD_DELAY, }; void mlx5e_reporter_rx_create(struct mlx5e_priv *priv) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c index 6fb0d143ad1b..515b77585926 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c @@ -514,14 +514,17 @@ void mlx5e_reporter_tx_ptpsq_unhealthy(struct mlx5e_ptpsq *ptpsq) mlx5e_health_report(priv, priv->tx_reporter, err_str, &err_ctx); } -#define MLX5_REPORTER_TX_GRACEFUL_PERIOD 500 +#define MLX5E_REPORTER_TX_GRACEFUL_PERIOD 500 +#define MLX5E_REPORTER_TX_GRACEFUL_PERIOD_DELAY 500 static const struct devlink_health_reporter_ops mlx5_tx_reporter_ops = { .name = "tx", .recover = mlx5e_tx_reporter_recover, .diagnose = mlx5e_tx_reporter_diagnose, .dump = mlx5e_tx_reporter_dump, - .default_graceful_period = MLX5_REPORTER_TX_GRACEFUL_PERIOD, + .default_graceful_period = MLX5E_REPORTER_TX_GRACEFUL_PERIOD, + .default_graceful_period_delay = + MLX5E_REPORTER_TX_GRACEFUL_PERIOD_DELAY, }; void mlx5e_reporter_tx_create(struct mlx5e_priv *priv) -- 2.31.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter 2025-07-17 16:07 [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Tariq Toukan ` (4 preceding siblings ...) 2025-07-17 16:07 ` [PATCH net-next 5/5] net/mlx5e: Set default grace period delay for TX and RX reporters Tariq Toukan @ 2025-07-19 0:47 ` Jakub Kicinski 2025-07-24 10:46 ` Tariq Toukan 5 siblings, 1 reply; 15+ messages in thread From: Jakub Kicinski @ 2025-07-19 0:47 UTC (permalink / raw) To: Tariq Toukan Cc: Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko, Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote: > Currently, the devlink health reporter initiates the grace period > immediately after recovering an error, which blocks further recovery > attempts until the grace period concludes. Since additional errors > are not generally expected during this short interval, any new error > reported during the grace period is not only rejected but also causes > the reporter to enter an error state that requires manual intervention. > > This approach poses a problem in scenarios where a single root cause > triggers multiple related errors in quick succession - for example, > a PCI issue affecting multiple hardware queues. Because these errors > are closely related and occur rapidly, it is more effective to handle > them together rather than handling only the first one reported and > blocking any subsequent recovery attempts. Furthermore, setting the > reporter to an error state in this context can be misleading, as these > multiple errors are manifestations of a single underlying issue, making > it unlike the general case where additional errors are not expected > during the grace period. > > To resolve this, introduce a configurable grace period delay attribute > to the devlink health reporter. This delay starts when the first error > is recovered and lasts for a user-defined duration. Once this grace > period delay expires, the actual grace period begins. After the grace > period ends, a new reported error will start the same flow again. > > Timeline summary: > > ----|--------|------------------------------/----------------------/-- > error is error is grace period delay grace period > reported recovered (recoveries allowed) (recoveries blocked) > > With grace period delay, create a time window during which recovery > attempts are permitted, allowing all reported errors to be handled > sequentially before the grace period starts. Once the grace period > begins, it prevents any further error recoveries until it ends. We are rate limiting recoveries, the "networking solution" to the problem you're describing would be to introduce a burst size. Some kind of poor man's token bucket filter. Could you say more about what designs were considered and why this one was chosen? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter 2025-07-19 0:47 ` [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Jakub Kicinski @ 2025-07-24 10:46 ` Tariq Toukan 2025-07-25 0:10 ` Jakub Kicinski 0 siblings, 1 reply; 15+ messages in thread From: Tariq Toukan @ 2025-07-24 10:46 UTC (permalink / raw) To: Jakub Kicinski, Tariq Toukan Cc: Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko, Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma On 19/07/2025 3:47, Jakub Kicinski wrote: > On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote: >> Currently, the devlink health reporter initiates the grace period >> immediately after recovering an error, which blocks further recovery >> attempts until the grace period concludes. Since additional errors >> are not generally expected during this short interval, any new error >> reported during the grace period is not only rejected but also causes >> the reporter to enter an error state that requires manual intervention. >> >> This approach poses a problem in scenarios where a single root cause >> triggers multiple related errors in quick succession - for example, >> a PCI issue affecting multiple hardware queues. Because these errors >> are closely related and occur rapidly, it is more effective to handle >> them together rather than handling only the first one reported and >> blocking any subsequent recovery attempts. Furthermore, setting the >> reporter to an error state in this context can be misleading, as these >> multiple errors are manifestations of a single underlying issue, making >> it unlike the general case where additional errors are not expected >> during the grace period. >> >> To resolve this, introduce a configurable grace period delay attribute >> to the devlink health reporter. This delay starts when the first error >> is recovered and lasts for a user-defined duration. Once this grace >> period delay expires, the actual grace period begins. After the grace >> period ends, a new reported error will start the same flow again. >> >> Timeline summary: >> >> ----|--------|------------------------------/----------------------/-- >> error is error is grace period delay grace period >> reported recovered (recoveries allowed) (recoveries blocked) >> >> With grace period delay, create a time window during which recovery >> attempts are permitted, allowing all reported errors to be handled >> sequentially before the grace period starts. Once the grace period >> begins, it prevents any further error recoveries until it ends. > > We are rate limiting recoveries, the "networking solution" to the > problem you're describing would be to introduce a burst size. > Some kind of poor man's token bucket filter. > > Could you say more about what designs were considered and why this > one was chosen? > Please see below. If no more comments, I'll add the below to the cover letter and re-spin. Regards, Tariq Design alternatives considered: 1. Recover all queues upon any error: A brute-force approach that recovers all queues on any error. While simple, it is overly aggressive and disrupts unaffected queues unnecessarily. Also, because this is handled entirely within the driver, it leads to a driver-specific implementation rather than a generic one. 2. Per-queue reporter: This design would isolate recovery handling per SQ or RQ, effectively removing interdependencies between queues. While conceptually clean, it introduces significant scalability challenges as the number of queues grows, as well as synchronization challenges across multiple reporters. 3. Error aggregation with delayed handling: Errors arriving during the grace period are saved and processed after it ends. While addressing the issue of related errors whose recovery is aborted as grace period started, this adds complexity due to synchronization needs and contradicts the assumption that no errors should occur during a healthy system’s grace period. Also, this breaks the important role of grace period in preventing an infinite loop of immediate error detection following recovery. In such cases we want to stop. 4. Allowing a fixed burst of errors before starting grace period: Allows a set number of recoveries before the grace period begins. However, it also requires limiting the error reporting window. To keep the design simple, the burst threshold becomes redundant. The grace period delay design was chosen for its simplicity and precision in addressing the problem at hand. It effectively captures the temporal correlation of related errors and aligns with the original intent of the grace period as a stabilization window where further errors are unexpected, and if they do occur, they indicate an abnormal system state. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter 2025-07-24 10:46 ` Tariq Toukan @ 2025-07-25 0:10 ` Jakub Kicinski 2025-07-27 11:00 ` Tariq Toukan 0 siblings, 1 reply; 15+ messages in thread From: Jakub Kicinski @ 2025-07-25 0:10 UTC (permalink / raw) To: Tariq Toukan Cc: Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko, Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma On Thu, 24 Jul 2025 13:46:08 +0300 Tariq Toukan wrote: > Design alternatives considered: > > 1. Recover all queues upon any error: > A brute-force approach that recovers all queues on any error. > While simple, it is overly aggressive and disrupts unaffected queues > unnecessarily. Also, because this is handled entirely within the > driver, it leads to a driver-specific implementation rather than a > generic one. > > 2. Per-queue reporter: > This design would isolate recovery handling per SQ or RQ, effectively > removing interdependencies between queues. While conceptually clean, > it introduces significant scalability challenges as the number of > queues grows, as well as synchronization challenges across multiple > reporters. > > 3. Error aggregation with delayed handling: > Errors arriving during the grace period are saved and processed after > it ends. While addressing the issue of related errors whose recovery > is aborted as grace period started, this adds complexity due to > synchronization needs and contradicts the assumption that no errors > should occur during a healthy system’s grace period. Also, this > breaks the important role of grace period in preventing an infinite > loop of immediate error detection following recovery. In such cases > we want to stop. > > 4. Allowing a fixed burst of errors before starting grace period: > Allows a set number of recoveries before the grace period begins. > However, it also requires limiting the error reporting window. > To keep the design simple, the burst threshold becomes redundant. We're talking about burst on order of 100s, right? The implementation is quite simple, store an array the size of burst in which you can save recovery timestamps (in a circular fashion). On error, count how many entries are in the past N msecs. It's a clear generalization of current scheme which can be thought of as having an array of size 1 (only one most recent recovery time is saved). > The grace period delay design was chosen for its simplicity and > precision in addressing the problem at hand. It effectively captures > the temporal correlation of related errors and aligns with the original > intent of the grace period as a stabilization window where further > errors are unexpected, and if they do occur, they indicate an abnormal > system state. Admittedly part of what I find extremely confusing when thinking about this API is that the period when recovery is **not** allowed is called "grace period". Now we add something called "grace period delay" in some places in the code referred to as "reporter_delay".. It may be more palatable if we named the first period "error burst period" and, well, the later I suppose it's too late to rename.. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter 2025-07-25 0:10 ` Jakub Kicinski @ 2025-07-27 11:00 ` Tariq Toukan 2025-07-28 15:17 ` Jakub Kicinski 0 siblings, 1 reply; 15+ messages in thread From: Tariq Toukan @ 2025-07-27 11:00 UTC (permalink / raw) To: Jakub Kicinski Cc: Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko, Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma On 25/07/2025 3:10, Jakub Kicinski wrote: > On Thu, 24 Jul 2025 13:46:08 +0300 Tariq Toukan wrote: >> Design alternatives considered: >> >> 1. Recover all queues upon any error: >> A brute-force approach that recovers all queues on any error. >> While simple, it is overly aggressive and disrupts unaffected queues >> unnecessarily. Also, because this is handled entirely within the >> driver, it leads to a driver-specific implementation rather than a >> generic one. >> >> 2. Per-queue reporter: >> This design would isolate recovery handling per SQ or RQ, effectively >> removing interdependencies between queues. While conceptually clean, >> it introduces significant scalability challenges as the number of >> queues grows, as well as synchronization challenges across multiple >> reporters. >> >> 3. Error aggregation with delayed handling: >> Errors arriving during the grace period are saved and processed after >> it ends. While addressing the issue of related errors whose recovery >> is aborted as grace period started, this adds complexity due to >> synchronization needs and contradicts the assumption that no errors >> should occur during a healthy system’s grace period. Also, this >> breaks the important role of grace period in preventing an infinite >> loop of immediate error detection following recovery. In such cases >> we want to stop. >> >> 4. Allowing a fixed burst of errors before starting grace period: >> Allows a set number of recoveries before the grace period begins. >> However, it also requires limiting the error reporting window. >> To keep the design simple, the burst threshold becomes redundant. > > We're talking about burst on order of 100s, right? It can be, typically up to O(num_cpus). > The implementation > is quite simple, store an array the size of burst in which you can > save recovery timestamps (in a circular fashion). On error, count > how many entries are in the past N msecs. > I get your suggestion. I agree that it's also pretty simple to implement, and that it tolerates bursts. However, I think it softens the grace period role too much. It has an important disadvantage, as it tolerates non-bursts as well. It lacks the "burstness" distinguishability. IMO current grace_period has multiple goals, among them: a. let the auto-recovery mechanism handle errors as long as they are followed by some long-enough "healthy" intervals. b. break infinite loop of auto-recoveries, when the "healthy" interval is not long enough. Raise a flag to mark the need for admin intervention. In your proposal, the above doesn't hold. It won't prevent the infinite auto-recovery loop for a buggy system that has a constant rate of up to X failures in N msecs. One can argue that this can be addressed by increasing the grace_period. i.e. a current system with grace_period=N is intuitively moved to burst_size=X and grace_period=X*N. But increasing the grace_period by such a large factor has over-enforcement and hurts legitimate auto-recoveries. Again, the main point is, it lacks the ability to properly distinguish between 1. a "burst" followed by a healthy interval, and 2. a buggy system with a rate of repeated errors. > It's a clear generalization of current scheme which can be thought of > as having an array of size 1 (only one most recent recovery time is > saved). > It is a simple generalization indeed. But I don't agree it's a better filter. >> The grace period delay design was chosen for its simplicity and >> precision in addressing the problem at hand. It effectively captures >> the temporal correlation of related errors and aligns with the original >> intent of the grace period as a stabilization window where further >> errors are unexpected, and if they do occur, they indicate an abnormal >> system state. > > Admittedly part of what I find extremely confusing when thinking about > this API is that the period when recovery is **not** allowed is called > "grace period". Absolutely. We discussed this exact same insight internally. The existing name is confusing, but we won't propose modifying it at this point. > Now we add something called "grace period delay" in > some places in the code referred to as "reporter_delay".. > > It may be more palatable if we named the first period "error burst > period" and, well, the later I suppose it's too late to rename.. It can be named after what it achieves (allows handling of more errors) or what it is (a shift of the grace_period). I'm fine with both, don't have strong preference. I'd call it grace_period in case we didn't have one already :) Please let me know what name you prefer. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter 2025-07-27 11:00 ` Tariq Toukan @ 2025-07-28 15:17 ` Jakub Kicinski 0 siblings, 0 replies; 15+ messages in thread From: Jakub Kicinski @ 2025-07-28 15:17 UTC (permalink / raw) To: Tariq Toukan Cc: Tariq Toukan, Eric Dumazet, Paolo Abeni, Andrew Lunn, David S. Miller, Jiri Pirko, Jiri Pirko, Saeed Mahameed, Gal Pressman, Leon Romanovsky, Shahar Shitrit, Donald Hunter, Jonathan Corbet, Brett Creeley, Michael Chan, Pavan Chebbi, Cai Huoqing, Tony Nguyen, Przemek Kitszel, Sunil Goutham, Linu Cherian, Geetha sowjanya, Jerin Jacob, hariprasad, Subbaraya Sundeep, Saeed Mahameed, Mark Bloch, Ido Schimmel, Petr Machata, Manish Chopra, netdev, linux-kernel, linux-doc, intel-wired-lan, linux-rdma On Sun, 27 Jul 2025 14:00:11 +0300 Tariq Toukan wrote: > I get your suggestion. I agree that it's also pretty simple to > implement, and that it tolerates bursts. > > However, I think it softens the grace period role too much. It has an > important disadvantage, as it tolerates non-bursts as well. It lacks the > "burstness" distinguishability. > > IMO current grace_period has multiple goals, among them: > > a. let the auto-recovery mechanism handle errors as long as they are > followed by some long-enough "healthy" intervals. > > b. break infinite loop of auto-recoveries, when the "healthy" interval > is not long enough. Raise a flag to mark the need for admin intervention. > > In your proposal, the above doesn't hold. > It won't prevent the infinite auto-recovery loop for a buggy system that > has a constant rate of up to X failures in N msecs. > > One can argue that this can be addressed by increasing the grace_period. > i.e. a current system with grace_period=N is intuitively moved to > burst_size=X and grace_period=X*N. > > But increasing the grace_period by such a large factor has > over-enforcement and hurts legitimate auto-recoveries. > > Again, the main point is, it lacks the ability to properly distinguish > between 1. a "burst" followed by a healthy interval, and 2. a buggy > system with a rate of repeated errors. I suspect this is catching some very mlx5-specific recovery loop, so I defer to your judgment on what's better. As a user I do not know how to configure this health recovery stuff. My intuition would be that we just needs to lower the recovery rate to prevent filling up logs etc. and the action of taking the machine out is really the responsibility of some fleet health monitoring daemon. I can't think of any other error reporting facility in the kernel where we'd shut down the recovery completely if the rate is high.. > > Now we add something called "grace period delay" in > > some places in the code referred to as "reporter_delay".. > > > > It may be more palatable if we named the first period "error burst > > period" and, well, the later I suppose it's too late to rename.. > It can be named after what it achieves (allows handling of more errors) > or what it is (a shift of the grace_period). I'm fine with both, don't > have strong preference. Let's rename to "error burst period", reporter_in_error_burst etc. > I'd call it grace_period in case we didn't have one already :) Exactly :) ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-07-28 15:17 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-17 16:07 [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 1/5] devlink: Move graceful period parameter to reporter ops Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 2/5] devlink: Move health reporter recovery abort logic to a separate function Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 3/5] devlink: Introduce grace period delay for health reporter Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable Tariq Toukan 2025-07-19 0:48 ` Jakub Kicinski 2025-07-20 10:11 ` Tariq Toukan 2025-07-19 0:51 ` Jakub Kicinski 2025-07-20 10:47 ` Tariq Toukan 2025-07-17 16:07 ` [PATCH net-next 5/5] net/mlx5e: Set default grace period delay for TX and RX reporters Tariq Toukan 2025-07-19 0:47 ` [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Jakub Kicinski 2025-07-24 10:46 ` Tariq Toukan 2025-07-25 0:10 ` Jakub Kicinski 2025-07-27 11:00 ` Tariq Toukan 2025-07-28 15:17 ` Jakub Kicinski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).