From: Tariq Toukan <tariqt@nvidia.com>
To: Jiri Pirko <jiri@nvidia.com>, Jiri Pirko <jiri@resnulli.us>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>
Cc: Donald Hunter <donald.hunter@gmail.com>,
Jonathan Corbet <corbet@lwn.net>,
Brett Creeley <brett.creeley@amd.com>,
Michael Chan <michael.chan@broadcom.com>,
Pavan Chebbi <pavan.chebbi@broadcom.com>,
"Cai Huoqing" <cai.huoqing@linux.dev>,
Tony Nguyen <anthony.l.nguyen@intel.com>,
Przemek Kitszel <przemyslaw.kitszel@intel.com>,
Sunil Goutham <sgoutham@marvell.com>,
Linu Cherian <lcherian@marvell.com>,
Geetha sowjanya <gakula@marvell.com>,
Jerin Jacob <jerinj@marvell.com>, hariprasad <hkelam@marvell.com>,
Subbaraya Sundeep <sbhatta@marvell.com>,
Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>,
Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
Ido Schimmel <idosch@nvidia.com>, Petr Machata <petrm@nvidia.com>,
Manish Chopra <manishc@marvell.com>, <netdev@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <linux-doc@vger.kernel.org>,
<intel-wired-lan@lists.osuosl.org>, <linux-rdma@vger.kernel.org>,
"Gal Pressman" <gal@nvidia.com>,
Dragos Tatulea <dtatulea@nvidia.com>,
"Shahar Shitrit" <shshitrit@nvidia.com>
Subject: [Intel-wired-lan] [PATCH net-next V3 5/5] net/mlx5e: Set default error burst period for TX and RX reporters
Date: Wed, 13 Aug 2025 21:55:49 +0300 [thread overview]
Message-ID: <1755111349-416632-6-git-send-email-tariqt@nvidia.com> (raw)
In-Reply-To: <1755111349-416632-1-git-send-email-tariqt@nvidia.com>
From: Shahar Shitrit <shshitrit@nvidia.com>
System errors can sometimes cause multiple errors to be reported
to the TX reporter at the same time. For instance, lost interrupts
may cause several SQs to time out simultaneously. When dev_watchdog
notifies the driver for that, it iterates over all SQs to trigger
recovery for the timed-out ones, via TX health reporter.
However, grace period allows only one recovery at a time, so only
the first SQ recovers while others remain blocked. Since no further
recoveries are allowed during the grace period, subsequent errors
cause the reporter to enter an ERROR state, requiring manual
intervention.
To address this, set the TX reporter's default error burst period
to 0.5 second. This allows the reporter to detect and handle all
timed-out SQs within this window before initiating the grace period.
To account for the possibility of a similar issue in the RX reporter,
its default error burst period is also configured.
Additionally, while here, align the TX definition prefix with the RX,
as these are used only in EN driver.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c | 2 ++
drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 7 +++++--
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
index 1b9ea72abc5a..0e861ae362bc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
@@ -652,6 +652,7 @@ void mlx5e_reporter_icosq_resume_recovery(struct mlx5e_channel *c)
}
#define MLX5E_REPORTER_RX_GRACEFUL_PERIOD 500
+#define MLX5E_REPORTER_RX_ERROR_BURST_PERIOD 500
static const struct devlink_health_reporter_ops mlx5_rx_reporter_ops = {
.name = "rx",
@@ -659,6 +660,7 @@ static const struct devlink_health_reporter_ops mlx5_rx_reporter_ops = {
.diagnose = mlx5e_rx_reporter_diagnose,
.dump = mlx5e_rx_reporter_dump,
.default_graceful_period = MLX5E_REPORTER_RX_GRACEFUL_PERIOD,
+ .default_error_burst_period = MLX5E_REPORTER_RX_ERROR_BURST_PERIOD,
};
void mlx5e_reporter_rx_create(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index 7a4a77f6fe6a..7813f18e7dfe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -539,14 +539,17 @@ void mlx5e_reporter_tx_ptpsq_unhealthy(struct mlx5e_ptpsq *ptpsq)
mlx5e_health_report(priv, priv->tx_reporter, err_str, &err_ctx);
}
-#define MLX5_REPORTER_TX_GRACEFUL_PERIOD 500
+#define MLX5E_REPORTER_TX_GRACEFUL_PERIOD 500
+#define MLX5E_REPORTER_TX_ERROR_BURST_PERIOD 500
static const struct devlink_health_reporter_ops mlx5_tx_reporter_ops = {
.name = "tx",
.recover = mlx5e_tx_reporter_recover,
.diagnose = mlx5e_tx_reporter_diagnose,
.dump = mlx5e_tx_reporter_dump,
- .default_graceful_period = MLX5_REPORTER_TX_GRACEFUL_PERIOD,
+ .default_graceful_period = MLX5E_REPORTER_TX_GRACEFUL_PERIOD,
+ .default_error_burst_period =
+ MLX5E_REPORTER_TX_ERROR_BURST_PERIOD,
};
void mlx5e_reporter_tx_create(struct mlx5e_priv *priv)
--
2.31.1
WARNING: multiple messages have this Message-ID (diff)
From: Tariq Toukan <tariqt@nvidia.com>
To: Jiri Pirko <jiri@nvidia.com>, Jiri Pirko <jiri@resnulli.us>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>
Cc: Donald Hunter <donald.hunter@gmail.com>,
Jonathan Corbet <corbet@lwn.net>,
Brett Creeley <brett.creeley@amd.com>,
Michael Chan <michael.chan@broadcom.com>,
Pavan Chebbi <pavan.chebbi@broadcom.com>,
"Cai Huoqing" <cai.huoqing@linux.dev>,
Tony Nguyen <anthony.l.nguyen@intel.com>,
Przemek Kitszel <przemyslaw.kitszel@intel.com>,
Sunil Goutham <sgoutham@marvell.com>,
Linu Cherian <lcherian@marvell.com>,
Geetha sowjanya <gakula@marvell.com>,
Jerin Jacob <jerinj@marvell.com>, hariprasad <hkelam@marvell.com>,
Subbaraya Sundeep <sbhatta@marvell.com>,
Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>,
Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
Ido Schimmel <idosch@nvidia.com>, Petr Machata <petrm@nvidia.com>,
Manish Chopra <manishc@marvell.com>, <netdev@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <linux-doc@vger.kernel.org>,
<intel-wired-lan@lists.osuosl.org>, <linux-rdma@vger.kernel.org>,
"Gal Pressman" <gal@nvidia.com>,
Dragos Tatulea <dtatulea@nvidia.com>,
"Shahar Shitrit" <shshitrit@nvidia.com>
Subject: [PATCH net-next V3 5/5] net/mlx5e: Set default error burst period for TX and RX reporters
Date: Wed, 13 Aug 2025 21:55:49 +0300 [thread overview]
Message-ID: <1755111349-416632-6-git-send-email-tariqt@nvidia.com> (raw)
In-Reply-To: <1755111349-416632-1-git-send-email-tariqt@nvidia.com>
From: Shahar Shitrit <shshitrit@nvidia.com>
System errors can sometimes cause multiple errors to be reported
to the TX reporter at the same time. For instance, lost interrupts
may cause several SQs to time out simultaneously. When dev_watchdog
notifies the driver for that, it iterates over all SQs to trigger
recovery for the timed-out ones, via TX health reporter.
However, grace period allows only one recovery at a time, so only
the first SQ recovers while others remain blocked. Since no further
recoveries are allowed during the grace period, subsequent errors
cause the reporter to enter an ERROR state, requiring manual
intervention.
To address this, set the TX reporter's default error burst period
to 0.5 second. This allows the reporter to detect and handle all
timed-out SQs within this window before initiating the grace period.
To account for the possibility of a similar issue in the RX reporter,
its default error burst period is also configured.
Additionally, while here, align the TX definition prefix with the RX,
as these are used only in EN driver.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c | 2 ++
drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 7 +++++--
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
index 1b9ea72abc5a..0e861ae362bc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
@@ -652,6 +652,7 @@ void mlx5e_reporter_icosq_resume_recovery(struct mlx5e_channel *c)
}
#define MLX5E_REPORTER_RX_GRACEFUL_PERIOD 500
+#define MLX5E_REPORTER_RX_ERROR_BURST_PERIOD 500
static const struct devlink_health_reporter_ops mlx5_rx_reporter_ops = {
.name = "rx",
@@ -659,6 +660,7 @@ static const struct devlink_health_reporter_ops mlx5_rx_reporter_ops = {
.diagnose = mlx5e_rx_reporter_diagnose,
.dump = mlx5e_rx_reporter_dump,
.default_graceful_period = MLX5E_REPORTER_RX_GRACEFUL_PERIOD,
+ .default_error_burst_period = MLX5E_REPORTER_RX_ERROR_BURST_PERIOD,
};
void mlx5e_reporter_rx_create(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index 7a4a77f6fe6a..7813f18e7dfe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -539,14 +539,17 @@ void mlx5e_reporter_tx_ptpsq_unhealthy(struct mlx5e_ptpsq *ptpsq)
mlx5e_health_report(priv, priv->tx_reporter, err_str, &err_ctx);
}
-#define MLX5_REPORTER_TX_GRACEFUL_PERIOD 500
+#define MLX5E_REPORTER_TX_GRACEFUL_PERIOD 500
+#define MLX5E_REPORTER_TX_ERROR_BURST_PERIOD 500
static const struct devlink_health_reporter_ops mlx5_tx_reporter_ops = {
.name = "tx",
.recover = mlx5e_tx_reporter_recover,
.diagnose = mlx5e_tx_reporter_diagnose,
.dump = mlx5e_tx_reporter_dump,
- .default_graceful_period = MLX5_REPORTER_TX_GRACEFUL_PERIOD,
+ .default_graceful_period = MLX5E_REPORTER_TX_GRACEFUL_PERIOD,
+ .default_error_burst_period =
+ MLX5E_REPORTER_TX_ERROR_BURST_PERIOD,
};
void mlx5e_reporter_tx_create(struct mlx5e_priv *priv)
--
2.31.1
next prev parent reply other threads:[~2025-08-13 18:57 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-13 18:55 [Intel-wired-lan] [PATCH net-next V3 0/5] Expose error burst period for devlink health reporter Tariq Toukan
2025-08-13 18:55 ` Tariq Toukan
2025-08-13 18:55 ` [Intel-wired-lan] [PATCH net-next V3 1/5] devlink: Move graceful period parameter to reporter ops Tariq Toukan
2025-08-13 18:55 ` Tariq Toukan
2025-08-15 19:19 ` [Intel-wired-lan] " Jakub Kicinski
2025-08-15 19:19 ` Jakub Kicinski
2025-08-13 18:55 ` [Intel-wired-lan] [PATCH net-next V3 2/5] devlink: Move health reporter recovery abort logic to a separate function Tariq Toukan
2025-08-13 18:55 ` Tariq Toukan
2025-08-13 18:55 ` [Intel-wired-lan] [PATCH net-next V3 3/5] devlink: Introduce error burst period for health reporter Tariq Toukan
2025-08-13 18:55 ` Tariq Toukan
2025-08-15 19:23 ` [Intel-wired-lan] " Jakub Kicinski
2025-08-15 19:23 ` Jakub Kicinski
2025-08-13 18:55 ` [Intel-wired-lan] [PATCH net-next V3 4/5] devlink: Make health reporter error burst period configurable Tariq Toukan
2025-08-13 18:55 ` Tariq Toukan
2025-08-15 19:26 ` [Intel-wired-lan] " Jakub Kicinski
2025-08-15 19:26 ` Jakub Kicinski
2025-08-17 16:08 ` [Intel-wired-lan] " Shahar Shitrit
2025-08-17 16:08 ` Shahar Shitrit
2025-08-18 15:45 ` [Intel-wired-lan] " Jakub Kicinski
2025-08-18 15:45 ` Jakub Kicinski
2025-08-13 18:55 ` Tariq Toukan [this message]
2025-08-13 18:55 ` [PATCH net-next V3 5/5] net/mlx5e: Set default error burst period for TX and RX reporters Tariq Toukan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1755111349-416632-6-git-send-email-tariqt@nvidia.com \
--to=tariqt@nvidia.com \
--cc=andrew+netdev@lunn.ch \
--cc=anthony.l.nguyen@intel.com \
--cc=brett.creeley@amd.com \
--cc=cai.huoqing@linux.dev \
--cc=corbet@lwn.net \
--cc=davem@davemloft.net \
--cc=donald.hunter@gmail.com \
--cc=dtatulea@nvidia.com \
--cc=edumazet@google.com \
--cc=gakula@marvell.com \
--cc=gal@nvidia.com \
--cc=hkelam@marvell.com \
--cc=idosch@nvidia.com \
--cc=intel-wired-lan@lists.osuosl.org \
--cc=jerinj@marvell.com \
--cc=jiri@nvidia.com \
--cc=jiri@resnulli.us \
--cc=kuba@kernel.org \
--cc=lcherian@marvell.com \
--cc=leon@kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=manishc@marvell.com \
--cc=mbloch@nvidia.com \
--cc=michael.chan@broadcom.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pavan.chebbi@broadcom.com \
--cc=petrm@nvidia.com \
--cc=przemyslaw.kitszel@intel.com \
--cc=saeedm@nvidia.com \
--cc=sbhatta@marvell.com \
--cc=sgoutham@marvell.com \
--cc=shshitrit@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.