netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: Tariq Toukan <tariqt@nvidia.com>
Cc: Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Jiri Pirko <jiri@resnulli.us>, Jiri Pirko <jiri@nvidia.com>,
	Saeed Mahameed <saeed@kernel.org>, Gal Pressman <gal@nvidia.com>,
	"Leon Romanovsky" <leon@kernel.org>,
	Shahar Shitrit <shshitrit@nvidia.com>,
	"Donald Hunter" <donald.hunter@gmail.com>,
	Jonathan Corbet <corbet@lwn.net>,
	"Brett Creeley" <brett.creeley@amd.com>,
	Michael Chan <michael.chan@broadcom.com>,
	Pavan Chebbi <pavan.chebbi@broadcom.com>,
	Cai Huoqing <cai.huoqing@linux.dev>,
	Tony Nguyen <anthony.l.nguyen@intel.com>,
	"Przemek Kitszel" <przemyslaw.kitszel@intel.com>,
	Sunil Goutham <sgoutham@marvell.com>,
	Linu Cherian <lcherian@marvell.com>,
	Geetha sowjanya <gakula@marvell.com>,
	Jerin Jacob <jerinj@marvell.com>, hariprasad <hkelam@marvell.com>,
	"Subbaraya Sundeep" <sbhatta@marvell.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Mark Bloch <mbloch@nvidia.com>, Ido Schimmel <idosch@nvidia.com>,
	Petr Machata <petrm@nvidia.com>,
	Manish Chopra <manishc@marvell.com>, <netdev@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-doc@vger.kernel.org>,
	<intel-wired-lan@lists.osuosl.org>, <linux-rdma@vger.kernel.org>
Subject: Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter
Date: Fri, 18 Jul 2025 17:47:37 -0700	[thread overview]
Message-ID: <20250718174737.1d1177cd@kernel.org> (raw)
In-Reply-To: <1752768442-264413-1-git-send-email-tariqt@nvidia.com>

On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote:
> Currently, the devlink health reporter initiates the grace period
> immediately after recovering an error, which blocks further recovery
> attempts until the grace period concludes. Since additional errors
> are not generally expected during this short interval, any new error
> reported during the grace period is not only rejected but also causes
> the reporter to enter an error state that requires manual intervention.
> 
> This approach poses a problem in scenarios where a single root cause
> triggers multiple related errors in quick succession - for example,
> a PCI issue affecting multiple hardware queues. Because these errors
> are closely related and occur rapidly, it is more effective to handle
> them together rather than handling only the first one reported and
> blocking any subsequent recovery attempts. Furthermore, setting the
> reporter to an error state in this context can be misleading, as these
> multiple errors are manifestations of a single underlying issue, making
> it unlike the general case where additional errors are not expected
> during the grace period.
> 
> To resolve this, introduce a configurable grace period delay attribute
> to the devlink health reporter. This delay starts when the first error
> is recovered and lasts for a user-defined duration. Once this grace
> period delay expires, the actual grace period begins. After the grace
> period ends, a new reported error will start the same flow again.
> 
> Timeline summary:
> 
> ----|--------|------------------------------/----------------------/--
> error is  error is    grace period delay          grace period
> reported  recovered  (recoveries allowed)     (recoveries blocked)
> 
> With grace period delay, create a time window during which recovery
> attempts are permitted, allowing all reported errors to be handled
> sequentially before the grace period starts. Once the grace period
> begins, it prevents any further error recoveries until it ends.

We are rate limiting recoveries, the "networking solution" to the
problem you're describing would be to introduce a burst size.
Some kind of poor man's token bucket filter.

Could you say more about what designs were considered and why this
one was chosen?

  parent reply	other threads:[~2025-07-19  0:47 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-17 16:07 [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Tariq Toukan
2025-07-17 16:07 ` [PATCH net-next 1/5] devlink: Move graceful period parameter to reporter ops Tariq Toukan
2025-07-17 16:07 ` [PATCH net-next 2/5] devlink: Move health reporter recovery abort logic to a separate function Tariq Toukan
2025-07-17 16:07 ` [PATCH net-next 3/5] devlink: Introduce grace period delay for health reporter Tariq Toukan
2025-07-17 16:07 ` [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable Tariq Toukan
2025-07-19  0:48   ` Jakub Kicinski
2025-07-20 10:11     ` Tariq Toukan
2025-07-19  0:51   ` Jakub Kicinski
2025-07-20 10:47     ` Tariq Toukan
2025-07-17 16:07 ` [PATCH net-next 5/5] net/mlx5e: Set default grace period delay for TX and RX reporters Tariq Toukan
2025-07-19  0:47 ` Jakub Kicinski [this message]
2025-07-24 10:46   ` [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Tariq Toukan
2025-07-25  0:10     ` Jakub Kicinski
2025-07-27 11:00       ` Tariq Toukan
2025-07-28 15:17         ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250718174737.1d1177cd@kernel.org \
    --to=kuba@kernel.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=anthony.l.nguyen@intel.com \
    --cc=brett.creeley@amd.com \
    --cc=cai.huoqing@linux.dev \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=donald.hunter@gmail.com \
    --cc=edumazet@google.com \
    --cc=gakula@marvell.com \
    --cc=gal@nvidia.com \
    --cc=hkelam@marvell.com \
    --cc=idosch@nvidia.com \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jerinj@marvell.com \
    --cc=jiri@nvidia.com \
    --cc=jiri@resnulli.us \
    --cc=lcherian@marvell.com \
    --cc=leon@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=manishc@marvell.com \
    --cc=mbloch@nvidia.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pavan.chebbi@broadcom.com \
    --cc=petrm@nvidia.com \
    --cc=przemyslaw.kitszel@intel.com \
    --cc=saeed@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=sbhatta@marvell.com \
    --cc=sgoutham@marvell.com \
    --cc=shshitrit@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).