Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Tariq Toukan <ttoukan.linux@gmail.com>
To: Jakub Kicinski <kuba@kernel.org>
Cc: Tariq Toukan <tariqt@nvidia.com>,
	Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Jiri Pirko <jiri@resnulli.us>, Jiri Pirko <jiri@nvidia.com>,
	Saeed Mahameed <saeed@kernel.org>, Gal Pressman <gal@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>,
	Shahar Shitrit <shshitrit@nvidia.com>,
	Donald Hunter <donald.hunter@gmail.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Brett Creeley <brett.creeley@amd.com>,
	Michael Chan <michael.chan@broadcom.com>,
	Pavan Chebbi <pavan.chebbi@broadcom.com>,
	Cai Huoqing <cai.huoqing@linux.dev>,
	Tony Nguyen <anthony.l.nguyen@intel.com>,
	Przemek Kitszel <przemyslaw.kitszel@intel.com>,
	Sunil Goutham <sgoutham@marvell.com>,
	Linu Cherian <lcherian@marvell.com>,
	Geetha sowjanya <gakula@marvell.com>,
	Jerin Jacob <jerinj@marvell.com>, hariprasad <hkelam@marvell.com>,
	Subbaraya Sundeep <sbhatta@marvell.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Mark Bloch <mbloch@nvidia.com>, Ido Schimmel <idosch@nvidia.com>,
	Petr Machata <petrm@nvidia.com>,
	Manish Chopra <manishc@marvell.com>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	linux-rdma@vger.kernel.org
Subject: Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter
Date: Sun, 27 Jul 2025 14:00:11 +0300	[thread overview]
Message-ID: <3bf6714b-46d7-45ad-9d15-f5ce9d4b74e4@gmail.com> (raw)
In-Reply-To: <20250724171011.2e8ebca4@kernel.org>



On 25/07/2025 3:10, Jakub Kicinski wrote:
> On Thu, 24 Jul 2025 13:46:08 +0300 Tariq Toukan wrote:
>> Design alternatives considered:
>>
>> 1. Recover all queues upon any error:
>>      A brute-force approach that recovers all queues on any error.
>>      While simple, it is overly aggressive and disrupts unaffected queues
>>      unnecessarily. Also, because this is handled entirely within the
>>      driver, it leads to a driver-specific implementation rather than a
>>      generic one.
>>
>> 2. Per-queue reporter:
>>      This design would isolate recovery handling per SQ or RQ, effectively
>>      removing interdependencies between queues. While conceptually clean,
>>      it introduces significant scalability challenges as the number of
>>      queues grows, as well as synchronization challenges across multiple
>>      reporters.
>>
>> 3. Error aggregation with delayed handling:
>>      Errors arriving during the grace period are saved and processed after
>>      it ends. While addressing the issue of related errors whose recovery
>>      is aborted as grace period started, this adds complexity due to
>>      synchronization needs and contradicts the assumption that no errors
>>      should occur during a healthy system’s grace period. Also, this
>>      breaks the important role of grace period in preventing an infinite
>>      loop of immediate error detection following recovery. In such cases
>>      we want to stop.
>>
>> 4. Allowing a fixed burst of errors before starting grace period:
>>      Allows a set number of recoveries before the grace period begins.
>>      However, it also requires limiting the error reporting window.
>>      To keep the design simple, the burst threshold becomes redundant.
> 
> We're talking about burst on order of 100s, right?

It can be, typically up to O(num_cpus).

> The implementation
> is quite simple, store an array the size of burst in which you can
> save recovery timestamps (in a circular fashion). On error, count
> how many entries are in the past N msecs.
> 

I get your suggestion. I agree that it's also pretty simple to 
implement, and that it tolerates bursts.

However, I think it softens the grace period role too much. It has an 
important disadvantage, as it tolerates non-bursts as well. It lacks the 
"burstness" distinguishability.

IMO current grace_period has multiple goals, among them:

a. let the auto-recovery mechanism handle errors as long as they are 
followed by some long-enough "healthy" intervals.

b. break infinite loop of auto-recoveries, when the "healthy" interval 
is not long enough. Raise a flag to mark the need for admin intervention.

In your proposal, the above doesn't hold.
It won't prevent the infinite auto-recovery loop for a buggy system that 
has a constant rate of up to X failures in N msecs.

One can argue that this can be addressed by increasing the grace_period. 
i.e. a current system with grace_period=N is intuitively moved to 
burst_size=X and grace_period=X*N.

But increasing the grace_period by such a large factor has 
over-enforcement and hurts legitimate auto-recoveries.

Again, the main point is, it lacks the ability to properly distinguish 
between 1. a "burst" followed by a healthy interval, and 2. a buggy 
system with a rate of repeated errors.

> It's a clear generalization of current scheme which can be thought of
> as having an array of size 1 (only one most recent recovery time is
> saved).
> 

It is a simple generalization indeed.
But I don't agree it's a better filter.

>> The grace period delay design was chosen for its simplicity and
>> precision in addressing the problem at hand. It effectively captures
>> the temporal correlation of related errors and aligns with the original
>> intent of the grace period as a stabilization window where further
>> errors are unexpected, and if they do occur, they indicate an abnormal
>> system state.
> 
> Admittedly part of what I find extremely confusing when thinking about
> this API is that the period when recovery is **not** allowed is called
> "grace period".

Absolutely.
We discussed this exact same insight internally. The existing name is 
confusing, but we won't propose modifying it at this point.

> Now we add something called "grace period delay" in
> some places in the code referred to as "reporter_delay"..
> 
> It may be more palatable if we named the first period "error burst
> period" and, well, the later I suppose it's too late to rename..
It can be named after what it achieves (allows handling of more errors) 
or what it is (a shift of the grace_period). I'm fine with both, don't 
have strong preference.

I'd call it grace_period in case we didn't have one already :)

Please let me know what name you prefer.

next prev parent reply	other threads:[~2025-07-27 11:00 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-17 16:07 [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Tariq Toukan
2025-07-17 16:07 ` [PATCH net-next 1/5] devlink: Move graceful period parameter to reporter ops Tariq Toukan
2025-07-17 16:07 ` [PATCH net-next 2/5] devlink: Move health reporter recovery abort logic to a separate function Tariq Toukan
2025-07-17 16:07 ` [PATCH net-next 3/5] devlink: Introduce grace period delay for health reporter Tariq Toukan
2025-07-17 16:07 ` [PATCH net-next 4/5] devlink: Make health reporter grace period delay configurable Tariq Toukan
2025-07-19  0:48   ` Jakub Kicinski
2025-07-20 10:11     ` Tariq Toukan
2025-07-19  0:51   ` Jakub Kicinski
2025-07-20 10:47     ` Tariq Toukan
2025-07-17 16:07 ` [PATCH net-next 5/5] net/mlx5e: Set default grace period delay for TX and RX reporters Tariq Toukan
2025-07-19  0:47 ` [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Jakub Kicinski
2025-07-24 10:46   ` Tariq Toukan
2025-07-25  0:10     ` Jakub Kicinski
2025-07-27 11:00       ` Tariq Toukan [this message]
2025-07-28 15:17         ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3bf6714b-46d7-45ad-9d15-f5ce9d4b74e4@gmail.com \
    --to=ttoukan.linux@gmail.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=anthony.l.nguyen@intel.com \
    --cc=brett.creeley@amd.com \
    --cc=cai.huoqing@linux.dev \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=donald.hunter@gmail.com \
    --cc=edumazet@google.com \
    --cc=gakula@marvell.com \
    --cc=gal@nvidia.com \
    --cc=hkelam@marvell.com \
    --cc=idosch@nvidia.com \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jerinj@marvell.com \
    --cc=jiri@nvidia.com \
    --cc=jiri@resnulli.us \
    --cc=kuba@kernel.org \
    --cc=lcherian@marvell.com \
    --cc=leon@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=manishc@marvell.com \
    --cc=mbloch@nvidia.com \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pavan.chebbi@broadcom.com \
    --cc=petrm@nvidia.com \
    --cc=przemyslaw.kitszel@intel.com \
    --cc=saeed@kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=sbhatta@marvell.com \
    --cc=sgoutham@marvell.com \
    --cc=shshitrit@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).