From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from smtp1.osuosl.org (smtp1.osuosl.org [140.211.166.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id ABFF0C83F34 for ; Sat, 19 Jul 2025 00:47:43 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp1.osuosl.org (Postfix) with ESMTP id 66B768462D; Sat, 19 Jul 2025 00:47:43 +0000 (UTC) X-Virus-Scanned: amavis at osuosl.org Received: from smtp1.osuosl.org ([127.0.0.1]) by localhost (smtp1.osuosl.org [127.0.0.1]) (amavis, port 10024) with ESMTP id 98bwz4nqlD9T; Sat, 19 Jul 2025 00:47:42 +0000 (UTC) X-Comment: SPF check N/A for local connections - client-ip=140.211.166.142; helo=lists1.osuosl.org; envelope-from=intel-wired-lan-bounces@osuosl.org; receiver= DKIM-Filter: OpenDKIM Filter v2.11.0 smtp1.osuosl.org C467384633 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osuosl.org; s=default; t=1752886062; bh=4yatukhwSwtUvcmgaGLBH8DEQ3xgjC3LfJj0epSaLCA=; h=Date:From:To:Cc:In-Reply-To:References:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From; b=esIAmy1R5dG5N3T3fZxNcttw/EQDixOnuFBGXm7dpx+jHZ0K1Pr5xOjfRgybI2145 HNqbCWTEXF/GD/KdXZgilZLz9Lx9U2d0jb61UvV1y0GemLD7U2ezXx7hVuoKmvbk7g ZZitMi3vre6HsIdD/yQnIgv3e95DLbC8hO7PHqmyf1MQ1mplEmLMb5YSRn9R4D4IhY 7VHabs/A5JAgDf90rAFuD3dZtWxMxiaA75CMeM0bNKzQbRhAx3zmFFgI5iwp3mt5Qd tja+hXl7VMZLhPvcLtV3lSuAzxdt0l4l4kNxbocgQUneUhDNoeeTeDpbd6/vMO7/0q DvJmMj2UqmcVQ== Received: from lists1.osuosl.org (lists1.osuosl.org [140.211.166.142]) by smtp1.osuosl.org (Postfix) with ESMTP id C467384633; Sat, 19 Jul 2025 00:47:42 +0000 (UTC) Received: from smtp1.osuosl.org (smtp1.osuosl.org [IPv6:2605:bc80:3010::138]) by lists1.osuosl.org (Postfix) with ESMTP id 2AF462734 for ; Sat, 19 Jul 2025 00:47:42 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp1.osuosl.org (Postfix) with ESMTP id 1C36384637 for ; Sat, 19 Jul 2025 00:47:42 +0000 (UTC) X-Virus-Scanned: amavis at osuosl.org Received: from smtp1.osuosl.org ([127.0.0.1]) by localhost (smtp1.osuosl.org [127.0.0.1]) (amavis, port 10024) with ESMTP id 7S-fVd-bhqGZ for ; Sat, 19 Jul 2025 00:47:41 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=139.178.84.217; helo=dfw.source.kernel.org; envelope-from=kuba@kernel.org; receiver= DMARC-Filter: OpenDMARC Filter v1.4.2 smtp1.osuosl.org 5B64A84633 DKIM-Filter: OpenDKIM Filter v2.11.0 smtp1.osuosl.org 5B64A84633 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by smtp1.osuosl.org (Postfix) with ESMTPS id 5B64A84633 for ; Sat, 19 Jul 2025 00:47:41 +0000 (UTC) Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 219595C4A53; Sat, 19 Jul 2025 00:47:40 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A2896C4CEEB; Sat, 19 Jul 2025 00:47:38 +0000 (UTC) Date: Fri, 18 Jul 2025 17:47:37 -0700 From: Jakub Kicinski To: Tariq Toukan Cc: Eric Dumazet , Paolo Abeni , Andrew Lunn , "David S. Miller" , Jiri Pirko , Jiri Pirko , Saeed Mahameed , Gal Pressman , "Leon Romanovsky" , Shahar Shitrit , "Donald Hunter" , Jonathan Corbet , "Brett Creeley" , Michael Chan , Pavan Chebbi , Cai Huoqing , Tony Nguyen , "Przemek Kitszel" , Sunil Goutham , Linu Cherian , Geetha sowjanya , Jerin Jacob , hariprasad , "Subbaraya Sundeep" , Saeed Mahameed , Mark Bloch , Ido Schimmel , Petr Machata , Manish Chopra , , , , , Message-ID: <20250718174737.1d1177cd@kernel.org> In-Reply-To: <1752768442-264413-1-git-send-email-tariqt@nvidia.com> References: <1752768442-264413-1-git-send-email-tariqt@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Mailman-Original-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1752886059; bh=XWuJCJJk0aV1qvt1Ws58hkJTkIsuqdkQoYz9oZwo0A8=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=rFghbRVZUQtS04xAEUavFtutQFmD8yLdkuOrSdpLL7c0Na4XiUXsAtxW6wDwztw3w XhxiZbMRo6PVPayxGEye1e9kHjqqmPfuSBkibINJfSce7CBGXv/MT2QAV5v+9GnLhM muW7QjvrVXFzBX7jB+1tCZKE3bF2HIlgZBGVT1Ob16bluI3YuNWd8JLqfBpiUDAzFc x58WpHF08HMRyq/mc4j5jIMTFtC9DVVGIbAafBzvCyAMOadb+ggVDkBBYt41qDGvHI 4vyhzzN0OPjTENEdko/RT71JSsyxVMDx125WBvK++u53IkIExjDt4y0qQBC74xrugD IATv9Up02IfUg== X-Mailman-Original-Authentication-Results: smtp1.osuosl.org; dmarc=pass (p=quarantine dis=none) header.from=kernel.org X-Mailman-Original-Authentication-Results: smtp1.osuosl.org; dkim=pass (2048-bit key, unprotected) header.d=kernel.org header.i=@kernel.org header.a=rsa-sha256 header.s=k20201202 header.b=rFghbRVZ Subject: Re: [Intel-wired-lan] [PATCH net-next 0/5] Expose grace period delay for devlink health reporter X-BeenThere: intel-wired-lan@osuosl.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Intel Wired Ethernet Linux Kernel Driver Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-wired-lan-bounces@osuosl.org Sender: "Intel-wired-lan" On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote: > Currently, the devlink health reporter initiates the grace period > immediately after recovering an error, which blocks further recovery > attempts until the grace period concludes. Since additional errors > are not generally expected during this short interval, any new error > reported during the grace period is not only rejected but also causes > the reporter to enter an error state that requires manual intervention. > > This approach poses a problem in scenarios where a single root cause > triggers multiple related errors in quick succession - for example, > a PCI issue affecting multiple hardware queues. Because these errors > are closely related and occur rapidly, it is more effective to handle > them together rather than handling only the first one reported and > blocking any subsequent recovery attempts. Furthermore, setting the > reporter to an error state in this context can be misleading, as these > multiple errors are manifestations of a single underlying issue, making > it unlike the general case where additional errors are not expected > during the grace period. > > To resolve this, introduce a configurable grace period delay attribute > to the devlink health reporter. This delay starts when the first error > is recovered and lasts for a user-defined duration. Once this grace > period delay expires, the actual grace period begins. After the grace > period ends, a new reported error will start the same flow again. > > Timeline summary: > > ----|--------|------------------------------/----------------------/-- > error is error is grace period delay grace period > reported recovered (recoveries allowed) (recoveries blocked) > > With grace period delay, create a time window during which recovery > attempts are permitted, allowing all reported errors to be handled > sequentially before the grace period starts. Once the grace period > begins, it prevents any further error recoveries until it ends. We are rate limiting recoveries, the "networking solution" to the problem you're describing would be to introduce a burst size. Some kind of poor man's token bucket filter. Could you say more about what designs were considered and why this one was chosen? From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 729224C6C; Sat, 19 Jul 2025 00:47:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752886060; cv=none; b=qtnPRyCZkXRqZhgCgeyV8pZTPAdns+Ri4U41EH7D2xQ+VwGDQICIyemziOw2fIdQUntFeHpo8Ix5JXzKtje083mVNT2oN4JrToRB0aPHPXqDnFhmF2/72GIyRkiAa4Q8XlJiHTtEBEuCFOltczIFPT/utRss+lFK1FnebKhm7YQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752886060; c=relaxed/simple; bh=XWuJCJJk0aV1qvt1Ws58hkJTkIsuqdkQoYz9oZwo0A8=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=hNsTNeBUENVa+iXYeZTgPYFycb8hSJTuNibQot8vZgVDWDIHaVanB24OXN+JnZaB4bfutZkRJV9EkiVzQ8qDJBQhD9pjNV23pR3IHBA1y9v4mupv5TLLjscmzKnoNuP4M+o/sDLXEeLOsWT9an/Srevh834bUpc8muo06beydzk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=rFghbRVZ; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="rFghbRVZ" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A2896C4CEEB; Sat, 19 Jul 2025 00:47:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1752886059; bh=XWuJCJJk0aV1qvt1Ws58hkJTkIsuqdkQoYz9oZwo0A8=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=rFghbRVZUQtS04xAEUavFtutQFmD8yLdkuOrSdpLL7c0Na4XiUXsAtxW6wDwztw3w XhxiZbMRo6PVPayxGEye1e9kHjqqmPfuSBkibINJfSce7CBGXv/MT2QAV5v+9GnLhM muW7QjvrVXFzBX7jB+1tCZKE3bF2HIlgZBGVT1Ob16bluI3YuNWd8JLqfBpiUDAzFc x58WpHF08HMRyq/mc4j5jIMTFtC9DVVGIbAafBzvCyAMOadb+ggVDkBBYt41qDGvHI 4vyhzzN0OPjTENEdko/RT71JSsyxVMDx125WBvK++u53IkIExjDt4y0qQBC74xrugD IATv9Up02IfUg== Date: Fri, 18 Jul 2025 17:47:37 -0700 From: Jakub Kicinski To: Tariq Toukan Cc: Eric Dumazet , Paolo Abeni , Andrew Lunn , "David S. Miller" , Jiri Pirko , Jiri Pirko , Saeed Mahameed , Gal Pressman , "Leon Romanovsky" , Shahar Shitrit , "Donald Hunter" , Jonathan Corbet , "Brett Creeley" , Michael Chan , Pavan Chebbi , Cai Huoqing , Tony Nguyen , "Przemek Kitszel" , Sunil Goutham , Linu Cherian , Geetha sowjanya , Jerin Jacob , hariprasad , "Subbaraya Sundeep" , Saeed Mahameed , Mark Bloch , Ido Schimmel , Petr Machata , Manish Chopra , , , , , Subject: Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter Message-ID: <20250718174737.1d1177cd@kernel.org> In-Reply-To: <1752768442-264413-1-git-send-email-tariqt@nvidia.com> References: <1752768442-264413-1-git-send-email-tariqt@nvidia.com> Precedence: bulk X-Mailing-List: linux-doc@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote: > Currently, the devlink health reporter initiates the grace period > immediately after recovering an error, which blocks further recovery > attempts until the grace period concludes. Since additional errors > are not generally expected during this short interval, any new error > reported during the grace period is not only rejected but also causes > the reporter to enter an error state that requires manual intervention. > > This approach poses a problem in scenarios where a single root cause > triggers multiple related errors in quick succession - for example, > a PCI issue affecting multiple hardware queues. Because these errors > are closely related and occur rapidly, it is more effective to handle > them together rather than handling only the first one reported and > blocking any subsequent recovery attempts. Furthermore, setting the > reporter to an error state in this context can be misleading, as these > multiple errors are manifestations of a single underlying issue, making > it unlike the general case where additional errors are not expected > during the grace period. > > To resolve this, introduce a configurable grace period delay attribute > to the devlink health reporter. This delay starts when the first error > is recovered and lasts for a user-defined duration. Once this grace > period delay expires, the actual grace period begins. After the grace > period ends, a new reported error will start the same flow again. > > Timeline summary: > > ----|--------|------------------------------/----------------------/-- > error is error is grace period delay grace period > reported recovered (recoveries allowed) (recoveries blocked) > > With grace period delay, create a time window during which recovery > attempts are permitted, allowing all reported errors to be handled > sequentially before the grace period starts. Once the grace period > begins, it prevents any further error recoveries until it ends. We are rate limiting recoveries, the "networking solution" to the problem you're describing would be to introduce a burst size. Some kind of poor man's token bucket filter. Could you say more about what designs were considered and why this one was chosen?