From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 16BE1EEB575 for ; Thu, 1 Jan 2026 00:27:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=+XV0RGzFu+u1axynRnlMhTlJbAQRp8rS4UC5tB5jCvg=; b=rDT75wzTIC11lThffaxJU9/V5K Kx1xU9/+wqPdiUgp49b8MMVQg58v+nnEGHeuExDuhPWqpkBU9DPl4Os4TOUAKI/d2qMhYLzzb3oeB L9om7hLzsJZw+9tK5C8tiZvdyuEreoV3MZi2jRWY69+HXM9q3wYaTO6WIzpG7XkdiGRZxvhJtB6Vk PC29/CbVtJsfIBUYLKUFH8FLiuQbFz1G0Vmcm64/wRj40/lZacpqan+2L6oJwZ8iOBgyBJY2jyK/t b8PHeLi2PKI1Sz/EYHSV8LfnQjraV78n+5dkxhCQnjwsYhe5dzv7hR3BdRB7GOP78SXVLCtRP72Yc KQyTmWrw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1vb6XJ-00000006Sfw-0M1h; Thu, 01 Jan 2026 00:27:29 +0000 Received: from mail-dl1-x1235.google.com ([2607:f8b0:4864:20::1235]) by bombadil.infradead.org with esmtps (Exim 4.98.2 #2 (Red Hat Linux)) id 1vb6XF-00000006SfZ-3CvO for linux-nvme@lists.infradead.org; Thu, 01 Jan 2026 00:27:27 +0000 Received: by mail-dl1-x1235.google.com with SMTP id a92af1059eb24-121b251438eso4791958c88.0 for ; Wed, 31 Dec 2025 16:27:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1767227245; x=1767832045; darn=lists.infradead.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=+XV0RGzFu+u1axynRnlMhTlJbAQRp8rS4UC5tB5jCvg=; b=V8QX4qO7rr5KiG5TOGHoOBrVR9dtRddLzv2yQlZNLXUzQq3+dlqDxGIFDw6+SOJrWA iVR1Tqt4lIg7GSBOzOJYIIxA6p6/BZD7/mcfydRvgdC9+LLCDLbt6aUU3b/KW09ij8DS IJyT2ex3Ru0AJWIsMXqI4cvzdB8NWzW58cRPHOncSoMDAqZTdGbhfDlbxW+GgKkE9ggV 2evtoUCoayAAeZ+d3bPutctiD5dy0AjfprfYS4+vsi0MulW5p5cwPOdSawOyglMN9faJ pHAKSJ6DXOx5R1d/4ROEmumONHaxy+JknI4ulESFvNoo3GqBxhHrLARLfxOj2KS19K6o IGgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767227245; x=1767832045; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+XV0RGzFu+u1axynRnlMhTlJbAQRp8rS4UC5tB5jCvg=; b=VAotNlmT03zcHdNW4Wu7PT3an//KYBj2LVsVkv5ZPlGNoVbVQp2aXJKqkLKhyAFPWr KFn8e9Y8hv2Jhj79nrsGJ3q6Dlr7kGlY+i2wmRufV1bNSBUYkjfHQr2Ngl2vRKrIN6x/ SsDSkV+X6hroFMqel5xWS/mvPHNJpty0Vnyff7EVJDx40Hgfmamu9Xph0Eo1glDC5HK2 HUNlF4+/4uFtBqdGklsF1hPEnHMy1czVz7dfMhEpfVemkG1hwUNtPWAfBMjkXseU510y S465rn/4yr/dDSwWXVzM15xqOPBa1T72AcxzCdow7N8yRnF1YJZJpAGcm3iFE2THwdFg 5U+g== X-Forwarded-Encrypted: i=1; AJvYcCUspqcU37JPU8cVqsz1Z/2oVPeYvjdF8qx4uPnXSsA82z4bpQOHHR2JFFdJ5fXFtXc/yaEXKKgJh0pQ@lists.infradead.org X-Gm-Message-State: AOJu0YwYozcn1xWVhSEwwD5YGOTsK4Pr9lD/+I6fGYGizj8Y790rFoCO mGccka8VHuU54vmugseZ9azrzvOhC9LgG35rsGIUPb01k5gIZuOxlu48K0ZODz9QuWk= X-Gm-Gg: AY/fxX7VV/0M5G8+ifcndH/DZqK7c8fTe8BplsPUO6KFnlkCYdXPhBzSbIFEAAJW0Ua gE0ahYBgpKzRyt/ZKGwujukayEeoTh55hfFSK33BNHJHfbQSnsviIULugjMiX5gXbE89cBren1S o2lqFrpRGgB+elO5czPYeAhY6M4SBEqU+HI1Pmyn1UJYBG/Cv0Ttu/2JyBVD4cdWkZ/qduJj20X kw1vkQc+RtJrZYvsJu+tgRfRM4bRfk8sFaxJgvZwvE+UfUDp1czm2hSloFaIzMgVf3r6RZGwirm u0Lb9wu3wEUouIuTkGNby2AapiPqZX2zZrFryKQT3PCPSCEADLl/hhZYpmSHllSAeKC1x4tAbPe 1e0KqhtrJvmZ6zZoBrFuowpamrfnPj8pJM0iQ4yjVI3bpH/1oRDWCbs9Otk+n5UBmz7YClMQVIf /6QmYbP44jQBKuIrdg4SZNVT9pk3cnLRI= X-Google-Smtp-Source: AGHT+IEqDlZJYx9J0/i4q9RqOm90pvAkTeEgM97ndXPv8ffVXmDmTbFT/1UjEiKhONdfDfozBq4Avw== X-Received: by 2002:a05:7022:e992:b0:119:e569:f85b with SMTP id a92af1059eb24-120619872ffmr35643210c88.18.1767227244632; Wed, 31 Dec 2025 16:27:24 -0800 (PST) Received: from medusa.lab.kspace.sh ([208.88.152.253]) by smtp.googlemail.com with UTF8SMTPSA id a92af1059eb24-1217253c0c6sm160469237c88.12.2025.12.31.16.27.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 31 Dec 2025 16:27:24 -0800 (PST) Date: Wed, 31 Dec 2025 16:27:23 -0800 From: Mohamed Khalfella To: Sagi Grimberg Cc: Chaitanya Kulkarni , Christoph Hellwig , Jens Axboe , Keith Busch , Aaron Dailey , Randy Jennings , John Meneghini , Hannes Reinecke , linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error Message-ID: <20260101002723.GS3864520-mkhalfella@purestorage.com> References: <20251126021250.2583630-1-mkhalfella@purestorage.com> <20251126021250.2583630-11-mkhalfella@purestorage.com> <5befc95c-b66a-4dd8-bb72-7cc6839c7c4b@grimberg.me> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5befc95c-b66a-4dd8-bb72-7cc6839c7c4b@grimberg.me> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20251231_162725_817150_7C969885 X-CRM114-Status: GOOD ( 37.92 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org On Sat 2025-12-27 12:35:23 +0200, Sagi Grimberg wrote: > > > On 26/11/2025 4:11, Mohamed Khalfella wrote: > > An alive nvme controller that hits an error now will move to RECOVERING > > state instead of RESETTING state. In RECOVERING state ctrl->err_work > > will attempt to use cross-controller recovery to terminate inflight IOs > > on the controller. If CCR succeeds, then switch to RESETTING state and > > continue error recovery as usuall by tearing down controller and attempt > > reconnecting to target. If CCR fails, then the behavior of recovery > > depends on whether CQT is supported or not. If CQT is supported, switch > > to time-based recovery by holding inflight IOs until it is safe for them > > to be retried. If CQT is not supported proceed to retry requests > > immediately, as the code currently does. > > > > To support implementing time-based recovery turn ctrl->err_work into > > delayed work. Update nvme_tcp_timeout() to not complete inflight IOs > > while controller in RECOVERING state. > > > > Signed-off-by: Mohamed Khalfella > > --- > > drivers/nvme/host/tcp.c | 52 +++++++++++++++++++++++++++++++++++------ > > 1 file changed, 45 insertions(+), 7 deletions(-) > > > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > > index 9a96df1a511c..ec9a713490a9 100644 > > --- a/drivers/nvme/host/tcp.c > > +++ b/drivers/nvme/host/tcp.c > > @@ -193,7 +193,7 @@ struct nvme_tcp_ctrl { > > struct sockaddr_storage src_addr; > > struct nvme_ctrl ctrl; > > > > - struct work_struct err_work; > > + struct delayed_work err_work; > > struct delayed_work connect_work; > > struct nvme_tcp_request async_req; > > u32 io_queues[HCTX_MAX_TYPES]; > > @@ -611,11 +611,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue) > > > > static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl) > > { > > - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) > > + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RECOVERING) && > > + !nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) > > This warrants an explanation. It is not clear at all why we should allow > two different > transitions to allow error recovery to start... The behavior of the ctrl->err_work depends on the controller state. We go to RECOVERING only if the controller is LIVE. Otherwise, we attempt to got to RESETTING. > > > return; > > > > dev_warn(ctrl->device, "starting error recovery\n"); > > - queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work); > > + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, 0); > > } > > > > static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue, > > @@ -2470,12 +2471,48 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work) > > nvme_tcp_reconnect_or_remove(ctrl, ret); > > } > > > > +static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl) > > +{ > > + unsigned long rem; > > + > > + if (test_and_clear_bit(NVME_CTRL_RECOVERED, &ctrl->flags)) { > > + dev_info(ctrl->device, "completed time-based recovery\n"); > > + goto done; > > + } > > This is also not clear, why should we get here when NVME_CTRL_RECOVERED > is set? NVME_CTRL_RECOVERED flag is set before scheduling ctrl->err_work as delayed work. This is how how time-based recovery is implemented. We get here when ctrl->err_work runs for the second time, and at this point we know that it is safe to just reset the controller and cancel inflight requests. > > + > > + rem = nvme_recover_ctrl(ctrl); > > + if (!rem) > > + goto done; > > + > > + if (!ctrl->cqt) { > > + dev_info(ctrl->device, > > + "CCR failed, CQT not supported, skip time-based recovery\n"); > > + goto done; > > + } > > + > > + dev_info(ctrl->device, > > + "CCR failed, switch to time-based recovery, timeout = %ums\n", > > + jiffies_to_msecs(rem)); > > + set_bit(NVME_CTRL_RECOVERED, &ctrl->flags); > > + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem); > > + return -EAGAIN; > > I don't think that reusing the same work to handle two completely > different things > is the right approach here. > > How about splitting to fence_work and err_work? That should eliminate > some of the > ctrl state inspections and simplify error recovery. > > > + > > +done: > > + nvme_end_ctrl_recovery(ctrl); > > + return 0; > > +} > > + > > static void nvme_tcp_error_recovery_work(struct work_struct *work) > > { > > - struct nvme_tcp_ctrl *tcp_ctrl = container_of(work, > > + struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work), > > struct nvme_tcp_ctrl, err_work); > > struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl; > > > > + if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) { > > + if (nvme_tcp_recover_ctrl(ctrl)) > > + return; > > + } > > + > > Yea, I think we want to rework the current design. Good point. Splitting ctrl->fence_work simplifies things. The if condition above will be moved to fence_work. However, we will still need to reschedule ctrl->fence_work from within its self to implement time-based recovery. Is this good option? If not, and we prefer to drop NVME_CTRL_RECOVERED flag above and not reschedule ctrl->fence_work from within its self, then we can add another ctr->fenced_work. How about that?