From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 35D8DCF5395 for ; Wed, 23 Oct 2024 14:37:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=++826wDm+W3aeFSg1LKVYei1dn15yWBCqCHAEEvWTHU=; b=E/+JR8QwA5kQChSeyggOrLfTSI pBJLBA2a61pckPR26EYS8RQQpfFpOjaO3pgQwqurtZfVa16pbbNjJnnka9QRVLoO99z7cwpfHjs2/ 1b4IKk3h7EFD8qWTPAWrPe/1G/hLOAXCwCoWkB3EZrf/3C8Rs76HZEn/GW1L8pBUVxvCLEMcvO1+c 3iMeMY5VZ+GC7ka7/6hzcZBm0KyENB/dDvbhOhDoP1ykr5nGav5qy9OzOGbLtPlOTMOEQxzDH/9EU nxMyFzvH2NdUEDOjTGcrk2QkFNwFsOW/ER+XckxKdV1E6hfGcwlCha/Nsq3MKx5eqhq1v54hMc9m8 SJuZtR6A==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1t3cU7-0000000EkaU-1nxJ; Wed, 23 Oct 2024 14:37:15 +0000 Received: from dfw.source.kernel.org ([139.178.84.217]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1t3cPX-0000000EiTb-2gdE for linux-nvme@lists.infradead.org; Wed, 23 Oct 2024 14:32:35 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 5BAF05C0505; Wed, 23 Oct 2024 14:32:26 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C779BC4CEE5; Wed, 23 Oct 2024 14:32:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1729693950; bh=tOOD41nVgXP8XBFfFqbGOJwRj41r+Yqj/FNVwYtzep8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=hyTH+7EPMeuGrmT6PBQlaUKM6HNMKLVJRYU6SPITqlQalUQPDLj4P2spLJj/+M4bB bmI8GIeVl7EYkVizpg4hgqE36imeby4BN7y/Nymz0CVKPWNlG9yLq/ytkO0lzQeZkk lnCs23eTQ5Z5ATXyQxnQpfobB9wMuqVca37ORbtGl4MYmVaKM/ROqkQeqZ/fL96qjy ikUbxxFHYrId65cGmimmpst5f1QNo6thMyFk88Rf0rpfn6gME526jjWEDxMoTpjoS0 wzN0ZEnZqkXT5lCadsheIiJUA3mex0LvTJF3ervB+lMfSQK3N+JhrOtMZXnrejPmX4 XRMva0iVWgMfA== From: Sasha Levin To: linux-kernel@vger.kernel.org, stable@vger.kernel.org Cc: Nilay Shroff , Christoph Hellwig , Keith Busch , Sasha Levin , sagi@grimberg.me, linux-nvme@lists.infradead.org Subject: [PATCH AUTOSEL 6.1 17/17] nvme: make keep-alive synchronous operation Date: Wed, 23 Oct 2024 10:31:56 -0400 Message-ID: <20241023143202.2981992-17-sashal@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20241023143202.2981992-1-sashal@kernel.org> References: <20241023143202.2981992-1-sashal@kernel.org> MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.1.114 Content-Transfer-Encoding: 8bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20241023_073232_077109_16CF49D7 X-CRM114-Status: GOOD ( 17.01 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org From: Nilay Shroff [ Upstream commit d06923670b5a5f609603d4a9fee4dec02d38de9c ] The nvme keep-alive operation, which executes at a periodic interval, could potentially sneak in while shutting down a fabric controller. This may lead to a race between the fabric controller admin queue destroy code path (invoked while shutting down controller) and hw/hctx queue dispatcher called from the nvme keep-alive async request queuing operation. This race could lead to the kernel crash shown below: Call Trace: autoremove_wake_function+0x0/0xbc (unreliable) __blk_mq_sched_dispatch_requests+0x114/0x24c blk_mq_sched_dispatch_requests+0x44/0x84 blk_mq_run_hw_queue+0x140/0x220 nvme_keep_alive_work+0xc8/0x19c [nvme_core] process_one_work+0x200/0x4e0 worker_thread+0x340/0x504 kthread+0x138/0x140 start_kernel_thread+0x14/0x18 While shutting down fabric controller, if nvme keep-alive request sneaks in then it would be flushed off. The nvme_keep_alive_end_io function is then invoked to handle the end of the keep-alive operation which decrements the admin->q_usage_counter and assuming this is the last/only request in the admin queue then the admin->q_usage_counter becomes zero. If that happens then blk-mq destroy queue operation (blk_mq_destroy_ queue()) which could be potentially running simultaneously on another cpu (as this is the controller shutdown code path) would forward progress and deletes the admin queue. So, now from this point onward we are not supposed to access the admin queue resources. However the issue here's that the nvme keep-alive thread running hw/hctx queue dispatch operation hasn't yet finished its work and so it could still potentially access the admin queue resource while the admin queue had been already deleted and that causes the above crash. This fix helps avoid the observed crash by implementing keep-alive as a synchronous operation so that we decrement admin->q_usage_counter only after keep-alive command finished its execution and returns the command status back up to its caller (blk_execute_rq()). This would ensure that fabric shutdown code path doesn't destroy the fabric admin queue until keep-alive request finished execution and also keep-alive thread is not running hw/hctx queue dispatch operation. Reviewed-by: Christoph Hellwig Signed-off-by: Nilay Shroff Signed-off-by: Keith Busch Signed-off-by: Sasha Levin --- drivers/nvme/host/core.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index dc25d91891327..92ffeb6605618 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1231,10 +1231,9 @@ static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl) nvme_keep_alive_work_period(ctrl)); } -static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq, - blk_status_t status) +static void nvme_keep_alive_finish(struct request *rq, + blk_status_t status, struct nvme_ctrl *ctrl) { - struct nvme_ctrl *ctrl = rq->end_io_data; unsigned long flags; bool startka = false; unsigned long rtt = jiffies - (rq->deadline - rq->timeout); @@ -1252,13 +1251,11 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq, delay = 0; } - blk_mq_free_request(rq); - if (status) { dev_err(ctrl->device, "failed nvme_keep_alive_end_io error=%d\n", status); - return RQ_END_IO_NONE; + return; } ctrl->ka_last_check_time = jiffies; @@ -1270,7 +1267,6 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq, spin_unlock_irqrestore(&ctrl->lock, flags); if (startka) queue_delayed_work(nvme_wq, &ctrl->ka_work, delay); - return RQ_END_IO_NONE; } static void nvme_keep_alive_work(struct work_struct *work) @@ -1279,6 +1275,7 @@ static void nvme_keep_alive_work(struct work_struct *work) struct nvme_ctrl, ka_work); bool comp_seen = ctrl->comp_seen; struct request *rq; + blk_status_t status; ctrl->ka_last_check_time = jiffies; @@ -1301,9 +1298,9 @@ static void nvme_keep_alive_work(struct work_struct *work) nvme_init_request(rq, &ctrl->ka_cmd); rq->timeout = ctrl->kato * HZ; - rq->end_io = nvme_keep_alive_end_io; - rq->end_io_data = ctrl; - blk_execute_rq_nowait(rq, false); + status = blk_execute_rq(rq, false); + nvme_keep_alive_finish(rq, status, ctrl); + blk_mq_free_request(rq); } static void nvme_start_keep_alive(struct nvme_ctrl *ctrl) -- 2.43.0