From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91FBDEB64DA for ; Sun, 25 Jun 2023 00:27:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229493AbjFYA1v (ORCPT ); Sat, 24 Jun 2023 20:27:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57498 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229484AbjFYA1v (ORCPT ); Sat, 24 Jun 2023 20:27:51 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E853D10FC for ; Sat, 24 Jun 2023 17:27:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1687652821; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=V8s4QvpXwSYPKRFomDkujYHgHMc8jEkRSvTYXcWmxyY=; b=UhrzeXKBk5B0CX3tjdkh2eqBO5wMhw8invopDjDlIHwU/qqKIseSp2sJUJMgFWH7A7jv/p v4Pp2MDQ2AkSrYzC1NKrz9ACKeRR3ziwAfAmeFZezT3Dm9XlknFxJgR02UUhkHxmEOhJNm 4Ges59JIhFQyGmmmld5zOfFmIPimNOM= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-587-EjYExB4SO7a-6MQS9Xci3A-1; Sat, 24 Jun 2023 20:26:58 -0400 X-MC-Unique: EjYExB4SO7a-6MQS9Xci3A-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 104BC8CC203; Sun, 25 Jun 2023 00:26:58 +0000 (UTC) Received: from ovpn-8-16.pek2.redhat.com (ovpn-8-16.pek2.redhat.com [10.72.8.16]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2376B492B01; Sun, 25 Jun 2023 00:26:52 +0000 (UTC) Date: Sun, 25 Jun 2023 08:26:48 +0800 From: Ming Lei To: Keith Busch Cc: Sagi Grimberg , Jens Axboe , Christoph Hellwig , linux-nvme@lists.infradead.org, Yi Zhang , linux-block@vger.kernel.org, Chunguang Xu Subject: Re: [PATCH V2 0/4] nvme: fix two kinds of IO hang from removing NSs Message-ID: References: <27ce75fc-f6c5-7bf3-8448-242ee3e65067@grimberg.me> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Thu, Jun 22, 2023 at 09:19:19AM -0600, Keith Busch wrote: > On Thu, Jun 22, 2023 at 10:53:05PM +0800, Ming Lei wrote: > > On Thu, Jun 22, 2023 at 08:35:49AM -0600, Keith Busch wrote: > > > On Thu, Jun 22, 2023 at 09:51:12PM +0800, Ming Lei wrote: > > > > On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote: > > > > > The point was to contain requests from entering while the hctx's are > > > > > being reconfigured. If you're going to pair up the freezes as you've > > > > > suggested, we might as well just not call freeze at all. > > > > > > > > blk_mq_update_nr_hw_queues() requires queue to be frozen. > > > > > > It's too late at that point. Let's work through a real example. You'll > > > need a system that has more CPU's than your nvme has IO queues. > > > > > > Boot without any special nvme parameters. Every possible nvme IO queue > > > will be assigned "default" hctx type. Now start IO to every queue, then > > > run: > > > > > > # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller > > > > > > Today, we freeze prior to tearing down the "default" IO queues, so > > > there's nothing entered into them while the driver reconfigures the > > > queues. > > > > nvme_start_freeze() just prevents new IO from being queued, and old ones > > may still be entering block layer queue, and what matters here is > > actually quiesce, which prevents new IO from being queued to > > driver/hardware. > > > > > > > > What you're suggesting will allow IO to queue up in a queisced "default" > > > queue, which will become "polled" without an interrupt hanlder on the > > > other side of the reset. The application doesn't know that, so the IO > > > you're allowing to queue up will time out. > > > > time out only happens after the request is queued to driver/hardware, or after > > blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually > > prevents new IOs from being dispatched to driver or be queued via .queue_rq(), > > meantime old requests have been canceled, so no any request can be > > timed out. > > Quiesce doesn't prevent requests from entering an hctx, and you can't > back it out to put on another hctx later. It doesn't matter that you > haven't dispatched it to hardware yet. The request's queue was set the > moment it was allocated, so after you unquiesce and freeze for the new > queue mapping, the requests previously blocked on quiesce will time out > in the scenario I've described. > > There are certainly gaps in the existing code where error'ed requests > can be requeued or stuck elsewhere and hit the exact same problem, but > the current way at least tries to contain it. Yeah, but you can't remove the gap at all with start_freeze, that said the current code has to live with the situation of new mapping change and old request with old mapping. Actually I considered to handle this kind of situation before, one approach is to reuse the bio steal logic taken in nvme mpath: 1) for FS IO, re-submit bios, meantime free request 2) for PT request, simply fail it It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is always set for PT request, but not see any better approach for handling PT request. Thanks, Ming