From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id DEC06EB64DC for ; Sun, 25 Jun 2023 08:10:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=qwfLjyoUz3ZDqsKQEVoGFsXw4nHaaBvsnDikn6HJhJo=; b=BAeU5sL0k9TJeHfsHjLFokMBIu MIWLohyIvESiOdqEboiloUt3/F93xM1vN4NjCwI3iGnhhIobxRH8XdoqbYL8iPTFC34NzR4AJzxd5 Y+hdcbphRh73SRKrCIHk9oW3lyYt8AwWXhyTsCAdsjtCSI5cZEZ8g8cggW6f4AocTl4EWY2S6srq2 KdPbkst427o5cUw+xKV1cwf/XzLqQI8L/M1XSM8uzt0052XBLUqkHbBw6VG0CEjW7d6DEtiEjnaRP wldqnPkkqGyIxWdck3KDFXv74q6Zlw4v3Zh6V0ixkAIu7fRwVS6b4F5QBKvKf3w/A/U3X9GW0ig7i O/4yMb/w==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qDKoo-007Rep-0c; Sun, 25 Jun 2023 08:09:58 +0000 Received: from mail-wr1-f52.google.com ([209.85.221.52]) by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1qDKol-007Rdw-0H for linux-nvme@lists.infradead.org; Sun, 25 Jun 2023 08:09:56 +0000 Received: by mail-wr1-f52.google.com with SMTP id ffacd0b85a97d-31114d47414so618498f8f.1 for ; Sun, 25 Jun 2023 01:09:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687680588; x=1690272588; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=qwfLjyoUz3ZDqsKQEVoGFsXw4nHaaBvsnDikn6HJhJo=; b=NHnAiJ/R5LidtupcLIaAmPFHGlxyObp+DvDAVlkf3baj/oga8BCI5jsnG1TfdtnqRP 8Kocc0VSShSJXzkUNGQjSzLp8WxgchKdsjuzsfMdVjbSo/uFuOtnNL9CCb7APlKP0QDi uE+ZeM9iXwQXOV0zOAW2s/UdU49PhZl/Arb2eN3xNhlmBHVHqLRyT1QIkpfQ0JBRTU1b aYq7IfHS1iDTzHkqaiOeRkTW2OawsHe3Exv5dDbcEKFmUY+roCHZ26AwrICgXhqtyRDH YOX8ld1UMsau/leD2T1LwR+MmYbLYVwM329bSySeteMYcAG9bcam/hTPmNYO7gMGDIMX lOSw== X-Gm-Message-State: AC+VfDz21OxOxc5+HF1egLpUvpf/I6pCDxF5dPkGdebmSSVsmDJgkT2E ZS5hLjLz/dPxngVcDOiUzAk= X-Google-Smtp-Source: ACHHUZ5FjGgKnKoBXfpyo3JNxXwqBqUImEJ7SWTVZRMv+JkraYho9UCyjCeDUcSyrIuSt63wjnkAuA== X-Received: by 2002:a5d:468d:0:b0:30a:d0a0:266e with SMTP id u13-20020a5d468d000000b0030ad0a0266emr24109094wrq.2.1687680587578; Sun, 25 Jun 2023 01:09:47 -0700 (PDT) Received: from [10.100.102.14] (46-116-229-137.bb.netvision.net.il. [46.116.229.137]) by smtp.gmail.com with ESMTPSA id f1-20020a5d5681000000b0030647449730sm3980931wrv.74.2023.06.25.01.09.45 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 25 Jun 2023 01:09:46 -0700 (PDT) Message-ID: <643f1f34-e88b-0e79-3834-62884b614008@grimberg.me> Date: Sun, 25 Jun 2023 11:09:45 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [PATCH V2 0/4] nvme: fix two kinds of IO hang from removing NSs Content-Language: en-US To: Ming Lei , Keith Busch Cc: Jens Axboe , Christoph Hellwig , linux-nvme@lists.infradead.org, Yi Zhang , linux-block@vger.kernel.org, Chunguang Xu References: <27ce75fc-f6c5-7bf3-8448-242ee3e65067@grimberg.me> From: Sagi Grimberg In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230625_010955_127262_31A691BA X-CRM114-Status: GOOD ( 23.66 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org >>>>>> The point was to contain requests from entering while the hctx's are >>>>>> being reconfigured. If you're going to pair up the freezes as you've >>>>>> suggested, we might as well just not call freeze at all. >>>>> >>>>> blk_mq_update_nr_hw_queues() requires queue to be frozen. >>>> >>>> It's too late at that point. Let's work through a real example. You'll >>>> need a system that has more CPU's than your nvme has IO queues. >>>> >>>> Boot without any special nvme parameters. Every possible nvme IO queue >>>> will be assigned "default" hctx type. Now start IO to every queue, then >>>> run: >>>> >>>> # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller >>>> >>>> Today, we freeze prior to tearing down the "default" IO queues, so >>>> there's nothing entered into them while the driver reconfigures the >>>> queues. >>> >>> nvme_start_freeze() just prevents new IO from being queued, and old ones >>> may still be entering block layer queue, and what matters here is >>> actually quiesce, which prevents new IO from being queued to >>> driver/hardware. >>> >>>> >>>> What you're suggesting will allow IO to queue up in a queisced "default" >>>> queue, which will become "polled" without an interrupt hanlder on the >>>> other side of the reset. The application doesn't know that, so the IO >>>> you're allowing to queue up will time out. >>> >>> time out only happens after the request is queued to driver/hardware, or after >>> blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually >>> prevents new IOs from being dispatched to driver or be queued via .queue_rq(), >>> meantime old requests have been canceled, so no any request can be >>> timed out. >> >> Quiesce doesn't prevent requests from entering an hctx, and you can't >> back it out to put on another hctx later. It doesn't matter that you >> haven't dispatched it to hardware yet. The request's queue was set the >> moment it was allocated, so after you unquiesce and freeze for the new >> queue mapping, the requests previously blocked on quiesce will time out >> in the scenario I've described. >> >> There are certainly gaps in the existing code where error'ed requests >> can be requeued or stuck elsewhere and hit the exact same problem, but >> the current way at least tries to contain it. > > Yeah, but you can't remove the gap at all with start_freeze, that said > the current code has to live with the situation of new mapping change > and old request with old mapping. > > Actually I considered to handle this kind of situation before, one approach > is to reuse the bio steal logic taken in nvme mpath: > > 1) for FS IO, re-submit bios, meantime free request > > 2) for PT request, simply fail it > > It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is > always set for PT request, but not see any better approach for handling > PT request. Ming, I suggest to submit patches for tcp/rdma and continue the discussion on the pci driver.