From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37565C433E4 for ; Tue, 28 Jul 2020 02:33:05 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 03B7A2075A for ; Tue, 28 Jul 2020 02:33:05 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="Sq+ylT+x"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="szhHWiOg" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 03B7A2075A Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.dk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date:Message-ID:From: References:To:Subject:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=g3CJCftoElSX4efiz2G4Xo5jyYMYSWPD1A4LsmGCmPE=; b=Sq+ylT+xvOfPFh8kvbn5KXXY7 NYhtXAHxdX2zoHgkCacG7VDnAEh0ImVw097pKn9u0SRpdL2As+KfW8L6g2Y0EszhajbWCtzO3kRSe eylFco0lSzrawV/9CCwRdV6SDJSut+LOn/dG5ku6owN7GcNxATKmExYDav9Mu3jB7Y/iqMlRCfczb wXAyZqDuwZBdPKlBeyP9dpnKWO7q8j+GmJdsS1oQe/GS+4ybOqRB8awJ6ukJoaO63DHsE0r4FX5IK wHUP0NvebtHb+0nXrPjP4TAQ9ceNDPbTe5jRbyE/nf00+dljWCVhoTmmHj2DtZBTXWE1ezdBKKWeR NzO5rlSsA==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1k0FQM-000179-F4; Tue, 28 Jul 2020 02:33:02 +0000 Received: from mail-pf1-x444.google.com ([2607:f8b0:4864:20::444]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1k0FQJ-00016N-8q for linux-nvme@lists.infradead.org; Tue, 28 Jul 2020 02:33:00 +0000 Received: by mail-pf1-x444.google.com with SMTP id z188so2272707pfc.6 for ; Mon, 27 Jul 2020 19:32:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=AVCeCjsYNWy7yl3GQCLxoee2T6ANS9Vfnp2KykIh8cQ=; b=szhHWiOgGtsx4vdB4DKVPYRKDYn1yU1hybwnEbEkP2PjrplQIlqcowH3iWjb+OB52M yzGio2fjOMxTRZdhehvYxELDzvS2MppjV5yx8qkJ18GNRXu2Eopo0iUXNvM87n91k+VD H9JvBXr3EYSN1ezP4YzmFsE95nKEGbuiFdF/VZHSjd0esj0zMFf42ewbomXT29o8K8Xe QtdLSueaEVTchj+wCG1gvePHXB2GeeolaVEuPrtxYmtUEM//X2g2ecPHfSdy67SbfHB7 2d2jI2djr40bO72tI53rkpSakNvCQnJb0Czymeii4D1FIscCg192OTZcaTrzUN4R2RdK dzDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=AVCeCjsYNWy7yl3GQCLxoee2T6ANS9Vfnp2KykIh8cQ=; b=dQg2l1QyFFAqRFuUnvEFX1Kg5c8Qq/lHsc9tmjUd1e4JoThdbzuQ9CrzK0k1PCPVHk cu4c5d1/jfcMPjEtoHwgd5ShdMTNMEac2hS/NOpoK8/FFzGQhT/93jyagQLLLp/5I5gs xva9Dm+KLKUPbxByfsyKAB7TyXIkoCjutcZXh6ViEDc3a6E4aj4mLZTyxhRm+IYscY3I KJD2MBNJe3tZ9ndQVo8YTh1mfhunH2H+p8qcrfHy0z5ZAoL1cMCBC77JQoV9gcKaRb36 OEEJiNWInvdUqhzTliCthG34AV9SS/q1g7Vmmt1WQGyeSMbC2qSVsq/CJLbFdDCOEaiV 8jTw== X-Gm-Message-State: AOAM533xwP+N6MeqUyelpXghlNF67gsM6CSfC3zftqjGCmIE6BICRrIj /YkZouBs/wAKydB4nDdAWejmOg== X-Google-Smtp-Source: ABdhPJxSL/OBvT/OhzUgrZYGrGqElFmK/68S+zHc1nk2kq1VNE3fBo3+Mg76J8t0xRxFp6FDXSfI5Q== X-Received: by 2002:a63:eb55:: with SMTP id b21mr22459743pgk.433.1595903577177; Mon, 27 Jul 2020 19:32:57 -0700 (PDT) Received: from [192.168.1.182] ([66.219.217.173]) by smtp.gmail.com with ESMTPSA id b128sm15989210pfg.114.2020.07.27.19.32.56 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 27 Jul 2020 19:32:56 -0700 (PDT) Subject: Re: [PATCH v5 1/2] blk-mq: add tagset quiesce interface To: Ming Lei References: <20200727231022.307602-1-sagi@grimberg.me> <20200727231022.307602-2-sagi@grimberg.me> <20200728014038.GA1305646@T590> <1d119df0-c3af-2dfa-d569-17109733ac80@kernel.dk> <20200728021744.GB1305646@T590> <5fce2096-2ed2-b396-76a7-5fb8ea97a389@kernel.dk> <20200728022802.GC1305646@T590> From: Jens Axboe Message-ID: Date: Mon, 27 Jul 2020 20:32:53 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: <20200728022802.GC1305646@T590> Content-Language: en-US X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20200727_223259_362298_4373F538 X-CRM114-Status: GOOD ( 26.50 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Sagi Grimberg , linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Chao Leng , Keith Busch , Ming Lin , Christoph Hellwig Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org On 7/27/20 8:28 PM, Ming Lei wrote: > On Mon, Jul 27, 2020 at 08:23:15PM -0600, Jens Axboe wrote: >> On 7/27/20 8:17 PM, Ming Lei wrote: >>> On Mon, Jul 27, 2020 at 07:51:16PM -0600, Jens Axboe wrote: >>>> On 7/27/20 7:40 PM, Ming Lei wrote: >>>>> On Mon, Jul 27, 2020 at 04:10:21PM -0700, Sagi Grimberg wrote: >>>>>> drivers that have shared tagsets may need to quiesce potentially a lot >>>>>> of request queues that all share a single tagset (e.g. nvme). Add an interface >>>>>> to quiesce all the queues on a given tagset. This interface is useful because >>>>>> it can speedup the quiesce by doing it in parallel. >>>>>> >>>>>> For tagsets that have BLK_MQ_F_BLOCKING set, we use call_srcu to all hctxs >>>>>> in parallel such that all of them wait for the same rcu elapsed period with >>>>>> a per-hctx heap allocated rcu_synchronize. for tagsets that don't have >>>>>> BLK_MQ_F_BLOCKING set, we simply call a single synchronize_rcu as this is >>>>>> sufficient. >>>>>> >>>>>> Signed-off-by: Sagi Grimberg >>>>>> --- >>>>>> block/blk-mq.c | 66 ++++++++++++++++++++++++++++++++++++++++++ >>>>>> include/linux/blk-mq.h | 4 +++ >>>>>> 2 files changed, 70 insertions(+) >>>>>> >>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>>> index abcf590f6238..c37e37354330 100644 >>>>>> --- a/block/blk-mq.c >>>>>> +++ b/block/blk-mq.c >>>>>> @@ -209,6 +209,42 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q) >>>>>> } >>>>>> EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait); >>>>>> >>>>>> +static void blk_mq_quiesce_blocking_queue_async(struct request_queue *q) >>>>>> +{ >>>>>> + struct blk_mq_hw_ctx *hctx; >>>>>> + unsigned int i; >>>>>> + >>>>>> + blk_mq_quiesce_queue_nowait(q); >>>>>> + >>>>>> + queue_for_each_hw_ctx(q, hctx, i) { >>>>>> + WARN_ON_ONCE(!(hctx->flags & BLK_MQ_F_BLOCKING)); >>>>>> + hctx->rcu_sync = kmalloc(sizeof(*hctx->rcu_sync), GFP_KERNEL); >>>>>> + if (!hctx->rcu_sync) >>>>>> + continue; >>>>> >>>>> This approach of quiesce/unquiesce tagset is good abstraction. >>>>> >>>>> Just one more thing, please allocate a rcu_sync array because hctx is >>>>> supposed to not store scratch stuff. >>>> >>>> I'd be all for not stuffing this in the hctx, but how would that work? >>>> The only thing I can think of that would work reliably is batching the >>>> queue+wait into units of N. We could potentially have many thousands of >>>> queues, and it could get iffy (and/or unreliable) in terms of allocation >>>> size. Looks like rcu_synchronize is 48-bytes on my local install, and it >>>> doesn't take a lot of devices at current CPU counts to make an alloc >>>> covering all of it huge. Let's say 64 threads, and 32 devices, then >>>> we're already at 64*32*48 bytes which is an order 5 allocation. Not >>>> friendly, and not going to be reliable when you need it. And if we start >>>> batching in reasonable counts, then we're _almost_ back to doing a queue >>>> or two at the time... 32 * 48 is 1536 bytes, so we could only do two at >>>> the time for single page allocations. >>> >>> We can convert to order 0 allocation by one extra indirect array. >> >> I guess that could work, and would just be one extra alloc + free if we >> still retain the batch. That'd take it to 16 devices (at 32 CPUs) per >> round, potentially way less of course if we have more CPUs. So still >> somewhat limiting, rather than do all at once. > > With the approach in blk_mq_alloc_rqs(), each allocated page can be > added to one list, so the indirect array can be saved. Then it is > possible to allocate for any size queues/devices since every > allocation is just for single page in case that it is needed, even no > pre-calculation is required. As long as we watch the complexity, don't think we need to go overboard here in the risk of adding issues for the failure path. But yes, we could use the same trick I did in blk_mq_alloc_rqs() and just alloc pages as we go. -- Jens Axboe _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme