From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B47ADC432C0 for ; Tue, 19 Nov 2019 00:06:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 912752230B for ; Tue, 19 Nov 2019 00:06:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726909AbfKSAGJ (ORCPT ); Mon, 18 Nov 2019 19:06:09 -0500 Received: from mail-wr1-f68.google.com ([209.85.221.68]:33889 "EHLO mail-wr1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726809AbfKSAGJ (ORCPT ); Mon, 18 Nov 2019 19:06:09 -0500 Received: by mail-wr1-f68.google.com with SMTP id e6so21711989wrw.1 for ; Mon, 18 Nov 2019 16:06:08 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=tZxLd2IRk4d1rcA/UAUVBNCOa5kUj7/JJgZoFiR+mm8=; b=JO81FcDqY5UAZwjwaoPezD9ypCoReQemPwggcYaB7VbdvmgDenTTGGNXpw0PQJDcBN kF0htWBYo2417M/wauxvEOnqjCNn+Cby380EhYFXHiCQ3p3aRWOst78Cc3U0Br1+f/Wb 50/vG5oqt7VxlpFLkp6jOoEfZRKSDlM5nV+PSIqlUu5Mnfq1fOHTYmrxtf2DSTSxS0Ns fEtev3DVKd20zbljLoiTj7UUGbS+xrufEzpXWYpilNuH19O7YAe4eXCbcIenS1cMG+D9 ZGiQqIR8rXbHfXjnyWS64Np2b/tiEbZBnI5cYR6NilUTYi7vf4bZEnjo46uP0apLPIdp jLHQ== X-Gm-Message-State: APjAAAXr6DVqyqSQSYlIGxfmmsUh/rpRYDpFH9VNNMcu2geA3mK6amMa CCmh3lPnDUgrceeCvtbpw3U= X-Google-Smtp-Source: APXvYqw5+UcNcSqFG7ka+GpudIJhIgIFPk49jGnrRMf2u78OLIlUPt9dhcoBrtmcH61evpwo1M2vMg== X-Received: by 2002:adf:ea88:: with SMTP id s8mr33969309wrm.120.1574121967199; Mon, 18 Nov 2019 16:06:07 -0800 (PST) Received: from [192.168.1.114] (162-195-240-247.lightspeed.sntcca.sbcglobal.net. [162.195.240.247]) by smtp.gmail.com with ESMTPSA id f67sm1138078wme.16.2019.11.18.16.06.04 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 18 Nov 2019 16:06:06 -0800 (PST) Subject: Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request To: Ming Lei Cc: Jens Axboe , linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, Keith Busch , Christoph Hellwig , James Smart References: <20191115104238.15107-1-ming.lei@redhat.com> <8f4402a0-967d-f12d-2f1a-949e1dda017c@grimberg.me> <20191116071754.GB18194@ming.t460p> From: Sagi Grimberg Message-ID: <016afdbc-9c63-4193-e64b-aad91ba5fcc1@grimberg.me> Date: Mon, 18 Nov 2019 16:05:56 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <20191116071754.GB18194@ming.t460p> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org > Hi Sagi, > > On Fri, Nov 15, 2019 at 02:38:44PM -0800, Sagi Grimberg wrote: >> >>> Hi, >> >> Hey Ming, >> >>> Use blk_mq_alloc_request() for allocating NVMe loop, fc, rdma and tcp's >>> connect request, and selecting transport queue runtime for connect >>> request. >>> >>> Then kill blk_mq_alloc_request_hctx(). >> >> Is it really so wrong to have an API to allocate a tag that belongs to >> a specific queue? Why must the tags allocation always correlate to the >> running cpu? Its true that NVMe is the only consumer of this at the >> moment, but does this mean that the interface should be removed because >> it has one (or rather four) consumer(s)? > > Now blk-mq takes a static queue mapping between CPU and hw queues, given > CPU hotplug may happen any time, so the specified hw queue may become > inactive any time. > > Queue mapping from CPU to hw queue is one core model of blk-mq which > relies a lot on the fact that hw queue active or not depends on > request's submission CPU. And we always can retrieve one active hw > queue if there is at least one online CPU. > > IO request is always mapped to the proper hw queue via the submission > CPU and queue type. > > So blk_mq_alloc_request_hctx() is really weird from the above blk-mq's > model. > > Actually the 4 consumer is just one single type of usage for submitting > connect command, seems no one explain this special usage before. And the > driver can handle well enough without this interface just like this > patch, can't it? Does removing the cpumask_and with cpu_online_mask fix your test? this check is wrong to begin with because it can not be true right after the check. This is a much simpler fix that does not create this churn local to every driver. Also, I don't like the assumptions about tag reservations that the drivers is taking locally (that the connect will have tag 0 for example). All this makes this look like a hack. There is also the question of what happens when we want to make connects parallel, which is not the case at the moment... >> I would instead suggest to simply remove the constraint that >> blk_mq_alloc_request_hctx() will fail if the first cpu in the mask >> is not on the cpu_online_mask.. The caller of this would know and >> be able to handle it. > > Of course, this usage is very special, which is different with normal > IO or passthrough request. > > The caller of this still needs to rely on blk-mq for dispatching this > request: > > 1) blk-mq needs to run hw queue in round-robin style, and different > CPU is selected from active CPU masks for running the hw queue. > > 2) Most of blk-mq drivers have switched to managed IRQ, which may be > shutdown even though there is still in-flight requests not completed > on the hw queue. We need to fix this issue. One solution is to free > the request and remap the bios into proper active hw queue according to > the new submission CPU. > > 3) warning will be caused when dispatching one request on inactive hw queue > > If we are going to support this special usage, lots of blk-mq needs to > be changed for covering the so special case. I'm starting to think we maybe need to get the connect out of the block layer execution if its such a big problem... Its a real shame if that is the case... >> To me it feels like this approach is fundamentally wrong. IMO, having >> the driver select a different queue than the tag naturally belongs to >> feels like a backwards design. > > The root cause is that the connection command needs to be submitted via > one specified queue, which is fundamentally not compatible with blk-mq's > queue mapping model. > > Given the connect command is so special for nvme, I'd suggest to let driver > handle it completely, since blk-mq is supposed to serve generic IO function. Its one thing to handling it locally, and to hack queue_rq to queue on a different queue than what was determined by the block layer... If this model is fundamentally broken with how the block layer dispatch requests, then we need a different solution.