From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85449CA9ED3 for ; Tue, 5 Nov 2019 03:34:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 5289D20818 for ; Tue, 5 Nov 2019 03:34:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="MR1SH1t6" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730188AbfKEDe4 (ORCPT ); Mon, 4 Nov 2019 22:34:56 -0500 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:40416 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728910AbfKEDe4 (ORCPT ); Mon, 4 Nov 2019 22:34:56 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1572924895; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Hsr0KNuZO/WePyH6FKz+kA+0NZ/Wshn8/p6mobwaFMg=; b=MR1SH1t6ikYCZ8aovwUaqngwGi1niv79MyzDclT63GQ1K/RBKDo5VuwR4EjHBXepPGWOt0 P5PPka9NHz0qSdM35eFwuJKULgtnMsDhjyEAk/6IeOYeDSXAc/A6MstHTnRuQPwmuJGr2P bhuUpAEV9VLN8OI/qIZV/l4uNU1fSPI= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-237-F8u0TWxlMOezz6E19BnV3A-1; Mon, 04 Nov 2019 22:34:49 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id EF1EA1005500; Tue, 5 Nov 2019 03:34:47 +0000 (UTC) Received: from ming.t460p (ovpn-8-20.pek2.redhat.com [10.72.8.20]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 20C195D9CD; Tue, 5 Nov 2019 03:34:40 +0000 (UTC) Date: Tue, 5 Nov 2019 11:34:36 +0800 From: Ming Lei To: Jens Axboe Cc: Kent Overstreet , Christoph Hellwig , linux-block@vger.kernel.org, Coly Li , Keith Busch , linux-bcache@vger.kernel.org Subject: Re: [PATCH V4] block: optimize for small block size IO Message-ID: <20191105033436.GH11436@ming.t460p> References: <20191104181541.GA21116@infradead.org> <20191104181742.GC8984@kmo-pixel> <20191104184217.GD8984@kmo-pixel> <20191105011135.GD11436@ming.t460p> <20191105021130.GB18564@moria.home.lan> <20191105022046.GF11436@ming.t460p> <20191105023002.GC18564@moria.home.lan> <20191105024641.GG11436@ming.t460p> <06fc1a0c-8c8b-e7ab-f343-3861db737776@kernel.dk> MIME-Version: 1.0 In-Reply-To: <06fc1a0c-8c8b-e7ab-f343-3861db737776@kernel.dk> User-Agent: Mutt/1.12.1 (2019-06-15) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-MC-Unique: F8u0TWxlMOezz6E19BnV3A-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Mon, Nov 04, 2019 at 07:49:49PM -0700, Jens Axboe wrote: > On 11/4/19 7:46 PM, Ming Lei wrote: > > On Mon, Nov 04, 2019 at 09:30:02PM -0500, Kent Overstreet wrote: > >> On Tue, Nov 05, 2019 at 10:20:46AM +0800, Ming Lei wrote: > >>> On Mon, Nov 04, 2019 at 09:11:30PM -0500, Kent Overstreet wrote: > >>>> On Tue, Nov 05, 2019 at 09:11:35AM +0800, Ming Lei wrote: > >>>>> On Mon, Nov 04, 2019 at 01:42:17PM -0500, Kent Overstreet wrote: > >>>>>> On Mon, Nov 04, 2019 at 11:23:42AM -0700, Jens Axboe wrote: > >>>>>>> On 11/4/19 11:17 AM, Kent Overstreet wrote: > >>>>>>>> On Mon, Nov 04, 2019 at 10:15:41AM -0800, Christoph Hellwig wrot= e: > >>>>>>>>> On Mon, Nov 04, 2019 at 01:14:03PM -0500, Kent Overstreet wrote= : > >>>>>>>>>> On Sat, Nov 02, 2019 at 03:29:11PM +0800, Ming Lei wrote: > >>>>>>>>>>> __blk_queue_split() may be a bit heavy for small block size(s= uch as > >>>>>>>>>>> 512B, or 4KB) IO, so introduce one flag to decide if this bio= includes > >>>>>>>>>>> multiple page. And only consider to try splitting this bio in= case > >>>>>>>>>>> that the multiple page flag is set. > >>>>>>>>>> > >>>>>>>>>> So, back in the day I had an alternative approach in mind: get= rid of > >>>>>>>>>> blk_queue_split entirely, by pushing splitting down to the req= uest layer - when > >>>>>>>>>> we map the bio/request to sgl, just have it map as much as wil= l fit in the sgl > >>>>>>>>>> and if it doesn't entirely fit bump bi_remaining and leave it = on the request > >>>>>>>>>> queue. > >>>>>>>>>> > >>>>>>>>>> This would mean there'd be no need for counting segments at al= l, and would cut a > >>>>>>>>>> fair amount of code out of the io path. > >>>>>>>>> > >>>>>>>>> I thought about that to, but it will take a lot more effort. M= ostly > >>>>>>>>> because md/dm heavily rely on splitting as well. I still think= it is > >>>>>>>>> worthwhile, it will just take a significant amount of time and = we > >>>>>>>>> should have the quick improvement now. > >>>>>>>> > >>>>>>>> We can do it one driver at a time - driver sets a flag to disabl= e > >>>>>>>> blk_queue_split(). Obvious one to do first would be nvme since t= hat's where it > >>>>>>>> shows up the most. > >>>>>>>> > >>>>>>>> And md/md do splitting internally, but I'm not so sure they need > >>>>>>>> blk_queue_split(). > >>>>>>> > >>>>>>> I'm a big proponent of doing something like that instead, but it = is a > >>>>>>> lot of work. I absolutely hate the splitting we're doing now, eve= n > >>>>>>> though the original "let's work as hard as we add add page time t= o get > >>>>>>> things right" was pretty abysmal as well. > >>>>>> > >>>>>> Last I looked I don't think it was going to be that bad, just need= ed a bit of > >>>>>> finesse. We just need to be able to partially process a request in= e.g. > >>>>>> nvme_map_data(), and blk_rq_map_sg() needs to be modified to only = map as much as > >>>>>> will fit instead of popping an assertion. > >>>>> > >>>>> I think it may not be doable. > >>>>> > >>>>> blk_rq_map_sg() is called by drivers and has to work on single requ= est, however > >>>>> more requests have to be involved if we delay the splitting to blk_= rq_map_sg(). > >>>>> Cause splitting means that two bios can't be submitted in single IO= request. > >>>> > >>>> Of course it's doable, do I have to show you how? > >>> > >>> No, you don't have to, could you just point out where my above words = is wrong? > >> > >> blk_rq_map_sg() _currently_ works on a single request, but as I said f= rom the > >> start that this would involve changing it to only process as much of a= request > >> as would fit on an sglist. > >=20 > >> Drivers will have to be modified, but the changes to driver code shoul= d be > >> pretty easy. What will be slightly trickier will be changing blk-mq to= handle > >> requests that are only partially completed; that will be harder than i= t would > >> have been before blk-mq, since the old request queue code used to hand= le > >> partially completed requests - not much work would have to be done tha= t code. > >=20 > > Looks you are suggesting partial request completion. > >=20 > > Then the biggest effect could be in performance, this change will cause= the > > whole FS bio is handled part by part serially, instead of submitting al= l > > splitted bios(part) concurrently. > >=20 > > So sounds you are suggesting to fix one performance issue by causing ne= w perf > > issue, is that doable? >=20 > It does seem like a rat hole of sorts. Because then you start adding code= to > guesstimate how big the request could roughly be, and if you miss a bit, > you get a request that's tiny in between the normal sized ones. >=20 > Or you'd clone, and then you could still have them inflight in parallel. > But then you're paying the cost of that... However, IO latency is often much longer than latency of bio fast clone, which is just one fix-sized slab allocation, mostly should hit cache. I understand the current issue should be the cost of blk_queue_split() in case of no splitting is required. blk_bio_segment_split() has been simple enough, looks we need to investigate why it is so slow since it shouldn't be such in-efficient. Thanks, Ming