From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD203C432C3 for ; Thu, 21 Nov 2019 02:29:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A45BB20885 for ; Thu, 21 Nov 2019 02:29:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="EJjzn0GN" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726532AbfKUC3b (ORCPT ); Wed, 20 Nov 2019 21:29:31 -0500 Received: from us-smtp-2.mimecast.com ([205.139.110.61]:57307 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725936AbfKUC3b (ORCPT ); Wed, 20 Nov 2019 21:29:31 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1574303369; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=00xDEnOM/AF21OETC8ONOHPp+zFLm0s9kS6DPKrSxHY=; b=EJjzn0GNFS5LzbAjlbT8aJogeZoT8ETKd0Q43emeOhW+FoROvSUb5scMgxk/W3BSS4a/t2 OMMzYW+1bky27EXuxZKgn90/eAsdh9DN+4955liEmQNHZtTF0ERCeLcioXCF6gmk5icmO9 3JCHLNNczXRtlSyDEGha0Bp9ccLOxLk= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-60-FB6Jo5RXM2yZs72atnFaZw-1; Wed, 20 Nov 2019 21:29:26 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 05C6E477; Thu, 21 Nov 2019 02:29:24 +0000 (UTC) Received: from ming.t460p (ovpn-8-21.pek2.redhat.com [10.72.8.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E9A4960BC4; Thu, 21 Nov 2019 02:29:13 +0000 (UTC) Date: Thu, 21 Nov 2019 10:29:09 +0800 From: Ming Lei To: Sumanesh Samanta Cc: axboe@kernel.dk, linux-block@vger.kernel.org, jejb@linux.ibm.com, martin.petersen@oracle.com, linux-scsi@vger.kernel.org, Sathya Prakash , chaitra.basappa@broadcom.com, Suganath Prabu Subramani , Kashyap Desai , Sumit Saxena , Shivasharan Srikanteshwara , Ewan Milne , hch@lst.de, hare@suse.de, bart.vanassche@wdc.com Subject: Re: [PATCH 1/1] scsi core: limit overhead of device_busy counter for SSDs Message-ID: <20191121022909.GI24548@ming.t460p> References: <1574194079-27363-1-git-send-email-sumanesh.samanta@broadcom.com> <1574194079-27363-2-git-send-email-sumanesh.samanta@broadcom.com> <20191120072919.GB3664@ming.t460p> <20191121013419.GF24548@ming.t460p> MIME-Version: 1.0 In-Reply-To: User-Agent: Mutt/1.12.1 (2019-06-15) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-MC-Unique: FB6Jo5RXM2yZs72atnFaZw-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org On Wed, Nov 20, 2019 at 06:59:49PM -0700, Sumanesh Samanta wrote: > >>IO size usually depends on workloads, instead of controller. > >>You can't assume that big size IO is always submitted from userspace > >>because HBA is high end. >=20 > Absolutely, and I did not try to say otherwise. >=20 > What I tried to say was that high end controllers normally have an IO > size limit that they can handle, and block layer cannot merge beyond > that anyway. And most sequential workload to HDD is large IO, because > and application reading/writing sequentially have no reason/incentive > to give small IO ( with the exception of DB REDO logs that go to SSD). > So, Sequential IO is generally large anyway, and they cannot be merged > too much by block layer, since controllers can support IO to only > specific size. > On top of that, many high end controllers have their own capability of > merging IO, and does not depend on upper layer. We shouldn't make assumptions on workloads, which is in another world, not kernel side. As I mentioned, if the controller exposes proper size limit, block layer will follow it and make a proper size IO to driver/device. >=20 > >>Frankly speaking, we should check NONROT flag instead of controller, > >>which may connect to slow HDD too, right? >=20 > If you can find a way of checking NONROT without the problems being > discussed in this and the other thread, I am completely fine and happy > with that. > If, however, no acceptable solution is found then I would request > giving the flag to the controller so that the ones that are not > dependent on IO merge are not burdened with it. Of course, this patch can do that since queue is frozen when changing NONROT, please take a close look at the patch 4. Thanks, Ming >=20 >=20 >=20 > On Wed, Nov 20, 2019 at 6:34 PM Ming Lei wrote: > > > > On Wed, Nov 20, 2019 at 12:59:56PM -0700, Sumanesh Samanta wrote: > > > On Wed, Nov 20, 2019 at 12:29 AM Ming Lei wrote= : > > > > > > > > On Tue, Nov 19, 2019 at 12:07:59PM -0800, Sumanesh Samanta wrote: > > > > > From: root > > > > > > > > > > Recently a patch was delivered to remove host_busy counter from S= CSI mid layer. That was a major bottleneck, and helped improve SCSI stack p= erformance. > > > > > With that patch, bottle neck moved to the scsi_device device_busy= counter. The performance issue with this counter is seen more in cases whe= re a single device can produce very high IOPs, for example h/w RAID devices= where OS sees one device, but there are many drives behind it, thus being = capable of very high IOPs. The effect is also visible when cores from multi= ple NUMA nodes send IO to the same device or same controller. > > > > > The device_busy counter is not needed by controllers which can ma= nage as many IO as submitted to it. Rotating media still uses it for mergin= g IO, but for non-rotating SSD drives it becomes a major bottleneck as desc= ribed above. > > > > > > > > > > A few weeks back, a patch was provided to address the device_busy= counter also but unfortunately that had some issues: > > > > > 1. There was a functional issue discovered: > > > > > https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/VFKD= TG4XC4VHWX5KKDJJI7P36EIGK526/ > > > > > 2. There was some concern about existing drivers using the device= _busy counter. > > > > > > > > There are only two drivers(mpt3sas and megaraid_sas) which uses thi= s > > > > counter. And there are two types of usage: > > > > > > > > 1) both use .device_busy to balance interrupt load among LUNs in > > > > fast path > > > > > > > > 2) mpt3sas uses .device_busy in its device reset handler(slow path)= , and > > > > this kind of usage can be replaced by blk_mq_queue_tag_busy_iter() > > > > easily. > > > > > > > > IMO, blk-mq has already considered IO load balance, see > > > > hctx_may_queue(), meantime managed IRQ can balance IO completion lo= ad > > > > among each IRQ vectors, not see obvious reason for driver to do tha= t > > > > any more. > > > > > > > > However, if the two drivers still want to do that, I'd suggest to i= mplement > > > > it inside the driver, and no reason to re-invent generic wheels jus= t for > > > > two drivers. > > > > > > > > That is why I replace .device_busy uses in the two drivers with pri= vate > > > > counters in the patches posted days ago: > > > > > > > > https://lore.kernel.org/linux-scsi/20191118103117.978-1-ming.lei@re= dhat.com/T/#t > > > > > > > > > > Agreed, a private counter should be good enough. > > > > > > > And if drivers thought the private counter isn't good enough, they = can > > > > improve it in any way, such as this percpu approach, or even kill t= hem. > > > > > > > > > > I was more concerned about the functional issue discovered in the > > > earlier patch and provided mine as an alternative without any side > > > effect or functional issue, since it does not modify any core logic. > > > Having said that, if your latest patch goes through and is accepted, > > > then agree that my patch is not needed. If however, some issue is > > > discovered in your latest patch, then I would request my patch to be > > > considered as an alternative, so that the device_busy counter overhea= d > > > can be avoided > > > > > > > > > > > > > This patch is an attempt to address both the above issues. > > > > > For this patch to be effective, LLDs need to set a specific flag = use_per_cpu_device_busy in the scsi_host_template. For other drivers ( who = does not set the flag), this patch would be a no-op, and should not affect = their performance or functionality at all. > > > > > > > > > > Also, this patch does not fundamentally change any logic or funct= ionality of the code. All it does is replace device_busy with a per CPU cou= nter. In fast path, all cpu increment/decrement their own counter. In relat= ively slow path. they call scsi_device_busy function to get the total no of= IO outstanding on a device. Only functional aspect it changes is that for = non-rotating media, the number of IO to a device is not restricted. Control= lers which can handle that, can set the use_per_cpu_device_busy flag in scs= i_host_template to take advantage of this patch. Other controllers need not= modify any code and would work as usual. > > > > > Since the patch does not modify any other functional aspects, it = should not have any side effects even for drivers that do set the use_per_c= pu_device_busy flag. > > > > > --- > > > > > drivers/scsi/scsi_lib.c | 151 +++++++++++++++++++++++++++++++= +++--- > > > > > drivers/scsi/scsi_scan.c | 16 ++++ > > > > > drivers/scsi/scsi_sysfs.c | 9 ++- > > > > > drivers/scsi/sg.c | 2 +- > > > > > include/scsi/scsi_device.h | 15 ++++ > > > > > include/scsi/scsi_host.h | 16 ++++ > > > > > 6 files changed, 197 insertions(+), 12 deletions(-) > > > > > > > > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > > > > > index 2563b061f56b..5dc392914f9e 100644 > > > > > --- a/drivers/scsi/scsi_lib.c > > > > > +++ b/drivers/scsi/scsi_lib.c > > > > > @@ -52,6 +52,12 @@ > > > > > #define SCSI_INLINE_SG_CNT 2 > > > > > #endif > > > > > > > > > > +#define MAX_PER_CPU_COUNTER_ABSOLUTE_VAL (0xFFFFFFFFFFF) > > > > > +#define PER_CPU_COUNTER_OK_VAL (MAX_PER_CPU_COUNTER_ABSOLUTE_VAL= >>16) > > > > > +#define USE_DEVICE_BUSY(sdev) (!(sdev)->host->hostt->use_= per_cpu_device_busy \ > > > > > + || !blk_queue_nonrot((sdev)->reques= t_queue)) > > > > > + > > > > > + > > > > > static struct kmem_cache *scsi_sdb_cache; > > > > > static struct kmem_cache *scsi_sense_cache; > > > > > static struct kmem_cache *scsi_sense_isadma_cache; > > > > > @@ -65,6 +71,111 @@ scsi_select_sense_cache(bool unchecked_isa_dm= a) > > > > > return unchecked_isa_dma ? scsi_sense_isadma_cache : scsi_s= ense_cache; > > > > > } > > > > > > > > > > +/* > > > > > + *Generic helper function to decrement per cpu io counter. > > > > > + *@per_cpu_counter: The per cpu counter array. Current cpu count= er will be > > > > > + * decremented > > > > > + */ > > > > > + > > > > > +static inline void dec_per_cpu_io_counter(atomic64_t __percpu *p= er_cpu_counter) > > > > > +{ > > > > > + atomic64_t __percpu *io_count =3D get_cpu_ptr(per_cpu_count= er); > > > > > + > > > > > + if (unlikely(abs(atomic64_dec_return(io_count)) > > > > > > + MAX_PER_CPU_COUNTER_ABSOLUTE_VAL)) > > > > > + scsi_rebalance_per_cpu_io_counters(per_cpu_counter,= io_count); > > > > > + put_cpu_ptr(per_cpu_counter); > > > > > +} > > > > > +/* > > > > > + *Generic helper function to increment per cpu io counter. > > > > > + *@per_cpu_counter: The per cpu counter array. Current cpu count= er will be > > > > > + * incremented > > > > > + */ > > > > > +static inline void inc_per_cpu_io_counter(atomic64_t __percpu *p= er_cpu_counter) > > > > > +{ > > > > > + atomic64_t __percpu *io_count =3D get_cpu_ptr(per_cpu_count= er); > > > > > + > > > > > + if (unlikely(abs(atomic64_inc_return(io_count)) > > > > > > + MAX_PER_CPU_COUNTER_ABSOLUTE_VAL)) > > > > > + scsi_rebalance_per_cpu_io_counters(per_cpu_counter,= io_count); > > > > > + put_cpu_ptr(per_cpu_counter); > > > > > +} > > > > > + > > > > > + > > > > > +/** > > > > > + * scsi_device_busy - Return the device_busy counter > > > > > + * @sdev: Pointer to scsi_device to get busy counter. > > > > > + **/ > > > > > +int scsi_device_busy(struct scsi_device *sdev) > > > > > +{ > > > > > + long long total =3D 0; > > > > > + int i; > > > > > + > > > > > + if (USE_DEVICE_BUSY(sdev)) > > > > > > > > As Ewan and Bart commented, you can't use the NONROT queue flag sim= ply > > > > in IO path, given it may be changed somewhere. > > > > > > > > > > I added the NONROT check just as an afterthought. This patch is > > > designed for high end controllers, and most of them have some storage > > > IO size limit. > > > > IO size usually depends on workloads, instead of controller. > > > > You can't assume that big size IO is always submitted from userspace > > because HBA is high end. > > > > For sequential IO workloads, block layer tries best to run IO merge, > > which is usually big win for HDD. > > > > > Also, for HDD sequential IO is almost always large and > > > touch the controller max IO size limit. > > > > No, there are workloads, single IO size isn't big, but > > they are sequential, it may not be hit by readahead, such as > > write. That is exactly what block layer has to run IO merge > > for these small IOs. > > > > > Thus, I am not sure merge > > > matters for these kind of controllers. Database use REDO log and smal= l > > > sequential IO, but those are targeted to SSDs, where latency and IOPs > > > are far more important than IO merging. > > > Anyway, this patch is opt-in for drivers, so any LLD that cannot take > > > advantage of the flag need not set it, and would work as-is. > > > I can provide a new version of the patch with this check removed > > > > Frankly speaking, we should check NONROT flag instead of controller, > > which may connect to slow HDD too, right? > > > > Thanks, > > Ming > > >=20 --=20 Ming