From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A78A7CA9EAF for ; Thu, 24 Oct 2019 21:24:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6474121BE5 for ; Thu, 24 Oct 2019 21:24:53 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="e+/fPu2/" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729088AbfJXVYx (ORCPT ); Thu, 24 Oct 2019 17:24:53 -0400 Received: from us-smtp-2.mimecast.com ([205.139.110.61]:26050 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729074AbfJXVYw (ORCPT ); Thu, 24 Oct 2019 17:24:52 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1571952291; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8i8bOiz7H+lGIEzyBV740toibEL/95OmwhdMVstbw04=; b=e+/fPu2/xAUBUvbExmmVHzgRuxB7bFkNMFkuD+CipqvAuo59saX7TmMPKsnpG3j0FrXNO1 /Ernqd2IfTD+NJEilvASfeENapV7nA0M+lK/fVzweQ8FC+SorqS4AeyLOO4Qwv9dwYcoos sLWqTopRSWpWAUp7ZIoj9+1IHiU1Qf0= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-318-bR9J9BNpP7icFJ18KFylEw-1; Thu, 24 Oct 2019 17:24:47 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id D64BE107AD31; Thu, 24 Oct 2019 21:24:45 +0000 (UTC) Received: from ming.t460p (ovpn-8-16.pek2.redhat.com [10.72.8.16]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 93A5B5D713; Thu, 24 Oct 2019 21:24:33 +0000 (UTC) Date: Fri, 25 Oct 2019 05:24:28 +0800 From: Ming Lei To: John Garry Cc: "linux-scsi@vger.kernel.org" , "Martin K . Petersen" , James Bottomley , Jens Axboe , "Ewan D . Milne" , Omar Sandoval , Christoph Hellwig , Kashyap Desai , Hannes Reinecke , Laurence Oberman , Bart Van Assche Subject: Re: [PATCH V4 1/2] scsi: core: avoid host-wide host_busy counter for scsi_mq Message-ID: <20191024212427.GA26168@ming.t460p> References: <20191009093241.21481-1-ming.lei@redhat.com> <20191009093241.21481-2-ming.lei@redhat.com> <7d95de12-6114-c0d7-8b21-d36b2ea020fc@huawei.com> <20191024005828.GB15426@ming.t460p> <19e73b4d-77c7-e776-fee4-8b9f078c2be5@huawei.com> MIME-Version: 1.0 In-Reply-To: <19e73b4d-77c7-e776-fee4-8b9f078c2be5@huawei.com> User-Agent: Mutt/1.11.3 (2019-02-01) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-MC-Unique: bR9J9BNpP7icFJ18KFylEw-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org On Thu, Oct 24, 2019 at 10:19:21AM +0100, John Garry wrote: > On 24/10/2019 01:58, Ming Lei wrote: > > On Wed, Oct 23, 2019 at 09:52:41AM +0100, John Garry wrote: > > > On 09/10/2019 10:32, Ming Lei wrote: > > > > It isn't necessary to check the host depth in scsi_queue_rq() any m= ore > > > > since it has been respected by blk-mq before calling scsi_queue_rq(= ) via > > > > getting driver tag. > > > >=20 > > > > Lots of LUNs may attach to same host and per-host IOPS may reach mi= llions, > > > > so we should avoid expensive atomic operations on the host-wide cou= nter in > > > > the IO path. > > > >=20 > > > > This patch implements scsi_host_busy() via blk_mq_tagset_busy_iter(= ) > > > > with one scsi command state for reading the count of busy IOs for s= csi_mq. > > > >=20 > > > > It is observed that IOPS is increased by 15% in IO test on scsi_deb= ug (32 > > > > LUNs, 32 submit queues, 1024 can_queue, libaio/dio) in a dual-socke= t > > > > system. > > > >=20 > > > > V2: > > > > =09- introduce SCMD_STATE_INFLIGHT for getting accurate host busy > > > > =09via blk_mq_tagset_busy_iter() > > > > =09- verified that original Jens's report[1] is fixed > > > > =09- verified that SCSI timeout/abort works fine > > > >=20 > > > > [1] https://www.spinics.net/lists/linux-scsi/msg122867.html > > > > [2] V1 & its revert: > > > >=20 > > > > d772a65d8a6c Revert "scsi: core: avoid host-wide host_busy counter = for scsi_mq" > > > > 23aa8e69f2c6 Revert "scsi: core: fix scsi_host_queue_ready" > > > > 265d59aacbce scsi: core: fix scsi_host_queue_ready > > > > 328728630d9f scsi: core: avoid host-wide host_busy counter for scsi= _mq > > > >=20 >=20 > [not trimming] >=20 > > > > Cc: Jens Axboe > > > > Cc: Ewan D. Milne > > > > Cc: Omar Sandoval , > > > > Cc: "Martin K. Petersen" , > > > > Cc: James Bottomley , > > > > Cc: Christoph Hellwig , > > > > Cc: Kashyap Desai > > > > Cc: Hannes Reinecke > > > > Cc: Laurence Oberman > > > > Cc: Bart Van Assche > > > > Signed-off-by: Ming Lei > > > > --- > > > > drivers/scsi/hosts.c | 19 ++++++++++++++++++- > > > > drivers/scsi/scsi.c | 2 +- > > > > drivers/scsi/scsi_lib.c | 35 +++++++++++++++++-----------------= - > > > > drivers/scsi/scsi_priv.h | 2 +- > > > > include/scsi/scsi_cmnd.h | 1 + > > > > include/scsi/scsi_host.h | 1 - > > > > 6 files changed, 38 insertions(+), 22 deletions(-) > > > >=20 > > > > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c > > > > index 55522b7162d3..1d669e47b692 100644 > > > > --- a/drivers/scsi/hosts.c > > > > +++ b/drivers/scsi/hosts.c > > > > @@ -38,6 +38,7 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > >=20 > > > alphabetic ordering could be maintained > > >=20 > > > > #include "scsi_priv.h" > > > > #include "scsi_logging.h" > > > > @@ -554,13 +555,29 @@ struct Scsi_Host *scsi_host_get(struct Scsi_H= ost *shost) > > > > } > > > > EXPORT_SYMBOL(scsi_host_get); > > > > +static bool scsi_host_check_in_flight(struct request *rq, void *da= ta, > > > > +=09=09=09=09 bool reserved) > > > > +{ > > > > +=09int *count =3D data; > > > > +=09struct scsi_cmnd *cmd =3D blk_mq_rq_to_pdu(rq); > > > > + > > > > +=09if (test_bit(SCMD_STATE_INFLIGHT, &cmd->state)) > > > > +=09=09(*count)++; > > > > + > > > > +=09return true; > > > > +} > > > > + > > > > /** > > > > * scsi_host_busy - Return the host busy counter > > >=20 > > > is this comment accurate now? > >=20 > > Nothing changed wrt. the above comment, I will explain below. >=20 > I mean that the host does not maintain a dedicated "busy counter" any lon= ger > (that was Scsi_Host.host_busy), so could be reworded accordingly. It is still a counter in logical. >=20 > >=20 > > >=20 > > > > * @shost:=09Pointer to Scsi_Host to inc. > > > > **/ > > > > int scsi_host_busy(struct Scsi_Host *shost) > > > > { > > > > -=09return atomic_read(&shost->host_busy); > > > > +=09int cnt =3D 0; > > > > + > > > > +=09blk_mq_tagset_busy_iter(&shost->tag_set, > > > > +=09=09=09=09scsi_host_check_in_flight, &cnt); > > >=20 > > > When I check blk_mq_tagset_busy_iter(), it does not seem to lock the = tags > > > for all hw queues against preemption for the counting, so how can we > > > guarantee that this count will be always accurate? > > >=20 > > > Or it never can be and you just don't care? > >=20 > > When the atomic variable of .host_busy is used, there isn't any host lo= ck > > to sync updating the variable and reading the variable, so nothing > > changed wrt. its usage given atomic variable can be updated when readin= g the > > variable. > >=20 > > That means it depends on its users. If the user doesn't care it, no loc= k > > is needed, otherwise user will deal with that, such as > > scsi_eh_scmd_add() and scsi_eh_inc_host_failed(), the host is put into > > SHOST_RECOVERY state first to prevent new requests from being queued, > > then read the counter in RCU callback via call_rcu(). > >=20 > > >=20 > > > > +=09return cnt; > > > > } > > > > EXPORT_SYMBOL(scsi_host_busy); > > > > diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c > > > > index 1f5b5c8a7f72..ddc4ec6ea2a1 100644 > > > > --- a/drivers/scsi/scsi.c > > > > +++ b/drivers/scsi/scsi.c > > > > @@ -186,7 +186,7 @@ void scsi_finish_command(struct scsi_cmnd *cmd) > > > > =09struct scsi_driver *drv; > > > > =09unsigned int good_bytes; > > > > -=09scsi_device_unbusy(sdev); > > > > +=09scsi_device_unbusy(sdev, cmd); > > > > =09/* > > > > =09 * Clear the flags that say that the device/target/host is no= longer > > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > > > > index dc210b9d4896..b6f66dcb15a5 100644 > > > > --- a/drivers/scsi/scsi_lib.c > > > > +++ b/drivers/scsi/scsi_lib.c > > > > @@ -189,7 +189,7 @@ static void __scsi_queue_insert(struct scsi_cmn= d *cmd, int reason, bool unbusy) > > > > =09 * active on the host/device. > > > > =09 */ > > > > =09if (unbusy) > > > > -=09=09scsi_device_unbusy(device); > > > > +=09=09scsi_device_unbusy(device, cmd); > > > > =09/* > > > > =09 * Requeue this command. It will go before all other command= s > @@ > > > > -329,12 +329,12 @@ static void scsi_init_cmd_errh(struct scsi_cmnd > > > *cmd) > > >=20 > > >=20 > > > The comment for scsi_dec_host_busy() is "Decrement the host_busy coun= ter and > > > ", so does this need to be fixed up? > >=20 > > Yeah, will fix it in next version. > >=20 > > >=20 > > > I'm guessing that there's lots more places like this... > >=20 > > I just run a grep, looks not found others. > >=20 > > >=20 > > > > * host_failed counter or that it notices the shost state change= made by > > > > * scsi_eh_scmd_add(). > > > > */ > > >=20 > > >=20 > > > > -static void scsi_dec_host_busy(struct Scsi_Host *shost) > > > > +static void scsi_dec_host_busy(struct Scsi_Host *shost, struct scs= i_cmnd *cmd) > > > > { > > > > =09unsigned long flags; > > > > =09rcu_read_lock(); > > > > -=09atomic_dec(&shost->host_busy); > > > > +=09__clear_bit(SCMD_STATE_INFLIGHT, &cmd->state); > > > > =09if (unlikely(scsi_host_in_recovery(shost))) { > > > > =09=09spin_lock_irqsave(shost->host_lock, flags); > > > > =09=09if (shost->host_failed || shost->host_eh_scheduled) > > > > @@ -344,12 +344,12 @@ static void scsi_dec_host_busy(struct Scsi_Ho= st *shost) > > > > =09rcu_read_unlock(); > > > > } > > > > -void scsi_device_unbusy(struct scsi_device *sdev) > > > > +void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd= *cmd) > > > > { > > > > =09struct Scsi_Host *shost =3D sdev->host; > > > > =09struct scsi_target *starget =3D scsi_target(sdev); > > > > -=09scsi_dec_host_busy(shost); > > > > +=09scsi_dec_host_busy(shost, cmd); > > > > =09if (starget->can_queue > 0) > > > > =09=09atomic_dec(&starget->target_busy); > > > > @@ -430,9 +430,6 @@ static inline bool scsi_target_is_busy(struct s= csi_target *starget) > > > > static inline bool scsi_host_is_busy(struct Scsi_Host *shost) > > > > { > > > > -=09if (shost->can_queue > 0 && > > > > -=09 atomic_read(&shost->host_busy) >=3D shost->can_queue) > > > > -=09=09return true; > > >=20 > > > Do we still honour "do not send more than can_queue commands to the > > > adapter", regardless of how many queues the LLDD exposes? > >=20 > > What we should honour is that 'do not send more than can_queue commands > > to each hw queue', that is exactly guaranteed by blk-mq. >=20 > In scsi_host.h, we have for scsi_host_template.can_queue: "It is set to t= he > maximum number of simultaneous commands a given host adapter will accept.= ", > so that should be honoured. That words should have been changed to: "It is set to the maximum number of simultaneous commands a given host adap= ter's hw queue will accept." >=20 > And Scsi_host.nr_hw_queues: "it is assumed that each hardware queue has a > queue depth of can_queue. In other words, the total queue depth per host = is > nr_hw_queues * can_queue." The above is correct. >=20 > I don't read "total queue depth per host" same as "maximum number of > simultaneous commands a given host adapter will accept". If anything, the > nr_hw_queues comment is ambiguous. >=20 > >=20 > > The point is simple, because each hw queue has its own independent tags= , > > that is why I mentioned your Hisilicon SAS can't be converted to MQ > > easily cause this hardware has only single shared tags. >=20 > Please be aware that HiSilicon SAS HW would not be unique for SCSI HBAs i= n > this regard, in that the unique hostwide tag is not just for HBA HW IO > management, but also is used as the tag for SCSI TMFs. Right. >=20 > Just checking mpt3sas seems similar: >=20 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/d= rivers/scsi/mpt3sas/mpt3sas_scsih.c#n2918 >=20 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/d= rivers/scsi/mpt3sas/mpt3sas_base.c#n3546 Not only mpt3sas, there are also HPSA and more. And these drivers have to support single hw queue of blk-mq, instead of real MQ. And the reason is th= at these HBA has single tags. Thanks, Ming