From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CEC6C432C0 for ; Thu, 21 Nov 2019 15:45:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E69742068E for ; Thu, 21 Nov 2019 15:45:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726658AbfKUPpy (ORCPT ); Thu, 21 Nov 2019 10:45:54 -0500 Received: from mx2.suse.de ([195.135.220.15]:37960 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726279AbfKUPpy (ORCPT ); Thu, 21 Nov 2019 10:45:54 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 29534AF79; Thu, 21 Nov 2019 15:45:50 +0000 (UTC) Subject: Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD To: Ming Lei Cc: Jens Axboe , linux-block@vger.kernel.org, "James E . J . Bottomley" , "Martin K . Petersen" , linux-scsi@vger.kernel.org, Sathya Prakash , Chaitra P B , Suganath Prabu Subramani , Kashyap Desai , Sumit Saxena , Shivasharan S , "Ewan D . Milne" , Christoph Hellwig , Bart Van Assche References: <20191118103117.978-1-ming.lei@redhat.com> <20191118103117.978-5-ming.lei@redhat.com> <1081145f-3e17-9bc1-2332-50a4b5621ef7@suse.de> <20191121005323.GB24548@ming.t460p> From: Hannes Reinecke Openpgp: preference=signencrypt Autocrypt: addr=hare@suse.de; prefer-encrypt=mutual; keydata= mQINBE6KyREBEACwRN6XKClPtxPiABx5GW+Yr1snfhjzExxkTYaINHsWHlsLg13kiemsS6o7 qrc+XP8FmhcnCOts9e2jxZxtmpB652lxRB9jZE40mcSLvYLM7S6aH0WXKn8bOqpqOGJiY2bc 6qz6rJuqkOx3YNuUgiAxjuoYauEl8dg4bzex3KGkGRuxzRlC8APjHlwmsr+ETxOLBfUoRNuE b4nUtaseMPkNDwM4L9+n9cxpGbdwX0XwKFhlQMbG3rWA3YqQYWj1erKIPpgpfM64hwsdk9zZ QO1krgfULH4poPQFpl2+yVeEMXtsSou915jn/51rBelXeLq+cjuK5+B/JZUXPnNDoxOG3j3V VSZxkxLJ8RO1YamqZZbVP6jhDQ/bLcAI3EfjVbxhw9KWrh8MxTcmyJPn3QMMEp3wpVX9nSOQ tzG72Up/Py67VQe0x8fqmu7R4MmddSbyqgHrab/Nu+ak6g2RRn3QHXAQ7PQUq55BDtj85hd9 W2iBiROhkZ/R+Q14cJkWhzaThN1sZ1zsfBNW0Im8OVn/J8bQUaS0a/NhpXJWv6J1ttkX3S0c QUratRfX4D1viAwNgoS0Joq7xIQD+CfJTax7pPn9rT////hSqJYUoMXkEz5IcO+hptCH1HF3 qz77aA5njEBQrDRlslUBkCZ5P+QvZgJDy0C3xRGdg6ZVXEXJOQARAQABtCpIYW5uZXMgUmVp bmVja2UgKFN1U0UgTGFicykgPGhhcmVAc3VzZS5kZT6JAkEEEwECACsCGwMFCRLMAwAGCwkI BwMCBhUIAgkKCwQWAgMBAh4BAheABQJOisquAhkBAAoJEGz4yi9OyKjPOHoQAJLeLvr6JNHx GPcHXaJLHQiinz2QP0/wtsT8+hE26dLzxb7hgxLafj9XlAXOG3FhGd+ySlQ5wSbbjdxNjgsq FIjqQ88/Lk1NfnqG5aUTPmhEF+PzkPogEV7Pm5Q17ap22VK623MPaltEba+ly6/pGOODbKBH ak3gqa7Gro5YCQzNU0QVtMpWyeGF7xQK76DY/atvAtuVPBJHER+RPIF7iv5J3/GFIfdrM+wS BubFVDOibgM7UBnpa7aohZ9RgPkzJpzECsbmbttxYaiv8+EOwark4VjvOne8dRaj50qeyJH6 HLpBXZDJH5ZcYJPMgunghSqghgfuUsd5fHmjFr3hDb5EoqAfgiRMSDom7wLZ9TGtT6viDldv hfWaIOD5UhpNYxfNgH6Y102gtMmN4o2P6g3UbZK1diH13s9DA5vI2mO2krGz2c5BOBmcctE5 iS+JWiCizOqia5Op+B/tUNye/YIXSC4oMR++Fgt30OEafB8twxydMAE3HmY+foawCpGq06yM vAguLzvm7f6wAPesDAO9vxRNC5y7JeN4Kytl561ciTICmBR80Pdgs/Obj2DwM6dvHquQbQrU Op4XtD3eGUW4qgD99DrMXqCcSXX/uay9kOG+fQBfK39jkPKZEuEV2QdpE4Pry36SUGfohSNq xXW+bMc6P+irTT39VWFUJMcSuQINBE6KyREBEACvEJggkGC42huFAqJcOcLqnjK83t4TVwEn JRisbY/VdeZIHTGtcGLqsALDzk+bEAcZapguzfp7cySzvuR6Hyq7hKEjEHAZmI/3IDc9nbdh EgdCiFatah0XZ/p4vp7KAelYqbv8YF/ORLylAdLh9rzLR6yHFqVaR4WL4pl4kEWwFhNSHLxe 55G56/dxBuoj4RrFoX3ynerXfbp4dH2KArPc0NfoamqebuGNfEQmDbtnCGE5zKcR0zvmXsRp qU7+caufueZyLwjTU+y5p34U4PlOO2Q7/bdaPEdXfpgvSpWk1o3H36LvkPV/PGGDCLzaNn04 BdiiiPEHwoIjCXOAcR+4+eqM4TSwVpTn6SNgbHLjAhCwCDyggK+3qEGJph+WNtNU7uFfscSP k4jqlxc8P+hn9IqaMWaeX9nBEaiKffR7OKjMdtFFnBRSXiW/kOKuuRdeDjL5gWJjY+IpdafP KhjvUFtfSwGdrDUh3SvB5knSixE3qbxbhbNxmqDVzyzMwunFANujyyVizS31DnWC6tKzANkC k15CyeFC6sFFu+WpRxvC6fzQTLI5CRGAB6FAxz8Hu5rpNNZHsbYs9Vfr/BJuSUfRI/12eOCL IvxRPpmMOlcI4WDW3EDkzqNAXn5Onx/b0rFGFpM4GmSPriEJdBb4M4pSD6fN6Y/Jrng/Bdwk SQARAQABiQIlBBgBAgAPBQJOiskRAhsMBQkSzAMAAAoJEGz4yi9OyKjPgEwQAIP/gy/Xqc1q OpzfFScswk3CEoZWSqHxn/fZasa4IzkwhTUmukuIvRew+BzwvrTxhHcz9qQ8hX7iDPTZBcUt ovWPxz+3XfbGqE+q0JunlIsP4N+K/I10nyoGdoFpMFMfDnAiMUiUatHRf9Wsif/nT6oRiPNJ T0EbbeSyIYe+ZOMFfZBVGPqBCbe8YMI+JiZeez8L9JtegxQ6O3EMQ//1eoPJ5mv5lWXLFQfx f4rAcKseM8DE6xs1+1AIsSIG6H+EE3tVm+GdCkBaVAZo2VMVapx9k8RMSlW7vlGEQsHtI0FT c1XNOCGjaP4ITYUiOpfkh+N0nUZVRTxWnJqVPGZ2Nt7xCk7eoJWTSMWmodFlsKSgfblXVfdM 9qoNScM3u0b9iYYuw/ijZ7VtYXFuQdh0XMM/V6zFrLnnhNmg0pnK6hO1LUgZlrxHwLZk5X8F uD/0MCbPmsYUMHPuJd5dSLUFTlejVXIbKTSAMd0tDSP5Ms8Ds84z5eHreiy1ijatqRFWFJRp ZtWlhGRERnDH17PUXDglsOA08HCls0PHx8itYsjYCAyETlxlLApXWdVl9YVwbQpQ+i693t/Y PGu8jotn0++P19d3JwXW8t6TVvBIQ1dRZHx1IxGLMn+CkDJMOmHAUMWTAXX2rf5tUjas8/v2 azzYF4VRJsdl+d0MCaSy8mUh Message-ID: <336f35fc-2e22-c615-9405-50297b9737ea@suse.de> Date: Thu, 21 Nov 2019 16:45:48 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: <20191121005323.GB24548@ming.t460p> Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org On 11/21/19 1:53 AM, Ming Lei wrote: > On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote: >> On 11/18/19 11:31 AM, Ming Lei wrote: >>> SCSI core uses the atomic variable of sdev->device_busy to track >>> in-flight IO requests dispatched to this scsi device. IO request may be >>> submitted from any CPU, so the cost for maintaining the shared atomic >>> counter can be very big on big NUMA machine with lots of CPU cores. >>> >>> sdev->queue_depth is usually used for two purposes: 1) improve IO merge; >>> 2) fair IO request scattered among all LUNs. >>> >>> blk-mq already provides fair request allocation among all active shared >>> request queues(LUNs), see hctx_may_queue(). >>> >>> NVMe doesn't have such per-request-queue(namespace) queue depth, so it >>> is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't >>> play big role for reaching top SSD performance. >>> >>> With this patch, big cost for tracking in-flight per-LUN requests via >>> atomic variable can be saved. >>> >>> Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue >>> before changing this flag. >>> >>> Cc: Sathya Prakash >>> Cc: Chaitra P B >>> Cc: Suganath Prabu Subramani >>> Cc: Kashyap Desai >>> Cc: Sumit Saxena >>> Cc: Shivasharan S >>> Cc: Ewan D. Milne >>> Cc: Christoph Hellwig , >>> Cc: Hannes Reinecke >>> Cc: Bart Van Assche >>> Signed-off-by: Ming Lei >>> --- >>> block/blk-sysfs.c | 14 +++++++++++++- >>> drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------ >>> drivers/scsi/sd.c | 4 ++++ >>> 3 files changed, 35 insertions(+), 7 deletions(-) >>> >>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >>> index fca9b158f4a0..9cc0dd5f88a1 100644 >>> --- a/block/blk-sysfs.c >>> +++ b/block/blk-sysfs.c >>> @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0); >>> QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0); >>> #undef QUEUE_SYSFS_BIT_FNS >>> >>> +static ssize_t queue_store_nonrot_freeze(struct request_queue *q, >>> + const char *page, size_t count) >>> +{ >>> + size_t ret; >>> + >>> + blk_mq_freeze_queue(q); >>> + ret = queue_store_nonrot(q, page, count); >>> + blk_mq_unfreeze_queue(q); >>> + >>> + return ret; >>> +} >>> + >>> static ssize_t queue_zoned_show(struct request_queue *q, char *page) >>> { >>> switch (blk_queue_zoned_model(q)) { >>> @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = { >>> static struct queue_sysfs_entry queue_nonrot_entry = { >>> .attr = {.name = "rotational", .mode = 0644 }, >>> .show = queue_show_nonrot, >>> - .store = queue_store_nonrot, >>> + .store = queue_store_nonrot_freeze, >>> }; >>> >>> static struct queue_sysfs_entry queue_zoned_entry = { >>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c >>> index 62a86a82c38d..72655a049efd 100644 >>> --- a/drivers/scsi/scsi_lib.c >>> +++ b/drivers/scsi/scsi_lib.c >>> @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) >>> if (starget->can_queue > 0) >>> atomic_dec(&starget->target_busy); >>> >>> - atomic_dec(&sdev->device_busy); >>> + if (!blk_queue_nonrot(sdev->request_queue)) >>> + atomic_dec(&sdev->device_busy); >>> } >>> >>> static void scsi_kick_queue(struct request_queue *q) >>> @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev) >>> >>> static inline bool scsi_device_is_busy(struct scsi_device *sdev) >>> { >>> - if (atomic_read(&sdev->device_busy) >= sdev->queue_depth) >>> + if (!blk_queue_nonrot(sdev->request_queue) && >>> + atomic_read(&sdev->device_busy) >= sdev->queue_depth) >>> return true; >>> if (atomic_read(&sdev->device_blocked) > 0) >>> return true; >>> @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, >>> struct scsi_device *sdev) >>> { >>> unsigned int busy; >>> + bool bypass = blk_queue_nonrot(sdev->request_queue); >>> >>> - busy = atomic_inc_return(&sdev->device_busy) - 1; >>> + if (!bypass) >>> + busy = atomic_inc_return(&sdev->device_busy) - 1; >>> + else >>> + busy = 0; >>> if (atomic_read(&sdev->device_blocked)) { >>> if (busy) >>> goto out_dec; >>> @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, >>> "unblocking device at zero depth\n")); >>> } >>> >>> + if (bypass) >>> + return 1; >>> + >>> if (busy >= sdev->queue_depth) >>> goto out_dec; >>> >>> return 1; >>> out_dec: >>> - atomic_dec(&sdev->device_busy); >>> + if (!bypass) >>> + atomic_dec(&sdev->device_busy); >>> return 0; >>> } >>> >>> @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) >>> struct request_queue *q = hctx->queue; >>> struct scsi_device *sdev = q->queuedata; >>> >>> - atomic_dec(&sdev->device_busy); >>> + if (!blk_queue_nonrot(sdev->request_queue)) >>> + atomic_dec(&sdev->device_busy); >>> } >>> >>> static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) >>> @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, >>> case BLK_STS_OK: >>> break; >>> case BLK_STS_RESOURCE: >>> - if (atomic_read(&sdev->device_busy) || >>> + if ((!blk_queue_nonrot(sdev->request_queue) && >>> + atomic_read(&sdev->device_busy)) || >>> scsi_device_blocked(sdev)) >>> ret = BLK_STS_DEV_RESOURCE; >>> break; >>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c >>> index 0744c34468e1..c3d47117700d 100644 >>> --- a/drivers/scsi/sd.c >>> +++ b/drivers/scsi/sd.c >>> @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp) >>> rot = get_unaligned_be16(&buffer[4]); >>> >>> if (rot == 1) { >>> + blk_mq_freeze_queue(q); >>> blk_queue_flag_set(QUEUE_FLAG_NONROT, q); >>> + blk_mq_unfreeze_queue(q); >>> blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q); >>> } >>> >>> @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk) >>> * cause this to be updated correctly and any device which >>> * doesn't support it should be treated as rotational. >>> */ >>> + blk_mq_freeze_queue(q); >>> blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); >>> + blk_mq_unfreeze_queue(q); >>> blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q); >>> >>> if (scsi_device_supports_vpd(sdp)) { >>> >> Hmm. >> >> I must admit I patently don't like this explicit dependency on >> blk_nonrot(). Having a conditional counter is just an open invitation to >> getting things wrong... > > That won't be an issue given this patch freezes queue when the > NONROT queue flag is changed. > That's not what I'm saying. My issue is that we're turning the device_busy counter into a conditional counter, which only must be accessed when that condition is true. This not only puts a burden on the developer (as he has to remember to always check the condition), but also has possible issues when switching between modes. The latter you've addressed with ensuring that the queue must be frozen when modifying that flag, but the former is a conceptual problem. And that was what I had been referring to. >> >> I would far prefer if we could delegate any queueing decision to the >> elevators, and completely drop the device_busy flag for all devices. > > If you drop it, you may create big sequential IO performance drop > on HDD., that is why this patch only bypasses sdev->queue_depth on > SSD. NVMe bypasses it because no one uses HDD. via NVMe. > I still wonder how much performance drop we actually see; what seems to happen is that device_busy just arbitrary pushes back to the block layer, giving it more time to do merging. I do think we can do better then that... Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 247165 (AG München), GF: Felix Imendörffer