From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 439ABC433DB for ; Tue, 9 Mar 2021 16:38:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 15CB1651BB for ; Tue, 9 Mar 2021 16:38:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231146AbhCIQhw (ORCPT ); Tue, 9 Mar 2021 11:37:52 -0500 Received: from frasgout.his.huawei.com ([185.176.79.56]:2672 "EHLO frasgout.his.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229799AbhCIQho (ORCPT ); Tue, 9 Mar 2021 11:37:44 -0500 Received: from fraeml734-chm.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4Dw11t0Hb8z67x38; Wed, 10 Mar 2021 00:31:46 +0800 (CST) Received: from lhreml724-chm.china.huawei.com (10.201.108.75) by fraeml734-chm.china.huawei.com (10.206.15.215) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Tue, 9 Mar 2021 17:37:42 +0100 Received: from [10.210.172.22] (10.210.172.22) by lhreml724-chm.china.huawei.com (10.201.108.75) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Tue, 9 Mar 2021 16:37:41 +0000 Subject: Re: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue To: Michael Kelley , "linux-scsi@vger.kernel.org" , "andres@anarazel.de" , Haiyang Zhang , "jejb@linux.ibm.com" , KY Srinivasan , "linux-hyperv@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "martin.petersen@oracle.com" , Stephen Hemminger , "wei.liu@kernel.org" References: <20210305232151.1531-1-melanieplageman@gmail.com> <20210308175618.GA2376@goldwasser> <01aa44d0-f0a5-6de6-6778-a1658a3d8a8f@huawei.com> From: John Garry Message-ID: <343cce4d-58f6-a1f2-ca4f-e32ff1eddf65@huawei.com> Date: Tue, 9 Mar 2021 16:35:39 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.1.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.210.172.22] X-ClientProxiedBy: lhreml748-chm.china.huawei.com (10.201.108.198) To lhreml724-chm.china.huawei.com (10.201.108.75) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org On 09/03/2021 15:57, Michael Kelley wrote: > From: John Garry Sent: Tuesday, March 9, 2021 2:10 AM >> >> On 08/03/2021 17:56, Melanie Plageman wrote: >>> On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote: >>>> From: Melanie Plageman (Microsoft) Sent: Friday, >> March 5, 2021 3:22 PM >>>>> >>>>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during >>>>> allocation. >>>>> >>>>> Cap cmd_per_lun at can_queue to avoid dispatch errors. >>>>> >>>>> Signed-off-by: Melanie Plageman (Microsoft) >>>>> --- >>>>> drivers/scsi/storvsc_drv.c | 2 ++ >>>>> 1 file changed, 2 insertions(+) >>>>> >>>>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c >>>>> index 6bc5453cea8a..d7953a6e00e6 100644 >>>>> --- a/drivers/scsi/storvsc_drv.c >>>>> +++ b/drivers/scsi/storvsc_drv.c >>>>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, >>>>> (max_sub_channels + 1) * >>>>> (100 - ring_avail_percent_lowater) / 100; >>>>> >>>>> + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, >> scsi_driver.can_queue); >>>>> + >>>> >>>> I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? >>> >>> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set >>> Scsi_Host->cmd_per_lun in storvsc_probe(). >>> >>> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is >>> called and sets the scsi_device->queue_depth to the Scsi_Host's >>> cmd_per_lun with this code: >>> >>> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? >>> sdev->host->cmd_per_lun : 1); >>> >>> During dispatch, the scsi_device->queue_depth is used in >>> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine >>> whether or not the device can queue another command. >>> >>> On some machines, with the 2048 value of cmd_per_lun that was used to >>> set the initial scsi_device->queue_depth, commands can be queued that >>> are later not able to be dispatched after running out of space in the >>> ringbuffer. >>> >>> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD >>> (running an fio workload that I can provide if needed), storvsc_do_io() >>> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. >>> >>> This is the call stack: >>> >>> hv_get_bytes_to_write >>> hv_ringbuffer_write >>> vmbus_send_packet >>> storvsc_dio_io >>> storvsc_queuecommand >>> scsi_dispatch_cmd >>> scsi_queue_rq >>> dispatch_rq_list >>> >>>> Be aware that the calculation of "can_queue" in this driver is somewhat >>>> flawed -- it should not be based on the size of the ring buffer, but instead on >>>> the maximum number of requests Hyper-V will queue. And even then, >>>> can_queue doesn't provide the cap you might expect because the blk-mq layer >>>> allocates can_queue tags for each HW queue, not as a total. >>> >>> >>> The docs for scsi_mid_low_api document Scsi_Host can_queue this way: >>> >>> can_queue >>> - must be greater than 0; do not send more than can_queue >>> commands to the adapter. >>> >>> I did notice that in scsi_host.h, the comment for can_queue does say >>> can_queue is the "maximum number of simultaneous commands a single hw >>> queue in HBA will accept." However, I don't see it being used this way >>> in the code. >>> >> >> JFYI, the block layer ensures that no more than can_queue requests are >> sent to the host. See scsi_mq_setup_tags(), and how the tagset queue >> depth is set to shost->can_queue. >> >> Thanks, >> John > > Agree on what's in scsi_mq_setup_tags(). But scsi_mq_setup_tags() calls > blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(), > which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag > set queue_depth as needed until it succeeds. > > The key thing is that __blk_mq_alloc_rq_maps() iterates over the > number of HW queues calling __blk_mq_alloc_map_and_request(). > The latter function allocates the map and the requests with a count > of the tag set's queue_depth. There's no logic to apportion the > can_queue value across multiple HW queues. So each HW queue gets > can_queue tags allocated, and the SCSI host driver may see up to > (can_queue * # HW queues) simultaneous requests. > > I'm certainly not an expert in this area, but that's what I see in the > code. We've run live experiments, and can see the number > simultaneous requests sent to the storvsc driver be greater than > can_queue when the # of HW queues is greater than 1, which seems > to be consistent with the code. ah, ok. I assumed that # of HW queues = 1 here. So you're describing a problem similar to https://lore.kernel.org/linux-scsi/b3e4e597-779b-7c1e-0d3c-07bc3dab1bb5@huawei.com/ So if you check nr_hw_queues comment in include/scsi/scsi_host.h, it reads: the total queue depth per host is nr_hw_queues * can_queue. However, for when host_tagset is set, the total queue depth is can_queue. Setting .host_tagset will ensure at most can_queue requests will be sent over all HW queues at any given time. A few SCSI MQ drivers set this now. Thanks, John > > Michael > >> >> >>> During dispatch, In scsi_target_queue_ready(), there is this code: >>> >>> if (busy >= starget->can_queue) >>> goto starved; >>> >>> And the scsi_target->can_queue value should be coming from Scsi_host as >>> mentioned in the scsi_target definition in scsi_device.h >>> /* >>> * LLDs should set this in the slave_alloc host template callout. >>> * If set to zero then there is not limit. >>> */ >>> unsigned int can_queue; >>> >>> So, I don't really see how this would be per hardware queue. >>> >>>> >>>> I agree that the cmd_per_lun setting is also too big, but we should fix that in >>>> the context of getting all of these different settings working together correctly, >>>> and not piecemeal. >>>> >>> >>> Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe >>> will also prevent the LUN queue_depth from being set to a value that is >>> higher than it can ever be set to again by the user when >>> storvsc_change_queue_depth() is invoked. >>> >>> Also in scsi_sysfs sdev_store_queue_depth() there is this check: >>> >>> if (depth < 1 || depth > sdev->host->can_queue) >>> return -EINVAL; >>> >>> I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun >>> is set to the min of the configured cmd_per_lun and >>> Scsi_Host->can_queue: >>> >>> shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); >>> >>> Best, >>> Melanie >>> . >>> >