From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f176.google.com ([209.85.220.176]:33666 "EHLO mail-qk0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751704AbdHGMsR (ORCPT ); Mon, 7 Aug 2017 08:48:17 -0400 Received: by mail-qk0-f176.google.com with SMTP id a77so1864222qkb.0 for ; Mon, 07 Aug 2017 05:48:16 -0700 (PDT) Subject: Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance To: Ming Lei , Jens Axboe , linux-block@vger.kernel.org, Christoph Hellwig Cc: Bart Van Assche References: <20170805065705.12989-1-ming.lei@redhat.com> From: Laurence Oberman Message-ID: Date: Mon, 7 Aug 2017 08:48:14 -0400 MIME-Version: 1.0 In-Reply-To: <20170805065705.12989-1-ming.lei@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-block-owner@vger.kernel.org List-Id: linux-block@vger.kernel.org On 08/05/2017 02:56 AM, Ming Lei wrote: > In Red Hat internal storage test wrt. blk-mq scheduler, we > found that I/O performance is much bad with mq-deadline, especially > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx, > SRP...) > > Turns out one big issue causes the performance regression: requests > are still dequeued from sw queue/scheduler queue even when ldd's > queue is busy, so I/O merge becomes quite difficult to make, then > sequential IO degrades a lot. > > The 1st five patches improve this situation, and brings back > some performance loss. > > But looks they are still not enough. It is caused by > the shared queue depth among all hw queues. For SCSI devices, > .cmd_per_lun defines the max number of pending I/O on one > request queue, which is per-request_queue depth. So during > dispatch, if one hctx is too busy to move on, all hctxs can't > dispatch too because of the per-request_queue depth. > > Patch 6 ~ 14 use per-request_queue dispatch list to avoid > to dequeue requests from sw/scheduler queue when lld queue > is busy. > > Patch 15 ~20 improve bio merge via hash table in sw queue, > which makes bio merge more efficient than current approch > in which only the last 8 requests are checked. Since patch > 6~14 converts to the scheduler way of dequeuing one request > from sw queue one time for SCSI device, and the times of > acquring ctx->lock is increased, and merging bio via hash > table decreases holding time of ctx->lock and should eliminate > effect from patch 14. > > With this changes, SCSI-MQ sequential I/O performance is > improved much, for lpfc, it is basically brought back > compared with block legacy path[1], especially mq-deadline > is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP, > For mq-none it is improved by 10% on lpfc, and write is > improved by > 10% on SRP too. > > Also Bart worried that this patchset may affect SRP, so provide > test data on SCSI SRP this time: > > - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs) > - system(16 cores, dual sockets, mem: 96G) > > |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches | > |blk-legacy dd |blk-mq none | blk-mq none | > -----------------------------------------------------------| > read :iops| 587K | 526K | 537K | > randread :iops| 115K | 140K | 139K | > write :iops| 596K | 519K | 602K | > randwrite:iops| 103K | 122K | 120K | > > > |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches > |blk-legacy dd |blk-mq dd | blk-mq dd | > ------------------------------------------------------------ > read :iops| 587K | 155K | 522K | > randread :iops| 115K | 140K | 141K | > write :iops| 596K | 135K | 587K | > randwrite:iops| 103K | 120K | 118K | > > V2: > - dequeue request from sw queues in round roubin's style > as suggested by Bart, and introduces one helper in sbitmap > for this purpose > - improve bio merge via hash table from sw queue > - add comments about using DISPATCH_BUSY state in lockless way, > simplifying handling on busy state, > - hold ctx->lock when clearing ctx busy bit as suggested > by Bart > > > [1] http://marc.info/?l=linux-block&m=150151989915776&w=2 > > Ming Lei (20): > blk-mq-sched: fix scheduler bad performance > sbitmap: introduce __sbitmap_for_each_set() > blk-mq: introduce blk_mq_dispatch_rq_from_ctx() > blk-mq-sched: move actual dispatching into one helper > blk-mq-sched: improve dispatching from sw queue > blk-mq-sched: don't dequeue request until all in ->dispatch are > flushed > blk-mq-sched: introduce blk_mq_sched_queue_depth() > blk-mq-sched: use q->queue_depth as hint for q->nr_requests > blk-mq: introduce BLK_MQ_F_SHARED_DEPTH > blk-mq-sched: introduce helpers for query, change busy state > blk-mq: introduce helpers for operating ->dispatch list > blk-mq: introduce pointers to dispatch lock & list > blk-mq: pass 'request_queue *' to several helpers of operating BUSY > blk-mq-sched: improve IO scheduling on SCSI devcie > block: introduce rqhash helpers > block: move actual bio merge code into __elv_merge > block: add check on elevator for supporting bio merge via hashtable > from blk-mq sw queue > block: introduce .last_merge and .hash to blk_mq_ctx > blk-mq-sched: refactor blk_mq_sched_try_merge() > blk-mq: improve bio merge from blk-mq sw queue > > block/blk-mq-debugfs.c | 12 ++-- > block/blk-mq-sched.c | 187 +++++++++++++++++++++++++++++------------------- > block/blk-mq-sched.h | 23 ++++++ > block/blk-mq.c | 133 +++++++++++++++++++++++++++++++--- > block/blk-mq.h | 73 +++++++++++++++++++ > block/blk-settings.c | 2 + > block/blk.h | 55 ++++++++++++++ > block/elevator.c | 93 ++++++++++++++---------- > include/linux/blk-mq.h | 5 ++ > include/linux/blkdev.h | 5 ++ > include/linux/sbitmap.h | 54 ++++++++++---- > 11 files changed, 504 insertions(+), 138 deletions(-) > Hello I tested this series using Ming's tests as well as my own set of tests typically run against changes to upstream code in my SRP test-bed. My tests also include very large sequential buffered and un-buffered I/O. This series seems to be fine for me. I did uncover another issue that is unrelated to these patches and also exists in 4.13-RC3 generic that I am still debugging. For what its worth: Tested-by: Laurence Oberman