From: Ming Lei <ming.lei@redhat.com>
To: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org,
"James E . J . Bottomley" <jejb@linux.ibm.com>,
"Martin K . Petersen" <martin.petersen@oracle.com>,
linux-scsi@vger.kernel.org,
Sathya Prakash <sathya.prakash@broadcom.com>,
Chaitra P B <chaitra.basappa@broadcom.com>,
Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>,
Kashyap Desai <kashyap.desai@broadcom.com>,
Sumit Saxena <sumit.saxena@broadcom.com>,
Shivasharan S <shivasharan.srikanteshwara@broadcom.com>,
"Ewan D . Milne" <emilne@redhat.com>,
Christoph Hellwig <hch@lst.de>,
Bart Van Assche <bart.vanassche@wdc.com>
Subject: Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
Date: Fri, 22 Nov 2019 16:09:59 +0800 [thread overview]
Message-ID: <20191122080959.GC903@ming.t460p> (raw)
In-Reply-To: <336f35fc-2e22-c615-9405-50297b9737ea@suse.de>
On Thu, Nov 21, 2019 at 04:45:48PM +0100, Hannes Reinecke wrote:
> On 11/21/19 1:53 AM, Ming Lei wrote:
> > On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
> >> On 11/18/19 11:31 AM, Ming Lei wrote:
> >>> SCSI core uses the atomic variable of sdev->device_busy to track
> >>> in-flight IO requests dispatched to this scsi device. IO request may be
> >>> submitted from any CPU, so the cost for maintaining the shared atomic
> >>> counter can be very big on big NUMA machine with lots of CPU cores.
> >>>
> >>> sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
> >>> 2) fair IO request scattered among all LUNs.
> >>>
> >>> blk-mq already provides fair request allocation among all active shared
> >>> request queues(LUNs), see hctx_may_queue().
> >>>
> >>> NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> >>> is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
> >>> play big role for reaching top SSD performance.
> >>>
> >>> With this patch, big cost for tracking in-flight per-LUN requests via
> >>> atomic variable can be saved.
> >>>
> >>> Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
> >>> before changing this flag.
> >>>
> >>> Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> >>> Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> >>> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> >>> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> >>> Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> >>> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> >>> Cc: Ewan D. Milne <emilne@redhat.com>
> >>> Cc: Christoph Hellwig <hch@lst.de>,
> >>> Cc: Hannes Reinecke <hare@suse.de>
> >>> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>> block/blk-sysfs.c | 14 +++++++++++++-
> >>> drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
> >>> drivers/scsi/sd.c | 4 ++++
> >>> 3 files changed, 35 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> >>> index fca9b158f4a0..9cc0dd5f88a1 100644
> >>> --- a/block/blk-sysfs.c
> >>> +++ b/block/blk-sysfs.c
> >>> @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
> >>> QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
> >>> #undef QUEUE_SYSFS_BIT_FNS
> >>>
> >>> +static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
> >>> + const char *page, size_t count)
> >>> +{
> >>> + size_t ret;
> >>> +
> >>> + blk_mq_freeze_queue(q);
> >>> + ret = queue_store_nonrot(q, page, count);
> >>> + blk_mq_unfreeze_queue(q);
> >>> +
> >>> + return ret;
> >>> +}
> >>> +
> >>> static ssize_t queue_zoned_show(struct request_queue *q, char *page)
> >>> {
> >>> switch (blk_queue_zoned_model(q)) {
> >>> @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
> >>> static struct queue_sysfs_entry queue_nonrot_entry = {
> >>> .attr = {.name = "rotational", .mode = 0644 },
> >>> .show = queue_show_nonrot,
> >>> - .store = queue_store_nonrot,
> >>> + .store = queue_store_nonrot_freeze,
> >>> };
> >>>
> >>> static struct queue_sysfs_entry queue_zoned_entry = {
> >>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> >>> index 62a86a82c38d..72655a049efd 100644
> >>> --- a/drivers/scsi/scsi_lib.c
> >>> +++ b/drivers/scsi/scsi_lib.c
> >>> @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
> >>> if (starget->can_queue > 0)
> >>> atomic_dec(&starget->target_busy);
> >>>
> >>> - atomic_dec(&sdev->device_busy);
> >>> + if (!blk_queue_nonrot(sdev->request_queue))
> >>> + atomic_dec(&sdev->device_busy);
> >>> }
> >>>
> >>> static void scsi_kick_queue(struct request_queue *q)
> >>> @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
> >>>
> >>> static inline bool scsi_device_is_busy(struct scsi_device *sdev)
> >>> {
> >>> - if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> >>> + if (!blk_queue_nonrot(sdev->request_queue) &&
> >>> + atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> >>> return true;
> >>> if (atomic_read(&sdev->device_blocked) > 0)
> >>> return true;
> >>> @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> >>> struct scsi_device *sdev)
> >>> {
> >>> unsigned int busy;
> >>> + bool bypass = blk_queue_nonrot(sdev->request_queue);
> >>>
> >>> - busy = atomic_inc_return(&sdev->device_busy) - 1;
> >>> + if (!bypass)
> >>> + busy = atomic_inc_return(&sdev->device_busy) - 1;
> >>> + else
> >>> + busy = 0;
> >>> if (atomic_read(&sdev->device_blocked)) {
> >>> if (busy)
> >>> goto out_dec;
> >>> @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> >>> "unblocking device at zero depth\n"));
> >>> }
> >>>
> >>> + if (bypass)
> >>> + return 1;
> >>> +
> >>> if (busy >= sdev->queue_depth)
> >>> goto out_dec;
> >>>
> >>> return 1;
> >>> out_dec:
> >>> - atomic_dec(&sdev->device_busy);
> >>> + if (!bypass)
> >>> + atomic_dec(&sdev->device_busy);
> >>> return 0;
> >>> }
> >>>
> >>> @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
> >>> struct request_queue *q = hctx->queue;
> >>> struct scsi_device *sdev = q->queuedata;
> >>>
> >>> - atomic_dec(&sdev->device_busy);
> >>> + if (!blk_queue_nonrot(sdev->request_queue))
> >>> + atomic_dec(&sdev->device_busy);
> >>> }
> >>>
> >>> static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> >>> @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> >>> case BLK_STS_OK:
> >>> break;
> >>> case BLK_STS_RESOURCE:
> >>> - if (atomic_read(&sdev->device_busy) ||
> >>> + if ((!blk_queue_nonrot(sdev->request_queue) &&
> >>> + atomic_read(&sdev->device_busy)) ||
> >>> scsi_device_blocked(sdev))
> >>> ret = BLK_STS_DEV_RESOURCE;
> >>> break;
> >>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> >>> index 0744c34468e1..c3d47117700d 100644
> >>> --- a/drivers/scsi/sd.c
> >>> +++ b/drivers/scsi/sd.c
> >>> @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
> >>> rot = get_unaligned_be16(&buffer[4]);
> >>>
> >>> if (rot == 1) {
> >>> + blk_mq_freeze_queue(q);
> >>> blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> >>> + blk_mq_unfreeze_queue(q);
> >>> blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
> >>> }
> >>>
> >>> @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
> >>> * cause this to be updated correctly and any device which
> >>> * doesn't support it should be treated as rotational.
> >>> */
> >>> + blk_mq_freeze_queue(q);
> >>> blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
> >>> + blk_mq_unfreeze_queue(q);
> >>> blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
> >>>
> >>> if (scsi_device_supports_vpd(sdp)) {
> >>>
> >> Hmm.
> >>
> >> I must admit I patently don't like this explicit dependency on
> >> blk_nonrot(). Having a conditional counter is just an open invitation to
> >> getting things wrong...
> >
> > That won't be an issue given this patch freezes queue when the
> > NONROT queue flag is changed.
> >
> That's not what I'm saying.
>
> My issue is that we're turning the device_busy counter into a
> conditional counter, which only must be accessed when that condition is
> true.
> This not only puts a burden on the developer (as he has to remember to
> always check the condition), but also has possible issues when switching
> between modes.
The conditional check can be avoided only if we can kill device_busy
completely. However, as Martin commented, that isn't realistic so far.
> The latter you've addressed with ensuring that the queue must be frozen
> when modifying that flag, but the former is a conceptual problem.
It can't be perfect, but it can be good by the latter.
Someone said 'Perfect is the enemy of the good', :-)
>
> And that was what I had been referring to.
>
> >>
> >> I would far prefer if we could delegate any queueing decision to the
> >> elevators, and completely drop the device_busy flag for all devices.
> >
> > If you drop it, you may create big sequential IO performance drop
> > on HDD., that is why this patch only bypasses sdev->queue_depth on
> > SSD. NVMe bypasses it because no one uses HDD. via NVMe.
> >
> I still wonder how much performance drop we actually see; what seems to
> happen is that device_busy just arbitrary pushes back to the block
> layer, giving it more time to do merging.
> I do think we can do better then that...
For example, running the following script[1] on 4-core VM:
------------------------------------------
| QD:255 | QD: 32 |
------------------------------------------
fio read throughput | 825MB/s | 1432MB/s|
------------------------------------------
note:
QD: scsi_device's queue depth, which can be passed to test.sh.
scsi debug's device queue depth is 255 at default, which is >= .can_queue
[1] test.sh
#!/bin/bash
QD=$1
modprobe scsi_debug ndelay=200000
sleep 2
DEVN=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename`
DEV="/dev/"$DEVN
echo $QD > /sys/block/$DEVN/device/queue_depth
SQD=`cat /sys/block/$DEVN/device/queue_depth`
echo "scsi device queue depth: $QD"
fio --readwrite=read --filename=$DEV --runtime=20s --time_based --group_reporting=1 \
--ioengine=libaio --direct=1 --iodepth=64 --bs=4k --numjobs=4 --name=seq_perf
Thanks,
Ming
next prev parent reply other threads:[~2019-11-22 8:10 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-18 10:31 [PATCH 0/4] scis: don't apply per-LUN queue depth for SSD Ming Lei
2019-11-18 10:31 ` [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands Ming Lei
2019-11-20 9:54 ` Hannes Reinecke
2019-11-26 3:12 ` Kashyap Desai
2019-11-26 3:37 ` Ming Lei
2019-12-05 10:32 ` Kashyap Desai
2019-11-18 10:31 ` [PATCH 2/4] scsi: mpt3sas: " Ming Lei
2019-11-20 9:55 ` Hannes Reinecke
2019-11-18 10:31 ` [PATCH 3/4] scsi: sd: register request queue after sd_revalidate_disk is done Ming Lei
2019-11-20 9:59 ` Hannes Reinecke
2019-11-18 10:31 ` [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD Ming Lei
2019-11-20 10:05 ` Hannes Reinecke
2019-11-20 17:00 ` Ewan D. Milne
2019-11-20 20:56 ` Bart Van Assche
2019-11-20 21:36 ` Ewan D. Milne
2019-11-22 2:25 ` Martin K. Petersen
2019-11-21 1:07 ` Ming Lei
2019-11-22 2:59 ` Martin K. Petersen
2019-11-22 3:24 ` Ming Lei
2019-11-22 16:38 ` Sumanesh Samanta
2019-11-21 0:08 ` Sumanesh Samanta
2019-11-21 0:54 ` Ming Lei
2019-11-21 19:19 ` Ewan D. Milne
2019-11-21 0:53 ` Ming Lei
2019-11-21 15:45 ` Hannes Reinecke
2019-11-22 8:09 ` Ming Lei [this message]
2019-11-22 18:14 ` Bart Van Assche
2019-11-22 18:26 ` James Smart
2019-11-22 20:46 ` Bart Van Assche
2019-11-22 22:04 ` Ming Lei
2019-11-22 22:00 ` Ming Lei
2019-11-25 18:28 ` Ewan D. Milne
2019-11-25 22:14 ` James Smart
2019-11-22 2:18 ` Martin K. Petersen
-- strict thread matches above, loose matches on Subject: below --
2019-11-20 21:58 Sumanesh Samanta
2019-11-21 1:21 ` Ming Lei
2019-11-21 1:50 ` Sumanesh Samanta
2019-11-21 2:23 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191122080959.GC903@ming.t460p \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=bart.vanassche@wdc.com \
--cc=chaitra.basappa@broadcom.com \
--cc=emilne@redhat.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=jejb@linux.ibm.com \
--cc=kashyap.desai@broadcom.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=sathya.prakash@broadcom.com \
--cc=shivasharan.srikanteshwara@broadcom.com \
--cc=suganath-prabu.subramani@broadcom.com \
--cc=sumit.saxena@broadcom.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox