From: Ming Lei <ming.lei@redhat.com>
To: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org,
"James E . J . Bottomley" <jejb@linux.ibm.com>,
"Martin K . Petersen" <martin.petersen@oracle.com>,
linux-scsi@vger.kernel.org,
Sathya Prakash <sathya.prakash@broadcom.com>,
Chaitra P B <chaitra.basappa@broadcom.com>,
Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>,
Kashyap Desai <kashyap.desai@broadcom.com>,
Sumit Saxena <sumit.saxena@broadcom.com>,
Shivasharan S <shivasharan.srikanteshwara@broadcom.com>,
"Ewan D . Milne" <emilne@redhat.com>,
Christoph Hellwig <hch@lst.de>,
Bart Van Assche <bart.vanassche@wdc.com>
Subject: Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
Date: Thu, 21 Nov 2019 08:53:23 +0800 [thread overview]
Message-ID: <20191121005323.GB24548@ming.t460p> (raw)
In-Reply-To: <1081145f-3e17-9bc1-2332-50a4b5621ef7@suse.de>
On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
> On 11/18/19 11:31 AM, Ming Lei wrote:
> > SCSI core uses the atomic variable of sdev->device_busy to track
> > in-flight IO requests dispatched to this scsi device. IO request may be
> > submitted from any CPU, so the cost for maintaining the shared atomic
> > counter can be very big on big NUMA machine with lots of CPU cores.
> >
> > sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
> > 2) fair IO request scattered among all LUNs.
> >
> > blk-mq already provides fair request allocation among all active shared
> > request queues(LUNs), see hctx_may_queue().
> >
> > NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> > is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
> > play big role for reaching top SSD performance.
> >
> > With this patch, big cost for tracking in-flight per-LUN requests via
> > atomic variable can be saved.
> >
> > Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
> > before changing this flag.
> >
> > Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> > Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> > Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> > Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> > Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> > Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> > Cc: Ewan D. Milne <emilne@redhat.com>
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: Hannes Reinecke <hare@suse.de>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > block/blk-sysfs.c | 14 +++++++++++++-
> > drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
> > drivers/scsi/sd.c | 4 ++++
> > 3 files changed, 35 insertions(+), 7 deletions(-)
> >
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index fca9b158f4a0..9cc0dd5f88a1 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
> > QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
> > #undef QUEUE_SYSFS_BIT_FNS
> >
> > +static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
> > + const char *page, size_t count)
> > +{
> > + size_t ret;
> > +
> > + blk_mq_freeze_queue(q);
> > + ret = queue_store_nonrot(q, page, count);
> > + blk_mq_unfreeze_queue(q);
> > +
> > + return ret;
> > +}
> > +
> > static ssize_t queue_zoned_show(struct request_queue *q, char *page)
> > {
> > switch (blk_queue_zoned_model(q)) {
> > @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
> > static struct queue_sysfs_entry queue_nonrot_entry = {
> > .attr = {.name = "rotational", .mode = 0644 },
> > .show = queue_show_nonrot,
> > - .store = queue_store_nonrot,
> > + .store = queue_store_nonrot_freeze,
> > };
> >
> > static struct queue_sysfs_entry queue_zoned_entry = {
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 62a86a82c38d..72655a049efd 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
> > if (starget->can_queue > 0)
> > atomic_dec(&starget->target_busy);
> >
> > - atomic_dec(&sdev->device_busy);
> > + if (!blk_queue_nonrot(sdev->request_queue))
> > + atomic_dec(&sdev->device_busy);
> > }
> >
> > static void scsi_kick_queue(struct request_queue *q)
> > @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
> >
> > static inline bool scsi_device_is_busy(struct scsi_device *sdev)
> > {
> > - if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> > + if (!blk_queue_nonrot(sdev->request_queue) &&
> > + atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> > return true;
> > if (atomic_read(&sdev->device_blocked) > 0)
> > return true;
> > @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> > struct scsi_device *sdev)
> > {
> > unsigned int busy;
> > + bool bypass = blk_queue_nonrot(sdev->request_queue);
> >
> > - busy = atomic_inc_return(&sdev->device_busy) - 1;
> > + if (!bypass)
> > + busy = atomic_inc_return(&sdev->device_busy) - 1;
> > + else
> > + busy = 0;
> > if (atomic_read(&sdev->device_blocked)) {
> > if (busy)
> > goto out_dec;
> > @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> > "unblocking device at zero depth\n"));
> > }
> >
> > + if (bypass)
> > + return 1;
> > +
> > if (busy >= sdev->queue_depth)
> > goto out_dec;
> >
> > return 1;
> > out_dec:
> > - atomic_dec(&sdev->device_busy);
> > + if (!bypass)
> > + atomic_dec(&sdev->device_busy);
> > return 0;
> > }
> >
> > @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
> > struct request_queue *q = hctx->queue;
> > struct scsi_device *sdev = q->queuedata;
> >
> > - atomic_dec(&sdev->device_busy);
> > + if (!blk_queue_nonrot(sdev->request_queue))
> > + atomic_dec(&sdev->device_busy);
> > }
> >
> > static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> > @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> > case BLK_STS_OK:
> > break;
> > case BLK_STS_RESOURCE:
> > - if (atomic_read(&sdev->device_busy) ||
> > + if ((!blk_queue_nonrot(sdev->request_queue) &&
> > + atomic_read(&sdev->device_busy)) ||
> > scsi_device_blocked(sdev))
> > ret = BLK_STS_DEV_RESOURCE;
> > break;
> > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > index 0744c34468e1..c3d47117700d 100644
> > --- a/drivers/scsi/sd.c
> > +++ b/drivers/scsi/sd.c
> > @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
> > rot = get_unaligned_be16(&buffer[4]);
> >
> > if (rot == 1) {
> > + blk_mq_freeze_queue(q);
> > blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > + blk_mq_unfreeze_queue(q);
> > blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
> > }
> >
> > @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
> > * cause this to be updated correctly and any device which
> > * doesn't support it should be treated as rotational.
> > */
> > + blk_mq_freeze_queue(q);
> > blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
> > + blk_mq_unfreeze_queue(q);
> > blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
> >
> > if (scsi_device_supports_vpd(sdp)) {
> >
> Hmm.
>
> I must admit I patently don't like this explicit dependency on
> blk_nonrot(). Having a conditional counter is just an open invitation to
> getting things wrong...
That won't be an issue given this patch freezes queue when the
NONROT queue flag is changed.
>
> I would far prefer if we could delegate any queueing decision to the
> elevators, and completely drop the device_busy flag for all devices.
If you drop it, you may create big sequential IO performance drop
on HDD., that is why this patch only bypasses sdev->queue_depth on
SSD. NVMe bypasses it because no one uses HDD. via NVMe.
> But I do see that this might not be easy to do, nor necessarily
> something which others agree on.
> So alternatively, can't we make this a per-host flag, instead of having
> it per device characteristic?
Are you sure each LUN shares the same NONROT flag? I think it can be
true for each LUN to have different NONROT flag.
>
> I'm just thinking of a storage array with a massive cache; it's anyone's
> guess if 'nonrot' is set there, but at the same time these things can be
> fast, so they might be profiting from dropping the device_busy counter
> there, too.
If you are worrying about slow queue freeze, that won't be an issue,
given we don't switch to percpu mode until the flag is updated during
sd probe, and blk_mq_freeze_queue_wait() is pretty quick in ATOMIC
mode.
Thanks,
Ming
next prev parent reply other threads:[~2019-11-21 0:53 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-18 10:31 [PATCH 0/4] scis: don't apply per-LUN queue depth for SSD Ming Lei
2019-11-18 10:31 ` [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands Ming Lei
2019-11-20 9:54 ` Hannes Reinecke
2019-11-26 3:12 ` Kashyap Desai
2019-11-26 3:37 ` Ming Lei
2019-12-05 10:32 ` Kashyap Desai
2019-11-18 10:31 ` [PATCH 2/4] scsi: mpt3sas: " Ming Lei
2019-11-20 9:55 ` Hannes Reinecke
2019-11-18 10:31 ` [PATCH 3/4] scsi: sd: register request queue after sd_revalidate_disk is done Ming Lei
2019-11-20 9:59 ` Hannes Reinecke
2019-11-18 10:31 ` [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD Ming Lei
2019-11-20 10:05 ` Hannes Reinecke
2019-11-20 17:00 ` Ewan D. Milne
2019-11-20 20:56 ` Bart Van Assche
2019-11-20 21:36 ` Ewan D. Milne
2019-11-22 2:25 ` Martin K. Petersen
2019-11-21 1:07 ` Ming Lei
2019-11-22 2:59 ` Martin K. Petersen
2019-11-22 3:24 ` Ming Lei
2019-11-22 16:38 ` Sumanesh Samanta
2019-11-21 0:08 ` Sumanesh Samanta
2019-11-21 0:54 ` Ming Lei
2019-11-21 19:19 ` Ewan D. Milne
2019-11-21 0:53 ` Ming Lei [this message]
2019-11-21 15:45 ` Hannes Reinecke
2019-11-22 8:09 ` Ming Lei
2019-11-22 18:14 ` Bart Van Assche
2019-11-22 18:26 ` James Smart
2019-11-22 20:46 ` Bart Van Assche
2019-11-22 22:04 ` Ming Lei
2019-11-22 22:00 ` Ming Lei
2019-11-25 18:28 ` Ewan D. Milne
2019-11-25 22:14 ` James Smart
2019-11-22 2:18 ` Martin K. Petersen
-- strict thread matches above, loose matches on Subject: below --
2019-11-20 21:58 Sumanesh Samanta
2019-11-21 1:21 ` Ming Lei
2019-11-21 1:50 ` Sumanesh Samanta
2019-11-21 2:23 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191121005323.GB24548@ming.t460p \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=bart.vanassche@wdc.com \
--cc=chaitra.basappa@broadcom.com \
--cc=emilne@redhat.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=jejb@linux.ibm.com \
--cc=kashyap.desai@broadcom.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=sathya.prakash@broadcom.com \
--cc=shivasharan.srikanteshwara@broadcom.com \
--cc=suganath-prabu.subramani@broadcom.com \
--cc=sumit.saxena@broadcom.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox