linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Hou Tao <houtao1@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>, Omar Sandoval <osandov@fb.com>,
	linux-block@vger.kernel.org,
	Kate Stewart <kstewart@linuxfoundation.org>,
	John Garry <john.garry@huawei.com>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [RFC PATCH] blk-mq: provide more tags for woken-up process when tag allocation is busy
Date: Thu, 4 Jun 2020 18:01:21 +0800	[thread overview]
Message-ID: <20200604100121.GA2234582@T590> (raw)
In-Reply-To: <20200603073931.94435-1-houtao1@huawei.com>

Hi Hou Tao,

On Wed, Jun 03, 2020 at 03:39:31PM +0800, Hou Tao wrote:
> When there are many free-bit waiters, current batch wakeup method will
> wake up at most wake_batch processes when wake_batch bits are freed.
> The perfect result is each process will get a free bit, however the
> real result is that a waken-up process may being unable to get
> a free bit and will call io_schedule() multiple times. That's because
> other processes (e.g. wake-up before) in the same wake-up batch
> may have already allocated multiple free bits.
> 
> And the race leads to two problems. The first one is the unnecessary
> context switch, because multiple processes are waken up and then
> go to sleep afterwards. And the second one is the performance
> degradation when there is spatial locality between requests from
> one process (e.g. split IO for HDD), because one process can not
> allocated requests continuously for the split IOs, and
> the sequential IOs will be dispatched separatedly.

I guess this way is a bit worse for HDD since sequential IO may be
interrupted by other context.

> 
> To fix the problem, we mimic the way how SQ handles this situation:

Do you mean the SQ way is the congestion control code in __get_request()?
If not, could you provide more background of SQ's way for this issue?
Cause it isn't easy for me to associate your approach with SQ's code.

> 1) stash a bulk of free bits
> 2) wake up a process when a new bit is freed
> 3) woken-up process consumes the stashed free bits
> 4) when stashed free bits are exhausted, goto step 1)
> 
> Because the tag allocation path or io submit path is much faster than
> the tag free path, so when the race for free tags is intensive,

Indeed, I guess you mean bio_endio is slow.

> we can ensure:
> 1) only few processes will be waken up and will exhaust the stashed
>    free bits quickly.
> 2) these processes will be able to allocate multiple requests
>    continuously.
> 
> An alternative fix is to dynamically adjust the number of woken-up
> process according to the number of waiters and busy bits, instead of
> using wake_batch each time in __sbq_wake_up(). However it will need
> to record the number of busy bits all the time, so use the
> stash-wake-use method instead.
> 
> The following is the result of a simple fio test:
> 
> 1. fio (random read, 1MB, libaio, iodepth=1024)
> 
> (1) 4TB HDD (max_sectors_kb=256)
> 
> IOPS (bs=1MB)
> jobs | 4.18-sq  | 5.6.15 | 5.6.15-patched |
> 1    | 120      | 120    | 119
> 24   | 120      | 105    | 121
> 48   | 122      | 102    | 121
> 72   | 120      | 100    | 119
> 
> context switch per second
> jobs | 4.18-sq  | 5.6.15 | 5.6.15-patched |
> 1    | 1058     | 1162   | 1188
> 24   | 1047     | 1715   | 1105
> 48   | 1109     | 1967   | 1105
> 72   | 1084     | 1908   | 1106
> 
> (2) 1.8TB SSD (set max_sectors_kb=256)
> 
> IOPS (bs=1MB)
> jobs | 4.18-sq  | 5.6.15 | 5.6.15-patched |
> 1    | 1077     | 1075   | 1076
> 24   | 1079     | 1075   | 1076
> 48   | 1077     | 1076   | 1076
> 72   | 1077     | 1076   | 1077
> 
> context switch per second
> jobs | 4.18-sq  | 5.6.15 | 5.6.15-patched |
> 1    | 1833     | 5123   | 5264
> 24   | 2143     | 15238  | 3859
> 48   | 2182     | 19015  | 3617
> 72   | 2268     | 19050  | 3662
> 
> (3) 1.5TB nvme (set max_sectors_kb=256)
> 
> 4 read queue, 72 CPU
> 
> IOPS (bs=1MB)
> jobs | 5.6.15 | 5.6.15-patched |
> 1    | 3018   | 3018
> 18   | 3015   | 3016
> 36   | 3001   | 3005
> 54   | 2993   | 2997
> 72   | 2984   | 2990
> 
> context switch per second
> jobs | 5.6.15 | 5.6.15-patched |
> 1    | 6292   | 6469
> 18   | 19428  | 4253
> 36   | 21290  | 3928
> 54   | 23060  | 3957
> 72   | 24221  | 4054
> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
> Hi,
> 
> We found the problems (excessive context switch and few performance
> degradation) during the performance comparison between blk-sq (4.18)
> and blk-mq (5.16) on HDD, but we can not find a better way to fix it.
> 
> It seems that in order to implement batched request allocation for
> single process, we need to use an atomic variable to track
> the number of busy bits. It's suitable for HDD or SDD, because the
> IO latency is greater than 1ms, but no sure whether or not it's OK
> for NVMe device.

Do you have benchmark on NVMe/SSD with 4k BS?

> 
> Suggestions and comments are welcome.
> 
> Regards,
> Tao
> ---
>  block/blk-mq-tag.c      |  4 ++++
>  include/linux/sbitmap.h |  7 ++++++
>  lib/sbitmap.c           | 49 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 60 insertions(+)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 586c9d6e904a..fd601fa6c684 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -180,6 +180,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  	sbitmap_finish_wait(bt, ws, &wait);
>  
>  found_tag:
> +	if (READ_ONCE(bt->stash_ready) &&
> +	    !atomic_dec_if_positive(&bt->stashed_bits))
> +		WRITE_ONCE(bt->stash_ready, false);
> +

Is it doable to move the above code into sbitmap_queue_do_stash_and_wake_up(),
after wake_up(&ws->wait)?

Or at least it should be done for successful allocation after context
switch?


Thanks, 
Ming


  reply	other threads:[~2020-06-04 10:01 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-03  7:39 [RFC PATCH] blk-mq: provide more tags for woken-up process when tag allocation is busy Hou Tao
2020-06-04 10:01 ` Ming Lei [this message]
2020-06-05 14:21   ` Hou Tao
2020-06-06  3:25     ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200604100121.GA2234582@T590 \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=houtao1@huawei.com \
    --cc=john.garry@huawei.com \
    --cc=kstewart@linuxfoundation.org \
    --cc=linux-block@vger.kernel.org \
    --cc=osandov@fb.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).