Linux block layer
 help / color / mirror / Atom feed
* [PATCHSET v2 block/for-linus] IO cost model based work-conserving porportional controller
From: Tejun Heo @ 2019-07-10 20:51 UTC (permalink / raw)
  To: axboe, newella, clm, josef, dennisz, lizefan, hannes
  Cc: linux-kernel, kernel-team, linux-block, cgroups

Hello,

Changes from v1[1]:

* Prerequisite patchsets had cosmetic changes and merged.  Refreshed
  on top.

* Renamed from ioweight to iocost.  All source code and tools are
  updated accordingly.  Control knobs io.weight.qos and
  io.weight.cost_model are renamed to io.cost.qos and io.cost.model
  respectively.  This is a more fitting name which won't become a
  misnomer when, for example, cost based io.max is added.

* Various bug fixes and improvements.  A few bugs were discovered
  while testing against high-iops nvme device.  Auto parameter
  selection improved and verified across different classes of SSDs.

* Dropped bpf iocost support for now.

* Added coef generation script.

* Verified on high-iops nvme device.  Result is included below.

One challenge of controlling IO resources is the lack of trivially
observable cost metric.  This is distinguished from CPU and memory
where wallclock time and the number of bytes can serve as accurate
enough approximations.

Bandwidth and iops are the most commonly used metrics for IO devices
but depending on the type and specifics of the device, different IO
patterns easily lead to multiple orders of magnitude variations
rendering them useless for the purpose of IO capacity distribution.
While on-device time, with a lot of clutches, could serve as a useful
approximation for non-queued rotational devices, this is no longer
viable with modern devices, even the rotational ones.

While there is no cost metric we can trivially observe, it isn't a
complete mystery.  For example, on a rotational device, seek cost
dominates while a contiguous transfer contributes a smaller amount
proportional to the size.  If we can characterize at least the
relative costs of these different types of IOs, it should be possible
to implement a reasonable work-conserving proportional IO resource
distribution.

This patchset implements IO cost model based work-conserving
proportional controller.  It currently has a simple linear cost model
builtin where each IO is classified as sequential or random and given
a base cost accordingly and additional size-proportional cost is added
on top.  Each IO is given a cost based on the model and the controller
issues IOs for each cgroup according to their hierarchical weight.

By default, the controller adapts its overall IO rate so that it
doesn't build up buffer bloat in the request_queue layer, which
guarantees that the controller doesn't lose significant amount of
total work.  However, this may not provide sufficient differentiation
as the underlying device may have a deep queue and not be fair in how
the queued IOs are serviced.  The controller provides extra QoS
control knobs which allow tightening control feedback loop as
necessary.

For more details on the control mechanism, implementation and
interface, please refer to the comment at the top of
block/blk-iocost.c and Documentation/admin-guide/cgroup-v2.rst changes
in the "blkcg: implement blk-iocost" patch.

Here are some test results.  Each test run goes through the following
combinations with each combination running for a minute.  All tests
are performed against regular files on btrfs w/ deadline as the IO
scheduler.  Random IOs are direct w/ queue depth of 64.  Sequential
are normal buffered IOs.

	high priority (weight=500)	low priority (weight=100)

	Rand read			None
	ditto				Rand read
	ditto				Seq  read
	ditto				Rand write
	ditto				Seq  write
	Seq  read			None
	ditto				Rand read
	ditto				Seq  read
	ditto				Rand write
	ditto				Seq  write
	Rand write			None
	ditto				Rand read
	ditto				Seq  read
	ditto				Rand write
	ditto				Seq  write
	Seq  write			None
	ditto				Rand read
	ditto				Seq  read
	ditto				Rand write
	ditto				Seq  write

* 7200RPM SATA hard disk
  * No IO control
    https://photos.app.goo.gl/1KBHn7ykpC1LXRkB8
  * iocost, QoS: None
    https://photos.app.goo.gl/MLNQGxCtBQ8wAmjm7
  * iocost, QoS: rpct=95.00 rlat=40000 wpct=95.00 wlat=40000 min=25.00 max=200.00
    https://photos.app.goo.gl/XqXHm3Mkbm9w6Db46
* NCQ-blacklisted SATA SSD (QD==1)
  * No IO control
    https://photos.app.goo.gl/wCTXeu2uJ6LYL4pk8
  * iocost, QoS: None
    https://photos.app.goo.gl/T2HedKD2sywQgj7R9
  * iocost, QoS: rpct=95.00 rlat=20000 wpct=95.00 wlat=20000 min=50.00 max=200.00
    https://photos.app.goo.gl/urBTV8XQc1UqPJJw7
* SATA SSD (QD==32)
  * No IO control
    https://photos.app.goo.gl/TjEVykuVudSQcryh6
  * iocost, QoS: None
    https://photos.app.goo.gl/iyQBsky7bmM54Xiq7
  * iocost, QoS: rpct=95.00 rlat=10000 wpct=95.00 wlat=20000 min=50.00 max=400.00
    https://photos.app.goo.gl/q1a6URLDxPLMrnHy5
* NVME SSD (ran with 8 concurrent fio jobs to achieve saturation)
  * No IO control
    https://photos.app.goo.gl/S6xjEVTJzcfb3w1j7
  * iocost, QoS: None
    https://photos.app.goo.gl/SjQUUotJBAGr7vqz7
  * iocost, QoS: rpct=95.00 rlat=5000 wpct=95.00 wlat=5000 min=1.00 max=10000.00
    https://photos.app.goo.gl/RsaYBd2muX7CegoN7

Even without explicit QoS configuration, read-heavy scenarios can
obtain acceptable differentiation.  However, when write-heavy, the
deep buffering on the device side makes it difficult to maintain
control.  With QoS parameters set, the differentiation is acceptable
across all combinations.

The implementation comes with default cost model parameters which are
selected automatically which should provide acceptable behavior across
most common devices.  The parameters for hdd and consumer-grade SSDs
seem pretty robust.  The default parameter set and selection criteria
for highend SSDs might need further adjustments.

It is fairly easy to configure the QoS parameters and, if needed, cost
model coefficients.  We'll follow up with tooling and further
documentation.  Also, the last RFC patch in the series implements
support for bpf-based custom cost function.  Originally we thought
that we'd need per-device-type cost functions but the simple linear
model now seem good enough to cover all common device classes.  In
case custom cost functions become necessary, we can fully develop the
bpf based extension and also easily add different builtin cost models.

Andy Newell did the heavy lifting of analyzing IO workloads and device
characteristics, exploring various cost models, determining the
default model and parameters to use.

Josef Bacik implemented a prototype which explored the use of
different types of cost metrics including on-device time and Andy's
linear model.

This patchset is on top of block/for-linus 3a10f999ffd4
("blk-throttle: fix zero wait time for iops throttled group")
and contains the following ten patches.

 0001-blkcg-pass-q-and-blkcg-into-blkcg_pol_alloc_pd_fn.patch
 0002-blkcg-make-cpd_init_fn-optional.patch
 0003-blkcg-separate-blkcg_conf_get_disk-out-of-blkg_conf_.patch
 0004-block-rq_qos-add-rq_qos_merge.patch
 0005-block-rq_qos-implement-rq_qos_ops-queue_depth_change.patch
 0006-blkcg-s-RQ_QOS_CGROUP-RQ_QOS_LATENCY.patch
 0007-blk-mq-add-optional-request-pre_start_time_ns.patch
 0008-blkcg-implement-blk-iocost.patch
 0009-blkcg-add-tools-cgroup-iocost_monitor.py.patch
 0010-blkcg-add-tools-cgroup-iocost_coef_gen.py.patch

0001-0007 are prep patches.  0008 implements blk-iocost.  0009 adds
monitoring script.  0010 adds linear model coef generation script.

The patchset is also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-iocost

diffstat follows, Thanks.

 Documentation/admin-guide/cgroup-v2.rst |   97 +
 block/Kconfig                           |    9 
 block/Makefile                          |    1 
 block/bfq-cgroup.c                      |    5 
 block/blk-cgroup.c                      |   71 
 block/blk-core.c                        |    4 
 block/blk-iocost.c                      | 2394 ++++++++++++++++++++++++++++++++
 block/blk-iolatency.c                   |    8 
 block/blk-mq.c                          |   11 
 block/blk-rq-qos.c                      |   18 
 block/blk-rq-qos.h                      |   28 
 block/blk-settings.c                    |    2 
 block/blk-throttle.c                    |    6 
 block/blk-wbt.c                         |   18 
 block/blk-wbt.h                         |    4 
 include/linux/blk-cgroup.h              |    4 
 include/linux/blk_types.h               |    3 
 include/linux/blkdev.h                  |    7 
 include/trace/events/iocost.h           |  174 ++
 tools/cgroup/iocost_coef_gen.py         |  178 ++
 tools/cgroup/iocost_monitor.py          |  270 +++
 21 files changed, 3259 insertions(+), 53 deletions(-)

--
tejun

[1] http://lkml.kernel.org/r/20190614015620.1587672-1-tj@kernel.org


^ permalink raw reply

* Re: [PATCH 1/2] wait: add wq_has_multiple_sleepers helper
From: Jens Axboe @ 2019-07-10 20:39 UTC (permalink / raw)
  To: Peter Zijlstra, Jens Axboe
  Cc: Josef Bacik, Kernel Team, linux-block@vger.kernel.org,
	Ingo Molnar, Oleg Nesterov
In-Reply-To: <20190710203516.GL3419@hirez.programming.kicks-ass.net>

On 7/10/19 2:35 PM, Peter Zijlstra wrote:
> On Wed, Jul 10, 2019 at 02:23:23PM -0600, Jens Axboe wrote:
>> On 7/10/19 1:52 PM, Josef Bacik wrote:
>>> rq-qos sits in the io path so we want to take locks as sparingly as
>>> possible.  To accomplish this we try not to take the waitqueue head lock
>>> unless we are sure we need to go to sleep, and we have an optimization
>>> to make sure that we don't starve out existing waiters.  Since we check
>>> if there are existing waiters locklessly we need to be able to update
>>> our view of the waitqueue list after we've added ourselves to the
>>> waitqueue.  Accomplish this by adding this helper to see if there are
>>> more than two waiters on the waitqueue.
>>>
>>> Suggested-by: Jens Axboe <axboe@kernel.dk>
>>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>>    include/linux/wait.h | 21 +++++++++++++++++++++
>>>    1 file changed, 21 insertions(+)
>>>
>>> diff --git a/include/linux/wait.h b/include/linux/wait.h
>>> index b6f77cf60dd7..89c41a7b3046 100644
>>> --- a/include/linux/wait.h
>>> +++ b/include/linux/wait.h
>>> @@ -126,6 +126,27 @@ static inline int waitqueue_active(struct wait_queue_head *wq_head)
>>>    	return !list_empty(&wq_head->head);
>>>    }
>>>    
>>> +/**
>>> + * wq_has_multiple_sleepers - check if there are multiple waiting prcesses
>>> + * @wq_head: wait queue head
>>> + *
>>> + * Returns true of wq_head has multiple waiting processes.
>>> + *
>>> + * Please refer to the comment for waitqueue_active.
>>> + */
>>> +static inline bool wq_has_multiple_sleepers(struct wait_queue_head *wq_head)
>>> +{
>>> +	/*
>>> +	 * We need to be sure we are in sync with the
>>> +	 * add_wait_queue modifications to the wait queue.
>>> +	 *
>>> +	 * This memory barrier should be paired with one on the
>>> +	 * waiting side.
>>> +	 */
>>> +	smp_mb();
>>> +	return !list_is_singular(&wq_head->head);
>>> +}
>>> +
>>>    /**
>>>     * wq_has_sleeper - check if there are any waiting processes
>>>     * @wq_head: wait queue head
>>
>> This (and 2/2) looks good to me, better than v1 for sure. Peter/Ingo,
>> are you OK with adding this new helper? For reference, this (and the
>> next patch) replace the alternative, which is an open-coding of
>> prepare_to_wait():
>>
>> https://lore.kernel.org/linux-block/20190710190514.86911-1-josef@toxicpanda.com/
> 
> Yet another approach would be to have prepare_to_wait*() return this
> state, but I think this is ok.

We did discuss that case, but it seems somewhat random to have it
return that specific piece of info. But it'd work for this case.

> The smp_mb() is superfluous -- in your specific case -- since
> preprare_to_wait*() already does one through set_current_state().
> 
> So you could do without it, I think.

But that's specific to this use case. Maybe it's the only one we'll
have, and then it's fine, but as a generic helper it seems safer to
include the same ordering protection as wq_has_sleeper().

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH 1/2] wait: add wq_has_multiple_sleepers helper
From: Peter Zijlstra @ 2019-07-10 20:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Josef Bacik, kernel-team, linux-block, Ingo Molnar, Oleg Nesterov
In-Reply-To: <bbe73e4e-9270-46ac-16d7-39a40485fe53@kernel.dk>

On Wed, Jul 10, 2019 at 02:23:23PM -0600, Jens Axboe wrote:
> On 7/10/19 1:52 PM, Josef Bacik wrote:
> > rq-qos sits in the io path so we want to take locks as sparingly as
> > possible.  To accomplish this we try not to take the waitqueue head lock
> > unless we are sure we need to go to sleep, and we have an optimization
> > to make sure that we don't starve out existing waiters.  Since we check
> > if there are existing waiters locklessly we need to be able to update
> > our view of the waitqueue list after we've added ourselves to the
> > waitqueue.  Accomplish this by adding this helper to see if there are
> > more than two waiters on the waitqueue.
> > 
> > Suggested-by: Jens Axboe <axboe@kernel.dk>
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >   include/linux/wait.h | 21 +++++++++++++++++++++
> >   1 file changed, 21 insertions(+)
> > 
> > diff --git a/include/linux/wait.h b/include/linux/wait.h
> > index b6f77cf60dd7..89c41a7b3046 100644
> > --- a/include/linux/wait.h
> > +++ b/include/linux/wait.h
> > @@ -126,6 +126,27 @@ static inline int waitqueue_active(struct wait_queue_head *wq_head)
> >   	return !list_empty(&wq_head->head);
> >   }
> >   
> > +/**
> > + * wq_has_multiple_sleepers - check if there are multiple waiting prcesses
> > + * @wq_head: wait queue head
> > + *
> > + * Returns true of wq_head has multiple waiting processes.
> > + *
> > + * Please refer to the comment for waitqueue_active.
> > + */
> > +static inline bool wq_has_multiple_sleepers(struct wait_queue_head *wq_head)
> > +{
> > +	/*
> > +	 * We need to be sure we are in sync with the
> > +	 * add_wait_queue modifications to the wait queue.
> > +	 *
> > +	 * This memory barrier should be paired with one on the
> > +	 * waiting side.
> > +	 */
> > +	smp_mb();
> > +	return !list_is_singular(&wq_head->head);
> > +}
> > +
> >   /**
> >    * wq_has_sleeper - check if there are any waiting processes
> >    * @wq_head: wait queue head
> 
> This (and 2/2) looks good to me, better than v1 for sure. Peter/Ingo,
> are you OK with adding this new helper? For reference, this (and the
> next patch) replace the alternative, which is an open-coding of
> prepare_to_wait():
> 
> https://lore.kernel.org/linux-block/20190710190514.86911-1-josef@toxicpanda.com/

Yet another approach would be to have prepare_to_wait*() return this
state, but I think this is ok.

The smp_mb() is superfluous -- in your specific case -- since
preprare_to_wait*() already does one through set_current_state().

So you could do without it, I think.


^ permalink raw reply

* Re: [PATCH 1/2] wait: add wq_has_multiple_sleepers helper
From: Jens Axboe @ 2019-07-10 20:23 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-block; +Cc: Peter Zijlstra, Ingo Molnar
In-Reply-To: <20190710195227.92322-1-josef@toxicpanda.com>

On 7/10/19 1:52 PM, Josef Bacik wrote:
> rq-qos sits in the io path so we want to take locks as sparingly as
> possible.  To accomplish this we try not to take the waitqueue head lock
> unless we are sure we need to go to sleep, and we have an optimization
> to make sure that we don't starve out existing waiters.  Since we check
> if there are existing waiters locklessly we need to be able to update
> our view of the waitqueue list after we've added ourselves to the
> waitqueue.  Accomplish this by adding this helper to see if there are
> more than two waiters on the waitqueue.
> 
> Suggested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>   include/linux/wait.h | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
> 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index b6f77cf60dd7..89c41a7b3046 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -126,6 +126,27 @@ static inline int waitqueue_active(struct wait_queue_head *wq_head)
>   	return !list_empty(&wq_head->head);
>   }
>   
> +/**
> + * wq_has_multiple_sleepers - check if there are multiple waiting prcesses
> + * @wq_head: wait queue head
> + *
> + * Returns true of wq_head has multiple waiting processes.
> + *
> + * Please refer to the comment for waitqueue_active.
> + */
> +static inline bool wq_has_multiple_sleepers(struct wait_queue_head *wq_head)
> +{
> +	/*
> +	 * We need to be sure we are in sync with the
> +	 * add_wait_queue modifications to the wait queue.
> +	 *
> +	 * This memory barrier should be paired with one on the
> +	 * waiting side.
> +	 */
> +	smp_mb();
> +	return !list_is_singular(&wq_head->head);
> +}
> +
>   /**
>    * wq_has_sleeper - check if there are any waiting processes
>    * @wq_head: wait queue head

This (and 2/2) looks good to me, better than v1 for sure. Peter/Ingo,
are you OK with adding this new helper? For reference, this (and the
next patch) replace the alternative, which is an open-coding of
prepare_to_wait():

https://lore.kernel.org/linux-block/20190710190514.86911-1-josef@toxicpanda.com/

-- 
Jens Axboe


^ permalink raw reply

* blktests: Nvme Target Generation Counter Issue
From: Logan Gunthorpe @ 2019-07-10 20:18 UTC (permalink / raw)
  To: Chaitanya Kulkarni; +Cc: osandov, Michael Moese, linux-block, Theodore Ts'o

Hey,

[I posted the following on the relevant github PR[1], though I'm
resending here to include the mailing lists and anyone who maybe doesn't
follow github]

The generation counter issue with NVMe target tests still seems to be an
issue even after a few months. I've found people complaining about it on
the list (@tytso) but I'm still seeing these failures in test
nvme/002,016 and 017 with no obvious resolution. The change in PR 34[1]
seems like the only sensible solution I've found so far (I've tested it
to ensure it works).

The generation counter is a static variable in the NVMe Target code that
gets incremented any time the discovery information changes. My best
guess is that the kernel started incrementing this for more reasons
since the tests were written and now the tests are broken.

IMO, masking these changes out as suggested in this PR is the best
solution. Then, if we want to test the generation counter, create
another test that ensures the counter changes after performing specific
actions.

If we do that, we can also remove the "modprobe -r" calls (which are
necessary to reset the counter) and be able to support running the tests
with the modules built-in (which would be very nice for people trying to
test the kernel in VMs).

One way or another, we need a solution that's less fragile than the
current method.

Thanks,

Logan



[1] https://github.com/osandov/blktests/pull/34

^ permalink raw reply

* Re: [PATCH V3] block: Disable write plugging for zoned block devices
From: Jens Axboe @ 2019-07-10 20:18 UTC (permalink / raw)
  To: Damien Le Moal, linux-block
In-Reply-To: <c9b5b4cd-68d5-7c29-8f76-ad5b04007594@kernel.dk>

On 7/10/19 2:05 PM, Jens Axboe wrote:
> Looks fine to me now, but doesn't apply cleanly to master/for-linus.

I fixed it up manually, fwiw.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] block: Fix elevator name declaration
From: Jens Axboe @ 2019-07-10 20:06 UTC (permalink / raw)
  To: Damien Le Moal, linux-block
In-Reply-To: <20190710155741.11292-1-damien.lemoal@wdc.com>

On 7/10/19 9:57 AM, Damien Le Moal wrote:
> The elevator_name field in struct elevator_type is declared as an array
> of characters (ELV_NAME_MAX size) but in practice used as a string
> pointer with its initialization done statically within each
> elevator elevator_type structure declaration.
> 
> Change the declaration of elevator_name to the more appropriate
> "const char *" type.

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] block: Remove unused definitions
From: Jens Axboe @ 2019-07-10 20:05 UTC (permalink / raw)
  To: Damien Le Moal, linux-block
In-Reply-To: <20190710155608.11227-1-damien.lemoal@wdc.com>

On 7/10/19 9:56 AM, Damien Le Moal wrote:
> The ELV_MQUEUE_XXX definitions in include/linux/elevator.h are unused
> since the removal of elevator_may_queue_fn in kernel 5.0. Remove these
> definitions and also remove the documentation of elevator_may_queue_fn
> in Documentiation/block/biodoc.txt.

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH V3] block: Disable write plugging for zoned block devices
From: Jens Axboe @ 2019-07-10 20:05 UTC (permalink / raw)
  To: Damien Le Moal, linux-block
In-Reply-To: <20190710161831.25250-1-damien.lemoal@wdc.com>

Looks fine to me now, but doesn't apply cleanly to master/for-linus.

-- 
Jens Axboe


^ permalink raw reply

* [PATCH 1/2] wait: add wq_has_multiple_sleepers helper
From: Josef Bacik @ 2019-07-10 19:52 UTC (permalink / raw)
  To: kernel-team, axboe, linux-block

rq-qos sits in the io path so we want to take locks as sparingly as
possible.  To accomplish this we try not to take the waitqueue head lock
unless we are sure we need to go to sleep, and we have an optimization
to make sure that we don't starve out existing waiters.  Since we check
if there are existing waiters locklessly we need to be able to update
our view of the waitqueue list after we've added ourselves to the
waitqueue.  Accomplish this by adding this helper to see if there are
more than two waiters on the waitqueue.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 include/linux/wait.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index b6f77cf60dd7..89c41a7b3046 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -126,6 +126,27 @@ static inline int waitqueue_active(struct wait_queue_head *wq_head)
 	return !list_empty(&wq_head->head);
 }
 
+/**
+ * wq_has_multiple_sleepers - check if there are multiple waiting prcesses
+ * @wq_head: wait queue head
+ *
+ * Returns true of wq_head has multiple waiting processes.
+ *
+ * Please refer to the comment for waitqueue_active.
+ */
+static inline bool wq_has_multiple_sleepers(struct wait_queue_head *wq_head)
+{
+	/*
+	 * We need to be sure we are in sync with the
+	 * add_wait_queue modifications to the wait queue.
+	 *
+	 * This memory barrier should be paired with one on the
+	 * waiting side.
+	 */
+	smp_mb();
+	return !list_is_singular(&wq_head->head);
+}
+
 /**
  * wq_has_sleeper - check if there are any waiting processes
  * @wq_head: wait queue head
-- 
2.17.1


^ permalink raw reply related

* [PATCH 2/2] rq-qos: fix missed wake-ups in rq_qos_throttle
From: Josef Bacik @ 2019-07-10 19:52 UTC (permalink / raw)
  To: kernel-team, axboe, linux-block
In-Reply-To: <20190710195227.92322-1-josef@toxicpanda.com>

We saw a hang in production with WBT where there was only one waiter in
the throttle path and no outstanding IO.  This is because of the
has_sleepers optimization that is used to make sure we don't steal an
inflight counter for new submitters when there are people already on the
list.

We can race with our check to see if the waitqueue has any waiters (this
is done locklessly) and the time we actually add ourselves to the
waitqueue.  If this happens we'll go to sleep and never be woken up
because nobody is doing IO to wake us up.

Fix this by checking if the waitqueue has multiple sleepers after we add
ourselves to the list, that way we have an uptodate view of the list.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 block/blk-rq-qos.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 659ccb8b693f..b39b5f3fb01b 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -244,6 +244,7 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data,
 		return;
 
 	prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
+	has_sleeper = wq_has_multiple_sleepers(&rqw->wait);
 	do {
 		if (data.got_token)
 			break;
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Sagi Grimberg @ 2019-07-10 19:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, bvanassche, dledford, Roman Pen, gregkh
In-Reply-To: <20190710172505.GD4051@ziepe.ca>


>> I still do not understand why this should give any notice-able
>> performance advantage.
> 
> Usually omitting invalidations gives a healthy bump.
> 
> Also, RDMA WRITE is generally faster than READ at the HW level in
> various ways.

Yes, but this should be essentially identical to running nvme-rdma
with 512KB of immediate-data (the nvme term is in-capsule data).

In the upstream nvme target we have inline_data_size port attribute
that is tunable for that (defaults to PAGE_SIZE).

^ permalink raw reply

* [PATCH] rq-qos: fix missed wake-ups in rq_qos_throttle
From: Josef Bacik @ 2019-07-10 19:05 UTC (permalink / raw)
  To: axboe, linux-block, kernel-team

We saw a hang in production with WBT where there was only one waiter in
the throttle path and no outstanding IO.  This is because of the
has_sleepers optimization that is used to make sure we don't steal an
inflight counter for new submitters when there are people already on the
list.

We can race with our check to see if the waitqueue has any waiters (this
is done locklessly) and the time we actually add ourselves to the
waitqueue.  If this happens we'll go to sleep and never be woken up
because nobody is doing IO to wake us up.

Fix this by open coding prepare_to_wait_exclusive (yes, yes, I know) in
order to get a real value for has_sleepers.  This way we keep our
optimization in place and avoid hanging forever if there are no longer
any waiters on the list.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 block/blk-rq-qos.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 659ccb8b693f..04590666f7c4 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -237,13 +237,18 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data,
 		.cb = acquire_inflight_cb,
 		.private_data = private_data,
 	};
+	unsigned long flags;
 	bool has_sleeper;
 
 	has_sleeper = wq_has_sleeper(&rqw->wait);
 	if (!has_sleeper && acquire_inflight_cb(rqw, private_data))
 		return;
 
-	prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
+	spin_lock_irqsave(&rqw->wait.lock, flags);
+	has_sleeper = !list_empty(&rqw->wait.head);
+	__add_wait_queue_entry_tail_exclusive(&rqw->wait, &data.wq);
+	set_current_state(TASK_UNINTERRUPTIBLE);
+	spin_unlock_irqrestore(&rqw->wait.lock, flags);
 	do {
 		if (data.got_token)
 			break;
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Jason Gunthorpe @ 2019-07-10 17:25 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, bvanassche, dledford, Roman Pen, gregkh
In-Reply-To: <c49bf227-5274-9d13-deba-a405c75d1358@grimberg.me>

On Wed, Jul 10, 2019 at 09:25:05AM -0700, Sagi Grimberg wrote:
> 
> > > Another question, from what I understand from the code, the client
> > > always rdma_writes data on writes (with imm) from a remote pool of
> > > server buffers dedicated to it. Essentially all writes are immediate (no
> > > rdma reads ever). How is that different than using send wrs to a set of
> > > pre-posted recv buffers (like all others are doing)? Is it faster?
> > 
> > RDMA WRITE only is generally a bit faster, and if you use a buffer
> > pool in a smart way it is possible to get very good data packing.
> 
> There is no packing, its used exactly as send/recv, but with a remote
> buffer pool (pool of 512K buffers) and the client selects one and rdma
> write with imm to it.

Well that makes little sense then:)

> > Maybe this is fine, but it needs to be made very clear that it uses
> > this insecure operating model to get higher performance..
> 
> I still do not understand why this should give any notice-able
> performance advantage.

Usually omitting invalidations gives a healthy bump.

Also, RDMA WRITE is generally faster than READ at the HW level in
various ways.

Jason

^ permalink raw reply

* Re: [PATCH V2] block: Disable write plugging for zoned block devices
From: Christoph Hellwig @ 2019-07-10 16:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Damien Le Moal, linux-block, Bart Van Assche
In-Reply-To: <71be2c3f-2e26-dfd4-72de-2eabdddd311f@kernel.dk>

On Wed, Jul 10, 2019 at 09:57:05AM -0600, Jens Axboe wrote:
> On 7/10/19 9:54 AM, Damien Le Moal wrote:
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 633a5a77ee8b..c9195a2cd670 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -238,4 +238,14 @@ static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
> >   		qmap->mq_map[cpu] = 0;
> >   }
> >   
> > +static inline struct blk_plug *blk_mq_plug(struct request_queue *q,
> > +					   struct bio *bio)
> > +{
> > +	if (!blk_queue_is_zoned(q) || !op_is_write(bio_op(bio)))
> > +		return current->plug;
> > +
> > +	/* Zoned block device write case: do not plug the BIO */
> > +	return NULL;
> > +}
> > +
> >   #endif
> 
> Folks are going to look at that and be puzzled, I think that function
> deserves a comment.

Agreed.  Also I'd reformat the conditionals to make the default
case more obvious:

	if (blk_queue_is_zoned(q) && op_is_write(bio_op(bio)))
		return NULL;
	return current->plug;

^ permalink raw reply

* Re: [PATCH] block: Fix elevator name declaration
From: Marcos Paulo de Souza @ 2019-07-10 16:34 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block, Jens Axboe
In-Reply-To: <20190710155741.11292-1-damien.lemoal@wdc.com>

On Thu, Jul 11, 2019 at 12:57:41AM +0900, Damien Le Moal wrote:
> The elevator_name field in struct elevator_type is declared as an array
> of characters (ELV_NAME_MAX size) but in practice used as a string
> pointer with its initialization done statically within each
> elevator elevator_type structure declaration.
> 
> Change the declaration of elevator_name to the more appropriate
> "const char *" type.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> ---
>  include/linux/elevator.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/elevator.h b/include/linux/elevator.h
> index 9842e53623f3..b73e6f35fbad 100644
> --- a/include/linux/elevator.h
> +++ b/include/linux/elevator.h
> @@ -75,7 +75,7 @@ struct elevator_type
>  	size_t icq_size;	/* see iocontext.h */
>  	size_t icq_align;	/* ditto */
>  	struct elv_fs_entry *elevator_attrs;
> -	char elevator_name[ELV_NAME_MAX];
> +	const char *elevator_name;
>  	const char *elevator_alias;
>  	struct module *elevator_owner;
>  #ifdef CONFIG_BLK_DEBUG_FS

Acked-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>

> -- 
> 2.21.0
> 

^ permalink raw reply

* Re: [PATCH] block: Remove unused definitions
From: Marcos Paulo de Souza @ 2019-07-10 16:33 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block, Jens Axboe
In-Reply-To: <20190710155608.11227-1-damien.lemoal@wdc.com>

On Thu, Jul 11, 2019 at 12:56:08AM +0900, Damien Le Moal wrote:
> The ELV_MQUEUE_XXX definitions in include/linux/elevator.h are unused
> since the removal of elevator_may_queue_fn in kernel 5.0. Remove these
> definitions and also remove the documentation of elevator_may_queue_fn
> in Documentiation/block/biodoc.txt.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> ---
>  Documentation/block/biodoc.txt | 5 -----
>  include/linux/elevator.h       | 9 ---------
>  2 files changed, 14 deletions(-)
> 
> diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
> index ac18b488cb5e..f0d15b0cb3c0 100644
> --- a/Documentation/block/biodoc.txt
> +++ b/Documentation/block/biodoc.txt
> @@ -844,11 +844,6 @@ elevator_latter_req_fn		These return the request before or after the
>  
>  elevator_completed_req_fn	called when a request is completed.
>  
> -elevator_may_queue_fn		returns true if the scheduler wants to allow the
> -				current context to queue a new request even if
> -				it is over the queue limit. This must be used
> -				very carefully!!
> -
>  elevator_set_req_fn
>  elevator_put_req_fn		Must be used to allocate and free any elevator
>  				specific storage for a request.
> diff --git a/include/linux/elevator.h b/include/linux/elevator.h
> index 6e8bc53740f0..9842e53623f3 100644
> --- a/include/linux/elevator.h
> +++ b/include/linux/elevator.h
> @@ -160,15 +160,6 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
>  #define ELEVATOR_INSERT_FLUSH	5
>  #define ELEVATOR_INSERT_SORT_MERGE	6
>  
> -/*
> - * return values from elevator_may_queue_fn
> - */
> -enum {
> -	ELV_MQUEUE_MAY,
> -	ELV_MQUEUE_NO,
> -	ELV_MQUEUE_MUST,
> -};

Acked-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>

> -
>  #define rq_end_sector(rq)	(blk_rq_pos(rq) + blk_rq_sectors(rq))
>  #define rb_entry_rq(node)	rb_entry((node), struct request, rb_node)
>  
> -- 
> 2.21.0
> 

^ permalink raw reply

* Re: [PATCH v4 00/25] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)
From: Sagi Grimberg @ 2019-07-10 16:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Danil Kipnis, Jack Wang, linux-block, linux-rdma, axboe,
	Christoph Hellwig, bvanassche, dledford, Roman Pen, gregkh
In-Reply-To: <20190710135519.GA4051@ziepe.ca>


>> Another question, from what I understand from the code, the client
>> always rdma_writes data on writes (with imm) from a remote pool of
>> server buffers dedicated to it. Essentially all writes are immediate (no
>> rdma reads ever). How is that different than using send wrs to a set of
>> pre-posted recv buffers (like all others are doing)? Is it faster?
> 
> RDMA WRITE only is generally a bit faster, and if you use a buffer
> pool in a smart way it is possible to get very good data packing.

There is no packing, its used exactly as send/recv, but with a remote
buffer pool (pool of 512K buffers) and the client selects one and rdma
write with imm to it.

> With
> SEND the number of recvq entries dictates how big the rx buffer can
> be, or you waste even more memory by using partial send buffers..

This is exactly how it used here.

> A scheme like this seems like a high performance idea, but on the
> other side, I have no idea how you could possibly manage invalidations
> efficiently with a shared RX buffer pool...

There are no invalidations, this remote server pool is registered once
and long lived with the session.

> The RXer has to push out an invalidation for the shared buffer pool
> MR, but we don't have protocols for partial MR invalidation.
> 
> Which is back to my earlier thought that the main reason this perfoms
> better is because it doesn't have synchronous MR invalidation.

This issue only exists on the client side. The server never
invalidates any of its buffers.

> Maybe this is fine, but it needs to be made very clear that it uses
> this insecure operating model to get higher performance..

I still do not understand why this should give any notice-able
performance advantage.

^ permalink raw reply

* [PATCH V3] block: Disable write plugging for zoned block devices
From: Damien Le Moal @ 2019-07-10 16:18 UTC (permalink / raw)
  To: linux-block, Jens Axboe

Simultaneously writing to a sequential zone of a zoned block device
from multiple contexts requires mutual exclusion for BIO issuing to
ensure that writes happen sequentially. However, even for a well
behaved user correctly implementing such synchronization, BIO plugging
may interfere and result in BIOs from the different contextx to be
reordered if plugging is done outside of the mutual exclusion section,
e.g. the plug was started by a function higher in the call chain than
the function issuing BIOs.

         Context A                     Context B

   | blk_start_plug()
   | ...
   | seq_write_zone()
     | mutex_lock(zone)
     | bio-0->bi_iter.bi_sector = zone->wp
     | zone->wp += bio_sectors(bio-0)
     | submit_bio(bio-0)
     | bio-1->bi_iter.bi_sector = zone->wp
     | zone->wp += bio_sectors(bio-1)
     | submit_bio(bio-1)
     | mutex_unlock(zone)
     | return
   | -----------------------> | seq_write_zone()
  				| mutex_lock(zone)
     				| bio-2->bi_iter.bi_sector = zone->wp
     				| zone->wp += bio_sectors(bio-2)
				| submit_bio(bio-2)
				| mutex_unlock(zone)
   | <------------------------- |
   | blk_finish_plug()

In the above example, despite the mutex synchronization ensuring the
correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
issued after BIO 2 of context B, when the plug is released with
blk_finish_plug().

While this problem can be addressed using the blk_flush_plug_list()
function (in the above example, the call must be inserted before the
zone mutex lock is released), a simple generic solution in the block
layer avoid this additional code in all zoned block device user code.
The simple generic solution implemented with this patch is to introduce
the internal helper function blk_mq_plug() to access the current
context plug on BIO submission. This helper returns the current plug
only if the target device is not a zoned block device or if the BIO to
be plugged is not a write operation. Otherwise, the caller context plug
is ignored and NULL returned, resulting is all writes to zoned block
device to never be plugged.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 block/blk-core.c |  2 +-
 block/blk-mq.c   |  2 +-
 block/blk-mq.h   | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8340f69670d8..3957ea6811c3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -645,7 +645,7 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	struct request *rq;
 	struct list_head *plug_list;
 
-	plug = current->plug;
+	plug = blk_mq_plug(q, bio);
 	if (!plug)
 		return false;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ce0f5f4ede70..90be5bb6fa1b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1969,7 +1969,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 	cookie = request_to_qc_t(data.hctx, rq);
 
-	plug = current->plug;
+	plug = blk_mq_plug(q, bio);
 	if (unlikely(is_flush_fua)) {
 		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 633a5a77ee8b..4a63c96188c5 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -238,4 +238,36 @@ static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 		qmap->mq_map[cpu] = 0;
 }
 
+/*
+ * blk_mq_plug() - Get caller context plug
+ * @q: request queue
+ * @bio : the bio being submitted by the caller context
+ *
+ * Plugging, by design, may delay the insertion of BIOs into the elevator in
+ * order to increase BIO merging opportunities. This however can cause BIO
+ * insertion order to change from the order in which submit_bio() is being
+ * executed in the case of multiple contexts concurrently issuing BIOs to a
+ * device, even if these context are synchronized to tightly control BIO issuing
+ * order. While this is not a problem with regular block devices, this ordering
+ * change can cause write BIO failures with zoned block devices as these
+ * require sequential write patterns to zones. Prevent this from happening by
+ * ignoring the plug state of a BIO issuing context if the target request queue
+ * is for a zoned block device and the BIO to plug is a write operation.
+ *
+ * Return current->plug if the bio can be plugged and NULL otherwise
+ */
+static inline struct blk_plug *blk_mq_plug(struct request_queue *q,
+					   struct bio *bio)
+{
+	/*
+	 * For regular block devices or read operations, use the context plug
+	 * which may be NULL if blk_start_plug() was not executed.
+	 */
+	if (!blk_queue_is_zoned(q) || !op_is_write(bio_op(bio)))
+		return current->plug;
+
+	/* Zoned block device write operation case: do not plug the BIO */
+	return NULL;
+}
+
 #endif
-- 
2.21.0


^ permalink raw reply related

* [PATCH V2] block: Disable write plugging for zoned block devices
From: Damien Le Moal @ 2019-07-10 15:54 UTC (permalink / raw)
  To: linux-block, Jens Axboe; +Cc: Bart Van Assche

Simultaneously writing to a sequential zone of a zoned block device
from multiple contexts requires mutual exclusion for BIO issuing to
ensure that writes happen sequentially. However, even for a well
behaved user correctly implementing such synchronization, BIO plugging
may interfere and result in BIOs from the different contextx to be
reordered if plugging is done outside of the mutual exclusion section,
e.g. the plug was started by a function higher in the call chain than
the function issuing BIOs.

         Context A                     Context B

   | blk_start_plug()
   | ...
   | seq_write_zone()
     | mutex_lock(zone)
     | bio-0->bi_iter.bi_sector = zone->wp
     | zone->wp += bio_sectors(bio-0)
     | submit_bio(bio-0)
     | bio-1->bi_iter.bi_sector = zone->wp
     | zone->wp += bio_sectors(bio-1)
     | submit_bio(bio-1)
     | mutex_unlock(zone)
     | return
   | -----------------------> | seq_write_zone()
  				| mutex_lock(zone)
     				| bio-2->bi_iter.bi_sector = zone->wp
     				| zone->wp += bio_sectors(bio-2)
				| submit_bio(bio-2)
				| mutex_unlock(zone)
   | <------------------------- |
   | blk_finish_plug()

In the above example, despite the mutex synchronization ensuring the
correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
issued after BIO 2 of context B, when the plug is released with
blk_finish_plug().

While this problem can be addressed using the blk_flush_plug_list()
function (in the above example, the call must be inserted before the
zone mutex lock is released), a simple generic solution in the block
layer avoid this additional code in all zoned block device user code.
The simple generic solution implemented with this patch is to introduce
the internal helper function blk_mq_plug() to access the current
context plug on BIO submission. This helper returns the current plug
only if the target device is not a zoned block device or if the BIO to
be plugged is not a write operation. Otherwise, the caller context plug
is ignored and NULL returned, resulting is all writes to zoned block
device to never be plugged.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 block/blk-core.c |  2 +-
 block/blk-mq.c   |  2 +-
 block/blk-mq.h   | 10 ++++++++++
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8340f69670d8..3957ea6811c3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -645,7 +645,7 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	struct request *rq;
 	struct list_head *plug_list;
 
-	plug = current->plug;
+	plug = blk_mq_plug(q, bio);
 	if (!plug)
 		return false;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ce0f5f4ede70..90be5bb6fa1b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1969,7 +1969,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 	cookie = request_to_qc_t(data.hctx, rq);
 
-	plug = current->plug;
+	plug = blk_mq_plug(q, bio);
 	if (unlikely(is_flush_fua)) {
 		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 633a5a77ee8b..c9195a2cd670 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -238,4 +238,14 @@ static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 		qmap->mq_map[cpu] = 0;
 }
 
+static inline struct blk_plug *blk_mq_plug(struct request_queue *q,
+					   struct bio *bio)
+{
+	if (!blk_queue_is_zoned(q) || !op_is_write(bio_op(bio)))
+		return current->plug;
+
+	/* Zoned block device write case: do not plug the BIO */
+	return NULL;
+}
+
 #endif
-- 
2.21.0


^ permalink raw reply related

* [PATCH] block: Fix elevator name declaration
From: Damien Le Moal @ 2019-07-10 15:57 UTC (permalink / raw)
  To: linux-block, Jens Axboe

The elevator_name field in struct elevator_type is declared as an array
of characters (ELV_NAME_MAX size) but in practice used as a string
pointer with its initialization done statically within each
elevator elevator_type structure declaration.

Change the declaration of elevator_name to the more appropriate
"const char *" type.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 include/linux/elevator.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 9842e53623f3..b73e6f35fbad 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -75,7 +75,7 @@ struct elevator_type
 	size_t icq_size;	/* see iocontext.h */
 	size_t icq_align;	/* ditto */
 	struct elv_fs_entry *elevator_attrs;
-	char elevator_name[ELV_NAME_MAX];
+	const char *elevator_name;
 	const char *elevator_alias;
 	struct module *elevator_owner;
 #ifdef CONFIG_BLK_DEBUG_FS
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH V2] block: Disable write plugging for zoned block devices
From: Jens Axboe @ 2019-07-10 15:57 UTC (permalink / raw)
  To: Damien Le Moal, linux-block; +Cc: Bart Van Assche
In-Reply-To: <20190710155447.11112-1-damien.lemoal@wdc.com>

On 7/10/19 9:54 AM, Damien Le Moal wrote:
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 633a5a77ee8b..c9195a2cd670 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -238,4 +238,14 @@ static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
>   		qmap->mq_map[cpu] = 0;
>   }
>   
> +static inline struct blk_plug *blk_mq_plug(struct request_queue *q,
> +					   struct bio *bio)
> +{
> +	if (!blk_queue_is_zoned(q) || !op_is_write(bio_op(bio)))
> +		return current->plug;
> +
> +	/* Zoned block device write case: do not plug the BIO */
> +	return NULL;
> +}
> +
>   #endif

Folks are going to look at that and be puzzled, I think that function
deserves a comment.

-- 
Jens Axboe


^ permalink raw reply

* [PATCH] block: Remove unused definitions
From: Damien Le Moal @ 2019-07-10 15:56 UTC (permalink / raw)
  To: linux-block, Jens Axboe

The ELV_MQUEUE_XXX definitions in include/linux/elevator.h are unused
since the removal of elevator_may_queue_fn in kernel 5.0. Remove these
definitions and also remove the documentation of elevator_may_queue_fn
in Documentiation/block/biodoc.txt.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 Documentation/block/biodoc.txt | 5 -----
 include/linux/elevator.h       | 9 ---------
 2 files changed, 14 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index ac18b488cb5e..f0d15b0cb3c0 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -844,11 +844,6 @@ elevator_latter_req_fn		These return the request before or after the
 
 elevator_completed_req_fn	called when a request is completed.
 
-elevator_may_queue_fn		returns true if the scheduler wants to allow the
-				current context to queue a new request even if
-				it is over the queue limit. This must be used
-				very carefully!!
-
 elevator_set_req_fn
 elevator_put_req_fn		Must be used to allocate and free any elevator
 				specific storage for a request.
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 6e8bc53740f0..9842e53623f3 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -160,15 +160,6 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 #define ELEVATOR_INSERT_FLUSH	5
 #define ELEVATOR_INSERT_SORT_MERGE	6
 
-/*
- * return values from elevator_may_queue_fn
- */
-enum {
-	ELV_MQUEUE_MAY,
-	ELV_MQUEUE_NO,
-	ELV_MQUEUE_MUST,
-};
-
 #define rq_end_sector(rq)	(blk_rq_pos(rq) + blk_rq_sectors(rq))
 #define rb_entry_rq(node)	rb_entry((node), struct request, rb_node)
 
-- 
2.21.0


^ permalink raw reply related

* [PATCH v2] io_uring: fix stuck problem caused by potential memory alloc failure
From: Jackie Liu @ 2019-07-10 15:39 UTC (permalink / raw)
  To: axboe; +Cc: Jackie Liu, linux-block

When io_req_defer alloc memory fails, it will be -EAGAIN. But
io_submit_sqe cannot return immediately because the reference count for
req is still held. Ensure that we free it.

[axboe@kernel.dk: reword commit message]
Fixes: de0617e46717 ("io_uring: add support for marking commands as draining")
Cc: linux-block@vger.kernel.org
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
---
 fs/io_uring.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4ef62a45045d..1c388533cdc8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1843,8 +1843,8 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
 	ret = io_req_defer(ctx, req, s->sqe);
 	if (ret) {
 		if (ret == -EIOCBQUEUED)
-			ret = 0;
-		return ret;
+			return 0;
+		goto out;
 	}
 
 	ret = __io_submit_sqe(ctx, req, s, true);
-- 
2.22.0




^ permalink raw reply related

* Re: [PATCH] io_uring: fix stuck problem caused by potential memory alloc failure
From: JackieLiu @ 2019-07-10 15:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block
In-Reply-To: <25691b9f-8c0f-0d19-d1db-4c4ce6dfb5a9@kernel.dk>

Thank you for pointing out. I haven't had time to pay attention to the new 
features of sqe links. When I was debugging another stuck problem, I found 
that there was no free the reference count, so I sent the patch about this.

Anyway, if necessary, I will resend the patch again, with your rewritten
comments.
 

> 在 2019年7月10日,21:57,Jens Axboe <axboe@kernel.dk> 写道:
> 
> On 7/10/19 2:35 AM, Jackie Liu wrote:
>> when io_req_defer alloc memory failed, it will be return -EAGAIN.
>> But io_submit_sqe cannot be returned immediately because the reference
>> count for req is still hold. we should be clean it.
> 
> Looks like this actually got fixed in my for-5.3/io_uring branch which
> I haven't pushed out yet. Once that's in, I'd suggest we send yours to
> stable separately. Probably change the wording to:
> 
> When io_req_defer alloc memory fails, it will be -EAGAIN. But
> io_submit_sqe cannot return immediately because the reference count for
> req is still held. Ensure that we free it.
> 
> -- 
> Jens Axboe
> 
> 




^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox