* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Jens Axboe @ 2017-03-20 18:40 UTC (permalink / raw)
To: Bart Van Assche, paolo.valente@linaro.org,
linus.walleij@linaro.org
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
fchecconi@gmail.com, broonie@kernel.org,
avanzini.arianna@gmail.com, tj@kernel.org, ulf.hansson@linaro.org
In-Reply-To: <1489859192.2339.12.camel@sandisk.com>
On 03/18/2017 01:46 PM, Bart Van Assche wrote:
> On Sat, 2017-03-18 at 18:09 +0100, Linus Walleij wrote:
>> On Sat, Mar 18, 2017 at 11:52 AM, Paolo Valente
>> <paolo.valente@linaro.org> wrote:
>>>> Il giorno 14 mar 2017, alle ore 16:32, Bart Van Assche <bart.vanassche@sandisk.com> ha scritto:
>>>> (...) what should
>>>> a developer do who only has access to a small subset of all the storage
>>>> devices that are supported by the Linux kernel and hence who can not run the
>>>> benchmark against every supported storage device?
>>
>> Don't we use the community for that? We are dependent on people
>> downloading and testing our code eventually, I mean sure it's good if
>> we make some reasonable effort to test changes we do, but we are
>> only humans, and we get corrected by the experience of other humans.
>
> Hello Linus,
>
> Do you mean relying on the community to test other storage devices
> before or after a patch is upstream? Relying on the community to file
> bug reports after a patch is upstream would be wrong. The Linux kernel
> should not be used for experiments. As you know patches that are sent
> upstream should not introduce regressions.
I think there are two main aspects to this:
1) Stability issues
2) Performance issues
For stability issues, obviously we expect BFQ to be bug free when
merged. In practical matters, this means that it doesn't have any known
pending issues, since we obviously cannot guarantee that the code is Bug
Free in general.
>From a performance perspective, using BFQ is absolutely known to
introduce regressions when used on certain types of storage. It works
well on single queue rotating devices. It'll tank your NVMe device
performance. I don't think think this is necessarily a problem. By
default, BFQ will not be enabled anywhere. It's a scheduler that is
available in the system, and users can opt in if they desire to use BFQ.
I'm expecting distros to do the right thing with udev rules here.
> My primary concern about BFQ is that it is a very complicated I/O
> scheduler and also that the concepts used internally in that I/O
> scheduler are far away from the concepts we are used to when reasoning
> about I/O devices. I'm concerned that this will make the BFQ I/O
> scheduler hard to maintain.
That is also my main concern, which is why I'm trying to go through the
code and suggest areas where it can be improved. It'd be great if it was
more modular, for instance, it's somewhat cumbersome to wade through
nine thousand lines of code. It's my hope that we can improve this
aspect of it.
Understanding the actual algorithms is a separate issue. But in that
regard I do think that BFQ is more forgiving than CFQ, since there are
actual papers detailing how it works. If implemented as cleanly as
possible, we can't really make it any easier to understand. It's not a
trivial topic.
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH 5/8] nowait aio: return on congested block device
From: Jan Kara @ 2017-03-20 17:33 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: Dave Chinner, Goldwyn Rodrigues, linux-fsdevel, jack, hch,
linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi, avi, axboe,
linux-api, willy
In-Reply-To: <045e06ce-ac19-6293-a62b-8ac937f753ac@suse.de>
On Fri 17-03-17 07:23:24, Goldwyn Rodrigues wrote:
> On 03/16/2017 04:31 PM, Dave Chinner wrote:
> > On Wed, Mar 15, 2017 at 04:51:04PM -0500, Goldwyn Rodrigues wrote:
> >> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >>
> >> A new flag BIO_NOWAIT is introduced to identify bio's
> >> orignating from iocb with IOCB_NOWAIT. This flag indicates
> >> to return immediately if a request cannot be made instead
> >> of retrying.
> >
> > So this makes a congested block device run the bio IO completion
> > callback with an -EAGAIN error present? Are all the filesystem
> > direct IO submission and completion routines OK with that? i.e. does
> > such a congestion case cause filesystems to temporarily expose stale
> > data to unprivileged users when the IO is requeued in this way?
> >
> > e.g. filesystem does allocation without blocking, submits bio,
> > device is congested, runs IO completion with error, so nothing
> > written to allocated blocks, write gets queued, so other read
> > comes in while the write is queued, reads data from uninitialised
> > blocks that were allocated during the write....
> >
> > Seems kinda problematic to me to have a undocumented design
> > constraint (i.e a landmine) where we submit the AIO only to have it
> > error out and then expect the filesystem to do something special and
> > different /without blocking/ on EAGAIN.
>
> If the filesystems has to perform block allocation, we would return
> -EAGAIN early enough. However, I agree there is a problem, since not all
> filesystems know this. I worked on only three of them.
So Dave has a good point about the missing documentation explaining the
expectation from the filesystem. Other than that the filesystem is
responsible for determining in its direct IO submission routine whether it
can deal with NOWAIT direct IO without blocking (including backing out of
block layer refuses to do the IO) and if not, just return with EAGAIN right
away.
Probably we should also have a feature flag in filesystem_type telling
whether a particular filesystem knows about NOWAIT direct IO at all so that
we don't have to add stub to each and every filesystem supporting direct IO
that returns EAGAIN when NOWAIT is set (OTOH adding that stub won't be that
hard either since there are not *that* many filesystems supporting direct
IO). But one or the other has to be done.
> > Why isn't the congestion check at a higher layer like we do for page
> > cache readahead? i.e. using the bdi*congested() API at the time we
> > are doing all the other filesystem blocking checks.
> >
>
> Yes, that may work better. We will have to call bdi_read_congested() on
> a write path. (will have to comment that part of the code). Would it
> encompass all possible waits in the block layer?
Congestion mechanism is not really reliable. The bdi can show the device is
not congested but block layer will still block you waiting for some
resource. And for some of the stuff like tag allocation you cannot really
do a reliable check in advance so I don't think congestion checks are
really a viable alternative.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* [PATCH] blk-mq: Add NULL pointer check for HW dispatch queue
From: Jitendra Bhivare @ 2017-03-20 9:40 UTC (permalink / raw)
To: linux-block; +Cc: Jitendra Bhivare, Somnath Kotur
As part of blk_mq_realloc_hw_ctx(), if the init_hctx() ops is
failed by the underyling transport, the hctx pointer is freed and
initialized to NULL.
However, functions down the line, access this hwctx pointer without
a NULL pointer check, which could lead to a kernel crash.
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Jitendra Bhivare <jitendra.bhivare@broadcom.com>
---
block/blk-mq.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a4546f0..9cb2d2e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2048,6 +2048,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
continue;
hctx = blk_mq_map_queue(q, i);
+ if (!hctx)
+ continue;
/*
* Set local node, IFF we have more than one hw queue. If
@@ -2128,6 +2130,8 @@ static void blk_mq_map_swqueue(struct request_queue *q,
ctx = per_cpu_ptr(q->queue_ctx, i);
hctx = blk_mq_map_queue(q, i);
+ if (!hctx)
+ continue;
cpumask_set_cpu(i, hctx->cpumask);
ctx->index_hw = hctx->nr_ctx;
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Paolo Valente @ 2017-03-19 12:14 UTC (permalink / raw)
To: Bart Van Assche
Cc: linus.walleij@linaro.org, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org, fchecconi@gmail.com, axboe@kernel.dk,
broonie@kernel.org, Arianna Avanzini, tj@kernel.org,
ulf.hansson@linaro.org
In-Reply-To: <1489859192.2339.12.camel@sandisk.com>
> Il giorno 18 mar 2017, alle ore 13:46, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>=20
> On Sat, 2017-03-18 at 18:09 +0100, Linus Walleij wrote:
>> On Sat, Mar 18, 2017 at 11:52 AM, Paolo Valente
>> <paolo.valente@linaro.org> wrote:
>>>> Il giorno 14 mar 2017, alle ore 16:32, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>>>> (...) what should
>>>> a developer do who only has access to a small subset of all the =
storage
>>>> devices that are supported by the Linux kernel and hence who can =
not run the
>>>> benchmark against every supported storage device?
>>=20
>> Don't we use the community for that? We are dependent on people
>> downloading and testing our code eventually, I mean sure it's good if
>> we make some reasonable effort to test changes we do, but we are
>> only humans, and we get corrected by the experience of other humans.
>=20
> Hello Linus,
>=20
> Do you mean relying on the community to test other storage devices =
before
> or after a patch is upstream? Relying on the community to file bug =
reports
> after a patch is upstream would be wrong. The Linux kernel should not =
be
> used for experiments. As you know patches that are sent upstream =
should
> not introduce regressions.
>=20
> My primary concern about BFQ is that it is a very complicated I/O =
scheduler
> and also that the concepts used internally in that I/O scheduler are =
far
> away from the concepts we are used to when reasoning about I/O =
devices.
Hi Bart,
could you elaborate a little bit more on this? To hopefully help you
highlight where the problem is, here is a summary of what the patches
introduce.
1. BFQ engine. This initial piece of code has been obtained mainly
by copying (verbatim) CFQ, replacing all cfq_ prefixes with bfq_,
replacing the round-robin algorithm at the hearth of BFQ with wf2q+, a
well-know and widely studied variant of the classical wfq algorithm,
and, finally, by adapting the code around the new engine to accomodate
the latter. In particular, budgets, measured in number of sectors, are
used instead of time slices, to achieve bandwidth fairness.
2. Support for cgroups and hierarchical scheduling.
3. Heuristics to improve service quality and boost throughput. These
additional pieces are introduced and documented one by one. The most
complex are: improving responsiveness by privileging the I/O of
interactive applications, improving audio/video playback/streaming by
privileging their I/O, boosting throughput with interleaved I/O (such
as KVM I/O) by merging the queues associated with the processes doing
such an I/O, boosting throughput for applications that span several
processes.
Which of these contributions contain deviations from the I/O concepts
you are used to, and what are these deviations?
Thanks,
Paolo
> I'm
> concerned that this will make the BFQ I/O scheduler hard to maintain.
>=20
> Bart.
^ permalink raw reply
* Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
From: Paolo Valente @ 2017-03-19 11:45 UTC (permalink / raw)
To: Bart Van Assche
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
fchecconi@gmail.com, linus.walleij@linaro.org, axboe@kernel.dk,
Arianna Avanzini, broonie@kernel.org, tj@kernel.org,
ulf.hansson@linaro.org
In-Reply-To: <1489850658.2339.1.camel@sandisk.com>
> Il giorno 18 mar 2017, alle ore 11:24, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>=20
> On Sat, 2017-03-18 at 08:08 -0400, Paolo Valente wrote:
>>> Il giorno 06 mar 2017, alle ore 14:40, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>>>> +#define BFQ_BFQQ_FNS(name) =
\
>>>> +static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) =
\
>>>> +{ =
\
>>>> + (bfqq)->flags |=3D (1 << BFQ_BFQQ_FLAG_##name); =
\
>>>> +} =
\
>>>> +static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) =
\
>>>> +{ =
\
>>>> + (bfqq)->flags &=3D ~(1 << BFQ_BFQQ_FLAG_##name); =
\
>>>> +} =
\
>>>> +static int bfq_bfqq_##name(const struct bfq_queue *bfqq) =
\
>>>> +{ =
\
>>>> + return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) !=3D 0; =
\
>>>> +}
>>>=20
>>> Are the bodies of the above functions duplicates of __set_bit(),
>>> __clear_bit() and test_bit()?
>>=20
>> Yes. We wrapped them into functions, because writing mark_flag_name
>> seemed more readable than writing the implementation of the marking =
of the
>> flag.
>=20
> Please do not open-code __set_bit(), __clear_bit() and test_bit() but =
use
> these macros instead.
>=20
ok, as usual, I misunderstood, and thought you wanted me to remove
those macros altogether. I'll fix their bodies, sorry.
>>>> + } else
>>>> + /*
>>>> + * Async queues get always the maximum possible
>>>> + * budget, as for them we do not care about latency
>>>> + * (in addition, their ability to dispatch is limited
>>>> + * by the charging factor).
>>>> + */
>>>> + budget =3D bfqd->bfq_max_budget;
>>>> +
>>>=20
>>> Please balance braces. Checkpatch should have warned about the use =
of "}
>>> else" instead of "} else {".
>>=20
>> No warning, I guess because the body of the else contains only a
>> simple instruction. Just to learn for the future: what's the
>> rationale for adding braces here, but not imposing braces everywhere
>> for single-instruction bodies?
>=20
> It's a general style recommendation for all kernel code: if braces are =
used
> for one side of an if-statement, also use braces for the other side, =
and
> definitely if that other side consists of multiple lines due to a =
comment.
>=20
Ok, thanks for repeating this rule for me.
Thanks,
Paolo
> Bart.
^ permalink raw reply
* Re: [PATCH rfc 10/10] target: Use non-selective polling
From: Nicholas A. Bellinger @ 2017-03-18 23:58 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-block, linux-rdma, target-devel, linux-nvme
In-Reply-To: <1489065402-14757-11-git-send-email-sagi@grimberg.me>
Hey Sagi,
On Thu, 2017-03-09 at 15:16 +0200, Sagi Grimberg wrote:
> It doesn't really make sense to do selective polling
> because we never care about specific IOs. Non selective
> polling can actually help by doing some useful work
> while we're submitting a command.
>
> We ask for a batch of (magic) 4 completions which looks
> like a decent network<->backend proportion, if less are
> available we'll see less.
>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> drivers/target/target_core_iblock.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
> index d316ed537d59..00726b6e51c4 100644
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
> @@ -757,6 +757,7 @@ iblock_execute_rw(struct se_cmd *cmd, struct scatterlist *sgl, u32 sgl_nents,
> }
>
> iblock_submit_bios(&list);
> + blk_mq_poll_batch(bdev_get_queue(IBLOCK_DEV(dev)->ibd_bd), 4);
> iblock_complete_cmd(cmd);
> return 0;
>
Let's make 'batch' into a backend specific attribute so it can be
changed on-the-fly per device, instead of a hard-coded value.
Here's a quick patch to that end. Feel free to fold it into your
series.
Beyond that:
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
>From c2c2a6185fc92132f61beb1708adbef9ea3995e9 Mon Sep 17 00:00:00 2001
From: Nicholas Bellinger <nab@linux-iscsi.org>
Date: Sat, 18 Mar 2017 20:29:05 +0000
Subject: [PATCH] target/iblock: Add poll_batch attribute
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
---
drivers/target/target_core_iblock.c | 65 ++++++++++++++++++++++++++++++++---
drivers/target/target_core_iblock.h | 2 ++
2 files changed, 63 insertions(+), 4 deletions(-)
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index 00726b6..9396a60 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -35,6 +35,7 @@
#include <linux/genhd.h>
#include <linux/file.h>
#include <linux/module.h>
+#include <linux/configfs.h>
#include <scsi/scsi_proto.h>
#include <asm/unaligned.h>
@@ -114,6 +115,7 @@ static int iblock_configure_device(struct se_device *dev)
goto out_free_bioset;
}
ib_dev->ibd_bd = bd;
+ ib_dev->poll_batch = IBLOCK_POLL_BATCH;
q = bdev_get_queue(bd);
@@ -757,7 +759,8 @@ static ssize_t iblock_show_configfs_dev_params(struct se_device *dev, char *b)
}
iblock_submit_bios(&list);
- blk_mq_poll_batch(bdev_get_queue(IBLOCK_DEV(dev)->ibd_bd), 4);
+ blk_mq_poll_batch(bdev_get_queue(IBLOCK_DEV(dev)->ibd_bd),
+ IBLOCK_DEV(dev)->poll_batch);
iblock_complete_cmd(cmd);
return 0;
@@ -840,7 +843,38 @@ static bool iblock_get_write_cache(struct se_device *dev)
return test_bit(QUEUE_FLAG_WC, &q->queue_flags);
}
-static const struct target_backend_ops iblock_ops = {
+static ssize_t iblock_poll_batch_show(struct config_item *item, char *page)
+{
+ struct se_dev_attrib *da = container_of(to_config_group(item),
+ struct se_dev_attrib, da_group);
+ struct iblock_dev *ibd = container_of(da->da_dev,
+ struct iblock_dev, dev);
+
+ return snprintf(page, PAGE_SIZE, "%u\n", ibd->poll_batch);
+}
+
+static ssize_t iblock_poll_batch_store(struct config_item *item, const char *page,
+ size_t count)
+{
+ struct se_dev_attrib *da = container_of(to_config_group(item),
+ struct se_dev_attrib, da_group);
+ struct iblock_dev *ibd = container_of(da->da_dev,
+ struct iblock_dev, dev);
+ u32 val;
+ int ret;
+
+ ret = kstrtou32(page, 0, &val);
+ if (ret < 0)
+ return ret;
+
+ ibd->poll_batch = val;
+ return count;
+}
+CONFIGFS_ATTR(iblock_, poll_batch);
+
+static struct configfs_attribute **iblock_attrs;
+
+static struct target_backend_ops iblock_ops = {
.name = "iblock",
.inquiry_prod = "IBLOCK",
.inquiry_rev = IBLOCK_VERSION,
@@ -860,17 +894,40 @@ static bool iblock_get_write_cache(struct se_device *dev)
.get_io_min = iblock_get_io_min,
.get_io_opt = iblock_get_io_opt,
.get_write_cache = iblock_get_write_cache,
- .tb_dev_attrib_attrs = sbc_attrib_attrs,
+ .tb_dev_attrib_attrs = NULL,
};
static int __init iblock_module_init(void)
{
- return transport_backend_register(&iblock_ops);
+ int rc, i, len = 0;
+
+ for (i = 0; sbc_attrib_attrs[i] != NULL; i++) {
+ len += sizeof(struct configfs_attribute *);
+ }
+ len += sizeof(struct configfs_attribute *) * 2;
+
+ iblock_attrs = kzalloc(len, GFP_KERNEL);
+ if (!iblock_attrs)
+ return -ENOMEM;
+
+ for (i = 0; sbc_attrib_attrs[i] != NULL; i++) {
+ iblock_attrs[i] = sbc_attrib_attrs[i];
+ }
+ iblock_attrs[i] = &iblock_attr_poll_batch;
+ iblock_ops.tb_dev_attrib_attrs = iblock_attrs;
+
+ rc = transport_backend_register(&iblock_ops);
+ if (rc) {
+ kfree(iblock_attrs);
+ return rc;
+ }
+ return 0;
}
static void __exit iblock_module_exit(void)
{
target_backend_unregister(&iblock_ops);
+ kfree(iblock_attrs);
}
MODULE_DESCRIPTION("TCM IBLOCK subsystem plugin");
diff --git a/drivers/target/target_core_iblock.h b/drivers/target/target_core_iblock.h
index 718d3fc..308b5c2 100644
--- a/drivers/target/target_core_iblock.h
+++ b/drivers/target/target_core_iblock.h
@@ -8,6 +8,7 @@
#define IBLOCK_MAX_CDBS 16
#define IBLOCK_LBA_SHIFT 9
+#define IBLOCK_POLL_BATCH 4
struct iblock_req {
atomic_t pending;
@@ -23,6 +24,7 @@ struct iblock_dev {
struct bio_set *ibd_bio_set;
struct block_device *ibd_bd;
bool ibd_readonly;
+ unsigned int poll_batch;
} ____cacheline_aligned;
#endif /* TARGET_CORE_IBLOCK_H */
--
1.7.10.4
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related
* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Linus Walleij @ 2017-03-18 20:46 UTC (permalink / raw)
To: Bart Van Assche
Cc: paolo.valente@linaro.org, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org, fchecconi@gmail.com, axboe@kernel.dk,
broonie@kernel.org, avanzini.arianna@gmail.com, tj@kernel.org,
ulf.hansson@linaro.org
In-Reply-To: <1489859192.2339.12.camel@sandisk.com>
On Sat, Mar 18, 2017 at 6:46 PM, Bart Van Assche
<Bart.VanAssche@sandisk.com> wrote:
> On Sat, 2017-03-18 at 18:09 +0100, Linus Walleij wrote:
>> On Sat, Mar 18, 2017 at 11:52 AM, Paolo Valente
>> <paolo.valente@linaro.org> wrote:
>> > > Il giorno 14 mar 2017, alle ore 16:32, Bart Van Assche <bart.vanassche@sandisk.com> ha scritto:
>> > > (...) what should
>> > > a developer do who only has access to a small subset of all the storage
>> > > devices that are supported by the Linux kernel and hence who can not run the
>> > > benchmark against every supported storage device?
>>
>> Don't we use the community for that? We are dependent on people
>> downloading and testing our code eventually, I mean sure it's good if
>> we make some reasonable effort to test changes we do, but we are
>> only humans, and we get corrected by the experience of other humans.
>
> Hello Linus,
>
> Do you mean relying on the community to test other storage devices before
> or after a patch is upstream?
I guess they should test it when it is in linux-next?
> Relying on the community to file bug reports
> after a patch is upstream would be wrong. The Linux kernel should not be
> used for experiments. As you know patches that are sent upstream should
> not introduce regressions.
Yeah still people introduce regressions and we have 7-8 release
candidates for each kernel because of this. Humans have flaws
I guess.
> My primary concern about BFQ is that it is a very complicated I/O scheduler
> and also that the concepts used internally in that I/O scheduler are far
> away from the concepts we are used to when reasoning about I/O devices. I'm
> concerned that this will make the BFQ I/O scheduler hard to maintain.
I understand that. Let's follow all rules of thumb that make code
easy to maintain.
We have pretty broad agreement on what makes code easy to
maintain on the syntactic level, checkpatch and some manual inspection
easily gives that.
I think where we need the most brainshare is in how to make semantics
maintainable. It's no fun reading terse code and try to figure out how
the developer writing it was thinking, so let's focus on anything we do not
understand and make it understandable, it seems Paolo is onto this task
for what I can tell.
Yours,
Linus Walleij
^ permalink raw reply
* Re: [Lsf] [RFC PATCH 0/4] blk-mq: multiqueue I/O scheduler
From: Martin K. Petersen @ 2017-03-18 20:18 UTC (permalink / raw)
To: Paolo Valente
Cc: linux-block, lsf, Omar Sandoval, kernel-team, Hannes Reinecke
In-Reply-To: <C191DA3B-3852-4304-A534-008088F97BA7@linaro.org>
Paolo Valente <paolo.valente@linaro.org> writes:
>> And actually, I have been looking at properly supporting host-wide tag
>> maps, so the concept of 'shallow' queue depth might be the way to go here.
>>
>> And as I have some other questions, too (assigning a new tag on
>> requeue?) it might be an idea to spread this out into a formal session
>> at LSF. If there are some slots left, that is.
>>
>
> I would be very interested in such a discussion too.
OK. Added.
--
Martin K. Petersen Oracle Linux Engineering
_______________________________________________
Lsf mailing list
Lsf@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lsf
^ permalink raw reply
* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Bart Van Assche @ 2017-03-18 17:46 UTC (permalink / raw)
To: paolo.valente@linaro.org, linus.walleij@linaro.org
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
fchecconi@gmail.com, axboe@kernel.dk, broonie@kernel.org,
avanzini.arianna@gmail.com, tj@kernel.org, ulf.hansson@linaro.org
In-Reply-To: <CACRpkdZhA8FY2FH_fnTsnPotZoZwX5qUfhdavWLwfzcnLkZUMQ@mail.gmail.com>
T24gU2F0LCAyMDE3LTAzLTE4IGF0IDE4OjA5ICswMTAwLCBMaW51cyBXYWxsZWlqIHdyb3RlOg0K
PiBPbiBTYXQsIE1hciAxOCwgMjAxNyBhdCAxMTo1MiBBTSwgUGFvbG8gVmFsZW50ZQ0KPiA8cGFv
bG8udmFsZW50ZUBsaW5hcm8ub3JnPiB3cm90ZToNCj4gPiA+IElsIGdpb3JubyAxNCBtYXIgMjAx
NywgYWxsZSBvcmUgMTY6MzIsIEJhcnQgVmFuIEFzc2NoZSA8YmFydC52YW5hc3NjaGVAc2FuZGlz
ay5jb20+IGhhIHNjcml0dG86DQo+ID4gPiAoLi4uKSB3aGF0IHNob3VsZA0KPiA+ID4gYSBkZXZl
bG9wZXIgZG8gd2hvIG9ubHkgaGFzIGFjY2VzcyB0byBhIHNtYWxsIHN1YnNldCBvZiBhbGwgdGhl
IHN0b3JhZ2UNCj4gPiA+IGRldmljZXMgdGhhdCBhcmUgc3VwcG9ydGVkIGJ5IHRoZSBMaW51eCBr
ZXJuZWwgYW5kIGhlbmNlIHdobyBjYW4gbm90IHJ1biB0aGUNCj4gPiA+IGJlbmNobWFyayBhZ2Fp
bnN0IGV2ZXJ5IHN1cHBvcnRlZCBzdG9yYWdlIGRldmljZT8NCj4gDQo+IERvbid0IHdlIHVzZSB0
aGUgY29tbXVuaXR5IGZvciB0aGF0PyBXZSBhcmUgZGVwZW5kZW50IG9uIHBlb3BsZQ0KPiBkb3du
bG9hZGluZyBhbmQgdGVzdGluZyBvdXIgY29kZSBldmVudHVhbGx5LCBJIG1lYW4gc3VyZSBpdCdz
IGdvb2QgaWYNCj4gd2UgbWFrZSBzb21lIHJlYXNvbmFibGUgZWZmb3J0IHRvIHRlc3QgY2hhbmdl
cyB3ZSBkbywgYnV0IHdlIGFyZQ0KPiBvbmx5IGh1bWFucywgYW5kIHdlIGdldCBjb3JyZWN0ZWQg
YnkgdGhlIGV4cGVyaWVuY2Ugb2Ygb3RoZXIgaHVtYW5zLg0KDQpIZWxsbyBMaW51cywNCg0KRG8g
eW91IG1lYW4gcmVseWluZyBvbiB0aGUgY29tbXVuaXR5IHRvIHRlc3Qgb3RoZXIgc3RvcmFnZSBk
ZXZpY2VzIGJlZm9yZQ0Kb3IgYWZ0ZXIgYSBwYXRjaCBpcyB1cHN0cmVhbT8gUmVseWluZyBvbiB0
aGUgY29tbXVuaXR5IHRvIGZpbGUgYnVnIHJlcG9ydHMNCmFmdGVyIGEgcGF0Y2ggaXMgdXBzdHJl
YW0gd291bGQgYmUgd3JvbmcuIFRoZSBMaW51eCBrZXJuZWwgc2hvdWxkIG5vdCBiZQ0KdXNlZCBm
b3IgZXhwZXJpbWVudHMuIEFzIHlvdSBrbm93IHBhdGNoZXMgdGhhdCBhcmUgc2VudCB1cHN0cmVh
bSBzaG91bGQNCm5vdCBpbnRyb2R1Y2UgcmVncmVzc2lvbnMuDQoNCk15IHByaW1hcnkgY29uY2Vy
biBhYm91dCBCRlEgaXMgdGhhdCBpdCBpcyBhIHZlcnkgY29tcGxpY2F0ZWQgSS9PIHNjaGVkdWxl
cg0KYW5kIGFsc28gdGhhdCB0aGUgY29uY2VwdHMgdXNlZCBpbnRlcm5hbGx5IGluIHRoYXQgSS9P
IHNjaGVkdWxlciBhcmUgZmFyDQphd2F5IGZyb20gdGhlIGNvbmNlcHRzIHdlIGFyZSB1c2VkIHRv
IHdoZW4gcmVhc29uaW5nIGFib3V0IEkvTyBkZXZpY2VzLiBJJ20NCmNvbmNlcm5lZCB0aGF0IHRo
aXMgd2lsbCBtYWtlIHRoZSBCRlEgSS9PIHNjaGVkdWxlciBoYXJkIHRvIG1haW50YWluLg0KDQpC
YXJ0Lg==
^ permalink raw reply
* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Linus Walleij @ 2017-03-18 17:09 UTC (permalink / raw)
To: Paolo Valente
Cc: Bart Van Assche, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org, fchecconi@gmail.com, axboe@kernel.dk,
Arianna Avanzini, broonie@kernel.org, tj@kernel.org,
ulf.hansson@linaro.org
In-Reply-To: <A8A05E9B-2FDF-4986-833E-166D92DFEF4F@linaro.org>
On Sat, Mar 18, 2017 at 11:52 AM, Paolo Valente
<paolo.valente@linaro.org> wrote:
>> Il giorno 14 mar 2017, alle ore 16:32, Bart Van Assche <bart.vanassche@sandisk.com> ha scritto:
>> (...) what should
>> a developer do who only has access to a small subset of all the storage
>> devices that are supported by the Linux kernel and hence who can not run the
>> benchmark against every supported storage device?
Don't we use the community for that? We are dependent on people
downloading and testing our code eventually, I mean sure it's good if
we make some reasonable effort to test changes we do, but we are
only humans, and we get corrected by the experience of other humans.
>> Do developers who do not
>> fully understand the BFQ algorithms and who run into a performance problem
>> have any other option than trial and error for fixing such performance issues?
>
> Hi Bart,
> maybe I got your point even before, but I did not reply consistently.
> You are highlighting an important problem, which, I think, can be
> stated in more general terms: if one makes a change in any complex
> component, which, in its turn, interacts with complex I/O devices,
> then it is hard, if ever possible, to prove, that that change will
> cause no regression with any possible device, just by speculation.
> Actually, facts show that this often holds even for simple components,
> given the complexity of the environment in which they work. Of
> course, if not only the component is complex, but who modifies it does
> not even fully understand how that component works, then regressions
> on untested devices are certainly more probable.
You are running a host of benchmarks on a host of devices, using
the fio tool that Jens devised for this kind of tests. What more can be
asked? More tests, more devices?
If you increase the amount of proof that is requested for any change
to any computer program not to cause unintended side effects or
regressions, you will eventually end up with the brick wall
"solve the halting problem".
Alternatively "test it forever on all systems in the world".
It eventually becomes absurd.
This actually occurred to me .. in a certain mission-critical algorithm
my department was requested to "prove that this will run to completion".
I was baffled and said that what they were requesting was that I
solve the halting problem. It turned out they just wanted something like
a comprehensible test suite.
Yours,
Linus Walleij
^ permalink raw reply
* Re: [Lsf] [RFC PATCH 0/4] blk-mq: multiqueue I/O scheduler
From: Bart Van Assche @ 2017-03-18 15:30 UTC (permalink / raw)
To: osandov@osandov.com, hare@suse.de, linux-block@vger.kernel.org
Cc: lsf@lists.linux-foundation.org, kernel-team@fb.com
In-Reply-To: <46f16799-86d9-7bf5-6346-be08801dd48e@suse.de>
T24gU2F0LCAyMDE3LTAzLTE4IGF0IDEyOjE0ICswMTAwLCBIYW5uZXMgUmVpbmVja2Ugd3JvdGU6
DQo+IE9uIDAzLzE3LzIwMTcgMTE6MDMgUE0sIE9tYXIgU2FuZG92YWwgd3JvdGU6DQo+ID4gRnJv
bTogT21hciBTYW5kb3ZhbCA8b3NhbmRvdkBmYi5jb20+DQo+ID4gDQo+ID4gVGhpcyBwYXRjaCBz
ZXJpZXMgaW1wbGVtZW50cyBhbiBJL08gc2NoZWR1bGVyIGZvciBtdWx0aXF1ZXVlIGRldmljZXMN
Cj4gPiBjb21iaW5pbmcgc2V2ZXJhbCB0ZWNobmlxdWVzOiB0aGUgc2NhbGFibGUgYml0bWFwIGxp
YnJhcnksIHRoZSBuZXcNCj4gPiBibGstc3RhdHMgQVBJLCBhbmQgcXVldWUgZGVwdGggdGhyb3R0
bGluZyBzaW1pbGFyIHRvIGJsay13YnQuDQo+ID4gDQo+ID4gVGhlc2UgcGF0Y2hlcyBhcmUgb24g
dG9wIG9mIG15IGVhcmxpZXIgYmxrLXN0YXRzIHNlcmllcyBbMV0uIFRoZXkgYWxzbw0KPiA+IG5l
ZWQgYSBmaXggaW4gSmVucycgZm9yLWxpbnVzIGJyYW5jaCBpbiBvcmRlciB0byB3b3JrIHByb3Bl
cmx5IFsyXS4NCj4gPiANCj4gPiBQYXRjaGVzIDEgYW5kIDIgaW1wbGVtZW50IGEgbmV3IHNiaXRt
YXAgb3BlcmF0aW9uIGFuZCBwYXRjaCAzIGV4cG9ydHMgYQ0KPiA+IHJlcXVpcmVkIGZ1bmN0aW9u
LiBQYXRjaCA0IGltcGxlbWVudHMgdGhlIG5ldyBzY2hlZHVsZXIsIG5hbWVkIEt5YmVyLg0KPiA+
IA0KPiA+IFRoZSBjb21taXQgbWVzc2FnZSBpbiBwYXRjaCA0IGRlc2NyaWJlcyBLeWJlciBpbiBk
ZXRhaWwuIFRoZSBzY2hlZHVsZXINCj4gPiBlbXBsb3lzIHNvbWUgaGV1cmlzdGljcyB0aGF0IEkn
dmUgZXhwZXJpbWVudGVkIHdpdGgsIGJ1dCB0aGF0J3MgcHJvYmFibHkNCj4gPiB0aGUgYXJlYSB0
aGF0IG5lZWRzIHRoZSBtb3N0IHdvcmsuIEknbGwgYmUgYXQgTFNGL01NIG5leHQgd2VlaywgYW5k
DQo+ID4gdGhlcmUncyBjdXJyZW50bHkgYSBsaWdodG5pbmcgdGFsayBvbiB0aGUgc2NoZWR1bGUg
dG8gZGlzY3Vzcw0KPiA+IGltcHJvdmVtZW50cyBhbmQgZXh0ZW5zaW9ucy4NCj4gPiANCj4gDQo+
IFsgLi4gXQ0KPiANCj4gQSBzaWxlbnQgcm91bmQgb2YgYXBwbGF1c2UgZm9yIHlvdS4NCj4gVmVy
eSBtdWNoIGFwcHJlY2lhdGVkLg0KPiANCj4gQW5kIGFjdHVhbGx5LCBJIGhhdmUgYmVlbiBsb29r
aW5nIGF0IHByb3Blcmx5IHN1cHBvcnRpbmcgaG9zdC13aWRlIHRhZw0KPiBtYXBzLCBzbyB0aGUg
Y29uY2VwdCBvZiAnc2hhbGxvdycgcXVldWUgZGVwdGggbWlnaHQgYmUgdGhlIHdheSB0byBnbyBo
ZXJlLg0KPiANCj4gQW5kIGFzIEkgaGF2ZSBzb21lIG90aGVyIHF1ZXN0aW9ucywgdG9vIChhc3Np
Z25pbmcgYSBuZXcgdGFnIG9uDQo+IHJlcXVldWU/KSBpdCBtaWdodCBiZSBhbiBpZGVhIHRvIHNw
cmVhZCB0aGlzIG91dCBpbnRvIGEgZm9ybWFsIHNlc3Npb24NCj4gYXQgTFNGLiBJZiB0aGVyZSBh
cmUgc29tZSBzbG90cyBsZWZ0LCB0aGF0IGlzLg0KDQpJZiBubyBzbG90IGNhbiBiZSBmb3VuZCwg
aG93IGFib3V0IG9yZ2FuaXppbmcgYW4gZXZlbmluZyBzZXNzaW9uIGFib3V0IHRoaXMNCm5ldyBz
Y2hlZHVsZXI/DQoNCkJhcnQu
^ permalink raw reply
* Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
From: Bart Van Assche @ 2017-03-18 15:24 UTC (permalink / raw)
To: paolo.valente@linaro.org
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
fchecconi@gmail.com, linus.walleij@linaro.org, axboe@kernel.dk,
avanzini.arianna@gmail.com, broonie@kernel.org, tj@kernel.org,
ulf.hansson@linaro.org
In-Reply-To: <0B1F1CFD-C8AB-442D-B5C6-426D19B1FBA3@linaro.org>
T24gU2F0LCAyMDE3LTAzLTE4IGF0IDA4OjA4IC0wNDAwLCBQYW9sbyBWYWxlbnRlIHdyb3RlOg0K
PiA+IElsIGdpb3JubyAwNiBtYXIgMjAxNywgYWxsZSBvcmUgMTQ6NDAsIEJhcnQgVmFuIEFzc2No
ZSA8YmFydC52YW5hc3NjaGVAc2FuZGlzay5jb20+IGhhIHNjcml0dG86DQo+ID4gPiArI2RlZmlu
ZSBCRlFfQkZRUV9GTlMobmFtZSkJCQkJCQlcDQo+ID4gPiArc3RhdGljIHZvaWQgYmZxX21hcmtf
YmZxcV8jI25hbWUoc3RydWN0IGJmcV9xdWV1ZSAqYmZxcSkJCVwNCj4gPiA+ICt7CQkJCQkJCQkJ
XA0KPiA+ID4gKwkoYmZxcSktPmZsYWdzIHw9ICgxIDw8IEJGUV9CRlFRX0ZMQUdfIyNuYW1lKTsJ
CQlcDQo+ID4gPiArfQkJCQkJCQkJCVwNCj4gPiA+ICtzdGF0aWMgdm9pZCBiZnFfY2xlYXJfYmZx
cV8jI25hbWUoc3RydWN0IGJmcV9xdWV1ZSAqYmZxcSkJCVwNCj4gPiA+ICt7CQkJCQkJCQkJXA0K
PiA+ID4gKwkoYmZxcSktPmZsYWdzICY9IH4oMSA8PCBCRlFfQkZRUV9GTEFHXyMjbmFtZSk7CQkJ
XA0KPiA+ID4gK30JCQkJCQkJCQlcDQo+ID4gPiArc3RhdGljIGludCBiZnFfYmZxcV8jI25hbWUo
Y29uc3Qgc3RydWN0IGJmcV9xdWV1ZSAqYmZxcSkJCVwNCj4gPiA+ICt7CQkJCQkJCQkJXA0KPiA+
ID4gKwlyZXR1cm4gKChiZnFxKS0+ZmxhZ3MgJiAoMSA8PCBCRlFfQkZRUV9GTEFHXyMjbmFtZSkp
ICE9IDA7CVwNCj4gPiA+ICt9DQo+ID4gDQo+ID4gQXJlIHRoZSBib2RpZXMgb2YgdGhlIGFib3Zl
IGZ1bmN0aW9ucyBkdXBsaWNhdGVzIG9mIF9fc2V0X2JpdCgpLA0KPiA+IF9fY2xlYXJfYml0KCkg
YW5kIHRlc3RfYml0KCk/DQo+IA0KPiBZZXMuICBXZSB3cmFwcGVkIHRoZW0gaW50byBmdW5jdGlv
bnMsIGJlY2F1c2Ugd3JpdGluZyBtYXJrX2ZsYWdfbmFtZQ0KPiBzZWVtZWQgbW9yZSByZWFkYWJs
ZSB0aGFuIHdyaXRpbmcgdGhlIGltcGxlbWVudGF0aW9uIG9mIHRoZSBtYXJraW5nIG9mIHRoZQ0K
PiBmbGFnLg0KDQpQbGVhc2UgZG8gbm90IG9wZW4tY29kZSBfX3NldF9iaXQoKSwgX19jbGVhcl9i
aXQoKSBhbmQgdGVzdF9iaXQoKSBidXQgdXNlDQp0aGVzZSBtYWNyb3MgaW5zdGVhZC4NCg0KPiA+
ID4gKwl9IGVsc2UNCj4gPiA+ICsJCS8qDQo+ID4gPiArCQkgKiBBc3luYyBxdWV1ZXMgZ2V0IGFs
d2F5cyB0aGUgbWF4aW11bSBwb3NzaWJsZQ0KPiA+ID4gKwkJICogYnVkZ2V0LCBhcyBmb3IgdGhl
bSB3ZSBkbyBub3QgY2FyZSBhYm91dCBsYXRlbmN5DQo+ID4gPiArCQkgKiAoaW4gYWRkaXRpb24s
IHRoZWlyIGFiaWxpdHkgdG8gZGlzcGF0Y2ggaXMgbGltaXRlZA0KPiA+ID4gKwkJICogYnkgdGhl
IGNoYXJnaW5nIGZhY3RvcikuDQo+ID4gPiArCQkgKi8NCj4gPiA+ICsJCWJ1ZGdldCA9IGJmcWQt
PmJmcV9tYXhfYnVkZ2V0Ow0KPiA+ID4gKw0KPiA+IA0KPiA+IFBsZWFzZSBiYWxhbmNlIGJyYWNl
cy4gQ2hlY2twYXRjaCBzaG91bGQgaGF2ZSB3YXJuZWQgYWJvdXQgdGhlIHVzZSBvZiAifQ0KPiA+
IGVsc2UiIGluc3RlYWQgb2YgIn0gZWxzZSB7Ii4NCj4gDQo+IE5vIHdhcm5pbmcsIEkgZ3Vlc3Mg
YmVjYXVzZSB0aGUgYm9keSBvZiB0aGUgZWxzZSBjb250YWlucyBvbmx5IGENCj4gc2ltcGxlIGlu
c3RydWN0aW9uLiAgSnVzdCB0byBsZWFybiBmb3IgdGhlIGZ1dHVyZTogd2hhdCdzIHRoZQ0KPiBy
YXRpb25hbGUgZm9yIGFkZGluZyBicmFjZXMgaGVyZSwgYnV0IG5vdCBpbXBvc2luZyBicmFjZXMg
ZXZlcnl3aGVyZQ0KPiBmb3Igc2luZ2xlLWluc3RydWN0aW9uIGJvZGllcz8NCg0KSXQncyBhIGdl
bmVyYWwgc3R5bGUgcmVjb21tZW5kYXRpb24gZm9yIGFsbCBrZXJuZWwgY29kZTogaWYgYnJhY2Vz
IGFyZSB1c2VkDQpmb3Igb25lIHNpZGUgb2YgYW4gaWYtc3RhdGVtZW50LCBhbHNvIHVzZSBicmFj
ZXMgZm9yIHRoZSBvdGhlciBzaWRlLCBhbmQNCmRlZmluaXRlbHkgaWYgdGhhdCBvdGhlciBzaWRl
IGNvbnNpc3RzIG9mIG11bHRpcGxlIGxpbmVzIGR1ZSB0byBhIGNvbW1lbnQuDQoNCkJhcnQu
^ permalink raw reply
* Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
From: Paolo Valente @ 2017-03-18 12:41 UTC (permalink / raw)
To: Jens Axboe
Cc: Tejun Heo, Fabio Checconi, Arianna Avanzini, linux-block,
Linux-Kernal, Ulf Hansson, Linus Walleij, broonie
In-Reply-To: <9ddb9586-d97e-e7b2-0081-216521f7be76@kernel.dk>
> Il giorno 07 mar 2017, alle ore 18:22, Jens Axboe <axboe@kernel.dk> ha =
scritto:
>=20
>> +/**
>> + * bfq_entity_of - get an entity from a node.
>> + * @node: the node field of the entity.
>> + *
>> + * Convert a node pointer to the relative entity. This is used only
>> + * to simplify the logic of some functions and not as the generic
>> + * conversion mechanism because, e.g., in the tree walking =
functions,
>> + * the check for a %NULL value would be redundant.
>> + */
>> +static struct bfq_entity *bfq_entity_of(struct rb_node *node)
>> +{
>> + struct bfq_entity *entity =3D NULL;
>> +
>> + if (node)
>> + entity =3D rb_entry(node, struct bfq_entity, rb_node);
>> +
>> + return entity;
>> +}
>=20
> Get rid of pointless wrappers like this, just use rb_entry() in the
> caller. It's harmful to the readability of the code to have to lookup
> things like this, if it's a list_entry or rb_entry() in the caller you
> know exactly what is going on immediately.
>=20
Ok. Just a quick request for help about this. The code seems to become
quite ugly with a verbatim replacement of this function with its body
(and there are several occurrences of it), so there is probably =
something I'm
missing. For example, given the following loop:
for (; entity ; entity =3D bfq_entity_of(rb_first(active)))
bfq_reparent_leaf_entity(bfqd, entity);
the only non-cryptic, but heavier solution I see is:
for (; entity ; ) {
bfq_reparent_leaf_entity(bfqd, entity);
if (rb_first(active))
entity =3D rb_entry(rb_first(active), struct =
bfq_entity, rb_node);
else
entity =3D NULL;
}
Am I missing something? Or are these extra lines of code a reasonable
price to pay for increased transparency?
Thanks,
Paolo
> --=20
> Jens Axboe
>=20
^ permalink raw reply
* Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
From: Paolo Valente @ 2017-03-18 12:08 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, Tejun Heo, Fabio Checconi, Arianna Avanzini,
linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
ulf.hansson@linaro.org, linus.walleij@linaro.org,
broonie@kernel.org
In-Reply-To: <1D08B61A9CF0974AA09887BE32D889DA0DA145@ULS-OP-MBXIP03.sdcorp.global.sandisk.com>
> Il giorno 06 mar 2017, alle ore 14:40, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>=20
> On 03/04/2017 08:01 AM, Paolo Valente wrote:
>> BFQ is a proportional-share I/O scheduler, whose general structure,
>> plus a lot of code, are borrowed from CFQ.
>> [ ... ]
>=20
Hi Bart,
I'll reply below only to the recommendations for which I have some
doubt/confusion, while I'll just apply all the others.
> This description is very useful. However, since it is identical to the
> description this patch adds to Documentation/block/bfq-iosched.txt I
> propose to leave it out from the patch description.
>=20
Actually I repeated this information just because avoidable
indirections are usually annoying for relatively small amount of
information of immediate interest. I mean, isn't it annoying, for who
reads the log, to have to look for a documentation file, instead of
having all information ready? Of course, you are the reader in this
case, so if you believe it is better to remove this information, I'll
just do it.
> What seems missing to me is an overview of the limitations of BFQ. =
Does
> BFQ e.g. support multiple hardware queues?
>=20
>> +3. What are BFQ's tunable?
>> +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
>> +[ ... ]
>=20
> A thorough knowledge of BFQ is required to tune it properly. Users =
don't
> want to tune I/O schedulers. Has it been considered to invent =
algorithms
> to tune these parameters automatically?
>=20
Here you are certainly referring to the tuning/debugging knobs that I
forgot to remove. As for the other parameters, they are the same
long-standing parameters of cfq, plus one called strict_guarantees.
You set the latter if you do care only about latency and fairness, =
despite
any throughput loss.
>=20
>> +
>> +/**
>> + * struct bfq_data - per-device data structure.
>> + *
>> + * All the fields are protected by @lock.
>> + */
>> +struct bfq_data {
>> + /* device request queue */
>> + struct request_queue *queue;
>> + [ ... ]
>> +
>> + /* on-disk position of the last served request */
>> + sector_t last_position;
>=20
> What is the relevance of last_position if there are multiple hardware
> queues? Will the BFQ algorithm fail to realize its guarantees in that =
case?
>=20
> What is the relevance of this structure member for block devices that
> have multiple spindles, e.g. arrays of hard disks?
>=20
Thanks for highlighting this interesting point. The logical position
of the last request served is used only to boost throughput. It has
worked with all the single-queue devices I have tested so far, for
evident locality. I've never used BFQ with a multi-queue device, so
honestly I don't know how this would perform with such devices. I
will check it when/if we'll consider BFQ for multi-queue devices too.
However, with those devices I expect many other important issues too.
>=20
>> +#define BFQ_BFQQ_FNS(name) =
\
>> +static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) =
\
>> +{ =
\
>> + (bfqq)->flags |=3D (1 << BFQ_BFQQ_FLAG_##name); =
\
>> +} =
\
>> +static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) =
\
>> +{ =
\
>> + (bfqq)->flags &=3D ~(1 << BFQ_BFQQ_FLAG_##name); =
\
>> +} =
\
>> +static int bfq_bfqq_##name(const struct bfq_queue *bfqq) =
\
>> +{ =
\
>> + return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) !=3D 0; =
\
>> +}
>=20
> Are the bodies of the above functions duplicates of __set_bit(),
> __clear_bit() and test_bit()?
Yes. We wrapped them into functions, because writing mark_flag_name
seemed more readable than writing the implementation of the marking of =
the
flag.
>> +/* Maximum backwards seek, in KiB. */
>> +static const int bfq_back_max =3D 16 * 1024;
>=20
> Where does this constant come from? Should it depend on geometry data
> like e.g. the number of sectors in a cylinder?
>=20
Yes. These constants are just taken from long-standing
(non-documented, if this is your main, right concern) code. All I can
say is that, according to all tests done so far, they work, that is,
BFQ achieves peak throughput where these constants are concerned.
>> +#define for_each_entity(entity) \
>> + for (; entity ; entity =3D NULL)
>=20
> Why has this confusing #define been introduced? Shouldn't all
> occurrences of this macro be changed into the equivalent "if =
(entity)"?
> We don't want silly macros like this in the Linux kernel.
>=20
This sort of empty variant of the macro is for the case where
CONFIG_BFQ_GROUP_IOSCHED is not defined. In that case, the loop does
perform just one iteration. I will add a comment to make this clearer.
>> +#define for_each_entity_safe(entity, parent) \
>> + for (parent =3D NULL; entity ; entity =3D parent)
>=20
> Same question here - why has this macro been introduced and how has =
its
> name been chosen? Since this macro is used only once and since no =
value
> is assigned to 'parent' in the code controlled by this construct, =
please
> remove this macro and use something that is less confusing than a =
"for"
> loop for something that is not a loop.
>=20
Same as above. I'll add a comment on that.
>=20
>> +
>> +/**
>> + * bfq_active_extract - remove an entity from the active tree.
>> + * @st: the service_tree containing the tree.
>> + * @entity: the entity being removed.
>> + */
>> +static void bfq_active_extract(struct bfq_service_tree *st,
>> + struct bfq_entity *entity)
>> +{
>> + struct bfq_queue *bfqq =3D bfq_entity_to_bfqq(entity);
>> + struct rb_node *node;
>> +
>> + node =3D bfq_find_deepest(&entity->rb_node);
>> + bfq_extract(&st->active, entity);
>> +
>> + if (node)
>> + bfq_update_active_tree(node);
>> +
>> + if (bfqq)
>> + list_del(&bfqq->bfqq_list);
>> +}
>=20
> Which locks protect the data structures manipulated by this and other
> functions? Have you considered to use lockdep_assert_held() to =
document
> these assumptions?
>=20
It's the scheduler lock. Actually, probably 99% of the functions are
protected by that lock. I imagine that adding lockdep_assert_held()
everywhere would be absurd. Is there a particular reason why you
would expect it used in some specific functions?
>> + case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
>=20
> Please don't use parentheses if no confusion is possible. =
Additionally,
> checkpatch should have requested you to insert a space before and =
after
> the logical or operator.
>=20
No complain from checkpatch actually, and also in this case
long-standing code, copied verbatim from CFQ. I'll remove the
parentheses.
>> +static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
>> + struct bfq_queue *bfqq)
>> +{
>> + if (bfqq) {
>> + bfq_mark_bfqq_budget_new(bfqq);
>> + bfq_clear_bfqq_fifo_expire(bfqq);
>> +
>> + bfqd->budgets_assigned =3D (bfqd->budgets_assigned*7 + =
256) / 8;
>=20
> Checkpatch should have asked you to insert spaces around the
> multiplication operator.
>=20
It has not, we've used no space for the multiplication, to make the
order of operations clearer. I'll add that space.
>=20
>> + } else
>> + /*
>> + * Async queues get always the maximum possible
>> + * budget, as for them we do not care about latency
>> + * (in addition, their ability to dispatch is limited
>> + * by the charging factor).
>> + */
>> + budget =3D bfqd->bfq_max_budget;
>> +
>=20
> Please balance braces. Checkpatch should have warned about the use of =
"}
> else" instead of "} else {".
>=20
No warning, I guess because the body of the else contains only a
simple instruction. Just to learn for the future: what's the
rationale for adding braces here, but not imposing braces everywhere
for single-instruction bodies?
>=20
>> +static int __init bfq_init(void)
>> +{
>> + int ret;
>> + char msg[50] =3D "BFQ I/O-scheduler: v0";
>=20
> Please leave out "[50]" and use "static const char" instead of "char".
>=20
msg is not constant, and is larger that the string literal used to
initialize it, because another piece is appended to msg if cgroups
support is enabled. Any better solution is of course welcome.
Thanks again for your thorough review,
Paolo=
^ permalink raw reply
* Re: [PATCH v1 3/3] blk-mq: start to freeze queue just after setting dying
From: Hannes Reinecke @ 2017-03-18 11:27 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-kernel, linux-block,
Christoph Hellwig
Cc: Yi Zhang, Bart Van Assche, Tejun Heo
In-Reply-To: <20170317095711.5819-4-tom.leiming@gmail.com>
On 03/17/2017 10:57 AM, Ming Lei wrote:
> Before commit 780db2071a(blk-mq: decouble blk-mq freezing
> from generic bypassing), the dying flag is checked before
> entering queue, and Tejun converts the checking into .mq_freeze_depth,
> and assumes the counter is increased just after dying flag
> is set. Unfortunately we doesn't do that in blk_set_queue_dying().
>
> This patch calls blk_mq_freeze_queue_start() for blk-mq in
> blk_set_queue_dying(), so that we can block new I/O coming
> once the queue is set as dying.
>
> Given blk_set_queue_dying() is always called in remove path
> of block device, and queue will be cleaned up later, we don't
> need to worry about undoing the counter.
>
> Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
> block/blk-core.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: J. Hawn, J. Guild, F. Imend�rffer, HRB 16746 (AG N�rnberg)
^ permalink raw reply
* Re: [PATCH v1 2/3] blk-mq: comment on races related with timeout handler
From: Hannes Reinecke @ 2017-03-18 11:25 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-kernel, linux-block,
Christoph Hellwig
Cc: Yi Zhang, Bart Van Assche
In-Reply-To: <20170317095711.5819-3-tom.leiming@gmail.com>
On 03/17/2017 10:57 AM, Ming Lei wrote:
> This patch adds comment on two races related with
> timeout handler:
>
> - requeue from queue busy vs. timeout
> - rq free & reallocation vs. timeout
>
> Both the races themselves and current solution aren't
> explicit enough, so add comments on them.
>
> Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
> block/blk-mq.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: J. Hawn, J. Guild, F. Imend�rffer, HRB 16746 (AG N�rnberg)
^ permalink raw reply
* Re: [PATCH v1 1/3] blk-mq: don't complete un-started request in timeout handler
From: Hannes Reinecke @ 2017-03-18 11:24 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-kernel, linux-block,
Christoph Hellwig
Cc: Yi Zhang, Bart Van Assche, stable
In-Reply-To: <20170317095711.5819-2-tom.leiming@gmail.com>
On 03/17/2017 10:57 AM, Ming Lei wrote:
> When iterating busy requests in timeout handler,
> if the STARTED flag of one request isn't set, that means
> the request is being processed in block layer or driver, and
> isn't dispatch to hardware yet.
>
> In current implementation of blk_mq_check_expired(),
> if the request queue becomes dying, un-started requests are
> handled as being completed/freed immediately. This way is
> wrong, and can cause use-after-free issue easily[1][2], when
> doing I/O and removing&resetting NVMe device at the sametime.
>
> This patch fixes several issues reported by Yi Zhang.
>
> [1]. oops log 1
> [ 581.789754] ------------[ cut here ]------------
> [ 581.789758] kernel BUG at block/blk-mq.c:374!
> [ 581.789760] invalid opcode: 0000 [#1] SMP
> [ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
> edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
> irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
> intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
> intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
> pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
> sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
> syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
> crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
> dm_region_hash dm_log dm_mod
> [ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
> [ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
> [ 581.789804] Workqueue: kblockd blk_mq_timeout_work
> [ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
> [ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
> [ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
> [ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
> [ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
> [ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
> [ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
> [ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
> [ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
> [ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
> [ 581.789820] Call Trace:
> [ 581.789825] blk_mq_check_expired+0x76/0x80
> [ 581.789828] bt_iter+0x45/0x50
> [ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0
> [ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70
> [ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70
> [ 581.789840] ? __switch_to+0x140/0x450
> [ 581.789841] blk_mq_timeout_work+0x88/0x170
> [ 581.789845] process_one_work+0x165/0x410
> [ 581.789847] worker_thread+0x137/0x4c0
> [ 581.789851] kthread+0x101/0x140
> [ 581.789853] ? rescuer_thread+0x3b0/0x3b0
> [ 581.789855] ? kthread_park+0x90/0x90
> [ 581.789860] ret_from_fork+0x2c/0x40
> [ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
> 8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
> 0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
> [ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
> [ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]---
>
> [2]. oops log2
> [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
> [ 6984.857373] PGD 0
> [ 6984.857374]
> [ 6984.857376] Oops: 0000 [#1] SMP
> [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
> edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
> iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
> mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
> acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
> ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
> sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
> libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
> dm_region_hash dm_log dm_mod
> [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
> 4.10.0-2.el7.bz1420297.x86_64 #1
> [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
> [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
> [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
> [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
> [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
> [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
> [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
> [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
> [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
> [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
> [ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
> [ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
> [ 6984.857443] Call Trace:
> [ 6984.857451] ? mempool_free+0x2b/0x80
> [ 6984.857455] ? bio_free+0x4e/0x60
> [ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230
> [ 6984.857462] blk_mq_process_rq_list+0x133/0x170
> [ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0
> [ 6984.857467] blk_mq_run_work_fn+0x12/0x20
> [ 6984.857473] process_one_work+0x165/0x410
> [ 6984.857475] worker_thread+0x137/0x4c0
> [ 6984.857478] kthread+0x101/0x140
> [ 6984.857480] ? rescuer_thread+0x3b0/0x3b0
> [ 6984.857481] ? kthread_park+0x90/0x90
> [ 6984.857489] ret_from_fork+0x2c/0x40
> [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
> 89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
> 8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
> [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
> [ 6984.857512] CR2: 0000000000000010
> [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---
>
> Cc: stable@vger.kernel.org
> Reported-by: Yi Zhang <yizhan@redhat.com>
> Tested-by: Yi Zhang <yizhan@redhat.com>
> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
> block/blk-mq.c | 11 +----------
> 1 file changed, 1 insertion(+), 10 deletions(-)
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: J. Hawn, J. Guild, F. Imend�rffer, HRB 16746 (AG N�rnberg)
^ permalink raw reply
* Re: [Lsf] [RFC PATCH 0/4] blk-mq: multiqueue I/O scheduler
From: Paolo Valente @ 2017-03-18 11:15 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: linux-block, lsf, Omar Sandoval, kernel-team
In-Reply-To: <46f16799-86d9-7bf5-6346-be08801dd48e@suse.de>
> Il giorno 18 mar 2017, alle ore 07:14, Hannes Reinecke <hare@suse.de> ha =
scritto:
> =
> On 03/17/2017 11:03 PM, Omar Sandoval wrote:
>> From: Omar Sandoval <osandov@fb.com>
>> =
>> This patch series implements an I/O scheduler for multiqueue devices
>> combining several techniques: the scalable bitmap library, the new
>> blk-stats API, and queue depth throttling similar to blk-wbt.
>> =
>> These patches are on top of my earlier blk-stats series [1]. They also
>> need a fix in Jens' for-linus branch in order to work properly [2].
>> =
>> Patches 1 and 2 implement a new sbitmap operation and patch 3 exports a
>> required function. Patch 4 implements the new scheduler, named Kyber.
>> =
>> The commit message in patch 4 describes Kyber in detail. The scheduler
>> employs some heuristics that I've experimented with, but that's probably
>> the area that needs the most work. I'll be at LSF/MM next week, and
>> there's currently a lightning talk on the schedule to discuss
>> improvements and extensions.
>> =
> [ .. ]
> =
> A silent round of applause for you.
> Very much appreciated.
> =
> And actually, I have been looking at properly supporting host-wide tag
> maps, so the concept of 'shallow' queue depth might be the way to go here.
> =
> And as I have some other questions, too (assigning a new tag on
> requeue?) it might be an idea to spread this out into a formal session
> at LSF. If there are some slots left, that is.
> =
I would be very interested in such a discussion too.
Thanks,
Paolo
> Cheers,
> =
> Hannes
> -- =
> Dr. Hannes Reinecke zSeries & Storage
> hare@suse.de +49 911 74053 688
> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N=FCrnberg
> GF: J. Hawn, J. Guild, F. Imend=F6rffer, HRB 16746 (AG N=FCrnberg)
> _______________________________________________
> Lsf mailing list
> Lsf@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf
_______________________________________________
Lsf mailing list
Lsf@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lsf
^ permalink raw reply
* Re: [Lsf] [RFC PATCH 0/4] blk-mq: multiqueue I/O scheduler
From: Hannes Reinecke @ 2017-03-18 11:14 UTC (permalink / raw)
To: Omar Sandoval, linux-block; +Cc: lsf, kernel-team
In-Reply-To: <cover.1489787289.git.osandov@fb.com>
On 03/17/2017 11:03 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> =
> This patch series implements an I/O scheduler for multiqueue devices
> combining several techniques: the scalable bitmap library, the new
> blk-stats API, and queue depth throttling similar to blk-wbt.
> =
> These patches are on top of my earlier blk-stats series [1]. They also
> need a fix in Jens' for-linus branch in order to work properly [2].
> =
> Patches 1 and 2 implement a new sbitmap operation and patch 3 exports a
> required function. Patch 4 implements the new scheduler, named Kyber.
> =
> The commit message in patch 4 describes Kyber in detail. The scheduler
> employs some heuristics that I've experimented with, but that's probably
> the area that needs the most work. I'll be at LSF/MM next week, and
> there's currently a lightning talk on the schedule to discuss
> improvements and extensions.
> =
[ .. ]
A silent round of applause for you.
Very much appreciated.
And actually, I have been looking at properly supporting host-wide tag
maps, so the concept of 'shallow' queue depth might be the way to go here.
And as I have some other questions, too (assigning a new tag on
requeue?) it might be an idea to spread this out into a formal session
at LSF. If there are some slots left, that is.
Cheers,
Hannes
-- =
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N=FCrnberg
GF: J. Hawn, J. Guild, F. Imend=F6rffer, HRB 16746 (AG N=FCrnberg)
_______________________________________________
Lsf mailing list
Lsf@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lsf
^ permalink raw reply
* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Paolo Valente @ 2017-03-18 10:52 UTC (permalink / raw)
To: Bart Van Assche
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
fchecconi@gmail.com, linus.walleij@linaro.org, axboe@kernel.dk,
Arianna Avanzini, broonie@kernel.org, tj@kernel.org,
ulf.hansson@linaro.org
In-Reply-To: <1489509154.2676.6.camel@sandisk.com>
> Il giorno 14 mar 2017, alle ore 16:32, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>=20
> On Tue, 2017-03-14 at 16:35 +0100, Paolo Valente wrote:
>>> Il giorno 07 mar 2017, alle ore 02:00, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>>>=20
>>> Additionally, the complexity of the code is huge. Just like for CFQ,
>>> sooner or later someone will run into a bug or a performance issue
>>> and will post a patch to fix it. However, the complexity of BFQ is
>>> such that a source code review alone won't be sufficient to verify
>>> whether or not such a patch negatively affects a workload or device
>>> that has not been tested by the author of the patch. This makes me
>>> wonder what process should be followed to verify future BFQ patches?
>>=20
>> Third and last, a proposal: why don't we discuss this issue at LSF
>> too? In particular, we could talk about the parts of BFQ that seem
>> more complex to understand, until they become clearer to you. Then I
>> could try to understand what helped make them clearer, and translate
>> it into extra comments in the code or into other, more radical
>> changes.
>=20
> Hello Paolo,
>=20
> Sorry if my comment was not clear enough. Suppose that e.g. someone =
would
> like to modify the following code:
>=20
> static int bfq_min_budget(struct bfq_data *bfqd)
> {
> if (bfqd->budgets_assigned < bfq_stats_min_budgets)
> return bfq_default_max_budget / 32;
> else
> return bfqd->bfq_max_budget / 32;
> }
>=20
> How to predict the performance impact of any changes in e.g. this =
function?
> It is really great that a performance benchmark is available. But what =
should
> a developer do who only has access to a small subset of all the =
storage
> devices that are supported by the Linux kernel and hence who can not =
run the
> benchmark against every supported storage device? Do developers who do =
not
> fully understand the BFQ algorithms and who run into a performance =
problem
> have any other option than trial and error for fixing such performance =
issues?
>=20
Hi Bart,
maybe I got your point even before, but I did not reply consistently.
You are highlighting an important problem, which, I think, can be
stated in more general terms: if one makes a change in any complex
component, which, in its turn, interacts with complex I/O devices,
then it is hard, if ever possible, to prove, that that change will
cause no regression with any possible device, just by speculation.
Actually, facts show that this often holds even for simple components,
given the complexity of the environment in which they work. Of
course, if not only the component is complex, but who modifies it does
not even fully understand how that component works, then regressions
on untested devices are certainly more probable.
These general considerations are the motivation for my previous
proposals: reduce complexity by breaking into simpler, independent
pieces; fix or improve documentation where needed or useful (why don't
we discuss the most obscure parts at lsfmm?); use a fixed set of
benchmarks to find regressions. Any other proposal is more than
welcome.
Thanks,
Paolo
> Thanks,
>=20
> Bart.
^ permalink raw reply
* Re: [PATCH RFC 10/14] block, bfq: add Early Queue Merge (EQM)
From: Paolo Valente @ 2017-03-18 10:33 UTC (permalink / raw)
To: Jens Axboe
Cc: Tejun Heo, Fabio Checconi, Arianna Avanzini, linux-block,
Linux-Kernal, Ulf Hansson, Linus Walleij, broonie,
Mauro Andreolini
In-Reply-To: <d41a304a-b863-5ea7-29ec-8c5a2ea5bbef@kernel.dk>
> Il giorno 15 mar 2017, alle ore 21:00, Jens Axboe <axboe@kernel.dk> ha =
scritto:
>=20
> On 03/15/2017 10:59 AM, Paolo Valente wrote:
>>=20
>>> Il giorno 15 mar 2017, alle ore 17:30, Jens Axboe <axboe@kernel.dk> =
ha scritto:
>>>=20
>>> On 03/15/2017 09:47 AM, Jens Axboe wrote:
>>>> I think you understood me correctly. Currently I think the putting =
of
>>>> the io context is somewhat of a mess. You have seemingly random =
places
>>>> where you have to use special unlock functions, to ensure that you
>>>> notice that some caller deeper down has set ->ioc_to_put. I took a =
quick
>>>> look at it, and by far most of the cases can return an io_context =
to
>>>> free quite easily. You can mark these functions __must_check to =
ensure
>>>> that we don't drop an io_context, inadvertently. That's already a =
win
>>>> over the random ->ioc_to_put store. And you can then get rid of
>>>> bfq_unlock_put_ioc and it's irq variant as well.
>>>>=20
>>>> The places where you are already returning a value, like off =
dispatch
>>>> for instance, you can just pass in a pointer to an io_context =
pointer.
>>>>=20
>>>> If you get this right, it'll be a lot less fragile and hacky than =
your
>>>> current approach.
>>>=20
>>> Even just looking a little closer, you also find cases where you
>>> potentially twice store ->ioc_to_put. That kind of mixup can't =
happen if
>>> you return it properly.
>>>=20
>>> In __bfq_dispatch_request(), for instance. You call =
bfq_select_queue(),
>>> and that in turn calls bfq_bfqq_expire(), which calls
>>> __bfq_bfqq_expire() which can set ->ioc_to_put. But later on,
>>> __bfq_dispatch_request() calls bfq_dispatch_rq_from_bfqq(), which in
>>> turn calls bfq_bfqq_expire() that can also set ->ioc_to_put. There's =
no
>>> "magic" bfq_unlock_and_put_ioc() in-between those. Maybe the former =
call
>>> never sets ->ioc_to_put if it returns with bfqq =3D=3D NULL? Hard to =
tell.
>>>=20
>>> Or __bfq_insert_request(), it calls bfq_add_request(), which may set
>>> ->ioc_to_put through bfq_bfqq_handle_idle_busy_switch() ->
>>> bfq_bfqq_expire(). And then from calling bfq_rq_enqueued() ->
>>> bfq_bfqq_expire().
>>>=20
>>=20
>> I have checked that. Basically, since a queue can't be expired =
twice,
>> then it should never happen that ioc_to_put is set twice before being
>> used. Yet, I do agree that using a shared field and exploiting
>> collateral effects makes code very complex and fragile (maybe even
>> buggy if my speculative check is wrong). Just, it has been the best
>> solution I found, to avoid deferred work as you asked. In fact, I
>> still find quite heavy the alternative of passing a pointer to an ioc
>> forth and back across seven or eight nested functions.
>=20
> It's not heavy at all, I went through all of it this morning.
Yes, sorry. I meant heavy in terms of code complexity.
> It's not
> super pretty either, since you end up passing back an io_context which
> is seemingly unrelated to what the functions otherwise do.
Exactly.
> But that's
> mostly a reflection of the implementation, not that it's a bad way to =
go
> about this in general. The worst bits are the places where you want to
> add a
>=20
> WARN_ON(ret !=3D NULL);
>=20
> between two calls that potentially both drop the ioc. In terms of
> overhead, it's not heavy. Punting to a workqueue would be orders of
> magnitude more expensive.
>=20
>>> There might be more, but I think the above is plenty of evidence =
that
>>> the current ->ioc_to_put solution is a bad hack, fragile, and =
already
>>> has bugs.
>>>=20
>>> How often do you expect this putting of the io_context to happen?
>>=20
>> Unfortunately often, as it must be done also every time the =
in-service
>> queue is reset. But, in this respect, are we sure that we do need to
>> grab a reference to the ioc when we set a queue in service (as done =
in
>> cfq, and copied into bfq)? I mean, we have the hook exit_ioc for
>> controlling the disappearing of an ioc. Am I missing something here
>> too?
>=20
> No, in fact that'd be perfectly fine. It's easier for CFQ to just =
retain
> the reference so we know it's not going away, but for your case, it
> might in fact make more sense to simply be able to de-service a queue =
if
> the process exits. And if you do that, we can drop all this passing =
back
> of ioc (or ->ioc_to_put) craziness, without having to punt to a
> workqueue either.
>=20
Done, and ... well, it seems to work :)
I'm striving to have a new patch series ready before Monday, but I'm
not confident I'll make it.
Thanks,
Paolo
> This will be more efficient too, since it'll be a much more rare
> occurence.
>=20
> --=20
> Jens Axboe
^ permalink raw reply
* Re: [PATCH v1 3/3] blk-mq: start to freeze queue just after setting dying
From: Ming Lei @ 2017-03-18 1:01 UTC (permalink / raw)
To: Bart Van Assche
Cc: linux-kernel@vger.kernel.org, hch@infradead.org,
linux-block@vger.kernel.org, axboe@fb.com, yizhan@redhat.com,
tj@kernel.org
In-Reply-To: <1489793173.2826.23.camel@sandisk.com>
On Fri, Mar 17, 2017 at 11:26:26PM +0000, Bart Van Assche wrote:
> On Fri, 2017-03-17 at 17:57 +0800, Ming Lei wrote:
> > Given blk_set_queue_dying() is always called in remove path
> > of block device, and queue will be cleaned up later, we don't
> > need to worry about undoing the counter.
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index d772c221cc17..62d4967c369f 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
> > queue_flag_set(QUEUE_FLAG_DYING, q);
> > spin_unlock_irq(q->queue_lock);
> >
> > - if (q->mq_ops)
> > + if (q->mq_ops) {
> > blk_mq_wake_waiters(q);
> > - else {
> > +
> > + /* block new I/O coming */
> > + blk_mq_freeze_queue_start(q);
> > + } else {
> > struct request_list *rl;
> >
> > spin_lock_irq(q->queue_lock);
>
> Hello Ming,
>
> The blk_freeze_queue() call in blk_cleanup_queue() waits until q_usage_counter
> drops to zero. Since the above blk_mq_freeze_queue_start() call increases that
> counter by one, how is blk_freeze_queue() expected to finish ever?
It is q->mq_freeze_depth which is increased by blk_mq_freeze_queue_start(), not
q->q_usage_counter, otherwise blk_freeze_queue() would never return, :-)
Thanks,
Ming
^ permalink raw reply
* Re: [RFC PATCH 0/4] blk-mq: multiqueue I/O scheduler
From: Omar Sandoval @ 2017-03-18 0:55 UTC (permalink / raw)
To: linux-block; +Cc: kernel-team
In-Reply-To: <cover.1489787289.git.osandov@fb.com>
On Fri, Mar 17, 2017 at 03:03:29PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> With Kyber set to a target latency of 1 ms, the reader sees latencies of 8 ms
> on average. Kyber brings this down to just over 1 ms.
Of course, this is supposed to say, without Kyber, the reader sees
latencies of 8 ms. With Kyber set to a target latency of 1 ms, the
actual average latency is just over 1 ms.
^ permalink raw reply
* [RFC PATCH 2/4] blk-mq: add shallow depth option for blk_mq_get_tag()
From: Omar Sandoval @ 2017-03-17 22:03 UTC (permalink / raw)
To: linux-block; +Cc: kernel-team
In-Reply-To: <cover.1489787289.git.osandov@fb.com>
From: Omar Sandoval <osandov@fb.com>
Wire up the sbitmap_get_shallow() operation to the tag code so that a
caller can limit the number of tags available to it.
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
block/blk-mq-tag.c | 5 ++++-
block/blk-mq.h | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index e48bc2c72615..1f25d466ebdc 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -96,7 +96,10 @@ static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
!hctx_may_queue(data->hctx, bt))
return -1;
- return __sbitmap_queue_get(bt);
+ if (data->shallow_depth)
+ return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
+ else
+ return __sbitmap_queue_get(bt);
}
unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8d49c06fc520..77ec66369f21 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -141,6 +141,7 @@ struct blk_mq_alloc_data {
/* input parameter */
struct request_queue *q;
unsigned int flags;
+ unsigned int shallow_depth;
/* input & output parameter */
struct blk_mq_ctx *ctx;
--
2.12.0
^ permalink raw reply related
* [RFC PATCH 4/4] blk-mq: introduce Kyber multiqueue I/O scheduler
From: Omar Sandoval @ 2017-03-17 22:03 UTC (permalink / raw)
To: linux-block; +Cc: kernel-team
In-Reply-To: <cover.1489787289.git.osandov@fb.com>
From: Omar Sandoval <osandov@fb.com>
The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
scale to multiple queues. Users configure a single knob, the target read
latency, and the scheduler tunes itself to achieve that latency goal.
The implementation is based on "tokens", built on top of the scalable
bitmap library. Tokens serve as a mechanism for limiting requests. There
are two tiers of tokens: queueing tokens and dispatch tokens.
A queueing token is required to allocate a request. In fact, these
tokens are actually the blk-mq internal scheduler tags, but the
scheduler manages the allocation directly in order to implement its
policy.
Dispatch tokens are device-wide and split up into two scheduling
domains: reads vs. writes. Each hardware queue dispatches batches
round-robin between the scheduling domains as long as tokens are
available for that domain.
These tokens can be used as the mechanism to enable various policies.
The policy Kyber uses is inspired by active queue management techniques
for network routing, similar to blk-wbt. The scheduler monitors read
latencies and scales the number of available write dispatch tokens
accordingly. Queueing tokens are used to prevent starvation of
synchronous requests by asynchronous requests.
Various extensions are possible, including better heuristics and ionice
support.
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
block/Kconfig.iosched | 8 +
block/Makefile | 1 +
block/elevator.c | 9 +-
block/kyber-iosched.c | 586 ++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 600 insertions(+), 4 deletions(-)
create mode 100644 block/kyber-iosched.c
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 58fc8684788d..ba6c9be67fa4 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -69,6 +69,14 @@ config MQ_IOSCHED_DEADLINE
---help---
MQ version of the deadline IO scheduler.
+config MQ_IOSCHED_KYBER
+ tristate "Kyber I/O scheduler"
+ default y
+ ---help---
+ The Kyber I/O scheduler is a low-overhead scheduler suitable for
+ multiqueue and other fast devices. Given a target latency, it will
+ self-tune queue depths to achieve that goal.
+
endmenu
endif
diff --git a/block/Makefile b/block/Makefile
index 081bb680789b..6146d2eaaeaa 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
+obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
diff --git a/block/elevator.c b/block/elevator.c
index 01139f549b5b..44a6e42ffc1a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -221,14 +221,15 @@ int elevator_init(struct request_queue *q, char *name)
if (!e) {
/*
- * For blk-mq devices, we default to using mq-deadline,
- * if available, for single queue devices. If deadline
- * isn't available OR we have multiple queues, default
- * to "none".
+ * For blk-mq, we default to using mq-deadline for single-queue
+ * devices and kyber for multi-queue devices. We fall back to
+ * "none" if the preferred scheduler isn't available.
*/
if (q->mq_ops) {
if (q->nr_hw_queues == 1)
e = elevator_get("mq-deadline", false);
+ else
+ e = elevator_get("kyber", false);
if (!e)
return 0;
} else
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
new file mode 100644
index 000000000000..e29cea785408
--- /dev/null
+++ b/block/kyber-iosched.c
@@ -0,0 +1,586 @@
+/*
+ * The Kyber I/O scheduler. Controls latency by throttling queue depths using
+ * scalable techniques.
+ *
+ * Copyright (C) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <https://www.gnu.org/licenses/>.
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/module.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-stat.h"
+
+/* Scheduling domains. */
+enum {
+ KYBER_READ,
+ KYBER_WRITE,
+ KYBER_NUM_DOMAINS,
+};
+
+enum {
+ KYBER_MIN_DEPTH = 256,
+
+ /*
+ * Initial device-wide depths for each scheduling domain.
+ *
+ * Even for fast devices with lots of tags like NVMe, you can saturate
+ * the device with only a fraction of the maximum possible queue depth.
+ * So, we cap these to a reasonable value.
+ */
+ KYBER_READ_DEPTH = 256,
+ KYBER_WRITE_DEPTH = KYBER_READ_DEPTH / 4,
+
+ /*
+ * Scheduling domain batch sizes. We favor reads over writes.
+ */
+ KYBER_READ_BATCH = 16,
+ KYBER_WRITE_BATCH = 8,
+
+ /*
+ * In order to prevent starvation of synchronous requests by a flood of
+ * asynchronous requests, we reserve 25% of requests for synchronous
+ * operations.
+ */
+ KYBER_ASYNC_PERCENT = 75,
+};
+
+struct kyber_queue_data {
+ struct request_queue *q;
+
+ struct blk_stat_callback *cb;
+
+ /*
+ * The device is divided into multiple scheduling domains based on the
+ * request type. Each domain has a fixed number of in-flight requests of
+ * that type device-wide, limited by these tokens.
+ */
+ struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS];
+
+ /*
+ * The maximum depth that the domain tokens can be resized to.
+ */
+ unsigned int max_domain_tokens[KYBER_NUM_DOMAINS];
+
+ /* Batch size for each scheduling domain. */
+ unsigned int domain_batch[KYBER_NUM_DOMAINS];
+
+ /*
+ * Async request percentage, converted to per-word depth for
+ * sbitmap_get_shallow().
+ */
+ unsigned int async_depth;
+
+ /* Target read latency in nanoseconds. */
+ u64 read_lat_nsec;
+};
+
+struct kyber_hctx_data {
+ spinlock_t lock;
+ struct list_head rqs[KYBER_NUM_DOMAINS];
+ int cur_domain;
+ unsigned int batching;
+};
+
+/*
+ * Heuristics for limiting queue depths based on latency. Similar to AQM
+ * techniques for network routing.
+ */
+static void kyber_stats_fn(struct blk_stat_callback *cb,
+ struct blk_stats *stats)
+{
+ struct kyber_queue_data *kqd = cb->data;
+ unsigned int orig_write_depth, write_depth;
+ u64 latency, target;
+
+ orig_write_depth = write_depth =
+ READ_ONCE(kqd->domain_tokens[KYBER_WRITE].sb.depth);
+
+ if (!stats->read.nr_samples) {
+ write_depth += 1;
+ goto resize;
+ }
+
+ latency = stats->read.mean;
+ target = kqd->read_lat_nsec;
+
+ if (latency >= 4 * target)
+ write_depth /= 2;
+ else if (latency >= 2 * target)
+ write_depth -= max(write_depth / 4, 1U);
+ else if (latency > target)
+ write_depth -= max(write_depth / 8, 1U);
+ else if (latency <= target / 2)
+ write_depth += 2;
+ else if (latency <= 3 * target / 4)
+ write_depth += 1;
+
+resize:
+ write_depth = clamp_t(unsigned int, write_depth, 1, KYBER_WRITE_DEPTH);
+ if (write_depth != orig_write_depth)
+ sbitmap_queue_resize(&kqd->domain_tokens[KYBER_WRITE], write_depth);
+
+ /* Continue monitoring latencies as long as we are throttling. */
+ if (write_depth < KYBER_WRITE_DEPTH && !timer_pending(&kqd->cb->timer))
+ blk_stat_arm_callback(kqd->cb, jiffies + msecs_to_jiffies(100));
+}
+
+/*
+ * Check if this request met our latency goal. If not, quickly gather some
+ * statistics and start throttling.
+ */
+static void kyber_check_latency(struct kyber_queue_data *kqd,
+ struct request *rq)
+{
+ u64 now, latency;
+ unsigned long expires;
+
+ if (req_op(rq) != REQ_OP_READ)
+ return;
+
+ /* If we are already managing the write depth, don't check again. */
+ if (kqd->domain_tokens[KYBER_WRITE].sb.depth < KYBER_WRITE_DEPTH)
+ return;
+
+ now = __blk_stat_time(ktime_to_ns(ktime_get()));
+ if (now < blk_stat_time(&rq->issue_stat))
+ return;
+
+ latency = now - blk_stat_time(&rq->issue_stat);
+
+ if (latency <= kqd->read_lat_nsec)
+ return;
+
+ if (!timer_pending(&kqd->cb->timer)) {
+ expires = jiffies + msecs_to_jiffies(10);
+ blk_stat_arm_callback(kqd->cb, expires);
+ }
+}
+
+static unsigned int kyber_sched_tags_shift(struct kyber_queue_data *kqd)
+{
+ /*
+ * All of the hardware queues have the same depth, so we can just grab
+ * the shift of the first one.
+ */
+ return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
+}
+
+static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
+{
+ struct kyber_queue_data *kqd;
+ unsigned int max_tokens;
+ unsigned int shift;
+ int ret = -ENOMEM;
+ int i;
+
+ kqd = kmalloc_node(sizeof(*kqd), GFP_KERNEL, q->node);
+ if (!kqd)
+ goto err;
+ kqd->q = q;
+
+ kqd->cb = blk_stat_alloc_callback(kyber_stats_fn, kqd);
+ if (!kqd->cb)
+ goto err_kqd;
+
+ /*
+ * The maximum number of tokens for any scheduling domain is at least
+ * the queue depth of a single hardware queue. If the hardware doesn't
+ * have many tags, still provide a reasonable number.
+ */
+ max_tokens = max_t(unsigned int, q->tag_set->queue_depth,
+ KYBER_MIN_DEPTH);
+ for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+ kqd->max_domain_tokens[i] = max_tokens;
+ ret = sbitmap_queue_init_node(&kqd->domain_tokens[i],
+ max_tokens, -1, false, GFP_KERNEL,
+ q->node);
+ if (ret) {
+ while (--i >= 0)
+ sbitmap_queue_free(&kqd->domain_tokens[i]);
+ goto err_cb;
+ }
+ }
+
+ sbitmap_queue_resize(&kqd->domain_tokens[KYBER_READ], KYBER_READ_DEPTH);
+ sbitmap_queue_resize(&kqd->domain_tokens[KYBER_WRITE], KYBER_WRITE_DEPTH);
+
+ kqd->domain_batch[KYBER_READ] = KYBER_READ_BATCH;
+ kqd->domain_batch[KYBER_WRITE] = KYBER_WRITE_BATCH;
+
+ shift = kyber_sched_tags_shift(kqd);
+ kqd->async_depth = (1U << shift) * KYBER_ASYNC_PERCENT / 100U;
+
+ kqd->read_lat_nsec = 2000000ULL;
+
+ return kqd;
+
+err_cb:
+ blk_stat_free_callback(kqd->cb);
+err_kqd:
+ kfree(kqd);
+err:
+ return ERR_PTR(ret);
+}
+
+static void kyber_queue_data_free(struct kyber_queue_data *kqd)
+{
+ int i;
+
+ if (!kqd)
+ return;
+
+ for (i = 0; i < KYBER_NUM_DOMAINS; i++)
+ sbitmap_queue_free(&kqd->domain_tokens[i]);
+ blk_stat_free_callback(kqd->cb);
+ kfree(kqd);
+}
+
+static int kyber_hctx_data_init(struct blk_mq_hw_ctx *hctx)
+{
+ struct kyber_hctx_data *khd = hctx->sched_data;
+ int i;
+
+ spin_lock_init(&khd->lock);
+
+ for (i = 0; i < KYBER_NUM_DOMAINS; i++)
+ INIT_LIST_HEAD(&khd->rqs[i]);
+
+ khd->cur_domain = 0;
+ khd->batching = 0;
+
+ return 0;
+}
+
+static int kyber_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+ struct kyber_queue_data *kqd;
+ struct elevator_queue *eq;
+ int ret;
+
+ eq = elevator_alloc(q, e);
+ if (!eq)
+ return -ENOMEM;
+
+ kqd = kyber_queue_data_alloc(q);
+ if (IS_ERR(kqd)) {
+ ret = PTR_ERR(kqd);
+ goto err_kobj;
+ }
+
+ ret = blk_mq_sched_init_hctx_data(q, sizeof(struct kyber_hctx_data),
+ kyber_hctx_data_init, NULL);
+ if (ret)
+ goto err_kqd;
+
+ eq->elevator_data = kqd;
+ q->elevator = eq;
+
+ blk_stat_add_callback(q, kqd->cb);
+
+ return 0;
+
+err_kqd:
+ kyber_queue_data_free(kqd);
+err_kobj:
+ kobject_put(&eq->kobj);
+ return ret;
+}
+
+static void kyber_exit_sched(struct elevator_queue *e)
+{
+ struct kyber_queue_data *kqd = e->elevator_data;
+ struct request_queue *q = kqd->q;
+
+ blk_stat_remove_callback(q, kqd->cb);
+ blk_mq_sched_free_hctx_data(q, NULL);
+ kyber_queue_data_free(e->elevator_data);
+}
+
+static int op_to_sched_domain(int op)
+{
+ if (op_is_write(op))
+ return KYBER_WRITE;
+ else
+ return KYBER_READ;
+}
+
+static int kyber_get_domain_token(struct kyber_queue_data *kqd,
+ int sched_domain)
+{
+ struct sbitmap_queue *domain_tokens;
+
+ domain_tokens = &kqd->domain_tokens[sched_domain];
+ return __sbitmap_queue_get(domain_tokens);
+}
+
+static int rq_get_domain_token(struct request *rq)
+{
+ return (long)rq->elv.priv[0];
+}
+
+static void rq_set_domain_token(struct request *rq, int token)
+{
+ rq->elv.priv[0] = (void *)(long)token;
+}
+
+static void rq_clear_domain_token(struct kyber_queue_data *kqd,
+ struct request *rq)
+{
+ int sched_domain, nr;
+
+ nr = rq_get_domain_token(rq);
+ if (nr != -1) {
+ sched_domain = op_to_sched_domain(req_op(rq));
+ sbitmap_queue_clear(&kqd->domain_tokens[sched_domain], nr,
+ rq->mq_ctx->cpu);
+ }
+}
+
+static struct request *kyber_get_request(struct request_queue *q,
+ unsigned int op,
+ struct blk_mq_alloc_data *data)
+{
+ struct kyber_queue_data *kqd = q->elevator->elevator_data;
+ struct request *rq;
+
+ /*
+ * We use the scheduler tags as per-hardware queue queueing tokens.
+ * Async requests can be limited at this stage.
+ */
+ if (!op_is_sync(op))
+ data->shallow_depth = READ_ONCE(kqd->async_depth);
+
+ rq = __blk_mq_alloc_request(data, op);
+ if (rq)
+ rq_set_domain_token(rq, -1);
+ return rq;
+}
+
+static void kyber_put_request(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ struct kyber_queue_data *kqd = q->elevator->elevator_data;
+
+ kyber_check_latency(kqd, rq);
+ rq_clear_domain_token(kqd, rq);
+ blk_mq_finish_request(rq);
+}
+
+static void kyber_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx,
+ struct kyber_hctx_data *khd)
+{
+ LIST_HEAD(rq_list);
+ struct request *rq, *next;
+
+ blk_mq_flush_busy_ctxs(hctx, &rq_list);
+ list_for_each_entry_safe(rq, next, &rq_list, queuelist) {
+ int sched_domain;
+
+ sched_domain = op_to_sched_domain(req_op(rq));
+ list_move_tail(&rq->queuelist, &khd->rqs[sched_domain]);
+ }
+}
+
+static struct request *
+kyber_dispatch_cur_domain(struct blk_mq_hw_ctx *hctx,
+ struct kyber_queue_data *kqd,
+ struct kyber_hctx_data *khd,
+ bool *flushed, bool *no_tokens)
+{
+ struct list_head *rqs;
+ struct request *rq;
+ int nr;
+
+ rqs = &khd->rqs[khd->cur_domain];
+ rq = list_first_entry_or_null(rqs, struct request, queuelist);
+
+ /*
+ * If there wasn't already a pending request and we haven't flushed the
+ * software queues yet, flush the software queues and check again.
+ */
+ if (!rq && !*flushed) {
+ kyber_flush_busy_ctxs(hctx, khd);
+ *flushed = true;
+ rq = list_first_entry_or_null(rqs, struct request, queuelist);
+ }
+
+ if (rq) {
+ nr = kyber_get_domain_token(kqd, khd->cur_domain);
+ if (nr == -1) {
+ *no_tokens = true;
+ } else {
+ khd->batching++;
+ rq_set_domain_token(rq, nr);
+ list_del_init(&rq->queuelist);
+ return rq;
+ }
+ }
+
+ /* There were either no pending requests or no tokens. */
+ return NULL;
+}
+
+/*
+ * Returns a request on success, NULL if there were no requests to dispatch, and
+ * ERR_PTR(-EBUSY) if there were requests to dispatch but no domain tokens for
+ * them.
+ */
+static struct request *__kyber_dispatch_request(struct kyber_queue_data *kqd,
+ struct kyber_hctx_data *khd,
+ struct blk_mq_hw_ctx *hctx)
+{
+ bool flushed = false, no_tokens = false;
+ struct request *rq;
+ int i;
+
+ /*
+ * First, if we are still entitled to batch, try to dispatch a request
+ * from the batch.
+ */
+ if (khd->batching < READ_ONCE(kqd->domain_batch[khd->cur_domain])) {
+ rq = kyber_dispatch_cur_domain(hctx, kqd, khd, &flushed,
+ &no_tokens);
+ if (rq)
+ return rq;
+ }
+
+ /*
+ * Either,
+ * 1. We were no longer entitled to a batch.
+ * 2. The domain we were batching didn't have any requests.
+ * 3. The domain we were batching was out of tokens.
+ *
+ * Start another batch. Note that this wraps back around to the original
+ * domain if no other domains have requests or tokens.
+ */
+ khd->batching = 0;
+ for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+ if (++khd->cur_domain >= KYBER_NUM_DOMAINS)
+ khd->cur_domain = 0;
+
+ rq = kyber_dispatch_cur_domain(hctx, kqd, khd, &flushed,
+ &no_tokens);
+ if (rq)
+ return rq;
+ }
+
+ return no_tokens ? ERR_PTR(-EBUSY) : NULL;
+}
+
+static struct request *kyber_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+ struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
+ struct kyber_hctx_data *khd = hctx->sched_data;
+ struct request *rq;
+
+ spin_lock(&khd->lock);
+
+ rq = __kyber_dispatch_request(kqd, khd, hctx);
+ if (IS_ERR(rq)) {
+ /*
+ * We failed to get a domain token. Mark the queue as needing a
+ * restart and try again in case a token was freed before we set
+ * the restart bit.
+ */
+ blk_mq_sched_mark_restart_queue(hctx);
+ rq = __kyber_dispatch_request(kqd, khd, hctx);
+ if (IS_ERR(rq))
+ rq = NULL;
+ }
+
+ spin_unlock(&khd->lock);
+
+ return rq;
+}
+
+static bool kyber_has_work(struct blk_mq_hw_ctx *hctx)
+{
+ struct kyber_hctx_data *khd = hctx->sched_data;
+ int i;
+
+ for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+ if (!list_empty_careful(&khd->rqs[i]))
+ return true;
+ }
+ return false;
+}
+
+static ssize_t kyber_read_lat_show(struct elevator_queue *e, char *page)
+{
+ struct kyber_queue_data *kqd = e->elevator_data;
+
+ return sprintf(page, "%llu\n", kqd->read_lat_nsec);
+}
+
+static ssize_t kyber_read_lat_store(struct elevator_queue *e, const char *page,
+ size_t count)
+{
+ struct kyber_queue_data *kqd = e->elevator_data;
+ unsigned long long nsec;
+ int ret;
+
+ ret = kstrtoull(page, 10, &nsec);
+ if (ret)
+ return ret;
+
+ WRITE_ONCE(kqd->read_lat_nsec, nsec);
+
+ return count;
+}
+
+static struct elv_fs_entry kyber_sched_attrs[] = {
+ __ATTR(read_lat_nsec, 0644, kyber_read_lat_show, kyber_read_lat_store),
+ __ATTR_NULL
+};
+
+static struct elevator_type kyber_sched = {
+ .ops.mq = {
+ .init_sched = kyber_init_sched,
+ .exit_sched = kyber_exit_sched,
+ .get_request = kyber_get_request,
+ .put_request = kyber_put_request,
+ .dispatch_request = kyber_dispatch_request,
+ .has_work = kyber_has_work,
+ },
+ .uses_mq = true,
+ .elevator_attrs = kyber_sched_attrs,
+ .elevator_name = "kyber",
+ .elevator_owner = THIS_MODULE,
+};
+
+static int __init kyber_init(void)
+{
+ return elv_register(&kyber_sched);
+}
+
+static void __exit kyber_exit(void)
+{
+ elv_unregister(&kyber_sched);
+}
+
+module_init(kyber_init);
+module_exit(kyber_exit);
+
+MODULE_AUTHOR("Omar Sandoval");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Kyber I/O scheduler");
--
2.12.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox