From: Shaohua Li <shli@fb.com>
To: Tejun Heo <tj@kernel.org>
Cc: <linux-kernel@vger.kernel.org>, <axboe@kernel.dk>,
<vgoyal@redhat.com>, <jmoyer@redhat.com>, <Kernel-team@fb.com>
Subject: Re: [RFC 0/3] block: proportional based blk-throttling
Date: Thu, 21 Jan 2016 16:00:16 -0800 [thread overview]
Message-ID: <20160122000015.GA4066045@devbig084.prn1.facebook.com> (raw)
In-Reply-To: <20160121224157.GL5157@mtj.duckdns.org>
Hi,
On Thu, Jan 21, 2016 at 05:41:57PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
>
> On Thu, Jan 21, 2016 at 02:24:51PM -0800, Shaohua Li wrote:
> > > Have you tried with some level, say 5, of nesting? IIRC, how it
> > > implements hierarchical control is rather braindead (and yeah I'm
> > > responsible for the damage).
> >
> > Not yet. Agree nesting increases the locking time. But my test is
> > already an extreme case. I had 32 threads in 2 nodes running IO and the
> > IOPS is 1M/s. Don't think real workload will act like this. The locking
> > issue definitely should be revisited in the future though.
>
> The thing is that most of the possible contentions can be removed by
> implementing per-cpu cache which shouldn't be too difficult. 10%
> extra cost on current gen hardware is already pretty high.
I did think about this. per-cpu cache does sound straightforward, but it
could severely impact fairness. For example, we give each cpu a budget,
see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
breaks fairness very much. I have no idea how this can be fixed.
> > Disagree io time is a better choice. Actually I think IO time will be
>
> If IO time isn't the right term, let's call it IO cost. Whatever the
> term, the actual fraction of cost that each IO is incurring.
>
> > the least we shoule consider for SSD. Idealy if we know each IO cost and
> > total disk capability, things will be easy. Unfortunately there is no
> > way to know IO cost. Bandwidth isn't perfect, but might be the best.
> >
> > I don't know why you think devices are predictable. SSD is never
> > predictable. I'm not sure how you will measure IO time. Morden SSD has
> > large queue depth (blk-mq support 10k queue depth). That means we can
> > send 10k IO in several ns. Measuring IO start/finish time doesn't help
> > too. a 4k IO with 1 io depth might use 10us. a 4k IO with 100 io depth
> > might use more than 100us. The IO time will increase with higher io
> > depth. The fundamental problem is disk with large queue depth can buffer
> > infinite IO request. I think IO time only works for queue depth 1 disk.
>
> They're way more predictable than rotational devices when measured
> over a period. I don't think we'll be able to measure anything
> meaningful at individual command level but aggregate numbers should be
> fairly stable. A simple approximation of IO cost such as fixed cost
> per IO + cost proportional to IO size would do a far better job than
> just depending on bandwidth or iops and that requires approximating
> two variables over time. I'm not sure how easy / feasible that
> actually would be tho.
It still sounds like IO time, otherwise I can't imagine we can measure
the cost. If we use some sort of aggregate number, it likes a variation
of bandwidth. eg cost = bandwidth/ios.
I understand you probably want something like: get disk total resource,
predicate resource of each IO, and then use the info to arbitrate
cgroups. I don't know how it's possible. A disk which uses all its
resources can still accept new IO queuing. Maybe someday a fancy device
can export the info.
Thanks,
Shaohua
next prev parent reply other threads:[~2016-01-22 0:00 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-20 17:49 [RFC 0/3] block: proportional based blk-throttling Shaohua Li
2016-01-20 17:49 ` [RFC 1/3] block: estimate disk bandwidth Shaohua Li
2016-01-20 17:49 ` [RFC 2/3] blk-throttling: weight based throttling Shaohua Li
2016-01-21 20:33 ` Vivek Goyal
2016-01-21 21:00 ` Shaohua Li
2016-01-20 17:49 ` [RFC 3/3] blk-throttling: detect inactive cgroup Shaohua Li
2016-01-21 20:44 ` Vivek Goyal
2016-01-21 21:05 ` Shaohua Li
2016-01-21 21:09 ` Vivek Goyal
2016-01-20 19:05 ` [RFC 0/3] block: proportional based blk-throttling Vivek Goyal
2016-01-20 19:34 ` Shaohua Li
2016-01-20 19:40 ` Vivek Goyal
2016-01-20 19:43 ` Shaohua Li
2016-01-20 19:54 ` Vivek Goyal
2016-01-20 21:11 ` Vivek Goyal
2016-01-20 21:34 ` Shaohua Li
2016-01-21 21:10 ` Tejun Heo
2016-01-21 22:24 ` Shaohua Li
2016-01-21 22:41 ` Tejun Heo
2016-01-22 0:00 ` Shaohua Li [this message]
2016-01-22 14:48 ` Tejun Heo
2016-01-22 15:52 ` Vivek Goyal
2016-01-22 18:00 ` Shaohua Li
2016-01-22 19:09 ` Vivek Goyal
2016-01-22 19:45 ` Shaohua Li
2016-01-22 20:04 ` Vivek Goyal
2016-01-22 17:57 ` Shaohua Li
2016-01-22 18:08 ` Tejun Heo
2016-01-22 19:11 ` Shaohua Li
2016-01-22 14:43 ` Vivek Goyal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160122000015.GA4066045@devbig084.prn1.facebook.com \
--to=shli@fb.com \
--cc=Kernel-team@fb.com \
--cc=axboe@kernel.dk \
--cc=jmoyer@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=tj@kernel.org \
--cc=vgoyal@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.