linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Shaohua Li <shli@fb.com>
Cc: linux-kernel@vger.kernel.org, axboe@kernel.dk, vgoyal@redhat.com,
	jmoyer@redhat.com, Kernel-team@fb.com
Subject: Re: [RFC 0/3] block: proportional based blk-throttling
Date: Fri, 22 Jan 2016 09:48:22 -0500	[thread overview]
Message-ID: <20160122144822.GA32380@htj.duckdns.org> (raw)
In-Reply-To: <20160122000015.GA4066045@devbig084.prn1.facebook.com>

Hello, Shaohua.

On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > The thing is that most of the possible contentions can be removed by
> > implementing per-cpu cache which shouldn't be too difficult.  10%
> > extra cost on current gen hardware is already pretty high.
> 
> I did think about this. per-cpu cache does sound straightforward, but it
> could severely impact fairness. For example, we give each cpu a budget,
> see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> breaks fairness very much. I have no idea how this can be fixed.

Let's say per-cgroup buffer budget B is calculated as, say, 100ms
worth of IO cost (or bandwidth or iops) available to the cgroup.  In
practice, this may have to be adjusted down depending on the number of
cgroups performing active IOs.  For a given cgroup, B can be
distributed among the CPUs that are actively issuing IOs in that
cgroup.  It will degenerate to round robin of small budget if there
are too many active for the budget available but for most cases this
will cut down most of cross-CPU traffic.

> > They're way more predictable than rotational devices when measured
> > over a period.  I don't think we'll be able to measure anything
> > meaningful at individual command level but aggregate numbers should be
> > fairly stable.  A simple approximation of IO cost such as fixed cost
> > per IO + cost proportional to IO size would do a far better job than
> > just depending on bandwidth or iops and that requires approximating
> > two variables over time.  I'm not sure how easy / feasible that
> > actually would be tho.
> 
> It still sounds like IO time, otherwise I can't imagine we can measure
> the cost. If we use some sort of aggregate number, it likes a variation
> of bandwidth. eg cost = bandwidth/ios.

I think cost of an IO can be approxmiated by a fixed per-IO cost +
cost proportional to the size, so

 cost = F + R * size

> I understand you probably want something like: get disk total resource,
> predicate resource of each IO, and then use the info to arbitrate
> cgroups. I don't know how it's possible. A disk which uses all its
> resources can still accept new IO queuing. Maybe someday a fancy device
> can export the info.

I don't know exactly how either; however, I don't want a situation
where we implement something just because it's easy regardless of
whether it's actually useful.  We've done that multiple times in
cgroup and they tend to become useless baggages which get in the way
of proper solutions.  Things don't have to be perfect from the
beginning but at least the abstractions and interfaces we expose must
be relevant to the capability that userland wants.

It isn't uncommon for devices to have close to or over an order of
magnitude difference in bandwidth between 4k random and sequential IO
patterns.  What the userland wants is proportional distribution of IO
resources.  I can't see how lumping up numbers whose differences are
in an order of magnitude would be able to represent that, or anything,
really.

I understand that it is a difficult and nasty problem but we'll just
have to solve it.  I'll think more about it too.

Thanks.

-- 
tejun

  reply	other threads:[~2016-01-22 14:48 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-20 17:49 [RFC 0/3] block: proportional based blk-throttling Shaohua Li
2016-01-20 17:49 ` [RFC 1/3] block: estimate disk bandwidth Shaohua Li
2016-01-20 17:49 ` [RFC 2/3] blk-throttling: weight based throttling Shaohua Li
2016-01-21 20:33   ` Vivek Goyal
2016-01-21 21:00     ` Shaohua Li
2016-01-20 17:49 ` [RFC 3/3] blk-throttling: detect inactive cgroup Shaohua Li
2016-01-21 20:44   ` Vivek Goyal
2016-01-21 21:05     ` Shaohua Li
2016-01-21 21:09       ` Vivek Goyal
2016-01-20 19:05 ` [RFC 0/3] block: proportional based blk-throttling Vivek Goyal
2016-01-20 19:34   ` Shaohua Li
2016-01-20 19:40     ` Vivek Goyal
2016-01-20 19:43       ` Shaohua Li
2016-01-20 19:54         ` Vivek Goyal
2016-01-20 21:11         ` Vivek Goyal
2016-01-20 21:34           ` Shaohua Li
2016-01-21 21:10 ` Tejun Heo
2016-01-21 22:24   ` Shaohua Li
2016-01-21 22:41     ` Tejun Heo
2016-01-22  0:00       ` Shaohua Li
2016-01-22 14:48         ` Tejun Heo [this message]
2016-01-22 15:52           ` Vivek Goyal
2016-01-22 18:00             ` Shaohua Li
2016-01-22 19:09               ` Vivek Goyal
2016-01-22 19:45                 ` Shaohua Li
2016-01-22 20:04                   ` Vivek Goyal
2016-01-22 17:57           ` Shaohua Li
2016-01-22 18:08             ` Tejun Heo
2016-01-22 19:11               ` Shaohua Li
2016-01-22 14:43       ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160122144822.GA32380@htj.duckdns.org \
    --to=tj@kernel.org \
    --cc=Kernel-team@fb.com \
    --cc=axboe@kernel.dk \
    --cc=jmoyer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shli@fb.com \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).