Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Vivek Goyal <vgoyal@redhat.com>
To: "tj@kernel.org" <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>, "jack@suse.cz" <jack@suse.cz>,
	"lizefan@huawei.com" <lizefan@huawei.com>,
	"gnehzuil.liu@gmail.com" <gnehzuil.liu@gmail.com>,
	"tm@tao.ma" <tm@tao.ma>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
Date: Thu, 2 Jan 2014 12:06:37 -0500	[thread overview]
Message-ID: <20140102170637.GA13276@redhat.com> (raw)
In-Reply-To: <20140102160102.GH11501@htj.dyndns.org>

On Thu, Jan 02, 2014 at 11:01:02AM -0500, tj@kernel.org wrote:
> Hello, Chris, Jan.
> 
> On Thu, Jan 02, 2014 at 03:21:15PM +0000, Chris Mason wrote:
> > On Thu, 2014-01-02 at 07:46 +0100, Jan Kara wrote:
> > > In an ideal world you could compute writeback throughput for each memcg
> > > (and writeback from a memcg would be accounted in a proper blkcg - we would
> > > need unified memcg & blkcg hieararchy for that), take into account number of
> > > dirty pages in each memcg, and compute dirty rate according to these two
> > > numbers. But whether this can work in practice heavily depends on the memcg
> > > size and how smooth / fair can the writeback from different memcgs be so
> > > that we don't have excessive stalls and throughput estimation errors...
> > 
> > [ Adding Tejun, Vivek and Li from another thread ]
> > 
> > I do agree that a basket of knobs is confusing and it doesn't really
> > help the admin.
> > 
> > My first idea was a complex system where the controller in the block
> > layer and the BDI flushers all communicated about current usage and
> > cooperated on a single set of reader/writer rates.  I think it could
> > work, but it'll be fragile.
> 
> One thing I do agree is that bdi would have to play some role.

So is this a separate configuration which can be done per bdi as opposed
to per device? IOW throttling offered per per cgroup per bdi. This will
help with the case of throttling over NFS too, which some people have
been asking for.

> 
> > But there are a limited number of non-pagecache methods to do IO.  Why
> > not just push the accounting and throttling for O_DIRECT into a new BDI
> > controller idea?  Tejun was just telling me how he'd rather fix the
> > existing controllers than add a new one, but I think we can have a much
> > better admin experience by having a having a single entry point based on
> > BDIs.
> 
> But if we'll have to make bdis blkcg-aware, I think the better way to
> do is splitting it per cgroup.  That's what's being don in the lower
> layer anyway.  We split request queues to multiple queues according to
> cgroup configuration.  Things which can affect request issue and
> completion, such as request allocation, are also split and each such
> split queue is used for resource provisioning.
> 

So is this a separate configuration which can be done per bdi as opposed
to per device? IOW throttling offered per per cgroup per bdi. This will
help with the case of throttling over NFS too, which some people have
been asking for.

So it sounds like re-implementing throttling infrastructure at bdi level
now (Similar to what has been done at device level)? Of course use as
much code as possible. But IIUC, proposal is that effectively there will
can be two throttling controllers. One operating at bdi level and one
operating below it at device level?

And then writeback logic needs to be modified so that it can calculate
per bdi per cgroup throughput (as opposed to bdi throughput only) and
throttle writeback accordingly.

Chris, in the past multiple implementations were proposed for a separate
knob for writeback. A separate knob was frowned upon so I also proposed
a separate implementation which throttle task in balance_dirty_pages()
based on the write limit configured on device.

https://lkml.org/lkml/2011/6/28/243

This approach assumed though that there is a block device associated.

Building a back pressure from device/bdi and adjusting writeback accordingly
(by making it cgroup aware) is complicated but once it works, I guess in long
term might turn out to be a reasonable approach.

Thanks
Vivek

next prev parent reply	other threads:[~2014-01-02 17:06 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-30 21:36 [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook Chris Mason
2013-12-31  8:49 ` Zheng Liu
2013-12-31  9:36   ` Jeff Liu
2013-12-31 12:45   ` [Lsf-pc] " Jan Kara
2013-12-31 13:19     ` Chris Mason
2013-12-31 14:22       ` Tao Ma
2013-12-31 15:34         ` Chris Mason
2014-01-02  6:46           ` Jan Kara
2014-01-02 15:21             ` Chris Mason
2014-01-02 16:01               ` tj
2014-01-02 16:14                 ` tj
2014-01-03  6:03                   ` Jan Kara
2014-01-02 17:06                 ` Vivek Goyal [this message]
2014-01-02 17:10                   ` tj
2014-01-02 19:11                     ` Chris Mason
2014-01-03  6:39                       ` Jan Kara
2014-01-02 18:27                 ` James Bottomley
2014-01-02 18:36                   ` tj
2014-01-03  7:44                     ` James Bottomley
2014-01-08 15:04       ` Mel Gorman
2014-01-08 16:14         ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140102170637.GA13276@redhat.com \
    --to=vgoyal@redhat.com \
    --cc=clm@fb.com \
    --cc=gnehzuil.liu@gmail.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=tj@kernel.org \
    --cc=tm@tao.ma \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).