Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Chris Mason <clm@fb.com>
To: "jack@suse.cz" <jack@suse.cz>
Cc: "vgoyal@redhat.com" <vgoyal@redhat.com>,
	"tj@kernel.org" <tj@kernel.org>,
	"lizefan@huawei.com" <lizefan@huawei.com>,
	"gnehzuil.liu@gmail.com" <gnehzuil.liu@gmail.com>,
	"tm@tao.ma" <tm@tao.ma>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [Lsf-pc] [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook
Date: Thu, 2 Jan 2014 15:21:15 +0000	[thread overview]
Message-ID: <1388676106.24668.14.camel@ret.masoncoding.com> (raw)
In-Reply-To: <20140102064659.GF11920@quack.suse.cz>

On Thu, 2014-01-02 at 07:46 +0100, Jan Kara wrote:
> On Tue 31-12-13 15:34:40, Chris Mason wrote:
> > On Tue, 2013-12-31 at 22:22 +0800, Tao Ma wrote:
> > > Hi Chris,
> > > On 12/31/2013 09:19 PM, Chris Mason wrote:
> > >  
> > > > So I'd like to throttle the rate at which dirty pages are created,
> > > > preferably based on the rates currently calculated in the BDI of how
> > > > quickly the device is doing IO.  This way we can limit dirty creation to
> > > > a percentage of the disk capacity during the current workload
> > > > (regardless of random vs buffered).
> > > Fengguang had already done some work on this, but it seems that the
> > > community does't have a consensus on where this control file should go.
> > >  You can look at this link: https://lkml.org/lkml/2011/4/4/205
> > 
> > I had forgotten Wu's patches here, it's very close to the starting point
> > I was hoping for.
>   I specifically don't like those patches because throttling pagecache
> dirty rate is IMHO rather poor interface. What people want to do is to
> limit IO from a container. That means reads & writes, buffered & direct IO.
> So dirty rate is just a one of several things which contributes to total IO
> rate. When you have both direct IO & buffered IO happening in the container
> they influence each other so dirty rate 50 MB/s may be fine when nothing
> else is going on in the container but may be far to much for the system if
> there are heavy direct IO reads happening as well.
> 
> So you really need to tune the limit on the dirty rate depending on how
> fast the writeback can happen (which is what current IO-less throttling
> does), not based on some hard throughput number like
> 50 MB/s (which is what Fengguang's patches did if I remember right).
> 
> What could work a tad bit better (and that seems to be something you are
> proposing) is to have a weight for each memcg and each memcg would be
> allowed to dirty at a rate proportional to its weight * writeback
> throughput. But this still has a couple of problems:
> 1) This doesn't take into account local situation in a memcg - for memcg
>    full of dirty pages you want to throttle dirtying much more than for a
>    memcg which has no dirty pages.
> 2) Flusher thread (or workqueue these days) doesn't know anything about
>    memcgs. So it can happily flush a memcg which is relatively OK for a
>    rather long time while some other memcg is full of dirty pages and
>    struggling to do any progress.
> 3) This will be somewhat unfair since the total IO allowed to happen from a
>    container will depend on whether you are doing only reads (or DIO), only
>    writes or both reads & writes.
> 
> In an ideal world you could compute writeback throughput for each memcg
> (and writeback from a memcg would be accounted in a proper blkcg - we would
> need unified memcg & blkcg hieararchy for that), take into account number of
> dirty pages in each memcg, and compute dirty rate according to these two
> numbers. But whether this can work in practice heavily depends on the memcg
> size and how smooth / fair can the writeback from different memcgs be so
> that we don't have excessive stalls and throughput estimation errors...

[ Adding Tejun, Vivek and Li from another thread ]

I do agree that a basket of knobs is confusing and it doesn't really
help the admin.

My first idea was a complex system where the controller in the block
layer and the BDI flushers all communicated about current usage and
cooperated on a single set of reader/writer rates.  I think it could
work, but it'll be fragile.

But there are a limited number of non-pagecache methods to do IO.  Why
not just push the accounting and throttling for O_DIRECT into a new BDI
controller idea?  Tejun was just telling me how he'd rather fix the
existing controllers than add a new one, but I think we can have a much
better admin experience by having a having a single entry point based on
BDIs.

-chris

next prev parent reply	other threads:[~2014-01-02 15:21 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-30 21:36 [LSF/MM ATTEND] Filesystems -- Btrfs, cgroups, Storage topics from Facebook Chris Mason
2013-12-31  8:49 ` Zheng Liu
2013-12-31  9:36   ` Jeff Liu
2013-12-31 12:45   ` [Lsf-pc] " Jan Kara
2013-12-31 13:19     ` Chris Mason
2013-12-31 14:22       ` Tao Ma
2013-12-31 15:34         ` Chris Mason
2014-01-02  6:46           ` Jan Kara
2014-01-02 15:21             ` Chris Mason [this message]
2014-01-02 16:01               ` tj
2014-01-02 16:14                 ` tj
2014-01-03  6:03                   ` Jan Kara
2014-01-02 17:06                 ` Vivek Goyal
2014-01-02 17:10                   ` tj
2014-01-02 19:11                     ` Chris Mason
2014-01-03  6:39                       ` Jan Kara
2014-01-02 18:27                 ` James Bottomley
2014-01-02 18:36                   ` tj
2014-01-03  7:44                     ` James Bottomley
2014-01-08 15:04       ` Mel Gorman
2014-01-08 16:14         ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1388676106.24668.14.camel@ret.masoncoding.com \
    --to=clm@fb.com \
    --cc=gnehzuil.liu@gmail.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=tj@kernel.org \
    --cc=tm@tao.ma \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.