linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>,
	Chad Talbott <ctalbott@google.com>,
	James Bottomley <james.bottomley@hansenpartnership.com>,
	lsf <lsf@lists.linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
Date: Thu, 31 Mar 2011 10:34:03 -0400	[thread overview]
Message-ID: <1301581251-sup-987@think> (raw)
In-Reply-To: <20110331141637.GA11139@redhat.com>

Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400:
> On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
> 
> [..]
> > > It should not happen that flusher
> > > thread gets blocked somewhere (trying to get request descriptors on
> > > request queue)
> > 
> > A major design principle of the bdi-flusher threads is that they
> > are supposed to block when the request queue gets full - that's how
> > we got rid of all the congestion garbage from the writeback
> > stack.
> 
> Instead of blocking flusher threads, can they voluntarily stop submitting
> more IO when they realize too much IO is in progress. We aready keep
> stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and
> flusher tread can use that?

We could, but the difficult part is keeping the hardware saturated as
requests complete.  The voluntarily stopping part is pretty much the
same thing the congestion code was trying to do.

> 
> Jens mentioned this idea of how about getting rid of this request accounting
> at request queue level and move it somewhere up say at bdi level.
> 
> > 
> > There are plans to move the bdi-flusher threads to work queues, and
> > once that is done all your concerns about blocking and parallelism
> > are pretty much gone because it's trivial to have multiple writeback
> > works in progress at once on the same bdi with that infrastructure.
> 
> Will this essentially not nullify the advantage of IO less throttling?
> I thought that we did not want have multiple threads doing writeback
> at the same time to avoid number of seeks and achieve better throughput.

Work queues alone are probably not appropriate, at least for spinning
storage.  It will introduce seeks into what would have been
sequential writes.  I had to make the btrfs worker thread pools after
having a lot of trouble cramming writeback into work queues.

> 
> Now with this I am assuming that multiple work can be on progress doing
> writeback. May be we can limit writeback work one per group so in global
> context only one work will be active.
> 
> > 
> > > or it tries to dispatch too much IO from an inode which
> > > primarily contains pages from low prio cgroup and high prio cgroup
> > > task does not get enough pages dispatched to device hence not getting
> > > any prio over low prio group.
> > 
> > That's a writeback scheduling issue independent of how we throttle,
> > and something we don't do at all right now. Our only decision on
> > what to write back is based on how low ago the inode was dirtied.
> > You need to completely rework the dirty inode tracking if you want
> > to efficiently prioritise writeback between different groups.
> > 
> > Given that filesystems don't all use the VFS dirty inode tracking
> > infrastructure and specific filesystems have different ideas of the
> > order of writeback, you've got a really difficult problem there.
> > e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity
> > purposes which will completely screw any sort of prioritised
> > writeback. Remember the ext3 "fsync = global sync" latency problems?
> 
> Ok, so if one issues a fsync when filesystem is mounted in "data=ordered"
> mode we will flush all the writes to disk before committing meta data.
> 
> I have no knowledge of filesystem code so here comes a stupid question.
> Do multiple fsyncs get completely serialized or they can progress in
> parallel? IOW, if a fsync is in progress and we slow down the writeback
> of that inode's pages, can other fsync still make progress without
> getting stuck behind the previous fsync?

An fsync has two basic parts

1) write the file data pages
2a) flush data=ordered in reiserfs/ext34
2b) do the real transaction commit


We can do part one in parallel across any number of writers.  For part
two, there is only one running transaction.  If the FS is smart, the
commit will only force down the transaction that last modified the
file. 50 procs running fsync may only need to trigger one commit.

btrfs and xfs do data=ordered differently.  They still avoid exposing
stale data but we don't pull the plug on the whole bathtub for every
commit.  In the btrfs case, we don't update metadata until the data is
written, so commits never have to force data writes.  xfs does something
lighter weight but with similar benefits.

ext4 with delayed allocation on and data=ordered will only end up
forcing down writes that are not under delayed allocation.  This is a
much smaller subset of the IO than ext3/reiserfs will do.

> 
> For me knowing this is also important in another context of absolute IO
> throttling.
> 
> - If a fsync is in progress and gets throttled at device, what impact it
>   has on other file system operations. What gets serialized behind it. 

It depends.  atime updates log inodes and logging needs a transaction
and transactions sometimes need to wait for the last transaction to
finish.  So its very possible you'll make anything using the FS appear
to stop.

-chris

  reply	other threads:[~2011-03-31 14:35 UTC|newest]

Thread overview: 138+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1301373398.2590.20.camel@mulgrave.site>
2011-03-29  5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein
2011-03-29 11:16 ` Ric Wheeler
2011-03-29 11:22   ` Matthew Wilcox
2011-03-29 12:17     ` Jens Axboe
2011-03-29 13:09       ` Martin K. Petersen
2011-03-29 13:12         ` Ric Wheeler
2011-03-29 13:38         ` James Bottomley
2011-03-29 17:20   ` Shyam_Iyer
2011-03-29 17:33     ` Vivek Goyal
2011-03-29 18:10       ` Shyam_Iyer
2011-03-29 18:45         ` Vivek Goyal
2011-03-29 19:13           ` Shyam_Iyer
2011-03-29 19:57             ` Vivek Goyal
2011-03-29 19:59             ` Mike Snitzer
2011-03-29 20:12               ` Shyam_Iyer
2011-03-29 20:23                 ` Mike Snitzer
2011-03-29 23:09                   ` Shyam_Iyer
2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
2011-03-30 14:02                       ` James Bottomley
2011-03-30 14:10                         ` Hannes Reinecke
2011-03-30 14:26                           ` James Bottomley
2011-03-30 14:55                             ` Hannes Reinecke
2011-03-30 15:33                               ` James Bottomley
2011-03-30 15:46                                 ` Shyam_Iyer
2011-03-30 20:32                                 ` Giridhar Malavali
2011-03-30 20:45                                   ` James Bottomley
2011-03-29 19:47   ` Nicholas A. Bellinger
2011-03-29 20:29   ` Jan Kara
2011-03-29 20:31     ` Ric Wheeler
2011-03-30  0:33   ` Mingming Cao
2011-03-30  2:17     ` Dave Chinner
2011-03-30 11:13       ` Theodore Tso
2011-03-30 11:28         ` Ric Wheeler
2011-03-30 14:07           ` Chris Mason
2011-04-01 15:19           ` Ted Ts'o
2011-04-01 16:30             ` Amir Goldstein
2011-04-01 21:46               ` Joel Becker
2011-04-02  3:26                 ` Amir Goldstein
2011-04-01 21:43             ` Joel Becker
2011-03-30 21:49       ` Mingming Cao
2011-03-31  0:05         ` Matthew Wilcox
2011-03-31  1:00         ` Joel Becker
2011-04-01 21:34           ` Mingming Cao
2011-04-01 21:49             ` Joel Becker
2011-03-29 17:35 ` Chad Talbott
2011-03-29 19:09   ` Vivek Goyal
2011-03-29 20:14     ` Chad Talbott
2011-03-29 20:35     ` Jan Kara
2011-03-29 21:08       ` Greg Thelen
2011-03-30  4:18   ` Dave Chinner
2011-03-30 15:37     ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal
2011-03-30 22:20       ` Dave Chinner
2011-03-30 22:49         ` Chad Talbott
2011-03-31  3:00           ` Dave Chinner
2011-03-31 14:16         ` Vivek Goyal
2011-03-31 14:34           ` Chris Mason [this message]
2011-03-31 22:14             ` Dave Chinner
2011-03-31 23:43               ` Chris Mason
2011-04-01  0:55                 ` Dave Chinner
2011-04-01  1:34               ` Vivek Goyal
2011-04-01  4:36                 ` Dave Chinner
2011-04-01  6:32                   ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
2011-04-01  7:23                     ` Dave Chinner
2011-04-01 12:56                       ` Christoph Hellwig
2011-04-21 15:07                         ` Vivek Goyal
2011-04-01 14:49                   ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal
2011-03-31 22:25             ` Vivek Goyal
2011-03-31 14:50           ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
2011-03-31 22:27             ` Dave Chinner
2011-04-01 17:18               ` Vivek Goyal
2011-04-01 21:49                 ` Dave Chinner
2011-04-02  7:33                   ` Greg Thelen
2011-04-02  7:34                     ` Greg Thelen
2011-04-05 13:13                   ` Vivek Goyal
2011-04-05 22:56                     ` Dave Chinner
2011-04-06 14:49                       ` Curt Wohlgemuth
2011-04-06 15:39                         ` Vivek Goyal
2011-04-06 19:49                           ` Greg Thelen
2011-04-06 23:07                           ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
2011-04-06 23:36                             ` Dave Chinner
2011-04-07 19:24                               ` Vivek Goyal
2011-04-07 20:33                                 ` Christoph Hellwig
2011-04-07 21:34                                   ` Vivek Goyal
2011-04-07 23:42                                 ` Dave Chinner
2011-04-08  0:59                                   ` Greg Thelen
2011-04-08  1:25                                     ` Dave Chinner
2011-04-12  3:17                                       ` KAMEZAWA Hiroyuki
2011-04-08 13:43                                   ` Vivek Goyal
2011-04-06 23:08                         ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
2011-04-07 20:04                           ` Vivek Goyal
2011-04-07 23:47                             ` Dave Chinner
2011-04-08 13:50                               ` Vivek Goyal
2011-04-11  1:05                                 ` Dave Chinner
2011-04-06 15:37                       ` Vivek Goyal
2011-04-06 16:08                         ` Vivek Goyal
2011-04-06 17:10                           ` Jan Kara
2011-04-06 17:14                             ` Curt Wohlgemuth
2011-04-08  1:58                             ` Dave Chinner
2011-04-19 14:26                               ` Wu Fengguang
2011-04-06 23:50                         ` Dave Chinner
2011-04-07 17:55                           ` Vivek Goyal
2011-04-11  1:36                             ` Dave Chinner
2011-04-15 21:07                               ` Vivek Goyal
2011-04-16  3:06                                 ` Vivek Goyal
2011-04-18 21:58                                   ` Jan Kara
2011-04-18 22:51                                     ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal
2011-04-19  0:33                                       ` Dave Chinner
2011-04-19 14:30                                         ` Vivek Goyal
2011-04-19 14:45                                           ` Jan Kara
2011-04-19 17:17                                           ` Vivek Goyal
2011-04-19 18:30                                             ` Vivek Goyal
2011-04-21  0:32                                               ` Dave Chinner
2011-04-21  0:29                                           ` Dave Chinner
2011-04-19 14:17                               ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
2011-04-19 14:34                                 ` Vivek Goyal
2011-04-19 14:48                                   ` Jan Kara
2011-04-19 15:11                                     ` Vivek Goyal
2011-04-19 15:22                                       ` Wu Fengguang
2011-04-19 15:31                                         ` Vivek Goyal
2011-04-19 16:58                                           ` Wu Fengguang
2011-04-19 17:05                                             ` Vivek Goyal
2011-04-19 20:58                                               ` Jan Kara
2011-04-20  1:21                                                 ` Wu Fengguang
2011-04-20 10:56                                                   ` Jan Kara
2011-04-20 11:19                                                     ` Wu Fengguang
2011-04-20 14:42                                                       ` Jan Kara
2011-04-20  1:16                                               ` Wu Fengguang
2011-04-20 18:44                                                 ` Vivek Goyal
2011-04-20 19:16                                                   ` Jan Kara
2011-04-21  0:17                                                   ` Dave Chinner
2011-04-21 15:06                                                   ` Wu Fengguang
2011-04-21 15:10                                                     ` Wu Fengguang
2011-04-21 17:20                                                     ` Vivek Goyal
2011-04-22  4:21                                                       ` Wu Fengguang
2011-04-22 15:25                                                         ` Vivek Goyal
2011-04-22 16:28                                                           ` Andrea Arcangeli
2011-04-25 18:19                                                             ` Vivek Goyal
2011-04-26 14:37                                                               ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1301581251-sup-987@think \
    --to=chris.mason@oracle.com \
    --cc=ctalbott@google.com \
    --cc=david@fromorbit.com \
    --cc=james.bottomley@hansenpartnership.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf@lists.linux-foundation.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).