linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Vivek Goyal <vgoyal@redhat.com>,
	Chad Talbott <ctalbott@google.com>,
	James Bottomley <james.bottomley@hansenpartnership.com>,
	lsf <lsf@lists.linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)
Date: Thu, 31 Mar 2011 19:43:27 -0400	[thread overview]
Message-ID: <1301614603-sup-6349@think> (raw)
In-Reply-To: <20110331221425.GB2904@dastard>

Excerpts from Dave Chinner's message of 2011-03-31 18:14:25 -0400:
> On Thu, Mar 31, 2011 at 10:34:03AM -0400, Chris Mason wrote:
> > Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400:
> > > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
> > > 
> > > [..]
> > > > > It should not happen that flusher
> > > > > thread gets blocked somewhere (trying to get request descriptors on
> > > > > request queue)
> > > > 
> > > > A major design principle of the bdi-flusher threads is that they
> > > > are supposed to block when the request queue gets full - that's how
> > > > we got rid of all the congestion garbage from the writeback
> > > > stack.
> > > 
> > > Instead of blocking flusher threads, can they voluntarily stop submitting
> > > more IO when they realize too much IO is in progress. We aready keep
> > > stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and
> > > flusher tread can use that?
> > 
> > We could, but the difficult part is keeping the hardware saturated as
> > requests complete.  The voluntarily stopping part is pretty much the
> > same thing the congestion code was trying to do.
> 
> And it was the bit that was causing most problems. IMO, we don't want to
> go back to that single threaded mechanism, especially as we have
> no shortage of cores and threads available...

Getting rid of the congestion code was my favorite part of the per-bdi
work.

> 
> > > > There are plans to move the bdi-flusher threads to work queues, and
> > > > once that is done all your concerns about blocking and parallelism
> > > > are pretty much gone because it's trivial to have multiple writeback
> > > > works in progress at once on the same bdi with that infrastructure.
> > > 
> > > Will this essentially not nullify the advantage of IO less throttling?
> > > I thought that we did not want have multiple threads doing writeback
> > > at the same time to avoid number of seeks and achieve better throughput.
> > 
> > Work queues alone are probably not appropriate, at least for spinning
> > storage.  It will introduce seeks into what would have been
> > sequential writes.  I had to make the btrfs worker thread pools after
> > having a lot of trouble cramming writeback into work queues.
> 
> That was before the cmwq infrastructure, right? cmwq changes the
> behaviour of workqueues in such a way that they can simply be
> thought of as having a thread pool of a specific size....
> 
> As a strict translation of the existing one flusher thread per bdi,
> then only allowing one work at a time to be issued (i.e. workqueue
> concurency of 1) would give the same behaviour without having all
> the thread management issues. i.e. regardless of the writeback
> parallelism mechanism we have the same issue of managing writeback
> to minimise seeking. cmwq just makes the implementation far simpler,
> IMO.
> 
> As to whether that causes seeks or not, that depends on how we are
> driving the concurrent works/threads. If we drive a concurrent work
> per dirty cgroup that needs writing back, then we achieve the
> concurrency needed to make the IO scheduler appropriately throttle
> the IO. For the case of no cgroups, then we still only have a single
> writeback work in progress at a time and behaviour is no different
> to the current setup. Hence I don't see any particular problem with
> using workqueues to acheive the necessary writeback parallelism that
> cgroup aware throttling requires....

Yes, as long as we aren't trying to shotgun style spread the
inodes across a bunch of threads, it should work well enough.  The trick
will just be making sure we don't end up with a lot of inode
interleaving in the delalloc allocations.

> 
> > > > > or it tries to dispatch too much IO from an inode which
> > > > > primarily contains pages from low prio cgroup and high prio cgroup
> > > > > task does not get enough pages dispatched to device hence not getting
> > > > > any prio over low prio group.
> > > > 
> > > > That's a writeback scheduling issue independent of how we throttle,
> > > > and something we don't do at all right now. Our only decision on
> > > > what to write back is based on how low ago the inode was dirtied.
> > > > You need to completely rework the dirty inode tracking if you want
> > > > to efficiently prioritise writeback between different groups.
> > > > 
> > > > Given that filesystems don't all use the VFS dirty inode tracking
> > > > infrastructure and specific filesystems have different ideas of the
> > > > order of writeback, you've got a really difficult problem there.
> > > > e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity
> > > > purposes which will completely screw any sort of prioritised
> > > > writeback. Remember the ext3 "fsync = global sync" latency problems?
> > > 
> > > Ok, so if one issues a fsync when filesystem is mounted in "data=ordered"
> > > mode we will flush all the writes to disk before committing meta data.
> > > 
> > > I have no knowledge of filesystem code so here comes a stupid question.
> > > Do multiple fsyncs get completely serialized or they can progress in
> > > parallel? IOW, if a fsync is in progress and we slow down the writeback
> > > of that inode's pages, can other fsync still make progress without
> > > getting stuck behind the previous fsync?
> > 
> > An fsync has two basic parts
> > 
> > 1) write the file data pages
> > 2a) flush data=ordered in reiserfs/ext34
> > 2b) do the real transaction commit
> > 
> > 
> > We can do part one in parallel across any number of writers.  For part
> > two, there is only one running transaction.  If the FS is smart, the
> > commit will only force down the transaction that last modified the
> > file. 50 procs running fsync may only need to trigger one commit.
> 
> Right. However the real issue here, I think, is that the IO comes
> from a thread not associated with writeback nor is in any way cgroup
> aware. IOWs, getting the right context to each block being written
> back will be complex and filesystem specific.

The ext3 style data=ordered requires that we give the same amount of
bandwidth to all the data=ordered IO during commit.  Otherwise we end up
making the commit wait for some poor page in the data=ordered list and
that slows everyone down.  ick.

> 
> The other thing that concerns me is how metadata IO is accounted and
> throttled. Doing stuff like creating lots of small files will
> generate as much or more metadata IO than data IO, and none of that
> will be associated with a cgroup. Indeed, in XFS metadata doesn't
> even use the pagecache anymore, and it's written back by a thread
> (soon to be a workqueue) deep inside XFS's journalling subsystem, so
> it's pretty much impossible to associate that IO with any specific
> cgroup.
> 
> What happens to that IO? Blocking it arbitrarily can have the same
> effect as blocking transaction completion - it can cause the
> filesystem to completely stop....

ick again, it's the same problem as the data=ordered stuff exactly.

-chris

  reply	other threads:[~2011-03-31 23:44 UTC|newest]

Thread overview: 138+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1301373398.2590.20.camel@mulgrave.site>
2011-03-29  5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein
2011-03-29 11:16 ` Ric Wheeler
2011-03-29 11:22   ` Matthew Wilcox
2011-03-29 12:17     ` Jens Axboe
2011-03-29 13:09       ` Martin K. Petersen
2011-03-29 13:12         ` Ric Wheeler
2011-03-29 13:38         ` James Bottomley
2011-03-29 17:20   ` Shyam_Iyer
2011-03-29 17:33     ` Vivek Goyal
2011-03-29 18:10       ` Shyam_Iyer
2011-03-29 18:45         ` Vivek Goyal
2011-03-29 19:13           ` Shyam_Iyer
2011-03-29 19:57             ` Vivek Goyal
2011-03-29 19:59             ` Mike Snitzer
2011-03-29 20:12               ` Shyam_Iyer
2011-03-29 20:23                 ` Mike Snitzer
2011-03-29 23:09                   ` Shyam_Iyer
2011-03-30  5:58                     ` [Lsf] " Hannes Reinecke
2011-03-30 14:02                       ` James Bottomley
2011-03-30 14:10                         ` Hannes Reinecke
2011-03-30 14:26                           ` James Bottomley
2011-03-30 14:55                             ` Hannes Reinecke
2011-03-30 15:33                               ` James Bottomley
2011-03-30 15:46                                 ` Shyam_Iyer
2011-03-30 20:32                                 ` Giridhar Malavali
2011-03-30 20:45                                   ` James Bottomley
2011-03-29 19:47   ` Nicholas A. Bellinger
2011-03-29 20:29   ` Jan Kara
2011-03-29 20:31     ` Ric Wheeler
2011-03-30  0:33   ` Mingming Cao
2011-03-30  2:17     ` Dave Chinner
2011-03-30 11:13       ` Theodore Tso
2011-03-30 11:28         ` Ric Wheeler
2011-03-30 14:07           ` Chris Mason
2011-04-01 15:19           ` Ted Ts'o
2011-04-01 16:30             ` Amir Goldstein
2011-04-01 21:46               ` Joel Becker
2011-04-02  3:26                 ` Amir Goldstein
2011-04-01 21:43             ` Joel Becker
2011-03-30 21:49       ` Mingming Cao
2011-03-31  0:05         ` Matthew Wilcox
2011-03-31  1:00         ` Joel Becker
2011-04-01 21:34           ` Mingming Cao
2011-04-01 21:49             ` Joel Becker
2011-03-29 17:35 ` Chad Talbott
2011-03-29 19:09   ` Vivek Goyal
2011-03-29 20:14     ` Chad Talbott
2011-03-29 20:35     ` Jan Kara
2011-03-29 21:08       ` Greg Thelen
2011-03-30  4:18   ` Dave Chinner
2011-03-30 15:37     ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal
2011-03-30 22:20       ` Dave Chinner
2011-03-30 22:49         ` Chad Talbott
2011-03-31  3:00           ` Dave Chinner
2011-03-31 14:16         ` Vivek Goyal
2011-03-31 14:34           ` Chris Mason
2011-03-31 22:14             ` Dave Chinner
2011-03-31 23:43               ` Chris Mason [this message]
2011-04-01  0:55                 ` Dave Chinner
2011-04-01  1:34               ` Vivek Goyal
2011-04-01  4:36                 ` Dave Chinner
2011-04-01  6:32                   ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig
2011-04-01  7:23                     ` Dave Chinner
2011-04-01 12:56                       ` Christoph Hellwig
2011-04-21 15:07                         ` Vivek Goyal
2011-04-01 14:49                   ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal
2011-03-31 22:25             ` Vivek Goyal
2011-03-31 14:50           ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen
2011-03-31 22:27             ` Dave Chinner
2011-04-01 17:18               ` Vivek Goyal
2011-04-01 21:49                 ` Dave Chinner
2011-04-02  7:33                   ` Greg Thelen
2011-04-02  7:34                     ` Greg Thelen
2011-04-05 13:13                   ` Vivek Goyal
2011-04-05 22:56                     ` Dave Chinner
2011-04-06 14:49                       ` Curt Wohlgemuth
2011-04-06 15:39                         ` Vivek Goyal
2011-04-06 19:49                           ` Greg Thelen
2011-04-06 23:07                           ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen
2011-04-06 23:36                             ` Dave Chinner
2011-04-07 19:24                               ` Vivek Goyal
2011-04-07 20:33                                 ` Christoph Hellwig
2011-04-07 21:34                                   ` Vivek Goyal
2011-04-07 23:42                                 ` Dave Chinner
2011-04-08  0:59                                   ` Greg Thelen
2011-04-08  1:25                                     ` Dave Chinner
2011-04-12  3:17                                       ` KAMEZAWA Hiroyuki
2011-04-08 13:43                                   ` Vivek Goyal
2011-04-06 23:08                         ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner
2011-04-07 20:04                           ` Vivek Goyal
2011-04-07 23:47                             ` Dave Chinner
2011-04-08 13:50                               ` Vivek Goyal
2011-04-11  1:05                                 ` Dave Chinner
2011-04-06 15:37                       ` Vivek Goyal
2011-04-06 16:08                         ` Vivek Goyal
2011-04-06 17:10                           ` Jan Kara
2011-04-06 17:14                             ` Curt Wohlgemuth
2011-04-08  1:58                             ` Dave Chinner
2011-04-19 14:26                               ` Wu Fengguang
2011-04-06 23:50                         ` Dave Chinner
2011-04-07 17:55                           ` Vivek Goyal
2011-04-11  1:36                             ` Dave Chinner
2011-04-15 21:07                               ` Vivek Goyal
2011-04-16  3:06                                 ` Vivek Goyal
2011-04-18 21:58                                   ` Jan Kara
2011-04-18 22:51                                     ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal
2011-04-19  0:33                                       ` Dave Chinner
2011-04-19 14:30                                         ` Vivek Goyal
2011-04-19 14:45                                           ` Jan Kara
2011-04-19 17:17                                           ` Vivek Goyal
2011-04-19 18:30                                             ` Vivek Goyal
2011-04-21  0:32                                               ` Dave Chinner
2011-04-21  0:29                                           ` Dave Chinner
2011-04-19 14:17                               ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang
2011-04-19 14:34                                 ` Vivek Goyal
2011-04-19 14:48                                   ` Jan Kara
2011-04-19 15:11                                     ` Vivek Goyal
2011-04-19 15:22                                       ` Wu Fengguang
2011-04-19 15:31                                         ` Vivek Goyal
2011-04-19 16:58                                           ` Wu Fengguang
2011-04-19 17:05                                             ` Vivek Goyal
2011-04-19 20:58                                               ` Jan Kara
2011-04-20  1:21                                                 ` Wu Fengguang
2011-04-20 10:56                                                   ` Jan Kara
2011-04-20 11:19                                                     ` Wu Fengguang
2011-04-20 14:42                                                       ` Jan Kara
2011-04-20  1:16                                               ` Wu Fengguang
2011-04-20 18:44                                                 ` Vivek Goyal
2011-04-20 19:16                                                   ` Jan Kara
2011-04-21  0:17                                                   ` Dave Chinner
2011-04-21 15:06                                                   ` Wu Fengguang
2011-04-21 15:10                                                     ` Wu Fengguang
2011-04-21 17:20                                                     ` Vivek Goyal
2011-04-22  4:21                                                       ` Wu Fengguang
2011-04-22 15:25                                                         ` Vivek Goyal
2011-04-22 16:28                                                           ` Andrea Arcangeli
2011-04-25 18:19                                                             ` Vivek Goyal
2011-04-26 14:37                                                               ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1301614603-sup-6349@think \
    --to=chris.mason@oracle.com \
    --cc=ctalbott@google.com \
    --cc=david@fromorbit.com \
    --cc=james.bottomley@hansenpartnership.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf@lists.linux-foundation.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).