Re: regression in page writeback

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Theodore Tso <tytso@mit.edu>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Chris Mason <chris.mason@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	"Li, Shaohua" <shaohua.li@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"richard@rsk.demon.co.uk" <richard@rsk.demon.co.uk>,
	"jens.axboe@oracle.com" <jens.axboe@oracle.com>
Subject: Re: regression in page writeback
Date: Tue, 6 Oct 2009 21:18:40 +0800	[thread overview]
Message-ID: <20091006131840.GA14111@localhost> (raw)
In-Reply-To: <20091006125519.GB22781@duck.suse.cz>

On Tue, Oct 06, 2009 at 08:55:19PM +0800, Jan Kara wrote:
> On Fri 02-10-09 11:27:14, Wu Fengguang wrote:
> > On Fri, Oct 02, 2009 at 06:17:39AM +0800, Jan Kara wrote:
> > > On Wed 30-09-09 13:32:23, Wu Fengguang wrote:
> > > > writeback: bump up writeback chunk size to 128MB
> > > > 
> > > > Adjust the writeback call stack to support larger writeback chunk size.
> > > > 
> > > > - make wbc.nr_to_write a per-file parameter
> > > > - init wbc.nr_to_write with MAX_WRITEBACK_PAGES=128MB
> > > >   (proposed by Ted)
> > > > - add wbc.nr_segments to limit seeks inside sparsely dirtied file
> > > >   (proposed by Chris)
> > > > - add wbc.timeout which will be used to control IO submission time
> > > >   either per-file or globally.
> > > >   
> > > > The wbc.nr_segments is now determined purely by logical page index
> > > > distance: if two pages are 1MB apart, it makes a new segment.
> > > > 
> > > > Filesystems could do this better with real extent knowledges.
> > > > One possible scheme is to record the previous page index in
> > > > wbc.writeback_index, and let ->writepage compare if the current and
> > > > previous pages lie in the same extent, and decrease wbc.nr_segments
> > > > accordingly. Care should taken to avoid double decreases in writepage
> > > > and write_cache_pages.
> > > > 
> > > > The wbc.timeout (when used per-file) is mainly a safeguard against slow
> > > > devices, which may take too long time to sync 128MB data.
> > > > 
> > > > The wbc.timeout (when used globally) could be useful when we decide to
> > > > do two sync scans on dirty pages and dirty metadata. XFS could say:
> > > > please return to sync dirty metadata after 10s. Would need another
> > > > b_io_metadata queue, but that's possible.
> > > > 
> > > > This work depends on the balance_dirty_pages() wait queue patch.
> > >   I don't know, I think it gets too complicated... I'd either use the
> > > segments idea or the timeout idea but not both (unless you can find real
> > > world tests in which both help).
>   I'm sorry for a delayed reply but I had to work on something else.
> 
> > Maybe complicated, but nr_segments and timeout each has their target
> > application.  nr_segments serves two major purposes:
> > - fairness between two large files, one is continuously dirtied,
> >   another is sparsely dirtied. Given the same amount of dirty pages,
> >   it could take vastly different time to sync them to the _same_
> >   device. The nr_segments check helps to favor continuous data.
> > - avoid seeks/fragmentations. To give each file fair chance of
> >   writeback, we have to abort a file when some nr_to_write or timeout
> >   is reached. However they are both not good abort conditions.
> >   The best is for filesystem to abort earlier in seek boundaries,
> >   and treat nr_to_write/timeout as large enough bottom lines.
> > timeout is mainly a safeguard in case nr_to_write is too large for
> > slow devices. It is not necessary if nr_to_write is auto-computed,
> > however timeout in itself serves as a simple throughput adapting
> > scheme.
>   I understand why you have introduced both segments and timeout value
> and a completely agree with your reasons to introduce them. I just think
> that when the system gets too complex (there will be several independent
> methods of determining when writeback should be terminated, and even
> though each method is simple on its own, their interactions needn't be
> simple...) it will be hard to debug all the corner cases - even more
> because they will manifest "just" by slow or unfair writeback. So I'd

I definitely agree on the complications. There are some known issues
as well as possibly some corner cases to be discovered. One problem I
noticed now is, what if all the files are sparsely dirtied? Then
a small nr_segments can only hurt.  Another problem is, the block
device file tend to have sparsely dirtied pages (with metadata on
them).  Not sure how to detect/handle such conditions..

> prefer a single metric to determine when to stop writeback of an inode
> even though it might be a bit more complicated.
>   For example terminating on writeout does not really get a file fair
> chance of writeback because it might have been blocked just because we were
> writing some heavily fragmented file just before. And your nr_segments

You mean timeout? I've dropped that idea in favor of an nr_to_write
adaptive to the bdi write speed :)

> check is just a rough guess of whether a writeback is going to be
> fragmented or not.

It could be made accurate if btrfs decreases it in its own writepages,
based on the extent info. Should also be possible for ext4.

>   So I'd rather implement in mpage_ functions a proper detection of how
> fragmented the writeback is and give each inode a limit on number of
> fragments which mpage_ functions would obey. We could even use a queue's
> NONROT flag (set for solid state disks) to detect whether we should expect
> higher or lower seek times.

Yes, mpage_* can also utilize nr_segments.

Anyway nr_segments is not perfect, I'll post a patch and let fs
developers decide whether it is convenient/useful :) 

Thanks,
Fengguang

next prev parent reply	other threads:[~2009-10-06 13:19 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-22  5:49 regression in page writeback Shaohua Li
2009-09-22  6:40 ` Peter Zijlstra
2009-09-22  8:05   ` Wu Fengguang
2009-09-22  8:09     ` Peter Zijlstra
2009-09-22  8:24       ` Wu Fengguang
2009-09-22  8:32         ` Peter Zijlstra
2009-09-22  8:51           ` Wu Fengguang
2009-09-22  8:52           ` Richard Kennedy
2009-09-22  9:05             ` Wu Fengguang
2009-09-22 11:41               ` Shaohua Li
2009-09-22 15:52           ` Chris Mason
2009-09-23  0:22             ` Wu Fengguang
2009-09-23  0:54               ` Andrew Morton
2009-09-23  1:17                 ` Wu Fengguang
2009-09-23  1:27                   ` Wu Fengguang
2009-09-23  1:28                   ` Andrew Morton
2009-09-23  1:32                     ` Wu Fengguang
2009-09-23  1:47                       ` Andrew Morton
2009-09-23  2:01                         ` Wu Fengguang
2009-09-23  2:09                           ` Andrew Morton
2009-09-23  3:07                             ` Wu Fengguang
2009-09-23  1:45                     ` Wu Fengguang
2009-09-23  1:59                       ` Andrew Morton
2009-09-23  2:26                         ` Wu Fengguang
2009-09-23  2:36                           ` Andrew Morton
2009-09-23  2:49                             ` Wu Fengguang
2009-09-23  2:56                               ` Andrew Morton
2009-09-23  3:11                                 ` Wu Fengguang
2009-09-23  3:10                               ` Shaohua Li
2009-09-23  3:14                                 ` Wu Fengguang
2009-09-23  3:25                                   ` Wu Fengguang
2009-09-23 14:00                             ` Chris Mason
2009-09-24  3:15                               ` Wu Fengguang
2009-09-24 12:10                                 ` Chris Mason
2009-09-25  3:26                                   ` Wu Fengguang
2009-09-25  0:11                                 ` Dave Chinner
2009-09-25  0:38                                   ` Chris Mason
2009-09-25  5:04                                     ` Dave Chinner
2009-09-25  6:45                                       ` Wu Fengguang
2009-09-28  1:07                                         ` Dave Chinner
2009-09-28  7:15                                           ` Wu Fengguang
2009-09-28 13:08                                             ` Christoph Hellwig
2009-09-28 14:07                                               ` Theodore Tso
2009-09-30  5:26                                                 ` Wu Fengguang
2009-09-30  5:32                                                   ` Wu Fengguang
2009-10-01 22:17                                                     ` Jan Kara
2009-10-02  3:27                                                       ` Wu Fengguang
2009-10-06 12:55                                                         ` Jan Kara
2009-10-06 13:18                                                           ` Wu Fengguang [this message]
2009-09-30 14:11                                                   ` Theodore Tso
2009-10-01 15:14                                                     ` Wu Fengguang
2009-10-01 21:54                                                       ` Theodore Tso
2009-10-02  2:55                                                         ` Wu Fengguang
2009-10-02  8:19                                                           ` Wu Fengguang
2009-10-02 17:26                                                             ` Theodore Tso
2009-10-03  6:10                                                               ` Wu Fengguang
2009-09-29  2:32                                               ` Wu Fengguang
2009-09-29 14:00                                                 ` Chris Mason
2009-09-29 14:21                                                 ` Christoph Hellwig
2009-09-29  0:15                                             ` Wu Fengguang
2009-09-28 14:25                                           ` Chris Mason
2009-09-29 23:39                                             ` Dave Chinner
2009-09-30  1:30                                               ` Wu Fengguang
2009-09-25 12:06                                       ` Chris Mason
2009-09-25  3:19                                   ` Wu Fengguang
2009-09-26  1:47                                     ` Dave Chinner
2009-09-26  3:02                                       ` Wu Fengguang
2009-09-23  9:19                         ` Richard Kennedy
2009-09-23  9:23                           ` Peter Zijlstra
2009-09-23  9:37                             ` Wu Fengguang
2009-09-23 10:30                               ` Wu Fengguang
2009-09-23  6:41             ` Shaohua Li
2009-09-22 10:49 ` Wu Fengguang
2009-09-22 11:50   ` Shaohua Li
2009-09-22 13:39     ` Wu Fengguang
2009-09-23  1:52       ` Shaohua Li
2009-09-23  4:00         ` Wu Fengguang
2009-09-25  6:14           ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091006131840.GA14111@localhost \
    --to=fengguang.wu@intel.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=richard@rsk.demon.co.uk \
    --cc=shaohua.li@intel.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox