Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
       [not found] ` <20090909150601.159061863@intel.com>
@ 2009-09-09 15:44   ` Jan Kara
  2009-09-10  1:42     ` Wu Fengguang
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2009-09-09 15:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> Some filesystem may choose to write much more than ratelimit_pages
> before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> determine number to write based on real number of dirtied pages.
> 
> The increased write_chunk may make the dirtier more bumpy.  This is
> filesystem writers' duty not to dirty too much at a time without
> checking the ratelimit.
  I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
dirty the page, not when we write it out. So a problem would only happen if
filesystem dirties pages by set_page_dirty() and won't call
balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
and do_wp_page() takes care of that. So where's the problem?

								Honza
> ---
>  mm/page-writeback.c |   13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> --- linux.orig/mm/page-writeback.c	2009-09-09 21:19:21.000000000 +0800
> +++ linux/mm/page-writeback.c	2009-09-09 21:25:45.000000000 +0800
> @@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
>  /*
>   * When balance_dirty_pages decides that the caller needs to perform some
>   * non-background writeback, this is how many pages it will attempt to write.
> - * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
> + * It should be somewhat larger than dirtied pages to ensure that reasonably
>   * large amounts of I/O are submitted.
>   */
> -static inline long sync_writeback_pages(void)
> +static inline long sync_writeback_pages(unsigned long dirtied)
>  {
> -	return ratelimit_pages + ratelimit_pages / 2;
> +	return dirtied + dirtied / 2;
>  }
>  
>  /* The following parameters are exported via /proc/sys/vm */
> @@ -476,7 +476,8 @@ get_dirty_limits(unsigned long *pbackgro
>   * If we're over `background_thresh' then pdflush is woken to perform some
>   * writeout.
>   */
> -static void balance_dirty_pages(struct address_space *mapping)
> +static void balance_dirty_pages(struct address_space *mapping,
> +				unsigned long write_chunk)
>  {
>  	long nr_reclaimable, bdi_nr_reclaimable;
>  	long nr_writeback, bdi_nr_writeback;
> @@ -484,7 +485,6 @@ static void balance_dirty_pages(struct a
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
>  	unsigned long pages_written = 0;
> -	unsigned long write_chunk = sync_writeback_pages();
>  
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
> @@ -638,9 +638,10 @@ void balance_dirty_pages_ratelimited_nr(
>  	p =  &__get_cpu_var(bdp_ratelimits);
>  	*p += nr_pages_dirtied;
>  	if (unlikely(*p >= ratelimit)) {
> +		ratelimit = sync_writeback_pages(*p);
>  		*p = 0;
>  		preempt_enable();
> -		balance_dirty_pages(mapping);
> +		balance_dirty_pages(mapping, ratelimit);
>  		return;
>  	}
>  	preempt_enable();
> 
> -- 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-09 15:44   ` [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages Jan Kara
@ 2009-09-10  1:42     ` Wu Fengguang
  2009-09-10 12:57       ` Chris Mason
  0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-09-10  1:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel@vger.kernel.org

On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > Some filesystem may choose to write much more than ratelimit_pages
> > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > determine number to write based on real number of dirtied pages.
> > 
> > The increased write_chunk may make the dirtier more bumpy.  This is
> > filesystem writers' duty not to dirty too much at a time without
> > checking the ratelimit.
>   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> dirty the page, not when we write it out. So a problem would only happen if
> filesystem dirties pages by set_page_dirty() and won't call
> balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> and do_wp_page() takes care of that. So where's the problem?

It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
(1024 is the computed nrptrs value in a 32bit kernel). And it calls
balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.

Thanks,
Fengguang

> > ---
> >  mm/page-writeback.c |   13 +++++++------
> >  1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > --- linux.orig/mm/page-writeback.c	2009-09-09 21:19:21.000000000 +0800
> > +++ linux/mm/page-writeback.c	2009-09-09 21:25:45.000000000 +0800
> > @@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
> >  /*
> >   * When balance_dirty_pages decides that the caller needs to perform some
> >   * non-background writeback, this is how many pages it will attempt to write.
> > - * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
> > + * It should be somewhat larger than dirtied pages to ensure that reasonably
> >   * large amounts of I/O are submitted.
> >   */
> > -static inline long sync_writeback_pages(void)
> > +static inline long sync_writeback_pages(unsigned long dirtied)
> >  {
> > -	return ratelimit_pages + ratelimit_pages / 2;
> > +	return dirtied + dirtied / 2;
> >  }
> >  
> >  /* The following parameters are exported via /proc/sys/vm */
> > @@ -476,7 +476,8 @@ get_dirty_limits(unsigned long *pbackgro
> >   * If we're over `background_thresh' then pdflush is woken to perform some
> >   * writeout.
> >   */
> > -static void balance_dirty_pages(struct address_space *mapping)
> > +static void balance_dirty_pages(struct address_space *mapping,
> > +				unsigned long write_chunk)
> >  {
> >  	long nr_reclaimable, bdi_nr_reclaimable;
> >  	long nr_writeback, bdi_nr_writeback;
> > @@ -484,7 +485,6 @@ static void balance_dirty_pages(struct a
> >  	unsigned long dirty_thresh;
> >  	unsigned long bdi_thresh;
> >  	unsigned long pages_written = 0;
> > -	unsigned long write_chunk = sync_writeback_pages();
> >  
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> >  
> > @@ -638,9 +638,10 @@ void balance_dirty_pages_ratelimited_nr(
> >  	p =  &__get_cpu_var(bdp_ratelimits);
> >  	*p += nr_pages_dirtied;
> >  	if (unlikely(*p >= ratelimit)) {
> > +		ratelimit = sync_writeback_pages(*p);
> >  		*p = 0;
> >  		preempt_enable();
> > -		balance_dirty_pages(mapping);
> > +		balance_dirty_pages(mapping, ratelimit);
> >  		return;
> >  	}
> >  	preempt_enable();
> > 
> > -- 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10  1:42     ` Wu Fengguang
@ 2009-09-10 12:57       ` Chris Mason
  2009-09-10 13:21         ` Wu Fengguang
  0 siblings, 1 reply; 19+ messages in thread
From: Chris Mason @ 2009-09-10 12:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner, Peter Zijlstra,
	Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel@vger.kernel.org

On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > Some filesystem may choose to write much more than ratelimit_pages
> > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > determine number to write based on real number of dirtied pages.
> > > 
> > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > filesystem writers' duty not to dirty too much at a time without
> > > checking the ratelimit.
> >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > dirty the page, not when we write it out. So a problem would only happen if
> > filesystem dirties pages by set_page_dirty() and won't call
> > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > and do_wp_page() takes care of that. So where's the problem?
> 
> It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.

I can easily change this to call more often, but we do always call
balance_dirty_pages to reflect how much ram we've really sent down.

-chris


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 12:57       ` Chris Mason
@ 2009-09-10 13:21         ` Wu Fengguang
  2009-09-10 14:56           ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-09-10 13:21 UTC (permalink / raw)
  To: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Pete

On Thu, Sep 10, 2009 at 08:57:42PM +0800, Chris Mason wrote:
> On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> > On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > > Some filesystem may choose to write much more than ratelimit_pages
> > > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > > determine number to write based on real number of dirtied pages.
> > > > 
> > > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > > filesystem writers' duty not to dirty too much at a time without
> > > > checking the ratelimit.
> > >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > > dirty the page, not when we write it out. So a problem would only happen if
> > > filesystem dirties pages by set_page_dirty() and won't call
> > > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > > and do_wp_page() takes care of that. So where's the problem?
> > 
> > It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> > (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> > balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.
> 
> I can easily change this to call more often, but we do always call
> balance_dirty_pages to reflect how much ram we've really sent down.

Btrfs is doing OK. 2MB/4MB looks like reasonable chunk sizes. The
need-change part is balance_dirty_pages_ratelimited_nr(), hence this
patch :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 13:21         ` Wu Fengguang
@ 2009-09-10 14:56           ` Peter Zijlstra
  2009-09-10 15:14             ` Wu Fengguang
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2009-09-10 14:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel@vger.kernel.org

On Thu, 2009-09-10 at 21:21 +0800, Wu Fengguang wrote:
> On Thu, Sep 10, 2009 at 08:57:42PM +0800, Chris Mason wrote:
> > On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> > > On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > > > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > > > Some filesystem may choose to write much more than ratelimit_pages
> > > > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > > > determine number to write based on real number of dirtied pages.
> > > > > 
> > > > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > > > filesystem writers' duty not to dirty too much at a time without
> > > > > checking the ratelimit.
> > > >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > > > dirty the page, not when we write it out. So a problem would only happen if
> > > > filesystem dirties pages by set_page_dirty() and won't call
> > > > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > > > and do_wp_page() takes care of that. So where's the problem?
> > > 
> > > It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> > > (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> > > balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.
> > 
> > I can easily change this to call more often, but we do always call
> > balance_dirty_pages to reflect how much ram we've really sent down.
> 
> Btrfs is doing OK. 2MB/4MB looks like reasonable chunk sizes. The
> need-change part is balance_dirty_pages_ratelimited_nr(), hence this
> patch :)

I'm not getting it, it calls set_page_dirty() for each page, right? and
then it calls into balance_dirty_pages_ratelimited_nr(), that sounds
right. What is the problem with that?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 14:56           ` Peter Zijlstra
@ 2009-09-10 15:14             ` Wu Fengguang
  2009-09-10 15:31               ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-09-10 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel@vger.kernel.org

On Thu, Sep 10, 2009 at 10:56:04PM +0800, Peter Zijlstra wrote:
> On Thu, 2009-09-10 at 21:21 +0800, Wu Fengguang wrote:
> > On Thu, Sep 10, 2009 at 08:57:42PM +0800, Chris Mason wrote:
> > > On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> > > > On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > > > > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > > > > Some filesystem may choose to write much more than ratelimit_pages
> > > > > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > > > > determine number to write based on real number of dirtied pages.
> > > > > > 
> > > > > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > > > > filesystem writers' duty not to dirty too much at a time without
> > > > > > checking the ratelimit.
> > > > >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > > > > dirty the page, not when we write it out. So a problem would only happen if
> > > > > filesystem dirties pages by set_page_dirty() and won't call
> > > > > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > > > > and do_wp_page() takes care of that. So where's the problem?
> > > > 
> > > > It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> > > > (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> > > > balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.
> > > 
> > > I can easily change this to call more often, but we do always call
> > > balance_dirty_pages to reflect how much ram we've really sent down.
> > 
> > Btrfs is doing OK. 2MB/4MB looks like reasonable chunk sizes. The
> > need-change part is balance_dirty_pages_ratelimited_nr(), hence this
> > patch :)
> 
> I'm not getting it, it calls set_page_dirty() for each page, right? and
> then it calls into balance_dirty_pages_ratelimited_nr(), that sounds
> right. What is the problem with that?

It looks like btrfs_file_write() eventually calls
__set_page_dirty_buffers() which in turn won't call
balance_dirty_pages*(). This is why do_wp_page() calls
set_page_dirty_balance() to do balance_dirty_pages*().

So btrfs_file_write() explicitly calls
balance_dirty_pages_ratelimited_nr() to get throttled.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 15:14             ` Wu Fengguang
@ 2009-09-10 15:31               ` Peter Zijlstra
  2009-09-10 15:41                 ` Wu Fengguang
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2009-09-10 15:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel@vger.kernel.org

On Thu, 2009-09-10 at 23:14 +0800, Wu Fengguang wrote:
> On Thu, Sep 10, 2009 at 10:56:04PM +0800, Peter Zijlstra wrote:
> > On Thu, 2009-09-10 at 21:21 +0800, Wu Fengguang wrote:
> > > On Thu, Sep 10, 2009 at 08:57:42PM +0800, Chris Mason wrote:
> > > > On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> > > > > On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > > > > > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > > > > > Some filesystem may choose to write much more than ratelimit_pages
> > > > > > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > > > > > determine number to write based on real number of dirtied pages.
> > > > > > > 
> > > > > > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > > > > > filesystem writers' duty not to dirty too much at a time without
> > > > > > > checking the ratelimit.
> > > > > >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > > > > > dirty the page, not when we write it out. So a problem would only happen if
> > > > > > filesystem dirties pages by set_page_dirty() and won't call
> > > > > > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > > > > > and do_wp_page() takes care of that. So where's the problem?
> > > > > 
> > > > > It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> > > > > (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> > > > > balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.
> > > > 
> > > > I can easily change this to call more often, but we do always call
> > > > balance_dirty_pages to reflect how much ram we've really sent down.
> > > 
> > > Btrfs is doing OK. 2MB/4MB looks like reasonable chunk sizes. The
> > > need-change part is balance_dirty_pages_ratelimited_nr(), hence this
> > > patch :)
> > 
> > I'm not getting it, it calls set_page_dirty() for each page, right? and
> > then it calls into balance_dirty_pages_ratelimited_nr(), that sounds
> > right. What is the problem with that?
> 
> It looks like btrfs_file_write() eventually calls
> __set_page_dirty_buffers() which in turn won't call
> balance_dirty_pages*(). This is why do_wp_page() calls
> set_page_dirty_balance() to do balance_dirty_pages*().
> 
> So btrfs_file_write() explicitly calls
> balance_dirty_pages_ratelimited_nr() to get throttled.

Right, so what is wrong with than, and how does this patch fix that?

[ the only thing you have to be careful with is that you don't
excessively grow the error bound on the dirty limit ]


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 15:31               ` Peter Zijlstra
@ 2009-09-10 15:41                 ` Wu Fengguang
  2009-09-10 15:54                   ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-09-10 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel@vger.kernel.org

On Thu, Sep 10, 2009 at 11:31:38PM +0800, Peter Zijlstra wrote:
> On Thu, 2009-09-10 at 23:14 +0800, Wu Fengguang wrote:
> > On Thu, Sep 10, 2009 at 10:56:04PM +0800, Peter Zijlstra wrote:
> > > On Thu, 2009-09-10 at 21:21 +0800, Wu Fengguang wrote:
> > > > On Thu, Sep 10, 2009 at 08:57:42PM +0800, Chris Mason wrote:
> > > > > On Thu, Sep 10, 2009 at 09:42:01AM +0800, Wu Fengguang wrote:
> > > > > > On Wed, Sep 09, 2009 at 11:44:13PM +0800, Jan Kara wrote:
> > > > > > > On Wed 09-09-09 22:51:48, Wu Fengguang wrote:
> > > > > > > > Some filesystem may choose to write much more than ratelimit_pages
> > > > > > > > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > > > > > > > determine number to write based on real number of dirtied pages.
> > > > > > > > 
> > > > > > > > The increased write_chunk may make the dirtier more bumpy.  This is
> > > > > > > > filesystem writers' duty not to dirty too much at a time without
> > > > > > > > checking the ratelimit.
> > > > > > >   I don't get this. balance_dirty_pages_ratelimited_nr() is called when we
> > > > > > > dirty the page, not when we write it out. So a problem would only happen if
> > > > > > > filesystem dirties pages by set_page_dirty() and won't call
> > > > > > > balance_dirty_pages_ratelimited_nr(). But e.g. generic_perform_write()
> > > > > > > and do_wp_page() takes care of that. So where's the problem?
> > > > > > 
> > > > > > It seems that btrfs_file_write() is writing in chunks of up to 1024-pages
> > > > > > (1024 is the computed nrptrs value in a 32bit kernel). And it calls
> > > > > > balance_dirty_pages_ratelimited_nr() each time it dirtied such a chunk.
> > > > > 
> > > > > I can easily change this to call more often, but we do always call
> > > > > balance_dirty_pages to reflect how much ram we've really sent down.
> > > > 
> > > > Btrfs is doing OK. 2MB/4MB looks like reasonable chunk sizes. The
> > > > need-change part is balance_dirty_pages_ratelimited_nr(), hence this
> > > > patch :)
> > > 
> > > I'm not getting it, it calls set_page_dirty() for each page, right? and
> > > then it calls into balance_dirty_pages_ratelimited_nr(), that sounds
> > > right. What is the problem with that?
> > 
> > It looks like btrfs_file_write() eventually calls
> > __set_page_dirty_buffers() which in turn won't call
> > balance_dirty_pages*(). This is why do_wp_page() calls
> > set_page_dirty_balance() to do balance_dirty_pages*().
> > 
> > So btrfs_file_write() explicitly calls
> > balance_dirty_pages_ratelimited_nr() to get throttled.
> 
> Right, so what is wrong with than, and how does this patch fix that?
> 
> [ the only thing you have to be careful with is that you don't
> excessively grow the error bound on the dirty limit ]

Then we could form a loop:

        btrfs_file_write():     dirty 1024 pages
        balance_dirty_pages():  write up to 12 pages (= ratelimit_pages * 1.5)

in which the writeback rate cannot keep up with dirty rate,
and the dirty pages go all the way beyond dirty_thresh.

Sorry for writing such a vague changelog!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 15:41                 ` Wu Fengguang
@ 2009-09-10 15:54                   ` Peter Zijlstra
  2009-09-10 16:08                     ` Wu Fengguang
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2009-09-10 15:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel@vger.kernel.org

On Thu, 2009-09-10 at 23:41 +0800, Wu Fengguang wrote:
> > > So btrfs_file_write() explicitly calls
> > > balance_dirty_pages_ratelimited_nr() to get throttled.
> > 
> > Right, so what is wrong with than, and how does this patch fix that?
> > 
> > [ the only thing you have to be careful with is that you don't
> > excessively grow the error bound on the dirty limit ]
> 
> Then we could form a loop:
> 
>         btrfs_file_write():     dirty 1024 pages
>         balance_dirty_pages():  write up to 12 pages (= ratelimit_pages * 1.5)
> 
> in which the writeback rate cannot keep up with dirty rate,
> and the dirty pages go all the way beyond dirty_thresh.

Ah, ok so this is to keep the error bound on the dirty limit bounded,
because we can break out of balance_dirty_pages() early, the /* We've
done our duty */ break.

Which unbalances the duty vs the dirty ratio.

I figure that with the task dirty limit stuff we could maybe try to get
rid of this break.. worth a try.

> Sorry for writing such a vague changelog!

np, as long as we get there :-)

Change makes sense now, thanks!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-10 15:54                   ` Peter Zijlstra
@ 2009-09-10 16:08                     ` Wu Fengguang
  0 siblings, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2009-09-10 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner,
	Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel@vger.kernel.org

On Thu, Sep 10, 2009 at 11:54:29PM +0800, Peter Zijlstra wrote:
> On Thu, 2009-09-10 at 23:41 +0800, Wu Fengguang wrote:
> > > > So btrfs_file_write() explicitly calls
> > > > balance_dirty_pages_ratelimited_nr() to get throttled.
> > > 
> > > Right, so what is wrong with than, and how does this patch fix that?
> > > 
> > > [ the only thing you have to be careful with is that you don't
> > > excessively grow the error bound on the dirty limit ]
> > 
> > Then we could form a loop:
> > 
> >         btrfs_file_write():     dirty 1024 pages
> >         balance_dirty_pages():  write up to 12 pages (= ratelimit_pages * 1.5)
> > 
> > in which the writeback rate cannot keep up with dirty rate,
> > and the dirty pages go all the way beyond dirty_thresh.
> 
> Ah, ok so this is to keep the error bound on the dirty limit bounded,
> because we can break out of balance_dirty_pages() early, the /* We've
> done our duty */ break.
> 
> Which unbalances the duty vs the dirty ratio.

Right!

> I figure that with the task dirty limit stuff we could maybe try to get
> rid of this break.. worth a try.

Be careful. Without that break, the time a task get throttled in a
single trip may go out of control. For example, task B get blocked
for 1000 seconds because there is a task A keep dirtying pages, in
the mean time task A's dirty thresh going down slowly, but still
larger than B's.

> > Sorry for writing such a vague changelog!
> 
> np, as long as we get there :-)
> 
> Change makes sense now, thanks!

May I add you ack?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <20090909150600.330539880@intel.com>]

* Re: [RFC][PATCH 1/7] writeback: cleanup writeback_single_inode()
       [not found] ` <20090909150600.330539880@intel.com>
@ 2009-09-09 15:45   ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2009-09-09 15:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed 09-09-09 22:51:42, Wu Fengguang wrote:
> Make the if-else straight in writeback_single_inode().
> No behavior change.
> 
> Cc: Michael Rubin <mrubin@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
  The patch looks good.
Acked-by: Jan Kara <jack@suse.cz>

> ---
>  fs/fs-writeback.c |   15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c	2009-09-09 21:40:41.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
> @@ -417,8 +417,13 @@ writeback_single_inode(struct inode *ino
>  	spin_lock(&inode_lock);
>  	inode->i_state &= ~I_SYNC;
>  	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> -		if (!(inode->i_state & I_DIRTY) &&
> -		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> +		if (inode->i_state & I_DIRTY) {
> +			/*
> +			 * Someone redirtied the inode while were writing back
> +			 * the pages.
> +			 */
> +			redirty_tail(inode);
> +		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
>  			 * We didn't write back all the pages.  nfs_writepages()
>  			 * sometimes bales out without doing anything. Redirty
> @@ -462,12 +467,6 @@ writeback_single_inode(struct inode *ino
>  				inode->i_state |= I_DIRTY_PAGES;
>  				redirty_tail(inode);
>  			}
> -		} else if (inode->i_state & I_DIRTY) {
> -			/*
> -			 * Someone redirtied the inode while were writing back
> -			 * the pages.
> -			 */
> -			redirty_tail(inode);
>  		} else if (atomic_read(&inode->i_count)) {
>  			/*
>  			 * The inode is clean, inuse
> 
> -- 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <20090909150600.451373732@intel.com>]

* Re: [RFC][PATCH 2/7] writeback: fix queue_io() ordering
       [not found] ` <20090909150600.451373732@intel.com>
@ 2009-09-09 15:53   ` Jan Kara
  2009-09-10  1:26     ` Wu Fengguang
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2009-09-09 15:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed 09-09-09 22:51:43, Wu Fengguang wrote:
> This was not a bug, since b_io is empty for kupdate writeback.
> The next patch will do requeue_io() for non-kupdate writeback,
> so let's fix it.
  But doesn't this patch also need your "anti-starvation" patch?
Looking into the code, we put inode to b_more_io when nr_to_write
drops to zero and this way we'd just start writing it again
in the next round...

							Honza
> 
> CC: Dave Chinner <david@fromorbit.com>
> Cc: Martin Bligh <mbligh@google.com>
> Cc: Michael Rubin <mrubin@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> ---
>  fs/fs-writeback.c |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-09-09 21:45:15.000000000 +0800
> @@ -313,11 +313,14 @@ static void move_expired_inodes(struct l
>  }
>  
>  /*
> - * Queue all expired dirty inodes for io, eldest first.
> + * Queue all expired dirty inodes for io, eldest first:
> + * (newly dirtied) => b_dirty inodes
> + *                 => b_more_io inodes
> + *                 => remaining inodes in b_io => (dequeue for sync)
>   */
>  static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
>  {
> -	list_splice_init(&wb->b_more_io, wb->b_io.prev);
> +	list_splice_init(&wb->b_more_io, &wb->b_io);
>  	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
>  }
>  
> 
> -- 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 2/7] writeback: fix queue_io() ordering
  2009-09-09 15:53   ` [RFC][PATCH 2/7] writeback: fix queue_io() ordering Jan Kara
@ 2009-09-10  1:26     ` Wu Fengguang
  2009-09-10 14:14       ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Wu Fengguang @ 2009-09-10  1:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, jack@suse.cz, Artem Bityutskiy,
	LKML, linux-fsdevel@vger.kernel.org

On Wed, Sep 09, 2009 at 11:53:30PM +0800, Jan Kara wrote:
> On Wed 09-09-09 22:51:43, Wu Fengguang wrote:
> > This was not a bug, since b_io is empty for kupdate writeback.
> > The next patch will do requeue_io() for non-kupdate writeback,
> > so let's fix it.
>   But doesn't this patch also need your "anti-starvation" patch?

Honza, can you show me that patch?

> Looking into the code, we put inode to b_more_io when nr_to_write
> drops to zero and this way we'd just start writing it again
> in the next round...

I'm confused. It's OK to start it in next round. Starvation can
occur if we start it immediately in the next writeback_inodes()
invocation. How can that happen with this patch?

Thanks,
Fengguang

> 							Honza
> > 
> > CC: Dave Chinner <david@fromorbit.com>
> > Cc: Martin Bligh <mbligh@google.com>
> > Cc: Michael Rubin <mrubin@google.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> > ---
> >  fs/fs-writeback.c |    7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > --- linux.orig/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
> > +++ linux/fs/fs-writeback.c	2009-09-09 21:45:15.000000000 +0800
> > @@ -313,11 +313,14 @@ static void move_expired_inodes(struct l
> >  }
> >  
> >  /*
> > - * Queue all expired dirty inodes for io, eldest first.
> > + * Queue all expired dirty inodes for io, eldest first:
> > + * (newly dirtied) => b_dirty inodes
> > + *                 => b_more_io inodes
> > + *                 => remaining inodes in b_io => (dequeue for sync)
> >   */
> >  static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> >  {
> > -	list_splice_init(&wb->b_more_io, wb->b_io.prev);
> > +	list_splice_init(&wb->b_more_io, &wb->b_io);
> >  	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> >  }
> >  
> > 
> > -- 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 2/7] writeback: fix queue_io() ordering
  2009-09-10  1:26     ` Wu Fengguang
@ 2009-09-10 14:14       ` Jan Kara
  2009-09-10 14:17         ` Wu Fengguang
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2009-09-10 14:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel@vger.kernel.org

On Thu 10-09-09 09:26:24, Wu Fengguang wrote:
> On Wed, Sep 09, 2009 at 11:53:30PM +0800, Jan Kara wrote:
> > On Wed 09-09-09 22:51:43, Wu Fengguang wrote:
> > > This was not a bug, since b_io is empty for kupdate writeback.
> > > The next patch will do requeue_io() for non-kupdate writeback,
> > > so let's fix it.
> >   But doesn't this patch also need your "anti-starvation" patch?
> 
> Honza, can you show me that patch?
> 
> > Looking into the code, we put inode to b_more_io when nr_to_write
> > drops to zero and this way we'd just start writing it again
> > in the next round...
> 
> I'm confused. It's OK to start it in next round. Starvation can
> occur if we start it immediately in the next writeback_inodes()
> invocation. How can that happen with this patch?
  Sorry, my fault. For kupdate, we splice the list only once s_io is empty
so that's not an issue. So the patch looks good.
  Acked-by: Jan Kara <jack@suse.cz>

> > > CC: Dave Chinner <david@fromorbit.com>
> > > Cc: Martin Bligh <mbligh@google.com>
> > > Cc: Michael Rubin <mrubin@google.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> > > ---
> > >  fs/fs-writeback.c |    7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > --- linux.orig/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
> > > +++ linux/fs/fs-writeback.c	2009-09-09 21:45:15.000000000 +0800
> > > @@ -313,11 +313,14 @@ static void move_expired_inodes(struct l
> > >  }
> > >  
> > >  /*
> > > - * Queue all expired dirty inodes for io, eldest first.
> > > + * Queue all expired dirty inodes for io, eldest first:
> > > + * (newly dirtied) => b_dirty inodes
> > > + *                 => b_more_io inodes
> > > + *                 => remaining inodes in b_io => (dequeue for sync)
> > >   */
> > >  static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> > >  {
> > > -	list_splice_init(&wb->b_more_io, wb->b_io.prev);
> > > +	list_splice_init(&wb->b_more_io, &wb->b_io);
> > >  	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> > >  }
> > >  
> > > 
> > > -- 
> > > 
> > -- 
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 2/7] writeback: fix queue_io() ordering
  2009-09-10 14:14       ` Jan Kara
@ 2009-09-10 14:17         ` Wu Fengguang
  0 siblings, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2009-09-10 14:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, Artem Bityutskiy, LKML,
	linux-fsdevel@vger.kernel.org

On Thu, Sep 10, 2009 at 10:14:15PM +0800, Jan Kara wrote:
> On Thu 10-09-09 09:26:24, Wu Fengguang wrote:
> > On Wed, Sep 09, 2009 at 11:53:30PM +0800, Jan Kara wrote:
> > > On Wed 09-09-09 22:51:43, Wu Fengguang wrote:
> > > > This was not a bug, since b_io is empty for kupdate writeback.
> > > > The next patch will do requeue_io() for non-kupdate writeback,
> > > > so let's fix it.
> > >   But doesn't this patch also need your "anti-starvation" patch?
> > 
> > Honza, can you show me that patch?
> > 
> > > Looking into the code, we put inode to b_more_io when nr_to_write
> > > drops to zero and this way we'd just start writing it again
> > > in the next round...
> > 
> > I'm confused. It's OK to start it in next round. Starvation can
> > occur if we start it immediately in the next writeback_inodes()
> > invocation. How can that happen with this patch?
>   Sorry, my fault. For kupdate, we splice the list only once s_io is empty
> so that's not an issue. So the patch looks good.
>   Acked-by: Jan Kara <jack@suse.cz>

Thank you :)

Regards,
Fengguang

> > > > CC: Dave Chinner <david@fromorbit.com>
> > > > Cc: Martin Bligh <mbligh@google.com>
> > > > Cc: Michael Rubin <mrubin@google.com>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> > > > ---
> > > >  fs/fs-writeback.c |    7 +++++--
> > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > 
> > > > --- linux.orig/fs/fs-writeback.c	2009-09-09 21:41:14.000000000 +0800
> > > > +++ linux/fs/fs-writeback.c	2009-09-09 21:45:15.000000000 +0800
> > > > @@ -313,11 +313,14 @@ static void move_expired_inodes(struct l
> > > >  }
> > > >  
> > > >  /*
> > > > - * Queue all expired dirty inodes for io, eldest first.
> > > > + * Queue all expired dirty inodes for io, eldest first:
> > > > + * (newly dirtied) => b_dirty inodes
> > > > + *                 => b_more_io inodes
> > > > + *                 => remaining inodes in b_io => (dequeue for sync)
> > > >   */
> > > >  static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> > > >  {
> > > > -	list_splice_init(&wb->b_more_io, wb->b_io.prev);
> > > > +	list_splice_init(&wb->b_more_io, &wb->b_io);
> > > >  	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> > > >  }
> > > >  
> > > > 
> > > > -- 
> > > > 
> > > -- 
> > > Jan Kara <jack@suse.cz>
> > > SUSE Labs, CR
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <20090909150600.874037375@intel.com>]

* Re: [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES
       [not found] ` <20090909150600.874037375@intel.com>
@ 2009-09-09 23:29   ` Theodore Tso
  2009-09-10  0:13     ` Wu Fengguang
  2009-09-10  4:53     ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: Theodore Tso @ 2009-09-09 23:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Dave Chinner, Chris Mason,
	Peter Zijlstra, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed, Sep 09, 2009 at 10:51:46PM +0800, Wu Fengguang wrote:
> + * The maximum number of pages to writeout in a single periodic/background
> + * writeback operation. 64MB means I_SYNC may be hold for up to 1 second.
> + * This is not a big problem since we normally do kind of trylock on I_SYNC
> + * for non-data-integrity writes.  Userspace tasks doing throttled writeback
> + * do not use this value.

What's your justification for using 64MB?  Where are you getting 1
second from?  On a fast RAID array 64MB can be written in much less
than 1 second.

More generally, I assume your patches conflict with Jens' per-bdi patches?

     		  	      	      	       - Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES
  2009-09-09 23:29   ` [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES Theodore Tso
@ 2009-09-10  0:13     ` Wu Fengguang
  2009-09-10  4:53     ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2009-09-10  0:13 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, Jens Axboe, Dave Chinner,
	Chris Mason

On Thu, Sep 10, 2009 at 07:29:38AM +0800, Theodore Tso wrote:
> On Wed, Sep 09, 2009 at 10:51:46PM +0800, Wu Fengguang wrote:
> > + * The maximum number of pages to writeout in a single periodic/background
> > + * writeback operation. 64MB means I_SYNC may be hold for up to 1 second.
> > + * This is not a big problem since we normally do kind of trylock on I_SYNC
> > + * for non-data-integrity writes.  Userspace tasks doing throttled writeback
> > + * do not use this value.
> 
> What's your justification for using 64MB?  Where are you getting 1
> second from?  On a fast RAID array 64MB can be written in much less
> than 1 second.

Because this value will be used for desktop and server alike, we have
to consider the worst case - for a laptop, it takes about 1 second to
write 64MB. It's not accurate to say I_SYNC may be hold for up to 1
second, but the same I_SYNC will be taken, dropped because of
congestion, retaken, and this loop could go on for 1 second until a
large file have been written 64MB. In the mean while, in the normal
case of a single kupdate thread per bdi, that file blocks writeout of
other files for 1 second.

> More generally, I assume your patches conflict with Jens' per-bdi patches?

It is based on latest linux-next tree with the vm.max_writeback_mb
patch reverted. The changelog only states the need to increase the
size, but not mentioned why it should be made tunable. So I tend to
just bump up MAX_WRITEBACK_PAGES. It seems that a large value won't
hurt anyone. Or are there demand to further increase it a lot for some
server configurations?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES
  2009-09-09 23:29   ` [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES Theodore Tso
  2009-09-10  0:13     ` Wu Fengguang
@ 2009-09-10  4:53     ` Peter Zijlstra
  2009-09-10  7:35       ` Wu Fengguang
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2009-09-10  4:53 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Wu Fengguang, Andrew Morton, Jens Axboe, Dave Chinner,
	Chris Mason, Christoph Hellwig, jack, Artem Bityutskiy, LKML,
	linux-fsdevel

On Wed, 2009-09-09 at 19:29 -0400, Theodore Tso wrote:
> On Wed, Sep 09, 2009 at 10:51:46PM +0800, Wu Fengguang wrote:
> > + * The maximum number of pages to writeout in a single periodic/background
> > + * writeback operation. 64MB means I_SYNC may be hold for up to 1 second.
> > + * This is not a big problem since we normally do kind of trylock on I_SYNC
> > + * for non-data-integrity writes.  Userspace tasks doing throttled writeback
> > + * do not use this value.
> 
> What's your justification for using 64MB?  Where are you getting 1
> second from?  On a fast RAID array 64MB can be written in much less
> than 1 second.

Worse, on my 5mb/s usb stick writing out 64m will take forever.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES
  2009-09-10  4:53     ` Peter Zijlstra
@ 2009-09-10  7:35       ` Wu Fengguang
  0 siblings, 0 replies; 19+ messages in thread
From: Wu Fengguang @ 2009-09-10  7:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Theodore Tso, Andrew Morton, Jens Axboe, Dave Chinner,
	Chris Mason, Christoph Hellwig, jack@suse.cz, Artem Bityutskiy,
	LKML, linux-fsdevel@vger.kernel.org

On Thu, Sep 10, 2009 at 12:53:18PM +0800, Peter Zijlstra wrote:
> On Wed, 2009-09-09 at 19:29 -0400, Theodore Tso wrote:
> > On Wed, Sep 09, 2009 at 10:51:46PM +0800, Wu Fengguang wrote:
> > > + * The maximum number of pages to writeout in a single periodic/background
> > > + * writeback operation. 64MB means I_SYNC may be hold for up to 1 second.
> > > + * This is not a big problem since we normally do kind of trylock on I_SYNC
> > > + * for non-data-integrity writes.  Userspace tasks doing throttled writeback
> > > + * do not use this value.
> > 
> > What's your justification for using 64MB?  Where are you getting 1
> > second from?  On a fast RAID array 64MB can be written in much less
> > than 1 second.
> 
> Worse, on my 5mb/s usb stick writing out 64m will take forever.

        cp bigfile1 bigfile2 /mnt/usb/
        sync

In that case the user would notice that kernel keeps writing to one
file for up to 13 seconds before switching to another file.

A simple fix would look like this. It stops io continuation on one
file after 1 second. It will work because when io is congested, it
relies on the io continuation logic (based on last_file*) to retry
the same file until MAX_WRITEBACK_PAGES is reached. The queue-able
requests between congested <=> uncongested states are not very large.
For slow devices, the queue-able pages between empty <=> congested
states are also not very large. For example, my USB stick has
nr_requests=128 and max_sectors_kb=120. It would take less than 12MB
to congest this queue.

With this patch and my usb stick, the kernel may first sync 12MB for
bigfile1 (which takes 1-3 seconds), then sync bigfile2 for 1 second,
and then bigfile1 for 1 second, and so on.

It seems that we could now safely bump MAX_WRITEBACK_PAGES to even
larger values beyond 128MB :)

Thanks,
Fengguang
---

--- linux.orig/fs/fs-writeback.c	2009-09-10 15:02:48.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-10 15:07:23.000000000 +0800
@@ -277,7 +277,8 @@ static void requeue_io(struct inode *ino
  */
 static void requeue_partial_io(struct writeback_control *wbc, struct inode *inode)
 {
-	if (wbc->last_file_written == 0 ||
+	if (time_before(wbc->last_file_time + HZ, jiffies) ||
+	    wbc->last_file_written == 0 ||
 	    wbc->last_file_written >= MAX_WRITEBACK_PAGES)
 		return requeue_io(inode);
 
@@ -428,6 +429,7 @@ writeback_single_inode(struct inode *ino
 
 	if (wbc->last_file != inode->i_ino) {
 		wbc->last_file = inode->i_ino;
+		wbc->last_file_time = jiffies;
 		wbc->last_file_written = nr_to_write - wbc->nr_to_write;
 	} else
 		wbc->last_file_written += nr_to_write - wbc->nr_to_write;
--- linux.orig/include/linux/writeback.h	2009-09-10 15:07:28.000000000 +0800
+++ linux/include/linux/writeback.h	2009-09-10 15:08:46.000000000 +0800
@@ -46,6 +46,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	unsigned long last_file;	/* Inode number of last written file */
+	unsigned long last_file_time;	/* First sync time for last file */
 	long last_file_written;		/* Total pages written for last file */
 	long pages_skipped;		/* Pages which were not written */
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-09-10 16:08 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090909145141.293229693@intel.com>
     [not found] ` <20090909150601.159061863@intel.com>
2009-09-09 15:44   ` [RFC][PATCH 7/7] writeback: balance_dirty_pages() shall write more than dirtied pages Jan Kara
2009-09-10  1:42     ` Wu Fengguang
2009-09-10 12:57       ` Chris Mason
2009-09-10 13:21         ` Wu Fengguang
2009-09-10 14:56           ` Peter Zijlstra
2009-09-10 15:14             ` Wu Fengguang
2009-09-10 15:31               ` Peter Zijlstra
2009-09-10 15:41                 ` Wu Fengguang
2009-09-10 15:54                   ` Peter Zijlstra
2009-09-10 16:08                     ` Wu Fengguang
     [not found] ` <20090909150600.330539880@intel.com>
2009-09-09 15:45   ` [RFC][PATCH 1/7] writeback: cleanup writeback_single_inode() Jan Kara
     [not found] ` <20090909150600.451373732@intel.com>
2009-09-09 15:53   ` [RFC][PATCH 2/7] writeback: fix queue_io() ordering Jan Kara
2009-09-10  1:26     ` Wu Fengguang
2009-09-10 14:14       ` Jan Kara
2009-09-10 14:17         ` Wu Fengguang
     [not found] ` <20090909150600.874037375@intel.com>
2009-09-09 23:29   ` [RFC][PATCH 5/7] writeback: use 64MB MAX_WRITEBACK_PAGES Theodore Tso
2009-09-10  0:13     ` Wu Fengguang
2009-09-10  4:53     ` Peter Zijlstra
2009-09-10  7:35       ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).