Re: [PATCH 30/45] vmscan: lumpy pageout

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 30/45] vmscan: lumpy pageout
@ 2009-10-07 10:38 Nikita Danilov
  2009-10-07 11:14 ` Wu Fengguang
  0 siblings, 1 reply; 14+ messages in thread
From: Nikita Danilov @ 2009-10-07 10:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel

Hello,

 > When pageout a dirty page, try to piggy back more consecutive dirty
 > pages (up to 512KB) to improve IO efficiency.
 >
 > Only ext3/reiserfs which don't have its own aops->writepages are
 > supported in this initial version.
 >

[...]

 >  		/*
 > +		 * only write_cache_pages() supports for_reclaim for now
 > +		 */
 > +		if (!mapping->a_ops->writepages) {
 > +			wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
 > +			wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
 > +			generic_writepages(mapping, &wbc);
 > +		}

This might end up calling ->writepage() on a page_mapped() page,
which used to be a problem, at least for shmem (see
BUG_ON(page_mapped(page)) in shmem_writepage()).

I tried to do a similar thing in the past, but eventually decided
do explicitly call ->writepage() in a loop from vmscan.c. This
way one can also filter out referenced and active pages. See
http://linuxhacker.ru/~nikita/patches/2.6.14-rc5/05-cluster-pageout.patch,
it also contains a comment explaining shmem problem.

Thank you,
Nikita.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07 10:38 [PATCH 30/45] vmscan: lumpy pageout Nikita Danilov
@ 2009-10-07 11:14 ` Wu Fengguang
  2009-10-07 11:32   ` Nick Piggin
  2009-10-07 11:37   ` Nikita Danilov
  0 siblings, 2 replies; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07 11:14 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
> Hello,
> 
>  > When pageout a dirty page, try to piggy back more consecutive dirty
>  > pages (up to 512KB) to improve IO efficiency.
>  >
>  > Only ext3/reiserfs which don't have its own aops->writepages are
>  > supported in this initial version.
>  >
> 
> [...]
> 
>  >  		/*
>  > +		 * only write_cache_pages() supports for_reclaim for now
>  > +		 */
>  > +		if (!mapping->a_ops->writepages) {
>  > +			wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
>  > +			wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
>  > +			generic_writepages(mapping, &wbc);
>  > +		}
> 
> This might end up calling ->writepage() on a page_mapped() page,
> which used to be a problem, at least for shmem (see
> BUG_ON(page_mapped(page)) in shmem_writepage()).

Good catch, thanks a lot!

> I tried to do a similar thing in the past, but eventually decided
> do explicitly call ->writepage() in a loop from vmscan.c. This
> way one can also filter out referenced and active pages. See
> http://linuxhacker.ru/~nikita/patches/2.6.14-rc5/05-cluster-pageout.patch,
> it also contains a comment explaining shmem problem.

Glad to know about your experiences :) Interestingly I started with
->writepage() and then switch to ->writepages() because filesystems
behave better with the latter (i.e. less file fragmentation).

I'd like to just ignore the shmem case, by adding a
bdi_cap_writeback_dirty() check. Because clustered writing to swap
device may be a less gain.

Page filtering should also be possible in write_cache_pages().  But
what do you mean by "hard-to-fix races against inode reclamation"?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07 11:14 ` Wu Fengguang
@ 2009-10-07 11:32   ` Nick Piggin
  2009-10-07 11:37   ` Nikita Danilov
  1 sibling, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2009-10-07 11:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nikita Danilov, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Jan Kara,
	linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 07:14:54PM +0800, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
> > Hello,
> > 
> >  > When pageout a dirty page, try to piggy back more consecutive dirty
> >  > pages (up to 512KB) to improve IO efficiency.
> >  >
> >  > Only ext3/reiserfs which don't have its own aops->writepages are
> >  > supported in this initial version.
> >  >
> > 
> > [...]
> > 
> >  >  		/*
> >  > +		 * only write_cache_pages() supports for_reclaim for now
> >  > +		 */
> >  > +		if (!mapping->a_ops->writepages) {
> >  > +			wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> >  > +			wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> >  > +			generic_writepages(mapping, &wbc);
> >  > +		}
> > 
> > This might end up calling ->writepage() on a page_mapped() page,
> > which used to be a problem, at least for shmem (see
> > BUG_ON(page_mapped(page)) in shmem_writepage()).
> 
> Good catch, thanks a lot!

You also have a problem in that you aren't allowed to touch mapping
after the page is unlocked. So this patch looks buggy. I didn't see
where Nikita mentions the inode reclaim race, but this is it.

It seems tempting to do this lumpy writeout here, but... hmm, I
don't know. I would like to see workloads where it matters now,
and then I would like to know why normal writeout doesn't keep
up.

Then... maybe it's enough simply to allow the filesystem to do
their own clustering because they can tell when writepage is called
for reclaim.

But if you have real cases where this helps, it would be very
interesting.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07 11:14 ` Wu Fengguang
  2009-10-07 11:32   ` Nick Piggin
@ 2009-10-07 11:37   ` Nikita Danilov
  2009-10-07 13:29     ` Wu Fengguang
  1 sibling, 1 reply; 14+ messages in thread
From: Nikita Danilov @ 2009-10-07 11:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org

2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
> On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
>> Hello,
>>

[...]

>
> Glad to know about your experiences :) Interestingly I started with
> ->writepage() and then switch to ->writepages() because filesystems
> behave better with the latter (i.e. less file fragmentation).

By the way, why is your patch doing

        ->writepage(page->index);
        generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1);

instead of

        generic_writepages(page->index, LUMPY_PAGEOUT_PAGES);

? Is this because of the difficulties with passing back page specific
errors from generic_writepages()?

>
> I'd like to just ignore the shmem case, by adding a
> bdi_cap_writeback_dirty() check. Because clustered writing to swap
> device may be a less gain.

Or you can just call try_to_unmap() from shmem_writepage() when
wbc->for_reclaim is true.

>
> Page filtering should also be possible in write_cache_pages().  But
> what do you mean by "hard-to-fix races against inode reclamation"?

vmscan.c pageout path doesn't take a reference on inode, so the
instant ->writepage() releases lock on the page, the inode can be
freed.

>
> Thanks,
> Fengguang
>

Thank you,
Nikita.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07 11:37   ` Nikita Danilov
@ 2009-10-07 13:29     ` Wu Fengguang
  2009-10-07 13:42       ` Wu Fengguang
  0 siblings, 1 reply; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07 13:29 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 07:37:35PM +0800, Nikita Danilov wrote:
> 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
> > On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
> >> Hello,
> >>
> 
> [...]
> 
> >
> > Glad to know about your experiences :) Interestingly I started with
> > ->writepage() and then switch to ->writepages() because filesystems
> > behave better with the latter (i.e. less file fragmentation).
> 
> By the way, why is your patch doing
> 
>         ->writepage(page->index);
>         generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1);
> 
> instead of
> 
>         generic_writepages(page->index, LUMPY_PAGEOUT_PAGES);
> 
> ? Is this because of the difficulties with passing back page specific
> errors from generic_writepages()?

Yes. It's possible to tell write_cache_pages() to return
AOP_WRITEPAGE_ACTIVATE. Other ->writepages() don't have to deal with
this because their ->writepage() won't return AOP_WRITEPAGE_ACTIVATE
at all.

But it is going to be ugly to specialize the first locked page in
every ->writepages() functions..

> >
> > I'd like to just ignore the shmem case, by adding a
> > bdi_cap_writeback_dirty() check. Because clustered writing to swap
> > device may be a less gain.
> 
> Or you can just call try_to_unmap() from shmem_writepage() when
> wbc->for_reclaim is true.

Hmm, it's more comfortable to stay away from shmem for the initial version.
But feel free to submit a patch for it in future :)

> > Page filtering should also be possible in write_cache_pages().  But
> > what do you mean by "hard-to-fix races against inode reclamation"?
> 
> vmscan.c pageout path doesn't take a reference on inode, so the
> instant ->writepage() releases lock on the page, the inode can be
> freed.

Ah, then we could just do igrab() if inode_lock is not locked?
A bit ugly though.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07 13:29     ` Wu Fengguang
@ 2009-10-07 13:42       ` Wu Fengguang
  2009-10-07 14:20         ` Wu Fengguang
  0 siblings, 1 reply; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07 13:42 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 09:29:24PM +0800, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 07:37:35PM +0800, Nikita Danilov wrote:
> > 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
> > > On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
> > >> Hello,
> > >>
> > 
> > [...]
> > 
> > >
> > > Glad to know about your experiences :) Interestingly I started with
> > > ->writepage() and then switch to ->writepages() because filesystems
> > > behave better with the latter (i.e. less file fragmentation).
> > 
> > By the way, why is your patch doing
> > 
> >         ->writepage(page->index);
> >         generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1);
> > 
> > instead of
> > 
> >         generic_writepages(page->index, LUMPY_PAGEOUT_PAGES);
> > 
> > ? Is this because of the difficulties with passing back page specific
> > errors from generic_writepages()?
> 
> Yes. It's possible to tell write_cache_pages() to return
> AOP_WRITEPAGE_ACTIVATE. Other ->writepages() don't have to deal with
> this because their ->writepage() won't return AOP_WRITEPAGE_ACTIVATE
> at all.
> 
> But it is going to be ugly to specialize the first locked page in
> every ->writepages() functions..
> 
> > >
> > > I'd like to just ignore the shmem case, by adding a
> > > bdi_cap_writeback_dirty() check. Because clustered writing to swap
> > > device may be a less gain.
> > 
> > Or you can just call try_to_unmap() from shmem_writepage() when
> > wbc->for_reclaim is true.
> 
> Hmm, it's more comfortable to stay away from shmem for the initial version.
> But feel free to submit a patch for it in future :)
> 
> > > Page filtering should also be possible in write_cache_pages().  But
> > > what do you mean by "hard-to-fix races against inode reclamation"?
> > 
> > vmscan.c pageout path doesn't take a reference on inode, so the
> > instant ->writepage() releases lock on the page, the inode can be
> > freed.
> 
> Ah, then we could just do igrab() if inode_lock is not locked?
> A bit ugly though.

Since writepage() can sleep, we don't need to worry about inode_lock.

Here is the updated patch.

Thanks,
Fengguang
---

vmscan: lumpy pageout

When pageout a dirty page, try to piggy back more consecutive dirty
pages (up to 512KB) to improve IO efficiency.

Only ext3/reiserfs which don't have its own aops->writepages are
supported in this initial version.

CC: Dave Chinner <david@fromorbit.com>
CC: Nikita Danilov <danilov@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   12 ++++++++++++
 mm/vmscan.c         |   16 ++++++++++++++++
 2 files changed, 28 insertions(+)

--- linux.orig/mm/vmscan.c	2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/vmscan.c	2009-10-07 21:39:57.000000000 +0800
@@ -344,6 +344,8 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+#define LUMPY_PAGEOUT_PAGES	(512 * 1024 / PAGE_CACHE_SIZE)
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -399,16 +401,30 @@ static pageout_t pageout(struct page *pa
 			.for_reclaim = 1,
 		};
 
+		igrab(mapping->host);
 		SetPageReclaim(page);
 		res = mapping->a_ops->writepage(page, &wbc);
 		if (res < 0)
 			handle_write_error(mapping, page, res);
 		if (res == AOP_WRITEPAGE_ACTIVATE) {
 			ClearPageReclaim(page);
+			iput(mapping->host);
 			return PAGE_ACTIVATE;
 		}
 
 		/*
+		 * only write_cache_pages() supports for_reclaim for now
+		 * ignore shmem for now, thanks to Nikita.
+		 */
+		if (bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
+		    !mapping->a_ops->writepages) {
+			wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
+			wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
+			generic_writepages(mapping, &wbc);
+			iput(mapping->host);
+		}
+
+		/*
 		 * Wait on writeback if requested to. This happens when
 		 * direct reclaiming a large contiguous area and the
 		 * first attempt to free a range of pages fails.
--- linux.orig/mm/page-writeback.c	2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-07 21:39:14.000000000 +0800
@@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa
 				break;
 			}
 
+			if (wbc->for_reclaim && done_index != page->index) {
+				done = 1;
+				break;
+			}
+
 			if (nr_to_write != wbc->nr_to_write &&
 			    done_index + WB_SEGMENT_DIST < page->index &&
 			    --wbc->nr_segments <= 0) {
@@ -846,6 +851,13 @@ continue_unlock:
 			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 
+			/*
+			 * active and unevictable pages will be checked at
+			 * rotate time
+			 */
+			if (wbc->for_reclaim)
+				SetPageReclaim(page);
+
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret)) {
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07 13:42       ` Wu Fengguang
@ 2009-10-07 14:20         ` Wu Fengguang
  2009-10-07 14:50           ` Nikita Danilov
  0 siblings, 1 reply; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07 14:20 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 09:42:54PM +0800, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 09:29:24PM +0800, Wu Fengguang wrote:
> > On Wed, Oct 07, 2009 at 07:37:35PM +0800, Nikita Danilov wrote:
> > > 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
> > > > On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
> > > >> Hello,
> > > >>
> > > 
> > > [...]
> > > 
> > > >
> > > > Glad to know about your experiences :) Interestingly I started with
> > > > ->writepage() and then switch to ->writepages() because filesystems
> > > > behave better with the latter (i.e. less file fragmentation).
> > > 
> > > By the way, why is your patch doing
> > > 
> > >         ->writepage(page->index);
> > >         generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1);
> > > 
> > > instead of
> > > 
> > >         generic_writepages(page->index, LUMPY_PAGEOUT_PAGES);
> > > 
> > > ? Is this because of the difficulties with passing back page specific
> > > errors from generic_writepages()?
> > 
> > Yes. It's possible to tell write_cache_pages() to return
> > AOP_WRITEPAGE_ACTIVATE. Other ->writepages() don't have to deal with
> > this because their ->writepage() won't return AOP_WRITEPAGE_ACTIVATE
> > at all.
> > 
> > But it is going to be ugly to specialize the first locked page in
> > every ->writepages() functions..
> > 
> > > >
> > > > I'd like to just ignore the shmem case, by adding a
> > > > bdi_cap_writeback_dirty() check. Because clustered writing to swap
> > > > device may be a less gain.
> > > 
> > > Or you can just call try_to_unmap() from shmem_writepage() when
> > > wbc->for_reclaim is true.
> > 
> > Hmm, it's more comfortable to stay away from shmem for the initial version.
> > But feel free to submit a patch for it in future :)
> > 
> > > > Page filtering should also be possible in write_cache_pages().  But
> > > > what do you mean by "hard-to-fix races against inode reclamation"?
> > > 
> > > vmscan.c pageout path doesn't take a reference on inode, so the
> > > instant ->writepage() releases lock on the page, the inode can be
> > > freed.
> > 
> > Ah, then we could just do igrab() if inode_lock is not locked?
> > A bit ugly though.
> 
> Since writepage() can sleep, we don't need to worry about inode_lock.
> 
> Here is the updated patch.

Sorry forget to check return value of igrab()..

---
vmscan: lumpy pageout

When pageout a dirty page, try to piggy back more consecutive dirty
pages (up to 512KB) to improve IO efficiency.

Only ext3/reiserfs which don't have its own aops->writepages are
supported in this initial version.

CC: Dave Chinner <david@fromorbit.com>
CC: Nikita Danilov <danilov@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   12 ++++++++++++
 mm/vmscan.c         |   16 ++++++++++++++++
 2 files changed, 28 insertions(+)

--- linux.orig/mm/vmscan.c	2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/vmscan.c	2009-10-07 22:18:55.000000000 +0800
@@ -344,6 +344,8 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+#define LUMPY_PAGEOUT_PAGES	(512 * 1024 / PAGE_CACHE_SIZE)
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -398,6 +400,7 @@ static pageout_t pageout(struct page *pa
 			.nonblocking = 1,
 			.for_reclaim = 1,
 		};
+		struct inode *inode = igrab(mapping->host);
 
 		SetPageReclaim(page);
 		res = mapping->a_ops->writepage(page, &wbc);
@@ -405,10 +408,23 @@ static pageout_t pageout(struct page *pa
 			handle_write_error(mapping, page, res);
 		if (res == AOP_WRITEPAGE_ACTIVATE) {
 			ClearPageReclaim(page);
+			iput(inode);
 			return PAGE_ACTIVATE;
 		}
 
 		/*
+		 * only write_cache_pages() supports for_reclaim for now
+		 * ignore shmem for now, thanks to Nikita.
+		 */
+		if (bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
+		    !mapping->a_ops->writepages) {
+			wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
+			wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
+			generic_writepages(mapping, &wbc);
+			iput(inode);
+		}
+
+		/*
 		 * Wait on writeback if requested to. This happens when
 		 * direct reclaiming a large contiguous area and the
 		 * first attempt to free a range of pages fails.
--- linux.orig/mm/page-writeback.c	2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-07 21:39:14.000000000 +0800
@@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa
 				break;
 			}
 
+			if (wbc->for_reclaim && done_index != page->index) {
+				done = 1;
+				break;
+			}
+
 			if (nr_to_write != wbc->nr_to_write &&
 			    done_index + WB_SEGMENT_DIST < page->index &&
 			    --wbc->nr_segments <= 0) {
@@ -846,6 +851,13 @@ continue_unlock:
 			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 
+			/*
+			 * active and unevictable pages will be checked at
+			 * rotate time
+			 */
+			if (wbc->for_reclaim)
+				SetPageReclaim(page);
+
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret)) {
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07 14:20         ` Wu Fengguang
@ 2009-10-07 14:50           ` Nikita Danilov
  2009-10-07 15:00             ` Wu Fengguang
  0 siblings, 1 reply; 14+ messages in thread
From: Nikita Danilov @ 2009-10-07 14:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org

2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:

[...]

> +               if (bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
> +                   !mapping->a_ops->writepages) {
> +                       wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> +                       wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> +                       generic_writepages(mapping, &wbc);
> +                       iput(inode);

I am afraid calling iput() from within pageout code is not generally
safe: it can trigger a lot of file system activity (think
open-unlinked file) that is not supposed to happen in the
direct-reclaim context. Limiting pageout clustering to kswapd might be
a better idea---this would improve direct-reclaim latency too, but
still, file systems are not designed for re-entrant calls to iput().

Thank you,
Nikita.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07 14:50           ` Nikita Danilov
@ 2009-10-07 15:00             ` Wu Fengguang
  2009-10-07 15:50               ` Nikita Danilov
  0 siblings, 1 reply; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07 15:00 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 10:50:17PM +0800, Nikita Danilov wrote:
> 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
> 
> [...]
> 
> > +               if (bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
> > +                   !mapping->a_ops->writepages) {
> > +                       wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> > +                       wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> > +                       generic_writepages(mapping, &wbc);
> > +                       iput(inode);
> 
> I am afraid calling iput() from within pageout code is not generally
> safe: it can trigger a lot of file system activity (think
> open-unlinked file) that is not supposed to happen in the
> direct-reclaim context. Limiting pageout clustering to kswapd might be
> a better idea---this would improve direct-reclaim latency too, but
> still, file systems are not designed for re-entrant calls to iput().

Good point, thanks!

---
vmscan: lumpy pageout

When pageout a dirty page, try to piggy back more consecutive dirty
pages (up to 512KB) to improve IO efficiency.

Only ext3/reiserfs which don't have its own aops->writepages are
supported in this initial version.

CC: Dave Chinner <david@fromorbit.com>
CC: Nikita Danilov <danilov@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   12 ++++++++++++
 mm/vmscan.c         |   20 ++++++++++++++++++++
 2 files changed, 32 insertions(+)

--- linux.orig/mm/vmscan.c	2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/vmscan.c	2009-10-07 22:59:57.000000000 +0800
@@ -344,6 +344,8 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+#define LUMPY_PAGEOUT_PAGES	(512 * 1024 / PAGE_CACHE_SIZE)
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -398,6 +400,10 @@ static pageout_t pageout(struct page *pa
 			.nonblocking = 1,
 			.for_reclaim = 1,
 		};
+		struct inode *inode = NULL;
+
+		if (current_is_kswapd())
+			inode = igrab(mapping->host);
 
 		SetPageReclaim(page);
 		res = mapping->a_ops->writepage(page, &wbc);
@@ -405,10 +411,24 @@ static pageout_t pageout(struct page *pa
 			handle_write_error(mapping, page, res);
 		if (res == AOP_WRITEPAGE_ACTIVATE) {
 			ClearPageReclaim(page);
+			iput(inode);
 			return PAGE_ACTIVATE;
 		}
 
 		/*
+		 * only write_cache_pages() supports for_reclaim for now
+		 * ignore shmem for now, thanks to Nikita.
+		 */
+		if (current_is_kswapd() &&
+		    bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
+		    !mapping->a_ops->writepages) {
+			wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
+			wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
+			generic_writepages(mapping, &wbc);
+			iput(inode);
+		}
+
+		/*
 		 * Wait on writeback if requested to. This happens when
 		 * direct reclaiming a large contiguous area and the
 		 * first attempt to free a range of pages fails.
--- linux.orig/mm/page-writeback.c	2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-07 22:57:55.000000000 +0800
@@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa
 				break;
 			}
 
+			if (wbc->for_reclaim && done_index != page->index) {
+				done = 1;
+				break;
+			}
+
 			if (nr_to_write != wbc->nr_to_write &&
 			    done_index + WB_SEGMENT_DIST < page->index &&
 			    --wbc->nr_segments <= 0) {
@@ -846,6 +851,13 @@ continue_unlock:
 			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 
+			/*
+			 * active and unevictable pages will be checked at
+			 * rotate time
+			 */
+			if (wbc->for_reclaim)
+				SetPageReclaim(page);
+
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret)) {
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07 15:00             ` Wu Fengguang
@ 2009-10-07 15:50               ` Nikita Danilov
  2009-10-08  2:37                 ` Wu Fengguang
  0 siblings, 1 reply; 14+ messages in thread
From: Nikita Danilov @ 2009-10-07 15:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org

2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:

[...]

> +                */
> +               if (current_is_kswapd() &&
> +                   bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
> +                   !mapping->a_ops->writepages) {
> +                       wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> +                       wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> +                       generic_writepages(mapping, &wbc);
> +                       iput(inode);
> +               }
> +
> +               /*

One potential problem with this is that generic_writepages() waits on
page locks and this can stall kswapd (always bad). This can be worked
around by replacing lock_page() with trylock_page() conditionally on
wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look
like a separate function would be better.

On a good side, it seems I was wrong and pageout calls iput() already:
shrink_slab()->prune_icache()->iput().

Thank you,
Nikita.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07 15:50               ` Nikita Danilov
@ 2009-10-08  2:37                 ` Wu Fengguang
  2009-10-08  8:20                   ` Hugh Dickins
  0 siblings, 1 reply; 14+ messages in thread
From: Wu Fengguang @ 2009-10-08  2:37 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Hugh Dickins, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 11:50:58PM +0800, Nikita Danilov wrote:
> 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
> 
> [...]
> 
> > +                */
> > +               if (current_is_kswapd() &&
> > +                   bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
> > +                   !mapping->a_ops->writepages) {
> > +                       wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> > +                       wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> > +                       generic_writepages(mapping, &wbc);
> > +                       iput(inode);
> > +               }
> > +
> > +               /*
> 
> One potential problem with this is that generic_writepages() waits on
> page locks and this can stall kswapd (always bad). This can be worked
> around by replacing lock_page() with trylock_page() conditionally on
> wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look
> like a separate function would be better.

IMHO trylock_page() is not necessary. Locked pages are rare in normal
states. kswapd already do lock_page() for all pages it try to examine
state for reclaim. So it makes sense for lumpy pageout to follow the
(simple) convention.

> On a good side, it seems I was wrong and pageout calls iput() already:
> shrink_slab()->prune_icache()->iput().

Not totally wrong ;) iput() will be called if __GFP_FS is on. However
pageout may be called on either __GFP_FS or (__GFP_IO && PageSwapCache).

So I updated the patch to do lumpy pageout for __GFP_FS. In long term,
it would be good to remove AOP_WRITEPAGE_ACTIVATE and ->writepage()
totally, and to support shmem as well :)

Thanks,
Fengguang
---

vmscan: lumpy pageout

When pageout a dirty page, try to piggy back more consecutive dirty
pages (up to 512KB) to improve IO efficiency.

Only ext3/reiserfs which don't have its own aops->writepages are
supported in this initial version.

CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
CC: Dave Chinner <david@fromorbit.com>
CC: Nikita Danilov <danilov@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   12 ++++++++++++
 mm/vmscan.c         |   23 ++++++++++++++++++++++-
 2 files changed, 34 insertions(+), 1 deletion(-)

--- linux.orig/mm/vmscan.c	2009-10-08 07:35:06.000000000 +0800
+++ linux/mm/vmscan.c	2009-10-08 10:20:51.000000000 +0800
@@ -344,11 +344,14 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+#define LUMPY_PAGEOUT_PAGES	(512 * 1024 / PAGE_CACHE_SIZE)
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
  */
 static pageout_t pageout(struct page *page, struct address_space *mapping,
+						struct scan_control *sc,
 						enum pageout_io sync_writeback)
 {
 	/*
@@ -398,6 +401,10 @@ static pageout_t pageout(struct page *pa
 			.nonblocking = 1,
 			.for_reclaim = 1,
 		};
+		struct inode *inode = NULL;
+
+		if (sc->gfp_mask & __GFP_FS)
+			inode = igrab(mapping->host);
 
 		SetPageReclaim(page);
 		res = mapping->a_ops->writepage(page, &wbc);
@@ -405,10 +412,24 @@ static pageout_t pageout(struct page *pa
 			handle_write_error(mapping, page, res);
 		if (res == AOP_WRITEPAGE_ACTIVATE) {
 			ClearPageReclaim(page);
+			iput(inode);
 			return PAGE_ACTIVATE;
 		}
 
 		/*
+		 * only write_cache_pages() supports for_reclaim for now
+		 * ignore shmem for now, thanks to Nikita.
+		 */
+		if (current_is_kswapd() &&
+		    bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
+		    !mapping->a_ops->writepages) {
+			wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
+			wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
+			generic_writepages(mapping, &wbc);
+			iput(inode);
+		}
+
+		/*
 		 * Wait on writeback if requested to. This happens when
 		 * direct reclaiming a large contiguous area and the
 		 * first attempt to free a range of pages fails.
@@ -684,7 +705,7 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
-			switch (pageout(page, mapping, sync_writeback)) {
+			switch (pageout(page, mapping, sc, sync_writeback)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
--- linux.orig/mm/page-writeback.c	2009-10-08 07:38:57.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-08 10:06:33.000000000 +0800
@@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa
 				break;
 			}
 
+			if (wbc->for_reclaim && done_index != page->index) {
+				done = 1;
+				break;
+			}
+
 			if (nr_to_write != wbc->nr_to_write &&
 			    done_index + WB_SEGMENT_DIST < page->index &&
 			    --wbc->nr_segments <= 0) {
@@ -846,6 +851,13 @@ continue_unlock:
 			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 
+			/*
+			 * active and unevictable pages will be checked at
+			 * rotate time
+			 */
+			if (wbc->for_reclaim)
+				SetPageReclaim(page);
+
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret)) {
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-08  2:37                 ` Wu Fengguang
@ 2009-10-08  8:20                   ` Hugh Dickins
  2009-10-08 10:12                     ` Wu Fengguang
  0 siblings, 1 reply; 14+ messages in thread
From: Hugh Dickins @ 2009-10-08  8:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Nikita Danilov, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org

On Thu, 8 Oct 2009, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 11:50:58PM +0800, Nikita Danilov wrote:
> > 
> > One potential problem with this is that generic_writepages() waits on
> > page locks and this can stall kswapd (always bad). This can be worked
> > around by replacing lock_page() with trylock_page() conditionally on
> > wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look
> > like a separate function would be better.
> 
> IMHO trylock_page() is not necessary. Locked pages are rare in normal
> states. kswapd already do lock_page() for all pages it try to examine
> state for reclaim. So it makes sense for lumpy pageout to follow the
> (simple) convention.

You're mistaken?  The only lock_page() I see in vmscan.c is in
handle_write_error(), the important ones are all trylock_page().
And I agree with Nikita that they need to be trylock_page().

Hugh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 30/45] vmscan: lumpy pageout
  2009-10-08  8:20                   ` Hugh Dickins
@ 2009-10-08 10:12                     ` Wu Fengguang
  0 siblings, 0 replies; 14+ messages in thread
From: Wu Fengguang @ 2009-10-08 10:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nikita Danilov, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org

On Thu, Oct 08, 2009 at 04:20:03PM +0800, Hugh Dickins wrote:
> On Thu, 8 Oct 2009, Wu Fengguang wrote:
> > On Wed, Oct 07, 2009 at 11:50:58PM +0800, Nikita Danilov wrote:
> > > 
> > > One potential problem with this is that generic_writepages() waits on
> > > page locks and this can stall kswapd (always bad). This can be worked
> > > around by replacing lock_page() with trylock_page() conditionally on
> > > wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look
> > > like a separate function would be better.
> > 
> > IMHO trylock_page() is not necessary. Locked pages are rare in normal
> > states. kswapd already do lock_page() for all pages it try to examine
> > state for reclaim. So it makes sense for lumpy pageout to follow the
> > (simple) convention.
> 
> You're mistaken?  The only lock_page() I see in vmscan.c is in
> handle_write_error(), the important ones are all trylock_page().
> And I agree with Nikita that they need to be trylock_page().

Ah big sorry! I should really double check the code.. 

OK I'll do trylock_page().

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 00/45] some writeback experiments
@ 2009-10-07  7:38 Wu Fengguang
  2009-10-07  7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang
  0 siblings, 1 reply; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

Hi all,

Here is a collection of writeback patches on

- larger writeback chunk sizes
- single per-bdi flush thread (killing the foreground throttling writeouts)
- lumpy pageout
- sync livelock prevention
- writeback scheduling
- random fixes

Sorry for posting a too big series - there are many direct or implicit
dependencies, and one patch lead to another before I can stop..

The lumpy pageout and nr_segments support is not complete and do not
cover all filesystems for now. It may be better to first convert some of
the ->writepages to the generic routines to avoid duplicate work.

I managed to address many issues in past week, however there are still known
problems. Hints from filesystem developers are highly appreciated. Thanks!

The estimated writeback bandwidth is about 1/2 the real throughput
for ext2/3/4 and btrfs; noticeable bigger than real throughput for NFS; and
cannot be estimated at all for XFS.  Very interesting..

NFS writeback is very bumpy. The page numbers and network throughput "freeze"
together from time to time:

# vmmon -d 1 nr_writeback nr_dirty nr_unstable      # (per 1-second samples)
     nr_writeback         nr_dirty      nr_unstable
            11227            41463            38044
            11227            41463            38044
            11227            41463            38044
            11227            41463            38044
            11045            53987             6490
            11033            53120             8145
            11195            52143            10886
            11211            52144            10913
            11211            52144            10913
            11211            52144            10913

btrfs seems to maintain a private pool of writeback pages, which can go out of
control:

     nr_writeback         nr_dirty
           261075              132
           252891              195
           244795              187
           236851              187
           228830              187
           221040              218
           212674              237
           204981              237

XFS has very interesting "bumpy writeback" behavior: it tends to wait
collect enough pages and then write the whole world.

     nr_writeback         nr_dirty
            80781                0
            37117            37703
            37117            43933
            81044                6
            81050                0
            43943            10199
            43930            36355
            43930            36355
            80293                0
            80285                0
            80285                0

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  0 siblings, 0 replies; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-lumpy-pageout.patch --]
[-- Type: text/plain, Size: 2082 bytes --]

When pageout a dirty page, try to piggy back more consecutive dirty
pages (up to 512KB) to improve IO efficiency.

Only ext3/reiserfs which don't have its own aops->writepages are
supported in this initial version.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   12 ++++++++++++
 mm/vmscan.c         |   11 +++++++++++
 2 files changed, 23 insertions(+)

--- linux.orig/mm/vmscan.c	2009-10-06 23:37:39.000000000 +0800
+++ linux/mm/vmscan.c	2009-10-06 23:39:30.000000000 +0800
@@ -344,6 +344,8 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+#define LUMPY_PAGEOUT_PAGES	(512 * 1024 / PAGE_CACHE_SIZE)
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -409,6 +411,15 @@ static pageout_t pageout(struct page *pa
 		}
 
 		/*
+		 * only write_cache_pages() supports for_reclaim for now
+		 */
+		if (!mapping->a_ops->writepages) {
+			wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
+			wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
+			generic_writepages(mapping, &wbc);
+		}
+
+		/*
 		 * Wait on writeback if requested to. This happens when
 		 * direct reclaiming a large contiguous area and the
 		 * first attempt to free a range of pages fails.
--- linux.orig/mm/page-writeback.c	2009-10-06 23:39:29.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-06 23:39:30.000000000 +0800
@@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa
 				break;
 			}
 
+			if (wbc->for_reclaim && done_index != page->index) {
+				done = 1;
+				break;
+			}
+
 			if (nr_to_write != wbc->nr_to_write &&
 			    done_index + WB_SEGMENT_DIST < page->index &&
 			    --wbc->nr_segments <= 0) {
@@ -846,6 +851,13 @@ continue_unlock:
 			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 
+			/*
+			 * active and unevictable pages will be checked at
+			 * rotate time
+			 */
+			if (wbc->for_reclaim)
+				SetPageReclaim(page);
+
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret)) {
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-10-08 10:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-07 10:38 [PATCH 30/45] vmscan: lumpy pageout Nikita Danilov
2009-10-07 11:14 ` Wu Fengguang
2009-10-07 11:32   ` Nick Piggin
2009-10-07 11:37   ` Nikita Danilov
2009-10-07 13:29     ` Wu Fengguang
2009-10-07 13:42       ` Wu Fengguang
2009-10-07 14:20         ` Wu Fengguang
2009-10-07 14:50           ` Nikita Danilov
2009-10-07 15:00             ` Wu Fengguang
2009-10-07 15:50               ` Nikita Danilov
2009-10-08  2:37                 ` Wu Fengguang
2009-10-08  8:20                   ` Hugh Dickins
2009-10-08 10:12                     ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
2009-10-07  7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).