* [PATCH 30/45] vmscan: lumpy pageout
2009-10-07 7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
@ 2009-10-07 7:38 ` Wu Fengguang
0 siblings, 0 replies; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07 7:38 UTC (permalink / raw)
To: Andrew Morton
Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
Peter Zijlstra, Li Shaohua, Myklebust Trond,
jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
Wu Fengguang, LKML
[-- Attachment #1: writeback-lumpy-pageout.patch --]
[-- Type: text/plain, Size: 2082 bytes --]
When pageout a dirty page, try to piggy back more consecutive dirty
pages (up to 512KB) to improve IO efficiency.
Only ext3/reiserfs which don't have its own aops->writepages are
supported in this initial version.
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/page-writeback.c | 12 ++++++++++++
mm/vmscan.c | 11 +++++++++++
2 files changed, 23 insertions(+)
--- linux.orig/mm/vmscan.c 2009-10-06 23:37:39.000000000 +0800
+++ linux/mm/vmscan.c 2009-10-06 23:39:30.000000000 +0800
@@ -344,6 +344,8 @@ typedef enum {
PAGE_CLEAN,
} pageout_t;
+#define LUMPY_PAGEOUT_PAGES (512 * 1024 / PAGE_CACHE_SIZE)
+
/*
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
@@ -409,6 +411,15 @@ static pageout_t pageout(struct page *pa
}
/*
+ * only write_cache_pages() supports for_reclaim for now
+ */
+ if (!mapping->a_ops->writepages) {
+ wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
+ wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
+ generic_writepages(mapping, &wbc);
+ }
+
+ /*
* Wait on writeback if requested to. This happens when
* direct reclaiming a large contiguous area and the
* first attempt to free a range of pages fails.
--- linux.orig/mm/page-writeback.c 2009-10-06 23:39:29.000000000 +0800
+++ linux/mm/page-writeback.c 2009-10-06 23:39:30.000000000 +0800
@@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa
break;
}
+ if (wbc->for_reclaim && done_index != page->index) {
+ done = 1;
+ break;
+ }
+
if (nr_to_write != wbc->nr_to_write &&
done_index + WB_SEGMENT_DIST < page->index &&
--wbc->nr_segments <= 0) {
@@ -846,6 +851,13 @@ continue_unlock:
if (!clear_page_dirty_for_io(page))
goto continue_unlock;
+ /*
+ * active and unevictable pages will be checked at
+ * rotate time
+ */
+ if (wbc->for_reclaim)
+ SetPageReclaim(page);
+
ret = (*writepage)(page, wbc, data);
if (unlikely(ret)) {
if (ret == AOP_WRITEPAGE_ACTIVATE) {
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
@ 2009-10-07 10:38 Nikita Danilov
2009-10-07 11:14 ` Wu Fengguang
0 siblings, 1 reply; 14+ messages in thread
From: Nikita Danilov @ 2009-10-07 10:38 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond,
jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel
Hello,
> When pageout a dirty page, try to piggy back more consecutive dirty
> pages (up to 512KB) to improve IO efficiency.
>
> Only ext3/reiserfs which don't have its own aops->writepages are
> supported in this initial version.
>
[...]
> /*
> + * only write_cache_pages() supports for_reclaim for now
> + */
> + if (!mapping->a_ops->writepages) {
> + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> + generic_writepages(mapping, &wbc);
> + }
This might end up calling ->writepage() on a page_mapped() page,
which used to be a problem, at least for shmem (see
BUG_ON(page_mapped(page)) in shmem_writepage()).
I tried to do a similar thing in the past, but eventually decided
do explicitly call ->writepage() in a loop from vmscan.c. This
way one can also filter out referenced and active pages. See
http://linuxhacker.ru/~nikita/patches/2.6.14-rc5/05-cluster-pageout.patch,
it also contains a comment explaining shmem problem.
Thank you,
Nikita.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-07 10:38 [PATCH 30/45] vmscan: lumpy pageout Nikita Danilov
@ 2009-10-07 11:14 ` Wu Fengguang
2009-10-07 11:32 ` Nick Piggin
2009-10-07 11:37 ` Nikita Danilov
0 siblings, 2 replies; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07 11:14 UTC (permalink / raw)
To: Nikita Danilov
Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
jens.axboe@oracle.com, Jan Kara, Nick Piggin,
linux-fsdevel@vger.kernel.org
On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
> Hello,
>
> > When pageout a dirty page, try to piggy back more consecutive dirty
> > pages (up to 512KB) to improve IO efficiency.
> >
> > Only ext3/reiserfs which don't have its own aops->writepages are
> > supported in this initial version.
> >
>
> [...]
>
> > /*
> > + * only write_cache_pages() supports for_reclaim for now
> > + */
> > + if (!mapping->a_ops->writepages) {
> > + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> > + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> > + generic_writepages(mapping, &wbc);
> > + }
>
> This might end up calling ->writepage() on a page_mapped() page,
> which used to be a problem, at least for shmem (see
> BUG_ON(page_mapped(page)) in shmem_writepage()).
Good catch, thanks a lot!
> I tried to do a similar thing in the past, but eventually decided
> do explicitly call ->writepage() in a loop from vmscan.c. This
> way one can also filter out referenced and active pages. See
> http://linuxhacker.ru/~nikita/patches/2.6.14-rc5/05-cluster-pageout.patch,
> it also contains a comment explaining shmem problem.
Glad to know about your experiences :) Interestingly I started with
->writepage() and then switch to ->writepages() because filesystems
behave better with the latter (i.e. less file fragmentation).
I'd like to just ignore the shmem case, by adding a
bdi_cap_writeback_dirty() check. Because clustered writing to swap
device may be a less gain.
Page filtering should also be possible in write_cache_pages(). But
what do you mean by "hard-to-fix races against inode reclamation"?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-07 11:14 ` Wu Fengguang
@ 2009-10-07 11:32 ` Nick Piggin
2009-10-07 11:37 ` Nikita Danilov
1 sibling, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2009-10-07 11:32 UTC (permalink / raw)
To: Wu Fengguang
Cc: Nikita Danilov, Andrew Morton, Theodore Tso, Christoph Hellwig,
Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
Myklebust Trond, jens.axboe@oracle.com, Jan Kara,
linux-fsdevel@vger.kernel.org
On Wed, Oct 07, 2009 at 07:14:54PM +0800, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
> > Hello,
> >
> > > When pageout a dirty page, try to piggy back more consecutive dirty
> > > pages (up to 512KB) to improve IO efficiency.
> > >
> > > Only ext3/reiserfs which don't have its own aops->writepages are
> > > supported in this initial version.
> > >
> >
> > [...]
> >
> > > /*
> > > + * only write_cache_pages() supports for_reclaim for now
> > > + */
> > > + if (!mapping->a_ops->writepages) {
> > > + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> > > + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> > > + generic_writepages(mapping, &wbc);
> > > + }
> >
> > This might end up calling ->writepage() on a page_mapped() page,
> > which used to be a problem, at least for shmem (see
> > BUG_ON(page_mapped(page)) in shmem_writepage()).
>
> Good catch, thanks a lot!
You also have a problem in that you aren't allowed to touch mapping
after the page is unlocked. So this patch looks buggy. I didn't see
where Nikita mentions the inode reclaim race, but this is it.
It seems tempting to do this lumpy writeout here, but... hmm, I
don't know. I would like to see workloads where it matters now,
and then I would like to know why normal writeout doesn't keep
up.
Then... maybe it's enough simply to allow the filesystem to do
their own clustering because they can tell when writepage is called
for reclaim.
But if you have real cases where this helps, it would be very
interesting.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-07 11:14 ` Wu Fengguang
2009-10-07 11:32 ` Nick Piggin
@ 2009-10-07 11:37 ` Nikita Danilov
2009-10-07 13:29 ` Wu Fengguang
1 sibling, 1 reply; 14+ messages in thread
From: Nikita Danilov @ 2009-10-07 11:37 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
jens.axboe@oracle.com, Jan Kara, Nick Piggin,
linux-fsdevel@vger.kernel.org
2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
> On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
>> Hello,
>>
[...]
>
> Glad to know about your experiences :) Interestingly I started with
> ->writepage() and then switch to ->writepages() because filesystems
> behave better with the latter (i.e. less file fragmentation).
By the way, why is your patch doing
->writepage(page->index);
generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1);
instead of
generic_writepages(page->index, LUMPY_PAGEOUT_PAGES);
? Is this because of the difficulties with passing back page specific
errors from generic_writepages()?
>
> I'd like to just ignore the shmem case, by adding a
> bdi_cap_writeback_dirty() check. Because clustered writing to swap
> device may be a less gain.
Or you can just call try_to_unmap() from shmem_writepage() when
wbc->for_reclaim is true.
>
> Page filtering should also be possible in write_cache_pages(). But
> what do you mean by "hard-to-fix races against inode reclamation"?
vmscan.c pageout path doesn't take a reference on inode, so the
instant ->writepage() releases lock on the page, the inode can be
freed.
>
> Thanks,
> Fengguang
>
Thank you,
Nikita.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-07 11:37 ` Nikita Danilov
@ 2009-10-07 13:29 ` Wu Fengguang
2009-10-07 13:42 ` Wu Fengguang
0 siblings, 1 reply; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07 13:29 UTC (permalink / raw)
To: Nikita Danilov
Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
jens.axboe@oracle.com, Jan Kara, Nick Piggin,
linux-fsdevel@vger.kernel.org
On Wed, Oct 07, 2009 at 07:37:35PM +0800, Nikita Danilov wrote:
> 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
> > On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
> >> Hello,
> >>
>
> [...]
>
> >
> > Glad to know about your experiences :) Interestingly I started with
> > ->writepage() and then switch to ->writepages() because filesystems
> > behave better with the latter (i.e. less file fragmentation).
>
> By the way, why is your patch doing
>
> ->writepage(page->index);
> generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1);
>
> instead of
>
> generic_writepages(page->index, LUMPY_PAGEOUT_PAGES);
>
> ? Is this because of the difficulties with passing back page specific
> errors from generic_writepages()?
Yes. It's possible to tell write_cache_pages() to return
AOP_WRITEPAGE_ACTIVATE. Other ->writepages() don't have to deal with
this because their ->writepage() won't return AOP_WRITEPAGE_ACTIVATE
at all.
But it is going to be ugly to specialize the first locked page in
every ->writepages() functions..
> >
> > I'd like to just ignore the shmem case, by adding a
> > bdi_cap_writeback_dirty() check. Because clustered writing to swap
> > device may be a less gain.
>
> Or you can just call try_to_unmap() from shmem_writepage() when
> wbc->for_reclaim is true.
Hmm, it's more comfortable to stay away from shmem for the initial version.
But feel free to submit a patch for it in future :)
> > Page filtering should also be possible in write_cache_pages(). But
> > what do you mean by "hard-to-fix races against inode reclamation"?
>
> vmscan.c pageout path doesn't take a reference on inode, so the
> instant ->writepage() releases lock on the page, the inode can be
> freed.
Ah, then we could just do igrab() if inode_lock is not locked?
A bit ugly though.
Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-07 13:29 ` Wu Fengguang
@ 2009-10-07 13:42 ` Wu Fengguang
2009-10-07 14:20 ` Wu Fengguang
0 siblings, 1 reply; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07 13:42 UTC (permalink / raw)
To: Nikita Danilov
Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
jens.axboe@oracle.com, Jan Kara, Nick Piggin,
linux-fsdevel@vger.kernel.org
On Wed, Oct 07, 2009 at 09:29:24PM +0800, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 07:37:35PM +0800, Nikita Danilov wrote:
> > 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
> > > On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
> > >> Hello,
> > >>
> >
> > [...]
> >
> > >
> > > Glad to know about your experiences :) Interestingly I started with
> > > ->writepage() and then switch to ->writepages() because filesystems
> > > behave better with the latter (i.e. less file fragmentation).
> >
> > By the way, why is your patch doing
> >
> > ->writepage(page->index);
> > generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1);
> >
> > instead of
> >
> > generic_writepages(page->index, LUMPY_PAGEOUT_PAGES);
> >
> > ? Is this because of the difficulties with passing back page specific
> > errors from generic_writepages()?
>
> Yes. It's possible to tell write_cache_pages() to return
> AOP_WRITEPAGE_ACTIVATE. Other ->writepages() don't have to deal with
> this because their ->writepage() won't return AOP_WRITEPAGE_ACTIVATE
> at all.
>
> But it is going to be ugly to specialize the first locked page in
> every ->writepages() functions..
>
> > >
> > > I'd like to just ignore the shmem case, by adding a
> > > bdi_cap_writeback_dirty() check. Because clustered writing to swap
> > > device may be a less gain.
> >
> > Or you can just call try_to_unmap() from shmem_writepage() when
> > wbc->for_reclaim is true.
>
> Hmm, it's more comfortable to stay away from shmem for the initial version.
> But feel free to submit a patch for it in future :)
>
> > > Page filtering should also be possible in write_cache_pages(). But
> > > what do you mean by "hard-to-fix races against inode reclamation"?
> >
> > vmscan.c pageout path doesn't take a reference on inode, so the
> > instant ->writepage() releases lock on the page, the inode can be
> > freed.
>
> Ah, then we could just do igrab() if inode_lock is not locked?
> A bit ugly though.
Since writepage() can sleep, we don't need to worry about inode_lock.
Here is the updated patch.
Thanks,
Fengguang
---
vmscan: lumpy pageout
When pageout a dirty page, try to piggy back more consecutive dirty
pages (up to 512KB) to improve IO efficiency.
Only ext3/reiserfs which don't have its own aops->writepages are
supported in this initial version.
CC: Dave Chinner <david@fromorbit.com>
CC: Nikita Danilov <danilov@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/page-writeback.c | 12 ++++++++++++
mm/vmscan.c | 16 ++++++++++++++++
2 files changed, 28 insertions(+)
--- linux.orig/mm/vmscan.c 2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/vmscan.c 2009-10-07 21:39:57.000000000 +0800
@@ -344,6 +344,8 @@ typedef enum {
PAGE_CLEAN,
} pageout_t;
+#define LUMPY_PAGEOUT_PAGES (512 * 1024 / PAGE_CACHE_SIZE)
+
/*
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
@@ -399,16 +401,30 @@ static pageout_t pageout(struct page *pa
.for_reclaim = 1,
};
+ igrab(mapping->host);
SetPageReclaim(page);
res = mapping->a_ops->writepage(page, &wbc);
if (res < 0)
handle_write_error(mapping, page, res);
if (res == AOP_WRITEPAGE_ACTIVATE) {
ClearPageReclaim(page);
+ iput(mapping->host);
return PAGE_ACTIVATE;
}
/*
+ * only write_cache_pages() supports for_reclaim for now
+ * ignore shmem for now, thanks to Nikita.
+ */
+ if (bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
+ !mapping->a_ops->writepages) {
+ wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
+ wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
+ generic_writepages(mapping, &wbc);
+ iput(mapping->host);
+ }
+
+ /*
* Wait on writeback if requested to. This happens when
* direct reclaiming a large contiguous area and the
* first attempt to free a range of pages fails.
--- linux.orig/mm/page-writeback.c 2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/page-writeback.c 2009-10-07 21:39:14.000000000 +0800
@@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa
break;
}
+ if (wbc->for_reclaim && done_index != page->index) {
+ done = 1;
+ break;
+ }
+
if (nr_to_write != wbc->nr_to_write &&
done_index + WB_SEGMENT_DIST < page->index &&
--wbc->nr_segments <= 0) {
@@ -846,6 +851,13 @@ continue_unlock:
if (!clear_page_dirty_for_io(page))
goto continue_unlock;
+ /*
+ * active and unevictable pages will be checked at
+ * rotate time
+ */
+ if (wbc->for_reclaim)
+ SetPageReclaim(page);
+
ret = (*writepage)(page, wbc, data);
if (unlikely(ret)) {
if (ret == AOP_WRITEPAGE_ACTIVATE) {
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-07 13:42 ` Wu Fengguang
@ 2009-10-07 14:20 ` Wu Fengguang
2009-10-07 14:50 ` Nikita Danilov
0 siblings, 1 reply; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07 14:20 UTC (permalink / raw)
To: Nikita Danilov
Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
jens.axboe@oracle.com, Jan Kara, Nick Piggin,
linux-fsdevel@vger.kernel.org
On Wed, Oct 07, 2009 at 09:42:54PM +0800, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 09:29:24PM +0800, Wu Fengguang wrote:
> > On Wed, Oct 07, 2009 at 07:37:35PM +0800, Nikita Danilov wrote:
> > > 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
> > > > On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote:
> > > >> Hello,
> > > >>
> > >
> > > [...]
> > >
> > > >
> > > > Glad to know about your experiences :) Interestingly I started with
> > > > ->writepage() and then switch to ->writepages() because filesystems
> > > > behave better with the latter (i.e. less file fragmentation).
> > >
> > > By the way, why is your patch doing
> > >
> > > ->writepage(page->index);
> > > generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1);
> > >
> > > instead of
> > >
> > > generic_writepages(page->index, LUMPY_PAGEOUT_PAGES);
> > >
> > > ? Is this because of the difficulties with passing back page specific
> > > errors from generic_writepages()?
> >
> > Yes. It's possible to tell write_cache_pages() to return
> > AOP_WRITEPAGE_ACTIVATE. Other ->writepages() don't have to deal with
> > this because their ->writepage() won't return AOP_WRITEPAGE_ACTIVATE
> > at all.
> >
> > But it is going to be ugly to specialize the first locked page in
> > every ->writepages() functions..
> >
> > > >
> > > > I'd like to just ignore the shmem case, by adding a
> > > > bdi_cap_writeback_dirty() check. Because clustered writing to swap
> > > > device may be a less gain.
> > >
> > > Or you can just call try_to_unmap() from shmem_writepage() when
> > > wbc->for_reclaim is true.
> >
> > Hmm, it's more comfortable to stay away from shmem for the initial version.
> > But feel free to submit a patch for it in future :)
> >
> > > > Page filtering should also be possible in write_cache_pages(). But
> > > > what do you mean by "hard-to-fix races against inode reclamation"?
> > >
> > > vmscan.c pageout path doesn't take a reference on inode, so the
> > > instant ->writepage() releases lock on the page, the inode can be
> > > freed.
> >
> > Ah, then we could just do igrab() if inode_lock is not locked?
> > A bit ugly though.
>
> Since writepage() can sleep, we don't need to worry about inode_lock.
>
> Here is the updated patch.
Sorry forget to check return value of igrab()..
---
vmscan: lumpy pageout
When pageout a dirty page, try to piggy back more consecutive dirty
pages (up to 512KB) to improve IO efficiency.
Only ext3/reiserfs which don't have its own aops->writepages are
supported in this initial version.
CC: Dave Chinner <david@fromorbit.com>
CC: Nikita Danilov <danilov@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/page-writeback.c | 12 ++++++++++++
mm/vmscan.c | 16 ++++++++++++++++
2 files changed, 28 insertions(+)
--- linux.orig/mm/vmscan.c 2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/vmscan.c 2009-10-07 22:18:55.000000000 +0800
@@ -344,6 +344,8 @@ typedef enum {
PAGE_CLEAN,
} pageout_t;
+#define LUMPY_PAGEOUT_PAGES (512 * 1024 / PAGE_CACHE_SIZE)
+
/*
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
@@ -398,6 +400,7 @@ static pageout_t pageout(struct page *pa
.nonblocking = 1,
.for_reclaim = 1,
};
+ struct inode *inode = igrab(mapping->host);
SetPageReclaim(page);
res = mapping->a_ops->writepage(page, &wbc);
@@ -405,10 +408,23 @@ static pageout_t pageout(struct page *pa
handle_write_error(mapping, page, res);
if (res == AOP_WRITEPAGE_ACTIVATE) {
ClearPageReclaim(page);
+ iput(inode);
return PAGE_ACTIVATE;
}
/*
+ * only write_cache_pages() supports for_reclaim for now
+ * ignore shmem for now, thanks to Nikita.
+ */
+ if (bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
+ !mapping->a_ops->writepages) {
+ wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
+ wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
+ generic_writepages(mapping, &wbc);
+ iput(inode);
+ }
+
+ /*
* Wait on writeback if requested to. This happens when
* direct reclaiming a large contiguous area and the
* first attempt to free a range of pages fails.
--- linux.orig/mm/page-writeback.c 2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/page-writeback.c 2009-10-07 21:39:14.000000000 +0800
@@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa
break;
}
+ if (wbc->for_reclaim && done_index != page->index) {
+ done = 1;
+ break;
+ }
+
if (nr_to_write != wbc->nr_to_write &&
done_index + WB_SEGMENT_DIST < page->index &&
--wbc->nr_segments <= 0) {
@@ -846,6 +851,13 @@ continue_unlock:
if (!clear_page_dirty_for_io(page))
goto continue_unlock;
+ /*
+ * active and unevictable pages will be checked at
+ * rotate time
+ */
+ if (wbc->for_reclaim)
+ SetPageReclaim(page);
+
ret = (*writepage)(page, wbc, data);
if (unlikely(ret)) {
if (ret == AOP_WRITEPAGE_ACTIVATE) {
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-07 14:20 ` Wu Fengguang
@ 2009-10-07 14:50 ` Nikita Danilov
2009-10-07 15:00 ` Wu Fengguang
0 siblings, 1 reply; 14+ messages in thread
From: Nikita Danilov @ 2009-10-07 14:50 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
jens.axboe@oracle.com, Jan Kara, Nick Piggin,
linux-fsdevel@vger.kernel.org
2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
[...]
> + if (bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
> + !mapping->a_ops->writepages) {
> + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> + generic_writepages(mapping, &wbc);
> + iput(inode);
I am afraid calling iput() from within pageout code is not generally
safe: it can trigger a lot of file system activity (think
open-unlinked file) that is not supposed to happen in the
direct-reclaim context. Limiting pageout clustering to kswapd might be
a better idea---this would improve direct-reclaim latency too, but
still, file systems are not designed for re-entrant calls to iput().
Thank you,
Nikita.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-07 14:50 ` Nikita Danilov
@ 2009-10-07 15:00 ` Wu Fengguang
2009-10-07 15:50 ` Nikita Danilov
0 siblings, 1 reply; 14+ messages in thread
From: Wu Fengguang @ 2009-10-07 15:00 UTC (permalink / raw)
To: Nikita Danilov
Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
jens.axboe@oracle.com, Jan Kara, Nick Piggin,
linux-fsdevel@vger.kernel.org
On Wed, Oct 07, 2009 at 10:50:17PM +0800, Nikita Danilov wrote:
> 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
>
> [...]
>
> > + if (bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
> > + !mapping->a_ops->writepages) {
> > + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> > + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> > + generic_writepages(mapping, &wbc);
> > + iput(inode);
>
> I am afraid calling iput() from within pageout code is not generally
> safe: it can trigger a lot of file system activity (think
> open-unlinked file) that is not supposed to happen in the
> direct-reclaim context. Limiting pageout clustering to kswapd might be
> a better idea---this would improve direct-reclaim latency too, but
> still, file systems are not designed for re-entrant calls to iput().
Good point, thanks!
---
vmscan: lumpy pageout
When pageout a dirty page, try to piggy back more consecutive dirty
pages (up to 512KB) to improve IO efficiency.
Only ext3/reiserfs which don't have its own aops->writepages are
supported in this initial version.
CC: Dave Chinner <david@fromorbit.com>
CC: Nikita Danilov <danilov@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/page-writeback.c | 12 ++++++++++++
mm/vmscan.c | 20 ++++++++++++++++++++
2 files changed, 32 insertions(+)
--- linux.orig/mm/vmscan.c 2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/vmscan.c 2009-10-07 22:59:57.000000000 +0800
@@ -344,6 +344,8 @@ typedef enum {
PAGE_CLEAN,
} pageout_t;
+#define LUMPY_PAGEOUT_PAGES (512 * 1024 / PAGE_CACHE_SIZE)
+
/*
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
@@ -398,6 +400,10 @@ static pageout_t pageout(struct page *pa
.nonblocking = 1,
.for_reclaim = 1,
};
+ struct inode *inode = NULL;
+
+ if (current_is_kswapd())
+ inode = igrab(mapping->host);
SetPageReclaim(page);
res = mapping->a_ops->writepage(page, &wbc);
@@ -405,10 +411,24 @@ static pageout_t pageout(struct page *pa
handle_write_error(mapping, page, res);
if (res == AOP_WRITEPAGE_ACTIVATE) {
ClearPageReclaim(page);
+ iput(inode);
return PAGE_ACTIVATE;
}
/*
+ * only write_cache_pages() supports for_reclaim for now
+ * ignore shmem for now, thanks to Nikita.
+ */
+ if (current_is_kswapd() &&
+ bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
+ !mapping->a_ops->writepages) {
+ wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
+ wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
+ generic_writepages(mapping, &wbc);
+ iput(inode);
+ }
+
+ /*
* Wait on writeback if requested to. This happens when
* direct reclaiming a large contiguous area and the
* first attempt to free a range of pages fails.
--- linux.orig/mm/page-writeback.c 2009-10-07 21:39:13.000000000 +0800
+++ linux/mm/page-writeback.c 2009-10-07 22:57:55.000000000 +0800
@@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa
break;
}
+ if (wbc->for_reclaim && done_index != page->index) {
+ done = 1;
+ break;
+ }
+
if (nr_to_write != wbc->nr_to_write &&
done_index + WB_SEGMENT_DIST < page->index &&
--wbc->nr_segments <= 0) {
@@ -846,6 +851,13 @@ continue_unlock:
if (!clear_page_dirty_for_io(page))
goto continue_unlock;
+ /*
+ * active and unevictable pages will be checked at
+ * rotate time
+ */
+ if (wbc->for_reclaim)
+ SetPageReclaim(page);
+
ret = (*writepage)(page, wbc, data);
if (unlikely(ret)) {
if (ret == AOP_WRITEPAGE_ACTIVATE) {
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-07 15:00 ` Wu Fengguang
@ 2009-10-07 15:50 ` Nikita Danilov
2009-10-08 2:37 ` Wu Fengguang
0 siblings, 1 reply; 14+ messages in thread
From: Nikita Danilov @ 2009-10-07 15:50 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
jens.axboe@oracle.com, Jan Kara, Nick Piggin,
linux-fsdevel@vger.kernel.org
2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
[...]
> + */
> + if (current_is_kswapd() &&
> + bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
> + !mapping->a_ops->writepages) {
> + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> + generic_writepages(mapping, &wbc);
> + iput(inode);
> + }
> +
> + /*
One potential problem with this is that generic_writepages() waits on
page locks and this can stall kswapd (always bad). This can be worked
around by replacing lock_page() with trylock_page() conditionally on
wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look
like a separate function would be better.
On a good side, it seems I was wrong and pageout calls iput() already:
shrink_slab()->prune_icache()->iput().
Thank you,
Nikita.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-07 15:50 ` Nikita Danilov
@ 2009-10-08 2:37 ` Wu Fengguang
2009-10-08 8:20 ` Hugh Dickins
0 siblings, 1 reply; 14+ messages in thread
From: Wu Fengguang @ 2009-10-08 2:37 UTC (permalink / raw)
To: Nikita Danilov
Cc: Hugh Dickins, Andrew Morton, Theodore Tso, Christoph Hellwig,
Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin,
linux-fsdevel@vger.kernel.org
On Wed, Oct 07, 2009 at 11:50:58PM +0800, Nikita Danilov wrote:
> 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>:
>
> [...]
>
> > + */
> > + if (current_is_kswapd() &&
> > + bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
> > + !mapping->a_ops->writepages) {
> > + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
> > + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
> > + generic_writepages(mapping, &wbc);
> > + iput(inode);
> > + }
> > +
> > + /*
>
> One potential problem with this is that generic_writepages() waits on
> page locks and this can stall kswapd (always bad). This can be worked
> around by replacing lock_page() with trylock_page() conditionally on
> wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look
> like a separate function would be better.
IMHO trylock_page() is not necessary. Locked pages are rare in normal
states. kswapd already do lock_page() for all pages it try to examine
state for reclaim. So it makes sense for lumpy pageout to follow the
(simple) convention.
> On a good side, it seems I was wrong and pageout calls iput() already:
> shrink_slab()->prune_icache()->iput().
Not totally wrong ;) iput() will be called if __GFP_FS is on. However
pageout may be called on either __GFP_FS or (__GFP_IO && PageSwapCache).
So I updated the patch to do lumpy pageout for __GFP_FS. In long term,
it would be good to remove AOP_WRITEPAGE_ACTIVATE and ->writepage()
totally, and to support shmem as well :)
Thanks,
Fengguang
---
vmscan: lumpy pageout
When pageout a dirty page, try to piggy back more consecutive dirty
pages (up to 512KB) to improve IO efficiency.
Only ext3/reiserfs which don't have its own aops->writepages are
supported in this initial version.
CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
CC: Dave Chinner <david@fromorbit.com>
CC: Nikita Danilov <danilov@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/page-writeback.c | 12 ++++++++++++
mm/vmscan.c | 23 ++++++++++++++++++++++-
2 files changed, 34 insertions(+), 1 deletion(-)
--- linux.orig/mm/vmscan.c 2009-10-08 07:35:06.000000000 +0800
+++ linux/mm/vmscan.c 2009-10-08 10:20:51.000000000 +0800
@@ -344,11 +344,14 @@ typedef enum {
PAGE_CLEAN,
} pageout_t;
+#define LUMPY_PAGEOUT_PAGES (512 * 1024 / PAGE_CACHE_SIZE)
+
/*
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
*/
static pageout_t pageout(struct page *page, struct address_space *mapping,
+ struct scan_control *sc,
enum pageout_io sync_writeback)
{
/*
@@ -398,6 +401,10 @@ static pageout_t pageout(struct page *pa
.nonblocking = 1,
.for_reclaim = 1,
};
+ struct inode *inode = NULL;
+
+ if (sc->gfp_mask & __GFP_FS)
+ inode = igrab(mapping->host);
SetPageReclaim(page);
res = mapping->a_ops->writepage(page, &wbc);
@@ -405,10 +412,24 @@ static pageout_t pageout(struct page *pa
handle_write_error(mapping, page, res);
if (res == AOP_WRITEPAGE_ACTIVATE) {
ClearPageReclaim(page);
+ iput(inode);
return PAGE_ACTIVATE;
}
/*
+ * only write_cache_pages() supports for_reclaim for now
+ * ignore shmem for now, thanks to Nikita.
+ */
+ if (current_is_kswapd() &&
+ bdi_cap_writeback_dirty(mapping->backing_dev_info) &&
+ !mapping->a_ops->writepages) {
+ wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
+ wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
+ generic_writepages(mapping, &wbc);
+ iput(inode);
+ }
+
+ /*
* Wait on writeback if requested to. This happens when
* direct reclaiming a large contiguous area and the
* first attempt to free a range of pages fails.
@@ -684,7 +705,7 @@ static unsigned long shrink_page_list(st
goto keep_locked;
/* Page is dirty, try to write it out here */
- switch (pageout(page, mapping, sync_writeback)) {
+ switch (pageout(page, mapping, sc, sync_writeback)) {
case PAGE_KEEP:
goto keep_locked;
case PAGE_ACTIVATE:
--- linux.orig/mm/page-writeback.c 2009-10-08 07:38:57.000000000 +0800
+++ linux/mm/page-writeback.c 2009-10-08 10:06:33.000000000 +0800
@@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa
break;
}
+ if (wbc->for_reclaim && done_index != page->index) {
+ done = 1;
+ break;
+ }
+
if (nr_to_write != wbc->nr_to_write &&
done_index + WB_SEGMENT_DIST < page->index &&
--wbc->nr_segments <= 0) {
@@ -846,6 +851,13 @@ continue_unlock:
if (!clear_page_dirty_for_io(page))
goto continue_unlock;
+ /*
+ * active and unevictable pages will be checked at
+ * rotate time
+ */
+ if (wbc->for_reclaim)
+ SetPageReclaim(page);
+
ret = (*writepage)(page, wbc, data);
if (unlikely(ret)) {
if (ret == AOP_WRITEPAGE_ACTIVATE) {
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-08 2:37 ` Wu Fengguang
@ 2009-10-08 8:20 ` Hugh Dickins
2009-10-08 10:12 ` Wu Fengguang
0 siblings, 1 reply; 14+ messages in thread
From: Hugh Dickins @ 2009-10-08 8:20 UTC (permalink / raw)
To: Wu Fengguang
Cc: Nikita Danilov, Andrew Morton, Theodore Tso, Christoph Hellwig,
Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin,
linux-fsdevel@vger.kernel.org
On Thu, 8 Oct 2009, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 11:50:58PM +0800, Nikita Danilov wrote:
> >
> > One potential problem with this is that generic_writepages() waits on
> > page locks and this can stall kswapd (always bad). This can be worked
> > around by replacing lock_page() with trylock_page() conditionally on
> > wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look
> > like a separate function would be better.
>
> IMHO trylock_page() is not necessary. Locked pages are rare in normal
> states. kswapd already do lock_page() for all pages it try to examine
> state for reclaim. So it makes sense for lumpy pageout to follow the
> (simple) convention.
You're mistaken? The only lock_page() I see in vmscan.c is in
handle_write_error(), the important ones are all trylock_page().
And I agree with Nikita that they need to be trylock_page().
Hugh
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout
2009-10-08 8:20 ` Hugh Dickins
@ 2009-10-08 10:12 ` Wu Fengguang
0 siblings, 0 replies; 14+ messages in thread
From: Wu Fengguang @ 2009-10-08 10:12 UTC (permalink / raw)
To: Hugh Dickins
Cc: Nikita Danilov, Andrew Morton, Theodore Tso, Christoph Hellwig,
Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin,
linux-fsdevel@vger.kernel.org
On Thu, Oct 08, 2009 at 04:20:03PM +0800, Hugh Dickins wrote:
> On Thu, 8 Oct 2009, Wu Fengguang wrote:
> > On Wed, Oct 07, 2009 at 11:50:58PM +0800, Nikita Danilov wrote:
> > >
> > > One potential problem with this is that generic_writepages() waits on
> > > page locks and this can stall kswapd (always bad). This can be worked
> > > around by replacing lock_page() with trylock_page() conditionally on
> > > wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look
> > > like a separate function would be better.
> >
> > IMHO trylock_page() is not necessary. Locked pages are rare in normal
> > states. kswapd already do lock_page() for all pages it try to examine
> > state for reclaim. So it makes sense for lumpy pageout to follow the
> > (simple) convention.
>
> You're mistaken? The only lock_page() I see in vmscan.c is in
> handle_write_error(), the important ones are all trylock_page().
> And I agree with Nikita that they need to be trylock_page().
Ah big sorry! I should really double check the code..
OK I'll do trylock_page().
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-10-08 10:13 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-07 10:38 [PATCH 30/45] vmscan: lumpy pageout Nikita Danilov
2009-10-07 11:14 ` Wu Fengguang
2009-10-07 11:32 ` Nick Piggin
2009-10-07 11:37 ` Nikita Danilov
2009-10-07 13:29 ` Wu Fengguang
2009-10-07 13:42 ` Wu Fengguang
2009-10-07 14:20 ` Wu Fengguang
2009-10-07 14:50 ` Nikita Danilov
2009-10-07 15:00 ` Wu Fengguang
2009-10-07 15:50 ` Nikita Danilov
2009-10-08 2:37 ` Wu Fengguang
2009-10-08 8:20 ` Hugh Dickins
2009-10-08 10:12 ` Wu Fengguang
-- strict thread matches above, loose matches on Subject: below --
2009-10-07 7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
2009-10-07 7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).