* Re: [PATCH 30/45] vmscan: lumpy pageout @ 2009-10-07 10:38 Nikita Danilov 2009-10-07 11:14 ` Wu Fengguang 0 siblings, 1 reply; 14+ messages in thread From: Nikita Danilov @ 2009-10-07 10:38 UTC (permalink / raw) To: Wu Fengguang Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel Hello, > When pageout a dirty page, try to piggy back more consecutive dirty > pages (up to 512KB) to improve IO efficiency. > > Only ext3/reiserfs which don't have its own aops->writepages are > supported in this initial version. > [...] > /* > + * only write_cache_pages() supports for_reclaim for now > + */ > + if (!mapping->a_ops->writepages) { > + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; > + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; > + generic_writepages(mapping, &wbc); > + } This might end up calling ->writepage() on a page_mapped() page, which used to be a problem, at least for shmem (see BUG_ON(page_mapped(page)) in shmem_writepage()). I tried to do a similar thing in the past, but eventually decided do explicitly call ->writepage() in a loop from vmscan.c. This way one can also filter out referenced and active pages. See http://linuxhacker.ru/~nikita/patches/2.6.14-rc5/05-cluster-pageout.patch, it also contains a comment explaining shmem problem. Thank you, Nikita. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-07 10:38 [PATCH 30/45] vmscan: lumpy pageout Nikita Danilov @ 2009-10-07 11:14 ` Wu Fengguang 2009-10-07 11:32 ` Nick Piggin 2009-10-07 11:37 ` Nikita Danilov 0 siblings, 2 replies; 14+ messages in thread From: Wu Fengguang @ 2009-10-07 11:14 UTC (permalink / raw) To: Nikita Danilov Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote: > Hello, > > > When pageout a dirty page, try to piggy back more consecutive dirty > > pages (up to 512KB) to improve IO efficiency. > > > > Only ext3/reiserfs which don't have its own aops->writepages are > > supported in this initial version. > > > > [...] > > > /* > > + * only write_cache_pages() supports for_reclaim for now > > + */ > > + if (!mapping->a_ops->writepages) { > > + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; > > + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; > > + generic_writepages(mapping, &wbc); > > + } > > This might end up calling ->writepage() on a page_mapped() page, > which used to be a problem, at least for shmem (see > BUG_ON(page_mapped(page)) in shmem_writepage()). Good catch, thanks a lot! > I tried to do a similar thing in the past, but eventually decided > do explicitly call ->writepage() in a loop from vmscan.c. This > way one can also filter out referenced and active pages. See > http://linuxhacker.ru/~nikita/patches/2.6.14-rc5/05-cluster-pageout.patch, > it also contains a comment explaining shmem problem. Glad to know about your experiences :) Interestingly I started with ->writepage() and then switch to ->writepages() because filesystems behave better with the latter (i.e. less file fragmentation). I'd like to just ignore the shmem case, by adding a bdi_cap_writeback_dirty() check. Because clustered writing to swap device may be a less gain. Page filtering should also be possible in write_cache_pages(). But what do you mean by "hard-to-fix races against inode reclamation"? Thanks, Fengguang ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-07 11:14 ` Wu Fengguang @ 2009-10-07 11:32 ` Nick Piggin 2009-10-07 11:37 ` Nikita Danilov 1 sibling, 0 replies; 14+ messages in thread From: Nick Piggin @ 2009-10-07 11:32 UTC (permalink / raw) To: Wu Fengguang Cc: Nikita Danilov, Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, linux-fsdevel@vger.kernel.org On Wed, Oct 07, 2009 at 07:14:54PM +0800, Wu Fengguang wrote: > On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote: > > Hello, > > > > > When pageout a dirty page, try to piggy back more consecutive dirty > > > pages (up to 512KB) to improve IO efficiency. > > > > > > Only ext3/reiserfs which don't have its own aops->writepages are > > > supported in this initial version. > > > > > > > [...] > > > > > /* > > > + * only write_cache_pages() supports for_reclaim for now > > > + */ > > > + if (!mapping->a_ops->writepages) { > > > + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; > > > + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; > > > + generic_writepages(mapping, &wbc); > > > + } > > > > This might end up calling ->writepage() on a page_mapped() page, > > which used to be a problem, at least for shmem (see > > BUG_ON(page_mapped(page)) in shmem_writepage()). > > Good catch, thanks a lot! You also have a problem in that you aren't allowed to touch mapping after the page is unlocked. So this patch looks buggy. I didn't see where Nikita mentions the inode reclaim race, but this is it. It seems tempting to do this lumpy writeout here, but... hmm, I don't know. I would like to see workloads where it matters now, and then I would like to know why normal writeout doesn't keep up. Then... maybe it's enough simply to allow the filesystem to do their own clustering because they can tell when writepage is called for reclaim. But if you have real cases where this helps, it would be very interesting. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-07 11:14 ` Wu Fengguang 2009-10-07 11:32 ` Nick Piggin @ 2009-10-07 11:37 ` Nikita Danilov 2009-10-07 13:29 ` Wu Fengguang 1 sibling, 1 reply; 14+ messages in thread From: Nikita Danilov @ 2009-10-07 11:37 UTC (permalink / raw) To: Wu Fengguang Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>: > On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote: >> Hello, >> [...] > > Glad to know about your experiences :) Interestingly I started with > ->writepage() and then switch to ->writepages() because filesystems > behave better with the latter (i.e. less file fragmentation). By the way, why is your patch doing ->writepage(page->index); generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1); instead of generic_writepages(page->index, LUMPY_PAGEOUT_PAGES); ? Is this because of the difficulties with passing back page specific errors from generic_writepages()? > > I'd like to just ignore the shmem case, by adding a > bdi_cap_writeback_dirty() check. Because clustered writing to swap > device may be a less gain. Or you can just call try_to_unmap() from shmem_writepage() when wbc->for_reclaim is true. > > Page filtering should also be possible in write_cache_pages(). But > what do you mean by "hard-to-fix races against inode reclamation"? vmscan.c pageout path doesn't take a reference on inode, so the instant ->writepage() releases lock on the page, the inode can be freed. > > Thanks, > Fengguang > Thank you, Nikita. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-07 11:37 ` Nikita Danilov @ 2009-10-07 13:29 ` Wu Fengguang 2009-10-07 13:42 ` Wu Fengguang 0 siblings, 1 reply; 14+ messages in thread From: Wu Fengguang @ 2009-10-07 13:29 UTC (permalink / raw) To: Nikita Danilov Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org On Wed, Oct 07, 2009 at 07:37:35PM +0800, Nikita Danilov wrote: > 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>: > > On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote: > >> Hello, > >> > > [...] > > > > > Glad to know about your experiences :) Interestingly I started with > > ->writepage() and then switch to ->writepages() because filesystems > > behave better with the latter (i.e. less file fragmentation). > > By the way, why is your patch doing > > ->writepage(page->index); > generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1); > > instead of > > generic_writepages(page->index, LUMPY_PAGEOUT_PAGES); > > ? Is this because of the difficulties with passing back page specific > errors from generic_writepages()? Yes. It's possible to tell write_cache_pages() to return AOP_WRITEPAGE_ACTIVATE. Other ->writepages() don't have to deal with this because their ->writepage() won't return AOP_WRITEPAGE_ACTIVATE at all. But it is going to be ugly to specialize the first locked page in every ->writepages() functions.. > > > > I'd like to just ignore the shmem case, by adding a > > bdi_cap_writeback_dirty() check. Because clustered writing to swap > > device may be a less gain. > > Or you can just call try_to_unmap() from shmem_writepage() when > wbc->for_reclaim is true. Hmm, it's more comfortable to stay away from shmem for the initial version. But feel free to submit a patch for it in future :) > > Page filtering should also be possible in write_cache_pages(). But > > what do you mean by "hard-to-fix races against inode reclamation"? > > vmscan.c pageout path doesn't take a reference on inode, so the > instant ->writepage() releases lock on the page, the inode can be > freed. Ah, then we could just do igrab() if inode_lock is not locked? A bit ugly though. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-07 13:29 ` Wu Fengguang @ 2009-10-07 13:42 ` Wu Fengguang 2009-10-07 14:20 ` Wu Fengguang 0 siblings, 1 reply; 14+ messages in thread From: Wu Fengguang @ 2009-10-07 13:42 UTC (permalink / raw) To: Nikita Danilov Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org On Wed, Oct 07, 2009 at 09:29:24PM +0800, Wu Fengguang wrote: > On Wed, Oct 07, 2009 at 07:37:35PM +0800, Nikita Danilov wrote: > > 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>: > > > On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote: > > >> Hello, > > >> > > > > [...] > > > > > > > > Glad to know about your experiences :) Interestingly I started with > > > ->writepage() and then switch to ->writepages() because filesystems > > > behave better with the latter (i.e. less file fragmentation). > > > > By the way, why is your patch doing > > > > ->writepage(page->index); > > generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1); > > > > instead of > > > > generic_writepages(page->index, LUMPY_PAGEOUT_PAGES); > > > > ? Is this because of the difficulties with passing back page specific > > errors from generic_writepages()? > > Yes. It's possible to tell write_cache_pages() to return > AOP_WRITEPAGE_ACTIVATE. Other ->writepages() don't have to deal with > this because their ->writepage() won't return AOP_WRITEPAGE_ACTIVATE > at all. > > But it is going to be ugly to specialize the first locked page in > every ->writepages() functions.. > > > > > > > I'd like to just ignore the shmem case, by adding a > > > bdi_cap_writeback_dirty() check. Because clustered writing to swap > > > device may be a less gain. > > > > Or you can just call try_to_unmap() from shmem_writepage() when > > wbc->for_reclaim is true. > > Hmm, it's more comfortable to stay away from shmem for the initial version. > But feel free to submit a patch for it in future :) > > > > Page filtering should also be possible in write_cache_pages(). But > > > what do you mean by "hard-to-fix races against inode reclamation"? > > > > vmscan.c pageout path doesn't take a reference on inode, so the > > instant ->writepage() releases lock on the page, the inode can be > > freed. > > Ah, then we could just do igrab() if inode_lock is not locked? > A bit ugly though. Since writepage() can sleep, we don't need to worry about inode_lock. Here is the updated patch. Thanks, Fengguang --- vmscan: lumpy pageout When pageout a dirty page, try to piggy back more consecutive dirty pages (up to 512KB) to improve IO efficiency. Only ext3/reiserfs which don't have its own aops->writepages are supported in this initial version. CC: Dave Chinner <david@fromorbit.com> CC: Nikita Danilov <danilov@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 12 ++++++++++++ mm/vmscan.c | 16 ++++++++++++++++ 2 files changed, 28 insertions(+) --- linux.orig/mm/vmscan.c 2009-10-07 21:39:13.000000000 +0800 +++ linux/mm/vmscan.c 2009-10-07 21:39:57.000000000 +0800 @@ -344,6 +344,8 @@ typedef enum { PAGE_CLEAN, } pageout_t; +#define LUMPY_PAGEOUT_PAGES (512 * 1024 / PAGE_CACHE_SIZE) + /* * pageout is called by shrink_page_list() for each dirty page. * Calls ->writepage(). @@ -399,16 +401,30 @@ static pageout_t pageout(struct page *pa .for_reclaim = 1, }; + igrab(mapping->host); SetPageReclaim(page); res = mapping->a_ops->writepage(page, &wbc); if (res < 0) handle_write_error(mapping, page, res); if (res == AOP_WRITEPAGE_ACTIVATE) { ClearPageReclaim(page); + iput(mapping->host); return PAGE_ACTIVATE; } /* + * only write_cache_pages() supports for_reclaim for now + * ignore shmem for now, thanks to Nikita. + */ + if (bdi_cap_writeback_dirty(mapping->backing_dev_info) && + !mapping->a_ops->writepages) { + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; + generic_writepages(mapping, &wbc); + iput(mapping->host); + } + + /* * Wait on writeback if requested to. This happens when * direct reclaiming a large contiguous area and the * first attempt to free a range of pages fails. --- linux.orig/mm/page-writeback.c 2009-10-07 21:39:13.000000000 +0800 +++ linux/mm/page-writeback.c 2009-10-07 21:39:14.000000000 +0800 @@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa break; } + if (wbc->for_reclaim && done_index != page->index) { + done = 1; + break; + } + if (nr_to_write != wbc->nr_to_write && done_index + WB_SEGMENT_DIST < page->index && --wbc->nr_segments <= 0) { @@ -846,6 +851,13 @@ continue_unlock: if (!clear_page_dirty_for_io(page)) goto continue_unlock; + /* + * active and unevictable pages will be checked at + * rotate time + */ + if (wbc->for_reclaim) + SetPageReclaim(page); + ret = (*writepage)(page, wbc, data); if (unlikely(ret)) { if (ret == AOP_WRITEPAGE_ACTIVATE) { -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-07 13:42 ` Wu Fengguang @ 2009-10-07 14:20 ` Wu Fengguang 2009-10-07 14:50 ` Nikita Danilov 0 siblings, 1 reply; 14+ messages in thread From: Wu Fengguang @ 2009-10-07 14:20 UTC (permalink / raw) To: Nikita Danilov Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org On Wed, Oct 07, 2009 at 09:42:54PM +0800, Wu Fengguang wrote: > On Wed, Oct 07, 2009 at 09:29:24PM +0800, Wu Fengguang wrote: > > On Wed, Oct 07, 2009 at 07:37:35PM +0800, Nikita Danilov wrote: > > > 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>: > > > > On Wed, Oct 07, 2009 at 06:38:37PM +0800, Nikita Danilov wrote: > > > >> Hello, > > > >> > > > > > > [...] > > > > > > > > > > > Glad to know about your experiences :) Interestingly I started with > > > > ->writepage() and then switch to ->writepages() because filesystems > > > > behave better with the latter (i.e. less file fragmentation). > > > > > > By the way, why is your patch doing > > > > > > ->writepage(page->index); > > > generic_writepages(page->index + 1, LUMPY_PAGEOUT_PAGES - 1); > > > > > > instead of > > > > > > generic_writepages(page->index, LUMPY_PAGEOUT_PAGES); > > > > > > ? Is this because of the difficulties with passing back page specific > > > errors from generic_writepages()? > > > > Yes. It's possible to tell write_cache_pages() to return > > AOP_WRITEPAGE_ACTIVATE. Other ->writepages() don't have to deal with > > this because their ->writepage() won't return AOP_WRITEPAGE_ACTIVATE > > at all. > > > > But it is going to be ugly to specialize the first locked page in > > every ->writepages() functions.. > > > > > > > > > > I'd like to just ignore the shmem case, by adding a > > > > bdi_cap_writeback_dirty() check. Because clustered writing to swap > > > > device may be a less gain. > > > > > > Or you can just call try_to_unmap() from shmem_writepage() when > > > wbc->for_reclaim is true. > > > > Hmm, it's more comfortable to stay away from shmem for the initial version. > > But feel free to submit a patch for it in future :) > > > > > > Page filtering should also be possible in write_cache_pages(). But > > > > what do you mean by "hard-to-fix races against inode reclamation"? > > > > > > vmscan.c pageout path doesn't take a reference on inode, so the > > > instant ->writepage() releases lock on the page, the inode can be > > > freed. > > > > Ah, then we could just do igrab() if inode_lock is not locked? > > A bit ugly though. > > Since writepage() can sleep, we don't need to worry about inode_lock. > > Here is the updated patch. Sorry forget to check return value of igrab().. --- vmscan: lumpy pageout When pageout a dirty page, try to piggy back more consecutive dirty pages (up to 512KB) to improve IO efficiency. Only ext3/reiserfs which don't have its own aops->writepages are supported in this initial version. CC: Dave Chinner <david@fromorbit.com> CC: Nikita Danilov <danilov@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 12 ++++++++++++ mm/vmscan.c | 16 ++++++++++++++++ 2 files changed, 28 insertions(+) --- linux.orig/mm/vmscan.c 2009-10-07 21:39:13.000000000 +0800 +++ linux/mm/vmscan.c 2009-10-07 22:18:55.000000000 +0800 @@ -344,6 +344,8 @@ typedef enum { PAGE_CLEAN, } pageout_t; +#define LUMPY_PAGEOUT_PAGES (512 * 1024 / PAGE_CACHE_SIZE) + /* * pageout is called by shrink_page_list() for each dirty page. * Calls ->writepage(). @@ -398,6 +400,7 @@ static pageout_t pageout(struct page *pa .nonblocking = 1, .for_reclaim = 1, }; + struct inode *inode = igrab(mapping->host); SetPageReclaim(page); res = mapping->a_ops->writepage(page, &wbc); @@ -405,10 +408,23 @@ static pageout_t pageout(struct page *pa handle_write_error(mapping, page, res); if (res == AOP_WRITEPAGE_ACTIVATE) { ClearPageReclaim(page); + iput(inode); return PAGE_ACTIVATE; } /* + * only write_cache_pages() supports for_reclaim for now + * ignore shmem for now, thanks to Nikita. + */ + if (bdi_cap_writeback_dirty(mapping->backing_dev_info) && + !mapping->a_ops->writepages) { + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; + generic_writepages(mapping, &wbc); + iput(inode); + } + + /* * Wait on writeback if requested to. This happens when * direct reclaiming a large contiguous area and the * first attempt to free a range of pages fails. --- linux.orig/mm/page-writeback.c 2009-10-07 21:39:13.000000000 +0800 +++ linux/mm/page-writeback.c 2009-10-07 21:39:14.000000000 +0800 @@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa break; } + if (wbc->for_reclaim && done_index != page->index) { + done = 1; + break; + } + if (nr_to_write != wbc->nr_to_write && done_index + WB_SEGMENT_DIST < page->index && --wbc->nr_segments <= 0) { @@ -846,6 +851,13 @@ continue_unlock: if (!clear_page_dirty_for_io(page)) goto continue_unlock; + /* + * active and unevictable pages will be checked at + * rotate time + */ + if (wbc->for_reclaim) + SetPageReclaim(page); + ret = (*writepage)(page, wbc, data); if (unlikely(ret)) { if (ret == AOP_WRITEPAGE_ACTIVATE) { -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-07 14:20 ` Wu Fengguang @ 2009-10-07 14:50 ` Nikita Danilov 2009-10-07 15:00 ` Wu Fengguang 0 siblings, 1 reply; 14+ messages in thread From: Nikita Danilov @ 2009-10-07 14:50 UTC (permalink / raw) To: Wu Fengguang Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>: [...] > + if (bdi_cap_writeback_dirty(mapping->backing_dev_info) && > + !mapping->a_ops->writepages) { > + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; > + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; > + generic_writepages(mapping, &wbc); > + iput(inode); I am afraid calling iput() from within pageout code is not generally safe: it can trigger a lot of file system activity (think open-unlinked file) that is not supposed to happen in the direct-reclaim context. Limiting pageout clustering to kswapd might be a better idea---this would improve direct-reclaim latency too, but still, file systems are not designed for re-entrant calls to iput(). Thank you, Nikita. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-07 14:50 ` Nikita Danilov @ 2009-10-07 15:00 ` Wu Fengguang 2009-10-07 15:50 ` Nikita Danilov 0 siblings, 1 reply; 14+ messages in thread From: Wu Fengguang @ 2009-10-07 15:00 UTC (permalink / raw) To: Nikita Danilov Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org On Wed, Oct 07, 2009 at 10:50:17PM +0800, Nikita Danilov wrote: > 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>: > > [...] > > > + if (bdi_cap_writeback_dirty(mapping->backing_dev_info) && > > + !mapping->a_ops->writepages) { > > + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; > > + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; > > + generic_writepages(mapping, &wbc); > > + iput(inode); > > I am afraid calling iput() from within pageout code is not generally > safe: it can trigger a lot of file system activity (think > open-unlinked file) that is not supposed to happen in the > direct-reclaim context. Limiting pageout clustering to kswapd might be > a better idea---this would improve direct-reclaim latency too, but > still, file systems are not designed for re-entrant calls to iput(). Good point, thanks! --- vmscan: lumpy pageout When pageout a dirty page, try to piggy back more consecutive dirty pages (up to 512KB) to improve IO efficiency. Only ext3/reiserfs which don't have its own aops->writepages are supported in this initial version. CC: Dave Chinner <david@fromorbit.com> CC: Nikita Danilov <danilov@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 12 ++++++++++++ mm/vmscan.c | 20 ++++++++++++++++++++ 2 files changed, 32 insertions(+) --- linux.orig/mm/vmscan.c 2009-10-07 21:39:13.000000000 +0800 +++ linux/mm/vmscan.c 2009-10-07 22:59:57.000000000 +0800 @@ -344,6 +344,8 @@ typedef enum { PAGE_CLEAN, } pageout_t; +#define LUMPY_PAGEOUT_PAGES (512 * 1024 / PAGE_CACHE_SIZE) + /* * pageout is called by shrink_page_list() for each dirty page. * Calls ->writepage(). @@ -398,6 +400,10 @@ static pageout_t pageout(struct page *pa .nonblocking = 1, .for_reclaim = 1, }; + struct inode *inode = NULL; + + if (current_is_kswapd()) + inode = igrab(mapping->host); SetPageReclaim(page); res = mapping->a_ops->writepage(page, &wbc); @@ -405,10 +411,24 @@ static pageout_t pageout(struct page *pa handle_write_error(mapping, page, res); if (res == AOP_WRITEPAGE_ACTIVATE) { ClearPageReclaim(page); + iput(inode); return PAGE_ACTIVATE; } /* + * only write_cache_pages() supports for_reclaim for now + * ignore shmem for now, thanks to Nikita. + */ + if (current_is_kswapd() && + bdi_cap_writeback_dirty(mapping->backing_dev_info) && + !mapping->a_ops->writepages) { + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; + generic_writepages(mapping, &wbc); + iput(inode); + } + + /* * Wait on writeback if requested to. This happens when * direct reclaiming a large contiguous area and the * first attempt to free a range of pages fails. --- linux.orig/mm/page-writeback.c 2009-10-07 21:39:13.000000000 +0800 +++ linux/mm/page-writeback.c 2009-10-07 22:57:55.000000000 +0800 @@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa break; } + if (wbc->for_reclaim && done_index != page->index) { + done = 1; + break; + } + if (nr_to_write != wbc->nr_to_write && done_index + WB_SEGMENT_DIST < page->index && --wbc->nr_segments <= 0) { @@ -846,6 +851,13 @@ continue_unlock: if (!clear_page_dirty_for_io(page)) goto continue_unlock; + /* + * active and unevictable pages will be checked at + * rotate time + */ + if (wbc->for_reclaim) + SetPageReclaim(page); + ret = (*writepage)(page, wbc, data); if (unlikely(ret)) { if (ret == AOP_WRITEPAGE_ACTIVATE) { -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-07 15:00 ` Wu Fengguang @ 2009-10-07 15:50 ` Nikita Danilov 2009-10-08 2:37 ` Wu Fengguang 0 siblings, 1 reply; 14+ messages in thread From: Nikita Danilov @ 2009-10-07 15:50 UTC (permalink / raw) To: Wu Fengguang Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>: [...] > + */ > + if (current_is_kswapd() && > + bdi_cap_writeback_dirty(mapping->backing_dev_info) && > + !mapping->a_ops->writepages) { > + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; > + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; > + generic_writepages(mapping, &wbc); > + iput(inode); > + } > + > + /* One potential problem with this is that generic_writepages() waits on page locks and this can stall kswapd (always bad). This can be worked around by replacing lock_page() with trylock_page() conditionally on wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look like a separate function would be better. On a good side, it seems I was wrong and pageout calls iput() already: shrink_slab()->prune_icache()->iput(). Thank you, Nikita. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-07 15:50 ` Nikita Danilov @ 2009-10-08 2:37 ` Wu Fengguang 2009-10-08 8:20 ` Hugh Dickins 0 siblings, 1 reply; 14+ messages in thread From: Wu Fengguang @ 2009-10-08 2:37 UTC (permalink / raw) To: Nikita Danilov Cc: Hugh Dickins, Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org On Wed, Oct 07, 2009 at 11:50:58PM +0800, Nikita Danilov wrote: > 2009/10/7 Wu Fengguang <fengguang.wu@intel.com>: > > [...] > > > + */ > > + if (current_is_kswapd() && > > + bdi_cap_writeback_dirty(mapping->backing_dev_info) && > > + !mapping->a_ops->writepages) { > > + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; > > + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; > > + generic_writepages(mapping, &wbc); > > + iput(inode); > > + } > > + > > + /* > > One potential problem with this is that generic_writepages() waits on > page locks and this can stall kswapd (always bad). This can be worked > around by replacing lock_page() with trylock_page() conditionally on > wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look > like a separate function would be better. IMHO trylock_page() is not necessary. Locked pages are rare in normal states. kswapd already do lock_page() for all pages it try to examine state for reclaim. So it makes sense for lumpy pageout to follow the (simple) convention. > On a good side, it seems I was wrong and pageout calls iput() already: > shrink_slab()->prune_icache()->iput(). Not totally wrong ;) iput() will be called if __GFP_FS is on. However pageout may be called on either __GFP_FS or (__GFP_IO && PageSwapCache). So I updated the patch to do lumpy pageout for __GFP_FS. In long term, it would be good to remove AOP_WRITEPAGE_ACTIVATE and ->writepage() totally, and to support shmem as well :) Thanks, Fengguang --- vmscan: lumpy pageout When pageout a dirty page, try to piggy back more consecutive dirty pages (up to 512KB) to improve IO efficiency. Only ext3/reiserfs which don't have its own aops->writepages are supported in this initial version. CC: Hugh Dickins <hugh.dickins@tiscali.co.uk> CC: Dave Chinner <david@fromorbit.com> CC: Nikita Danilov <danilov@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 12 ++++++++++++ mm/vmscan.c | 23 ++++++++++++++++++++++- 2 files changed, 34 insertions(+), 1 deletion(-) --- linux.orig/mm/vmscan.c 2009-10-08 07:35:06.000000000 +0800 +++ linux/mm/vmscan.c 2009-10-08 10:20:51.000000000 +0800 @@ -344,11 +344,14 @@ typedef enum { PAGE_CLEAN, } pageout_t; +#define LUMPY_PAGEOUT_PAGES (512 * 1024 / PAGE_CACHE_SIZE) + /* * pageout is called by shrink_page_list() for each dirty page. * Calls ->writepage(). */ static pageout_t pageout(struct page *page, struct address_space *mapping, + struct scan_control *sc, enum pageout_io sync_writeback) { /* @@ -398,6 +401,10 @@ static pageout_t pageout(struct page *pa .nonblocking = 1, .for_reclaim = 1, }; + struct inode *inode = NULL; + + if (sc->gfp_mask & __GFP_FS) + inode = igrab(mapping->host); SetPageReclaim(page); res = mapping->a_ops->writepage(page, &wbc); @@ -405,10 +412,24 @@ static pageout_t pageout(struct page *pa handle_write_error(mapping, page, res); if (res == AOP_WRITEPAGE_ACTIVATE) { ClearPageReclaim(page); + iput(inode); return PAGE_ACTIVATE; } /* + * only write_cache_pages() supports for_reclaim for now + * ignore shmem for now, thanks to Nikita. + */ + if (current_is_kswapd() && + bdi_cap_writeback_dirty(mapping->backing_dev_info) && + !mapping->a_ops->writepages) { + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; + generic_writepages(mapping, &wbc); + iput(inode); + } + + /* * Wait on writeback if requested to. This happens when * direct reclaiming a large contiguous area and the * first attempt to free a range of pages fails. @@ -684,7 +705,7 @@ static unsigned long shrink_page_list(st goto keep_locked; /* Page is dirty, try to write it out here */ - switch (pageout(page, mapping, sync_writeback)) { + switch (pageout(page, mapping, sc, sync_writeback)) { case PAGE_KEEP: goto keep_locked; case PAGE_ACTIVATE: --- linux.orig/mm/page-writeback.c 2009-10-08 07:38:57.000000000 +0800 +++ linux/mm/page-writeback.c 2009-10-08 10:06:33.000000000 +0800 @@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa break; } + if (wbc->for_reclaim && done_index != page->index) { + done = 1; + break; + } + if (nr_to_write != wbc->nr_to_write && done_index + WB_SEGMENT_DIST < page->index && --wbc->nr_segments <= 0) { @@ -846,6 +851,13 @@ continue_unlock: if (!clear_page_dirty_for_io(page)) goto continue_unlock; + /* + * active and unevictable pages will be checked at + * rotate time + */ + if (wbc->for_reclaim) + SetPageReclaim(page); + ret = (*writepage)(page, wbc, data); if (unlikely(ret)) { if (ret == AOP_WRITEPAGE_ACTIVATE) { -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-08 2:37 ` Wu Fengguang @ 2009-10-08 8:20 ` Hugh Dickins 2009-10-08 10:12 ` Wu Fengguang 0 siblings, 1 reply; 14+ messages in thread From: Hugh Dickins @ 2009-10-08 8:20 UTC (permalink / raw) To: Wu Fengguang Cc: Nikita Danilov, Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org On Thu, 8 Oct 2009, Wu Fengguang wrote: > On Wed, Oct 07, 2009 at 11:50:58PM +0800, Nikita Danilov wrote: > > > > One potential problem with this is that generic_writepages() waits on > > page locks and this can stall kswapd (always bad). This can be worked > > around by replacing lock_page() with trylock_page() conditionally on > > wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look > > like a separate function would be better. > > IMHO trylock_page() is not necessary. Locked pages are rare in normal > states. kswapd already do lock_page() for all pages it try to examine > state for reclaim. So it makes sense for lumpy pageout to follow the > (simple) convention. You're mistaken? The only lock_page() I see in vmscan.c is in handle_write_error(), the important ones are all trylock_page(). And I agree with Nikita that they need to be trylock_page(). Hugh ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 30/45] vmscan: lumpy pageout 2009-10-08 8:20 ` Hugh Dickins @ 2009-10-08 10:12 ` Wu Fengguang 0 siblings, 0 replies; 14+ messages in thread From: Wu Fengguang @ 2009-10-08 10:12 UTC (permalink / raw) To: Hugh Dickins Cc: Nikita Danilov, Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org On Thu, Oct 08, 2009 at 04:20:03PM +0800, Hugh Dickins wrote: > On Thu, 8 Oct 2009, Wu Fengguang wrote: > > On Wed, Oct 07, 2009 at 11:50:58PM +0800, Nikita Danilov wrote: > > > > > > One potential problem with this is that generic_writepages() waits on > > > page locks and this can stall kswapd (always bad). This can be worked > > > around by replacing lock_page() with trylock_page() conditionally on > > > wbc->for_reclaim (or wbc->nonblocking?), but then, this almost look > > > like a separate function would be better. > > > > IMHO trylock_page() is not necessary. Locked pages are rare in normal > > states. kswapd already do lock_page() for all pages it try to examine > > state for reclaim. So it makes sense for lumpy pageout to follow the > > (simple) convention. > > You're mistaken? The only lock_page() I see in vmscan.c is in > handle_write_error(), the important ones are all trylock_page(). > And I agree with Nikita that they need to be trylock_page(). Ah big sorry! I should really double check the code.. OK I'll do trylock_page(). Thanks, Fengguang ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 00/45] some writeback experiments @ 2009-10-07 7:38 Wu Fengguang 2009-10-07 7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang 0 siblings, 1 reply; 14+ messages in thread From: Wu Fengguang @ 2009-10-07 7:38 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, Wu Fengguang, LKML Hi all, Here is a collection of writeback patches on - larger writeback chunk sizes - single per-bdi flush thread (killing the foreground throttling writeouts) - lumpy pageout - sync livelock prevention - writeback scheduling - random fixes Sorry for posting a too big series - there are many direct or implicit dependencies, and one patch lead to another before I can stop.. The lumpy pageout and nr_segments support is not complete and do not cover all filesystems for now. It may be better to first convert some of the ->writepages to the generic routines to avoid duplicate work. I managed to address many issues in past week, however there are still known problems. Hints from filesystem developers are highly appreciated. Thanks! The estimated writeback bandwidth is about 1/2 the real throughput for ext2/3/4 and btrfs; noticeable bigger than real throughput for NFS; and cannot be estimated at all for XFS. Very interesting.. NFS writeback is very bumpy. The page numbers and network throughput "freeze" together from time to time: # vmmon -d 1 nr_writeback nr_dirty nr_unstable # (per 1-second samples) nr_writeback nr_dirty nr_unstable 11227 41463 38044 11227 41463 38044 11227 41463 38044 11227 41463 38044 11045 53987 6490 11033 53120 8145 11195 52143 10886 11211 52144 10913 11211 52144 10913 11211 52144 10913 btrfs seems to maintain a private pool of writeback pages, which can go out of control: nr_writeback nr_dirty 261075 132 252891 195 244795 187 236851 187 228830 187 221040 218 212674 237 204981 237 XFS has very interesting "bumpy writeback" behavior: it tends to wait collect enough pages and then write the whole world. nr_writeback nr_dirty 80781 0 37117 37703 37117 43933 81044 6 81050 0 43943 10199 43930 36355 43930 36355 80293 0 80285 0 80285 0 Thanks, Fengguang ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 30/45] vmscan: lumpy pageout 2009-10-07 7:38 [PATCH 00/45] some writeback experiments Wu Fengguang @ 2009-10-07 7:38 ` Wu Fengguang 0 siblings, 0 replies; 14+ messages in thread From: Wu Fengguang @ 2009-10-07 7:38 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, Wu Fengguang, LKML [-- Attachment #1: writeback-lumpy-pageout.patch --] [-- Type: text/plain, Size: 2082 bytes --] When pageout a dirty page, try to piggy back more consecutive dirty pages (up to 512KB) to improve IO efficiency. Only ext3/reiserfs which don't have its own aops->writepages are supported in this initial version. CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 12 ++++++++++++ mm/vmscan.c | 11 +++++++++++ 2 files changed, 23 insertions(+) --- linux.orig/mm/vmscan.c 2009-10-06 23:37:39.000000000 +0800 +++ linux/mm/vmscan.c 2009-10-06 23:39:30.000000000 +0800 @@ -344,6 +344,8 @@ typedef enum { PAGE_CLEAN, } pageout_t; +#define LUMPY_PAGEOUT_PAGES (512 * 1024 / PAGE_CACHE_SIZE) + /* * pageout is called by shrink_page_list() for each dirty page. * Calls ->writepage(). @@ -409,6 +411,15 @@ static pageout_t pageout(struct page *pa } /* + * only write_cache_pages() supports for_reclaim for now + */ + if (!mapping->a_ops->writepages) { + wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT; + wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1; + generic_writepages(mapping, &wbc); + } + + /* * Wait on writeback if requested to. This happens when * direct reclaiming a large contiguous area and the * first attempt to free a range of pages fails. --- linux.orig/mm/page-writeback.c 2009-10-06 23:39:29.000000000 +0800 +++ linux/mm/page-writeback.c 2009-10-06 23:39:30.000000000 +0800 @@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa break; } + if (wbc->for_reclaim && done_index != page->index) { + done = 1; + break; + } + if (nr_to_write != wbc->nr_to_write && done_index + WB_SEGMENT_DIST < page->index && --wbc->nr_segments <= 0) { @@ -846,6 +851,13 @@ continue_unlock: if (!clear_page_dirty_for_io(page)) goto continue_unlock; + /* + * active and unevictable pages will be checked at + * rotate time + */ + if (wbc->for_reclaim) + SetPageReclaim(page); + ret = (*writepage)(page, wbc, data); if (unlikely(ret)) { if (ret == AOP_WRITEPAGE_ACTIVATE) { ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-10-08 10:13 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-10-07 10:38 [PATCH 30/45] vmscan: lumpy pageout Nikita Danilov 2009-10-07 11:14 ` Wu Fengguang 2009-10-07 11:32 ` Nick Piggin 2009-10-07 11:37 ` Nikita Danilov 2009-10-07 13:29 ` Wu Fengguang 2009-10-07 13:42 ` Wu Fengguang 2009-10-07 14:20 ` Wu Fengguang 2009-10-07 14:50 ` Nikita Danilov 2009-10-07 15:00 ` Wu Fengguang 2009-10-07 15:50 ` Nikita Danilov 2009-10-08 2:37 ` Wu Fengguang 2009-10-08 8:20 ` Hugh Dickins 2009-10-08 10:12 ` Wu Fengguang -- strict thread matches above, loose matches on Subject: below -- 2009-10-07 7:38 [PATCH 00/45] some writeback experiments Wu Fengguang 2009-10-07 7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).