From: Mel Gorman <mel@csn.ul.ie>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
stable@kernel.org, Rik van Riel <riel@redhat.com>,
Christoph Hellwig <hch@infradead.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Dave Chinner <david@fromorbit.com>,
Chris Mason <chris.mason@oracle.com>,
Nick Piggin <npiggin@suse.de>,
Johannes Weiner <hannes@cmpxchg.org>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Minchan Kim <minchan.kim@gmail.com>, Andreas Mohr <andi@lisas.de>,
Bill Davidsen <davidsen@tmr.com>,
Ben Gamari <bgamari.foss@gmail.com>
Subject: Re: Why PAGEOUT_IO_SYNC stalls for a long time
Date: Wed, 28 Jul 2010 14:10:17 +0100 [thread overview]
Message-ID: <20100728131017.GI5300@csn.ul.ie> (raw)
In-Reply-To: <20100728191322.4A85.A69D9226@jp.fujitsu.com>
On Wed, Jul 28, 2010 at 08:40:21PM +0900, KOSAKI Motohiro wrote:
> In this week, I've tested some IO congested workload for a while. and probably
> I did reproduced Andreas's issue.
>
> So, I would like to explain current lumpy reclaim how works and why so much sucks.
>
>
> 1. Now isolate_lru_pages() have following pfn neighber grabbing logic.
>
> for (; pfn < end_pfn; pfn++) {
> (snip)
> if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> list_move(&cursor_page->lru, dst);
> mem_cgroup_del_lru(cursor_page);
> nr_taken++;
> nr_lumpy_taken++;
> if (PageDirty(cursor_page))
> nr_lumpy_dirty++;
> scan++;
> } else {
> if (mode == ISOLATE_BOTH &&
> page_count(cursor_page))
> nr_lumpy_failed++;
> }
> }
>
> Mainly, __isolate_lru_page() failure can be caused following reasons.
> (1) the page have already been freed and is in buddy.
> (2) the page is used for non user process purpose
> (3) the page is unevictable (e.g. mlocked)
>
> (2), (3) have very different characteristic from (1). the lumpy reclaim
> mean 'contenious physical memory reclaiming'. that said, if we are trying
> order 9 reclaim, 512 pages reclaim success and 511 pages reclaim success
> are completely differennt.
Yep, and this can occur quite regularly. Judging from the ftrace
results, contig_failed is frequently positive although whether this is
due to the page being about to be freed or because it's due (2), I don't
know.
> former mean lumpy reclaim successfull, latter mean
> failure. So, if (2) or (3) occur, that pfn have lost a possibility of lumpy
> reclaim successfull. then, we should stop pfn neighbor search immediately and
> try to get lru next page. (i.e. we should use 'break' statement instead 'continue')
>
Easy enough to do.
> 2. synchronous lumpy reclaim condition is insane.
>
> currently, synchrounous lumpy reclaim will be invoked when following
> condition.
>
> if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
> sc->lumpy_reclaim_mode) {
>
> but "nr_reclaimed < nr_taken" is pretty stupid. if isolated pages have
> much dirty pages, pageout() only issue first 113 IOs.
> (if io queue have >113 requests, bdi_write_congested() return true and
> may_write_to_queue() return false)
>
> So, we haven't call ->writepage(), congestion_wait() and wait_on_page_writeback()
> are surely stupid.
>
This is somewhat intentional though. See the comment
/*
* Synchronous reclaim is performed in two passes,
* first an asynchronous pass over the list to
* start parallel writeback, and a second synchronous
* pass to wait for the IO to complete......
If all pages on the list were not taken, it means that some of the them
were dirty but most should now be queued for writeback (possibly not all if
congested). The intention is to loop a second time waiting for that writeback
to complete before continueing on.
> 3. pageout() is intended anynchronous api. but doesn't works so.
>
> pageout() call ->writepage with wbc->nonblocking=1. because if the system have
> default vm.dirty_ratio (i.e. 20), we have 80% clean memory. so, getting stuck
> on one page is stupid, we should scan much pages as soon as possible.
>
> HOWEVER, block layer ignore this argument. if slow usb memory device connect
> to the system, ->writepage() will sleep long time. because submit_bio() call
> get_request_wait() unconditionally and it doesn't have any PF_MEMALLOC task
> bonus.
>
Is this not a problem in the writeback layer rather than pageout()
specifically?
>
> 4. synchronous lumpy reclaim call clear_active_flags(). but it is also silly.
>
> Now, page_check_references() ignore pte young bit when we are processing lumpy reclaim.
> Then, In almostly case, PageActive() mean "swap device is full". Therefore,
> waiting IO and retry pageout() are just silly.
>
try_to_unmap also obey reference bits. If you remove the call to
clear_active_flags, then pageout should pass TTY_IGNORE_ACCESS to
try_to_unmap(). I had a patch to do this but it didn't improve
high-order allocation success rates any so I dropped it.
> In andres's case, congestion_wait() and get_request_wait() are root cause.
> Other issue is problematic when more higher order lumpy reclaim.
>
> Now, I'm preparing some patches and probably I can send them tommorow.
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
next prev parent reply other threads:[~2010-07-28 13:10 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-07-28 7:17 [PATCH] vmscan: raise the bar to PAGEOUT_IO_SYNC stalls Wu Fengguang
2010-07-28 7:49 ` Minchan Kim
2010-07-28 8:46 ` [PATCH] vmscan: remove wait_on_page_writeback() from pageout() Wu Fengguang
2010-07-28 9:10 ` Mel Gorman
2010-07-28 9:30 ` Wu Fengguang
2010-07-28 9:45 ` Mel Gorman
2010-07-28 9:43 ` KOSAKI Motohiro
2010-07-28 9:50 ` Mel Gorman
2010-07-28 9:59 ` KOSAKI Motohiro
2010-08-01 5:27 ` Wu Fengguang
2010-08-01 5:49 ` Wu Fengguang
2010-08-01 8:32 ` KOSAKI Motohiro
2010-08-01 8:35 ` Wu Fengguang
2010-08-01 8:40 ` KOSAKI Motohiro
2010-08-01 5:17 ` Wu Fengguang
2010-07-28 16:29 ` Minchan Kim
2010-07-28 11:40 ` Why PAGEOUT_IO_SYNC stalls for a long time KOSAKI Motohiro
2010-07-28 13:10 ` Mel Gorman [this message]
2010-07-29 10:34 ` KOSAKI Motohiro
2010-07-29 14:24 ` Mel Gorman
2010-07-30 4:54 ` KOSAKI Motohiro
2010-07-30 10:30 ` Mel Gorman
2010-08-01 8:47 ` KOSAKI Motohiro
2010-08-04 11:10 ` Mel Gorman
2010-08-05 6:20 ` KOSAKI Motohiro
2010-08-05 8:09 ` Andreas Mohr
2010-07-28 17:30 ` Andrew Morton
2010-07-29 1:01 ` KOSAKI Motohiro
2010-07-30 13:17 ` [PATCH] vmscan: raise the bar to PAGEOUT_IO_SYNC stalls Andrea Arcangeli
2010-07-30 13:31 ` Mel Gorman
2010-07-31 16:13 ` Wu Fengguang
2010-07-31 17:33 ` Christoph Hellwig
2010-07-31 17:55 ` Pekka Enberg
2010-07-31 17:59 ` Christoph Hellwig
2010-07-31 18:09 ` Pekka Enberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100728131017.GI5300@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andi@lisas.de \
--cc=bgamari.foss@gmail.com \
--cc=chris.mason@oracle.com \
--cc=david@fromorbit.com \
--cc=davidsen@tmr.com \
--cc=fengguang.wu@intel.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan.kim@gmail.com \
--cc=npiggin@suse.de \
--cc=riel@redhat.com \
--cc=stable@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).