All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH -mm] make swapin readahead skip over holes
Date: Thu, 12 Jan 2012 14:33:01 +0000	[thread overview]
Message-ID: <20120112143301.GG3910@csn.ul.ie> (raw)
In-Reply-To: <4F0DE1CE.5000006@redhat.com>

On Wed, Jan 11, 2012 at 02:23:58PM -0500, Rik van Riel wrote:
> On 01/11/2012 11:51 AM, Mel Gorman wrote:
> >On Mon, Jan 09, 2012 at 06:10:23PM -0500, Rik van Riel wrote:
> 
> >>+	get_swap_cluster(entry,&offset,&end_offset);
> >>+
> >>+	for (; offset<= end_offset ; offset++) {
> >>  		/* Ok, do the async read-ahead now */
> >>  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> >>  						gfp_mask, vma, addr);
> >>  		if (!page)
> >>-			break;
> >>+			continue;
> >>  		page_cache_release(page);
> >>  	}
> >
> >For a heavily fragmented swap file, this will result in more IO and
> >the gamble is that pages nearby are needed soon. You say your virtual
> >machines swapin faster and that does not surprise me. I also expect
> >they need the data so it's a net win.
> 
> More IO operations, yes.  However, IO operations from nearby
> blocks can often be done without incurring extra disk seeks,
> because the disk head is already in the right place.
> 

I understand that, the result still has to go somewhere in memory
though. The cost in terms of IO speed might be effectively 0 but it
still consumes memory resources.

> This seems to be born out by the fact that I saw swapin
> rates increase from maybe 200-300kB/s to 5-15MB/s...
> 
> Even on some SSDs it could avoid some bank switches, though
> of course there I would expect the effect to be much less
> pronounced.
> 
> >There is an possibility that under memory pressure that swapping in
> >more pages will cause more memory pressure (increased swapin causing
> >clean page cache discards and pageout) and be an overall loss. This may
> >be a net loss in some cases such as where the working set size is just
> >over physical memory and the increased swapin causes a problem. I doubt
> >this case is common but it is worth bearing in mind if future bug
> >reports complain about increased swap activity.
> 
> True, there may be workloads that benefit from a smaller
> page-cluster. The fact that the recently swapped in pages
> are all put on the inactive anon list should help protect
> the working set, too.
> 

True to some extent.

> Another alternative may be a time based decision. If we
> have swapped something out recently, go with a less
> aggressive swapin readahead.
> 

Maybe, but while I know I brought up the problem of swapin might
increase swapout, it would be best to have an example of how that can
happen before looking at heuristics.

> That would automatically give us fast swapin readahead
> when in "memory hog just exited, let the system recover"
> mode, and conservative swapin readahead in your situation.
> 
> However, that could still hurt badly if the system is just
> moving the working set from one part of a program to another.
> 
> I suspect we will be faster off by having faster swap IO,
> which this patch seems to provide.
> 

More than likely.

> >>-	si = swap_info[swp_type(entry)];
> >>-	target = swp_offset(entry);
> >>-	base = (target>>  our_page_cluster)<<  our_page_cluster;
> >>-	end = base + (1<<  our_page_cluster);
> >>-	if (!base)		/* first page is swap header */
> >>-		base++;
> 
> >>+	si = swap_info[swp_type(entry)];
> >>+	/* Round the begin down to a page_cluster boundary. */
> >>+	offset = (offset>>  page_cluster)<<  page_cluster;
> >
> >Minor nit but it would feel more natural to me to see
> >
> >offset&  ~((1<<  page_cluster) - 1)
> >
> >but I understand that you are reusing the existing code.
> 
> Sure, I can do that.
> 
> While I'm there, I can also add that if (!base) base++
> thing back in :)
> 

Whoops, I missed that!

> >>+	*begin = offset;
> >>+	/* Round the end up, but not beyond the end of the swap device. */
> >>+	offset = offset + (1<<  page_cluster);
> >>+	if (offset>  si->max)
> >>+		offset = si->max;
> >>+	*end = offset;
> >>  	spin_unlock(&swap_lock);
> >>-
> >>-	/*
> >>-	 * Indicate starting offset, and return number of pages to get:
> >>-	 * if only 1, say 0, since there's then no readahead to be done.
> >>-	 */
> >>-	*offset = ++toff;
> >>-	return nr_pages? ++nr_pages: 0;
> >>  }
> >
> >This section deletes code which is nice but there is a
> >problem. Your changelog says that this is duplicating the effort of
> >read_swap_cache_async() which is true but what it does is
> >
> >1. a swap cache lookup which will probably fail
> >2. alloc_page_vma()
> >3. radix_tree_preload()
> >4. swapcache_prepare
> >    - calls __swap_duplicate()
> >    - finds the hole, bails out
> >
> >That's a lot of work before the hole is found. Would it be worth
> >doing a racy check in swapin_readahead without swap lock held before
> >calling read_swap_cache_async()?
> 
> The problem is that without the swap_lock held, the swap_info
> struct may disappear completely because of the swapin_readahead
> happening concurrently with a swapoff.
> 

hmm, I considered that when I wrote the suggestion and was making
an assumption that the presense of swap pages on the area would
effectively pin the swap_info struct.

I did think of the race that swapoff is operating on the very last page
that swapin is reading and removes the map at the same time. We make a
racy check against effectively random memory and one of three results
might happen

1. DEBUG_PAGE_ALLOC was enabled and we blow up, ok, this one is bad if
   somewhat corner case

2. False positive count - we fall through and try to read the page,
   find the swap page no longer exists and bail gracefully

3. False zero count - we do not readin the swap page. It's not a big
   deal if swapin_readahead fails to read in a page it should have
   (swapoff has brought it in anyway).

Basically, I did not see any major downside with doing a racy check. If
1 above is a serious problem, then yes, we have to take the swap_lock
but that's still cheaper than allocating needlessly.

> I suspect that the CPU time spent doing 1-4 above will be
> negligible compared to the amount of time spent doing disk IO,

Negligible in comparison to the IO sure, but it still seems like we
could be doing a lot of allocation work (alloc_page_vma and
radix_tree_preload) for no good reason.

> but if there turns out to be a problem it should be possible
> to move the swap hole identification closer to the top of
> swap_cache_read_async().
> 

There is that, it'll be tricky to spot that we're needless allocating
pages though. Not the end of the world, what you have still improves
things.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Mel Gorman <mel@csn.ul.ie>
To: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH -mm] make swapin readahead skip over holes
Date: Thu, 12 Jan 2012 14:33:01 +0000	[thread overview]
Message-ID: <20120112143301.GG3910@csn.ul.ie> (raw)
In-Reply-To: <4F0DE1CE.5000006@redhat.com>

On Wed, Jan 11, 2012 at 02:23:58PM -0500, Rik van Riel wrote:
> On 01/11/2012 11:51 AM, Mel Gorman wrote:
> >On Mon, Jan 09, 2012 at 06:10:23PM -0500, Rik van Riel wrote:
> 
> >>+	get_swap_cluster(entry,&offset,&end_offset);
> >>+
> >>+	for (; offset<= end_offset ; offset++) {
> >>  		/* Ok, do the async read-ahead now */
> >>  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> >>  						gfp_mask, vma, addr);
> >>  		if (!page)
> >>-			break;
> >>+			continue;
> >>  		page_cache_release(page);
> >>  	}
> >
> >For a heavily fragmented swap file, this will result in more IO and
> >the gamble is that pages nearby are needed soon. You say your virtual
> >machines swapin faster and that does not surprise me. I also expect
> >they need the data so it's a net win.
> 
> More IO operations, yes.  However, IO operations from nearby
> blocks can often be done without incurring extra disk seeks,
> because the disk head is already in the right place.
> 

I understand that, the result still has to go somewhere in memory
though. The cost in terms of IO speed might be effectively 0 but it
still consumes memory resources.

> This seems to be born out by the fact that I saw swapin
> rates increase from maybe 200-300kB/s to 5-15MB/s...
> 
> Even on some SSDs it could avoid some bank switches, though
> of course there I would expect the effect to be much less
> pronounced.
> 
> >There is an possibility that under memory pressure that swapping in
> >more pages will cause more memory pressure (increased swapin causing
> >clean page cache discards and pageout) and be an overall loss. This may
> >be a net loss in some cases such as where the working set size is just
> >over physical memory and the increased swapin causes a problem. I doubt
> >this case is common but it is worth bearing in mind if future bug
> >reports complain about increased swap activity.
> 
> True, there may be workloads that benefit from a smaller
> page-cluster. The fact that the recently swapped in pages
> are all put on the inactive anon list should help protect
> the working set, too.
> 

True to some extent.

> Another alternative may be a time based decision. If we
> have swapped something out recently, go with a less
> aggressive swapin readahead.
> 

Maybe, but while I know I brought up the problem of swapin might
increase swapout, it would be best to have an example of how that can
happen before looking at heuristics.

> That would automatically give us fast swapin readahead
> when in "memory hog just exited, let the system recover"
> mode, and conservative swapin readahead in your situation.
> 
> However, that could still hurt badly if the system is just
> moving the working set from one part of a program to another.
> 
> I suspect we will be faster off by having faster swap IO,
> which this patch seems to provide.
> 

More than likely.

> >>-	si = swap_info[swp_type(entry)];
> >>-	target = swp_offset(entry);
> >>-	base = (target>>  our_page_cluster)<<  our_page_cluster;
> >>-	end = base + (1<<  our_page_cluster);
> >>-	if (!base)		/* first page is swap header */
> >>-		base++;
> 
> >>+	si = swap_info[swp_type(entry)];
> >>+	/* Round the begin down to a page_cluster boundary. */
> >>+	offset = (offset>>  page_cluster)<<  page_cluster;
> >
> >Minor nit but it would feel more natural to me to see
> >
> >offset&  ~((1<<  page_cluster) - 1)
> >
> >but I understand that you are reusing the existing code.
> 
> Sure, I can do that.
> 
> While I'm there, I can also add that if (!base) base++
> thing back in :)
> 

Whoops, I missed that!

> >>+	*begin = offset;
> >>+	/* Round the end up, but not beyond the end of the swap device. */
> >>+	offset = offset + (1<<  page_cluster);
> >>+	if (offset>  si->max)
> >>+		offset = si->max;
> >>+	*end = offset;
> >>  	spin_unlock(&swap_lock);
> >>-
> >>-	/*
> >>-	 * Indicate starting offset, and return number of pages to get:
> >>-	 * if only 1, say 0, since there's then no readahead to be done.
> >>-	 */
> >>-	*offset = ++toff;
> >>-	return nr_pages? ++nr_pages: 0;
> >>  }
> >
> >This section deletes code which is nice but there is a
> >problem. Your changelog says that this is duplicating the effort of
> >read_swap_cache_async() which is true but what it does is
> >
> >1. a swap cache lookup which will probably fail
> >2. alloc_page_vma()
> >3. radix_tree_preload()
> >4. swapcache_prepare
> >    - calls __swap_duplicate()
> >    - finds the hole, bails out
> >
> >That's a lot of work before the hole is found. Would it be worth
> >doing a racy check in swapin_readahead without swap lock held before
> >calling read_swap_cache_async()?
> 
> The problem is that without the swap_lock held, the swap_info
> struct may disappear completely because of the swapin_readahead
> happening concurrently with a swapoff.
> 

hmm, I considered that when I wrote the suggestion and was making
an assumption that the presense of swap pages on the area would
effectively pin the swap_info struct.

I did think of the race that swapoff is operating on the very last page
that swapin is reading and removes the map at the same time. We make a
racy check against effectively random memory and one of three results
might happen

1. DEBUG_PAGE_ALLOC was enabled and we blow up, ok, this one is bad if
   somewhat corner case

2. False positive count - we fall through and try to read the page,
   find the swap page no longer exists and bail gracefully

3. False zero count - we do not readin the swap page. It's not a big
   deal if swapin_readahead fails to read in a page it should have
   (swapoff has brought it in anyway).

Basically, I did not see any major downside with doing a racy check. If
1 above is a serious problem, then yes, we have to take the swap_lock
but that's still cheaper than allocating needlessly.

> I suspect that the CPU time spent doing 1-4 above will be
> negligible compared to the amount of time spent doing disk IO,

Negligible in comparison to the IO sure, but it still seems like we
could be doing a lot of allocation work (alloc_page_vma and
radix_tree_preload) for no good reason.

> but if there turns out to be a problem it should be possible
> to move the swap hole identification closer to the top of
> swap_cache_read_async().
> 

There is that, it'll be tricky to spot that we're needless allocating
pages though. Not the end of the world, what you have still improves
things.

-- 
Mel Gorman
SUSE Labs

  reply	other threads:[~2012-01-12 14:33 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-09 23:10 [PATCH -mm] make swapin readahead skip over holes Rik van Riel
2012-01-09 23:10 ` Rik van Riel
2012-01-09 23:49 ` KOSAKI Motohiro
2012-01-09 23:49   ` KOSAKI Motohiro
2012-01-10  3:09   ` Rik van Riel
2012-01-10  3:09     ` Rik van Riel
2012-01-11  7:14     ` KOSAKI Motohiro
2012-01-11  7:14       ` KOSAKI Motohiro
2012-01-11  8:01       ` KOSAKI Motohiro
2012-01-11  8:01         ` KOSAKI Motohiro
2012-01-11  8:05         ` KOSAKI Motohiro
2012-01-11  8:05           ` KOSAKI Motohiro
2012-01-11 14:11         ` John Stoffel
2012-01-11 14:11           ` John Stoffel
2012-01-11 19:17           ` Christoph Lameter
2012-01-11 19:17             ` Christoph Lameter
2012-01-11 21:03             ` John Stoffel
2012-01-11 21:03               ` John Stoffel
2012-01-11 22:25               ` KOSAKI Motohiro
2012-01-11 22:25                 ` KOSAKI Motohiro
2012-01-11 23:15       ` Andrew Morton
2012-01-11 23:15         ` Andrew Morton
2012-01-11 16:51 ` Mel Gorman
2012-01-11 16:51   ` Mel Gorman
2012-01-11 19:23   ` Rik van Riel
2012-01-11 19:23     ` Rik van Riel
2012-01-12 14:33     ` Mel Gorman [this message]
2012-01-12 14:33       ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120112143301.GG3910@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=kosaki.motohiro@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.