Re: [PATCH 11/14] Direct compact when a high-order allocation fails

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Adam Litke <agl@us.ibm.com>, Avi Kivity <avi@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Rik van Riel <riel@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 11/14] Direct compact when a high-order allocation fails
Date: Wed, 7 Apr 2010 17:06:42 +0100	[thread overview]
Message-ID: <20100407160642.GT17882@csn.ul.ie> (raw)
In-Reply-To: <20100406170603.8a999dc2.akpm@linux-foundation.org>

On Tue, Apr 06, 2010 at 05:06:03PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:45 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> > free pages to satisfy the allocation.  With this patch, it is determined if
> > an allocation failed due to external fragmentation instead of low memory
> > and if so, the calling process will compact until a suitable page is
> > freed. Compaction by moving pages in memory is considerably cheaper than
> > paging out to disk and works where there are locked pages or no swap. If
> > compaction fails to free a page of a suitable size, then reclaim will
> > still occur.
> 
> Does this work?
> 

Well, yes or there wouldn't be a marked reduction in the latency to allocate
a huge page as linked to in the leader and the difference in allocation
success rates on ppc64 would not be so marked.

> > Direct compaction returns as soon as possible. As each block is compacted,
> > it is checked if a suitable page has been freed and if so, it returns.
> 
> So someone else can get in and steal it.  How is that resolved?
> 

It isn't, lumpy reclaim has a similar problem. They could be captured
of course but so far stealing has only been a problem when under very
heavy memory pressure.

> Please expound upon the relationship between the icky pageblock_order
> and the caller's desired allocation order here. 

Compaction works on the same units as anti-fragmentation does - the
pageblock_order. It could work on units smaller than that when selecting
pages to migrate from and to, but there would be little advantage for
some additional complexity.

The caller's desired allocation order determines if compaction has
finished or not after a pageblock of pages has been migrated.

> The compaction design
> seems fairly fixated upon pageblock_order - what happens if the caller
> wanted something larger than pageblock_order? 

Then it would get tricky. Selecting for migration stays simple but there would
be additional complexity in finding 2 or more adjacent naturally-aligned
MIGRATE_MOVABLE blocks to migrate to. As pageblock_order is related to the
default huge page size, I'd wonder what caller would be routinely allocating
larger pages?

> The
> less-than-pageblock_order case seems pretty obvious, although perhaps
> wasteful?
> 

compact_finished() could be called more regularly but the waste is minimal. At
worst, a few more pages get migrated that weren't necessary for the caller
to successfully allocate. This is not massively dissimilar to how direct
reclaim can reclaim slightly more pages than necessary.

> >
> > ...
> >
> > +static unsigned long compact_zone_order(struct zone *zone,
> > +						int order, gfp_t gfp_mask)
> > +{
> > +	struct compact_control cc = {
> > +		.nr_freepages = 0,
> > +		.nr_migratepages = 0,
> > +		.order = order,
> > +		.migratetype = allocflags_to_migratetype(gfp_mask),
> > +		.zone = zone,
> > +	};
> 
> yeah, like that.
> 
> > +	INIT_LIST_HEAD(&cc.freepages);
> > +	INIT_LIST_HEAD(&cc.migratepages);
> > +
> > +	return compact_zone(zone, &cc);
> > +}
> > +
> > +/**
> > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> > + * @zonelist: The zonelist used for the current allocation
> > + * @order: The order of the current allocation
> > + * @gfp_mask: The GFP mask of the current allocation
> > + * @nodemask: The allowed nodes to allocate from
> > + *
> > + * This is the main entry point for direct page compaction.
> > + */
> > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > +	int may_enter_fs = gfp_mask & __GFP_FS;
> > +	int may_perform_io = gfp_mask & __GFP_IO;
> > +	unsigned long watermark;
> > +	struct zoneref *z;
> > +	struct zone *zone;
> > +	int rc = COMPACT_SKIPPED;
> > +
> > +	/*
> > +	 * Check whether it is worth even starting compaction. The order check is
> > +	 * made because an assumption is made that the page allocator can satisfy
> > +	 * the "cheaper" orders without taking special steps
> > +	 */
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER 
> 
> Was that a correct decision?  If we perform compaction when smaller
> allocation attemtps fail, will the kernel get better, or worse?
> 

I think better but there are concerns about LRU churn and it might encourage
increased use of high-order allocations. The desire is to try compaction out
first with huge pages and move towards lifting this restriction on order later.

> And how do we save my order-4-allocating wireless driver? 

Ultimately, it could perform a subset of compaction that doesn't go to
sleep but migration isn't up to that right now.

> That would
> require that kswapd perform the compaction for me, perhaps?
> 
> > || !may_enter_fs || !may_perform_io)
> 
> Would be nice to add some comments explaining this a bit more. 
> Compaction doesn't actually perform IO, nor enter filesystems, does it?
> 

Compaction doesn't, but migration can and you don't know in advance if
it will need to or not. Migration would itself need to take a GFP mask
of what was and wasn't allowed during the course of migration but these
checks to be moved.

Not impossible, just not done as of this time.

> > +		return rc;
> > +
> > +	count_vm_event(COMPACTSTALL);
> > +
> > +	/* Compact each zone in the list */
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> > +								nodemask) {
> > +		int fragindex;
> > +		int status;
> > +
> > +		/*
> > +		 * Watermarks for order-0 must be met for compaction. Note
> > +		 * the 2UL. This is because during migration, copies of
> > +		 * pages need to be allocated and for a short time, the
> > +		 * footprint is higher
> > +		 */
> > +		watermark = low_wmark_pages(zone) + (2UL << order);
> > +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> > +			continue;
> 
> ooh, so that starts to explain split_free_page().  But
> split_free_page() didn't do the 2UL thing.
> 

No, but split_free_page() knows exactly how much it is removing at that
time. At this point, there is a worst-case expectation that the pages being
migrating from and to are both isolated. At no point should they be all
allocated at any given time but it's not checking against deadlocks.

> Surely these things are racy?  So we'll deadlock less often :(
> 

It won't deadlock, this is a heuristic only that guesses whether compaction
is likely to succeed or not. The watermarks are rechecked every time pages
are taken off free list.

> > +		/*
> > +		 * fragmentation index determines if allocation failures are
> > +		 * due to low memory or external fragmentation
> > +		 *
> > +		 * index of -1 implies allocations might succeed depending
> > +		 * 	on watermarks
> > +		 * index towards 0 implies failure is due to lack of memory
> > +		 * index towards 1000 implies failure is due to fragmentation
> > +		 *
> > +		 * Only compact if a failure would be due to fragmentation.
> > +		 */
> > +		fragindex = fragmentation_index(zone, order);
> > +		if (fragindex >= 0 && fragindex <= 500)
> > +			continue;
> > +
> > +		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> > +			rc = COMPACT_PARTIAL;
> > +			break;
> > +		}
> 
> Why are we doing all this handwavy stuff?  Why not just try a
> compaction run and see if it worked? 

Because if that index is not matched, it really is a waste of time to
try compacting. It just won't work but it'll do a full scan of the zone
figuring that out.

> That would be more robust/reliable, surely?
> 

We'll also eventually get a bug report on low-memory situations causing
large amounts of CPU to be consumed in compaction without the pages
being allocated. Granted, we wouldn't get them until compaction was also
working for the lower orders but we'd get the report eventually.

> > +		status = compact_zone_order(zone, order, gfp_mask);
> > +		rc = max(status, rc);
> > +
> > +		if (zone_watermark_ok(zone, order, watermark, 0, 0))
> > +			break;
> > +	}
> > +
> > +	return rc;
> > +}
> > +
> > +
> >  /* Compact all zones within a node */
> >  static int compact_node(int nid)
> >  {
> >
> > ...
> >
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -561,7 +561,7 @@ static int unusable_show(struct seq_file *m, void *arg)
> >   * The value can be used to determine if page reclaim or compaction
> >   * should be used
> >   */
> > -int fragmentation_index(unsigned int order, struct contig_page_info *info)
> > +int __fragmentation_index(unsigned int order, struct contig_page_info *info)
> >  {
> >  	unsigned long requested = 1UL << order;
> >  
> > @@ -581,6 +581,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
> >  	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
> >  }
> >  
> > +/* Same as __fragmentation index but allocs contig_page_info on stack */
> > +int fragmentation_index(struct zone *zone, unsigned int order)
> > +{
> > +	struct contig_page_info info;
> > +
> > +	fill_contig_page_info(zone, order, &info);
> > +	return __fragmentation_index(order, &info);
> > +}
> >  
> >  static void extfrag_show_print(struct seq_file *m,
> >  					pg_data_t *pgdat, struct zone *zone)
> > @@ -596,7 +604,7 @@ static void extfrag_show_print(struct seq_file *m,
> >  				zone->name);
> >  	for (order = 0; order < MAX_ORDER; ++order) {
> >  		fill_contig_page_info(zone, order, &info);
> > -		index = fragmentation_index(order, &info);
> > +		index = __fragmentation_index(order, &info);
> >  		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
> >  	}
> >  
> > @@ -896,6 +904,9 @@ static const char * const vmstat_text[] = {
> >  	"compact_blocks_moved",
> >  	"compact_pages_moved",
> >  	"compact_pagemigrate_failed",
> > +	"compact_stall",
> > +	"compact_fail",
> > +	"compact_success",
> 
> CONFIG_COMPACTION=n?
> 

Yeah, it should be.

> >
> > ...
> >
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-04-07 16:07 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
2010-04-02 16:02 ` [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07  9:56     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 02/14] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
2010-04-02 16:02 ` [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07  0:10     ` Rik van Riel
2010-04-07 10:01     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 04/14] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07 10:22     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 05/14] Export unusable free space index via /proc/unusable_index Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07 10:35     ` Mel Gorman
2010-04-13 12:42     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 06/14] Export fragmentation index via /proc/extfrag_index Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07 10:46     ` Mel Gorman
2010-04-13 12:43     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 07/14] Move definition for LRU isolation modes to a header Mel Gorman
2010-04-02 16:02 ` [PATCH 08/14] Memory compaction core Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07 15:21     ` Mel Gorman
2010-04-08 16:59   ` Mel Gorman
2010-04-08 17:06     ` Andrea Arcangeli
2010-04-02 16:02 ` [PATCH 09/14] Add /proc trigger for memory compaction Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07 15:39     ` Mel Gorman
2010-04-07 18:27       ` Mel Gorman
2010-04-02 16:02 ` [PATCH 10/14] Add /sys trigger for per-node " Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07  0:31     ` KAMEZAWA Hiroyuki
2010-04-06 21:56       ` Andrew Morton
2010-04-07  1:19         ` KAMEZAWA Hiroyuki
2010-04-07 15:42     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 11/14] Direct compact when a high-order allocation fails Mel Gorman
2010-04-07  0:06   ` Andrew Morton
2010-04-07 16:06     ` Mel Gorman [this message]
2010-04-07 18:29     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 12/14] Add a tunable that decides when memory should be compacted and when it should be reclaimed Mel Gorman
2010-04-07  0:06   ` Andrew Morton
2010-04-07 16:11     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 13/14] Do not compact within a preferred zone after a compaction failure Mel Gorman
2010-04-07  0:06   ` Andrew Morton
2010-04-07  0:55     ` Andrea Arcangeli
2010-04-07 16:32     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages Mel Gorman
2010-04-06  6:54   ` KAMEZAWA Hiroyuki
2010-04-06 15:37   ` Minchan Kim
2010-04-07  0:06   ` Andrew Morton
2010-04-07 16:49     ` Mel Gorman
2010-04-06 14:47 ` [PATCH 0/14] Memory Compaction v7 Tarkan Erimer
2010-04-06 15:00   ` Mel Gorman
2010-04-06 15:03     ` Tarkan Erimer
  -- strict thread matches above, loose matches on Subject: below --
2010-03-30  9:14 [PATCH 0/14] Memory Compaction v6 Mel Gorman
2010-03-30  9:14 ` [PATCH 11/14] Direct compact when a high-order allocation fails Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100407160642.GT17882@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=aarcange@redhat.com \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=avi@redhat.com \
    --cc=cl@linux-foundation.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).