Re: [PATCH 02/22] Do not sanity check order in the fast path

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Linux Memory Management List <linux-mm@kvack.org>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Nick Piggin <npiggin@suse.de>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Lin Ming <ming.m.lin@intel.com>,
	Zhang Yanmin <yanmin_zhang@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Pekka Enberg <penberg@cs.helsinki.fi>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path
Date: Thu, 23 Apr 2009 10:58:22 +0100	[thread overview]
Message-ID: <20090423095821.GA25102@csn.ul.ie> (raw)
In-Reply-To: <1240450447.10627.119.camel@nimitz>

On Wed, Apr 22, 2009 at 06:34:07PM -0700, Dave Hansen wrote:
> On Thu, 2009-04-23 at 01:13 +0100, Mel Gorman wrote:
> > On Wed, Apr 22, 2009 at 10:30:15AM -0700, Dave Hansen wrote:
> > > On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> > > > On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > > > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > > > > but we check for it on each and every allocation. Delete this check and
> > > > > > make it a VM_BUG_ON check further down the call path.
> > > > > 
> > > > > Should we get the check re-added to some of the upper-level functions,
> > > > > then?  Perhaps __get_free_pages() or things like alloc_pages_exact()? 
> > > > 
> > > > I don't think so, no. It just moves the source of the text bloat and
> > > > for the few callers that are asking for something that will never
> > > > succeed.
> > > 
> > > Well, it's a matter of figuring out when it can succeed.  Some of this
> > > stuff, we can figure out at compile-time.  Others are a bit harder.
> > > 
> > 
> > What do you suggest then? Some sort of constant that tells you the
> > maximum size you can call for callers that think they might ever request
> > too much?
> > 
> > Shuffling the check around to other top-level helpers seems pointless to
> > me because as I said, it just moves text bloat from one place to the
> > next.
> 
> Do any of the actual fast paths pass 'order' in as a real variable?  If
> not, the compiler should be able to just take care of it.  From a quick
> scan, it does appear that at least a third of the direct alloc_pages()
> users pass an explicit '0'.  That should get optimized away
> *immediately*.
> 

You'd think but here are the vmlinux figures

3355718 0001-Replace-__alloc_pages_internal-with-__alloc_pages_.patch
3355622 0002-Do-not-sanity-check-order-in-the-fast-path.patch

That's a mostly modules kernel configuration and it's still making a
difference to the text size. Enough to care about and one to of the
things I was trying to keep in mind was branches in the fast paths.

> > > > > I'm selfishly thinking of what I did in profile_init().  Can I slab
> > > > > alloc it?  Nope.  Page allocator?  Nope.  Oh, well, try vmalloc():
> > > > > 
> > > > >         prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> > > > >         if (prof_buffer)
> > > > >                 return 0;
> > > > > 
> > > > >         prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> > > > >         if (prof_buffer)
> > > > >                 return 0;
> > > > > 
> > > > >         prof_buffer = vmalloc(buffer_bytes);
> > > > >         if (prof_buffer)
> > > > >                 return 0;
> > > > > 
> > > > >         free_cpumask_var(prof_cpu_mask);
> > > > >         return -ENOMEM;
> > > > > 
> > > > 
> > > > Can this ever actually be asking for an order larger than MAX_ORDER
> > > > though? If so, you're condemning it to always behave poorly.
> > > 
> > > Yeah.  It is based on text size.  Smaller kernels with trimmed configs
> > > and no modules have no problem fitting under MAX_ORDER, as do kernels
> > > with larger base page sizes.  
> > > 
> > 
> > It would seem that the right thing to have done here in the first place
> > then was
> > 
> > if (buffer_bytes > PAGE_SIZE << (MAX_ORDER-1)
> > 	return vmalloc(...)
> > 
> > kzalloc attempt
> > 
> > alloc_pages_exact attempt
> 
> Yeah, but honestly, I don't expect most users to get that "(buffer_bytes
> > PAGE_SIZE << (MAX_ORDER-1)" right.  It seems like *exactly* the kind
> of thing we should be wrapping up in common code.
> 

No, I wouldn't expect them to get it right. I'd expect some sort of
helper to exist all right if there were enough of these callers to be
fixed up.

> Perhaps we do need an alloc_pages_nocheck() for the users that do have a
> true non-compile-time-constant 'order' and still know they don't need
> the check.
> 

I'd do the opposite if we were doing this - alloc_pages_checksize() because
there are far more callers that are allocating sensible sizes than the
opposite. That said, making sure the allocator can handle a busted order
(even if we call all the way down to __rmqueue_smallest) seems the best
compromise between not having checks high in the fast path and callers that
want to chance their arm calling in and falling back to vmalloc() if it
doesn't work out.

> > > > > Same thing in __kmalloc_section_memmap():
> > > > > 
> > > > >         page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> > > > >         if (page)
> > > > >                 goto got_map_page;
> > > > > 
> > > > >         ret = vmalloc(memmap_size);
> > > > >         if (ret)
> > > > >                 goto got_map_ptr;
> > > > > 
> > > > 
> > > > If I'm reading that right, the order will never be a stupid order. It can fail
> > > > for higher orders in which case it falls back to vmalloc() .  For example,
> > > > to hit that limit, the section size for a 4K kernel, maximum usable order
> > > > of 10, the section size would need to be 256MB (assuming struct page size
> > > > of 64 bytes). I don't think it's ever that size and if so, it'll always be
> > > > sub-optimal which is a poor choice to make.
> > > 
> > > I think the section size default used to be 512M on x86 because we
> > > concentrate on removing whole DIMMs.  
> > > 
> > 
> > It was a poor choice then as their sections always ended up in
> > vmalloc() or else it was using the bootmem allocator in which case it
> > doesn't matter that the core page allocator was doing.
> 
> True, but we tried to code that sucker to work anywhere and to be as
> optimal as possible (which vmalloc() is not) when we could.
> 

Ok, seems fair.

> > > > > I depend on the allocator to tell me when I've fed it too high of an
> > > > > order.  If we really need this, perhaps we should do an audit and then
> > > > > add a WARN_ON() for a few releases to catch the stragglers.
> > > > 
> > > > I consider it buggy to ask for something so large that you always end up
> > > > with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> > > > to get as many reports as possible on who is depending on this odd
> > > > behaviour?
> > > > 
> > > > If there are users with good reasons, then we could convert this to WARN_ON
> > > > to fix up the callers. I suspect that the allocator can already cope with
> > > > recieving a stupid order silently but slowly. It should go all the way to the
> > > > bottom and just never find anything useful and return NULL.  zone_watermark_ok
> > > > is the most dangerous looking part but even it should never get to MAX_ORDER
> > > > because it should always find there are not enough free pages and return
> > > > before it overruns.
> > > 
> > > Whatever we do, I'd agree that it's fine that this is a degenerate case
> > > that gets handled very slowly and as far out of hot paths as possible.
> > > Anybody who can fall back to a vmalloc is not doing these things very
> > > often.
> > 
> > If that's the case, the simpliest course might be to just drop the VM_BUG_ON()
> > as a separate patch after asserting it's safe to call into the page
> > allocator with too large an order with the consequence of it being a
> > relatively expensive call considering it can never succeed.
> 
> __rmqueue_smallest() seems to do the right thing and it is awfully deep
> in the allocator.
> 

And I believe zone_watermark_ok() does the right thing as well by
deciding that there are not enough free pages before it overflows the
buffer.

> How about this:  I'll go and audit the use of order in page_alloc.c to
> make sure that having an order>MAX_ORDER-1 floating around is OK and
> won't break anything. 

Great. Right now, I think it's ok but I haven't audited for this
explicily and a second set of eyes never hurts.

> I'll also go and see what the actual .text size
> changes are from this patch both for alloc_pages() and
> alloc_pages_node() separately to make sure what we're dealing with here.

It's .config specific of course, but check the leader where I posted vmlinux
sizes for each patch. scripts/bloat-o-meter by Matt Mackall also rules most
mightily for identifying where text increased or decreased between patches
in case you're not aware of it already.

> Does this check even *exist* in the optimized code very often?  
> 

Enough that it shrunk text size on my .config anyway.

> Deal? :)
> 

Deal

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-04-23  9:57 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-22 13:53 [PATCH 00/22] Cleanup and optimise the page allocator V7 Mel Gorman
2009-04-22 13:53 ` [PATCH 01/22] Replace __alloc_pages_internal() with __alloc_pages_nodemask() Mel Gorman
2009-04-22 13:53 ` [PATCH 02/22] Do not sanity check order in the fast path Mel Gorman
2009-04-22 16:13   ` Dave Hansen
2009-04-22 17:11     ` Mel Gorman
2009-04-22 17:30       ` Dave Hansen
2009-04-23  0:13         ` Mel Gorman
2009-04-23  1:34           ` Dave Hansen
2009-04-23  9:58             ` Mel Gorman [this message]
2009-04-23 17:36               ` Dave Hansen
2009-04-24  2:57                 ` KOSAKI Motohiro
2009-04-24 10:34                 ` Mel Gorman
2009-04-24 14:16                   ` Dave Hansen
2009-04-23 19:26             ` Dave Hansen
2009-04-23 19:45               ` Dave Hansen
2009-04-24  9:21                 ` Mel Gorman
2009-04-24 14:25                   ` Dave Hansen
2009-04-22 20:11       ` David Rientjes
2009-04-22 20:20         ` Christoph Lameter
2009-04-23  7:44         ` Pekka Enberg
2009-04-23 22:44       ` Andrew Morton
2009-04-22 13:53 ` [PATCH 03/22] Do not check NUMA node ID when the caller knows the node is valid Mel Gorman
2009-04-22 13:53 ` [PATCH 04/22] Check only once if the zonelist is suitable for the allocation Mel Gorman
2009-04-22 13:53 ` [PATCH 05/22] Break up the allocator entry point into fast and slow paths Mel Gorman
2009-04-22 13:53 ` [PATCH 06/22] Move check for disabled anti-fragmentation out of fastpath Mel Gorman
2009-04-22 13:53 ` [PATCH 07/22] Calculate the preferred zone for allocation only once Mel Gorman
2009-04-23 22:48   ` Andrew Morton
2009-04-22 13:53 ` [PATCH 08/22] Calculate the migratetype " Mel Gorman
2009-04-22 13:53 ` [PATCH 09/22] Calculate the alloc_flags " Mel Gorman
2009-04-23 22:52   ` Andrew Morton
2009-04-24 10:47     ` Mel Gorman
2009-04-24 17:51       ` Andrew Morton
2009-04-22 13:53 ` [PATCH 10/22] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH Mel Gorman
2009-04-22 13:53 ` [PATCH 11/22] Inline __rmqueue_smallest() Mel Gorman
2009-04-22 13:53 ` [PATCH 12/22] Inline buffered_rmqueue() Mel Gorman
2009-04-22 13:53 ` [PATCH 13/22] Inline __rmqueue_fallback() Mel Gorman
2009-04-22 13:53 ` [PATCH 14/22] Do not call get_pageblock_migratetype() more than necessary Mel Gorman
2009-04-22 13:53 ` [PATCH 15/22] Do not disable interrupts in free_page_mlock() Mel Gorman
2009-04-23 22:59   ` Andrew Morton
2009-04-24  0:07     ` KOSAKI Motohiro
2009-04-24  0:33     ` KOSAKI Motohiro
2009-04-24 11:33       ` Mel Gorman
2009-04-24 11:52         ` Lee Schermerhorn
2009-04-24 11:18     ` Mel Gorman
2009-04-22 13:53 ` [PATCH 16/22] Do not setup zonelist cache when there is only one node Mel Gorman
2009-04-22 20:24   ` David Rientjes
2009-04-22 20:32     ` Lee Schermerhorn
2009-04-22 20:34       ` David Rientjes
2009-04-23  0:11         ` KOSAKI Motohiro
2009-04-23  0:19     ` Mel Gorman
2009-04-22 13:53 ` [PATCH 17/22] Do not check for compound pages during the page allocator sanity checks Mel Gorman
2009-04-22 13:53 ` [PATCH 18/22] Use allocation flags as an index to the zone watermark Mel Gorman
2009-04-22 17:11   ` Dave Hansen
2009-04-22 17:14     ` Mel Gorman
2009-04-22 17:47       ` Dave Hansen
2009-04-23  0:27         ` KOSAKI Motohiro
2009-04-23 10:03           ` Mel Gorman
2009-04-24  6:41             ` KOSAKI Motohiro
2009-04-22 20:06   ` David Rientjes
2009-04-23  0:29     ` Mel Gorman
2009-04-27 17:00     ` [RFC] Replace the watermark-related union in struct zone with a watermark[] array Mel Gorman
2009-04-27 20:48       ` David Rientjes
2009-04-27 20:54         ` Mel Gorman
2009-04-27 20:51           ` Christoph Lameter
2009-04-27 21:04           ` David Rientjes
2009-04-30 13:35             ` Mel Gorman
2009-04-30 13:48               ` Dave Hansen
2009-05-12 14:13                 ` [RFC] Replace the watermark-related union in struct zone with a watermark[] array V2 Mel Gorman
2009-05-12 15:05                   ` [RFC] Replace the watermark-related union in struct zone with awatermark[] " Dave Hansen
2009-05-13  8:31                   ` [RFC] Replace the watermark-related union in struct zone with a watermark[] " KOSAKI Motohiro
2009-04-22 13:53 ` [PATCH 19/22] Update NR_FREE_PAGES only as necessary Mel Gorman
2009-04-23 23:06   ` Andrew Morton
2009-04-23 23:04     ` Christoph Lameter
2009-04-24 13:06     ` Mel Gorman
2009-04-22 13:53 ` [PATCH 20/22] Get the pageblock migratetype without disabling interrupts Mel Gorman
2009-04-22 13:53 ` [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths Mel Gorman
2009-04-22 23:04   ` David Rientjes
2009-04-23  0:44     ` Mel Gorman
2009-04-23 19:29       ` David Rientjes
2009-04-24 13:31         ` [PATCH] Do not override definition of node_set_online() with macro Mel Gorman
2009-04-22 13:53 ` [PATCH 22/22] slab: Use nr_online_nodes to check for a NUMA platform Mel Gorman
2009-04-22 14:37   ` Pekka Enberg
2009-04-27  7:58 ` [PATCH 00/22] Cleanup and optimise the page allocator V7 Zhang, Yanmin
2009-04-27 14:38   ` Mel Gorman
2009-04-28  1:59     ` Zhang, Yanmin
2009-04-28 10:27       ` Mel Gorman
2009-04-28 10:31       ` [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() Mel Gorman
2009-04-28 16:37         ` Christoph Lameter
2009-04-28 16:51           ` Mel Gorman
2009-04-28 17:15             ` Hugh Dickins
2009-04-28 18:07               ` [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() V2 Mel Gorman
2009-04-28 18:25                 ` Hugh Dickins
2009-04-28 18:36               ` [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090423095821.GA25102@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux-foundation.org \
    --cc=dave@linux.vnet.ibm.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ming.m.lin@intel.com \
    --cc=npiggin@suse.de \
    --cc=penberg@cs.helsinki.fi \
    --cc=peterz@infradead.org \
    --cc=yanmin_zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).