Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@elte.hu>
To: Tejun Heo <tj@kernel.org>
Cc: rusty@rustcorp.com.au, tglx@linutronix.de, x86@kernel.org,
	linux-kernel@vger.kernel.org, hpa@zytor.com, jeremy@goop.org,
	cpw@sgi.com
Subject: Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
Date: Fri, 20 Feb 2009 10:32:34 +0100	[thread overview]
Message-ID: <20090220093234.GF24555@elte.hu> (raw)
In-Reply-To: <499E20BC.4020408@kernel.org>


* Tejun Heo <tj@kernel.org> wrote:

> Hello, Ingo.
> 
> Ingo Molnar wrote:
> > * Tejun Heo <tj@kernel.org> wrote:
> > 
> >> Tejun Heo wrote:
> >>>   One trick we can do is to reserve the initial chunk in non-vmalloc
> >>>   area so that at least the static cpu ones and whatever gets
> >>>   allocated in the first chunk is served by regular large page
> >>>   mappings.  Given that those are most frequent visited ones, this
> >>>   could be a nice compromise - no noticeable penalty for usual cases
> >>>   yet allowing scalability for unusual cases.  If this is something
> >>>   which can be agreed on, I'll pursue this.
> >> I've given more thought to this and it actually will solve 
> >> most of issues for non-NUMA but it can't be done for NUMA.  
> >> Any better ideas?
> > 
> > It could be allocated via NUMA-aware bootmem allocations.
> 
> Hmmm... not really.  Here's what I was planning to do on non-NUMA.
> 
>   Allocate the first chunk using alloc_bootmem().  After setting up
>   each unit, give back extra space sans the initialized static area
>   and some amount of free space which should be enough for common
>   cases by calling free_bootmem().  Mark the returned space as used in
>   the chunk map.
> 
> This will allow sane chunk size and scalability without adding 
> TLB pressure, so it's actually pretty sweet.  Unfortunately, 
> this doesn't really work for NUMA because we don't have 
> control over how NUMA addresses are laid out so we can't 
> allocate contiguous NUMA-correct chunk without remapping.  And 
> if we remap, we can't give back what's left to the allocator.  
> Giving back the original address doubles TLB usage and giving 
> back the remapped address breaks __pa/__va.  :-(

Where's the problem? Via bootmem we can allocate an arbitrarily 
large, properly NUMA-affine, well-aligned, linear, large-TLB 
piece of memory, for each CPU.

We should allocate a large enough chunk for the static percpu 
variables, and remap them using 2MB mapping[s].

I'm not sure where the desire for 'chunking' below 2MB comes 
from - there's no real benefit from it - the TLB will either be 
4K or 2MB, going inbetween makes little sense.

So i think the best (and simplest) approach is to:

 - allocate the static percpu area using bootmem-alloc, but 
   using a 2MB alignment parameter and a 2MB aligned size. Then 
   we can remap it to some convenient and undisturbed virtual 
   memory area, using 2MB TLBs. [*]

 - The 'partial' bit of the 2MB page (the one that is outside 
   the 4K-uprounded portion of __per_cpu_end - __per_cpu_start) 
   can then be freed via bootmem and is available as regular 
   pages to the rest of the kernel.

 - Then we start dynamic allocations at the _next_ 2MB boundary 
   in the virtual remapped space, and use 4K mappings from that 
   point on. Since at least initially we dont want to waste a 
   full 2MB page on dynamic allocations, we've got no choice but 
   to use 4K pages.

 - This means that percpu_alloc() will not return a pointer to 
   an array of percpu pointers - but will return a standard 
   offset that is valid in each percpu area and points to 
   somewhere beyond the 2MB boundary that comes after the 
   initial static area. This means it needs some minimal memory 
   management - but it all looks relatively straightforward.

So the virtual memory area will be continous, with a 'hole' in 
it that separates the static and dynamic portions, and dynamic 
percpu pointers will point straight into it [with a %gs offset] 
- without an intermediary array of pointers.

No chunking, no fuss - just bootmem plus 4K allocations - the 
best of both worlds.

This also means we've essentially eliminated the boundary 
between static and dynamic APIs, and can probably use some of 
the same direct assembly optimizations (on x86) for local-CPU 
dynamic percpu accesses too. [maybe not all addressing modes are 
possible straight away, this needs a more precise check.]

	Ingo

[*] Note: the 2MB up-rounding bootmem trick above is needed to 
          make sure the partial 2MB page is still fully RAM - 
          it's not well-specified to have a PAT-incompatible 
          area (unmapped RAM, device memory, etc.) in that hole.

next prev parent reply	other threads:[~2009-02-20  9:33 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
2009-02-18 12:04 ` [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range() Tejun Heo
2009-02-19 12:06   ` Nick Piggin
2009-02-19 22:36     ` David Miller
2009-02-18 12:04 ` [PATCH 02/10] module: fix out-of-range memory access Tejun Heo
2009-02-19 12:08   ` Nick Piggin
2009-02-20  7:16   ` Tejun Heo
2009-02-18 12:04 ` [PATCH 03/10] module: reorder module pcpu related functions Tejun Heo
2009-02-18 12:04 ` [PATCH 04/10] alloc_percpu: change percpu_ptr to per_cpu_ptr Tejun Heo
2009-02-18 12:04 ` [PATCH 05/10] alloc_percpu: add align argument to __alloc_percpu Tejun Heo
2009-02-18 12:04 ` [PATCH 06/10] percpu: kill percpu_alloc() and friends Tejun Heo
2009-02-19  0:17   ` Rusty Russell
2009-03-11 18:36   ` Tony Luck
2009-03-11 22:44     ` Rusty Russell
2009-03-12  2:06     ` Tejun Heo
2009-02-18 12:04 ` [PATCH 07/10] vmalloc: implement vm_area_register_early() Tejun Heo
2009-02-19  0:55   ` Tejun Heo
2009-02-19 12:09   ` Nick Piggin
2009-02-18 12:04 ` [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush() Tejun Heo
2009-02-19 12:17   ` Nick Piggin
2009-02-20  1:27     ` Tejun Heo
2009-02-20  7:15   ` Subject: [PATCH 08/10 UPDATED] " Tejun Heo
2009-02-20  8:32     ` Andrew Morton
2009-02-21  3:21       ` Tejun Heo
2009-02-18 12:04 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
2009-02-19 10:10   ` Andrew Morton
2009-02-19 11:01     ` Ingo Molnar
2009-02-20  2:45       ` Tejun Heo
2009-02-19 12:07     ` Rusty Russell
2009-02-20  2:35     ` Tejun Heo
2009-02-20  3:04       ` Andrew Morton
2009-02-20  5:29         ` Tejun Heo
2009-02-24  2:52         ` Rusty Russell
2009-02-19 11:51   ` Rusty Russell
2009-02-20  3:01     ` Tejun Heo
2009-02-20  3:02       ` Tejun Heo
2009-02-24  2:56       ` Rusty Russell
2009-02-24  5:27         ` [PATCH tj-percpu] percpu: add __read_mostly to variables which are mostly read only Tejun Heo
2009-02-24  5:47         ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
2009-02-24 17:41           ` Luck, Tony
2009-02-26  3:17             ` Tejun Heo
2009-02-27 19:41               ` Luck, Tony
2009-02-19 12:36   ` Nick Piggin
2009-02-20  3:04     ` Tejun Heo
2009-02-20  7:30   ` [PATCH UPDATED " Tejun Heo
2009-02-20  8:37     ` Andrew Morton
2009-02-21  3:23       ` Tejun Heo
2009-02-21  3:42         ` [PATCH tj-percpu] percpu: s/size/bytes/g in new percpu allocator and interface Tejun Heo
2009-02-21  7:48           ` Tejun Heo
2009-02-21  7:55             ` [PATCH tj-percpu] percpu: clean up size usage Tejun Heo
2009-02-21  7:56               ` Tejun Heo
2009-02-18 12:04 ` [PATCH 10/10] x86: convert to the new dynamic percpu allocator Tejun Heo
2009-02-18 13:43 ` [PATCHSET x86/core/percpu] implement " Ingo Molnar
2009-02-19  0:31   ` Tejun Heo
2009-02-19 10:51   ` Rusty Russell
2009-02-19 11:06     ` Ingo Molnar
2009-02-19 12:14       ` Rusty Russell
2009-02-20  3:08         ` Tejun Heo
2009-02-20  5:36           ` Tejun Heo
2009-02-20  7:33             ` Tejun Heo
2009-02-19  0:30 ` Tejun Heo
2009-02-19 11:07   ` Ingo Molnar
2009-02-20  3:17     ` Tejun Heo
2009-02-20  9:32       ` Ingo Molnar [this message]
2009-02-21  7:10         ` Tejun Heo
2009-02-21  7:33           ` Tejun Heo
2009-02-22 19:38             ` Ingo Molnar
2009-02-23  0:43               ` Tejun Heo
2009-02-23 10:17                 ` Ingo Molnar
2009-02-23 13:38                   ` [patch] x86: optimize __pa() to be linear again on 64-bit x86 Ingo Molnar
2009-02-23 14:08                     ` Nick Piggin
2009-02-23 14:53                       ` Ingo Molnar
2009-02-24 16:00                         ` Andi Kleen
2009-02-27  5:57                         ` Tejun Heo
2009-02-27  6:57                           ` Ingo Molnar
2009-02-27  7:11                             ` Tejun Heo
2009-02-22 19:27           ` [PATCHSET x86/core/percpu] implement dynamic percpu allocator Ingo Molnar
2009-02-23  0:47             ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090220093234.GF24555@elte.hu \
    --to=mingo@elte.hu \
    --cc=cpw@sgi.com \
    --cc=hpa@zytor.com \
    --cc=jeremy@goop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rusty@rustcorp.com.au \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.