All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Ingo Molnar <mingo@elte.hu>
Cc: rusty@rustcorp.com.au, tglx@linutronix.de, x86@kernel.org,
	linux-kernel@vger.kernel.org, hpa@zytor.com, jeremy@goop.org,
	cpw@sgi.com
Subject: Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
Date: Sat, 21 Feb 2009 16:10:09 +0900	[thread overview]
Message-ID: <499FA8D1.8030806@kernel.org> (raw)
In-Reply-To: <20090220093234.GF24555@elte.hu>

Hello, Ingo.

Ingo Molnar wrote:
> Where's the problem? Via bootmem we can allocate an arbitrarily 
> large, properly NUMA-affine, well-aligned, linear, large-TLB 
> piece of memory, for each CPU.

I wish it was that peachy.  The problem is the added TLB pressure.

> We should allocate a large enough chunk for the static percpu 
> variables, and remap them using 2MB mapping[s].
> 
> I'm not sure where the desire for 'chunking' below 2MB comes 
> from - there's no real benefit from it - the TLB will either be 
> 4K or 2MB, going inbetween makes little sense.

Making the 'chunk' size 2MB would be useful for non-NUMA.  For NUMA,
making the 'chunk' size 2MB doesn't help much.  For unit size, 4k is
the minimum and 2MB is a meaningful boundary if percpu area gets
sufficiently large as large page mapping can be used for NUMA.  For
chunk size, 4k * num_possible_cpus() is the minimum and 2MB is a
meaningful boundary for !NUMA and 2MB * num_possible_cpus() for NUMA.

Anything between 4k and one of the meaningful boundaries doesn't make
much difference other than the chunk size needs to be at least as
large as the maximum supported allocation.  If it's above certain
limit, going large doesn't provide much benefit.  Given the tight vm
situation on 32bits, there simply isn't good reason to default to 2MB
unless large mapping is gonna be used.

> So i think the best (and simplest) approach is to:
> 
>  - allocate the static percpu area using bootmem-alloc, but 
>    using a 2MB alignment parameter and a 2MB aligned size. Then 
>    we can remap it to some convenient and undisturbed virtual 
>    memory area, using 2MB TLBs. [*]
> 
>  - The 'partial' bit of the 2MB page (the one that is outside 
>    the 4K-uprounded portion of __per_cpu_end - __per_cpu_start) 
>    can then be freed via bootmem and is available as regular 
>    pages to the rest of the kernel.

Heh... that's exactly where the problem is.  If you remap and free
what's left.  The remapped area and the freed area will use two
separate 2MB TLBs instead of one.  It's probably worse than simply
using 4k mappings.

On !NUMA, we can get away with this because the static percpu area
doesn't need to be remapped so the physical mapping can used unchanged
and what's left can be returned to the system.  On NUMA, we need remap
so we can't easily return what's left.

>  - Then we start dynamic allocations at the _next_ 2MB boundary 
>    in the virtual remapped space, and use 4K mappings from that 
>    point on. Since at least initially we dont want to waste a 
>    full 2MB page on dynamic allocations, we've got no choice but 
>    to use 4K pages.

It will be better to reserve some area for dynamic allocation so that
usual percpu allocations can be served by the initial mapping, which
tends to be pretty small on usual configurations.

>  - This means that percpu_alloc() will not return a pointer to 
>    an array of percpu pointers - but will return a standard 
>    offset that is valid in each percpu area and points to 
>    somewhere beyond the 2MB boundary that comes after the 
>    initial static area. This means it needs some minimal memory 
>    management - but it all looks relatively straightforward.
>
> So the virtual memory area will be continous, with a 'hole' in 
> it that separates the static and dynamic portions, and dynamic 
> percpu pointers will point straight into it [with a %gs offset] 
> - without an intermediary array of pointers.
> 
> No chunking, no fuss - just bootmem plus 4K allocations - the 
> best of both worlds.

The new percpu_alloc() already does that.  Chunking or not makes no
difference on this regard.  The only difference whether there are more
holes in the allocated percpu addresses or not, which basically is
irrelevant and chunking makes things much more flexible and scalable.
ie. It can scale toward many many cpus or large large percpu areas
wheras making the areas contiguous make the scalability determined by
the product of the two.

Also, contiguous per-cpu areas might look simpler but it actually is
more complicated because it becomes much more arch dependent.  With
chunking, the complexity is in generic code as virtual address and
stuff are already in place.  If the cpu areas need to be made
contiguous, the generic code will get simpler but each arch needs to
come up with new address space layout.

There simply isn't any measurable advantage to making the area
contiguous.

> This also means we've essentially eliminated the boundary 
> between static and dynamic APIs, and can probably use some of 
> the same direct assembly optimizations (on x86) for local-CPU 
> dynamic percpu accesses too. [maybe not all addressing modes are 
> possible straight away, this needs a more precise check.]

The posted patchset already does that.  Please take a look at the new
per_cpu_ptr().  It's basically &per_cpu().  Unifying accessors is the
next step and I'm planning to conslidate local_t implementation into
it too but I think all that depends on we agreeing on the allocator.
I can remove the TLB problem from non-NUMA case but for NUMA I still
don't have a good idea.  Maybe we need to accept the overhead for
NUMA?  I don't know.

Thanks.

-- 
tejun

  reply	other threads:[~2009-02-21  7:10 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-02-18 12:04 [PATCHSET x86/core/percpu] implement dynamic percpu allocator Tejun Heo
2009-02-18 12:04 ` [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range() Tejun Heo
2009-02-19 12:06   ` Nick Piggin
2009-02-19 22:36     ` David Miller
2009-02-18 12:04 ` [PATCH 02/10] module: fix out-of-range memory access Tejun Heo
2009-02-19 12:08   ` Nick Piggin
2009-02-20  7:16   ` Tejun Heo
2009-02-18 12:04 ` [PATCH 03/10] module: reorder module pcpu related functions Tejun Heo
2009-02-18 12:04 ` [PATCH 04/10] alloc_percpu: change percpu_ptr to per_cpu_ptr Tejun Heo
2009-02-18 12:04 ` [PATCH 05/10] alloc_percpu: add align argument to __alloc_percpu Tejun Heo
2009-02-18 12:04 ` [PATCH 06/10] percpu: kill percpu_alloc() and friends Tejun Heo
2009-02-19  0:17   ` Rusty Russell
2009-03-11 18:36   ` Tony Luck
2009-03-11 22:44     ` Rusty Russell
2009-03-12  2:06     ` Tejun Heo
2009-02-18 12:04 ` [PATCH 07/10] vmalloc: implement vm_area_register_early() Tejun Heo
2009-02-19  0:55   ` Tejun Heo
2009-02-19 12:09   ` Nick Piggin
2009-02-18 12:04 ` [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush() Tejun Heo
2009-02-19 12:17   ` Nick Piggin
2009-02-20  1:27     ` Tejun Heo
2009-02-20  7:15   ` Subject: [PATCH 08/10 UPDATED] " Tejun Heo
2009-02-20  8:32     ` Andrew Morton
2009-02-21  3:21       ` Tejun Heo
2009-02-18 12:04 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
2009-02-19 10:10   ` Andrew Morton
2009-02-19 11:01     ` Ingo Molnar
2009-02-20  2:45       ` Tejun Heo
2009-02-19 12:07     ` Rusty Russell
2009-02-20  2:35     ` Tejun Heo
2009-02-20  3:04       ` Andrew Morton
2009-02-20  5:29         ` Tejun Heo
2009-02-24  2:52         ` Rusty Russell
2009-02-19 11:51   ` Rusty Russell
2009-02-20  3:01     ` Tejun Heo
2009-02-20  3:02       ` Tejun Heo
2009-02-24  2:56       ` Rusty Russell
2009-02-24  5:27         ` [PATCH tj-percpu] percpu: add __read_mostly to variables which are mostly read only Tejun Heo
2009-02-24  5:47         ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
2009-02-24 17:41           ` Luck, Tony
2009-02-26  3:17             ` Tejun Heo
2009-02-27 19:41               ` Luck, Tony
2009-02-19 12:36   ` Nick Piggin
2009-02-20  3:04     ` Tejun Heo
2009-02-20  7:30   ` [PATCH UPDATED " Tejun Heo
2009-02-20  8:37     ` Andrew Morton
2009-02-21  3:23       ` Tejun Heo
2009-02-21  3:42         ` [PATCH tj-percpu] percpu: s/size/bytes/g in new percpu allocator and interface Tejun Heo
2009-02-21  7:48           ` Tejun Heo
2009-02-21  7:55             ` [PATCH tj-percpu] percpu: clean up size usage Tejun Heo
2009-02-21  7:56               ` Tejun Heo
2009-02-18 12:04 ` [PATCH 10/10] x86: convert to the new dynamic percpu allocator Tejun Heo
2009-02-18 13:43 ` [PATCHSET x86/core/percpu] implement " Ingo Molnar
2009-02-19  0:31   ` Tejun Heo
2009-02-19 10:51   ` Rusty Russell
2009-02-19 11:06     ` Ingo Molnar
2009-02-19 12:14       ` Rusty Russell
2009-02-20  3:08         ` Tejun Heo
2009-02-20  5:36           ` Tejun Heo
2009-02-20  7:33             ` Tejun Heo
2009-02-19  0:30 ` Tejun Heo
2009-02-19 11:07   ` Ingo Molnar
2009-02-20  3:17     ` Tejun Heo
2009-02-20  9:32       ` Ingo Molnar
2009-02-21  7:10         ` Tejun Heo [this message]
2009-02-21  7:33           ` Tejun Heo
2009-02-22 19:38             ` Ingo Molnar
2009-02-23  0:43               ` Tejun Heo
2009-02-23 10:17                 ` Ingo Molnar
2009-02-23 13:38                   ` [patch] x86: optimize __pa() to be linear again on 64-bit x86 Ingo Molnar
2009-02-23 14:08                     ` Nick Piggin
2009-02-23 14:53                       ` Ingo Molnar
2009-02-24 16:00                         ` Andi Kleen
2009-02-27  5:57                         ` Tejun Heo
2009-02-27  6:57                           ` Ingo Molnar
2009-02-27  7:11                             ` Tejun Heo
2009-02-22 19:27           ` [PATCHSET x86/core/percpu] implement dynamic percpu allocator Ingo Molnar
2009-02-23  0:47             ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=499FA8D1.8030806@kernel.org \
    --to=tj@kernel.org \
    --cc=cpw@sgi.com \
    --cc=hpa@zytor.com \
    --cc=jeremy@goop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rusty@rustcorp.com.au \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.