From: Ingo Molnar <mingo@elte.hu>
To: Tejun Heo <tj@kernel.org>
Cc: rusty@rustcorp.com.au, tglx@linutronix.de, x86@kernel.org,
linux-kernel@vger.kernel.org, hpa@zytor.com, jeremy@goop.org,
cpw@sgi.com, nickpiggin@yahoo.com.au, ink@jurassic.park.msu.ru
Subject: Re: [PATCHSET x86/core/percpu] improve the first percpu chunk allocation
Date: Tue, 24 Feb 2009 10:57:08 +0100 [thread overview]
Message-ID: <20090224095708.GA20739@elte.hu> (raw)
In-Reply-To: <1235445101-7882-1-git-send-email-tj@kernel.org>
* Tejun Heo <tj@kernel.org> wrote:
> Hello, all.
>
> This patchset improves the first percpu chunk allocation. The
> problem is that the dynamic percpu area allocation maps the
> whole percpu area into vmalloc area using 4k mappings which
> adds considerable amount of TLB pressure.
>
> This patchset modularizes the first percpu chunk allocation
> and uses different allocation schemes to optimize TLB usage.
>
> * On !NUMA, the first chunk is allocated directly using
> alloc_bootmem() thus adding no TLB pressure whatsoever.
>
> * On NUMA, the first chunk is remapped using large pages and
> whatever is left in the large page is given back to the
> bootmem allocator. This makes each cpu use an additional
> large TLB entry for the first chunk but still is much better
> than using many 4k TLB entries.
Hm, i think there still must be some basic misunderstanding
somewhere here. Let me describe the design i described in the
previous mail in more detail.
In one of your changelogs you state:
| On NUMA, embedding allocator can't be used as different
| units can't be made to fall in the correct NUMA nodes.
This is a direct consequence of the unit/chunk abstraction, and
i think that abstraction is wrong.
What i'm suggesting is to have a simple continuous [non-chunked,
with a hole in the last bits of the first 2MB] virtual memory
range for each CPU.
This special virtual memory starts with a 2MB page (for the
static bits - perhaps also with a default starter dynamic area
appended to that - we can size this reasonably) and continues
with 4K mappings at the next 2MB boundary and goes on linearly
from that point on.
The variables within this singular 'percpu area' mirror each
other amongst CPUs. So if a dynamic (or static) percpu variable
is at offset 156100 in CPU#5's range - then it will be at offset
156100 in CPU#11's percpu area too. Each of these areas are
tightly packed with that CPU's allocations (and only that CPU's
allocations), there's no chunking, no units.
As with your proposal this tears down the current artificial
distinction between static and dynamic percpu variables.
But with this approach we'd the following additional advantages:
- No dynamic-alloc single-allocation size limits _at all_ in
practice. [up to the total size of the virtual memory window]
( With your current proposal the dynamic alloc is limited to
unit size - which is looks a bit inflexible as unit size
impacts other characteristics so when we want to increase
the dynamic allocation size we'd also affect other areas of
the code. )
percpu_alloc() would become as limitless (on 64-bit) as
vmalloc().
- no NUMA complications and no NUMA assymetry at all. When we
extend a CPU's percpu area we do NUMA-local allocations to
that CPU. The memory allocated is purely for that CPU's
purpose.
- We'd have a very 'compressed' pte presence in the pagetables:
the dynamic percpu area is as tightly packed as possible. With
a chunked design we 'scatter' the ptes a bit more broadly.
The only thing that gets a bit trickier is sizing - but not by
much. The best way we can size this without practical
complications on very small or very large systems would by
setting the maximum _combined_ size for all percpu allocations.
Say we set this 'PERCPU_TOTAL' limit to 4 GB. That means that if
there are 8 possible CPUs, each CPU can have up to 512 MB of
RAM. That's plenty in practice.
We can do this splitup dynamically during bootup, because the
area is still fully linear, relative to the percpu offset.
[ A system with 4k CPUs would want to have a larger PERCPU_TOTAL
- but obviously it cannot be really mind-blowingly large
because the total max has to be backed up with real RAM. So
realistically we wont have more than 1TB in the next 10 years
or so. Which is still well below the limitations of the 64-bit
address space. ]
In a non-chunked allocator the whole bitmap management becomes
much simpler and more straightforward as well. It's also much
easier to think about than an interleaved unit+chunk design.
The only special complication is the setup of the initial 2MB
area - but that is tricky to bootstrap anyway because we need to
set it up before the page allocator gets initialized. It's also
worthwile to put the most common percpu variables, and an
expected amount of dynamic area into a 2MB TLB.
Hm?
Ingo
next prev parent reply other threads:[~2009-02-24 9:57 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-24 3:11 [PATCHSET x86/core/percpu] improve the first percpu chunk allocation Tejun Heo
2009-02-24 3:11 ` [PATCH 01/10] percpu: fix pcpu_chunk_struct_size Tejun Heo
2009-02-24 3:11 ` [PATCH 02/10] bootmem: clean up arch-specific bootmem wrapping Tejun Heo
2009-02-24 11:30 ` Johannes Weiner
2009-02-24 11:39 ` Tejun Heo
2009-02-24 3:11 ` [PATCH 03/10] bootmem: reorder interface functions and add a missing one Tejun Heo
2009-02-24 3:11 ` [PATCH 04/10] vmalloc: add @align to vm_area_register_early() Tejun Heo
2009-02-24 3:11 ` [PATCH 05/10] x86: update populate_extra_pte() and add populate_extra_pmd() Tejun Heo
2009-02-24 3:11 ` [PATCH 06/10] percpu: remove unit_size power-of-2 restriction Tejun Heo
2009-02-24 3:11 ` [PATCH 07/10] percpu: give more latitude to arch specific first chunk initialization Tejun Heo
2009-02-24 3:11 ` [PATCH 08/10] x86: separate out setup_pcpu_4k() from setup_per_cpu_areas() Tejun Heo
2009-02-24 3:11 ` [PATCH 09/10] x86: add embedding percpu first chunk allocator Tejun Heo
2009-02-24 3:11 ` [PATCH 10/10] x86: add remapping " Tejun Heo
2009-02-24 9:57 ` Ingo Molnar [this message]
2009-02-24 11:48 ` [PATCHSET x86/core/percpu] improve the first percpu chunk allocation Tejun Heo
2009-02-24 12:40 ` Ingo Molnar
2009-02-24 13:27 ` Tejun Heo
2009-02-24 14:12 ` Ingo Molnar
2009-02-24 14:37 ` Tejun Heo
2009-02-24 15:15 ` Ingo Molnar
2009-02-24 23:33 ` Tejun Heo
2009-03-04 0:03 ` Rusty Russell
2009-03-04 0:15 ` H. Peter Anvin
2009-03-04 0:50 ` Ingo Molnar
2009-02-24 12:51 ` Ingo Molnar
2009-02-24 14:47 ` Tejun Heo
2009-02-24 15:19 ` Ingo Molnar
2009-02-24 15:30 ` Nick Piggin
2009-02-24 13:02 ` Ingo Molnar
2009-02-24 14:40 ` Tejun Heo
2009-02-24 20:17 ` Ingo Molnar
2009-02-24 20:51 ` Ingo Molnar
2009-02-24 21:02 ` Yinghai Lu
2009-02-24 21:12 ` [PATCH] x86: check range in reserve_early() -v2 Yinghai Lu
2009-02-24 21:16 ` [PATCHSET x86/core/percpu] improve the first percpu chunk allocation Ingo Molnar
2009-02-25 2:09 ` [PATCH x86/core/percpu 1/2] x86, percpu: fix minor bugs in setup_percpu.c Tejun Heo
2009-02-25 2:10 ` [PATCH x86/core/percpu 2/2] x86: convert cacheflush macros inline functions Tejun Heo
2009-02-25 2:23 ` [PATCHSET x86/core/percpu] improve the first percpu chunk allocation Tejun Heo
2009-02-25 2:56 ` Tejun Heo
2009-02-25 12:59 ` Ingo Molnar
2009-02-25 13:43 ` WARNING: at include/linux/percpu.h:159 __create_workqueue_key+0x1f6/0x220() Ingo Molnar
2009-02-26 2:03 ` [PATCH core/percpu] percpu: fix too low alignment restriction on UP Tejun Heo
2009-02-26 3:26 ` Ingo Molnar
2009-02-25 6:40 ` [PATCHSET x86/core/percpu] improve the first percpu chunk allocation Rusty Russell
2009-02-25 12:54 ` Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090224095708.GA20739@elte.hu \
--to=mingo@elte.hu \
--cc=cpw@sgi.com \
--cc=hpa@zytor.com \
--cc=ink@jurassic.park.msu.ru \
--cc=jeremy@goop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=nickpiggin@yahoo.com.au \
--cc=rusty@rustcorp.com.au \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox