From: Tejun Heo <tj@kernel.org>
To: rusty@rustcorp.com.au, tglx@linutronix.de, x86@kernel.org,
linux-kernel@vger.kernel.org, hpa@zytor.com, jeremy@goop.org,
cpw@sgi.com, mingo@elte.hu
Subject: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
Date: Wed, 18 Feb 2009 21:04:26 +0900 [thread overview]
Message-ID: <1234958676-27618-1-git-send-email-tj@kernel.org> (raw)
Hello, all.
This patchset implements dynamic percpu allocator. As I wrote before,
the percpu areas are organized in chunks which in turn are composed of
num_possible_cpus() units. As offsets of units against the first unit
stay the same regardless of where the chunk is, arch code can directly
access each percpu area by setting up percpu access such that each cpu
translates the same percpu address unit size apart.
Statically declared percpu area for the kernel which is setup early
during boot is also served by the same allocator but it needs special
init path as it needs to be up and running way before regular memory
management is initialized.
Percpu areas are allocated from the vmalloc space and managed directly
by the percpu code. Chunks start empty and are populated with pages
as they're allocated. As there are many small allocations and
allocations often need much smaller alignment (no need for cacheline
alignment), the allocator tries to maximize chunk utilization and put
allocations in fuller chunks.
There have been several concerns regarding this approach.
* On 64bit, no need for chunks. We can just allocate contiguous
areas.
For 32bit, with the overcrowded address space, consolidating percpu
allocations into vmalloc (or other) area is a big win as no space
needs to be further set aside for percpu variables and with
relatively small number of possible cpus, the chunks can be at
manageable size (e.g. 128k chunks for 4way smp wouldn't be too bad)
and it can achieve reasonable scalability.
So, I think the question becomes whether it makes sense to use
different allocation scheme for 32 and 64bits. The added overhead
of chunk handling itself isn't anything which can warrant separate
implementations. If there's a way to solve some other issues nicely
with larger address space, maybe, but I really think it would be
best to stick with one implementation.
* It adds to TLB pressure.
Yeah, unfortunately, it does. Currently it adds a number of kernel
4k pages into circulation (cold/high pages, so unlikely to affect
other large mappings). There are several different varieties of
this issue.
The unit size and thus the chunk size is pretty flexible (it
currently requires power of 2 but that restriction can be lifted
easily). With vm area allocation with larger alignment, using large
page for chunk (non-NUMA) or unit (large, large NUMA) shouldn't be
too difficult for highends but for mid range stuff, it looks like
there isn't much else to do than sticking with 4k mappings.
The TLB pressure problem would be there regardless of address layout
as long as we want to grow the percpu area dynamically.
Page-granual growth will add 4k pressures. Large-page-granuality is
likely to waste lots of space.
One trick we can do is to reserve the initial chunk in non-vmalloc
area so that at least the static cpu ones and whatever gets
allocated in the first chunk is served by regular large page
mappings. Given that those are most frequent visited ones, this
could be a nice compromise - no noticeable penalty for usual cases
yet allowing scalability for unusual cases. If this is something
which can be agreed on, I'll pursue this.
The percpu allocator is optional feature which can be selected by each
arch by setting HAVE_DYNAMIC_PER_CPU_AREA configuration variable.
Currently only x86_32 an 64 use it.
Ah.. I also left out cpu hotplugging stuff for now. This largely
isn't an issue on most machines where num_possible_cpus() doesn't
deviate much from num_online_cpus(). Are there cases where this is
critical? Currently, no user of percpu allocation, static or dynamic,
cares about this and it has been like this for a long time, so I'm a
little bit skeptical about it.
This patchset contains the following ten patches.
0001-vmalloc-call-flush_cache_vunmap-from-unmap_kernel.patch
0002-module-fix-out-of-range-memory-access.patch
0003-module-reorder-module-pcpu-related-functions.patch
0004-alloc_percpu-change-percpu_ptr-to-per_cpu_ptr.patch
0005-alloc_percpu-add-align-argument-to-__alloc_percpu.patch
0006-percpu-kill-percpu_alloc-and-friends.patch
0007-vmalloc-implement-vm_area_register_early.patch
0008-vmalloc-add-un-map_kernel_range_noflush.patch
0009-percpu-implement-new-dynamic-percpu-allocator.patch
0010-x86-convert-to-the-new-dynamic-percpu-allocator.patch
0001-0003 contain fixes and trivial prep. 0004-0006 clean up percpu.
0007-0008 add stuff to vmalloc which will be used by the new
allocator. 0009-0010 implement and use the new allocator.
This patchset is on top of the current x86/core/percpu[1] and can be
fetched from the following git vector.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git tj-percpu
diffstat follows.
arch/alpha/mm/init.c | 20
arch/x86/Kconfig | 3
arch/x86/include/asm/percpu.h | 8
arch/x86/include/asm/pgtable.h | 1
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 2
arch/x86/kernel/setup_percpu.c | 62 +-
arch/x86/mm/init_32.c | 10
arch/x86/mm/init_64.c | 19
block/blktrace.c | 2
drivers/acpi/processor_perflib.c | 4
include/linux/percpu.h | 65 +-
include/linux/vmalloc.h | 4
kernel/module.c | 78 +-
kernel/sched.c | 6
kernel/stop_machine.c | 2
mm/Makefile | 4
mm/allocpercpu.c | 32 -
mm/percpu.c | 876 +++++++++++++++++++++++++++++
mm/vmalloc.c | 84 ++
net/ipv4/af_inet.c | 4
20 files changed, 1183 insertions(+), 103 deletions(-)
Thanks.
--
tejun
[1] 58105ef1857112a186696c9b8957020090226a28
next reply other threads:[~2009-02-18 12:06 UTC|newest]
Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-18 12:04 Tejun Heo [this message]
2009-02-18 12:04 ` [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range() Tejun Heo
2009-02-19 12:06 ` Nick Piggin
2009-02-19 22:36 ` David Miller
2009-02-18 12:04 ` [PATCH 02/10] module: fix out-of-range memory access Tejun Heo
2009-02-19 12:08 ` Nick Piggin
2009-02-20 7:16 ` Tejun Heo
2009-02-18 12:04 ` [PATCH 03/10] module: reorder module pcpu related functions Tejun Heo
2009-02-18 12:04 ` [PATCH 04/10] alloc_percpu: change percpu_ptr to per_cpu_ptr Tejun Heo
2009-02-18 12:04 ` [PATCH 05/10] alloc_percpu: add align argument to __alloc_percpu Tejun Heo
2009-02-18 12:04 ` [PATCH 06/10] percpu: kill percpu_alloc() and friends Tejun Heo
2009-02-19 0:17 ` Rusty Russell
2009-03-11 18:36 ` Tony Luck
2009-03-11 22:44 ` Rusty Russell
2009-03-12 2:06 ` Tejun Heo
2009-02-18 12:04 ` [PATCH 07/10] vmalloc: implement vm_area_register_early() Tejun Heo
2009-02-19 0:55 ` Tejun Heo
2009-02-19 12:09 ` Nick Piggin
2009-02-18 12:04 ` [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush() Tejun Heo
2009-02-19 12:17 ` Nick Piggin
2009-02-20 1:27 ` Tejun Heo
2009-02-20 7:15 ` Subject: [PATCH 08/10 UPDATED] " Tejun Heo
2009-02-20 8:32 ` Andrew Morton
2009-02-21 3:21 ` Tejun Heo
2009-02-18 12:04 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
2009-02-19 10:10 ` Andrew Morton
2009-02-19 11:01 ` Ingo Molnar
2009-02-20 2:45 ` Tejun Heo
2009-02-19 12:07 ` Rusty Russell
2009-02-20 2:35 ` Tejun Heo
2009-02-20 3:04 ` Andrew Morton
2009-02-20 5:29 ` Tejun Heo
2009-02-24 2:52 ` Rusty Russell
2009-02-19 11:51 ` Rusty Russell
2009-02-20 3:01 ` Tejun Heo
2009-02-20 3:02 ` Tejun Heo
2009-02-24 2:56 ` Rusty Russell
2009-02-24 5:27 ` [PATCH tj-percpu] percpu: add __read_mostly to variables which are mostly read only Tejun Heo
2009-02-24 5:47 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
2009-02-24 17:41 ` Luck, Tony
2009-02-26 3:17 ` Tejun Heo
2009-02-27 19:41 ` Luck, Tony
2009-02-19 12:36 ` Nick Piggin
2009-02-20 3:04 ` Tejun Heo
2009-02-20 7:30 ` [PATCH UPDATED " Tejun Heo
2009-02-20 8:37 ` Andrew Morton
2009-02-21 3:23 ` Tejun Heo
2009-02-21 3:42 ` [PATCH tj-percpu] percpu: s/size/bytes/g in new percpu allocator and interface Tejun Heo
2009-02-21 7:48 ` Tejun Heo
2009-02-21 7:55 ` [PATCH tj-percpu] percpu: clean up size usage Tejun Heo
2009-02-21 7:56 ` Tejun Heo
2009-02-18 12:04 ` [PATCH 10/10] x86: convert to the new dynamic percpu allocator Tejun Heo
2009-02-18 13:43 ` [PATCHSET x86/core/percpu] implement " Ingo Molnar
2009-02-19 0:31 ` Tejun Heo
2009-02-19 10:51 ` Rusty Russell
2009-02-19 11:06 ` Ingo Molnar
2009-02-19 12:14 ` Rusty Russell
2009-02-20 3:08 ` Tejun Heo
2009-02-20 5:36 ` Tejun Heo
2009-02-20 7:33 ` Tejun Heo
2009-02-19 0:30 ` Tejun Heo
2009-02-19 11:07 ` Ingo Molnar
2009-02-20 3:17 ` Tejun Heo
2009-02-20 9:32 ` Ingo Molnar
2009-02-21 7:10 ` Tejun Heo
2009-02-21 7:33 ` Tejun Heo
2009-02-22 19:38 ` Ingo Molnar
2009-02-23 0:43 ` Tejun Heo
2009-02-23 10:17 ` Ingo Molnar
2009-02-23 13:38 ` [patch] x86: optimize __pa() to be linear again on 64-bit x86 Ingo Molnar
2009-02-23 14:08 ` Nick Piggin
2009-02-23 14:53 ` Ingo Molnar
2009-02-24 16:00 ` Andi Kleen
2009-02-27 5:57 ` Tejun Heo
2009-02-27 6:57 ` Ingo Molnar
2009-02-27 7:11 ` Tejun Heo
2009-02-22 19:27 ` [PATCHSET x86/core/percpu] implement dynamic percpu allocator Ingo Molnar
2009-02-23 0:47 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1234958676-27618-1-git-send-email-tj@kernel.org \
--to=tj@kernel.org \
--cc=cpw@sgi.com \
--cc=hpa@zytor.com \
--cc=jeremy@goop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=rusty@rustcorp.com.au \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.