[PATCHSET x86/core/percpu] implement dynamic percpu allocator

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tejun Heo <tj@kernel.org>
To: rusty@rustcorp.com.au, tglx@linutronix.de, x86@kernel.org,
	linux-kernel@vger.kernel.org, hpa@zytor.com, jeremy@goop.org,
	cpw@sgi.com, mingo@elte.hu
Subject: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
Date: Wed, 18 Feb 2009 21:04:26 +0900	[thread overview]
Message-ID: <1234958676-27618-1-git-send-email-tj@kernel.org> (raw)

Hello, all.

This patchset implements dynamic percpu allocator.  As I wrote before,
the percpu areas are organized in chunks which in turn are composed of
num_possible_cpus() units.  As offsets of units against the first unit
stay the same regardless of where the chunk is, arch code can directly
access each percpu area by setting up percpu access such that each cpu
translates the same percpu address unit size apart.

Statically declared percpu area for the kernel which is setup early
during boot is also served by the same allocator but it needs special
init path as it needs to be up and running way before regular memory
management is initialized.

Percpu areas are allocated from the vmalloc space and managed directly
by the percpu code.  Chunks start empty and are populated with pages
as they're allocated.  As there are many small allocations and
allocations often need much smaller alignment (no need for cacheline
alignment), the allocator tries to maximize chunk utilization and put
allocations in fuller chunks.

There have been several concerns regarding this approach.

* On 64bit, no need for chunks.  We can just allocate contiguous
  areas.

  For 32bit, with the overcrowded address space, consolidating percpu
  allocations into vmalloc (or other) area is a big win as no space
  needs to be further set aside for percpu variables and with
  relatively small number of possible cpus, the chunks can be at
  manageable size (e.g. 128k chunks for 4way smp wouldn't be too bad)
  and it can achieve reasonable scalability.

  So, I think the question becomes whether it makes sense to use
  different allocation scheme for 32 and 64bits.  The added overhead
  of chunk handling itself isn't anything which can warrant separate
  implementations.  If there's a way to solve some other issues nicely
  with larger address space, maybe, but I really think it would be
  best to stick with one implementation.

* It adds to TLB pressure.

  Yeah, unfortunately, it does.  Currently it adds a number of kernel
  4k pages into circulation (cold/high pages, so unlikely to affect
  other large mappings).  There are several different varieties of
  this issue.

  The unit size and thus the chunk size is pretty flexible (it
  currently requires power of 2 but that restriction can be lifted
  easily).  With vm area allocation with larger alignment, using large
  page for chunk (non-NUMA) or unit (large, large NUMA) shouldn't be
  too difficult for highends but for mid range stuff, it looks like
  there isn't much else to do than sticking with 4k mappings.

  The TLB pressure problem would be there regardless of address layout
  as long as we want to grow the percpu area dynamically.
  Page-granual growth will add 4k pressures.  Large-page-granuality is
  likely to waste lots of space.

  One trick we can do is to reserve the initial chunk in non-vmalloc
  area so that at least the static cpu ones and whatever gets
  allocated in the first chunk is served by regular large page
  mappings.  Given that those are most frequent visited ones, this
  could be a nice compromise - no noticeable penalty for usual cases
  yet allowing scalability for unusual cases.  If this is something
  which can be agreed on, I'll pursue this.

The percpu allocator is optional feature which can be selected by each
arch by setting HAVE_DYNAMIC_PER_CPU_AREA configuration variable.
Currently only x86_32 an 64 use it.

Ah.. I also left out cpu hotplugging stuff for now.  This largely
isn't an issue on most machines where num_possible_cpus() doesn't
deviate much from num_online_cpus().  Are there cases where this is
critical?  Currently, no user of percpu allocation, static or dynamic,
cares about this and it has been like this for a long time, so I'm a
little bit skeptical about it.

This patchset contains the following ten patches.

  0001-vmalloc-call-flush_cache_vunmap-from-unmap_kernel.patch
  0002-module-fix-out-of-range-memory-access.patch
  0003-module-reorder-module-pcpu-related-functions.patch
  0004-alloc_percpu-change-percpu_ptr-to-per_cpu_ptr.patch
  0005-alloc_percpu-add-align-argument-to-__alloc_percpu.patch
  0006-percpu-kill-percpu_alloc-and-friends.patch
  0007-vmalloc-implement-vm_area_register_early.patch
  0008-vmalloc-add-un-map_kernel_range_noflush.patch
  0009-percpu-implement-new-dynamic-percpu-allocator.patch
  0010-x86-convert-to-the-new-dynamic-percpu-allocator.patch

0001-0003 contain fixes and trivial prep.  0004-0006 clean up percpu.
0007-0008 add stuff to vmalloc which will be used by the new
allocator.  0009-0010 implement and use the new allocator.

This patchset is on top of the current x86/core/percpu[1] and can be
fetched from the following git vector.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git tj-percpu

diffstat follows.

 arch/alpha/mm/init.c                       |   20 
 arch/x86/Kconfig                           |    3 
 arch/x86/include/asm/percpu.h              |    8 
 arch/x86/include/asm/pgtable.h             |    1 
 arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |    2 
 arch/x86/kernel/setup_percpu.c             |   62 +-
 arch/x86/mm/init_32.c                      |   10 
 arch/x86/mm/init_64.c                      |   19 
 block/blktrace.c                           |    2 
 drivers/acpi/processor_perflib.c           |    4 
 include/linux/percpu.h                     |   65 +-
 include/linux/vmalloc.h                    |    4 
 kernel/module.c                            |   78 +-
 kernel/sched.c                             |    6 
 kernel/stop_machine.c                      |    2 
 mm/Makefile                                |    4 
 mm/allocpercpu.c                           |   32 -
 mm/percpu.c                                |  876 +++++++++++++++++++++++++++++
 mm/vmalloc.c                               |   84 ++
 net/ipv4/af_inet.c                         |    4 
 20 files changed, 1183 insertions(+), 103 deletions(-)

Thanks.

--
tejun

[1] 58105ef1857112a186696c9b8957020090226a28

next             reply	other threads:[~2009-02-18 12:06 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-02-18 12:04 Tejun Heo [this message]
2009-02-18 12:04 ` [PATCH 01/10] vmalloc: call flush_cache_vunmap() from unmap_kernel_range() Tejun Heo
2009-02-19 12:06   ` Nick Piggin
2009-02-19 22:36     ` David Miller
2009-02-18 12:04 ` [PATCH 02/10] module: fix out-of-range memory access Tejun Heo
2009-02-19 12:08   ` Nick Piggin
2009-02-20  7:16   ` Tejun Heo
2009-02-18 12:04 ` [PATCH 03/10] module: reorder module pcpu related functions Tejun Heo
2009-02-18 12:04 ` [PATCH 04/10] alloc_percpu: change percpu_ptr to per_cpu_ptr Tejun Heo
2009-02-18 12:04 ` [PATCH 05/10] alloc_percpu: add align argument to __alloc_percpu Tejun Heo
2009-02-18 12:04 ` [PATCH 06/10] percpu: kill percpu_alloc() and friends Tejun Heo
2009-02-19  0:17   ` Rusty Russell
2009-03-11 18:36   ` Tony Luck
2009-03-11 22:44     ` Rusty Russell
2009-03-12  2:06     ` Tejun Heo
2009-02-18 12:04 ` [PATCH 07/10] vmalloc: implement vm_area_register_early() Tejun Heo
2009-02-19  0:55   ` Tejun Heo
2009-02-19 12:09   ` Nick Piggin
2009-02-18 12:04 ` [PATCH 08/10] vmalloc: add un/map_kernel_range_noflush() Tejun Heo
2009-02-19 12:17   ` Nick Piggin
2009-02-20  1:27     ` Tejun Heo
2009-02-20  7:15   ` Subject: [PATCH 08/10 UPDATED] " Tejun Heo
2009-02-20  8:32     ` Andrew Morton
2009-02-21  3:21       ` Tejun Heo
2009-02-18 12:04 ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
2009-02-19 10:10   ` Andrew Morton
2009-02-19 11:01     ` Ingo Molnar
2009-02-20  2:45       ` Tejun Heo
2009-02-19 12:07     ` Rusty Russell
2009-02-20  2:35     ` Tejun Heo
2009-02-20  3:04       ` Andrew Morton
2009-02-20  5:29         ` Tejun Heo
2009-02-24  2:52         ` Rusty Russell
2009-02-19 11:51   ` Rusty Russell
2009-02-20  3:01     ` Tejun Heo
2009-02-20  3:02       ` Tejun Heo
2009-02-24  2:56       ` Rusty Russell
2009-02-24  5:27         ` [PATCH tj-percpu] percpu: add __read_mostly to variables which are mostly read only Tejun Heo
2009-02-24  5:47         ` [PATCH 09/10] percpu: implement new dynamic percpu allocator Tejun Heo
2009-02-24 17:41           ` Luck, Tony
2009-02-26  3:17             ` Tejun Heo
2009-02-27 19:41               ` Luck, Tony
2009-02-19 12:36   ` Nick Piggin
2009-02-20  3:04     ` Tejun Heo
2009-02-20  7:30   ` [PATCH UPDATED " Tejun Heo
2009-02-20  8:37     ` Andrew Morton
2009-02-21  3:23       ` Tejun Heo
2009-02-21  3:42         ` [PATCH tj-percpu] percpu: s/size/bytes/g in new percpu allocator and interface Tejun Heo
2009-02-21  7:48           ` Tejun Heo
2009-02-21  7:55             ` [PATCH tj-percpu] percpu: clean up size usage Tejun Heo
2009-02-21  7:56               ` Tejun Heo
2009-02-18 12:04 ` [PATCH 10/10] x86: convert to the new dynamic percpu allocator Tejun Heo
2009-02-18 13:43 ` [PATCHSET x86/core/percpu] implement " Ingo Molnar
2009-02-19  0:31   ` Tejun Heo
2009-02-19 10:51   ` Rusty Russell
2009-02-19 11:06     ` Ingo Molnar
2009-02-19 12:14       ` Rusty Russell
2009-02-20  3:08         ` Tejun Heo
2009-02-20  5:36           ` Tejun Heo
2009-02-20  7:33             ` Tejun Heo
2009-02-19  0:30 ` Tejun Heo
2009-02-19 11:07   ` Ingo Molnar
2009-02-20  3:17     ` Tejun Heo
2009-02-20  9:32       ` Ingo Molnar
2009-02-21  7:10         ` Tejun Heo
2009-02-21  7:33           ` Tejun Heo
2009-02-22 19:38             ` Ingo Molnar
2009-02-23  0:43               ` Tejun Heo
2009-02-23 10:17                 ` Ingo Molnar
2009-02-23 13:38                   ` [patch] x86: optimize __pa() to be linear again on 64-bit x86 Ingo Molnar
2009-02-23 14:08                     ` Nick Piggin
2009-02-23 14:53                       ` Ingo Molnar
2009-02-24 16:00                         ` Andi Kleen
2009-02-27  5:57                         ` Tejun Heo
2009-02-27  6:57                           ` Ingo Molnar
2009-02-27  7:11                             ` Tejun Heo
2009-02-22 19:27           ` [PATCHSET x86/core/percpu] implement dynamic percpu allocator Ingo Molnar
2009-02-23  0:47             ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1234958676-27618-1-git-send-email-tj@kernel.org \
    --to=tj@kernel.org \
    --cc=cpw@sgi.com \
    --cc=hpa@zytor.com \
    --cc=jeremy@goop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rusty@rustcorp.com.au \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.