From: Eric Dumazet <dada1@cosmosbay.com>
To: Mike Travis <travis@sgi.com>, Christoph Lameter <clameter@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
David Miller <davem@davemloft.net>,
Peter Zijlstra <peterz@infradead.org>,
Rusty Russell <rusty@rustcorp.com.au>
Subject: Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
Date: Fri, 06 Jun 2008 07:33:22 +0200 [thread overview]
Message-ID: <4848CC22.6090109@cosmosbay.com> (raw)
In-Reply-To: <4846AFCF.30500@sgi.com>
Mike Travis a écrit :
> Andrew Morton wrote:
>> On Thu, 29 May 2008 20:56:20 -0700 Christoph Lameter <clameter@sgi.com> wrote:
>>
>>> In various places the kernel maintains arrays of pointers indexed by
>>> processor numbers. These are used to locate objects that need to be used
>>> when executing on a specirfic processor. Both the slab allocator
>>> and the page allocator use these arrays and there the arrays are used in
>>> performance critical code. The allocpercpu functionality is a simple
>>> allocator to provide these arrays.
>> All seems reasonable to me. The obvious question is "how do we size
>> the arena". We either waste memory or, much worse, run out.
>>
>> And running out is a real possibility, I think. Most people will only
>> mount a handful of XFS filesystems. But some customer will come along
>> who wants to mount 5,000, and distributors will need to cater for that,
>> but how can they?
>>
>> I wonder if we can arrange for the default to be overridden via a
>> kernel boot option?
>>
>>
>> Another obvious question is "how much of a problem will we have with
>> internal fragmentation"? This might be a drop-dead showstopper.
Christoph & Mike,
Please forgive me if I beat a dead horse, but this percpu stuff
should find its way.
I wonder why you absolutely want to have only one chunk holding
all percpu variables, static(vmlinux) & static(modules)
& dynamically allocated.
Its *not* possible to put an arbitrary limit to this global zone.
You'll allways find somebody to break this limit. This is the point
we must solve, before coding anything.
Have you considered using a list of fixed size chunks, each chunk
having its own bitmap ?
We only want fix offsets between CPU locations. For a given variable,
we MUST find addresses for all CPUS looking at the same offset table.
(Then we can optimize things on x86, using %gs/%fs register, instead
of a table lookup)
We could chose chunk size at compile time, depending on various
parameters (32/64 bit arches, or hugepage sizes on NUMA),
and a minimum value (ABI guarantee)
On x86_64 && NUMA we could use 2 Mbytes chunks, while
on x86_32 or non NUMA we should probably use 64 Kbytes.
At boot time, we setup the first chunk (chunk 0) and copy
.data.percpu on this chunk, for each possible cpu, and we
build the bitmap for future dynamic/module percpu allocations.
So we still have the restriction that sizeofsection(.data.percpu)
should fit in the chunk 0. Not a problem in practice.
Then if we need to expand percpu zone for heavy duty machine,
and chunk 0 is already filled, we can add as many 2 M/ 64K
chunks we need.
This would limit the dynamic percpu allocation to 64 kbytes for
a given variable, so huge users should probably still use a
different allocator (like oprofile alloc_cpu_buffers() function)
But at least we dont anymore limit the total size of percpu area.
I understand you want to offset percpu data to 0, but for
static percpu data. (pda being included in, to share %gs)
For dynamically allocated percpu variables (including modules
".data.percpu"), nothing forces you to have low offsets,
relative to %gs/%fs register. Access to these variables
will be register indirect based anyway (eg %gs:(%rax) )
1) NUMA case
For a 64 bit NUMA arch, chunk size of 2Mbytes
Allocates 2Mb for each possible processor (on its preferred memory
node), and compute values to setup offset_of_cpu[NR_CPUS] array.
Chunk 0
CPU 0 : virtual address XXXXXX
CPU 1 : virtual address XXXXXX + offset_of_cpu[1]
...
CPU n : virtual address XXXXXX + offset_of_cpu[n]
+ a shared bitmap
For next chunks, we could use vmalloc() zone to find
nr_possible_cpus virtual addresses ranges where you can map
a 2Mb page per possible cpu, as long as we respect the relative
delta between each cpu block, that was computed when
chunk 0 was setup.
Chunk 1..n
CPU 0 : virtual address YYYYYYYYYYYYYY
CPU 1 : virtual address YYYYYYYYYYYYYY + offset_of_cpu[1]
...
CPU n : virtual address YYYYYYYYYYYYYY + offset_of_cpu[n]
+ a shared bitmap (32Kbytes if 8 bytes granularity in allocator)
For a variable located in chunk 0, its 'address' relative to current
cpu %gs will be some number between [0 and 2^20-1]
For a variable located in chunk 1, its 'address' relative to current
cpu %gs will be some number between
[YYYYYYYYYYYYYY - XXXXXX and YYYYYYYYYYYYYY - XXXXXX + 2^20 - 1],
not necessarly [2^20 to 2^21 - 1]
Chunk 0 would use normal memory (no vmap TLB cost), only next ones need vmalloc().
So the extra TLB cost would only be taken for very special NUMA setups
(only if using a lot of percpu allocations)
Also, using a 2Mb page granularity probably wastes about 2Mb per cpu, but
this is nothing for NUMA machines :)
2) SMP && !NUMA
On non NUMA machines, we dont need vmalloc games, since we can allocate
chunk space using contiguous memory, (size = nr_possible_cpus*64Kbytes)
offset_of_cpu[N] = N * CHUNK_SIZE
(On a 4 CPU x86_32 machine, allocate a 256 Kbyte bloc then divide it in
64 kb blocs)
If this order-6 allocation fails, then fallback to vmalloc(), but most
percpu allocations happens at boot time, when memory is not yet fragmented...
3) UP case : fallback to standard allocators. No need for bitmaps.
NUMA special casing can be implemented later of course...
Thanks for reading
next prev parent reply other threads:[~2008-06-06 5:39 UTC|newest]
Thread overview: 163+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-05-30 3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
2008-05-30 3:56 ` [patch 01/41] cpu_alloc: Increase percpu area size to 128k Christoph Lameter
2008-06-02 17:58 ` Luck, Tony
2008-06-02 23:48 ` Rusty Russell
2008-06-10 17:22 ` Christoph Lameter
2008-06-10 17:22 ` Christoph Lameter
2008-06-10 19:54 ` Luck, Tony
2008-05-30 3:56 ` [patch 02/41] cpu alloc: The allocator Christoph Lameter
2008-05-30 4:58 ` Andrew Morton
2008-05-30 5:10 ` Christoph Lameter
2008-05-30 5:31 ` Andrew Morton
2008-06-02 9:29 ` Paul Jackson
2008-05-30 5:56 ` KAMEZAWA Hiroyuki
2008-05-30 6:16 ` Christoph Lameter
2008-06-04 14:48 ` Mike Travis
2008-05-30 5:04 ` Eric Dumazet
2008-05-30 5:20 ` Christoph Lameter
2008-05-30 5:52 ` Rusty Russell
2008-06-04 15:30 ` Mike Travis
2008-06-05 23:48 ` Rusty Russell
2008-05-30 5:54 ` Eric Dumazet
2008-06-04 14:58 ` Mike Travis
2008-06-04 15:11 ` Eric Dumazet
2008-06-06 0:32 ` Rusty Russell
2008-06-06 0:32 ` Rusty Russell
2008-06-10 17:33 ` Christoph Lameter
2008-06-10 18:05 ` Eric Dumazet
2008-06-10 18:28 ` Christoph Lameter
2008-05-30 5:46 ` Rusty Russell
2008-06-04 15:04 ` Mike Travis
2008-06-10 17:34 ` Christoph Lameter
2008-05-31 20:58 ` Pavel Machek
2008-05-30 3:56 ` [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator Christoph Lameter
2008-05-30 4:58 ` Andrew Morton
2008-05-30 5:14 ` Christoph Lameter
2008-05-30 5:34 ` Andrew Morton
2008-05-30 6:08 ` Rusty Russell
2008-05-30 6:21 ` Christoph Lameter
2008-05-30 3:56 ` [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations Christoph Lameter
2008-05-30 4:58 ` Andrew Morton
2008-05-30 5:17 ` Christoph Lameter
2008-05-30 5:38 ` Andrew Morton
2008-05-30 6:12 ` Christoph Lameter
2008-05-30 7:08 ` Rusty Russell
2008-05-30 18:00 ` Christoph Lameter
2008-06-02 2:00 ` Rusty Russell
2008-06-04 18:18 ` Mike Travis
2008-06-05 23:59 ` Rusty Russell
2008-06-09 19:00 ` Christoph Lameter
2008-06-09 23:27 ` Rusty Russell
2008-06-09 23:54 ` Christoph Lameter
2008-06-10 2:56 ` Rusty Russell
2008-06-10 3:18 ` Christoph Lameter
2008-06-11 0:03 ` Rusty Russell
2008-06-11 0:15 ` Christoph Lameter
2008-06-09 23:09 ` Christoph Lameter
2008-06-10 17:42 ` Christoph Lameter
2008-06-11 11:10 ` Rusty Russell
2008-06-11 23:39 ` Christoph Lameter
2008-06-12 0:58 ` Nick Piggin
2008-06-12 2:44 ` Rusty Russell
2008-06-12 3:40 ` Nick Piggin
2008-06-12 9:37 ` Martin Peschke
2008-06-12 11:21 ` Nick Piggin
2008-06-12 17:19 ` Christoph Lameter
2008-06-13 0:38 ` Rusty Russell
2008-06-13 2:27 ` Christoph Lameter
2008-06-15 10:33 ` Rusty Russell
2008-06-15 10:33 ` Rusty Russell
2008-06-16 14:52 ` Christoph Lameter
2008-06-17 0:24 ` Rusty Russell
2008-06-17 2:29 ` Christoph Lameter
2008-06-17 14:21 ` Mike Travis
2008-05-30 7:05 ` Rusty Russell
2008-05-30 6:32 ` Rusty Russell
2008-05-30 3:56 ` [patch 05/41] cpu alloc: Percpu_counter conversion Christoph Lameter
2008-05-30 6:47 ` Rusty Russell
2008-05-30 17:54 ` Christoph Lameter
2008-05-30 3:56 ` [patch 06/41] cpu alloc: crash_notes conversion Christoph Lameter
2008-05-30 3:56 ` [patch 07/41] cpu alloc: Workqueue conversion Christoph Lameter
2008-05-30 3:56 ` [patch 08/41] cpu alloc: ACPI cstate handling conversion Christoph Lameter
2008-05-30 3:56 ` [patch 09/41] cpu alloc: Genhd statistics conversion Christoph Lameter
2008-05-30 3:56 ` [patch 10/41] cpu alloc: blktrace conversion Christoph Lameter
2008-05-30 3:56 ` [patch 11/41] cpu alloc: SRCU cpu alloc conversion Christoph Lameter
2008-05-30 3:56 ` [patch 12/41] cpu alloc: XFS counter conversion Christoph Lameter
2008-05-30 3:56 ` [patch 13/41] cpu alloc: NFS statistics Christoph Lameter
2008-05-30 3:56 ` [patch 14/41] cpu alloc: Neigbour statistics Christoph Lameter
2008-05-30 3:56 ` [patch 15/41] cpu_alloc: Convert ip route statistics Christoph Lameter
2008-05-30 3:56 ` [patch 16/41] cpu alloc: Tcp statistics conversion Christoph Lameter
2008-05-30 3:56 ` [patch 17/41] cpu alloc: Convert scratches to cpu alloc Christoph Lameter
2008-05-30 3:56 ` [patch 18/41] cpu alloc: Dmaengine conversion Christoph Lameter
2008-05-30 3:56 ` [patch 19/41] cpu alloc: Convert loopback statistics Christoph Lameter
2008-05-30 3:56 ` [patch 20/41] cpu alloc: Veth conversion Christoph Lameter
2008-05-30 3:56 ` [patch 21/41] cpu alloc: Chelsio statistics conversion Christoph Lameter
2008-05-30 3:56 ` [patch 22/41] cpu alloc: Convert network sockets inuse counter Christoph Lameter
2008-05-30 3:56 ` [patch 23/41] cpu alloc: Use it for infiniband Christoph Lameter
2008-05-30 3:56 ` [patch 24/41] cpu alloc: Use in the crypto subsystem Christoph Lameter
2008-05-30 3:56 ` [patch 25/41] cpu alloc: scheduler: Convert cpuusage to cpu_alloc Christoph Lameter
2008-05-30 3:56 ` [patch 26/41] cpu alloc: Convert mib handling to cpu alloc Christoph Lameter
2008-05-30 6:47 ` Eric Dumazet
2008-05-30 18:01 ` Christoph Lameter
2008-05-30 3:56 ` [patch 27/41] cpu alloc: Remove the allocpercpu functionality Christoph Lameter
2008-05-30 4:58 ` Andrew Morton
2008-05-30 3:56 ` [patch 28/41] Module handling: Use CPU_xx ops to dynamically allocate counters Christoph Lameter
2008-05-30 3:56 ` [patch 29/41] x86_64: Use CPU ops for nmi alert counter Christoph Lameter
2008-05-30 3:56 ` [patch 30/41] Remove local_t support Christoph Lameter
2008-05-30 3:56 ` [patch 31/41] VM statistics: Use CPU ops Christoph Lameter
2008-05-30 3:56 ` [patch 32/41] cpu alloc: Use in slub Christoph Lameter
2008-05-30 3:56 ` [patch 33/41] cpu alloc: Remove slub fields Christoph Lameter
2008-05-30 3:56 ` [patch 34/41] cpu alloc: Page allocator conversion Christoph Lameter
2008-05-30 3:56 ` [patch 35/41] Support for CPU ops Christoph Lameter
2008-05-30 4:58 ` Andrew Morton
2008-05-30 5:18 ` Christoph Lameter
2008-05-30 3:56 ` [patch 36/41] Zero based percpu: Infrastructure to rebase the per cpu area to zero Christoph Lameter
2008-05-30 3:56 ` [patch 37/41] x86_64: Fold pda into per cpu area Christoph Lameter
2008-05-30 3:56 ` [patch 38/41] x86: Extend percpu ops to 64 bit Christoph Lameter
2008-05-30 3:56 ` [patch 39/41] x86: Replace cpu_pda() using percpu logic and get rid of _cpu_pda() Christoph Lameter
2008-05-30 3:57 ` [patch 40/41] x86: Replace xxx_pda() operations with x86_xx_percpu() Christoph Lameter
2008-05-30 3:57 ` [patch 41/41] x86_64: Support for cpu ops Christoph Lameter
2008-05-30 4:58 ` [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Andrew Morton
2008-05-30 5:03 ` Christoph Lameter
2008-05-30 5:21 ` Andrew Morton
2008-05-30 5:27 ` Christoph Lameter
2008-05-30 5:49 ` Andrew Morton
2008-05-30 6:16 ` Christoph Lameter
2008-05-30 6:51 ` KAMEZAWA Hiroyuki
2008-05-30 14:38 ` Mike Travis
2008-05-30 17:50 ` Christoph Lameter
2008-05-30 18:00 ` Matthew Wilcox
2008-05-30 18:12 ` Christoph Lameter
2008-05-30 6:01 ` Eric Dumazet
2008-05-30 6:16 ` Andrew Morton
2008-05-30 6:22 ` Christoph Lameter
2008-05-30 6:37 ` Andrew Morton
2008-05-30 11:32 ` Matthew Wilcox
2008-06-04 15:07 ` Mike Travis
2008-06-06 5:33 ` Eric Dumazet [this message]
2008-06-06 13:08 ` Mike Travis
2008-06-08 6:00 ` Rusty Russell
2008-06-09 18:44 ` Christoph Lameter
2008-06-09 19:11 ` Andi Kleen
2008-06-09 20:15 ` Eric Dumazet
2008-05-30 9:12 ` Peter Zijlstra
2008-05-30 9:18 ` Ingo Molnar
2008-05-30 18:11 ` Christoph Lameter
2008-05-30 18:40 ` Peter Zijlstra
2008-05-30 18:56 ` Christoph Lameter
2008-05-30 19:13 ` Peter Zijlstra
2008-06-01 3:25 ` Christoph Lameter
2008-06-01 8:19 ` Peter Zijlstra
2008-05-30 18:06 ` Christoph Lameter
2008-05-30 18:19 ` Peter Zijlstra
2008-05-30 18:26 ` Christoph Lameter
2008-05-30 18:47 ` Peter Zijlstra
2008-05-30 19:10 ` Christoph Lameter
2008-05-30 19:21 ` Peter Zijlstra
2008-05-30 19:35 ` Peter Zijlstra
2008-06-01 3:27 ` Christoph Lameter
2008-05-30 18:08 ` Christoph Lameter
2008-05-30 18:39 ` Peter Zijlstra
2008-05-30 18:51 ` Christoph Lameter
2008-05-30 19:00 ` Peter Zijlstra
2008-05-30 19:11 ` Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4848CC22.6090109@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=davem@davemloft.net \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=rusty@rustcorp.com.au \
--cc=travis@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox