From: Manfred Spraul <manfred@colorfullife.com>
To: Christoph Lameter <christoph@lameter.com>
Cc: Andrew Morton <akpm@osdl.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: Fw: [PATCH] NUMA Slab Allocator
Date: Wed, 16 Mar 2005 19:34:22 +0100 [thread overview]
Message-ID: <42387C2E.4040106@colorfullife.com> (raw)
In-Reply-To: <20050315204110.6664771d.akpm@osdl.org>
Hi Christoph,
Do you have profile data from your modification? Which percentage of the
allocations is node-local, which percentage is from foreign nodes?
Preferably per-cache. It shouldn't be difficult to add statistics
counters to your patch.
And: Can you estaimate which percentage is really accessed node-local
and which percentage are long-living structures that are accessed from
all cpus in the system?
I had discussions with guys from IBM and SGI regarding a numa allocator,
and we decided that we need profile data before we can decide if we need
one:
- A node-local allocator reduces the inter-node traffic, because the
callers get node-local memory
- A node-local allocator increases the inter-node traffic, because
objects that are kfree'd on the wrong node must be returned to their
home node.
> static inline void __cache_free (kmem_cache_t *cachep, void* objp)
> {
> struct array_cache *ac = ac_data(cachep);
>+ struct slab *slabp;
>
> check_irq_off();
> objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
>
>- if (likely(ac->avail < ac->limit)) {
>+ /* Make sure we are not freeing a object from another
>+ * node to the array cache on this cpu.
>+ */
>+ slabp = GET_PAGE_SLAB(virt_to_page(objp));
>
>
This line is quite slow, and should be performed only for NUMA builds,
not for non-numa builds. Some kind of wrapper is required.
>+ if(unlikely(slabp->nodeid != numa_node_id())) {
>+ STATS_INC_FREEMISS(cachep);
>+ int nodeid = slabp->nodeid;
>+ spin_lock(&(cachep->nodelists[nodeid])->list_lock);
>
>
This line is very dangerous: Every wrong-node allocation causes a
spin_lock operation. I fear that the cache line traffic for the spinlock
might kill the performance for some workloads. I personally think that
batching is required, i.e. each cpu stores wrong-node objects in a
seperate per-cpu array, and then the objects are returned as a block to
their home node.
>-/*
>- * NUMA: different approach needed if the spinlock is moved into
>- * the l3 structure
>
>
You have moved the cache spinlock into the l3 structure. Have you
compared both approaches?
A global spinlock has the advantage that batching is possible in
free_block: Acquire global spinlock, return objects to all nodes in the
system, release spinlock. A node-local spinlock would mean less
contention [multiple spinlocks instead of one global lock], but far more
spin_lock/unlock calls.
IIRC the conclusion from our discussion was, that there are at least
four possible implementations:
- your version
- Add a second per-cpu array for off-node allocations. __cache_free
batches, free_block then returns. Global spinlock or per-node spinlock.
A patch with a global spinlock is in
http://www.colorfullife.com/~manfred/Linux-kernel/slab/patch-slab-numa-2.5.66
per-node spinlocks would require a restructuring of free_block.
- Add per-node array for each cpu for wrong node allocations. Allows
very fast batch return: each array contains memory just from one node,
usefull if per-node spinlocks are used.
- do nothing. Least overhead within slab.
I'm fairly certains that "do nothing" is the right answer for some
caches. For example the dentry-cache: The object lifetime is seconds to
minutes, the objects are stored in a global hashtable. They will be
touched from all cpus in the system, thus guaranteeing that
kmem_cache_alloc returns node-local memory won't help. But the added
overhead within slab.c will hurt.
--
Manfred
WARNING: multiple messages have this Message-ID (diff)
From: Manfred Spraul <manfred@colorfullife.com>
To: Christoph Lameter <christoph@lameter.com>
Cc: Andrew Morton <akpm@osdl.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: Fw: [PATCH] NUMA Slab Allocator
Date: Wed, 16 Mar 2005 19:34:22 +0100 [thread overview]
Message-ID: <42387C2E.4040106@colorfullife.com> (raw)
In-Reply-To: <20050315204110.6664771d.akpm@osdl.org>
Hi Christoph,
Do you have profile data from your modification? Which percentage of the
allocations is node-local, which percentage is from foreign nodes?
Preferably per-cache. It shouldn't be difficult to add statistics
counters to your patch.
And: Can you estaimate which percentage is really accessed node-local
and which percentage are long-living structures that are accessed from
all cpus in the system?
I had discussions with guys from IBM and SGI regarding a numa allocator,
and we decided that we need profile data before we can decide if we need
one:
- A node-local allocator reduces the inter-node traffic, because the
callers get node-local memory
- A node-local allocator increases the inter-node traffic, because
objects that are kfree'd on the wrong node must be returned to their
home node.
> static inline void __cache_free (kmem_cache_t *cachep, void* objp)
> {
> struct array_cache *ac = ac_data(cachep);
>+ struct slab *slabp;
>
> check_irq_off();
> objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
>
>- if (likely(ac->avail < ac->limit)) {
>+ /* Make sure we are not freeing a object from another
>+ * node to the array cache on this cpu.
>+ */
>+ slabp = GET_PAGE_SLAB(virt_to_page(objp));
>
>
This line is quite slow, and should be performed only for NUMA builds,
not for non-numa builds. Some kind of wrapper is required.
>+ if(unlikely(slabp->nodeid != numa_node_id())) {
>+ STATS_INC_FREEMISS(cachep);
>+ int nodeid = slabp->nodeid;
>+ spin_lock(&(cachep->nodelists[nodeid])->list_lock);
>
>
This line is very dangerous: Every wrong-node allocation causes a
spin_lock operation. I fear that the cache line traffic for the spinlock
might kill the performance for some workloads. I personally think that
batching is required, i.e. each cpu stores wrong-node objects in a
seperate per-cpu array, and then the objects are returned as a block to
their home node.
>-/*
>- * NUMA: different approach needed if the spinlock is moved into
>- * the l3 structure
>
>
You have moved the cache spinlock into the l3 structure. Have you
compared both approaches?
A global spinlock has the advantage that batching is possible in
free_block: Acquire global spinlock, return objects to all nodes in the
system, release spinlock. A node-local spinlock would mean less
contention [multiple spinlocks instead of one global lock], but far more
spin_lock/unlock calls.
IIRC the conclusion from our discussion was, that there are at least
four possible implementations:
- your version
- Add a second per-cpu array for off-node allocations. __cache_free
batches, free_block then returns. Global spinlock or per-node spinlock.
A patch with a global spinlock is in
http://www.colorfullife.com/~manfred/Linux-kernel/slab/patch-slab-numa-2.5.66
per-node spinlocks would require a restructuring of free_block.
- Add per-node array for each cpu for wrong node allocations. Allows
very fast batch return: each array contains memory just from one node,
usefull if per-node spinlocks are used.
- do nothing. Least overhead within slab.
I'm fairly certains that "do nothing" is the right answer for some
caches. For example the dentry-cache: The object lifetime is seconds to
minutes, the objects are stored in a global hashtable. They will be
touched from all cpus in the system, thus guaranteeing that
kmem_cache_alloc returns node-local memory won't help. But the added
overhead within slab.c will hurt.
--
Manfred
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
next parent reply other threads:[~2005-03-16 18:37 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20050315204110.6664771d.akpm@osdl.org>
2005-03-16 18:34 ` Manfred Spraul [this message]
2005-03-16 18:34 ` Fw: [PATCH] NUMA Slab Allocator Manfred Spraul
2005-03-16 18:54 ` Martin J. Bligh
2005-03-16 18:54 ` Martin J. Bligh
2005-03-16 19:09 ` Manfred Spraul
2005-03-16 19:09 ` Manfred Spraul
2005-03-30 5:30 ` API changes to the slab allocator for NUMA memory allocation Christoph Lameter
2005-03-30 5:30 ` Christoph Lameter
2005-03-30 5:30 ` Christoph Lameter
2005-03-30 5:56 ` Manfred Spraul
2005-03-30 5:56 ` Manfred Spraul
2005-03-30 5:56 ` Manfred Spraul
2005-03-30 15:55 ` Christoph Lameter
2005-03-30 15:55 ` Christoph Lameter
2005-03-30 15:55 ` Christoph Lameter
2005-03-30 17:55 ` Manfred Spraul
2005-03-30 17:55 ` Manfred Spraul
2005-03-30 18:13 ` Christoph Lameter
2005-03-30 18:13 ` Christoph Lameter
2005-03-30 18:13 ` Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=42387C2E.4040106@colorfullife.com \
--to=manfred@colorfullife.com \
--cc=akpm@osdl.org \
--cc=christoph@lameter.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.