Re: Fw: [PATCH] NUMA Slab Allocator

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Manfred Spraul <manfred@colorfullife.com>
To: Christoph Lameter <christoph@lameter.com>
Cc: Andrew Morton <akpm@osdl.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: Fw: [PATCH] NUMA Slab Allocator
Date: Wed, 16 Mar 2005 19:34:22 +0100	[thread overview]
Message-ID: <42387C2E.4040106@colorfullife.com> (raw)
In-Reply-To: <20050315204110.6664771d.akpm@osdl.org>

Hi Christoph,

Do you have profile data from your modification? Which percentage of the 
allocations is node-local, which percentage is from foreign nodes? 
Preferably per-cache. It shouldn't be difficult to add statistics 
counters to your patch.
And: Can you estaimate which percentage is really accessed node-local 
and which percentage are long-living structures that are accessed from 
all cpus in the system?
I had discussions with guys from IBM and SGI regarding a numa allocator, 
and we decided that we need profile data before we can decide if we need 
one:
- A node-local allocator reduces the inter-node traffic, because the 
callers get node-local memory
- A node-local allocator increases the inter-node traffic, because 
objects that are kfree'd on the wrong node must be returned to their 
home node.

> static inline void __cache_free (kmem_cache_t *cachep, void* objp)
> {
>  struct array_cache *ac = ac_data(cachep);
>+ struct slab *slabp;
>
>  check_irq_off();
>  objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
>
>- if (likely(ac->avail < ac->limit)) {
>+ /* Make sure we are not freeing a object from another
>+  * node to the array cache on this cpu.
>+  */
>+ slabp = GET_PAGE_SLAB(virt_to_page(objp));
>  
>
This line is quite slow, and should be performed only for NUMA builds, 
not for non-numa builds. Some kind of wrapper is required.

>+ if(unlikely(slabp->nodeid != numa_node_id())) {
>+  STATS_INC_FREEMISS(cachep);
>+  int nodeid = slabp->nodeid;
>+  spin_lock(&(cachep->nodelists[nodeid])->list_lock);
>  
>
This line is very dangerous: Every wrong-node allocation causes a 
spin_lock operation. I fear that the cache line traffic for the spinlock 
might kill the performance for some workloads. I personally think that 
batching is required, i.e. each cpu stores wrong-node objects in a 
seperate per-cpu array, and then the objects are returned as a block to 
their home node.

>-/*
>- * NUMA: different approach needed if the spinlock is moved into
>- * the l3 structure
>  
>
You have moved the cache spinlock into the l3 structure. Have you 
compared both approaches?
A global spinlock has the advantage that batching is possible in 
free_block: Acquire global spinlock, return objects to all nodes in the 
system, release spinlock. A node-local spinlock would mean less 
contention [multiple spinlocks instead of one global lock], but far more 
spin_lock/unlock calls.

IIRC the conclusion from our discussion was, that there are at least 
four possible implementations:
- your version
- Add a second per-cpu array for off-node allocations. __cache_free 
batches, free_block then returns. Global spinlock or per-node spinlock. 
A patch with a global spinlock is in
http://www.colorfullife.com/~manfred/Linux-kernel/slab/patch-slab-numa-2.5.66
per-node spinlocks would require a restructuring of free_block.
- Add per-node array for each cpu for wrong node allocations. Allows 
very fast batch return: each array contains memory just from one node, 
usefull if per-node spinlocks are used.
- do nothing. Least overhead within slab.

I'm fairly certains that "do nothing" is the right answer for some 
caches. For example the dentry-cache: The object lifetime is seconds to 
minutes, the objects are stored in a global hashtable. They will be 
touched from all cpus in the system, thus guaranteeing that 
kmem_cache_alloc returns node-local memory won't help. But the added 
overhead within slab.c will hurt.

--
    Manfred

WARNING: multiple messages have this Message-ID (diff)

From: Manfred Spraul <manfred@colorfullife.com>
To: Christoph Lameter <christoph@lameter.com>
Cc: Andrew Morton <akpm@osdl.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: Fw: [PATCH] NUMA Slab Allocator
Date: Wed, 16 Mar 2005 19:34:22 +0100	[thread overview]
Message-ID: <42387C2E.4040106@colorfullife.com> (raw)
In-Reply-To: <20050315204110.6664771d.akpm@osdl.org>

Hi Christoph,

Do you have profile data from your modification? Which percentage of the 
allocations is node-local, which percentage is from foreign nodes? 
Preferably per-cache. It shouldn't be difficult to add statistics 
counters to your patch.
And: Can you estaimate which percentage is really accessed node-local 
and which percentage are long-living structures that are accessed from 
all cpus in the system?
I had discussions with guys from IBM and SGI regarding a numa allocator, 
and we decided that we need profile data before we can decide if we need 
one:
- A node-local allocator reduces the inter-node traffic, because the 
callers get node-local memory
- A node-local allocator increases the inter-node traffic, because 
objects that are kfree'd on the wrong node must be returned to their 
home node.

> static inline void __cache_free (kmem_cache_t *cachep, void* objp)
> {
>  struct array_cache *ac = ac_data(cachep);
>+ struct slab *slabp;
>
>  check_irq_off();
>  objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
>
>- if (likely(ac->avail < ac->limit)) {
>+ /* Make sure we are not freeing a object from another
>+  * node to the array cache on this cpu.
>+  */
>+ slabp = GET_PAGE_SLAB(virt_to_page(objp));
>  
>
This line is quite slow, and should be performed only for NUMA builds, 
not for non-numa builds. Some kind of wrapper is required.

>+ if(unlikely(slabp->nodeid != numa_node_id())) {
>+  STATS_INC_FREEMISS(cachep);
>+  int nodeid = slabp->nodeid;
>+  spin_lock(&(cachep->nodelists[nodeid])->list_lock);
>  
>
This line is very dangerous: Every wrong-node allocation causes a 
spin_lock operation. I fear that the cache line traffic for the spinlock 
might kill the performance for some workloads. I personally think that 
batching is required, i.e. each cpu stores wrong-node objects in a 
seperate per-cpu array, and then the objects are returned as a block to 
their home node.

>-/*
>- * NUMA: different approach needed if the spinlock is moved into
>- * the l3 structure
>  
>
You have moved the cache spinlock into the l3 structure. Have you 
compared both approaches?
A global spinlock has the advantage that batching is possible in 
free_block: Acquire global spinlock, return objects to all nodes in the 
system, release spinlock. A node-local spinlock would mean less 
contention [multiple spinlocks instead of one global lock], but far more 
spin_lock/unlock calls.

IIRC the conclusion from our discussion was, that there are at least 
four possible implementations:
- your version
- Add a second per-cpu array for off-node allocations. __cache_free 
batches, free_block then returns. Global spinlock or per-node spinlock. 
A patch with a global spinlock is in
http://www.colorfullife.com/~manfred/Linux-kernel/slab/patch-slab-numa-2.5.66
per-node spinlocks would require a restructuring of free_block.
- Add per-node array for each cpu for wrong node allocations. Allows 
very fast batch return: each array contains memory just from one node, 
usefull if per-node spinlocks are used.
- do nothing. Least overhead within slab.

I'm fairly certains that "do nothing" is the right answer for some 
caches. For example the dentry-cache: The object lifetime is seconds to 
minutes, the objects are stored in a global hashtable. They will be 
touched from all cpus in the system, thus guaranteeing that 
kmem_cache_alloc returns node-local memory won't help. But the added 
overhead within slab.c will hurt.

--
    Manfred
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

next      parent reply	other threads:[~2005-03-16 18:37 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20050315204110.6664771d.akpm@osdl.org>
2005-03-16 18:34 ` Manfred Spraul [this message]
2005-03-16 18:34   ` Fw: [PATCH] NUMA Slab Allocator Manfred Spraul
2005-03-16 18:54   ` Martin J. Bligh
2005-03-16 18:54     ` Martin J. Bligh
2005-03-16 19:09     ` Manfred Spraul
2005-03-16 19:09       ` Manfred Spraul
2005-03-30  5:30       ` API changes to the slab allocator for NUMA memory allocation Christoph Lameter
2005-03-30  5:30         ` Christoph Lameter
2005-03-30  5:30         ` Christoph Lameter
2005-03-30  5:56         ` Manfred Spraul
2005-03-30  5:56           ` Manfred Spraul
2005-03-30  5:56           ` Manfred Spraul
2005-03-30 15:55           ` Christoph Lameter
2005-03-30 15:55             ` Christoph Lameter
2005-03-30 15:55             ` Christoph Lameter
2005-03-30 17:55             ` Manfred Spraul
2005-03-30 17:55               ` Manfred Spraul
2005-03-30 18:13               ` Christoph Lameter
2005-03-30 18:13                 ` Christoph Lameter
2005-03-30 18:13                 ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42387C2E.4040106@colorfullife.com \
    --to=manfred@colorfullife.com \
    --cc=akpm@osdl.org \
    --cc=christoph@lameter.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.