Re: VMA lookup with RCU

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: VMA lookup with RCU
       [not found]   ` <1191436672.7103.38.camel@alexis>
@ 2007-10-03 19:40     ` Peter Zijlstra
  2007-10-03 19:54       ` Peter Zijlstra
  2007-10-04 15:42       ` Vaidyanathan Srinivasan
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2007-10-03 19:40 UTC (permalink / raw)
  To: Alexis Bruemmer
  Cc: Vaidyanathan Srinivasan, Balbir Singh, Badari Pulavarty,
	Max Asbock, linux-mm

On Wed, 2007-10-03 at 11:37 -0700, Alexis Bruemmer wrote:
> Hi Peter,
> 
> Some colleagues and I have been looking at how best to implement
> your latest suggestions for the VMA lookup with RCU patches, however we
> have some questions:
> 
> <snip>
> > My current plan for these patches is:
> > 
> > Fine grain lock the update side of the B+tree
> > do a per-node vma reference, so that vma lookup and lock is:

> How did you envision the implementation here?  More specifically, what
> is the best way to get local copies of B+trees on each node?

Don't copy the whole trees, just keep references to vma's locally. Saves
a lot of traffic:

struct vm_area_struct {

	...

	struct rw_semaphore lock;
	atomic_t refs;
};

struct vma_ref {
	struct vm_area_struct *vma;
	struct rw_semaphore lock;
	int dead;
};

struct vma_ref *
find_get_vma(struct mm_struct *mm, unsigned long addr)
{
	struct vm_area_struct *vma;
	struct vma_ref *ref = NULL;

again:
	rcu_read_lock();
	vma = btree_find(&mm->btree, addr);
	if (!vma)
		goto out_unlock;

	ref = btree_find(node_local_tree(), (unsigned long)vma);
	if (!ref) {
		BTREE_LOCK_CONTEXT(ctx, node_local_tree()); /* see fine grain locked RADIX tree */

		down_read(&vma->lock);	
		if (atomic_read(&vma->refs) < 0) {
			/* vma got destroyed */
			up_read(&vma->lock);
			goto out_unlock;
		}
		atomic_inc(&vma->refs);
		rcu_read_unlock(); /* we got vma->lock, can't escape */

		ref = kmalloc(sizeof(*ref), GFP_KERNEL);
		/* XXX: what to do when fails */

		ref->vma = vma;
		init_rwsem(&ref->mutex);
		ref->dead = 0;

		ret = btree_insert(ctx, (unsigned long)vma, ref);

		if (ret = -EEXIST) {
			kfree(ref);
			goto again;
		}
	}
	
	down_read(&ref->lock);
	if (ref->dead) {
		up_read(&ref->lock);
		rcu_read_unlock();
		goto again;
	}

out_unlock:
	rcu_read_unlock();
out:
	return ref;
}

Something like that, it got big holes in it, but should illustrate the
idea. We index the local reference by the vma address.

> >     lookup in node local tree
> >     if found, take read lock on local reference
> >     if not-found, do global lookup, lock vma, take reference, 
> >                   insert reference into local tree,
> >                   take read lock on it, drop vma lock
> >
> > write lock on the vma would:
> >     find the vma in the global tree, lock it
> >     enqueue work items in a waitqueue that,
> >       find the local ref, lock it (might sleep)
> >       release the reference, unlock and clear from local tree
> >       signal completion
> >     once all nodes have completed we have no outstanding refs
> >     and since we have the lock, we're exclusive.

void invalidate_vma_refs(void *addr)
{
	BTREE_LOCK_CONTEXT(ctx, node_local_tree());

	rcu_read_lock();
	ref = btree_find(node_local_tree, (unsigned long)addr);
	if (!ref)
		goto out_unlock;

	down_write(&ref->lock); /* no more local refs */
	ref->dead = 1;
	atomic_dec(&ref->vma->refs); /* release */
	btree_delete(ctx, (unsigned long)addr); /* unhook */
	rcu_call(free_vma_ref, ref); /* destroy */
	up_write(&ref->lock);

out_unlock:
	rcu_read_unlock();
}

struct vm_area_struct *
write_lock_vma(struct mm *mm, unsigned long addr)
{
	rcu_read_lock();
	vma = btree_find(&mm->btree, addr);
	if (!vma)
		goto out_unlock;

	down_write(&vma->lock); /* no new refs */
	rcu_read_unlock();

	schedule_on_each_cpu(invalidate_vma_refs, vma, 0, 1);

	return vma;

out_unlock:
	rcu_read_unlock();
	return NULL;
}


> Some general questions:
> How does a process and vma know what node it's on?

numa_node_id() ?

just return the ref object, and unlock against that.

> Also, as we discussed before the initial test results on the first set
> of these patches showed really no improvement in performance over the
> vanilla kernel.  What kind of performance gains to you predict with this
> new fine grain lock approach?

Really no idea, the write side is rather heavy, it would need to
invalidate (unsigned long)vma on all cpu's (perhaps track a cpumask of
cpu's that did take a reference). The advantage is that its per vma, not
per mm.

> Finally many local kernel hackers that I have spoke with about this idea
> have expressed great interest in your patch set.  Do you have any
> objections to opening this discussion up to linux-mm? 

done :-)

> Thank you again for your time and help,

no problem at all.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: VMA lookup with RCU
  2007-10-03 19:40     ` VMA lookup with RCU Peter Zijlstra
@ 2007-10-03 19:54       ` Peter Zijlstra
  2007-10-04 15:42       ` Vaidyanathan Srinivasan
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2007-10-03 19:54 UTC (permalink / raw)
  To: Alexis Bruemmer
  Cc: Vaidyanathan Srinivasan, Balbir Singh, Badari Pulavarty,
	Max Asbock, linux-mm

On Wed, 2007-10-03 at 21:40 +0200, Peter Zijlstra wrote:
> On Wed, 2007-10-03 at 11:37 -0700, Alexis Bruemmer wrote:
> > Hi Peter,
> > 
> > Some colleagues and I have been looking at how best to implement
> > your latest suggestions for the VMA lookup with RCU patches, however we
> > have some questions:
> > 
> > <snip>
> > > My current plan for these patches is:
> > > 
> > > Fine grain lock the update side of the B+tree
> > > do a per-node vma reference, so that vma lookup and lock is:
> 
> > How did you envision the implementation here?  More specifically, what
> > is the best way to get local copies of B+trees on each node?
> 
> Don't copy the whole trees, just keep references to vma's locally. Saves
> a lot of traffic:
> 
> struct vm_area_struct {
> 
> 	...
> 
> 	struct rw_semaphore lock;
> 	atomic_t refs;
> };
> 
> struct vma_ref {
> 	struct vm_area_struct *vma;
> 	struct rw_semaphore lock;
> 	int dead;
> };
> 
> struct vma_ref *
> find_get_vma(struct mm_struct *mm, unsigned long addr)
> {
> 	struct vm_area_struct *vma;
> 	struct vma_ref *ref = NULL;
> 
> again:
> 	rcu_read_lock();
> 	vma = btree_find(&mm->btree, addr);
> 	if (!vma)
> 		goto out_unlock;
> 
> 	ref = btree_find(node_local_tree(), (unsigned long)vma);
> 	if (!ref) {
> 		BTREE_LOCK_CONTEXT(ctx, node_local_tree()); /* see fine grain locked RADIX tree */
> 
> 		down_read(&vma->lock);	
> 		if (atomic_read(&vma->refs) < 0) {
> 			/* vma got destroyed */
> 			up_read(&vma->lock);
> 			goto out_unlock;
> 		}
> 		atomic_inc(&vma->refs);
> 		rcu_read_unlock(); /* we got vma->lock, can't escape */
> 
> 		ref = kmalloc(sizeof(*ref), GFP_KERNEL);
> 		/* XXX: what to do when fails */
> 
> 		ref->vma = vma;
> 		init_rwsem(&ref->mutex);
> 		ref->dead = 0;
> 
> 		ret = btree_insert(ctx, (unsigned long)vma, ref);
> 
> 		if (ret = -EEXIST) {
> 			kfree(ref);
> 			goto again;
> 		}
> 	}
> 	
> 	down_read(&ref->lock);
> 	if (ref->dead) {
> 		up_read(&ref->lock);
> 		rcu_read_unlock();
> 		goto again;
> 	}
> 
> out_unlock:
> 	rcu_read_unlock();
> out:
> 	return ref;
> }


Hmm, biggest problem I now realize is that we can freely schedule
around, and the up_read() on ref->lock can happen from anywhere :-/

that will pretty much destroy all the good we got from the lookup
locality.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: VMA lookup with RCU
  2007-10-03 19:40     ` VMA lookup with RCU Peter Zijlstra
  2007-10-03 19:54       ` Peter Zijlstra
@ 2007-10-04 15:42       ` Vaidyanathan Srinivasan
  2007-10-04 17:21         ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Vaidyanathan Srinivasan @ 2007-10-04 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexis Bruemmer, Balbir Singh, Badari Pulavarty, Max Asbock,
	linux-mm, Bharata B Rao

Peter Zijlstra wrote:
>>>     lookup in node local tree
>>>     if found, take read lock on local reference
>>>     if not-found, do global lookup, lock vma, take reference, 
>>>                   insert reference into local tree,
>>>                   take read lock on it, drop vma lock
>>>
>>> write lock on the vma would:
>>>     find the vma in the global tree, lock it
>>>     enqueue work items in a waitqueue that,
>>>       find the local ref, lock it (might sleep)
>>>       release the reference, unlock and clear from local tree
>>>       signal completion
>>>     once all nodes have completed we have no outstanding refs
>>>     and since we have the lock, we're exclusive.
> 
> void invalidate_vma_refs(void *addr)
> {
> 	BTREE_LOCK_CONTEXT(ctx, node_local_tree());
> 
> 	rcu_read_lock();
> 	ref = btree_find(node_local_tree, (unsigned long)addr);
> 	if (!ref)
> 		goto out_unlock;
> 
> 	down_write(&ref->lock); /* no more local refs */
> 	ref->dead = 1;
> 	atomic_dec(&ref->vma->refs); /* release */
> 	btree_delete(ctx, (unsigned long)addr); /* unhook */
> 	rcu_call(free_vma_ref, ref); /* destroy */
> 	up_write(&ref->lock);
> 
> out_unlock:
> 	rcu_read_unlock();
> }
> 
> struct vm_area_struct *
> write_lock_vma(struct mm *mm, unsigned long addr)
> {
> 	rcu_read_lock();
> 	vma = btree_find(&mm->btree, addr);
> 	if (!vma)
> 		goto out_unlock;
> 
> 	down_write(&vma->lock); /* no new refs */
> 	rcu_read_unlock();
> 
> 	schedule_on_each_cpu(invalidate_vma_refs, vma, 0, 1);
> 
> 	return vma;
> 
> out_unlock:
> 	rcu_read_unlock();
> 	return NULL;
> }
> 
> 

Hi Peter,

Making node local copies of VMA is a good idea to reduce inter-node
traffic, but the cost of search and delete is very high.  Also, as you have
pointed out, if the atomic operations happen on remote node due to
scheduler migrating our thread, then all the cycles saved may be lost.

In find_get_vma() cross node traffic is due to btree traversal or the
actual VMA object reference?  Can we look at duplicating the btree
structure per node and have VMA structures just one copy and make all
btrees in each node point to the same vma object.  This will make write
operation and deletion of btree entries on all nodes little simple.  All
VMA lists will be unique and not duplicated.

Another related idea is to move the VMA object to node local memory.  Can
we migrate the VMA object to the node where it is referenced the most?  We
still maintain only _one_ copy of VMA object.  No data duplication, but we
can move the memory around to make it node local.

Some more thoughts:

Pagefault handler does most of the find_get_vma() to validate user address
and then create page table entries (allocate page frames)... can we make
the page fault handler run on the node where the VMAs have been allocated?
 The CPU that has page-faulted need not necessarily do all the find_vma()
calls and update the page table.  The process can sleep while another CPU
_near_ to the memory containing VMAs and pagetable can do the job with
local memory references.

I don't know if the page tables for the faulting process is allocated in
node local memory.

Per CPU last vma cache:  Currently we have the last vma referenced in a one
entry cache in mm_struct.  Can we have this cache per CPU or per node so
that a multi threaded application can have node/cpu local cache of last vma
referenced.  This may reduce btree/rbtree traversal.  Let the hardware
cache maintain the corresponding VMA object and its coherency.

Please let me know your comment and thoughts.

Thanks,
Vaidy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: VMA lookup with RCU
  2007-10-04 15:42       ` Vaidyanathan Srinivasan
@ 2007-10-04 17:21         ` Peter Zijlstra
  2007-10-07  7:47           ` Nick Piggin
  2007-10-08 17:11           ` Vaidyanathan Srinivasan
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2007-10-04 17:21 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Alexis Bruemmer, Balbir Singh, Badari Pulavarty, Max Asbock,
	linux-mm, Bharata B Rao, Nick Piggin

On Thu, 2007-10-04 at 21:12 +0530, Vaidyanathan Srinivasan wrote:
> Peter Zijlstra wrote:
> >>>     lookup in node local tree
> >>>     if found, take read lock on local reference
> >>>     if not-found, do global lookup, lock vma, take reference, 
> >>>                   insert reference into local tree,
> >>>                   take read lock on it, drop vma lock
> >>>
> >>> write lock on the vma would:
> >>>     find the vma in the global tree, lock it
> >>>     enqueue work items in a waitqueue that,
> >>>       find the local ref, lock it (might sleep)
> >>>       release the reference, unlock and clear from local tree
> >>>       signal completion
> >>>     once all nodes have completed we have no outstanding refs
> >>>     and since we have the lock, we're exclusive.
> > 
> > void invalidate_vma_refs(void *addr)
> > {
> > 	BTREE_LOCK_CONTEXT(ctx, node_local_tree());
> > 
> > 	rcu_read_lock();
> > 	ref = btree_find(node_local_tree, (unsigned long)addr);
> > 	if (!ref)
> > 		goto out_unlock;
> > 
> > 	down_write(&ref->lock); /* no more local refs */
> > 	ref->dead = 1;
> > 	atomic_dec(&ref->vma->refs); /* release */
> > 	btree_delete(ctx, (unsigned long)addr); /* unhook */
> > 	rcu_call(free_vma_ref, ref); /* destroy */
> > 	up_write(&ref->lock);
> > 
> > out_unlock:
> > 	rcu_read_unlock();
> > }
> > 
> > struct vm_area_struct *
> > write_lock_vma(struct mm *mm, unsigned long addr)
> > {
> > 	rcu_read_lock();
> > 	vma = btree_find(&mm->btree, addr);
> > 	if (!vma)
> > 		goto out_unlock;
> > 
> > 	down_write(&vma->lock); /* no new refs */
> > 	rcu_read_unlock();
> > 
> > 	schedule_on_each_cpu(invalidate_vma_refs, vma, 0, 1);
> > 
> > 	return vma;
> > 
> > out_unlock:
> > 	rcu_read_unlock();
> > 	return NULL;
> > }
> > 
> > 
> 
> Hi Peter,
> 
> Making node local copies of VMA is a good idea to reduce inter-node
> traffic, but the cost of search and delete is very high.  Also, as you have
> pointed out, if the atomic operations happen on remote node due to
> scheduler migrating our thread, then all the cycles saved may be lost.
> 
> In find_get_vma() cross node traffic is due to btree traversal or the
> actual VMA object reference? 

Not sure, I'm not sure how to profile cacheline transfers.

The outlined approach would try to keep all accesses read-only, so that
the cacheline can be shared. But yeah, once it get evicted it needs to
be re-transfered.

>  Can we look at duplicating the btree
> structure per node and have VMA structures just one copy and make all
> btrees in each node point to the same vma object.  This will make write
> operation and deletion of btree entries on all nodes little simple.  All
> VMA lists will be unique and not duplicated.

But that would end up with a 2d tree, (mm, vma) in which you can try to
find an exact match for a given (mm, address) key.

Trouble with multi-dimensional trees is the balancing thing, afaik its
an np-hard problem.

> Another related idea is to move the VMA object to node local memory.  Can
> we migrate the VMA object to the node where it is referenced the most?  We
> still maintain only _one_ copy of VMA object.  No data duplication, but we
> can move the memory around to make it node local.

I guess we can do that, is you take the vma lock in exclusive mode, you
can make a copy of the object, replace the tree pointer, mark the old
one dead (so that rcu lookups with re-try) and rcu_free the old one.

> Some more thoughts:
> 
> Pagefault handler does most of the find_get_vma() to validate user address
> and then create page table entries (allocate page frames)... can we make
> the page fault handler run on the node where the VMAs have been allocated?

explicit migration - like migrate_disable() - make load balancing a very
hard problem.

>  The CPU that has page-faulted need not necessarily do all the find_vma()
> calls and update the page table.  The process can sleep while another CPU
> _near_ to the memory containing VMAs and pagetable can do the job with
> local memory references.

would we not end up with remote page tables?

> I don't know if the page tables for the faulting process is allocated in
> node local memory.
> 
> Per CPU last vma cache:  Currently we have the last vma referenced in a one
> entry cache in mm_struct.  Can we have this cache per CPU or per node so
> that a multi threaded application can have node/cpu local cache of last vma
> referenced.  This may reduce btree/rbtree traversal.  Let the hardware
> cache maintain the corresponding VMA object and its coherency.
> 
> Please let me know your comment and thoughts.

Nick Piggin (and I think Eric Dumazet) had nice patches for this. I
think they were posted in the private futex thread.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: VMA lookup with RCU
  2007-10-04 17:21         ` Peter Zijlstra
@ 2007-10-07  7:47           ` Nick Piggin
  2007-10-08  7:51             ` Peter Zijlstra
  2007-10-08 17:02             ` Vaidyanathan Srinivasan
  2007-10-08 17:11           ` Vaidyanathan Srinivasan
  1 sibling, 2 replies; 12+ messages in thread
From: Nick Piggin @ 2007-10-07  7:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vaidyanathan Srinivasan, Alexis Bruemmer, Balbir Singh,
	Badari Pulavarty, Max Asbock, linux-mm, Bharata B Rao

On Friday 05 October 2007 03:21, Peter Zijlstra wrote:
> On Thu, 2007-10-04 at 21:12 +0530, Vaidyanathan Srinivasan wrote:

> > Per CPU last vma cache:  Currently we have the last vma referenced in a
> > one entry cache in mm_struct.  Can we have this cache per CPU or per node
> > so that a multi threaded application can have node/cpu local cache of
> > last vma referenced.  This may reduce btree/rbtree traversal.  Let the
> > hardware cache maintain the corresponding VMA object and its coherency.
> >
> > Please let me know your comment and thoughts.
>
> Nick Piggin (and I think Eric Dumazet) had nice patches for this. I
> think they were posted in the private futex thread.

All they need is testing and some results to show they help. I actually
don't really have a realistic workload where vma lookup contention is
a problem, since the malloc fixes and private futexes went in.

Actually -- there is one thing, apparently oprofile does lots of find_vmas,
which trashes the vma cache. Either it should have its own cache, or at
least use a "nontemporal" lookup.

What I implemented was a per-thread cache. Per-CPU I guess would be
equally possible and might be preferable in some cases (although worse
in others). Still, the per-thread cache should be fine for basic performance
testing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: VMA lookup with RCU
  2007-10-07  7:47           ` Nick Piggin
@ 2007-10-08  7:51             ` Peter Zijlstra
  2007-10-08  9:32               ` Balbir Singh
  2007-10-08 17:02             ` Vaidyanathan Srinivasan
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2007-10-08  7:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Vaidyanathan Srinivasan, Alexis Bruemmer, Balbir Singh,
	Badari Pulavarty, Max Asbock, linux-mm, Bharata B Rao

On Sun, 2007-10-07 at 17:47 +1000, Nick Piggin wrote:
> On Friday 05 October 2007 03:21, Peter Zijlstra wrote:
> > On Thu, 2007-10-04 at 21:12 +0530, Vaidyanathan Srinivasan wrote:
> 
> > > Per CPU last vma cache:  Currently we have the last vma referenced in a
> > > one entry cache in mm_struct.  Can we have this cache per CPU or per node
> > > so that a multi threaded application can have node/cpu local cache of
> > > last vma referenced.  This may reduce btree/rbtree traversal.  Let the
> > > hardware cache maintain the corresponding VMA object and its coherency.
> > >
> > > Please let me know your comment and thoughts.
> >
> > Nick Piggin (and I think Eric Dumazet) had nice patches for this. I
> > think they were posted in the private futex thread.
> 
> All they need is testing and some results to show they help. I actually
> don't really have a realistic workload where vma lookup contention is
> a problem, since the malloc fixes and private futexes went in.
> 
> Actually -- there is one thing, apparently oprofile does lots of find_vmas,
> which trashes the vma cache. Either it should have its own cache, or at
> least use a "nontemporal" lookup.
> 
> What I implemented was a per-thread cache. Per-CPU I guess would be
> equally possible and might be preferable in some cases (although worse
> in others). Still, the per-thread cache should be fine for basic performance
> testing.

Apparently our IBM friends on this thread have a workload where mmap_sem
does hurt, and I suspect its a massively threaded Java app on a somewhat
larger box (8-16 cpus), which does a bit of faulting around.

But I'll let them tell about it :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: VMA lookup with RCU
  2007-10-08 16:51                 ` Vaidyanathan Srinivasan
@ 2007-10-08  8:17                   ` Nick Piggin
  2007-10-22  9:54                   ` Vaidyanathan Srinivasan
  1 sibling, 0 replies; 12+ messages in thread
From: Nick Piggin @ 2007-10-08  8:17 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan, Eric Dumazet
  Cc: balbir, Peter Zijlstra, Alexis Bruemmer, Balbir Singh,
	Badari Pulavarty, Max Asbock, linux-mm, Bharata B Rao

On Tuesday 09 October 2007 02:51, Vaidyanathan Srinivasan wrote:
> >> Apparently our IBM friends on this thread have a workload where mmap_sem
> >> does hurt, and I suspect its a massively threaded Java app on a somewhat
> >> larger box (8-16 cpus), which does a bit of faulting around.
> >>
> >> But I'll let them tell about it :-)
> >
> > Nick,
> >
> > We used the latest glibc (with the private futexes fix) and the latest
> > kernel. We see improvements in scalability, but at 12-16 CPU's, we see
> > a slowdown. Vaidy has been using ebizzy for testing mmap_sem
> > scalability.
>
> Hi Peter and Nick,
>
> We have been doing some tests with ebizzy 0.2 workload.
> Here are some of the test results...

Cool graphs!

Looks like private futexes blew your mmap_sem contention away. Not too
surprising: I wouldn't have expected a high performance app like this to be
doing a huge number of mmap()ing and page faults...

They almost tripled your peak performance! Of course that's with ebizzy...
what sort of correlation does this have to your real server app? (ie. does
it also see a 3x speedup?)

I don't see any synchronisation in ebizzy 2 -- I guess the gain is all due to
improved libc heap management scalability?

> ebizzy-futex.png plots the performance impact of private futex while
> ebizzy-rcu-vma.png plots the performance of Peter's RCU VMA look patch
> against base kernel with and without private futex.
>
> We can observe in both the plots that private futex improved scaling
> significantly from 4 CPUs to 8 CPUs but we still have scaling issues beyond
> 12 CPUs.
>
> Peter's RCU based b+tree vma lookup approach gives marginal performance
> improvement till 4 to 8 CPUs but does not help beyond that.
>
> Perhaps the scaling problem area shifts beyond 8-12 cpus and it is not just
> the mmap_sem and vma lookup.
>
> The significant oprofile output for various configurations are listed
> below:
>
> 12 CPUs 2.6.23-rc6 No private futex:
>
> samples  %        symbol name
> 6908330  23.7520  __down_read
> 4990895  17.1595  __up_read
> 2165162   7.4442  find_vma
> 2069868   7.1166  futex_wait
> 2063447   7.0945  futex_wake
> 1557829   5.3561  drop_futex_key_refs
> 741268    2.5486  task_rq_lock
> 638947    2.1968  schedule
> 600493    2.0646  system_call
> 515924    1.7738  copy_user_generic_unrolled
> 399672    1.3741  mwait_idle
>
> 12 CPUs 2.6.23-rc6 with private futex:
>
> samples  %        symbol name
> 2095531  15.5092  task_rq_lock
> 1094271   8.0988  schedule
> 1068093   7.9050  futex_wake
> 516250    3.8208  futex_wait
> 469220    3.4727  mwait_idle
> 468979    3.4710  system_call
> 443208    3.2802  idle_cpu
> 438301    3.2439  update_curr
> 397231    2.9399  try_to_wake_up
> 364424    2.6971  apic_timer_interrupt
> 362633    2.6839  scheduler_tick

There is basically no more mmap_sem contention or any vma lookups to
be seen. So I think it would be a waste of time to test my vma cache patches
really :P

It looks like most of the contention is on the runqueue locks and on futex
locks now. Both those paths are now pretty optimised... probably some
improvements could be made, but fundamentally if you are doing a lot of
sleeping on a single futex, and are doing a lot of cross-CPU wakeups, then
you are going to have scalability limits.

So improving glibc allocator to be more scalable, or changing the application
is likely to be the best course of action from here...

*If* you have a huge number of futexes, or a lot of processes (each with
their own threads and private futexes), then there are some possible things
we could try to improve in the private futex lookup code... but that doesn't
seem to be the case for you?

> All the above test results has the impact of oprofile included.  Running
> oprofile also may significantly increase mmap_sem contention.
>
> I Will run the tests again without oprofile to understand the impact of
> oprofile itself.
>
> Please let me know your comments and suggestions.

Getting confirmation of what is so costly in futex_wait and futex_wake
might be useful if you have time. I'll bet it is the hash lock, but could be
wrong.

Playing with the sched-domains parameters and possibly trying to reduce
the number of cross-CPU wakeups might help. However you have to be pretty
careful with this that you don't just tune the system to work well with ebizzy
and not your real workload.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: VMA lookup with RCU
  2007-10-08  7:51             ` Peter Zijlstra
@ 2007-10-08  9:32               ` Balbir Singh
  2007-10-08 16:51                 ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 12+ messages in thread
From: Balbir Singh @ 2007-10-08  9:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Vaidyanathan Srinivasan, Alexis Bruemmer,
	Balbir Singh, Badari Pulavarty, Max Asbock, linux-mm,
	Bharata B Rao

Peter Zijlstra wrote:
> On Sun, 2007-10-07 at 17:47 +1000, Nick Piggin wrote:
>> On Friday 05 October 2007 03:21, Peter Zijlstra wrote:
>>> On Thu, 2007-10-04 at 21:12 +0530, Vaidyanathan Srinivasan wrote:
>>>> Per CPU last vma cache:  Currently we have the last vma referenced in a
>>>> one entry cache in mm_struct.  Can we have this cache per CPU or per node
>>>> so that a multi threaded application can have node/cpu local cache of
>>>> last vma referenced.  This may reduce btree/rbtree traversal.  Let the
>>>> hardware cache maintain the corresponding VMA object and its coherency.
>>>>
>>>> Please let me know your comment and thoughts.
>>> Nick Piggin (and I think Eric Dumazet) had nice patches for this. I
>>> think they were posted in the private futex thread.
>> All they need is testing and some results to show they help. I actually
>> don't really have a realistic workload where vma lookup contention is
>> a problem, since the malloc fixes and private futexes went in.
>>
>> Actually -- there is one thing, apparently oprofile does lots of find_vmas,
>> which trashes the vma cache. Either it should have its own cache, or at
>> least use a "nontemporal" lookup.
>>
>> What I implemented was a per-thread cache. Per-CPU I guess would be
>> equally possible and might be preferable in some cases (although worse
>> in others). Still, the per-thread cache should be fine for basic performance
>> testing.
> 
> Apparently our IBM friends on this thread have a workload where mmap_sem
> does hurt, and I suspect its a massively threaded Java app on a somewhat
> larger box (8-16 cpus), which does a bit of faulting around.
> 
> But I'll let them tell about it :-)
> 

Nick,

We used the latest glibc (with the private futexes fix) and the latest
kernel. We see improvements in scalability, but at 12-16 CPU's, we see
a slowdown. Vaidy has been using ebizzy for testing mmap_sem
scalability.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: VMA lookup with RCU
  2007-10-08  9:32               ` Balbir Singh
@ 2007-10-08 16:51                 ` Vaidyanathan Srinivasan
  2007-10-08  8:17                   ` Nick Piggin
  2007-10-22  9:54                   ` Vaidyanathan Srinivasan
  0 siblings, 2 replies; 12+ messages in thread
From: Vaidyanathan Srinivasan @ 2007-10-08 16:51 UTC (permalink / raw)
  To: balbir
  Cc: Peter Zijlstra, Nick Piggin, Alexis Bruemmer, Balbir Singh,
	Badari Pulavarty, Max Asbock, linux-mm, Bharata B Rao

[-- Attachment #1: Type: text/plain, Size: 3271 bytes --]



>> Apparently our IBM friends on this thread have a workload where mmap_sem
>> does hurt, and I suspect its a massively threaded Java app on a somewhat
>> larger box (8-16 cpus), which does a bit of faulting around.
>>
>> But I'll let them tell about it :-)
>>
> 
> Nick,
> 
> We used the latest glibc (with the private futexes fix) and the latest
> kernel. We see improvements in scalability, but at 12-16 CPU's, we see
> a slowdown. Vaidy has been using ebizzy for testing mmap_sem
> scalability.
> 

Hi Peter and Nick,

We have been doing some tests with ebizzy 0.2 workload.
Here are some of the test results...

ebizzy-futex.png plots the performance impact of private futex while
ebizzy-rcu-vma.png plots the performance of Peter's RCU VMA look patch
against base kernel with and without private futex.

We can observe in both the plots that private futex improved scaling
significantly from 4 CPUs to 8 CPUs but we still have scaling issues beyond
12 CPUs.

Peter's RCU based b+tree vma lookup approach gives marginal performance
improvement till 4 to 8 CPUs but does not help beyond that.

Perhaps the scaling problem area shifts beyond 8-12 cpus and it is not just
the mmap_sem and vma lookup.

The significant oprofile output for various configurations are listed below:

12 CPUs 2.6.23-rc6 No private futex:

samples  %        symbol name
6908330  23.7520  __down_read
4990895  17.1595  __up_read
2165162   7.4442  find_vma
2069868   7.1166  futex_wait
2063447   7.0945  futex_wake
1557829   5.3561  drop_futex_key_refs
741268    2.5486  task_rq_lock
638947    2.1968  schedule
600493    2.0646  system_call
515924    1.7738  copy_user_generic_unrolled
399672    1.3741  mwait_idle

12 CPUs 2.6.23-rc6 with private futex:

samples  %        symbol name
2095531  15.5092  task_rq_lock
1094271   8.0988  schedule
1068093   7.9050  futex_wake
516250    3.8208  futex_wait
469220    3.4727  mwait_idle
468979    3.4710  system_call
443208    3.2802  idle_cpu
438301    3.2439  update_curr
397231    2.9399  try_to_wake_up
364424    2.6971  apic_timer_interrupt
362633    2.6839  scheduler_tick

8 CPUs 2.6.23-rc9 + RCU VMA + private futex:

samples  %        symbol name
386111   10.4460  mwait_idle
367289    9.9368  __btree_search
286286    7.7453  apic_timer_interrupt
272863    7.3821  find_busiest_group
230268    6.2298  scheduler_tick
224902    6.0846  copy_user_generic_unrolled
188991    5.1130  memset
123692    3.3464  __exit_idle
91930     2.4871  hrtimer_run_queues
87796     2.3753  task_rq_lock
85968     2.3258  run_rebalance_domains

12 CPUs 2.6.23-rc9 + RCU VMA + private futex:

samples  %        symbol name
14505427 18.6891  task_rq_lock
10323001 13.3004  futex_wake
6980246   8.9935  schedule
4768884   6.1443  futex_wait
2981997   3.8421  idle_cpu
2939439   3.7872  system_call
2655022   3.4208  update_curr
2433540   3.1354  try_to_wake_up
1786021   2.3011  check_preempt_curr_fair
1692711   2.1809  syscall_trace_enter
1486381   1.9151  thread_return

All the above test results has the impact of oprofile included.  Running
oprofile also may significantly increase mmap_sem contention.

I Will run the tests again without oprofile to understand the impact of
oprofile itself.

Please let me know your comments and suggestions.

--Vaidy

[-- Attachment #2: ebizzy-futex.png --]
[-- Type: image/png, Size: 4324 bytes --]

[-- Attachment #3: ebizzy-rcu-vma.png --]
[-- Type: image/png, Size: 4917 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: VMA lookup with RCU
  2007-10-07  7:47           ` Nick Piggin
  2007-10-08  7:51             ` Peter Zijlstra
@ 2007-10-08 17:02             ` Vaidyanathan Srinivasan
  1 sibling, 0 replies; 12+ messages in thread
From: Vaidyanathan Srinivasan @ 2007-10-08 17:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Alexis Bruemmer, Balbir Singh, Badari Pulavarty,
	Max Asbock, linux-mm, Bharata B Rao


Nick Piggin wrote:
> On Friday 05 October 2007 03:21, Peter Zijlstra wrote:
>> On Thu, 2007-10-04 at 21:12 +0530, Vaidyanathan Srinivasan wrote:
> 
>>> Per CPU last vma cache:  Currently we have the last vma referenced in a
>>> one entry cache in mm_struct.  Can we have this cache per CPU or per node
>>> so that a multi threaded application can have node/cpu local cache of
>>> last vma referenced.  This may reduce btree/rbtree traversal.  Let the
>>> hardware cache maintain the corresponding VMA object and its coherency.
>>>
>>> Please let me know your comment and thoughts.
>> Nick Piggin (and I think Eric Dumazet) had nice patches for this. I
>> think they were posted in the private futex thread.
> 
> All they need is testing and some results to show they help. I actually
> don't really have a realistic workload where vma lookup contention is
> a problem, since the malloc fixes and private futexes went in.

Hi Nick,

Just point me to the patch.  I will run them thru ebizzy with and without
oprofile on a large system and post the data.

> 
> Actually -- there is one thing, apparently oprofile does lots of find_vmas,
> which trashes the vma cache. Either it should have its own cache, or at
> least use a "nontemporal" lookup.
> 
> What I implemented was a per-thread cache. Per-CPU I guess would be
> equally possible and might be preferable in some cases (although worse
> in others). Still, the per-thread cache should be fine for basic performance
> testing.

Per-thread last vma cache is a good idea... much simpler to implement than
per CPU or per node cache I guess.  But still invalidating the caches my be
a slow path.  Lets check it out.

--Vaidy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: VMA lookup with RCU
  2007-10-04 17:21         ` Peter Zijlstra
  2007-10-07  7:47           ` Nick Piggin
@ 2007-10-08 17:11           ` Vaidyanathan Srinivasan
  1 sibling, 0 replies; 12+ messages in thread
From: Vaidyanathan Srinivasan @ 2007-10-08 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexis Bruemmer, Balbir Singh, Badari Pulavarty, Max Asbock,
	linux-mm, Bharata B Rao, Nick Piggin

Peter Zijlstra wrote:
> On Thu, 2007-10-04 at 21:12 +0530, Vaidyanathan Srinivasan wrote:
>> Peter Zijlstra wrote:
>> Hi Peter,
>>
>> Making node local copies of VMA is a good idea to reduce inter-node
>> traffic, but the cost of search and delete is very high.  Also, as you have
>> pointed out, if the atomic operations happen on remote node due to
>> scheduler migrating our thread, then all the cycles saved may be lost.
>>
>> In find_get_vma() cross node traffic is due to btree traversal or the
>> actual VMA object reference? 
> 
> Not sure, I'm not sure how to profile cacheline transfers.

I asked around and found that oprofile can give cache misses.  But
associating it with find_get_vma() is the problem.

> The outlined approach would try to keep all accesses read-only, so that
> the cacheline can be shared. But yeah, once it get evicted it needs to
> be re-transfered.

>>  Can we look at duplicating the btree
>> structure per node and have VMA structures just one copy and make all
>> btrees in each node point to the same vma object.  This will make write
>> operation and deletion of btree entries on all nodes little simple.  All
>> VMA lists will be unique and not duplicated.
> 
> But that would end up with a 2d tree, (mm, vma) in which you can try to
> find an exact match for a given (mm, address) key.
> 
> Trouble with multi-dimensional trees is the balancing thing, afaik its
> an np-hard problem.

Not a good idea then :)

>> Another related idea is to move the VMA object to node local memory.  Can
>> we migrate the VMA object to the node where it is referenced the most?  We
>> still maintain only _one_ copy of VMA object.  No data duplication, but we
>> can move the memory around to make it node local.
> 
> I guess we can do that, is you take the vma lock in exclusive mode, you
> can make a copy of the object, replace the tree pointer, mark the old
> one dead (so that rcu lookups with re-try) and rcu_free the old one.

So worth a try. I will pick this up.

>> Some more thoughts:
>>
>> Pagefault handler does most of the find_get_vma() to validate user address
>> and then create page table entries (allocate page frames)... can we make
>> the page fault handler run on the node where the VMAs have been allocated?
> 
> explicit migration - like migrate_disable() - make load balancing a very
> hard problem.

>>  The CPU that has page-faulted need not necessarily do all the find_vma()
>> calls and update the page table.  The process can sleep while another CPU
>> _near_ to the memory containing VMAs and pagetable can do the job with
>> local memory references.
> 
> would we not end up with remote page tables?

Remote pagetable update is less costly than remote vma lookup.  We will
have to balance between the two.

>> I don't know if the page tables for the faulting process is allocated in
>> node local memory.
>>
>> Per CPU last vma cache:  Currently we have the last vma referenced in a one
>> entry cache in mm_struct.  Can we have this cache per CPU or per node so
>> that a multi threaded application can have node/cpu local cache of last vma
>> referenced.  This may reduce btree/rbtree traversal.  Let the hardware
>> cache maintain the corresponding VMA object and its coherency.
>>
>> Please let me know your comment and thoughts.
> 
> Nick Piggin (and I think Eric Dumazet) had nice patches for this. I
> think they were posted in the private futex thread.

Good.  I would like to try them out.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: VMA lookup with RCU
  2007-10-08 16:51                 ` Vaidyanathan Srinivasan
  2007-10-08  8:17                   ` Nick Piggin
@ 2007-10-22  9:54                   ` Vaidyanathan Srinivasan
  1 sibling, 0 replies; 12+ messages in thread
From: Vaidyanathan Srinivasan @ 2007-10-22  9:54 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: balbir, Peter Zijlstra, Nick Piggin, Alexis Bruemmer,
	Balbir Singh, Badari Pulavarty, Max Asbock, linux-mm,
	Bharata B Rao

[-- Attachment #1: Type: text/plain, Size: 710 bytes --]

[snip]

> 
> All the above test results has the impact of oprofile included.  Running
> oprofile also may significantly increase mmap_sem contention.
> 
> I Will run the tests again without oprofile to understand the impact of
> oprofile itself.
> 

Hi,

I have run the same tests without oprofile and also tried to plot the
impact of oprofile.

There are two attached graphs.

rcu-vma-scaling.png contains the plot of Peter's RCU-VMA lookup without
running oprofile.

There are some variations beyond 10 CPUs.  I am trying to wonder if those
were due to errors in test/measurement!

oprofile-scaling.png plots similar tests with and without oprofile enabled.
 We can see more impact beyond 12 CPUs.

--Vaidy


[-- Attachment #2: oprofile-scaling.png --]
[-- Type: image/png, Size: 4028 bytes --]

[-- Attachment #3: rcu-vma-scaling.png --]
[-- Type: image/png, Size: 4808 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-10-22  9:47 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <46F01289.7040106@linux.vnet.ibm.com>
     [not found] ` <20070918205419.60d24da7@lappy>
     [not found]   ` <1191436672.7103.38.camel@alexis>
2007-10-03 19:40     ` VMA lookup with RCU Peter Zijlstra
2007-10-03 19:54       ` Peter Zijlstra
2007-10-04 15:42       ` Vaidyanathan Srinivasan
2007-10-04 17:21         ` Peter Zijlstra
2007-10-07  7:47           ` Nick Piggin
2007-10-08  7:51             ` Peter Zijlstra
2007-10-08  9:32               ` Balbir Singh
2007-10-08 16:51                 ` Vaidyanathan Srinivasan
2007-10-08  8:17                   ` Nick Piggin
2007-10-22  9:54                   ` Vaidyanathan Srinivasan
2007-10-08 17:02             ` Vaidyanathan Srinivasan
2007-10-08 17:11           ` Vaidyanathan Srinivasan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).