Re: VMA lookup with RCU - Peter Zijlstra

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Peter Zijlstra <peterz@infradead.org>
To: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Alexis Bruemmer <alexisb@us.ibm.com>,
	Balbir Singh <balbir@in.ibm.com>,
	Badari Pulavarty <pbadari@us.ibm.com>,
	Max Asbock <amax@us.ibm.com>, linux-mm <linux-mm@kvack.org>,
	Bharata B Rao <bharata@in.ibm.com>,
	Nick Piggin <nickpiggin@yahoo.com.au>
Subject: Re: VMA lookup with RCU
Date: Thu, 04 Oct 2007 19:21:26 +0200	[thread overview]
Message-ID: <1191518486.5574.24.camel@lappy> (raw)
In-Reply-To: <470509F5.4010902@linux.vnet.ibm.com>

On Thu, 2007-10-04 at 21:12 +0530, Vaidyanathan Srinivasan wrote:
> Peter Zijlstra wrote:
> >>>     lookup in node local tree
> >>>     if found, take read lock on local reference
> >>>     if not-found, do global lookup, lock vma, take reference, 
> >>>                   insert reference into local tree,
> >>>                   take read lock on it, drop vma lock
> >>>
> >>> write lock on the vma would:
> >>>     find the vma in the global tree, lock it
> >>>     enqueue work items in a waitqueue that,
> >>>       find the local ref, lock it (might sleep)
> >>>       release the reference, unlock and clear from local tree
> >>>       signal completion
> >>>     once all nodes have completed we have no outstanding refs
> >>>     and since we have the lock, we're exclusive.
> > 
> > void invalidate_vma_refs(void *addr)
> > {
> > 	BTREE_LOCK_CONTEXT(ctx, node_local_tree());
> > 
> > 	rcu_read_lock();
> > 	ref = btree_find(node_local_tree, (unsigned long)addr);
> > 	if (!ref)
> > 		goto out_unlock;
> > 
> > 	down_write(&ref->lock); /* no more local refs */
> > 	ref->dead = 1;
> > 	atomic_dec(&ref->vma->refs); /* release */
> > 	btree_delete(ctx, (unsigned long)addr); /* unhook */
> > 	rcu_call(free_vma_ref, ref); /* destroy */
> > 	up_write(&ref->lock);
> > 
> > out_unlock:
> > 	rcu_read_unlock();
> > }
> > 
> > struct vm_area_struct *
> > write_lock_vma(struct mm *mm, unsigned long addr)
> > {
> > 	rcu_read_lock();
> > 	vma = btree_find(&mm->btree, addr);
> > 	if (!vma)
> > 		goto out_unlock;
> > 
> > 	down_write(&vma->lock); /* no new refs */
> > 	rcu_read_unlock();
> > 
> > 	schedule_on_each_cpu(invalidate_vma_refs, vma, 0, 1);
> > 
> > 	return vma;
> > 
> > out_unlock:
> > 	rcu_read_unlock();
> > 	return NULL;
> > }
> > 
> > 
> 
> Hi Peter,
> 
> Making node local copies of VMA is a good idea to reduce inter-node
> traffic, but the cost of search and delete is very high.  Also, as you have
> pointed out, if the atomic operations happen on remote node due to
> scheduler migrating our thread, then all the cycles saved may be lost.
> 
> In find_get_vma() cross node traffic is due to btree traversal or the
> actual VMA object reference? 

Not sure, I'm not sure how to profile cacheline transfers.

The outlined approach would try to keep all accesses read-only, so that
the cacheline can be shared. But yeah, once it get evicted it needs to
be re-transfered.

>  Can we look at duplicating the btree
> structure per node and have VMA structures just one copy and make all
> btrees in each node point to the same vma object.  This will make write
> operation and deletion of btree entries on all nodes little simple.  All
> VMA lists will be unique and not duplicated.

But that would end up with a 2d tree, (mm, vma) in which you can try to
find an exact match for a given (mm, address) key.

Trouble with multi-dimensional trees is the balancing thing, afaik its
an np-hard problem.

> Another related idea is to move the VMA object to node local memory.  Can
> we migrate the VMA object to the node where it is referenced the most?  We
> still maintain only _one_ copy of VMA object.  No data duplication, but we
> can move the memory around to make it node local.

I guess we can do that, is you take the vma lock in exclusive mode, you
can make a copy of the object, replace the tree pointer, mark the old
one dead (so that rcu lookups with re-try) and rcu_free the old one.

> Some more thoughts:
> 
> Pagefault handler does most of the find_get_vma() to validate user address
> and then create page table entries (allocate page frames)... can we make
> the page fault handler run on the node where the VMAs have been allocated?

explicit migration - like migrate_disable() - make load balancing a very
hard problem.

>  The CPU that has page-faulted need not necessarily do all the find_vma()
> calls and update the page table.  The process can sleep while another CPU
> _near_ to the memory containing VMAs and pagetable can do the job with
> local memory references.

would we not end up with remote page tables?

> I don't know if the page tables for the faulting process is allocated in
> node local memory.
> 
> Per CPU last vma cache:  Currently we have the last vma referenced in a one
> entry cache in mm_struct.  Can we have this cache per CPU or per node so
> that a multi threaded application can have node/cpu local cache of last vma
> referenced.  This may reduce btree/rbtree traversal.  Let the hardware
> cache maintain the corresponding VMA object and its coherency.
> 
> Please let me know your comment and thoughts.

Nick Piggin (and I think Eric Dumazet) had nice patches for this. I
think they were posted in the private futex thread.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-10-04 17:21 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <46F01289.7040106@linux.vnet.ibm.com>
     [not found] ` <20070918205419.60d24da7@lappy>
     [not found]   ` <1191436672.7103.38.camel@alexis>
2007-10-03 19:40     ` VMA lookup with RCU Peter Zijlstra
2007-10-03 19:54       ` Peter Zijlstra
2007-10-04 15:42       ` Vaidyanathan Srinivasan
2007-10-04 17:21         ` Peter Zijlstra [this message]
2007-10-07  7:47           ` Nick Piggin
2007-10-08  7:51             ` Peter Zijlstra
2007-10-08  9:32               ` Balbir Singh
2007-10-08 16:51                 ` Vaidyanathan Srinivasan
2007-10-08  8:17                   ` Nick Piggin
2007-10-22  9:54                   ` Vaidyanathan Srinivasan
2007-10-08 17:02             ` Vaidyanathan Srinivasan
2007-10-08 17:11           ` Vaidyanathan Srinivasan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1191518486.5574.24.camel@lappy \
    --to=peterz@infradead.org \
    --cc=alexisb@us.ibm.com \
    --cc=amax@us.ibm.com \
    --cc=balbir@in.ibm.com \
    --cc=bharata@in.ibm.com \
    --cc=linux-mm@kvack.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=pbadari@us.ibm.com \
    --cc=svaidy@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).