All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Alexis Bruemmer <alexisb@us.ibm.com>,
	Balbir Singh <balbir@in.ibm.com>,
	Badari Pulavarty <pbadari@us.ibm.com>,
	Max Asbock <amax@us.ibm.com>, linux-mm <linux-mm@kvack.org>,
	Bharata B Rao <bharata@in.ibm.com>,
	Nick Piggin <nickpiggin@yahoo.com.au>
Subject: Re: VMA lookup with RCU
Date: Thu, 04 Oct 2007 19:21:26 +0200	[thread overview]
Message-ID: <1191518486.5574.24.camel@lappy> (raw)
In-Reply-To: <470509F5.4010902@linux.vnet.ibm.com>

On Thu, 2007-10-04 at 21:12 +0530, Vaidyanathan Srinivasan wrote:
> Peter Zijlstra wrote:
> >>>     lookup in node local tree
> >>>     if found, take read lock on local reference
> >>>     if not-found, do global lookup, lock vma, take reference, 
> >>>                   insert reference into local tree,
> >>>                   take read lock on it, drop vma lock
> >>>
> >>> write lock on the vma would:
> >>>     find the vma in the global tree, lock it
> >>>     enqueue work items in a waitqueue that,
> >>>       find the local ref, lock it (might sleep)
> >>>       release the reference, unlock and clear from local tree
> >>>       signal completion
> >>>     once all nodes have completed we have no outstanding refs
> >>>     and since we have the lock, we're exclusive.
> > 
> > void invalidate_vma_refs(void *addr)
> > {
> > 	BTREE_LOCK_CONTEXT(ctx, node_local_tree());
> > 
> > 	rcu_read_lock();
> > 	ref = btree_find(node_local_tree, (unsigned long)addr);
> > 	if (!ref)
> > 		goto out_unlock;
> > 
> > 	down_write(&ref->lock); /* no more local refs */
> > 	ref->dead = 1;
> > 	atomic_dec(&ref->vma->refs); /* release */
> > 	btree_delete(ctx, (unsigned long)addr); /* unhook */
> > 	rcu_call(free_vma_ref, ref); /* destroy */
> > 	up_write(&ref->lock);
> > 
> > out_unlock:
> > 	rcu_read_unlock();
> > }
> > 
> > struct vm_area_struct *
> > write_lock_vma(struct mm *mm, unsigned long addr)
> > {
> > 	rcu_read_lock();
> > 	vma = btree_find(&mm->btree, addr);
> > 	if (!vma)
> > 		goto out_unlock;
> > 
> > 	down_write(&vma->lock); /* no new refs */
> > 	rcu_read_unlock();
> > 
> > 	schedule_on_each_cpu(invalidate_vma_refs, vma, 0, 1);
> > 
> > 	return vma;
> > 
> > out_unlock:
> > 	rcu_read_unlock();
> > 	return NULL;
> > }
> > 
> > 
> 
> Hi Peter,
> 
> Making node local copies of VMA is a good idea to reduce inter-node
> traffic, but the cost of search and delete is very high.  Also, as you have
> pointed out, if the atomic operations happen on remote node due to
> scheduler migrating our thread, then all the cycles saved may be lost.
> 
> In find_get_vma() cross node traffic is due to btree traversal or the
> actual VMA object reference? 

Not sure, I'm not sure how to profile cacheline transfers.

The outlined approach would try to keep all accesses read-only, so that
the cacheline can be shared. But yeah, once it get evicted it needs to
be re-transfered.

>  Can we look at duplicating the btree
> structure per node and have VMA structures just one copy and make all
> btrees in each node point to the same vma object.  This will make write
> operation and deletion of btree entries on all nodes little simple.  All
> VMA lists will be unique and not duplicated.

But that would end up with a 2d tree, (mm, vma) in which you can try to
find an exact match for a given (mm, address) key.

Trouble with multi-dimensional trees is the balancing thing, afaik its
an np-hard problem.

> Another related idea is to move the VMA object to node local memory.  Can
> we migrate the VMA object to the node where it is referenced the most?  We
> still maintain only _one_ copy of VMA object.  No data duplication, but we
> can move the memory around to make it node local.

I guess we can do that, is you take the vma lock in exclusive mode, you
can make a copy of the object, replace the tree pointer, mark the old
one dead (so that rcu lookups with re-try) and rcu_free the old one.

> Some more thoughts:
> 
> Pagefault handler does most of the find_get_vma() to validate user address
> and then create page table entries (allocate page frames)... can we make
> the page fault handler run on the node where the VMAs have been allocated?

explicit migration - like migrate_disable() - make load balancing a very
hard problem.

>  The CPU that has page-faulted need not necessarily do all the find_vma()
> calls and update the page table.  The process can sleep while another CPU
> _near_ to the memory containing VMAs and pagetable can do the job with
> local memory references.

would we not end up with remote page tables?

> I don't know if the page tables for the faulting process is allocated in
> node local memory.
> 
> Per CPU last vma cache:  Currently we have the last vma referenced in a one
> entry cache in mm_struct.  Can we have this cache per CPU or per node so
> that a multi threaded application can have node/cpu local cache of last vma
> referenced.  This may reduce btree/rbtree traversal.  Let the hardware
> cache maintain the corresponding VMA object and its coherency.
> 
> Please let me know your comment and thoughts.

Nick Piggin (and I think Eric Dumazet) had nice patches for this. I
think they were posted in the private futex thread.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2007-10-04 17:21 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <46F01289.7040106@linux.vnet.ibm.com>
     [not found] ` <20070918205419.60d24da7@lappy>
     [not found]   ` <1191436672.7103.38.camel@alexis>
2007-10-03 19:40     ` VMA lookup with RCU Peter Zijlstra
2007-10-03 19:54       ` Peter Zijlstra
2007-10-04 15:42       ` Vaidyanathan Srinivasan
2007-10-04 17:21         ` Peter Zijlstra [this message]
2007-10-07  7:47           ` Nick Piggin
2007-10-08  7:51             ` Peter Zijlstra
2007-10-08  9:32               ` Balbir Singh
2007-10-08 16:51                 ` Vaidyanathan Srinivasan
2007-10-08  8:17                   ` Nick Piggin
2007-10-22  9:54                   ` Vaidyanathan Srinivasan
2007-10-08 17:02             ` Vaidyanathan Srinivasan
2007-10-08 17:11           ` Vaidyanathan Srinivasan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1191518486.5574.24.camel@lappy \
    --to=peterz@infradead.org \
    --cc=alexisb@us.ibm.com \
    --cc=amax@us.ibm.com \
    --cc=balbir@in.ibm.com \
    --cc=bharata@in.ibm.com \
    --cc=linux-mm@kvack.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=pbadari@us.ibm.com \
    --cc=svaidy@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.