From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from sd0109e.au.ibm.com (d23rh905.au.ibm.com [202.81.18.225]) by e23smtp02.au.ibm.com (8.13.1/8.13.1) with ESMTP id l94FiVAd021218 for ; Fri, 5 Oct 2007 01:44:31 +1000 Received: from d23av04.au.ibm.com (d23av04.au.ibm.com [9.190.235.139]) by sd0109e.au.ibm.com (8.13.8/8.13.8/NCO v8.5) with ESMTP id l94FkmpA101830 for ; Fri, 5 Oct 2007 01:46:48 +1000 Received: from d23av04.au.ibm.com (loopback [127.0.0.1]) by d23av04.au.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l94Fgvg6008576 for ; Fri, 5 Oct 2007 01:42:57 +1000 Message-ID: <470509F5.4010902@linux.vnet.ibm.com> Date: Thu, 04 Oct 2007 21:12:45 +0530 From: Vaidyanathan Srinivasan MIME-Version: 1.0 Subject: Re: VMA lookup with RCU References: <46F01289.7040106@linux.vnet.ibm.com> <20070918205419.60d24da7@lappy> <1191436672.7103.38.camel@alexis> <1191440429.5599.72.camel@lappy> In-Reply-To: <1191440429.5599.72.camel@lappy> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Peter Zijlstra Cc: Alexis Bruemmer , Balbir Singh , Badari Pulavarty , Max Asbock , linux-mm , Bharata B Rao List-ID: Peter Zijlstra wrote: >>> lookup in node local tree >>> if found, take read lock on local reference >>> if not-found, do global lookup, lock vma, take reference, >>> insert reference into local tree, >>> take read lock on it, drop vma lock >>> >>> write lock on the vma would: >>> find the vma in the global tree, lock it >>> enqueue work items in a waitqueue that, >>> find the local ref, lock it (might sleep) >>> release the reference, unlock and clear from local tree >>> signal completion >>> once all nodes have completed we have no outstanding refs >>> and since we have the lock, we're exclusive. > > void invalidate_vma_refs(void *addr) > { > BTREE_LOCK_CONTEXT(ctx, node_local_tree()); > > rcu_read_lock(); > ref = btree_find(node_local_tree, (unsigned long)addr); > if (!ref) > goto out_unlock; > > down_write(&ref->lock); /* no more local refs */ > ref->dead = 1; > atomic_dec(&ref->vma->refs); /* release */ > btree_delete(ctx, (unsigned long)addr); /* unhook */ > rcu_call(free_vma_ref, ref); /* destroy */ > up_write(&ref->lock); > > out_unlock: > rcu_read_unlock(); > } > > struct vm_area_struct * > write_lock_vma(struct mm *mm, unsigned long addr) > { > rcu_read_lock(); > vma = btree_find(&mm->btree, addr); > if (!vma) > goto out_unlock; > > down_write(&vma->lock); /* no new refs */ > rcu_read_unlock(); > > schedule_on_each_cpu(invalidate_vma_refs, vma, 0, 1); > > return vma; > > out_unlock: > rcu_read_unlock(); > return NULL; > } > > Hi Peter, Making node local copies of VMA is a good idea to reduce inter-node traffic, but the cost of search and delete is very high. Also, as you have pointed out, if the atomic operations happen on remote node due to scheduler migrating our thread, then all the cycles saved may be lost. In find_get_vma() cross node traffic is due to btree traversal or the actual VMA object reference? Can we look at duplicating the btree structure per node and have VMA structures just one copy and make all btrees in each node point to the same vma object. This will make write operation and deletion of btree entries on all nodes little simple. All VMA lists will be unique and not duplicated. Another related idea is to move the VMA object to node local memory. Can we migrate the VMA object to the node where it is referenced the most? We still maintain only _one_ copy of VMA object. No data duplication, but we can move the memory around to make it node local. Some more thoughts: Pagefault handler does most of the find_get_vma() to validate user address and then create page table entries (allocate page frames)... can we make the page fault handler run on the node where the VMAs have been allocated? The CPU that has page-faulted need not necessarily do all the find_vma() calls and update the page table. The process can sleep while another CPU _near_ to the memory containing VMAs and pagetable can do the job with local memory references. I don't know if the page tables for the faulting process is allocated in node local memory. Per CPU last vma cache: Currently we have the last vma referenced in a one entry cache in mm_struct. Can we have this cache per CPU or per node so that a multi threaded application can have node/cpu local cache of last vma referenced. This may reduce btree/rbtree traversal. Let the hardware cache maintain the corresponding VMA object and its coherency. Please let me know your comment and thoughts. Thanks, Vaidy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org