From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from holomorphy.com ([207.189.100.168]:63186 "EHLO holomorphy.com") by vger.kernel.org with ESMTP id S266296AbUHGIH5 (ORCPT ); Sat, 7 Aug 2004 04:07:57 -0400 Date: Sat, 7 Aug 2004 01:07:51 -0700 From: William Lee Irwin III Subject: Re: copy_page_range() Message-ID: <20040807080751.GX17188@holomorphy.com> References: <20040807000529.5ca6e8fe.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20040807000529.5ca6e8fe.davem@redhat.com> To: "David S. Miller" Cc: torvalds@osdl.org, linux-arch@vger.kernel.org List-ID: On Sat, Aug 07, 2004 at 12:05:29AM -0700, David S. Miller wrote: > Every couple months I look at this thing. > The main issue is that it's very cache unfriendly, > especially with how sparsely populated the page tables > are for 64-bit processes. > As a simple example, it's at the top of the kernel > profile for 64-bit lat_proc {fork,exec,shell} on > sparc64. > And it's in fact the pmd array scans that take all > of the cache misses, and thus most of the run time. > An idea I've always been entertaining is to associate > a bitmask with each pmd table. For example, a possible > current implementation could be to abuse page_struct->index > for this bitmask, and use virt_to_page(pmdp)->index to get > at it. Sounds generally reasonable. On Sat, Aug 07, 2004 at 12:05:29AM -0700, David S. Miller wrote: > This divides the pmd table into BITS_PER_LONG sections. > If the bit is set in ->index then we populated at least > one of the pmd entries in that section. We never clear > bits, except at pmd table allocation time. > Then the pmd scan iterates over ->index, and only actually > dereferences the pmd entries iff it finds a set bit, and > it only dereferences the section of pmd entries represented > by that bit. > Another idea I've also considered is to implement the > pgd/pmd levels as a more compact tree, based upon virtual > address, such as a radix tree. > I think all of this could be experimented with if we > abstracted out the pmd/pgd/pte iteration. So much stuff > in the kernel mm code is of the form: > for_each_pgd(pgdp) > for_each_pmd(pgdp, pmdp) > for_each_pte(pmdp, ptep) > do_something(ptep) > At 2-levels, as on most of the 32-bit platforms, things > aren't so bad. > Comments? The number of levels can be abstracted easily. Something to give an idea of how might be something like this: struct pte_walk_state { pgd_t *pgd; pmd_t *pmd; pte_t *pte; unsigned long vaddr; }; int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma) { int cow, ret = 0; struct pte_walk_state walk_parent, walk_child; cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; spin_lock(&dst->page_table_lock); pte_walk_descend_and_create(dst, &walk_child, vma->vm_start); for_each_inuse_pte(src, &walk_parent, vma->vm_start, vma->vm_end) { if (pte_walk_move_and_create(&walk_child, walk_parent.vaddr)) { ret = -ENOMEM; break; } /* * do stuff to child and parent ptes */ ... } spin_unlock(&dst->page_table_lock); return ret; } void zap_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long len, struct zap_details *details) { struct pte_walk_state walk; spin_lock(&vma->vm_mm->page_table_lock); for_each_inuse_pte(vma->vm_mm, &walk, vma->vm_start, vma->vm_end) { /* * wipe pte and do stuff */ ... } spin_unlock(&vma->vm_mm->page_table_lock); } where #define for_each_inuse_pte(mm, walk, start, end) \ for (pte_walk_descend(mm, walk, start); (walk)->vaddr < (end); \ next_inuse_pte(walk)) etc. -- wli