From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S267474AbUHPG7n (ORCPT ); Mon, 16 Aug 2004 02:59:43 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S267477AbUHPG7n (ORCPT ); Mon, 16 Aug 2004 02:59:43 -0400 Received: from omx2-ext.sgi.com ([192.48.171.19]:5825 "EHLO omx2.sgi.com") by vger.kernel.org with ESMTP id S267474AbUHPG6v (ORCPT ); Mon, 16 Aug 2004 02:58:51 -0400 Message-ID: <41205BAA.9030800@sgi.com> Date: Mon, 16 Aug 2004 02:00:58 -0500 From: Ray Bryant User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Christoph Lameter CC: "David S. Miller" , linux-ia64@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: page fault fastpath: Increasing SMP scalability by introducing pte locks? References: <20040815130919.44769735.davem@redhat.com> <20040815165827.0c0c8844.davem@redhat.com> <20040815185644.24ecb247.davem@redhat.com> In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Christoph, Something else to worry about here is mm->rss. Previously, this was updated only with the page_table_lock held, so concurrent increments were not a problem. rss may need to converted be an atomic_t if you use pte_locks. It may be that an approximate value for rss is good enough, but I'm not sure how to bound the error that could be introduced by a couple of hundred processers handling page faults in parallel and updating rss without locking it or making it an atomic_t. Christoph Lameter wrote: > On Sun, 15 Aug 2004, David S. Miller wrote: > > >>On Sun, 15 Aug 2004 17:11:53 -0700 (PDT) >>Christoph Lameter wrote: >> >> >>>pgd/pmd walking should be possible always even without the vma semaphore >>>since the CPU can potentially walk the chain at anytime. >> >>munmap() can destroy pmd and pte tables. somehow we have >>to protect against that, and currently that is having the >>VMA semaphore held for reading, see free_pgtables(). > > > It looks to me like the code takes care to provide the correct > sequencing so that the integrity of pgd,pmd and pte links is > guaranteed from the viewpoint of the MMU in the CPUs. munmap is there to > protect one kernel thread messing with the addresses of these entities > that might be stored in another threads register. > > Therefore it is safe to walk the chain only holding the semaphore read > lock? > > If the mmap lock already guarantees the integrity of the pgd,pmd,pte > system, then pte locking would be okay as long as integrity of the > pgd,pmd and pte's is always guaranteed. Then also adding a lock bit would > work. > > So then there are two ways of modifying the pgd,pmd and pte's. > > A) Processor obtains vma semaphore write lock and does large scale > modifications to pgd,pmd,pte. > > B) Processor obtains vma semaphore read lock but is still free to do > modifications on individual pte's while holding that vma lock. There is no > need to acquire the page_table_lock. These changes must be atomic. > > The role of the page_table_lock is restricted *only* to the "struct > page" stuff? It says in the comments regarding handle_mm_fault that the > lock is taken for synchronization with kswapd in regards to the pte > entries. Seems that this use of the page_table_lock is wrong. A or B > should have been used. > > We could simply remove the page_table_lock from handle_mm_fault and > provide the synchronization with kswapd with pte locks right? Both > processes are essentially doing modifications on pte's while holding the > vma read lock and I would be changing the way of synchronization between > these two processes. > > F.e. something along these lines removing the page_table_lock from > handle_mm_fault and friends. Surprisingly this will also avoid many > rereads of the pte's since the pte's are really locked. This is just for > illustrative purpose and unfinished... > > Index: linux-2.6.8.1/mm/memory.c > =================================================================== > --- linux-2.6.8.1.orig/mm/memory.c 2004-08-15 06:03:04.000000000 -0700 > +++ linux-2.6.8.1/mm/memory.c 2004-08-15 20:26:29.000000000 -0700 > @@ -1035,8 +1035,7 @@ > * change only once the write actually happens. This avoids a few races, > * and potentially makes it more efficient. > * > - * We hold the mm semaphore and the page_table_lock on entry and exit > - * with the page_table_lock released. > + * We hold the mm semaphore. > */ > static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma, > unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte) > @@ -1051,10 +1050,10 @@ > * at least the kernel stops what it's doing before it corrupts > * data, but for the moment just pretend this is OOM. > */ > + ptep_unlock(page_table); > pte_unmap(page_table); > printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n", > address); > - spin_unlock(&mm->page_table_lock); > return VM_FAULT_OOM; > } > old_page = pfn_to_page(pfn); > @@ -1069,7 +1068,7 @@ > ptep_set_access_flags(vma, address, page_table, entry, 1); > update_mmu_cache(vma, address, entry); > pte_unmap(page_table); > - spin_unlock(&mm->page_table_lock); > + /* pte lock unlocked by ptep_set_access */ > return VM_FAULT_MINOR; > } > } > @@ -1080,7 +1079,7 @@ > */ > if (!PageReserved(old_page)) > page_cache_get(old_page); > - spin_unlock(&mm->page_table_lock); > + ptep_unlock(page_table); > > if (unlikely(anon_vma_prepare(vma))) > goto no_new_page; > @@ -1090,26 +1089,21 @@ > copy_cow_page(old_page,new_page,address); > > /* > - * Re-check the pte - we dropped the lock > + * There is no need to recheck. The pte was locked > */ > - spin_lock(&mm->page_table_lock); > - page_table = pte_offset_map(pmd, address); > - if (likely(pte_same(*page_table, pte))) { > - if (PageReserved(old_page)) > - ++mm->rss; > - else > - page_remove_rmap(old_page); > - break_cow(vma, new_page, address, page_table); > - lru_cache_add_active(new_page); > - page_add_anon_rmap(new_page, vma, address); > + if (PageReserved(old_page)) > + ++mm->rss; > + else > + page_remove_rmap(old_page); > + break_cow(vma, new_page, address, page_table); > + lru_cache_add_active(new_page); > + page_add_anon_rmap(new_page, vma, address); > > - /* Free the old page.. */ > - new_page = old_page; > - } > + /* Free the old page.. */ > + new_page = old_page; > pte_unmap(page_table); > page_cache_release(new_page); > page_cache_release(old_page); > - spin_unlock(&mm->page_table_lock); > return VM_FAULT_MINOR; > > no_new_page: > @@ -1314,8 +1308,8 @@ > } > > /* > - * We hold the mm semaphore and the page_table_lock on entry and > - * should release the pagetable lock on exit.. > + * We hold the mm semaphore and a pte lock n entry and > + * should release the pte lock on exit.. > */ > static int do_swap_page(struct mm_struct * mm, > struct vm_area_struct * vma, unsigned long address, > @@ -1327,27 +1321,10 @@ > int ret = VM_FAULT_MINOR; > > pte_unmap(page_table); > - spin_unlock(&mm->page_table_lock); > page = lookup_swap_cache(entry); > if (!page) { > swapin_readahead(entry, address, vma); > page = read_swap_cache_async(entry, vma, address); > - if (!page) { > - /* > - * Back out if somebody else faulted in this pte while > - * we released the page table lock. > - */ > - spin_lock(&mm->page_table_lock); > - page_table = pte_offset_map(pmd, address); > - if (likely(pte_same(*page_table, orig_pte))) > - ret = VM_FAULT_OOM; > - else > - ret = VM_FAULT_MINOR; > - pte_unmap(page_table); > - spin_unlock(&mm->page_table_lock); > - goto out; > - } > - > /* Had to read the page from swap area: Major fault */ > ret = VM_FAULT_MAJOR; > inc_page_state(pgmajfault); > @@ -1356,21 +1333,6 @@ > mark_page_accessed(page); > lock_page(page); > > - /* > - * Back out if somebody else faulted in this pte while we > - * released the page table lock. > - */ > - spin_lock(&mm->page_table_lock); > - page_table = pte_offset_map(pmd, address); > - if (unlikely(!pte_same(*page_table, orig_pte))) { > - pte_unmap(page_table); > - spin_unlock(&mm->page_table_lock); > - unlock_page(page); > - page_cache_release(page); > - ret = VM_FAULT_MINOR; > - goto out; > - } > - > /* The page isn't present yet, go ahead with the fault. */ > > swap_free(entry); > @@ -1398,8 +1360,8 @@ > > /* No need to invalidate - it was non-present before */ > update_mmu_cache(vma, address, pte); > + ptep_unlock(page_table); > pte_unmap(page_table); > - spin_unlock(&mm->page_table_lock); > out: > return ret; > } > @@ -1424,7 +1386,6 @@ > if (write_access) { > /* Allocate our own private page. */ > pte_unmap(page_table); > - spin_unlock(&mm->page_table_lock); > > if (unlikely(anon_vma_prepare(vma))) > goto no_mem; > @@ -1433,13 +1394,12 @@ > goto no_mem; > clear_user_highpage(page, addr); > > - spin_lock(&mm->page_table_lock); > page_table = pte_offset_map(pmd, addr); > > if (!pte_none(*page_table)) { > + ptep_unlock(page_table); > pte_unmap(page_table); > page_cache_release(page); > - spin_unlock(&mm->page_table_lock); > goto out; > } > mm->rss++; > @@ -1456,7 +1416,6 @@ > > /* No need to invalidate - it was non-present before */ > update_mmu_cache(vma, addr, entry); > - spin_unlock(&mm->page_table_lock); > out: > return VM_FAULT_MINOR; > no_mem: > @@ -1472,8 +1431,8 @@ > * As this is called only for pages that do not currently exist, we > * do not need to flush old virtual caches or the TLB. > * > - * This is called with the MM semaphore held and the page table > - * spinlock held. Exit with the spinlock released. > + * This is called with the MM semaphore held and pte lock > + * held. Exit with the pte lock released. > */ > static int > do_no_page(struct mm_struct *mm, struct vm_area_struct *vma, > @@ -1489,9 +1448,9 @@ > if (!vma->vm_ops || !vma->vm_ops->nopage) > return do_anonymous_page(mm, vma, page_table, > pmd, write_access, address); > + ptep_unlock(page_table); > pte_unmap(page_table); > - spin_unlock(&mm->page_table_lock); > - > + > if (vma->vm_file) { > mapping = vma->vm_file->f_mapping; > sequence = atomic_read(&mapping->truncate_count); > @@ -1523,7 +1482,7 @@ > anon = 1; > } > > - spin_lock(&mm->page_table_lock); > + while (ptep_lock(page_table)) ; > /* > * For a file-backed vma, someone could have truncated or otherwise > * invalidated this page. If unmap_mapping_range got called, > @@ -1532,7 +1491,7 @@ > if (mapping && > (unlikely(sequence != atomic_read(&mapping->truncate_count)))) { > sequence = atomic_read(&mapping->truncate_count); > - spin_unlock(&mm->page_table_lock); > + ptep_unlock(page_table); > page_cache_release(new_page); > goto retry; > } > @@ -1565,15 +1524,15 @@ > pte_unmap(page_table); > } else { > /* One of our sibling threads was faster, back out. */ > + ptep_unlock(page_table); > pte_unmap(page_table); > page_cache_release(new_page); > - spin_unlock(&mm->page_table_lock); > goto out; > } > > /* no need to invalidate: a not-present page shouldn't be cached */ > update_mmu_cache(vma, address, entry); > - spin_unlock(&mm->page_table_lock); > + ptep_unlock(page_table); > out: > return ret; > oom: > @@ -1606,8 +1565,8 @@ > > pgoff = pte_to_pgoff(*pte); > > + ptep_unlock(pte); > pte_unmap(pte); > - spin_unlock(&mm->page_table_lock); > > err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0); > if (err == -ENOMEM) > @@ -1644,13 +1603,11 @@ > { > pte_t entry; > > - entry = *pte; > + entry = *pte; /* get the unlocked value so that we do not write the lock bit back */ > + > + if (ptep_lock(pte)) return VM_FAULT_MINOR; > + > if (!pte_present(entry)) { > - /* > - * If it truly wasn't present, we know that kswapd > - * and the PTE updates will not touch it later. So > - * drop the lock. > - */ > if (pte_none(entry)) > return do_no_page(mm, vma, address, write_access, pte, pmd); > if (pte_file(entry)) > @@ -1668,7 +1625,6 @@ > ptep_set_access_flags(vma, address, pte, entry, write_access); > update_mmu_cache(vma, address, entry); > pte_unmap(pte); > - spin_unlock(&mm->page_table_lock); > return VM_FAULT_MINOR; > } > > @@ -1688,12 +1644,6 @@ > > if (is_vm_hugetlb_page(vma)) > return VM_FAULT_SIGBUS; /* mapping truncation does this. */ > - > - /* > - * We need the page table lock to synchronize with kswapd > - * and the SMP-safe atomic PTE updates. > - */ > - spin_lock(&mm->page_table_lock); > pmd = pmd_alloc(mm, pgd, address); > > if (pmd) { > @@ -1701,7 +1651,6 @@ > if (pte) > return handle_pte_fault(mm, vma, address, write_access, pte, pmd); > } > - spin_unlock(&mm->page_table_lock); > return VM_FAULT_OOM; > } > > Index: linux-2.6.8.1/mm/rmap.c > =================================================================== > --- linux-2.6.8.1.orig/mm/rmap.c 2004-08-14 03:56:22.000000000 -0700 > +++ linux-2.6.8.1/mm/rmap.c 2004-08-15 19:59:32.000000000 -0700 > @@ -494,8 +494,14 @@ > > /* Nuke the page table entry. */ > flush_cache_page(vma, address); > +#ifdef __HAVE_ARCH_PTE_LOCK > + /* If we would simply zero the pte then handle_mm_fault might > + * race against this code and reinstate an anonymous mapping > + */ > + pteval = ptep_clear_and_lock_flush(vma, address, pte); > +#else > pteval = ptep_clear_flush(vma, address, pte); > - > +#endif > /* Move the dirty bit to the physical page now the pte is gone. */ > if (pte_dirty(pteval)) > set_page_dirty(page); > @@ -508,9 +514,13 @@ > */ > BUG_ON(!PageSwapCache(page)); > swap_duplicate(entry); > + /* This is going to clear the lock that may have been set on the pte */ > set_pte(pte, swp_entry_to_pte(entry)); > BUG_ON(pte_file(*pte)); > } > +#ifdef __HAVE_ARCH_PTE_LOCK > + else ptep_unlock(pte); > +#endif > > mm->rss--; > BUG_ON(!page->mapcount); > - > To unsubscribe from this list: send the line "unsubscribe linux-ia64" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Best Regards, Ray ----------------------------------------------- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) raybry@sgi.com raybry@austin.rr.com The box said: "Requires Windows 98 or better", so I installed Linux. -----------------------------------------------