public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 0/7] lockless pagecache 2
@ 2005-08-11 12:18 Nick Piggin
  2005-08-11 12:21 ` [patch 1/7] mm: remove PageReserved rollup Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2005-08-11 12:18 UTC (permalink / raw)
  To: Paul McKenney, Dipankar Sarma, Ingo Molnar, linux-kernel

This is my second attempt at a lockless pagecache.

Patches are against 2.6.13-rc6, and have had reasonable
stressing (albeit on small SMPs).

Main changes since last seen:
* Code clarity and commenting improvement.

* Fix race where multiple concurrent failed speculative
   reference takers could be confused into thinking a free
   page wasn't free, due to the elevated refcounts.

* Convert radix tree node freeing over to RCU. I completely
   missed this problem in my first attempt. (My first real
   RCU attempt - completely wrong?).

* page_cache_get_speculative previously used only preempt_
   disable to stop the current CPU from entering the page
   allocator. Needs to turn off interrupts too.

   Paul picked this bug up without seeing the code, just a
   vague description of what I was trying to do. All I
   picked up was my jaw from the ground ;)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 1/7] mm: remove PageReserved rollup
  2005-08-11 12:18 [patch 0/7] lockless pagecache 2 Nick Piggin
@ 2005-08-11 12:21 ` Nick Piggin
  2005-08-11 12:22   ` [patch 2/7] mm: PG_free flag Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2005-08-11 12:21 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Dipankar Sarma, Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 202 bytes --]

1/7

This rollup is a patchset all on its own. There is
a recent thread on linux-kernel if it interests you.

Required by lockless pagecache for consistent page
refcounting

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-remove-PageReserved-rollup.patch --]
[-- Type: text/plain, Size: 27403 bytes --]

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -442,22 +442,17 @@ int page_referenced(struct page *page, i
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
-	struct anon_vma *anon_vma = vma->anon_vma;
-	pgoff_t index;
-
-	BUG_ON(PageReserved(page));
-	BUG_ON(!anon_vma);
-
 	inc_mm_counter(vma->vm_mm, anon_rss);
 
-	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-	index = (address - vma->vm_start) >> PAGE_SHIFT;
-	index += vma->vm_pgoff;
-	index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
-
 	if (atomic_inc_and_test(&page->_mapcount)) {
-		page->index = index;
+		struct anon_vma *anon_vma = vma->anon_vma;
+
+		BUG_ON(!anon_vma);
+		anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 		page->mapping = (struct address_space *) anon_vma;
+
+		page->index = linear_page_index(vma, address);
+
 		inc_page_state(nr_mapped);
 	}
 	/* else checking page index and mapping is racy */
@@ -472,8 +467,7 @@ void page_add_anon_rmap(struct page *pag
 void page_add_file_rmap(struct page *page)
 {
 	BUG_ON(PageAnon(page));
-	if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
-		return;
+	BUG_ON(!pfn_valid(page_to_pfn(page)));
 
 	if (atomic_inc_and_test(&page->_mapcount))
 		inc_page_state(nr_mapped);
@@ -487,8 +481,6 @@ void page_add_file_rmap(struct page *pag
  */
 void page_remove_rmap(struct page *page)
 {
-	BUG_ON(PageReserved(page));
-
 	if (atomic_add_negative(-1, &page->_mapcount)) {
 		BUG_ON(page_mapcount(page) < 0);
 		/*
@@ -532,6 +524,8 @@ static int try_to_unmap_one(struct page 
 	 * If the page is mlock()d, we cannot swap it out.
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
+	 *
+	 * Pages belonging to VM_RESERVED regions should not happen here.
 	 */
 	if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) ||
 			ptep_clear_flush_young(vma, address, pte)) {
@@ -644,13 +638,13 @@ static void try_to_unmap_cluster(unsigne
 			continue;
 
 		pfn = pte_pfn(*pte);
-		if (!pfn_valid(pfn))
+		if (unlikely(!pfn_valid(pfn))) {
+			invalid_pfn(*pte, vma->vm_flags, address);
 			continue;
+		}
 
 		page = pfn_to_page(pfn);
 		BUG_ON(PageAnon(page));
-		if (PageReserved(page))
-			continue;
 
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
@@ -813,7 +807,6 @@ int try_to_unmap(struct page *page)
 {
 	int ret;
 
-	BUG_ON(PageReserved(page));
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -113,7 +113,8 @@ static void bad_page(const char *functio
 			1 << PG_reclaim |
 			1 << PG_slab    |
 			1 << PG_swapcache |
-			1 << PG_writeback);
+			1 << PG_writeback |
+			1 << PG_reserved );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
 	page->mapping = NULL;
@@ -243,7 +244,6 @@ static inline int page_is_buddy(struct p
 {
        if (PagePrivate(page)           &&
            (page_order(page) == order) &&
-           !PageReserved(page)         &&
             page_count(page) == 0)
                return 1;
        return 0;
@@ -326,10 +326,11 @@ static inline void free_pages_check(cons
 			1 << PG_reclaim	|
 			1 << PG_slab	|
 			1 << PG_swapcache |
-			1 << PG_writeback )))
+			1 << PG_writeback |
+			1 << PG_reserved )))
 		bad_page(function, page);
 	if (PageDirty(page))
-		ClearPageDirty(page);
+		__ClearPageDirty(page);
 }
 
 /*
@@ -454,7 +455,8 @@ static void prep_new_page(struct page *p
 			1 << PG_reclaim	|
 			1 << PG_slab    |
 			1 << PG_swapcache |
-			1 << PG_writeback )))
+			1 << PG_writeback |
+			1 << PG_reserved )))
 		bad_page(__FUNCTION__, page);
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
@@ -1011,7 +1013,7 @@ void __pagevec_free(struct pagevec *pvec
 
 fastcall void __free_pages(struct page *page, unsigned int order)
 {
-	if (!PageReserved(page) && put_page_testzero(page)) {
+	if (put_page_testzero(page)) {
 		if (order == 0)
 			free_hot_page(page);
 		else
@@ -1653,7 +1655,7 @@ void __init memmap_init_zone(unsigned lo
 			continue;
 		page = pfn_to_page(pfn);
 		set_page_links(page, zone, nid, pfn);
-		set_page_count(page, 0);
+		set_page_count(page, 1);
 		reset_page_mapcount(page);
 		SetPageReserved(page);
 		INIT_LIST_HEAD(&page->lru);
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -194,6 +194,7 @@ extern void __mod_page_state(unsigned lo
 #define SetPageDirty(page)	set_bit(PG_dirty, &(page)->flags)
 #define TestSetPageDirty(page)	test_and_set_bit(PG_dirty, &(page)->flags)
 #define ClearPageDirty(page)	clear_bit(PG_dirty, &(page)->flags)
+#define __ClearPageDirty(page)	__clear_bit(PG_dirty, &(page)->flags)
 #define TestClearPageDirty(page) test_and_clear_bit(PG_dirty, &(page)->flags)
 
 #define SetPageLRU(page)	set_bit(PG_lru, &(page)->flags)
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -141,6 +141,10 @@ move_one_page(struct vm_area_struct *vma
 			if (dst) {
 				pte_t pte;
 				pte = ptep_clear_flush(vma, old_addr, src);
+				/* ZERO_PAGE can be dependant on virtual addr */
+				if (pfn_valid(pte_pfn(pte)) &&
+					pte_page(pte) == ZERO_PAGE(old_addr))
+					pte = pte_wrprotect(mk_pte(ZERO_PAGE(new_addr), new_vma->vm_page_prot));
 				set_pte_at(mm, new_addr, dst, pte);
 			} else
 				error = -ENOMEM;
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -156,7 +156,8 @@ extern unsigned int kobjsize(const void 
 
 #define VM_DONTCOPY	0x00020000      /* Do not copy this vma on fork */
 #define VM_DONTEXPAND	0x00040000	/* Cannot expand with mremap() */
-#define VM_RESERVED	0x00080000	/* Don't unmap it from swap_out */
+#define VM_RESERVED	0x00080000	/* Pages and ptes in region aren't managed with regular pagecache or rmap routines */
+
 #define VM_ACCOUNT	0x00100000	/* Is a VM accounted object */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
@@ -337,7 +338,7 @@ static inline void get_page(struct page 
 
 static inline void put_page(struct page *page)
 {
-	if (!PageReserved(page) && put_page_testzero(page))
+	if (put_page_testzero(page))
 		__page_cache_release(page);
 }
 
@@ -723,6 +724,9 @@ void install_arg_page(struct vm_area_str
 
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
 		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
+#define invalid_pfn(pte, vm_flags, vaddr)		\
+		__invalid_pfn(__FUNCTION__, pte, vm_flags, vaddr)
+void __invalid_pfn(const char *, pte_t, unsigned long, unsigned long);
 
 int __set_page_dirty_buffers(struct page *page);
 int __set_page_dirty_nobuffers(struct page *page);
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -123,7 +123,7 @@ static long madvise_dontneed(struct vm_a
 			     unsigned long start, unsigned long end)
 {
 	*prev = vma;
-	if ((vma->vm_flags & VM_LOCKED) || is_vm_hugetlb_page(vma))
+	if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) || is_vm_hugetlb_page(vma))
 		return -EINVAL;
 
 	if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -333,6 +333,21 @@ out:
 }
 
 /*
+ * This function is called to print an error when a pte in a
+ * !VM_RESERVED region is found pointing to an invalid pfn (which
+ * is an error.
+ *
+ * The calling function must still handle the error.
+ */
+void __invalid_pfn(const char *errfunc, pte_t pte,
+				unsigned long vm_flags, unsigned long vaddr)
+{
+	printk(KERN_ERR "%s: pte does not point to valid memory. "
+		"process = %s, pte = %08lx, vm_flags = %lx, vaddr = %lx\n",
+		errfunc, current->comm, (long)pte_val(pte), vm_flags, vaddr);
+}
+
+/*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
  * covered by this vma.
@@ -361,25 +376,29 @@ copy_one_pte(struct mm_struct *dst_mm, s
 				spin_unlock(&mmlist_lock);
 			}
 		}
-		set_pte_at(dst_mm, addr, dst_pte, pte);
-		return;
+		goto out_set_pte;
 	}
 
+	/* If the region is VM_RESERVED, the mapping is not
+	 * mapped via rmap - duplicate the pte as is.
+	 */
+	if (vm_flags & VM_RESERVED)
+		goto out_set_pte;
+
+	/* If the pte points outside of valid memory but
+	 * the region is not VM_RESERVED, we have a problem.
+	 */
 	pfn = pte_pfn(pte);
-	/* the pte points outside of valid memory, the
-	 * mapping is assumed to be good, meaningful
-	 * and not mapped via rmap - duplicate the
-	 * mapping as is.
-	 */
-	page = NULL;
-	if (pfn_valid(pfn))
-		page = pfn_to_page(pfn);
-
-	if (!page || PageReserved(page)) {
-		set_pte_at(dst_mm, addr, dst_pte, pte);
-		return;
+	if (unlikely(!pfn_valid(pfn))) {
+		invalid_pfn(pte, vm_flags, addr);
+		goto out_set_pte; /* try to do something sane */
 	}
 
+	page = pfn_to_page(pfn);
+	/* Mappings to zero pages aren't covered by rmap either. */
+	if (page == ZERO_PAGE(addr))
+		goto out_set_pte;
+
 	/*
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
@@ -400,8 +419,9 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	inc_mm_counter(dst_mm, rss);
 	if (PageAnon(page))
 		inc_mm_counter(dst_mm, anon_rss);
-	set_pte_at(dst_mm, addr, dst_pte, pte);
 	page_dup_rmap(page);
+out_set_pte:
+	set_pte_at(dst_mm, addr, dst_pte, pte);
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds
 	return 0;
 }
 
-static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+static void zap_pte_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
@@ -527,11 +548,15 @@ static void zap_pte_range(struct mmu_gat
 			continue;
 		if (pte_present(ptent)) {
 			struct page *page = NULL;
-			unsigned long pfn = pte_pfn(ptent);
-			if (pfn_valid(pfn)) {
-				page = pfn_to_page(pfn);
-				if (PageReserved(page))
-					page = NULL;
+			if (!(vma->vm_flags & VM_RESERVED)) {
+				unsigned long pfn = pte_pfn(ptent);
+				if (unlikely(!pfn_valid(pfn))) {
+					invalid_pfn(ptent, vma->vm_flags, addr);
+				} else {
+					page = pfn_to_page(pfn);
+					if (page == ZERO_PAGE(addr))
+						page = NULL;
+				}
 			}
 			if (unlikely(details) && page) {
 				/*
@@ -584,7 +609,8 @@ static void zap_pte_range(struct mmu_gat
 	pte_unmap(pte - 1);
 }
 
-static inline void zap_pmd_range(struct mmu_gather *tlb, pud_t *pud,
+static inline void zap_pmd_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma, pud_t *pud,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
@@ -596,11 +622,12 @@ static inline void zap_pmd_range(struct 
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		zap_pte_range(tlb, pmd, addr, next, details);
+		zap_pte_range(tlb, vma, pmd, addr, next, details);
 	} while (pmd++, addr = next, addr != end);
 }
 
-static inline void zap_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
+static inline void zap_pud_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
@@ -612,7 +639,7 @@ static inline void zap_pud_range(struct 
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		zap_pmd_range(tlb, pud, addr, next, details);
+		zap_pmd_range(tlb, vma, pud, addr, next, details);
 	} while (pud++, addr = next, addr != end);
 }
 
@@ -633,7 +660,7 @@ static void unmap_page_range(struct mmu_
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		zap_pud_range(tlb, pgd, addr, next, details);
+		zap_pud_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
 }
@@ -933,7 +960,7 @@ int get_user_pages(struct task_struct *t
 			continue;
 		}
 
-		if (!vma || (vma->vm_flags & VM_IO)
+		if (!vma || (vma->vm_flags & (VM_IO | VM_RESERVED))
 				|| !(flags & vma->vm_flags))
 			return i ? : -EFAULT;
 
@@ -993,8 +1020,7 @@ int get_user_pages(struct task_struct *t
 			if (pages) {
 				pages[i] = page;
 				flush_dcache_page(page);
-				if (!PageReserved(page))
-					page_cache_get(page);
+				page_cache_get(page);
 			}
 			if (vmas)
 				vmas[i] = vma;
@@ -1098,8 +1124,7 @@ static int remap_pte_range(struct mm_str
 		return -ENOMEM;
 	do {
 		BUG_ON(!pte_none(*pte));
-		if (!pfn_valid(pfn) || PageReserved(pfn_to_page(pfn)))
-			set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
+		set_pte_at(mm, addr, pte, pfn_pte(pfn, prot));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	pte_unmap(pte - 1);
@@ -1239,6 +1264,8 @@ static int do_wp_page(struct mm_struct *
 	pte_t entry;
 	int ret;
 
+	BUG_ON(vma->vm_flags & VM_RESERVED);
+
 	if (unlikely(!pfn_valid(pfn))) {
 		/*
 		 * This should really halt the system so it can be debugged or
@@ -1246,9 +1273,8 @@ static int do_wp_page(struct mm_struct *
 		 * data, but for the moment just pretend this is OOM.
 		 */
 		pte_unmap(page_table);
-		printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
-				address);
 		spin_unlock(&mm->page_table_lock);
+		invalid_pfn(pte, vma->vm_flags, address);
 		return VM_FAULT_OOM;
 	}
 	old_page = pfn_to_page(pfn);
@@ -1273,13 +1299,16 @@ static int do_wp_page(struct mm_struct *
 	/*
 	 * Ok, we need to copy. Oh, well..
 	 */
-	if (!PageReserved(old_page))
+	if (old_page == ZERO_PAGE(address))
+		old_page = NULL;
+	else
 		page_cache_get(old_page);
+
 	spin_unlock(&mm->page_table_lock);
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
-	if (old_page == ZERO_PAGE(address)) {
+	if (old_page == NULL) {
 		new_page = alloc_zeroed_user_highpage(vma, address);
 		if (!new_page)
 			goto no_new_page;
@@ -1296,12 +1325,13 @@ static int do_wp_page(struct mm_struct *
 	spin_lock(&mm->page_table_lock);
 	page_table = pte_offset_map(pmd, address);
 	if (likely(pte_same(*page_table, pte))) {
-		if (PageAnon(old_page))
-			dec_mm_counter(mm, anon_rss);
-		if (PageReserved(old_page))
+		if (old_page == NULL)
 			inc_mm_counter(mm, rss);
-		else
+		else {
 			page_remove_rmap(old_page);
+			if (PageAnon(old_page))
+				dec_mm_counter(mm, anon_rss);
+		}
 		flush_cache_page(vma, address, pfn);
 		break_cow(vma, new_page, address, page_table);
 		lru_cache_add_active(new_page);
@@ -1312,13 +1342,16 @@ static int do_wp_page(struct mm_struct *
 		ret |= VM_FAULT_WRITE;
 	}
 	pte_unmap(page_table);
-	page_cache_release(new_page);
-	page_cache_release(old_page);
+	if (old_page) {
+		page_cache_release(new_page);
+		page_cache_release(old_page);
+	}
 	spin_unlock(&mm->page_table_lock);
 	return ret;
 
 no_new_page:
-	page_cache_release(old_page);
+	if (old_page)
+		page_cache_release(old_page);
 	return VM_FAULT_OOM;
 }
 
@@ -1755,7 +1788,7 @@ do_anonymous_page(struct mm_struct *mm, 
 	struct page * page = ZERO_PAGE(addr);
 
 	/* Read-only mapping of ZERO_PAGE. */
-	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
+	entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
 
 	/* ..except if it's a write access */
 	if (write_access) {
@@ -1894,9 +1927,6 @@ retry:
 	 */
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
-		if (!PageReserved(new_page))
-			inc_mm_counter(mm, rss);
-
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
@@ -1905,8 +1935,10 @@ retry:
 		if (anon) {
 			lru_cache_add_active(new_page);
 			page_add_anon_rmap(new_page, vma, address);
-		} else
+		} else if (!(vma->vm_flags & VM_RESERVED)) {
 			page_add_file_rmap(new_page);
+			inc_mm_counter(mm, rss);
+		}
 		pte_unmap(page_table);
 	} else {
 		/* One of our sibling threads was faster, back out. */
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -48,7 +48,7 @@ void put_page(struct page *page)
 		}
 		return;
 	}
-	if (!PageReserved(page) && put_page_testzero(page))
+	if (put_page_testzero(page))
 		__page_cache_release(page);
 }
 EXPORT_SYMBOL(put_page);
@@ -215,7 +215,7 @@ void release_pages(struct page **pages, 
 		struct page *page = pages[i];
 		struct zone *pagezone;
 
-		if (PageReserved(page) || !put_page_testzero(page))
+		if (!put_page_testzero(page))
 			continue;
 
 		pagezone = page_zone(page);
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str
 		return;
 	if (pte_present(pte)) {
 		unsigned long pfn = pte_pfn(pte);
+		struct page *page;
 
 		flush_cache_page(vma, addr, pfn);
 		pte = ptep_clear_flush(vma, addr, ptep);
-		if (pfn_valid(pfn)) {
-			struct page *page = pfn_to_page(pfn);
-			if (!PageReserved(page)) {
-				if (pte_dirty(pte))
-					set_page_dirty(page);
-				page_remove_rmap(page);
-				page_cache_release(page);
-				dec_mm_counter(mm, rss);
-			}
+		if (unlikely(!pfn_valid(pfn))) {
+			invalid_pfn(pte, vma->vm_flags, addr);
+			return;
+		}
+		page = pfn_to_page(pfn);
+		if (page != ZERO_PAGE(addr)) {
+			if (pte_dirty(pte))
+				set_page_dirty(page);
+			page_remove_rmap(page);
+			dec_mm_counter(mm, rss);
+			page_cache_release(page);
 		}
 	} else {
 		if (!pte_file(pte))
@@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s
 	pgd_t *pgd;
 	pte_t pte_val;
 
+	BUG_ON(vma->vm_flags & VM_RESERVED);
+
 	pgd = pgd_offset(mm, addr);
 	spin_lock(&mm->page_table_lock);
 	
@@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m
 	pgd_t *pgd;
 	pte_t pte_val;
 
+	BUG_ON(vma->vm_flags & VM_RESERVED);
+
 	pgd = pgd_offset(mm, addr);
 	spin_lock(&mm->page_table_lock);
 	
Index: linux-2.6/mm/msync.c
===================================================================
--- linux-2.6.orig/mm/msync.c
+++ linux-2.6/mm/msync.c
@@ -37,11 +37,11 @@ static void sync_pte_range(struct vm_are
 		if (!pte_maybe_dirty(*pte))
 			continue;
 		pfn = pte_pfn(*pte);
-		if (!pfn_valid(pfn))
+		if (unlikely(!pfn_valid(pfn))) {
+			invalid_pfn(*pte, vma->vm_flags, addr);
 			continue;
+		}
 		page = pfn_to_page(pfn);
-		if (PageReserved(page))
-			continue;
 
 		if (ptep_clear_flush_dirty(vma, addr, pte) ||
 		    page_test_and_clear_dirty(page))
@@ -149,6 +149,9 @@ static int msync_interval(struct vm_area
 	if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
 		return -EBUSY;
 
+	if (vma->vm_flags & VM_RESERVED)
+		return -EINVAL;
+
 	if (file && (vma->vm_flags & VM_SHARED)) {
 		filemap_sync(vma, addr, end);
 
Index: linux-2.6/drivers/scsi/sg.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sg.c
+++ linux-2.6/drivers/scsi/sg.c
@@ -1887,13 +1887,17 @@ st_unmap_user_pages(struct scatterlist *
 	int i;
 
 	for (i=0; i < nr_pages; i++) {
-		if (dirtied && !PageReserved(sgl[i].page))
-			SetPageDirty(sgl[i].page);
-		/* unlock_page(sgl[i].page); */
+		struct page *page = sgl[i].page;
+
+		/* XXX: just for debug. Remove when PageReserved is removed */
+		BUG_ON(PageReserved(page));
+		if (dirtied)
+			SetPageDirty(page);
+		/* unlock_page(page); */
 		/* FIXME: cache flush missing for rw==READ
 		 * FIXME: call the correct reference counting function
 		 */
-		page_cache_release(sgl[i].page);
+		page_cache_release(page);
 	}
 
 	return 0;
Index: linux-2.6/drivers/scsi/st.c
===================================================================
--- linux-2.6.orig/drivers/scsi/st.c
+++ linux-2.6/drivers/scsi/st.c
@@ -4431,12 +4431,16 @@ static int sgl_unmap_user_pages(struct s
 	int i;
 
 	for (i=0; i < nr_pages; i++) {
-		if (dirtied && !PageReserved(sgl[i].page))
-			SetPageDirty(sgl[i].page);
+		struct page *page = sgl[i].page;
+
+		/* XXX: just for debug. Remove when PageReserved is removed */
+		BUG_ON(PageReserved(page));
+		if (dirtied)
+			SetPageDirty(page);
 		/* FIXME: cache flush missing for rw==READ
 		 * FIXME: call the correct reference counting function
 		 */
-		page_cache_release(sgl[i].page);
+		page_cache_release(page);
 	}
 
 	return 0;
Index: linux-2.6/sound/core/pcm_native.c
===================================================================
--- linux-2.6.orig/sound/core/pcm_native.c
+++ linux-2.6/sound/core/pcm_native.c
@@ -2944,8 +2944,7 @@ static struct page * snd_pcm_mmap_status
 		return NOPAGE_OOM;
 	runtime = substream->runtime;
 	page = virt_to_page(runtime->status);
-	if (!PageReserved(page))
-		get_page(page);
+	get_page(page);
 	if (type)
 		*type = VM_FAULT_MINOR;
 	return page;
@@ -2987,8 +2986,7 @@ static struct page * snd_pcm_mmap_contro
 		return NOPAGE_OOM;
 	runtime = substream->runtime;
 	page = virt_to_page(runtime->control);
-	if (!PageReserved(page))
-		get_page(page);
+	get_page(page);
 	if (type)
 		*type = VM_FAULT_MINOR;
 	return page;
@@ -3061,8 +3059,7 @@ static struct page *snd_pcm_mmap_data_no
 		vaddr = runtime->dma_area + offset;
 		page = virt_to_page(vaddr);
 	}
-	if (!PageReserved(page))
-		get_page(page);
+	get_page(page);
 	if (type)
 		*type = VM_FAULT_MINOR;
 	return page;
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1077,6 +1077,17 @@ munmap_back:
 		error = file->f_op->mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
+		if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED))
+						== (VM_WRITE | VM_RESERVED)) {
+			printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
+				"PROT_WRITE mmap of VM_RESERVED memory, which "
+				"is deprecated. Please report this to "
+				"linux-kernel@vger.kernel.org\n",current->comm);
+			if (vma->vm_ops && vma->vm_ops->close)
+				vma->vm_ops->close(vma);
+			error = -EACCES;
+			goto unmap_and_free_vma;
+		}
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
 		if (error)
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c
+++ linux-2.6/mm/mprotect.c
@@ -131,6 +131,14 @@ mprotect_fixup(struct vm_area_struct *vm
 				return -ENOMEM;
 			newflags |= VM_ACCOUNT;
 		}
+		if (oldflags & VM_RESERVED) {
+			BUG_ON(oldflags & VM_WRITE);
+			printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
+				"PROT_WRITE mprotect of VM_RESERVED memory, "
+				"which is deprecated. Please report this to "
+				"linux-kernel@vger.kernel.org\n",current->comm);
+			return -EACCES;
+		}
 	}
 
 	newprot = protection_map[newflags & 0xf];
Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -297,6 +297,7 @@ static unsigned long __init free_all_boo
 				if (j + 16 < BITS_PER_LONG)
 					prefetchw(page + j + 16);
 				__ClearPageReserved(page + j);
+				set_page_count(page + j, 0);
 			}
 			__free_pages(page, order);
 			i += BITS_PER_LONG;
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c
+++ linux-2.6/mm/mempolicy.c
@@ -253,8 +253,10 @@ static int check_pte_range(struct mm_str
 		if (!pte_present(*pte))
 			continue;
 		pfn = pte_pfn(*pte);
-		if (!pfn_valid(pfn))
+		if (unlikely(!pfn_valid(pfn))) {
+			invalid_pfn(*pte, -1UL, addr);
 			continue;
+		}
 		nid = pfn_to_nid(pfn);
 		if (!test_bit(nid, nodes))
 			break;
@@ -326,6 +328,8 @@ check_range(struct mm_struct *mm, unsign
 	first = find_vma(mm, start);
 	if (!first)
 		return ERR_PTR(-EFAULT);
+	if (first->vm_flags & VM_RESERVED)
+		return ERR_PTR(-EACCES);
 	prev = NULL;
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
 		if (!vma->vm_next && vma->vm_end < end)
Index: linux-2.6/arch/ppc64/kernel/vdso.c
===================================================================
--- linux-2.6.orig/arch/ppc64/kernel/vdso.c
+++ linux-2.6/arch/ppc64/kernel/vdso.c
@@ -176,13 +176,13 @@ static struct page * vdso_vma_nopage(str
 		return NOPAGE_SIGBUS;
 
 	/*
-	 * Last page is systemcfg, special handling here, no get_page() a
-	 * this is a reserved page
+	 * Last page is systemcfg.
 	 */
 	if ((vma->vm_end - address) <= PAGE_SIZE)
-		return virt_to_page(systemcfg);
+		pg = virt_to_page(systemcfg);
+	else
+		pg = virt_to_page(vbase + offset);
 
-	pg = virt_to_page(vbase + offset);
 	get_page(pg);
 	DBG(" ->page count: %d\n", page_count(pg));
 
@@ -600,6 +600,8 @@ void __init vdso_init(void)
 		ClearPageReserved(pg);
 		get_page(pg);
 	}
+
+	get_page(virt_to_page(systemcfg));
 }
 
 int in_gate_area_no_task(unsigned long addr)
Index: linux-2.6/kernel/power/swsusp.c
===================================================================
--- linux-2.6.orig/kernel/power/swsusp.c
+++ linux-2.6/kernel/power/swsusp.c
@@ -434,15 +434,23 @@ static int save_highmem_zone(struct zone
 			continue;
 		page = pfn_to_page(pfn);
 		/*
-		 * This condition results from rvmalloc() sans vmalloc_32()
-		 * and architectural memory reservations. This should be
-		 * corrected eventually when the cases giving rise to this
-		 * are better understood.
+		 * PageReserved results from rvmalloc() sans vmalloc_32()
+		 * and architectural memory reservations.
+		 *
+		 * rvmalloc should not cause this, because all implementations
+		 * appear to always be using vmalloc_32 on architectures with
+		 * highmem. This is a good thing, because we would like to save
+		 * rvmalloc pages.
+		 *
+		 * It appears to be triggered by pages which do not point to
+		 * valid memory (see arch/i386/mm/init.c:one_highpage_init(),
+		 * which sets PageReserved if the page does not point to valid
+		 * RAM.
+		 *
+		 * XXX: must remove usage of PageReserved!
 		 */
-		if (PageReserved(page)) {
-			printk("highmem reserved page?!\n");
+		if (PageReserved(page))
 			continue;
-		}
 		BUG_ON(PageNosave(page));
 		if (PageNosaveFree(page))
 			continue;
@@ -528,10 +536,9 @@ static int saveable(struct zone * zone, 
 		return 0;
 
 	page = pfn_to_page(pfn);
-	BUG_ON(PageReserved(page) && PageNosave(page));
 	if (PageNosave(page))
 		return 0;
-	if (PageReserved(page) && pfn_is_nosave(pfn)) {
+	if (pfn_is_nosave(pfn)) {
 		pr_debug("[nosave pfn 0x%lx]", pfn);
 		return 0;
 	}
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1523,7 +1523,8 @@ static void do_shmem_file_read(struct fi
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 
-		page_cache_release(page);
+		if (page != ZERO_PAGE(0))
+			page_cache_release(page);
 		if (ret != nr || !desc->count)
 			break;
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 2/7] mm: PG_free flag
  2005-08-11 12:21 ` [patch 1/7] mm: remove PageReserved rollup Nick Piggin
@ 2005-08-11 12:22   ` Nick Piggin
  2005-08-11 12:22     ` [patch 3/7] mm: speculative get_page Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2005-08-11 12:22 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Dipankar Sarma, Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 33 bytes --]

2/7

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-PG_free-flag.patch --]
[-- Type: text/plain, Size: 2886 bytes --]

In a future patch we can no longer rely on page_count being stable at any
time, so we can no longer overload PagePrivate && page_count == 0 to mean
the page is free and on the buddy lists.

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -76,6 +76,8 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_uncached		19	/* Page has been mapped as uncached */
 
+#define PG_free			20	/* Page is on the free lists */
+
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
  * allowed.
@@ -306,6 +308,10 @@ extern void __mod_page_state(unsigned lo
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageFree(page)		test_bit(PG_free, &(page)->flags)
+#define __SetPageFree(page)	__set_bit(PG_free, &(page)->flags)
+#define __ClearPageFree(page)	__clear_bit(PG_free, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -114,7 +114,8 @@ static void bad_page(const char *functio
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
-			1 << PG_reserved );
+			1 << PG_reserved |
+			1 << PG_free );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
 	page->mapping = NULL;
@@ -191,12 +192,12 @@ static inline unsigned long page_order(s
 
 static inline void set_page_order(struct page *page, int order) {
 	page->private = order;
-	__SetPagePrivate(page);
+	__SetPageFree(page);
 }
 
 static inline void rmv_page_order(struct page *page)
 {
-	__ClearPagePrivate(page);
+	__ClearPageFree(page);
 	page->private = 0;
 }
 
@@ -242,9 +243,7 @@ __find_combined_index(unsigned long page
  */
 static inline int page_is_buddy(struct page *page, int order)
 {
-       if (PagePrivate(page)           &&
-           (page_order(page) == order) &&
-            page_count(page) == 0)
+       if (PageFree(page) && (page_order(page) == order))
                return 1;
        return 0;
 }
@@ -327,7 +326,8 @@ static inline void free_pages_check(cons
 			1 << PG_slab	|
 			1 << PG_swapcache |
 			1 << PG_writeback |
-			1 << PG_reserved )))
+			1 << PG_reserved |
+			1 << PG_free )))
 		bad_page(function, page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
@@ -456,7 +456,8 @@ static void prep_new_page(struct page *p
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
-			1 << PG_reserved )))
+			1 << PG_reserved |
+			1 << PG_free )))
 		bad_page(__FUNCTION__, page);
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 3/7] mm: speculative get_page
  2005-08-11 12:22   ` [patch 2/7] mm: PG_free flag Nick Piggin
@ 2005-08-11 12:22     ` Nick Piggin
  2005-08-11 12:25       ` [patch 4/7] radix-tree: lookup_slot Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2005-08-11 12:22 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Dipankar Sarma, Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 33 bytes --]

3/7

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-speculative-get_page.patch --]
[-- Type: text/plain, Size: 8139 bytes --]

If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to
the page.

This can be done if get_page/put_page behaves in the same manner
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).

There needs to be some careful logic for freed pages so they aren't
freed again, and also some careful logic for pages in the process
of being removed from pagecache.

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -77,6 +77,7 @@
 #define PG_uncached		19	/* Page has been mapped as uncached */
 
 #define PG_free			20	/* Page is on the free lists */
+#define PG_freeing		21	/* PG_refcount about to be freed */
 
 /*
  * Global page accounting.  One instance per CPU.  Only unsigned longs are
@@ -312,6 +313,11 @@ extern void __mod_page_state(unsigned lo
 #define __SetPageFree(page)	__set_bit(PG_free, &(page)->flags)
 #define __ClearPageFree(page)	__clear_bit(PG_free, &(page)->flags)
 
+#define PageFreeing(page)	test_bit(PG_freeing, &(page)->flags)
+#define SetPageFreeing(page)	set_bit(PG_freeing, &(page)->flags)
+#define ClearPageFreeing(page)	clear_bit(PG_freeing, &(page)->flags)
+#define __ClearPageFreeing(page) __clear_bit(PG_freeing, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -50,6 +50,64 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+static inline struct page *page_cache_get_speculative(struct page **pagep)
+{
+	unsigned long flags;
+	struct page *page;
+
+	/*
+	 * Disable IRQs (and preempt) so we don't deadlock with the page
+	 * allocator who might be waiting for us to drop the speculative
+	 * reference.
+	 *
+	 * Interrupts could be disabled _after_ loading *pagep, however
+	 * we want to really minimise the window between taking a spec
+	 * ref on the page and retesting the page.
+	 */
+	local_irq_save(flags);
+
+	page = *pagep;
+	if (!page)
+		goto out_failed;
+
+	/* Note that get_page_testone provides a memory barrier */
+	if (unlikely(get_page_testone(page) || PageFree(page))) {
+		/*
+		 * Picked up a freed page. Note order of evaluation of the
+		 * above is important. If we are not the first speculative
+		 * getter on a free page, then we'll fall through to the
+		 * PageFree test, which is stable because the previous getters
+		 * are keeping page allocation spinning on this page.
+		 */
+		__put_page(page);
+		page = NULL;
+		goto out_failed;
+	}
+
+	/*
+	 * interrupts and preempt could be enabled here (only need to be
+	 * disabled because page allocation can spin on the elevated refcount,
+	 * but we don't want to hold a reference on an unrelated page for too
+	 * long, so keep preempt off until we know we have the right page
+	 */
+
+	if (unlikely(PageFreeing(page) || page != *pagep)) {
+		/*
+		 * Picked up a page being freed, or one that is no longer
+		 * being pointed to by pagep. Note that we do the complete
+		 * put_page, because unlike the above case, this page is
+		 * not free and our reference is pinning it.
+		 */
+		put_page(page);
+		page = NULL;
+		goto out_failed;
+	}
+
+out_failed:
+	local_irq_restore(flags);
+	return page;
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
 	return alloc_pages(mapping_gfp_mask(x)|__GFP_NORECLAIM, 0);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -116,7 +116,6 @@ static void bad_page(const char *functio
 			1 << PG_writeback |
 			1 << PG_reserved |
 			1 << PG_free );
-	set_page_count(page, 0);
 	reset_page_mapcount(page);
 	page->mapping = NULL;
 	tainted |= TAINT_BAD_PAGE;
@@ -316,7 +315,6 @@ static inline void free_pages_check(cons
 {
 	if (	page_mapcount(page) ||
 		page->mapping != NULL ||
-		page_count(page) != 0 ||
 		(page->flags & (
 			1 << PG_lru	|
 			1 << PG_private |
@@ -424,7 +422,7 @@ expand(struct zone *zone, struct page *p
 void set_page_refs(struct page *page, int order)
 {
 #ifdef CONFIG_MMU
-	set_page_count(page, 1);
+	get_page(page);
 #else
 	int i;
 
@@ -434,7 +432,7 @@ void set_page_refs(struct page *page, in
 	 * - eg: access_process_vm()
 	 */
 	for (i = 0; i < (1 << order); i++)
-		set_page_count(page + i, 1);
+		get_page(page + i);
 #endif /* CONFIG_MMU */
 }
 
@@ -445,7 +443,6 @@ static void prep_new_page(struct page *p
 {
 	if (	page_mapcount(page) ||
 		page->mapping != NULL ||
-		page_count(page) != 0 ||
 		(page->flags & (
 			1 << PG_lru	|
 			1 << PG_private	|
@@ -464,7 +461,13 @@ static void prep_new_page(struct page *p
 			1 << PG_referenced | 1 << PG_arch_1 |
 			1 << PG_checked | 1 << PG_mappedtodisk);
 	page->private = 0;
+
 	set_page_refs(page, order);
+	smp_mb();
+	/* Wait for speculative get_page after count has been elevated. */
+	while (unlikely(page_count(page) > 1))
+		cpu_relax();
+
 	kernel_map_pages(page, 1 << order, 1);
 }
 
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -504,6 +504,7 @@ static int shrink_list(struct list_head 
 		if (!mapping)
 			goto keep_locked;	/* truncate got there first */
 
+		SetPageFreeing(page);
 		write_lock_irq(&mapping->tree_lock);
 
 		/*
@@ -513,6 +514,7 @@ static int shrink_list(struct list_head 
 		 */
 		if (page_count(page) != 2 || PageDirty(page)) {
 			write_unlock_irq(&mapping->tree_lock);
+			ClearPageFreeing(page);
 			goto keep_locked;
 		}
 
@@ -533,6 +535,7 @@ static int shrink_list(struct list_head 
 
 free_it:
 		unlock_page(page);
+		__ClearPageFreeing(page);
 		reclaimed++;
 		if (!pagevec_add(&freed_pvec, page))
 			__pagevec_release_nonlru(&freed_pvec);
Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -289,19 +289,20 @@ static unsigned long __init free_all_boo
 			int j, order;
 
 			page = pfn_to_page(pfn);
+			prefetchw(page);
+
 			count += BITS_PER_LONG;
-			__ClearPageReserved(page);
 			order = ffs(BITS_PER_LONG) - 1;
-			set_page_refs(page, order);
-			for (j = 1; j < BITS_PER_LONG; j++) {
-				if (j + 16 < BITS_PER_LONG)
-					prefetchw(page + j + 16);
+			for (j = 0; j < BITS_PER_LONG; j++) {
+				if (j + 1 < BITS_PER_LONG)
+					prefetchw(page + j + 1);
 				__ClearPageReserved(page + j);
 				set_page_count(page + j, 0);
 			}
+			set_page_refs(page, order);
 			__free_pages(page, order);
+
 			i += BITS_PER_LONG;
-			page += BITS_PER_LONG;
 		} else if (v) {
 			unsigned long m;
 
@@ -310,6 +311,7 @@ static unsigned long __init free_all_boo
 				if (v & m) {
 					count++;
 					__ClearPageReserved(page);
+					set_page_count(page, 0);
 					set_page_refs(page, 0);
 					__free_page(page);
 				}
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -338,6 +338,7 @@ int remove_exclusive_swap_page(struct pa
 	retval = 0;
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
+		SetPageFreeing(page);
 		write_lock_irq(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
@@ -345,6 +346,7 @@ int remove_exclusive_swap_page(struct pa
 			retval = 1;
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
+		ClearPageFreeing(page);
 	}
 	swap_info_put(p);
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 4/7] radix-tree: lookup_slot
  2005-08-11 12:22     ` [patch 3/7] mm: speculative get_page Nick Piggin
@ 2005-08-11 12:25       ` Nick Piggin
  2005-08-11 12:25         ` [patch 5/7] radix-tree: lockless readside Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2005-08-11 12:25 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Dipankar Sarma, Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 119 bytes --]

4/7

Required by lockless pagecache in order to get a pointer
to a pagecache struct page.

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: radix-tree-lookup_slot.patch --]
[-- Type: text/plain, Size: 2627 bytes --]

From: Hans Reiser <reiser@namesys.com>

Reiser4 uses radix trees to solve a trouble reiser4_readdir has serving nfs
requests.

Unfortunately, radix tree api lacks an operation suitable for modifying
existing entry.  This patch adds radix_tree_lookup_slot which returns pointer
to found item within the tree.  That location can be then updated.

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -46,6 +46,7 @@ do {									\
 
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
+void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -276,14 +276,8 @@ int radix_tree_insert(struct radix_tree_
 }
 EXPORT_SYMBOL(radix_tree_insert);
 
-/**
- *	radix_tree_lookup    -    perform lookup operation on a radix tree
- *	@root:		radix tree root
- *	@index:		index key
- *
- *	Lookup the item at the position @index in the radix tree @root.
- */
-void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+static inline void **__lookup_slot(struct radix_tree_root *root,
+				   unsigned long index)
 {
 	unsigned int height, shift;
 	struct radix_tree_node **slot;
@@ -306,7 +300,36 @@ void *radix_tree_lookup(struct radix_tre
 		height--;
 	}
 
-	return *slot;
+	return (void **)slot;
+}
+
+/**
+ *	radix_tree_lookup_slot    -    lookup a slot in a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup the slot corresponding to the position @index in the radix tree
+ *	@root. This is useful for update-if-exists operations.
+ */
+void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
+{
+	return __lookup_slot(root, index);
+}
+EXPORT_SYMBOL(radix_tree_lookup_slot);
+
+/**
+ *	radix_tree_lookup    -    perform lookup operation on a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup the item at the position @index in the radix tree @root.
+ */
+void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+{
+	void **slot;
+
+	slot = __lookup_slot(root, index);
+	return slot != NULL ? *slot : NULL;
 }
 EXPORT_SYMBOL(radix_tree_lookup);
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 5/7] radix-tree: lockless readside
  2005-08-11 12:25       ` [patch 4/7] radix-tree: lookup_slot Nick Piggin
@ 2005-08-11 12:25         ` Nick Piggin
  2005-08-11 12:28           ` [patch 6/7] mm: lockless pagecache Nick Piggin
  2005-08-12  1:37           ` [patch 5/7] radix-tree: lockless readside Paul E. McKenney
  0 siblings, 2 replies; 16+ messages in thread
From: Nick Piggin @ 2005-08-11 12:25 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Dipankar Sarma, Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 33 bytes --]

5/7

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: radix-tree-lockless-readside.patch --]
[-- Type: text/plain, Size: 6908 bytes --]

Make radix tree lookups safe to be performed without locks.
Readers are protected against nodes being deleted by using RCU
based freeing. Readers are protected against new node insertion
by using memory barriers to ensure the node itself will be
properly written before it is visible in the radix tree.

Also introduce a lockfree gang_lookup_slot which will be used
by a future patch.

Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -29,6 +29,7 @@
 #include <linux/gfp.h>
 #include <linux/string.h>
 #include <linux/bitops.h>
+#include <linux/rcupdate.h>
 
 
 #ifdef __KERNEL__
@@ -45,7 +46,9 @@
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
 struct radix_tree_node {
+	unsigned int	height;		/* Height from the bottom */
 	unsigned int	count;
+	struct rcu_head	rcu_head;
 	void		*slots[RADIX_TREE_MAP_SIZE];
 	unsigned long	tags[RADIX_TREE_TAGS][RADIX_TREE_TAG_LONGS];
 };
@@ -97,10 +100,17 @@ radix_tree_node_alloc(struct radix_tree_
 	return ret;
 }
 
+static void radix_tree_node_rcu_free(struct rcu_head *head)
+{
+	struct radix_tree_node *node =
+			container_of(head, struct radix_tree_node, rcu_head);
+	kmem_cache_free(radix_tree_node_cachep, node);
+}
+
 static inline void
 radix_tree_node_free(struct radix_tree_node *node)
 {
-	kmem_cache_free(radix_tree_node_cachep, node);
+	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
 /*
@@ -196,6 +206,7 @@ static int radix_tree_extend(struct radi
 	}
 
 	do {
+		unsigned int newheight;
 		if (!(node = radix_tree_node_alloc(root)))
 			return -ENOMEM;
 
@@ -208,9 +219,13 @@ static int radix_tree_extend(struct radi
 				tag_set(node, tag, 0);
 		}
 
+		newheight = root->height+1;
+		node->height = newheight;
 		node->count = 1;
+		/* Make ->height visible before node visible via ->rnode */
+		smp_wmb();
 		root->rnode = node;
-		root->height++;
+		root->height = newheight;
 	} while (height > root->height);
 out:
 	return 0;
@@ -250,9 +265,12 @@ int radix_tree_insert(struct radix_tree_
 			/* Have to add a child node.  */
 			if (!(tmp = radix_tree_node_alloc(root)))
 				return -ENOMEM;
-			*slot = tmp;
+			tmp->height = height;
 			if (node)
 				node->count++;
+			/* Make ->height visible before node visible via slot */
+			smp_wmb();
+			*slot = tmp;
 		}
 
 		/* Go a level down */
@@ -282,12 +300,14 @@ static inline void **__lookup_slot(struc
 	unsigned int height, shift;
 	struct radix_tree_node **slot;
 
-	height = root->height;
+	if (root->rnode == NULL)
+		return NULL;
+	slot = &root->rnode;
+	height = (*slot)->height;
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-	slot = &root->rnode;
 
 	while (height > 0) {
 		if (*slot == NULL)
@@ -491,21 +511,24 @@ EXPORT_SYMBOL(radix_tree_tag_get);
 #endif
 
 static unsigned int
-__lookup(struct radix_tree_root *root, void **results, unsigned long index,
+__lookup(struct radix_tree_root *root, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index)
 {
+	unsigned long i;
 	unsigned int nr_found = 0;
 	unsigned int shift;
-	unsigned int height = root->height;
+	unsigned int height;
 	struct radix_tree_node *slot;
 
-	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 	slot = root->rnode;
+	if (!slot)
+		goto out;
+	height = slot->height;
+	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
-	while (height > 0) {
-		unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK;
-
-		for ( ; i < RADIX_TREE_MAP_SIZE; i++) {
+	for (;;) {
+		for (i = (index >> shift) & RADIX_TREE_MAP_MASK;
+						i < RADIX_TREE_MAP_SIZE; i++) {
 			if (slot->slots[i] != NULL)
 				break;
 			index &= ~((1UL << shift) - 1);
@@ -516,21 +539,23 @@ __lookup(struct radix_tree_root *root, v
 		if (i == RADIX_TREE_MAP_SIZE)
 			goto out;
 		height--;
-		if (height == 0) {	/* Bottom level: grab some items */
-			unsigned long j = index & RADIX_TREE_MAP_MASK;
-
-			for ( ; j < RADIX_TREE_MAP_SIZE; j++) {
-				index++;
-				if (slot->slots[j]) {
-					results[nr_found++] = slot->slots[j];
-					if (nr_found == max_items)
-						goto out;
-				}
-			}
+		if (height == 0) {
+			/* Bottom level: grab some items */
+			break;
 		}
 		shift -= RADIX_TREE_MAP_SHIFT;
 		slot = slot->slots[i];
 	}
+
+	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
+		index++;
+		if (slot->slots[i]) {
+			results[nr_found++] = &(slot->slots[i]);
+			if (nr_found == max_items)
+				goto out;
+		}
+	}
+
 out:
 	*next_index = index;
 	return nr_found;
@@ -558,6 +583,43 @@ radix_tree_gang_lookup(struct radix_tree
 	unsigned int ret = 0;
 
 	while (ret < max_items) {
+		unsigned int nr_found, i;
+		unsigned long next_index;	/* Index of next search */
+
+		if (cur_index > max_index)
+			break;
+		nr_found = __lookup(root, (void ***)results + ret, cur_index,
+					max_items - ret, &next_index);
+		for (i = 0; i < nr_found; i++)
+			results[ret + i] = *(((void ***)results)[ret + i]);
+		ret += nr_found;
+		if (next_index == 0)
+			break;
+		cur_index = next_index;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup);
+
+/**
+ *	radix_tree_gang_lookup_slot - perform multiple lookup on a radix tree
+ *	@root:		radix tree root
+ *	@results:	where the results of the lookup are placed
+ *	@first_index:	start the lookup from this key
+ *	@max_items:	place up to this many items at *results
+ *
+ *	Same as radix_tree_gang_lookup, but returns an array of pointers
+ *	(slots) to the stored items instead of the items themselves.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items)
+{
+	const unsigned long max_index = radix_tree_maxindex(root->height);
+	unsigned long cur_index = first_index;
+	unsigned int ret = 0;
+
+	while (ret < max_items) {
 		unsigned int nr_found;
 		unsigned long next_index;	/* Index of next search */
 
@@ -572,7 +634,8 @@ radix_tree_gang_lookup(struct radix_tree
 	}
 	return ret;
 }
-EXPORT_SYMBOL(radix_tree_gang_lookup);
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
+
 
 /*
  * FIXME: the two tag_get()s here should use find_next_bit() instead of
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -51,6 +51,9 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(int gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 6/7] mm: lockless pagecache
  2005-08-11 12:25         ` [patch 5/7] radix-tree: lockless readside Nick Piggin
@ 2005-08-11 12:28           ` Nick Piggin
  2005-08-11 12:28             ` [patch 7/7] mm: spinlock tree_lock Nick Piggin
                               ` (2 more replies)
  2005-08-12  1:37           ` [patch 5/7] radix-tree: lockless readside Paul E. McKenney
  1 sibling, 3 replies; 16+ messages in thread
From: Nick Piggin @ 2005-08-11 12:28 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Dipankar Sarma, Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 33 bytes --]

6/7

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-lockless-pagecache-lookups.patch --]
[-- Type: text/plain, Size: 11410 bytes --]

Use the speculative get_page and the lockless radix tree lookups
to introduce lockless page cache lookups (ie. no mapping->tree_lock).

The only atomicity changes this should introduce is the use of a
non atomic pagevec lookup for truncate, however what atomicity
guarantees there were are probably not too useful anyway.

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -379,18 +379,25 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		page_cache_get(page);
+		__SetPageLocked(page);
+		page->mapping = mapping;
+		page->index = offset;
+
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
-			page_cache_get(page);
-			SetPageLocked(page);
-			page->mapping = mapping;
-			page->index = offset;
 			mapping->nrpages++;
 			pagecache_acct(1);
 		}
 		write_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
+
+		if (error) {
+			page->mapping = NULL;
+			__put_page(page);
+			__ClearPageLocked(page);
+		}
 	}
 	return error;
 }
@@ -500,13 +507,15 @@ EXPORT_SYMBOL(__lock_page);
  */
 struct page * find_get_page(struct address_space *mapping, unsigned long offset)
 {
-	struct page *page;
+	struct page **pagep;
+	struct page *page = NULL;
 
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
-	read_unlock_irq(&mapping->tree_lock);
+	rcu_read_lock();
+	pagep = (struct page **)radix_tree_lookup_slot(&mapping->page_tree,
+									offset);
+	if (pagep)
+		page = page_cache_get_speculative(pagep);
+	rcu_read_unlock();
 	return page;
 }
 
@@ -519,12 +528,24 @@ struct page *find_trylock_page(struct ad
 {
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page && TestSetPageLocked(page))
-		page = NULL;
-	read_unlock_irq(&mapping->tree_lock);
-	return page;
+	page = find_get_page(mapping, offset);
+	if (page) {
+		if (TestSetPageLocked(page))
+			goto out_failed;
+		/* Has the page been truncated before being locked? */
+		if (page->mapping != mapping || page->index != offset) {
+			unlock_page(page);
+			goto out_failed;
+		}
+
+		/* Silly interface requires us to drop the refcount */
+		__put_page(page);
+		return page;
+
+out_failed:
+		page_cache_release(page);
+	}
+	return NULL;
 }
 
 EXPORT_SYMBOL(find_trylock_page);
@@ -545,25 +566,17 @@ struct page *find_lock_page(struct addre
 {
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
 repeat:
-	page = radix_tree_lookup(&mapping->page_tree, offset);
+	page = find_get_page(mapping, offset);
 	if (page) {
-		page_cache_get(page);
-		if (TestSetPageLocked(page)) {
-			read_unlock_irq(&mapping->tree_lock);
-			lock_page(page);
-			read_lock_irq(&mapping->tree_lock);
-
-			/* Has the page been truncated while we slept? */
-			if (page->mapping != mapping || page->index != offset) {
-				unlock_page(page);
-				page_cache_release(page);
-				goto repeat;
-			}
+		lock_page(page);
+		/* Has the page been truncated before being locked? */
+		if (page->mapping != mapping || page->index != offset) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
 		}
 	}
-	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 
@@ -646,6 +659,32 @@ unsigned find_get_pages(struct address_s
 	return ret;
 }
 
+unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
+			    unsigned int nr_pages, struct page **pages)
+{
+	unsigned int i;
+	unsigned int ret;
+	unsigned int ret2;
+
+	/*
+	 * We do some unsightly casting to use the array first for storing
+	 * pointers to the page pointers, and then for the pointers to
+	 * the pages themselves that the caller wants.
+	 */
+	rcu_read_lock();
+	ret = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, start, nr_pages);
+	ret2 = 0;
+	for (i = 0; i < ret; i++) {
+		struct page *page;
+		page = page_cache_get_speculative(((struct page ***)pages)[i]);
+		if (page)
+			pages[ret2++] = page;
+	}
+	rcu_read_unlock();
+	return ret2;
+}
+
 /*
  * Like find_get_pages, except we only return pages which are tagged with
  * `tag'.   We update *index to index the next page for the traversal.
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -272,27 +272,26 @@ __do_page_cache_readahead(struct address
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	read_lock_irq(&mapping->tree_lock);
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		unsigned long page_offset = offset + page_idx;
 		
 		if (page_offset > end_index)
 			break;
 
+		/* Don't need mapping->tree_lock - lookup can be racy */
+		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
+		rcu_read_unlock();
 		if (page)
 			continue;
 
-		read_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
-		read_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = page_offset;
 		list_add(&page->lru, &page_pool);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
 
 	/*
 	 * Now start the IO.  We ignore I/O errors - if the page is not
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -76,19 +76,26 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		page_cache_get(page);
+		SetPageLocked(page);
+		SetPageSwapCache(page);
+		page->private = entry.val;
+
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
-			page_cache_get(page);
-			SetPageLocked(page);
-			SetPageSwapCache(page);
-			page->private = entry.val;
 			total_swapcache_pages++;
 			pagecache_acct(1);
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
 		radix_tree_preload_end();
+
+		if (error) {
+			__put_page(page);
+			ClearPageLocked(page);
+			ClearPageSwapCache(page);
+		}
 	}
 	return error;
 }
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h
+++ linux-2.6/include/linux/page-flags.h
@@ -167,16 +167,13 @@ extern void __mod_page_state(unsigned lo
 /*
  * Manipulation of page state flags
  */
-#define PageLocked(page)		\
-		test_bit(PG_locked, &(page)->flags)
-#define SetPageLocked(page)		\
-		set_bit(PG_locked, &(page)->flags)
-#define TestSetPageLocked(page)		\
-		test_and_set_bit(PG_locked, &(page)->flags)
-#define ClearPageLocked(page)		\
-		clear_bit(PG_locked, &(page)->flags)
-#define TestClearPageLocked(page)	\
-		test_and_clear_bit(PG_locked, &(page)->flags)
+#define PageLocked(page)	test_bit(PG_locked, &(page)->flags)
+#define SetPageLocked(page)	set_bit(PG_locked, &(page)->flags)
+#define __SetPageLocked(page)	__set_bit(PG_locked, &(page)->flags)
+#define TestSetPageLocked(page)	test_and_set_bit(PG_locked, &(page)->flags)
+#define ClearPageLocked(page)	clear_bit(PG_locked, &(page)->flags)
+#define __ClearPageLocked(page)	__clear_bit(PG_locked, &(page)->flags)
+#define TestClearPageLocked(page) test_and_clear_bit(PG_locked, &(page)->flags)
 
 #define PageError(page)		test_bit(PG_error, &(page)->flags)
 #define SetPageError(page)	set_bit(PG_error, &(page)->flags)
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -130,6 +130,8 @@ extern struct page * find_or_create_page
 				unsigned long index, unsigned int gfp_mask);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
+unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
+			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 
Index: linux-2.6/include/linux/pagevec.h
===================================================================
--- linux-2.6.orig/include/linux/pagevec.h
+++ linux-2.6/include/linux/pagevec.h
@@ -25,6 +25,8 @@ void __pagevec_lru_add_active(struct pag
 void pagevec_strip(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
+unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
+	struct address_space *mapping, pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
 		struct address_space *mapping, pgoff_t *index, int tag,
 		unsigned nr_pages);
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -380,6 +380,19 @@ unsigned pagevec_lookup(struct pagevec *
 	return pagevec_count(pvec);
 }
 
+/**
+ * pagevec_lookup_nonatomic - non atomic pagevec_lookup
+ *
+ * This routine is non-atomic in that it may return blah.
+ */
+unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
+		struct address_space *mapping, pgoff_t start, unsigned nr_pages)
+{
+	pvec->nr = find_get_pages_nonatomic(mapping, start,
+					nr_pages, pvec->pages);
+	return pagevec_count(pvec);
+}
+
 unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t *index, int tag, unsigned nr_pages)
 {
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -126,7 +126,7 @@ void truncate_inode_pages(struct address
 
 	pagevec_init(&pvec, 0);
 	next = start;
-	while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	while (pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index = page->index;
@@ -160,7 +160,7 @@ void truncate_inode_pages(struct address
 	next = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		if (!pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
 			if (next == start)
 				break;
 			next = start;
@@ -206,7 +206,7 @@ unsigned long invalidate_mapping_pages(s
 
 	pagevec_init(&pvec, 0);
 	while (next <= end &&
-			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -811,6 +811,7 @@ int mapping_tagged(struct address_space 
 	unsigned long flags;
 	int ret;
 
+	/* XXX: radix_tree_tagged is safe to run without the lock? */
 	read_lock_irqsave(&mapping->tree_lock, flags);
 	ret = radix_tree_tagged(&mapping->page_tree, tag);
 	read_unlock_irqrestore(&mapping->tree_lock, flags);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 7/7] mm: spinlock tree_lock
  2005-08-11 12:28           ` [patch 6/7] mm: lockless pagecache Nick Piggin
@ 2005-08-11 12:28             ` Nick Piggin
  2005-08-11 13:58             ` [patch 6/7] mm: lockless pagecache Pekka Enberg
  2005-08-12  1:49             ` Paul E. McKenney
  2 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2005-08-11 12:28 UTC (permalink / raw)
  To: Paul McKenney; +Cc: Dipankar Sarma, Ingo Molnar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 33 bytes --]

7/7

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-spinlock-tree_lock.patch --]
[-- Type: text/plain, Size: 13919 bytes --]

With practially all the read locks gone from mapping->tree_lock,
convert the lock from an rwlock back to a spinlock.

The remaining locks including the read locks mainly deal with IO
submission and not the lookup fastpaths.

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -859,7 +859,7 @@ int __set_page_dirty_buffers(struct page
 	spin_unlock(&mapping->private_lock);
 
 	if (!TestSetPageDirty(page)) {
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		if (page->mapping) {	/* Race with truncate? */
 			if (mapping_cap_account_dirty(mapping))
 				inc_page_state(nr_dirty);
@@ -867,7 +867,7 @@ int __set_page_dirty_buffers(struct page
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	}
 	
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -195,7 +195,7 @@ void inode_init_once(struct inode *inode
 	sema_init(&inode->i_sem, 1);
 	init_rwsem(&inode->i_alloc_sem);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
-	rwlock_init(&inode->i_data.tree_lock);
+	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -339,7 +339,7 @@ struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
-	rwlock_t		tree_lock;	/* and rwlock protecting it */
+	spinlock_t		tree_lock;	/* and lock protecting it */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -121,9 +121,9 @@ void remove_from_page_cache(struct page 
 
 	BUG_ON(!PageLocked(page));
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 }
 
 static int sync_page(void *word)
@@ -384,13 +384,13 @@ int add_to_page_cache(struct page *page,
 		page->mapping = mapping;
 		page->index = offset;
 
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
 			mapping->nrpages++;
 			pagecache_acct(1);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
 
 		if (error) {
@@ -650,12 +650,12 @@ unsigned find_get_pages(struct address_s
 	unsigned int i;
 	unsigned int ret;
 
-	read_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
 	for (i = 0; i < ret; i++)
 		page_cache_get(pages[i]);
-	read_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
 
@@ -695,14 +695,14 @@ unsigned find_get_pages_tag(struct addre
 	unsigned int i;
 	unsigned int ret;
 
-	read_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
 	for (i = 0; i < ret; i++)
 		page_cache_get(pages[i]);
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
-	read_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
 
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -35,7 +35,7 @@ static struct backing_dev_info swap_back
 
 struct address_space swapper_space = {
 	.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-	.tree_lock	= RW_LOCK_UNLOCKED,
+	.tree_lock	= SPIN_LOCK_UNLOCKED,
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
@@ -81,14 +81,14 @@ static int __add_to_swap_cache(struct pa
 		SetPageSwapCache(page);
 		page->private = entry.val;
 
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
 			total_swapcache_pages++;
 			pagecache_acct(1);
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 		radix_tree_preload_end();
 
 		if (error) {
@@ -210,9 +210,9 @@ void delete_from_swap_cache(struct page 
   
 	entry.val = page->private;
 
-	write_lock_irq(&swapper_space.tree_lock);
+	spin_lock_irq(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
-	write_unlock_irq(&swapper_space.tree_lock);
+	spin_unlock_irq(&swapper_space.tree_lock);
 
 	swap_free(entry);
 	page_cache_release(page);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -339,13 +339,13 @@ int remove_exclusive_swap_page(struct pa
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
 		SetPageFreeing(page);
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 		ClearPageFreeing(page);
 	}
 	swap_info_put(p);
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -76,15 +76,15 @@ invalidate_complete_page(struct address_
 	if (PagePrivate(page) && !try_to_release_page(page, 0))
 		return 0;
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page)) {
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		return 0;
 	}
 
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -505,7 +505,7 @@ static int shrink_list(struct list_head 
 			goto keep_locked;	/* truncate got there first */
 
 		SetPageFreeing(page);
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 
 		/*
 		 * The non-racy check for busy page.  It is critical to check
@@ -513,7 +513,7 @@ static int shrink_list(struct list_head 
 		 * not in use by anybody. 	(pagecache + us == 2)
 		 */
 		if (page_count(page) != 2 || PageDirty(page)) {
-			write_unlock_irq(&mapping->tree_lock);
+			spin_unlock_irq(&mapping->tree_lock);
 			ClearPageFreeing(page);
 			goto keep_locked;
 		}
@@ -522,7 +522,7 @@ static int shrink_list(struct list_head 
 		if (PageSwapCache(page)) {
 			swp_entry_t swap = { .val = page->private };
 			__delete_from_swap_cache(page);
-			write_unlock_irq(&mapping->tree_lock);
+			spin_unlock_irq(&mapping->tree_lock);
 			swap_free(swap);
 			__put_page(page);	/* The pagecache ref */
 			goto free_it;
@@ -530,7 +530,7 @@ static int shrink_list(struct list_head 
 #endif /* CONFIG_SWAP */
 
 		__remove_from_page_cache(page);
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		__put_page(page);
 
 free_it:
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -623,7 +623,7 @@ int __set_page_dirty_nobuffers(struct pa
 		struct address_space *mapping2;
 
 		if (mapping) {
-			write_lock_irq(&mapping->tree_lock);
+			spin_lock_irq(&mapping->tree_lock);
 			mapping2 = page_mapping(page);
 			if (mapping2) { /* Race with truncate? */
 				BUG_ON(mapping2 != mapping);
@@ -632,7 +632,7 @@ int __set_page_dirty_nobuffers(struct pa
 				radix_tree_tag_set(&mapping->page_tree,
 					page_index(page), PAGECACHE_TAG_DIRTY);
 			}
-			write_unlock_irq(&mapping->tree_lock);
+			spin_unlock_irq(&mapping->tree_lock);
 			if (mapping->host) {
 				/* !PageAnon && !swapper_space */
 				__mark_inode_dirty(mapping->host,
@@ -707,17 +707,17 @@ int test_clear_page_dirty(struct page *p
 	unsigned long flags;
 
 	if (mapping) {
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		if (TestClearPageDirty(page)) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
-			write_unlock_irqrestore(&mapping->tree_lock, flags);
+			spin_unlock_irqrestore(&mapping->tree_lock, flags);
 			if (mapping_cap_account_dirty(mapping))
 				dec_page_state(nr_dirty);
 			return 1;
 		}
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		return 0;
 	}
 	return TestClearPageDirty(page);
@@ -762,13 +762,13 @@ int test_clear_page_writeback(struct pag
 	if (mapping) {
 		unsigned long flags;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
 		if (ret)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
@@ -783,7 +783,7 @@ int test_set_page_writeback(struct page 
 	if (mapping) {
 		unsigned long flags;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret)
 			radix_tree_tag_set(&mapping->page_tree,
@@ -793,7 +793,7 @@ int test_set_page_writeback(struct page 
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
@@ -811,10 +811,10 @@ int mapping_tagged(struct address_space 
 	unsigned long flags;
 	int ret;
 
-	/* XXX: radix_tree_tagged is safe to run without the lock? */
-	read_lock_irqsave(&mapping->tree_lock, flags);
+	/* XXX: radix_tree_tagged is safe to run without the lock */
+	spin_lock_irqsave(&mapping->tree_lock, flags);
 	ret = radix_tree_tagged(&mapping->page_tree, tag);
-	read_unlock_irqrestore(&mapping->tree_lock, flags);
+	spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	return ret;
 }
 EXPORT_SYMBOL(mapping_tagged);
Index: linux-2.6/drivers/mtd/devices/block2mtd.c
===================================================================
--- linux-2.6.orig/drivers/mtd/devices/block2mtd.c
+++ linux-2.6/drivers/mtd/devices/block2mtd.c
@@ -58,7 +58,7 @@ void cache_readahead(struct address_spac
 
 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
 
-	read_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	for (i = 0; i < PAGE_READAHEAD; i++) {
 		pagei = index + i;
 		if (pagei > end_index) {
@@ -70,16 +70,16 @@ void cache_readahead(struct address_spac
 			break;
 		if (page)
 			continue;
-		read_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
-		read_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = pagei;
 		list_add(&page->lru, &page_pool);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	if (ret)
 		read_cache_pages(mapping, &page_pool, filler, NULL);
 }
Index: linux-2.6/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-arm/cacheflush.h
+++ linux-2.6/include/asm-arm/cacheflush.h
@@ -315,9 +315,9 @@ flush_cache_page(struct vm_area_struct *
 extern void flush_dcache_page(struct page *);
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_user_range(vma,page,addr,len) \
 	flush_dcache_page(page)
Index: linux-2.6/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/cacheflush.h
+++ linux-2.6/include/asm-parisc/cacheflush.h
@@ -57,9 +57,9 @@ flush_user_icache_range(unsigned long st
 extern void flush_dcache_page(struct page *page);
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_page(vma,page)	do { flush_kernel_dcache_page(page_address(page)); flush_kernel_icache_page(page_address(page)); } while (0)
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 6/7] mm: lockless pagecache
  2005-08-11 12:28           ` [patch 6/7] mm: lockless pagecache Nick Piggin
  2005-08-11 12:28             ` [patch 7/7] mm: spinlock tree_lock Nick Piggin
@ 2005-08-11 13:58             ` Pekka Enberg
  2005-08-11 14:06               ` Nick Piggin
  2005-08-12  1:49             ` Paul E. McKenney
  2 siblings, 1 reply; 16+ messages in thread
From: Pekka Enberg @ 2005-08-11 13:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Paul McKenney, Dipankar Sarma, Ingo Molnar, linux-kernel

Hi Nick,

On 8/11/05, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> +unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
> +                           unsigned int nr_pages, struct page **pages)
> +{
> +       unsigned int i;
> +       unsigned int ret;

Rename to nr_pages?

> +       unsigned int ret2;

Rename to ret?

> +
> +       /*
> +        * We do some unsightly casting to use the array first for storing
> +        * pointers to the page pointers, and then for the pointers to
> +        * the pages themselves that the caller wants.
> +        */
> +       rcu_read_lock();
> +       ret = radix_tree_gang_lookup_slot(&mapping->page_tree,
> +                               (void ***)pages, start, nr_pages);
> +       ret2 = 0;
> +       for (i = 0; i < ret; i++) {

Pretty please? :-)

                                  Pekka

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 6/7] mm: lockless pagecache
  2005-08-11 13:58             ` [patch 6/7] mm: lockless pagecache Pekka Enberg
@ 2005-08-11 14:06               ` Nick Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2005-08-11 14:06 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Paul McKenney, Dipankar Sarma, Ingo Molnar, linux-kernel

Hi Pekka,

Pekka Enberg wrote:
> Hi Nick,
> 
> On 8/11/05, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>+unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
>>+                           unsigned int nr_pages, struct page **pages)
>>+{
>>+       unsigned int i;
>>+       unsigned int ret;
> 
> 
> Rename to nr_pages?
> 
> 
>>+       unsigned int ret2;
> 
> 
> Rename to ret?
> 
> 
>>+
>>+       /*
>>+        * We do some unsightly casting to use the array first for storing
>>+        * pointers to the page pointers, and then for the pointers to
>>+        * the pages themselves that the caller wants.
>>+        */
>>+       rcu_read_lock();
>>+       ret = radix_tree_gang_lookup_slot(&mapping->page_tree,
>>+                               (void ***)pages, start, nr_pages);
>>+       ret2 = 0;
>>+       for (i = 0; i < ret; i++) {
> 
> 
> Pretty please? :-)
> 

Hey that's not actually a bad idea.

I was thinking about how to make that function nicer,
but the voices in my head just kept telling me to rename
'i' to 'ret3' ;)

So, thanks!

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 5/7] radix-tree: lockless readside
  2005-08-11 12:25         ` [patch 5/7] radix-tree: lockless readside Nick Piggin
  2005-08-11 12:28           ` [patch 6/7] mm: lockless pagecache Nick Piggin
@ 2005-08-12  1:37           ` Paul E. McKenney
  2005-08-12  3:59             ` Nick Piggin
  2005-08-12  4:38             ` Nick Piggin
  1 sibling, 2 replies; 16+ messages in thread
From: Paul E. McKenney @ 2005-08-12  1:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Paul McKenney, Dipankar Sarma, Ingo Molnar, linux-kernel

On Thu, Aug 11, 2005 at 10:25:47PM +1000, Nick Piggin wrote:
> 5/7
> 
> -- 
> SUSE Labs, Novell Inc.
> 

> Make radix tree lookups safe to be performed without locks.
> Readers are protected against nodes being deleted by using RCU
> based freeing. Readers are protected against new node insertion
> by using memory barriers to ensure the node itself will be
> properly written before it is visible in the radix tree.
> 
> Also introduce a lockfree gang_lookup_slot which will be used
> by a future patch.

Interesting approach!  Don't claim to fully understand it, but
see below (search for empty lines).  In the meantime, some questions:

o	What exactly is RCU protecting?  My first guess is that
	it protects the pointers and internal nodes of the
	radix tree, but not the objects in the leaves of the
	trees (in other words, the things pointed to by the
	return value from things like radix_tree_lookup_slot()).

	But if this really is the case, then the rcu_read_lock() &
	rcu_read_unlock() pairs can be pushed down into
	radix_tree_lookup_slot() and friends.

o	The current code structure would lead me to believe that
	page_cache_get_speculative() is protected by RCU, but
	I don't see the corresponding call_rcu() or synchronize_rcu()
	that would cause this protection to be required.

	I will expand on this in a reply to the relevant patch...

						Thanx, Paul

> Index: linux-2.6/lib/radix-tree.c
> ===================================================================
> --- linux-2.6.orig/lib/radix-tree.c
> +++ linux-2.6/lib/radix-tree.c
> @@ -29,6 +29,7 @@
>  #include <linux/gfp.h>
>  #include <linux/string.h>
>  #include <linux/bitops.h>
> +#include <linux/rcupdate.h>
>  
>  
>  #ifdef __KERNEL__
> @@ -45,7 +46,9 @@
>  	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
>  
>  struct radix_tree_node {
> +	unsigned int	height;		/* Height from the bottom */
>  	unsigned int	count;
> +	struct rcu_head	rcu_head;
>  	void		*slots[RADIX_TREE_MAP_SIZE];
>  	unsigned long	tags[RADIX_TREE_TAGS][RADIX_TREE_TAG_LONGS];
>  };
> @@ -97,10 +100,17 @@ radix_tree_node_alloc(struct radix_tree_
>  	return ret;
>  }
>  
> +static void radix_tree_node_rcu_free(struct rcu_head *head)
> +{
> +	struct radix_tree_node *node =
> +			container_of(head, struct radix_tree_node, rcu_head);
> +	kmem_cache_free(radix_tree_node_cachep, node);
> +}
> +
>  static inline void
>  radix_tree_node_free(struct radix_tree_node *node)
>  {
> -	kmem_cache_free(radix_tree_node_cachep, node);
> +	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
>  }
>  
>  /*
> @@ -196,6 +206,7 @@ static int radix_tree_extend(struct radi
>  	}
>  
>  	do {
> +		unsigned int newheight;
>  		if (!(node = radix_tree_node_alloc(root)))
>  			return -ENOMEM;
>  
> @@ -208,9 +219,13 @@ static int radix_tree_extend(struct radi
>  				tag_set(node, tag, 0);
>  		}
>  
> +		newheight = root->height+1;
> +		node->height = newheight;
>  		node->count = 1;
> +		/* Make ->height visible before node visible via ->rnode */

> +		smp_wmb();
>  		root->rnode = node;

The prior two lines should instead be:

		rcu_assign_pointer(root->rnode, node);

> -		root->height++;
> +		root->height = newheight;
>  	} while (height > root->height);
>  out:
>  	return 0;
> @@ -250,9 +265,12 @@ int radix_tree_insert(struct radix_tree_
>  			/* Have to add a child node.  */
>  			if (!(tmp = radix_tree_node_alloc(root)))
>  				return -ENOMEM;
> -			*slot = tmp;
> +			tmp->height = height;
>  			if (node)
>  				node->count++;
> +			/* Make ->height visible before node visible via slot */
> +			smp_wmb();
> +			*slot = tmp;
>  		}
>  
>  		/* Go a level down */
> @@ -282,12 +300,14 @@ static inline void **__lookup_slot(struc
>  	unsigned int height, shift;
>  	struct radix_tree_node **slot;
>  
> -	height = root->height;
> +	if (root->rnode == NULL)
> +		return NULL;
> +	slot = &root->rnode;
> +	height = (*slot)->height;
>  	if (index > radix_tree_maxindex(height))
>  		return NULL;
>  
>  	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
> -	slot = &root->rnode;
>  
>  	while (height > 0) {
>  		if (*slot == NULL)
> @@ -491,21 +511,24 @@ EXPORT_SYMBOL(radix_tree_tag_get);
>  #endif
>  
>  static unsigned int
> -__lookup(struct radix_tree_root *root, void **results, unsigned long index,
> +__lookup(struct radix_tree_root *root, void ***results, unsigned long index,
>  	unsigned int max_items, unsigned long *next_index)
>  {
> +	unsigned long i;
>  	unsigned int nr_found = 0;
>  	unsigned int shift;
> -	unsigned int height = root->height;
> +	unsigned int height;
>  	struct radix_tree_node *slot;
>  
> -	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
>  	slot = root->rnode;
> +	if (!slot)
> +		goto out;
> +	height = slot->height;
> +	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
>  
> -	while (height > 0) {
> -		unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK;
> -
> -		for ( ; i < RADIX_TREE_MAP_SIZE; i++) {
> +	for (;;) {
> +		for (i = (index >> shift) & RADIX_TREE_MAP_MASK;
> +						i < RADIX_TREE_MAP_SIZE; i++) {
>  			if (slot->slots[i] != NULL)
>  				break;
>  			index &= ~((1UL << shift) - 1);
> @@ -516,21 +539,23 @@ __lookup(struct radix_tree_root *root, v
>  		if (i == RADIX_TREE_MAP_SIZE)
>  			goto out;
>  		height--;
> -		if (height == 0) {	/* Bottom level: grab some items */
> -			unsigned long j = index & RADIX_TREE_MAP_MASK;
> -
> -			for ( ; j < RADIX_TREE_MAP_SIZE; j++) {
> -				index++;
> -				if (slot->slots[j]) {
> -					results[nr_found++] = slot->slots[j];
> -					if (nr_found == max_items)
> -						goto out;
> -				}
> -			}
> +		if (height == 0) {
> +			/* Bottom level: grab some items */
> +			break;
>  		}
>  		shift -= RADIX_TREE_MAP_SHIFT;
>  		slot = slot->slots[i];
>  	}
> +
> +	for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) {
> +		index++;
> +		if (slot->slots[i]) {
> +			results[nr_found++] = &(slot->slots[i]);
> +			if (nr_found == max_items)
> +				goto out;
> +		}
> +	}
> +
>  out:
>  	*next_index = index;
>  	return nr_found;
> @@ -558,6 +583,43 @@ radix_tree_gang_lookup(struct radix_tree
>  	unsigned int ret = 0;
>  
>  	while (ret < max_items) {
> +		unsigned int nr_found, i;
> +		unsigned long next_index;	/* Index of next search */
> +
> +		if (cur_index > max_index)
> +			break;
> +		nr_found = __lookup(root, (void ***)results + ret, cur_index,
> +					max_items - ret, &next_index);
> +		for (i = 0; i < nr_found; i++)
> +			results[ret + i] = *(((void ***)results)[ret + i]);
> +		ret += nr_found;
> +		if (next_index == 0)
> +			break;
> +		cur_index = next_index;
> +	}
> +	return ret;
> +}
> +EXPORT_SYMBOL(radix_tree_gang_lookup);
> +
> +/**
> + *	radix_tree_gang_lookup_slot - perform multiple lookup on a radix tree
> + *	@root:		radix tree root
> + *	@results:	where the results of the lookup are placed
> + *	@first_index:	start the lookup from this key
> + *	@max_items:	place up to this many items at *results
> + *
> + *	Same as radix_tree_gang_lookup, but returns an array of pointers
> + *	(slots) to the stored items instead of the items themselves.
> + */
> +unsigned int
> +radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
> +			unsigned long first_index, unsigned int max_items)
> +{
> +	const unsigned long max_index = radix_tree_maxindex(root->height);
> +	unsigned long cur_index = first_index;
> +	unsigned int ret = 0;
> +
> +	while (ret < max_items) {
>  		unsigned int nr_found;
>  		unsigned long next_index;	/* Index of next search */
>  
> @@ -572,7 +634,8 @@ radix_tree_gang_lookup(struct radix_tree
>  	}
>  	return ret;
>  }
> -EXPORT_SYMBOL(radix_tree_gang_lookup);
> +EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
> +
>  
>  /*
>   * FIXME: the two tag_get()s here should use find_next_bit() instead of
> Index: linux-2.6/include/linux/radix-tree.h
> ===================================================================
> --- linux-2.6.orig/include/linux/radix-tree.h
> +++ linux-2.6/include/linux/radix-tree.h
> @@ -51,6 +51,9 @@ void *radix_tree_delete(struct radix_tre
>  unsigned int
>  radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
>  			unsigned long first_index, unsigned int max_items);
> +unsigned int
> +radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
> +			unsigned long first_index, unsigned int max_items);
>  int radix_tree_preload(int gfp_mask);
>  void radix_tree_init(void);
>  void *radix_tree_tag_set(struct radix_tree_root *root,


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 6/7] mm: lockless pagecache
  2005-08-11 12:28           ` [patch 6/7] mm: lockless pagecache Nick Piggin
  2005-08-11 12:28             ` [patch 7/7] mm: spinlock tree_lock Nick Piggin
  2005-08-11 13:58             ` [patch 6/7] mm: lockless pagecache Pekka Enberg
@ 2005-08-12  1:49             ` Paul E. McKenney
  2005-08-12  4:04               ` Nick Piggin
  2 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2005-08-12  1:49 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Paul McKenney, Dipankar Sarma, Ingo Molnar, linux-kernel

On Thu, Aug 11, 2005 at 10:28:04PM +1000, Nick Piggin wrote:
> 6/7
> 
> -- 
> SUSE Labs, Novell Inc.
> 

> Use the speculative get_page and the lockless radix tree lookups
> to introduce lockless page cache lookups (ie. no mapping->tree_lock).
> 
> The only atomicity changes this should introduce is the use of a
> non atomic pagevec lookup for truncate, however what atomicity
> guarantees there were are probably not too useful anyway.

I don't understand the placement of the rcu_read_lock() and
rcu_read_unlock() calls.  Again, possibly because I don't understand
the overall algorithm yet.  And again, search for blank lines.

> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -379,18 +379,25 @@ int add_to_page_cache(struct page *page,
>  	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
>  
>  	if (error == 0) {
> +		page_cache_get(page);
> +		__SetPageLocked(page);
> +		page->mapping = mapping;
> +		page->index = offset;
> +
>  		write_lock_irq(&mapping->tree_lock);
>  		error = radix_tree_insert(&mapping->page_tree, offset, page);
>  		if (!error) {
> -			page_cache_get(page);
> -			SetPageLocked(page);
> -			page->mapping = mapping;
> -			page->index = offset;
>  			mapping->nrpages++;
>  			pagecache_acct(1);
>  		}
>  		write_unlock_irq(&mapping->tree_lock);
>  		radix_tree_preload_end();
> +
> +		if (error) {
> +			page->mapping = NULL;
> +			__put_page(page);
> +			__ClearPageLocked(page);
> +		}
>  	}
>  	return error;
>  }
> @@ -500,13 +507,15 @@ EXPORT_SYMBOL(__lock_page);
>   */
>  struct page * find_get_page(struct address_space *mapping, unsigned long offset)
>  {
> -	struct page *page;
> +	struct page **pagep;
> +	struct page *page = NULL;
>  
> -	read_lock_irq(&mapping->tree_lock);
> -	page = radix_tree_lookup(&mapping->page_tree, offset);
> -	if (page)
> -		page_cache_get(page);
> -	read_unlock_irq(&mapping->tree_lock);
> +	rcu_read_lock();
> +	pagep = (struct page **)radix_tree_lookup_slot(&mapping->page_tree,
> +									offset);
> +	if (pagep)
> +		page = page_cache_get_speculative(pagep);

The data structures accessed by page_cache_get_speculative() don't
seem to be freed via RCU.  My guess is that this is because these
data structures (struct page) never really go away.  However, they
do get re-used, and this re-use must be protected against.  It looks
to me that this protection takes the form of atomic instructions
in get_page_testone() and the like.  If this is the case, then it
should be possible to push the rcu_read_lock() and rcu_read_unlock()
down into the radix_tree_lookup_slot().

> +	rcu_read_unlock();
>  	return page;
>  }
>  
> @@ -519,12 +528,24 @@ struct page *find_trylock_page(struct ad
>  {
>  	struct page *page;
>  
> -	read_lock_irq(&mapping->tree_lock);
> -	page = radix_tree_lookup(&mapping->page_tree, offset);
> -	if (page && TestSetPageLocked(page))
> -		page = NULL;
> -	read_unlock_irq(&mapping->tree_lock);
> -	return page;
> +	page = find_get_page(mapping, offset);
> +	if (page) {
> +		if (TestSetPageLocked(page))
> +			goto out_failed;
> +		/* Has the page been truncated before being locked? */
> +		if (page->mapping != mapping || page->index != offset) {
> +			unlock_page(page);
> +			goto out_failed;
> +		}
> +
> +		/* Silly interface requires us to drop the refcount */
> +		__put_page(page);
> +		return page;
> +
> +out_failed:
> +		page_cache_release(page);
> +	}
> +	return NULL;
>  }
>  
>  EXPORT_SYMBOL(find_trylock_page);
> @@ -545,25 +566,17 @@ struct page *find_lock_page(struct addre
>  {
>  	struct page *page;
>  
> -	read_lock_irq(&mapping->tree_lock);
>  repeat:
> -	page = radix_tree_lookup(&mapping->page_tree, offset);
> +	page = find_get_page(mapping, offset);
>  	if (page) {
> -		page_cache_get(page);
> -		if (TestSetPageLocked(page)) {
> -			read_unlock_irq(&mapping->tree_lock);
> -			lock_page(page);
> -			read_lock_irq(&mapping->tree_lock);
> -
> -			/* Has the page been truncated while we slept? */
> -			if (page->mapping != mapping || page->index != offset) {
> -				unlock_page(page);
> -				page_cache_release(page);
> -				goto repeat;
> -			}
> +		lock_page(page);
> +		/* Has the page been truncated before being locked? */
> +		if (page->mapping != mapping || page->index != offset) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto repeat;
>  		}
>  	}
> -	read_unlock_irq(&mapping->tree_lock);
>  	return page;
>  }
>  
> @@ -646,6 +659,32 @@ unsigned find_get_pages(struct address_s
>  	return ret;
>  }
>  
> +unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
> +			    unsigned int nr_pages, struct page **pages)
> +{
> +	unsigned int i;
> +	unsigned int ret;
> +	unsigned int ret2;
> +
> +	/*
> +	 * We do some unsightly casting to use the array first for storing
> +	 * pointers to the page pointers, and then for the pointers to
> +	 * the pages themselves that the caller wants.
> +	 */
> +	rcu_read_lock();
> +	ret = radix_tree_gang_lookup_slot(&mapping->page_tree,
> +				(void ***)pages, start, nr_pages);
> +	ret2 = 0;
> +	for (i = 0; i < ret; i++) {
> +		struct page *page;
> +		page = page_cache_get_speculative(((struct page ***)pages)[i]);
> +		if (page)
> +			pages[ret2++] = page;
> +	}

Same thing here -- I don't see why the rcu_read_lock() and rcu_read_unlock()
should not be pushed down into radix_tree_gang_lookup_slot().

> +	rcu_read_unlock();
> +	return ret2;
> +}
> +
>  /*
>   * Like find_get_pages, except we only return pages which are tagged with
>   * `tag'.   We update *index to index the next page for the traversal.
> Index: linux-2.6/mm/readahead.c
> ===================================================================
> --- linux-2.6.orig/mm/readahead.c
> +++ linux-2.6/mm/readahead.c
> @@ -272,27 +272,26 @@ __do_page_cache_readahead(struct address
>  	/*
>  	 * Preallocate as many pages as we will need.
>  	 */
> -	read_lock_irq(&mapping->tree_lock);
>  	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
>  		unsigned long page_offset = offset + page_idx;
>  		
>  		if (page_offset > end_index)
>  			break;
>  
> +		/* Don't need mapping->tree_lock - lookup can be racy */
> +		rcu_read_lock();
>  		page = radix_tree_lookup(&mapping->page_tree, page_offset);
> +		rcu_read_unlock();

Here, they could clearly be pushed down.  ;-)

>  		if (page)
>  			continue;
>  
> -		read_unlock_irq(&mapping->tree_lock);
>  		page = page_cache_alloc_cold(mapping);
> -		read_lock_irq(&mapping->tree_lock);
>  		if (!page)
>  			break;
>  		page->index = page_offset;
>  		list_add(&page->lru, &page_pool);
>  		ret++;
>  	}
> -	read_unlock_irq(&mapping->tree_lock);
>  
>  	/*
>  	 * Now start the IO.  We ignore I/O errors - if the page is not
> Index: linux-2.6/mm/swap_state.c
> ===================================================================
> --- linux-2.6.orig/mm/swap_state.c
> +++ linux-2.6/mm/swap_state.c
> @@ -76,19 +76,26 @@ static int __add_to_swap_cache(struct pa
>  	BUG_ON(PagePrivate(page));
>  	error = radix_tree_preload(gfp_mask);
>  	if (!error) {
> +		page_cache_get(page);
> +		SetPageLocked(page);
> +		SetPageSwapCache(page);
> +		page->private = entry.val;
> +
>  		write_lock_irq(&swapper_space.tree_lock);
>  		error = radix_tree_insert(&swapper_space.page_tree,
>  						entry.val, page);
>  		if (!error) {
> -			page_cache_get(page);
> -			SetPageLocked(page);
> -			SetPageSwapCache(page);
> -			page->private = entry.val;
>  			total_swapcache_pages++;
>  			pagecache_acct(1);
>  		}
>  		write_unlock_irq(&swapper_space.tree_lock);
>  		radix_tree_preload_end();
> +
> +		if (error) {
> +			__put_page(page);
> +			ClearPageLocked(page);
> +			ClearPageSwapCache(page);
> +		}
>  	}
>  	return error;
>  }
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h
> +++ linux-2.6/include/linux/page-flags.h
> @@ -167,16 +167,13 @@ extern void __mod_page_state(unsigned lo
>  /*
>   * Manipulation of page state flags
>   */
> -#define PageLocked(page)		\
> -		test_bit(PG_locked, &(page)->flags)
> -#define SetPageLocked(page)		\
> -		set_bit(PG_locked, &(page)->flags)
> -#define TestSetPageLocked(page)		\
> -		test_and_set_bit(PG_locked, &(page)->flags)
> -#define ClearPageLocked(page)		\
> -		clear_bit(PG_locked, &(page)->flags)
> -#define TestClearPageLocked(page)	\
> -		test_and_clear_bit(PG_locked, &(page)->flags)
> +#define PageLocked(page)	test_bit(PG_locked, &(page)->flags)
> +#define SetPageLocked(page)	set_bit(PG_locked, &(page)->flags)
> +#define __SetPageLocked(page)	__set_bit(PG_locked, &(page)->flags)
> +#define TestSetPageLocked(page)	test_and_set_bit(PG_locked, &(page)->flags)
> +#define ClearPageLocked(page)	clear_bit(PG_locked, &(page)->flags)
> +#define __ClearPageLocked(page)	__clear_bit(PG_locked, &(page)->flags)
> +#define TestClearPageLocked(page) test_and_clear_bit(PG_locked, &(page)->flags)
>  
>  #define PageError(page)		test_bit(PG_error, &(page)->flags)
>  #define SetPageError(page)	set_bit(PG_error, &(page)->flags)
> Index: linux-2.6/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pagemap.h
> +++ linux-2.6/include/linux/pagemap.h
> @@ -130,6 +130,8 @@ extern struct page * find_or_create_page
>  				unsigned long index, unsigned int gfp_mask);
>  unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
>  			unsigned int nr_pages, struct page **pages);
> +unsigned find_get_pages_nonatomic(struct address_space *mapping, pgoff_t start,
> +			unsigned int nr_pages, struct page **pages);
>  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
>  			int tag, unsigned int nr_pages, struct page **pages);
>  
> Index: linux-2.6/include/linux/pagevec.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pagevec.h
> +++ linux-2.6/include/linux/pagevec.h
> @@ -25,6 +25,8 @@ void __pagevec_lru_add_active(struct pag
>  void pagevec_strip(struct pagevec *pvec);
>  unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
>  		pgoff_t start, unsigned nr_pages);
> +unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
> +	struct address_space *mapping, pgoff_t start, unsigned nr_pages);
>  unsigned pagevec_lookup_tag(struct pagevec *pvec,
>  		struct address_space *mapping, pgoff_t *index, int tag,
>  		unsigned nr_pages);
> Index: linux-2.6/mm/swap.c
> ===================================================================
> --- linux-2.6.orig/mm/swap.c
> +++ linux-2.6/mm/swap.c
> @@ -380,6 +380,19 @@ unsigned pagevec_lookup(struct pagevec *
>  	return pagevec_count(pvec);
>  }
>  
> +/**
> + * pagevec_lookup_nonatomic - non atomic pagevec_lookup
> + *
> + * This routine is non-atomic in that it may return blah.
> + */
> +unsigned pagevec_lookup_nonatomic(struct pagevec *pvec,
> +		struct address_space *mapping, pgoff_t start, unsigned nr_pages)
> +{
> +	pvec->nr = find_get_pages_nonatomic(mapping, start,
> +					nr_pages, pvec->pages);
> +	return pagevec_count(pvec);
> +}
> +
>  unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping,
>  		pgoff_t *index, int tag, unsigned nr_pages)
>  {
> Index: linux-2.6/mm/truncate.c
> ===================================================================
> --- linux-2.6.orig/mm/truncate.c
> +++ linux-2.6/mm/truncate.c
> @@ -126,7 +126,7 @@ void truncate_inode_pages(struct address
>  
>  	pagevec_init(&pvec, 0);
>  	next = start;
> -	while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> +	while (pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  			pgoff_t page_index = page->index;
> @@ -160,7 +160,7 @@ void truncate_inode_pages(struct address
>  	next = start;
>  	for ( ; ; ) {
>  		cond_resched();
> -		if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> +		if (!pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
>  			if (next == start)
>  				break;
>  			next = start;
> @@ -206,7 +206,7 @@ unsigned long invalidate_mapping_pages(s
>  
>  	pagevec_init(&pvec, 0);
>  	while (next <= end &&
> -			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> +		pagevec_lookup_nonatomic(&pvec, mapping, next, PAGEVEC_SIZE)) {
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
> Index: linux-2.6/mm/page-writeback.c
> ===================================================================
> --- linux-2.6.orig/mm/page-writeback.c
> +++ linux-2.6/mm/page-writeback.c
> @@ -811,6 +811,7 @@ int mapping_tagged(struct address_space 
>  	unsigned long flags;
>  	int ret;
>  
> +	/* XXX: radix_tree_tagged is safe to run without the lock? */
>  	read_lock_irqsave(&mapping->tree_lock, flags);
>  	ret = radix_tree_tagged(&mapping->page_tree, tag);
>  	read_unlock_irqrestore(&mapping->tree_lock, flags);


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 5/7] radix-tree: lockless readside
  2005-08-12  1:37           ` [patch 5/7] radix-tree: lockless readside Paul E. McKenney
@ 2005-08-12  3:59             ` Nick Piggin
  2005-08-12  4:38             ` Nick Piggin
  1 sibling, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2005-08-12  3:59 UTC (permalink / raw)
  To: paulmck; +Cc: Paul McKenney, Dipankar Sarma, Ingo Molnar, lkml

On Thu, 2005-08-11 at 18:37 -0700, Paul E. McKenney wrote:
> On Thu, Aug 11, 2005 at 10:25:47PM +1000, Nick Piggin wrote:
> > 5/7
> > 
> > -- 
> > SUSE Labs, Novell Inc.
> > 
> 
> > Make radix tree lookups safe to be performed without locks.
> > Readers are protected against nodes being deleted by using RCU
> > based freeing. Readers are protected against new node insertion
> > by using memory barriers to ensure the node itself will be
> > properly written before it is visible in the radix tree.
> > 
> > Also introduce a lockfree gang_lookup_slot which will be used
> > by a future patch.
> 
> Interesting approach!  Don't claim to fully understand it, but
> see below (search for empty lines).  In the meantime, some questions:
> 

Thanks for comments and review - very helpful.

> o	What exactly is RCU protecting?  My first guess is that
> 	it protects the pointers and internal nodes of the
> 	radix tree, but not the objects in the leaves of the
> 	trees (in other words, the things pointed to by the
> 	return value from things like radix_tree_lookup_slot()).
> 

Ah OK, this wasn't initially apparent to me either, however I
believe it is needed.

Indeed we are protecting the internal nodes of the radix tree,
however the speculative get operation needs to dereference a
*pointer* to the page. The `struct page` itself doesn't go away,
and thus doesn't need any protection. However the *pointer* can
go away - it is contained within a radix tree node.

> 	But if this really is the case, then the rcu_read_lock() &
> 	rcu_read_unlock() pairs can be pushed down into
> 	radix_tree_lookup_slot() and friends.
> 
> o	The current code structure would lead me to believe that
> 	page_cache_get_speculative() is protected by RCU, but
> 	I don't see the corresponding call_rcu() or synchronize_rcu()
> 	that would cause this protection to be required.
> 

I believe this should answer your concern about the RCU protection
placement.

One thing we could do differently is this: instead of receiving
a pointer to the struct page from the radix tree lookup, we could
actually do as you say and push the rcu locking into the radix
tree lookup.

Then we would do a speculative get_page on that struct page,
finally we would lookup the radix tree again to see whether the
same page is still in the pagecache.

I discounted this idea not just for efficiency, but also because
it would require either a 2 phase speculative get (or a speculative
undo), or would require teaching speculative get about the pagecache
radix tree.

> > @@ -208,9 +219,13 @@ static int radix_tree_extend(struct radi
> >  				tag_set(node, tag, 0);
> >  		}
> >  
> > +		newheight = root->height+1;
> > +		node->height = newheight;
> >  		node->count = 1;
> > +		/* Make ->height visible before node visible via ->rnode */
> 
> > +		smp_wmb();
> >  		root->rnode = node;
> 
> The prior two lines should instead be:
> 
> 		rcu_assign_pointer(root->rnode, node);
> 

Great, thanks very much.

-- 
SUSE Labs, Novell Inc.



Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 6/7] mm: lockless pagecache
  2005-08-12  1:49             ` Paul E. McKenney
@ 2005-08-12  4:04               ` Nick Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2005-08-12  4:04 UTC (permalink / raw)
  To: paulmck; +Cc: Paul McKenney, Dipankar Sarma, Ingo Molnar, lkml

On Thu, 2005-08-11 at 18:49 -0700, Paul E. McKenney wrote:
> On Thu, Aug 11, 2005 at 10:28:04PM +1000, Nick Piggin wrote:
> > 6/7
> > 
> > -- 
> > SUSE Labs, Novell Inc.
> > 
> 
> > Use the speculative get_page and the lockless radix tree lookups
> > to introduce lockless page cache lookups (ie. no mapping->tree_lock).
> > 
> > The only atomicity changes this should introduce is the use of a
> > non atomic pagevec lookup for truncate, however what atomicity
> > guarantees there were are probably not too useful anyway.
> 
> I don't understand the placement of the rcu_read_lock() and
> rcu_read_unlock() calls.  Again, possibly because I don't understand
> the overall algorithm yet.  And again, search for blank lines.
> 

I hope I gave a satisfactory answer to this in the last email. If
not, let me know what I've missed...

Indeed you are right that we could push all the RCU locking down
into the radix tree lookups in the places you mention _if_ we
didn't rely on lookups returning radix tree *slots*.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.



Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 5/7] radix-tree: lockless readside
  2005-08-12  1:37           ` [patch 5/7] radix-tree: lockless readside Paul E. McKenney
  2005-08-12  3:59             ` Nick Piggin
@ 2005-08-12  4:38             ` Nick Piggin
  2005-08-12  7:53               ` Nick Piggin
  1 sibling, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2005-08-12  4:38 UTC (permalink / raw)
  To: paulmck; +Cc: Paul McKenney, Dipankar Sarma, Ingo Molnar, lkml

On Thu, 2005-08-11 at 18:37 -0700, Paul E. McKenney wrote:
> On Thu, Aug 11, 2005 at 10:25:47PM +1000, Nick Piggin wrote:
> > 5/7
> > 
> > -- 
> > SUSE Labs, Novell Inc.
> > 
> 
> > Make radix tree lookups safe to be performed without locks.
> > Readers are protected against nodes being deleted by using RCU
> > based freeing. Readers are protected against new node insertion
> > by using memory barriers to ensure the node itself will be
> > properly written before it is visible in the radix tree.
> > 
> > Also introduce a lockfree gang_lookup_slot which will be used
> > by a future patch.
> 
> Interesting approach!  Don't claim to fully understand it, but
> see below (search for empty lines).  In the meantime, some questions:
> 

Basically, the main idea is this:

find_get_page (the current, locked version) will take the tree_lock,
elevate the refcount of the page currently in pagecache, and drop
the tree_lock.

tree_lock is used to a) ensure consistency of the radix tree, and
b) "pin" the page until we are able to take a reference to it.

After find_get_page returns the page with elevated refcount, it
has released the tree_lock, so there is no other atomicity /
serialisation provided other than the above 2 functions.

* Assuming my RCU'ed radix-tree is correct, that takes care of a.

* Now, we look up a 'struct page' that has existed in the pagecache
  at *some* point in time (ie. dereference the pagep pointer).

* Then, take a reference on that struct page, regardless of whether
  or not it is *now* in pagecache. This act of taking a reference
  is also a memory barrier.

* Now we can again check if the page existed in pagecache _after_
  taking that reference. If yes, we have a reference to the page
  we want.

* If it is no longer in pagecache, we assume it is the wrong page.
  Even if there is a new page in that part of the pagecache we can
  return NULL, because the pagecache would have had to be NULL at
  *some* point between the 2 times we load the pagep pointer.

With the above, we can meet the same requirements of the current
find_get_page. Which basically are:

x) If the page was ever[1] in pagecache, it may be returned
y) If the pagecache was ever[2] empty, NULL may be returned

[1], [2] - in accordance with the high level serialisation
schema, of course. The main point is that the tree_lock in
find_get_page can be taken at any arbitrary time and thus
doesn't provide anything past x and y.

Similar arguments for find_lock_page and find_get_pages, etc.


Anyway, that's the high level idea. All the rest of the machinery
is geared to handle the case where we have taken a reference to
the page after it has been removed from pagecache or used for
something else. This is admittedly fairly complex, but don't get
too hung up about it when getting an overview of it is supposed
to work.

-- 
SUSE Labs, Novell Inc.



Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 5/7] radix-tree: lockless readside
  2005-08-12  4:38             ` Nick Piggin
@ 2005-08-12  7:53               ` Nick Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2005-08-12  7:53 UTC (permalink / raw)
  To: paulmck; +Cc: Paul McKenney, Dipankar Sarma, Ingo Molnar, lkml

Nick Piggin wrote:

> With the above, we can meet the same requirements of the current
> find_get_page. Which basically are:
> 
> x) If the page was ever[1] in pagecache, it may be returned
> y) If the pagecache was ever[2] empty, NULL may be returned
> 

Oh, I missed a couple of "obvious" ones.

More correctly:

x1) If a page was ever in pagecache, it may be returned.
x1) If not, then NULL will be returned.

y1) If the pagecache was ever empty, NULL may be returned.
y2) If not, then the page will be returned.


-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2005-08-12  7:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-11 12:18 [patch 0/7] lockless pagecache 2 Nick Piggin
2005-08-11 12:21 ` [patch 1/7] mm: remove PageReserved rollup Nick Piggin
2005-08-11 12:22   ` [patch 2/7] mm: PG_free flag Nick Piggin
2005-08-11 12:22     ` [patch 3/7] mm: speculative get_page Nick Piggin
2005-08-11 12:25       ` [patch 4/7] radix-tree: lookup_slot Nick Piggin
2005-08-11 12:25         ` [patch 5/7] radix-tree: lockless readside Nick Piggin
2005-08-11 12:28           ` [patch 6/7] mm: lockless pagecache Nick Piggin
2005-08-11 12:28             ` [patch 7/7] mm: spinlock tree_lock Nick Piggin
2005-08-11 13:58             ` [patch 6/7] mm: lockless pagecache Pekka Enberg
2005-08-11 14:06               ` Nick Piggin
2005-08-12  1:49             ` Paul E. McKenney
2005-08-12  4:04               ` Nick Piggin
2005-08-12  1:37           ` [patch 5/7] radix-tree: lockless readside Paul E. McKenney
2005-08-12  3:59             ` Nick Piggin
2005-08-12  4:38             ` Nick Piggin
2005-08-12  7:53               ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox