VMA_MERGING_FIXUP and patch

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* VMA_MERGING_FIXUP and patch
@ 2004-03-22 17:05 Hugh Dickins
  2004-03-22 17:52 ` Andrea Arcangeli
  2004-03-22 19:57 ` VMA_MERGING_FIXUP and patch Andrew Morton
  0 siblings, 2 replies; 15+ messages in thread
From: Hugh Dickins @ 2004-03-22 17:05 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel

Just a reminder that you still have several #if VMA_MERGING_FIXUPs
in your 2.6.5-rc2-aa1 tree, so mprotects and mremaps are not merging
vmas at all.

I can understand if you'd prefer to leave the mremaps that way,
at least for now.  (I do have code to decide whether any page is
shared, will post later in anobjrmap 7/6, you could use the same
to allow mremap vma merging if unproblematic.)  But I think you
ought to get to merging the mprotects, aren't there apps which
will give you a frightening number of vmas unless merged?

Here's some minor updates (no hurry) to objrmap.c,
mirroring recentish s390 mods to mainline rmap.c:
the page_test_and_clear_dirty I mentioned before.

Hmm, I wonder, is that safe to be calling set_page_dirty
from inside the page rmap lock?  Andrew?

Hugh

--- 2.6.5-rc2-aa1/mm/objrmap.c	2004-03-22 11:38:55.000000000 +0000
+++ linux/mm/objrmap.c	2004-03-22 16:34:29.421216936 +0000
@@ -212,7 +212,7 @@ int fastcall page_referenced(struct page
 	BUG_ON(!page->mapping);

 	if (page_test_and_clear_young(page))
-		mark_page_accessed(page);
+		referenced++;

 	if (TestClearPageReferenced(page))
 		referenced++;
@@ -327,8 +327,11 @@ void fastcall page_remove_rmap(struct pa
 	if (!page_mapped(page))
 		goto out_unlock;

-	if (!--page->mapcount)
+	if (!--page->mapcount) {
+		if (page_test_and_clear_dirty(page))
+			set_page_dirty(page);
 		dec_page_state(nr_mapped);
+	}

 	if (PageAnon(page))
 		anon_vma_page_unlink(page);
@@ -520,6 +523,8 @@ int fastcall try_to_unmap(struct page * 
 		ret = try_to_unmap_anon(page);

 	if (!page_mapped(page)) {
+		if (page_test_and_clear_dirty(page))
+			set_page_dirty(page);
 		dec_page_state(nr_mapped);
 		ret = SWAP_SUCCESS;
 	}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: VMA_MERGING_FIXUP and patch
  2004-03-22 17:05 VMA_MERGING_FIXUP and patch Hugh Dickins
@ 2004-03-22 17:52 ` Andrea Arcangeli
  2004-03-22 19:02   ` Hugh Dickins
  2004-03-22 19:57 ` VMA_MERGING_FIXUP and patch Andrew Morton
  1 sibling, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-22 17:52 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel

On Mon, Mar 22, 2004 at 05:05:33PM +0000, Hugh Dickins wrote:
> Just a reminder that you still have several #if VMA_MERGING_FIXUPs
> in your 2.6.5-rc2-aa1 tree, so mprotects and mremaps are not merging
> vmas at all.
> 
> I can understand if you'd prefer to leave the mremaps that way,
> at least for now.  (I do have code to decide whether any page is
> shared, will post later in anobjrmap 7/6, you could use the same
> to allow mremap vma merging if unproblematic.)  But I think you
> ought to get to merging the mprotects, aren't there apps which
> will give you a frightening number of vmas unless merged?

agreed, enabling again the merging in mprotect is quite high prio,
but really some of those apps have never been able to merge in mprotect
because we would never been able to merge the file mappings after a
vma_split. The pros is that after we fixup, the file mappings will merge
fine too (the cons is that the merging code will have to be more
complicated than it was before ;).

> Here's some minor updates (no hurry) to objrmap.c,
> mirroring recentish s390 mods to mainline rmap.c:
> the page_test_and_clear_dirty I mentioned before.

this is greatly appreciated, I didn't notice this change (I got
uninteresting rejects in mm/rmap.c.rej too often and by mistake I
stopped looking at them at some point).

> Hmm, I wonder, is that safe to be calling set_page_dirty
> from inside the page rmap lock?  Andrew?

I agree, I don't feel safe running it there indeed (it will not hang
immediatly but it enforces an ordering of other spinlocks, the less
enforced ordering the better), I'd prefer to set a local
variable and run it after the page_map_unlock.  Actually re-reading
page->mapcount lockless should be ok too, if we don't read zero due a
race, it means somebody else will run the unmap or remove_page_rmap
again. All it matters is that the very last one unmapping the page reads
zero and that's guaranteed even without the spinlock.

> --- 2.6.5-rc2-aa1/mm/objrmap.c	2004-03-22 11:38:55.000000000 +0000
> +++ linux/mm/objrmap.c	2004-03-22 16:34:29.421216936 +0000
> @@ -212,7 +212,7 @@ int fastcall page_referenced(struct page
>  	BUG_ON(!page->mapping);
>  
>  	if (page_test_and_clear_young(page))
> -		mark_page_accessed(page);
> +		referenced++;
>  
>  	if (TestClearPageReferenced(page))
>  		referenced++;

I definitely agree with this one.

what about this?

thanks for spotting these s390 bugs! 

--- x/mm/vmscan.c.~1~	2004-03-21 15:21:42.000000000 +0100
+++ x/mm/vmscan.c	2004-03-22 18:39:05.551793088 +0100
@@ -313,16 +313,14 @@ shrink_list(struct list_head *page_list,
 		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page)) {
 			case SWAP_FAIL:
-				page_map_unlock(page);
 				goto activate_locked;
 			case SWAP_AGAIN:
-				page_map_unlock(page);
 				goto keep_locked;
 			case SWAP_SUCCESS:
 				; /* try to free the page below */
 			}
-		}
-		page_map_unlock(page);
+		} else
+			page_map_unlock(page);
 
 		/*
 		 * If the page is dirty, only perform writeback if that write
--- x/mm/memory.c.~1~	2004-03-21 15:21:42.000000000 +0100
+++ x/mm/memory.c	2004-03-22 13:40:26.852849384 +0100
@@ -324,9 +324,11 @@ skip_copy_pte_range:
 					 * Device driver pages must not be
 					 * tracked by the VM for unmapping.
 					 */
-					BUG_ON(!page_mapped(page));
-					BUG_ON(!page->mapping);
-					page_add_rmap(page, vma, address, PageAnon(page));
+					if (likely(page_mapped(page) && page->mapping))
+						page_add_rmap(page, vma, address, PageAnon(page));
+					else
+						printk("Badness in %s at %s:%d\n",
+						       __FUNCTION__, __FILE__, __LINE__);
 				} else {
 					BUG_ON(page_mapped(page));
 					BUG_ON(page->mapping);
@@ -1429,7 +1431,9 @@ retry:
 	 * real anonymous pages, they're "device" reserved pages instead.
 	 */
 	reserved = !!(vma->vm_flags & VM_RESERVED);
-	WARN_ON(reserved == pageable);
+	if (unlikely(reserved == pageable))
+		printk("Badness in %s at %s:%d\n",
+		       __FUNCTION__, __FILE__, __LINE__);
 
 	/*
 	 * Should we do an early C-O-W break?
--- x/mm/objrmap.c.~1~	2004-03-21 15:21:42.000000000 +0100
+++ x/mm/objrmap.c	2004-03-22 18:41:56.434814920 +0100
@@ -48,6 +48,12 @@ static inline void validate_anon_vma_fin
 #endif
 }
 
+static pte_t *
+find_pte_nonlinear(struct mm_struct * mm, struct vm_area_struct *vma, struct page *page)
+{
+	
+}
+
 /**
  * find_pte - Find a pte pointer given a vma and a struct page.
  * @vma: the vma to search
@@ -62,7 +68,7 @@ static inline void validate_anon_vma_fin
  * 
  * It is the caller's responsibility to unmap the pte if it is returned.
  */
-static inline pte_t *
+static pte_t *
 find_pte(struct vm_area_struct *vma, struct page *page, unsigned long *addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -72,6 +78,9 @@ find_pte(struct vm_area_struct *vma, str
 	unsigned long loffset;
 	unsigned long address;
 
+	if (vma->vm_flags & VM_NONLINEAR)
+		return find_pte_nonlinear(mm, vma, page);
+
 	loffset = (page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
 	address = vma->vm_start + ((loffset - vma->vm_pgoff) << PAGE_SHIFT);
 	if (address < vma->vm_start || address >= vma->vm_end)
@@ -120,6 +129,13 @@ page_referenced_one(struct vm_area_struc
 	pte_t *pte;
 	int referenced = 0;
 
+	/*
+	 * Tracking the referenced info is too expensive
+	 * for nonlinear mappings.
+	 */
+	if (vma->vm_flags & VM_NONLINEAR)
+		goto out;
+
 	if (!spin_trylock(&mm->page_table_lock))
 		goto out;
 
@@ -212,7 +228,7 @@ int fastcall page_referenced(struct page
 	BUG_ON(!page->mapping);
 
 	if (page_test_and_clear_young(page))
-		mark_page_accessed(page);
+		referenced++;
 
 	if (TestClearPageReferenced(page))
 		referenced++;
@@ -344,6 +360,10 @@ void fastcall page_remove_rmap(struct pa
   
  out_unlock:
 	page_map_unlock(page);
+
+	if (page_test_and_clear_dirty(page) && !page_mapped(page))
+		set_page_dirty(page);
+
 	return;
 }
 
@@ -523,6 +543,11 @@ int fastcall try_to_unmap(struct page * 
 		dec_page_state(nr_mapped);
 		ret = SWAP_SUCCESS;
 	}
+	page_map_unlock(page);
+
+	if (page_test_and_clear_dirty(page) && ret == SWAP_SUCCESS)
+		set_page_dirty(page);
+
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: VMA_MERGING_FIXUP and patch
  2004-03-22 17:52 ` Andrea Arcangeli
@ 2004-03-22 19:02   ` Hugh Dickins
  2004-03-22 19:58     ` Andrea Arcangeli
  0 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2004-03-22 19:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel

On Mon, 22 Mar 2004, Andrea Arcangeli wrote:
> 
> what about this?
> 
> @@ -344,6 +360,10 @@ void fastcall page_remove_rmap(struct pa
>    
>   out_unlock:
>  	page_map_unlock(page);
> +
> +	if (page_test_and_clear_dirty(page) && !page_mapped(page))
> +		set_page_dirty(page);
> +
>  	return;
>  }

No, it has to be
	if (!page_mapped(page) && page_test_and_clear_dirty(page))
		set_page_dirty(page);
but the positioning is fine.

> @@ -523,6 +543,11 @@ int fastcall try_to_unmap(struct page * 
>  		dec_page_state(nr_mapped);
>  		ret = SWAP_SUCCESS;
>  	}
> +	page_map_unlock(page);
> +
> +	if (page_test_and_clear_dirty(page) && ret == SWAP_SUCCESS)
> +		set_page_dirty(page);
> +
>  	return ret;
>  }

No, it has to be
	if (ret == SWAP_SUCCESS && page_test_and_clear_dirty(page))
		set_page_dirty(page);

Personally, I'd prefer we leave try_to_unmap with the lock we
had on entry, and do this at the shrink_list end - though I
can see that the way you've done it actually reduces the code.

(The s390 header file comments that the page_test_and_clear_dirty
needs to be done while not mapped, because of race with referenced
bit, and we are opening up to a race now; but unless s390 is very
different, I see nothing wrong with a rare race on referenced -
whereas everything wrong with any race that might lose dirty.)

Excited by that glimpse of find_pte_nonlinear you just gave us;
disappointed to find it empty ;-)

Hugh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: VMA_MERGING_FIXUP and patch
  2004-03-22 17:05 VMA_MERGING_FIXUP and patch Hugh Dickins
  2004-03-22 17:52 ` Andrea Arcangeli
@ 2004-03-22 19:57 ` Andrew Morton
  2004-03-22 20:05   ` Andrea Arcangeli
  1 sibling, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2004-03-22 19:57 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: andrea, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> Hmm, I wonder, is that safe to be calling set_page_dirty
>  from inside the page rmap lock?  Andrew?

set_page_dirty() takes ->tree_lock and inode_lock.  tree_lock surely is OK
and while I cannot think of any deadlocks which could occur with taking
inode_lock inside the rmap lock, it doesn't sound very nice.

It would of course be best if we could avoid adding a new ranking
relationship between these locks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: VMA_MERGING_FIXUP and patch
  2004-03-22 19:02   ` Hugh Dickins
@ 2004-03-22 19:58     ` Andrea Arcangeli
  2004-03-23 21:44       ` nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch] Andrea Arcangeli
  0 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-22 19:58 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel

On Mon, Mar 22, 2004 at 07:02:01PM +0000, Hugh Dickins wrote:
> On Mon, 22 Mar 2004, Andrea Arcangeli wrote:
> > 
> > what about this?
> > 
> > @@ -344,6 +360,10 @@ void fastcall page_remove_rmap(struct pa
> >    
> >   out_unlock:
> >  	page_map_unlock(page);
> > +
> > +	if (page_test_and_clear_dirty(page) && !page_mapped(page))
> > +		set_page_dirty(page);
> > +
> >  	return;
> >  }
> 
> No, it has to be
> 	if (!page_mapped(page) && page_test_and_clear_dirty(page))
> 		set_page_dirty(page);
> but the positioning is fine.

you're right, I'll fixup. You saved s390 for the second time in a row ;)

> > @@ -523,6 +543,11 @@ int fastcall try_to_unmap(struct page * 
> >  		dec_page_state(nr_mapped);
> >  		ret = SWAP_SUCCESS;
> >  	}
> > +	page_map_unlock(page);
> > +
> > +	if (page_test_and_clear_dirty(page) && ret == SWAP_SUCCESS)
> > +		set_page_dirty(page);
> > +
> >  	return ret;
> >  }
> 
> No, it has to be
> 	if (ret == SWAP_SUCCESS && page_test_and_clear_dirty(page))
> 		set_page_dirty(page);

agreed.

> Personally, I'd prefer we leave try_to_unmap with the lock we
> had on entry, and do this at the shrink_list end - though I
> can see that the way you've done it actually reduces the code.

I agree with you here too, I also preferred not to drop the lock before
return, I did it to reduce the code (in the asm especially).

> (The s390 header file comments that the page_test_and_clear_dirty
> needs to be done while not mapped, because of race with referenced
> bit, and we are opening up to a race now; but unless s390 is very
> different, I see nothing wrong with a rare race on referenced -
> whereas everything wrong with any race that might lose dirty.)

Agreed. I don't see how could ever lose the dirty bit, however the above
comment is scary, and needs clarification from the hw experts. If it's
not safe we can put it back under the page_map_lock. I don't recall
stuff taking the page_map_lock while holding the mapping locks, so it
should be safe (though if we can do it ouside it's preferable.

> Excited by that glimpse of find_pte_nonlinear you just gave us;
> disappointed to find it empty ;-)

8)8)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: VMA_MERGING_FIXUP and patch
  2004-03-22 19:57 ` VMA_MERGING_FIXUP and patch Andrew Morton
@ 2004-03-22 20:05   ` Andrea Arcangeli
  2004-03-22 20:33     ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-22 20:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel

On Mon, Mar 22, 2004 at 11:57:34AM -0800, Andrew Morton wrote:
> set_page_dirty() takes ->tree_lock and inode_lock.  tree_lock surely is OK
> and while I cannot think of any deadlocks which could occur with taking
> inode_lock inside the rmap lock, it doesn't sound very nice.
> 
> It would of course be best if we could avoid adding a new ranking
> relationship between these locks.

agreed. the inode_lock especially is more a vfs thing than a mm thing,
so it lives quite far away. (btw, in my last email I only mentioned
the mapping tree_lock because I was thinking at the swapcache, with
filebacked pagecache and the inode_lock it gets worse, probably still ok
today though)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: VMA_MERGING_FIXUP and patch
  2004-03-22 20:05   ` Andrea Arcangeli
@ 2004-03-22 20:33     ` Andrew Morton
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2004-03-22 20:33 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: hugh, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Mon, Mar 22, 2004 at 11:57:34AM -0800, Andrew Morton wrote:
> > set_page_dirty() takes ->tree_lock and inode_lock.  tree_lock surely is OK
> > and while I cannot think of any deadlocks which could occur with taking
> > inode_lock inside the rmap lock, it doesn't sound very nice.
> > 
> > It would of course be best if we could avoid adding a new ranking
> > relationship between these locks.
> 
> agreed. the inode_lock especially is more a vfs thing than a mm thing,
> so it lives quite far away.

Alas, inode_lock can be taken inside page_table_lock, in zap_pte_range().
That set_page_dirty() in there is the nastiest part of the MM locking.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
  2004-03-22 19:58     ` Andrea Arcangeli
@ 2004-03-23 21:44       ` Andrea Arcangeli
  2004-03-24  2:35         ` Andrea Arcangeli
                           ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-23 21:44 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel

This is incremental with 2.6.5-rc2-aa1 and it should allow swapping of
nonlinear mappings too. Has anybody a testcase that I can use to test
the nonlinear mapping swapping or should I write it from scratch?
This code compiles fine but it's untested at this time (I tested the
regular swapping but not the nonlinear one).

I don't think I can use the tlb gather because I've to set the pte back
immediatly, or can I? The IPI flood and huge pagetable walk with total
destruction of the address space with huge mappings will be very bad in
terms of usability during swapping of huge nonlinear vmas, but hey, if
you want to swap smoothly, you should use the vmas.

If you giveup using remap_file_pages you will get a good swapping and,
as far as there are a dozen of pages per-vma, the cost of
remap_file_pages w/ pte_chains will be the same of the cost of the vmas
w/o pte_chains, plus the vmas can be heavily merged. So remap_file_pages
really makes sense only if you would generate less than a dozen pages
per vma and you can save significant ram with the vmas and you take the
risk of the worse-swapping, or alternatively if you giveup swapping (so
if you use mlock + remap_file_pages that saves the memory for the vmas).

But in general (especially w/o pte_chains) the usage of remap_file_pages
should be strongly discouraged, unless it's a special app using mlock or
an emulator with small amounts of nonlinear address space. If you use it
don't hope to ever swap efficiently. On 64bit archs nothing but the
emulators should use it.

--- x/include/linux/objrmap.h.~1~	2004-03-21 15:21:42.000000000 +0100
+++ x/include/linux/objrmap.h	2004-03-23 22:29:53.699758112 +0100
@@ -1,5 +1,5 @@
-#ifndef _LINUX_RMAP_H
-#define _LINUX_RMAP_H
+#ifndef _LINUX_OBJRMAP_H
+#define _LINUX_OBJRMAP_H
 /*
  * Declarations for Object Reverse Mapping functions in mm/objrmap.c
  */
@@ -75,4 +75,4 @@ int FASTCALL(page_referenced(struct page
 
 #endif /* CONFIG_MMU */
 
-#endif /* _LINUX_RMAP_H */
+#endif /* _LINUX_OBJRMAP_H */
--- x/include/linux/page-flags.h.~1~	2004-03-21 15:21:42.000000000 +0100
+++ x/include/linux/page-flags.h	2004-03-23 22:30:16.940225024 +0100
@@ -71,11 +71,11 @@
 #define PG_nosave		14	/* Used for system suspend/resume */
 #define PG_maplock		15	/* lock bit for ->as.anon_vma and ->mapcount */
 
+#define PG_swapcache		16	/* SwapCache page */
 #define PG_mappedtodisk		17	/* Has blocks allocated on-disk */
 #define PG_reclaim		18	/* To be reclaimed asap */
 #define PG_compound		19	/* Part of a compound page */
 #define PG_anon			20	/* Anonymous page */
-#define PG_swapcache		21	/* SwapCache page */
 
 
 /*
--- x/mm/vmscan.c.~1~	2004-03-21 15:21:42.000000000 +0100
+++ x/mm/vmscan.c	2004-03-22 18:39:05.000000000 +0100
@@ -313,16 +313,14 @@ shrink_list(struct list_head *page_list,
 		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page)) {
 			case SWAP_FAIL:
-				page_map_unlock(page);
 				goto activate_locked;
 			case SWAP_AGAIN:
-				page_map_unlock(page);
 				goto keep_locked;
 			case SWAP_SUCCESS:
 				; /* try to free the page below */
 			}
-		}
-		page_map_unlock(page);
+		} else
+			page_map_unlock(page);
 
 		/*
 		 * If the page is dirty, only perform writeback if that write
--- x/mm/memory.c.~1~	2004-03-21 15:21:42.000000000 +0100
+++ x/mm/memory.c	2004-03-23 22:25:28.758035376 +0100
@@ -97,7 +97,7 @@ static inline void free_one_pmd(struct m
 
 	if (pmd_none(*dir))
 		return;
-	if (pmd_bad(*dir)) {
+	if (unlikely(pmd_bad(*dir))) {
 		pmd_ERROR(*dir);
 		pmd_clear(dir);
 		return;
@@ -115,7 +115,7 @@ static inline void free_one_pgd(struct m
 
 	if (pgd_none(*dir))
 		return;
-	if (pgd_bad(*dir)) {
+	if (unlikely(pgd_bad(*dir))) {
 		pgd_ERROR(*dir);
 		pgd_clear(dir);
 		return;
@@ -232,7 +232,7 @@ int copy_page_range(struct mm_struct *ds
 		
 		if (pgd_none(*src_pgd))
 			goto skip_copy_pmd_range;
-		if (pgd_bad(*src_pgd)) {
+		if (unlikely(pgd_bad(*src_pgd))) {
 			pgd_ERROR(*src_pgd);
 			pgd_clear(src_pgd);
 skip_copy_pmd_range:	address = (address + PGDIR_SIZE) & PGDIR_MASK;
@@ -253,7 +253,7 @@ skip_copy_pmd_range:	address = (address 
 		
 			if (pmd_none(*src_pmd))
 				goto skip_copy_pte_range;
-			if (pmd_bad(*src_pmd)) {
+			if (unlikely(pmd_bad(*src_pmd))) {
 				pmd_ERROR(*src_pmd);
 				pmd_clear(src_pmd);
 skip_copy_pte_range:
@@ -324,9 +324,11 @@ skip_copy_pte_range:
 					 * Device driver pages must not be
 					 * tracked by the VM for unmapping.
 					 */
-					BUG_ON(!page_mapped(page));
-					BUG_ON(!page->mapping);
-					page_add_rmap(page, vma, address, PageAnon(page));
+					if (likely(page_mapped(page) && page->mapping))
+						page_add_rmap(page, vma, address, PageAnon(page));
+					else
+						printk("Badness in %s at %s:%d\n",
+						       __FUNCTION__, __FILE__, __LINE__);
 				} else {
 					BUG_ON(page_mapped(page));
 					BUG_ON(page->mapping);
@@ -381,7 +383,7 @@ zap_pte_range(struct mmu_gather *tlb, pm
 
 	if (pmd_none(*pmd))
 		return;
-	if (pmd_bad(*pmd)) {
+	if (unlikely(pmd_bad(*pmd))) {
 		pmd_ERROR(*pmd);
 		pmd_clear(pmd);
 		return;
@@ -424,27 +426,25 @@ zap_pte_range(struct mmu_gather *tlb, pm
 
 static void
 zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir,
-		unsigned long address, unsigned long size)
+		unsigned long address, unsigned long end)
 {
 	pmd_t * pmd;
-	unsigned long end;
 
 	if (pgd_none(*dir))
 		return;
-	if (pgd_bad(*dir)) {
+	if (unlikely(pgd_bad(*dir))) {
 		pgd_ERROR(*dir);
 		pgd_clear(dir);
 		return;
 	}
 	pmd = pmd_offset(dir, address);
-	end = address + size;
 	if (end > ((address + PGDIR_SIZE) & PGDIR_MASK))
 		end = ((address + PGDIR_SIZE) & PGDIR_MASK);
 	do {
 		zap_pte_range(tlb, pmd, address, end - address);
 		address = (address + PMD_SIZE) & PMD_MASK; 
 		pmd++;
-	} while (address < end);
+	} while (address && (address < end));
 }
 
 void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
@@ -462,7 +462,7 @@ void unmap_page_range(struct mmu_gather 
 	dir = pgd_offset(vma->vm_mm, address);
 	tlb_start_vma(tlb, vma);
 	do {
-		zap_pmd_range(tlb, dir, address, end - address);
+		zap_pmd_range(tlb, dir, address, end);
 		address = (address + PGDIR_SIZE) & PGDIR_MASK;
 		dir++;
 	} while (address && (address < end));
@@ -632,7 +632,7 @@ follow_page(struct mm_struct *mm, unsign
 		goto out;
 	if (pmd_huge(*pmd))
 		return follow_huge_pmd(mm, address, pmd, write);
-	if (pmd_bad(*pmd))
+	if (unlikely(pmd_bad(*pmd)))
 		goto out;
 
 	ptep = pte_offset_map(pmd, address);
@@ -1429,7 +1429,9 @@ retry:
 	 * real anonymous pages, they're "device" reserved pages instead.
 	 */
 	reserved = !!(vma->vm_flags & VM_RESERVED);
-	WARN_ON(reserved == pageable);
+	if (unlikely(reserved == pageable))
+		printk("Badness in %s at %s:%d\n",
+		       __FUNCTION__, __FILE__, __LINE__);
 
 	/*
 	 * Should we do an early C-O-W break?
--- x/mm/objrmap.c.~1~	2004-03-21 15:21:42.000000000 +0100
+++ x/mm/objrmap.c	2004-03-23 22:23:27.393485592 +0100
@@ -19,6 +19,11 @@
  * Released under the General Public License (GPL).
  */
 
+/*
+ * nonlinear pagetable walking elaborated from mm/memory.c under
+ * Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
+ */
+
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -62,7 +67,7 @@ static inline void validate_anon_vma_fin
  * 
  * It is the caller's responsibility to unmap the pte if it is returned.
  */
-static inline pte_t *
+static pte_t *
 find_pte(struct vm_area_struct *vma, struct page *page, unsigned long *addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -120,7 +125,14 @@ page_referenced_one(struct vm_area_struc
 	pte_t *pte;
 	int referenced = 0;
 
-	if (!spin_trylock(&mm->page_table_lock))
+	/*
+	 * Tracking the referenced info is too expensive
+	 * for nonlinear mappings.
+	 */
+	if (vma->vm_flags & VM_NONLINEAR)
+		goto out;
+
+	if (unlikely(!spin_trylock(&mm->page_table_lock)))
 		goto out;
 
 	pte = find_pte(vma, page, NULL);
@@ -158,7 +170,7 @@ page_referenced_inode(struct page *page)
 
 	BUG_ON(PageSwapCache(page));
 
-	if (down_trylock(&mapping->i_shared_sem))
+	if (unlikely(down_trylock(&mapping->i_shared_sem)))
 		goto out;
 
 	list_for_each_entry(vma, &mapping->i_mmap, shared)
@@ -212,7 +224,7 @@ int fastcall page_referenced(struct page
 	BUG_ON(!page->mapping);
 
 	if (page_test_and_clear_young(page))
-		mark_page_accessed(page);
+		referenced++;
 
 	if (TestClearPageReferenced(page))
 		referenced++;
@@ -344,38 +356,18 @@ void fastcall page_remove_rmap(struct pa
   
  out_unlock:
 	page_map_unlock(page);
+
+	if (!page_mapped(page) && page_test_and_clear_dirty(page))
+		set_page_dirty(page);
+
 	return;
 }
 
-/**
- * try_to_unmap_one - unmap a page using the object-based rmap method
- * @page: the page to unmap
- *
- * Determine whether a page is mapped in a given vma and unmap it if it's found.
- *
- * This function is strictly a helper function for try_to_unmap_inode.
- */
-static int
-try_to_unmap_one(struct vm_area_struct *vma, struct page *page)
+static void
+unmap_pte_page(struct page * page, struct vm_area_struct * vma,
+	       unsigned long address, pte_t * pte)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long address;
-	pte_t *pte;
 	pte_t pteval;
-	int ret = SWAP_AGAIN;
-
-	if (!spin_trylock(&mm->page_table_lock))
-		return ret;
-
-	pte = find_pte(vma, page, &address);
-	if (!pte)
-		goto out;
-
-	BUG_ON(vma->vm_flags & VM_RESERVED);
-	if (vma->vm_flags & VM_LOCKED) {
-		ret =  SWAP_FAIL;
-		goto out_unmap;
-	}
 
 	flush_cache_page(vma, address);
 	pteval = ptep_clear_flush(vma, address, pte);
@@ -389,8 +381,11 @@ try_to_unmap_one(struct vm_area_struct *
 		swap_duplicate(entry);
 		set_pte(pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
+		BUG_ON(!PageAnon(page));
 	} else {
 		unsigned long pgidx;
+
+		BUG_ON(PageAnon(page));
 		/*
 		 * If a nonlinear mapping then store the file page offset
 		 * in the pte.
@@ -409,14 +404,129 @@ try_to_unmap_one(struct vm_area_struct *
 
 	BUG_ON(!page->mapcount);
 
-	mm->rss--;
+	vma->vm_mm->rss--;
 	if (!--page->mapcount && PageAnon(page))
 		anon_vma_page_unlink(page);
 	page_cache_release(page);
+}
 
-out_unmap:
-	pte_unmap(pte);
+static void
+try_to_unmap_nonlinear_pte(struct vm_area_struct * vma,
+			   pmd_t * pmd, unsigned long address, unsigned long size)
+{
+	unsigned long offset;
+	pte_t *ptep;
+
+	if (pmd_none(*pmd))
+		return;
+	if (unlikely(pmd_bad(*pmd))) {
+		pmd_ERROR(*pmd);
+		pmd_clear(pmd);
+		return;
+	}
+	ptep = pte_offset_map(pmd, address);
+	offset = address & ~PMD_MASK;
+	if (offset + size > PMD_SIZE)
+		size = PMD_SIZE - offset;
+	size &= PAGE_MASK;
+	for (offset=0; offset < size; ptep++, offset += PAGE_SIZE) {
+		pte_t pte = *ptep;
+		if (pte_none(pte))
+			continue;
+		if (pte_present(pte)) {
+			unsigned long pfn = pte_pfn(pte);
+			struct page * page;
+
+			BUG_ON(!pfn_valid(pfn));
+			page = pfn_to_page(pfn);
+#ifdef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
+			get_page(page);
+#endif
+			unmap_pte_page(page, vma, address, ptep);
+			if (!page_mapped(page) && page_test_and_clear_dirty(page))
+				/* ugly locking */
+				set_page_dirty(page);
+#ifdef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
+			put_page(page);
+#endif
+		}
+	}
+	pte_unmap(ptep-1);
+}
+
+static void
+try_to_unmap_nonlinear_pmd(struct vm_area_struct * vma,
+			   pgd_t * dir, unsigned long address, unsigned long end)
+{
+	pmd_t * pmd;
+
+	if (pgd_none(*dir))
+		return;
+	if (unlikely(pgd_bad(*dir))) {
+		pgd_ERROR(*dir);
+		pgd_clear(dir);
+		return;
+	}
+	pmd = pmd_offset(dir, address);
+	if (end > ((address + PGDIR_SIZE) & PGDIR_MASK))
+		end = ((address + PGDIR_SIZE) & PGDIR_MASK);
+	do {
+		try_to_unmap_nonlinear_pte(vma, pmd, address, end - address);
+		address = (address + PMD_SIZE) & PMD_MASK; 
+		pmd++;
+	} while (address && (address < end));
+}
+
+static void
+try_to_unmap_nonlinear(struct vm_area_struct *vma)
+{
+	pgd_t * dir;
+	unsigned long address = vma->vm_start, end = vma->vm_end;
+
+	dir = pgd_offset(vma->vm_mm, address);
+	do {
+		try_to_unmap_nonlinear_pmd(vma, dir, address, end);
+		address = (address + PGDIR_SIZE) & PGDIR_MASK;
+		dir++;
+	} while (address && (address < end));
+}
+
+/**
+ * try_to_unmap_one - unmap a page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Determine whether a page is mapped in a given vma and unmap it if it's found.
+ *
+ * This function is strictly a helper function for try_to_unmap_inode.
+ */
+static int
+try_to_unmap_one(struct vm_area_struct *vma, struct page *page)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t *pte;
+	int ret;
+
+	BUG_ON(vma->vm_flags & VM_RESERVED);
+	if (unlikely(vma->vm_flags & VM_LOCKED))
+		return SWAP_FAIL;
+
+	ret = SWAP_AGAIN;
+	if (unlikely(!spin_trylock(&mm->page_table_lock)))
+		return ret;
+
+	if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
+		try_to_unmap_nonlinear(vma);
+		goto out;
+	}
+
+	pte = find_pte(vma, page, &address);
+	if (!pte)
+		goto out;
 
+	unmap_pte_page(page, vma, address, pte);
+
+	pte_unmap(pte);
 out:
 	spin_unlock(&mm->page_table_lock);
 	return ret;
@@ -443,7 +553,7 @@ try_to_unmap_inode(struct page *page)
 
 	BUG_ON(PageSwapCache(page));
 
-	if (down_trylock(&mapping->i_shared_sem))
+	if (unlikely(down_trylock(&mapping->i_shared_sem)))
 		return ret;
 	
 	list_for_each_entry(vma, &mapping->i_mmap, shared) {
@@ -523,6 +633,11 @@ int fastcall try_to_unmap(struct page * 
 		dec_page_state(nr_mapped);
 		ret = SWAP_SUCCESS;
 	}
+	page_map_unlock(page);
+
+	if (ret == SWAP_SUCCESS && page_test_and_clear_dirty(page))
+		set_page_dirty(page);
+
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
  2004-03-23 21:44       ` nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch] Andrea Arcangeli
@ 2004-03-24  2:35         ` Andrea Arcangeli
  2004-03-24  4:38         ` Andrea Arcangeli
  2004-03-24 10:12         ` Hugh Dickins
  2 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-24  2:35 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel

On Tue, Mar 23, 2004 at 10:44:59PM +0100, Andrea Arcangeli wrote:
> This is incremental with 2.6.5-rc2-aa1 and it should allow swapping of
> nonlinear mappings too. Has anybody a testcase that I can use to test

I wrote a test app swapping around 1G with 200 tasks mapping 1.7G each after
marking the vmas nonlinear. It's not exactly a smooth-swapping
(expected) but it seems to work just fine (I admit the first few seconds I
wondered if it was deadlocking until I pressed SYSRQ+P and I've seen
idle cpus and then disk running too, so I waited some more ;)

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
12  1      0 912564  26824  45188    0    0    18     5  270    18  0  0 99  0
19  1      0 684436  26824 269928    0    0     0     0 1087 27560 11 89  0  0
33  0      0 542148  26824 408172    0    0     0     0 1086   104 13 87  0  0
47  2      0 432956  26824 513164    0    0     0     0 1084   104 16 84  0  0
54  0      0 357364  26824 584768    0    0     0     0 1085    86 15 85  0  0
61  0      0 287980  26832 650176    0    0     0    44 1090    95 16 84  0  0
68  0      0 238244  26832 695872    0    0     0     0 1086    81 16 84  0  0
75  0      0 191132  26832 738984    0    0     0     0 1085    82 16 84  0  0
75  0      0 139676  26832 786584    0    0     0     0 1086    75 16 84  0  0
82  0      0 100868  26832 821468    0    0     0     0 1083    82 17 83  0  0
82  0      0  57028  26832 861452    0    0     0     0 1085    72 16 85  0  0
89  0      0   6920  23248 910936    0    0     0     0 1103    85 15 85  0  0
89  2      0   5140  11440 919208    0    0     0   240 2177   738 11 89  0  0
90 71   2160   2048    332 934084    0  284     0   284 1823  2255  5 91  0  4
95 67  11564   6440    332 929572   44 11236    44 11236 1384   582  6 94  0  0
 0 98 751260   2276    192 710688 88296 628492 90828 628640 104070 47927  1 18  0 81
 0 98 751464   6052    200 711364   64 2924   712  2924 1580   161  0  0  0 100
 1 96 752124   2312    200 717264  256 5720   380  5748 1736   170  0  3  0 97
 0 97 750392   2384    204 718800 2432 4608  2432  4612 1319   373  0  1  0 99
 0 97 750968   2516    204 720672  128 3796   128  3796 1612   119  0  2  0 98
 0 97 750260   2640    204 724312  472 3968   472  3968 1401   146  0  1  0 99
 0 97 751816   2384    212 726712  792 4124   912  4124 1701   256  0  2  0 98
 0 97 772464   2240    216 714092  176 7320   176  7348 1875   292  2 22  0 77
 0 97 776016   2248    216 735556  100 37828   100 37828 1457   228  1  4  0 95
 0 97 773044   6024    216 738528 3984 3124  3984  3124 1223   451  0  1  0 100
 0 97 769832   2504    216 741740 3792    0  3792     0 1253   491  0  0  0 100
 0 97 773096   2448    216 738688 3296 10048  3296 10048 1330   512  0  2  0 98
 0 209 771488   2192    216 750164  960 25636   960 25636 1424  7683  1  7  0 92
 0 209 772984   2272    216 769216 2120 7880  2120  7880 1331  8610  1 12  0 87
 0 209 776216   2292    216 772664 2744 2584  2744  2584 1439  1522  1 13  0 86
 0 208 775828   2740    220 776292  556 6912   688  6912 1628   651  1  6  0 93
 0 208 778384   3280    220 778492  544 15796   544 15796 1719   401  0  2  0 98
 0 208 784516  23548    220 771880  440 17764   440 17764 1564  2062  0  4  0 97

it's like going back to the 2.4 swapping behaviour. If those nonlinear mappings
are big and they must be swapped efficiently, remap_file_pages should not be
used. Of course mlock on top of nonlinear will fix it completely.

I doubt we can do much better than this w/o throwing lots of memory at tracking
ptes (like rmap did with the pte_chains), but IMO the pte_chains would
invalidate the point of the nonlinear vmas (if one has to pay for the
pte_chains then one can as well use the vmas since the cost is the same as far
as there are at least a dozen pages per-vma). I think this way is
optimal, remap_file_pages wants to be even faster and scalable and with
less memory footprint for the fast path, at the expense of a less
efficient and less scalable swapping.

to make a comparions this is the swapping of the linear vmas instead
(i.e. the only one we care about, it's quite cpu dominated too and
mostly spent in the trylocks).

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
11  4   5660 959856   2428  29728  220  359   241   362  301   183  0  3 69 28
27  4   5660 747060   2440 239292    0    0     0   180 1089 22479 11 89  0  0
34  1   5660 627052   2448 355496    0    0     0    84 1092   103 13 87  0  0
41  0   5660 523868   2448 454708    0    0     0     0 1084    91 15 85  0  0
49  2   5660 431772   2448 542904    0    0     0     0 1085    88 15 85  0  0
62  1   5660 362204   2448 608320    0    0     0     0 1084    98 14 86  0  0
62  0   5660 302492   2448 664420    0    0     0    48 1093    77 16 84  0  0
69  0   5660 241628   2448 721336    0    0     0     0 1085    79 16 84  0  0
76  0   5660 171020   2448 787976    0    0     0     0 1086    85 15 85  0  0
83  0   5660 126212   2448 828844    0    0     0     0 1085    81 17 83  0  0
83  1   5660  60036   2448 891336    0    0     0     0 1086    70 15 85  0  0
90  1   5660  35764   2448 911464    0    0     0     0 1084    89 16 84  0  0
92  1   3704   2332    704 946516    0  196     0   384 1129    80 12 88  0  0
91  0   9148   3400    332 947688    0 5660     0  5660 1130   129 10 90  0  0
69 28  13252   2064    300 949800  224 3208   344  3208 1161   397  9 91  0  0
80 38  30192   4112    300 942908  584 15468   676 15468 1512  3754  6 94  0  0
89 45  43508   3640    308 940204  932 15304  1052 15316 1469  4564  4 96  0  0
66 74  56072   5156    316 939460 2684 13432  2832 13432 1392 13575  5 95  0  0
69 75  68248   5236    184 937780 3008 12592  3412 12600 1414 17344  2 98  0  0
62 81  79076   7576    132 932404 3964 10836  3964 10836 1526 28044  1 99  0  0
106 95  84068   4444    132 934540 4052 6912  4052  6912 1466 24908  0 100  0  0
107 92  87864   6072    140 930740 3068 4468  3164  4468 1509  4045  0 100  0  0
12 100  92592   6068    148 926096 5156 5424  5156  5440 1508 25626  1 100  0  0
99 93  96872   5572    148 926840 3972 5548  4024  5548 1449 20534  0 100  0  0
107 92 101304   7984    148 922492 3608 5304  3656  5304 1491 14817  0 100  0  0
100 90 105340   8416    148 922012 5592 5400  5592  5400 1439 19803  0 100  0  0
107 89 108688   7376    148 918572 4600 4496  4600  4496 1383 12725  0 100  0  0
108 93 113168   4876    148 918512 5440 5780  5440  5780 1383 23260  0 100  0  0
96 90 116820   4816    148 914956 5192 5008  5192  5008 1342 11427  1 100  0  0
89 99 121152   3288    148 913044 5580 5772  5580  5772 1382 29699  0 100  0  0
107 96 124948   4048    164 912236 4220 4832  4432  4832 1407  7537  0 100  0  0
90 94 128348   4284    164 909940 5056 4656  5056  4656 1398 14028  0 100  0  0
108 90 132028   2472    144 913728 3688 4640  3688  4640 1413  3582  0 100  0  0
89 91 135484   3256    144 911400 4516 4556  4516  4556 1401 11283  0 100  0  0
108 90 139644   2952    144 910092 4268 5084  4268  5084 1500 19229  1 100  0  0
107 89 143772   2636    144 907696 4492 5084  4492  5084 1467 14234  0 100  0  0
82 85 147720   6904    140 903788 3800 4820  3840  4820 1413 11291  0 100  0  0
106 96 152536   2808    140 904904 4520 5940  4520  5940 1417 29255  0 100  0  0

I also found a race condition during the testing:

this bug triggered in unmap_pte_page:

	BUG_ON(!page->mapcount);

that was because I was running page_add_rmap outside the page_table_lock in the
do_anonymous_page. We must do lru_cache_add + set_pte + page_add_rmap
atomically in the same critical section protected by the same page_table_lock
to prevent the swap side to try to unmap pages that have not been tracked by
page_add_rmap (and in turn anon_vma) yet. Fix is easy:

--- x/mm/memory.c.~1~	2004-03-23 23:29:19.000000000 +0100
+++ x/mm/memory.c	2004-03-24 03:30:27.850181344 +0100
@@ -1354,13 +1354,14 @@ do_anonymous_page(struct mm_struct *mm, 
 	set_pte(page_table, entry);
 	pte_unmap(page_table);
 
+	/* ignores ZERO_PAGE */
+	page_add_rmap(page, vma, addr, anon);
+
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
 	spin_unlock(&mm->page_table_lock);
 	ret = VM_FAULT_MINOR;
 
-	/* ignores ZERO_PAGE */
-	page_add_rmap(page, vma, addr, anon);
 
 	return ret;
 }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
  2004-03-23 21:44       ` nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch] Andrea Arcangeli
  2004-03-24  2:35         ` Andrea Arcangeli
@ 2004-03-24  4:38         ` Andrea Arcangeli
  2004-03-24 10:12         ` Hugh Dickins
  2 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-24  4:38 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel

On Tue, Mar 23, 2004 at 10:44:59PM +0100, Andrea Arcangeli wrote:
> +#ifdef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
> +			get_page(page);
> +#endif
> +			unmap_pte_page(page, vma, address, ptep);
						  ^^^^^^^ + offset

I added some stuff in the shm threaded test program to verify
correctness after swapin and it showed userspace mm corruption on the
nonlinear pages, the above + offset is the fix.

I also merged anobjrmap-6 from Hugh so ppc and ppc64 compiles fine (plus
some other arch).

Also the race in do_anonymous_page couldn't trigger, the callers of
try_to_unmap always first check if the page is mapped after taking the
page_map_lock, so I'm backing it out.

The real race triggering such bug is that I'm not taking the
page_map_lock for all the pages before unmapping them so the
--page->mapcount aren't atomic while the page is unmapped by different
address spaces at the same time from multiple cpus.

It'll all be fixed in the next update.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
  2004-03-23 21:44       ` nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch] Andrea Arcangeli
  2004-03-24  2:35         ` Andrea Arcangeli
  2004-03-24  4:38         ` Andrea Arcangeli
@ 2004-03-24 10:12         ` Hugh Dickins
  2004-03-24 12:18           ` Hugh Dickins
  2004-03-24 14:37           ` Andrea Arcangeli
  2 siblings, 2 replies; 15+ messages in thread
From: Hugh Dickins @ 2004-03-24 10:12 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel

On Tue, 23 Mar 2004, Andrea Arcangeli wrote:
> 
> I don't think I can use the tlb gather because I've to set the pte back
> immediatly, or can I? The IPI flood and huge pagetable walk with total
> destruction of the address space with huge mappings will be very bad in
> terms of usability during swapping of huge nonlinear vmas, but hey, if
> you want to swap smoothly, you should use the vmas.

Thanks a lot for the preview (or would have been a preview if I'd been
awake - and now I've found it easiest to look at 2.6.5-rc1 patched with
the 2.6.5-rc1-aa2 objrmap and anon_vma you pointed Martin to in other
mail, which includes your latest fixes).

I think you're being too harsh on the nonlinear vmas!  I know you're
not keen on them, but punishing them this hard!  If I read it right,
page_referenced will never (unless PageReferenced, or mapped into
a nonlinear also) report a page from a nonlinear vma as referenced
(I do agree with that part).  So they'll soon reach try_to_unmap,
and each one which gets there will cause every page in every nonlinear
vma of that inode to be unmapped from the nonlinears right then?
Yes, that'll teach 'em to use sys_remap_file_pages without VM_LOCKED.

For mine I'll try to carry on with the less draconian approach I
started yesterday, scanning just a range each time (rather 2.4 style).

At the very least, I think your unmap (and mine) needs to
ptep_test_and_clear_young just before unmap_pte_page, and back out if
the page is young (referenced).  I was going to recommend that anyway:
at last got around to considering that issue of whether the failed
trylocks should report referenced or not (return 1 or 0).  Looking at
how shrink_list goes, even before 2.6.5-rc1, I'd expect it to behave
better your way (proceed to try_to_unmap, which will rightly say
SWAP_AGAIN if it fails the same trylock) than how it was before in
objrmap; but that will behave better with a ptep_test_and_clear_young
check first too.

Sorry to see the #if VMA_MERGING_FIXUPs are still there.  I've a
growing feeling that it won't make enough difference when they're
gone.  But maybe you have a cunning plan to merge all the anon_vmas
which would result from an mmap next page, write data in, mprotect ro,
mmap next page, write data in, mprotect ro, ..... workload.

Hugh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
  2004-03-24 10:12         ` Hugh Dickins
@ 2004-03-24 12:18           ` Hugh Dickins
  2004-03-24 14:47             ` Andrea Arcangeli
  2004-03-24 14:37           ` Andrea Arcangeli
  1 sibling, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2004-03-24 12:18 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel

This subtlety in try_to_unmap_nonlinear_pte:

	page_map_lock(page);
	/* check that we're not in between set_pte and page_add_rmap */
	if (page_mapped(page)) {
		unmap_pte_page(page, vma, address + offset, ptep);

Harmless, but isn't our acquisition of the page_table_lock guaranteeing
that it cannot be in between set_pte and page_add_rmap?

Hugh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
  2004-03-24 10:12         ` Hugh Dickins
  2004-03-24 12:18           ` Hugh Dickins
@ 2004-03-24 14:37           ` Andrea Arcangeli
  2004-03-24 18:42             ` Andrea Arcangeli
  1 sibling, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-24 14:37 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel

On Wed, Mar 24, 2004 at 10:12:58AM +0000, Hugh Dickins wrote:
> On Tue, 23 Mar 2004, Andrea Arcangeli wrote:
> > 
> > I don't think I can use the tlb gather because I've to set the pte back
> > immediatly, or can I? The IPI flood and huge pagetable walk with total
> > destruction of the address space with huge mappings will be very bad in
> > terms of usability during swapping of huge nonlinear vmas, but hey, if
> > you want to swap smoothly, you should use the vmas.
> 
> Thanks a lot for the preview (or would have been a preview if I'd been
> awake - and now I've found it easiest to look at 2.6.5-rc1 patched with
> the 2.6.5-rc1-aa2 objrmap and anon_vma you pointed Martin to in other
> mail, which includes your latest fixes).
> 
> I think you're being too harsh on the nonlinear vmas!  I know you're
> not keen on them, but punishing them this hard!  If I read it right,
> page_referenced will never (unless PageReferenced, or mapped into
> a nonlinear also) report a page from a nonlinear vma as referenced
> (I do agree with that part).  So they'll soon reach try_to_unmap,
> and each one which gets there will cause every page in every nonlinear
> vma of that inode to be unmapped from the nonlinears right then?
> Yes, that'll teach 'em to use sys_remap_file_pages without VM_LOCKED.

Yep ;)

> For mine I'll try to carry on with the less draconian approach I
> started yesterday, scanning just a range each time (rather 2.4 style).

That will DoS real life, that's why I had to be draconian.  after you
finished I'll send a testcase to test, that is a real life testcase not
an exploit. The only way to dominate complexity with a pagetable scan is
to do what 2.4 is doing, that is to drop all ptes we find it in our way
so the vm will stop calling try_to_unmap, we must avoid walking the vma
more than once to swap it out. This will cause a minor fault flood but
that's ok, it doesn't need to be fast at swapping.

> At the very least, I think your unmap (and mine) needs to
> ptep_test_and_clear_young just before unmap_pte_page, and back out if
> the page is young (referenced).  I was going to recommend that anyway:
> at last got around to considering that issue of whether the failed
> trylocks should report referenced or not (return 1 or 0).  Looking at
> how shrink_list goes, even before 2.6.5-rc1, I'd expect it to behave
> better your way (proceed to try_to_unmap, which will rightly say
> SWAP_AGAIN if it fails the same trylock) than how it was before in
> objrmap; but that will behave better with a ptep_test_and_clear_young
> check first too.

cute, I agree we should recheck the young bit inside.

> Sorry to see the #if VMA_MERGING_FIXUPs are still there.  I've a
> growing feeling that it won't make enough difference when they're
> gone.  But maybe you have a cunning plan to merge all the anon_vmas
> which would result from an mmap next page, write data in, mprotect ro,
> mmap next page, write data in, mprotect ro, ..... workload.

problem is that mprotect (and mremap) meging is low prio compared to
nonlinear==mlock and i_mmap{shared} complexity, so it'll address it only
after I've a scalable swapping for huge i_mmap{shared} list too, which
is a pre-requisite for merging, mprotect merging doesn't sounds
prerequisite, though I certainly agree we should fixup it soon (and
after we fix it it'll work for files too, something that never worked
todate, and I feel it'll be as important for files as it was so far for
anon ram, and nobody complained yet that it's not enabled for files ;).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
  2004-03-24 12:18           ` Hugh Dickins
@ 2004-03-24 14:47             ` Andrea Arcangeli
  0 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-24 14:47 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel

On Wed, Mar 24, 2004 at 12:18:12PM +0000, Hugh Dickins wrote:
> This subtlety in try_to_unmap_nonlinear_pte:
> 
> 	page_map_lock(page);
> 	/* check that we're not in between set_pte and page_add_rmap */
> 	if (page_mapped(page)) {
> 		unmap_pte_page(page, vma, address + offset, ptep);
> 
> Harmless, but isn't our acquisition of the page_table_lock guaranteeing
> that it cannot be in between set_pte and page_add_rmap?

I find that fragile, see the way I implemented do_anonymous_page, other
places always do page_add_rmap under the page_table_lock, but there's no
reason to require that, the swapout code already checks explicitly for
page_mapped after taking the page_map_lock, it has to do that anyways,
so I find it nicer to do it like the above and in do_anonymous_page.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
  2004-03-24 14:37           ` Andrea Arcangeli
@ 2004-03-24 18:42             ` Andrea Arcangeli
  0 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-24 18:42 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel

On Wed, Mar 24, 2004 at 03:37:29PM +0100, Andrea Arcangeli wrote:
> cute, I agree we should recheck the young bit inside.

I did it (the below is incremental with 2.6.5-rc2-aa2), looks ok but I
cannot notice absolutely any difference in practice. Anyways it makes
sense so I keep it. now re-running the regressions for both linear and
non-linear heavy shm swap.

So I'm now going to merge the prio_tree, then I can care about the
filter for s/anon_vma_t/struct anon_vma/ and the mprotect/mremap
vma merging (for file mappings too).

--- x/mm/objrmap.c.~1~	2004-03-24 07:16:18.000000000 +0100
+++ x/mm/objrmap.c	2004-03-24 19:27:31.213416608 +0100
@@ -137,7 +137,7 @@ page_referenced_one(struct vm_area_struc
 
 	pte = find_pte(vma, page, NULL);
 	if (pte) {
-		if (ptep_test_and_clear_young(pte))
+		if (pte_young(*pte) && ptep_test_and_clear_young(pte))
 			referenced++;
 		pte_unmap(pte);
 	}
@@ -442,6 +442,8 @@ try_to_unmap_nonlinear_pte(struct vm_are
 			page = pfn_to_page(pfn);
 			if (PageReserved(page))
 				continue;
+			if (pte_young(pte) && ptep_test_and_clear_young(ptep))
+				continue;
 			/*
 			 * any other page in the nonlinear mapping will not wait
 			 * on us since only one cpu can take the i_shared_sem
@@ -506,7 +508,7 @@ try_to_unmap_nonlinear(struct vm_area_st
  * This function is strictly a helper function for try_to_unmap_inode.
  */
 static int
-try_to_unmap_one(struct vm_area_struct *vma, struct page *page)
+try_to_unmap_one(struct vm_area_struct *vma, struct page *page, int * young)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -523,12 +525,21 @@ try_to_unmap_one(struct vm_area_struct *
 
 	if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
 		/*
-		 * All it matters is that the page won't go
-		 * away under us after we unlock.
+		 * If this was a false positive generated by a
+		 * failed trylock in the referenced pass let's
+		 * avoid to pay the big cost of the nonlinear
+		 * swap, we'd better be sure we've to pay that
+		 * cost before running it.
 		 */
-		page_map_unlock(page);
-		try_to_unmap_nonlinear(vma);
-		page_map_lock(page);
+		if (!*young) {
+			/*
+			 * All it matters is that the page won't go
+			 * away under us after we unlock.
+			 */
+			page_map_unlock(page);
+			try_to_unmap_nonlinear(vma);
+			page_map_lock(page);
+		}
 		goto out;
 	}
 
@@ -536,10 +547,21 @@ try_to_unmap_one(struct vm_area_struct *
 	if (!pte)
 		goto out;
 
-	unmap_pte_page(page, vma, address, pte);
+	/*
+	 * We use trylocks in the "reference" methods, if they fails
+	 * we let the VM to go ahead unmapping to avoid locking
+	 * congestions, so here we may be trying to unmap young
+	 * ptes, if that happens we givup trying unmapping this page
+	 * and we clear all other reference bits instead (basically
+	 * downgrading to a page_referenced pass).
+	 */
+	if ((!pte_young(*pte) || !ptep_test_and_clear_young(pte)) && !*young)
+		unmap_pte_page(page, vma, address, pte);
+	else
+		*young = 1;
 
 	pte_unmap(pte);
-out:
+ out:
 	spin_unlock(&mm->page_table_lock);
 	return ret;
 }
@@ -561,7 +583,7 @@ try_to_unmap_inode(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	struct vm_area_struct *vma;
-	int ret = SWAP_AGAIN;
+	int ret = SWAP_AGAIN, young = 0;
 
 	BUG_ON(PageSwapCache(page));
 
@@ -569,13 +591,13 @@ try_to_unmap_inode(struct page *page)
 		return ret;
 	
 	list_for_each_entry(vma, &mapping->i_mmap, shared) {
-		ret = try_to_unmap_one(vma, page);
+		ret = try_to_unmap_one(vma, page, &young);
 		if (ret == SWAP_FAIL || !page->mapcount)
 			goto out;
 	}
 
 	list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
-		ret = try_to_unmap_one(vma, page);
+		ret = try_to_unmap_one(vma, page, &young);
 		if (ret == SWAP_FAIL || !page->mapcount)
 			goto out;
 	}
@@ -588,7 +610,7 @@ out:
 static int
 try_to_unmap_anon(struct page * page)
 {
-	int ret = SWAP_AGAIN;
+	int ret = SWAP_AGAIN, young = 0;
 	struct vm_area_struct * vma;
 	anon_vma_t * anon_vma = (anon_vma_t *) page->mapping;
 
@@ -598,7 +620,7 @@ try_to_unmap_anon(struct page * page)
 	spin_lock(&anon_vma->anon_vma_lock);
 	BUG_ON(list_empty(&anon_vma->anon_vma_head));
 	list_for_each_entry(vma, &anon_vma->anon_vma_head, anon_vma_node) {
-		ret = try_to_unmap_one(vma, page);
+		ret = try_to_unmap_one(vma, page, &young);
 		if (ret == SWAP_FAIL || !page->mapcount)
 			break;
 	}


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2004-03-24 18:41 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-22 17:05 VMA_MERGING_FIXUP and patch Hugh Dickins
2004-03-22 17:52 ` Andrea Arcangeli
2004-03-22 19:02   ` Hugh Dickins
2004-03-22 19:58     ` Andrea Arcangeli
2004-03-23 21:44       ` nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch] Andrea Arcangeli
2004-03-24  2:35         ` Andrea Arcangeli
2004-03-24  4:38         ` Andrea Arcangeli
2004-03-24 10:12         ` Hugh Dickins
2004-03-24 12:18           ` Hugh Dickins
2004-03-24 14:47             ` Andrea Arcangeli
2004-03-24 14:37           ` Andrea Arcangeli
2004-03-24 18:42             ` Andrea Arcangeli
2004-03-22 19:57 ` VMA_MERGING_FIXUP and patch Andrew Morton
2004-03-22 20:05   ` Andrea Arcangeli
2004-03-22 20:33     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox