* VMA_MERGING_FIXUP and patch
@ 2004-03-22 17:05 Hugh Dickins
2004-03-22 17:52 ` Andrea Arcangeli
2004-03-22 19:57 ` VMA_MERGING_FIXUP and patch Andrew Morton
0 siblings, 2 replies; 15+ messages in thread
From: Hugh Dickins @ 2004-03-22 17:05 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel
Just a reminder that you still have several #if VMA_MERGING_FIXUPs
in your 2.6.5-rc2-aa1 tree, so mprotects and mremaps are not merging
vmas at all.
I can understand if you'd prefer to leave the mremaps that way,
at least for now. (I do have code to decide whether any page is
shared, will post later in anobjrmap 7/6, you could use the same
to allow mremap vma merging if unproblematic.) But I think you
ought to get to merging the mprotects, aren't there apps which
will give you a frightening number of vmas unless merged?
Here's some minor updates (no hurry) to objrmap.c,
mirroring recentish s390 mods to mainline rmap.c:
the page_test_and_clear_dirty I mentioned before.
Hmm, I wonder, is that safe to be calling set_page_dirty
from inside the page rmap lock? Andrew?
Hugh
--- 2.6.5-rc2-aa1/mm/objrmap.c 2004-03-22 11:38:55.000000000 +0000
+++ linux/mm/objrmap.c 2004-03-22 16:34:29.421216936 +0000
@@ -212,7 +212,7 @@ int fastcall page_referenced(struct page
BUG_ON(!page->mapping);
if (page_test_and_clear_young(page))
- mark_page_accessed(page);
+ referenced++;
if (TestClearPageReferenced(page))
referenced++;
@@ -327,8 +327,11 @@ void fastcall page_remove_rmap(struct pa
if (!page_mapped(page))
goto out_unlock;
- if (!--page->mapcount)
+ if (!--page->mapcount) {
+ if (page_test_and_clear_dirty(page))
+ set_page_dirty(page);
dec_page_state(nr_mapped);
+ }
if (PageAnon(page))
anon_vma_page_unlink(page);
@@ -520,6 +523,8 @@ int fastcall try_to_unmap(struct page *
ret = try_to_unmap_anon(page);
if (!page_mapped(page)) {
+ if (page_test_and_clear_dirty(page))
+ set_page_dirty(page);
dec_page_state(nr_mapped);
ret = SWAP_SUCCESS;
}
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: VMA_MERGING_FIXUP and patch
2004-03-22 17:05 VMA_MERGING_FIXUP and patch Hugh Dickins
@ 2004-03-22 17:52 ` Andrea Arcangeli
2004-03-22 19:02 ` Hugh Dickins
2004-03-22 19:57 ` VMA_MERGING_FIXUP and patch Andrew Morton
1 sibling, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-22 17:52 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel
On Mon, Mar 22, 2004 at 05:05:33PM +0000, Hugh Dickins wrote:
> Just a reminder that you still have several #if VMA_MERGING_FIXUPs
> in your 2.6.5-rc2-aa1 tree, so mprotects and mremaps are not merging
> vmas at all.
>
> I can understand if you'd prefer to leave the mremaps that way,
> at least for now. (I do have code to decide whether any page is
> shared, will post later in anobjrmap 7/6, you could use the same
> to allow mremap vma merging if unproblematic.) But I think you
> ought to get to merging the mprotects, aren't there apps which
> will give you a frightening number of vmas unless merged?
agreed, enabling again the merging in mprotect is quite high prio,
but really some of those apps have never been able to merge in mprotect
because we would never been able to merge the file mappings after a
vma_split. The pros is that after we fixup, the file mappings will merge
fine too (the cons is that the merging code will have to be more
complicated than it was before ;).
> Here's some minor updates (no hurry) to objrmap.c,
> mirroring recentish s390 mods to mainline rmap.c:
> the page_test_and_clear_dirty I mentioned before.
this is greatly appreciated, I didn't notice this change (I got
uninteresting rejects in mm/rmap.c.rej too often and by mistake I
stopped looking at them at some point).
> Hmm, I wonder, is that safe to be calling set_page_dirty
> from inside the page rmap lock? Andrew?
I agree, I don't feel safe running it there indeed (it will not hang
immediatly but it enforces an ordering of other spinlocks, the less
enforced ordering the better), I'd prefer to set a local
variable and run it after the page_map_unlock. Actually re-reading
page->mapcount lockless should be ok too, if we don't read zero due a
race, it means somebody else will run the unmap or remove_page_rmap
again. All it matters is that the very last one unmapping the page reads
zero and that's guaranteed even without the spinlock.
> --- 2.6.5-rc2-aa1/mm/objrmap.c 2004-03-22 11:38:55.000000000 +0000
> +++ linux/mm/objrmap.c 2004-03-22 16:34:29.421216936 +0000
> @@ -212,7 +212,7 @@ int fastcall page_referenced(struct page
> BUG_ON(!page->mapping);
>
> if (page_test_and_clear_young(page))
> - mark_page_accessed(page);
> + referenced++;
>
> if (TestClearPageReferenced(page))
> referenced++;
I definitely agree with this one.
what about this?
thanks for spotting these s390 bugs!
--- x/mm/vmscan.c.~1~ 2004-03-21 15:21:42.000000000 +0100
+++ x/mm/vmscan.c 2004-03-22 18:39:05.551793088 +0100
@@ -313,16 +313,14 @@ shrink_list(struct list_head *page_list,
if (page_mapped(page) && mapping) {
switch (try_to_unmap(page)) {
case SWAP_FAIL:
- page_map_unlock(page);
goto activate_locked;
case SWAP_AGAIN:
- page_map_unlock(page);
goto keep_locked;
case SWAP_SUCCESS:
; /* try to free the page below */
}
- }
- page_map_unlock(page);
+ } else
+ page_map_unlock(page);
/*
* If the page is dirty, only perform writeback if that write
--- x/mm/memory.c.~1~ 2004-03-21 15:21:42.000000000 +0100
+++ x/mm/memory.c 2004-03-22 13:40:26.852849384 +0100
@@ -324,9 +324,11 @@ skip_copy_pte_range:
* Device driver pages must not be
* tracked by the VM for unmapping.
*/
- BUG_ON(!page_mapped(page));
- BUG_ON(!page->mapping);
- page_add_rmap(page, vma, address, PageAnon(page));
+ if (likely(page_mapped(page) && page->mapping))
+ page_add_rmap(page, vma, address, PageAnon(page));
+ else
+ printk("Badness in %s at %s:%d\n",
+ __FUNCTION__, __FILE__, __LINE__);
} else {
BUG_ON(page_mapped(page));
BUG_ON(page->mapping);
@@ -1429,7 +1431,9 @@ retry:
* real anonymous pages, they're "device" reserved pages instead.
*/
reserved = !!(vma->vm_flags & VM_RESERVED);
- WARN_ON(reserved == pageable);
+ if (unlikely(reserved == pageable))
+ printk("Badness in %s at %s:%d\n",
+ __FUNCTION__, __FILE__, __LINE__);
/*
* Should we do an early C-O-W break?
--- x/mm/objrmap.c.~1~ 2004-03-21 15:21:42.000000000 +0100
+++ x/mm/objrmap.c 2004-03-22 18:41:56.434814920 +0100
@@ -48,6 +48,12 @@ static inline void validate_anon_vma_fin
#endif
}
+static pte_t *
+find_pte_nonlinear(struct mm_struct * mm, struct vm_area_struct *vma, struct page *page)
+{
+
+}
+
/**
* find_pte - Find a pte pointer given a vma and a struct page.
* @vma: the vma to search
@@ -62,7 +68,7 @@ static inline void validate_anon_vma_fin
*
* It is the caller's responsibility to unmap the pte if it is returned.
*/
-static inline pte_t *
+static pte_t *
find_pte(struct vm_area_struct *vma, struct page *page, unsigned long *addr)
{
struct mm_struct *mm = vma->vm_mm;
@@ -72,6 +78,9 @@ find_pte(struct vm_area_struct *vma, str
unsigned long loffset;
unsigned long address;
+ if (vma->vm_flags & VM_NONLINEAR)
+ return find_pte_nonlinear(mm, vma, page);
+
loffset = (page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
address = vma->vm_start + ((loffset - vma->vm_pgoff) << PAGE_SHIFT);
if (address < vma->vm_start || address >= vma->vm_end)
@@ -120,6 +129,13 @@ page_referenced_one(struct vm_area_struc
pte_t *pte;
int referenced = 0;
+ /*
+ * Tracking the referenced info is too expensive
+ * for nonlinear mappings.
+ */
+ if (vma->vm_flags & VM_NONLINEAR)
+ goto out;
+
if (!spin_trylock(&mm->page_table_lock))
goto out;
@@ -212,7 +228,7 @@ int fastcall page_referenced(struct page
BUG_ON(!page->mapping);
if (page_test_and_clear_young(page))
- mark_page_accessed(page);
+ referenced++;
if (TestClearPageReferenced(page))
referenced++;
@@ -344,6 +360,10 @@ void fastcall page_remove_rmap(struct pa
out_unlock:
page_map_unlock(page);
+
+ if (page_test_and_clear_dirty(page) && !page_mapped(page))
+ set_page_dirty(page);
+
return;
}
@@ -523,6 +543,11 @@ int fastcall try_to_unmap(struct page *
dec_page_state(nr_mapped);
ret = SWAP_SUCCESS;
}
+ page_map_unlock(page);
+
+ if (page_test_and_clear_dirty(page) && ret == SWAP_SUCCESS)
+ set_page_dirty(page);
+
return ret;
}
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: VMA_MERGING_FIXUP and patch
2004-03-22 17:52 ` Andrea Arcangeli
@ 2004-03-22 19:02 ` Hugh Dickins
2004-03-22 19:58 ` Andrea Arcangeli
0 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2004-03-22 19:02 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel
On Mon, 22 Mar 2004, Andrea Arcangeli wrote:
>
> what about this?
>
> @@ -344,6 +360,10 @@ void fastcall page_remove_rmap(struct pa
>
> out_unlock:
> page_map_unlock(page);
> +
> + if (page_test_and_clear_dirty(page) && !page_mapped(page))
> + set_page_dirty(page);
> +
> return;
> }
No, it has to be
if (!page_mapped(page) && page_test_and_clear_dirty(page))
set_page_dirty(page);
but the positioning is fine.
> @@ -523,6 +543,11 @@ int fastcall try_to_unmap(struct page *
> dec_page_state(nr_mapped);
> ret = SWAP_SUCCESS;
> }
> + page_map_unlock(page);
> +
> + if (page_test_and_clear_dirty(page) && ret == SWAP_SUCCESS)
> + set_page_dirty(page);
> +
> return ret;
> }
No, it has to be
if (ret == SWAP_SUCCESS && page_test_and_clear_dirty(page))
set_page_dirty(page);
Personally, I'd prefer we leave try_to_unmap with the lock we
had on entry, and do this at the shrink_list end - though I
can see that the way you've done it actually reduces the code.
(The s390 header file comments that the page_test_and_clear_dirty
needs to be done while not mapped, because of race with referenced
bit, and we are opening up to a race now; but unless s390 is very
different, I see nothing wrong with a rare race on referenced -
whereas everything wrong with any race that might lose dirty.)
Excited by that glimpse of find_pte_nonlinear you just gave us;
disappointed to find it empty ;-)
Hugh
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: VMA_MERGING_FIXUP and patch
2004-03-22 17:05 VMA_MERGING_FIXUP and patch Hugh Dickins
2004-03-22 17:52 ` Andrea Arcangeli
@ 2004-03-22 19:57 ` Andrew Morton
2004-03-22 20:05 ` Andrea Arcangeli
1 sibling, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2004-03-22 19:57 UTC (permalink / raw)
To: Hugh Dickins; +Cc: andrea, linux-kernel
Hugh Dickins <hugh@veritas.com> wrote:
>
> Hmm, I wonder, is that safe to be calling set_page_dirty
> from inside the page rmap lock? Andrew?
set_page_dirty() takes ->tree_lock and inode_lock. tree_lock surely is OK
and while I cannot think of any deadlocks which could occur with taking
inode_lock inside the rmap lock, it doesn't sound very nice.
It would of course be best if we could avoid adding a new ranking
relationship between these locks.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: VMA_MERGING_FIXUP and patch
2004-03-22 19:02 ` Hugh Dickins
@ 2004-03-22 19:58 ` Andrea Arcangeli
2004-03-23 21:44 ` nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch] Andrea Arcangeli
0 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-22 19:58 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel
On Mon, Mar 22, 2004 at 07:02:01PM +0000, Hugh Dickins wrote:
> On Mon, 22 Mar 2004, Andrea Arcangeli wrote:
> >
> > what about this?
> >
> > @@ -344,6 +360,10 @@ void fastcall page_remove_rmap(struct pa
> >
> > out_unlock:
> > page_map_unlock(page);
> > +
> > + if (page_test_and_clear_dirty(page) && !page_mapped(page))
> > + set_page_dirty(page);
> > +
> > return;
> > }
>
> No, it has to be
> if (!page_mapped(page) && page_test_and_clear_dirty(page))
> set_page_dirty(page);
> but the positioning is fine.
you're right, I'll fixup. You saved s390 for the second time in a row ;)
> > @@ -523,6 +543,11 @@ int fastcall try_to_unmap(struct page *
> > dec_page_state(nr_mapped);
> > ret = SWAP_SUCCESS;
> > }
> > + page_map_unlock(page);
> > +
> > + if (page_test_and_clear_dirty(page) && ret == SWAP_SUCCESS)
> > + set_page_dirty(page);
> > +
> > return ret;
> > }
>
> No, it has to be
> if (ret == SWAP_SUCCESS && page_test_and_clear_dirty(page))
> set_page_dirty(page);
agreed.
> Personally, I'd prefer we leave try_to_unmap with the lock we
> had on entry, and do this at the shrink_list end - though I
> can see that the way you've done it actually reduces the code.
I agree with you here too, I also preferred not to drop the lock before
return, I did it to reduce the code (in the asm especially).
> (The s390 header file comments that the page_test_and_clear_dirty
> needs to be done while not mapped, because of race with referenced
> bit, and we are opening up to a race now; but unless s390 is very
> different, I see nothing wrong with a rare race on referenced -
> whereas everything wrong with any race that might lose dirty.)
Agreed. I don't see how could ever lose the dirty bit, however the above
comment is scary, and needs clarification from the hw experts. If it's
not safe we can put it back under the page_map_lock. I don't recall
stuff taking the page_map_lock while holding the mapping locks, so it
should be safe (though if we can do it ouside it's preferable.
> Excited by that glimpse of find_pte_nonlinear you just gave us;
> disappointed to find it empty ;-)
8)8)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: VMA_MERGING_FIXUP and patch
2004-03-22 19:57 ` VMA_MERGING_FIXUP and patch Andrew Morton
@ 2004-03-22 20:05 ` Andrea Arcangeli
2004-03-22 20:33 ` Andrew Morton
0 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-22 20:05 UTC (permalink / raw)
To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel
On Mon, Mar 22, 2004 at 11:57:34AM -0800, Andrew Morton wrote:
> set_page_dirty() takes ->tree_lock and inode_lock. tree_lock surely is OK
> and while I cannot think of any deadlocks which could occur with taking
> inode_lock inside the rmap lock, it doesn't sound very nice.
>
> It would of course be best if we could avoid adding a new ranking
> relationship between these locks.
agreed. the inode_lock especially is more a vfs thing than a mm thing,
so it lives quite far away. (btw, in my last email I only mentioned
the mapping tree_lock because I was thinking at the swapcache, with
filebacked pagecache and the inode_lock it gets worse, probably still ok
today though)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: VMA_MERGING_FIXUP and patch
2004-03-22 20:05 ` Andrea Arcangeli
@ 2004-03-22 20:33 ` Andrew Morton
0 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2004-03-22 20:33 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: hugh, linux-kernel
Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Mon, Mar 22, 2004 at 11:57:34AM -0800, Andrew Morton wrote:
> > set_page_dirty() takes ->tree_lock and inode_lock. tree_lock surely is OK
> > and while I cannot think of any deadlocks which could occur with taking
> > inode_lock inside the rmap lock, it doesn't sound very nice.
> >
> > It would of course be best if we could avoid adding a new ranking
> > relationship between these locks.
>
> agreed. the inode_lock especially is more a vfs thing than a mm thing,
> so it lives quite far away.
Alas, inode_lock can be taken inside page_table_lock, in zap_pte_range().
That set_page_dirty() in there is the nastiest part of the MM locking.
^ permalink raw reply [flat|nested] 15+ messages in thread
* nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
2004-03-22 19:58 ` Andrea Arcangeli
@ 2004-03-23 21:44 ` Andrea Arcangeli
2004-03-24 2:35 ` Andrea Arcangeli
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-23 21:44 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel
This is incremental with 2.6.5-rc2-aa1 and it should allow swapping of
nonlinear mappings too. Has anybody a testcase that I can use to test
the nonlinear mapping swapping or should I write it from scratch?
This code compiles fine but it's untested at this time (I tested the
regular swapping but not the nonlinear one).
I don't think I can use the tlb gather because I've to set the pte back
immediatly, or can I? The IPI flood and huge pagetable walk with total
destruction of the address space with huge mappings will be very bad in
terms of usability during swapping of huge nonlinear vmas, but hey, if
you want to swap smoothly, you should use the vmas.
If you giveup using remap_file_pages you will get a good swapping and,
as far as there are a dozen of pages per-vma, the cost of
remap_file_pages w/ pte_chains will be the same of the cost of the vmas
w/o pte_chains, plus the vmas can be heavily merged. So remap_file_pages
really makes sense only if you would generate less than a dozen pages
per vma and you can save significant ram with the vmas and you take the
risk of the worse-swapping, or alternatively if you giveup swapping (so
if you use mlock + remap_file_pages that saves the memory for the vmas).
But in general (especially w/o pte_chains) the usage of remap_file_pages
should be strongly discouraged, unless it's a special app using mlock or
an emulator with small amounts of nonlinear address space. If you use it
don't hope to ever swap efficiently. On 64bit archs nothing but the
emulators should use it.
--- x/include/linux/objrmap.h.~1~ 2004-03-21 15:21:42.000000000 +0100
+++ x/include/linux/objrmap.h 2004-03-23 22:29:53.699758112 +0100
@@ -1,5 +1,5 @@
-#ifndef _LINUX_RMAP_H
-#define _LINUX_RMAP_H
+#ifndef _LINUX_OBJRMAP_H
+#define _LINUX_OBJRMAP_H
/*
* Declarations for Object Reverse Mapping functions in mm/objrmap.c
*/
@@ -75,4 +75,4 @@ int FASTCALL(page_referenced(struct page
#endif /* CONFIG_MMU */
-#endif /* _LINUX_RMAP_H */
+#endif /* _LINUX_OBJRMAP_H */
--- x/include/linux/page-flags.h.~1~ 2004-03-21 15:21:42.000000000 +0100
+++ x/include/linux/page-flags.h 2004-03-23 22:30:16.940225024 +0100
@@ -71,11 +71,11 @@
#define PG_nosave 14 /* Used for system suspend/resume */
#define PG_maplock 15 /* lock bit for ->as.anon_vma and ->mapcount */
+#define PG_swapcache 16 /* SwapCache page */
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
#define PG_reclaim 18 /* To be reclaimed asap */
#define PG_compound 19 /* Part of a compound page */
#define PG_anon 20 /* Anonymous page */
-#define PG_swapcache 21 /* SwapCache page */
/*
--- x/mm/vmscan.c.~1~ 2004-03-21 15:21:42.000000000 +0100
+++ x/mm/vmscan.c 2004-03-22 18:39:05.000000000 +0100
@@ -313,16 +313,14 @@ shrink_list(struct list_head *page_list,
if (page_mapped(page) && mapping) {
switch (try_to_unmap(page)) {
case SWAP_FAIL:
- page_map_unlock(page);
goto activate_locked;
case SWAP_AGAIN:
- page_map_unlock(page);
goto keep_locked;
case SWAP_SUCCESS:
; /* try to free the page below */
}
- }
- page_map_unlock(page);
+ } else
+ page_map_unlock(page);
/*
* If the page is dirty, only perform writeback if that write
--- x/mm/memory.c.~1~ 2004-03-21 15:21:42.000000000 +0100
+++ x/mm/memory.c 2004-03-23 22:25:28.758035376 +0100
@@ -97,7 +97,7 @@ static inline void free_one_pmd(struct m
if (pmd_none(*dir))
return;
- if (pmd_bad(*dir)) {
+ if (unlikely(pmd_bad(*dir))) {
pmd_ERROR(*dir);
pmd_clear(dir);
return;
@@ -115,7 +115,7 @@ static inline void free_one_pgd(struct m
if (pgd_none(*dir))
return;
- if (pgd_bad(*dir)) {
+ if (unlikely(pgd_bad(*dir))) {
pgd_ERROR(*dir);
pgd_clear(dir);
return;
@@ -232,7 +232,7 @@ int copy_page_range(struct mm_struct *ds
if (pgd_none(*src_pgd))
goto skip_copy_pmd_range;
- if (pgd_bad(*src_pgd)) {
+ if (unlikely(pgd_bad(*src_pgd))) {
pgd_ERROR(*src_pgd);
pgd_clear(src_pgd);
skip_copy_pmd_range: address = (address + PGDIR_SIZE) & PGDIR_MASK;
@@ -253,7 +253,7 @@ skip_copy_pmd_range: address = (address
if (pmd_none(*src_pmd))
goto skip_copy_pte_range;
- if (pmd_bad(*src_pmd)) {
+ if (unlikely(pmd_bad(*src_pmd))) {
pmd_ERROR(*src_pmd);
pmd_clear(src_pmd);
skip_copy_pte_range:
@@ -324,9 +324,11 @@ skip_copy_pte_range:
* Device driver pages must not be
* tracked by the VM for unmapping.
*/
- BUG_ON(!page_mapped(page));
- BUG_ON(!page->mapping);
- page_add_rmap(page, vma, address, PageAnon(page));
+ if (likely(page_mapped(page) && page->mapping))
+ page_add_rmap(page, vma, address, PageAnon(page));
+ else
+ printk("Badness in %s at %s:%d\n",
+ __FUNCTION__, __FILE__, __LINE__);
} else {
BUG_ON(page_mapped(page));
BUG_ON(page->mapping);
@@ -381,7 +383,7 @@ zap_pte_range(struct mmu_gather *tlb, pm
if (pmd_none(*pmd))
return;
- if (pmd_bad(*pmd)) {
+ if (unlikely(pmd_bad(*pmd))) {
pmd_ERROR(*pmd);
pmd_clear(pmd);
return;
@@ -424,27 +426,25 @@ zap_pte_range(struct mmu_gather *tlb, pm
static void
zap_pmd_range(struct mmu_gather *tlb, pgd_t * dir,
- unsigned long address, unsigned long size)
+ unsigned long address, unsigned long end)
{
pmd_t * pmd;
- unsigned long end;
if (pgd_none(*dir))
return;
- if (pgd_bad(*dir)) {
+ if (unlikely(pgd_bad(*dir))) {
pgd_ERROR(*dir);
pgd_clear(dir);
return;
}
pmd = pmd_offset(dir, address);
- end = address + size;
if (end > ((address + PGDIR_SIZE) & PGDIR_MASK))
end = ((address + PGDIR_SIZE) & PGDIR_MASK);
do {
zap_pte_range(tlb, pmd, address, end - address);
address = (address + PMD_SIZE) & PMD_MASK;
pmd++;
- } while (address < end);
+ } while (address && (address < end));
}
void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
@@ -462,7 +462,7 @@ void unmap_page_range(struct mmu_gather
dir = pgd_offset(vma->vm_mm, address);
tlb_start_vma(tlb, vma);
do {
- zap_pmd_range(tlb, dir, address, end - address);
+ zap_pmd_range(tlb, dir, address, end);
address = (address + PGDIR_SIZE) & PGDIR_MASK;
dir++;
} while (address && (address < end));
@@ -632,7 +632,7 @@ follow_page(struct mm_struct *mm, unsign
goto out;
if (pmd_huge(*pmd))
return follow_huge_pmd(mm, address, pmd, write);
- if (pmd_bad(*pmd))
+ if (unlikely(pmd_bad(*pmd)))
goto out;
ptep = pte_offset_map(pmd, address);
@@ -1429,7 +1429,9 @@ retry:
* real anonymous pages, they're "device" reserved pages instead.
*/
reserved = !!(vma->vm_flags & VM_RESERVED);
- WARN_ON(reserved == pageable);
+ if (unlikely(reserved == pageable))
+ printk("Badness in %s at %s:%d\n",
+ __FUNCTION__, __FILE__, __LINE__);
/*
* Should we do an early C-O-W break?
--- x/mm/objrmap.c.~1~ 2004-03-21 15:21:42.000000000 +0100
+++ x/mm/objrmap.c 2004-03-23 22:23:27.393485592 +0100
@@ -19,6 +19,11 @@
* Released under the General Public License (GPL).
*/
+/*
+ * nonlinear pagetable walking elaborated from mm/memory.c under
+ * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
+ */
+
#include <linux/pagemap.h>
#include <linux/swap.h>
#include <linux/swapops.h>
@@ -62,7 +67,7 @@ static inline void validate_anon_vma_fin
*
* It is the caller's responsibility to unmap the pte if it is returned.
*/
-static inline pte_t *
+static pte_t *
find_pte(struct vm_area_struct *vma, struct page *page, unsigned long *addr)
{
struct mm_struct *mm = vma->vm_mm;
@@ -120,7 +125,14 @@ page_referenced_one(struct vm_area_struc
pte_t *pte;
int referenced = 0;
- if (!spin_trylock(&mm->page_table_lock))
+ /*
+ * Tracking the referenced info is too expensive
+ * for nonlinear mappings.
+ */
+ if (vma->vm_flags & VM_NONLINEAR)
+ goto out;
+
+ if (unlikely(!spin_trylock(&mm->page_table_lock)))
goto out;
pte = find_pte(vma, page, NULL);
@@ -158,7 +170,7 @@ page_referenced_inode(struct page *page)
BUG_ON(PageSwapCache(page));
- if (down_trylock(&mapping->i_shared_sem))
+ if (unlikely(down_trylock(&mapping->i_shared_sem)))
goto out;
list_for_each_entry(vma, &mapping->i_mmap, shared)
@@ -212,7 +224,7 @@ int fastcall page_referenced(struct page
BUG_ON(!page->mapping);
if (page_test_and_clear_young(page))
- mark_page_accessed(page);
+ referenced++;
if (TestClearPageReferenced(page))
referenced++;
@@ -344,38 +356,18 @@ void fastcall page_remove_rmap(struct pa
out_unlock:
page_map_unlock(page);
+
+ if (!page_mapped(page) && page_test_and_clear_dirty(page))
+ set_page_dirty(page);
+
return;
}
-/**
- * try_to_unmap_one - unmap a page using the object-based rmap method
- * @page: the page to unmap
- *
- * Determine whether a page is mapped in a given vma and unmap it if it's found.
- *
- * This function is strictly a helper function for try_to_unmap_inode.
- */
-static int
-try_to_unmap_one(struct vm_area_struct *vma, struct page *page)
+static void
+unmap_pte_page(struct page * page, struct vm_area_struct * vma,
+ unsigned long address, pte_t * pte)
{
- struct mm_struct *mm = vma->vm_mm;
- unsigned long address;
- pte_t *pte;
pte_t pteval;
- int ret = SWAP_AGAIN;
-
- if (!spin_trylock(&mm->page_table_lock))
- return ret;
-
- pte = find_pte(vma, page, &address);
- if (!pte)
- goto out;
-
- BUG_ON(vma->vm_flags & VM_RESERVED);
- if (vma->vm_flags & VM_LOCKED) {
- ret = SWAP_FAIL;
- goto out_unmap;
- }
flush_cache_page(vma, address);
pteval = ptep_clear_flush(vma, address, pte);
@@ -389,8 +381,11 @@ try_to_unmap_one(struct vm_area_struct *
swap_duplicate(entry);
set_pte(pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
+ BUG_ON(!PageAnon(page));
} else {
unsigned long pgidx;
+
+ BUG_ON(PageAnon(page));
/*
* If a nonlinear mapping then store the file page offset
* in the pte.
@@ -409,14 +404,129 @@ try_to_unmap_one(struct vm_area_struct *
BUG_ON(!page->mapcount);
- mm->rss--;
+ vma->vm_mm->rss--;
if (!--page->mapcount && PageAnon(page))
anon_vma_page_unlink(page);
page_cache_release(page);
+}
-out_unmap:
- pte_unmap(pte);
+static void
+try_to_unmap_nonlinear_pte(struct vm_area_struct * vma,
+ pmd_t * pmd, unsigned long address, unsigned long size)
+{
+ unsigned long offset;
+ pte_t *ptep;
+
+ if (pmd_none(*pmd))
+ return;
+ if (unlikely(pmd_bad(*pmd))) {
+ pmd_ERROR(*pmd);
+ pmd_clear(pmd);
+ return;
+ }
+ ptep = pte_offset_map(pmd, address);
+ offset = address & ~PMD_MASK;
+ if (offset + size > PMD_SIZE)
+ size = PMD_SIZE - offset;
+ size &= PAGE_MASK;
+ for (offset=0; offset < size; ptep++, offset += PAGE_SIZE) {
+ pte_t pte = *ptep;
+ if (pte_none(pte))
+ continue;
+ if (pte_present(pte)) {
+ unsigned long pfn = pte_pfn(pte);
+ struct page * page;
+
+ BUG_ON(!pfn_valid(pfn));
+ page = pfn_to_page(pfn);
+#ifdef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
+ get_page(page);
+#endif
+ unmap_pte_page(page, vma, address, ptep);
+ if (!page_mapped(page) && page_test_and_clear_dirty(page))
+ /* ugly locking */
+ set_page_dirty(page);
+#ifdef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
+ put_page(page);
+#endif
+ }
+ }
+ pte_unmap(ptep-1);
+}
+
+static void
+try_to_unmap_nonlinear_pmd(struct vm_area_struct * vma,
+ pgd_t * dir, unsigned long address, unsigned long end)
+{
+ pmd_t * pmd;
+
+ if (pgd_none(*dir))
+ return;
+ if (unlikely(pgd_bad(*dir))) {
+ pgd_ERROR(*dir);
+ pgd_clear(dir);
+ return;
+ }
+ pmd = pmd_offset(dir, address);
+ if (end > ((address + PGDIR_SIZE) & PGDIR_MASK))
+ end = ((address + PGDIR_SIZE) & PGDIR_MASK);
+ do {
+ try_to_unmap_nonlinear_pte(vma, pmd, address, end - address);
+ address = (address + PMD_SIZE) & PMD_MASK;
+ pmd++;
+ } while (address && (address < end));
+}
+
+static void
+try_to_unmap_nonlinear(struct vm_area_struct *vma)
+{
+ pgd_t * dir;
+ unsigned long address = vma->vm_start, end = vma->vm_end;
+
+ dir = pgd_offset(vma->vm_mm, address);
+ do {
+ try_to_unmap_nonlinear_pmd(vma, dir, address, end);
+ address = (address + PGDIR_SIZE) & PGDIR_MASK;
+ dir++;
+ } while (address && (address < end));
+}
+
+/**
+ * try_to_unmap_one - unmap a page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Determine whether a page is mapped in a given vma and unmap it if it's found.
+ *
+ * This function is strictly a helper function for try_to_unmap_inode.
+ */
+static int
+try_to_unmap_one(struct vm_area_struct *vma, struct page *page)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address;
+ pte_t *pte;
+ int ret;
+
+ BUG_ON(vma->vm_flags & VM_RESERVED);
+ if (unlikely(vma->vm_flags & VM_LOCKED))
+ return SWAP_FAIL;
+
+ ret = SWAP_AGAIN;
+ if (unlikely(!spin_trylock(&mm->page_table_lock)))
+ return ret;
+
+ if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
+ try_to_unmap_nonlinear(vma);
+ goto out;
+ }
+
+ pte = find_pte(vma, page, &address);
+ if (!pte)
+ goto out;
+ unmap_pte_page(page, vma, address, pte);
+
+ pte_unmap(pte);
out:
spin_unlock(&mm->page_table_lock);
return ret;
@@ -443,7 +553,7 @@ try_to_unmap_inode(struct page *page)
BUG_ON(PageSwapCache(page));
- if (down_trylock(&mapping->i_shared_sem))
+ if (unlikely(down_trylock(&mapping->i_shared_sem)))
return ret;
list_for_each_entry(vma, &mapping->i_mmap, shared) {
@@ -523,6 +633,11 @@ int fastcall try_to_unmap(struct page *
dec_page_state(nr_mapped);
ret = SWAP_SUCCESS;
}
+ page_map_unlock(page);
+
+ if (ret == SWAP_SUCCESS && page_test_and_clear_dirty(page))
+ set_page_dirty(page);
+
return ret;
}
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
2004-03-23 21:44 ` nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch] Andrea Arcangeli
@ 2004-03-24 2:35 ` Andrea Arcangeli
2004-03-24 4:38 ` Andrea Arcangeli
2004-03-24 10:12 ` Hugh Dickins
2 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-24 2:35 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel
On Tue, Mar 23, 2004 at 10:44:59PM +0100, Andrea Arcangeli wrote:
> This is incremental with 2.6.5-rc2-aa1 and it should allow swapping of
> nonlinear mappings too. Has anybody a testcase that I can use to test
I wrote a test app swapping around 1G with 200 tasks mapping 1.7G each after
marking the vmas nonlinear. It's not exactly a smooth-swapping
(expected) but it seems to work just fine (I admit the first few seconds I
wondered if it was deadlocking until I pressed SYSRQ+P and I've seen
idle cpus and then disk running too, so I waited some more ;)
r b swpd free buff cache si so bi bo in cs us sy id wa
12 1 0 912564 26824 45188 0 0 18 5 270 18 0 0 99 0
19 1 0 684436 26824 269928 0 0 0 0 1087 27560 11 89 0 0
33 0 0 542148 26824 408172 0 0 0 0 1086 104 13 87 0 0
47 2 0 432956 26824 513164 0 0 0 0 1084 104 16 84 0 0
54 0 0 357364 26824 584768 0 0 0 0 1085 86 15 85 0 0
61 0 0 287980 26832 650176 0 0 0 44 1090 95 16 84 0 0
68 0 0 238244 26832 695872 0 0 0 0 1086 81 16 84 0 0
75 0 0 191132 26832 738984 0 0 0 0 1085 82 16 84 0 0
75 0 0 139676 26832 786584 0 0 0 0 1086 75 16 84 0 0
82 0 0 100868 26832 821468 0 0 0 0 1083 82 17 83 0 0
82 0 0 57028 26832 861452 0 0 0 0 1085 72 16 85 0 0
89 0 0 6920 23248 910936 0 0 0 0 1103 85 15 85 0 0
89 2 0 5140 11440 919208 0 0 0 240 2177 738 11 89 0 0
90 71 2160 2048 332 934084 0 284 0 284 1823 2255 5 91 0 4
95 67 11564 6440 332 929572 44 11236 44 11236 1384 582 6 94 0 0
0 98 751260 2276 192 710688 88296 628492 90828 628640 104070 47927 1 18 0 81
0 98 751464 6052 200 711364 64 2924 712 2924 1580 161 0 0 0 100
1 96 752124 2312 200 717264 256 5720 380 5748 1736 170 0 3 0 97
0 97 750392 2384 204 718800 2432 4608 2432 4612 1319 373 0 1 0 99
0 97 750968 2516 204 720672 128 3796 128 3796 1612 119 0 2 0 98
0 97 750260 2640 204 724312 472 3968 472 3968 1401 146 0 1 0 99
0 97 751816 2384 212 726712 792 4124 912 4124 1701 256 0 2 0 98
0 97 772464 2240 216 714092 176 7320 176 7348 1875 292 2 22 0 77
0 97 776016 2248 216 735556 100 37828 100 37828 1457 228 1 4 0 95
0 97 773044 6024 216 738528 3984 3124 3984 3124 1223 451 0 1 0 100
0 97 769832 2504 216 741740 3792 0 3792 0 1253 491 0 0 0 100
0 97 773096 2448 216 738688 3296 10048 3296 10048 1330 512 0 2 0 98
0 209 771488 2192 216 750164 960 25636 960 25636 1424 7683 1 7 0 92
0 209 772984 2272 216 769216 2120 7880 2120 7880 1331 8610 1 12 0 87
0 209 776216 2292 216 772664 2744 2584 2744 2584 1439 1522 1 13 0 86
0 208 775828 2740 220 776292 556 6912 688 6912 1628 651 1 6 0 93
0 208 778384 3280 220 778492 544 15796 544 15796 1719 401 0 2 0 98
0 208 784516 23548 220 771880 440 17764 440 17764 1564 2062 0 4 0 97
it's like going back to the 2.4 swapping behaviour. If those nonlinear mappings
are big and they must be swapped efficiently, remap_file_pages should not be
used. Of course mlock on top of nonlinear will fix it completely.
I doubt we can do much better than this w/o throwing lots of memory at tracking
ptes (like rmap did with the pte_chains), but IMO the pte_chains would
invalidate the point of the nonlinear vmas (if one has to pay for the
pte_chains then one can as well use the vmas since the cost is the same as far
as there are at least a dozen pages per-vma). I think this way is
optimal, remap_file_pages wants to be even faster and scalable and with
less memory footprint for the fast path, at the expense of a less
efficient and less scalable swapping.
to make a comparions this is the swapping of the linear vmas instead
(i.e. the only one we care about, it's quite cpu dominated too and
mostly spent in the trylocks).
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
11 4 5660 959856 2428 29728 220 359 241 362 301 183 0 3 69 28
27 4 5660 747060 2440 239292 0 0 0 180 1089 22479 11 89 0 0
34 1 5660 627052 2448 355496 0 0 0 84 1092 103 13 87 0 0
41 0 5660 523868 2448 454708 0 0 0 0 1084 91 15 85 0 0
49 2 5660 431772 2448 542904 0 0 0 0 1085 88 15 85 0 0
62 1 5660 362204 2448 608320 0 0 0 0 1084 98 14 86 0 0
62 0 5660 302492 2448 664420 0 0 0 48 1093 77 16 84 0 0
69 0 5660 241628 2448 721336 0 0 0 0 1085 79 16 84 0 0
76 0 5660 171020 2448 787976 0 0 0 0 1086 85 15 85 0 0
83 0 5660 126212 2448 828844 0 0 0 0 1085 81 17 83 0 0
83 1 5660 60036 2448 891336 0 0 0 0 1086 70 15 85 0 0
90 1 5660 35764 2448 911464 0 0 0 0 1084 89 16 84 0 0
92 1 3704 2332 704 946516 0 196 0 384 1129 80 12 88 0 0
91 0 9148 3400 332 947688 0 5660 0 5660 1130 129 10 90 0 0
69 28 13252 2064 300 949800 224 3208 344 3208 1161 397 9 91 0 0
80 38 30192 4112 300 942908 584 15468 676 15468 1512 3754 6 94 0 0
89 45 43508 3640 308 940204 932 15304 1052 15316 1469 4564 4 96 0 0
66 74 56072 5156 316 939460 2684 13432 2832 13432 1392 13575 5 95 0 0
69 75 68248 5236 184 937780 3008 12592 3412 12600 1414 17344 2 98 0 0
62 81 79076 7576 132 932404 3964 10836 3964 10836 1526 28044 1 99 0 0
106 95 84068 4444 132 934540 4052 6912 4052 6912 1466 24908 0 100 0 0
107 92 87864 6072 140 930740 3068 4468 3164 4468 1509 4045 0 100 0 0
12 100 92592 6068 148 926096 5156 5424 5156 5440 1508 25626 1 100 0 0
99 93 96872 5572 148 926840 3972 5548 4024 5548 1449 20534 0 100 0 0
107 92 101304 7984 148 922492 3608 5304 3656 5304 1491 14817 0 100 0 0
100 90 105340 8416 148 922012 5592 5400 5592 5400 1439 19803 0 100 0 0
107 89 108688 7376 148 918572 4600 4496 4600 4496 1383 12725 0 100 0 0
108 93 113168 4876 148 918512 5440 5780 5440 5780 1383 23260 0 100 0 0
96 90 116820 4816 148 914956 5192 5008 5192 5008 1342 11427 1 100 0 0
89 99 121152 3288 148 913044 5580 5772 5580 5772 1382 29699 0 100 0 0
107 96 124948 4048 164 912236 4220 4832 4432 4832 1407 7537 0 100 0 0
90 94 128348 4284 164 909940 5056 4656 5056 4656 1398 14028 0 100 0 0
108 90 132028 2472 144 913728 3688 4640 3688 4640 1413 3582 0 100 0 0
89 91 135484 3256 144 911400 4516 4556 4516 4556 1401 11283 0 100 0 0
108 90 139644 2952 144 910092 4268 5084 4268 5084 1500 19229 1 100 0 0
107 89 143772 2636 144 907696 4492 5084 4492 5084 1467 14234 0 100 0 0
82 85 147720 6904 140 903788 3800 4820 3840 4820 1413 11291 0 100 0 0
106 96 152536 2808 140 904904 4520 5940 4520 5940 1417 29255 0 100 0 0
I also found a race condition during the testing:
this bug triggered in unmap_pte_page:
BUG_ON(!page->mapcount);
that was because I was running page_add_rmap outside the page_table_lock in the
do_anonymous_page. We must do lru_cache_add + set_pte + page_add_rmap
atomically in the same critical section protected by the same page_table_lock
to prevent the swap side to try to unmap pages that have not been tracked by
page_add_rmap (and in turn anon_vma) yet. Fix is easy:
--- x/mm/memory.c.~1~ 2004-03-23 23:29:19.000000000 +0100
+++ x/mm/memory.c 2004-03-24 03:30:27.850181344 +0100
@@ -1354,13 +1354,14 @@ do_anonymous_page(struct mm_struct *mm,
set_pte(page_table, entry);
pte_unmap(page_table);
+ /* ignores ZERO_PAGE */
+ page_add_rmap(page, vma, addr, anon);
+
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
spin_unlock(&mm->page_table_lock);
ret = VM_FAULT_MINOR;
- /* ignores ZERO_PAGE */
- page_add_rmap(page, vma, addr, anon);
return ret;
}
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
2004-03-23 21:44 ` nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch] Andrea Arcangeli
2004-03-24 2:35 ` Andrea Arcangeli
@ 2004-03-24 4:38 ` Andrea Arcangeli
2004-03-24 10:12 ` Hugh Dickins
2 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-24 4:38 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel
On Tue, Mar 23, 2004 at 10:44:59PM +0100, Andrea Arcangeli wrote:
> +#ifdef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
> + get_page(page);
> +#endif
> + unmap_pte_page(page, vma, address, ptep);
^^^^^^^ + offset
I added some stuff in the shm threaded test program to verify
correctness after swapin and it showed userspace mm corruption on the
nonlinear pages, the above + offset is the fix.
I also merged anobjrmap-6 from Hugh so ppc and ppc64 compiles fine (plus
some other arch).
Also the race in do_anonymous_page couldn't trigger, the callers of
try_to_unmap always first check if the page is mapped after taking the
page_map_lock, so I'm backing it out.
The real race triggering such bug is that I'm not taking the
page_map_lock for all the pages before unmapping them so the
--page->mapcount aren't atomic while the page is unmapped by different
address spaces at the same time from multiple cpus.
It'll all be fixed in the next update.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
2004-03-23 21:44 ` nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch] Andrea Arcangeli
2004-03-24 2:35 ` Andrea Arcangeli
2004-03-24 4:38 ` Andrea Arcangeli
@ 2004-03-24 10:12 ` Hugh Dickins
2004-03-24 12:18 ` Hugh Dickins
2004-03-24 14:37 ` Andrea Arcangeli
2 siblings, 2 replies; 15+ messages in thread
From: Hugh Dickins @ 2004-03-24 10:12 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel
On Tue, 23 Mar 2004, Andrea Arcangeli wrote:
>
> I don't think I can use the tlb gather because I've to set the pte back
> immediatly, or can I? The IPI flood and huge pagetable walk with total
> destruction of the address space with huge mappings will be very bad in
> terms of usability during swapping of huge nonlinear vmas, but hey, if
> you want to swap smoothly, you should use the vmas.
Thanks a lot for the preview (or would have been a preview if I'd been
awake - and now I've found it easiest to look at 2.6.5-rc1 patched with
the 2.6.5-rc1-aa2 objrmap and anon_vma you pointed Martin to in other
mail, which includes your latest fixes).
I think you're being too harsh on the nonlinear vmas! I know you're
not keen on them, but punishing them this hard! If I read it right,
page_referenced will never (unless PageReferenced, or mapped into
a nonlinear also) report a page from a nonlinear vma as referenced
(I do agree with that part). So they'll soon reach try_to_unmap,
and each one which gets there will cause every page in every nonlinear
vma of that inode to be unmapped from the nonlinears right then?
Yes, that'll teach 'em to use sys_remap_file_pages without VM_LOCKED.
For mine I'll try to carry on with the less draconian approach I
started yesterday, scanning just a range each time (rather 2.4 style).
At the very least, I think your unmap (and mine) needs to
ptep_test_and_clear_young just before unmap_pte_page, and back out if
the page is young (referenced). I was going to recommend that anyway:
at last got around to considering that issue of whether the failed
trylocks should report referenced or not (return 1 or 0). Looking at
how shrink_list goes, even before 2.6.5-rc1, I'd expect it to behave
better your way (proceed to try_to_unmap, which will rightly say
SWAP_AGAIN if it fails the same trylock) than how it was before in
objrmap; but that will behave better with a ptep_test_and_clear_young
check first too.
Sorry to see the #if VMA_MERGING_FIXUPs are still there. I've a
growing feeling that it won't make enough difference when they're
gone. But maybe you have a cunning plan to merge all the anon_vmas
which would result from an mmap next page, write data in, mprotect ro,
mmap next page, write data in, mprotect ro, ..... workload.
Hugh
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
2004-03-24 10:12 ` Hugh Dickins
@ 2004-03-24 12:18 ` Hugh Dickins
2004-03-24 14:47 ` Andrea Arcangeli
2004-03-24 14:37 ` Andrea Arcangeli
1 sibling, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2004-03-24 12:18 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel
This subtlety in try_to_unmap_nonlinear_pte:
page_map_lock(page);
/* check that we're not in between set_pte and page_add_rmap */
if (page_mapped(page)) {
unmap_pte_page(page, vma, address + offset, ptep);
Harmless, but isn't our acquisition of the page_table_lock guaranteeing
that it cannot be in between set_pte and page_add_rmap?
Hugh
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
2004-03-24 10:12 ` Hugh Dickins
2004-03-24 12:18 ` Hugh Dickins
@ 2004-03-24 14:37 ` Andrea Arcangeli
2004-03-24 18:42 ` Andrea Arcangeli
1 sibling, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-24 14:37 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel
On Wed, Mar 24, 2004 at 10:12:58AM +0000, Hugh Dickins wrote:
> On Tue, 23 Mar 2004, Andrea Arcangeli wrote:
> >
> > I don't think I can use the tlb gather because I've to set the pte back
> > immediatly, or can I? The IPI flood and huge pagetable walk with total
> > destruction of the address space with huge mappings will be very bad in
> > terms of usability during swapping of huge nonlinear vmas, but hey, if
> > you want to swap smoothly, you should use the vmas.
>
> Thanks a lot for the preview (or would have been a preview if I'd been
> awake - and now I've found it easiest to look at 2.6.5-rc1 patched with
> the 2.6.5-rc1-aa2 objrmap and anon_vma you pointed Martin to in other
> mail, which includes your latest fixes).
>
> I think you're being too harsh on the nonlinear vmas! I know you're
> not keen on them, but punishing them this hard! If I read it right,
> page_referenced will never (unless PageReferenced, or mapped into
> a nonlinear also) report a page from a nonlinear vma as referenced
> (I do agree with that part). So they'll soon reach try_to_unmap,
> and each one which gets there will cause every page in every nonlinear
> vma of that inode to be unmapped from the nonlinears right then?
> Yes, that'll teach 'em to use sys_remap_file_pages without VM_LOCKED.
Yep ;)
> For mine I'll try to carry on with the less draconian approach I
> started yesterday, scanning just a range each time (rather 2.4 style).
That will DoS real life, that's why I had to be draconian. after you
finished I'll send a testcase to test, that is a real life testcase not
an exploit. The only way to dominate complexity with a pagetable scan is
to do what 2.4 is doing, that is to drop all ptes we find it in our way
so the vm will stop calling try_to_unmap, we must avoid walking the vma
more than once to swap it out. This will cause a minor fault flood but
that's ok, it doesn't need to be fast at swapping.
> At the very least, I think your unmap (and mine) needs to
> ptep_test_and_clear_young just before unmap_pte_page, and back out if
> the page is young (referenced). I was going to recommend that anyway:
> at last got around to considering that issue of whether the failed
> trylocks should report referenced or not (return 1 or 0). Looking at
> how shrink_list goes, even before 2.6.5-rc1, I'd expect it to behave
> better your way (proceed to try_to_unmap, which will rightly say
> SWAP_AGAIN if it fails the same trylock) than how it was before in
> objrmap; but that will behave better with a ptep_test_and_clear_young
> check first too.
cute, I agree we should recheck the young bit inside.
> Sorry to see the #if VMA_MERGING_FIXUPs are still there. I've a
> growing feeling that it won't make enough difference when they're
> gone. But maybe you have a cunning plan to merge all the anon_vmas
> which would result from an mmap next page, write data in, mprotect ro,
> mmap next page, write data in, mprotect ro, ..... workload.
problem is that mprotect (and mremap) meging is low prio compared to
nonlinear==mlock and i_mmap{shared} complexity, so it'll address it only
after I've a scalable swapping for huge i_mmap{shared} list too, which
is a pre-requisite for merging, mprotect merging doesn't sounds
prerequisite, though I certainly agree we should fixup it soon (and
after we fix it it'll work for files too, something that never worked
todate, and I feel it'll be as important for files as it was so far for
anon ram, and nobody complained yet that it's not enabled for files ;).
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
2004-03-24 12:18 ` Hugh Dickins
@ 2004-03-24 14:47 ` Andrea Arcangeli
0 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-24 14:47 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel
On Wed, Mar 24, 2004 at 12:18:12PM +0000, Hugh Dickins wrote:
> This subtlety in try_to_unmap_nonlinear_pte:
>
> page_map_lock(page);
> /* check that we're not in between set_pte and page_add_rmap */
> if (page_mapped(page)) {
> unmap_pte_page(page, vma, address + offset, ptep);
>
> Harmless, but isn't our acquisition of the page_table_lock guaranteeing
> that it cannot be in between set_pte and page_add_rmap?
I find that fragile, see the way I implemented do_anonymous_page, other
places always do page_add_rmap under the page_table_lock, but there's no
reason to require that, the swapout code already checks explicitly for
page_mapped after taking the page_map_lock, it has to do that anyways,
so I find it nicer to do it like the above and in do_anonymous_page.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]
2004-03-24 14:37 ` Andrea Arcangeli
@ 2004-03-24 18:42 ` Andrea Arcangeli
0 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-03-24 18:42 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel
On Wed, Mar 24, 2004 at 03:37:29PM +0100, Andrea Arcangeli wrote:
> cute, I agree we should recheck the young bit inside.
I did it (the below is incremental with 2.6.5-rc2-aa2), looks ok but I
cannot notice absolutely any difference in practice. Anyways it makes
sense so I keep it. now re-running the regressions for both linear and
non-linear heavy shm swap.
So I'm now going to merge the prio_tree, then I can care about the
filter for s/anon_vma_t/struct anon_vma/ and the mprotect/mremap
vma merging (for file mappings too).
--- x/mm/objrmap.c.~1~ 2004-03-24 07:16:18.000000000 +0100
+++ x/mm/objrmap.c 2004-03-24 19:27:31.213416608 +0100
@@ -137,7 +137,7 @@ page_referenced_one(struct vm_area_struc
pte = find_pte(vma, page, NULL);
if (pte) {
- if (ptep_test_and_clear_young(pte))
+ if (pte_young(*pte) && ptep_test_and_clear_young(pte))
referenced++;
pte_unmap(pte);
}
@@ -442,6 +442,8 @@ try_to_unmap_nonlinear_pte(struct vm_are
page = pfn_to_page(pfn);
if (PageReserved(page))
continue;
+ if (pte_young(pte) && ptep_test_and_clear_young(ptep))
+ continue;
/*
* any other page in the nonlinear mapping will not wait
* on us since only one cpu can take the i_shared_sem
@@ -506,7 +508,7 @@ try_to_unmap_nonlinear(struct vm_area_st
* This function is strictly a helper function for try_to_unmap_inode.
*/
static int
-try_to_unmap_one(struct vm_area_struct *vma, struct page *page)
+try_to_unmap_one(struct vm_area_struct *vma, struct page *page, int * young)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -523,12 +525,21 @@ try_to_unmap_one(struct vm_area_struct *
if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
/*
- * All it matters is that the page won't go
- * away under us after we unlock.
+ * If this was a false positive generated by a
+ * failed trylock in the referenced pass let's
+ * avoid to pay the big cost of the nonlinear
+ * swap, we'd better be sure we've to pay that
+ * cost before running it.
*/
- page_map_unlock(page);
- try_to_unmap_nonlinear(vma);
- page_map_lock(page);
+ if (!*young) {
+ /*
+ * All it matters is that the page won't go
+ * away under us after we unlock.
+ */
+ page_map_unlock(page);
+ try_to_unmap_nonlinear(vma);
+ page_map_lock(page);
+ }
goto out;
}
@@ -536,10 +547,21 @@ try_to_unmap_one(struct vm_area_struct *
if (!pte)
goto out;
- unmap_pte_page(page, vma, address, pte);
+ /*
+ * We use trylocks in the "reference" methods, if they fails
+ * we let the VM to go ahead unmapping to avoid locking
+ * congestions, so here we may be trying to unmap young
+ * ptes, if that happens we givup trying unmapping this page
+ * and we clear all other reference bits instead (basically
+ * downgrading to a page_referenced pass).
+ */
+ if ((!pte_young(*pte) || !ptep_test_and_clear_young(pte)) && !*young)
+ unmap_pte_page(page, vma, address, pte);
+ else
+ *young = 1;
pte_unmap(pte);
-out:
+ out:
spin_unlock(&mm->page_table_lock);
return ret;
}
@@ -561,7 +583,7 @@ try_to_unmap_inode(struct page *page)
{
struct address_space *mapping = page->mapping;
struct vm_area_struct *vma;
- int ret = SWAP_AGAIN;
+ int ret = SWAP_AGAIN, young = 0;
BUG_ON(PageSwapCache(page));
@@ -569,13 +591,13 @@ try_to_unmap_inode(struct page *page)
return ret;
list_for_each_entry(vma, &mapping->i_mmap, shared) {
- ret = try_to_unmap_one(vma, page);
+ ret = try_to_unmap_one(vma, page, &young);
if (ret == SWAP_FAIL || !page->mapcount)
goto out;
}
list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
- ret = try_to_unmap_one(vma, page);
+ ret = try_to_unmap_one(vma, page, &young);
if (ret == SWAP_FAIL || !page->mapcount)
goto out;
}
@@ -588,7 +610,7 @@ out:
static int
try_to_unmap_anon(struct page * page)
{
- int ret = SWAP_AGAIN;
+ int ret = SWAP_AGAIN, young = 0;
struct vm_area_struct * vma;
anon_vma_t * anon_vma = (anon_vma_t *) page->mapping;
@@ -598,7 +620,7 @@ try_to_unmap_anon(struct page * page)
spin_lock(&anon_vma->anon_vma_lock);
BUG_ON(list_empty(&anon_vma->anon_vma_head));
list_for_each_entry(vma, &anon_vma->anon_vma_head, anon_vma_node) {
- ret = try_to_unmap_one(vma, page);
+ ret = try_to_unmap_one(vma, page, &young);
if (ret == SWAP_FAIL || !page->mapcount)
break;
}
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2004-03-24 18:41 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-22 17:05 VMA_MERGING_FIXUP and patch Hugh Dickins
2004-03-22 17:52 ` Andrea Arcangeli
2004-03-22 19:02 ` Hugh Dickins
2004-03-22 19:58 ` Andrea Arcangeli
2004-03-23 21:44 ` nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch] Andrea Arcangeli
2004-03-24 2:35 ` Andrea Arcangeli
2004-03-24 4:38 ` Andrea Arcangeli
2004-03-24 10:12 ` Hugh Dickins
2004-03-24 12:18 ` Hugh Dickins
2004-03-24 14:47 ` Andrea Arcangeli
2004-03-24 14:37 ` Andrea Arcangeli
2004-03-24 18:42 ` Andrea Arcangeli
2004-03-22 19:57 ` VMA_MERGING_FIXUP and patch Andrew Morton
2004-03-22 20:05 ` Andrea Arcangeli
2004-03-22 20:33 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox