Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines)
       [not found] <20040310080000.GA30940@dualathlon.random>
@ 2004-03-10 13:01 ` Rik van Riel
  2004-03-10 13:50   ` Andrea Arcangeli
  0 siblings, 1 reply; 74+ messages in thread
From: Rik van Riel @ 2004-03-10 13:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	Rajesh Venkatasubramanian

On Wed, 10 Mar 2004, Andrea Arcangeli wrote:
> On Tue, Mar 09, 2004 at 06:56:50PM +0100, Andrea Arcangeli wrote:
> > We've lot of room for improvements.
> 
> Rajesh has a smart idea on how to fix the complexity issue (for both
> truncate and vm) and it involes a new non trivial data structure.
>
> I trust he will make it, but if there will be any trouble with his
> approch for safety I'm currently planning on a simpler fallback solution
> that I can manage without having to design a new tree data structure.
> 
> Sharing his "tree and sorting" idea, the fallback I propose is to simply
> index the vmas in a rbtree too.

That simply results in looking up less VMAs for low file
indexes, but still needing to check all of them for high
file indexes.

You really want to sort on both the start and end offset
of the VMA, as can be done with a kd-tree or kdb-tree.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines)
  2004-03-10 13:01 ` [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) Rik van Riel
@ 2004-03-10 13:50   ` Andrea Arcangeli
  2004-03-12 17:05     ` anon_vma RFC2 Rajesh Venkatasubramanian
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-10 13:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	Rajesh Venkatasubramanian

On Wed, Mar 10, 2004 at 08:01:15AM -0500, Rik van Riel wrote:
> On Wed, 10 Mar 2004, Andrea Arcangeli wrote:
> > On Tue, Mar 09, 2004 at 06:56:50PM +0100, Andrea Arcangeli wrote:
> > > We've lot of room for improvements.
> > 
> > Rajesh has a smart idea on how to fix the complexity issue (for both
> > truncate and vm) and it involes a new non trivial data structure.
> >
> > I trust he will make it, but if there will be any trouble with his
> > approch for safety I'm currently planning on a simpler fallback solution
> > that I can manage without having to design a new tree data structure.
> > 
> > Sharing his "tree and sorting" idea, the fallback I propose is to simply
> > index the vmas in a rbtree too.
> 
> That simply results in looking up less VMAs for low file
> indexes, but still needing to check all of them for high
> file indexes.
> 
> You really want to sort on both the start and end offset
> of the VMA, as can be done with a kd-tree or kdb-tree.

yes. But the only single reason for me to even consider using the rbtree
was to avoid having to introduce another data structure and to feel very
safe in terms of risks of memory corruption in the short term ;). The
rbtree is extremely well exercised, that's the only reason I suggested
it. Rajesh is currently working on another data strucure that is
efficient at finding a "range" (not sure if it is what you're
suggesting, he called it a prio_tree, mix between hashes and raidx
trees), that's optimal, though in practice the rbtree would work too
(peraphs one could still work an exploit ;) but the the real life apps
would be definitely covered by the rbtree too (since all vma are of the
same size and they're all naturally aligned).

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-10 13:50   ` Andrea Arcangeli
@ 2004-03-12 17:05     ` Rajesh Venkatasubramanian
  2004-03-12 17:26       ` Andrea Arcangeli
  0 siblings, 1 reply; 74+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-03-12 17:05 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

>> have a devastating effect on vma usage, yes) issue of vma merging, but
>> what about the (mandatory) vma splitting? ...[snip]

> you're right about vma_split, the way I implemented it is wrong,
> basically the as.vma/PageDirect idea is falling apart with vma_split.

Why do you have to fix up all page structs' PageDirect and as.vma
fields when a vma_split or vma_merge occurs.

Can't you do it lazily on the next page_referenced or page_add_rmap,
etc. Anyway we can get to the anon_vma using as.vma->anon_vma.

I understand that currenly your code assumes that if PageDirect is
set, then there cannot be an anon_vma corresponding to the page.

Rajesh

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 17:05     ` anon_vma RFC2 Rajesh Venkatasubramanian
@ 2004-03-12 17:26       ` Andrea Arcangeli
  2004-03-12 21:16         ` Rajesh Venkatasubramanian
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 17:26 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian; +Cc: linux-kernel

On Fri, Mar 12, 2004 at 12:05:27PM -0500, Rajesh Venkatasubramanian wrote:
> 
> 
> >> have a devastating effect on vma usage, yes) issue of vma merging, but
> >> what about the (mandatory) vma splitting? ...[snip]
> 
> > you're right about vma_split, the way I implemented it is wrong,
> > basically the as.vma/PageDirect idea is falling apart with vma_split.
> 
> Why do you have to fix up all page structs' PageDirect and as.vma
> fields when a vma_split or vma_merge occurs.
> 
> Can't you do it lazily on the next page_referenced or page_add_rmap,

I cannot do it lazily unfortunately because the paging routine will
start from the page, so if the page is not uptodate it will go to
read into nirvana.

> etc. Anyway we can get to the anon_vma using as.vma->anon_vma.
> 
> I understand that currenly your code assumes that if PageDirect is
> set, then there cannot be an anon_vma corresponding to the page.

correct, though I will have to change that for the above problem ;(

Well, another way is to just do the pagetable walk and fixup the
page->as.vma to be a page->as.anon_vma during split/merge (actually
merge is already taken care of by forbidding merging in the interesting
cases, what I missed was the split, oh well ;). But preallocating the
anon_vma is such a little cost that it should be a lot better than
slowing down the split.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 17:26       ` Andrea Arcangeli
@ 2004-03-12 21:16         ` Rajesh Venkatasubramanian
  2004-03-13 17:55           ` Rajesh Venkatasubramanian
  0 siblings, 1 reply; 74+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-03-12 21:16 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, torvalds

>> I think your approach could work (reverse map by having separate
>> address
>> spaces for unrelated processes), but I don't see any good "page->index"
>> allocation scheme that is implementable.

>> Or did I totally mis-understand what you were proposing?

> You're absolutely right.  I am still trying to come up with
> a way to do this.
> [snip]

> I just can't think of any now ...

Atleast one solution exists. It may be just an academic solution, though.

Add a new prio_tree root "remap_address" to anonmm address_space
structure.

struct anon_remap_address {
	unsigned long old_page_index_start;
	unsigned long old_page_index_end;
	unsigned long new_page_index;
	struct prio_tree_node prio_tree_node;
}

For each mremap that expands the area and moves the page tables, allocate
a new anon_remap_address struct and add to remap_address tree.

The page->index does not change ever. Take the page->index and walk
remap_address tree to find all remapped addresses. Once a list of
all remapped addresses are found, it's easy to find the interesting
vmas (again using a different prio_tree). Finding all remapped addresses
may involve recursion, that's bad.

Rajesh

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 21:16         ` Rajesh Venkatasubramanian
@ 2004-03-13 17:55           ` Rajesh Venkatasubramanian
  2004-03-13 18:16             ` Andrea Arcangeli
  0 siblings, 1 reply; 74+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-03-13 17:55 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, torvalds, andrea

> The only problem is mremap() after a fork(), and hell, we know that's a
> special case anyway, and let's just add a few lines to copy_one_pte(),
> which basically does:
>
>	if (PageAnonymous(page) && page->count > 1) {
>		newpage = alloc_page();
>		copy_page(page, newpage);
>		page = newpage;
>	}
>	/* Move the page to the new address */
>	page->index = address >> PAGE_SHIFT;
>
> and now we have zero special cases.

This part makes the problem so simple. If this is acceptable, then we
have many choices. Since we won't have many mms in the anonmm list,
I don't think we will have any search complexity problems. If we really
worry again about search complexity, we can consider using prio_tree
(adds 16 bytes per vma - we cannot share vma.shared.prio_tree_node).
The prio_tree easily fits for anonmm after linus-mremap-simplification.

Rajesh

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 17:55           ` Rajesh Venkatasubramanian
@ 2004-03-13 18:16             ` Andrea Arcangeli
  2004-03-13 19:40               ` Rajesh Venkatasubramanian
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-13 18:16 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian; +Cc: riel, linux-kernel, torvalds

On Sat, Mar 13, 2004 at 12:55:09PM -0500, Rajesh Venkatasubramanian wrote:
> 
> > The only problem is mremap() after a fork(), and hell, we know that's a
> > special case anyway, and let's just add a few lines to copy_one_pte(),
> > which basically does:
> >
> >	if (PageAnonymous(page) && page->count > 1) {
> >		newpage = alloc_page();
> >		copy_page(page, newpage);
> >		page = newpage;
> >	}
> >	/* Move the page to the new address */
> >	page->index = address >> PAGE_SHIFT;
> >
> > and now we have zero special cases.
> 
> This part makes the problem so simple. If this is acceptable, then we
> have many choices. Since we won't have many mms in the anonmm list,
> I don't think we will have any search complexity problems. If we really
> worry again about search complexity, we can consider using prio_tree
> (adds 16 bytes per vma - we cannot share vma.shared.prio_tree_node).
> The prio_tree easily fits for anonmm after linus-mremap-simplification.

prio_tree with linus-mremap-simplification makes no sense to me. You
cannot avoid checking all the mm with the prio_tree and that is the only
complexity issue introduced by anonmm vs anon_vma.


prio_tree can only sit on top of anon_vma, not on top of
anonmm+linus-unshare-mremap (and yes, I cannot share
vma.shared.prio_tree_node) but pratically it's not needed for the
anon_vmas.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 18:16             ` Andrea Arcangeli
@ 2004-03-13 19:40               ` Rajesh Venkatasubramanian
  2004-03-14  0:23                 ` Andrea Arcangeli
  0 siblings, 1 reply; 74+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-03-13 19:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: riel, linux-kernel, torvalds

> prio_tree can only sit on top of anon_vma, not on top of
> anonmm+linus-unshare-mremap (and yes, I cannot share
> vma.shared.prio_tree_node) but pratically it's not needed for the
> anon_vmas.

Agreed. prio_tree is only useful for anon_vma. But, after
linus-unshare-mremap, the anon_vma patch can be modified
(simplified ?) a lot. You don't need any as.anon_vma, as.vma
pointers in the page struct. You just need the already existing
page->mapping and page->index, and a prio_tree of all anon vmas.
The prio_tree can be used to get to the "interesting vmas" without
walking all mms. However, the new prio_tree node adds 16 bytes
per-vma. Considering there may not be much sharing of anon vmas
in common case, I am not sure whether that is worthwhile. Maybe
we can wait for someone to write a program that locks the machine :)

Rajesh

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 19:40               ` Rajesh Venkatasubramanian
@ 2004-03-14  0:23                 ` Andrea Arcangeli
  2004-03-14  0:52                   ` Linus Torvalds
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-14  0:23 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian; +Cc: riel, linux-kernel, torvalds

On Sat, Mar 13, 2004 at 02:40:09PM -0500, Rajesh Venkatasubramanian wrote:
> Agreed. prio_tree is only useful for anon_vma. But, after
> linus-unshare-mremap, the anon_vma patch can be modified
> (simplified ?) a lot. You don't need any as.anon_vma, as.vma
> pointers in the page struct. You just need the already existing
> page->mapping and page->index, and a prio_tree of all anon vmas.

what you are missing is that we don't need a prio_tree at all with
anonmm+linus-unshare-mremap, prio tree can make sense only with
anon_vma, not with anonmm. the vm_pgoff is meaningless with anonmm.
find_vma (and the rbtree) already does the trick with anonmm. the
linus-unshare-mremap guarantees that a certain physical page will be
only at a certain virtual address in every mm, so prio_tree taking pgoff
into account isn't needed there, find_vma is more than enough.

any prio_tree can't fix anyways the problem that anonmm will force
the vm to scan all mm at the page->index address, even for a newly
allocated malloc region. that is optimized away by anon_vma, plus
anon_vma avoids the early-COW in mremap. the relevant downside of
anon_vma is that it takes some more byte in the vma to provide those
features.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-14  0:23                 ` Andrea Arcangeli
@ 2004-03-14  0:52                   ` Linus Torvalds
  2004-03-14  1:01                     ` William Lee Irwin III
  0 siblings, 1 reply; 74+ messages in thread
From: Linus Torvalds @ 2004-03-14  0:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Rajesh Venkatasubramanian, riel, linux-kernel

On Sun, 14 Mar 2004, Andrea Arcangeli wrote:
>
> linus-unshare-mremap guarantees that a certain physical page will be
> only at a certain virtual address in every mm, so prio_tree taking pgoff
> into account isn't needed there, find_vma is more than enough.

Yes. However, I'd at least personally hope that we don't even need the 
find_vma() all the time.

When removing a page using the reverse mapping, there really is very
little reason to even look up the vma, although right now the
"flush_tlb_page()" interface is done for vma only so we'd need to change 
that or at least add a "flush_tlb_page_mm(mm, virt)" flusher (and if any 
architecture wants to look up the vma, they could do so).

It would be silly to look up the vma if we don't actually need it, and I
don't think we do. It's likely faster to just look up the page tables
directly than to even worry about anything else.

But find_vma() certainly would be sufficient.

		Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-14  0:52                   ` Linus Torvalds
@ 2004-03-14  1:01                     ` William Lee Irwin III
  2004-03-14  1:07                       ` Rik van Riel
  2004-03-14  1:15                       ` Linus Torvalds
  0 siblings, 2 replies; 74+ messages in thread
From: William Lee Irwin III @ 2004-03-14  1:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Rajesh Venkatasubramanian, riel, linux-kernel

On Sat, Mar 13, 2004 at 04:52:00PM -0800, Linus Torvalds wrote:
> Yes. However, I'd at least personally hope that we don't even need the 
> find_vma() all the time.
> When removing a page using the reverse mapping, there really is very
> little reason to even look up the vma, although right now the
> "flush_tlb_page()" interface is done for vma only so we'd need to change 
> that or at least add a "flush_tlb_page_mm(mm, virt)" flusher (and if any 
> architecture wants to look up the vma, they could do so).
> It would be silly to look up the vma if we don't actually need it, and I
> don't think we do. It's likely faster to just look up the page tables
> directly than to even worry about anything else.
> But find_vma() certainly would be sufficient.

find_vma() is often necessary to determine whether the page is mlock()'d.
In schemes where mm's that may not map the page appear in searches, it
may also be necessary to determine if there's even a vma covering the
area at all or otherwise a normal vma, since pagetables outside normal
vmas may very well not be understood by the core (e.g. hugetlb).


-- wli

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-14  1:01                     ` William Lee Irwin III
@ 2004-03-14  1:07                       ` Rik van Riel
  2004-03-14  1:19                         ` William Lee Irwin III
  2004-03-14  1:15                       ` Linus Torvalds
  1 sibling, 1 reply; 74+ messages in thread
From: Rik van Riel @ 2004-03-14  1:07 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linus Torvalds, Andrea Arcangeli, Rajesh Venkatasubramanian,
	linux-kernel

On Sat, 13 Mar 2004, William Lee Irwin III wrote:
> On Sat, Mar 13, 2004 at 04:52:00PM -0800, Linus Torvalds wrote:
> > Yes. However, I'd at least personally hope that we don't even need the 
> > find_vma() all the time.
>
> find_vma() is often necessary to determine whether the page is mlock()'d.

Alternatively, the mlock()d pages shouldn't appear on the LRU
at all, reusing one of the variables inside page->lru as a
counter to keep track of exactly how many times this page is
mlock()d.

> In schemes where mm's that may not map the page appear in searches,
> it may also be necessary to determine if there's even a vma covering the
> area at all or otherwise a normal vma, since pagetables outside normal
> vmas may very well not be understood by the core (e.g. hugetlb).

If the page is a normal page on the LRU, I suspect we don't
need to find the VMA, with the exception of mlock()d pages...

Good thing Christoph was already looking at the mlock()d page
counter idea.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-14  1:07                       ` Rik van Riel
@ 2004-03-14  1:19                         ` William Lee Irwin III
  2004-03-14  1:41                           ` Rik van Riel
  0 siblings, 1 reply; 74+ messages in thread
From: William Lee Irwin III @ 2004-03-14  1:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrea Arcangeli, Rajesh Venkatasubramanian,
	linux-kernel

On Sat, 13 Mar 2004, William Lee Irwin III wrote:
>> find_vma() is often necessary to determine whether the page is mlock()'d.

On Sat, Mar 13, 2004 at 08:07:52PM -0500, Rik van Riel wrote:
> Alternatively, the mlock()d pages shouldn't appear on the LRU
> at all, reusing one of the variables inside page->lru as a
> counter to keep track of exactly how many times this page is
> mlock()d.

That would be the rare case where it's not necessary. =)

On Sat, 13 Mar 2004, William Lee Irwin III wrote:
>> In schemes where mm's that may not map the page appear in searches,
>> it may also be necessary to determine if there's even a vma covering the
>> area at all or otherwise a normal vma, since pagetables outside normal
>> vmas may very well not be understood by the core (e.g. hugetlb).

On Sat, Mar 13, 2004 at 08:07:52PM -0500, Rik van Riel wrote:
> If the page is a normal page on the LRU, I suspect we don't
> need to find the VMA, with the exception of mlock()d pages...
> Good thing Christoph was already looking at the mlock()d page
> counter idea.

That's not quite where the issue happens. Suppose you have a COW
sharing group (called variously struct anonmm, struct anon, and so on
by various codebases) where a page you're trying to unmap occurs at
some virtual address in several of them, but others may have hugetlb
vmas where that page is otherwise expected. On i386 and potentially
others, the core may not understand present pmd's that are not mere
pointers to ptes and other machine-dependent hugetlb constructs, so
there is trouble. Searching the COW sharing group isn't how everything
works, but in those cases where additionally you can find mm's that
don't map the page at that virtual address and may have different vmas
cover it, this can arise.

-- wli

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-14  1:19                         ` William Lee Irwin III
@ 2004-03-14  1:41                           ` Rik van Riel
  2004-03-14  2:27                             ` William Lee Irwin III
  0 siblings, 1 reply; 74+ messages in thread
From: Rik van Riel @ 2004-03-14  1:41 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linus Torvalds, Andrea Arcangeli, Rajesh Venkatasubramanian,
	linux-kernel

On Sat, 13 Mar 2004, William Lee Irwin III wrote:

> [hugetlb at same address]

Well, we can find this merely by looking at the page tables
themselves, so that shouldn't be a problem.

> Searching the COW sharing group isn't how everything works, but in those
> cases where additionally you can find mm's that don't map the page at
> that virtual address and may have different vmas cover it, this can
> arise.

This could only happen when you truncate a file that's
been mapped by various nonlinear VMAs, so truncate can't
get rid of the pages...

I suspect there are two ways to fix that:
1) on truncate, scan ALL the ptes inside nonlinear VMAs
   and remove the pages
2) don't allow truncate on a file that's mapped with
   nonlinear VMAs

Either would work.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-14  1:41                           ` Rik van Riel
@ 2004-03-14  2:27                             ` William Lee Irwin III
  0 siblings, 0 replies; 74+ messages in thread
From: William Lee Irwin III @ 2004-03-14  2:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrea Arcangeli, Rajesh Venkatasubramanian,
	linux-kernel

On Sat, 13 Mar 2004, William Lee Irwin III wrote:
>> [hugetlb at same address]

On Sat, Mar 13, 2004 at 08:41:42PM -0500, Rik van Riel wrote:
> Well, we can find this merely by looking at the page tables
> themselves, so that shouldn't be a problem.

Pagetables of a kind the core understands may not be present there.
On ia32 one could in theory have a pmd_huge() check, which would in
turn not suffice for ia64 and sparc64 hugetlb. These were only examples.
Other unusual forms of mappings, e.g. VM_RESERVED and VM_IO, may also
be bad ideas to trip over by accident.

On Sat, 13 Mar 2004, William Lee Irwin III wrote:
>> Searching the COW sharing group isn't how everything works, but in those
>> cases where additionally you can find mm's that don't map the page at
>> that virtual address and may have different vmas cover it, this can
>> arise.

On Sat, Mar 13, 2004 at 08:41:42PM -0500, Rik van Riel wrote:
> This could only happen when you truncate a file that's
> been mapped by various nonlinear VMAs, so truncate can't
> get rid of the pages...
> I suspect there are two ways to fix that:
> 1) on truncate, scan ALL the ptes inside nonlinear VMAs
>    and remove the pages
> 2) don't allow truncate on a file that's mapped with
>    nonlinear VMAs
> Either would work.

I'm not sure how that came in. The issue I had in mind was strictly
a matter of tripping over things one can't make sense of from
pagetables alone in try_to_unmap().

COW-shared anonymous pages not unmappable via anonymous COW sharing
groups arising from truncate() vs. remap_file_pages() interactions and
failures to check for nonlinearly-mapped pages in pagetable walkers are
an issue in general of course, but they just aren't this issue.

-- wli

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-14  1:01                     ` William Lee Irwin III
  2004-03-14  1:07                       ` Rik van Riel
@ 2004-03-14  1:15                       ` Linus Torvalds
  1 sibling, 0 replies; 74+ messages in thread
From: Linus Torvalds @ 2004-03-14  1:15 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrea Arcangeli, Rajesh Venkatasubramanian, riel, linux-kernel



On Sat, 13 Mar 2004, William Lee Irwin III wrote:
> 
> find_vma() is often necessary to determine whether the page is mlock()'d.
> In schemes where mm's that may not map the page appear in searches, it
> may also be necessary to determine if there's even a vma covering the
> area at all or otherwise a normal vma, since pagetables outside normal
> vmas may very well not be understood by the core (e.g. hugetlb).

Both excellent points. I guess we'll need the extra few cache misses. 
Dang.

		Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
@ 2004-03-11 20:09 Manfred Spraul
  0 siblings, 0 replies; 74+ messages in thread
From: Manfred Spraul @ 2004-03-11 20:09 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

>
>
>at the previous try, with slab debugging enabled, it was spawning tons
>of errors but I suspect it's a bug in the slab debugging, it was
>complaining with red zone memory corruption, could be due the tiny size
>of this object (only 8 bytes).
>
>andrea@xeon:~> grep anon_vma /proc/slabinfo
>anon_vma            1230   1500     12  250    1 : tunables  120   60 8 : slabdata      6      6      0
>
According to the slabinfo line, 12 bytes. The revoke_table is 12 bytes, 
too, and I'm not aware of any problems with slab debugging enabled.

Could you send me the first few errors?

--
    Manfred


^ permalink raw reply	[flat|nested] 74+ messages in thread

* objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines)
@ 2004-03-08 20:24 Andrea Arcangeli
  2004-03-09 10:52 ` [lockup] " Ingo Molnar
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-08 20:24 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton; +Cc: linux-kernel

Hello,

This patch avoids the allocation of rmap for shared memory and it uses
the objrmap framework to do find the mapping-ptes starting from a
page_t which is zero memory cost, (and zero cpu cost for the fast paths).

patch applies cleanly to linux-2.5 CVS. I suggest it for merging into
mainline.

without this patch not even the 4:4 tlb overhead would allow intensive
shm (shmfs+IPC) workloads to surivive on 32bit archs. Basically without
this fix it's like 2.6 is running w/o pte-highmem. 700 tasks with 2.7G
of shm mapped each would run the box out of zone-normal even with 4:4.
With 3:1 100 tasks would be enough. Math is easy:

	2.7*1024*1024*1024/4096*8*100/1024/1024/1024
	2.7*1024*1024*1024/4096*8*700/1024/1024/1024

But the real reason of this work is for huge 64bit archs, so we speedup
and avoid to waste tons of ram. on 32-ways the scalability is hurted
very badly by rmap, so it has to be removed (Martin can provide the
numbers I think).

Even with this fix removing rmap for the file mappings, the anonymous
memory will still pay for the rmap slowdown (still very relevant for
various critical apps), so I just finished designing a new method for
unmapping ptes of anonymous mappings too. it's not Hugh's anobjrmap
patch because (despite being very useful to get the right mindset) its
design was flawed since it was tracking mm not vmas and the page->index
as an absolute address not an offset, so it was breaking with mremap
(forcing him to reinstantiate rmap during mremap in the anobjrmap-5
patch), and it had several other implementation issues. But all my
further work will be against the below objrmap-core.  The below patch
just fixes the most serious bottlenecks. So I recommend it for
inclusion, the rest of the work for anonymous memory and non linear
vmas, is orthogonal with this.

Credit for this patch goes enterely to Dave McCracken (the original idea
of using the i_mmap lists for the vm instead of only using it for
truncate is as usual from David Miller), I only fixed two bugs in its
version before submitting it to you.

I speculate that because of rmap some people has been forced to use 4:4
generating >30% slowdowns to critical common server linux workloads even
to boxes with as little as 8G of ram.

I'm very convinced that it would be an huge mistake to force the
userbase with <=16G of ram to the 4:4 slowdown, but to avoid that we've
to drop rmap.

As part of my current anon_vma_chain vm work I'm also shrinking the
page_t to 40 bytes, and eventually it will be 32 bytes with further
patches, that combined with the usage of remap_file_pages (avoiding tons
of vmas) and the bio work not requiring flood of bh anymore (more
powerful than the 2.4 varyio), should reduce even further the needs of
normal-zone during high end workloads, allowing at least 16G boxes to
run perfectly fine with 3:1 design, like today with 2.4 we already run
huge shm workloads on 16G boxes with plenty of zone-normal margin in
production, even 32G seems to work fine (though the margin is not huge
there). With 2.6 I expect to raise the margin significantly (for
safety) in 32G boxes too with the most efficient 3:1 kernel split. Only
64G boxes will require either 2.5:1.5 or 4:4, and I think it's ok to
either use 4:4 or 2.5:1.5 there since they're less than 1% of the
userbase and with AMD64 hitting the market already I doubt the x86 64G
userbase will increase anytime.

diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/fs/exec.c sles-objrmap/fs/exec.c
--- sles-ref/fs/exec.c	2004-02-29 17:47:21.000000000 +0100
+++ sles-objrmap/fs/exec.c	2004-03-03 06:45:38.716636864 +0100
@@ -323,6 +323,7 @@ void put_dirty_page(struct task_struct *
 	}
 	lru_cache_add_active(page);
 	flush_dcache_page(page);
+	SetPageAnon(page);
 	set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(page, prot))));
 	pte_chain = page_add_rmap(page, pte, pte_chain);
 	pte_unmap(pte);
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/include/linux/mm.h sles-objrmap/include/linux/mm.h
--- sles-ref/include/linux/mm.h	2004-02-29 17:47:30.000000000 +0100
+++ sles-objrmap/include/linux/mm.h	2004-03-03 06:45:38.000000000 +0100
@@ -180,6 +180,7 @@ struct page {
 		struct pte_chain *chain;/* Reverse pte mapping pointer.
 					 * protected by PG_chainlock */
 		pte_addr_t direct;
+		int mapcount;
 	} pte;
 	unsigned long private;		/* mapping-private opaque data */
 
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/include/linux/page-flags.h sles-objrmap/include/linux/page-flags.h
--- sles-ref/include/linux/page-flags.h	2004-01-15 18:36:24.000000000 +0100
+++ sles-objrmap/include/linux/page-flags.h	2004-03-03 06:45:38.808622880 +0100
@@ -75,6 +75,7 @@
 #define PG_mappedtodisk		17	/* Has blocks allocated on-disk */
 #define PG_reclaim		18	/* To be reclaimed asap */
 #define PG_compound		19	/* Part of a compound page */
+#define PG_anon			20	/* Anonymous page */
 
 
 /*
@@ -270,6 +271,10 @@ extern void get_full_page_state(struct p
 #define SetPageCompound(page)	set_bit(PG_compound, &(page)->flags)
 #define ClearPageCompound(page)	clear_bit(PG_compound, &(page)->flags)
 
+#define PageAnon(page)		test_bit(PG_anon, &(page)->flags)
+#define SetPageAnon(page)	set_bit(PG_anon, &(page)->flags)
+#define ClearPageAnon(page)	clear_bit(PG_anon, &(page)->flags)
+
 /*
  * The PageSwapCache predicate doesn't use a PG_flag at this time,
  * but it may again do so one day.
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/include/linux/swap.h sles-objrmap/include/linux/swap.h
--- sles-ref/include/linux/swap.h	2004-02-04 16:07:05.000000000 +0100
+++ sles-objrmap/include/linux/swap.h	2004-03-03 06:45:38.830619536 +0100
@@ -185,6 +185,8 @@ struct pte_chain *FASTCALL(page_add_rmap
 void FASTCALL(page_remove_rmap(struct page *, pte_t *));
 int FASTCALL(try_to_unmap(struct page *));
 
+int page_convert_anon(struct page *);
+
 /* linux/mm/shmem.c */
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
 #else
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/filemap.c sles-objrmap/mm/filemap.c
--- sles-ref/mm/filemap.c	2004-02-29 17:47:33.000000000 +0100
+++ sles-objrmap/mm/filemap.c	2004-03-03 06:45:38.915606616 +0100
@@ -73,6 +73,9 @@
  *  ->mmap_sem
  *    ->i_sem			(msync)
  *
+ *  ->lock_page
+ *    ->i_shared_sem		(page_convert_anon)
+ *
  *  ->inode_lock
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->page_lock	(__sync_single_inode)
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/fremap.c sles-objrmap/mm/fremap.c
--- sles-ref/mm/fremap.c	2004-02-29 17:47:26.000000000 +0100
+++ sles-objrmap/mm/fremap.c	2004-03-03 06:45:38.936603424 +0100
@@ -61,10 +61,26 @@ int install_page(struct mm_struct *mm, s
 	pmd_t *pmd;
 	pte_t pte_val;
 	struct pte_chain *pte_chain;
+	unsigned long pgidx;
 
 	pte_chain = pte_chain_alloc(GFP_KERNEL);
 	if (!pte_chain)
 		goto err;
+
+	/*
+	 * Convert this page to anon for objrmap if it's nonlinear
+	 */
+	pgidx = (addr - vma->vm_start) >> PAGE_SHIFT;
+	pgidx += vma->vm_pgoff;
+	pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
+	if (!PageAnon(page) && (page->index != pgidx)) {
+		lock_page(page);
+		err = page_convert_anon(page);
+		unlock_page(page);
+		if (err < 0)
+			goto err_free;
+	}
+
 	pgd = pgd_offset(mm, addr);
 	spin_lock(&mm->page_table_lock);
 
@@ -85,12 +101,11 @@ int install_page(struct mm_struct *mm, s
 	pte_val = *pte;
 	pte_unmap(pte);
 	update_mmu_cache(vma, addr, pte_val);
-	spin_unlock(&mm->page_table_lock);
-	pte_chain_free(pte_chain);
-	return 0;
 
+	err = 0;
 err_unlock:
 	spin_unlock(&mm->page_table_lock);
+err_free:
 	pte_chain_free(pte_chain);
 err:
 	return err;
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/memory.c sles-objrmap/mm/memory.c
--- sles-ref/mm/memory.c	2004-02-29 17:47:33.000000000 +0100
+++ sles-objrmap/mm/memory.c	2004-03-03 06:45:38.965599016 +0100
@@ -1071,6 +1071,7 @@ static int do_wp_page(struct mm_struct *
 			++mm->rss;
 		page_remove_rmap(old_page, page_table);
 		break_cow(vma, new_page, address, page_table);
+		SetPageAnon(new_page);
 		pte_chain = page_add_rmap(new_page, page_table, pte_chain);
 		lru_cache_add_active(new_page);
 
@@ -1310,6 +1311,7 @@ static int do_swap_page(struct mm_struct
 
 	flush_icache_page(vma, page);
 	set_pte(page_table, pte);
+	SetPageAnon(page);
 	pte_chain = page_add_rmap(page, page_table, pte_chain);
 
 	/* No need to invalidate - it was non-present before */
@@ -1377,6 +1379,7 @@ do_anonymous_page(struct mm_struct *mm, 
 				      vma);
 		lru_cache_add_active(page);
 		mark_page_accessed(page);
+		SetPageAnon(page);
 	}
 
 	set_pte(page_table, entry);
@@ -1444,6 +1447,10 @@ retry:
 	if (!pte_chain)
 		goto oom;
 
+	/* See if nopage returned an anon page */
+	if (!new_page->mapping || PageSwapCache(new_page))
+		SetPageAnon(new_page);
+
 	/*
 	 * Should we do an early C-O-W break?
 	 */
@@ -1454,6 +1461,7 @@ retry:
 		copy_user_highpage(page, new_page, address);
 		page_cache_release(new_page);
 		lru_cache_add_active(page);
+		SetPageAnon(page);
 		new_page = page;
 	}
 
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/mmap.c sles-objrmap/mm/mmap.c
--- sles-ref/mm/mmap.c	2004-02-29 17:47:30.000000000 +0100
+++ sles-objrmap/mm/mmap.c	2004-03-03 06:53:46.000000000 +0100
@@ -267,9 +267,7 @@ static void vma_link(struct mm_struct *m
 
 	if (mapping)
 		down(&mapping->i_shared_sem);
-	spin_lock(&mm->page_table_lock);
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
-	spin_unlock(&mm->page_table_lock);
 	if (mapping)
 		up(&mapping->i_shared_sem);
 
@@ -318,6 +316,22 @@ static inline int is_mergeable_vma(struc
 	return 1;
 }
 
+/* requires that the relevant i_shared_sem be held by the caller */
+static void move_vma_start(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct inode *inode = NULL;
+	
+	if (vma->vm_file)
+		inode = vma->vm_file->f_dentry->d_inode;
+	if (inode)
+		__remove_shared_vm_struct(vma, inode);
+	/* If no vm_file, perhaps we should always keep vm_pgoff at 0?? */
+	vma->vm_pgoff += (long)(addr - vma->vm_start) >> PAGE_SHIFT;
+	vma->vm_start = addr;
+	if (inode)
+		__vma_link_file(vma);
+}
+
 /*
  * Return true if we can merge this (vm_flags,file,vm_pgoff,size)
  * in front of (at a lower virtual address and file offset than) the vma.
@@ -370,7 +384,6 @@ static int vma_merge(struct mm_struct *m
 			unsigned long end, unsigned long vm_flags,
 			struct file *file, unsigned long pgoff)
 {
-	spinlock_t *lock = &mm->page_table_lock;
 	struct inode *inode = file ? file->f_dentry->d_inode : NULL;
 	struct semaphore *i_shared_sem;
 
@@ -402,7 +415,6 @@ static int vma_merge(struct mm_struct *m
 			down(i_shared_sem);
 			need_up = 1;
 		}
-		spin_lock(lock);
 		prev->vm_end = end;
 
 		/*
@@ -415,7 +427,6 @@ static int vma_merge(struct mm_struct *m
 			prev->vm_end = next->vm_end;
 			__vma_unlink(mm, next, prev);
 			__remove_shared_vm_struct(next, inode);
-			spin_unlock(lock);
 			if (need_up)
 				up(i_shared_sem);
 			if (file)
@@ -425,7 +436,6 @@ static int vma_merge(struct mm_struct *m
 			kmem_cache_free(vm_area_cachep, next);
 			return 1;
 		}
-		spin_unlock(lock);
 		if (need_up)
 			up(i_shared_sem);
 		return 1;
@@ -443,10 +453,7 @@ static int vma_merge(struct mm_struct *m
 		if (end == prev->vm_start) {
 			if (file)
 				down(i_shared_sem);
-			spin_lock(lock);
-			prev->vm_start = addr;
-			prev->vm_pgoff -= (end - addr) >> PAGE_SHIFT;
-			spin_unlock(lock);
+			move_vma_start(prev, addr);
 			if (file)
 				up(i_shared_sem);
 			return 1;
@@ -905,19 +912,16 @@ int expand_stack(struct vm_area_struct *
 	 */
 	address += 4 + PAGE_SIZE - 1;
 	address &= PAGE_MASK;
- 	spin_lock(&vma->vm_mm->page_table_lock);
 	grow = (address - vma->vm_end) >> PAGE_SHIFT;
 
 	/* Overcommit.. */
 	if (security_vm_enough_memory(grow)) {
-		spin_unlock(&vma->vm_mm->page_table_lock);
 		return -ENOMEM;
 	}
 	
 	if (address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur ||
 			((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
 			current->rlim[RLIMIT_AS].rlim_cur) {
-		spin_unlock(&vma->vm_mm->page_table_lock);
 		vm_unacct_memory(grow);
 		return -ENOMEM;
 	}
@@ -925,7 +929,6 @@ int expand_stack(struct vm_area_struct *
 	vma->vm_mm->total_vm += grow;
 	if (vma->vm_flags & VM_LOCKED)
 		vma->vm_mm->locked_vm += grow;
-	spin_unlock(&vma->vm_mm->page_table_lock);
 	return 0;
 }
 
@@ -959,19 +962,16 @@ int expand_stack(struct vm_area_struct *
 	 * the spinlock only before relocating the vma range ourself.
 	 */
 	address &= PAGE_MASK;
- 	spin_lock(&vma->vm_mm->page_table_lock);
 	grow = (vma->vm_start - address) >> PAGE_SHIFT;
 
 	/* Overcommit.. */
 	if (security_vm_enough_memory(grow)) {
-		spin_unlock(&vma->vm_mm->page_table_lock);
 		return -ENOMEM;
 	}
 	
 	if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
 			((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
 			current->rlim[RLIMIT_AS].rlim_cur) {
-		spin_unlock(&vma->vm_mm->page_table_lock);
 		vm_unacct_memory(grow);
 		return -ENOMEM;
 	}
@@ -980,7 +980,6 @@ int expand_stack(struct vm_area_struct *
 	vma->vm_mm->total_vm += grow;
 	if (vma->vm_flags & VM_LOCKED)
 		vma->vm_mm->locked_vm += grow;
-	spin_unlock(&vma->vm_mm->page_table_lock);
 	return 0;
 }
 
@@ -1147,8 +1146,6 @@ static void unmap_region(struct mm_struc
 /*
  * Create a list of vma's touched by the unmap, removing them from the mm's
  * vma list as we go..
- *
- * Called with the page_table_lock held.
  */
 static void
 detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1211,10 +1208,9 @@ int split_vma(struct mm_struct * mm, str
 		down(&mapping->i_shared_sem);
 	spin_lock(&mm->page_table_lock);
 
-	if (new_below) {
-		vma->vm_start = addr;
-		vma->vm_pgoff += ((addr - new->vm_start) >> PAGE_SHIFT);
-	} else
+	if (new_below)
+		move_vma_start(vma, addr);
+	else
 		vma->vm_end = addr;
 
 	__insert_vm_struct(mm, new);
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/page_alloc.c sles-objrmap/mm/page_alloc.c
--- sles-ref/mm/page_alloc.c	2004-02-29 17:47:36.000000000 +0100
+++ sles-objrmap/mm/page_alloc.c	2004-03-03 06:45:38.992594912 +0100
@@ -230,6 +230,8 @@ static inline void free_pages_check(cons
 		bad_page(function, page);
 	if (PageDirty(page))
 		ClearPageDirty(page);
+	if (PageAnon(page))
+		ClearPageAnon(page);
 }
 
 /*
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/rmap.c sles-objrmap/mm/rmap.c
--- sles-ref/mm/rmap.c	2004-02-29 17:47:33.000000000 +0100
+++ sles-objrmap/mm/rmap.c	2004-03-03 07:01:39.200621104 +0100
@@ -102,6 +102,136 @@ pte_chain_encode(struct pte_chain *pte_c
  **/
 
 /**
+ * find_pte - Find a pte pointer given a vma and a struct page.
+ * @vma: the vma to search
+ * @page: the page to find
+ *
+ * Determine if this page is mapped in this vma.  If it is, map and rethrn
+ * the pte pointer associated with it.  Return null if the page is not
+ * mapped in this vma for any reason.
+ *
+ * This is strictly an internal helper function for the object-based rmap
+ * functions.
+ * 
+ * It is the caller's responsibility to unmap the pte if it is returned.
+ */
+static inline pte_t *
+find_pte(struct vm_area_struct *vma, struct page *page, unsigned long *addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pmd_t *pmd;
+	pte_t *pte;
+	unsigned long loffset;
+	unsigned long address;
+
+	loffset = (page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
+	address = vma->vm_start + ((loffset - vma->vm_pgoff) << PAGE_SHIFT);
+	if (address < vma->vm_start || address >= vma->vm_end)
+		goto out;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pmd = pmd_offset(pgd, address);
+	if (!pmd_present(*pmd))
+		goto out;
+
+	pte = pte_offset_map(pmd, address);
+	if (!pte_present(*pte))
+		goto out_unmap;
+
+	if (page_to_pfn(page) != pte_pfn(*pte))
+		goto out_unmap;
+
+	if (addr)
+		*addr = address;
+
+	return pte;
+
+out_unmap:
+	pte_unmap(pte);
+out:
+	return NULL;
+}
+
+/**
+ * page_referenced_obj_one - referenced check for object-based rmap
+ * @vma: the vma to look in.
+ * @page: the page we're working on.
+ *
+ * Find a pte entry for a page/vma pair, then check and clear the referenced
+ * bit.
+ *
+ * This is strictly a helper function for page_referenced_obj.
+ */
+static int
+page_referenced_obj_one(struct vm_area_struct *vma, struct page *page)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t *pte;
+	int referenced = 0;
+
+	if (!spin_trylock(&mm->page_table_lock))
+		return 1;
+
+	pte = find_pte(vma, page, NULL);
+	if (pte) {
+		if (ptep_test_and_clear_young(pte))
+			referenced++;
+		pte_unmap(pte);
+	}
+
+	spin_unlock(&mm->page_table_lock);
+	return referenced;
+}
+
+/**
+ * page_referenced_obj_one - referenced check for object-based rmap
+ * @page: the page we're checking references on.
+ *
+ * For an object-based mapped page, find all the places it is mapped and
+ * check/clear the referenced flag.  This is done by following the page->mapping
+ * pointer, then walking the chain of vmas it holds.  It returns the number
+ * of references it found.
+ *
+ * This function is only called from page_referenced for object-based pages.
+ *
+ * The semaphore address_space->i_shared_sem is tried.  If it can't be gotten,
+ * assume a reference count of 1.
+ */
+static int
+page_referenced_obj(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct vm_area_struct *vma;
+	int referenced = 0;
+
+	if (!page->pte.mapcount)
+		return 0;
+
+	if (!mapping)
+		BUG();
+
+	if (PageSwapCache(page))
+		BUG();
+
+	if (down_trylock(&mapping->i_shared_sem))
+		return 1;
+	
+	list_for_each_entry(vma, &mapping->i_mmap, shared)
+		referenced += page_referenced_obj_one(vma, page);
+
+	list_for_each_entry(vma, &mapping->i_mmap_shared, shared)
+		referenced += page_referenced_obj_one(vma, page);
+
+	up(&mapping->i_shared_sem);
+
+	return referenced;
+}
+
+/**
  * page_referenced - test if the page was referenced
  * @page: the page to test
  *
@@ -123,6 +253,10 @@ int fastcall page_referenced(struct page
 	if (TestClearPageReferenced(page))
 		referenced++;
 
+	if (!PageAnon(page)) {
+		referenced += page_referenced_obj(page);
+		goto out;
+	}
 	if (PageDirect(page)) {
 		pte_t *pte = rmap_ptep_map(page->pte.direct);
 		if (ptep_test_and_clear_young(pte))
@@ -154,6 +288,7 @@ int fastcall page_referenced(struct page
 			__pte_chain_free(pc);
 		}
 	}
+out:
 	return referenced;
 }
 
@@ -176,6 +311,21 @@ page_add_rmap(struct page *page, pte_t *
 
 	pte_chain_lock(page);
 
+	/*
+	 * If this is an object-based page, just count it.  We can
+ 	 * find the mappings by walking the object vma chain for that object.
+	 */
+	if (!PageAnon(page)) {
+		if (!page->mapping)
+			BUG();
+		if (PageSwapCache(page))
+			BUG();
+		if (!page->pte.mapcount)
+			inc_page_state(nr_mapped);
+		page->pte.mapcount++;
+		goto out;
+	}
+
 	if (page->pte.direct == 0) {
 		page->pte.direct = pte_paddr;
 		SetPageDirect(page);
@@ -232,8 +382,25 @@ void fastcall page_remove_rmap(struct pa
 	pte_chain_lock(page);
 
 	if (!page_mapped(page))
-		goto out_unlock;	/* remap_page_range() from a driver? */
+		goto out_unlock;
 
+	/*
+	 * If this is an object-based page, just uncount it.  We can
+	 * find the mappings by walking the object vma chain for that object.
+	 */
+	if (!PageAnon(page)) {
+		if (!page->mapping)
+			BUG();
+		if (PageSwapCache(page))
+			BUG();
+		if (!page->pte.mapcount)
+			BUG();
+		page->pte.mapcount--;
+		if (!page->pte.mapcount)
+			dec_page_state(nr_mapped);
+		goto out_unlock;
+	}
+  
 	if (PageDirect(page)) {
 		if (page->pte.direct == pte_paddr) {
 			page->pte.direct = 0;
@@ -280,6 +447,102 @@ out_unlock:
 }
 
 /**
+ * try_to_unmap_obj - unmap a page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Determine whether a page is mapped in a given vma and unmap it if it's found.
+ *
+ * This function is strictly a helper function for try_to_unmap_obj.
+ */
+static inline int
+try_to_unmap_obj_one(struct vm_area_struct *vma, struct page *page)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t *pte;
+	pte_t pteval;
+	int ret = SWAP_AGAIN;
+
+	if (!spin_trylock(&mm->page_table_lock))
+		return ret;
+
+	pte = find_pte(vma, page, &address);
+	if (!pte)
+		goto out;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
+		ret =  SWAP_FAIL;
+		goto out_unmap;
+	}
+
+	flush_cache_page(vma, address);
+	pteval = ptep_get_and_clear(pte);
+	flush_tlb_page(vma, address);
+
+	if (pte_dirty(pteval))
+		set_page_dirty(page);
+
+	if (!page->pte.mapcount)
+		BUG();
+
+	mm->rss--;
+	page->pte.mapcount--;
+	page_cache_release(page);
+
+out_unmap:
+	pte_unmap(pte);
+
+out:
+	spin_unlock(&mm->page_table_lock);
+	return ret;
+}
+
+/**
+ * try_to_unmap_obj - unmap a page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the address_space struct it points to.
+ *
+ * This function is only called from try_to_unmap for object-based pages.
+ *
+ * The semaphore address_space->i_shared_sem is tried.  If it can't be gotten,
+ * return a temporary error.
+ */
+static int
+try_to_unmap_obj(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct vm_area_struct *vma;
+	int ret = SWAP_AGAIN;
+
+	if (!mapping)
+		BUG();
+
+	if (PageSwapCache(page))
+		BUG();
+
+	if (down_trylock(&mapping->i_shared_sem))
+		return ret;
+	
+	list_for_each_entry(vma, &mapping->i_mmap, shared) {
+		ret = try_to_unmap_obj_one(vma, page);
+		if (ret == SWAP_FAIL || !page->pte.mapcount)
+			goto out;
+	}
+
+	list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+		ret = try_to_unmap_obj_one(vma, page);
+		if (ret == SWAP_FAIL || !page->pte.mapcount)
+			goto out;
+	}
+
+out:
+	up(&mapping->i_shared_sem);
+	return ret;
+}
+
+/**
  * try_to_unmap_one - worker function for try_to_unmap
  * @page: page to unmap
  * @ptep: page table entry to unmap from page
@@ -397,6 +660,15 @@ int fastcall try_to_unmap(struct page * 
 	if (!page->mapping)
 		BUG();
 
+	/*
+	 * If it's an object-based page, use the object vma chain to find all
+	 * the mappings.
+	 */
+	if (!PageAnon(page)) {
+		ret = try_to_unmap_obj(page);
+		goto out;
+	}
+
 	if (PageDirect(page)) {
 		ret = try_to_unmap_one(page, page->pte.direct);
 		if (ret == SWAP_SUCCESS) {
@@ -453,12 +725,115 @@ int fastcall try_to_unmap(struct page * 
 		}
 	}
 out:
-	if (!page_mapped(page))
+	if (!page_mapped(page)) {
 		dec_page_state(nr_mapped);
+		ret = SWAP_SUCCESS;
+	}
 	return ret;
 }
 
 /**
+ * page_convert_anon - Convert an object-based mapped page to pte_chain-based.
+ * @page: the page to convert
+ *
+ * Find all the mappings for an object-based page and convert them
+ * to 'anonymous', ie create a pte_chain and store all the pte pointers there.
+ *
+ * This function takes the address_space->i_shared_sem, sets the PageAnon flag,
+ * then sets the mm->page_table_lock for each vma and calls page_add_rmap. This
+ * means there is a period when PageAnon is set, but still has some mappings
+ * with no pte_chain entry.  This is in fact safe, since page_remove_rmap will
+ * simply not find it.  try_to_unmap might erroneously return success, but it
+ * will never be called because the page_convert_anon() caller has locked the
+ * page.
+ *
+ * page_referenced() may fail to scan all the appropriate pte's and may return
+ * an inaccurate result.  This is so rare that it does not matter.
+ */
+int page_convert_anon(struct page *page)
+{
+	struct address_space *mapping;
+	struct vm_area_struct *vma;
+	struct pte_chain *pte_chain = NULL;
+	pte_t *pte;
+	int err = 0;
+
+	mapping = page->mapping;
+	if (mapping == NULL)
+		goto out;		/* truncate won the lock_page() race */
+
+	down(&mapping->i_shared_sem);
+	pte_chain_lock(page);
+
+	/*
+	 * Has someone else done it for us before we got the lock?
+	 * If so, pte.direct or pte.chain has replaced pte.mapcount.
+	 */
+	if (PageAnon(page)) {
+		pte_chain_unlock(page);
+		goto out_unlock;
+	}
+
+	SetPageAnon(page);
+	if (page->pte.mapcount == 0) {
+		pte_chain_unlock(page);
+		goto out_unlock;
+	}
+	/* This is gonna get incremented by page_add_rmap */
+	dec_page_state(nr_mapped);
+	page->pte.mapcount = 0;
+
+	/*
+	 * Now that the page is marked as anon, unlock it.  page_add_rmap will
+	 * lock it as necessary.
+	 */
+	pte_chain_unlock(page);
+
+	list_for_each_entry(vma, &mapping->i_mmap, shared) {
+		if (!pte_chain) {
+			pte_chain = pte_chain_alloc(GFP_KERNEL);
+			if (!pte_chain) {
+				err = -ENOMEM;
+				goto out_unlock;
+			}
+		}
+		spin_lock(&vma->vm_mm->page_table_lock);
+		pte = find_pte(vma, page, NULL);
+		if (pte) {
+			/* Make sure this isn't a duplicate */
+			page_remove_rmap(page, pte);
+			pte_chain = page_add_rmap(page, pte, pte_chain);
+			pte_unmap(pte);
+		}
+		spin_unlock(&vma->vm_mm->page_table_lock);
+	}
+	list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+		if (!pte_chain) {
+			pte_chain = pte_chain_alloc(GFP_KERNEL);
+			if (!pte_chain) {
+				err = -ENOMEM;
+				goto out_unlock;
+			}
+		}
+		spin_lock(&vma->vm_mm->page_table_lock);
+		pte = find_pte(vma, page, NULL);
+		if (pte) {
+			/* Make sure this isn't a duplicate */
+			page_remove_rmap(page, pte);
+			pte_chain = page_add_rmap(page, pte, pte_chain);
+			pte_unmap(pte);
+		}
+		spin_unlock(&vma->vm_mm->page_table_lock);
+	}
+
+out_unlock:
+	pte_chain_free(pte_chain);
+	up(&mapping->i_shared_sem);
+out:
+	return err;
+}
+
+/**
  ** No more VM stuff below this comment, only pte_chain helper
  ** functions.
  **/
diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids sles-ref/mm/swapfile.c sles-objrmap/mm/swapfile.c
--- sles-ref/mm/swapfile.c	2004-02-20 17:26:54.000000000 +0100
+++ sles-objrmap/mm/swapfile.c	2004-03-03 07:03:33.128301464 +0100
@@ -390,6 +390,7 @@ unuse_pte(struct vm_area_struct *vma, un
 	vma->vm_mm->rss++;
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
+	SetPageAnon(page);
 	*pte_chainp = page_add_rmap(page, dir, *pte_chainp);
 	swap_free(entry);
 }

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines)
  2004-03-08 20:24 objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) Andrea Arcangeli
@ 2004-03-09 10:52 ` Ingo Molnar
  2004-03-09 11:02   ` Ingo Molnar
  0 siblings, 1 reply; 74+ messages in thread
From: Ingo Molnar @ 2004-03-09 10:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2140 bytes --]

* Andrea Arcangeli <andrea@suse.de> wrote:

> This patch avoids the allocation of rmap for shared memory and it uses
> the objrmap framework to do find the mapping-ptes starting from a
> page_t which is zero memory cost, (and zero cpu cost for the fast
> paths)

this patch locks up the VM.

To reproduce, run the attached, very simple test-mmap.c code (as
unprivileged user) which maps 80MB worth of shared memory in a
finegrained way, creating ~19K vmas, and sleeps. Keep this process
around.

Then try to create any sort of VM swap pressure. (start a few desktop
apps or generate pagecache pressure.) [the 500 MHz P3 system i tried
this on has 256 MB of RAM and 300 MB of swap.]

stock 2.6.4-rc2-mm1 handles it just fine - it starts swapping and
recovers. The system is responsive and behaves just fine.

with 2.6.4-rc2-mm1 + your objrmap patch the box in essence locks up and
it's not possible to do anything. The VM is looping within the objrmap
functions. (a sample trace attached.)

Note that the test-mmap.c app does nothing that a normal user cannot do. 
In fact it's not even hostile - it only has lots of vmas but is
otherwise not actively pushing the VM, it's just sleeping. (Also, the
test is a very far cry from Oracle's workload of gigabytes of shm mapped
in a finegrained way to hundreds of processes.) All in one, currently i
believe the patch is pretty unacceptable in its present form.

	Ingo

Pid: 7, comm:              kswapd0
EIP: 0060:[<c013ee6d>] CPU: 0
EIP is at page_referenced_obj+0xdd/0x120
 EFLAGS: 00000246    Not tainted
EAX: cb311808 EBX: cb311820 ECX: 40a2d000 EDX: cb311848
ESI: cfe202fc EDI: cfe2033c EBP: cfdf9dc4 DS: 007b ES: 007b
CR0: 8005003b CR2: 40507000 CR3: 0b11e000 CR4: 00000290
Call Trace:
 [<c013ef71>] page_referenced+0xc1/0xd0
 [<c0137bad>] refill_inactive_zone+0x3fd/0x4c0
 [<c01376bc>] shrink_cache+0x26c/0x360
 [<c0137d11>] shrink_zone+0xa1/0xb0
 [<c01380d7>] balance_pgdat+0x1a7/0x200
 [<c013820b>] kswapd+0xdb/0xe0
 [<c01180b0>] autoremove_wake_function+0x0/0x50
 [<c01180b0>] autoremove_wake_function+0x0/0x50
 [<c0138130>] kswapd+0x0/0xe0
 [<c01050f9>] kernel_thread_helper+0x5/0xc

[-- Attachment #2: test-mmap.c --]
[-- Type: text/plain, Size: 1095 bytes --]

/*
 * Copyright (C) Ingo Molnar, 2004
 *
 * Create 80 MB worth of finegrained mappings to a shmfs file.
 */
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>

/* 80 MB of mappings */
#define CACHE_PAGES 20000

#define PAGE_SIZE	4096
#define CACHE_SIZE	(CACHE_PAGES*PAGE_SIZE)
#define WINDOW_PAGES	(CACHE_PAGES*9/10)
#define WINDOW_SIZE	(WINDOW_PAGES*PAGE_SIZE)
#define WINDOW_START	0x48000000

int main(void)
{
	char *data, *ptr, filename[100];
	char empty_page [PAGE_SIZE];
	int i, fd;

	sprintf(filename, "/dev/shm/cache%d", getpid());
	fd = open(filename, O_RDWR|O_CREAT|O_TRUNC,S_IRWXU);
	unlink(filename);

	for (i = 0; i < CACHE_PAGES; i++)
		write(fd, empty_page, PAGE_SIZE);
	data = mmap(0, WINDOW_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED , fd, 0);

	for (i = 0; i < WINDOW_PAGES; i++) {
		ptr = (char*) mmap(data + i*PAGE_SIZE, PAGE_SIZE,
				PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED,
					fd, (WINDOW_PAGES-i)*PAGE_SIZE);
		(*ptr)++;
	}
	printf("%d pages mapped - sleeping until Ctrl-C.\n", WINDOW_PAGES);
	pause();

	return 0;
}

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines)
  2004-03-09 10:52 ` [lockup] " Ingo Molnar
@ 2004-03-09 11:02   ` Ingo Molnar
  2004-03-09 11:09     ` Andrew Morton
  0 siblings, 1 reply; 74+ messages in thread
From: Ingo Molnar @ 2004-03-09 11:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 477 bytes --]


* Ingo Molnar <mingo@elte.hu> wrote:

> To reproduce, run the attached, very simple test-mmap.c code (as
> unprivileged user) which maps 80MB worth of shared memory in a
> finegrained way, creating ~19K vmas, and sleeps. Keep this process
> around.

or run the attached test-mmap2.c code, which simulates a very small DB
app using only 1800 vmas per process: it only maps 8 MB of shm and
spawns 32 processes. This has an even more lethal effect than the
previous code.

	Ingo

[-- Attachment #2: test-mmap2.c --]
[-- Type: text/plain, Size: 1160 bytes --]

/*
 * Copyright (C) Ingo Molnar, 2004
 *
 * Create 8 MB worth of finegrained mappings to a shmfs file,
 * and spawn 32 processes.
 */
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>

/* 8 MB of mappings */
#define CACHE_PAGES 2000

#define PAGE_SIZE	4096
#define CACHE_SIZE	(CACHE_PAGES*PAGE_SIZE)
#define WINDOW_PAGES	(CACHE_PAGES*9/10)
#define WINDOW_SIZE	(WINDOW_PAGES*PAGE_SIZE)
#define WINDOW_START	0x48000000

int main(void)
{
	char *data, *ptr, filename[100];
	char empty_page [PAGE_SIZE];
	int i, fd;

	sprintf(filename, "/dev/shm/cache%d", getpid());
	fd = open(filename, O_RDWR|O_CREAT|O_TRUNC,S_IRWXU);
	unlink(filename);

	for (i = 0; i < CACHE_PAGES; i++)
		write(fd, empty_page, PAGE_SIZE);
	data = mmap(0, WINDOW_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED , fd, 0);

	for (i = 0; i < WINDOW_PAGES; i++) {
		ptr = (char*) mmap(data + i*PAGE_SIZE, PAGE_SIZE,
				PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED,
					fd, (WINDOW_PAGES-i)*PAGE_SIZE);
		(*ptr)++;
	}
	printf("%d pages mapped - sleeping until Ctrl-C.\n", WINDOW_PAGES);
	fork(); fork(); fork(); fork(); fork();
	pause();

	return 0;
}


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines)
  2004-03-09 11:02   ` Ingo Molnar
@ 2004-03-09 11:09     ` Andrew Morton
  2004-03-09 11:49       ` Ingo Molnar
  0 siblings, 1 reply; 74+ messages in thread
From: Andrew Morton @ 2004-03-09 11:09 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: andrea, torvalds, linux-kernel

Ingo Molnar <mingo@elte.hu> wrote:
>
> or run the attached test-mmap2.c code, which simulates a very small DB
>  app using only 1800 vmas per process: it only maps 8 MB of shm and
>  spawns 32 processes. This has an even more lethal effect than the
>  previous code.

Do these tests actually make any forward progress at all, or is it some bug
which has sent the kernel into a loop?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines)
  2004-03-09 11:09     ` Andrew Morton
@ 2004-03-09 11:49       ` Ingo Molnar
  2004-03-09 16:03         ` Andrea Arcangeli
  0 siblings, 1 reply; 74+ messages in thread
From: Ingo Molnar @ 2004-03-09 11:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, torvalds, linux-kernel


* Andrew Morton <akpm@osdl.org> wrote:

> > or run the attached test-mmap2.c code, which simulates a very small DB
> >  app using only 1800 vmas per process: it only maps 8 MB of shm and
> >  spawns 32 processes. This has an even more lethal effect than the
> >  previous code.
> 
> Do these tests actually make any forward progress at all, or is it some bug
> which has sent the kernel into a loop?

i think they make a forward progress so it's more of a DoS - but a very
effective one, especially considering that i didnt even try hard ...

what worries me is that there are apps that generate such vma patterns
(for various reasons).

I do believe that scanning ->i_mmap & ->i_mmap_shared is fundamentally
flawed.

	Ingo

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines)
  2004-03-09 11:49       ` Ingo Molnar
@ 2004-03-09 16:03         ` Andrea Arcangeli
  2004-03-10 10:36           ` RFC anon_vma previous (i.e. full objrmap) Andrea Arcangeli
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-09 16:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, torvalds, linux-kernel

On Tue, Mar 09, 2004 at 12:49:24PM +0100, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > > or run the attached test-mmap2.c code, which simulates a very small DB
> > >  app using only 1800 vmas per process: it only maps 8 MB of shm and
> > >  spawns 32 processes. This has an even more lethal effect than the
> > >  previous code.
> > 
> > Do these tests actually make any forward progress at all, or is it some bug
> > which has sent the kernel into a loop?
> 
> i think they make a forward progress so it's more of a DoS - but a very
> effective one, especially considering that i didnt even try hard ...
> 
> what worries me is that there are apps that generate such vma patterns
> (for various reasons).

those vmas in those apps are forced to be mlocked with the rmap VM, so
it's hard for me to buy that rmap is any better. You can't even allow
those vmas to be non-mlocked or you'll finish your zone-normal even with
4:4.

on 64bit those apps will work _absolutely_best_ with objrmap and they'll
waste tons of ram (and some amount of cpu too) with rmap. objrmap is the
absolutely best model for those apps in any 64bit arch.

the argument you're making about those apps are all in favour of objrmap
IMO.

> I do believe that scanning ->i_mmap & ->i_mmap_shared is fundamentally
> flawed.

If it's the DoS that you worry about, vmtruncate will do the trick too.

overall machine remains usable for me, despite the increased cpu load.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RFC anon_vma previous (i.e. full objrmap)
  2004-03-09 16:03         ` Andrea Arcangeli
@ 2004-03-10 10:36           ` Andrea Arcangeli
  2004-03-11  6:52             ` anon_vma RFC2 Andrea Arcangeli
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-10 10:36 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, torvalds, linux-kernel

On Tue, Mar 09, 2004 at 05:03:07PM +0100, Andrea Arcangeli wrote:
> those vmas in those apps are forced to be mlocked with the rmap VM, so
> it's hard for me to buy that rmap is any better. You can't even allow

btw, try your exploit by keeping the stuff mlocked. you'll see we're not
following the i_mmap already the first time we run into a VM_LOCKED vma,
we could be even more efficient by removing mlocked pages from the lru,
but it's definitely not required to get that workload right, and that
workload needs mlock with rmap to remove the pte_chains anyways!

So even now objrmap seems a lot better than rmap for that workload, it
doesn't even require mlock, it only requires it if you want to pageout
heavily (rmap requires it regardless if you pageout or not). And it can
be fixed too with an rbtree as worse, while the rmap overhead is not
fixable (other than to remove rmap enterely like I'm doing).

BTW, my current anon_vma work is going really well, the code is so much
nicer, and it's quite smaller too.

 include/linux/mm.h         |   76 +++
 include/linux/objrmap.h    |   74 +++
 include/linux/page-flags.h |    4 
 include/linux/rmap.h       |   53 --
 init/main.c                |    4 
 mm/memory.c                |   15 
 mm/mmap.c                  |    4 
 mm/nommu.c                 |    2 
 mm/objrmap.c               |  480 +++++++++++++++++++++++
 mm/page_alloc.c            |    6 
 mm/rmap.c                  |  908 ---------------------------------------------
 12 files changed, 636 insertions(+), 990 deletions(-)

and this doesn't remove all the pte_chains everywhere yet.

objrmap.c seems already fully complete, what's missing now is the
removal of all the pte_chains from memory.c and friends, and later the
anon_vma tracking with fork and munmap (I've only covered
do_anonymous_page, so far, see how cool it looks like now:

static int
do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
		pte_t *page_table, pmd_t *pmd, int write_access,
		unsigned long addr)
{
	pte_t entry;
	struct page * page = ZERO_PAGE(addr);
	int ret;

	/* Read-only mapping of ZERO_PAGE. */
	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));

	/* ..except if it's a write access */
	if (write_access) {
		/* Allocate our own private page. */
		pte_unmap(page_table);
		spin_unlock(&mm->page_table_lock);

		page = alloc_page(GFP_HIGHUSER);
		if (!page)
			goto no_mem;
		clear_user_highpage(page, addr);

		spin_lock(&mm->page_table_lock);
		page_table = pte_offset_map(pmd, addr);

		if (!pte_none(*page_table)) {
			pte_unmap(page_table);
			page_cache_release(page);
			spin_unlock(&mm->page_table_lock);
			ret = VM_FAULT_MINOR;
			goto out;
		}
		mm->rss++;
		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
							 vma->vm_page_prot)),
				      vma);
		lru_cache_add_active(page);
		mark_page_accessed(page);
		SetPageAnon(page);
	}

	set_pte(page_table, entry);
	/* ignores ZERO_PAGE */
	page_add_rmap(page, vma);
	pte_unmap(page_table);

	/* No need to invalidate - it was non-present before */
	update_mmu_cache(vma, addr, entry);
	spin_unlock(&mm->page_table_lock);
	ret = VM_FAULT_MINOR;
	goto out;

no_mem:
	ret = VM_FAULT_OOM;
out:
	return ret;
}


no pte_chains anywhere.

and here the page_add_rmap from objrmap.c:

/* this needs the page->flags PG_map_lock held */
static void inline anon_vma_page_link(struct page * page, struct
vm_struct * vma)
{
	SetPageDirect(page);
	page->as.vma = vma;
}

/**
 * page_add_rmap - add reverse mapping entry to a page
 * @page: the page to add the mapping to
 * @vma: the vma that is covering the page
 *
 * Add a new pte reverse mapping to a page.
 * The caller needs to hold the mm->page_table_lock.
 */
void * fastcall
page_add_rmap(struct page *page, struct vm_struct * vma)
{
	if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
		return;

	page_map_lock(page);

	if (!page->mapcount++)
		inc_page_state(nr_mapped);

	if (PageAnon(page))
		anon_vma_page_link(page, vma);
	else {
		/*
		 * If this is an object-based page, just count it.
		 * We can find the mappings by walking the object
		 * vma chain for that object.
		 */
		BUG_ON(!page->as.mapping);
		BUG_ON(PageSwapCache(page));
	}

	page_map_unlock(page);
}

here page_remove_rmap:

/* this needs the page->flags PG_map_lock held */
static void inline anon_vma_page_unlink(struct page * page)
{
	/*
	 * Cleanup if this anon page is gone
	 * as far as the vm is concerned.
	 */
	if (!page->mapcount) {
		page->as.vma = 0;
#if 0
		/*
		 * The above clears page->as.anon_vma too
		 * if the page wasn't direct.
		 */
		page->as.anon_vma = 0;
#endif
		ClearPageDirect(page);
	}
}

/**
 * page_remove_rmap - take down reverse mapping to a page
 * @page: page to remove mapping from
 *
 * Removes the reverse mapping from the pte_chain of the page,
 * after that the caller can clear the page table entry and free
 * the page.
 * Caller needs to hold the mm->page_table_lock.
 */
void fastcall page_remove_rmap(struct page *page)
{
	if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
		return;

	page_map_lock(page);

	if (!page_mapped(page))
		goto out_unlock;

	if (!--page->mapcount)
		dec_page_state(nr_mapped);

	if (PageAnon(page))
		anon_vma_page_unlink(page, vma);
	else {
		/*
		 * If this is an object-based page, just uncount it.
		 * We can find the mappings by walking the object vma
		 * chain for that object.
		 */
		BUG_ON(!page->as.mapping);
		BUG_ON(PageSwapCache(page));
	}
  
	page_map_unlock(page);
	return;
}


here the paging code that unmaps the ptes:

static int
try_to_unmap_anon(struct page * page)
{
	int ret = SWAP_AGAIN;

	page_map_lock(page);

	if (PageDirect(page)) {
		ret = try_to_unmap_inode_one(page->as.vma, page);
	} else {
		struct vm_area_struct * vma;
		anon_vma_t * anon_vma = page->as.anon_vma;
		
		list_for_each_entry(vma, &anon_vma->anon_vma_head, anon_vma_node) {
			ret = try_to_unmap_inode_one(vma, page);
			if (ret == SWAP_FAIL || !page->mapcount)
				goto out;
		}
	}

out:
	page_map_unlock(page);
	return ret;
}

/**
 * try_to_unmap - try to remove all page table mappings to a page
 * @page: the page to get unmapped
 *
 * Tries to remove all the page table entries which are mapping this
 * page, used in the pageout path.  Caller must hold the page lock
 * and its pte chain lock.  Return values are:
 *
 * SWAP_SUCCESS	- we succeeded in removing all mappings
 * SWAP_AGAIN	- we missed a trylock, try again later
 * SWAP_FAIL	- the page is unswappable
 */
int fastcall try_to_unmap(struct page * page)
{
	struct pte_chain *pc, *next_pc, *start;
	int ret = SWAP_SUCCESS;

	/* This page should not be on the pageout lists. */
	BUG_ON(PageReserved(page));
	BUG_ON(!PageLocked(page));

	/*
	 * We need backing store to swap out a page.
	 * Subtle: this checks for page->as.anon_vma too ;).
	 */
	BUG_ON(!page->as.mapping);

	if (!PageAnon(page))
		ret = try_to_unmap_inode(page);
	else
		ret = try_to_unmap_anon(page);

	if (!page_mapped(page)) {
		dec_page_state(nr_mapped);
		ret = SWAP_SUCCESS;
	}
	return ret;
}

In my first attempt I was nucking page->mapping++ (that's pure locking
overhead for the file mappings and it wastes 4 bytes per page_t) but
then I retraced since nr_mapped was expanding everywhere in the vm and
the modifications were growing too fast at the same time, so I'll think
about it later for now I will do anon_vma only plus the nonlinear
pagetable walk, so the page is as self contained as possible and it'll
drop all pte_chains from the kernel.

The only single reason I need page->mapped is that if the page is an
inode mapping, page->as.mapping won't be enough to tell if it was
already mapped or not. So my current anon_vma patch (incremental with
objrmap) only reduces the page_t of 4 bytes compared to mainline 2.4 and
mainline 2.6.

With PageDirect and the page->as.vma field I'm deferring _all_ anon_vma
object allocations to fork(), even when a MAP_PRIVATE vma is already
tracked by an inode and by an anon_vma (generated by an old fork), new
anonymous pages allocated are still "direct". So the same vma will have
direct anon pages, anon_vma indirect cow pages, and finally it will have
inode pages too (readonly writeprotect). I plan to teach the cow fault
to convert anon_vma indirect pages to direct pages if page->mapping ==
1 (no, I don't need page->mapping for that, I could use page_count but
since I've page->mapping I use it so the unlikely races are converted to
direct mode too). However a vma can't return "direct", only the page can
return direct. The reason is that I've no way to reach _only_ the pages
pointing to an anon_vma starting from the vma (the only way would be a
pagetable walk but I don't want to do that, and leaving the anon_vma is
perfectly fine, I will garbage collect it when the vma goes away too).
Overall this means anonymous page faults will be blazing fast, no
allocation ever in the fast paths, just fork will have to allocate 12
more bytes per anonymous vma to track the cows (not a big deal compared
to 8 bytes per pte of rmap ;).

here below (most important of all to understand my anon_vma proposed
design) a preview of the data structure layout.

I think this is close to DaveM's original approch to handle the
anonymous memory, though the last time I read his patch was a few years
ago so I don't remeber exactly, the only thing I remeber (because I
disliked that) was that he was doing slab allocations from page faults,
something that definitely completely avoid with highest priority. Hugh's
approch as well was not usable since it was tracking mm and it broke off
with mremap unfortunately.

the way I designed the garbage collection of the anon_vma transient
objects as well I think is extremely optimized, I don't need list of
pages or counter of the pages, I simply garbage collect the anon_vma
during vma destruction, checking vma->anon_vma &&
list_empty(&vma->anon_vma->anon_vma_head). I use the invariant that for
a page to point to an anon_vma there must be a vma still queued in the
anon_vma. That should work reliably and it allows me to only point
anon_vmas from pages, and I never know from a anon_vma (or a vma) if any
page is pointing to it (I only need to know that no page is pointing to
it if no vma is queued into the anon_vma).

It took me a while to design this thing, but now I'm quite happy, I hope
not to find some huge design flaw at the last minute ;). This is why I'm
showing you all this right now before it's finished, if you see any
design flaw please let me know ASAP, I need this thing working quickly!

thanks.

--- sles-anobjrmap-2/include/linux/mm.h.~1~	2004-03-03 06:45:38.000000000 +0100
+++ sles-anobjrmap-2/include/linux/mm.h	2004-03-10 10:25:55.955735680 +0100
@@ -39,6 +39,22 @@ extern int page_cluster;
  * mmap() functions).
  */
 
+typedef struct anon_vma_s {
+	/* This serializes the accesses to the vma list. */
+	spinlock_t anon_vma_lock;
+
+	/*
+	 * This is a list of anonymous "related" vmas,
+	 * to scan if one of the pages pointing to this
+	 * anon_vma needs to be unmapped.
+	 * After we unlink the last vma we must garbage collect
+	 * the object if the list is empty because we're
+	 * guaranteed no page can be pointing to this anon_vma
+	 * if there's no vma anymore.
+	 */
+	struct list_head anon_vma_head;
+} anon_vma_t;
+
 /*
  * This struct defines a memory VMM memory area. There is one of these
  * per VM-area/task.  A VM area is any part of the process virtual memory
@@ -69,6 +85,19 @@ struct vm_area_struct {
 	 */
 	struct list_head shared;
 
+	/*
+	 * The same vma can be both queued into the i_mmap and in a
+	 * anon_vma too, for example after a cow in
+	 * a MAP_PRIVATE file mapping. However only the MAP_PRIVATE
+	 * will go both in the i_mmap and anon_vma. A MAP_SHARED
+	 * will only be in the i_mmap_shared and a MAP_ANONYMOUS (file = 0)
+	 * will only be queued only in the anon_vma.
+	 * The list is serialized by the anon_vma->lock.
+	 */
+	struct list_head anon_vma_node;
+	/* Serialized by the vma->vm_mm->page_table_lock */
+	anon_vma_t * anon_vma;
+
 	/* Function pointers to deal with this struct. */
 	struct vm_operations_struct * vm_ops;
 
@@ -172,16 +201,51 @@ struct page {
 					   updated asynchronously */
 	atomic_t count;			/* Usage count, see below. */
 	struct list_head list;		/* ->mapping has some page lists. */
-	struct address_space *mapping;	/* The inode (or ...) we belong to. */
 	unsigned long index;		/* Our offset within mapping. */
 	struct list_head lru;		/* Pageout list, eg. active_list;
 					   protected by zone->lru_lock !! */
+
+	/*
+	 * Address space of this page.
+	 * A page can be either mapped to a file or to be anonymous
+	 * memory, so using the union is optimal here. The PG_anon
+	 * bitflag tells if this is anonymous or a file-mapping.
+	 * If PG_anon is clear we use the as.mapping, if PG_anon is
+	 * set and PG_direct is not set we use the as.anon_vma,
+	 * if PG_anon is set and PG_direct is set we use the as.vma.
+	 */
 	union {
-		struct pte_chain *chain;/* Reverse pte mapping pointer.
-					 * protected by PG_chainlock */
-		pte_addr_t direct;
-		int mapcount;
-	} pte;
+		/* The inode address space if it's a file mapping. */
+		struct address_space * mapping;
+
+		/*
+		 * This points to an anon_vma object.
+		 * The anon_vma can't go away under us if
+		 * we hold the PG_maplock.
+		 */
+		anon_vma_t * anon_vma;
+
+		/*
+		 * Before the first fork we avoid anon_vma object allocation
+		 * and we set PG_direct. anon_vma objects are only created
+		 * via fork(), and the vm then stop using the page->as.vma
+		 * and it starts using the as.anon_vma object instead.
+		 * After the first fork(), even if the child exit, the pages
+		 * cannot be downgraded to PG_direct anymore (even if we
+		 * wanted to) because there's no way to reach pages starting
+		 * from an anon_vma object.
+		 */
+		struct vm_struct * vma;
+	} as;
+	
+	/*
+	 * Number of ptes mapping this page.
+	 * It's serialized by PG_maplock.
+	 * This is needed only to maintain the nr_mapped global info
+	 * so it would be nice to drop it.
+	 */
+	unsigned long mapcount;		
+
 	unsigned long private;		/* mapping-private opaque data */
 
 	/*
--- sles-anobjrmap-2/include/linux/page-flags.h.~1~	2004-03-03 06:45:38.000000000 +0100
+++ sles-anobjrmap-2/include/linux/page-flags.h	2004-03-10 10:20:59.324830432 +0100
@@ -69,9 +69,9 @@
 #define PG_private		12	/* Has something at ->private */
 #define PG_writeback		13	/* Page is under writeback */
 #define PG_nosave		14	/* Used for system suspend/resume */
-#define PG_chainlock		15	/* lock bit for ->pte_chain */
+#define PG_maplock		15	/* lock bit for ->as.anon_vma and ->mapcount */
 
-#define PG_direct		16	/* ->pte_chain points directly at pte */
+#define PG_direct		16	/* if set it must use page->as.vma */
 #define PG_mappedtodisk		17	/* Has blocks allocated on-disk */
 #define PG_reclaim		18	/* To be reclaimed asap */
 #define PG_compound		19	/* Part of a compound page */


^ permalink raw reply	[flat|nested] 74+ messages in thread

* anon_vma RFC2
  2004-03-10 10:36           ` RFC anon_vma previous (i.e. full objrmap) Andrea Arcangeli
@ 2004-03-11  6:52             ` Andrea Arcangeli
  2004-03-11 13:23               ` Hugh Dickins
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-11  6:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, torvalds, linux-kernel, William Lee Irwin III,
	Hugh Dickins

Hello,

this is the full current status of my anon_vma work. Now fork() and all
the other page_add/remove_rmap in memory.c plus the paging routines
seems fully covered and I'm now dealing with the  vma merging and the
anon_vma garbage collection (the latter is easy but I need to track all
the kmem_cache_free).

There is just one minor limitation with the vma merging of anonymous
memory that I didn't considered during the design phase (I figured it
out while coding).  In short this is only an issue with the mremap
syscall (and sometimes with mmap too while filling an hole). The vma
merging happening during mmap/brk (not filling an hole) is always going
to happen fine, since the newly created vma has vma->anon_vma == NULL
and I can have the guarantee from the caller that no page is yet mapped
to this vma, so I can merge it just fine and it'll be part of whatever
pre-existing anon_vma object (after possibly fixing up the vma->pg_off
of the newly created vma).

Only if I fill an hole (with mmap or brk) I may be not able to merge the
three anon vmas together if their pg_off disagrees. However their pg_off
may disagree only if somebody used mremap on those vmas previously,
since I setup the pg_off of anonymous memory in a way that if you only
use mmap/brk even filling the holes is guaranteed to do full merging.

The problem in mremap is not only the pgoff, the problem is that I can
merge anonymous vma only if (!vma1->anon_vma  || !vma2->anon_vma) is
true. If both vma1 and vma2 have a different anon_vma I cannot merge
them togheter (even if the pg_off agrees) because the pages under vma2
may point to vma2->anon_vma and the pages under vma1 may point to
vma1->anon_vma in their page->as.anon_vma.  There is no way to reach
efficiently the pages pointing to a certain anon_vma. As said yesterday
the invariant I use to garbage collect the anon_vma is to wait all vma
to go be unlinked from the anon_vma, but as far as there are vmas queued
into the anon_vma object I cannot release those anon_vma objects, and in
turn I cannot do merging either.

the only way to allow 100% merging through mremap would be to have a
list with the head in the anon_vma and the nodes in the page_t, that
would be very easy but it would waste 4 bytes per page_t for a
hlist_node (the 4byte waste in the anon_vma is not a problem). And the
merging would be very expensive too since I would need to run a
for_each_page_in_the_list loop to fixup first all the page->index
according to the spread between vma1->pg_off and vma2->pg_off, and
second I should reset the page->as.anon_vma (or page->as.vma for direct
pages) to point respectively to the other anon_vma (or the other vma for
direct pages).

So I think I will go ahead with the current data structures despite the
small regression in vma merging. I doubt it's an issue but please let me
know if you think it's an issue and that I should add an hlist_node to
the page_t and an hlist_head to the anon_vma_t. btw, it's something I
can always do later if it's really necessary. Even with the additional
4bytes per page_t the page_t size would not be bigger than mainline 2.4
and mainline 2.6.

 include/linux/mm.h         |   79 +++
 include/linux/objrmap.h    |   66 +++
 include/linux/page-flags.h |    4
 include/linux/rmap.h       |   53 --
 init/main.c                |    4
 kernel/fork.c              |   10
 mm/Makefile                |    2
 mm/memory.c                |  129 +-----
 mm/mmap.c                  |    9
 mm/nommu.c                 |    2
 mm/objrmap.c               |  575 ++++++++++++++++++++++++++++
 mm/page_alloc.c            |    6
 mm/rmap.c                  |  908 ---------------------------------------------
 14 files changed, 772 insertions(+), 1075 deletions(-)

--- sles-anobjrmap-2/include/linux/mm.h.~1~	2004-03-03 06:45:38.000000000 +0100
+++ sles-anobjrmap-2/include/linux/mm.h	2004-03-10 18:59:14.000000000 +0100
@@ -39,6 +39,22 @@ extern int page_cluster;
  * mmap() functions).
  */
 
+typedef struct anon_vma_s {
+	/* This serializes the accesses to the vma list. */
+	spinlock_t anon_vma_lock;
+
+	/*
+	 * This is a list of anonymous "related" vmas,
+	 * to scan if one of the pages pointing to this
+	 * anon_vma needs to be unmapped.
+	 * After we unlink the last vma we must garbage collect
+	 * the object if the list is empty because we're
+	 * guaranteed no page can be pointing to this anon_vma
+	 * if there's no vma anymore.
+	 */
+	struct list_head anon_vma_head;
+} anon_vma_t;
+
 /*
  * This struct defines a memory VMM memory area. There is one of these
  * per VM-area/task.  A VM area is any part of the process virtual memory
@@ -69,6 +85,19 @@ struct vm_area_struct {
 	 */
 	struct list_head shared;
 
+	/*
+	 * The same vma can be both queued into the i_mmap and in a
+	 * anon_vma too, for example after a cow in
+	 * a MAP_PRIVATE file mapping. However only the MAP_PRIVATE
+	 * will go both in the i_mmap and anon_vma. A MAP_SHARED
+	 * will only be in the i_mmap_shared and a MAP_ANONYMOUS (file = 0)
+	 * will only be queued only in the anon_vma.
+	 * The list is serialized by the anon_vma->lock.
+	 */
+	struct list_head anon_vma_node;
+	/* Serialized by the vma->vm_mm->page_table_lock */
+	anon_vma_t * anon_vma;
+
 	/* Function pointers to deal with this struct. */
 	struct vm_operations_struct * vm_ops;
 
@@ -172,16 +201,51 @@ struct page {
 					   updated asynchronously */
 	atomic_t count;			/* Usage count, see below. */
 	struct list_head list;		/* ->mapping has some page lists. */
-	struct address_space *mapping;	/* The inode (or ...) we belong to. */
 	unsigned long index;		/* Our offset within mapping. */
 	struct list_head lru;		/* Pageout list, eg. active_list;
 					   protected by zone->lru_lock !! */
+
+	/*
+	 * Address space of this page.
+	 * A page can be either mapped to a file or to be anonymous
+	 * memory, so using the union is optimal here. The PG_anon
+	 * bitflag tells if this is anonymous or a file-mapping.
+	 * If PG_anon is clear we use the as.mapping, if PG_anon is
+	 * set and PG_direct is not set we use the as.anon_vma,
+	 * if PG_anon is set and PG_direct is set we use the as.vma.
+	 */
 	union {
-		struct pte_chain *chain;/* Reverse pte mapping pointer.
-					 * protected by PG_chainlock */
-		pte_addr_t direct;
-		int mapcount;
-	} pte;
+		/* The inode address space if it's a file mapping. */
+		struct address_space * mapping;
+
+		/*
+		 * This points to an anon_vma object.
+		 * The anon_vma can't go away under us if
+		 * we hold the PG_maplock.
+		 */
+		anon_vma_t * anon_vma;
+
+		/*
+		 * Before the first fork we avoid anon_vma object allocation
+		 * and we set PG_direct. anon_vma objects are only created
+		 * via fork(), and the vm then stop using the page->as.vma
+		 * and it starts using the as.anon_vma object instead.
+		 * After the first fork(), even if the child exit, the pages
+		 * cannot be downgraded to PG_direct anymore (even if we
+		 * wanted to) because there's no way to reach pages starting
+		 * from an anon_vma object.
+		 */
+		struct vm_struct * vma;
+	} as;
+	
+	/*
+	 * Number of ptes mapping this page.
+	 * It's serialized by PG_maplock.
+	 * This is needed only to maintain the nr_mapped global info
+	 * so it would be nice to drop it.
+	 */
+	unsigned long mapcount;		
+
 	unsigned long private;		/* mapping-private opaque data */
 
 	/*
@@ -440,7 +504,8 @@ void unmap_page_range(struct mmu_gather 
 			unsigned long address, unsigned long size);
 void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+		    struct vm_area_struct *vma, struct vm_area_struct *orig_vma,
+		    anon_vma_t ** anon_vma);
 int zeromap_page_range(struct vm_area_struct *vma, unsigned long from,
 			unsigned long size, pgprot_t prot);
 
--- sles-anobjrmap-2/include/linux/page-flags.h.~1~	2004-03-03 06:45:38.000000000 +0100
+++ sles-anobjrmap-2/include/linux/page-flags.h	2004-03-10 10:20:59.000000000 +0100
@@ -69,9 +69,9 @@
 #define PG_private		12	/* Has something at ->private */
 #define PG_writeback		13	/* Page is under writeback */
 #define PG_nosave		14	/* Used for system suspend/resume */
-#define PG_chainlock		15	/* lock bit for ->pte_chain */
+#define PG_maplock		15	/* lock bit for ->as.anon_vma and ->mapcount */
 
-#define PG_direct		16	/* ->pte_chain points directly at pte */
+#define PG_direct		16	/* if set it must use page->as.vma */
 #define PG_mappedtodisk		17	/* Has blocks allocated on-disk */
 #define PG_reclaim		18	/* To be reclaimed asap */
 #define PG_compound		19	/* Part of a compound page */
--- sles-anobjrmap-2/include/linux/objrmap.h.~1~	2004-03-05 05:27:41.000000000 +0100
+++ sles-anobjrmap-2/include/linux/objrmap.h	2004-03-10 20:48:57.000000000 +0100
@@ -1,8 +1,7 @@
 #ifndef _LINUX_RMAP_H
 #define _LINUX_RMAP_H
 /*
- * Declarations for Reverse Mapping functions in mm/rmap.c
- * Its structures are declared within that file.
+ * Declarations for Object Reverse Mapping functions in mm/objrmap.c
  */
 #include <linux/config.h>
 
@@ -10,32 +9,46 @@
 
 #include <linux/linkage.h>
 #include <linux/slab.h>
+#include <linux/kernel.h>
 
-struct pte_chain;
-extern kmem_cache_t *pte_chain_cache;
+extern kmem_cache_t * anon_vma_cachep;
 
-#define pte_chain_lock(page)	bit_spin_lock(PG_chainlock, &page->flags)
-#define pte_chain_unlock(page)	bit_spin_unlock(PG_chainlock, &page->flags)
+#define page_map_lock(page)	bit_spin_lock(PG_maplock, &page->flags)
+#define page_map_unlock(page)	bit_spin_unlock(PG_maplock, &page->flags)
 
-struct pte_chain *pte_chain_alloc(int gfp_flags);
-void __pte_chain_free(struct pte_chain *pte_chain);
+static inline void anon_vma_free(anon_vma_t * anon_vma)
+{
+	kmem_cache_free(anon_vma);
+}
 
-static inline void pte_chain_free(struct pte_chain *pte_chain)
+static inline anon_vma_t * anon_vma_alloc(void)
 {
-	if (pte_chain)
-		__pte_chain_free(pte_chain);
+	might_sleep();
+
+	return kmem_cache_alloc(anon_vma_cachep, SLAB_KERNEL);
 }
 
-int FASTCALL(page_referenced(struct page *));
-struct pte_chain *FASTCALL(page_add_rmap(struct page *, pte_t *,
-					struct pte_chain *));
-void FASTCALL(page_remove_rmap(struct page *, pte_t *));
-int page_convert_anon(struct page *);
+static inline void anon_vma_unlink(struct vm_area_struct * vma)
+{
+	anon_vma_t * anon_vma = vma->anon_vma;
+
+	if (anon_vma) {
+		spin_lock(&anon_vma->anon_vma_lock);
+		list_del(&vma->anon_vm_node);
+		spin_unlock(&anon_vma->anon_vma_lock);
+	}
+}
+
+void FASTCALL(page_add_rmap(struct page *, struct vm_struct *));
+void FASTCALL(page_add_rmap_fork(struct page *, struct vm_area_struct *,
+				 struct vm_area_struct *, anon_vma_t **));
+void FASTCALL(page_remove_rmap(struct page *));
 
 /*
  * Called from mm/vmscan.c to handle paging out
  */
 int FASTCALL(try_to_unmap(struct page *));
+int FASTCALL(page_referenced(struct page *));
 
 /*
  * Return values of try_to_unmap
--- sles-anobjrmap-2/init/main.c.~1~	2004-02-29 17:47:36.000000000 +0100
+++ sles-anobjrmap-2/init/main.c	2004-03-09 05:32:34.000000000 +0100
@@ -85,7 +85,7 @@ extern void signals_init(void);
 extern void buffer_init(void);
 extern void pidhash_init(void);
 extern void pidmap_init(void);
-extern void pte_chain_init(void);
+extern void anon_vma_init(void);
 extern void radix_tree_init(void);
 extern void free_initmem(void);
 extern void populate_rootfs(void);
@@ -495,7 +495,7 @@ asmlinkage void __init start_kernel(void
 	calibrate_delay();
 	pidmap_init();
 	pgtable_cache_init();
-	pte_chain_init();
+	anon_vma_init();
 
 #ifdef	CONFIG_KDB
 	kdb_init();
--- sles-anobjrmap-2/kernel/fork.c.~1~	2004-02-29 17:47:33.000000000 +0100
+++ sles-anobjrmap-2/kernel/fork.c	2004-03-10 18:58:29.000000000 +0100
@@ -276,6 +276,7 @@ static inline int dup_mmap(struct mm_str
 	struct vm_area_struct * mpnt, *tmp, **pprev;
 	int retval;
 	unsigned long charge = 0;
+	anon_vma_t * anon_vma = NULL;
 
 	down_write(&oldmm->mmap_sem);
 	flush_cache_mm(current->mm);
@@ -310,6 +311,11 @@ static inline int dup_mmap(struct mm_str
 				goto fail_nomem;
 			charge += len;
 		}
+		if (!anon_vma) {
+			anon_vma = anon_vma_alloc();
+			if (!anon_vma)
+				goto fail_nomem;
+		}
 		tmp = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 		if (!tmp)
 			goto fail_nomem;
@@ -339,7 +345,7 @@ static inline int dup_mmap(struct mm_str
 		*pprev = tmp;
 		pprev = &tmp->vm_next;
 		mm->map_count++;
-		retval = copy_page_range(mm, current->mm, tmp);
+		retval = copy_page_range(mm, current->mm, tmp, mpnt, &anon_vma);
 		spin_unlock(&mm->page_table_lock);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
@@ -354,6 +360,8 @@ static inline int dup_mmap(struct mm_str
 out:
 	flush_tlb_mm(current->mm);
 	up_write(&oldmm->mmap_sem);
+	if (anon_vma)
+		anon_vma_free(anon_vma);
 	return retval;
 fail_nomem:
 	retval = -ENOMEM;
--- sles-anobjrmap-2/mm/mmap.c.~1~	2004-03-03 06:53:46.000000000 +0100
+++ sles-anobjrmap-2/mm/mmap.c	2004-03-11 07:43:32.158221568 +0100
@@ -325,7 +325,7 @@ static void move_vma_start(struct vm_are
 		inode = vma->vm_file->f_dentry->d_inode;
 	if (inode)
 		__remove_shared_vm_struct(vma, inode);
-	/* If no vm_file, perhaps we should always keep vm_pgoff at 0?? */
+	/* we must update pgoff even if no vm_file for the anon_vma_chain */
 	vma->vm_pgoff += (long)(addr - vma->vm_start) >> PAGE_SHIFT;
 	vma->vm_start = addr;
 	if (inode)
@@ -576,6 +576,7 @@ unsigned long __do_mmap_pgoff(struct mm_
 		case MAP_SHARED:
 			break;
 		}
+		pgoff = addr << PAGE_SHIFT;
 	}
 
 	error = security_file_mmap(file, prot, flags);
@@ -639,6 +640,8 @@ munmap_back:
 	vma->vm_private_data = NULL;
 	vma->vm_next = NULL;
 	INIT_LIST_HEAD(&vma->shared);
+	INIT_LIST_HEAD(&vma->anon_vma_node);
+	vma->anon_vma = NULL;
 
 	if (file) {
 		error = -EINVAL;
@@ -1381,10 +1384,12 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_flags = flags;
 	vma->vm_page_prot = protection_map[flags & 0x0f];
 	vma->vm_ops = NULL;
-	vma->vm_pgoff = 0;
+	vma->vm_pgoff = addr << PAGE_SHIFT;
 	vma->vm_file = NULL;
 	vma->vm_private_data = NULL;
 	INIT_LIST_HEAD(&vma->shared);
+	INIT_LIST_HEAD(&vma->anon_vma_node);
+	vma->anon_vma = NULL;
 
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 
--- sles-anobjrmap-2/mm/page_alloc.c.~1~	2004-03-03 06:45:38.000000000 +0100
+++ sles-anobjrmap-2/mm/page_alloc.c	2004-03-10 10:28:26.000000000 +0100
@@ -91,6 +91,7 @@ static void bad_page(const char *functio
 			1 << PG_writeback);
 	set_page_count(page, 0);
 	page->mapping = NULL;
+	page->mapcount = 0;
 }
 
 #if !defined(CONFIG_HUGETLB_PAGE) && !defined(CONFIG_CRASH_DUMP) \
@@ -216,8 +217,7 @@ static inline void __free_pages_bulk (st
 
 static inline void free_pages_check(const char *function, struct page *page)
 {
-	if (	page_mapped(page) ||
-		page->mapping != NULL ||
+	if (	page->as.mapping != NULL ||
 		page_count(page) != 0 ||
 		(page->flags & (
 			1 << PG_lru	|
@@ -329,7 +329,7 @@ static inline void set_page_refs(struct 
  */
 static void prep_new_page(struct page *page, int order)
 {
-	if (page->mapping || page_mapped(page) ||
+	if (page->as.mapping ||
 	    (page->flags & (
 			1 << PG_private	|
 			1 << PG_locked	|
--- sles-anobjrmap-2/mm/nommu.c.~1~	2004-02-04 16:07:06.000000000 +0100
+++ sles-anobjrmap-2/mm/nommu.c	2004-03-09 05:32:41.000000000 +0100
@@ -568,6 +568,6 @@ unsigned long get_unmapped_area(struct f
 	return -ENOMEM;
 }
 
-void pte_chain_init(void)
+void anon_vma_init(void)
 {
 }
--- sles-anobjrmap-2/mm/memory.c.~1~	2004-03-05 05:24:35.000000000 +0100
+++ sles-anobjrmap-2/mm/memory.c	2004-03-10 19:25:27.000000000 +0100
@@ -43,12 +43,11 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
-#include <linux/rmap.h>
+#include <linux/objrmap.h>
 #include <linux/module.h>
 #include <linux/init.h>
 
 #include <asm/pgalloc.h>
-#include <asm/rmap.h>
 #include <asm/uaccess.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
@@ -105,7 +104,6 @@ static inline void free_one_pmd(struct m
 	}
 	page = pmd_page(*dir);
 	pmd_clear(dir);
-	pgtable_remove_rmap(page);
 	pte_free_tlb(tlb, page);
 }
 
@@ -164,7 +162,6 @@ pte_t fastcall * pte_alloc_map(struct mm
 			pte_free(new);
 			goto out;
 		}
-		pgtable_add_rmap(new, mm, address);
 		pmd_populate(mm, pmd, new);
 	}
 out:
@@ -190,7 +187,6 @@ pte_t fastcall * pte_alloc_kernel(struct
 			pte_free_kernel(new);
 			goto out;
 		}
-		pgtable_add_rmap(virt_to_page(new), mm, address);
 		pmd_populate_kernel(mm, pmd, new);
 	}
 out:
@@ -211,26 +207,17 @@ out:
  * but may be dropped within pmd_alloc() and pte_alloc_map().
  */
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma)
+		    struct vm_area_struct *vma, struct vm_area_struct *orig_vma,
+		    anon_vma_t ** anon_vma)
 {
 	pgd_t * src_pgd, * dst_pgd;
 	unsigned long address = vma->vm_start;
 	unsigned long end = vma->vm_end;
 	unsigned long cow;
-	struct pte_chain *pte_chain = NULL;
 
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst, src, vma);
 
-	pte_chain = pte_chain_alloc(GFP_ATOMIC);
-	if (!pte_chain) {
-		spin_unlock(&dst->page_table_lock);
-		pte_chain = pte_chain_alloc(GFP_KERNEL);
-		spin_lock(&dst->page_table_lock);
-		if (!pte_chain)
-			goto nomem;
-	}
-	
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 	src_pgd = pgd_offset(src, address)-1;
 	dst_pgd = pgd_offset(dst, address)-1;
@@ -299,7 +286,7 @@ skip_copy_pte_range:
 				pfn = pte_pfn(pte);
 				/* the pte points outside of valid memory, the
 				 * mapping is assumed to be good, meaningful
-				 * and not mapped via rmap - duplicate the
+				 * and not mapped via objrmap - duplicate the
 				 * mapping as is.
 				 */
 				page = NULL;
@@ -331,30 +318,20 @@ skip_copy_pte_range:
 				dst->rss++;
 
 				set_pte(dst_pte, pte);
-				pte_chain = page_add_rmap(page, dst_pte,
-							pte_chain);
-				if (pte_chain)
-					goto cont_copy_pte_range_noset;
-				pte_chain = pte_chain_alloc(GFP_ATOMIC);
-				if (pte_chain)
-					goto cont_copy_pte_range_noset;
+				page_add_rmap_fork(page, vma, orig_vma, anon_vma);
+
+				if (need_resched()) {
+					pte_unmap_nested(src_pte);
+					pte_unmap(dst_pte);
+					spin_unlock(&src->page_table_lock);	
+					spin_unlock(&dst->page_table_lock);	
+					__cond_resched();
+					spin_lock(&dst->page_table_lock);	
+					spin_lock(&src->page_table_lock);
+					dst_pte = pte_offset_map(dst_pmd, address);
+					src_pte = pte_offset_map_nested(src_pmd, address);
+				}
 
-				/*
-				 * pte_chain allocation failed, and we need to
-				 * run page reclaim.
-				 */
-				pte_unmap_nested(src_pte);
-				pte_unmap(dst_pte);
-				spin_unlock(&src->page_table_lock);	
-				spin_unlock(&dst->page_table_lock);	
-				pte_chain = pte_chain_alloc(GFP_KERNEL);
-				spin_lock(&dst->page_table_lock);	
-				if (!pte_chain)
-					goto nomem;
-				spin_lock(&src->page_table_lock);
-				dst_pte = pte_offset_map(dst_pmd, address);
-				src_pte = pte_offset_map_nested(src_pmd,
-								address);
 cont_copy_pte_range_noset:
 				address += PAGE_SIZE;
 				if (address >= end) {
@@ -377,10 +354,9 @@ cont_copy_pmd_range:
 out_unlock:
 	spin_unlock(&src->page_table_lock);
 out:
-	pte_chain_free(pte_chain);
 	return 0;
+
 nomem:
-	pte_chain_free(pte_chain);
 	return -ENOMEM;
 }
 
@@ -421,7 +397,7 @@ zap_pte_range(struct mmu_gather *tlb, pm
 							!PageSwapCache(page))
 						mark_page_accessed(page);
 					tlb->freed++;
-					page_remove_rmap(page, ptep);
+					page_remove_rmap(page);
 					tlb_remove_page(tlb, page);
 				}
 			}
@@ -1014,7 +990,6 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	unsigned long pfn = pte_pfn(pte);
-	struct pte_chain *pte_chain;
 	pte_t entry;
 
 	if (unlikely(!pfn_valid(pfn))) {
@@ -1053,9 +1028,6 @@ static int do_wp_page(struct mm_struct *
 	page_cache_get(old_page);
 	spin_unlock(&mm->page_table_lock);
 
-	pte_chain = pte_chain_alloc(GFP_KERNEL);
-	if (!pte_chain)
-		goto no_pte_chain;
 	new_page = alloc_page(GFP_HIGHUSER);
 	if (!new_page)
 		goto no_new_page;
@@ -1069,10 +1041,10 @@ static int do_wp_page(struct mm_struct *
 	if (pte_same(*page_table, pte)) {
 		if (PageReserved(old_page))
 			++mm->rss;
-		page_remove_rmap(old_page, page_table);
+		page_remove_rmap(old_page);
 		break_cow(vma, new_page, address, page_table);
 		SetPageAnon(new_page);
-		pte_chain = page_add_rmap(new_page, page_table, pte_chain);
+		page_add_rmap(new_page, vma);
 		lru_cache_add_active(new_page);
 
 		/* Free the old page.. */
@@ -1082,12 +1054,9 @@ static int do_wp_page(struct mm_struct *
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	spin_unlock(&mm->page_table_lock);
-	pte_chain_free(pte_chain);
 	return VM_FAULT_MINOR;
 
 no_new_page:
-	pte_chain_free(pte_chain);
-no_pte_chain:
 	page_cache_release(old_page);
 	return VM_FAULT_OOM;
 }
@@ -1245,7 +1214,6 @@ static int do_swap_page(struct mm_struct
 	swp_entry_t entry = pte_to_swp_entry(orig_pte);
 	pte_t pte;
 	int ret = VM_FAULT_MINOR;
-	struct pte_chain *pte_chain = NULL;
 
 	pte_unmap(page_table);
 	spin_unlock(&mm->page_table_lock);
@@ -1275,11 +1243,6 @@ static int do_swap_page(struct mm_struct
 	}
 
 	mark_page_accessed(page);
-	pte_chain = pte_chain_alloc(GFP_KERNEL);
-	if (!pte_chain) {
-		ret = VM_FAULT_OOM;
-		goto out;
-	}
 	lock_page(page);
 
 	/*
@@ -1312,14 +1275,13 @@ static int do_swap_page(struct mm_struct
 	flush_icache_page(vma, page);
 	set_pte(page_table, pte);
 	SetPageAnon(page);
-	pte_chain = page_add_rmap(page, page_table, pte_chain);
+	page_add_rmap(page, vma);
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, address, pte);
 	pte_unmap(page_table);
 	spin_unlock(&mm->page_table_lock);
 out:
-	pte_chain_free(pte_chain);
 	return ret;
 }
 
@@ -1335,20 +1297,8 @@ do_anonymous_page(struct mm_struct *mm, 
 {
 	pte_t entry;
 	struct page * page = ZERO_PAGE(addr);
-	struct pte_chain *pte_chain;
 	int ret;
 
-	pte_chain = pte_chain_alloc(GFP_ATOMIC);
-	if (!pte_chain) {
-		pte_unmap(page_table);
-		spin_unlock(&mm->page_table_lock);
-		pte_chain = pte_chain_alloc(GFP_KERNEL);
-		if (!pte_chain)
-			goto no_mem;
-		spin_lock(&mm->page_table_lock);
-		page_table = pte_offset_map(pmd, addr);
-	}
-		
 	/* Read-only mapping of ZERO_PAGE. */
 	entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
 
@@ -1359,8 +1309,8 @@ do_anonymous_page(struct mm_struct *mm, 
 		spin_unlock(&mm->page_table_lock);
 
 		page = alloc_page(GFP_HIGHUSER);
-		if (!page)
-			goto no_mem;
+		if (unlikely(!page))
+			return VM_FAULT_OOM;
 		clear_user_highpage(page, addr);
 
 		spin_lock(&mm->page_table_lock);
@@ -1370,8 +1320,7 @@ do_anonymous_page(struct mm_struct *mm, 
 			pte_unmap(page_table);
 			page_cache_release(page);
 			spin_unlock(&mm->page_table_lock);
-			ret = VM_FAULT_MINOR;
-			goto out;
+			return VM_FAULT_MINOR;
 		}
 		mm->rss++;
 		entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
@@ -1383,20 +1332,16 @@ do_anonymous_page(struct mm_struct *mm, 
 	}
 
 	set_pte(page_table, entry);
-	/* ignores ZERO_PAGE */
-	pte_chain = page_add_rmap(page, page_table, pte_chain);
 	pte_unmap(page_table);
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
 	spin_unlock(&mm->page_table_lock);
 	ret = VM_FAULT_MINOR;
-	goto out;
 
-no_mem:
-	ret = VM_FAULT_OOM;
-out:
-	pte_chain_free(pte_chain);
+	/* ignores ZERO_PAGE */
+	page_add_rmap(page, vma);
+
 	return ret;
 }
 
@@ -1419,7 +1364,6 @@ do_no_page(struct mm_struct *mm, struct 
 	struct page * new_page;
 	struct address_space *mapping = NULL;
 	pte_t entry;
-	struct pte_chain *pte_chain;
 	int sequence = 0;
 	int ret = VM_FAULT_MINOR;
 
@@ -1443,10 +1387,6 @@ retry:
 	if (new_page == NOPAGE_OOM)
 		return VM_FAULT_OOM;
 
-	pte_chain = pte_chain_alloc(GFP_KERNEL);
-	if (!pte_chain)
-		goto oom;
-
 	/* See if nopage returned an anon page */
 	if (!new_page->mapping || PageSwapCache(new_page))
 		SetPageAnon(new_page);
@@ -1476,7 +1416,6 @@ retry:
 		sequence = atomic_read(&mapping->truncate_count);
 		spin_unlock(&mm->page_table_lock);
 		page_cache_release(new_page);
-		pte_chain_free(pte_chain);
 		goto retry;
 	}
 	page_table = pte_offset_map(pmd, address);
@@ -1500,7 +1439,7 @@ retry:
 		if (write_access)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		set_pte(page_table, entry);
-		pte_chain = page_add_rmap(new_page, page_table, pte_chain);
+		page_add_rmap(new_page, vma);
 		pte_unmap(page_table);
 	} else {
 		/* One of our sibling threads was faster, back out. */
@@ -1513,13 +1452,13 @@ retry:
 	/* no need to invalidate: a not-present page shouldn't be cached */
 	update_mmu_cache(vma, address, entry);
 	spin_unlock(&mm->page_table_lock);
-	goto out;
-oom:
+ out:
+	return ret;
+
+ oom:
 	page_cache_release(new_page);
 	ret = VM_FAULT_OOM;
-out:
-	pte_chain_free(pte_chain);
-	return ret;
+	goto out;
 }
 
 /*
--- sles-anobjrmap-2/mm/objrmap.c.~1~	2004-03-05 05:40:21.000000000 +0100
+++ sles-anobjrmap-2/mm/objrmap.c	2004-03-10 20:29:20.000000000 +0100
@@ -1,105 +1,27 @@
 /*
- * mm/rmap.c - physical to virtual reverse mappings
- *
- * Copyright 2001, Rik van Riel <riel@conectiva.com.br>
- * Released under the General Public License (GPL).
+ *  mm/objrmap.c
  *
+ *  Provides methods for unmapping all sort of mapped pages
+ *  using the vma objects, the brainer part of objrmap is the
+ *  tracking of the vma to analyze for every given mapped page.
+ *  The anon_vma methods are tracking anonymous pages,
+ *  and the inode methods are tracking pages belonging
+ *  to an inode.
  *
- * Simple, low overhead pte-based reverse mapping scheme.
- * This is kept modular because we may want to experiment
- * with object-based reverse mapping schemes. Please try
- * to keep this thing as modular as possible.
+ *  anonymous methods by Andrea Arcangeli <andrea@suse.de> 2004
+ *  inode methods by Dave McCracken <dmccr@us.ibm.com> 2003, 2004
  */
 
 /*
- * Locking:
- * - the page->pte.chain is protected by the PG_chainlock bit,
- *   which nests within the the mm->page_table_lock,
- *   which nests within the page lock.
- * - because swapout locking is opposite to the locking order
- *   in the page fault path, the swapout path uses trylocks
- *   on the mm->page_table_lock
- */
-#include <linux/mm.h>
-#include <linux/pagemap.h>
-#include <linux/swap.h>
-#include <linux/swapops.h>
-#include <linux/slab.h>
-#include <linux/init.h>
-#include <linux/rmap.h>
-#include <linux/cache.h>
-#include <linux/percpu.h>
-
-#include <asm/pgalloc.h>
-#include <asm/rmap.h>
-#include <asm/tlb.h>
-#include <asm/tlbflush.h>
-
-/* #define DEBUG_RMAP */
-
-/*
- * Shared pages have a chain of pte_chain structures, used to locate
- * all the mappings to this page. We only need a pointer to the pte
- * here, the page struct for the page table page contains the process
- * it belongs to and the offset within that process.
- *
- * We use an array of pte pointers in this structure to minimise cache misses
- * while traversing reverse maps.
- */
-#define NRPTE ((L1_CACHE_BYTES - sizeof(unsigned long))/sizeof(pte_addr_t))
-
-/*
- * next_and_idx encodes both the address of the next pte_chain and the
- * offset of the highest-index used pte in ptes[].
+ * try_to_unmap/page_referenced/page_add_rmap/page_remove_rmap
+ * inherit from the rmap design mm/rmap.c under
+ * Copyright 2001, Rik van Riel <riel@conectiva.com.br>
+ * Released under the General Public License (GPL).
  */
-struct pte_chain {
-	unsigned long next_and_idx;
-	pte_addr_t ptes[NRPTE];
-} ____cacheline_aligned;
-
-kmem_cache_t	*pte_chain_cache;
 
-static inline struct pte_chain *pte_chain_next(struct pte_chain *pte_chain)
-{
-	return (struct pte_chain *)(pte_chain->next_and_idx & ~NRPTE);
-}
-
-static inline struct pte_chain *pte_chain_ptr(unsigned long pte_chain_addr)
-{
-	return (struct pte_chain *)(pte_chain_addr & ~NRPTE);
-}
-
-static inline int pte_chain_idx(struct pte_chain *pte_chain)
-{
-	return pte_chain->next_and_idx & NRPTE;
-}
-
-static inline unsigned long
-pte_chain_encode(struct pte_chain *pte_chain, int idx)
-{
-	return (unsigned long)pte_chain | idx;
-}
-
-/*
- * pte_chain list management policy:
- *
- * - If a page has a pte_chain list then it is shared by at least two processes,
- *   because a single sharing uses PageDirect. (Well, this isn't true yet,
- *   coz this code doesn't collapse singletons back to PageDirect on the remove
- *   path).
- * - A pte_chain list has free space only in the head member - all succeeding
- *   members are 100% full.
- * - If the head element has free space, it occurs in its leading slots.
- * - All free space in the pte_chain is at the start of the head member.
- * - Insertion into the pte_chain puts a pte pointer in the last free slot of
- *   the head member.
- * - Removal from a pte chain moves the head pte of the head member onto the
- *   victim pte and frees the head member if it became empty.
- */
+#include <linux/mm.h>
 
-/**
- ** VM stuff below this comment
- **/
+kmem_cache_t * anon_vma_cachep;
 
 /**
  * find_pte - Find a pte pointer given a vma and a struct page.
@@ -157,17 +79,17 @@ out:
 }
 
 /**
- * page_referenced_obj_one - referenced check for object-based rmap
+ * page_referenced_inode_one - referenced check for object-based rmap
  * @vma: the vma to look in.
  * @page: the page we're working on.
  *
  * Find a pte entry for a page/vma pair, then check and clear the referenced
  * bit.
  *
- * This is strictly a helper function for page_referenced_obj.
+ * This is strictly a helper function for page_referenced_inode.
  */
 static int
-page_referenced_obj_one(struct vm_area_struct *vma, struct page *page)
+page_referenced_inode_one(struct vm_area_struct *vma, struct page *page)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte;
@@ -188,11 +110,11 @@ page_referenced_obj_one(struct vm_area_s
 }
 
 /**
- * page_referenced_obj_one - referenced check for object-based rmap
+ * page_referenced_inode_one - referenced check for object-based rmap
  * @page: the page we're checking references on.
  *
  * For an object-based mapped page, find all the places it is mapped and
- * check/clear the referenced flag.  This is done by following the page->mapping
+ * check/clear the referenced flag.  This is done by following the page->as.mapping
  * pointer, then walking the chain of vmas it holds.  It returns the number
  * of references it found.
  *
@@ -202,29 +124,54 @@ page_referenced_obj_one(struct vm_area_s
  * assume a reference count of 1.
  */
 static int
-page_referenced_obj(struct page *page)
+page_referenced_inode(struct page *page)
 {
-	struct address_space *mapping = page->mapping;
+	struct address_space *mapping = page->as.mapping;
 	struct vm_area_struct *vma;
-	int referenced = 0;
+	int referenced;
 
-	if (!page->pte.mapcount)
+	if (!page->mapcount)
 		return 0;
 
-	if (!mapping)
-		BUG();
+	BUG_ON(!mapping);
+	BUG_ON(PageSwapCache(page));
 
-	if (PageSwapCache(page))
-		BUG();
+	if (down_trylock(&mapping->i_shared_sem))
+		return 1;
+
+	referenced = 0;
+
+	list_for_each_entry(vma, &mapping->i_mmap, shared)
+		referenced += page_referenced_inode_one(vma, page);
+
+	list_for_each_entry(vma, &mapping->i_mmap_shared, shared)
+		referenced += page_referenced_inode_one(vma, page);
+
+	up(&mapping->i_shared_sem);
+
+	return referenced;
+}
+
+static int page_referenced_anon(struct page *page)
+{
+	int referenced;
+
+	if (!page->mapcount)
+		return 0;
+
+	BUG_ON(!mapping);
+	BUG_ON(PageSwapCache(page));
 
 	if (down_trylock(&mapping->i_shared_sem))
 		return 1;
-	
+
+	referenced = 0;
+
 	list_for_each_entry(vma, &mapping->i_mmap, shared)
-		referenced += page_referenced_obj_one(vma, page);
+		referenced += page_referenced_inode_one(vma, page);
 
 	list_for_each_entry(vma, &mapping->i_mmap_shared, shared)
-		referenced += page_referenced_obj_one(vma, page);
+		referenced += page_referenced_inode_one(vma, page);
 
 	up(&mapping->i_shared_sem);
 
@@ -244,7 +191,6 @@ page_referenced_obj(struct page *page)
  */
 int fastcall page_referenced(struct page * page)
 {
-	struct pte_chain *pc;
 	int referenced = 0;
 
 	if (page_test_and_clear_young(page))
@@ -253,209 +199,179 @@ int fastcall page_referenced(struct page
 	if (TestClearPageReferenced(page))
 		referenced++;
 
-	if (!PageAnon(page)) {
-		referenced += page_referenced_obj(page);
-		goto out;
-	}
-	if (PageDirect(page)) {
-		pte_t *pte = rmap_ptep_map(page->pte.direct);
-		if (ptep_test_and_clear_young(pte))
-			referenced++;
-		rmap_ptep_unmap(pte);
-	} else {
-		int nr_chains = 0;
+	if (!PageAnon(page))
+		referenced += page_referenced_inode(page);
+	else
+		referenced += page_referenced_anon(page);
 
-		/* Check all the page tables mapping this page. */
-		for (pc = page->pte.chain; pc; pc = pte_chain_next(pc)) {
-			int i;
-
-			for (i = pte_chain_idx(pc); i < NRPTE; i++) {
-				pte_addr_t pte_paddr = pc->ptes[i];
-				pte_t *p;
-
-				p = rmap_ptep_map(pte_paddr);
-				if (ptep_test_and_clear_young(p))
-					referenced++;
-				rmap_ptep_unmap(p);
-				nr_chains++;
-			}
-		}
-		if (nr_chains == 1) {
-			pc = page->pte.chain;
-			page->pte.direct = pc->ptes[NRPTE-1];
-			SetPageDirect(page);
-			pc->ptes[NRPTE-1] = 0;
-			__pte_chain_free(pc);
-		}
-	}
-out:
 	return referenced;
 }
 
+/* this needs the page->flags PG_map_lock held */
+static void inline anon_vma_page_link(struct page * page, struct vm_area_struct * vma)
+{
+	BUG_ON(page->mapcount != 1);
+	BUG_ON(PageDirect(page));
+
+	SetPageDirect(page);
+	page->as.vma = vma;
+}
+
+/* this needs the page->flags PG_map_lock held */
+static void inline anon_vma_page_link_fork(struct page * page, struct vm_area_struct * vma,
+					   struct vm_area_struct * orig_vma, anon_vma_t ** anon_vma)
+{
+	anon_vma_t * anon_vma = orig_vma->anon_vma;
+
+	BUG_ON(page->mapcount <= 1);
+	BUG_ON(!PageDirect(page));
+
+	if (!anon_vma) {
+		anon_vma = *anon_vma;
+		*anon_vma = NULL;
+
+		/* it's single threaded here, avoid the anon_vma->anon_vma_lock */
+		list_add(&vma->anon_vma_node, &anon_vma->anon_vma_head);
+		list_add(&orig_vma->anon_vma_node, &anon_vma->anon_vma_head);
+
+		orig_vma->anon_vma = vma->anon_vma = anon_vma;
+	} else {
+		/* multithreaded here, anon_vma existed already in other mm */
+		spin_lock(&anon_vma->anon_vma_lock);
+		list_add(&vma->anon_vma_node, &anon_vma->anon_vma_head);
+		spin_unlock(&anon_vma->anon_vma_lock);
+	}
+
+	ClearPageDirect(page);
+	page->as.anon_vma = anon_vma;
+}
+
 /**
  * page_add_rmap - add reverse mapping entry to a page
  * @page: the page to add the mapping to
- * @ptep: the page table entry mapping this page
+ * @vma: the vma that is covering the page
  *
  * Add a new pte reverse mapping to a page.
- * The caller needs to hold the mm->page_table_lock.
  */
-struct pte_chain * fastcall
-page_add_rmap(struct page *page, pte_t *ptep, struct pte_chain *pte_chain)
+void fastcall page_add_rmap(struct page *page, struct vm_area_struct * vma)
 {
-	pte_addr_t pte_paddr = ptep_to_paddr(ptep);
-	struct pte_chain *cur_pte_chain;
+	if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
+		return;
 
-	if (PageReserved(page))
-		return pte_chain;
+	page_map_lock(page);
 
-	pte_chain_lock(page);
+	if (!page->mapcount++)
+		inc_page_state(nr_mapped);
 
-	/*
-	 * If this is an object-based page, just count it.  We can
- 	 * find the mappings by walking the object vma chain for that object.
-	 */
-	if (!PageAnon(page)) {
-		if (!page->mapping)
-			BUG();
-		if (PageSwapCache(page))
-			BUG();
-		if (!page->pte.mapcount)
-			inc_page_state(nr_mapped);
-		page->pte.mapcount++;
-		goto out;
+	if (PageAnon(page))
+		anon_vma_page_link(page, vma);
+	else {
+		/*
+		 * If this is an object-based page, just count it.
+		 * We can find the mappings by walking the object
+		 * vma chain for that object.
+		 */
+		BUG_ON(!page->as.mapping);
+		BUG_ON(PageSwapCache(page));
 	}
 
-	if (page->pte.direct == 0) {
-		page->pte.direct = pte_paddr;
-		SetPageDirect(page);
+	page_map_unlock(page);
+}
+
+/* called from fork() */
+void fastcall page_add_rmap_fork(struct page *page, struct vm_area_struct * vma,
+				 struct vm_area_struct * orig_vma, anon_vma_t ** anon_vma)
+{
+	if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
+		return;
+
+	page_map_lock(page);
+
+	if (!page->mapcount++)
 		inc_page_state(nr_mapped);
-		goto out;
-	}
 
-	if (PageDirect(page)) {
-		/* Convert a direct pointer into a pte_chain */
-		ClearPageDirect(page);
-		pte_chain->ptes[NRPTE-1] = page->pte.direct;
-		pte_chain->ptes[NRPTE-2] = pte_paddr;
-		pte_chain->next_and_idx = pte_chain_encode(NULL, NRPTE-2);
-		page->pte.direct = 0;
-		page->pte.chain = pte_chain;
-		pte_chain = NULL;	/* We consumed it */
-		goto out;
+	if (PageAnon(page))
+		anon_vma_page_link_fork(page, vma, orig_vma, anon_vma);
+	else {
+		/*
+		 * If this is an object-based page, just count it.
+		 * We can find the mappings by walking the object
+		 * vma chain for that object.
+		 */
+		BUG_ON(!page->as.mapping);
+		BUG_ON(PageSwapCache(page));
 	}
 
-	cur_pte_chain = page->pte.chain;
-	if (cur_pte_chain->ptes[0]) {	/* It's full */
-		pte_chain->next_and_idx = pte_chain_encode(cur_pte_chain,
-								NRPTE - 1);
-		page->pte.chain = pte_chain;
-		pte_chain->ptes[NRPTE-1] = pte_paddr;
-		pte_chain = NULL;	/* We consumed it */
-		goto out;
+	page_map_unlock(page);
+}
+
+/* this needs the page->flags PG_map_lock held */
+static void inline anon_vma_page_unlink(struct page * page)
+{
+	/*
+	 * Cleanup if this anon page is gone
+	 * as far as the vm is concerned.
+	 */
+	if (!page->mapcount) {
+		page->as.vma = 0;
+#if 0
+		/*
+		 * The above clears page->as.anon_vma too
+		 * if the page wasn't direct.
+		 */
+		page->as.anon_vma = 0;
+#endif
+		ClearPageDirect(page);
 	}
-	cur_pte_chain->ptes[pte_chain_idx(cur_pte_chain) - 1] = pte_paddr;
-	cur_pte_chain->next_and_idx--;
-out:
-	pte_chain_unlock(page);
-	return pte_chain;
 }
 
 /**
  * page_remove_rmap - take down reverse mapping to a page
  * @page: page to remove mapping from
- * @ptep: page table entry to remove
  *
  * Removes the reverse mapping from the pte_chain of the page,
  * after that the caller can clear the page table entry and free
  * the page.
- * Caller needs to hold the mm->page_table_lock.
  */
-void fastcall page_remove_rmap(struct page *page, pte_t *ptep)
+void fastcall page_remove_rmap(struct page *page)
 {
-	pte_addr_t pte_paddr = ptep_to_paddr(ptep);
-	struct pte_chain *pc;
-
 	if (!pfn_valid(page_to_pfn(page)) || PageReserved(page))
 		return;
 
-	pte_chain_lock(page);
+	page_map_lock(page);
 
 	if (!page_mapped(page))
 		goto out_unlock;
 
-	/*
-	 * If this is an object-based page, just uncount it.  We can
-	 * find the mappings by walking the object vma chain for that object.
-	 */
-	if (!PageAnon(page)) {
-		if (!page->mapping)
-			BUG();
-		if (PageSwapCache(page))
-			BUG();
-		if (!page->pte.mapcount)
-			BUG();
-		page->pte.mapcount--;
-		if (!page->pte.mapcount)
-			dec_page_state(nr_mapped);
-		goto out_unlock;
+	if (!--page->mapcount)
+		dec_page_state(nr_mapped);
+
+	if (PageAnon(page))
+		anon_vma_page_unlink(page, vma);
+	else {
+		/*
+		 * If this is an object-based page, just uncount it.
+		 * We can find the mappings by walking the object vma
+		 * chain for that object.
+		 */
+		BUG_ON(!page->as.mapping);
+		BUG_ON(PageSwapCache(page));
 	}
   
-	if (PageDirect(page)) {
-		if (page->pte.direct == pte_paddr) {
-			page->pte.direct = 0;
-			ClearPageDirect(page);
-			goto out;
-		}
-	} else {
-		struct pte_chain *start = page->pte.chain;
-		struct pte_chain *next;
-		int victim_i = pte_chain_idx(start);
-
-		for (pc = start; pc; pc = next) {
-			int i;
-
-			next = pte_chain_next(pc);
-			if (next)
-				prefetch(next);
-			for (i = pte_chain_idx(pc); i < NRPTE; i++) {
-				pte_addr_t pa = pc->ptes[i];
-
-				if (pa != pte_paddr)
-					continue;
-				pc->ptes[i] = start->ptes[victim_i];
-				start->ptes[victim_i] = 0;
-				if (victim_i == NRPTE-1) {
-					/* Emptied a pte_chain */
-					page->pte.chain = pte_chain_next(start);
-					__pte_chain_free(start);
-				} else {
-					start->next_and_idx++;
-				}
-				goto out;
-			}
-		}
-	}
-out:
-	if (page->pte.direct == 0 && page_test_and_clear_dirty(page))
-		set_page_dirty(page);
-	if (!page_mapped(page))
-		dec_page_state(nr_mapped);
-out_unlock:
-	pte_chain_unlock(page);
+	page_map_unlock(page);
 	return;
 }
 
 /**
- * try_to_unmap_obj - unmap a page using the object-based rmap method
+ * try_to_unmap_one - unmap a page using the object-based rmap method
  * @page: the page to unmap
  *
  * Determine whether a page is mapped in a given vma and unmap it if it's found.
  *
- * This function is strictly a helper function for try_to_unmap_obj.
+ * This function is strictly a helper function for try_to_unmap_inode.
  */
-static inline int
-try_to_unmap_obj_one(struct vm_area_struct *vma, struct page *page)
+static int
+try_to_unmap_one(struct vm_area_struct *vma, struct page *page)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -477,17 +393,39 @@ try_to_unmap_obj_one(struct vm_area_stru
 	}
 
 	flush_cache_page(vma, address);
-	pteval = ptep_get_and_clear(pte);
-	flush_tlb_page(vma, address);
+	pteval = ptep_clear_flush(vma, address, pte);
+
+	if (PageSwapCache(page)) {
+		/*
+		 * Store the swap location in the pte.
+		 * See handle_pte_fault() ...
+		 */
+		swp_entry_t entry = { .val = page->index };
+		swap_duplicate(entry);
+		set_pte(pte, swp_entry_to_pte(entry));
+		BUG_ON(pte_file(*pte));
+	} else {
+		unsigned long pgidx;
+		/*
+		 * If a nonlinear mapping then store the file page offset
+		 * in the pte.
+		 */
+		pgidx = (address - vma->vm_start) >> PAGE_SHIFT;
+		pgidx += vma->vm_pgoff;
+		pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
+		if (page->index != pgidx) {
+			set_pte(pte, pgoff_to_pte(page->index));
+			BUG_ON(!pte_file(*pte));
+		}
+	}
 
 	if (pte_dirty(pteval))
 		set_page_dirty(page);
 
-	if (!page->pte.mapcount)
-		BUG();
+	BUG_ON(!page->mapcount);
 
 	mm->rss--;
-	page->pte.mapcount--;
+	page->mapcount--;
 	page_cache_release(page);
 
 out_unmap:
@@ -499,7 +437,7 @@ out:
 }
 
 /**
- * try_to_unmap_obj - unmap a page using the object-based rmap method
+ * try_to_unmap_inode - unmap a page using the object-based rmap method
  * @page: the page to unmap
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
@@ -511,30 +449,26 @@ out:
  * return a temporary error.
  */
 static int
-try_to_unmap_obj(struct page *page)
+try_to_unmap_inode(struct page *page)
 {
-	struct address_space *mapping = page->mapping;
+	struct address_space *mapping = page->as.mapping;
 	struct vm_area_struct *vma;
 	int ret = SWAP_AGAIN;
 
-	if (!mapping)
-		BUG();
-
-	if (PageSwapCache(page))
-		BUG();
+	BUG_ON(PageSwapCache(page));
 
 	if (down_trylock(&mapping->i_shared_sem))
 		return ret;
 	
 	list_for_each_entry(vma, &mapping->i_mmap, shared) {
-		ret = try_to_unmap_obj_one(vma, page);
-		if (ret == SWAP_FAIL || !page->pte.mapcount)
+		ret = try_to_unmap_one(vma, page);
+		if (ret == SWAP_FAIL || !page->mapcount)
 			goto out;
 	}
 
 	list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
-		ret = try_to_unmap_obj_one(vma, page);
-		if (ret == SWAP_FAIL || !page->pte.mapcount)
+		ret = try_to_unmap_one(vma, page);
+		if (ret == SWAP_FAIL || !page->mapcount)
 			goto out;
 	}
 
@@ -543,94 +477,33 @@ out:
 	return ret;
 }
 
-/**
- * try_to_unmap_one - worker function for try_to_unmap
- * @page: page to unmap
- * @ptep: page table entry to unmap from page
- *
- * Internal helper function for try_to_unmap, called for each page
- * table entry mapping a page. Because locking order here is opposite
- * to the locking order used by the page fault path, we use trylocks.
- * Locking:
- *	    page lock			shrink_list(), trylock
- *		pte_chain_lock		shrink_list()
- *		    mm->page_table_lock	try_to_unmap_one(), trylock
- */
-static int FASTCALL(try_to_unmap_one(struct page *, pte_addr_t));
-static int fastcall try_to_unmap_one(struct page * page, pte_addr_t paddr)
-{
-	pte_t *ptep = rmap_ptep_map(paddr);
-	unsigned long address = ptep_to_address(ptep);
-	struct mm_struct * mm = ptep_to_mm(ptep);
-	struct vm_area_struct * vma;
-	pte_t pte;
-	int ret;
-
-	if (!mm)
-		BUG();
-
-	/*
-	 * We need the page_table_lock to protect us from page faults,
-	 * munmap, fork, etc...
-	 */
-	if (!spin_trylock(&mm->page_table_lock)) {
-		rmap_ptep_unmap(ptep);
-		return SWAP_AGAIN;
-	}
-
-
-	/* During mremap, it's possible pages are not in a VMA. */
-	vma = find_vma(mm, address);
-	if (!vma) {
-		ret = SWAP_FAIL;
-		goto out_unlock;
-	}
-
-	/* The page is mlock()d, we cannot swap it out. */
-	if (vma->vm_flags & VM_LOCKED) {
-		ret = SWAP_FAIL;
-		goto out_unlock;
-	}
+static int
+try_to_unmap_anon(struct page * page)
+{
+	int ret = SWAP_AGAIN;
 
-	/* Nuke the page table entry. */
-	flush_cache_page(vma, address);
-	pte = ptep_clear_flush(vma, address, ptep);
+	page_map_lock(page);
 
-	if (PageSwapCache(page)) {
-		/*
-		 * Store the swap location in the pte.
-		 * See handle_pte_fault() ...
-		 */
-		swp_entry_t entry = { .val = page->index };
-		swap_duplicate(entry);
-		set_pte(ptep, swp_entry_to_pte(entry));
-		BUG_ON(pte_file(*ptep));
+	if (PageDirect(page)) {
+		vma = page->as.vma;
+		ret = try_to_unmap_one(page->as.vma, page);
 	} else {
-		unsigned long pgidx;
-		/*
-		 * If a nonlinear mapping then store the file page offset
-		 * in the pte.
-		 */
-		pgidx = (address - vma->vm_start) >> PAGE_SHIFT;
-		pgidx += vma->vm_pgoff;
-		pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
-		if (page->index != pgidx) {
-			set_pte(ptep, pgoff_to_pte(page->index));
-			BUG_ON(!pte_file(*ptep));
+		struct vm_area_struct * vma;
+		anon_vma_t * anon_vma = page->as.anon_vma;
+
+		spin_lock(&anon_vma->anon_vma_lock);
+		list_for_each_entry(vma, &anon_vma->anon_vma_head, anon_vma_node) {
+			ret = try_to_unmap_one(vma, page);
+			if (ret == SWAP_FAIL || !page->mapcount) {
+				spin_unlock(&anon_vma->anon_vma_lock);
+				goto out;
+			}
 		}
+		spin_unlock(&anon_vma->anon_vma_lock);
 	}
 
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pte))
-		set_page_dirty(page);
-
-	mm->rss--;
-	page_cache_release(page);
-	ret = SWAP_SUCCESS;
-
-out_unlock:
-	rmap_ptep_unmap(ptep);
-	spin_unlock(&mm->page_table_lock);
+out:
+	page_map_unlock(page);
 	return ret;
 }
 
@@ -650,82 +523,22 @@ int fastcall try_to_unmap(struct page * 
 {
 	struct pte_chain *pc, *next_pc, *start;
 	int ret = SWAP_SUCCESS;
-	int victim_i;
 
 	/* This page should not be on the pageout lists. */
-	if (PageReserved(page))
-		BUG();
-	if (!PageLocked(page))
-		BUG();
-	/* We need backing store to swap out a page. */
-	if (!page->mapping)
-		BUG();
+	BUG_ON(PageReserved(page));
+	BUG_ON(!PageLocked(page));
 
 	/*
-	 * If it's an object-based page, use the object vma chain to find all
-	 * the mappings.
+	 * We need backing store to swap out a page.
+	 * Subtle: this checks for page->as.anon_vma too ;).
 	 */
-	if (!PageAnon(page)) {
-		ret = try_to_unmap_obj(page);
-		goto out;
-	}
+	BUG_ON(!page->as.mapping);
 
-	if (PageDirect(page)) {
-		ret = try_to_unmap_one(page, page->pte.direct);
-		if (ret == SWAP_SUCCESS) {
-			if (page_test_and_clear_dirty(page))
-				set_page_dirty(page);
-			page->pte.direct = 0;
-			ClearPageDirect(page);
-		}
-		goto out;
-	}		
+	if (!PageAnon(page))
+		ret = try_to_unmap_inode(page);
+	else
+		ret = try_to_unmap_anon(page);
 
-	start = page->pte.chain;
-	victim_i = pte_chain_idx(start);
-	for (pc = start; pc; pc = next_pc) {
-		int i;
-
-		next_pc = pte_chain_next(pc);
-		if (next_pc)
-			prefetch(next_pc);
-		for (i = pte_chain_idx(pc); i < NRPTE; i++) {
-			pte_addr_t pte_paddr = pc->ptes[i];
-
-			switch (try_to_unmap_one(page, pte_paddr)) {
-			case SWAP_SUCCESS:
-				/*
-				 * Release a slot.  If we're releasing the
-				 * first pte in the first pte_chain then
-				 * pc->ptes[i] and start->ptes[victim_i] both
-				 * refer to the same thing.  It works out.
-				 */
-				pc->ptes[i] = start->ptes[victim_i];
-				start->ptes[victim_i] = 0;
-				victim_i++;
-				if (victim_i == NRPTE) {
-					page->pte.chain = pte_chain_next(start);
-					__pte_chain_free(start);
-					start = page->pte.chain;
-					victim_i = 0;
-				} else {
-					start->next_and_idx++;
-				}
-				if (page->pte.direct == 0 &&
-				    page_test_and_clear_dirty(page))
-					set_page_dirty(page);
-				break;
-			case SWAP_AGAIN:
-				/* Skip this pte, remembering status. */
-				ret = SWAP_AGAIN;
-				continue;
-			case SWAP_FAIL:
-				ret = SWAP_FAIL;
-				goto out;
-			}
-		}
-	}
-out:
 	if (!page_mapped(page)) {
 		dec_page_state(nr_mapped);
 		ret = SWAP_SUCCESS;
@@ -733,176 +546,30 @@ out:
 	return ret;
 }
 
-/**
- * page_convert_anon - Convert an object-based mapped page to pte_chain-based.
- * @page: the page to convert
- *
- * Find all the mappings for an object-based page and convert them
- * to 'anonymous', ie create a pte_chain and store all the pte pointers there.
- *
- * This function takes the address_space->i_shared_sem, sets the PageAnon flag,
- * then sets the mm->page_table_lock for each vma and calls page_add_rmap. This
- * means there is a period when PageAnon is set, but still has some mappings
- * with no pte_chain entry.  This is in fact safe, since page_remove_rmap will
- * simply not find it.  try_to_unmap might erroneously return success, but it
- * will never be called because the page_convert_anon() caller has locked the
- * page.
- *
- * page_referenced() may fail to scan all the appropriate pte's and may return
- * an inaccurate result.  This is so rare that it does not matter.
+/*
+ * No more VM stuff below this comment, only anon_vma helper
+ * functions.
  */
-int page_convert_anon(struct page *page)
-{
-	struct address_space *mapping;
-	struct vm_area_struct *vma;
-	struct pte_chain *pte_chain = NULL;
-	pte_t *pte;
-	int err = 0;
-
-	mapping = page->mapping;
-	if (mapping == NULL)
-		goto out;		/* truncate won the lock_page() race */
-
-	down(&mapping->i_shared_sem);
-	pte_chain_lock(page);
-
-	/*
-	 * Has someone else done it for us before we got the lock?
-	 * If so, pte.direct or pte.chain has replaced pte.mapcount.
-	 */
-	if (PageAnon(page)) {
-		pte_chain_unlock(page);
-		goto out_unlock;
-	}
-
-	SetPageAnon(page);
-	if (page->pte.mapcount == 0) {
-		pte_chain_unlock(page);
-		goto out_unlock;
-	}
-	/* This is gonna get incremented by page_add_rmap */
-	dec_page_state(nr_mapped);
-	page->pte.mapcount = 0;
-
-	/*
-	 * Now that the page is marked as anon, unlock it.  page_add_rmap will
-	 * lock it as necessary.
-	 */
-	pte_chain_unlock(page);
-
-	list_for_each_entry(vma, &mapping->i_mmap, shared) {
-		if (!pte_chain) {
-			pte_chain = pte_chain_alloc(GFP_KERNEL);
-			if (!pte_chain) {
-				err = -ENOMEM;
-				goto out_unlock;
-			}
-		}
-		spin_lock(&vma->vm_mm->page_table_lock);
-		pte = find_pte(vma, page, NULL);
-		if (pte) {
-			/* Make sure this isn't a duplicate */
-			page_remove_rmap(page, pte);
-			pte_chain = page_add_rmap(page, pte, pte_chain);
-			pte_unmap(pte);
-		}
-		spin_unlock(&vma->vm_mm->page_table_lock);
-	}
-	list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
-		if (!pte_chain) {
-			pte_chain = pte_chain_alloc(GFP_KERNEL);
-			if (!pte_chain) {
-				err = -ENOMEM;
-				goto out_unlock;
-			}
-		}
-		spin_lock(&vma->vm_mm->page_table_lock);
-		pte = find_pte(vma, page, NULL);
-		if (pte) {
-			/* Make sure this isn't a duplicate */
-			page_remove_rmap(page, pte);
-			pte_chain = page_add_rmap(page, pte, pte_chain);
-			pte_unmap(pte);
-		}
-		spin_unlock(&vma->vm_mm->page_table_lock);
-	}
-
-out_unlock:
-	pte_chain_free(pte_chain);
-	up(&mapping->i_shared_sem);
-out:
-	return err;
-}
-
-/**
- ** No more VM stuff below this comment, only pte_chain helper
- ** functions.
- **/
-
-static void pte_chain_ctor(void *p, kmem_cache_t *cachep, unsigned long flags)
-{
-	struct pte_chain *pc = p;
-
-	memset(pc, 0, sizeof(*pc));
-}
-
-DEFINE_PER_CPU(struct pte_chain *, local_pte_chain) = 0;
 
-/**
- * __pte_chain_free - free pte_chain structure
- * @pte_chain: pte_chain struct to free
- */
-void __pte_chain_free(struct pte_chain *pte_chain)
+static void
+anon_vma_ctor(void *data, kmem_cache_t *cachep, unsigned long flags)
 {
-	struct pte_chain **pte_chainp;
-
-	pte_chainp = &get_cpu_var(local_pte_chain);
-	if (pte_chain->next_and_idx)
-		pte_chain->next_and_idx = 0;
-	if (*pte_chainp)
-		kmem_cache_free(pte_chain_cache, *pte_chainp);
-	*pte_chainp = pte_chain;
-	put_cpu_var(local_pte_chain);
-}
+	if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
+	    SLAB_CTOR_CONSTRUCTOR) {
+		anon_vma_t * anon_vma = (anon_vma_t *) data;
 
-/*
- * pte_chain_alloc(): allocate a pte_chain structure for use by page_add_rmap().
- *
- * The caller of page_add_rmap() must perform the allocation because
- * page_add_rmap() is invariably called under spinlock.  Often, page_add_rmap()
- * will not actually use the pte_chain, because there is space available in one
- * of the existing pte_chains which are attached to the page.  So the case of
- * allocating and then freeing a single pte_chain is specially optimised here,
- * with a one-deep per-cpu cache.
- */
-struct pte_chain *pte_chain_alloc(int gfp_flags)
-{
-	struct pte_chain *ret;
-	struct pte_chain **pte_chainp;
-
-	might_sleep_if(gfp_flags & __GFP_WAIT);
-
-	pte_chainp = &get_cpu_var(local_pte_chain);
-	if (*pte_chainp) {
-		ret = *pte_chainp;
-		*pte_chainp = NULL;
-		put_cpu_var(local_pte_chain);
-	} else {
-		put_cpu_var(local_pte_chain);
-		ret = kmem_cache_alloc(pte_chain_cache, gfp_flags);
+		spin_lock_init(&anon_vma->anon_vma_lock);
+		INIT_LIST_HEAD(&anon_vma->anon_vma_head);
 	}
-	return ret;
 }
 
-void __init pte_chain_init(void)
+void __init anon_vma_init(void)
 {
-	pte_chain_cache = kmem_cache_create(	"pte_chain",
-						sizeof(struct pte_chain),
-						0,
-						SLAB_MUST_HWCACHE_ALIGN,
-						pte_chain_ctor,
-						NULL);
+	/* this is intentonally not hw aligned to avoid wasting ram */
+	anon_vma_cachep = kmem_cache_create("anon_vma",
+					    sizeof(anon_vma_t), 0, 0,
+					    anon_vma_ctor, NULL);
 
-	if (!pte_chain_cache)
-		panic("failed to create pte_chain cache!\n");
+	if(!anon_vma_cachep)
+		panic("Cannot create anon_vma SLAB cache");
 }
--- sles-anobjrmap-2/mm/Makefile.~1~	2004-02-29 17:47:30.000000000 +0100
+++ sles-anobjrmap-2/mm/Makefile	2004-03-10 20:26:16.000000000 +0100
@@ -4,7 +4,7 @@
 
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
-			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
+			   mlock.o mmap.o mprotect.o mremap.o msync.o objrmap.o \
 			   shmem.o vmalloc.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-11  6:52             ` anon_vma RFC2 Andrea Arcangeli
@ 2004-03-11 13:23               ` Hugh Dickins
  2004-03-11 13:56                 ` Andrea Arcangeli
                                   ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Hugh Dickins @ 2004-03-11 13:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

Hi Andrea,

On Thu, 11 Mar 2004, Andrea Arcangeli wrote:
> 
> this is the full current status of my anon_vma work. Now fork() and all
> the other page_add/remove_rmap in memory.c plus the paging routines
> seems fully covered and I'm now dealing with the  vma merging and the
> anon_vma garbage collection (the latter is easy but I need to track all
> the kmem_cache_free).

I'm still making my way through all the relevant mails, and not even
glanced at your code yet: I hope later today.  But to judge by the
length of your essay on vma merging, it strikes me that you've taken
a wrong direction in switching from my anon mm to your anon vma.

Go by vmas and you have tiresome problems as they are split and merged,
very commonly.  Plus you have the overhead of new data structure per vma.
If your design magicked those problems away somehow, okay, but it seems
you're finding issues with it: I think you should go back to anon mms.

Go by mms, and there's only the exceedingly rare (does it ever occur
outside our testing?) awkward case of tracking pages in a private anon
vma inherited from parent, when parent or child mremaps it with MAYMOVE.

Which I reused the pte_chain code for, but it's probably better done
by conjuring up an imaginary tmpfs object as backing at that point
(that has its own little cost, since the object lives on at full size
until all its mappers unmap it, however small the portion they have
mapped).  And the overhead of the new data structre is per mm only.

I'll get back to reading through the mails now: sorry if I'm about to
find the arguments against anonmm in my reading.  (By the way, several
times you mention the size of a 2.6 struct page as larger than a 2.4
struct page: no, thanks to wli and others it's the 2.6 that's smaller.)

Hugh

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-11 13:23               ` Hugh Dickins
@ 2004-03-11 13:56                 ` Andrea Arcangeli
  2004-03-11 21:54                   ` Hugh Dickins
  2004-03-12  3:28                   ` Rik van Riel
  2004-03-11 17:33                 ` Andrea Arcangeli
  2004-03-11 22:20                 ` Rik van Riel
  2 siblings, 2 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-11 13:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

Hi Hugh,

On Thu, Mar 11, 2004 at 01:23:24PM +0000, Hugh Dickins wrote:
> Hi Andrea,
> 
> On Thu, 11 Mar 2004, Andrea Arcangeli wrote:
> > 
> > this is the full current status of my anon_vma work. Now fork() and all
> > the other page_add/remove_rmap in memory.c plus the paging routines
> > seems fully covered and I'm now dealing with the  vma merging and the
> > anon_vma garbage collection (the latter is easy but I need to track all
> > the kmem_cache_free).
> 
> I'm still making my way through all the relevant mails, and not even
> glanced at your code yet: I hope later today.  But to judge by the
> length of your essay on vma merging, it strikes me that you've taken
> a wrong direction in switching from my anon mm to your anon vma.
> 
> Go by vmas and you have tiresome problems as they are split and merged,
> very commonly.  Plus you have the overhead of new data structure per vma.

it's more complicated because it's more finegrined and it can handle
mremap too. I mean, the additional cost of tracking the vmas payoffs
because then we've a tiny list of vma to search for every page,
otherwise with the mm-wide model we'd need to search all of the vmas in
a mm. This is quite important during swapping with tons of vmas. Note
that in my common case the page will point directly to the vma
(PageDirect(page) == 1), no find_vma or whatever needed in between.

the per-vma overhead is 12 bytes, 2 pointers for the list node and 1
pointer to the anon-vma. As said above it provides several advantages,
but you're certainly right the mm approch had no vma overhead.

I'm quite convinced the anon_vma is the optimal design, though it's not
running yet ;). However it's close to compile. the whole vma and page
layer is finished (including the vma merging). I'm now dealing with the
swapcache stuff and I'm doing it slightly differently from your
anobjrmap-2 patch (obviously I also reistantiate the PG_swapcache
bitflag but the fundamental difference is that I don't drop the
swapper_space):

static inline struct address_space * page_mapping(struct page * page)
{
	extern struct address_space swapper_space;
	struct address_space * mapping = NULL;

	if (PageSwapCache(page))
		mapping = &swapper_space;
	else if (!PageAnon(page))
		mapping = page->as.mapping;
	return mapping;
}

I want the same pagecache/swapcache code to work transparently, but I
free up the page->index and the page->mapping for the swapcache, so that
I can reuse it to track the anon_vma. I think the above is simpler than
killing the swapper_space completely as you did. My solution avoids  me
hacks like this:

 	if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
 		return mapping->a_ops->sync_page(page);
+	if (PageSwapCache(page))
+		blk_run_queues();
 	return 0;
 }

it also avoids me rework set_page_dirty to call __set_page_dirty_buffers
by hand too. I mean, it's less intrusive.

the cpu cost it's similar, since I pay for an additional compare in
page_mapping though, but the code looks cleaner. Could be my opinion
only though ;).

> If your design magicked those problems away somehow, okay, but it seems
> you're finding issues with it: I think you should go back to anon mms.

the only issue I found so far, is that to track the stuff in a
fine-granular way I have to forbid merging sometime. note that
forbidding merging is a feature too, if I would go down with a pagetable
scan on the vma to fixup all page->as.vma/anon_vma and page->index I
would then lose some historic information on the origin of certain vmas,
and I would eventually fallback to the mm-wide information if I would do
total merging.

I think the probability of forbidden merging is low enough that it
doesn't matter. Also it doesn't impact in any way the file merging.
It basically merges as well as the file merging. Right now I'm also not
overriding the intitial vm_pgoff given to brand new anonymous vmas, but
I could, to boost the merging with mremapped segments. Though I don't
think it's necessary.

Overall the main reason for forbidding keeping track of vmas and not of
mm, is to be able to handle mremap as efficiently as with 2.4, I mean
your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has
to deal with both pte_chains and anonmm too.

> Go by mms, and there's only the exceedingly rare (does it ever occur
> outside our testing?) awkward case of tracking pages in a private anon
> vma inherited from parent, when parent or child mremaps it with MAYMOVE.
> 
> Which I reused the pte_chain code for, but it's probably better done
> by conjuring up an imaginary tmpfs object as backing at that point
> (that has its own little cost, since the object lives on at full size
> until all its mappers unmap it, however small the portion they have
> mapped).  And the overhead of the new data structre is per mm only.
> 
> I'll get back to reading through the mails now: sorry if I'm about to
> find the arguments against anonmm in my reading.  (By the way, several
> times you mention the size of a 2.6 struct page as larger than a 2.4
> struct page: no, thanks to wli and others it's the 2.6 that's smaller.)

really? mainline 2.6 has the same size of mainline 2.4 (48 bytes), or
I'm counting wrong? (at least my 2.4-aa tree is 48 bytes too, but I
think 2.4 mainline too) objrmap adds 4 bytes (goes to 52bytes), my patch
removes 8 bytes (i.e.  the pte_chain) and the result of my patch is 4
bytes less than 2.4 and 2.6 (44 bytes instead of 48 bytes). I wanted to
nuke the mapcount too but that destroy the nr_mapped info, and that
spreads all over so for now I keep the page->mapcount ;)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-11 13:56                 ` Andrea Arcangeli
@ 2004-03-11 21:54                   ` Hugh Dickins
  2004-03-12  1:47                     ` Andrea Arcangeli
  2004-03-12  3:28                   ` Rik van Riel
  1 sibling, 1 reply; 74+ messages in thread
From: Hugh Dickins @ 2004-03-11 21:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, William Lee Irwin III,
	linux-kernel

On Thu, 11 Mar 2004, Andrea Arcangeli wrote:
> On Thu, Mar 11, 2004 at 01:23:24PM +0000, Hugh Dickins wrote:
> > 
> > Go by vmas and you have tiresome problems as they are split and merged,
> > very commonly.  Plus you have the overhead of new data structure per vma.
> 
> it's more complicated because it's more finegrined and it can handle
> mremap too. I mean, the additional cost of tracking the vmas payoffs
> because then we've a tiny list of vma to search for every page,
> otherwise with the mm-wide model we'd need to search all of the vmas in
> a mm. This is quite important during swapping with tons of vmas. Note
> that in my common case the page will point directly to the vma
> (PageDirect(page) == 1), no find_vma or whatever needed in between.

Nice if you can avoid the find_vma, but it is (or was) used in the
objrmap case, so I was happy to have it in the anobj case also.

Could you post a patch against 2.6.3 or 2.6.4?  Your objrmap patch
applies with offsets, no problem, but your anobjrmap patch doesn't
apply cleanly on top of that, partly because you've renamed files
in between (revert that?), but there seem to be other untracked
changes too.  I may not be seeing the whole story right.

Great to see the pte_chains gone, but I find what you have for anon
vmas strangely complicated: the continued existence of PageDirect etc.
I guess, having elected to go by vmas, you're trying to avoid some of
the overhead until fork.  But that does make it messy to my eyes,
the anonmm way much cleaner and simpler in that regard.

> I want the same pagecache/swapcache code to work transparently, but I
> free up the page->index and the page->mapping for the swapcache, so that
> I can reuse it to track the anon_vma. I think the above is simpler than
> killing the swapper_space completely as you did. My solution avoids  me
> hacks like this:
> 
>  	if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
>  		return mapping->a_ops->sync_page(page);
> +	if (PageSwapCache(page))
> +		blk_run_queues();
>  	return 0;
>  }
> 
> it also avoids me rework set_page_dirty to call __set_page_dirty_buffers
> by hand too. I mean, it's less intrusive.

There may well be better ways of reassigning the page struct fields
than I had, making for less extensive changes, yes.  Best to go with the
least intrusive for now (so long as not too ugly) and reappraise later.

> Overall the main reason for forbidding keeping track of vmas and not of
> mm, is to be able to handle mremap as efficiently as with 2.4, I mean
> your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has
> to deal with both pte_chains and anonmm too.

Yes, I used pte_chains for that because we hadn't worked out how to
do remap_file_pages without them (I've not yet looked into how you're
handling those), so might as well put them to use here too.  But if
nonlinear is now relieved of pte_chains, great, and as I said below,
the anonmm mremap case should be able to conjure a tmpfs backing object
- which probably amounts to your anon_vma, but only needed in that one
odd case, anon mm sufficient for all the rest, less overhead all round.

> > Go by mms, and there's only the exceedingly rare (does it ever occur
> > outside our testing?) awkward case of tracking pages in a private anon
> > vma inherited from parent, when parent or child mremaps it with MAYMOVE.
> > 
> > Which I reused the pte_chain code for, but it's probably better done
> > by conjuring up an imaginary tmpfs object as backing at that point
> > (that has its own little cost, since the object lives on at full size
> > until all its mappers unmap it, however small the portion they have
> > mapped).  And the overhead of the new data structre is per mm only.
> > 
> > I'll get back to reading through the mails now: sorry if I'm about to
> > find the arguments against anonmm in my reading.  (By the way, several
> > times you mention the size of a 2.6 struct page as larger than a 2.4
> > struct page: no, thanks to wli and others it's the 2.6 that's smaller.)
> 
> really? mainline 2.6 has the same size of mainline 2.4 (48 bytes), or
> I'm counting wrong? (at least my 2.4-aa tree is 48 bytes too, but I
> think 2.4 mainline too) objrmap adds 4 bytes (goes to 52bytes), my patch
> removes 8 bytes (i.e.  the pte_chain) and the result of my patch is 4
> bytes less than 2.4 and 2.6 (44 bytes instead of 48 bytes). I wanted to
> nuke the mapcount too but that destroy the nr_mapped info, and that
> spreads all over so for now I keep the page->mapcount ;)

I think you were counting wrong.  Mainline 2.4 i386 48 bytes, agreed.
Mainline 2.6 i386 40 bytes, or 44 bytes if PAE & HIGHPTE.  And today,
2.6.4-mm1 i386 32 bytes, or 36 bytes if PAE & HIGHPTE.  Though of course
the vanished fields will often be countered by memory usage elsewhere.

Yes, keep mapcount for now: I went around that same loop, it
surely has the feel of something that can be disposed of in the end,
but there's no need to attempt that while doing this objrmap job,
it's better done after since it needs a different kind of care.

(Be aware that shmem_writepage will do the wrong thing, COWing what
should be a shared page, if it is ever given a still-mapped page:
but no need to worry about that now, and it may be easy to work it
differently once the rmap changes settle down.  As to shmem_writepage
going directly to swap, by the way: I'm perfectly happy for you to
make that change, but I don't believe the old way was mistaken - it
intentionally gave tmpfs pages which should remain in memory another
go around.  I was never convinced one way or the other: but the current
code works very badly for some loads, as you found, I doubt there are
any that will suffer so greatly from the change, so go ahead.)

Hugh

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-11 21:54                   ` Hugh Dickins
@ 2004-03-12  1:47                     ` Andrea Arcangeli
  2004-03-12  2:20                       ` Andrea Arcangeli
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12  1:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, William Lee Irwin III,
	linux-kernel

On Thu, Mar 11, 2004 at 09:54:01PM +0000, Hugh Dickins wrote:
> Could you post a patch against 2.6.3 or 2.6.4?  Your objrmap patch

I uploaded my latest status, there are three patches, the first is
Dave's objrmap, the second is your anobjrmap-1, the third is my anon_vma
work that removes the pte_chains all over the kernel.

my patch is not stable yet, it crashes during swapping and the debugging
code catches bug even before swapping (which is good):

 0  0      0 404468  11900  41276    0    0     0     0 1095    61  0  0 100  0
 0  0      0 404468  11900  41276    0    0     0     0 1108    71  0  0 100  0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      0 404468  11908  41268    0    0     0   136 1102    59  0  0 100  0
 1  0      0 310972  11908  41268    0    0     0     0 1100    50  2  7 91  0
 1  0      0  66748  11908  41268    0    0     0     0 1085    30  6 19 75  0
 1  1    128   2648    216  14132    0  128     0   256 1118   139  3 16 73  8
 1  2  77084   1332    232   2188    0 76952   308 76952 1162   255  1 10 54 35

I hope to make it work tomorrow, then the next two things to do are the
pagetable walk in the nonlinear (currently it's pinned) and the rbtree
(or prio_tree) for the i_mmap{,shared}. Then it will be complete and
mergeable.

	http://www.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.3/objrmap

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12  1:47                     ` Andrea Arcangeli
@ 2004-03-12  2:20                       ` Andrea Arcangeli
  0 siblings, 0 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12  2:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, William Lee Irwin III,
	linux-kernel

On Fri, Mar 12, 2004 at 02:47:10AM +0100, Andrea Arcangeli wrote:
> my patch is not stable yet, it crashes during swapping and the debugging
> code catches bug even before swapping (which is good):

I fixed some more bugs (s/index/private), it's not stable yet but some basic
swapping works now (there is probably some issue with shared swapcache
still, since ps just oopsed, and ps may be sharing-cow swapcache through
fork).

 0  0      0 408712   7800  41160    0    0     0     0 1131    46  0  0 95  5
 0  0      0 408712   7800  41160    0    0     0     0 1102    64  0  0 100  0
 0  0      0 408712   7800  41160    0    0     0     0 1090    40  0  0 100  0
 0  0      0 408712   7800  41160    0    0     0     0 1107    84  0  0 100  0
 0  0      0 408712   7808  41152    0    0     0    84 1101    66  0  0 100  0
 0  0      0 408712   7808  41152    0    0     0     0 1096    52  0  0 100  0
 1  0      0 264808   7808  41152    0    0     0     0 1093    49  5 16 79  0
 1  0      0  51636   7808  41152    0    0     0     0 1083    34  5 20 75  0
 1  1    128   2384    212  14068    0  128     0   204 1106   178  1  7 73 19
 1  2  82824   2332    200   2136   32 82668    40 82668 1221  1955  1 12 49 38
 1  2 130000   2448    208   1868   32 47048   312 47048 1184   782  0  5 60 35
 0  3 178700   1676    208   2428 10388 48700 11000 48700 1536  1291  0  4 55 40
 0  3 205996   1780    216   1992 4264 27224  4424 27224 1312   549  1  4 41 55
 2  2 238900   4148    240   2388   88 32980   684 32984 1190  1380  1  6 23 69
 0  3 295124   1996    244   2392   92 56148   232 56148 1223   149  1  6 38 54
 0  2 315204   2036    244   2356    0 19972     0 19972 1172    55  1  2 52 45
 1  0 334052   3924    264   2592  192 18720   372 18720 1205   154  0  1 35 63
 0  3 377208   2324    264   1928   64 42984    64 42984 1249   208  2  6 39 53
 0  1 389856   3408    264   2032  128 12680   224 12680 1187   159  0  1 60 38
 0  0 374032 263036    316   3504  920    0  2464     0 1258   224  0  2 76 23
 0  0 374032 263036    316   3504    0    0     0     0 1087    27  0  0 100  0
 0  0 374032 263036    316   3504    0    0     0     0 1083    25  0  0 100  0
 0  0 374032 263040    316   3504    0    0     0     0 1086    25  0  0 100  0
 0  0 374032 263040    316   3504    0    0     0     0 1084    27  0  0 100  0
 0  0 374032 263128    316   3504    0    0     0     0 1086    23  0  0 100  0
 0  0 374032 263164    316   3472   32    0    32     0 1086    23  0  0 100  0
 0  0 374032 263212    316   3508   32    0    32     0 1086    25  0  0 100  0

I uploaded a new anon_vma patch in the same directory with the fixes to make
the basic swapping work. Tomorrow I'll look into the ps oops and into
heavey cow loads.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-11 13:56                 ` Andrea Arcangeli
  2004-03-11 21:54                   ` Hugh Dickins
@ 2004-03-12  3:28                   ` Rik van Riel
  2004-03-12 12:21                     ` Andrea Arcangeli
  1 sibling, 1 reply; 74+ messages in thread
From: Rik van Riel @ 2004-03-12  3:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Thu, 11 Mar 2004, Andrea Arcangeli wrote:

> it's more complicated because it's more finegrined and it can handle
> mremap too. I mean, the additional cost of tracking the vmas payoffs
> because then we've a tiny list of vma to search for every page,
> otherwise with the mm-wide model we'd need to search all of the vmas in
> a mm.

Actually, with the code Rajesh is working on there's
no search problem with Hugh's idea.

Considering the fact that we'll need Rajesh's code
anyway, to deal with Ingo's test program and the real
world programs that do similar things, I don't see how
your objection to Hugh's code is still valid.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12  3:28                   ` Rik van Riel
@ 2004-03-12 12:21                     ` Andrea Arcangeli
  2004-03-12 12:40                       ` Rik van Riel
                                         ` (3 more replies)
  0 siblings, 4 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 12:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote:
> On Thu, 11 Mar 2004, Andrea Arcangeli wrote:
> 
> > it's more complicated because it's more finegrined and it can handle
> > mremap too. I mean, the additional cost of tracking the vmas payoffs
> > because then we've a tiny list of vma to search for every page,
> > otherwise with the mm-wide model we'd need to search all of the vmas in
> > a mm.
> 
> Actually, with the code Rajesh is working on there's
> no search problem with Hugh's idea.

you missed the fact mremap doesn't work, that's the fundamental reason
for the vma tracking, so you can use vm_pgoff.

if you take Hugh's anonmm, mremap will be attaching a persistent dynamic
overhead to the vma it touches. Currently it does in form of pte_chains,
that can be converted to other means of overhead, but I simply don't
like it.

I like all vmas to be symmetric to each other, without special hacks to
handle mremap right.

We have the vm_pgoff to handle mremap and I simply use that.

> Considering the fact that we'll need Rajesh's code
> anyway, to deal with Ingo's test program and the real

Rajesh's code has nothing to do with the mremap breakage, Rajesh's code
can only boost the search of the interesting vmas in an anonmm, it
doesn't solve mremap.

> world programs that do similar things, I don't see how
> your objection to Hugh's code is still valid.

This was my objection, maybe you didn't read all my emails, i quote
again:

"Overall the main reason for forbidding keeping track of vmas and not of
mm, is to be able to handle mremap as efficiently as with 2.4, I mean
your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has
to deal with both pte_chains and anonmm too."

As said one can convert the pte_chains to other means of overhead, but
still it's an hack and you'll need transient objects to track those if
you don't track finegrined by vma as I'm doing.

It's not that I didn't read anonmm patches from Hugh, I spent lots of
time on those, they just were flawed and they couldn't handle mremap,
he very well knows, see anobjrmap-5 for istance.

the vma merging isn't a problem, we need to rework the code anyways to
allow the file merging in both mprotect and mremap (currently only mmap
is capable of merging files, and in turn it's also the only one capable
of merging anon_vmas). Any merging code that is currently capable of
merging files is easy to teach about anon_vmas too, it's basically the
same problem at merging.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 12:21                     ` Andrea Arcangeli
@ 2004-03-12 12:40                       ` Rik van Riel
  2004-03-12 13:11                         ` Andrea Arcangeli
  2004-03-12 12:42                       ` Andrea Arcangeli
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 74+ messages in thread
From: Rik van Riel @ 2004-03-12 12:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote:

> > Actually, with the code Rajesh is working on there's
> > no search problem with Hugh's idea.
> 
> you missed the fact mremap doesn't work, that's the fundamental reason
> for the vma tracking, so you can use vm_pgoff.
> 
> if you take Hugh's anonmm, mremap will be attaching a persistent dynamic
> overhead to the vma it touches. Currently it does in form of pte_chains,
> that can be converted to other means of overhead, but I simply don't
> like it.
> 
> I like all vmas to be symmetric to each other, without special hacks to
> handle mremap right.
> 
> We have the vm_pgoff to handle mremap and I simply use that.

Would it be possible to get rid of that if we attached
a struct address_space to each mm_struct after exec(),
sharing the address_space between parent and child
processes after a fork() ?

Note that the page cache can handle up to 2^42 bytes
in one address_space on a 32 bit system, so there's
more than enough space to be shared between parent and
child processes.

Then the vmas can track vm_pgoff inside the address
space attached to the mm.

> > Considering the fact that we'll need Rajesh's code
> > anyway, to deal with Ingo's test program and the real
> 
> Rajesh's code has nothing to do with the mremap breakage, Rajesh's code
> can only boost the search of the interesting vmas in an anonmm, it
> doesn't solve mremap.

If you mmap a file, then mremap part of that mmap, where's
the special case ?

> "Overall the main reason for forbidding keeping track of vmas and not of
> mm, is to be able to handle mremap as efficiently as with 2.4, I mean
> your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has
> to deal with both pte_chains and anonmm too."

Yes, that's a problem indeed.  I'm not sure it's fundamental
or just an implementation artifact, though...

> the vma merging isn't a problem, we need to rework the code anyways to
> allow the file merging in both mprotect and mremap

Agreed.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 12:40                       ` Rik van Riel
@ 2004-03-12 13:11                         ` Andrea Arcangeli
  2004-03-12 16:25                           ` Rik van Riel
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 13:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Fri, Mar 12, 2004 at 07:40:51AM -0500, Rik van Riel wrote:
> On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> > On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote:
> 
> > > Actually, with the code Rajesh is working on there's
> > > no search problem with Hugh's idea.
> > 
> > you missed the fact mremap doesn't work, that's the fundamental reason
> > for the vma tracking, so you can use vm_pgoff.
> > 
> > if you take Hugh's anonmm, mremap will be attaching a persistent dynamic
> > overhead to the vma it touches. Currently it does in form of pte_chains,
> > that can be converted to other means of overhead, but I simply don't
> > like it.
> > 
> > I like all vmas to be symmetric to each other, without special hacks to
> > handle mremap right.
> > 
> > We have the vm_pgoff to handle mremap and I simply use that.
> 
> Would it be possible to get rid of that if we attached
> a struct address_space to each mm_struct after exec(),
> sharing the address_space between parent and child
> processes after a fork() ?

> Note that the page cache can handle up to 2^42 bytes
> in one address_space on a 32 bit system, so there's
> more than enough space to be shared between parent and
> child processes.
> 
> Then the vmas can track vm_pgoff inside the address
> space attached to the mm.

I can't understand sorry.

I don't see what you mean with sharing the same address space between
parent and child, whatever _global_ mm wide address space is screwed by
mremap, if you don't use the pg_off to ofset the page->index, the
vm_start/vm_end means nothing.

I think the anonmm design is flawed and it has no way to handle
mremap reasonably well, though feel free to keep doing research on that,
I would be happy to use a simpler and more efficient design, I just
tried to reuse the anonmm but it was overlay complex in design and
inefficient too to deal with mremap, so I had not much doubts I had to
change that, and the anon_vma idea solved all the issues with anonmm, so
I started coding that.

If you don't track by vmas (like I'm doing), and you allow merging of
two different vmas, one touched by mremap and the other not, you'll end
up mixing the vm_pgoff and the whole anonmm falls apart, and the tree
search falls apart too after you lost the vm_pgoff of the vma that got
merged.

Hugh solved this by simply saying that anonmm isn't capable of dealing
with mremap and he used the pte_chains like if it was the rmap vm, after
the first mremap. That's bad, but whatever more efficient solution than
the pte_chains (for example a metadata tracking a range, not wasting
bytes for every single page in the range like rmap does) will still
be a mess in terms of vma merging, tracking and rbtree/prio_tree search
too, and it won't at all be more obviously efficient, since you'll still
have to use the tree, and in all common cases my design will beat the
tree performance (even ignoring the mremap overhead with anonmm). the
way I defer the anonvma allocation and I instantiate direct pages 
is as well is extremely efficient compared to the anonmm.

The only thing I disallow is the merging of two vmas with different
anon_vma or different vm_pgoff, but that's a feature, if you don't do
that in the anonmm design, you'll have to allocate dynamic structures on
top of the vma tracking partial ranges within each vma which can be a
lot slower and it's so messy to deal with that I don't even remotely
considered writing anything like that, when I can use the pgoff with
the anon_vma_t.

> > > Considering the fact that we'll need Rajesh's code
> > > anyway, to deal with Ingo's test program and the real
> > 
> > Rajesh's code has nothing to do with the mremap breakage, Rajesh's code
> > can only boost the search of the interesting vmas in an anonmm, it
> > doesn't solve mremap.
> 
> If you mmap a file, then mremap part of that mmap, where's
> the special case ?

you miss that we disallow the merging of vmas with vm_pgoff if they
belong to a file (vma->vm_file != NULL). Infact what my code is doing is
to threat the anon vma similarly to the file-vmas, and that's why the
merging probability is reduced a little bit. The single fact anonmm
allows merging of all anonvmas like if they were not-vma-tracked tells
you anonmm is flawed w.r.t.  mremap. Something has to be changed anyways
in the vma handling code too (like the vma merging code) even with
anonmm, if your object is to always pass through the vma to reach the
pagetables. Hugh solved this by not passing through the vma after the
first mremap, that works too of course but I think my design is more
efficient, my whole effort is to avoid allocating per-page overhead and
to have a single metadata object (the vma) serving a range of pages,
that's a lot more efficient than the pte_chains and it saves a load of
ram in 64bit and 32bit.

to tell it in another way, the problem you have with anonmm, is that
after an mremap the page->index becomes invalid, and no, you can't fixup
the page->index by looping all over the pages pointed by the vma because
those page->index will be meaningful to other vmas in other address
spaces, where their address is still the original one (the one before
fork()).

> > "Overall the main reason for forbidding keeping track of vmas and not of
> > mm, is to be able to handle mremap as efficiently as with 2.4, I mean
> > your anobjrmap-5 simply reistantiate the pte_chains, so the vm then has
> > to deal with both pte_chains and anonmm too."
> 
> Yes, that's a problem indeed.  I'm not sure it's fundamental
> or just an implementation artifact, though...

I think it's fundamental but again, if you can find a solution to that
it's more than welcome, I just don't see how you can ever handle mremap
if you threat all the vmas the same, before and after mremap, if you
threat all the vmas the same you lose vm_pgoff and in turn you break in
mremap and you can forget using the vmas for reaching the pagetables
since you will do nothing with just the vm_start/vm_end and page->index
then.

You can still threat all of them the same by allocating dynamic stuff on
top of the vma but that will complicate everything, including the tree
search and the vma merging too. So the few lines I had to add to the vma
merging to teach the vma layer about the anon_vma, should be a whole lot
simpler and a whole lot more efficient than the ones you've to add to
allocate those dynamic objects sitting on top of the vmas and telling you
the right pg_off per-range (not to tell the handling of the oom
conditions while allocating those dynamic objects in super-spinlocked
paths, even the GFP_ATOMIC abuses from the pte_chains were nasty,
GFP_ATOMIC should reserved to irqs and bhs since they've no way to
unlock and sleep!...).

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 13:11                         ` Andrea Arcangeli
@ 2004-03-12 16:25                           ` Rik van Riel
  2004-03-12 17:13                             ` Andrea Arcangeli
  0 siblings, 1 reply; 74+ messages in thread
From: Rik van Riel @ 2004-03-12 16:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Fri, 12 Mar 2004, Andrea Arcangeli wrote:

> I don't see what you mean with sharing the same address space between
> parent and child, whatever _global_ mm wide address space is screwed by
> mremap, if you don't use the pg_off to ofset the page->index, the
> vm_start/vm_end means nothing.

At mremap time, you don't change the page->index at all,
but only the vm_start/vm_end.  Think of it as an mm_struct
pointing to a struct address_space with its anonymous
memory.  On exec() the mm_struct gets a new address_space,
on fork parent and child share them.

Sharing is good enough, because there is PAGE_SIZE times
more space in a struct address_space than there's available
virtual memory in one single process.  That means that for
a daemon like apache every child can simply get its own 4GB
subset of the address space for any new VMAs, while mapping
the inherited VMAs in the same way any other file is mapped.

> I think the anonmm design is flawed and it has no way to handle
> mremap reasonably well,

There's no difference between mremap() of anonymous memory
and mremap() of part of an mmap() range of a file...

At least, there doesn't need to be.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 16:25                           ` Rik van Riel
@ 2004-03-12 17:13                             ` Andrea Arcangeli
  2004-03-12 17:23                               ` Rik van Riel
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 17:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Fri, Mar 12, 2004 at 11:25:27AM -0500, Rik van Riel wrote:
> pointing to a struct address_space with its anonymous
> memory.  On exec() the mm_struct gets a new address_space,
> on fork parent and child share them.

isn't this what anonmm is already doing? are you suggesting something
different?

> There's no difference between mremap() of anonymous memory
> and mremap() of part of an mmap() range of a file...
> 
> At least, there doesn't need to be.

the anonmm simply cannot work because it's not reaching vmas, it only
reaches mm, and with an mm and a virtual address you cannot reach the
right vma if it was moved around by mremap, you don't even see any
vm_pgoff during the lookup, no way to fix anonmm with a prio_tree.

something in between anon_vma and anonmm that could handle mremap too
would been possible but it has downsides not fixable with a prio_tree,
and it consists in queueing all the _vmas_ (not the mm!) into an
anon_vma object, then you've to fixup the vma merging code to obey to
forbid merging with different vm_pgoff. That would be like anon_vma but
it would not be finegriend like anon_vma is, you'll end up scanning very
old vma segments in other address spaces despite you're working with
direct memory now. Such model (let's call it anon_vma_global) would save
8 bytes per vma of anonvma objects.  Maybe that's the model that DaveM
implemented originally? I think my anon_vma is superior because more
finegriend (it also avoids the need of a prio_tree even if in theory we
could stack a prio_tree on top of every anon_vma, but it's really not
needed) and the memory usage is minimal anyways (the per-vma memory cost
is the same for anon_vma and anon_vma_global, only the total number of
anon_vma objects vary). the prio_tree wouldn't fix the intermediate
model because the vma ranges could match fine in all address spaces, so
you would need the prio_tree adding another 12 bytes to each vma (on top
of the 12 bytes addred by the anon_vma_global), but the pages would be
different because the vma->vm_mm is different and there can be copy on
writes.  this cannot happen with an inode, so the prio_tree fixes the
inode completely while it doesn't fix the anon_vma_global design with 1
anon_vma only allocated at fork for all childs. anon_vma gets that
optimally instead (with a 8byte cost).  so overall I think anon_vma is a
much better utilizations of the 12 bytes, rather than having a prio_tree
stacked on top of a anon_vma_global, I prefer to be finegrined and to
track the stuff that not even a prio tree can track when the vma->vm_mm
has different pages for every vma in the same range.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 17:13                             ` Andrea Arcangeli
@ 2004-03-12 17:23                               ` Rik van Riel
  2004-03-12 17:44                                 ` Andrea Arcangeli
  2004-03-12 18:25                                 ` Linus Torvalds
  0 siblings, 2 replies; 74+ messages in thread
From: Rik van Riel @ 2004-03-12 17:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Fri, Mar 12, 2004 at 11:25:27AM -0500, Rik van Riel wrote:
> > pointing to a struct address_space with its anonymous
> > memory.  On exec() the mm_struct gets a new address_space,
> > on fork parent and child share them.
> 
> isn't this what anonmm is already doing? are you suggesting something
> different?

I am suggesting a pointer from the mm_struct to a
struct address_space ...

> > There's no difference between mremap() of anonymous memory
> > and mremap() of part of an mmap() range of a file...
> > 
> > At least, there doesn't need to be.
> 
> the anonmm simply cannot work because it's not reaching vmas, it only
> reaches mm, and with an mm and a virtual address you cannot reach the
> right vma if it was moved around by mremap,

... and use the offset into the struct address_space as
the page->index, NOT the virtual address inside the mm.

On first creation of anonymous memory these addresses
could be the same, but on mremap inside a forked process
(with multiple processes sharing part of anonymous memory)
a page could have a different offset inside the struct
address space than its virtual address....

Then on mremap you only need to adjust the start and
end offsets inside the VMAs, not the page->index ...

> That would be like anon_vma but it would not be finegriend like anon_vma
> is, you'll end up scanning very old vma segments in other address spaces

Not really.  On exec you can start with a new address
space entirely, so the sharing is limited only to
processes that really do share anonymous memory with
each other...

> I think my anon_vma is superior because more finegriend

Isn't being LESS finegrained the whole reason for moving
from pte based to object based reverse mapping ? ;))

> (it also avoids the need of a prio_tree even if in theory we could stack
> a prio_tree on top of every anon_vma, but it's really not needed)

We need the prio_tree anyway for files.  I don't see
why we couldn't reuse that code for anonymous memory,
but instead reimplement something new...

Having the same code everywhere will definately help
simplifying things.

cheers,

Rik
-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 17:23                               ` Rik van Riel
@ 2004-03-12 17:44                                 ` Andrea Arcangeli
  2004-03-12 18:18                                   ` Rik van Riel
  2004-03-12 18:25                                 ` Linus Torvalds
  1 sibling, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 17:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Fri, Mar 12, 2004 at 12:23:22PM -0500, Rik van Riel wrote:
> On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> > On Fri, Mar 12, 2004 at 11:25:27AM -0500, Rik van Riel wrote:
> > > pointing to a struct address_space with its anonymous
> > > memory.  On exec() the mm_struct gets a new address_space,
> > > on fork parent and child share them.
> > 
> > isn't this what anonmm is already doing? are you suggesting something
> > different?
> 
> I am suggesting a pointer from the mm_struct to a
> struct address_space ...

that's the anonmm:

+	mm->anonmm = anonmm;

> > > There's no difference between mremap() of anonymous memory
> > > and mremap() of part of an mmap() range of a file...
> > > 
> > > At least, there doesn't need to be.
> > 
> > the anonmm simply cannot work because it's not reaching vmas, it only
> > reaches mm, and with an mm and a virtual address you cannot reach the
> > right vma if it was moved around by mremap,
> 
> ... and use the offset into the struct address_space as
> the page->index, NOT the virtual address inside the mm.
> 
> On first creation of anonymous memory these addresses
> could be the same, but on mremap inside a forked process
> (with multiple processes sharing part of anonymous memory)
> a page could have a different offset inside the struct
> address space than its virtual address....
> 
> Then on mremap you only need to adjust the start and
> end offsets inside the VMAs, not the page->index ...

I don't see how this can work, each vma needs its own vm_off or a single
address space can't handle them all. Also the page->index is the virtual
address (or the virtual offset with anon_vma), it cannot be replaced
with something global, it has to be per-page.

> Isn't being LESS finegrained the whole reason for moving
> from pte based to object based reverse mapping ? ;))

the object is to cover ranges, instead of forcing per-page overhread.
Being finegrined at the vma is fine, being finegrined less than a vma is
desiderable only if there's no downside.

> > (it also avoids the need of a prio_tree even if in theory we could stack
> > a prio_tree on top of every anon_vma, but it's really not needed)
> 
> We need the prio_tree anyway for files.  I don't see

As I said in the last email the prio_tree will not work for the
anonvmas, because every vma in the same range will map to different
pages. So you'll find more vmas than the ones you're interested about.
This doesn't happen with inodes. with inodes every vma queued into the
i_mmap will be mapping to the right page _if_ it's pte_present == 1.

with your anonymous address space shared by childs the prio_tree will
find lots of vmas in different vma->vm_mm, each one pointing to
different pages. so to unmap a direct page after a malloc, you may end
up scanning all the address spaces by mistake. This cannot happen with
anon_vma. Furthermore the prio_tree will waste 12 bytes per vma, while
the anon_vma design will waste _at_most_ 8 bytes per vma (actually less
if the anon_vma are shared). And with anon_vma in practice you won't
need a prio_tree stacked on top of anon_vma. You could put one if you
want paying another 12bytes per vma, but it doesn't worth it. So
anon_vma takes less memory and it's more efficent as far as I can tell.

> Having the same code everywhere will definately help
> simplifying things.

Reusing the same code would be good I agree, but I don't think it would
work as well as with the inodes, and with the inodes it's really needed
only for a special 32bit case, so normally the lookup would be immdiate,
while here we need it for real expensive lookups if one has many
anonymous vmas in the childs even on 64bit apps. So I prefer a design
where the prio_tree or not the cost for good apps 64bit archs is the
same. prio_tree is not free, it's still O(log(N)) and I prefer a design
where the common case is N == 1 like with anon_vma (with your
address-space design N would be >1 normally in a server app).

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 17:44                                 ` Andrea Arcangeli
@ 2004-03-12 18:18                                   ` Rik van Riel
  0 siblings, 0 replies; 74+ messages in thread
From: Rik van Riel @ 2004-03-12 18:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Fri, Mar 12, 2004 at 12:23:22PM -0500, Rik van Riel wrote:

> > ... and use the offset into the struct address_space as
> > the page->index, NOT the virtual address inside the mm.

> As I said in the last email the prio_tree will not work for the
> anonvmas, because every vma in the same range will map to different
> pages. So you'll find more vmas than the ones you're interested about.
> This doesn't happen with inodes. with inodes every vma queued into the
> i_mmap will be mapping to the right page _if_ it's pte_present == 1.

You don't have multiple VMAs mapping to same pages, but in
the same range in the address_space.

Note that the per-process virtual memory != per "fork-group"
backing address_space ...

> with your anonymous address space shared by childs the prio_tree will
> find lots of vmas in different vma->vm_mm, each one pointing to
> different pages.

Nope.  I wish I was better with graphical programs, or I'd
draw you a picture. ;)

> > Having the same code everywhere will definately help
> > simplifying things.
> 
> Reusing the same code would be good I agree, but I don't think it would
> work as well as with the inodes,

> prio_tree is not free, it's still O(log(N)) and I prefer a design where
> the common case is N == 1 like with anon_vma (with your address-space
> design N would be >1 normally in a server app).

It's all a space-time overhead.  Do you want more structures
allocated and a more complex mremap, or do you eat the O(log(N))
lookup ?

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 17:23                               ` Rik van Riel
  2004-03-12 17:44                                 ` Andrea Arcangeli
@ 2004-03-12 18:25                                 ` Linus Torvalds
  2004-03-12 18:48                                   ` Rik van Riel
  2004-03-12 21:08                                   ` Jamie Lokier
  1 sibling, 2 replies; 74+ messages in thread
From: Linus Torvalds @ 2004-03-12 18:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Hugh Dickins, Ingo Molnar, Andrew Morton,
	linux-kernel, William Lee Irwin III

On Fri, 12 Mar 2004, Rik van Riel wrote:
> 
> I am suggesting a pointer from the mm_struct to a
> struct address_space ...

[ deleted ]

> Then on mremap you only need to adjust the start and
> end offsets inside the VMAs, not the page->index ...

One fundamental problem I see, maybe you can explain it to me...

 - You need a _unique_ page->index start for each VMA, since each
   anonymous page needs to have a unique index. Right?
 - You can use the virtual address as that unique page index start
 - when you mremap() an area, you leave the start indexes the same, so 
   that you can find the original pages (and create new ones in the old 
   mapping) by just searching the vma's, not by actually looking at the 
   page tables.
 - HOWEVER, after a mremap(), when you now create a new vma (or expand an 
   old one) into the previously used page index area, you're now screwed. 
   How are you going to generate unique page indexes in this new area 
   without re-using the indexes that you allocated in the old (moved)  
   area?

I think your approach could work (reverse map by having separate address
spaces for unrelated processes), but I don't see any good "page->index"  
allocation scheme that is implementable.

The "unique" page->index thing wouldn't need to have to have anything to 
do with the virtual address (indeed, after a mremap it clearly cannot have 
anything to do with that), but the thing is, you'd need to be able to 
cover the virtual address space with whatever numbers you choose.

You'd want to allocate contiguous indexes within one "vma", since the
whole point would be to be able to try to quickly find the vma (and thus
the page) that contains one particular page, but there are no range
allocators that I can think of that allow growing the VMA after allocation
(needed for vma merging on mmap and brk()) and still keep the range of
indexes down to reasonable numbers.

Or did I totally mis-understand what you were proposing?

		Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 18:25                                 ` Linus Torvalds
@ 2004-03-12 18:48                                   ` Rik van Riel
  2004-03-12 19:02                                     ` Chris Friesen
  2004-03-12 21:08                                   ` Jamie Lokier
  1 sibling, 1 reply; 74+ messages in thread
From: Rik van Riel @ 2004-03-12 18:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Hugh Dickins, Ingo Molnar, Andrew Morton,
	linux-kernel, William Lee Irwin III

On Fri, 12 Mar 2004, Linus Torvalds wrote:

> I think your approach could work (reverse map by having separate address
> spaces for unrelated processes), but I don't see any good "page->index"  
> allocation scheme that is implementable.

> Or did I totally mis-understand what you were proposing?

You're absolutely right.  I am still trying to come up with
a way to do this.

Note that since we count page->index in PAGE_SIZE unit we
have PAGE_SIZE times as much space as a process can take,
so we definately have enough address space to come up with
a creative allocation scheme.

I just can't think of any now ...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 18:48                                   ` Rik van Riel
@ 2004-03-12 19:02                                     ` Chris Friesen
  2004-03-12 19:06                                       ` Rik van Riel
  0 siblings, 1 reply; 74+ messages in thread
From: Chris Friesen @ 2004-03-12 19:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrea Arcangeli, Hugh Dickins, Ingo Molnar,
	Andrew Morton, linux-kernel, William Lee Irwin III

Rik van Riel wrote:
> On Fri, 12 Mar 2004, Linus Torvalds wrote:
> 
> 
>>I think your approach could work (reverse map by having separate address
>>spaces for unrelated processes), but I don't see any good "page->index"  
>>allocation scheme that is implementable.

> Note that since we count page->index in PAGE_SIZE unit we
> have PAGE_SIZE times as much space as a process can take,
> so we definately have enough address space to come up with
> a creative allocation scheme.

What happens when you have more than PAGE_SIZE processes running?

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 19:02                                     ` Chris Friesen
@ 2004-03-12 19:06                                       ` Rik van Riel
  2004-03-12 19:10                                         ` Chris Friesen
  2004-03-12 20:27                                         ` Andrea Arcangeli
  0 siblings, 2 replies; 74+ messages in thread
From: Rik van Riel @ 2004-03-12 19:06 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Linus Torvalds, Andrea Arcangeli, Hugh Dickins, Ingo Molnar,
	Andrew Morton, linux-kernel, William Lee Irwin III

On Fri, 12 Mar 2004, Chris Friesen wrote:

> What happens when you have more than PAGE_SIZE processes running?

Forked off the same process ?
Without doing an exec ?
On a 32 bit system ?

You'd probably run out of space to put the VMAs,
mm_structs and pgds long before reaching this point ...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 19:06                                       ` Rik van Riel
@ 2004-03-12 19:10                                         ` Chris Friesen
  2004-03-12 19:14                                           ` Rik van Riel
  2004-03-12 20:27                                         ` Andrea Arcangeli
  1 sibling, 1 reply; 74+ messages in thread
From: Chris Friesen @ 2004-03-12 19:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrea Arcangeli, Hugh Dickins, Ingo Molnar,
	Andrew Morton, linux-kernel, William Lee Irwin III

Rik van Riel wrote:
> On Fri, 12 Mar 2004, Chris Friesen wrote:
> 
> 
>>What happens when you have more than PAGE_SIZE processes running?
> 
> 
> Forked off the same process ?
> Without doing an exec ?
> On a 32 bit system ?
> 
> You'd probably run out of space to put the VMAs,
> mm_structs and pgds long before reaching this point ...

I'm just thinking of the "fork 100000 kids to test 32-bit pids" sort of 
test cases.

Chris


-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 19:10                                         ` Chris Friesen
@ 2004-03-12 19:14                                           ` Rik van Riel
  0 siblings, 0 replies; 74+ messages in thread
From: Rik van Riel @ 2004-03-12 19:14 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Linus Torvalds, Andrea Arcangeli, Hugh Dickins, Ingo Molnar,
	Andrew Morton, linux-kernel, William Lee Irwin III

On Fri, 12 Mar 2004, Chris Friesen wrote:

> I'm just thinking of the "fork 100000 kids to test 32-bit pids" sort of
> test cases.

Try that with a process that takes up 2GB of address
space ;)   It won't work now and it'll fail for the
same reasons with the scheme I proposed.

Probably before the 2^44 bits of space run out, too.


-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 19:06                                       ` Rik van Riel
  2004-03-12 19:10                                         ` Chris Friesen
@ 2004-03-12 20:27                                         ` Andrea Arcangeli
  2004-03-12 20:32                                           ` Rik van Riel
  1 sibling, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 20:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Chris Friesen, Linus Torvalds, Hugh Dickins, Ingo Molnar,
	Andrew Morton, linux-kernel, William Lee Irwin III

On Fri, Mar 12, 2004 at 02:06:17PM -0500, Rik van Riel wrote:
> On Fri, 12 Mar 2004, Chris Friesen wrote:
> 
> > What happens when you have more than PAGE_SIZE processes running?
> 
> Forked off the same process ?
> Without doing an exec ?
> On a 32 bit system ?
> 
> You'd probably run out of space to put the VMAs,
> mm_structs and pgds long before reaching this point ...

7.5k users are being reached in a real workload with around 2gigs mapped
per process and with tons of vma per process. with 2.6 and faster cpus
I hope to go even further.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 20:27                                         ` Andrea Arcangeli
@ 2004-03-12 20:32                                           ` Rik van Riel
  2004-03-12 20:49                                             ` Andrea Arcangeli
  0 siblings, 1 reply; 74+ messages in thread
From: Rik van Riel @ 2004-03-12 20:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Chris Friesen, Linus Torvalds, Hugh Dickins, Ingo Molnar,
	Andrew Morton, linux-kernel, William Lee Irwin III

On Fri, 12 Mar 2004, Andrea Arcangeli wrote:

> 7.5k users are being reached in a real workload with around 2gigs mapped
> per process and with tons of vma per process. with 2.6 and faster cpus
> I hope to go even further.

That's not all anonymous memory, though ;)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 20:32                                           ` Rik van Riel
@ 2004-03-12 20:49                                             ` Andrea Arcangeli
  0 siblings, 0 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 20:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Chris Friesen, Linus Torvalds, Hugh Dickins, Ingo Molnar,
	Andrew Morton, linux-kernel, William Lee Irwin III

On Fri, Mar 12, 2004 at 03:32:20PM -0500, Rik van Riel wrote:
> That's not all anonymous memory, though ;)

true, my point is it's feasible (cow or shared is the same from a memory
footprint standpoint, actually less since anon_vmas are a lot cheaper
than dummy shmfs inodes)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 18:25                                 ` Linus Torvalds
  2004-03-12 18:48                                   ` Rik van Riel
@ 2004-03-12 21:08                                   ` Jamie Lokier
  1 sibling, 0 replies; 74+ messages in thread
From: Jamie Lokier @ 2004-03-12 21:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, Hugh Dickins, Ingo Molnar,
	Andrew Morton, linux-kernel, William Lee Irwin III

Linus Torvalds wrote:
> You'd want to allocate contiguous indexes within one "vma", since the
> whole point would be to be able to try to quickly find the vma (and thus
> the page) that contains one particular page, but there are no range
> allocators that I can think of that allow growing the VMA after allocation
> (needed for vma merging on mmap and brk()) and still keep the range of
> indexes down to reasonable numbers.

For growing, they don't have to be contiguous - it's just desirable.

When a vma is grown and the page->offset space it would like to occupy
is already taken, it can be split into two vmas.

Of course that alters mremap() semantics, which depend on vma
boundaries.  (mmap, munmap and mprotect don't care).  So add a vma
flag which indicates that it and the following vma(s) are a single
unit for the purpose of remapping.  Call it the mremap-group flag.
Groups always have the same flags etc.; only the vm_offset varies.

In effect, I'm suggesting that instead of having vmas be the
user-visible unit, and some other finer-grained structures track page
mappings, let _vmas_ be the finer-grained structure, and make the
user-visible unit be whatever multiple consecutive vmas occur with
that flag set.  (This is a good balance if the number of splits is
small; not if there are many).

It shouldn't lead to a proliferation of vmas, provided the
page->offset allocation algorithm is sufficiently sparse.

To keep the number of potential splits small, always allocate some
extra page->offset space so that a vma can grow into it.  Only when it
cannot grow in page->offset space, do you create a new vma.  The new
vma has extra page->offset space allocated too.  That extra space
should be proportional to the size of the entire new mremap() region
(multiple vmas), not the new vma size.

In that way, I think it bounds the number of splits to O(log (n/m))
where n is the total mremap() region size, and m is the original size.
The constant in that expression is determined by the proportion that
is used for reserving extra space.

This has some consequences.

If each vma's page->offset allocation reserves space around it to
grow, then adjacent anonymous vmas won't be mergeable.

If they aren't mergeable, it begs the question of why not have an
address_space per vma, instead of per-mm, other than to save memory on
address_space structures?

Well we like them to be mergeable.  Lots of reasons.  So make initial
mmap() allocations not reserve page->offset space exclusively, but
make allocations done by mremap() reserve the extra space, to get that
O(log (n/m)) property.

Using the mremap-group flag, we are also able to give the appearance
of merged vmas when it would be difficult.  If we want certain
anonymous vmas to be appear merged despite them having incompatible
vm_offset values, we can do that.

So going back to the question of address_space per-mm: you don't need
one, due to the mremap-group flag.  It's good to use as few as
possible, but it's ok to use more than one per process or per
fork-group, when absolutely necessary.

That fixes the address_space limitation of 2^32 pages and makes
page->offset allocation _very_ simple:

    1. Allocate by simply incrementing an address counter.

    2. When it's about to wrap, allocate a new address_space.

    3. When allocating, reserve extra space for growing.
       The extra space should be proportional to the allocation, or
       the total size size of the region after mremap(), and clamped
       to a sane maximum such as 4G minus size, and a sane minimum
       such as 2^22 (room for a million reservations per address_space).

    5. When allocating, look at the nearby preceding or following vma
       in the virtual address space.  If the amount of page->offset space
       reserved by those vmas is large enough, we can claim some of that
       reservation for the new allocation.  If our good neighbour is
       adjacent to the new vma, that means the neighbour vma is simply
       grown.  Otherwise, it means we create a new vma which is
       vm_offset-compatible with its neighbour, allowing them to merge if
       the hole between is filled.

    6. By using large reservations, large regions of the virtual address
       space become covered with vm_offset-compatible vmas that are mergeable
       when the holes are filled.

    4. When trying to merge adjacent anon vmas during ordinary
       mmap/munmap/mprotect/mremap operations, if they are not
       vm_offset-compatible (or their address_spaces aren't equal)
       just use the mremap-group flag to make them appear merged.  The
       user-visible result is a single vma.  The effect on the kernel
       is a rare non-mergeable boundary, which will slow vma searching
       marginally.  The benefit is this simple allocation scheme.

This is like what we have today, with some occasional non-mergeable
vma boundaries (but only very few compared with the total
number of vmas in an mm).  These boundaries are not
user-visible, and only affect the kernel algorithms - and in a
simple way.

Data structure changes required: one flag, VM_GROUP or something; each
vma needs a pointer to _its_ address_space (can share space with
vm_file or such); each vma needs to record how much page->offset space
it has reserved beyond its own size.  VM_GROWSDOWN vmas might want to
record a reservation down rather than up.

-- Jamie

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 12:21                     ` Andrea Arcangeli
  2004-03-12 12:40                       ` Rik van Riel
@ 2004-03-12 12:42                       ` Andrea Arcangeli
  2004-03-12 12:46                       ` William Lee Irwin III
  2004-03-12 13:43                       ` Hugh Dickins
  3 siblings, 0 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 12:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Fri, Mar 12, 2004 at 01:21:27PM +0100, Andrea Arcangeli wrote:
> Rajesh's code has nothing to do with the mremap breakage, Rajesh's code
> can only boost the search of the interesting vmas in an anonmm, it
> doesn't solve mremap.

btw, one more detail, Rajesh's code will fall apart while dealing with
the dynamic metadata attached to vmas relocated by mremap: his code is
usable out of the box only on top of anon_vma (where
vm_pgoff/vm_start/vm_end retains the same semantics as the file mappings
in the i_mmap list), not on top of anonmm where you'll have to stack
some other dynamic structure (like the pte_chains today in anobjrmap-5).
Not sure how well his code could be modified to take into account the
dynamic data structure generated by mremap.

Also don't forget Rajesh's code doesn't come free, it also adds overhead
to the vma, so if you need the tree in the anonmm too (not only in the
inode), you'll grow the vma size too (I grow it of 12 bytes with
anon_vma but then I don't need complex metadata dynamic allocated later
in mremap and I don't need the rbtree search either since it's
finegrined well enough).

I also expect you'll still have significant problems merging two vmas, one
touched by mremap, the other not, since then the dynamic objects would
need to be "partial" for only a part of the vma, complicating even
further the "tree search" with ranges in the sub-metadata attached to
the vma.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 12:21                     ` Andrea Arcangeli
  2004-03-12 12:40                       ` Rik van Riel
  2004-03-12 12:42                       ` Andrea Arcangeli
@ 2004-03-12 12:46                       ` William Lee Irwin III
  2004-03-12 13:24                         ` Andrea Arcangeli
  2004-03-12 16:17                         ` Linus Torvalds
  2004-03-12 13:43                       ` Hugh Dickins
  3 siblings, 2 replies; 74+ messages in thread
From: William Lee Irwin III @ 2004-03-12 12:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds,
	linux-kernel

On Fri, Mar 12, 2004 at 01:21:27PM +0100, Andrea Arcangeli wrote:
> you missed the fact mremap doesn't work, that's the fundamental reason
> for the vma tracking, so you can use vm_pgoff.
> if you take Hugh's anonmm, mremap will be attaching a persistent dynamic
> overhead to the vma it touches. Currently it does in form of pte_chains,
> that can be converted to other means of overhead, but I simply don't
> like it.
> I like all vmas to be symmetric to each other, without special hacks to
> handle mremap right.
> We have the vm_pgoff to handle mremap and I simply use that.

Absolute guarantees are nice but this characterization is too extreme.
The case where mremap() creates rmap_chains is so rare I never ever saw
it happen in 6 months of regular practical use and testing. Their
creation could be triggered only by remap_file_pages().


-- wli

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 12:46                       ` William Lee Irwin III
@ 2004-03-12 13:24                         ` Andrea Arcangeli
  2004-03-12 13:40                           ` William Lee Irwin III
  2004-03-12 13:55                           ` Hugh Dickins
  2004-03-12 16:17                         ` Linus Torvalds
  1 sibling, 2 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 13:24 UTC (permalink / raw)
  To: William Lee Irwin III, Rik van Riel, Hugh Dickins, Ingo Molnar,
	Andrew Morton, torvalds, linux-kernel

On Fri, Mar 12, 2004 at 04:46:38AM -0800, William Lee Irwin III wrote:
> On Fri, Mar 12, 2004 at 01:21:27PM +0100, Andrea Arcangeli wrote:
> > you missed the fact mremap doesn't work, that's the fundamental reason
> > for the vma tracking, so you can use vm_pgoff.
> > if you take Hugh's anonmm, mremap will be attaching a persistent dynamic
> > overhead to the vma it touches. Currently it does in form of pte_chains,
> > that can be converted to other means of overhead, but I simply don't
> > like it.
> > I like all vmas to be symmetric to each other, without special hacks to
> > handle mremap right.
> > We have the vm_pgoff to handle mremap and I simply use that.
> 
> Absolute guarantees are nice but this characterization is too extreme.
> The case where mremap() creates rmap_chains is so rare I never ever saw
> it happen in 6 months of regular practical use and testing. Their
> creation could be triggered only by remap_file_pages().

did you try specweb with apache? that's super heavy mremap as far as I
know (and it maybe using anon memory, and if not I certainly cannot
exclude other apps are using mremap on significant amounts of anymous
ram). To a point that the kmap_lock for the persistent kmaps I used
originally in mremap (at least it has never been racy) was a showstopper
bottleneck spending most of system time there (profiling was horrible in
the kmap_lock) and I had to fixup the 2.6 way with the per-cpu atomic
kmaps to avoid being an order of magnitude slower than in the small
boxes w/o highmem.

the single reason I'm doing this work is to avoid allocating the
pte_chains and to always use the vma instead. If I've to use the
pte_chains again for mremap (hoping that no application is using mremap)
then I'm not at all happy since people could still fall in the pte_chain
trap with some app.

Amittedly the pte_chains makes perfect sense only for nonlinear vmas,
since the vma is meaningless for the nonlinear vmas and really a
per-page cost makes sense there, but I'm not going to add 8 bytes
per-page to swapout the nonlinear vmas efficiently, and I'll let the cpu
pay for that if you really need to swap the nonlinear mappings (i.e. the
pagetable walk). An alternate way would been to dynamically allocate the
per-pte pointer, but that will throw a whole lot of memory at the
problem too, and one of the main points for using nonlinear maps is to
avoid the allocation of the vmas, so I doubt people really want to
allocate lots of ram to handle nonlinear efficiently, so I believe
saving all ram at the expense of cpu cost during swapping will be ok.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 13:24                         ` Andrea Arcangeli
@ 2004-03-12 13:40                           ` William Lee Irwin III
  2004-03-12 13:55                           ` Hugh Dickins
  1 sibling, 0 replies; 74+ messages in thread
From: William Lee Irwin III @ 2004-03-12 13:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Hugh Dickins, Ingo Molnar, Andrew Morton, torvalds,
	linux-kernel

On Fri, Mar 12, 2004 at 02:24:36PM +0100, Andrea Arcangeli wrote:
> did you try specweb with apache? that's super heavy mremap as far as I
> know (and it maybe using anon memory, and if not I certainly cannot
> exclude other apps are using mremap on significant amounts of anymous
> ram). To a point that the kmap_lock for the persistent kmaps I used
> originally in mremap (at least it has never been racy) was a showstopper
> bottleneck spending most of system time there (profiling was horrible in
> the kmap_lock) and I had to fixup the 2.6 way with the per-cpu atomic
> kmaps to avoid being an order of magnitude slower than in the small
> boxes w/o highmem.

No. I have never had access to systems set up for specweb.


-- wli

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 13:24                         ` Andrea Arcangeli
  2004-03-12 13:40                           ` William Lee Irwin III
@ 2004-03-12 13:55                           ` Hugh Dickins
  2004-03-12 16:01                             ` Andrea Arcangeli
  1 sibling, 1 reply; 74+ messages in thread
From: Hugh Dickins @ 2004-03-12 13:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Rik van Riel, Ingo Molnar, Andrew Morton,
	torvalds, linux-kernel

On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Fri, Mar 12, 2004 at 04:46:38AM -0800, William Lee Irwin III wrote:
> > 
> > The case where mremap() creates rmap_chains is so rare I never ever saw
> > it happen in 6 months of regular practical use and testing. Their
> > creation could be triggered only by remap_file_pages().
> 
> did you try specweb with apache? that's super heavy mremap as far as I
> know (and it maybe using anon memory, and if not I certainly cannot
> exclude other apps are using mremap on significant amounts of anymous
> ram).

anonmm has no problem with most mremaps: the special case is for
mremap MAYMOVE of anon vmas _inherited from parent_ (same page at
different addresses in the different mms).  As I said before, it's
quite conceivable that this case never arises outside our testing
(but I'd be glad to be shown wrong, would make effort worthwhile).

Hugh


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 13:55                           ` Hugh Dickins
@ 2004-03-12 16:01                             ` Andrea Arcangeli
  0 siblings, 0 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 16:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: William Lee Irwin III, Rik van Riel, Ingo Molnar, Andrew Morton,
	torvalds, linux-kernel

On Fri, Mar 12, 2004 at 01:55:30PM +0000, Hugh Dickins wrote:
> On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> > On Fri, Mar 12, 2004 at 04:46:38AM -0800, William Lee Irwin III wrote:
> > > 
> > > The case where mremap() creates rmap_chains is so rare I never ever saw
> > > it happen in 6 months of regular practical use and testing. Their
> > > creation could be triggered only by remap_file_pages().
> > 
> > did you try specweb with apache? that's super heavy mremap as far as I
> > know (and it maybe using anon memory, and if not I certainly cannot
> > exclude other apps are using mremap on significant amounts of anymous
> > ram).
> 
> anonmm has no problem with most mremaps: the special case is for
> mremap MAYMOVE of anon vmas _inherited from parent_ (same page at
> different addresses in the different mms).  As I said before, it's
> quite conceivable that this case never arises outside our testing
> (but I'd be glad to be shown wrong, would make effort worthwhile).

the problem is that it _can_ arise, and fixing that is an huge mess
without using the pte_chains IMHO (no hope to use the vma->shared).

I also don't see how can you know if a vma is pointing all to "direct"
pages and in turn you can move it somewhere else without the pte_chains.
sure you can move all anon vmas freely after an execve, but after the
first fork (and in turn with cow pages going on) all mremaps will
non-trackable with anonmm, right? lots of server processes uses fork()
model for the childs, and they can run mremap inside the child of memory
malloced inside the child, and I don't think you can easily track if the
malloc happened inside the child or inside the father, though I may be
wrong on this.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 12:46                       ` William Lee Irwin III
  2004-03-12 13:24                         ` Andrea Arcangeli
@ 2004-03-12 16:17                         ` Linus Torvalds
  2004-03-13  0:28                           ` William Lee Irwin III
  2004-03-13 14:43                           ` Rik van Riel
  1 sibling, 2 replies; 74+ messages in thread
From: Linus Torvalds @ 2004-03-12 16:17 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrea Arcangeli, Rik van Riel, Hugh Dickins, Ingo Molnar,
	Andrew Morton, linux-kernel

On Fri, 12 Mar 2004, William Lee Irwin III wrote:
> 
> Absolute guarantees are nice but this characterization is too extreme.
> The case where mremap() creates rmap_chains is so rare I never ever saw
> it happen in 6 months of regular practical use and testing. Their
> creation could be triggered only by remap_file_pages().

I have to _violently_ agree with Andrea on this one.

The absolute _LAST_ thing we want to have is a "remnant" rmap 
infrastructure that only gets very occasional use. That's a GUARANTEED way 
to get bugs, and really subtle behaviour.

I think Andrea is 100% right. Either do rmap for everything (like we do
now, modulo IO/mlock), or do it for _nothing_.  No half measures with
"most of the time".

Quite frankly, the stuff I've seen suggested sounds absolutely _horrible_. 
Special cases are not just a pain to work with, they definitely will cause 
bugs. It's not a matter of "if", it's a matter of "when".

So let's make it clear: if we have an object-based reverse mapping, it 
should cover all reasonable cases, and in particular, it should NOT have 
rare fallbacks to code that thus never gets any real testing.

And if we have per-page rmap like now, it should _always_ be there.

You do have to realize that maintainability is a HELL of a lot more
important than scalability of performance can be. Please keep that in
mind.

		Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 16:17                         ` Linus Torvalds
@ 2004-03-13  0:28                           ` William Lee Irwin III
  2004-03-13 14:43                           ` Rik van Riel
  1 sibling, 0 replies; 74+ messages in thread
From: William Lee Irwin III @ 2004-03-13  0:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Rik van Riel, Hugh Dickins, Ingo Molnar,
	Andrew Morton, linux-kernel

On Fri, Mar 12, 2004 at 08:17:49AM -0800, Linus Torvalds wrote:
> I have to _violently_ agree with Andrea on this one.
> The absolute _LAST_ thing we want to have is a "remnant" rmap 
> infrastructure that only gets very occasional use. That's a GUARANTEED way 
> to get bugs, and really subtle behaviour.
> I think Andrea is 100% right. Either do rmap for everything (like we do
> now, modulo IO/mlock), or do it for _nothing_.  No half measures with
> "most of the time".
> Quite frankly, the stuff I've seen suggested sounds absolutely _horrible_. 
> Special cases are not just a pain to work with, they definitely will cause 
> bugs. It's not a matter of "if", it's a matter of "when".
> So let's make it clear: if we have an object-based reverse mapping, it 
> should cover all reasonable cases, and in particular, it should NOT have 
> rare fallbacks to code that thus never gets any real testing.
> And if we have per-page rmap like now, it should _always_ be there.
> You do have to realize that maintainability is a HELL of a lot more
> important than scalability of performance can be. Please keep that in
> mind.

The sole point I had to make was against a performance/resource scalabilty
argument; the soft issues weren't part of that, though they may ultimately
be the deciding factor.

-- wli

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 16:17                         ` Linus Torvalds
  2004-03-13  0:28                           ` William Lee Irwin III
@ 2004-03-13 14:43                           ` Rik van Riel
  2004-03-13 16:18                             ` Linus Torvalds
  1 sibling, 1 reply; 74+ messages in thread
From: Rik van Riel @ 2004-03-13 14:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: William Lee Irwin III, Andrea Arcangeli, Hugh Dickins,
	Ingo Molnar, Andrew Morton, linux-kernel

On Fri, 12 Mar 2004, Linus Torvalds wrote:

> So let's make it clear: if we have an object-based reverse mapping, it 
> should cover all reasonable cases, and in particular, it should NOT have 
> rare fallbacks to code that thus never gets any real testing.

Absolutely agreed.  And with Rajesh's code it should be possible
to get object-based rmap right, not vulnerable to the scalability
issues demonstrated by Ingo's test programs.

Whether we go with mm-based or vma-based, I don't particularly
care either.  As long as the code is nice...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 14:43                           ` Rik van Riel
@ 2004-03-13 16:18                             ` Linus Torvalds
  2004-03-13 17:24                               ` Hugh Dickins
  2004-03-13 17:33                               ` Andrea Arcangeli
  0 siblings, 2 replies; 74+ messages in thread
From: Linus Torvalds @ 2004-03-13 16:18 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli
  Cc: William Lee Irwin III, Hugh Dickins, Ingo Molnar, Andrew Morton,
	Kernel Mailing List

Ok, guys,
 how about this anon-page suggestion?

I'm a bit nervous about the complexity issues in Andrea's current setup, 
so I've been thinking about Rik's per-mm thing. And I think that there is 
one very simple approach, which should work fine, and should have minimal 
impact on the existing setup exactly because it is so simple.

Basic setup:
 - each anonymous page is associated with exactly _one_ virtual address, 
   in a "anon memory group". 

   We put the virtual address (shifted down by PAGE_SHIFT) into 
   "page->index". We put the "anon memory group" pointer into 
   "page->mapping". We have a PAGE_ANONYMOUS flag to tell the
   rest of the world about this.

 - the anon memory group has a list of all mm's that it is associated 
   with.

 - an "execve()" creates a new "anon memory group" and drops the old one.

 - a mm copy operation just increments the reference count and adds the 
   new mm to the mm list for that anon memory group.

So now to do reverse mapping, we can take a page, and do

	if (PageAnonymous(page)) {
		struct anongroup *mmlist = (struct anongroup *)page->mapping;
		unsigned long address = page->index << PAGE_SHIFT;
		struct mm_struct *mm;

		for_each_entry(mm, mmlist->anon_mms, anon_mm) {
			.. look up page in page tables in "mm, address" ..
			.. most of the time we may not even need to look ..
			.. up the "vma" at all, just walk the page tables ..
		}
	} else {
		/* Shared page */
		.. look up page using the inode vma list ..
	}

The above all works 99% of the time.

The only problem is mremap() after a fork(), and hell, we know that's a
special case anyway, and let's just add a few lines to copy_one_pte(),
which basically does:

	if (PageAnonymous(page) && page->count > 1) {
		newpage = alloc_page();
		copy_page(page, newpage);
		page = newpage;
	}
	/* Move the page to the new address */
	page->index = address >> PAGE_SHIFT;

and now we have zero special cases.

The above should work very well. In most cases the "anongroup" will be 
very small, and even when it's large (if somebody does a ton of forks 
without any execve's), we only have _one_ address to check, and that is 
pretty fast. A high-performance server would use threads, anyway. (And 
quite frankly, _any_ algorithm will have this issue. Even rmap will have 
exactly the same loop, although rmap skips any vm's where the page might 
have been COW'ed or removed).

The extra COW in mremap() seems benign. Again, it should usually not even 
trigger.

What do you think? To me, this seems to be a really simple approach..

		Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 16:18                             ` Linus Torvalds
@ 2004-03-13 17:24                               ` Hugh Dickins
  2004-03-13 17:28                                 ` Rik van Riel
  2004-03-13 17:48                                 ` Andrea Arcangeli
  2004-03-13 17:33                               ` Andrea Arcangeli
  1 sibling, 2 replies; 74+ messages in thread
From: Hugh Dickins @ 2004-03-13 17:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, William Lee Irwin III,
	Ingo Molnar, Andrew Morton, Kernel Mailing List

On Sat, 13 Mar 2004, Linus Torvalds wrote:
> 
> Ok, guys,
>  how about this anon-page suggestion?

What you describe is pretty much exactly what my anobjrmap patch
from a year ago did.  I'm currently looking through that again
to bring it up to date.

> I'm a bit nervous about the complexity issues in Andrea's current setup, 
> so I've been thinking about Rik's per-mm thing. And I think that there is 
> one very simple approach, which should work fine, and should have minimal 
> impact on the existing setup exactly because it is so simple.
> 
> Basic setup:
>  - each anonymous page is associated with exactly _one_ virtual address, 
>    in a "anon memory group". 
> 
>    We put the virtual address (shifted down by PAGE_SHIFT) into 
>    "page->index". We put the "anon memory group" pointer into 
>    "page->mapping". We have a PAGE_ANONYMOUS flag to tell the
>    rest of the world about this.

It's a bit more complicated because page->mapping currently contains
&swapper_space if PageSwapCache(page) - indeed, at present that's
exactly what PageSwapCache(page) tests.  So I reintroduced a
PageSwapCache(page) flagbit, avoid the very few places where mapping
pointing to swapper_space was actually useful, and use page->private
instead of page->index for the swp_entry_t.

(Andrew did point out that we could reduce the scale of the mods by
reusing page->list fields instead of mapping/index; but mapping/index
are the natural fields to use, and Andrew now has other changes in
-mm which remove page->list: so the original choice looks right again.)

> 		for_each_entry(mm, mmlist->anon_mms, anon_mm) {
> 			.. look up page in page tables in "mm, address" ..
> 			.. most of the time we may not even need to look ..
> 			.. up the "vma" at all, just walk the page tables ..
> 		}

I believe page_referenced() can just walk the page tables,
but try_to_unmap() needs vma to check VM_LOCKED (we're thinking
of other ways to avoid that, but they needn't get mixed into this)
and for flushing cache and tlb (perhaps avoidable on some arches?
I've not checked, and again that would be an optimization to
consider later, not mix in at this stage).

> The only problem is mremap() after a fork(), and hell, we know that's a
> special case anyway, and let's just add a few lines to copy_one_pte(),
> which basically does:
> 
> 	if (PageAnonymous(page) && page->count > 1) {
> 		newpage = alloc_page();
> 		copy_page(page, newpage);
> 		page = newpage;
> 	}
> 	/* Move the page to the new address */
> 	page->index = address >> PAGE_SHIFT;
> 
> and now we have zero special cases.

That's always been a fallback solution, I was just a little too ashamed
to propose it originally - seems a little wrong to waste whole pages
rather than wasting a few bytes of data structure trying to track them:
though the pages are pageable unlike any data structure we come up with.

I think we have page_table_lock in copy_one_pte, so won't want to do
it quite like that.  It won't matter at all if pages are transiently
untrackable.  Might want to do something like make_pages_present
afterwards (but it should only be COWing instantiated pages; and
does need to COW pages currently on swap too).

There's probably an issue with Alan's strict commit memory accounting,
if the mapping is readonly; but so long as we get that counting right,
I don't think it's really going to matter at all if we sometimes fail
an mremap for that reason - but probably need to avoid mistaking the
common case (mremap of own area) for the rare case which needs this
copying (mremap of inherited area).

Hugh

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 17:24                               ` Hugh Dickins
@ 2004-03-13 17:28                                 ` Rik van Riel
  2004-03-13 17:41                                   ` Hugh Dickins
                                                     ` (2 more replies)
  2004-03-13 17:48                                 ` Andrea Arcangeli
  1 sibling, 3 replies; 74+ messages in thread
From: Rik van Riel @ 2004-03-13 17:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrea Arcangeli, William Lee Irwin III,
	Ingo Molnar, Andrew Morton, Kernel Mailing List

On Sat, 13 Mar 2004, Hugh Dickins wrote:
> On Sat, 13 Mar 2004, Linus Torvalds wrote:

> > 	if (PageAnonymous(page) && page->count > 1) {
> > 		newpage = alloc_page();
> > 		copy_page(page, newpage);
> > 		page = newpage;
> > 	}
> > 	/* Move the page to the new address */
> > 	page->index = address >> PAGE_SHIFT;
> > 
> > and now we have zero special cases.
> 
> That's always been a fallback solution, I was just a little too ashamed
> to propose it originally - seems a little wrong to waste whole pages
> rather than wasting a few bytes of data structure trying to track them:
> though the pages are pageable unlike any data structure we come up with.

No, Linus is right.

If a child process uses mremap(), it stands to reason that
it's about to use those pages for something.

Think of it as taking the COW faults early, because chances
are you'd be taking them anyway, just a little bit later...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 17:28                                 ` Rik van Riel
@ 2004-03-13 17:41                                   ` Hugh Dickins
  2004-03-13 18:08                                     ` Andrea Arcangeli
  2004-03-13 17:54                                   ` Andrea Arcangeli
  2004-03-13 18:57                                   ` Linus Torvalds
  2 siblings, 1 reply; 74+ messages in thread
From: Hugh Dickins @ 2004-03-13 17:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Andrea Arcangeli, William Lee Irwin III,
	Ingo Molnar, Andrew Morton, Kernel Mailing List

On Sat, 13 Mar 2004, Rik van Riel wrote:
> 
> No, Linus is right.
> 
> If a child process uses mremap(), it stands to reason that
> it's about to use those pages for something.
> 
> Think of it as taking the COW faults early, because chances
> are you'd be taking them anyway, just a little bit later...

Makes perfect sense in the read-write case.  The read-only
case is less satisfactory, but those will be even rarer.

Hugh


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 17:41                                   ` Hugh Dickins
@ 2004-03-13 18:08                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-13 18:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Rik van Riel, Linus Torvalds, William Lee Irwin III, Ingo Molnar,
	Andrew Morton, Kernel Mailing List

On Sat, Mar 13, 2004 at 05:41:37PM +0000, Hugh Dickins wrote:
> On Sat, 13 Mar 2004, Rik van Riel wrote:
> > 
> > No, Linus is right.
> > 
> > If a child process uses mremap(), it stands to reason that
> > it's about to use those pages for something.
> > 
> > Think of it as taking the COW faults early, because chances
> > are you'd be taking them anyway, just a little bit later...
> 
> Makes perfect sense in the read-write case.  The read-only
> case is less satisfactory, but those will be even rarer.

overall it's not obvious to me that those will be even rarer.  see the
last email about kde-like usages to share data like-threads but with
memory protection, those won't write to the data. I mean, it maybe the
way to go, but I think we should get some ok from the major linux
projects that we're not going to invalidate their smart optimizations
first, and we should get this "misfeature" documented somehow.

I've to admit the simplicity is appealing, but besides its
coding-simplicity in practice I believe the only other appealing thing
will be the fact it's not exploitable by people doing a flood of
vma_splits, to solve that with anon_vma I'd need a prio tree on top of
every anon_vma, that means even more memory wased both in the anon_vma
and vma, though pratically a prio_tree there wouldn't be necessary. The
anonmm solves the complexity issue using find_vma, so sharing the rbtree
which already works. that's probably the part I find most appealing of
anonmm. One can still exploit the complexity with anonmm too, but not
from the same address space, so it's easier to limit with ulimit -u. I'm
really not sure what's best, which is not good since I hoped to get
anon_vma implementation working on Monday evening (heck it was already
swapping fine my test app despite the huge vma_split/PageDirect bug that you
noticed that probably caused `ps` to oops, I bet `ps` is doing a
vma_split ;) but I now returned wondering about the design issues instead.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 17:28                                 ` Rik van Riel
  2004-03-13 17:41                                   ` Hugh Dickins
@ 2004-03-13 17:54                                   ` Andrea Arcangeli
  2004-03-13 17:55                                     ` Andrea Arcangeli
  2004-03-13 18:57                                   ` Linus Torvalds
  2 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-13 17:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Linus Torvalds, William Lee Irwin III, Ingo Molnar,
	Andrew Morton, Kernel Mailing List

On Sat, Mar 13, 2004 at 12:28:31PM -0500, Rik van Riel wrote:
> On Sat, 13 Mar 2004, Hugh Dickins wrote:
> > On Sat, 13 Mar 2004, Linus Torvalds wrote:
> 
> > > 	if (PageAnonymous(page) && page->count > 1) {
> > > 		newpage = alloc_page();
> > > 		copy_page(page, newpage);
> > > 		page = newpage;
> > > 	}
> > > 	/* Move the page to the new address */
> > > 	page->index = address >> PAGE_SHIFT;
> > > 
> > > and now we have zero special cases.
> > 
> > That's always been a fallback solution, I was just a little too ashamed
> > to propose it originally - seems a little wrong to waste whole pages
> > rather than wasting a few bytes of data structure trying to track them:
> > though the pages are pageable unlike any data structure we come up with.
> 
> No, Linus is right.
> 
> If a child process uses mremap(), it stands to reason that
> it's about to use those pages for something.
> 
> Think of it as taking the COW faults early, because chances
> are you'd be taking them anyway, just a little bit later...

using mremap to _move_ anonymous maps is simply not frequent. It's so
unfrequent that it's hard to tell if the child is going to _read_ or to
_write_. Using those pages means nothing, all it matters is if it will
use those pages from reading or for writing, and I don't see how you can
assume it's going to write to them and how can you assume this is an
early-COW in the common case.

the only interesting point to me is that it's non frequent, with that I
certainly agreee, but I don't see this as an early-COW.

What worries me most are things like kde, they used the library design
with the only object of sharing readonly anonymous pages, that's very
smart since it still avoids one bug in one app to take down the whole
GUI, but if they happen to use mremap to move those readonly page around
after the for we'll screw them completely. I've no indication that this
may be the case and if they ever call mrmap, but I cannot tell the
opposite either.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 17:54                                   ` Andrea Arcangeli
@ 2004-03-13 17:55                                     ` Andrea Arcangeli
  0 siblings, 0 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-13 17:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Linus Torvalds, William Lee Irwin III, Ingo Molnar,
	Andrew Morton, Kernel Mailing List

On Sat, Mar 13, 2004 at 06:54:06PM +0100, Andrea Arcangeli wrote:
> after the for we'll screw them completely. I've no indication that this
	       ^k

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 17:28                                 ` Rik van Riel
  2004-03-13 17:41                                   ` Hugh Dickins
  2004-03-13 17:54                                   ` Andrea Arcangeli
@ 2004-03-13 18:57                                   ` Linus Torvalds
  2004-03-13 19:14                                     ` Hugh Dickins
  2 siblings, 1 reply; 74+ messages in thread
From: Linus Torvalds @ 2004-03-13 18:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Andrea Arcangeli, William Lee Irwin III,
	Ingo Molnar, Andrew Morton, Kernel Mailing List

On Sat, 13 Mar 2004, Rik van Riel wrote:
> 
> No, Linus is right.
> 
> If a child process uses mremap(), it stands to reason that
> it's about to use those pages for something.

That's not necessarily true, since it's entirely possible that it's just a 
realloc(), and the old part of the allocation would have been left alone.

That said, I suspect that
 - mremap() isn't all _that_ common in the first place
 - it's even more rare to do a fork() and then a mremap() (ie most of the 
   time I suspect the page count will be 1, and no COW is necessary). Most
   apps tend to exec() after a fork.
 - I agree that in at least part of the remaining cases we _would_ COW the
   pages anyway.

I suspect that the only common "no execve after fork" usage is for a few 
servers, especially the traditional UNIX kind (ie using processes are 
fairly heavy-weight threads). It could be interesting to see numbers.

But basically I'm inclined to believe that the "unnecessary COW" case is
_so_ rare, that if it allows us to make other things simpler (and thus
more stable and likely faster) it is worth it. Especially the simplicity
just appeals to me.

I just think that if mremap() causes so many problems for reverse mapping,
we should make _that_ the expensive operation, instead of making
everything else more complicated. After all, if it turns out that the
"early COW" behaviour I suggest can be a performance problem for some
(rare) circumstances, then the fix for that is likely to just let
applications know that mremap() can be expensive.

(It's still likely to be a lot cheaper than actually doing a new
mmap+memcpy+munmap, so it's not like mremap would become pointless).

			Linus

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 18:57                                   ` Linus Torvalds
@ 2004-03-13 19:14                                     ` Hugh Dickins
  0 siblings, 0 replies; 74+ messages in thread
From: Hugh Dickins @ 2004-03-13 19:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, William Lee Irwin III,
	Ingo Molnar, Andrew Morton, Kernel Mailing List

On Fri, 12 Mar 2004, Linus Torvalds wrote:
> 
> The absolute _LAST_ thing we want to have is a "remnant" rmap 
> infrastructure that only gets very occasional use. That's a GUARANTEED way 
> to get bugs, and really subtle behaviour.

On Sat, 13 Mar 2004, Linus Torvalds wrote:
> 
> I just think that if mremap() causes so many problems for reverse mapping,
> we should make _that_ the expensive operation, instead of making
> everything else more complicated.

Friday's Linus has a good point, but I agree more with Saturday's:
mremap MAYMOVE is a very special case, and I believe it would hurt
the whole to put it at the centre of the design.  But all power to
Andrea to achieve that.

Hugh


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 17:24                               ` Hugh Dickins
  2004-03-13 17:28                                 ` Rik van Riel
@ 2004-03-13 17:48                                 ` Andrea Arcangeli
  1 sibling, 0 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-13 17:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Rik van Riel, William Lee Irwin III, Ingo Molnar,
	Andrew Morton, Kernel Mailing List

On Sat, Mar 13, 2004 at 05:24:12PM +0000, Hugh Dickins wrote:
> On Sat, 13 Mar 2004, Linus Torvalds wrote:
> > 
> > Ok, guys,
> >  how about this anon-page suggestion?
> 
> What you describe is pretty much exactly what my anobjrmap patch
> from a year ago did.  I'm currently looking through that again

it is. Linus simply provided a solution to the mremap issue, that is to
make it impossible to share anonymous pages through an mremap, that
solves the problem indeed at some cpu and memory cost after an mremap.

I realized you could solve it also by walking the whole list of vmas in
every mm->mmap list but that complexity would be way too high.

> > The only problem is mremap() after a fork(), and hell, we know that's a
> > special case anyway, and let's just add a few lines to copy_one_pte(),
> > which basically does:
> > 
> > 	if (PageAnonymous(page) && page->count > 1) {
> > 		newpage = alloc_page();
> > 		copy_page(page, newpage);
> > 		page = newpage;
> > 	}
> > 	/* Move the page to the new address */
> > 	page->index = address >> PAGE_SHIFT;
> > 
> > and now we have zero special cases.
> 
> That's always been a fallback solution, I was just a little too ashamed
> to propose it originally - seems a little wrong to waste whole pages
> rather than wasting a few bytes of data structure trying to track them:
> though the pages are pageable unlike any data structure we come up with.
> 
> I think we have page_table_lock in copy_one_pte, so won't want to do
> it quite like that.  It won't matter at all if pages are transiently
> untrackable.  Might want to do something like make_pages_present
> afterwards (but it should only be COWing instantiated pages; and
> does need to COW pages currently on swap too).
> 
> There's probably an issue with Alan's strict commit memory accounting,
> if the mapping is readonly; but so long as we get that counting right,
> I don't think it's really going to matter at all if we sometimes fail
> an mremap for that reason - but probably need to avoid mistaking the
> common case (mremap of own area) for the rare case which needs this
> copying (mremap of inherited area).

It still looks like quite an hack to me, though I must agree in a
desktop scenario with swapoff -a, it will save around 24 bytes per
anonymous vma and 12 bytes per file vma plus it doesn't restrict the vma
merging in any way, compared to my anon_vma, and it avoids me to worry
about people doing a flood of vma_splits that will generate a long list
of vmas for every anon_vma.

I still feel anon_vma is more preferable than
anonmm+linus-unshare-mremap if one needs to swap, and while the
prio_tree on i_mmap{shared} in practice is needed only for 32bit apps, I
know some app with hundred of processes allocating huge chunks of direct
anon memory each and swapping a lot at the same time.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 16:18                             ` Linus Torvalds
  2004-03-13 17:24                               ` Hugh Dickins
@ 2004-03-13 17:33                               ` Andrea Arcangeli
  2004-03-13 17:53                                 ` Hugh Dickins
  2004-03-13 17:57                                 ` Rik van Riel
  1 sibling, 2 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-13 17:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, William Lee Irwin III, Hugh Dickins, Ingo Molnar,
	Andrew Morton, Kernel Mailing List

On Sat, Mar 13, 2004 at 08:18:48AM -0800, Linus Torvalds wrote:
> 
> 
> Ok, guys,
>  how about this anon-page suggestion?
> 
> I'm a bit nervous about the complexity issues in Andrea's current setup, 
> so I've been thinking about Rik's per-mm thing. And I think that there is 
> one very simple approach, which should work fine, and should have minimal 
> impact on the existing setup exactly because it is so simple.
> 
> Basic setup:
>  - each anonymous page is associated with exactly _one_ virtual address, 
>    in a "anon memory group". 
> 
>    We put the virtual address (shifted down by PAGE_SHIFT) into 
>    "page->index". We put the "anon memory group" pointer into 
>    "page->mapping". We have a PAGE_ANONYMOUS flag to tell the
>    rest of the world about this.
> 
>  - the anon memory group has a list of all mm's that it is associated 
>    with.
> 
>  - an "execve()" creates a new "anon memory group" and drops the old one.
> 
>  - a mm copy operation just increments the reference count and adds the 
>    new mm to the mm list for that anon memory group.

This is the anonmm from Hugh.

> 
> So now to do reverse mapping, we can take a page, and do
> 
> 	if (PageAnonymous(page)) {
> 		struct anongroup *mmlist = (struct anongroup *)page->mapping;
> 		unsigned long address = page->index << PAGE_SHIFT;
> 		struct mm_struct *mm;
> 
> 		for_each_entry(mm, mmlist->anon_mms, anon_mm) {
> 			.. look up page in page tables in "mm, address" ..
> 			.. most of the time we may not even need to look ..
> 			.. up the "vma" at all, just walk the page tables ..
> 		}
> 	} else {
> 		/* Shared page */
> 		.. look up page using the inode vma list ..
> 	}
> 
> The above all works 99% of the time.

this is again exactly the anonmm from Hugh.

BTW, (for completeness) I was thinking last night that the anonmm could
handle mremap correctly too in theory without changes like the below
one, if it would walk the whole list of vmas reachable from the mm->mmap
for every mm in the anonmm (your anongroup, Hugh called it struct anonmm
instead of struct anongroup). Problem is that checking all the vmas in
if expensive and a single find_vma is a lot faster, but find_vma has no
way to take vm_pgoff into the equation and in turn it breaks with
mremap.

> The only problem is mremap() after a fork(), and hell, we know that's a
> special case anyway, and let's just add a few lines to copy_one_pte(),
> which basically does:
> 
> 	if (PageAnonymous(page) && page->count > 1) {
> 		newpage = alloc_page();
> 		copy_page(page, newpage);
> 		page = newpage;
> 	}
> 	/* Move the page to the new address */
> 	page->index = address >> PAGE_SHIFT;
> 
> and now we have zero special cases.

you're basically here saying that you agree with Hugh that anonmm is the
way to go, and you're providing one of the possible ways to handle
mremap correctly with anonmm (without using pte_chains). I also above
provided another alternate way to handle mremap correctly with anonmm
(that is to inefficiently walk all the mm->mmap and to try unmapping
from all vmas with vma->vm_file == NULL).

what I called anon_vma_global in a older email is the more efficient
version of checking all the vmas in the mm->mmap, a prio_tree could
index all the anon vmas in each mm, so taking vm_pgoff into
consideration, unlike find_vma(page->index). That still takes memory for
each vma though, and it also still forces to check all unrelated mm
address spaces too (see later in the email for details on this).

But returning to your proposed solution to the mremap problem with
the anonmm design, that will certainly work: rather than trying to
handle that case correctly we just makes it impossible for that
condition to happen. I don't like very much to unshare pages, but it may
save more memory than what it actually waste. Problem is that it depends
on the workload.

The remaining downside of all the global anonmm designs vs my finegrined
anon_vma design, is that if you execute a malloc in a child (that will
be direct memory with page->count == 1), you'll still have to try all
the mm in the anongroup (that can be on the order of the thousands),
while the anon_vma design would immediatly only reach the right vma in
the right mm and it would not try the wrong vmas in the other mm (i.e.
no find_vma). That isn't fixable with the anonmm design.

I think the only important thing is to avoid the _per-page_ overhead of the
pte_chains, a _per-vma_ 12 byte cost for the anon_vma doesn't sound like
an issue to me if it can save significant cpu in a setup with thousand
of tasks and each one executing a malloc. A single vma can cover
plenty of memory.

Note that even the i_mmap{,shared} methods (even with a prio_tree!) may
actually check vmas and (in turn mm_structs too) where the page has been
sobstituted with an anonymous copy during a cow fault, if the vma has
been mapped with MAP_PRIVATE. While we cannot avoid to check unrelated
mm_structs with MAP_PRIVATE usages (since the only thing where we have
that information is the pte itself, so by the time we find the answer
it's too late to avoid asking the question), but I can avoid that for
the anonymous memory with my anon_vma design. And my anon_vma gets
mremap right too without the need of prio trees like the anon_vma_global
design I proposed requires, and while still allowing sharing of pages
through mremap.

the downsides of anon_vma vs anonmm+linus-unshare-during-mremap is
that anon_vma requires a per anonymous vma 12 byte object, and secondly it
requires 12 bytes per-vma for the anon_vma_node list_head and the
anon_vma pointer. So it's a worst case 24byte overhead per anonymous
vma (on average it will be slightly less since the anon_vmas can be
shared). Secondly anon_vma forbids merging of vmas with different
anon_vma or with different vm_pgoff, though for all appends there will
be no problem at all, appends with mmap are guaranteed to work. A
munmap+mmap gap creation and gap fill is also guaranteed to work (since
split_vma will make both the prev and next vma share the same anon_vma).

the advantage of anon_vma is that it will track all vma in the most
possible finegrined way, avoiding the unmapping code to walk "mm" that
for sure don't have anything to do with the page that we want to unmap,
plus it handles mremap (allowing sharing and avoiding copies). It avoids
the find_vma cost too.

I'm not sure if the pros-cons worth the additional 24 bytes per
anonymous vma. the complexity doesn't worry me though. Also when the
cost will be truly 24 bytes we'll have the biggest advantage, if the
advantage will be low it means the cost will be less than 24 bytes since
the vma is shared.

> What do you think? To me, this seems to be a really simple approach..

I certainly agree it's simpler. I'm quite undecided if to giveup on the
anon_vma and to use anonmm plus your unshared during mremap at the
moment, while it's simpler it's also a definitely inferior solution
since it uses the mremap hack to work safely and it will check all mm
in the group with find_pte not matter if it worth checking them, but at
the same time if one is never swapping and never using mremap it will
save some memory from the anon_vma overhead (and it will also be
non-exploitable without the need of a prio_tree).

With anon_vma and w/o a prio_tree on top of it, one could try executing
a flood of vma_splits, and without a prio_tree on top of an anon_vma,
that could cause memory waste during swapping, but all real applications
would definitely swap better with anon_vma than with anonmm.

I mean, I would expect the pte_chain advocates to agree anon_vma is a lot
better than anonmm, they were going to throw 8 bytes per-pte to save cpu
during swapping, now I throw only 24 bytes per-vma at the problem (with
each vma being still extendable with merging) and I still provide
optimal swapping with minimal complexty, so they should like the
finegrined way more than unsharing with mremap and not scaling during
swapping checking all unrelated mms too. anon_vma basically sits in
between anonmm and pte_chains. it was more than enough for me, to save all
the memory wasted in the pte_chains on the 64bit archs with huge
anonymous vma blocks, but I didn't want to giveup the swap scalability
either with many processes (with i_mmap{,shared} we've already enough
troubles with the scalability during swapping that I didn't want to
think about those issues with the anonymous memory too with some
thousand tasks like it will run in practice). If I go stright ahead with
anon_vma I'm basically guaranteed that I can forget about the anonymous
vma swapping and that all real life apps will scale _as_well_ as with the
pte_chains, and I'm guaranteed not to run into issues with mremap
(though I don't expect troubles there).

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 17:33                               ` Andrea Arcangeli
@ 2004-03-13 17:53                                 ` Hugh Dickins
  2004-03-13 18:13                                   ` Andrea Arcangeli
  2004-03-13 17:57                                 ` Rik van Riel
  1 sibling, 1 reply; 74+ messages in thread
From: Hugh Dickins @ 2004-03-13 17:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Rik van Riel, William Lee Irwin III, Ingo Molnar,
	Andrew Morton, Kernel Mailing List

On Sat, 13 Mar 2004, Andrea Arcangeli wrote:
> 
> I certainly agree it's simpler. I'm quite undecided if to giveup on the
> anon_vma and to use anonmm plus your unshared during mremap at the
> moment, while it's simpler it's also a definitely inferior solution

I think you should persist with anon_vma and I should resurrect
anonmm, and let others decide between those two and pte_chains.

But while in this trial phase, can we both do it in such a way as to
avoid too much trivial change all over the tree?  For example, I'm
thinking I need to junk my irrelevant renaming of put_dirty_page to
put_stack_page, and for the moment it would help if you cut out your
mapping -> as.mapping changes (when I came to build yours, I had to
go through various filesystems I had in my config updating them
accordingly).  It's a correct change (which I was too lazy to do,
used evil casting instead) but better left as a tidyup for later?

Hugh

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 17:53                                 ` Hugh Dickins
@ 2004-03-13 18:13                                   ` Andrea Arcangeli
  2004-03-13 19:35                                     ` Hugh Dickins
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-13 18:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Rik van Riel, William Lee Irwin III, Ingo Molnar,
	Andrew Morton, Kernel Mailing List

On Sat, Mar 13, 2004 at 05:53:36PM +0000, Hugh Dickins wrote:
> On Sat, 13 Mar 2004, Andrea Arcangeli wrote:
> > 
> > I certainly agree it's simpler. I'm quite undecided if to giveup on the
> > anon_vma and to use anonmm plus your unshared during mremap at the
> > moment, while it's simpler it's also a definitely inferior solution
> 
> I think you should persist with anon_vma and I should resurrect
> anonmm, and let others decide between those two and pte_chains.
> 
> But while in this trial phase, can we both do it in such a way as to
> avoid too much trivial change all over the tree?  For example, I'm
> thinking I need to junk my irrelevant renaming of put_dirty_page to
> put_stack_page, and for the moment it would help if you cut out your
> mapping -> as.mapping changes (when I came to build yours, I had to
> go through various filesystems I had in my config updating them
> accordingly).  It's a correct change (which I was too lazy to do,
> used evil casting instead) but better left as a tidyup for later?

yes, we should split in two patches, one is the "peparation" for a
reused page->as.mapping, you know I did it differently to retain the
swapper_space and avoiding to hook explicit "if (PageSwapCache)" checks
into things like sync_page.

About using the union, I still prefer it, I've seen Linus in the
pseudocode used an explicit cast too, but I don't feel safe with
explicit casts, I prefer more breakage, than risking to forget
converting any page->mapping into page_maping or similar issues with the
casts ;)

I'll return working on this after the weekend. You can find my latest
status on the ftp, if you extract any interesting "common" bit from
there just send it to me too. thanks.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 18:13                                   ` Andrea Arcangeli
@ 2004-03-13 19:35                                     ` Hugh Dickins
  0 siblings, 0 replies; 74+ messages in thread
From: Hugh Dickins @ 2004-03-13 19:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Rik van Riel, William Lee Irwin III, Ingo Molnar,
	Andrew Morton, Kernel Mailing List

On Sat, 13 Mar 2004, Andrea Arcangeli wrote:
> 
> yes, we should split in two patches, one is the "peparation" for a
> reused page->as.mapping, you know I did it differently to retain the
> swapper_space and avoiding to hook explicit "if (PageSwapCache)" checks
> into things like sync_page.
> 
> About using the union, I still prefer it, I've seen Linus in the
> pseudocode used an explicit cast too, but I don't feel safe with
> explicit casts, I prefer more breakage, than risking to forget
> converting any page->mapping into page_maping or similar issues with the
> casts ;)

Your union is right, and my casting lazy, no question of that.
It's just that we'd need to do a whole lot of cosmetic edits
to get fully building trees, distracting from the guts of it.

In my case, anyway, the number of places that actually use the
casting are very few (just rmap.c?), suspect it's same for you.

I'm certainly not arguing against sanity checks where needed,
just against treewide edits (or broken builds) for now.

> I'll return working on this after the weekend. You can find my latest
> status on the ftp, if you extract any interesting "common" bit from
> there just send it to me too. thanks.

Thanks a lot.  I don't imagine you've done the nonlinear vma case
yet, but when you or Rajesh do, please may I just steal it, okay?

Hugh

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-13 17:33                               ` Andrea Arcangeli
  2004-03-13 17:53                                 ` Hugh Dickins
@ 2004-03-13 17:57                                 ` Rik van Riel
  1 sibling, 0 replies; 74+ messages in thread
From: Rik van Riel @ 2004-03-13 17:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, William Lee Irwin III, Hugh Dickins, Ingo Molnar,
	Andrew Morton, Kernel Mailing List

On Sat, 13 Mar 2004, Andrea Arcangeli wrote:

> The remaining downside of all the global anonmm designs vs my finegrined
> anon_vma design, is that if you execute a malloc in a child (that will
> be direct memory with page->count == 1), you'll still have to try all
> the mm in the anongroup (that can be on the order of the thousands),

That's ok, you have a similar issue with very commonly
mmap()d files, where some pages haven't been faulted in
by most processes, or have been replaced by private pages
after a COW fault due to MAP_PRIVATE mapping.

You just increase the number of pages for which this
search is done, but I suspect that shouldn't be a big
worry...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 12:21                     ` Andrea Arcangeli
                                         ` (2 preceding siblings ...)
  2004-03-12 12:46                       ` William Lee Irwin III
@ 2004-03-12 13:43                       ` Hugh Dickins
  2004-03-12 15:56                         ` Andrea Arcangeli
  3 siblings, 1 reply; 74+ messages in thread
From: Hugh Dickins @ 2004-03-12 13:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

Thanks a lot for pointing us to your (last night's) patches, Andrea.

On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote:
> 
> It's not that I didn't read anonmm patches from Hugh, I spent lots of
> time on those, they just were flawed and they couldn't handle mremap,
> he very well knows, see anobjrmap-5 for istance.

Flawed in what way?  They handled mremap fine, but yes, used pte_chains
for that extraordinary case, just as pte_chains were used for nonlinear.
With pte_chains gone (hurrah! though nonlinear handling yet to come),
as you know, I've already suggested a better way to handle that case
(use tmpfs-style backing object).

> the vma merging isn't a problem, we need to rework the code anyways to
> allow the file merging in both mprotect and mremap (currently only mmap
> is capable of merging files, and in turn it's also the only one capable
> of merging anon_vmas). Any merging code that is currently capable of
> merging files is easy to teach about anon_vmas too, it's basically the
> same problem at merging.

You're paying too much attention to the (almost optional, though it can
have a devastating effect on vma usage, yes) issue of vma merging, but
what about the (mandatory) vma splitting?  I see no sign of the tiresome
code I said you'd need for anonvma rather than anonmm, walking the pages
updating as.vma whenever vma changes e.g. when mprotecting or munmapping
some pages in the middle of a vma.  Surely move_vma_start is not enough?

That's what led me to choose anonmm, which seems a lot simpler: the real
argument for anonvma is that it saves a find_vma per pte in try_to_unmap
(page_referenced doesn't need it): a good saving, but is it worth the
complication of the faster paths?

Hugh

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 13:43                       ` Hugh Dickins
@ 2004-03-12 15:56                         ` Andrea Arcangeli
  2004-03-12 16:12                           ` Hugh Dickins
  0 siblings, 1 reply; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 15:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Rik van Riel, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Fri, Mar 12, 2004 at 01:43:23PM +0000, Hugh Dickins wrote:
> Thanks a lot for pointing us to your (last night's) patches, Andrea.
> 
> On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> > On Thu, Mar 11, 2004 at 10:28:42PM -0500, Rik van Riel wrote:
> > 
> > It's not that I didn't read anonmm patches from Hugh, I spent lots of
> > time on those, they just were flawed and they couldn't handle mremap,
> > he very well knows, see anobjrmap-5 for istance.
> 
> Flawed in what way?  They handled mremap fine, but yes, used pte_chains
> for that extraordinary case, just as pte_chains were used for nonlinear.

"using pte_chains for the extraordinary case" (which is a common case
for some apps) means it doesn't handle it, and you've to use rmap to
handle that case.

> With pte_chains gone (hurrah! though nonlinear handling yet to come),
> as you know, I've already suggested a better way to handle that case
> (use tmpfs-style backing object).

Do you realize the complexity of creating a tmpfs-inode and to attach
all vmas to it stacked on top of anonmm? And after you fix mremap you
get the same disavantages for merging of vmas (remeber my
disavantage of not merging after an mremap you won't merge too), plus it
wastes a lot more ram since you need a fake inode for every anonymous
vma and it's ugly to create those objects inside mremap. My transient
object is 8 bytes per group of vmas. And you need even the prio_tree
search on top of the anonmm.

Don't forget you can't re-use the vma->shared for doing the tmpfs-style
thing, that's already in a true inode. so what you're suggesting would
becomes an huge mess to implement IMHO. the anon_vma sounds a lot
cleaner and more efficient design to me than stacking inode-like objects
on top of a vma already queued in a i_mmap.

> > the vma merging isn't a problem, we need to rework the code anyways
> > to
> > allow the file merging in both mprotect and mremap (currently only mmap
> > is capable of merging files, and in turn it's also the only one capable
> > of merging anon_vmas). Any merging code that is currently capable of
> > merging files is easy to teach about anon_vmas too, it's basically the
> > same problem at merging.
> 
> You're paying too much attention to the (almost optional, though it can
> have a devastating effect on vma usage, yes) issue of vma merging, but
> what about the (mandatory) vma splitting?  I see no sign of the tiresome
> code I said you'd need for anonvma rather than anonmm, walking the pages
> updating as.vma whenever vma changes e.g. when mprotecting or munmapping
> some pages in the middle of a vma.  Surely move_vma_start is not enough?

you're right about vma_split, the way I implemented it is wrong,
basically the as.vma/PageDirect idea is falling apart with vma_split.
I should simply allocate the anon_vma without passing through the direct
mode, that will fix it though it'll be a bit less efficient for the
first page fault in an anonymous vma (only the first one, for all the
other page faults it'll be as fast as the direct mode).

this is probably why the code was not stable yet btw ;) so I greatly
appreciate your comments about it, it's just the optimization I did that
was invalid.

I could retain the optimization with a list of pages attached to the vma
but it doesn't worth it, allocating the anon_vma is way too cheap
compared to that. the pagedirect was a microoptization only, any
additional complexity to retain the microoptimization is worthless.

> That's what led me to choose anonmm, which seems a lot simpler: the real
> argument for anonvma is that it saves a find_vma per pte in try_to_unmap
> (page_referenced doesn't need it): a good saving, but is it worth the
> complication of the faster paths?

the only real argument is mremap, your tmpfs-like thing is overkill
compared to anon_vma, and secondly I don't need the prio_tree to scale.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 15:56                         ` Andrea Arcangeli
@ 2004-03-12 16:12                           ` Hugh Dickins
  2004-03-12 16:39                             ` Andrea Arcangeli
  0 siblings, 1 reply; 74+ messages in thread
From: Hugh Dickins @ 2004-03-12 16:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Fri, 12 Mar 2004, Andrea Arcangeli wrote:
> On Fri, Mar 12, 2004 at 01:43:23PM +0000, Hugh Dickins wrote:
> 
> Don't forget you can't re-use the vma->shared for doing the tmpfs-style
> thing, that's already in a true inode.

Good point, I was overlooking that.  I'll see if I can come up with
something, but that may well prove a killer.

> you're right about vma_split, the way I implemented it is wrong,
> basically the as.vma/PageDirect idea is falling apart with vma_split.
> I should simply allocate the anon_vma without passing through the direct

Yes, that'll take a lot of the branching out, all much simpler.

> mode, that will fix it though it'll be a bit less efficient for the
> first page fault in an anonymous vma (only the first one, for all the
> other page faults it'll be as fast as the direct mode).

Simpler still to allocate it earlier?  Perhaps too wasteful.

Hugh


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-12 16:12                           ` Hugh Dickins
@ 2004-03-12 16:39                             ` Andrea Arcangeli
  0 siblings, 0 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-12 16:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Rik van Riel, Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

On Fri, Mar 12, 2004 at 04:12:10PM +0000, Hugh Dickins wrote:
> > you're right about vma_split, the way I implemented it is wrong,
> > basically the as.vma/PageDirect idea is falling apart with vma_split.
> > I should simply allocate the anon_vma without passing through the direct
> 
> Yes, that'll take a lot of the branching out, all much simpler.

indeed.

> Simpler still to allocate it earlier?  Perhaps too wasteful.

one trouble with allocate it earlier is that insert_vm_struct would need
to return a -ENOMEM retval, plus things like MAP_PRIVATE don't
necessairly need an anon_vma ever (true anon mappings tends to need it
always instead ;).

So I will have to add a anon_vma_prepare(vma) near all SetPageAnon.
that's easy.  Infact I may want to coalesce the two things together, it
will look like:

int anon_vma_prepare_page(vma, page) {
	if (!vma->anon_vma) {
		vma->anon_vma = anon_vma_alloc()
		if (!vma->anon_vma)
			return -ENOMEM;
		/* single threaded no locks here */
		list_add(&vma->anon_vma_node, &anon_vma->anon_vma_head);
	}
	SetPageAnon(page);

	return 0;
}

I will have to handle a retval failure from there, that's the only
annoyance of removing the PageDirect optimization, I really did the
PageDirect mostly to leave all the anon_vma allocations to fork().

Now it's the exact opposite, fork will never need to allocate any
anon_vma anymore, it will only boost the page->mapcount.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-11 13:23               ` Hugh Dickins
  2004-03-11 13:56                 ` Andrea Arcangeli
@ 2004-03-11 17:33                 ` Andrea Arcangeli
  2004-03-11 22:20                 ` Rik van Riel
  2 siblings, 0 replies; 74+ messages in thread
From: Andrea Arcangeli @ 2004-03-11 17:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Ingo Molnar, Andrew Morton, torvalds, linux-kernel,
	William Lee Irwin III

ok, it links and boots ;)

at the previous try, with slab debugging enabled, it was spawning tons
of errors but I suspect it's a bug in the slab debugging, it was
complaining with red zone memory corruption, could be due the tiny size
of this object (only 8 bytes).

andrea@xeon:~> grep anon_vma /proc/slabinfo
anon_vma            1230   1500     12  250    1 : tunables  120   60 8 : slabdata      6      6      0
andrea@xeon:~> 

now I need to try swapping... (I guess it won't work at the first try,
I'd be surprised if I didn't miss any s/index/private/)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-11 13:23               ` Hugh Dickins
  2004-03-11 13:56                 ` Andrea Arcangeli
  2004-03-11 17:33                 ` Andrea Arcangeli
@ 2004-03-11 22:20                 ` Rik van Riel
  2004-03-11 23:43                   ` Hugh Dickins
  2 siblings, 1 reply; 74+ messages in thread
From: Rik van Riel @ 2004-03-11 22:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Ingo Molnar, Andrew Morton, torvalds,
	linux-kernel, William Lee Irwin III

On Thu, 11 Mar 2004, Hugh Dickins wrote:

> length of your essay on vma merging, it strikes me that you've taken
> a wrong direction in switching from my anon mm to your anon vma.
> 
> Go by vmas and you have tiresome problems as they are split and merged,
> very commonly.  Plus you have the overhead of new data structure per vma.

There's of course a blindingly simple alternative.  

Add every anonymous page to an "anon_memory" inode.  Then
everything is in effect file backed.  Using the same page
refcounting we already do, holes get shot into that "file".

The swap cache code provides a filesystem like mapping
from the anon_memory "files" to the on-disk stuff, or the
anon_memory file pages are resident in memory.

As a side effect, it also makes it possible to get rid
of the swapoff code, simply move the anon_memory file
pages from disk into memory...

We can avoid BSD memory object like code by simply having
multiple processes share the same anon_memory inode, allocating
extents of virtual space at once to reduce VMA count.

Not sure to which extent this is similar to what Hugh's stuff
already does though, or if it's just a different way of saying
how it's done ... I need to re-read the code ;)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-11 22:20                 ` Rik van Riel
@ 2004-03-11 23:43                   ` Hugh Dickins
  2004-03-12  3:20                     ` Rik van Riel
  0 siblings, 1 reply; 74+ messages in thread
From: Hugh Dickins @ 2004-03-11 23:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Ingo Molnar, Andrew Morton, Linus Torvalds,
	William Lee Irwin III, linux-kernel

On Thu, 11 Mar 2004, Rik van Riel wrote:
> On Thu, 11 Mar 2004, Hugh Dickins wrote:
> 
> > length of your essay on vma merging, it strikes me that you've taken
> > a wrong direction in switching from my anon mm to your anon vma.
> > 
> > Go by vmas and you have tiresome problems as they are split and merged,
> > very commonly.  Plus you have the overhead of new data structure per vma.
> 
> There's of course a blindingly simple alternative.  
> 
> Add every anonymous page to an "anon_memory" inode.  Then
> everything is in effect file backed.  Using the same page
> refcounting we already do, holes get shot into that "file".

Okay, Rik, the two extremes belong to you: one anon memory
object in total (above), and one per page (your original rmap);
whereas Andrea is betting on one per vma, and I go for one per mm.
Each way has its merits, I'm sure - and you've placed two bets!

> The swap cache code provides a filesystem like mapping
> from the anon_memory "files" to the on-disk stuff, or the
> anon_memory file pages are resident in memory.

For 2.7 something like that may well be reasonable.
But let's beware the fancy bloat of extra levels.

> As a side effect, it also makes it possible to get rid
> of the swapoff code, simply move the anon_memory file
> pages from disk into memory...

Wonderful if that code could disappear: but I somehow doubt
it'll fall out quite so easily - swapoff is inevitably
backwards from sanity, isn't it?

Hugh


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: anon_vma RFC2
  2004-03-11 23:43                   ` Hugh Dickins
@ 2004-03-12  3:20                     ` Rik van Riel
  0 siblings, 0 replies; 74+ messages in thread
From: Rik van Riel @ 2004-03-12  3:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Ingo Molnar, Andrew Morton, Linus Torvalds,
	William Lee Irwin III, linux-kernel

On Thu, 11 Mar 2004, Hugh Dickins wrote:

> Okay, Rik, the two extremes belong to you: one anon memory
> object in total (above), and one per page (your original rmap);
> whereas Andrea is betting on one per vma, and I go for one per mm.
> Each way has its merits, I'm sure - and you've placed two bets!

I suspect yours is the best mix.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2004-03-14  2:27 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20040310080000.GA30940@dualathlon.random>
2004-03-10 13:01 ` [lockup] Re: objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) Rik van Riel
2004-03-10 13:50   ` Andrea Arcangeli
2004-03-12 17:05     ` anon_vma RFC2 Rajesh Venkatasubramanian
2004-03-12 17:26       ` Andrea Arcangeli
2004-03-12 21:16         ` Rajesh Venkatasubramanian
2004-03-13 17:55           ` Rajesh Venkatasubramanian
2004-03-13 18:16             ` Andrea Arcangeli
2004-03-13 19:40               ` Rajesh Venkatasubramanian
2004-03-14  0:23                 ` Andrea Arcangeli
2004-03-14  0:52                   ` Linus Torvalds
2004-03-14  1:01                     ` William Lee Irwin III
2004-03-14  1:07                       ` Rik van Riel
2004-03-14  1:19                         ` William Lee Irwin III
2004-03-14  1:41                           ` Rik van Riel
2004-03-14  2:27                             ` William Lee Irwin III
2004-03-14  1:15                       ` Linus Torvalds
2004-03-11 20:09 Manfred Spraul
  -- strict thread matches above, loose matches on Subject: below --
2004-03-08 20:24 objrmap-core-1 (rmap removal for file mappings to avoid 4:4 in <=16G machines) Andrea Arcangeli
2004-03-09 10:52 ` [lockup] " Ingo Molnar
2004-03-09 11:02   ` Ingo Molnar
2004-03-09 11:09     ` Andrew Morton
2004-03-09 11:49       ` Ingo Molnar
2004-03-09 16:03         ` Andrea Arcangeli
2004-03-10 10:36           ` RFC anon_vma previous (i.e. full objrmap) Andrea Arcangeli
2004-03-11  6:52             ` anon_vma RFC2 Andrea Arcangeli
2004-03-11 13:23               ` Hugh Dickins
2004-03-11 13:56                 ` Andrea Arcangeli
2004-03-11 21:54                   ` Hugh Dickins
2004-03-12  1:47                     ` Andrea Arcangeli
2004-03-12  2:20                       ` Andrea Arcangeli
2004-03-12  3:28                   ` Rik van Riel
2004-03-12 12:21                     ` Andrea Arcangeli
2004-03-12 12:40                       ` Rik van Riel
2004-03-12 13:11                         ` Andrea Arcangeli
2004-03-12 16:25                           ` Rik van Riel
2004-03-12 17:13                             ` Andrea Arcangeli
2004-03-12 17:23                               ` Rik van Riel
2004-03-12 17:44                                 ` Andrea Arcangeli
2004-03-12 18:18                                   ` Rik van Riel
2004-03-12 18:25                                 ` Linus Torvalds
2004-03-12 18:48                                   ` Rik van Riel
2004-03-12 19:02                                     ` Chris Friesen
2004-03-12 19:06                                       ` Rik van Riel
2004-03-12 19:10                                         ` Chris Friesen
2004-03-12 19:14                                           ` Rik van Riel
2004-03-12 20:27                                         ` Andrea Arcangeli
2004-03-12 20:32                                           ` Rik van Riel
2004-03-12 20:49                                             ` Andrea Arcangeli
2004-03-12 21:08                                   ` Jamie Lokier
2004-03-12 12:42                       ` Andrea Arcangeli
2004-03-12 12:46                       ` William Lee Irwin III
2004-03-12 13:24                         ` Andrea Arcangeli
2004-03-12 13:40                           ` William Lee Irwin III
2004-03-12 13:55                           ` Hugh Dickins
2004-03-12 16:01                             ` Andrea Arcangeli
2004-03-12 16:17                         ` Linus Torvalds
2004-03-13  0:28                           ` William Lee Irwin III
2004-03-13 14:43                           ` Rik van Riel
2004-03-13 16:18                             ` Linus Torvalds
2004-03-13 17:24                               ` Hugh Dickins
2004-03-13 17:28                                 ` Rik van Riel
2004-03-13 17:41                                   ` Hugh Dickins
2004-03-13 18:08                                     ` Andrea Arcangeli
2004-03-13 17:54                                   ` Andrea Arcangeli
2004-03-13 17:55                                     ` Andrea Arcangeli
2004-03-13 18:57                                   ` Linus Torvalds
2004-03-13 19:14                                     ` Hugh Dickins
2004-03-13 17:48                                 ` Andrea Arcangeli
2004-03-13 17:33                               ` Andrea Arcangeli
2004-03-13 17:53                                 ` Hugh Dickins
2004-03-13 18:13                                   ` Andrea Arcangeli
2004-03-13 19:35                                     ` Hugh Dickins
2004-03-13 17:57                                 ` Rik van Riel
2004-03-12 13:43                       ` Hugh Dickins
2004-03-12 15:56                         ` Andrea Arcangeli
2004-03-12 16:12                           ` Hugh Dickins
2004-03-12 16:39                             ` Andrea Arcangeli
2004-03-11 17:33                 ` Andrea Arcangeli
2004-03-11 22:20                 ` Rik van Riel
2004-03-11 23:43                   ` Hugh Dickins
2004-03-12  3:20                     ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox