Re: [patch 00/14] remap_file_pages protection support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [patch 00/14] remap_file_pages protection support
       [not found] <20060430172953.409399000@zion.home.lan>
@ 2006-05-02  3:45 ` Nick Piggin
  2006-05-02  3:56   ` Nick Piggin
                     ` (3 more replies)
       [not found] ` <20060430173025.752423000@zion.home.lan>
  1 sibling, 4 replies; 27+ messages in thread
From: Nick Piggin @ 2006-05-02  3:45 UTC (permalink / raw)
  To: blaisorblade; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

blaisorblade@yahoo.it wrote:

> The first idea is to use this for UML - it must create a lot of single page
> mappings, and managing them through separate VMAs is slow.

I don't know about this. The patches add some complexity, I guess because
we now have vmas which cannot communicate the protectedness of the pages.
Still, nobody was too concerned about nonlinear mappings doing the same
for addressing. But this does seem to add more overhead to the common cases
in the VM :(

Now I didn't follow the earlier discussions on this much, but let me try
making a few silly comments to get things going again (cc'ed linux-mm).

I think I would rather this all just folded under VM_NONLINEAR rather than
having this extra MANYPROTS thing, no? (you're already doing that in one
direction).

> 
> Additional note: this idea, with some further refinements (which I'll code after
> this chunk is accepted), will allow to reduce the number of used VMAs for most
> userspace programs - in particular, it will allow to avoid creating one VMA for
> one guard pages (which has PROT_NONE) - forcing PROT_NONE on that page will be
> enough.

I think that's silly. Your VM_MANYPROTS|VM_NONLINEAR vmas will cause more
overhead in faulting and reclaim.

It loooks like it would take an hour or two just to code up a patch which
puts a VM_GUARDPAGES flag into the vma, and tells the free area allocator
to skip vm_start-1 .. vm_end+1. What kind of troubles has prevented
something simple and easy like that from going in?

> 
> This will be useful since the VMA lookup at fault time can be a bottleneck for
> some programs (I've received a report about this from Ulrich Drepper and I've
> been told that also Val Henson from Intel is interested about this). I guess
> that since we use RB-trees, the slowness is also due to the poor cache locality
> of RB-trees (since RB nodes are within VMAs but aren't accessed together with
> their content), compared for instance with radix trees where the lookup has high
> cache locality (but they have however space usage problems, possibly bigger, on
> 64-bit machines).

Let's try get back to the good old days when people actually reported
their bugs (togther will *real* numbers) to the mailing lists. That way,
everybody gets to think about and discuss the problem.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes
       [not found] ` <20060430173025.752423000@zion.home.lan>
@ 2006-05-02  3:53   ` Nick Piggin
  2006-05-03  1:29     ` Blaisorblade
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2006-05-02  3:53 UTC (permalink / raw)
  To: blaisorblade; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

blaisorblade@yahoo.it wrote:
> From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
> 
> pte_present(pte) implies that pte_pfn(pte) is valid. Normally even with a
> _PAGE_PROTNONE pte this holds, but not when such a PTE is installed by
> the new install_file_pte; previously it didn't store protections, only file
> offsets, with the patches it also stores protections, and can set
> _PAGE_PROTNONE|_PAGE_FILE.

Why is this combination useful? Can't you just drop the _PAGE_FILE from
_PAGE_PROTNONE ptes?

> 
> zap_pte_range, when acting on such a pte, calls vm_normal_page and gets
> &mem_map[0], does page_remove_rmap, and we're easily in trouble, because it
> happens to find a page with mapcount == 0. And it BUGs on this!
> 
> I've seen this trigger easily and repeatably on UML on 2.6.16-rc3. This was
> likely avoided in the past by the PageReserved test - page 0 *had* to be
> reserved on i386 (dunno on UML).
> 
> Implementation follows for UML and i386.
> 
> To avoid additional overhead, I also considered adding likely() for
> _PAGE_PRESENT and unlikely() for the rest, but I'm uncertain about validity of
> possible [un]likely(pte_present()) occurrences.

Not present pages are likely to be pretty common when unmapping.

I don't like this patch much.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
@ 2006-05-02  3:56   ` Nick Piggin
  2006-05-02 11:24     ` Ingo Molnar
  2006-05-02 17:16   ` Lee Schermerhorn
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2006-05-02  3:56 UTC (permalink / raw)
  To: blaisorblade; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

Nick Piggin wrote:
> blaisorblade@yahoo.it wrote:
> 
>> The first idea is to use this for UML - it must create a lot of single 
>> page
>> mappings, and managing them through separate VMAs is slow.

[...]

> Let's try get back to the good old days when people actually reported
> their bugs (togther will *real* numbers) to the mailing lists. That way,
> everybody gets to think about and discuss the problem.

Speaking of which, let's see some numbers for UML -- performance
and memory. I don't doubt your claims, but I (and others) would be
interested to see.

Thanks

PS. I'll be away for the next few days.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02  3:56   ` Nick Piggin
@ 2006-05-02 11:24     ` Ingo Molnar
  2006-05-02 12:19       ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2006-05-02 11:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> >Let's try get back to the good old days when people actually reported
> >their bugs (togther will *real* numbers) to the mailing lists. That way,
> >everybody gets to think about and discuss the problem.
> 
> Speaking of which, let's see some numbers for UML -- performance and 
> memory. I don't doubt your claims, but I (and others) would be 
> interested to see.

firstly, thanks for the review feedback!

originally i tested this feature with some minimal amount of RAM 
simulated by UML 128MB or so. That's just 32 thousand pages, but still 
the improvement was massive: context-switch times in UML were cut in 
half or more. Process-creation times improved 10-fold. With this feature 
included I accidentally (for the first time ever!) confused an UML shell 
prompt with a real shell prompt. (before that UML was so slow [even in 
"skas mode"] that you'd immediately notice it by the shell's behavior)

the 'have 1 vma instead of 32,000 vmas' thing is a really, really big 
plus. It makes UML comparable to Xen, in rough terms of basic VM design.

Now imagine a somewhat larger setup - 16 GB RAM UML instance with 4 
million vmas per UML process ... Frankly, without 
sys_remap_file_pages_prot() the UML design is still somewhat of a toy.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02 11:24     ` Ingo Molnar
@ 2006-05-02 12:19       ` Nick Piggin
  0 siblings, 0 replies; 27+ messages in thread
From: Nick Piggin @ 2006-05-02 12:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management

Ingo Molnar wrote:

> originally i tested this feature with some minimal amount of RAM 
> simulated by UML 128MB or so. That's just 32 thousand pages, but still 
> the improvement was massive: context-switch times in UML were cut in 
> half or more. Process-creation times improved 10-fold. With this feature 
> included I accidentally (for the first time ever!) confused an UML shell 
> prompt with a real shell prompt. (before that UML was so slow [even in 
> "skas mode"] that you'd immediately notice it by the shell's behavior)

Cool, thanks for the numbers.

> 
> the 'have 1 vma instead of 32,000 vmas' thing is a really, really big 
> plus. It makes UML comparable to Xen, in rough terms of basic VM design.
> 
> Now imagine a somewhat larger setup - 16 GB RAM UML instance with 4 
> million vmas per UML process ... Frankly, without 
> sys_remap_file_pages_prot() the UML design is still somewhat of a toy.

Yes, I guess I imagined the common case might have been slightly better,
however with reasonable RAM utilisation, fragmentation means I wouldn't
be surprised if it does easily get close to that worst theoretical case.

My request for numbers was more about the Intel/glibc people than Paolo:
I do realise it is a problem for UML. I just like to see nice numbers :)

I think UML's really neat, so I'd love to see this get in. I don't see
any fundamental sticking point, given a few iterations, and some more
discussion.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
  2006-05-02  3:56   ` Nick Piggin
@ 2006-05-02 17:16   ` Lee Schermerhorn
  2006-05-03  1:20     ` Blaisorblade
  2006-05-03  0:25   ` Blaisorblade
  2006-05-03  0:44   ` Blaisorblade
  3 siblings, 1 reply; 27+ messages in thread
From: Lee Schermerhorn @ 2006-05-02 17:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management

On Tue, 2006-05-02 at 13:45 +1000, Nick Piggin wrote:
> blaisorblade@yahoo.it wrote:
> 
> > The first idea is to use this for UML - it must create a lot of single page
> > mappings, and managing them through separate VMAs is slow.
> 
> I don't know about this. The patches add some complexity, I guess because
> we now have vmas which cannot communicate the protectedness of the pages.
> Still, nobody was too concerned about nonlinear mappings doing the same
> for addressing. But this does seem to add more overhead to the common cases
> in the VM :(
> 
> Now I didn't follow the earlier discussions on this much, but let me try
> making a few silly comments to get things going again (cc'ed linux-mm).
> 
> I think I would rather this all just folded under VM_NONLINEAR rather than
> having this extra MANYPROTS thing, no? (you're already doing that in one
> direction).
<snip>

One way I've seen this done on other systems is to use something like a
prio tree [e.g., see the shared policy support for shmem] for sub-vma
protection ranges.  Most vmas [I'm guessing here] will have only the
original protections or will be reprotected in toto.  So, one need only
allocate/populate the protection tree when sub-vma protections are
requested.   Then, one can test protections via the vma, perhaps with
access/check macros to hide the existence of the protection tree.  Of
course, adding a tree-like structure could introduce locking
complications/overhead in some paths where we'd rather not [just
guessing again].  Might be more overhead than just mucking with the ptes
[for UML], but would keep the ptes in sync with the vma's view of
"protectedness".

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
  2006-05-02  3:56   ` Nick Piggin
  2006-05-02 17:16   ` Lee Schermerhorn
@ 2006-05-03  0:25   ` Blaisorblade
  2006-05-06 16:05     ` Ulrich Drepper
  2006-05-03  0:44   ` Blaisorblade
  3 siblings, 1 reply; 27+ messages in thread
From: Blaisorblade @ 2006-05-03  0:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, Linux Memory Management,
	Ulrich Drepper, Val Henson

On Tuesday 02 May 2006 05:45, Nick Piggin wrote:
> blaisorblade@yahoo.it wrote:
> > This will be useful since the VMA lookup at fault time can be a
> > bottleneck for some programs (I've received a report about this from
> > Ulrich Drepper and I've been told that also Val Henson from Intel is
> > interested about this). I guess that since we use RB-trees, the slowness
> > is also due to the poor cache locality of RB-trees (since RB nodes are
> > within VMAs but aren't accessed together with their content), compared
> > for instance with radix trees where the lookup has high cache locality
> > (but they have however space usage problems, possibly bigger, on 64-bit
> > machines).

> Let's try get back to the good old days when people actually reported
> their bugs (togther will *real* numbers) to the mailing lists. That way,
> everybody gets to think about and discuss the problem.

I've not seen the numbers indeed, I've been told of a problem with a "customer 
program" and Ingo connected my work with this problem. Frankly, I've been 
always astonished about how looking up a 10-level tree can be slow. Poor 
cache locality is the only thing that I could think about.

That said, it was an add-on, not the original motivation of the work.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
                     ` (2 preceding siblings ...)
  2006-05-03  0:25   ` Blaisorblade
@ 2006-05-03  0:44   ` Blaisorblade
  2006-05-06  9:06     ` Nick Piggin
  3 siblings, 1 reply; 27+ messages in thread
From: Blaisorblade @ 2006-05-03  0:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, Linux Memory Management,
	Ulrich Drepper, Val Henson

On Tuesday 02 May 2006 05:45, Nick Piggin wrote:
> blaisorblade@yahoo.it wrote:
> > The first idea is to use this for UML - it must create a lot of single
> > page mappings, and managing them through separate VMAs is slow.

> I think I would rather this all just folded under VM_NONLINEAR rather than
> having this extra MANYPROTS thing, no? (you're already doing that in one
> direction).

That step is _temporary_ if the extra usages are accepted.

Also, I reported (changelog of patch 03/14) a definite API bug you get if you 
don't distinguish VM_MANYPROTS from VM_NONLINEAR. I'm pasting it here because 
that changelog is rather long:

"In fact, without this flag, we'd have indeed a regression with
remap_file_pages VS mprotect, on uniform nonlinear VMAs.

mprotect alters the VMA prots and walks each present PTE, ignoring installed
ones, even when pte_file() is on; their saved prots will be restored on 
faults,
ignoring VMA ones and losing the mprotect() on them. So, in do_file_page(), we
must restore anyway VMA prots when the VMA is uniform, as we used to do before
this trail of patches."

> > Additional note: this idea, with some further refinements (which I'll
> > code after this chunk is accepted), will allow to reduce the number of
> > used VMAs for most userspace programs - in particular, it will allow to
> > avoid creating one VMA for one guard pages (which has PROT_NONE) -
> > forcing PROT_NONE on that page will be enough.

> I think that's silly. Your VM_MANYPROTS|VM_NONLINEAR vmas will cause more
> overhead in faulting and reclaim.

I know that problem. In fact for that we want VM_MANYPROTS without 
VM_NONLINEAR.

> It loooks like it would take an hour or two just to code up a patch which
> puts a VM_GUARDPAGES flag into the vma, and tells the free area allocator
> to skip vm_start-1 .. vm_end+1
we must refine which pages to skip (the example I saw has only one guard page, 
if I'm not mistaken) but 
> . What kind of troubles has prevented 
> something simple and easy like that from going in?

Fairly better idea... It's just the fact that the original proposal was wider, 
and that we looked to the problem in the wrong way (+ we wanted anyway to 
have the present work merged, so that wasn't a problem).

Ulrich wanted to have code+data(+guard on 64-bit) into the same VMA, but I 
left the code+data VMA joining away, to think more with it, since currently 
it's too slow on swapout.

The other part is avoiding guard VMAs for thread stacks, and that could be 
accomplished too by your proposal. Iff this work is held out, however.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-02 17:16   ` Lee Schermerhorn
@ 2006-05-03  1:20     ` Blaisorblade
  2006-05-03 14:35       ` Lee Schermerhorn
  0 siblings, 1 reply; 27+ messages in thread
From: Blaisorblade @ 2006-05-03  1:20 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nick Piggin, Andrew Morton, linux-kernel, Linux Memory Management,
	Ulrich Drepper, Val Henson

On Tuesday 02 May 2006 19:16, Lee Schermerhorn wrote:
> On Tue, 2006-05-02 at 13:45 +1000, Nick Piggin wrote:
> > blaisorblade@yahoo.it wrote:

> > I think I would rather this all just folded under VM_NONLINEAR rather
> > than having this extra MANYPROTS thing, no? (you're already doing that in
> > one direction).

> One way I've seen this done on other systems

I'm curious, which ones?

> is to use something like a 
> prio tree [e.g., see the shared policy support for shmem] for sub-vma
> protection ranges.
Which sub-vma ranges? The ones created with mprotect?

I'm curious about what is the difference between this sub-tree and the main 
tree... you have some point, but I miss which one :-) Actually when doing a 
lookup in the main tree the extra nodes in the subtree are not searched, so 
you get an advantage.

One possible point is that a VMA maps to one mmap() call (with splits from 
mremap(),mprotect(),partial munmap()s), and then they use sub-VMAs instead of 
VMA splits.

> Most vmas [I'm guessing here] will have only the 
> original protections or will be reprotected in toto.

> So, one need only 
> allocate/populate the protection tree when sub-vma protections are
> requested.   Then, one can test protections via the vma, perhaps with
> access/check macros to hide the existence of the protection tree.  Of
> course, adding a tree-like structure could introduce locking
> complications/overhead in some paths where we'd rather not [just
> guessing again].  Might be more overhead than just mucking with the ptes
> [for UML], but would keep the ptes in sync with the vma's view of
> "protectedness".
>
> Lee

Ok, there are two different situations, I'm globally unconvinced until I 
understand the usefulness of a different sub-tree.

a) UML. The answer is _no_ to all guesses, since we must implement page tables 
of a guest virtual machine via mmap() or remap_file_pages. And they're as 
fragmented as they get (we get one-page-wide VMAs currently).

b) the proposed glibc usage. The original Ulrich's request (which I cut down 
because of problems with objrmap) was to have one mapping per DSO, including 
code,data and guard page. So you have three protections in one VMA.

However, this is doable via this remap_file_pages, adding something for 
handling private VMAs (handling movement of the anonymous memory you get on 
writes); but it's slow on swapout, since it stops using objrmap. So I've not 
thought to do it.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes
  2006-05-02  3:53   ` [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes Nick Piggin
@ 2006-05-03  1:29     ` Blaisorblade
  2006-05-06 10:03       ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: Blaisorblade @ 2006-05-03  1:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

On Tuesday 02 May 2006 05:53, Nick Piggin wrote:
> blaisorblade@yahoo.it wrote:
> > From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
> >
> > pte_present(pte) implies that pte_pfn(pte) is valid. Normally even with a
> > _PAGE_PROTNONE pte this holds, but not when such a PTE is installed by
> > the new install_file_pte; previously it didn't store protections, only
> > file offsets, with the patches it also stores protections, and can set
> > _PAGE_PROTNONE|_PAGE_FILE.

What could be done is to set a PTE with "no protection", use another bit 
rather than _PAGE_PROTNONE. This wastes one more bit but doable.

> Why is this combination useful? Can't you just drop the _PAGE_FILE from
> _PAGE_PROTNONE ptes?

I must think on this, but the semantics are not entirely the same between the 
two cases. You have no page attached when _PAGE_FILE is there, but a page is 
attached to the PTE with only _PAGE_PROTNONE. Testing that via VM_MANYPROTS 
is just as slow as-is (can be changed with code duplication for the linear 
and non-linear cases).

The application semantics can also be different when you remap as read/write 
that page - the app could have stored an offset there (this is less definite 
since you can't remap & keep the offset currently).

Also, this wouldn't solve the problem, it would make the solution harder: how 
do I know that there's no page to call page_remove_rmap() on, without 
_PAGE_FILE?

I thought to change _PAGE_PROTNONE: it is used to hold a page present and 
referenced but unaccessible. It seems it could be released when 
_PAGE_PROTNONE is set, but for anonymous memory it's impossible. When I've 
asked Hugh about this, he imagined the case when an application faults in a 
page in a VMA then mprotects(PROT_NONE) it; the PTE is set as PROT_NONE. We 
can avoid that in the VM_MAYSHARE case (VM_SHARED or PROT_SHARED was set but 
the file is readonly), but not when anonymous memory is present - the 
application could want it back.

> > To avoid additional overhead, I also considered adding likely() for
> > _PAGE_PRESENT and unlikely() for the rest, but I'm uncertain about
> > validity of possible [un]likely(pte_present()) occurrences.
>
> Not present pages are likely to be pretty common when unmapping.

Ok, only unlikely for test on _PAGE_PROTNONE and ! _PAGE_FILE.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-03  1:20     ` Blaisorblade
@ 2006-05-03 14:35       ` Lee Schermerhorn
  0 siblings, 0 replies; 27+ messages in thread
From: Lee Schermerhorn @ 2006-05-03 14:35 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Nick Piggin, Andrew Morton, linux-kernel, Linux Memory Management,
	Ulrich Drepper, Val Henson, Bob Picco

On Wed, 2006-05-03 at 03:20 +0200, Blaisorblade wrote:
> On Tuesday 02 May 2006 19:16, Lee Schermerhorn wrote:
> > On Tue, 2006-05-02 at 13:45 +1000, Nick Piggin wrote:
> > > blaisorblade@yahoo.it wrote:
> 
> > > I think I would rather this all just folded under VM_NONLINEAR rather
> > > than having this extra MANYPROTS thing, no? (you're already doing that in
> > > one direction).
> 
> > One way I've seen this done on other systems
> 
> I'm curious, which ones?

Let's see:  System V [4.2MP] back in the days of USL [Unix Systems Labs]
did this, as did [does] Digital/Compaq/HP Tru64 on alpha.  I'm not sure
if the latter came from the original Mach/OSF code or from Bob Picco's
rewrite thereof back in the 90's.

> 
> > is to use something like a 
> > prio tree [e.g., see the shared policy support for shmem] for sub-vma
> > protection ranges.
> Which sub-vma ranges? The ones created with mprotect?
> 
> I'm curious about what is the difference between this sub-tree and the main 
> tree... you have some point, but I miss which one :-) Actually when doing a 
> lookup in the main tree the extra nodes in the subtree are not searched, so 
> you get an advantage.

True, you still have locate the protection range in the subtree and then
still walk the page tables to install the new protections.  Of course,
in the bad old days, the only time I saw thousands or 10s of thousands
of different mappings/vmas/regions/whatever was in vm stress tests.
Until squid came along, that is.  Then we encountered, on large memory
alpha systems, 100s of thousands of mmaped files.  Had to replace the
linear [honest] vma/mapping list at that point ;-).

> 
> One possible point is that a VMA maps to one mmap() call (with splits from 
> mremap(),mprotect(),partial munmap()s), and then they use sub-VMAs instead of 
> VMA splits.

Yeah.  That was the point--in response to Nick's comment about the
disconnect betwen the protections as reported by the vma and the actual
pte protections.
> 
> > Most vmas [I'm guessing here] will have only the 
> > original protections or will be reprotected in toto.
> 
> > So, one need only 
> > allocate/populate the protection tree when sub-vma protections are
> > requested.   Then, one can test protections via the vma, perhaps with
> > access/check macros to hide the existence of the protection tree.  Of
> > course, adding a tree-like structure could introduce locking
> > complications/overhead in some paths where we'd rather not [just
> > guessing again].  Might be more overhead than just mucking with the ptes
> > [for UML], but would keep the ptes in sync with the vma's view of
> > "protectedness".
> >
> > Lee
> 
> Ok, there are two different situations, I'm globally unconvinced until I 
> understand the usefulness of a different sub-tree.
> 
> a) UML. The answer is _no_ to all guesses, since we must implement page tables 
> of a guest virtual machine via mmap() or remap_file_pages. And they're as 
> fragmented as they get (we get one-page-wide VMAs currently).
> 
> b) the proposed glibc usage. The original Ulrich's request (which I cut down 
> because of problems with objrmap) was to have one mapping per DSO, including 
> code,data and guard page. So you have three protections in one VMA.
> 
> However, this is doable via this remap_file_pages, adding something for 
> handling private VMAs (handling movement of the anonymous memory you get on 
> writes); but it's slow on swapout, since it stops using objrmap. So I've not 
> thought to do it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-03  0:44   ` Blaisorblade
@ 2006-05-06  9:06     ` Nick Piggin
  2006-05-06 15:26       ` Ulrich Drepper
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2006-05-06  9:06 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Andrew Morton, linux-kernel, Linux Memory Management,
	Ulrich Drepper, Val Henson

Blaisorblade wrote:
> On Tuesday 02 May 2006 05:45, Nick Piggin wrote:
> 
>>blaisorblade@yahoo.it wrote:
>>
>>>The first idea is to use this for UML - it must create a lot of single
>>>page mappings, and managing them through separate VMAs is slow.
> 
> 
>>I think I would rather this all just folded under VM_NONLINEAR rather than
>>having this extra MANYPROTS thing, no? (you're already doing that in one
>>direction).
> 
> 
> That step is _temporary_ if the extra usages are accepted.

Can we try to get the whole design right from the start?

> 
> Also, I reported (changelog of patch 03/14) a definite API bug you get if you 
> don't distinguish VM_MANYPROTS from VM_NONLINEAR. I'm pasting it here because 
> that changelog is rather long:
> 
> "In fact, without this flag, we'd have indeed a regression with
> remap_file_pages VS mprotect, on uniform nonlinear VMAs.
> 
> mprotect alters the VMA prots and walks each present PTE, ignoring installed
> ones, even when pte_file() is on; their saved prots will be restored on 
> faults,
> ignoring VMA ones and losing the mprotect() on them. So, in do_file_page(), we
> must restore anyway VMA prots when the VMA is uniform, as we used to do before
> this trail of patches."

It is only a bug because you hadn't plugged the hole -- make it fix up pte_file
ones as well.

> Ulrich wanted to have code+data(+guard on 64-bit) into the same VMA, but I 
> left the code+data VMA joining away, to think more with it, since currently 
> it's too slow on swapout.

Yes, and it would be ridiculous to do this with non linear protections anyway.
If the vma code is so slow that glibc wants to merge code and data vmas together,
then we obviously need to fix the data structure (which will help everyone)
rather than hacking around it.

> 
> The other part is avoiding guard VMAs for thread stacks, and that could be 
> accomplished too by your proposal. Iff this work is held out, however.

I see no reason why they couldn't both go in. In fact, having an mmap flag for
adding guard regions around vmas (and perhaps eg. a system-wide / per-process
option for stack) could almost go in tomorrow.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes
  2006-05-03  1:29     ` Blaisorblade
@ 2006-05-06 10:03       ` Nick Piggin
  2006-05-07 17:50         ` Blaisorblade
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2006-05-06 10:03 UTC (permalink / raw)
  To: Blaisorblade; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

Blaisorblade wrote:
> On Tuesday 02 May 2006 05:53, Nick Piggin wrote:
> 
>>blaisorblade@yahoo.it wrote:
>>
>>>From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
>>>
>>>pte_present(pte) implies that pte_pfn(pte) is valid. Normally even with a
>>>_PAGE_PROTNONE pte this holds, but not when such a PTE is installed by
>>>the new install_file_pte; previously it didn't store protections, only
>>>file offsets, with the patches it also stores protections, and can set
>>>_PAGE_PROTNONE|_PAGE_FILE.
> 
> 
> What could be done is to set a PTE with "no protection", use another bit 
> rather than _PAGE_PROTNONE. This wastes one more bit but doable.

I see.

> 
> 
>>Why is this combination useful? Can't you just drop the _PAGE_FILE from
>>_PAGE_PROTNONE ptes?
> 
> 
> I must think on this, but the semantics are not entirely the same between the 
> two cases.

And yes, this won't work. I was misunderstanding what was happening.

I guess your problem is that you're overloading the pte protection bits
for present ptes as protection bits for not present (file) ptes. I'd rather
you just used a different encoding for file pte protections then.

"Wasting" a bit seems much more preferable for this very uncommon case (for
most people) rather than bloating pte_present check, which is called in
practically every performance critical inner loop).

That said, if the patch is i386/uml specific then I don't have much say in
it. If Ingo/Linus and Jeff/Yourself, respectively, accept the patch, then
fine.

But I think you should drop the comment from the core code. It seems wrong.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-06  9:06     ` Nick Piggin
@ 2006-05-06 15:26       ` Ulrich Drepper
  0 siblings, 0 replies; 27+ messages in thread
From: Ulrich Drepper @ 2006-05-06 15:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson

[-- Attachment #1: Type: text/plain, Size: 853 bytes --]

Nick Piggin wrote:
> I see no reason why they couldn't both go in. In fact, having an mmap
> flag for
> adding guard regions around vmas (and perhaps eg. a system-wide /
> per-process
> option for stack) could almost go in tomorrow.

This would have to be flexible, though.  For thread stacks, at least,
the programmer is able to specify the size of the guard area.  It can be
arbitrarily large.

Also, consider IA-64.  Here we have two stacks.  We allocate them with
one mmap call and put the guard somewhere in the middle (the optimal
ratio of CPU and register stack size is yet to be determined) and have
the stack grow toward each other.  This results into three VMAs in the
moment.  Anything which results on more VMAs obviously isn't good.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-03  0:25   ` Blaisorblade
@ 2006-05-06 16:05     ` Ulrich Drepper
  2006-05-07  4:22       ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: Ulrich Drepper @ 2006-05-06 16:05 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Nick Piggin, Andrew Morton, linux-kernel, Linux Memory Management,
	Val Henson

[-- Attachment #1: Type: text/plain, Size: 3971 bytes --]

Blaisorblade wrote:
> I've not seen the numbers indeed, I've been told of a problem with a "customer 
> program" and Ingo connected my work with this problem. Frankly, I've been 
> always astonished about how looking up a 10-level tree can be slow. Poor 
> cache locality is the only thing that I could think about.

It might be good if I explain a bit how much we use mmap in libc.  The
numbers can really add up quickly.

- for each loaded DSO we might have up to 5 VMAs today:

   1. text segment, protection r-xp, normally never written to
   2. a gap, protection ---p (i.e., no access)
   3. a relro data segment, protectection r--p
   4. a data segment, protection rw-p
   5. a bss segment, anonymous memory

The first four are mapped from the file.  In fact, the first segment
"allocates" the entire address space of all segment, even if it's longer
than the file.

Then gap is done using mprotect(PROT_NONE).  Then the area for segment 3
and 4 is mapped in one mmap() call.  It's in the same file but the
offset used in the mmap is not the same as the same as the offset which
naturally is already established through the first  mmap.  I.e., if the
first mmap() would start at offset 0 and  continue for 1000 pages, the
gap might start at a, say, offset of 4 pages and continue for 500 pages.
 Then the "natural" offset of the first data page would be 504 pages but
the second mmap() call would in fact use the offset 4 because the text
and data segment are continuous in the _file_ (although not in memory).

Anyway, once relocations are done the protection of the relro segment is
changed, splitting the data segment in two.

So, for DSO loading there would be two steps of improvement:

1. if a mprotect() call wouldn't split the VMA we would have 3 VMAs in
the end instead of 5.  40% gain.

2. if I could use remap_file_pages() for the data segment mapping and
the call would allow changing the protection and it would not split the
VMAs, then we'd be down to 2 mappings.  60% down.

A second big VMA user are thread stacks.  I think the application which
was mentioned in this thread briefly used literally thousands of
threads.  Leaving aside the insanity of this (it's unfortunately how
many programmers work) this can create problems because we get at least
two (on IA-64 three) VMAs per thread.  I.e., thread stack allocation
works likes this:

1. allocate area big enough for stack and guard (we don't use automatic
growing, this cannot work)

2. change the protection of the guard end of the stack to PROT_NONE.

So, for say 1000 threads we'll end up with 2000 VMAs.  Threads are also
important to mention here because

- often they are short-lived and we have to recreate them often.  We
usually reuse stacks but only keep that much allocated memory around.
So more often than we like we actually free and later re-allocate stacks.

- these thousands of stack VMAs are really used concurrently.  ALl
threads are woken over a period of time.

A third source of VMAs is anonymous memory allocation.  mmap is used in
the malloc implementation and directly in various places.  For
randomization reasons there isn't really much we can do here, we
shouldn't lump all these allocations together.

A fourth source of VMAs are the programs themselves which mmap files.
Often read-only mappings of many small files.

Put all this together and non-trivial apps as written today (I don't say
they are high-quality apps) can easily have a few thousand, maybe even
10,000 to 20,000 VMAs.  Firefox on my machine uses in the moment ~560
VMAs and this is with only a handful of threads.  Are these the numbers
the VM system is optimized for?  I think what our people running the
experiments at the customer site saw is that it's not.  The VMA
traversal showed up on the profile lists.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-06 16:05     ` Ulrich Drepper
@ 2006-05-07  4:22       ` Nick Piggin
  2006-05-13 14:13         ` Nick Piggin
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2006-05-07  4:22 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson

Ulrich Drepper wrote:
> Blaisorblade wrote:
> 
>>I've not seen the numbers indeed, I've been told of a problem with a "customer 
>>program" and Ingo connected my work with this problem. Frankly, I've been 
>>always astonished about how looking up a 10-level tree can be slow. Poor 
>>cache locality is the only thing that I could think about.
> 
> 
> It might be good if I explain a bit how much we use mmap in libc.  The
> numbers can really add up quickly.

[...]

Thanks. Very informative.

> Put all this together and non-trivial apps as written today (I don't say
> they are high-quality apps) can easily have a few thousand, maybe even
> 10,000 to 20,000 VMAs.  Firefox on my machine uses in the moment ~560
> VMAs and this is with only a handful of threads.  Are these the numbers
> the VM system is optimized for?  I think what our people running the
> experiments at the customer site saw is that it's not.  The VMA
> traversal showed up on the profile lists.

Your % improvement numbers are of course only talking about memory
usage improvements. Time complexity increases with the log of the
number of VMAs, so while search within 100,000 vmas might have a CPU
cost of 16 arbitrary units, it is only about 300% the cost in 40
vmas (and not the 2,500,000% that the number of vmas suggests).

Definitely reducing vmas would be good. If guard ranges around vmas
can be implemented easily and reduce vmas by even 20%, it would come
at an almost zero complexity cost to the kernel.

However, I think another consideration is the vma lookup cache. I need
to get around to looking at this again, but IMO it is inadequate for
threaded applications. Currently we have one last-lookup cached vma
for each mm. You get cacheline bouncing when updating the cache, and
the locality becomes almost useless.

I think possibly each thread should have a private vma cache, with
room for at least its stack vma(s), (and several others, eg. code,
data). Perhaps the per-mm cache could be dispensed with completely,
although it might be useful eg. for the heap. And it might be helped
with increased entries as well.

I've got patches lying around to implement this stuff -- I'd be
interested to have more detail about this problem, or distilled test
cases.

Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes
  2006-05-06 10:03       ` Nick Piggin
@ 2006-05-07 17:50         ` Blaisorblade
  0 siblings, 0 replies; 27+ messages in thread
From: Blaisorblade @ 2006-05-07 17:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, Linux Memory Management

On Saturday 06 May 2006 12:03, Nick Piggin wrote:
> Blaisorblade wrote:
> > On Tuesday 02 May 2006 05:53, Nick Piggin wrote:
> >>blaisorblade@yahoo.it wrote:
> >>>From: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
> >>>
> >>>pte_present(pte) implies that pte_pfn(pte) is valid. Normally even with
> >>> a _PAGE_PROTNONE pte this holds, but not when such a PTE is installed
> >>> by the new install_file_pte; previously it didn't store protections,
> >>> only file offsets, with the patches it also stores protections, and can
> >>> set _PAGE_PROTNONE|_PAGE_FILE.
> >
> > What could be done is to set a PTE with "no protection", use another bit
> > rather than _PAGE_PROTNONE. This wastes one more bit but doable.

> I see.

> I guess your problem is that you're overloading the pte protection bits
> for present ptes as protection bits for not present (file) ptes. I'd rather
> you just used a different encoding for file pte protections then.

Yes, this is what I said above, so we agree; and indeed this overloading was 
decided when the present problem didn't trigger, so it can now change. As 
detailed in the patch description, the previous PageReserved handling 
prevented freeing page 0 and hided this.

> "Wasting" a bit seems much more preferable for this very uncommon case (for
> most people) rather than bloating pte_present check, which is called in
> practically every performance critical inner loop).

Yes, I thought about this problem, I wasn't sure how hard it was.

> That said, if the patch is i386/uml specific then I don't have much say in
> it.

It's presently specific, but will probably extend. Implementations for some 
other archs were already sent and I've collected them (will send 
afterwards,I've avoided excess bloat).

> If Ingo/Linus and Jeff/Yourself, respectively, accept the patch, then 
> fine.

> But I think you should drop the comment from the core code. It seems wrong.

Yep, forgot there, thanks for reminding, I've now removed it.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-07  4:22       ` Nick Piggin
@ 2006-05-13 14:13         ` Nick Piggin
  2006-05-13 18:19           ` Valerie Henson
  0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2006-05-13 14:13 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson

[-- Attachment #1: Type: text/plain, Size: 1433 bytes --]

Nick Piggin wrote:

> I think possibly each thread should have a private vma cache, with
> room for at least its stack vma(s), (and several others, eg. code,
> data). Perhaps the per-mm cache could be dispensed with completely,
> although it might be useful eg. for the heap. And it might be helped
> with increased entries as well.
> 
> I've got patches lying around to implement this stuff -- I'd be
> interested to have more detail about this problem, or distilled test
> cases.

OK, I got interested again, but can't get Val's ebizzy to give me
a find_vma constrained workload yet (though the numbers back up
my assertion that the vma cache is crap for threaded apps).

Without the patch, after bootup, the vma cache gets 208 364 hits out
of 438 535 lookups (47.5%)

./ebizzy -t16: 384.29user 754.61system 5:31.87elapsed 343%CPU

And ebizzy gets 7 373 078 hits out of 82 255 863 lookups (8.9%)


With mm + 4 slot LRU per-thread cache (this patch):
After boot, 303 767 / 439 918 = 69.0%

./ebizzy -t16: 388.73user 750.29system 5:30.24elapsed 344%CPU

ebizzy hits: 53 024 083 / 82 055 195 = 64.6%


So on a non-threaded workload, hit rate is increased by about 50%;
on a threaded workload it is increased by over 700%. In rbtree-walk
-constrained workloads, the total find_vma speedup should be linear
to the hit ratio improvement.

I don't think my ebizzy numbers can justify the patch though...

Nick

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: vma.patch --]
[-- Type: text/plain, Size: 6766 bytes --]

Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2006-05-13 23:31:13.000000000 +1000
+++ linux-2.6/mm/mmap.c	2006-05-13 23:48:53.000000000 +1000
@@ -30,6 +30,99 @@
 #include <asm/cacheflush.h>
 #include <asm/tlb.h>
 
+static void vma_cache_touch(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	struct task_struct *curr = current;
+	if (mm == curr->mm) {
+		int i;
+		if (curr->vma_cache_sequence != mm->vma_sequence) {
+			curr->vma_cache_sequence = mm->vma_sequence;
+			curr->vma_cache[0] = vma;
+			for (i = 1; i < 4; i++)
+				curr->vma_cache[i] = NULL;
+		} else {
+			int update_mm;
+
+			if (curr->vma_cache[0] == vma)
+				return;
+
+			for (i = 1; i < 4; i++) {
+				if (curr->vma_cache[i] == vma)
+					break;
+			}
+			update_mm = 0;
+			if (i == 4) {
+				update_mm = 1;
+				i = 3;
+			}
+			while (i != 0) {
+				curr->vma_cache[i] = curr->vma_cache[i-1];
+				i--;
+			}
+			curr->vma_cache[0] = vma;
+
+			if (!update_mm)
+				return;
+		}
+	}
+
+	if (mm->vma_cache != vma) /* prevent cacheline bouncing */
+		mm->vma_cache = vma;
+}
+
+static void vma_cache_replace(struct mm_struct *mm, struct vm_area_struct *vma,
+						struct vm_area_struct *repl)
+{
+	mm->vma_sequence++;
+	if (unlikely(mm->vma_sequence == 0)) {
+		struct task_struct *curr = current, *t;
+		t = curr;
+		rcu_read_lock();
+		do {
+			t->vma_cache_sequence = -1;
+			t = next_thread(t);
+		} while (t != curr);
+		rcu_read_unlock();
+	}
+
+	if (mm->vma_cache == vma)
+		mm->vma_cache = repl;
+}
+
+static inline void vma_cache_invalidate(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	vma_cache_replace(mm, vma, NULL);
+}
+
+static struct vm_area_struct *vma_cache_find(struct mm_struct *mm,
+						unsigned long addr)
+{
+	struct task_struct *curr;
+	struct vm_area_struct *vma;
+
+	inc_page_state(vma_cache_query);
+
+	curr = current;
+	if (mm == curr->mm && mm->vma_sequence == curr->vma_cache_sequence) {
+		int i;
+		for (i = 0; i < 4; i++) {
+			vma = curr->vma_cache[i];
+			if (vma && vma->vm_end > addr && vma->vm_start <= addr){
+				inc_page_state(vma_cache_hit);
+				return vma;
+			}
+		}
+	}
+
+	vma = mm->vma_cache;
+	if (vma && vma->vm_end > addr && vma->vm_start <= addr) {
+		inc_page_state(vma_cache_hit);
+		return vma;
+	}
+
+	return NULL;
+}
+
 static void unmap_region(struct mm_struct *mm,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
 		unsigned long start, unsigned long end);
@@ -460,8 +553,6 @@
 {
 	prev->vm_next = vma->vm_next;
 	rb_erase(&vma->vm_rb, &mm->mm_rb);
-	if (mm->mmap_cache == vma)
-		mm->mmap_cache = prev;
 }
 
 /*
@@ -586,6 +677,7 @@
 		 * us to remove next before dropping the locks.
 		 */
 		__vma_unlink(mm, next, vma);
+		vma_cache_replace(mm, next, vma);
 		if (file)
 			__remove_shared_vm_struct(next, file, mapping);
 		if (next->anon_vma)
@@ -1384,8 +1476,8 @@
 	if (mm) {
 		/* Check the cache first. */
 		/* (Cache hit rate is typically around 35%.) */
-		vma = mm->mmap_cache;
-		if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
+		vma = vma_cache_find(mm, addr);
+		if (!vma) {
 			struct rb_node * rb_node;
 
 			rb_node = mm->mm_rb.rb_node;
@@ -1405,9 +1497,9 @@
 				} else
 					rb_node = rb_node->rb_right;
 			}
-			if (vma)
-				mm->mmap_cache = vma;
 		}
+		if (vma)
+			vma_cache_touch(mm, vma);
 	}
 	return vma;
 }
@@ -1424,6 +1516,14 @@
 	if (!mm)
 		goto out;
 
+	vma = vma_cache_find(mm, addr);
+	if (vma) {
+		rb_node = rb_prev(&vma->vm_rb);
+		if (rb_node)
+			prev = rb_entry(rb_node, struct vm_area_struct, vm_rb);
+		goto out;
+	}
+
 	/* Guard against addr being lower than the first VMA */
 	vma = mm->mmap;
 
@@ -1445,6 +1545,9 @@
 	}
 
 out:
+	if (vma)
+		vma_cache_touch(mm, vma);
+
 	*pprev = prev;
 	return prev ? prev->vm_next : vma;
 }
@@ -1686,6 +1789,7 @@
 
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	do {
+		vma_cache_invalidate(mm, vma);
 		rb_erase(&vma->vm_rb, &mm->mm_rb);
 		mm->map_count--;
 		tail_vma = vma;
@@ -1698,7 +1802,6 @@
 	else
 		addr = vma ?  vma->vm_start : mm->mmap_base;
 	mm->unmap_area(mm, addr);
-	mm->mmap_cache = NULL;		/* Kill the cache. */
 }
 
 /*
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2006-05-13 23:31:05.000000000 +1000
+++ linux-2.6/include/linux/page-flags.h	2006-05-13 23:31:44.000000000 +1000
@@ -164,6 +164,9 @@
 
 	unsigned long pgrotated;	/* pages rotated to tail of the LRU */
 	unsigned long nr_bounce;	/* pages for bounce buffers */
+
+	unsigned long vma_cache_hit;
+	unsigned long vma_cache_query;
 };
 
 extern void get_page_state(struct page_state *ret);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2006-05-13 23:31:05.000000000 +1000
+++ linux-2.6/mm/page_alloc.c	2006-05-13 23:31:44.000000000 +1000
@@ -2389,6 +2389,9 @@
 
 	"pgrotated",
 	"nr_bounce",
+
+	"vma_cache_hit",
+	"vma_cache_query",
 };
 
 static void *vmstat_start(struct seq_file *m, loff_t *pos)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2006-05-13 23:31:03.000000000 +1000
+++ linux-2.6/include/linux/sched.h	2006-05-13 23:33:01.000000000 +1000
@@ -293,9 +293,11 @@
 } while (0)
 
 struct mm_struct {
-	struct vm_area_struct * mmap;		/* list of VMAs */
+	struct vm_area_struct *mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
-	struct vm_area_struct * mmap_cache;	/* last find_vma result */
+	struct vm_area_struct *vma_cache;
+	unsigned long vma_sequence;
+
 	unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
@@ -734,6 +736,8 @@
 	struct list_head ptrace_list;
 
 	struct mm_struct *mm, *active_mm;
+	struct vm_area_struct *vma_cache[4];
+	unsigned long vma_cache_sequence;
 
 /* task state */
 	struct linux_binfmt *binfmt;
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2006-05-13 23:31:03.000000000 +1000
+++ linux-2.6/kernel/fork.c	2006-05-13 23:32:59.000000000 +1000
@@ -197,7 +197,7 @@
 
 	mm->locked_vm = 0;
 	mm->mmap = NULL;
-	mm->mmap_cache = NULL;
+	mm->vma_sequence = oldmm->vma_sequence+1;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->cached_hole_size = ~0UL;
 	mm->map_count = 0;
@@ -238,6 +238,10 @@
 		tmp->vm_next = NULL;
 		anon_vma_link(tmp);
 		file = tmp->vm_file;
+
+		if (oldmm->vma_cache == mpnt)
+			mm->vma_cache = tmp;
+
 		if (file) {
 			struct inode *inode = file->f_dentry->d_inode;
 			get_file(file);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-13 14:13         ` Nick Piggin
@ 2006-05-13 18:19           ` Valerie Henson
  2006-05-13 22:54             ` Valerie Henson
  2006-05-16 13:30             ` Nick Piggin
  0 siblings, 2 replies; 27+ messages in thread
From: Valerie Henson @ 2006-05-13 18:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson

On Sun, May 14, 2006 at 12:13:21AM +1000, Nick Piggin wrote:
> 
> OK, I got interested again, but can't get Val's ebizzy to give me
> a find_vma constrained workload yet (though the numbers back up
> my assertion that the vma cache is crap for threaded apps).

Hey Nick,

Glad to see you're using it!  There are (at least) two ways to do what
you want:

1. Increase the number of threads - this gives you two vma's per
   thread, one for stack, one for guard page:

   $ ./ebizzy -t 100

2. Apply the patch at the end of this email and use -p "prevent
   coalescing", -m "always mmap" and appropriate number of chunks,
   size, and records to search - this works for me:

   $ ./ebizzy -p -m -n 10000 -s 4096 -r 100000

The original program mmapped everything with the same permissions and
no alignment restrictions, so all the mmaps were coalesced into one.
This version alternates PROT_WRITE permissions on the mmap'd areas
after they are written, so you get lots of vma's:

val@goober:~/ebizzy$ ./ebizzy -p -m -n 10000 -s 4096 -r 100000

[2]+  Stopped                 ./ebizzy -p -m -n 10000 -s 4096 -r 100000
val@goober:~/ebizzy$ wc -l /proc/`pgrep ebizzy`/maps
10019 /proc/10917/maps

I haven't profiled to see if this brings find_vma to the top, though.

(The patch also moves around some other stuff so that options are in
alphabetical order; apparently I thought 's' came after 'r' and before
'R'...)

-VAL

--- ebizzy.c.old	2006-05-13 10:18:58.000000000 -0700
+++ ebizzy.c	2006-05-13 11:01:42.000000000 -0700
@@ -52,9 +52,10 @@
 static unsigned int always_mmap;
 static unsigned int never_mmap;
 static unsigned int chunks;
+static unsigned int prevent_coalescing;
 static unsigned int records;
-static unsigned int chunk_size;
 static unsigned int random_size;
+static unsigned int chunk_size;
 static unsigned int threads;
 static unsigned int verbose;
 static unsigned int linear;
@@ -76,9 +77,10 @@
 		"-m\t\t Always use mmap instead of malloc\n"
 		"-M\t\t Never use mmap\n"
 		"-n <num>\t Number of memory chunks to allocate\n"
+		"-p \t\t Prevent mmap coalescing\n"
 		"-r <num>\t Total number of records to search for\n"
-		"-s <size>\t Size of memory chunks, in bytes\n"
 		"-R\t\t Randomize size of memory to copy and search\n"
+		"-s <size>\t Size of memory chunks, in bytes\n"
 		"-t <num>\t Number of threads\n"
 		"-v[v[v]]\t Be verbose (more v's for more verbose)\n"
 		"-z\t\t Linear search instead of binary search\n",
@@ -98,7 +100,7 @@
 	cmd = argv[0];
 	opterr = 1;
 
-	while ((c = getopt(argc, argv, "mMn:r:s:Rt:vz")) != -1) {
+	while ((c = getopt(argc, argv, "mMn:pr:Rs:t:vz")) != -1) {
 		switch (c) {
 		case 'm':
 			always_mmap = 1;
@@ -111,19 +113,22 @@
 			if (chunks == 0)
 				usage();
 			break;
+		case 'p':
+			prevent_coalescing = 1;
+			break;
 		case 'r':
 			records = atoi(optarg);
 			if (records == 0)
 				usage();
 			break;
+		case 'R':
+			random_size = 1;
+			break;
 		case 's':
 			chunk_size = atoi(optarg);
 			if (chunk_size == 0)
 				usage();
 			break;
-		case 'R':
-			random_size = 1;
-			break;
 		case 't':
 			threads = atoi(optarg);
 			if (threads == 0)
@@ -141,7 +146,7 @@
 	}
 
 	if (verbose)
-		printf("ebizzy 0.1, Copyright 2006 Intel Corporation\n"
+		printf("ebizzy 0.2, Copyright 2006 Intel Corporation\n"
 		       "Written by Val Henson <val_henson@linux.intel.com\n");
 
 	/*
@@ -173,10 +178,11 @@
 		printf("always_mmap %u\n", always_mmap);
 		printf("never_mmap %u\n", never_mmap);
 		printf("chunks %u\n", chunks);
+		printf("prevent coalescing %u\n", prevent_coalescing);
 		printf("records %u\n", records);
 		printf("records per thread %u\n", records_per_thread);
-		printf("chunk_size %u\n", chunk_size);
 		printf("random_size %u\n", random_size);
+		printf("chunk_size %u\n", chunk_size);
 		printf("threads %u\n", threads);
 		printf("verbose %u\n", verbose);
 		printf("linear %u\n", linear);
@@ -251,9 +257,13 @@
 {
 	int i, j;
 
-	for (i = 0; i < chunks; i++)
+	for (i = 0; i < chunks; i++) {
 		for(j = 0; j < chunk_size / record_size; j++)
 			mem[i][j] = (record_t) j;
+		/* Prevent coalescing by alternating permissions */
+		if (prevent_coalescing && (i % 2) == 0)
+			mprotect(mem[i], chunk_size, PROT_READ);
+	}
 	if (verbose)
 		printf("Wrote memory\n");
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-13 18:19           ` Valerie Henson
@ 2006-05-13 22:54             ` Valerie Henson
  2006-05-16 13:30             ` Nick Piggin
  1 sibling, 0 replies; 27+ messages in thread
From: Valerie Henson @ 2006-05-13 22:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Arjan van de Ven

On Sat, May 13, 2006 at 11:19:46AM -0700, Valerie Henson wrote:
> The original program mmapped everything with the same permissions and
> no alignment restrictions, so all the mmaps were coalesced into one.
> This version alternates PROT_WRITE permissions on the mmap'd areas
> after they are written, so you get lots of vma's:

... Which is of course exactly the case that Blaisorblade's patches
should coalesce into one vma.  So I wrote another option which uses
holes instead - takes more memory initially, unfortunately.  Grab it
from:

http://www.nmt.edu/~val/patches/ebizzy.tar.gz

-p for preventing coaelescing via protections, -P for preventing via
holes.

-VAL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-13 18:19           ` Valerie Henson
  2006-05-13 22:54             ` Valerie Henson
@ 2006-05-16 13:30             ` Nick Piggin
  2006-05-16 13:51               ` Andreas Mohr
  2006-05-16 16:33               ` Valerie Henson
  1 sibling, 2 replies; 27+ messages in thread
From: Nick Piggin @ 2006-05-16 13:30 UTC (permalink / raw)
  To: Valerie Henson
  Cc: Ulrich Drepper, Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management, Val Henson

Valerie Henson wrote:
> On Sun, May 14, 2006 at 12:13:21AM +1000, Nick Piggin wrote:
> 
>>OK, I got interested again, but can't get Val's ebizzy to give me
>>a find_vma constrained workload yet (though the numbers back up
>>my assertion that the vma cache is crap for threaded apps).
> 
> 
> Hey Nick,
> 
> Glad to see you're using it!  There are (at least) two ways to do what
> you want:
> 
> 1. Increase the number of threads - this gives you two vma's per
>    thread, one for stack, one for guard page:
> 
>    $ ./ebizzy -t 100
> 
> 2. Apply the patch at the end of this email and use -p "prevent
>    coalescing", -m "always mmap" and appropriate number of chunks,
>    size, and records to search - this works for me:
> 
>    $ ./ebizzy -p -m -n 10000 -s 4096 -r 100000
> 
> The original program mmapped everything with the same permissions and
> no alignment restrictions, so all the mmaps were coalesced into one.
> This version alternates PROT_WRITE permissions on the mmap'd areas
> after they are written, so you get lots of vma's:
> 
> val@goober:~/ebizzy$ ./ebizzy -p -m -n 10000 -s 4096 -r 100000
> 
> [2]+  Stopped                 ./ebizzy -p -m -n 10000 -s 4096 -r 100000
> val@goober:~/ebizzy$ wc -l /proc/`pgrep ebizzy`/maps
> 10019 /proc/10917/maps
> 
> I haven't profiled to see if this brings find_vma to the top, though.
> 

Hi Val,

Thanks, I've tried with your most recent ebizzy and with 256 threads and
50,000 vmas (which gives really poor mmap_cache hits), I'm still unable
to get find_vma above a few % of kernel time.

With 50,000 vmas, my per-thread vma cache is much less effective, I guess
because access is pretty random (hopefully more realistic patterns would
get a bigger improvement).

I also tried running kbuild under UML, and could not make find_vma take
much time either [in this case, the per-thread vma cache patch roughly
doubles the number of hits, from about 15%->30% (in the host)].

So I guess it's time to go back into my hole. If anyone does come across
a find_vma constrained workload (especially with threads), I'd be very
interested.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 13:30             ` Nick Piggin
@ 2006-05-16 13:51               ` Andreas Mohr
  2006-05-16 16:31                 ` Valerie Henson
  2006-05-16 16:33               ` Valerie Henson
  1 sibling, 1 reply; 27+ messages in thread
From: Andreas Mohr @ 2006-05-16 13:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Valerie Henson, Ulrich Drepper, Blaisorblade, Andrew Morton,
	linux-kernel, Linux Memory Management, Val Henson

Hi,

On Tue, May 16, 2006 at 11:30:32PM +1000, Nick Piggin wrote:
> I also tried running kbuild under UML, and could not make find_vma take
> much time either [in this case, the per-thread vma cache patch roughly
> doubles the number of hits, from about 15%->30% (in the host)].
> 
> So I guess it's time to go back into my hole. If anyone does come across
> a find_vma constrained workload (especially with threads), I'd be very
> interested.

I cannot offer much other than some random confirmation that from my own
oprofiling, whatever I did (often running a load test script consisting of
launching 30 big apps at the same time), find_vma basically always showed up
very prominently in the list of vmlinux-based code (always ranking within the
top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
some find_vma activity from oprofile itself.
Profiling done on 512MB UP Athlon and P3/700, 2.6.16ish, current Debian.
Sorry for the foggy report, I don't have those logs here right now.

So yes, improving that part should help in general, but I cannot quite
say that my machines are "constrained" by it.

But you probably knew that already, otherwise you wouldn't have poked
in there... ;)

Andreas Mohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 13:51               ` Andreas Mohr
@ 2006-05-16 16:31                 ` Valerie Henson
  2006-05-16 16:47                   ` Andreas Mohr
  0 siblings, 1 reply; 27+ messages in thread
From: Valerie Henson @ 2006-05-16 16:31 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Nick Piggin, Ulrich Drepper, Blaisorblade, Andrew Morton,
	linux-kernel, Linux Memory Management, Val Henson

On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
> 
> I cannot offer much other than some random confirmation that from my own
> oprofiling, whatever I did (often running a load test script consisting of
> launching 30 big apps at the same time), find_vma basically always showed up
> very prominently in the list of vmlinux-based code (always ranking within the
> top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
> call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
> some find_vma activity from oprofile itself.

This is important: Which kernel?

The cases I saw it in were in a (now old) SuSE kernel which as it
turns out had old/different vma lookup code.

Thanks,

-VAL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 13:30             ` Nick Piggin
  2006-05-16 13:51               ` Andreas Mohr
@ 2006-05-16 16:33               ` Valerie Henson
  1 sibling, 0 replies; 27+ messages in thread
From: Valerie Henson @ 2006-05-16 16:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Blaisorblade, Andrew Morton, linux-kernel,
	Linux Memory Management

On Tue, May 16, 2006 at 11:30:32PM +1000, Nick Piggin wrote:
> 
> Hi Val,
> 
> Thanks, I've tried with your most recent ebizzy and with 256 threads and
> 50,000 vmas (which gives really poor mmap_cache hits), I'm still unable
> to get find_vma above a few % of kernel time.

How excellent!  Sometimes negative results are worth publishing. :)

-VAL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 16:31                 ` Valerie Henson
@ 2006-05-16 16:47                   ` Andreas Mohr
  2006-05-17  3:25                     ` Nick Piggin
  2006-05-17  6:10                     ` Blaisorblade
  0 siblings, 2 replies; 27+ messages in thread
From: Andreas Mohr @ 2006-05-16 16:47 UTC (permalink / raw)
  To: Valerie Henson
  Cc: Nick Piggin, Ulrich Drepper, Blaisorblade, Andrew Morton,
	linux-kernel, Linux Memory Management, Val Henson

Hi,

On Tue, May 16, 2006 at 09:31:12AM -0700, Valerie Henson wrote:
> On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
> > 
> > I cannot offer much other than some random confirmation that from my own
> > oprofiling, whatever I did (often running a load test script consisting of
> > launching 30 big apps at the same time), find_vma basically always showed up
> > very prominently in the list of vmlinux-based code (always ranking within the
> > top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
> > call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
> > some find_vma activity from oprofile itself.
> 
> This is important: Which kernel?

I had some traces still showing find_vma prominently during a profiling run
just yesterday, with a very fresh 2.6.17-rc4-ck1 (IOW, basically 2.6.17-rc4).
I added some cache prefetching in the list traversal a while ago, and IIRC
that improved profiling times there, but cache prefetching is very often
a bandaid in search for a real solution: a better data-handling algorithm.

Andreas Mohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 16:47                   ` Andreas Mohr
@ 2006-05-17  3:25                     ` Nick Piggin
  2006-05-17  6:10                     ` Blaisorblade
  1 sibling, 0 replies; 27+ messages in thread
From: Nick Piggin @ 2006-05-17  3:25 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Valerie Henson, Ulrich Drepper, Blaisorblade, Andrew Morton,
	linux-kernel, Linux Memory Management, Val Henson

Andreas Mohr wrote:

>Hi,
>
>On Tue, May 16, 2006 at 09:31:12AM -0700, Valerie Henson wrote:
>
>>On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
>>
>>>I cannot offer much other than some random confirmation that from my own
>>>oprofiling, whatever I did (often running a load test script consisting of
>>>launching 30 big apps at the same time), find_vma basically always showed up
>>>very prominently in the list of vmlinux-based code (always ranking within the
>>>top 4 or 5 kernel hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
>>>call-tracing showed it originating from mmap syscalls etc., and AFAIR quite
>>>some find_vma activity from oprofile itself.
>>>
>>This is important: Which kernel?
>>
>
>I had some traces still showing find_vma prominently during a profiling run
>just yesterday, with a very fresh 2.6.17-rc4-ck1 (IOW, basically 2.6.17-rc4).
>I added some cache prefetching in the list traversal a while ago, and IIRC
>that improved profiling times there, but cache prefetching is very often
>a bandaid in search for a real solution: a better data-handling algorithm.
>

If you want to try out the patch and see what it does for you, that would be
interesting. I'll repost a slightly cleaned up version in a couple of hours.

Nick
--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch 00/14] remap_file_pages protection support
  2006-05-16 16:47                   ` Andreas Mohr
  2006-05-17  3:25                     ` Nick Piggin
@ 2006-05-17  6:10                     ` Blaisorblade
  1 sibling, 0 replies; 27+ messages in thread
From: Blaisorblade @ 2006-05-17  6:10 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Valerie Henson, Nick Piggin, Ulrich Drepper, Andrew Morton,
	linux-kernel, Linux Memory Management, Val Henson

On Tuesday 16 May 2006 18:47, Andreas Mohr wrote:
> Hi,
>
> On Tue, May 16, 2006 at 09:31:12AM -0700, Valerie Henson wrote:
> > On Tue, May 16, 2006 at 03:51:35PM +0200, Andreas Mohr wrote:
> > > I cannot offer much other than some random confirmation that from my
> > > own oprofiling, whatever I did (often running a load test script
> > > consisting of launching 30 big apps at the same time), find_vma
> > > basically always showed up very prominently in the list of
> > > vmlinux-based code (always ranking within the top 4 or 5 kernel
> > > hotspots, such as timer interrupts, ACPI idle I/O etc.pp.).
> > > call-tracing showed it originating from mmap syscalls etc., and AFAIR
> > > quite some find_vma activity from oprofile itself.
> >
> > This is important: Which kernel?

I'd also add (for all peoples): on which processors? L2 cache size probably 
plays an important role, if (as I'm convinced) the problem are cache misses 
during rb-tree traversal.

> I had some traces still showing find_vma prominently during a profiling run
> just yesterday, with a very fresh 2.6.17-rc4-ck1 (IOW, basically
> 2.6.17-rc4). I added some cache prefetching in the list traversal a while
> ago, 

You mean the rb-tree traversal, I guess! Or was the base kernel so old?

> and IIRC that improved profiling times there, but cache prefetching is 
> very often a bandaid in search for a real solution: a better data-handling
> algorithm.

Ok, finally I find the time to kick in and ask a couple of question.

The current algorithm is good but has poor cache locality (IMHO).

First, since you can get find_vma on the profile, I've read (the article 
talked about userspace apps but I think it applies to kernelspace too) that 
oprofile can trace L2 cache misses.

I think such a profiling, if possible, would be particularly interesting: 
there's no reason whatsoever for that lookup, even on a 32-level tree 
(theoretical maximum since we have max 64K vmas and height_rbtree <= 2 logN), 
should be so slow, unless you add cache misses into the picture. The fact 
that cache prefetching helps shows this even more.

The lookup has very poor cache locality: the rb-node (3 pointers i.e. 12 
bytes, and we need only 2 pointers on searches) is surrounded by non-relevant 
data we fetch (we don't need the VMA itself for nodes we traverse).

For cache-locality the best data structure I know of are radix trees; but 
changing the implementation is absolutely non-trivial (the find_vma_prev() 
and friends API is tightly coupled with the rb-tree), and the size of the 
tree grows with the virtual address space (which is a problem on 64-bit 
archs); finally, you have locality when you do multiple searches, especially 
for the root nodes, but not across different levels inside a single search.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2006-05-17  6:10 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20060430172953.409399000@zion.home.lan>
2006-05-02  3:45 ` [patch 00/14] remap_file_pages protection support Nick Piggin
2006-05-02  3:56   ` Nick Piggin
2006-05-02 11:24     ` Ingo Molnar
2006-05-02 12:19       ` Nick Piggin
2006-05-02 17:16   ` Lee Schermerhorn
2006-05-03  1:20     ` Blaisorblade
2006-05-03 14:35       ` Lee Schermerhorn
2006-05-03  0:25   ` Blaisorblade
2006-05-06 16:05     ` Ulrich Drepper
2006-05-07  4:22       ` Nick Piggin
2006-05-13 14:13         ` Nick Piggin
2006-05-13 18:19           ` Valerie Henson
2006-05-13 22:54             ` Valerie Henson
2006-05-16 13:30             ` Nick Piggin
2006-05-16 13:51               ` Andreas Mohr
2006-05-16 16:31                 ` Valerie Henson
2006-05-16 16:47                   ` Andreas Mohr
2006-05-17  3:25                     ` Nick Piggin
2006-05-17  6:10                     ` Blaisorblade
2006-05-16 16:33               ` Valerie Henson
2006-05-03  0:44   ` Blaisorblade
2006-05-06  9:06     ` Nick Piggin
2006-05-06 15:26       ` Ulrich Drepper
     [not found] ` <20060430173025.752423000@zion.home.lan>
2006-05-02  3:53   ` [patch 11/14] remap_file_pages protection support: pte_present should not trigger on PTE_FILE PROTNONE ptes Nick Piggin
2006-05-03  1:29     ` Blaisorblade
2006-05-06 10:03       ` Nick Piggin
2006-05-07 17:50         ` Blaisorblade

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).