public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Page aging broken in 2.6
@ 2003-12-26  7:28 Benjamin Herrenschmidt
  2003-12-26  7:40 ` Andrew Morton
  2003-12-26 17:59 ` Linus Torvalds
  0 siblings, 2 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2003-12-26  7:28 UTC (permalink / raw)
  To: Linux Kernel list; +Cc: Rik van Riel, Andrew Morton

HI !

I don't know if x86 is affected (I suspect not) but ppc and ppc64
definitely are.

in mm/rmap.c, in page_referenced(), we do that twice:

                if (ptep_test_and_clear_young(pte))
                        referenced++;

And we never flush the TLB entry. 

I don't know if x86 (or other archs really using page tables) will
actually set the referenced bit again in the PTE if it's already set
in the TLB, if not, then x86 needs a flush too.

ppc and ppc64 need a flush to evict the entry from the hash table or
we'll never set the _PAGE_ACCESSED bit anymore.

On the other hand, I'd like to propose a semantic change here, by
changing ptep_test_and_clear_dirty() as well so that the flush is done
by the arch function and not explicitely by the generic code in both
cases. (I'm not sure if it's worth adding an mm parameter to the call
or if the arch will figure it out, we don't have it at hand in
page_referenced()).

That way, arch that don't need the flush (if any) can avoid it, and
in the case of ptep_test_and_clear_dirty, I may have a better way of
implementing it without a flush in mind.

Comments ?

Ben.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-26  7:28 Benjamin Herrenschmidt
@ 2003-12-26  7:40 ` Andrew Morton
  2003-12-26  9:21   ` Arjan van de Ven
  2003-12-26  9:33   ` Russell King
  2003-12-26 17:59 ` Linus Torvalds
  1 sibling, 2 replies; 31+ messages in thread
From: Andrew Morton @ 2003-12-26  7:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-kernel, riel

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
> HI !
> 
> I don't know if x86 is affected (I suspect not) but ppc and ppc64
> definitely are.
> 
> in mm/rmap.c, in page_referenced(), we do that twice:
> 
>                 if (ptep_test_and_clear_young(pte))
>                         referenced++;
> 
> And we never flush the TLB entry. 
> 
> I don't know if x86 (or other archs really using page tables) will
> actually set the referenced bit again in the PTE if it's already set
> in the TLB, if not, then x86 needs a flush too.

x86 needs a flush_tlb_page(), yes.

> ppc and ppc64 need a flush to evict the entry from the hash table or
> we'll never set the _PAGE_ACCESSED bit anymore.
> 
> On the other hand, I'd like to propose a semantic change here, by
> changing ptep_test_and_clear_dirty() as well so that the flush is done
> by the arch function and not explicitely by the generic code in both
> cases. (I'm not sure if it's worth adding an mm parameter to the call
> or if the arch will figure it out, we don't have it at hand in
> page_referenced()).
> 
> That way, arch that don't need the flush (if any) can avoid it, and
> in the case of ptep_test_and_clear_dirty, I may have a better way of
> implementing it without a flush in mind.

I don't feel particularly strongly either way, but the core mm code is
sprinkled with flushes anyway; it would probably be more consistent to
open-code it in rmap.c now.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-26  7:40 ` Andrew Morton
@ 2003-12-26  9:21   ` Arjan van de Ven
  2003-12-26  9:58     ` Benjamin Herrenschmidt
  2003-12-26 19:44     ` Davide Libenzi
  2003-12-26  9:33   ` Russell King
  1 sibling, 2 replies; 31+ messages in thread
From: Arjan van de Ven @ 2003-12-26  9:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Benjamin Herrenschmidt, linux-kernel, riel

[-- Attachment #1: Type: text/plain, Size: 456 bytes --]


> > And we never flush the TLB entry. 
> > 
> > I don't know if x86 (or other archs really using page tables) will
> > actually set the referenced bit again in the PTE if it's already set
> > in the TLB, if not, then x86 needs a flush too.
> 
> x86 needs a flush_tlb_page(), yes.

it does? Are you 100% sure ?

Afaik x86 is very very slow in setting the A and D bits (like 2000 to
3000 cycles) *because* it doesn't need a TLB flush....




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-26  7:40 ` Andrew Morton
  2003-12-26  9:21   ` Arjan van de Ven
@ 2003-12-26  9:33   ` Russell King
  2003-12-26 10:07     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 31+ messages in thread
From: Russell King @ 2003-12-26  9:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Benjamin Herrenschmidt, linux-kernel, riel

On Thu, Dec 25, 2003 at 11:40:23PM -0800, Andrew Morton wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > And we never flush the TLB entry. 
> > 
> > I don't know if x86 (or other archs really using page tables) will
> > actually set the referenced bit again in the PTE if it's already set
> > in the TLB, if not, then x86 needs a flush too.
> 
> x86 needs a flush_tlb_page(), yes.
> 
> > ppc and ppc64 need a flush to evict the entry from the hash table or
> > we'll never set the _PAGE_ACCESSED bit anymore.

ARM would strictly need the flush as well.  I seem to vaguely remember,
however, that when this code went in there was some discussion about
this very topic, and it was decided that the flush was not critical.

Indeed, 2.4 seems to have the same logic concerning not flushing the
PTE:

        /* Don't look at this pte if it's been accessed recently. */
        if ((vma->vm_flags & VM_LOCKED) || ptep_test_and_clear_young(page_table)) {
                mark_page_accessed(page);
                return 0;
        }


-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-26  9:21   ` Arjan van de Ven
@ 2003-12-26  9:58     ` Benjamin Herrenschmidt
  2003-12-26 19:44     ` Davide Libenzi
  1 sibling, 0 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2003-12-26  9:58 UTC (permalink / raw)
  To: arjanv; +Cc: Andrew Morton, Linux Kernel list, Rik van Riel

On Fri, 2003-12-26 at 20:21, Arjan van de Ven wrote:
> > > And we never flush the TLB entry. 
> > > 
> > > I don't know if x86 (or other archs really using page tables) will
> > > actually set the referenced bit again in the PTE if it's already set
> > > in the TLB, if not, then x86 needs a flush too.
> > 
> > x86 needs a flush_tlb_page(), yes.
> 
> it does? Are you 100% sure ?
> 
> Afaik x86 is very very slow in setting the A and D bits (like 2000 to
> 3000 cycles) *because* it doesn't need a TLB flush....

How does this work ? If x86 always update those bits even when the
TLB copy has them already set, then it will keep writing to the PTEs
on every access... which I doubt it does ;) Or does it snoop accesses
to the PTE to "catch" somebody clearing the bits ?

Ben.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-26  9:33   ` Russell King
@ 2003-12-26 10:07     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2003-12-26 10:07 UTC (permalink / raw)
  To: Russell King; +Cc: Andrew Morton, Linux Kernel list, Rik van Riel


> ARM would strictly need the flush as well.  I seem to vaguely remember,
> however, that when this code went in there was some discussion about
> this very topic, and it was decided that the flush was not critical.
> 
> Indeed, 2.4 seems to have the same logic concerning not flushing the
> PTE:
> 
>         /* Don't look at this pte if it's been accessed recently. */
>         if ((vma->vm_flags & VM_LOCKED) || ptep_test_and_clear_young(page_table)) {
>                 mark_page_accessed(page);
>                 return 0;
>         }

I can imagine that an architecture with TLBs will usually evict
the entry from the TLB sooner or later and the accessed bit will end
up beeing set again. On PPC, that isn't the case, the entry can well
stay a loooong time in the hash and if not evicted, _PAGE_ACCESSED
will never be set again.

Ben.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
@ 2003-12-26 10:45 Manfred Spraul
  0 siblings, 0 replies; 31+ messages in thread
From: Manfred Spraul @ 2003-12-26 10:45 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-kernel

Ben wrote:

>I can imagine that an architecture with TLBs will usually evict
>the entry from the TLB sooner or later and the accessed bit will end
>up beeing set again. On PPC, that isn't the case, the entry can well
>stay a loooong time in the hash and if not evicted, _PAGE_ACCESSED
>will never be set again.
>
One risk for i386 are the huge tlbs that AMD uses (512 entries?) - hot 
pages might stay in the TLB forever.

>Or does it snoop accesses
>to the PTE to "catch" somebody clearing the bits ?
>
No. AMD K8 cpu partially snoop PDE/PTE accesses and ignore tlb flush 
instructions if they are certain that the tlb is valid, but I'm not 
aware that anyone snoops the complete tlb cache.

--
    Manfred


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-26  7:28 Benjamin Herrenschmidt
  2003-12-26  7:40 ` Andrew Morton
@ 2003-12-26 17:59 ` Linus Torvalds
  2003-12-26 23:55   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2003-12-26 17:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Linux Kernel list, Rik van Riel, Andrew Morton



On Fri, 26 Dec 2003, Benjamin Herrenschmidt wrote:
> 
> in mm/rmap.c, in page_referenced(), we do that twice:
> 
>                 if (ptep_test_and_clear_young(pte))
>                         referenced++;
> 
> And we never flush the TLB entry. 
> 
> I don't know if x86 (or other archs really using page tables) will
> actually set the referenced bit again in the PTE if it's already set
> in the TLB, if not, then x86 needs a flush too.

This was very much done on purpose. The theory is, that if you're low on
memory and have a lot of pages mapped, you will see enough TLB trashing
for this to not matter.

And if you aren't low on memory, or don't have a lot of pages mapped, it 
_also_ doesn't matter.

> ppc and ppc64 need a flush to evict the entry from the hash table or
> we'll never set the _PAGE_ACCESSED bit anymore.

Yeah, all hail bad MMU's.

Hash tables may need some kind of "not very urgent TLB flush" thing, so 
that it doesn't penalize sane architectures.

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-26  9:21   ` Arjan van de Ven
  2003-12-26  9:58     ` Benjamin Herrenschmidt
@ 2003-12-26 19:44     ` Davide Libenzi
  1 sibling, 0 replies; 31+ messages in thread
From: Davide Libenzi @ 2003-12-26 19:44 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Benjamin Herrenschmidt, Linux Kernel Mailing List,
	riel

On Fri, 26 Dec 2003, Arjan van de Ven wrote:

> 
> > > And we never flush the TLB entry. 
> > > 
> > > I don't know if x86 (or other archs really using page tables) will
> > > actually set the referenced bit again in the PTE if it's already set
> > > in the TLB, if not, then x86 needs a flush too.
> > 
> > x86 needs a flush_tlb_page(), yes.
> 
> it does? Are you 100% sure ?

According to the Intel doc #24319202, section 3.7, it is OS responsibility 
to invalidate the TLB entry.



- Davide



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-26 17:59 ` Linus Torvalds
@ 2003-12-26 23:55   ` Benjamin Herrenschmidt
  2003-12-27  0:35     ` Linus Torvalds
  0 siblings, 1 reply; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2003-12-26 23:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel list, Rik van Riel, Andrew Morton


> Yeah, all hail bad MMU's.

Bad MMUs or our architetured beeing tied to one MMU type ? :)

(Note that I'm no special fan of our PPC hash table, it seems
to be fairly bad with the cache).

Note also that the need for a flush isn't tied to that fact we
have a hash table but to how we use it in linux. If we used the
real HW A and D bits and had ptep_test_and_clear* actually walk
the hash and use them, we could avoid the flush the same way in
this case.

But we do not, we use the hash as a big TLB cache and consider
any page in there as accessed and any writeable page in there as
diry, so clearing those bits requires evicting from the hash
(hash misses, at least on ppc32, are fairly cheap though).

> Hash tables may need some kind of "not very urgent TLB flush" thing, so 
> that it doesn't penalize sane architectures.

Or do what I propose here, that is have ptep_test_and_clear_* be
responsible for the flush on archs where it is necessary, but then
it would be nice to have more than the ptep as an argument...

Ben.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-26 23:55   ` Benjamin Herrenschmidt
@ 2003-12-27  0:35     ` Linus Torvalds
  2003-12-27  0:44       ` Benjamin Herrenschmidt
  2003-12-27  1:41       ` Andrea Arcangeli
  0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2003-12-27  0:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linux Kernel list, Rik van Riel, Andrew Morton, Andrea Arcangeli



On Sat, 27 Dec 2003, Benjamin Herrenschmidt wrote:
> 
> Or do what I propose here, that is have ptep_test_and_clear_* be
> responsible for the flush on archs where it is necessary, but then
> it would be nice to have more than the ptep as an argument...

The dirty handling already does the TLB flush (in that case it's a 
correctness issue, not a hint). So it's only ptep_test_and_clear_young() 
that matters.

I don't know whather that ever ends up being performance-critical, and I
don't see what else could be passed into it. We literally don't _have_
anythign else than the pte.

But the ppc architecture could easily decide to walk the hash tables and
invalidate in ptep_test_and_clear_young(). And if it ends up being a
performance issue, it _appears_ that all users of "page_referenced()" 
(which is the only thing that does this) are actually using the return 
value as just a boolean. And it's entirely possible that we should break 
out of "page_referenced()" on the _first_ hit of "yes, this has been 
referenced".

That would make it much less CPU-intensive to make
"ptep_test_and_clear_young()" slightly heavier to execute. It would also 
cause "page_referenced()" to not clear _all_ mapped reference bits at the 
same time - which might unfairly cause multi-used pages to stay in memory. 
On the other hand, that might be the _right_ behaviour.

Rik? Andrea? 

Worth testing, perhaps.

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  0:35     ` Linus Torvalds
@ 2003-12-27  0:44       ` Benjamin Herrenschmidt
  2003-12-27  0:53         ` Linus Torvalds
  2003-12-27  1:41       ` Andrea Arcangeli
  1 sibling, 1 reply; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2003-12-27  0:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel list, Rik van Riel, Andrew Morton, Andrea Arcangeli

On Sat, 2003-12-27 at 11:35, Linus Torvalds wrote:
> On Sat, 27 Dec 2003, Benjamin Herrenschmidt wrote:
> > 
> > Or do what I propose here, that is have ptep_test_and_clear_* be
> > responsible for the flush on archs where it is necessary, but then
> > it would be nice to have more than the ptep as an argument...
> 
> The dirty handling already does the TLB flush (in that case it's a 
> correctness issue, not a hint). So it's only ptep_test_and_clear_young() 
> that matters.

Yes, but it would be possible to optimize it some way on our
beloved hash tables ;) (By marking the entry read-only in the
hash instead of evicting it). Maybe not worth the pain though...

> I don't know whather that ever ends up being performance-critical, and I
> don't see what else could be passed into it. We literally don't _have_
> anythign else than the pte.

Ok, figured that out.

> But the ppc architecture could easily decide to walk the hash tables and
> invalidate in ptep_test_and_clear_young(). And if it ends up being a
> performance issue, it _appears_ that all users of "page_referenced()" 
> (which is the only thing that does this) are actually using the return 
> value as just a boolean. And it's entirely possible that we should break 
> out of "page_referenced()" on the _first_ hit of "yes, this has been 
> referenced".

Except that we may expect all "referencing" PTEs to have the accessed
bit cleared, no ? Or if we have lots of users we'll end up getting lots
of positive results while after the page was actually referenced... I
don't know if this would be a real problem though.

> That would make it much less CPU-intensive to make
> "ptep_test_and_clear_young()" slightly heavier to execute. It would also 
> cause "page_referenced()" to not clear _all_ mapped reference bits at the 
> same time - which might unfairly cause multi-used pages to stay in memory. 
> On the other hand, that might be the _right_ behaviour.
> 
> Rik? Andrea? 
> 
> Worth testing, perhaps.

Ok, right now, Anton is testing a patch from paulus where we do our
own flush batching and do the flush inside ptep_test_and_clear_* That
will at least fix the problem for us now.

Ben.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  0:44       ` Benjamin Herrenschmidt
@ 2003-12-27  0:53         ` Linus Torvalds
  2003-12-27  0:59           ` Linus Torvalds
                             ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Linus Torvalds @ 2003-12-27  0:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linux Kernel list, Rik van Riel, Andrew Morton, Andrea Arcangeli



On Sat, 27 Dec 2003, Benjamin Herrenschmidt wrote:
> > 
> > The dirty handling already does the TLB flush (in that case it's a 
> > correctness issue, not a hint). So it's only ptep_test_and_clear_young() 
> > that matters.
> 
> Yes, but it would be possible to optimize it some way on our
> beloved hash tables ;) (By marking the entry read-only in the
> hash instead of evicting it). Maybe not worth the pain though...

I don't think you should evict it, since
 - you know the value it should have
 - if you do the hash lookup anyway, you might as well just update the 
   entry.

And it's not "read-only" - it's the "A" bit, not the "W" bit you should be 
clearing in "ptep_test_and_clear_young()".

> Except that we may expect all "referencing" PTEs to have the accessed
> bit cleared, no ? Or if we have lots of users we'll end up getting lots
> of positive results while after the page was actually referenced... I
> don't know if this would be a real problem though.

I'll let Rik and Andrea argue that part - it's entirely possible that 
getting lots of positive results is a _good_ thing, if the same page is 
mapped multiple times. That would just make us less eager to unmap it, 
which sounds like potentially the right thign to do (it's also how the old 
non-rmap code worked, and I know Rik thought it was "unfair", but 
whatever).

> Ok, right now, Anton is testing a patch from paulus where we do our
> own flush batching and do the flush inside ptep_test_and_clear_* That
> will at least fix the problem for us now.

Yeah, and it's unlikely to be a performance problem anyway. That function 
should be called only when we're low on memory..

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  0:53         ` Linus Torvalds
@ 2003-12-27  0:59           ` Linus Torvalds
  2003-12-27  1:03           ` Benjamin Herrenschmidt
  2003-12-27  2:47           ` Rik van Riel
  2 siblings, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2003-12-27  0:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linux Kernel list, Rik van Riel, Andrew Morton, Andrea Arcangeli



On Fri, 26 Dec 2003, Linus Torvalds wrote:
> 
> And it's not "read-only" - it's the "A" bit, not the "W" bit you should be 
> clearing in "ptep_test_and_clear_young()".

Oh, I see, I misparsed your comment.. you were talking about changing
"ptep_test_and_clear_dirty()"  too..

That's certainly possible.

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  0:53         ` Linus Torvalds
  2003-12-27  0:59           ` Linus Torvalds
@ 2003-12-27  1:03           ` Benjamin Herrenschmidt
  2003-12-27  2:37             ` Andrea Arcangeli
  2003-12-27  2:47           ` Rik van Riel
  2 siblings, 1 reply; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2003-12-27  1:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel list, Rik van Riel, Andrew Morton, Andrea Arcangeli

On Sat, 2003-12-27 at 11:53, Linus Torvalds wrote:
> On Sat, 27 Dec 2003, Benjamin Herrenschmidt wrote:
> > > 
> > > The dirty handling already does the TLB flush (in that case it's a 
> > > correctness issue, not a hint). So it's only ptep_test_and_clear_young() 
> > > that matters.
> > 
> > Yes, but it would be possible to optimize it some way on our
> > beloved hash tables ;) (By marking the entry read-only in the
> > hash instead of evicting it). Maybe not worth the pain though...
> 
> I don't think you should evict it, since
>  - you know the value it should have
>  - if you do the hash lookup anyway, you might as well just update the 
>    entry.

Yup, that is my point.

> And it's not "read-only" - it's the "A" bit, not the "W" bit you should be 
> clearing in "ptep_test_and_clear_young()".

In the above I was talking about dirty.

For accessed, we currently do not use the HW bit neither. Accessed = in
the hash, not accessed = not in the hash. A bit basic, but the cost of
faulting them back in isn't that bad. Still, I always found it a bit
stupid that we end up having the harvesting of accessed bits actually
evict pages that _are_ accessed, and thus potentially here to be
accessed again ;)

Paul did some experiments using the HW bits and didn't see a great
perf increase (or what is even a decrease ?), but I should try that
again on ppc64 since there, we can much more quickly hit the proper
hash slot (we store its index in one group within the PTE).

Another problem with using real A & D hash bits is that we may evict
entries from the hash table (because both groups are full for a given
hash value). In this case, we need to go back to the linux PTE to
update the bits in there before we lose the A/D information from the
hash. But I don't think the overhead here matters much, we only rarely
do evicts.

> I'll let Rik and Andrea argue that part - it's entirely possible that 
> getting lots of positive results is a _good_ thing, if the same page is 
> mapped multiple times. That would just make us less eager to unmap it, 
> which sounds like potentially the right thign to do (it's also how the old 
> non-rmap code worked, and I know Rik thought it was "unfair", but 
> whatever).
> 
> > Ok, right now, Anton is testing a patch from paulus where we do our
> > own flush batching and do the flush inside ptep_test_and_clear_* That
> > will at least fix the problem for us now.
> 
> Yeah, and it's unlikely to be a performance problem anyway. That function 
> should be called only when we're low on memory..
> 
> 		Linus
-- 
Benjamin Herrenschmidt <benh@kernel.crashing.org>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  0:35     ` Linus Torvalds
  2003-12-27  0:44       ` Benjamin Herrenschmidt
@ 2003-12-27  1:41       ` Andrea Arcangeli
  1 sibling, 0 replies; 31+ messages in thread
From: Andrea Arcangeli @ 2003-12-27  1:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Linux Kernel list, Rik van Riel,
	Andrew Morton

On Fri, Dec 26, 2003 at 04:35:56PM -0800, Linus Torvalds wrote:
> 
> 
> On Sat, 27 Dec 2003, Benjamin Herrenschmidt wrote:
> > 
> > Or do what I propose here, that is have ptep_test_and_clear_* be
> > responsible for the flush on archs where it is necessary, but then
> > it would be nice to have more than the ptep as an argument...
> 
> The dirty handling already does the TLB flush (in that case it's a 
> correctness issue, not a hint). So it's only ptep_test_and_clear_young() 
> that matters.
> 
> I don't know whather that ever ends up being performance-critical, and I
> don't see what else could be passed into it. We literally don't _have_
> anythign else than the pte.
> 
> But the ppc architecture could easily decide to walk the hash tables and
> invalidate in ptep_test_and_clear_young(). And if it ends up being a
> performance issue, it _appears_ that all users of "page_referenced()" 
> (which is the only thing that does this) are actually using the return 
> value as just a boolean. And it's entirely possible that we should break 
> out of "page_referenced()" on the _first_ hit of "yes, this has been 
> referenced".
> 
> That would make it much less CPU-intensive to make
> "ptep_test_and_clear_young()" slightly heavier to execute. It would also 
> cause "page_referenced()" to not clear _all_ mapped reference bits at the 
> same time - which might unfairly cause multi-used pages to stay in memory. 
> On the other hand, that might be the _right_ behaviour.
> 
> Rik? Andrea? 

I agree with you about the current code being optimal for x86 despite
it's not accurate, as you said it doesn't need to be accurate since it's
not a correctness matter. I'm not very concerned about the size of the
tlb that Manfred mentioned, the snoops can't get past a cr3 overwrite,
so the first mm flush will make sure the young bit will be marked again
next time the page is accessed, no matter the size of the tlb and no
matter what snooping it does. Snoops can't obviate the lack of ASN in
the x86* arch.

In my opinion flushing the tlb for every page is definitely overkill in
SMP due the flood of broadcast IPIs it generates (I actually did that
and got bitten by the ipi flood in practice some year ago ;).

I believe flushing the tlb for every pte is reasonable only under
#ifndef CONFIG_SMP.

something that can work for both cases SMP and UP is to keep track of
which mm have to be flushed, and to flush the _whole_ mm, only after the
pagetable walk is complete, this reduces the IPI broadcast to 1 per mm,
not 1 per pte. This should improve it of an order of magnitude with
common workloads. Flushing per-pte is probably overkill on UP too.

This is what I have in 2.4:

#ifndef CONFIG_SMP
	/* in SMP is too costly to send further IPIs */
	if (tlb_flush)
		flush_tlb_mm(mm);
#endif

Originally I was flushing the mm even in UP, but then somebody
complained on some 4-way with some 8gigs and so I let it go completely
lazy now in SMP by adding the ifndef around it.

However I've no proof that the #ifndef made any difference, so it could
be the flush_tlb_mm is fine for SMP too. The above is just to go 100%
safe in terms of scalability in the high end, while providing the most
efficient and most accurate behaviour to UP (on UP all but one of those
flush_tlb_mm are noops in terms of tlb cost).

the above flush_tlb_mm executes for every mm scan only if the inner code
had to mark a pte as "old". in 2.4 that's trivial because we scan the mm
in order, but a dumb algorithm to do it in 2.6 could be to add a bitflag
to every mm and walking the mmlist and flush the mm with the bit on
(though this dumb logic is an O(N) one with the number of mm).

I'm unsure if the tlb flushing (if any) should be in the architectural
code or in the common code (there are pros and cons). For this specific
case exactly because the flush should be per-mm (a "should" for UP and a
"must" for SMP) I think doing it in the architectural code isn't
preferable.

As a generic matter (not necessairly specific to the
ptep_test_and_clear_young) I dislike any form cleverness and smartness
and hiding in the architectural pte/tlb lib calls unless it's strictly
necessary. To make an example s390 needs some additional complexity to
exploit their per-physical-page dirty bit, that incidentally broke in
subtle ways with the zero page with get_user_pages (that's fixed by now,
but it took a while to figure out and it sure couldn't be trapped by
reading the code, it bites you when it's too late). The problem is most
people would only read the x86* implementation anyways (that is not a
noop, it is just different), and in turn avoiding smartness and
differences in the archs code increases the probability that the common
code will work correctly everywhere. And you can think simpler by just
reading the common code. For istance I would prefer an additional
"flush_tlb_after_test_and_clear_young" than to hide it in
ptep_test_and_clear_young. That would be self documenting, the hiding
isn't and it could be forgotten by x86* programmers over time.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  1:03           ` Benjamin Herrenschmidt
@ 2003-12-27  2:37             ` Andrea Arcangeli
  2003-12-27  5:02               ` Benjamin Herrenschmidt
  2003-12-27 10:16               ` William Lee Irwin III
  0 siblings, 2 replies; 31+ messages in thread
From: Andrea Arcangeli @ 2003-12-27  2:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Linux Kernel list, Rik van Riel, Andrew Morton

On Sat, Dec 27, 2003 at 12:03:48PM +1100, Benjamin Herrenschmidt wrote:
> For accessed, we currently do not use the HW bit neither. Accessed = in
> the hash, not accessed = not in the hash. A bit basic, but the cost of
> faulting them back in isn't that bad. Still, I always found it a bit
> stupid that we end up having the harvesting of accessed bits actually
> evict pages that _are_ accessed, and thus potentially here to be
> accessed again ;)

It's hard for me to evaluate how much the young bit matters by only
thinking about it,  I know for sure the heavily swapping behaviour on
the alpha was noticeably less smooth than on x86 (alpha has^Hd no way to
implement the young bit, not even like you do in software through hash
faults). So I guess it's worthwhile for you to account for it even if in
software (i.e. ppc not ppc64).

> Paul did some experiments using the HW bits and didn't see a great
> perf increase (or what is even a decrease ?), but I should try that

It should be an I/O dominated workload anyways and it sounds like the
hardware way involves hash manipulation too (it only avoids the fault to
set it back on).

> > I'll let Rik and Andrea argue that part - it's entirely possible that 
> > getting lots of positive results is a _good_ thing, if the same page is 
> > mapped multiple times. That would just make us less eager to unmap it, 

that sounds correct behaviour to me, if a page is mapped multiple times
we should be eager in unmapping it. More precisely we should give every
user the opportunity to increase the youngness of the page, so a page
with multiple users will go away after a page with just a single user,
assuming all users access their pages at the same frequency.

Returning to the "how to flush the tlb after clearing the young bit", at
least on the x86 I find more desiderable to flush based on mm (in UP
that's the most efficient and it provides an accurate behaviour, in SMP
it maybe still to costly but sure a lot less costly than a broadcast per
pte).  In 2.4 with the pagetable scan the flush per mm is
strightforward and  it provides a very high probability of optimizing
away an huge lot of spurious IPI broadcast. But even in 2.6 the vm is
unmapping stuff with some aggressive clustering algorithm so that when
it starts umapping stuff it drops quite some stuff and there's still a
relevant probability that only a few mm have to be flushed, which in SMP
can decrease a lot the need of IPIs.  Not sure how these flush_tlb_mm
ideas translates for ppc though.

The dirty and accessed bitflags instead are quite a different matter
w.r.t to tlb flushing, we can't defer the tlb flush after atomically
clearing the pte in smp while we clear the dirty bit. the tlb shootdown
is the clustered version of that. the shootdown run a broadcast IPI
not more than every 508 pte freed per mm. For the same reason we can try
to coalesce the tlb flush post-clear-young with an mm flush, we can
achieve a similar coalescing without the no need of an exact tlb
shootdown like in the pte freeing.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  0:53         ` Linus Torvalds
  2003-12-27  0:59           ` Linus Torvalds
  2003-12-27  1:03           ` Benjamin Herrenschmidt
@ 2003-12-27  2:47           ` Rik van Riel
  2003-12-27  3:00             ` Andrew Morton
  2 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2003-12-27  2:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Linux Kernel list, Andrew Morton,
	Andrea Arcangeli

On Fri, 26 Dec 2003, Linus Torvalds wrote:

> I'll let Rik and Andrea argue that part - it's entirely possible that
> getting lots of positive results is a _good_ thing, if the same page is
> mapped multiple times. That would just make us less eager to unmap it,
> which sounds like potentially the right thign to do (it's also how the
> old non-rmap code worked, and I know Rik thought it was "unfair", but
> whatever).

I'm really not sure which of the two behaviours would
perform better.  Chances are both behaviours will show
some performance improvement over the other, depending
on the workload...

Rik
-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  2:47           ` Rik van Riel
@ 2003-12-27  3:00             ` Andrew Morton
  2003-12-27  3:31               ` Rik van Riel
                                 ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Andrew Morton @ 2003-12-27  3:00 UTC (permalink / raw)
  To: Rik van Riel; +Cc: torvalds, benh, linux-kernel, andrea

Rik van Riel <riel@surriel.com> wrote:
>
> On Fri, 26 Dec 2003, Linus Torvalds wrote:
> 
> > I'll let Rik and Andrea argue that part - it's entirely possible that
> > getting lots of positive results is a _good_ thing, if the same page is
> > mapped multiple times. That would just make us less eager to unmap it,
> > which sounds like potentially the right thign to do (it's also how the
> > old non-rmap code worked, and I know Rik thought it was "unfair", but
> > whatever).
> 
> I'm really not sure which of the two behaviours would
> perform better.  Chances are both behaviours will show
> some performance improvement over the other, depending
> on the workload...
> 

The current behaviour seems better from a theoretical point of view.  All
we want to know is the reference pattern - whether it is one process
referencing the page frequently or 100 processes referencing it
infrequently shouldn't matter.  And if we want to give mapped pages more
preference over unmapped ones (they already have some preference, by the
default value of /proc/sys/vm/swappiness), we have less radical ways of
doing this.

But yes, it probably makes damn-all difference across a mix of workloads.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  3:00             ` Andrew Morton
@ 2003-12-27  3:31               ` Rik van Riel
  2003-12-27  3:54               ` Linus Torvalds
  2003-12-27 23:07               ` Roger Luethi
  2 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2003-12-27  3:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, benh, linux-kernel, andrea

On Fri, 26 Dec 2003, Andrew Morton wrote:

> The current behaviour seems better from a theoretical point of view.

It's certainly easier to understand when tuning the VM,
indeed.

> And if we want to give mapped pages more preference over unmapped ones
> (they already have some preference, by the default value of
> /proc/sys/vm/swappiness), we have less radical ways of doing this.

Agreed, the current swappiness is probably a better measure.
More flexible and more easy to tune.

cheers,

Rik
-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  3:00             ` Andrew Morton
  2003-12-27  3:31               ` Rik van Riel
@ 2003-12-27  3:54               ` Linus Torvalds
  2003-12-27 16:34                 ` Martin J. Bligh
  2003-12-27 23:07               ` Roger Luethi
  2 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2003-12-27  3:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, benh, linux-kernel, andrea



On Fri, 26 Dec 2003, Andrew Morton wrote:
> 
> The current behaviour seems better from a theoretical point of view. 

I disagree. It's at least not obvious.

>							 All
> we want to know is the reference pattern - whether it is one process
> referencing the page frequently or 100 processes referencing it
> infrequently shouldn't matter.

I agree that those two cases should be the same. And in fact, those two
cases _will_ be the same by my suggested change ("break out of
'page_referenced()' early")

However, you ignore the third case: a page that is frequently used by 100 
processes.

Such a page behaves differently with the 'break early' behaviour, by 
pinnong the page more tightly. 

And I think that's the right behaviour. At least that's not "obviously 
wrong".

It's not something to do in 2.6.x, but I disagree that it's clear-cut.

		Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  2:37             ` Andrea Arcangeli
@ 2003-12-27  5:02               ` Benjamin Herrenschmidt
  2003-12-27 10:16               ` William Lee Irwin III
  1 sibling, 0 replies; 31+ messages in thread
From: Benjamin Herrenschmidt @ 2003-12-27  5:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Linux Kernel list, Rik van Riel, Andrew Morton


> Returning to the "how to flush the tlb after clearing the young bit", at
> least on the x86 I find more desiderable to flush based on mm (in UP
> that's the most efficient and it provides an accurate behaviour, in SMP
> it maybe still to costly but sure a lot less costly than a broadcast per
> pte).  In 2.4 with the pagetable scan the flush per mm is
> strightforward and  it provides a very high probability of optimizing
> away an huge lot of spurious IPI broadcast. But even in 2.6 the vm is
> unmapping stuff with some aggressive clustering algorithm so that when
> it starts umapping stuff it drops quite some stuff and there's still a
> relevant probability that only a few mm have to be flushed, which in SMP
> can decrease a lot the need of IPIs.  Not sure how these flush_tlb_mm
> ideas translates for ppc though.

Since we use the hash as a TLB cache, we need to evict things from
it where you would do a flush_tlb. A flush_tlb_mm (or a range) is
fairly expensive. We have to calculate the hash value for each page
and evict them all. Also, the "nice" thing with this hash is since
we have the vsid's (kind of address space number), we can hold
many processes translations in there for a long time.

On the other hand, we don't need IPIs for any kind of flush (the
actual TLB flushes that we perform after evicting the hash entries
do broadcast in HW).

> The dirty and accessed bitflags instead are quite a different matter
> w.r.t to tlb flushing, we can't defer the tlb flush after atomically
> clearing the pte in smp while we clear the dirty bit. the tlb shootdown
> is the clustered version of that. the shootdown run a broadcast IPI
> not more than every 508 pte freed per mm. For the same reason we can try
> to coalesce the tlb flush post-clear-young with an mm flush, we can
> achieve a similar coalescing without the no need of an exact tlb
> shootdown like in the pte freeing




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  2:37             ` Andrea Arcangeli
  2003-12-27  5:02               ` Benjamin Herrenschmidt
@ 2003-12-27 10:16               ` William Lee Irwin III
  1 sibling, 0 replies; 31+ messages in thread
From: William Lee Irwin III @ 2003-12-27 10:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Benjamin Herrenschmidt, Linus Torvalds, Linux Kernel list,
	Rik van Riel, Andrew Morton

On Sat, Dec 27, 2003 at 03:37:53AM +0100, Andrea Arcangeli wrote:
> It's hard for me to evaluate how much the young bit matters by only
> thinking about it,  I know for sure the heavily swapping behaviour on
> the alpha was noticeably less smooth than on x86 (alpha has^Hd no way to
> implement the young bit, not even like you do in software through hash
> faults). So I guess it's worthwhile for you to account for it even if in
> software (i.e. ppc not ppc64).

I have a vague notion it should be possible to turn off the PAL
pagetable emulation and do these things yourself, though I'm not
entirely clear on how practical this is to do (e.g. whether the real
MMU's docs are public, whether the PAL code can be turned off at all,
etc.). It would be a relatively large amount of arch code to bang out,
and for a largely (and rather unfortunately) dead architecture at that.

Probably a moot point depending on the level of resistance to the idea
(which seems to dominate technical concerns for a number of things) and
how thin I'm spread. I've got a multia, but (a) multias suck and (b)
it's stripped, so unless I feel like debugging swapping over the network,
that's useless apart from making sure it boots and runs luserspace.

I guess the point of this was really a roundabout way to ask if there
was enough information to either do that with or rule it out just to
satisfy my own curiosity, since it's not likely I'll ever get around
to actually trying to implement it.


-- wli

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  3:54               ` Linus Torvalds
@ 2003-12-27 16:34                 ` Martin J. Bligh
  0 siblings, 0 replies; 31+ messages in thread
From: Martin J. Bligh @ 2003-12-27 16:34 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton; +Cc: Rik van Riel, benh, linux-kernel, andrea

>> The current behaviour seems better from a theoretical point of view. 
> 
> I disagree. It's at least not obvious.
> 
>>							 All
>> we want to know is the reference pattern - whether it is one process
>> referencing the page frequently or 100 processes referencing it
>> infrequently shouldn't matter.
> 
> I agree that those two cases should be the same. And in fact, those two
> cases _will_ be the same by my suggested change ("break out of
> 'page_referenced()' early")
> 
> However, you ignore the third case: a page that is frequently used by 100 
> processes.
> 
> Such a page behaves differently with the 'break early' behaviour, by 
> pinnong the page more tightly. 
> 
> And I think that's the right behaviour. At least that's not "obviously 
> wrong".
> 
> It's not something to do in 2.6.x, but I disagree that it's clear-cut.

Could we at least stick a big fat comment explaining the current behaviour
in there? The current behaviour is not at all obvious from reading the code.
I'll try to write something if you like, but no doubt someone could do a
better job than I.

M.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27  3:00             ` Andrew Morton
  2003-12-27  3:31               ` Rik van Riel
  2003-12-27  3:54               ` Linus Torvalds
@ 2003-12-27 23:07               ` Roger Luethi
  2003-12-27 23:55                 ` William Lee Irwin III
  2003-12-28  0:04                 ` Andrew Morton
  2 siblings, 2 replies; 31+ messages in thread
From: Roger Luethi @ 2003-12-27 23:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, torvalds, benh, linux-kernel, andrea

On Fri, 26 Dec 2003 19:00:45 -0800, Andrew Morton wrote:
> The current behaviour seems better from a theoretical point of view.  All
> we want to know is the reference pattern - whether it is one process
> referencing the page frequently or 100 processes referencing it
> infrequently shouldn't matter.  And if we want to give mapped pages more

It can matter. Evicting a page that is infrequently referenced by many
processes increases the chance that all runnable processes block waiting
for that same page later. The likelihood of that happening grows under
memory pressure, when "infrequently" may actually be "quite often" and
when disk I/O is congested (resulting in higher disk access times).

You won't have the same effect when evicting a page that is referenced
by one process only, no matter how frequently.

Having all processes blocked is indeed one problem of 2.6 under memory
pressure. I don't know what the cause is, though.

Roger

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27 23:07               ` Roger Luethi
@ 2003-12-27 23:55                 ` William Lee Irwin III
  2003-12-28 11:23                   ` Roger Luethi
  2003-12-28  0:04                 ` Andrew Morton
  1 sibling, 1 reply; 31+ messages in thread
From: William Lee Irwin III @ 2003-12-27 23:55 UTC (permalink / raw)
  To: Andrew Morton, Rik van Riel, torvalds, benh, linux-kernel, andrea

On Sun, Dec 28, 2003 at 12:07:58AM +0100, Roger Luethi wrote:
> It can matter. Evicting a page that is infrequently referenced by many
> processes increases the chance that all runnable processes block waiting
> for that same page later. The likelihood of that happening grows under
> memory pressure, when "infrequently" may actually be "quite often" and
> when disk I/O is congested (resulting in higher disk access times).
> You won't have the same effect when evicting a page that is referenced
> by one process only, no matter how frequently.

Part of this is unrealistic; paging I/O being congested must be due to
paging itself causing seeks without additional I/O load. Reading a
single page once and then faulting that one page back into numerous
process address spaces is only one I/O request, and so cannot seek in
and of itself. So in this scenario, a convoy of processes on a single
page is plausible; aggravated paging I/O seekiness is not. Did you have
in mind some additional I/O load? Or do affected processes actually all
fault before the one I/O completes, and so all block temporarily?

On Sun, Dec 28, 2003 at 12:07:58AM +0100, Roger Luethi wrote:
> Having all processes blocked is indeed one problem of 2.6 under memory
> pressure. I don't know what the cause is, though.

Can you capture sysrq t while a situation like this is in progress?

-- wli

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27 23:07               ` Roger Luethi
  2003-12-27 23:55                 ` William Lee Irwin III
@ 2003-12-28  0:04                 ` Andrew Morton
  2003-12-28 11:58                   ` Roger Luethi
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2003-12-28  0:04 UTC (permalink / raw)
  To: Roger Luethi; +Cc: riel, torvalds, benh, linux-kernel, andrea

Roger Luethi <rl@hellgate.ch> wrote:
>
> On Fri, 26 Dec 2003 19:00:45 -0800, Andrew Morton wrote:
> > The current behaviour seems better from a theoretical point of view.  All
> > we want to know is the reference pattern - whether it is one process
> > referencing the page frequently or 100 processes referencing it
> > infrequently shouldn't matter.  And if we want to give mapped pages more
> 
> It can matter. Evicting a page that is infrequently referenced by many
> processes increases the chance that all runnable processes block waiting
> for that same page later. The likelihood of that happening grows under
> memory pressure, when "infrequently" may actually be "quite often" and
> when disk I/O is congested (resulting in higher disk access times).
> 
> You won't have the same effect when evicting a page that is referenced
> by one process only, no matter how frequently.
> 
> Having all processes blocked is indeed one problem of 2.6 under memory
> pressure. I don't know what the cause is, though.
> 

I usually work this sort of thing out by "random sampling".  When
everything is in steady state, break into kgdb and start looking at task
backtraces, see where they are all sleeping.

If it's in the pagefault handler, go up to do_page_fault() and work out the
faulting address.  Compare that with /proc/pid/maps to see if it's libc or
whatever.

Repeat the above N times until you have a decent feel for what's happening
in there.  It doesn't take long.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-27 23:55                 ` William Lee Irwin III
@ 2003-12-28 11:23                   ` Roger Luethi
  2003-12-28 16:35                     ` William Lee Irwin III
  0 siblings, 1 reply; 31+ messages in thread
From: Roger Luethi @ 2003-12-28 11:23 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, Rik van Riel, torvalds,
	benh, linux-kernel, andrea

On Sat, 27 Dec 2003 15:55:38 -0800, William Lee Irwin III wrote:
> On Sun, Dec 28, 2003 at 12:07:58AM +0100, Roger Luethi wrote:
> > It can matter. Evicting a page that is infrequently referenced by many
> > processes increases the chance that all runnable processes block waiting
> > for that same page later. The likelihood of that happening grows under
> > memory pressure, when "infrequently" may actually be "quite often" and
> > when disk I/O is congested (resulting in higher disk access times).
> > You won't have the same effect when evicting a page that is referenced
> > by one process only, no matter how frequently.
> 
> Part of this is unrealistic; paging I/O being congested must be due to
> paging itself causing seeks without additional I/O load. Reading a
> single page once and then faulting that one page back into numerous
> process address spaces is only one I/O request, and so cannot seek in
> and of itself. So in this scenario, a convoy of processes on a single
> page is plausible; aggravated paging I/O seekiness is not. Did you have
> in mind some additional I/O load? Or do affected processes actually all
> fault before the one I/O completes, and so all block temporarily?

My previous message was meant as a warning of the assumption that
the aggregated reference frequency is all that matters. I was merely
pointing out how the number of processes referencing a page could affect
performance as well. Reference frequency is used as an estimator for
the _likelihood_ of a fault in the future, but the potential _impact_
of a fault grows with the number of processes that may block on it.
It is one possible (though not necessarily the most likely) explanation
for the symptoms I see with 2.6.

vmstat finds all processes blocked a lot more often in 2.6 than in
2.4, often for several seconds in a row. That only means something in
comparison, of course, because it is anything but a precise measurement
-- not only because of the 1 second snapshot granularity but also due
to the fact that bookkeeping of running and blocked processes in the
kernel is not accurate (processes may count as both blocked and running).

Typical log snippet for a kernel build under some 2.6.0-test release:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 9  3   6268 851814   1500   8992  440    0   996   348 1141   294 87 13  0  0
 9  3   6164 852816   1540   9088  352    0   456     0 1045   145 91  9  0  0
 9  6   6164   4044 853818   8112   60    0   100    28 1016    71 92  8  0  0
 4  6   6604 854820    924   7432  532  472   784   488 1096   626 57 43  0  0
 2  9   9248   3556 855921   6968 1044 2748  1640  2752 1283   412 74 13  0 13
 3  7   9248 857071    924   6864 1208    0  1720   108 1326   524 60 34  0  6
10  8  11164   2080 858438   5952 1068 1944  2040  2064 1623  1655 74 26  0  0
 0 11  13000 859563    356   5824  796 2032  1572  2036 1330   656 66 24  0 10
 0 10  16608   4064 861037   5868  832 3960  1836  3964 1755   725 42  9  0 49
 0 11  16604 862284    420   5920 1420    0  2216     4 1471   485 39  4  0 57
 7  4   9772  10656 863286   6644  552    0  1344    12 1112   250 56  5  0 39
 9  2   8228 864687    732   6960  296    0   632   108 1484   257 96  4  0  0
 8  3   8212  10656 865689   7176   80    0   320     0 1050   146 95  5  0  0

The trace above is not for the benchmark I referred to as kbuild in the
past few weeks (it was taken under lighter load). Even so 2.6 exhibits
significantly more periods with I/O wait and consequently takes longer
than 2.4 to complete.

> > Having all processes blocked is indeed one problem of 2.6 under memory
> > pressure. I don't know what the cause is, though.
> 
> Can you capture sysrq t while a situation like this is in progress?

What are you getting at? This may be easier for you to do because you
know what you are looking for.

Roger

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-28  0:04                 ` Andrew Morton
@ 2003-12-28 11:58                   ` Roger Luethi
  0 siblings, 0 replies; 31+ messages in thread
From: Roger Luethi @ 2003-12-28 11:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: riel, torvalds, benh, linux-kernel, andrea

On Sat, 27 Dec 2003 16:04:10 -0800, Andrew Morton wrote:
> > Having all processes blocked is indeed one problem of 2.6 under memory
> > pressure. I don't know what the cause is, though.
> 
> I usually work this sort of thing out by "random sampling".  When
> everything is in steady state, break into kgdb and start looking at task
> backtraces, see where they are all sleeping.

Well, there isn't really a steady state as such. On a loaded system
there are periods during compile benchmarks where the system spends
half the time and more in I/O wait, so some processes do get to run
and do some minimal amount of work.

> If it's in the pagefault handler, go up to do_page_fault() and work out the
> faulting address.  Compare that with /proc/pid/maps to see if it's libc or
> whatever.
> 
> Repeat the above N times until you have a decent feel for what's happening
> in there.  It doesn't take long.

I instrumented the kernel a while ago to log page fault handling
(address, backing file if available) when the system became idle with
all processes blocked. I can resurrect that code which would allow for
larger samples. I'll post results if/when I get around to do it.

Roger

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-28 11:23                   ` Roger Luethi
@ 2003-12-28 16:35                     ` William Lee Irwin III
  2003-12-28 17:15                       ` Roger Luethi
  0 siblings, 1 reply; 31+ messages in thread
From: William Lee Irwin III @ 2003-12-28 16:35 UTC (permalink / raw)
  To: rl, Andrew Morton, Rik van Riel, torvalds, benh, linux-kernel,
	andrea

At some point in the past, I wrote:
>> Part of this is unrealistic; paging I/O being congested must be due to
>> paging itself causing seeks without additional I/O load. Reading a
>> single page once and then faulting that one page back into numerous
>> process address spaces is only one I/O request, and so cannot seek in
>> and of itself. So in this scenario, a convoy of processes on a single
>> page is plausible; aggravated paging I/O seekiness is not. Did you have
>> in mind some additional I/O load? Or do affected processes actually all
>> fault before the one I/O completes, and so all block temporarily?

On Sun, Dec 28, 2003 at 12:23:40PM +0100, Roger Luethi wrote:
> My previous message was meant as a warning of the assumption that
> the aggregated reference frequency is all that matters. I was merely
> pointing out how the number of processes referencing a page could affect
> performance as well. Reference frequency is used as an estimator for
> the _likelihood_ of a fault in the future, but the potential _impact_
> of a fault grows with the number of processes that may block on it.
> It is one possible (though not necessarily the most likely) explanation
> for the symptoms I see with 2.6.

I guess caution against LFU is uncontroversial.


On Sun, Dec 28, 2003 at 12:23:40PM +0100, Roger Luethi wrote:
> vmstat finds all processes blocked a lot more often in 2.6 than in
> 2.4, often for several seconds in a row. That only means something in
> comparison, of course, because it is anything but a precise measurement
> -- not only because of the 1 second snapshot granularity but also due
> to the fact that bookkeeping of running and blocked processes in the
> kernel is not accurate (processes may count as both blocked and running).
> Typical log snippet for a kernel build under some 2.6.0-test release:

I'm not convinced what vmstat gets out of 2.4 is entirely comparable to
what it gets out of 2.6. "blocked" and "running" are collected very
differently in 2.6. iowait shouldn't be collected on 2.4 at all.

This could probably be addressed by backporting 2.6's reporting methods
to 2.4 so the two kernels use similar reporting mechanisms.


On Sun, Dec 28, 2003 at 12:23:40PM +0100, Roger Luethi wrote:
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  9  3   6268 851814   1500   8992  440    0   996   348 1141   294 87 13  0  0
>  9  3   6164 852816   1540   9088  352    0   456     0 1045   145 91  9  0  0
>  9  6   6164   4044 853818   8112   60    0   100    28 1016    71 92  8  0  0
>  4  6   6604 854820    924   7432  532  472   784   488 1096   626 57 43  0  0
>  2  9   9248   3556 855921   6968 1044 2748  1640  2752 1283   412 74 13  0 13
>  3  7   9248 857071    924   6864 1208    0  1720   108 1326   524 60 34  0  6
> 10  8  11164   2080 858438   5952 1068 1944  2040  2064 1623  1655 74 26  0  0
>  0 11  13000 859563    356   5824  796 2032  1572  2036 1330   656 66 24  0 10
>  0 10  16608   4064 861037   5868  832 3960  1836  3964 1755   725 42  9  0 49
>  0 11  16604 862284    420   5920 1420    0  2216     4 1471   485 39  4  0 57
>  7  4   9772  10656 863286   6644  552    0  1344    12 1112   250 56  5  0 39
>  9  2   8228 864687    732   6960  296    0   632   108 1484   257 96  4  0  0
>  8  3   8212  10656 865689   7176   80    0   320     0 1050   146 95  5  0  0
> The trace above is not for the benchmark I referred to as kbuild in the
> past few weeks (it was taken under lighter load). Even so 2.6 exhibits
> significantly more periods with I/O wait and consequently takes longer
> than 2.4 to complete.

The oscillation in "free" and "buff" is very unusual. What is this
box doing?


On Sat, 27 Dec 2003 15:55:38 -0800, William Lee Irwin III wrote:
>> Can you capture sysrq t while a situation like this is in progress?

On Sun, Dec 28, 2003 at 12:23:40PM +0100, Roger Luethi wrote:
> What are you getting at? This may be easier for you to do because you
> know what you are looking for.

I'm not looking for anything per se. It will say what codepaths tasks
are blocked in and give an idea of what's going on around the system.
kgdb can do something similar that might be more useful, since data can
be examined also.


-- wli

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Page aging broken in 2.6
  2003-12-28 16:35                     ` William Lee Irwin III
@ 2003-12-28 17:15                       ` Roger Luethi
  0 siblings, 0 replies; 31+ messages in thread
From: Roger Luethi @ 2003-12-28 17:15 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, Rik van Riel, torvalds,
	benh, linux-kernel, andrea

On Sun, 28 Dec 2003 08:35:28 -0800, William Lee Irwin III wrote:
> > the aggregated reference frequency is all that matters. I was merely
> > pointing out how the number of processes referencing a page could affect
> > performance as well. Reference frequency is used as an estimator for
> > the _likelihood_ of a fault in the future, but the potential _impact_
> > of a fault grows with the number of processes that may block on it.
> > It is one possible (though not necessarily the most likely) explanation
> > for the symptoms I see with 2.6.
> 
> I guess caution against LFU is uncontroversial.

My bad. What I said is true for both LRU and LFU (they try to predict
the probability of future references), but I wrote "frequency" because
that happened to be on my mind (for unrelated reasons). The point was
basically: risk = probability * damage

> I'm not convinced what vmstat gets out of 2.4 is entirely comparable to
> what it gets out of 2.6. "blocked" and "running" are collected very

Agreed. OTOH those readings are consistent with other observations I
made. It should even be possible to add up the reported idle times and
receive a ballpark figure for the slowdown compared to a system with
more than enough memory.

> differently in 2.6. iowait shouldn't be collected on 2.4 at all.

True. If 2.4 reports idle time during a compile benchmark, though, it
seems plausible to assume it is IO wait. And if 2.6 takes much longer
than 2.4 to complete, it is due to time spend waiting for I/O (minus
some difference in system overhead) -- the work done in user space is
equal, after all.

> This could probably be addressed by backporting 2.6's reporting methods
> to 2.4 so the two kernels use similar reporting mechanisms.

I don't think it's worth it. It wouldn't tell us anything we don't
already know.

> The oscillation in "free" and "buff" is very unusual. What is this
> box doing?

Oops, sorry. That trace is a few months old and I forgot I had used a
hack to have timestamps in vmstat. The large numbers that are alternating
are jiffies, the smaller numbers are the actual readings.

Roger

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2003-12-28 17:16 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-12-26 10:45 Page aging broken in 2.6 Manfred Spraul
  -- strict thread matches above, loose matches on Subject: below --
2003-12-26  7:28 Benjamin Herrenschmidt
2003-12-26  7:40 ` Andrew Morton
2003-12-26  9:21   ` Arjan van de Ven
2003-12-26  9:58     ` Benjamin Herrenschmidt
2003-12-26 19:44     ` Davide Libenzi
2003-12-26  9:33   ` Russell King
2003-12-26 10:07     ` Benjamin Herrenschmidt
2003-12-26 17:59 ` Linus Torvalds
2003-12-26 23:55   ` Benjamin Herrenschmidt
2003-12-27  0:35     ` Linus Torvalds
2003-12-27  0:44       ` Benjamin Herrenschmidt
2003-12-27  0:53         ` Linus Torvalds
2003-12-27  0:59           ` Linus Torvalds
2003-12-27  1:03           ` Benjamin Herrenschmidt
2003-12-27  2:37             ` Andrea Arcangeli
2003-12-27  5:02               ` Benjamin Herrenschmidt
2003-12-27 10:16               ` William Lee Irwin III
2003-12-27  2:47           ` Rik van Riel
2003-12-27  3:00             ` Andrew Morton
2003-12-27  3:31               ` Rik van Riel
2003-12-27  3:54               ` Linus Torvalds
2003-12-27 16:34                 ` Martin J. Bligh
2003-12-27 23:07               ` Roger Luethi
2003-12-27 23:55                 ` William Lee Irwin III
2003-12-28 11:23                   ` Roger Luethi
2003-12-28 16:35                     ` William Lee Irwin III
2003-12-28 17:15                       ` Roger Luethi
2003-12-28  0:04                 ` Andrew Morton
2003-12-28 11:58                   ` Roger Luethi
2003-12-27  1:41       ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox