context overflow

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* context overflow
@ 2001-01-20  2:27 Dan Malek
  2001-01-22  4:28 ` Troy Benjegerdes
  2001-01-23  1:12 ` Frank Rowand
  0 siblings, 2 replies; 48+ messages in thread
From: Dan Malek @ 2001-01-20  2:27 UTC (permalink / raw)
  To: linuxppc-dev

I just heard about the bug Tom Gall fixed in "context_overflow"
by testing for current->mm == NULL.

I believe the proper solution is to use 'current->active_mm'
instead of 'current->mm' (and you never get a null pointer).
This way, the proper 'active' context is updated with a new
context even though a kernel thread has stolen it from somewhere
else to use.  I think skipping the selection of a new context
in this case could be logically incorrect for some PowerPC cores.

	-- Dan

--

	I like MMUs because I don't have a real life.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-20  2:27 context overflow Dan Malek
@ 2001-01-22  4:28 ` Troy Benjegerdes
  2001-01-22  4:39   ` Tom Gall
  2001-01-22  4:55   ` Larry McVoy
  2001-01-23  1:12 ` Frank Rowand
  1 sibling, 2 replies; 48+ messages in thread
From: Troy Benjegerdes @ 2001-01-22  4:28 UTC (permalink / raw)
  To: Dan Malek; +Cc: linuxppc_commit, linuxppc-dev

On Fri, Jan 19, 2001 at 09:27:44PM -0500, Dan Malek wrote:
>
> I just heard about the bug Tom Gall fixed in "context_overflow"
> by testing for current->mm == NULL.
>
> I believe the proper solution is to use 'current->active_mm'
> instead of 'current->mm' (and you never get a null pointer).
> This way, the proper 'active' context is updated with a new
> context even though a kernel thread has stolen it from somewhere
> else to use.  I think skipping the selection of a new context
> in this case could be logically incorrect for some PowerPC cores.

Since this got no response, I'm cross-posting to linuxppc-commit.

So, does anyone else have any comments on this? This appears to be something
that's going to be quite hard to track down if we don't get it right..
So what's the right answer before we forget about it again and go on?

And on that note... what do we have for bug-tracking systems for ppc kernel
related stuff? The collective memory of people on these to mailing lists?
I know someone mentioned a sourceforge page at one point.. is anyone useing that?

--
--------------------------------------------------------------------------
Troy Benjegerdes                'da hozer'                hozer@drgw.net

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-22  4:28 ` Troy Benjegerdes
@ 2001-01-22  4:39   ` Tom Gall
  2001-01-22 18:10     ` Dan Malek
  2001-01-22 18:55     ` tom_gall
  2001-01-22  4:55   ` Larry McVoy
  1 sibling, 2 replies; 48+ messages in thread
From: Tom Gall @ 2001-01-22  4:39 UTC (permalink / raw)
  To: Troy Benjegerdes; +Cc: Dan Malek, linuxppc_commit, linuxppc-dev


Troy Benjegerdes wrote:

> On Fri, Jan 19, 2001 at 09:27:44PM -0500, Dan Malek wrote:
> >
> > I just heard about the bug Tom Gall fixed in "context_overflow"
> > by testing for current->mm == NULL.
> >
> > I believe the proper solution is to use 'current->active_mm'
> > instead of 'current->mm' (and you never get a null pointer).
> > This way, the proper 'active' context is updated with a new
> > context even though a kernel thread has stolen it from somewhere
> > else to use.  I think skipping the selection of a new context
> > in this case could be logically incorrect for some PowerPC cores.
>
> Since this got no response, I'm cross-posting to linuxppc-commit.

I will look into this tomorrow.  It's an important fix, I don't want to rush.


> So, does anyone else have any comments on this? This appears to be something
> that's going to be quite hard to track down if we don't get it right..
> So what's the right answer before we forget about it again and go on?
>
> And on that note... what do we have for bug-tracking systems for ppc kernel
> related stuff? The collective memory of people on these to mailing lists?
> I know someone mentioned a sourceforge page at one point.. is anyone useing that?

There is a sourceforge page. It hasn't been active which probably isn't a good
thing. I know I'm keeping a bug list on paper that I know about. It's not a long
list.  Happy to put it out on sourceforge...  just don't want it all to be a waste
of time tho. I'm sure Olaf might have a bug or two to contribute 8-)

https://sourceforge.net/projects/ppclinux/

--
Hakuna Matata,

Tom

-----------------------------------------------------------
PPC Linux Guy      "My heart is human, my blood is boiling,
                    my brain IBM" -- Mr Roboto, Styxx
tgall@rochcivictheatre.org


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-22  4:28 ` Troy Benjegerdes
  2001-01-22  4:39   ` Tom Gall
@ 2001-01-22  4:55   ` Larry McVoy
  2001-01-22  6:15     ` Troy Benjegerdes
  1 sibling, 1 reply; 48+ messages in thread
From: Larry McVoy @ 2001-01-22  4:55 UTC (permalink / raw)
  To: Troy Benjegerdes; +Cc: Dan Malek, linuxppc_commit, linuxppc-dev


On Sun, Jan 21, 2001 at 10:28:42PM -0600, Troy Benjegerdes wrote:
> And on that note... what do we have for bug-tracking systems for ppc kernel
> related stuff? The collective memory of people on these to mailing lists?

We have a fairly simple bug system we use for BitKeeper; if there was interest
I could polish up a version for the PPC team.
--
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-22  4:55   ` Larry McVoy
@ 2001-01-22  6:15     ` Troy Benjegerdes
  0 siblings, 0 replies; 48+ messages in thread
From: Troy Benjegerdes @ 2001-01-22  6:15 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Dan Malek, linuxppc_commit, linuxppc-dev

On Sun, Jan 21, 2001 at 08:55:57PM -0800, Larry McVoy wrote:
> On Sun, Jan 21, 2001 at 10:28:42PM -0600, Troy Benjegerdes wrote:
> > And on that note... what do we have for bug-tracking systems for ppc kernel
> > related stuff? The collective memory of people on these to mailing lists?
>
> We have a fairly simple bug system we use for BitKeeper; if there was interest
> I could polish up a version for the PPC team.

YES!

I was really annoyed when I tried to use bloatzilla (oh, I meant bugzilla)
and had to get a stupid password to even report a bug. Not to mention
it was way more info than I needed.

I'm assuming it's a relatively small amount of code, and could be extended
without too much of a learning curve? (Say, to include work with this
nebulus regression tester I keep talking about?)

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-22  4:39   ` Tom Gall
@ 2001-01-22 18:10     ` Dan Malek
  2001-01-22 18:55     ` tom_gall
  1 sibling, 0 replies; 48+ messages in thread
From: Dan Malek @ 2001-01-22 18:10 UTC (permalink / raw)
  To: Tom Gall; +Cc: Troy Benjegerdes, linuxppc_commit, linuxppc-dev

Tom Gall wrote:

> I will look into this tomorrow.  It's an important fix, I don't want to rush.

This was a bug that appeared during the 2.3 development, and (the
collective :-) we have been running systems like this for a long
time.  I'm running all of mine with my proposed (proper :-) bug
fix, but since they never crashed before......I'll at least check
it into the 2_5 baseline along with some other stuff.

	-- Dan

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-22  4:39   ` Tom Gall
  2001-01-22 18:10     ` Dan Malek
@ 2001-01-22 18:55     ` tom_gall
  2001-01-22 19:59       ` Dan Malek
  1 sibling, 1 reply; 48+ messages in thread
From: tom_gall @ 2001-01-22 18:55 UTC (permalink / raw)
  To: Tom Gall; +Cc: Troy Benjegerdes, Dan Malek, linuxppc_commit, linuxppc-dev

Tom Gall wrote:
>
> Troy Benjegerdes wrote:
>
> > On Fri, Jan 19, 2001 at 09:27:44PM -0500, Dan Malek wrote:
> > >
> > > I just heard about the bug Tom Gall fixed in "context_overflow"
> > > by testing for current->mm == NULL.
> > >
> > > I believe the proper solution is to use 'current->active_mm'
> > > instead of 'current->mm' (and you never get a null pointer).
> > > This way, the proper 'active' context is updated with a new
> > > context even though a kernel thread has stolen it from somewhere
> > > else to use.  I think skipping the selection of a new context
> > > in this case could be logically incorrect for some PowerPC cores.
> >
> > Since this got no response, I'm cross-posting to linuxppc-commit.
>
> I will look into this tomorrow.  It's an important fix, I don't want to rush.

Hi All,

  Ok here's the explaination and I beg forgivness if this isn't clear or needs
more filled in.

  current->mm I believe is correct. active_mm for tasks in user space just point
back to mm. kernel space tasks will have an mm of NULL yet their active_mm will
point back to the last user space task they ran.

  The reason for this patch is in the case where the idle task comes in on one
processor and on another processor it has encountered a context overflow. The
idle task on processor 0 detects the overflow as well and that's when things get
interesting, and why the change.

  So anyway that's the situation from my neck of the woods. These other
processor cores make me worried. What cores are you refering too? Are they SMP?
How and in what ways are they different?

  Many thanks to Pat McCarthy who is my local guru on this topic and also the
author of the fix in question.

  Regards,

  Tom
--
Tom Gall - PowerPC Linux Team    "Where's the ka-boom? There was
Linux Technology Center           supposed to be an earth
(w) tom_gall@vnet.ibm.com         shattering ka-boom!"
(w) 507-253-4558                 -- Marvin Martian
(h) tgall@rochcivictheatre.org
http://oss.software.ibm.com/developerworks/opensource/linux

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-22 18:55     ` tom_gall
@ 2001-01-22 19:59       ` Dan Malek
  2001-01-22 22:08         ` tom_gall
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Malek @ 2001-01-22 19:59 UTC (permalink / raw)
  To: tom_gall; +Cc: Tom Gall, Troy Benjegerdes, linuxppc_commit, linuxppc-dev

tom_gall@vnet.ibm.com wrote:

>   current->mm I believe is correct. active_mm for tasks in user space just point
> back to mm. kernel space tasks will have an mm of NULL yet their active_mm will
> point back to the last user space task they ran.

Not exactly.  Every task running on a CPU must have an active_mm, and
it represents the current context for the MMU.  This active_mm comes
from a single threaded application's 'mm', or in the case of a
thread without an 'mm' from the previous application that ran, or
from somewhere else depending upon VM_CLONE games.

The point you are missing is 'active_mm' represents the current
context for the MMU.  If you get a context overflow, you can't skip
getting and setting a context for an active task just because it
doesn't have a 'current->mm'.  Your modification to do this
results in a task running on a CPU with a "NO CONTEXT" mm, and worse
and incorrect VSID/ASID/PID/whatever for the task running on that MMU.

>   The reason for this patch is in the case where the idle task comes in on one
> processor and on another processor it has encountered a context overflow.

It's not just the idle task.  It could be any task that is supposed
to get an active_mm from someone else.

The patch is just logically incorrect.  There should be no
'if current->mm' and it should get/set context on current->active_mm.

	-- Dan

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-22 19:59       ` Dan Malek
@ 2001-01-22 22:08         ` tom_gall
  2001-01-23  0:10           ` Dan Malek
  0 siblings, 1 reply; 48+ messages in thread
From: tom_gall @ 2001-01-22 22:08 UTC (permalink / raw)
  To: Dan Malek; +Cc: Tom Gall, Troy Benjegerdes, linuxppc-commit, linuxppc-dev

Dan Malek wrote:
>
> tom_gall@vnet.ibm.com wrote:
>
> >   current->mm I believe is correct. active_mm for tasks in user space just point
> > back to mm. kernel space tasks will have an mm of NULL yet their active_mm will
> > point back to the last user space task they ran.
>
> Not exactly.  Every task running on a CPU must have an active_mm, and
> it represents the current context for the MMU.  This active_mm comes
> from a single threaded application's 'mm', or in the case of a
> thread without an 'mm' from the previous application that ran, or
> from somewhere else depending upon VM_CLONE games.

Hi Dan,

  Pat and I huddled around your note and gave this some more thought. Still not
convinced it's wrong tho.

> The point you are missing is 'active_mm' represents the current
> context for the MMU.  If you get a context overflow, you can't skip
> getting and setting a context for an active task just because it
> doesn't have a 'current->mm'.  Your modification to do this
> results in a task running on a CPU with a "NO CONTEXT" mm, and worse
> and incorrect VSID/ASID/PID/whatever for the task running on that MMU.

  active_mm represents the current context of USER space, not kernel space. I
think that's the important point here. If a task doesn't have a current->mm it's
a kernel task. It shouldn't be using the Segment Registers in the context.
Right? A kernel task should only be concerned with addresses in the range of
C0000000-FFFFFFFF which aren't in the context.

  If what you say is true that incorrect VSID/ASID etc could be handed out, I'm
wondering how my box has been up running and stable since last week. It's not
proof that it's right but I would think something would have melted down by
now....

> >   The reason for this patch is in the case where the idle task comes in on one
> > processor and on another processor it has encountered a context overflow.
>
> It's not just the idle task.  It could be any task that is supposed
> to get an active_mm from someone else.

  Since current->mm is NULL, it's a kernel task... granted it doesn't have to be
the idle task but it shouldn't matter. Or when you say any task, are you saying
that user tasks as well?

  Correct me if I'm wrong but from the code we were looking at the sched.c when
you pass through switch_mm from a kernel task to a user task, it catches it and
you go from state of NO CONTEXT to the correct context.

  Beat me over the head with a crowbar please if I'm missing something.

Regards,

Tom
--
Tom Gall - PowerPC Linux Team    "Where's the ka-boom? There was
Linux Technology Center           supposed to be an earth
(w) tom_gall@vnet.ibm.com         shattering ka-boom!"
(w) 507-253-4558                 -- Marvin Martian
(h) tgall@rochcivictheatre.org
http://oss.software.ibm.com/developerworks/opensource/linux

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-22 22:08         ` tom_gall
@ 2001-01-23  0:10           ` Dan Malek
  2001-01-23 10:00             ` Gabriel Paubert
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Malek @ 2001-01-23  0:10 UTC (permalink / raw)
  To: tom_gall; +Cc: Tom Gall, Troy Benjegerdes, linuxppc-commit, linuxppc-dev

tom_gall@vnet.ibm.com wrote:

>   active_mm represents the current context of USER space,

Not exactly.  The mm object may only contain page tables for a
user thread, but it also contains information about the MMU context
in general for any thread running on the processor.

> ..........If a task doesn't have a current->mm it's
> a kernel task. It shouldn't be using the Segment Registers in the context.
> Right? A kernel task should only be concerned with addresses in the range of
> C0000000-FFFFFFFF which aren't in the context.

No, you are confusing MMU context with kernel memory mapping and
our (mostly incorrect) use of VSIDs on the 7xx processors.

>   If what you say is true that incorrect VSID/ASID etc could be handed out, I'm
> wondering how my box has been up running and stable since last week.

Because you are not running something like an MPC8xx or IBM4xx that
cares whether it is correct in kernel space.  Some processors do care.

>   Since current->mm is NULL, it's a kernel task... granted it doesn't have to be
> the idle task but it shouldn't matter. Or when you say any task, are you saying
> that user tasks as well?

The MMU context switching logic doesn't make any assumptions about
the meaning of current->mm.  If there is a current->mm, it switches
to that as the active_mm for the thread.  If there isn't a current->mm,
it locates something to use as the active_mm.  The rules for selecting
an active_mm can be whatever makes sense for reducing MMU management
or implementing features.

>   Correct me if I'm wrong but from the code we were looking at the sched.c when
> you pass through switch_mm from a kernel task to a user task, it catches it and
> you go from state of NO CONTEXT to the correct context.

Yes, but the problem is the context overflowed you did not select a
new one.  You allowed the thread to run on the processor (regardless of
what it was) with an expired context, that doesn't match the context
of active_mm.  Then later, you find yet another context to switch to
for the same thread that was using the wrong one.

>   Beat me over the head with a crowbar please if I'm missing something.

What's the big deal?  I'm going to say it for the third time.  The
active_mm is supposed to represent the mmu context for the thread
currently running on the processor.  When the context overflows, we
should get (pick a number) and set (in the MMU) the new context for
the active_mm running on each processor.  It is logically incorrect
to test current->mm and skip the get/set.  By doing so, you have a
stale MMU hardware context and an mm object that shouldn't be running
on a processor.

	-- Dan

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-20  2:27 context overflow Dan Malek
  2001-01-22  4:28 ` Troy Benjegerdes
@ 2001-01-23  1:12 ` Frank Rowand
  2001-01-23  1:20   ` Dan Malek
  1 sibling, 1 reply; 48+ messages in thread
From: Frank Rowand @ 2001-01-23  1:12 UTC (permalink / raw)
  To: Dan Malek; +Cc: linuxppc-dev

Dan Malek wrote:
>
> I just heard about the bug Tom Gall fixed in "context_overflow"
> by testing for current->mm == NULL.
>
> I believe the proper solution is to use 'current->active_mm'
> instead of 'current->mm' (and you never get a null pointer).

OK, I've stayed on the side-lines, waiting till I had time to actually
read through the current linuxppc_2_5 source so I would understand
the way things are today.  So as I'm going through the messages, from
the beginning, I'm already confused.  The only "context_overflow" I
find is mmu_context_overflow() in arch/ppc/mm/init.c.  You obviously
aren't talking about the 8xx version of this function.  The other
version of the function contains:

        for_each_task(tsk) {
                if (tsk->mm)
                        tsk->mm->context = NO_CONTEXT;
        }

This is the code we are talking about, correct?  The reason I'm
confused is that I was being literal in my reading of "current->mm"
as opposed to "tsk->mm".

Thanks,

Frank
--
Frank Rowand <frank_rowand@mvista.com>
MontaVista Software, Inc

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-23  1:12 ` Frank Rowand
@ 2001-01-23  1:20   ` Dan Malek
  2001-01-23  2:12     ` Frank Rowand
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Malek @ 2001-01-23  1:20 UTC (permalink / raw)
  To: frowand; +Cc: linuxppc-dev

Frank Rowand wrote:

> ....The only "context_overflow" I
> find is mmu_context_overflow() in arch/ppc/mm/init.c.  You obviously
> aren't talking about the 8xx version of this function.

That's the one.  As part of the 4xx merge I made the 8xx work as
well, so there is now only this one function for all PowerPC ports.

> This is the code we are talking about, correct?  The reason I'm
> confused is that I was being literal in my reading of "current->mm"
> as opposed to "tsk->mm".

The code you posted looks pretty old.  This function has more guts
to it now, including a 'set_context' of 'current->mm', the source
of the bug we are discussing.

	-- Dan

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-23  1:20   ` Dan Malek
@ 2001-01-23  2:12     ` Frank Rowand
  0 siblings, 0 replies; 48+ messages in thread
From: Frank Rowand @ 2001-01-23  2:12 UTC (permalink / raw)
  To: Dan Malek; +Cc: frowand, linuxppc-dev


Dan Malek wrote:
>
> Frank Rowand wrote:
>
> > ....The only "context_overflow" I
> > find is mmu_context_overflow() in arch/ppc/mm/init.c.  You obviously
> > aren't talking about the 8xx version of this function.
>
> That's the one.  As part of the 4xx merge I made the 8xx work as
> well, so there is now only this one function for all PowerPC ports.
>
> > This is the code we are talking about, correct?  The reason I'm
> > confused is that I was being literal in my reading of "current->mm"
> > as opposed to "tsk->mm".
>
> The code you posted looks pretty old.  This function has more guts
> to it now, including a 'set_context' of 'current->mm', the source
> of the bug we are discussing.
>
>         -- Dan

Thanks!  I'm back on the right page now...

The code I posted was from a pull of linuxppc_2_5 as of last night.  I
was just a little confused because I didn't see any "if (current->mm != NULL)"
in it.  The problem was that it was gone already.

However, as of last night there is still an 8xx version (paraphrasing quite
a bit):


#ifndef CONFIG_PPC_CPU_PPC_8xx
local_flush_tlb_all()
{
}
local_flush_tlb_mm()
{
}
local_flush_tlb_page()
{
}
local_flush_tlb_range()
{
}
mmu_context_overflow()
{
}

#else /* CONFIG_PPC_CPU_PPC_8xx */
void
mmu_context_overflow(void)
{
        atomic_set(&next_mmu_context, -1);
}
#endif /* CONFIG_PPC_CPU_PPC_8xx */


-Frank
--
Frank Rowand <frank_rowand@mvista.com>
MontaVista Software, Inc

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-23  0:10           ` Dan Malek
@ 2001-01-23 10:00             ` Gabriel Paubert
  2001-01-23 18:21               ` Dan Malek
  2001-02-06 10:50               ` Paul Mackerras
  0 siblings, 2 replies; 48+ messages in thread
From: Gabriel Paubert @ 2001-01-23 10:00 UTC (permalink / raw)
  To: Dan Malek
  Cc: tom_gall, Tom Gall, Troy Benjegerdes, linuxppc-commit,
	linuxppc-dev

On Mon, 22 Jan 2001, Dan Malek wrote:

>
> tom_gall@vnet.ibm.com wrote:
>
> >   active_mm represents the current context of USER space,
>
> Not exactly.  The mm object may only contain page tables for a
> user thread, but it also contains information about the MMU context
> in general for any thread running on the processor.
>
> > ..........If a task doesn't have a current->mm it's
> > a kernel task. It shouldn't be using the Segment Registers in the context.
> > Right? A kernel task should only be concerned with addresses in the range of
> > C0000000-FFFFFFFF which aren't in the context.
>
> No, you are confusing MMU context with kernel memory mapping and
> our (mostly incorrect) use of VSIDs on the 7xx processors.

Finally, somebody had to say it. But saying mostly incorrect is the
understatement of the week.

> >   If what you say is true that incorrect VSID/ASID etc could be handed out, I'm
> > wondering how my box has been up running and stable since last week.
>
> Because you are not running something like an MPC8xx or IBM4xx that
> cares whether it is correct in kernel space.  Some processors do care.

Yes they do care, but let us noi add unnecessary baggage to processors who
do not need it. All these games with current_mm and active_mm were
introduced for x86 because of their stupid MMU. 8xx/4xx are unfortunately
almost as braindead while 6xx/7xx do get it right.

The MMU code is so different for 4xx/8xx and 6xx/7xx that adding a few
conditionals for this mm management won't hurt. No kernel will ever run
without a recompile on both kind of MMUs anyawy.

> >   Since current->mm is NULL, it's a kernel task... granted it doesn't have to be
> > the idle task but it shouldn't matter. Or when you say any task, are you saying
> > that user tasks as well?
>
> The MMU context switching logic doesn't make any assumptions about
> the meaning of current->mm.  If there is a current->mm, it switches
> to that as the active_mm for the thread.  If there isn't a current->mm,
> it locates something to use as the active_mm.  The rules for selecting
> an active_mm can be whatever makes sense for reducing MMU management
> or implementing features.

For 6xx/7xx active_mm has not even any reason to exist in the first place,
there are no TLB flush issues when switching tasks when TLB directly
map through an intermediate 52 bit address space. Changing segment
registers if necessary is _cheap_, the operation is local to a processor
and only needs a couple of isync.

> Yes, but the problem is the context overflowed you did not select a
> new one.  You allowed the thread to run on the processor (regardless of
> what it was) with an expired context, that doesn't match the context
> of active_mm.  Then later, you find yet another context to switch to
> for the same thread that was using the wrong one.

A proper implementation on 6xx/7xx would not even know what an expired
context is. It would not even be possible to happen in the first place.

	Regards,
	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-23 10:00             ` Gabriel Paubert
@ 2001-01-23 18:21               ` Dan Malek
  2001-02-06 10:55                 ` Paul Mackerras
  2001-02-06 10:50               ` Paul Mackerras
  1 sibling, 1 reply; 48+ messages in thread
From: Dan Malek @ 2001-01-23 18:21 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: tom_gall, Tom Gall, Troy Benjegerdes, linuxppc-commit,
	linuxppc-dev

Gabriel Paubert wrote:

> Finally, somebody had to say it. But saying mostly incorrect is the
> understatement of the week.

It's been mumbled about for a while, and now that I am merging the
latest 4xx code and updating 8xx (and getting some performance
results on the 82xx and 74xx) I want to create more PowerPC generic
functions for all of the variants.  Since I am doing the work here,
I can also try some alternatives better suited to PowerPC.

> Yes they do care, but let us noi add unnecessary baggage to processors who
> do not need it.

I could make them all not care (I think), but I just found it interesting
the generic Linux memory management now has the concept of context
management, and we seem to be the only ones using it (and not very
effectively).

> ..... All these games with current_mm and active_mm were
> introduced for x86 because of their stupid MMU.

I guess I am not familiar enough with that stupid MMU to see why
it had to be done that way.  To me, it looked like a neat solution
to sharing mm contexts when it was useful.  I would like to use
the lazy TLB management, but it's implementation doesn't seem quite
right (or more likely I don't fully understand it :-).

> ..... 8xx/4xx are unfortunately
> almost as braindead while 6xx/7xx do get it right.

I'll take issue with that.  The 8xx/4xx is actually quite nice for
32-bit only embedded processors.  It is quite flexible and with some
large page size enhancements I am considering, it should be efficient
as well.  The 6xx/7xx have to solve the very large virtual address
space challenge, and unfortunately require lots of software management
because the Linux page tables don't map into their hardware assist
very well.  The 74xx (or at least 7450) allows some additional
flexibility that we may want to consider as well.  Anyway, there
are other solutions that should work better and I want to work
toward a more generic VM interface for all of PowerPC.

> The MMU code is so different for 4xx/8xx and 6xx/7xx that adding a few
> conditionals for this mm management won't hurt. No kernel will ever run
> without a recompile on both kind of MMUs anyawy.

Right, the MMUs are different, and they share common exception
vectors (a good thing....hint to IBM).  Except for the TLB miss
handler and the hash table management, all of the PowerPC code
should be identical (except for the IBM 4xx, which seems to think
it is necessary to invent new registers to do the same thing...).

> For 6xx/7xx active_mm has not even any reason to exist....

Perhaps, but it does and we should make sure it is used correctly
(hence, my complaining about not updating it correctly during a
context overflow).  I'll be using it to properly manage contexts/VSIDs
in the future.

> A proper implementation on 6xx/7xx would not even know what an expired
> context is. It would not even be possible to happen in the first place.

Heh :-)....Right, but it will require a context, and active_mm is
the proper place to store/retrieve it.  Doesn't the HIGHMEM option
require we properly use a kernel context?  Or, do we just get lucky
there, too :-).

What you see today won't be there much longer......

	-- Dan

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-23 10:00             ` Gabriel Paubert
  2001-01-23 18:21               ` Dan Malek
@ 2001-02-06 10:50               ` Paul Mackerras
  2001-02-06 21:32                 ` Dan Malek
  1 sibling, 1 reply; 48+ messages in thread
From: Paul Mackerras @ 2001-02-06 10:50 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Dan Malek, tom_gall, linuxppc-commit, linuxppc-dev


Gabriel Paubert writes:

> On Mon, 22 Jan 2001, Dan Malek wrote:
[snip]
> > No, you are confusing MMU context with kernel memory mapping and
> > our (mostly incorrect) use of VSIDs on the 7xx processors.
>
> Finally, somebody had to say it. But saying mostly incorrect is the
> understatement of the week.

What's incorrect about it?

Paul.

--
Paul Mackerras, Open Source Research Fellow, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Putting Open Source to work.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-01-23 18:21               ` Dan Malek
@ 2001-02-06 10:55                 ` Paul Mackerras
  2001-02-06 21:11                   ` Dan Malek
  0 siblings, 1 reply; 48+ messages in thread
From: Paul Mackerras @ 2001-02-06 10:55 UTC (permalink / raw)
  To: Dan Malek
  Cc: Gabriel Paubert, tom_gall, Tom Gall, Troy Benjegerdes,
	linuxppc-commit, linuxppc-dev

Dan Malek writes:

> Heh :-)....Right, but it will require a context, and active_mm is
> the proper place to store/retrieve it.  Doesn't the HIGHMEM option
> require we properly use a kernel context?  Or, do we just get lucky
> there, too :-).

The way we do things on 6xx/7xx, kernel accesses to addresses >=
0xc0000000 don't depend on active_mm or on the currently selected
context at all.  We can effectively choose a different context for
each 256MB segment of the address space, and we always choose context
0 for segments 0xc to 0xf.  Further, the hash_page routine looks in
swapper_pg_dir for linux PTEs for addresses >= 0xc0000000, rather than
in the page tables associated with current->active_mm or current->mm.

--
Paul Mackerras, Open Source Research Fellow, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Putting Open Source to work.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-06 10:55                 ` Paul Mackerras
@ 2001-02-06 21:11                   ` Dan Malek
  2001-02-06 21:50                     ` Paul Mackerras
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Malek @ 2001-02-06 21:11 UTC (permalink / raw)
  To: paulus
  Cc: Gabriel Paubert, tom_gall, Tom Gall, Troy Benjegerdes,
	linuxppc-commit, linuxppc-dev

Paul Mackerras wrote:

> The way we do things on 6xx/7xx, .....

That's the way all PowerPCs currently do it.

> ....  We can effectively choose a different context for
> each 256MB segment of the address space, and we always choose context
> 0 for segments 0xc to 0xf.

That's not what MMU context means, well at least the way I have
learned to use it in the past.  An MMU context is supposed to represent
the virtual mapping of memory objects.  Linux has memory objects
and the ability to map these through VM areas, which is interesting
considering (IMHO) the TLB management and the terms (like context)
banted about are such a big hack.  Normally, it is the other way around.
You have some legacy hunk of code designed around arcane two level
page tables that tries to represent VM areas and memory objects
with TLB management doing its best to implement real MMU context.

	-- Dan

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-06 10:50               ` Paul Mackerras
@ 2001-02-06 21:32                 ` Dan Malek
  2001-02-06 22:08                   ` Paul Mackerras
  2001-02-07  9:18                   ` Roman Zippel
  0 siblings, 2 replies; 48+ messages in thread
From: Dan Malek @ 2001-02-06 21:32 UTC (permalink / raw)
  To: paulus; +Cc: Gabriel Paubert, tom_gall, linuxppc-commit, linuxppc-dev

Paul Mackerras wrote:

> What's incorrect about it?

The PowerPC gives us the ability to utilize a huge virtual address
space, and with newer processors a larger physical address space.
Although on 32-bit processors we can't see all of this at once, it
is there for us to use.  The only difference between a 32- and 64-bit
PowerPC kernel implementation is that on the 64-bit you get to see
all of the VM space all of the time, while the 32-bit system is
scurrying around behind the scenes ensuring what you need to see
is currently mapped.

We are currently using VSIDs as a lazy TLB reloader, not for
a large VM space context manager.  There shouldn't be any need
for CONFIG_HIGHMEM on PowerPC, and we should be able to map huge
PCI spaces we see on the multiple bridge CPCI systems without
any worry of running out of VM space.

The reason this is hard to realize today is we have built up a
system with the assumption there is only a 32-bit virtual space,
half of it consumed by a user task.  When I saw the VM changes
happening during 2.3, I saw hoping we could take the opportunity
to improve PowerPC performance with better MMU management.  Maybe
someday I can work on it.

	-- Dan

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-06 21:11                   ` Dan Malek
@ 2001-02-06 21:50                     ` Paul Mackerras
  2001-02-06 22:29                       ` Dan Malek
  0 siblings, 1 reply; 48+ messages in thread
From: Paul Mackerras @ 2001-02-06 21:50 UTC (permalink / raw)
  To: Dan Malek
  Cc: Gabriel Paubert, tom_gall, Tom Gall, Troy Benjegerdes,
	linuxppc-commit, linuxppc-dev

Dan Malek writes:

> That's not what MMU context means, well at least the way I have
> learned to use it in the past.  An MMU context is supposed to represent
> the virtual mapping of memory objects.  Linux has memory objects

No, an MMU context represents an address space, or more precisely the
set of virtual to physical mappings in an address space, which will
typically include mappings of many objects.  That's the way the term
is used in the Linux kernel, that's why it's the mm_struct (which
represents an address space) which has the MMU context in it.

> and the ability to map these through VM areas, which is interesting
> considering (IMHO) the TLB management and the terms (like context)
> banted about are such a big hack.  Normally, it is the other way around.
> You have some legacy hunk of code designed around arcane two level
> page tables that tries to represent VM areas and memory objects
> with TLB management doing its best to implement real MMU context.

On machines like the x86 where the MMU doesn't know about MMU contexts
you have to basically context-switch the whole MMU including the TLB.
Fortunately we don't have to do that. :)

Paul.

--
Paul Mackerras, Open Source Research Fellow, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Putting Open Source to work.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-06 21:32                 ` Dan Malek
@ 2001-02-06 22:08                   ` Paul Mackerras
  2001-02-06 23:14                     ` Dan Malek
  2001-02-07  9:18                   ` Roman Zippel
  1 sibling, 1 reply; 48+ messages in thread
From: Paul Mackerras @ 2001-02-06 22:08 UTC (permalink / raw)
  To: Dan Malek; +Cc: Gabriel Paubert, tom_gall, linuxppc-commit, linuxppc-dev

Dan Malek writes:

> The PowerPC gives us the ability to utilize a huge virtual address
> space, and with newer processors a larger physical address space.
> Although on 32-bit processors we can't see all of this at once, it
> is there for us to use.  The only difference between a 32- and 64-bit
> PowerPC kernel implementation is that on the 64-bit you get to see
> all of the VM space all of the time, while the 32-bit system is
> scurrying around behind the scenes ensuring what you need to see
> is currently mapped.

I'm not sure what "scurrying around behind the scenes" you're talking
about.

Are you saying that when two tasks both have an object mapped, we
should be using the same virtual address in PPC terms (i.e. the 52 or
80-bit intermediate VA), and set up the segment registers for the 2
tasks with the same VSIDs?  I think that would be a really really bad
idea (for a start it limits the number of objects you can map to
around 10 or so).  Doing that would reduce the number of hash table
entries slightly but believe me, hash table space is not a problem.

> We are currently using VSIDs as a lazy TLB reloader, not for
> a large VM space context manager.  There shouldn't be any need

You're intent on solving a problem that doesn't exist, just because
the hardware can solve it.  On a 32-bit system, linux processes don't
want and indeed have no notion of a large VM space, if by large you
mean more than a few gigabytes.

And saying "lazy TLB reloader" is misleading.  Lazy TLB reloading is
what you do when your MMU doesn't have any support for MMU contexts,
in order to try to minimize TLB flushes.  Fortunately Linux knows
about MMUs that support MMU contexts.

> for CONFIG_HIGHMEM on PowerPC, and we should be able to map huge
> PCI spaces we see on the multiple bridge CPCI systems without
> any worry of running out of VM space.

On a 32-bit PPC you can't ever see more than 4GB of virtual address
space at once without changing mappings.  We really don't want to go
down the path of trying to use the segment registers like the bank
registers you used to get on 16-bit systems 25 years ago.  For
example, if we mapped in the highmem pages like this, it would
drastically limit the number of highmem pages that we could have
mapped in at once.

> The reason this is hard to realize today is we have built up a
> system with the assumption there is only a 32-bit virtual space,
> half of it consumed by a user task.  When I saw the VM changes
> happening during 2.3, I saw hoping we could take the opportunity
> to improve PowerPC performance with better MMU management.  Maybe
> someday I can work on it.

Let me know when you've rewritten the whole of the Linux generic MM
layer. :-)

The reason for having the user task space visible in the kernel is so
that copy_to_user/copy_from_user can go fast, without having to set up
any mappings or do any more than a range-check on the address.

--
Paul Mackerras, Open Source Research Fellow, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Putting Open Source to work.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-06 21:50                     ` Paul Mackerras
@ 2001-02-06 22:29                       ` Dan Malek
  2001-02-06 22:45                         ` Paul Mackerras
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Malek @ 2001-02-06 22:29 UTC (permalink / raw)
  To: paulus
  Cc: Gabriel Paubert, tom_gall, Tom Gall, Troy Benjegerdes,
	linuxppc-commit, linuxppc-dev


Paul Mackerras wrote:

> > That's not what MMU context means, well at least the way I have
> > learned to use it in the past.  An MMU context is supposed to represent
> > the virtual mapping of memory objects.  Linux has memory objects
>
> No, an MMU context represents an address space, or more precisely the
> set of virtual to physical mappings in an address space,...

Isn't that what I just said above :-)?  Your original message said
you want to map some context to just a few of the VSIDs, that is what
I said isn't correct.


> On machines like the x86 where the MMU doesn't know about MMU contexts
> you have to basically context-switch the whole MMU including the TLB.
> Fortunately we don't have to do that. :)

Well, an MMU doesn't have to know about contexts (or have something
called a 'context register') for you to implement MMU context management.


	-- Dan

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-06 22:29                       ` Dan Malek
@ 2001-02-06 22:45                         ` Paul Mackerras
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Mackerras @ 2001-02-06 22:45 UTC (permalink / raw)
  To: Dan Malek
  Cc: Gabriel Paubert, tom_gall, Troy Benjegerdes, linuxppc-commit,
	linuxppc-dev

Dan Malek writes:

> > > That's not what MMU context means, well at least the way I have
> > > learned to use it in the past.  An MMU context is supposed to represent
> > > the virtual mapping of memory objects.  Linux has memory objects
> >
> > No, an MMU context represents an address space, or more precisely the
> > set of virtual to physical mappings in an address space,...
>
> Isn't that what I just said above :-)?  Your original message said

That is one possible interpretation of what you said above. :)  You
went on to talk about memory objects which is why I thought you were
saying that a context represented a mapping of a single object.  I
think we're actually in violent agreement.

> you want to map some context to just a few of the VSIDs, that is what
> I said isn't correct.

What does "correct" mean in this context (excuse the pun :) ?
We can set up the VSIDs however we darn well please.  We set up the
VSIDs so that the kernel portion of each context is the same.  Where's
the problem?

In saying "we can effectively choose a different context for each
256MB segment" I was trying to say "if you think about MMUs in terms
of MMU contexts, then you can think of the PPC MMU as being able to
select a different context, not just per task, but per 256MB segment
of the address space of that task as well".

> Well, an MMU doesn't have to know about contexts (or have something
> called a 'context register') for you to implement MMU context management.

Sure.  But it needs to know about more address bits (in some sense)
than the 32 bits of VA that the cpu core gives you.  The x86 MMU
doesn't.

--
Paul Mackerras, Open Source Research Fellow, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Putting Open Source to work.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-06 22:08                   ` Paul Mackerras
@ 2001-02-06 23:14                     ` Dan Malek
  2001-02-07  0:23                       ` Paul Mackerras
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Malek @ 2001-02-06 23:14 UTC (permalink / raw)
  To: paulus; +Cc: Gabriel Paubert, tom_gall, linuxppc-commit, linuxppc-dev

Paul Mackerras wrote:

> Are you saying that when two tasks both have an object mapped, we
> should be using the same virtual address in PPC terms

No.  Simply that the PowerPC allows us to utilize a greater than
32-bit address and an MMU context should represent that and allow
us to manage it.

> You're intent on solving a problem that doesn't exist, just because
> the hardware can solve it.

I think the CONFIG_HIGHMEM is clearly a problem that wasn't
implemented very well.  We are seeing embedded PowerPCs with 1G or
larger real memory, along with lots of PCI busses with devices
that require mapping.  This simply doesn't work today without some
kernel modifications.  Clearly problems that need solutions.

> And saying "lazy TLB reloader" is misleading.  Lazy TLB reloading is
> what you do when your MMU doesn't have any support for MMU contexts,
> in order to try to minimize TLB flushes.

That's exactly what we are doing on PowerPC.  We use the VSIDs as a
kind of context register, not as an extension of virtual space addressing.
We minimize TLB flushes by using different VSIDs for Linux MMU contexts.
When switching MMU contexts, we simply allow the new VSIDs to miss
and cause a TLB reload.  It's not misleading, it is exactly what
we are doing.  We aren't actually minimizing anything, just postponing
the TLB reloads.  Actively switching MMU contexts allows the ability
to preload TLB entries, minimizing misses upon the context switch,
and also provides a more predictable context switch behavior.  Do it
right, get more features and flexibility.

> On a 32-bit PPC you can't ever see more than 4GB of virtual address
> space at once without changing mappings.

Yep.

> ....  We really don't want to go
> down the path of trying to use the segment registers like the bank
> registers you used to get on 16-bit systems 25 years ago.

We already have.  That is basically what CONFIG_HIGHMEM does...just
a horse of a different color.

> .......  For
> example, if we mapped in the highmem pages like this, it would
> drastically limit the number of highmem pages that we could have
> mapped in at once.

But that is effectively what we are (Linus is :-) doing.  It is
competition for a finite resource.  My only suggestion is we take
advantage of the PowerPC hardware to do it for us, and then have
a cleaner path to a 64-bit (or simply larger physical address space)
implementations (like the 7450).

> Let me know when you've rewritten the whole of the Linux generic MM
> layer. :-)

Linus is learning :-).  We don't have to rewrite much of it anymore.
We just need to use it more wisely in our processor specific porting.

> The reason for having the user task space visible....

I've known about that trade off for about 25 years, and I am sure it
was discussed long before that.  From experience, I tend to favor
the other direction.  The user/kernel spaces are separate and you
should design a system that treats them as such.  If you want to
share the space, you end up with empty functions/macros for mapping.
Much easier than trying to go the other way.  You also uncover lots
of programming errors, especially the pointers are not integers, that
makes it much easier to scale up to larger systems.  There are just
way too many advantages to separate user/kernel spaces and the trivial
data copy argument should just never prevail.

	-- Dan

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-06 23:14                     ` Dan Malek
@ 2001-02-07  0:23                       ` Paul Mackerras
  2001-02-07 18:02                         ` Dan Malek
  0 siblings, 1 reply; 48+ messages in thread
From: Paul Mackerras @ 2001-02-07  0:23 UTC (permalink / raw)
  To: Dan Malek; +Cc: Gabriel Paubert, tom_gall, linuxppc-commit, linuxppc-dev

Dan Malek writes:

> No.  Simply that the PowerPC allows us to utilize a greater than
> 32-bit address and an MMU context should represent that and allow
> us to manage it.

I think would need to see a more concrete proposal to really
understand what you're getting at.

> > And saying "lazy TLB reloader" is misleading.  Lazy TLB reloading is
> > what you do when your MMU doesn't have any support for MMU contexts,
> > in order to try to minimize TLB flushes.
>
> That's exactly what we are doing on PowerPC.  We use the VSIDs as a
> kind of context register, not as an extension of virtual space addressing.

No, let's be clear, we are *not* doing lazy TLB switching as the term
is used in the linux kernel.  As I understand it, on x86 the kernel
avoids switching the page tables on a context switch if possible, so
in fact a process running in the kernel can in some cases be running
using the page tables of another task.  This is a win in the cases
where you switch to a task which does a small amount of processing
(and doesn't access its userspace area) then goes back to sleep.

> We minimize TLB flushes by using different VSIDs for Linux MMU contexts.
> When switching MMU contexts, we simply allow the new VSIDs to miss
> and cause a TLB reload.  It's not misleading, it is exactly what
> we are doing.  We aren't actually minimizing anything, just postponing
> the TLB reloads.  Actively switching MMU contexts allows the ability
> to preload TLB entries, minimizing misses upon the context switch,

Preloading TLB entries just means you take the hit now instead of
later, how does that help?  In general it's better to take the hit
later because you can't predict with 100% accuracy which TLB entries
will be needed.  If you preload an entry which isn't then used, you've
wasted time.

> > ....  We really don't want to go
> > down the path of trying to use the segment registers like the bank
> > registers you used to get on 16-bit systems 25 years ago.
>
> We already have.  That is basically what CONFIG_HIGHMEM does...just
> a horse of a different color.

No, the point about bank registers, or segment registers, is that
there are only a few of them, whereas there are lots of pages.  That's
why it is better to map in high memory pages using PTEs instead of
segment registers.

You talk about "utilizing a greater than 32 bit address".  But on a
32-bit platform you have to do some kind of mapping in order to access
that.  You can do that mapping with PTEs or with segment registers,
but the trouble with segment registers is that there are far too few
of them to be useful.

> But that is effectively what we are (Linus is :-) doing.  It is
> competition for a finite resource.  My only suggestion is we take
> advantage of the PowerPC hardware to do it for us, and then have
> a cleaner path to a 64-bit (or simply larger physical address space)
> implementations (like the 7450).

Sounds nice, got a concrete proposal?  It's easy to criticise.

> I've known about that trade off for about 25 years, and I am sure it
> was discussed long before that.  From experience, I tend to favor
> the other direction.  The user/kernel spaces are separate and you
> should design a system that treats them as such.  If you want to

Linux does, that's why there is copy_to/from_user, get/put_user etc.
The sparc64 port has separate user and kernel spaces, but there at
least you have the lda and sta instructions (load/store to/from
alternate address space) which can access the user space with no
overhead.

--
Paul Mackerras, Open Source Research Fellow, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Support for the revolution.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-06 21:32                 ` Dan Malek
  2001-02-06 22:08                   ` Paul Mackerras
@ 2001-02-07  9:18                   ` Roman Zippel
  2001-02-07 17:46                     ` Dan Malek
  1 sibling, 1 reply; 48+ messages in thread
From: Roman Zippel @ 2001-02-07  9:18 UTC (permalink / raw)
  To: Dan Malek
  Cc: paulus, Gabriel Paubert, tom_gall, linuxppc-commit, linuxppc-dev


Hi,

On Tue, 6 Feb 2001, Dan Malek wrote:

> We are currently using VSIDs as a lazy TLB reloader, not for
> a large VM space context manager.  There shouldn't be any need
> for CONFIG_HIGHMEM on PowerPC, and we should be able to map huge
> PCI spaces we see on the multiple bridge CPCI systems without
> any worry of running out of VM space.

Somewhere I lost you, could you explain it a bit more (maybe with an
example)? What has CONFIG_HIGHMEM to do with the pci mapping???

bye, Roman


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-07  9:18                   ` Roman Zippel
@ 2001-02-07 17:46                     ` Dan Malek
  2001-02-07 18:39                       ` Roman Zippel
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Malek @ 2001-02-07 17:46 UTC (permalink / raw)
  To: Roman Zippel
  Cc: paulus, Gabriel Paubert, tom_gall, linuxppc-commit, linuxppc-dev

Roman Zippel wrote:

> Somewhere I lost you, could you explain it a bit more (maybe with an
> example)? What has CONFIG_HIGHMEM to do with the pci mapping???

The only thing they have in common is the desire to consume kernel
virtual space.  They were just two examples of the reason I am trying
to find methods of increasing the kernel virtual space.

	-- Dan

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-07  0:23                       ` Paul Mackerras
@ 2001-02-07 18:02                         ` Dan Malek
  2001-02-08  0:48                           ` Paul Mackerras
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Malek @ 2001-02-07 18:02 UTC (permalink / raw)
  To: paulus; +Cc: Gabriel Paubert, tom_gall, linuxppc-commit, linuxppc-dev

Paul Mackerras wrote:

> Sounds nice, got a concrete proposal?  It's easy to criticise.

Sorry, I don't mean to criticize (or criticise :-).  Well, maybe
just a little......in general some of the Linux design seems to
be just a little short sighted (not picking on you Paul, I even
do this :-).  We tend to fix an immediate problem today without
stepping back for 30 seconds to apply some knowledge gained over
the past 50 years of computer development.

Over the past month or so, I have been part of some new processor
ports and trying to find solutions for large memory systems.  It
has pushed me deep into the VM design again, I am trying to clean
up some of the redundant functions we have and find some longer term
solutions instead of providing a "hack of the day" that will just
need to be thrown away in favor of something else tomorrow.

I don't yet have a proposal, and in retrospect it does seem the
discussion started on a negative tone.  I'm thinking about flying
out to some Caribbean island for a week or so to work on this (among
other things :-).

I have a couple of things to update for the other processor work I
am doing, and after that I will experiment with a few things.  I'll
let you know if I can find something useful.

	-- Dan

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-07 17:46                     ` Dan Malek
@ 2001-02-07 18:39                       ` Roman Zippel
  2001-02-07 21:16                         ` Gabriel Paubert
  0 siblings, 1 reply; 48+ messages in thread
From: Roman Zippel @ 2001-02-07 18:39 UTC (permalink / raw)
  To: Dan Malek
  Cc: paulus, Gabriel Paubert, tom_gall, linuxppc-commit, linuxppc-dev

Hi,

On Wed, 7 Feb 2001, Dan Malek wrote:

> > Somewhere I lost you, could you explain it a bit more (maybe with an
> > example)? What has CONFIG_HIGHMEM to do with the pci mapping???
>
> The only thing they have in common is the desire to consume kernel
> virtual space.  They were just two examples of the reason I am trying
> to find methods of increasing the kernel virtual space.

CONFIG_HIGHMEM has little to do with the kernel virtual space. The main
purpose is to get the memory into user space, where user programs can use
them. I really don't understand, what problem you're trying to solve. The
kernel only wants to manage high memory and rarely access it (and uses
kmap for this), most of the time these pages should be in the user virtual
address space only.

bye, Roman

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-07 18:39                       ` Roman Zippel
@ 2001-02-07 21:16                         ` Gabriel Paubert
  2001-02-08  0:34                           ` Paul Mackerras
  0 siblings, 1 reply; 48+ messages in thread
From: Gabriel Paubert @ 2001-02-07 21:16 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Dan Malek, paulus, tom_gall, linuxppc-commit, linuxppc-dev

	Hi,

> CONFIG_HIGHMEM has little to do with the kernel virtual space. The main
> purpose is to get the memory into user space, where user programs can use
> them. I really don't understand, what problem you're trying to solve. The
> kernel only wants to manage high memory and rarely access it (and uses
> kmap for this), most of the time these pages should be in the user virtual
> address space only.

Confused. If you have up to 64 Gb or RAM (the 7450 has 36 address lines)
and can use not even use one gigabyte for cache, you end up with a
seriously unbalanced system. Depending on what you do of course.

Hey 64 Gb of RAM is not even that expensive today (the only problem is
buffering and additional latency caused by the additional layers
you need to access 100+ modules).

	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-07 21:16                         ` Gabriel Paubert
@ 2001-02-08  0:34                           ` Paul Mackerras
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Mackerras @ 2001-02-08  0:34 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: Roman Zippel, Dan Malek, tom_gall, linuxppc-commit, linuxppc-dev


Gabriel Paubert writes:

> > CONFIG_HIGHMEM has little to do with the kernel virtual space. The main
> > purpose is to get the memory into user space, where user programs can use
> > them. I really don't understand, what problem you're trying to solve. The
> > kernel only wants to manage high memory and rarely access it (and uses
> > kmap for this), most of the time these pages should be in the user virtual
> > address space only.
>
> Confused. If you have up to 64 Gb or RAM (the 7450 has 36 address lines)
> and can use not even use one gigabyte for cache, you end up with a
> seriously unbalanced system. Depending on what you do of course.

The highmem pages are used mainly as page-cache pages and anonymous
user pages.  At the moment the kernel uses bounce buffers for doing
I/O to/from highmem pages but that is on the list of things to be
fixed. :)

Paul.

--
Paul Mackerras, Open Source Research Fellow, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Putting open source to work.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-07 18:02                         ` Dan Malek
@ 2001-02-08  0:48                           ` Paul Mackerras
  2001-02-08  1:39                             ` Frank Rowand
  2001-02-08 19:00                             ` David Edelsohn
  0 siblings, 2 replies; 48+ messages in thread
From: Paul Mackerras @ 2001-02-08  0:48 UTC (permalink / raw)
  To: Dan Malek; +Cc: Gabriel Paubert, tom_gall, linuxppc-commit, linuxppc-dev

Dan Malek writes:

> I don't yet have a proposal, and in retrospect it does seem the
> discussion started on a negative tone.  I'm thinking about flying

Well, it did start with a remark of yours about our "mostly incorrect
use of the MMU". :)

I think that it is seductive but misleading when the PPC docs say that
the PPC supports a "large virtual address space".  Yes it does, in a
sense, but it's not one that you can access directly on a 32-bit PPC;
you have to go changing segment registers to get at all of it - i.e.,
there's no such thing as a pointer into this space that the CPU
understands natively.  You could have such a pointer as a 64-bit
value, but to use it you have to split it apart and put one part in a
segment register and make a pointer that the CPU understands out of
the other part.  And the other thing is that there just aren't enough
segment registers on 32-bit PPCs to be really useful (except as a way
of coarsely partitioning the address space).

And of course when you go to a 64-bit PPC all of those sort of
problems go away; you just have 64-bit pointers and set up your
logical -> physical translation however you want, and you can then
basically ignore the MMU's "virtual" address space except as something
internal to the translation process that only really affects what
values you have to put in the segment and hash tables.

Paul.

--
Paul Mackerras, Open Source Research Fellow, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Putting open source to work.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08  0:48                           ` Paul Mackerras
@ 2001-02-08  1:39                             ` Frank Rowand
  2001-02-08 19:00                             ` David Edelsohn
  1 sibling, 0 replies; 48+ messages in thread
From: Frank Rowand @ 2001-02-08  1:39 UTC (permalink / raw)
  To: paulus; +Cc: Dan Malek, Gabriel Paubert, tom_gall, linuxppc-commit,
	linuxppc-dev


Paul Mackerras wrote:


> I think that it is seductive but misleading when the PPC docs say that
> the PPC supports a "large virtual address space".  Yes it does, in a
> sense, but it's not one that you can access directly on a 32-bit PPC;
> you have to go changing segment registers to get at all of it - i.e.,
> there's no such thing as a pointer into this space that the CPU
> understands natively.  You could have such a pointer as a 64-bit
> value, but to use it you have to split it apart and put one part in a
> segment register and make a pointer that the CPU understands out of
> the other part.  And the other thing is that there just aren't enough
> segment registers on 32-bit PPCs to be really useful (except as a way
> of coarsely partitioning the address space).

Then there are the PPC processors that don't even have segment registers....

> Paul.

-Frank
--
Frank Rowand <frank_rowand@mvista.com>
MontaVista Software, Inc

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08  0:48                           ` Paul Mackerras
  2001-02-08  1:39                             ` Frank Rowand
@ 2001-02-08 19:00                             ` David Edelsohn
  2001-02-08 20:53                               ` Roman Zippel
                                                 ` (2 more replies)
  1 sibling, 3 replies; 48+ messages in thread
From: David Edelsohn @ 2001-02-08 19:00 UTC (permalink / raw)
  To: paulus; +Cc: Dan Malek, Gabriel Paubert, tom_gall, linuxppc-commit,
	linuxppc-dev


Paul,

	The POWER and PowerPC architectures specifically were designed
with the larger "virtual" address space in mind.  Yes, a single context
cannot see more than 32-bit address space at a time, but an operating
system can utilize that for more efficent access to a larger address
space.  Maybe the current HIGHMEM implementation is an anomaly, but it is
important to be open to these alternate perspecives.

	Also, as Dan said, the current PowerPC port is avoiding many of
the PowerPC architecture's design features for VMM.  While "mostly
incorrect use of the MMU" may be a little strong, I would agree with Dan
that the port is not using the PowerPC architecture as intended.  By not
utilizing the hardware assists, the port is not performing at its optimal
level.

	For instance, the rotating VSIDs are blowing away internally
cached information about mappings and forcing the processor to recreate
translations more often than necessary.  That causes a performance
degradation.  Pre-heating the TLB can be good under certain circumstances.

	As I have mentioned before, the current design appears to be
generating many hash table misses because it allocates a new VSID rather
than unmapping multiple pages from the page table.  This also means that
it cannot be exploiting the dirty bit in the page/hash table entry and
presumably encounters double misses on write faults.

	One really needs to consider the design model for the PowerPC
architecture and some of the microarchitecture optimizations utilizing the
greater chip area in newer PowerPC processor implementations to know how
to structure the PowerPC Linux VMM for best performance.  One needs to
consider these issues when arguing for a design to defer work (like TLB
entries) as well as considering the details of *how* the deferral is
implemented (VSID shuffling) relative to the perceived benefit.

	We really need to get more IBM PowerPC and kernel VMM experts
involved to explore the best design given the Linux kernel's abstractions
and limitations in the VMM space.

David

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 19:00                             ` David Edelsohn
@ 2001-02-08 20:53                               ` Roman Zippel
  2001-02-08 21:14                                 ` David Edelsohn
  2001-02-08 21:28                               ` Cort Dougan
  2001-02-09 10:49                               ` Paul Mackerras
  2 siblings, 1 reply; 48+ messages in thread
From: Roman Zippel @ 2001-02-08 20:53 UTC (permalink / raw)
  To: David Edelsohn
  Cc: paulus, Dan Malek, Gabriel Paubert, tom_gall, linuxppc-commit,
	linuxppc-dev


Hi,

On Thu, 8 Feb 2001, David Edelsohn wrote:

> 	We really need to get more IBM PowerPC and kernel VMM experts
> involved to explore the best design given the Linux kernel's abstractions
> and limitations in the VMM space.

There is certainly room for improvement, but slowly I'd really like to
know why the Linux vmm should be/is the limiting part?

bye, Roman


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 20:53                               ` Roman Zippel
@ 2001-02-08 21:14                                 ` David Edelsohn
  2001-02-08 23:23                                   ` Roman Zippel
  0 siblings, 1 reply; 48+ messages in thread
From: David Edelsohn @ 2001-02-08 21:14 UTC (permalink / raw)
  To: Roman Zippel
  Cc: paulus, Dan Malek, Gabriel Paubert, tom_gall, linuxppc-commit,
	linuxppc-dev


>>>>> Roman Zippel writes:

> On Thu, 8 Feb 2001, David Edelsohn wrote:

>> We really need to get more IBM PowerPC and kernel VMM experts
>> involved to explore the best design given the Linux kernel's abstractions
>> and limitations in the VMM space.

Roman> There is certainly room for improvement, but slowly I'd really like to
Roman> know why the Linux vmm should be/is the limiting part?

	I did not say that Linux VMM is the limiting part, I said that the
design needs to work within the Linux VMM limitations.  The Linux VMM
still uses the x86 page table design as a basis for its abstraction
layer.  Not all processors match that configuration.

David

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 19:00                             ` David Edelsohn
  2001-02-08 20:53                               ` Roman Zippel
@ 2001-02-08 21:28                               ` Cort Dougan
  2001-02-08 22:08                                 ` David Edelsohn
  2001-02-09 10:49                               ` Paul Mackerras
  2 siblings, 1 reply; 48+ messages in thread
From: Cort Dougan @ 2001-02-08 21:28 UTC (permalink / raw)
  To: David Edelsohn
  Cc: paulus, Dan Malek, Gabriel Paubert, tom_gall, linuxppc-commit,
	linuxppc-dev

} that the port is not using the PowerPC architecture as intended.  By not
} utilizing the hardware assists, the port is not performing at its optimal
} level.

I have data, and have written a paper with Victor and Paul, showing that we
get performance _increases_ by not using the PowerPC MMU architecture as
intended.  I think the PPC architecture intentions for the hash table and
TLB and very very poor and restrictive.  The 603 was a good step forward
but the 750, 7400 and follow-ons have been steps backwards from this good
start.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 21:28                               ` Cort Dougan
@ 2001-02-08 22:08                                 ` David Edelsohn
  2001-02-08 22:26                                   ` Cort Dougan
  2001-02-08 23:28                                   ` Gabriel Paubert
  0 siblings, 2 replies; 48+ messages in thread
From: David Edelsohn @ 2001-02-08 22:08 UTC (permalink / raw)
  To: Cort Dougan
  Cc: paulus, Dan Malek, Gabriel Paubert, tom_gall, linuxppc-commit,
	linuxppc-dev


>>>>> Cort Dougan writes:

} that the port is not using the PowerPC architecture as intended.  By not
} utilizing the hardware assists, the port is not performing at its optimal
} level.

Cort> I have data, and have written a paper with Victor and Paul, showing that we
Cort> get performance _increases_ by not using the PowerPC MMU architecture as
Cort> intended.  I think the PPC architecture intentions for the hash table and
Cort> TLB and very very poor and restrictive.  The 603 was a good step forward
Cort> but the 750, 7400 and follow-ons have been steps backwards from this good
Cort> start.

	Your paper was a very novel and good solution to a performance
problem that you detected in the VMM design.  I already mentioned one of
the problems with the VMM design causing double misses on write faults.
Let me reference the reasoning that Orran Krieger, Marc Auslander, and I
wrote to you about in March 1999 after Orran attended your talk at OSDI:

	"Your paper discussess an approach to handling hash table misses
quickly, but that begs the question of why your design has so many hash
table misses that it is important to handle them quickly.  In the Research
OS that I am working on (targetting PowerPC architecture, among others),
we assume that hash table misses are so infrequent, that we handle them as
in-core page faults.  With a hash table 4 times the size of physical
memory, and a good spread of entries across them, this seems reasonable.  I
got the impression that misses in your system are more frequent because
you allocate new VSIDs rather than unmap multiple pages from the page
table.  If so, I guess that you can't be exploiting the dirty bit in the
page/hash table entry, and hence get double misses on write faults.

	"We also disagree with one of your main conclusions: that
processors should not handle TLB misses in HW.  I think that software
handling of TLB misses is an idea whose time as come ... and gone :-)
Hardware made sense in the past when you wanted to look at a whole pile of
entiries at the same time with specialized HW.  Then, for a while it was
more efficient to do things in SW and avoid the HW complexity.  Now, with
speculative execution and super-scaler highly pipelined processors,
handling them in SW means that you suffer a huge performance penalty
because you introduce a barrier/bubble on every TLB miss.  With HW you can
freeze the pipeline and handle the miss with much reduced cost."

	You and Paul and Victor did some excellent work, but you need to
keep in mind what implicit assumptions about processor design determined
whether the VMM design was an overall win.  We can have a discussion about
whether the hardware improvements which make the VMM design less
adventageous are themselves the right strategy, but many commercial
processors are following that path after careful study of all options.

	Your VMM design was correct for a specific, narrow class of
processors.  We do not agree with your premise that the criteria for a
good processor design is whether it can utilize your VMM design.

	As I said before, one needs to consider the microarchitecture
design and implementation of new processors before one can make sweeping
statements about which VMM design is best.  You can create a wonderful
engineer solution, but are you solving the problem or simply masking a
symptom?

Cheers, David
===============================================================================
David Edelsohn                                      T.J. Watson Research Center
dje@watson.ibm.com                                  P.O. Box 218
+1 914 945 4364 (TL 862)                            Yorktown Heights, NY 10598
URL: http://www.research.ibm.com/people/d/dje/

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 22:08                                 ` David Edelsohn
@ 2001-02-08 22:26                                   ` Cort Dougan
  2001-02-08 23:17                                     ` David Edelsohn
  2001-02-08 23:28                                   ` Gabriel Paubert
  1 sibling, 1 reply; 48+ messages in thread
From: Cort Dougan @ 2001-02-08 22:26 UTC (permalink / raw)
  To: David Edelsohn
  Cc: paulus, Dan Malek, Gabriel Paubert, tom_gall, linuxppc-commit,
	linuxppc-dev

} statements about which VMM design is best.  You can create a wonderful
} engineer solution, but are you solving the problem or simply masking a
} symptom?

Would you characterize AIX as an OS that takes advantage of the VMM design
of the PowerPC?  From the benchmarks I did it appeared to suffer the same
problem that Linux/PPC did at that time.  The problem being, a
straightforward implementation of the software side of the VM design of the
PowerPC.  After some general improvements in the MM system of Linux/PPC we
sped things up by a large amount by doing what you claim is a mistake.  We
widened the gap between Linux/PPC and what was probably the best example of
what VM design can come from knowledge of the PPC architecture (in AIX).

AIX can't claim to be doing it better.  I was unable to look under the hood
of AIX, of course, but my benchmarks did show that the _ONLY_ thing that
mattered - wall clock time of user apps - was improved.  Doing things that
non-PPC way is not a flaw if it results in better performance.

I do stand by our design, and the choices we made, but I'm open to
suggestion for better ways to do it.  I can certainly believe there's a lot
of room for improvement.  This is linux, though - actual working code speaks
more loudly and clearly than anything else.

Do you have an example of a better way of doing a VM system in Linux/PPC?
We can't change the page table layout.  That's something we're stuck with,
in one form or another, in Linux no matter what (not our decision).  What
does Kitchewan do for a VM subsystem?  Can you give me an overview of the
design?

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 22:26                                   ` Cort Dougan
@ 2001-02-08 23:17                                     ` David Edelsohn
  2001-02-08 23:27                                       ` Cort Dougan
  0 siblings, 1 reply; 48+ messages in thread
From: David Edelsohn @ 2001-02-08 23:17 UTC (permalink / raw)
  To: Cort Dougan
  Cc: paulus, Dan Malek, Gabriel Paubert, tom_gall, linuxppc-commit,
	linuxppc-dev


	I do not know which version of AIX you were testing and exactly
how your benchmarks were stressing the system.  AIX continually is
improving, just as is the Linux kernel.  As I said, the your design was
very good for the particular class of PowerPC implementation that you were
targetting.  AIX normally is tuned for a different class of processors and
system configurations than Linux has targeted, so I would not expect them
to have the same sweet spot when running benchmarks.

	We currently are in the process of writing some papers about the
system which should explain some of the design decisions.  We very much
want to help improve the PowerPC Linux kernel VMM design and interact with
you and other developers.  After the current papers are done, we will have
more time to engage in discussions about this.

David

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 21:14                                 ` David Edelsohn
@ 2001-02-08 23:23                                   ` Roman Zippel
  2001-02-08 23:48                                     ` Cort Dougan
  0 siblings, 1 reply; 48+ messages in thread
From: Roman Zippel @ 2001-02-08 23:23 UTC (permalink / raw)
  To: David Edelsohn
  Cc: paulus, Dan Malek, Gabriel Paubert, tom_gall, linuxppc-commit,
	linuxppc-dev

Hi,

On Thu, 8 Feb 2001, David Edelsohn wrote:

> 	I did not say that Linux VMM is the limiting part, I said that the
> design needs to work within the Linux VMM limitations.  The Linux VMM
> still uses the x86 page table design as a basis for its abstraction
> layer.  Not all processors match that configuration.

The page table abstraction is only used for user space vm management. The
important point is: it's only an abstraction as far as it concerns the
general vm code. What happens underneath is a complete different story and
the general code should provide enough hooks to allow the implementation
to do whatever it wishes. If something is missing it can certainly be
added, but always remember not to mix abstraction with implementation.

Anyway, an idea to improve current context handling: define another bit
_PAGE_HASHED for the pte. If it's set, it points to the hash table entry,
otherwise it's a normal linux pte entry. That makes pte handling a bit
more complicated, but if we can dump the current tlb/context handling, it
should be really worth it.

bye, Roman

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 23:17                                     ` David Edelsohn
@ 2001-02-08 23:27                                       ` Cort Dougan
  0 siblings, 0 replies; 48+ messages in thread
From: Cort Dougan @ 2001-02-08 23:27 UTC (permalink / raw)
  To: David Edelsohn
  Cc: paulus, Dan Malek, Gabriel Paubert, tom_gall, linuxppc-commit,
	linuxppc-dev


We anxiously await your assistance...

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 22:08                                 ` David Edelsohn
  2001-02-08 22:26                                   ` Cort Dougan
@ 2001-02-08 23:28                                   ` Gabriel Paubert
  2001-02-09  9:58                                     ` Paul Mackerras
  1 sibling, 1 reply; 48+ messages in thread
From: Gabriel Paubert @ 2001-02-08 23:28 UTC (permalink / raw)
  To: David Edelsohn
  Cc: Cort Dougan, paulus, Dan Malek, tom_gall, linuxppc-commit,
	linuxppc-dev


On Thu, 8 Feb 2001, David Edelsohn wrote:

> 	"Your paper discussess an approach to handling hash table misses
> quickly, but that begs the question of why your design has so many hash
> table misses that it is important to handle them quickly.  In the Research
> OS that I am working on (targetting PowerPC architecture, among others),
> we assume that hash table misses are so infrequent, that we handle them as
> in-core page faults.  With a hash table 4 times the size of physical
> memory, and a good spread of entries across them, this seems reasonable.  I
> got the impression that misses in your system are more frequent because
> you allocate new VSIDs rather than unmap multiple pages from the page
> table.  If so, I guess that you can't be exploiting the dirty bit in the
> page/hash table entry, and hence get double misses on write faults.
>
> 	"We also disagree with one of your main conclusions: that
> processors should not handle TLB misses in HW.  I think that software
> handling of TLB misses is an idea whose time as come ... and gone :-)
> Hardware made sense in the past when you wanted to look at a whole pile of
> entiries at the same time with specialized HW.  Then, for a while it was
> more efficient to do things in SW and avoid the HW complexity.  Now, with
> speculative execution and super-scaler highly pipelined processors,
> handling them in SW means that you suffer a huge performance penalty
> because you introduce a barrier/bubble on every TLB miss.  With HW you can
> freeze the pipeline and handle the miss with much reduced cost."
>
> 	You and Paul and Victor did some excellent work, but you need to
> keep in mind what implicit assumptions about processor design determined
> whether the VMM design was an overall win.  We can have a discussion about
> whether the hardware improvements which make the VMM design less
> adventageous are themselves the right strategy, but many commercial
> processors are following that path after careful study of all options.
>
> 	Your VMM design was correct for a specific, narrow class of
> processors.  We do not agree with your premise that the criteria for a
> good processor design is whether it can utilize your VMM design.
>
> 	As I said before, one needs to consider the microarchitecture
> design and implementation of new processors before one can make sweeping
> statements about which VMM design is best.  You can create a wonderful
> engineer solution, but are you solving the problem or simply masking a
> symptom?

I agree with you, but that's only a gut feeling. Did you also notice that
Linux/PPC only uses half the recommended hash table size unless I'm
grossly mistaken ?

My feeling is that the hash table should be rather over- than under-sized,
especially with the amount of sharing there is between all the
applications running on "modern" desktops (large shared libraries for
X/KDE/GNOME among other things).

	Regards,
	Gabriel.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 23:23                                   ` Roman Zippel
@ 2001-02-08 23:48                                     ` Cort Dougan
  0 siblings, 0 replies; 48+ messages in thread
From: Cort Dougan @ 2001-02-08 23:48 UTC (permalink / raw)
  To: Roman Zippel
  Cc: David Edelsohn, paulus, Dan Malek, Gabriel Paubert, tom_gall,
	linuxppc-commit, linuxppc-dev

} The page table abstraction is only used for user space vm management. The
} important point is: it's only an abstraction as far as it concerns the
} general vm code. What happens underneath is a complete different story and
} the general code should provide enough hooks to allow the implementation
} to do whatever it wishes. If something is missing it can certainly be
} added, but always remember not to mix abstraction with implementation.
}
} Anyway, an idea to improve current context handling: define another bit
} _PAGE_HASHED for the pte. If it's set, it points to the hash table entry,
} otherwise it's a normal linux pte entry. That makes pte handling a bit
} more complicated, but if we can dump the current tlb/context handling, it
} should be really worth it.

I disagree.  I think the overhead would slow it down quite a bit.  That
being said, implement it and prove me wrong.  I'll merge it in.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 23:28                                   ` Gabriel Paubert
@ 2001-02-09  9:58                                     ` Paul Mackerras
  2001-02-09 10:57                                       ` Gabriel Paubert
  0 siblings, 1 reply; 48+ messages in thread
From: Paul Mackerras @ 2001-02-09  9:58 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: David Edelsohn, Cort Dougan, Dan Malek, tom_gall, linuxppc-commit,
	linuxppc-dev


Gabriel Paubert writes:

> I agree with you, but that's only a gut feeling. Did you also notice that
> Linux/PPC only uses half the recommended hash table size unless I'm
> grossly mistaken ?
>
> My feeling is that the hash table should be rather over- than under-sized,
> especially with the amount of sharing there is between all the
> applications running on "modern" desktops (large shared libraries for
> X/KDE/GNOME among other things).

We use a smaller than recommended hash table size for a couple of reasons:

- The hash table occupancy rates measured by Cort were very small,
  typically less than 10% IIRC.

- A bigger hash table takes longer to clear when you have to do a
  flush_tlb_all().   Fortunately there are only a couple of places
  where flush_tlb_all is called, and they can both easily be changed
  to flush_tlb_range.  However, it is still necessary to clear the
  hash table on a MMU context overflow.

- The recommended sizes are based on the idea that you have to try
  quite hard to keep all the active PTEs in the hash table.  We don't,
  we can quickly fault PTEs into the hash table on demand so it is
  less important for us to try to keep all the active PTEs in the hash
  table.

Paul.

--
Paul Mackerras, Open Source Research Fellow, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Support for the revolution.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-08 19:00                             ` David Edelsohn
  2001-02-08 20:53                               ` Roman Zippel
  2001-02-08 21:28                               ` Cort Dougan
@ 2001-02-09 10:49                               ` Paul Mackerras
  2 siblings, 0 replies; 48+ messages in thread
From: Paul Mackerras @ 2001-02-09 10:49 UTC (permalink / raw)
  To: David Edelsohn
  Cc: Dan Malek, Gabriel Paubert, tom_gall, linuxppc-commit,
	linuxppc-dev

David,

> 	The POWER and PowerPC architectures specifically were designed
> with the larger "virtual" address space in mind.  Yes, a single context
> cannot see more than 32-bit address space at a time, but an operating
> system can utilize that for more efficent access to a larger address
> space.

I'm pretty partisan towards the PowerPC architecture and my preference
would always be to say that the PowerPC way is the best way.  But I
don't feel that I can say that the 32-bit PowerPC architecture
achieves this goal effectively.

The 64-bit PPC architecture is another story; there the "logical"
address space is big enough that you can have pointers for all your
data objects.  And the PPC MMU supports a full 64-bit logical address
with hardware TLB reloads, unlike alpha etc. which only give you a
virtual address space of 44 bits or so.  So in the rest of this I am
talking about the 32-bit PPC architecture only.

Anyway, the only way you have to access different parts of this large
"virtual" address space is to change segment registers.  And there are
only 16 of them - fewer in practice because you need some fixed ones
for kernel code and data, I/O, etc.  Which means that they are a
scarce resource which needs to be managed; you then need routines to
allocate and free segment registers, you probably need to refcount
them, and you have a problem tracking the lifetime of pointers that
you construct, you need to check for crossings over a segment
boundary, etc.

Maybe I'm unduly pessimistic - maybe there is a way for an operating
system to "utilize that for more efficent access to a larger address
space" as you say.  But I don't see it.

An interesting experiment for someone to try would be to somehow use a
set of segment registers (maybe the 4 from 0x80000000 to 0xb0000000)
to implement the HIGHMEM stuff.  It may be that that is a simple
enough situation that the software overhead is manageable.  One of the
questions to answer will be whether it is OK to limit each task to
having at most 3 highmem pages mapped in at any one time (I am
thinking we would need to reserve 1 segment register for kmap_atomic).
And then of course we would need to measure the performance to see how
much difference it makes.

> 	For instance, the rotating VSIDs are blowing away internally
> cached information about mappings and forcing the processor to recreate
> translations more often than necessary.  That causes a performance
> degradation.  Pre-heating the TLB can be good under certain circumstances.

How is "blowing away internally cached information" worse than doing
tlbie's?  We only rotate VSIDs when we have to flush mappings from the
MMU/hashtable.  And searching for and invalidating HPTEs takes
significant time itself.

For a flush_tlb_mm, where we have to invalidate all the mappings for
an entire address space, there is no question; changing the VSIDs is
faster than searching through the hash table, invalidating all the
relevant HPTEs, and doing tlbia (or the equivalent).  For a
flush_tlb_range, it depends on the size of the range; we can argue
about the threshold we use but I don't think there could be any
argument that for a very large range it is faster to change VSIDs.

> 	As I have mentioned before, the current design appears to be
> generating many hash table misses because it allocates a new VSID rather
> than unmapping multiple pages from the page table.  This also means that
> it cannot be exploiting the dirty bit in the page/hash table entry and
> presumably encounters double misses on write faults.

On a write access after a read access to a clean page, yes.  There is
only one fault taken if the first access is a write, or if the page is
already marked dirty when the first read access happens.

> 	One really needs to consider the design model for the PowerPC
> architecture and some of the microarchitecture optimizations utilizing the
> greater chip area in newer PowerPC processor implementations to know how
> to structure the PowerPC Linux VMM for best performance.  One needs to
> consider these issues when arguing for a design to defer work (like TLB
> entries) as well as considering the details of *how* the deferral is
> implemented (VSID shuffling) relative to the perceived benefit.

Well you clearly know more than me in this area, and we would
appreciate hearing whatever you are allowed to tell us :).  It sounds
like recent PPCs are being optimized for the way that AIX or similar
OS's use the MMU.  (Anyway, aren't all IBM's recent PPCs 64-bit?)

But in the end it's only the benchmarks that can tell us which
approach is the fastest.  And I suspect that sometimes the hardware
engineers don't take full account of the software overhead involved in
using the hardware features they provide. :)

I guess my response here boils down to two questions:

- how can an OS effectively make use of the segment registers to
  access different parts of the "virtual" address space when there are
  so few of them?

- how can it be faster to do a lengthy HPTE search-and-destroy
  operation plus a lot of tlbie's, instead of just changing the
  segment registers?

Paul.

--
Paul Mackerras, Open Source Research Fellow, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Support for the revolution.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-09  9:58                                     ` Paul Mackerras
@ 2001-02-09 10:57                                       ` Gabriel Paubert
  2001-02-09 11:26                                         ` Paul Mackerras
  0 siblings, 1 reply; 48+ messages in thread
From: Gabriel Paubert @ 2001-02-09 10:57 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Edelsohn, Cort Dougan, Dan Malek, tom_gall, linuxppc-commit,
	linuxppc-dev

On Fri, 9 Feb 2001, Paul Mackerras wrote:

> Gabriel Paubert writes:
>
> > I agree with you, but that's only a gut feeling. Did you also notice that
> > Linux/PPC only uses half the recommended hash table size unless I'm
> > grossly mistaken ?
> >
> > My feeling is that the hash table should be rather over- than under-sized,
> > especially with the amount of sharing there is between all the
> > applications running on "modern" desktops (large shared libraries for
> > X/KDE/GNOME among other things).
>
> We use a smaller than recommended hash table size for a couple of reasons:
>
> - The hash table occupancy rates measured by Cort were very small,
>   typically less than 10% IIRC.

Then it means that the mm system is completely screwed up, even more than
I thought. I have to study first how VSID are handled, but this smells
definitively wrong.

>
> - A bigger hash table takes longer to clear when you have to do a
>   flush_tlb_all().   Fortunately there are only a couple of places
>   where flush_tlb_all is called, and they can both easily be changed
>   to flush_tlb_range.  However, it is still necessary to clear the
>   hash table on a MMU context overflow.

Yes, I was always worried by the added latency when a hash table clear
comes in. But the question is why do we have to do it ? Actually the
question is whether flush_tlb_all is even necessary.

> - The recommended sizes are based on the idea that you have to try
>   quite hard to keep all the active PTEs in the hash table.  We don't,
>   we can quickly fault PTEs into the hash table on demand so it is
>   less important for us to try to keep all the active PTEs in the hash
>   table.

I believe that this is a big mistake, faulting a PTE in the hash table is
an exception and wil never be as fast as having the entry already in the
PTE. And on SMP, you acquire hash_table_lock, an unnecessary variable BTW
but let us leave it for a later discussion, which will be very contended
and ping pong like mad between processors.

 In short in all the previous discussion you had with Dan, I stand with
him and against you in all and every single aspect, except for the TLB
preloading thing.

	Regards,
	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: context overflow
  2001-02-09 10:57                                       ` Gabriel Paubert
@ 2001-02-09 11:26                                         ` Paul Mackerras
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Mackerras @ 2001-02-09 11:26 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: David Edelsohn, Cort Dougan, Dan Malek, tom_gall, linuxppc-commit,
	linuxppc-dev

Gabriel Paubert writes:

> > - The hash table occupancy rates measured by Cort were very small,
> >   typically less than 10% IIRC.
>
> Then it means that the mm system is completely screwed up, even more than
> I thought. I have to study first how VSID are handled, but this smells
> definitively wrong.

Gabriel: one word: Measure.  Then criticize, if you like.

Cort's measurements were done with lmbench and kernel compiles IIRC.
Don't forget that not all pages of physical memory are used via HPTEs;
many pages are used for page-cache pages of files which are read and
written rather than being mmap'd.  For example, in the case of a
kernel compile you would hopefully have all of the relevant kernel
source and object files in the page cache but never mmap'd, and those
pages would be typically be accessed through the kernel BAT mapping.

And the other point is that the recommended hash table sizes are large
enough to map the whole of physical memory 4 times over.

> Yes, I was always worried by the added latency when a hash table clear
> comes in. But the question is why do we have to do it ? Actually the
> question is whether flush_tlb_all is even necessary.

As I have already said, flush_tlb_all can be avoided completely, but
the flush on MMU context overflow is necessary (but fortunately very
rare).

> I believe that this is a big mistake, faulting a PTE in the hash table is
> an exception and wil never be as fast as having the entry already in the
> PTE. And on SMP, you acquire hash_table_lock, an unnecessary variable BTW
> but let us leave it for a later discussion, which will be very contended
> and ping pong like mad between processors.

Measurements?

(David, perhaps you could comment on the need for hash_table_lock and
the possible hardware deadlocks if you do tlbie/tlbsync on different
CPUs at the same time?)

Like everything, the hash table size is a tradeoff.  A big hashtable
has disadvantages, one of which is that less of it will fit in L2
cache and thus TLB misses will take longer on average.

>  In short in all the previous discussion you had with Dan, I stand with
> him and against you in all and every single aspect, except for the TLB
> preloading thing.

I look forward to seeing your benchmark results for the different
alternatives.

Paul.

--
Paul Mackerras, Open Source Research Fellow, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Support for the revolution.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2001-02-09 11:26 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-01-20  2:27 context overflow Dan Malek
2001-01-22  4:28 ` Troy Benjegerdes
2001-01-22  4:39   ` Tom Gall
2001-01-22 18:10     ` Dan Malek
2001-01-22 18:55     ` tom_gall
2001-01-22 19:59       ` Dan Malek
2001-01-22 22:08         ` tom_gall
2001-01-23  0:10           ` Dan Malek
2001-01-23 10:00             ` Gabriel Paubert
2001-01-23 18:21               ` Dan Malek
2001-02-06 10:55                 ` Paul Mackerras
2001-02-06 21:11                   ` Dan Malek
2001-02-06 21:50                     ` Paul Mackerras
2001-02-06 22:29                       ` Dan Malek
2001-02-06 22:45                         ` Paul Mackerras
2001-02-06 10:50               ` Paul Mackerras
2001-02-06 21:32                 ` Dan Malek
2001-02-06 22:08                   ` Paul Mackerras
2001-02-06 23:14                     ` Dan Malek
2001-02-07  0:23                       ` Paul Mackerras
2001-02-07 18:02                         ` Dan Malek
2001-02-08  0:48                           ` Paul Mackerras
2001-02-08  1:39                             ` Frank Rowand
2001-02-08 19:00                             ` David Edelsohn
2001-02-08 20:53                               ` Roman Zippel
2001-02-08 21:14                                 ` David Edelsohn
2001-02-08 23:23                                   ` Roman Zippel
2001-02-08 23:48                                     ` Cort Dougan
2001-02-08 21:28                               ` Cort Dougan
2001-02-08 22:08                                 ` David Edelsohn
2001-02-08 22:26                                   ` Cort Dougan
2001-02-08 23:17                                     ` David Edelsohn
2001-02-08 23:27                                       ` Cort Dougan
2001-02-08 23:28                                   ` Gabriel Paubert
2001-02-09  9:58                                     ` Paul Mackerras
2001-02-09 10:57                                       ` Gabriel Paubert
2001-02-09 11:26                                         ` Paul Mackerras
2001-02-09 10:49                               ` Paul Mackerras
2001-02-07  9:18                   ` Roman Zippel
2001-02-07 17:46                     ` Dan Malek
2001-02-07 18:39                       ` Roman Zippel
2001-02-07 21:16                         ` Gabriel Paubert
2001-02-08  0:34                           ` Paul Mackerras
2001-01-22  4:55   ` Larry McVoy
2001-01-22  6:15     ` Troy Benjegerdes
2001-01-23  1:12 ` Frank Rowand
2001-01-23  1:20   ` Dan Malek
2001-01-23  2:12     ` Frank Rowand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).