fsl booke MM vs. SMP questions

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* fsl booke MM vs. SMP questions
@ 2007-05-21  7:06 Benjamin Herrenschmidt
       [not found] ` <1179741447.3660.7.camel@localhost.localdomain>
  2007-05-22  3:03 ` Kumar Gala
  0 siblings, 2 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-21  7:06 UTC (permalink / raw)
  To: ppc-dev; +Cc: Kumar Gala, Paul Mackerras

Hi Folks !

I see that the fsl booke code has some #ifdef CONFIG_SMP bits here or
there, thus I suppose there are some SMP implementations of these
right ?

I'm having some serious issues trying to figure out how the TLB
management is made SMP safe however.

There are at least two main issues I've spotted at this point (there's
at least one more if there are HW threading, that is the TLB is shared
between logical processors, but I'll ignore that for now since I don't
think there is such a thing ... yet).

 - How do you guys shield PTE flushing vs. TLB misses on another CPU ?
That is, how do you prevent (if you do) the following scenario:

	cpu 0				cpu 1
	tlb miss			pte_clear (or similar)
	load PTE value
					write 0 to PTE (or replace)
					tlbviax (tlbie)
	tlbwe

That scenario, as you can see, will leave you with stale entries in the
TLB which will ultimately lead to all sort of unpleasant/random
behaviours.

If the answer is "oops ... we don't", then let's try to find out ways
out of that since I may have a similar issue in a not too distant
future :-) And I'm trying to find out a -fast- way to deal with that
without bloating the fast path. My main problem is that I want to avoid
taking a spin lock or equivalent atomic operation in the fast TLB reload
path (which would solve the problem) since lwarx/stwcx. are generally
real slow (hundreds of cycles on some processors).

 - I see that your TLB miss handle is using a non-atomic store to write
the _PAGE_ACCESSED bit back to the PTE. Don't you have a similar race
where something would do:

	cpu 0				cpu 1
	tlb miss			pte_clear (or similar)
	load PTE value
					write 0 to PTE (or replace)
	write back PTE with _PAGE_ACCESSED
	tlbwe

This is an extension of the previous race but it's a different problem
so I listed it separately. In that case, the problem is worse, since not
only you have a stale TLB entry, but you -also- have corrupted the linux
PTE by writing back the old value in it.

At this point, I'm afraid you may have no choice but going atomic, which
means paying the cost of lwarx/stwcx. on TLB misses, though if you have
a solution for the first problem, then you can avoid the atomic
operation in the second problem if _PAGE_ACCESSED is already set.

If not, you might have to use a _PAGE_BUSY bit similar to what 64 bits
uses as a per-PTE lock, or use mmu_hash_lock... Unless you come up with
a great idea or some HW black magic that makes the problem go away...

In any case, I'm curious about how you have or intend to solve that
since as I said above, I might be in a similar situation soon and am
trying to keep the TLB miss handler as fast as humanly possible.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

[parent not found: <1179741447.3660.7.camel@localhost.localdomain>]

[parent not found: <1179742083.32247.689.camel@localhost.localdomain>]

* Re: fsl booke MM vs. SMP questions
       [not found]   ` <1179742083.32247.689.camel@localhost.localdomain>
@ 2007-05-21 11:37     ` Dave Liu
  2007-05-21 22:07       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Liu @ 2007-05-21 11:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala

On Mon, 2007-05-21 at 20:08 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2007-05-21 at 17:57 +0800, Dave Liu wrote:
> 
> > > If not, you might have to use a _PAGE_BUSY bit similar to what 64 bits
> > > uses as a per-PTE lock, or use mmu_hash_lock... Unless you come up with
> > > a great idea or some HW black magic that makes the problem go away...
> > 
> > I would like the _PAGE_BUSY bit for a per-PTE lock, it will have better
> > performance benifit than global lock. The BookE architecutre doesn't use
> > the hardware hash table, so can not use the mmu_hash_lock, which is
> > global lock for hashtable.
> 
> (BTW. Did you remove the list CC on purpose ? If not, then please add it
> back on your reply and make sure my reply is fully visible :-)

Sorry for that, It is wrong to click the mouse.

> Still.. having to use a lwarx/stwcx. loop in the TLB refill handler is a
> sad story don't you think ? I don't know for you guys but on the cpus I
> know, those take hundres of cycles....

It is true, I know that.

> I've come up with an idea (thanks wli for tipping me off) that's
> inspired from RCU instead:
> 
> We have a per-cpu flag called tlbbusy
> 
> The tlb miss handler does:
> 
>  - tlbbusy = 1
>  - barrier (make sure the following read is in order vs. the previous
> store to tlbbusy)
>  - read linux PTE value
>  - write it to the HW TLB

and write the linux PTE with referenced bit?

>  - appropriate sync
>  - tlbbusy = 0
> 
> Now, the tlb invalidation code (which can use a batch to be even more
> efficient, see how 64 bits or x86 use batching for TLB invalidations)
> can then use the fact that the mm carries a cpu bitmask of all CPUs that
> ever touched that mm and thus can do, after a PTE has changed and before
> broadcasting an invalidation:

How to interlock this PTE change with the PTE change of tlb miss?

>  - make a local copy "mask" of the mm->cpu_vm_mask
>  - clear bit for the current cpu from the mask
>  - while there is still a bit in the mask
>  - for each bit in the mask, check if tlbbusy for that cpu is 0
>    -> if 0, clear the bit in the mask
>  - loop until there's nop more bit in the mask
>  - perform the tlbivax

It looks like good idea, but what is the bad things with the batch
invalidation?

> In addition, if you have a "local" version of tlbivax (no broadcast),
> you can do a nice optimisation if after step 2 (clear bit for the
> current cpu) the mask is already 0 (that means the mm only ever existed
> on the local cpu), in which case you can do a local tlbivax and return.

The BookE has the "local" version of tlbivax with the tlbwe inst. Yes,
It actually can reduce the bus traffic. 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-21 11:37     ` Dave Liu
@ 2007-05-21 22:07       ` Benjamin Herrenschmidt
  2007-05-22  3:09         ` Benjamin Herrenschmidt
  2007-05-22  8:46         ` Gabriel Paubert
  0 siblings, 2 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-21 22:07 UTC (permalink / raw)
  To: Dave Liu; +Cc: ppc-dev, Paul Mackerras, Kumar Gala


> > The tlb miss handler does:
> > 
> >  - tlbbusy = 1
> >  - barrier (make sure the following read is in order vs. the previous
> > store to tlbbusy)
> >  - read linux PTE value
> >  - write it to the HW TLB
> 
> and write the linux PTE with referenced bit?

I've kept the reference bit rewrite out of that pseudo-code because I
was approaching a different issue but yes. The idea i have there is to
do break down the linux PTE operation that way:

	 1 - rX = read PTE value (normal load)
	 2 - if (!_PAGE_PRESENT)) -> out
 	 3 - rY = rX | _PAGE_ACCESSED
	 4 - if (rX != rY)
	 5 -   rZ = lwarx PTE value
	 6 -   if (rZ != rX)
	 7 -	stdcx. PTE, rZ (rewrite just read value to clear reserv)
	 8 - 	goto 1 (try again)
	 9 -   stdcx. PTE, rY
	10 -   if failed -> goto 1 (try again)
	11 - that's it ! 

In addition, I suppose performance can be improved by also dealing with
dirty bit right in the TLB refill if the access is a write and the page
is writeable rather than taking a double fault.

> >  - appropriate sync
> >  - tlbbusy = 0
> > 
> > Now, the tlb invalidation code (which can use a batch to be even more
> > efficient, see how 64 bits or x86 use batching for TLB invalidations)
> > can then use the fact that the mm carries a cpu bitmask of all CPUs that
> > ever touched that mm and thus can do, after a PTE has changed and before
> > broadcasting an invalidation:
> 
> How to interlock this PTE change with the PTE change of tlb miss?

Look at pgtables-ppc32.h. PTE changes done by linux are atomic. If you
use the procedure I outlined above, you will also have PTE modifications
done by the TLB miss handler be atomic, though you also skip the atomic
bit when not necessary (when _PAGE_ACCESSED is already set for example).

Thus, the situation is basically that linux PTE changes need to

 - update the PTE
 - barrier
 - make sure that change is visible to all other CPUs and that
   they all have been out of a TLB miss handler at least once
   which is what my proposed algorithm does
 - broadcast invalidation

> >  - make a local copy "mask" of the mm->cpu_vm_mask
> >  - clear bit for the current cpu from the mask
> >  - while there is still a bit in the mask
> >  - for each bit in the mask, check if tlbbusy for that cpu is 0
> >    -> if 0, clear the bit in the mask
> >  - loop until there's nop more bit in the mask
> >  - perform the tlbivax
> 
> It looks like good idea, but what is the bad things with the batch
> invalidation?

Why bad ?

Batch invalidations allow you to do the whole operation of sync'ing with
other CPUs only once for a whole lot of invalidations:

	- clear lots of PTEs
	- sync once
	- send lots of tlbivax

You don't have to implement batch invalidates but it will improve
performances.

> > In addition, if you have a "local" version of tlbivax (no broadcast),
> > you can do a nice optimisation if after step 2 (clear bit for the
> > current cpu) the mask is already 0 (that means the mm only ever existed
> > on the local cpu), in which case you can do a local tlbivax and return.
> 
> The BookE has the "local" version of tlbivax with the tlbwe inst. Yes,
> It actually can reduce the bus traffic. 

And is probably faster too :-)

The above method need to also be looked at carefully for the TLB storage
interrupt (that is TLB present but with wrong permission).

Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-21 22:07       ` Benjamin Herrenschmidt
@ 2007-05-22  3:09         ` Benjamin Herrenschmidt
  2007-05-22 10:56           ` Dave Liu
  2007-05-22  8:46         ` Gabriel Paubert
  1 sibling, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-22  3:09 UTC (permalink / raw)
  To: Dave Liu; +Cc: ppc-dev, Paul Mackerras, Kumar Gala


> > > Now, the tlb invalidation code (which can use a batch to be even more
> > > efficient, see how 64 bits or x86 use batching for TLB invalidations)
> > > can then use the fact that the mm carries a cpu bitmask of all CPUs that
> > > ever touched that mm and thus can do, after a PTE has changed and before
> > > broadcasting an invalidation:
> > 
> > How to interlock this PTE change with the PTE change of tlb miss?
> 
> Look at pgtables-ppc32.h. PTE changes done by linux are atomic. If you
> use the procedure I outlined above, you will also have PTE modifications
> done by the TLB miss handler be atomic, though you also skip the atomic
> bit when not necessary (when _PAGE_ACCESSED is already set for example).
> 
> Thus, the situation is basically that linux PTE changes need to

Note that overall, my method requires at least those barriers:

	- setting the flag to 1 vs. reading the PTE
	- writing the TLB entry vs. setting the flag to 0

Which means two barriers in the TLB refill handler. I'm not 100%
familiar with the barriers you have on fsl BookE and their exact
semantics and performance issues but you may need to closely look at the
impact of taking those.

In the end, the best solution might still be to simply not do any of
this and instead send an IPI on invalidations. That's the method used by
most architectures in linux (if not all) that do software TLB load on
SMP. Basically, the invalidate code path then does:

	- Update the linux PTE
	- write barrier
	- send IPI interrupt to all CPUs in mm->cpu_vm_mask
	- local TLB flush

And the IPI does a local TLB flush on all affected CPUs.

Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-22  3:09         ` Benjamin Herrenschmidt
@ 2007-05-22 10:56           ` Dave Liu
  2007-05-22 22:42             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Liu @ 2007-05-22 10:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala

On Tue, 2007-05-22 at 13:09 +1000, Benjamin Herrenschmidt wrote:
> In the end, the best solution might still be to simply not do any of
> this and instead send an IPI on invalidations. That's the method used by
> most architectures in linux (if not all) that do software TLB load on
> SMP. Basically, the invalidate code path then does:
> 
> 	- Update the linux PTE
> 	- write barrier
> 	- send IPI interrupt to all CPUs in mm->cpu_vm_mask
> 	- local TLB flush
> 
> And the IPI does a local TLB flush on all affected CPUs.

How to avoid IPI interrupt missing if the IPI interrupt is edge-
triggered?

or How to make sure TLB flushed on the else all affected CPUs?

-d

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-22 10:56           ` Dave Liu
@ 2007-05-22 22:42             ` Benjamin Herrenschmidt
  2007-05-23  2:38               ` Dave Liu
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-22 22:42 UTC (permalink / raw)
  To: Dave Liu; +Cc: ppc-dev, Paul Mackerras, Kumar Gala

On Tue, 2007-05-22 at 18:56 +0800, Dave Liu wrote:
> On Tue, 2007-05-22 at 13:09 +1000, Benjamin Herrenschmidt wrote:
> > In the end, the best solution might still be to simply not do any of
> > this and instead send an IPI on invalidations. That's the method used by
> > most architectures in linux (if not all) that do software TLB load on
> > SMP. Basically, the invalidate code path then does:
> > 
> > 	- Update the linux PTE
> > 	- write barrier
> > 	- send IPI interrupt to all CPUs in mm->cpu_vm_mask
> > 	- local TLB flush
> > 
> > And the IPI does a local TLB flush on all affected CPUs.
> 
> How to avoid IPI interrupt missing if the IPI interrupt is edge-
> triggered?
> 
> or How to make sure TLB flushed on the else all affected CPUs?

The IPIs should be buffered by the PIC ... delivered only once but
still. Also, IPI handling in linux is synchronous, there is an ack to
wait for the remote function to complete.

Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-22 22:42             ` Benjamin Herrenschmidt
@ 2007-05-23  2:38               ` Dave Liu
  2007-05-23  3:08                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Liu @ 2007-05-23  2:38 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala

On Wed, 2007-05-23 at 08:42 +1000, Benjamin Herrenschmidt wrote:

> The IPIs should be buffered by the PIC ... delivered only once but

but what is the buffer depth for IPIs in the PIC?

> still. Also, IPI handling in linux is synchronous, there is an ack to
> wait for the remote function to complete.

ya. I got it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-23  2:38               ` Dave Liu
@ 2007-05-23  3:08                 ` Benjamin Herrenschmidt
  2007-05-28  9:05                   ` Liu Dave-r63238
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-23  3:08 UTC (permalink / raw)
  To: Dave Liu; +Cc: ppc-dev, Paul Mackerras, Kumar Gala

On Wed, 2007-05-23 at 10:38 +0800, Dave Liu wrote:
> On Wed, 2007-05-23 at 08:42 +1000, Benjamin Herrenschmidt wrote:
> 
> > The IPIs should be buffered by the PIC ... delivered only once but
> 
> but what is the buffer depth for IPIs in the PIC?

One :-)

You never "loose" IPIs in the sense that you always get at least 1
interrupt for N IPIs and it's up to software to make sure not to lose
any event. The linux kernel arch code usually handles that with a
synchronous IPI mecanism.

Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: fsl booke MM vs. SMP questions
  2007-05-23  3:08                 ` Benjamin Herrenschmidt
@ 2007-05-28  9:05                   ` Liu Dave-r63238
  2007-05-28  9:24                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Liu Dave-r63238 @ 2007-05-28  9:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala

Ben,

> You never "loose" IPIs in the sense that you always get at least 1
> interrupt for N IPIs and it's up to software to make sure not to lose
> any event. The linux kernel arch code usually handles that with a
> synchronous IPI mecanism.

Due to the synchronous IPI mechanism for TLB invalidatation, it is
very time exhausting, there are interrupt overhead and wait time for
sync.

I also noticed that tlb invalidation on the PowerPC 750 SMP system
is using the IPI mechanism, that is because the 750 can not broadcast
tlb invalidation ops.
=20
If the broadcast tlbivax instruction is more effective than the IPI
mechanism?

Did you evaluate the performance with the two different ways?

-d

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: fsl booke MM vs. SMP questions
  2007-05-28  9:05                   ` Liu Dave-r63238
@ 2007-05-28  9:24                     ` Benjamin Herrenschmidt
  2007-05-28  9:37                       ` Liu Dave-r63238
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-28  9:24 UTC (permalink / raw)
  To: Liu Dave-r63238; +Cc: ppc-dev, Paul Mackerras, Kumar Gala

On Mon, 2007-05-28 at 17:05 +0800, Liu Dave-r63238 wrote:
> Ben,
> 
> > You never "loose" IPIs in the sense that you always get at least 1
> > interrupt for N IPIs and it's up to software to make sure not to lose
> > any event. The linux kernel arch code usually handles that with a
> > synchronous IPI mecanism.
> 
> Due to the synchronous IPI mechanism for TLB invalidatation, it is
> very time exhausting, there are interrupt overhead and wait time for
> sync.

Yup, there is, though you can try to optimize it such that you only sync
the CPUs involved with the IPIs, which often are only few.

> I also noticed that tlb invalidation on the PowerPC 750 SMP system
> is using the IPI mechanism, that is because the 750 can not broadcast
> tlb invalidation ops.

Do we support that in linux ?
 
> If the broadcast tlbivax instruction is more effective than the IPI
> mechanism?
> 
> Did you evaluate the performance with the two different ways?

Not really... it depends on bus traffic, plus the need to spinlock the
broadcast tlbivax as well, etc..

I'm not working on real HW at the moment. I don't know what the exact
characteristics of your target HW are...

Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: fsl booke MM vs. SMP questions
  2007-05-28  9:24                     ` Benjamin Herrenschmidt
@ 2007-05-28  9:37                       ` Liu Dave-r63238
  2007-05-28 10:00                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Liu Dave-r63238 @ 2007-05-28  9:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala

> > I also noticed that tlb invalidation on the PowerPC 750 SMP system
> > is using the IPI mechanism, that is because the 750 can not=20
> > broadcast tlb invalidation ops.
>=20
> Do we support that in linux ?

Yes, it support in ppc, but not in powerpc arch. It may miss in the
powerpc when porting.

> Not really... it depends on bus traffic, plus the need to spinlock the
> broadcast tlbivax as well, etc..
>=20
> I'm not working on real HW at the moment. I don't know what the exact
> characteristics of your target HW are...

BTW, if the x86 processor support the broadcast tlb operation to system?
If it can,  why we adopt the IPI mechanism for x86? what is the concern?

-d

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: fsl booke MM vs. SMP questions
  2007-05-28  9:37                       ` Liu Dave-r63238
@ 2007-05-28 10:00                         ` Benjamin Herrenschmidt
  2007-05-28 10:23                           ` Gabriel Paubert
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-28 10:00 UTC (permalink / raw)
  To: Liu Dave-r63238; +Cc: ppc-dev, Paul Mackerras, Kumar Gala

On Mon, 2007-05-28 at 17:37 +0800, Liu Dave-r63238 wrote:
> 
> BTW, if the x86 processor support the broadcast tlb operation to
> system?
> If it can,  why we adopt the IPI mechanism for x86? what is the
> concern?

I don't think it supports them but then, I don't know for sure.

Part of the problem is what your workload is. if you have a lot of small
and short lived processes, such as CGI's on a web server, they are
fairly unlikely to exist on more than one processor, maybe two, during
their lifetime (there is a strong optimisation to only do a local
invalidate when the process only ever existed on one processor).

If you have a massively threaded workload, that is, a given process is
likely to exist on all processors, then it's also fairly unlikely that
you start doing a lot of fork()'s or to have that processes be short
lived... so it's less of an issue unless you start abusing mmap/munmap
or mprotect.

Also, when you have a large number of processors, having broadcast tlb
invalidations on the bus might become a bottleneck if, at the end of the
day, you really only want to invalidate one or two siblings. In that
case, targetted IPIs are probably a better option.

In the end, it's very difficult to "guess" what is better. If you add
all the above, plus the race between tlb invalidations and SW TLB
reload, it makes sense to start with IPIs and try to optimize that code
path as much as you can to avoid hitting more CPUs than necessary for
example).

Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-28 10:00                         ` Benjamin Herrenschmidt
@ 2007-05-28 10:23                           ` Gabriel Paubert
  2007-05-28 10:28                             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Gabriel Paubert @ 2007-05-28 10:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: ppc-dev, Liu Dave-r63238, Paul Mackerras, Kumar Gala

On Mon, May 28, 2007 at 08:00:21PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2007-05-28 at 17:37 +0800, Liu Dave-r63238 wrote:
> > 
> > BTW, if the x86 processor support the broadcast tlb operation to
> > system?
> > If it can,  why we adopt the IPI mechanism for x86? what is the
> > concern?
> 
> I don't think it supports them but then, I don't know for sure.
> 

It does not. However IA64 (aka Itanic) does. Of course on x86 until
recently, the TLB were completely flushed (at least the entries mapping to
user space) on task switches to a different mm, which automatically 
avoids races for single threaded apps.

> Part of the problem is what your workload is. if you have a lot of small
> and short lived processes, such as CGI's on a web server, they are
> fairly unlikely to exist on more than one processor, maybe two, during
> their lifetime (there is a strong optimisation to only do a local
> invalidate when the process only ever existed on one processor).
> 
> If you have a massively threaded workload, that is, a given process is
> likely to exist on all processors, then it's also fairly unlikely that
> you start doing a lot of fork()'s or to have that processes be short
> lived... so it's less of an issue unless you start abusing mmap/munmap
> or mprotect.
> 
> Also, when you have a large number of processors, having broadcast tlb
> invalidations on the bus might become a bottleneck if, at the end of the
> day, you really only want to invalidate one or two siblings. In that
> case, targetted IPIs are probably a better option.

On SMP with single die and integrated memory controllers (PASemi), 
I'd bet that tlb invalidation broadcast is typically much cheaper 
since no external signals are involved (from a hardware point of view
it's not very different from a store to a shared cache line that has 
to be invalidated in the cache of the other processors).

	Gabriel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-28 10:23                           ` Gabriel Paubert
@ 2007-05-28 10:28                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-28 10:28 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: ppc-dev, Liu Dave-r63238, Paul Mackerras, Kumar Gala

On Mon, 2007-05-28 at 12:23 +0200, Gabriel Paubert wrote:
> On SMP with single die and integrated memory controllers (PASemi), 
> I'd bet that tlb invalidation broadcast is typically much cheaper 
> since no external signals are involved (from a hardware point of view
> it's not very different from a store to a shared cache line that has 
> to be invalidated in the cache of the other processors). 

Except that is often has strong locking requirements along with a race
or two to deal with when not having a HW reload on the TLB. So in the
case of Freescale BookE, it is really something that should be measured.

Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-21 22:07       ` Benjamin Herrenschmidt
  2007-05-22  3:09         ` Benjamin Herrenschmidt
@ 2007-05-22  8:46         ` Gabriel Paubert
  2007-05-22  9:14           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 20+ messages in thread
From: Gabriel Paubert @ 2007-05-22  8:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: ppc-dev, Dave Liu, Paul Mackerras, Kumar Gala

On Tue, May 22, 2007 at 08:07:52AM +1000, Benjamin Herrenschmidt wrote:
> 
> > > The tlb miss handler does:
> > > 
> > >  - tlbbusy = 1
> > >  - barrier (make sure the following read is in order vs. the previous
> > > store to tlbbusy)
> > >  - read linux PTE value
> > >  - write it to the HW TLB
> > 
> > and write the linux PTE with referenced bit?
> 
> I've kept the reference bit rewrite out of that pseudo-code because I
> was approaching a different issue but yes. The idea i have there is to
> do break down the linux PTE operation that way:
> 
> 	 1 - rX = read PTE value (normal load)
> 	 2 - if (!_PAGE_PRESENT)) -> out
>  	 3 - rY = rX | _PAGE_ACCESSED
> 	 4 - if (rX != rY)
> 	 5 -   rZ = lwarx PTE value
> 	 6 -   if (rZ != rX)
> 	 7 -	stdcx. PTE, rZ (rewrite just read value to clear reserv)

Why do you want to clear the reservation here? 

Coming out of some code path with the reservation still held 
can only affect buggy code (someone doing st[dw]cx. before 
l[dw]arx) AFAIK.

> 	 8 - 	goto 1 (try again)
> 	 9 -   stdcx. PTE, rY
> 	10 -   if failed -> goto 1 (try again)
> 	11 - that's it ! 
> 
> In addition, I suppose performance can be improved by also dealing with
> dirty bit right in the TLB refill if the access is a write and the page
> is writeable rather than taking a double fault.

	Regards,
	Gabriel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-22  8:46         ` Gabriel Paubert
@ 2007-05-22  9:14           ` Benjamin Herrenschmidt
  2007-05-22 10:02             ` Gabriel Paubert
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-22  9:14 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: ppc-dev, Dave Liu, Paul Mackerras, Kumar Gala

> Why do you want to clear the reservation here? 
> 
> Coming out of some code path with the reservation still held 
> can only affect buggy code (someone doing st[dw]cx. before 
> l[dw]arx) AFAIK.

And buggy CPUs :-)

Seriously, lots of CPU implementations don't test the address for local
lwarx stwcx. so if your kernel code "replaces" a reservation with
another that is left set, the userland stwcx. might well succeed which
is bogus.

Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-22  9:14           ` Benjamin Herrenschmidt
@ 2007-05-22 10:02             ` Gabriel Paubert
  2007-05-22 10:05               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Gabriel Paubert @ 2007-05-22 10:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: ppc-dev, Dave Liu, Paul Mackerras, Kumar Gala

On Tue, May 22, 2007 at 07:14:38PM +1000, Benjamin Herrenschmidt wrote:
> > Why do you want to clear the reservation here? 
> > 
> > Coming out of some code path with the reservation still held 
> > can only affect buggy code (someone doing st[dw]cx. before 
> > l[dw]arx) AFAIK.
> 
> And buggy CPUs :-)
> 
> Seriously, lots of CPU implementations don't test the address for local
> lwarx stwcx. so if your kernel code "replaces" a reservation with
> another that is left set, the userland stwcx. might well succeed which
> is bogus.
> 

Well, there should always be an stwcx. to clear reservation before
any interrupt return. Otherwise you'll be able to cause hard to
reproduce bugs in the interrupted code.

Checking or not that the reservation address matches the stwcx.
is irrelevant.

	Gabriel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-22 10:02             ` Gabriel Paubert
@ 2007-05-22 10:05               ` Benjamin Herrenschmidt
  2007-05-23  9:12                 ` Gabriel Paubert
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-22 10:05 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: ppc-dev, Dave Liu, Paul Mackerras, Kumar Gala

On Tue, 2007-05-22 at 12:02 +0200, Gabriel Paubert wrote:
> 
> Well, there should always be an stwcx. to clear reservation before
> any interrupt return. Otherwise you'll be able to cause hard to
> reproduce bugs in the interrupted code.

Well, that's the point. The BookE TLB refill exception is a very fast
path that doesn't use the normal interrupt return code path. It thus
needs to be careful about not leaving dangling reservations.

On some CPUs, there are also performance issues with leaving dangling
lwarx iirc but I don't have the details off the top of my mind.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-22 10:05               ` Benjamin Herrenschmidt
@ 2007-05-23  9:12                 ` Gabriel Paubert
  0 siblings, 0 replies; 20+ messages in thread
From: Gabriel Paubert @ 2007-05-23  9:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: ppc-dev, Dave Liu, Paul Mackerras, Kumar Gala

On Tue, May 22, 2007 at 08:05:42PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2007-05-22 at 12:02 +0200, Gabriel Paubert wrote:
> > 
> > Well, there should always be an stwcx. to clear reservation before
> > any interrupt return. Otherwise you'll be able to cause hard to
> > reproduce bugs in the interrupted code.
> 
> Well, that's the point. The BookE TLB refill exception is a very fast
> path that doesn't use the normal interrupt return code path. It thus
> needs to be careful about not leaving dangling reservations.

Ok, thanks. I missed that critical piece of information from
the context. In this case it makes sense, although I wonder
if a different order of instructions could shave some latency 
from the critical path:

	 1 - rX = read PTE value (normal load)
	 2 - if (!_PAGE_PRESENT)) -> out
 	 3 - rY = rX | _PAGE_ACCESSED
	 4 - if (rX != rY)
	
Specifically here, I wonder whether instead of the sequence:
	ori	ry, rx, PAGE_ACCESSED
	cmpw	rx, ry
	beq	11f	; Needs non-default static prediction?


it might be better to write it as:
	andi.	rz, rx, PAGE_ACCESSED
	ori	ry, rx, PAGE_ACCESSED
	bne	11f

since on some processors the branch might be resolved one cycle
earlier. But I don't know very well the processors with these MMU.

	 5 -   rZ = lwarx PTE value
	 6 -   if (rZ != rX)
	 7 -	stdcx. PTE, rZ (rewrite just read value to clear reserv)

Hmm, lWarx paired with stDcx., looks like a typo ?

	 8 - 	goto 1 (try again)
	 9 -   stdcx. PTE, rY 
Ditto.
	10 -   if failed -> goto 1 (try again)
	11 - that's it ! 

I suspect that in the TLB handler, you've got something
like 4 registers and one CR field to play with. So more
clever solutions may be impossible to implement.

> On some CPUs, there are also performance issues with leaving dangling
> lwarx iirc but I don't have the details off the top of my mind.
> 

I don't know of any, but I almost exclusively use 603e and 750.


	Gabriel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: fsl booke MM vs. SMP questions
  2007-05-21  7:06 fsl booke MM vs. SMP questions Benjamin Herrenschmidt
       [not found] ` <1179741447.3660.7.camel@localhost.localdomain>
@ 2007-05-22  3:03 ` Kumar Gala
  1 sibling, 0 replies; 20+ messages in thread
From: Kumar Gala @ 2007-05-22  3:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala


On May 21, 2007, at 2:06 AM, Benjamin Herrenschmidt wrote:

> Hi Folks !
>
> I see that the fsl booke code has some #ifdef CONFIG_SMP bits here or
> there, thus I suppose there are some SMP implementations of these
> right ?

There will be, the SMP code that exists was just some stuff I put in  
w/o going through each case.  The TLB mgmt code does need some fixup  
for SMP.

- k

>
> I'm having some serious issues trying to figure out how the TLB
> management is made SMP safe however.
>
> There are at least two main issues I've spotted at this point (there's
> at least one more if there are HW threading, that is the TLB is shared
> between logical processors, but I'll ignore that for now since I don't
> think there is such a thing ... yet).
>
>  - How do you guys shield PTE flushing vs. TLB misses on another CPU ?
> That is, how do you prevent (if you do) the following scenario:
>
> 	cpu 0				cpu 1
> 	tlb miss			pte_clear (or similar)
> 	load PTE value
> 					write 0 to PTE (or replace)
> 					tlbviax (tlbie)
> 	tlbwe
>
> That scenario, as you can see, will leave you with stale entries in  
> the
> TLB which will ultimately lead to all sort of unpleasant/random
> behaviours.
>
> If the answer is "oops ... we don't", then let's try to find out ways
> out of that since I may have a similar issue in a not too distant
> future :-) And I'm trying to find out a -fast- way to deal with that
> without bloating the fast path. My main problem is that I want to  
> avoid
> taking a spin lock or equivalent atomic operation in the fast TLB  
> reload
> path (which would solve the problem) since lwarx/stwcx. are generally
> real slow (hundreds of cycles on some processors).
>
>  - I see that your TLB miss handle is using a non-atomic store to  
> write
> the _PAGE_ACCESSED bit back to the PTE. Don't you have a similar race
> where something would do:
>
> 	cpu 0				cpu 1
> 	tlb miss			pte_clear (or similar)
> 	load PTE value
> 					write 0 to PTE (or replace)
> 	write back PTE with _PAGE_ACCESSED
> 	tlbwe
>
> This is an extension of the previous race but it's a different problem
> so I listed it separately. In that case, the problem is worse,  
> since not
> only you have a stale TLB entry, but you -also- have corrupted the  
> linux
> PTE by writing back the old value in it.
>
> At this point, I'm afraid you may have no choice but going atomic,  
> which
> means paying the cost of lwarx/stwcx. on TLB misses, though if you  
> have
> a solution for the first problem, then you can avoid the atomic
> operation in the second problem if _PAGE_ACCESSED is already set.
>
> If not, you might have to use a _PAGE_BUSY bit similar to what 64 bits
> uses as a per-PTE lock, or use mmu_hash_lock... Unless you come up  
> with
> a great idea or some HW black magic that makes the problem go away...
>
> In any case, I'm curious about how you have or intend to solve that
> since as I said above, I might be in a similar situation soon and am
> trying to keep the TLB miss handler as fast as humanly possible.
>
> Cheers,
> Ben.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2007-05-28 10:28 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-21  7:06 fsl booke MM vs. SMP questions Benjamin Herrenschmidt
     [not found] ` <1179741447.3660.7.camel@localhost.localdomain>
     [not found]   ` <1179742083.32247.689.camel@localhost.localdomain>
2007-05-21 11:37     ` Dave Liu
2007-05-21 22:07       ` Benjamin Herrenschmidt
2007-05-22  3:09         ` Benjamin Herrenschmidt
2007-05-22 10:56           ` Dave Liu
2007-05-22 22:42             ` Benjamin Herrenschmidt
2007-05-23  2:38               ` Dave Liu
2007-05-23  3:08                 ` Benjamin Herrenschmidt
2007-05-28  9:05                   ` Liu Dave-r63238
2007-05-28  9:24                     ` Benjamin Herrenschmidt
2007-05-28  9:37                       ` Liu Dave-r63238
2007-05-28 10:00                         ` Benjamin Herrenschmidt
2007-05-28 10:23                           ` Gabriel Paubert
2007-05-28 10:28                             ` Benjamin Herrenschmidt
2007-05-22  8:46         ` Gabriel Paubert
2007-05-22  9:14           ` Benjamin Herrenschmidt
2007-05-22 10:02             ` Gabriel Paubert
2007-05-22 10:05               ` Benjamin Herrenschmidt
2007-05-23  9:12                 ` Gabriel Paubert
2007-05-22  3:03 ` Kumar Gala

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).