* fsl booke MM vs. SMP questions @ 2007-05-21 7:06 Benjamin Herrenschmidt [not found] ` <1179741447.3660.7.camel@localhost.localdomain> 2007-05-22 3:03 ` Kumar Gala 0 siblings, 2 replies; 20+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-21 7:06 UTC (permalink / raw) To: ppc-dev; +Cc: Kumar Gala, Paul Mackerras Hi Folks ! I see that the fsl booke code has some #ifdef CONFIG_SMP bits here or there, thus I suppose there are some SMP implementations of these right ? I'm having some serious issues trying to figure out how the TLB management is made SMP safe however. There are at least two main issues I've spotted at this point (there's at least one more if there are HW threading, that is the TLB is shared between logical processors, but I'll ignore that for now since I don't think there is such a thing ... yet). - How do you guys shield PTE flushing vs. TLB misses on another CPU ? That is, how do you prevent (if you do) the following scenario: cpu 0 cpu 1 tlb miss pte_clear (or similar) load PTE value write 0 to PTE (or replace) tlbviax (tlbie) tlbwe That scenario, as you can see, will leave you with stale entries in the TLB which will ultimately lead to all sort of unpleasant/random behaviours. If the answer is "oops ... we don't", then let's try to find out ways out of that since I may have a similar issue in a not too distant future :-) And I'm trying to find out a -fast- way to deal with that without bloating the fast path. My main problem is that I want to avoid taking a spin lock or equivalent atomic operation in the fast TLB reload path (which would solve the problem) since lwarx/stwcx. are generally real slow (hundreds of cycles on some processors). - I see that your TLB miss handle is using a non-atomic store to write the _PAGE_ACCESSED bit back to the PTE. Don't you have a similar race where something would do: cpu 0 cpu 1 tlb miss pte_clear (or similar) load PTE value write 0 to PTE (or replace) write back PTE with _PAGE_ACCESSED tlbwe This is an extension of the previous race but it's a different problem so I listed it separately. In that case, the problem is worse, since not only you have a stale TLB entry, but you -also- have corrupted the linux PTE by writing back the old value in it. At this point, I'm afraid you may have no choice but going atomic, which means paying the cost of lwarx/stwcx. on TLB misses, though if you have a solution for the first problem, then you can avoid the atomic operation in the second problem if _PAGE_ACCESSED is already set. If not, you might have to use a _PAGE_BUSY bit similar to what 64 bits uses as a per-PTE lock, or use mmu_hash_lock... Unless you come up with a great idea or some HW black magic that makes the problem go away... In any case, I'm curious about how you have or intend to solve that since as I said above, I might be in a similar situation soon and am trying to keep the TLB miss handler as fast as humanly possible. Cheers, Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <1179741447.3660.7.camel@localhost.localdomain>]
[parent not found: <1179742083.32247.689.camel@localhost.localdomain>]
* Re: fsl booke MM vs. SMP questions [not found] ` <1179742083.32247.689.camel@localhost.localdomain> @ 2007-05-21 11:37 ` Dave Liu 2007-05-21 22:07 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 20+ messages in thread From: Dave Liu @ 2007-05-21 11:37 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala On Mon, 2007-05-21 at 20:08 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2007-05-21 at 17:57 +0800, Dave Liu wrote: > > > > If not, you might have to use a _PAGE_BUSY bit similar to what 64 bits > > > uses as a per-PTE lock, or use mmu_hash_lock... Unless you come up with > > > a great idea or some HW black magic that makes the problem go away... > > > > I would like the _PAGE_BUSY bit for a per-PTE lock, it will have better > > performance benifit than global lock. The BookE architecutre doesn't use > > the hardware hash table, so can not use the mmu_hash_lock, which is > > global lock for hashtable. > > (BTW. Did you remove the list CC on purpose ? If not, then please add it > back on your reply and make sure my reply is fully visible :-) Sorry for that, It is wrong to click the mouse. > Still.. having to use a lwarx/stwcx. loop in the TLB refill handler is a > sad story don't you think ? I don't know for you guys but on the cpus I > know, those take hundres of cycles.... It is true, I know that. > I've come up with an idea (thanks wli for tipping me off) that's > inspired from RCU instead: > > We have a per-cpu flag called tlbbusy > > The tlb miss handler does: > > - tlbbusy = 1 > - barrier (make sure the following read is in order vs. the previous > store to tlbbusy) > - read linux PTE value > - write it to the HW TLB and write the linux PTE with referenced bit? > - appropriate sync > - tlbbusy = 0 > > Now, the tlb invalidation code (which can use a batch to be even more > efficient, see how 64 bits or x86 use batching for TLB invalidations) > can then use the fact that the mm carries a cpu bitmask of all CPUs that > ever touched that mm and thus can do, after a PTE has changed and before > broadcasting an invalidation: How to interlock this PTE change with the PTE change of tlb miss? > - make a local copy "mask" of the mm->cpu_vm_mask > - clear bit for the current cpu from the mask > - while there is still a bit in the mask > - for each bit in the mask, check if tlbbusy for that cpu is 0 > -> if 0, clear the bit in the mask > - loop until there's nop more bit in the mask > - perform the tlbivax It looks like good idea, but what is the bad things with the batch invalidation? > In addition, if you have a "local" version of tlbivax (no broadcast), > you can do a nice optimisation if after step 2 (clear bit for the > current cpu) the mask is already 0 (that means the mm only ever existed > on the local cpu), in which case you can do a local tlbivax and return. The BookE has the "local" version of tlbivax with the tlbwe inst. Yes, It actually can reduce the bus traffic. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-21 11:37 ` Dave Liu @ 2007-05-21 22:07 ` Benjamin Herrenschmidt 2007-05-22 3:09 ` Benjamin Herrenschmidt 2007-05-22 8:46 ` Gabriel Paubert 0 siblings, 2 replies; 20+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-21 22:07 UTC (permalink / raw) To: Dave Liu; +Cc: ppc-dev, Paul Mackerras, Kumar Gala > > The tlb miss handler does: > > > > - tlbbusy = 1 > > - barrier (make sure the following read is in order vs. the previous > > store to tlbbusy) > > - read linux PTE value > > - write it to the HW TLB > > and write the linux PTE with referenced bit? I've kept the reference bit rewrite out of that pseudo-code because I was approaching a different issue but yes. The idea i have there is to do break down the linux PTE operation that way: 1 - rX = read PTE value (normal load) 2 - if (!_PAGE_PRESENT)) -> out 3 - rY = rX | _PAGE_ACCESSED 4 - if (rX != rY) 5 - rZ = lwarx PTE value 6 - if (rZ != rX) 7 - stdcx. PTE, rZ (rewrite just read value to clear reserv) 8 - goto 1 (try again) 9 - stdcx. PTE, rY 10 - if failed -> goto 1 (try again) 11 - that's it ! In addition, I suppose performance can be improved by also dealing with dirty bit right in the TLB refill if the access is a write and the page is writeable rather than taking a double fault. > > - appropriate sync > > - tlbbusy = 0 > > > > Now, the tlb invalidation code (which can use a batch to be even more > > efficient, see how 64 bits or x86 use batching for TLB invalidations) > > can then use the fact that the mm carries a cpu bitmask of all CPUs that > > ever touched that mm and thus can do, after a PTE has changed and before > > broadcasting an invalidation: > > How to interlock this PTE change with the PTE change of tlb miss? Look at pgtables-ppc32.h. PTE changes done by linux are atomic. If you use the procedure I outlined above, you will also have PTE modifications done by the TLB miss handler be atomic, though you also skip the atomic bit when not necessary (when _PAGE_ACCESSED is already set for example). Thus, the situation is basically that linux PTE changes need to - update the PTE - barrier - make sure that change is visible to all other CPUs and that they all have been out of a TLB miss handler at least once which is what my proposed algorithm does - broadcast invalidation > > - make a local copy "mask" of the mm->cpu_vm_mask > > - clear bit for the current cpu from the mask > > - while there is still a bit in the mask > > - for each bit in the mask, check if tlbbusy for that cpu is 0 > > -> if 0, clear the bit in the mask > > - loop until there's nop more bit in the mask > > - perform the tlbivax > > It looks like good idea, but what is the bad things with the batch > invalidation? Why bad ? Batch invalidations allow you to do the whole operation of sync'ing with other CPUs only once for a whole lot of invalidations: - clear lots of PTEs - sync once - send lots of tlbivax You don't have to implement batch invalidates but it will improve performances. > > In addition, if you have a "local" version of tlbivax (no broadcast), > > you can do a nice optimisation if after step 2 (clear bit for the > > current cpu) the mask is already 0 (that means the mm only ever existed > > on the local cpu), in which case you can do a local tlbivax and return. > > The BookE has the "local" version of tlbivax with the tlbwe inst. Yes, > It actually can reduce the bus traffic. And is probably faster too :-) The above method need to also be looked at carefully for the TLB storage interrupt (that is TLB present but with wrong permission). Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-21 22:07 ` Benjamin Herrenschmidt @ 2007-05-22 3:09 ` Benjamin Herrenschmidt 2007-05-22 10:56 ` Dave Liu 2007-05-22 8:46 ` Gabriel Paubert 1 sibling, 1 reply; 20+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-22 3:09 UTC (permalink / raw) To: Dave Liu; +Cc: ppc-dev, Paul Mackerras, Kumar Gala > > > Now, the tlb invalidation code (which can use a batch to be even more > > > efficient, see how 64 bits or x86 use batching for TLB invalidations) > > > can then use the fact that the mm carries a cpu bitmask of all CPUs that > > > ever touched that mm and thus can do, after a PTE has changed and before > > > broadcasting an invalidation: > > > > How to interlock this PTE change with the PTE change of tlb miss? > > Look at pgtables-ppc32.h. PTE changes done by linux are atomic. If you > use the procedure I outlined above, you will also have PTE modifications > done by the TLB miss handler be atomic, though you also skip the atomic > bit when not necessary (when _PAGE_ACCESSED is already set for example). > > Thus, the situation is basically that linux PTE changes need to Note that overall, my method requires at least those barriers: - setting the flag to 1 vs. reading the PTE - writing the TLB entry vs. setting the flag to 0 Which means two barriers in the TLB refill handler. I'm not 100% familiar with the barriers you have on fsl BookE and their exact semantics and performance issues but you may need to closely look at the impact of taking those. In the end, the best solution might still be to simply not do any of this and instead send an IPI on invalidations. That's the method used by most architectures in linux (if not all) that do software TLB load on SMP. Basically, the invalidate code path then does: - Update the linux PTE - write barrier - send IPI interrupt to all CPUs in mm->cpu_vm_mask - local TLB flush And the IPI does a local TLB flush on all affected CPUs. Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-22 3:09 ` Benjamin Herrenschmidt @ 2007-05-22 10:56 ` Dave Liu 2007-05-22 22:42 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 20+ messages in thread From: Dave Liu @ 2007-05-22 10:56 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala On Tue, 2007-05-22 at 13:09 +1000, Benjamin Herrenschmidt wrote: > In the end, the best solution might still be to simply not do any of > this and instead send an IPI on invalidations. That's the method used by > most architectures in linux (if not all) that do software TLB load on > SMP. Basically, the invalidate code path then does: > > - Update the linux PTE > - write barrier > - send IPI interrupt to all CPUs in mm->cpu_vm_mask > - local TLB flush > > And the IPI does a local TLB flush on all affected CPUs. How to avoid IPI interrupt missing if the IPI interrupt is edge- triggered? or How to make sure TLB flushed on the else all affected CPUs? -d ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-22 10:56 ` Dave Liu @ 2007-05-22 22:42 ` Benjamin Herrenschmidt 2007-05-23 2:38 ` Dave Liu 0 siblings, 1 reply; 20+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-22 22:42 UTC (permalink / raw) To: Dave Liu; +Cc: ppc-dev, Paul Mackerras, Kumar Gala On Tue, 2007-05-22 at 18:56 +0800, Dave Liu wrote: > On Tue, 2007-05-22 at 13:09 +1000, Benjamin Herrenschmidt wrote: > > In the end, the best solution might still be to simply not do any of > > this and instead send an IPI on invalidations. That's the method used by > > most architectures in linux (if not all) that do software TLB load on > > SMP. Basically, the invalidate code path then does: > > > > - Update the linux PTE > > - write barrier > > - send IPI interrupt to all CPUs in mm->cpu_vm_mask > > - local TLB flush > > > > And the IPI does a local TLB flush on all affected CPUs. > > How to avoid IPI interrupt missing if the IPI interrupt is edge- > triggered? > > or How to make sure TLB flushed on the else all affected CPUs? The IPIs should be buffered by the PIC ... delivered only once but still. Also, IPI handling in linux is synchronous, there is an ack to wait for the remote function to complete. Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-22 22:42 ` Benjamin Herrenschmidt @ 2007-05-23 2:38 ` Dave Liu 2007-05-23 3:08 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 20+ messages in thread From: Dave Liu @ 2007-05-23 2:38 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala On Wed, 2007-05-23 at 08:42 +1000, Benjamin Herrenschmidt wrote: > The IPIs should be buffered by the PIC ... delivered only once but but what is the buffer depth for IPIs in the PIC? > still. Also, IPI handling in linux is synchronous, there is an ack to > wait for the remote function to complete. ya. I got it. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-23 2:38 ` Dave Liu @ 2007-05-23 3:08 ` Benjamin Herrenschmidt 2007-05-28 9:05 ` Liu Dave-r63238 0 siblings, 1 reply; 20+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-23 3:08 UTC (permalink / raw) To: Dave Liu; +Cc: ppc-dev, Paul Mackerras, Kumar Gala On Wed, 2007-05-23 at 10:38 +0800, Dave Liu wrote: > On Wed, 2007-05-23 at 08:42 +1000, Benjamin Herrenschmidt wrote: > > > The IPIs should be buffered by the PIC ... delivered only once but > > but what is the buffer depth for IPIs in the PIC? One :-) You never "loose" IPIs in the sense that you always get at least 1 interrupt for N IPIs and it's up to software to make sure not to lose any event. The linux kernel arch code usually handles that with a synchronous IPI mecanism. Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: fsl booke MM vs. SMP questions 2007-05-23 3:08 ` Benjamin Herrenschmidt @ 2007-05-28 9:05 ` Liu Dave-r63238 2007-05-28 9:24 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 20+ messages in thread From: Liu Dave-r63238 @ 2007-05-28 9:05 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala Ben, > You never "loose" IPIs in the sense that you always get at least 1 > interrupt for N IPIs and it's up to software to make sure not to lose > any event. The linux kernel arch code usually handles that with a > synchronous IPI mecanism. Due to the synchronous IPI mechanism for TLB invalidatation, it is very time exhausting, there are interrupt overhead and wait time for sync. I also noticed that tlb invalidation on the PowerPC 750 SMP system is using the IPI mechanism, that is because the 750 can not broadcast tlb invalidation ops. =20 If the broadcast tlbivax instruction is more effective than the IPI mechanism? Did you evaluate the performance with the two different ways? -d ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: fsl booke MM vs. SMP questions 2007-05-28 9:05 ` Liu Dave-r63238 @ 2007-05-28 9:24 ` Benjamin Herrenschmidt 2007-05-28 9:37 ` Liu Dave-r63238 0 siblings, 1 reply; 20+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-28 9:24 UTC (permalink / raw) To: Liu Dave-r63238; +Cc: ppc-dev, Paul Mackerras, Kumar Gala On Mon, 2007-05-28 at 17:05 +0800, Liu Dave-r63238 wrote: > Ben, > > > You never "loose" IPIs in the sense that you always get at least 1 > > interrupt for N IPIs and it's up to software to make sure not to lose > > any event. The linux kernel arch code usually handles that with a > > synchronous IPI mecanism. > > Due to the synchronous IPI mechanism for TLB invalidatation, it is > very time exhausting, there are interrupt overhead and wait time for > sync. Yup, there is, though you can try to optimize it such that you only sync the CPUs involved with the IPIs, which often are only few. > I also noticed that tlb invalidation on the PowerPC 750 SMP system > is using the IPI mechanism, that is because the 750 can not broadcast > tlb invalidation ops. Do we support that in linux ? > If the broadcast tlbivax instruction is more effective than the IPI > mechanism? > > Did you evaluate the performance with the two different ways? Not really... it depends on bus traffic, plus the need to spinlock the broadcast tlbivax as well, etc.. I'm not working on real HW at the moment. I don't know what the exact characteristics of your target HW are... Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: fsl booke MM vs. SMP questions 2007-05-28 9:24 ` Benjamin Herrenschmidt @ 2007-05-28 9:37 ` Liu Dave-r63238 2007-05-28 10:00 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 20+ messages in thread From: Liu Dave-r63238 @ 2007-05-28 9:37 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala > > I also noticed that tlb invalidation on the PowerPC 750 SMP system > > is using the IPI mechanism, that is because the 750 can not=20 > > broadcast tlb invalidation ops. >=20 > Do we support that in linux ? Yes, it support in ppc, but not in powerpc arch. It may miss in the powerpc when porting. > Not really... it depends on bus traffic, plus the need to spinlock the > broadcast tlbivax as well, etc.. >=20 > I'm not working on real HW at the moment. I don't know what the exact > characteristics of your target HW are... BTW, if the x86 processor support the broadcast tlb operation to system? If it can, why we adopt the IPI mechanism for x86? what is the concern? -d ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: fsl booke MM vs. SMP questions 2007-05-28 9:37 ` Liu Dave-r63238 @ 2007-05-28 10:00 ` Benjamin Herrenschmidt 2007-05-28 10:23 ` Gabriel Paubert 0 siblings, 1 reply; 20+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-28 10:00 UTC (permalink / raw) To: Liu Dave-r63238; +Cc: ppc-dev, Paul Mackerras, Kumar Gala On Mon, 2007-05-28 at 17:37 +0800, Liu Dave-r63238 wrote: > > BTW, if the x86 processor support the broadcast tlb operation to > system? > If it can, why we adopt the IPI mechanism for x86? what is the > concern? I don't think it supports them but then, I don't know for sure. Part of the problem is what your workload is. if you have a lot of small and short lived processes, such as CGI's on a web server, they are fairly unlikely to exist on more than one processor, maybe two, during their lifetime (there is a strong optimisation to only do a local invalidate when the process only ever existed on one processor). If you have a massively threaded workload, that is, a given process is likely to exist on all processors, then it's also fairly unlikely that you start doing a lot of fork()'s or to have that processes be short lived... so it's less of an issue unless you start abusing mmap/munmap or mprotect. Also, when you have a large number of processors, having broadcast tlb invalidations on the bus might become a bottleneck if, at the end of the day, you really only want to invalidate one or two siblings. In that case, targetted IPIs are probably a better option. In the end, it's very difficult to "guess" what is better. If you add all the above, plus the race between tlb invalidations and SW TLB reload, it makes sense to start with IPIs and try to optimize that code path as much as you can to avoid hitting more CPUs than necessary for example). Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-28 10:00 ` Benjamin Herrenschmidt @ 2007-05-28 10:23 ` Gabriel Paubert 2007-05-28 10:28 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 20+ messages in thread From: Gabriel Paubert @ 2007-05-28 10:23 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: ppc-dev, Liu Dave-r63238, Paul Mackerras, Kumar Gala On Mon, May 28, 2007 at 08:00:21PM +1000, Benjamin Herrenschmidt wrote: > On Mon, 2007-05-28 at 17:37 +0800, Liu Dave-r63238 wrote: > > > > BTW, if the x86 processor support the broadcast tlb operation to > > system? > > If it can, why we adopt the IPI mechanism for x86? what is the > > concern? > > I don't think it supports them but then, I don't know for sure. > It does not. However IA64 (aka Itanic) does. Of course on x86 until recently, the TLB were completely flushed (at least the entries mapping to user space) on task switches to a different mm, which automatically avoids races for single threaded apps. > Part of the problem is what your workload is. if you have a lot of small > and short lived processes, such as CGI's on a web server, they are > fairly unlikely to exist on more than one processor, maybe two, during > their lifetime (there is a strong optimisation to only do a local > invalidate when the process only ever existed on one processor). > > If you have a massively threaded workload, that is, a given process is > likely to exist on all processors, then it's also fairly unlikely that > you start doing a lot of fork()'s or to have that processes be short > lived... so it's less of an issue unless you start abusing mmap/munmap > or mprotect. > > Also, when you have a large number of processors, having broadcast tlb > invalidations on the bus might become a bottleneck if, at the end of the > day, you really only want to invalidate one or two siblings. In that > case, targetted IPIs are probably a better option. On SMP with single die and integrated memory controllers (PASemi), I'd bet that tlb invalidation broadcast is typically much cheaper since no external signals are involved (from a hardware point of view it's not very different from a store to a shared cache line that has to be invalidated in the cache of the other processors). Gabriel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-28 10:23 ` Gabriel Paubert @ 2007-05-28 10:28 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 20+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-28 10:28 UTC (permalink / raw) To: Gabriel Paubert; +Cc: ppc-dev, Liu Dave-r63238, Paul Mackerras, Kumar Gala On Mon, 2007-05-28 at 12:23 +0200, Gabriel Paubert wrote: > On SMP with single die and integrated memory controllers (PASemi), > I'd bet that tlb invalidation broadcast is typically much cheaper > since no external signals are involved (from a hardware point of view > it's not very different from a store to a shared cache line that has > to be invalidated in the cache of the other processors). Except that is often has strong locking requirements along with a race or two to deal with when not having a HW reload on the TLB. So in the case of Freescale BookE, it is really something that should be measured. Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-21 22:07 ` Benjamin Herrenschmidt 2007-05-22 3:09 ` Benjamin Herrenschmidt @ 2007-05-22 8:46 ` Gabriel Paubert 2007-05-22 9:14 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 20+ messages in thread From: Gabriel Paubert @ 2007-05-22 8:46 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: ppc-dev, Dave Liu, Paul Mackerras, Kumar Gala On Tue, May 22, 2007 at 08:07:52AM +1000, Benjamin Herrenschmidt wrote: > > > > The tlb miss handler does: > > > > > > - tlbbusy = 1 > > > - barrier (make sure the following read is in order vs. the previous > > > store to tlbbusy) > > > - read linux PTE value > > > - write it to the HW TLB > > > > and write the linux PTE with referenced bit? > > I've kept the reference bit rewrite out of that pseudo-code because I > was approaching a different issue but yes. The idea i have there is to > do break down the linux PTE operation that way: > > 1 - rX = read PTE value (normal load) > 2 - if (!_PAGE_PRESENT)) -> out > 3 - rY = rX | _PAGE_ACCESSED > 4 - if (rX != rY) > 5 - rZ = lwarx PTE value > 6 - if (rZ != rX) > 7 - stdcx. PTE, rZ (rewrite just read value to clear reserv) Why do you want to clear the reservation here? Coming out of some code path with the reservation still held can only affect buggy code (someone doing st[dw]cx. before l[dw]arx) AFAIK. > 8 - goto 1 (try again) > 9 - stdcx. PTE, rY > 10 - if failed -> goto 1 (try again) > 11 - that's it ! > > In addition, I suppose performance can be improved by also dealing with > dirty bit right in the TLB refill if the access is a write and the page > is writeable rather than taking a double fault. Regards, Gabriel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-22 8:46 ` Gabriel Paubert @ 2007-05-22 9:14 ` Benjamin Herrenschmidt 2007-05-22 10:02 ` Gabriel Paubert 0 siblings, 1 reply; 20+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-22 9:14 UTC (permalink / raw) To: Gabriel Paubert; +Cc: ppc-dev, Dave Liu, Paul Mackerras, Kumar Gala > Why do you want to clear the reservation here? > > Coming out of some code path with the reservation still held > can only affect buggy code (someone doing st[dw]cx. before > l[dw]arx) AFAIK. And buggy CPUs :-) Seriously, lots of CPU implementations don't test the address for local lwarx stwcx. so if your kernel code "replaces" a reservation with another that is left set, the userland stwcx. might well succeed which is bogus. Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-22 9:14 ` Benjamin Herrenschmidt @ 2007-05-22 10:02 ` Gabriel Paubert 2007-05-22 10:05 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 20+ messages in thread From: Gabriel Paubert @ 2007-05-22 10:02 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: ppc-dev, Dave Liu, Paul Mackerras, Kumar Gala On Tue, May 22, 2007 at 07:14:38PM +1000, Benjamin Herrenschmidt wrote: > > Why do you want to clear the reservation here? > > > > Coming out of some code path with the reservation still held > > can only affect buggy code (someone doing st[dw]cx. before > > l[dw]arx) AFAIK. > > And buggy CPUs :-) > > Seriously, lots of CPU implementations don't test the address for local > lwarx stwcx. so if your kernel code "replaces" a reservation with > another that is left set, the userland stwcx. might well succeed which > is bogus. > Well, there should always be an stwcx. to clear reservation before any interrupt return. Otherwise you'll be able to cause hard to reproduce bugs in the interrupted code. Checking or not that the reservation address matches the stwcx. is irrelevant. Gabriel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-22 10:02 ` Gabriel Paubert @ 2007-05-22 10:05 ` Benjamin Herrenschmidt 2007-05-23 9:12 ` Gabriel Paubert 0 siblings, 1 reply; 20+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-22 10:05 UTC (permalink / raw) To: Gabriel Paubert; +Cc: ppc-dev, Dave Liu, Paul Mackerras, Kumar Gala On Tue, 2007-05-22 at 12:02 +0200, Gabriel Paubert wrote: > > Well, there should always be an stwcx. to clear reservation before > any interrupt return. Otherwise you'll be able to cause hard to > reproduce bugs in the interrupted code. Well, that's the point. The BookE TLB refill exception is a very fast path that doesn't use the normal interrupt return code path. It thus needs to be careful about not leaving dangling reservations. On some CPUs, there are also performance issues with leaving dangling lwarx iirc but I don't have the details off the top of my mind. Cheers, Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-22 10:05 ` Benjamin Herrenschmidt @ 2007-05-23 9:12 ` Gabriel Paubert 0 siblings, 0 replies; 20+ messages in thread From: Gabriel Paubert @ 2007-05-23 9:12 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: ppc-dev, Dave Liu, Paul Mackerras, Kumar Gala On Tue, May 22, 2007 at 08:05:42PM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2007-05-22 at 12:02 +0200, Gabriel Paubert wrote: > > > > Well, there should always be an stwcx. to clear reservation before > > any interrupt return. Otherwise you'll be able to cause hard to > > reproduce bugs in the interrupted code. > > Well, that's the point. The BookE TLB refill exception is a very fast > path that doesn't use the normal interrupt return code path. It thus > needs to be careful about not leaving dangling reservations. Ok, thanks. I missed that critical piece of information from the context. In this case it makes sense, although I wonder if a different order of instructions could shave some latency from the critical path: 1 - rX = read PTE value (normal load) 2 - if (!_PAGE_PRESENT)) -> out 3 - rY = rX | _PAGE_ACCESSED 4 - if (rX != rY) Specifically here, I wonder whether instead of the sequence: ori ry, rx, PAGE_ACCESSED cmpw rx, ry beq 11f ; Needs non-default static prediction? it might be better to write it as: andi. rz, rx, PAGE_ACCESSED ori ry, rx, PAGE_ACCESSED bne 11f since on some processors the branch might be resolved one cycle earlier. But I don't know very well the processors with these MMU. 5 - rZ = lwarx PTE value 6 - if (rZ != rX) 7 - stdcx. PTE, rZ (rewrite just read value to clear reserv) Hmm, lWarx paired with stDcx., looks like a typo ? 8 - goto 1 (try again) 9 - stdcx. PTE, rY Ditto. 10 - if failed -> goto 1 (try again) 11 - that's it ! I suspect that in the TLB handler, you've got something like 4 registers and one CR field to play with. So more clever solutions may be impossible to implement. > On some CPUs, there are also performance issues with leaving dangling > lwarx iirc but I don't have the details off the top of my mind. > I don't know of any, but I almost exclusively use 603e and 750. Gabriel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: fsl booke MM vs. SMP questions 2007-05-21 7:06 fsl booke MM vs. SMP questions Benjamin Herrenschmidt [not found] ` <1179741447.3660.7.camel@localhost.localdomain> @ 2007-05-22 3:03 ` Kumar Gala 1 sibling, 0 replies; 20+ messages in thread From: Kumar Gala @ 2007-05-22 3:03 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: ppc-dev, Paul Mackerras, Kumar Gala On May 21, 2007, at 2:06 AM, Benjamin Herrenschmidt wrote: > Hi Folks ! > > I see that the fsl booke code has some #ifdef CONFIG_SMP bits here or > there, thus I suppose there are some SMP implementations of these > right ? There will be, the SMP code that exists was just some stuff I put in w/o going through each case. The TLB mgmt code does need some fixup for SMP. - k > > I'm having some serious issues trying to figure out how the TLB > management is made SMP safe however. > > There are at least two main issues I've spotted at this point (there's > at least one more if there are HW threading, that is the TLB is shared > between logical processors, but I'll ignore that for now since I don't > think there is such a thing ... yet). > > - How do you guys shield PTE flushing vs. TLB misses on another CPU ? > That is, how do you prevent (if you do) the following scenario: > > cpu 0 cpu 1 > tlb miss pte_clear (or similar) > load PTE value > write 0 to PTE (or replace) > tlbviax (tlbie) > tlbwe > > That scenario, as you can see, will leave you with stale entries in > the > TLB which will ultimately lead to all sort of unpleasant/random > behaviours. > > If the answer is "oops ... we don't", then let's try to find out ways > out of that since I may have a similar issue in a not too distant > future :-) And I'm trying to find out a -fast- way to deal with that > without bloating the fast path. My main problem is that I want to > avoid > taking a spin lock or equivalent atomic operation in the fast TLB > reload > path (which would solve the problem) since lwarx/stwcx. are generally > real slow (hundreds of cycles on some processors). > > - I see that your TLB miss handle is using a non-atomic store to > write > the _PAGE_ACCESSED bit back to the PTE. Don't you have a similar race > where something would do: > > cpu 0 cpu 1 > tlb miss pte_clear (or similar) > load PTE value > write 0 to PTE (or replace) > write back PTE with _PAGE_ACCESSED > tlbwe > > This is an extension of the previous race but it's a different problem > so I listed it separately. In that case, the problem is worse, > since not > only you have a stale TLB entry, but you -also- have corrupted the > linux > PTE by writing back the old value in it. > > At this point, I'm afraid you may have no choice but going atomic, > which > means paying the cost of lwarx/stwcx. on TLB misses, though if you > have > a solution for the first problem, then you can avoid the atomic > operation in the second problem if _PAGE_ACCESSED is already set. > > If not, you might have to use a _PAGE_BUSY bit similar to what 64 bits > uses as a per-PTE lock, or use mmu_hash_lock... Unless you come up > with > a great idea or some HW black magic that makes the problem go away... > > In any case, I'm curious about how you have or intend to solve that > since as I said above, I might be in a similar situation soon and am > trying to keep the TLB miss handler as fast as humanly possible. > > Cheers, > Ben. > ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2007-05-28 10:28 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-21 7:06 fsl booke MM vs. SMP questions Benjamin Herrenschmidt [not found] ` <1179741447.3660.7.camel@localhost.localdomain> [not found] ` <1179742083.32247.689.camel@localhost.localdomain> 2007-05-21 11:37 ` Dave Liu 2007-05-21 22:07 ` Benjamin Herrenschmidt 2007-05-22 3:09 ` Benjamin Herrenschmidt 2007-05-22 10:56 ` Dave Liu 2007-05-22 22:42 ` Benjamin Herrenschmidt 2007-05-23 2:38 ` Dave Liu 2007-05-23 3:08 ` Benjamin Herrenschmidt 2007-05-28 9:05 ` Liu Dave-r63238 2007-05-28 9:24 ` Benjamin Herrenschmidt 2007-05-28 9:37 ` Liu Dave-r63238 2007-05-28 10:00 ` Benjamin Herrenschmidt 2007-05-28 10:23 ` Gabriel Paubert 2007-05-28 10:28 ` Benjamin Herrenschmidt 2007-05-22 8:46 ` Gabriel Paubert 2007-05-22 9:14 ` Benjamin Herrenschmidt 2007-05-22 10:02 ` Gabriel Paubert 2007-05-22 10:05 ` Benjamin Herrenschmidt 2007-05-23 9:12 ` Gabriel Paubert 2007-05-22 3:03 ` Kumar Gala
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).