* Re: x86 ptep_get_and_clear question [not found] ` <200102151723.JAA43255@google.engr.sgi.com> @ 2001-02-15 17:47 ` Jamie Lokier 2001-02-15 18:05 ` Kanoj Sarcar 2001-02-15 18:23 ` Kanoj Sarcar 0 siblings, 2 replies; 41+ messages in thread From: Jamie Lokier @ 2001-02-15 17:47 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, linux-mm, mingo, alan, linux-kernel [Added Linus and linux-kernel as I think it's of general interest] Kanoj Sarcar wrote: > Whether Jamie was trying to illustrate a different problem, I am not > sure. Yes, I was talking about pte_test_and_clear_dirty in the earlier post. > Look in mm/mprotect.c. Look at the call sequence change_protection() -> ... > change_pte_range(). Specifically at the sequence: > > entry = ptep_get_and_clear(pte); > set_pte(pte, pte_modify(entry, newprot)); > > Go ahead and pull your x86 specs, and prove to me that between the > ptep_get_and_clear(), which zeroes out the pte (specifically, when the > dirty bit is not set), processor 2 can not come in and set the dirty > bit on the in-memory pte. Which immediately gets overwritten by the > set_pte(). For an example of how this can happen, look at my previous > postings. Let's see. We'll assume processor 2 does a write between the ptep_get_and_clear and the set_pte, which are done on processor 1. Now, ptep_get_and_clear is atomic, so we can talk about "before" and "after". Before it, either processor 2 has a TLB entry with the dirty bit set, or it does not (it has either a clean TLB entry or no TLB entry at all). After ptep_get_and_clear, processor 2 does a write. If it already has a dirty TLB entry, then `entry' will also be dirty so the dirty bit is preserved. If processor 2 does not have a dirty TLB entry, then it will look up the pte. Processor 2 finds the pte is clear, so raises a page fault. Spinlocks etc. sort everything out in the page fault. Here's the important part: when processor 2 wants to set the pte's dirty bit, it *rereads* the pte and *rechecks* the permission bits again. Even though it has a non-dirty TLB entry for that pte. That is how I read Ben LaHaise's description, and his test program tests exactly this. If the processor worked by atomically setting the dirty bit in the pte without rechecking the permissions when it reads that pte bit, then this scheme would fail and you'd be right about the lost dirty bits. I would have thought it would be simpler to implement a CPU this way, but clearly it is not as efficient for SMP OS design so perhaps CPU designers thought about this. The only remaining question is: is the observed behaviour defined for x86 CPUs in general, or are we depending on the results of testing a few particular CPUs? -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 17:47 ` x86 ptep_get_and_clear question Jamie Lokier @ 2001-02-15 18:05 ` Kanoj Sarcar 2001-02-15 18:23 ` Kanoj Sarcar 1 sibling, 0 replies; 41+ messages in thread From: Kanoj Sarcar @ 2001-02-15 18:05 UTC (permalink / raw) To: Jamie Lokier; +Cc: Ben LaHaise, linux-mm, mingo, alan, linux-kernel > > [Added Linus and linux-kernel as I think it's of general interest] > > Kanoj Sarcar wrote: > > Whether Jamie was trying to illustrate a different problem, I am not > > sure. > > Yes, I was talking about pte_test_and_clear_dirty in the earlier post. > > > Look in mm/mprotect.c. Look at the call sequence change_protection() -> ... > > change_pte_range(). Specifically at the sequence: > > > > entry = ptep_get_and_clear(pte); > > set_pte(pte, pte_modify(entry, newprot)); > > > > Go ahead and pull your x86 specs, and prove to me that between the > > ptep_get_and_clear(), which zeroes out the pte (specifically, when the > > dirty bit is not set), processor 2 can not come in and set the dirty > > bit on the in-memory pte. Which immediately gets overwritten by the > > set_pte(). For an example of how this can happen, look at my previous > > postings. > Now you are talking my language! > Let's see. We'll assume processor 2 does a write between the > ptep_get_and_clear and the set_pte, which are done on processor 1. > > Now, ptep_get_and_clear is atomic, so we can talk about "before" and > "after". Before it, either processor 2 has a TLB entry with the dirty > bit set, or it does not (it has either a clean TLB entry or no TLB entry > at all). > > After ptep_get_and_clear, processor 2 does a write. If it already has a > dirty TLB entry, then `entry' will also be dirty so the dirty bit is > preserved. If processor 2 does not have a dirty TLB entry, then it will > look up the pte. Processor 2 finds the pte is clear, so raises a page fault. > Spinlocks etc. sort everything out in the page fault. > > Here's the important part: when processor 2 wants to set the pte's dirty > bit, it *rereads* the pte and *rechecks* the permission bits again. > Even though it has a non-dirty TLB entry for that pte. > > That is how I read Ben LaHaise's description, and his test program tests > exactly this. > Okay, I asked Ben, he couldn't point me at specs and shut me up. > If the processor worked by atomically setting the dirty bit in the pte > without rechecking the permissions when it reads that pte bit, then this > scheme would fail and you'd be right about the lost dirty bits. I would Exactly. This is why I did not implement this scheme earlier when Alan and I talked about this scenario, almost a couple of years back. > have thought it would be simpler to implement a CPU this way, but > clearly it is not as efficient for SMP OS design so perhaps CPU > designers thought about this. > > The only remaining question is: is the observed behaviour defined for > x86 CPUs in general, or are we depending on the results of testing a few > particular CPUs? Exactly! So my claim still stands: ptep_get_and_clear() doesn't do what it claims to do. I would be more than happy if someone can give me logic to break this claim ... which would mean one longstanding data integrity problem on Linux has been fixed satisfactorily. Kanoj > > -- Jamie > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 17:47 ` x86 ptep_get_and_clear question Jamie Lokier 2001-02-15 18:05 ` Kanoj Sarcar @ 2001-02-15 18:23 ` Kanoj Sarcar 2001-02-15 18:42 ` Jamie Lokier 2001-02-15 18:51 ` Manfred Spraul 1 sibling, 2 replies; 41+ messages in thread From: Kanoj Sarcar @ 2001-02-15 18:23 UTC (permalink / raw) To: Jamie Lokier; +Cc: Ben LaHaise, linux-mm, mingo, alan, linux-kernel > > [Added Linus and linux-kernel as I think it's of general interest] > > Kanoj Sarcar wrote: > > Whether Jamie was trying to illustrate a different problem, I am not > > sure. > > Yes, I was talking about pte_test_and_clear_dirty in the earlier post. > > > Look in mm/mprotect.c. Look at the call sequence change_protection() -> ... > > change_pte_range(). Specifically at the sequence: > > > > entry = ptep_get_and_clear(pte); > > set_pte(pte, pte_modify(entry, newprot)); > > > > Go ahead and pull your x86 specs, and prove to me that between the > > ptep_get_and_clear(), which zeroes out the pte (specifically, when the > > dirty bit is not set), processor 2 can not come in and set the dirty > > bit on the in-memory pte. Which immediately gets overwritten by the > > set_pte(). For an example of how this can happen, look at my previous > > postings. > > Let's see. We'll assume processor 2 does a write between the > ptep_get_and_clear and the set_pte, which are done on processor 1. > > Now, ptep_get_and_clear is atomic, so we can talk about "before" and > "after". Before it, either processor 2 has a TLB entry with the dirty > bit set, or it does not (it has either a clean TLB entry or no TLB entry > at all). > > After ptep_get_and_clear, processor 2 does a write. If it already has a > dirty TLB entry, then `entry' will also be dirty so the dirty bit is > preserved. If processor 2 does not have a dirty TLB entry, then it will > look up the pte. Processor 2 finds the pte is clear, so raises a page fault. > Spinlocks etc. sort everything out in the page fault. > > Here's the important part: when processor 2 wants to set the pte's dirty > bit, it *rereads* the pte and *rechecks* the permission bits again. > Even though it has a non-dirty TLB entry for that pte. > > That is how I read Ben LaHaise's description, and his test program tests > exactly this. Okay, I will quote from Intel Architecture Software Developer's Manual Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27: "Bus cycles to the page directory and page tables in memory are performed only when the TLBs do not contain the translation information for a requested page." And on the same page: "Whenever a page directory or page table entry is changed (including when the present flag is set to zero), the operating system must immediately invalidate the corresponding entry in the TLB so that it can be updated the next time the entry is referenced." So, it looks highly unlikely to me that the basic assumption about how x86 works wrt tlb/ptes in the ptep_get_and_clear() solution is correct. Kanoj > > If the processor worked by atomically setting the dirty bit in the pte > without rechecking the permissions when it reads that pte bit, then this > scheme would fail and you'd be right about the lost dirty bits. I would > have thought it would be simpler to implement a CPU this way, but > clearly it is not as efficient for SMP OS design so perhaps CPU > designers thought about this. > > The only remaining question is: is the observed behaviour defined for > x86 CPUs in general, or are we depending on the results of testing a few > particular CPUs? > > -- Jamie > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 18:23 ` Kanoj Sarcar @ 2001-02-15 18:42 ` Jamie Lokier 2001-02-15 18:57 ` Kanoj Sarcar 2001-02-15 18:51 ` Manfred Spraul 1 sibling, 1 reply; 41+ messages in thread From: Jamie Lokier @ 2001-02-15 18:42 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Ben LaHaise, linux-mm, mingo, alan, linux-kernel Kanoj Sarcar wrote: > > Here's the important part: when processor 2 wants to set the pte's dirty > > bit, it *rereads* the pte and *rechecks* the permission bits again. > > Even though it has a non-dirty TLB entry for that pte. > > > > That is how I read Ben LaHaise's description, and his test program tests > > exactly this. > > Okay, I will quote from Intel Architecture Software Developer's Manual > Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27: > > "Bus cycles to the page directory and page tables in memory are performed > only when the TLBs do not contain the translation information for a > requested page." > > And on the same page: > > "Whenever a page directory or page table entry is changed (including when > the present flag is set to zero), the operating system must immediately > invalidate the corresponding entry in the TLB so that it can be updated > the next time the entry is referenced." > > So, it looks highly unlikely to me that the basic assumption about how > x86 works wrt tlb/ptes in the ptep_get_and_clear() solution is correct. To me those quotes don't address the question we're asking. We know that bus cycles _do_ occur when a TLB entry is switched from clean to dirty, and furthermore they are locked cycles. (Don't ask me how I know this though). Does that mean, in jargon, the TLB does not "contain the translation information" for a write? The second quote: sure, if we want the TLB updated we have to flush it. And eventually in mm/mprotect.c we do. But what before, it keeps on using the old TLB entry? That's ok. If the entry was already dirty then we don't mind if processor 2 continues with the old TLB entry for a while, until we do the big TLB range flush. In other words I don't think those two quotes address our question at all. What worries more is that this is quite a subtle requirement, and the code in mm/mprotect.c is not specific to one architecture. Do all SMP CPUs support by Linux do the same thing on converting TLB entries from clean to dirty, or do they have a subtle, easily missed data integrity problem? -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 18:42 ` Jamie Lokier @ 2001-02-15 18:57 ` Kanoj Sarcar 2001-02-15 19:06 ` Ben LaHaise 0 siblings, 1 reply; 41+ messages in thread From: Kanoj Sarcar @ 2001-02-15 18:57 UTC (permalink / raw) To: Jamie Lokier; +Cc: Ben LaHaise, linux-mm, mingo, alan, linux-kernel > > Kanoj Sarcar wrote: > > > Here's the important part: when processor 2 wants to set the pte's dirty > > > bit, it *rereads* the pte and *rechecks* the permission bits again. > > > Even though it has a non-dirty TLB entry for that pte. > > > > > > That is how I read Ben LaHaise's description, and his test program tests > > > exactly this. > > > > Okay, I will quote from Intel Architecture Software Developer's Manual > > Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27: > > > > "Bus cycles to the page directory and page tables in memory are performed > > only when the TLBs do not contain the translation information for a > > requested page." > > > > And on the same page: > > > > "Whenever a page directory or page table entry is changed (including when > > the present flag is set to zero), the operating system must immediately > > invalidate the corresponding entry in the TLB so that it can be updated > > the next time the entry is referenced." > > > > So, it looks highly unlikely to me that the basic assumption about how > > x86 works wrt tlb/ptes in the ptep_get_and_clear() solution is correct. > > To me those quotes don't address the question we're asking. We know > that bus cycles _do_ occur when a TLB entry is switched from clean to > dirty, and furthermore they are locked cycles. (Don't ask me how I know > this though). > > Does that mean, in jargon, the TLB does not "contain > the translation information" for a write? > > The second quote: sure, if we want the TLB updated we have to flush it. > And eventually in mm/mprotect.c we do. But what before, it keeps on > using the old TLB entry? That's ok. If the entry was already dirty > then we don't mind if processor 2 continues with the old TLB entry for a > while, until we do the big TLB range flush. > > In other words I don't think those two quotes address our question at > all. Agreed. But these are the only relevant quotes I could come up with. And to me, these quotes make the ptep_get_and_clear() assumption look risky at best ... even though they do not give clear answers either way. > > What worries more is that this is quite a subtle requirement, and the > code in mm/mprotect.c is not specific to one architecture. Do all SMP > CPUs support by Linux do the same thing on converting TLB entries from > clean to dirty, or do they have a subtle, easily missed data integrity > problem? No. All architectures do not have this problem. For example, if the Linux "dirty" (not the pte dirty) bit is managed by software, a fault will actually be taken when processor 2 tries to do the write. The fault is solely to make sure that the Linux "dirty" bit can be tracked. As long as the fault handler grabs the right locks before updating the Linux "dirty" bit, things should be okay. This is the case with mips, for example. The problem with x86 is that we depend on automatic x86 dirty bit update to manage the Linux "dirty" bit (they are the same!). So appropriate locks are not grabbed. Kanoj > > -- Jamie > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 18:57 ` Kanoj Sarcar @ 2001-02-15 19:06 ` Ben LaHaise 2001-02-15 19:19 ` Kanoj Sarcar 0 siblings, 1 reply; 41+ messages in thread From: Ben LaHaise @ 2001-02-15 19:06 UTC (permalink / raw) To: Kanoj Sarcar; +Cc: Jamie Lokier, linux-mm, mingo, alan, linux-kernel On Thu, 15 Feb 2001, Kanoj Sarcar wrote: > No. All architectures do not have this problem. For example, if the > Linux "dirty" (not the pte dirty) bit is managed by software, a fault > will actually be taken when processor 2 tries to do the write. The fault > is solely to make sure that the Linux "dirty" bit can be tracked. As long > as the fault handler grabs the right locks before updating the Linux "dirty" > bit, things should be okay. This is the case with mips, for example. > > The problem with x86 is that we depend on automatic x86 dirty bit > update to manage the Linux "dirty" bit (they are the same!). So appropriate > locks are not grabbed. Will you please go off and prove that this "problem" exists on some x86 processor before continuing this rant? None of the PII, PIII, Athlon, K6-2 or 486s I checked exhibited the worrisome behaviour you're speculating about, plus it is logically consistent with the statements the manual does make about updating ptes; otherwise how could an smp os perform a reliable shootdown by doing an atomic bit clear on the present bit of a pte? -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 19:06 ` Ben LaHaise @ 2001-02-15 19:19 ` Kanoj Sarcar 2001-02-15 20:16 ` Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: Kanoj Sarcar @ 2001-02-15 19:19 UTC (permalink / raw) To: Ben LaHaise; +Cc: Jamie Lokier, linux-mm, mingo, alan, linux-kernel > > On Thu, 15 Feb 2001, Kanoj Sarcar wrote: > > > No. All architectures do not have this problem. For example, if the > > Linux "dirty" (not the pte dirty) bit is managed by software, a fault > > will actually be taken when processor 2 tries to do the write. The fault > > is solely to make sure that the Linux "dirty" bit can be tracked. As long > > as the fault handler grabs the right locks before updating the Linux "dirty" > > bit, things should be okay. This is the case with mips, for example. > > > > The problem with x86 is that we depend on automatic x86 dirty bit > > update to manage the Linux "dirty" bit (they are the same!). So appropriate > > locks are not grabbed. > > Will you please go off and prove that this "problem" exists on some x86 > processor before continuing this rant? None of the PII, PIII, Athlon, And will you please stop behaving like this is not an issue? > K6-2 or 486s I checked exhibited the worrisome behaviour you're And I maintain that this kind of race condition can not be tickled deterministically. There might be some piece of logic (or absence of it), that can show that your finding of a thousand runs is not relevant. > speculating about, plus it is logically consistent with the statements the > manual does make about updating ptes; otherwise how could an smp os Don't say this anymore, specially if you can not point me to the specs. > perform a reliable shootdown by doing an atomic bit clear on the present > bit of a pte? OS clears present bit, processors can keep using their TLBs and access the page, no problems at all. That is why after clearing the present bit, the processor must flush all tlbs before it can assume no one is using the page. Hardware updated access bit could also be a problem, but an error there does not destroy data, it just leads the os to choosing the wrong page to evict during memory pressure. Kanoj > > -ben > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux.eu.org/Linux-MM/ > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 19:19 ` Kanoj Sarcar @ 2001-02-15 20:16 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2001-02-15 20:16 UTC (permalink / raw) To: linux-kernel In article <200102151919.LAA74131@google.engr.sgi.com>, Kanoj Sarcar <kanoj@google.engr.sgi.com> wrote: >> >> Will you please go off and prove that this "problem" exists on some x86 >> processor before continuing this rant? None of the PII, PIII, Athlon, > >And will you please stop behaving like this is not an issue? This is documented in at least Programming the 80386 John Crawford & Patrick Gelsinger which is still the best book I've ever seen on the x86 architecture. See page 477, "Memory management, Protection, and Tasks", under "Multiple- Processor Considerations". And I quote: "Before changing a page table entry that may be used on another procesor <sic>, software should use a locked AND instruction to clear the P bit to 0 in an indivisible operation. Then the entry can be changed as required, and made available by later setting the P bit to 1. At some point in the modification of a page table entry, all processors in the system that may have had the entry cached must be notified (usually with an interrupt) to flush their page translation caches to remove any old copies of the entry. Until these old copies are flushed, the processors can continue to access the old page, and may also set the D bit in the entry being modified. If this may case the modification of the entry to fail, the paging caches should be flushed after the entry is marked not present, but before the entry is otherwise modified". Note the last sentence - that's the one that really matters to this discussion. And it does imply that the read-and-clear thing is not the right thing to do and is not guaranteed to fix the race (even if I personally suspect that all current x86 implementations will just re-walk the page tables and set the D bit the same way they set the A bit, and basically making the usage an "argument" to the page table walker logic). However, I suspect that we could extend it to just re-read the entry (which _should_ be zero, but could have the D bit set) after having invalidated the TLB on the other CPU's. But Gelsinger suggests just clearing the P bit - which is easily enough done, as the following modification would be needed anyway in mm/vmscan.c: pte = ptep_get_and_clear(page_table); flush_tlb_page(vma, address); + pte = ptep_update_after_flush(page_table, pte); where "ptep_update_after_flush()" would be a no-op on UP, and on SMP it would just or in the D bit (which should be practically always zero) from the page table entry into the pte. Just clearing the P bit actuall ymakes "out_unlock_restore:" simpler: it becomes a simple lock ; orl $1, page_table which makes the worry about overwriting the D bit at that point go away (although, considering where we invalidate the TLB's and that we should now have had the correct D bits anyway, the non-locked simple store should also work reliably). The case of munmap() is more worrisome, and has much worse performance issues. Ben's experimental shootdown patch would appear to not be good enough. The only simple solution is the "gather" operation that I've already suggested because of it's obvious correctness and simplicity. A potential alternative would be to walk the page tables twice, and make the page table zapping be a two-phase process. We'd only need to do this when the "mm->cpu_vm_mask" bits implied that other CPU's might have TLB entries, so we could avoid the double work for the normal case. HOWEVER, this is also the only case where a CPU "gather" operation would be necessary, so the thing basically boils down to the question of whether "gather" or "double walk" is the more expensive operation. The "gather" operation could possibly be improved to make the other CPU's do useful work while being shot down (ie schedule away to another mm), but that has it's own pitfalls too. >OS clears present bit, processors can keep using their TLBs and access >the page, no problems at all. That is why after clearing the present bit, >the processor must flush all tlbs before it can assume no one is using >the page. Hardware updated access bit could also be a problem, but an >error there does not destroy data, it just leads the os to choosing the >wrong page to evict during memory pressure. The bible (see above) explicitly mentions only the D bit here - the A bit is set at page table walk time, and is explicitly NOT set if the P bit is clear at walk time, so there is no apparent race on that. But as the A bit isn't very important anyway, the apparent lack of a race is not all that interesting. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 18:23 ` Kanoj Sarcar 2001-02-15 18:42 ` Jamie Lokier @ 2001-02-15 18:51 ` Manfred Spraul 2001-02-15 19:05 ` Kanoj Sarcar 2001-02-15 19:07 ` Jamie Lokier 1 sibling, 2 replies; 41+ messages in thread From: Manfred Spraul @ 2001-02-15 18:51 UTC (permalink / raw) To: Kanoj Sarcar Cc: Jamie Lokier, Ben LaHaise, linux-mm, mingo, alan, linux-kernel Kanoj Sarcar wrote: > > Okay, I will quote from Intel Architecture Software Developer's Manual > Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27: > > "Bus cycles to the page directory and page tables in memory are performed > only when the TLBs do not contain the translation information for a > requested page." > > And on the same page: > > "Whenever a page directory or page table entry is changed (including when > the present flag is set to zero), the operating system must immediately > invalidate the corresponding entry in the TLB so that it can be updated > the next time the entry is referenced." > But there is another paragraph that mentions that an OS may use lazy tlb shootdowns. [search for shootdown] You check the far too obvious chapters, remember that Intel wrote the documentation ;-) I searched for 'dirty' though Vol 3 and found Chapter 7.1.2.1 Automatic locking. .. the processor uses locked cycles to set the accessed and dirty flag in the page-directory and page-table entries. But that obviously doesn't answer your question. Is the sequence << lock; read pte pte |= dirty write pte >> end lock; or << lock; read pte if (!present(pte)) do_page_fault(); pte |= dirty write pte. >> end lock; -- Manfred ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 18:51 ` Manfred Spraul @ 2001-02-15 19:05 ` Kanoj Sarcar 2001-02-15 19:19 ` Jamie Lokier 2001-02-15 19:07 ` Jamie Lokier 1 sibling, 1 reply; 41+ messages in thread From: Kanoj Sarcar @ 2001-02-15 19:05 UTC (permalink / raw) To: Manfred Spraul Cc: Jamie Lokier, Ben LaHaise, linux-mm, mingo, alan, linux-kernel > > Kanoj Sarcar wrote: > > > > Okay, I will quote from Intel Architecture Software Developer's Manual > > Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27: > > > > "Bus cycles to the page directory and page tables in memory are performed > > only when the TLBs do not contain the translation information for a > > requested page." > > > > And on the same page: > > > > "Whenever a page directory or page table entry is changed (including when > > the present flag is set to zero), the operating system must immediately > > invalidate the corresponding entry in the TLB so that it can be updated > > the next time the entry is referenced." > > > > But there is another paragraph that mentions that an OS may use lazy tlb > shootdowns. > [search for shootdown] > > You check the far too obvious chapters, remember that Intel wrote the > documentation ;-) :-) :-) The good part is, there are a lot of Intel folks now active on Linux, I can go off and ask one of them, if we are sufficiently confused. I am trying to see whether we are. > I searched for 'dirty' though Vol 3 and found > > Chapter 7.1.2.1 Automatic locking. > > .. the processor uses locked cycles to set the accessed and dirty flag > in the page-directory and page-table entries. > > But that obviously doesn't answer your question. > > Is the sequence > << lock; > read pte > pte |= dirty > write pte > >> end lock; > or > << lock; > read pte > if (!present(pte)) > do_page_fault(); > pte |= dirty > write pte. > >> end lock; No, it is a little more complicated. You also have to include in the tlb state into this algorithm. Since that is what we are talking about. Specifically, what does the processor do when it has a tlb entry allowing RW, the processor has only done reads using the translation, and the in-memory pte is clear? Kanoj > > -- > Manfred > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 19:05 ` Kanoj Sarcar @ 2001-02-15 19:19 ` Jamie Lokier 2001-02-15 20:31 ` Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: Jamie Lokier @ 2001-02-15 19:19 UTC (permalink / raw) To: Kanoj Sarcar Cc: Manfred Spraul, Ben LaHaise, linux-mm, mingo, alan, linux-kernel Kanoj Sarcar wrote: > > Is the sequence > > << lock; > > read pte > > pte |= dirty > > write pte > > >> end lock; > > or > > << lock; > > read pte > > if (!present(pte)) > > do_page_fault(); > > pte |= dirty > > write pte. > > >> end lock; > > No, it is a little more complicated. You also have to include in the > tlb state into this algorithm. Since that is what we are talking about. > Specifically, what does the processor do when it has a tlb entry allowing > RW, the processor has only done reads using the translation, and the > in-memory pte is clear? Yes (no to the no): Manfred's pseudo-code is exactly the question you're asking. Because when the TLB entry is non-dirty and you do a write, we _know_ the processor will do a locked memory cycle to update the dirty bit. A locked memory cycle implies read-modify-write, not "write TLB entry + dirty" (which would be a plain write) or anything like that. Given you know it's a locked cycle, the only sensible design from Intel is going to be one of Manfred's scenarios. An interesting thought experiment though is this: << lock; read pte pte |= dirty write pte >> end lock; if (!present(pte)) do_page_fault(); It would have a mighty odd effect wouldn't it? -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 19:19 ` Jamie Lokier @ 2001-02-15 20:31 ` Linus Torvalds 2001-02-15 21:26 ` Manfred Spraul 2001-02-15 23:57 ` Jamie Lokier 0 siblings, 2 replies; 41+ messages in thread From: Linus Torvalds @ 2001-02-15 20:31 UTC (permalink / raw) To: linux-kernel In article <20010215201945.A2505@pcep-jamie.cern.ch>, Jamie Lokier <lk@tantalophile.demon.co.uk> wrote: >> > << lock; >> > read pte >> > if (!present(pte)) >> > do_page_fault(); >> > pte |= dirty >> > write pte. >> > >> end lock; >> >> No, it is a little more complicated. You also have to include in the >> tlb state into this algorithm. Since that is what we are talking about. >> Specifically, what does the processor do when it has a tlb entry allowing >> RW, the processor has only done reads using the translation, and the >> in-memory pte is clear? > >Yes (no to the no): Manfred's pseudo-code is exactly the question you're >asking. Because when the TLB entry is non-dirty and you do a write, we >_know_ the processor will do a locked memory cycle to update the dirty >bit. A locked memory cycle implies read-modify-write, not "write TLB >entry + dirty" (which would be a plain write) or anything like that. > >Given you know it's a locked cycle, the only sensible design from Intel >is going to be one of Manfred's scenarios. Not necessarily, and this is NOT guaranteed by the docs I've seen. It _could_ be that the TLB data actually also contains the pointer to the place where it was fetched, and a "mark dirty" becomes read *ptr locked val |= D write *ptr unlock Now, I will agree that I suspect most x86 _implementations_ will not do this. TLB's are too timing-critical, and nobody tends to want to make them bigger than necessary - so saving off the source address is unlikely. Also, setting the D bit is not a very common operation, so it's easy enough to say that an internal D-bit-fault will just cause a TLB re-load, where the TLB re-load just sets the A and D bits as it fetches the entry (and then page fault handling is an automatic result of the reload). However, the _implementation_ detail is not, as far as I can tell, explicitly defined by the architecture. And in anothe rpost I quote a book by the designers of the original 80386 that implies strongly that the "re-walk the page tables on D miss" assumption is not what they _meant_ for the architecture design, even if they probably happened to implement it that way. >An interesting thought experiment though is this: > ><< lock; >read pte >pte |= dirty >write pte >>> end lock; >if (!present(pte)) > do_page_fault(); > >It would have a mighty odd effect wouldn't it? Why do you insist on the !present() check at all? It's not implied by the architecture - a correctly functioning OS is not supposed to ever be able to cause it according to specs.. I tink Kanoj is right to be worried. I _do_ believe that the current Linux code works on "all current hardware". But I think Kanoj has a valid point in that it's not guaranteed to work in the future. That said, I think Intel tends to be fairly pragmatic in their design (that's the nice way of saying that Intel CPU's tend to dismiss the notion of "beautiful architecture" completely over the notion of "let's make it work"). And I would be extremely surprised indeed if especially MS Windows didn't do some really bad things with the TLB. In fact, I think I can say from personal experience that I pretty much _know_ windows has big bugs in TLB invalidation. And because of that, it may be that nobody can ever create a x86-compatible CPU that does anything but "re-walk the TLB tables on _anything_ fishy going on with the TLB". (Basically, it seems to be pretty much a fact of life that the x86 architecture will NOT raise a page protection fault directly from the TLB content - it will re-walk the page tables before it actually raises the fault, and only the act of walking the page tables and finding that it really _should_ fault will raise an x86-level fault. It all boils down to "never trust the TLB more than you absolutely have to"). Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 20:31 ` Linus Torvalds @ 2001-02-15 21:26 ` Manfred Spraul 2001-02-15 21:29 ` Manfred Spraul 2001-02-16 1:21 ` Linus Torvalds 2001-02-15 23:57 ` Jamie Lokier 1 sibling, 2 replies; 41+ messages in thread From: Manfred Spraul @ 2001-02-15 21:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Linus Torvalds wrote: > > In article <20010215201945.A2505@pcep-jamie.cern.ch>, > Jamie Lokier <lk@tantalophile.demon.co.uk> wrote: > >> > << lock; > >> > read pte > >> > if (!present(pte)) > >> > do_page_fault(); > >> > pte |= dirty > >> > write pte. > >> > >> end lock; > >> > >> No, it is a little more complicated. You also have to include in the > >> tlb state into this algorithm. Since that is what we are talking about. > >> Specifically, what does the processor do when it has a tlb entry allowing > >> RW, the processor has only done reads using the translation, and the > >> in-memory pte is clear? > > > >Yes (no to the no): Manfred's pseudo-code is exactly the question you're > >asking. Because when the TLB entry is non-dirty and you do a write, we > >_know_ the processor will do a locked memory cycle to update the dirty > >bit. A locked memory cycle implies read-modify-write, not "write TLB > >entry + dirty" (which would be a plain write) or anything like that. > > > >Given you know it's a locked cycle, the only sensible design from Intel > >is going to be one of Manfred's scenarios. > > Not necessarily, and this is NOT guaranteed by the docs I've seen. > > It _could_ be that the TLB data actually also contains the pointer to > the place where it was fetched, and a "mark dirty" becomes > > read *ptr locked > val |= D > write *ptr unlock > Jamie wrote "one of my scenarios", that's the other option ;-) > Now, I will agree that I suspect most x86 _implementations_ will not do > this. TLB's are too timing-critical, and nobody tends to want to make > them bigger than necessary - so saving off the source address is > unlikely. Also, setting the D bit is not a very common operation, so > it's easy enough to say that an internal D-bit-fault will just cause a > TLB re-load, where the TLB re-load just sets the A and D bits as it > fetches the entry (and then page fault handling is an automatic result > of the reload). > But then the cpu would support setting the D bit in the page directory, but it doesn't. Probably Kanoj is right, the current code is not guaranteed by the specs. But if we change the interface, could we think about the poor s390 developers? s390 only has a "clear the present bit in the pte and flush the tlb" instruction. >From your other post: > pte = ptep_get_and_clear(page_table); > flush_tlb_page(vma, address); >+ pte = ptep_update_after_flush(page_table, pte); What about one arch specific pte = ptep_get_and_invalidate(vma, address, page_table); On i386 SMP it would { pte = *page_table_entry; if(!present(pte)) return pte; lock; andl 0xfffffffe, *page_table_entry; flush_tlb_page(); return *page_table_entry | 1; } > > The "gather" operation could possibly be improved to make the other > CPU's do useful work while being shot down (ie schedule away to another > mm), but that has it's own pitfalls too. > IMHO scheduling away is the best long term solution. Perhaps try to schedule away, just to improve the probability that mm->cpu_vm_mask is clear. I just benchmarked a single flush_tlb_page(). Pentium II 350: ~ 2000 cpu ticks. Pentium III 850: ~ 3000 cpu ticks. -- Manfred ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 21:26 ` Manfred Spraul @ 2001-02-15 21:29 ` Manfred Spraul 2001-02-16 1:21 ` Linus Torvalds 1 sibling, 0 replies; 41+ messages in thread From: Manfred Spraul @ 2001-02-15 21:29 UTC (permalink / raw) To: Linus Torvalds, linux-kernel Manfred Spraul wrote: > > I just benchmarked a single flush_tlb_page(). > > Pentium II 350: ~ 2000 cpu ticks. > Pentium III 850: ~ 3000 cpu ticks. > I forgot the important part: SMP, including a smp_call_function() IPI. IIRC Ingo wrote that a local 'invplg' is around 100 ticks. -- Manfred ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 21:26 ` Manfred Spraul 2001-02-15 21:29 ` Manfred Spraul @ 2001-02-16 1:21 ` Linus Torvalds 2001-02-16 14:18 ` Jamie Lokier 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2001-02-16 1:21 UTC (permalink / raw) To: Manfred Spraul; +Cc: linux-kernel On Thu, 15 Feb 2001, Manfred Spraul wrote: > > > Now, I will agree that I suspect most x86 _implementations_ will not do > > this. TLB's are too timing-critical, and nobody tends to want to make > > them bigger than necessary - so saving off the source address is > > unlikely. Also, setting the D bit is not a very common operation, so > > it's easy enough to say that an internal D-bit-fault will just cause a > > TLB re-load, where the TLB re-load just sets the A and D bits as it > > fetches the entry (and then page fault handling is an automatic result > > of the reload). > > But then the cpu would support setting the D bit in the page directory, > but it doesn't. Not necessarily. The TLB walker is a nasty piece of business, and simplifying it as much as possible is important for hardware. Not setting the D bit in the page directory is likely to be because it is unnecessary, and not because it couldn't be done. > But if we change the interface, could we think about the poor s390 > developers? > > s390 only has a "clear the present bit in the pte and flush the tlb" > instruction. Now, that ends up being fairly close to what it seems mm/vmscan.c needs to do, so yes, it would not necessarily be a bad idea to join the "ptep_get_and_clear()" and "flush_tlb_page()" operations into one. However, the mm/memory.c use (ie region unmapping with zap_page_range()) really needs to do something different, because it inherently works with a range of entries, and abstacting it to be a per-entry thing would be really bad for performance anywhere else (S/390 might be ok with it, assuming that their special instruction is really fast - I don't know. But I do know that everybody else wants to do it with one single flush for the whole region, especially for SMP). > Perhaps try to schedule away, just to improve the probability that > mm->cpu_vm_mask is clear. > > I just benchmarked a single flush_tlb_page(). > > Pentium II 350: ~ 2000 cpu ticks. > Pentium III 850: ~ 3000 cpu ticks. Note that there is some room for concurrency here - we can fire off the IPI, and continue to do "local" work until we actually need the "results" in the form of stable D bits etc. So we _might_ want to take this into account in the interfaces: allow for a "prepare_to_gather()" which just sends the IPI but doesn't wait for it to necessarily get accepted, and then only by the time we actually start checking the dirty bits (ie the second phase, after we've invalidated the page tables) do we need to wait and make sure that nobody else is using the TLB any more. Done right, this _might_ be of the type - prepare_to_gather(): sends IPI to all CPU's indicated in mm->cpu_vm_mask - go on, invalidating all VM entries - busy-wait until "mm->cpu_vm_mask" only contains the local CPU (where the busy-wait is hopefully not a wait at all - the other CPU's would have exited the mm while we were cleaning up the page tables) - go back, gather up any potential dirty bits and free the pages - release the mm Note that there are tons of optimizations for the common case: for example, if we're talking about private read-only mappings, we can possibly skip some or all of this, because we know that we simply won't care about whether the pages were dirty or not as they're going to be thrown away in any case. So we can have several layers of optimizations: for UP or the SMP case where we have "mm->cpu_vm_mask & ~(1 << current_cpu) == 0" we don't need the IPI or the careful multi-CPU case at all. And for private stuff, we need the careful invalidation, but we don't need to go back and gather the dirty bits. So the only case that ends up being fairly heavy may be a case that is very uncommon in practice (only for unmapping shared mappings in threaded programs or the lazy TLB case). I suspect getting a good interface for this, so that zap_page_range() doesn't end up being the function for hell, is the most important thing. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 1:21 ` Linus Torvalds @ 2001-02-16 14:18 ` Jamie Lokier 2001-02-16 14:59 ` Manfred Spraul 0 siblings, 1 reply; 41+ messages in thread From: Jamie Lokier @ 2001-02-16 14:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: Manfred Spraul, linux-kernel Linus Torvalds wrote: > So the only case that ends up being fairly heavy may be a case that is > very uncommon in practice (only for unmapping shared mappings in > threaded programs or the lazy TLB case). I can think of one case where performance is considered quite important: mprotect() is used by several garbage collectors, including threaded ones. Maybe mprotect() isn't the best primitive for those anyway, but it's what they have to work with atm. -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 14:18 ` Jamie Lokier @ 2001-02-16 14:59 ` Manfred Spraul 2001-02-16 15:27 ` Jamie Lokier 2001-02-16 17:59 ` Linus Torvalds 0 siblings, 2 replies; 41+ messages in thread From: Manfred Spraul @ 2001-02-16 14:59 UTC (permalink / raw) To: Jamie Lokier; +Cc: Linus Torvalds, linux-kernel Jamie Lokier wrote: > > Linus Torvalds wrote: > > So the only case that ends up being fairly heavy may be a case that is > > very uncommon in practice (only for unmapping shared mappings in > > threaded programs or the lazy TLB case). > The lazy tlb case is quite fast: lazy tlb thread never write to user space pages, we don't need to protect the dirty bits. And the first ipi clears mm->cpu_vm_mask, only one ipi. > > I can think of one case where performance is considered quite important: > mprotect() is used by several garbage collectors, including threaded > ones. Maybe mprotect() isn't the best primitive for those anyway, but > it's what they have to work with atm. > Does mprotect() actually care for wrong dirty bits? The race should be invisible to user space apps. >>>>>>> mprotect() for_all_affected_ptes() { lock andl ~PERMISSION_MASK, *pte; lock orl new_permission, *pte; } < now anther cpu could still write to the write protected pages < and set the dirty bit, but who cares? Shouldn't be a problem. flush_tlb_range(). < tlb flush before ending the syscall, user space can't notice < the delay. <<<< -- Manfred ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 14:59 ` Manfred Spraul @ 2001-02-16 15:27 ` Jamie Lokier 2001-02-16 15:54 ` Manfred Spraul 2001-02-16 17:59 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Jamie Lokier @ 2001-02-16 15:27 UTC (permalink / raw) To: Manfred Spraul; +Cc: Linus Torvalds, linux-kernel Manfred Spraul wrote: > > I can think of one case where performance is considered quite important: > > mprotect() is used by several garbage collectors, including threaded > > ones. Maybe mprotect() isn't the best primitive for those anyway, but > > it's what they have to work with atm. > > Does mprotect() actually care for wrong dirty bits? > The race should be invisible to user space apps. > > >>>>>>> mprotect() > for_all_affected_ptes() { > lock andl ~PERMISSION_MASK, *pte; > lock orl new_permission, *pte; > } > < now anther cpu could still write to the write protected pages > < and set the dirty bit, but who cares? Shouldn't be a problem. > flush_tlb_range(). > < tlb flush before ending the syscall, user space can't notice > < the delay. > <<<< The user-space app doesn't even _know_ about dirty bits. I don't think there's even the possibility of losing dirty bits with mprotect(), so long as pte_modify doesn't clear the dirty bit, which it doesn't, in this code: /* mprotect.c */ entry = ptep_get_and_clear(pte); set_pte(pte, pte_modify(entry, newprot)); I.e. the only code with the race condition is code which explicitly clears the dirty bit, in vmscan.c. Do you see any possibility of losing a dirty bit here? If not, there's no need for the intricate "gather" or "double scan" schemes for mprotect() and it can stay as fast as possible. Btw, a possible mprotect optimisation: there is no need for flush_tlb_range() when increasing permissions. -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 15:27 ` Jamie Lokier @ 2001-02-16 15:54 ` Manfred Spraul 2001-02-16 16:00 ` Jamie Lokier 0 siblings, 1 reply; 41+ messages in thread From: Manfred Spraul @ 2001-02-16 15:54 UTC (permalink / raw) To: Jamie Lokier; +Cc: Linus Torvalds, linux-kernel Jamie Lokier wrote: > > /* mprotect.c */ > entry = ptep_get_and_clear(pte); > set_pte(pte, pte_modify(entry, newprot)); > > I.e. the only code with the race condition is code which explicitly > clears the dirty bit, in vmscan.c. > > Do you see any possibility of losing a dirty bit here? > Of course. Just check the output after preprocessing. It's int entry; entry = *pte; entry &= ~_PAGE_CHG_MASK; entry |= pgprot_val(newprot) *pte = entry; We need atomic_clear_mask (_PAGE_CHG_MASK, pte); atomic_set_mask (pgprot_val(newprot), *pte); for multi threaded apps. > If not, there's no need for the intricate "gather" or "double scan" > schemes for mprotect() and it can stay as fast as possible. > Correct, but we need a platform specific "update_pte", and perhaps update_begin, update_end hooks (empty on i386) for other archs. -- Manfred ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 15:54 ` Manfred Spraul @ 2001-02-16 16:00 ` Jamie Lokier 2001-02-16 16:23 ` Manfred Spraul 0 siblings, 1 reply; 41+ messages in thread From: Jamie Lokier @ 2001-02-16 16:00 UTC (permalink / raw) To: Manfred Spraul; +Cc: Linus Torvalds, linux-kernel Manfred Spraul wrote: > > entry = ptep_get_and_clear(pte); > > set_pte(pte, pte_modify(entry, newprot)); > > > > I.e. the only code with the race condition is code which explicitly > > clears the dirty bit, in vmscan.c. > > > > Do you see any possibility of losing a dirty bit here? > > > Of course. > Just check the output after preprocessing. > It's > int entry; > entry = *pte; > entry &= ~_PAGE_CHG_MASK; > entry |= pgprot_val(newprot) > *pte = entry; And how does that lose a dirty bit? For the other processor to not write a dirty bit, it must have a dirty TLB entry already which, along with the locked cycle in ptep_get_and_clear, means that `entry' will have _PAGE_DIRTY set. The dirty bit is not lost. > We need > atomic_clear_mask (_PAGE_CHG_MASK, pte); > atomic_set_mask (pgprot_val(newprot), *pte); > > for multi threaded apps. cmpxchg is probably faster. -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 16:00 ` Jamie Lokier @ 2001-02-16 16:23 ` Manfred Spraul 2001-02-16 16:43 ` Jamie Lokier 0 siblings, 1 reply; 41+ messages in thread From: Manfred Spraul @ 2001-02-16 16:23 UTC (permalink / raw) To: Jamie Lokier, Linus Torvalds, linux-kernel Jamie Lokier wrote: > > And how does that lose a dirty bit? > > For the other processor to not write a dirty bit, it must have a dirty ^^^^^^^^^^^ > TLB entry already which, along with the locked cycle in > ptep_get_and_clear, means that `entry' will have _PAGE_DIRTY set. The > dirty bit is not lost. > The other cpu writes the dirty bit - we just overwrite it ;-) After the ptep_get_and_clear(), before the set_pte(). The current assumption about the page dirty logic is: A cpu that has a writable, non-dirty pte cached in its tlb it may unconditionally set the dirty bit - without honoring present or write protected bits. --> set_pte() can either loose a dirty bit or a 'pte_none() entry' could suddenly become a swap entry unless it's guaranteed that no cpus has a cached valid tlb entry. Linus, does the proposed pte gather code handle the second part? pte_none() suddenly becomes 0x0040. Back to the current mprotect.c code: pte is writable, not-dirty. cpu1: has a writable, non-dirty pte in it's tlb. cpu 2: in mprotect.c entry = ptep_get_and_clear(pte); * pte now clear. * entry contains the pte value without the dirty bit cpu decodes a write instruction, and dirties the pte. lock; orl DIRTY_BIT, *pte set_pte(pte, pte_modify(entry, newprot)); * pte overwritten with entry. --> dirty bit lost. -- Manfred ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 16:23 ` Manfred Spraul @ 2001-02-16 16:43 ` Jamie Lokier 2001-02-16 17:12 ` Manfred Spraul 2001-02-16 17:29 ` Ben LaHaise 0 siblings, 2 replies; 41+ messages in thread From: Jamie Lokier @ 2001-02-16 16:43 UTC (permalink / raw) To: Manfred Spraul; +Cc: Linus Torvalds, linux-kernel, bcrl Manfred Spraul wrote: > The other cpu writes the dirty bit - we just overwrite it ;-) > After the ptep_get_and_clear(), before the set_pte(). Ah, I see. The other CPU does an atomic *pte |= _PAGE_DIRTY, without checking the present bit. ('scuse me for temporary brain failure). How about a pragmatic solution. Given that Ben's found that "checks pte_present on dirtying" works in practice, and it is _much_ simpler to do things that way, perhaps we could write a boot time test for this? If the boot time test fails, we (a) printk("Sorry we've never seen a CPU like this, please report"); (b) Put this in ptep_get_and_clear: if (tlb_dirty_doesnt_sync) flush_tlb_page(page) It should be fast on known CPUs, correct on unknown ones, and much simpler than "gather" code which may be completely unnecessary and rather difficult to test. If anyone reports the message, _then_ we think about the problem some more. Ben, fancy writing a boot-time test? -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 16:43 ` Jamie Lokier @ 2001-02-16 17:12 ` Manfred Spraul 2001-02-16 17:20 ` Jamie Lokier 2001-02-16 17:37 ` Jamie Lokier 2001-02-16 17:29 ` Ben LaHaise 1 sibling, 2 replies; 41+ messages in thread From: Manfred Spraul @ 2001-02-16 17:12 UTC (permalink / raw) To: Jamie Lokier; +Cc: Linus Torvalds, linux-kernel, bcrl Jamie Lokier wrote: > > Manfred Spraul wrote: > > The other cpu writes the dirty bit - we just overwrite it ;-) > > After the ptep_get_and_clear(), before the set_pte(). > > Ah, I see. The other CPU does an atomic *pte |= _PAGE_DIRTY, without > checking the present bit. ('scuse me for temporary brain failure). > > How about a pragmatic solution. > Ok, Is there one case were your pragmatic solutions is vastly faster? * mprotect: No. The difference is at most one additional locked instruction for each pte. * munmap(anon): No. We must handle delayed accessed anyway (don't call free_pages_ok() until flush_tlb_ipi returned). The difference is that we might have to perform a second pass to clear any spurious 0x40 bits. * munmap(file): No. Second pass required for correct msync behaviour. * try_to_swap_out(): No. another memory read. Any other cases? > > Ben, fancy writing a boot-time test? > I'd never rely on such a test - what if the cpu checks in 99% of the cases, but doesn't handle some cases ('rep movd, everything unaligned, ...'. And check the Pentium III erratas. There is one with the tlb that's only triggered if 4 instruction lie in a certain window and all access memory in the same way of the tlb (EFLAGS incorrect if 'andl mask,<memory_addr>' causes page fault)). -- Manfred ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 17:12 ` Manfred Spraul @ 2001-02-16 17:20 ` Jamie Lokier 2001-02-16 17:36 ` Linus Torvalds 2001-02-16 17:37 ` Jamie Lokier 1 sibling, 1 reply; 41+ messages in thread From: Jamie Lokier @ 2001-02-16 17:20 UTC (permalink / raw) To: Manfred Spraul; +Cc: Linus Torvalds, linux-kernel, bcrl Manfred Spraul wrote: > Ok, Is there one case were your pragmatic solutions is vastly faster? > * mprotect: No. The difference is at most one additional locked > instruction for each pte. Oh, what instruction is that? > * munmap(anon): No. We must handle delayed accessed anyway (don't call > free_pages_ok() until flush_tlb_ipi returned). The difference is that we > might have to perform a second pass to clear any spurious 0x40 bits. That second pass is what I had in mind. > * munmap(file): No. Second pass required for correct msync behaviour. It is? -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 17:20 ` Jamie Lokier @ 2001-02-16 17:36 ` Linus Torvalds 2001-02-16 18:49 ` Manfred Spraul 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2001-02-16 17:36 UTC (permalink / raw) To: Jamie Lokier; +Cc: Manfred Spraul, linux-kernel, bcrl On Fri, 16 Feb 2001, Jamie Lokier wrote: > Manfred Spraul wrote: > > Ok, Is there one case were your pragmatic solutions is vastly faster? > > > * mprotect: No. The difference is at most one additional locked > > instruction for each pte. > > Oh, what instruction is that? The "set_pte()" thing could easily be changed into lock ; orl pte,(ptepointer) which actually should work as-is. We do not allow "set_pte()" on anything but "pte_none()" entries anyway, so in the trivial case the "orl" is exactly equivalent to a "movl". And in the (so far theoretical) case where another CPU might have set the dirty bit, the locked "or" will again do the right thing, and preserve it. So that would basically be a one-liner that removes the set_pte() race for mprotect() (and the vmscan.c case of re-establishing the pte, but as vmscan needs to do something more anyway that part is probably not interesting). > > * munmap(anon): No. We must handle delayed accessed anyway (don't call > > free_pages_ok() until flush_tlb_ipi returned). The difference is that we > > might have to perform a second pass to clear any spurious 0x40 bits. > > That second pass is what I had in mind. > > > * munmap(file): No. Second pass required for correct msync behaviour. > > It is? Not now it isn't. We just do a msync() + fsync() for msync(MS_SYNC). Which is admittedly not optimal, but it works. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 17:36 ` Linus Torvalds @ 2001-02-16 18:49 ` Manfred Spraul 2001-02-16 19:00 ` Linus Torvalds 2001-02-16 19:02 ` Ben LaHaise 0 siblings, 2 replies; 41+ messages in thread From: Manfred Spraul @ 2001-02-16 18:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jamie Lokier, linux-kernel, bcrl Linus wrote: > > > > > That second pass is what I had in mind. > > > > > * munmap(file): No. Second pass required for correct msync behaviour. > > > > It is? > > Not now it isn't. We just do a msync() + fsync() for msync(MS_SYNC). Which > is admittedly not optimal, but it works. > Ok, munmap() will be fixed by the tlb shootdown changes - it also uses zap_page_range(). That leaves msync() - it currently does a flush_tlb_page() for every single dirty page. Is it possible to integrate that into the mmu gather code? tlb_transfer_dirty() in addition to tlb_clear_page()? -- Manfred ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 18:49 ` Manfred Spraul @ 2001-02-16 19:00 ` Linus Torvalds 2001-02-16 19:02 ` Ben LaHaise 1 sibling, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2001-02-16 19:00 UTC (permalink / raw) To: Manfred Spraul; +Cc: Jamie Lokier, linux-kernel, bcrl On Fri, 16 Feb 2001, Manfred Spraul wrote: > > That leaves msync() - it currently does a flush_tlb_page() for every > single dirty page. > Is it possible to integrate that into the mmu gather code? Not even necessary. The D bit does not have to be coherent. We need to make sure that we flush the TLB before we start the IO on the pages which clears the per-physical D bit (so that no CPU will have done any modifications that didn't show up in one of the D bits - whther virtual in the page tables or physical in the memory map), but there are no other real requirements. So you don't strictly need to gather them at all, although right now with the type of setup we have I suspect it's hard to actually implement any other way (because msync doesn't necessarily know when the IO has been physically started and has no good way of hooking into it..). Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 18:49 ` Manfred Spraul 2001-02-16 19:00 ` Linus Torvalds @ 2001-02-16 19:02 ` Ben LaHaise 2001-02-16 19:32 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Ben LaHaise @ 2001-02-16 19:02 UTC (permalink / raw) To: Manfred Spraul; +Cc: Linus Torvalds, Jamie Lokier, linux-kernel On Fri, 16 Feb 2001, Manfred Spraul wrote: > That leaves msync() - it currently does a flush_tlb_page() for every > single dirty page. > Is it possible to integrate that into the mmu gather code? > > tlb_transfer_dirty() in addition to tlb_clear_page()? Actually, in the filemap_sync case, the flush_tlb_page is redundant -- there's already a call to flush_tlb_range in filemap_sync after the dirty bits are cleared. None of the cpus we support document having a writeback tlb, and intel's docs explicitely state that they do not as they state that the dirty bit is updated on the first write to dirty the pte. -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 19:02 ` Ben LaHaise @ 2001-02-16 19:32 ` Linus Torvalds 2001-02-16 19:42 ` Ben LaHaise 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2001-02-16 19:32 UTC (permalink / raw) To: Ben LaHaise; +Cc: Manfred Spraul, Jamie Lokier, linux-kernel On Fri, 16 Feb 2001, Ben LaHaise wrote: > > Actually, in the filemap_sync case, the flush_tlb_page is redundant -- > there's already a call to flush_tlb_range in filemap_sync after the dirty > bits are cleared. This is not enough. If another CPU has started write-out of one of the dirty pages (which, as far as I can tell, is certainly unlikely but not impossible) while we were still handling other dirty pages, that other CPU might clear the physical dirty bit of that page while a third CPU (or the same writer, but that makes the timing even _more_ unlikely) is still using a stale "dirty" TLB entry and writing to the page (and not updating the virtual dirty bit because it doesn't know that it has already been cleared). So you have to somehow guarantee that you invalidate the TLB's before the dirty bit from the "struct page" can be cleared (which in turn has to happen before the writeout). That can obviously be done with the tlb range flushing, but it needs more locking. This is, actually, a problem that I suspect ends up being _very_ similar to the zap_page_range() case. zap_page_range() needs to make sure that everything has been updated by the time the page is actually free'd. While filemap_sync() needs to make sure that everything has been updated before the page is written out (or marked dirty - which obviously also guarantees the ordering, and makes the problems look even more similar). Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 19:32 ` Linus Torvalds @ 2001-02-16 19:42 ` Ben LaHaise 0 siblings, 0 replies; 41+ messages in thread From: Ben LaHaise @ 2001-02-16 19:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Manfred Spraul, Jamie Lokier, linux-kernel On Fri, 16 Feb 2001, Linus Torvalds wrote: > This is, actually, a problem that I suspect ends up being _very_ similar > to the zap_page_range() case. zap_page_range() needs to make sure that > everything has been updated by the time the page is actually free'd. While > filemap_sync() needs to make sure that everything has been updated before > the page is written out (or marked dirty - which obviously also guarantees > the ordering, and makes the problems look even more similar). Ah, I see what I was missing. So long as the tlb flush is in between the ptep_test_and_clear_dirty and the set_page_dirty, we're fine (ie the current code is good). If we really want to reduce the number of tlb flushes, yes, we can use the gather code and then just do the set_page_dirty after a tlb_flush_range. -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 17:12 ` Manfred Spraul 2001-02-16 17:20 ` Jamie Lokier @ 2001-02-16 17:37 ` Jamie Lokier 2001-02-16 18:04 ` Manfred Spraul 2001-02-16 18:36 ` Hugh Dickins 1 sibling, 2 replies; 41+ messages in thread From: Jamie Lokier @ 2001-02-16 17:37 UTC (permalink / raw) To: Manfred Spraul; +Cc: Linus Torvalds, linux-kernel, bcrl > > Ben, fancy writing a boot-time test? > > > I'd never rely on such a test - what if the cpu checks in 99% of the > cases, but doesn't handle some cases ('rep movd, everything unaligned, > ...'. A good point. The test results are inconclusive. > And check the Pentium III erratas. There is one with the tlb > that's only triggered if 4 instruction lie in a certain window and all > access memory in the same way of the tlb (EFLAGS incorrect if 'andl > mask,<memory_addr>' causes page fault)). Nasty, but I don't see what an obscure and impossible to work around processor bug has to do with this thread. It doesn't actually change page fault handling, does it? -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 17:37 ` Jamie Lokier @ 2001-02-16 18:04 ` Manfred Spraul 2001-02-16 18:09 ` Jamie Lokier 2001-02-16 18:36 ` Hugh Dickins 1 sibling, 1 reply; 41+ messages in thread From: Manfred Spraul @ 2001-02-16 18:04 UTC (permalink / raw) To: Jamie Lokier; +Cc: Linus Torvalds, linux-kernel, bcrl Jamie Lokier wrote: > > > > Ben, fancy writing a boot-time test? > > > > > I'd never rely on such a test - what if the cpu checks in 99% of the > > cases, but doesn't handle some cases ('rep movd, everything unaligned, > > ...'. > > A good point. The test results are inconclusive. > > > And check the Pentium III erratas. There is one with the tlb > > that's only triggered if 4 instruction lie in a certain window and all > > access memory in the same way of the tlb (EFLAGS incorrect if 'andl > > mask,<memory_addr>' causes page fault)). > > Nasty, but I don't see what an obscure and impossible to work around > processor bug has to do with this thread. It doesn't actually change > page fault handling, does it? > Page fault handling is unchanged, but perhaps there are other races. And note that these races wouldn't be processor bugs - the spec nowhere guarantee the behaviour you assume. Ben tries to prove that the current cpu _never_ sets the dirty bit without checking the present bit. A very simple test might be cpu 1: cli(); a = 0; b = 0; m = 0; flush_local_tlb_page(a); flush_local_tlb_page(b); flush_local_tlb_page(a); while(!m); while (!a && !b); a = 1; cpu 2: <wait> cli(); both ptes for a and b as writable, not dirty. m = 1; udelay(100); change the pte of a to not present. wmb(); b = 1; Now start with variants: change to read only instead of not present a and b in the same way of the tlb, in a different way. change pte with write, change with lock; . . . But you'll never prove that you tested every combination. -- Manfred ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 18:04 ` Manfred Spraul @ 2001-02-16 18:09 ` Jamie Lokier 0 siblings, 0 replies; 41+ messages in thread From: Jamie Lokier @ 2001-02-16 18:09 UTC (permalink / raw) To: Manfred Spraul; +Cc: Linus Torvalds, linux-kernel, bcrl Manfred Spraul wrote: > A very simple test might be > > cpu 1: > cpu 2: Ben's test uses only one CPU. > Now start with variants: > change to read only instead of not present > a and b in the same way of the tlb, in a different way. > change pte with write, change with lock; > . > . > . > > But you'll never prove that you tested every combination. Indeed. -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 17:37 ` Jamie Lokier 2001-02-16 18:04 ` Manfred Spraul @ 2001-02-16 18:36 ` Hugh Dickins 1 sibling, 0 replies; 41+ messages in thread From: Hugh Dickins @ 2001-02-16 18:36 UTC (permalink / raw) To: Jamie Lokier; +Cc: Manfred Spraul, Linus Torvalds, linux-kernel, bcrl On Fri, 16 Feb 2001, Jamie Lokier wrote: > > > And check the Pentium III erratas. There is one with the tlb > > that's only triggered if 4 instruction lie in a certain window and all > > access memory in the same way of the tlb (EFLAGS incorrect if 'andl > > mask,<memory_addr>' causes page fault)). > > Nasty, but I don't see what an obscure and impossible to work around > processor bug has to do with this thread. It doesn't actually change > page fault handling, does it? Obscure but not nasty: the copy of EFLAGS pushed onto the stack when taking the fault is wrong, but once the instruction is restarted it all sorts itself out (as I understand from the Spec Update). Possible to work around, but just not worth the effort. Nastier was its precursor, Pentium Pro Erratum #63, generated under similar conditions: where the wrong (carry bit of) EFLAGS when faulting in the middle of ADC, SBB, RCR or RCL could cause a wrong arithmetic result when restarted. Perfectly possible to work around (only lower permissions of a pte visible on another CPU while that CPU is pulled into the kernel with an IPI), and necessary to work around it back then (4 years ago) when the Pentium Pro was at the leading edge; but I doubt it's worth redesigning now to suit an old erratum. These errata do make the point that, whatever x86 specs say should happen, Intel sometimes fails to match them; and the SMP TLB area was certainly prone to errata at the time of the Pentium Pro - but hopefully that means Intel exercise greater care there now. Hugh ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 16:43 ` Jamie Lokier 2001-02-16 17:12 ` Manfred Spraul @ 2001-02-16 17:29 ` Ben LaHaise 2001-02-16 17:38 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Ben LaHaise @ 2001-02-16 17:29 UTC (permalink / raw) To: Jamie Lokier; +Cc: Manfred Spraul, Linus Torvalds, linux-kernel On Fri, 16 Feb 2001, Jamie Lokier wrote: > It should be fast on known CPUs, correct on unknown ones, and much > simpler than "gather" code which may be completely unnecessary and > rather difficult to test. > > If anyone reports the message, _then_ we think about the problem some more. > > Ben, fancy writing a boot-time test? Sure, I'll whip one up this afternoon. -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 17:29 ` Ben LaHaise @ 2001-02-16 17:38 ` Linus Torvalds 2001-02-16 17:44 ` Ben LaHaise 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2001-02-16 17:38 UTC (permalink / raw) To: Ben LaHaise; +Cc: Jamie Lokier, Manfred Spraul, linux-kernel On Fri, 16 Feb 2001, Ben LaHaise wrote: > On Fri, 16 Feb 2001, Jamie Lokier wrote: > > > It should be fast on known CPUs, correct on unknown ones, and much > > simpler than "gather" code which may be completely unnecessary and > > rather difficult to test. > > > > If anyone reports the message, _then_ we think about the problem some more. > > > > Ben, fancy writing a boot-time test? > > Sure, I'll whip one up this afternoon. How do you expect to ever see this in practice? Sounds basically impossible to test for this hardware race. The obvious "try to dirty as fast as possible on one CPU while doing an atomic get-and-clear on the other" thing is not valid - it's in fact quite likely to get into lock-step because of page table cache movement synchronization. And as such it could hide any race. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 17:38 ` Linus Torvalds @ 2001-02-16 17:44 ` Ben LaHaise 0 siblings, 0 replies; 41+ messages in thread From: Ben LaHaise @ 2001-02-16 17:44 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jamie Lokier, Manfred Spraul, linux-kernel On Fri, 16 Feb 2001, Linus Torvalds wrote: > How do you expect to ever see this in practice? Sounds basically > impossible to test for this hardware race. The obvious "try to dirty as > fast as possible on one CPU while doing an atomic get-and-clear on the > other" thing is not valid - it's in fact quite likely to get into > lock-step because of page table cache movement synchronization. And as > such it could hide any race. That's not the behaviour I'm testing, but whether the CPU is doing lock pte = *ptep if (present && writable) pte |= dirty *ptep = pte unlock versus lock pte = *ptep pte |= dirty *ptep = pte unlock Which can be tested by means of getting the pte into the tlb then changing the pte without flushing and observing the results (page fault vs changed pte). I'm willing to bet that all cpus are doing the first version. -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-16 14:59 ` Manfred Spraul 2001-02-16 15:27 ` Jamie Lokier @ 2001-02-16 17:59 ` Linus Torvalds 1 sibling, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2001-02-16 17:59 UTC (permalink / raw) To: Manfred Spraul; +Cc: Jamie Lokier, linux-kernel On Fri, 16 Feb 2001, Manfred Spraul wrote: > Jamie Lokier wrote: > > > > Linus Torvalds wrote: > > > So the only case that ends up being fairly heavy may be a case that is > > > very uncommon in practice (only for unmapping shared mappings in > > > threaded programs or the lazy TLB case). > > > The lazy tlb case is quite fast: lazy tlb thread never write to user > space pages, we don't need to protect the dirty bits. And the first ipi > clears mm->cpu_vm_mask, only one ipi. This is NOT necessarily true in the generic case. The lazy TLB thread itself may not write to the address space, but I can in theory see a hardware implementation that delays writing out the dirty bit from the TLB until it is invalidated. I agree that it is unlikely, especially on an x86, but I think it's a case we should at least think about for the generic kernel architecture. Think of the TLB as a cache, and think of the dirty state as being either write-through or write-back. Now, I will bet you that all current x86's (a) _do_ actually check the P bit when writing D (ie current Linux code is probably fine as-is, even if incorrect in theory) and (b) the D bit is write-through. But even so, I want people to at least consider the case of a write-back TLB dirty bit, in which case the real state of the D bit might not be known until a TLB flush has been done (even on a UP machine - which is why I'm certain that no current x86 actually does this optimization). (And because of (a), I don't think I'll necessarily fix this during 2.4.x anyway unless it gets fixed as a result of the generic TLB shootdown issue which has nothing at all to do with the D bit) Don't get too hung up on implementation details when designing a good architecture for this thing. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 20:31 ` Linus Torvalds 2001-02-15 21:26 ` Manfred Spraul @ 2001-02-15 23:57 ` Jamie Lokier 2001-02-16 0:55 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Jamie Lokier @ 2001-02-15 23:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Linus Torvalds wrote: > It _could_ be that the TLB data actually also contains the pointer to > the place where it was fetched, and a "mark dirty" becomes > > read *ptr locked > val |= D > write *ptr unlock If you want to take it really far, it _could_ be that the TLB data contains both the pointer and the original pte contents. Then "mark dirty" becomes val |= D write *ptr > Now, I will agree that I suspect most x86 _implementations_ will not do > this. TLB's are too timing-critical, and nobody tends to want to make > them bigger than necessary - so saving off the source address is > unlikely. Then again, these hypothetical addresses etc. aren't part of the associative lookup, so could be located in something like an ordinary cache ram, with just an index in the TLB itself. -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 23:57 ` Jamie Lokier @ 2001-02-16 0:55 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2001-02-16 0:55 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-kernel On Fri, 16 Feb 2001, Jamie Lokier wrote: > > If you want to take it really far, it _could_ be that the TLB data > contains both the pointer and the original pte contents. Then "mark > dirty" becomes > > val |= D > write *ptr No. This is forbidden by the intel documentation. First off, the documentation clearly states that it's a locked r-m-w cycle. Secondly, the documentation also makes it clear that the CPU page table accesses work correctly in SMP environments, which the above simply would not do. It doesn't allow for people marking the entry invalid, which is documented to work (see the very part I quoted). So while the above could be a valid TLB writeback strategy in general for some hypothetical architecture, it would _not_ be an x86 CPU any more if it acted that way. So a plain "just write out our cached value" is definitely not legal. > > Now, I will agree that I suspect most x86 _implementations_ will not do > > this. TLB's are too timing-critical, and nobody tends to want to make > > them bigger than necessary - so saving off the source address is > > unlikely. > > Then again, these hypothetical addresses etc. aren't part of the > associative lookup, so could be located in something like an ordinary > cache ram, with just an index in the TLB itself. True. I'd still consider it unlikely for the other reasons (ie this is not a timing-critical part of the normal CPU behaviour), but you're right - it could be done without making the actual TLB any bigger or different, by just having the TLB fill routine having a separate "source cache" that the dirty-marking can use. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: x86 ptep_get_and_clear question 2001-02-15 18:51 ` Manfred Spraul 2001-02-15 19:05 ` Kanoj Sarcar @ 2001-02-15 19:07 ` Jamie Lokier 1 sibling, 0 replies; 41+ messages in thread From: Jamie Lokier @ 2001-02-15 19:07 UTC (permalink / raw) To: Manfred Spraul Cc: Kanoj Sarcar, Ben LaHaise, linux-mm, mingo, alan, linux-kernel Manfred Spraul wrote: > Is the sequence > << lock; > read pte > pte |= dirty > write pte > >> end lock; > or > << lock; > read pte > if (!present(pte)) > do_page_fault(); > pte |= dirty > write pte. > >> end lock; or more generally << lock; read pte if (!present(pte) || !writable(pte)) do_page_fault(); pte |= dirty write pte. >> end lock; Not to mention, does it guarantee to use the newly read physical address, does it check the superviser permission again, does it use the new PAT/CD/WT attributes? I can vaguely imagine some COW optimisation where the pte is updated to be writable with the new page's address, and there is no need to flush other processor TLBs because they will do so when they first write to the page. (But of course you have to be careful synchronising with other uses of the shared page prior to the eventual TLB flush). -- Jamie ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2001-02-16 19:44 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20010215173547.A2079@pcep-jamie.cern.ch>
[not found] ` <200102151723.JAA43255@google.engr.sgi.com>
2001-02-15 17:47 ` x86 ptep_get_and_clear question Jamie Lokier
2001-02-15 18:05 ` Kanoj Sarcar
2001-02-15 18:23 ` Kanoj Sarcar
2001-02-15 18:42 ` Jamie Lokier
2001-02-15 18:57 ` Kanoj Sarcar
2001-02-15 19:06 ` Ben LaHaise
2001-02-15 19:19 ` Kanoj Sarcar
2001-02-15 20:16 ` Linus Torvalds
2001-02-15 18:51 ` Manfred Spraul
2001-02-15 19:05 ` Kanoj Sarcar
2001-02-15 19:19 ` Jamie Lokier
2001-02-15 20:31 ` Linus Torvalds
2001-02-15 21:26 ` Manfred Spraul
2001-02-15 21:29 ` Manfred Spraul
2001-02-16 1:21 ` Linus Torvalds
2001-02-16 14:18 ` Jamie Lokier
2001-02-16 14:59 ` Manfred Spraul
2001-02-16 15:27 ` Jamie Lokier
2001-02-16 15:54 ` Manfred Spraul
2001-02-16 16:00 ` Jamie Lokier
2001-02-16 16:23 ` Manfred Spraul
2001-02-16 16:43 ` Jamie Lokier
2001-02-16 17:12 ` Manfred Spraul
2001-02-16 17:20 ` Jamie Lokier
2001-02-16 17:36 ` Linus Torvalds
2001-02-16 18:49 ` Manfred Spraul
2001-02-16 19:00 ` Linus Torvalds
2001-02-16 19:02 ` Ben LaHaise
2001-02-16 19:32 ` Linus Torvalds
2001-02-16 19:42 ` Ben LaHaise
2001-02-16 17:37 ` Jamie Lokier
2001-02-16 18:04 ` Manfred Spraul
2001-02-16 18:09 ` Jamie Lokier
2001-02-16 18:36 ` Hugh Dickins
2001-02-16 17:29 ` Ben LaHaise
2001-02-16 17:38 ` Linus Torvalds
2001-02-16 17:44 ` Ben LaHaise
2001-02-16 17:59 ` Linus Torvalds
2001-02-15 23:57 ` Jamie Lokier
2001-02-16 0:55 ` Linus Torvalds
2001-02-15 19:07 ` Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox