* ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm
@ 2004-09-25 15:54 Andrea Arcangeli
2004-09-25 23:33 ` Benjamin Herrenschmidt
2004-09-25 23:44 ` Rik van Riel
0 siblings, 2 replies; 20+ messages in thread
From: Andrea Arcangeli @ 2004-09-25 15:54 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Andrew Morton, linux-kernel
I was discussing an obscure bug with Martin in the COW handling.
After some thoughts on the issue the only thing that I found potentially
buggy in that path is that we're not safe not writing the 8bytes
atomically before flushing the tlb. I'm afraid without set_pte_atomic
the other cpu in another userspace thread could touch the memory with an
empty tlb entry, and load the half-written pte, that may contain the new
high bits for the high part of the pfn number, but the old low bits for
the low part of the pfn number plus the old ro protection. Potentially
giving access to a random physical page to a task (in readonly mode, so
no kernel crash or userspace malfunction except the security
compromise). I don't expect anybody to be able to exploit it though.
Worthy to note is that we're buggy in all set_pte implementations since,
all archs would need to also implement the set_pte in assembler to make
sure the C language doesn't write it byte-by-byte which would break the
SMP in the other thread. On ppc64 where a problem triggered (possibily
unrelated to this) the pte is an unsigned long and it's being updated by
set_pte with this:
*ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS
(note pte_clear would be fine to be still in C, pte clear is guaranteed
to run on not present ptes, so we don't race with other threads, it's
only set_pte that should always be written in assembler in the last
opcode that writes in the pte)
We don't need an SMP lock, we only need to write 4 or 8 bytes at once (a
plain movl in x86 would do the trick). That's all we need. (and in
theory only for SMP, but UP will not get any slowdown since no lock on
the bus will be necessary, we're the only writer thanks to the
page_table_lock, but there are other readers running in userspace in
SMP, hence the need of atomicity)
pte_clear would not be safe to call inside ptep_establish for the same
reason (pte_clear is not atomic and it doesn't need to be atomic unlike
set_pte), while something like this should be fine:
ptep_get_and_clear
set_pte
flush_tlb
but it doesn't worht it since all other archs supporting SMP in their
architecture will have to change set_pte to an assembly version, so for
them set_pte_atomic will be defined to set_pte.
The x86 set_pte itself should be changed to:
static inline void set_pte(pte_t *ptep, pte_t pte)
{
ptep->pte_high = pte.pte_high;
smp_wmb();
do this in a single not locked movl -> (ptep->pte_low = pte.pte_low);
}
and for x86 the set_pte_atomic will remain the same as today, so the
below patch seems the right long term fix (even if it breaks all archs
but x86).
Comments?
Index: linux-2.5/include/asm-generic/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-generic/pgtable.h,v
retrieving revision 1.8
diff -u -p -r1.8 pgtable.h
--- linux-2.5/include/asm-generic/pgtable.h 29 Jul 2004 06:01:30 -0000 1.8
+++ linux-2.5/include/asm-generic/pgtable.h 25 Sep 2004 15:16:50 -0000
@@ -15,7 +15,7 @@
*/
#define ptep_establish(__vma, __address, __ptep, __entry) \
do { \
- set_pte(__ptep, __entry); \
+ set_pte_atomic(__ptep, __entry); \
flush_tlb_page(__vma, __address); \
} while (0)
#endif
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-25 15:54 ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm Andrea Arcangeli @ 2004-09-25 23:33 ` Benjamin Herrenschmidt 2004-09-26 0:20 ` Andrea Arcangeli 2004-09-25 23:44 ` Rik van Riel 1 sibling, 1 reply; 20+ messages in thread From: Benjamin Herrenschmidt @ 2004-09-25 23:33 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, Andrew Morton, Linux Kernel list On Sun, 2004-09-26 at 01:54, Andrea Arcangeli wrote: > Worthy to note is that we're buggy in all set_pte implementations since, > all archs would need to also implement the set_pte in assembler to make > sure the C language doesn't write it byte-by-byte which would break the > SMP in the other thread. On ppc64 where a problem triggered (possibily > unrelated to this) the pte is an unsigned long and it's being updated by > set_pte with this: Bye by byte ? Ahem ... That would be a really broken C compiler ;) I don't see how it could be broken on archs where the PTE size is a single long for example, ppc64 is not. ppc32 is already atomic for different reasons > *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS This is not broken, how can somebody else race on modifying this PTE since we hold the page table lock and the PTE was previously cleared & flushed from the hash table ? The last store updating the PTE before we leave that code path will be a single 8 bytes store (this is a 64 bits architecture !) > (note pte_clear would be fine to be still in C, pte clear is guaranteed > to run on not present ptes, That isn't the case of the pte_clear call issued by set_pte itself on ppc64. I haven't looked at othe cases in the generic code, but I suppose they indeed use get_and_clear instead. > so we don't race with other threads, it's > only set_pte that should always be written in assembler in the last > opcode that writes in the pte) Why ? I mean, why _always_ ? The above is perfectly correct on ppc64 > We don't need an SMP lock, we only need to write 4 or 8 bytes at once (a > plain movl in x86 would do the trick). That's all we need. No, we need the page table lock on ppc64 because we must make sure the PTE has been cleared & flushed from the hash table, before we set it to the new value and all of that without another thread trying to modify it. The only way we could get rid of the locks around set_pte or other calls modifying the PTE validity/address on ppc64 would be to use the PAGE_BUSY bit, which we defined as a per-PTE lock, in such a way that it's taken around the whole flush+write operation (that is hold it accross the hash flush). > (and in > theory only for SMP, but UP will not get any slowdown since no lock on > the bus will be necessary, we're the only writer thanks to the > page_table_lock, but there are other readers running in userspace in > SMP, hence the need of atomicity) > > pte_clear would not be safe to call inside ptep_establish for the same > reason (pte_clear is not atomic and it doesn't need to be atomic unlike > set_pte), while something like this should be fine: > > ptep_get_and_clear > set_pte > flush_tlb > > but it doesn't worht it since all other archs supporting SMP in their > architecture will have to change set_pte to an assembly version, so for > them set_pte_atomic will be defined to set_pte. I don't understand your point... PTE's are usually the native long size of the arch and usually set_pte is a single aligned store, which mean it's pretty much always "atomic"... > The x86 set_pte itself should be changed to: > > static inline void set_pte(pte_t *ptep, pte_t pte) > { > ptep->pte_high = pte.pte_high; > smp_wmb(); > do this in a single not locked movl -> (ptep->pte_low = pte.pte_low); > } > > and for x86 the set_pte_atomic will remain the same as today, so the > below patch seems the right long term fix (even if it breaks all archs > but x86). > > Comments? If I understand your explanation, all you need is make sure that x86 set_pte sets the HW present bit last when writing the 2 halves, no ? Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-25 23:33 ` Benjamin Herrenschmidt @ 2004-09-26 0:20 ` Andrea Arcangeli 2004-09-26 0:31 ` Rik van Riel 2004-09-26 0:44 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 20+ messages in thread From: Andrea Arcangeli @ 2004-09-26 0:20 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: Martin J. Bligh, Andrew Morton, Linux Kernel list On Sun, Sep 26, 2004 at 09:33:27AM +1000, Benjamin Herrenschmidt wrote: > On Sun, 2004-09-26 at 01:54, Andrea Arcangeli wrote: > > > Worthy to note is that we're buggy in all set_pte implementations since, > > all archs would need to also implement the set_pte in assembler to make > > sure the C language doesn't write it byte-by-byte which would break the > > SMP in the other thread. On ppc64 where a problem triggered (possibily > > unrelated to this) the pte is an unsigned long and it's being updated by > > set_pte with this: > > Bye by byte ? Ahem ... That would be a really broken C compiler ;) I that would be an overoptimized or underoptimized C compiler, sure not a really broken one. The C compiler is perfectly allowed to do that, check the specs or ask your C compiler friends to get confirmation. anyways on x86 the bug is real in practice, regardless of the C compiler, heck we even put a smp_wmb() in between the two writes. The fact all other archs are buggy in theory too is just a corollary. I thought it worth to fix the theoretical bug in all other archs too, instead of keeping playing russian roulette. > don't see how it could be broken on archs where the PTE size is a single > long for example, ppc64 is not. ppc32 is already atomic for different > reasons of course in practice it's expectedly working correctly for all archs, except x86 where there is a smp_wmb() in between, and even if x86 was using an unsigned long, the C compiler would still not be writing it atomically. > > *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS > > This is not broken, how can somebody else race on modifying this It is broken as far as C language is concerned. You're just hoping to have an efficient compiler under you, and you're hoping to have an architecture where doing a single write is more efficient. This happens to work in practice, but it's still wrong in theory, check C language. What you are assuming in the above code are semantics the C language _cannot_ provide you. the C language has no knowledge of the "somebody else", the C language thinks it's single threaded and it can do whatever it wants as far as it writes that data to that place eventually. > That isn't the case of the pte_clear call issued by set_pte itself on > ppc64. I haven't looked at othe cases in the generic code, but I > suppose they indeed use get_and_clear instead. generic code should really use get_and_clear for that. pte_clear in common code can only be called on non present ptes. Again, ppc64 would be ok in practice (still not in theory), but x86 would break even in practice (not only in theory) if you use pte_clear on a present pte. But even ppc64 is wrong as far as C is concerned, your = 0 can be implemented in 8 byte writes, and the C compiler would be right, and you would be wrong. You never know if they could ever choose to do a memset instead of an atomic write, or whatever new assembler instruction in future implementations of the cpu. I perfectly know it works fine in practice and that the only definitive bug is in the x86 arch, but this is a theoretical bug for _all_ archs. > > so we don't race with other threads, it's > > only set_pte that should always be written in assembler in the last > > opcode that writes in the pte) > > Why ? I mean, why _always_ ? The above is perfectly correct on ppc64 it's not correct even on ppc64. > > > We don't need an SMP lock, we only need to write 4 or 8 bytes at once (a > > plain movl in x86 would do the trick). That's all we need. > > No, we need the page table lock on ppc64 because we must make sure the you misunderstood, obviously everybody is required to hold the page_table_lock while writing to any pagetable. What I meant with lock above, was the "lock prefix to the movl" instruction, not the lock as in page_table_lock. The write to the pte doesn't need to be executed with an atomic opcode, a movl would work, it doesn't need to be a "lock movl", because thanks to the page_table_lock, there's only one writer, and all readers are in userspace racing with us (hence the need for an assembler write, the only thing that can provide an atomicity guarantee, C can't, check the language specifications). It works by luck, now if it crashed you could still blamed the compiler for being suboptimal, but you definitely cannot blame the compiler for being wrong. The only wrong thing is your implementation of set_pte that you keep advocating, not the compiler. > I don't understand your point... PTE's are usually the native long size > of the arch and usually set_pte is a single aligned store, which mean > it's pretty much always "atomic"... same question all over, already answered why C cannot provide any atomic operation many times above. And you even seem to partially agree that it's buggy when you say "pretty much always", instead of a plain "always" ;). > If I understand your explanation, all you need is make sure that x86 > set_pte sets the HW present bit last when writing the 2 halves, no ? x86 already does that. But that's not enough. It must be a set_pte_atomic, writing all 8 bytes at once. Because the pte is already established, so the first write of the first halve will make any racing thread load into the tlb a mapping to a wrong random page in the system. This is a security compromise (note: seccomp not involved as usual, since I obviously disallow clone on bytecode running under seccomp mode). If every other kernel guy agrees with you to keep depending on semantics the C language cannot provide us, I can live with that (in such case I'm only going to fix x86 and x86-64 respective set_pte in asm and I will save this email if in the future a MM corruption bug triggers due subtle compiler behaviour ;). returning to the pratical bug (ignoring the thoeretical bug) we will have to at least move the ptep_enstablish modified as I posted (with set_pte_atomic instead of set_pte) from asm-generic to asm-i386. That will fix the security issue I found on x86 PAE with >4G of ram. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 0:20 ` Andrea Arcangeli @ 2004-09-26 0:31 ` Rik van Riel 2004-09-26 0:46 ` Andrea Arcangeli 2004-09-26 0:44 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 20+ messages in thread From: Rik van Riel @ 2004-09-26 0:31 UTC (permalink / raw) To: Andrea Arcangeli Cc: Benjamin Herrenschmidt, Martin J. Bligh, Andrew Morton, Linux Kernel list On Sun, 26 Sep 2004, Andrea Arcangeli wrote: > But even ppc64 is wrong as far as C is concerned, Looks fine to me. From include/asm-ppc64/pgtable.h static inline void set_pte(pte_t *ptep, pte_t pte) { if (pte_present(*ptep)) { pte_clear(ptep); flush_tlb_pending(); } *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS; } -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 0:31 ` Rik van Riel @ 2004-09-26 0:46 ` Andrea Arcangeli 2004-09-26 0:59 ` Benjamin Herrenschmidt 2004-09-28 9:12 ` Pavel Machek 0 siblings, 2 replies; 20+ messages in thread From: Andrea Arcangeli @ 2004-09-26 0:46 UTC (permalink / raw) To: Rik van Riel Cc: Benjamin Herrenschmidt, Martin J. Bligh, Andrew Morton, Linux Kernel list On Sat, Sep 25, 2004 at 08:31:13PM -0400, Rik van Riel wrote: > On Sun, 26 Sep 2004, Andrea Arcangeli wrote: > > > But even ppc64 is wrong as far as C is concerned, > > Looks fine to me. From include/asm-ppc64/pgtable.h > > static inline void set_pte(pte_t *ptep, pte_t pte) > { > if (pte_present(*ptep)) { > pte_clear(ptep); > flush_tlb_pending(); > } > *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS; > } As far as the C language is concerned that *ptep = something can be implemented with 8 writes of 1 byte each (or alternatively with an assembler instruction that may make the written data visible not atomically to other cpus, despite it was written with a single opcode, similarly to what happens if you use incl without the lock prefix). I'm not saying such instruction exists in ppc64, but the compiler is definitely allowed to break the above. You can blame on the compiler to be inefficient, but you can't blame on the compiler for the security hazard it would generate. Only the kernel would be to blame if for whatever reason a gcc version would be underoptimized. I perfectly know in practice the above is "almost guaranteed" to work, it's just the "almost" that's not good enough for me ;) anyways this is just a corollary to the x86 true bug, where a smp_wmb() sits in between the two separate writes, which makes the x86 set_pte obviously not atomic even in practice (not just in theory like for ppc64 and all other archs). I thought it was better to fix the theoretical bugs too, and they are true bugs as far as the C language is concerned, even if they're not triggering with the current gcc implementation of the language (and with any other decent compiler ;). ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 0:46 ` Andrea Arcangeli @ 2004-09-26 0:59 ` Benjamin Herrenschmidt 2004-09-26 1:36 ` Andrea Arcangeli 2004-09-26 20:30 ` Paul Mackerras 2004-09-28 9:12 ` Pavel Machek 1 sibling, 2 replies; 20+ messages in thread From: Benjamin Herrenschmidt @ 2004-09-26 0:59 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Martin J. Bligh, Andrew Morton, Linux Kernel list On Sun, 2004-09-26 at 10:46, Andrea Arcangeli wrote: > As far as the C language is concerned that *ptep = something can be > implemented with 8 writes of 1 byte each (or alternatively with an > assembler instruction that may make the written data visible not > atomically to other cpus, despite it was written with a single opcode, > similarly to what happens if you use incl without the lock prefix). I'm > not saying such instruction exists in ppc64, but the compiler is > definitely allowed to break the above. You can blame on the compiler to > be inefficient, but you can't blame on the compiler for the security > hazard it would generate. Only the kernel would be to blame if for > whatever reason a gcc version would be underoptimized. BTW, for your reading pleasure :) #define atomic_set(v,i) (((v)->counter) = (i)) (asm-i386/atomic.h) And that's really far from beeing the 2 only cases where the kernel _relies_ on a write of a simple type like int or long to an aligned location to be atomic. Almost all drivers manipulating DMA descriptors do that, jiffies is a good example too afaik, and more and more and more ... so if the compiler is breaking that up, I think the set_pte race is the least of our problems :) Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 0:59 ` Benjamin Herrenschmidt @ 2004-09-26 1:36 ` Andrea Arcangeli 2004-09-26 5:31 ` Benjamin Herrenschmidt 2004-09-26 20:30 ` Paul Mackerras 1 sibling, 1 reply; 20+ messages in thread From: Andrea Arcangeli @ 2004-09-26 1:36 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Rik van Riel, Martin J. Bligh, Andrew Morton, Linux Kernel list On Sun, Sep 26, 2004 at 10:59:43AM +1000, Benjamin Herrenschmidt wrote: > On Sun, 2004-09-26 at 10:46, Andrea Arcangeli wrote: > > > As far as the C language is concerned that *ptep = something can be > > implemented with 8 writes of 1 byte each (or alternatively with an > > assembler instruction that may make the written data visible not > > atomically to other cpus, despite it was written with a single opcode, > > similarly to what happens if you use incl without the lock prefix). I'm > > not saying such instruction exists in ppc64, but the compiler is > > definitely allowed to break the above. You can blame on the compiler to > > be inefficient, but you can't blame on the compiler for the security > > hazard it would generate. Only the kernel would be to blame if for > > whatever reason a gcc version would be underoptimized. > > BTW, for your reading pleasure :) > > #define atomic_set(v,i) (((v)->counter) = (i)) > > (asm-i386/atomic.h) and then check this: typedef struct { volatile int counter; } atomic_t; ^^^^^^^^ if the pte was at least volatile it would be a lot safer. C knows it must not mess with volatile. > And that's really far from beeing the 2 only cases where the kernel _relies_ > on a write of a simple type like int or long to an aligned location to be > atomic. Almost all drivers manipulating DMA descriptors do that, jiffies > is a good example too afaik, and more and more and more ... so if the jiffies, writel/readl all volatile incidentally. > compiler is breaking that up, I think the set_pte race is the least of > our problems :) the only one not volatile, or can you find more? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 1:36 ` Andrea Arcangeli @ 2004-09-26 5:31 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 20+ messages in thread From: Benjamin Herrenschmidt @ 2004-09-26 5:31 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Martin J. Bligh, Andrew Morton, Linux Kernel list On Sun, 2004-09-26 at 11:36, Andrea Arcangeli wrote: > the only one not volatile, or can you find more? I'm not sure all DMA descriptor buffers are marked volatile but yes, I tend to agree with you on the fact that it may be a good idea Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 0:59 ` Benjamin Herrenschmidt 2004-09-26 1:36 ` Andrea Arcangeli @ 2004-09-26 20:30 ` Paul Mackerras [not found] ` <20040926203640.GR2499@dualathlon.random> 1 sibling, 1 reply; 20+ messages in thread From: Paul Mackerras @ 2004-09-26 20:30 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Andrea Arcangeli, Rik van Riel, Martin J. Bligh, Andrew Morton, Linux Kernel list Benjamin Herrenschmidt writes: > On Sun, 2004-09-26 at 10:46, Andrea Arcangeli wrote: > > > As far as the C language is concerned that *ptep = something can be > > implemented with 8 writes of 1 byte each (or alternatively with an > > assembler instruction that may make the written data visible not > > atomically to other cpus, despite it was written with a single opcode, > > similarly to what happens if you use incl without the lock prefix). I'm > > not saying such instruction exists in ppc64, but the compiler is > > definitely allowed to break the above. You can blame on the compiler to > > be inefficient, but you can't blame on the compiler for the security > > hazard it would generate. Only the kernel would be to blame if for > > whatever reason a gcc version would be underoptimized. > > BTW, for your reading pleasure :) > > #define atomic_set(v,i) (((v)->counter) = (i)) FWIW, we also rely on several other things that are not guaranteed by the C standard, for instance that integer arithmetic is 2's complement, that bytes are individually addressable, and that pointers are represented by an address that is no bigger than a long. Paul. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <20040926203640.GR2499@dualathlon.random>]
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm [not found] ` <20040926203640.GR2499@dualathlon.random> @ 2004-09-27 16:41 ` Andrea Arcangeli 0 siblings, 0 replies; 20+ messages in thread From: Andrea Arcangeli @ 2004-09-27 16:41 UTC (permalink / raw) To: Paul Mackerras Cc: Benjamin Herrenschmidt, Rik van Riel, Martin J. Bligh, Andrew Morton, Linux Kernel list On Sun, Sep 26, 2004 at 10:36:40PM +0200, Andrea Arcangeli wrote: > On Mon, Sep 27, 2004 at 06:30:25AM +1000, Paul Mackerras wrote: > > FWIW, we also rely on several other things that are not guaranteed by > > the C standard, for instance that integer arithmetic is 2's > > complement, that bytes are individually addressable, and that pointers > > are represented by an address that is no bigger than a long. > > I wouldn't compare these with atomic writes on non volatile variables. I > mean, sizeof(char *) being different than sizeof(long) is something I'm > very confortable will not break anytime. If you want to add up to the > list, even the gcc inline assembly itself isn't part of the language... > (infact that was the major trouble for icc to add it AFIK) Just to make an example of why you definitely cannot compare the assumption of sizeof(char *) == sizeof(long), and complement 2 aritmetic, is that gcc is allowed to implement something like this: *64bit_ptr = 32bit_integer_var << 32; as two instructions: xorl and a movl, that is likely to be faster than a shiftleft + movq. Or maybe it's not faster on the x86-64 and gcc may prefer to use shiftleft + movq, but you get the idea of what I'm talking about, maybe for other archs the performance point is different etc... (by instinct I believe it'd be faster even on x86, especially on the nocona based on p4 core) with the ptes on x86 the shiftleft of 12 probably avoid screwups but that's just because we're lucky and I don't want to think what else gcc could optimize by evaluating constants... ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 0:46 ` Andrea Arcangeli 2004-09-26 0:59 ` Benjamin Herrenschmidt @ 2004-09-28 9:12 ` Pavel Machek 1 sibling, 0 replies; 20+ messages in thread From: Pavel Machek @ 2004-09-28 9:12 UTC (permalink / raw) To: Andrea Arcangeli Cc: Rik van Riel, Benjamin Herrenschmidt, Martin J. Bligh, Andrew Morton, Linux Kernel list Hi! > > > But even ppc64 is wrong as far as C is concerned, > > > > Looks fine to me. From include/asm-ppc64/pgtable.h > > > > static inline void set_pte(pte_t *ptep, pte_t pte) > > { > > if (pte_present(*ptep)) { > > pte_clear(ptep); > > flush_tlb_pending(); > > } > > *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS; > > } > > As far as the C language is concerned that *ptep = something can be > implemented with 8 writes of 1 byte each (or alternatively with an > assembler instruction that may make the written data visible not I'd say that it is okay to do this in arch-dependend code. Architecture controls list of compilers allowed. Pavel -- People were complaining that M$ turns users into beta-testers... ...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl! ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 0:20 ` Andrea Arcangeli 2004-09-26 0:31 ` Rik van Riel @ 2004-09-26 0:44 ` Benjamin Herrenschmidt 2004-09-26 1:32 ` Andrea Arcangeli 2004-09-26 14:41 ` Martin J. Bligh 1 sibling, 2 replies; 20+ messages in thread From: Benjamin Herrenschmidt @ 2004-09-26 0:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, Andrew Morton, Linux Kernel list On Sun, 2004-09-26 at 10:20, Andrea Arcangeli wrote: > that would be an overoptimized or underoptimized C compiler, sure not a > really broken one. The C compiler is perfectly allowed to do that, check > the specs or ask your C compiler friends to get confirmation. > > anyways on x86 the bug is real in practice, regardless of the C > compiler, heck we even put a smp_wmb() in between the two writes. The > fact all other archs are buggy in theory too is just a corollary. I > thought it worth to fix the theoretical bug in all other archs too, > instead of keeping playing russian roulette. How so ? A bunch of archs have the pte beeing a simple long, on these set_pte is perfectly atomic as it is... I'd say in this regard that x86 is the exception ;) > > don't see how it could be broken on archs where the PTE size is a single > > long for example, ppc64 is not. ppc32 is already atomic for different > > reasons > > of course in practice it's expectedly working correctly for all archs, > except x86 where there is a smp_wmb() in between, and even if x86 was > using an unsigned long, the C compiler would still not be writing it > atomically. Ugh ? Writing an unsigned long with a single instruction is always atomic, whatever the C compiler does. If the compiler turns a single long aligned write into something else, then it's very broken imho > > > *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS > > > > This is not broken, how can somebody else race on modifying this > > It is broken as far as C language is concerned. You're just hoping to > have an efficient compiler under you, and you're hoping to have an > architecture where doing a single write is more efficient. I'm not convinced :) > This happens to work in practice, but it's still wrong in theory, check > C language. What you are assuming in the above code are semantics the C > language _cannot_ provide you. Oh yes, I suppose we could imagine a compiler writing the above bit-by-bit, but let's stay serious on that one, there are a bunch of other places in the kernel that would be broken if the compiler started breaking up an int or long write in pieces :) > the C language has no knowledge of the "somebody else", the C language > thinks it's single threaded and it can do whatever it wants as far as it > writes that data to that place eventually. And ? The above example is clearly a read/modify/write case, all the compiler might ever do is move the last write around, but it can't move it before the if () since there's a clear data dependency and the flush done in there has an implicit barrier, so all that can happen is the write to the PTE to happen a bit later ... I don't see where is the issue here. > > That isn't the case of the pte_clear call issued by set_pte itself on > > ppc64. I haven't looked at othe cases in the generic code, but I > > suppose they indeed use get_and_clear instead. > > generic code should really use get_and_clear for that. pte_clear in > common code can only be called on non present ptes. Again, ppc64 would > be ok in practice (still not in theory), but x86 would break even in > practice (not only in theory) if you use pte_clear on a present pte. > > But even ppc64 is wrong as far as C is concerned, your = 0 can be > implemented in 8 byte writes, and the C compiler would be right, and you > would be wrong. And I would change to another compiler. Any compiler trafficking a write of an aligned native sized type like that is good for the trashcan. > You never know if they could ever choose to do a memset > instead of an atomic write, or whatever new assembler instruction in > future implementations of the cpu. > > I perfectly know it works fine in practice and that the only definitive > bug is in the x86 arch, but this is a theoretical bug for _all_ archs. > > > > so we don't race with other threads, it's > > > only set_pte that should always be written in assembler in the last > > > opcode that writes in the pte) > > > > Why ? I mean, why _always_ ? The above is perfectly correct on ppc64 > > it's not correct even on ppc64. Find me a single case of a compiler not generating that correctly then. Doing an atomic instruction there would have a cost I don't want to pay. > > > We don't need an SMP lock, we only need to write 4 or 8 bytes at once (a > > > plain movl in x86 would do the trick). That's all we need. > > > > No, we need the page table lock on ppc64 because we must make sure the > > you misunderstood, obviously everybody is required to hold the > page_table_lock while writing to any pagetable. Except in the fault path when setting accessed or dirty on a rw page... But then, I refer you to the patches that have been floating around for implementing page table lock-less do_page_fault(), this is what I was talking about. > What I meant with lock above, was the "lock prefix to the movl" > instruction, not the lock as in page_table_lock. I understood what you were saying > The write to the pte doesn't need to be executed with an atomic opcode, > a movl would work, it doesn't need to be a "lock movl", because thanks > to the page_table_lock, there's only one writer, and all readers are in > userspace racing with us (hence the need for an assembler write, the > only thing that can provide an atomicity guarantee, C can't, check the > language specifications). It works by luck, now if it crashed you could > still blamed the compiler for being suboptimal, but you definitely > cannot blame the compiler for being wrong. The only wrong thing is your > implementation of set_pte that you keep advocating, not the compiler. Again, find me a single case where the compiler will generate anything but an "std" instruction for the above on ppc64 and you'll get a free case of champagne :) > > I don't understand your point... PTE's are usually the native long size > > of the arch and usually set_pte is a single aligned store, which mean > > it's pretty much always "atomic"... > > same question all over, already answered why C cannot provide any atomic > operation many times above. And you even seem to partially agree that it's > buggy when you say "pretty much always", instead of a plain "always" ;). > > > If I understand your explanation, all you need is make sure that x86 > > set_pte sets the HW present bit last when writing the 2 halves, no ? > > x86 already does that. But that's not enough. It must be a > set_pte_atomic, writing all 8 bytes at once. Because the pte is already > established, so the first write of the first halve will make any racing > thread load into the tlb a mapping to a wrong random page in the system. Unless you do like ppc64 and clear it first :) But I agree that if your architecture lets you do 64 bits atomic writes on the 32 bits arch, then it's definitely the better solution. But then, you don't need a special set_pte_atomic, just make the normal set_pte do that and be done with it... Or is there a perf. issue there ? > This is a security compromise (note: seccomp not involved as usual, > since I obviously disallow clone on bytecode running under seccomp mode). > > If every other kernel guy agrees with you to keep depending on semantics > the C language cannot provide us, I can live with that (in such case I'm > only going to fix x86 and x86-64 respective set_pte in asm and I will > save this email if in the future a MM corruption bug triggers due subtle > compiler behaviour ;). > > returning to the pratical bug (ignoring the thoeretical bug) we will > have to at least move the ptep_enstablish modified as I posted (with > set_pte_atomic instead of set_pte) from asm-generic to asm-i386. That > will fix the security issue I found on x86 PAE with >4G of ram. -- Benjamin Herrenschmidt <benh@kernel.crashing.org> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 0:44 ` Benjamin Herrenschmidt @ 2004-09-26 1:32 ` Andrea Arcangeli 2004-09-26 5:29 ` Benjamin Herrenschmidt 2004-09-26 14:41 ` Martin J. Bligh 1 sibling, 1 reply; 20+ messages in thread From: Andrea Arcangeli @ 2004-09-26 1:32 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: Martin J. Bligh, Andrew Morton, Linux Kernel list On Sun, Sep 26, 2004 at 10:44:48AM +1000, Benjamin Herrenschmidt wrote: > How so ? A bunch of archs have the pte beeing a simple long, on these > set_pte is perfectly atomic as it is... I'd say in this regard that > x86 is the exception ;) this is a C language issue, not a gcc issue. I know gcc does things optimially and it will always write it atomically. But the word "atomic" doesn't exist in C, so you cannot write C code and expect anything to be written atomically. The single fact you agree set_pte has to be done with a perfect atomic-SMP operation, means you also agree it cannot be fully implemented in C. which is the only thing I'm advocating. > Ugh ? Writing an unsigned long with a single instruction is always > atomic, whatever the C compiler does. If the compiler turns a single > long aligned write into something else, then it's very broken imho It doesn't need to be. It's always atomic with your current cpu revision, future cpus may introduce not-SMP atomic writes and gcc would be perfectly allowed to use those new instructions by default (though I guess in the kernel other things would break doing that, this may not be the only place where we do this, but it would be our problem, not a gcc issue). > I'm not convinced :) I know it perfectly works in practice today, and you most certainly don't risk anything with current gcc (I doubt nobody except a gcc hacker keeping most of gcc in mind at once could give you a true guarantee of correctness though, I sure can't). > > This happens to work in practice, but it's still wrong in theory, check > > C language. What you are assuming in the above code are semantics the C > > language _cannot_ provide you. > > Oh yes, I suppose we could imagine a compiler writing the above > bit-by-bit, but let's stay serious on that one, there are a bunch of > other places in the kernel that would be broken if the compiler started > breaking up an int or long write in pieces :) I hope not ;). Peraphs you mean readl/writel? at least those are volatile, which means gcc will not be allowed to mess with those as much as it can with a plain unsigned long pointer (i.e. the pte). There's also another kind of bugs that is that we often abuse, that is to sometime touch memory that can change from under us, without marking the variable as volatile. That can confuse gcc, in some cases, Linus preferred to allow that by thinking when GCC could ever get confused. So for example if you use non-volatile variables that can change under you in SMP, for a switch statement, your kernel can crash freely, and it's up to you not to do that and remeber gcc can generate kernel crashing code if you do that (switch can be implemented with an hash or similar techniques that may generate overflows and jump into a random location in memory if the variable can change under gcc). If you only test them with an if you're probably safe instead (and that's what most kernel code does, to avoid taking spinlocks in the fast paths). This case with set_pte is a similar, and I'm ready to lose on this one too, no problem with me, but at least I tried to do the right thing as far as I can tell. so I'm really fine to stick to bugs that are real like the x86 one I just found. > Find me a single case of a compiler not generating that correctly then. I really hope there is none... but I don't read ppc64 asm anyways, so you'd have to look into it yourself if you want to be sure. > Doing an atomic instruction there would have a cost I don't want to pay. Not really, it would be definitely zerocost at runtime, you misunderstood if you think the asm generated wouldn't be the same. The asm produced would be exactly the same as today, the hardware couldn't notice the difference, only the C programmer could. I'm not talking about locked instructions. I'm talking about a single atomic _assembly_ instruction instead of a C line of code (which we would then depend on the compiler to generate as atomic). > > you misunderstood, obviously everybody is required to hold the > > page_table_lock while writing to any pagetable. > > Except in the fault path when setting accessed or dirty on a rw page... maybe I'm biased because I'm reading x86-64 code, but where? the software mkdirty and mkyoung seem to all be inside the page_table_lock. > But then, I refer you to the patches that have been floating around for > implementing page table lock-less do_page_fault(), this is what I was > talking about. Not sure how could I get you were talking about those floating around patches. I still don't get any connetion with those patches and the above discussion. > Again, find me a single case where the compiler will generate anything > but an "std" instruction for the above on ppc64 and you'll get a free > case of champagne :) If something I can check x86-64 which has the same issue, not ppc64. If you prefer to ignore those theoretical smp races, then I will save this email and I'll forward it to you when it triggers in production because gcc did something strange, and then you will send me the free case of champagne :). I'm also waiting the other bug for the lack of volatile variables where we access memory that can change under us to trigger anywhere in the kernel, only after it does I will have a good argument to convince people not to depend on subtle behaviour of gcc, and to write C language instead and to leave the atomic guarantees to asm statements that the C compiler isn't allowed to mess up. Oh maybe it already triggers on Martin's machine... ;), this is another reason why I would like to see this can of warms closed, so I don't have to worry every time that gcc doesn't something silly that could never be catched by the gcc regression test suite, since gcc would be still C complaint despite the apparently silly thing (silly from the point of view of a kernel developer at least, not necessairly silly from the point of view of a gcc developer). I recommend to talk to the IBM compiler guys too, not just with me, I'm probably not the right person to advocate for not depending on subtles of the current gcc implementation, last time it wasn't me to ask to change the kernel to stop touching memory that can change under gcc, but it was Honza, a gcc developer recommending me not to depend on that (but we gave up eventually, like I will very easily giveup on this too). > Unless you do like ppc64 and clear it first :) But I agree that if your > architecture lets you do 64 bits atomic writes on the 32 bits arch, then > it's definitely the better solution. But then, you don't need a special > set_pte_atomic, just make the normal set_pte do that and be done with > it... Or is there a perf. issue there ? there is a perf issue, cmpxchg8b is a lot more costly than two movl and a smp_wmb in between. We only need atomic writes (not locked writes) in all set_pte, except ptep_establish which is the only overwriting a pte that is already present. so let's remove the C-language compliant corollary and let's stick to the userspace crashing x86 bug that is a tangible pratical bug. This fixes an smp race generating userspace mm corruption + potential security compromise sniffing random memory with threads on x86 with PAE with > 4G of ram. Signed-Off-By: Andrea Arcangeli <andrea@novell.com> Index: linux-2.5/include/asm-i386/pgtable-3level.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-i386/pgtable-3level.h,v retrieving revision 1.20 diff -u -p -r1.20 pgtable-3level.h --- linux-2.5/include/asm-i386/pgtable-3level.h 24 Aug 2004 18:28:07 -0000 1.20 +++ linux-2.5/include/asm-i386/pgtable-3level.h 26 Sep 2004 01:13:10 -0000 @@ -88,6 +88,13 @@ static inline pte_t ptep_get_and_clear(p return res; } +#define __HAVE_ARCH_PTEP_ESTABLISH +#define ptep_establish(__vma, __address, __ptep, __entry) \ +do { \ + set_pte_atomic(__ptep, __entry); \ + flush_tlb_page(__vma, __address); \ +} while (0) + static inline int pte_same(pte_t a, pte_t b) { return a.pte_low == b.pte_low && a.pte_high == b.pte_high; ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 1:32 ` Andrea Arcangeli @ 2004-09-26 5:29 ` Benjamin Herrenschmidt 2004-09-26 15:39 ` Andrea Arcangeli 0 siblings, 1 reply; 20+ messages in thread From: Benjamin Herrenschmidt @ 2004-09-26 5:29 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, Andrew Morton, Linux Kernel list On Sun, 2004-09-26 at 11:32, Andrea Arcangeli wrote: > maybe I'm biased because I'm reading x86-64 code, but where? the > software mkdirty and mkyoung seem to all be inside the page_table_lock. ppc and ppc64 who treat their hash table as a kind of big tlb cache, and embedded ppc's with software loaded TLBs all have the TLB or hash refill mecanism "mimmic" a HW TLB load, that is it is assembly code that will set the DIRTY or ACCESSED bits without taking the page table lock > Not sure how could I get you were talking about those floating around > patches. I still don't get any connetion with those patches and the > above discussion. Oh, I side-tracked a bit on the need to make the PTE update & hash flush atomic on ppc64 using the per-PTE lock _PAGE_BUSY bit we have there if we ever implement that lockless do_page_fault(), but that was a side discussion, sorry for confusion. > > Again, find me a single case where the compiler will generate anything > > but an "std" instruction for the above on ppc64 and you'll get a free > > case of champagne :) > > If something I can check x86-64 which has the same issue, not ppc64. > > If you prefer to ignore those theoretical smp races, then I will save > this email and I'll forward it to you when it triggers in production > because gcc did something strange, and then you will send me the free > case of champagne :) We have a deal :) > I'm also waiting the other bug for the lack of volatile variables where > we access memory that can change under us to > trigger anywhere in the kernel, only after it does I will have a good > argument to convince people not to depend on subtle behaviour of gcc, > and to write C language instead and to leave the atomic guarantees to > asm statements that the C compiler isn't allowed to mess up. > > Oh maybe it already triggers on Martin's machine... ;), this is another > reason why I would like to see this can of warms closed, so I don't have > to worry every time that gcc doesn't something silly that could never be > catched by the gcc regression test suite, since gcc would be still C > complaint despite the apparently silly thing (silly from the point of > view of a kernel developer at least, not necessairly silly from the > point of view of a gcc developer). Oh, I agree the lack of volatile on a switch/case may be an issue, I've seen really esoteric ways of generating switch/case... > there is a perf issue, cmpxchg8b is a lot more costly than two movl and > a smp_wmb in between. We only need atomic writes (not locked writes) in > all set_pte, except ptep_establish which is the only overwriting a pte > that is already present. Right, in your hypotetical scenario, I'd just have to make sure an std instruction is generated on ppc64 Ben. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 5:29 ` Benjamin Herrenschmidt @ 2004-09-26 15:39 ` Andrea Arcangeli 0 siblings, 0 replies; 20+ messages in thread From: Andrea Arcangeli @ 2004-09-26 15:39 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: Martin J. Bligh, Andrew Morton, Linux Kernel list On Sun, Sep 26, 2004 at 03:29:48PM +1000, Benjamin Herrenschmidt wrote: > On Sun, 2004-09-26 at 11:32, Andrea Arcangeli wrote: > > > maybe I'm biased because I'm reading x86-64 code, but where? the > > software mkdirty and mkyoung seem to all be inside the page_table_lock. > > ppc and ppc64 who treat their hash table as a kind of big tlb cache, and > embedded ppc's with software loaded TLBs all have the TLB or hash refill > mecanism "mimmic" a HW TLB load, that is it is assembly code that will > set the DIRTY or ACCESSED bits without taking the page table lock ok, I thought you were talking about common code setting the dirty and accessed bit. The x86 architecture in hardware as well sets dirty and accessed bit, without the page table lock. > Oh, I side-tracked a bit on the need to make the PTE update & hash flush > atomic on ppc64 using the per-PTE lock _PAGE_BUSY bit we have there if > we ever implement that lockless do_page_fault(), but that was a side agreed. > discussion, sorry for confusion. No problem. > Right, in your hypotetical scenario, I'd just have to make sure an std > instruction is generated on ppc64 exactly. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 0:44 ` Benjamin Herrenschmidt 2004-09-26 1:32 ` Andrea Arcangeli @ 2004-09-26 14:41 ` Martin J. Bligh 2004-09-26 15:41 ` Andrea Arcangeli 1 sibling, 1 reply; 20+ messages in thread From: Martin J. Bligh @ 2004-09-26 14:41 UTC (permalink / raw) To: Benjamin Herrenschmidt, Andrea Arcangeli; +Cc: Andrew Morton, Linux Kernel list >> anyways on x86 the bug is real in practice, regardless of the C >> compiler, heck we even put a smp_wmb() in between the two writes. The >> fact all other archs are buggy in theory too is just a corollary. I >> thought it worth to fix the theoretical bug in all other archs too, >> instead of keeping playing russian roulette. > > How so ? A bunch of archs have the pte beeing a simple long, on these > set_pte is perfectly atomic as it is... I'd say in this regard that > x86 is the exception ;) Wouldn't it make sense to call set_pte_atomic, and just have that resolve to set_pte on 90% of arches? (I'm ignoring the wierdo compiler issue here, this is just for arches with pte > long). M. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 14:41 ` Martin J. Bligh @ 2004-09-26 15:41 ` Andrea Arcangeli 0 siblings, 0 replies; 20+ messages in thread From: Andrea Arcangeli @ 2004-09-26 15:41 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list On Sun, Sep 26, 2004 at 07:41:52AM -0700, Martin J. Bligh wrote: > Wouldn't it make sense to call set_pte_atomic, and just have that resolve > to set_pte on 90% of arches? (I'm ignoring the wierdo compiler issue here, > this is just for arches with pte > long). I also like this more, since it betters defines that set_pte really cannot be called on an established pte in common code [i.e. the bug triggering on x86] (and asm-generic is common code). ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-25 15:54 ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm Andrea Arcangeli 2004-09-25 23:33 ` Benjamin Herrenschmidt @ 2004-09-25 23:44 ` Rik van Riel 2004-09-26 0:31 ` Andrea Arcangeli 1 sibling, 1 reply; 20+ messages in thread From: Rik van Riel @ 2004-09-25 23:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, Andrew Morton, linux-kernel On Sat, 25 Sep 2004, Andrea Arcangeli wrote: > set_pte), while something like this should be fine: > > ptep_get_and_clear > set_pte > flush_tlb Almost. Think of software TLB refills, especially HPTE. The order needs to be: ptep_get_and_clear flush_tlb set_pte Any page faults happening "in the middle" will end up as virtual no-ops once they grab the page_table_lock. Luckily the PPC64 code in 2.6 seems to have a fix for the race already. Martin's race is a 2.4 thing only and needs a 2-line change to establish_pte(), to make the code do what I described above. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-25 23:44 ` Rik van Riel @ 2004-09-26 0:31 ` Andrea Arcangeli 2004-09-26 0:37 ` Rik van Riel 0 siblings, 1 reply; 20+ messages in thread From: Andrea Arcangeli @ 2004-09-26 0:31 UTC (permalink / raw) To: Rik van Riel; +Cc: Martin J. Bligh, Andrew Morton, linux-kernel On Sat, Sep 25, 2004 at 07:44:05PM -0400, Rik van Riel wrote: > On Sat, 25 Sep 2004, Andrea Arcangeli wrote: > > > set_pte), while something like this should be fine: > > > > ptep_get_and_clear > > set_pte > > flush_tlb > > Almost. Think of software TLB refills, especially HPTE. > The order needs to be: > > ptep_get_and_clear > flush_tlb > set_pte Interesting point. I sure agree it's saner to have the flush_tlb in between ptep_get_and_clear and set_pte, I said the other version just because I'm thinking hardware TLB and it shouldn't make any difference on hardware TLB anyways, does it? > Any page faults happening "in the middle" will end up as > virtual no-ops once they grab the page_table_lock. I'm not very fond on software TLBs and their internal locking, but exactly because of what you said ("they grab the page_table_lock."), how can the software TLB ever care about the flush_tlb in between ptep_get_and_clear and set_pte? ptep_establish is obviously always called with the page_table_lock hold. Nobody is allowed to call ptep_establish without it. So a larger code sequence of my version expands to: spin_lock(&page_table_lock) ptep_establish() { ptep_get_and_clear set_pte flush_tlb } spin_unlock(&page_table_lock) How can a software TLB care about a tlb flush in between two pieces of code that are anyways under the page_table_lock? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm 2004-09-26 0:31 ` Andrea Arcangeli @ 2004-09-26 0:37 ` Rik van Riel 0 siblings, 0 replies; 20+ messages in thread From: Rik van Riel @ 2004-09-26 0:37 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Martin J. Bligh, Andrew Morton, linux-kernel On Sun, 26 Sep 2004, Andrea Arcangeli wrote: > I'm not very fond on software TLBs and their internal locking, but > exactly because of what you said ("they grab the page_table_lock."), how > can the software TLB ever care about the flush_tlb in between > ptep_get_and_clear and set_pte? It only grabs the page_table_lock if it finds pte_none(pte), otherwise it'll run without any of the generic VM locks. > How can a software TLB care about a tlb flush in between two pieces of > code that are anyways under the page_table_lock? Thing is, it's not under page_table_lock ;) -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2004-09-28 9:15 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-25 15:54 ptep_establish/establish_pte needs set_pte_atomic and all set_pte must be written in asm Andrea Arcangeli
2004-09-25 23:33 ` Benjamin Herrenschmidt
2004-09-26 0:20 ` Andrea Arcangeli
2004-09-26 0:31 ` Rik van Riel
2004-09-26 0:46 ` Andrea Arcangeli
2004-09-26 0:59 ` Benjamin Herrenschmidt
2004-09-26 1:36 ` Andrea Arcangeli
2004-09-26 5:31 ` Benjamin Herrenschmidt
2004-09-26 20:30 ` Paul Mackerras
[not found] ` <20040926203640.GR2499@dualathlon.random>
2004-09-27 16:41 ` Andrea Arcangeli
2004-09-28 9:12 ` Pavel Machek
2004-09-26 0:44 ` Benjamin Herrenschmidt
2004-09-26 1:32 ` Andrea Arcangeli
2004-09-26 5:29 ` Benjamin Herrenschmidt
2004-09-26 15:39 ` Andrea Arcangeli
2004-09-26 14:41 ` Martin J. Bligh
2004-09-26 15:41 ` Andrea Arcangeli
2004-09-25 23:44 ` Rik van Riel
2004-09-26 0:31 ` Andrea Arcangeli
2004-09-26 0:37 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox