accessed/dirty bit handler tuning

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* accessed/dirty bit handler tuning
@ 2006-03-13 14:08 Zoltan Menyhart
  2006-03-13 16:31 ` Christoph Lameter
                   ` (34 more replies)
  0 siblings, 35 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-13 14:08 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 532 bytes --]

I think we can do some accessed/dirty bit handler tuning. E.g.
in my patch (based on the Christoph's one entitled "Fix race in the
accessed/dirty bit handlers"), I think we gain a bit by:

- using the "nta" hint in order not to "pollute" the caches L1D / L3

- using the "bias" hint in order to obtain the "E" cache state at the
  beginning (the additional snoop bus cycle for the "S" => "E" state
  transition is eliminated)

- not testing the result of "cmpxchg" (we'll re-read the PTE and
  compare it anyway)
 
Thanks,

Zoltan


[-- Attachment #2: srlz.d.diff2 --]
[-- Type: text/plain, Size: 3111 bytes --]

--- old/arch/ia64/kernel/ivt.S	2006-03-09 16:56:18.000000000 +0100
+++ new/arch/ia64/kernel/ivt.S	2006-03-13 14:34:40.000000000 +0100
@@ -557,29 +557,59 @@ ENTRY(dirty_bit)
 #ifdef CONFIG_SMP
 	mov r28=ar.ccv				// save ar.ccv
 	;;
-1:	ld8 r18=[r17]
+	/*
+	 * The atomic instructions are handled exclusively by the L2 (L2D) cache.
+	 * "bias" is a hint to acquire exclusive ownership.
+	 * "nta": allocate the cache line only in L2 and to bias it to be replaced.
+	 */
+1:	ld8.bias.nta r18 = [r17]
 	;;					// avoid RAW on r18
 	mov ar.ccv=r18				// set compare value for cmpxchg
 	or r25=_PAGE_D|_PAGE_A,r18		// set the dirty and accessed bits
 	tbit.z p7,p6 = r18,_PAGE_P_BIT		// Check present bit
-	;;
-(p6)	cmpxchg8.acq r26=[r17],r25,ar.ccv	// Only update if page is present
-	mov r24=PAGE_SHIFT<<2
-	;;
-(p6)	cmp.eq p6,p7=r26,r18			// Only compare if page is present
-	;;
-(p6)	itc.d r25				// install updated PTE
+	mov r24 = PAGE_SHIFT << 2
 	;;
 	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
+	 * "nta" is a hint not to allocate the cache line elsewhere than in L2,
+	 * to bias it to be replaced and not to write it back into L3.
+	 *
+	 * We do not care for the result of "cmpxchg". It only makes sure we do not
+	 * overwrite a PTE that has been modified by someone else in the mean time.
+	 * We'll read back the in memory PTE later.
 	 */
-	dv_serialize_data
-
-	ld8 r18=[r17]				// read PTE again
+(p6)	cmpxchg8.acq.nta r26 = [r17],r25,ar.ccv	// Only update if page is present
+	/*
+	 * We load the new translation independently of the success of "cmpxchg". 
+	 * Should "cmpxchg" have failed, we'll purge the new translation later.
+	 */
+(p6)	itc.d r25				// Install updated PTE if page is present
+	;;					// "itc" must be the last in the group
+	/*
+	 * We make sure the visibility of "itc" to generated purges (like "ptc.ga")
+	 * before we re-read the PTE.
+	 * (No, we are not going to use the freshly inserted translation for the next
+	 * "ld".)
+	 * A simple ";;" does not make sure that the purges / invalidations go all the
+	 * way down. E.g. in case of page size of 64 K, up to 16 L1 DTLB entries may be
+	 * purged and all the L1D cache lines brought in via these translations need to
+	 * be invalidated.
+	 */
+(p6)	srlz.d
+	/*
+	 * No need for ";;", the following "ld" can be in the same group as "srlz.d" is.
+	 */
+(p6)	ld8.nta r18 = [r17]			// Read PTE again
 	;;
-	cmp.eq p6,p7=r18,r25			// is it same as the newly installed
+(p6)	cmp.eq p0, p7 = r18, r25		// Is it same as we wanted to install?
 	;;
+	/*
+	 * The new translation (or the old one if "p6" is off) gets purged if:
+	 * - the page is not present
+	 * - the in memory PTE is not what we wanted to write out because:
+	 *   + someone else has modified it after our successful "cmpxchg"
+	 *   + "cmpxchg" has failed (with the exception when someone else has set the
+	 *     very same dirty bit as we wanted to => our new translation is correct)
+	 */
 (p7)	ptc.l r16,r24
 	mov b0=r29				// restore b0
 	mov ar.ccv=r28

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
@ 2006-03-13 16:31 ` Christoph Lameter
  2006-03-13 16:55 ` Zoltan Menyhart
                   ` (33 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2006-03-13 16:31 UTC (permalink / raw)
  To: linux-ia64

On Mon, 13 Mar 2006, Zoltan Menyhart wrote:

> I think we can do some accessed/dirty bit handler tuning. E.g.
> in my patch (based on the Christoph's one entitled "Fix race in the
> accessed/dirty bit handlers"), I think we gain a bit by:

Could you measure the effect that this has? We seem to be getting into 
some special processor behavior here.

> - using the "nta" hint in order not to "pollute" the caches L1D / L3

The last that I heard about nta was that it just skips the marking of a 
cacheline as recent. Thus the cacheline will be a more likely candidate to 
be evicted from the caches. Are you sure that the processors can bypass 
the L1D and L3?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
  2006-03-13 16:31 ` Christoph Lameter
@ 2006-03-13 16:55 ` Zoltan Menyhart
  2006-03-13 19:46 ` Chen, Kenneth W
                   ` (32 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-13 16:55 UTC (permalink / raw)
  To: linux-ia64

Christoph Lameter wrote:

> Could you measure the effect that this has? We seem to be getting into 
> some special processor behavior here.

Telling the truth: I cannot measure it.

The architecture ia64 defines some hints which *may* increase the performance.

If you have a sequence that does not cost a penny and may run faster...
If you have a shorter and somewhat faster sequence because of the elimination
of "cmp" that had to wait the completion the "cmpxchg"...
... why not?

> The last that I heard about nta was that it just skips the marking of a 
> cacheline as recent. Thus the cacheline will be a more likely candidate to 
> be evicted from the caches. Are you sure that the processors can bypass 
> the L1D and L3?

Please refer to e.g. the I2 Proc. Ref. Man. for SW Dev. & Opt. - may 2004
table 5-4 on page 41.

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
  2006-03-13 16:31 ` Christoph Lameter
  2006-03-13 16:55 ` Zoltan Menyhart
@ 2006-03-13 19:46 ` Chen, Kenneth W
  2006-03-13 20:05 ` Luck, Tony
                   ` (31 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Chen, Kenneth W @ 2006-03-13 19:46 UTC (permalink / raw)
  To: linux-ia64

Zoltan Menyhart wrote on Monday, March 13, 2006 6:09 AM
> I think we can do some accessed/dirty bit handler tuning. E.g.
> in my patch (based on the Christoph's one entitled "Fix race in the
> accessed/dirty bit handlers"), I think we gain a bit by:
> 
> - using the "nta" hint in order not to "pollute" the caches L1D / L3
> 
> - using the "bias" hint in order to obtain the "E" cache state at the
>   beginning (the additional snoop bus cycle for the "S" => "E" state
>   transition is eliminated)
> 
> - not testing the result of "cmpxchg" (we'll re-read the PTE and
>   compare it anyway)


Hmm, I think another alternative is to rip out all the itc insertion
code and let the hardware page walker do the "dirty" job.  Because it
is known and architected to be atomic-read-and-insert and is also
known to honor ptc.g while atomic-read-and-insert is in-flight (i.e.,
won't insert tlb entry).

- Ken


 ivt.S |   19 ++-----------------
 1 files changed, 2 insertions(+), 17 deletions(-)

--- ./arch/ia64/kernel/ivt.S.orig	2006-03-13 12:40:25.245145301 -0800
+++ ./arch/ia64/kernel/ivt.S	2006-03-13 12:41:53.923855152 -0800
@@ -558,29 +558,14 @@ ENTRY(dirty_bit)
 	mov r28=ar.ccv				// save ar.ccv
 	;;
 1:	ld8 r18=[r17]
+	mov r24=PAGE_SHIFT<<2
 	;;					// avoid RAW on r18
 	mov ar.ccv=r18				// set compare value for cmpxchg
 	or r25=_PAGE_D|_PAGE_A,r18		// set the dirty and accessed bits
 	tbit.z p7,p6 = r18,_PAGE_P_BIT		// Check present bit
 	;;
 (p6)	cmpxchg8.acq r26=[r17],r25,ar.ccv	// Only update if page is present
-	mov r24=PAGE_SHIFT<<2
-	;;
-(p6)	cmp.eq p6,p7=r26,r18			// Only compare if page is present
-	;;
-(p6)	itc.d r25				// install updated PTE
-	;;
-	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
-	 */
-	dv_serialize_data
-
-	ld8 r18=[r17]				// read PTE again
-	;;
-	cmp.eq p6,p7=r18,r25			// is it same as the newly installed
-	;;
-(p7)	ptc.l r16,r24
+	ptc.l r16,r24
 	mov b0=r29				// restore b0
 	mov ar.ccv=r28
 #else


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (2 preceding siblings ...)
  2006-03-13 19:46 ` Chen, Kenneth W
@ 2006-03-13 20:05 ` Luck, Tony
  2006-03-13 20:14 ` Chen, Kenneth W
                   ` (30 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Luck, Tony @ 2006-03-13 20:05 UTC (permalink / raw)
  To: linux-ia64

 
>Hmm, I think another alternative is to rip out all the itc insertion
>code and let the hardware page walker do the "dirty" job.  Because it
>is known and architected to be atomic-read-and-insert and is also
>known to honor ptc.g while atomic-read-and-insert is in-flight (i.e.,
>won't insert tlb entry).

Can we get some perf. numbers ... this will take each dirty fault twice
(though the second should be fast if VHPT does it's job).  This might
be slower than putting in the srlz.d that Zoltan wants.

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (3 preceding siblings ...)
  2006-03-13 20:05 ` Luck, Tony
@ 2006-03-13 20:14 ` Chen, Kenneth W
  2006-03-13 22:53 ` Chen, Kenneth W
                   ` (29 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Chen, Kenneth W @ 2006-03-13 20:14 UTC (permalink / raw)
  To: linux-ia64

Luck, Tony wrote on Monday, March 13, 2006 12:05 PM
> >Hmm, I think another alternative is to rip out all the itc insertion
> >code and let the hardware page walker do the "dirty" job.  Because it
> >is known and architected to be atomic-read-and-insert and is also
> >known to honor ptc.g while atomic-read-and-insert is in-flight (i.e.,
> >won't insert tlb entry).
> 
> Can we get some perf. numbers ... this will take each dirty fault twice
> (though the second should be fast if VHPT does it's job).  This might
> be slower than putting in the srlz.d that Zoltan wants.

I don't have any numbers ...  Though I've measured 5 cycles hpw insert
latency. It ought be faster than srlz.d.

On the other hand, the behavior of itc with respect to ptc.g is still up
in the air pending an query from ia64 hardware architects.  But the more
I read the SDM, the more it looks like a statement of actual processor
behavior instead of a statement for software requirement.

So going either way is a pre-matured decision, I guess.

- Ken


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (4 preceding siblings ...)
  2006-03-13 20:14 ` Chen, Kenneth W
@ 2006-03-13 22:53 ` Chen, Kenneth W
  2006-03-14 10:12 ` Zoltan Menyhart
                   ` (28 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Chen, Kenneth W @ 2006-03-13 22:53 UTC (permalink / raw)
  To: linux-ia64

Zoltan Menyhart wrote on Monday, March 13, 2006 6:09 AM
> I think we can do some accessed/dirty bit handler tuning. E.g.
> in my patch (based on the Christoph's one entitled "Fix race in the
> accessed/dirty bit handlers"), I think we gain a bit by:
> 
> ...
> - not testing the result of "cmpxchg" (we'll re-read the PTE and
>   compare it anyway)


It occurs on me that you can do even more: you don't even need the
2nd load, move itc opportunistically before cmpxchg, then use data
returned from cmpxchg and compare it to the first read.

Oh, well, I suppose Tony has enough versions to jog around ;-)

- Ken


--- ./arch/ia64/kernel/ivt.S.orig	2006-03-13 15:39:36.745990157 -0800
+++ ./arch/ia64/kernel/ivt.S	2006-03-13 15:43:56.757705722 -0800
@@ -563,23 +563,12 @@ ENTRY(dirty_bit)
 	or r25=_PAGE_D|_PAGE_A,r18		// set the dirty and accessed bits
 	tbit.z p7,p6 = r18,_PAGE_P_BIT		// Check present bit
 	;;
+(p6)	itc.d r25				// install updated PTE
 (p6)	cmpxchg8.acq r26=[r17],r25,ar.ccv	// Only update if page is present
 	mov r24=PAGE_SHIFT<<2
 	;;
 (p6)	cmp.eq p6,p7=r26,r18			// Only compare if page is present
 	;;
-(p6)	itc.d r25				// install updated PTE
-	;;
-	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
-	 */
-	dv_serialize_data
-
-	ld8 r18=[r17]				// read PTE again
-	;;
-	cmp.eq p6,p7=r18,r25			// is it same as the newly installed
-	;;
 (p7)	ptc.l r16,r24
 	mov b0=r29				// restore b0
 	mov ar.ccv=r28


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (5 preceding siblings ...)
  2006-03-13 22:53 ` Chen, Kenneth W
@ 2006-03-14 10:12 ` Zoltan Menyhart
  2006-03-14 19:33 ` Chen, Kenneth W
                   ` (27 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-14 10:12 UTC (permalink / raw)
  To: linux-ia64

Chen, Kenneth W wrote:

> Hmm, I think another alternative is to rip out all the itc insertion
> code and let the hardware page walker do the "dirty" job.  Because it
> is known and architected to be atomic-read-and-insert and is also
> known to honor ptc.g while atomic-read-and-insert is in-flight (i.e.,
> won't insert tlb entry).

Form the "semantical point of view", I can agree with you.

Yet in my sequence:

(p6)    cmpxchg8.acq.nta r26 = [r17],r25,ar.ccv
(p6)    itc.d r25
         ;;
(p6)    srlz.d

the execution of "cmpxchg" (that is not a quick & simple instruction)
partially overlaps that of "itc" (this latter has got an acquire
semantics, it does not depend on the completion of the former).

If it is the page walker that inserts the new translation, then it has
to observe the purge requirements, too:
E.g. in case of page size of 64 K, up to 16 L1 DTLB entries may be
purged and all the L1D cache lines brought in via these translations
need to be invalidated.
It does take time.

> I don't have any numbers ...  Though I've measured 5 cycles hpw insert
> latency. It ought be faster than srlz.d.

How did you measure it?

I'd expect (sure, not knowing exectly how the HW works :-)) up to:

	  16	max. number of L1 DTLB entries used for a page
	* 32	L1D cache is indexed as 0...31
	----
	 512

cycles only for purging and invalidating the old suff.

I think the CPU refuses the external purge request while the hardware
page walker is busy with this clean up activity
(retry response on the system bus).

In my sequence, it is "srlz.d" that stalls the exec. pipeline during
this clean up activity.

> It occurs on me that you can do even more: you don't even need the
> 2nd load, move itc opportunistically before cmpxchg, then use data
> returned from cmpxchg and compare it to the first read.

You will have to have a slightly more complicated sequence:

(p6)    itc.d r25
         ;;                                // "itc" must be the last in the group
(p6)    srlz.d                            // This is what I think is necessary
(p6)    cmpxchg8.acq r26=[r17],r25,ar.ccv

You avoid an L2 cache access by eliminating "ld" and you do not
take advantage of the partially overlapping "cmpxchg" and "itc".

Regards,

Zoltan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (6 preceding siblings ...)
  2006-03-14 10:12 ` Zoltan Menyhart
@ 2006-03-14 19:33 ` Chen, Kenneth W
  2006-03-15 13:29 ` Zoltan Menyhart
                   ` (26 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Chen, Kenneth W @ 2006-03-14 19:33 UTC (permalink / raw)
  To: linux-ia64

Zoltan Menyhart wrote on Tuesday, March 14, 2006 2:13 AM
> Yet in my sequence:
> 
> (p6)    cmpxchg8.acq.nta r26 = [r17],r25,ar.ccv
> (p6)    itc.d r25
>          ;;
> (p6)    srlz.d
> 
> the execution of "cmpxchg" (that is not a quick & simple instruction)
> partially overlaps that of "itc" (this latter has got an acquire
> semantics, it does not depend on the completion of the former).

This is indeed a very fine work of art in micro-optimization.  Thank you
for pointing this out. I think this is going to save us a lot of cycles.


> If it is the page walker that inserts the new translation, then it has
> to observe the purge requirements, too:
> E.g. in case of page size of 64 K, up to 16 L1 DTLB entries may be
> purged and all the L1D cache lines brought in via these translations
> need to be invalidated.

There is no need to worry about performance in the slow path.  Slow path
is meant to take whatever effort needed to fix up a detected race condition.
So let it be a couple of cycles longer.


> I'd expect (sure, not knowing exectly how the HW works :-)) up to:
> 
> 	  16	max. number of L1 DTLB entries used for a page
> 	* 32	L1D cache is indexed as 0...31
> 	----
> 	 512
> 
> cycles only for purging and invalidating the old suff.

The hardware is a lot smarter than what you think :-)  come on, we are
talking about Itanium processor here. I plea you to give some faith to
the hardware designers please.

- Ken


^ permalink raw reply	[flat|nested] 36+ messages in thread

* accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (7 preceding siblings ...)
  2006-03-14 19:33 ` Chen, Kenneth W
@ 2006-03-15 13:29 ` Zoltan Menyhart
  2006-03-15 17:37 ` Chen, Kenneth W
                   ` (25 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-15 13:29 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 273 bytes --]

This patch is based on the Christoph's one entitled "Fix race in the
accessed/dirty bit handlers".

- It adds the lacking "srlz.d"
- It uses some "nta" and "bias" cache hints
- It slightly reorganizes the routines for some minor performance improvements

Thanks,

Zoltan



[-- Attachment #2: srlz.d.diff3 --]
[-- Type: text/plain, Size: 6512 bytes --]

Signed-off-by: Zoltan Menyhart <Zoltan.Menyhart@bull.net>

Index: linux-2.6.16-rc5-mm3/arch/ia64/kernel/ivt.S
===================================================================
--- old/arch/ia64/kernel/ivt.S	2006-03-15 12:01:23.000000000 +0100
+++ new/arch/ia64/kernel/ivt.S	2006-03-15 14:11:46.000000000 +0100
@@ -557,29 +557,59 @@ ENTRY(dirty_bit)
 #ifdef CONFIG_SMP
 	mov r28=ar.ccv				// save ar.ccv
 	;;
-1:	ld8 r18=[r17]
-	;;					// avoid RAW on r18
+	/*
+	 * The atomic instructions are handled exclusively by the L2 (L2D) cache.
+	 * "bias" is a hint to acquire exclusive ownership.
+	 * "nta": allocate the cache line only in L2 and to bias it to be replaced.
+	 */
+1:	ld8.bias.nta r18 = [r17]
+	;;
 	mov ar.ccv=r18				// set compare value for cmpxchg
 	or r25=_PAGE_D|_PAGE_A,r18		// set the dirty and accessed bits
 	tbit.z p7,p6 = r18,_PAGE_P_BIT		// Check present bit
-	;;
-(p6)	cmpxchg8.acq r26=[r17],r25,ar.ccv	// Only update if page is present
-	mov r24=PAGE_SHIFT<<2
-	;;
-(p6)	cmp.eq p6,p7=r26,r18			// Only compare if page is present
-	;;
-(p6)	itc.d r25				// install updated PTE
+	mov r24 = PAGE_SHIFT << 2
 	;;
 	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
+	 * "nta" is a hint not to allocate the cache line elsewhere than in L2,
+	 * to bias it to be replaced and not to write it back into L3.
+	 *
+	 * We do not care for the result of "cmpxchg". It only makes sure we do not
+	 * overwrite a PTE that has been modified by someone else in the mean time.
+	 * We'll read back the in memory PTE later.
 	 */
-	dv_serialize_data
-
-	ld8 r18=[r17]				// read PTE again
+(p6)	cmpxchg8.acq.nta r26 = [r17],r25,ar.ccv	// Only update if page is present
+	/*
+	 * We load the new translation independently of the success of "cmpxchg". 
+	 * Should "cmpxchg" have failed, we'll purge the new translation later.
+	 */
+(p6)	itc.d r25				// Install updated PTE if page is present
+	;;					// "itc" must be the last in the group
+	/*
+	 * We make sure the visibility of "itc" to generated purges (like "ptc.ga")
+	 * before we re-read the PTE.
+	 * (No, we are not going to use the freshly inserted translation for the next
+	 * "ld".)
+	 * A simple ";;" does not make sure that the purges / invalidations go all the
+	 * way down. E.g. in case of page size of 64 K, up to 16 L1 DTLB entries may be
+	 * purged and all the L1D cache lines brought in via these translations need to
+	 * be invalidated.
+	 */
+(p6)	srlz.d
+	/*
+	 * No need for ";;", the following "ld" can be in the same group as "srlz.d" is.
+	 */
+(p6)	ld8.nta r18 = [r17]			// Read PTE again
 	;;
-	cmp.eq p6,p7=r18,r25			// is it same as the newly installed
+(p6)	cmp.eq p0, p7 = r18, r25		// Is it same as we wanted to install?
 	;;
+	/*
+	 * The new translation (or the old one if "p6" is off) gets purged if:
+	 * - the page is not present
+	 * - the in memory PTE is not what we wanted to write out because:
+	 *   + someone else has modified it after our successful "cmpxchg"
+	 *   + "cmpxchg" has failed (with the exception when someone else has set the
+	 *     very same dirty bit as we wanted to => our new translation is correct)
+	 */
 (p7)	ptc.l r16,r24
 	mov b0=r29				// restore b0
 	mov ar.ccv=r28
@@ -602,7 +632,10 @@ END(dirty_bit)
 // 0x2400 Entry 9 (size 64 bundles) Instruction Access-bit (27)
 ENTRY(iaccess_bit)
 	DBG_FAULT(9)
-	// Like Entry 8, except for instruction access
+	/*
+	 * Like Entry 8, except for instruction access.
+	 * For the remarks on cache hints and synchronization issues see there.
+	 */
 	mov r16=cr.ifa				// get the address that caused the fault
 	movl r30=1f				// load continuation point in case of nested fault
 	mov r31=pr				// save predicates
@@ -623,28 +656,20 @@ ENTRY(iaccess_bit)
 #ifdef CONFIG_SMP
 	mov r28=ar.ccv				// save ar.ccv
 	;;
-1:	ld8 r18=[r17]
+1:	ld8.bias.nta r18 = [r17]
 	;;
 	mov ar.ccv=r18				// set compare value for cmpxchg
 	or r25=_PAGE_A,r18			// set the accessed bit
 	tbit.z p7,p6 = r18,_PAGE_P_BIT	 	// Check present bit
+	mov r24 = PAGE_SHIFT << 2
 	;;
-(p6)	cmpxchg8.acq r26=[r17],r25,ar.ccv	// Only if page present
-	mov r24=PAGE_SHIFT<<2
-	;;
-(p6)	cmp.eq p6,p7=r26,r18			// Only if page present
-	;;
-(p6)	itc.i r25				// install updated PTE
+(p6)	cmpxchg8.acq.nta r26 = [r17],r25,ar.ccv	// Only update if page is present
+(p6)	itc.i r25				// Install updated PTE if page is present
 	;;
-	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
-	 */
-	dv_serialize_data
-
-	ld8 r18=[r17]				// read PTE again
+(p6)	srlz.d
+(p6)	ld8.nta r18 = [r17]			// Read PTE again
 	;;
-	cmp.eq p6,p7=r18,r25			// is it same as the newly installed
+(p6)	cmp.eq p0, p7 = r18, r25		// Is it same as we wanted to install?
 	;;
 (p7)	ptc.l r16,r24
 	mov b0=r29				// restore b0
@@ -668,7 +693,10 @@ END(iaccess_bit)
 // 0x2800 Entry 10 (size 64 bundles) Data Access-bit (15,55)
 ENTRY(daccess_bit)
 	DBG_FAULT(10)
-	// Like Entry 8, except for data access
+	/*
+	 * Like Entry 8, except for data access.
+	 * For the remarks on cache hints and synchronization issues see there.
+	 */
 	mov r16=cr.ifa				// get the address that caused the fault
 	movl r30=1f				// load continuation point in case of nested fault
 	;;
@@ -678,27 +706,20 @@ ENTRY(daccess_bit)
 #ifdef CONFIG_SMP
 	mov r28=ar.ccv				// save ar.ccv
 	;;
-1:	ld8 r18=[r17]
-	;;					// avoid RAW on r18
+1:	ld8.bias.nta r18 = [r17]
+	;;
 	mov ar.ccv=r18				// set compare value for cmpxchg
 	or r25=_PAGE_A,r18			// set the dirty bit
 	tbit.z p7,p6 = r18,_PAGE_P_BIT		// Check present bit
+	mov r24 = PAGE_SHIFT << 2
 	;;
-(p6)	cmpxchg8.acq r26=[r17],r25,ar.ccv	// Only if page is present
-	mov r24=PAGE_SHIFT<<2
-	;;
-(p6)	cmp.eq p6,p7=r26,r18			// Only if page is present
-	;;
-(p6)	itc.d r25				// install updated PTE
-	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
-	 */
-	dv_serialize_data
+(p6)	cmpxchg8.acq.nta r26 = [r17],r25,ar.ccv	// Only update if page is present
+(p6)	itc.d r25				// Install updated PTE if page is present
 	;;
-	ld8 r18=[r17]				// read PTE again
+(p6)	srlz.d
+(p6)	ld8.nta r18 = [r17]			// Read PTE again
 	;;
-	cmp.eq p6,p7=r18,r25			// is it same as the newly installed
+(p6)	cmp.eq p0, p7 = r18, r25		// Is it same as we wanted to install?
 	;;
 (p7)	ptc.l r16,r24
 	mov ar.ccv=r28

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (8 preceding siblings ...)
  2006-03-15 13:29 ` Zoltan Menyhart
@ 2006-03-15 17:37 ` Chen, Kenneth W
  2006-03-16  9:57 ` Zoltan Menyhart
                   ` (24 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Chen, Kenneth W @ 2006-03-15 17:37 UTC (permalink / raw)
  To: linux-ia64

Zoltan Menyhart wrote on Wednesday, March 15, 2006 5:30 AM
> This patch is based on the Christoph's one entitled "Fix race in the
> accessed/dirty bit handlers".
> 
> - It adds the lacking "srlz.d"

It is still not clear whether srlz.d is required or not, right?  Wording
in SDM is vague.  Through experiment, I've verified that itc instruction
observe full instruction latency with respect to memory operation that
immediately follows it. It is pretty much in-line with what I think what
the SDM is trying to say: it has implicit semi-serialization (next
memory
operation won't proceed until itc.d finishes).

> - It uses some "nta"

Do you have any performance data showing that nta is a win?

- Ken

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (9 preceding siblings ...)
  2006-03-15 17:37 ` Chen, Kenneth W
@ 2006-03-16  9:57 ` Zoltan Menyhart
  2006-03-16 10:19 ` Luck, Tony
                   ` (23 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-16  9:57 UTC (permalink / raw)
  To: linux-ia64

Chen, Kenneth W wrote:

> It is still not clear whether srlz.d is required or not, right?  Wording
> in SDM is vague.

We have quoted several times the SDM:

"The visibility of the itc instruction to generated purges (ptc.g, ptc.ga) must occur \
before subsequent memory operations. From a software perspective, this is similar to \
acquire semantics. Serialization is still required to observe the side-effects of the \
translation being present."

What do you think the statement "Serialization is still required..." means
if not a "srlz.d" (or "rfi") ?

> Through experiment, I've verified that itc instruction
> observe full instruction latency with respect to memory operation that
> immediately follows it.

Have you got a test to check it?
Could you please give us the test program?

Assuming you are right, do you think Intel guarantees that all the CPU
models (incl. the forthcoming ones) behave like that?

> It is pretty much in-line with what I think what
> the SDM is trying to say: it has implicit semi-serialization (next
> memory operation won't proceed until itc.d finishes).

Can you please indicate where it states that?

> Do you have any performance data showing that nta is a win?

I have already admitted that I cannot measure the difference.
(We do not hit very frequently these trap routines.)

Let us put the question in another way:

There is a sequence with "nta"-s.
This sequence is not longer than the one w/o "nta"-s.
According to the doc. it *may* run faster.
Why should not we use it?

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (10 preceding siblings ...)
  2006-03-16  9:57 ` Zoltan Menyhart
@ 2006-03-16 10:19 ` Luck, Tony
  2006-03-16 19:12 ` Chen, Kenneth W
                   ` (22 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Luck, Tony @ 2006-03-16 10:19 UTC (permalink / raw)
  To: linux-ia64

> We have quoted several times the SDM:
> 
> "The visibility of the itc instruction to generated purges (ptc.g, ptc.ga) must occur \
> before subsequent memory operations. From a software perspective, this is similar to \
> acquire semantics. Serialization is still required to observe the side-effects of the \
> translation being present."
> 
> What do you think the statement "Serialization is still required..." means
> if not a "srlz.d" (or "rfi") ?

I'm trying to get some clarifications on this ... my personal interpretation
of this section is that it only refers to memory references that will use
the translation that is being inserted by the itc.d instruction. But I agree
that there are lots of ways to read these three sentences.

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (11 preceding siblings ...)
  2006-03-16 10:19 ` Luck, Tony
@ 2006-03-16 19:12 ` Chen, Kenneth W
  2006-03-29  8:11 ` Zoltan Menyhart
                   ` (21 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Chen, Kenneth W @ 2006-03-16 19:12 UTC (permalink / raw)
  To: linux-ia64

Zoltan Menyhart wrote on Thursday, March 16, 2006 1:57 AM
> > It is still not clear whether srlz.d is required or not, right?  Wording
> > in SDM is vague.
> 
> We have quoted several times the SDM:
> 
> "The visibility of the itc instruction to generated purges (ptc.g, ptc.ga)
> must occur before subsequent memory operations. From a software perspective,
> this is similar to acquire semantics. Serialization is still required to
> observe the side-effects of the translation being present."
> 
> What do you think the statement "Serialization is still required..." means
> if not a "srlz.d" (or "rfi") ?

This is a rat hole, until we get some clarification from the ia64 architects,
nobody should take a strong position one way or the other. You persistently
post patches that adds srlz and I'm countering that with persistent reminder
of ambiguity.  For dirty/access fault handler, the srlz probably costs
nothing because itc latency is already hidden behind cmpxchg.  However, I
don't want that become a de-facto and then subsequently leak into vhpt_miss
handler.

> > Do you have any performance data showing that nta is a win?
> 
> I have already admitted that I cannot measure the difference.
> (We do not hit very frequently these trap routines.)
> 
> Let us put the question in another way:
> 
> There is a sequence with "nta"-s.
> This sequence is not longer than the one w/o "nta"-s.
> According to the doc. it *may* run faster.
> Why should not we use it?

This is another rat hole.  The whole reason you want to use nta is you know
it is going to be used only once.  In the dirty handler, it's not "use once"
scenario.  It is used 3 times! The 2nd and 3rd access will see full L3 latency
if the data is indeed a cache miss on the 1st load.  The original question is:
prove nta is an overall win given the penalty of 2nd and 3rd pte access. It
*may not* run faster than what you think.

- Ken

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (12 preceding siblings ...)
  2006-03-16 19:12 ` Chen, Kenneth W
@ 2006-03-29  8:11 ` Zoltan Menyhart
  2006-03-29  8:28 ` Chen, Kenneth W
                   ` (20 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-29  8:11 UTC (permalink / raw)
  To: linux-ia64

Tony Luck wrote:
> Zoltan,
> 
> The Itanium architects agree with you ... the architecture would allow
> for an implementation where the itc becomes visible after the ld8 that
> is checking the pte hasn't changed.
> 
> Ken and I messed with your patch a bit (to match the style of the rest
> of ivt.S, and to drop some pointless differences between the trap 8, 9
> and 10 handlers).  Here's what I plan to checkin:

Well, it looks correct.

We'll have to have a look at the other places like "vhpt_miss",...

Apparently most of the comments are stripped off :-(

Let me explain why I think it is important to "over-comment" these
low level stuffs, which are far from being self-commenting.

There are a couple non trivial information hidden in these
machine dependent code fragments. We should describe how these
code fragments are meant to work. We should make it as easy as possible
to understand - and criticize - the algorithm for the code readers.
(It replaces the documentation :-).)

There is a second aspect, too: once the algorithm is understood and
agreed upon, the reader can check if the actual implementation is
correct and conform to what is said in the comments.

Otherwise how can someone reading the code know if a "trick" hides
a hilarious idea or it is just a silly bug?

A typical example is the story of our "srlz.d".
I think summarizing what the Itanium architects said about it
could be very much useful.
(And the cache hints...)

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (13 preceding siblings ...)
  2006-03-29  8:11 ` Zoltan Menyhart
@ 2006-03-29  8:28 ` Chen, Kenneth W
  2006-03-29 13:37 ` Zoltan Menyhart
                   ` (19 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Chen, Kenneth W @ 2006-03-29  8:28 UTC (permalink / raw)
  To: linux-ia64

Zoltan Menyhart wrote on Wednesday, March 29, 2006 12:12 AM
> > The Itanium architects agree with you ... the architecture would allow
> > for an implementation where the itc becomes visible after the ld8 that
> > is checking the pte hasn't changed.
> > 
> > Ken and I messed with your patch a bit (to match the style of the rest
> > of ivt.S, and to drop some pointless differences between the trap 8, 9
> > and 10 handlers).  Here's what I plan to checkin:
> 
> Well, it looks correct.
> 
> We'll have to have a look at the other places like "vhpt_miss",...


Oh my gosh, my worst nightmare becomes the reality, :-( It is unacceptable
to have srlz.d in vhpt_miss.  Couple of alternatives:

(1) strip off all ptc.g related instructions in vhpt and just let the hpw
    walker do the job.  Kernel can take double faults, but after all, with
    what people do to ia64 kernel, this might be the best solution.

(2) add 20 cycles of delay in front of ptc.g

(3) dynamically patch out srlz.d for McK/Madison/Montecito processor.

(4) .....

- Ken

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (14 preceding siblings ...)
  2006-03-29  8:28 ` Chen, Kenneth W
@ 2006-03-29 13:37 ` Zoltan Menyhart
  2006-03-29 17:01 ` Zoltan Menyhart
                   ` (18 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-29 13:37 UTC (permalink / raw)
  To: linux-ia64

Chen, Kenneth W wrote:

> Oh my gosh, my worst nightmare becomes the reality, :-( It is unacceptable
> to have srlz.d in vhpt_miss.

Can you please explain why it is a nightmare?
How much time do you think will be wasted?

>  Couple of alternatives:
>
> (1) strip off all ptc.g related instructions in vhpt and just let the hpw
>     walker do the job.  Kernel can take double faults, but after all, with
>     what people do to ia64 kernel, this might be the best solution.

I do not really see why it would be more efficient to let the walker do
the inserting job, once we are in the "vhpt_miss" handler.

If we suffer from the delay, why could the walker avoid this overhead?

Have you got a prototype for this reduced "vhpt_miss" handler ?

> (2) add 20 cycles of delay in front of ptc.g

I do not think any delay like this could be safe.

> (3) dynamically patch out srlz.d for McK/Madison/Montecito processor.

Is it specified anywhere what CPU models do / do not require this "srlz.d"?

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (15 preceding siblings ...)
  2006-03-29 13:37 ` Zoltan Menyhart
@ 2006-03-29 17:01 ` Zoltan Menyhart
  2006-03-29 22:57 ` Luck, Tony
                   ` (17 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-29 17:01 UTC (permalink / raw)
  To: linux-ia64

Ken,

Can you please tell me why are the pud - pmd pointers are re-checked
in "vhpt_miss" ?

	ld8 r26=[r17]				// read *pmd again
#ifdef CONFIG_PGTABLE_4
	ld8 r19=[r28]				// read *pud again
#endif
...
	cmp.ne.or.andcm p6,p7=r26,r20		// did *pmd change
#ifdef CONFIG_PGTABLE_4
	cmp.ne.or.andcm p6,p7=r19,r29		// did *pud change
#endif

As far as I know, pud, pmd, pte pages can go away only via:

	free_pgtables()
		free_pgd_range()
			free_pud_range()
				free_pmd_range()
					free_pte_range()

If a 0xa000... stuff goes away => BUG.
For the user mode pages:

"free_pgtables()" is called only from:

- "exit_mmap()": we simply cannot have the chance to have a vhpt miss
- "unmap_region()" calls "unmap_vmas()" before "free_pgtables()":
   again, we cannot fault in that region

Have I missed something?

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (16 preceding siblings ...)
  2006-03-29 17:01 ` Zoltan Menyhart
@ 2006-03-29 22:57 ` Luck, Tony
  2006-03-29 22:59 ` Chen, Kenneth W
                   ` (16 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Luck, Tony @ 2006-03-29 22:57 UTC (permalink / raw)
  To: linux-ia64

Ian,

Yes ... I think I goofed when mailing to Zoltan and the list ... the copy
never showed up on the list.  Here is the version of the patch:

-Tony

---

diff --git a/arch/ia64/kernel/ivt.S b/arch/ia64/kernel/ivt.S
index 829a43c..86123c1 100644
--- a/arch/ia64/kernel/ivt.S
+++ b/arch/ia64/kernel/ivt.S
@@ -552,48 +552,56 @@ ENTRY(dirty_bit)
 	movl r30\x1f				// load continuation point in case of nested fault
 	;;
 	thash r17=r16				// compute virtual address of L3 PTE
+	mov r31=pr
 	mov r29°				// save b0 in case of nested fault
-	mov r31=pr				// save pr
 #ifdef CONFIG_SMP
 	mov r28=ar.ccv				// save ar.ccv
 	;;
-1:	ld8 r18=[r17]
-	;;					// avoid RAW on r18
+1:	ld8.bias.nta r18=[r17]
+	;;
 	mov ar.ccv=r18				// set compare value for cmpxchg
 	or r25=_PAGE_D|_PAGE_A,r18		// set the dirty and accessed bits
 	tbit.z p7,p6 = r18,_PAGE_P_BIT		// Check present bit
 	;;
-(p6)	cmpxchg8.acq r26=[r17],r25,ar.ccv	// Only update if page is present
-	mov r24=PAGE_SHIFT<<2
-	;;
-(p6)	cmp.eq p6,p7=r26,r18			// Only compare if page is present
-	;;
+	/*
+	 * We do not test for the result of "cmpxchg". It only makes sure we do not
+	 * overwrite a PTE that has been modified by someone else in the mean time.
+	 * We'll read back the in memory PTE later.
+	 */
+(p6)	cmpxchg8.acq.nta r26=[r17],r25,ar.ccv	// Only update if page is present
 (p6)	itc.d r25				// install updated PTE
 	;;
 	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
+	 * We make sure itc.d completes before re-read the PTE.
 	 */
-	dv_serialize_data
-
-	ld8 r18=[r17]				// read PTE again
+(p6)	srlz.d
+(p6)	ld8.nta r18=[r17]				// Read PTE again
 	;;
-	cmp.eq p6,p7=r18,r25			// is it same as the newly installed
+(p6)	cmp.eq p0,p7=r18,r25			// Is it same as we wanted to install?
+	mov r24=PAGE_SHIFT << 2
 	;;
+	/*
+	 * The new translation (or the old one if "p6" is off) gets purged if:
+	 * - the page is not present
+	 * - the in memory PTE is not what we wanted to write out because:
+	 *   + someone else has modified it after our successful "cmpxchg"
+	 *   + "cmpxchg" has failed (with the exception when someone else has set the
+	 *     very same dirty bit as we wanted to => our new translation is correct)
+	 */
 (p7)	ptc.l r16,r24
 	mov b0=r29				// restore b0
 	mov ar.ccv=r28
 #else
 	;;
 1:	ld8 r18=[r17]
-	;;					// avoid RAW on r18
+	;;
 	or r18=_PAGE_D|_PAGE_A,r18		// set the dirty and accessed bits
 	mov b0=r29				// restore b0
 	;;
 	st8 [r17]=r18				// store back updated PTE
 	itc.d r18				// install updated PTE
 #endif
-	mov pr=r31,-1				// restore pr
+	mov pr=r31,-1
 	rfi
 END(dirty_bit)
 
@@ -602,7 +610,10 @@ END(dirty_bit)
 // 0x2400 Entry 9 (size 64 bundles) Instruction Access-bit (27)
 ENTRY(iaccess_bit)
 	DBG_FAULT(9)
-	// Like Entry 8, except for instruction access
+	/*
+	 * Like Entry 8, except for instruction access.
+	 * For the remarks on cache hints and synchronization issues see there.
+	 */
 	mov r16=cr.ifa				// get the address that caused the fault
 	movl r30\x1f				// load continuation point in case of nested fault
 	mov r31=pr				// save predicates
@@ -623,33 +634,25 @@ #endif /* CONFIG_ITANIUM */
 #ifdef CONFIG_SMP
 	mov r28=ar.ccv				// save ar.ccv
 	;;
-1:	ld8 r18=[r17]
+1:	ld8.bias.nta r18=[r17]
 	;;
 	mov ar.ccv=r18				// set compare value for cmpxchg
 	or r25=_PAGE_A,r18			// set the accessed bit
-	tbit.z p7,p6 = r18,_PAGE_P_BIT	 	// Check present bit
+	tbit.z p7,p6=r18,_PAGE_P_BIT		// Check present bit
 	;;
-(p6)	cmpxchg8.acq r26=[r17],r25,ar.ccv	// Only if page present
-	mov r24=PAGE_SHIFT<<2
+(p6)	cmpxchg8.acq.nta r26=[r17],r25,ar.ccv	// Only update if page is present
+(p6)	itc.i r25				// Install updated PTE if page is present
 	;;
-(p6)	cmp.eq p6,p7=r26,r18			// Only if page present
-	;;
-(p6)	itc.i r25				// install updated PTE
-	;;
-	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
-	 */
-	dv_serialize_data
-
-	ld8 r18=[r17]				// read PTE again
+(p6)	srlz.d
+(p6)	ld8.nta r18=[r17]				// Read PTE again
 	;;
-	cmp.eq p6,p7=r18,r25			// is it same as the newly installed
+(p6)	cmp.eq p0,p7=r18,r25			// Is it same as we wanted to install?
+	mov r24=PAGE_SHIFT << 2
 	;;
 (p7)	ptc.l r16,r24
 	mov b0=r29				// restore b0
 	mov ar.ccv=r28
-#else /* !CONFIG_SMP */
+#else
 	;;
 1:	ld8 r18=[r17]
 	;;
@@ -658,7 +661,7 @@ #else /* !CONFIG_SMP */
 	;;
 	st8 [r17]=r18				// store back updated PTE
 	itc.i r18				// install updated PTE
-#endif /* !CONFIG_SMP */
+#endif
 	mov pr=r31,-1
 	rfi
 END(iaccess_bit)
@@ -668,50 +671,47 @@ END(iaccess_bit)
 // 0x2800 Entry 10 (size 64 bundles) Data Access-bit (15,55)
 ENTRY(daccess_bit)
 	DBG_FAULT(10)
-	// Like Entry 8, except for data access
+	/*
+	 * Like Entry 8, except for data access.
+	 * For the remarks on cache hints and synchronization issues see there.
+	 */
 	mov r16=cr.ifa				// get the address that caused the fault
 	movl r30\x1f				// load continuation point in case of nested fault
 	;;
 	thash r17=r16				// compute virtual address of L3 PTE
 	mov r31=pr
-	mov r29°				// save b0 in case of nested fault)
+	mov r29°				// save b0 in case of nested fault
 #ifdef CONFIG_SMP
 	mov r28=ar.ccv				// save ar.ccv
-	;;
-1:	ld8 r18=[r17]
-	;;					// avoid RAW on r18
-	mov ar.ccv=r18				// set compare value for cmpxchg
-	or r25=_PAGE_A,r18			// set the dirty bit
-	tbit.z p7,p6 = r18,_PAGE_P_BIT		// Check present bit
 	;;
-(p6)	cmpxchg8.acq r26=[r17],r25,ar.ccv	// Only if page is present
-	mov r24=PAGE_SHIFT<<2
+1:	ld8.bias.nta r18=[r17]
 	;;
-(p6)	cmp.eq p6,p7=r26,r18			// Only if page is present
+	mov ar.ccv=r18				// set compare value for cmpxchg
+	or r25=_PAGE_A,r18			// set the accessed bit
+	tbit.z p7,p6=r18,_PAGE_P_BIT		// Check present bit
 	;;
-(p6)	itc.d r25				// install updated PTE
-	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
-	 */
-	dv_serialize_data
+(p6)	cmpxchg8.acq.nta r26=[r17],r25,ar.ccv	// Only update if page is present
+(p6)	itc.d r25				// Install updated PTE if page is present
 	;;
-	ld8 r18=[r17]				// read PTE again
+(p6)	srlz.d
+(p6)	ld8.nta r18=[r17]				// Read PTE again
 	;;
-	cmp.eq p6,p7=r18,r25			// is it same as the newly installed
+(p6)	cmp.eq p0,p7=r18,r25			// Is it same as we wanted to install?
+	mov r24=PAGE_SHIFT << 2
 	;;
 (p7)	ptc.l r16,r24
+	mov b0=r29				// restore b0
 	mov ar.ccv=r28
 #else
 	;;
 1:	ld8 r18=[r17]
-	;;					// avoid RAW on r18
+	;;
 	or r18=_PAGE_A,r18			// set the accessed bit
+	mov b0=r29				// restore b0
 	;;
 	st8 [r17]=r18				// store back updated PTE
 	itc.d r18				// install updated PTE
 #endif
-	mov b0=r29				// restore b0
 	mov pr=r31,-1
 	rfi
 END(daccess_bit)

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (17 preceding siblings ...)
  2006-03-29 22:57 ` Luck, Tony
@ 2006-03-29 22:59 ` Chen, Kenneth W
  2006-03-30 15:13 ` Zoltan Menyhart
                   ` (15 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Chen, Kenneth W @ 2006-03-29 22:59 UTC (permalink / raw)
  To: linux-ia64

Zoltan Menyhart wrote on Wednesday, March 29, 2006 9:02 AM
> Can you please tell me why are the pud - pmd pointers are re-checked
> in "vhpt_miss" ?

This code has been there for over five years, and was written by a
prominent ia64 pioneer. I trusted it with my whole heart that it is
there for a good reason.


> As far as I know, pud, pmd, pte pages can go away only via:
> 
> 	free_pgtables()
> 		free_pgd_range()
> 			free_pud_range()
> 				free_pmd_range()
> 					free_pte_range()
> 
> If a 0xa000... stuff goes away => BUG.
> For the user mode pages:
> 
> "free_pgtables()" is called only from:
> 
> - "exit_mmap()": we simply cannot have the chance to have a vhpt miss
> - "unmap_region()" calls "unmap_vmas()" before "free_pgtables()":
>    again, we cannot fault in that region
> 
> Have I missed something?

OK, let's see what happens without re-read pud/pmd:

cpu0                            cpu1                  cpu2
Vhpt miss:
  walk page table
                                free_pgtables
                                ptc.g fault address
                                ptc.g hash address
                                                      pud_alloc/pmd_alloc
                                                      new page instantiation
  itc.d faulting address
  itc.d hash address
  read pte
  kill tlb for fault addr
  rfi

Touch fault addr
Walker install the tlb
with staled vhpt tlb
-> using someone else's page
 -> data corruption
   -> poor kernel engineer
      scratch his head with
      straight 7 days of debug
      with no clue what the
      hack is going on ....

It's a far fetched scenario, but ....

I suppose we can close the race with killing hash address tlb along with
the faulting address tlb.

- Ken

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (18 preceding siblings ...)
  2006-03-29 22:59 ` Chen, Kenneth W
@ 2006-03-30 15:13 ` Zoltan Menyhart
  2006-03-31 16:23 ` Zoltan Menyhart
                   ` (14 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-30 15:13 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 314 bytes --]

Please have a look at this patch that applies after the yours.
I hope I have not forgotten anything.

Some notes:

I do not add ".nta"-s because:
- there is no atomic operation as in the dirty bit handler
- we re-read the probably unmodified data - the L1D cache can help

".bias" is out question

Thanks,

Zoltan

[-- Attachment #2: ivt.diff --]
[-- Type: text/plain, Size: 1779 bytes --]

--- save/arch/ia64/kernel/ivt.S	2006-03-30 16:19:18.000000000 +0200
+++ linux-2.6.16/arch/ia64/kernel/ivt.S	2006-03-30 17:03:56.000000000 +0200
@@ -197,11 +197,12 @@ ENTRY(vhpt_miss)
 	;;
 #ifdef CONFIG_SMP
 	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
+	 * We make sure the visibility of itc.* to generated purges (like ptc.ga)
+	 * before we re-read the *pgd ... PTE.
+	 * Having itc.i-d a new translation, there is no need for srlz.i, the rfi below
+	 * will do the serialization.
 	 */
-	dv_serialize_data
-
+(p7)	srlz.d
 	/*
 	 * Re-check pagetable entry.  If they changed, we may have received a ptc.g
 	 * between reading the pagetable and the "itc".  If so, flush the entry we
@@ -266,11 +267,11 @@ ENTRY(itlb_miss)
 	;;
 #ifdef CONFIG_SMP
 	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
+	 * We make sure the visibility of itc.i to generated purges (like ptc.ga)
+	 * before we re-read the PTE.
+	 * There is no need for srlz.i, the rfi below will do the serialization.
 	 */
-	dv_serialize_data
-
+	srlz.d
 	ld8 r19=[r17]				// read *pte again and see if same
 	mov r20=PAGE_SHIFT<<2			// setup page size for purge
 	;;
@@ -310,11 +311,10 @@ dtlb_fault:
 	;;
 #ifdef CONFIG_SMP
 	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
+	 * We make sure the visibility of itc.d to generated purges (like ptc.ga)
+	 * before we re-read the PTE.
 	 */
-	dv_serialize_data
-
+	srlz.d
 	ld8 r19=[r17]				// read *pte again and see if same
 	mov r20=PAGE_SHIFT<<2			// setup page size for purge
 	;;

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (19 preceding siblings ...)
  2006-03-30 15:13 ` Zoltan Menyhart
@ 2006-03-31 16:23 ` Zoltan Menyhart
  2006-03-31 19:08 ` Chen, Kenneth W
                   ` (13 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-31 16:23 UTC (permalink / raw)
  To: linux-ia64

Ken wrote:

 > cpu0                            cpu1                  cpu2
 > Vhpt miss:
 >   walk page table
 >                                 free_pgtables
 >                                 ptc.g fault address
 >                                 ptc.g hash address
 >                                                       pud_alloc/pmd_alloc
 >                                                       new page instantiation
 >   itc.d faulting address
 >   itc.d hash address
 >   read pte
 >   kill tlb for fault addr
 >   rfi

Let's apply the same logic to the dirty bit handler.

Assume a nested TLB miss, i.e. we dig up the PTE entry in the same way as
we do in "vhpt_miss" (in physical addressing mode):

	rx = ... -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]

(and some NULL pointer verifications)

Having inserted the new PTE (and the srlz.d is done), we re-read the
PTE value only.
What makes it sure that the PTE address is still valid when we re-read the
PTE value (we are still in physical addressing mode)?
Should not we re-read the complete pgd ... pte chain as we do in "vhpt_miss"?

Should not we insert the TLB entry for the relevant virtual page table page
as we do in "vhpt_miss" (it's an efficiency issue only)?

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (20 preceding siblings ...)
  2006-03-31 16:23 ` Zoltan Menyhart
@ 2006-03-31 19:08 ` Chen, Kenneth W
  2006-03-31 21:18 ` Zoltan Menyhart
                   ` (12 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Chen, Kenneth W @ 2006-03-31 19:08 UTC (permalink / raw)
  To: linux-ia64

Zoltan Menyhart wrote on Friday, March 31, 2006 8:23 AM
> Ken wrote:
> 
>  > cpu0                            cpu1                  cpu2
>  > Vhpt miss:
>  >   walk page table
>  >                                 free_pgtables
>  >                                 ptc.g fault address
>  >                                 ptc.g hash address
>  >                                                       pud_alloc/pmd_alloc
>  >                                                       new page instantiation
>  >   itc.d faulting address
>  >   itc.d hash address
>  >   read pte
>  >   kill tlb for fault addr
>  >   rfi
> 
> Let's apply the same logic to the dirty bit handler.
> 
> Assume a nested TLB miss, i.e. we dig up the PTE entry in the same way as
> we do in "vhpt_miss" (in physical addressing mode):
> 
> 	rx = ... -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]
> 
> (and some NULL pointer verifications)
> 
> Having inserted the new PTE (and the srlz.d is done), we re-read the
> PTE value only.
> What makes it sure that the PTE address is still valid when we re-read the
> PTE value (we are still in physical addressing mode)?

Because nested DTLB miss will ensure the consistency.  If another CPU is
tearing down the address space, a separate purge will occur.


> Should not we re-read the complete pgd ... pte chain as we do in "vhpt_miss"?
> 
> Should not we insert the TLB entry for the relevant virtual page table page
> as we do in "vhpt_miss" (it's an efficiency issue only)?

I think both are really bad ideas.  The fast path should already have the
TLB for the hash address in the CPU, why bother looking it up again?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (21 preceding siblings ...)
  2006-03-31 19:08 ` Chen, Kenneth W
@ 2006-03-31 21:18 ` Zoltan Menyhart
  2006-03-31 21:51 ` Chen, Kenneth W
                   ` (11 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-31 21:18 UTC (permalink / raw)
  To: linux-ia64

Chen, Kenneth W wrote:

> Zoltan Menyhart wrote on Friday, March 31, 2006 8:23 AM
> 
>>Ken wrote:
>>
>> > cpu0                            cpu1                  cpu2
>> > Vhpt miss:
>> >   walk page table
>> >                                 free_pgtables
>> >                                 ptc.g fault address
>> >                                 ptc.g hash address
>> >                                                       pud_alloc/pmd_alloc
>> >                                                       new page instantiation
>> >   itc.d faulting address
>> >   itc.d hash address
>> >   read pte
>> >   kill tlb for fault addr
>> >   rfi
>>
>>Let's apply the same logic to the dirty bit handler.
>>
>>Assume a nested TLB miss, i.e. we dig up the PTE entry in the same way as
>>we do in "vhpt_miss" (in physical addressing mode):
>>
>>	rx = ... -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]
>>
>>(and some NULL pointer verifications)
>>
>>Having inserted the new PTE (and the srlz.d is done), we re-read the
>>PTE value only.
>>What makes it sure that the PTE address is still valid when we re-read the
>>PTE value (we are still in physical addressing mode)?
> 
> 
> Because nested DTLB miss will ensure the consistency.  If another CPU is
> tearing down the address space, a separate purge will occur.

Lets assume the following for cpu0:

- it owns a copy of a shared cache line
- this cache line is on a data page that has never been modified
- it has got a valid TLB entry for mapping the data page
- it has NOT got a valid TLB entry for mapping the corresponding PTE page
- it tries to modify the cache line

cpu0:                          cpu1:                   cpu2:
dirty bit fault:
attempts to read the PTE
nested DTLB fault:
walks page table
back to dirty bit handler:
reads the PTE using phys. addr.
itc.d new PTE
                                free_pgtables:
                                ptc.g dirty bit fault address
                                free the data page
                                ptc.g PTE page address
                                free the PTE page
                                                        pte_alloc:
                                                        re-uses the old PTE page
                                                        new page instantiation:
                                                        re-uses the old data page
srlz.d completes
re-reads the PTE using phys. addr.
PTE value matches


Problem #1:

cpu0 keeps (see r17) the physical address of a PTE whose page has gone.
cpu0 is not sensitive to ptc.g-ing the PTE page address, because it accesses
the PTE page by use of this (potentially invalid) physical address, not as the
virtually mapped linear page table.

cpu0 has not got the right to touch a PTE page unless it makes sure
that the PTE page is still anchored by its current->mm->pgd...

Problem #2:

cpu2 may install the old data page freed by cpu1 at the same PTE offset as it
was before.
The new PTE may be numerically the same as the one just inserted by cpu0
(and it is at the same physical address), but it belongs to another process.
cpu0 cannot catch the ptc.g for the dirty bit fault address because
itc.d + srlz.d have not completed by that moment.
The compare may result in a false positive.
cpu0 may be granted the write access right to a data page of someone else.

Thanks,

Zoltan











^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (22 preceding siblings ...)
  2006-03-31 21:18 ` Zoltan Menyhart
@ 2006-03-31 21:51 ` Chen, Kenneth W
  2006-03-31 22:14 ` Chen, Kenneth W
                   ` (10 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Chen, Kenneth W @ 2006-03-31 21:51 UTC (permalink / raw)
  To: linux-ia64

Zoltan Menyhart wrote on Friday, March 31, 2006 1:18 PM
> Problem #1:
> 
> cpu0 keeps (see r17) the physical address of a PTE whose page has gone.
> cpu0 is not sensitive to ptc.g-ing the PTE page address, because it accesses
> the PTE page by use of this (potentially invalid) physical address, not as the
> virtually mapped linear page table.
> 
> cpu0 has not got the right to touch a PTE page unless it makes sure
> that the PTE page is still anchored by its current->mm->pgd...
> 
> Problem #2:
> 
> cpu2 may install the old data page freed by cpu1 at the same PTE offset as it
> was before.
> The new PTE may be numerically the same as the one just inserted by cpu0
> (and it is at the same physical address), but it belongs to another process.
> cpu0 cannot catch the ptc.g for the dirty bit fault address because
> itc.d + srlz.d have not completed by that moment.
> The compare may result in a false positive.
> cpu0 may be granted the write access right to a data page of someone else.

You are correct.  I forgot that nested_dtlb_miss doesn't actually do the check.
I rather prefer not to add anything in the fast path to detect an exceedingly
rare race event (only if ia64 architect screwed up so bad that made itc.d have
10,000 cycle latency and at the same time does a splendid job at job at ptc.g
which completes in zero cycle along with other thousands of other instructions).

In that event, as I said, it's actually better to simple purge the entry, write
the dirty bit into in-memory page table entry and let the hardware page walker
insert the new entry.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (23 preceding siblings ...)
  2006-03-31 21:51 ` Chen, Kenneth W
@ 2006-03-31 22:14 ` Chen, Kenneth W
  2006-03-31 22:57 ` Zoltan Menyhart
                   ` (9 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Chen, Kenneth W @ 2006-03-31 22:14 UTC (permalink / raw)
  To: linux-ia64

Zoltan,

Can you do some stress test experiments and let us know how many time ptc.l was actually executed in vhpt_miss/tlb_miss/dirty/access
handler? Thanks.

- Ken


-----Original Message-----
From: Zoltan Menyhart [mailto:Zoltan.Menyhart@free.fr] 
Sent: Friday, March 31, 2006 1:18 PM
To: Chen, Kenneth W; linux-ia64@vger.kernel.org
Subject: Re: accessed/dirty bit handler tuning

Chen, Kenneth W wrote:

> Zoltan Menyhart wrote on Friday, March 31, 2006 8:23 AM
> 
>>Ken wrote:
>>
>> > cpu0                            cpu1                  cpu2
>> > Vhpt miss:
>> >   walk page table
>> >                                 free_pgtables
>> >                                 ptc.g fault address
>> >                                 ptc.g hash address
>> >                                                       pud_alloc/pmd_alloc
>> >                                                       new page instantiation
>> >   itc.d faulting address
>> >   itc.d hash address
>> >   read pte
>> >   kill tlb for fault addr
>> >   rfi
>>
>>Let's apply the same logic to the dirty bit handler.
>>
>>Assume a nested TLB miss, i.e. we dig up the PTE entry in the same way as
>>we do in "vhpt_miss" (in physical addressing mode):
>>
>>	rx = ... -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]
>>
>>(and some NULL pointer verifications)
>>
>>Having inserted the new PTE (and the srlz.d is done), we re-read the
>>PTE value only.
>>What makes it sure that the PTE address is still valid when we re-read the
>>PTE value (we are still in physical addressing mode)?
> 
> 
> Because nested DTLB miss will ensure the consistency.  If another CPU is
> tearing down the address space, a separate purge will occur.

Lets assume the following for cpu0:

- it owns a copy of a shared cache line
- this cache line is on a data page that has never been modified
- it has got a valid TLB entry for mapping the data page
- it has NOT got a valid TLB entry for mapping the corresponding PTE page
- it tries to modify the cache line

cpu0:                          cpu1:                   cpu2:
dirty bit fault:
attempts to read the PTE
nested DTLB fault:
walks page table
back to dirty bit handler:
reads the PTE using phys. addr.
itc.d new PTE
                                free_pgtables:
                                ptc.g dirty bit fault address
                                free the data page
                                ptc.g PTE page address
                                free the PTE page
                                                        pte_alloc:
                                                        re-uses the old PTE page
                                                        new page instantiation:
                                                        re-uses the old data page
srlz.d completes
re-reads the PTE using phys. addr.
PTE value matches


Problem #1:

cpu0 keeps (see r17) the physical address of a PTE whose page has gone.
cpu0 is not sensitive to ptc.g-ing the PTE page address, because it accesses
the PTE page by use of this (potentially invalid) physical address, not as the
virtually mapped linear page table.

cpu0 has not got the right to touch a PTE page unless it makes sure
that the PTE page is still anchored by its current->mm->pgd...

Problem #2:

cpu2 may install the old data page freed by cpu1 at the same PTE offset as it
was before.
The new PTE may be numerically the same as the one just inserted by cpu0
(and it is at the same physical address), but it belongs to another process.
cpu0 cannot catch the ptc.g for the dirty bit fault address because
itc.d + srlz.d have not completed by that moment.
The compare may result in a false positive.
cpu0 may be granted the write access right to a data page of someone else.

Thanks,

Zoltan








^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (24 preceding siblings ...)
  2006-03-31 22:14 ` Chen, Kenneth W
@ 2006-03-31 22:57 ` Zoltan Menyhart
  2006-04-03  8:46 ` Zoltan Menyhart
                   ` (8 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-03-31 22:57 UTC (permalink / raw)
  To: linux-ia64

Chen, Kenneth W wrote:

> You are correct.  I forgot that nested_dtlb_miss doesn't actually do the check.
> I rather prefer not to add anything in the fast path to detect an exceedingly
> rare race event (only if ia64 architect screwed up so bad that made itc.d have
> 10,000 cycle latency and at the same time does a splendid job at job at ptc.g
> which completes in zero cycle along with other thousands of other instructions).
> 
> In that event, as I said, it's actually better to simple purge the entry, write
> the dirty bit into in-memory page table entry and let the hardware page walker
> insert the new entry.

My first guess is:
- keep the fast path as it is (we are in virtual mode)
- after a nested DTLB fault, we do not return return to the fast path in physical
   but to the "completed" dirty bit fault handler

I guess it is more efficient than to let the hardware page walker
insert the new entry (we already have it in a register).

I'll have to think it over. I'm not sure we can write anything after the
nested DTLB fault. The next example:

cpu0:                          cpu1:                   cpu2:
dirty bit fault:
attempts to read the PTE
nested DTLB fault:
walks page table
back to dirty bit handler:
(keeps the physical address
of the PTE in r17)
                                free_pgtables:
                                ptc.g dirty bit fault address
                                free the data page
                                ptc.g PTE page address
                                free the PTE page
                                                        page_alloc:
                                                        re-uses the old PTE page
(still keeps the physical address
of the PTE whose page has gone)
ld
or
cmpxchg

Probably, there is no way to make sure the physical address of the PTE
remains valid => we have to switch back to virtual mode for the "completed"
dirty bit fault handler.

 > Can you do some stress test experiments and let us know how many time ptc.l
 > was actually executed in vhpt_miss/tlb_miss/dirty/access
 > handler? Thanks.

Well, to instrument the kernel may take some time...
What stress test program do you think of?

Zoltan




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (25 preceding siblings ...)
  2006-03-31 22:57 ` Zoltan Menyhart
@ 2006-04-03  8:46 ` Zoltan Menyhart
  2006-04-03 13:45 ` Zoltan Menyhart
                   ` (7 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-04-03  8:46 UTC (permalink / raw)
  To: linux-ia64

Chen, Kenneth W wrote:

> You are correct.  I forgot that nested_dtlb_miss doesn't actually do the check.
> I rather prefer not to add anything in the fast path to detect an exceedingly
> rare race event (only if ia64 architect screwed up so bad that made itc.d have
> 10,000 cycle latency and at the same time does a splendid job at job at ptc.g
> which completes in zero cycle along with other thousands of other instructions).
> 
> In that event, as I said, it's actually better to simple purge the entry, write
> the dirty bit into in-memory page table entry and let the hardware page walker
> insert the new entry.

The problem common to both the VHPT miss and the nested DTLB handler is
that we have to walk the

    rx = IA64_KR_PT_BASE -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]

chain without any locking.

IA64_KR_PT_BASE remains valid, the PGD page remains in its place until exit
(of the last thread in a multi-threaded application).

Assume we have picked up a valid PUD pointer from pgd[i].
Today, nothing makes sure that the PUD page remains valid by the time when
we dereference the PUD pointer.
The same can be said about the other steps in the chain.

I agree, the probability that it happens is very very low.
Yet we do not program for the statistics but for the security.

Someone wants more chance to hit this bug? Here it is:

Assume we have picked up a valid PUD / PMD / PTE pointer.
A local MCA happens that is corrected by the PAL / SAL => CMCI (later).
As the recovery can take no matter how much time, another CPU has got
plenty of time to unmap a region, set free a PUD / PMD / PTE page
whose physical address is in a register of our CPU.

We are insensitive to the ptc.g of the hash address, issued by the CPU
tearing down the mapping.

We may be obliged to pronounce the 4 letter dirty word: lock.

> Can you do some stress test experiments and let us know how many time ptc.l
> was actually executed in vhpt_miss/tlb_miss/dirty/access
> handler? Thanks.

Did you think of locking, too? Do you want to estimate the loss of the
performance?

Using the page-table-lock is out of question:
- can be split (looking up "struct page"-es not a good idea)
- scales badly
- we do not want to exclude page faults (which only add pages)
- we do not want to exclude the swapper (that takes away "leaf" pages only)

I can think of taking the mm semaphore for read:
- can be taken for read almost all the time
- scales well
- requires 2 atomic operation, say 4 memory accesses:
  it doubles the original number of memory accesses needed to walk
  the PGD ... PTE chain

Unless ... the P*D[] pointers become virtual addresses.
(Either the virtual addresses themselves are stored in the tables or we
keep the physical ones and we or 0xe000... to them.)

This idea is based on the following:
If we have got a virtual address and we manages to access the memory via
this virtual address => the virtual address was valid during the access.
Whoever tears down the mapping clears first the pointer then purges the
translation enabling to access the pointer. We can catch it by the usual
technique.

Here is my first guess for the VHPT miss handler (sanity checks, e.g.
"presence", and other minor calculations are left for the reader):

// This is the fast path
- Do not switch off the data translation
- IA64_KR_PT_BASE holds the virtual address of the PGD page
- Set the return address for all nested faults
  (always re-run the complete sequence in case of fault)
- Set a predicate to indicate that we try to read pgd[i]
  (or bx = @dedicated-nested-fault-handler)
- Read pgd[i] - may fault, see below
- Set a predicate to indicate that we try to read pud[j]
- Read pud[j] - may fault, see below
- Set a predicate to indicate that we try to read pmd[k]
- Read pmd[k] - may fault, see below
- Insert the translation for the PTE page + srlz.d
- Re-read pmd[k] - may fault...
- If it does not match => purge the translation + re-start
- Set a predicate to indicate that we try to read pte[l]
- Read pte[l] - may fault despite the fact that the PTE page just
                has been mapped "by hand"
- Insert the new translation + srlz.*
- Re-read pte[l] - may fault...
- If it does not match => purge the translation + re-start

As we use virtual addresses => the nested fault handler has to insert
an identity mapped translation for the PGD, PUD or PMD page, or the
translation - as today - for the PTE page (it will be used by the HW
walker, too).

The fast path of this proposal includes the same number of loads, itc-s
and srlz-s as the current version (with the missing srlz.d added).

The drawback of this latter approach is that it requires 5 TLB entries
to be able to "progress forward". The architecture does not guarantee
that many. However, the ia64 CPUs in production (and the foreseen ones)
have got 128 TLB entries...
(We may check at boot time if they are available.
If not we may fall back...)

Zoltan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (26 preceding siblings ...)
  2006-04-03  8:46 ` Zoltan Menyhart
@ 2006-04-03 13:45 ` Zoltan Menyhart
  2006-04-03 15:49 ` Luck, Tony
                   ` (6 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-04-03 13:45 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1458 bytes --]

Chen, Kenneth W wrote:

> Can you do some stress test experiments and let us know how many time ptc.l
> was actually executed in vhpt_miss/tlb_miss/dirty/access
> handler? Thanks.

Here is a patch that adds a small syscall to display or clear (./stat -clear)
the statistics.
Please verify if this is what you wanted (and the potential bugs...).

The 1st version (indicated by "#if 0") I tried, should have worked in virtual
mode. Unfortunately, I could not make it work (having this short deadline).
Could you have a look at it why it fails to work?

I ran a "make -j 16" of the kernel on an 8 processor machine.
(It is actually 2 Tiger boxes connected via a Scalability Port Switch.)
Unfortunately, the I/O is subsystem weak: a single SCSI disk.

Here is what I got:

             VHPT miss counter: 1674978
       VHPT miss - hash purged:       0
        VHPT miss - PTE purged:       0
             ITLB miss counter:     293
            ITLB miss - purged:       0
             DTLB miss counter:    3806
            DTLB miss - purged:       0
            DIRTY trap counter:     224
                DIRTY - purged:       0
         I-ACCESS trap counter:       2
             I-ACCESS - purged:       0
         D-ACCESS trap counter:  173227
             D-ACCESS - purged:       0

Unless I am mistaken, there is no purge observed.
It is very much curious having so few dirty and i-access traps...

Have you got some good & stressing tests?

Zoltan


[-- Attachment #2: stat.diff --]
[-- Type: text/plain, Size: 9633 bytes --]

--- save/arch/ia64/kernel/entry.S	2006-03-15 11:07:38.000000000 +0100
+++ linux-2.6.16/arch/ia64/kernel/entry.S	2006-04-03 12:48:03.000000000 +0200
@@ -1619,5 +1619,6 @@ sys_call_table:
 	data8 sys_ni_syscall			// reserved for pselect
 	data8 sys_ni_syscall			// 1295 reserved for ppoll
 	data8 sys_unshare
+	data8 sys_trap_statistics		// 1297
 
 	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
--- save/arch/ia64/kernel/ivt.S	2006-03-30 16:19:18.000000000 +0200
+++ linux-2.6.16/arch/ia64/kernel/ivt.S	2006-04-03 14:58:20.000000000 +0200
@@ -108,6 +108,36 @@ ENTRY(vhpt_miss)
 	movl r18=PAGE_SHIFT
 	mov r25=cr.itir
 #endif
+#if 0
+	/*
+	 * Increment the VHPT miss counter - must be identity mapped.
+	 */
+	LOAD_PHYSICAL(p0, r17, fault_statistics + VHPT_idx * _ENTRY_SIZE_)
+	movl r21=(((1 << IA64_MAX_PHYS_BITS) - 1) & ~0xfff)
+	movl r19=PAGE_OFFSET
+	movl r20=PAGE_SHIFT << 2		// ... and protection key == 0
+	movl r22=PAGE_KERNEL
+	;;
+	or r19=r19,r17				// __va(&fault_statistics[VHPT_idx]
+	and r21=r21,r17				// Clear ed, reserved bits, and PTE control bits
+	;;
+	mov cr.ifa=r19
+	mov cr.itir=r20
+	or r21=r21,r22				// Insert control bits
+	;;
+	itc.d r21
+	;;
+	srlz.d
+	// Unsafe: the translation can be killed in the mean time
+	ld8.bias.nta r17=[r19]			// = fault_statistics[VHPT_idx]
+	;;
+	add r17=1,r17
+	mov cr.ifa=r16				// Restore
+	mov cr.itir=r25
+	;;
+	// Unsafe: the translation can be killed in the mean time
+	st8 [r19]=r17				// Not atomic increment - who cares?
+#endif
 	;;
 	rsm psr.dt				// use physical addressing for data
 	mov r31=pr				// save the predicate registers
@@ -132,6 +162,17 @@ ENTRY(vhpt_miss)
 (p7)	dep r17=r17,r19,(PAGE_SHIFT-3),3	// put region number bits in place
 
 	srlz.d
+	/*
+	 * Increment the VHPT miss counter.
+	 */
+	LOAD_PHYSICAL(p0, r19, fault_statistics + VHPT_idx * _ENTRY_SIZE_)
+	;;
+	ld8.bias.nta r28=[r19]
+	;;
+	add r28=1,r28
+	;;
+	st8 [r19]=r28				// Not atomic increment - who cares?
+
 	LOAD_PHYSICAL(p6, r19, swapper_pg_dir)	// region 5 is rooted at swapper_pg_dir
 
 	.pred.rel "mutex", p6, p7
@@ -197,11 +238,12 @@ ENTRY(vhpt_miss)
 	;;
 #ifdef CONFIG_SMP
 	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
+	 * We make sure the visibility of itc.* to generated purges (like ptc.ga)
+	 * before we re-read the *pgd ... PTE.
+	 * Having itc.i-d a new translation, there is no need for srlz.i, the rfi below
+	 * will do the serialization.
 	 */
-	dv_serialize_data
-
+(p7)	srlz.d
 	/*
 	 * Re-check pagetable entry.  If they changed, we may have received a ptc.g
 	 * between reading the pagetable and the "itc".  If so, flush the entry we
@@ -229,9 +271,34 @@ ENTRY(vhpt_miss)
 	mov r27=PAGE_SHIFT<<2
 	;;
 (p6)	ptc.l r22,r27				// purge PTE page translation
+	/*
+	 * Increment the VHPT miss - purge PTE page counter.
+	 * It is the next long. Accessed via its physical address.
+	 */
+	dv_serialize_data
+	LOAD_PHYSICAL(p6, r19, fault_statistics + (VHPT_idx + 1) * _ENTRY_SIZE_)
+	;;
+(p6)	ld8.bias.nta r17=[r19]
+	;;
+(p6)	add r17=1,r17
+	;;
+(p6)	st8 [r19]=r17				// Not atomic increment - who cares?
+
 (p7)	cmp.ne.or.andcm p6,p7=r25,r18		// did *pte change
 	;;
 (p6)	ptc.l r16,r27				// purge translation
+	/*
+	 * Increment the VHPT miss - purge PTE page counter.
+	 * It is in the 2nd next long. Accessed via its physical address.
+	 */
+	dv_serialize_data
+	LOAD_PHYSICAL(p6, r19, fault_statistics + (VHPT_idx + 2) * _ENTRY_SIZE_)
+	;;
+(p6)	ld8.bias.nta r17=[r19]
+	;;
+(p6)	add r17=1,r17
+	;;
+(p6)	st8 [r19]=r17				// Not atomic increment - who cares?
 #endif
 
 	mov pr=r31,-1				// restore predicate registers
@@ -266,16 +333,36 @@ ENTRY(itlb_miss)
 	;;
 #ifdef CONFIG_SMP
 	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
+	 * We make sure the visibility of itc.i to generated purges (like ptc.ga)
+	 * before we re-read the PTE.
+	 * There is no need for srlz.i, the rfi below will do the serialization.
 	 */
-	dv_serialize_data
-
+	srlz.d
 	ld8 r19=[r17]				// read *pte again and see if same
 	mov r20=PAGE_SHIFT<<2			// setup page size for purge
 	;;
 	cmp.ne p7,p0=r18,r19
 	;;
+	/*
+	 * Increment the ITLB miss counters.
+	 */
+	rsm psr.dt				// use physical addressing for data
+	;;
+	srlz.d
+	LOAD_PHYSICAL(p0, r19, fault_statistics + ITLB_idx * _ENTRY_SIZE_)
+	;;
+	ld8.bias.nta r28=[r19]
+	;;
+	add r28=1,r28
+	;;
+	st8 [r19]=r28,8				// Not atomic increment - who cares?
+	;;
+(p7)	ld8.bias.nta r28=[r19]			// Next long: ITLB miss - purged
+	;;
+(p7)	add r28=1,r28
+	;;
+(p7)	st8 [r19]=r28				// Not atomic increment - who cares?
+
 (p7)	ptc.l r16,r20
 #endif
 	mov pr=r31,-1
@@ -310,16 +397,35 @@ dtlb_fault:
 	;;
 #ifdef CONFIG_SMP
 	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
+	 * We make sure the visibility of itc.d to generated purges (like ptc.ga)
+	 * before we re-read the PTE.
 	 */
-	dv_serialize_data
-
+	srlz.d
 	ld8 r19=[r17]				// read *pte again and see if same
 	mov r20=PAGE_SHIFT<<2			// setup page size for purge
 	;;
 	cmp.ne p7,p0=r18,r19
 	;;
+	/*
+	 * Increment the DTLB miss counters.
+	 */
+	rsm psr.dt				// use physical addressing for data
+	;;
+	srlz.d
+	LOAD_PHYSICAL(p0, r19, fault_statistics + DTLB_idx * _ENTRY_SIZE_)
+	;;
+	ld8.bias.nta r28=[r19]
+	;;
+	add r28=1,r28
+	;;
+	st8 [r19]=r28,8				// Not atomic increment - who cares?
+	;;
+(p7)	ld8.bias.nta r28=[r19]			// Next long: DTLB miss - purged
+	;;
+(p7)	add r28=1,r28
+	;;
+(p7)	st8 [r19]=r28				// Not atomic increment - who cares?
+
 (p7)	ptc.l r16,r20
 #endif
 	mov pr=r31,-1
@@ -589,6 +695,26 @@ ENTRY(dirty_bit)
 	 *     very same dirty bit as we wanted to => our new translation is correct)
 	 */
 (p7)	ptc.l r16,r24
+	/*
+	 * Increment the DIRTY trap counters.
+	 */
+	rsm psr.dt				// use physical addressing for data
+	;;
+	srlz.d
+	LOAD_PHYSICAL(p0, r19, fault_statistics + DIRTY_idx * _ENTRY_SIZE_)
+	;;
+	ld8.bias.nta r27=[r19]
+	;;
+	add r27=1,r27
+	;;
+	st8 [r19]=r27,8				// Not atomic increment - who cares?
+	;;
+(p7)	ld8.bias.nta r27=[r19]			// Next long: DIRTY - purged
+	;;
+(p7)	add r27=1,r27
+	;;
+(p7)	st8 [r19]=r27				// Not atomic increment - who cares?
+
 	mov b0=r29				// restore b0
 	mov ar.ccv=r28
 #else
@@ -650,6 +776,26 @@ ENTRY(iaccess_bit)
 	mov r24=PAGE_SHIFT << 2
 	;;
 (p7)	ptc.l r16,r24
+	/*
+	 * Increment the I-ACCESS trap counters.
+	 */
+	rsm psr.dt				// use physical addressing for data
+	;;
+	srlz.d
+	LOAD_PHYSICAL(p0, r19, fault_statistics + IACC_idx * _ENTRY_SIZE_)
+	;;
+	ld8.bias.nta r27=[r19]
+	;;
+	add r27=1,r27
+	;;
+	st8 [r19]=r27,8				// Not atomic increment - who cares?
+	;;
+(p7)	ld8.bias.nta r27=[r19]			// Next long: I-ACCESS - purged
+	;;
+(p7)	add r27=1,r27
+	;;
+(p7)	st8 [r19]=r27				// Not atomic increment - who cares?
+
 	mov b0=r29				// restore b0
 	mov ar.ccv=r28
 #else
@@ -700,6 +846,26 @@ ENTRY(daccess_bit)
 	mov r24=PAGE_SHIFT << 2
 	;;
 (p7)	ptc.l r16,r24
+	/*
+	 * Increment the D-ACCESS trap counters.
+	 */
+	rsm psr.dt				// use physical addressing for data
+	;;
+	srlz.d
+	LOAD_PHYSICAL(p0, r19, fault_statistics + DACC_idx * _ENTRY_SIZE_)
+	;;
+	ld8.bias.nta r27=[r19]
+	;;
+	add r27=1,r27
+	;;
+	st8 [r19]=r27,8				// Not atomic increment - who cares?
+	;;
+(p7)	ld8.bias.nta r27=[r19]			// Next long: D-ACCESS - purged
+	;;
+(p7)	add r27=1,r27
+	;;
+(p7)	st8 [r19]=r27				// Not atomic increment - who cares?
+
 	mov b0=r29				// restore b0
 	mov ar.ccv=r28
 #else
--- save/kernel/sched.c	2006-03-15 11:09:03.000000000 +0100
+++ linux-2.6.16/kernel/sched.c	2006-04-03 15:06:57.000000000 +0200
@@ -199,6 +192,26 @@ struct prio_array {
 	struct list_head queue[MAX_PRIO];
 };
 
+
+long fault_statistics[MAX_TRAP_idx];
+
+
+/*
+ * Read / clear trap statistics
+ */
+asmlinkage long sys_trap_statistics(int index)
+{
+	if (index >= 0 && index < MAX_TRAP_idx)
+		return fault_statistics[index];
+	if (index == -1){
+		for (index = 0; index < MAX_TRAP_idx; index++)
+			fault_statistics[index] = 0;
+		return 0;
+	}
+	return -EINVAL;
+}
+
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
--- save/include/asm-ia64/system.h	2006-03-15 11:08:53.000000000 +0100
+++ linux-2.6.16/include/asm-ia64/system.h	2006-04-03 14:26:19.000000000 +0200
@@ -262,4 +262,28 @@ void sched_cacheflush(void);
 
 #endif /* __ASSEMBLY__ */
 
+
+/*
+ * For trap statistics
+ */
+#define	VHPT_idx		0		// VHPT miss counter
+#define	VHPT_HASH_PTC_idx	1		// VHPT miss - hash purged
+#define	VHPT_PTE_PTC_idx	2		// VHPT miss - PTE purged
+#define	ITLB_idx		3		// ITLB miss counter
+#define	ITLB_PTC_idx		4		// ITLB miss - purged
+#define	DTLB_idx		5		// DTLB miss counter
+#define	DTLB_PTC_idx		6		// DTLB miss - purged
+#define	DIRTY_idx		7		// DIRTY trap counter
+#define	DIRTY_PTC_idx		8		// DIRTY - purged
+#define	IACC_idx		9		// I-ACCESS trap counter
+#define	IACC_PTC_idx		10		// I-ACCESS - purged
+#define	DACC_idx		11		// D-ACCESS trap counter
+#define	DACC_PTC_idx		12		// D-ACCESS - purged
+
+
+#define	MAX_TRAP_idx		13
+
+#define	_ENTRY_SIZE_		8
+
+
 #endif /* _ASM_IA64_SYSTEM_H */
--- save/include/asm-ia64/unistd.h	2006-03-15 11:08:53.000000000 +0100
+++ linux-2.6.16/include/asm-ia64/unistd.h	2006-04-03 12:46:40.000000000 +0200
@@ -290,7 +290,7 @@
 
 #include <linux/config.h>
 
-#define NR_syscalls			273 /* length of syscall table */
+#define NR_syscalls			274 /* length of syscall table */
 
 #define __ARCH_WANT_SYS_RT_SIGACTION
 

[-- Attachment #3: stat.c --]
[-- Type: text/plain, Size: 1578 bytes --]

#include <sys/syscall.h>
#include <unistd.h>


/*
 * For trap statistics
 * Use "-1" to clear the counters
 */
#define	VHPT_idx		0		// VHPT miss counter
#define	VHPT_HASH_PTC_idx	1		// VHPT miss - hash purged
#define	VHPT_PTE_PTC_idx	2		// VHPT miss - PTE purged
#define	ITLB_idx		3		// ITLB miss counter
#define	ITLB_PTC_idx		4		// ITLB miss - purged
#define	DTLB_idx		5		// DTLB miss counter
#define	DTLB_PTC_idx		6		// DTLB miss - purged
#define	DIRTY_idx		7		// DIRTY trap counter
#define	DIRTY_PTC_idx		8		// DIRTY - purged
#define	IACC_idx		9		// I-ACCESS trap counter
#define	IACC_PTC_idx		10		// I-ACCESS - purged
#define	DACC_idx		11		// D-ACCESS trap counter
#define	DACC_PTC_idx		12		// D-ACCESS - purged


#define	MAX_TRAP_idx		13

#define	_ENTRY_SIZE_		8


#define	sys_trap_statistics	1297


char *names[] = {
	"VHPT miss counter",
	"VHPT miss - hash purged",
	"VHPT miss - PTE purged",
	"ITLB miss counter",
	"ITLB miss - purged",
	"DTLB miss counter",
	"DTLB miss - purged",
	"DIRTY trap counter",
	"DIRTY - purged",
	"I-ACCESS trap counter",
	"I-ACCESS - purged",
	"D-ACCESS trap counter",
	"D-ACCESS - purged",
};


main(int cnt, char *args[])
{
	int index;
	long count;

	if (cnt == 2 && strcmp(args[1], "-clear") == 0){
		if (syscall(sys_trap_statistics, -1) == -1){
			perror("sys_trap_statistics");
			exit(1);
		}
		exit(0);
	}
	for (index = 0; index < MAX_TRAP_idx; index++){
		count = syscall(sys_trap_statistics, index);
		if (count == -1){
			perror("sys_trap_statistics");
			exit(1);
		}
		printf("%30s: %7ld\n", names[index], count);
	}
	exit(0);
}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (27 preceding siblings ...)
  2006-04-03 13:45 ` Zoltan Menyhart
@ 2006-04-03 15:49 ` Luck, Tony
  2006-04-03 15:57 ` Luck, Tony
                   ` (5 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Luck, Tony @ 2006-04-03 15:49 UTC (permalink / raw)
  To: linux-ia64

>Unless I am mistaken, there is no purge observed.

The purge cases will be very, very rare.  You either need to be
swapping (so that pages are being stolen out from the process
in the middle of the trap handler), or to construct some special
multi-threaded test (where some threads are re-mapping pieces
of the shared virtual space while other threads are trying to
use those addresses!).  Even in this case the race window is
tiny ... so I'd be surprised to see any benchmark generate more
than a few such purges per hour.

>It is very much curious having so few dirty and i-access traps...

Agreed.  The numbers for dirty/i-access traps are very low (less
than one per invocation of gcc during a kernel build).

>Have you got some good & stressing tests?

Nothing comes to mind that would really stress this code.  Either
you need to make the system swap heavily (boot with mem\x128m, and
then run make -j32, or some such thing ... but some tuning would
be needed to get enough swapping but still make enough progess),
or create some custom multi-threaded test that does mapping &
unmapping (I'm not aware of any real application that does anything
like this ... threads would have to cope with occasional SIGSEGV
if they happened to make an access while the addresses were being
re-mapped).

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (28 preceding siblings ...)
  2006-04-03 15:49 ` Luck, Tony
@ 2006-04-03 15:57 ` Luck, Tony
  2006-04-03 16:33 ` Zoltan Menyhart
                   ` (4 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Luck, Tony @ 2006-04-03 15:57 UTC (permalink / raw)
  To: linux-ia64

> It is very much curious having so few dirty and i-access traps...

Your data collection code has races (ld8, add, st8 on one cpu can
race with another cpu doing the same).  So you'll undercount whenever
a race happens.

Perhaps you should use per-cpu counters to collect the values, and
then sum for each cpu in the syscall() before reporting to the user?

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (29 preceding siblings ...)
  2006-04-03 15:57 ` Luck, Tony
@ 2006-04-03 16:33 ` Zoltan Menyhart
  2006-04-03 16:42 ` David Mosberger-Tang
                   ` (3 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-04-03 16:33 UTC (permalink / raw)
  To: linux-ia64

Luck, Tony wrote:
>>It is very much curious having so few dirty and i-access traps...
> 
> 
> Your data collection code has races (ld8, add, st8 on one cpu can
> race with another cpu doing the same).  So you'll undercount whenever
> a race happens.
> 
> Perhaps you should use per-cpu counters to collect the values, and
> then sum for each cpu in the syscall() before reporting to the user?

It was not very important to count the events precisely.

The lesson I learnt is the VHPT miss handler is the most important
(maybe the D-ACCESS trap handler), the others are "neglectable".
(I have not counted yet the handlers not doing a purge.)

The problem is that the most frequently used trap handler contains
the unsafe walk of the

	rx = IA64_KR_PT_BASE -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]

chain...

Zoltan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (30 preceding siblings ...)
  2006-04-03 16:33 ` Zoltan Menyhart
@ 2006-04-03 16:42 ` David Mosberger-Tang
  2006-04-03 17:23 ` Zoltan Menyhart
                   ` (2 subsequent siblings)
  34 siblings, 0 replies; 36+ messages in thread
From: David Mosberger-Tang @ 2006-04-03 16:42 UTC (permalink / raw)
  To: linux-ia64

On 4/3/06, Zoltan Menyhart <Zoltan.Menyhart@bull.net> wrote:

> The problem is that the most frequently used trap handler contains
> the unsafe walk of the
>
>         rx = IA64_KR_PT_BASE -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]
>
> chain...

Please, everybody step back a minute.  Hint: consider that x86 does
the page-table walk in hardware...

  --david

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (31 preceding siblings ...)
  2006-04-03 16:42 ` David Mosberger-Tang
@ 2006-04-03 17:23 ` Zoltan Menyhart
  2006-04-03 17:50 ` Luck, Tony
  2006-04-03 18:27 ` Christoph Lameter
  34 siblings, 0 replies; 36+ messages in thread
From: Zoltan Menyhart @ 2006-04-03 17:23 UTC (permalink / raw)
  To: linux-ia64

David Mosberger-Tang wrote:
> On 4/3/06, Zoltan Menyhart <Zoltan.Menyhart@bull.net> wrote:
> 
> 
>>The problem is that the most frequently used trap handler contains
>>the unsafe walk of the
>>
>>        rx = IA64_KR_PT_BASE -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]
>>
>>chain...
> 
> 
> Please, everybody step back a minute.  Hint: consider that x86 does
> the page-table walk in hardware...

Telling the truth: I'm not an x86 expert :-) 
What I could dig up in 5 minutes is:

IA-32 Intel® Architecture Software Developer’s Manual Volume 3A:
7.1.2.1 Automatic Locking

"When updating page-directory and page-table entries:
When updating page-directory and page-table entries, the processor uses
locked cycles to set the accessed and dirty flag in the page-directory
and page-table entries."

I guess the TLB load is auto-locked, too.

Anyway, what can we conclude from this for the ia64 architecture?

Can you _prove_ that walking that chain of pointers is safe?

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (32 preceding siblings ...)
  2006-04-03 17:23 ` Zoltan Menyhart
@ 2006-04-03 17:50 ` Luck, Tony
  2006-04-03 18:27 ` Christoph Lameter
  34 siblings, 0 replies; 36+ messages in thread
From: Luck, Tony @ 2006-04-03 17:50 UTC (permalink / raw)
  To: linux-ia64

> I guess the TLB load is auto-locked, too.
No, the whole multi-level TLB walk is not locked.

> Anyway, what can we conclude from this for the ia64 architecture?

We can conclude that the generic Linux code that makes changes
to the page table tree has been written to take some care to
make changes in a "safe" way given that other processors may be
simltaneously walking the page tables.  Thus the pointer from the
higher level table to the lower level table is cleared, and we issue
the purge before we free the lower level table.

However, there have been changes to that code, and in particular
ia64 did re-instate quicklists ... so a re-audit to make sure that
none of the assumptions have been broken wouldn't be a bad idea.

-Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: accessed/dirty bit handler tuning
  2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
                   ` (33 preceding siblings ...)
  2006-04-03 17:50 ` Luck, Tony
@ 2006-04-03 18:27 ` Christoph Lameter
  34 siblings, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2006-04-03 18:27 UTC (permalink / raw)
  To: linux-ia64

On Mon, 3 Apr 2006, Zoltan Menyhart wrote:

> It is very much curious having so few dirty and i-access traps...

The dirty bit handler is only used for sharable writable mappings. There 
are only very few pieces of software using such mappings. The kernel 
usually attempts to prepopulate ptes with bits set so that the dirty fault 
is avoided. F.e. a no page fault attempted for a write will install a 
dirty pte.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2006-04-03 18:27 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-13 14:08 accessed/dirty bit handler tuning Zoltan Menyhart
2006-03-13 16:31 ` Christoph Lameter
2006-03-13 16:55 ` Zoltan Menyhart
2006-03-13 19:46 ` Chen, Kenneth W
2006-03-13 20:05 ` Luck, Tony
2006-03-13 20:14 ` Chen, Kenneth W
2006-03-13 22:53 ` Chen, Kenneth W
2006-03-14 10:12 ` Zoltan Menyhart
2006-03-14 19:33 ` Chen, Kenneth W
2006-03-15 13:29 ` Zoltan Menyhart
2006-03-15 17:37 ` Chen, Kenneth W
2006-03-16  9:57 ` Zoltan Menyhart
2006-03-16 10:19 ` Luck, Tony
2006-03-16 19:12 ` Chen, Kenneth W
2006-03-29  8:11 ` Zoltan Menyhart
2006-03-29  8:28 ` Chen, Kenneth W
2006-03-29 13:37 ` Zoltan Menyhart
2006-03-29 17:01 ` Zoltan Menyhart
2006-03-29 22:57 ` Luck, Tony
2006-03-29 22:59 ` Chen, Kenneth W
2006-03-30 15:13 ` Zoltan Menyhart
2006-03-31 16:23 ` Zoltan Menyhart
2006-03-31 19:08 ` Chen, Kenneth W
2006-03-31 21:18 ` Zoltan Menyhart
2006-03-31 21:51 ` Chen, Kenneth W
2006-03-31 22:14 ` Chen, Kenneth W
2006-03-31 22:57 ` Zoltan Menyhart
2006-04-03  8:46 ` Zoltan Menyhart
2006-04-03 13:45 ` Zoltan Menyhart
2006-04-03 15:49 ` Luck, Tony
2006-04-03 15:57 ` Luck, Tony
2006-04-03 16:33 ` Zoltan Menyhart
2006-04-03 16:42 ` David Mosberger-Tang
2006-04-03 17:23 ` Zoltan Menyhart
2006-04-03 17:50 ` Luck, Tony
2006-04-03 18:27 ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox