[PATCH 1/3] powerpc/e6500: remove the stale TCD

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/3] powerpc/e6500: remove the stale TCD_LOCK macro
@ 2015-08-13 11:51 Kevin Hao
  2015-08-13 11:51 ` [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes Kevin Hao
  2015-08-13 11:51 ` [PATCH 3/3] powerpc/e6500: hw tablewalk: order the memory access when acquire/release tcd lock Kevin Hao
  0 siblings, 2 replies; 15+ messages in thread
From: Kevin Hao @ 2015-08-13 11:51 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev

Since we moved the "lock" to be the first element of
struct tlb_core_data in commit 82d86de25b9c ("powerpc/e6500: Make TLB
lock recursive), this macro is not used by any code. Just delete it.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
---
 arch/powerpc/kernel/asm-offsets.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 98230579d99c..810f433731dc 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -213,7 +213,6 @@ int main(void)
 		offsetof(struct tlb_core_data, esel_max));
 	DEFINE(TCD_ESEL_FIRST,
 		offsetof(struct tlb_core_data, esel_first));
-	DEFINE(TCD_LOCK, offsetof(struct tlb_core_data, lock));
 #endif /* CONFIG_PPC_BOOK3E */
 
 #ifdef CONFIG_PPC_STD_MMU_64
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes
  2015-08-13 11:51 [PATCH 1/3] powerpc/e6500: remove the stale TCD_LOCK macro Kevin Hao
@ 2015-08-13 11:51 ` Kevin Hao
  2015-08-13 18:44   ` Scott Wood
  2015-08-13 11:51 ` [PATCH 3/3] powerpc/e6500: hw tablewalk: order the memory access when acquire/release tcd lock Kevin Hao
  1 sibling, 1 reply; 15+ messages in thread
From: Kevin Hao @ 2015-08-13 11:51 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev

It makes no sense to put the instructions for calculating the lock
value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx
loop. And when the lock is acquired by the other thread, the current
lock value has no chance to equal with the lock value used by current
cpu. So we can skip the comparing for these two lock values in the
lbz/bne loop.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
---
 arch/powerpc/mm/tlb_low_64e.S | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
index 765b419883f2..e4185581c5a7 100644
--- a/arch/powerpc/mm/tlb_low_64e.S
+++ b/arch/powerpc/mm/tlb_low_64e.S
@@ -308,11 +308,11 @@ BEGIN_FTR_SECTION		/* CPU_FTR_SMT */
 	 *
 	 * MAS6:IND should be already set based on MAS4
 	 */
-1:	lbarx	r15,0,r11
 	lhz	r10,PACAPACAINDEX(r13)
-	cmpdi	r15,0
-	cmpdi	cr1,r15,1	/* set cr1.eq = 0 for non-recursive */
 	addi	r10,r10,1
+	crclr	cr1*4+eq	/* set cr1.eq = 0 for non-recursive */
+1:	lbarx	r15,0,r11
+	cmpdi	r15,0
 	bne	2f
 	stbcx.	r10,0,r11
 	bne	1b
@@ -320,9 +320,9 @@ BEGIN_FTR_SECTION		/* CPU_FTR_SMT */
 	.subsection 1
 2:	cmpd	cr1,r15,r10	/* recursive lock due to mcheck/crit/etc? */
 	beq	cr1,3b		/* unlock will happen if cr1.eq = 0 */
-	lbz	r15,0(r11)
+10:	lbz	r15,0(r11)
 	cmpdi	r15,0
-	bne	2b
+	bne	10b
 	b	1b
 	.previous
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes
  2015-08-13 11:51 ` [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes Kevin Hao
@ 2015-08-13 18:44   ` Scott Wood
  2015-08-14  7:13     ` Kevin Hao
  0 siblings, 1 reply; 15+ messages in thread
From: Scott Wood @ 2015-08-13 18:44 UTC (permalink / raw)
  To: Kevin Hao; +Cc: linuxppc-dev

On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote:
> It makes no sense to put the instructions for calculating the lock
> value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx
> loop. And when the lock is acquired by the other thread, the current
> lock value has no chance to equal with the lock value used by current
> cpu. So we can skip the comparing for these two lock values in the
> lbz/bne loop.
> 
> Signed-off-by: Kevin Hao <haokexin@gmail.com>
> ---
>  arch/powerpc/mm/tlb_low_64e.S | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
> index 765b419883f2..e4185581c5a7 100644
> --- a/arch/powerpc/mm/tlb_low_64e.S
> +++ b/arch/powerpc/mm/tlb_low_64e.S
> @@ -308,11 +308,11 @@ BEGIN_FTR_SECTION               /* CPU_FTR_SMT */
>        *
>        * MAS6:IND should be already set based on MAS4
>        */
> -1:   lbarx   r15,0,r11
>       lhz     r10,PACAPACAINDEX(r13)
> -     cmpdi   r15,0
> -     cmpdi   cr1,r15,1       /* set cr1.eq = 0 for non-recursive */
>       addi    r10,r10,1
> +     crclr   cr1*4+eq        /* set cr1.eq = 0 for non-recursive */
> +1:   lbarx   r15,0,r11
> +     cmpdi   r15,0
>       bne     2f

You're optimizing the contended case at the expense of introducing stalls in 
the uncontended case.  Does it really matter if there are more instructions 
in the loop?  This change just means that you'll spin in the loop for more 
iterations (if it even does that -- I think the cycles per loop iteration 
might be the same before and after, due to load latency and pairing) while 
waiting for the other thread to release the lock.

Do you have any benchmark results for this patch?

-Scott

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes
  2015-08-13 18:44   ` Scott Wood
@ 2015-08-14  7:13     ` Kevin Hao
  2015-08-15  2:44       ` Scott Wood
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Hao @ 2015-08-14  7:13 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3877 bytes --]

On Thu, Aug 13, 2015 at 01:44:43PM -0500, Scott Wood wrote:
> On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote:
> > It makes no sense to put the instructions for calculating the lock
> > value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx
> > loop. And when the lock is acquired by the other thread, the current
> > lock value has no chance to equal with the lock value used by current
> > cpu. So we can skip the comparing for these two lock values in the
> > lbz/bne loop.
> > 
> > Signed-off-by: Kevin Hao <haokexin@gmail.com>
> > ---
> >  arch/powerpc/mm/tlb_low_64e.S | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
> > index 765b419883f2..e4185581c5a7 100644
> > --- a/arch/powerpc/mm/tlb_low_64e.S
> > +++ b/arch/powerpc/mm/tlb_low_64e.S
> > @@ -308,11 +308,11 @@ BEGIN_FTR_SECTION               /* CPU_FTR_SMT */
> >        *
> >        * MAS6:IND should be already set based on MAS4
> >        */
> > -1:   lbarx   r15,0,r11
> >       lhz     r10,PACAPACAINDEX(r13)
> > -     cmpdi   r15,0
> > -     cmpdi   cr1,r15,1       /* set cr1.eq = 0 for non-recursive */
> >       addi    r10,r10,1
> > +     crclr   cr1*4+eq        /* set cr1.eq = 0 for non-recursive */
> > +1:   lbarx   r15,0,r11
> > +     cmpdi   r15,0
> >       bne     2f
> 
> You're optimizing the contended case at the expense of introducing stalls in 
> the uncontended case.

Before the patch, the uncontended case code sequence are:
1:	lbarx	r15,0,r11
	lhz	r10,PACAPACAINDEX(r13)
	cmpdi	r15,0
	cmpdi	cr1,r15,1	/* set cr1.eq = 0 for non-recursive */
	addi	r10,r10,1
	bne	2f
	stbcx.	r10,0,r11
	bne	1b


After the patch:
	lhz	r10,PACAPACAINDEX(r13)
	addi	r10,r10,1
	crclr	cr1*4+eq	/* set cr1.eq = 0 for non-recursive */
1:	lbarx	r15,0,r11
	cmpdi	r15,0
	bne	2f
	stbcx.	r10,0,r11
	bne	1b

As you know, the lbarx is a Presync instruction and stbcx is a Presync and
Postsync instruction. Putting the unnecessary instructions in the lbarx/stbcx
loop also serialize these instructions execution. The execution latency of
lbarx is only 3 cycles and there are still two instructions after it.
Considering the out of order execution optimization after this patch, do you
really think that new uncontended path will become slower due to this
additional stall?

>  Does it really matter if there are more instructions 
> in the loop?

I really think we should minimize the window of lbarx/stbcx, for following two
reasons:
  - The bigger of this window, the more possible conflicts between the two
     threads run into this loop simultaneously.
  - The reservation used by lbarx may be cleared by another thread due to
     store to the same reservation granule. The smaller the window of
     lbarx/stbcx, the less possibility to be affected by this.

>  This change just means that you'll spin in the loop for more 
> iterations (if it even does that -- I think the cycles per loop iteration 
> might be the same before and after, due to load latency and pairing) while 
> waiting for the other thread to release the lock.

Besides the optimization for the contended case, it also make the code more
readable with these changes:
  - It always seem a bit weird to calculate the lock value for the current
    cpu in the lbarx/stbcx loop.
  - The "cmpdi   cr1,r15,1" seems pretty confusion. It doesn't always do what
    the comment said (set cr1.eq = 0). In some cases, it does set the
    crq.eq = 1, such as when running on cpu 1 with lock is acquired by cpu0.
    All we need to do just clear the cr1.eq unconditionally.

> 
> Do you have any benchmark results for this patch?

I doubt it will get any visible difference. Anyway I will gave it a try.

Thanks,
Kevin
> 
> -Scott
> 

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes
  2015-08-14  7:13     ` Kevin Hao
@ 2015-08-15  2:44       ` Scott Wood
  2015-08-17 11:16         ` Kevin Hao
  0 siblings, 1 reply; 15+ messages in thread
From: Scott Wood @ 2015-08-15  2:44 UTC (permalink / raw)
  To: Kevin Hao; +Cc: linuxppc-dev

On Fri, 2015-08-14 at 15:13 +0800, Kevin Hao wrote:
> On Thu, Aug 13, 2015 at 01:44:43PM -0500, Scott Wood wrote:
> > On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote:
> > > It makes no sense to put the instructions for calculating the lock
> > > value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx
> > > loop. And when the lock is acquired by the other thread, the current
> > > lock value has no chance to equal with the lock value used by current
> > > cpu. So we can skip the comparing for these two lock values in the
> > > lbz/bne loop.
> > > 
> > > Signed-off-by: Kevin Hao <haokexin@gmail.com>
> > > ---
> > >  arch/powerpc/mm/tlb_low_64e.S | 10 +++++-----
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/mm/tlb_low_64e.S 
> > > b/arch/powerpc/mm/tlb_low_64e.S
> > > index 765b419883f2..e4185581c5a7 100644
> > > --- a/arch/powerpc/mm/tlb_low_64e.S
> > > +++ b/arch/powerpc/mm/tlb_low_64e.S
> > > @@ -308,11 +308,11 @@ BEGIN_FTR_SECTION               /* CPU_FTR_SMT */
> > >        *
> > >        * MAS6:IND should be already set based on MAS4
> > >        */
> > > -1:   lbarx   r15,0,r11
> > >       lhz     r10,PACAPACAINDEX(r13)
> > > -     cmpdi   r15,0
> > > -     cmpdi   cr1,r15,1       /* set cr1.eq = 0 for non-recursive */
> > >       addi    r10,r10,1
> > > +     crclr   cr1*4+eq        /* set cr1.eq = 0 for non-recursive */
> > > +1:   lbarx   r15,0,r11
> > > +     cmpdi   r15,0
> > >       bne     2f
> > 
> > You're optimizing the contended case at the expense of introducing stalls 
> > in 
> > the uncontended case.
> 
> Before the patch, the uncontended case code sequence are:
> 1:    lbarx   r15,0,r11
>       lhz     r10,PACAPACAINDEX(r13)
>       cmpdi   r15,0
>       cmpdi   cr1,r15,1       /* set cr1.eq = 0 for non-recursive */
>       addi    r10,r10,1
>       bne     2f
>       stbcx.  r10,0,r11
>       bne     1b
> 
> 
> After the patch:
>       lhz     r10,PACAPACAINDEX(r13)
>       addi    r10,r10,1
>       crclr   cr1*4+eq        /* set cr1.eq = 0 for non-recursive */
> 1:    lbarx   r15,0,r11
>       cmpdi   r15,0
>       bne     2f
>       stbcx.  r10,0,r11
>       bne     1b
> 
> As you know, the lbarx is a Presync instruction and stbcx is a Presync and
> Postsync instruction.

Yes, so don't we want to move instructions after the lbarx if possible, so 
that the presync condition is achieved sooner?

>  Putting the unnecessary instructions in the lbarx/stbcx
> loop also serialize these instructions execution.

Again, the common case should be that the loop executes only once.  The two 
cmpdi instructions should pair, the addi should pair with the bne, and the 
lhz should happen while waiting for the lbarx result.  My understanding of 
how to model this stuff is certainly imperfect/incomplete, so I generally try 
to confirm by testing, but I think both loops take the same number of cycles 
per iteration.

>  The execution latency of
> lbarx is only 3 cycles and there are still two instructions after it.
> Considering the out of order execution optimization after this patch, do you
> really think that new uncontended path will become slower due to this
> additional stall?

The (theoretical) additional time is before the loop, not during it.

> >  Does it really matter if there are more instructions 
> > in the loop?
> 
> I really think we should minimize the window of lbarx/stbcx, for following 
> two
> reasons:
>   - The bigger of this window, the more possible conflicts between the two
>      threads run into this loop simultaneously.

That's more true of the total time the lock is held, not the lbarx/stbcx 
section.

>   - The reservation used by lbarx may be cleared by another thread due to
>      store to the same reservation granule. The smaller the window of
>      lbarx/stbcx, the less possibility to be affected by this.

There's only one other thread that should be touching that reservation 
granule, and it's the one we're waiting for.

In any case, if there is a difference in loop iteration execution time, it's 
small.

> > This change just means that you'll spin in the loop for more 
> > iterations (if it even does that -- I think the cycles per loop iteration 
> > might be the same before and after, due to load latency and pairing) 
> > while 
> > waiting for the other thread to release the lock.
> 
> Besides the optimization for the contended case, it also make the code more
> readable with these changes:
>   - It always seem a bit weird to calculate the lock value for the current
>     cpu in the lbarx/stbcx loop.
>   - The "cmpdi   cr1,r15,1" seems pretty confusion. It doesn't always do 
> what
>     the comment said (set cr1.eq = 0). In some cases, it does set the
>     crq.eq = 1, such as when running on cpu 1 with lock is acquired by cpu0.
>     All we need to do just clear the cr1.eq unconditionally.

We only care about cr1.eq when we break out of the loop, in which case r15 
will have been zero.  But yes, crclr is better.

> > 
> > Do you have any benchmark results for this patch?
> 
> I doubt it will get any visible difference. Anyway I will gave it a try.

I tried a couple different benchmarks and didn't find a significant 
difference, relative to the variability of the results running on the same 
kernel.  A patch that claims to "optimize a bit" as its main purpose ought to 
show some results. :-)

-Scott

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes
  2015-08-15  2:44       ` Scott Wood
@ 2015-08-17 11:16         ` Kevin Hao
  2015-08-17 21:08           ` Scott Wood
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Hao @ 2015-08-17 11:16 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2258 bytes --]

On Fri, Aug 14, 2015 at 09:44:28PM -0500, Scott Wood wrote:
> I tried a couple different benchmarks and didn't find a significant 
> difference, relative to the variability of the results running on the same 
> kernel.  A patch that claims to "optimize a bit" as its main purpose ought to 
> show some results. :-)

I tried to compare the execution time of these two code sequences with the
following test module:

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/printk.h>

static void test1(void)
{
	int i;
	unsigned char lock, c;
	unsigned short cpu, s;

	for (i = 0; i < 100000; i++) {
		lock = 0;
		cpu = 1;

		asm volatile (	
"1:		lbarx	%0,0,%2\n\
		lhz	%1,0(%3)\n\
		cmpdi	%0,0\n\
		cmpdi	cr1,%1,1\n\
		addi	%1,%1,1\n\
		bne	2f\n\
		stbcx.	%1,0,%2\n\
		bne	1b\n\
2:"
		: "=&r" (c), "=&r" (s) : "r" (&lock), "r" (&cpu) : "cr0", "cr1", "memory"); 
	}
}

static void test2(void)
{
	int i;
	unsigned char lock, c;
	unsigned short cpu, s;

	for (i = 0; i < 100000; i++) {
		lock = 0;
		cpu = 1;

		asm volatile (	
"		lhz	%1,0(%3)\n\
		addi	%1,%1,1\n\
		crclr	cr1*4+eq\n\
1:		lbarx	%0,0,%2\n\
		cmpdi	%0,0\n\
		bne	2f\n\
		stbcx.	%1,0,%2\n\
		bne	1b\n\
2:"
		: "=&r" (c), "=&r" (s) : "r" (&lock), "r" (&cpu) : "cr0", "cr1", "memory"); 
	}
}

static int test_init(void)
{
	unsigned long s, e, tm1, tm2;

	__hard_irq_disable();
	/* Just for prefetch */
	test1();
	s = mftb();
	test1();
	e = mftb();
	tm1 = e - s;

	/* Just for prefetch */
	test2();
	s = mftb();
	test2();
	e = mftb();
	tm2 = e - s;
	__hard_irq_enable();

	pr_err("test1: %ld, test2: %ld, %%%ld\n", tm1, tm2, (tm1 - tm2) * 100 / tm1);

	return 0;
}

static void test_exit(void)
{
	return;
}

module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");

The results:
test1: 156568, test2: 151675, %3
test1: 156604, test2: 151670, %3
test1: 156567, test2: 151684, %3
test1: 156567, test2: 151678, %3
test1: 156567, test2: 151688, %3
test1: 156570, test2: 151683, %3
test1: 156565, test2: 151675, %3
test1: 156565, test2: 151673, %3

It seems that there do have a %3 gain in performance by moving the
3 instructions out of lbarx/stbcx loop.

Thanks,
Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes
  2015-08-17 11:16         ` Kevin Hao
@ 2015-08-17 21:08           ` Scott Wood
  0 siblings, 0 replies; 15+ messages in thread
From: Scott Wood @ 2015-08-17 21:08 UTC (permalink / raw)
  To: Kevin Hao; +Cc: linuxppc-dev

On Mon, 2015-08-17 at 19:16 +0800, Kevin Hao wrote:
> On Fri, Aug 14, 2015 at 09:44:28PM -0500, Scott Wood wrote:
> > I tried a couple different benchmarks and didn't find a significant 
> > difference, relative to the variability of the results running on the 
> > same 
> > kernel.  A patch that claims to "optimize a bit" as its main purpose 
> > ought to 
> > show some results. :-)
> 
> I tried to compare the execution time of these two code sequences with the
> following test module:
> 
> #include <linux/module.h>
> #include <linux/kernel.h>
> #include <linux/printk.h>
> 
> static void test1(void)
> {
>       int i;
>       unsigned char lock, c;
>       unsigned short cpu, s;
> 
>       for (i = 0; i < 100000; i++) {
>               lock = 0;
>               cpu = 1;
> 
>               asm volatile (  
> "1:           lbarx   %0,0,%2\n\
>               lhz     %1,0(%3)\n\
>               cmpdi   %0,0\n\
>               cmpdi   cr1,%1,1\n\

This should be either "cmpdi cr1,%0,1" or crclr, not that it made much 
difference.  The test seemed to be rather sensitive to additional 
instructions inserted at the beginning of the asm statement (especially 
isync), so the initial instructions before the loop are probably pairing with 
something outside the asm.

That said, it looks like this patch at least doesn't make things worse, and 
does convert cmpdi to a more readable crclr, so I guess I'll apply it even 
though it doesn't show any measurable benefit when testing entire TLB misses 
(much less actual applications).

I suspect the point where I misunderstood the core manual was where it listed 
lbarx as having a repeat-rate of 3 cycles.  I probably assumed that that was 
because of the presync, and thus a subsequent unrelated load could execute 
partially in parallel, but it looks like the repeat rate is specifically 
talking about how long it is until the execution unit can accept any other 
instruction.

-Scott

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/3] powerpc/e6500: hw tablewalk: order the memory access when acquire/release tcd lock
  2015-08-13 11:51 [PATCH 1/3] powerpc/e6500: remove the stale TCD_LOCK macro Kevin Hao
  2015-08-13 11:51 ` [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes Kevin Hao
@ 2015-08-13 11:51 ` Kevin Hao
  2015-08-14  3:39   ` Scott Wood
  1 sibling, 1 reply; 15+ messages in thread
From: Kevin Hao @ 2015-08-13 11:51 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev

I didn't find anything unusual. But I think we do need to order the
load/store of esel_next when acquire/release tcd lock. For acquire,
add a data dependency to order the loads of lock and esel_next.
For release, even there already have a "isync" here, but it doesn't
guarantee any memory access order. So we still need "lwsync" for
the two stores for lock and esel_next.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
---
 arch/powerpc/mm/tlb_low_64e.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
index e4185581c5a7..964754911987 100644
--- a/arch/powerpc/mm/tlb_low_64e.S
+++ b/arch/powerpc/mm/tlb_low_64e.S
@@ -334,6 +334,8 @@ BEGIN_FTR_SECTION		/* CPU_FTR_SMT */
 	 * with tlbilx before overwriting.
 	 */
 
+	andi	r15,r15,0	/* add a data dependency to order the loards */
+	add	r11,r11,r15	/* between the lock and esel_next */
 	lbz	r15,TCD_ESEL_NEXT(r11)
 	rlwinm	r10,r15,16,0xff0000
 	oris	r10,r10,MAS0_TLBSEL(1)@h
@@ -447,6 +449,7 @@ BEGIN_FTR_SECTION
 	beq	cr1,1f		/* no unlock if lock was recursively grabbed */
 	li	r15,0
 	isync
+	lwsync
 	stb	r15,0(r11)
 1:
 END_FTR_SECTION_IFSET(CPU_FTR_SMT)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] powerpc/e6500: hw tablewalk: order the memory access when acquire/release tcd lock
  2015-08-13 11:51 ` [PATCH 3/3] powerpc/e6500: hw tablewalk: order the memory access when acquire/release tcd lock Kevin Hao
@ 2015-08-14  3:39   ` Scott Wood
  2015-08-14  7:13     ` Kevin Hao
  0 siblings, 1 reply; 15+ messages in thread
From: Scott Wood @ 2015-08-14  3:39 UTC (permalink / raw)
  To: Kevin Hao; +Cc: linuxppc-dev

On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote:
> I didn't find anything unusual. But I think we do need to order the
> load/store of esel_next when acquire/release tcd lock. For acquire,
> add a data dependency to order the loads of lock and esel_next.
> For release, even there already have a "isync" here, but it doesn't
> guarantee any memory access order. So we still need "lwsync" for
> the two stores for lock and esel_next.

I was going to say that esel_next is just a hint and it doesn't really matter 
if we occasionally get the wrong value, unless it happens often enough to 
cause more performance degradation than the lwsync causes.  However, with the 
A-008139 workaround we do need to read the same value from esel_next both 
times.  It might be less costly to save/restore an additional register 
instead of lwsync, though.

-Scott

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] powerpc/e6500: hw tablewalk: order the memory access when acquire/release tcd lock
  2015-08-14  3:39   ` Scott Wood
@ 2015-08-14  7:13     ` Kevin Hao
  2015-08-15  0:44       ` Scott Wood
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Hao @ 2015-08-14  7:13 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1128 bytes --]

On Thu, Aug 13, 2015 at 10:39:19PM -0500, Scott Wood wrote:
> On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote:
> > I didn't find anything unusual. But I think we do need to order the
> > load/store of esel_next when acquire/release tcd lock. For acquire,
> > add a data dependency to order the loads of lock and esel_next.
> > For release, even there already have a "isync" here, but it doesn't
> > guarantee any memory access order. So we still need "lwsync" for
> > the two stores for lock and esel_next.
> 
> I was going to say that esel_next is just a hint and it doesn't really matter 
> if we occasionally get the wrong value, unless it happens often enough to 
> cause more performance degradation than the lwsync causes.  However, with the 
> A-008139 workaround we do need to read the same value from esel_next both 
> times.  It might be less costly to save/restore an additional register 
> instead of lwsync, though.

I will try to get some benchmark number to compare which method is a bit better.
Do you have any recommended benchmark for a case this is?

Thanks,
Kevin

> 
> -Scott
> 

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] powerpc/e6500: hw tablewalk: order the memory access when acquire/release tcd lock
  2015-08-14  7:13     ` Kevin Hao
@ 2015-08-15  0:44       ` Scott Wood
  2015-08-17 11:19         ` Kevin Hao
  0 siblings, 1 reply; 15+ messages in thread
From: Scott Wood @ 2015-08-15  0:44 UTC (permalink / raw)
  To: Kevin Hao; +Cc: linuxppc-dev

On Fri, 2015-08-14 at 15:13 +0800, Kevin Hao wrote:
> On Thu, Aug 13, 2015 at 10:39:19PM -0500, Scott Wood wrote:
> > On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote:
> > > I didn't find anything unusual. But I think we do need to order the
> > > load/store of esel_next when acquire/release tcd lock. For acquire,
> > > add a data dependency to order the loads of lock and esel_next.
> > > For release, even there already have a "isync" here, but it doesn't
> > > guarantee any memory access order. So we still need "lwsync" for
> > > the two stores for lock and esel_next.
> > 
> > I was going to say that esel_next is just a hint and it doesn't really 
> > matter 
> > if we occasionally get the wrong value, unless it happens often enough to 
> > cause more performance degradation than the lwsync causes.  However, with 
> > the 
> > A-008139 workaround we do need to read the same value from esel_next both 
> > times.  It might be less costly to save/restore an additional register 
> > instead of lwsync, though.
> 
> I will try to get some benchmark number to compare which method is a bit 
> better.
> Do you have any recommended benchmark for a case this is?

lmbench lat_mem_rd with a stride chosen to maximize TLB misses.  For the 
uncontended case, one instance; for the contended case, two instances, one 
pinned to each thread of a core.

-Scott

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] powerpc/e6500: hw tablewalk: order the memory access when acquire/release tcd lock
  2015-08-15  0:44       ` Scott Wood
@ 2015-08-17 11:19         ` Kevin Hao
  2015-08-18  7:55           ` [PATCH v2] powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry Kevin Hao
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Hao @ 2015-08-17 11:19 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 5771 bytes --]

On Fri, Aug 14, 2015 at 07:44:23PM -0500, Scott Wood wrote:
> On Fri, 2015-08-14 at 15:13 +0800, Kevin Hao wrote:
> > On Thu, Aug 13, 2015 at 10:39:19PM -0500, Scott Wood wrote:
> > > On Thu, 2015-08-13 at 19:51 +0800, Kevin Hao wrote:
> > > > I didn't find anything unusual. But I think we do need to order the
> > > > load/store of esel_next when acquire/release tcd lock. For acquire,
> > > > add a data dependency to order the loads of lock and esel_next.
> > > > For release, even there already have a "isync" here, but it doesn't
> > > > guarantee any memory access order. So we still need "lwsync" for
> > > > the two stores for lock and esel_next.
> > > 
> > > I was going to say that esel_next is just a hint and it doesn't really 
> > > matter 
> > > if we occasionally get the wrong value, unless it happens often enough to 
> > > cause more performance degradation than the lwsync causes.  However, with 
> > > the 
> > > A-008139 workaround we do need to read the same value from esel_next both 
> > > times.  It might be less costly to save/restore an additional register 
> > > instead of lwsync, though.
> > 
> > I will try to get some benchmark number to compare which method is a bit 
> > better.
> > Do you have any recommended benchmark for a case this is?
> 
> lmbench lat_mem_rd with a stride chosen to maximize TLB misses.  For the 
> uncontended case, one instance; for the contended case, two instances, one 
> pinned to each thread of a core.

I have tried the method to save/restore an additional register for the esel
with the following codes:

diff --git a/arch/powerpc/include/asm/exception-64e.h b/arch/powerpc/include/asm/exception-64e.h
index a8b52b61043f..8267c1404050 100644
--- a/arch/powerpc/include/asm/exception-64e.h
+++ b/arch/powerpc/include/asm/exception-64e.h
@@ -69,9 +69,9 @@
 #define EX_TLB_ESR	( 9 * 8) /* Level 0 and 2 only */
 #define EX_TLB_SRR0	(10 * 8)
 #define EX_TLB_SRR1	(11 * 8)
+#define EX_TLB_R9	(12 * 8)
 #ifdef CONFIG_BOOK3E_MMU_TLB_STATS
-#define EX_TLB_R8	(12 * 8)
-#define EX_TLB_R9	(13 * 8)
+#define EX_TLB_R8	(13 * 8)
 #define EX_TLB_LR	(14 * 8)
 #define EX_TLB_SIZE	(15 * 8)
 #else
diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
index e4185581c5a7..8d184dd530c4 100644
--- a/arch/powerpc/mm/tlb_low_64e.S
+++ b/arch/powerpc/mm/tlb_low_64e.S
@@ -68,11 +68,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
 	ld	r14,PACAPGD(r13)
 	std	r15,EX_TLB_R15(r12)
 	std	r10,EX_TLB_CR(r12)
+	std	r9,EX_TLB_R9(r12)
 	TLB_MISS_PROLOG_STATS
 .endm
 
 .macro tlb_epilog_bolted
 	ld	r14,EX_TLB_CR(r12)
+	ld	r9,EX_TLB_R9(r12)
 	ld	r10,EX_TLB_R10(r12)
 	ld	r11,EX_TLB_R11(r12)
 	ld	r13,EX_TLB_R13(r12)
@@ -297,6 +299,7 @@ itlb_miss_fault_bolted:
  * r13 = PACA
  * r11 = tlb_per_core ptr
  * r10 = crap (free to use)
+ * r9  = esel entry
  */
 tlb_miss_common_e6500:
 	crmove	cr2*4+2,cr0*4+2		/* cr2.eq != 0 if kernel address */
@@ -334,8 +337,8 @@ BEGIN_FTR_SECTION		/* CPU_FTR_SMT */
 	 * with tlbilx before overwriting.
 	 */
 
-	lbz	r15,TCD_ESEL_NEXT(r11)
-	rlwinm	r10,r15,16,0xff0000
+	lbz	r9,TCD_ESEL_NEXT(r11)
+	rlwinm	r10,r9,16,0xff0000
 	oris	r10,r10,MAS0_TLBSEL(1)@h
 	mtspr	SPRN_MAS0,r10
 	isync
@@ -429,15 +432,14 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_SMT)
 	mtspr	SPRN_MAS2,r15
 
 tlb_miss_huge_done_e6500:
-	lbz	r15,TCD_ESEL_NEXT(r11)
 	lbz	r16,TCD_ESEL_MAX(r11)
 	lbz	r14,TCD_ESEL_FIRST(r11)
-	rlwimi	r10,r15,16,0x00ff0000	/* insert esel_next into MAS0 */
-	addi	r15,r15,1		/* increment esel_next */
+	rlwimi	r10,r9,16,0x00ff0000	/* insert esel_next into MAS0 */
+	addi	r9,r9,1		/* increment esel_next */
 	mtspr	SPRN_MAS0,r10
-	cmpw	r15,r16
-	iseleq	r15,r14,r15		/* if next == last use first */
-	stb	r15,TCD_ESEL_NEXT(r11)
+	cmpw	r9,r16
+	iseleq	r9,r14,r9		/* if next == last use first */
+	stb	r9,TCD_ESEL_NEXT(r11)
 
 	tlbwe
 

The following is the benchmark number on a t2080rdb board:
For uncontended case (taskset -c 0 lat_mem_rd 2048 2097152):
		origin		lwsync		r9
2.00000		1.958		1.958		1.958
3.00000		1.958		1.958		1.958
4.00000		1.958		1.958		1.958
6.00000		1.958		1.958		1.958
8.00000		1.958		1.958		1.958
12.00000	6.528		6.528		6.528
16.00000	6.528		6.528		6.528
24.00000	37.862		37.862		37.861
32.00000	37.862		37.862		37.862
48.00000	37.862		37.862		37.862
64.00000	37.862		37.862		37.862
96.00000	37.862		37.863		37.862
128.00000	221.621		232.067		222.927
192.00000	221.874		232.333		222.925
256.00000	221.623		232.066		222.927
384.00000	221.758		232.331		222.927
512.00000	221.621		232.165		222.926
768.00000	221.776		236.870		226.598
1024.00000	264.199		271.351		259.281
1536.00000	370.748		380.910		372.544
2048.00000	392.185		404.696		395.881

For contended case (taskset -c 0 lat_mem_rd 2048 2097152 &
                    taskset -c 1 lat_mem_rd 2048 2097152 >/dev/null 2>&1):
		origin		lwsync		r9
2.00000		3.366		2.944		3.086
3.00000		2.915		3.256		3.095
4.00000		3.043		2.443		2.617
6.00000		2.742		3.367		2.629
8.00000		3.145		3.365		2.443
12.00000	18.267		11.885		13.736
16.00000	15.607		13.312		18.048
24.00000	37.856		37.208		37.855
32.00000	37.856		37.861		37.855
48.00000	37.856		37.861		37.855
64.00000	57.487		37.861		52.505
96.00000	270.445		229.641		143.241
128.00000	284.535		279.907		305.540
192.00000	275.491		298.282		295.592
256.00000	265.155		331.212		259.096
384.00000	276.084		291.023		251.282
512.00000	275.852		335.602		267.074
768.00000	289.682		354.521		312.180
1024.00000	344.968		355.977		343.725
1536.00000	410.961		454.381		412.790
2048.00000	392.984		461.818		397.185

It seems that the using of an additional register do have better performance in
both cases.

Thanks,
Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2] powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
  2015-08-17 11:19         ` Kevin Hao
@ 2015-08-18  7:55           ` Kevin Hao
  2015-10-17  0:01             ` [v2] " Scott Wood
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Hao @ 2015-08-18  7:55 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev

In order to workaround Erratum A-008139, we have to invalidate the
tlb entry with tlbilx before overwriting. Due to the performance
consideration, we don't add any memory barrier when acquire/release
the tcd lock. This means the two load instructions for esel_next do
have the possibility to return different value. This is definitely
not acceptable due to the Erratum A-008139. We have two options to
fix this issue:
  a) Add memory barrier when acquire/release tcd lock to order the
     load/store to esel_next.
  b) Just make sure to invalidate and write to the same tlb entry and
     tolerate the race that we may get the wrong value and overwrite
     the tlb entry just updated by the other thread.

We observe better performance using option b. So reserve an additional
register to save the value of the esel_next.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
---
v2: Use an additional register for saving the value of esel_next instead of lwsync.

 arch/powerpc/include/asm/exception-64e.h | 11 ++++++-----
 arch/powerpc/mm/tlb_low_64e.S            | 26 ++++++++++++++++++--------
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64e.h b/arch/powerpc/include/asm/exception-64e.h
index a8b52b61043f..d53575becbed 100644
--- a/arch/powerpc/include/asm/exception-64e.h
+++ b/arch/powerpc/include/asm/exception-64e.h
@@ -69,13 +69,14 @@
 #define EX_TLB_ESR	( 9 * 8) /* Level 0 and 2 only */
 #define EX_TLB_SRR0	(10 * 8)
 #define EX_TLB_SRR1	(11 * 8)
+#define EX_TLB_R7	(12 * 8)
 #ifdef CONFIG_BOOK3E_MMU_TLB_STATS
-#define EX_TLB_R8	(12 * 8)
-#define EX_TLB_R9	(13 * 8)
-#define EX_TLB_LR	(14 * 8)
-#define EX_TLB_SIZE	(15 * 8)
+#define EX_TLB_R8	(13 * 8)
+#define EX_TLB_R9	(14 * 8)
+#define EX_TLB_LR	(15 * 8)
+#define EX_TLB_SIZE	(16 * 8)
 #else
-#define EX_TLB_SIZE	(12 * 8)
+#define EX_TLB_SIZE	(13 * 8)
 #endif
 
 #define	START_EXCEPTION(label)						\
diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
index e4185581c5a7..3a5b89dfb5a1 100644
--- a/arch/powerpc/mm/tlb_low_64e.S
+++ b/arch/powerpc/mm/tlb_low_64e.S
@@ -68,11 +68,21 @@ END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
 	ld	r14,PACAPGD(r13)
 	std	r15,EX_TLB_R15(r12)
 	std	r10,EX_TLB_CR(r12)
+#ifdef CONFIG_PPC_FSL_BOOK3E
+BEGIN_FTR_SECTION
+	std	r7,EX_TLB_R7(r12)
+END_FTR_SECTION_IFSET(CPU_FTR_SMT)
+#endif
 	TLB_MISS_PROLOG_STATS
 .endm
 
 .macro tlb_epilog_bolted
 	ld	r14,EX_TLB_CR(r12)
+#ifdef CONFIG_PPC_FSL_BOOK3E
+BEGIN_FTR_SECTION
+	ld	r7,EX_TLB_R7(r12)
+END_FTR_SECTION_IFSET(CPU_FTR_SMT)
+#endif
 	ld	r10,EX_TLB_R10(r12)
 	ld	r11,EX_TLB_R11(r12)
 	ld	r13,EX_TLB_R13(r12)
@@ -297,6 +307,7 @@ itlb_miss_fault_bolted:
  * r13 = PACA
  * r11 = tlb_per_core ptr
  * r10 = crap (free to use)
+ * r7  = esel_next
  */
 tlb_miss_common_e6500:
 	crmove	cr2*4+2,cr0*4+2		/* cr2.eq != 0 if kernel address */
@@ -334,8 +345,8 @@ BEGIN_FTR_SECTION		/* CPU_FTR_SMT */
 	 * with tlbilx before overwriting.
 	 */
 
-	lbz	r15,TCD_ESEL_NEXT(r11)
-	rlwinm	r10,r15,16,0xff0000
+	lbz	r7,TCD_ESEL_NEXT(r11)
+	rlwinm	r10,r7,16,0xff0000
 	oris	r10,r10,MAS0_TLBSEL(1)@h
 	mtspr	SPRN_MAS0,r10
 	isync
@@ -429,15 +440,14 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_SMT)
 	mtspr	SPRN_MAS2,r15
 
 tlb_miss_huge_done_e6500:
-	lbz	r15,TCD_ESEL_NEXT(r11)
 	lbz	r16,TCD_ESEL_MAX(r11)
 	lbz	r14,TCD_ESEL_FIRST(r11)
-	rlwimi	r10,r15,16,0x00ff0000	/* insert esel_next into MAS0 */
-	addi	r15,r15,1		/* increment esel_next */
+	rlwimi	r10,r7,16,0x00ff0000	/* insert esel_next into MAS0 */
+	addi	r7,r7,1			/* increment esel_next */
 	mtspr	SPRN_MAS0,r10
-	cmpw	r15,r16
-	iseleq	r15,r14,r15		/* if next == last use first */
-	stb	r15,TCD_ESEL_NEXT(r11)
+	cmpw	r7,r16
+	iseleq	r7,r14,r7		/* if next == last use first */
+	stb	r7,TCD_ESEL_NEXT(r11)
 
 	tlbwe
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [v2] powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
  2015-08-18  7:55           ` [PATCH v2] powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry Kevin Hao
@ 2015-10-17  0:01             ` Scott Wood
  2015-10-22 12:19               ` Kevin Hao
  0 siblings, 1 reply; 15+ messages in thread
From: Scott Wood @ 2015-10-17  0:01 UTC (permalink / raw)
  To: Kevin Hao; +Cc: linuxppc-dev

On Tue, Aug 18, 2015 at 03:55:56PM +0800, Kevin Hao wrote:
> diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
> index e4185581c5a7..3a5b89dfb5a1 100644
> --- a/arch/powerpc/mm/tlb_low_64e.S
> +++ b/arch/powerpc/mm/tlb_low_64e.S
> @@ -68,11 +68,21 @@ END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
>  	ld	r14,PACAPGD(r13)
>  	std	r15,EX_TLB_R15(r12)
>  	std	r10,EX_TLB_CR(r12)
> +#ifdef CONFIG_PPC_FSL_BOOK3E
> +BEGIN_FTR_SECTION
> +	std	r7,EX_TLB_R7(r12)
> +END_FTR_SECTION_IFSET(CPU_FTR_SMT)
> +#endif
>  	TLB_MISS_PROLOG_STATS
>  .endm
>  
>  .macro tlb_epilog_bolted
>  	ld	r14,EX_TLB_CR(r12)
> +#ifdef CONFIG_PPC_FSL_BOOK3E
> +BEGIN_FTR_SECTION
> +	ld	r7,EX_TLB_R7(r12)
> +END_FTR_SECTION_IFSET(CPU_FTR_SMT)
> +#endif

r7 is used outside the CPU_FTR_SMT section of the e6500 TLB handler.

-Scott

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [v2] powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
  2015-10-17  0:01             ` [v2] " Scott Wood
@ 2015-10-22 12:19               ` Kevin Hao
  0 siblings, 0 replies; 15+ messages in thread
From: Kevin Hao @ 2015-10-22 12:19 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1087 bytes --]

On Fri, Oct 16, 2015 at 07:01:55PM -0500, Scott Wood wrote:
> On Tue, Aug 18, 2015 at 03:55:56PM +0800, Kevin Hao wrote:
> > diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
> > index e4185581c5a7..3a5b89dfb5a1 100644
> > --- a/arch/powerpc/mm/tlb_low_64e.S
> > +++ b/arch/powerpc/mm/tlb_low_64e.S
> > @@ -68,11 +68,21 @@ END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
> >  	ld	r14,PACAPGD(r13)
> >  	std	r15,EX_TLB_R15(r12)
> >  	std	r10,EX_TLB_CR(r12)
> > +#ifdef CONFIG_PPC_FSL_BOOK3E
> > +BEGIN_FTR_SECTION
> > +	std	r7,EX_TLB_R7(r12)
> > +END_FTR_SECTION_IFSET(CPU_FTR_SMT)
> > +#endif
> >  	TLB_MISS_PROLOG_STATS
> >  .endm
> >  
> >  .macro tlb_epilog_bolted
> >  	ld	r14,EX_TLB_CR(r12)
> > +#ifdef CONFIG_PPC_FSL_BOOK3E
> > +BEGIN_FTR_SECTION
> > +	ld	r7,EX_TLB_R7(r12)
> > +END_FTR_SECTION_IFSET(CPU_FTR_SMT)
> > +#endif
> 
> r7 is used outside the CPU_FTR_SMT section of the e6500 TLB handler.

Sorry for the delay response just back from vacation. I will move the load
of TCD_ESEL_NEXT out of CPU_FTR_SMT wrap.

Thanks,
Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-10-22 12:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-13 11:51 [PATCH 1/3] powerpc/e6500: remove the stale TCD_LOCK macro Kevin Hao
2015-08-13 11:51 ` [PATCH 2/3] powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes Kevin Hao
2015-08-13 18:44   ` Scott Wood
2015-08-14  7:13     ` Kevin Hao
2015-08-15  2:44       ` Scott Wood
2015-08-17 11:16         ` Kevin Hao
2015-08-17 21:08           ` Scott Wood
2015-08-13 11:51 ` [PATCH 3/3] powerpc/e6500: hw tablewalk: order the memory access when acquire/release tcd lock Kevin Hao
2015-08-14  3:39   ` Scott Wood
2015-08-14  7:13     ` Kevin Hao
2015-08-15  0:44       ` Scott Wood
2015-08-17 11:19         ` Kevin Hao
2015-08-18  7:55           ` [PATCH v2] powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry Kevin Hao
2015-10-17  0:01             ` [v2] " Scott Wood
2015-10-22 12:19               ` Kevin Hao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).