linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* RFC: Reducing the number of non volatile GPRs in the ppc64 kernel
@ 2015-08-05  4:03 Anton Blanchard
  2015-08-05  4:19 ` Segher Boessenkool
  2015-08-14  2:01 ` Michael Ellerman
  0 siblings, 2 replies; 8+ messages in thread
From: Anton Blanchard @ 2015-08-05  4:03 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alan Modra, benh, Michael Ellerman, paulus, Ulrich Weigand,
	Michael Gschwind, Bill Schmidt

[-- Attachment #1: Type: text/plain, Size: 656 bytes --]

Hi,

While looking at traces of kernel workloads, I noticed places where gcc
used a large number of non volatiles. Some of these functions
did very little work, and we spent most of our time saving the
non volatiles to the stack and reading them back.

It made me wonder if we have the right ratio of volatile to non
volatile GPRs. Since the kernel is completely self contained, we could
potentially change that ratio.

Attached is a quick hack to gcc and the kernel to decrease the number
of non volatile GPRs to 8. I'm not sure if this is a good idea (and if
the volatile to non volatile ratio is right), but this gives us
something to play with.

Anton 

[-- Attachment #2: linux-volatiles.patch --]
[-- Type: text/x-patch, Size: 5092 bytes --]

powerpc: Reduce the number of non volatiles GPRs to 8

This requires a hacked gcc.

Signed-off-by: Anton Blanchard <anton@samba.org>
--

Index: linux.junk/arch/powerpc/include/asm/exception-64s.h
===================================================================
--- linux.junk.orig/arch/powerpc/include/asm/exception-64s.h
+++ linux.junk/arch/powerpc/include/asm/exception-64s.h
@@ -336,6 +336,7 @@ do_kvm_##n:								\
 	std	r2,GPR2(r1);		/* save r2 in stackframe	*/ \
 	SAVE_4GPRS(3, r1);		/* save r3 - r6 in stackframe   */ \
 	SAVE_2GPRS(7, r1);		/* save r7, r8 in stackframe	*/ \
+	SAVE_10GPRS(14, r1);						   \
 	mflr	r9;			/* Get LR, later save to stack	*/ \
 	ld	r2,PACATOC(r13);	/* get kernel TOC into r2	*/ \
 	std	r9,_LINK(r1);						   \
Index: linux.junk/arch/powerpc/include/asm/ppc_asm.h
===================================================================
--- linux.junk.orig/arch/powerpc/include/asm/ppc_asm.h
+++ linux.junk/arch/powerpc/include/asm/ppc_asm.h
@@ -77,8 +77,8 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLP
 #ifdef __powerpc64__
 #define SAVE_GPR(n, base)	std	n,GPR0+8*(n)(base)
 #define REST_GPR(n, base)	ld	n,GPR0+8*(n)(base)
-#define SAVE_NVGPRS(base)	SAVE_8GPRS(14, base); SAVE_10GPRS(22, base)
-#define REST_NVGPRS(base)	REST_8GPRS(14, base); REST_10GPRS(22, base)
+#define SAVE_NVGPRS(base)	SAVE_8GPRS(24, base)
+#define REST_NVGPRS(base)	REST_8GPRS(24, base)
 #else
 #define SAVE_GPR(n, base)	stw	n,GPR0+4*(n)(base)
 #define REST_GPR(n, base)	lwz	n,GPR0+4*(n)(base)
Index: linux.junk/arch/powerpc/kernel/asm-offsets.c
===================================================================
--- linux.junk.orig/arch/powerpc/kernel/asm-offsets.c
+++ linux.junk/arch/powerpc/kernel/asm-offsets.c
@@ -289,7 +289,6 @@ int main(void)
 	DEFINE(GPR11, STACK_FRAME_OVERHEAD+offsetof(struct pt_regs, gpr[11]));
 	DEFINE(GPR12, STACK_FRAME_OVERHEAD+offsetof(struct pt_regs, gpr[12]));
 	DEFINE(GPR13, STACK_FRAME_OVERHEAD+offsetof(struct pt_regs, gpr[13]));
-#ifndef CONFIG_PPC64
 	DEFINE(GPR14, STACK_FRAME_OVERHEAD+offsetof(struct pt_regs, gpr[14]));
 	DEFINE(GPR15, STACK_FRAME_OVERHEAD+offsetof(struct pt_regs, gpr[15]));
 	DEFINE(GPR16, STACK_FRAME_OVERHEAD+offsetof(struct pt_regs, gpr[16]));
@@ -308,7 +307,6 @@ int main(void)
 	DEFINE(GPR29, STACK_FRAME_OVERHEAD+offsetof(struct pt_regs, gpr[29]));
 	DEFINE(GPR30, STACK_FRAME_OVERHEAD+offsetof(struct pt_regs, gpr[30]));
 	DEFINE(GPR31, STACK_FRAME_OVERHEAD+offsetof(struct pt_regs, gpr[31]));
-#endif /* CONFIG_PPC64 */
 	/*
 	 * Note: these symbols include _ because they overlap with special
 	 * register names
Index: linux.junk/arch/powerpc/kernel/entry_64.S
===================================================================
--- linux.junk.orig/arch/powerpc/kernel/entry_64.S
+++ linux.junk/arch/powerpc/kernel/entry_64.S
@@ -86,6 +86,18 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM)
 	std	r11,_XER(r1)
 	std	r11,_CTR(r1)
 	std	r9,GPR13(r1)
+
+	std	r14,GPR14(r1)
+	std	r15,GPR15(r1)
+	std	r16,GPR16(r1)
+	std	r17,GPR17(r1)
+	std	r18,GPR18(r1)
+	std	r19,GPR19(r1)
+	std	r20,GPR20(r1)
+	std	r21,GPR21(r1)
+	std	r22,GPR22(r1)
+	std	r23,GPR23(r1)
+
 	mflr	r10
 	/*
 	 * This clears CR0.SO (bit 28), which is the error indication on
@@ -112,6 +124,7 @@ BEGIN_FW_FTR_SECTION
 	cmpd	cr1,r11,r10
 	beq+	cr1,33f
 	bl	accumulate_stolen_time
+	trap
 	REST_GPR(0,r1)
 	REST_4GPRS(3,r1)
 	REST_2GPRS(7,r1)
@@ -225,7 +238,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_STCX_CHECK
 	ACCOUNT_CPU_USER_EXIT(r11, r12)
 	HMT_MEDIUM_LOW_HAS_PPR
 	ld	r13,GPR13(r1)	/* only restore r13 if returning to usermode */
-1:	ld	r2,GPR2(r1)
+1:
+	REST_10GPRS(14, r1)
+	ld	r2,GPR2(r1)
 	ld	r1,GPR1(r1)
 	mtlr	r4
 	mtcr	r5
@@ -405,10 +420,10 @@ _GLOBAL(ret_from_fork)
 _GLOBAL(ret_from_kernel_thread)
 	bl	schedule_tail
 	REST_NVGPRS(r1)
-	mtlr	r14
-	mr	r3,r15
+	mtlr	r24
+	mr	r3,r25
 #if defined(_CALL_ELF) && _CALL_ELF == 2
-	mr	r12,r14
+	mr	r12,r24
 #endif
 	blrl
 	li	r3,0
@@ -540,8 +555,7 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEG
 	mtcrf	0xFF,r6
 
 	/* r3-r13 are destroyed -- Cort */
-	REST_8GPRS(14, r1)
-	REST_10GPRS(22, r1)
+	REST_8GPRS(24, r1)
 
 	/* convert old thread to its task_struct for return value */
 	addi	r3,r3,-THREAD
@@ -771,6 +785,7 @@ fast_exception_return:
 	mtspr	SPRN_XER,r4
 
 	REST_8GPRS(5, r1)
+	REST_10GPRS(14, r1)
 
 	andi.	r0,r3,MSR_RI
 	beq-	unrecov_restore
Index: linux.junk/arch/powerpc/kernel/process.c
===================================================================
--- linux.junk.orig/arch/powerpc/kernel/process.c
+++ linux.junk/arch/powerpc/kernel/process.c
@@ -1207,12 +1207,12 @@ int copy_thread(unsigned long clone_flag
 		childregs->gpr[1] = sp + sizeof(struct pt_regs);
 		/* function */
 		if (usp)
-			childregs->gpr[14] = ppc_function_entry((void *)usp);
+			childregs->gpr[24] = ppc_function_entry((void *)usp);
 #ifdef CONFIG_PPC64
 		clear_tsk_thread_flag(p, TIF_32BIT);
 		childregs->softe = 1;
 #endif
-		childregs->gpr[15] = kthread_arg;
+		childregs->gpr[25] = kthread_arg;
 		p->thread.regs = NULL;	/* no user register state */
 		ti->flags |= _TIF_RESTOREALL;
 		f = ret_from_kernel_thread;

[-- Attachment #3: gcc-volatiles.patch --]
[-- Type: text/x-patch, Size: 2258 bytes --]

powerpc: Reduce the number of non volatiles GPRs to 8

A quick hack to test this change on the Linux kernel.

Signed-off-by: Anton Blanchard <anton@samba.org>
--

Index: gcc/gcc/config/rs6000/rs6000.h
===================================================================
--- gcc.orig/gcc/config/rs6000/rs6000.h
+++ gcc/gcc/config/rs6000/rs6000.h
@@ -1017,8 +1017,8 @@ enum data_align { align_abi, align_opt,
    Aside from that, you can include as many other registers as you like.  */
 
 #define CALL_USED_REGISTERS  \
-  {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, FIXED_R13, 0, 0, \
-   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+  {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, FIXED_R13, 1, 1, \
+   1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, \
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, \
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
    1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1,	   \
@@ -1039,8 +1039,8 @@ enum data_align { align_abi, align_opt,
    of `CALL_USED_REGISTERS'.  */
 
 #define CALL_REALLY_USED_REGISTERS  \
-  {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, FIXED_R13, 0, 0, \
-   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
+  {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, FIXED_R13, 1, 1, \
+   1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, \
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, \
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
    1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1,	   \
@@ -1058,7 +1058,7 @@ enum data_align { align_abi, align_opt,
 
 #define FIRST_SAVED_ALTIVEC_REGNO (FIRST_ALTIVEC_REGNO+20)
 #define FIRST_SAVED_FP_REGNO	  (14+32)
-#define FIRST_SAVED_GP_REGNO	  (FIXED_R13 ? 14 : 13)
+#define FIRST_SAVED_GP_REGNO	  24
 
 /* List the order in which to allocate registers.  Each register must be
    listed once, even those in FIXED_REGISTERS.
@@ -1124,8 +1124,8 @@ enum data_align { align_abi, align_opt,
    MAYBE_R2_AVAILABLE						\
    9, 10, 8, 7, 6, 5, 4,					\
    3, EARLY_R12 11, 0,						\
-   31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19,		\
-   18, 17, 16, 15, 14, 13, LATE_R12				\
+   23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 			\
+   31, 30, 29, 28, 27, 26, 25, 24, 13, LATE_R12			\
    66, 65,							\
    1, MAYBE_R2_FIXED 67, 76,					\
    /* AltiVec registers.  */					\

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Reducing the number of non volatile GPRs in the ppc64 kernel
  2015-08-05  4:03 RFC: Reducing the number of non volatile GPRs in the ppc64 kernel Anton Blanchard
@ 2015-08-05  4:19 ` Segher Boessenkool
  2015-08-07  5:55   ` Bill Schmidt
  2015-08-14  2:01 ` Michael Ellerman
  1 sibling, 1 reply; 8+ messages in thread
From: Segher Boessenkool @ 2015-08-05  4:19 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: linuxppc-dev, Michael Gschwind, Alan Modra, Bill Schmidt,
	Ulrich Weigand, paulus

Hi Anton,

On Wed, Aug 05, 2015 at 02:03:00PM +1000, Anton Blanchard wrote:
> While looking at traces of kernel workloads, I noticed places where gcc
> used a large number of non volatiles. Some of these functions
> did very little work, and we spent most of our time saving the
> non volatiles to the stack and reading them back.

That is something that should be fixed in GCC -- do you have an example
of such a function?

> It made me wonder if we have the right ratio of volatile to non
> volatile GPRs. Since the kernel is completely self contained, we could
> potentially change that ratio.
> 
> Attached is a quick hack to gcc and the kernel to decrease the number
> of non volatile GPRs to 8. I'm not sure if this is a good idea (and if
> the volatile to non volatile ratio is right), but this gives us
> something to play with.

Instead of the GCC hack you can add a bunch of -fcall-used-r14 etc.
options; does that not work for you?


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Reducing the number of non volatile GPRs in the ppc64 kernel
  2015-08-05  4:19 ` Segher Boessenkool
@ 2015-08-07  5:55   ` Bill Schmidt
  2015-08-10  4:52     ` Anton Blanchard
  0 siblings, 1 reply; 8+ messages in thread
From: Bill Schmidt @ 2015-08-07  5:55 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Alan Modra, Anton Blanchard, linuxppc-dev, Michael Gschwind,
	paulus, Ulrich Weigand


[-- Attachment #1.1: Type: text/plain, Size: 1811 bytes --]


I agree with Segher.  We already know we have opportunities to do a better
job with shrink-wrapping (pushing this kind of useless activity down past
early exits), so having examples of code to look at to improve this would
be useful.

-- Bill

Bill Schmidt, Ph.D.
Linux on Power Toolchain
IBM Linux Technology Center
wschmidt@us.ibm.com   (507) 319-6873




From:	Segher Boessenkool <segher@kernel.crashing.org>
To:	Anton Blanchard <anton@samba.org>
Cc:	linuxppc-dev@lists.ozlabs.org, Michael
            Gschwind/Watson/IBM@IBMUS, Alan Modra <amodra@gmail.com>, Bill
            Schmidt/Rochester/IBM@IBMUS, Ulrich Weigand
            <Ulrich.Weigand@de.ibm.com>, paulus@samba.org
Date:	08/05/2015 06:20 AM
Subject:	Re: RFC: Reducing the number of non volatile GPRs in the ppc64
            kernel



Hi Anton,

On Wed, Aug 05, 2015 at 02:03:00PM +1000, Anton Blanchard wrote:
> While looking at traces of kernel workloads, I noticed places where gcc
> used a large number of non volatiles. Some of these functions
> did very little work, and we spent most of our time saving the
> non volatiles to the stack and reading them back.

That is something that should be fixed in GCC -- do you have an example
of such a function?

> It made me wonder if we have the right ratio of volatile to non
> volatile GPRs. Since the kernel is completely self contained, we could
> potentially change that ratio.
>
> Attached is a quick hack to gcc and the kernel to decrease the number
> of non volatile GPRs to 8. I'm not sure if this is a good idea (and if
> the volatile to non volatile ratio is right), but this gives us
> something to play with.

Instead of the GCC hack you can add a bunch of -fcall-used-r14 etc.
options; does that not work for you?


Segher



[-- Attachment #1.2: Type: text/html, Size: 3079 bytes --]

[-- Attachment #2: graycol.gif --]
[-- Type: image/gif, Size: 105 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Reducing the number of non volatile GPRs in the ppc64 kernel
  2015-08-07  5:55   ` Bill Schmidt
@ 2015-08-10  4:52     ` Anton Blanchard
  2015-08-11 20:08       ` Segher Boessenkool
  2015-08-13 21:04       ` Anton Blanchard
  0 siblings, 2 replies; 8+ messages in thread
From: Anton Blanchard @ 2015-08-10  4:52 UTC (permalink / raw)
  To: Bill Schmidt
  Cc: Segher Boessenkool, Alan Modra, linuxppc-dev, Michael Gschwind,
	paulus, Ulrich Weigand

Hi Bill, Segher,

> I agree with Segher.  We already know we have opportunities to do a
> better job with shrink-wrapping (pushing this kind of useless
> activity down past early exits), so having examples of code to look
> at to improve this would be useful.

I'll look out for specific examples. I noticed this one today when
analysing malloc(8). It is an instruction trace of _int_malloc().

The overall function is pretty huge, which I assume leads to gcc using
so many non volatiles. Perhaps in this case we should separate out the
slow path into another function marked noinline.

This is just an upstream glibc build, but I'll send the preprocessed
source off list.

Anton
--

0x410d538       mflr    r0
0x410d53c       li      r9,-65
0x410d540       std     r14,-144(r1)     # 0x0000000fff00efe0
0x410d544       std     r15,-136(r1)     # 0x0000000fff00efe8
0x410d548       cmpld   cr7,r4,r9
0x410d54c       std     r16,-128(r1)     # 0x0000000fff00eff0
0x410d550       std     r17,-120(r1)     # 0x0000000fff00eff8
0x410d554       std     r18,-112(r1)     # 0x0000000fff00f000
0x410d558       std     r19,-104(r1)     # 0x0000000fff00f008
0x410d55c       std     r20,-96(r1)      # 0x0000000fff00f010
0x410d560       std     r21,-88(r1)      # 0x0000000fff00f018
0x410d564       std     r22,-80(r1)      # 0x0000000fff00f020
0x410d568       std     r23,-72(r1)      # 0x0000000fff00f028
0x410d56c       std     r0,16(r1)        # 0x0000000fff00f080
0x410d570       std     r24,-64(r1)      # 0x0000000fff00f030
0x410d574       std     r25,-56(r1)      # 0x0000000fff00f038
0x410d578       std     r26,-48(r1)      # 0x0000000fff00f040
0x410d57c       std     r27,-40(r1)      # 0x0000000fff00f048
0x410d580       std     r28,-32(r1)      # 0x0000000fff00f050
0x410d584       std     r29,-24(r1)      # 0x0000000fff00f058
0x410d588       std     r30,-16(r1)      # 0x0000000fff00f060
0x410d58c       std     r31,-8(r1)       # 0x0000000fff00f068
0x410d590       stdu    r1,-224(r1)      # 0x0000000fff00ef90
0x410d594       bgt     cr7,0x410dda4
0x410d598       addi    r9,r4,23
0x410d59c       li      r16,32
0x410d5a0       cmpldi  cr7,r9,31
0x410d5a4       bgt     cr7,0x410d700
0x410d5a8       cmpdi   cr7,r3,0
0x410d5ac       mr      r14,r3
0x410d5b0       mr      r30,r4
0x410d5b4       beq     cr7,0x410ddc0
0x410d5b8       nop
0x410d5bc       ld      r9,-19136(r2)    # 0x0000000004222840
0x410d5c0       rlwinm  r29,r16,28,4,31
0x410d5c4       cmpld   cr7,r16,r9
0x410d5c8       bgt     cr7,0x410d650
0x410d5cc       addi    r6,r29,-2
0x410d5d0       clrldi  r9,r6,32
0x410d5d4       rldicr  r10,r9,3,60
0x410d5d8       addi    r7,r9,1
0x410d5dc       add     r10,r3,r10
0x410d5e0       rldicr  r7,r7,3,60
0x410d5e4       add     r7,r3,r7
0x410d5e8       ld      r9,8(r10)        # 0x0000000004220ce0
0x410d5ec       cmpdi   cr7,r9,0
0x410d5f0       beq     cr7,0x410d650
0x410d5f4       ld      r10,16(r9)       # 0x0000000010030010
0x410d5f8       ldarx   r15,0,r7,1       # 0x0000000004220ce0
0x410d5fc       cmpd    r15,r9
0x410d600       bne     0x410d60c
0x410d604       stdcx.  r10,0,r7         # 0x0000000004220ce0
0x410d608       bne-    0x410d5f8
0x410d60c       isync
0x410d610       cmpld   cr7,r15,r9
0x410d614       bne     cr7,0x410d648
0x410d618       b       0x410da40
0x410da40       ld      r9,8(r15)        # 0x0000000010030008
0x410da44       rlwinm  r9,r9,28,4,31
0x410da48       addi    r9,r9,-2
0x410da4c       cmplw   cr7,r9,r6
0x410da50       bne     cr7,0x410de08
0x410da54       nop
0x410da58       addi    r31,r15,16
0x410da5c       lwa     r9,-19080(r2)    # 0x0000000004222878
0x410da60       cmpdi   cr7,r9,0
0x410da64       bne     cr7,0x410d6e4
0x410da68       addi    r1,r1,224
0x410da6c       mr      r3,r31
0x410da70       ld      r0,16(r1)        # 0x0000000fff00f080
0x410da74       ld      r14,-144(r1)     # 0x0000000fff00efe0
0x410da78       ld      r15,-136(r1)     # 0x0000000fff00efe8
0x410da7c       ld      r16,-128(r1)     # 0x0000000fff00eff0
0x410da80       ld      r17,-120(r1)     # 0x0000000fff00eff8
0x410da84       ld      r18,-112(r1)     # 0x0000000fff00f000
0x410da88       ld      r19,-104(r1)     # 0x0000000fff00f008
0x410da8c       ld      r20,-96(r1)      # 0x0000000fff00f010
0x410da90       ld      r21,-88(r1)      # 0x0000000fff00f018
0x410da94       ld      r22,-80(r1)      # 0x0000000fff00f020
0x410da98       ld      r23,-72(r1)      # 0x0000000fff00f028
0x410da9c       ld      r24,-64(r1)      # 0x0000000fff00f030
0x410daa0       mtlr    r0
0x410da70       ld      r0,16(r1)        # 0x0000000fff00f080
0x410da74       ld      r14,-144(r1)     # 0x0000000fff00efe0
0x410da78       ld      r15,-136(r1)     # 0x0000000fff00efe8
0x410da7c       ld      r16,-128(r1)     # 0x0000000fff00eff0
0x410da80       ld      r17,-120(r1)     # 0x0000000fff00eff8
0x410da84       ld      r18,-112(r1)     # 0x0000000fff00f000
0x410da88       ld      r19,-104(r1)     # 0x0000000fff00f008
0x410da8c       ld      r20,-96(r1)      # 0x0000000fff00f010
0x410da90       ld      r21,-88(r1)      # 0x0000000fff00f018
0x410da94       ld      r22,-80(r1)      # 0x0000000fff00f020
0x410da98       ld      r23,-72(r1)      # 0x0000000fff00f028
0x410da9c       ld      r24,-64(r1)      # 0x0000000fff00f030
0x410daa0       mtlr    r0
0x410daa4       ld      r25,-56(r1)      # 0x0000000fff00f038
0x410daa8       ld      r26,-48(r1)      # 0x0000000fff00f040
0x410daac       ld      r27,-40(r1)      # 0x0000000fff00f048
0x410dab0       ld      r28,-32(r1)      # 0x0000000fff00f050
0x410dab4       ld      r29,-24(r1)      # 0x0000000fff00f058
0x410dab8       ld      r30,-16(r1)      # 0x0000000fff00f060
0x410dabc       ld      r31,-8(r1)       # 0x0000000fff00f068
0x410dac0       blr

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Reducing the number of non volatile GPRs in the ppc64 kernel
  2015-08-10  4:52     ` Anton Blanchard
@ 2015-08-11 20:08       ` Segher Boessenkool
  2015-08-11 22:18         ` Segher Boessenkool
  2015-08-13 21:04       ` Anton Blanchard
  1 sibling, 1 reply; 8+ messages in thread
From: Segher Boessenkool @ 2015-08-11 20:08 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Bill Schmidt, Alan Modra, linuxppc-dev, Michael Gschwind, paulus,
	Ulrich Weigand

On Mon, Aug 10, 2015 at 02:52:28PM +1000, Anton Blanchard wrote:
> Hi Bill, Segher,
> 
> > I agree with Segher.  We already know we have opportunities to do a
> > better job with shrink-wrapping (pushing this kind of useless
> > activity down past early exits), so having examples of code to look
> > at to improve this would be useful.
> 
> I'll look out for specific examples. I noticed this one today when
> analysing malloc(8). It is an instruction trace of _int_malloc().
> 
> The overall function is pretty huge, which I assume leads to gcc using
> so many non volatiles.

That is one part of it; also GCC deals out volatiles too generously.

> Perhaps in this case we should separate out the
> slow path into another function marked noinline.

Or GCC could do that, effectively at least.

> This is just an upstream glibc build, but I'll send the preprocessed
> source off list.

Thanks :-)

[snip code]

After the prologue there are 46 insns executed before the epilogue.
Many of those are conditional branches (that are not executed); it is
all fall-through until it jumps to the "tail" (the few insns before
the epilogue).  GCC knows how to duplicate a tail so that it can do
shrink-wrapping (the original tail needs to be followed by an epilogue,
the duplicated one does not want one); but it can only do it in very
simple cases (one basic block or at least no control flow), and that
is not the case here.  We need to handle more generic tails.

This seems related to (if not the same as!) <http://gcc.gnu.org/PR51982>.


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Reducing the number of non volatile GPRs in the ppc64 kernel
  2015-08-11 20:08       ` Segher Boessenkool
@ 2015-08-11 22:18         ` Segher Boessenkool
  0 siblings, 0 replies; 8+ messages in thread
From: Segher Boessenkool @ 2015-08-11 22:18 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Michael Gschwind, Alan Modra, Bill Schmidt, Ulrich Weigand,
	paulus, linuxppc-dev

On Tue, Aug 11, 2015 at 03:08:29PM -0500, Segher Boessenkool wrote:
> [snip code]
> 
> After the prologue there are 46 insns executed before the epilogue.
> Many of those are conditional branches (that are not executed); it is
> all fall-through until it jumps to the "tail" (the few insns before
> the epilogue).  GCC knows how to duplicate a tail so that it can do
> shrink-wrapping (the original tail needs to be followed by an epilogue,
> the duplicated one does not want one); but it can only do it in very
> simple cases (one basic block or at least no control flow), and that
> is not the case here.  We need to handle more generic tails.

And never mind the elephant in the room: the "fastpath" instructions
already use a few non-volatile registers, and the shrink-wrap pass
(which runs after register allocation) cannot fix that.  Ugh.

> This seems related to (if not the same as!) <http://gcc.gnu.org/PR51982>.

This has that same problem, too.


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Reducing the number of non volatile GPRs in the ppc64 kernel
  2015-08-10  4:52     ` Anton Blanchard
  2015-08-11 20:08       ` Segher Boessenkool
@ 2015-08-13 21:04       ` Anton Blanchard
  1 sibling, 0 replies; 8+ messages in thread
From: Anton Blanchard @ 2015-08-13 21:04 UTC (permalink / raw)
  To: Bill Schmidt
  Cc: Segher Boessenkool, Alan Modra, linuxppc-dev, Michael Gschwind,
	paulus, Ulrich Weigand, David Edelsohn

Hi,

Here is another instruction trace from a kernel context switch trace.
Quite a lot of register and CR save/restore code.

Regards,
Anton

c0000000002943d8 <fsnotify+0x8> mfcr    r12
c0000000002943dc <fsnotify+0xc> std     r20,-96(r1)
c0000000002943e0 <fsnotify+0x10> std     r21,-88(r1)
c0000000002943e4 <fsnotify+0x14> rldicl. r9,r4,63,63
c0000000002943e8 <fsnotify+0x18> std     r22,-80(r1)
c0000000002943ec <fsnotify+0x1c> mflr    r0
c0000000002943f0 <fsnotify+0x20> std     r24,-64(r1)
c0000000002943f4 <fsnotify+0x24> std     r25,-56(r1)
c0000000002943f8 <fsnotify+0x28> std     r26,-48(r1)
c0000000002943fc <fsnotify+0x2c> std     r27,-40(r1)
c000000000294400 <fsnotify+0x30> std     r31,-8(r1)
c000000000294404 <fsnotify+0x34> std     r15,-136(r1)
c000000000294408 <fsnotify+0x38> stw     r12,8(r1)
c00000000029440c <fsnotify+0x3c> std     r16,-128(r1)
c000000000294410 <fsnotify+0x40> mcrf    cr4,cr0
c000000000294414 <fsnotify+0x44> std     r0,16(r1)
c000000000294418 <fsnotify+0x48> std     r17,-120(r1)
c00000000029441c <fsnotify+0x4c> std     r18,-112(r1)
c000000000294420 <fsnotify+0x50> std     r19,-104(r1)
c000000000294424 <fsnotify+0x54> std     r23,-72(r1)
c000000000294428 <fsnotify+0x58> std     r28,-32(r1)
c00000000029442c <fsnotify+0x5c> std     r29,-24(r1)
c000000000294430 <fsnotify+0x60> std     r30,-16(r1)
c000000000294434 <fsnotify+0x64> stdu    r1,-272(r1)
c000000000294438 <fsnotify+0x68> cmpwi   cr7,r6,1
c00000000029443c <fsnotify+0x6c> rlwinm  r31,r4,4,1,31
c000000000294440 <fsnotify+0x70> li      r9,0
c000000000294444 <fsnotify+0x74> rotlwi  r31,r31,28
c000000000294448 <fsnotify+0x78> mr      r24,r6
c00000000029444c <fsnotify+0x7c> mr      r26,r4
c000000000294450 <fsnotify+0x80> mr      r25,r3
c000000000294454 <fsnotify+0x84> mr      r22,r5
c000000000294458 <fsnotify+0x88> mr      r21,r7
c00000000029445c <fsnotify+0x8c> mr      r20,r8
c000000000294460 <fsnotify+0x90> std     r9,120(r1)
c000000000294464 <fsnotify+0x94> std     r9,112(r1)
c000000000294468 <fsnotify+0x98> clrldi  r27,r31,32
c00000000029446c <fsnotify+0x9c> beq     cr7,c000000000294888 <fsnotify+0x4b8> 
c000000000294888 <fsnotify+0x4b8> ld      r29,0(r5)
c00000000029488c <fsnotify+0x4bc> addi    r29,r29,-32
c000000000294890 <fsnotify+0x4c0> beq     c000000000294478 <fsnotify+0xa8> 
c000000000294478 <fsnotify+0xa8> lwz     r9,516(r25)
c00000000029447c <fsnotify+0xac> and     r10,r9,r31
c000000000294480 <fsnotify+0xb0> cmpwi   r10,0
c000000000294484 <fsnotify+0xb4> bne     c0000000002945d0 <fsnotify+0x200> 
c000000000294488 <fsnotify+0xb8> cmpdi   cr7,r29,0
c00000000029448c <fsnotify+0xbc> beq     cr7,c0000000002948c4 <fsnotify+0x4f4> 
c000000000294490 <fsnotify+0xc0> lwz     r9,264(r29)
c000000000294494 <fsnotify+0xc4> and     r10,r9,r31
c000000000294498 <fsnotify+0xc8> cmpwi   r10,0
c00000000029449c <fsnotify+0xcc> beq     c0000000002948c4 <fsnotify+0x4f4> 
c0000000002948c4 <fsnotify+0x4f4> li      r3,0
c0000000002948c8 <fsnotify+0x4f8> b       c0000000002947cc <fsnotify+0x3fc> 
c0000000002947cc <fsnotify+0x3fc> addi    r1,r1,272
c0000000002947d0 <fsnotify+0x400> ld      r0,16(r1)
c0000000002947d4 <fsnotify+0x404> lwz     r12,8(r1)
c0000000002947d8 <fsnotify+0x408> ld      r15,-136(r1)
c0000000002947dc <fsnotify+0x40c> ld      r16,-128(r1)
c0000000002947e0 <fsnotify+0x410> mtlr    r0
c0000000002947e4 <fsnotify+0x414> ld      r17,-120(r1)
c0000000002947e8 <fsnotify+0x418> ld      r18,-112(r1)
c0000000002947ec <fsnotify+0x41c> mtocrf  32,r12
c0000000002947f0 <fsnotify+0x420> mtocrf  16,r12
c0000000002947f4 <fsnotify+0x424> mtocrf  8,r12
c0000000002947f8 <fsnotify+0x428> ld      r19,-104(r1)
c0000000002947fc <fsnotify+0x42c> ld      r20,-96(r1)
c000000000294800 <fsnotify+0x430> ld      r21,-88(r1)
c000000000294804 <fsnotify+0x434> ld      r22,-80(r1)
c000000000294808 <fsnotify+0x438> ld      r23,-72(r1)
c00000000029480c <fsnotify+0x43c> ld      r24,-64(r1)
c000000000294810 <fsnotify+0x440> ld      r25,-56(r1)
c000000000294814 <fsnotify+0x444> ld      r26,-48(r1)
c000000000294818 <fsnotify+0x448> ld      r27,-40(r1)
c00000000029481c <fsnotify+0x44c> ld      r28,-32(r1)
c000000000294820 <fsnotify+0x450> ld      r29,-24(r1)
c000000000294824 <fsnotify+0x454> ld      r30,-16(r1)
c000000000294828 <fsnotify+0x458> ld      r31,-8(r1)
c00000000029482c <fsnotify+0x45c> blr

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Reducing the number of non volatile GPRs in the ppc64 kernel
  2015-08-05  4:03 RFC: Reducing the number of non volatile GPRs in the ppc64 kernel Anton Blanchard
  2015-08-05  4:19 ` Segher Boessenkool
@ 2015-08-14  2:01 ` Michael Ellerman
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Ellerman @ 2015-08-14  2:01 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: linuxppc-dev, Alan Modra, benh, paulus, Ulrich Weigand,
	Michael Gschwind, Bill Schmidt

On Wed, 2015-08-05 at 14:03 +1000, Anton Blanchard wrote:
> Hi,
> 
> While looking at traces of kernel workloads, I noticed places where gcc
> used a large number of non volatiles. Some of these functions
> did very little work, and we spent most of our time saving the
> non volatiles to the stack and reading them back.
> 
> It made me wonder if we have the right ratio of volatile to non
> volatile GPRs. Since the kernel is completely self contained, we could
> potentially change that ratio.
> 
> Attached is a quick hack to gcc and the kernel to decrease the number
> of non volatile GPRs to 8. I'm not sure if this is a good idea (and if
> the volatile to non volatile ratio is right), but this gives us
> something to play with.

OK, interesting idea. Can't say I'd ever though of that.

I'm thinking we'd want some pretty solid analysis of the resulting code-gen and
real world perf before we made a switch like that.

Presumably it's going to hurt our null syscall, due to the added save/restores,
but hopefully help with paths that do actual work.

If the caller is actually using the non-volatiles then presumably it will be a
wash, because the caller will have to do the save anyway. Though maybe it would
still be a win because the caller can do the saves & restores when it needs to
rather than all in a block.

I'm also not clear on how it would affect folks who build modules separate from
the kernel. We'd have to make sure they had the right GCC, or things would go
badly wrong, unless it can be done with command line flags? I don't know how
much we care about that but distros presumably do.

cheers

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-08-14  2:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-05  4:03 RFC: Reducing the number of non volatile GPRs in the ppc64 kernel Anton Blanchard
2015-08-05  4:19 ` Segher Boessenkool
2015-08-07  5:55   ` Bill Schmidt
2015-08-10  4:52     ` Anton Blanchard
2015-08-11 20:08       ` Segher Boessenkool
2015-08-11 22:18         ` Segher Boessenkool
2015-08-13 21:04       ` Anton Blanchard
2015-08-14  2:01 ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).